Commit Graph

9 Commits

Author SHA1 Message Date
2320925433 Skip PCIe link-speed warnings for disabled devices
Disabled PCIe devices (sysfs enable==0) carry no data traffic; their
link state has no operational impact. Switchtec PCIe switch management
endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers)
train at reduced speed intentionally and were producing spurious warnings.

Check is vendor-agnostic: reads enable attribute via existing helper,
no vendor/device ID hardcoding.

Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-12 03:42:19 +03:00
Mikhail Chusavitin
11d00b9442 Document read-only submodules policy 2026-04-29 09:54:23 +03:00
eb60100297 fix: pcie gen, nccl binary, netconf sudo, boot noise, firmware cleanup
- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead
  of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state
- build: remove bee-nccl-gpu-stress from rm -f list so shell script from
  overlay is not silently dropped from the ISO
- smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress,
  bee-nccl-gpu-stress, all_reduce_perf
- netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors
- auto/config: reduce loglevel 7→3 to show clean systemd output on boot
- auto/config: blacklist snd_hda_intel and related audio modules (unused on servers)
- package-lists: remove firmware-intel-sound and firmware-amd-graphics from
  base list; move firmware-amd-graphics to bee-amd variant only
- bible-local: mark memtest ADR resolved, document working solution

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:25:23 +03:00
Mikhail Chusavitin
2baf3be640 Handle memtest recovery probe under set -e 2026-04-01 17:42:13 +03:00
Mikhail Chusavitin
d92f8f41d0 Fix memtest ISO validation false negatives 2026-04-01 12:22:17 +03:00
Mikhail Chusavitin
f6f4923ac9 fix(iso): recover memtest after live-build 2026-04-01 08:55:57 +03:00
Mikhail Chusavitin
3472afea32 fix(iso): make memtest non-blocking by default 2026-04-01 08:33:36 +03:00
Mikhail Chusavitin
6dee8f3509 Add NVIDIA stress loader selection and DCGM 4 support 2026-03-31 11:15:15 +03:00
871c766194 docs: add bible-local with architecture and decisions, fix PLAN.md versions
- bible-local/architecture/system-overview.md: scope, tech stack, key paths
- bible-local/architecture/runtime-flows.md: boot sequence, ISO build, collector flow
- bible-local/decisions/2026-03-05-nvidia-proprietary-driver.md
- PLAN.md: update KERNEL_VERSION 6.6→6.12, NVIDIA 550.54.15→590.48.01

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 18:15:07 +03:00