ROCm 6.4 does not yet publish a Release file for Debian Bookworm, causing
the live-build chroot hook to fail with "does not have a Release file".
Try each version in ROCM_CANDIDATES order; skip to the next if apt-get update
fails (repo unavailable). Exit gracefully if none are available.
Also rename inner 'candidate' variable to 'smi_path' to avoid collision.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two-phase GPU thermal cycling test with per-second telemetry:
- Phases: baseline → load1 → pause (no cooldown) → load2 → cooldown
- Monitors: fan RPM (ipmitool sdr), CPU/server temps (ipmitool/sensors),
system power (ipmitool dcmi), GPU temp/power/usage/clock/throttle (nvidia-smi)
- Detects throttling via clocks_throttle_reasons.active bitmask
- Measures fan response lag from load start (validates case-04 ~2s lag)
- Exports metrics.csv (wide format, one row/sec) and fan-sensors.csv (long format)
- TUI: adds [F] Fan Stress Test to Health Check screen with Quick/Standard/Express modes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New tui/sat_progress.go: polls {DefaultSATBaseDir}/{prefix}-*/verbose.log every 300ms and parses completed/in-progress steps
- Busy screen now shows each step as PASS lscpu (234ms) / FAIL stress-ng (60.0s) / ... sensors-after instead of just "Working..."
2. Test results shown on screen (instead of just "Archive written to /path")
- RunCPUAcceptancePackResult, RunMemoryAcceptancePackResult, RunStorageAcceptancePackResult, RunAMDAcceptancePackResult now read summary.txt from the run directory and return a formatted per-step result:
Run: 2025-03-25T10:00:00Z
PASS lscpu
PASS sensors-before
FAIL stress-ng
PASS sensors-after
Overall: FAILED (ok=3 failed=1)
3. AMD GPU SAT with auto-detection
- platform.System.DetectGPUVendor(): checks /dev/nvidia0 → "nvidia", /dev/kfd → "amd"
- platform.System.RunAMDAcceptancePack(): runs rocm-smi, rocm-smi --showallinfo, dmidecode
- GPU SAT (G key / GPU row enter) automatically routes to AMD or NVIDIA based on detected vendor
- "Run All" also auto-detects vendor
4. Panel detail view
- GPU detail now shows the most recent (NVIDIA or AMD) SAT result, whichever is newer
- All SAT detail views use the same human-readable formatSATDetail format
- firmware-amd-graphics: Aldebaran firmware blobs (fixes amdgpu IB ring
test errors on MI250/MI250X at boot)
- 9001-amd-rocm.hook.chroot: adds AMD ROCm 6.4 apt repo and installs
rocm-smi-lib for GPU monitoring (analogous to nvidia-smi)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--start is not a valid nvme-cli flag; correct syntax is -s 1 (short test).
Add --wait so the command blocks until the test completes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stress-ng was missing from the LiveCD — CPU acceptance test exited
immediately with rc=1 because the binary was not found.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
isGPUDevice matched all AMD vendor PCIe devices (SATA, crypto coprocessors,
PCIe dummies) because of a broad strings.Contains(vendor,"amd") check.
Remove it — AMD Instinct/Radeon GPUs are caught by ProcessingAccelerator /
DisplayController class. Also exclude ASPEED (BMC VGA adapter).
Add clear before bee-tui to avoid dirty terminal output.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Check PCI vendor 10de before attempting insmod — avoids spurious
nvidia_uvm symbol errors on systems without NVIDIA hardware (e.g. AMD MI350).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The graphical splash had "BEE / HARDWARE AUDIT" baked into the PNG,
overriding the echo ASCII art. Replace with a plain black background
so the EASY-BEE block-char banner from grub.cfg echo commands is visible.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Export detected DEBIAN_KERNEL_ABI as BEE_KERNEL_ABI from build.sh so
auto/config can pin linux-packages to the exact versioned package
(e.g. linux-image-6.1.0-31 + flavour amd64 = linux-image-6.1.0-31-amd64).
This prevents nvidia.ko vermagic mismatch if the linux-image-amd64
meta-package is updated between build start and lb build chroot step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
live-build constructs the kernel package as <linux-packages>-<linux-flavours>,
so "linux-image-amd64" + "amd64" = "linux-image-amd64-amd64" (not found).
The correct value is "linux-image" + "amd64" = "linux-image-amd64".
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace "Bee Hardware Audit" branding with EASY-BEE across bootloader
and LiveCD: grub.cfg menu entries, echo ASCII art before menu,
motd banner, iso-volume and iso-application metadata.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Dockerfile: linux-headers-amd64 meta-package instead of pinned ABI;
remove DEBIAN_KERNEL_ABI build-arg (no longer needed at image build time)
- build-in-container.sh: drop --build-arg DEBIAN_KERNEL_ABI
- build.sh: apt-get update + detect ABI from apt-cache at build time;
auto-install linux-headers-<ABI> if kernel changed since image build
Image rebuild is now needed only when changing Go version or lb tools,
not on every Debian kernel point release.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DEBIAN_KERNEL_ABI=auto in VERSIONS — build.sh queries
apt-cache depends linux-image-amd64 to find the current ABI.
lb config now uses linux-image-amd64 meta-package.
This prevents build failures when Debian drops old kernel packages
from the repo (happens with every point release).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
exFAT is the default filesystem on USB drives >32GB sold today.
Without exfatprogs, mount fails silently and export to such drives is broken.
ntfs-3g covers Windows-formatted drives.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BMC:
- collector/board.go: collectBMCFirmware() via ipmitool mc info, graceful skip if /dev/ipmi0 absent
- collector/collector.go: append BMC firmware record to snap.Firmware
- app/panel.go: show BMC version in TUI right-panel header alongside BIOS
CPU SAT:
- platform/sat.go: RunCPUAcceptancePack(baseDir, durationSec) — lscpu + sensors before/after + stress-ng
- app/app.go: RunCPUAcceptancePack + RunCPUAcceptancePackResult methods, satRunner interface updated
- app/panel.go: CPU row now reads real PASS/FAIL from cpu-*/summary.txt via satStatuses(); cpuDetailResult shows last SAT summary + audit data
- tui/types.go: actionRunCPUSAT, confirmBody for CPU test with mode label
- tui/screen_health_check.go: hcCPUDurations [60,300,900]s; hcRunSingle(CPU)→confirm screen; executeRunAll uses RunCPUAcceptancePackResult
- tui/forms.go: actionRunCPUSAT → RunCPUAcceptancePackResult with mode duration
- cmd/bee/main.go: bee sat cpu [--duration N] subcommand
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace 12-item flat menu with 4-item main menu: Health Check, Export support bundle, Settings, Exit
- Add Health Check screen (Lenovo-style): per-component checkboxes (GPU/MEM/DISK/CPU), Quick/Standard/Express modes, Run All, letter hotkeys G/M/S/C/R/A/1/2/3
- Add two-column main screen: left = menu, right = hardware panel with colored PASS/FAIL/CANCEL/N/A status per component; Tab/→ switches focus, Enter opens component detail
- Add app.LoadHardwarePanel() + ComponentDetailResult() reading audit JSON and SAT summary.txt files
- Move Network/Services/audit actions into Settings submenu
- Export: support bundle only (remove separate audit JSON export)
- Delete screen_acceptance.go; add screen_health_check.go, screen_settings.go, app/panel.go
- Add BMC + CPU stress-ng tests to backlog
- Update bible submodule
- Rewrite tui_test.go for new screen/action structure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NVIDIA's CUDA repo for Debian 12 only has NCCL packages for cuda13.x,
not cuda12.x. Update to the latest available: 2.28.9-1+cuda13.0.
Also pass sha256 from VERSIONS into build-nccl.sh for integrity check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Download libnccl2 .deb from NVIDIA's CUDA apt repo (Debian 12) during ISO
build, extract libnccl.so.* into the overlay at /usr/lib/ alongside
libnvidia-ml and libcuda. Version pinned in VERSIONS, reflected in
/etc/bee-release.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without a keepalive the kernel watchdog timer expires and reboots
the host mid-audit. Configuring RuntimeWatchdogSec lets systemd PID 1
reset /dev/watchdog every 30 s — well within the typical 60 s timeout.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes
- nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort
- GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock)
- Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt
- Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours
- AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list
- runtime-flows.md updated with full NVIDIA SAT TUI flow documentation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>