Echo messages captured in stdout polluted the return value of
download_verified_pkg(), causing extract_deb() to receive a
multi-line string instead of a file path and silently exit via set -e.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
console=tty0 sent kernel messages to the active VT (tty1), overwriting
the TUI. Changed to console=tty2 so kernel logs land on a dedicated
console. tty1 is now clean; operator can press Alt+F2 to inspect kernel
messages and Alt+F3 for an extra shell.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- GPU Platform Stress Test now shows a live in-TUI chart instead of nvtop.
nvidia-smi is polled every second; up to 60 data points per GPU kept.
All three metrics (Usage %, Temp °C, Power W) drawn on a single plot,
each normalised to its own range and rendered in a different colour.
- Memory allocation changed from MemoryMB/16 to MemoryMB-512 (full VRAM
minus 512 MB driver overhead) so bee-gpu-stress actually stresses memory.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
loglevel=3 was hiding all kernel messages on tty0/ttyS0 except errors.
Machine crashes (panics, driver oops, module failures) were silent on VGA.
Restored loglevel=7 so kernel messages up to debug are printed to both
tty0 (VGA) and ttyS0 (SOL). Journald MaxLevelConsole reduced to info
(was debug) to reduce noise on SOL while keeping it useful.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the ISO file was named after git describe --match 'audit/v*',
so a new iso/ tag produced names like v1.0.9-1-gXXXXXXX instead of v1.0.17.
Now build.sh has resolve_iso_version() that looks at iso/v* tags separately.
The bee binary inside the ISO still uses AUDIT_VERSION_EFFECTIVE.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- build-nvidia-module.sh: copy libnvidia-ptxjitcompiler.so.* alongside
libcuda/libnvidia-ml — required by cuModuleLoadDataEx for PTX JIT.
Without it: CUDA_ERROR_JIT_COMPILER_NOT_FOUND at runtime.
Cache check updated to force rebuild when ptxjitcompiler is missing.
- bee-nvidia-load: run ldconfig after module load so that NVIDIA/NCCL
libs injected into /usr/lib/ are visible to dlopen() callers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvtop is no longer shown during NVIDIA SAT runs.
[o] Open nvtop shortcut also removed from the running screen.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Commit d36e844 dropped console=tty0 and added dual-serial + debug logging.
Without console=tty0 the kernel never initialises the VGA console,
leaving the physical screen permanently blank.
- Restore console=tty0 (VGA) as primary, keep console=ttyS0 for SOL
- Drop console=ttyS1 (redundant second serial port)
- Replace loglevel=7 + journald debug flood with loglevel=3 (errors only)
so kernel messages don't overwrite the TUI on the local screen
- Remove systemd.log_target/forward_to_console debug params
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AMD does not publish Debian Bookworm packages at all (only focal/jammy/noble).
Switch ROCM_UBUNTU_DIST to "jammy"; jammy packages install cleanly on
Debian 12 due to compatible glibc. Also expand candidate list to include
point-releases (6.3.4, 6.3.3, …) so we pick the latest actually-published one.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Apply the same pattern as NVIDIA SAT: launch nvtop via tea.ExecProcess
so it occupies the full terminal as a live GPU chart (temp, power, fan,
utilisation lines) while the stress test runs in the background.
- Add screenGPUStressRunning screen + dedicated running/render handlers
- startGPUStressTest: tea.Batch(stress goroutine, tea.ExecProcess(nvtop))
- [o] reopen nvtop at any time; [a] abort (cancels context)
- Graceful degradation: test still runs if nvtop is not on PATH
- gpuStressDoneMsg routes result to screenOutput on completion
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ROCm 6.4 does not yet publish a Release file for Debian Bookworm, causing
the live-build chroot hook to fail with "does not have a Release file".
Try each version in ROCM_CANDIDATES order; skip to the next if apt-get update
fails (repo unavailable). Exit gracefully if none are available.
Also rename inner 'candidate' variable to 'smi_path' to avoid collision.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two-phase GPU thermal cycling test with per-second telemetry:
- Phases: baseline → load1 → pause (no cooldown) → load2 → cooldown
- Monitors: fan RPM (ipmitool sdr), CPU/server temps (ipmitool/sensors),
system power (ipmitool dcmi), GPU temp/power/usage/clock/throttle (nvidia-smi)
- Detects throttling via clocks_throttle_reasons.active bitmask
- Measures fan response lag from load start (validates case-04 ~2s lag)
- Exports metrics.csv (wide format, one row/sec) and fan-sensors.csv (long format)
- TUI: adds [F] Fan Stress Test to Health Check screen with Quick/Standard/Express modes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New tui/sat_progress.go: polls {DefaultSATBaseDir}/{prefix}-*/verbose.log every 300ms and parses completed/in-progress steps
- Busy screen now shows each step as PASS lscpu (234ms) / FAIL stress-ng (60.0s) / ... sensors-after instead of just "Working..."
2. Test results shown on screen (instead of just "Archive written to /path")
- RunCPUAcceptancePackResult, RunMemoryAcceptancePackResult, RunStorageAcceptancePackResult, RunAMDAcceptancePackResult now read summary.txt from the run directory and return a formatted per-step result:
Run: 2025-03-25T10:00:00Z
PASS lscpu
PASS sensors-before
FAIL stress-ng
PASS sensors-after
Overall: FAILED (ok=3 failed=1)
3. AMD GPU SAT with auto-detection
- platform.System.DetectGPUVendor(): checks /dev/nvidia0 → "nvidia", /dev/kfd → "amd"
- platform.System.RunAMDAcceptancePack(): runs rocm-smi, rocm-smi --showallinfo, dmidecode
- GPU SAT (G key / GPU row enter) automatically routes to AMD or NVIDIA based on detected vendor
- "Run All" also auto-detects vendor
4. Panel detail view
- GPU detail now shows the most recent (NVIDIA or AMD) SAT result, whichever is newer
- All SAT detail views use the same human-readable formatSATDetail format
- firmware-amd-graphics: Aldebaran firmware blobs (fixes amdgpu IB ring
test errors on MI250/MI250X at boot)
- 9001-amd-rocm.hook.chroot: adds AMD ROCm 6.4 apt repo and installs
rocm-smi-lib for GPU monitoring (analogous to nvidia-smi)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--start is not a valid nvme-cli flag; correct syntax is -s 1 (short test).
Add --wait so the command blocks until the test completes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stress-ng was missing from the LiveCD — CPU acceptance test exited
immediately with rc=1 because the binary was not found.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
isGPUDevice matched all AMD vendor PCIe devices (SATA, crypto coprocessors,
PCIe dummies) because of a broad strings.Contains(vendor,"amd") check.
Remove it — AMD Instinct/Radeon GPUs are caught by ProcessingAccelerator /
DisplayController class. Also exclude ASPEED (BMC VGA adapter).
Add clear before bee-tui to avoid dirty terminal output.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Check PCI vendor 10de before attempting insmod — avoids spurious
nvidia_uvm symbol errors on systems without NVIDIA hardware (e.g. AMD MI350).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The graphical splash had "BEE / HARDWARE AUDIT" baked into the PNG,
overriding the echo ASCII art. Replace with a plain black background
so the EASY-BEE block-char banner from grub.cfg echo commands is visible.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Export detected DEBIAN_KERNEL_ABI as BEE_KERNEL_ABI from build.sh so
auto/config can pin linux-packages to the exact versioned package
(e.g. linux-image-6.1.0-31 + flavour amd64 = linux-image-6.1.0-31-amd64).
This prevents nvidia.ko vermagic mismatch if the linux-image-amd64
meta-package is updated between build start and lb build chroot step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
live-build constructs the kernel package as <linux-packages>-<linux-flavours>,
so "linux-image-amd64" + "amd64" = "linux-image-amd64-amd64" (not found).
The correct value is "linux-image" + "amd64" = "linux-image-amd64".
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace "Bee Hardware Audit" branding with EASY-BEE across bootloader
and LiveCD: grub.cfg menu entries, echo ASCII art before menu,
motd banner, iso-volume and iso-application metadata.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Dockerfile: linux-headers-amd64 meta-package instead of pinned ABI;
remove DEBIAN_KERNEL_ABI build-arg (no longer needed at image build time)
- build-in-container.sh: drop --build-arg DEBIAN_KERNEL_ABI
- build.sh: apt-get update + detect ABI from apt-cache at build time;
auto-install linux-headers-<ABI> if kernel changed since image build
Image rebuild is now needed only when changing Go version or lb tools,
not on every Debian kernel point release.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DEBIAN_KERNEL_ABI=auto in VERSIONS — build.sh queries
apt-cache depends linux-image-amd64 to find the current ABI.
lb config now uses linux-image-amd64 meta-package.
This prevents build failures when Debian drops old kernel packages
from the repo (happens with every point release).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
exFAT is the default filesystem on USB drives >32GB sold today.
Without exfatprogs, mount fails silently and export to such drives is broken.
ntfs-3g covers Windows-formatted drives.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BMC:
- collector/board.go: collectBMCFirmware() via ipmitool mc info, graceful skip if /dev/ipmi0 absent
- collector/collector.go: append BMC firmware record to snap.Firmware
- app/panel.go: show BMC version in TUI right-panel header alongside BIOS
CPU SAT:
- platform/sat.go: RunCPUAcceptancePack(baseDir, durationSec) — lscpu + sensors before/after + stress-ng
- app/app.go: RunCPUAcceptancePack + RunCPUAcceptancePackResult methods, satRunner interface updated
- app/panel.go: CPU row now reads real PASS/FAIL from cpu-*/summary.txt via satStatuses(); cpuDetailResult shows last SAT summary + audit data
- tui/types.go: actionRunCPUSAT, confirmBody for CPU test with mode label
- tui/screen_health_check.go: hcCPUDurations [60,300,900]s; hcRunSingle(CPU)→confirm screen; executeRunAll uses RunCPUAcceptancePackResult
- tui/forms.go: actionRunCPUSAT → RunCPUAcceptancePackResult with mode duration
- cmd/bee/main.go: bee sat cpu [--duration N] subcommand
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace 12-item flat menu with 4-item main menu: Health Check, Export support bundle, Settings, Exit
- Add Health Check screen (Lenovo-style): per-component checkboxes (GPU/MEM/DISK/CPU), Quick/Standard/Express modes, Run All, letter hotkeys G/M/S/C/R/A/1/2/3
- Add two-column main screen: left = menu, right = hardware panel with colored PASS/FAIL/CANCEL/N/A status per component; Tab/→ switches focus, Enter opens component detail
- Add app.LoadHardwarePanel() + ComponentDetailResult() reading audit JSON and SAT summary.txt files
- Move Network/Services/audit actions into Settings submenu
- Export: support bundle only (remove separate audit JSON export)
- Delete screen_acceptance.go; add screen_health_check.go, screen_settings.go, app/panel.go
- Add BMC + CPU stress-ng tests to backlog
- Update bible submodule
- Rewrite tui_test.go for new screen/action structure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NVIDIA's CUDA repo for Debian 12 only has NCCL packages for cuda13.x,
not cuda12.x. Update to the latest available: 2.28.9-1+cuda13.0.
Also pass sha256 from VERSIONS into build-nccl.sh for integrity check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Download libnccl2 .deb from NVIDIA's CUDA apt repo (Debian 12) during ISO
build, extract libnccl.so.* into the overlay at /usr/lib/ alongside
libnvidia-ml and libcuda. Version pinned in VERSIONS, reflected in
/etc/bee-release.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without a keepalive the kernel watchdog timer expires and reboots
the host mid-audit. Configuring RuntimeWatchdogSec lets systemd PID 1
reset /dev/watchdog every 30 s — well within the typical 60 s timeout.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes
- nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort
- GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock)
- Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt
- Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours
- AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list
- runtime-flows.md updated with full NVIDIA SAT TUI flow documentation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>