NVIDIA's CUDA repo for Debian 12 only has NCCL packages for cuda13.x,
not cuda12.x. Update to the latest available: 2.28.9-1+cuda13.0.
Also pass sha256 from VERSIONS into build-nccl.sh for integrity check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Download libnccl2 .deb from NVIDIA's CUDA apt repo (Debian 12) during ISO
build, extract libnccl.so.* into the overlay at /usr/lib/ alongside
libnvidia-ml and libcuda. Version pinned in VERSIONS, reflected in
/etc/bee-release.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without a keepalive the kernel watchdog timer expires and reboots
the host mid-audit. Configuring RuntimeWatchdogSec lets systemd PID 1
reset /dev/watchdog every 30 s — well within the typical 60 s timeout.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes
- nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort
- GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock)
- Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt
- Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours
- AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list
- runtime-flows.md updated with full NVIDIA SAT TUI flow documentation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Debian 12 splits kernel headers into two packages:
linux-headers-<kver> (arch-specific: generated/, config/)
linux-headers-<kver>-common (source headers: linux/, asm-generic/, etc.)
NVIDIA conftest.sh builds include paths as HEADERS=$SOURCES/include.
When SYSSRC=amd64, HEADERS=amd64/include/ which is nearly empty —
conftest can't compile any kernel header tests, all compile-tests fail
silently, and NVIDIA assumes all kernel APIs are present. This causes
link errors for APIs added in kernel 6.3+ (vm_flags_set, vm_flags_clear)
and removed APIs (phys_to_dma, dma_is_direct, get_dma_ops).
Fix: pass SYSSRC=common (real headers) and SYSOUT=amd64 (generated headers).
NVIDIA Makefile maps SYSSRC→NV_KERNEL_SOURCES, SYSOUT→NV_KERNEL_OUTPUT,
and runs 'make -C common KBUILD_OUTPUT=amd64'. Conftest then correctly
detects which APIs are present in kernel 6.1 and uses proper wrappers.
Tested: 5 .ko files built successfully on Debian 12 kernel 6.1.0-43-amd64.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
get_dma_ops() return type changed in kernel 6.1 — GCC treats int-conversion
warning as error. Suppress with -Wno-error to allow build to complete.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
build-nvidia-module.sh:
- Replace silent glob cp for libnvidia-ml/libcuda with find + explicit error
if library not found in extract dir (catches installer layout changes)
- Fix circular symlink bug: don't create .so.1 -> .so.1 if versioned file
is already named .so.1
- Verify .ko count > 0 after build, fail loudly if none produced
- Show lib cache in final summary
bee-nvidia:
- mknod failures are now logged with ewarn instead of silently suppressed
- If nvidia not in /proc/devices (no GPU hardware), log clearly and exit clean
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gcompat alone provides only the ELF interpreter entry point (/lib64/ld-linux-x86-64.so.2).
It does NOT provide libpthread.so.0, libm.so.6, libdl.so.2, libc.so.6 stubs.
libnvidia-ml.so.590 has NEEDED: libpthread.so.0 etc. When nvidia-smi calls
dlopen("libnvidia-ml.so.1"), musl's linker fails to satisfy these deps
→ NVML_ERROR_LIBRARY_NOT_FOUND (exit 12), "couldn't find libnvidia-ml.so".
libc6-compat provides the missing stubs (libpthread.so.0, libm.so.6, libdl.so.2,
libc.so.6, librt.so.1) as musl redirects, enabling dlopen of glibc shared objects.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Alpine uses mdev which has no rules for NVIDIA devices. Without /dev/nvidiactl
and /dev/nvidia{0-7}, nvidia-smi returns NVML_ERROR_LIBRARY_NOT_FOUND (exit 12)
even though kernel modules are loaded and libraries are present.
Fix: after insmod, read major numbers from /proc/devices and mknod the required
character devices (/dev/nvidiactl, /dev/nvidia{0-7}, /dev/nvidia-uvm).
Add /dev/nvidia* node checks to smoketest for earlier failure detection.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The static KERNEL_PKG_VERSION pin was the root cause of nvidia-smi never
working: modules were compiled for pinned version (e.g. 6.12.76-r0) but
the ISO kernel was unpinned (latest from repo at build time). When Alpine
updated linux-lts, the two diverged silently.
Fix: both steps now use whatever linux-lts is current in Alpine 3.21 main
at build time. build-nvidia-module.sh uses `apk add --update linux-lts-dev`
(no version pin), mkimage gets the same package from the same mirror.
Module cache is still keyed by detected KVER so rebuilds remain fast.
Removed: KERNEL_VERSION, KERNEL_PKG_VERSION from VERSIONS, all pin references
from build.sh and build-nvidia-module.sh.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-audit init.d: use --output file: so "audit output written" is logged
(stdout mode silently redirects, never emits the slog confirmation)
- build-nvidia-module.sh: use $KERNEL_SRC in find for .ko collection
(was hardcoded $EXTRACT_DIR/kernel, silent failure if path differs)
- smoketest: add bee-audit to required services (was never checked)
- smoketest: remove legacy bee-audit-debug from service list
- smoketest: internet ping → warn (live CD runs in isolated network, no internet)
- build.sh: auto-copy smoketest.sh → overlay/usr/local/bin/bee-smoketest
(removes manual sync hazard; smoketest.sh is now single source of truth)
- remove static overlay/usr/local/bin/bee-smoketest (generated by build.sh now)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stale apks_* dirs (from old mirror or previous version pin) cause
"unable to select package" failures. Nuke them on every build.
kernel_*, syslinux_*, grub_* are still preserved — they're large,
stable, and only need to change when KERNEL_PKG_VERSION changes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mkimage checks CWD (/var/tmp) before ~/.mkimage/ for genapkovl scripts.
Old genapkovl-bee.sh left in /var/tmp from previous builds was overriding
the updated version, causing bee-audit-debug to persist in runlevel.
Also add gcompat to apk world so it's installed at boot (was in apks cache
but missing from world file, so nvidia-smi failed with missing ld-linux).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
linux-lts in apks conflicted with mkimage's own kernel download via
kernel_flavors="lts". The kernel is embedded in the ISO via modloop,
not via apks. Pinning it in apks caused "unable to select package".
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both build-nvidia-module.sh (apk add) and mkimage.sh (--repository) now
explicitly use dl-cdn. Local builder mirror config is ignored.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>