iso: improve burn-in, export, and live boot

2026-03-26 18:56:19 +03:00
parent 67a215c66f
commit fc5c2019aa
23 changed files with 1706 additions and 168 deletions
@@ -9,6 +9,8 @@ DHCP is used only for LAN (operator SSH access). Internet is NOT available.

 ## Boot sequence (single ISO)

+The live system is expected to boot with `toram`, so `live-boot` copies the full read-only medium into RAM before mounting the root filesystem. After that point, runtime must not depend on the original USB/BMC virtual media staying readable.
+
 `systemd` boot order:

 ```
@@ -25,6 +27,7 @@ local-fs.target
 ```

 **Critical invariants:**
+- The live ISO boots with `boot=live toram`. Runtime binaries must continue working even if the original boot media disappears after early boot.
 - OpenSSH MUST start without network. `bee-sshsetup.service` runs before `ssh.service`.
 - `bee-network.service` uses `dhclient -nw` (background) — network bring-up is best effort and non-blocking.
 - `bee-nvidia.service` loads modules via `insmod` with absolute paths — NOT `modprobe`.
@@ -71,24 +74,39 @@ build-in-container.sh [--authorized-keys /path/to/keys]
       d. build kernel modules against Debian headers
       e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
       f. cache in `dist/nvidia-<version>-<kver>/`
-  7. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
-  8. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
-  9. inject `libnvidia-ml` + `libcuda` → staged `/usr/lib/`
-  10. write staged `/etc/bee-release` (versions + git commit)
-  11. patch staged `motd` with build metadata
-  12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
-  13. sync staged overlay into workdir `config/includes.chroot/`
-  14. run `lb config && lb build` inside the privileged builder container
+  7. `build-cublas.sh`:
+       a. download `libcublas`, `libcublasLt`, `libcudart` runtime + dev packages from the NVIDIA CUDA Debian repo
+       b. verify packages against repo `Packages.gz`
+       c. extract headers for `bee-gpu-stress` build
+       d. cache userspace libs in `dist/cublas-<version>+cuda<series>/`
+  8. build `bee-gpu-stress` against extracted cuBLASLt/cudart headers
+  9. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
+  10. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
+  11. inject `libnvidia-ml` + `libcuda` + `libcublas` + `libcublasLt` + `libcudart` → staged `/usr/lib/`
+  12. write staged `/etc/bee-release` (versions + git commit)
+  13. patch staged `motd` with build metadata
+  14. copy `iso/builder/` into a temporary live-build workdir under `dist/`
+  15. sync staged overlay into workdir `config/includes.chroot/`
+  16. run `lb config && lb build` inside the privileged builder container
 ```

+Build host notes:
+- `build-in-container.sh` targets `linux/amd64` builder containers by default, including Docker Desktop on macOS / Apple Silicon.
+- Override with `BEE_BUILDER_PLATFORM=<os/arch>` only if you intentionally need a different container platform.
+- If the local builder image under the same tag was previously built for the wrong architecture, the script rebuilds it automatically.
+
 **Critical invariants:**
 - `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
  1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
  2. `auto/config` — `linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
 - NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
+- `bee-gpu-stress` must be built against cached CUDA userspace headers from `build-cublas.sh`, not against random host-installed CUDA headers.
+- The live ISO must ship `libcublas`, `libcublasLt`, and `libcudart` together with `libcuda` so tensor-core stress works without internet or package installs at boot.
 - The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
 - The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
 - Container build requires `--privileged` because `live-build` uses mounts/chroots/loop devices during ISO assembly.
+- On macOS / Docker Desktop, the builder still must run as `linux/amd64` so the shipped ISO binaries remain `amd64`.
+- Operators must provision enough RAM to hold the full compressed live medium plus normal runtime overhead, because `toram` copies the entire read-only ISO payload into memory before the system reaches steady state.

 ## Post-boot smoke test

@@ -131,10 +149,15 @@ Current validation state:
 Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.

 Acceptance flows:
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + lightweight `bee-gpu-stress`
+- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + mixed-precision `bee-gpu-stress`
 - `bee sat memory` → `memtester` archive
 - `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
 - SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
+- `bee-gpu-stress` should prefer cuBLASLt GEMM load over the old integer/PTX burn path:
+  - Ampere: `fp16` + `fp32`/TF32 tensor-core load
+  - Ada / Hopper: add `fp8`
+  - Blackwell+: add `fp4`
+  - PTX fallback is only for missing cuBLASLt/userspace or unsupported narrow datatypes
 - Runtime overrides:
  - `BEE_GPU_STRESS_SECONDS`
  - `BEE_GPU_STRESS_SIZE_MB`
@@ -21,7 +21,8 @@ Fills gaps where Redfish/logpile is blind:
 - Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
 - Machine-readable health summary derived from collector verdicts
 - Operator-triggered acceptance tests for NVIDIA, memory, and storage
- NVIDIA SAT includes both diagnostic collection and lightweight GPU stress via `bee-gpu-stress`
+- NVIDIA SAT includes both diagnostic collection and mixed-precision GPU stress via `bee-gpu-stress`
+- `bee-gpu-stress` should exercise tensor/inference paths (`fp16`, `fp32`/TF32, `fp8`, `fp4` when supported by the GPU/userspace stack) and fall back to Driver API PTX burn only if cuBLASLt is unavailable
 - Automatic boot audit with operator-facing local console and SSH access
 - NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
 - SSH access (OpenSSH) always available for inspection and debugging
@@ -69,6 +70,7 @@ Fills gaps where Redfish/logpile is blind:
 | SSH | OpenSSH server |
 | NVIDIA driver | Proprietary `.run` installer, built against Debian kernel headers |
 | NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
+| GPU stress backend | `bee-gpu-stress` + cuBLASLt/cuBLAS/cudart mixed-precision GEMM, with Driver API PTX fallback |
 | Builder | Debian 12 host/VM or Debian 12 container image |

 ## Operator UX
@@ -78,6 +80,7 @@ Fills gaps where Redfish/logpile is blind:
 - The TUI itself executes privileged actions as `root` via `sudo -n`
 - SSH remains available independently of the local console path
 - VM-oriented builds also include `qemu-guest-agent` and serial console support for debugging
+- The ISO boots with `toram`, so loss of the original USB/BMC virtual media after boot should not break already-installed runtime binaries

 ## Runtime split

@@ -85,6 +88,7 @@ Fills gaps where Redfish/logpile is blind:
 - Live-ISO-only responsibilities stay in `iso/` integration code
 - Live ISO launches the Go CLI with `--runtime livecd`
 - Local/manual runs use `--runtime auto` or `--runtime local`
+- Live ISO targets must have enough RAM for the full compressed live medium plus runtime working set because the boot medium is copied into memory at startup

 ## Key paths