- jobState now has optional cancel func; abort() calls it if job is running - handleAPISATRun passes cancellable context to RunNvidiaAcceptancePackWithOptions - POST /api/sat/abort?job_id=... cancels the running SAT job - bible-local/runtime-flows.md: replace TUI SAT flow with Web UI flow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9.6 KiB
Runtime Flows — bee
Network isolation — CRITICAL
The live CD runs in an isolated network segment with no internet access. All binaries, kernel modules, and tools must be baked into the ISO at build time. No package installation, no downloads, and no package manager calls are allowed at boot. DHCP is used only for LAN (operator SSH access). Internet is NOT available.
Boot sequence (single ISO)
The live system is expected to boot with toram, so live-boot copies the full read-only medium into RAM before mounting the root filesystem. After that point, runtime must not depend on the original USB/BMC virtual media staying readable.
systemd boot order:
local-fs.target
├── bee-sshsetup.service (enables SSH key auth; password fallback only if marker exists)
│ └── ssh.service (OpenSSH on port 22 — starts without network)
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
├── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
│ creates /dev/nvidia* nodes)
├── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
│ never blocks boot on partial collector failures)
├── bee-web.service (runs `bee web` on :80 — full interactive web UI)
└── bee-desktop.service (startx → openbox + chromium http://localhost/)
Critical invariants:
- The live ISO boots with
boot=live toram. Runtime binaries must continue working even if the original boot media disappears after early boot. - OpenSSH MUST start without network.
bee-sshsetup.serviceruns beforessh.service. bee-network.serviceusesdhclient -nw(background) — network bring-up is best effort and non-blocking.bee-nvidia.serviceloads modules viainsmodwith absolute paths — NOTmodprobe. Reason: the modules are shipped in the ISO overlay under/usr/local/lib/nvidia/, not in the host module tree.bee-audit.servicedoes not wait fornetwork-online.target; audit is local and must run even if DHCP is broken.bee-audit.servicelogs audit failures but does not turn partial collector problems into a boot blocker.bee-web.servicebinds0.0.0.0:80and always renders the current/var/log/bee-audit.jsoncontents.- Audit JSON now includes a
hardware.summaryblock with overall verdict and warning/failure counts.
Console and login flow
Local-console behavior:
tty1
└── live-config autologin → bee
└── /home/bee/.profile (prints web UI URLs)
display :0
└── bee-desktop.service (User=bee)
└── startx /usr/local/bin/bee-openbox-session -- :0
├── tint2 (taskbar)
├── chromium http://localhost/
└── openbox (WM)
Rules:
- local
tty1lands in userbee, not directly inroot bee-desktop.servicestarts X11 + openbox + Chromium automatically afterbee-web.service- Chromium opens
http://localhost/— the full interactive web UI - SSH is independent from the desktop path
- serial console support is enabled for VM boot debugging
ISO build sequence
build-in-container.sh [--authorized-keys /path/to/keys]
1. compile `bee` binary (skip if .go files older than binary)
2. create a temporary overlay staging dir under `dist/`
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
4. copy `bee` binary → staged `/usr/local/bin/bee`
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
(`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli` — optional; `mstflint` comes from the Debian package set)
6. `build-nvidia-module.sh`:
a. install Debian kernel headers if missing
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
c. extract installer
d. build kernel modules against Debian headers
e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
f. cache in `dist/nvidia-<version>-<kver>/`
7. `build-cublas.sh`:
a. download `libcublas`, `libcublasLt`, `libcudart` runtime + dev packages from the NVIDIA CUDA Debian repo
b. verify packages against repo `Packages.gz`
c. extract headers for `bee-gpu-stress` build
d. cache userspace libs in `dist/cublas-<version>+cuda<series>/`
8. build `bee-gpu-stress` against extracted cuBLASLt/cudart headers
9. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
10. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
11. inject `libnvidia-ml` + `libcuda` + `libcublas` + `libcublasLt` + `libcudart` → staged `/usr/lib/`
12. write staged `/etc/bee-release` (versions + git commit)
13. patch staged `motd` with build metadata
14. copy `iso/builder/` into a temporary live-build workdir under `dist/`
15. sync staged overlay into workdir `config/includes.chroot/`
16. run `lb config && lb build` inside the privileged builder container
Build host notes:
build-in-container.shtargetslinux/amd64builder containers by default, including Docker Desktop on macOS / Apple Silicon.- Override with
BEE_BUILDER_PLATFORM=<os/arch>only if you intentionally need a different container platform. - If the local builder image under the same tag was previously built for the wrong architecture, the script rebuilds it automatically.
Critical invariants:
DEBIAN_KERNEL_ABIiniso/builder/VERSIONSpins the exact kernel ABI used in BOTH places:build-in-container.sh/build-nvidia-module.sh— Debian kernel headers for module buildauto/config—linux-image-${DEBIAN_KERNEL_ABI}in the ISO
- NVIDIA modules go to staged
usr/local/lib/nvidia/— NOT to/lib/modules/<kver>/extra/. bee-gpu-stressmust be built against cached CUDA userspace headers frombuild-cublas.sh, not against random host-installed CUDA headers.- The live ISO must ship
libcublas,libcublasLt, andlibcudarttogether withlibcudaso tensor-core stress works without internet or package installs at boot. - The source overlay in
iso/overlay/is treated as immutable source. Build-time files are injected only into the staged overlay. - The live-build workdir under
dist/is disposable; source files underiso/builder/stay clean. - Container build requires
--privilegedbecauselive-builduses mounts/chroots/loop devices during ISO assembly. - On macOS / Docker Desktop, the builder still must run as
linux/amd64so the shipped ISO binaries remainamd64. - Operators must provision enough RAM to hold the full compressed live medium plus normal runtime overhead, because
toramcopies the entire read-only ISO payload into memory before the system reaches steady state.
Post-boot smoke test
After booting a live ISO, run to verify all critical components:
ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh
Exit code 0 = all required checks pass. All FAIL lines must be zero before shipping.
Key checks: NVIDIA modules loaded, nvidia-smi sees all GPUs, lib symlinks present,
systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
Current validation state:
- local/libvirt VM boot path is validated for
systemd, SSH,bee audit,bee-network, and Web UI startup - real hardware validation is still required before treating the ISO as release-ready
Overlay mechanism
live-build copies files from config/includes.chroot/ into the ISO filesystem.
build.sh prepares a staged overlay, then syncs it into a temporary workdir's
config/includes.chroot/ before running lb build.
Collector flow
`bee audit` start
1. board collector (dmidecode -t 0,1,2)
2. cpu collector (dmidecode -t 4)
3. memory collector (dmidecode -t 17)
4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/)
6. psu collector (ipmitool fru + sdr — silent if no /dev/ipmi0)
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
8. output JSON → /var/log/bee-audit.json
9. QR summary to stdout (qrencode if available)
Every collector returns nil, nil on tool-not-found. Errors are logged, never fatal.
Acceptance flows:
bee sat nvidia→ diagnostic archive withnvidia-smi -q+nvidia-bug-report+ mixed-precisionbee-gpu-stressbee sat memory→memtesterarchivebee sat storage→ SMART/NVMe diagnostic archive and short self-test trigger where supported- SAT
summary.txtnow includesoverall_statusand per-job*_statusvalues (OK,FAILED,UNSUPPORTED) bee-gpu-stressshould prefer cuBLASLt GEMM load over the old integer/PTX burn path:- Ampere:
fp16+fp32/TF32 tensor-core load - Ada / Hopper: add
fp8 - Blackwell+: add
fp4 - PTX fallback is only for missing cuBLASLt/userspace or unsupported narrow datatypes
- Ampere:
- Runtime overrides:
BEE_GPU_STRESS_SECONDSBEE_GPU_STRESS_SIZE_MBBEE_MEMTESTER_SIZE_MBBEE_MEMTESTER_PASSES
NVIDIA SAT Web UI flow
Web UI: Acceptance Tests page → Run Test button
1. POST /api/sat/nvidia/run → returns job_id
2. GET /api/sat/stream?job_id=... (SSE) — streams stdout/stderr lines live
3. After completion — archive written to /appdata/bee/export/bee-sat/
summary.txt contains overall_status (OK / FAILED) and per-job status values
Critical invariants:
bee-gpu-stressusesexec.CommandContext— killed on job context cancel.- Metric goroutine uses stopCh/doneCh pattern; main goroutine waits
<-doneChbefore reading rows (no mutex needed). - SVG chart is fully offline: no JS, no external CSS, pure inline SVG.