186 lines
10 KiB
Markdown
186 lines
10 KiB
Markdown
# Runtime Flows — bee
|
|
|
|
## Network isolation — CRITICAL
|
|
|
|
**The live CD runs in an isolated network segment with no internet access.**
|
|
All binaries, kernel modules, and tools must be baked into the ISO at build time.
|
|
No package installation, no downloads, and no package manager calls are allowed at boot.
|
|
DHCP is used only for LAN (operator SSH access). Internet is NOT available.
|
|
|
|
## Boot sequence (single ISO)
|
|
|
|
The live system is expected to boot with `toram`, so `live-boot` copies the full read-only medium into RAM before mounting the root filesystem. After that point, runtime must not depend on the original USB/BMC virtual media staying readable.
|
|
|
|
`systemd` boot order:
|
|
|
|
```
|
|
local-fs.target
|
|
├── bee-sshsetup.service (enables SSH key auth; password fallback only if marker exists)
|
|
│ └── ssh.service (OpenSSH on port 22 — starts without network)
|
|
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
|
|
├── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
|
|
│ creates /dev/nvidia* nodes)
|
|
├── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
|
|
│ never blocks boot on partial collector failures)
|
|
├── bee-web.service (runs `bee web` on :80 — full interactive web UI)
|
|
└── bee-desktop.service (startx → openbox + chromium http://localhost/)
|
|
```
|
|
|
|
**Critical invariants:**
|
|
- The live ISO boots with `boot=live toram`. Runtime binaries must continue working even if the original boot media disappears after early boot.
|
|
- OpenSSH MUST start without network. `bee-sshsetup.service` runs before `ssh.service`.
|
|
- `bee-network.service` uses `dhclient -nw` (background) — network bring-up is best effort and non-blocking.
|
|
- `bee-nvidia.service` loads modules via `insmod` with absolute paths — NOT `modprobe`.
|
|
Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
|
|
- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
|
|
- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.
|
|
- `bee-web.service` binds `0.0.0.0:80` and always renders the current `/var/log/bee-audit.json` contents.
|
|
- Audit JSON now includes a `hardware.summary` block with overall verdict and warning/failure counts.
|
|
|
|
## Console and login flow
|
|
|
|
Local-console behavior:
|
|
|
|
```text
|
|
tty1
|
|
└── live-config autologin → bee
|
|
└── /home/bee/.profile (prints web UI URLs)
|
|
|
|
display :0
|
|
└── bee-desktop.service (User=bee)
|
|
└── startx /usr/local/bin/bee-openbox-session -- :0
|
|
├── tint2 (taskbar)
|
|
├── chromium http://localhost/
|
|
└── openbox (WM)
|
|
```
|
|
|
|
Rules:
|
|
- local `tty1` lands in user `bee`, not directly in `root`
|
|
- `bee-desktop.service` starts X11 + openbox + Chromium automatically after `bee-web.service`
|
|
- Chromium opens `http://localhost/` — the full interactive web UI
|
|
- SSH is independent from the desktop path
|
|
- serial console support is enabled for VM boot debugging
|
|
- Default boot keeps the server-safe graphics path (`nomodeset` + forced `fbdev`) for IPMI/BMC consoles
|
|
- Higher-resolution mode selection is expected only when booting through an explicit `bee.display=kms` menu entry, which disables the forced `fbdev` Xorg config before `lightdm`
|
|
|
|
## ISO build sequence
|
|
|
|
```
|
|
build-in-container.sh [--authorized-keys /path/to/keys]
|
|
1. compile `bee` binary (skip if .go files older than binary)
|
|
2. create a temporary overlay staging dir under `dist/`
|
|
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
|
|
4. copy `bee` binary → staged `/usr/local/bin/bee`
|
|
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
|
|
(`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli` — optional; `mstflint` comes from the Debian package set)
|
|
6. `build-nvidia-module.sh`:
|
|
a. install Debian kernel headers if missing
|
|
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
|
|
c. extract installer
|
|
d. build kernel modules against Debian headers
|
|
e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
|
|
f. cache in `dist/nvidia-<version>-<kver>/`
|
|
7. `build-cublas.sh`:
|
|
a. download `libcublas`, `libcublasLt`, `libcudart` runtime + dev packages from the NVIDIA CUDA Debian repo
|
|
b. verify packages against repo `Packages.gz`
|
|
c. extract headers for `bee-gpu-burn` worker build
|
|
d. cache userspace libs in `dist/cublas-<version>+cuda<series>/`
|
|
8. build `bee-gpu-burn` worker against extracted cuBLASLt/cudart headers
|
|
9. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
|
|
10. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
|
|
11. inject `libnvidia-ml` + `libcuda` + `libcublas` + `libcublasLt` + `libcudart` → staged `/usr/lib/`
|
|
12. write staged `/etc/bee-release` (versions + git commit)
|
|
13. patch staged `motd` with build metadata
|
|
14. copy `iso/builder/` into a temporary live-build workdir under `dist/`
|
|
15. sync staged overlay into workdir `config/includes.chroot/`
|
|
16. run `lb config && lb build` inside the privileged builder container
|
|
```
|
|
|
|
Build host notes:
|
|
- `build-in-container.sh` targets `linux/amd64` builder containers by default, including Docker Desktop on macOS / Apple Silicon.
|
|
- Override with `BEE_BUILDER_PLATFORM=<os/arch>` only if you intentionally need a different container platform.
|
|
- If the local builder image under the same tag was previously built for the wrong architecture, the script rebuilds it automatically.
|
|
|
|
**Critical invariants:**
|
|
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
|
|
1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
|
|
2. `auto/config` — `linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
|
|
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
|
|
- `bee-gpu-burn` worker must be built against cached CUDA userspace headers from `build-cublas.sh`, not against random host-installed CUDA headers.
|
|
- The live ISO must ship `libcublas`, `libcublasLt`, and `libcudart` together with `libcuda` so tensor-core stress works without internet or package installs at boot.
|
|
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
|
|
- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
|
|
- Container build requires `--privileged` because `live-build` uses mounts/chroots/loop devices during ISO assembly.
|
|
- On macOS / Docker Desktop, the builder still must run as `linux/amd64` so the shipped ISO binaries remain `amd64`.
|
|
- Operators must provision enough RAM to hold the full compressed live medium plus normal runtime overhead, because `toram` copies the entire read-only ISO payload into memory before the system reaches steady state.
|
|
|
|
## Post-boot smoke test
|
|
|
|
After booting a live ISO, run to verify all critical components:
|
|
|
|
```sh
|
|
ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh
|
|
```
|
|
|
|
Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shipping.
|
|
|
|
Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
|
|
systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
|
|
|
|
Current validation state:
|
|
- local/libvirt VM boot path is validated for `systemd`, SSH, `bee audit`, `bee-network`, and Web UI startup
|
|
- real hardware validation is still required before treating the ISO as release-ready
|
|
|
|
## Overlay mechanism
|
|
|
|
`live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
|
|
`build.sh` prepares a staged overlay, then syncs it into a temporary workdir's
|
|
`config/includes.chroot/` before running `lb build`.
|
|
|
|
## Collector flow
|
|
|
|
```
|
|
`bee audit` start
|
|
1. board collector (dmidecode -t 0,1,2)
|
|
2. cpu collector (dmidecode -t 4)
|
|
3. memory collector (dmidecode -t 17)
|
|
4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
|
|
5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/)
|
|
6. psu collector (ipmitool fru + sdr — silent if no /dev/ipmi0)
|
|
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
|
|
8. output JSON → /var/log/bee-audit.json
|
|
9. QR summary to stdout (qrencode if available)
|
|
```
|
|
|
|
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
|
|
|
|
Acceptance flows:
|
|
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + lightweight `bee-gpu-burn`
|
|
- NVIDIA GPU burn-in can use either `bee-gpu-burn` or `bee-john-gpu-stress` (John the Ripper jumbo via OpenCL)
|
|
- `bee sat memory` → `memtester` archive
|
|
- `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
|
|
- SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
|
|
- `bee-gpu-burn` should prefer cuBLASLt GEMM load over the old integer/PTX burn path:
|
|
- Ampere: `fp16` + `fp32`/TF32 tensor-core load
|
|
- Ada / Hopper: add `fp8`
|
|
- Blackwell+: add `fp4`
|
|
- PTX fallback is only for missing cuBLASLt/userspace or unsupported narrow datatypes
|
|
- Runtime overrides:
|
|
- `BEE_MEMTESTER_SIZE_MB`
|
|
- `BEE_MEMTESTER_PASSES`
|
|
|
|
## NVIDIA SAT Web UI flow
|
|
|
|
```
|
|
Web UI: Acceptance Tests page → Run Test button
|
|
1. POST /api/sat/nvidia/run → returns job_id
|
|
2. GET /api/sat/stream?job_id=... (SSE) — streams stdout/stderr lines live
|
|
3. After completion — archive written to /appdata/bee/export/bee-sat/
|
|
summary.txt contains overall_status (OK / FAILED) and per-job status values
|
|
```
|
|
|
|
**Critical invariants:**
|
|
- `bee-gpu-burn` / `bee-john-gpu-stress` use `exec.CommandContext` — killed on job context cancel.
|
|
- Metric goroutine uses stopCh/doneCh pattern; main goroutine waits `<-doneCh` before reading rows (no mutex needed).
|
|
- SVG chart is fully offline: no JS, no external CSS, pure inline SVG.
|