144 lines
6.7 KiB
Markdown
144 lines
6.7 KiB
Markdown
# Runtime Flows — bee
|
|
|
|
## Network isolation — CRITICAL
|
|
|
|
**The live CD runs in an isolated network segment with no internet access.**
|
|
All binaries, kernel modules, and tools must be baked into the ISO at build time.
|
|
No package installation, no downloads, and no package manager calls are allowed at boot.
|
|
DHCP is used only for LAN (operator SSH access). Internet is NOT available.
|
|
|
|
## Boot sequence (single ISO)
|
|
|
|
`systemd` boot order:
|
|
|
|
```
|
|
local-fs.target
|
|
├── bee-sshsetup.service (enables SSH key auth; password fallback only if marker exists)
|
|
│ └── ssh.service (OpenSSH on port 22 — starts without network)
|
|
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
|
|
├── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
|
|
│ creates /dev/nvidia* nodes)
|
|
├── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
|
|
│ never blocks boot on partial collector failures)
|
|
└── bee-web.service (runs `bee web` on :80,
|
|
reads the latest audit snapshot on each request)
|
|
```
|
|
|
|
**Critical invariants:**
|
|
- OpenSSH MUST start without network. `bee-sshsetup.service` runs before `ssh.service`.
|
|
- `bee-network.service` uses `dhclient -nw` (background) — network bring-up is best effort and non-blocking.
|
|
- `bee-nvidia.service` loads modules via `insmod` with absolute paths — NOT `modprobe`.
|
|
Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
|
|
- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
|
|
- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.
|
|
- `bee-web.service` binds `0.0.0.0:80` and always renders the current `/var/log/bee-audit.json` contents.
|
|
- Audit JSON now includes a `hardware.summary` block with overall verdict and warning/failure counts.
|
|
|
|
## Console and login flow
|
|
|
|
Local-console behavior:
|
|
|
|
```text
|
|
tty1
|
|
└── live-config autologin → bee
|
|
└── /home/bee/.profile
|
|
└── exec menu
|
|
└── /usr/local/bin/bee-tui
|
|
└── sudo -n /usr/local/bin/bee tui --runtime livecd
|
|
```
|
|
|
|
Rules:
|
|
- local `tty1` lands in user `bee`, not directly in `root`
|
|
- `menu` must work without typing `sudo`
|
|
- TUI actions still run as `root` via `sudo -n`
|
|
- SSH is independent from the tty1 path
|
|
- serial console support is enabled for VM boot debugging
|
|
|
|
## ISO build sequence
|
|
|
|
```
|
|
build.sh [--authorized-keys /path/to/keys]
|
|
1. compile `bee` binary (skip if .go files older than binary)
|
|
2. create a temporary overlay staging dir under `dist/`
|
|
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
|
|
4. copy `bee` binary → staged `/usr/local/bin/bee`
|
|
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
|
|
(`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli` — optional; `mstflint` comes from the Debian package set)
|
|
6. `build-nvidia-module.sh`:
|
|
a. install Debian kernel headers if missing
|
|
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
|
|
c. extract installer
|
|
d. build kernel modules against Debian headers
|
|
e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
|
|
f. cache in `dist/nvidia-<version>-<kver>/`
|
|
7. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
|
|
8. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
|
|
9. inject `libnvidia-ml` + `libcuda` → staged `/usr/lib/`
|
|
10. write staged `/etc/bee-release` (versions + git commit)
|
|
11. patch staged `motd` with build metadata
|
|
12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
|
|
13. sync staged overlay into workdir `config/includes.chroot/`
|
|
14. run `lb config && lb build` inside the temporary workdir
|
|
(either on a Debian host/VM or inside the privileged builder container)
|
|
```
|
|
|
|
**Critical invariants:**
|
|
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
|
|
1. `setup-builder.sh` / `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
|
|
2. `auto/config` — `linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
|
|
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
|
|
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
|
|
- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
|
|
- Container build requires `--privileged` because `live-build` uses mounts/chroots/loop devices during ISO assembly.
|
|
|
|
## Post-boot smoke test
|
|
|
|
After booting a live ISO, run to verify all critical components:
|
|
|
|
```sh
|
|
ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh
|
|
```
|
|
|
|
Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shipping.
|
|
|
|
Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
|
|
systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
|
|
|
|
Current validation state:
|
|
- local/libvirt VM boot path is validated for `systemd`, SSH, `bee audit`, `bee-network`, and TUI startup
|
|
- real hardware validation is still required before treating the ISO as release-ready
|
|
|
|
## Overlay mechanism
|
|
|
|
`live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
|
|
`build.sh` prepares a staged overlay, then syncs it into a temporary workdir's
|
|
`config/includes.chroot/` before running `lb build`.
|
|
|
|
## Collector flow
|
|
|
|
```
|
|
`bee audit` start
|
|
1. board collector (dmidecode -t 0,1,2)
|
|
2. cpu collector (dmidecode -t 4)
|
|
3. memory collector (dmidecode -t 17)
|
|
4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
|
|
5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/)
|
|
6. psu collector (ipmitool fru + sdr — silent if no /dev/ipmi0)
|
|
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
|
|
8. output JSON → /var/log/bee-audit.json
|
|
9. QR summary to stdout (qrencode if available)
|
|
```
|
|
|
|
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
|
|
|
|
Acceptance flows:
|
|
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + lightweight `bee-gpu-stress`
|
|
- `bee sat memory` → `memtester` archive
|
|
- `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
|
|
- SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
|
|
- Runtime overrides:
|
|
- `BEE_GPU_STRESS_SECONDS`
|
|
- `BEE_GPU_STRESS_SIZE_MB`
|
|
- `BEE_MEMTESTER_SIZE_MB`
|
|
- `BEE_MEMTESTER_PASSES`
|