Refactor bee CLI and LiveCD integration

This commit is contained in:
Mikhail Chusavitin
2026-03-13 16:52:16 +03:00
parent b7c888edb1
commit 6aca1682b9
47 changed files with 3137 additions and 1201 deletions

View File

@@ -4,100 +4,68 @@
**The live CD runs in an isolated network segment with no internet access.**
All binaries, kernel modules, and tools must be baked into the ISO at build time.
No `apk add`, no downloads, no package manager calls are allowed at boot.
No package installation, no downloads, and no package manager calls are allowed at boot.
DHCP is used only for LAN (operator SSH access). Internet is NOT available.
## Boot sequence (single ISO)
OpenRC default runlevel, service start order:
`systemd` boot order:
```
localmount
├── bee-sshsetup (creates bee user, sets password; runs before dropbear)
│ └── dropbear (SSH on port 22 — starts without network)
├── bee-network (udhcpc -b on all physical interfaces, non-blocking)
│ └── bee-nvidia (insmod nvidia*.ko from /usr/local/lib/nvidia/,
│ creates libnvidia-ml.so.1 symlinks in /usr/lib/)
└── bee-audit (runs audit binary → /var/log/bee-audit.json)
local-fs.target
├── bee-sshsetup.service (enables SSH key auth; password fallback only if marker exists)
│ └── ssh.service (OpenSSH on port 22 — starts without network)
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
creates /dev/nvidia* nodes)
└── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
never blocks boot on partial collector failures)
```
**Critical invariants:**
- Dropbear MUST start without network. `bee-sshsetup` has `need localmount` only.
- `bee-network` uses `udhcpc -b` (background) — retries indefinitely if no cable.
- `bee-nvidia` loads modules via `insmod` with absolute paths — NOT `modprobe`.
Reason: modloop squashfs mounts over `/lib/modules/<kver>/` at boot, making it
read-only. The overlay's modules at that path are inaccessible. Modules are stored
at `/usr/local/lib/nvidia/` (overlay path, always writable).
- `bee-nvidia` creates `libnvidia-ml.so.1` symlinks in `/usr/lib/` — required because
`nvidia-smi` is a glibc binary that looks for the soname symlink, not the versioned file.
- `gcompat` package provides `/lib64/ld-linux-x86-64.so.2` for glibc compat on Alpine musl.
- `bee-audit` uses `after bee-nvidia` — ensures NVIDIA enrichment succeeds.
- `bee-audit` uses `eend 0` always — never fails boot even if audit errors.
- OpenSSH MUST start without network. `bee-sshsetup.service` runs before `ssh.service`.
- `bee-network.service` uses `dhclient -nw` (background) — network bring-up is best effort and non-blocking.
- `bee-nvidia.service` loads modules via `insmod` with absolute paths — NOT `modprobe`.
Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.
## ISO build sequence
```
build.sh [--authorized-keys /path/to/keys]
1. compile audit binary (skip if .go files older than binary)
2. inject authorized_keys into overlay/root/.ssh/ (or set password fallback)
3. copy audit binary → overlay/usr/local/bin/audit
4. copy vendor binaries from iso/vendor/ → overlay/usr/local/bin/
(storcli64, sas2ircu, sas3ircu, mstflint, gpu_burn — each optional)
5. build-nvidia-module.sh:
a. apk add linux-lts-dev (always, to get current Alpine 3.21 kernel headers)
b. detect KVER from /usr/src/linux-headers-*
c. download NVIDIA .run installer (sha256 verified, cached in dist/)
d. extract installer
e. build kernel modules against linux-lts headers
f. create libnvidia-ml.so.1 / libcuda.so.1 symlinks in cache
g. cache in dist/nvidia-<version>-<kver>/
6. inject NVIDIA .ko → overlay/usr/local/lib/nvidia/
7. inject nvidia-smi → overlay/usr/local/bin/nvidia-smi
8. inject libnvidia-ml + libcuda → overlay/usr/lib/
9. write overlay/etc/bee-release (versions + git commit)
10. export BEE_BUILD_INFO for motd substitution
11. mkimage.sh (from /var/tmp, TMPDIR=/var/tmp):
kernel_* section — cached (linux-lts modloop)
apks_* section — cached (downloaded packages)
syslinux_* / grub_* — cached
apkovl — always regenerated (genapkovl-bee.sh)
final ISO — always assembled
1. compile `bee` binary (skip if .go files older than binary)
2. create a temporary overlay staging dir under `dist/`
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
4. copy `bee` binary → staged `/usr/local/bin/bee`
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
(`storcli64`, `sas2ircu`, `sas3ircu`, `mstflint` — each optional)
6. `build-nvidia-module.sh`:
a. install Debian kernel headers if missing
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
c. extract installer
d. build kernel modules against Debian headers
e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
f. cache in `dist/nvidia-<version>-<kver>/`
7. inject NVIDIA `.ko`staged `/usr/local/lib/nvidia/`
8. inject `nvidia-smi`staged `/usr/local/bin/nvidia-smi`
9. inject `libnvidia-ml` + `libcuda`staged `/usr/lib/`
10. write staged `/etc/bee-release` (versions + git commit)
11. patch staged `motd` with build metadata
12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
13. sync staged overlay into workdir `config/includes.chroot/`
14. run `lb config && lb build` inside the temporary workdir
(either on a Debian host/VM or inside the privileged builder container)
```
**Critical invariants:**
- `KERNEL_PKG_VERSION` in `iso/builder/VERSIONS` pins the exact Alpine package version
(e.g. `6.12.76-r0`). This version is used in THREE places that MUST stay in sync:
1. `build-nvidia-module.sh``apk add linux-lts-dev=${KERNEL_PKG_VERSION}` (compile headers)
2. `mkimg.bee.sh``linux-lts=${KERNEL_PKG_VERSION}` in apks list (ISO kernel)
3. `build.sh` — build-time verification that headers match pin (fails loudly if not)
When Alpine releases a new linux-lts patch (e.g. r0 → r1), update KERNEL_PKG_VERSION
in VERSIONS — that's the only place to change. The build will fail loudly if the pin
doesn't match the installed headers, so stale pins are caught immediately.
- **All three must use the same APK mirror: `dl-cdn.alpinelinux.org`.** Both
`build-nvidia-module.sh` (apk add) and `mkimage.sh` (--repository) explicitly use
`https://dl-cdn.alpinelinux.org/alpine/v${ALPINE_VERSION}/main|community`.
Never use the builder's local `/etc/apk/repositories` — its mirror may serve
a different package state, causing "unable to select package" failures.
- `linux-lts-dev` is always installed (not conditional) — stale 6.6.x headers on the
builder would cause modules to be built for the wrong kernel and never load at runtime.
- NVIDIA modules go to `overlay/usr/local/lib/nvidia/` — NOT `lib/modules/<kver>/extra/`.
- `genapkovl-bee.sh` must be copied to `/var/tmp/` (CWD when mkimage runs).
- `TMPDIR=/var/tmp` required — tmpfs `/tmp` is only ~1GB, too small for kernel firmware.
- Workdir cleanup preserves `apks_*`, `kernel_*`, `syslinux_*`, `grub_*` cache dirs.
## gpu_burn vendor binary
`gpu_burn` requires CUDA nvcc to build. It is NOT built as part of the main ISO build.
Build separately on the builder VM and place in `iso/vendor/gpu_burn`:
```sh
sh iso/builder/build-gpu-burn.sh dist/
cp dist/gpu_burn iso/vendor/gpu_burn
cp dist/compare.ptx iso/vendor/compare.ptx
```
Requires: CUDA 12.8+ (supports GCC 14, Alpine 3.21), libxml2, g++, make, git.
The `build.sh` will include it automatically if `iso/vendor/gpu_burn` exists.
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
1. `setup-builder.sh` / `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
2. `auto/config``linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
- Container build requires `--privileged` because `live-build` uses mounts/chroots/loop devices during ISO assembly.
## Post-boot smoke test
@@ -109,26 +77,19 @@ ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh
Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shipping.
Key checks: NVIDIA modules loaded, nvidia-smi sees all GPUs, lib symlinks present,
gcompat installed, services running, audit completed with NVIDIA enrichment, internet.
Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
## apkovl mechanism
## Overlay mechanism
The apkovl is a `.tar.gz` injected into the ISO at `/boot/`. Alpine initramfs extracts
it at boot, overlaying `/etc`, `/usr`, `/root`, `/lib` on the tmpfs root.
`genapkovl-bee.sh` generates the tarball containing:
- `/etc/apk/world` — package list (apk installs on first boot)
- `/etc/runlevels/*/` — OpenRC service symlinks
- `/etc/conf.d/dropbear``DROPBEAR_OPTS="-R -B"`
- `/etc/network/interfaces` — lo only (bee-network handles DHCP)
- `/etc/hostname`
- Everything from `iso/overlay/` (init scripts, binaries, ssh keys, tui)
`live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
`build.sh` prepares a staged overlay, then syncs it into a temporary workdir's
`config/includes.chroot/` before running `lb build`.
## Collector flow
```
audit binary start
`bee audit` start
1. board collector (dmidecode -t 0,1,2)
2. cpu collector (dmidecode -t 4)
3. memory collector (dmidecode -t 17)