Both build-nvidia-module.sh (apk add) and mkimage.sh (--repository) now explicitly use dl-cdn. Local builder mirror config is ignored. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
144 lines
6.9 KiB
Markdown
144 lines
6.9 KiB
Markdown
# Runtime Flows — bee
|
|
|
|
## Network isolation — CRITICAL
|
|
|
|
**The live CD runs in an isolated network segment with no internet access.**
|
|
All binaries, kernel modules, and tools must be baked into the ISO at build time.
|
|
No `apk add`, no downloads, no package manager calls are allowed at boot.
|
|
DHCP is used only for LAN (operator SSH access). Internet is NOT available.
|
|
|
|
## Boot sequence (single ISO)
|
|
|
|
OpenRC default runlevel, service start order:
|
|
|
|
```
|
|
localmount
|
|
├── bee-sshsetup (creates bee user, sets password; runs before dropbear)
|
|
│ └── dropbear (SSH on port 22 — starts without network)
|
|
├── bee-network (udhcpc -b on all physical interfaces, non-blocking)
|
|
│ └── bee-nvidia (insmod nvidia*.ko from /usr/local/lib/nvidia/,
|
|
│ creates libnvidia-ml.so.1 symlinks in /usr/lib/)
|
|
│ └── bee-audit (runs audit binary → /var/log/bee-audit.json)
|
|
```
|
|
|
|
**Critical invariants:**
|
|
- Dropbear MUST start without network. `bee-sshsetup` has `need localmount` only.
|
|
- `bee-network` uses `udhcpc -b` (background) — retries indefinitely if no cable.
|
|
- `bee-nvidia` loads modules via `insmod` with absolute paths — NOT `modprobe`.
|
|
Reason: modloop squashfs mounts over `/lib/modules/<kver>/` at boot, making it
|
|
read-only. The overlay's modules at that path are inaccessible. Modules are stored
|
|
at `/usr/local/lib/nvidia/` (overlay path, always writable).
|
|
- `bee-nvidia` creates `libnvidia-ml.so.1` symlinks in `/usr/lib/` — required because
|
|
`nvidia-smi` is a glibc binary that looks for the soname symlink, not the versioned file.
|
|
- `gcompat` package provides `/lib64/ld-linux-x86-64.so.2` for glibc compat on Alpine musl.
|
|
- `bee-audit` uses `after bee-nvidia` — ensures NVIDIA enrichment succeeds.
|
|
- `bee-audit` uses `eend 0` always — never fails boot even if audit errors.
|
|
|
|
## ISO build sequence
|
|
|
|
```
|
|
build.sh [--authorized-keys /path/to/keys]
|
|
1. compile audit binary (skip if .go files older than binary)
|
|
2. inject authorized_keys into overlay/root/.ssh/ (or set password fallback)
|
|
3. copy audit binary → overlay/usr/local/bin/audit
|
|
4. copy vendor binaries from iso/vendor/ → overlay/usr/local/bin/
|
|
(storcli64, sas2ircu, sas3ircu, mstflint, gpu_burn — each optional)
|
|
5. build-nvidia-module.sh:
|
|
a. apk add linux-lts-dev (always, to get current Alpine 3.21 kernel headers)
|
|
b. detect KVER from /usr/src/linux-headers-*
|
|
c. download NVIDIA .run installer (sha256 verified, cached in dist/)
|
|
d. extract installer
|
|
e. build kernel modules against linux-lts headers
|
|
f. create libnvidia-ml.so.1 / libcuda.so.1 symlinks in cache
|
|
g. cache in dist/nvidia-<version>-<kver>/
|
|
6. inject NVIDIA .ko → overlay/usr/local/lib/nvidia/
|
|
7. inject nvidia-smi → overlay/usr/local/bin/nvidia-smi
|
|
8. inject libnvidia-ml + libcuda → overlay/usr/lib/
|
|
9. write overlay/etc/bee-release (versions + git commit)
|
|
10. export BEE_BUILD_INFO for motd substitution
|
|
11. mkimage.sh (from /var/tmp, TMPDIR=/var/tmp):
|
|
kernel_* section — cached (linux-lts modloop)
|
|
apks_* section — cached (downloaded packages)
|
|
syslinux_* / grub_* — cached
|
|
apkovl — always regenerated (genapkovl-bee.sh)
|
|
final ISO — always assembled
|
|
```
|
|
|
|
**Critical invariants:**
|
|
- `KERNEL_PKG_VERSION` in `iso/builder/VERSIONS` pins the exact Alpine package version
|
|
(e.g. `6.12.76-r0`). This version is used in THREE places that MUST stay in sync:
|
|
1. `build-nvidia-module.sh` — `apk add linux-lts-dev=${KERNEL_PKG_VERSION}` (compile headers)
|
|
2. `mkimg.bee.sh` — `linux-lts=${KERNEL_PKG_VERSION}` in apks list (ISO kernel)
|
|
3. `build.sh` — build-time verification that headers match pin (fails loudly if not)
|
|
When Alpine releases a new linux-lts patch (e.g. r0 → r1), update KERNEL_PKG_VERSION
|
|
in VERSIONS — that's the only place to change. The build will fail loudly if the pin
|
|
doesn't match the installed headers, so stale pins are caught immediately.
|
|
- **All three must use the same APK mirror: `dl-cdn.alpinelinux.org`.** Both
|
|
`build-nvidia-module.sh` (apk add) and `mkimage.sh` (--repository) explicitly use
|
|
`https://dl-cdn.alpinelinux.org/alpine/v${ALPINE_VERSION}/main|community`.
|
|
Never use the builder's local `/etc/apk/repositories` — its mirror may serve
|
|
a different package state, causing "unable to select package" failures.
|
|
- `linux-lts-dev` is always installed (not conditional) — stale 6.6.x headers on the
|
|
builder would cause modules to be built for the wrong kernel and never load at runtime.
|
|
- NVIDIA modules go to `overlay/usr/local/lib/nvidia/` — NOT `lib/modules/<kver>/extra/`.
|
|
- `genapkovl-bee.sh` must be copied to `/var/tmp/` (CWD when mkimage runs).
|
|
- `TMPDIR=/var/tmp` required — tmpfs `/tmp` is only ~1GB, too small for kernel firmware.
|
|
- Workdir cleanup preserves `apks_*`, `kernel_*`, `syslinux_*`, `grub_*` cache dirs.
|
|
|
|
## gpu_burn vendor binary
|
|
|
|
`gpu_burn` requires CUDA nvcc to build. It is NOT built as part of the main ISO build.
|
|
Build separately on the builder VM and place in `iso/vendor/gpu_burn`:
|
|
|
|
```sh
|
|
sh iso/builder/build-gpu-burn.sh dist/
|
|
cp dist/gpu_burn iso/vendor/gpu_burn
|
|
cp dist/compare.ptx iso/vendor/compare.ptx
|
|
```
|
|
|
|
Requires: CUDA 12.8+ (supports GCC 14, Alpine 3.21), libxml2, g++, make, git.
|
|
The `build.sh` will include it automatically if `iso/vendor/gpu_burn` exists.
|
|
|
|
## Post-boot smoke test
|
|
|
|
After booting a live ISO, run to verify all critical components:
|
|
|
|
```sh
|
|
ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh
|
|
```
|
|
|
|
Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shipping.
|
|
|
|
Key checks: NVIDIA modules loaded, nvidia-smi sees all GPUs, lib symlinks present,
|
|
gcompat installed, services running, audit completed with NVIDIA enrichment, internet.
|
|
|
|
## apkovl mechanism
|
|
|
|
The apkovl is a `.tar.gz` injected into the ISO at `/boot/`. Alpine initramfs extracts
|
|
it at boot, overlaying `/etc`, `/usr`, `/root`, `/lib` on the tmpfs root.
|
|
|
|
`genapkovl-bee.sh` generates the tarball containing:
|
|
- `/etc/apk/world` — package list (apk installs on first boot)
|
|
- `/etc/runlevels/*/` — OpenRC service symlinks
|
|
- `/etc/conf.d/dropbear` — `DROPBEAR_OPTS="-R -B"`
|
|
- `/etc/network/interfaces` — lo only (bee-network handles DHCP)
|
|
- `/etc/hostname`
|
|
- Everything from `iso/overlay/` (init scripts, binaries, ssh keys, tui)
|
|
|
|
## Collector flow
|
|
|
|
```
|
|
audit binary start
|
|
1. board collector (dmidecode -t 0,1,2)
|
|
2. cpu collector (dmidecode -t 4)
|
|
3. memory collector (dmidecode -t 17)
|
|
4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
|
|
5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/)
|
|
6. psu collector (ipmitool fru — silent if no /dev/ipmi0)
|
|
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
|
|
8. output JSON → /var/log/bee-audit.json
|
|
9. QR summary to stdout (qrencode if available)
|
|
```
|
|
|
|
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
|