Refactor bee CLI and LiveCD integration

This commit is contained in:
Mikhail Chusavitin
2026-03-13 16:52:16 +03:00
parent b7c888edb1
commit 6aca1682b9
47 changed files with 3137 additions and 1201 deletions

View File

@@ -4,100 +4,68 @@
**The live CD runs in an isolated network segment with no internet access.**
All binaries, kernel modules, and tools must be baked into the ISO at build time.
No `apk add`, no downloads, no package manager calls are allowed at boot.
No package installation, no downloads, and no package manager calls are allowed at boot.
DHCP is used only for LAN (operator SSH access). Internet is NOT available.
## Boot sequence (single ISO)
OpenRC default runlevel, service start order:
`systemd` boot order:
```
localmount
├── bee-sshsetup (creates bee user, sets password; runs before dropbear)
│ └── dropbear (SSH on port 22 — starts without network)
├── bee-network (udhcpc -b on all physical interfaces, non-blocking)
│ └── bee-nvidia (insmod nvidia*.ko from /usr/local/lib/nvidia/,
│ creates libnvidia-ml.so.1 symlinks in /usr/lib/)
└── bee-audit (runs audit binary → /var/log/bee-audit.json)
local-fs.target
├── bee-sshsetup.service (enables SSH key auth; password fallback only if marker exists)
│ └── ssh.service (OpenSSH on port 22 — starts without network)
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
creates /dev/nvidia* nodes)
└── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
never blocks boot on partial collector failures)
```
**Critical invariants:**
- Dropbear MUST start without network. `bee-sshsetup` has `need localmount` only.
- `bee-network` uses `udhcpc -b` (background) — retries indefinitely if no cable.
- `bee-nvidia` loads modules via `insmod` with absolute paths — NOT `modprobe`.
Reason: modloop squashfs mounts over `/lib/modules/<kver>/` at boot, making it
read-only. The overlay's modules at that path are inaccessible. Modules are stored
at `/usr/local/lib/nvidia/` (overlay path, always writable).
- `bee-nvidia` creates `libnvidia-ml.so.1` symlinks in `/usr/lib/` — required because
`nvidia-smi` is a glibc binary that looks for the soname symlink, not the versioned file.
- `gcompat` package provides `/lib64/ld-linux-x86-64.so.2` for glibc compat on Alpine musl.
- `bee-audit` uses `after bee-nvidia` — ensures NVIDIA enrichment succeeds.
- `bee-audit` uses `eend 0` always — never fails boot even if audit errors.
- OpenSSH MUST start without network. `bee-sshsetup.service` runs before `ssh.service`.
- `bee-network.service` uses `dhclient -nw` (background) — network bring-up is best effort and non-blocking.
- `bee-nvidia.service` loads modules via `insmod` with absolute paths — NOT `modprobe`.
Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.
## ISO build sequence
```
build.sh [--authorized-keys /path/to/keys]
1. compile audit binary (skip if .go files older than binary)
2. inject authorized_keys into overlay/root/.ssh/ (or set password fallback)
3. copy audit binary → overlay/usr/local/bin/audit
4. copy vendor binaries from iso/vendor/ → overlay/usr/local/bin/
(storcli64, sas2ircu, sas3ircu, mstflint, gpu_burn — each optional)
5. build-nvidia-module.sh:
a. apk add linux-lts-dev (always, to get current Alpine 3.21 kernel headers)
b. detect KVER from /usr/src/linux-headers-*
c. download NVIDIA .run installer (sha256 verified, cached in dist/)
d. extract installer
e. build kernel modules against linux-lts headers
f. create libnvidia-ml.so.1 / libcuda.so.1 symlinks in cache
g. cache in dist/nvidia-<version>-<kver>/
6. inject NVIDIA .ko → overlay/usr/local/lib/nvidia/
7. inject nvidia-smi → overlay/usr/local/bin/nvidia-smi
8. inject libnvidia-ml + libcuda → overlay/usr/lib/
9. write overlay/etc/bee-release (versions + git commit)
10. export BEE_BUILD_INFO for motd substitution
11. mkimage.sh (from /var/tmp, TMPDIR=/var/tmp):
kernel_* section — cached (linux-lts modloop)
apks_* section — cached (downloaded packages)
syslinux_* / grub_* — cached
apkovl — always regenerated (genapkovl-bee.sh)
final ISO — always assembled
1. compile `bee` binary (skip if .go files older than binary)
2. create a temporary overlay staging dir under `dist/`
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
4. copy `bee` binary → staged `/usr/local/bin/bee`
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
(`storcli64`, `sas2ircu`, `sas3ircu`, `mstflint` — each optional)
6. `build-nvidia-module.sh`:
a. install Debian kernel headers if missing
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
c. extract installer
d. build kernel modules against Debian headers
e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
f. cache in `dist/nvidia-<version>-<kver>/`
7. inject NVIDIA `.ko`staged `/usr/local/lib/nvidia/`
8. inject `nvidia-smi`staged `/usr/local/bin/nvidia-smi`
9. inject `libnvidia-ml` + `libcuda`staged `/usr/lib/`
10. write staged `/etc/bee-release` (versions + git commit)
11. patch staged `motd` with build metadata
12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
13. sync staged overlay into workdir `config/includes.chroot/`
14. run `lb config && lb build` inside the temporary workdir
(either on a Debian host/VM or inside the privileged builder container)
```
**Critical invariants:**
- `KERNEL_PKG_VERSION` in `iso/builder/VERSIONS` pins the exact Alpine package version
(e.g. `6.12.76-r0`). This version is used in THREE places that MUST stay in sync:
1. `build-nvidia-module.sh``apk add linux-lts-dev=${KERNEL_PKG_VERSION}` (compile headers)
2. `mkimg.bee.sh``linux-lts=${KERNEL_PKG_VERSION}` in apks list (ISO kernel)
3. `build.sh` — build-time verification that headers match pin (fails loudly if not)
When Alpine releases a new linux-lts patch (e.g. r0 → r1), update KERNEL_PKG_VERSION
in VERSIONS — that's the only place to change. The build will fail loudly if the pin
doesn't match the installed headers, so stale pins are caught immediately.
- **All three must use the same APK mirror: `dl-cdn.alpinelinux.org`.** Both
`build-nvidia-module.sh` (apk add) and `mkimage.sh` (--repository) explicitly use
`https://dl-cdn.alpinelinux.org/alpine/v${ALPINE_VERSION}/main|community`.
Never use the builder's local `/etc/apk/repositories` — its mirror may serve
a different package state, causing "unable to select package" failures.
- `linux-lts-dev` is always installed (not conditional) — stale 6.6.x headers on the
builder would cause modules to be built for the wrong kernel and never load at runtime.
- NVIDIA modules go to `overlay/usr/local/lib/nvidia/` — NOT `lib/modules/<kver>/extra/`.
- `genapkovl-bee.sh` must be copied to `/var/tmp/` (CWD when mkimage runs).
- `TMPDIR=/var/tmp` required — tmpfs `/tmp` is only ~1GB, too small for kernel firmware.
- Workdir cleanup preserves `apks_*`, `kernel_*`, `syslinux_*`, `grub_*` cache dirs.
## gpu_burn vendor binary
`gpu_burn` requires CUDA nvcc to build. It is NOT built as part of the main ISO build.
Build separately on the builder VM and place in `iso/vendor/gpu_burn`:
```sh
sh iso/builder/build-gpu-burn.sh dist/
cp dist/gpu_burn iso/vendor/gpu_burn
cp dist/compare.ptx iso/vendor/compare.ptx
```
Requires: CUDA 12.8+ (supports GCC 14, Alpine 3.21), libxml2, g++, make, git.
The `build.sh` will include it automatically if `iso/vendor/gpu_burn` exists.
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
1. `setup-builder.sh` / `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
2. `auto/config``linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
- Container build requires `--privileged` because `live-build` uses mounts/chroots/loop devices during ISO assembly.
## Post-boot smoke test
@@ -109,26 +77,19 @@ ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh
Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shipping.
Key checks: NVIDIA modules loaded, nvidia-smi sees all GPUs, lib symlinks present,
gcompat installed, services running, audit completed with NVIDIA enrichment, internet.
Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
## apkovl mechanism
## Overlay mechanism
The apkovl is a `.tar.gz` injected into the ISO at `/boot/`. Alpine initramfs extracts
it at boot, overlaying `/etc`, `/usr`, `/root`, `/lib` on the tmpfs root.
`genapkovl-bee.sh` generates the tarball containing:
- `/etc/apk/world` — package list (apk installs on first boot)
- `/etc/runlevels/*/` — OpenRC service symlinks
- `/etc/conf.d/dropbear``DROPBEAR_OPTS="-R -B"`
- `/etc/network/interfaces` — lo only (bee-network handles DHCP)
- `/etc/hostname`
- Everything from `iso/overlay/` (init scripts, binaries, ssh keys, tui)
`live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
`build.sh` prepares a staged overlay, then syncs it into a temporary workdir's
`config/includes.chroot/` before running `lb build`.
## Collector flow
```
audit binary start
`bee audit` start
1. board collector (dmidecode -t 0,1,2)
2. cpu collector (dmidecode -t 4)
3. memory collector (dmidecode -t 17)

View File

@@ -21,16 +21,15 @@ Fills gaps where Redfish/logpile is blind:
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
- Unattended operation — no user interaction required
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
- SSH access (dropbear) always available for inspection and debugging
- Interactive TUI (`bee-tui`) for network setup, service management, GPU tests
- GPU stress testing via `gpu_burn` (vendor binary, optional)
- SSH access (OpenSSH) always available for inspection and debugging
- Interactive Go TUI via `bee tui` for network setup, service management, and acceptance tests
## Network isolation — CRITICAL
**The live CD runs in an isolated network segment with no internet access.**
- All tools, drivers, and binaries MUST be pre-baked into the ISO at build time
- No `apk add` at boot — packages are installed during ISO creation, not at runtime
- No package installation at boot — packages are installed during ISO creation, not at runtime
- No downloads at boot — NVIDIA modules, vendor tools, and all binaries come from the ISO overlay
- DHCP is used only for LAN access (SSH from operator laptop); internet is NOT assumed
- Any feature requiring network downloads cannot be added to the live CD
@@ -49,26 +48,32 @@ Fills gaps where Redfish/logpile is blind:
| Component | Technology |
|---|---|
| Audit binary | Go, static, `CGO_ENABLED=0` |
| LiveCD | Alpine Linux 3.21, linux-lts 6.12.x |
| ISO build | Alpine mkimage + apkovl overlay (`iso/overlay/`) |
| Init system | OpenRC |
| SSH | Dropbear (always included) |
| NVIDIA driver | Proprietary `.run` installer, built against linux-lts headers |
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` (not modloop path) |
| glibc compat | `gcompat` — required for `nvidia-smi` (glibc binary on musl Alpine) |
| Builder VM | Alpine 3.21 |
| Live ISO | Debian 12 (bookworm), amd64 live-build image |
| ISO build | Debian `live-build` + overlay sync into `config/includes.chroot/` |
| Init system | `systemd` |
| SSH | OpenSSH server |
| NVIDIA driver | Proprietary `.run` installer, built against Debian kernel headers |
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
| Builder | Debian 12 host/VM or Debian 12 container image |
## Runtime split
- The main Go application must run both on a normal Linux host and inside the live ISO
- Live-ISO-only responsibilities stay in `iso/` integration code
- Live ISO launches the Go CLI with `--runtime livecd`
- Local/manual runs use `--runtime auto` or `--runtime local`
## Key paths
| Path | Purpose |
|---|---|
| `audit/cmd/audit/` | CLI entry point |
| `audit/cmd/bee/` | Main CLI entry point |
| `audit/internal/collector/` | Per-subsystem collectors |
| `audit/internal/schema/` | HardwareIngestRequest types |
| `iso/builder/` | ISO build scripts and mkimage profile |
| `iso/overlay/` | Single overlay: files injected into ISO via apkovl |
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, gpu_burn, …) |
| `iso/builder/VERSIONS` | Pinned versions: Alpine, Go, NVIDIA driver, kernel |
| `iso/builder/` | ISO build scripts and `live-build` profile |
| `iso/overlay/` | Source overlay copied into a staged build overlay |
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, mstflint, …) |
| `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI |
| `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
| `dist/` | Build outputs (gitignored) |
| `iso/out/` | Downloaded ISO files (gitignored) |

View File

@@ -2,19 +2,20 @@
## GPU stress test (H100)
**Задача:** добавить GPU burn/stress тест в bee-tui без существенного увеличения ISO.
**Статус:** отложено. В текущем ISO `gpu_burn` не включается и не запускается.
**Контекст:**
- `gpu_burn` (wilicc/gpu-burn) не подходит — требует `libcublas.so` (~500MB), что раздует ISO кратно
- `libcuda.so` уже есть в ISO (из NVIDIA .run installer)
**Почему задача всё ещё в backlog:**
- `gpu_burn` остаётся тяжёлым и неудобным с точки зрения зависимостей
- хочется штатный lightweight stress tool без `libcublas.so` и без заметного раздувания ISO
- для H100 нужен предсказуемый offline-инструмент, который можно стабильно возить внутри ISO
**Выбранный подход:** написать минимальный стресс-тул на CUDA Driver API
- Использует только `libcuda.so` (уже в ISO) — никаких новых зависимостей
- Реализует матричное умножение или memory bandwidth через `cuLaunchKernel`
- Бинарь ~100KB, компилируется через `nvcc` на builder VM, кладётся в `iso/vendor/`
- bee-tui вызывает его вместо `gpu_burn`
**Желаемый следующий шаг:** написать минимальный stress tool на CUDA Driver API
- использует только `libcuda.so`, уже присутствующий в ISO
- выполняет простой compute / memory workload через `cuLaunchKernel`
- собирается отдельно на builder VM и кладётся в `iso/vendor/`
- в будущем может вызываться из `bee tui` как предпочтительный встроенный GPU SAT/stress path
**Отклонённые варианты:**
**Отклонённые / проблемные варианты:**
- `gpu_burn` — нужен libcublas (~500MB)
- `nvbandwidth` — только bandwidth, не жжёт FLOPs; нужен libcudart (~8MB)
- DCGM diag — правильный инструмент для H100 но ~100MB установка