Refactor bee CLI and LiveCD integration

2026-03-13 16:52:16 +03:00
parent b7c888edb1
commit 6aca1682b9
47 changed files with 3137 additions and 1201 deletions
--- a/bible-local/architecture/runtime-flows.md
+++ b/bible-local/architecture/runtime-flows.md
@@ -4,100 +4,68 @@

 **The live CD runs in an isolated network segment with no internet access.**
 All binaries, kernel modules, and tools must be baked into the ISO at build time.
-No `apk add`, no downloads, no package manager calls are allowed at boot.
+No package installation, no downloads, and no package manager calls are allowed at boot.
 DHCP is used only for LAN (operator SSH access). Internet is NOT available.

 ## Boot sequence (single ISO)

-OpenRC default runlevel, service start order:
+`systemd` boot order:

 ```
-localmount
-  ├── bee-sshsetup   (creates bee user, sets password; runs before dropbear)
-  │     └── dropbear  (SSH on port 22 — starts without network)
-  ├── bee-network    (udhcpc -b on all physical interfaces, non-blocking)
-  │     └── bee-nvidia  (insmod nvidia*.ko from /usr/local/lib/nvidia/,
-  │                      creates libnvidia-ml.so.1 symlinks in /usr/lib/)
-  │           └── bee-audit  (runs audit binary → /var/log/bee-audit.json)
+local-fs.target
+  ├── bee-sshsetup.service   (enables SSH key auth; password fallback only if marker exists)
+  │     └── ssh.service      (OpenSSH on port 22 — starts without network)
+  ├── bee-network.service    (starts `dhclient -nw` on all physical interfaces, non-blocking)
+  ├── bee-nvidia.service     (insmod nvidia*.ko from /usr/local/lib/nvidia/,
+  │                           creates /dev/nvidia* nodes)
+  └── bee-audit.service      (runs `bee audit` → /var/log/bee-audit.json,
+                               never blocks boot on partial collector failures)
 ```

 **Critical invariants:**
- Dropbear MUST start without network. `bee-sshsetup` has `need localmount` only.
- `bee-network` uses `udhcpc -b` (background) — retries indefinitely if no cable.
- `bee-nvidia` loads modules via `insmod` with absolute paths — NOT `modprobe`.
-  Reason: modloop squashfs mounts over `/lib/modules/<kver>/` at boot, making it
-  read-only. The overlay's modules at that path are inaccessible. Modules are stored
-  at `/usr/local/lib/nvidia/` (overlay path, always writable).
- `bee-nvidia` creates `libnvidia-ml.so.1` symlinks in `/usr/lib/` — required because
-  `nvidia-smi` is a glibc binary that looks for the soname symlink, not the versioned file.
- `gcompat` package provides `/lib64/ld-linux-x86-64.so.2` for glibc compat on Alpine musl.
- `bee-audit` uses `after bee-nvidia` — ensures NVIDIA enrichment succeeds.
- `bee-audit` uses `eend 0` always — never fails boot even if audit errors.
+- OpenSSH MUST start without network. `bee-sshsetup.service` runs before `ssh.service`.
+- `bee-network.service` uses `dhclient -nw` (background) — network bring-up is best effort and non-blocking.
+- `bee-nvidia.service` loads modules via `insmod` with absolute paths — NOT `modprobe`.
+  Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
+- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
+- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.

 ## ISO build sequence

 ```
 build.sh [--authorized-keys /path/to/keys]
-  1. compile audit binary (skip if .go files older than binary)
-  2. inject authorized_keys into overlay/root/.ssh/ (or set password fallback)
-  3. copy audit binary → overlay/usr/local/bin/audit
-  4. copy vendor binaries from iso/vendor/ → overlay/usr/local/bin/
-     (storcli64, sas2ircu, sas3ircu, mstflint, gpu_burn — each optional)
-  5. build-nvidia-module.sh:
-       a. apk add linux-lts-dev (always, to get current Alpine 3.21 kernel headers)
-       b. detect KVER from /usr/src/linux-headers-*
-       c. download NVIDIA .run installer (sha256 verified, cached in dist/)
-       d. extract installer
-       e. build kernel modules against linux-lts headers
-       f. create libnvidia-ml.so.1 / libcuda.so.1 symlinks in cache
-       g. cache in dist/nvidia-<version>-<kver>/
-  6. inject NVIDIA .ko → overlay/usr/local/lib/nvidia/
-  7. inject nvidia-smi → overlay/usr/local/bin/nvidia-smi
-  8. inject libnvidia-ml + libcuda → overlay/usr/lib/
-  9. write overlay/etc/bee-release (versions + git commit)
-  10. export BEE_BUILD_INFO for motd substitution
-  11. mkimage.sh (from /var/tmp, TMPDIR=/var/tmp):
-        kernel_* section  — cached (linux-lts modloop)
-        apks_* section    — cached (downloaded packages)
-        syslinux_* / grub_* — cached
-        apkovl            — always regenerated (genapkovl-bee.sh)
-        final ISO         — always assembled
+  1. compile `bee` binary (skip if .go files older than binary)
+  2. create a temporary overlay staging dir under `dist/`
+  3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
+  4. copy `bee` binary → staged `/usr/local/bin/bee`
+  5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
+     (`storcli64`, `sas2ircu`, `sas3ircu`, `mstflint` — each optional)
+  6. `build-nvidia-module.sh`:
+       a. install Debian kernel headers if missing
+       b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
+       c. extract installer
+       d. build kernel modules against Debian headers
+       e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
+       f. cache in `dist/nvidia-<version>-<kver>/`
+  7. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
+  8. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
+  9. inject `libnvidia-ml` + `libcuda` → staged `/usr/lib/`
+  10. write staged `/etc/bee-release` (versions + git commit)
+  11. patch staged `motd` with build metadata
+  12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
+  13. sync staged overlay into workdir `config/includes.chroot/`
+  14. run `lb config && lb build` inside the temporary workdir
+     (either on a Debian host/VM or inside the privileged builder container)
 ```

 **Critical invariants:**
- `KERNEL_PKG_VERSION` in `iso/builder/VERSIONS` pins the exact Alpine package version
-  (e.g. `6.12.76-r0`). This version is used in THREE places that MUST stay in sync:
-  1. `build-nvidia-module.sh` — `apk add linux-lts-dev=${KERNEL_PKG_VERSION}` (compile headers)
-  2. `mkimg.bee.sh` — `linux-lts=${KERNEL_PKG_VERSION}` in apks list (ISO kernel)
-  3. `build.sh` — build-time verification that headers match pin (fails loudly if not)
-  When Alpine releases a new linux-lts patch (e.g. r0 → r1), update KERNEL_PKG_VERSION
-  in VERSIONS — that's the only place to change. The build will fail loudly if the pin
-  doesn't match the installed headers, so stale pins are caught immediately.
- **All three must use the same APK mirror: `dl-cdn.alpinelinux.org`.** Both
-  `build-nvidia-module.sh` (apk add) and `mkimage.sh` (--repository) explicitly use
-  `https://dl-cdn.alpinelinux.org/alpine/v${ALPINE_VERSION}/main|community`.
-  Never use the builder's local `/etc/apk/repositories` — its mirror may serve
-  a different package state, causing "unable to select package" failures.
- `linux-lts-dev` is always installed (not conditional) — stale 6.6.x headers on the
-  builder would cause modules to be built for the wrong kernel and never load at runtime.
- NVIDIA modules go to `overlay/usr/local/lib/nvidia/` — NOT `lib/modules/<kver>/extra/`.
- `genapkovl-bee.sh` must be copied to `/var/tmp/` (CWD when mkimage runs).
- `TMPDIR=/var/tmp` required — tmpfs `/tmp` is only ~1GB, too small for kernel firmware.
- Workdir cleanup preserves `apks_*`, `kernel_*`, `syslinux_*`, `grub_*` cache dirs.
-
-## gpu_burn vendor binary
-
-`gpu_burn` requires CUDA nvcc to build. It is NOT built as part of the main ISO build.
-Build separately on the builder VM and place in `iso/vendor/gpu_burn`:
-
-```sh
-sh iso/builder/build-gpu-burn.sh dist/
-cp dist/gpu_burn iso/vendor/gpu_burn
-cp dist/compare.ptx iso/vendor/compare.ptx
-```
-
-Requires: CUDA 12.8+ (supports GCC 14, Alpine 3.21), libxml2, g++, make, git.
-The `build.sh` will include it automatically if `iso/vendor/gpu_burn` exists.
+- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
+  1. `setup-builder.sh` / `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
+  2. `auto/config` — `linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
+- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
+- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
+- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
+- Container build requires `--privileged` because `live-build` uses mounts/chroots/loop devices during ISO assembly.

 ## Post-boot smoke test

@@ -109,26 +77,19 @@ ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh

 Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shipping.

-Key checks: NVIDIA modules loaded, nvidia-smi sees all GPUs, lib symlinks present,
-gcompat installed, services running, audit completed with NVIDIA enrichment, internet.
+Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
+systemd services running, audit completed with NVIDIA enrichment, LAN reachability.

-## apkovl mechanism
+## Overlay mechanism

-The apkovl is a `.tar.gz` injected into the ISO at `/boot/`. Alpine initramfs extracts
-it at boot, overlaying `/etc`, `/usr`, `/root`, `/lib` on the tmpfs root.
-
-`genapkovl-bee.sh` generates the tarball containing:
- `/etc/apk/world` — package list (apk installs on first boot)
- `/etc/runlevels/*/` — OpenRC service symlinks
- `/etc/conf.d/dropbear` — `DROPBEAR_OPTS="-R -B"`
- `/etc/network/interfaces` — lo only (bee-network handles DHCP)
- `/etc/hostname`
- Everything from `iso/overlay/` (init scripts, binaries, ssh keys, tui)
+`live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
+`build.sh` prepares a staged overlay, then syncs it into a temporary workdir's
+`config/includes.chroot/` before running `lb build`.

 ## Collector flow

 ```
-audit binary start
+`bee audit` start
  1. board collector   (dmidecode -t 0,1,2)
  2. cpu collector     (dmidecode -t 4)
  3. memory collector  (dmidecode -t 17)
--- a/bible-local/architecture/system-overview.md
+++ b/bible-local/architecture/system-overview.md
@@ -21,16 +21,15 @@ Fills gaps where Redfish/logpile is blind:
 - Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
 - Unattended operation — no user interaction required
 - NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
- SSH access (dropbear) always available for inspection and debugging
- Interactive TUI (`bee-tui`) for network setup, service management, GPU tests
- GPU stress testing via `gpu_burn` (vendor binary, optional)
+- SSH access (OpenSSH) always available for inspection and debugging
+- Interactive Go TUI via `bee tui` for network setup, service management, and acceptance tests

 ## Network isolation — CRITICAL

 **The live CD runs in an isolated network segment with no internet access.**

 - All tools, drivers, and binaries MUST be pre-baked into the ISO at build time
- No `apk add` at boot — packages are installed during ISO creation, not at runtime
+- No package installation at boot — packages are installed during ISO creation, not at runtime
 - No downloads at boot — NVIDIA modules, vendor tools, and all binaries come from the ISO overlay
 - DHCP is used only for LAN access (SSH from operator laptop); internet is NOT assumed
 - Any feature requiring network downloads cannot be added to the live CD
@@ -49,26 +48,32 @@ Fills gaps where Redfish/logpile is blind:
 | Component | Technology |
 |---|---|
 | Audit binary | Go, static, `CGO_ENABLED=0` |
-| LiveCD | Alpine Linux 3.21, linux-lts 6.12.x |
-| ISO build | Alpine mkimage + apkovl overlay (`iso/overlay/`) |
-| Init system | OpenRC |
-| SSH | Dropbear (always included) |
-| NVIDIA driver | Proprietary `.run` installer, built against linux-lts headers |
-| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` (not modloop path) |
-| glibc compat | `gcompat` — required for `nvidia-smi` (glibc binary on musl Alpine) |
-| Builder VM | Alpine 3.21 |
+| Live ISO | Debian 12 (bookworm), amd64 live-build image |
+| ISO build | Debian `live-build` + overlay sync into `config/includes.chroot/` |
+| Init system | `systemd` |
+| SSH | OpenSSH server |
+| NVIDIA driver | Proprietary `.run` installer, built against Debian kernel headers |
+| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
+| Builder | Debian 12 host/VM or Debian 12 container image |
+
+## Runtime split
+
+- The main Go application must run both on a normal Linux host and inside the live ISO
+- Live-ISO-only responsibilities stay in `iso/` integration code
+- Live ISO launches the Go CLI with `--runtime livecd`
+- Local/manual runs use `--runtime auto` or `--runtime local`

 ## Key paths

 | Path | Purpose |
 |---|---|
-| `audit/cmd/audit/` | CLI entry point |
+| `audit/cmd/bee/` | Main CLI entry point |
 | `audit/internal/collector/` | Per-subsystem collectors |
 | `audit/internal/schema/` | HardwareIngestRequest types |
-| `iso/builder/` | ISO build scripts and mkimage profile |
-| `iso/overlay/` | Single overlay: files injected into ISO via apkovl |
-| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, gpu_burn, …) |
-| `iso/builder/VERSIONS` | Pinned versions: Alpine, Go, NVIDIA driver, kernel |
+| `iso/builder/` | ISO build scripts and `live-build` profile |
+| `iso/overlay/` | Source overlay copied into a staged build overlay |
+| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, mstflint, …) |
+| `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI |
 | `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
 | `dist/` | Build outputs (gitignored) |
 | `iso/out/` | Downloaded ISO files (gitignored) |
--- a/bible-local/backlog.md
+++ b/bible-local/backlog.md
@@ -2,19 +2,20 @@

 ## GPU stress test (H100)

-**Задача:** добавить GPU burn/stress тест в bee-tui без существенного увеличения ISO.
+**Статус:** отложено. В текущем ISO `gpu_burn` не включается и не запускается.

-**Контекст:**
- `gpu_burn` (wilicc/gpu-burn) не подходит — требует `libcublas.so` (~500MB), что раздует ISO кратно
- `libcuda.so` уже есть в ISO (из NVIDIA .run installer)
+**Почему задача всё ещё в backlog:**
+- `gpu_burn` остаётся тяжёлым и неудобным с точки зрения зависимостей
+- хочется штатный lightweight stress tool без `libcublas.so` и без заметного раздувания ISO
+- для H100 нужен предсказуемый offline-инструмент, который можно стабильно возить внутри ISO

-**Выбранный подход:** написать минимальный стресс-тул на CUDA Driver API
- Использует только `libcuda.so` (уже в ISO) — никаких новых зависимостей
- Реализует матричное умножение или memory bandwidth через `cuLaunchKernel`
- Бинарь ~100KB, компилируется через `nvcc` на builder VM, кладётся в `iso/vendor/`
- bee-tui вызывает его вместо `gpu_burn`
+**Желаемый следующий шаг:** написать минимальный stress tool на CUDA Driver API
+- использует только `libcuda.so`, уже присутствующий в ISO
+- выполняет простой compute / memory workload через `cuLaunchKernel`
+- собирается отдельно на builder VM и кладётся в `iso/vendor/`
+- в будущем может вызываться из `bee tui` как предпочтительный встроенный GPU SAT/stress path

-**Отклонённые варианты:**
+**Отклонённые / проблемные варианты:**
 - `gpu_burn` — нужен libcublas (~500MB)
 - `nvbandwidth` — только bandwidth, не жжёт FLOPs; нужен libcudart (~8MB)
 - DCGM diag — правильный инструмент для H100 но ~100MB установка