Refactor bee CLI and LiveCD integration
This commit is contained in:
@@ -4,100 +4,68 @@
|
||||
|
||||
**The live CD runs in an isolated network segment with no internet access.**
|
||||
All binaries, kernel modules, and tools must be baked into the ISO at build time.
|
||||
No `apk add`, no downloads, no package manager calls are allowed at boot.
|
||||
No package installation, no downloads, and no package manager calls are allowed at boot.
|
||||
DHCP is used only for LAN (operator SSH access). Internet is NOT available.
|
||||
|
||||
## Boot sequence (single ISO)
|
||||
|
||||
OpenRC default runlevel, service start order:
|
||||
`systemd` boot order:
|
||||
|
||||
```
|
||||
localmount
|
||||
├── bee-sshsetup (creates bee user, sets password; runs before dropbear)
|
||||
│ └── dropbear (SSH on port 22 — starts without network)
|
||||
├── bee-network (udhcpc -b on all physical interfaces, non-blocking)
|
||||
│ └── bee-nvidia (insmod nvidia*.ko from /usr/local/lib/nvidia/,
|
||||
│ creates libnvidia-ml.so.1 symlinks in /usr/lib/)
|
||||
│ └── bee-audit (runs audit binary → /var/log/bee-audit.json)
|
||||
local-fs.target
|
||||
├── bee-sshsetup.service (enables SSH key auth; password fallback only if marker exists)
|
||||
│ └── ssh.service (OpenSSH on port 22 — starts without network)
|
||||
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
|
||||
├── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
|
||||
│ creates /dev/nvidia* nodes)
|
||||
└── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
|
||||
never blocks boot on partial collector failures)
|
||||
```
|
||||
|
||||
**Critical invariants:**
|
||||
- Dropbear MUST start without network. `bee-sshsetup` has `need localmount` only.
|
||||
- `bee-network` uses `udhcpc -b` (background) — retries indefinitely if no cable.
|
||||
- `bee-nvidia` loads modules via `insmod` with absolute paths — NOT `modprobe`.
|
||||
Reason: modloop squashfs mounts over `/lib/modules/<kver>/` at boot, making it
|
||||
read-only. The overlay's modules at that path are inaccessible. Modules are stored
|
||||
at `/usr/local/lib/nvidia/` (overlay path, always writable).
|
||||
- `bee-nvidia` creates `libnvidia-ml.so.1` symlinks in `/usr/lib/` — required because
|
||||
`nvidia-smi` is a glibc binary that looks for the soname symlink, not the versioned file.
|
||||
- `gcompat` package provides `/lib64/ld-linux-x86-64.so.2` for glibc compat on Alpine musl.
|
||||
- `bee-audit` uses `after bee-nvidia` — ensures NVIDIA enrichment succeeds.
|
||||
- `bee-audit` uses `eend 0` always — never fails boot even if audit errors.
|
||||
- OpenSSH MUST start without network. `bee-sshsetup.service` runs before `ssh.service`.
|
||||
- `bee-network.service` uses `dhclient -nw` (background) — network bring-up is best effort and non-blocking.
|
||||
- `bee-nvidia.service` loads modules via `insmod` with absolute paths — NOT `modprobe`.
|
||||
Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
|
||||
- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
|
||||
- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.
|
||||
|
||||
## ISO build sequence
|
||||
|
||||
```
|
||||
build.sh [--authorized-keys /path/to/keys]
|
||||
1. compile audit binary (skip if .go files older than binary)
|
||||
2. inject authorized_keys into overlay/root/.ssh/ (or set password fallback)
|
||||
3. copy audit binary → overlay/usr/local/bin/audit
|
||||
4. copy vendor binaries from iso/vendor/ → overlay/usr/local/bin/
|
||||
(storcli64, sas2ircu, sas3ircu, mstflint, gpu_burn — each optional)
|
||||
5. build-nvidia-module.sh:
|
||||
a. apk add linux-lts-dev (always, to get current Alpine 3.21 kernel headers)
|
||||
b. detect KVER from /usr/src/linux-headers-*
|
||||
c. download NVIDIA .run installer (sha256 verified, cached in dist/)
|
||||
d. extract installer
|
||||
e. build kernel modules against linux-lts headers
|
||||
f. create libnvidia-ml.so.1 / libcuda.so.1 symlinks in cache
|
||||
g. cache in dist/nvidia-<version>-<kver>/
|
||||
6. inject NVIDIA .ko → overlay/usr/local/lib/nvidia/
|
||||
7. inject nvidia-smi → overlay/usr/local/bin/nvidia-smi
|
||||
8. inject libnvidia-ml + libcuda → overlay/usr/lib/
|
||||
9. write overlay/etc/bee-release (versions + git commit)
|
||||
10. export BEE_BUILD_INFO for motd substitution
|
||||
11. mkimage.sh (from /var/tmp, TMPDIR=/var/tmp):
|
||||
kernel_* section — cached (linux-lts modloop)
|
||||
apks_* section — cached (downloaded packages)
|
||||
syslinux_* / grub_* — cached
|
||||
apkovl — always regenerated (genapkovl-bee.sh)
|
||||
final ISO — always assembled
|
||||
1. compile `bee` binary (skip if .go files older than binary)
|
||||
2. create a temporary overlay staging dir under `dist/`
|
||||
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
|
||||
4. copy `bee` binary → staged `/usr/local/bin/bee`
|
||||
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
|
||||
(`storcli64`, `sas2ircu`, `sas3ircu`, `mstflint` — each optional)
|
||||
6. `build-nvidia-module.sh`:
|
||||
a. install Debian kernel headers if missing
|
||||
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
|
||||
c. extract installer
|
||||
d. build kernel modules against Debian headers
|
||||
e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
|
||||
f. cache in `dist/nvidia-<version>-<kver>/`
|
||||
7. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
|
||||
8. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
|
||||
9. inject `libnvidia-ml` + `libcuda` → staged `/usr/lib/`
|
||||
10. write staged `/etc/bee-release` (versions + git commit)
|
||||
11. patch staged `motd` with build metadata
|
||||
12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
|
||||
13. sync staged overlay into workdir `config/includes.chroot/`
|
||||
14. run `lb config && lb build` inside the temporary workdir
|
||||
(either on a Debian host/VM or inside the privileged builder container)
|
||||
```
|
||||
|
||||
**Critical invariants:**
|
||||
- `KERNEL_PKG_VERSION` in `iso/builder/VERSIONS` pins the exact Alpine package version
|
||||
(e.g. `6.12.76-r0`). This version is used in THREE places that MUST stay in sync:
|
||||
1. `build-nvidia-module.sh` — `apk add linux-lts-dev=${KERNEL_PKG_VERSION}` (compile headers)
|
||||
2. `mkimg.bee.sh` — `linux-lts=${KERNEL_PKG_VERSION}` in apks list (ISO kernel)
|
||||
3. `build.sh` — build-time verification that headers match pin (fails loudly if not)
|
||||
When Alpine releases a new linux-lts patch (e.g. r0 → r1), update KERNEL_PKG_VERSION
|
||||
in VERSIONS — that's the only place to change. The build will fail loudly if the pin
|
||||
doesn't match the installed headers, so stale pins are caught immediately.
|
||||
- **All three must use the same APK mirror: `dl-cdn.alpinelinux.org`.** Both
|
||||
`build-nvidia-module.sh` (apk add) and `mkimage.sh` (--repository) explicitly use
|
||||
`https://dl-cdn.alpinelinux.org/alpine/v${ALPINE_VERSION}/main|community`.
|
||||
Never use the builder's local `/etc/apk/repositories` — its mirror may serve
|
||||
a different package state, causing "unable to select package" failures.
|
||||
- `linux-lts-dev` is always installed (not conditional) — stale 6.6.x headers on the
|
||||
builder would cause modules to be built for the wrong kernel and never load at runtime.
|
||||
- NVIDIA modules go to `overlay/usr/local/lib/nvidia/` — NOT `lib/modules/<kver>/extra/`.
|
||||
- `genapkovl-bee.sh` must be copied to `/var/tmp/` (CWD when mkimage runs).
|
||||
- `TMPDIR=/var/tmp` required — tmpfs `/tmp` is only ~1GB, too small for kernel firmware.
|
||||
- Workdir cleanup preserves `apks_*`, `kernel_*`, `syslinux_*`, `grub_*` cache dirs.
|
||||
|
||||
## gpu_burn vendor binary
|
||||
|
||||
`gpu_burn` requires CUDA nvcc to build. It is NOT built as part of the main ISO build.
|
||||
Build separately on the builder VM and place in `iso/vendor/gpu_burn`:
|
||||
|
||||
```sh
|
||||
sh iso/builder/build-gpu-burn.sh dist/
|
||||
cp dist/gpu_burn iso/vendor/gpu_burn
|
||||
cp dist/compare.ptx iso/vendor/compare.ptx
|
||||
```
|
||||
|
||||
Requires: CUDA 12.8+ (supports GCC 14, Alpine 3.21), libxml2, g++, make, git.
|
||||
The `build.sh` will include it automatically if `iso/vendor/gpu_burn` exists.
|
||||
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
|
||||
1. `setup-builder.sh` / `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
|
||||
2. `auto/config` — `linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
|
||||
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
|
||||
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
|
||||
- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
|
||||
- Container build requires `--privileged` because `live-build` uses mounts/chroots/loop devices during ISO assembly.
|
||||
|
||||
## Post-boot smoke test
|
||||
|
||||
@@ -109,26 +77,19 @@ ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh
|
||||
|
||||
Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shipping.
|
||||
|
||||
Key checks: NVIDIA modules loaded, nvidia-smi sees all GPUs, lib symlinks present,
|
||||
gcompat installed, services running, audit completed with NVIDIA enrichment, internet.
|
||||
Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
|
||||
systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
|
||||
|
||||
## apkovl mechanism
|
||||
## Overlay mechanism
|
||||
|
||||
The apkovl is a `.tar.gz` injected into the ISO at `/boot/`. Alpine initramfs extracts
|
||||
it at boot, overlaying `/etc`, `/usr`, `/root`, `/lib` on the tmpfs root.
|
||||
|
||||
`genapkovl-bee.sh` generates the tarball containing:
|
||||
- `/etc/apk/world` — package list (apk installs on first boot)
|
||||
- `/etc/runlevels/*/` — OpenRC service symlinks
|
||||
- `/etc/conf.d/dropbear` — `DROPBEAR_OPTS="-R -B"`
|
||||
- `/etc/network/interfaces` — lo only (bee-network handles DHCP)
|
||||
- `/etc/hostname`
|
||||
- Everything from `iso/overlay/` (init scripts, binaries, ssh keys, tui)
|
||||
`live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
|
||||
`build.sh` prepares a staged overlay, then syncs it into a temporary workdir's
|
||||
`config/includes.chroot/` before running `lb build`.
|
||||
|
||||
## Collector flow
|
||||
|
||||
```
|
||||
audit binary start
|
||||
`bee audit` start
|
||||
1. board collector (dmidecode -t 0,1,2)
|
||||
2. cpu collector (dmidecode -t 4)
|
||||
3. memory collector (dmidecode -t 17)
|
||||
|
||||
@@ -21,16 +21,15 @@ Fills gaps where Redfish/logpile is blind:
|
||||
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
|
||||
- Unattended operation — no user interaction required
|
||||
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
|
||||
- SSH access (dropbear) always available for inspection and debugging
|
||||
- Interactive TUI (`bee-tui`) for network setup, service management, GPU tests
|
||||
- GPU stress testing via `gpu_burn` (vendor binary, optional)
|
||||
- SSH access (OpenSSH) always available for inspection and debugging
|
||||
- Interactive Go TUI via `bee tui` for network setup, service management, and acceptance tests
|
||||
|
||||
## Network isolation — CRITICAL
|
||||
|
||||
**The live CD runs in an isolated network segment with no internet access.**
|
||||
|
||||
- All tools, drivers, and binaries MUST be pre-baked into the ISO at build time
|
||||
- No `apk add` at boot — packages are installed during ISO creation, not at runtime
|
||||
- No package installation at boot — packages are installed during ISO creation, not at runtime
|
||||
- No downloads at boot — NVIDIA modules, vendor tools, and all binaries come from the ISO overlay
|
||||
- DHCP is used only for LAN access (SSH from operator laptop); internet is NOT assumed
|
||||
- Any feature requiring network downloads cannot be added to the live CD
|
||||
@@ -49,26 +48,32 @@ Fills gaps where Redfish/logpile is blind:
|
||||
| Component | Technology |
|
||||
|---|---|
|
||||
| Audit binary | Go, static, `CGO_ENABLED=0` |
|
||||
| LiveCD | Alpine Linux 3.21, linux-lts 6.12.x |
|
||||
| ISO build | Alpine mkimage + apkovl overlay (`iso/overlay/`) |
|
||||
| Init system | OpenRC |
|
||||
| SSH | Dropbear (always included) |
|
||||
| NVIDIA driver | Proprietary `.run` installer, built against linux-lts headers |
|
||||
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` (not modloop path) |
|
||||
| glibc compat | `gcompat` — required for `nvidia-smi` (glibc binary on musl Alpine) |
|
||||
| Builder VM | Alpine 3.21 |
|
||||
| Live ISO | Debian 12 (bookworm), amd64 live-build image |
|
||||
| ISO build | Debian `live-build` + overlay sync into `config/includes.chroot/` |
|
||||
| Init system | `systemd` |
|
||||
| SSH | OpenSSH server |
|
||||
| NVIDIA driver | Proprietary `.run` installer, built against Debian kernel headers |
|
||||
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
|
||||
| Builder | Debian 12 host/VM or Debian 12 container image |
|
||||
|
||||
## Runtime split
|
||||
|
||||
- The main Go application must run both on a normal Linux host and inside the live ISO
|
||||
- Live-ISO-only responsibilities stay in `iso/` integration code
|
||||
- Live ISO launches the Go CLI with `--runtime livecd`
|
||||
- Local/manual runs use `--runtime auto` or `--runtime local`
|
||||
|
||||
## Key paths
|
||||
|
||||
| Path | Purpose |
|
||||
|---|---|
|
||||
| `audit/cmd/audit/` | CLI entry point |
|
||||
| `audit/cmd/bee/` | Main CLI entry point |
|
||||
| `audit/internal/collector/` | Per-subsystem collectors |
|
||||
| `audit/internal/schema/` | HardwareIngestRequest types |
|
||||
| `iso/builder/` | ISO build scripts and mkimage profile |
|
||||
| `iso/overlay/` | Single overlay: files injected into ISO via apkovl |
|
||||
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, gpu_burn, …) |
|
||||
| `iso/builder/VERSIONS` | Pinned versions: Alpine, Go, NVIDIA driver, kernel |
|
||||
| `iso/builder/` | ISO build scripts and `live-build` profile |
|
||||
| `iso/overlay/` | Source overlay copied into a staged build overlay |
|
||||
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, mstflint, …) |
|
||||
| `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI |
|
||||
| `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
|
||||
| `dist/` | Build outputs (gitignored) |
|
||||
| `iso/out/` | Downloaded ISO files (gitignored) |
|
||||
|
||||
@@ -2,19 +2,20 @@
|
||||
|
||||
## GPU stress test (H100)
|
||||
|
||||
**Задача:** добавить GPU burn/stress тест в bee-tui без существенного увеличения ISO.
|
||||
**Статус:** отложено. В текущем ISO `gpu_burn` не включается и не запускается.
|
||||
|
||||
**Контекст:**
|
||||
- `gpu_burn` (wilicc/gpu-burn) не подходит — требует `libcublas.so` (~500MB), что раздует ISO кратно
|
||||
- `libcuda.so` уже есть в ISO (из NVIDIA .run installer)
|
||||
**Почему задача всё ещё в backlog:**
|
||||
- `gpu_burn` остаётся тяжёлым и неудобным с точки зрения зависимостей
|
||||
- хочется штатный lightweight stress tool без `libcublas.so` и без заметного раздувания ISO
|
||||
- для H100 нужен предсказуемый offline-инструмент, который можно стабильно возить внутри ISO
|
||||
|
||||
**Выбранный подход:** написать минимальный стресс-тул на CUDA Driver API
|
||||
- Использует только `libcuda.so` (уже в ISO) — никаких новых зависимостей
|
||||
- Реализует матричное умножение или memory bandwidth через `cuLaunchKernel`
|
||||
- Бинарь ~100KB, компилируется через `nvcc` на builder VM, кладётся в `iso/vendor/`
|
||||
- bee-tui вызывает его вместо `gpu_burn`
|
||||
**Желаемый следующий шаг:** написать минимальный stress tool на CUDA Driver API
|
||||
- использует только `libcuda.so`, уже присутствующий в ISO
|
||||
- выполняет простой compute / memory workload через `cuLaunchKernel`
|
||||
- собирается отдельно на builder VM и кладётся в `iso/vendor/`
|
||||
- в будущем может вызываться из `bee tui` как предпочтительный встроенный GPU SAT/stress path
|
||||
|
||||
**Отклонённые варианты:**
|
||||
**Отклонённые / проблемные варианты:**
|
||||
- `gpu_burn` — нужен libcublas (~500MB)
|
||||
- `nvbandwidth` — только bandwidth, не жжёт FLOPs; нужен libcudart (~8MB)
|
||||
- DCGM diag — правильный инструмент для H100 но ~100MB установка
|
||||
|
||||
Reference in New Issue
Block a user