Merge debug/prod into single ISO build, fix NVIDIA module loading
## ISO build consolidation - Remove separate debug/prod split: overlay-debug/, build-debug.sh, mkimg.bee_debug.sh, genapkovl-bee_debug.sh all deleted - Single overlay: iso/overlay/ (was overlay-debug content) - Single build script: build.sh (SSH, TUI, NVIDIA, vendor tools, bee-release) - Single mkimage profile: bee (with dropbear, dialog, strace, gcompat, etc.) ## NVIDIA fixes - Modules now stored at /usr/local/lib/nvidia/ instead of /lib/modules/<kver>/extra/nvidia/ — modloop squashfs mounts over that path at boot making overlay content there inaccessible - bee-nvidia init: load via insmod (absolute path), not modprobe - bee-nvidia init: create libnvidia-ml.so.1/libcuda.so.1 symlinks in /usr/lib/ - build-nvidia-module.sh: always install linux-lts-dev (not conditional) — stale 6.6.x headers caused wrong-kernel modules that never loaded at runtime - build-nvidia-module.sh: create soname symlinks in cache - KERNEL_VERSION in VERSIONS updated 6.6 → 6.12 - gcompat added to ISO packages (nvidia-smi is a glibc binary on musl Alpine) ## Service ordering - bee-audit: add `after bee-nvidia` so NVIDIA enrichment always succeeds ## New tooling - iso/builder/smoketest.sh: SSH smoke test for post-boot ISO validation - iso/builder/build-gpu-burn.sh: builds gpu_burn vendor binary (CUDA 12.8+) - vendor/gpu_burn included automatically if placed in iso/vendor/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,64 +1,109 @@
|
||||
# Runtime Flows — bee
|
||||
|
||||
## Boot sequence (debug ISO)
|
||||
## Boot sequence (single ISO)
|
||||
|
||||
OpenRC default runlevel, service start order:
|
||||
|
||||
```
|
||||
localmount
|
||||
└── bee-sshsetup (creates bee user, sets password fallback)
|
||||
└── dropbear (SSH on port 22 — starts regardless of network)
|
||||
└── bee-network (udhcpc -b on all physical interfaces, non-blocking)
|
||||
└── bee-nvidia (depmod -a, modprobe nvidia nvidia-modeset nvidia-uvm)
|
||||
└── bee-audit-debug (runs audit binary, logs to /var/log/bee-audit.json)
|
||||
├── bee-sshsetup (creates bee user, sets password; runs before dropbear)
|
||||
│ └── dropbear (SSH on port 22 — starts without network)
|
||||
├── bee-network (udhcpc -b on all physical interfaces, non-blocking)
|
||||
│ └── bee-nvidia (insmod nvidia*.ko from /usr/local/lib/nvidia/,
|
||||
│ creates libnvidia-ml.so.1 symlinks in /usr/lib/)
|
||||
│ └── bee-audit (runs audit binary → /var/log/bee-audit.json)
|
||||
```
|
||||
|
||||
**Critical invariants:**
|
||||
- Dropbear MUST start without network. Custom init in overlay has `need localmount` only — NOT `need net`.
|
||||
- bee-network uses `udhcpc -b` (background daemon) so it retries indefinitely when cable connected later.
|
||||
- bee-audit-debug uses `eend 0` always — never fails boot even if audit errors.
|
||||
- Dropbear MUST start without network. `bee-sshsetup` has `need localmount` only.
|
||||
- `bee-network` uses `udhcpc -b` (background) — retries indefinitely if no cable.
|
||||
- `bee-nvidia` loads modules via `insmod` with absolute paths — NOT `modprobe`.
|
||||
Reason: modloop squashfs mounts over `/lib/modules/<kver>/` at boot, making it
|
||||
read-only. The overlay's modules at that path are inaccessible. Modules are stored
|
||||
at `/usr/local/lib/nvidia/` (overlay path, always writable).
|
||||
- `bee-nvidia` creates `libnvidia-ml.so.1` symlinks in `/usr/lib/` — required because
|
||||
`nvidia-smi` is a glibc binary that looks for the soname symlink, not the versioned file.
|
||||
- `gcompat` package provides `/lib64/ld-linux-x86-64.so.2` for glibc compat on Alpine musl.
|
||||
- `bee-audit` uses `after bee-nvidia` — ensures NVIDIA enrichment succeeds.
|
||||
- `bee-audit` uses `eend 0` always — never fails boot even if audit errors.
|
||||
|
||||
## ISO build sequence
|
||||
|
||||
```
|
||||
build-debug.sh
|
||||
build.sh [--authorized-keys /path/to/keys]
|
||||
1. compile audit binary (skip if .go files older than binary)
|
||||
2. build-nvidia-module.sh:
|
||||
a. download NVIDIA .run installer (sha256 verified, cached)
|
||||
b. extract installer
|
||||
c. build kernel modules against linux-lts-dev headers
|
||||
d. extract nvidia-smi + libnvidia-ml from installer
|
||||
e. cache in dist/nvidia-<version>-<kver>/
|
||||
3. inject authorized_keys into overlay
|
||||
4. inject audit binary → overlay/usr/local/bin/audit
|
||||
5. inject NVIDIA .ko → overlay/lib/modules/<kver>/extra/nvidia/
|
||||
6. inject nvidia-smi → overlay/usr/local/bin/nvidia-smi
|
||||
7. copy mkimg profile + genapkovl to ~/.mkimage/ AND /var/tmp/
|
||||
8. mkimage.sh (from /var/tmp, TMPDIR=/var/tmp):
|
||||
kernel_* section — cached (linux-lts modloop, lz4 compressed)
|
||||
apks_* section — cached (downloaded packages)
|
||||
syslinux_* / grub_* — cached
|
||||
apkovl — always regenerated (genapkovl-bee_debug.sh)
|
||||
final ISO — always assembled
|
||||
2. inject authorized_keys into overlay/root/.ssh/ (or set password fallback)
|
||||
3. copy audit binary → overlay/usr/local/bin/audit
|
||||
4. copy vendor binaries from iso/vendor/ → overlay/usr/local/bin/
|
||||
(storcli64, sas2ircu, sas3ircu, mstflint, gpu_burn — each optional)
|
||||
5. build-nvidia-module.sh:
|
||||
a. apk add linux-lts-dev (always, to get current Alpine 3.21 kernel headers)
|
||||
b. detect KVER from /usr/src/linux-headers-*
|
||||
c. download NVIDIA .run installer (sha256 verified, cached in dist/)
|
||||
d. extract installer
|
||||
e. build kernel modules against linux-lts headers
|
||||
f. create libnvidia-ml.so.1 / libcuda.so.1 symlinks in cache
|
||||
g. cache in dist/nvidia-<version>-<kver>/
|
||||
6. inject NVIDIA .ko → overlay/usr/local/lib/nvidia/
|
||||
7. inject nvidia-smi → overlay/usr/local/bin/nvidia-smi
|
||||
8. inject libnvidia-ml + libcuda → overlay/usr/lib/
|
||||
9. write overlay/etc/bee-release (versions + git commit)
|
||||
10. export BEE_BUILD_INFO for motd substitution
|
||||
11. mkimage.sh (from /var/tmp, TMPDIR=/var/tmp):
|
||||
kernel_* section — cached (linux-lts modloop)
|
||||
apks_* section — cached (downloaded packages)
|
||||
syslinux_* / grub_* — cached
|
||||
apkovl — always regenerated (genapkovl-bee.sh)
|
||||
final ISO — always assembled
|
||||
```
|
||||
|
||||
**Critical invariants:**
|
||||
- `genapkovl-bee_debug.sh` must be in `/var/tmp/` (CWD when mkimage runs), not only `~/.mkimage/`.
|
||||
- `TMPDIR=/var/tmp` required — tmpfs /tmp is only ~1GB, too small for kernel firmware.
|
||||
- Workdir cleanup preserves `apks_*`, `kernel_*`, `syslinux_*`, `grub_*` — only clears apkovl and final image.
|
||||
- `run-builder.sh` runs build in `screen` session to survive SSH disconnects during long NVIDIA downloads.
|
||||
- `linux-lts-dev` is always installed (not conditional) — stale 6.6.x headers on the
|
||||
builder would cause modules to be built for the wrong kernel and never load at runtime.
|
||||
- NVIDIA modules go to `overlay/usr/local/lib/nvidia/` — NOT `lib/modules/<kver>/extra/`.
|
||||
- `genapkovl-bee.sh` must be copied to `/var/tmp/` (CWD when mkimage runs).
|
||||
- `TMPDIR=/var/tmp` required — tmpfs `/tmp` is only ~1GB, too small for kernel firmware.
|
||||
- Workdir cleanup preserves `apks_*`, `kernel_*`, `syslinux_*`, `grub_*` cache dirs.
|
||||
|
||||
## gpu_burn vendor binary
|
||||
|
||||
`gpu_burn` requires CUDA nvcc to build. It is NOT built as part of the main ISO build.
|
||||
Build separately on the builder VM and place in `iso/vendor/gpu_burn`:
|
||||
|
||||
```sh
|
||||
sh iso/builder/build-gpu-burn.sh dist/
|
||||
cp dist/gpu_burn iso/vendor/gpu_burn
|
||||
cp dist/compare.ptx iso/vendor/compare.ptx
|
||||
```
|
||||
|
||||
Requires: CUDA 12.8+ (supports GCC 14, Alpine 3.21), libxml2, g++, make, git.
|
||||
The `build.sh` will include it automatically if `iso/vendor/gpu_burn` exists.
|
||||
|
||||
## Post-boot smoke test
|
||||
|
||||
After booting a live ISO, run to verify all critical components:
|
||||
|
||||
```sh
|
||||
ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh
|
||||
```
|
||||
|
||||
Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shipping.
|
||||
|
||||
Key checks: NVIDIA modules loaded, nvidia-smi sees all GPUs, lib symlinks present,
|
||||
gcompat installed, services running, audit completed with NVIDIA enrichment, internet.
|
||||
|
||||
## apkovl mechanism
|
||||
|
||||
The apkovl is a `.tar.gz` injected into the ISO at `/boot/`. Alpine's initramfs extracts it at boot, overlaying `/etc`, `/usr`, `/root` on the tmpfs root.
|
||||
The apkovl is a `.tar.gz` injected into the ISO at `/boot/`. Alpine initramfs extracts
|
||||
it at boot, overlaying `/etc`, `/usr`, `/root`, `/lib` on the tmpfs root.
|
||||
|
||||
`genapkovl-bee_debug.sh` generates the tarball containing:
|
||||
- `/etc/apk/world` — package list (apk installs these on first boot)
|
||||
`genapkovl-bee.sh` generates the tarball containing:
|
||||
- `/etc/apk/world` — package list (apk installs on first boot)
|
||||
- `/etc/runlevels/*/` — OpenRC service symlinks
|
||||
- `/etc/conf.d/dropbear` — DROPBEAR_OPTS="-R -B"
|
||||
- `/etc/conf.d/dropbear` — `DROPBEAR_OPTS="-R -B"`
|
||||
- `/etc/network/interfaces` — lo only (bee-network handles DHCP)
|
||||
- `/etc/hostname`
|
||||
- Everything from `iso/overlay-debug/` (init scripts, binaries, ssh keys)
|
||||
- Everything from `iso/overlay/` (init scripts, binaries, ssh keys, tui)
|
||||
|
||||
## Collector flow
|
||||
|
||||
@@ -70,8 +115,8 @@ audit binary start
|
||||
4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
|
||||
5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/)
|
||||
6. psu collector (ipmitool fru — silent if no /dev/ipmi0)
|
||||
7. nvidia enrichment (nvidia-smi — skipped if driver not loaded)
|
||||
8. output JSON to stdout / file / usb
|
||||
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
|
||||
8. output JSON → /var/log/bee-audit.json
|
||||
9. QR summary to stdout (qrencode if available)
|
||||
```
|
||||
|
||||
|
||||
@@ -19,15 +19,16 @@ Fills gaps where Redfish/logpile is blind:
|
||||
## In scope
|
||||
|
||||
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
|
||||
- Unattended operation — no user interaction at any stage
|
||||
- NVIDIA proprietary driver loaded at boot for GPU enrichment
|
||||
- SSH access in debug ISO for development and testing
|
||||
- Auto-update of audit binary from Gitea releases (production ISO)
|
||||
- Unattended operation — no user interaction required
|
||||
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
|
||||
- SSH access (dropbear) always available for inspection and debugging
|
||||
- Interactive TUI (`bee-tui`) for network setup, service management, GPU tests
|
||||
- GPU stress testing via `gpu_burn` (vendor binary, optional)
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Any writes to the server being audited
|
||||
- Network configuration changes
|
||||
- Network configuration changes (persistent)
|
||||
- BMC/IPMI configuration
|
||||
- Anything requiring persistent storage on the audited machine
|
||||
- Windows support
|
||||
@@ -38,11 +39,13 @@ Fills gaps where Redfish/logpile is blind:
|
||||
|---|---|
|
||||
| Audit binary | Go, static, `CGO_ENABLED=0` |
|
||||
| LiveCD | Alpine Linux 3.21, linux-lts 6.12.x |
|
||||
| ISO build | Alpine mkimage + apkovl overlay |
|
||||
| ISO build | Alpine mkimage + apkovl overlay (`iso/overlay/`) |
|
||||
| Init system | OpenRC |
|
||||
| SSH (debug) | Dropbear |
|
||||
| SSH | Dropbear (always included) |
|
||||
| NVIDIA driver | Proprietary `.run` installer, built against linux-lts headers |
|
||||
| Builder VM | Alpine 3.21, 172.27.0.4 |
|
||||
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` (not modloop path) |
|
||||
| glibc compat | `gcompat` — required for `nvidia-smi` (glibc binary on musl Alpine) |
|
||||
| Builder VM | Alpine 3.21 |
|
||||
|
||||
## Key paths
|
||||
|
||||
@@ -52,7 +55,9 @@ Fills gaps where Redfish/logpile is blind:
|
||||
| `audit/internal/collector/` | Per-subsystem collectors |
|
||||
| `audit/internal/schema/` | HardwareIngestRequest types |
|
||||
| `iso/builder/` | ISO build scripts and mkimage profile |
|
||||
| `iso/overlay-debug/` | Files injected into debug ISO via apkovl |
|
||||
| `iso/builder/VERSIONS` | Pinned versions: Alpine, Go, NVIDIA driver |
|
||||
| `iso/overlay/` | Single overlay: files injected into ISO via apkovl |
|
||||
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, gpu_burn, …) |
|
||||
| `iso/builder/VERSIONS` | Pinned versions: Alpine, Go, NVIDIA driver, kernel |
|
||||
| `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
|
||||
| `dist/` | Build outputs (gitignored) |
|
||||
| `iso/out/` | Downloaded ISO files (gitignored) |
|
||||
|
||||
Reference in New Issue
Block a user