diff --git a/PLAN.md b/PLAN.md index 6a71bf6..782d43e 100644 --- a/PLAN.md +++ b/PLAN.md @@ -485,8 +485,8 @@ Release naming convention: binary asset named `bee-audit-linux-amd64` per releas ``` AUDIT_VERSION=1.0 ALPINE_VERSION=3.21 -KERNEL_VERSION=6.6 -NVIDIA_DRIVER_VERSION=550.54.15 +KERNEL_VERSION=6.12 +NVIDIA_DRIVER_VERSION=590.48.01 ``` LiveCD release = full ISO rebuild. Binary-only patch = new Gitea release with binary asset. diff --git a/bible-local/README.md b/bible-local/README.md new file mode 100644 index 0000000..6bdf907 --- /dev/null +++ b/bible-local/README.md @@ -0,0 +1,12 @@ +# bee — Project Bible + +Project-specific architecture, decisions, and runtime contracts. +Generic engineering rules live in `bible/rules/patterns/`. + +## Files + +| File | Contents | +|---|---| +| `architecture/system-overview.md` | What bee does, scope, tech stack | +| `architecture/runtime-flows.md` | Boot sequence, audit flow, service order | +| `decisions/` | Architectural decision log | diff --git a/bible-local/architecture/runtime-flows.md b/bible-local/architecture/runtime-flows.md new file mode 100644 index 0000000..b5654d5 --- /dev/null +++ b/bible-local/architecture/runtime-flows.md @@ -0,0 +1,78 @@ +# Runtime Flows — bee + +## Boot sequence (debug ISO) + +OpenRC default runlevel, service start order: + +``` +localmount + └── bee-sshsetup (creates bee user, sets password fallback) + └── dropbear (SSH on port 22 — starts regardless of network) + └── bee-network (udhcpc -b on all physical interfaces, non-blocking) + └── bee-nvidia (depmod -a, modprobe nvidia nvidia-modeset nvidia-uvm) + └── bee-audit-debug (runs audit binary, logs to /var/log/bee-audit.json) +``` + +**Critical invariants:** +- Dropbear MUST start without network. Custom init in overlay has `need localmount` only — NOT `need net`. +- bee-network uses `udhcpc -b` (background daemon) so it retries indefinitely when cable connected later. +- bee-audit-debug uses `eend 0` always — never fails boot even if audit errors. + +## ISO build sequence + +``` +build-debug.sh + 1. compile audit binary (skip if .go files older than binary) + 2. build-nvidia-module.sh: + a. download NVIDIA .run installer (sha256 verified, cached) + b. extract installer + c. build kernel modules against linux-lts-dev headers + d. extract nvidia-smi + libnvidia-ml from installer + e. cache in dist/nvidia--/ + 3. inject authorized_keys into overlay + 4. inject audit binary → overlay/usr/local/bin/audit + 5. inject NVIDIA .ko → overlay/lib/modules//extra/nvidia/ + 6. inject nvidia-smi → overlay/usr/local/bin/nvidia-smi + 7. copy mkimg profile + genapkovl to ~/.mkimage/ AND /var/tmp/ + 8. mkimage.sh (from /var/tmp, TMPDIR=/var/tmp): + kernel_* section — cached (linux-lts modloop, lz4 compressed) + apks_* section — cached (downloaded packages) + syslinux_* / grub_* — cached + apkovl — always regenerated (genapkovl-bee_debug.sh) + final ISO — always assembled +``` + +**Critical invariants:** +- `genapkovl-bee_debug.sh` must be in `/var/tmp/` (CWD when mkimage runs), not only `~/.mkimage/`. +- `TMPDIR=/var/tmp` required — tmpfs /tmp is only ~1GB, too small for kernel firmware. +- Workdir cleanup preserves `apks_*`, `kernel_*`, `syslinux_*`, `grub_*` — only clears apkovl and final image. +- `run-builder.sh` runs build in `screen` session to survive SSH disconnects during long NVIDIA downloads. + +## apkovl mechanism + +The apkovl is a `.tar.gz` injected into the ISO at `/boot/`. Alpine's initramfs extracts it at boot, overlaying `/etc`, `/usr`, `/root` on the tmpfs root. + +`genapkovl-bee_debug.sh` generates the tarball containing: +- `/etc/apk/world` — package list (apk installs these on first boot) +- `/etc/runlevels/*/` — OpenRC service symlinks +- `/etc/conf.d/dropbear` — DROPBEAR_OPTS="-R -B" +- `/etc/network/interfaces` — lo only (bee-network handles DHCP) +- `/etc/hostname` +- Everything from `iso/overlay-debug/` (init scripts, binaries, ssh keys) + +## Collector flow + +``` +audit binary start + 1. board collector (dmidecode -t 0,1,2) + 2. cpu collector (dmidecode -t 4) + 3. memory collector (dmidecode -t 17) + 4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log) + 5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/) + 6. psu collector (ipmitool fru — silent if no /dev/ipmi0) + 7. nvidia enrichment (nvidia-smi — skipped if driver not loaded) + 8. output JSON to stdout / file / usb + 9. QR summary to stdout (qrencode if available) +``` + +Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal. diff --git a/bible-local/architecture/system-overview.md b/bible-local/architecture/system-overview.md new file mode 100644 index 0000000..312abbf --- /dev/null +++ b/bible-local/architecture/system-overview.md @@ -0,0 +1,58 @@ +# System Overview — bee + +## What it does + +Hardware audit LiveCD. Boots on a server via BMC virtual media or USB. +Collects hardware inventory at OS level (not through BMC/Redfish). +Produces `HardwareIngestRequest` JSON compatible with core/reanimator. + +## Why it exists + +Fills gaps where Redfish/logpile is blind: +- NVMe serials and SMART data +- DIMM serials and slot layout +- GPU serials and VBIOS versions +- Physical disks behind RAID controllers +- Full SMART wear telemetry +- NIC firmware versions + +## In scope + +- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID +- Unattended operation — no user interaction at any stage +- NVIDIA proprietary driver loaded at boot for GPU enrichment +- SSH access in debug ISO for development and testing +- Auto-update of audit binary from Gitea releases (production ISO) + +## Out of scope + +- Any writes to the server being audited +- Network configuration changes +- BMC/IPMI configuration +- Anything requiring persistent storage on the audited machine +- Windows support + +## Tech stack + +| Component | Technology | +|---|---| +| Audit binary | Go, static, `CGO_ENABLED=0` | +| LiveCD | Alpine Linux 3.21, linux-lts 6.12.x | +| ISO build | Alpine mkimage + apkovl overlay | +| Init system | OpenRC | +| SSH (debug) | Dropbear | +| NVIDIA driver | Proprietary `.run` installer, built against linux-lts headers | +| Builder VM | Alpine 3.21, 172.27.0.4 | + +## Key paths + +| Path | Purpose | +|---|---| +| `audit/cmd/audit/` | CLI entry point | +| `audit/internal/collector/` | Per-subsystem collectors | +| `audit/internal/schema/` | HardwareIngestRequest types | +| `iso/builder/` | ISO build scripts and mkimage profile | +| `iso/overlay-debug/` | Files injected into debug ISO via apkovl | +| `iso/builder/VERSIONS` | Pinned versions: Alpine, Go, NVIDIA driver | +| `dist/` | Build outputs (gitignored) | +| `iso/out/` | Downloaded ISO files (gitignored) | diff --git a/bible-local/decisions/2026-03-05-nvidia-proprietary-driver.md b/bible-local/decisions/2026-03-05-nvidia-proprietary-driver.md new file mode 100644 index 0000000..4f41620 --- /dev/null +++ b/bible-local/decisions/2026-03-05-nvidia-proprietary-driver.md @@ -0,0 +1,23 @@ +# Decision: Use NVIDIA proprietary driver, not open kernel modules + +**Date:** 2026-03-05 +**Status:** active + +## Context + +bee needs to collect GPU serial numbers, VBIOS versions, and ECC telemetry via `nvidia-smi`. +Two options exist: NVIDIA open-gpu-kernel-modules (MIT/GPLv2, GitHub) or the official +proprietary `.run` installer. + +## Decision + +Use the official proprietary NVIDIA `.run` installer for both kernel modules and `nvidia-smi`. + +## Consequences + +- Kernel modules and nvidia-smi come from a single verified source. +- NVIDIA publishes `.sha256sum` alongside each installer — download and verify before use. +- Driver version pinned in `iso/builder/VERSIONS` as `NVIDIA_DRIVER_VERSION`. +- Build process: download `.run`, extract, compile `kernel/` sources against `linux-lts-dev`. +- Modules cached in `dist/nvidia--/` — rebuild only on version or kernel change. +- ISO size increases by ~50MB for .ko files + nvidia-smi. diff --git a/bible-local/decisions/README.md b/bible-local/decisions/README.md new file mode 100644 index 0000000..72a2042 --- /dev/null +++ b/bible-local/decisions/README.md @@ -0,0 +1,7 @@ +# Architectural Decision Log + +One file per decision, named `YYYY-MM-DD-short-topic.md`. + +| Date | Decision | Status | +|---|---|---| +| 2026-03-05 | Use NVIDIA proprietary driver | active |