# BEE — Build Plan Hardware audit LiveCD for offline server inventory. Produces `HardwareIngestRequest` JSON compatible with core/reanimator. **Principle:** OS-level collection — reads hardware directly, not through BMC. Automatic boot audit plus operator console. Boot runs audit immediately, but local/SSH operators can rerun checks through the TUI and CLI. Errors are logged and should not block boot on partial collector failures. Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware. --- ## Status snapshot (2026-03-14) ### Phase 1 — Go Audit Binary - 1.1 Project scaffold — **DONE** - 1.2 Board collector — **DONE** - 1.3 CPU collector — **DONE** - 1.4 Memory collector — **DONE** - 1.5 Storage collector — **DONE** - 1.6 PCIe collector — **DONE** (with noise filtering for system/chipset devices) - 1.7 PSU collector — **DONE (basic FRU path)** - 1.8 NVIDIA GPU enrichment — **DONE** - 1.8b Component wear / age telemetry — **DONE** (storage + NVMe + NVIDIA + NIC SFP/DOM + NIC packet stats) - 1.8c Storage health verdicts — **DONE** (SMART/NVMe warning/failed status derivation) - 1.9 Mellanox/NVIDIA NIC enrichment — **DONE** (mstflint + ethtool firmware fallback) - 1.10 RAID controller enrichment — **DONE (initial multi-tool support)** (storcli + sas2/3ircu + arcconf + ssacli + VROC/mdstat) - 1.11 PSU SDR health — **DONE** (`ipmitool sdr` merged with FRU inventory) - 1.11 Output and export workflow — **DONE** (explicit file output + manual removable export via TUI) - 1.12 Integration test (local) — **DONE** (`scripts/test-local.sh`) ### Phase 2 — Debian Live ISO - Current implementation uses Debian 12 `live-build`, `systemd`, and OpenSSH. - Network bring-up on boot — **DONE** - Boot services (`bee-network`, `bee-nvidia`, `bee-audit`, `bee-sshsetup`) — **DONE** - Local console UX (`bee` autologin on `tty1`, `menu` auto-start, TUI privilege escalation via `sudo -n`) — **DONE** - VM/debug support (`qemu-guest-agent`, serial console, virtual GPU initramfs modules) — **DONE** - Vendor utilities in overlay — **DONE** - Build metadata + staged overlay injection — **DONE** - Builder container cache persisted outside container writable layer — **DONE** - ISO volume label `BEE` — **DONE** - Auto-update flow remains deferred; current focus is deterministic offline audit ISO behavior. - Real-hardware validation remains **PENDING**; current validation is limited to local/libvirt VM boot + service checks. --- ## Phase 1 — Go Audit Binary Self-contained static binary. Runs on any Linux (including the Debian live ISO). Calls system utilities, parses their output, produces `HardwareIngestRequest` JSON. ### 1.1 — Project scaffold - `audit/go.mod` — module `bee/audit` - `audit/cmd/bee/main.go` — main CLI entry point: subcommands, runtime selection, JSON output - `audit/internal/schema/` — copy of `HardwareIngestRequest` types from core (no import dependency) - `audit/internal/collector/` — empty package stubs for all collectors - `const Version = "1.0"` in main - Output modes: stdout (default), file path flag `--output /path/to/file.json` - Tests: none yet (stubs only) ### 1.2 — Board collector Source: `dmidecode -t 0` (BIOS), `-t 1` (System), `-t 2` (Baseboard) Collects: - `board.serial_number` — from System Information - `board.manufacturer`, `board.product_name` — from System Information - `board.part_number` — from Baseboard - `board.uuid` — from System Information - `firmware[BIOS]` — vendor, version, release date from BIOS Information Tests: table tests with `testdata/dmidecode_*.txt` fixtures ### 1.3 — CPU collector Source: `dmidecode -t 4` Collects: - socket index, model, manufacturer, status - cores, threads, current/max frequency - firmware: microcode version from `/sys/devices/system/cpu/cpu0/microcode/version` - serial: not available on Intel Xeon → fallback `-CPU-` (matches core logic) Tests: table tests with dmidecode fixtures ### 1.4 — Memory collector Source: `dmidecode -t 17` Collects: - slot, location, present flag - size_mb, type (DDR4/DDR5), max_speed_mhz, current_speed_mhz - manufacturer, serial_number, part_number - status from "Data Width" / "No Module Installed" detection Tests: table tests with dmidecode fixtures (populated + empty slots) ### 1.5 — Storage collector Sources: - `lsblk -J -o NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,MOUNTPOINT` — device enumeration - `smartctl -j -i /dev/X` — serial, model, firmware, interface per device - `nvme id-ctrl /dev/nvmeX -o json` — NVMe: serial (sn), firmware (fr), model (mn), size Collects per device: - type: SSD/HDD/NVMe - model, serial_number, manufacturer, firmware - size_gb, interface (SATA/SAS/NVMe) - slot: from lsblk HCTL where available Tests: table tests with `smartctl -j` JSON fixtures and `nvme id-ctrl` JSON fixtures ### 1.6 — PCIe collector Sources: - `lspci -vmm -D` — slot, vendor, device, class - `lspci -vvv -D` — link width/speed (LnkSta, LnkCap) - embedded pci.ids (same submodule as logpile: `third_party/pciids`) — model name lookup - `/sys/bus/pci/devices//` — actual negotiated link state from kernel Collects per device: - bdf, vendor_id, device_id, device_class - manufacturer, model (via pciids lookup if empty) - link_width, link_speed, max_link_width, max_link_speed - serial_number: device-specific (see per-type enrichment in 1.8, 1.9) Dedup: by serial → bdf (mirrors logpile canonical device repository logic) Tests: table tests with lspci fixtures ### 1.7 — PSU collector Source: `ipmitool fru` — primary (only source for PSU data from OS) Fallback: `dmidecode -t 39` (System Power Supply, limited availability) Collects: - slot, present, model, vendor, serial_number, part_number - wattage_w from FRU or dmidecode - firmware from ipmitool FRU `Board Extra` fields - input_power_w, output_power_w, input_voltage from `ipmitool sdr` Tests: table tests with ipmitool fru text fixtures ### 1.8 — NVIDIA GPU enrichment Prerequisite: NVIDIA driver loaded (checked via `nvidia-smi -L` exit code) Sources: - `nvidia-smi --query-gpu=index,name,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total --format=csv,noheader,nounits` - BDF correlation: `nvidia-smi --query-gpu=index,pci.bus_id --format=csv,noheader` → match to PCIe collector records Enriches PCIe records for NVIDIA devices: - `serial_number` — real GPU serial from nvidia-smi - `firmware` — VBIOS version - `status` — derived from ECC uncorrected errors (0 = OK, >0 = WARNING) - telemetry: temperature_c, power_w added to PCIe record attributes Fallback (no driver): PCIe record stays as-is with serial fallback `-PCIE-` Tests: table tests with nvidia-smi CSV fixtures ### 1.8b — Component wear / age telemetry Every component that stores its own usage history must have that data collected and placed in the `attributes` / `telemetry` map of the respective record. This is a cross-cutting concern applied on top of the per-collector steps. **Storage (SATA/SAS) — smartctl:** - `Power_On_Hours` (attr 9) — total hours powered on - `Power_Cycle_Count` (attr 12) - `Reallocated_Sector_Ct` (attr 5) — wear indicator - `Wear_Leveling_Count` (attr 177, SSD) — NAND wear - `Total_LBAs_Written` (attr 241) — bytes written lifetime - `SSD_Life_Left` (attr 231) — % remaining if reported - Collected as `telemetry` map keys: `power_on_hours`, `power_cycles`, `reallocated_sectors`, `wear_leveling_pct`, `total_lba_written`, `life_remaining_pct` **NVMe — nvme smart-log:** - `power_on_hours` — lifetime hours - `power_cycles` - `unsafe_shutdowns` — abnormal power loss count - `percentage_used` — % of rated lifetime consumed (0–100) - `data_units_written` — 512KB units written lifetime - `controller_busy_time` — hours controller was busy - Collected via `nvme smart-log /dev/nvmeX -o json` **NVIDIA GPU — nvidia-smi:** - `ecc.errors.uncorrected.aggregate.total` — lifetime uncorrected ECC errors - `ecc.errors.corrected.aggregate.total` — lifetime corrected ECC errors - `clocks_throttle_reasons.hw_slowdown` — thermal/power throttle state - Stored in PCIe device `telemetry` **NIC SFP/QSFP transceivers — ethtool:** - `ethtool -m ` — DOM (Digital Optical Monitoring) if supported - Extracts: TX power, RX power, temperature, voltage, bias current - Also: `ethtool -i ` → firmware version - `ip -s link show ` → tx_packets, rx_packets, tx_errors, rx_errors (uptime proxy) - Stored in PCIe device `telemetry`: `sfp_temperature_c`, `sfp_tx_power_dbm`, `sfp_rx_power_dbm` **PSU — ipmitool sdr:** - Input/output power readings over time not stored by BMC (point-in-time only) - `ipmitool fru` may include manufacture date for age estimation - Stored: `input_power_w`, `output_power_w`, `input_voltage` (already in PSU schema) **All wear telemetry placement rules:** - Numeric wear indicators go into `telemetry` map (machine-readable, importable by core) - Boolean flags (throttle_active, ecc_errors_present) go into `attributes` map - Never flatten into named top-level fields not in the schema — use maps Tests: table tests for each SMART parser (JSON fixtures from smartctl/nvme smart-log) ### 1.9 — Mellanox/NVIDIA NIC enrichment Source: `mstflint -d q` — if mstflint present and device is Mellanox (vendor_id 0x15b3) Fallback: `ethtool -i ` — firmware-version field Enriches PCIe/NIC records: - `firmware` — from mstflint `FW Version` or ethtool `firmware-version` - `serial_number` — from mstflint `Board Serial Number` if available Detection: by PCI vendor_id (0x15b3 = Mellanox/NVIDIA Networking) from PCIe collector Tests: table tests with mstflint output fixtures ### 1.10 — RAID controller enrichment Source: tool selected by PCI vendor_id: | PCI vendor_id | Tool | Controller | |---|---|---| | 0x1000 | `storcli64 /c show all J` | Broadcom MegaRAID | | 0x1000 (SAS) | `sas2ircu display` / `sas3ircu display` | LSI SAS 2.x/3.x | | 0x9005 | `arcconf getconfig ` | Adaptec | | 0x103c | `ssacli ctrl slot= pd all show detail` | HPE Smart Array | Collects physical drives behind controller (not visible to OS as block devices): - serial_number, model, manufacturer, firmware - size_gb, interface (SAS/SATA), slot/bay - status (Online/Failed/Rebuilding → OK/CRITICAL/WARNING) No hardcoded vendor names in detection logic — pure PCI vendor_id map. Tests: table tests with storcli/sas2ircu text fixtures ### 1.11 — Output and export workflow `--output stdout` (default): pretty-printed JSON to stdout `--output file:`: write JSON to explicit path Live ISO default service output: `/var/log/bee-audit.json` Removable-media export is manual via `bee tui` (or the LiveCD wrapper `bee-tui`): - operator chooses a removable filesystem explicitly - TUI mounts it if needed - TUI asks for confirmation before copying the JSON - TUI unmounts temporary mountpoints after export No auto-write to arbitrary removable media is allowed. ### 1.12 — Integration test (local) `scripts/test-local.sh` — runs `bee audit` on developer machine (Linux), captures JSON, validates required fields are present (board.serial_number non-empty, cpus non-empty, etc.) Not a unit test — requires real hardware access. Documents how to run for verification. --- ## Phase 2 — Debian Live ISO ISO image bootable via BMC virtual media or USB. Runs boot services automatically and writes the audit result to `/var/log/bee-audit.json`. ### 2.1 — Builder environment `iso/builder/build-in-container.sh` is the only supported builder entrypoint. It builds a Debian 12 builder image with `live-build`, toolchains, and pinned kernel headers, then runs the ISO assembly in a privileged container because `live-build` needs mount/chroot/loop capabilities. `iso/builder/build.sh` orchestrates the full ISO build: 1. compile the Go `bee` binary 2. create a staged overlay under `dist/overlay-stage` 3. inject SSH auth, vendor tools, NVIDIA artifacts, and build metadata into the staged overlay 4. create a disposable `live-build` workdir under `dist/live-build-work` 5. sync the staged overlay into `config/includes.chroot/` 6. run `lb config && lb build` 7. copy the final ISO into `dist/` ### 2.2 — NVIDIA driver build `iso/builder/build-nvidia-module.sh`: - downloads the pinned NVIDIA `.run` installer - verifies SHA256 - builds kernel modules against the pinned Debian kernel ABI - caches modules, userspace tools, and libs in `dist/nvidia--/` `iso/overlay/usr/local/bin/bee-nvidia-load`: - loads `nvidia`, `nvidia-modeset`, `nvidia-uvm` via `insmod` - creates `/dev/nvidia*` nodes if the driver registered successfully - logs failures but does not block the rest of boot ### 2.3 — ISO assembly and overlay policy `iso/overlay/` is source-only input for the build. Build-time files are injected into the staged overlay only: - `bee` - `bee-smoketest` - `authorized_keys` - password-fallback marker - `/etc/bee-release` - vendor tools from `iso/vendor/` The source tree must stay clean after a build. ### 2.4 — Boot services `systemd` service order: - `bee-sshsetup.service` → configures SSH auth before `ssh.service` - `bee-network.service` → starts best-effort DHCP on all physical interfaces - `bee-nvidia.service` → loads NVIDIA modules if present - `bee-audit.service` → runs audit and logs failures without turning partial collector bugs into a boot blocker ### 2.4b — Runtime split Target split: - main Go application works on a normal Linux host and on the live ISO - live-ISO specifics stay in integration glue under `iso/` - the live ISO passes `--runtime livecd` to the Go binary - local runs default to `--runtime auto`, which resolves to `local` unless a live marker is detected Planned code shape: - `audit/cmd/bee/` — main CLI entrypoint - `audit/internal/runtimeenv/` — runtime detection and mode selection - future `audit/internal/tui/` — host/live shared TUI logic - `iso/overlay/` — boot-time livecd integration only ### 2.5 — Operator workflows - Automatic boot audit writes JSON to `/var/log/bee-audit.json` - `tty1` autologins into `bee` and auto-runs `menu` - `menu` launches the LiveCD wrapper `bee-tui`, which escalates to `root` via `sudo -n` - `bee tui` can rerun the audit manually - `bee tui` can export the latest audit JSON to removable media - `bee tui` can show health summary and run NVIDIA/memory/storage acceptance tests - NVIDIA SAT now includes a lightweight in-image GPU stress step via `bee-gpu-stress` - SAT summaries now expose `overall_status` plus per-job `OK/FAILED/UNSUPPORTED` - Memory/GPU SAT runtime defaults can be overridden via `BEE_MEMTESTER_*` and `BEE_GPU_STRESS_*` - removable export requires explicit target selection, mount, confirmation, copy, and cleanup ### 2.6 — Vendor utilities and optional assets Optional binaries live in `iso/vendor/` and are included when present: - `storcli64` - `sas2ircu`, `sas3ircu` - `arcconf` - `ssacli` - `mstflint` (via Debian package set) Missing optional tools do not fail the build or boot. ### 2.7 — Release workflow `iso/builder/VERSIONS` pins the current release inputs: - audit version - Debian version / kernel ABI - Go version - NVIDIA driver version Current release model: - shipping a new ISO means a full rebuild - build metadata is embedded into `/etc/bee-release` and `motd` - current ISO label is `BEE` - binary self-update remains deferred; no automatic USB/network patching is part of the current runtime --- ## Eating order Builder environment is set up early (after 1.3) so every subsequent collector is developed and tested directly on real hardware in the actual Debian live ISO environment. No "works on my Mac" drift. ``` 1.0 keys repo setup → git.mchus.pro/mchus/keys, keygen.sh, developer pubkeys 1.1 scaffold + schema types → binary runs, outputs empty JSON 1.2 board collector → first real data 1.3 CPU collector → +CPUs --- BUILDER + BEE ISO (unblock real-hardware testing) --- 2.1 builder setup → privileged container with build deps 2.2 debug ISO profile → minimal Debian ISO: `bee` binary + OpenSSH + all packages 2.3 boot on real server → SSH in, verify packages present, run audit manually --- CONTINUE COLLECTORS (tested on real hardware from here) --- 1.4 memory collector → +DIMMs 1.5 storage collector → +disks (SATA/SAS/NVMe) 1.6 PCIe collector → +all PCIe devices 1.7 PSU collector → +power supplies 1.8 NVIDIA GPU enrichment → +GPU serial/VBIOS 1.8b wear/age telemetry → +SMART hours, NVMe % used, SFP DOM, ECC 1.9 Mellanox NIC enrichment → +NIC firmware/serial 1.10 RAID enrichment → +physical disks behind RAID 1.11 output + export workflow → file output + explicit removable export --- PRODUCTION ISO --- 2.4 NVIDIA driver build → driver compiled into overlay 2.5 network bring-up on boot → DHCP on all interfaces 2.6 systemd boot service → audit runs on boot automatically 2.7 vendor utilities → storcli/sas2ircu/arcconf/ssacli in image 2.8 release workflow → versioning + release notes 2.9 operator export flow → explicit TUI export to removable media ```