17 KiB
BEE — Build Plan
Hardware audit LiveCD for offline server inventory.
Produces HardwareIngestRequest JSON compatible with core/reanimator.
Principle: OS-level collection — reads hardware directly, not through BMC. Automatic boot audit plus operator console. Boot runs audit immediately, but local/SSH operators can rerun checks through the TUI and CLI. Errors are logged and should not block boot on partial collector failures. Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware.
Status snapshot (2026-03-14)
Phase 1 — Go Audit Binary
- 1.1 Project scaffold — DONE
- 1.2 Board collector — DONE
- 1.3 CPU collector — DONE
- 1.4 Memory collector — DONE
- 1.5 Storage collector — DONE
- 1.6 PCIe collector — DONE (with noise filtering for system/chipset devices)
- 1.7 PSU collector — DONE (basic FRU path)
- 1.8 NVIDIA GPU enrichment — DONE
- 1.8b Component wear / age telemetry — DONE (storage + NVMe + NVIDIA + NIC SFP/DOM + NIC packet stats)
- 1.8c Storage health verdicts — DONE (SMART/NVMe warning/failed status derivation)
- 1.9 Mellanox/NVIDIA NIC enrichment — DONE (mstflint + ethtool firmware fallback)
- 1.10 RAID controller enrichment — DONE (initial multi-tool support) (storcli + sas2/3ircu + arcconf + ssacli + VROC/mdstat)
- 1.11 PSU SDR health — DONE (
ipmitool sdrmerged with FRU inventory) - 1.11 Output and export workflow — DONE (explicit file output + manual removable export via TUI)
- 1.12 Integration test (local) — DONE (
scripts/test-local.sh)
Phase 2 — Debian Live ISO
- Current implementation uses Debian 12
live-build,systemd, and OpenSSH. - Network bring-up on boot — DONE
- Boot services (
bee-network,bee-nvidia,bee-audit,bee-sshsetup) — DONE - Local console UX (
beeautologin ontty1,menuauto-start, TUI privilege escalation viasudo -n) — DONE - VM/debug support (
qemu-guest-agent, serial console, virtual GPU initramfs modules) — DONE - Vendor utilities in overlay — DONE
- Build metadata + staged overlay injection — DONE
- Builder container cache persisted outside container writable layer — DONE
- ISO volume label
BEE— DONE - Auto-update flow remains deferred; current focus is deterministic offline audit ISO behavior.
- Real-hardware validation remains PENDING; current validation is limited to local/libvirt VM boot + service checks.
Phase 1 — Go Audit Binary
Self-contained static binary. Runs on any Linux (including the Debian live ISO).
Calls system utilities, parses their output, produces HardwareIngestRequest JSON.
1.1 — Project scaffold
audit/go.mod— modulebee/auditaudit/cmd/bee/main.go— main CLI entry point: subcommands, runtime selection, JSON outputaudit/internal/schema/— copy ofHardwareIngestRequesttypes from core (no import dependency)audit/internal/collector/— empty package stubs for all collectorsconst Version = "1.0"in main- Output modes: stdout (default), file path flag
--output /path/to/file.json - Tests: none yet (stubs only)
1.2 — Board collector
Source: dmidecode -t 0 (BIOS), -t 1 (System), -t 2 (Baseboard)
Collects:
board.serial_number— from System Informationboard.manufacturer,board.product_name— from System Informationboard.part_number— from Baseboardboard.uuid— from System Informationfirmware[BIOS]— vendor, version, release date from BIOS Information
Tests: table tests with testdata/dmidecode_*.txt fixtures
1.3 — CPU collector
Source: dmidecode -t 4
Collects:
- socket index, model, manufacturer, status
- cores, threads, current/max frequency
- firmware: microcode version from
/sys/devices/system/cpu/cpu0/microcode/version - serial: not available on Intel Xeon → fallback
<board_serial>-CPU-<socket>(matches core logic)
Tests: table tests with dmidecode fixtures
1.4 — Memory collector
Source: dmidecode -t 17
Collects:
- slot, location, present flag
- size_mb, type (DDR4/DDR5), max_speed_mhz, current_speed_mhz
- manufacturer, serial_number, part_number
- status from "Data Width" / "No Module Installed" detection
Tests: table tests with dmidecode fixtures (populated + empty slots)
1.5 — Storage collector
Sources:
lsblk -J -o NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,MOUNTPOINT— device enumerationsmartctl -j -i /dev/X— serial, model, firmware, interface per devicenvme id-ctrl /dev/nvmeX -o json— NVMe: serial (sn), firmware (fr), model (mn), size
Collects per device:
- type: SSD/HDD/NVMe
- model, serial_number, manufacturer, firmware
- size_gb, interface (SATA/SAS/NVMe)
- slot: from lsblk HCTL where available
Tests: table tests with smartctl -j JSON fixtures and nvme id-ctrl JSON fixtures
1.6 — PCIe collector
Sources:
lspci -vmm -D— slot, vendor, device, classlspci -vvv -D— link width/speed (LnkSta, LnkCap)- embedded pci.ids (same submodule as logpile:
third_party/pciids) — model name lookup /sys/bus/pci/devices/<bdf>/— actual negotiated link state from kernel
Collects per device:
- bdf, vendor_id, device_id, device_class
- manufacturer, model (via pciids lookup if empty)
- link_width, link_speed, max_link_width, max_link_speed
- serial_number: device-specific (see per-type enrichment in 1.8, 1.9)
Dedup: by serial → bdf (mirrors logpile canonical device repository logic)
Tests: table tests with lspci fixtures
1.7 — PSU collector
Source: ipmitool fru — primary (only source for PSU data from OS)
Fallback: dmidecode -t 39 (System Power Supply, limited availability)
Collects:
- slot, present, model, vendor, serial_number, part_number
- wattage_w from FRU or dmidecode
- firmware from ipmitool FRU
Board Extrafields - input_power_w, output_power_w, input_voltage from
ipmitool sdr
Tests: table tests with ipmitool fru text fixtures
1.8 — NVIDIA GPU enrichment
Prerequisite: NVIDIA driver loaded (checked via nvidia-smi -L exit code)
Sources:
nvidia-smi --query-gpu=index,name,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total --format=csv,noheader,nounits- BDF correlation:
nvidia-smi --query-gpu=index,pci.bus_id --format=csv,noheader→ match to PCIe collector records
Enriches PCIe records for NVIDIA devices:
serial_number— real GPU serial from nvidia-smifirmware— VBIOS versionstatus— derived from ECC uncorrected errors (0 = OK, >0 = WARNING)- telemetry: temperature_c, power_w added to PCIe record attributes
Fallback (no driver): PCIe record stays as-is with serial fallback <board_serial>-PCIE-<slot>
Tests: table tests with nvidia-smi CSV fixtures
1.8b — Component wear / age telemetry
Every component that stores its own usage history must have that data collected and placed in the attributes / telemetry map of the respective record. This is a cross-cutting concern applied on top of the per-collector steps.
Storage (SATA/SAS) — smartctl:
Power_On_Hours(attr 9) — total hours powered onPower_Cycle_Count(attr 12)Reallocated_Sector_Ct(attr 5) — wear indicatorWear_Leveling_Count(attr 177, SSD) — NAND wearTotal_LBAs_Written(attr 241) — bytes written lifetimeSSD_Life_Left(attr 231) — % remaining if reported- Collected as
telemetrymap keys:power_on_hours,power_cycles,reallocated_sectors,wear_leveling_pct,total_lba_written,life_remaining_pct
NVMe — nvme smart-log:
power_on_hours— lifetime hourspower_cyclesunsafe_shutdowns— abnormal power loss countpercentage_used— % of rated lifetime consumed (0–100)data_units_written— 512KB units written lifetimecontroller_busy_time— hours controller was busy- Collected via
nvme smart-log /dev/nvmeX -o json
NVIDIA GPU — nvidia-smi:
ecc.errors.uncorrected.aggregate.total— lifetime uncorrected ECC errorsecc.errors.corrected.aggregate.total— lifetime corrected ECC errorsclocks_throttle_reasons.hw_slowdown— thermal/power throttle state- Stored in PCIe device
telemetry
NIC SFP/QSFP transceivers — ethtool:
ethtool -m <iface>— DOM (Digital Optical Monitoring) if supported- Extracts: TX power, RX power, temperature, voltage, bias current
- Also:
ethtool -i <iface>→ firmware version ip -s link show <iface>→ tx_packets, rx_packets, tx_errors, rx_errors (uptime proxy)- Stored in PCIe device
telemetry:sfp_temperature_c,sfp_tx_power_dbm,sfp_rx_power_dbm
PSU — ipmitool sdr:
- Input/output power readings over time not stored by BMC (point-in-time only)
ipmitool frumay include manufacture date for age estimation- Stored:
input_power_w,output_power_w,input_voltage(already in PSU schema)
All wear telemetry placement rules:
- Numeric wear indicators go into
telemetrymap (machine-readable, importable by core) - Boolean flags (throttle_active, ecc_errors_present) go into
attributesmap - Never flatten into named top-level fields not in the schema — use maps
Tests: table tests for each SMART parser (JSON fixtures from smartctl/nvme smart-log)
1.9 — Mellanox/NVIDIA NIC enrichment
Source: mstflint -d <bdf> q — if mstflint present and device is Mellanox (vendor_id 0x15b3)
Fallback: ethtool -i <iface> — firmware-version field
Enriches PCIe/NIC records:
firmware— from mstflintFW Versionor ethtoolfirmware-versionserial_number— from mstflintBoard Serial Numberif available
Detection: by PCI vendor_id (0x15b3 = Mellanox/NVIDIA Networking) from PCIe collector
Tests: table tests with mstflint output fixtures
1.10 — RAID controller enrichment
Source: tool selected by PCI vendor_id:
| PCI vendor_id | Tool | Controller |
|---|---|---|
| 0x1000 | storcli64 /c<n> show all J |
Broadcom MegaRAID |
| 0x1000 (SAS) | sas2ircu <n> display / sas3ircu <n> display |
LSI SAS 2.x/3.x |
| 0x9005 | arcconf getconfig <n> |
Adaptec |
| 0x103c | ssacli ctrl slot=<n> pd all show detail |
HPE Smart Array |
Collects physical drives behind controller (not visible to OS as block devices):
- serial_number, model, manufacturer, firmware
- size_gb, interface (SAS/SATA), slot/bay
- status (Online/Failed/Rebuilding → OK/CRITICAL/WARNING)
No hardcoded vendor names in detection logic — pure PCI vendor_id map.
Tests: table tests with storcli/sas2ircu text fixtures
1.11 — Output and export workflow
--output stdout (default): pretty-printed JSON to stdout
--output file:<path>: write JSON to explicit path
Live ISO default service output: /var/log/bee-audit.json
Removable-media export is manual via bee tui (or the LiveCD wrapper bee-tui):
- operator chooses a removable filesystem explicitly
- TUI mounts it if needed
- TUI asks for confirmation before copying the JSON
- TUI unmounts temporary mountpoints after export
No auto-write to arbitrary removable media is allowed.
1.12 — Integration test (local)
scripts/test-local.sh — runs bee audit on developer machine (Linux), captures JSON,
validates required fields are present (board.serial_number non-empty, cpus non-empty, etc.)
Not a unit test — requires real hardware access. Documents how to run for verification.
Phase 2 — Debian Live ISO
ISO image bootable via BMC virtual media or USB. Runs boot services automatically and writes the audit result to /var/log/bee-audit.json.
2.1 — Builder environment
iso/builder/build-in-container.sh is the only supported builder entrypoint.
It builds a Debian 12 builder image with live-build, toolchains, and pinned kernel headers,
then runs the ISO assembly in a privileged container because live-build needs
mount/chroot/loop capabilities.
iso/builder/build.sh orchestrates the full ISO build:
- compile the Go
beebinary - create a staged overlay under
dist/overlay-stage - inject SSH auth, vendor tools, NVIDIA artifacts, and build metadata into the staged overlay
- create a disposable
live-buildworkdir underdist/live-build-work - sync the staged overlay into
config/includes.chroot/ - run
lb config && lb build - copy the final ISO into
dist/
2.2 — NVIDIA driver build
iso/builder/build-nvidia-module.sh:
- downloads the pinned NVIDIA
.runinstaller - verifies SHA256
- builds kernel modules against the pinned Debian kernel ABI
- caches modules, userspace tools, and libs in
dist/nvidia-<version>-<kver>/
iso/overlay/usr/local/bin/bee-nvidia-load:
- loads
nvidia,nvidia-modeset,nvidia-uvmviainsmod - creates
/dev/nvidia*nodes if the driver registered successfully - logs failures but does not block the rest of boot
2.3 — ISO assembly and overlay policy
iso/overlay/ is source-only input for the build.
Build-time files are injected into the staged overlay only:
beebee-smoketestauthorized_keys- password-fallback marker
/etc/bee-release- vendor tools from
iso/vendor/
The source tree must stay clean after a build.
2.4 — Boot services
systemd service order:
bee-sshsetup.service→ configures SSH auth beforessh.servicebee-network.service→ starts best-effort DHCP on all physical interfacesbee-nvidia.service→ loads NVIDIA modules if presentbee-audit.service→ runs audit and logs failures without turning partial collector bugs into a boot blocker
2.4b — Runtime split
Target split:
- main Go application works on a normal Linux host and on the live ISO
- live-ISO specifics stay in integration glue under
iso/ - the live ISO passes
--runtime livecdto the Go binary - local runs default to
--runtime auto, which resolves tolocalunless a live marker is detected
Planned code shape:
audit/cmd/bee/— main CLI entrypointaudit/internal/runtimeenv/— runtime detection and mode selection- future
audit/internal/tui/— host/live shared TUI logic iso/overlay/— boot-time livecd integration only
2.5 — Operator workflows
- Automatic boot audit writes JSON to
/var/log/bee-audit.json tty1autologins intobeeand auto-runsmenumenulaunches the LiveCD wrapperbee-tui, which escalates torootviasudo -nbee tuican rerun the audit manuallybee tuican export the latest audit JSON to removable mediabee tuican show health summary and run NVIDIA/memory/storage acceptance tests- NVIDIA SAT now includes a lightweight in-image GPU stress step via
bee-gpu-stress - SAT summaries now expose
overall_statusplus per-jobOK/FAILED/UNSUPPORTED - Memory/GPU SAT runtime defaults can be overridden via
BEE_MEMTESTER_*andBEE_GPU_STRESS_* - removable export requires explicit target selection, mount, confirmation, copy, and cleanup
2.6 — Vendor utilities and optional assets
Optional binaries live in iso/vendor/ and are included when present:
storcli64sas2ircu,sas3ircuarcconfssaclimstflint(via Debian package set)
Missing optional tools do not fail the build or boot.
2.7 — Release workflow
iso/builder/VERSIONS pins the current release inputs:
- audit version
- Debian version / kernel ABI
- Go version
- NVIDIA driver version
Current release model:
- shipping a new ISO means a full rebuild
- build metadata is embedded into
/etc/bee-releaseandmotd - current ISO label is
BEE - binary self-update remains deferred; no automatic USB/network patching is part of the current runtime
Eating order
Builder environment is set up early (after 1.3) so every subsequent collector is developed and tested directly on real hardware in the actual Debian live ISO environment. No "works on my Mac" drift.
1.0 keys repo setup → git.mchus.pro/mchus/keys, keygen.sh, developer pubkeys
1.1 scaffold + schema types → binary runs, outputs empty JSON
1.2 board collector → first real data
1.3 CPU collector → +CPUs
--- BUILDER + BEE ISO (unblock real-hardware testing) ---
2.1 builder setup → privileged container with build deps
2.2 debug ISO profile → minimal Debian ISO: `bee` binary + OpenSSH + all packages
2.3 boot on real server → SSH in, verify packages present, run audit manually
--- CONTINUE COLLECTORS (tested on real hardware from here) ---
1.4 memory collector → +DIMMs
1.5 storage collector → +disks (SATA/SAS/NVMe)
1.6 PCIe collector → +all PCIe devices
1.7 PSU collector → +power supplies
1.8 NVIDIA GPU enrichment → +GPU serial/VBIOS
1.8b wear/age telemetry → +SMART hours, NVMe % used, SFP DOM, ECC
1.9 Mellanox NIC enrichment → +NIC firmware/serial
1.10 RAID enrichment → +physical disks behind RAID
1.11 output + export workflow → file output + explicit removable export
--- PRODUCTION ISO ---
2.4 NVIDIA driver build → driver compiled into overlay
2.5 network bring-up on boot → DHCP on all interfaces
2.6 systemd boot service → audit runs on boot automatically
2.7 vendor utilities → storcli/sas2ircu/arcconf/ssacli in image
2.8 release workflow → versioning + release notes
2.9 operator export flow → explicit TUI export to removable media