23 KiB
BEE — Build Plan
Hardware audit LiveCD for offline server inventory.
Produces HardwareIngestRequest JSON compatible with core/reanimator.
Principle: OS-level collection — reads hardware directly, not through BMC. Fully unattended — no user interaction required at any stage. Boot → update → audit → output → done. All errors are logged, never presented interactively. Every failure path has a silent fallback. Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware.
Status snapshot (2026-03-06)
Phase 1 — Go Audit Binary
- 1.1 Project scaffold — DONE
- 1.2 Board collector — DONE
- 1.3 CPU collector — DONE
- 1.4 Memory collector — DONE
- 1.5 Storage collector — DONE
- 1.6 PCIe collector — DONE (with noise filtering for system/chipset devices)
- 1.7 PSU collector — DONE (basic FRU path)
- 1.8 NVIDIA GPU enrichment — DONE
- 1.8b Component wear / age telemetry — DONE (storage + NVMe + NVIDIA + NIC SFP/DOM + NIC packet stats)
- 1.9 Mellanox/NVIDIA NIC enrichment — DONE (mstflint + ethtool firmware fallback)
- 1.10 RAID controller enrichment — DONE (initial multi-tool support) (storcli + sas2/3ircu + arcconf + ssacli + VROC/mdstat)
- 1.11 Output and USB write — DONE (usb + /tmp fallback)
- 1.12 Integration test (local) — DONE (
scripts/test-local.sh)
Phase 2 — Alpine LiveCD
- Debug ISO track is active (builder + overlay-debug + OpenRC services + TUI workflow).
- Production ISO track — IN PROGRESS.
- 2.3 Alpine mkimage profile — DONE (production profile scaffold)
- 2.4 Network bring-up on boot — DONE
- 2.5 OpenRC boot service (bee-audit) — DONE (with explicit bee-nvidia ordering)
- 2.6 Vendor utilities in overlay — DONE (fetch script + iso/vendor scaffold)
- 2.7 Auto-update wiring (USB first, network second) — PARTIAL (shell flow done; strict Ed25519 verification intentionally deferred to final stage)
- 2.8 Release workflow — PARTIAL (production build now injects audit binary, NVIDIA modules/tools, vendor tools, and build metadata)
Phase 1 — Go Audit Binary
Self-contained static binary. Runs on any Linux (including Alpine LiveCD).
Calls system utilities, parses their output, produces HardwareIngestRequest JSON.
1.1 — Project scaffold
audit/go.mod— modulebee/auditaudit/cmd/audit/main.go— CLI entry point: flags, orchestration, JSON outputaudit/internal/schema/— copy ofHardwareIngestRequesttypes from core (no import dependency)audit/internal/collector/— empty package stubs for all collectorsconst Version = "1.0"in main- Output modes: stdout (default), file path flag
--output /path/to/file.json - Tests: none yet (stubs only)
1.2 — Board collector
Source: dmidecode -t 0 (BIOS), -t 1 (System), -t 2 (Baseboard)
Collects:
board.serial_number— from System Informationboard.manufacturer,board.product_name— from System Informationboard.part_number— from Baseboardboard.uuid— from System Informationfirmware[BIOS]— vendor, version, release date from BIOS Information
Tests: table tests with testdata/dmidecode_*.txt fixtures
1.3 — CPU collector
Source: dmidecode -t 4
Collects:
- socket index, model, manufacturer, status
- cores, threads, current/max frequency
- firmware: microcode version from
/sys/devices/system/cpu/cpu0/microcode/version - serial: not available on Intel Xeon → fallback
<board_serial>-CPU-<socket>(matches core logic)
Tests: table tests with dmidecode fixtures
1.4 — Memory collector
Source: dmidecode -t 17
Collects:
- slot, location, present flag
- size_mb, type (DDR4/DDR5), max_speed_mhz, current_speed_mhz
- manufacturer, serial_number, part_number
- status from "Data Width" / "No Module Installed" detection
Tests: table tests with dmidecode fixtures (populated + empty slots)
1.5 — Storage collector
Sources:
lsblk -J -o NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,MOUNTPOINT— device enumerationsmartctl -j -i /dev/X— serial, model, firmware, interface per devicenvme id-ctrl /dev/nvmeX -o json— NVMe: serial (sn), firmware (fr), model (mn), size
Collects per device:
- type: SSD/HDD/NVMe
- model, serial_number, manufacturer, firmware
- size_gb, interface (SATA/SAS/NVMe)
- slot: from lsblk HCTL where available
Tests: table tests with smartctl -j JSON fixtures and nvme id-ctrl JSON fixtures
1.6 — PCIe collector
Sources:
lspci -vmm -D— slot, vendor, device, classlspci -vvv -D— link width/speed (LnkSta, LnkCap)- embedded pci.ids (same submodule as logpile:
third_party/pciids) — model name lookup /sys/bus/pci/devices/<bdf>/— actual negotiated link state from kernel
Collects per device:
- bdf, vendor_id, device_id, device_class
- manufacturer, model (via pciids lookup if empty)
- link_width, link_speed, max_link_width, max_link_speed
- serial_number: device-specific (see per-type enrichment in 1.8, 1.9)
Dedup: by serial → bdf (mirrors logpile canonical device repository logic)
Tests: table tests with lspci fixtures
1.7 — PSU collector
Source: ipmitool fru — primary (only source for PSU data from OS)
Fallback: dmidecode -t 39 (System Power Supply, limited availability)
Collects:
- slot, present, model, vendor, serial_number, part_number
- wattage_w from FRU or dmidecode
- firmware from ipmitool FRU
Board Extrafields - input_power_w, output_power_w, input_voltage from
ipmitool sdr
Tests: table tests with ipmitool fru text fixtures
1.8 — NVIDIA GPU enrichment
Prerequisite: NVIDIA driver loaded (checked via nvidia-smi -L exit code)
Sources:
nvidia-smi --query-gpu=index,name,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total --format=csv,noheader,nounits- BDF correlation:
nvidia-smi --query-gpu=index,pci.bus_id --format=csv,noheader→ match to PCIe collector records
Enriches PCIe records for NVIDIA devices:
serial_number— real GPU serial from nvidia-smifirmware— VBIOS versionstatus— derived from ECC uncorrected errors (0 = OK, >0 = WARNING)- telemetry: temperature_c, power_w added to PCIe record attributes
Fallback (no driver): PCIe record stays as-is with serial fallback <board_serial>-PCIE-<slot>
Tests: table tests with nvidia-smi CSV fixtures
1.8b — Component wear / age telemetry
Every component that stores its own usage history must have that data collected and placed in the attributes / telemetry map of the respective record. This is a cross-cutting concern applied on top of the per-collector steps.
Storage (SATA/SAS) — smartctl:
Power_On_Hours(attr 9) — total hours powered onPower_Cycle_Count(attr 12)Reallocated_Sector_Ct(attr 5) — wear indicatorWear_Leveling_Count(attr 177, SSD) — NAND wearTotal_LBAs_Written(attr 241) — bytes written lifetimeSSD_Life_Left(attr 231) — % remaining if reported- Collected as
telemetrymap keys:power_on_hours,power_cycles,reallocated_sectors,wear_leveling_pct,total_lba_written,life_remaining_pct
NVMe — nvme smart-log:
power_on_hours— lifetime hourspower_cyclesunsafe_shutdowns— abnormal power loss countpercentage_used— % of rated lifetime consumed (0–100)data_units_written— 512KB units written lifetimecontroller_busy_time— hours controller was busy- Collected via
nvme smart-log /dev/nvmeX -o json
NVIDIA GPU — nvidia-smi:
ecc.errors.uncorrected.aggregate.total— lifetime uncorrected ECC errorsecc.errors.corrected.aggregate.total— lifetime corrected ECC errorsclocks_throttle_reasons.hw_slowdown— thermal/power throttle state- Stored in PCIe device
telemetry
NIC SFP/QSFP transceivers — ethtool:
ethtool -m <iface>— DOM (Digital Optical Monitoring) if supported- Extracts: TX power, RX power, temperature, voltage, bias current
- Also:
ethtool -i <iface>→ firmware version ip -s link show <iface>→ tx_packets, rx_packets, tx_errors, rx_errors (uptime proxy)- Stored in PCIe device
telemetry:sfp_temperature_c,sfp_tx_power_dbm,sfp_rx_power_dbm
PSU — ipmitool sdr:
- Input/output power readings over time not stored by BMC (point-in-time only)
ipmitool frumay include manufacture date for age estimation- Stored:
input_power_w,output_power_w,input_voltage(already in PSU schema)
All wear telemetry placement rules:
- Numeric wear indicators go into
telemetrymap (machine-readable, importable by core) - Boolean flags (throttle_active, ecc_errors_present) go into
attributesmap - Never flatten into named top-level fields not in the schema — use maps
Tests: table tests for each SMART parser (JSON fixtures from smartctl/nvme smart-log)
1.9 — Mellanox/NVIDIA NIC enrichment
Source: mstflint -d <bdf> q — if mstflint present and device is Mellanox (vendor_id 0x15b3)
Fallback: ethtool -i <iface> — firmware-version field
Enriches PCIe/NIC records:
firmware— from mstflintFW Versionor ethtoolfirmware-versionserial_number— from mstflintBoard Serial Numberif available
Detection: by PCI vendor_id (0x15b3 = Mellanox/NVIDIA Networking) from PCIe collector
Tests: table tests with mstflint output fixtures
1.10 — RAID controller enrichment
Source: tool selected by PCI vendor_id:
| PCI vendor_id | Tool | Controller |
|---|---|---|
| 0x1000 | storcli64 /c<n> show all J |
Broadcom MegaRAID |
| 0x1000 (SAS) | sas2ircu <n> display / sas3ircu <n> display |
LSI SAS 2.x/3.x |
| 0x9005 | arcconf getconfig <n> |
Adaptec |
| 0x103c | ssacli ctrl slot=<n> pd all show detail |
HPE Smart Array |
Collects physical drives behind controller (not visible to OS as block devices):
- serial_number, model, manufacturer, firmware
- size_gb, interface (SAS/SATA), slot/bay
- status (Online/Failed/Rebuilding → OK/CRITICAL/WARNING)
No hardcoded vendor names in detection logic — pure PCI vendor_id map.
Tests: table tests with storcli/sas2ircu text fixtures
1.11 — Output and USB write
--output stdout (default): pretty-printed JSON to stdout
--output file:<path>: write JSON to explicit path
--output usb: auto-detect first removable block device, mount it, write audit-<board_serial>-<YYYYMMDD-HHMMSS>.json
USB detection: scan /sys/block/*/removable, pick first 1, mount to /tmp/bee-usb
QR summary to stdout (always): board serial + model + component counts — fits in one QR code
Uses qrencode if present, else skips silently
1.12 — Integration test (local)
scripts/test-local.sh — runs audit binary on developer machine (Linux), captures JSON,
validates required fields are present (board.serial_number non-empty, cpus non-empty, etc.)
Not a unit test — requires real hardware access. Documents how to run for verification.
Phase 2 — Alpine LiveCD
ISO image bootable via BMC virtual media. Runs audit binary automatically on boot.
2.1 — Builder environment
iso/builder/Dockerfile — Alpine 3.21 build environment with:
alpine-sdk,abuild,squashfs-tools,xorriso- Go toolchain (for binary compilation inside builder)
- NVIDIA driver
.runpre-fetched during image build
iso/builder/build.sh — orchestrates full ISO build:
- Compile Go binary (static,
CGO_ENABLED=0) - Compile NVIDIA kernel module against Alpine 3.21 LTS kernel headers
- Run
mkimage.shwith bee profile - Output:
dist/bee-<version>.iso
2.2 — NVIDIA driver build
Alpine 3.21, LTS kernel 6.6 — fixed versions in builder.
iso/builder/build-nvidia.sh:
- Download
NVIDIA-Linux-x86_64-<ver>.run(version pinned iniso/builder/VERSIONS) - Extract kernel module sources
- Compile against
linux-lts-devheaders - Strip and package as
nvidia-<ver>-k6.6.ko.tar.gzfor inclusion in overlay
iso/overlay/usr/local/bin/load-nvidia.sh:
insmodsequence: nvidia.ko → nvidia-modeset.ko → nvidia-uvm.ko- Verify:
nvidia-smi -L→ log result - On failure: log warning, continue (audit runs without GPU enrichment)
2.3 — Alpine mkimage profile
iso/builder/mkimg.bee.sh — Alpine mkimage profile:
- Base:
alpine-base - Kernel:
linux-lts - Packages:
dmidecode smartmontools nvme-cli pciutils ipmitool util-linux e2fsprogs qrencode - Overlay:
iso/overlay/included as apkovl
2.4 — Network bring-up on boot
iso/overlay/usr/local/bin/bee-network.sh:
- Enumerate all network interfaces:
ip link show→ filter out loopback and virtual (docker/bridge) - For each physical interface:
ip link set <iface> up+udhcpc -i <iface> -t 5 -T 3 -n - Log each interface result (got IP / timeout / no carrier)
- Continue regardless — network is best-effort for auto-update
iso/overlay/etc/init.d/bee-network:
- runlevel: default, before: bee-update
- Calls bee-network.sh
- Does not block boot if DHCP fails on all interfaces
2.5 — OpenRC boot service (bee-audit)
iso/overlay/etc/init.d/bee-audit:
- runlevel: default, after: bee-update
- start(): load-nvidia.sh → /usr/local/bin/audit --output usb
- on completion: print QR summary to /dev/tty1 (always, even if USB write failed)
- log everything to /var/log/bee-audit.log
- exits 0 regardless of partial failures — unattended, no prompts, no waits
Unattended invariants:
- No TTY prompts ever. All decisions are automatic.
- Missing USB: output goes to /tmp/bee-audit--.json, QR shown on screen.
- Missing NVIDIA driver: GPU records have status UNKNOWN, audit continues.
- Missing ipmitool/storcli/any tool: that collector is skipped, rest continue.
- Timeout on any external command: 30s hard limit via
timeoutwrapper, then skip. - Boot never hangs waiting for user input.
iso/overlay/etc/runlevels/default/bee-audit symlink
2.6 — Vendor utilities in overlay
iso/overlay/usr/local/bin/ includes pre-fetched proprietary tools:
storcli64(Broadcom)sas2ircu,sas3ircu(Broadcom/LSI)mstflint(NVIDIA Networking / Mellanox)
scripts/fetch-vendor.sh — downloads and places these before ISO build.
Checksums verified. Tools not committed to git — fetched at build time.
iso/vendor/.gitkeep — placeholder, directory gitignored except .gitkeep
2.7 — Auto-update of audit binary (USB + network)
Two update paths, tried in order on every boot:
Path A — USB (no network required, higher priority):
bee-update.sh scans mounted removable media for an update package before checking network.
Looks for: <usb>/bee-update/bee-audit-linux-amd64 + <usb>/bee-update/bee-audit-linux-amd64.sha256
Steps:
- Find USB mount point (same detection as audit output:
/sys/block/*/removable) - Check for
bee-update/bee-audit-linux-amd64on the USB root - Read version from
bee-update/VERSIONfile (plain text, e.g.1.3) - Compare with running binary version (
/usr/local/bin/audit --version) - If USB version > running: verify SHA256 checksum, replace binary, log update
- Re-run audit if updated
Authenticity verification — Ed25519 multi-key trust (stdlib only, no external tools):
Problem: SHA256 alone does not prevent a crafted attack — an attacker places their binary and a matching SHA256 next to it. The LiveCD would accept it.
Solution: Ed25519 asymmetric signatures via Go stdlib crypto/ed25519.
Multiple developer public keys are supported. A binary update is accepted if its signature
verifies against ANY of the embedded trusted public keys.
This mirrors the SSH authorized_keys model: add a developer → add their public key. Remove a developer → rebuild without their key.
Key management — centralized across all projects:
Public keys live in a dedicated repo at git.mchus.pro/mchus/keys (or similar):
keys/
developers/
mchusavitin.pub ← Ed25519 public key, base64, one line
developer2.pub
README.md ← how to generate a key pair
Public keys are safe to commit — they are not secret. Private keys stay on each developer's machine, never committed anywhere.
Key generation (one-time per developer, run locally):
# scripts/keygen.sh — also lives in the keys repo
openssl genpkey -algorithm ed25519 -out ~/.bee-release.key
openssl pkey -in ~/.bee-release.key -pubout -outform DER \
| tail -c 32 | base64 > mchusavitin.pub
Embedding public keys at release time (not compile time):
Public keys are injected via -ldflags at build time from the keys repo.
The binary does not hardcode keys — they are provided by the release script.
// audit/internal/updater/trust.go
// trustedKeysRaw is injected at build time via -ldflags
// format: base64(key1):base64(key2):...
var trustedKeysRaw string
func trustedKeys() ([]ed25519.PublicKey, error) {
if trustedKeysRaw == "" {
return nil, fmt.Errorf("binary built without trusted keys — updates disabled")
}
var keys []ed25519.PublicKey
for _, enc := range strings.Split(trustedKeysRaw, ":") {
b, err := base64.StdEncoding.DecodeString(strings.TrimSpace(enc))
if err != nil || len(b) != ed25519.PublicKeySize {
return nil, fmt.Errorf("invalid trusted key: %w", err)
}
keys = append(keys, ed25519.PublicKey(b))
}
return keys, nil
}
func verifySignature(binaryPath, sigPath string) error {
keys, err := trustedKeys()
if err != nil {
return err
}
data, _ := os.ReadFile(binaryPath)
sig, _ := os.ReadFile(sigPath) // 64 bytes raw Ed25519 signature
for _, key := range keys {
if ed25519.Verify(key, data, sig) {
return nil // any trusted key accepts → pass
}
}
return fmt.Errorf("signature verification failed: no trusted key matched")
}
Release build injects keys:
# scripts/build-release.sh
KEYS=$(paste -sd: keys/developers/*.pub)
go build -ldflags "-X bee/audit/internal/updater/trust.trustedKeysRaw=${KEYS}" \
-o dist/bee-audit-linux-amd64 ./cmd/audit
Signing (release engineer signs with their private key):
# scripts/sign-release.sh <binary>
openssl pkeyutl -sign -inkey ~/.bee-release.key \
-rawin -in "$1" -out "$1.sig"
Binary built without -ldflags injection (e.g. local dev build) has trustedKeysRaw=""
→ updates are disabled, logged as INFO, audit continues normally.
Update rejected silently (logged as WARNING, audit continues with current binary) if:
.sigfile missing- Signature does not match any trusted key
trustedKeysRawempty (dev build)
Update package layout on USB:
/bee-update/
bee-audit-linux-amd64 ← new binary (also signed with embedded keys)
bee-audit-linux-amd64.sig ← Ed25519 signature (64 bytes raw)
VERSION ← plain version string e.g. "1.3"
Admin workflow: download bee-audit-linux-amd64 + bee-audit-linux-amd64.sig from Gitea
release assets, place in bee-update/ on USB.
Path B — Network (requires DHCP on at least one interface):
- Check network: ping git.mchus.pro -c 1 -W 3 || skip
- Fetch:
GET https://git.mchus.pro/api/v1/repos/<org>/bee/releases/latest - Parse tag_name, asset URLs for
bee-audit-linux-amd64+bee-audit-linux-amd64.sig - Compare tag with running version
- If newer: download both files to /tmp, verify Ed25519 signature against all trusted keys
- Replace binary on pass, log and skip on fail
- Re-run audit if updated
Ordering: USB update checked first, network checked second. If USB update applied and verified, network check is skipped.
iso/overlay/etc/init.d/bee-update:
- runlevel: default
- after: bee-network (network path needs interfaces up)
- before: bee-audit (audit runs with latest binary)
- Calls bee-update.sh
Triggered after bee-audit completes, only if network is available.
iso/overlay/usr/local/bin/bee-update.sh:
1. Check network: ping git.mchus.pro -c 1 -W 3 || exit 0
2. Fetch latest release metadata:
GET https://git.mchus.pro/api/v1/repos/<org>/bee/releases/latest
3. Parse: extract tag_name, asset URL for bee-audit-linux-amd64
4. Compare tag_name with /usr/local/bin/audit --version output
5. If newer: download to /tmp/bee-audit-new, verify SHA256 checksum from release assets
6. Replace /usr/local/bin/audit (tmpfs — survives until reboot)
7. Log: updated from vX.Y to vX.Z
8. Re-run audit if update happened: /usr/local/bin/audit --output usb
iso/overlay/etc/init.d/bee-update:
- runlevel: default
- after: bee-audit, network
- Calls bee-update.sh
Release naming convention: binary asset named bee-audit-linux-amd64 per release tag.
2.8 — Release workflow
iso/builder/VERSIONS — pinned versions:
AUDIT_VERSION=1.0
ALPINE_VERSION=3.21
KERNEL_VERSION=6.12
NVIDIA_DRIVER_VERSION=590.48.01
LiveCD release = full ISO rebuild. Binary-only patch = new Gitea release with binary asset. On boot with network: ISO auto-patches its binary without full rebuild.
ISO version embedded in /etc/bee-release:
BEE_ISO_VERSION=1.0
BEE_AUDIT_VERSION=1.0
BUILD_DATE=2026-03-05
Eating order
Builder environment is set up early (after 1.3) so every subsequent collector is developed and tested directly on real hardware in the actual Alpine environment. No "works on my Mac" drift.
1.0 keys repo setup → git.mchus.pro/mchus/keys, keygen.sh, developer pubkeys
1.1 scaffold + schema types → binary runs, outputs empty JSON
1.2 board collector → first real data
1.3 CPU collector → +CPUs
--- BUILDER + DEBUG ISO (unblock real-hardware testing) ---
2.1 builder VM setup → Alpine VM with build deps + Go toolchain
2.2 debug ISO profile → minimal Alpine ISO: audit binary + dropbear SSH + all packages
2.3 boot on real server → SSH in, verify packages present, run audit manually
--- CONTINUE COLLECTORS (tested on real hardware from here) ---
1.4 memory collector → +DIMMs
1.5 storage collector → +disks (SATA/SAS/NVMe)
1.6 PCIe collector → +all PCIe devices
1.7 PSU collector → +power supplies
1.8 NVIDIA GPU enrichment → +GPU serial/VBIOS
1.8b wear/age telemetry → +SMART hours, NVMe % used, SFP DOM, ECC
1.9 Mellanox NIC enrichment → +NIC firmware/serial
1.10 RAID enrichment → +physical disks behind RAID
1.11 output + USB write → production-ready output
--- PRODUCTION ISO ---
2.4 NVIDIA driver build → driver compiled into overlay
2.5 network bring-up on boot → DHCP on all interfaces
2.6 OpenRC boot service → audit runs on boot automatically
2.7 vendor utilities → storcli/sas2ircu/mstflint in image
2.8 auto-update → binary self-patches from Gitea
2.9 release workflow → versioning + release notes