Files
bee/PLAN.md
2026-03-16 00:23:55 +03:00

17 KiB
Raw Permalink Blame History

BEE — Build Plan

Hardware audit LiveCD for offline server inventory. Produces HardwareIngestRequest JSON compatible with core/reanimator.

Principle: OS-level collection — reads hardware directly, not through BMC. Automatic boot audit plus operator console. Boot runs audit immediately, but local/SSH operators can rerun checks through the TUI and CLI. Errors are logged and should not block boot on partial collector failures. Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware.


Status snapshot (2026-03-14)

Phase 1 — Go Audit Binary

  • 1.1 Project scaffold — DONE
  • 1.2 Board collector — DONE
  • 1.3 CPU collector — DONE
  • 1.4 Memory collector — DONE
  • 1.5 Storage collector — DONE
  • 1.6 PCIe collector — DONE (with noise filtering for system/chipset devices)
  • 1.7 PSU collector — DONE (basic FRU path)
  • 1.8 NVIDIA GPU enrichment — DONE
  • 1.8b Component wear / age telemetry — DONE (storage + NVMe + NVIDIA + NIC SFP/DOM + NIC packet stats)
  • 1.8c Storage health verdicts — DONE (SMART/NVMe warning/failed status derivation)
  • 1.9 Mellanox/NVIDIA NIC enrichment — DONE (mstflint + ethtool firmware fallback)
  • 1.10 RAID controller enrichment — DONE (initial multi-tool support) (storcli + sas2/3ircu + arcconf + ssacli + VROC/mdstat)
  • 1.11 PSU SDR health — DONE (ipmitool sdr merged with FRU inventory)
  • 1.11 Output and export workflow — DONE (explicit file output + manual removable export via TUI)
  • 1.12 Integration test (local) — DONE (scripts/test-local.sh)

Phase 2 — Debian Live ISO

  • Current implementation uses Debian 12 live-build, systemd, and OpenSSH.
  • Network bring-up on boot — DONE
  • Boot services (bee-network, bee-nvidia, bee-audit, bee-sshsetup) — DONE
  • Local console UX (bee autologin on tty1, menu auto-start, TUI privilege escalation via sudo -n) — DONE
  • VM/debug support (qemu-guest-agent, serial console, virtual GPU initramfs modules) — DONE
  • Vendor utilities in overlay — DONE
  • Build metadata + staged overlay injection — DONE
  • Builder container cache persisted outside container writable layer — DONE
  • ISO volume label BEEDONE
  • Auto-update flow remains deferred; current focus is deterministic offline audit ISO behavior.
  • Real-hardware validation remains PENDING; current validation is limited to local/libvirt VM boot + service checks.

Phase 1 — Go Audit Binary

Self-contained static binary. Runs on any Linux (including the Debian live ISO). Calls system utilities, parses their output, produces HardwareIngestRequest JSON.

1.1 — Project scaffold

  • audit/go.mod — module bee/audit
  • audit/cmd/bee/main.go — main CLI entry point: subcommands, runtime selection, JSON output
  • audit/internal/schema/ — copy of HardwareIngestRequest types from core (no import dependency)
  • audit/internal/collector/ — empty package stubs for all collectors
  • const Version = "1.0" in main
  • Output modes: stdout (default), file path flag --output /path/to/file.json
  • Tests: none yet (stubs only)

1.2 — Board collector

Source: dmidecode -t 0 (BIOS), -t 1 (System), -t 2 (Baseboard)

Collects:

  • board.serial_number — from System Information
  • board.manufacturer, board.product_name — from System Information
  • board.part_number — from Baseboard
  • board.uuid — from System Information
  • firmware[BIOS] — vendor, version, release date from BIOS Information

Tests: table tests with testdata/dmidecode_*.txt fixtures

1.3 — CPU collector

Source: dmidecode -t 4

Collects:

  • socket index, model, manufacturer, status
  • cores, threads, current/max frequency
  • firmware: microcode version from /sys/devices/system/cpu/cpu0/microcode/version
  • serial: not available on Intel Xeon → fallback <board_serial>-CPU-<socket> (matches core logic)

Tests: table tests with dmidecode fixtures

1.4 — Memory collector

Source: dmidecode -t 17

Collects:

  • slot, location, present flag
  • size_mb, type (DDR4/DDR5), max_speed_mhz, current_speed_mhz
  • manufacturer, serial_number, part_number
  • status from "Data Width" / "No Module Installed" detection

Tests: table tests with dmidecode fixtures (populated + empty slots)

1.5 — Storage collector

Sources:

  • lsblk -J -o NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,MOUNTPOINT — device enumeration
  • smartctl -j -i /dev/X — serial, model, firmware, interface per device
  • nvme id-ctrl /dev/nvmeX -o json — NVMe: serial (sn), firmware (fr), model (mn), size

Collects per device:

  • type: SSD/HDD/NVMe
  • model, serial_number, manufacturer, firmware
  • size_gb, interface (SATA/SAS/NVMe)
  • slot: from lsblk HCTL where available

Tests: table tests with smartctl -j JSON fixtures and nvme id-ctrl JSON fixtures

1.6 — PCIe collector

Sources:

  • lspci -vmm -D — slot, vendor, device, class
  • lspci -vvv -D — link width/speed (LnkSta, LnkCap)
  • embedded pci.ids (same submodule as logpile: third_party/pciids) — model name lookup
  • /sys/bus/pci/devices/<bdf>/ — actual negotiated link state from kernel

Collects per device:

  • bdf, vendor_id, device_id, device_class
  • manufacturer, model (via pciids lookup if empty)
  • link_width, link_speed, max_link_width, max_link_speed
  • serial_number: device-specific (see per-type enrichment in 1.8, 1.9)

Dedup: by serial → bdf (mirrors logpile canonical device repository logic)

Tests: table tests with lspci fixtures

1.7 — PSU collector

Source: ipmitool fru — primary (only source for PSU data from OS) Fallback: dmidecode -t 39 (System Power Supply, limited availability)

Collects:

  • slot, present, model, vendor, serial_number, part_number
  • wattage_w from FRU or dmidecode
  • firmware from ipmitool FRU Board Extra fields
  • input_power_w, output_power_w, input_voltage from ipmitool sdr

Tests: table tests with ipmitool fru text fixtures

1.8 — NVIDIA GPU enrichment

Prerequisite: NVIDIA driver loaded (checked via nvidia-smi -L exit code)

Sources:

  • nvidia-smi --query-gpu=index,name,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total --format=csv,noheader,nounits
  • BDF correlation: nvidia-smi --query-gpu=index,pci.bus_id --format=csv,noheader → match to PCIe collector records

Enriches PCIe records for NVIDIA devices:

  • serial_number — real GPU serial from nvidia-smi
  • firmware — VBIOS version
  • status — derived from ECC uncorrected errors (0 = OK, >0 = WARNING)
  • telemetry: temperature_c, power_w added to PCIe record attributes

Fallback (no driver): PCIe record stays as-is with serial fallback <board_serial>-PCIE-<slot>

Tests: table tests with nvidia-smi CSV fixtures

1.8b — Component wear / age telemetry

Every component that stores its own usage history must have that data collected and placed in the attributes / telemetry map of the respective record. This is a cross-cutting concern applied on top of the per-collector steps.

Storage (SATA/SAS) — smartctl:

  • Power_On_Hours (attr 9) — total hours powered on
  • Power_Cycle_Count (attr 12)
  • Reallocated_Sector_Ct (attr 5) — wear indicator
  • Wear_Leveling_Count (attr 177, SSD) — NAND wear
  • Total_LBAs_Written (attr 241) — bytes written lifetime
  • SSD_Life_Left (attr 231) — % remaining if reported
  • Collected as telemetry map keys: power_on_hours, power_cycles, reallocated_sectors, wear_leveling_pct, total_lba_written, life_remaining_pct

NVMe — nvme smart-log:

  • power_on_hours — lifetime hours
  • power_cycles
  • unsafe_shutdowns — abnormal power loss count
  • percentage_used — % of rated lifetime consumed (0100)
  • data_units_written — 512KB units written lifetime
  • controller_busy_time — hours controller was busy
  • Collected via nvme smart-log /dev/nvmeX -o json

NVIDIA GPU — nvidia-smi:

  • ecc.errors.uncorrected.aggregate.total — lifetime uncorrected ECC errors
  • ecc.errors.corrected.aggregate.total — lifetime corrected ECC errors
  • clocks_throttle_reasons.hw_slowdown — thermal/power throttle state
  • Stored in PCIe device telemetry

NIC SFP/QSFP transceivers — ethtool:

  • ethtool -m <iface> — DOM (Digital Optical Monitoring) if supported
  • Extracts: TX power, RX power, temperature, voltage, bias current
  • Also: ethtool -i <iface> → firmware version
  • ip -s link show <iface> → tx_packets, rx_packets, tx_errors, rx_errors (uptime proxy)
  • Stored in PCIe device telemetry: sfp_temperature_c, sfp_tx_power_dbm, sfp_rx_power_dbm

PSU — ipmitool sdr:

  • Input/output power readings over time not stored by BMC (point-in-time only)
  • ipmitool fru may include manufacture date for age estimation
  • Stored: input_power_w, output_power_w, input_voltage (already in PSU schema)

All wear telemetry placement rules:

  • Numeric wear indicators go into telemetry map (machine-readable, importable by core)
  • Boolean flags (throttle_active, ecc_errors_present) go into attributes map
  • Never flatten into named top-level fields not in the schema — use maps

Tests: table tests for each SMART parser (JSON fixtures from smartctl/nvme smart-log)

1.9 — Mellanox/NVIDIA NIC enrichment

Source: mstflint -d <bdf> q — if mstflint present and device is Mellanox (vendor_id 0x15b3) Fallback: ethtool -i <iface> — firmware-version field

Enriches PCIe/NIC records:

  • firmware — from mstflint FW Version or ethtool firmware-version
  • serial_number — from mstflint Board Serial Number if available

Detection: by PCI vendor_id (0x15b3 = Mellanox/NVIDIA Networking) from PCIe collector

Tests: table tests with mstflint output fixtures

1.10 — RAID controller enrichment

Source: tool selected by PCI vendor_id:

PCI vendor_id Tool Controller
0x1000 storcli64 /c<n> show all J Broadcom MegaRAID
0x1000 (SAS) sas2ircu <n> display / sas3ircu <n> display LSI SAS 2.x/3.x
0x9005 arcconf getconfig <n> Adaptec
0x103c ssacli ctrl slot=<n> pd all show detail HPE Smart Array

Collects physical drives behind controller (not visible to OS as block devices):

  • serial_number, model, manufacturer, firmware
  • size_gb, interface (SAS/SATA), slot/bay
  • status (Online/Failed/Rebuilding → OK/CRITICAL/WARNING)

No hardcoded vendor names in detection logic — pure PCI vendor_id map.

Tests: table tests with storcli/sas2ircu text fixtures

1.11 — Output and export workflow

--output stdout (default): pretty-printed JSON to stdout --output file:<path>: write JSON to explicit path

Live ISO default service output: /var/log/bee-audit.json

Removable-media export is manual via bee tui (or the LiveCD wrapper bee-tui):

  • operator chooses a removable filesystem explicitly
  • TUI mounts it if needed
  • TUI asks for confirmation before copying the JSON
  • TUI unmounts temporary mountpoints after export

No auto-write to arbitrary removable media is allowed.

1.12 — Integration test (local)

scripts/test-local.sh — runs bee audit on developer machine (Linux), captures JSON, validates required fields are present (board.serial_number non-empty, cpus non-empty, etc.)

Not a unit test — requires real hardware access. Documents how to run for verification.


Phase 2 — Debian Live ISO

ISO image bootable via BMC virtual media or USB. Runs boot services automatically and writes the audit result to /var/log/bee-audit.json.

2.1 — Builder environment

iso/builder/build-in-container.sh is the only supported builder entrypoint. It builds a Debian 12 builder image with live-build, toolchains, and pinned kernel headers, then runs the ISO assembly in a privileged container because live-build needs mount/chroot/loop capabilities.

iso/builder/build.sh orchestrates the full ISO build:

  1. compile the Go bee binary
  2. create a staged overlay under dist/overlay-stage
  3. inject SSH auth, vendor tools, NVIDIA artifacts, and build metadata into the staged overlay
  4. create a disposable live-build workdir under dist/live-build-work
  5. sync the staged overlay into config/includes.chroot/
  6. run lb config && lb build
  7. copy the final ISO into dist/

2.2 — NVIDIA driver build

iso/builder/build-nvidia-module.sh:

  • downloads the pinned NVIDIA .run installer
  • verifies SHA256
  • builds kernel modules against the pinned Debian kernel ABI
  • caches modules, userspace tools, and libs in dist/nvidia-<version>-<kver>/

iso/overlay/usr/local/bin/bee-nvidia-load:

  • loads nvidia, nvidia-modeset, nvidia-uvm via insmod
  • creates /dev/nvidia* nodes if the driver registered successfully
  • logs failures but does not block the rest of boot

2.3 — ISO assembly and overlay policy

iso/overlay/ is source-only input for the build.

Build-time files are injected into the staged overlay only:

  • bee
  • bee-smoketest
  • authorized_keys
  • password-fallback marker
  • /etc/bee-release
  • vendor tools from iso/vendor/

The source tree must stay clean after a build.

2.4 — Boot services

systemd service order:

  • bee-sshsetup.service → configures SSH auth before ssh.service
  • bee-network.service → starts best-effort DHCP on all physical interfaces
  • bee-nvidia.service → loads NVIDIA modules if present
  • bee-audit.service → runs audit and logs failures without turning partial collector bugs into a boot blocker

2.4b — Runtime split

Target split:

  • main Go application works on a normal Linux host and on the live ISO
  • live-ISO specifics stay in integration glue under iso/
  • the live ISO passes --runtime livecd to the Go binary
  • local runs default to --runtime auto, which resolves to local unless a live marker is detected

Planned code shape:

  • audit/cmd/bee/ — main CLI entrypoint
  • audit/internal/runtimeenv/ — runtime detection and mode selection
  • future audit/internal/tui/ — host/live shared TUI logic
  • iso/overlay/ — boot-time livecd integration only

2.5 — Operator workflows

  • Automatic boot audit writes JSON to /var/log/bee-audit.json
  • tty1 autologins into bee and auto-runs menu
  • menu launches the LiveCD wrapper bee-tui, which escalates to root via sudo -n
  • bee tui can rerun the audit manually
  • bee tui can export the latest audit JSON to removable media
  • bee tui can show health summary and run NVIDIA/memory/storage acceptance tests
  • NVIDIA SAT now includes a lightweight in-image GPU stress step via bee-gpu-stress
  • SAT summaries now expose overall_status plus per-job OK/FAILED/UNSUPPORTED
  • Memory/GPU SAT runtime defaults can be overridden via BEE_MEMTESTER_* and BEE_GPU_STRESS_*
  • removable export requires explicit target selection, mount, confirmation, copy, and cleanup

2.6 — Vendor utilities and optional assets

Optional binaries live in iso/vendor/ and are included when present:

  • storcli64
  • sas2ircu, sas3ircu
  • arcconf
  • ssacli
  • mstflint (via Debian package set)

Missing optional tools do not fail the build or boot.

2.7 — Release workflow

iso/builder/VERSIONS pins the current release inputs:

  • audit version
  • Debian version / kernel ABI
  • Go version
  • NVIDIA driver version

Current release model:

  • shipping a new ISO means a full rebuild
  • build metadata is embedded into /etc/bee-release and motd
  • current ISO label is BEE
  • binary self-update remains deferred; no automatic USB/network patching is part of the current runtime

Eating order

Builder environment is set up early (after 1.3) so every subsequent collector is developed and tested directly on real hardware in the actual Debian live ISO environment. No "works on my Mac" drift.

1.0  keys repo setup               → git.mchus.pro/mchus/keys, keygen.sh, developer pubkeys
1.1  scaffold + schema types       → binary runs, outputs empty JSON
1.2  board collector               → first real data
1.3  CPU collector                 → +CPUs

--- BUILDER + BEE ISO (unblock real-hardware testing) ---

2.1  builder setup                 → privileged container with build deps
2.2  debug ISO profile             → minimal Debian ISO: `bee` binary + OpenSSH + all packages
2.3  boot on real server           → SSH in, verify packages present, run audit manually

--- CONTINUE COLLECTORS (tested on real hardware from here) ---

1.4  memory collector              → +DIMMs
1.5  storage collector             → +disks (SATA/SAS/NVMe)
1.6  PCIe collector                → +all PCIe devices
1.7  PSU collector                 → +power supplies
1.8  NVIDIA GPU enrichment         → +GPU serial/VBIOS
1.8b wear/age telemetry            → +SMART hours, NVMe % used, SFP DOM, ECC
1.9  Mellanox NIC enrichment       → +NIC firmware/serial
1.10 RAID enrichment               → +physical disks behind RAID
1.11 output + export workflow      → file output + explicit removable export

--- PRODUCTION ISO ---

2.4  NVIDIA driver build           → driver compiled into overlay
2.5  network bring-up on boot      → DHCP on all interfaces
2.6  systemd boot service          → audit runs on boot automatically
2.7  vendor utilities              → storcli/sas2ircu/arcconf/ssacli in image
2.8  release workflow              → versioning + release notes
2.9  operator export flow          → explicit TUI export to removable media