Files
bee/PLAN.md
Michael Chus 871c766194 docs: add bible-local with architecture and decisions, fix PLAN.md versions
- bible-local/architecture/system-overview.md: scope, tech stack, key paths
- bible-local/architecture/runtime-flows.md: boot sequence, ISO build, collector flow
- bible-local/decisions/2026-03-05-nvidia-proprietary-driver.md
- PLAN.md: update KERNEL_VERSION 6.6→6.12, NVIDIA 550.54.15→590.48.01

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 18:15:07 +03:00

21 KiB
Raw Blame History

BEE — Build Plan

Hardware audit LiveCD for offline server inventory. Produces HardwareIngestRequest JSON compatible with core/reanimator.

Principle: OS-level collection — reads hardware directly, not through BMC. Fully unattended — no user interaction required at any stage. Boot → update → audit → output → done. All errors are logged, never presented interactively. Every failure path has a silent fallback. Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware.


Phase 1 — Go Audit Binary

Self-contained static binary. Runs on any Linux (including Alpine LiveCD). Calls system utilities, parses their output, produces HardwareIngestRequest JSON.

1.1 — Project scaffold

  • audit/go.mod — module bee/audit
  • audit/cmd/audit/main.go — CLI entry point: flags, orchestration, JSON output
  • audit/internal/schema/ — copy of HardwareIngestRequest types from core (no import dependency)
  • audit/internal/collector/ — empty package stubs for all collectors
  • const Version = "1.0" in main
  • Output modes: stdout (default), file path flag --output /path/to/file.json
  • Tests: none yet (stubs only)

1.2 — Board collector

Source: dmidecode -t 0 (BIOS), -t 1 (System), -t 2 (Baseboard)

Collects:

  • board.serial_number — from System Information
  • board.manufacturer, board.product_name — from System Information
  • board.part_number — from Baseboard
  • board.uuid — from System Information
  • firmware[BIOS] — vendor, version, release date from BIOS Information

Tests: table tests with testdata/dmidecode_*.txt fixtures

1.3 — CPU collector

Source: dmidecode -t 4

Collects:

  • socket index, model, manufacturer, status
  • cores, threads, current/max frequency
  • firmware: microcode version from /sys/devices/system/cpu/cpu0/microcode/version
  • serial: not available on Intel Xeon → fallback <board_serial>-CPU-<socket> (matches core logic)

Tests: table tests with dmidecode fixtures

1.4 — Memory collector

Source: dmidecode -t 17

Collects:

  • slot, location, present flag
  • size_mb, type (DDR4/DDR5), max_speed_mhz, current_speed_mhz
  • manufacturer, serial_number, part_number
  • status from "Data Width" / "No Module Installed" detection

Tests: table tests with dmidecode fixtures (populated + empty slots)

1.5 — Storage collector

Sources:

  • lsblk -J -o NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,MOUNTPOINT — device enumeration
  • smartctl -j -i /dev/X — serial, model, firmware, interface per device
  • nvme id-ctrl /dev/nvmeX -o json — NVMe: serial (sn), firmware (fr), model (mn), size

Collects per device:

  • type: SSD/HDD/NVMe
  • model, serial_number, manufacturer, firmware
  • size_gb, interface (SATA/SAS/NVMe)
  • slot: from lsblk HCTL where available

Tests: table tests with smartctl -j JSON fixtures and nvme id-ctrl JSON fixtures

1.6 — PCIe collector

Sources:

  • lspci -vmm -D — slot, vendor, device, class
  • lspci -vvv -D — link width/speed (LnkSta, LnkCap)
  • embedded pci.ids (same submodule as logpile: third_party/pciids) — model name lookup
  • /sys/bus/pci/devices/<bdf>/ — actual negotiated link state from kernel

Collects per device:

  • bdf, vendor_id, device_id, device_class
  • manufacturer, model (via pciids lookup if empty)
  • link_width, link_speed, max_link_width, max_link_speed
  • serial_number: device-specific (see per-type enrichment in 1.8, 1.9)

Dedup: by serial → bdf (mirrors logpile canonical device repository logic)

Tests: table tests with lspci fixtures

1.7 — PSU collector

Source: ipmitool fru — primary (only source for PSU data from OS) Fallback: dmidecode -t 39 (System Power Supply, limited availability)

Collects:

  • slot, present, model, vendor, serial_number, part_number
  • wattage_w from FRU or dmidecode
  • firmware from ipmitool FRU Board Extra fields
  • input_power_w, output_power_w, input_voltage from ipmitool sdr

Tests: table tests with ipmitool fru text fixtures

1.8 — NVIDIA GPU enrichment

Prerequisite: NVIDIA driver loaded (checked via nvidia-smi -L exit code)

Sources:

  • nvidia-smi --query-gpu=index,name,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total --format=csv,noheader,nounits
  • BDF correlation: nvidia-smi --query-gpu=index,pci.bus_id --format=csv,noheader → match to PCIe collector records

Enriches PCIe records for NVIDIA devices:

  • serial_number — real GPU serial from nvidia-smi
  • firmware — VBIOS version
  • status — derived from ECC uncorrected errors (0 = OK, >0 = WARNING)
  • telemetry: temperature_c, power_w added to PCIe record attributes

Fallback (no driver): PCIe record stays as-is with serial fallback <board_serial>-PCIE-<slot>

Tests: table tests with nvidia-smi CSV fixtures

1.8b — Component wear / age telemetry

Every component that stores its own usage history must have that data collected and placed in the attributes / telemetry map of the respective record. This is a cross-cutting concern applied on top of the per-collector steps.

Storage (SATA/SAS) — smartctl:

  • Power_On_Hours (attr 9) — total hours powered on
  • Power_Cycle_Count (attr 12)
  • Reallocated_Sector_Ct (attr 5) — wear indicator
  • Wear_Leveling_Count (attr 177, SSD) — NAND wear
  • Total_LBAs_Written (attr 241) — bytes written lifetime
  • SSD_Life_Left (attr 231) — % remaining if reported
  • Collected as telemetry map keys: power_on_hours, power_cycles, reallocated_sectors, wear_leveling_pct, total_lba_written, life_remaining_pct

NVMe — nvme smart-log:

  • power_on_hours — lifetime hours
  • power_cycles
  • unsafe_shutdowns — abnormal power loss count
  • percentage_used — % of rated lifetime consumed (0100)
  • data_units_written — 512KB units written lifetime
  • controller_busy_time — hours controller was busy
  • Collected via nvme smart-log /dev/nvmeX -o json

NVIDIA GPU — nvidia-smi:

  • ecc.errors.uncorrected.aggregate.total — lifetime uncorrected ECC errors
  • ecc.errors.corrected.aggregate.total — lifetime corrected ECC errors
  • clocks_throttle_reasons.hw_slowdown — thermal/power throttle state
  • Stored in PCIe device telemetry

NIC SFP/QSFP transceivers — ethtool:

  • ethtool -m <iface> — DOM (Digital Optical Monitoring) if supported
  • Extracts: TX power, RX power, temperature, voltage, bias current
  • Also: ethtool -i <iface> → firmware version
  • ip -s link show <iface> → tx_packets, rx_packets, tx_errors, rx_errors (uptime proxy)
  • Stored in PCIe device telemetry: sfp_temperature_c, sfp_tx_power_dbm, sfp_rx_power_dbm

PSU — ipmitool sdr:

  • Input/output power readings over time not stored by BMC (point-in-time only)
  • ipmitool fru may include manufacture date for age estimation
  • Stored: input_power_w, output_power_w, input_voltage (already in PSU schema)

All wear telemetry placement rules:

  • Numeric wear indicators go into telemetry map (machine-readable, importable by core)
  • Boolean flags (throttle_active, ecc_errors_present) go into attributes map
  • Never flatten into named top-level fields not in the schema — use maps

Tests: table tests for each SMART parser (JSON fixtures from smartctl/nvme smart-log)

1.9 — Mellanox/NVIDIA NIC enrichment

Source: mstflint -d <bdf> q — if mstflint present and device is Mellanox (vendor_id 0x15b3) Fallback: ethtool -i <iface> — firmware-version field

Enriches PCIe/NIC records:

  • firmware — from mstflint FW Version or ethtool firmware-version
  • serial_number — from mstflint Board Serial Number if available

Detection: by PCI vendor_id (0x15b3 = Mellanox/NVIDIA Networking) from PCIe collector

Tests: table tests with mstflint output fixtures

1.10 — RAID controller enrichment

Source: tool selected by PCI vendor_id:

PCI vendor_id Tool Controller
0x1000 storcli64 /c<n> show all J Broadcom MegaRAID
0x1000 (SAS) sas2ircu <n> display / sas3ircu <n> display LSI SAS 2.x/3.x
0x9005 arcconf getconfig <n> Adaptec
0x103c ssacli ctrl slot=<n> pd all show detail HPE Smart Array

Collects physical drives behind controller (not visible to OS as block devices):

  • serial_number, model, manufacturer, firmware
  • size_gb, interface (SAS/SATA), slot/bay
  • status (Online/Failed/Rebuilding → OK/CRITICAL/WARNING)

No hardcoded vendor names in detection logic — pure PCI vendor_id map.

Tests: table tests with storcli/sas2ircu text fixtures

1.11 — Output and USB write

--output stdout (default): pretty-printed JSON to stdout --output file:<path>: write JSON to explicit path --output usb: auto-detect first removable block device, mount it, write audit-<board_serial>-<YYYYMMDD-HHMMSS>.json

USB detection: scan /sys/block/*/removable, pick first 1, mount to /tmp/bee-usb

QR summary to stdout (always): board serial + model + component counts — fits in one QR code Uses qrencode if present, else skips silently

1.12 — Integration test (local)

scripts/test-local.sh — runs audit binary on developer machine (Linux), captures JSON, validates required fields are present (board.serial_number non-empty, cpus non-empty, etc.)

Not a unit test — requires real hardware access. Documents how to run for verification.


Phase 2 — Alpine LiveCD

ISO image bootable via BMC virtual media. Runs audit binary automatically on boot.

2.1 — Builder environment

iso/builder/Dockerfile — Alpine 3.21 build environment with:

  • alpine-sdk, abuild, squashfs-tools, xorriso
  • Go toolchain (for binary compilation inside builder)
  • NVIDIA driver .run pre-fetched during image build

iso/builder/build.sh — orchestrates full ISO build:

  1. Compile Go binary (static, CGO_ENABLED=0)
  2. Compile NVIDIA kernel module against Alpine 3.21 LTS kernel headers
  3. Run mkimage.sh with bee profile
  4. Output: dist/bee-<version>.iso

2.2 — NVIDIA driver build

Alpine 3.21, LTS kernel 6.6 — fixed versions in builder.

iso/builder/build-nvidia.sh:

  • Download NVIDIA-Linux-x86_64-<ver>.run (version pinned in iso/builder/VERSIONS)
  • Extract kernel module sources
  • Compile against linux-lts-dev headers
  • Strip and package as nvidia-<ver>-k6.6.ko.tar.gz for inclusion in overlay

iso/overlay/usr/local/bin/load-nvidia.sh:

  • insmod sequence: nvidia.ko → nvidia-modeset.ko → nvidia-uvm.ko
  • Verify: nvidia-smi -L → log result
  • On failure: log warning, continue (audit runs without GPU enrichment)

2.3 — Alpine mkimage profile

iso/builder/mkimg.bee.sh — Alpine mkimage profile:

  • Base: alpine-base
  • Kernel: linux-lts
  • Packages: dmidecode smartmontools nvme-cli pciutils ipmitool util-linux e2fsprogs qrencode
  • Overlay: iso/overlay/ included as apkovl

2.4 — Network bring-up on boot

iso/overlay/usr/local/bin/bee-network.sh:

  • Enumerate all network interfaces: ip link show → filter out loopback and virtual (docker/bridge)
  • For each physical interface: ip link set <iface> up + udhcpc -i <iface> -t 5 -T 3 -n
  • Log each interface result (got IP / timeout / no carrier)
  • Continue regardless — network is best-effort for auto-update

iso/overlay/etc/init.d/bee-network:

  • runlevel: default, before: bee-update
  • Calls bee-network.sh
  • Does not block boot if DHCP fails on all interfaces

2.5 — OpenRC boot service (bee-audit)

iso/overlay/etc/init.d/bee-audit:

  • runlevel: default, after: bee-update
  • start(): load-nvidia.sh → /usr/local/bin/audit --output usb
  • on completion: print QR summary to /dev/tty1 (always, even if USB write failed)
  • log everything to /var/log/bee-audit.log
  • exits 0 regardless of partial failures — unattended, no prompts, no waits

Unattended invariants:

  • No TTY prompts ever. All decisions are automatic.
  • Missing USB: output goes to /tmp/bee-audit--.json, QR shown on screen.
  • Missing NVIDIA driver: GPU records have status UNKNOWN, audit continues.
  • Missing ipmitool/storcli/any tool: that collector is skipped, rest continue.
  • Timeout on any external command: 30s hard limit via timeout wrapper, then skip.
  • Boot never hangs waiting for user input.

iso/overlay/etc/runlevels/default/bee-audit symlink

2.6 — Vendor utilities in overlay

iso/overlay/usr/local/bin/ includes pre-fetched proprietary tools:

  • storcli64 (Broadcom)
  • sas2ircu, sas3ircu (Broadcom/LSI)
  • mstflint (NVIDIA Networking / Mellanox)

scripts/fetch-vendor.sh — downloads and places these before ISO build. Checksums verified. Tools not committed to git — fetched at build time.

iso/vendor/.gitkeep — placeholder, directory gitignored except .gitkeep

2.7 — Auto-update of audit binary (USB + network)

Two update paths, tried in order on every boot:

Path A — USB (no network required, higher priority):

bee-update.sh scans mounted removable media for an update package before checking network.

Looks for: <usb>/bee-update/bee-audit-linux-amd64 + <usb>/bee-update/bee-audit-linux-amd64.sha256

Steps:

  1. Find USB mount point (same detection as audit output: /sys/block/*/removable)
  2. Check for bee-update/bee-audit-linux-amd64 on the USB root
  3. Read version from bee-update/VERSION file (plain text, e.g. 1.3)
  4. Compare with running binary version (/usr/local/bin/audit --version)
  5. If USB version > running: verify SHA256 checksum, replace binary, log update
  6. Re-run audit if updated

Authenticity verification — Ed25519 multi-key trust (stdlib only, no external tools):

Problem: SHA256 alone does not prevent a crafted attack — an attacker places their binary and a matching SHA256 next to it. The LiveCD would accept it.

Solution: Ed25519 asymmetric signatures via Go stdlib crypto/ed25519. Multiple developer public keys are supported. A binary update is accepted if its signature verifies against ANY of the embedded trusted public keys.

This mirrors the SSH authorized_keys model: add a developer → add their public key. Remove a developer → rebuild without their key.

Key management — centralized across all projects:

Public keys live in a dedicated repo at git.mchus.pro/mchus/keys (or similar):

keys/
  developers/
    mchusavitin.pub    ← Ed25519 public key, base64, one line
    developer2.pub
  README.md            ← how to generate a key pair

Public keys are safe to commit — they are not secret. Private keys stay on each developer's machine, never committed anywhere.

Key generation (one-time per developer, run locally):

# scripts/keygen.sh — also lives in the keys repo
openssl genpkey -algorithm ed25519 -out ~/.bee-release.key
openssl pkey -in ~/.bee-release.key -pubout -outform DER \
  | tail -c 32 | base64 > mchusavitin.pub

Embedding public keys at release time (not compile time):

Public keys are injected via -ldflags at build time from the keys repo. The binary does not hardcode keys — they are provided by the release script.

// audit/internal/updater/trust.go
// trustedKeysRaw is injected at build time via -ldflags
// format: base64(key1):base64(key2):...
var trustedKeysRaw string

func trustedKeys() ([]ed25519.PublicKey, error) {
    if trustedKeysRaw == "" {
        return nil, fmt.Errorf("binary built without trusted keys — updates disabled")
    }
    var keys []ed25519.PublicKey
    for _, enc := range strings.Split(trustedKeysRaw, ":") {
        b, err := base64.StdEncoding.DecodeString(strings.TrimSpace(enc))
        if err != nil || len(b) != ed25519.PublicKeySize {
            return nil, fmt.Errorf("invalid trusted key: %w", err)
        }
        keys = append(keys, ed25519.PublicKey(b))
    }
    return keys, nil
}

func verifySignature(binaryPath, sigPath string) error {
    keys, err := trustedKeys()
    if err != nil {
        return err
    }
    data, _ := os.ReadFile(binaryPath)
    sig, _ := os.ReadFile(sigPath)  // 64 bytes raw Ed25519 signature
    for _, key := range keys {
        if ed25519.Verify(key, data, sig) {
            return nil  // any trusted key accepts → pass
        }
    }
    return fmt.Errorf("signature verification failed: no trusted key matched")
}

Release build injects keys:

# scripts/build-release.sh
KEYS=$(paste -sd: keys/developers/*.pub)
go build -ldflags "-X bee/audit/internal/updater/trust.trustedKeysRaw=${KEYS}" \
  -o dist/bee-audit-linux-amd64 ./cmd/audit

Signing (release engineer signs with their private key):

# scripts/sign-release.sh <binary>
openssl pkeyutl -sign -inkey ~/.bee-release.key \
  -rawin -in "$1" -out "$1.sig"

Binary built without -ldflags injection (e.g. local dev build) has trustedKeysRaw="" → updates are disabled, logged as INFO, audit continues normally.

Update rejected silently (logged as WARNING, audit continues with current binary) if:

  • .sig file missing
  • Signature does not match any trusted key
  • trustedKeysRaw empty (dev build)

Update package layout on USB:

/bee-update/
  bee-audit-linux-amd64        ← new binary (also signed with embedded keys)
  bee-audit-linux-amd64.sig    ← Ed25519 signature (64 bytes raw)
  VERSION                      ← plain version string e.g. "1.3"

Admin workflow: download bee-audit-linux-amd64 + bee-audit-linux-amd64.sig from Gitea release assets, place in bee-update/ on USB.

Path B — Network (requires DHCP on at least one interface):

  1. Check network: ping git.mchus.pro -c 1 -W 3 || skip
  2. Fetch: GET https://git.mchus.pro/api/v1/repos/<org>/bee/releases/latest
  3. Parse tag_name, asset URLs for bee-audit-linux-amd64 + bee-audit-linux-amd64.sig
  4. Compare tag with running version
  5. If newer: download both files to /tmp, verify Ed25519 signature against all trusted keys
  6. Replace binary on pass, log and skip on fail
  7. Re-run audit if updated

Ordering: USB update checked first, network checked second. If USB update applied and verified, network check is skipped.

iso/overlay/etc/init.d/bee-update:

  • runlevel: default
  • after: bee-network (network path needs interfaces up)
  • before: bee-audit (audit runs with latest binary)
  • Calls bee-update.sh

Triggered after bee-audit completes, only if network is available.

iso/overlay/usr/local/bin/bee-update.sh:

1. Check network: ping git.mchus.pro -c 1 -W 3 || exit 0
2. Fetch latest release metadata:
   GET https://git.mchus.pro/api/v1/repos/<org>/bee/releases/latest
3. Parse: extract tag_name, asset URL for bee-audit-linux-amd64
4. Compare tag_name with /usr/local/bin/audit --version output
5. If newer: download to /tmp/bee-audit-new, verify SHA256 checksum from release assets
6. Replace /usr/local/bin/audit (tmpfs — survives until reboot)
7. Log: updated from vX.Y to vX.Z
8. Re-run audit if update happened: /usr/local/bin/audit --output usb

iso/overlay/etc/init.d/bee-update:

  • runlevel: default
  • after: bee-audit, network
  • Calls bee-update.sh

Release naming convention: binary asset named bee-audit-linux-amd64 per release tag.

2.8 — Release workflow

iso/builder/VERSIONS — pinned versions:

AUDIT_VERSION=1.0
ALPINE_VERSION=3.21
KERNEL_VERSION=6.12
NVIDIA_DRIVER_VERSION=590.48.01

LiveCD release = full ISO rebuild. Binary-only patch = new Gitea release with binary asset. On boot with network: ISO auto-patches its binary without full rebuild.

ISO version embedded in /etc/bee-release:

BEE_ISO_VERSION=1.0
BEE_AUDIT_VERSION=1.0
BUILD_DATE=2026-03-05

Eating order

Builder environment is set up early (after 1.3) so every subsequent collector is developed and tested directly on real hardware in the actual Alpine environment. No "works on my Mac" drift.

1.0  keys repo setup               → git.mchus.pro/mchus/keys, keygen.sh, developer pubkeys
1.1  scaffold + schema types       → binary runs, outputs empty JSON
1.2  board collector               → first real data
1.3  CPU collector                 → +CPUs

--- BUILDER + DEBUG ISO (unblock real-hardware testing) ---

2.1  builder VM setup              → Alpine VM with build deps + Go toolchain
2.2  debug ISO profile             → minimal Alpine ISO: audit binary + dropbear SSH + all packages
2.3  boot on real server           → SSH in, verify packages present, run audit manually

--- CONTINUE COLLECTORS (tested on real hardware from here) ---

1.4  memory collector              → +DIMMs
1.5  storage collector             → +disks (SATA/SAS/NVMe)
1.6  PCIe collector                → +all PCIe devices
1.7  PSU collector                 → +power supplies
1.8  NVIDIA GPU enrichment         → +GPU serial/VBIOS
1.8b wear/age telemetry            → +SMART hours, NVMe % used, SFP DOM, ECC
1.9  Mellanox NIC enrichment       → +NIC firmware/serial
1.10 RAID enrichment               → +physical disks behind RAID
1.11 output + USB write            → production-ready output

--- PRODUCTION ISO ---

2.4  NVIDIA driver build           → driver compiled into overlay
2.5  network bring-up on boot      → DHCP on all interfaces
2.6  OpenRC boot service           → audit runs on boot automatically
2.7  vendor utilities              → storcli/sas2ircu/mstflint in image
2.8  auto-update                   → binary self-patches from Gitea
2.9  release workflow              → versioning + release notes