Compare commits

..

48 Commits

Author SHA1 Message Date
Mikhail Chusavitin
4669f14f4f feat(tui): GPU Platform Stress Test — live nvtop chart during test
Apply the same pattern as NVIDIA SAT: launch nvtop via tea.ExecProcess
so it occupies the full terminal as a live GPU chart (temp, power, fan,
utilisation lines) while the stress test runs in the background.

- Add screenGPUStressRunning screen + dedicated running/render handlers
- startGPUStressTest: tea.Batch(stress goroutine, tea.ExecProcess(nvtop))
- [o] reopen nvtop at any time; [a] abort (cancels context)
- Graceful degradation: test still runs if nvtop is not on PATH
- gpuStressDoneMsg routes result to screenOutput on completion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:01:31 +03:00
Mikhail Chusavitin
540a9e39b8 refactor(audit): rename Fan Stress Test → GPU Platform Stress Test
Update all user-facing strings in TUI and ActionResult title.
Internal identifiers (types, functions, file name) unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 09:56:25 +03:00
Mikhail Chusavitin
58510207fa fix(iso): fall back through ROCm 6.4→6.3→6.2 if repo Release file missing
ROCm 6.4 does not yet publish a Release file for Debian Bookworm, causing
the live-build chroot hook to fail with "does not have a Release file".

Try each version in ROCM_CANDIDATES order; skip to the next if apt-get update
fails (repo unavailable). Exit gracefully if none are available.
Also rename inner 'candidate' variable to 'smi_path' to avoid collision.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 09:52:17 +03:00
Mikhail Chusavitin
4cd7c9ab4e feat(audit): fan-stress SAT for MSI case-04 fan lag & thermal throttle detection
Two-phase GPU thermal cycling test with per-second telemetry:
- Phases: baseline → load1 → pause (no cooldown) → load2 → cooldown
- Monitors: fan RPM (ipmitool sdr), CPU/server temps (ipmitool/sensors),
  system power (ipmitool dcmi), GPU temp/power/usage/clock/throttle (nvidia-smi)
- Detects throttling via clocks_throttle_reasons.active bitmask
- Measures fan response lag from load start (validates case-04 ~2s lag)
- Exports metrics.csv (wide format, one row/sec) and fan-sensors.csv (long format)
- TUI: adds [F] Fan Stress Test to Health Check screen with Quick/Standard/Express modes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 09:51:03 +03:00
Mikhail Chusavitin
cfe255f6e4 Release audit/v1.0.5 2026-03-26 09:41:19 +03:00
Mikhail Chusavitin
8b9d3447d7 Overlay SAT results into audit JSON 2026-03-25 20:11:03 +03:00
Mikhail Chusavitin
614b7cad61 Improve PCIe inventory and hardware identity collection 2026-03-25 20:00:38 +03:00
Mikhail Chusavitin
9a1df9b1ba Tighten support bundles and fix AMD runtime checks 2026-03-25 19:35:25 +03:00
Mikhail Chusavitin
30cf014d58 Rename NVIDIA bootloader modes 2026-03-25 19:12:26 +03:00
Mikhail Chusavitin
27d478aed6 Add bootloader choice for safe vs full NVIDIA boot 2026-03-25 19:11:15 +03:00
Mikhail Chusavitin
d36e8442a9 Stabilize live ISO consoles and NVIDIA boot path 2026-03-25 19:05:18 +03:00
Mikhail Chusavitin
b345b0d14d Derive ISO version from git tags 2026-03-25 18:40:48 +03:00
Mikhail Chusavitin
0a1ac2ab9f Bootstrap ROCm hook prerequisites in ISO build 2026-03-25 18:38:19 +03:00
Mikhail Chusavitin
1e62f828c6 Embed MOTD banner into TUI 2026-03-25 18:11:17 +03:00
Mikhail Chusavitin
f8c997d272 Add missing SAT progress TUI helpers 2026-03-25 18:03:45 +03:00
Mikhail Chusavitin
0c16616cc9 1. Verbose live progress during SAT tests (CPU, Memory, Storage, AMD GPU)
- New tui/sat_progress.go: polls {DefaultSATBaseDir}/{prefix}-*/verbose.log every 300ms and parses completed/in-progress steps
  - Busy screen now shows each step as PASS  lscpu (234ms) / FAIL  stress-ng (60.0s) / ...   sensors-after instead of just "Working..."

  2. Test results shown on screen (instead of just "Archive written to /path")
  - RunCPUAcceptancePackResult, RunMemoryAcceptancePackResult, RunStorageAcceptancePackResult, RunAMDAcceptancePackResult now read summary.txt from the run directory and return a formatted per-step result:
  Run: 2025-03-25T10:00:00Z

  PASS  lscpu
  PASS  sensors-before
  FAIL  stress-ng
  PASS  sensors-after

  Overall: FAILED  (ok=3  failed=1)

  3. AMD GPU SAT with auto-detection
  - platform.System.DetectGPUVendor(): checks /dev/nvidia0 → "nvidia", /dev/kfd → "amd"
  - platform.System.RunAMDAcceptancePack(): runs rocm-smi, rocm-smi --showallinfo, dmidecode
  - GPU SAT (G key / GPU row enter) automatically routes to AMD or NVIDIA based on detected vendor
  - "Run All" also auto-detects vendor

  4. Panel detail view
  - GPU detail now shows the most recent (NVIDIA or AMD) SAT result, whichever is newer
  - All SAT detail views use the same human-readable formatSATDetail format
2026-03-25 17:54:27 +03:00
Mikhail Chusavitin
adcc147b32 feat(iso): add AMD Instinct MI250X/MI250 driver support
- firmware-amd-graphics: Aldebaran firmware blobs (fixes amdgpu IB ring
  test errors on MI250/MI250X at boot)
- 9001-amd-rocm.hook.chroot: adds AMD ROCm 6.4 apt repo and installs
  rocm-smi-lib for GPU monitoring (analogous to nvidia-smi)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 15:42:10 +03:00
Mikhail Chusavitin
94e233651e fix(sat): fix nvme device-self-test command flags
--start is not a valid nvme-cli flag; correct syntax is -s 1 (short test).
Add --wait so the command blocks until the test completes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 15:24:52 +03:00
Mikhail Chusavitin
03c36f6cb2 fix(iso): add stress-ng to package list for CPU SAT
stress-ng was missing from the LiveCD — CPU acceptance test exited
immediately with rc=1 because the binary was not found.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:50:30 +03:00
Mikhail Chusavitin
a221814797 fix(tui): fix GPU panel row showing AMD chipset devices, clear screen before TUI
isGPUDevice matched all AMD vendor PCIe devices (SATA, crypto coprocessors,
PCIe dummies) because of a broad strings.Contains(vendor,"amd") check.
Remove it — AMD Instinct/Radeon GPUs are caught by ProcessingAccelerator /
DisplayController class. Also exclude ASPEED (BMC VGA adapter).

Add clear before bee-tui to avoid dirty terminal output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:49:09 +03:00
Mikhail Chusavitin
b6619d5ccc fix(iso): skip NVIDIA module load when no NVIDIA GPU present
Check PCI vendor 10de before attempting insmod — avoids spurious
nvidia_uvm symbol errors on systems without NVIDIA hardware (e.g. AMD MI350).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:38:31 +03:00
Mikhail Chusavitin
450193b063 feat(iso): remove splash.png, show EASY-BEE ASCII art in GRUB text mode
The graphical splash had "BEE / HARDWARE AUDIT" baked into the PNG,
overriding the echo ASCII art. Replace with a plain black background
so the EASY-BEE block-char banner from grub.cfg echo commands is visible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:32:23 +03:00
Mikhail Chusavitin
ee8931f171 fix(iso): pin ISO kernel to same ABI as compiled NVIDIA modules
Export detected DEBIAN_KERNEL_ABI as BEE_KERNEL_ABI from build.sh so
auto/config can pin linux-packages to the exact versioned package
(e.g. linux-image-6.1.0-31 + flavour amd64 = linux-image-6.1.0-31-amd64).
This prevents nvidia.ko vermagic mismatch if the linux-image-amd64
meta-package is updated between build start and lb build chroot step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 12:26:59 +03:00
Mikhail Chusavitin
b771d95894 fix(iso): fix linux-packages to "linux-image" so lb appends flavour correctly
live-build constructs the kernel package as <linux-packages>-<linux-flavours>,
so "linux-image-amd64" + "amd64" = "linux-image-amd64-amd64" (not found).
The correct value is "linux-image" + "amd64" = "linux-image-amd64".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:45:41 +03:00
Mikhail Chusavitin
8e60e474dc feat(iso): rebrand to EASY-BEE with ASCII art banner
Replace "Bee Hardware Audit" branding with EASY-BEE across bootloader
and LiveCD: grub.cfg menu entries, echo ASCII art before menu,
motd banner, iso-volume and iso-application metadata.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:45:12 +03:00
Mikhail Chusavitin
2f4ec2acda fix(iso): auto-detect and install kernel headers at build time
- Dockerfile: linux-headers-amd64 meta-package instead of pinned ABI;
  remove DEBIAN_KERNEL_ABI build-arg (no longer needed at image build time)
- build-in-container.sh: drop --build-arg DEBIAN_KERNEL_ABI
- build.sh: apt-get update + detect ABI from apt-cache at build time;
  auto-install linux-headers-<ABI> if kernel changed since image build

Image rebuild is now needed only when changing Go version or lb tools,
not on every Debian kernel point release.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:25:29 +03:00
Mikhail Chusavitin
7ed5cb0306 fix(iso): auto-detect kernel ABI at build time instead of pinning
DEBIAN_KERNEL_ABI=auto in VERSIONS — build.sh queries
apt-cache depends linux-image-amd64 to find the current ABI.
lb config now uses linux-image-amd64 meta-package.

This prevents build failures when Debian drops old kernel packages
from the repo (happens with every point release).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:17:29 +03:00
Mikhail Chusavitin
6df7ac68f5 fix(iso): bump kernel ABI to 6.1.0-44 (6.1.164-1 in bookworm)
6.1.0-43 is no longer available in Debian repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:16:09 +03:00
Mikhail Chusavitin
0ce23aea4f feat(iso): add exfatprogs and ntfs-3g for USB export support
exFAT is the default filesystem on USB drives >32GB sold today.
Without exfatprogs, mount fails silently and export to such drives is broken.
ntfs-3g covers Windows-formatted drives.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:12:51 +03:00
Mikhail Chusavitin
36dff6e584 feat: CPU SAT via stress-ng + BMC version via ipmitool
BMC:
- collector/board.go: collectBMCFirmware() via ipmitool mc info, graceful skip if /dev/ipmi0 absent
- collector/collector.go: append BMC firmware record to snap.Firmware
- app/panel.go: show BMC version in TUI right-panel header alongside BIOS

CPU SAT:
- platform/sat.go: RunCPUAcceptancePack(baseDir, durationSec) — lscpu + sensors before/after + stress-ng
- app/app.go: RunCPUAcceptancePack + RunCPUAcceptancePackResult methods, satRunner interface updated
- app/panel.go: CPU row now reads real PASS/FAIL from cpu-*/summary.txt via satStatuses(); cpuDetailResult shows last SAT summary + audit data
- tui/types.go: actionRunCPUSAT, confirmBody for CPU test with mode label
- tui/screen_health_check.go: hcCPUDurations [60,300,900]s; hcRunSingle(CPU)→confirm screen; executeRunAll uses RunCPUAcceptancePackResult
- tui/forms.go: actionRunCPUSAT → RunCPUAcceptancePackResult with mode duration
- cmd/bee/main.go: bee sat cpu [--duration N] subcommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:06:12 +03:00
Mikhail Chusavitin
1c80906c1f feat(tui): rebuild TUI around hardware diagnostics (Health Check + two-column layout)
- Replace 12-item flat menu with 4-item main menu: Health Check, Export support bundle, Settings, Exit
- Add Health Check screen (Lenovo-style): per-component checkboxes (GPU/MEM/DISK/CPU), Quick/Standard/Express modes, Run All, letter hotkeys G/M/S/C/R/A/1/2/3
- Add two-column main screen: left = menu, right = hardware panel with colored PASS/FAIL/CANCEL/N/A status per component; Tab/→ switches focus, Enter opens component detail
- Add app.LoadHardwarePanel() + ComponentDetailResult() reading audit JSON and SAT summary.txt files
- Move Network/Services/audit actions into Settings submenu
- Export: support bundle only (remove separate audit JSON export)
- Delete screen_acceptance.go; add screen_health_check.go, screen_settings.go, app/panel.go
- Add BMC + CPU stress-ng tests to backlog
- Update bible submodule
- Rewrite tui_test.go for new screen/action structure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 10:59:21 +03:00
Mikhail Chusavitin
2abe2ce3aa fix(iso): fix NCCL version to 2.28.9+cuda13.0, add sha256 verification
NVIDIA's CUDA repo for Debian 12 only has NCCL packages for cuda13.x,
not cuda12.x. Update to the latest available: 2.28.9-1+cuda13.0.
Also pass sha256 from VERSIONS into build-nccl.sh for integrity check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 12:04:03 +03:00
Mikhail Chusavitin
8233c9ee85 feat(iso): add NCCL 2.26.2 to LiveCD
Download libnccl2 .deb from NVIDIA's CUDA apt repo (Debian 12) during ISO
build, extract libnccl.so.* into the overlay at /usr/lib/ alongside
libnvidia-ml and libcuda. Version pinned in VERSIONS, reflected in
/etc/bee-release.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 09:51:28 +03:00
Mikhail Chusavitin
13189e2683 fix(iso): pet hardware watchdog via systemd RuntimeWatchdogSec=30s
Without a keepalive the kernel watchdog timer expires and reboots
the host mid-audit. Configuring RuntimeWatchdogSec lets systemd PID 1
reset /dev/watchdog every 30 s — well within the typical 60 s timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 23:56:42 +03:00
Mikhail Chusavitin
76a17937f3 feat(tui): NVIDIA SAT with nvtop, GPU selection, metrics and chart — v1.0.0
- TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes
- nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort
- GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock)
- Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt
- Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours
- AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list
- runtime-flows.md updated with full NVIDIA SAT TUI flow documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 15:18:57 +03:00
Mikhail Chusavitin
b965184e71 feat: wrap chart viewer in web shell 2026-03-16 18:26:05 +03:00
Mikhail Chusavitin
b25a2f6d30 feat: add support bundle and raw audit export 2026-03-16 18:20:26 +03:00
Mikhail Chusavitin
d18cde19c1 Drop legacy non-container builders 2026-03-16 00:23:55 +03:00
Mikhail Chusavitin
78c6dfc0ef Sync hardware ingest contract v2.7 2026-03-15 23:03:38 +03:00
Mikhail Chusavitin
72cf482ad3 Embed Reanimator Chart web viewer 2026-03-15 22:07:42 +03:00
Mikhail Chusavitin
a6023372b1 Use microcode as CPU firmware 2026-03-15 21:16:17 +03:00
Mikhail Chusavitin
ab5a4be7ac Align hardware export with ingest contract 2026-03-15 21:04:53 +03:00
Mikhail Chusavitin
b8c235b5ac Add TUI hardware banner and polish SAT summaries 2026-03-15 14:27:01 +03:00
Mikhail Chusavitin
b483e2ce35 Add health verdicts and acceptance tests 2026-03-14 17:53:58 +03:00
Mikhail Chusavitin
17f0bda45e Update docs for current LiveCD flow 2026-03-14 16:28:30 +03:00
Mikhail Chusavitin
591164a251 Rename ISO volume to BEE 2026-03-14 14:58:49 +03:00
Mikhail Chusavitin
ef4ec5695d Remove broken TUI log redirection 2026-03-14 14:57:31 +03:00
Mikhail Chusavitin
f1e096cabe Keep live TUI logs off the console 2026-03-14 14:51:25 +03:00
115 changed files with 13201 additions and 816 deletions

3
.gitmodules vendored
View File

@@ -1,3 +1,6 @@
[submodule "bible"]
path = bible
url = https://git.mchus.pro/mchus/bible.git
[submodule "internal/chart"]
path = internal/chart
url = https://git.mchus.pro/reanimator/chart.git

41
PLAN.md
View File

@@ -4,13 +4,13 @@ Hardware audit LiveCD for offline server inventory.
Produces `HardwareIngestRequest` JSON compatible with core/reanimator.
**Principle:** OS-level collection — reads hardware directly, not through BMC.
Fully unattended — no user interaction required at any stage. Boot → update → audit → output → done.
All errors are logged, never presented interactively. Every failure path has a silent fallback.
Automatic boot audit plus operator console. Boot runs audit immediately, but local/SSH operators can rerun checks through the TUI and CLI.
Errors are logged and should not block boot on partial collector failures.
Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware.
---
## Status snapshot (2026-03-06)
## Status snapshot (2026-03-14)
### Phase 1 — Go Audit Binary
@@ -23,8 +23,10 @@ Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials,
- 1.7 PSU collector — **DONE (basic FRU path)**
- 1.8 NVIDIA GPU enrichment — **DONE**
- 1.8b Component wear / age telemetry — **DONE** (storage + NVMe + NVIDIA + NIC SFP/DOM + NIC packet stats)
- 1.8c Storage health verdicts — **DONE** (SMART/NVMe warning/failed status derivation)
- 1.9 Mellanox/NVIDIA NIC enrichment — **DONE** (mstflint + ethtool firmware fallback)
- 1.10 RAID controller enrichment — **DONE (initial multi-tool support)** (storcli + sas2/3ircu + arcconf + ssacli + VROC/mdstat)
- 1.11 PSU SDR health — **DONE** (`ipmitool sdr` merged with FRU inventory)
- 1.11 Output and export workflow — **DONE** (explicit file output + manual removable export via TUI)
- 1.12 Integration test (local) — **DONE** (`scripts/test-local.sh`)
@@ -33,9 +35,14 @@ Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials,
- Current implementation uses Debian 12 `live-build`, `systemd`, and OpenSSH.
- Network bring-up on boot — **DONE**
- Boot services (`bee-network`, `bee-nvidia`, `bee-audit`, `bee-sshsetup`) — **DONE**
- Local console UX (`bee` autologin on `tty1`, `menu` auto-start, TUI privilege escalation via `sudo -n`) — **DONE**
- VM/debug support (`qemu-guest-agent`, serial console, virtual GPU initramfs modules) — **DONE**
- Vendor utilities in overlay — **DONE**
- Build metadata + staged overlay injection — **DONE**
- Builder container cache persisted outside container writable layer — **DONE**
- ISO volume label `BEE`**DONE**
- Auto-update flow remains deferred; current focus is deterministic offline audit ISO behavior.
- Real-hardware validation remains **PENDING**; current validation is limited to local/libvirt VM boot + service checks.
---
@@ -265,13 +272,10 @@ ISO image bootable via BMC virtual media or USB. Runs boot services automaticall
### 2.1 — Builder environment
`iso/builder/setup-builder.sh` prepares a Debian 12 host/VM with:
- `live-build`, `debootstrap`, bootloader tooling, kernel headers
- Go toolchain
- everything needed to compile the `bee` binary and NVIDIA modules
`iso/builder/build-in-container.sh` offers the same builder stack in a Debian 12 container image.
The container run is privileged because `live-build` needs mount/chroot/loop capabilities.
`iso/builder/build-in-container.sh` is the only supported builder entrypoint.
It builds a Debian 12 builder image with `live-build`, toolchains, and pinned kernel headers,
then runs the ISO assembly in a privileged container because `live-build` needs
mount/chroot/loop capabilities.
`iso/builder/build.sh` orchestrates the full ISO build:
1. compile the Go `bee` binary
@@ -334,8 +338,14 @@ Planned code shape:
### 2.5 — Operator workflows
- Automatic boot audit writes JSON to `/var/log/bee-audit.json`
- `tty1` autologins into `bee` and auto-runs `menu`
- `menu` launches the LiveCD wrapper `bee-tui`, which escalates to `root` via `sudo -n`
- `bee tui` can rerun the audit manually
- `bee tui` can export the latest audit JSON to removable media
- `bee tui` can show health summary and run NVIDIA/memory/storage acceptance tests
- NVIDIA SAT now includes a lightweight in-image GPU stress step via `bee-gpu-stress`
- SAT summaries now expose `overall_status` plus per-job `OK/FAILED/UNSUPPORTED`
- Memory/GPU SAT runtime defaults can be overridden via `BEE_MEMTESTER_*` and `BEE_GPU_STRESS_*`
- removable export requires explicit target selection, mount, confirmation, copy, and cleanup
### 2.6 — Vendor utilities and optional assets
@@ -343,7 +353,9 @@ Planned code shape:
Optional binaries live in `iso/vendor/` and are included when present:
- `storcli64`
- `sas2ircu`, `sas3ircu`
- `mstflint`
- `arcconf`
- `ssacli`
- `mstflint` (via Debian package set)
Missing optional tools do not fail the build or boot.
@@ -358,6 +370,7 @@ Missing optional tools do not fail the build or boot.
Current release model:
- shipping a new ISO means a full rebuild
- build metadata is embedded into `/etc/bee-release` and `motd`
- current ISO label is `BEE`
- binary self-update remains deferred; no automatic USB/network patching is part of the current runtime
---
@@ -374,9 +387,9 @@ No "works on my Mac" drift.
1.2 board collector → first real data
1.3 CPU collector → +CPUs
--- BUILDER + DEBUG ISO (unblock real-hardware testing) ---
--- BUILDER + BEE ISO (unblock real-hardware testing) ---
2.1 builder setup → Debian host/VM or privileged container with build deps
2.1 builder setup → privileged container with build deps
2.2 debug ISO profile → minimal Debian ISO: `bee` binary + OpenSSH + all packages
2.3 boot on real server → SSH in, verify packages present, run audit manually
@@ -397,7 +410,7 @@ No "works on my Mac" drift.
2.4 NVIDIA driver build → driver compiled into overlay
2.5 network bring-up on boot → DHCP on all interfaces
2.6 systemd boot service → audit runs on boot automatically
2.7 vendor utilities → storcli/sas2ircu/mstflint in image
2.7 vendor utilities → storcli/sas2ircu/arcconf/ssacli in image
2.8 release workflow → versioning + release notes
2.9 operator export flow → explicit TUI export to removable media
```

View File

@@ -12,6 +12,7 @@ import (
"bee/audit/internal/platform"
"bee/audit/internal/runtimeenv"
"bee/audit/internal/tui"
"bee/audit/internal/webui"
)
var Version = "dev"
@@ -27,11 +28,14 @@ func run(args []string, stdout, stderr io.Writer) int {
if len(args) == 0 {
printRootUsage(stderr)
return 1
return 2
}
switch args[0] {
case "help", "--help", "-h":
if len(args) > 1 {
return runHelp(args[1:], stdout, stderr)
}
printRootUsage(stdout)
return 0
case "audit":
@@ -40,6 +44,12 @@ func run(args []string, stdout, stderr io.Writer) int {
return runTUI(args[1:], stdout, stderr)
case "export":
return runExport(args[1:], stdout, stderr)
case "preflight":
return runPreflight(args[1:], stdout, stderr)
case "support-bundle":
return runSupportBundle(args[1:], stdout, stderr)
case "web":
return runWeb(args[1:], stdout, stderr)
case "sat":
return runSAT(args[1:], stdout, stderr)
case "version", "--version", "-version":
@@ -48,17 +58,47 @@ func run(args []string, stdout, stderr io.Writer) int {
default:
fmt.Fprintf(stderr, "bee: unknown command %q\n\n", args[0])
printRootUsage(stderr)
return 1
return 2
}
}
func printRootUsage(w io.Writer) {
fmt.Fprintln(w, `bee commands:
bee audit --runtime auto|local|livecd --output stdout|file:<path>
bee preflight --output stdout|file:<path>
bee tui --runtime auto|local|livecd
bee export --target <device>
bee sat nvidia
bee version`)
bee support-bundle --output stdout|file:<path>
bee web --listen :80 --audit-path `+app.DefaultAuditJSONPath+`
bee sat nvidia|memory|storage|cpu [--duration <seconds>]
bee version
bee help [command]`)
}
func runHelp(args []string, stdout, stderr io.Writer) int {
switch args[0] {
case "audit":
return runAudit([]string{"--help"}, stdout, stdout)
case "tui":
return runTUI([]string{"--help"}, stdout, stdout)
case "export":
return runExport([]string{"--help"}, stdout, stdout)
case "preflight":
return runPreflight([]string{"--help"}, stdout, stdout)
case "support-bundle":
return runSupportBundle([]string{"--help"}, stdout, stdout)
case "web":
return runWeb([]string{"--help"}, stdout, stdout)
case "sat":
return runSAT([]string{"--help"}, stdout, stderr)
case "version":
fmt.Fprintln(stdout, "usage: bee version")
return 0
default:
fmt.Fprintf(stderr, "bee help: unknown command %q\n\n", args[0])
printRootUsage(stderr)
return 2
}
}
func runAudit(args []string, stdout, stderr io.Writer) int {
@@ -72,6 +112,13 @@ func runAudit(args []string, stdout, stderr io.Writer) int {
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
if *showVersion {
@@ -107,6 +154,13 @@ func runTUI(args []string, stdout, stderr io.Writer) int {
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
@@ -116,6 +170,10 @@ func runTUI(args []string, stdout, stderr io.Writer) int {
return 1
}
slog.SetDefault(slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{
Level: slog.LevelInfo,
})))
application := app.New(platform.New())
if err := tui.Run(application, runtimeInfo.Mode); err != nil {
slog.Error("run tui", "err", err)
@@ -133,6 +191,13 @@ func runExport(args []string, stdout, stderr io.Writer) int {
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
if strings.TrimSpace(*targetDevice) == "" {
@@ -164,22 +229,175 @@ func runExport(args []string, stdout, stderr io.Writer) int {
return 1
}
func runSAT(args []string, stdout, stderr io.Writer) int {
if len(args) == 0 || args[0] == "help" || args[0] == "--help" || args[0] == "-h" {
fmt.Fprintln(stderr, "usage: bee sat nvidia")
func runPreflight(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("preflight", flag.ContinueOnError)
fs.SetOutput(stderr)
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
fs.Usage = func() {
fmt.Fprintf(stderr, "usage: bee preflight [--output stdout|file:%s]\n", app.DefaultRuntimeJSONPath)
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if args[0] != "nvidia" {
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", args[0])
fmt.Fprintln(stderr, "usage: bee sat nvidia")
if fs.NArg() != 0 {
fs.Usage()
return 2
}
application := app.New(platform.New())
archive, err := application.RunNvidiaAcceptancePack("")
path, err := application.RunRuntimePreflight(*output)
if err != nil {
slog.Error("run nvidia sat", "err", err)
slog.Error("run preflight", "err", err)
return 1
}
slog.Info("nvidia sat archive written", "path", archive)
if path != "stdout" {
slog.Info("runtime health written", "path", path)
}
return 0
}
func runSupportBundle(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("support-bundle", flag.ContinueOnError)
fs.SetOutput(stderr)
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
fs.Usage = func() {
fmt.Fprintln(stderr, "usage: bee support-bundle [--output stdout|file:<path>]")
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
path, err := app.BuildSupportBundle(app.DefaultExportDir)
if err != nil {
slog.Error("build support bundle", "err", err)
return 1
}
defer os.Remove(path)
raw, err := os.ReadFile(path)
if err != nil {
slog.Error("read support bundle", "err", err)
return 1
}
switch {
case *output == "stdout":
if _, err := stdout.Write(raw); err != nil {
slog.Error("write support bundle stdout", "err", err)
return 1
}
case strings.HasPrefix(*output, "file:"):
dst := strings.TrimPrefix(*output, "file:")
if err := os.WriteFile(dst, raw, 0644); err != nil {
slog.Error("write support bundle", "err", err)
return 1
}
slog.Info("support bundle written", "path", dst)
default:
fmt.Fprintln(stderr, "bee support-bundle: unknown output destination")
fs.Usage()
return 2
}
return 0
}
func runWeb(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("web", flag.ContinueOnError)
fs.SetOutput(stderr)
listenAddr := fs.String("listen", ":8080", "listen address, e.g. :80")
auditPath := fs.String("audit-path", app.DefaultAuditJSONPath, "path to the latest audit JSON snapshot")
exportDir := fs.String("export-dir", app.DefaultExportDir, "directory with logs, SAT results, and support bundles")
title := fs.String("title", "Bee Hardware Audit", "page title")
fs.Usage = func() {
fmt.Fprintf(stderr, "usage: bee web [--listen :80] [--audit-path %s] [--export-dir %s] [--title \"Bee Hardware Audit\"]\n", app.DefaultAuditJSONPath, app.DefaultExportDir)
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
slog.Info("starting bee web", "listen", *listenAddr, "audit_path", *auditPath)
if err := webui.ListenAndServe(*listenAddr, webui.HandlerOptions{
Title: *title,
AuditPath: *auditPath,
ExportDir: *exportDir,
}); err != nil {
slog.Error("run web", "err", err)
return 1
}
return 0
}
func runSAT(args []string, stdout, stderr io.Writer) int {
if len(args) == 0 {
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
return 2
}
if args[0] == "help" || args[0] == "--help" || args[0] == "-h" {
fmt.Fprintln(stdout, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
return 0
}
fs := flag.NewFlagSet("sat", flag.ContinueOnError)
fs.SetOutput(stderr)
duration := fs.Int("duration", 0, "stress-ng duration in seconds (cpu only; default: 60)")
if err := fs.Parse(args[1:]); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fmt.Fprintf(stderr, "bee sat: unexpected arguments\n")
return 2
}
target := args[0]
if target != "nvidia" && target != "memory" && target != "storage" && target != "cpu" {
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", target)
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
return 2
}
application := app.New(platform.New())
var (
archive string
err error
)
switch target {
case "nvidia":
archive, err = application.RunNvidiaAcceptancePack("")
case "memory":
archive, err = application.RunMemoryAcceptancePack("")
case "storage":
archive, err = application.RunStorageAcceptancePack("")
case "cpu":
dur := *duration
if dur <= 0 {
dur = 60
}
archive, err = application.RunCPUAcceptancePack("", dur)
}
if err != nil {
slog.Error("run sat", "target", target, "err", err)
return 1
}
slog.Info("sat archive written", "target", target, "path", archive)
return 0
}

View File

@@ -24,8 +24,8 @@ func TestRunNoArgsPrintsUsage(t *testing.T) {
var stdout, stderr bytes.Buffer
rc := run(nil, &stdout, &stderr)
if rc != 1 {
t.Fatalf("rc=%d want 1", rc)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "bee commands:") {
t.Fatalf("stderr missing root usage:\n%s", stderr.String())
@@ -37,8 +37,8 @@ func TestRunUnknownCommand(t *testing.T) {
var stdout, stderr bytes.Buffer
rc := run([]string{"wat"}, &stdout, &stderr)
if rc != 1 {
t.Fatalf("rc=%d want 1", rc)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), `unknown command "wat"`) {
t.Fatalf("stderr missing unknown command message:\n%s", stderr.String())
@@ -86,11 +86,63 @@ func TestRunSATUsage(t *testing.T) {
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee sat nvidia") {
if !strings.Contains(stderr.String(), "usage: bee sat nvidia|memory|storage") {
t.Fatalf("stderr missing sat usage:\n%s", stderr.String())
}
}
func TestRunPreflightRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"preflight", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee preflight") {
t.Fatalf("stderr missing preflight usage:\n%s", stderr.String())
}
}
func TestRunSupportBundleRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"support-bundle", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee support-bundle") {
t.Fatalf("stderr missing support-bundle usage:\n%s", stderr.String())
}
}
func TestRunHelpForSubcommand(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"help", "export"}, &stdout, &stderr)
if rc != 0 {
t.Fatalf("rc=%d want 0", rc)
}
if !strings.Contains(stdout.String(), "usage: bee export --target <device>") {
t.Fatalf("stdout missing export help:\n%s", stdout.String())
}
}
func TestRunHelpUnknownSubcommand(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"help", "wat"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), `bee help: unknown command "wat"`) {
t.Fatalf("stderr missing help error:\n%s", stderr.String())
}
}
func TestRunSATUnknownTarget(t *testing.T) {
t.Parallel()
@@ -104,6 +156,32 @@ func TestRunSATUnknownTarget(t *testing.T) {
}
}
func TestRunSATHelp(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"sat", "--help"}, &stdout, &stderr)
if rc != 0 {
t.Fatalf("rc=%d want 0", rc)
}
if !strings.Contains(stdout.String(), "usage: bee sat nvidia|memory|storage|cpu") {
t.Fatalf("stdout missing sat help:\n%s", stdout.String())
}
}
func TestRunSATRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"sat", "memory", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "bee sat: unexpected arguments") {
t.Fatalf("stderr missing sat error:\n%s", stderr.String())
}
}
func TestRunAuditInvalidRuntime(t *testing.T) {
t.Parallel()
@@ -113,3 +191,29 @@ func TestRunAuditInvalidRuntime(t *testing.T) {
t.Fatalf("rc=%d want 1", rc)
}
}
func TestRunAuditRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"audit", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee audit") {
t.Fatalf("stderr missing audit usage:\n%s", stderr.String())
}
}
func TestRunExportRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"export", "--target", "/dev/sdb1", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee export --target <device>") {
t.Fatalf("stderr missing export usage:\n%s", stderr.String())
}
}

View File

@@ -1,12 +1,16 @@
module bee/audit
go 1.23
go 1.24.0
replace reanimator/chart => ../internal/chart
require github.com/charmbracelet/bubbletea v1.3.4
require github.com/charmbracelet/lipgloss v1.0.0
require reanimator/chart v0.0.0
require (
github.com/aymanbagabas/go-osc52/v2 v2.0.1 // indirect
github.com/charmbracelet/lipgloss v1.0.0 // indirect
github.com/charmbracelet/lipgloss v1.0.0 // promoted to direct used for TUI colors
github.com/charmbracelet/x/ansi v0.8.0 // indirect
github.com/charmbracelet/x/term v0.2.1 // indirect
github.com/erikgeiser/coninput v0.0.0-20211004153227-1c3628e74d0f // indirect

View File

@@ -1,10 +1,13 @@
package app
import (
"context"
"encoding/json"
"fmt"
"log/slog"
"os"
"path/filepath"
"sort"
"strconv"
"strings"
"time"
@@ -12,11 +15,21 @@ import (
"bee/audit/internal/collector"
"bee/audit/internal/platform"
"bee/audit/internal/runtimeenv"
"bee/audit/internal/schema"
)
const (
DefaultAuditJSONPath = "/var/log/bee-audit.json"
DefaultAuditLogPath = "/var/log/bee-audit.log"
var (
DefaultExportDir = "/appdata/bee/export"
DefaultAuditJSONPath = DefaultExportDir + "/bee-audit.json"
DefaultAuditLogPath = DefaultExportDir + "/bee-audit.log"
DefaultWebLogPath = DefaultExportDir + "/bee-web.log"
DefaultNetworkLogPath = DefaultExportDir + "/bee-network.log"
DefaultNvidiaLogPath = DefaultExportDir + "/bee-nvidia.log"
DefaultSSHLogPath = DefaultExportDir + "/bee-sshsetup.log"
DefaultRuntimeJSONPath = DefaultExportDir + "/runtime-health.json"
DefaultRuntimeLogPath = DefaultExportDir + "/runtime-health.log"
DefaultTechDumpDir = DefaultExportDir + "/techdump"
DefaultSATBaseDir = DefaultExportDir + "/bee-sat"
)
type App struct {
@@ -25,6 +38,7 @@ type App struct {
exports exportManager
tools toolManager
sat satRunner
runtime runtimeChecker
}
type ActionResult struct {
@@ -58,6 +72,20 @@ type toolManager interface {
type satRunner interface {
RunNvidiaAcceptancePack(baseDir string) (string, error)
RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, durationSec int, sizeMB int, gpuIndices []int) (string, error)
RunMemoryAcceptancePack(baseDir string) (string, error)
RunStorageAcceptancePack(baseDir string) (string, error)
RunCPUAcceptancePack(baseDir string, durationSec int) (string, error)
ListNvidiaGPUs() ([]platform.NvidiaGPU, error)
DetectGPUVendor() string
ListAMDGPUs() ([]platform.AMDGPUInfo, error)
RunAMDAcceptancePack(baseDir string) (string, error)
RunFanStressTest(ctx context.Context, baseDir string, opts platform.FanStressOptions) (string, error)
}
type runtimeChecker interface {
CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error)
CaptureTechnicalDump(baseDir string) error
}
func New(platform *platform.System) *App {
@@ -67,11 +95,21 @@ func New(platform *platform.System) *App {
exports: platform,
tools: platform,
sat: platform,
runtime: platform,
}
}
func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, error) {
if runtimeMode == runtimeenv.ModeLiveCD {
if err := a.runtime.CaptureTechnicalDump(DefaultTechDumpDir); err != nil {
slog.Warn("capture technical dump", "err", err)
}
}
result := collector.Run(runtimeMode)
applyLatestSATStatuses(&result.Hardware, DefaultSATBaseDir)
if health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath); err == nil {
result.Runtime = &health
}
data, err := json.MarshalIndent(result, "", " ")
if err != nil {
return "", err
@@ -83,6 +121,9 @@ func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, erro
return "stdout", err
case strings.HasPrefix(output, "file:"):
path := strings.TrimPrefix(output, "file:")
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return "", err
}
if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil {
return "", err
}
@@ -92,6 +133,72 @@ func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, erro
}
}
func (a *App) RunRuntimePreflight(output string) (string, error) {
health, err := a.runtime.CollectRuntimeHealth(DefaultExportDir)
if err != nil {
return "", err
}
data, err := json.MarshalIndent(health, "", " ")
if err != nil {
return "", err
}
switch {
case output == "stdout":
_, err := os.Stdout.Write(append(data, '\n'))
return "stdout", err
case strings.HasPrefix(output, "file:"):
path := strings.TrimPrefix(output, "file:")
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return "", err
}
if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil {
return "", err
}
return path, nil
default:
return "", fmt.Errorf("unknown output destination %q — use stdout or file:<path>", output)
}
}
func (a *App) RunRuntimePreflightResult() (ActionResult, error) {
path, err := a.RunRuntimePreflight("file:" + DefaultRuntimeJSONPath)
body := "Runtime preflight completed."
if path != "" {
body = "Runtime health written to " + path
}
return ActionResult{Title: "Run self-check", Body: body}, err
}
func (a *App) RuntimeHealthResult() ActionResult {
health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath)
if err != nil {
return ActionResult{Title: "Runtime issues", Body: "No runtime health found."}
}
driverLabel := "Driver ready"
accelLabel := "CUDA ready"
switch a.sat.DetectGPUVendor() {
case "amd":
driverLabel = "AMDGPU ready"
accelLabel = "ROCm SMI ready"
case "nvidia":
driverLabel = "NVIDIA ready"
}
var body strings.Builder
fmt.Fprintf(&body, "Status: %s\n", firstNonEmpty(health.Status, "UNKNOWN"))
fmt.Fprintf(&body, "Export dir: %s\n", firstNonEmpty(health.ExportDir, DefaultExportDir))
fmt.Fprintf(&body, "%s: %t\n", driverLabel, health.DriverReady)
fmt.Fprintf(&body, "%s: %t\n", accelLabel, health.CUDAReady)
fmt.Fprintf(&body, "Network: %s", firstNonEmpty(health.NetworkStatus, "UNKNOWN"))
if len(health.Issues) > 0 {
body.WriteString("\n\nIssues:\n")
for _, issue := range health.Issues {
fmt.Fprintf(&body, "- %s: %s\n", issue.Code, issue.Description)
}
}
return ActionResult{Title: "Runtime issues", Body: strings.TrimSpace(body.String())}
}
func (a *App) RunAuditNow(runtimeMode runtimeenv.Mode) (ActionResult, error) {
path, err := a.RunAudit(runtimeMode, "file:"+DefaultAuditJSONPath)
body := "Audit completed."
@@ -124,7 +231,29 @@ func (a *App) ExportLatestAudit(target platform.RemovableTarget) (string, error)
func (a *App) ExportLatestAuditResult(target platform.RemovableTarget) (ActionResult, error) {
path, err := a.ExportLatestAudit(target)
return ActionResult{Title: "Export audit", Body: "Audit exported to " + path}, err
body := "Audit exported."
if path != "" {
body = "Audit exported to " + path
}
return ActionResult{Title: "Export audit", Body: body}, err
}
func (a *App) ExportSupportBundle(target platform.RemovableTarget) (string, error) {
archive, err := BuildSupportBundle(DefaultExportDir)
if err != nil {
return "", err
}
defer os.Remove(archive)
return a.exports.ExportFileToTarget(archive, target)
}
func (a *App) ExportSupportBundleResult(target platform.RemovableTarget) (ActionResult, error) {
path, err := a.ExportSupportBundle(target)
body := "Support bundle exported. USB target unmounted and safe to remove."
if path != "" {
body = "Support bundle exported to " + path + ".\n\nUSB target unmounted and safe to remove."
}
return ActionResult{Title: "Export support bundle", Body: body}, err
}
func (a *App) ListInterfaces() ([]platform.InterfaceInfo, error) {
@@ -141,7 +270,7 @@ func (a *App) DHCPOne(iface string) (string, error) {
func (a *App) DHCPOneResult(iface string) (ActionResult, error) {
body, err := a.network.DHCPOne(iface)
return ActionResult{Title: "DHCP on " + iface, Body: body}, err
return ActionResult{Title: "DHCP: " + iface, Body: bodyOr(body, "DHCP completed.")}, err
}
func (a *App) DHCPAll() (string, error) {
@@ -150,7 +279,7 @@ func (a *App) DHCPAll() (string, error) {
func (a *App) DHCPAllResult() (ActionResult, error) {
body, err := a.network.DHCPAll()
return ActionResult{Title: "DHCP all interfaces", Body: body}, err
return ActionResult{Title: "DHCP: all interfaces", Body: bodyOr(body, "DHCP completed.")}, err
}
func (a *App) SetStaticIPv4(cfg platform.StaticIPv4Config) (string, error) {
@@ -159,7 +288,7 @@ func (a *App) SetStaticIPv4(cfg platform.StaticIPv4Config) (string, error) {
func (a *App) SetStaticIPv4Result(cfg platform.StaticIPv4Config) (ActionResult, error) {
body, err := a.network.SetStaticIPv4(cfg)
return ActionResult{Title: "Static IPv4 on " + cfg.Interface, Body: body}, err
return ActionResult{Title: "Static IPv4: " + cfg.Interface, Body: bodyOr(body, "Static IPv4 updated.")}, err
}
func (a *App) NetworkStatus() (ActionResult, error) {
@@ -167,6 +296,9 @@ func (a *App) NetworkStatus() (ActionResult, error) {
if err != nil {
return ActionResult{Title: "Network status"}, err
}
if len(ifaces) == 0 {
return ActionResult{Title: "Network status", Body: "No physical interfaces found."}, nil
}
var body strings.Builder
for _, iface := range ifaces {
ipv4 := "(no IPv4)"
@@ -216,7 +348,7 @@ func (a *App) ServiceStatus(name string) (string, error) {
func (a *App) ServiceStatusResult(name string) (ActionResult, error) {
body, err := a.services.ServiceStatus(name)
return ActionResult{Title: "service: " + name, Body: body}, err
return ActionResult{Title: "service status: " + name, Body: bodyOr(body, "No status output.")}, err
}
func (a *App) ServiceDo(name string, action platform.ServiceAction) (string, error) {
@@ -225,7 +357,7 @@ func (a *App) ServiceDo(name string, action platform.ServiceAction) (string, err
func (a *App) ServiceActionResult(name string, action platform.ServiceAction) (ActionResult, error) {
body, err := a.services.ServiceDo(name, action)
return ActionResult{Title: "service: " + name, Body: body}, err
return ActionResult{Title: "service " + string(action) + ": " + name, Body: bodyOr(body, "Action completed.")}, err
}
func (a *App) ListRemovableTargets() ([]platform.RemovableTarget, error) {
@@ -241,6 +373,9 @@ func (a *App) CheckTools(names []string) []platform.ToolStatus {
}
func (a *App) ToolCheckResult(names []string) ActionResult {
if len(names) == 0 {
return ActionResult{Title: "Required tools", Body: "No tools checked."}
}
var body strings.Builder
for _, tool := range a.tools.CheckTools(names) {
status := "MISSING"
@@ -253,17 +388,250 @@ func (a *App) ToolCheckResult(names []string) ActionResult {
}
func (a *App) AuditLogTailResult() ActionResult {
body := a.tools.TailFile(DefaultAuditLogPath, 40) + "\n\n" + a.tools.TailFile(DefaultAuditJSONPath, 20)
logTail := strings.TrimSpace(a.tools.TailFile(DefaultAuditLogPath, 40))
jsonTail := strings.TrimSpace(a.tools.TailFile(DefaultAuditJSONPath, 20))
body := strings.TrimSpace(logTail + "\n\n" + jsonTail)
if body == "" {
body = "No audit logs found."
}
return ActionResult{Title: "Audit log tail", Body: body}
}
func (a *App) RunNvidiaAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaAcceptancePack(baseDir)
}
func (a *App) RunNvidiaAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.sat.RunNvidiaAcceptancePack(baseDir)
return ActionResult{Title: "NVIDIA SAT", Body: "Archive written to " + path}, err
path, err := a.RunNvidiaAcceptancePack(baseDir)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
}
return ActionResult{Title: "NVIDIA SAT", Body: body}, err
}
func (a *App) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
return a.sat.ListNvidiaGPUs()
}
func (a *App) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, durationSec int, sizeMB int, gpuIndices []int) (ActionResult, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
path, err := a.sat.RunNvidiaAcceptancePackWithOptions(ctx, baseDir, durationSec, sizeMB, gpuIndices)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
}
// Include terminal chart if available (runDir = archive path without .tar.gz).
if path != "" {
termPath := filepath.Join(strings.TrimSuffix(path, ".tar.gz"), "gpu-metrics-term.txt")
if chart, readErr := os.ReadFile(termPath); readErr == nil && len(chart) > 0 {
body += "\n\n" + string(chart)
}
}
return ActionResult{Title: "NVIDIA SAT", Body: body}, err
}
func (a *App) RunMemoryAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunMemoryAcceptancePack(baseDir)
}
func (a *App) RunMemoryAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.RunMemoryAcceptancePack(baseDir)
return ActionResult{Title: "Memory SAT", Body: satResultBody(path)}, err
}
func (a *App) RunCPUAcceptancePack(baseDir string, durationSec int) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunCPUAcceptancePack(baseDir, durationSec)
}
func (a *App) RunCPUAcceptancePackResult(baseDir string, durationSec int) (ActionResult, error) {
path, err := a.RunCPUAcceptancePack(baseDir, durationSec)
return ActionResult{Title: "CPU SAT", Body: satResultBody(path)}, err
}
func (a *App) RunStorageAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunStorageAcceptancePack(baseDir)
}
func (a *App) RunStorageAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.RunStorageAcceptancePack(baseDir)
return ActionResult{Title: "Storage SAT", Body: satResultBody(path)}, err
}
func (a *App) DetectGPUVendor() string {
return a.sat.DetectGPUVendor()
}
func (a *App) ListAMDGPUs() ([]platform.AMDGPUInfo, error) {
return a.sat.ListAMDGPUs()
}
func (a *App) RunAMDAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunAMDAcceptancePack(baseDir)
}
func (a *App) RunAMDAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.RunAMDAcceptancePack(baseDir)
return ActionResult{Title: "AMD GPU SAT", Body: satResultBody(path)}, err
}
func (a *App) RunFanStressTest(ctx context.Context, baseDir string, opts platform.FanStressOptions) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunFanStressTest(ctx, baseDir, opts)
}
func (a *App) RunFanStressTestResult(ctx context.Context, opts platform.FanStressOptions) (ActionResult, error) {
path, err := a.RunFanStressTest(ctx, "", opts)
body := formatFanStressResult(path)
if err != nil && err != context.Canceled {
body += "\nERROR: " + err.Error()
}
return ActionResult{Title: "GPU Platform Stress Test", Body: body}, err
}
// formatFanStressResult formats the summary.txt from a fan-stress run, including
// the per-step pass/fail display and the analysis section (throttling, max temps, fan response).
func formatFanStressResult(archivePath string) string {
if archivePath == "" {
return "No output produced."
}
runDir := strings.TrimSuffix(archivePath, ".tar.gz")
raw, err := os.ReadFile(filepath.Join(runDir, "summary.txt"))
if err != nil {
return "Archive written to " + archivePath
}
content := strings.TrimSpace(string(raw))
kv := parseKeyValueSummary(content)
var b strings.Builder
b.WriteString(formatSATDetail(content))
// Append analysis section.
var analysis []string
if v, ok := kv["throttling_detected"]; ok {
label := "NO"
if v == "true" {
label = "YES ← throttling detected during load"
}
analysis = append(analysis, "Throttling: "+label)
}
if v, ok := kv["max_gpu_temp_c"]; ok && v != "0.0" {
analysis = append(analysis, "Max GPU temp: "+v+"°C")
}
if v, ok := kv["max_cpu_temp_c"]; ok && v != "0.0" {
analysis = append(analysis, "Max CPU temp: "+v+"°C")
}
if v, ok := kv["fan_response_sec"]; ok && v != "N/A" && v != "-1.0" {
analysis = append(analysis, "Fan response: "+v+"s")
}
if len(analysis) > 0 {
b.WriteString("\n\n=== Analysis ===\n")
for _, line := range analysis {
b.WriteString(line + "\n")
}
}
return strings.TrimSpace(b.String())
}
// satResultBody reads summary.txt from the SAT run directory (archive path without .tar.gz)
// and returns a formatted human-readable result. Falls back to a plain message if unreadable.
func satResultBody(archivePath string) string {
if archivePath == "" {
return "No output produced."
}
runDir := strings.TrimSuffix(archivePath, ".tar.gz")
raw, err := os.ReadFile(filepath.Join(runDir, "summary.txt"))
if err != nil {
return "Archive written to " + archivePath
}
return formatSATDetail(strings.TrimSpace(string(raw)))
}
func (a *App) HealthSummaryResult() ActionResult {
raw, err := os.ReadFile(DefaultAuditJSONPath)
if err != nil {
return ActionResult{Title: "Health summary", Body: "No audit JSON found."}
}
var snapshot schema.HardwareIngestRequest
if err := json.Unmarshal(raw, &snapshot); err != nil {
return ActionResult{Title: "Health summary", Body: "Audit JSON is unreadable."}
}
summary := collector.BuildHealthSummary(snapshot.Hardware)
var body strings.Builder
status := summary.Status
if status == "" {
status = "Unknown"
}
fmt.Fprintf(&body, "Overall: %s\n", status)
fmt.Fprintf(&body, "Storage: warn=%d fail=%d\n", summary.StorageWarn, summary.StorageFail)
fmt.Fprintf(&body, "PCIe: warn=%d fail=%d\n", summary.PCIeWarn, summary.PCIeFail)
fmt.Fprintf(&body, "PSU: warn=%d fail=%d\n", summary.PSUWarn, summary.PSUFail)
fmt.Fprintf(&body, "Memory: warn=%d fail=%d\n", summary.MemoryWarn, summary.MemoryFail)
for _, item := range latestSATSummaries() {
fmt.Fprintf(&body, "\n\n%s", item)
}
if len(summary.Failures) > 0 {
fmt.Fprintf(&body, "\n\nFailures:\n- %s", strings.Join(summary.Failures, "\n- "))
}
if len(summary.Warnings) > 0 {
fmt.Fprintf(&body, "\n\nWarnings:\n- %s", strings.Join(summary.Warnings, "\n- "))
}
return ActionResult{Title: "Health summary", Body: strings.TrimSpace(body.String())}
}
func (a *App) MainBanner() string {
raw, err := os.ReadFile(DefaultAuditJSONPath)
if err != nil {
return ""
}
var snapshot schema.HardwareIngestRequest
if err := json.Unmarshal(raw, &snapshot); err != nil {
return ""
}
var lines []string
if system := formatSystemLine(snapshot.Hardware.Board); system != "" {
lines = append(lines, system)
}
if cpu := formatCPULine(snapshot.Hardware.CPUs); cpu != "" {
lines = append(lines, cpu)
}
if memory := formatMemoryLine(snapshot.Hardware.Memory); memory != "" {
lines = append(lines, memory)
}
if storage := formatStorageLine(snapshot.Hardware.Storage); storage != "" {
lines = append(lines, storage)
}
if gpu := formatGPULine(snapshot.Hardware.PCIeDevices); gpu != "" {
lines = append(lines, gpu)
}
if ip := formatIPLine(a.network.ListInterfaces); ip != "" {
lines = append(lines, ip)
}
return strings.TrimSpace(strings.Join(lines, "\n"))
}
func (a *App) FormatToolStatuses(statuses []platform.ToolStatus) string {
@@ -309,3 +677,320 @@ func sanitizeFilename(v string) string {
}
return string(out)
}
func bodyOr(body, fallback string) string {
body = strings.TrimSpace(body)
if body == "" {
return fallback
}
return body
}
func ReadRuntimeHealth(path string) (schema.RuntimeHealth, error) {
raw, err := os.ReadFile(path)
if err != nil {
return schema.RuntimeHealth{}, err
}
var health schema.RuntimeHealth
if err := json.Unmarshal(raw, &health); err != nil {
return schema.RuntimeHealth{}, err
}
return health, nil
}
func latestSATSummaries() []string {
patterns := []struct {
label string
prefix string
}{
{label: "NVIDIA SAT", prefix: "gpu-nvidia-"},
{label: "Memory SAT", prefix: "memory-"},
{label: "Storage SAT", prefix: "storage-"},
{label: "CPU SAT", prefix: "cpu-"},
}
var out []string
for _, item := range patterns {
matches, err := filepath.Glob(filepath.Join(DefaultSATBaseDir, item.prefix+"*/summary.txt"))
if err != nil || len(matches) == 0 {
continue
}
sort.Strings(matches)
raw, err := os.ReadFile(matches[len(matches)-1])
if err != nil {
continue
}
out = append(out, formatSATSummary(item.label, string(raw)))
}
return out
}
func formatSATSummary(label, raw string) string {
values := parseKeyValueSummary(raw)
var body strings.Builder
fmt.Fprintf(&body, "%s:", label)
if overall := firstNonEmpty(values["overall_status"], "UNKNOWN"); overall != "" {
fmt.Fprintf(&body, " %s", overall)
}
if ok := firstNonEmpty(values["job_ok"], "0"); ok != "" {
fmt.Fprintf(&body, " ok=%s", ok)
}
if failed := firstNonEmpty(values["job_failed"], "0"); failed != "" {
fmt.Fprintf(&body, " failed=%s", failed)
}
if unsupported := firstNonEmpty(values["job_unsupported"], "0"); unsupported != "" && unsupported != "0" {
fmt.Fprintf(&body, " unsupported=%s", unsupported)
}
if devices := strings.TrimSpace(values["devices"]); devices != "" {
fmt.Fprintf(&body, "\nDevices: %s", devices)
}
return body.String()
}
func formatSystemLine(board schema.HardwareBoard) string {
model := strings.TrimSpace(strings.Join([]string{
trimPtr(board.Manufacturer),
trimPtr(board.ProductName),
}, " "))
serial := strings.TrimSpace(board.SerialNumber)
switch {
case model != "" && serial != "":
return fmt.Sprintf("System: %s | S/N %s", model, serial)
case model != "":
return "System: " + model
case serial != "":
return "System S/N: " + serial
default:
return ""
}
}
func formatCPULine(cpus []schema.HardwareCPU) string {
if len(cpus) == 0 {
return ""
}
modelCounts := map[string]int{}
unknown := 0
for _, cpu := range cpus {
model := trimPtr(cpu.Model)
if model == "" {
unknown++
continue
}
modelCounts[model]++
}
if len(modelCounts) == 1 && unknown == 0 {
for model, count := range modelCounts {
return fmt.Sprintf("CPU: %d x %s", count, model)
}
}
parts := make([]string, 0, len(modelCounts)+1)
if len(modelCounts) > 0 {
keys := make([]string, 0, len(modelCounts))
for key := range modelCounts {
keys = append(keys, key)
}
sort.Strings(keys)
for _, key := range keys {
parts = append(parts, fmt.Sprintf("%d x %s", modelCounts[key], key))
}
}
if unknown > 0 {
parts = append(parts, fmt.Sprintf("%d x unknown", unknown))
}
return "CPU: " + strings.Join(parts, ", ")
}
func formatMemoryLine(dimms []schema.HardwareMemory) string {
totalMB := 0
present := 0
types := map[string]struct{}{}
for _, dimm := range dimms {
if dimm.Present != nil && !*dimm.Present {
continue
}
if dimm.SizeMB == nil || *dimm.SizeMB <= 0 {
continue
}
present++
totalMB += *dimm.SizeMB
if value := trimPtr(dimm.Type); value != "" {
types[value] = struct{}{}
}
}
if totalMB == 0 {
return ""
}
typeText := joinSortedKeys(types)
line := fmt.Sprintf("Memory: %s", humanizeMB(totalMB))
if typeText != "" {
line += " " + typeText
}
if present > 0 {
line += fmt.Sprintf(" (%d DIMMs)", present)
}
return line
}
func formatStorageLine(disks []schema.HardwareStorage) string {
count := 0
totalGB := 0
for _, disk := range disks {
if disk.Present != nil && !*disk.Present {
continue
}
count++
if disk.SizeGB != nil && *disk.SizeGB > 0 {
totalGB += *disk.SizeGB
}
}
if count == 0 {
return ""
}
line := fmt.Sprintf("Storage: %d drives", count)
if totalGB > 0 {
line += fmt.Sprintf(" / %s", humanizeGB(totalGB))
}
return line
}
func formatGPULine(devices []schema.HardwarePCIeDevice) string {
gpus := map[string]int{}
for _, dev := range devices {
if !isGPUDevice(dev) {
continue
}
name := firstNonEmpty(trimPtr(dev.Model), trimPtr(dev.Manufacturer), "unknown")
gpus[name]++
}
if len(gpus) == 0 {
return ""
}
keys := make([]string, 0, len(gpus))
for key := range gpus {
keys = append(keys, key)
}
sort.Strings(keys)
parts := make([]string, 0, len(keys))
for _, key := range keys {
parts = append(parts, fmt.Sprintf("%d x %s", gpus[key], key))
}
return "GPU: " + strings.Join(parts, ", ")
}
func formatIPLine(list func() ([]platform.InterfaceInfo, error)) string {
if list == nil {
return ""
}
ifaces, err := list()
if err != nil {
return ""
}
seen := map[string]struct{}{}
var ips []string
for _, iface := range ifaces {
for _, ip := range iface.IPv4 {
ip = strings.TrimSpace(ip)
if ip == "" {
continue
}
if _, ok := seen[ip]; ok {
continue
}
seen[ip] = struct{}{}
ips = append(ips, ip)
}
}
if len(ips) == 0 {
return ""
}
sort.Strings(ips)
return "IP: " + strings.Join(ips, ", ")
}
func isGPUDevice(dev schema.HardwarePCIeDevice) bool {
class := trimPtr(dev.DeviceClass)
model := strings.ToLower(trimPtr(dev.Model))
vendor := strings.ToLower(trimPtr(dev.Manufacturer))
// Exclude ASPEED (BMC VGA adapter, not a compute GPU)
if strings.Contains(vendor, "aspeed") || strings.Contains(model, "aspeed") {
return false
}
// AMD Instinct / Radeon compute GPUs have class ProcessingAccelerator or DisplayController.
// Do NOT match by AMD vendor alone — chipset/CPU PCIe devices share that vendor.
return class == "VideoController" ||
class == "DisplayController" ||
class == "ProcessingAccelerator" ||
strings.Contains(model, "nvidia") ||
strings.Contains(vendor, "nvidia")
}
func trimPtr(value *string) string {
if value == nil {
return ""
}
return strings.TrimSpace(*value)
}
func joinSortedKeys(values map[string]struct{}) string {
if len(values) == 0 {
return ""
}
keys := make([]string, 0, len(values))
for key := range values {
keys = append(keys, key)
}
sort.Strings(keys)
return strings.Join(keys, "/")
}
func humanizeMB(totalMB int) string {
if totalMB <= 0 {
return ""
}
gb := float64(totalMB) / 1024.0
if gb >= 1024.0 {
tb := gb / 1024.0
return fmt.Sprintf("%.1f TB", tb)
}
if gb == float64(int64(gb)) {
return fmt.Sprintf("%.0f GB", gb)
}
return fmt.Sprintf("%.1f GB", gb)
}
func humanizeGB(totalGB int) string {
if totalGB <= 0 {
return ""
}
tb := float64(totalGB) / 1024.0
if tb >= 1.0 {
return fmt.Sprintf("%.1f TB", tb)
}
return fmt.Sprintf("%d GB", totalGB)
}
func parseKeyValueSummary(raw string) map[string]string {
out := map[string]string{}
for _, line := range strings.Split(raw, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
key, value, ok := strings.Cut(line, "=")
if !ok {
continue
}
out[strings.TrimSpace(key)] = strings.TrimSpace(value)
}
return out
}
func firstNonEmpty(values ...string) string {
for _, value := range values {
value = strings.TrimSpace(value)
if value != "" {
return value
}
}
return ""
}

View File

@@ -1,10 +1,18 @@
package app
import (
"archive/tar"
"compress/gzip"
"context"
"encoding/json"
"errors"
"io"
"os"
"path/filepath"
"testing"
"bee/audit/internal/platform"
"bee/audit/internal/schema"
)
type fakeNetwork struct {
@@ -52,16 +60,41 @@ func (f fakeServices) ServiceDo(name string, action platform.ServiceAction) (str
return f.serviceDoFn(name, action)
}
type fakeExports struct{}
type fakeExports struct {
listTargetsFn func() ([]platform.RemovableTarget, error)
exportToTargetFn func(string, platform.RemovableTarget) (string, error)
}
func (f fakeExports) ListRemovableTargets() ([]platform.RemovableTarget, error) {
if f.listTargetsFn != nil {
return f.listTargetsFn()
}
return nil, nil
}
func (f fakeExports) ExportFileToTarget(src string, target platform.RemovableTarget) (string, error) {
if f.exportToTargetFn != nil {
return f.exportToTargetFn(src, target)
}
return "", nil
}
type fakeRuntime struct {
collectFn func(string) (schema.RuntimeHealth, error)
dumpFn func(string) error
}
func (f fakeRuntime) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
return f.collectFn(exportDir)
}
func (f fakeRuntime) CaptureTechnicalDump(baseDir string) error {
if f.dumpFn != nil {
return f.dumpFn(baseDir)
}
return nil
}
type fakeTools struct {
tailFileFn func(string, int) string
checkToolsFn func([]string) []platform.ToolStatus
@@ -76,11 +109,69 @@ func (f fakeTools) CheckTools(names []string) []platform.ToolStatus {
}
type fakeSAT struct {
runFn func(string) (string, error)
runNvidiaFn func(string) (string, error)
runMemoryFn func(string) (string, error)
runStorageFn func(string) (string, error)
runCPUFn func(string, int) (string, error)
detectVendorFn func() string
listAMDGPUsFn func() ([]platform.AMDGPUInfo, error)
runAMDPackFn func(string) (string, error)
listNvidiaGPUsFn func() ([]platform.NvidiaGPU, error)
}
func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string) (string, error) {
return f.runFn(baseDir)
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaAcceptancePackWithOptions(_ context.Context, baseDir string, _ int, _ int, _ []int) (string, error) {
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
if f.listNvidiaGPUsFn != nil {
return f.listNvidiaGPUsFn()
}
return nil, nil
}
func (f fakeSAT) RunMemoryAcceptancePack(baseDir string) (string, error) {
return f.runMemoryFn(baseDir)
}
func (f fakeSAT) RunStorageAcceptancePack(baseDir string) (string, error) {
return f.runStorageFn(baseDir)
}
func (f fakeSAT) RunCPUAcceptancePack(baseDir string, durationSec int) (string, error) {
if f.runCPUFn != nil {
return f.runCPUFn(baseDir, durationSec)
}
return "", nil
}
func (f fakeSAT) DetectGPUVendor() string {
if f.detectVendorFn != nil {
return f.detectVendorFn()
}
return ""
}
func (f fakeSAT) ListAMDGPUs() ([]platform.AMDGPUInfo, error) {
if f.listAMDGPUsFn != nil {
return f.listAMDGPUsFn()
}
return nil, nil
}
func (f fakeSAT) RunAMDAcceptancePack(baseDir string) (string, error) {
if f.runAMDPackFn != nil {
return f.runAMDPackFn(baseDir)
}
return "", nil
}
func (f fakeSAT) RunFanStressTest(_ context.Context, _ string, _ platform.FanStressOptions) (string, error) {
return "", nil
}
func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
@@ -96,6 +187,9 @@ func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
},
defaultRouteFn: func() string { return "10.0.0.1" },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.NetworkStatus()
@@ -116,6 +210,28 @@ func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
}
}
func TestNetworkStatusHandlesNoInterfaces(t *testing.T) {
t.Parallel()
a := &App{
network: fakeNetwork{
listInterfacesFn: func() ([]platform.InterfaceInfo, error) { return nil, nil },
defaultRouteFn: func() string { return "" },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.NetworkStatus()
if err != nil {
t.Fatalf("NetworkStatus error: %v", err)
}
if result.Body != "No physical interfaces found." {
t.Fatalf("body=%q want %q", result.Body, "No physical interfaces found.")
}
}
func TestNetworkStatusPropagatesListError(t *testing.T) {
t.Parallel()
@@ -126,6 +242,9 @@ func TestNetworkStatusPropagatesListError(t *testing.T) {
},
defaultRouteFn: func() string { return "" },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.NetworkStatus()
@@ -150,6 +269,9 @@ func TestParseStaticIPv4ConfigAndDefaults(t *testing.T) {
dhcpAllFn: func() (string, error) { return "", nil },
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
defaults := a.DefaultStaticIPv4FormFields("eth0")
@@ -186,13 +308,16 @@ func TestServiceActionResults(t *testing.T) {
return string(action) + " ok", nil
},
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
statusResult, err := a.ServiceStatusResult("bee-audit")
if err != nil {
t.Fatalf("ServiceStatusResult error: %v", err)
}
if statusResult.Title != "service: bee-audit" || statusResult.Body != "active" {
if statusResult.Title != "service status: bee-audit" || statusResult.Body != "active" {
t.Fatalf("unexpected status result: %#v", statusResult)
}
@@ -200,7 +325,7 @@ func TestServiceActionResults(t *testing.T) {
if err != nil {
t.Fatalf("ServiceActionResult error: %v", err)
}
if actionResult.Title != "service: bee-audit" || actionResult.Body != "restart ok" {
if actionResult.Title != "service restart: bee-audit" || actionResult.Body != "restart ok" {
t.Fatalf("unexpected action result: %#v", actionResult)
}
}
@@ -242,17 +367,125 @@ func TestToolCheckAndLogTailResults(t *testing.T) {
}
}
func TestActionResultsUseFallbackBody(t *testing.T) {
t.Parallel()
a := &App{
network: fakeNetwork{
dhcpOneFn: func(string) (string, error) { return " ", nil },
dhcpAllFn: func() (string, error) { return "", nil },
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
return nil, nil
},
defaultRouteFn: func() string { return "" },
},
services: fakeServices{
serviceStatusFn: func(string) (string, error) { return "", nil },
serviceDoFn: func(string, platform.ServiceAction) (string, error) { return "", nil },
},
tools: fakeTools{
tailFileFn: func(string, int) string { return " " },
checkToolsFn: func([]string) []platform.ToolStatus { return nil },
},
sat: fakeSAT{
runNvidiaFn: func(string) (string, error) { return "", nil },
runMemoryFn: func(string) (string, error) { return "", nil },
runStorageFn: func(string) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) {
return schema.RuntimeHealth{Status: "PARTIAL", ExportDir: "/tmp/export"}, nil
},
},
}
if got, _ := a.DHCPOneResult("eth0"); got.Body != "DHCP completed." {
t.Fatalf("dhcp one body=%q", got.Body)
}
if got, _ := a.DHCPAllResult(); got.Body != "DHCP completed." {
t.Fatalf("dhcp all body=%q", got.Body)
}
if got, _ := a.SetStaticIPv4Result(platform.StaticIPv4Config{Interface: "eth0"}); got.Body != "Static IPv4 updated." {
t.Fatalf("static body=%q", got.Body)
}
if got, _ := a.ServiceStatusResult("bee-audit"); got.Body != "No status output." {
t.Fatalf("status body=%q", got.Body)
}
if got, _ := a.ServiceActionResult("bee-audit", platform.ServiceRestart); got.Body != "Action completed." {
t.Fatalf("action body=%q", got.Body)
}
if got := a.ToolCheckResult(nil); got.Body != "No tools checked." {
t.Fatalf("tool body=%q", got.Body)
}
if got := a.AuditLogTailResult(); got.Body != "No audit logs found." {
t.Fatalf("log body=%q", got.Body)
}
if got, _ := a.RunNvidiaAcceptancePackResult(""); got.Body != "Archive written." {
t.Fatalf("sat body=%q", got.Body)
}
if got, _ := a.RunMemoryAcceptancePackResult(""); got.Body != "No output produced." {
t.Fatalf("memory sat body=%q", got.Body)
}
if got, _ := a.RunStorageAcceptancePackResult(""); got.Body != "No output produced." {
t.Fatalf("storage sat body=%q", got.Body)
}
}
func TestExportSupportBundleResultMentionsUnmountedUSB(t *testing.T) {
t.Parallel()
tmp := t.TempDir()
oldExportDir := DefaultExportDir
DefaultExportDir = tmp
t.Cleanup(func() { DefaultExportDir = oldExportDir })
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.json"), []byte("{}\n"), 0644); err != nil {
t.Fatalf("write bee-audit.json: %v", err)
}
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.log"), []byte("audit ok\n"), 0644); err != nil {
t.Fatalf("write bee-audit.log: %v", err)
}
a := &App{
exports: fakeExports{
exportToTargetFn: func(src string, target platform.RemovableTarget) (string, error) {
if filepath.Base(src) == "" {
t.Fatalf("expected non-empty source path")
}
return "/media/bee/" + filepath.Base(src), nil
},
},
}
result, err := a.ExportSupportBundleResult(platform.RemovableTarget{Device: "/dev/sdb1"})
if err != nil {
t.Fatalf("ExportSupportBundleResult error: %v", err)
}
if result.Title != "Export support bundle" {
t.Fatalf("title=%q want %q", result.Title, "Export support bundle")
}
if want := "USB target unmounted and safe to remove."; !contains(result.Body, want) {
t.Fatalf("body missing %q\nbody=%s", want, result.Body)
}
}
func TestRunNvidiaAcceptancePackResult(t *testing.T) {
t.Parallel()
a := &App{
sat: fakeSAT{
runFn: func(baseDir string) (string, error) {
runNvidiaFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/sat" {
t.Fatalf("baseDir=%q want %q", baseDir, "/tmp/sat")
}
return "/tmp/sat/out.tar.gz", nil
},
runMemoryFn: func(string) (string, error) { return "", nil },
runStorageFn: func(string) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
@@ -265,6 +498,265 @@ func TestRunNvidiaAcceptancePackResult(t *testing.T) {
}
}
func TestRunSATDefaultsToExportDir(t *testing.T) {
t.Parallel()
oldSATBaseDir := DefaultSATBaseDir
DefaultSATBaseDir = "/tmp/export/bee-sat"
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
a := &App{
sat: fakeSAT{
runNvidiaFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("nvidia baseDir=%q", baseDir)
}
return "", nil
},
runMemoryFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("memory baseDir=%q", baseDir)
}
return "", nil
},
runStorageFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("storage baseDir=%q", baseDir)
}
return "", nil
},
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
if _, err := a.RunNvidiaAcceptancePack(""); err != nil {
t.Fatal(err)
}
if _, err := a.RunMemoryAcceptancePack(""); err != nil {
t.Fatal(err)
}
if _, err := a.RunStorageAcceptancePack(""); err != nil {
t.Fatal(err)
}
}
func TestFormatSATSummary(t *testing.T) {
t.Parallel()
got := formatSATSummary("Memory SAT", "overall_status=PARTIAL\njob_ok=2\njob_failed=0\njob_unsupported=1\ndevices=3\n")
want := "Memory SAT: PARTIAL ok=2 failed=0 unsupported=1\nDevices: 3"
if got != want {
t.Fatalf("got %q want %q", got, want)
}
}
func TestHealthSummaryResultIncludesCompactSATSummary(t *testing.T) {
tmp := t.TempDir()
oldAuditPath := DefaultAuditJSONPath
oldSATBaseDir := DefaultSATBaseDir
DefaultAuditJSONPath = filepath.Join(tmp, "audit.json")
DefaultSATBaseDir = filepath.Join(tmp, "sat")
t.Cleanup(func() { DefaultAuditJSONPath = oldAuditPath })
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
satDir := filepath.Join(DefaultSATBaseDir, "memory-testcase")
if err := os.MkdirAll(satDir, 0755); err != nil {
t.Fatalf("mkdir sat dir: %v", err)
}
raw := `{"collected_at":"2026-03-15T10:00:00Z","hardware":{"board":{"serial_number":"SRV123"},"storage":[{"serial_number":"DISK1","status":"Warning"}]}}`
if err := os.WriteFile(DefaultAuditJSONPath, []byte(raw), 0644); err != nil {
t.Fatalf("write audit json: %v", err)
}
if err := os.WriteFile(filepath.Join(satDir, "summary.txt"), []byte("overall_status=OK\njob_ok=3\njob_failed=0\njob_unsupported=0\n"), 0644); err != nil {
t.Fatalf("write sat summary: %v", err)
}
result := (&App{}).HealthSummaryResult()
if !contains(result.Body, "Memory SAT: OK ok=3 failed=0") {
t.Fatalf("body missing compact sat summary:\n%s", result.Body)
}
}
func TestBuildSupportBundleIncludesExportDirContents(t *testing.T) {
tmp := t.TempDir()
exportDir := filepath.Join(tmp, "export")
if err := os.MkdirAll(filepath.Join(exportDir, "bee-sat", "memory-run"), 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.json"), []byte(`{"ok":true}`), 0644); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run", "verbose.log"), []byte("sat verbose"), 0644); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run.tar.gz"), []byte("nested sat archive"), 0644); err != nil {
t.Fatal(err)
}
archive, err := BuildSupportBundle(exportDir)
if err != nil {
t.Fatalf("BuildSupportBundle error: %v", err)
}
if _, err := os.Stat(archive); err != nil {
t.Fatalf("archive stat: %v", err)
}
file, err := os.Open(archive)
if err != nil {
t.Fatalf("open archive: %v", err)
}
defer file.Close()
gzr, err := gzip.NewReader(file)
if err != nil {
t.Fatalf("gzip reader: %v", err)
}
defer gzr.Close()
tr := tar.NewReader(gzr)
var names []string
for {
hdr, err := tr.Next()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
t.Fatalf("read tar entry: %v", err)
}
names = append(names, hdr.Name)
}
var foundRaw bool
for _, name := range names {
if contains(name, "/export/bee-sat/memory-run/verbose.log") {
foundRaw = true
}
if contains(name, "/export/bee-sat/memory-run.tar.gz") {
t.Fatalf("support bundle should not contain nested SAT archive: %s", name)
}
}
if !foundRaw {
t.Fatalf("support bundle missing raw SAT log, names=%v", names)
}
}
func TestMainBanner(t *testing.T) {
tmp := t.TempDir()
oldAuditPath := DefaultAuditJSONPath
DefaultAuditJSONPath = filepath.Join(tmp, "audit.json")
t.Cleanup(func() { DefaultAuditJSONPath = oldAuditPath })
trueValue := true
manufacturer := "Dell"
product := "PowerEdge R760"
cpuModel := "Intel Xeon Gold 6430"
memoryType := "DDR5"
gpuClass := "VideoController"
gpuModel := "NVIDIA H100"
payload := schema.HardwareIngestRequest{
Hardware: schema.HardwareSnapshot{
Board: schema.HardwareBoard{
Manufacturer: &manufacturer,
ProductName: &product,
SerialNumber: "SRV123",
},
CPUs: []schema.HardwareCPU{
{Model: &cpuModel},
{Model: &cpuModel},
},
Memory: []schema.HardwareMemory{
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
},
Storage: []schema.HardwareStorage{
{Present: &trueValue, SizeGB: intPtr(3840)},
{Present: &trueValue, SizeGB: intPtr(3840)},
},
PCIeDevices: []schema.HardwarePCIeDevice{
{DeviceClass: &gpuClass, Model: &gpuModel},
{DeviceClass: &gpuClass, Model: &gpuModel},
},
},
}
raw, err := json.Marshal(payload)
if err != nil {
t.Fatalf("marshal: %v", err)
}
if err := os.WriteFile(DefaultAuditJSONPath, raw, 0644); err != nil {
t.Fatalf("write audit json: %v", err)
}
a := &App{
network: fakeNetwork{
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
return []platform.InterfaceInfo{
{Name: "eth0", IPv4: []string{"10.0.0.10"}},
{Name: "eth1", IPv4: []string{"192.168.1.10"}},
}, nil
},
},
}
got := a.MainBanner()
for _, want := range []string{
"System: Dell PowerEdge R760 | S/N SRV123",
"CPU: 2 x Intel Xeon Gold 6430",
"Memory: 1.0 TB DDR5 (2 DIMMs)",
"Storage: 2 drives / 7.5 TB",
"GPU: 2 x NVIDIA H100",
"IP: 10.0.0.10, 192.168.1.10",
} {
if !contains(got, want) {
t.Fatalf("banner missing %q:\n%s", want, got)
}
}
}
func TestRuntimeHealthResultUsesAMDLabels(t *testing.T) {
tmp := t.TempDir()
oldRuntimePath := DefaultRuntimeJSONPath
DefaultRuntimeJSONPath = filepath.Join(tmp, "runtime-health.json")
t.Cleanup(func() { DefaultRuntimeJSONPath = oldRuntimePath })
raw, err := json.Marshal(schema.RuntimeHealth{
Status: "OK",
ExportDir: "/appdata/bee/export",
DriverReady: true,
CUDAReady: true,
NetworkStatus: "OK",
})
if err != nil {
t.Fatalf("marshal runtime health: %v", err)
}
if err := os.WriteFile(DefaultRuntimeJSONPath, raw, 0644); err != nil {
t.Fatalf("write runtime health: %v", err)
}
a := &App{
sat: fakeSAT{
detectVendorFn: func() string { return "amd" },
},
}
result := a.RuntimeHealthResult()
if !contains(result.Body, "AMDGPU ready: true") {
t.Fatalf("body missing AMD driver label:\n%s", result.Body)
}
if !contains(result.Body, "ROCm SMI ready: true") {
t.Fatalf("body missing ROCm label:\n%s", result.Body)
}
if contains(result.Body, "CUDA ready") {
t.Fatalf("body should not mention CUDA on AMD:\n%s", result.Body)
}
}
func intPtr(v int) *int { return &v }
func contains(haystack, needle string) bool {
return len(needle) == 0 || (len(haystack) >= len(needle) && (haystack == needle || containsAt(haystack, needle)))
}

387
audit/internal/app/panel.go Normal file
View File

@@ -0,0 +1,387 @@
package app
import (
"encoding/json"
"fmt"
"os"
"path/filepath"
"sort"
"strings"
"bee/audit/internal/schema"
)
// ComponentRow is one line in the hardware panel.
type ComponentRow struct {
Key string // "CPU", "MEM", "GPU", "DISK", "PSU"
Status string // "PASS", "FAIL", "CANCEL", "N/A"
Detail string // compact one-liner
}
// HardwarePanelData holds everything the TUI right panel needs.
type HardwarePanelData struct {
Header []string
Rows []ComponentRow
}
// LoadHardwarePanel reads the latest audit JSON and SAT summaries.
// Returns empty panel if no audit data exists yet.
func (a *App) LoadHardwarePanel() HardwarePanelData {
raw, err := os.ReadFile(DefaultAuditJSONPath)
if err != nil {
return HardwarePanelData{Header: []string{"No audit data — run audit first."}}
}
var snap schema.HardwareIngestRequest
if err := json.Unmarshal(raw, &snap); err != nil {
return HardwarePanelData{Header: []string{"Audit data unreadable."}}
}
statuses := satStatuses()
var header []string
if sys := formatSystemLine(snap.Hardware.Board); sys != "" {
header = append(header, sys)
}
for _, fw := range snap.Hardware.Firmware {
if fw.DeviceName == "BIOS" && fw.Version != "" {
header = append(header, "BIOS: "+fw.Version)
}
if fw.DeviceName == "BMC" && fw.Version != "" {
header = append(header, "BMC: "+fw.Version)
}
}
if ip := formatIPLine(a.network.ListInterfaces); ip != "" {
header = append(header, ip)
}
var rows []ComponentRow
if cpu := formatCPULine(snap.Hardware.CPUs); cpu != "" {
rows = append(rows, ComponentRow{
Key: "CPU",
Status: statuses["cpu"],
Detail: strings.TrimPrefix(cpu, "CPU: "),
})
}
if mem := formatMemoryLine(snap.Hardware.Memory); mem != "" {
rows = append(rows, ComponentRow{
Key: "MEM",
Status: statuses["memory"],
Detail: strings.TrimPrefix(mem, "Memory: "),
})
}
if gpu := formatGPULine(snap.Hardware.PCIeDevices); gpu != "" {
rows = append(rows, ComponentRow{
Key: "GPU",
Status: statuses["gpu"],
Detail: strings.TrimPrefix(gpu, "GPU: "),
})
}
if disk := formatStorageLine(snap.Hardware.Storage); disk != "" {
rows = append(rows, ComponentRow{
Key: "DISK",
Status: statuses["storage"],
Detail: strings.TrimPrefix(disk, "Storage: "),
})
}
if psu := formatPSULine(snap.Hardware.PowerSupplies); psu != "" {
rows = append(rows, ComponentRow{
Key: "PSU",
Status: "N/A",
Detail: psu,
})
}
return HardwarePanelData{Header: header, Rows: rows}
}
// ComponentDetailResult returns detail text for a component shown in the panel.
func (a *App) ComponentDetailResult(key string) ActionResult {
switch key {
case "CPU":
return a.cpuDetailResult(false)
case "MEM":
return a.satDetailResult("memory", "memory-", "MEM detail")
case "GPU":
// Prefer whichever GPU SAT was run most recently.
nv, _ := filepath.Glob(filepath.Join(DefaultSATBaseDir, "gpu-nvidia-*/summary.txt"))
am, _ := filepath.Glob(filepath.Join(DefaultSATBaseDir, "gpu-amd-*/summary.txt"))
sort.Strings(nv)
sort.Strings(am)
latestNV := ""
if len(nv) > 0 {
latestNV = nv[len(nv)-1]
}
latestAM := ""
if len(am) > 0 {
latestAM = am[len(am)-1]
}
if latestAM > latestNV {
return a.satDetailResult("gpu", "gpu-amd-", "GPU detail")
}
return a.satDetailResult("gpu", "gpu-nvidia-", "GPU detail")
case "DISK":
return a.satDetailResult("storage", "storage-", "DISK detail")
case "PSU":
return a.psuDetailResult()
default:
return ActionResult{Title: key, Body: "No detail available."}
}
}
func (a *App) cpuDetailResult(satOnly bool) ActionResult {
var b strings.Builder
// Show latest SAT summary if available.
satResult := a.satDetailResult("cpu", "cpu-", "CPU SAT")
if satResult.Body != "No test results found. Run a test first." {
fmt.Fprintln(&b, "=== Last SAT ===")
fmt.Fprintln(&b, satResult.Body)
fmt.Fprintln(&b)
}
if satOnly {
body := strings.TrimSpace(b.String())
if body == "" {
body = "No CPU SAT results found. Run a test first."
}
return ActionResult{Title: "CPU SAT", Body: body}
}
raw, err := os.ReadFile(DefaultAuditJSONPath)
if err != nil {
return ActionResult{Title: "CPU", Body: strings.TrimSpace(b.String())}
}
var snap schema.HardwareIngestRequest
if err := json.Unmarshal(raw, &snap); err != nil {
return ActionResult{Title: "CPU", Body: strings.TrimSpace(b.String())}
}
if len(snap.Hardware.CPUs) == 0 {
return ActionResult{Title: "CPU", Body: strings.TrimSpace(b.String())}
}
fmt.Fprintln(&b, "=== Audit ===")
for i, cpu := range snap.Hardware.CPUs {
fmt.Fprintf(&b, "CPU %d\n", i)
if cpu.Model != nil {
fmt.Fprintf(&b, " Model: %s\n", *cpu.Model)
}
if cpu.Manufacturer != nil {
fmt.Fprintf(&b, " Vendor: %s\n", *cpu.Manufacturer)
}
if cpu.Cores != nil {
fmt.Fprintf(&b, " Cores: %d\n", *cpu.Cores)
}
if cpu.Threads != nil {
fmt.Fprintf(&b, " Threads: %d\n", *cpu.Threads)
}
if cpu.MaxFrequencyMHz != nil {
fmt.Fprintf(&b, " Max freq: %d MHz\n", *cpu.MaxFrequencyMHz)
}
if cpu.TemperatureC != nil {
fmt.Fprintf(&b, " Temp: %.1f°C\n", *cpu.TemperatureC)
}
if cpu.Throttled != nil {
fmt.Fprintf(&b, " Throttled: %v\n", *cpu.Throttled)
}
if cpu.CorrectableErrorCount != nil && *cpu.CorrectableErrorCount > 0 {
fmt.Fprintf(&b, " ECC correctable: %d\n", *cpu.CorrectableErrorCount)
}
if cpu.UncorrectableErrorCount != nil && *cpu.UncorrectableErrorCount > 0 {
fmt.Fprintf(&b, " ECC uncorrectable: %d\n", *cpu.UncorrectableErrorCount)
}
if i < len(snap.Hardware.CPUs)-1 {
fmt.Fprintln(&b)
}
}
return ActionResult{Title: "CPU", Body: strings.TrimSpace(b.String())}
}
func (a *App) satDetailResult(statusKey, prefix, title string) ActionResult {
matches, err := filepath.Glob(filepath.Join(DefaultSATBaseDir, prefix+"*/summary.txt"))
if err != nil || len(matches) == 0 {
return ActionResult{Title: title, Body: "No test results found. Run a test first."}
}
sort.Strings(matches)
raw, err := os.ReadFile(matches[len(matches)-1])
if err != nil {
return ActionResult{Title: title, Body: "Could not read test results."}
}
return ActionResult{Title: title, Body: formatSATDetail(strings.TrimSpace(string(raw)))}
}
// formatSATDetail converts raw summary.txt key=value content to a human-readable per-step display.
func formatSATDetail(raw string) string {
var b strings.Builder
kv := parseKeyValueSummary(raw)
if t, ok := kv["run_at_utc"]; ok {
fmt.Fprintf(&b, "Run: %s\n\n", t)
}
// Collect step names in order they appear in the file
lines := strings.Split(raw, "\n")
var stepKeys []string
seenStep := map[string]bool{}
for _, line := range lines {
if idx := strings.Index(line, "_status="); idx >= 0 {
key := line[:idx]
if !seenStep[key] && key != "overall" {
seenStep[key] = true
stepKeys = append(stepKeys, key)
}
}
}
for _, key := range stepKeys {
status := kv[key+"_status"]
display := cleanSummaryKey(key)
switch status {
case "OK":
fmt.Fprintf(&b, "PASS %s\n", display)
case "FAILED":
fmt.Fprintf(&b, "FAIL %s\n", display)
case "UNSUPPORTED":
fmt.Fprintf(&b, "SKIP %s\n", display)
default:
fmt.Fprintf(&b, "? %s\n", display)
}
}
if overall, ok := kv["overall_status"]; ok {
ok2 := kv["job_ok"]
failed := kv["job_failed"]
fmt.Fprintf(&b, "\nOverall: %s (ok=%s failed=%s)", overall, ok2, failed)
}
return strings.TrimSpace(b.String())
}
// cleanSummaryKey strips the leading numeric prefix from a SAT step key.
// "1-lscpu" → "lscpu", "3-stress-ng" → "stress-ng"
func cleanSummaryKey(key string) string {
idx := strings.Index(key, "-")
if idx <= 0 {
return key
}
prefix := key[:idx]
for _, c := range prefix {
if c < '0' || c > '9' {
return key
}
}
return key[idx+1:]
}
func (a *App) psuDetailResult() ActionResult {
raw, err := os.ReadFile(DefaultAuditJSONPath)
if err != nil {
return ActionResult{Title: "PSU", Body: "No audit data."}
}
var snap schema.HardwareIngestRequest
if err := json.Unmarshal(raw, &snap); err != nil {
return ActionResult{Title: "PSU", Body: "Audit data unreadable."}
}
if len(snap.Hardware.PowerSupplies) == 0 {
return ActionResult{Title: "PSU", Body: "No PSU data in last audit."}
}
var b strings.Builder
for i, psu := range snap.Hardware.PowerSupplies {
fmt.Fprintf(&b, "PSU %d\n", i)
if psu.Model != nil {
fmt.Fprintf(&b, " Model: %s\n", *psu.Model)
}
if psu.Vendor != nil {
fmt.Fprintf(&b, " Vendor: %s\n", *psu.Vendor)
}
if psu.WattageW != nil {
fmt.Fprintf(&b, " Rated: %d W\n", *psu.WattageW)
}
if psu.InputPowerW != nil {
fmt.Fprintf(&b, " Input: %.1f W\n", *psu.InputPowerW)
}
if psu.OutputPowerW != nil {
fmt.Fprintf(&b, " Output: %.1f W\n", *psu.OutputPowerW)
}
if psu.TemperatureC != nil {
fmt.Fprintf(&b, " Temp: %.1f°C\n", *psu.TemperatureC)
}
if i < len(snap.Hardware.PowerSupplies)-1 {
fmt.Fprintln(&b)
}
}
return ActionResult{Title: "PSU", Body: strings.TrimSpace(b.String())}
}
// satStatuses reads the latest summary.txt for each SAT type and returns
// a map of component key ("gpu","memory","storage") → status ("PASS","FAIL","CANCEL","N/A").
func satStatuses() map[string]string {
result := map[string]string{
"gpu": "N/A",
"memory": "N/A",
"storage": "N/A",
"cpu": "N/A",
}
patterns := []struct {
key string
prefix string
}{
{"gpu", "gpu-nvidia-"},
{"gpu", "gpu-amd-"},
{"memory", "memory-"},
{"storage", "storage-"},
{"cpu", "cpu-"},
}
for _, item := range patterns {
matches, err := filepath.Glob(filepath.Join(DefaultSATBaseDir, item.prefix+"*/summary.txt"))
if err != nil || len(matches) == 0 {
continue
}
sort.Strings(matches)
raw, err := os.ReadFile(matches[len(matches)-1])
if err != nil {
continue
}
values := parseKeyValueSummary(string(raw))
switch strings.ToUpper(strings.TrimSpace(values["overall_status"])) {
case "OK":
result[item.key] = "PASS"
case "FAILED":
result[item.key] = "FAIL"
case "CANCELED", "CANCELLED":
result[item.key] = "CANCEL"
}
}
return result
}
func formatPSULine(psus []schema.HardwarePowerSupply) string {
var present []schema.HardwarePowerSupply
for _, psu := range psus {
if psu.Present != nil && !*psu.Present {
continue
}
present = append(present, psu)
}
if len(present) == 0 {
return ""
}
firstW := 0
if present[0].WattageW != nil {
firstW = *present[0].WattageW
}
allSame := firstW > 0
for _, p := range present[1:] {
w := 0
if p.WattageW != nil {
w = *p.WattageW
}
if w != firstW {
allSame = false
break
}
}
if allSame && firstW > 0 {
return fmt.Sprintf("%dx %dW", len(present), firstW)
}
return fmt.Sprintf("%d PSU", len(present))
}

View File

@@ -0,0 +1,214 @@
package app
import (
"os"
"path/filepath"
"sort"
"strings"
"bee/audit/internal/schema"
)
func applyLatestSATStatuses(snap *schema.HardwareSnapshot, baseDir string) {
if snap == nil || strings.TrimSpace(baseDir) == "" {
return
}
if summary, ok := loadLatestSATSummary(baseDir, "gpu-amd-"); ok {
applyGPUVendorSAT(snap.PCIeDevices, "amd", summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "gpu-nvidia-"); ok {
applyGPUVendorSAT(snap.PCIeDevices, "nvidia", summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "memory-"); ok {
applyMemorySAT(snap.Memory, summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "cpu-"); ok {
applyCPUSAT(snap.CPUs, summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "storage-"); ok {
applyStorageSAT(snap.Storage, summary)
}
}
type satSummary struct {
runAtUTC string
overall string
kv map[string]string
}
func loadLatestSATSummary(baseDir, prefix string) (satSummary, bool) {
matches, err := filepath.Glob(filepath.Join(baseDir, prefix+"*/summary.txt"))
if err != nil || len(matches) == 0 {
return satSummary{}, false
}
sort.Strings(matches)
raw, err := os.ReadFile(matches[len(matches)-1])
if err != nil {
return satSummary{}, false
}
kv := parseKeyValueSummary(string(raw))
return satSummary{
runAtUTC: strings.TrimSpace(kv["run_at_utc"]),
overall: strings.ToUpper(strings.TrimSpace(kv["overall_status"])),
kv: kv,
}, true
}
func applyGPUVendorSAT(devs []schema.HardwarePCIeDevice, vendor string, summary satSummary) {
status, description, ok := satSummaryStatus(summary, vendor+" GPU SAT")
if !ok {
return
}
for i := range devs {
if !matchesGPUVendor(devs[i], vendor) {
continue
}
mergeComponentStatus(&devs[i].HardwareComponentStatus, summary.runAtUTC, status, description)
}
}
func applyMemorySAT(dimms []schema.HardwareMemory, summary satSummary) {
status, description, ok := satSummaryStatus(summary, "memory SAT")
if !ok {
return
}
for i := range dimms {
mergeComponentStatus(&dimms[i].HardwareComponentStatus, summary.runAtUTC, status, description)
}
}
func applyCPUSAT(cpus []schema.HardwareCPU, summary satSummary) {
status, description, ok := satSummaryStatus(summary, "CPU SAT")
if !ok {
return
}
for i := range cpus {
mergeComponentStatus(&cpus[i].HardwareComponentStatus, summary.runAtUTC, status, description)
}
}
func applyStorageSAT(disks []schema.HardwareStorage, summary satSummary) {
byDevice := parseStorageSATStatus(summary)
for i := range disks {
devPath, _ := disks[i].Telemetry["linux_device"].(string)
devName := filepath.Base(strings.TrimSpace(devPath))
if devName == "" {
continue
}
result, ok := byDevice[devName]
if !ok {
continue
}
mergeComponentStatus(&disks[i].HardwareComponentStatus, summary.runAtUTC, result.status, result.description)
}
}
type satStatusResult struct {
status string
description string
ok bool
}
func parseStorageSATStatus(summary satSummary) map[string]satStatusResult {
result := map[string]satStatusResult{}
for key, value := range summary.kv {
if !strings.HasSuffix(key, "_status") || key == "overall_status" {
continue
}
base := strings.TrimSuffix(key, "_status")
idx := strings.Index(base, "_")
if idx <= 0 {
continue
}
devName := base[:idx]
step := strings.ReplaceAll(base[idx+1:], "_", "-")
stepStatus, desc, ok := satKeyStatus(strings.ToUpper(strings.TrimSpace(value)), "storage "+step)
if !ok {
continue
}
current := result[devName]
if !current.ok || statusSeverity(stepStatus) > statusSeverity(current.status) {
result[devName] = satStatusResult{status: stepStatus, description: desc, ok: true}
}
}
return result
}
func satSummaryStatus(summary satSummary, label string) (string, string, bool) {
return satKeyStatus(summary.overall, label)
}
func satKeyStatus(rawStatus, label string) (string, string, bool) {
switch strings.ToUpper(strings.TrimSpace(rawStatus)) {
case "OK":
return "OK", label + " passed", true
case "PARTIAL", "UNSUPPORTED", "CANCELED", "CANCELLED":
return "Warning", label + " incomplete", true
case "FAILED":
return "Critical", label + " failed", true
default:
return "", "", false
}
}
func mergeComponentStatus(component *schema.HardwareComponentStatus, changedAt, satStatus, description string) {
if component == nil || satStatus == "" {
return
}
current := strings.TrimSpace(ptrString(component.Status))
if current == "" || current == "Unknown" || statusSeverity(satStatus) > statusSeverity(current) {
component.Status = appStringPtr(satStatus)
if strings.TrimSpace(description) != "" {
component.ErrorDescription = appStringPtr(description)
}
if strings.TrimSpace(changedAt) != "" {
component.StatusChangedAt = appStringPtr(changedAt)
component.StatusHistory = append(component.StatusHistory, schema.HardwareStatusHistory{
Status: satStatus,
ChangedAt: changedAt,
Details: appStringPtr(description),
})
}
}
}
func statusSeverity(status string) int {
switch strings.TrimSpace(status) {
case "Critical":
return 3
case "Warning":
return 2
case "OK":
return 1
default:
return 0
}
}
func matchesGPUVendor(dev schema.HardwarePCIeDevice, vendor string) bool {
if dev.DeviceClass == nil || !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Controller") && !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Accelerator") {
if dev.DeviceClass == nil || !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Display") && !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Video") {
return false
}
}
manufacturer := strings.ToLower(strings.TrimSpace(ptrString(dev.Manufacturer)))
switch vendor {
case "amd":
return strings.Contains(manufacturer, "advanced micro devices") || strings.Contains(manufacturer, "amd/ati")
case "nvidia":
return strings.Contains(manufacturer, "nvidia")
default:
return false
}
}
func ptrString(v *string) string {
if v == nil {
return ""
}
return *v
}
func appStringPtr(value string) *string {
return &value
}

View File

@@ -0,0 +1,61 @@
package app
import (
"os"
"path/filepath"
"testing"
"bee/audit/internal/schema"
)
func TestApplyLatestSATStatusesMarksStorageByDevice(t *testing.T) {
baseDir := t.TempDir()
runDir := filepath.Join(baseDir, "storage-20260325-161151")
if err := os.MkdirAll(runDir, 0755); err != nil {
t.Fatal(err)
}
raw := "run_at_utc=2026-03-25T16:11:51Z\nnvme0n1_nvme_smart_log_status=OK\nsda_smartctl_health_status=FAILED\noverall_status=FAILED\n"
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(raw), 0644); err != nil {
t.Fatal(err)
}
nvme := schema.HardwareStorage{Telemetry: map[string]any{"linux_device": "/dev/nvme0n1"}}
usb := schema.HardwareStorage{Telemetry: map[string]any{"linux_device": "/dev/sda"}}
snap := schema.HardwareSnapshot{Storage: []schema.HardwareStorage{nvme, usb}}
applyLatestSATStatuses(&snap, baseDir)
if snap.Storage[0].Status == nil || *snap.Storage[0].Status != "OK" {
t.Fatalf("nvme status=%v want OK", snap.Storage[0].Status)
}
if snap.Storage[1].Status == nil || *snap.Storage[1].Status != "Critical" {
t.Fatalf("sda status=%v want Critical", snap.Storage[1].Status)
}
}
func TestApplyLatestSATStatusesMarksAMDGPUs(t *testing.T) {
baseDir := t.TempDir()
runDir := filepath.Join(baseDir, "gpu-amd-20260325-161436")
if err := os.MkdirAll(runDir, 0755); err != nil {
t.Fatal(err)
}
raw := "run_at_utc=2026-03-25T16:14:36Z\noverall_status=FAILED\n"
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(raw), 0644); err != nil {
t.Fatal(err)
}
class := "DisplayController"
manufacturer := "Advanced Micro Devices, Inc. [AMD/ATI]"
snap := schema.HardwareSnapshot{
PCIeDevices: []schema.HardwarePCIeDevice{{
DeviceClass: &class,
Manufacturer: &manufacturer,
}},
}
applyLatestSATStatuses(&snap, baseDir)
if snap.PCIeDevices[0].Status == nil || *snap.PCIeDevices[0].Status != "Critical" {
t.Fatalf("gpu status=%v want Critical", snap.PCIeDevices[0].Status)
}
}

View File

@@ -0,0 +1,364 @@
package app
import (
"archive/tar"
"compress/gzip"
"fmt"
"io"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
"time"
)
var supportBundleServices = []string{
"bee-audit.service",
"bee-web.service",
"bee-network.service",
"bee-nvidia.service",
"bee-preflight.service",
"bee-sshsetup.service",
}
var supportBundleCommands = []struct {
name string
cmd []string
}{
{name: "system/uname.txt", cmd: []string{"uname", "-a"}},
{name: "system/lsmod.txt", cmd: []string{"lsmod"}},
{name: "system/lspci-nn.txt", cmd: []string{"lspci", "-nn"}},
{name: "system/ip-addr.txt", cmd: []string{"ip", "addr"}},
{name: "system/ip-route.txt", cmd: []string{"ip", "route"}},
{name: "system/mount.txt", cmd: []string{"mount"}},
{name: "system/df-h.txt", cmd: []string{"df", "-h"}},
{name: "system/dmesg-tail.txt", cmd: []string{"sh", "-c", "dmesg | tail -n 200"}},
}
func BuildSupportBundle(exportDir string) (string, error) {
exportDir = strings.TrimSpace(exportDir)
if exportDir == "" {
exportDir = DefaultExportDir
}
if err := os.MkdirAll(exportDir, 0755); err != nil {
return "", err
}
if err := cleanupOldSupportBundles(os.TempDir()); err != nil {
return "", err
}
host := sanitizeFilename(hostnameOr("unknown"))
ts := time.Now().UTC().Format("20060102-150405")
stageRoot := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s", host, ts))
if err := os.MkdirAll(stageRoot, 0755); err != nil {
return "", err
}
defer os.RemoveAll(stageRoot)
if err := copyExportDirForSupportBundle(exportDir, filepath.Join(stageRoot, "export")); err != nil {
return "", err
}
if err := writeJournalDump(filepath.Join(stageRoot, "systemd", "combined.journal.log")); err != nil {
return "", err
}
for _, svc := range supportBundleServices {
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".status.txt"), []string{"systemctl", "status", svc, "--no-pager"}); err != nil {
return "", err
}
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".journal.log"), []string{"journalctl", "--no-pager", "-u", svc}); err != nil {
return "", err
}
}
for _, item := range supportBundleCommands {
if err := writeCommandOutput(filepath.Join(stageRoot, item.name), item.cmd); err != nil {
return "", err
}
}
if err := writeManifest(filepath.Join(stageRoot, "manifest.txt"), exportDir, stageRoot); err != nil {
return "", err
}
archivePath := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s.tar.gz", host, ts))
if err := createSupportTarGz(archivePath, stageRoot); err != nil {
return "", err
}
return archivePath, nil
}
func cleanupOldSupportBundles(dir string) error {
matches, err := filepath.Glob(filepath.Join(dir, "bee-support-*.tar.gz"))
if err != nil {
return err
}
type entry struct {
path string
mod time.Time
}
list := make([]entry, 0, len(matches))
for _, match := range matches {
info, err := os.Stat(match)
if err != nil {
continue
}
if time.Since(info.ModTime()) > 24*time.Hour {
_ = os.Remove(match)
continue
}
list = append(list, entry{path: match, mod: info.ModTime()})
}
sort.Slice(list, func(i, j int) bool { return list[i].mod.After(list[j].mod) })
if len(list) > 3 {
for _, old := range list[3:] {
_ = os.Remove(old.path)
}
}
return nil
}
func writeJournalDump(dst string) error {
args := []string{"--no-pager"}
for _, svc := range supportBundleServices {
args = append(args, "-u", svc)
}
raw, err := exec.Command("journalctl", args...).CombinedOutput()
if len(raw) == 0 && err != nil {
raw = []byte(err.Error() + "\n")
}
if len(raw) == 0 {
raw = []byte("no journal output\n")
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
return os.WriteFile(dst, raw, 0644)
}
func writeCommandOutput(dst string, cmd []string) error {
if len(cmd) == 0 {
return nil
}
raw, err := exec.Command(cmd[0], cmd[1:]...).CombinedOutput()
if len(raw) == 0 {
if err != nil {
raw = []byte(err.Error() + "\n")
} else {
raw = []byte("no output\n")
}
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
return os.WriteFile(dst, raw, 0644)
}
func writeManifest(dst, exportDir, stageRoot string) error {
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
var body strings.Builder
fmt.Fprintf(&body, "bee_version=%s\n", buildVersion())
fmt.Fprintf(&body, "host=%s\n", hostnameOr("unknown"))
fmt.Fprintf(&body, "generated_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
fmt.Fprintf(&body, "export_dir=%s\n", exportDir)
fmt.Fprintf(&body, "\nfiles:\n")
var files []string
if err := filepath.Walk(stageRoot, func(path string, info os.FileInfo, err error) error {
if err != nil || info.IsDir() {
return err
}
if filepath.Clean(path) == filepath.Clean(dst) {
return nil
}
rel, err := filepath.Rel(stageRoot, path)
if err != nil {
return err
}
files = append(files, fmt.Sprintf("%s\t%d", rel, info.Size()))
return nil
}); err != nil {
return err
}
sort.Strings(files)
for _, line := range files {
body.WriteString(line)
body.WriteByte('\n')
}
return os.WriteFile(dst, []byte(body.String()), 0644)
}
func buildVersion() string {
raw, err := exec.Command("bee", "version").CombinedOutput()
if err != nil {
return "unknown"
}
return strings.TrimSpace(string(raw))
}
func copyDirContents(srcDir, dstDir string) error {
entries, err := os.ReadDir(srcDir)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return err
}
for _, entry := range entries {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(dstDir, entry.Name())
if err := copyPath(src, dst); err != nil {
return err
}
}
return nil
}
func copyExportDirForSupportBundle(srcDir, dstDir string) error {
return copyDirContentsFiltered(srcDir, dstDir, func(rel string, info os.FileInfo) bool {
cleanRel := filepath.ToSlash(strings.TrimPrefix(filepath.Clean(rel), "./"))
if cleanRel == "" {
return true
}
if strings.HasPrefix(cleanRel, "bee-sat/") && strings.HasSuffix(cleanRel, ".tar.gz") {
return false
}
if strings.HasPrefix(filepath.Base(cleanRel), "bee-support-") && strings.HasSuffix(cleanRel, ".tar.gz") {
return false
}
return true
})
}
func copyDirContentsFiltered(srcDir, dstDir string, keep func(rel string, info os.FileInfo) bool) error {
entries, err := os.ReadDir(srcDir)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return err
}
for _, entry := range entries {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(dstDir, entry.Name())
if err := copyPathFiltered(srcDir, src, dst, keep); err != nil {
return err
}
}
return nil
}
func copyPath(src, dst string) error {
info, err := os.Stat(src)
if err != nil {
return err
}
if info.IsDir() {
if err := os.MkdirAll(dst, info.Mode().Perm()); err != nil {
return err
}
entries, err := os.ReadDir(src)
if err != nil {
return err
}
for _, entry := range entries {
if err := copyPath(filepath.Join(src, entry.Name()), filepath.Join(dst, entry.Name())); err != nil {
return err
}
}
return nil
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
in, err := os.Open(src)
if err != nil {
return err
}
defer in.Close()
out, err := os.OpenFile(dst, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, info.Mode().Perm())
if err != nil {
return err
}
defer out.Close()
_, err = io.Copy(out, in)
return err
}
func copyPathFiltered(rootSrc, src, dst string, keep func(rel string, info os.FileInfo) bool) error {
info, err := os.Stat(src)
if err != nil {
return err
}
rel, err := filepath.Rel(rootSrc, src)
if err != nil {
return err
}
if keep != nil && !keep(rel, info) {
return nil
}
if info.IsDir() {
if err := os.MkdirAll(dst, info.Mode().Perm()); err != nil {
return err
}
entries, err := os.ReadDir(src)
if err != nil {
return err
}
for _, entry := range entries {
if err := copyPathFiltered(rootSrc, filepath.Join(src, entry.Name()), filepath.Join(dst, entry.Name()), keep); err != nil {
return err
}
}
return nil
}
return copyPath(src, dst)
}
func createSupportTarGz(dst, srcDir string) error {
file, err := os.Create(dst)
if err != nil {
return err
}
defer file.Close()
gz := gzip.NewWriter(file)
defer gz.Close()
tw := tar.NewWriter(gz)
defer tw.Close()
base := filepath.Dir(srcDir)
return filepath.Walk(srcDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if info.IsDir() {
return nil
}
header, err := tar.FileInfoHeader(info, "")
if err != nil {
return err
}
header.Name, err = filepath.Rel(base, path)
if err != nil {
return err
}
if err := tw.WriteHeader(header); err != nil {
return err
}
f, err := os.Open(path)
if err != nil {
return err
}
defer f.Close()
_, err = io.Copy(tw, f)
return err
})
}

View File

@@ -0,0 +1,252 @@
package collector
import (
"encoding/csv"
"log/slog"
"os/exec"
"path/filepath"
"sort"
"strconv"
"strings"
"bee/audit/internal/schema"
)
var (
amdSMIExecCommand = exec.Command
amdSMILookPath = exec.LookPath
amdSMIGlob = filepath.Glob
)
var amdSMIExecutableGlobs = []string{
"/opt/rocm/bin/rocm-smi",
"/opt/rocm-*/bin/rocm-smi",
"/usr/local/bin/rocm-smi",
}
type amdGPUInfo struct {
BDF string
Serial string
Product string
Firmware string
PowerW *float64
TempC *float64
}
func enrichPCIeWithAMD(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
if !hasAMDGPUDevices(devs) {
return devs
}
infoByBDF, err := queryAMDGPUs()
if err != nil {
slog.Info("amdgpu: enrichment skipped", "err", err)
return devs
}
enriched := 0
for i := range devs {
if !isAMDGPUDevice(devs[i]) || devs[i].BDF == nil {
continue
}
info, ok := infoByBDF[normalizePCIeBDF(*devs[i].BDF)]
if !ok {
continue
}
if strings.TrimSpace(info.Serial) != "" {
devs[i].SerialNumber = &info.Serial
}
if strings.TrimSpace(info.Firmware) != "" {
devs[i].Firmware = &info.Firmware
}
if strings.TrimSpace(info.Product) != "" && devs[i].Model == nil {
devs[i].Model = &info.Product
}
if info.PowerW != nil {
devs[i].PowerW = info.PowerW
}
if info.TempC != nil {
devs[i].TemperatureC = info.TempC
}
enriched++
}
if enriched > 0 {
slog.Info("amdgpu: enriched", "count", enriched)
}
return devs
}
func hasAMDGPUDevices(devs []schema.HardwarePCIeDevice) bool {
for _, dev := range devs {
if isAMDGPUDevice(dev) {
return true
}
}
return false
}
func isAMDGPUDevice(dev schema.HardwarePCIeDevice) bool {
if dev.Manufacturer == nil || dev.DeviceClass == nil {
return false
}
manufacturer := strings.ToLower(strings.TrimSpace(*dev.Manufacturer))
return strings.Contains(manufacturer, "advanced micro devices") && isGPUClass(strings.TrimSpace(*dev.DeviceClass))
}
func queryAMDGPUs() (map[string]amdGPUInfo, error) {
busByCard, err := queryAMDField("--showbus")
if err != nil {
return nil, err
}
infoByCard := map[string]amdGPUInfo{}
for card, bus := range busByCard {
bdf := normalizePCIeBDF(bus)
if bdf == "" {
continue
}
infoByCard[card] = amdGPUInfo{BDF: bdf}
}
if len(infoByCard) == 0 {
return map[string]amdGPUInfo{}, nil
}
mergeAMDField(infoByCard, "--showserial", func(info *amdGPUInfo, value string) { info.Serial = value })
mergeAMDField(infoByCard, "--showproductname", func(info *amdGPUInfo, value string) { info.Product = value })
mergeAMDField(infoByCard, "--showvbios", func(info *amdGPUInfo, value string) { info.Firmware = value })
mergeAMDNumericField(infoByCard, "--showpower", func(info *amdGPUInfo, value float64) { info.PowerW = &value })
mergeAMDNumericField(infoByCard, "--showtemp", func(info *amdGPUInfo, value float64) { info.TempC = &value })
result := make(map[string]amdGPUInfo, len(infoByCard))
for _, info := range infoByCard {
if info.BDF == "" {
continue
}
result[info.BDF] = info
}
return result, nil
}
func mergeAMDField(infoByCard map[string]amdGPUInfo, flag string, apply func(*amdGPUInfo, string)) {
values, err := queryAMDField(flag)
if err != nil {
return
}
for card, value := range values {
info, ok := infoByCard[card]
if !ok {
continue
}
value = strings.TrimSpace(value)
if value == "" {
continue
}
apply(&info, value)
infoByCard[card] = info
}
}
func mergeAMDNumericField(infoByCard map[string]amdGPUInfo, flag string, apply func(*amdGPUInfo, float64)) {
values, err := queryAMDNumericField(flag)
if err != nil {
return
}
for card, value := range values {
info, ok := infoByCard[card]
if !ok {
continue
}
apply(&info, value)
infoByCard[card] = info
}
}
func queryAMDField(flag string) (map[string]string, error) {
cmd, err := resolveAMDSMICmd(flag, "--csv")
if err != nil {
return nil, err
}
out, err := amdSMIExecCommand(cmd[0], cmd[1:]...).CombinedOutput()
if err != nil {
return nil, err
}
return parseROCmSingleValueCSV(string(out)), nil
}
func queryAMDNumericField(flag string) (map[string]float64, error) {
values, err := queryAMDField(flag)
if err != nil {
return nil, err
}
out := map[string]float64{}
for card, raw := range values {
if value, ok := firstFloat(raw); ok {
out[card] = value
}
}
return out, nil
}
func resolveAMDSMICmd(args ...string) ([]string, error) {
if path, err := amdSMILookPath("rocm-smi"); err == nil {
return append([]string{path}, args...), nil
}
for _, pattern := range amdSMIExecutableGlobs {
matches, err := amdSMIGlob(pattern)
if err != nil {
continue
}
sort.Strings(matches)
for _, match := range matches {
return append([]string{match}, args...), nil
}
}
return nil, exec.ErrNotFound
}
func parseROCmSingleValueCSV(raw string) map[string]string {
rows := map[string]string{}
reader := csv.NewReader(strings.NewReader(raw))
reader.FieldsPerRecord = -1
records, err := reader.ReadAll()
if err != nil {
return rows
}
for _, rec := range records {
if len(rec) < 2 {
continue
}
card := normalizeROCmCardKey(rec[0])
if card == "" {
continue
}
value := strings.TrimSpace(strings.Join(rec[1:], ","))
if value == "" || looksLikeCSVHeaderValue(value) {
continue
}
rows[card] = value
}
return rows
}
func normalizeROCmCardKey(raw string) string {
raw = strings.ToLower(strings.TrimSpace(raw))
raw = strings.Trim(raw, "\"")
if raw == "" {
return ""
}
if raw == "device" || raw == "gpu" || raw == "card" {
return ""
}
if strings.HasPrefix(raw, "card") {
return raw
}
if _, err := strconv.Atoi(raw); err == nil {
return "card" + raw
}
return ""
}
func looksLikeCSVHeaderValue(value string) bool {
value = strings.ToLower(strings.TrimSpace(value))
return strings.Contains(value, "product") ||
strings.Contains(value, "serial") ||
strings.Contains(value, "vbios") ||
strings.Contains(value, "bus")
}

View File

@@ -0,0 +1,56 @@
package collector
import (
"os/exec"
"testing"
)
func TestParseROCmSingleValueCSV(t *testing.T) {
raw := "device,Serial Number\ncard0,ABC123\ncard1,XYZ789\n"
got := parseROCmSingleValueCSV(raw)
if got["card0"] != "ABC123" {
t.Fatalf("card0=%q want ABC123", got["card0"])
}
if got["card1"] != "XYZ789" {
t.Fatalf("card1=%q want XYZ789", got["card1"])
}
}
func TestQueryAMDNumericFieldParsesUnits(t *testing.T) {
origExec := amdSMIExecCommand
origLookPath := amdSMILookPath
t.Cleanup(func() {
amdSMIExecCommand = origExec
amdSMILookPath = origLookPath
})
amdSMILookPath = func(string) (string, error) { return "/usr/bin/rocm-smi", nil }
amdSMIExecCommand = func(name string, args ...string) *exec.Cmd {
return exec.Command("sh", "-c", "printf 'device,Temperature\\ncard0,45.5c\\ncard1,67.0c\\n'")
}
got, err := queryAMDNumericField("--showtemp")
if err != nil {
t.Fatalf("queryAMDNumericField: %v", err)
}
if got["card0"] != 45.5 {
t.Fatalf("card0=%v want 45.5", got["card0"])
}
if got["card1"] != 67.0 {
t.Fatalf("card1=%v want 67.0", got["card1"])
}
}
func TestNormalizeROCmCardKey(t *testing.T) {
tests := map[string]string{
"0": "card0",
"card1": "card1",
"Device": "",
"": "",
}
for input, want := range tests {
if got := normalizeROCmCardKey(input); got != want {
t.Fatalf("normalizeROCmCardKey(%q)=%q want %q", input, got, want)
}
}
}

View File

@@ -4,10 +4,27 @@ import (
"bee/audit/internal/schema"
"bufio"
"log/slog"
"os"
"os/exec"
"strings"
)
var execDmidecode = func(typeNum string) (string, error) {
out, err := exec.Command("dmidecode", "-t", typeNum).Output()
if err != nil {
return "", err
}
return string(out), nil
}
var execIpmitool = func(args ...string) (string, error) {
out, err := exec.Command("ipmitool", args...).Output()
if err != nil {
return "", err
}
return string(out), nil
}
// collectBoard runs dmidecode for types 0, 1, 2 and returns the board record
// plus the BIOS firmware entry. Any failure is logged and returns zero values.
func collectBoard() (schema.HardwareBoard, []schema.HardwareFirmwareRecord) {
@@ -61,6 +78,45 @@ func parseBoard(type1, type2 string) schema.HardwareBoard {
return board
}
// collectBMCFirmware collects BMC firmware version via ipmitool mc info.
// Returns nil if ipmitool is missing, /dev/ipmi0 is absent, or any error occurs.
func collectBMCFirmware() []schema.HardwareFirmwareRecord {
if _, err := exec.LookPath("ipmitool"); err != nil {
return nil
}
if _, err := os.Stat("/dev/ipmi0"); err != nil {
return nil
}
out, err := execIpmitool("mc", "info")
if err != nil {
slog.Info("bmc: ipmitool mc info unavailable", "err", err)
return nil
}
version := parseBMCFirmwareRevision(out)
if version == "" {
return nil
}
slog.Info("bmc: collected", "version", version)
return []schema.HardwareFirmwareRecord{
{DeviceName: "BMC", Version: version},
}
}
// parseBMCFirmwareRevision extracts the "Firmware Revision" field from ipmitool mc info output.
func parseBMCFirmwareRevision(out string) string {
for _, line := range strings.Split(out, "\n") {
line = strings.TrimSpace(line)
key, val, ok := strings.Cut(line, ":")
if !ok {
continue
}
if strings.TrimSpace(key) == "Firmware Revision" {
return strings.TrimSpace(val)
}
}
return ""
}
// parseBIOSFirmware extracts BIOS version from dmidecode type 0 output.
func parseBIOSFirmware(type0 string) []schema.HardwareFirmwareRecord {
fields := parseDMIFields(type0, "BIOS Information")
@@ -141,9 +197,5 @@ func cleanDMIValue(v string) string {
// runDmidecode executes dmidecode -t <typeNum> and returns its stdout.
func runDmidecode(typeNum string) (string, error) {
out, err := exec.Command("dmidecode", "-t", typeNum).Output()
if err != nil {
return "", err
}
return string(out), nil
return execDmidecode(typeNum)
}

View File

@@ -7,13 +7,15 @@ import (
"bee/audit/internal/runtimeenv"
"bee/audit/internal/schema"
"log/slog"
"os"
"time"
)
// Run executes all collectors and returns the combined snapshot.
// Partial failures are logged as warnings; collection always completes.
func Run(runtimeMode runtimeenv.Mode) schema.HardwareIngestRequest {
func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
start := time.Now()
collectedAt := time.Now().UTC().Format(time.RFC3339)
slog.Info("audit started")
snap := schema.HardwareSnapshot{}
@@ -21,32 +23,45 @@ func Run(runtimeMode runtimeenv.Mode) schema.HardwareIngestRequest {
board, biosFW := collectBoard()
snap.Board = board
snap.Firmware = append(snap.Firmware, biosFW...)
snap.Firmware = append(snap.Firmware, collectBMCFirmware()...)
cpus, cpuFW := collectCPUs(snap.Board.SerialNumber)
snap.CPUs = cpus
snap.Firmware = append(snap.Firmware, cpuFW...)
snap.CPUs = collectCPUs()
snap.Memory = collectMemory()
sensorDoc, err := readSensorsJSONDoc()
if err != nil {
slog.Info("sensors: unavailable for enrichment", "err", err)
}
snap.CPUs = enrichCPUsWithTelemetry(snap.CPUs, sensorDoc)
snap.Memory = enrichMemoryWithTelemetry(snap.Memory, sensorDoc)
snap.Storage = collectStorage()
snap.PCIeDevices = collectPCIe()
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices, snap.Board.SerialNumber)
snap.PCIeDevices = enrichPCIeWithAMD(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithPCISerials(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithMellanox(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithNICTelemetry(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithRAIDTelemetry(snap.PCIeDevices)
snap.Storage = enrichStorageWithVROC(snap.Storage, snap.PCIeDevices)
snap.Storage = appendUniqueStorage(snap.Storage, collectRAIDStorage(snap.PCIeDevices))
snap.PowerSupplies = collectPSUs()
snap.PowerSupplies = enrichPSUsWithTelemetry(snap.PowerSupplies, sensorDoc)
snap.Sensors = buildSensorsFromDoc(sensorDoc)
finalizeSnapshot(&snap, collectedAt)
// remaining collectors added in steps 1.8 1.10
slog.Info("audit completed", "duration", time.Since(start).Round(time.Millisecond))
sourceType := string(runtimeMode)
protocol := "os-direct"
sourceType := "manual"
var targetHost *string
if hostname, err := os.Hostname(); err == nil && hostname != "" {
targetHost = &hostname
}
return schema.HardwareIngestRequest{
SourceType: &sourceType,
Protocol: &protocol,
CollectedAt: time.Now().UTC().Format(time.RFC3339),
TargetHost: targetHost,
CollectedAt: collectedAt,
Hardware: snap,
}
}

View File

@@ -0,0 +1,64 @@
package collector
import "strings"
const (
statusOK = "OK"
statusWarning = "Warning"
statusCritical = "Critical"
statusUnknown = "Unknown"
statusEmpty = "Empty"
)
func mapPCIeDeviceClass(raw string) string {
normalized := strings.ToLower(strings.TrimSpace(raw))
switch {
case normalized == "":
return ""
case strings.Contains(normalized, "ethernet controller"):
return "EthernetController"
case strings.Contains(normalized, "fibre channel"):
return "FibreChannelController"
case strings.Contains(normalized, "network controller"), strings.Contains(normalized, "infiniband controller"):
return "NetworkController"
case strings.Contains(normalized, "serial attached scsi"), strings.Contains(normalized, "storage controller"):
return "StorageController"
case strings.Contains(normalized, "raid"), strings.Contains(normalized, "mass storage"):
return "MassStorageController"
case strings.Contains(normalized, "display controller"):
return "DisplayController"
case strings.Contains(normalized, "vga"), strings.Contains(normalized, "3d controller"), strings.Contains(normalized, "video controller"):
return "VideoController"
case strings.Contains(normalized, "processing accelerators"), strings.Contains(normalized, "processing accelerator"):
return "ProcessingAccelerator"
default:
return raw
}
}
func isNICClass(class string) bool {
switch strings.TrimSpace(class) {
case "EthernetController", "NetworkController":
return true
default:
return false
}
}
func isGPUClass(class string) bool {
switch strings.TrimSpace(class) {
case "VideoController", "DisplayController", "ProcessingAccelerator":
return true
default:
return false
}
}
func isRAIDClass(class string) bool {
switch strings.TrimSpace(class) {
case "MassStorageController", "StorageController":
return true
default:
return false
}
}

View File

@@ -3,42 +3,39 @@ package collector
import (
"bee/audit/internal/schema"
"bufio"
"fmt"
"log/slog"
"os"
"path/filepath"
"strconv"
"strings"
)
// collectCPUs runs dmidecode -t 4 and reads microcode version from sysfs.
func collectCPUs(boardSerial string) ([]schema.HardwareCPU, []schema.HardwareFirmwareRecord) {
// collectCPUs runs dmidecode -t 4 and enriches CPUs with microcode from sysfs.
func collectCPUs() []schema.HardwareCPU {
out, err := runDmidecode("4")
if err != nil {
slog.Warn("cpu: dmidecode type 4 failed", "err", err)
return nil, nil
return nil
}
cpus := parseCPUs(out, boardSerial)
var firmware []schema.HardwareFirmwareRecord
cpus := parseCPUs(out)
if mc := readMicrocode(); mc != "" {
firmware = append(firmware, schema.HardwareFirmwareRecord{
DeviceName: "CPU Microcode",
Version: mc,
})
for i := range cpus {
cpus[i].Firmware = &mc
}
}
slog.Info("cpu: collected", "count", len(cpus))
return cpus, firmware
return cpus
}
// parseCPUs splits dmidecode output into per-processor sections and parses each.
func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
func parseCPUs(output string) []schema.HardwareCPU {
sections := splitDMISections(output, "Processor Information")
cpus := make([]schema.HardwareCPU, 0, len(sections))
for _, section := range sections {
cpu, ok := parseCPUSection(section, boardSerial)
cpu, ok := parseCPUSection(section)
if !ok {
continue
}
@@ -49,14 +46,16 @@ func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
// parseCPUSection parses one "Processor Information" block into a HardwareCPU.
// Returns false if the socket is unpopulated.
func parseCPUSection(fields map[string]string, boardSerial string) (schema.HardwareCPU, bool) {
func parseCPUSection(fields map[string]string) (schema.HardwareCPU, bool) {
status := parseCPUStatus(fields["Status"])
if status == "EMPTY" {
if status == statusEmpty {
return schema.HardwareCPU{}, false
}
cpu := schema.HardwareCPU{}
cpu.Status = &status
present := true
cpu.Present = &present
if socket, ok := parseSocketIndex(fields["Socket Designation"]); ok {
cpu.Socket = &socket
@@ -70,11 +69,6 @@ func parseCPUSection(fields map[string]string, boardSerial string) (schema.Hardw
}
if v := cleanDMIValue(fields["Serial Number"]); v != "" {
cpu.SerialNumber = &v
} else if boardSerial != "" && cpu.Socket != nil {
// Intel Xeon never exposes serial via DMI — generate stable fallback
// matching core's generateCPUVendorSerial() logic
fb := fmt.Sprintf("%s-CPU-%d", boardSerial, *cpu.Socket)
cpu.SerialNumber = &fb
}
if v := parseMHz(fields["Max Speed"]); v > 0 {
@@ -99,15 +93,15 @@ func parseCPUStatus(raw string) string {
upper := strings.ToUpper(raw)
switch {
case upper == "" || upper == "UNKNOWN":
return "UNKNOWN"
return statusUnknown
case strings.Contains(upper, "UNPOPULATED") || strings.Contains(upper, "NOT POPULATED"):
return "EMPTY"
return statusEmpty
case strings.Contains(upper, "ENABLED"):
return "OK"
return statusOK
case strings.Contains(upper, "DISABLED"):
return "WARNING"
return statusWarning
default:
return "UNKNOWN"
return statusUnknown
}
}
@@ -178,7 +172,7 @@ func parseInt(v string) int {
// readMicrocode reads the CPU microcode revision from sysfs.
// Returns empty string if unavailable.
func readMicrocode() string {
data, err := os.ReadFile("/sys/devices/system/cpu/cpu0/microcode/version")
data, err := os.ReadFile(filepath.Join(cpuSysBaseDir, "cpu0", "microcode", "version"))
if err != nil {
return ""
}

View File

@@ -0,0 +1,196 @@
package collector
import (
"bee/audit/internal/schema"
"os"
"path/filepath"
"regexp"
"sort"
"strconv"
"strings"
)
var (
cpuSysBaseDir = "/sys/devices/system/cpu"
socketIndexRe = regexp.MustCompile(`(?i)(?:package id|socket|cpu)\s*([0-9]+)`)
)
func enrichCPUsWithTelemetry(cpus []schema.HardwareCPU, doc sensorsDoc) []schema.HardwareCPU {
if len(cpus) == 0 {
return cpus
}
tempBySocket := cpuTempsFromSensors(doc, len(cpus))
powerBySocket := cpuPowerFromSensors(doc, len(cpus))
throttleBySocket := cpuThrottleBySocket()
for i := range cpus {
socket := 0
if cpus[i].Socket != nil {
socket = *cpus[i].Socket
}
if value, ok := tempBySocket[socket]; ok {
cpus[i].TemperatureC = &value
}
if value, ok := powerBySocket[socket]; ok {
cpus[i].PowerW = &value
}
if value, ok := throttleBySocket[socket]; ok {
cpus[i].Throttled = &value
}
}
return cpus
}
func cpuTempsFromSensors(doc sensorsDoc, cpuCount int) map[int]float64 {
out := map[int]float64{}
if len(doc) == 0 {
return out
}
var fallback []float64
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok {
continue
}
if classifySensorFeature(feature) != "temp" {
continue
}
temp, ok := firstFeatureFloat(feature, "_input")
if !ok {
continue
}
if socket, ok := detectCPUSocket(chip, featureName); ok {
if _, exists := out[socket]; !exists {
out[socket] = temp
}
continue
}
if isLikelyCPUTemp(chip, featureName) {
fallback = append(fallback, temp)
}
}
}
if len(out) == 0 && cpuCount == 1 && len(fallback) > 0 {
out[0] = fallback[0]
}
return out
}
func cpuPowerFromSensors(doc sensorsDoc, cpuCount int) map[int]float64 {
out := map[int]float64{}
if len(doc) == 0 {
return out
}
var fallback []float64
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok {
continue
}
if classifySensorFeature(feature) != "power" {
continue
}
power, ok := firstFeatureFloatWithContains(feature, []string{"power"})
if !ok {
continue
}
if socket, ok := detectCPUSocket(chip, featureName); ok {
if _, exists := out[socket]; !exists {
out[socket] = power
}
continue
}
if isLikelyCPUPower(chip, featureName) {
fallback = append(fallback, power)
}
}
}
if len(out) == 0 && cpuCount == 1 && len(fallback) > 0 {
out[0] = fallback[0]
}
return out
}
func detectCPUSocket(parts ...string) (int, bool) {
for _, part := range parts {
matches := socketIndexRe.FindStringSubmatch(strings.ToLower(part))
if len(matches) == 2 {
value, err := strconv.Atoi(matches[1])
if err == nil {
return value, true
}
}
}
return 0, false
}
func isLikelyCPUTemp(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "coretemp") ||
strings.Contains(value, "k10temp") ||
strings.Contains(value, "package id") ||
strings.Contains(value, "tdie") ||
strings.Contains(value, "tctl") ||
strings.Contains(value, "cpu temp")
}
func isLikelyCPUPower(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "intel-rapl") ||
strings.Contains(value, "package id") ||
strings.Contains(value, "package-") ||
strings.Contains(value, "cpu power")
}
func cpuThrottleBySocket() map[int]bool {
out := map[int]bool{}
cpuDirs, err := filepath.Glob(filepath.Join(cpuSysBaseDir, "cpu[0-9]*"))
if err != nil {
return out
}
sort.Strings(cpuDirs)
for _, cpuDir := range cpuDirs {
socket, ok := readSocketIndex(cpuDir)
if !ok {
continue
}
if cpuPackageThrottled(cpuDir) {
out[socket] = true
}
}
return out
}
func readSocketIndex(cpuDir string) (int, bool) {
raw, err := os.ReadFile(filepath.Join(cpuDir, "topology", "physical_package_id"))
if err != nil {
return 0, false
}
value, err := strconv.Atoi(strings.TrimSpace(string(raw)))
if err != nil || value < 0 {
return 0, false
}
return value, true
}
func cpuPackageThrottled(cpuDir string) bool {
paths := []string{
filepath.Join(cpuDir, "thermal_throttle", "package_throttle_count"),
filepath.Join(cpuDir, "thermal_throttle", "core_throttle_count"),
}
for _, path := range paths {
raw, err := os.ReadFile(path)
if err != nil {
continue
}
value, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
if err == nil && value > 0 {
return true
}
}
return false
}

View File

@@ -0,0 +1,71 @@
package collector
import (
"os"
"path/filepath"
"testing"
"bee/audit/internal/schema"
)
func TestEnrichCPUsWithTelemetry(t *testing.T) {
tmp := t.TempDir()
oldBase := cpuSysBaseDir
cpuSysBaseDir = tmp
t.Cleanup(func() { cpuSysBaseDir = oldBase })
mustWriteFile(t, filepath.Join(tmp, "cpu0", "topology", "physical_package_id"), "0\n")
mustWriteFile(t, filepath.Join(tmp, "cpu0", "thermal_throttle", "package_throttle_count"), "3\n")
mustWriteFile(t, filepath.Join(tmp, "cpu1", "topology", "physical_package_id"), "1\n")
mustWriteFile(t, filepath.Join(tmp, "cpu1", "thermal_throttle", "package_throttle_count"), "0\n")
doc := sensorsDoc{
"coretemp-isa-0000": {
"Package id 0": map[string]any{"temp1_input": 61.5},
"Package id 1": map[string]any{"temp2_input": 58.0},
},
"intel-rapl-mmio-0": {
"Package id 0": map[string]any{"power1_average": 180.0},
"Package id 1": map[string]any{"power2_average": 175.0},
},
}
socket0 := 0
socket1 := 1
status := statusOK
cpus := []schema.HardwareCPU{
{Socket: &socket0, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Socket: &socket1, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
}
got := enrichCPUsWithTelemetry(cpus, doc)
if got[0].TemperatureC == nil || *got[0].TemperatureC != 61.5 {
t.Fatalf("cpu0 temperature mismatch: %#v", got[0].TemperatureC)
}
if got[0].PowerW == nil || *got[0].PowerW != 180.0 {
t.Fatalf("cpu0 power mismatch: %#v", got[0].PowerW)
}
if got[0].Throttled == nil || !*got[0].Throttled {
t.Fatalf("cpu0 throttled mismatch: %#v", got[0].Throttled)
}
if got[1].TemperatureC == nil || *got[1].TemperatureC != 58.0 {
t.Fatalf("cpu1 temperature mismatch: %#v", got[1].TemperatureC)
}
if got[1].PowerW == nil || *got[1].PowerW != 175.0 {
t.Fatalf("cpu1 power mismatch: %#v", got[1].PowerW)
}
if got[1].Throttled != nil && *got[1].Throttled {
t.Fatalf("cpu1 throttled mismatch: %#v", got[1].Throttled)
}
}
func mustWriteFile(t *testing.T, path, content string) {
t.Helper()
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
t.Fatalf("mkdir %s: %v", path, err)
}
if err := os.WriteFile(path, []byte(content), 0644); err != nil {
t.Fatalf("write %s: %v", path, err)
}
}

View File

@@ -1,12 +1,14 @@
package collector
import (
"os"
"path/filepath"
"testing"
)
func TestParseCPUs_dual_socket(t *testing.T) {
out := mustReadFile(t, "testdata/dmidecode_type4.txt")
cpus := parseCPUs(out, "CAR315KA0803B90")
cpus := parseCPUs(out)
if len(cpus) != 2 {
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
@@ -37,23 +39,22 @@ func TestParseCPUs_dual_socket(t *testing.T) {
if cpu0.Status == nil || *cpu0.Status != "OK" {
t.Errorf("cpu0 status: got %v, want OK", cpu0.Status)
}
// Intel Xeon serial not available → fallback
if cpu0.SerialNumber == nil || *cpu0.SerialNumber != "CAR315KA0803B90-CPU-0" {
t.Errorf("cpu0 serial fallback: got %v, want CAR315KA0803B90-CPU-0", cpu0.SerialNumber)
if cpu0.SerialNumber != nil {
t.Errorf("cpu0 serial should stay nil without source data, got %v", cpu0.SerialNumber)
}
cpu1 := cpus[1]
if cpu1.Socket == nil || *cpu1.Socket != 1 {
t.Errorf("cpu1 socket: got %v, want 1", cpu1.Socket)
}
if cpu1.SerialNumber == nil || *cpu1.SerialNumber != "CAR315KA0803B90-CPU-1" {
t.Errorf("cpu1 serial fallback: got %v, want CAR315KA0803B90-CPU-1", cpu1.SerialNumber)
if cpu1.SerialNumber != nil {
t.Errorf("cpu1 serial should stay nil without source data, got %v", cpu1.SerialNumber)
}
}
func TestParseCPUs_unpopulated_skipped(t *testing.T) {
out := mustReadFile(t, "testdata/dmidecode_type4_disabled.txt")
cpus := parseCPUs(out, "BOARD-001")
cpus := parseCPUs(out)
if len(cpus) != 1 {
t.Fatalf("expected 1 CPU (unpopulated skipped), got %d", len(cpus))
@@ -63,18 +64,51 @@ func TestParseCPUs_unpopulated_skipped(t *testing.T) {
}
}
func TestCollectCPUsSetsFirmwareFromMicrocode(t *testing.T) {
tmp := t.TempDir()
origBase := cpuSysBaseDir
cpuSysBaseDir = tmp
t.Cleanup(func() { cpuSysBaseDir = origBase })
if err := os.MkdirAll(filepath.Join(tmp, "cpu0", "microcode"), 0755); err != nil {
t.Fatalf("mkdir microcode dir: %v", err)
}
if err := os.WriteFile(filepath.Join(tmp, "cpu0", "microcode", "version"), []byte("0x2b000643\n"), 0644); err != nil {
t.Fatalf("write microcode version: %v", err)
}
origRun := execDmidecode
execDmidecode = func(typeNum string) (string, error) {
if typeNum != "4" {
t.Fatalf("unexpected dmidecode type: %s", typeNum)
}
return mustReadFile(t, "testdata/dmidecode_type4.txt"), nil
}
t.Cleanup(func() { execDmidecode = origRun })
cpus := collectCPUs()
if len(cpus) != 2 {
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
}
for i, cpu := range cpus {
if cpu.Firmware == nil || *cpu.Firmware != "0x2b000643" {
t.Fatalf("cpu[%d] firmware=%v want microcode", i, cpu.Firmware)
}
}
}
func TestParseCPUStatus(t *testing.T) {
tests := []struct {
input string
want string
}{
{"Populated, Enabled", "OK"},
{"Populated, Disabled By User", "WARNING"},
{"Populated, Disabled By BIOS", "WARNING"},
{"Unpopulated", "EMPTY"},
{"Not Populated", "EMPTY"},
{"Unknown", "UNKNOWN"},
{"", "UNKNOWN"},
{"Populated, Disabled By User", statusWarning},
{"Populated, Disabled By BIOS", statusWarning},
{"Unpopulated", statusEmpty},
{"Not Populated", statusEmpty},
{"Unknown", statusUnknown},
{"", statusUnknown},
}
for _, tt := range tests {
got := parseCPUStatus(tt.input)

View File

@@ -0,0 +1,88 @@
package collector
import "bee/audit/internal/schema"
func finalizeSnapshot(snap *schema.HardwareSnapshot, collectedAt string) {
snap.Memory = filterMemory(snap.Memory)
snap.Storage = filterStorage(snap.Storage)
snap.PowerSupplies = filterPSUs(snap.PowerSupplies)
setComponentStatusMetadata(snap, collectedAt)
}
func filterMemory(dimms []schema.HardwareMemory) []schema.HardwareMemory {
out := make([]schema.HardwareMemory, 0, len(dimms))
for _, dimm := range dimms {
if dimm.Present != nil && !*dimm.Present {
continue
}
if dimm.Status != nil && *dimm.Status == statusEmpty {
continue
}
if dimm.SerialNumber == nil || *dimm.SerialNumber == "" {
continue
}
out = append(out, dimm)
}
return out
}
func filterStorage(disks []schema.HardwareStorage) []schema.HardwareStorage {
out := make([]schema.HardwareStorage, 0, len(disks))
for _, disk := range disks {
if disk.SerialNumber == nil || *disk.SerialNumber == "" {
continue
}
out = append(out, disk)
}
return out
}
func filterPSUs(psus []schema.HardwarePowerSupply) []schema.HardwarePowerSupply {
out := make([]schema.HardwarePowerSupply, 0, len(psus))
for _, psu := range psus {
hasIdentity := false
switch {
case psu.SerialNumber != nil && *psu.SerialNumber != "":
hasIdentity = true
case psu.Slot != nil && *psu.Slot != "":
hasIdentity = true
case psu.Model != nil && *psu.Model != "":
hasIdentity = true
case psu.Vendor != nil && *psu.Vendor != "":
hasIdentity = true
}
if !hasIdentity {
continue
}
out = append(out, psu)
}
return out
}
func setComponentStatusMetadata(snap *schema.HardwareSnapshot, collectedAt string) {
for i := range snap.CPUs {
setStatusCheckedAt(&snap.CPUs[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.Memory {
setStatusCheckedAt(&snap.Memory[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.Storage {
setStatusCheckedAt(&snap.Storage[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.PCIeDevices {
setStatusCheckedAt(&snap.PCIeDevices[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.PowerSupplies {
setStatusCheckedAt(&snap.PowerSupplies[i].HardwareComponentStatus, collectedAt)
}
}
func setStatusCheckedAt(status *schema.HardwareComponentStatus, collectedAt string) {
if status == nil || status.Status == nil || *status.Status == "" {
return
}
if status.StatusCheckedAt == nil {
status.StatusCheckedAt = &collectedAt
}
}

View File

@@ -0,0 +1,80 @@
package collector
import (
"bee/audit/internal/schema"
"testing"
)
func TestFinalizeSnapshotFiltersComponentsWithoutRequiredSerials(t *testing.T) {
collectedAt := "2026-03-15T12:00:00Z"
present := true
status := statusOK
serial := "SN-1"
snap := schema.HardwareSnapshot{
Memory: []schema.HardwareMemory{
{Present: &present, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Present: &present, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
Storage: []schema.HardwareStorage{
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
PowerSupplies: []schema.HardwarePowerSupply{
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
}
finalizeSnapshot(&snap, collectedAt)
if len(snap.Memory) != 1 || snap.Memory[0].StatusCheckedAt == nil || *snap.Memory[0].StatusCheckedAt != collectedAt {
t.Fatalf("memory finalize mismatch: %+v", snap.Memory)
}
if len(snap.Storage) != 1 || snap.Storage[0].StatusCheckedAt == nil || *snap.Storage[0].StatusCheckedAt != collectedAt {
t.Fatalf("storage finalize mismatch: %+v", snap.Storage)
}
if len(snap.PowerSupplies) != 1 || snap.PowerSupplies[0].StatusCheckedAt == nil || *snap.PowerSupplies[0].StatusCheckedAt != collectedAt {
t.Fatalf("psu finalize mismatch: %+v", snap.PowerSupplies)
}
}
func TestFinalizeSnapshotPreservesDuplicateSerials(t *testing.T) {
collectedAt := "2026-03-15T12:00:00Z"
status := statusOK
model := "Device"
serial := "DUPLICATE"
snap := schema.HardwareSnapshot{
Storage: []schema.HardwareStorage{
{Model: &model, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Model: &model, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
}
finalizeSnapshot(&snap, collectedAt)
if got := *snap.Storage[0].SerialNumber; got != serial {
t.Fatalf("first serial changed: %q", got)
}
if got := *snap.Storage[1].SerialNumber; got != serial {
t.Fatalf("duplicate serial should stay unchanged: %q", got)
}
}
func TestFilterPSUsKeepsSlotOnlyEntries(t *testing.T) {
slot := "0"
status := statusOK
got := filterPSUs([]schema.HardwarePowerSupply{
{Slot: &slot, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
})
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].Slot == nil || *got[0].Slot != "0" {
t.Fatalf("unexpected kept PSU: %+v", got[0])
}
}

View File

@@ -47,12 +47,12 @@ func parseMemorySection(fields map[string]string) schema.HardwareMemory {
dimm.Present = &present
if !present {
status := "EMPTY"
status := statusEmpty
dimm.Status = &status
return dimm
}
status := "OK"
status := statusOK
dimm.Status = &status
if mb := parseMemorySizeMB(rawSize); mb > 0 {

View File

@@ -0,0 +1,203 @@
package collector
import (
"bee/audit/internal/schema"
"os"
"path/filepath"
"sort"
"strconv"
"strings"
)
var edacBaseDir = "/sys/devices/system/edac/mc"
type edacDIMMStats struct {
Label string
CECount *int64
UECount *int64
}
func enrichMemoryWithTelemetry(dimms []schema.HardwareMemory, doc sensorsDoc) []schema.HardwareMemory {
if len(dimms) == 0 {
return dimms
}
tempByLabel := memoryTempsFromSensors(doc)
stats := readEDACStats()
for i := range dimms {
labelKeys := dimmMatchKeys(dimms[i].Slot, dimms[i].Location)
for _, key := range labelKeys {
if temp, ok := tempByLabel[key]; ok {
dimms[i].TemperatureC = &temp
break
}
}
for _, key := range labelKeys {
if stat, ok := stats[key]; ok {
if stat.CECount != nil {
dimms[i].CorrectableECCErrorCount = stat.CECount
}
if stat.UECount != nil {
dimms[i].UncorrectableECCErrorCount = stat.UECount
}
if stat.UECount != nil && *stat.UECount > 0 {
dimms[i].DataLossDetected = boolPtr(true)
status := statusCritical
dimms[i].Status = &status
if dimms[i].ErrorDescription == nil {
dimms[i].ErrorDescription = stringPtr("EDAC reports uncorrectable ECC errors")
}
} else if stat.CECount != nil && *stat.CECount > 0 && (dimms[i].Status == nil || *dimms[i].Status == statusOK) {
status := statusWarning
dimms[i].Status = &status
if dimms[i].ErrorDescription == nil {
dimms[i].ErrorDescription = stringPtr("EDAC reports correctable ECC errors")
}
}
break
}
}
}
return dimms
}
func memoryTempsFromSensors(doc sensorsDoc) map[string]float64 {
out := map[string]float64{}
if len(doc) == 0 {
return out
}
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok || classifySensorFeature(feature) != "temp" {
continue
}
if !isLikelyMemoryTemp(chip, featureName) {
continue
}
temp, ok := firstFeatureFloat(feature, "_input")
if !ok {
continue
}
key := canonicalLabel(featureName)
if key == "" {
continue
}
if _, exists := out[key]; !exists {
out[key] = temp
}
}
}
return out
}
func readEDACStats() map[string]edacDIMMStats {
out := map[string]edacDIMMStats{}
mcDirs, err := filepath.Glob(filepath.Join(edacBaseDir, "mc*"))
if err != nil {
return out
}
sort.Strings(mcDirs)
for _, mcDir := range mcDirs {
dimmDirs, err := filepath.Glob(filepath.Join(mcDir, "dimm*"))
if err != nil {
continue
}
sort.Strings(dimmDirs)
for _, dimmDir := range dimmDirs {
stat, ok := readEDACDIMMStats(dimmDir)
if !ok {
continue
}
key := canonicalLabel(stat.Label)
if key == "" {
continue
}
out[key] = stat
}
}
return out
}
func readEDACDIMMStats(dimmDir string) (edacDIMMStats, bool) {
labelBytes, err := os.ReadFile(filepath.Join(dimmDir, "dimm_label"))
if err != nil {
labelBytes, err = os.ReadFile(filepath.Join(dimmDir, "label"))
if err != nil {
return edacDIMMStats{}, false
}
}
label := strings.TrimSpace(string(labelBytes))
if label == "" {
return edacDIMMStats{}, false
}
stat := edacDIMMStats{Label: label}
if value, ok := readEDACCount(dimmDir, []string{"dimm_ce_count", "ce_count"}); ok {
stat.CECount = &value
}
if value, ok := readEDACCount(dimmDir, []string{"dimm_ue_count", "ue_count"}); ok {
stat.UECount = &value
}
return stat, true
}
func readEDACCount(dir string, names []string) (int64, bool) {
for _, name := range names {
raw, err := os.ReadFile(filepath.Join(dir, name))
if err != nil {
continue
}
value, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
if err == nil && value >= 0 {
return value, true
}
}
return 0, false
}
func dimmMatchKeys(slot, location *string) []string {
var out []string
add := func(value *string) {
key := canonicalLabel(derefString(value))
if key == "" {
return
}
for _, existing := range out {
if existing == key {
return
}
}
out = append(out, key)
}
add(slot)
add(location)
return out
}
func canonicalLabel(value string) string {
value = strings.ToUpper(strings.TrimSpace(value))
if value == "" {
return ""
}
var b strings.Builder
for _, r := range value {
if (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') {
b.WriteRune(r)
}
}
return b.String()
}
func isLikelyMemoryTemp(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "dimm") || strings.Contains(value, "sodimm")
}
func boolPtr(value bool) *bool {
return &value
}

View File

@@ -0,0 +1,61 @@
package collector
import (
"path/filepath"
"testing"
"bee/audit/internal/schema"
)
func TestEnrichMemoryWithTelemetry(t *testing.T) {
tmp := t.TempDir()
oldBase := edacBaseDir
edacBaseDir = tmp
t.Cleanup(func() { edacBaseDir = oldBase })
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_label"), "CPU0_DIMM_A1\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_ce_count"), "7\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_ue_count"), "0\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_label"), "CPU1_DIMM_B2\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_ce_count"), "0\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_ue_count"), "2\n")
doc := sensorsDoc{
"jc42-i2c-0-18": {
"CPU0 DIMM A1": map[string]any{"temp1_input": 43.0},
"CPU1 DIMM B2": map[string]any{"temp2_input": 46.0},
},
}
status := statusOK
slotA := "CPU0_DIMM_A1"
slotB := "CPU1_DIMM_B2"
dimms := []schema.HardwareMemory{
{Slot: &slotA, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Slot: &slotB, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
}
got := enrichMemoryWithTelemetry(dimms, doc)
if got[0].TemperatureC == nil || *got[0].TemperatureC != 43.0 {
t.Fatalf("dimm0 temperature mismatch: %#v", got[0].TemperatureC)
}
if got[0].CorrectableECCErrorCount == nil || *got[0].CorrectableECCErrorCount != 7 {
t.Fatalf("dimm0 ce mismatch: %#v", got[0].CorrectableECCErrorCount)
}
if got[0].Status == nil || *got[0].Status != statusWarning {
t.Fatalf("dimm0 status mismatch: %#v", got[0].Status)
}
if got[1].TemperatureC == nil || *got[1].TemperatureC != 46.0 {
t.Fatalf("dimm1 temperature mismatch: %#v", got[1].TemperatureC)
}
if got[1].UncorrectableECCErrorCount == nil || *got[1].UncorrectableECCErrorCount != 2 {
t.Fatalf("dimm1 ue mismatch: %#v", got[1].UncorrectableECCErrorCount)
}
if got[1].Status == nil || *got[1].Status != statusCritical {
t.Fatalf("dimm1 status mismatch: %#v", got[1].Status)
}
if got[1].DataLossDetected == nil || !*got[1].DataLossDetected {
t.Fatalf("dimm1 data_loss_detected mismatch: %#v", got[1].DataLossDetected)
}
}

View File

@@ -18,17 +18,13 @@ var (
}
return string(out), nil
}
readNetStatFile = func(iface, key string) (int64, error) {
path := filepath.Join("/sys/class/net", iface, "statistics", key)
readNetAddressFile = func(iface string) (string, error) {
path := filepath.Join("/sys/class/net", iface, "address")
raw, err := os.ReadFile(path)
if err != nil {
return 0, err
return "", err
}
v, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
if err != nil {
return 0, err
}
return v, nil
return strings.TrimSpace(string(raw)), nil
}
)
@@ -47,6 +43,12 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
continue
}
iface := ifaces[0]
devs[i].MacAddresses = collectInterfaceMACs(ifaces)
if devs[i].SerialNumber == nil {
if serial := queryPCIDeviceSerial(bdf); serial != "" {
devs[i].SerialNumber = &serial
}
}
if devs[i].Firmware == nil {
if out, err := ethtoolInfoQuery(iface); err == nil {
@@ -56,16 +58,13 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
}
}
if devs[i].Telemetry == nil {
devs[i].Telemetry = map[string]any{}
}
injectNICPacketStats(devs[i].Telemetry, iface)
if out, err := ethtoolModuleQuery(iface); err == nil {
injectSFPDOMTelemetry(devs[i].Telemetry, out)
if injectSFPDOMTelemetry(&devs[i], out) {
enriched++
continue
}
}
if len(devs[i].Telemetry) == 0 {
devs[i].Telemetry = nil
} else {
if len(devs[i].MacAddresses) > 0 || devs[i].Firmware != nil {
enriched++
}
}
@@ -77,31 +76,32 @@ func isNICDevice(dev schema.HardwarePCIeDevice) bool {
if dev.DeviceClass == nil {
return false
}
c := strings.ToLower(strings.TrimSpace(*dev.DeviceClass))
return strings.Contains(c, "ethernet controller") ||
strings.Contains(c, "network controller") ||
strings.Contains(c, "infiniband controller")
c := strings.TrimSpace(*dev.DeviceClass)
return isNICClass(c) || strings.EqualFold(c, "FibreChannelController")
}
func injectNICPacketStats(dst map[string]any, iface string) {
for _, key := range []string{"rx_packets", "tx_packets", "rx_errors", "tx_errors"} {
if v, err := readNetStatFile(iface, key); err == nil {
dst[key] = v
func collectInterfaceMACs(ifaces []string) []string {
seen := map[string]struct{}{}
var out []string
for _, iface := range ifaces {
mac, err := readNetAddressFile(iface)
if err != nil || mac == "" {
continue
}
mac = strings.ToLower(strings.TrimSpace(mac))
if _, ok := seen[mac]; ok {
continue
}
seen[mac] = struct{}{}
out = append(out, mac)
}
}
func injectSFPDOMTelemetry(dst map[string]any, raw string) {
parsed := parseSFPDOM(raw)
for k, v := range parsed {
dst[k] = v
}
return out
}
var floatRe = regexp.MustCompile(`[-+]?[0-9]*\.?[0-9]+`)
func parseSFPDOM(raw string) map[string]any {
out := map[string]any{}
func injectSFPDOMTelemetry(dev *schema.HardwarePCIeDevice, raw string) bool {
var changed bool
for _, line := range strings.Split(raw, "\n") {
trimmed := strings.TrimSpace(line)
if trimmed == "" {
@@ -117,26 +117,55 @@ func parseSFPDOM(raw string) map[string]any {
switch {
case strings.Contains(key, "module temperature"):
if f, ok := firstFloat(val); ok {
out["sfp_temperature_c"] = f
dev.SFPTemperatureC = &f
changed = true
}
case strings.Contains(key, "laser output power"):
if f, ok := dbmValue(val); ok {
out["sfp_tx_power_dbm"] = f
dev.SFPTXPowerDBM = &f
changed = true
}
case strings.Contains(key, "receiver signal"):
if f, ok := dbmValue(val); ok {
out["sfp_rx_power_dbm"] = f
dev.SFPRXPowerDBM = &f
changed = true
}
case strings.Contains(key, "module voltage"):
if f, ok := firstFloat(val); ok {
out["sfp_voltage_v"] = f
dev.SFPVoltageV = &f
changed = true
}
case strings.Contains(key, "laser bias current"):
if f, ok := firstFloat(val); ok {
out["sfp_bias_ma"] = f
dev.SFPBiasMA = &f
changed = true
}
}
}
return changed
}
func parseSFPDOM(raw string) map[string]any {
dev := schema.HardwarePCIeDevice{}
if !injectSFPDOMTelemetry(&dev, raw) {
return map[string]any{}
}
out := map[string]any{}
if dev.SFPTemperatureC != nil {
out["sfp_temperature_c"] = *dev.SFPTemperatureC
}
if dev.SFPTXPowerDBM != nil {
out["sfp_tx_power_dbm"] = *dev.SFPTXPowerDBM
}
if dev.SFPRXPowerDBM != nil {
out["sfp_rx_power_dbm"] = *dev.SFPRXPowerDBM
}
if dev.SFPVoltageV != nil {
out["sfp_voltage_v"] = *dev.SFPVoltageV
}
if dev.SFPBiasMA != nil {
out["sfp_bias_ma"] = *dev.SFPBiasMA
}
return out
}

View File

@@ -1,6 +1,10 @@
package collector
import "testing"
import (
"bee/audit/internal/schema"
"fmt"
"testing"
)
func TestParseSFPDOM(t *testing.T) {
raw := `
@@ -29,6 +33,74 @@ func TestParseSFPDOM(t *testing.T) {
}
}
func TestParseLSPCIDetailSerial(t *testing.T) {
raw := `
05:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
Serial number: NIC-SN-12345
`
if got := parseLSPCIDetailSerial(raw); got != "NIC-SN-12345" {
t.Fatalf("serial=%q want %q", got, "NIC-SN-12345")
}
}
func TestParsePCIVPDSerial(t *testing.T) {
raw := []byte{0x82, 0x05, 0x00, 'M', 'L', 'X', '5', 0x90, 0x08, 0x00, 'S', 'N', 0x08, 'M', 'T', '1', '2', '3', '4', '5', '6'}
if got := parsePCIVPDSerial(raw); got != "MT123456" {
t.Fatalf("serial=%q want %q", got, "MT123456")
}
}
func TestEnrichPCIeWithNICTelemetryAddsSerialFallback(t *testing.T) {
origDetail := queryPCILSPCIDetail
origVPD := readPCIVPDFile
origIfaces := netIfacesByBDF
origReadMAC := readNetAddressFile
origEth := ethtoolInfoQuery
origModule := ethtoolModuleQuery
t.Cleanup(func() {
queryPCILSPCIDetail = origDetail
readPCIVPDFile = origVPD
netIfacesByBDF = origIfaces
readNetAddressFile = origReadMAC
ethtoolInfoQuery = origEth
ethtoolModuleQuery = origModule
})
queryPCILSPCIDetail = func(bdf string) (string, error) {
if bdf != "0000:18:00.0" {
t.Fatalf("unexpected bdf: %s", bdf)
}
return "Serial number: NIC-SN-98765\n", nil
}
readPCIVPDFile = func(string) ([]byte, error) {
return nil, fmt.Errorf("no vpd needed")
}
netIfacesByBDF = func(string) []string { return []string{"eth0"} }
readNetAddressFile = func(iface string) (string, error) {
if iface != "eth0" {
t.Fatalf("unexpected iface: %s", iface)
}
return "aa:bb:cc:dd:ee:ff", nil
}
ethtoolInfoQuery = func(string) (string, error) { return "", fmt.Errorf("skip firmware") }
ethtoolModuleQuery = func(string) (string, error) { return "", fmt.Errorf("skip optics") }
class := "EthernetController"
bdf := "0000:18:00.0"
devs := []schema.HardwarePCIeDevice{{
DeviceClass: &class,
BDF: &bdf,
}}
out := enrichPCIeWithNICTelemetry(devs)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "NIC-SN-98765" {
t.Fatalf("serial=%v want NIC-SN-98765", out[0].SerialNumber)
}
if len(out[0].MacAddresses) != 1 || out[0].MacAddresses[0] != "aa:bb:cc:dd:ee:ff" {
t.Fatalf("mac_addresses=%v", out[0].MacAddresses)
}
}
func TestDBMValue(t *testing.T) {
tests := []struct {
in string

View File

@@ -24,18 +24,17 @@ type nvidiaGPUInfo struct {
}
// enrichPCIeWithNVIDIA enriches NVIDIA PCIe devices with data from nvidia-smi.
// If the driver/tool is unavailable, NVIDIA devices get UNKNOWN status and
// a stable serial fallback based on board serial + slot.
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice, boardSerial string) []schema.HardwarePCIeDevice {
// If the driver/tool is unavailable, NVIDIA devices get Unknown status.
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
if !hasNVIDIADevices(devs) {
return devs
}
gpuByBDF, err := queryNVIDIAGPUs()
if err != nil {
slog.Info("nvidia: enrichment skipped", "err", err)
return enrichPCIeWithNVIDIAData(devs, nil, boardSerial, false)
return enrichPCIeWithNVIDIAData(devs, nil, false)
}
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, boardSerial, true)
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, true)
}
func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool {
@@ -47,7 +46,7 @@ func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool {
return false
}
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, boardSerial string, driverLoaded bool) []schema.HardwarePCIeDevice {
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, driverLoaded bool) []schema.HardwarePCIeDevice {
enriched := 0
for i := range devs {
if !isNVIDIADevice(devs[i]) {
@@ -55,7 +54,7 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
}
if !driverLoaded {
setPCIeFallback(&devs[i], boardSerial)
setPCIeFallback(&devs[i])
continue
}
@@ -65,22 +64,21 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
}
info, ok := gpuByBDF[bdf]
if !ok {
setPCIeFallback(&devs[i], boardSerial)
setPCIeFallback(&devs[i])
continue
}
if v := strings.TrimSpace(info.Serial); v != "" {
devs[i].SerialNumber = &v
} else {
setPCIeFallbackSerial(&devs[i], boardSerial)
}
if v := strings.TrimSpace(info.VBIOS); v != "" {
devs[i].Firmware = &v
}
status := "OK"
status := statusOK
if info.ECCUncorrected != nil && *info.ECCUncorrected > 0 {
status = "WARNING"
status = statusWarning
devs[i].ErrorDescription = stringPtr("GPU reports uncorrected ECC errors")
}
devs[i].Status = &status
injectNVIDIATelemetry(&devs[i], info)
@@ -212,46 +210,25 @@ func isNVIDIADevice(dev schema.HardwarePCIeDevice) bool {
return false
}
func setPCIeFallback(dev *schema.HardwarePCIeDevice, boardSerial string) {
setPCIeFallbackSerial(dev, boardSerial)
status := "UNKNOWN"
func setPCIeFallback(dev *schema.HardwarePCIeDevice) {
status := statusUnknown
dev.Status = &status
}
func setPCIeFallbackSerial(dev *schema.HardwarePCIeDevice, boardSerial string) {
if strings.TrimSpace(boardSerial) == "" || dev.SerialNumber != nil {
return
}
slot := "unknown"
if dev.BDF != nil && strings.TrimSpace(*dev.BDF) != "" {
slot = strings.TrimSpace(*dev.BDF)
} else if dev.Slot != nil && strings.TrimSpace(*dev.Slot) != "" {
slot = strings.TrimSpace(*dev.Slot)
}
fb := fmt.Sprintf("%s-PCIE-%s", boardSerial, slot)
dev.SerialNumber = &fb
}
func injectNVIDIATelemetry(dev *schema.HardwarePCIeDevice, info nvidiaGPUInfo) {
if dev.Telemetry == nil {
dev.Telemetry = map[string]any{}
}
if info.TemperatureC != nil {
dev.Telemetry["temperature_c"] = *info.TemperatureC
dev.TemperatureC = info.TemperatureC
}
if info.PowerW != nil {
dev.Telemetry["power_w"] = *info.PowerW
dev.PowerW = info.PowerW
}
if info.ECCUncorrected != nil {
dev.Telemetry["ecc_uncorrected_total"] = *info.ECCUncorrected
dev.ECCUncorrectedTotal = info.ECCUncorrected
}
if info.ECCCorrected != nil {
dev.Telemetry["ecc_corrected_total"] = *info.ECCCorrected
dev.ECCCorrectedTotal = info.ECCCorrected
}
if info.HWSlowdown != nil {
dev.Telemetry["hw_slowdown_active"] = *info.HWSlowdown
}
if len(dev.Telemetry) == 0 {
dev.Telemetry = nil
dev.HWSlowdown = info.HWSlowdown
}
}

View File

@@ -54,10 +54,10 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
status := "OK"
devices := []schema.HardwarePCIeDevice{
{
VendorID: &vendorID,
BDF: &bdf,
Manufacturer: &manufacturer,
Status: &status,
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
VendorID: &vendorID,
BDF: &bdf,
Manufacturer: &manufacturer,
},
}
@@ -73,21 +73,21 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
},
}
out := enrichPCIeWithNVIDIAData(devices, byBDF, "BOARD-001", true)
out := enrichPCIeWithNVIDIAData(devices, byBDF, true)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-ABC" {
t.Fatalf("serial: got %v", out[0].SerialNumber)
}
if out[0].Firmware == nil || *out[0].Firmware != "96.00.1F.00.02" {
t.Fatalf("firmware: got %v", out[0].Firmware)
}
if out[0].Status == nil || *out[0].Status != "WARNING" {
if out[0].Status == nil || *out[0].Status != statusWarning {
t.Fatalf("status: got %v", out[0].Status)
}
if out[0].Telemetry == nil {
t.Fatal("expected telemetry")
if out[0].ECCUncorrectedTotal == nil || *out[0].ECCUncorrectedTotal != 2 {
t.Fatalf("ecc_uncorrected_total: got %#v", out[0].ECCUncorrectedTotal)
}
if got, ok := out[0].Telemetry["ecc_uncorrected_total"].(int64); !ok || got != 2 {
t.Fatalf("ecc_uncorrected_total: got %#v", out[0].Telemetry["ecc_uncorrected_total"])
if out[0].TemperatureC == nil || *out[0].TemperatureC != 55.5 {
t.Fatalf("temperature_c: got %#v", out[0].TemperatureC)
}
}
@@ -103,11 +103,11 @@ func TestEnrichPCIeWithNVIDIAData_driverMissingFallback(t *testing.T) {
},
}
out := enrichPCIeWithNVIDIAData(devices, nil, "BOARD-123", false)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "BOARD-123-PCIE-0000:17:00.0" {
t.Fatalf("fallback serial: got %v", out[0].SerialNumber)
out := enrichPCIeWithNVIDIAData(devices, nil, false)
if out[0].SerialNumber != nil {
t.Fatalf("serial should stay nil without source data, got %v", out[0].SerialNumber)
}
if out[0].Status == nil || *out[0].Status != "UNKNOWN" {
if out[0].Status == nil || *out[0].Status != statusUnknown {
t.Fatalf("fallback status: got %v", out[0].Status)
}
}

View File

@@ -37,7 +37,7 @@ func parseLspci(output string) []schema.HardwarePCIeDevice {
val := strings.TrimSpace(line[idx+2:])
fields[key] = val
}
if !shouldIncludePCIeDevice(fields["Class"]) {
if !shouldIncludePCIeDevice(fields["Class"], fields["Vendor"], fields["Device"]) {
continue
}
dev := parseLspciDevice(fields)
@@ -46,8 +46,10 @@ func parseLspci(output string) []schema.HardwarePCIeDevice {
return devs
}
func shouldIncludePCIeDevice(class string) bool {
func shouldIncludePCIeDevice(class, vendor, device string) bool {
c := strings.ToLower(strings.TrimSpace(class))
v := strings.ToLower(strings.TrimSpace(vendor))
d := strings.ToLower(strings.TrimSpace(device))
if c == "" {
return true
}
@@ -57,6 +59,8 @@ func shouldIncludePCIeDevice(class string) bool {
"host bridge",
"isa bridge",
"pci bridge",
"performance counter",
"performance counters",
"ram memory",
"system peripheral",
"communication controller",
@@ -66,12 +70,28 @@ func shouldIncludePCIeDevice(class string) bool {
"audio device",
"serial bus controller",
"unassigned class",
"non-essential instrumentation",
}
for _, bad := range excluded {
if strings.Contains(c, bad) {
return false
}
}
if strings.Contains(v, "advanced micro devices") || strings.Contains(v, "[amd]") {
internalAMDPatterns := []string{
"dummy function",
"reserved spp",
"ptdma",
"cryptographic coprocessor pspcpp",
"pspcpp",
}
for _, bad := range internalAMDPatterns {
if strings.Contains(d, bad) {
return false
}
}
}
return true
}
@@ -79,11 +99,12 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
dev := schema.HardwarePCIeDevice{}
present := true
dev.Present = &present
status := "OK"
status := statusOK
dev.Status = &status
// Slot is the BDF: "0000:00:02.0"
if bdf := fields["Slot"]; bdf != "" {
dev.Slot = &bdf
dev.BDF = &bdf
// parse vendor_id and device_id from sysfs
vendorID, deviceID := readPCIIDs(bdf)
@@ -93,10 +114,34 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
if deviceID != 0 {
dev.DeviceID = &deviceID
}
if numaNode, ok := readPCINumaNode(bdf); ok {
dev.NUMANode = &numaNode
} else if numaNode, ok := parsePCINumaNode(fields["NUMANode"]); ok {
dev.NUMANode = &numaNode
}
if width, ok := readPCIIntAttribute(bdf, "current_link_width"); ok {
dev.LinkWidth = &width
}
if width, ok := readPCIIntAttribute(bdf, "max_link_width"); ok {
dev.MaxLinkWidth = &width
}
if speed, ok := readPCIStringAttribute(bdf, "current_link_speed"); ok {
linkSpeed := normalizePCILinkSpeed(speed)
if linkSpeed != "" {
dev.LinkSpeed = &linkSpeed
}
}
if speed, ok := readPCIStringAttribute(bdf, "max_link_speed"); ok {
linkSpeed := normalizePCILinkSpeed(speed)
if linkSpeed != "" {
dev.MaxLinkSpeed = &linkSpeed
}
}
}
if v := fields["Class"]; v != "" {
dev.DeviceClass = &v
class := mapPCIeDeviceClass(v)
dev.DeviceClass = &class
}
if v := fields["Vendor"]; v != "" {
dev.Manufacturer = &v
@@ -131,3 +176,67 @@ func readHexFile(path string) (int, error) {
n, err := strconv.ParseInt(s, 16, 64)
return int(n), err
}
func readPCINumaNode(bdf string) (int, bool) {
value, ok := readPCIIntAttribute(bdf, "numa_node")
if !ok || value < 0 {
return 0, false
}
return value, true
}
func parsePCINumaNode(raw string) (int, bool) {
raw = strings.TrimSpace(raw)
if raw == "" {
return 0, false
}
value, err := strconv.Atoi(raw)
if err != nil || value < 0 {
return 0, false
}
return value, true
}
func readPCIIntAttribute(bdf, attribute string) (int, bool) {
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
if err != nil {
return 0, false
}
value, err := strconv.Atoi(strings.TrimSpace(string(out)))
if err != nil || value < 0 {
return 0, false
}
return value, true
}
func readPCIStringAttribute(bdf, attribute string) (string, bool) {
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
if err != nil {
return "", false
}
value := strings.TrimSpace(string(out))
if value == "" {
return "", false
}
return value, true
}
func normalizePCILinkSpeed(raw string) string {
raw = strings.TrimSpace(strings.ToLower(raw))
switch {
case strings.Contains(raw, "2.5"):
return "Gen1"
case strings.Contains(raw, "5.0"):
return "Gen2"
case strings.Contains(raw, "8.0"):
return "Gen3"
case strings.Contains(raw, "16.0"):
return "Gen4"
case strings.Contains(raw, "32.0"):
return "Gen5"
case strings.Contains(raw, "64.0"):
return "Gen6"
default:
return ""
}
}

View File

@@ -1,41 +1,126 @@
package collector
import "testing"
import (
"encoding/json"
"strings"
"testing"
)
func TestShouldIncludePCIeDevice(t *testing.T) {
tests := []struct {
class string
want bool
name string
class string
vendor string
device string
want bool
}{
{"USB controller", false},
{"System peripheral", false},
{"Audio device", false},
{"Host bridge", false},
{"PCI bridge", false},
{"SMBus", false},
{"Ethernet controller", true},
{"RAID bus controller", true},
{"Non-Volatile memory controller", true},
{"VGA compatible controller", true},
{name: "usb", class: "USB controller", want: false},
{name: "system peripheral", class: "System peripheral", want: false},
{name: "audio", class: "Audio device", want: false},
{name: "host bridge", class: "Host bridge", want: false},
{name: "pci bridge", class: "PCI bridge", want: false},
{name: "smbus", class: "SMBus", want: false},
{name: "perf", class: "Performance counters", want: false},
{name: "non essential instrumentation", class: "Non-Essential Instrumentation", want: false},
{name: "amd dummy function", class: "Encryption controller", vendor: "Advanced Micro Devices, Inc. [AMD]", device: "Starship/Matisse PTDMA", want: false},
{name: "amd pspcpp", class: "Encryption controller", vendor: "Advanced Micro Devices, Inc. [AMD]", device: "Starship/Matisse Cryptographic Coprocessor PSPCPP", want: false},
{name: "ethernet", class: "Ethernet controller", want: true},
{name: "raid", class: "RAID bus controller", want: true},
{name: "nvme", class: "Non-Volatile memory controller", want: true},
{name: "vga", class: "VGA compatible controller", want: true},
{name: "other encryption controller", class: "Encryption controller", vendor: "Intel Corporation", device: "QuickAssist", want: true},
}
for _, tt := range tests {
got := shouldIncludePCIeDevice(tt.class)
if got != tt.want {
t.Fatalf("class %q include=%v want %v", tt.class, got, tt.want)
}
t.Run(tt.name, func(t *testing.T) {
got := shouldIncludePCIeDevice(tt.class, tt.vendor, tt.device)
if got != tt.want {
t.Fatalf("class=%q vendor=%q device=%q include=%v want %v", tt.class, tt.vendor, tt.device, got, tt.want)
}
})
}
}
func TestParseLspci_filtersExcludedClasses(t *testing.T) {
input := "Slot:\t0000:00:14.0\nClass:\tUSB controller\nVendor:\tIntel Corporation\nDevice:\tUSB 3.0\n\n" +
"Slot:\t0000:00:18.0\nClass:\tNon-Essential Instrumentation\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PCIe Dummy Function\n\n" +
"Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
devs := parseLspci(input)
if len(devs) != 1 {
t.Fatalf("expected 1 filtered device, got %d", len(devs))
}
if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VGA compatible controller" {
if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VideoController" {
t.Fatalf("unexpected remaining class: %v", devs[0].DeviceClass)
}
if devs[0].Slot == nil || *devs[0].Slot != "0000:65:00.0" {
t.Fatalf("slot: got %v", devs[0].Slot)
}
if devs[0].BDF == nil || *devs[0].BDF != "0000:65:00.0" {
t.Fatalf("bdf: got %v", devs[0].BDF)
}
}
func TestParseLspci_filtersAMDChipsetNoise(t *testing.T) {
input := "" +
"Slot:\t0000:1a:00.0\nClass:\tNon-Essential Instrumentation\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PCIe Dummy Function\n\n" +
"Slot:\t0000:1a:00.2\nClass:\tEncryption controller\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PTDMA\n\n" +
"Slot:\t0000:05:00.0\nClass:\tEthernet controller\nVendor:\tMellanox Technologies\nDevice:\tMT28908 Family [ConnectX-6]\n\n"
devs := parseLspci(input)
if len(devs) != 1 {
t.Fatalf("expected 1 remaining device, got %d", len(devs))
}
if devs[0].Model == nil || *devs[0].Model != "MT28908 Family [ConnectX-6]" {
t.Fatalf("unexpected remaining device: %+v", devs[0])
}
}
func TestPCIeJSONUsesSlotNotBDF(t *testing.T) {
input := "Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
devs := parseLspci(input)
data, err := json.Marshal(devs[0])
if err != nil {
t.Fatalf("marshal: %v", err)
}
text := string(data)
if !strings.Contains(text, `"slot":"0000:65:00.0"`) {
t.Fatalf("json missing slot: %s", text)
}
if strings.Contains(text, `"bdf"`) {
t.Fatalf("json should not emit bdf: %s", text)
}
}
func TestParseLspciUsesNUMANodeFieldWhenSysfsUnavailable(t *testing.T) {
input := "Slot:\t0000:65:00.0\nClass:\tEthernet controller\nVendor:\tIntel Corporation\nDevice:\tX710\nNUMANode:\t1\n\n"
devs := parseLspci(input)
if len(devs) != 1 {
t.Fatalf("expected 1 device, got %d", len(devs))
}
if devs[0].NUMANode == nil || *devs[0].NUMANode != 1 {
t.Fatalf("numa_node=%v want 1", devs[0].NUMANode)
}
}
func TestNormalizePCILinkSpeed(t *testing.T) {
tests := []struct {
raw string
want string
}{
{"2.5 GT/s PCIe", "Gen1"},
{"5.0 GT/s PCIe", "Gen2"},
{"8.0 GT/s PCIe", "Gen3"},
{"16.0 GT/s PCIe", "Gen4"},
{"32.0 GT/s PCIe", "Gen5"},
{"64.0 GT/s PCIe", "Gen6"},
{"unknown", ""},
}
for _, tt := range tests {
if got := normalizePCILinkSpeed(tt.raw); got != tt.want {
t.Fatalf("normalizePCILinkSpeed(%q)=%q want %q", tt.raw, got, tt.want)
}
}
}

View File

@@ -0,0 +1,123 @@
package collector
import (
"bee/audit/internal/schema"
"log/slog"
"os"
"os/exec"
"path/filepath"
"strings"
)
var (
queryPCILSPCIDetail = func(bdf string) (string, error) {
out, err := exec.Command("lspci", "-vv", "-s", bdf).Output()
if err != nil {
return "", err
}
return string(out), nil
}
readPCIVPDFile = func(bdf string) ([]byte, error) {
return os.ReadFile(filepath.Join("/sys/bus/pci/devices", bdf, "vpd"))
}
)
func enrichPCIeWithPCISerials(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
enriched := 0
for i := range devs {
if !shouldProbePCIeSerial(devs[i]) {
continue
}
bdf := normalizePCIeBDF(*devs[i].BDF)
if bdf == "" {
continue
}
if serial := queryPCIDeviceSerial(bdf); serial != "" {
devs[i].SerialNumber = &serial
enriched++
}
}
if enriched > 0 {
slog.Info("pcie: serials enriched", "count", enriched)
}
return devs
}
func shouldProbePCIeSerial(dev schema.HardwarePCIeDevice) bool {
if dev.BDF == nil || dev.SerialNumber != nil {
return false
}
if dev.DeviceClass == nil {
return false
}
class := strings.TrimSpace(*dev.DeviceClass)
return isNICClass(class) || isGPUClass(class)
}
func queryPCIDeviceSerial(bdf string) string {
if out, err := queryPCILSPCIDetail(bdf); err == nil {
if serial := parseLSPCIDetailSerial(out); serial != "" {
return serial
}
}
if raw, err := readPCIVPDFile(bdf); err == nil {
return parsePCIVPDSerial(raw)
}
return ""
}
func parseLSPCIDetailSerial(raw string) string {
for _, line := range strings.Split(raw, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
lower := strings.ToLower(line)
if !strings.Contains(lower, "serial number:") {
continue
}
idx := strings.Index(line, ":")
if idx < 0 {
continue
}
if serial := strings.TrimSpace(line[idx+1:]); serial != "" {
return serial
}
}
return ""
}
func parsePCIVPDSerial(raw []byte) string {
for i := 0; i+3 < len(raw); i++ {
if raw[i] != 'S' || raw[i+1] != 'N' {
continue
}
length := int(raw[i+2])
if length <= 0 || length > 64 || i+3+length > len(raw) {
continue
}
value := strings.TrimSpace(strings.Trim(string(raw[i+3:i+3+length]), "\x00"))
if !looksLikeSerial(value) {
continue
}
return value
}
return ""
}
func looksLikeSerial(value string) bool {
if len(value) < 4 {
return false
}
hasAlphaNum := false
for _, r := range value {
switch {
case r >= 'a' && r <= 'z', r >= 'A' && r <= 'Z', r >= '0' && r <= '9':
hasAlphaNum = true
case strings.ContainsRune(" -_./:", r):
default:
return false
}
}
return hasAlphaNum
}

View File

@@ -0,0 +1,47 @@
package collector
import (
"bee/audit/internal/schema"
"fmt"
"testing"
)
func TestEnrichPCIeWithPCISerialsAddsGPUFallback(t *testing.T) {
origDetail := queryPCILSPCIDetail
origVPD := readPCIVPDFile
t.Cleanup(func() {
queryPCILSPCIDetail = origDetail
readPCIVPDFile = origVPD
})
queryPCILSPCIDetail = func(bdf string) (string, error) {
if bdf != "0000:11:00.0" {
t.Fatalf("unexpected bdf: %s", bdf)
}
return "Serial number: GPU-SN-12345\n", nil
}
readPCIVPDFile = func(string) ([]byte, error) {
return nil, fmt.Errorf("no vpd needed")
}
class := "DisplayController"
bdf := "0000:11:00.0"
devs := []schema.HardwarePCIeDevice{{
DeviceClass: &class,
BDF: &bdf,
}}
out := enrichPCIeWithPCISerials(devs)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-SN-12345" {
t.Fatalf("serial=%v want GPU-SN-12345", out[0].SerialNumber)
}
}
func TestShouldProbePCIeSerialSkipsNonGPUOrNIC(t *testing.T) {
class := "StorageController"
bdf := "0000:19:00.0"
dev := schema.HardwarePCIeDevice{DeviceClass: &class, BDF: &bdf}
if shouldProbePCIeSerial(dev) {
t.Fatal("unexpected probe for storage controller")
}
}

View File

@@ -4,18 +4,32 @@ import (
"bee/audit/internal/schema"
"log/slog"
"os/exec"
"regexp"
"sort"
"strconv"
"strings"
)
func collectPSUs() []schema.HardwarePowerSupply {
// ipmitool requires /dev/ipmi0 — not available on non-server hardware
out, err := exec.Command("ipmitool", "fru", "print").Output()
if err != nil {
var psus []schema.HardwarePowerSupply
if out, err := exec.Command("ipmitool", "fru", "print").Output(); err == nil {
psus = parseFRU(string(out))
} else {
slog.Info("psu: fru unavailable", "err", err)
}
sdrData := map[int]psuSDR{}
if sdrOut, err := exec.Command("ipmitool", "sdr").Output(); err == nil {
sdrData = parsePSUSDR(string(sdrOut))
if len(psus) == 0 {
psus = synthesizePSUsFromSDR(sdrData)
} else {
mergePSUSDR(psus, sdrData)
}
} else if len(psus) == 0 {
slog.Info("psu: ipmitool unavailable, skipping", "err", err)
return nil
}
psus := parseFRU(string(out))
slog.Info("psu: collected", "count", len(psus))
return psus
}
@@ -75,9 +89,7 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
// Only process PSU FRU records
headerLower := strings.ToLower(header)
if !strings.Contains(headerLower, "psu") &&
!strings.Contains(headerLower, "power supply") &&
!strings.Contains(headerLower, "power_supply") {
if !isPSUHeader(headerLower) {
return schema.HardwarePowerSupply{}, false
}
@@ -85,21 +97,24 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
psu := schema.HardwarePowerSupply{Present: &present}
slotStr := strconv.Itoa(slotIdx)
if slot, ok := parsePSUSlot(header); ok && slot > 0 {
slotStr = strconv.Itoa(slot - 1)
}
psu.Slot = &slotStr
if v := cleanDMIValue(fields["Board Product"]); v != "" {
if v := firstNonEmptyField(fields, "Board Product", "Product Name", "Product Part Number"); v != "" {
psu.Model = &v
}
if v := cleanDMIValue(fields["Board Mfg"]); v != "" {
if v := firstNonEmptyField(fields, "Board Mfg", "Product Manufacturer", "Product Manufacturer Name"); v != "" {
psu.Vendor = &v
}
if v := cleanDMIValue(fields["Board Serial"]); v != "" {
if v := firstNonEmptyField(fields, "Board Serial", "Product Serial", "Product Serial Number"); v != "" {
psu.SerialNumber = &v
}
if v := cleanDMIValue(fields["Board Part Number"]); v != "" {
if v := firstNonEmptyField(fields, "Board Part Number", "Product Part Number", "Part Number"); v != "" {
psu.PartNumber = &v
}
if v := cleanDMIValue(fields["Board Extra"]); v != "" {
if v := firstNonEmptyField(fields, "Board Extra", "Product Version", "Board Version"); v != "" {
psu.Firmware = &v
}
@@ -110,12 +125,230 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
}
}
status := "OK"
status := statusOK
psu.Status = &status
return psu, true
}
func isPSUHeader(headerLower string) bool {
return strings.Contains(headerLower, "psu") ||
strings.Contains(headerLower, "pws") ||
strings.Contains(headerLower, "power supply") ||
strings.Contains(headerLower, "power_supply") ||
strings.Contains(headerLower, "power module")
}
func firstNonEmptyField(fields map[string]string, keys ...string) string {
for _, key := range keys {
if value := cleanDMIValue(fields[key]); value != "" {
return value
}
}
return ""
}
type psuSDR struct {
slot int
status string
reason string
inputPowerW *float64
outputPowerW *float64
inputVoltage *float64
temperatureC *float64
healthPct *float64
}
var psuSlotPatterns = []*regexp.Regexp{
regexp.MustCompile(`(?i)\bpsu?\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bps\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bpws\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bpower\s*supply(?:\s*bay)?\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bbay\s*([0-9]+)\b`),
}
func parsePSUSDR(raw string) map[int]psuSDR {
out := map[int]psuSDR{}
for _, line := range strings.Split(raw, "\n") {
fields := splitSDRFields(line)
if len(fields) < 3 {
continue
}
name := fields[0]
value := fields[1]
state := strings.ToLower(fields[2])
slot, ok := parsePSUSlot(name)
if !ok {
continue
}
entry := out[slot]
entry.slot = slot
if entry.status == "" {
entry.status = statusOK
}
if state != "" && state != "ok" && state != "ns" {
entry.status = statusCritical
entry.reason = "PSU sensor reported non-OK state: " + state
}
lowerName := strings.ToLower(name)
switch {
case strings.Contains(lowerName, "input power"):
entry.inputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "output power"):
entry.outputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "power supply bay"), strings.Contains(lowerName, "psu bay"):
entry.outputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "input voltage"), strings.Contains(lowerName, "ac input"):
entry.inputVoltage = parseFloatPtr(value)
case strings.Contains(lowerName, "temp"):
entry.temperatureC = parseFloatPtr(value)
case strings.Contains(lowerName, "health"), strings.Contains(lowerName, "remaining life"), strings.Contains(lowerName, "life remaining"):
entry.healthPct = parsePercentPtr(value)
}
out[slot] = entry
}
return out
}
func synthesizePSUsFromSDR(sdr map[int]psuSDR) []schema.HardwarePowerSupply {
if len(sdr) == 0 {
return nil
}
slots := make([]int, 0, len(sdr))
for slot := range sdr {
slots = append(slots, slot)
}
sort.Ints(slots)
out := make([]schema.HardwarePowerSupply, 0, len(slots))
for _, slot := range slots {
entry := sdr[slot]
present := true
status := entry.status
if status == "" {
status = statusUnknown
}
slotStr := strconv.Itoa(slot - 1)
model := "PSU"
psu := schema.HardwarePowerSupply{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Slot: &slotStr,
Present: &present,
Model: &model,
InputPowerW: entry.inputPowerW,
OutputPowerW: entry.outputPowerW,
InputVoltage: entry.inputVoltage,
TemperatureC: entry.temperatureC,
}
if entry.healthPct != nil {
psu.LifeRemainingPct = entry.healthPct
lifeUsed := 100 - *entry.healthPct
psu.LifeUsedPct = &lifeUsed
}
if entry.reason != "" {
psu.ErrorDescription = &entry.reason
}
out = append(out, psu)
}
return out
}
func mergePSUSDR(psus []schema.HardwarePowerSupply, sdr map[int]psuSDR) {
for i := range psus {
slotIdx, err := strconv.Atoi(derefPSUSlot(psus[i].Slot))
if err != nil {
continue
}
entry, ok := sdr[slotIdx+1]
if !ok {
continue
}
if entry.inputPowerW != nil {
psus[i].InputPowerW = entry.inputPowerW
}
if entry.outputPowerW != nil {
psus[i].OutputPowerW = entry.outputPowerW
}
if entry.inputVoltage != nil {
psus[i].InputVoltage = entry.inputVoltage
}
if entry.temperatureC != nil {
psus[i].TemperatureC = entry.temperatureC
}
if entry.healthPct != nil {
psus[i].LifeRemainingPct = entry.healthPct
lifeUsed := 100 - *entry.healthPct
psus[i].LifeUsedPct = &lifeUsed
}
if entry.status != "" {
psus[i].Status = &entry.status
}
if entry.reason != "" {
psus[i].ErrorDescription = &entry.reason
}
if psus[i].Status != nil && *psus[i].Status == statusOK {
if (entry.inputPowerW == nil && entry.outputPowerW == nil && entry.inputVoltage == nil) && entry.status == "" {
unknown := statusUnknown
psus[i].Status = &unknown
}
}
}
}
func splitSDRFields(line string) []string {
parts := strings.Split(line, "|")
out := make([]string, 0, len(parts))
for _, part := range parts {
part = strings.TrimSpace(part)
if part != "" {
out = append(out, part)
}
}
return out
}
func parsePSUSlot(name string) (int, bool) {
for _, re := range psuSlotPatterns {
m := re.FindStringSubmatch(strings.ToLower(name))
if len(m) == 0 {
continue
}
for _, group := range m[1:] {
if group == "" {
continue
}
n, err := strconv.Atoi(group)
if err == nil && n > 0 {
return n, true
}
}
}
return 0, false
}
func parseFloatPtr(raw string) *float64 {
raw = strings.TrimSpace(raw)
if raw == "" || strings.EqualFold(raw, "na") {
return nil
}
for _, field := range strings.Fields(raw) {
n, err := strconv.ParseFloat(strings.TrimSpace(field), 64)
if err == nil {
return &n
}
}
return nil
}
func derefPSUSlot(slot *string) string {
if slot == nil {
return ""
}
return *slot
}
// parseWattage extracts wattage from strings like "PSU 800W", "1200W PLATINUM".
func parseWattage(s string) int {
s = strings.ToUpper(s)

View File

@@ -0,0 +1,91 @@
package collector
import "testing"
func TestParsePSUSDR(t *testing.T) {
raw := `
PS1 Input Power | 215 Watts | ok
PS1 Output Power | 198 Watts | ok
PS1 Input Voltage | 229 Volts | ok
PS1 Temp | 39 C | ok
PS1 Health | 97 % | ok
PS2 Input Power | 0 Watts | cr
`
got := parsePSUSDR(raw)
if len(got) != 2 {
t.Fatalf("len(got)=%d want 2", len(got))
}
if got[1].status != statusOK {
t.Fatalf("ps1 status=%q", got[1].status)
}
if got[1].inputPowerW == nil || *got[1].inputPowerW != 215 {
t.Fatalf("ps1 input power=%v", got[1].inputPowerW)
}
if got[1].outputPowerW == nil || *got[1].outputPowerW != 198 {
t.Fatalf("ps1 output power=%v", got[1].outputPowerW)
}
if got[1].inputVoltage == nil || *got[1].inputVoltage != 229 {
t.Fatalf("ps1 input voltage=%v", got[1].inputVoltage)
}
if got[1].temperatureC == nil || *got[1].temperatureC != 39 {
t.Fatalf("ps1 temperature=%v", got[1].temperatureC)
}
if got[1].healthPct == nil || *got[1].healthPct != 97 {
t.Fatalf("ps1 health=%v", got[1].healthPct)
}
if got[2].status != statusCritical {
t.Fatalf("ps2 status=%q", got[2].status)
}
}
func TestParsePSUSlotVendorVariants(t *testing.T) {
t.Parallel()
tests := []struct {
name string
want int
}{
{name: "PWS1 Status", want: 1},
{name: "Power Supply Bay 8", want: 8},
{name: "PS 6 Input Power", want: 6},
}
for _, tt := range tests {
got, ok := parsePSUSlot(tt.name)
if !ok || got != tt.want {
t.Fatalf("parsePSUSlot(%q)=(%d,%v) want (%d,true)", tt.name, got, ok, tt.want)
}
}
}
func TestSynthesizePSUsFromSDR(t *testing.T) {
t.Parallel()
health := 97.0
outputPower := 915.0
got := synthesizePSUsFromSDR(map[int]psuSDR{
1: {
slot: 1,
status: statusOK,
outputPowerW: &outputPower,
healthPct: &health,
},
})
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].Slot == nil || *got[0].Slot != "0" {
t.Fatalf("slot=%v want 0", got[0].Slot)
}
if got[0].OutputPowerW == nil || *got[0].OutputPowerW != 915 {
t.Fatalf("output power=%v", got[0].OutputPowerW)
}
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 97 {
t.Fatalf("life remaining=%v", got[0].LifeRemainingPct)
}
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 3 {
t.Fatalf("life used=%v", got[0].LifeUsedPct)
}
}

View File

@@ -0,0 +1,121 @@
package collector
import (
"bee/audit/internal/schema"
"strconv"
"strings"
)
func enrichPSUsWithTelemetry(psus []schema.HardwarePowerSupply, doc sensorsDoc) []schema.HardwarePowerSupply {
if len(psus) == 0 || len(doc) == 0 {
return psus
}
tempBySlot := psuTempsFromSensors(doc)
healthBySlot := psuHealthFromSensors(doc)
for i := range psus {
slot := derefPSUSlot(psus[i].Slot)
if slot == "" {
continue
}
if psus[i].TemperatureC == nil {
if value, ok := tempBySlot[slot]; ok {
psus[i].TemperatureC = &value
}
}
if psus[i].LifeRemainingPct == nil {
if value, ok := healthBySlot[slot]; ok {
psus[i].LifeRemainingPct = &value
used := 100 - value
psus[i].LifeUsedPct = &used
}
}
}
return psus
}
func psuHealthFromSensors(doc sensorsDoc) map[string]float64 {
out := map[string]float64{}
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok {
continue
}
if !isLikelyPSUHealth(chip, featureName) {
continue
}
value, ok := firstFeaturePercent(feature)
if !ok {
continue
}
if slot, ok := detectPSUSlot(chip, featureName); ok {
if _, exists := out[slot]; !exists {
out[slot] = value
}
}
}
}
return out
}
func firstFeaturePercent(feature map[string]any) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
lower := strings.ToLower(key)
if strings.HasSuffix(lower, "_alarm") {
continue
}
if strings.Contains(lower, "health") || strings.Contains(lower, "life") || strings.Contains(lower, "remain") {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
return 0, false
}
func isLikelyPSUHealth(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return (strings.Contains(value, "psu") || strings.Contains(value, "power supply")) &&
(strings.Contains(value, "health") || strings.Contains(value, "life") || strings.Contains(value, "remain"))
}
func psuTempsFromSensors(doc sensorsDoc) map[string]float64 {
out := map[string]float64{}
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok || classifySensorFeature(feature) != "temp" {
continue
}
if !isLikelyPSUTemp(chip, featureName) {
continue
}
temp, ok := firstFeatureFloat(feature, "_input")
if !ok {
continue
}
if slot, ok := detectPSUSlot(chip, featureName); ok {
if _, exists := out[slot]; !exists {
out[slot] = temp
}
}
}
}
return out
}
func isLikelyPSUTemp(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "psu") || strings.Contains(value, "power supply")
}
func detectPSUSlot(parts ...string) (string, bool) {
for _, part := range parts {
if value, ok := parsePSUSlot(part); ok && value > 0 {
return strconv.Itoa(value - 1), true
}
}
return "", false
}

View File

@@ -0,0 +1,42 @@
package collector
import (
"testing"
"bee/audit/internal/schema"
)
func TestEnrichPSUsWithTelemetry(t *testing.T) {
slot0 := "0"
slot1 := "1"
psus := []schema.HardwarePowerSupply{
{Slot: &slot0},
{Slot: &slot1},
}
doc := sensorsDoc{
"psu-hwmon-0": {
"PSU1 Temp": map[string]any{"temp1_input": 39.5},
"PSU2 Temp": map[string]any{"temp2_input": 41.0},
"PSU1 Health": map[string]any{"health1_input": 98.0},
"PSU2 Remaining Life": map[string]any{"life2_input": 95.0},
},
}
got := enrichPSUsWithTelemetry(psus, doc)
if got[0].TemperatureC == nil || *got[0].TemperatureC != 39.5 {
t.Fatalf("psu0 temperature mismatch: %#v", got[0].TemperatureC)
}
if got[1].TemperatureC == nil || *got[1].TemperatureC != 41.0 {
t.Fatalf("psu1 temperature mismatch: %#v", got[1].TemperatureC)
}
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 98.0 {
t.Fatalf("psu0 life remaining mismatch: %#v", got[0].LifeRemainingPct)
}
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 2.0 {
t.Fatalf("psu0 life used mismatch: %#v", got[0].LifeUsedPct)
}
if got[1].LifeRemainingPct == nil || *got[1].LifeRemainingPct != 95.0 {
t.Fatalf("psu1 life remaining mismatch: %#v", got[1].LifeRemainingPct)
}
}

View File

@@ -83,11 +83,7 @@ func isLikelyRAIDController(dev schema.HardwarePCIeDevice) bool {
if dev.DeviceClass == nil {
return false
}
c := strings.ToLower(*dev.DeviceClass)
return strings.Contains(c, "raid") ||
strings.Contains(c, "sas") ||
strings.Contains(c, "mass storage") ||
strings.Contains(c, "serial attached scsi")
return isRAIDClass(*dev.DeviceClass)
}
func collectStorcliDrives() []schema.HardwareStorage {
@@ -182,7 +178,10 @@ func parseSASIrcuDisplay(raw string) []schema.HardwareStorage {
present := true
status := mapRAIDDriveStatus(b["State"])
s := schema.HardwareStorage{Present: &present, Status: &status}
s := schema.HardwareStorage{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
enclosure := strings.TrimSpace(b["Enclosure #"])
slot := strings.TrimSpace(b["Slot #"])
@@ -281,7 +280,10 @@ func parseArcconfPhysicalDrives(raw string) []schema.HardwareStorage {
for _, b := range blocks {
present := true
status := mapRAIDDriveStatus(b["State"])
s := schema.HardwareStorage{Present: &present, Status: &status}
s := schema.HardwareStorage{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
if v := strings.TrimSpace(b["Reported Location"]); v != "" {
s.Slot = &v
@@ -362,8 +364,11 @@ func parseSSACLIPhysicalDrives(raw string) []schema.HardwareStorage {
if m := ssacliPhysicalDriveLine.FindStringSubmatch(trimmed); len(m) == 3 {
flush()
present := true
status := "UNKNOWN"
s := schema.HardwareStorage{Present: &present, Status: &status}
status := statusUnknown
s := schema.HardwareStorage{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
slot := m[1]
s.Slot = &slot
@@ -475,8 +480,8 @@ func storcliDriveToStorage(d struct {
present := true
status := mapRAIDDriveStatus(d.State)
s := schema.HardwareStorage{
Present: &present,
Status: &status,
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
if v := strings.TrimSpace(d.EIDSlt); v != "" {
@@ -527,15 +532,15 @@ func mapRAIDDriveStatus(raw string) string {
u := strings.ToUpper(strings.TrimSpace(raw))
switch {
case strings.Contains(u, "OK"), strings.Contains(u, "OPTIMAL"), strings.Contains(u, "READY"):
return "OK"
return statusOK
case strings.Contains(u, "ONLN"), strings.Contains(u, "ONLINE"):
return "OK"
return statusOK
case strings.Contains(u, "RBLD"), strings.Contains(u, "REBUILD"):
return "WARNING"
return statusWarning
case strings.Contains(u, "FAIL"), strings.Contains(u, "OFFLINE"):
return "CRITICAL"
return statusCritical
default:
return "UNKNOWN"
return statusUnknown
}
}
@@ -641,8 +646,9 @@ func enrichStorageWithVROC(storage []schema.HardwareStorage, pcie []schema.Hardw
storage[i].Telemetry["vroc_array"] = arr.Name
storage[i].Telemetry["vroc_degraded"] = arr.Degraded
if arr.Degraded {
status := "WARNING"
status := statusWarning
storage[i].Status = &status
storage[i].ErrorDescription = stringPtr("VROC array is degraded")
}
updated++
}
@@ -659,14 +665,14 @@ func hasVROCController(pcie []schema.HardwarePCIeDevice) bool {
class := ""
if dev.DeviceClass != nil {
class = strings.ToLower(*dev.DeviceClass)
class = strings.TrimSpace(*dev.DeviceClass)
}
model := ""
if dev.Model != nil {
model = strings.ToLower(*dev.Model)
}
if strings.Contains(class, "raid") ||
if isRAIDClass(class) ||
strings.Contains(model, "vroc") ||
strings.Contains(model, "volume management device") ||
strings.Contains(model, "vmd") {

View File

@@ -0,0 +1,334 @@
package collector
import (
"bee/audit/internal/schema"
"encoding/json"
"log/slog"
"strconv"
"strings"
)
type raidControllerTelemetry struct {
BatteryChargePct *float64
BatteryHealthPct *float64
BatteryTemperatureC *float64
BatteryVoltageV *float64
BatteryReplaceRequired *bool
ErrorDescription *string
}
func enrichPCIeWithRAIDTelemetry(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
byVendor := collectRAIDControllerTelemetry()
if len(byVendor) == 0 {
return devs
}
positions := map[int]int{}
for i := range devs {
if devs[i].VendorID == nil || !isLikelyRAIDController(devs[i]) {
continue
}
vendor := *devs[i].VendorID
list := byVendor[vendor]
if len(list) == 0 {
continue
}
index := positions[vendor]
if index >= len(list) {
continue
}
positions[vendor] = index + 1
applyRAIDControllerTelemetry(&devs[i], list[index])
}
return devs
}
func applyRAIDControllerTelemetry(dev *schema.HardwarePCIeDevice, tel raidControllerTelemetry) {
if tel.BatteryChargePct != nil {
dev.BatteryChargePct = tel.BatteryChargePct
}
if tel.BatteryHealthPct != nil {
dev.BatteryHealthPct = tel.BatteryHealthPct
}
if tel.BatteryTemperatureC != nil {
dev.BatteryTemperatureC = tel.BatteryTemperatureC
}
if tel.BatteryVoltageV != nil {
dev.BatteryVoltageV = tel.BatteryVoltageV
}
if tel.BatteryReplaceRequired != nil {
dev.BatteryReplaceRequired = tel.BatteryReplaceRequired
}
if tel.ErrorDescription != nil {
dev.ErrorDescription = tel.ErrorDescription
if dev.Status == nil || *dev.Status == statusOK {
status := statusWarning
dev.Status = &status
}
}
}
func collectRAIDControllerTelemetry() map[int][]raidControllerTelemetry {
out := map[int][]raidControllerTelemetry{}
if raw, err := raidToolQuery("storcli64", "/call", "show", "all", "J"); err == nil {
list := parseStorcliControllerTelemetry(raw)
if len(list) > 0 {
out[vendorBroadcomLSI] = append(out[vendorBroadcomLSI], list...)
slog.Info("raid: storcli controller telemetry", "count", len(list))
}
}
if raw, err := raidToolQuery("ssacli", "ctrl", "all", "show", "config", "detail"); err == nil {
list := parseSSACLIControllerTelemetry(string(raw))
if len(list) > 0 {
out[vendorHPE] = append(out[vendorHPE], list...)
slog.Info("raid: ssacli controller telemetry", "count", len(list))
}
}
if raw, err := raidToolQuery("arcconf", "getconfig", "1", "ad"); err == nil {
list := parseArcconfControllerTelemetry(string(raw))
if len(list) > 0 {
out[vendorAdaptec] = append(out[vendorAdaptec], list...)
slog.Info("raid: arcconf controller telemetry", "count", len(list))
}
}
return out
}
func parseStorcliControllerTelemetry(raw []byte) []raidControllerTelemetry {
var doc struct {
Controllers []struct {
ResponseData map[string]any `json:"Response Data"`
} `json:"Controllers"`
}
if err := json.Unmarshal(raw, &doc); err != nil {
slog.Warn("raid: parse storcli controller telemetry failed", "err", err)
return nil
}
var out []raidControllerTelemetry
for _, ctl := range doc.Controllers {
tel := raidControllerTelemetry{}
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["BBU_Info"]))
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["BBU_Info_Details"]))
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["CV_Info"]))
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["CV_Info_Details"]))
if hasRAIDControllerTelemetry(tel) {
out = append(out, tel)
}
}
return out
}
func nestedStringMap(raw any) map[string]string {
switch value := raw.(type) {
case map[string]any:
out := map[string]string{}
flattenStringMap("", value, out)
return out
case []any:
out := map[string]string{}
for _, item := range value {
if m, ok := item.(map[string]any); ok {
flattenStringMap("", m, out)
}
}
return out
default:
return nil
}
}
func flattenStringMap(prefix string, in map[string]any, out map[string]string) {
for key, raw := range in {
fullKey := strings.TrimSpace(strings.ToLower(strings.Trim(prefix+" "+key, " ")))
switch value := raw.(type) {
case map[string]any:
flattenStringMap(fullKey, value, out)
case []any:
for _, item := range value {
if m, ok := item.(map[string]any); ok {
flattenStringMap(fullKey, m, out)
}
}
case string:
out[fullKey] = value
case json.Number:
out[fullKey] = value.String()
case float64:
out[fullKey] = strconv.FormatFloat(value, 'f', -1, 64)
case bool:
if value {
out[fullKey] = "true"
} else {
out[fullKey] = "false"
}
}
}
}
func mergeStorcliBatteryMap(tel *raidControllerTelemetry, fields map[string]string) {
if len(fields) == 0 {
return
}
for key, raw := range fields {
lower := strings.ToLower(strings.TrimSpace(key))
switch {
case strings.Contains(lower, "relative state of charge"), strings.Contains(lower, "remaining capacity"), strings.Contains(lower, "charge"):
if tel.BatteryChargePct == nil {
tel.BatteryChargePct = parsePercentPtr(raw)
}
case strings.Contains(lower, "state of health"), strings.Contains(lower, "health"):
if tel.BatteryHealthPct == nil {
tel.BatteryHealthPct = parsePercentPtr(raw)
}
case strings.Contains(lower, "temperature"):
if tel.BatteryTemperatureC == nil {
tel.BatteryTemperatureC = parseFloatPtr(raw)
}
case strings.Contains(lower, "voltage"):
if tel.BatteryVoltageV == nil {
tel.BatteryVoltageV = parseFloatPtr(raw)
}
case strings.Contains(lower, "replace"), strings.Contains(lower, "replacement required"):
if tel.BatteryReplaceRequired == nil {
tel.BatteryReplaceRequired = parseReplaceRequired(raw)
}
case strings.Contains(lower, "learn cycle requested"), strings.Contains(lower, "battery state"), strings.Contains(lower, "capacitance state"):
if desc := batteryStateDescription(raw); desc != nil && tel.ErrorDescription == nil {
tel.ErrorDescription = desc
}
}
}
}
func parseSSACLIControllerTelemetry(raw string) []raidControllerTelemetry {
lines := strings.Split(raw, "\n")
var out []raidControllerTelemetry
var current *raidControllerTelemetry
flush := func() {
if current != nil && hasRAIDControllerTelemetry(*current) {
out = append(out, *current)
}
current = nil
}
for _, line := range lines {
trimmed := strings.TrimSpace(line)
if trimmed == "" {
continue
}
if strings.HasPrefix(strings.ToLower(trimmed), "smart array") || strings.HasPrefix(strings.ToLower(trimmed), "controller ") {
flush()
current = &raidControllerTelemetry{}
continue
}
if current == nil {
continue
}
if idx := strings.Index(trimmed, ":"); idx > 0 {
key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
val := strings.TrimSpace(trimmed[idx+1:])
switch {
case strings.Contains(key, "capacitor temperature"), strings.Contains(key, "battery temperature"):
current.BatteryTemperatureC = parseFloatPtr(val)
case strings.Contains(key, "capacitor voltage"), strings.Contains(key, "battery voltage"):
current.BatteryVoltageV = parseFloatPtr(val)
case strings.Contains(key, "capacitor charge"), strings.Contains(key, "battery charge"):
current.BatteryChargePct = parsePercentPtr(val)
case strings.Contains(key, "capacitor health"), strings.Contains(key, "battery health"):
current.BatteryHealthPct = parsePercentPtr(val)
case strings.Contains(key, "replace") || strings.Contains(key, "failed"):
if current.BatteryReplaceRequired == nil {
current.BatteryReplaceRequired = parseReplaceRequired(val)
}
if desc := batteryStateDescription(val); desc != nil && current.ErrorDescription == nil {
current.ErrorDescription = desc
}
}
}
}
flush()
return out
}
func parseArcconfControllerTelemetry(raw string) []raidControllerTelemetry {
lines := strings.Split(raw, "\n")
tel := raidControllerTelemetry{}
for _, line := range lines {
trimmed := strings.TrimSpace(line)
if idx := strings.Index(trimmed, ":"); idx > 0 {
key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
val := strings.TrimSpace(trimmed[idx+1:])
switch {
case strings.Contains(key, "battery temperature"), strings.Contains(key, "capacitor temperature"):
tel.BatteryTemperatureC = parseFloatPtr(val)
case strings.Contains(key, "battery voltage"), strings.Contains(key, "capacitor voltage"):
tel.BatteryVoltageV = parseFloatPtr(val)
case strings.Contains(key, "battery charge"), strings.Contains(key, "capacitor charge"):
tel.BatteryChargePct = parsePercentPtr(val)
case strings.Contains(key, "battery health"), strings.Contains(key, "capacitor health"):
tel.BatteryHealthPct = parsePercentPtr(val)
case strings.Contains(key, "replace"), strings.Contains(key, "failed"):
if tel.BatteryReplaceRequired == nil {
tel.BatteryReplaceRequired = parseReplaceRequired(val)
}
if desc := batteryStateDescription(val); desc != nil && tel.ErrorDescription == nil {
tel.ErrorDescription = desc
}
}
}
}
if hasRAIDControllerTelemetry(tel) {
return []raidControllerTelemetry{tel}
}
return nil
}
func hasRAIDControllerTelemetry(tel raidControllerTelemetry) bool {
return tel.BatteryChargePct != nil ||
tel.BatteryHealthPct != nil ||
tel.BatteryTemperatureC != nil ||
tel.BatteryVoltageV != nil ||
tel.BatteryReplaceRequired != nil ||
tel.ErrorDescription != nil
}
func parsePercentPtr(raw string) *float64 {
raw = strings.ReplaceAll(strings.TrimSpace(raw), "%", "")
return parseFloatPtr(raw)
}
func parseReplaceRequired(raw string) *bool {
lower := strings.ToLower(strings.TrimSpace(raw))
switch {
case lower == "":
return nil
case strings.Contains(lower, "replace"), strings.Contains(lower, "failed"), strings.Contains(lower, "yes"), strings.Contains(lower, "required"):
value := true
return &value
case strings.Contains(lower, "no"), strings.Contains(lower, "ok"), strings.Contains(lower, "good"), strings.Contains(lower, "optimal"):
value := false
return &value
default:
return nil
}
}
func batteryStateDescription(raw string) *string {
lower := strings.ToLower(strings.TrimSpace(raw))
if lower == "" {
return nil
}
switch {
case strings.Contains(lower, "failed"), strings.Contains(lower, "fault"), strings.Contains(lower, "replace"), strings.Contains(lower, "warning"), strings.Contains(lower, "degraded"):
return &raw
default:
return nil
}
}

View File

@@ -1,6 +1,10 @@
package collector
import "testing"
import (
"bee/audit/internal/schema"
"errors"
"testing"
)
func TestParseSASIrcuControllerIDs(t *testing.T) {
raw := `LSI Corporation SAS2 IR Configuration Utility.
@@ -90,7 +94,111 @@ physicaldrive 1I:1:2 (894 GB, SAS HDD, Failed)
if drives[0].Status == nil || *drives[0].Status != "OK" {
t.Fatalf("drive0 status: %v", drives[0].Status)
}
if drives[1].Status == nil || *drives[1].Status != "CRITICAL" {
if drives[1].Status == nil || *drives[1].Status != statusCritical {
t.Fatalf("drive1 status: %v", drives[1].Status)
}
}
func TestParseStorcliControllerTelemetry(t *testing.T) {
raw := []byte(`{
"Controllers": [
{
"Response Data": {
"BBU_Info": {
"State of Health": "98 %",
"Relative State of Charge": "76 %",
"Temperature": "41 C",
"Voltage": "12.3 V",
"Replacement required": "No"
}
}
}
]
}`)
got := parseStorcliControllerTelemetry(raw)
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].BatteryHealthPct == nil || *got[0].BatteryHealthPct != 98 {
t.Fatalf("battery health=%v", got[0].BatteryHealthPct)
}
if got[0].BatteryChargePct == nil || *got[0].BatteryChargePct != 76 {
t.Fatalf("battery charge=%v", got[0].BatteryChargePct)
}
if got[0].BatteryTemperatureC == nil || *got[0].BatteryTemperatureC != 41 {
t.Fatalf("battery temperature=%v", got[0].BatteryTemperatureC)
}
if got[0].BatteryVoltageV == nil || *got[0].BatteryVoltageV != 12.3 {
t.Fatalf("battery voltage=%v", got[0].BatteryVoltageV)
}
if got[0].BatteryReplaceRequired == nil || *got[0].BatteryReplaceRequired {
t.Fatalf("battery replace=%v", got[0].BatteryReplaceRequired)
}
}
func TestParseSSACLIControllerTelemetry(t *testing.T) {
raw := `Smart Array P440ar in Slot 0
Battery/Capacitor Count: 1
Capacitor Temperature (C): 37
Capacitor Charge (%): 94
Capacitor Health (%): 96
Capacitor Voltage (V): 9.8
Capacitor Failed: No
`
got := parseSSACLIControllerTelemetry(raw)
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].BatteryTemperatureC == nil || *got[0].BatteryTemperatureC != 37 {
t.Fatalf("battery temperature=%v", got[0].BatteryTemperatureC)
}
if got[0].BatteryChargePct == nil || *got[0].BatteryChargePct != 94 {
t.Fatalf("battery charge=%v", got[0].BatteryChargePct)
}
}
func TestEnrichPCIeWithRAIDTelemetry(t *testing.T) {
orig := raidToolQuery
t.Cleanup(func() { raidToolQuery = orig })
raidToolQuery = func(name string, args ...string) ([]byte, error) {
switch name {
case "storcli64":
return []byte(`{
"Controllers": [
{
"Response Data": {
"CV_Info": {
"State of Health": "99 %",
"Relative State of Charge": "81 %",
"Temperature": "38 C",
"Voltage": "12.1 V",
"Replacement required": "No"
}
}
}
]
}`), nil
default:
return nil, errors.New("skip")
}
}
vendor := vendorBroadcomLSI
class := "MassStorageController"
status := statusOK
devs := []schema.HardwarePCIeDevice{{
VendorID: &vendor,
DeviceClass: &class,
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
}}
out := enrichPCIeWithRAIDTelemetry(devs)
if out[0].BatteryHealthPct == nil || *out[0].BatteryHealthPct != 99 {
t.Fatalf("battery health=%v", out[0].BatteryHealthPct)
}
if out[0].BatteryChargePct == nil || *out[0].BatteryChargePct != 81 {
t.Fatalf("battery charge=%v", out[0].BatteryChargePct)
}
if out[0].BatteryVoltageV == nil || *out[0].BatteryVoltageV != 12.1 {
t.Fatalf("battery voltage=%v", out[0].BatteryVoltageV)
}
}

View File

@@ -0,0 +1,373 @@
package collector
import (
"bee/audit/internal/schema"
"encoding/json"
"log/slog"
"os/exec"
"sort"
"strconv"
"strings"
)
type sensorsDoc map[string]map[string]any
func collectSensors() *schema.HardwareSensors {
doc, err := readSensorsJSONDoc()
if err != nil {
slog.Info("sensors: unavailable, skipping", "err", err)
return nil
}
sensors := buildSensorsFromDoc(doc)
if sensors == nil || (len(sensors.Fans) == 0 && len(sensors.Power) == 0 && len(sensors.Temperatures) == 0 && len(sensors.Other) == 0) {
return nil
}
slog.Info("sensors: collected",
"fans", len(sensors.Fans),
"power", len(sensors.Power),
"temperatures", len(sensors.Temperatures),
"other", len(sensors.Other),
)
return sensors
}
func readSensorsJSONDoc() (sensorsDoc, error) {
out, err := exec.Command("sensors", "-j").Output()
if err != nil {
return nil, err
}
var doc sensorsDoc
if err := json.Unmarshal(out, &doc); err != nil {
return nil, err
}
return doc, nil
}
func buildSensorsFromDoc(doc sensorsDoc) *schema.HardwareSensors {
if len(doc) == 0 {
return nil
}
result := &schema.HardwareSensors{}
seen := map[string]struct{}{}
chips := make([]string, 0, len(doc))
for chip := range doc {
chips = append(chips, chip)
}
sort.Strings(chips)
for _, chip := range chips {
features := doc[chip]
location := sensorLocation(chip)
keys := make([]string, 0, len(features))
for key := range features {
keys = append(keys, key)
}
sort.Strings(keys)
for _, key := range keys {
if strings.EqualFold(key, "Adapter") {
continue
}
feature, ok := features[key].(map[string]any)
if !ok {
continue
}
name := strings.TrimSpace(key)
if name == "" {
continue
}
switch classifySensorFeature(feature) {
case "fan":
item := buildFanSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "fan", item.Name) {
continue
}
result.Fans = append(result.Fans, *item)
case "temp":
item := buildTempSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "temp", item.Name) {
continue
}
result.Temperatures = append(result.Temperatures, *item)
case "power":
item := buildPowerSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "power", item.Name) {
continue
}
result.Power = append(result.Power, *item)
default:
item := buildOtherSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "other", item.Name) {
continue
}
result.Other = append(result.Other, *item)
}
}
}
return result
}
func parseSensorsJSON(raw []byte) (*schema.HardwareSensors, error) {
var doc sensorsDoc
err := json.Unmarshal(raw, &doc)
if err != nil {
return nil, err
}
return buildSensorsFromDoc(doc), nil
}
func duplicateSensor(seen map[string]struct{}, sensorType, name string) bool {
key := sensorType + "\x00" + name
if _, ok := seen[key]; ok {
return true
}
seen[key] = struct{}{}
return false
}
func sensorLocation(chip string) *string {
chip = strings.TrimSpace(chip)
if chip == "" {
return nil
}
return &chip
}
func classifySensorFeature(feature map[string]any) string {
for key := range feature {
switch {
case strings.Contains(key, "fan") && strings.HasSuffix(key, "_input"):
return "fan"
case strings.Contains(key, "temp") && strings.HasSuffix(key, "_input"):
return "temp"
case strings.Contains(key, "power") && (strings.HasSuffix(key, "_input") || strings.HasSuffix(key, "_average")):
return "power"
case strings.Contains(key, "curr") && strings.HasSuffix(key, "_input"):
return "power"
case strings.HasPrefix(key, "in") && strings.HasSuffix(key, "_input"):
return "power"
}
}
return "other"
}
func buildFanSensor(name string, location *string, feature map[string]any) *schema.HardwareFanSensor {
rpm, ok := firstFeatureInt(feature, "_input")
if !ok {
return nil
}
item := &schema.HardwareFanSensor{Name: name, Location: location, RPM: &rpm}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
}
return item
}
func buildTempSensor(name string, location *string, feature map[string]any) *schema.HardwareTemperatureSensor {
celsius, ok := firstFeatureFloat(feature, "_input")
if !ok {
return nil
}
item := &schema.HardwareTemperatureSensor{Name: name, Location: location, Celsius: &celsius}
if warning, ok := firstFeatureFloatWithSuffixes(feature, []string{"_max", "_high"}); ok {
item.ThresholdWarningCelsius = &warning
}
if critical, ok := firstFeatureFloatWithSuffixes(feature, []string{"_crit", "_emergency"}); ok {
item.ThresholdCriticalCelsius = &critical
}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
} else {
item.Status = deriveTemperatureStatus(item.Celsius, item.ThresholdWarningCelsius, item.ThresholdCriticalCelsius)
}
return item
}
func buildPowerSensor(name string, location *string, feature map[string]any) *schema.HardwarePowerSensor {
item := &schema.HardwarePowerSensor{Name: name, Location: location}
if v, ok := firstFeatureFloatWithContains(feature, []string{"power"}); ok {
item.PowerW = &v
}
if v, ok := firstFeatureFloatWithPrefix(feature, "curr"); ok {
item.CurrentA = &v
}
if v, ok := firstFeatureFloatWithPrefix(feature, "in"); ok {
item.VoltageV = &v
}
if item.PowerW == nil && item.CurrentA == nil && item.VoltageV == nil {
return nil
}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
}
return item
}
func buildOtherSensor(name string, location *string, feature map[string]any) *schema.HardwareOtherSensor {
value, unit, ok := firstGenericSensorValue(feature)
if !ok {
return nil
}
item := &schema.HardwareOtherSensor{Name: name, Location: location, Value: &value}
if unit != "" {
item.Unit = &unit
}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
}
return item
}
func sensorStatusFromFeature(feature map[string]any) *string {
for key, raw := range feature {
if !strings.HasSuffix(key, "_alarm") {
continue
}
if number, ok := floatFromAny(raw); ok && number > 0 {
status := statusWarning
return &status
}
}
return nil
}
func deriveTemperatureStatus(current, warning, critical *float64) *string {
if current == nil {
return nil
}
switch {
case critical != nil && *current >= *critical:
status := statusCritical
return &status
case warning != nil && *current >= *warning:
status := statusWarning
return &status
default:
status := statusOK
return &status
}
}
func firstFeatureInt(feature map[string]any, suffix string) (int, bool) {
for key, raw := range feature {
if strings.HasSuffix(key, suffix) {
if value, ok := floatFromAny(raw); ok {
return int(value), true
}
}
}
return 0, false
}
func firstFeatureFloat(feature map[string]any, suffix string) (float64, bool) {
return firstFeatureFloatWithSuffixes(feature, []string{suffix})
}
func firstFeatureFloatWithSuffixes(feature map[string]any, suffixes []string) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
for _, suffix := range suffixes {
if strings.HasSuffix(key, suffix) {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
}
return 0, false
}
func firstFeatureFloatWithContains(feature map[string]any, parts []string) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
matched := true
for _, part := range parts {
if !strings.Contains(key, part) {
matched = false
break
}
}
if matched {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
return 0, false
}
func firstFeatureFloatWithPrefix(feature map[string]any, prefix string) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
if strings.HasPrefix(key, prefix) && strings.HasSuffix(key, "_input") {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
return 0, false
}
func firstGenericSensorValue(feature map[string]any) (float64, string, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
if strings.HasSuffix(key, "_alarm") {
continue
}
value, ok := floatFromAny(feature[key])
if !ok {
continue
}
unit := inferSensorUnit(key)
return value, unit, true
}
return 0, "", false
}
func inferSensorUnit(key string) string {
switch {
case strings.Contains(key, "humidity"):
return "%"
case strings.Contains(key, "intrusion"):
return ""
default:
return ""
}
}
func sortedFeatureKeys(feature map[string]any) []string {
keys := make([]string, 0, len(feature))
for key := range feature {
keys = append(keys, key)
}
sort.Strings(keys)
return keys
}
func floatFromAny(raw any) (float64, bool) {
switch value := raw.(type) {
case float64:
return value, true
case float32:
return float64(value), true
case int:
return float64(value), true
case int64:
return float64(value), true
case json.Number:
if f, err := value.Float64(); err == nil {
return f, true
}
case string:
if value == "" {
return 0, false
}
if f, err := strconv.ParseFloat(value, 64); err == nil {
return f, true
}
}
return 0, false
}

View File

@@ -0,0 +1,54 @@
package collector
import "testing"
func TestParseSensorsJSON(t *testing.T) {
raw := []byte(`{
"coretemp-isa-0000": {
"Adapter": "ISA adapter",
"Package id 0": {
"temp1_input": 61.5,
"temp1_max": 80.0,
"temp1_crit": 95.0
},
"fan1": {
"fan1_input": 4200
}
},
"acpitz-acpi-0": {
"Adapter": "ACPI interface",
"in0": {
"in0_input": 12.06
},
"curr1": {
"curr1_input": 0.64
},
"power1": {
"power1_average": 137.0
},
"humidity1": {
"humidity1_input": 38.5
}
}
}`)
got, err := parseSensorsJSON(raw)
if err != nil {
t.Fatalf("parseSensorsJSON error: %v", err)
}
if got == nil {
t.Fatal("expected sensors")
}
if len(got.Temperatures) != 1 || got.Temperatures[0].Celsius == nil || *got.Temperatures[0].Celsius != 61.5 {
t.Fatalf("temperatures mismatch: %#v", got.Temperatures)
}
if len(got.Fans) != 1 || got.Fans[0].RPM == nil || *got.Fans[0].RPM != 4200 {
t.Fatalf("fans mismatch: %#v", got.Fans)
}
if len(got.Power) != 3 {
t.Fatalf("power sensors mismatch: %#v", got.Power)
}
if len(got.Other) != 1 || got.Other[0].Unit == nil || *got.Other[0].Unit != "%" {
t.Fatalf("other sensors mismatch: %#v", got.Other)
}
}

View File

@@ -5,11 +5,13 @@ import (
"encoding/json"
"log/slog"
"os/exec"
"path/filepath"
"strconv"
"strings"
)
func collectStorage() []schema.HardwareStorage {
devs := lsblkDevices()
devs := discoverStorageDevices()
result := make([]schema.HardwareStorage, 0, len(devs))
for _, dev := range devs {
var s schema.HardwareStorage
@@ -26,19 +28,60 @@ func collectStorage() []schema.HardwareStorage {
// lsblkDevice is a minimal lsblk JSON record.
type lsblkDevice struct {
Name string `json:"name"`
Type string `json:"type"`
Size string `json:"size"`
Serial string `json:"serial"`
Model string `json:"model"`
Tran string `json:"tran"`
Hctl string `json:"hctl"`
Name string `json:"name"`
Type string `json:"type"`
Size string `json:"size"`
Serial string `json:"serial"`
Model string `json:"model"`
Tran string `json:"tran"`
Hctl string `json:"hctl"`
}
type lsblkRoot struct {
Blockdevices []lsblkDevice `json:"blockdevices"`
}
type nvmeListRoot struct {
Devices []nvmeListDevice `json:"Devices"`
}
type nvmeListDevice struct {
DevicePath string `json:"DevicePath"`
ModelNumber string `json:"ModelNumber"`
SerialNumber string `json:"SerialNumber"`
Firmware string `json:"Firmware"`
PhysicalSize int64 `json:"PhysicalSize"`
}
func discoverStorageDevices() []lsblkDevice {
merged := map[string]lsblkDevice{}
for _, dev := range lsblkDevices() {
if dev.Name == "" {
continue
}
merged[dev.Name] = dev
}
for _, dev := range nvmeListDevices() {
if dev.Name == "" {
continue
}
current := merged[dev.Name]
merged[dev.Name] = mergeStorageDevice(current, dev)
}
disks := make([]lsblkDevice, 0, len(merged))
for _, dev := range merged {
if dev.Type == "" {
dev.Type = "disk"
}
if dev.Type != "disk" {
continue
}
disks = append(disks, dev)
}
return disks
}
func lsblkDevices() []lsblkDevice {
out, err := exec.Command("lsblk", "-J", "-d",
"-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL").Output()
@@ -60,6 +103,59 @@ func lsblkDevices() []lsblkDevice {
return disks
}
func nvmeListDevices() []lsblkDevice {
out, err := exec.Command("nvme", "list", "-o", "json").Output()
if err != nil {
return nil
}
var root nvmeListRoot
if err := json.Unmarshal(out, &root); err != nil {
slog.Warn("storage: nvme list parse failed", "err", err)
return nil
}
devices := make([]lsblkDevice, 0, len(root.Devices))
for _, dev := range root.Devices {
name := filepath.Base(strings.TrimSpace(dev.DevicePath))
if name == "" {
continue
}
devices = append(devices, lsblkDevice{
Name: name,
Type: "disk",
Size: strconv.FormatInt(dev.PhysicalSize, 10),
Serial: strings.TrimSpace(dev.SerialNumber),
Model: strings.TrimSpace(dev.ModelNumber),
Tran: "nvme",
})
}
return devices
}
func mergeStorageDevice(existing, incoming lsblkDevice) lsblkDevice {
if existing.Name == "" {
return incoming
}
if existing.Type == "" {
existing.Type = incoming.Type
}
if strings.TrimSpace(existing.Size) == "" {
existing.Size = incoming.Size
}
if strings.TrimSpace(existing.Serial) == "" {
existing.Serial = incoming.Serial
}
if strings.TrimSpace(existing.Model) == "" {
existing.Model = incoming.Model
}
if strings.TrimSpace(existing.Tran) == "" {
existing.Tran = incoming.Tran
}
if strings.TrimSpace(existing.Hctl) == "" {
existing.Hctl = incoming.Hctl
}
return existing
}
// smartctlInfo is the subset of smartctl -j -a output we care about.
type smartctlInfo struct {
ModelFamily string `json:"model_family"`
@@ -67,14 +163,22 @@ type smartctlInfo struct {
SerialNumber string `json:"serial_number"`
FirmwareVer string `json:"firmware_version"`
RotationRate int `json:"rotation_rate"`
Temperature struct {
Current int `json:"current"`
} `json:"temperature"`
SmartStatus struct {
Passed bool `json:"passed"`
} `json:"smart_status"`
UserCapacity struct {
Bytes int64 `json:"bytes"`
} `json:"user_capacity"`
AtaSmartAttributes struct {
Table []struct {
ID int `json:"id"`
Name string `json:"name"`
Raw struct{ Value int64 `json:"value"` } `json:"raw"`
ID int `json:"id"`
Name string `json:"name"`
Raw struct {
Value int64 `json:"value"`
} `json:"raw"`
} `json:"table"`
} `json:"ata_smart_attributes"`
PowerOnTime struct {
@@ -86,6 +190,7 @@ type smartctlInfo struct {
func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
present := true
s := schema.HardwareStorage{Present: &present}
s.Telemetry = map[string]any{"linux_device": "/dev/" + dev.Name}
tran := strings.ToLower(dev.Tran)
devPath := "/dev/" + dev.Name
@@ -149,69 +254,117 @@ func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
} else if info.RotationRate > 0 {
devType = "HDD"
}
s.Type = &devType
// telemetry
tel := map[string]any{}
if info.Temperature.Current > 0 {
t := float64(info.Temperature.Current)
s.TemperatureC = &t
}
if info.PowerOnTime.Hours > 0 {
tel["power_on_hours"] = info.PowerOnTime.Hours
v := int64(info.PowerOnTime.Hours)
s.PowerOnHours = &v
}
if info.PowerCycleCount > 0 {
tel["power_cycles"] = info.PowerCycleCount
v := int64(info.PowerCycleCount)
s.PowerCycles = &v
}
reallocated := int64(0)
pending := int64(0)
uncorrectable := int64(0)
lifeRemaining := int64(0)
for _, attr := range info.AtaSmartAttributes.Table {
switch attr.ID {
case 5:
tel["reallocated_sectors"] = attr.Raw.Value
reallocated = attr.Raw.Value
s.ReallocatedSectors = &reallocated
case 177:
tel["wear_leveling_pct"] = attr.Raw.Value
value := float64(attr.Raw.Value)
s.LifeUsedPct = &value
case 231:
tel["life_remaining_pct"] = attr.Raw.Value
lifeRemaining = attr.Raw.Value
value := float64(attr.Raw.Value)
s.LifeRemainingPct = &value
case 241:
tel["total_lba_written"] = attr.Raw.Value
value := attr.Raw.Value
s.WrittenBytes = &value
case 197:
pending = attr.Raw.Value
s.CurrentPendingSectors = &pending
case 198:
uncorrectable = attr.Raw.Value
s.OfflineUncorrectable = &uncorrectable
}
}
if len(tel) > 0 {
s.Telemetry = tel
status := storageHealthStatus{
overallPassed: info.SmartStatus.Passed,
hasOverall: true,
reallocatedSectors: reallocated,
pendingSectors: pending,
offlineUncorrectable: uncorrectable,
lifeRemainingPct: lifeRemaining,
}
setStorageHealthStatus(&s, status)
return s
}
s.Type = &devType
status := "OK"
status := statusUnknown
s.Status = &status
return s
}
// nvmeSmartLog is the subset of `nvme smart-log -o json` output we care about.
type nvmeSmartLog struct {
CriticalWarning int `json:"critical_warning"`
PercentageUsed int `json:"percentage_used"`
AvailableSpare int `json:"available_spare"`
SpareThreshold int `json:"spare_thresh"`
Temperature int64 `json:"temperature"`
PowerOnHours int64 `json:"power_on_hours"`
PowerCycles int64 `json:"power_cycles"`
UnsafeShutdowns int64 `json:"unsafe_shutdowns"`
DataUnitsRead int64 `json:"data_units_read"`
DataUnitsWritten int64 `json:"data_units_written"`
ControllerBusy int64 `json:"controller_busy_time"`
MediaErrors int64 `json:"media_errors"`
NumErrLogEntries int64 `json:"num_err_log_entries"`
}
// nvmeIDCtrl is the subset of `nvme id-ctrl -o json` output.
type nvmeIDCtrl struct {
ModelNumber string `json:"mn"`
SerialNumber string `json:"sn"`
FirmwareRev string `json:"fr"`
TotalCapacity int64 `json:"tnvmcap"`
ModelNumber string `json:"mn"`
SerialNumber string `json:"sn"`
FirmwareRev string `json:"fr"`
TotalCapacity int64 `json:"tnvmcap"`
}
func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
present := true
devType := "NVMe"
iface := "NVMe"
status := "OK"
status := statusOK
s := schema.HardwareStorage{
Present: &present,
Type: &devType,
Interface: &iface,
Status: &status,
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
Type: &devType,
Interface: &iface,
Telemetry: map[string]any{"linux_device": "/dev/" + dev.Name},
}
devPath := "/dev/" + dev.Name
if v := cleanDMIValue(strings.TrimSpace(dev.Model)); v != "" {
s.Model = &v
}
if v := cleanDMIValue(strings.TrimSpace(dev.Serial)); v != "" {
s.SerialNumber = &v
}
if size := parseStorageBytes(dev.Size); size > 0 {
gb := int(size / 1_000_000_000)
if gb > 0 {
s.SizeGB = &gb
}
}
// id-ctrl: model, serial, firmware, capacity
if out, err := exec.Command("nvme", "id-ctrl", devPath, "-o", "json").Output(); err == nil {
@@ -237,30 +390,131 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
if out, err := exec.Command("nvme", "smart-log", devPath, "-o", "json").Output(); err == nil {
var log nvmeSmartLog
if json.Unmarshal(out, &log) == nil {
tel := map[string]any{}
if log.PowerOnHours > 0 {
tel["power_on_hours"] = log.PowerOnHours
s.PowerOnHours = &log.PowerOnHours
}
if log.PowerCycles > 0 {
tel["power_cycles"] = log.PowerCycles
s.PowerCycles = &log.PowerCycles
}
if log.UnsafeShutdowns > 0 {
tel["unsafe_shutdowns"] = log.UnsafeShutdowns
s.UnsafeShutdowns = &log.UnsafeShutdowns
}
if log.PercentageUsed > 0 {
tel["percentage_used"] = log.PercentageUsed
v := float64(log.PercentageUsed)
s.LifeUsedPct = &v
remaining := 100 - v
s.LifeRemainingPct = &remaining
}
if log.DataUnitsWritten > 0 {
tel["data_units_written"] = log.DataUnitsWritten
v := nvmeDataUnitsToBytes(log.DataUnitsWritten)
s.WrittenBytes = &v
}
if log.ControllerBusy > 0 {
tel["controller_busy_time"] = log.ControllerBusy
if log.DataUnitsRead > 0 {
v := nvmeDataUnitsToBytes(log.DataUnitsRead)
s.ReadBytes = &v
}
if len(tel) > 0 {
s.Telemetry = tel
if log.AvailableSpare > 0 {
v := float64(log.AvailableSpare)
s.AvailableSparePct = &v
}
if log.MediaErrors > 0 {
s.MediaErrors = &log.MediaErrors
}
if log.NumErrLogEntries > 0 {
s.ErrorLogEntries = &log.NumErrLogEntries
}
if log.Temperature > 0 {
v := float64(log.Temperature - 273)
s.TemperatureC = &v
}
setStorageHealthStatus(&s, storageHealthStatus{
criticalWarning: log.CriticalWarning,
percentageUsed: int64(log.PercentageUsed),
availableSpare: int64(log.AvailableSpare),
spareThreshold: int64(log.SpareThreshold),
unsafeShutdowns: log.UnsafeShutdowns,
mediaErrors: log.MediaErrors,
errorLogEntries: log.NumErrLogEntries,
})
return s
}
}
status = statusUnknown
s.Status = &status
return s
}
func parseStorageBytes(raw string) int64 {
value, err := strconv.ParseInt(strings.TrimSpace(raw), 10, 64)
if err == nil && value > 0 {
return value
}
return 0
}
func nvmeDataUnitsToBytes(units int64) int64 {
if units <= 0 {
return 0
}
return units * 512000
}
type storageHealthStatus struct {
hasOverall bool
overallPassed bool
reallocatedSectors int64
pendingSectors int64
offlineUncorrectable int64
lifeRemainingPct int64
criticalWarning int
percentageUsed int64
availableSpare int64
spareThreshold int64
unsafeShutdowns int64
mediaErrors int64
errorLogEntries int64
}
func setStorageHealthStatus(s *schema.HardwareStorage, health storageHealthStatus) {
status := statusOK
var description *string
switch {
case health.hasOverall && !health.overallPassed:
status = statusCritical
description = stringPtr("SMART overall self-assessment failed")
case health.criticalWarning > 0:
status = statusCritical
description = stringPtr("NVMe critical warning is set")
case health.pendingSectors > 0 || health.offlineUncorrectable > 0:
status = statusCritical
description = stringPtr("Pending or offline uncorrectable sectors detected")
case health.mediaErrors > 0:
status = statusWarning
description = stringPtr("Media errors reported")
case health.reallocatedSectors > 0:
status = statusWarning
description = stringPtr("Reallocated sectors detected")
case health.errorLogEntries > 0:
status = statusWarning
description = stringPtr("Device error log contains entries")
case health.lifeRemainingPct > 0 && health.lifeRemainingPct <= 10:
status = statusWarning
description = stringPtr("Life remaining is low")
case health.percentageUsed >= 95:
status = statusWarning
description = stringPtr("Drive wear level is high")
case health.availableSpare > 0 && health.spareThreshold > 0 && health.availableSpare <= health.spareThreshold:
status = statusWarning
description = stringPtr("Available spare is at or below threshold")
case health.unsafeShutdowns > 100:
status = statusWarning
description = stringPtr("Unsafe shutdown count is high")
}
s.Status = &status
s.ErrorDescription = description
}
func stringPtr(value string) *string {
return &value
}

View File

@@ -0,0 +1,33 @@
package collector
import "testing"
func TestMergeStorageDevicePrefersNonEmptyFields(t *testing.T) {
t.Parallel()
got := mergeStorageDevice(
lsblkDevice{Name: "nvme0n1", Type: "disk", Tran: "nvme"},
lsblkDevice{Name: "nvme0n1", Type: "disk", Size: "1024", Serial: "SN123", Model: "Kioxia"},
)
if got.Serial != "SN123" {
t.Fatalf("serial=%q want SN123", got.Serial)
}
if got.Model != "Kioxia" {
t.Fatalf("model=%q want Kioxia", got.Model)
}
if got.Size != "1024" {
t.Fatalf("size=%q want 1024", got.Size)
}
}
func TestParseStorageBytes(t *testing.T) {
t.Parallel()
if got := parseStorageBytes(" 2048 "); got != 2048 {
t.Fatalf("parseStorageBytes=%d want 2048", got)
}
if got := parseStorageBytes("1.92 TB"); got != 0 {
t.Fatalf("parseStorageBytes invalid=%d want 0", got)
}
}

View File

@@ -0,0 +1,63 @@
package collector
import (
"testing"
"bee/audit/internal/schema"
)
func TestSetStorageHealthStatus(t *testing.T) {
t.Parallel()
tests := []struct {
name string
health storageHealthStatus
want string
}{
{
name: "smart overall failed",
health: storageHealthStatus{hasOverall: true, overallPassed: false},
want: statusCritical,
},
{
name: "nvme critical warning",
health: storageHealthStatus{criticalWarning: 1},
want: statusCritical,
},
{
name: "pending sectors",
health: storageHealthStatus{pendingSectors: 1},
want: statusCritical,
},
{
name: "media errors warning",
health: storageHealthStatus{mediaErrors: 2},
want: statusWarning,
},
{
name: "reallocated warning",
health: storageHealthStatus{reallocatedSectors: 1},
want: statusWarning,
},
{
name: "life remaining low",
health: storageHealthStatus{lifeRemainingPct: 8},
want: statusWarning,
},
{
name: "healthy",
health: storageHealthStatus{},
want: statusOK,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
var disk schema.HardwareStorage
setStorageHealthStatus(&disk, tt.health)
if disk.Status == nil || *disk.Status != tt.want {
t.Fatalf("status=%v want %q", disk.Status, tt.want)
}
})
}
}

View File

@@ -0,0 +1,114 @@
package collector
import (
"bee/audit/internal/schema"
"fmt"
"time"
)
func BuildHealthSummary(snap schema.HardwareSnapshot) *schema.HardwareHealthSummary {
summary := &schema.HardwareHealthSummary{
Status: statusOK,
CollectedAt: time.Now().UTC().Format(time.RFC3339),
}
for _, dimm := range snap.Memory {
switch derefString(dimm.Status) {
case statusWarning:
summary.MemoryWarn++
summary.Warnings = append(summary.Warnings, formatMemorySummary(dimm))
case statusCritical:
summary.MemoryFail++
summary.Failures = append(summary.Failures, formatMemorySummary(dimm))
case statusEmpty:
summary.EmptyDIMMs++
}
}
for _, disk := range snap.Storage {
switch derefString(disk.Status) {
case statusWarning:
summary.StorageWarn++
summary.Warnings = append(summary.Warnings, formatStorageSummary(disk))
case statusCritical:
summary.StorageFail++
summary.Failures = append(summary.Failures, formatStorageSummary(disk))
}
}
for _, dev := range snap.PCIeDevices {
switch derefString(dev.Status) {
case statusWarning:
summary.PCIeWarn++
summary.Warnings = append(summary.Warnings, formatPCIeSummary(dev))
case statusCritical:
summary.PCIeFail++
summary.Failures = append(summary.Failures, formatPCIeSummary(dev))
}
}
for _, psu := range snap.PowerSupplies {
if psu.Present != nil && !*psu.Present {
summary.MissingPSUs++
}
switch derefString(psu.Status) {
case statusWarning:
summary.PSUWarn++
summary.Warnings = append(summary.Warnings, formatPSUSummary(psu))
case statusCritical:
summary.PSUFail++
summary.Failures = append(summary.Failures, formatPSUSummary(psu))
}
}
if len(summary.Failures) > 0 || summary.StorageFail > 0 || summary.PCIeFail > 0 || summary.PSUFail > 0 || summary.MemoryFail > 0 {
summary.Status = statusCritical
} else if len(summary.Warnings) > 0 || summary.StorageWarn > 0 || summary.PCIeWarn > 0 || summary.PSUWarn > 0 || summary.MemoryWarn > 0 {
summary.Status = statusWarning
}
if len(summary.Warnings) == 0 {
summary.Warnings = nil
}
if len(summary.Failures) == 0 {
summary.Failures = nil
}
return summary
}
func derefString(value *string) string {
if value == nil {
return ""
}
return *value
}
func preferredName(model, serial, slot *string) string {
switch {
case model != nil && *model != "":
return *model
case serial != nil && *serial != "":
return *serial
case slot != nil && *slot != "":
return *slot
default:
return "unknown"
}
}
func formatStorageSummary(disk schema.HardwareStorage) string {
return fmt.Sprintf("storage %s status=%s", preferredName(disk.Model, disk.SerialNumber, disk.Slot), derefString(disk.Status))
}
func formatPCIeSummary(dev schema.HardwarePCIeDevice) string {
return fmt.Sprintf("pcie %s status=%s", preferredName(dev.Model, dev.SerialNumber, dev.BDF), derefString(dev.Status))
}
func formatPSUSummary(psu schema.HardwarePowerSupply) string {
return fmt.Sprintf("psu %s status=%s", preferredName(psu.Model, psu.SerialNumber, psu.Slot), derefString(psu.Status))
}
func formatMemorySummary(dimm schema.HardwareMemory) string {
return fmt.Sprintf("memory %s status=%s", preferredName(dimm.PartNumber, dimm.SerialNumber, dimm.Slot), derefString(dimm.Status))
}

View File

@@ -31,7 +31,7 @@ md125 : active raid1 nvme2n1[0] nvme3n1[1]
func TestHasVROCController(t *testing.T) {
intel := vendorIntel
model := "Volume Management Device NVMe RAID Controller"
class := "RAID bus controller"
class := "MassStorageController"
tests := []struct {
name string
pcie []schema.HardwarePCIeDevice

View File

@@ -9,8 +9,10 @@ import (
"strings"
)
var exportExecCommand = exec.Command
func (s *System) ListRemovableTargets() ([]RemovableTarget, error) {
raw, err := exec.Command("lsblk", "-P", "-o", "NAME,TYPE,PKNAME,RM,FSTYPE,MOUNTPOINT,SIZE,LABEL,MODEL").Output()
raw, err := exportExecCommand("lsblk", "-P", "-o", "NAME,TYPE,PKNAME,RM,FSTYPE,MOUNTPOINT,SIZE,LABEL,MODEL").Output()
if err != nil {
return nil, err
}
@@ -52,7 +54,7 @@ func (s *System) ListRemovableTargets() ([]RemovableTarget, error) {
return out, nil
}
func (s *System) ExportFileToTarget(src string, target RemovableTarget) (string, error) {
func (s *System) ExportFileToTarget(src string, target RemovableTarget) (dst string, retErr error) {
if src == "" || target.Device == "" {
return "", fmt.Errorf("source and target are required")
}
@@ -62,20 +64,39 @@ func (s *System) ExportFileToTarget(src string, target RemovableTarget) (string,
mountpoint := strings.TrimSpace(target.Mountpoint)
mountedHere := false
mounted := mountpoint != ""
if mountpoint == "" {
mountpoint = filepath.Join("/tmp", "bee-export-"+filepath.Base(target.Device))
if err := os.MkdirAll(mountpoint, 0755); err != nil {
return "", err
}
if raw, err := exec.Command("mount", target.Device, mountpoint).CombinedOutput(); err != nil {
if raw, err := exportExecCommand("mount", target.Device, mountpoint).CombinedOutput(); err != nil {
_ = os.Remove(mountpoint)
return string(raw), err
}
mountedHere = true
mounted = true
}
defer func() {
if !mounted {
return
}
_ = exportExecCommand("sync").Run()
if raw, err := exportExecCommand("umount", mountpoint).CombinedOutput(); err != nil && retErr == nil {
msg := strings.TrimSpace(string(raw))
if msg == "" {
retErr = err
} else {
retErr = fmt.Errorf("%s: %w", msg, err)
}
}
if mountedHere {
_ = os.Remove(mountpoint)
}
}()
filename := filepath.Base(src)
dst := filepath.Join(mountpoint, filename)
dst = filepath.Join(mountpoint, filename)
data, err := os.ReadFile(src)
if err != nil {
return "", err
@@ -83,12 +104,6 @@ func (s *System) ExportFileToTarget(src string, target RemovableTarget) (string,
if err := os.WriteFile(dst, data, 0644); err != nil {
return "", err
}
_ = exec.Command("sync").Run()
if mountedHere {
_ = exec.Command("umount", mountpoint).Run()
_ = os.Remove(mountpoint)
}
return dst, nil
}

View File

@@ -0,0 +1,56 @@
package platform
import (
"os"
"os/exec"
"path/filepath"
"testing"
)
func TestExportFileToTargetUnmountsExistingMountpoint(t *testing.T) {
t.Parallel()
tmp := t.TempDir()
src := filepath.Join(tmp, "bundle.tar.gz")
mountpoint := filepath.Join(tmp, "mnt")
if err := os.MkdirAll(mountpoint, 0755); err != nil {
t.Fatalf("mkdir mountpoint: %v", err)
}
if err := os.WriteFile(src, []byte("bundle"), 0644); err != nil {
t.Fatalf("write src: %v", err)
}
var calls [][]string
oldExec := exportExecCommand
exportExecCommand = func(name string, args ...string) *exec.Cmd {
calls = append(calls, append([]string{name}, args...))
return exec.Command("sh", "-c", "exit 0")
}
t.Cleanup(func() { exportExecCommand = oldExec })
s := &System{}
dst, err := s.ExportFileToTarget(src, RemovableTarget{
Device: "/dev/sdb1",
Mountpoint: mountpoint,
})
if err != nil {
t.Fatalf("ExportFileToTarget error: %v", err)
}
if got, want := dst, filepath.Join(mountpoint, "bundle.tar.gz"); got != want {
t.Fatalf("dst=%q want %q", got, want)
}
if _, err := os.Stat(filepath.Join(mountpoint, "bundle.tar.gz")); err != nil {
t.Fatalf("exported file missing: %v", err)
}
foundUmount := false
for _, call := range calls {
if len(call) == 2 && call[0] == "umount" && call[1] == mountpoint {
foundUmount = true
break
}
}
if !foundUmount {
t.Fatalf("expected umount %q call, got %#v", mountpoint, calls)
}
}

View File

@@ -0,0 +1,577 @@
package platform
import (
"bytes"
"fmt"
"math"
"os"
"os/exec"
"strconv"
"strings"
"time"
)
// GPUMetricRow is one telemetry sample from nvidia-smi during a stress test.
type GPUMetricRow struct {
ElapsedSec float64
GPUIndex int
TempC float64
UsagePct float64
PowerW float64
ClockMHz float64
}
// sampleGPUMetrics runs nvidia-smi once and returns current metrics for each GPU.
func sampleGPUMetrics(gpuIndices []int) ([]GPUMetricRow, error) {
args := []string{
"--query-gpu=index,temperature.gpu,utilization.gpu,power.draw,clocks.current.graphics",
"--format=csv,noheader,nounits",
}
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
args = append([]string{"--id=" + strings.Join(ids, ",")}, args...)
}
out, err := exec.Command("nvidia-smi", args...).Output()
if err != nil {
return nil, err
}
var rows []GPUMetricRow
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.Split(line, ", ")
if len(parts) < 5 {
continue
}
idx, _ := strconv.Atoi(strings.TrimSpace(parts[0]))
rows = append(rows, GPUMetricRow{
GPUIndex: idx,
TempC: parseGPUFloat(parts[1]),
UsagePct: parseGPUFloat(parts[2]),
PowerW: parseGPUFloat(parts[3]),
ClockMHz: parseGPUFloat(parts[4]),
})
}
return rows, nil
}
func parseGPUFloat(s string) float64 {
s = strings.TrimSpace(s)
if s == "N/A" || s == "[Not Supported]" || s == "" {
return 0
}
v, _ := strconv.ParseFloat(s, 64)
return v
}
// WriteGPUMetricsCSV writes collected rows as a CSV file.
func WriteGPUMetricsCSV(path string, rows []GPUMetricRow) error {
var b bytes.Buffer
b.WriteString("elapsed_sec,gpu_index,temperature_c,usage_pct,power_w,clock_mhz\n")
for _, r := range rows {
fmt.Fprintf(&b, "%.1f,%d,%.1f,%.1f,%.1f,%.0f\n",
r.ElapsedSec, r.GPUIndex, r.TempC, r.UsagePct, r.PowerW, r.ClockMHz)
}
return os.WriteFile(path, b.Bytes(), 0644)
}
// WriteGPUMetricsHTML writes a standalone HTML file with one SVG chart per GPU.
func WriteGPUMetricsHTML(path string, rows []GPUMetricRow) error {
// Group by GPU index preserving order.
seen := make(map[int]bool)
var order []int
gpuMap := make(map[int][]GPUMetricRow)
for _, r := range rows {
if !seen[r.GPUIndex] {
seen[r.GPUIndex] = true
order = append(order, r.GPUIndex)
}
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
}
var svgs strings.Builder
for _, gpuIdx := range order {
svgs.WriteString(drawGPUChartSVG(gpuMap[gpuIdx], gpuIdx))
svgs.WriteString("\n")
}
ts := time.Now().UTC().Format("2006-01-02 15:04:05 UTC")
html := fmt.Sprintf(`<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<title>GPU Stress Test Metrics</title>
<style>
body { font-family: sans-serif; background: #f0f0f0; margin: 0; padding: 20px; }
h1 { text-align: center; color: #333; margin: 0 0 8px; }
p { text-align: center; color: #888; font-size: 13px; margin: 0 0 24px; }
</style>
</head><body>
<h1>GPU Stress Test Metrics</h1>
<p>Generated %s</p>
%s
</body></html>`, ts, svgs.String())
return os.WriteFile(path, []byte(html), 0644)
}
// drawGPUChartSVG generates a self-contained SVG chart for one GPU.
func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int) string {
// Layout
const W, H = 960, 520
const plotX1 = 120 // usage axis / chart left border
const plotX2 = 840 // power axis / chart right border
const plotY1 = 70 // top
const plotY2 = 465 // bottom (PH = 395)
const PW = plotX2 - plotX1
const PH = plotY2 - plotY1
// Outer axes
const tempAxisX = 60 // temp axis line
const clockAxisX = 900 // clock axis line
colors := [4]string{"#e74c3c", "#3498db", "#2ecc71", "#f39c12"}
seriesLabel := [4]string{
fmt.Sprintf("GPU %d Temp (°C)", gpuIdx),
fmt.Sprintf("GPU %d Usage (%%)", gpuIdx),
fmt.Sprintf("GPU %d Power (W)", gpuIdx),
fmt.Sprintf("GPU %d Clock (MHz)", gpuIdx),
}
axisLabel := [4]string{"Temperature (°C)", "GPU Usage (%)", "Power (W)", "Clock (MHz)"}
// Extract series
t := make([]float64, len(rows))
vals := [4][]float64{}
for i := range vals {
vals[i] = make([]float64, len(rows))
}
for i, r := range rows {
t[i] = r.ElapsedSec
vals[0][i] = r.TempC
vals[1][i] = r.UsagePct
vals[2][i] = r.PowerW
vals[3][i] = r.ClockMHz
}
tMin, tMax := gpuMinMax(t)
type axisScale struct {
ticks []float64
min, max float64
}
var axes [4]axisScale
for i := 0; i < 4; i++ {
mn, mx := gpuMinMax(vals[i])
tks := gpuNiceTicks(mn, mx, 8)
axes[i] = axisScale{ticks: tks, min: tks[0], max: tks[len(tks)-1]}
}
xv := func(tv float64) float64 {
if tMax == tMin {
return float64(plotX1)
}
return float64(plotX1) + (tv-tMin)/(tMax-tMin)*float64(PW)
}
yv := func(v float64, ai int) float64 {
a := axes[ai]
if a.max == a.min {
return float64(plotY1 + PH/2)
}
return float64(plotY2) - (v-a.min)/(a.max-a.min)*float64(PH)
}
var b strings.Builder
fmt.Fprintf(&b, `<svg xmlns="http://www.w3.org/2000/svg" width="%d" height="%d"`+
` style="background:#fff;border-radius:8px;display:block;margin:0 auto 24px;`+
`box-shadow:0 2px 12px rgba(0,0,0,.12)">`+"\n", W, H)
// Title
fmt.Fprintf(&b, `<text x="%d" y="22" text-anchor="middle" font-family="sans-serif"`+
` font-size="14" font-weight="bold" fill="#333">GPU Stress Test Metrics — GPU %d</text>`+"\n",
plotX1+PW/2, gpuIdx)
// Horizontal grid (align to temp axis ticks)
b.WriteString(`<g stroke="#e0e0e0" stroke-width="0.5">` + "\n")
for _, tick := range axes[0].ticks {
y := yv(tick, 0)
if y < float64(plotY1) || y > float64(plotY2) {
continue
}
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"/>`+"\n",
plotX1, y, plotX2, y)
}
// Vertical grid
xTicks := gpuNiceTicks(tMin, tMax, 10)
for _, tv := range xTicks {
x := xv(tv)
if x < float64(plotX1) || x > float64(plotX2) {
continue
}
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d"/>`+"\n",
x, plotY1, x, plotY2)
}
b.WriteString("</g>\n")
// Chart border
fmt.Fprintf(&b, `<rect x="%d" y="%d" width="%d" height="%d"`+
` fill="none" stroke="#333" stroke-width="1"/>`+"\n",
plotX1, plotY1, PW, PH)
// X axis ticks and labels
b.WriteString(`<g font-family="sans-serif" font-size="11" fill="#333" text-anchor="middle">` + "\n")
for _, tv := range xTicks {
x := xv(tv)
if x < float64(plotX1) || x > float64(plotX2) {
continue
}
fmt.Fprintf(&b, `<text x="%.1f" y="%d">%s</text>`+"\n", x, plotY2+18, gpuFormatTick(tv))
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d" stroke="#333" stroke-width="1"/>`+"\n",
x, plotY2, x, plotY2+4)
}
b.WriteString("</g>\n")
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="13"`+
` fill="#333" text-anchor="middle">Time (seconds)</text>`+"\n",
plotX1+PW/2, plotY2+38)
// Y axes: [tempAxisX, plotX1, plotX2, clockAxisX]
axisLineX := [4]int{tempAxisX, plotX1, plotX2, clockAxisX}
axisRight := [4]bool{false, false, true, true}
// Label x positions (for rotated vertical text)
axisLabelX := [4]int{10, 68, 868, 950}
for i := 0; i < 4; i++ {
ax := axisLineX[i]
right := axisRight[i]
color := colors[i]
// Axis line
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
` stroke="%s" stroke-width="1"/>`+"\n",
ax, plotY1, ax, plotY2, color)
// Ticks and tick labels
fmt.Fprintf(&b, `<g font-family="sans-serif" font-size="10" fill="%s">`+"\n", color)
for _, tick := range axes[i].ticks {
y := yv(tick, i)
if y < float64(plotY1) || y > float64(plotY2) {
continue
}
dx := -5
textX := ax - 8
anchor := "end"
if right {
dx = 5
textX = ax + 8
anchor = "start"
}
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"`+
` stroke="%s" stroke-width="1"/>`+"\n",
ax, y, ax+dx, y, color)
fmt.Fprintf(&b, `<text x="%d" y="%.1f" text-anchor="%s" dy="4">%s</text>`+"\n",
textX, y, anchor, gpuFormatTick(tick))
}
b.WriteString("</g>\n")
// Axis label (rotated)
lx := axisLabelX[i]
fmt.Fprintf(&b, `<text transform="translate(%d,%d) rotate(-90)"`+
` font-family="sans-serif" font-size="12" fill="%s" text-anchor="middle">%s</text>`+"\n",
lx, plotY1+PH/2, color, axisLabel[i])
}
// Data lines
for i := 0; i < 4; i++ {
var pts strings.Builder
for j := range rows {
x := xv(t[j])
y := yv(vals[i][j], i)
if j == 0 {
fmt.Fprintf(&pts, "%.1f,%.1f", x, y)
} else {
fmt.Fprintf(&pts, " %.1f,%.1f", x, y)
}
}
fmt.Fprintf(&b, `<polyline points="%s" fill="none" stroke="%s" stroke-width="1.5"/>`+"\n",
pts.String(), colors[i])
}
// Legend
const legendY = 42
for i := 0; i < 4; i++ {
lx := plotX1 + i*(PW/4) + 10
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
` stroke="%s" stroke-width="2"/>`+"\n",
lx, legendY, lx+20, legendY, colors[i])
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="12" fill="#333">%s</text>`+"\n",
lx+25, legendY+4, seriesLabel[i])
}
b.WriteString("</svg>\n")
return b.String()
}
const (
ansiRed = "\033[31m"
ansiBlue = "\033[34m"
ansiGreen = "\033[32m"
ansiYellow = "\033[33m"
ansiReset = "\033[0m"
)
const (
termChartWidth = 70
termChartHeight = 12
)
// RenderGPUTerminalChart returns ANSI line charts (asciigraph-style) per GPU.
// Suitable for display in the TUI screenOutput.
func RenderGPUTerminalChart(rows []GPUMetricRow) string {
seen := make(map[int]bool)
var order []int
gpuMap := make(map[int][]GPUMetricRow)
for _, r := range rows {
if !seen[r.GPUIndex] {
seen[r.GPUIndex] = true
order = append(order, r.GPUIndex)
}
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
}
type seriesDef struct {
caption string
color string
fn func(GPUMetricRow) float64
}
defs := []seriesDef{
{"Temperature (°C)", ansiRed, func(r GPUMetricRow) float64 { return r.TempC }},
{"GPU Usage (%)", ansiBlue, func(r GPUMetricRow) float64 { return r.UsagePct }},
{"Power (W)", ansiGreen, func(r GPUMetricRow) float64 { return r.PowerW }},
{"Clock (MHz)", ansiYellow, func(r GPUMetricRow) float64 { return r.ClockMHz }},
}
var b strings.Builder
for _, gpuIdx := range order {
gr := gpuMap[gpuIdx]
if len(gr) == 0 {
continue
}
tMax := gr[len(gr)-1].ElapsedSec - gr[0].ElapsedSec
fmt.Fprintf(&b, "GPU %d — Stress Test Metrics (%.0f seconds)\n\n", gpuIdx, tMax)
for _, d := range defs {
b.WriteString(renderLineChart(extractGPUField(gr, d.fn), d.color, d.caption,
termChartHeight, termChartWidth))
b.WriteRune('\n')
}
}
return strings.TrimRight(b.String(), "\n")
}
// renderLineChart draws a single time-series line chart using box-drawing characters.
// Produces output in the style of asciigraph: ╭─╮ │ ╰─╯ with a Y axis and caption.
func renderLineChart(vals []float64, color, caption string, height, width int) string {
if len(vals) == 0 {
return caption + "\n"
}
mn, mx := gpuMinMax(vals)
if mn == mx {
mx = mn + 1
}
// Use the smaller of width or len(vals) to avoid stretching sparse data.
w := width
if len(vals) < w {
w = len(vals)
}
data := gpuDownsample(vals, w)
// row[i] = display row index: 0 = top = max value, height = bottom = min value.
row := make([]int, w)
for i, v := range data {
r := int(math.Round((mx - v) / (mx - mn) * float64(height)))
if r < 0 {
r = 0
}
if r > height {
r = height
}
row[i] = r
}
// Fill the character grid.
grid := make([][]rune, height+1)
for i := range grid {
grid[i] = make([]rune, w)
for j := range grid[i] {
grid[i][j] = ' '
}
}
for x := 0; x < w; x++ {
r := row[x]
if x == 0 {
grid[r][0] = '─'
continue
}
p := row[x-1]
switch {
case r == p:
grid[r][x] = '─'
case r < p: // value went up (row index decreased toward top)
grid[r][x] = '╭'
grid[p][x] = '╯'
for y := r + 1; y < p; y++ {
grid[y][x] = '│'
}
default: // r > p, value went down
grid[p][x] = '╮'
grid[r][x] = '╰'
for y := p + 1; y < r; y++ {
grid[y][x] = '│'
}
}
}
// Y axis tick labels.
ticks := gpuNiceTicks(mn, mx, height/2)
tickAtRow := make(map[int]string)
labelWidth := 4
for _, t := range ticks {
r := int(math.Round((mx - t) / (mx - mn) * float64(height)))
if r < 0 || r > height {
continue
}
s := gpuFormatTick(t)
tickAtRow[r] = s
if len(s) > labelWidth {
labelWidth = len(s)
}
}
var b strings.Builder
for r := 0; r <= height; r++ {
label := tickAtRow[r]
fmt.Fprintf(&b, "%*s", labelWidth, label)
switch {
case label != "":
b.WriteRune('┤')
case r == height:
b.WriteRune('┼')
default:
b.WriteRune('│')
}
b.WriteString(color)
b.WriteString(string(grid[r]))
b.WriteString(ansiReset)
b.WriteRune('\n')
}
// Bottom axis.
b.WriteString(strings.Repeat(" ", labelWidth))
b.WriteRune('└')
b.WriteString(strings.Repeat("─", w))
b.WriteRune('\n')
// Caption centered under the chart.
if caption != "" {
total := labelWidth + 1 + w
if pad := (total - len(caption)) / 2; pad > 0 {
b.WriteString(strings.Repeat(" ", pad))
}
b.WriteString(caption)
b.WriteRune('\n')
}
return b.String()
}
func extractGPUField(rows []GPUMetricRow, fn func(GPUMetricRow) float64) []float64 {
v := make([]float64, len(rows))
for i, r := range rows {
v[i] = fn(r)
}
return v
}
// gpuDownsample averages vals into w buckets (or nearest-neighbor upsamples if len(vals) < w).
func gpuDownsample(vals []float64, w int) []float64 {
n := len(vals)
if n == 0 {
return make([]float64, w)
}
result := make([]float64, w)
if n >= w {
counts := make([]int, w)
for i, v := range vals {
bucket := i * w / n
if bucket >= w {
bucket = w - 1
}
result[bucket] += v
counts[bucket]++
}
for i := range result {
if counts[i] > 0 {
result[i] /= float64(counts[i])
}
}
} else {
// Nearest-neighbour upsample.
for i := range result {
src := i * (n - 1) / (w - 1)
if src >= n {
src = n - 1
}
result[i] = vals[src]
}
}
return result
}
func gpuMinMax(vals []float64) (float64, float64) {
if len(vals) == 0 {
return 0, 1
}
mn, mx := vals[0], vals[0]
for _, v := range vals[1:] {
if v < mn {
mn = v
}
if v > mx {
mx = v
}
}
return mn, mx
}
func gpuNiceTicks(mn, mx float64, targetCount int) []float64 {
if mn == mx {
mn -= 1
mx += 1
}
r := mx - mn
step := math.Pow(10, math.Floor(math.Log10(r/float64(targetCount))))
for _, f := range []float64{1, 2, 5, 10} {
if r/(f*step) <= float64(targetCount)*1.5 {
step = f * step
break
}
}
lo := math.Floor(mn/step) * step
hi := math.Ceil(mx/step) * step
var ticks []float64
for v := lo; v <= hi+step*0.001; v += step {
ticks = append(ticks, math.Round(v*1e9)/1e9)
}
return ticks
}
func gpuFormatTick(v float64) string {
if v == math.Trunc(v) {
return strconv.Itoa(int(v))
}
return strconv.FormatFloat(v, 'f', 1, 64)
}

View File

@@ -0,0 +1,214 @@
package platform
import (
"os"
"os/exec"
"strings"
"time"
"bee/audit/internal/schema"
)
var runtimeRequiredTools = []string{
"dmidecode",
"lspci",
"lsblk",
"smartctl",
"nvme",
"ipmitool",
"dhclient",
"mount",
}
var runtimeTrackedServices = []string{
"bee-network",
"bee-nvidia",
"bee-preflight",
"bee-audit",
"bee-web",
"bee-sshsetup",
}
func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
checkedAt := time.Now().UTC().Format(time.RFC3339)
health := schema.RuntimeHealth{
Status: "OK",
CheckedAt: checkedAt,
ExportDir: strings.TrimSpace(exportDir),
}
if health.ExportDir != "" {
if err := os.MkdirAll(health.ExportDir, 0755); err != nil {
health.Status = "FAILED"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "export_dir_unavailable",
Severity: "critical",
Description: err.Error(),
})
}
}
interfaces, err := s.ListInterfaces()
if err == nil {
health.Interfaces = make([]schema.RuntimeInterface, 0, len(interfaces))
hasIPv4 := false
missingIPv4 := false
for _, iface := range interfaces {
outcome := "no_offer"
if len(iface.IPv4) > 0 {
outcome = "lease_acquired"
hasIPv4 = true
} else if strings.EqualFold(iface.State, "DOWN") {
outcome = "link_down"
} else {
missingIPv4 = true
}
health.Interfaces = append(health.Interfaces, schema.RuntimeInterface{
Name: iface.Name,
State: iface.State,
IPv4: iface.IPv4,
Outcome: outcome,
})
}
switch {
case hasIPv4 && !missingIPv4:
health.NetworkStatus = "OK"
case hasIPv4:
health.NetworkStatus = "PARTIAL"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "dhcp_partial",
Severity: "warning",
Description: "At least one interface did not obtain IPv4 connectivity.",
})
default:
health.NetworkStatus = "FAILED"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "dhcp_failed",
Severity: "warning",
Description: "No physical interface obtained IPv4 connectivity.",
})
}
}
vendor := s.DetectGPUVendor()
for _, tool := range s.runtimeToolStatuses(vendor) {
health.Tools = append(health.Tools, schema.RuntimeToolStatus{
Name: tool.Name,
Path: tool.Path,
OK: tool.OK,
})
if !tool.OK {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "tool_missing",
Severity: "warning",
Description: "Required tool missing: " + tool.Name,
})
}
}
for _, name := range runtimeTrackedServices {
health.Services = append(health.Services, schema.RuntimeServiceStatus{
Name: name,
Status: s.ServiceState(name),
})
}
s.collectGPURuntimeHealth(vendor, &health)
if health.Status != "FAILED" && len(health.Issues) > 0 {
health.Status = "PARTIAL"
}
return health, nil
}
func commandText(name string, args ...string) string {
raw, err := exec.Command(name, args...).CombinedOutput()
if err != nil && len(raw) == 0 {
return ""
}
return string(raw)
}
func (s *System) runtimeToolStatuses(vendor string) []ToolStatus {
tools := s.CheckTools(runtimeRequiredTools)
switch vendor {
case "nvidia":
tools = append(tools, s.CheckTools([]string{
"nvidia-smi",
"nvidia-bug-report.sh",
"bee-gpu-stress",
})...)
case "amd":
tool := ToolStatus{Name: "rocm-smi"}
if cmd, err := resolveROCmSMICommand(); err == nil && len(cmd) > 0 {
tool.Path = cmd[0]
if len(cmd) > 1 && strings.HasSuffix(cmd[1], "rocm_smi.py") {
tool.Path = cmd[1]
}
tool.OK = true
}
tools = append(tools, tool)
}
return tools
}
func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHealth) {
lsmodText := commandText("lsmod")
switch vendor {
case "nvidia":
health.DriverReady = strings.Contains(lsmodText, "nvidia ")
if !health.DriverReady {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "nvidia_kernel_module_missing",
Severity: "warning",
Description: "NVIDIA kernel module is not loaded.",
})
}
if health.DriverReady && !strings.Contains(lsmodText, "nvidia_modeset") {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "nvidia_modeset_failed",
Severity: "warning",
Description: "nvidia-modeset is not loaded; display/CUDA stack may be partial.",
})
}
if out, err := exec.Command("nvidia-smi", "-L").CombinedOutput(); err == nil && strings.TrimSpace(string(out)) != "" {
health.DriverReady = true
}
if lookErr := exec.Command("sh", "-c", "command -v bee-gpu-stress >/dev/null 2>&1").Run(); lookErr == nil {
out, err := exec.Command("bee-gpu-stress", "--seconds", "1", "--size-mb", "1").CombinedOutput()
if err == nil {
health.CUDAReady = true
} else if strings.Contains(strings.ToLower(string(out)), "cuda_error_system_not_ready") {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "cuda_runtime_not_ready",
Severity: "warning",
Description: "CUDA runtime is not ready for GPU SAT.",
})
}
}
case "amd":
health.DriverReady = strings.Contains(lsmodText, "amdgpu ") || strings.Contains(lsmodText, "amdkfd")
if !health.DriverReady {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "amdgpu_kernel_module_missing",
Severity: "warning",
Description: "AMD GPU driver is not loaded.",
})
}
out, err := runROCmSMI("--showproductname", "--csv")
if err == nil && strings.TrimSpace(string(out)) != "" {
health.CUDAReady = true
health.DriverReady = true
return
}
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "rocm_smi_unavailable",
Severity: "warning",
Description: "ROCm SMI is not available for AMD GPU SAT.",
})
}
}

View File

@@ -3,60 +3,653 @@ package platform
import (
"archive/tar"
"compress/gzip"
"context"
"errors"
"fmt"
"io"
"os"
"os/exec"
"path/filepath"
"sort"
"strconv"
"strings"
"time"
)
var (
satExecCommand = exec.Command
satLookPath = exec.LookPath
satGlob = filepath.Glob
satStat = os.Stat
rocmSMIExecutableGlobs = []string{
"/opt/rocm/bin/rocm-smi",
"/opt/rocm-*/bin/rocm-smi",
}
rocmSMIScriptGlobs = []string{
"/opt/rocm/libexec/rocm_smi/rocm_smi.py",
"/opt/rocm-*/libexec/rocm_smi/rocm_smi.py",
}
)
// NvidiaGPU holds basic GPU info from nvidia-smi.
type NvidiaGPU struct {
Index int
Name string
MemoryMB int
}
// AMDGPUInfo holds basic info about an AMD GPU from rocm-smi.
type AMDGPUInfo struct {
Index int
Name string
}
// DetectGPUVendor returns "nvidia" if /dev/nvidia0 exists, "amd" if /dev/kfd exists, or "" otherwise.
func (s *System) DetectGPUVendor() string {
if _, err := os.Stat("/dev/nvidia0"); err == nil {
return "nvidia"
}
if _, err := os.Stat("/dev/kfd"); err == nil {
return "amd"
}
return ""
}
// ListAMDGPUs returns AMD GPUs visible to rocm-smi.
func (s *System) ListAMDGPUs() ([]AMDGPUInfo, error) {
out, err := runROCmSMI("--showproductname", "--csv")
if err != nil {
return nil, fmt.Errorf("rocm-smi: %w", err)
}
var gpus []AMDGPUInfo
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" || strings.HasPrefix(strings.ToLower(line), "device") {
continue
}
parts := strings.SplitN(line, ",", 2)
name := ""
if len(parts) >= 2 {
name = strings.TrimSpace(parts[1])
}
idx := len(gpus)
gpus = append(gpus, AMDGPUInfo{Index: idx, Name: name})
}
return gpus, nil
}
// RunAMDAcceptancePack runs an AMD GPU diagnostic pack using rocm-smi.
func (s *System) RunAMDAcceptancePack(baseDir string) (string, error) {
return runAcceptancePack(baseDir, "gpu-amd", []satJob{
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
{name: "02-rocm-smi-showallinfo.log", cmd: []string{"rocm-smi", "--showallinfo"}},
{name: "03-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "04-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
})
}
// ListNvidiaGPUs returns GPUs visible to nvidia-smi.
func (s *System) ListNvidiaGPUs() ([]NvidiaGPU, error) {
out, err := exec.Command("nvidia-smi",
"--query-gpu=index,name,memory.total",
"--format=csv,noheader,nounits").Output()
if err != nil {
return nil, fmt.Errorf("nvidia-smi: %w", err)
}
var gpus []NvidiaGPU
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.SplitN(line, ", ", 3)
if len(parts) != 3 {
continue
}
idx, err := strconv.Atoi(strings.TrimSpace(parts[0]))
if err != nil {
continue
}
memMB, _ := strconv.Atoi(strings.TrimSpace(parts[2]))
gpus = append(gpus, NvidiaGPU{
Index: idx,
Name: strings.TrimSpace(parts[1]),
MemoryMB: memMB,
})
}
return gpus, nil
}
func (s *System) RunNvidiaAcceptancePack(baseDir string) (string, error) {
return runAcceptancePack(baseDir, "gpu-nvidia", nvidiaSATJobs())
}
// RunNvidiaAcceptancePackWithOptions runs the NVIDIA SAT with explicit duration,
// GPU memory size, and GPU index selection. ctx cancellation kills the running job.
func (s *System) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, durationSec int, sizeMB int, gpuIndices []int) (string, error) {
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia", nvidiaSATJobsWithOptions(durationSec, sizeMB, gpuIndices))
}
func (s *System) RunMemoryAcceptancePack(baseDir string) (string, error) {
sizeMB := envInt("BEE_MEMTESTER_SIZE_MB", 128)
passes := envInt("BEE_MEMTESTER_PASSES", 1)
return runAcceptancePack(baseDir, "memory", []satJob{
{name: "01-free-before.log", cmd: []string{"free", "-h"}},
{name: "02-memtester.log", cmd: []string{"memtester", fmt.Sprintf("%dM", sizeMB), fmt.Sprintf("%d", passes)}},
{name: "03-free-after.log", cmd: []string{"free", "-h"}},
})
}
func (s *System) RunCPUAcceptancePack(baseDir string, durationSec int) (string, error) {
if durationSec <= 0 {
durationSec = 60
}
return runAcceptancePack(baseDir, "cpu", []satJob{
{name: "01-lscpu.log", cmd: []string{"lscpu"}},
{name: "02-sensors-before.log", cmd: []string{"sensors"}},
{name: "03-stress-ng.log", cmd: []string{"stress-ng", "--cpu", "0", "--cpu-method", "all", "--timeout", fmt.Sprintf("%d", durationSec)}},
{name: "04-sensors-after.log", cmd: []string{"sensors"}},
})
}
func (s *System) RunStorageAcceptancePack(baseDir string) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, "gpu-nvidia-"+ts)
runDir := filepath.Join(baseDir, "storage-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
type job struct {
name string
cmd []string
}
jobs := []job{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output", filepath.Join(runDir, "nvidia-bug-report.log")}},
devices, err := listStorageDevices()
if err != nil {
return "", err
}
sort.Strings(devices)
var summary strings.Builder
stats := satStats{}
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
for _, job := range jobs {
out, err := exec.Command(job.cmd[0], job.cmd[1:]...).CombinedOutput()
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
return "", writeErr
}
rc := 0
if err != nil {
rc = 1
}
fmt.Fprintf(&summary, "%s_rc=%d\n", strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log"), rc)
if len(devices) == 0 {
fmt.Fprintln(&summary, "devices=0")
stats.Unsupported++
} else {
fmt.Fprintf(&summary, "devices=%d\n", len(devices))
}
for index, devPath := range devices {
prefix := fmt.Sprintf("%02d-%s", index+1, filepath.Base(devPath))
commands := storageSATCommands(devPath)
for cmdIndex, job := range commands {
name := fmt.Sprintf("%s-%02d-%s.log", prefix, cmdIndex+1, job.name)
out, err := runSATCommand(verboseLog, job.name, job.cmd)
if writeErr := os.WriteFile(filepath.Join(runDir, name), out, 0644); writeErr != nil {
return "", writeErr
}
status, rc := classifySATResult(job.name, out, err)
stats.Add(status)
key := filepath.Base(devPath) + "_" + strings.ReplaceAll(job.name, "-", "_")
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
}
}
writeSATStats(&summary, stats)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, "gpu-nvidia-"+ts+".tar.gz")
archive := filepath.Join(baseDir, "storage-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
type satJob struct {
name string
cmd []string
env []string // extra env vars (appended to os.Environ)
collectGPU bool // collect GPU metrics via nvidia-smi while this job runs
gpuIndices []int // GPU indices to collect metrics for (empty = all)
}
type satStats struct {
OK int
Failed int
Unsupported int
}
func nvidiaSATJobs() []satJob {
seconds := envInt("BEE_GPU_STRESS_SECONDS", 5)
sizeMB := envInt("BEE_GPU_STRESS_SIZE_MB", 64)
return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
{name: "05-bee-gpu-stress.log", cmd: []string{"bee-gpu-stress", "--seconds", fmt.Sprintf("%d", seconds), "--size-mb", fmt.Sprintf("%d", sizeMB)}},
}
}
func runAcceptancePack(baseDir, prefix string, jobs []satJob) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, prefix+"-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
var summary strings.Builder
stats := satStats{}
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
for _, job := range jobs {
cmd := make([]string, 0, len(job.cmd))
for _, arg := range job.cmd {
cmd = append(cmd, strings.ReplaceAll(arg, "{{run_dir}}", runDir))
}
out, err := runSATCommand(verboseLog, job.name, cmd)
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
return "", writeErr
}
status, rc := classifySATResult(job.name, out, err)
stats.Add(status)
key := strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log")
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
}
writeSATStats(&summary, stats)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
func nvidiaSATJobsWithOptions(durationSec, sizeMB int, gpuIndices []int) []satJob {
var env []string
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
env = []string{"CUDA_VISIBLE_DEVICES=" + strings.Join(ids, ",")}
}
return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
{
name: "05-bee-gpu-stress.log",
cmd: []string{"bee-gpu-stress", "--seconds", strconv.Itoa(durationSec), "--size-mb", strconv.Itoa(sizeMB)},
env: env,
collectGPU: true,
gpuIndices: gpuIndices,
},
}
}
func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []satJob) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, prefix+"-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
var summary strings.Builder
stats := satStats{}
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
for _, job := range jobs {
if ctx.Err() != nil {
break
}
cmd := make([]string, 0, len(job.cmd))
for _, arg := range job.cmd {
cmd = append(cmd, strings.ReplaceAll(arg, "{{run_dir}}", runDir))
}
var out []byte
var err error
if job.collectGPU {
out, err = runSATCommandWithMetrics(ctx, verboseLog, job.name, cmd, job.env, job.gpuIndices, runDir)
} else {
out, err = runSATCommandCtx(ctx, verboseLog, job.name, cmd, job.env)
}
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
return "", writeErr
}
status, rc := classifySATResult(job.name, out, err)
stats.Add(status)
key := strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log")
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
}
writeSATStats(&summary, stats)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
func runSATCommandCtx(ctx context.Context, verboseLog, name string, cmd []string, env []string) ([]byte, error) {
start := time.Now().UTC()
resolvedCmd, err := resolveSATCommand(cmd)
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
"cmd: "+strings.Join(resolvedCmd, " "),
)
if err != nil {
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
"rc: 1",
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return []byte(err.Error() + "\n"), err
}
c := exec.CommandContext(ctx, resolvedCmd[0], resolvedCmd[1:]...)
if len(env) > 0 {
c.Env = append(os.Environ(), env...)
}
out, err := c.CombinedOutput()
rc := 0
if err != nil {
rc = 1
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
fmt.Sprintf("rc: %d", rc),
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return out, err
}
func listStorageDevices() ([]string, error) {
out, err := satExecCommand("lsblk", "-dn", "-o", "NAME,TYPE,TRAN").Output()
if err != nil {
return nil, err
}
return parseStorageDevices(string(out)), nil
}
func storageSATCommands(devPath string) []satJob {
if strings.Contains(filepath.Base(devPath), "nvme") {
return []satJob{
{name: "nvme-id-ctrl", cmd: []string{"nvme", "id-ctrl", devPath, "-o", "json"}},
{name: "nvme-smart-log", cmd: []string{"nvme", "smart-log", devPath, "-o", "json"}},
{name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "-s", "1", "--wait"}},
}
}
return []satJob{
{name: "smartctl-health", cmd: []string{"smartctl", "-H", "-A", devPath}},
{name: "smartctl-self-test-short", cmd: []string{"smartctl", "-t", "short", devPath}},
}
}
func (s *satStats) Add(status string) {
switch status {
case "OK":
s.OK++
case "UNSUPPORTED":
s.Unsupported++
default:
s.Failed++
}
}
func (s satStats) Overall() string {
if s.Failed > 0 {
return "FAILED"
}
if s.Unsupported > 0 {
return "PARTIAL"
}
return "OK"
}
func writeSATStats(summary *strings.Builder, stats satStats) {
fmt.Fprintf(summary, "overall_status=%s\n", stats.Overall())
fmt.Fprintf(summary, "job_ok=%d\n", stats.OK)
fmt.Fprintf(summary, "job_failed=%d\n", stats.Failed)
fmt.Fprintf(summary, "job_unsupported=%d\n", stats.Unsupported)
}
func classifySATResult(name string, out []byte, err error) (string, int) {
rc := 0
if err != nil {
rc = 1
}
if err == nil {
return "OK", rc
}
text := strings.ToLower(string(out))
if strings.Contains(text, "unsupported") ||
strings.Contains(text, "not supported") ||
strings.Contains(text, "invalid opcode") ||
strings.Contains(text, "unknown command") ||
strings.Contains(text, "not implemented") ||
strings.Contains(text, "not available") ||
strings.Contains(text, "cuda_error_system_not_ready") ||
strings.Contains(text, "no such device") ||
(strings.Contains(name, "self-test") && strings.Contains(text, "aborted")) {
return "UNSUPPORTED", rc
}
return "FAILED", rc
}
func runSATCommand(verboseLog, name string, cmd []string) ([]byte, error) {
start := time.Now().UTC()
resolvedCmd, err := resolveSATCommand(cmd)
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
"cmd: "+strings.Join(resolvedCmd, " "),
)
if err != nil {
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
"rc: 1",
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return []byte(err.Error() + "\n"), err
}
out, err := satExecCommand(resolvedCmd[0], resolvedCmd[1:]...).CombinedOutput()
rc := 0
if err != nil {
rc = 1
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
fmt.Sprintf("rc: %d", rc),
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return out, err
}
func runROCmSMI(args ...string) ([]byte, error) {
cmd, err := resolveROCmSMICommand(args...)
if err != nil {
return nil, err
}
return satExecCommand(cmd[0], cmd[1:]...).CombinedOutput()
}
func resolveSATCommand(cmd []string) ([]string, error) {
if len(cmd) == 0 {
return nil, errors.New("empty SAT command")
}
if cmd[0] != "rocm-smi" {
return cmd, nil
}
return resolveROCmSMICommand(cmd[1:]...)
}
func resolveROCmSMICommand(args ...string) ([]string, error) {
if path, err := satLookPath("rocm-smi"); err == nil {
return append([]string{path}, args...), nil
}
for _, path := range rocmSMIExecutableCandidates() {
return append([]string{path}, args...), nil
}
pythonPath, pyErr := satLookPath("python3")
if pyErr == nil {
for _, script := range rocmSMIScriptCandidates() {
cmd := []string{pythonPath, script}
cmd = append(cmd, args...)
return cmd, nil
}
}
return nil, errors.New("rocm-smi not found in PATH or under /opt/rocm")
}
func rocmSMIExecutableCandidates() []string {
return expandExistingPaths(rocmSMIExecutableGlobs)
}
func rocmSMIScriptCandidates() []string {
return expandExistingPaths(rocmSMIScriptGlobs)
}
func expandExistingPaths(patterns []string) []string {
seen := make(map[string]struct{})
var paths []string
for _, pattern := range patterns {
matches, err := satGlob(pattern)
if err != nil {
continue
}
sort.Strings(matches)
for _, match := range matches {
if _, err := satStat(match); err != nil {
continue
}
if _, ok := seen[match]; ok {
continue
}
seen[match] = struct{}{}
paths = append(paths, match)
}
}
return paths
}
func parseStorageDevices(raw string) []string {
var devices []string
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
fields := strings.Fields(strings.TrimSpace(line))
if len(fields) < 2 || fields[1] != "disk" {
continue
}
if len(fields) >= 3 && strings.EqualFold(fields[2], "usb") {
continue
}
devices = append(devices, "/dev/"+fields[0])
}
return devices
}
// runSATCommandWithMetrics runs a command while collecting GPU metrics in the background.
// On completion it writes gpu-metrics.csv and gpu-metrics.html into runDir.
func runSATCommandWithMetrics(ctx context.Context, verboseLog, name string, cmd []string, env []string, gpuIndices []int, runDir string) ([]byte, error) {
stopCh := make(chan struct{})
doneCh := make(chan struct{})
var metricRows []GPUMetricRow
start := time.Now()
go func() {
defer close(doneCh)
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-stopCh:
return
case <-ticker.C:
samples, err := sampleGPUMetrics(gpuIndices)
if err != nil {
continue
}
elapsed := time.Since(start).Seconds()
for i := range samples {
samples[i].ElapsedSec = elapsed
}
metricRows = append(metricRows, samples...)
}
}
}()
out, err := runSATCommandCtx(ctx, verboseLog, name, cmd, env)
close(stopCh)
<-doneCh
if len(metricRows) > 0 {
_ = WriteGPUMetricsCSV(filepath.Join(runDir, "gpu-metrics.csv"), metricRows)
_ = WriteGPUMetricsHTML(filepath.Join(runDir, "gpu-metrics.html"), metricRows)
chart := RenderGPUTerminalChart(metricRows)
_ = os.WriteFile(filepath.Join(runDir, "gpu-metrics-term.txt"), []byte(chart), 0644)
}
return out, err
}
func appendSATVerboseLog(path string, lines ...string) {
if path == "" {
return
}
f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0644)
if err != nil {
return
}
defer f.Close()
for _, line := range lines {
_, _ = io.WriteString(f, line+"\n")
}
}
func envInt(name string, fallback int) int {
raw := strings.TrimSpace(os.Getenv(name))
if raw == "" {
return fallback
}
value, err := strconv.Atoi(raw)
if err != nil || value <= 0 {
return fallback
}
return value
}
func createTarGz(dst, srcDir string) error {
file, err := os.Create(dst)
if err != nil {

View File

@@ -0,0 +1,587 @@
package platform
import (
"context"
"fmt"
"os"
"os/exec"
"path/filepath"
"strconv"
"strings"
"sync"
"time"
)
// FanStressOptions configures the fan-stress / thermal cycling test.
type FanStressOptions struct {
BaselineSec int // idle monitoring before and after load (default 30)
Phase1DurSec int // first load phase duration in seconds (default 300)
PauseSec int // pause between the two load phases (default 60)
Phase2DurSec int // second load phase duration in seconds (default 300)
SizeMB int // GPU memory to allocate per GPU during stress (default 64)
GPUIndices []int // which GPU indices to stress (empty = all detected)
}
// FanReading holds one fan sensor reading.
type FanReading struct {
Name string
RPM float64
}
// GPUStressMetric holds per-GPU metrics during the stress test.
type GPUStressMetric struct {
Index int
TempC float64
UsagePct float64
PowerW float64
ClockMHz float64
Throttled bool // true if any throttle reason is active
}
// FanStressRow is one second-interval telemetry sample covering all monitored dimensions.
type FanStressRow struct {
TimestampUTC string
ElapsedSec float64
Phase string // "baseline", "load1", "pause", "load2", "cooldown"
GPUs []GPUStressMetric
Fans []FanReading
CPUMaxTempC float64 // highest CPU temperature from ipmitool / sensors
SysPowerW float64 // DCMI system power reading
}
// RunFanStressTest runs a two-phase GPU stress test while monitoring fan speeds,
// temperatures, and power draw every second. Exports metrics.csv and fan-sensors.csv.
// Designed to reproduce case-04 fan-speed lag and detect GPU thermal throttling.
func (s *System) RunFanStressTest(ctx context.Context, baseDir string, opts FanStressOptions) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
applyFanStressDefaults(&opts)
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, "fan-stress-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
// Phase name shared between sampler goroutine and main goroutine.
var phaseMu sync.Mutex
currentPhase := "init"
setPhase := func(name string) {
phaseMu.Lock()
currentPhase = name
phaseMu.Unlock()
}
getPhase := func() string {
phaseMu.Lock()
defer phaseMu.Unlock()
return currentPhase
}
start := time.Now()
var rowsMu sync.Mutex
var allRows []FanStressRow
// Start background sampler (every second).
stopCh := make(chan struct{})
doneCh := make(chan struct{})
go func() {
defer close(doneCh)
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-stopCh:
return
case <-ticker.C:
row := sampleFanStressRow(opts.GPUIndices, getPhase(), time.Since(start).Seconds())
rowsMu.Lock()
allRows = append(allRows, row)
rowsMu.Unlock()
}
}
}()
var summary strings.Builder
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
stats := satStats{}
// idlePhase sleeps for durSec while the sampler stamps phaseName on each row.
idlePhase := func(phaseName, stepName string, durSec int) {
if ctx.Err() != nil {
return
}
setPhase(phaseName)
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s (idle %ds)", time.Now().UTC().Format(time.RFC3339), stepName, durSec),
)
select {
case <-ctx.Done():
case <-time.After(time.Duration(durSec) * time.Second):
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), stepName),
)
fmt.Fprintf(&summary, "%s_status=OK\n", stepName)
stats.OK++
}
// loadPhase runs bee-gpu-stress for durSec; sampler stamps phaseName on each row.
loadPhase := func(phaseName, stepName string, durSec int) {
if ctx.Err() != nil {
return
}
setPhase(phaseName)
var env []string
if len(opts.GPUIndices) > 0 {
ids := make([]string, len(opts.GPUIndices))
for i, idx := range opts.GPUIndices {
ids[i] = strconv.Itoa(idx)
}
env = []string{"CUDA_VISIBLE_DEVICES=" + strings.Join(ids, ",")}
}
cmd := []string{
"bee-gpu-stress",
"--seconds", strconv.Itoa(durSec),
"--size-mb", strconv.Itoa(opts.SizeMB),
}
out, err := runSATCommandCtx(ctx, verboseLog, stepName, cmd, env)
_ = os.WriteFile(filepath.Join(runDir, stepName+".log"), out, 0644)
if err != nil && err != context.Canceled && err.Error() != "signal: killed" {
fmt.Fprintf(&summary, "%s_status=FAILED\n", stepName)
stats.Failed++
} else {
fmt.Fprintf(&summary, "%s_status=OK\n", stepName)
stats.OK++
}
}
// Execute test phases.
idlePhase("baseline", "01-baseline", opts.BaselineSec)
loadPhase("load1", "02-load1", opts.Phase1DurSec)
idlePhase("pause", "03-pause", opts.PauseSec)
loadPhase("load2", "04-load2", opts.Phase2DurSec)
idlePhase("cooldown", "05-cooldown", opts.BaselineSec)
// Stop sampler and collect rows.
close(stopCh)
<-doneCh
rowsMu.Lock()
rows := allRows
rowsMu.Unlock()
// Analysis.
throttled := analyzeThrottling(rows)
maxGPUTemp := analyzeMaxTemp(rows, func(r FanStressRow) float64 {
var m float64
for _, g := range r.GPUs {
if g.TempC > m {
m = g.TempC
}
}
return m
})
maxCPUTemp := analyzeMaxTemp(rows, func(r FanStressRow) float64 {
return r.CPUMaxTempC
})
fanResponseSec := analyzeFanResponse(rows)
fmt.Fprintf(&summary, "throttling_detected=%v\n", throttled)
fmt.Fprintf(&summary, "max_gpu_temp_c=%.1f\n", maxGPUTemp)
fmt.Fprintf(&summary, "max_cpu_temp_c=%.1f\n", maxCPUTemp)
if fanResponseSec >= 0 {
fmt.Fprintf(&summary, "fan_response_sec=%.1f\n", fanResponseSec)
} else {
fmt.Fprintf(&summary, "fan_response_sec=N/A\n")
}
// Throttling failure counts against overall result.
if throttled {
stats.Failed++
}
writeSATStats(&summary, stats)
// Write CSV outputs.
if err := WriteFanStressCSV(filepath.Join(runDir, "metrics.csv"), rows, opts.GPUIndices); err != nil {
return "", err
}
_ = WriteFanSensorsCSV(filepath.Join(runDir, "fan-sensors.csv"), rows)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, "fan-stress-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
func applyFanStressDefaults(opts *FanStressOptions) {
if opts.BaselineSec <= 0 {
opts.BaselineSec = 30
}
if opts.Phase1DurSec <= 0 {
opts.Phase1DurSec = 300
}
if opts.PauseSec <= 0 {
opts.PauseSec = 60
}
if opts.Phase2DurSec <= 0 {
opts.Phase2DurSec = 300
}
if opts.SizeMB <= 0 {
opts.SizeMB = 64
}
}
// sampleFanStressRow collects all metrics for one telemetry sample.
func sampleFanStressRow(gpuIndices []int, phase string, elapsed float64) FanStressRow {
row := FanStressRow{
TimestampUTC: time.Now().UTC().Format(time.RFC3339),
ElapsedSec: elapsed,
Phase: phase,
}
row.GPUs = sampleGPUStressMetrics(gpuIndices)
row.Fans, _ = sampleFanSpeeds()
row.CPUMaxTempC = sampleCPUMaxTemp()
row.SysPowerW = sampleSystemPower()
return row
}
// sampleGPUStressMetrics queries nvidia-smi for temperature, utilization, power,
// clock frequency, and active throttle reasons for each GPU.
func sampleGPUStressMetrics(gpuIndices []int) []GPUStressMetric {
args := []string{
"--query-gpu=index,temperature.gpu,utilization.gpu,power.draw,clocks.current.graphics,clocks_throttle_reasons.active",
"--format=csv,noheader,nounits",
}
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
args = append([]string{"--id=" + strings.Join(ids, ",")}, args...)
}
out, err := exec.Command("nvidia-smi", args...).Output()
if err != nil {
return nil
}
var metrics []GPUStressMetric
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.Split(line, ", ")
if len(parts) < 6 {
continue
}
idx, _ := strconv.Atoi(strings.TrimSpace(parts[0]))
throttleVal := strings.TrimSpace(parts[5])
// Throttled if active reasons bitmask is non-zero.
throttled := throttleVal != "0x0000000000000000" &&
throttleVal != "0x0" &&
throttleVal != "0" &&
throttleVal != "" &&
throttleVal != "N/A"
metrics = append(metrics, GPUStressMetric{
Index: idx,
TempC: parseGPUFloat(parts[1]),
UsagePct: parseGPUFloat(parts[2]),
PowerW: parseGPUFloat(parts[3]),
ClockMHz: parseGPUFloat(parts[4]),
Throttled: throttled,
})
}
return metrics
}
// sampleFanSpeeds reads fan RPM values from ipmitool sdr.
func sampleFanSpeeds() ([]FanReading, error) {
out, err := exec.Command("ipmitool", "sdr", "type", "Fan").Output()
if err != nil {
return nil, err
}
return parseFanSpeeds(string(out)), nil
}
// parseFanSpeeds parses "ipmitool sdr type Fan" output.
// Line format: "FAN1 | 2400.000 | RPM | ok"
func parseFanSpeeds(raw string) []FanReading {
var fans []FanReading
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
parts := strings.Split(line, "|")
if len(parts) < 3 {
continue
}
unit := strings.TrimSpace(parts[2])
if !strings.EqualFold(unit, "RPM") {
continue
}
valStr := strings.TrimSpace(parts[1])
if strings.EqualFold(valStr, "na") || strings.EqualFold(valStr, "disabled") || valStr == "" {
continue
}
val, err := strconv.ParseFloat(valStr, 64)
if err != nil {
continue
}
fans = append(fans, FanReading{
Name: strings.TrimSpace(parts[0]),
RPM: val,
})
}
return fans
}
// sampleCPUMaxTemp returns the highest CPU/inlet temperature from ipmitool or sensors.
func sampleCPUMaxTemp() float64 {
out, err := exec.Command("ipmitool", "sdr", "type", "Temperature").Output()
if err != nil {
return sampleCPUTempViaSensors()
}
return parseIPMIMaxTemp(string(out))
}
// parseIPMIMaxTemp extracts the maximum temperature from "ipmitool sdr type Temperature".
func parseIPMIMaxTemp(raw string) float64 {
var max float64
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
parts := strings.Split(line, "|")
if len(parts) < 3 {
continue
}
unit := strings.TrimSpace(parts[2])
if !strings.Contains(strings.ToLower(unit), "degrees") {
continue
}
valStr := strings.TrimSpace(parts[1])
if strings.EqualFold(valStr, "na") || valStr == "" {
continue
}
val, err := strconv.ParseFloat(valStr, 64)
if err != nil {
continue
}
if val > max {
max = val
}
}
return max
}
// sampleCPUTempViaSensors falls back to lm-sensors when ipmitool is unavailable.
func sampleCPUTempViaSensors() float64 {
out, err := exec.Command("sensors", "-u").Output()
if err != nil {
return 0
}
var max float64
for _, line := range strings.Split(string(out), "\n") {
line = strings.TrimSpace(line)
fields := strings.Fields(line)
if len(fields) < 2 {
continue
}
if !strings.HasSuffix(fields[0], "_input:") {
continue
}
val, err := strconv.ParseFloat(fields[1], 64)
if err != nil {
continue
}
if val > 0 && val < 150 && val > max {
max = val
}
}
return max
}
// sampleSystemPower reads system power draw via DCMI.
func sampleSystemPower() float64 {
out, err := exec.Command("ipmitool", "dcmi", "power", "reading").Output()
if err != nil {
return 0
}
return parseDCMIPowerReading(string(out))
}
// parseDCMIPowerReading extracts the instantaneous power reading from ipmitool dcmi output.
// Sample: " Instantaneous power reading: 500 Watts"
func parseDCMIPowerReading(raw string) float64 {
for _, line := range strings.Split(raw, "\n") {
if !strings.Contains(strings.ToLower(line), "instantaneous") {
continue
}
parts := strings.Fields(line)
for i, p := range parts {
if strings.EqualFold(p, "Watts") && i > 0 {
val, err := strconv.ParseFloat(parts[i-1], 64)
if err == nil {
return val
}
}
}
}
return 0
}
// analyzeThrottling returns true if any GPU reported an active throttle reason
// during either load phase.
func analyzeThrottling(rows []FanStressRow) bool {
for _, row := range rows {
if row.Phase != "load1" && row.Phase != "load2" {
continue
}
for _, gpu := range row.GPUs {
if gpu.Throttled {
return true
}
}
}
return false
}
// analyzeMaxTemp returns the maximum value of the given extractor across all rows.
func analyzeMaxTemp(rows []FanStressRow, extract func(FanStressRow) float64) float64 {
var max float64
for _, row := range rows {
if v := extract(row); v > max {
max = v
}
}
return max
}
// analyzeFanResponse returns the seconds from load1 start until fan RPM first
// increased by more than 5% above the baseline average. Returns -1 if undetermined.
func analyzeFanResponse(rows []FanStressRow) float64 {
// Compute baseline average fan RPM.
var baseTotal, baseCount float64
for _, row := range rows {
if row.Phase != "baseline" {
continue
}
for _, f := range row.Fans {
baseTotal += f.RPM
baseCount++
}
}
if baseCount == 0 || baseTotal == 0 {
return -1
}
baseAvg := baseTotal / baseCount
threshold := baseAvg * 1.05 // 5% increase signals fan ramp-up
// Find elapsed time when load1 started.
var load1Start float64 = -1
for _, row := range rows {
if row.Phase == "load1" {
load1Start = row.ElapsedSec
break
}
}
if load1Start < 0 {
return -1
}
// Find first load1 row where average RPM crosses the threshold.
for _, row := range rows {
if row.Phase != "load1" {
continue
}
var total, count float64
for _, f := range row.Fans {
total += f.RPM
count++
}
if count > 0 && total/count >= threshold {
return row.ElapsedSec - load1Start
}
}
return -1
}
// WriteFanStressCSV writes the wide-format metrics CSV with one row per second.
// GPU columns are generated per index in gpuIndices order.
func WriteFanStressCSV(path string, rows []FanStressRow, gpuIndices []int) error {
if len(rows) == 0 {
return os.WriteFile(path, []byte("no data\n"), 0644)
}
var b strings.Builder
// Header: fixed system columns + per-GPU columns.
b.WriteString("timestamp_utc,elapsed_sec,phase,fan_avg_rpm,fan_min_rpm,fan_max_rpm,cpu_max_temp_c,sys_power_w")
for _, idx := range gpuIndices {
fmt.Fprintf(&b, ",gpu%d_temp_c,gpu%d_usage_pct,gpu%d_power_w,gpu%d_clock_mhz,gpu%d_throttled",
idx, idx, idx, idx, idx)
}
b.WriteRune('\n')
for _, row := range rows {
favg, fmin, fmax := fanRPMStats(row.Fans)
fmt.Fprintf(&b, "%s,%.1f,%s,%.0f,%.0f,%.0f,%.1f,%.1f",
row.TimestampUTC,
row.ElapsedSec,
row.Phase,
favg, fmin, fmax,
row.CPUMaxTempC,
row.SysPowerW,
)
gpuByIdx := make(map[int]GPUStressMetric, len(row.GPUs))
for _, g := range row.GPUs {
gpuByIdx[g.Index] = g
}
for _, idx := range gpuIndices {
g := gpuByIdx[idx]
throttled := 0
if g.Throttled {
throttled = 1
}
fmt.Fprintf(&b, ",%.1f,%.1f,%.1f,%.0f,%d",
g.TempC, g.UsagePct, g.PowerW, g.ClockMHz, throttled)
}
b.WriteRune('\n')
}
return os.WriteFile(path, []byte(b.String()), 0644)
}
// WriteFanSensorsCSV writes individual fan sensor readings in long (tidy) format.
func WriteFanSensorsCSV(path string, rows []FanStressRow) error {
var b strings.Builder
b.WriteString("timestamp_utc,elapsed_sec,phase,fan_name,rpm\n")
for _, row := range rows {
for _, f := range row.Fans {
fmt.Fprintf(&b, "%s,%.1f,%s,%s,%.0f\n",
row.TimestampUTC, row.ElapsedSec, row.Phase, f.Name, f.RPM)
}
}
return os.WriteFile(path, []byte(b.String()), 0644)
}
// fanRPMStats computes average, min, max RPM across all fans in a sample row.
func fanRPMStats(fans []FanReading) (avg, min, max float64) {
if len(fans) == 0 {
return 0, 0, 0
}
min = fans[0].RPM
max = fans[0].RPM
var total float64
for _, f := range fans {
total += f.RPM
if f.RPM < min {
min = f.RPM
}
if f.RPM > max {
max = f.RPM
}
}
return total / float64(len(fans)), min, max
}

View File

@@ -0,0 +1,182 @@
package platform
import (
"errors"
"os"
"os/exec"
"path/filepath"
"testing"
)
func TestStorageSATCommands(t *testing.T) {
t.Parallel()
nvme := storageSATCommands("/dev/nvme0n1")
if len(nvme) != 3 || nvme[2].cmd[0] != "nvme" {
t.Fatalf("unexpected nvme commands: %#v", nvme)
}
sata := storageSATCommands("/dev/sda")
if len(sata) != 2 || sata[0].cmd[0] != "smartctl" {
t.Fatalf("unexpected sata commands: %#v", sata)
}
}
func TestRunNvidiaAcceptancePackIncludesGPUStress(t *testing.T) {
t.Parallel()
jobs := nvidiaSATJobs()
if len(jobs) != 5 {
t.Fatalf("jobs=%d want 5", len(jobs))
}
if got := jobs[4].cmd[0]; got != "bee-gpu-stress" {
t.Fatalf("gpu stress command=%q want bee-gpu-stress", got)
}
if got := jobs[3].cmd[1]; got != "--output-file" {
t.Fatalf("bug report flag=%q want --output-file", got)
}
}
func TestNvidiaSATJobsUseEnvOverrides(t *testing.T) {
t.Setenv("BEE_GPU_STRESS_SECONDS", "9")
t.Setenv("BEE_GPU_STRESS_SIZE_MB", "96")
jobs := nvidiaSATJobs()
got := jobs[4].cmd
want := []string{"bee-gpu-stress", "--seconds", "9", "--size-mb", "96"}
if len(got) != len(want) {
t.Fatalf("cmd len=%d want %d", len(got), len(want))
}
for i := range want {
if got[i] != want[i] {
t.Fatalf("cmd[%d]=%q want %q", i, got[i], want[i])
}
}
}
func TestEnvIntFallback(t *testing.T) {
os.Unsetenv("BEE_MEMTESTER_SIZE_MB")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
t.Fatalf("got %d want 123", got)
}
t.Setenv("BEE_MEMTESTER_SIZE_MB", "bad")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
t.Fatalf("got %d want 123", got)
}
t.Setenv("BEE_MEMTESTER_SIZE_MB", "256")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 256 {
t.Fatalf("got %d want 256", got)
}
}
func TestClassifySATResult(t *testing.T) {
tests := []struct {
name string
job string
out string
err error
status string
}{
{name: "ok", job: "memtester", out: "done", err: nil, status: "OK"},
{name: "unsupported", job: "smartctl-self-test-short", out: "Self-test not supported", err: errors.New("rc 1"), status: "UNSUPPORTED"},
{name: "failed", job: "bee-gpu-stress", out: "cuda error", err: errors.New("rc 1"), status: "FAILED"},
{name: "cuda not ready", job: "bee-gpu-stress", out: "cuInit failed: CUDA_ERROR_SYSTEM_NOT_READY", err: errors.New("rc 1"), status: "UNSUPPORTED"},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got, _ := classifySATResult(tt.job, []byte(tt.out), tt.err)
if got != tt.status {
t.Fatalf("status=%q want %q", got, tt.status)
}
})
}
}
func TestParseStorageDevicesSkipsUSBDisks(t *testing.T) {
t.Parallel()
raw := "nvme0n1 disk nvme\nsda disk usb\nloop0 loop\nsdb disk sata\n"
got := parseStorageDevices(raw)
want := []string{"/dev/nvme0n1", "/dev/sdb"}
if len(got) != len(want) {
t.Fatalf("len(devices)=%d want %d (%v)", len(got), len(want), got)
}
for i := range want {
if got[i] != want[i] {
t.Fatalf("devices[%d]=%q want %q", i, got[i], want[i])
}
}
}
func TestResolveROCmSMICommandFromPATH(t *testing.T) {
t.Setenv("PATH", t.TempDir())
toolPath := filepath.Join(os.Getenv("PATH"), "rocm-smi")
if err := os.WriteFile(toolPath, []byte("#!/bin/sh\nexit 0\n"), 0755); err != nil {
t.Fatalf("write rocm-smi: %v", err)
}
cmd, err := resolveROCmSMICommand("--showproductname")
if err != nil {
t.Fatalf("resolveROCmSMICommand error: %v", err)
}
if len(cmd) != 2 {
t.Fatalf("cmd len=%d want 2 (%v)", len(cmd), cmd)
}
if cmd[0] != toolPath {
t.Fatalf("cmd[0]=%q want %q", cmd[0], toolPath)
}
}
func TestResolveROCmSMICommandFallsBackToROCmTree(t *testing.T) {
tmp := t.TempDir()
execPath := filepath.Join(tmp, "opt", "rocm", "bin", "rocm-smi")
if err := os.MkdirAll(filepath.Dir(execPath), 0755); err != nil {
t.Fatalf("mkdir: %v", err)
}
if err := os.WriteFile(execPath, []byte("#!/bin/sh\nexit 0\n"), 0755); err != nil {
t.Fatalf("write rocm-smi: %v", err)
}
oldGlob := rocmSMIExecutableGlobs
oldScriptGlobs := rocmSMIScriptGlobs
rocmSMIExecutableGlobs = []string{execPath}
rocmSMIScriptGlobs = nil
t.Cleanup(func() {
rocmSMIExecutableGlobs = oldGlob
rocmSMIScriptGlobs = oldScriptGlobs
})
t.Setenv("PATH", "")
cmd, err := resolveROCmSMICommand("--showallinfo")
if err != nil {
t.Fatalf("resolveROCmSMICommand error: %v", err)
}
if len(cmd) != 2 {
t.Fatalf("cmd len=%d want 2 (%v)", len(cmd), cmd)
}
if cmd[0] != execPath {
t.Fatalf("cmd[0]=%q want %q", cmd[0], execPath)
}
}
func TestRunROCmSMIReportsMissingCommand(t *testing.T) {
oldLookPath := satLookPath
oldExecGlobs := rocmSMIExecutableGlobs
oldScriptGlobs := rocmSMIScriptGlobs
satLookPath = func(string) (string, error) { return "", exec.ErrNotFound }
rocmSMIExecutableGlobs = nil
rocmSMIScriptGlobs = nil
t.Cleanup(func() {
satLookPath = oldLookPath
rocmSMIExecutableGlobs = oldExecGlobs
rocmSMIScriptGlobs = oldScriptGlobs
})
if _, err := runROCmSMI("--showproductname"); err == nil {
t.Fatal("expected missing rocm-smi error")
}
}

View File

@@ -0,0 +1,150 @@
package platform
import (
"encoding/json"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
)
var techDumpFixedCommands = []struct {
Name string
Args []string
File string
}{
{Name: "dmidecode", Args: []string{"-t", "0"}, File: "dmidecode-type0.txt"},
{Name: "dmidecode", Args: []string{"-t", "1"}, File: "dmidecode-type1.txt"},
{Name: "dmidecode", Args: []string{"-t", "2"}, File: "dmidecode-type2.txt"},
{Name: "dmidecode", Args: []string{"-t", "4"}, File: "dmidecode-type4.txt"},
{Name: "dmidecode", Args: []string{"-t", "17"}, File: "dmidecode-type17.txt"},
{Name: "lspci", Args: []string{"-vmm", "-D"}, File: "lspci-vmm.txt"},
{Name: "lsblk", Args: []string{"-J", "-d", "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL"}, File: "lsblk.json"},
{Name: "sensors", Args: []string{"-j"}, File: "sensors.json"},
{Name: "ipmitool", Args: []string{"fru", "print"}, File: "ipmitool-fru.txt"},
{Name: "ipmitool", Args: []string{"sdr"}, File: "ipmitool-sdr.txt"},
{Name: "nvme", Args: []string{"list", "-o", "json"}, File: "nvme-list.json"},
}
var techDumpNvidiaCommands = []struct {
Name string
Args []string
File string
}{
{Name: "nvidia-smi", Args: []string{"-q"}, File: "nvidia-smi-q.txt"},
{Name: "nvidia-smi", Args: []string{"--query-gpu=index,pci.bus_id,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown", "--format=csv,noheader,nounits"}, File: "nvidia-smi-query.csv"},
}
type lsblkDumpRoot struct {
Blockdevices []struct {
Name string `json:"name"`
Type string `json:"type"`
Tran string `json:"tran"`
} `json:"blockdevices"`
}
type nvmeDumpRoot struct {
Devices []struct {
DevicePath string `json:"DevicePath"`
} `json:"Devices"`
}
func (s *System) CaptureTechnicalDump(baseDir string) error {
if err := os.MkdirAll(baseDir, 0755); err != nil {
return err
}
for _, cmd := range techDumpFixedCommands {
writeCommandDump(filepath.Join(baseDir, cmd.File), cmd.Name, cmd.Args...)
}
switch s.DetectGPUVendor() {
case "nvidia":
for _, cmd := range techDumpNvidiaCommands {
writeCommandDump(filepath.Join(baseDir, cmd.File), cmd.Name, cmd.Args...)
}
case "amd":
writeROCmSMIDump(filepath.Join(baseDir, "rocm-smi.txt"))
writeROCmSMIDump(filepath.Join(baseDir, "rocm-smi-showallinfo.txt"), "--showallinfo")
}
for _, dev := range lsblkDumpDevices(filepath.Join(baseDir, "lsblk.json")) {
writeCommandDump(filepath.Join(baseDir, "smartctl-"+sanitizeDumpName(dev)+".json"), "smartctl", "-j", "-a", "/dev/"+dev)
}
for _, dev := range nvmeDumpDevices(filepath.Join(baseDir, "nvme-list.json")) {
writeCommandDump(filepath.Join(baseDir, "nvme-id-ctrl-"+sanitizeDumpName(dev)+".json"), "nvme", "id-ctrl", dev, "-o", "json")
writeCommandDump(filepath.Join(baseDir, "nvme-smart-log-"+sanitizeDumpName(dev)+".json"), "nvme", "smart-log", dev, "-o", "json")
}
return nil
}
func writeCommandDump(path, name string, args ...string) {
out, err := exec.Command(name, args...).CombinedOutput()
if err != nil && len(out) == 0 {
return
}
_ = os.WriteFile(path, out, 0644)
}
func writeROCmSMIDump(path string, args ...string) {
out, err := runROCmSMI(args...)
if err != nil && len(out) == 0 {
return
}
_ = os.WriteFile(path, out, 0644)
}
func lsblkDumpDevices(path string) []string {
raw, err := os.ReadFile(path)
if err != nil {
return nil
}
var root lsblkDumpRoot
if err := json.Unmarshal(raw, &root); err != nil {
return nil
}
var devices []string
for _, dev := range root.Blockdevices {
if strings.EqualFold(strings.TrimSpace(dev.Tran), "usb") {
continue
}
if dev.Type == "disk" && strings.TrimSpace(dev.Name) != "" {
devices = append(devices, strings.TrimSpace(dev.Name))
}
}
sort.Strings(devices)
return devices
}
func nvmeDumpDevices(path string) []string {
raw, err := os.ReadFile(path)
if err != nil {
return nil
}
var root nvmeDumpRoot
if err := json.Unmarshal(raw, &root); err != nil {
return nil
}
seen := map[string]bool{}
var devices []string
for _, dev := range root.Devices {
name := strings.TrimSpace(dev.DevicePath)
if name == "" || seen[name] {
continue
}
seen[name] = true
devices = append(devices, name)
}
sort.Strings(devices)
return devices
}
func sanitizeDumpName(value string) string {
value = strings.TrimSpace(value)
value = strings.TrimPrefix(value, "/dev/")
value = strings.ReplaceAll(value, "/", "_")
if value == "" {
return "unknown"
}
return value
}

View File

@@ -0,0 +1,48 @@
package platform
import (
"os"
"path/filepath"
"reflect"
"testing"
)
func TestLSBLKDumpDevices(t *testing.T) {
t.Parallel()
dir := t.TempDir()
path := filepath.Join(dir, "lsblk.json")
if err := os.WriteFile(path, []byte(`{"blockdevices":[{"name":"sda","type":"disk","tran":"usb"},{"name":"sda1","type":"part"},{"name":"nvme0n1","type":"disk","tran":"nvme"},{"name":"sdb","type":"disk","tran":"sata"}]}`), 0644); err != nil {
t.Fatalf("write lsblk fixture: %v", err)
}
got := lsblkDumpDevices(path)
want := []string{"nvme0n1", "sdb"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("lsblkDumpDevices=%v want %v", got, want)
}
}
func TestNVMEDumpDevices(t *testing.T) {
t.Parallel()
dir := t.TempDir()
path := filepath.Join(dir, "nvme-list.json")
if err := os.WriteFile(path, []byte(`{"Devices":[{"DevicePath":"/dev/nvme1n1"},{"DevicePath":"/dev/nvme0n1"},{"DevicePath":"/dev/nvme1n1"}]}`), 0644); err != nil {
t.Fatalf("write nvme fixture: %v", err)
}
got := nvmeDumpDevices(path)
want := []string{"/dev/nvme0n1", "/dev/nvme1n1"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("nvmeDumpDevices=%v want %v", got, want)
}
}
func TestSanitizeDumpName(t *testing.T) {
t.Parallel()
if got := sanitizeDumpName("/dev/nvme0n1"); got != "nvme0n1" {
t.Fatalf("sanitizeDumpName=%q want nvme0n1", got)
}
}

View File

@@ -5,14 +5,52 @@ package schema
// HardwareIngestRequest is the top-level output document produced by `bee audit`.
// It is accepted as-is by the core /api/ingest/hardware endpoint.
type HardwareIngestRequest struct {
Filename *string `json:"filename"`
SourceType *string `json:"source_type"`
Protocol *string `json:"protocol"`
TargetHost string `json:"target_host"`
Filename *string `json:"filename,omitempty"`
SourceType *string `json:"source_type,omitempty"`
Protocol *string `json:"protocol,omitempty"`
TargetHost *string `json:"target_host,omitempty"`
CollectedAt string `json:"collected_at"`
Runtime *RuntimeHealth `json:"runtime,omitempty"`
Hardware HardwareSnapshot `json:"hardware"`
}
type RuntimeHealth struct {
Status string `json:"status"`
CheckedAt string `json:"checked_at"`
ExportDir string `json:"export_dir,omitempty"`
DriverReady bool `json:"driver_ready,omitempty"`
CUDAReady bool `json:"cuda_ready,omitempty"`
NetworkStatus string `json:"network_status,omitempty"`
Issues []RuntimeIssue `json:"issues,omitempty"`
Tools []RuntimeToolStatus `json:"tools,omitempty"`
Services []RuntimeServiceStatus `json:"services,omitempty"`
Interfaces []RuntimeInterface `json:"interfaces,omitempty"`
}
type RuntimeIssue struct {
Code string `json:"code"`
Severity string `json:"severity,omitempty"`
Description string `json:"description"`
}
type RuntimeToolStatus struct {
Name string `json:"name"`
Path string `json:"path,omitempty"`
OK bool `json:"ok"`
}
type RuntimeServiceStatus struct {
Name string `json:"name"`
Status string `json:"status"`
}
type RuntimeInterface struct {
Name string `json:"name"`
State string `json:"state,omitempty"`
IPv4 []string `json:"ipv4,omitempty"`
Outcome string `json:"outcome,omitempty"`
}
type HardwareSnapshot struct {
Board HardwareBoard `json:"board"`
Firmware []HardwareFirmwareRecord `json:"firmware,omitempty"`
@@ -21,14 +59,33 @@ type HardwareSnapshot struct {
Storage []HardwareStorage `json:"storage,omitempty"`
PCIeDevices []HardwarePCIeDevice `json:"pcie_devices,omitempty"`
PowerSupplies []HardwarePowerSupply `json:"power_supplies,omitempty"`
Sensors *HardwareSensors `json:"sensors,omitempty"`
EventLogs []HardwareEventLog `json:"event_logs,omitempty"`
}
type HardwareHealthSummary struct {
Status string `json:"status"`
Warnings []string `json:"warnings,omitempty"`
Failures []string `json:"failures,omitempty"`
StorageWarn int `json:"storage_warn,omitempty"`
StorageFail int `json:"storage_fail,omitempty"`
PCIeWarn int `json:"pcie_warn,omitempty"`
PCIeFail int `json:"pcie_fail,omitempty"`
PSUWarn int `json:"psu_warn,omitempty"`
PSUFail int `json:"psu_fail,omitempty"`
MemoryWarn int `json:"memory_warn,omitempty"`
MemoryFail int `json:"memory_fail,omitempty"`
EmptyDIMMs int `json:"empty_dimms,omitempty"`
MissingPSUs int `json:"missing_psus,omitempty"`
CollectedAt string `json:"collected_at,omitempty"`
}
type HardwareBoard struct {
Manufacturer *string `json:"manufacturer"`
ProductName *string `json:"product_name"`
Manufacturer *string `json:"manufacturer,omitempty"`
ProductName *string `json:"product_name,omitempty"`
SerialNumber string `json:"serial_number"`
PartNumber *string `json:"part_number"`
UUID *string `json:"uuid"`
PartNumber *string `json:"part_number,omitempty"`
UUID *string `json:"uuid,omitempty"`
}
type HardwareFirmwareRecord struct {
@@ -37,77 +94,196 @@ type HardwareFirmwareRecord struct {
}
type HardwareCPU struct {
Socket *int `json:"socket"`
Model *string `json:"model"`
Manufacturer *string `json:"manufacturer"`
Status *string `json:"status"`
SerialNumber *string `json:"serial_number"`
Firmware *string `json:"firmware"`
Cores *int `json:"cores"`
Threads *int `json:"threads"`
FrequencyMHz *int `json:"frequency_mhz"`
MaxFrequencyMHz *int `json:"max_frequency_mhz"`
HardwareComponentStatus
Socket *int `json:"socket,omitempty"`
Model *string `json:"model,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
Firmware *string `json:"firmware,omitempty"`
Cores *int `json:"cores,omitempty"`
Threads *int `json:"threads,omitempty"`
FrequencyMHz *int `json:"frequency_mhz,omitempty"`
MaxFrequencyMHz *int `json:"max_frequency_mhz,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
PowerW *float64 `json:"power_w,omitempty"`
Throttled *bool `json:"throttled,omitempty"`
CorrectableErrorCount *int64 `json:"correctable_error_count,omitempty"`
UncorrectableErrorCount *int64 `json:"uncorrectable_error_count,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
Present *bool `json:"present,omitempty"`
}
type HardwareMemory struct {
Slot *string `json:"slot"`
Location *string `json:"location"`
Present *bool `json:"present"`
SizeMB *int `json:"size_mb"`
Type *string `json:"type"`
MaxSpeedMHz *int `json:"max_speed_mhz"`
CurrentSpeedMHz *int `json:"current_speed_mhz"`
Manufacturer *string `json:"manufacturer"`
SerialNumber *string `json:"serial_number"`
PartNumber *string `json:"part_number"`
Status *string `json:"status"`
HardwareComponentStatus
Slot *string `json:"slot,omitempty"`
Location *string `json:"location,omitempty"`
Present *bool `json:"present,omitempty"`
SizeMB *int `json:"size_mb,omitempty"`
Type *string `json:"type,omitempty"`
MaxSpeedMHz *int `json:"max_speed_mhz,omitempty"`
CurrentSpeedMHz *int `json:"current_speed_mhz,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
PartNumber *string `json:"part_number,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
CorrectableECCErrorCount *int64 `json:"correctable_ecc_error_count,omitempty"`
UncorrectableECCErrorCount *int64 `json:"uncorrectable_ecc_error_count,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
SpareBlocksRemainingPct *float64 `json:"spare_blocks_remaining_pct,omitempty"`
PerformanceDegraded *bool `json:"performance_degraded,omitempty"`
DataLossDetected *bool `json:"data_loss_detected,omitempty"`
}
type HardwareStorage struct {
Slot *string `json:"slot"`
Type *string `json:"type"`
Model *string `json:"model"`
SizeGB *int `json:"size_gb"`
SerialNumber *string `json:"serial_number"`
Manufacturer *string `json:"manufacturer"`
Firmware *string `json:"firmware"`
Interface *string `json:"interface"`
Present *bool `json:"present"`
Status *string `json:"status"`
Telemetry map[string]any `json:"telemetry,omitempty"`
HardwareComponentStatus
Slot *string `json:"slot,omitempty"`
Type *string `json:"type,omitempty"`
Model *string `json:"model,omitempty"`
SizeGB *int `json:"size_gb,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
Firmware *string `json:"firmware,omitempty"`
Interface *string `json:"interface,omitempty"`
Present *bool `json:"present,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
PowerOnHours *int64 `json:"power_on_hours,omitempty"`
PowerCycles *int64 `json:"power_cycles,omitempty"`
UnsafeShutdowns *int64 `json:"unsafe_shutdowns,omitempty"`
MediaErrors *int64 `json:"media_errors,omitempty"`
ErrorLogEntries *int64 `json:"error_log_entries,omitempty"`
WrittenBytes *int64 `json:"written_bytes,omitempty"`
ReadBytes *int64 `json:"read_bytes,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
AvailableSparePct *float64 `json:"available_spare_pct,omitempty"`
ReallocatedSectors *int64 `json:"reallocated_sectors,omitempty"`
CurrentPendingSectors *int64 `json:"current_pending_sectors,omitempty"`
OfflineUncorrectable *int64 `json:"offline_uncorrectable,omitempty"`
Telemetry map[string]any `json:"-"`
}
type HardwarePCIeDevice struct {
Slot *string `json:"slot"`
VendorID *int `json:"vendor_id"`
DeviceID *int `json:"device_id"`
BDF *string `json:"bdf"`
DeviceClass *string `json:"device_class"`
Manufacturer *string `json:"manufacturer"`
Model *string `json:"model"`
LinkWidth *int `json:"link_width"`
LinkSpeed *string `json:"link_speed"`
MaxLinkWidth *int `json:"max_link_width"`
MaxLinkSpeed *string `json:"max_link_speed"`
SerialNumber *string `json:"serial_number"`
Firmware *string `json:"firmware"`
Present *bool `json:"present"`
Status *string `json:"status"`
Telemetry map[string]any `json:"telemetry,omitempty"`
HardwareComponentStatus
Slot *string `json:"slot,omitempty"`
VendorID *int `json:"vendor_id,omitempty"`
DeviceID *int `json:"device_id,omitempty"`
NUMANode *int `json:"numa_node,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
PowerW *float64 `json:"power_w,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
ECCCorrectedTotal *int64 `json:"ecc_corrected_total,omitempty"`
ECCUncorrectedTotal *int64 `json:"ecc_uncorrected_total,omitempty"`
HWSlowdown *bool `json:"hw_slowdown,omitempty"`
BatteryChargePct *float64 `json:"battery_charge_pct,omitempty"`
BatteryHealthPct *float64 `json:"battery_health_pct,omitempty"`
BatteryTemperatureC *float64 `json:"battery_temperature_c,omitempty"`
BatteryVoltageV *float64 `json:"battery_voltage_v,omitempty"`
BatteryReplaceRequired *bool `json:"battery_replace_required,omitempty"`
SFPTemperatureC *float64 `json:"sfp_temperature_c,omitempty"`
SFPTXPowerDBM *float64 `json:"sfp_tx_power_dbm,omitempty"`
SFPRXPowerDBM *float64 `json:"sfp_rx_power_dbm,omitempty"`
SFPVoltageV *float64 `json:"sfp_voltage_v,omitempty"`
SFPBiasMA *float64 `json:"sfp_bias_ma,omitempty"`
BDF *string `json:"-"`
DeviceClass *string `json:"device_class,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
Model *string `json:"model,omitempty"`
LinkWidth *int `json:"link_width,omitempty"`
LinkSpeed *string `json:"link_speed,omitempty"`
MaxLinkWidth *int `json:"max_link_width,omitempty"`
MaxLinkSpeed *string `json:"max_link_speed,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
Firmware *string `json:"firmware,omitempty"`
MacAddresses []string `json:"mac_addresses,omitempty"`
Present *bool `json:"present,omitempty"`
Telemetry map[string]any `json:"-"`
}
type HardwarePowerSupply struct {
Slot *string `json:"slot"`
Present *bool `json:"present"`
Model *string `json:"model"`
Vendor *string `json:"vendor"`
WattageW *int `json:"wattage_w"`
SerialNumber *string `json:"serial_number"`
PartNumber *string `json:"part_number"`
Firmware *string `json:"firmware"`
Status *string `json:"status"`
InputType *string `json:"input_type"`
InputPowerW *float64 `json:"input_power_w"`
OutputPowerW *float64 `json:"output_power_w"`
InputVoltage *float64 `json:"input_voltage"`
HardwareComponentStatus
Slot *string `json:"slot,omitempty"`
Present *bool `json:"present,omitempty"`
Model *string `json:"model,omitempty"`
Vendor *string `json:"vendor,omitempty"`
WattageW *int `json:"wattage_w,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
PartNumber *string `json:"part_number,omitempty"`
Firmware *string `json:"firmware,omitempty"`
InputType *string `json:"input_type,omitempty"`
InputPowerW *float64 `json:"input_power_w,omitempty"`
OutputPowerW *float64 `json:"output_power_w,omitempty"`
InputVoltage *float64 `json:"input_voltage,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
}
type HardwareComponentStatus struct {
Status *string `json:"status,omitempty"`
StatusCheckedAt *string `json:"status_checked_at,omitempty"`
StatusChangedAt *string `json:"status_changed_at,omitempty"`
StatusHistory []HardwareStatusHistory `json:"status_history,omitempty"`
ErrorDescription *string `json:"error_description,omitempty"`
ManufacturedYearWeek *string `json:"manufactured_year_week,omitempty"`
}
type HardwareStatusHistory struct {
Status string `json:"status"`
ChangedAt string `json:"changed_at"`
Details *string `json:"details,omitempty"`
}
type HardwareSensors struct {
Fans []HardwareFanSensor `json:"fans,omitempty"`
Power []HardwarePowerSensor `json:"power,omitempty"`
Temperatures []HardwareTemperatureSensor `json:"temperatures,omitempty"`
Other []HardwareOtherSensor `json:"other,omitempty"`
}
type HardwareFanSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
RPM *int `json:"rpm,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwarePowerSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
VoltageV *float64 `json:"voltage_v,omitempty"`
CurrentA *float64 `json:"current_a,omitempty"`
PowerW *float64 `json:"power_w,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareTemperatureSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
Celsius *float64 `json:"celsius,omitempty"`
ThresholdWarningCelsius *float64 `json:"threshold_warning_celsius,omitempty"`
ThresholdCriticalCelsius *float64 `json:"threshold_critical_celsius,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareOtherSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
Value *float64 `json:"value,omitempty"`
Unit *string `json:"unit,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareEventLog struct {
Source string `json:"source"`
EventTime *string `json:"event_time,omitempty"`
Severity *string `json:"severity,omitempty"`
MessageID *string `json:"message_id,omitempty"`
Message string `json:"message"`
ComponentRef *string `json:"component_ref,omitempty"`
Fingerprint *string `json:"fingerprint,omitempty"`
IsActive *bool `json:"is_active,omitempty"`
RawPayload map[string]any `json:"raw_payload,omitempty"`
}

View File

@@ -0,0 +1,46 @@
package schema
import (
"encoding/json"
"strings"
"testing"
)
func TestHardwareSnapshotMarshalsNewContractFields(t *testing.T) {
week := "2024-W07"
eventTime := "2026-03-15T14:03:11Z"
message := "Correctable ECC error threshold exceeded"
payload := HardwareIngestRequest{
CollectedAt: "2026-03-15T15:00:00Z",
Hardware: HardwareSnapshot{
Board: HardwareBoard{SerialNumber: "SRV-001"},
CPUs: []HardwareCPU{
{
HardwareComponentStatus: HardwareComponentStatus{
ManufacturedYearWeek: &week,
},
},
},
EventLogs: []HardwareEventLog{
{
Source: "bmc",
EventTime: &eventTime,
Message: message,
},
},
},
}
data, err := json.Marshal(payload)
if err != nil {
t.Fatalf("marshal: %v", err)
}
text := string(data)
if !strings.Contains(text, `"manufactured_year_week":"2024-W07"`) {
t.Fatalf("missing manufactured_year_week: %s", text)
}
if !strings.Contains(text, `"event_logs":[{"source":"bmc","event_time":"2026-03-15T14:03:11Z","message":"Correctable ECC error threshold exceeded"}]`) {
t.Fatalf("missing event_logs payload: %s", text)
}
}

View File

@@ -1,6 +1,11 @@
package tui
import tea "github.com/charmbracelet/bubbletea"
import (
"time"
"bee/audit/internal/platform"
tea "github.com/charmbracelet/bubbletea"
)
func (m model) updateStaticForm(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
switch msg.String() {
@@ -29,6 +34,7 @@ func (m model) updateStaticForm(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
m.formFields[3].Value,
})
m.busy = true
m.busyTitle = "Static IPv4: " + m.selectedIface
return m, func() tea.Msg {
result, err := m.app.SetStaticIPv4Result(cfg)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
@@ -59,26 +65,81 @@ func (m model) updateConfirm(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
case "esc":
m.screen = m.confirmCancelTarget()
m.cursor = 0
m.pendingAction = actionNone
return m, nil
case "enter":
if m.cursor == 1 {
if m.cursor == 1 { // Cancel
m.screen = m.confirmCancelTarget()
m.cursor = 0
m.pendingAction = actionNone
return m, nil
}
m.busy = true
switch m.pendingAction {
case actionExportAudit:
case actionExportBundle:
m.busyTitle = "Export support bundle"
target := *m.selectedTarget
return m, func() tea.Msg {
result, err := m.app.ExportLatestAuditResult(target)
result, err := m.app.ExportSupportBundleResult(target)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
}
case actionRunNvidiaSAT:
return m, func() tea.Msg {
result, err := m.app.RunNvidiaAcceptancePackResult("")
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance}
}
case actionRunAll:
return m.executeRunAll()
case actionRunMemorySAT:
m.busyTitle = "Memory test"
m.progressPrefix = "memory"
m.progressSince = time.Now()
m.progressLines = nil
since := m.progressSince
return m, tea.Batch(
func() tea.Msg {
result, err := m.app.RunMemoryAcceptancePackResult("")
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenHealthCheck}
},
pollSATProgress("memory", since),
)
case actionRunStorageSAT:
m.busyTitle = "Storage test"
m.progressPrefix = "storage"
m.progressSince = time.Now()
m.progressLines = nil
since := m.progressSince
return m, tea.Batch(
func() tea.Msg {
result, err := m.app.RunStorageAcceptancePackResult("")
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenHealthCheck}
},
pollSATProgress("storage", since),
)
case actionRunCPUSAT:
m.busyTitle = "CPU test"
m.progressPrefix = "cpu"
m.progressSince = time.Now()
m.progressLines = nil
since := m.progressSince
durationSec := hcCPUDurations[m.hcMode]
return m, tea.Batch(
func() tea.Msg {
result, err := m.app.RunCPUAcceptancePackResult("", durationSec)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenHealthCheck}
},
pollSATProgress("cpu", since),
)
case actionRunAMDGPUSAT:
m.busyTitle = "AMD GPU test"
m.progressPrefix = "gpu-amd"
m.progressSince = time.Now()
m.progressLines = nil
since := m.progressSince
return m, tea.Batch(
func() tea.Msg {
result, err := m.app.RunAMDAcceptancePackResult("")
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenHealthCheck}
},
pollSATProgress("gpu-amd", since),
)
case actionRunFanStress:
return m.startGPUStressTest()
}
case "ctrl+c":
return m, tea.Quit
@@ -88,11 +149,55 @@ func (m model) updateConfirm(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
func (m model) confirmCancelTarget() screen {
switch m.pendingAction {
case actionExportAudit:
case actionExportBundle:
return screenExportTargets
case actionRunNvidiaSAT:
return screenAcceptance
case actionRunAll, actionRunMemorySAT, actionRunStorageSAT, actionRunCPUSAT, actionRunAMDGPUSAT, actionRunFanStress:
return screenHealthCheck
default:
return screenMain
}
}
// hcFanStressOpts builds FanStressOptions for the selected mode, auto-detecting all GPUs.
func hcFanStressOpts(hcMode int, application interface {
ListNvidiaGPUs() ([]platform.NvidiaGPU, error)
}) platform.FanStressOptions {
// Phase durations per mode: [baseline, load1, pause, load2]
type durations struct{ baseline, load1, pause, load2 int }
modes := [3]durations{
{30, 120, 30, 120}, // Quick: ~5 min total
{60, 300, 60, 300}, // Standard: ~12 min total
{60, 600, 120, 600}, // Express: ~24 min total
}
if hcMode < 0 || hcMode >= len(modes) {
hcMode = 0
}
d := modes[hcMode]
// Use all detected NVIDIA GPUs.
var indices []int
if gpus, err := application.ListNvidiaGPUs(); err == nil {
for _, g := range gpus {
indices = append(indices, g.Index)
}
}
// Use minimum GPU memory size to fit all GPUs.
sizeMB := 64
if gpus, err := application.ListNvidiaGPUs(); err == nil {
for _, g := range gpus {
if g.MemoryMB > 0 && (sizeMB == 64 || g.MemoryMB < sizeMB) {
sizeMB = g.MemoryMB / 16 // allocate 1/16 of VRAM per GPU
}
}
}
return platform.FanStressOptions{
BaselineSec: d.baseline,
Phase1DurSec: d.load1,
PauseSec: d.pause,
Phase2DurSec: d.load2,
SizeMB: sizeMB,
GPUIndices: indices,
}
}

View File

@@ -1,6 +1,9 @@
package tui
import "bee/audit/internal/platform"
import (
"bee/audit/internal/app"
"bee/audit/internal/platform"
)
type resultMsg struct {
title string
@@ -23,3 +26,27 @@ type exportTargetsMsg struct {
targets []platform.RemovableTarget
err error
}
type snapshotMsg struct {
banner string
panel app.HardwarePanelData
}
type nvidiaGPUsMsg struct {
gpus []platform.NvidiaGPU
err error
}
type nvtopClosedMsg struct{}
type nvidiaSATDoneMsg struct {
title string
body string
err error
}
type gpuStressDoneMsg struct {
title string
body string
err error
}

View File

@@ -0,0 +1,131 @@
package tui
import (
"fmt"
"os"
"path/filepath"
"sort"
"strconv"
"strings"
"time"
"bee/audit/internal/app"
tea "github.com/charmbracelet/bubbletea"
)
type satProgressMsg struct {
lines []string
}
// pollSATProgress returns a Cmd that waits 300ms then reads the latest verbose.log
// for the given SAT prefix and returns parsed step progress lines.
func pollSATProgress(prefix string, since time.Time) tea.Cmd {
return tea.Tick(300*time.Millisecond, func(_ time.Time) tea.Msg {
return satProgressMsg{lines: readSATProgressLines(prefix, since)}
})
}
func readSATProgressLines(prefix string, since time.Time) []string {
pattern := filepath.Join(app.DefaultSATBaseDir, prefix+"-*/verbose.log")
matches, err := filepath.Glob(pattern)
if err != nil || len(matches) == 0 {
return nil
}
sort.Strings(matches)
// Find the latest file created at or after (since - 5s) to account for clock skew.
cutoff := since.Add(-5 * time.Second)
candidate := ""
for _, m := range matches {
info, statErr := os.Stat(m)
if statErr == nil && info.ModTime().After(cutoff) {
candidate = m
}
}
if candidate == "" {
return nil
}
raw, err := os.ReadFile(candidate)
if err != nil {
return nil
}
return parseSATVerboseProgress(string(raw))
}
// parseSATVerboseProgress parses verbose.log content and returns display lines like:
//
// "PASS lscpu (234ms)"
// "FAIL stress-ng (60.0s)"
// "... sensors-after"
func parseSATVerboseProgress(content string) []string {
type step struct {
name string
rc int
durationMs int
done bool
}
lines := strings.Split(content, "\n")
var steps []step
stepIdx := map[string]int{}
for i, line := range lines {
line = strings.TrimSpace(line)
if idx := strings.Index(line, "] start "); idx >= 0 {
name := strings.TrimSpace(line[idx+len("] start "):])
if _, exists := stepIdx[name]; !exists {
stepIdx[name] = len(steps)
steps = append(steps, step{name: name})
}
} else if idx := strings.Index(line, "] finish "); idx >= 0 {
name := strings.TrimSpace(line[idx+len("] finish "):])
si, exists := stepIdx[name]
if !exists {
continue
}
steps[si].done = true
for j := i + 1; j < len(lines) && j <= i+3; j++ {
l := strings.TrimSpace(lines[j])
if strings.HasPrefix(l, "rc: ") {
steps[si].rc, _ = strconv.Atoi(strings.TrimPrefix(l, "rc: "))
} else if strings.HasPrefix(l, "duration_ms: ") {
steps[si].durationMs, _ = strconv.Atoi(strings.TrimPrefix(l, "duration_ms: "))
}
}
}
}
var result []string
for _, s := range steps {
display := cleanSATStepName(s.name)
if s.done {
status := "PASS"
if s.rc != 0 {
status = "FAIL"
}
result = append(result, fmt.Sprintf("%-4s %s (%s)", status, display, fmtDurMs(s.durationMs)))
} else {
result = append(result, fmt.Sprintf("... %s", display))
}
}
return result
}
// cleanSATStepName strips leading digits and dash: "01-lscpu.log" → "lscpu".
func cleanSATStepName(name string) string {
name = strings.TrimSuffix(name, ".log")
i := 0
for i < len(name) && name[i] >= '0' && name[i] <= '9' {
i++
}
if i < len(name) && name[i] == '-' {
name = name[i+1:]
}
return name
}
func fmtDurMs(ms int) string {
if ms < 1000 {
return fmt.Sprintf("%dms", ms)
}
return fmt.Sprintf("%.1fs", float64(ms)/1000)
}

View File

@@ -1,14 +0,0 @@
package tui
import tea "github.com/charmbracelet/bubbletea"
func (m model) handleAcceptanceMenu() (tea.Model, tea.Cmd) {
if m.cursor == 1 {
m.screen = screenMain
m.cursor = 0
return m, nil
}
m.pendingAction = actionRunNvidiaSAT
m.screen = screenConfirm
return m, nil
}

View File

@@ -4,11 +4,11 @@ import tea "github.com/charmbracelet/bubbletea"
func (m model) handleExportTargetsMenu() (tea.Model, tea.Cmd) {
if len(m.targets) == 0 {
return m, resultCmd("Export audit", "No removable filesystems found", nil, screenMain)
return m, resultCmd("Export support bundle", "No removable filesystems found", nil, screenMain)
}
target := m.targets[m.cursor]
m.selectedTarget = &target
m.pendingAction = actionExportAudit
m.pendingAction = actionExportBundle
m.screen = screenConfirm
return m, nil
}

View File

@@ -0,0 +1,386 @@
package tui
import (
"context"
"fmt"
"os/exec"
"strings"
tea "github.com/charmbracelet/bubbletea"
)
// Component indices.
const (
hcGPU = 0
hcMemory = 1
hcStorage = 2
hcCPU = 3
)
// Cursor positions in Health Check screen.
const (
hcCurGPU = 0
hcCurMemory = 1
hcCurStorage = 2
hcCurCPU = 3
hcCurSelectAll = 4
hcCurModeQuick = 5
hcCurModeStd = 6
hcCurModeExpr = 7
hcCurRunAll = 8
hcCurFanStress = 9
hcCurTotal = 10
)
// hcModeDurations maps mode index (0=Quick,1=Standard,2=Express) to GPU stress seconds.
var hcModeDurations = [3]int{600, 3600, 28800}
// hcCPUDurations maps mode index to CPU stress-ng seconds.
var hcCPUDurations = [3]int{60, 300, 900}
func (m model) enterHealthCheck() (tea.Model, tea.Cmd) {
m.screen = screenHealthCheck
if !m.hcInitialized {
m.hcSel = [4]bool{true, true, true, true}
m.hcMode = 0
m.hcCursor = 0
m.hcInitialized = true
}
return m, nil
}
func (m model) updateHealthCheck(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
switch msg.String() {
case "up", "k":
if m.hcCursor > 0 {
m.hcCursor--
}
case "down", "j":
if m.hcCursor < hcCurTotal-1 {
m.hcCursor++
}
case " ":
switch m.hcCursor {
case hcCurGPU, hcCurMemory, hcCurStorage, hcCurCPU:
m.hcSel[m.hcCursor] = !m.hcSel[m.hcCursor]
case hcCurSelectAll:
allOn := m.hcSel[0] && m.hcSel[1] && m.hcSel[2] && m.hcSel[3]
for i := range m.hcSel {
m.hcSel[i] = !allOn
}
case hcCurModeQuick, hcCurModeStd, hcCurModeExpr:
m.hcMode = m.hcCursor - hcCurModeQuick
}
case "enter":
switch m.hcCursor {
case hcCurGPU, hcCurMemory, hcCurStorage, hcCurCPU:
return m.hcRunSingle(m.hcCursor)
case hcCurSelectAll:
allOn := m.hcSel[0] && m.hcSel[1] && m.hcSel[2] && m.hcSel[3]
for i := range m.hcSel {
m.hcSel[i] = !allOn
}
case hcCurModeQuick, hcCurModeStd, hcCurModeExpr:
m.hcMode = m.hcCursor - hcCurModeQuick
case hcCurRunAll:
return m.hcRunAll()
case hcCurFanStress:
return m.hcRunFanStress()
}
case "g", "G":
return m.hcRunSingle(hcGPU)
case "m", "M":
return m.hcRunSingle(hcMemory)
case "s", "S":
return m.hcRunSingle(hcStorage)
case "c", "C":
return m.hcRunSingle(hcCPU)
case "r", "R":
return m.hcRunAll()
case "f", "F":
return m.hcRunFanStress()
case "a", "A":
allOn := m.hcSel[0] && m.hcSel[1] && m.hcSel[2] && m.hcSel[3]
for i := range m.hcSel {
m.hcSel[i] = !allOn
}
case "1":
m.hcMode = 0
case "2":
m.hcMode = 1
case "3":
m.hcMode = 2
case "esc":
m.screen = screenMain
m.cursor = 0
case "q", "ctrl+c":
return m, tea.Quit
}
return m, nil
}
func (m model) hcRunSingle(idx int) (tea.Model, tea.Cmd) {
switch idx {
case hcGPU:
if m.app.DetectGPUVendor() == "amd" {
m.pendingAction = actionRunAMDGPUSAT
m.screen = screenConfirm
m.cursor = 0
return m, nil
}
m.nvidiaDurIdx = m.hcMode
return m.enterNvidiaSATSetup()
case hcMemory:
m.pendingAction = actionRunMemorySAT
m.screen = screenConfirm
m.cursor = 0
return m, nil
case hcStorage:
m.pendingAction = actionRunStorageSAT
m.screen = screenConfirm
m.cursor = 0
return m, nil
case hcCPU:
m.pendingAction = actionRunCPUSAT
m.screen = screenConfirm
m.cursor = 0
return m, nil
}
return m, nil
}
func (m model) hcRunFanStress() (tea.Model, tea.Cmd) {
m.pendingAction = actionRunFanStress
m.screen = screenConfirm
m.cursor = 0
return m, nil
}
// startGPUStressTest launches the GPU Platform Stress Test and nvtop concurrently.
// nvtop occupies the full terminal as a live chart; the stress test runs in background.
func (m model) startGPUStressTest() (tea.Model, tea.Cmd) {
opts := hcFanStressOpts(m.hcMode, m.app)
ctx, cancel := context.WithCancel(context.Background())
m.gpuStressCancel = cancel
m.gpuStressAborted = false
m.screen = screenGPUStressRunning
m.nvidiaSATCursor = 0
stressCmd := func() tea.Msg {
result, err := m.app.RunFanStressTestResult(ctx, opts)
return gpuStressDoneMsg{title: result.Title, body: result.Body, err: err}
}
nvtopPath, lookErr := exec.LookPath("nvtop")
if lookErr != nil {
return m, stressCmd
}
return m, tea.Batch(
stressCmd,
tea.ExecProcess(exec.Command(nvtopPath), func(_ error) tea.Msg {
return nvtopClosedMsg{}
}),
)
}
// updateGPUStressRunning handles keys on the GPU stress running screen.
func (m model) updateGPUStressRunning(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
switch msg.String() {
case "o", "O":
nvtopPath, err := exec.LookPath("nvtop")
if err != nil {
return m, nil
}
return m, tea.ExecProcess(exec.Command(nvtopPath), func(_ error) tea.Msg {
return nvtopClosedMsg{}
})
case "a", "A":
if m.gpuStressCancel != nil {
m.gpuStressCancel()
m.gpuStressCancel = nil
}
m.gpuStressAborted = true
m.screen = screenHealthCheck
m.cursor = 0
case "ctrl+c":
return m, tea.Quit
}
return m, nil
}
func renderGPUStressRunning() string {
return "GPU PLATFORM STRESS TEST\n\nTest is running...\n\n[o] Open nvtop [a] Abort test [ctrl+c] quit\n"
}
func (m model) hcRunAll() (tea.Model, tea.Cmd) {
for _, sel := range m.hcSel {
if sel {
m.pendingAction = actionRunAll
m.screen = screenConfirm
m.cursor = 0
return m, nil
}
}
return m, nil
}
func (m model) executeRunAll() (tea.Model, tea.Cmd) {
durationSec := hcModeDurations[m.hcMode]
durationIdx := m.hcMode
sel := m.hcSel
app := m.app
m.busy = true
m.busyTitle = "Health Check"
return m, func() tea.Msg {
var parts []string
if sel[hcGPU] {
vendor := app.DetectGPUVendor()
if vendor == "amd" {
r, err := app.RunAMDAcceptancePackResult("")
body := r.Body
if err != nil {
body += "\nERROR: " + err.Error()
}
parts = append(parts, "=== GPU (AMD) ===\n"+body)
} else {
gpus, err := app.ListNvidiaGPUs()
if err != nil || len(gpus) == 0 {
parts = append(parts, "=== GPU ===\nNo NVIDIA GPUs detected or driver not loaded.")
} else {
var indices []int
sizeMB := 0
for _, g := range gpus {
indices = append(indices, g.Index)
if sizeMB == 0 || g.MemoryMB < sizeMB {
sizeMB = g.MemoryMB
}
}
if sizeMB == 0 {
sizeMB = 64
}
r, err := app.RunNvidiaAcceptancePackWithOptions(context.Background(), "", durationSec, sizeMB, indices)
body := r.Body
if err != nil {
body += "\nERROR: " + err.Error()
}
parts = append(parts, "=== GPU ===\n"+body)
}
}
}
if sel[hcMemory] {
r, err := app.RunMemoryAcceptancePackResult("")
body := r.Body
if err != nil {
body += "\nERROR: " + err.Error()
}
parts = append(parts, "=== MEMORY ===\n"+body)
}
if sel[hcStorage] {
r, err := app.RunStorageAcceptancePackResult("")
body := r.Body
if err != nil {
body += "\nERROR: " + err.Error()
}
parts = append(parts, "=== STORAGE ===\n"+body)
}
if sel[hcCPU] {
cpuDur := hcCPUDurations[durationIdx]
r, err := app.RunCPUAcceptancePackResult("", cpuDur)
body := r.Body
if err != nil {
body += "\nERROR: " + err.Error()
}
parts = append(parts, "=== CPU ===\n"+body)
}
combined := strings.Join(parts, "\n\n")
if combined == "" {
combined = "No components selected."
}
return resultMsg{title: "Health Check", body: combined, back: screenHealthCheck}
}
}
func renderHealthCheck(m model) string {
var b strings.Builder
fmt.Fprintln(&b, "HEALTH CHECK")
fmt.Fprintln(&b)
fmt.Fprintln(&b, " Diagnostics:")
fmt.Fprintln(&b)
type comp struct{ name, desc, key string }
comps := []comp{
{"GPU", "nvidia/amd auto-detect", "G"},
{"MEMORY", "memtester", "M"},
{"STORAGE", "smartctl + NVMe self-test", "S"},
{"CPU", "audit diagnostics", "C"},
}
for i, c := range comps {
pfx := " "
if m.hcCursor == i {
pfx = "> "
}
ch := "[ ]"
if m.hcSel[i] {
ch = "[x]"
}
fmt.Fprintf(&b, "%s%s %-8s %-28s [%s]\n", pfx, ch, c.name, c.desc, c.key)
}
fmt.Fprintln(&b, " ─────────────────────────────────────────────────")
{
pfx := " "
if m.hcCursor == hcCurSelectAll {
pfx = "> "
}
allOn := m.hcSel[0] && m.hcSel[1] && m.hcSel[2] && m.hcSel[3]
ch := "[ ]"
if allOn {
ch = "[x]"
}
fmt.Fprintf(&b, "%s%s Select / Deselect All [A]\n", pfx, ch)
}
fmt.Fprintln(&b)
fmt.Fprintln(&b, " Mode:")
modes := []struct{ label, key string }{
{"Quick", "1"},
{"Standard", "2"},
{"Express", "3"},
}
for i, mode := range modes {
pfx := " "
if m.hcCursor == hcCurModeQuick+i {
pfx = "> "
}
radio := "( )"
if m.hcMode == i {
radio = "(*)"
}
fmt.Fprintf(&b, "%s%s %-10s [%s]\n", pfx, radio, mode.label, mode.key)
}
fmt.Fprintln(&b)
{
pfx := " "
if m.hcCursor == hcCurRunAll {
pfx = "> "
}
fmt.Fprintf(&b, "%s[ RUN ALL [R] ]\n", pfx)
}
{
pfx := " "
if m.hcCursor == hcCurFanStress {
pfx = "> "
}
fmt.Fprintf(&b, "%s[ GPU PLATFORM STRESS TEST [F] ] (thermal cycling, fan lag, throttle check)\n", pfx)
}
fmt.Fprintln(&b)
fmt.Fprintln(&b, "─────────────────────────────────────────────────────────────────")
fmt.Fprint(&b, "[↑↓] move [space/enter] toggle [letter] single test [R] run all [F] gpu stress [Esc] back")
return b.String()
}

View File

@@ -6,45 +6,21 @@ import (
func (m model) handleMainMenu() (tea.Model, tea.Cmd) {
switch m.cursor {
case 0:
m.screen = screenNetwork
m.cursor = 0
return m, nil
case 1:
m.busy = true
return m, func() tea.Msg {
services, err := m.app.ListBeeServices()
return servicesMsg{services: services, err: err}
}
case 2:
m.screen = screenAcceptance
m.cursor = 0
return m, nil
case 3:
m.busy = true
return m, func() tea.Msg {
result, err := m.app.RunAuditNow(m.runtimeMode)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
}
case 4:
case 0: // Health Check
return m.enterHealthCheck()
case 1: // Export support bundle
m.pendingAction = actionExportBundle
m.busy = true
m.busyTitle = "Export support bundle"
return m, func() tea.Msg {
targets, err := m.app.ListRemovableTargets()
return exportTargetsMsg{targets: targets, err: err}
}
case 5:
m.busy = true
return m, func() tea.Msg {
result := m.app.ToolCheckResult([]string{"dmidecode", "smartctl", "nvme", "ipmitool", "lspci", "bee", "nvidia-smi", "dhclient", "lsblk", "mount"})
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
}
case 6:
m.busy = true
return m, func() tea.Msg {
result := m.app.AuditLogTailResult()
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
}
case 7:
case 2: // Settings
m.screen = screenSettings
m.cursor = 0
return m, nil
case 3: // Exit
return m, tea.Quit
}
return m, nil

View File

@@ -10,12 +10,14 @@ func (m model) handleNetworkMenu() (tea.Model, tea.Cmd) {
switch m.cursor {
case 0:
m.busy = true
m.busyTitle = "Network status"
return m, func() tea.Msg {
result, err := m.app.NetworkStatus()
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
}
case 1:
m.busy = true
m.busyTitle = "DHCP all interfaces"
return m, func() tea.Msg {
result, err := m.app.DHCPAllResult()
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
@@ -23,6 +25,7 @@ func (m model) handleNetworkMenu() (tea.Model, tea.Cmd) {
case 2:
m.pendingAction = actionDHCPOne
m.busy = true
m.busyTitle = "Interfaces"
return m, func() tea.Msg {
ifaces, err := m.app.ListInterfaces()
return interfacesMsg{ifaces: ifaces, err: err}
@@ -30,12 +33,13 @@ func (m model) handleNetworkMenu() (tea.Model, tea.Cmd) {
case 3:
m.pendingAction = actionStaticIPv4
m.busy = true
m.busyTitle = "Interfaces"
return m, func() tea.Msg {
ifaces, err := m.app.ListInterfaces()
return interfacesMsg{ifaces: ifaces, err: err}
}
case 4:
m.screen = screenMain
m.screen = screenSettings
m.cursor = 0
return m, nil
}
@@ -50,6 +54,7 @@ func (m model) handleInterfacePickMenu() (tea.Model, tea.Cmd) {
switch m.pendingAction {
case actionDHCPOne:
m.busy = true
m.busyTitle = "DHCP on " + m.selectedIface
return m, func() tea.Msg {
result, err := m.app.DHCPOneResult(m.selectedIface)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}

View File

@@ -0,0 +1,238 @@
package tui
import (
"context"
"fmt"
"os/exec"
"strings"
"bee/audit/internal/platform"
tea "github.com/charmbracelet/bubbletea"
)
var nvidiaDurationOptions = []struct {
label string
seconds int
}{
{"10 minutes", 600},
{"1 hour", 3600},
{"8 hours", 28800},
{"24 hours", 86400},
}
// enterNvidiaSATSetup resets the setup screen and starts loading GPU list.
func (m model) enterNvidiaSATSetup() (tea.Model, tea.Cmd) {
m.screen = screenNvidiaSATSetup
m.nvidiaGPUs = nil
m.nvidiaGPUSel = nil
m.nvidiaDurIdx = 0
m.nvidiaSATCursor = 0
m.busy = true
m.busyTitle = "NVIDIA SAT"
return m, func() tea.Msg {
gpus, err := m.app.ListNvidiaGPUs()
return nvidiaGPUsMsg{gpus: gpus, err: err}
}
}
// handleNvidiaGPUsMsg processes the GPU list response.
func (m model) handleNvidiaGPUsMsg(msg nvidiaGPUsMsg) (tea.Model, tea.Cmd) {
m.busy = false
m.busyTitle = ""
if msg.err != nil {
m.title = "NVIDIA SAT"
m.body = fmt.Sprintf("Failed to list GPUs: %v", msg.err)
m.prevScreen = screenHealthCheck
m.screen = screenOutput
return m, nil
}
m.nvidiaGPUs = msg.gpus
m.nvidiaGPUSel = make([]bool, len(msg.gpus))
for i := range m.nvidiaGPUSel {
m.nvidiaGPUSel[i] = true // all selected by default
}
m.nvidiaSATCursor = 0
return m, nil
}
// updateNvidiaSATSetup handles keys on the setup screen.
func (m model) updateNvidiaSATSetup(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
numDur := len(nvidiaDurationOptions)
numGPU := len(m.nvidiaGPUs)
totalItems := numDur + numGPU + 2 // +2: Start, Cancel
switch msg.String() {
case "up", "k":
if m.nvidiaSATCursor > 0 {
m.nvidiaSATCursor--
}
case "down", "j":
if m.nvidiaSATCursor < totalItems-1 {
m.nvidiaSATCursor++
}
case " ":
switch {
case m.nvidiaSATCursor < numDur:
m.nvidiaDurIdx = m.nvidiaSATCursor
case m.nvidiaSATCursor < numDur+numGPU:
i := m.nvidiaSATCursor - numDur
m.nvidiaGPUSel[i] = !m.nvidiaGPUSel[i]
}
case "enter":
startIdx := numDur + numGPU
cancelIdx := startIdx + 1
switch {
case m.nvidiaSATCursor < numDur:
m.nvidiaDurIdx = m.nvidiaSATCursor
case m.nvidiaSATCursor < startIdx:
i := m.nvidiaSATCursor - numDur
m.nvidiaGPUSel[i] = !m.nvidiaGPUSel[i]
case m.nvidiaSATCursor == startIdx:
return m.startNvidiaSAT()
case m.nvidiaSATCursor == cancelIdx:
m.screen = screenHealthCheck
m.cursor = 0
}
case "esc":
m.screen = screenHealthCheck
m.cursor = 0
case "ctrl+c", "q":
return m, tea.Quit
}
return m, nil
}
// startNvidiaSAT launches the SAT and nvtop.
func (m model) startNvidiaSAT() (tea.Model, tea.Cmd) {
var selectedGPUs []platform.NvidiaGPU
for i, sel := range m.nvidiaGPUSel {
if sel {
selectedGPUs = append(selectedGPUs, m.nvidiaGPUs[i])
}
}
if len(selectedGPUs) == 0 {
selectedGPUs = m.nvidiaGPUs // fallback: use all if none explicitly selected
}
sizeMB := 0
for _, g := range selectedGPUs {
if sizeMB == 0 || g.MemoryMB < sizeMB {
sizeMB = g.MemoryMB
}
}
if sizeMB == 0 {
sizeMB = 64
}
var gpuIndices []int
for _, g := range selectedGPUs {
gpuIndices = append(gpuIndices, g.Index)
}
durationSec := nvidiaDurationOptions[m.nvidiaDurIdx].seconds
ctx, cancel := context.WithCancel(context.Background())
m.nvidiaSATCancel = cancel
m.nvidiaSATAborted = false
m.screen = screenNvidiaSATRunning
m.nvidiaSATCursor = 0
satCmd := func() tea.Msg {
result, err := m.app.RunNvidiaAcceptancePackWithOptions(ctx, "", durationSec, sizeMB, gpuIndices)
return nvidiaSATDoneMsg{title: result.Title, body: result.Body, err: err}
}
nvtopPath, lookErr := exec.LookPath("nvtop")
if lookErr != nil {
// nvtop not available: just run the SAT, show running screen
return m, satCmd
}
return m, tea.Batch(
satCmd,
tea.ExecProcess(exec.Command(nvtopPath), func(_ error) tea.Msg {
return nvtopClosedMsg{}
}),
)
}
// updateNvidiaSATRunning handles keys on the running screen.
func (m model) updateNvidiaSATRunning(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
switch msg.String() {
case "o", "O":
nvtopPath, err := exec.LookPath("nvtop")
if err != nil {
return m, nil
}
return m, tea.ExecProcess(exec.Command(nvtopPath), func(_ error) tea.Msg {
return nvtopClosedMsg{}
})
case "a", "A":
if m.nvidiaSATCancel != nil {
m.nvidiaSATCancel()
m.nvidiaSATCancel = nil
}
m.nvidiaSATAborted = true
m.screen = screenHealthCheck
m.cursor = 0
case "ctrl+c":
return m, tea.Quit
}
return m, nil
}
// renderNvidiaSATSetup renders the setup screen.
func renderNvidiaSATSetup(m model) string {
var b strings.Builder
fmt.Fprintln(&b, "NVIDIA SAT")
fmt.Fprintln(&b)
fmt.Fprintln(&b, "Duration:")
for i, opt := range nvidiaDurationOptions {
radio := "( )"
if i == m.nvidiaDurIdx {
radio = "(*)"
}
prefix := " "
if m.nvidiaSATCursor == i {
prefix = "> "
}
fmt.Fprintf(&b, "%s%s %s\n", prefix, radio, opt.label)
}
fmt.Fprintln(&b)
if len(m.nvidiaGPUs) == 0 {
fmt.Fprintln(&b, "GPUs: (none detected)")
} else {
fmt.Fprintln(&b, "GPUs:")
for i, gpu := range m.nvidiaGPUs {
check := "[ ]"
if m.nvidiaGPUSel[i] {
check = "[x]"
}
prefix := " "
if m.nvidiaSATCursor == len(nvidiaDurationOptions)+i {
prefix = "> "
}
fmt.Fprintf(&b, "%s%s %d: %s (%d MB)\n", prefix, check, gpu.Index, gpu.Name, gpu.MemoryMB)
}
}
fmt.Fprintln(&b)
startIdx := len(nvidiaDurationOptions) + len(m.nvidiaGPUs)
startPfx := " "
cancelPfx := " "
if m.nvidiaSATCursor == startIdx {
startPfx = "> "
}
if m.nvidiaSATCursor == startIdx+1 {
cancelPfx = "> "
}
fmt.Fprintf(&b, "%sStart\n", startPfx)
fmt.Fprintf(&b, "%sCancel\n", cancelPfx)
fmt.Fprintln(&b)
b.WriteString("[↑/↓] move [space] toggle [enter] select [esc] cancel\n")
return b.String()
}
// renderNvidiaSATRunning renders the running screen.
func renderNvidiaSATRunning() string {
return "NVIDIA SAT\n\nTest is running...\n\n[o] Open nvtop [a] Abort test [ctrl+c] quit\n"
}

View File

@@ -8,7 +8,7 @@ import (
func (m model) handleServicesMenu() (tea.Model, tea.Cmd) {
if len(m.services) == 0 {
return m, resultCmd("bee services", "No bee-* services found", nil, screenMain)
return m, resultCmd("Services", "No bee-* services found.", nil, screenSettings)
}
m.selectedService = m.services[m.cursor]
m.screen = screenServiceAction
@@ -25,22 +25,23 @@ func (m model) handleServiceActionMenu() (tea.Model, tea.Cmd) {
}
m.busy = true
m.busyTitle = "service: " + m.selectedService
return m, func() tea.Msg {
switch action {
case "status":
case "Status":
result, err := m.app.ServiceStatusResult(m.selectedService)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
case "restart":
case "Restart":
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceRestart)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
case "start":
case "Start":
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceStart)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
case "stop":
case "Stop":
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceStop)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
default:
return resultMsg{title: "service", body: "unknown action", back: screenServiceAction}
return resultMsg{title: "Service", body: "Unknown action.", back: screenServiceAction}
}
}
}

View File

@@ -0,0 +1,64 @@
package tui
import tea "github.com/charmbracelet/bubbletea"
func (m model) handleSettingsMenu() (tea.Model, tea.Cmd) {
switch m.cursor {
case 0: // Network
m.screen = screenNetwork
m.cursor = 0
return m, nil
case 1: // Services
m.busy = true
m.busyTitle = "Services"
return m, func() tea.Msg {
services, err := m.app.ListBeeServices()
return servicesMsg{services: services, err: err}
}
case 2: // Re-run audit
m.busy = true
m.busyTitle = "Re-run audit"
runtimeMode := m.runtimeMode
return m, func() tea.Msg {
result, err := m.app.RunAuditNow(runtimeMode)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenSettings}
}
case 3: // Run self-check
m.busy = true
m.busyTitle = "Self-check"
return m, func() tea.Msg {
result, err := m.app.RunRuntimePreflightResult()
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenSettings}
}
case 4: // Runtime issues
m.busy = true
m.busyTitle = "Runtime issues"
return m, func() tea.Msg {
result := m.app.RuntimeHealthResult()
return resultMsg{title: result.Title, body: result.Body, back: screenSettings}
}
case 5: // Audit logs
m.busy = true
m.busyTitle = "Audit logs"
return m, func() tea.Msg {
result := m.app.AuditLogTailResult()
return resultMsg{title: result.Title, body: result.Body, back: screenSettings}
}
case 6: // Check tools
m.busy = true
m.busyTitle = "Check tools"
return m, func() tea.Msg {
result := m.app.ToolCheckResult([]string{
"dmidecode", "smartctl", "nvme", "ipmitool", "lspci",
"ethtool", "bee", "nvidia-smi", "bee-gpu-stress",
"memtester", "dhclient", "lsblk", "mount",
})
return resultMsg{title: result.Title, body: result.Body, back: screenSettings}
}
case 7: // Back
m.screen = screenMain
m.cursor = 0
return m, nil
}
return m, nil
}

View File

@@ -0,0 +1,30 @@
package tui
import (
"bee/audit/internal/app"
tea "github.com/charmbracelet/bubbletea"
)
func (m model) refreshSnapshotCmd() tea.Cmd {
if m.app == nil {
return nil
}
return func() tea.Msg {
return snapshotMsg{
banner: m.app.MainBanner(),
panel: m.app.LoadHardwarePanel(),
}
}
}
func shouldRefreshSnapshot(prev, next model) bool {
return prev.screen != next.screen || prev.busy != next.busy
}
func emptySnapshot() snapshotMsg {
return snapshotMsg{
banner: "",
panel: app.HardwarePanelData{},
}
}

View File

@@ -1,6 +1,7 @@
package tui
import (
"strings"
"testing"
"bee/audit/internal/app"
@@ -52,11 +53,10 @@ func TestUpdateMainMenuEnterActions(t *testing.T) {
wantBusy bool
wantCmd bool
}{
{name: "network", cursor: 0, wantScreen: screenNetwork},
{name: "services", cursor: 1, wantScreen: screenMain, wantBusy: true, wantCmd: true},
{name: "acceptance", cursor: 2, wantScreen: screenAcceptance},
{name: "run audit", cursor: 3, wantScreen: screenMain, wantBusy: true, wantCmd: true},
{name: "export", cursor: 4, wantScreen: screenMain, wantBusy: true, wantCmd: true},
{name: "health_check", cursor: 0, wantScreen: screenHealthCheck, wantCmd: true},
{name: "export", cursor: 1, wantScreen: screenMain, wantBusy: true, wantCmd: true},
{name: "settings", cursor: 2, wantScreen: screenSettings, wantCmd: true},
{name: "exit", cursor: 3, wantScreen: screenMain, wantCmd: true},
}
for _, test := range tests {
@@ -88,7 +88,7 @@ func TestUpdateConfirmCancelViaKeys(t *testing.T) {
m := newTestModel()
m.screen = screenConfirm
m.pendingAction = actionRunNvidiaSAT
m.pendingAction = actionRunMemorySAT
next, _ := m.Update(tea.KeyMsg{Type: tea.KeyRight})
got := next.(model)
@@ -98,8 +98,8 @@ func TestUpdateConfirmCancelViaKeys(t *testing.T) {
next, _ = got.Update(tea.KeyMsg{Type: tea.KeyEnter})
got = next.(model)
if got.screen != screenAcceptance {
t.Fatalf("screen=%q want %q", got.screen, screenAcceptance)
if got.screen != screenHealthCheck {
t.Fatalf("screen=%q want %q", got.screen, screenHealthCheck)
}
if got.cursor != 0 {
t.Fatalf("cursor=%d want 0 after cancel", got.cursor)
@@ -114,8 +114,8 @@ func TestMainMenuSimpleTransitions(t *testing.T) {
cursor int
wantScreen screen
}{
{name: "network", cursor: 0, wantScreen: screenNetwork},
{name: "acceptance", cursor: 2, wantScreen: screenAcceptance},
{name: "health_check", cursor: 0, wantScreen: screenHealthCheck},
{name: "settings", cursor: 2, wantScreen: screenSettings},
}
for _, test := range tests {
@@ -142,38 +142,42 @@ func TestMainMenuSimpleTransitions(t *testing.T) {
}
}
func TestMainMenuAsyncActionsSetBusy(t *testing.T) {
func TestMainMenuExportSetsBusy(t *testing.T) {
t.Parallel()
tests := []struct {
name string
cursor int
}{
{name: "services", cursor: 1},
{name: "run audit", cursor: 3},
{name: "export", cursor: 4},
{name: "check tools", cursor: 5},
{name: "log tail", cursor: 6},
m := newTestModel()
m.cursor = 1 // Export support bundle
next, cmd := m.handleMainMenu()
got := next.(model)
if !got.busy {
t.Fatal("busy=false for export")
}
if cmd == nil {
t.Fatal("expected async cmd for export")
}
}
for _, test := range tests {
test := test
t.Run(test.name, func(t *testing.T) {
t.Parallel()
func TestMainViewRendersTwoColumns(t *testing.T) {
t.Parallel()
m := newTestModel()
m.cursor = test.cursor
m := newTestModel()
m.cursor = 1
next, cmd := m.handleMainMenu()
got := next.(model)
if !got.busy {
t.Fatalf("busy=false for %s", test.name)
}
if cmd == nil {
t.Fatalf("expected async cmd for %s", test.name)
}
})
view := m.View()
for _, want := range []string{
"bee",
"Health Check",
"> Export support bundle",
"Settings",
"Exit",
"│",
"[↑↓] move",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
@@ -185,9 +189,9 @@ func TestEscapeNavigation(t *testing.T) {
screen screen
wantScreen screen
}{
{name: "network to main", screen: screenNetwork, wantScreen: screenMain},
{name: "services to main", screen: screenServices, wantScreen: screenMain},
{name: "acceptance to main", screen: screenAcceptance, wantScreen: screenMain},
{name: "network to settings", screen: screenNetwork, wantScreen: screenSettings},
{name: "services to settings", screen: screenServices, wantScreen: screenSettings},
{name: "settings to main", screen: screenSettings, wantScreen: screenMain},
{name: "service action to services", screen: screenServiceAction, wantScreen: screenServices},
{name: "export targets to main", screen: screenExportTargets, wantScreen: screenMain},
{name: "interface pick to network", screen: screenInterfacePick, wantScreen: screenNetwork},
@@ -215,6 +219,24 @@ func TestEscapeNavigation(t *testing.T) {
}
}
func TestHealthCheckEscReturnsToMain(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenHealthCheck
m.hcCursor = 3
next, _ := m.updateHealthCheck(tea.KeyMsg{Type: tea.KeyEsc})
got := next.(model)
if got.screen != screenMain {
t.Fatalf("screen=%q want %q", got.screen, screenMain)
}
if got.cursor != 0 {
t.Fatalf("cursor=%d want 0", got.cursor)
}
}
func TestOutputScreenReturnsToPreviousScreen(t *testing.T) {
t.Parallel()
@@ -235,30 +257,56 @@ func TestOutputScreenReturnsToPreviousScreen(t *testing.T) {
}
}
func TestAcceptanceConfirmFlow(t *testing.T) {
func TestHealthCheckGPUOpensNvidiaSATSetup(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenAcceptance
m.cursor = 0
m.screen = screenHealthCheck
m.hcInitialized = true
m.hcSel = [4]bool{true, true, true, true}
next, cmd := m.handleAcceptanceMenu()
next, cmd := m.hcRunSingle(hcGPU)
got := next.(model)
if cmd != nil {
t.Fatal("expected nil cmd")
if cmd == nil {
t.Fatal("expected non-nil cmd (GPU list loader)")
}
if got.screen != screenConfirm {
t.Fatalf("screen=%q want %q", got.screen, screenConfirm)
}
if got.pendingAction != actionRunNvidiaSAT {
t.Fatalf("pendingAction=%q want %q", got.pendingAction, actionRunNvidiaSAT)
if got.screen != screenNvidiaSATSetup {
t.Fatalf("screen=%q want %q", got.screen, screenNvidiaSATSetup)
}
next, _ = got.updateConfirm(tea.KeyMsg{Type: tea.KeyEsc})
// esc from setup returns to health check
next, _ = got.updateNvidiaSATSetup(tea.KeyMsg{Type: tea.KeyEsc})
got = next.(model)
if got.screen != screenAcceptance {
t.Fatalf("screen after esc=%q want %q", got.screen, screenAcceptance)
if got.screen != screenHealthCheck {
t.Fatalf("screen after esc=%q want %q", got.screen, screenHealthCheck)
}
}
func TestHealthCheckRunSingleMapsActions(t *testing.T) {
t.Parallel()
tests := []struct {
idx int
want actionKind
}{
{idx: hcMemory, want: actionRunMemorySAT},
{idx: hcStorage, want: actionRunStorageSAT},
}
for _, test := range tests {
m := newTestModel()
m.screen = screenHealthCheck
m.hcInitialized = true
next, _ := m.hcRunSingle(test.idx)
got := next.(model)
if got.pendingAction != test.want {
t.Fatalf("idx=%d pendingAction=%q want %q", test.idx, got.pendingAction, test.want)
}
if got.screen != screenConfirm {
t.Fatalf("idx=%d screen=%q want %q", test.idx, got.screen, screenConfirm)
}
}
}
@@ -278,8 +326,8 @@ func TestExportTargetSelectionOpensConfirm(t *testing.T) {
if got.screen != screenConfirm {
t.Fatalf("screen=%q want %q", got.screen, screenConfirm)
}
if got.pendingAction != actionExportAudit {
t.Fatalf("pendingAction=%q want %q", got.pendingAction, actionExportAudit)
if got.pendingAction != actionExportBundle {
t.Fatalf("pendingAction=%q want %q", got.pendingAction, actionExportBundle)
}
if got.selectedTarget == nil || got.selectedTarget.Device != "/dev/sdb1" {
t.Fatalf("selectedTarget=%+v want /dev/sdb1", got.selectedTarget)
@@ -332,14 +380,24 @@ func TestConfirmCancelTarget(t *testing.T) {
m := newTestModel()
m.pendingAction = actionExportAudit
m.pendingAction = actionExportBundle
if got := m.confirmCancelTarget(); got != screenExportTargets {
t.Fatalf("export cancel target=%q want %q", got, screenExportTargets)
}
m.pendingAction = actionRunNvidiaSAT
if got := m.confirmCancelTarget(); got != screenAcceptance {
t.Fatalf("sat cancel target=%q want %q", got, screenAcceptance)
m.pendingAction = actionRunAll
if got := m.confirmCancelTarget(); got != screenHealthCheck {
t.Fatalf("run all cancel target=%q want %q", got, screenHealthCheck)
}
m.pendingAction = actionRunMemorySAT
if got := m.confirmCancelTarget(); got != screenHealthCheck {
t.Fatalf("memory sat cancel target=%q want %q", got, screenHealthCheck)
}
m.pendingAction = actionRunStorageSAT
if got := m.confirmCancelTarget(); got != screenHealthCheck {
t.Fatalf("storage sat cancel target=%q want %q", got, screenHealthCheck)
}
m.pendingAction = actionNone
@@ -347,3 +405,224 @@ func TestConfirmCancelTarget(t *testing.T) {
t.Fatalf("default cancel target=%q want %q", got, screenMain)
}
}
func TestViewBusyStateIsMinimal(t *testing.T) {
t.Parallel()
m := newTestModel()
m.busy = true
view := m.View()
want := "bee\n\nWorking...\n\n[ctrl+c] quit\n"
if view != want {
t.Fatalf("busy view mismatch\nwant:\n%s\ngot:\n%s", want, view)
}
}
func TestViewBusyStateUsesBusyTitle(t *testing.T) {
t.Parallel()
m := newTestModel()
m.busy = true
m.busyTitle = "Export support bundle"
view := m.View()
for _, want := range []string{
"Export support bundle",
"Working...",
"[ctrl+c] quit",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewOutputScreenRendersBodyAndBackHint(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenOutput
m.title = "Run audit"
m.body = "audit output: /appdata/bee/export/bee-audit.json\n"
view := m.View()
for _, want := range []string{
"Run audit",
"audit output: /appdata/bee/export/bee-audit.json",
"[enter/esc] back [ctrl+c] quit",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewRendersBannerModuleAboveScreenBody(t *testing.T) {
t.Parallel()
m := newTestModel()
m.banner = "System: Demo Server\nIP: 10.0.0.10"
m.width = 60
view := m.View()
for _, want := range []string{
"┌ MOTD ",
"System: Demo Server",
"IP: 10.0.0.10",
"Health Check",
"Export support bundle",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestSnapshotMsgUpdatesBannerAndPanel(t *testing.T) {
t.Parallel()
m := newTestModel()
next, cmd := m.Update(snapshotMsg{
banner: "System: Demo",
panel: app.HardwarePanelData{
Header: []string{"Demo header"},
Rows: []app.ComponentRow{
{Key: "CPU", Status: "PASS", Detail: "ok"},
},
},
})
got := next.(model)
if cmd != nil {
t.Fatal("expected nil cmd")
}
if got.banner != "System: Demo" {
t.Fatalf("banner=%q want %q", got.banner, "System: Demo")
}
if len(got.panel.Rows) != 1 || got.panel.Rows[0].Key != "CPU" {
t.Fatalf("panel rows=%+v", got.panel.Rows)
}
}
func TestViewExportTargetsRendersDeviceMetadata(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenExportTargets
m.targets = []platform.RemovableTarget{
{
Device: "/dev/sdb1",
FSType: "vfat",
Size: "29G",
Label: "BEEUSB",
Mountpoint: "/media/bee",
},
}
view := m.View()
for _, want := range []string{
"Export support bundle",
"Select removable filesystem",
"> /dev/sdb1 [vfat 29G] label=BEEUSB mounted=/media/bee",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewStaticFormRendersFields(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenStaticForm
m.selectedIface = "enp1s0"
m.formFields = []formField{
{Label: "Address", Value: "192.0.2.10/24"},
{Label: "Gateway", Value: "192.0.2.1"},
{Label: "DNS", Value: "1.1.1.1"},
}
m.formIndex = 1
view := m.View()
for _, want := range []string{
"Static IPv4: enp1s0",
" Address: 192.0.2.10/24",
"> Gateway: 192.0.2.1",
" DNS: 1.1.1.1",
"[tab/↑/↓] move [enter] next/submit [backspace] delete [esc] cancel",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewConfirmScreenMatchesPendingExport(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenConfirm
m.pendingAction = actionExportBundle
m.selectedTarget = &platform.RemovableTarget{Device: "/dev/sdb1"}
view := m.View()
for _, want := range []string{
"Export support bundle",
"Copy support bundle to /dev/sdb1?",
"> Confirm",
" Cancel",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestResultMsgClearsBusyAndPendingAction(t *testing.T) {
t.Parallel()
m := newTestModel()
m.busy = true
m.busyTitle = "Export support bundle"
m.pendingAction = actionExportBundle
m.screen = screenConfirm
next, _ := m.Update(resultMsg{title: "Export support bundle", body: "done", back: screenMain})
got := next.(model)
if got.busy {
t.Fatal("busy=true want false")
}
if got.busyTitle != "" {
t.Fatalf("busyTitle=%q want empty", got.busyTitle)
}
if got.pendingAction != actionNone {
t.Fatalf("pendingAction=%q want empty", got.pendingAction)
}
}
func TestResultMsgErrorWithoutBodyFormatsCleanly(t *testing.T) {
t.Parallel()
m := newTestModel()
next, _ := m.Update(resultMsg{title: "Export support bundle", err: assertErr("boom"), back: screenMain})
got := next.(model)
if got.body != "ERROR: boom" {
t.Fatalf("body=%q want %q", got.body, "ERROR: boom")
}
}
type assertErr string
func (e assertErr) Error() string { return string(e) }

View File

@@ -1,6 +1,9 @@
package tui
import (
"strings"
"time"
"bee/audit/internal/app"
"bee/audit/internal/platform"
"bee/audit/internal/runtimeenv"
@@ -11,41 +14,52 @@ import (
type screen string
const (
screenMain screen = "main"
screenNetwork screen = "network"
screenInterfacePick screen = "interface_pick"
screenServices screen = "services"
screenServiceAction screen = "service_action"
screenAcceptance screen = "acceptance"
screenExportTargets screen = "export_targets"
screenOutput screen = "output"
screenStaticForm screen = "static_form"
screenConfirm screen = "confirm"
screenMain screen = "main"
screenHealthCheck screen = "health_check"
screenSettings screen = "settings"
screenNetwork screen = "network"
screenInterfacePick screen = "interface_pick"
screenServices screen = "services"
screenServiceAction screen = "service_action"
screenExportTargets screen = "export_targets"
screenOutput screen = "output"
screenStaticForm screen = "static_form"
screenConfirm screen = "confirm"
screenNvidiaSATSetup screen = "nvidia_sat_setup"
screenNvidiaSATRunning screen = "nvidia_sat_running"
screenGPUStressRunning screen = "gpu_stress_running"
)
type actionKind string
const (
actionNone actionKind = ""
actionDHCPOne actionKind = "dhcp_one"
actionStaticIPv4 actionKind = "static_ipv4"
actionExportAudit actionKind = "export_audit"
actionRunNvidiaSAT actionKind = "run_nvidia_sat"
actionNone actionKind = ""
actionDHCPOne actionKind = "dhcp_one"
actionStaticIPv4 actionKind = "static_ipv4"
actionExportBundle actionKind = "export_bundle"
actionRunAll actionKind = "run_all"
actionRunMemorySAT actionKind = "run_memory_sat"
actionRunStorageSAT actionKind = "run_storage_sat"
actionRunCPUSAT actionKind = "run_cpu_sat"
actionRunAMDGPUSAT actionKind = "run_amd_gpu_sat"
actionRunFanStress actionKind = "run_fan_stress"
)
type model struct {
app *app.App
runtimeMode runtimeenv.Mode
screen screen
prevScreen screen
cursor int
busy bool
title string
body string
mainMenu []string
networkMenu []string
serviceMenu []string
screen screen
prevScreen screen
cursor int
busy bool
busyTitle string
title string
body string
mainMenu []string
settingsMenu []string
networkMenu []string
serviceMenu []string
services []string
interfaces []platform.InterfaceInfo
@@ -57,6 +71,40 @@ type model struct {
formFields []formField
formIndex int
// Hardware panel (right column)
panel app.HardwarePanelData
panelFocus bool
panelCursor int
banner string
// Health Check screen
hcSel [4]bool
hcMode int
hcCursor int
hcInitialized bool
// NVIDIA SAT setup
nvidiaGPUs []platform.NvidiaGPU
nvidiaGPUSel []bool
nvidiaDurIdx int
nvidiaSATCursor int
// NVIDIA SAT running
nvidiaSATCancel func()
nvidiaSATAborted bool
// GPU Platform Stress Test running
gpuStressCancel func()
gpuStressAborted bool
// SAT verbose progress (CPU / Memory / Storage / AMD GPU)
progressLines []string
progressPrefix string
progressSince time.Time
// Terminal size
width int
}
type formField struct {
@@ -80,32 +128,78 @@ func newModel(application *app.App, runtimeMode runtimeenv.Mode) model {
runtimeMode: runtimeMode,
screen: screenMain,
mainMenu: []string{
"Network setup",
"bee service management",
"System acceptance tests",
"Run audit now",
"Export audit to removable drive",
"Check required tools",
"Show last audit log tail",
"Health Check",
"Export support bundle",
"Settings",
"Exit",
},
settingsMenu: []string{
"Network",
"Services",
"Re-run audit",
"Run self-check",
"Runtime issues",
"Audit logs",
"Check tools",
"Back",
},
networkMenu: []string{
"Show network status",
"Show status",
"DHCP on all interfaces",
"DHCP on one interface",
"Set static IPv4 on one interface",
"Set static IPv4",
"Back",
},
serviceMenu: []string{
"status",
"restart",
"start",
"stop",
"back",
"Status",
"Restart",
"Start",
"Stop",
"Back",
},
}
}
func (m model) Init() tea.Cmd {
return nil
return m.refreshSnapshotCmd()
}
func (m model) confirmBody() (string, string) {
switch m.pendingAction {
case actionExportBundle:
if m.selectedTarget == nil {
return "Export support bundle", "No target selected"
}
return "Export support bundle", "Copy support bundle to " + m.selectedTarget.Device + "?"
case actionRunAll:
modes := []string{"Quick", "Standard", "Express"}
mode := modes[m.hcMode]
var sel []string
names := []string{"GPU", "Memory", "Storage", "CPU"}
for i, on := range m.hcSel {
if on {
sel = append(sel, names[i])
}
}
if len(sel) == 0 {
return "Health Check", "No components selected."
}
return "Health Check", "Run: " + strings.Join(sel, " + ") + "\nMode: " + mode
case actionRunMemorySAT:
return "Memory test", "Run memtester?"
case actionRunStorageSAT:
return "Storage test", "Run storage diagnostic pack?"
case actionRunCPUSAT:
modes := []string{"Quick (60s)", "Standard (300s)", "Express (900s)"}
return "CPU test", "Run stress-ng? Mode: " + modes[m.hcMode]
case actionRunAMDGPUSAT:
return "AMD GPU test", "Run AMD GPU diagnostic pack (rocm-smi)?"
case actionRunFanStress:
modes := []string{"Quick (2×2min)", "Standard (2×5min)", "Express (2×10min)"}
return "GPU Platform Stress Test", "Two-phase GPU thermal cycling test.\n" +
"Monitors fans, temps, power — detects throttling.\n" +
"Mode: " + modes[m.hcMode] + "\n\nAll NVIDIA GPUs will be stressed."
default:
return "Confirm", "Proceed?"
}
}

View File

@@ -9,24 +9,51 @@ import (
func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
switch msg := msg.(type) {
case tea.WindowSizeMsg:
m.width = msg.Width
return m, nil
case tea.KeyMsg:
if m.busy {
switch msg.String() {
case "ctrl+c":
if msg.String() == "ctrl+c" {
return m, tea.Quit
default:
return m, nil
}
return m, nil
}
return m.updateKey(msg)
next, cmd := m.updateKey(msg)
nextModel := next.(model)
if shouldRefreshSnapshot(m, nextModel) {
return nextModel, tea.Batch(cmd, nextModel.refreshSnapshotCmd())
}
return nextModel, cmd
case satProgressMsg:
if m.busy && m.progressPrefix != "" {
if len(msg.lines) > 0 {
m.progressLines = msg.lines
}
return m, pollSATProgress(m.progressPrefix, m.progressSince)
}
return m, nil
case snapshotMsg:
m.banner = msg.banner
m.panel = msg.panel
return m, nil
case resultMsg:
m.busy = false
m.busyTitle = ""
m.progressLines = nil
m.progressPrefix = ""
m.title = msg.title
if msg.err != nil {
m.body = fmt.Sprintf("%s\n\nERROR: %v", strings.TrimSpace(msg.body), msg.err)
body := strings.TrimSpace(msg.body)
if body == "" {
m.body = fmt.Sprintf("ERROR: %v", msg.err)
} else {
m.body = fmt.Sprintf("%s\n\nERROR: %v", body, msg.err)
}
} else {
m.body = msg.body
}
m.pendingAction = actionNone
if msg.back != "" {
m.prevScreen = msg.back
} else {
@@ -34,63 +61,121 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
}
m.screen = screenOutput
m.cursor = 0
return m, nil
return m, m.refreshSnapshotCmd()
case servicesMsg:
m.busy = false
m.busyTitle = ""
if msg.err != nil {
m.title = "bee services"
m.title = "Services"
m.body = msg.err.Error()
m.prevScreen = screenMain
m.prevScreen = screenSettings
m.screen = screenOutput
return m, nil
return m, m.refreshSnapshotCmd()
}
m.services = msg.services
m.screen = screenServices
m.cursor = 0
return m, nil
return m, m.refreshSnapshotCmd()
case interfacesMsg:
m.busy = false
m.busyTitle = ""
if msg.err != nil {
m.title = "interfaces"
m.body = msg.err.Error()
m.prevScreen = screenMain
m.prevScreen = screenNetwork
m.screen = screenOutput
return m, nil
return m, m.refreshSnapshotCmd()
}
m.interfaces = msg.ifaces
m.screen = screenInterfacePick
m.cursor = 0
return m, nil
return m, m.refreshSnapshotCmd()
case exportTargetsMsg:
m.busy = false
m.busyTitle = ""
if msg.err != nil {
m.title = "export"
m.body = msg.err.Error()
m.prevScreen = screenMain
m.screen = screenOutput
return m, nil
return m, m.refreshSnapshotCmd()
}
m.targets = msg.targets
m.screen = screenExportTargets
m.cursor = 0
return m, m.refreshSnapshotCmd()
case nvidiaGPUsMsg:
return m.handleNvidiaGPUsMsg(msg)
case nvtopClosedMsg:
return m, nil
case gpuStressDoneMsg:
if m.gpuStressAborted {
return m, nil
}
if m.gpuStressCancel != nil {
m.gpuStressCancel()
m.gpuStressCancel = nil
}
m.prevScreen = screenHealthCheck
m.screen = screenOutput
m.title = msg.title
if msg.err != nil {
body := strings.TrimSpace(msg.body)
if body == "" {
m.body = fmt.Sprintf("ERROR: %v", msg.err)
} else {
m.body = fmt.Sprintf("%s\n\nERROR: %v", body, msg.err)
}
} else {
m.body = msg.body
}
return m, m.refreshSnapshotCmd()
case nvidiaSATDoneMsg:
if m.nvidiaSATAborted {
return m, nil
}
if m.nvidiaSATCancel != nil {
m.nvidiaSATCancel()
m.nvidiaSATCancel = nil
}
m.prevScreen = screenHealthCheck
m.screen = screenOutput
m.title = msg.title
if msg.err != nil {
body := strings.TrimSpace(msg.body)
if body == "" {
m.body = fmt.Sprintf("ERROR: %v", msg.err)
} else {
m.body = fmt.Sprintf("%s\n\nERROR: %v", body, msg.err)
}
} else {
m.body = msg.body
}
return m, m.refreshSnapshotCmd()
}
return m, nil
}
func (m model) updateKey(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
switch m.screen {
case screenMain:
return m.updateMenu(msg, len(m.mainMenu), m.handleMainMenu)
return m.updateMain(msg)
case screenHealthCheck:
return m.updateHealthCheck(msg)
case screenSettings:
return m.updateMenu(msg, len(m.settingsMenu), m.handleSettingsMenu)
case screenNetwork:
return m.updateMenu(msg, len(m.networkMenu), m.handleNetworkMenu)
case screenServices:
return m.updateMenu(msg, len(m.services), m.handleServicesMenu)
case screenServiceAction:
return m.updateMenu(msg, len(m.serviceMenu), m.handleServiceActionMenu)
case screenAcceptance:
return m.updateMenu(msg, 2, m.handleAcceptanceMenu)
case screenNvidiaSATSetup:
return m.updateNvidiaSATSetup(msg)
case screenNvidiaSATRunning:
return m.updateNvidiaSATRunning(msg)
case screenGPUStressRunning:
return m.updateGPUStressRunning(msg)
case screenExportTargets:
return m.updateMenu(msg, len(m.targets), m.handleExportTargetsMenu)
case screenInterfacePick:
@@ -101,6 +186,7 @@ func (m model) updateKey(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
m.screen = m.prevScreen
m.body = ""
m.title = ""
m.pendingAction = actionNone
return m, nil
case "ctrl+c":
return m, tea.Quit
@@ -110,13 +196,54 @@ func (m model) updateKey(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
case screenConfirm:
return m.updateConfirm(msg)
}
if msg.String() == "ctrl+c" {
return m, tea.Quit
}
return m, nil
}
// updateMain handles keys on the main (two-column) screen.
func (m model) updateMain(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
if m.panelFocus {
return m.updateMainPanel(msg)
}
// Switch focus to right panel.
if (msg.String() == "tab" || msg.String() == "right" || msg.String() == "l") && len(m.panel.Rows) > 0 {
m.panelFocus = true
return m, nil
}
return m.updateMenu(msg, len(m.mainMenu), m.handleMainMenu)
}
// updateMainPanel handles keys when right panel has focus.
func (m model) updateMainPanel(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
switch msg.String() {
case "up", "k":
if m.panelCursor > 0 {
m.panelCursor--
}
case "down", "j":
if m.panelCursor < len(m.panel.Rows)-1 {
m.panelCursor++
}
case "enter":
if m.panelCursor < len(m.panel.Rows) {
key := m.panel.Rows[m.panelCursor].Key
m.busy = true
m.busyTitle = key
return m, func() tea.Msg {
r := m.app.ComponentDetailResult(key)
return resultMsg{title: r.Title, body: r.Body, back: screenMain}
}
}
case "tab", "left", "h", "esc":
m.panelFocus = false
case "q", "ctrl+c":
return m, tea.Quit
}
return m, nil
}
func (m model) updateMenu(msg tea.KeyMsg, size int, onEnter func() (tea.Model, tea.Cmd)) (tea.Model, tea.Cmd) {
if size == 0 {
size = 1
@@ -134,7 +261,10 @@ func (m model) updateMenu(msg tea.KeyMsg, size int, onEnter func() (tea.Model, t
return onEnter()
case "esc":
switch m.screen {
case screenNetwork, screenServices, screenAcceptance:
case screenNetwork, screenServices:
m.screen = screenSettings
m.cursor = 0
case screenSettings:
m.screen = screenMain
m.cursor = 0
case screenServiceAction:

View File

@@ -7,53 +7,155 @@ import (
"bee/audit/internal/platform"
tea "github.com/charmbracelet/bubbletea"
"github.com/charmbracelet/lipgloss"
)
func (m model) View() string {
if m.busy {
return "bee\n\nWorking...\n"
}
switch m.screen {
case screenMain:
return renderMenu("bee", "Select action", m.mainMenu, m.cursor)
case screenNetwork:
return renderMenu("Network", "Select action", m.networkMenu, m.cursor)
case screenServices:
return renderMenu("bee services", "Select service", m.services, m.cursor)
case screenServiceAction:
items := make([]string, len(m.serviceMenu))
copy(items, m.serviceMenu)
return renderMenu("Service: "+m.selectedService, "Select action", items, m.cursor)
case screenAcceptance:
return renderMenu("System acceptance tests", "Select action", []string{"Run NVIDIA command pack", "Back"}, m.cursor)
case screenExportTargets:
return renderMenu("Export audit", "Select removable filesystem", renderTargetItems(m.targets), m.cursor)
case screenInterfacePick:
return renderMenu("Interfaces", "Select interface", renderInterfaceItems(m.interfaces), m.cursor)
case screenStaticForm:
return renderForm("Static IPv4: "+m.selectedIface, m.formFields, m.formIndex)
case screenConfirm:
title, body := m.confirmBody()
return renderConfirm(title, body, m.cursor)
case screenOutput:
return fmt.Sprintf("%s\n\n%s\n\n[enter/esc] back [ctrl+c] quit\n", m.title, strings.TrimSpace(m.body))
// Column widths for two-column main layout.
const leftColWidth = 30
var (
stylePass = lipgloss.NewStyle().Foreground(lipgloss.Color("10")) // bright green
styleFail = lipgloss.NewStyle().Foreground(lipgloss.Color("9")) // bright red
styleCancel = lipgloss.NewStyle().Foreground(lipgloss.Color("11")) // bright yellow
styleNA = lipgloss.NewStyle().Foreground(lipgloss.Color("8")) // dark gray
)
func colorStatus(status string) string {
switch status {
case "PASS":
return stylePass.Render("PASS")
case "FAIL":
return styleFail.Render("FAIL")
case "CANCEL":
return styleCancel.Render("CANC")
default:
return "bee\n"
return styleNA.Render("N/A ")
}
}
func (m model) confirmBody() (string, string) {
switch m.pendingAction {
case actionExportAudit:
if m.selectedTarget == nil {
return "Export audit", "No target selected"
func (m model) View() string {
var body string
if m.busy {
title := "bee"
if m.busyTitle != "" {
title = m.busyTitle
}
if len(m.progressLines) > 0 {
var b strings.Builder
fmt.Fprintf(&b, "%s\n\n", title)
for _, l := range m.progressLines {
fmt.Fprintf(&b, " %s\n", l)
}
b.WriteString("\n[ctrl+c] quit\n")
body = b.String()
} else {
body = fmt.Sprintf("%s\n\nWorking...\n\n[ctrl+c] quit\n", title)
}
} else {
switch m.screen {
case screenMain:
body = renderTwoColumnMain(m)
case screenHealthCheck:
body = renderHealthCheck(m)
case screenSettings:
body = renderMenu("Settings", "Select action", m.settingsMenu, m.cursor)
case screenNetwork:
body = renderMenu("Network", "Select action", m.networkMenu, m.cursor)
case screenServices:
body = renderMenu("Services", "Select service", m.services, m.cursor)
case screenServiceAction:
body = renderMenu("Service: "+m.selectedService, "Select action", m.serviceMenu, m.cursor)
case screenExportTargets:
body = renderMenu("Export support bundle", "Select removable filesystem", renderTargetItems(m.targets), m.cursor)
case screenInterfacePick:
body = renderMenu("Interfaces", "Select interface", renderInterfaceItems(m.interfaces), m.cursor)
case screenStaticForm:
body = renderForm("Static IPv4: "+m.selectedIface, m.formFields, m.formIndex)
case screenConfirm:
title, confirmBody := m.confirmBody()
body = renderConfirm(title, confirmBody, m.cursor)
case screenNvidiaSATSetup:
body = renderNvidiaSATSetup(m)
case screenNvidiaSATRunning:
body = renderNvidiaSATRunning()
case screenGPUStressRunning:
body = renderGPUStressRunning()
case screenOutput:
body = fmt.Sprintf("%s\n\n%s\n\n[enter/esc] back [ctrl+c] quit\n", m.title, strings.TrimSpace(m.body))
default:
body = "bee\n"
}
return "Export audit", fmt.Sprintf("Copy latest audit JSON to %s?", m.selectedTarget.Device)
case actionRunNvidiaSAT:
return "NVIDIA SAT", "Run NVIDIA acceptance command pack?"
default:
return "Confirm", "Proceed?"
}
return m.renderWithBanner(body)
}
// renderTwoColumnMain renders the main screen with menu on the left and hardware panel on the right.
func renderTwoColumnMain(m model) string {
// Left column lines
leftLines := []string{"bee", ""}
for i, item := range m.mainMenu {
pfx := " "
if !m.panelFocus && m.cursor == i {
pfx = "> "
}
leftLines = append(leftLines, pfx+item)
}
// Right column lines
rightLines := buildPanelLines(m)
// Render side by side
var b strings.Builder
maxRows := max(len(leftLines), len(rightLines))
for i := 0; i < maxRows; i++ {
l := ""
if i < len(leftLines) {
l = leftLines[i]
}
r := ""
if i < len(rightLines) {
r = rightLines[i]
}
w := lipgloss.Width(l)
if w < leftColWidth {
l += strings.Repeat(" ", leftColWidth-w)
}
b.WriteString(l + " │ " + r + "\n")
}
sep := strings.Repeat("─", leftColWidth) + "─┴─" + strings.Repeat("─", 46)
b.WriteString(sep + "\n")
if m.panelFocus {
b.WriteString("[↑↓] move [enter] details [tab/←] menu [ctrl+c] quit\n")
} else {
b.WriteString("[↑↓] move [enter] select [tab/→] panel [ctrl+c] quit\n")
}
return b.String()
}
func buildPanelLines(m model) []string {
p := m.panel
var lines []string
for _, h := range p.Header {
lines = append(lines, h)
}
if len(p.Header) > 0 && len(p.Rows) > 0 {
lines = append(lines, "")
}
for i, row := range p.Rows {
pfx := " "
if m.panelFocus && m.panelCursor == i {
pfx = "> "
}
status := colorStatus(row.Status)
lines = append(lines, fmt.Sprintf("%s%s %-4s %s", pfx, status, row.Key, row.Detail))
}
return lines
}
func renderTargetItems(targets []platform.RemovableTarget) []string {
@@ -135,3 +237,60 @@ func resultCmd(title, body string, err error, back screen) tea.Cmd {
return resultMsg{title: title, body: body, err: err, back: back}
}
}
func (m model) renderWithBanner(body string) string {
body = strings.TrimRight(body, "\n")
banner := renderBannerModule(m.banner, m.width)
if banner == "" {
if body == "" {
return ""
}
return body + "\n"
}
if body == "" {
return banner + "\n"
}
return banner + "\n\n" + body + "\n"
}
func renderBannerModule(banner string, width int) string {
banner = strings.TrimSpace(banner)
if banner == "" {
return ""
}
lines := strings.Split(banner, "\n")
contentWidth := 0
for _, line := range lines {
if w := lipgloss.Width(line); w > contentWidth {
contentWidth = w
}
}
if width > 0 && width-4 > contentWidth {
contentWidth = width - 4
}
if contentWidth < 20 {
contentWidth = 20
}
label := " MOTD "
topFill := contentWidth + 2 - lipgloss.Width(label)
if topFill < 0 {
topFill = 0
}
var b strings.Builder
b.WriteString("┌" + label + strings.Repeat("─", topFill) + "┐\n")
for _, line := range lines {
b.WriteString("│ " + padRight(line, contentWidth) + " │\n")
}
b.WriteString("└" + strings.Repeat("─", contentWidth+2) + "┘")
return b.String()
}
func padRight(value string, width int) string {
if gap := width - lipgloss.Width(value); gap > 0 {
return value + strings.Repeat(" ", gap)
}
return value
}

View File

@@ -0,0 +1,240 @@
package webui
import (
"errors"
"fmt"
"html"
"net/http"
"net/url"
"os"
"path/filepath"
"sort"
"strings"
"bee/audit/internal/app"
"reanimator/chart/viewer"
chartweb "reanimator/chart/web"
)
const defaultTitle = "Bee Hardware Audit"
type HandlerOptions struct {
Title string
AuditPath string
ExportDir string
}
func NewHandler(opts HandlerOptions) http.Handler {
title := strings.TrimSpace(opts.Title)
if title == "" {
title = defaultTitle
}
auditPath := strings.TrimSpace(opts.AuditPath)
exportDir := strings.TrimSpace(opts.ExportDir)
if exportDir == "" {
exportDir = app.DefaultExportDir
}
mux := http.NewServeMux()
mux.Handle("GET /static/", http.StripPrefix("/static/", chartweb.Static()))
mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Cache-Control", "no-store")
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("ok"))
})
mux.HandleFunc("GET /audit.json", func(w http.ResponseWriter, r *http.Request) {
data, err := loadSnapshot(auditPath)
if err != nil {
if errors.Is(err, os.ErrNotExist) {
http.Error(w, "audit snapshot not found", http.StatusNotFound)
return
}
http.Error(w, fmt.Sprintf("read audit snapshot: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/json; charset=utf-8")
_, _ = w.Write(data)
})
mux.HandleFunc("GET /export/support.tar.gz", func(w http.ResponseWriter, r *http.Request) {
archive, err := app.BuildSupportBundle(exportDir)
if err != nil {
http.Error(w, fmt.Sprintf("build support bundle: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/gzip")
w.Header().Set("Content-Disposition", fmt.Sprintf("attachment; filename=%q", filepath.Base(archive)))
http.ServeFile(w, r, archive)
})
mux.HandleFunc("GET /runtime-health.json", func(w http.ResponseWriter, r *http.Request) {
data, err := loadSnapshot(filepath.Join(exportDir, "runtime-health.json"))
if err != nil {
if errors.Is(err, os.ErrNotExist) {
http.Error(w, "runtime health not found", http.StatusNotFound)
return
}
http.Error(w, fmt.Sprintf("read runtime health: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/json; charset=utf-8")
_, _ = w.Write(data)
})
mux.HandleFunc("GET /export/", func(w http.ResponseWriter, r *http.Request) {
body, err := renderExportIndex(exportDir)
if err != nil {
http.Error(w, fmt.Sprintf("render export index: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte(body))
})
mux.HandleFunc("GET /export/file", func(w http.ResponseWriter, r *http.Request) {
rel := strings.TrimSpace(r.URL.Query().Get("path"))
if rel == "" {
http.Error(w, "path is required", http.StatusBadRequest)
return
}
clean := filepath.Clean(rel)
if clean == "." || strings.HasPrefix(clean, "..") {
http.Error(w, "invalid path", http.StatusBadRequest)
return
}
http.ServeFile(w, r, filepath.Join(exportDir, clean))
})
mux.HandleFunc("GET /viewer", func(w http.ResponseWriter, r *http.Request) {
snapshot, err := loadSnapshot(auditPath)
if err != nil && !errors.Is(err, os.ErrNotExist) {
http.Error(w, fmt.Sprintf("read audit snapshot: %v", err), http.StatusInternalServerError)
return
}
html, err := viewer.RenderHTML(snapshot, title)
if err != nil {
http.Error(w, fmt.Sprintf("render snapshot: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write(html)
})
mux.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) {
noticeTitle, noticeBody := runtimeNotice(filepath.Join(exportDir, "runtime-health.json"))
body := renderShellPage(title, noticeTitle, noticeBody)
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte(body))
})
return mux
}
func ListenAndServe(addr string, opts HandlerOptions) error {
return http.ListenAndServe(addr, NewHandler(opts))
}
func loadSnapshot(path string) ([]byte, error) {
if strings.TrimSpace(path) == "" {
return nil, os.ErrNotExist
}
return os.ReadFile(path)
}
func runtimeNotice(path string) (string, string) {
health, err := app.ReadRuntimeHealth(path)
if err != nil {
return "Runtime Health", "No runtime health snapshot found yet."
}
body := fmt.Sprintf("Status: %s. Export dir: %s. Driver ready: %t. CUDA ready: %t. Network: %s. Export files: /export/",
firstNonEmpty(health.Status, "UNKNOWN"),
firstNonEmpty(health.ExportDir, app.DefaultExportDir),
health.DriverReady,
health.CUDAReady,
firstNonEmpty(health.NetworkStatus, "UNKNOWN"),
)
if len(health.Issues) > 0 {
body += " Issues: "
parts := make([]string, 0, len(health.Issues))
for _, issue := range health.Issues {
parts = append(parts, issue.Code)
}
body += strings.Join(parts, ", ")
}
return "Runtime Health", body
}
func renderExportIndex(exportDir string) (string, error) {
var entries []string
err := filepath.Walk(strings.TrimSpace(exportDir), func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if info.IsDir() {
return nil
}
rel, err := filepath.Rel(exportDir, path)
if err != nil {
return err
}
entries = append(entries, rel)
return nil
})
if err != nil && !errors.Is(err, os.ErrNotExist) {
return "", err
}
sort.Strings(entries)
var body strings.Builder
body.WriteString("<!DOCTYPE html><html><head><meta charset=\"utf-8\"><title>Bee Export Files</title></head><body>")
body.WriteString("<h1>Bee Export Files</h1><ul>")
for _, entry := range entries {
body.WriteString("<li><a href=\"/export/file?path=" + url.QueryEscape(entry) + "\">" + html.EscapeString(entry) + "</a></li>")
}
if len(entries) == 0 {
body.WriteString("<li>No export files found.</li>")
}
body.WriteString("</ul></body></html>")
return body.String(), nil
}
func renderShellPage(title, noticeTitle, noticeBody string) string {
var body strings.Builder
body.WriteString("<!DOCTYPE html><html><head><meta charset=\"utf-8\"><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">")
body.WriteString("<title>" + html.EscapeString(title) + "</title>")
body.WriteString(`<style>
body{margin:0;font-family:system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;background:#f4f1ea;color:#1b1b18}
.shell{min-height:100vh;display:grid;grid-template-rows:auto auto 1fr}
.header{padding:18px 20px 12px;border-bottom:1px solid rgba(0,0,0,.08);background:#fbf8f2}
.header h1{margin:0;font-size:24px}
.header p{margin:6px 0 0;color:#5a5a52}
.actions{display:flex;flex-wrap:wrap;gap:10px;padding:12px 20px;background:#fbf8f2}
.actions a{display:inline-block;text-decoration:none;padding:10px 14px;border-radius:999px;background:#1f5f4a;color:#fff;font-weight:600}
.actions a.secondary{background:#d8e5dd;color:#17372b}
.notice{margin:16px 20px 0;padding:14px 16px;border-radius:14px;background:#fff7df;border:1px solid #ead9a4}
.notice h2{margin:0 0 6px;font-size:16px}
.notice p{margin:0;color:#4f4a37}
.viewer-wrap{padding:16px 20px 20px}
.viewer{width:100%;height:calc(100vh - 170px);border:0;border-radius:18px;background:#fff;box-shadow:0 12px 40px rgba(0,0,0,.08)}
@media (max-width:720px){.viewer{height:calc(100vh - 240px)}}
</style></head><body><div class="shell">`)
body.WriteString("<header class=\"header\"><h1>" + html.EscapeString(title) + "</h1><p>Audit viewer with support bundle and raw export access.</p></header>")
body.WriteString("<nav class=\"actions\">")
body.WriteString("<a href=\"/export/support.tar.gz\">Download support bundle</a>")
body.WriteString("<a class=\"secondary\" href=\"/audit.json\">Open audit.json</a>")
body.WriteString("<a class=\"secondary\" href=\"/runtime-health.json\">Open runtime-health.json</a>")
body.WriteString("<a class=\"secondary\" href=\"/export/\">Browse export files</a>")
body.WriteString("</nav>")
if strings.TrimSpace(noticeTitle) != "" {
body.WriteString("<section class=\"notice\"><h2>" + html.EscapeString(noticeTitle) + "</h2><p>" + html.EscapeString(noticeBody) + "</p></section>")
}
body.WriteString("<main class=\"viewer-wrap\"><iframe class=\"viewer\" src=\"/viewer\" loading=\"eager\" referrerpolicy=\"same-origin\"></iframe></main>")
body.WriteString("</div></body></html>")
return body.String()
}
func firstNonEmpty(value, fallback string) string {
value = strings.TrimSpace(value)
if value == "" {
return fallback
}
return value
}

View File

@@ -0,0 +1,167 @@
package webui
import (
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"strings"
"testing"
)
func TestRootRendersShellWithIframe(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{
Title: "Bee Hardware Audit",
AuditPath: path,
ExportDir: exportDir,
})
first := httptest.NewRecorder()
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/", nil))
if first.Code != http.StatusOK {
t.Fatalf("first status=%d", first.Code)
}
if !strings.Contains(first.Body.String(), `iframe`) || !strings.Contains(first.Body.String(), `src="/viewer"`) {
t.Fatalf("first body missing iframe viewer: %s", first.Body.String())
}
if !strings.Contains(first.Body.String(), "/export/support.tar.gz") {
t.Fatalf("first body missing support bundle link: %s", first.Body.String())
}
if got := first.Header().Get("Cache-Control"); got != "no-store" {
t.Fatalf("first cache-control=%q", got)
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
t.Fatal(err)
}
second := httptest.NewRecorder()
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/", nil))
if second.Code != http.StatusOK {
t.Fatalf("second status=%d", second.Code)
}
if !strings.Contains(second.Body.String(), `src="/viewer"`) {
t.Fatalf("second body missing iframe viewer: %s", second.Body.String())
}
}
func TestViewerRendersLatestSnapshot(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path})
first := httptest.NewRecorder()
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/viewer", nil))
if first.Code != http.StatusOK {
t.Fatalf("first status=%d", first.Code)
}
if !strings.Contains(first.Body.String(), "SERIAL-OLD") {
t.Fatalf("viewer body missing old serial: %s", first.Body.String())
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
t.Fatal(err)
}
second := httptest.NewRecorder()
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/viewer", nil))
if second.Code != http.StatusOK {
t.Fatalf("second status=%d", second.Code)
}
if !strings.Contains(second.Body.String(), "SERIAL-NEW") {
t.Fatalf("viewer body missing new serial: %s", second.Body.String())
}
if strings.Contains(second.Body.String(), "SERIAL-OLD") {
t.Fatalf("viewer body still contains old serial: %s", second.Body.String())
}
}
func TestAuditJSONServesLatestSnapshot(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
body := `{"hardware":{"board":{"serial_number":"SERIAL-API"}}}`
if err := os.WriteFile(path, []byte(body), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
if got := strings.TrimSpace(rec.Body.String()); got != body {
t.Fatalf("body=%q want %q", got, body)
}
if got := rec.Header().Get("Content-Type"); !strings.Contains(got, "application/json") {
t.Fatalf("content-type=%q", got)
}
}
func TestMissingAuditJSONReturnsNotFound(t *testing.T) {
handler := NewHandler(HandlerOptions{AuditPath: "/missing/audit.json"})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
if rec.Code != http.StatusNotFound {
t.Fatalf("status=%d want %d", rec.Code, http.StatusNotFound)
}
}
func TestSupportBundleEndpointReturnsArchive(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.log"), []byte("audit log"), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/export/support.tar.gz", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
if got := rec.Header().Get("Content-Disposition"); !strings.Contains(got, "attachment;") {
t.Fatalf("content-disposition=%q", got)
}
if rec.Body.Len() == 0 {
t.Fatal("empty archive body")
}
}
func TestRuntimeHealthEndpointReturnsJSON(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
body := `{"status":"PARTIAL","checked_at":"2026-03-16T10:00:00Z"}`
if err := os.WriteFile(filepath.Join(exportDir, "runtime-health.json"), []byte(body), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/runtime-health.json", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
if strings.TrimSpace(rec.Body.String()) != body {
t.Fatalf("body=%q want %q", strings.TrimSpace(rec.Body.String()), body)
}
}

2
bible

Submodule bible updated: 456c1f022c...688b87e98d

View File

@@ -9,4 +9,5 @@ Generic engineering rules live in `bible/rules/patterns/`.
|---|---|
| `architecture/system-overview.md` | What bee does, scope, tech stack |
| `architecture/runtime-flows.md` | Boot sequence, audit flow, service order |
| `docs/hardware-ingest-contract.md` | Current Reanimator hardware ingest JSON contract |
| `decisions/` | Architectural decision log |

View File

@@ -18,8 +18,10 @@ local-fs.target
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
├── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
│ creates /dev/nvidia* nodes)
── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
never blocks boot on partial collector failures)
── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
never blocks boot on partial collector failures)
└── bee-web.service (runs `bee web` on :80,
reads the latest audit snapshot on each request)
```
**Critical invariants:**
@@ -29,17 +31,39 @@ local-fs.target
Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.
- `bee-web.service` binds `0.0.0.0:80` and always renders the current `/var/log/bee-audit.json` contents.
- Audit JSON now includes a `hardware.summary` block with overall verdict and warning/failure counts.
## Console and login flow
Local-console behavior:
```text
tty1
└── live-config autologin → bee
└── /home/bee/.profile
└── exec menu
└── /usr/local/bin/bee-tui
└── sudo -n /usr/local/bin/bee tui --runtime livecd
```
Rules:
- local `tty1` lands in user `bee`, not directly in `root`
- `menu` must work without typing `sudo`
- TUI actions still run as `root` via `sudo -n`
- SSH is independent from the tty1 path
- serial console support is enabled for VM boot debugging
## ISO build sequence
```
build.sh [--authorized-keys /path/to/keys]
build-in-container.sh [--authorized-keys /path/to/keys]
1. compile `bee` binary (skip if .go files older than binary)
2. create a temporary overlay staging dir under `dist/`
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
4. copy `bee` binary → staged `/usr/local/bin/bee`
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
(`storcli64`, `sas2ircu`, `sas3ircu`, `mstflint` — each optional)
(`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli` — optional; `mstflint` comes from the Debian package set)
6. `build-nvidia-module.sh`:
a. install Debian kernel headers if missing
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
@@ -54,13 +78,12 @@ build.sh [--authorized-keys /path/to/keys]
11. patch staged `motd` with build metadata
12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
13. sync staged overlay into workdir `config/includes.chroot/`
14. run `lb config && lb build` inside the temporary workdir
(either on a Debian host/VM or inside the privileged builder container)
14. run `lb config && lb build` inside the privileged builder container
```
**Critical invariants:**
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
1. `setup-builder.sh` / `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
2. `auto/config``linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
@@ -80,6 +103,10 @@ Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shi
Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
Current validation state:
- local/libvirt VM boot path is validated for `systemd`, SSH, `bee audit`, `bee-network`, and TUI startup
- real hardware validation is still required before treating the ISO as release-ready
## Overlay mechanism
`live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
@@ -95,10 +122,52 @@ systemd services running, audit completed with NVIDIA enrichment, LAN reachabili
3. memory collector (dmidecode -t 17)
4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/)
6. psu collector (ipmitool fru — silent if no /dev/ipmi0)
6. psu collector (ipmitool fru + sdr — silent if no /dev/ipmi0)
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
8. output JSON → /var/log/bee-audit.json
9. QR summary to stdout (qrencode if available)
```
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
Acceptance flows:
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + lightweight `bee-gpu-stress`
- `bee sat memory``memtester` archive
- `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
- SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
- Runtime overrides:
- `BEE_GPU_STRESS_SECONDS`
- `BEE_GPU_STRESS_SIZE_MB`
- `BEE_MEMTESTER_SIZE_MB`
- `BEE_MEMTESTER_PASSES`
## NVIDIA SAT TUI flow (v1.0.0+)
```
TUI: Acceptance tests → NVIDIA command pack
1. screenNvidiaSATSetup
a. enumerate GPUs via `nvidia-smi --query-gpu=index,name,memory.total`
b. user selects duration preset: 10 min / 1 h / 8 h / 24 h
c. user selects GPUs via checkboxes (all selected by default)
d. memory size = max(selected GPU memory) — auto-detected, not exposed to user
2. Start → screenNvidiaSATRunning
a. CUDA_VISIBLE_DEVICES set to selected GPU indices
b. tea.Batch: SAT goroutine + tea.ExecProcess(nvtop) launched concurrently
c. nvtop occupies full terminal; SAT result queues in background
d. [o] reopen nvtop at any time; [a] abort (cancels context → kills bee-gpu-stress)
3. GPU metrics collection (during bee-gpu-stress)
- background goroutine polls `nvidia-smi` every second
- per-second rows: elapsed, GPU index, temp°C, usage%, power W, clock MHz
- outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG chart), gpu-metrics-term.txt
4. After SAT completes
- result shown in screenOutput with terminal line-chart (gpu-metrics-term.txt)
- chart is asciigraph-style: box-drawing chars (╭╮╰╯─│), 4 series per GPU,
Y axis with ticks, ANSI colours (red=temp, blue=usage, green=power, yellow=clock)
```
**Critical invariants:**
- `nvtop` must be in `iso/builder/config/package-lists/bee.list.chroot` (baked into ISO).
- `bee-gpu-stress` uses `exec.CommandContext` — aborted on cancel.
- Metric goroutine uses stopCh/doneCh pattern; main goroutine waits `<-doneCh` before reading rows (no mutex needed).
- If `nvtop` is not found on PATH, SAT still runs without it (graceful degradation).
- SVG chart is fully offline: no JS, no external CSS, pure inline SVG.

View File

@@ -4,7 +4,7 @@
Hardware audit LiveCD. Boots on a server via BMC virtual media or USB.
Collects hardware inventory at OS level (not through BMC/Redfish).
Produces `HardwareIngestRequest` JSON compatible with core/reanimator.
Produces `HardwareIngestRequest` JSON compatible with the contract in `bible-local/docs/hardware-ingest-contract.md`.
## Why it exists
@@ -19,10 +19,15 @@ Fills gaps where Redfish/logpile is blind:
## In scope
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
- Unattended operation — no user interaction required
- Machine-readable health summary derived from collector verdicts
- Operator-triggered acceptance tests for NVIDIA, memory, and storage
- NVIDIA SAT includes both diagnostic collection and lightweight GPU stress via `bee-gpu-stress`
- Automatic boot audit with operator-facing local console and SSH access
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
- SSH access (OpenSSH) always available for inspection and debugging
- Interactive Go TUI via `bee tui` for network setup, service management, and acceptance tests
- Read-only web viewer via `bee web`, rendering the latest audit snapshot through the embedded Reanimator Chart
- Local `tty1` operator UX: `bee` autologin, `menu` auto-start, privileged actions via `sudo -n`
## Network isolation — CRITICAL
@@ -42,6 +47,16 @@ Fills gaps where Redfish/logpile is blind:
- Anything requiring persistent storage on the audited machine
- Windows support
- Any functionality requiring internet access at boot
- Component lifecycle/history across multiple snapshots
- Status transition history (`status_history`, `status_changed_at`) derived from previous exports
- Replacement detection between two or more audit runs
## Contract boundary
- `bee` is responsible for the current hardware snapshot only.
- `bee` should populate current component state, hardware inventory, telemetry, and `status_checked_at`.
- Historical status transitions and component replacement logic belong to the centralized ingest/lifecycle system, not to `bee`.
- Contract fields that have no honest local source on a generic Linux host may remain empty.
## Tech stack
@@ -56,6 +71,14 @@ Fills gaps where Redfish/logpile is blind:
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
| Builder | Debian 12 host/VM or Debian 12 container image |
## Operator UX
- On the live ISO, `tty1` autologins as `bee`
- The login profile auto-runs `menu`, which enters the Go TUI
- The TUI itself executes privileged actions as `root` via `sudo -n`
- SSH remains available independently of the local console path
- VM-oriented builds also include `qemu-guest-agent` and serial console support for debugging
## Runtime split
- The main Go application must run both on a normal Linux host and inside the live ISO
@@ -72,8 +95,11 @@ Fills gaps where Redfish/logpile is blind:
| `audit/internal/schema/` | HardwareIngestRequest types |
| `iso/builder/` | ISO build scripts and `live-build` profile |
| `iso/overlay/` | Source overlay copied into a staged build overlay |
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, mstflint, …) |
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, arcconf, ssacli, …) |
| `internal/chart/` | Git submodule with `reanimator/chart`, embedded into `bee web` |
| `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI |
| `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
| `iso/overlay/etc/profile.d/bee.sh` | `menu` helper + tty1 auto-start policy |
| `iso/overlay/home/bee/.profile` | `bee` shell profile for local console startup |
| `dist/` | Build outputs (gitignored) |
| `iso/out/` | Downloaded ISO files (gitignored) |

View File

@@ -1,22 +1,89 @@
# Backlog
## GPU stress test (H100)
## BMC версия через IPMI
**Статус:** отложено. В текущем ISO `gpu_burn` не включается и не запускается.
**Статус:** реализовано.
**Почему задача всё ещё в backlog:**
- `gpu_burn` остаётся тяжёлым и неудобным с точки зрения зависимостей
- хочется штатный lightweight stress tool без `libcublas.so` и без заметного раздувания ISO
- для H100 нужен предсказуемый offline-инструмент, который можно стабильно возить внутри ISO
Добавить сбор версии BMC firmware в board collector:
- Команда: `ipmitool mc info` → поле `Firmware Revision`
- Записывать в `hardware.firmware[]` как `{device_name: "BMC", version: "..."}`
- Показывать в TUI правой колонке рядом с BIOS версией
- Graceful skip если `/dev/ipmi0` отсутствует (silent: same pattern as PSU collector)
**Желаемый следующий шаг:** написать минимальный stress tool на CUDA Driver API
- использует только `libcuda.so`, уже присутствующий в ISO
- выполняет простой compute / memory workload через `cuLaunchKernel`
- собирается отдельно на builder VM и кладётся в `iso/vendor/`
- в будущем может вызываться из `bee tui` как предпочтительный встроенный GPU SAT/stress path
## CPU acceptance test через stress-ng
**Отклонённые / проблемные варианты:**
- `gpu_burn` — нужен libcublas (~500MB)
- `nvbandwidth` — только bandwidth, не жжёт FLOPs; нужен libcudart (~8MB)
- DCGM diag — правильный инструмент для H100 но ~100MB установка
- Download on demand — нужен libcublas, проблема та же
**Статус:** реализовано. CPU в Health Check получает PASS/FAIL из summary.txt.
Добавить CPU SAT на базе `stress-ng`:
- Bake `stress-ng` в ISO (добавить в `bee.list.chroot`)
- Новый `bee sat cpu` — запускает `stress-ng --cpu 0 --cpu-method all --timeout <N>` где N = duration из режима (Quick=60s, Standard=300s, Express=900s)
- Параллельно снимает температуры через `sensors` и throttle-флаги из аудит JSON
- Результат: SAT архив с summary.txt в формате других SAT (overall_status=OK/FAILED)
- После реализации: CPU в Health Check получает реальный PASS/FAIL статус
## Real hardware validation
**Статус:** ожидает доступа к железу.
Что осталось подтвердить на практике:
- `bee sat nvidia` на реальном NVIDIA GPU host
- `bee sat storage` на NVMe/SATA/RAID host
- `ipmitool sdr` parsing на сервере с реальным BMC/IPMI
- vendor RAID tooling (`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli`) в живом ISO
## SAT result polish
**Статус:** частично закрыто.
Что ещё можно улучшить после полевой проверки:
- точнее классифицировать vendor-specific self-test outputs в `storage SAT`
- подобрать дефолты `memtester` по объёму RAM на целевых машинах
- при необходимости расширить `bee-gpu-stress` по длительности/нагрузке
## Hardware Contract backlog
**Статус:** уточнён, сокращён до `bee`-only snapshot scope.
### Не backlog для `bee`
Эти задачи не должны реализовываться в `bee`, потому что относятся к централизованному ingest/lifecycle слою:
- `status_history`
- `status_changed_at`
- определение замены компонента между snapshot'ами
- timeline/lifecycle/history по diff между экспортами
`bee` отвечает только за текущий snapshot железа и `status_checked_at`.
### Реализуемо инкрементально
Эти поля можно развивать дальше по мере появления реальных sample outputs и vendor-specific parser'ов:
- `cpus.correctable_error_count`
- `cpus.uncorrectable_error_count`
- `power_supplies.life_remaining_pct`
- `power_supplies.life_used_pct`
- `pcie_devices.battery_charge_pct`
- `pcie_devices.battery_health_pct`
- `pcie_devices.battery_temperature_c`
- `pcie_devices.battery_voltage_v`
- `pcie_devices.battery_replace_required`
### Vendor/platform-specific, часто пустые
Эти поля допустимо оставлять пустыми на части платформ даже после реализации parser'ов:
- `power_supplies.life_remaining_pct`
- `power_supplies.life_used_pct`
- часть `pcie_devices.battery_*` для неподдержанных RAID/NIC/GPU вендоров
### Unsupported в `bee`
Эти поля считаются нереалистичными для общего OS-level hardware snapshotter без synthetic/fake data:
- `cpus.life_remaining_pct`
- `cpus.life_used_pct`
- `memory.life_remaining_pct`
- `memory.life_used_pct`
- `memory.spare_blocks_remaining_pct`
- `memory.performance_degraded`
Причина: у обычного Linux-host audit обычно нет честного vendor-neutral runtime source для этих метрик.
Эти поля считаются дропнутыми из backlog `bee` и не должны возвращаться в план работ без появления нового доказуемого локального источника данных на целевых машинах.

View File

@@ -0,0 +1,793 @@
---
title: Hardware Ingest JSON Contract
version: "2.7"
updated: "2026-03-15"
maintainer: Reanimator Core
audience: external-integrators, ai-agents
language: ru
---
# Интеграция с Reanimator: контракт JSON-импорта аппаратного обеспечения
Версия: **2.7** · Дата: **2026-03-15**
Документ описывает формат JSON для передачи данных об аппаратном обеспечении серверов в систему **Reanimator** (управление жизненным циклом аппаратного обеспечения).
Предназначен для разработчиков смежных систем (Redfish-коллекторов, агентов мониторинга, CMDB-экспортёров) и может быть включён в документацию интегрируемых проектов.
> Актуальная версия документа: https://git.mchus.pro/reanimator/core/src/branch/main/bible-local/docs/hardware-ingest-contract.md
---
## Changelog
| Версия | Дата | Изменения |
|--------|------|-----------|
| 2.7 | 2026-03-15 | Явно запрещён синтез данных в `event_logs`; интеграторы не должны придумывать серийные номера компонентов, если источник их не отдал |
| 2.6 | 2026-03-15 | Добавлена необязательная секция `event_logs` для dedup/upsert логов `host` / `bmc` / `redfish` вне history timeline |
| 2.5 | 2026-03-15 | Добавлено общее необязательное поле `manufactured_year_week` для компонентных секций (`YYYY-Www`) |
| 2.4 | 2026-03-15 | Добавлена первая волна component telemetry: health/life поля для `cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies` |
| 2.3 | 2026-03-15 | Добавлены component telemetry поля: `pcie_devices.temperature_c`, `pcie_devices.power_w`, `power_supplies.temperature_c` |
| 2.2 | 2026-03-15 | Добавлено поле `numa_node` у `pcie_devices` для topology/affinity |
| 2.1 | 2026-03-15 | Добавлена секция `sensors` (fans, power, temperatures, other); поле `mac_addresses` у `pcie_devices`; расширен список значений `device_class` |
| 2.0 | 2026-02-01 | История статусов (`status_history`, `status_changed_at`); поля telemetry у PSU; async job response |
| 1.0 | 2026-01-01 | Начальная версия контракта |
---
## Принципы
1. **Snapshot** — JSON описывает состояние сервера на момент сбора. Может включать историю изменений статуса компонентов.
2. **Идемпотентность** — повторная отправка идентичного payload не создаёт дублей (дедупликация по хешу).
3. **Частичность** — можно передавать только те секции, данные по которым доступны. Пустой массив и отсутствие секции эквивалентны.
4. **Строгая схема** — endpoint использует строгий JSON-декодер; неизвестные поля приводят к `400 Bad Request`.
5. **Event-driven** — импорт создаёт события в timeline (LOG_COLLECTED, INSTALLED, REMOVED, FIRMWARE_CHANGED и др.).
6. **Без синтеза со стороны интегратора** — сборщик передаёт только фактически собранные значения. Нельзя придумывать `serial_number`, `component_ref`, `message`, `message_id` или другие идентификаторы/атрибуты, если источник их не предоставил или парсер не смог их надёжно извлечь.
---
## Endpoint
```
POST /ingest/hardware
Content-Type: application/json
```
**Ответ при приёме (202 Accepted):**
```json
{
"status": "accepted",
"job_id": "job_01J..."
}
```
Импорт выполняется асинхронно. Результат доступен по:
```
GET /ingest/hardware/jobs/{job_id}
```
**Ответ при успехе задачи:**
```json
{
"status": "success",
"bundle_id": "lb_01J...",
"asset_id": "mach_01J...",
"collected_at": "2026-02-10T15:30:00Z",
"duplicate": false,
"summary": {
"parts_observed": 15,
"parts_created": 2,
"parts_updated": 13,
"installations_created": 2,
"installations_closed": 1,
"timeline_events_created": 9,
"failure_events_created": 1
}
}
```
**Ответ при дубликате:**
```json
{
"status": "success",
"duplicate": true,
"message": "LogBundle with this content hash already exists"
}
```
**Ответ при ошибке (400 Bad Request):**
```json
{
"status": "error",
"error": "validation_failed",
"details": {
"field": "hardware.board.serial_number",
"message": "serial_number is required"
}
}
```
Частые причины `400`:
- Неверный формат `collected_at` (требуется RFC3339).
- Пустой `hardware.board.serial_number`.
- Наличие неизвестного JSON-поля на любом уровне.
- Тело запроса превышает допустимый размер.
---
## Структура верхнего уровня
```json
{
"filename": "redfish://10.10.10.103",
"source_type": "api",
"protocol": "redfish",
"target_host": "10.10.10.103",
"collected_at": "2026-02-10T15:30:00Z",
"hardware": {
"board": { ... },
"firmware": [ ... ],
"cpus": [ ... ],
"memory": [ ... ],
"storage": [ ... ],
"pcie_devices": [ ... ],
"power_supplies": [ ... ],
"sensors": { ... },
"event_logs": [ ... ]
}
}
```
### Поля верхнего уровня
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `collected_at` | string RFC3339 | **да** | Время сбора данных |
| `hardware` | object | **да** | Аппаратный снапшот |
| `hardware.board.serial_number` | string | **да** | Серийный номер платы/сервера |
| `target_host` | string | нет | IP или hostname |
| `source_type` | string | нет | Тип источника: `api`, `logfile`, `manual` |
| `protocol` | string | нет | Протокол: `redfish`, `ipmi`, `snmp`, `ssh` |
| `filename` | string | нет | Идентификатор источника |
---
## Общие поля статуса компонентов
Применяются ко всем компонентным секциям (`cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies`).
| Поле | Тип | Описание |
|------|-----|----------|
| `status` | string | Текущий статус: `OK`, `Warning`, `Critical`, `Unknown`, `Empty` |
| `status_checked_at` | string RFC3339 | Время последней проверки статуса |
| `status_changed_at` | string RFC3339 | Время последнего изменения статуса |
| `status_history` | array | История переходов статусов (см. ниже) |
| `error_description` | string | Текст ошибки/диагностики |
| `manufactured_year_week` | string | Дата производства в формате `YYYY-Www`, например `2024-W07` |
**Объект `status_history[]`:**
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `status` | string | **да** | Статус в этот момент |
| `changed_at` | string RFC3339 | **да** | Время перехода (без этого поля запись игнорируется) |
| `details` | string | нет | Пояснение к переходу |
**Правила приоритета времени события:**
1. `status_changed_at`
2. Последняя запись `status_history` с совпадающим статусом
3. Последняя парсируемая запись `status_history`
4. `status_checked_at`
**Правила передачи статусов:**
- Передавайте `status` как текущее состояние компонента в snapshot.
- Если источник хранит историю — передавайте `status_history` отсортированным по `changed_at` по возрастанию.
- Не включайте записи `status_history` без `changed_at`.
- Все даты — RFC3339, рекомендуется UTC (`Z`).
- `manufactured_year_week` используйте, когда источник знает только год и неделю производства, без точной календарной даты.
---
## Секции hardware
### board
Основная информация о сервере. Обязательная секция.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `serial_number` | string | **да** | Серийный номер (ключ идентификации Asset) |
| `manufacturer` | string | нет | Производитель |
| `product_name` | string | нет | Модель |
| `part_number` | string | нет | Партномер |
| `uuid` | string | нет | UUID системы |
Значения `"NULL"` в строковых полях трактуются как отсутствие данных.
```json
"board": {
"manufacturer": "Supermicro",
"product_name": "X12DPG-QT6",
"serial_number": "21D634101",
"part_number": "X12DPG-QT6-REV1.01",
"uuid": "d7ef2fe5-2fd0-11f0-910a-346f11040868"
}
```
---
### firmware
Версии прошивок системных компонентов (BIOS, BMC, CPLD и др.).
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `device_name` | string | **да** | Название устройства (`BIOS`, `BMC`, `CPLD`, …) |
| `version` | string | **да** | Версия прошивки |
Записи с пустым `device_name` или `version` игнорируются.
Изменение версии создаёт событие `FIRMWARE_CHANGED` для Asset.
```json
"firmware": [
{ "device_name": "BIOS", "version": "06.08.05" },
{ "device_name": "BMC", "version": "5.17.00" },
{ "device_name": "CPLD", "version": "01.02.03" }
]
```
---
### cpus
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `socket` | int | **да** | Номер сокета (используется для генерации serial) |
| `model` | string | нет | Модель процессора |
| `manufacturer` | string | нет | Производитель |
| `cores` | int | нет | Количество ядер |
| `threads` | int | нет | Количество потоков |
| `frequency_mhz` | int | нет | Текущая частота |
| `max_frequency_mhz` | int | нет | Максимальная частота |
| `temperature_c` | float | нет | Температура CPU, °C (telemetry) |
| `power_w` | float | нет | Текущая мощность CPU, Вт (telemetry) |
| `throttled` | bool | нет | Зафиксирован thermal/power throttling |
| `correctable_error_count` | int | нет | Количество корректируемых ошибок CPU |
| `uncorrectable_error_count` | int | нет | Количество некорректируемых ошибок CPU |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `serial_number` | string | нет | Серийный номер (если доступен) |
| `firmware` | string | нет | Версия микрокода; если логгер отдает `Microcode level`, передавайте его сюда как есть |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| + общие поля статуса | | | см. раздел выше |
**Генерация serial_number при отсутствии:** `{board_serial}-CPU-{socket}`
Если источник использует поле/лейбл `Microcode level`, его значение передавайте в `cpus[].firmware` без дополнительного преобразования.
```json
"cpus": [
{
"socket": 0,
"model": "INTEL(R) XEON(R) GOLD 6530",
"cores": 32,
"threads": 64,
"frequency_mhz": 2100,
"max_frequency_mhz": 4000,
"temperature_c": 61.5,
"power_w": 182.0,
"throttled": false,
"manufacturer": "Intel",
"status": "OK",
"status_checked_at": "2026-02-10T15:28:00Z"
}
]
```
---
### memory
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `present` | bool | нет | Наличие модуля (по умолчанию `true`) |
| `serial_number` | string | нет | Серийный номер |
| `part_number` | string | нет | Партномер (используется как модель) |
| `manufacturer` | string | нет | Производитель |
| `size_mb` | int | нет | Объём в МБ |
| `type` | string | нет | Тип: `DDR3`, `DDR4`, `DDR5`, … |
| `max_speed_mhz` | int | нет | Максимальная частота |
| `current_speed_mhz` | int | нет | Текущая частота |
| `temperature_c` | float | нет | Температура DIMM/модуля, °C (telemetry) |
| `correctable_ecc_error_count` | int | нет | Количество корректируемых ECC-ошибок |
| `uncorrectable_ecc_error_count` | int | нет | Количество некорректируемых ECC-ошибок |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `spare_blocks_remaining_pct` | float | нет | Остаток spare blocks, % |
| `performance_degraded` | bool | нет | Зафиксирована деградация производительности |
| `data_loss_detected` | bool | нет | Источник сигнализирует риск/факт потери данных |
| + общие поля статуса | | | см. раздел выше |
Модуль без `serial_number` игнорируется. Модуль с `present=false` или `status=Empty` игнорируется.
```json
"memory": [
{
"slot": "CPU0_C0D0",
"present": true,
"size_mb": 32768,
"type": "DDR5",
"max_speed_mhz": 4800,
"current_speed_mhz": 4800,
"temperature_c": 43.0,
"correctable_ecc_error_count": 0,
"manufacturer": "Hynix",
"serial_number": "80AD032419E17CEEC1",
"part_number": "HMCG88AGBRA191N",
"status": "OK"
}
]
```
---
### storage
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Канонический адрес установки PCIe-устройства; передавайте BDF (`0000:18:00.0`) |
| `serial_number` | string | нет | Серийный номер |
| `model` | string | нет | Модель |
| `manufacturer` | string | нет | Производитель |
| `type` | string | нет | Тип: `NVMe`, `SSD`, `HDD` |
| `interface` | string | нет | Интерфейс: `NVMe`, `SATA`, `SAS` |
| `size_gb` | int | нет | Размер в ГБ |
| `temperature_c` | float | нет | Температура накопителя, °C (telemetry) |
| `power_on_hours` | int64 | нет | Время работы, часы |
| `power_cycles` | int64 | нет | Количество циклов питания |
| `unsafe_shutdowns` | int64 | нет | Нештатные выключения |
| `media_errors` | int64 | нет | Ошибки носителя / media errors |
| `error_log_entries` | int64 | нет | Количество записей в error log |
| `written_bytes` | int64 | нет | Всего записано байт |
| `read_bytes` | int64 | нет | Всего прочитано байт |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `available_spare_pct` | float | нет | Доступный spare, % |
| `reallocated_sectors` | int64 | нет | Переназначенные сектора |
| `current_pending_sectors` | int64 | нет | Сектора в ожидании ремапа |
| `offline_uncorrectable` | int64 | нет | Некорректируемые ошибки offline scan |
| `firmware` | string | нет | Версия прошивки |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| + общие поля статуса | | | см. раздел выше |
Диск без `serial_number` игнорируется. Изменение `firmware` создаёт событие `FIRMWARE_CHANGED`.
```json
"storage": [
{
"slot": "OB01",
"type": "NVMe",
"model": "INTEL SSDPF2KX076T1",
"size_gb": 7680,
"temperature_c": 38.5,
"power_on_hours": 12450,
"unsafe_shutdowns": 3,
"written_bytes": 9876543210,
"life_remaining_pct": 91.0,
"serial_number": "BTAX41900GF87P6DGN",
"manufacturer": "Intel",
"firmware": "9CV10510",
"interface": "NVMe",
"present": true,
"status": "OK"
}
]
```
---
### pcie_devices
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `vendor_id` | int | нет | PCI Vendor ID (decimal) |
| `device_id` | int | нет | PCI Device ID (decimal) |
| `numa_node` | int | нет | NUMA node / CPU affinity устройства |
| `temperature_c` | float | нет | Температура устройства, °C (telemetry) |
| `power_w` | float | нет | Текущее энергопотребление устройства, Вт (telemetry) |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `ecc_corrected_total` | int64 | нет | Всего корректируемых ECC-ошибок |
| `ecc_uncorrected_total` | int64 | нет | Всего некорректируемых ECC-ошибок |
| `hw_slowdown` | bool | нет | Устройство вошло в hardware slowdown / protective mode |
| `battery_charge_pct` | float | нет | Заряд батареи / supercap, % |
| `battery_health_pct` | float | нет | Состояние батареи / supercap, % |
| `battery_temperature_c` | float | нет | Температура батареи / supercap, °C |
| `battery_voltage_v` | float | нет | Напряжение батареи / supercap, В |
| `battery_replace_required` | bool | нет | Требуется замена батареи / supercap |
| `sfp_temperature_c` | float | нет | Температура SFP/optic, °C |
| `sfp_tx_power_dbm` | float | нет | TX optical power, dBm |
| `sfp_rx_power_dbm` | float | нет | RX optical power, dBm |
| `sfp_voltage_v` | float | нет | Напряжение SFP, В |
| `sfp_bias_ma` | float | нет | Bias current SFP, мА |
| `bdf` | string | нет | Deprecated alias для `slot`; при наличии ingest нормализует его в `slot` |
| `device_class` | string | нет | Класс устройства (см. список ниже) |
| `manufacturer` | string | нет | Производитель |
| `model` | string | нет | Модель |
| `serial_number` | string | нет | Серийный номер |
| `firmware` | string | нет | Версия прошивки |
| `link_width` | int | нет | Текущая ширина линка |
| `link_speed` | string | нет | Текущая скорость: `Gen3`, `Gen4`, `Gen5` |
| `max_link_width` | int | нет | Максимальная ширина линка |
| `max_link_speed` | string | нет | Максимальная скорость |
| `mac_addresses` | string[] | нет | MAC-адреса портов (для сетевых устройств) |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| + общие поля статуса | | | см. раздел выше |
`numa_node` передавайте для NIC / InfiniBand / RAID / GPU, когда источник знает CPU/NUMA affinity. Поле сохраняется в snapshot-атрибутах PCIe-компонента и дублируется в telemetry для topology use cases.
Поля `temperature_c` и `power_w` используйте для device-level telemetry GPU / accelerator / smart PCIe devices. Они не влияют на идентификацию компонента.
**Генерация serial_number при отсутствии или `"N/A"`:** `{board_serial}-PCIE-{slot}`, где `slot` для PCIe равен BDF.
`slot` — единственный канонический адрес компонента. Для PCIe в `slot` передавайте BDF. Поле `bdf` сохраняется только как переходный alias на входе и не должно использоваться как отдельная координата рядом со `slot`.
**Значения `device_class`:**
| Значение | Назначение |
|----------|------------|
| `MassStorageController` | RAID-контроллеры |
| `StorageController` | HBA, SAS-контроллеры |
| `NetworkController` | Сетевые адаптеры (InfiniBand, общий) |
| `EthernetController` | Ethernet NIC |
| `FibreChannelController` | Fibre Channel HBA |
| `VideoController` | GPU, видеокарты |
| `ProcessingAccelerator` | Вычислительные ускорители (AI/ML) |
| `DisplayController` | Контроллеры дисплея (BMC VGA) |
Список открытый: допускаются произвольные строки для нестандартных классов.
```json
"pcie_devices": [
{
"slot": "0000:3b:00.0",
"vendor_id": 5555,
"device_id": 4401,
"numa_node": 0,
"temperature_c": 48.5,
"power_w": 18.2,
"sfp_temperature_c": 36.2,
"sfp_tx_power_dbm": -1.8,
"sfp_rx_power_dbm": -2.1,
"device_class": "EthernetController",
"manufacturer": "Intel",
"model": "X710 10GbE",
"serial_number": "K65472-003",
"firmware": "9.20 0x8000d4ae",
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
"status": "OK"
}
]
```
---
### power_supplies
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| `serial_number` | string | нет | Серийный номер |
| `part_number` | string | нет | Партномер |
| `model` | string | нет | Модель |
| `vendor` | string | нет | Производитель |
| `wattage_w` | int | нет | Мощность в ваттах |
| `firmware` | string | нет | Версия прошивки |
| `input_type` | string | нет | Тип входа (например `ACWideRange`) |
| `input_voltage` | float | нет | Входное напряжение, В (telemetry) |
| `input_power_w` | float | нет | Входная мощность, Вт (telemetry) |
| `output_power_w` | float | нет | Выходная мощность, Вт (telemetry) |
| `temperature_c` | float | нет | Температура PSU, °C (telemetry) |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| + общие поля статуса | | | см. раздел выше |
Поля telemetry (`input_voltage`, `input_power_w`, `output_power_w`, `temperature_c`, `life_remaining_pct`, `life_used_pct`) сохраняются в атрибутах компонента и не влияют на его идентификацию.
PSU без `serial_number` игнорируется.
```json
"power_supplies": [
{
"slot": "0",
"present": true,
"model": "GW-CRPS3000LW",
"vendor": "Great Wall",
"wattage_w": 3000,
"serial_number": "2P06C102610",
"firmware": "00.03.05",
"status": "OK",
"input_type": "ACWideRange",
"input_power_w": 137,
"output_power_w": 104,
"input_voltage": 215.25,
"temperature_c": 39.5,
"life_remaining_pct": 97.0
}
]
```
---
### sensors
Показания сенсоров сервера. Секция опциональная, не привязана к компонентам.
Данные хранятся как последнее известное значение (last-known-value) на уровне Asset.
```json
"sensors": {
"fans": [ ... ],
"power": [ ... ],
"temperatures": [ ... ],
"other": [ ... ]
}
```
---
### event_logs
Нормализованные операционные логи сервера из `host`, `bmc` или `redfish`.
Эти записи не попадают в history timeline и не создают history events. Они сохраняются в отдельной deduplicated log store и отображаются в отдельном UI-блоке asset logs / host logs.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `source` | string | **да** | Источник лога: `host`, `bmc`, `redfish` |
| `event_time` | string RFC3339 | нет | Время события из источника; если отсутствует, используется время ingest/collection |
| `severity` | string | нет | Уровень: `OK`, `Info`, `Warning`, `Critical`, `Unknown` |
| `message_id` | string | нет | Идентификатор/код события источника |
| `message` | string | **да** | Нормализованный текст события |
| `component_ref` | string | нет | Ссылка на компонент/устройство/слот, если извлекается |
| `fingerprint` | string | нет | Внешний готовый dedup-key; если не передан, система вычисляет свой |
| `is_active` | bool | нет | Признак, что событие всё ещё активно/не погашено, если источник умеет lifecycle |
| `raw_payload` | object | нет | Сырой vendor-specific payload для диагностики |
**Правила event_logs:**
- Логи дедуплицируются в рамках asset + source + fingerprint.
- Если `fingerprint` не передан, система строит его из нормализованных полей (`source`, `message_id`, `message`, `component_ref`, временная нормализация).
- Интегратор/сборщик логов не должен синтезировать содержимое событий: не придумывайте `message`, `message_id`, `component_ref`, serial/device identifiers или иные поля, если они отсутствуют в исходном логе или не были надёжно извлечены.
- Повторное получение того же события обновляет `last_seen_at`/счётчик повторов и не должно создавать новый timeline/history event.
- `event_logs` используются для отдельного UI-представления логов и не изменяют canonical state компонентов/asset по умолчанию.
```json
"event_logs": [
{
"source": "bmc",
"event_time": "2026-03-15T14:03:11Z",
"severity": "Warning",
"message_id": "0x000F",
"message": "Correctable ECC error threshold exceeded",
"component_ref": "CPU0_C0D0",
"raw_payload": {
"sensor": "DIMM_A1",
"sel_record_id": "0042"
}
},
{
"source": "redfish",
"event_time": "2026-03-15T14:03:20Z",
"severity": "Info",
"message_id": "OpenBMC.0.1.SystemReboot",
"message": "System reboot requested by administrator",
"component_ref": "Mainboard"
}
]
```
#### sensors.fans
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора в рамках секции |
| `location` | string | нет | Физическое расположение |
| `rpm` | int | нет | Обороты, RPM |
| `status` | string | нет | Статус: `OK`, `Warning`, `Critical`, `Unknown` |
#### sensors.power
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `voltage_v` | float | нет | Напряжение, В |
| `current_a` | float | нет | Ток, А |
| `power_w` | float | нет | Мощность, Вт |
| `status` | string | нет | Статус |
#### sensors.temperatures
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `celsius` | float | нет | Температура, °C |
| `threshold_warning_celsius` | float | нет | Порог Warning, °C |
| `threshold_critical_celsius` | float | нет | Порог Critical, °C |
| `status` | string | нет | Статус |
#### sensors.other
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `value` | float | нет | Значение |
| `unit` | string | нет | Единица измерения |
| `status` | string | нет | Статус |
**Правила sensors:**
- Идентификатор сенсора: пара `(sensor_type, name)`. Дубли в одном payload — берётся первое вхождение.
- Сенсоры без `name` игнорируются.
- При каждом импорте значения перезаписываются (upsert по ключу).
```json
"sensors": {
"fans": [
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" },
{ "name": "FAN_CPU0", "location": "CPU0", "rpm": 5600, "status": "OK" }
],
"power": [
{ "name": "12V Rail", "location": "Mainboard", "voltage_v": 12.06, "status": "OK" },
{ "name": "PSU0 Input", "location": "PSU0", "voltage_v": 215.25, "current_a": 0.64, "power_w": 137.0, "status": "OK" }
],
"temperatures": [
{ "name": "CPU0 Temp", "location": "CPU0", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" },
{ "name": "Inlet Temp", "location": "Front", "celsius": 22.0, "threshold_warning_celsius": 40.0, "threshold_critical_celsius": 50.0, "status": "OK" }
],
"other": [
{ "name": "System Humidity", "value": 38.5, "unit": "%", "status": "OK" }
]
}
```
---
## Обработка статусов компонентов
| Статус | Поведение |
|--------|-----------|
| `OK` | Нормальная обработка |
| `Warning` | Создаётся событие `COMPONENT_WARNING` |
| `Critical` | Создаётся событие `COMPONENT_FAILED` + запись в `failure_events` |
| `Unknown` | Компонент считается рабочим, создаётся событие `COMPONENT_UNKNOWN` |
| `Empty` | Компонент не создаётся/не обновляется |
---
## Обработка отсутствующих serial_number
Общее правило для всех секций: если источник не вернул серийный номер и сборщик не смог его надёжно извлечь, интегратор не должен подставлять вымышленные значения, хеши, локальные placeholder-идентификаторы или серийные номера "по догадке". Разрешены только явно оговорённые ниже server-side fallback-правила ingest.
| Тип | Поведение |
|-----|-----------|
| CPU | Генерируется: `{board_serial}-CPU-{socket}` |
| PCIe | Генерируется: `{board_serial}-PCIE-{slot}` (если serial = `"N/A"` или пустой; `slot` для PCIe = BDF) |
| Memory | Компонент игнорируется |
| Storage | Компонент игнорируется |
| PSU | Компонент игнорируется |
Если `serial_number` не уникален внутри одного payload для того же `model`:
- Первое вхождение сохраняет оригинальный серийный номер.
- Каждое следующее дублирующее получает placeholder: `NO_SN-XXXXXXXX`.
---
## Минимальный валидный пример
```json
{
"collected_at": "2026-02-10T15:30:00Z",
"target_host": "192.168.1.100",
"hardware": {
"board": {
"serial_number": "SRV-001"
}
}
}
```
---
## Полный пример с историей статусов
```json
{
"filename": "redfish://10.10.10.103",
"source_type": "api",
"protocol": "redfish",
"target_host": "10.10.10.103",
"collected_at": "2026-02-10T15:30:00Z",
"hardware": {
"board": {
"manufacturer": "Supermicro",
"product_name": "X12DPG-QT6",
"serial_number": "21D634101"
},
"firmware": [
{ "device_name": "BIOS", "version": "06.08.05" },
{ "device_name": "BMC", "version": "5.17.00" }
],
"cpus": [
{
"socket": 0,
"model": "INTEL(R) XEON(R) GOLD 6530",
"manufacturer": "Intel",
"cores": 32,
"threads": 64,
"status": "OK"
}
],
"storage": [
{
"slot": "OB01",
"type": "NVMe",
"model": "INTEL SSDPF2KX076T1",
"size_gb": 7680,
"serial_number": "BTAX41900GF87P6DGN",
"manufacturer": "Intel",
"firmware": "9CV10510",
"present": true,
"status": "OK",
"status_changed_at": "2026-02-10T15:22:00Z",
"status_history": [
{ "status": "Critical", "changed_at": "2026-02-10T15:10:00Z", "details": "I/O timeout on NVMe queue 3" },
{ "status": "OK", "changed_at": "2026-02-10T15:22:00Z", "details": "Recovered after controller reset" }
]
}
],
"pcie_devices": [
{
"slot": "0000:18:00.0",
"device_class": "EthernetController",
"manufacturer": "Intel",
"model": "X710 10GbE",
"serial_number": "K65472-003",
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
"status": "OK"
}
],
"power_supplies": [
{
"slot": "0",
"present": true,
"model": "GW-CRPS3000LW",
"vendor": "Great Wall",
"wattage_w": 3000,
"serial_number": "2P06C102610",
"firmware": "00.03.05",
"status": "OK",
"input_power_w": 137,
"output_power_w": 104,
"input_voltage": 215.25
}
],
"sensors": {
"fans": [
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" }
],
"power": [
{ "name": "12V Rail", "voltage_v": 12.06, "status": "OK" }
],
"temperatures": [
{ "name": "CPU0 Temp", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" }
],
"other": [
{ "name": "System Humidity", "value": 38.5, "unit": "%" }
]
}
}
}
```

1
internal/chart Submodule

Submodule internal/chart added at 05db6994d4

View File

@@ -1,7 +1,6 @@
FROM debian:12
ARG GO_VERSION=1.23.6
ARG DEBIAN_KERNEL_ABI=6.1.0-43
ARG GO_VERSION=1.24.0
ENV DEBIAN_FRONTEND=noninteractive
@@ -24,7 +23,7 @@ RUN apt-get update -qq && apt-get install -y \
gcc \
make \
perl \
"linux-headers-${DEBIAN_KERNEL_ABI}-amd64" \
linux-headers-amd64 \
&& rm -rf /var/lib/apt/lists/*
RUN arch="$(dpkg --print-architecture)" \

View File

@@ -1,5 +1,8 @@
DEBIAN_VERSION=12
DEBIAN_KERNEL_ABI=6.1.0-43
DEBIAN_KERNEL_ABI=auto
NVIDIA_DRIVER_VERSION=590.48.01
GO_VERSION=1.23.6
AUDIT_VERSION=0.1.1
NCCL_VERSION=2.28.9-1
NCCL_CUDA_VERSION=13.0
NCCL_SHA256=2e6faafd2c19cffc7738d9283976a3200ea9db9895907f337f0c7e5a25563186
GO_VERSION=1.24.0
AUDIT_VERSION=1.0.0

View File

@@ -7,6 +7,15 @@ set -e
. "$(dirname "$0")/../VERSIONS"
# Pin the exact kernel ABI detected by build.sh so the ISO kernel matches
# the kernel headers used to compile NVIDIA modules. Falls back to meta-package
# when lb config is run manually without the environment variable.
if [ -n "${BEE_KERNEL_ABI:-}" ] && [ "${BEE_KERNEL_ABI}" != "auto" ]; then
LB_LINUX_PACKAGES="linux-image-${BEE_KERNEL_ABI}"
else
LB_LINUX_PACKAGES="linux-image"
fi
lb config noauto \
--distribution bookworm \
--architectures amd64 \
@@ -19,10 +28,10 @@ lb config noauto \
--mirror-binary "https://deb.debian.org/debian" \
--security true \
--linux-flavours "amd64" \
--linux-packages "linux-image-${DEBIAN_KERNEL_ABI}" \
--linux-packages "${LB_LINUX_PACKAGES}" \
--memtest none \
--iso-volume "BEE-DEBUG" \
--iso-application "Bee Hardware Audit" \
--bootappend-live "boot=live components console=tty0 console=ttyS0,115200n8 username=bee user-fullname=Bee modprobe.blacklist=nouveau" \
--iso-volume "EASY-BEE" \
--iso-application "EASY-BEE" \
--bootappend-live "boot=live components console=ttyS0,115200n8 console=ttyS1,115200n8 loglevel=7 systemd.log_target=console systemd.journald.forward_to_console=1 systemd.journald.max_level_console=debug username=bee user-fullname=Bee modprobe.blacklist=nouveau" \
--apt-recommends false \
"${@}"

View File

@@ -0,0 +1,314 @@
#define _POSIX_C_SOURCE 200809L
#include <dlfcn.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
typedef int CUdevice;
typedef uint64_t CUdeviceptr;
typedef int CUresult;
typedef void *CUcontext;
typedef void *CUmodule;
typedef void *CUfunction;
typedef void *CUstream;
#define CU_SUCCESS 0
static const char *ptx_source =
".version 6.0\n"
".target sm_30\n"
".address_size 64\n"
"\n"
".visible .entry burn(\n"
" .param .u64 data,\n"
" .param .u32 words,\n"
" .param .u32 rounds\n"
")\n"
"{\n"
" .reg .pred %p<2>;\n"
" .reg .b32 %r<8>;\n"
" .reg .b64 %rd<5>;\n"
"\n"
" ld.param.u64 %rd1, [data];\n"
" ld.param.u32 %r1, [words];\n"
" ld.param.u32 %r2, [rounds];\n"
" mov.u32 %r3, %ctaid.x;\n"
" mov.u32 %r4, %ntid.x;\n"
" mov.u32 %r5, %tid.x;\n"
" mad.lo.s32 %r0, %r3, %r4, %r5;\n"
" setp.ge.u32 %p0, %r0, %r1;\n"
" @%p0 bra DONE;\n"
" mul.wide.u32 %rd2, %r0, 4;\n"
" add.s64 %rd3, %rd1, %rd2;\n"
" ld.global.u32 %r6, [%rd3];\n"
"LOOP:\n"
" setp.eq.u32 %p1, %r2, 0;\n"
" @%p1 bra STORE;\n"
" mad.lo.u32 %r6, %r6, 1664525, 1013904223;\n"
" sub.u32 %r2, %r2, 1;\n"
" bra LOOP;\n"
"STORE:\n"
" st.global.u32 [%rd3], %r6;\n"
"DONE:\n"
" ret;\n"
"}\n";
typedef CUresult (*cuInit_fn)(unsigned int);
typedef CUresult (*cuDeviceGetCount_fn)(int *);
typedef CUresult (*cuDeviceGet_fn)(CUdevice *, int);
typedef CUresult (*cuDeviceGetName_fn)(char *, int, CUdevice);
typedef CUresult (*cuCtxCreate_fn)(CUcontext *, unsigned int, CUdevice);
typedef CUresult (*cuCtxDestroy_fn)(CUcontext);
typedef CUresult (*cuCtxSynchronize_fn)(void);
typedef CUresult (*cuMemAlloc_fn)(CUdeviceptr *, size_t);
typedef CUresult (*cuMemFree_fn)(CUdeviceptr);
typedef CUresult (*cuMemcpyHtoD_fn)(CUdeviceptr, const void *, size_t);
typedef CUresult (*cuMemcpyDtoH_fn)(void *, CUdeviceptr, size_t);
typedef CUresult (*cuModuleLoadDataEx_fn)(CUmodule *, const void *, unsigned int, void *, void *);
typedef CUresult (*cuModuleGetFunction_fn)(CUfunction *, CUmodule, const char *);
typedef CUresult (*cuLaunchKernel_fn)(CUfunction,
unsigned int,
unsigned int,
unsigned int,
unsigned int,
unsigned int,
unsigned int,
unsigned int,
CUstream,
void **,
void **);
typedef CUresult (*cuGetErrorName_fn)(CUresult, const char **);
typedef CUresult (*cuGetErrorString_fn)(CUresult, const char **);
struct cuda_api {
void *lib;
cuInit_fn cuInit;
cuDeviceGetCount_fn cuDeviceGetCount;
cuDeviceGet_fn cuDeviceGet;
cuDeviceGetName_fn cuDeviceGetName;
cuCtxCreate_fn cuCtxCreate;
cuCtxDestroy_fn cuCtxDestroy;
cuCtxSynchronize_fn cuCtxSynchronize;
cuMemAlloc_fn cuMemAlloc;
cuMemFree_fn cuMemFree;
cuMemcpyHtoD_fn cuMemcpyHtoD;
cuMemcpyDtoH_fn cuMemcpyDtoH;
cuModuleLoadDataEx_fn cuModuleLoadDataEx;
cuModuleGetFunction_fn cuModuleGetFunction;
cuLaunchKernel_fn cuLaunchKernel;
cuGetErrorName_fn cuGetErrorName;
cuGetErrorString_fn cuGetErrorString;
};
static int load_symbol(void *lib, const char *name, void **out) {
*out = dlsym(lib, name);
return *out != NULL;
}
static int load_cuda(struct cuda_api *api) {
memset(api, 0, sizeof(*api));
api->lib = dlopen("libcuda.so.1", RTLD_NOW | RTLD_LOCAL);
if (!api->lib) {
return 0;
}
return
load_symbol(api->lib, "cuInit", (void **)&api->cuInit) &&
load_symbol(api->lib, "cuDeviceGetCount", (void **)&api->cuDeviceGetCount) &&
load_symbol(api->lib, "cuDeviceGet", (void **)&api->cuDeviceGet) &&
load_symbol(api->lib, "cuDeviceGetName", (void **)&api->cuDeviceGetName) &&
load_symbol(api->lib, "cuCtxCreate_v2", (void **)&api->cuCtxCreate) &&
load_symbol(api->lib, "cuCtxDestroy_v2", (void **)&api->cuCtxDestroy) &&
load_symbol(api->lib, "cuCtxSynchronize", (void **)&api->cuCtxSynchronize) &&
load_symbol(api->lib, "cuMemAlloc_v2", (void **)&api->cuMemAlloc) &&
load_symbol(api->lib, "cuMemFree_v2", (void **)&api->cuMemFree) &&
load_symbol(api->lib, "cuMemcpyHtoD_v2", (void **)&api->cuMemcpyHtoD) &&
load_symbol(api->lib, "cuMemcpyDtoH_v2", (void **)&api->cuMemcpyDtoH) &&
load_symbol(api->lib, "cuModuleLoadDataEx", (void **)&api->cuModuleLoadDataEx) &&
load_symbol(api->lib, "cuModuleGetFunction", (void **)&api->cuModuleGetFunction) &&
load_symbol(api->lib, "cuLaunchKernel", (void **)&api->cuLaunchKernel);
}
static const char *cu_error_name(struct cuda_api *api, CUresult rc) {
const char *value = NULL;
if (api->cuGetErrorName && api->cuGetErrorName(rc, &value) == CU_SUCCESS && value) {
return value;
}
return "CUDA_ERROR";
}
static const char *cu_error_string(struct cuda_api *api, CUresult rc) {
const char *value = NULL;
if (api->cuGetErrorString && api->cuGetErrorString(rc, &value) == CU_SUCCESS && value) {
return value;
}
return "unknown";
}
static int check_rc(struct cuda_api *api, const char *step, CUresult rc) {
if (rc == CU_SUCCESS) {
return 1;
}
fprintf(stderr, "%s failed: %s (%s)\n", step, cu_error_name(api, rc), cu_error_string(api, rc));
return 0;
}
static double now_seconds(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (double)ts.tv_sec + ((double)ts.tv_nsec / 1000000000.0);
}
int main(int argc, char **argv) {
int seconds = 5;
int size_mb = 64;
for (int i = 1; i < argc; i++) {
if ((strcmp(argv[i], "--seconds") == 0 || strcmp(argv[i], "-t") == 0) && i + 1 < argc) {
seconds = atoi(argv[++i]);
} else if ((strcmp(argv[i], "--size-mb") == 0 || strcmp(argv[i], "-m") == 0) && i + 1 < argc) {
size_mb = atoi(argv[++i]);
} else {
fprintf(stderr, "usage: %s [--seconds N] [--size-mb N]\n", argv[0]);
return 2;
}
}
if (seconds <= 0) {
seconds = 5;
}
if (size_mb <= 0) {
size_mb = 64;
}
struct cuda_api api;
if (!load_cuda(&api)) {
fprintf(stderr, "failed to load libcuda.so.1 or required Driver API symbols\n");
return 1;
}
load_symbol(api.lib, "cuGetErrorName", (void **)&api.cuGetErrorName);
load_symbol(api.lib, "cuGetErrorString", (void **)&api.cuGetErrorString);
if (!check_rc(&api, "cuInit", api.cuInit(0))) {
return 1;
}
int count = 0;
if (!check_rc(&api, "cuDeviceGetCount", api.cuDeviceGetCount(&count))) {
return 1;
}
if (count <= 0) {
fprintf(stderr, "no CUDA devices found\n");
return 1;
}
CUdevice dev = 0;
if (!check_rc(&api, "cuDeviceGet", api.cuDeviceGet(&dev, 0))) {
return 1;
}
char name[128] = {0};
if (!check_rc(&api, "cuDeviceGetName", api.cuDeviceGetName(name, (int)sizeof(name), dev))) {
return 1;
}
CUcontext ctx = NULL;
if (!check_rc(&api, "cuCtxCreate", api.cuCtxCreate(&ctx, 0, dev))) {
return 1;
}
size_t bytes = (size_t)size_mb * 1024 * 1024;
uint32_t words = (uint32_t)(bytes / sizeof(uint32_t));
if (words < 1024) {
words = 1024;
bytes = (size_t)words * sizeof(uint32_t);
}
uint32_t *host = (uint32_t *)malloc(bytes);
if (!host) {
fprintf(stderr, "malloc failed\n");
api.cuCtxDestroy(ctx);
return 1;
}
for (uint32_t i = 0; i < words; i++) {
host[i] = i ^ 0x12345678u;
}
CUdeviceptr device_mem = 0;
if (!check_rc(&api, "cuMemAlloc", api.cuMemAlloc(&device_mem, bytes))) {
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
if (!check_rc(&api, "cuMemcpyHtoD", api.cuMemcpyHtoD(device_mem, host, bytes))) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
CUmodule module = NULL;
if (!check_rc(&api, "cuModuleLoadDataEx", api.cuModuleLoadDataEx(&module, ptx_source, 0, NULL, NULL))) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
CUfunction kernel = NULL;
if (!check_rc(&api, "cuModuleGetFunction", api.cuModuleGetFunction(&kernel, module, "burn"))) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
unsigned int threads = 256;
unsigned int blocks = (words + threads - 1) / threads;
uint32_t rounds = 256;
void *params[] = {&device_mem, &words, &rounds};
double start = now_seconds();
double deadline = start + (double)seconds;
unsigned long iterations = 0;
while (now_seconds() < deadline) {
if (!check_rc(&api, "cuLaunchKernel",
api.cuLaunchKernel(kernel, blocks, 1, 1, threads, 1, 1, 0, NULL, params, NULL))) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
iterations++;
}
if (!check_rc(&api, "cuCtxSynchronize", api.cuCtxSynchronize())) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
if (!check_rc(&api, "cuMemcpyDtoH", api.cuMemcpyDtoH(host, device_mem, bytes))) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
uint64_t checksum = 0;
for (uint32_t i = 0; i < words; i += words / 256 ? words / 256 : 1) {
checksum += host[i];
}
double elapsed = now_seconds() - start;
printf("device=%s\n", name);
printf("duration_s=%.2f\n", elapsed);
printf("buffer_mb=%d\n", size_mb);
printf("iterations=%lu\n", iterations);
printf("checksum=%llu\n", (unsigned long long)checksum);
printf("status=OK\n");
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 0;
}

View File

@@ -1,5 +1,5 @@
#!/bin/sh
# build-in-container.sh — build the bee ISO inside a Debian container.
# build-in-container.sh — build the bee ISO inside the Debian builder container.
set -e
@@ -59,7 +59,6 @@ IMAGE_REF="${IMAGE_TAG}:debian${DEBIAN_VERSION}"
if [ "$REBUILD_IMAGE" = "1" ] || ! "$CONTAINER_TOOL" image inspect "${IMAGE_REF}" >/dev/null 2>&1; then
"$CONTAINER_TOOL" build \
--build-arg GO_VERSION="${GO_VERSION}" \
--build-arg DEBIAN_KERNEL_ABI="${DEBIAN_KERNEL_ABI}" \
-t "${IMAGE_REF}" \
"${BUILDER_DIR}"
else
@@ -70,6 +69,7 @@ set -- \
run --rm --privileged \
-v "${REPO_ROOT}:/work" \
-v "${CACHE_DIR}:/cache" \
-e BEE_CONTAINER_BUILD=1 \
-e GOCACHE=/cache/go-build \
-e GOMODCACHE=/cache/go-mod \
-e TMPDIR=/cache/tmp \
@@ -83,6 +83,7 @@ if [ -n "$AUTH_KEYS" ]; then
-v "${REPO_ROOT}:/work" \
-v "${CACHE_DIR}:/cache" \
-v "${AUTH_KEYS_DIR}:/tmp/bee-authkeys:ro" \
-e BEE_CONTAINER_BUILD=1 \
-e GOCACHE=/cache/go-build \
-e GOMODCACHE=/cache/go-mod \
-e TMPDIR=/cache/tmp \

94
iso/builder/build-nccl.sh Executable file
View File

@@ -0,0 +1,94 @@
#!/bin/sh
# build-nccl.sh — download and extract NCCL shared library for the LiveCD.
#
# Downloads libnccl2 .deb from NVIDIA's CUDA apt repository (Debian 12, x86_64)
# and extracts the shared library. Package integrity verified via sha256.
#
# Output is cached in DIST_DIR/nccl-<version>+cuda<cuda>/ so subsequent builds
# are instant unless NCCL_VERSION or NCCL_CUDA_VERSION changes.
#
# Output layout:
# $CACHE_DIR/lib/ — libnccl.so.* files
set -e
NCCL_VERSION="$1"
NCCL_CUDA_VERSION="$2"
DIST_DIR="$3"
EXPECTED_SHA256="$4"
[ -n "$NCCL_VERSION" ] || { echo "usage: $0 <nccl-version> <cuda-version> <dist-dir> [sha256]"; exit 1; }
[ -n "$NCCL_CUDA_VERSION" ] || { echo "usage: $0 <nccl-version> <cuda-version> <dist-dir> [sha256]"; exit 1; }
[ -n "$DIST_DIR" ] || { echo "usage: $0 <nccl-version> <cuda-version> <dist-dir> [sha256]"; exit 1; }
echo "=== NCCL ${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION} ==="
CACHE_DIR="${DIST_DIR}/nccl-${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/nccl-downloads"
if [ -d "$CACHE_DIR/lib" ] && [ "$(ls "$CACHE_DIR/lib/"libnccl.so.* 2>/dev/null | wc -l)" -gt 0 ]; then
echo "=== NCCL cached, skipping download ==="
echo "cache: $CACHE_DIR"
echo "libs: $(ls "$CACHE_DIR/lib/" | wc -l) files"
exit 0
fi
REPO_BASE="https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64"
PKG_NAME="libnccl2_${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}_amd64.deb"
PKG_URL="${REPO_BASE}/${PKG_NAME}"
mkdir -p "$DOWNLOAD_CACHE_DIR"
DEB_FILE="${DOWNLOAD_CACHE_DIR}/${PKG_NAME}"
echo "=== downloading NCCL package ==="
echo "URL: ${PKG_URL}"
wget --show-progress -O "$DEB_FILE" "$PKG_URL"
if [ -n "$EXPECTED_SHA256" ]; then
echo "=== verifying sha256 ==="
ACTUAL_SHA256=$(sha256sum "$DEB_FILE" | awk '{print $1}')
if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then
echo "ERROR: sha256 mismatch"
echo " expected: $EXPECTED_SHA256"
echo " actual: $ACTUAL_SHA256"
rm -f "$DEB_FILE"
exit 1
fi
echo "sha256 OK"
fi
echo "=== extracting NCCL libraries ==="
EXTRACT_TMP=$(mktemp -d)
trap 'rm -rf "$EXTRACT_TMP"' EXIT INT TERM
# .deb is an ar archive; data.tar.* contains the actual files
cd "$EXTRACT_TMP"
ar x "$DEB_FILE"
# Extract data tarball (xz, gz, or zst)
DATA_TAR=$(ls data.tar.* 2>/dev/null | head -1)
[ -n "$DATA_TAR" ] || { echo "ERROR: data.tar.* not found in .deb"; exit 1; }
tar xf "$DATA_TAR"
# Library lands in ./usr/lib/x86_64-linux-gnu/ or ./usr/lib/
mkdir -p "$CACHE_DIR/lib"
found=0
for f in $(find . -name 'libnccl.so.*' -not -type d 2>/dev/null); do
cp "$f" "$CACHE_DIR/lib/"
found=$((found + 1))
done
[ "$found" -gt 0 ] || { echo "ERROR: libnccl.so.* not found in package"; exit 1; }
# Create soname symlinks: libnccl.so.2 -> libnccl.so.<full>, libnccl.so -> libnccl.so.2
versioned=$(ls "$CACHE_DIR/lib/libnccl.so."[0-9][0-9.]* 2>/dev/null | head -1)
if [ -n "$versioned" ]; then
base=$(basename "$versioned")
ln -sf "$base" "$CACHE_DIR/lib/libnccl.so.2" 2>/dev/null || true
ln -sf "libnccl.so.2" "$CACHE_DIR/lib/libnccl.so" 2>/dev/null || true
fi
echo "=== NCCL extraction complete ==="
echo "cache: $CACHE_DIR"
ls -lh "$CACHE_DIR/lib/"

View File

@@ -1,14 +1,13 @@
#!/bin/sh
# build.sh — build bee ISO (Debian 12 / live-build)
#
# Single build script. Produces a bootable live ISO with SSH access, TUI, NVIDIA drivers.
#
# Run on Debian 12 builder VM as root after setup-builder.sh.
# Usage:
# sh iso/builder/build.sh [--authorized-keys /path/to/authorized_keys]
# build.sh — internal ISO build entrypoint executed inside the builder container.
set -e
if [ "${BEE_CONTAINER_BUILD:-0}" != "1" ]; then
echo "build.sh must run inside iso/builder/build-in-container.sh" >&2
exit 1
fi
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
BUILDER_DIR="${REPO_ROOT}/iso/builder"
OVERLAY_DIR="${REPO_ROOT}/iso/overlay"
@@ -35,12 +34,82 @@ mkdir -p "${CACHE_ROOT}"
: "${GOMODCACHE:=${CACHE_ROOT}/go-mod}"
export GOCACHE GOMODCACHE
resolve_audit_version() {
if [ -n "${BEE_AUDIT_VERSION:-}" ]; then
echo "${BEE_AUDIT_VERSION}"
return 0
fi
tag="$(git -C "${REPO_ROOT}" describe --tags --match 'audit/v*' --abbrev=7 --dirty 2>/dev/null || true)"
if [ -z "${tag}" ]; then
tag="$(git -C "${REPO_ROOT}" describe --tags --match 'v*' --abbrev=7 --dirty 2>/dev/null || true)"
fi
case "${tag}" in
audit/v*)
echo "${tag#audit/v}"
return 0
;;
v*)
echo "${tag#v}"
return 0
;;
"")
;;
*)
echo "${tag}"
return 0
;;
esac
if [ -n "${AUDIT_VERSION:-}" ]; then
echo "${AUDIT_VERSION}"
return 0
fi
date +%Y%m%d
}
AUDIT_VERSION_EFFECTIVE="$(resolve_audit_version)"
# Auto-detect kernel ABI: refresh apt index, then query current linux-image-amd64 dependency.
# If headers for the detected ABI are not yet installed (kernel updated since image build),
# install them on the fly so NVIDIA modules and ISO kernel always match.
if [ -z "${DEBIAN_KERNEL_ABI}" ] || [ "${DEBIAN_KERNEL_ABI}" = "auto" ]; then
echo "=== refreshing apt index to detect current kernel ABI ==="
apt-get update -qq
DEBIAN_KERNEL_ABI=$(apt-cache depends linux-image-amd64 2>/dev/null \
| awk '/Depends:.*linux-image-[0-9]/{print $2}' \
| grep -oE '[0-9]+\.[0-9]+\.[0-9]+-[0-9]+' \
| head -1)
if [ -z "${DEBIAN_KERNEL_ABI}" ]; then
echo "ERROR: could not auto-detect kernel ABI from apt-cache" >&2
exit 1
fi
echo "=== kernel ABI: ${DEBIAN_KERNEL_ABI} ==="
fi
# Export detected ABI so that auto/config can pin the exact kernel package
# (prevents NVIDIA module/kernel mismatch if linux-image-amd64 meta-package
# gets updated between build.sh start and lb build chroot step)
export BEE_KERNEL_ABI="${DEBIAN_KERNEL_ABI}"
KVER="${DEBIAN_KERNEL_ABI}-amd64"
if [ ! -d "/usr/src/linux-headers-${KVER}" ]; then
echo "=== installing linux-headers-${KVER} (kernel updated since image build) ==="
apt-get install -y "linux-headers-${KVER}"
fi
echo "=== bee ISO build ==="
echo "Debian: ${DEBIAN_VERSION}, Kernel ABI: ${DEBIAN_KERNEL_ABI}, Go: ${GO_VERSION}"
echo "Audit version: ${AUDIT_VERSION_EFFECTIVE}"
echo ""
echo "=== syncing git submodules ==="
git -C "${REPO_ROOT}" submodule update --init --recursive
# --- compile bee binary (static, Linux amd64) ---
BEE_BIN="${DIST_DIR}/bee-linux-amd64"
GPU_STRESS_BIN="${DIST_DIR}/bee-gpu-stress-linux-amd64"
NEED_BUILD=1
if [ -f "$BEE_BIN" ]; then
NEWEST_SRC=$(find "${REPO_ROOT}/audit" -name '*.go' -newer "$BEE_BIN" | head -1)
@@ -52,7 +121,7 @@ if [ "$NEED_BUILD" = "1" ]; then
cd "${REPO_ROOT}/audit"
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 \
go build \
-ldflags "-s -w -X main.Version=${AUDIT_VERSION:-$(date +%Y%m%d)}" \
-ldflags "-s -w -X main.Version=${AUDIT_VERSION_EFFECTIVE}" \
-o "$BEE_BIN" \
./cmd/bee
echo "binary: $BEE_BIN"
@@ -70,6 +139,22 @@ else
echo "=== bee binary up to date, skipping build ==="
fi
GPU_STRESS_NEED_BUILD=1
if [ -f "$GPU_STRESS_BIN" ] && [ "${BUILDER_DIR}/bee-gpu-stress.c" -ot "$GPU_STRESS_BIN" ]; then
GPU_STRESS_NEED_BUILD=0
fi
if [ "$GPU_STRESS_NEED_BUILD" = "1" ]; then
echo "=== building bee-gpu-stress ==="
gcc -O2 -s -Wall -Wextra \
-o "$GPU_STRESS_BIN" \
"${BUILDER_DIR}/bee-gpu-stress.c" \
-ldl
echo "binary: $GPU_STRESS_BIN"
else
echo "=== bee-gpu-stress up to date, skipping build ==="
fi
echo "=== preparing staged overlay ==="
rm -rf "${BUILD_WORK_DIR}" "${OVERLAY_STAGE_DIR}"
mkdir -p "${BUILD_WORK_DIR}" "${OVERLAY_STAGE_DIR}"
@@ -80,6 +165,7 @@ rm -f \
"${OVERLAY_STAGE_DIR}/etc/bee-release" \
"${OVERLAY_STAGE_DIR}/root/.ssh/authorized_keys" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest"
# --- inject authorized_keys for SSH access ---
@@ -119,13 +205,15 @@ fi
mkdir -p "${OVERLAY_STAGE_DIR}/usr/local/bin"
cp "${DIST_DIR}/bee-linux-amd64" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
cp "${GPU_STRESS_BIN}" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress"
# --- inject smoketest into overlay so it runs directly on the live CD ---
cp "${BUILDER_DIR}/smoketest.sh" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest"
# --- vendor utilities (optional pre-fetched binaries) ---
for tool in storcli64 sas2ircu sas3ircu mstflint; do
for tool in storcli64 sas2ircu sas3ircu arcconf ssacli; do
if [ -f "${VENDOR_DIR}/${tool}" ]; then
cp "${VENDOR_DIR}/${tool}" "${OVERLAY_STAGE_DIR}/usr/local/bin/${tool}"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/${tool}" || true
@@ -164,18 +252,31 @@ if [ -d "${NVIDIA_CACHE}/firmware" ] && [ "$(ls -A "${NVIDIA_CACHE}/firmware" 2>
echo "=== firmware: $(ls "${OVERLAY_STAGE_DIR}/lib/firmware/nvidia/${NVIDIA_DRIVER_VERSION}/" | wc -l) files injected ==="
fi
# --- build / download NCCL ---
echo ""
echo "=== downloading NCCL ${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION} ==="
sh "${BUILDER_DIR}/build-nccl.sh" "${NCCL_VERSION}" "${NCCL_CUDA_VERSION}" "${DIST_DIR}" "${NCCL_SHA256:-}"
NCCL_CACHE="${DIST_DIR}/nccl-${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}"
# Inject libnccl.so.* into overlay alongside other NVIDIA userspace libs
cp "${NCCL_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/"
echo "=== NCCL: $(ls "${NCCL_CACHE}/lib/" | wc -l) files injected into /usr/lib/ ==="
# --- embed build metadata ---
mkdir -p "${OVERLAY_STAGE_DIR}/etc"
BUILD_DATE="$(date +%Y-%m-%d)"
GIT_COMMIT="$(git -C "${REPO_ROOT}" rev-parse --short HEAD 2>/dev/null || echo unknown)"
cat > "${OVERLAY_STAGE_DIR}/etc/bee-release" <<EOF
BEE_ISO_VERSION=${AUDIT_VERSION}
BEE_AUDIT_VERSION=${AUDIT_VERSION}
BEE_ISO_VERSION=${AUDIT_VERSION_EFFECTIVE}
BEE_AUDIT_VERSION=${AUDIT_VERSION_EFFECTIVE}
BUILD_DATE=${BUILD_DATE}
GIT_COMMIT=${GIT_COMMIT}
DEBIAN_VERSION=${DEBIAN_VERSION}
DEBIAN_KERNEL_ABI=${DEBIAN_KERNEL_ABI}
NVIDIA_DRIVER_VERSION=${NVIDIA_DRIVER_VERSION}
NCCL_VERSION=${NCCL_VERSION}
NCCL_CUDA_VERSION=${NCCL_CUDA_VERSION}
EOF
# Patch motd with build info
@@ -209,7 +310,7 @@ lb build 2>&1
# live-build outputs live-image-amd64.hybrid.iso in LB_DIR
ISO_RAW="${LB_DIR}/live-image-amd64.hybrid.iso"
ISO_OUT="${DIST_DIR}/bee-debian${DEBIAN_VERSION}-v${AUDIT_VERSION}-amd64.iso"
ISO_OUT="${DIST_DIR}/bee-debian${DEBIAN_VERSION}-v${AUDIT_VERSION_EFFECTIVE}-amd64.iso"
if [ -f "$ISO_RAW" ]; then
cp "$ISO_RAW" "$ISO_OUT"
echo ""

View File

@@ -1,12 +1,26 @@
source /boot/grub/config.cfg
menuentry "Bee Hardware Audit" {
linux @KERNEL_LIVE@ @APPEND_LIVE@
echo ""
echo " ███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗"
echo " ██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝"
echo " █████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗"
echo " ██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝"
echo " ███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗"
echo " ╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝"
echo ""
menuentry "EASY-BEE" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=normal
initrd @INITRD_LIVE@
}
menuentry "Bee Hardware Audit (fail-safe)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ memtest noapic noapm nodma nomce nolapic nosmp vga=normal
menuentry "EASY-BEE (NVIDIA GSP=off)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=gsp-off
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (fail-safe)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=gsp-off memtest noapic noapm nodma nomce nolapic nosmp vga=normal
initrd @INITRD_LIVE@
}

View File

@@ -1,4 +1,4 @@
desktop-image: "../splash.png"
desktop-color: "#000000"
title-color: "#f5a800"
title-font: "Unifont Regular 16"
title-text: ""

Binary file not shown.

Before

Width:  |  Height:  |  Size: 8.7 KiB

View File

@@ -0,0 +1,18 @@
label live-@FLAVOUR@-normal
menu label ^EASY-BEE
menu default
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=normal
label live-@FLAVOUR@-gsp-off
menu label EASY-BEE (^NVIDIA GSP=off)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=gsp-off
label live-@FLAVOUR@-failsafe
menu label EASY-BEE (^fail-safe)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=gsp-off memtest noapic noapm nodma nomce nolapic nosmp vga=normal

View File

@@ -5,14 +5,33 @@ set -e
echo "=== bee chroot setup ==="
ensure_bee_console_user() {
if id bee >/dev/null 2>&1; then
usermod -d /home/bee -s /bin/sh bee 2>/dev/null || true
else
useradd -d /home/bee -m -s /bin/sh -U bee
fi
mkdir -p /home/bee
chown -R bee:bee /home/bee
echo "bee:eeb" | chpasswd
usermod -aG sudo bee 2>/dev/null || true
}
ensure_bee_console_user
# Enable bee services
systemctl enable bee-network.service
systemctl enable bee-nvidia.service
systemctl enable bee-preflight.service
systemctl enable bee-audit.service
systemctl enable bee-web.service
systemctl enable bee-sshsetup.service
systemctl enable ssh.service
systemctl enable qemu-guest-agent.service 2>/dev/null || true
systemctl enable serial-getty@ttyS0.service 2>/dev/null || true
systemctl enable serial-getty@ttyS1.service 2>/dev/null || true
systemctl enable bee-journal-mirror@ttyS1.service 2>/dev/null || true
# Ensure scripts are executable
chmod +x /usr/local/bin/bee-network.sh 2>/dev/null || true
@@ -21,12 +40,13 @@ chmod +x /usr/local/bin/bee-sshsetup 2>/dev/null || true
chmod +x /usr/local/bin/bee-smoketest 2>/dev/null || true
chmod +x /usr/local/bin/bee-tui 2>/dev/null || true
chmod +x /usr/local/bin/bee 2>/dev/null || true
chmod +x /usr/local/bin/bee-log-run 2>/dev/null || true
# Reload udev rules
udevadm control --reload-rules 2>/dev/null || true
# Create log directory
mkdir -p /var/log
# Create export directory
mkdir -p /appdata/bee/export
if [ -f /etc/sudoers.d/bee ]; then
chmod 0440 /etc/sudoers.d/bee

View File

@@ -0,0 +1,93 @@
#!/bin/sh
# 9001-amd-rocm.hook.chroot — install AMD ROCm SMI tool for Instinct GPU monitoring.
# Runs inside the live-build chroot. Adds AMD's apt repository and installs
# rocm-smi-lib which provides the `rocm-smi` CLI (analogous to nvidia-smi).
set -e
# ROCm versions to try in order (newest first). Fall back if a version's
# Release file is missing from the repo (happens with brand-new releases).
ROCM_CANDIDATES="6.4 6.3 6.2"
ROCM_KEYRING="/etc/apt/keyrings/rocm.gpg"
ROCM_LIST="/etc/apt/sources.list.d/rocm.list"
APT_UPDATED=0
mkdir -p /etc/apt/keyrings
ensure_tool() {
tool="$1"
pkg="$2"
if command -v "${tool}" >/dev/null 2>&1; then
return 0
fi
if [ "${APT_UPDATED}" -eq 0 ]; then
apt-get update -qq
APT_UPDATED=1
fi
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends "${pkg}"
}
ensure_cert_bundle() {
if [ -s /etc/ssl/certs/ca-certificates.crt ]; then
return 0
fi
if [ "${APT_UPDATED}" -eq 0 ]; then
apt-get update -qq
APT_UPDATED=1
fi
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ca-certificates
}
# live-build chroot may not include fetch/signing tools yet
if ! ensure_cert_bundle || ! ensure_tool wget wget || ! ensure_tool gpg gpg; then
echo "WARN: failed to install wget/gpg/ca-certificates prerequisites — skipping ROCm install"
exit 0
fi
# Download and import AMD GPG key
if ! wget -qO- "https://repo.radeon.com/rocm/rocm.gpg.key" \
| gpg --dearmor --yes --output "${ROCM_KEYRING}"; then
echo "WARN: failed to fetch AMD ROCm GPG key — skipping ROCm install"
exit 0
fi
# Try each ROCm version until apt-get update succeeds (repo has a Release file).
ROCM_VERSION=""
for candidate in ${ROCM_CANDIDATES}; do
cat > "${ROCM_LIST}" <<EOF
deb [arch=amd64 signed-by=${ROCM_KEYRING}] https://repo.radeon.com/rocm/apt/${candidate} bookworm main
EOF
if apt-get update -qq 2>/dev/null; then
ROCM_VERSION="${candidate}"
echo "=== AMD ROCm ${ROCM_VERSION}: repository available ==="
break
fi
echo "WARN: ROCm ${candidate} repository not available for bookworm, trying next..."
rm -f "${ROCM_LIST}"
done
if [ -z "${ROCM_VERSION}" ]; then
echo "WARN: no ROCm apt repository available for bookworm — skipping ROCm install"
rm -f "${ROCM_KEYRING}"
exit 0
fi
# rocm-smi-lib provides the rocm-smi CLI tool for GPU monitoring
if apt-get install -y --no-install-recommends rocm-smi-lib 2>/dev/null; then
echo "=== AMD ROCm: rocm-smi installed ==="
if [ -x /opt/rocm/bin/rocm-smi ]; then
ln -sf /opt/rocm/bin/rocm-smi /usr/local/bin/rocm-smi
else
smi_path="$(find /opt -path '*/bin/rocm-smi' -type f 2>/dev/null | sort | tail -1)"
if [ -n "${smi_path}" ]; then
ln -sf "${smi_path}" /usr/local/bin/rocm-smi
fi
fi
rocm-smi --version 2>/dev/null || true
else
echo "WARN: rocm-smi-lib install failed — GPU monitoring unavailable"
fi
# Clean up apt lists to keep ISO size down
rm -f "${ROCM_LIST}"
apt-get clean

View File

@@ -11,11 +11,17 @@ lshw
iproute2
isc-dhcp-client
iputils-ping
ethtool
lm-sensors
qemu-guest-agent
# SSH
openssh-server
# Filesystem support for USB export targets
exfatprogs
ntfs-3g
# Utilities
bash
procps
@@ -25,14 +31,19 @@ less
vim-tiny
mc
htop
nvtop
sudo
zstd
mstflint
memtester
stress-ng
# QR codes (for displaying audit results)
qrencode
# Firmware
firmware-linux-free
firmware-amd-graphics
# glibc compat helpers (for any external binaries that need it)
libc6

View File

@@ -1,75 +0,0 @@
#!/bin/sh
# setup-builder.sh — prepare Debian 12 host/VM as bee ISO builder
#
# Run once on a fresh Debian 12 (Bookworm) host/VM as root.
# After this script completes, the machine can build bee ISO images directly.
# Container alternative: use `iso/builder/build-in-container.sh`.
#
# Usage (on Debian VM):
# wget -O- https://git.mchus.pro/mchus/bee/raw/branch/main/iso/builder/setup-builder.sh | sh
# or: sh setup-builder.sh
set -e
. "$(dirname "$0")/VERSIONS" 2>/dev/null || true
GO_VERSION="${GO_VERSION:-1.23.6}"
DEBIAN_VERSION="${DEBIAN_VERSION:-12}"
DEBIAN_KERNEL_ABI="${DEBIAN_KERNEL_ABI:-6.1.0-28}"
echo "=== bee builder setup ==="
echo "Debian: $(cat /etc/debian_version)"
echo "Go target: ${GO_VERSION}"
echo "Kernel ABI: ${DEBIAN_KERNEL_ABI}"
echo ""
# --- system packages ---
export DEBIAN_FRONTEND=noninteractive
apt-get update -qq
apt-get install -y \
live-build \
debootstrap \
squashfs-tools \
xorriso \
grub-pc-bin \
grub-efi-amd64-bin \
mtools \
git \
wget \
curl \
tar \
xz-utils \
screen \
rsync \
build-essential \
gcc \
make \
perl \
"linux-headers-${DEBIAN_KERNEL_ABI}-amd64"
echo "linux-headers installed: $(dpkg -l "linux-headers-${DEBIAN_KERNEL_ABI}-amd64" | awk '/^ii/{print $3}')"
# --- Go toolchain ---
echo ""
echo "=== installing Go ${GO_VERSION} ==="
if [ -d /usr/local/go ] && /usr/local/go/bin/go version 2>/dev/null | grep -q "${GO_VERSION}"; then
echo "Go ${GO_VERSION} already installed"
else
ARCH=$(uname -m)
case "$ARCH" in
x86_64) GOARCH=amd64 ;;
aarch64) GOARCH=arm64 ;;
*) echo "unsupported arch: $ARCH"; exit 1 ;;
esac
wget -O /tmp/go.tar.gz \
"https://go.dev/dl/go${GO_VERSION}.linux-${GOARCH}.tar.gz"
rm -rf /usr/local/go
tar -C /usr/local -xzf /tmp/go.tar.gz
rm /tmp/go.tar.gz
fi
export PATH="$PATH:/usr/local/go/bin"
echo "Go: $(go version)"
echo ""
echo "=== builder setup complete ==="
echo "Next: sh iso/builder/build.sh"

View File

@@ -26,6 +26,15 @@ echo ""
KVER=$(uname -r)
info "kernel: $KVER"
NVIDIA_BOOT_MODE="normal"
for arg in $(cat /proc/cmdline 2>/dev/null); do
case "$arg" in
bee.nvidia.mode=*)
NVIDIA_BOOT_MODE="${arg#*=}"
;;
esac
done
info "nvidia boot mode: ${NVIDIA_BOOT_MODE}"
# --- PATH & binaries ---
echo "-- PATH & binaries --"
@@ -53,17 +62,25 @@ else
fail "NVIDIA ko dir missing: $KO_DIR"
fi
for mod in nvidia nvidia_modeset nvidia_uvm; do
if /sbin/lsmod 2>/dev/null | grep -q "^nvidia "; then
ok "module loaded: nvidia"
else
fail "module NOT loaded: nvidia"
fi
for mod in nvidia_modeset nvidia_uvm; do
if /sbin/lsmod 2>/dev/null | grep -q "^$mod "; then
ok "module loaded: $mod"
elif [ "${NVIDIA_BOOT_MODE}" = "normal" ] || [ "${NVIDIA_BOOT_MODE}" = "full" ]; then
fail "module NOT loaded in normal mode: $mod"
else
fail "module NOT loaded: $mod"
warn "module not loaded in GSP-off mode: $mod"
fi
done
echo ""
echo "-- NVIDIA device nodes --"
for dev in nvidiactl nvidia0 nvidia-uvm; do
for dev in nvidiactl nvidia0; do
if [ -e "/dev/$dev" ]; then
ok "/dev/$dev exists"
else
@@ -71,6 +88,14 @@ for dev in nvidiactl nvidia0 nvidia-uvm; do
fi
done
if [ -e /dev/nvidia-uvm ]; then
ok "/dev/nvidia-uvm exists"
elif [ "${NVIDIA_BOOT_MODE}" = "normal" ] || [ "${NVIDIA_BOOT_MODE}" = "full" ]; then
fail "/dev/nvidia-uvm missing in normal mode"
else
warn "/dev/nvidia-uvm missing — CUDA stress path may be unavailable until loaded on demand"
fi
echo ""
echo "-- nvidia-smi --"
if PATH="/usr/local/bin:$PATH" command -v nvidia-smi >/dev/null 2>&1; then
@@ -96,7 +121,7 @@ done
echo ""
echo "-- systemd services --"
for svc in bee-nvidia bee-network bee-audit; do
for svc in bee-nvidia bee-network bee-preflight bee-audit bee-web; do
if systemctl is-active --quiet "$svc" 2>/dev/null; then
ok "service active: $svc"
else
@@ -104,6 +129,20 @@ for svc in bee-nvidia bee-network bee-audit; do
fi
done
echo ""
echo "-- runtime health --"
if [ -f /appdata/bee/export/runtime-health.json ] && [ -s /appdata/bee/export/runtime-health.json ]; then
ok "runtime: runtime-health.json present and non-empty"
else
fail "runtime: runtime-health.json missing or empty"
fi
if [ -f /appdata/bee/export/runtime-health.log ]; then
info "last runtime log line: $(tail -1 /appdata/bee/export/runtime-health.log)"
else
warn "runtime: no log found at /appdata/bee/export/runtime-health.log"
fi
for svc in ssh bee-sshsetup; do
if systemctl is-active --quiet "$svc" 2>/dev/null \
|| systemctl show "$svc" --property=ActiveState 2>/dev/null | grep -q "inactive\|exited"; then
@@ -126,29 +165,43 @@ fi
echo ""
echo "-- audit last run --"
if [ -f /var/log/bee-audit.json ] && [ -s /var/log/bee-audit.json ]; then
if [ -f /appdata/bee/export/bee-audit.json ] && [ -s /appdata/bee/export/bee-audit.json ]; then
ok "audit: bee-audit.json present and non-empty"
info "size: $(du -sh /var/log/bee-audit.json | cut -f1)"
info "size: $(du -sh /appdata/bee/export/bee-audit.json | cut -f1)"
else
fail "audit: bee-audit.json missing or empty"
fi
if [ -f /var/log/bee-audit.log ]; then
last_line=$(tail -1 /var/log/bee-audit.log)
if [ -f /appdata/bee/export/bee-audit.log ]; then
last_line=$(tail -1 /appdata/bee/export/bee-audit.log)
info "last log line: $last_line"
if grep -q "audit output written" /var/log/bee-audit.log 2>/dev/null; then
if grep -q "audit output written" /appdata/bee/export/bee-audit.log 2>/dev/null; then
ok "audit: completed successfully"
else
warn "audit: 'audit output written' not found in log — may have failed"
fi
if grep -q "nvidia: enrichment skipped\|nvidia.*skipped\|enrichment skipped" /var/log/bee-audit.log 2>/dev/null; then
reason=$(grep -E "nvidia.*skipped|enrichment skipped" /var/log/bee-audit.log | tail -1)
if grep -q "nvidia: enrichment skipped\|nvidia.*skipped\|enrichment skipped" /appdata/bee/export/bee-audit.log 2>/dev/null; then
reason=$(grep -E "nvidia.*skipped|enrichment skipped" /appdata/bee/export/bee-audit.log | tail -1)
fail "audit: nvidia enrichment skipped — $reason"
else
ok "audit: nvidia enrichment OK (no skip message)"
fi
else
warn "audit: no log found at /var/log/bee-audit.log"
warn "audit: no log found at /appdata/bee/export/bee-audit.log"
fi
echo ""
echo "-- bee web --"
if [ -f /appdata/bee/export/bee-web.log ]; then
info "last web log line: $(tail -1 /appdata/bee/export/bee-web.log)"
else
warn "web: no log found at /appdata/bee/export/bee-web.log"
fi
if bash -c 'exec 3<>/dev/tcp/127.0.0.1/80 && printf "GET /healthz HTTP/1.0\r\nHost: localhost\r\n\r\n" >&3 && grep -q "^ok$" <&3'; then
ok "web: health endpoint reachable on 127.0.0.1:80"
else
fail "web: health endpoint not reachable on 127.0.0.1:80"
fi
echo ""

View File

@@ -1,15 +1,16 @@
██████╗ ███████╗███████╗ ██████╗ ███████╗██████╗ ██╗ ██╗ ██████╗
██╔══██╗██╔════╝██╔════╝ ██╔══██╗██╔════╝██╔══██╗██║ ██║██╔════╝
██████╔╝██████████╗ ██║ ██║█████╗ ██████╔╝██║ ██║██║ ███╗
██╔══██╗██╔══╝ ██╔══╝ ██║ ██║██╔══╝ ██╔══██╗██║ ██║██║ ██
██████╔╝██████████████╗ ██████╔╝███████╗██████╔╝╚██████╔╝╚██████╔╝
╚═════╝ ╚══════╝╚══════╝ ╚═════╝ ╚══════╝╚═════╝ ╚═════╝ ╚═════╝
███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗
██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝
████████████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗
██╔══██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝
███████╗██║ █████████║ ██║ ██████╔╝███████╗███████╗
╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╚═════╝ ╚══════╝╚══════╝
Hardware Audit LiveCD
Build: %%BUILD_INFO%%
Logs: /var/log/bee-audit.json /var/log/bee-network.log
Export dir: /appdata/bee/export
Self-check: /appdata/bee/export/runtime-health.json
Open TUI: bee-tui

Some files were not shown because too many files have changed in this diff Show More