Compare commits

..

14 Commits

Author SHA1 Message Date
Mikhail Chusavitin
76a17937f3 feat(tui): NVIDIA SAT with nvtop, GPU selection, metrics and chart — v1.0.0
- TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes
- nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort
- GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock)
- Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt
- Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours
- AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list
- runtime-flows.md updated with full NVIDIA SAT TUI flow documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 15:18:57 +03:00
Mikhail Chusavitin
b965184e71 feat: wrap chart viewer in web shell 2026-03-16 18:26:05 +03:00
Mikhail Chusavitin
b25a2f6d30 feat: add support bundle and raw audit export 2026-03-16 18:20:26 +03:00
Mikhail Chusavitin
d18cde19c1 Drop legacy non-container builders 2026-03-16 00:23:55 +03:00
Mikhail Chusavitin
78c6dfc0ef Sync hardware ingest contract v2.7 2026-03-15 23:03:38 +03:00
Mikhail Chusavitin
72cf482ad3 Embed Reanimator Chart web viewer 2026-03-15 22:07:42 +03:00
Mikhail Chusavitin
a6023372b1 Use microcode as CPU firmware 2026-03-15 21:16:17 +03:00
Mikhail Chusavitin
ab5a4be7ac Align hardware export with ingest contract 2026-03-15 21:04:53 +03:00
Mikhail Chusavitin
b8c235b5ac Add TUI hardware banner and polish SAT summaries 2026-03-15 14:27:01 +03:00
Mikhail Chusavitin
b483e2ce35 Add health verdicts and acceptance tests 2026-03-14 17:53:58 +03:00
Mikhail Chusavitin
17f0bda45e Update docs for current LiveCD flow 2026-03-14 16:28:30 +03:00
Mikhail Chusavitin
591164a251 Rename ISO volume to BEE 2026-03-14 14:58:49 +03:00
Mikhail Chusavitin
ef4ec5695d Remove broken TUI log redirection 2026-03-14 14:57:31 +03:00
Mikhail Chusavitin
f1e096cabe Keep live TUI logs off the console 2026-03-14 14:51:25 +03:00
87 changed files with 8833 additions and 579 deletions

3
.gitmodules vendored
View File

@@ -1,3 +1,6 @@
[submodule "bible"] [submodule "bible"]
path = bible path = bible
url = https://git.mchus.pro/mchus/bible.git url = https://git.mchus.pro/mchus/bible.git
[submodule "internal/chart"]
path = internal/chart
url = https://git.mchus.pro/reanimator/chart.git

41
PLAN.md
View File

@@ -4,13 +4,13 @@ Hardware audit LiveCD for offline server inventory.
Produces `HardwareIngestRequest` JSON compatible with core/reanimator. Produces `HardwareIngestRequest` JSON compatible with core/reanimator.
**Principle:** OS-level collection — reads hardware directly, not through BMC. **Principle:** OS-level collection — reads hardware directly, not through BMC.
Fully unattended — no user interaction required at any stage. Boot → update → audit → output → done. Automatic boot audit plus operator console. Boot runs audit immediately, but local/SSH operators can rerun checks through the TUI and CLI.
All errors are logged, never presented interactively. Every failure path has a silent fallback. Errors are logged and should not block boot on partial collector failures.
Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware. Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware.
--- ---
## Status snapshot (2026-03-06) ## Status snapshot (2026-03-14)
### Phase 1 — Go Audit Binary ### Phase 1 — Go Audit Binary
@@ -23,8 +23,10 @@ Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials,
- 1.7 PSU collector — **DONE (basic FRU path)** - 1.7 PSU collector — **DONE (basic FRU path)**
- 1.8 NVIDIA GPU enrichment — **DONE** - 1.8 NVIDIA GPU enrichment — **DONE**
- 1.8b Component wear / age telemetry — **DONE** (storage + NVMe + NVIDIA + NIC SFP/DOM + NIC packet stats) - 1.8b Component wear / age telemetry — **DONE** (storage + NVMe + NVIDIA + NIC SFP/DOM + NIC packet stats)
- 1.8c Storage health verdicts — **DONE** (SMART/NVMe warning/failed status derivation)
- 1.9 Mellanox/NVIDIA NIC enrichment — **DONE** (mstflint + ethtool firmware fallback) - 1.9 Mellanox/NVIDIA NIC enrichment — **DONE** (mstflint + ethtool firmware fallback)
- 1.10 RAID controller enrichment — **DONE (initial multi-tool support)** (storcli + sas2/3ircu + arcconf + ssacli + VROC/mdstat) - 1.10 RAID controller enrichment — **DONE (initial multi-tool support)** (storcli + sas2/3ircu + arcconf + ssacli + VROC/mdstat)
- 1.11 PSU SDR health — **DONE** (`ipmitool sdr` merged with FRU inventory)
- 1.11 Output and export workflow — **DONE** (explicit file output + manual removable export via TUI) - 1.11 Output and export workflow — **DONE** (explicit file output + manual removable export via TUI)
- 1.12 Integration test (local) — **DONE** (`scripts/test-local.sh`) - 1.12 Integration test (local) — **DONE** (`scripts/test-local.sh`)
@@ -33,9 +35,14 @@ Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials,
- Current implementation uses Debian 12 `live-build`, `systemd`, and OpenSSH. - Current implementation uses Debian 12 `live-build`, `systemd`, and OpenSSH.
- Network bring-up on boot — **DONE** - Network bring-up on boot — **DONE**
- Boot services (`bee-network`, `bee-nvidia`, `bee-audit`, `bee-sshsetup`) — **DONE** - Boot services (`bee-network`, `bee-nvidia`, `bee-audit`, `bee-sshsetup`) — **DONE**
- Local console UX (`bee` autologin on `tty1`, `menu` auto-start, TUI privilege escalation via `sudo -n`) — **DONE**
- VM/debug support (`qemu-guest-agent`, serial console, virtual GPU initramfs modules) — **DONE**
- Vendor utilities in overlay — **DONE** - Vendor utilities in overlay — **DONE**
- Build metadata + staged overlay injection — **DONE** - Build metadata + staged overlay injection — **DONE**
- Builder container cache persisted outside container writable layer — **DONE**
- ISO volume label `BEE`**DONE**
- Auto-update flow remains deferred; current focus is deterministic offline audit ISO behavior. - Auto-update flow remains deferred; current focus is deterministic offline audit ISO behavior.
- Real-hardware validation remains **PENDING**; current validation is limited to local/libvirt VM boot + service checks.
--- ---
@@ -265,13 +272,10 @@ ISO image bootable via BMC virtual media or USB. Runs boot services automaticall
### 2.1 — Builder environment ### 2.1 — Builder environment
`iso/builder/setup-builder.sh` prepares a Debian 12 host/VM with: `iso/builder/build-in-container.sh` is the only supported builder entrypoint.
- `live-build`, `debootstrap`, bootloader tooling, kernel headers It builds a Debian 12 builder image with `live-build`, toolchains, and pinned kernel headers,
- Go toolchain then runs the ISO assembly in a privileged container because `live-build` needs
- everything needed to compile the `bee` binary and NVIDIA modules mount/chroot/loop capabilities.
`iso/builder/build-in-container.sh` offers the same builder stack in a Debian 12 container image.
The container run is privileged because `live-build` needs mount/chroot/loop capabilities.
`iso/builder/build.sh` orchestrates the full ISO build: `iso/builder/build.sh` orchestrates the full ISO build:
1. compile the Go `bee` binary 1. compile the Go `bee` binary
@@ -334,8 +338,14 @@ Planned code shape:
### 2.5 — Operator workflows ### 2.5 — Operator workflows
- Automatic boot audit writes JSON to `/var/log/bee-audit.json` - Automatic boot audit writes JSON to `/var/log/bee-audit.json`
- `tty1` autologins into `bee` and auto-runs `menu`
- `menu` launches the LiveCD wrapper `bee-tui`, which escalates to `root` via `sudo -n`
- `bee tui` can rerun the audit manually - `bee tui` can rerun the audit manually
- `bee tui` can export the latest audit JSON to removable media - `bee tui` can export the latest audit JSON to removable media
- `bee tui` can show health summary and run NVIDIA/memory/storage acceptance tests
- NVIDIA SAT now includes a lightweight in-image GPU stress step via `bee-gpu-stress`
- SAT summaries now expose `overall_status` plus per-job `OK/FAILED/UNSUPPORTED`
- Memory/GPU SAT runtime defaults can be overridden via `BEE_MEMTESTER_*` and `BEE_GPU_STRESS_*`
- removable export requires explicit target selection, mount, confirmation, copy, and cleanup - removable export requires explicit target selection, mount, confirmation, copy, and cleanup
### 2.6 — Vendor utilities and optional assets ### 2.6 — Vendor utilities and optional assets
@@ -343,7 +353,9 @@ Planned code shape:
Optional binaries live in `iso/vendor/` and are included when present: Optional binaries live in `iso/vendor/` and are included when present:
- `storcli64` - `storcli64`
- `sas2ircu`, `sas3ircu` - `sas2ircu`, `sas3ircu`
- `mstflint` - `arcconf`
- `ssacli`
- `mstflint` (via Debian package set)
Missing optional tools do not fail the build or boot. Missing optional tools do not fail the build or boot.
@@ -358,6 +370,7 @@ Missing optional tools do not fail the build or boot.
Current release model: Current release model:
- shipping a new ISO means a full rebuild - shipping a new ISO means a full rebuild
- build metadata is embedded into `/etc/bee-release` and `motd` - build metadata is embedded into `/etc/bee-release` and `motd`
- current ISO label is `BEE`
- binary self-update remains deferred; no automatic USB/network patching is part of the current runtime - binary self-update remains deferred; no automatic USB/network patching is part of the current runtime
--- ---
@@ -374,9 +387,9 @@ No "works on my Mac" drift.
1.2 board collector → first real data 1.2 board collector → first real data
1.3 CPU collector → +CPUs 1.3 CPU collector → +CPUs
--- BUILDER + DEBUG ISO (unblock real-hardware testing) --- --- BUILDER + BEE ISO (unblock real-hardware testing) ---
2.1 builder setup → Debian host/VM or privileged container with build deps 2.1 builder setup → privileged container with build deps
2.2 debug ISO profile → minimal Debian ISO: `bee` binary + OpenSSH + all packages 2.2 debug ISO profile → minimal Debian ISO: `bee` binary + OpenSSH + all packages
2.3 boot on real server → SSH in, verify packages present, run audit manually 2.3 boot on real server → SSH in, verify packages present, run audit manually
@@ -397,7 +410,7 @@ No "works on my Mac" drift.
2.4 NVIDIA driver build → driver compiled into overlay 2.4 NVIDIA driver build → driver compiled into overlay
2.5 network bring-up on boot → DHCP on all interfaces 2.5 network bring-up on boot → DHCP on all interfaces
2.6 systemd boot service → audit runs on boot automatically 2.6 systemd boot service → audit runs on boot automatically
2.7 vendor utilities → storcli/sas2ircu/mstflint in image 2.7 vendor utilities → storcli/sas2ircu/arcconf/ssacli in image
2.8 release workflow → versioning + release notes 2.8 release workflow → versioning + release notes
2.9 operator export flow → explicit TUI export to removable media 2.9 operator export flow → explicit TUI export to removable media
``` ```

View File

@@ -12,6 +12,7 @@ import (
"bee/audit/internal/platform" "bee/audit/internal/platform"
"bee/audit/internal/runtimeenv" "bee/audit/internal/runtimeenv"
"bee/audit/internal/tui" "bee/audit/internal/tui"
"bee/audit/internal/webui"
) )
var Version = "dev" var Version = "dev"
@@ -27,11 +28,14 @@ func run(args []string, stdout, stderr io.Writer) int {
if len(args) == 0 { if len(args) == 0 {
printRootUsage(stderr) printRootUsage(stderr)
return 1 return 2
} }
switch args[0] { switch args[0] {
case "help", "--help", "-h": case "help", "--help", "-h":
if len(args) > 1 {
return runHelp(args[1:], stdout, stderr)
}
printRootUsage(stdout) printRootUsage(stdout)
return 0 return 0
case "audit": case "audit":
@@ -40,6 +44,12 @@ func run(args []string, stdout, stderr io.Writer) int {
return runTUI(args[1:], stdout, stderr) return runTUI(args[1:], stdout, stderr)
case "export": case "export":
return runExport(args[1:], stdout, stderr) return runExport(args[1:], stdout, stderr)
case "preflight":
return runPreflight(args[1:], stdout, stderr)
case "support-bundle":
return runSupportBundle(args[1:], stdout, stderr)
case "web":
return runWeb(args[1:], stdout, stderr)
case "sat": case "sat":
return runSAT(args[1:], stdout, stderr) return runSAT(args[1:], stdout, stderr)
case "version", "--version", "-version": case "version", "--version", "-version":
@@ -48,17 +58,47 @@ func run(args []string, stdout, stderr io.Writer) int {
default: default:
fmt.Fprintf(stderr, "bee: unknown command %q\n\n", args[0]) fmt.Fprintf(stderr, "bee: unknown command %q\n\n", args[0])
printRootUsage(stderr) printRootUsage(stderr)
return 1 return 2
} }
} }
func printRootUsage(w io.Writer) { func printRootUsage(w io.Writer) {
fmt.Fprintln(w, `bee commands: fmt.Fprintln(w, `bee commands:
bee audit --runtime auto|local|livecd --output stdout|file:<path> bee audit --runtime auto|local|livecd --output stdout|file:<path>
bee preflight --output stdout|file:<path>
bee tui --runtime auto|local|livecd bee tui --runtime auto|local|livecd
bee export --target <device> bee export --target <device>
bee sat nvidia bee support-bundle --output stdout|file:<path>
bee version`) bee web --listen :80 --audit-path `+app.DefaultAuditJSONPath+`
bee sat nvidia|memory|storage
bee version
bee help [command]`)
}
func runHelp(args []string, stdout, stderr io.Writer) int {
switch args[0] {
case "audit":
return runAudit([]string{"--help"}, stdout, stdout)
case "tui":
return runTUI([]string{"--help"}, stdout, stdout)
case "export":
return runExport([]string{"--help"}, stdout, stdout)
case "preflight":
return runPreflight([]string{"--help"}, stdout, stdout)
case "support-bundle":
return runSupportBundle([]string{"--help"}, stdout, stdout)
case "web":
return runWeb([]string{"--help"}, stdout, stdout)
case "sat":
return runSAT([]string{"--help"}, stdout, stderr)
case "version":
fmt.Fprintln(stdout, "usage: bee version")
return 0
default:
fmt.Fprintf(stderr, "bee help: unknown command %q\n\n", args[0])
printRootUsage(stderr)
return 2
}
} }
func runAudit(args []string, stdout, stderr io.Writer) int { func runAudit(args []string, stdout, stderr io.Writer) int {
@@ -72,6 +112,13 @@ func runAudit(args []string, stdout, stderr io.Writer) int {
fs.PrintDefaults() fs.PrintDefaults()
} }
if err := fs.Parse(args); err != nil { if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2 return 2
} }
if *showVersion { if *showVersion {
@@ -107,6 +154,13 @@ func runTUI(args []string, stdout, stderr io.Writer) int {
fs.PrintDefaults() fs.PrintDefaults()
} }
if err := fs.Parse(args); err != nil { if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2 return 2
} }
@@ -116,6 +170,10 @@ func runTUI(args []string, stdout, stderr io.Writer) int {
return 1 return 1
} }
slog.SetDefault(slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{
Level: slog.LevelInfo,
})))
application := app.New(platform.New()) application := app.New(platform.New())
if err := tui.Run(application, runtimeInfo.Mode); err != nil { if err := tui.Run(application, runtimeInfo.Mode); err != nil {
slog.Error("run tui", "err", err) slog.Error("run tui", "err", err)
@@ -133,6 +191,13 @@ func runExport(args []string, stdout, stderr io.Writer) int {
fs.PrintDefaults() fs.PrintDefaults()
} }
if err := fs.Parse(args); err != nil { if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2 return 2
} }
if strings.TrimSpace(*targetDevice) == "" { if strings.TrimSpace(*targetDevice) == "" {
@@ -164,22 +229,160 @@ func runExport(args []string, stdout, stderr io.Writer) int {
return 1 return 1
} }
func runSAT(args []string, stdout, stderr io.Writer) int { func runPreflight(args []string, stdout, stderr io.Writer) int {
if len(args) == 0 || args[0] == "help" || args[0] == "--help" || args[0] == "-h" { fs := flag.NewFlagSet("preflight", flag.ContinueOnError)
fmt.Fprintln(stderr, "usage: bee sat nvidia") fs.SetOutput(stderr)
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
fs.Usage = func() {
fmt.Fprintf(stderr, "usage: bee preflight [--output stdout|file:%s]\n", app.DefaultRuntimeJSONPath)
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2 return 2
} }
if args[0] != "nvidia" { if fs.NArg() != 0 {
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", args[0]) fs.Usage()
fmt.Fprintln(stderr, "usage: bee sat nvidia")
return 2 return 2
} }
application := app.New(platform.New()) application := app.New(platform.New())
archive, err := application.RunNvidiaAcceptancePack("") path, err := application.RunRuntimePreflight(*output)
if err != nil { if err != nil {
slog.Error("run nvidia sat", "err", err) slog.Error("run preflight", "err", err)
return 1 return 1
} }
slog.Info("nvidia sat archive written", "path", archive) if path != "stdout" {
slog.Info("runtime health written", "path", path)
}
return 0
}
func runSupportBundle(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("support-bundle", flag.ContinueOnError)
fs.SetOutput(stderr)
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
fs.Usage = func() {
fmt.Fprintln(stderr, "usage: bee support-bundle [--output stdout|file:<path>]")
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
path, err := app.BuildSupportBundle(app.DefaultExportDir)
if err != nil {
slog.Error("build support bundle", "err", err)
return 1
}
defer os.Remove(path)
raw, err := os.ReadFile(path)
if err != nil {
slog.Error("read support bundle", "err", err)
return 1
}
switch {
case *output == "stdout":
if _, err := stdout.Write(raw); err != nil {
slog.Error("write support bundle stdout", "err", err)
return 1
}
case strings.HasPrefix(*output, "file:"):
dst := strings.TrimPrefix(*output, "file:")
if err := os.WriteFile(dst, raw, 0644); err != nil {
slog.Error("write support bundle", "err", err)
return 1
}
slog.Info("support bundle written", "path", dst)
default:
fmt.Fprintln(stderr, "bee support-bundle: unknown output destination")
fs.Usage()
return 2
}
return 0
}
func runWeb(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("web", flag.ContinueOnError)
fs.SetOutput(stderr)
listenAddr := fs.String("listen", ":8080", "listen address, e.g. :80")
auditPath := fs.String("audit-path", app.DefaultAuditJSONPath, "path to the latest audit JSON snapshot")
exportDir := fs.String("export-dir", app.DefaultExportDir, "directory with logs, SAT results, and support bundles")
title := fs.String("title", "Bee Hardware Audit", "page title")
fs.Usage = func() {
fmt.Fprintf(stderr, "usage: bee web [--listen :80] [--audit-path %s] [--export-dir %s] [--title \"Bee Hardware Audit\"]\n", app.DefaultAuditJSONPath, app.DefaultExportDir)
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
slog.Info("starting bee web", "listen", *listenAddr, "audit_path", *auditPath)
if err := webui.ListenAndServe(*listenAddr, webui.HandlerOptions{
Title: *title,
AuditPath: *auditPath,
ExportDir: *exportDir,
}); err != nil {
slog.Error("run web", "err", err)
return 1
}
return 0
}
func runSAT(args []string, stdout, stderr io.Writer) int {
if len(args) == 0 {
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage")
return 2
}
if args[0] == "help" || args[0] == "--help" || args[0] == "-h" {
fmt.Fprintln(stdout, "usage: bee sat nvidia|memory|storage")
return 0
}
if args[0] != "nvidia" && args[0] != "memory" && args[0] != "storage" {
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", args[0])
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage")
return 2
}
if len(args) > 1 {
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage")
return 2
}
application := app.New(platform.New())
var (
archive string
err error
label string
)
switch args[0] {
case "nvidia":
label = "nvidia"
archive, err = application.RunNvidiaAcceptancePack("")
case "memory":
label = "memory"
archive, err = application.RunMemoryAcceptancePack("")
case "storage":
label = "storage"
archive, err = application.RunStorageAcceptancePack("")
}
if err != nil {
slog.Error("run sat", "target", label, "err", err)
return 1
}
slog.Info("sat archive written", "target", label, "path", archive)
return 0 return 0
} }

View File

@@ -24,8 +24,8 @@ func TestRunNoArgsPrintsUsage(t *testing.T) {
var stdout, stderr bytes.Buffer var stdout, stderr bytes.Buffer
rc := run(nil, &stdout, &stderr) rc := run(nil, &stdout, &stderr)
if rc != 1 { if rc != 2 {
t.Fatalf("rc=%d want 1", rc) t.Fatalf("rc=%d want 2", rc)
} }
if !strings.Contains(stderr.String(), "bee commands:") { if !strings.Contains(stderr.String(), "bee commands:") {
t.Fatalf("stderr missing root usage:\n%s", stderr.String()) t.Fatalf("stderr missing root usage:\n%s", stderr.String())
@@ -37,8 +37,8 @@ func TestRunUnknownCommand(t *testing.T) {
var stdout, stderr bytes.Buffer var stdout, stderr bytes.Buffer
rc := run([]string{"wat"}, &stdout, &stderr) rc := run([]string{"wat"}, &stdout, &stderr)
if rc != 1 { if rc != 2 {
t.Fatalf("rc=%d want 1", rc) t.Fatalf("rc=%d want 2", rc)
} }
if !strings.Contains(stderr.String(), `unknown command "wat"`) { if !strings.Contains(stderr.String(), `unknown command "wat"`) {
t.Fatalf("stderr missing unknown command message:\n%s", stderr.String()) t.Fatalf("stderr missing unknown command message:\n%s", stderr.String())
@@ -86,11 +86,63 @@ func TestRunSATUsage(t *testing.T) {
if rc != 2 { if rc != 2 {
t.Fatalf("rc=%d want 2", rc) t.Fatalf("rc=%d want 2", rc)
} }
if !strings.Contains(stderr.String(), "usage: bee sat nvidia") { if !strings.Contains(stderr.String(), "usage: bee sat nvidia|memory|storage") {
t.Fatalf("stderr missing sat usage:\n%s", stderr.String()) t.Fatalf("stderr missing sat usage:\n%s", stderr.String())
} }
} }
func TestRunPreflightRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"preflight", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee preflight") {
t.Fatalf("stderr missing preflight usage:\n%s", stderr.String())
}
}
func TestRunSupportBundleRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"support-bundle", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee support-bundle") {
t.Fatalf("stderr missing support-bundle usage:\n%s", stderr.String())
}
}
func TestRunHelpForSubcommand(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"help", "export"}, &stdout, &stderr)
if rc != 0 {
t.Fatalf("rc=%d want 0", rc)
}
if !strings.Contains(stdout.String(), "usage: bee export --target <device>") {
t.Fatalf("stdout missing export help:\n%s", stdout.String())
}
}
func TestRunHelpUnknownSubcommand(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"help", "wat"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), `bee help: unknown command "wat"`) {
t.Fatalf("stderr missing help error:\n%s", stderr.String())
}
}
func TestRunSATUnknownTarget(t *testing.T) { func TestRunSATUnknownTarget(t *testing.T) {
t.Parallel() t.Parallel()
@@ -104,6 +156,32 @@ func TestRunSATUnknownTarget(t *testing.T) {
} }
} }
func TestRunSATHelp(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"sat", "--help"}, &stdout, &stderr)
if rc != 0 {
t.Fatalf("rc=%d want 0", rc)
}
if !strings.Contains(stdout.String(), "usage: bee sat nvidia|memory|storage") {
t.Fatalf("stdout missing sat help:\n%s", stdout.String())
}
}
func TestRunSATRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"sat", "memory", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee sat nvidia|memory|storage") {
t.Fatalf("stderr missing sat usage:\n%s", stderr.String())
}
}
func TestRunAuditInvalidRuntime(t *testing.T) { func TestRunAuditInvalidRuntime(t *testing.T) {
t.Parallel() t.Parallel()
@@ -113,3 +191,29 @@ func TestRunAuditInvalidRuntime(t *testing.T) {
t.Fatalf("rc=%d want 1", rc) t.Fatalf("rc=%d want 1", rc)
} }
} }
func TestRunAuditRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"audit", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee audit") {
t.Fatalf("stderr missing audit usage:\n%s", stderr.String())
}
}
func TestRunExportRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"export", "--target", "/dev/sdb1", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee export --target <device>") {
t.Fatalf("stderr missing export usage:\n%s", stderr.String())
}
}

View File

@@ -1,8 +1,11 @@
module bee/audit module bee/audit
go 1.23 go 1.24.0
replace reanimator/chart => ../internal/chart
require github.com/charmbracelet/bubbletea v1.3.4 require github.com/charmbracelet/bubbletea v1.3.4
require reanimator/chart v0.0.0
require ( require (
github.com/aymanbagabas/go-osc52/v2 v2.0.1 // indirect github.com/aymanbagabas/go-osc52/v2 v2.0.1 // indirect

View File

@@ -1,10 +1,13 @@
package app package app
import ( import (
"context"
"encoding/json" "encoding/json"
"fmt" "fmt"
"log/slog"
"os" "os"
"path/filepath" "path/filepath"
"sort"
"strconv" "strconv"
"strings" "strings"
"time" "time"
@@ -12,11 +15,21 @@ import (
"bee/audit/internal/collector" "bee/audit/internal/collector"
"bee/audit/internal/platform" "bee/audit/internal/platform"
"bee/audit/internal/runtimeenv" "bee/audit/internal/runtimeenv"
"bee/audit/internal/schema"
) )
const ( var (
DefaultAuditJSONPath = "/var/log/bee-audit.json" DefaultExportDir = "/appdata/bee/export"
DefaultAuditLogPath = "/var/log/bee-audit.log" DefaultAuditJSONPath = DefaultExportDir + "/bee-audit.json"
DefaultAuditLogPath = DefaultExportDir + "/bee-audit.log"
DefaultWebLogPath = DefaultExportDir + "/bee-web.log"
DefaultNetworkLogPath = DefaultExportDir + "/bee-network.log"
DefaultNvidiaLogPath = DefaultExportDir + "/bee-nvidia.log"
DefaultSSHLogPath = DefaultExportDir + "/bee-sshsetup.log"
DefaultRuntimeJSONPath = DefaultExportDir + "/runtime-health.json"
DefaultRuntimeLogPath = DefaultExportDir + "/runtime-health.log"
DefaultTechDumpDir = DefaultExportDir + "/techdump"
DefaultSATBaseDir = DefaultExportDir + "/bee-sat"
) )
type App struct { type App struct {
@@ -25,6 +38,7 @@ type App struct {
exports exportManager exports exportManager
tools toolManager tools toolManager
sat satRunner sat satRunner
runtime runtimeChecker
} }
type ActionResult struct { type ActionResult struct {
@@ -58,6 +72,15 @@ type toolManager interface {
type satRunner interface { type satRunner interface {
RunNvidiaAcceptancePack(baseDir string) (string, error) RunNvidiaAcceptancePack(baseDir string) (string, error)
RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, durationSec int, sizeMB int, gpuIndices []int) (string, error)
RunMemoryAcceptancePack(baseDir string) (string, error)
RunStorageAcceptancePack(baseDir string) (string, error)
ListNvidiaGPUs() ([]platform.NvidiaGPU, error)
}
type runtimeChecker interface {
CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error)
CaptureTechnicalDump(baseDir string) error
} }
func New(platform *platform.System) *App { func New(platform *platform.System) *App {
@@ -67,11 +90,20 @@ func New(platform *platform.System) *App {
exports: platform, exports: platform,
tools: platform, tools: platform,
sat: platform, sat: platform,
runtime: platform,
} }
} }
func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, error) { func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, error) {
if runtimeMode == runtimeenv.ModeLiveCD {
if err := a.runtime.CaptureTechnicalDump(DefaultTechDumpDir); err != nil {
slog.Warn("capture technical dump", "err", err)
}
}
result := collector.Run(runtimeMode) result := collector.Run(runtimeMode)
if health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath); err == nil {
result.Runtime = &health
}
data, err := json.MarshalIndent(result, "", " ") data, err := json.MarshalIndent(result, "", " ")
if err != nil { if err != nil {
return "", err return "", err
@@ -83,6 +115,9 @@ func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, erro
return "stdout", err return "stdout", err
case strings.HasPrefix(output, "file:"): case strings.HasPrefix(output, "file:"):
path := strings.TrimPrefix(output, "file:") path := strings.TrimPrefix(output, "file:")
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return "", err
}
if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil { if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil {
return "", err return "", err
} }
@@ -92,6 +127,63 @@ func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, erro
} }
} }
func (a *App) RunRuntimePreflight(output string) (string, error) {
health, err := a.runtime.CollectRuntimeHealth(DefaultExportDir)
if err != nil {
return "", err
}
data, err := json.MarshalIndent(health, "", " ")
if err != nil {
return "", err
}
switch {
case output == "stdout":
_, err := os.Stdout.Write(append(data, '\n'))
return "stdout", err
case strings.HasPrefix(output, "file:"):
path := strings.TrimPrefix(output, "file:")
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return "", err
}
if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil {
return "", err
}
return path, nil
default:
return "", fmt.Errorf("unknown output destination %q — use stdout or file:<path>", output)
}
}
func (a *App) RunRuntimePreflightResult() (ActionResult, error) {
path, err := a.RunRuntimePreflight("file:" + DefaultRuntimeJSONPath)
body := "Runtime preflight completed."
if path != "" {
body = "Runtime health written to " + path
}
return ActionResult{Title: "Run self-check", Body: body}, err
}
func (a *App) RuntimeHealthResult() ActionResult {
health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath)
if err != nil {
return ActionResult{Title: "Runtime issues", Body: "No runtime health found."}
}
var body strings.Builder
fmt.Fprintf(&body, "Status: %s\n", firstNonEmpty(health.Status, "UNKNOWN"))
fmt.Fprintf(&body, "Export dir: %s\n", firstNonEmpty(health.ExportDir, DefaultExportDir))
fmt.Fprintf(&body, "Driver ready: %t\n", health.DriverReady)
fmt.Fprintf(&body, "CUDA ready: %t\n", health.CUDAReady)
fmt.Fprintf(&body, "Network: %s", firstNonEmpty(health.NetworkStatus, "UNKNOWN"))
if len(health.Issues) > 0 {
body.WriteString("\n\nIssues:\n")
for _, issue := range health.Issues {
fmt.Fprintf(&body, "- %s: %s\n", issue.Code, issue.Description)
}
}
return ActionResult{Title: "Runtime issues", Body: strings.TrimSpace(body.String())}
}
func (a *App) RunAuditNow(runtimeMode runtimeenv.Mode) (ActionResult, error) { func (a *App) RunAuditNow(runtimeMode runtimeenv.Mode) (ActionResult, error) {
path, err := a.RunAudit(runtimeMode, "file:"+DefaultAuditJSONPath) path, err := a.RunAudit(runtimeMode, "file:"+DefaultAuditJSONPath)
body := "Audit completed." body := "Audit completed."
@@ -124,7 +216,29 @@ func (a *App) ExportLatestAudit(target platform.RemovableTarget) (string, error)
func (a *App) ExportLatestAuditResult(target platform.RemovableTarget) (ActionResult, error) { func (a *App) ExportLatestAuditResult(target platform.RemovableTarget) (ActionResult, error) {
path, err := a.ExportLatestAudit(target) path, err := a.ExportLatestAudit(target)
return ActionResult{Title: "Export audit", Body: "Audit exported to " + path}, err body := "Audit exported."
if path != "" {
body = "Audit exported to " + path
}
return ActionResult{Title: "Export audit", Body: body}, err
}
func (a *App) ExportSupportBundle(target platform.RemovableTarget) (string, error) {
archive, err := BuildSupportBundle(DefaultExportDir)
if err != nil {
return "", err
}
defer os.Remove(archive)
return a.exports.ExportFileToTarget(archive, target)
}
func (a *App) ExportSupportBundleResult(target platform.RemovableTarget) (ActionResult, error) {
path, err := a.ExportSupportBundle(target)
body := "Support bundle exported."
if path != "" {
body = "Support bundle exported to " + path
}
return ActionResult{Title: "Export support bundle", Body: body}, err
} }
func (a *App) ListInterfaces() ([]platform.InterfaceInfo, error) { func (a *App) ListInterfaces() ([]platform.InterfaceInfo, error) {
@@ -141,7 +255,7 @@ func (a *App) DHCPOne(iface string) (string, error) {
func (a *App) DHCPOneResult(iface string) (ActionResult, error) { func (a *App) DHCPOneResult(iface string) (ActionResult, error) {
body, err := a.network.DHCPOne(iface) body, err := a.network.DHCPOne(iface)
return ActionResult{Title: "DHCP on " + iface, Body: body}, err return ActionResult{Title: "DHCP: " + iface, Body: bodyOr(body, "DHCP completed.")}, err
} }
func (a *App) DHCPAll() (string, error) { func (a *App) DHCPAll() (string, error) {
@@ -150,7 +264,7 @@ func (a *App) DHCPAll() (string, error) {
func (a *App) DHCPAllResult() (ActionResult, error) { func (a *App) DHCPAllResult() (ActionResult, error) {
body, err := a.network.DHCPAll() body, err := a.network.DHCPAll()
return ActionResult{Title: "DHCP all interfaces", Body: body}, err return ActionResult{Title: "DHCP: all interfaces", Body: bodyOr(body, "DHCP completed.")}, err
} }
func (a *App) SetStaticIPv4(cfg platform.StaticIPv4Config) (string, error) { func (a *App) SetStaticIPv4(cfg platform.StaticIPv4Config) (string, error) {
@@ -159,7 +273,7 @@ func (a *App) SetStaticIPv4(cfg platform.StaticIPv4Config) (string, error) {
func (a *App) SetStaticIPv4Result(cfg platform.StaticIPv4Config) (ActionResult, error) { func (a *App) SetStaticIPv4Result(cfg platform.StaticIPv4Config) (ActionResult, error) {
body, err := a.network.SetStaticIPv4(cfg) body, err := a.network.SetStaticIPv4(cfg)
return ActionResult{Title: "Static IPv4 on " + cfg.Interface, Body: body}, err return ActionResult{Title: "Static IPv4: " + cfg.Interface, Body: bodyOr(body, "Static IPv4 updated.")}, err
} }
func (a *App) NetworkStatus() (ActionResult, error) { func (a *App) NetworkStatus() (ActionResult, error) {
@@ -167,6 +281,9 @@ func (a *App) NetworkStatus() (ActionResult, error) {
if err != nil { if err != nil {
return ActionResult{Title: "Network status"}, err return ActionResult{Title: "Network status"}, err
} }
if len(ifaces) == 0 {
return ActionResult{Title: "Network status", Body: "No physical interfaces found."}, nil
}
var body strings.Builder var body strings.Builder
for _, iface := range ifaces { for _, iface := range ifaces {
ipv4 := "(no IPv4)" ipv4 := "(no IPv4)"
@@ -216,7 +333,7 @@ func (a *App) ServiceStatus(name string) (string, error) {
func (a *App) ServiceStatusResult(name string) (ActionResult, error) { func (a *App) ServiceStatusResult(name string) (ActionResult, error) {
body, err := a.services.ServiceStatus(name) body, err := a.services.ServiceStatus(name)
return ActionResult{Title: "service: " + name, Body: body}, err return ActionResult{Title: "service status: " + name, Body: bodyOr(body, "No status output.")}, err
} }
func (a *App) ServiceDo(name string, action platform.ServiceAction) (string, error) { func (a *App) ServiceDo(name string, action platform.ServiceAction) (string, error) {
@@ -225,7 +342,7 @@ func (a *App) ServiceDo(name string, action platform.ServiceAction) (string, err
func (a *App) ServiceActionResult(name string, action platform.ServiceAction) (ActionResult, error) { func (a *App) ServiceActionResult(name string, action platform.ServiceAction) (ActionResult, error) {
body, err := a.services.ServiceDo(name, action) body, err := a.services.ServiceDo(name, action)
return ActionResult{Title: "service: " + name, Body: body}, err return ActionResult{Title: "service " + string(action) + ": " + name, Body: bodyOr(body, "Action completed.")}, err
} }
func (a *App) ListRemovableTargets() ([]platform.RemovableTarget, error) { func (a *App) ListRemovableTargets() ([]platform.RemovableTarget, error) {
@@ -241,6 +358,9 @@ func (a *App) CheckTools(names []string) []platform.ToolStatus {
} }
func (a *App) ToolCheckResult(names []string) ActionResult { func (a *App) ToolCheckResult(names []string) ActionResult {
if len(names) == 0 {
return ActionResult{Title: "Required tools", Body: "No tools checked."}
}
var body strings.Builder var body strings.Builder
for _, tool := range a.tools.CheckTools(names) { for _, tool := range a.tools.CheckTools(names) {
status := "MISSING" status := "MISSING"
@@ -253,17 +373,151 @@ func (a *App) ToolCheckResult(names []string) ActionResult {
} }
func (a *App) AuditLogTailResult() ActionResult { func (a *App) AuditLogTailResult() ActionResult {
body := a.tools.TailFile(DefaultAuditLogPath, 40) + "\n\n" + a.tools.TailFile(DefaultAuditJSONPath, 20) logTail := strings.TrimSpace(a.tools.TailFile(DefaultAuditLogPath, 40))
jsonTail := strings.TrimSpace(a.tools.TailFile(DefaultAuditJSONPath, 20))
body := strings.TrimSpace(logTail + "\n\n" + jsonTail)
if body == "" {
body = "No audit logs found."
}
return ActionResult{Title: "Audit log tail", Body: body} return ActionResult{Title: "Audit log tail", Body: body}
} }
func (a *App) RunNvidiaAcceptancePack(baseDir string) (string, error) { func (a *App) RunNvidiaAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaAcceptancePack(baseDir) return a.sat.RunNvidiaAcceptancePack(baseDir)
} }
func (a *App) RunNvidiaAcceptancePackResult(baseDir string) (ActionResult, error) { func (a *App) RunNvidiaAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.sat.RunNvidiaAcceptancePack(baseDir) path, err := a.RunNvidiaAcceptancePack(baseDir)
return ActionResult{Title: "NVIDIA SAT", Body: "Archive written to " + path}, err body := "Archive written."
if path != "" {
body = "Archive written to " + path
}
return ActionResult{Title: "NVIDIA SAT", Body: body}, err
}
func (a *App) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
return a.sat.ListNvidiaGPUs()
}
func (a *App) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, durationSec int, sizeMB int, gpuIndices []int) (ActionResult, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
path, err := a.sat.RunNvidiaAcceptancePackWithOptions(ctx, baseDir, durationSec, sizeMB, gpuIndices)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
}
// Include terminal chart if available (runDir = archive path without .tar.gz).
if path != "" {
termPath := filepath.Join(strings.TrimSuffix(path, ".tar.gz"), "gpu-metrics-term.txt")
if chart, readErr := os.ReadFile(termPath); readErr == nil && len(chart) > 0 {
body += "\n\n" + string(chart)
}
}
return ActionResult{Title: "NVIDIA SAT", Body: body}, err
}
func (a *App) RunMemoryAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunMemoryAcceptancePack(baseDir)
}
func (a *App) RunMemoryAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.RunMemoryAcceptancePack(baseDir)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
}
return ActionResult{Title: "Memory SAT", Body: body}, err
}
func (a *App) RunStorageAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunStorageAcceptancePack(baseDir)
}
func (a *App) RunStorageAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.RunStorageAcceptancePack(baseDir)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
}
return ActionResult{Title: "Storage SAT", Body: body}, err
}
func (a *App) HealthSummaryResult() ActionResult {
raw, err := os.ReadFile(DefaultAuditJSONPath)
if err != nil {
return ActionResult{Title: "Health summary", Body: "No audit JSON found."}
}
var snapshot schema.HardwareIngestRequest
if err := json.Unmarshal(raw, &snapshot); err != nil {
return ActionResult{Title: "Health summary", Body: "Audit JSON is unreadable."}
}
summary := collector.BuildHealthSummary(snapshot.Hardware)
var body strings.Builder
status := summary.Status
if status == "" {
status = "Unknown"
}
fmt.Fprintf(&body, "Overall: %s\n", status)
fmt.Fprintf(&body, "Storage: warn=%d fail=%d\n", summary.StorageWarn, summary.StorageFail)
fmt.Fprintf(&body, "PCIe: warn=%d fail=%d\n", summary.PCIeWarn, summary.PCIeFail)
fmt.Fprintf(&body, "PSU: warn=%d fail=%d\n", summary.PSUWarn, summary.PSUFail)
fmt.Fprintf(&body, "Memory: warn=%d fail=%d\n", summary.MemoryWarn, summary.MemoryFail)
for _, item := range latestSATSummaries() {
fmt.Fprintf(&body, "\n\n%s", item)
}
if len(summary.Failures) > 0 {
fmt.Fprintf(&body, "\n\nFailures:\n- %s", strings.Join(summary.Failures, "\n- "))
}
if len(summary.Warnings) > 0 {
fmt.Fprintf(&body, "\n\nWarnings:\n- %s", strings.Join(summary.Warnings, "\n- "))
}
return ActionResult{Title: "Health summary", Body: strings.TrimSpace(body.String())}
}
func (a *App) MainBanner() string {
raw, err := os.ReadFile(DefaultAuditJSONPath)
if err != nil {
return ""
}
var snapshot schema.HardwareIngestRequest
if err := json.Unmarshal(raw, &snapshot); err != nil {
return ""
}
var lines []string
if system := formatSystemLine(snapshot.Hardware.Board); system != "" {
lines = append(lines, system)
}
if cpu := formatCPULine(snapshot.Hardware.CPUs); cpu != "" {
lines = append(lines, cpu)
}
if memory := formatMemoryLine(snapshot.Hardware.Memory); memory != "" {
lines = append(lines, memory)
}
if storage := formatStorageLine(snapshot.Hardware.Storage); storage != "" {
lines = append(lines, storage)
}
if gpu := formatGPULine(snapshot.Hardware.PCIeDevices); gpu != "" {
lines = append(lines, gpu)
}
if ip := formatIPLine(a.network.ListInterfaces); ip != "" {
lines = append(lines, ip)
}
return strings.TrimSpace(strings.Join(lines, "\n"))
} }
func (a *App) FormatToolStatuses(statuses []platform.ToolStatus) string { func (a *App) FormatToolStatuses(statuses []platform.ToolStatus) string {
@@ -309,3 +563,314 @@ func sanitizeFilename(v string) string {
} }
return string(out) return string(out)
} }
func bodyOr(body, fallback string) string {
body = strings.TrimSpace(body)
if body == "" {
return fallback
}
return body
}
func ReadRuntimeHealth(path string) (schema.RuntimeHealth, error) {
raw, err := os.ReadFile(path)
if err != nil {
return schema.RuntimeHealth{}, err
}
var health schema.RuntimeHealth
if err := json.Unmarshal(raw, &health); err != nil {
return schema.RuntimeHealth{}, err
}
return health, nil
}
func latestSATSummaries() []string {
patterns := []struct {
label string
prefix string
}{
{label: "NVIDIA SAT", prefix: "gpu-nvidia-"},
{label: "Memory SAT", prefix: "memory-"},
{label: "Storage SAT", prefix: "storage-"},
}
var out []string
for _, item := range patterns {
matches, err := filepath.Glob(filepath.Join(DefaultSATBaseDir, item.prefix+"*/summary.txt"))
if err != nil || len(matches) == 0 {
continue
}
sort.Strings(matches)
raw, err := os.ReadFile(matches[len(matches)-1])
if err != nil {
continue
}
out = append(out, formatSATSummary(item.label, string(raw)))
}
return out
}
func formatSATSummary(label, raw string) string {
values := parseKeyValueSummary(raw)
var body strings.Builder
fmt.Fprintf(&body, "%s:", label)
if overall := firstNonEmpty(values["overall_status"], "UNKNOWN"); overall != "" {
fmt.Fprintf(&body, " %s", overall)
}
if ok := firstNonEmpty(values["job_ok"], "0"); ok != "" {
fmt.Fprintf(&body, " ok=%s", ok)
}
if failed := firstNonEmpty(values["job_failed"], "0"); failed != "" {
fmt.Fprintf(&body, " failed=%s", failed)
}
if unsupported := firstNonEmpty(values["job_unsupported"], "0"); unsupported != "" && unsupported != "0" {
fmt.Fprintf(&body, " unsupported=%s", unsupported)
}
if devices := strings.TrimSpace(values["devices"]); devices != "" {
fmt.Fprintf(&body, "\nDevices: %s", devices)
}
return body.String()
}
func formatSystemLine(board schema.HardwareBoard) string {
model := strings.TrimSpace(strings.Join([]string{
trimPtr(board.Manufacturer),
trimPtr(board.ProductName),
}, " "))
serial := strings.TrimSpace(board.SerialNumber)
switch {
case model != "" && serial != "":
return fmt.Sprintf("System: %s | S/N %s", model, serial)
case model != "":
return "System: " + model
case serial != "":
return "System S/N: " + serial
default:
return ""
}
}
func formatCPULine(cpus []schema.HardwareCPU) string {
if len(cpus) == 0 {
return ""
}
modelCounts := map[string]int{}
unknown := 0
for _, cpu := range cpus {
model := trimPtr(cpu.Model)
if model == "" {
unknown++
continue
}
modelCounts[model]++
}
if len(modelCounts) == 1 && unknown == 0 {
for model, count := range modelCounts {
return fmt.Sprintf("CPU: %d x %s", count, model)
}
}
parts := make([]string, 0, len(modelCounts)+1)
if len(modelCounts) > 0 {
keys := make([]string, 0, len(modelCounts))
for key := range modelCounts {
keys = append(keys, key)
}
sort.Strings(keys)
for _, key := range keys {
parts = append(parts, fmt.Sprintf("%d x %s", modelCounts[key], key))
}
}
if unknown > 0 {
parts = append(parts, fmt.Sprintf("%d x unknown", unknown))
}
return "CPU: " + strings.Join(parts, ", ")
}
func formatMemoryLine(dimms []schema.HardwareMemory) string {
totalMB := 0
present := 0
types := map[string]struct{}{}
for _, dimm := range dimms {
if dimm.Present != nil && !*dimm.Present {
continue
}
if dimm.SizeMB == nil || *dimm.SizeMB <= 0 {
continue
}
present++
totalMB += *dimm.SizeMB
if value := trimPtr(dimm.Type); value != "" {
types[value] = struct{}{}
}
}
if totalMB == 0 {
return ""
}
typeText := joinSortedKeys(types)
line := fmt.Sprintf("Memory: %s", humanizeMB(totalMB))
if typeText != "" {
line += " " + typeText
}
if present > 0 {
line += fmt.Sprintf(" (%d DIMMs)", present)
}
return line
}
func formatStorageLine(disks []schema.HardwareStorage) string {
count := 0
totalGB := 0
for _, disk := range disks {
if disk.Present != nil && !*disk.Present {
continue
}
count++
if disk.SizeGB != nil && *disk.SizeGB > 0 {
totalGB += *disk.SizeGB
}
}
if count == 0 {
return ""
}
line := fmt.Sprintf("Storage: %d drives", count)
if totalGB > 0 {
line += fmt.Sprintf(" / %s", humanizeGB(totalGB))
}
return line
}
func formatGPULine(devices []schema.HardwarePCIeDevice) string {
gpus := map[string]int{}
for _, dev := range devices {
if !isGPUDevice(dev) {
continue
}
name := firstNonEmpty(trimPtr(dev.Model), trimPtr(dev.Manufacturer), "unknown")
gpus[name]++
}
if len(gpus) == 0 {
return ""
}
keys := make([]string, 0, len(gpus))
for key := range gpus {
keys = append(keys, key)
}
sort.Strings(keys)
parts := make([]string, 0, len(keys))
for _, key := range keys {
parts = append(parts, fmt.Sprintf("%d x %s", gpus[key], key))
}
return "GPU: " + strings.Join(parts, ", ")
}
func formatIPLine(list func() ([]platform.InterfaceInfo, error)) string {
if list == nil {
return ""
}
ifaces, err := list()
if err != nil {
return ""
}
seen := map[string]struct{}{}
var ips []string
for _, iface := range ifaces {
for _, ip := range iface.IPv4 {
ip = strings.TrimSpace(ip)
if ip == "" {
continue
}
if _, ok := seen[ip]; ok {
continue
}
seen[ip] = struct{}{}
ips = append(ips, ip)
}
}
if len(ips) == 0 {
return ""
}
sort.Strings(ips)
return "IP: " + strings.Join(ips, ", ")
}
func isGPUDevice(dev schema.HardwarePCIeDevice) bool {
class := trimPtr(dev.DeviceClass)
model := strings.ToLower(trimPtr(dev.Model))
vendor := strings.ToLower(trimPtr(dev.Manufacturer))
return class == "VideoController" ||
class == "DisplayController" ||
class == "ProcessingAccelerator" ||
strings.Contains(model, "nvidia") ||
strings.Contains(vendor, "nvidia") ||
strings.Contains(vendor, "amd")
}
func trimPtr(value *string) string {
if value == nil {
return ""
}
return strings.TrimSpace(*value)
}
func joinSortedKeys(values map[string]struct{}) string {
if len(values) == 0 {
return ""
}
keys := make([]string, 0, len(values))
for key := range values {
keys = append(keys, key)
}
sort.Strings(keys)
return strings.Join(keys, "/")
}
func humanizeMB(totalMB int) string {
if totalMB <= 0 {
return ""
}
gb := float64(totalMB) / 1024.0
if gb >= 1024.0 {
tb := gb / 1024.0
return fmt.Sprintf("%.1f TB", tb)
}
if gb == float64(int64(gb)) {
return fmt.Sprintf("%.0f GB", gb)
}
return fmt.Sprintf("%.1f GB", gb)
}
func humanizeGB(totalGB int) string {
if totalGB <= 0 {
return ""
}
tb := float64(totalGB) / 1024.0
if tb >= 1.0 {
return fmt.Sprintf("%.1f TB", tb)
}
return fmt.Sprintf("%d GB", totalGB)
}
func parseKeyValueSummary(raw string) map[string]string {
out := map[string]string{}
for _, line := range strings.Split(raw, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
key, value, ok := strings.Cut(line, "=")
if !ok {
continue
}
out[strings.TrimSpace(key)] = strings.TrimSpace(value)
}
return out
}
func firstNonEmpty(values ...string) string {
for _, value := range values {
value = strings.TrimSpace(value)
if value != "" {
return value
}
}
return ""
}

View File

@@ -1,10 +1,15 @@
package app package app
import ( import (
"context"
"encoding/json"
"errors" "errors"
"os"
"path/filepath"
"testing" "testing"
"bee/audit/internal/platform" "bee/audit/internal/platform"
"bee/audit/internal/schema"
) )
type fakeNetwork struct { type fakeNetwork struct {
@@ -62,6 +67,22 @@ func (f fakeExports) ExportFileToTarget(src string, target platform.RemovableTar
return "", nil return "", nil
} }
type fakeRuntime struct {
collectFn func(string) (schema.RuntimeHealth, error)
dumpFn func(string) error
}
func (f fakeRuntime) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
return f.collectFn(exportDir)
}
func (f fakeRuntime) CaptureTechnicalDump(baseDir string) error {
if f.dumpFn != nil {
return f.dumpFn(baseDir)
}
return nil
}
type fakeTools struct { type fakeTools struct {
tailFileFn func(string, int) string tailFileFn func(string, int) string
checkToolsFn func([]string) []platform.ToolStatus checkToolsFn func([]string) []platform.ToolStatus
@@ -76,11 +97,29 @@ func (f fakeTools) CheckTools(names []string) []platform.ToolStatus {
} }
type fakeSAT struct { type fakeSAT struct {
runFn func(string) (string, error) runNvidiaFn func(string) (string, error)
runMemoryFn func(string) (string, error)
runStorageFn func(string) (string, error)
} }
func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string) (string, error) { func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string) (string, error) {
return f.runFn(baseDir) return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaAcceptancePackWithOptions(_ context.Context, baseDir string, _ int, _ int, _ []int) (string, error) {
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
return nil, nil
}
func (f fakeSAT) RunMemoryAcceptancePack(baseDir string) (string, error) {
return f.runMemoryFn(baseDir)
}
func (f fakeSAT) RunStorageAcceptancePack(baseDir string) (string, error) {
return f.runStorageFn(baseDir)
} }
func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) { func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
@@ -96,6 +135,9 @@ func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
}, },
defaultRouteFn: func() string { return "10.0.0.1" }, defaultRouteFn: func() string { return "10.0.0.1" },
}, },
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
} }
result, err := a.NetworkStatus() result, err := a.NetworkStatus()
@@ -116,6 +158,28 @@ func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
} }
} }
func TestNetworkStatusHandlesNoInterfaces(t *testing.T) {
t.Parallel()
a := &App{
network: fakeNetwork{
listInterfacesFn: func() ([]platform.InterfaceInfo, error) { return nil, nil },
defaultRouteFn: func() string { return "" },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.NetworkStatus()
if err != nil {
t.Fatalf("NetworkStatus error: %v", err)
}
if result.Body != "No physical interfaces found." {
t.Fatalf("body=%q want %q", result.Body, "No physical interfaces found.")
}
}
func TestNetworkStatusPropagatesListError(t *testing.T) { func TestNetworkStatusPropagatesListError(t *testing.T) {
t.Parallel() t.Parallel()
@@ -126,6 +190,9 @@ func TestNetworkStatusPropagatesListError(t *testing.T) {
}, },
defaultRouteFn: func() string { return "" }, defaultRouteFn: func() string { return "" },
}, },
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
} }
result, err := a.NetworkStatus() result, err := a.NetworkStatus()
@@ -150,6 +217,9 @@ func TestParseStaticIPv4ConfigAndDefaults(t *testing.T) {
dhcpAllFn: func() (string, error) { return "", nil }, dhcpAllFn: func() (string, error) { return "", nil },
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil }, setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
}, },
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
} }
defaults := a.DefaultStaticIPv4FormFields("eth0") defaults := a.DefaultStaticIPv4FormFields("eth0")
@@ -186,13 +256,16 @@ func TestServiceActionResults(t *testing.T) {
return string(action) + " ok", nil return string(action) + " ok", nil
}, },
}, },
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
} }
statusResult, err := a.ServiceStatusResult("bee-audit") statusResult, err := a.ServiceStatusResult("bee-audit")
if err != nil { if err != nil {
t.Fatalf("ServiceStatusResult error: %v", err) t.Fatalf("ServiceStatusResult error: %v", err)
} }
if statusResult.Title != "service: bee-audit" || statusResult.Body != "active" { if statusResult.Title != "service status: bee-audit" || statusResult.Body != "active" {
t.Fatalf("unexpected status result: %#v", statusResult) t.Fatalf("unexpected status result: %#v", statusResult)
} }
@@ -200,7 +273,7 @@ func TestServiceActionResults(t *testing.T) {
if err != nil { if err != nil {
t.Fatalf("ServiceActionResult error: %v", err) t.Fatalf("ServiceActionResult error: %v", err)
} }
if actionResult.Title != "service: bee-audit" || actionResult.Body != "restart ok" { if actionResult.Title != "service restart: bee-audit" || actionResult.Body != "restart ok" {
t.Fatalf("unexpected action result: %#v", actionResult) t.Fatalf("unexpected action result: %#v", actionResult)
} }
} }
@@ -242,17 +315,87 @@ func TestToolCheckAndLogTailResults(t *testing.T) {
} }
} }
func TestActionResultsUseFallbackBody(t *testing.T) {
t.Parallel()
a := &App{
network: fakeNetwork{
dhcpOneFn: func(string) (string, error) { return " ", nil },
dhcpAllFn: func() (string, error) { return "", nil },
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
return nil, nil
},
defaultRouteFn: func() string { return "" },
},
services: fakeServices{
serviceStatusFn: func(string) (string, error) { return "", nil },
serviceDoFn: func(string, platform.ServiceAction) (string, error) { return "", nil },
},
tools: fakeTools{
tailFileFn: func(string, int) string { return " " },
checkToolsFn: func([]string) []platform.ToolStatus { return nil },
},
sat: fakeSAT{
runNvidiaFn: func(string) (string, error) { return "", nil },
runMemoryFn: func(string) (string, error) { return "", nil },
runStorageFn: func(string) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) {
return schema.RuntimeHealth{Status: "PARTIAL", ExportDir: "/tmp/export"}, nil
},
},
}
if got, _ := a.DHCPOneResult("eth0"); got.Body != "DHCP completed." {
t.Fatalf("dhcp one body=%q", got.Body)
}
if got, _ := a.DHCPAllResult(); got.Body != "DHCP completed." {
t.Fatalf("dhcp all body=%q", got.Body)
}
if got, _ := a.SetStaticIPv4Result(platform.StaticIPv4Config{Interface: "eth0"}); got.Body != "Static IPv4 updated." {
t.Fatalf("static body=%q", got.Body)
}
if got, _ := a.ServiceStatusResult("bee-audit"); got.Body != "No status output." {
t.Fatalf("status body=%q", got.Body)
}
if got, _ := a.ServiceActionResult("bee-audit", platform.ServiceRestart); got.Body != "Action completed." {
t.Fatalf("action body=%q", got.Body)
}
if got := a.ToolCheckResult(nil); got.Body != "No tools checked." {
t.Fatalf("tool body=%q", got.Body)
}
if got := a.AuditLogTailResult(); got.Body != "No audit logs found." {
t.Fatalf("log body=%q", got.Body)
}
if got, _ := a.RunNvidiaAcceptancePackResult(""); got.Body != "Archive written." {
t.Fatalf("sat body=%q", got.Body)
}
if got, _ := a.RunMemoryAcceptancePackResult(""); got.Body != "Archive written." {
t.Fatalf("memory sat body=%q", got.Body)
}
if got, _ := a.RunStorageAcceptancePackResult(""); got.Body != "Archive written." {
t.Fatalf("storage sat body=%q", got.Body)
}
}
func TestRunNvidiaAcceptancePackResult(t *testing.T) { func TestRunNvidiaAcceptancePackResult(t *testing.T) {
t.Parallel() t.Parallel()
a := &App{ a := &App{
sat: fakeSAT{ sat: fakeSAT{
runFn: func(baseDir string) (string, error) { runNvidiaFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/sat" { if baseDir != "/tmp/sat" {
t.Fatalf("baseDir=%q want %q", baseDir, "/tmp/sat") t.Fatalf("baseDir=%q want %q", baseDir, "/tmp/sat")
} }
return "/tmp/sat/out.tar.gz", nil return "/tmp/sat/out.tar.gz", nil
}, },
runMemoryFn: func(string) (string, error) { return "", nil },
runStorageFn: func(string) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
}, },
} }
@@ -265,6 +408,186 @@ func TestRunNvidiaAcceptancePackResult(t *testing.T) {
} }
} }
func TestRunSATDefaultsToExportDir(t *testing.T) {
t.Parallel()
oldSATBaseDir := DefaultSATBaseDir
DefaultSATBaseDir = "/tmp/export/bee-sat"
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
a := &App{
sat: fakeSAT{
runNvidiaFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("nvidia baseDir=%q", baseDir)
}
return "", nil
},
runMemoryFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("memory baseDir=%q", baseDir)
}
return "", nil
},
runStorageFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("storage baseDir=%q", baseDir)
}
return "", nil
},
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
if _, err := a.RunNvidiaAcceptancePack(""); err != nil {
t.Fatal(err)
}
if _, err := a.RunMemoryAcceptancePack(""); err != nil {
t.Fatal(err)
}
if _, err := a.RunStorageAcceptancePack(""); err != nil {
t.Fatal(err)
}
}
func TestFormatSATSummary(t *testing.T) {
t.Parallel()
got := formatSATSummary("Memory SAT", "overall_status=PARTIAL\njob_ok=2\njob_failed=0\njob_unsupported=1\ndevices=3\n")
want := "Memory SAT: PARTIAL ok=2 failed=0 unsupported=1\nDevices: 3"
if got != want {
t.Fatalf("got %q want %q", got, want)
}
}
func TestHealthSummaryResultIncludesCompactSATSummary(t *testing.T) {
tmp := t.TempDir()
oldAuditPath := DefaultAuditJSONPath
oldSATBaseDir := DefaultSATBaseDir
DefaultAuditJSONPath = filepath.Join(tmp, "audit.json")
DefaultSATBaseDir = filepath.Join(tmp, "sat")
t.Cleanup(func() { DefaultAuditJSONPath = oldAuditPath })
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
satDir := filepath.Join(DefaultSATBaseDir, "memory-testcase")
if err := os.MkdirAll(satDir, 0755); err != nil {
t.Fatalf("mkdir sat dir: %v", err)
}
raw := `{"collected_at":"2026-03-15T10:00:00Z","hardware":{"board":{"serial_number":"SRV123"},"storage":[{"serial_number":"DISK1","status":"Warning"}]}}`
if err := os.WriteFile(DefaultAuditJSONPath, []byte(raw), 0644); err != nil {
t.Fatalf("write audit json: %v", err)
}
if err := os.WriteFile(filepath.Join(satDir, "summary.txt"), []byte("overall_status=OK\njob_ok=3\njob_failed=0\njob_unsupported=0\n"), 0644); err != nil {
t.Fatalf("write sat summary: %v", err)
}
result := (&App{}).HealthSummaryResult()
if !contains(result.Body, "Memory SAT: OK ok=3 failed=0") {
t.Fatalf("body missing compact sat summary:\n%s", result.Body)
}
}
func TestBuildSupportBundleIncludesExportDirContents(t *testing.T) {
tmp := t.TempDir()
exportDir := filepath.Join(tmp, "export")
if err := os.MkdirAll(filepath.Join(exportDir, "bee-sat", "memory-run"), 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.json"), []byte(`{"ok":true}`), 0644); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run", "verbose.log"), []byte("sat verbose"), 0644); err != nil {
t.Fatal(err)
}
archive, err := BuildSupportBundle(exportDir)
if err != nil {
t.Fatalf("BuildSupportBundle error: %v", err)
}
if _, err := os.Stat(archive); err != nil {
t.Fatalf("archive stat: %v", err)
}
}
func TestMainBanner(t *testing.T) {
tmp := t.TempDir()
oldAuditPath := DefaultAuditJSONPath
DefaultAuditJSONPath = filepath.Join(tmp, "audit.json")
t.Cleanup(func() { DefaultAuditJSONPath = oldAuditPath })
trueValue := true
manufacturer := "Dell"
product := "PowerEdge R760"
cpuModel := "Intel Xeon Gold 6430"
memoryType := "DDR5"
gpuClass := "VideoController"
gpuModel := "NVIDIA H100"
payload := schema.HardwareIngestRequest{
Hardware: schema.HardwareSnapshot{
Board: schema.HardwareBoard{
Manufacturer: &manufacturer,
ProductName: &product,
SerialNumber: "SRV123",
},
CPUs: []schema.HardwareCPU{
{Model: &cpuModel},
{Model: &cpuModel},
},
Memory: []schema.HardwareMemory{
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
},
Storage: []schema.HardwareStorage{
{Present: &trueValue, SizeGB: intPtr(3840)},
{Present: &trueValue, SizeGB: intPtr(3840)},
},
PCIeDevices: []schema.HardwarePCIeDevice{
{DeviceClass: &gpuClass, Model: &gpuModel},
{DeviceClass: &gpuClass, Model: &gpuModel},
},
},
}
raw, err := json.Marshal(payload)
if err != nil {
t.Fatalf("marshal: %v", err)
}
if err := os.WriteFile(DefaultAuditJSONPath, raw, 0644); err != nil {
t.Fatalf("write audit json: %v", err)
}
a := &App{
network: fakeNetwork{
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
return []platform.InterfaceInfo{
{Name: "eth0", IPv4: []string{"10.0.0.10"}},
{Name: "eth1", IPv4: []string{"192.168.1.10"}},
}, nil
},
},
}
got := a.MainBanner()
for _, want := range []string{
"System: Dell PowerEdge R760 | S/N SRV123",
"CPU: 2 x Intel Xeon Gold 6430",
"Memory: 1.0 TB DDR5 (2 DIMMs)",
"Storage: 2 drives / 7.5 TB",
"GPU: 2 x NVIDIA H100",
"IP: 10.0.0.10, 192.168.1.10",
} {
if !contains(got, want) {
t.Fatalf("banner missing %q:\n%s", want, got)
}
}
}
func intPtr(v int) *int { return &v }
func contains(haystack, needle string) bool { func contains(haystack, needle string) bool {
return len(needle) == 0 || (len(haystack) >= len(needle) && (haystack == needle || containsAt(haystack, needle))) return len(needle) == 0 || (len(haystack) >= len(needle) && (haystack == needle || containsAt(haystack, needle)))
} }

View File

@@ -0,0 +1,300 @@
package app
import (
"archive/tar"
"compress/gzip"
"fmt"
"io"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
"time"
)
var supportBundleServices = []string{
"bee-audit.service",
"bee-web.service",
"bee-network.service",
"bee-nvidia.service",
"bee-preflight.service",
"bee-sshsetup.service",
}
var supportBundleCommands = []struct {
name string
cmd []string
}{
{name: "system/uname.txt", cmd: []string{"uname", "-a"}},
{name: "system/lsmod.txt", cmd: []string{"lsmod"}},
{name: "system/lspci-nn.txt", cmd: []string{"lspci", "-nn"}},
{name: "system/ip-addr.txt", cmd: []string{"ip", "addr"}},
{name: "system/ip-route.txt", cmd: []string{"ip", "route"}},
{name: "system/mount.txt", cmd: []string{"mount"}},
{name: "system/df-h.txt", cmd: []string{"df", "-h"}},
{name: "system/dmesg-tail.txt", cmd: []string{"sh", "-c", "dmesg | tail -n 200"}},
}
func BuildSupportBundle(exportDir string) (string, error) {
exportDir = strings.TrimSpace(exportDir)
if exportDir == "" {
exportDir = DefaultExportDir
}
if err := os.MkdirAll(exportDir, 0755); err != nil {
return "", err
}
if err := cleanupOldSupportBundles(os.TempDir()); err != nil {
return "", err
}
host := sanitizeFilename(hostnameOr("unknown"))
ts := time.Now().UTC().Format("20060102-150405")
stageRoot := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s", host, ts))
if err := os.MkdirAll(stageRoot, 0755); err != nil {
return "", err
}
defer os.RemoveAll(stageRoot)
if err := copyDirContents(exportDir, filepath.Join(stageRoot, "export")); err != nil {
return "", err
}
if err := writeJournalDump(filepath.Join(stageRoot, "systemd", "combined.journal.log")); err != nil {
return "", err
}
for _, svc := range supportBundleServices {
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".status.txt"), []string{"systemctl", "status", svc, "--no-pager"}); err != nil {
return "", err
}
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".journal.log"), []string{"journalctl", "--no-pager", "-u", svc}); err != nil {
return "", err
}
}
for _, item := range supportBundleCommands {
if err := writeCommandOutput(filepath.Join(stageRoot, item.name), item.cmd); err != nil {
return "", err
}
}
if err := writeManifest(filepath.Join(stageRoot, "manifest.txt"), exportDir, stageRoot); err != nil {
return "", err
}
archivePath := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s.tar.gz", host, ts))
if err := createSupportTarGz(archivePath, stageRoot); err != nil {
return "", err
}
return archivePath, nil
}
func cleanupOldSupportBundles(dir string) error {
matches, err := filepath.Glob(filepath.Join(dir, "bee-support-*.tar.gz"))
if err != nil {
return err
}
type entry struct {
path string
mod time.Time
}
list := make([]entry, 0, len(matches))
for _, match := range matches {
info, err := os.Stat(match)
if err != nil {
continue
}
if time.Since(info.ModTime()) > 24*time.Hour {
_ = os.Remove(match)
continue
}
list = append(list, entry{path: match, mod: info.ModTime()})
}
sort.Slice(list, func(i, j int) bool { return list[i].mod.After(list[j].mod) })
if len(list) > 3 {
for _, old := range list[3:] {
_ = os.Remove(old.path)
}
}
return nil
}
func writeJournalDump(dst string) error {
args := []string{"--no-pager"}
for _, svc := range supportBundleServices {
args = append(args, "-u", svc)
}
raw, err := exec.Command("journalctl", args...).CombinedOutput()
if len(raw) == 0 && err != nil {
raw = []byte(err.Error() + "\n")
}
if len(raw) == 0 {
raw = []byte("no journal output\n")
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
return os.WriteFile(dst, raw, 0644)
}
func writeCommandOutput(dst string, cmd []string) error {
if len(cmd) == 0 {
return nil
}
raw, err := exec.Command(cmd[0], cmd[1:]...).CombinedOutput()
if len(raw) == 0 {
if err != nil {
raw = []byte(err.Error() + "\n")
} else {
raw = []byte("no output\n")
}
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
return os.WriteFile(dst, raw, 0644)
}
func writeManifest(dst, exportDir, stageRoot string) error {
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
var body strings.Builder
fmt.Fprintf(&body, "bee_version=%s\n", buildVersion())
fmt.Fprintf(&body, "host=%s\n", hostnameOr("unknown"))
fmt.Fprintf(&body, "generated_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
fmt.Fprintf(&body, "export_dir=%s\n", exportDir)
fmt.Fprintf(&body, "\nfiles:\n")
var files []string
if err := filepath.Walk(stageRoot, func(path string, info os.FileInfo, err error) error {
if err != nil || info.IsDir() {
return err
}
if filepath.Clean(path) == filepath.Clean(dst) {
return nil
}
rel, err := filepath.Rel(stageRoot, path)
if err != nil {
return err
}
files = append(files, fmt.Sprintf("%s\t%d", rel, info.Size()))
return nil
}); err != nil {
return err
}
sort.Strings(files)
for _, line := range files {
body.WriteString(line)
body.WriteByte('\n')
}
return os.WriteFile(dst, []byte(body.String()), 0644)
}
func buildVersion() string {
raw, err := exec.Command("bee", "version").CombinedOutput()
if err != nil {
return "unknown"
}
return strings.TrimSpace(string(raw))
}
func copyDirContents(srcDir, dstDir string) error {
entries, err := os.ReadDir(srcDir)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return err
}
for _, entry := range entries {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(dstDir, entry.Name())
if err := copyPath(src, dst); err != nil {
return err
}
}
return nil
}
func copyPath(src, dst string) error {
info, err := os.Stat(src)
if err != nil {
return err
}
if info.IsDir() {
if err := os.MkdirAll(dst, info.Mode().Perm()); err != nil {
return err
}
entries, err := os.ReadDir(src)
if err != nil {
return err
}
for _, entry := range entries {
if err := copyPath(filepath.Join(src, entry.Name()), filepath.Join(dst, entry.Name())); err != nil {
return err
}
}
return nil
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
in, err := os.Open(src)
if err != nil {
return err
}
defer in.Close()
out, err := os.OpenFile(dst, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, info.Mode().Perm())
if err != nil {
return err
}
defer out.Close()
_, err = io.Copy(out, in)
return err
}
func createSupportTarGz(dst, srcDir string) error {
file, err := os.Create(dst)
if err != nil {
return err
}
defer file.Close()
gz := gzip.NewWriter(file)
defer gz.Close()
tw := tar.NewWriter(gz)
defer tw.Close()
base := filepath.Dir(srcDir)
return filepath.Walk(srcDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if info.IsDir() {
return nil
}
header, err := tar.FileInfoHeader(info, "")
if err != nil {
return err
}
header.Name, err = filepath.Rel(base, path)
if err != nil {
return err
}
if err := tw.WriteHeader(header); err != nil {
return err
}
f, err := os.Open(path)
if err != nil {
return err
}
defer f.Close()
_, err = io.Copy(tw, f)
return err
})
}

View File

@@ -8,6 +8,14 @@ import (
"strings" "strings"
) )
var execDmidecode = func(typeNum string) (string, error) {
out, err := exec.Command("dmidecode", "-t", typeNum).Output()
if err != nil {
return "", err
}
return string(out), nil
}
// collectBoard runs dmidecode for types 0, 1, 2 and returns the board record // collectBoard runs dmidecode for types 0, 1, 2 and returns the board record
// plus the BIOS firmware entry. Any failure is logged and returns zero values. // plus the BIOS firmware entry. Any failure is logged and returns zero values.
func collectBoard() (schema.HardwareBoard, []schema.HardwareFirmwareRecord) { func collectBoard() (schema.HardwareBoard, []schema.HardwareFirmwareRecord) {
@@ -141,9 +149,5 @@ func cleanDMIValue(v string) string {
// runDmidecode executes dmidecode -t <typeNum> and returns its stdout. // runDmidecode executes dmidecode -t <typeNum> and returns its stdout.
func runDmidecode(typeNum string) (string, error) { func runDmidecode(typeNum string) (string, error) {
out, err := exec.Command("dmidecode", "-t", typeNum).Output() return execDmidecode(typeNum)
if err != nil {
return "", err
}
return string(out), nil
} }

View File

@@ -7,13 +7,15 @@ import (
"bee/audit/internal/runtimeenv" "bee/audit/internal/runtimeenv"
"bee/audit/internal/schema" "bee/audit/internal/schema"
"log/slog" "log/slog"
"os"
"time" "time"
) )
// Run executes all collectors and returns the combined snapshot. // Run executes all collectors and returns the combined snapshot.
// Partial failures are logged as warnings; collection always completes. // Partial failures are logged as warnings; collection always completes.
func Run(runtimeMode runtimeenv.Mode) schema.HardwareIngestRequest { func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
start := time.Now() start := time.Now()
collectedAt := time.Now().UTC().Format(time.RFC3339)
slog.Info("audit started") slog.Info("audit started")
snap := schema.HardwareSnapshot{} snap := schema.HardwareSnapshot{}
@@ -22,31 +24,41 @@ func Run(runtimeMode runtimeenv.Mode) schema.HardwareIngestRequest {
snap.Board = board snap.Board = board
snap.Firmware = append(snap.Firmware, biosFW...) snap.Firmware = append(snap.Firmware, biosFW...)
cpus, cpuFW := collectCPUs(snap.Board.SerialNumber) snap.CPUs = collectCPUs()
snap.CPUs = cpus
snap.Firmware = append(snap.Firmware, cpuFW...)
snap.Memory = collectMemory() snap.Memory = collectMemory()
sensorDoc, err := readSensorsJSONDoc()
if err != nil {
slog.Info("sensors: unavailable for enrichment", "err", err)
}
snap.CPUs = enrichCPUsWithTelemetry(snap.CPUs, sensorDoc)
snap.Memory = enrichMemoryWithTelemetry(snap.Memory, sensorDoc)
snap.Storage = collectStorage() snap.Storage = collectStorage()
snap.PCIeDevices = collectPCIe() snap.PCIeDevices = collectPCIe()
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices, snap.Board.SerialNumber) snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithMellanox(snap.PCIeDevices) snap.PCIeDevices = enrichPCIeWithMellanox(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithNICTelemetry(snap.PCIeDevices) snap.PCIeDevices = enrichPCIeWithNICTelemetry(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithRAIDTelemetry(snap.PCIeDevices)
snap.Storage = enrichStorageWithVROC(snap.Storage, snap.PCIeDevices) snap.Storage = enrichStorageWithVROC(snap.Storage, snap.PCIeDevices)
snap.Storage = appendUniqueStorage(snap.Storage, collectRAIDStorage(snap.PCIeDevices)) snap.Storage = appendUniqueStorage(snap.Storage, collectRAIDStorage(snap.PCIeDevices))
snap.PowerSupplies = collectPSUs() snap.PowerSupplies = collectPSUs()
snap.PowerSupplies = enrichPSUsWithTelemetry(snap.PowerSupplies, sensorDoc)
snap.Sensors = buildSensorsFromDoc(sensorDoc)
finalizeSnapshot(&snap, collectedAt)
// remaining collectors added in steps 1.8 1.10 // remaining collectors added in steps 1.8 1.10
slog.Info("audit completed", "duration", time.Since(start).Round(time.Millisecond)) slog.Info("audit completed", "duration", time.Since(start).Round(time.Millisecond))
sourceType := string(runtimeMode) sourceType := "manual"
protocol := "os-direct" var targetHost *string
if hostname, err := os.Hostname(); err == nil && hostname != "" {
targetHost = &hostname
}
return schema.HardwareIngestRequest{ return schema.HardwareIngestRequest{
SourceType: &sourceType, SourceType: &sourceType,
Protocol: &protocol, TargetHost: targetHost,
CollectedAt: time.Now().UTC().Format(time.RFC3339), CollectedAt: collectedAt,
Hardware: snap, Hardware: snap,
} }
} }

View File

@@ -0,0 +1,64 @@
package collector
import "strings"
const (
statusOK = "OK"
statusWarning = "Warning"
statusCritical = "Critical"
statusUnknown = "Unknown"
statusEmpty = "Empty"
)
func mapPCIeDeviceClass(raw string) string {
normalized := strings.ToLower(strings.TrimSpace(raw))
switch {
case normalized == "":
return ""
case strings.Contains(normalized, "ethernet controller"):
return "EthernetController"
case strings.Contains(normalized, "fibre channel"):
return "FibreChannelController"
case strings.Contains(normalized, "network controller"), strings.Contains(normalized, "infiniband controller"):
return "NetworkController"
case strings.Contains(normalized, "serial attached scsi"), strings.Contains(normalized, "storage controller"):
return "StorageController"
case strings.Contains(normalized, "raid"), strings.Contains(normalized, "mass storage"):
return "MassStorageController"
case strings.Contains(normalized, "display controller"):
return "DisplayController"
case strings.Contains(normalized, "vga"), strings.Contains(normalized, "3d controller"), strings.Contains(normalized, "video controller"):
return "VideoController"
case strings.Contains(normalized, "processing accelerators"), strings.Contains(normalized, "processing accelerator"):
return "ProcessingAccelerator"
default:
return raw
}
}
func isNICClass(class string) bool {
switch strings.TrimSpace(class) {
case "EthernetController", "NetworkController":
return true
default:
return false
}
}
func isGPUClass(class string) bool {
switch strings.TrimSpace(class) {
case "VideoController", "DisplayController", "ProcessingAccelerator":
return true
default:
return false
}
}
func isRAIDClass(class string) bool {
switch strings.TrimSpace(class) {
case "MassStorageController", "StorageController":
return true
default:
return false
}
}

View File

@@ -3,42 +3,39 @@ package collector
import ( import (
"bee/audit/internal/schema" "bee/audit/internal/schema"
"bufio" "bufio"
"fmt"
"log/slog" "log/slog"
"os" "os"
"path/filepath"
"strconv" "strconv"
"strings" "strings"
) )
// collectCPUs runs dmidecode -t 4 and reads microcode version from sysfs. // collectCPUs runs dmidecode -t 4 and enriches CPUs with microcode from sysfs.
func collectCPUs(boardSerial string) ([]schema.HardwareCPU, []schema.HardwareFirmwareRecord) { func collectCPUs() []schema.HardwareCPU {
out, err := runDmidecode("4") out, err := runDmidecode("4")
if err != nil { if err != nil {
slog.Warn("cpu: dmidecode type 4 failed", "err", err) slog.Warn("cpu: dmidecode type 4 failed", "err", err)
return nil, nil return nil
} }
cpus := parseCPUs(out, boardSerial) cpus := parseCPUs(out)
var firmware []schema.HardwareFirmwareRecord
if mc := readMicrocode(); mc != "" { if mc := readMicrocode(); mc != "" {
firmware = append(firmware, schema.HardwareFirmwareRecord{ for i := range cpus {
DeviceName: "CPU Microcode", cpus[i].Firmware = &mc
Version: mc, }
})
} }
slog.Info("cpu: collected", "count", len(cpus)) slog.Info("cpu: collected", "count", len(cpus))
return cpus, firmware return cpus
} }
// parseCPUs splits dmidecode output into per-processor sections and parses each. // parseCPUs splits dmidecode output into per-processor sections and parses each.
func parseCPUs(output, boardSerial string) []schema.HardwareCPU { func parseCPUs(output string) []schema.HardwareCPU {
sections := splitDMISections(output, "Processor Information") sections := splitDMISections(output, "Processor Information")
cpus := make([]schema.HardwareCPU, 0, len(sections)) cpus := make([]schema.HardwareCPU, 0, len(sections))
for _, section := range sections { for _, section := range sections {
cpu, ok := parseCPUSection(section, boardSerial) cpu, ok := parseCPUSection(section)
if !ok { if !ok {
continue continue
} }
@@ -49,14 +46,16 @@ func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
// parseCPUSection parses one "Processor Information" block into a HardwareCPU. // parseCPUSection parses one "Processor Information" block into a HardwareCPU.
// Returns false if the socket is unpopulated. // Returns false if the socket is unpopulated.
func parseCPUSection(fields map[string]string, boardSerial string) (schema.HardwareCPU, bool) { func parseCPUSection(fields map[string]string) (schema.HardwareCPU, bool) {
status := parseCPUStatus(fields["Status"]) status := parseCPUStatus(fields["Status"])
if status == "EMPTY" { if status == statusEmpty {
return schema.HardwareCPU{}, false return schema.HardwareCPU{}, false
} }
cpu := schema.HardwareCPU{} cpu := schema.HardwareCPU{}
cpu.Status = &status cpu.Status = &status
present := true
cpu.Present = &present
if socket, ok := parseSocketIndex(fields["Socket Designation"]); ok { if socket, ok := parseSocketIndex(fields["Socket Designation"]); ok {
cpu.Socket = &socket cpu.Socket = &socket
@@ -70,11 +69,6 @@ func parseCPUSection(fields map[string]string, boardSerial string) (schema.Hardw
} }
if v := cleanDMIValue(fields["Serial Number"]); v != "" { if v := cleanDMIValue(fields["Serial Number"]); v != "" {
cpu.SerialNumber = &v cpu.SerialNumber = &v
} else if boardSerial != "" && cpu.Socket != nil {
// Intel Xeon never exposes serial via DMI — generate stable fallback
// matching core's generateCPUVendorSerial() logic
fb := fmt.Sprintf("%s-CPU-%d", boardSerial, *cpu.Socket)
cpu.SerialNumber = &fb
} }
if v := parseMHz(fields["Max Speed"]); v > 0 { if v := parseMHz(fields["Max Speed"]); v > 0 {
@@ -99,15 +93,15 @@ func parseCPUStatus(raw string) string {
upper := strings.ToUpper(raw) upper := strings.ToUpper(raw)
switch { switch {
case upper == "" || upper == "UNKNOWN": case upper == "" || upper == "UNKNOWN":
return "UNKNOWN" return statusUnknown
case strings.Contains(upper, "UNPOPULATED") || strings.Contains(upper, "NOT POPULATED"): case strings.Contains(upper, "UNPOPULATED") || strings.Contains(upper, "NOT POPULATED"):
return "EMPTY" return statusEmpty
case strings.Contains(upper, "ENABLED"): case strings.Contains(upper, "ENABLED"):
return "OK" return statusOK
case strings.Contains(upper, "DISABLED"): case strings.Contains(upper, "DISABLED"):
return "WARNING" return statusWarning
default: default:
return "UNKNOWN" return statusUnknown
} }
} }
@@ -178,7 +172,7 @@ func parseInt(v string) int {
// readMicrocode reads the CPU microcode revision from sysfs. // readMicrocode reads the CPU microcode revision from sysfs.
// Returns empty string if unavailable. // Returns empty string if unavailable.
func readMicrocode() string { func readMicrocode() string {
data, err := os.ReadFile("/sys/devices/system/cpu/cpu0/microcode/version") data, err := os.ReadFile(filepath.Join(cpuSysBaseDir, "cpu0", "microcode", "version"))
if err != nil { if err != nil {
return "" return ""
} }

View File

@@ -0,0 +1,196 @@
package collector
import (
"bee/audit/internal/schema"
"os"
"path/filepath"
"regexp"
"sort"
"strconv"
"strings"
)
var (
cpuSysBaseDir = "/sys/devices/system/cpu"
socketIndexRe = regexp.MustCompile(`(?i)(?:package id|socket|cpu)\s*([0-9]+)`)
)
func enrichCPUsWithTelemetry(cpus []schema.HardwareCPU, doc sensorsDoc) []schema.HardwareCPU {
if len(cpus) == 0 {
return cpus
}
tempBySocket := cpuTempsFromSensors(doc, len(cpus))
powerBySocket := cpuPowerFromSensors(doc, len(cpus))
throttleBySocket := cpuThrottleBySocket()
for i := range cpus {
socket := 0
if cpus[i].Socket != nil {
socket = *cpus[i].Socket
}
if value, ok := tempBySocket[socket]; ok {
cpus[i].TemperatureC = &value
}
if value, ok := powerBySocket[socket]; ok {
cpus[i].PowerW = &value
}
if value, ok := throttleBySocket[socket]; ok {
cpus[i].Throttled = &value
}
}
return cpus
}
func cpuTempsFromSensors(doc sensorsDoc, cpuCount int) map[int]float64 {
out := map[int]float64{}
if len(doc) == 0 {
return out
}
var fallback []float64
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok {
continue
}
if classifySensorFeature(feature) != "temp" {
continue
}
temp, ok := firstFeatureFloat(feature, "_input")
if !ok {
continue
}
if socket, ok := detectCPUSocket(chip, featureName); ok {
if _, exists := out[socket]; !exists {
out[socket] = temp
}
continue
}
if isLikelyCPUTemp(chip, featureName) {
fallback = append(fallback, temp)
}
}
}
if len(out) == 0 && cpuCount == 1 && len(fallback) > 0 {
out[0] = fallback[0]
}
return out
}
func cpuPowerFromSensors(doc sensorsDoc, cpuCount int) map[int]float64 {
out := map[int]float64{}
if len(doc) == 0 {
return out
}
var fallback []float64
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok {
continue
}
if classifySensorFeature(feature) != "power" {
continue
}
power, ok := firstFeatureFloatWithContains(feature, []string{"power"})
if !ok {
continue
}
if socket, ok := detectCPUSocket(chip, featureName); ok {
if _, exists := out[socket]; !exists {
out[socket] = power
}
continue
}
if isLikelyCPUPower(chip, featureName) {
fallback = append(fallback, power)
}
}
}
if len(out) == 0 && cpuCount == 1 && len(fallback) > 0 {
out[0] = fallback[0]
}
return out
}
func detectCPUSocket(parts ...string) (int, bool) {
for _, part := range parts {
matches := socketIndexRe.FindStringSubmatch(strings.ToLower(part))
if len(matches) == 2 {
value, err := strconv.Atoi(matches[1])
if err == nil {
return value, true
}
}
}
return 0, false
}
func isLikelyCPUTemp(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "coretemp") ||
strings.Contains(value, "k10temp") ||
strings.Contains(value, "package id") ||
strings.Contains(value, "tdie") ||
strings.Contains(value, "tctl") ||
strings.Contains(value, "cpu temp")
}
func isLikelyCPUPower(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "intel-rapl") ||
strings.Contains(value, "package id") ||
strings.Contains(value, "package-") ||
strings.Contains(value, "cpu power")
}
func cpuThrottleBySocket() map[int]bool {
out := map[int]bool{}
cpuDirs, err := filepath.Glob(filepath.Join(cpuSysBaseDir, "cpu[0-9]*"))
if err != nil {
return out
}
sort.Strings(cpuDirs)
for _, cpuDir := range cpuDirs {
socket, ok := readSocketIndex(cpuDir)
if !ok {
continue
}
if cpuPackageThrottled(cpuDir) {
out[socket] = true
}
}
return out
}
func readSocketIndex(cpuDir string) (int, bool) {
raw, err := os.ReadFile(filepath.Join(cpuDir, "topology", "physical_package_id"))
if err != nil {
return 0, false
}
value, err := strconv.Atoi(strings.TrimSpace(string(raw)))
if err != nil || value < 0 {
return 0, false
}
return value, true
}
func cpuPackageThrottled(cpuDir string) bool {
paths := []string{
filepath.Join(cpuDir, "thermal_throttle", "package_throttle_count"),
filepath.Join(cpuDir, "thermal_throttle", "core_throttle_count"),
}
for _, path := range paths {
raw, err := os.ReadFile(path)
if err != nil {
continue
}
value, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
if err == nil && value > 0 {
return true
}
}
return false
}

View File

@@ -0,0 +1,71 @@
package collector
import (
"os"
"path/filepath"
"testing"
"bee/audit/internal/schema"
)
func TestEnrichCPUsWithTelemetry(t *testing.T) {
tmp := t.TempDir()
oldBase := cpuSysBaseDir
cpuSysBaseDir = tmp
t.Cleanup(func() { cpuSysBaseDir = oldBase })
mustWriteFile(t, filepath.Join(tmp, "cpu0", "topology", "physical_package_id"), "0\n")
mustWriteFile(t, filepath.Join(tmp, "cpu0", "thermal_throttle", "package_throttle_count"), "3\n")
mustWriteFile(t, filepath.Join(tmp, "cpu1", "topology", "physical_package_id"), "1\n")
mustWriteFile(t, filepath.Join(tmp, "cpu1", "thermal_throttle", "package_throttle_count"), "0\n")
doc := sensorsDoc{
"coretemp-isa-0000": {
"Package id 0": map[string]any{"temp1_input": 61.5},
"Package id 1": map[string]any{"temp2_input": 58.0},
},
"intel-rapl-mmio-0": {
"Package id 0": map[string]any{"power1_average": 180.0},
"Package id 1": map[string]any{"power2_average": 175.0},
},
}
socket0 := 0
socket1 := 1
status := statusOK
cpus := []schema.HardwareCPU{
{Socket: &socket0, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Socket: &socket1, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
}
got := enrichCPUsWithTelemetry(cpus, doc)
if got[0].TemperatureC == nil || *got[0].TemperatureC != 61.5 {
t.Fatalf("cpu0 temperature mismatch: %#v", got[0].TemperatureC)
}
if got[0].PowerW == nil || *got[0].PowerW != 180.0 {
t.Fatalf("cpu0 power mismatch: %#v", got[0].PowerW)
}
if got[0].Throttled == nil || !*got[0].Throttled {
t.Fatalf("cpu0 throttled mismatch: %#v", got[0].Throttled)
}
if got[1].TemperatureC == nil || *got[1].TemperatureC != 58.0 {
t.Fatalf("cpu1 temperature mismatch: %#v", got[1].TemperatureC)
}
if got[1].PowerW == nil || *got[1].PowerW != 175.0 {
t.Fatalf("cpu1 power mismatch: %#v", got[1].PowerW)
}
if got[1].Throttled != nil && *got[1].Throttled {
t.Fatalf("cpu1 throttled mismatch: %#v", got[1].Throttled)
}
}
func mustWriteFile(t *testing.T, path, content string) {
t.Helper()
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
t.Fatalf("mkdir %s: %v", path, err)
}
if err := os.WriteFile(path, []byte(content), 0644); err != nil {
t.Fatalf("write %s: %v", path, err)
}
}

View File

@@ -1,12 +1,14 @@
package collector package collector
import ( import (
"os"
"path/filepath"
"testing" "testing"
) )
func TestParseCPUs_dual_socket(t *testing.T) { func TestParseCPUs_dual_socket(t *testing.T) {
out := mustReadFile(t, "testdata/dmidecode_type4.txt") out := mustReadFile(t, "testdata/dmidecode_type4.txt")
cpus := parseCPUs(out, "CAR315KA0803B90") cpus := parseCPUs(out)
if len(cpus) != 2 { if len(cpus) != 2 {
t.Fatalf("expected 2 CPUs, got %d", len(cpus)) t.Fatalf("expected 2 CPUs, got %d", len(cpus))
@@ -37,23 +39,22 @@ func TestParseCPUs_dual_socket(t *testing.T) {
if cpu0.Status == nil || *cpu0.Status != "OK" { if cpu0.Status == nil || *cpu0.Status != "OK" {
t.Errorf("cpu0 status: got %v, want OK", cpu0.Status) t.Errorf("cpu0 status: got %v, want OK", cpu0.Status)
} }
// Intel Xeon serial not available → fallback if cpu0.SerialNumber != nil {
if cpu0.SerialNumber == nil || *cpu0.SerialNumber != "CAR315KA0803B90-CPU-0" { t.Errorf("cpu0 serial should stay nil without source data, got %v", cpu0.SerialNumber)
t.Errorf("cpu0 serial fallback: got %v, want CAR315KA0803B90-CPU-0", cpu0.SerialNumber)
} }
cpu1 := cpus[1] cpu1 := cpus[1]
if cpu1.Socket == nil || *cpu1.Socket != 1 { if cpu1.Socket == nil || *cpu1.Socket != 1 {
t.Errorf("cpu1 socket: got %v, want 1", cpu1.Socket) t.Errorf("cpu1 socket: got %v, want 1", cpu1.Socket)
} }
if cpu1.SerialNumber == nil || *cpu1.SerialNumber != "CAR315KA0803B90-CPU-1" { if cpu1.SerialNumber != nil {
t.Errorf("cpu1 serial fallback: got %v, want CAR315KA0803B90-CPU-1", cpu1.SerialNumber) t.Errorf("cpu1 serial should stay nil without source data, got %v", cpu1.SerialNumber)
} }
} }
func TestParseCPUs_unpopulated_skipped(t *testing.T) { func TestParseCPUs_unpopulated_skipped(t *testing.T) {
out := mustReadFile(t, "testdata/dmidecode_type4_disabled.txt") out := mustReadFile(t, "testdata/dmidecode_type4_disabled.txt")
cpus := parseCPUs(out, "BOARD-001") cpus := parseCPUs(out)
if len(cpus) != 1 { if len(cpus) != 1 {
t.Fatalf("expected 1 CPU (unpopulated skipped), got %d", len(cpus)) t.Fatalf("expected 1 CPU (unpopulated skipped), got %d", len(cpus))
@@ -63,18 +64,51 @@ func TestParseCPUs_unpopulated_skipped(t *testing.T) {
} }
} }
func TestCollectCPUsSetsFirmwareFromMicrocode(t *testing.T) {
tmp := t.TempDir()
origBase := cpuSysBaseDir
cpuSysBaseDir = tmp
t.Cleanup(func() { cpuSysBaseDir = origBase })
if err := os.MkdirAll(filepath.Join(tmp, "cpu0", "microcode"), 0755); err != nil {
t.Fatalf("mkdir microcode dir: %v", err)
}
if err := os.WriteFile(filepath.Join(tmp, "cpu0", "microcode", "version"), []byte("0x2b000643\n"), 0644); err != nil {
t.Fatalf("write microcode version: %v", err)
}
origRun := execDmidecode
execDmidecode = func(typeNum string) (string, error) {
if typeNum != "4" {
t.Fatalf("unexpected dmidecode type: %s", typeNum)
}
return mustReadFile(t, "testdata/dmidecode_type4.txt"), nil
}
t.Cleanup(func() { execDmidecode = origRun })
cpus := collectCPUs()
if len(cpus) != 2 {
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
}
for i, cpu := range cpus {
if cpu.Firmware == nil || *cpu.Firmware != "0x2b000643" {
t.Fatalf("cpu[%d] firmware=%v want microcode", i, cpu.Firmware)
}
}
}
func TestParseCPUStatus(t *testing.T) { func TestParseCPUStatus(t *testing.T) {
tests := []struct { tests := []struct {
input string input string
want string want string
}{ }{
{"Populated, Enabled", "OK"}, {"Populated, Enabled", "OK"},
{"Populated, Disabled By User", "WARNING"}, {"Populated, Disabled By User", statusWarning},
{"Populated, Disabled By BIOS", "WARNING"}, {"Populated, Disabled By BIOS", statusWarning},
{"Unpopulated", "EMPTY"}, {"Unpopulated", statusEmpty},
{"Not Populated", "EMPTY"}, {"Not Populated", statusEmpty},
{"Unknown", "UNKNOWN"}, {"Unknown", statusUnknown},
{"", "UNKNOWN"}, {"", statusUnknown},
} }
for _, tt := range tests { for _, tt := range tests {
got := parseCPUStatus(tt.input) got := parseCPUStatus(tt.input)

View File

@@ -0,0 +1,77 @@
package collector
import "bee/audit/internal/schema"
func finalizeSnapshot(snap *schema.HardwareSnapshot, collectedAt string) {
snap.Memory = filterMemory(snap.Memory)
snap.Storage = filterStorage(snap.Storage)
snap.PowerSupplies = filterPSUs(snap.PowerSupplies)
setComponentStatusMetadata(snap, collectedAt)
}
func filterMemory(dimms []schema.HardwareMemory) []schema.HardwareMemory {
out := make([]schema.HardwareMemory, 0, len(dimms))
for _, dimm := range dimms {
if dimm.Present != nil && !*dimm.Present {
continue
}
if dimm.Status != nil && *dimm.Status == statusEmpty {
continue
}
if dimm.SerialNumber == nil || *dimm.SerialNumber == "" {
continue
}
out = append(out, dimm)
}
return out
}
func filterStorage(disks []schema.HardwareStorage) []schema.HardwareStorage {
out := make([]schema.HardwareStorage, 0, len(disks))
for _, disk := range disks {
if disk.SerialNumber == nil || *disk.SerialNumber == "" {
continue
}
out = append(out, disk)
}
return out
}
func filterPSUs(psus []schema.HardwarePowerSupply) []schema.HardwarePowerSupply {
out := make([]schema.HardwarePowerSupply, 0, len(psus))
for _, psu := range psus {
if psu.SerialNumber == nil || *psu.SerialNumber == "" {
continue
}
out = append(out, psu)
}
return out
}
func setComponentStatusMetadata(snap *schema.HardwareSnapshot, collectedAt string) {
for i := range snap.CPUs {
setStatusCheckedAt(&snap.CPUs[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.Memory {
setStatusCheckedAt(&snap.Memory[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.Storage {
setStatusCheckedAt(&snap.Storage[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.PCIeDevices {
setStatusCheckedAt(&snap.PCIeDevices[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.PowerSupplies {
setStatusCheckedAt(&snap.PowerSupplies[i].HardwareComponentStatus, collectedAt)
}
}
func setStatusCheckedAt(status *schema.HardwareComponentStatus, collectedAt string) {
if status == nil || status.Status == nil || *status.Status == "" {
return
}
if status.StatusCheckedAt == nil {
status.StatusCheckedAt = &collectedAt
}
}

View File

@@ -0,0 +1,63 @@
package collector
import (
"bee/audit/internal/schema"
"testing"
)
func TestFinalizeSnapshotFiltersComponentsWithoutRequiredSerials(t *testing.T) {
collectedAt := "2026-03-15T12:00:00Z"
present := true
status := statusOK
serial := "SN-1"
snap := schema.HardwareSnapshot{
Memory: []schema.HardwareMemory{
{Present: &present, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Present: &present, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
Storage: []schema.HardwareStorage{
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
PowerSupplies: []schema.HardwarePowerSupply{
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
}
finalizeSnapshot(&snap, collectedAt)
if len(snap.Memory) != 1 || snap.Memory[0].StatusCheckedAt == nil || *snap.Memory[0].StatusCheckedAt != collectedAt {
t.Fatalf("memory finalize mismatch: %+v", snap.Memory)
}
if len(snap.Storage) != 1 || snap.Storage[0].StatusCheckedAt == nil || *snap.Storage[0].StatusCheckedAt != collectedAt {
t.Fatalf("storage finalize mismatch: %+v", snap.Storage)
}
if len(snap.PowerSupplies) != 1 || snap.PowerSupplies[0].StatusCheckedAt == nil || *snap.PowerSupplies[0].StatusCheckedAt != collectedAt {
t.Fatalf("psu finalize mismatch: %+v", snap.PowerSupplies)
}
}
func TestFinalizeSnapshotPreservesDuplicateSerials(t *testing.T) {
collectedAt := "2026-03-15T12:00:00Z"
status := statusOK
model := "Device"
serial := "DUPLICATE"
snap := schema.HardwareSnapshot{
Storage: []schema.HardwareStorage{
{Model: &model, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Model: &model, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
}
finalizeSnapshot(&snap, collectedAt)
if got := *snap.Storage[0].SerialNumber; got != serial {
t.Fatalf("first serial changed: %q", got)
}
if got := *snap.Storage[1].SerialNumber; got != serial {
t.Fatalf("duplicate serial should stay unchanged: %q", got)
}
}

View File

@@ -47,12 +47,12 @@ func parseMemorySection(fields map[string]string) schema.HardwareMemory {
dimm.Present = &present dimm.Present = &present
if !present { if !present {
status := "EMPTY" status := statusEmpty
dimm.Status = &status dimm.Status = &status
return dimm return dimm
} }
status := "OK" status := statusOK
dimm.Status = &status dimm.Status = &status
if mb := parseMemorySizeMB(rawSize); mb > 0 { if mb := parseMemorySizeMB(rawSize); mb > 0 {

View File

@@ -0,0 +1,203 @@
package collector
import (
"bee/audit/internal/schema"
"os"
"path/filepath"
"sort"
"strconv"
"strings"
)
var edacBaseDir = "/sys/devices/system/edac/mc"
type edacDIMMStats struct {
Label string
CECount *int64
UECount *int64
}
func enrichMemoryWithTelemetry(dimms []schema.HardwareMemory, doc sensorsDoc) []schema.HardwareMemory {
if len(dimms) == 0 {
return dimms
}
tempByLabel := memoryTempsFromSensors(doc)
stats := readEDACStats()
for i := range dimms {
labelKeys := dimmMatchKeys(dimms[i].Slot, dimms[i].Location)
for _, key := range labelKeys {
if temp, ok := tempByLabel[key]; ok {
dimms[i].TemperatureC = &temp
break
}
}
for _, key := range labelKeys {
if stat, ok := stats[key]; ok {
if stat.CECount != nil {
dimms[i].CorrectableECCErrorCount = stat.CECount
}
if stat.UECount != nil {
dimms[i].UncorrectableECCErrorCount = stat.UECount
}
if stat.UECount != nil && *stat.UECount > 0 {
dimms[i].DataLossDetected = boolPtr(true)
status := statusCritical
dimms[i].Status = &status
if dimms[i].ErrorDescription == nil {
dimms[i].ErrorDescription = stringPtr("EDAC reports uncorrectable ECC errors")
}
} else if stat.CECount != nil && *stat.CECount > 0 && (dimms[i].Status == nil || *dimms[i].Status == statusOK) {
status := statusWarning
dimms[i].Status = &status
if dimms[i].ErrorDescription == nil {
dimms[i].ErrorDescription = stringPtr("EDAC reports correctable ECC errors")
}
}
break
}
}
}
return dimms
}
func memoryTempsFromSensors(doc sensorsDoc) map[string]float64 {
out := map[string]float64{}
if len(doc) == 0 {
return out
}
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok || classifySensorFeature(feature) != "temp" {
continue
}
if !isLikelyMemoryTemp(chip, featureName) {
continue
}
temp, ok := firstFeatureFloat(feature, "_input")
if !ok {
continue
}
key := canonicalLabel(featureName)
if key == "" {
continue
}
if _, exists := out[key]; !exists {
out[key] = temp
}
}
}
return out
}
func readEDACStats() map[string]edacDIMMStats {
out := map[string]edacDIMMStats{}
mcDirs, err := filepath.Glob(filepath.Join(edacBaseDir, "mc*"))
if err != nil {
return out
}
sort.Strings(mcDirs)
for _, mcDir := range mcDirs {
dimmDirs, err := filepath.Glob(filepath.Join(mcDir, "dimm*"))
if err != nil {
continue
}
sort.Strings(dimmDirs)
for _, dimmDir := range dimmDirs {
stat, ok := readEDACDIMMStats(dimmDir)
if !ok {
continue
}
key := canonicalLabel(stat.Label)
if key == "" {
continue
}
out[key] = stat
}
}
return out
}
func readEDACDIMMStats(dimmDir string) (edacDIMMStats, bool) {
labelBytes, err := os.ReadFile(filepath.Join(dimmDir, "dimm_label"))
if err != nil {
labelBytes, err = os.ReadFile(filepath.Join(dimmDir, "label"))
if err != nil {
return edacDIMMStats{}, false
}
}
label := strings.TrimSpace(string(labelBytes))
if label == "" {
return edacDIMMStats{}, false
}
stat := edacDIMMStats{Label: label}
if value, ok := readEDACCount(dimmDir, []string{"dimm_ce_count", "ce_count"}); ok {
stat.CECount = &value
}
if value, ok := readEDACCount(dimmDir, []string{"dimm_ue_count", "ue_count"}); ok {
stat.UECount = &value
}
return stat, true
}
func readEDACCount(dir string, names []string) (int64, bool) {
for _, name := range names {
raw, err := os.ReadFile(filepath.Join(dir, name))
if err != nil {
continue
}
value, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
if err == nil && value >= 0 {
return value, true
}
}
return 0, false
}
func dimmMatchKeys(slot, location *string) []string {
var out []string
add := func(value *string) {
key := canonicalLabel(derefString(value))
if key == "" {
return
}
for _, existing := range out {
if existing == key {
return
}
}
out = append(out, key)
}
add(slot)
add(location)
return out
}
func canonicalLabel(value string) string {
value = strings.ToUpper(strings.TrimSpace(value))
if value == "" {
return ""
}
var b strings.Builder
for _, r := range value {
if (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') {
b.WriteRune(r)
}
}
return b.String()
}
func isLikelyMemoryTemp(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "dimm") || strings.Contains(value, "sodimm")
}
func boolPtr(value bool) *bool {
return &value
}

View File

@@ -0,0 +1,61 @@
package collector
import (
"path/filepath"
"testing"
"bee/audit/internal/schema"
)
func TestEnrichMemoryWithTelemetry(t *testing.T) {
tmp := t.TempDir()
oldBase := edacBaseDir
edacBaseDir = tmp
t.Cleanup(func() { edacBaseDir = oldBase })
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_label"), "CPU0_DIMM_A1\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_ce_count"), "7\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_ue_count"), "0\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_label"), "CPU1_DIMM_B2\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_ce_count"), "0\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_ue_count"), "2\n")
doc := sensorsDoc{
"jc42-i2c-0-18": {
"CPU0 DIMM A1": map[string]any{"temp1_input": 43.0},
"CPU1 DIMM B2": map[string]any{"temp2_input": 46.0},
},
}
status := statusOK
slotA := "CPU0_DIMM_A1"
slotB := "CPU1_DIMM_B2"
dimms := []schema.HardwareMemory{
{Slot: &slotA, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Slot: &slotB, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
}
got := enrichMemoryWithTelemetry(dimms, doc)
if got[0].TemperatureC == nil || *got[0].TemperatureC != 43.0 {
t.Fatalf("dimm0 temperature mismatch: %#v", got[0].TemperatureC)
}
if got[0].CorrectableECCErrorCount == nil || *got[0].CorrectableECCErrorCount != 7 {
t.Fatalf("dimm0 ce mismatch: %#v", got[0].CorrectableECCErrorCount)
}
if got[0].Status == nil || *got[0].Status != statusWarning {
t.Fatalf("dimm0 status mismatch: %#v", got[0].Status)
}
if got[1].TemperatureC == nil || *got[1].TemperatureC != 46.0 {
t.Fatalf("dimm1 temperature mismatch: %#v", got[1].TemperatureC)
}
if got[1].UncorrectableECCErrorCount == nil || *got[1].UncorrectableECCErrorCount != 2 {
t.Fatalf("dimm1 ue mismatch: %#v", got[1].UncorrectableECCErrorCount)
}
if got[1].Status == nil || *got[1].Status != statusCritical {
t.Fatalf("dimm1 status mismatch: %#v", got[1].Status)
}
if got[1].DataLossDetected == nil || !*got[1].DataLossDetected {
t.Fatalf("dimm1 data_loss_detected mismatch: %#v", got[1].DataLossDetected)
}
}

View File

@@ -18,17 +18,13 @@ var (
} }
return string(out), nil return string(out), nil
} }
readNetStatFile = func(iface, key string) (int64, error) { readNetAddressFile = func(iface string) (string, error) {
path := filepath.Join("/sys/class/net", iface, "statistics", key) path := filepath.Join("/sys/class/net", iface, "address")
raw, err := os.ReadFile(path) raw, err := os.ReadFile(path)
if err != nil { if err != nil {
return 0, err return "", err
} }
v, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64) return strings.TrimSpace(string(raw)), nil
if err != nil {
return 0, err
}
return v, nil
} }
) )
@@ -47,6 +43,7 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
continue continue
} }
iface := ifaces[0] iface := ifaces[0]
devs[i].MacAddresses = collectInterfaceMACs(ifaces)
if devs[i].Firmware == nil { if devs[i].Firmware == nil {
if out, err := ethtoolInfoQuery(iface); err == nil { if out, err := ethtoolInfoQuery(iface); err == nil {
@@ -56,16 +53,13 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
} }
} }
if devs[i].Telemetry == nil {
devs[i].Telemetry = map[string]any{}
}
injectNICPacketStats(devs[i].Telemetry, iface)
if out, err := ethtoolModuleQuery(iface); err == nil { if out, err := ethtoolModuleQuery(iface); err == nil {
injectSFPDOMTelemetry(devs[i].Telemetry, out) if injectSFPDOMTelemetry(&devs[i], out) {
enriched++
continue
}
} }
if len(devs[i].Telemetry) == 0 { if len(devs[i].MacAddresses) > 0 || devs[i].Firmware != nil {
devs[i].Telemetry = nil
} else {
enriched++ enriched++
} }
} }
@@ -77,31 +71,32 @@ func isNICDevice(dev schema.HardwarePCIeDevice) bool {
if dev.DeviceClass == nil { if dev.DeviceClass == nil {
return false return false
} }
c := strings.ToLower(strings.TrimSpace(*dev.DeviceClass)) c := strings.TrimSpace(*dev.DeviceClass)
return strings.Contains(c, "ethernet controller") || return isNICClass(c) || strings.EqualFold(c, "FibreChannelController")
strings.Contains(c, "network controller") ||
strings.Contains(c, "infiniband controller")
} }
func injectNICPacketStats(dst map[string]any, iface string) { func collectInterfaceMACs(ifaces []string) []string {
for _, key := range []string{"rx_packets", "tx_packets", "rx_errors", "tx_errors"} { seen := map[string]struct{}{}
if v, err := readNetStatFile(iface, key); err == nil { var out []string
dst[key] = v for _, iface := range ifaces {
mac, err := readNetAddressFile(iface)
if err != nil || mac == "" {
continue
} }
mac = strings.ToLower(strings.TrimSpace(mac))
if _, ok := seen[mac]; ok {
continue
}
seen[mac] = struct{}{}
out = append(out, mac)
} }
} return out
func injectSFPDOMTelemetry(dst map[string]any, raw string) {
parsed := parseSFPDOM(raw)
for k, v := range parsed {
dst[k] = v
}
} }
var floatRe = regexp.MustCompile(`[-+]?[0-9]*\.?[0-9]+`) var floatRe = regexp.MustCompile(`[-+]?[0-9]*\.?[0-9]+`)
func parseSFPDOM(raw string) map[string]any { func injectSFPDOMTelemetry(dev *schema.HardwarePCIeDevice, raw string) bool {
out := map[string]any{} var changed bool
for _, line := range strings.Split(raw, "\n") { for _, line := range strings.Split(raw, "\n") {
trimmed := strings.TrimSpace(line) trimmed := strings.TrimSpace(line)
if trimmed == "" { if trimmed == "" {
@@ -117,26 +112,55 @@ func parseSFPDOM(raw string) map[string]any {
switch { switch {
case strings.Contains(key, "module temperature"): case strings.Contains(key, "module temperature"):
if f, ok := firstFloat(val); ok { if f, ok := firstFloat(val); ok {
out["sfp_temperature_c"] = f dev.SFPTemperatureC = &f
changed = true
} }
case strings.Contains(key, "laser output power"): case strings.Contains(key, "laser output power"):
if f, ok := dbmValue(val); ok { if f, ok := dbmValue(val); ok {
out["sfp_tx_power_dbm"] = f dev.SFPTXPowerDBM = &f
changed = true
} }
case strings.Contains(key, "receiver signal"): case strings.Contains(key, "receiver signal"):
if f, ok := dbmValue(val); ok { if f, ok := dbmValue(val); ok {
out["sfp_rx_power_dbm"] = f dev.SFPRXPowerDBM = &f
changed = true
} }
case strings.Contains(key, "module voltage"): case strings.Contains(key, "module voltage"):
if f, ok := firstFloat(val); ok { if f, ok := firstFloat(val); ok {
out["sfp_voltage_v"] = f dev.SFPVoltageV = &f
changed = true
} }
case strings.Contains(key, "laser bias current"): case strings.Contains(key, "laser bias current"):
if f, ok := firstFloat(val); ok { if f, ok := firstFloat(val); ok {
out["sfp_bias_ma"] = f dev.SFPBiasMA = &f
changed = true
} }
} }
} }
return changed
}
func parseSFPDOM(raw string) map[string]any {
dev := schema.HardwarePCIeDevice{}
if !injectSFPDOMTelemetry(&dev, raw) {
return map[string]any{}
}
out := map[string]any{}
if dev.SFPTemperatureC != nil {
out["sfp_temperature_c"] = *dev.SFPTemperatureC
}
if dev.SFPTXPowerDBM != nil {
out["sfp_tx_power_dbm"] = *dev.SFPTXPowerDBM
}
if dev.SFPRXPowerDBM != nil {
out["sfp_rx_power_dbm"] = *dev.SFPRXPowerDBM
}
if dev.SFPVoltageV != nil {
out["sfp_voltage_v"] = *dev.SFPVoltageV
}
if dev.SFPBiasMA != nil {
out["sfp_bias_ma"] = *dev.SFPBiasMA
}
return out return out
} }

View File

@@ -24,18 +24,17 @@ type nvidiaGPUInfo struct {
} }
// enrichPCIeWithNVIDIA enriches NVIDIA PCIe devices with data from nvidia-smi. // enrichPCIeWithNVIDIA enriches NVIDIA PCIe devices with data from nvidia-smi.
// If the driver/tool is unavailable, NVIDIA devices get UNKNOWN status and // If the driver/tool is unavailable, NVIDIA devices get Unknown status.
// a stable serial fallback based on board serial + slot. func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice, boardSerial string) []schema.HardwarePCIeDevice {
if !hasNVIDIADevices(devs) { if !hasNVIDIADevices(devs) {
return devs return devs
} }
gpuByBDF, err := queryNVIDIAGPUs() gpuByBDF, err := queryNVIDIAGPUs()
if err != nil { if err != nil {
slog.Info("nvidia: enrichment skipped", "err", err) slog.Info("nvidia: enrichment skipped", "err", err)
return enrichPCIeWithNVIDIAData(devs, nil, boardSerial, false) return enrichPCIeWithNVIDIAData(devs, nil, false)
} }
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, boardSerial, true) return enrichPCIeWithNVIDIAData(devs, gpuByBDF, true)
} }
func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool { func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool {
@@ -47,7 +46,7 @@ func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool {
return false return false
} }
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, boardSerial string, driverLoaded bool) []schema.HardwarePCIeDevice { func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, driverLoaded bool) []schema.HardwarePCIeDevice {
enriched := 0 enriched := 0
for i := range devs { for i := range devs {
if !isNVIDIADevice(devs[i]) { if !isNVIDIADevice(devs[i]) {
@@ -55,7 +54,7 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
} }
if !driverLoaded { if !driverLoaded {
setPCIeFallback(&devs[i], boardSerial) setPCIeFallback(&devs[i])
continue continue
} }
@@ -65,22 +64,21 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
} }
info, ok := gpuByBDF[bdf] info, ok := gpuByBDF[bdf]
if !ok { if !ok {
setPCIeFallback(&devs[i], boardSerial) setPCIeFallback(&devs[i])
continue continue
} }
if v := strings.TrimSpace(info.Serial); v != "" { if v := strings.TrimSpace(info.Serial); v != "" {
devs[i].SerialNumber = &v devs[i].SerialNumber = &v
} else {
setPCIeFallbackSerial(&devs[i], boardSerial)
} }
if v := strings.TrimSpace(info.VBIOS); v != "" { if v := strings.TrimSpace(info.VBIOS); v != "" {
devs[i].Firmware = &v devs[i].Firmware = &v
} }
status := "OK" status := statusOK
if info.ECCUncorrected != nil && *info.ECCUncorrected > 0 { if info.ECCUncorrected != nil && *info.ECCUncorrected > 0 {
status = "WARNING" status = statusWarning
devs[i].ErrorDescription = stringPtr("GPU reports uncorrected ECC errors")
} }
devs[i].Status = &status devs[i].Status = &status
injectNVIDIATelemetry(&devs[i], info) injectNVIDIATelemetry(&devs[i], info)
@@ -212,46 +210,25 @@ func isNVIDIADevice(dev schema.HardwarePCIeDevice) bool {
return false return false
} }
func setPCIeFallback(dev *schema.HardwarePCIeDevice, boardSerial string) { func setPCIeFallback(dev *schema.HardwarePCIeDevice) {
setPCIeFallbackSerial(dev, boardSerial) status := statusUnknown
status := "UNKNOWN"
dev.Status = &status dev.Status = &status
} }
func setPCIeFallbackSerial(dev *schema.HardwarePCIeDevice, boardSerial string) {
if strings.TrimSpace(boardSerial) == "" || dev.SerialNumber != nil {
return
}
slot := "unknown"
if dev.BDF != nil && strings.TrimSpace(*dev.BDF) != "" {
slot = strings.TrimSpace(*dev.BDF)
} else if dev.Slot != nil && strings.TrimSpace(*dev.Slot) != "" {
slot = strings.TrimSpace(*dev.Slot)
}
fb := fmt.Sprintf("%s-PCIE-%s", boardSerial, slot)
dev.SerialNumber = &fb
}
func injectNVIDIATelemetry(dev *schema.HardwarePCIeDevice, info nvidiaGPUInfo) { func injectNVIDIATelemetry(dev *schema.HardwarePCIeDevice, info nvidiaGPUInfo) {
if dev.Telemetry == nil {
dev.Telemetry = map[string]any{}
}
if info.TemperatureC != nil { if info.TemperatureC != nil {
dev.Telemetry["temperature_c"] = *info.TemperatureC dev.TemperatureC = info.TemperatureC
} }
if info.PowerW != nil { if info.PowerW != nil {
dev.Telemetry["power_w"] = *info.PowerW dev.PowerW = info.PowerW
} }
if info.ECCUncorrected != nil { if info.ECCUncorrected != nil {
dev.Telemetry["ecc_uncorrected_total"] = *info.ECCUncorrected dev.ECCUncorrectedTotal = info.ECCUncorrected
} }
if info.ECCCorrected != nil { if info.ECCCorrected != nil {
dev.Telemetry["ecc_corrected_total"] = *info.ECCCorrected dev.ECCCorrectedTotal = info.ECCCorrected
} }
if info.HWSlowdown != nil { if info.HWSlowdown != nil {
dev.Telemetry["hw_slowdown_active"] = *info.HWSlowdown dev.HWSlowdown = info.HWSlowdown
}
if len(dev.Telemetry) == 0 {
dev.Telemetry = nil
} }
} }

View File

@@ -54,10 +54,10 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
status := "OK" status := "OK"
devices := []schema.HardwarePCIeDevice{ devices := []schema.HardwarePCIeDevice{
{ {
VendorID: &vendorID, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
BDF: &bdf, VendorID: &vendorID,
Manufacturer: &manufacturer, BDF: &bdf,
Status: &status, Manufacturer: &manufacturer,
}, },
} }
@@ -73,21 +73,21 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
}, },
} }
out := enrichPCIeWithNVIDIAData(devices, byBDF, "BOARD-001", true) out := enrichPCIeWithNVIDIAData(devices, byBDF, true)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-ABC" { if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-ABC" {
t.Fatalf("serial: got %v", out[0].SerialNumber) t.Fatalf("serial: got %v", out[0].SerialNumber)
} }
if out[0].Firmware == nil || *out[0].Firmware != "96.00.1F.00.02" { if out[0].Firmware == nil || *out[0].Firmware != "96.00.1F.00.02" {
t.Fatalf("firmware: got %v", out[0].Firmware) t.Fatalf("firmware: got %v", out[0].Firmware)
} }
if out[0].Status == nil || *out[0].Status != "WARNING" { if out[0].Status == nil || *out[0].Status != statusWarning {
t.Fatalf("status: got %v", out[0].Status) t.Fatalf("status: got %v", out[0].Status)
} }
if out[0].Telemetry == nil { if out[0].ECCUncorrectedTotal == nil || *out[0].ECCUncorrectedTotal != 2 {
t.Fatal("expected telemetry") t.Fatalf("ecc_uncorrected_total: got %#v", out[0].ECCUncorrectedTotal)
} }
if got, ok := out[0].Telemetry["ecc_uncorrected_total"].(int64); !ok || got != 2 { if out[0].TemperatureC == nil || *out[0].TemperatureC != 55.5 {
t.Fatalf("ecc_uncorrected_total: got %#v", out[0].Telemetry["ecc_uncorrected_total"]) t.Fatalf("temperature_c: got %#v", out[0].TemperatureC)
} }
} }
@@ -103,11 +103,11 @@ func TestEnrichPCIeWithNVIDIAData_driverMissingFallback(t *testing.T) {
}, },
} }
out := enrichPCIeWithNVIDIAData(devices, nil, "BOARD-123", false) out := enrichPCIeWithNVIDIAData(devices, nil, false)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "BOARD-123-PCIE-0000:17:00.0" { if out[0].SerialNumber != nil {
t.Fatalf("fallback serial: got %v", out[0].SerialNumber) t.Fatalf("serial should stay nil without source data, got %v", out[0].SerialNumber)
} }
if out[0].Status == nil || *out[0].Status != "UNKNOWN" { if out[0].Status == nil || *out[0].Status != statusUnknown {
t.Fatalf("fallback status: got %v", out[0].Status) t.Fatalf("fallback status: got %v", out[0].Status)
} }
} }

View File

@@ -57,6 +57,8 @@ func shouldIncludePCIeDevice(class string) bool {
"host bridge", "host bridge",
"isa bridge", "isa bridge",
"pci bridge", "pci bridge",
"performance counter",
"performance counters",
"ram memory", "ram memory",
"system peripheral", "system peripheral",
"communication controller", "communication controller",
@@ -79,11 +81,12 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
dev := schema.HardwarePCIeDevice{} dev := schema.HardwarePCIeDevice{}
present := true present := true
dev.Present = &present dev.Present = &present
status := "OK" status := statusOK
dev.Status = &status dev.Status = &status
// Slot is the BDF: "0000:00:02.0" // Slot is the BDF: "0000:00:02.0"
if bdf := fields["Slot"]; bdf != "" { if bdf := fields["Slot"]; bdf != "" {
dev.Slot = &bdf
dev.BDF = &bdf dev.BDF = &bdf
// parse vendor_id and device_id from sysfs // parse vendor_id and device_id from sysfs
vendorID, deviceID := readPCIIDs(bdf) vendorID, deviceID := readPCIIDs(bdf)
@@ -93,10 +96,32 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
if deviceID != 0 { if deviceID != 0 {
dev.DeviceID = &deviceID dev.DeviceID = &deviceID
} }
if numaNode, ok := readPCINumaNode(bdf); ok {
dev.NUMANode = &numaNode
}
if width, ok := readPCIIntAttribute(bdf, "current_link_width"); ok {
dev.LinkWidth = &width
}
if width, ok := readPCIIntAttribute(bdf, "max_link_width"); ok {
dev.MaxLinkWidth = &width
}
if speed, ok := readPCIStringAttribute(bdf, "current_link_speed"); ok {
linkSpeed := normalizePCILinkSpeed(speed)
if linkSpeed != "" {
dev.LinkSpeed = &linkSpeed
}
}
if speed, ok := readPCIStringAttribute(bdf, "max_link_speed"); ok {
linkSpeed := normalizePCILinkSpeed(speed)
if linkSpeed != "" {
dev.MaxLinkSpeed = &linkSpeed
}
}
} }
if v := fields["Class"]; v != "" { if v := fields["Class"]; v != "" {
dev.DeviceClass = &v class := mapPCIeDeviceClass(v)
dev.DeviceClass = &class
} }
if v := fields["Vendor"]; v != "" { if v := fields["Vendor"]; v != "" {
dev.Manufacturer = &v dev.Manufacturer = &v
@@ -131,3 +156,55 @@ func readHexFile(path string) (int, error) {
n, err := strconv.ParseInt(s, 16, 64) n, err := strconv.ParseInt(s, 16, 64)
return int(n), err return int(n), err
} }
func readPCINumaNode(bdf string) (int, bool) {
value, ok := readPCIIntAttribute(bdf, "numa_node")
if !ok || value < 0 {
return 0, false
}
return value, true
}
func readPCIIntAttribute(bdf, attribute string) (int, bool) {
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
if err != nil {
return 0, false
}
value, err := strconv.Atoi(strings.TrimSpace(string(out)))
if err != nil || value < 0 {
return 0, false
}
return value, true
}
func readPCIStringAttribute(bdf, attribute string) (string, bool) {
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
if err != nil {
return "", false
}
value := strings.TrimSpace(string(out))
if value == "" {
return "", false
}
return value, true
}
func normalizePCILinkSpeed(raw string) string {
raw = strings.TrimSpace(strings.ToLower(raw))
switch {
case strings.Contains(raw, "2.5"):
return "Gen1"
case strings.Contains(raw, "5.0"):
return "Gen2"
case strings.Contains(raw, "8.0"):
return "Gen3"
case strings.Contains(raw, "16.0"):
return "Gen4"
case strings.Contains(raw, "32.0"):
return "Gen5"
case strings.Contains(raw, "64.0"):
return "Gen6"
default:
return ""
}
}

View File

@@ -1,6 +1,10 @@
package collector package collector
import "testing" import (
"encoding/json"
"strings"
"testing"
)
func TestShouldIncludePCIeDevice(t *testing.T) { func TestShouldIncludePCIeDevice(t *testing.T) {
tests := []struct { tests := []struct {
@@ -13,6 +17,7 @@ func TestShouldIncludePCIeDevice(t *testing.T) {
{"Host bridge", false}, {"Host bridge", false},
{"PCI bridge", false}, {"PCI bridge", false},
{"SMBus", false}, {"SMBus", false},
{"Performance counters", false},
{"Ethernet controller", true}, {"Ethernet controller", true},
{"RAID bus controller", true}, {"RAID bus controller", true},
{"Non-Volatile memory controller", true}, {"Non-Volatile memory controller", true},
@@ -35,7 +40,50 @@ func TestParseLspci_filtersExcludedClasses(t *testing.T) {
if len(devs) != 1 { if len(devs) != 1 {
t.Fatalf("expected 1 filtered device, got %d", len(devs)) t.Fatalf("expected 1 filtered device, got %d", len(devs))
} }
if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VGA compatible controller" { if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VideoController" {
t.Fatalf("unexpected remaining class: %v", devs[0].DeviceClass) t.Fatalf("unexpected remaining class: %v", devs[0].DeviceClass)
} }
if devs[0].Slot == nil || *devs[0].Slot != "0000:65:00.0" {
t.Fatalf("slot: got %v", devs[0].Slot)
}
if devs[0].BDF == nil || *devs[0].BDF != "0000:65:00.0" {
t.Fatalf("bdf: got %v", devs[0].BDF)
}
}
func TestPCIeJSONUsesSlotNotBDF(t *testing.T) {
input := "Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
devs := parseLspci(input)
data, err := json.Marshal(devs[0])
if err != nil {
t.Fatalf("marshal: %v", err)
}
text := string(data)
if !strings.Contains(text, `"slot":"0000:65:00.0"`) {
t.Fatalf("json missing slot: %s", text)
}
if strings.Contains(text, `"bdf"`) {
t.Fatalf("json should not emit bdf: %s", text)
}
}
func TestNormalizePCILinkSpeed(t *testing.T) {
tests := []struct {
raw string
want string
}{
{"2.5 GT/s PCIe", "Gen1"},
{"5.0 GT/s PCIe", "Gen2"},
{"8.0 GT/s PCIe", "Gen3"},
{"16.0 GT/s PCIe", "Gen4"},
{"32.0 GT/s PCIe", "Gen5"},
{"64.0 GT/s PCIe", "Gen6"},
{"unknown", ""},
}
for _, tt := range tests {
if got := normalizePCILinkSpeed(tt.raw); got != tt.want {
t.Fatalf("normalizePCILinkSpeed(%q)=%q want %q", tt.raw, got, tt.want)
}
}
} }

View File

@@ -4,18 +4,32 @@ import (
"bee/audit/internal/schema" "bee/audit/internal/schema"
"log/slog" "log/slog"
"os/exec" "os/exec"
"regexp"
"sort"
"strconv" "strconv"
"strings" "strings"
) )
func collectPSUs() []schema.HardwarePowerSupply { func collectPSUs() []schema.HardwarePowerSupply {
// ipmitool requires /dev/ipmi0 — not available on non-server hardware var psus []schema.HardwarePowerSupply
out, err := exec.Command("ipmitool", "fru", "print").Output() if out, err := exec.Command("ipmitool", "fru", "print").Output(); err == nil {
if err != nil { psus = parseFRU(string(out))
} else {
slog.Info("psu: fru unavailable", "err", err)
}
sdrData := map[int]psuSDR{}
if sdrOut, err := exec.Command("ipmitool", "sdr").Output(); err == nil {
sdrData = parsePSUSDR(string(sdrOut))
if len(psus) == 0 {
psus = synthesizePSUsFromSDR(sdrData)
} else {
mergePSUSDR(psus, sdrData)
}
} else if len(psus) == 0 {
slog.Info("psu: ipmitool unavailable, skipping", "err", err) slog.Info("psu: ipmitool unavailable, skipping", "err", err)
return nil return nil
} }
psus := parseFRU(string(out))
slog.Info("psu: collected", "count", len(psus)) slog.Info("psu: collected", "count", len(psus))
return psus return psus
} }
@@ -75,9 +89,7 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
// Only process PSU FRU records // Only process PSU FRU records
headerLower := strings.ToLower(header) headerLower := strings.ToLower(header)
if !strings.Contains(headerLower, "psu") && if !isPSUHeader(headerLower) {
!strings.Contains(headerLower, "power supply") &&
!strings.Contains(headerLower, "power_supply") {
return schema.HardwarePowerSupply{}, false return schema.HardwarePowerSupply{}, false
} }
@@ -85,21 +97,24 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
psu := schema.HardwarePowerSupply{Present: &present} psu := schema.HardwarePowerSupply{Present: &present}
slotStr := strconv.Itoa(slotIdx) slotStr := strconv.Itoa(slotIdx)
if slot, ok := parsePSUSlot(header); ok && slot > 0 {
slotStr = strconv.Itoa(slot - 1)
}
psu.Slot = &slotStr psu.Slot = &slotStr
if v := cleanDMIValue(fields["Board Product"]); v != "" { if v := firstNonEmptyField(fields, "Board Product", "Product Name", "Product Part Number"); v != "" {
psu.Model = &v psu.Model = &v
} }
if v := cleanDMIValue(fields["Board Mfg"]); v != "" { if v := firstNonEmptyField(fields, "Board Mfg", "Product Manufacturer", "Product Manufacturer Name"); v != "" {
psu.Vendor = &v psu.Vendor = &v
} }
if v := cleanDMIValue(fields["Board Serial"]); v != "" { if v := firstNonEmptyField(fields, "Board Serial", "Product Serial", "Product Serial Number"); v != "" {
psu.SerialNumber = &v psu.SerialNumber = &v
} }
if v := cleanDMIValue(fields["Board Part Number"]); v != "" { if v := firstNonEmptyField(fields, "Board Part Number", "Product Part Number", "Part Number"); v != "" {
psu.PartNumber = &v psu.PartNumber = &v
} }
if v := cleanDMIValue(fields["Board Extra"]); v != "" { if v := firstNonEmptyField(fields, "Board Extra", "Product Version", "Board Version"); v != "" {
psu.Firmware = &v psu.Firmware = &v
} }
@@ -110,12 +125,230 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
} }
} }
status := "OK" status := statusOK
psu.Status = &status psu.Status = &status
return psu, true return psu, true
} }
func isPSUHeader(headerLower string) bool {
return strings.Contains(headerLower, "psu") ||
strings.Contains(headerLower, "pws") ||
strings.Contains(headerLower, "power supply") ||
strings.Contains(headerLower, "power_supply") ||
strings.Contains(headerLower, "power module")
}
func firstNonEmptyField(fields map[string]string, keys ...string) string {
for _, key := range keys {
if value := cleanDMIValue(fields[key]); value != "" {
return value
}
}
return ""
}
type psuSDR struct {
slot int
status string
reason string
inputPowerW *float64
outputPowerW *float64
inputVoltage *float64
temperatureC *float64
healthPct *float64
}
var psuSlotPatterns = []*regexp.Regexp{
regexp.MustCompile(`(?i)\bpsu?\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bps\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bpws\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bpower\s*supply(?:\s*bay)?\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bbay\s*([0-9]+)\b`),
}
func parsePSUSDR(raw string) map[int]psuSDR {
out := map[int]psuSDR{}
for _, line := range strings.Split(raw, "\n") {
fields := splitSDRFields(line)
if len(fields) < 3 {
continue
}
name := fields[0]
value := fields[1]
state := strings.ToLower(fields[2])
slot, ok := parsePSUSlot(name)
if !ok {
continue
}
entry := out[slot]
entry.slot = slot
if entry.status == "" {
entry.status = statusOK
}
if state != "" && state != "ok" && state != "ns" {
entry.status = statusCritical
entry.reason = "PSU sensor reported non-OK state: " + state
}
lowerName := strings.ToLower(name)
switch {
case strings.Contains(lowerName, "input power"):
entry.inputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "output power"):
entry.outputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "power supply bay"), strings.Contains(lowerName, "psu bay"):
entry.outputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "input voltage"), strings.Contains(lowerName, "ac input"):
entry.inputVoltage = parseFloatPtr(value)
case strings.Contains(lowerName, "temp"):
entry.temperatureC = parseFloatPtr(value)
case strings.Contains(lowerName, "health"), strings.Contains(lowerName, "remaining life"), strings.Contains(lowerName, "life remaining"):
entry.healthPct = parsePercentPtr(value)
}
out[slot] = entry
}
return out
}
func synthesizePSUsFromSDR(sdr map[int]psuSDR) []schema.HardwarePowerSupply {
if len(sdr) == 0 {
return nil
}
slots := make([]int, 0, len(sdr))
for slot := range sdr {
slots = append(slots, slot)
}
sort.Ints(slots)
out := make([]schema.HardwarePowerSupply, 0, len(slots))
for _, slot := range slots {
entry := sdr[slot]
present := true
status := entry.status
if status == "" {
status = statusUnknown
}
slotStr := strconv.Itoa(slot - 1)
model := "PSU"
psu := schema.HardwarePowerSupply{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Slot: &slotStr,
Present: &present,
Model: &model,
InputPowerW: entry.inputPowerW,
OutputPowerW: entry.outputPowerW,
InputVoltage: entry.inputVoltage,
TemperatureC: entry.temperatureC,
}
if entry.healthPct != nil {
psu.LifeRemainingPct = entry.healthPct
lifeUsed := 100 - *entry.healthPct
psu.LifeUsedPct = &lifeUsed
}
if entry.reason != "" {
psu.ErrorDescription = &entry.reason
}
out = append(out, psu)
}
return out
}
func mergePSUSDR(psus []schema.HardwarePowerSupply, sdr map[int]psuSDR) {
for i := range psus {
slotIdx, err := strconv.Atoi(derefPSUSlot(psus[i].Slot))
if err != nil {
continue
}
entry, ok := sdr[slotIdx+1]
if !ok {
continue
}
if entry.inputPowerW != nil {
psus[i].InputPowerW = entry.inputPowerW
}
if entry.outputPowerW != nil {
psus[i].OutputPowerW = entry.outputPowerW
}
if entry.inputVoltage != nil {
psus[i].InputVoltage = entry.inputVoltage
}
if entry.temperatureC != nil {
psus[i].TemperatureC = entry.temperatureC
}
if entry.healthPct != nil {
psus[i].LifeRemainingPct = entry.healthPct
lifeUsed := 100 - *entry.healthPct
psus[i].LifeUsedPct = &lifeUsed
}
if entry.status != "" {
psus[i].Status = &entry.status
}
if entry.reason != "" {
psus[i].ErrorDescription = &entry.reason
}
if psus[i].Status != nil && *psus[i].Status == statusOK {
if (entry.inputPowerW == nil && entry.outputPowerW == nil && entry.inputVoltage == nil) && entry.status == "" {
unknown := statusUnknown
psus[i].Status = &unknown
}
}
}
}
func splitSDRFields(line string) []string {
parts := strings.Split(line, "|")
out := make([]string, 0, len(parts))
for _, part := range parts {
part = strings.TrimSpace(part)
if part != "" {
out = append(out, part)
}
}
return out
}
func parsePSUSlot(name string) (int, bool) {
for _, re := range psuSlotPatterns {
m := re.FindStringSubmatch(strings.ToLower(name))
if len(m) == 0 {
continue
}
for _, group := range m[1:] {
if group == "" {
continue
}
n, err := strconv.Atoi(group)
if err == nil && n > 0 {
return n, true
}
}
}
return 0, false
}
func parseFloatPtr(raw string) *float64 {
raw = strings.TrimSpace(raw)
if raw == "" || strings.EqualFold(raw, "na") {
return nil
}
for _, field := range strings.Fields(raw) {
n, err := strconv.ParseFloat(strings.TrimSpace(field), 64)
if err == nil {
return &n
}
}
return nil
}
func derefPSUSlot(slot *string) string {
if slot == nil {
return ""
}
return *slot
}
// parseWattage extracts wattage from strings like "PSU 800W", "1200W PLATINUM". // parseWattage extracts wattage from strings like "PSU 800W", "1200W PLATINUM".
func parseWattage(s string) int { func parseWattage(s string) int {
s = strings.ToUpper(s) s = strings.ToUpper(s)

View File

@@ -0,0 +1,91 @@
package collector
import "testing"
func TestParsePSUSDR(t *testing.T) {
raw := `
PS1 Input Power | 215 Watts | ok
PS1 Output Power | 198 Watts | ok
PS1 Input Voltage | 229 Volts | ok
PS1 Temp | 39 C | ok
PS1 Health | 97 % | ok
PS2 Input Power | 0 Watts | cr
`
got := parsePSUSDR(raw)
if len(got) != 2 {
t.Fatalf("len(got)=%d want 2", len(got))
}
if got[1].status != statusOK {
t.Fatalf("ps1 status=%q", got[1].status)
}
if got[1].inputPowerW == nil || *got[1].inputPowerW != 215 {
t.Fatalf("ps1 input power=%v", got[1].inputPowerW)
}
if got[1].outputPowerW == nil || *got[1].outputPowerW != 198 {
t.Fatalf("ps1 output power=%v", got[1].outputPowerW)
}
if got[1].inputVoltage == nil || *got[1].inputVoltage != 229 {
t.Fatalf("ps1 input voltage=%v", got[1].inputVoltage)
}
if got[1].temperatureC == nil || *got[1].temperatureC != 39 {
t.Fatalf("ps1 temperature=%v", got[1].temperatureC)
}
if got[1].healthPct == nil || *got[1].healthPct != 97 {
t.Fatalf("ps1 health=%v", got[1].healthPct)
}
if got[2].status != statusCritical {
t.Fatalf("ps2 status=%q", got[2].status)
}
}
func TestParsePSUSlotVendorVariants(t *testing.T) {
t.Parallel()
tests := []struct {
name string
want int
}{
{name: "PWS1 Status", want: 1},
{name: "Power Supply Bay 8", want: 8},
{name: "PS 6 Input Power", want: 6},
}
for _, tt := range tests {
got, ok := parsePSUSlot(tt.name)
if !ok || got != tt.want {
t.Fatalf("parsePSUSlot(%q)=(%d,%v) want (%d,true)", tt.name, got, ok, tt.want)
}
}
}
func TestSynthesizePSUsFromSDR(t *testing.T) {
t.Parallel()
health := 97.0
outputPower := 915.0
got := synthesizePSUsFromSDR(map[int]psuSDR{
1: {
slot: 1,
status: statusOK,
outputPowerW: &outputPower,
healthPct: &health,
},
})
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].Slot == nil || *got[0].Slot != "0" {
t.Fatalf("slot=%v want 0", got[0].Slot)
}
if got[0].OutputPowerW == nil || *got[0].OutputPowerW != 915 {
t.Fatalf("output power=%v", got[0].OutputPowerW)
}
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 97 {
t.Fatalf("life remaining=%v", got[0].LifeRemainingPct)
}
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 3 {
t.Fatalf("life used=%v", got[0].LifeUsedPct)
}
}

View File

@@ -0,0 +1,121 @@
package collector
import (
"bee/audit/internal/schema"
"strconv"
"strings"
)
func enrichPSUsWithTelemetry(psus []schema.HardwarePowerSupply, doc sensorsDoc) []schema.HardwarePowerSupply {
if len(psus) == 0 || len(doc) == 0 {
return psus
}
tempBySlot := psuTempsFromSensors(doc)
healthBySlot := psuHealthFromSensors(doc)
for i := range psus {
slot := derefPSUSlot(psus[i].Slot)
if slot == "" {
continue
}
if psus[i].TemperatureC == nil {
if value, ok := tempBySlot[slot]; ok {
psus[i].TemperatureC = &value
}
}
if psus[i].LifeRemainingPct == nil {
if value, ok := healthBySlot[slot]; ok {
psus[i].LifeRemainingPct = &value
used := 100 - value
psus[i].LifeUsedPct = &used
}
}
}
return psus
}
func psuHealthFromSensors(doc sensorsDoc) map[string]float64 {
out := map[string]float64{}
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok {
continue
}
if !isLikelyPSUHealth(chip, featureName) {
continue
}
value, ok := firstFeaturePercent(feature)
if !ok {
continue
}
if slot, ok := detectPSUSlot(chip, featureName); ok {
if _, exists := out[slot]; !exists {
out[slot] = value
}
}
}
}
return out
}
func firstFeaturePercent(feature map[string]any) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
lower := strings.ToLower(key)
if strings.HasSuffix(lower, "_alarm") {
continue
}
if strings.Contains(lower, "health") || strings.Contains(lower, "life") || strings.Contains(lower, "remain") {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
return 0, false
}
func isLikelyPSUHealth(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return (strings.Contains(value, "psu") || strings.Contains(value, "power supply")) &&
(strings.Contains(value, "health") || strings.Contains(value, "life") || strings.Contains(value, "remain"))
}
func psuTempsFromSensors(doc sensorsDoc) map[string]float64 {
out := map[string]float64{}
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok || classifySensorFeature(feature) != "temp" {
continue
}
if !isLikelyPSUTemp(chip, featureName) {
continue
}
temp, ok := firstFeatureFloat(feature, "_input")
if !ok {
continue
}
if slot, ok := detectPSUSlot(chip, featureName); ok {
if _, exists := out[slot]; !exists {
out[slot] = temp
}
}
}
}
return out
}
func isLikelyPSUTemp(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "psu") || strings.Contains(value, "power supply")
}
func detectPSUSlot(parts ...string) (string, bool) {
for _, part := range parts {
if value, ok := parsePSUSlot(part); ok && value > 0 {
return strconv.Itoa(value - 1), true
}
}
return "", false
}

View File

@@ -0,0 +1,42 @@
package collector
import (
"testing"
"bee/audit/internal/schema"
)
func TestEnrichPSUsWithTelemetry(t *testing.T) {
slot0 := "0"
slot1 := "1"
psus := []schema.HardwarePowerSupply{
{Slot: &slot0},
{Slot: &slot1},
}
doc := sensorsDoc{
"psu-hwmon-0": {
"PSU1 Temp": map[string]any{"temp1_input": 39.5},
"PSU2 Temp": map[string]any{"temp2_input": 41.0},
"PSU1 Health": map[string]any{"health1_input": 98.0},
"PSU2 Remaining Life": map[string]any{"life2_input": 95.0},
},
}
got := enrichPSUsWithTelemetry(psus, doc)
if got[0].TemperatureC == nil || *got[0].TemperatureC != 39.5 {
t.Fatalf("psu0 temperature mismatch: %#v", got[0].TemperatureC)
}
if got[1].TemperatureC == nil || *got[1].TemperatureC != 41.0 {
t.Fatalf("psu1 temperature mismatch: %#v", got[1].TemperatureC)
}
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 98.0 {
t.Fatalf("psu0 life remaining mismatch: %#v", got[0].LifeRemainingPct)
}
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 2.0 {
t.Fatalf("psu0 life used mismatch: %#v", got[0].LifeUsedPct)
}
if got[1].LifeRemainingPct == nil || *got[1].LifeRemainingPct != 95.0 {
t.Fatalf("psu1 life remaining mismatch: %#v", got[1].LifeRemainingPct)
}
}

View File

@@ -83,11 +83,7 @@ func isLikelyRAIDController(dev schema.HardwarePCIeDevice) bool {
if dev.DeviceClass == nil { if dev.DeviceClass == nil {
return false return false
} }
c := strings.ToLower(*dev.DeviceClass) return isRAIDClass(*dev.DeviceClass)
return strings.Contains(c, "raid") ||
strings.Contains(c, "sas") ||
strings.Contains(c, "mass storage") ||
strings.Contains(c, "serial attached scsi")
} }
func collectStorcliDrives() []schema.HardwareStorage { func collectStorcliDrives() []schema.HardwareStorage {
@@ -182,7 +178,10 @@ func parseSASIrcuDisplay(raw string) []schema.HardwareStorage {
present := true present := true
status := mapRAIDDriveStatus(b["State"]) status := mapRAIDDriveStatus(b["State"])
s := schema.HardwareStorage{Present: &present, Status: &status} s := schema.HardwareStorage{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
enclosure := strings.TrimSpace(b["Enclosure #"]) enclosure := strings.TrimSpace(b["Enclosure #"])
slot := strings.TrimSpace(b["Slot #"]) slot := strings.TrimSpace(b["Slot #"])
@@ -281,7 +280,10 @@ func parseArcconfPhysicalDrives(raw string) []schema.HardwareStorage {
for _, b := range blocks { for _, b := range blocks {
present := true present := true
status := mapRAIDDriveStatus(b["State"]) status := mapRAIDDriveStatus(b["State"])
s := schema.HardwareStorage{Present: &present, Status: &status} s := schema.HardwareStorage{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
if v := strings.TrimSpace(b["Reported Location"]); v != "" { if v := strings.TrimSpace(b["Reported Location"]); v != "" {
s.Slot = &v s.Slot = &v
@@ -362,8 +364,11 @@ func parseSSACLIPhysicalDrives(raw string) []schema.HardwareStorage {
if m := ssacliPhysicalDriveLine.FindStringSubmatch(trimmed); len(m) == 3 { if m := ssacliPhysicalDriveLine.FindStringSubmatch(trimmed); len(m) == 3 {
flush() flush()
present := true present := true
status := "UNKNOWN" status := statusUnknown
s := schema.HardwareStorage{Present: &present, Status: &status} s := schema.HardwareStorage{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
slot := m[1] slot := m[1]
s.Slot = &slot s.Slot = &slot
@@ -475,8 +480,8 @@ func storcliDriveToStorage(d struct {
present := true present := true
status := mapRAIDDriveStatus(d.State) status := mapRAIDDriveStatus(d.State)
s := schema.HardwareStorage{ s := schema.HardwareStorage{
Present: &present, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Status: &status, Present: &present,
} }
if v := strings.TrimSpace(d.EIDSlt); v != "" { if v := strings.TrimSpace(d.EIDSlt); v != "" {
@@ -527,15 +532,15 @@ func mapRAIDDriveStatus(raw string) string {
u := strings.ToUpper(strings.TrimSpace(raw)) u := strings.ToUpper(strings.TrimSpace(raw))
switch { switch {
case strings.Contains(u, "OK"), strings.Contains(u, "OPTIMAL"), strings.Contains(u, "READY"): case strings.Contains(u, "OK"), strings.Contains(u, "OPTIMAL"), strings.Contains(u, "READY"):
return "OK" return statusOK
case strings.Contains(u, "ONLN"), strings.Contains(u, "ONLINE"): case strings.Contains(u, "ONLN"), strings.Contains(u, "ONLINE"):
return "OK" return statusOK
case strings.Contains(u, "RBLD"), strings.Contains(u, "REBUILD"): case strings.Contains(u, "RBLD"), strings.Contains(u, "REBUILD"):
return "WARNING" return statusWarning
case strings.Contains(u, "FAIL"), strings.Contains(u, "OFFLINE"): case strings.Contains(u, "FAIL"), strings.Contains(u, "OFFLINE"):
return "CRITICAL" return statusCritical
default: default:
return "UNKNOWN" return statusUnknown
} }
} }
@@ -641,8 +646,9 @@ func enrichStorageWithVROC(storage []schema.HardwareStorage, pcie []schema.Hardw
storage[i].Telemetry["vroc_array"] = arr.Name storage[i].Telemetry["vroc_array"] = arr.Name
storage[i].Telemetry["vroc_degraded"] = arr.Degraded storage[i].Telemetry["vroc_degraded"] = arr.Degraded
if arr.Degraded { if arr.Degraded {
status := "WARNING" status := statusWarning
storage[i].Status = &status storage[i].Status = &status
storage[i].ErrorDescription = stringPtr("VROC array is degraded")
} }
updated++ updated++
} }
@@ -659,14 +665,14 @@ func hasVROCController(pcie []schema.HardwarePCIeDevice) bool {
class := "" class := ""
if dev.DeviceClass != nil { if dev.DeviceClass != nil {
class = strings.ToLower(*dev.DeviceClass) class = strings.TrimSpace(*dev.DeviceClass)
} }
model := "" model := ""
if dev.Model != nil { if dev.Model != nil {
model = strings.ToLower(*dev.Model) model = strings.ToLower(*dev.Model)
} }
if strings.Contains(class, "raid") || if isRAIDClass(class) ||
strings.Contains(model, "vroc") || strings.Contains(model, "vroc") ||
strings.Contains(model, "volume management device") || strings.Contains(model, "volume management device") ||
strings.Contains(model, "vmd") { strings.Contains(model, "vmd") {

View File

@@ -0,0 +1,334 @@
package collector
import (
"bee/audit/internal/schema"
"encoding/json"
"log/slog"
"strconv"
"strings"
)
type raidControllerTelemetry struct {
BatteryChargePct *float64
BatteryHealthPct *float64
BatteryTemperatureC *float64
BatteryVoltageV *float64
BatteryReplaceRequired *bool
ErrorDescription *string
}
func enrichPCIeWithRAIDTelemetry(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
byVendor := collectRAIDControllerTelemetry()
if len(byVendor) == 0 {
return devs
}
positions := map[int]int{}
for i := range devs {
if devs[i].VendorID == nil || !isLikelyRAIDController(devs[i]) {
continue
}
vendor := *devs[i].VendorID
list := byVendor[vendor]
if len(list) == 0 {
continue
}
index := positions[vendor]
if index >= len(list) {
continue
}
positions[vendor] = index + 1
applyRAIDControllerTelemetry(&devs[i], list[index])
}
return devs
}
func applyRAIDControllerTelemetry(dev *schema.HardwarePCIeDevice, tel raidControllerTelemetry) {
if tel.BatteryChargePct != nil {
dev.BatteryChargePct = tel.BatteryChargePct
}
if tel.BatteryHealthPct != nil {
dev.BatteryHealthPct = tel.BatteryHealthPct
}
if tel.BatteryTemperatureC != nil {
dev.BatteryTemperatureC = tel.BatteryTemperatureC
}
if tel.BatteryVoltageV != nil {
dev.BatteryVoltageV = tel.BatteryVoltageV
}
if tel.BatteryReplaceRequired != nil {
dev.BatteryReplaceRequired = tel.BatteryReplaceRequired
}
if tel.ErrorDescription != nil {
dev.ErrorDescription = tel.ErrorDescription
if dev.Status == nil || *dev.Status == statusOK {
status := statusWarning
dev.Status = &status
}
}
}
func collectRAIDControllerTelemetry() map[int][]raidControllerTelemetry {
out := map[int][]raidControllerTelemetry{}
if raw, err := raidToolQuery("storcli64", "/call", "show", "all", "J"); err == nil {
list := parseStorcliControllerTelemetry(raw)
if len(list) > 0 {
out[vendorBroadcomLSI] = append(out[vendorBroadcomLSI], list...)
slog.Info("raid: storcli controller telemetry", "count", len(list))
}
}
if raw, err := raidToolQuery("ssacli", "ctrl", "all", "show", "config", "detail"); err == nil {
list := parseSSACLIControllerTelemetry(string(raw))
if len(list) > 0 {
out[vendorHPE] = append(out[vendorHPE], list...)
slog.Info("raid: ssacli controller telemetry", "count", len(list))
}
}
if raw, err := raidToolQuery("arcconf", "getconfig", "1", "ad"); err == nil {
list := parseArcconfControllerTelemetry(string(raw))
if len(list) > 0 {
out[vendorAdaptec] = append(out[vendorAdaptec], list...)
slog.Info("raid: arcconf controller telemetry", "count", len(list))
}
}
return out
}
func parseStorcliControllerTelemetry(raw []byte) []raidControllerTelemetry {
var doc struct {
Controllers []struct {
ResponseData map[string]any `json:"Response Data"`
} `json:"Controllers"`
}
if err := json.Unmarshal(raw, &doc); err != nil {
slog.Warn("raid: parse storcli controller telemetry failed", "err", err)
return nil
}
var out []raidControllerTelemetry
for _, ctl := range doc.Controllers {
tel := raidControllerTelemetry{}
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["BBU_Info"]))
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["BBU_Info_Details"]))
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["CV_Info"]))
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["CV_Info_Details"]))
if hasRAIDControllerTelemetry(tel) {
out = append(out, tel)
}
}
return out
}
func nestedStringMap(raw any) map[string]string {
switch value := raw.(type) {
case map[string]any:
out := map[string]string{}
flattenStringMap("", value, out)
return out
case []any:
out := map[string]string{}
for _, item := range value {
if m, ok := item.(map[string]any); ok {
flattenStringMap("", m, out)
}
}
return out
default:
return nil
}
}
func flattenStringMap(prefix string, in map[string]any, out map[string]string) {
for key, raw := range in {
fullKey := strings.TrimSpace(strings.ToLower(strings.Trim(prefix+" "+key, " ")))
switch value := raw.(type) {
case map[string]any:
flattenStringMap(fullKey, value, out)
case []any:
for _, item := range value {
if m, ok := item.(map[string]any); ok {
flattenStringMap(fullKey, m, out)
}
}
case string:
out[fullKey] = value
case json.Number:
out[fullKey] = value.String()
case float64:
out[fullKey] = strconv.FormatFloat(value, 'f', -1, 64)
case bool:
if value {
out[fullKey] = "true"
} else {
out[fullKey] = "false"
}
}
}
}
func mergeStorcliBatteryMap(tel *raidControllerTelemetry, fields map[string]string) {
if len(fields) == 0 {
return
}
for key, raw := range fields {
lower := strings.ToLower(strings.TrimSpace(key))
switch {
case strings.Contains(lower, "relative state of charge"), strings.Contains(lower, "remaining capacity"), strings.Contains(lower, "charge"):
if tel.BatteryChargePct == nil {
tel.BatteryChargePct = parsePercentPtr(raw)
}
case strings.Contains(lower, "state of health"), strings.Contains(lower, "health"):
if tel.BatteryHealthPct == nil {
tel.BatteryHealthPct = parsePercentPtr(raw)
}
case strings.Contains(lower, "temperature"):
if tel.BatteryTemperatureC == nil {
tel.BatteryTemperatureC = parseFloatPtr(raw)
}
case strings.Contains(lower, "voltage"):
if tel.BatteryVoltageV == nil {
tel.BatteryVoltageV = parseFloatPtr(raw)
}
case strings.Contains(lower, "replace"), strings.Contains(lower, "replacement required"):
if tel.BatteryReplaceRequired == nil {
tel.BatteryReplaceRequired = parseReplaceRequired(raw)
}
case strings.Contains(lower, "learn cycle requested"), strings.Contains(lower, "battery state"), strings.Contains(lower, "capacitance state"):
if desc := batteryStateDescription(raw); desc != nil && tel.ErrorDescription == nil {
tel.ErrorDescription = desc
}
}
}
}
func parseSSACLIControllerTelemetry(raw string) []raidControllerTelemetry {
lines := strings.Split(raw, "\n")
var out []raidControllerTelemetry
var current *raidControllerTelemetry
flush := func() {
if current != nil && hasRAIDControllerTelemetry(*current) {
out = append(out, *current)
}
current = nil
}
for _, line := range lines {
trimmed := strings.TrimSpace(line)
if trimmed == "" {
continue
}
if strings.HasPrefix(strings.ToLower(trimmed), "smart array") || strings.HasPrefix(strings.ToLower(trimmed), "controller ") {
flush()
current = &raidControllerTelemetry{}
continue
}
if current == nil {
continue
}
if idx := strings.Index(trimmed, ":"); idx > 0 {
key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
val := strings.TrimSpace(trimmed[idx+1:])
switch {
case strings.Contains(key, "capacitor temperature"), strings.Contains(key, "battery temperature"):
current.BatteryTemperatureC = parseFloatPtr(val)
case strings.Contains(key, "capacitor voltage"), strings.Contains(key, "battery voltage"):
current.BatteryVoltageV = parseFloatPtr(val)
case strings.Contains(key, "capacitor charge"), strings.Contains(key, "battery charge"):
current.BatteryChargePct = parsePercentPtr(val)
case strings.Contains(key, "capacitor health"), strings.Contains(key, "battery health"):
current.BatteryHealthPct = parsePercentPtr(val)
case strings.Contains(key, "replace") || strings.Contains(key, "failed"):
if current.BatteryReplaceRequired == nil {
current.BatteryReplaceRequired = parseReplaceRequired(val)
}
if desc := batteryStateDescription(val); desc != nil && current.ErrorDescription == nil {
current.ErrorDescription = desc
}
}
}
}
flush()
return out
}
func parseArcconfControllerTelemetry(raw string) []raidControllerTelemetry {
lines := strings.Split(raw, "\n")
tel := raidControllerTelemetry{}
for _, line := range lines {
trimmed := strings.TrimSpace(line)
if idx := strings.Index(trimmed, ":"); idx > 0 {
key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
val := strings.TrimSpace(trimmed[idx+1:])
switch {
case strings.Contains(key, "battery temperature"), strings.Contains(key, "capacitor temperature"):
tel.BatteryTemperatureC = parseFloatPtr(val)
case strings.Contains(key, "battery voltage"), strings.Contains(key, "capacitor voltage"):
tel.BatteryVoltageV = parseFloatPtr(val)
case strings.Contains(key, "battery charge"), strings.Contains(key, "capacitor charge"):
tel.BatteryChargePct = parsePercentPtr(val)
case strings.Contains(key, "battery health"), strings.Contains(key, "capacitor health"):
tel.BatteryHealthPct = parsePercentPtr(val)
case strings.Contains(key, "replace"), strings.Contains(key, "failed"):
if tel.BatteryReplaceRequired == nil {
tel.BatteryReplaceRequired = parseReplaceRequired(val)
}
if desc := batteryStateDescription(val); desc != nil && tel.ErrorDescription == nil {
tel.ErrorDescription = desc
}
}
}
}
if hasRAIDControllerTelemetry(tel) {
return []raidControllerTelemetry{tel}
}
return nil
}
func hasRAIDControllerTelemetry(tel raidControllerTelemetry) bool {
return tel.BatteryChargePct != nil ||
tel.BatteryHealthPct != nil ||
tel.BatteryTemperatureC != nil ||
tel.BatteryVoltageV != nil ||
tel.BatteryReplaceRequired != nil ||
tel.ErrorDescription != nil
}
func parsePercentPtr(raw string) *float64 {
raw = strings.ReplaceAll(strings.TrimSpace(raw), "%", "")
return parseFloatPtr(raw)
}
func parseReplaceRequired(raw string) *bool {
lower := strings.ToLower(strings.TrimSpace(raw))
switch {
case lower == "":
return nil
case strings.Contains(lower, "replace"), strings.Contains(lower, "failed"), strings.Contains(lower, "yes"), strings.Contains(lower, "required"):
value := true
return &value
case strings.Contains(lower, "no"), strings.Contains(lower, "ok"), strings.Contains(lower, "good"), strings.Contains(lower, "optimal"):
value := false
return &value
default:
return nil
}
}
func batteryStateDescription(raw string) *string {
lower := strings.ToLower(strings.TrimSpace(raw))
if lower == "" {
return nil
}
switch {
case strings.Contains(lower, "failed"), strings.Contains(lower, "fault"), strings.Contains(lower, "replace"), strings.Contains(lower, "warning"), strings.Contains(lower, "degraded"):
return &raw
default:
return nil
}
}

View File

@@ -1,6 +1,10 @@
package collector package collector
import "testing" import (
"bee/audit/internal/schema"
"errors"
"testing"
)
func TestParseSASIrcuControllerIDs(t *testing.T) { func TestParseSASIrcuControllerIDs(t *testing.T) {
raw := `LSI Corporation SAS2 IR Configuration Utility. raw := `LSI Corporation SAS2 IR Configuration Utility.
@@ -90,7 +94,111 @@ physicaldrive 1I:1:2 (894 GB, SAS HDD, Failed)
if drives[0].Status == nil || *drives[0].Status != "OK" { if drives[0].Status == nil || *drives[0].Status != "OK" {
t.Fatalf("drive0 status: %v", drives[0].Status) t.Fatalf("drive0 status: %v", drives[0].Status)
} }
if drives[1].Status == nil || *drives[1].Status != "CRITICAL" { if drives[1].Status == nil || *drives[1].Status != statusCritical {
t.Fatalf("drive1 status: %v", drives[1].Status) t.Fatalf("drive1 status: %v", drives[1].Status)
} }
} }
func TestParseStorcliControllerTelemetry(t *testing.T) {
raw := []byte(`{
"Controllers": [
{
"Response Data": {
"BBU_Info": {
"State of Health": "98 %",
"Relative State of Charge": "76 %",
"Temperature": "41 C",
"Voltage": "12.3 V",
"Replacement required": "No"
}
}
}
]
}`)
got := parseStorcliControllerTelemetry(raw)
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].BatteryHealthPct == nil || *got[0].BatteryHealthPct != 98 {
t.Fatalf("battery health=%v", got[0].BatteryHealthPct)
}
if got[0].BatteryChargePct == nil || *got[0].BatteryChargePct != 76 {
t.Fatalf("battery charge=%v", got[0].BatteryChargePct)
}
if got[0].BatteryTemperatureC == nil || *got[0].BatteryTemperatureC != 41 {
t.Fatalf("battery temperature=%v", got[0].BatteryTemperatureC)
}
if got[0].BatteryVoltageV == nil || *got[0].BatteryVoltageV != 12.3 {
t.Fatalf("battery voltage=%v", got[0].BatteryVoltageV)
}
if got[0].BatteryReplaceRequired == nil || *got[0].BatteryReplaceRequired {
t.Fatalf("battery replace=%v", got[0].BatteryReplaceRequired)
}
}
func TestParseSSACLIControllerTelemetry(t *testing.T) {
raw := `Smart Array P440ar in Slot 0
Battery/Capacitor Count: 1
Capacitor Temperature (C): 37
Capacitor Charge (%): 94
Capacitor Health (%): 96
Capacitor Voltage (V): 9.8
Capacitor Failed: No
`
got := parseSSACLIControllerTelemetry(raw)
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].BatteryTemperatureC == nil || *got[0].BatteryTemperatureC != 37 {
t.Fatalf("battery temperature=%v", got[0].BatteryTemperatureC)
}
if got[0].BatteryChargePct == nil || *got[0].BatteryChargePct != 94 {
t.Fatalf("battery charge=%v", got[0].BatteryChargePct)
}
}
func TestEnrichPCIeWithRAIDTelemetry(t *testing.T) {
orig := raidToolQuery
t.Cleanup(func() { raidToolQuery = orig })
raidToolQuery = func(name string, args ...string) ([]byte, error) {
switch name {
case "storcli64":
return []byte(`{
"Controllers": [
{
"Response Data": {
"CV_Info": {
"State of Health": "99 %",
"Relative State of Charge": "81 %",
"Temperature": "38 C",
"Voltage": "12.1 V",
"Replacement required": "No"
}
}
}
]
}`), nil
default:
return nil, errors.New("skip")
}
}
vendor := vendorBroadcomLSI
class := "MassStorageController"
status := statusOK
devs := []schema.HardwarePCIeDevice{{
VendorID: &vendor,
DeviceClass: &class,
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
}}
out := enrichPCIeWithRAIDTelemetry(devs)
if out[0].BatteryHealthPct == nil || *out[0].BatteryHealthPct != 99 {
t.Fatalf("battery health=%v", out[0].BatteryHealthPct)
}
if out[0].BatteryChargePct == nil || *out[0].BatteryChargePct != 81 {
t.Fatalf("battery charge=%v", out[0].BatteryChargePct)
}
if out[0].BatteryVoltageV == nil || *out[0].BatteryVoltageV != 12.1 {
t.Fatalf("battery voltage=%v", out[0].BatteryVoltageV)
}
}

View File

@@ -0,0 +1,373 @@
package collector
import (
"bee/audit/internal/schema"
"encoding/json"
"log/slog"
"os/exec"
"sort"
"strconv"
"strings"
)
type sensorsDoc map[string]map[string]any
func collectSensors() *schema.HardwareSensors {
doc, err := readSensorsJSONDoc()
if err != nil {
slog.Info("sensors: unavailable, skipping", "err", err)
return nil
}
sensors := buildSensorsFromDoc(doc)
if sensors == nil || (len(sensors.Fans) == 0 && len(sensors.Power) == 0 && len(sensors.Temperatures) == 0 && len(sensors.Other) == 0) {
return nil
}
slog.Info("sensors: collected",
"fans", len(sensors.Fans),
"power", len(sensors.Power),
"temperatures", len(sensors.Temperatures),
"other", len(sensors.Other),
)
return sensors
}
func readSensorsJSONDoc() (sensorsDoc, error) {
out, err := exec.Command("sensors", "-j").Output()
if err != nil {
return nil, err
}
var doc sensorsDoc
if err := json.Unmarshal(out, &doc); err != nil {
return nil, err
}
return doc, nil
}
func buildSensorsFromDoc(doc sensorsDoc) *schema.HardwareSensors {
if len(doc) == 0 {
return nil
}
result := &schema.HardwareSensors{}
seen := map[string]struct{}{}
chips := make([]string, 0, len(doc))
for chip := range doc {
chips = append(chips, chip)
}
sort.Strings(chips)
for _, chip := range chips {
features := doc[chip]
location := sensorLocation(chip)
keys := make([]string, 0, len(features))
for key := range features {
keys = append(keys, key)
}
sort.Strings(keys)
for _, key := range keys {
if strings.EqualFold(key, "Adapter") {
continue
}
feature, ok := features[key].(map[string]any)
if !ok {
continue
}
name := strings.TrimSpace(key)
if name == "" {
continue
}
switch classifySensorFeature(feature) {
case "fan":
item := buildFanSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "fan", item.Name) {
continue
}
result.Fans = append(result.Fans, *item)
case "temp":
item := buildTempSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "temp", item.Name) {
continue
}
result.Temperatures = append(result.Temperatures, *item)
case "power":
item := buildPowerSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "power", item.Name) {
continue
}
result.Power = append(result.Power, *item)
default:
item := buildOtherSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "other", item.Name) {
continue
}
result.Other = append(result.Other, *item)
}
}
}
return result
}
func parseSensorsJSON(raw []byte) (*schema.HardwareSensors, error) {
var doc sensorsDoc
err := json.Unmarshal(raw, &doc)
if err != nil {
return nil, err
}
return buildSensorsFromDoc(doc), nil
}
func duplicateSensor(seen map[string]struct{}, sensorType, name string) bool {
key := sensorType + "\x00" + name
if _, ok := seen[key]; ok {
return true
}
seen[key] = struct{}{}
return false
}
func sensorLocation(chip string) *string {
chip = strings.TrimSpace(chip)
if chip == "" {
return nil
}
return &chip
}
func classifySensorFeature(feature map[string]any) string {
for key := range feature {
switch {
case strings.Contains(key, "fan") && strings.HasSuffix(key, "_input"):
return "fan"
case strings.Contains(key, "temp") && strings.HasSuffix(key, "_input"):
return "temp"
case strings.Contains(key, "power") && (strings.HasSuffix(key, "_input") || strings.HasSuffix(key, "_average")):
return "power"
case strings.Contains(key, "curr") && strings.HasSuffix(key, "_input"):
return "power"
case strings.HasPrefix(key, "in") && strings.HasSuffix(key, "_input"):
return "power"
}
}
return "other"
}
func buildFanSensor(name string, location *string, feature map[string]any) *schema.HardwareFanSensor {
rpm, ok := firstFeatureInt(feature, "_input")
if !ok {
return nil
}
item := &schema.HardwareFanSensor{Name: name, Location: location, RPM: &rpm}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
}
return item
}
func buildTempSensor(name string, location *string, feature map[string]any) *schema.HardwareTemperatureSensor {
celsius, ok := firstFeatureFloat(feature, "_input")
if !ok {
return nil
}
item := &schema.HardwareTemperatureSensor{Name: name, Location: location, Celsius: &celsius}
if warning, ok := firstFeatureFloatWithSuffixes(feature, []string{"_max", "_high"}); ok {
item.ThresholdWarningCelsius = &warning
}
if critical, ok := firstFeatureFloatWithSuffixes(feature, []string{"_crit", "_emergency"}); ok {
item.ThresholdCriticalCelsius = &critical
}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
} else {
item.Status = deriveTemperatureStatus(item.Celsius, item.ThresholdWarningCelsius, item.ThresholdCriticalCelsius)
}
return item
}
func buildPowerSensor(name string, location *string, feature map[string]any) *schema.HardwarePowerSensor {
item := &schema.HardwarePowerSensor{Name: name, Location: location}
if v, ok := firstFeatureFloatWithContains(feature, []string{"power"}); ok {
item.PowerW = &v
}
if v, ok := firstFeatureFloatWithPrefix(feature, "curr"); ok {
item.CurrentA = &v
}
if v, ok := firstFeatureFloatWithPrefix(feature, "in"); ok {
item.VoltageV = &v
}
if item.PowerW == nil && item.CurrentA == nil && item.VoltageV == nil {
return nil
}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
}
return item
}
func buildOtherSensor(name string, location *string, feature map[string]any) *schema.HardwareOtherSensor {
value, unit, ok := firstGenericSensorValue(feature)
if !ok {
return nil
}
item := &schema.HardwareOtherSensor{Name: name, Location: location, Value: &value}
if unit != "" {
item.Unit = &unit
}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
}
return item
}
func sensorStatusFromFeature(feature map[string]any) *string {
for key, raw := range feature {
if !strings.HasSuffix(key, "_alarm") {
continue
}
if number, ok := floatFromAny(raw); ok && number > 0 {
status := statusWarning
return &status
}
}
return nil
}
func deriveTemperatureStatus(current, warning, critical *float64) *string {
if current == nil {
return nil
}
switch {
case critical != nil && *current >= *critical:
status := statusCritical
return &status
case warning != nil && *current >= *warning:
status := statusWarning
return &status
default:
status := statusOK
return &status
}
}
func firstFeatureInt(feature map[string]any, suffix string) (int, bool) {
for key, raw := range feature {
if strings.HasSuffix(key, suffix) {
if value, ok := floatFromAny(raw); ok {
return int(value), true
}
}
}
return 0, false
}
func firstFeatureFloat(feature map[string]any, suffix string) (float64, bool) {
return firstFeatureFloatWithSuffixes(feature, []string{suffix})
}
func firstFeatureFloatWithSuffixes(feature map[string]any, suffixes []string) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
for _, suffix := range suffixes {
if strings.HasSuffix(key, suffix) {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
}
return 0, false
}
func firstFeatureFloatWithContains(feature map[string]any, parts []string) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
matched := true
for _, part := range parts {
if !strings.Contains(key, part) {
matched = false
break
}
}
if matched {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
return 0, false
}
func firstFeatureFloatWithPrefix(feature map[string]any, prefix string) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
if strings.HasPrefix(key, prefix) && strings.HasSuffix(key, "_input") {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
return 0, false
}
func firstGenericSensorValue(feature map[string]any) (float64, string, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
if strings.HasSuffix(key, "_alarm") {
continue
}
value, ok := floatFromAny(feature[key])
if !ok {
continue
}
unit := inferSensorUnit(key)
return value, unit, true
}
return 0, "", false
}
func inferSensorUnit(key string) string {
switch {
case strings.Contains(key, "humidity"):
return "%"
case strings.Contains(key, "intrusion"):
return ""
default:
return ""
}
}
func sortedFeatureKeys(feature map[string]any) []string {
keys := make([]string, 0, len(feature))
for key := range feature {
keys = append(keys, key)
}
sort.Strings(keys)
return keys
}
func floatFromAny(raw any) (float64, bool) {
switch value := raw.(type) {
case float64:
return value, true
case float32:
return float64(value), true
case int:
return float64(value), true
case int64:
return float64(value), true
case json.Number:
if f, err := value.Float64(); err == nil {
return f, true
}
case string:
if value == "" {
return 0, false
}
if f, err := strconv.ParseFloat(value, 64); err == nil {
return f, true
}
}
return 0, false
}

View File

@@ -0,0 +1,54 @@
package collector
import "testing"
func TestParseSensorsJSON(t *testing.T) {
raw := []byte(`{
"coretemp-isa-0000": {
"Adapter": "ISA adapter",
"Package id 0": {
"temp1_input": 61.5,
"temp1_max": 80.0,
"temp1_crit": 95.0
},
"fan1": {
"fan1_input": 4200
}
},
"acpitz-acpi-0": {
"Adapter": "ACPI interface",
"in0": {
"in0_input": 12.06
},
"curr1": {
"curr1_input": 0.64
},
"power1": {
"power1_average": 137.0
},
"humidity1": {
"humidity1_input": 38.5
}
}
}`)
got, err := parseSensorsJSON(raw)
if err != nil {
t.Fatalf("parseSensorsJSON error: %v", err)
}
if got == nil {
t.Fatal("expected sensors")
}
if len(got.Temperatures) != 1 || got.Temperatures[0].Celsius == nil || *got.Temperatures[0].Celsius != 61.5 {
t.Fatalf("temperatures mismatch: %#v", got.Temperatures)
}
if len(got.Fans) != 1 || got.Fans[0].RPM == nil || *got.Fans[0].RPM != 4200 {
t.Fatalf("fans mismatch: %#v", got.Fans)
}
if len(got.Power) != 3 {
t.Fatalf("power sensors mismatch: %#v", got.Power)
}
if len(got.Other) != 1 || got.Other[0].Unit == nil || *got.Other[0].Unit != "%" {
t.Fatalf("other sensors mismatch: %#v", got.Other)
}
}

View File

@@ -5,11 +5,13 @@ import (
"encoding/json" "encoding/json"
"log/slog" "log/slog"
"os/exec" "os/exec"
"path/filepath"
"strconv"
"strings" "strings"
) )
func collectStorage() []schema.HardwareStorage { func collectStorage() []schema.HardwareStorage {
devs := lsblkDevices() devs := discoverStorageDevices()
result := make([]schema.HardwareStorage, 0, len(devs)) result := make([]schema.HardwareStorage, 0, len(devs))
for _, dev := range devs { for _, dev := range devs {
var s schema.HardwareStorage var s schema.HardwareStorage
@@ -26,19 +28,60 @@ func collectStorage() []schema.HardwareStorage {
// lsblkDevice is a minimal lsblk JSON record. // lsblkDevice is a minimal lsblk JSON record.
type lsblkDevice struct { type lsblkDevice struct {
Name string `json:"name"` Name string `json:"name"`
Type string `json:"type"` Type string `json:"type"`
Size string `json:"size"` Size string `json:"size"`
Serial string `json:"serial"` Serial string `json:"serial"`
Model string `json:"model"` Model string `json:"model"`
Tran string `json:"tran"` Tran string `json:"tran"`
Hctl string `json:"hctl"` Hctl string `json:"hctl"`
} }
type lsblkRoot struct { type lsblkRoot struct {
Blockdevices []lsblkDevice `json:"blockdevices"` Blockdevices []lsblkDevice `json:"blockdevices"`
} }
type nvmeListRoot struct {
Devices []nvmeListDevice `json:"Devices"`
}
type nvmeListDevice struct {
DevicePath string `json:"DevicePath"`
ModelNumber string `json:"ModelNumber"`
SerialNumber string `json:"SerialNumber"`
Firmware string `json:"Firmware"`
PhysicalSize int64 `json:"PhysicalSize"`
}
func discoverStorageDevices() []lsblkDevice {
merged := map[string]lsblkDevice{}
for _, dev := range lsblkDevices() {
if dev.Name == "" {
continue
}
merged[dev.Name] = dev
}
for _, dev := range nvmeListDevices() {
if dev.Name == "" {
continue
}
current := merged[dev.Name]
merged[dev.Name] = mergeStorageDevice(current, dev)
}
disks := make([]lsblkDevice, 0, len(merged))
for _, dev := range merged {
if dev.Type == "" {
dev.Type = "disk"
}
if dev.Type != "disk" {
continue
}
disks = append(disks, dev)
}
return disks
}
func lsblkDevices() []lsblkDevice { func lsblkDevices() []lsblkDevice {
out, err := exec.Command("lsblk", "-J", "-d", out, err := exec.Command("lsblk", "-J", "-d",
"-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL").Output() "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL").Output()
@@ -60,6 +103,59 @@ func lsblkDevices() []lsblkDevice {
return disks return disks
} }
func nvmeListDevices() []lsblkDevice {
out, err := exec.Command("nvme", "list", "-o", "json").Output()
if err != nil {
return nil
}
var root nvmeListRoot
if err := json.Unmarshal(out, &root); err != nil {
slog.Warn("storage: nvme list parse failed", "err", err)
return nil
}
devices := make([]lsblkDevice, 0, len(root.Devices))
for _, dev := range root.Devices {
name := filepath.Base(strings.TrimSpace(dev.DevicePath))
if name == "" {
continue
}
devices = append(devices, lsblkDevice{
Name: name,
Type: "disk",
Size: strconv.FormatInt(dev.PhysicalSize, 10),
Serial: strings.TrimSpace(dev.SerialNumber),
Model: strings.TrimSpace(dev.ModelNumber),
Tran: "nvme",
})
}
return devices
}
func mergeStorageDevice(existing, incoming lsblkDevice) lsblkDevice {
if existing.Name == "" {
return incoming
}
if existing.Type == "" {
existing.Type = incoming.Type
}
if strings.TrimSpace(existing.Size) == "" {
existing.Size = incoming.Size
}
if strings.TrimSpace(existing.Serial) == "" {
existing.Serial = incoming.Serial
}
if strings.TrimSpace(existing.Model) == "" {
existing.Model = incoming.Model
}
if strings.TrimSpace(existing.Tran) == "" {
existing.Tran = incoming.Tran
}
if strings.TrimSpace(existing.Hctl) == "" {
existing.Hctl = incoming.Hctl
}
return existing
}
// smartctlInfo is the subset of smartctl -j -a output we care about. // smartctlInfo is the subset of smartctl -j -a output we care about.
type smartctlInfo struct { type smartctlInfo struct {
ModelFamily string `json:"model_family"` ModelFamily string `json:"model_family"`
@@ -67,14 +163,22 @@ type smartctlInfo struct {
SerialNumber string `json:"serial_number"` SerialNumber string `json:"serial_number"`
FirmwareVer string `json:"firmware_version"` FirmwareVer string `json:"firmware_version"`
RotationRate int `json:"rotation_rate"` RotationRate int `json:"rotation_rate"`
Temperature struct {
Current int `json:"current"`
} `json:"temperature"`
SmartStatus struct {
Passed bool `json:"passed"`
} `json:"smart_status"`
UserCapacity struct { UserCapacity struct {
Bytes int64 `json:"bytes"` Bytes int64 `json:"bytes"`
} `json:"user_capacity"` } `json:"user_capacity"`
AtaSmartAttributes struct { AtaSmartAttributes struct {
Table []struct { Table []struct {
ID int `json:"id"` ID int `json:"id"`
Name string `json:"name"` Name string `json:"name"`
Raw struct{ Value int64 `json:"value"` } `json:"raw"` Raw struct {
Value int64 `json:"value"`
} `json:"raw"`
} `json:"table"` } `json:"table"`
} `json:"ata_smart_attributes"` } `json:"ata_smart_attributes"`
PowerOnTime struct { PowerOnTime struct {
@@ -149,69 +253,116 @@ func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
} else if info.RotationRate > 0 { } else if info.RotationRate > 0 {
devType = "HDD" devType = "HDD"
} }
s.Type = &devType
// telemetry if info.Temperature.Current > 0 {
tel := map[string]any{} t := float64(info.Temperature.Current)
s.TemperatureC = &t
}
if info.PowerOnTime.Hours > 0 { if info.PowerOnTime.Hours > 0 {
tel["power_on_hours"] = info.PowerOnTime.Hours v := int64(info.PowerOnTime.Hours)
s.PowerOnHours = &v
} }
if info.PowerCycleCount > 0 { if info.PowerCycleCount > 0 {
tel["power_cycles"] = info.PowerCycleCount v := int64(info.PowerCycleCount)
s.PowerCycles = &v
} }
reallocated := int64(0)
pending := int64(0)
uncorrectable := int64(0)
lifeRemaining := int64(0)
for _, attr := range info.AtaSmartAttributes.Table { for _, attr := range info.AtaSmartAttributes.Table {
switch attr.ID { switch attr.ID {
case 5: case 5:
tel["reallocated_sectors"] = attr.Raw.Value reallocated = attr.Raw.Value
s.ReallocatedSectors = &reallocated
case 177: case 177:
tel["wear_leveling_pct"] = attr.Raw.Value value := float64(attr.Raw.Value)
s.LifeUsedPct = &value
case 231: case 231:
tel["life_remaining_pct"] = attr.Raw.Value lifeRemaining = attr.Raw.Value
value := float64(attr.Raw.Value)
s.LifeRemainingPct = &value
case 241: case 241:
tel["total_lba_written"] = attr.Raw.Value value := attr.Raw.Value
s.WrittenBytes = &value
case 197:
pending = attr.Raw.Value
s.CurrentPendingSectors = &pending
case 198:
uncorrectable = attr.Raw.Value
s.OfflineUncorrectable = &uncorrectable
} }
} }
if len(tel) > 0 {
s.Telemetry = tel status := storageHealthStatus{
overallPassed: info.SmartStatus.Passed,
hasOverall: true,
reallocatedSectors: reallocated,
pendingSectors: pending,
offlineUncorrectable: uncorrectable,
lifeRemainingPct: lifeRemaining,
} }
setStorageHealthStatus(&s, status)
return s
} }
s.Type = &devType s.Type = &devType
status := "OK" status := statusUnknown
s.Status = &status s.Status = &status
return s return s
} }
// nvmeSmartLog is the subset of `nvme smart-log -o json` output we care about. // nvmeSmartLog is the subset of `nvme smart-log -o json` output we care about.
type nvmeSmartLog struct { type nvmeSmartLog struct {
CriticalWarning int `json:"critical_warning"`
PercentageUsed int `json:"percentage_used"` PercentageUsed int `json:"percentage_used"`
AvailableSpare int `json:"available_spare"`
SpareThreshold int `json:"spare_thresh"`
Temperature int64 `json:"temperature"`
PowerOnHours int64 `json:"power_on_hours"` PowerOnHours int64 `json:"power_on_hours"`
PowerCycles int64 `json:"power_cycles"` PowerCycles int64 `json:"power_cycles"`
UnsafeShutdowns int64 `json:"unsafe_shutdowns"` UnsafeShutdowns int64 `json:"unsafe_shutdowns"`
DataUnitsRead int64 `json:"data_units_read"`
DataUnitsWritten int64 `json:"data_units_written"` DataUnitsWritten int64 `json:"data_units_written"`
ControllerBusy int64 `json:"controller_busy_time"` ControllerBusy int64 `json:"controller_busy_time"`
MediaErrors int64 `json:"media_errors"`
NumErrLogEntries int64 `json:"num_err_log_entries"`
} }
// nvmeIDCtrl is the subset of `nvme id-ctrl -o json` output. // nvmeIDCtrl is the subset of `nvme id-ctrl -o json` output.
type nvmeIDCtrl struct { type nvmeIDCtrl struct {
ModelNumber string `json:"mn"` ModelNumber string `json:"mn"`
SerialNumber string `json:"sn"` SerialNumber string `json:"sn"`
FirmwareRev string `json:"fr"` FirmwareRev string `json:"fr"`
TotalCapacity int64 `json:"tnvmcap"` TotalCapacity int64 `json:"tnvmcap"`
} }
func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage { func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
present := true present := true
devType := "NVMe" devType := "NVMe"
iface := "NVMe" iface := "NVMe"
status := "OK" status := statusOK
s := schema.HardwareStorage{ s := schema.HardwareStorage{
Present: &present, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Type: &devType, Present: &present,
Interface: &iface, Type: &devType,
Status: &status, Interface: &iface,
} }
devPath := "/dev/" + dev.Name devPath := "/dev/" + dev.Name
if v := cleanDMIValue(strings.TrimSpace(dev.Model)); v != "" {
s.Model = &v
}
if v := cleanDMIValue(strings.TrimSpace(dev.Serial)); v != "" {
s.SerialNumber = &v
}
if size := parseStorageBytes(dev.Size); size > 0 {
gb := int(size / 1_000_000_000)
if gb > 0 {
s.SizeGB = &gb
}
}
// id-ctrl: model, serial, firmware, capacity // id-ctrl: model, serial, firmware, capacity
if out, err := exec.Command("nvme", "id-ctrl", devPath, "-o", "json").Output(); err == nil { if out, err := exec.Command("nvme", "id-ctrl", devPath, "-o", "json").Output(); err == nil {
@@ -237,30 +388,131 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
if out, err := exec.Command("nvme", "smart-log", devPath, "-o", "json").Output(); err == nil { if out, err := exec.Command("nvme", "smart-log", devPath, "-o", "json").Output(); err == nil {
var log nvmeSmartLog var log nvmeSmartLog
if json.Unmarshal(out, &log) == nil { if json.Unmarshal(out, &log) == nil {
tel := map[string]any{}
if log.PowerOnHours > 0 { if log.PowerOnHours > 0 {
tel["power_on_hours"] = log.PowerOnHours s.PowerOnHours = &log.PowerOnHours
} }
if log.PowerCycles > 0 { if log.PowerCycles > 0 {
tel["power_cycles"] = log.PowerCycles s.PowerCycles = &log.PowerCycles
} }
if log.UnsafeShutdowns > 0 { if log.UnsafeShutdowns > 0 {
tel["unsafe_shutdowns"] = log.UnsafeShutdowns s.UnsafeShutdowns = &log.UnsafeShutdowns
} }
if log.PercentageUsed > 0 { if log.PercentageUsed > 0 {
tel["percentage_used"] = log.PercentageUsed v := float64(log.PercentageUsed)
s.LifeUsedPct = &v
remaining := 100 - v
s.LifeRemainingPct = &remaining
} }
if log.DataUnitsWritten > 0 { if log.DataUnitsWritten > 0 {
tel["data_units_written"] = log.DataUnitsWritten v := nvmeDataUnitsToBytes(log.DataUnitsWritten)
s.WrittenBytes = &v
} }
if log.ControllerBusy > 0 { if log.DataUnitsRead > 0 {
tel["controller_busy_time"] = log.ControllerBusy v := nvmeDataUnitsToBytes(log.DataUnitsRead)
s.ReadBytes = &v
} }
if len(tel) > 0 { if log.AvailableSpare > 0 {
s.Telemetry = tel v := float64(log.AvailableSpare)
s.AvailableSparePct = &v
} }
if log.MediaErrors > 0 {
s.MediaErrors = &log.MediaErrors
}
if log.NumErrLogEntries > 0 {
s.ErrorLogEntries = &log.NumErrLogEntries
}
if log.Temperature > 0 {
v := float64(log.Temperature - 273)
s.TemperatureC = &v
}
setStorageHealthStatus(&s, storageHealthStatus{
criticalWarning: log.CriticalWarning,
percentageUsed: int64(log.PercentageUsed),
availableSpare: int64(log.AvailableSpare),
spareThreshold: int64(log.SpareThreshold),
unsafeShutdowns: log.UnsafeShutdowns,
mediaErrors: log.MediaErrors,
errorLogEntries: log.NumErrLogEntries,
})
return s
} }
} }
status = statusUnknown
s.Status = &status
return s return s
} }
func parseStorageBytes(raw string) int64 {
value, err := strconv.ParseInt(strings.TrimSpace(raw), 10, 64)
if err == nil && value > 0 {
return value
}
return 0
}
func nvmeDataUnitsToBytes(units int64) int64 {
if units <= 0 {
return 0
}
return units * 512000
}
type storageHealthStatus struct {
hasOverall bool
overallPassed bool
reallocatedSectors int64
pendingSectors int64
offlineUncorrectable int64
lifeRemainingPct int64
criticalWarning int
percentageUsed int64
availableSpare int64
spareThreshold int64
unsafeShutdowns int64
mediaErrors int64
errorLogEntries int64
}
func setStorageHealthStatus(s *schema.HardwareStorage, health storageHealthStatus) {
status := statusOK
var description *string
switch {
case health.hasOverall && !health.overallPassed:
status = statusCritical
description = stringPtr("SMART overall self-assessment failed")
case health.criticalWarning > 0:
status = statusCritical
description = stringPtr("NVMe critical warning is set")
case health.pendingSectors > 0 || health.offlineUncorrectable > 0:
status = statusCritical
description = stringPtr("Pending or offline uncorrectable sectors detected")
case health.mediaErrors > 0:
status = statusWarning
description = stringPtr("Media errors reported")
case health.reallocatedSectors > 0:
status = statusWarning
description = stringPtr("Reallocated sectors detected")
case health.errorLogEntries > 0:
status = statusWarning
description = stringPtr("Device error log contains entries")
case health.lifeRemainingPct > 0 && health.lifeRemainingPct <= 10:
status = statusWarning
description = stringPtr("Life remaining is low")
case health.percentageUsed >= 95:
status = statusWarning
description = stringPtr("Drive wear level is high")
case health.availableSpare > 0 && health.spareThreshold > 0 && health.availableSpare <= health.spareThreshold:
status = statusWarning
description = stringPtr("Available spare is at or below threshold")
case health.unsafeShutdowns > 100:
status = statusWarning
description = stringPtr("Unsafe shutdown count is high")
}
s.Status = &status
s.ErrorDescription = description
}
func stringPtr(value string) *string {
return &value
}

View File

@@ -0,0 +1,33 @@
package collector
import "testing"
func TestMergeStorageDevicePrefersNonEmptyFields(t *testing.T) {
t.Parallel()
got := mergeStorageDevice(
lsblkDevice{Name: "nvme0n1", Type: "disk", Tran: "nvme"},
lsblkDevice{Name: "nvme0n1", Type: "disk", Size: "1024", Serial: "SN123", Model: "Kioxia"},
)
if got.Serial != "SN123" {
t.Fatalf("serial=%q want SN123", got.Serial)
}
if got.Model != "Kioxia" {
t.Fatalf("model=%q want Kioxia", got.Model)
}
if got.Size != "1024" {
t.Fatalf("size=%q want 1024", got.Size)
}
}
func TestParseStorageBytes(t *testing.T) {
t.Parallel()
if got := parseStorageBytes(" 2048 "); got != 2048 {
t.Fatalf("parseStorageBytes=%d want 2048", got)
}
if got := parseStorageBytes("1.92 TB"); got != 0 {
t.Fatalf("parseStorageBytes invalid=%d want 0", got)
}
}

View File

@@ -0,0 +1,63 @@
package collector
import (
"testing"
"bee/audit/internal/schema"
)
func TestSetStorageHealthStatus(t *testing.T) {
t.Parallel()
tests := []struct {
name string
health storageHealthStatus
want string
}{
{
name: "smart overall failed",
health: storageHealthStatus{hasOverall: true, overallPassed: false},
want: statusCritical,
},
{
name: "nvme critical warning",
health: storageHealthStatus{criticalWarning: 1},
want: statusCritical,
},
{
name: "pending sectors",
health: storageHealthStatus{pendingSectors: 1},
want: statusCritical,
},
{
name: "media errors warning",
health: storageHealthStatus{mediaErrors: 2},
want: statusWarning,
},
{
name: "reallocated warning",
health: storageHealthStatus{reallocatedSectors: 1},
want: statusWarning,
},
{
name: "life remaining low",
health: storageHealthStatus{lifeRemainingPct: 8},
want: statusWarning,
},
{
name: "healthy",
health: storageHealthStatus{},
want: statusOK,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
var disk schema.HardwareStorage
setStorageHealthStatus(&disk, tt.health)
if disk.Status == nil || *disk.Status != tt.want {
t.Fatalf("status=%v want %q", disk.Status, tt.want)
}
})
}
}

View File

@@ -0,0 +1,114 @@
package collector
import (
"bee/audit/internal/schema"
"fmt"
"time"
)
func BuildHealthSummary(snap schema.HardwareSnapshot) *schema.HardwareHealthSummary {
summary := &schema.HardwareHealthSummary{
Status: statusOK,
CollectedAt: time.Now().UTC().Format(time.RFC3339),
}
for _, dimm := range snap.Memory {
switch derefString(dimm.Status) {
case statusWarning:
summary.MemoryWarn++
summary.Warnings = append(summary.Warnings, formatMemorySummary(dimm))
case statusCritical:
summary.MemoryFail++
summary.Failures = append(summary.Failures, formatMemorySummary(dimm))
case statusEmpty:
summary.EmptyDIMMs++
}
}
for _, disk := range snap.Storage {
switch derefString(disk.Status) {
case statusWarning:
summary.StorageWarn++
summary.Warnings = append(summary.Warnings, formatStorageSummary(disk))
case statusCritical:
summary.StorageFail++
summary.Failures = append(summary.Failures, formatStorageSummary(disk))
}
}
for _, dev := range snap.PCIeDevices {
switch derefString(dev.Status) {
case statusWarning:
summary.PCIeWarn++
summary.Warnings = append(summary.Warnings, formatPCIeSummary(dev))
case statusCritical:
summary.PCIeFail++
summary.Failures = append(summary.Failures, formatPCIeSummary(dev))
}
}
for _, psu := range snap.PowerSupplies {
if psu.Present != nil && !*psu.Present {
summary.MissingPSUs++
}
switch derefString(psu.Status) {
case statusWarning:
summary.PSUWarn++
summary.Warnings = append(summary.Warnings, formatPSUSummary(psu))
case statusCritical:
summary.PSUFail++
summary.Failures = append(summary.Failures, formatPSUSummary(psu))
}
}
if len(summary.Failures) > 0 || summary.StorageFail > 0 || summary.PCIeFail > 0 || summary.PSUFail > 0 || summary.MemoryFail > 0 {
summary.Status = statusCritical
} else if len(summary.Warnings) > 0 || summary.StorageWarn > 0 || summary.PCIeWarn > 0 || summary.PSUWarn > 0 || summary.MemoryWarn > 0 {
summary.Status = statusWarning
}
if len(summary.Warnings) == 0 {
summary.Warnings = nil
}
if len(summary.Failures) == 0 {
summary.Failures = nil
}
return summary
}
func derefString(value *string) string {
if value == nil {
return ""
}
return *value
}
func preferredName(model, serial, slot *string) string {
switch {
case model != nil && *model != "":
return *model
case serial != nil && *serial != "":
return *serial
case slot != nil && *slot != "":
return *slot
default:
return "unknown"
}
}
func formatStorageSummary(disk schema.HardwareStorage) string {
return fmt.Sprintf("storage %s status=%s", preferredName(disk.Model, disk.SerialNumber, disk.Slot), derefString(disk.Status))
}
func formatPCIeSummary(dev schema.HardwarePCIeDevice) string {
return fmt.Sprintf("pcie %s status=%s", preferredName(dev.Model, dev.SerialNumber, dev.BDF), derefString(dev.Status))
}
func formatPSUSummary(psu schema.HardwarePowerSupply) string {
return fmt.Sprintf("psu %s status=%s", preferredName(psu.Model, psu.SerialNumber, psu.Slot), derefString(psu.Status))
}
func formatMemorySummary(dimm schema.HardwareMemory) string {
return fmt.Sprintf("memory %s status=%s", preferredName(dimm.PartNumber, dimm.SerialNumber, dimm.Slot), derefString(dimm.Status))
}

View File

@@ -31,7 +31,7 @@ md125 : active raid1 nvme2n1[0] nvme3n1[1]
func TestHasVROCController(t *testing.T) { func TestHasVROCController(t *testing.T) {
intel := vendorIntel intel := vendorIntel
model := "Volume Management Device NVMe RAID Controller" model := "Volume Management Device NVMe RAID Controller"
class := "RAID bus controller" class := "MassStorageController"
tests := []struct { tests := []struct {
name string name string
pcie []schema.HardwarePCIeDevice pcie []schema.HardwarePCIeDevice

View File

@@ -0,0 +1,577 @@
package platform
import (
"bytes"
"fmt"
"math"
"os"
"os/exec"
"strconv"
"strings"
"time"
)
// GPUMetricRow is one telemetry sample from nvidia-smi during a stress test.
type GPUMetricRow struct {
ElapsedSec float64
GPUIndex int
TempC float64
UsagePct float64
PowerW float64
ClockMHz float64
}
// sampleGPUMetrics runs nvidia-smi once and returns current metrics for each GPU.
func sampleGPUMetrics(gpuIndices []int) ([]GPUMetricRow, error) {
args := []string{
"--query-gpu=index,temperature.gpu,utilization.gpu,power.draw,clocks.current.graphics",
"--format=csv,noheader,nounits",
}
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
args = append([]string{"--id=" + strings.Join(ids, ",")}, args...)
}
out, err := exec.Command("nvidia-smi", args...).Output()
if err != nil {
return nil, err
}
var rows []GPUMetricRow
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.Split(line, ", ")
if len(parts) < 5 {
continue
}
idx, _ := strconv.Atoi(strings.TrimSpace(parts[0]))
rows = append(rows, GPUMetricRow{
GPUIndex: idx,
TempC: parseGPUFloat(parts[1]),
UsagePct: parseGPUFloat(parts[2]),
PowerW: parseGPUFloat(parts[3]),
ClockMHz: parseGPUFloat(parts[4]),
})
}
return rows, nil
}
func parseGPUFloat(s string) float64 {
s = strings.TrimSpace(s)
if s == "N/A" || s == "[Not Supported]" || s == "" {
return 0
}
v, _ := strconv.ParseFloat(s, 64)
return v
}
// WriteGPUMetricsCSV writes collected rows as a CSV file.
func WriteGPUMetricsCSV(path string, rows []GPUMetricRow) error {
var b bytes.Buffer
b.WriteString("elapsed_sec,gpu_index,temperature_c,usage_pct,power_w,clock_mhz\n")
for _, r := range rows {
fmt.Fprintf(&b, "%.1f,%d,%.1f,%.1f,%.1f,%.0f\n",
r.ElapsedSec, r.GPUIndex, r.TempC, r.UsagePct, r.PowerW, r.ClockMHz)
}
return os.WriteFile(path, b.Bytes(), 0644)
}
// WriteGPUMetricsHTML writes a standalone HTML file with one SVG chart per GPU.
func WriteGPUMetricsHTML(path string, rows []GPUMetricRow) error {
// Group by GPU index preserving order.
seen := make(map[int]bool)
var order []int
gpuMap := make(map[int][]GPUMetricRow)
for _, r := range rows {
if !seen[r.GPUIndex] {
seen[r.GPUIndex] = true
order = append(order, r.GPUIndex)
}
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
}
var svgs strings.Builder
for _, gpuIdx := range order {
svgs.WriteString(drawGPUChartSVG(gpuMap[gpuIdx], gpuIdx))
svgs.WriteString("\n")
}
ts := time.Now().UTC().Format("2006-01-02 15:04:05 UTC")
html := fmt.Sprintf(`<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<title>GPU Stress Test Metrics</title>
<style>
body { font-family: sans-serif; background: #f0f0f0; margin: 0; padding: 20px; }
h1 { text-align: center; color: #333; margin: 0 0 8px; }
p { text-align: center; color: #888; font-size: 13px; margin: 0 0 24px; }
</style>
</head><body>
<h1>GPU Stress Test Metrics</h1>
<p>Generated %s</p>
%s
</body></html>`, ts, svgs.String())
return os.WriteFile(path, []byte(html), 0644)
}
// drawGPUChartSVG generates a self-contained SVG chart for one GPU.
func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int) string {
// Layout
const W, H = 960, 520
const plotX1 = 120 // usage axis / chart left border
const plotX2 = 840 // power axis / chart right border
const plotY1 = 70 // top
const plotY2 = 465 // bottom (PH = 395)
const PW = plotX2 - plotX1
const PH = plotY2 - plotY1
// Outer axes
const tempAxisX = 60 // temp axis line
const clockAxisX = 900 // clock axis line
colors := [4]string{"#e74c3c", "#3498db", "#2ecc71", "#f39c12"}
seriesLabel := [4]string{
fmt.Sprintf("GPU %d Temp (°C)", gpuIdx),
fmt.Sprintf("GPU %d Usage (%%)", gpuIdx),
fmt.Sprintf("GPU %d Power (W)", gpuIdx),
fmt.Sprintf("GPU %d Clock (MHz)", gpuIdx),
}
axisLabel := [4]string{"Temperature (°C)", "GPU Usage (%)", "Power (W)", "Clock (MHz)"}
// Extract series
t := make([]float64, len(rows))
vals := [4][]float64{}
for i := range vals {
vals[i] = make([]float64, len(rows))
}
for i, r := range rows {
t[i] = r.ElapsedSec
vals[0][i] = r.TempC
vals[1][i] = r.UsagePct
vals[2][i] = r.PowerW
vals[3][i] = r.ClockMHz
}
tMin, tMax := gpuMinMax(t)
type axisScale struct {
ticks []float64
min, max float64
}
var axes [4]axisScale
for i := 0; i < 4; i++ {
mn, mx := gpuMinMax(vals[i])
tks := gpuNiceTicks(mn, mx, 8)
axes[i] = axisScale{ticks: tks, min: tks[0], max: tks[len(tks)-1]}
}
xv := func(tv float64) float64 {
if tMax == tMin {
return float64(plotX1)
}
return float64(plotX1) + (tv-tMin)/(tMax-tMin)*float64(PW)
}
yv := func(v float64, ai int) float64 {
a := axes[ai]
if a.max == a.min {
return float64(plotY1 + PH/2)
}
return float64(plotY2) - (v-a.min)/(a.max-a.min)*float64(PH)
}
var b strings.Builder
fmt.Fprintf(&b, `<svg xmlns="http://www.w3.org/2000/svg" width="%d" height="%d"`+
` style="background:#fff;border-radius:8px;display:block;margin:0 auto 24px;`+
`box-shadow:0 2px 12px rgba(0,0,0,.12)">`+"\n", W, H)
// Title
fmt.Fprintf(&b, `<text x="%d" y="22" text-anchor="middle" font-family="sans-serif"`+
` font-size="14" font-weight="bold" fill="#333">GPU Stress Test Metrics — GPU %d</text>`+"\n",
plotX1+PW/2, gpuIdx)
// Horizontal grid (align to temp axis ticks)
b.WriteString(`<g stroke="#e0e0e0" stroke-width="0.5">` + "\n")
for _, tick := range axes[0].ticks {
y := yv(tick, 0)
if y < float64(plotY1) || y > float64(plotY2) {
continue
}
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"/>`+"\n",
plotX1, y, plotX2, y)
}
// Vertical grid
xTicks := gpuNiceTicks(tMin, tMax, 10)
for _, tv := range xTicks {
x := xv(tv)
if x < float64(plotX1) || x > float64(plotX2) {
continue
}
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d"/>`+"\n",
x, plotY1, x, plotY2)
}
b.WriteString("</g>\n")
// Chart border
fmt.Fprintf(&b, `<rect x="%d" y="%d" width="%d" height="%d"`+
` fill="none" stroke="#333" stroke-width="1"/>`+"\n",
plotX1, plotY1, PW, PH)
// X axis ticks and labels
b.WriteString(`<g font-family="sans-serif" font-size="11" fill="#333" text-anchor="middle">` + "\n")
for _, tv := range xTicks {
x := xv(tv)
if x < float64(plotX1) || x > float64(plotX2) {
continue
}
fmt.Fprintf(&b, `<text x="%.1f" y="%d">%s</text>`+"\n", x, plotY2+18, gpuFormatTick(tv))
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d" stroke="#333" stroke-width="1"/>`+"\n",
x, plotY2, x, plotY2+4)
}
b.WriteString("</g>\n")
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="13"`+
` fill="#333" text-anchor="middle">Time (seconds)</text>`+"\n",
plotX1+PW/2, plotY2+38)
// Y axes: [tempAxisX, plotX1, plotX2, clockAxisX]
axisLineX := [4]int{tempAxisX, plotX1, plotX2, clockAxisX}
axisRight := [4]bool{false, false, true, true}
// Label x positions (for rotated vertical text)
axisLabelX := [4]int{10, 68, 868, 950}
for i := 0; i < 4; i++ {
ax := axisLineX[i]
right := axisRight[i]
color := colors[i]
// Axis line
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
` stroke="%s" stroke-width="1"/>`+"\n",
ax, plotY1, ax, plotY2, color)
// Ticks and tick labels
fmt.Fprintf(&b, `<g font-family="sans-serif" font-size="10" fill="%s">`+"\n", color)
for _, tick := range axes[i].ticks {
y := yv(tick, i)
if y < float64(plotY1) || y > float64(plotY2) {
continue
}
dx := -5
textX := ax - 8
anchor := "end"
if right {
dx = 5
textX = ax + 8
anchor = "start"
}
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"`+
` stroke="%s" stroke-width="1"/>`+"\n",
ax, y, ax+dx, y, color)
fmt.Fprintf(&b, `<text x="%d" y="%.1f" text-anchor="%s" dy="4">%s</text>`+"\n",
textX, y, anchor, gpuFormatTick(tick))
}
b.WriteString("</g>\n")
// Axis label (rotated)
lx := axisLabelX[i]
fmt.Fprintf(&b, `<text transform="translate(%d,%d) rotate(-90)"`+
` font-family="sans-serif" font-size="12" fill="%s" text-anchor="middle">%s</text>`+"\n",
lx, plotY1+PH/2, color, axisLabel[i])
}
// Data lines
for i := 0; i < 4; i++ {
var pts strings.Builder
for j := range rows {
x := xv(t[j])
y := yv(vals[i][j], i)
if j == 0 {
fmt.Fprintf(&pts, "%.1f,%.1f", x, y)
} else {
fmt.Fprintf(&pts, " %.1f,%.1f", x, y)
}
}
fmt.Fprintf(&b, `<polyline points="%s" fill="none" stroke="%s" stroke-width="1.5"/>`+"\n",
pts.String(), colors[i])
}
// Legend
const legendY = 42
for i := 0; i < 4; i++ {
lx := plotX1 + i*(PW/4) + 10
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
` stroke="%s" stroke-width="2"/>`+"\n",
lx, legendY, lx+20, legendY, colors[i])
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="12" fill="#333">%s</text>`+"\n",
lx+25, legendY+4, seriesLabel[i])
}
b.WriteString("</svg>\n")
return b.String()
}
const (
ansiRed = "\033[31m"
ansiBlue = "\033[34m"
ansiGreen = "\033[32m"
ansiYellow = "\033[33m"
ansiReset = "\033[0m"
)
const (
termChartWidth = 70
termChartHeight = 12
)
// RenderGPUTerminalChart returns ANSI line charts (asciigraph-style) per GPU.
// Suitable for display in the TUI screenOutput.
func RenderGPUTerminalChart(rows []GPUMetricRow) string {
seen := make(map[int]bool)
var order []int
gpuMap := make(map[int][]GPUMetricRow)
for _, r := range rows {
if !seen[r.GPUIndex] {
seen[r.GPUIndex] = true
order = append(order, r.GPUIndex)
}
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
}
type seriesDef struct {
caption string
color string
fn func(GPUMetricRow) float64
}
defs := []seriesDef{
{"Temperature (°C)", ansiRed, func(r GPUMetricRow) float64 { return r.TempC }},
{"GPU Usage (%)", ansiBlue, func(r GPUMetricRow) float64 { return r.UsagePct }},
{"Power (W)", ansiGreen, func(r GPUMetricRow) float64 { return r.PowerW }},
{"Clock (MHz)", ansiYellow, func(r GPUMetricRow) float64 { return r.ClockMHz }},
}
var b strings.Builder
for _, gpuIdx := range order {
gr := gpuMap[gpuIdx]
if len(gr) == 0 {
continue
}
tMax := gr[len(gr)-1].ElapsedSec - gr[0].ElapsedSec
fmt.Fprintf(&b, "GPU %d — Stress Test Metrics (%.0f seconds)\n\n", gpuIdx, tMax)
for _, d := range defs {
b.WriteString(renderLineChart(extractGPUField(gr, d.fn), d.color, d.caption,
termChartHeight, termChartWidth))
b.WriteRune('\n')
}
}
return strings.TrimRight(b.String(), "\n")
}
// renderLineChart draws a single time-series line chart using box-drawing characters.
// Produces output in the style of asciigraph: ╭─╮ │ ╰─╯ with a Y axis and caption.
func renderLineChart(vals []float64, color, caption string, height, width int) string {
if len(vals) == 0 {
return caption + "\n"
}
mn, mx := gpuMinMax(vals)
if mn == mx {
mx = mn + 1
}
// Use the smaller of width or len(vals) to avoid stretching sparse data.
w := width
if len(vals) < w {
w = len(vals)
}
data := gpuDownsample(vals, w)
// row[i] = display row index: 0 = top = max value, height = bottom = min value.
row := make([]int, w)
for i, v := range data {
r := int(math.Round((mx - v) / (mx - mn) * float64(height)))
if r < 0 {
r = 0
}
if r > height {
r = height
}
row[i] = r
}
// Fill the character grid.
grid := make([][]rune, height+1)
for i := range grid {
grid[i] = make([]rune, w)
for j := range grid[i] {
grid[i][j] = ' '
}
}
for x := 0; x < w; x++ {
r := row[x]
if x == 0 {
grid[r][0] = '─'
continue
}
p := row[x-1]
switch {
case r == p:
grid[r][x] = '─'
case r < p: // value went up (row index decreased toward top)
grid[r][x] = '╭'
grid[p][x] = '╯'
for y := r + 1; y < p; y++ {
grid[y][x] = '│'
}
default: // r > p, value went down
grid[p][x] = '╮'
grid[r][x] = '╰'
for y := p + 1; y < r; y++ {
grid[y][x] = '│'
}
}
}
// Y axis tick labels.
ticks := gpuNiceTicks(mn, mx, height/2)
tickAtRow := make(map[int]string)
labelWidth := 4
for _, t := range ticks {
r := int(math.Round((mx - t) / (mx - mn) * float64(height)))
if r < 0 || r > height {
continue
}
s := gpuFormatTick(t)
tickAtRow[r] = s
if len(s) > labelWidth {
labelWidth = len(s)
}
}
var b strings.Builder
for r := 0; r <= height; r++ {
label := tickAtRow[r]
fmt.Fprintf(&b, "%*s", labelWidth, label)
switch {
case label != "":
b.WriteRune('┤')
case r == height:
b.WriteRune('┼')
default:
b.WriteRune('│')
}
b.WriteString(color)
b.WriteString(string(grid[r]))
b.WriteString(ansiReset)
b.WriteRune('\n')
}
// Bottom axis.
b.WriteString(strings.Repeat(" ", labelWidth))
b.WriteRune('└')
b.WriteString(strings.Repeat("─", w))
b.WriteRune('\n')
// Caption centered under the chart.
if caption != "" {
total := labelWidth + 1 + w
if pad := (total - len(caption)) / 2; pad > 0 {
b.WriteString(strings.Repeat(" ", pad))
}
b.WriteString(caption)
b.WriteRune('\n')
}
return b.String()
}
func extractGPUField(rows []GPUMetricRow, fn func(GPUMetricRow) float64) []float64 {
v := make([]float64, len(rows))
for i, r := range rows {
v[i] = fn(r)
}
return v
}
// gpuDownsample averages vals into w buckets (or nearest-neighbor upsamples if len(vals) < w).
func gpuDownsample(vals []float64, w int) []float64 {
n := len(vals)
if n == 0 {
return make([]float64, w)
}
result := make([]float64, w)
if n >= w {
counts := make([]int, w)
for i, v := range vals {
bucket := i * w / n
if bucket >= w {
bucket = w - 1
}
result[bucket] += v
counts[bucket]++
}
for i := range result {
if counts[i] > 0 {
result[i] /= float64(counts[i])
}
}
} else {
// Nearest-neighbour upsample.
for i := range result {
src := i * (n - 1) / (w - 1)
if src >= n {
src = n - 1
}
result[i] = vals[src]
}
}
return result
}
func gpuMinMax(vals []float64) (float64, float64) {
if len(vals) == 0 {
return 0, 1
}
mn, mx := vals[0], vals[0]
for _, v := range vals[1:] {
if v < mn {
mn = v
}
if v > mx {
mx = v
}
}
return mn, mx
}
func gpuNiceTicks(mn, mx float64, targetCount int) []float64 {
if mn == mx {
mn -= 1
mx += 1
}
r := mx - mn
step := math.Pow(10, math.Floor(math.Log10(r/float64(targetCount))))
for _, f := range []float64{1, 2, 5, 10} {
if r/(f*step) <= float64(targetCount)*1.5 {
step = f * step
break
}
}
lo := math.Floor(mn/step) * step
hi := math.Ceil(mx/step) * step
var ticks []float64
for v := lo; v <= hi+step*0.001; v += step {
ticks = append(ticks, math.Round(v*1e9)/1e9)
}
return ticks
}
func gpuFormatTick(v float64) string {
if v == math.Trunc(v) {
return strconv.Itoa(int(v))
}
return strconv.FormatFloat(v, 'f', 1, 64)
}

View File

@@ -0,0 +1,164 @@
package platform
import (
"os"
"os/exec"
"strings"
"time"
"bee/audit/internal/schema"
)
var runtimeRequiredTools = []string{
"dmidecode",
"lspci",
"lsblk",
"smartctl",
"nvme",
"ipmitool",
"nvidia-smi",
"nvidia-bug-report.sh",
"bee-gpu-stress",
"dhclient",
"mount",
}
var runtimeTrackedServices = []string{
"bee-network",
"bee-nvidia",
"bee-preflight",
"bee-audit",
"bee-web",
"bee-sshsetup",
}
func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
checkedAt := time.Now().UTC().Format(time.RFC3339)
health := schema.RuntimeHealth{
Status: "OK",
CheckedAt: checkedAt,
ExportDir: strings.TrimSpace(exportDir),
}
if health.ExportDir != "" {
if err := os.MkdirAll(health.ExportDir, 0755); err != nil {
health.Status = "FAILED"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "export_dir_unavailable",
Severity: "critical",
Description: err.Error(),
})
}
}
interfaces, err := s.ListInterfaces()
if err == nil {
health.Interfaces = make([]schema.RuntimeInterface, 0, len(interfaces))
hasIPv4 := false
missingIPv4 := false
for _, iface := range interfaces {
outcome := "no_offer"
if len(iface.IPv4) > 0 {
outcome = "lease_acquired"
hasIPv4 = true
} else if strings.EqualFold(iface.State, "DOWN") {
outcome = "link_down"
} else {
missingIPv4 = true
}
health.Interfaces = append(health.Interfaces, schema.RuntimeInterface{
Name: iface.Name,
State: iface.State,
IPv4: iface.IPv4,
Outcome: outcome,
})
}
switch {
case hasIPv4 && !missingIPv4:
health.NetworkStatus = "OK"
case hasIPv4:
health.NetworkStatus = "PARTIAL"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "dhcp_partial",
Severity: "warning",
Description: "At least one interface did not obtain IPv4 connectivity.",
})
default:
health.NetworkStatus = "FAILED"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "dhcp_failed",
Severity: "warning",
Description: "No physical interface obtained IPv4 connectivity.",
})
}
}
for _, tool := range s.CheckTools(runtimeRequiredTools) {
health.Tools = append(health.Tools, schema.RuntimeToolStatus{
Name: tool.Name,
Path: tool.Path,
OK: tool.OK,
})
if !tool.OK {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "tool_missing",
Severity: "warning",
Description: "Required tool missing: " + tool.Name,
})
}
}
for _, name := range runtimeTrackedServices {
health.Services = append(health.Services, schema.RuntimeServiceStatus{
Name: name,
Status: s.ServiceState(name),
})
}
lsmodText := commandText("lsmod")
health.DriverReady = strings.Contains(lsmodText, "nvidia ")
if !health.DriverReady {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "nvidia_kernel_module_missing",
Severity: "warning",
Description: "NVIDIA kernel module is not loaded.",
})
}
if health.DriverReady && !strings.Contains(lsmodText, "nvidia_modeset") {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "nvidia_modeset_failed",
Severity: "warning",
Description: "nvidia-modeset is not loaded; display/CUDA stack may be partial.",
})
}
if out, err := exec.Command("nvidia-smi", "-L").CombinedOutput(); err == nil && strings.TrimSpace(string(out)) != "" {
health.DriverReady = true
}
health.CUDAReady = false
if lookErr := exec.Command("sh", "-c", "command -v bee-gpu-stress >/dev/null 2>&1").Run(); lookErr == nil {
out, err := exec.Command("bee-gpu-stress", "--seconds", "1", "--size-mb", "1").CombinedOutput()
if err == nil {
health.CUDAReady = true
} else if strings.Contains(strings.ToLower(string(out)), "cuda_error_system_not_ready") {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "cuda_runtime_not_ready",
Severity: "warning",
Description: "CUDA runtime is not ready for GPU SAT.",
})
}
}
if health.Status != "FAILED" && len(health.Issues) > 0 {
health.Status = "PARTIAL"
}
return health, nil
}
func commandText(name string, args ...string) string {
raw, err := exec.Command(name, args...).CombinedOutput()
if err != nil && len(raw) == 0 {
return ""
}
return string(raw)
}

View File

@@ -3,60 +3,477 @@ package platform
import ( import (
"archive/tar" "archive/tar"
"compress/gzip" "compress/gzip"
"context"
"fmt" "fmt"
"io" "io"
"os" "os"
"os/exec" "os/exec"
"path/filepath" "path/filepath"
"sort"
"strconv"
"strings" "strings"
"time" "time"
) )
// NvidiaGPU holds basic GPU info from nvidia-smi.
type NvidiaGPU struct {
Index int
Name string
MemoryMB int
}
// ListNvidiaGPUs returns GPUs visible to nvidia-smi.
func (s *System) ListNvidiaGPUs() ([]NvidiaGPU, error) {
out, err := exec.Command("nvidia-smi",
"--query-gpu=index,name,memory.total",
"--format=csv,noheader,nounits").Output()
if err != nil {
return nil, fmt.Errorf("nvidia-smi: %w", err)
}
var gpus []NvidiaGPU
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.SplitN(line, ", ", 3)
if len(parts) != 3 {
continue
}
idx, err := strconv.Atoi(strings.TrimSpace(parts[0]))
if err != nil {
continue
}
memMB, _ := strconv.Atoi(strings.TrimSpace(parts[2]))
gpus = append(gpus, NvidiaGPU{
Index: idx,
Name: strings.TrimSpace(parts[1]),
MemoryMB: memMB,
})
}
return gpus, nil
}
func (s *System) RunNvidiaAcceptancePack(baseDir string) (string, error) { func (s *System) RunNvidiaAcceptancePack(baseDir string) (string, error) {
return runAcceptancePack(baseDir, "gpu-nvidia", nvidiaSATJobs())
}
// RunNvidiaAcceptancePackWithOptions runs the NVIDIA SAT with explicit duration,
// GPU memory size, and GPU index selection. ctx cancellation kills the running job.
func (s *System) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, durationSec int, sizeMB int, gpuIndices []int) (string, error) {
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia", nvidiaSATJobsWithOptions(durationSec, sizeMB, gpuIndices))
}
func (s *System) RunMemoryAcceptancePack(baseDir string) (string, error) {
sizeMB := envInt("BEE_MEMTESTER_SIZE_MB", 128)
passes := envInt("BEE_MEMTESTER_PASSES", 1)
return runAcceptancePack(baseDir, "memory", []satJob{
{name: "01-free-before.log", cmd: []string{"free", "-h"}},
{name: "02-memtester.log", cmd: []string{"memtester", fmt.Sprintf("%dM", sizeMB), fmt.Sprintf("%d", passes)}},
{name: "03-free-after.log", cmd: []string{"free", "-h"}},
})
}
func (s *System) RunStorageAcceptancePack(baseDir string) (string, error) {
if baseDir == "" { if baseDir == "" {
baseDir = "/var/log/bee-sat" baseDir = "/var/log/bee-sat"
} }
ts := time.Now().UTC().Format("20060102-150405") ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, "gpu-nvidia-"+ts) runDir := filepath.Join(baseDir, "storage-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil { if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err return "", err
} }
verboseLog := filepath.Join(runDir, "verbose.log")
type job struct { devices, err := listStorageDevices()
name string if err != nil {
cmd []string return "", err
}
jobs := []job{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output", filepath.Join(runDir, "nvidia-bug-report.log")}},
} }
sort.Strings(devices)
var summary strings.Builder var summary strings.Builder
stats := satStats{}
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339)) fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
for _, job := range jobs { if len(devices) == 0 {
out, err := exec.Command(job.cmd[0], job.cmd[1:]...).CombinedOutput() fmt.Fprintln(&summary, "devices=0")
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil { stats.Unsupported++
return "", writeErr } else {
} fmt.Fprintf(&summary, "devices=%d\n", len(devices))
rc := 0
if err != nil {
rc = 1
}
fmt.Fprintf(&summary, "%s_rc=%d\n", strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log"), rc)
} }
for index, devPath := range devices {
prefix := fmt.Sprintf("%02d-%s", index+1, filepath.Base(devPath))
commands := storageSATCommands(devPath)
for cmdIndex, job := range commands {
name := fmt.Sprintf("%s-%02d-%s.log", prefix, cmdIndex+1, job.name)
out, err := runSATCommand(verboseLog, job.name, job.cmd)
if writeErr := os.WriteFile(filepath.Join(runDir, name), out, 0644); writeErr != nil {
return "", writeErr
}
status, rc := classifySATResult(job.name, out, err)
stats.Add(status)
key := filepath.Base(devPath) + "_" + strings.ReplaceAll(job.name, "-", "_")
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
}
}
writeSATStats(&summary, stats)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil { if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err return "", err
} }
archive := filepath.Join(baseDir, "storage-"+ts+".tar.gz")
archive := filepath.Join(baseDir, "gpu-nvidia-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil { if err := createTarGz(archive, runDir); err != nil {
return "", err return "", err
} }
return archive, nil return archive, nil
} }
type satJob struct {
name string
cmd []string
env []string // extra env vars (appended to os.Environ)
collectGPU bool // collect GPU metrics via nvidia-smi while this job runs
gpuIndices []int // GPU indices to collect metrics for (empty = all)
}
type satStats struct {
OK int
Failed int
Unsupported int
}
func nvidiaSATJobs() []satJob {
seconds := envInt("BEE_GPU_STRESS_SECONDS", 5)
sizeMB := envInt("BEE_GPU_STRESS_SIZE_MB", 64)
return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
{name: "05-bee-gpu-stress.log", cmd: []string{"bee-gpu-stress", "--seconds", fmt.Sprintf("%d", seconds), "--size-mb", fmt.Sprintf("%d", sizeMB)}},
}
}
func runAcceptancePack(baseDir, prefix string, jobs []satJob) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, prefix+"-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
var summary strings.Builder
stats := satStats{}
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
for _, job := range jobs {
cmd := make([]string, 0, len(job.cmd))
for _, arg := range job.cmd {
cmd = append(cmd, strings.ReplaceAll(arg, "{{run_dir}}", runDir))
}
out, err := runSATCommand(verboseLog, job.name, cmd)
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
return "", writeErr
}
status, rc := classifySATResult(job.name, out, err)
stats.Add(status)
key := strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log")
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
}
writeSATStats(&summary, stats)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
func nvidiaSATJobsWithOptions(durationSec, sizeMB int, gpuIndices []int) []satJob {
var env []string
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
env = []string{"CUDA_VISIBLE_DEVICES=" + strings.Join(ids, ",")}
}
return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
{
name: "05-bee-gpu-stress.log",
cmd: []string{"bee-gpu-stress", "--seconds", strconv.Itoa(durationSec), "--size-mb", strconv.Itoa(sizeMB)},
env: env,
collectGPU: true,
gpuIndices: gpuIndices,
},
}
}
func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []satJob) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, prefix+"-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
var summary strings.Builder
stats := satStats{}
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
for _, job := range jobs {
if ctx.Err() != nil {
break
}
cmd := make([]string, 0, len(job.cmd))
for _, arg := range job.cmd {
cmd = append(cmd, strings.ReplaceAll(arg, "{{run_dir}}", runDir))
}
var out []byte
var err error
if job.collectGPU {
out, err = runSATCommandWithMetrics(ctx, verboseLog, job.name, cmd, job.env, job.gpuIndices, runDir)
} else {
out, err = runSATCommandCtx(ctx, verboseLog, job.name, cmd, job.env)
}
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
return "", writeErr
}
status, rc := classifySATResult(job.name, out, err)
stats.Add(status)
key := strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log")
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
}
writeSATStats(&summary, stats)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
func runSATCommandCtx(ctx context.Context, verboseLog, name string, cmd []string, env []string) ([]byte, error) {
start := time.Now().UTC()
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
"cmd: "+strings.Join(cmd, " "),
)
c := exec.CommandContext(ctx, cmd[0], cmd[1:]...)
if len(env) > 0 {
c.Env = append(os.Environ(), env...)
}
out, err := c.CombinedOutput()
rc := 0
if err != nil {
rc = 1
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
fmt.Sprintf("rc: %d", rc),
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return out, err
}
func listStorageDevices() ([]string, error) {
out, err := exec.Command("lsblk", "-dn", "-o", "NAME,TYPE").Output()
if err != nil {
return nil, err
}
var devices []string
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
fields := strings.Fields(strings.TrimSpace(line))
if len(fields) != 2 || fields[1] != "disk" {
continue
}
devices = append(devices, "/dev/"+fields[0])
}
return devices, nil
}
func storageSATCommands(devPath string) []satJob {
if strings.Contains(filepath.Base(devPath), "nvme") {
return []satJob{
{name: "nvme-id-ctrl", cmd: []string{"nvme", "id-ctrl", devPath, "-o", "json"}},
{name: "nvme-smart-log", cmd: []string{"nvme", "smart-log", devPath, "-o", "json"}},
{name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "--start", "1"}},
}
}
return []satJob{
{name: "smartctl-health", cmd: []string{"smartctl", "-H", "-A", devPath}},
{name: "smartctl-self-test-short", cmd: []string{"smartctl", "-t", "short", devPath}},
}
}
func (s *satStats) Add(status string) {
switch status {
case "OK":
s.OK++
case "UNSUPPORTED":
s.Unsupported++
default:
s.Failed++
}
}
func (s satStats) Overall() string {
if s.Failed > 0 {
return "FAILED"
}
if s.Unsupported > 0 {
return "PARTIAL"
}
return "OK"
}
func writeSATStats(summary *strings.Builder, stats satStats) {
fmt.Fprintf(summary, "overall_status=%s\n", stats.Overall())
fmt.Fprintf(summary, "job_ok=%d\n", stats.OK)
fmt.Fprintf(summary, "job_failed=%d\n", stats.Failed)
fmt.Fprintf(summary, "job_unsupported=%d\n", stats.Unsupported)
}
func classifySATResult(name string, out []byte, err error) (string, int) {
rc := 0
if err != nil {
rc = 1
}
if err == nil {
return "OK", rc
}
text := strings.ToLower(string(out))
if strings.Contains(text, "unsupported") ||
strings.Contains(text, "not supported") ||
strings.Contains(text, "invalid opcode") ||
strings.Contains(text, "unknown command") ||
strings.Contains(text, "not implemented") ||
strings.Contains(text, "not available") ||
strings.Contains(text, "cuda_error_system_not_ready") ||
strings.Contains(text, "no such device") ||
(strings.Contains(name, "self-test") && strings.Contains(text, "aborted")) {
return "UNSUPPORTED", rc
}
return "FAILED", rc
}
func runSATCommand(verboseLog, name string, cmd []string) ([]byte, error) {
start := time.Now().UTC()
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
"cmd: "+strings.Join(cmd, " "),
)
out, err := exec.Command(cmd[0], cmd[1:]...).CombinedOutput()
rc := 0
if err != nil {
rc = 1
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
fmt.Sprintf("rc: %d", rc),
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return out, err
}
// runSATCommandWithMetrics runs a command while collecting GPU metrics in the background.
// On completion it writes gpu-metrics.csv and gpu-metrics.html into runDir.
func runSATCommandWithMetrics(ctx context.Context, verboseLog, name string, cmd []string, env []string, gpuIndices []int, runDir string) ([]byte, error) {
stopCh := make(chan struct{})
doneCh := make(chan struct{})
var metricRows []GPUMetricRow
start := time.Now()
go func() {
defer close(doneCh)
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-stopCh:
return
case <-ticker.C:
samples, err := sampleGPUMetrics(gpuIndices)
if err != nil {
continue
}
elapsed := time.Since(start).Seconds()
for i := range samples {
samples[i].ElapsedSec = elapsed
}
metricRows = append(metricRows, samples...)
}
}
}()
out, err := runSATCommandCtx(ctx, verboseLog, name, cmd, env)
close(stopCh)
<-doneCh
if len(metricRows) > 0 {
_ = WriteGPUMetricsCSV(filepath.Join(runDir, "gpu-metrics.csv"), metricRows)
_ = WriteGPUMetricsHTML(filepath.Join(runDir, "gpu-metrics.html"), metricRows)
chart := RenderGPUTerminalChart(metricRows)
_ = os.WriteFile(filepath.Join(runDir, "gpu-metrics-term.txt"), []byte(chart), 0644)
}
return out, err
}
func appendSATVerboseLog(path string, lines ...string) {
if path == "" {
return
}
f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0644)
if err != nil {
return
}
defer f.Close()
for _, line := range lines {
_, _ = io.WriteString(f, line+"\n")
}
}
func envInt(name string, fallback int) int {
raw := strings.TrimSpace(os.Getenv(name))
if raw == "" {
return fallback
}
value, err := strconv.Atoi(raw)
if err != nil || value <= 0 {
return fallback
}
return value
}
func createTarGz(dst, srcDir string) error { func createTarGz(dst, srcDir string) error {
file, err := os.Create(dst) file, err := os.Create(dst)
if err != nil { if err != nil {

View File

@@ -0,0 +1,93 @@
package platform
import (
"errors"
"os"
"testing"
)
func TestStorageSATCommands(t *testing.T) {
t.Parallel()
nvme := storageSATCommands("/dev/nvme0n1")
if len(nvme) != 3 || nvme[2].cmd[0] != "nvme" {
t.Fatalf("unexpected nvme commands: %#v", nvme)
}
sata := storageSATCommands("/dev/sda")
if len(sata) != 2 || sata[0].cmd[0] != "smartctl" {
t.Fatalf("unexpected sata commands: %#v", sata)
}
}
func TestRunNvidiaAcceptancePackIncludesGPUStress(t *testing.T) {
t.Parallel()
jobs := nvidiaSATJobs()
if len(jobs) != 5 {
t.Fatalf("jobs=%d want 5", len(jobs))
}
if got := jobs[4].cmd[0]; got != "bee-gpu-stress" {
t.Fatalf("gpu stress command=%q want bee-gpu-stress", got)
}
if got := jobs[3].cmd[1]; got != "--output-file" {
t.Fatalf("bug report flag=%q want --output-file", got)
}
}
func TestNvidiaSATJobsUseEnvOverrides(t *testing.T) {
t.Setenv("BEE_GPU_STRESS_SECONDS", "9")
t.Setenv("BEE_GPU_STRESS_SIZE_MB", "96")
jobs := nvidiaSATJobs()
got := jobs[4].cmd
want := []string{"bee-gpu-stress", "--seconds", "9", "--size-mb", "96"}
if len(got) != len(want) {
t.Fatalf("cmd len=%d want %d", len(got), len(want))
}
for i := range want {
if got[i] != want[i] {
t.Fatalf("cmd[%d]=%q want %q", i, got[i], want[i])
}
}
}
func TestEnvIntFallback(t *testing.T) {
os.Unsetenv("BEE_MEMTESTER_SIZE_MB")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
t.Fatalf("got %d want 123", got)
}
t.Setenv("BEE_MEMTESTER_SIZE_MB", "bad")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
t.Fatalf("got %d want 123", got)
}
t.Setenv("BEE_MEMTESTER_SIZE_MB", "256")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 256 {
t.Fatalf("got %d want 256", got)
}
}
func TestClassifySATResult(t *testing.T) {
tests := []struct {
name string
job string
out string
err error
status string
}{
{name: "ok", job: "memtester", out: "done", err: nil, status: "OK"},
{name: "unsupported", job: "smartctl-self-test-short", out: "Self-test not supported", err: errors.New("rc 1"), status: "UNSUPPORTED"},
{name: "failed", job: "bee-gpu-stress", out: "cuda error", err: errors.New("rc 1"), status: "FAILED"},
{name: "cuda not ready", job: "bee-gpu-stress", out: "cuInit failed: CUDA_ERROR_SYSTEM_NOT_READY", err: errors.New("rc 1"), status: "UNSUPPORTED"},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got, _ := classifySATResult(tt.job, []byte(tt.out), tt.err)
if got != tt.status {
t.Fatalf("status=%q want %q", got, tt.status)
}
})
}
}

View File

@@ -0,0 +1,122 @@
package platform
import (
"encoding/json"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
)
var techDumpFixedCommands = []struct {
Name string
Args []string
File string
}{
{Name: "dmidecode", Args: []string{"-t", "0"}, File: "dmidecode-type0.txt"},
{Name: "dmidecode", Args: []string{"-t", "1"}, File: "dmidecode-type1.txt"},
{Name: "dmidecode", Args: []string{"-t", "2"}, File: "dmidecode-type2.txt"},
{Name: "dmidecode", Args: []string{"-t", "4"}, File: "dmidecode-type4.txt"},
{Name: "dmidecode", Args: []string{"-t", "17"}, File: "dmidecode-type17.txt"},
{Name: "lspci", Args: []string{"-vmm", "-D"}, File: "lspci-vmm.txt"},
{Name: "lsblk", Args: []string{"-J", "-d", "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL"}, File: "lsblk.json"},
{Name: "sensors", Args: []string{"-j"}, File: "sensors.json"},
{Name: "ipmitool", Args: []string{"fru", "print"}, File: "ipmitool-fru.txt"},
{Name: "ipmitool", Args: []string{"sdr"}, File: "ipmitool-sdr.txt"},
{Name: "nvidia-smi", Args: []string{"-q"}, File: "nvidia-smi-q.txt"},
{Name: "nvidia-smi", Args: []string{"--query-gpu=index,pci.bus_id,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown", "--format=csv,noheader,nounits"}, File: "nvidia-smi-query.csv"},
{Name: "nvme", Args: []string{"list", "-o", "json"}, File: "nvme-list.json"},
}
type lsblkDumpRoot struct {
Blockdevices []struct {
Name string `json:"name"`
Type string `json:"type"`
} `json:"blockdevices"`
}
type nvmeDumpRoot struct {
Devices []struct {
DevicePath string `json:"DevicePath"`
} `json:"Devices"`
}
func (s *System) CaptureTechnicalDump(baseDir string) error {
if err := os.MkdirAll(baseDir, 0755); err != nil {
return err
}
for _, cmd := range techDumpFixedCommands {
writeCommandDump(filepath.Join(baseDir, cmd.File), cmd.Name, cmd.Args...)
}
for _, dev := range lsblkDumpDevices(filepath.Join(baseDir, "lsblk.json")) {
writeCommandDump(filepath.Join(baseDir, "smartctl-"+sanitizeDumpName(dev)+".json"), "smartctl", "-j", "-a", "/dev/"+dev)
}
for _, dev := range nvmeDumpDevices(filepath.Join(baseDir, "nvme-list.json")) {
writeCommandDump(filepath.Join(baseDir, "nvme-id-ctrl-"+sanitizeDumpName(dev)+".json"), "nvme", "id-ctrl", dev, "-o", "json")
writeCommandDump(filepath.Join(baseDir, "nvme-smart-log-"+sanitizeDumpName(dev)+".json"), "nvme", "smart-log", dev, "-o", "json")
}
return nil
}
func writeCommandDump(path, name string, args ...string) {
out, err := exec.Command(name, args...).CombinedOutput()
if err != nil && len(out) == 0 {
return
}
_ = os.WriteFile(path, out, 0644)
}
func lsblkDumpDevices(path string) []string {
raw, err := os.ReadFile(path)
if err != nil {
return nil
}
var root lsblkDumpRoot
if err := json.Unmarshal(raw, &root); err != nil {
return nil
}
var devices []string
for _, dev := range root.Blockdevices {
if dev.Type == "disk" && strings.TrimSpace(dev.Name) != "" {
devices = append(devices, strings.TrimSpace(dev.Name))
}
}
sort.Strings(devices)
return devices
}
func nvmeDumpDevices(path string) []string {
raw, err := os.ReadFile(path)
if err != nil {
return nil
}
var root nvmeDumpRoot
if err := json.Unmarshal(raw, &root); err != nil {
return nil
}
seen := map[string]bool{}
var devices []string
for _, dev := range root.Devices {
name := strings.TrimSpace(dev.DevicePath)
if name == "" || seen[name] {
continue
}
seen[name] = true
devices = append(devices, name)
}
sort.Strings(devices)
return devices
}
func sanitizeDumpName(value string) string {
value = strings.TrimSpace(value)
value = strings.TrimPrefix(value, "/dev/")
value = strings.ReplaceAll(value, "/", "_")
if value == "" {
return "unknown"
}
return value
}

View File

@@ -0,0 +1,48 @@
package platform
import (
"os"
"path/filepath"
"reflect"
"testing"
)
func TestLSBLKDumpDevices(t *testing.T) {
t.Parallel()
dir := t.TempDir()
path := filepath.Join(dir, "lsblk.json")
if err := os.WriteFile(path, []byte(`{"blockdevices":[{"name":"sda","type":"disk"},{"name":"sda1","type":"part"},{"name":"nvme0n1","type":"disk"}]}`), 0644); err != nil {
t.Fatalf("write lsblk fixture: %v", err)
}
got := lsblkDumpDevices(path)
want := []string{"nvme0n1", "sda"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("lsblkDumpDevices=%v want %v", got, want)
}
}
func TestNVMEDumpDevices(t *testing.T) {
t.Parallel()
dir := t.TempDir()
path := filepath.Join(dir, "nvme-list.json")
if err := os.WriteFile(path, []byte(`{"Devices":[{"DevicePath":"/dev/nvme1n1"},{"DevicePath":"/dev/nvme0n1"},{"DevicePath":"/dev/nvme1n1"}]}`), 0644); err != nil {
t.Fatalf("write nvme fixture: %v", err)
}
got := nvmeDumpDevices(path)
want := []string{"/dev/nvme0n1", "/dev/nvme1n1"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("nvmeDumpDevices=%v want %v", got, want)
}
}
func TestSanitizeDumpName(t *testing.T) {
t.Parallel()
if got := sanitizeDumpName("/dev/nvme0n1"); got != "nvme0n1" {
t.Fatalf("sanitizeDumpName=%q want nvme0n1", got)
}
}

View File

@@ -5,14 +5,52 @@ package schema
// HardwareIngestRequest is the top-level output document produced by `bee audit`. // HardwareIngestRequest is the top-level output document produced by `bee audit`.
// It is accepted as-is by the core /api/ingest/hardware endpoint. // It is accepted as-is by the core /api/ingest/hardware endpoint.
type HardwareIngestRequest struct { type HardwareIngestRequest struct {
Filename *string `json:"filename"` Filename *string `json:"filename,omitempty"`
SourceType *string `json:"source_type"` SourceType *string `json:"source_type,omitempty"`
Protocol *string `json:"protocol"` Protocol *string `json:"protocol,omitempty"`
TargetHost string `json:"target_host"` TargetHost *string `json:"target_host,omitempty"`
CollectedAt string `json:"collected_at"` CollectedAt string `json:"collected_at"`
Runtime *RuntimeHealth `json:"runtime,omitempty"`
Hardware HardwareSnapshot `json:"hardware"` Hardware HardwareSnapshot `json:"hardware"`
} }
type RuntimeHealth struct {
Status string `json:"status"`
CheckedAt string `json:"checked_at"`
ExportDir string `json:"export_dir,omitempty"`
DriverReady bool `json:"driver_ready,omitempty"`
CUDAReady bool `json:"cuda_ready,omitempty"`
NetworkStatus string `json:"network_status,omitempty"`
Issues []RuntimeIssue `json:"issues,omitempty"`
Tools []RuntimeToolStatus `json:"tools,omitempty"`
Services []RuntimeServiceStatus `json:"services,omitempty"`
Interfaces []RuntimeInterface `json:"interfaces,omitempty"`
}
type RuntimeIssue struct {
Code string `json:"code"`
Severity string `json:"severity,omitempty"`
Description string `json:"description"`
}
type RuntimeToolStatus struct {
Name string `json:"name"`
Path string `json:"path,omitempty"`
OK bool `json:"ok"`
}
type RuntimeServiceStatus struct {
Name string `json:"name"`
Status string `json:"status"`
}
type RuntimeInterface struct {
Name string `json:"name"`
State string `json:"state,omitempty"`
IPv4 []string `json:"ipv4,omitempty"`
Outcome string `json:"outcome,omitempty"`
}
type HardwareSnapshot struct { type HardwareSnapshot struct {
Board HardwareBoard `json:"board"` Board HardwareBoard `json:"board"`
Firmware []HardwareFirmwareRecord `json:"firmware,omitempty"` Firmware []HardwareFirmwareRecord `json:"firmware,omitempty"`
@@ -21,14 +59,33 @@ type HardwareSnapshot struct {
Storage []HardwareStorage `json:"storage,omitempty"` Storage []HardwareStorage `json:"storage,omitempty"`
PCIeDevices []HardwarePCIeDevice `json:"pcie_devices,omitempty"` PCIeDevices []HardwarePCIeDevice `json:"pcie_devices,omitempty"`
PowerSupplies []HardwarePowerSupply `json:"power_supplies,omitempty"` PowerSupplies []HardwarePowerSupply `json:"power_supplies,omitempty"`
Sensors *HardwareSensors `json:"sensors,omitempty"`
EventLogs []HardwareEventLog `json:"event_logs,omitempty"`
}
type HardwareHealthSummary struct {
Status string `json:"status"`
Warnings []string `json:"warnings,omitempty"`
Failures []string `json:"failures,omitempty"`
StorageWarn int `json:"storage_warn,omitempty"`
StorageFail int `json:"storage_fail,omitempty"`
PCIeWarn int `json:"pcie_warn,omitempty"`
PCIeFail int `json:"pcie_fail,omitempty"`
PSUWarn int `json:"psu_warn,omitempty"`
PSUFail int `json:"psu_fail,omitempty"`
MemoryWarn int `json:"memory_warn,omitempty"`
MemoryFail int `json:"memory_fail,omitempty"`
EmptyDIMMs int `json:"empty_dimms,omitempty"`
MissingPSUs int `json:"missing_psus,omitempty"`
CollectedAt string `json:"collected_at,omitempty"`
} }
type HardwareBoard struct { type HardwareBoard struct {
Manufacturer *string `json:"manufacturer"` Manufacturer *string `json:"manufacturer,omitempty"`
ProductName *string `json:"product_name"` ProductName *string `json:"product_name,omitempty"`
SerialNumber string `json:"serial_number"` SerialNumber string `json:"serial_number"`
PartNumber *string `json:"part_number"` PartNumber *string `json:"part_number,omitempty"`
UUID *string `json:"uuid"` UUID *string `json:"uuid,omitempty"`
} }
type HardwareFirmwareRecord struct { type HardwareFirmwareRecord struct {
@@ -37,77 +94,196 @@ type HardwareFirmwareRecord struct {
} }
type HardwareCPU struct { type HardwareCPU struct {
Socket *int `json:"socket"` HardwareComponentStatus
Model *string `json:"model"` Socket *int `json:"socket,omitempty"`
Manufacturer *string `json:"manufacturer"` Model *string `json:"model,omitempty"`
Status *string `json:"status"` Manufacturer *string `json:"manufacturer,omitempty"`
SerialNumber *string `json:"serial_number"` SerialNumber *string `json:"serial_number,omitempty"`
Firmware *string `json:"firmware"` Firmware *string `json:"firmware,omitempty"`
Cores *int `json:"cores"` Cores *int `json:"cores,omitempty"`
Threads *int `json:"threads"` Threads *int `json:"threads,omitempty"`
FrequencyMHz *int `json:"frequency_mhz"` FrequencyMHz *int `json:"frequency_mhz,omitempty"`
MaxFrequencyMHz *int `json:"max_frequency_mhz"` MaxFrequencyMHz *int `json:"max_frequency_mhz,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
PowerW *float64 `json:"power_w,omitempty"`
Throttled *bool `json:"throttled,omitempty"`
CorrectableErrorCount *int64 `json:"correctable_error_count,omitempty"`
UncorrectableErrorCount *int64 `json:"uncorrectable_error_count,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
Present *bool `json:"present,omitempty"`
} }
type HardwareMemory struct { type HardwareMemory struct {
Slot *string `json:"slot"` HardwareComponentStatus
Location *string `json:"location"` Slot *string `json:"slot,omitempty"`
Present *bool `json:"present"` Location *string `json:"location,omitempty"`
SizeMB *int `json:"size_mb"` Present *bool `json:"present,omitempty"`
Type *string `json:"type"` SizeMB *int `json:"size_mb,omitempty"`
MaxSpeedMHz *int `json:"max_speed_mhz"` Type *string `json:"type,omitempty"`
CurrentSpeedMHz *int `json:"current_speed_mhz"` MaxSpeedMHz *int `json:"max_speed_mhz,omitempty"`
Manufacturer *string `json:"manufacturer"` CurrentSpeedMHz *int `json:"current_speed_mhz,omitempty"`
SerialNumber *string `json:"serial_number"` Manufacturer *string `json:"manufacturer,omitempty"`
PartNumber *string `json:"part_number"` SerialNumber *string `json:"serial_number,omitempty"`
Status *string `json:"status"` PartNumber *string `json:"part_number,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
CorrectableECCErrorCount *int64 `json:"correctable_ecc_error_count,omitempty"`
UncorrectableECCErrorCount *int64 `json:"uncorrectable_ecc_error_count,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
SpareBlocksRemainingPct *float64 `json:"spare_blocks_remaining_pct,omitempty"`
PerformanceDegraded *bool `json:"performance_degraded,omitempty"`
DataLossDetected *bool `json:"data_loss_detected,omitempty"`
} }
type HardwareStorage struct { type HardwareStorage struct {
Slot *string `json:"slot"` HardwareComponentStatus
Type *string `json:"type"` Slot *string `json:"slot,omitempty"`
Model *string `json:"model"` Type *string `json:"type,omitempty"`
SizeGB *int `json:"size_gb"` Model *string `json:"model,omitempty"`
SerialNumber *string `json:"serial_number"` SizeGB *int `json:"size_gb,omitempty"`
Manufacturer *string `json:"manufacturer"` SerialNumber *string `json:"serial_number,omitempty"`
Firmware *string `json:"firmware"` Manufacturer *string `json:"manufacturer,omitempty"`
Interface *string `json:"interface"` Firmware *string `json:"firmware,omitempty"`
Present *bool `json:"present"` Interface *string `json:"interface,omitempty"`
Status *string `json:"status"` Present *bool `json:"present,omitempty"`
Telemetry map[string]any `json:"telemetry,omitempty"` TemperatureC *float64 `json:"temperature_c,omitempty"`
PowerOnHours *int64 `json:"power_on_hours,omitempty"`
PowerCycles *int64 `json:"power_cycles,omitempty"`
UnsafeShutdowns *int64 `json:"unsafe_shutdowns,omitempty"`
MediaErrors *int64 `json:"media_errors,omitempty"`
ErrorLogEntries *int64 `json:"error_log_entries,omitempty"`
WrittenBytes *int64 `json:"written_bytes,omitempty"`
ReadBytes *int64 `json:"read_bytes,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
AvailableSparePct *float64 `json:"available_spare_pct,omitempty"`
ReallocatedSectors *int64 `json:"reallocated_sectors,omitempty"`
CurrentPendingSectors *int64 `json:"current_pending_sectors,omitempty"`
OfflineUncorrectable *int64 `json:"offline_uncorrectable,omitempty"`
Telemetry map[string]any `json:"-"`
} }
type HardwarePCIeDevice struct { type HardwarePCIeDevice struct {
Slot *string `json:"slot"` HardwareComponentStatus
VendorID *int `json:"vendor_id"` Slot *string `json:"slot,omitempty"`
DeviceID *int `json:"device_id"` VendorID *int `json:"vendor_id,omitempty"`
BDF *string `json:"bdf"` DeviceID *int `json:"device_id,omitempty"`
DeviceClass *string `json:"device_class"` NUMANode *int `json:"numa_node,omitempty"`
Manufacturer *string `json:"manufacturer"` TemperatureC *float64 `json:"temperature_c,omitempty"`
Model *string `json:"model"` PowerW *float64 `json:"power_w,omitempty"`
LinkWidth *int `json:"link_width"` LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LinkSpeed *string `json:"link_speed"` LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
MaxLinkWidth *int `json:"max_link_width"` ECCCorrectedTotal *int64 `json:"ecc_corrected_total,omitempty"`
MaxLinkSpeed *string `json:"max_link_speed"` ECCUncorrectedTotal *int64 `json:"ecc_uncorrected_total,omitempty"`
SerialNumber *string `json:"serial_number"` HWSlowdown *bool `json:"hw_slowdown,omitempty"`
Firmware *string `json:"firmware"` BatteryChargePct *float64 `json:"battery_charge_pct,omitempty"`
Present *bool `json:"present"` BatteryHealthPct *float64 `json:"battery_health_pct,omitempty"`
Status *string `json:"status"` BatteryTemperatureC *float64 `json:"battery_temperature_c,omitempty"`
Telemetry map[string]any `json:"telemetry,omitempty"` BatteryVoltageV *float64 `json:"battery_voltage_v,omitempty"`
BatteryReplaceRequired *bool `json:"battery_replace_required,omitempty"`
SFPTemperatureC *float64 `json:"sfp_temperature_c,omitempty"`
SFPTXPowerDBM *float64 `json:"sfp_tx_power_dbm,omitempty"`
SFPRXPowerDBM *float64 `json:"sfp_rx_power_dbm,omitempty"`
SFPVoltageV *float64 `json:"sfp_voltage_v,omitempty"`
SFPBiasMA *float64 `json:"sfp_bias_ma,omitempty"`
BDF *string `json:"-"`
DeviceClass *string `json:"device_class,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
Model *string `json:"model,omitempty"`
LinkWidth *int `json:"link_width,omitempty"`
LinkSpeed *string `json:"link_speed,omitempty"`
MaxLinkWidth *int `json:"max_link_width,omitempty"`
MaxLinkSpeed *string `json:"max_link_speed,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
Firmware *string `json:"firmware,omitempty"`
MacAddresses []string `json:"mac_addresses,omitempty"`
Present *bool `json:"present,omitempty"`
Telemetry map[string]any `json:"-"`
} }
type HardwarePowerSupply struct { type HardwarePowerSupply struct {
Slot *string `json:"slot"` HardwareComponentStatus
Present *bool `json:"present"` Slot *string `json:"slot,omitempty"`
Model *string `json:"model"` Present *bool `json:"present,omitempty"`
Vendor *string `json:"vendor"` Model *string `json:"model,omitempty"`
WattageW *int `json:"wattage_w"` Vendor *string `json:"vendor,omitempty"`
SerialNumber *string `json:"serial_number"` WattageW *int `json:"wattage_w,omitempty"`
PartNumber *string `json:"part_number"` SerialNumber *string `json:"serial_number,omitempty"`
Firmware *string `json:"firmware"` PartNumber *string `json:"part_number,omitempty"`
Status *string `json:"status"` Firmware *string `json:"firmware,omitempty"`
InputType *string `json:"input_type"` InputType *string `json:"input_type,omitempty"`
InputPowerW *float64 `json:"input_power_w"` InputPowerW *float64 `json:"input_power_w,omitempty"`
OutputPowerW *float64 `json:"output_power_w"` OutputPowerW *float64 `json:"output_power_w,omitempty"`
InputVoltage *float64 `json:"input_voltage"` InputVoltage *float64 `json:"input_voltage,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
}
type HardwareComponentStatus struct {
Status *string `json:"status,omitempty"`
StatusCheckedAt *string `json:"status_checked_at,omitempty"`
StatusChangedAt *string `json:"status_changed_at,omitempty"`
StatusHistory []HardwareStatusHistory `json:"status_history,omitempty"`
ErrorDescription *string `json:"error_description,omitempty"`
ManufacturedYearWeek *string `json:"manufactured_year_week,omitempty"`
}
type HardwareStatusHistory struct {
Status string `json:"status"`
ChangedAt string `json:"changed_at"`
Details *string `json:"details,omitempty"`
}
type HardwareSensors struct {
Fans []HardwareFanSensor `json:"fans,omitempty"`
Power []HardwarePowerSensor `json:"power,omitempty"`
Temperatures []HardwareTemperatureSensor `json:"temperatures,omitempty"`
Other []HardwareOtherSensor `json:"other,omitempty"`
}
type HardwareFanSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
RPM *int `json:"rpm,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwarePowerSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
VoltageV *float64 `json:"voltage_v,omitempty"`
CurrentA *float64 `json:"current_a,omitempty"`
PowerW *float64 `json:"power_w,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareTemperatureSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
Celsius *float64 `json:"celsius,omitempty"`
ThresholdWarningCelsius *float64 `json:"threshold_warning_celsius,omitempty"`
ThresholdCriticalCelsius *float64 `json:"threshold_critical_celsius,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareOtherSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
Value *float64 `json:"value,omitempty"`
Unit *string `json:"unit,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareEventLog struct {
Source string `json:"source"`
EventTime *string `json:"event_time,omitempty"`
Severity *string `json:"severity,omitempty"`
MessageID *string `json:"message_id,omitempty"`
Message string `json:"message"`
ComponentRef *string `json:"component_ref,omitempty"`
Fingerprint *string `json:"fingerprint,omitempty"`
IsActive *bool `json:"is_active,omitempty"`
RawPayload map[string]any `json:"raw_payload,omitempty"`
} }

View File

@@ -0,0 +1,46 @@
package schema
import (
"encoding/json"
"strings"
"testing"
)
func TestHardwareSnapshotMarshalsNewContractFields(t *testing.T) {
week := "2024-W07"
eventTime := "2026-03-15T14:03:11Z"
message := "Correctable ECC error threshold exceeded"
payload := HardwareIngestRequest{
CollectedAt: "2026-03-15T15:00:00Z",
Hardware: HardwareSnapshot{
Board: HardwareBoard{SerialNumber: "SRV-001"},
CPUs: []HardwareCPU{
{
HardwareComponentStatus: HardwareComponentStatus{
ManufacturedYearWeek: &week,
},
},
},
EventLogs: []HardwareEventLog{
{
Source: "bmc",
EventTime: &eventTime,
Message: message,
},
},
},
}
data, err := json.Marshal(payload)
if err != nil {
t.Fatalf("marshal: %v", err)
}
text := string(data)
if !strings.Contains(text, `"manufactured_year_week":"2024-W07"`) {
t.Fatalf("missing manufactured_year_week: %s", text)
}
if !strings.Contains(text, `"event_logs":[{"source":"bmc","event_time":"2026-03-15T14:03:11Z","message":"Correctable ECC error threshold exceeded"}]`) {
t.Fatalf("missing event_logs payload: %s", text)
}
}

View File

@@ -29,6 +29,7 @@ func (m model) updateStaticForm(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
m.formFields[3].Value, m.formFields[3].Value,
}) })
m.busy = true m.busy = true
m.busyTitle = "Static IPv4: " + m.selectedIface
return m, func() tea.Msg { return m, func() tea.Msg {
result, err := m.app.SetStaticIPv4Result(cfg) result, err := m.app.SetStaticIPv4Result(cfg)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
@@ -59,26 +60,49 @@ func (m model) updateConfirm(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
case "esc": case "esc":
m.screen = m.confirmCancelTarget() m.screen = m.confirmCancelTarget()
m.cursor = 0 m.cursor = 0
m.pendingAction = actionNone
return m, nil return m, nil
case "enter": case "enter":
if m.cursor == 1 { if m.cursor == 1 {
m.screen = m.confirmCancelTarget() m.screen = m.confirmCancelTarget()
m.cursor = 0 m.cursor = 0
m.pendingAction = actionNone
return m, nil return m, nil
} }
m.busy = true m.busy = true
switch m.pendingAction { switch m.pendingAction {
case actionExportAudit: case actionExportAudit:
m.busyTitle = "Export audit"
target := *m.selectedTarget target := *m.selectedTarget
return m, func() tea.Msg { return m, func() tea.Msg {
result, err := m.app.ExportLatestAuditResult(target) result, err := m.app.ExportLatestAuditResult(target)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
} }
case actionExportBundle:
m.busyTitle = "Export support bundle"
target := *m.selectedTarget
return m, func() tea.Msg {
result, err := m.app.ExportSupportBundleResult(target)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
}
case actionRunNvidiaSAT: case actionRunNvidiaSAT:
m.busyTitle = "NVIDIA SAT"
return m, func() tea.Msg { return m, func() tea.Msg {
result, err := m.app.RunNvidiaAcceptancePackResult("") result, err := m.app.RunNvidiaAcceptancePackResult("")
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance}
} }
case actionRunMemorySAT:
m.busyTitle = "Memory SAT"
return m, func() tea.Msg {
result, err := m.app.RunMemoryAcceptancePackResult("")
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance}
}
case actionRunStorageSAT:
m.busyTitle = "Storage SAT"
return m, func() tea.Msg {
result, err := m.app.RunStorageAcceptancePackResult("")
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance}
}
} }
case "ctrl+c": case "ctrl+c":
return m, tea.Quit return m, tea.Quit
@@ -90,7 +114,13 @@ func (m model) confirmCancelTarget() screen {
switch m.pendingAction { switch m.pendingAction {
case actionExportAudit: case actionExportAudit:
return screenExportTargets return screenExportTargets
case actionExportBundle:
return screenExportTargets
case actionRunNvidiaSAT: case actionRunNvidiaSAT:
fallthrough
case actionRunMemorySAT:
fallthrough
case actionRunStorageSAT:
return screenAcceptance return screenAcceptance
default: default:
return screenMain return screenMain

View File

@@ -23,3 +23,20 @@ type exportTargetsMsg struct {
targets []platform.RemovableTarget targets []platform.RemovableTarget
err error err error
} }
type bannerMsg struct {
text string
}
type nvidiaGPUsMsg struct {
gpus []platform.NvidiaGPU
err error
}
type nvtopClosedMsg struct{}
type nvidiaSATDoneMsg struct {
title string
body string
err error
}

View File

@@ -3,12 +3,19 @@ package tui
import tea "github.com/charmbracelet/bubbletea" import tea "github.com/charmbracelet/bubbletea"
func (m model) handleAcceptanceMenu() (tea.Model, tea.Cmd) { func (m model) handleAcceptanceMenu() (tea.Model, tea.Cmd) {
if m.cursor == 1 { if m.cursor == 3 {
m.screen = screenMain m.screen = screenMain
m.cursor = 0 m.cursor = 0
return m, nil return m, nil
} }
m.pendingAction = actionRunNvidiaSAT switch m.cursor {
case 0:
return m.enterNvidiaSATSetup()
case 1:
m.pendingAction = actionRunMemorySAT
case 2:
m.pendingAction = actionRunStorageSAT
}
m.screen = screenConfirm m.screen = screenConfirm
return m, nil return m, nil
} }

View File

@@ -4,11 +4,17 @@ import tea "github.com/charmbracelet/bubbletea"
func (m model) handleExportTargetsMenu() (tea.Model, tea.Cmd) { func (m model) handleExportTargetsMenu() (tea.Model, tea.Cmd) {
if len(m.targets) == 0 { if len(m.targets) == 0 {
return m, resultCmd("Export audit", "No removable filesystems found", nil, screenMain) title := "Export audit"
if m.pendingAction == actionExportBundle {
title = "Export support bundle"
}
return m, resultCmd(title, "No removable filesystems found", nil, screenMain)
} }
target := m.targets[m.cursor] target := m.targets[m.cursor]
m.selectedTarget = &target m.selectedTarget = &target
m.pendingAction = actionExportAudit if m.pendingAction == actionNone {
m.pendingAction = actionExportAudit
}
m.screen = screenConfirm m.screen = screenConfirm
return m, nil return m, nil
} }

View File

@@ -12,6 +12,7 @@ func (m model) handleMainMenu() (tea.Model, tea.Cmd) {
return m, nil return m, nil
case 1: case 1:
m.busy = true m.busy = true
m.busyTitle = "Services"
return m, func() tea.Msg { return m, func() tea.Msg {
services, err := m.app.ListBeeServices() services, err := m.app.ListBeeServices()
return servicesMsg{services: services, err: err} return servicesMsg{services: services, err: err}
@@ -22,29 +23,63 @@ func (m model) handleMainMenu() (tea.Model, tea.Cmd) {
return m, nil return m, nil
case 3: case 3:
m.busy = true m.busy = true
m.busyTitle = "Run audit"
return m, func() tea.Msg { return m, func() tea.Msg {
result, err := m.app.RunAuditNow(m.runtimeMode) result, err := m.app.RunAuditNow(m.runtimeMode)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
} }
case 4: case 4:
m.busy = true m.busy = true
m.busyTitle = "Run self-check"
return m, func() tea.Msg {
result, err := m.app.RunRuntimePreflightResult()
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
}
case 5:
m.pendingAction = actionExportAudit
m.busy = true
m.busyTitle = "Export audit"
return m, func() tea.Msg { return m, func() tea.Msg {
targets, err := m.app.ListRemovableTargets() targets, err := m.app.ListRemovableTargets()
return exportTargetsMsg{targets: targets, err: err} return exportTargetsMsg{targets: targets, err: err}
} }
case 5: case 6:
m.pendingAction = actionExportBundle
m.busy = true m.busy = true
m.busyTitle = "Export support bundle"
return m, func() tea.Msg { return m, func() tea.Msg {
result := m.app.ToolCheckResult([]string{"dmidecode", "smartctl", "nvme", "ipmitool", "lspci", "bee", "nvidia-smi", "dhclient", "lsblk", "mount"}) targets, err := m.app.ListRemovableTargets()
return exportTargetsMsg{targets: targets, err: err}
}
case 7:
m.busy = true
m.busyTitle = "Required tools"
return m, func() tea.Msg {
result := m.app.ToolCheckResult([]string{"dmidecode", "smartctl", "nvme", "ipmitool", "lspci", "ethtool", "bee", "nvidia-smi", "bee-gpu-stress", "memtester", "dhclient", "lsblk", "mount"})
return resultMsg{title: result.Title, body: result.Body, back: screenMain} return resultMsg{title: result.Title, body: result.Body, back: screenMain}
} }
case 6: case 8:
m.busy = true m.busy = true
m.busyTitle = "Health summary"
return m, func() tea.Msg {
result := m.app.HealthSummaryResult()
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
}
case 9:
m.busy = true
m.busyTitle = "Runtime issues"
return m, func() tea.Msg {
result := m.app.RuntimeHealthResult()
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
}
case 10:
m.busy = true
m.busyTitle = "Audit logs"
return m, func() tea.Msg { return m, func() tea.Msg {
result := m.app.AuditLogTailResult() result := m.app.AuditLogTailResult()
return resultMsg{title: result.Title, body: result.Body, back: screenMain} return resultMsg{title: result.Title, body: result.Body, back: screenMain}
} }
case 7: case 11:
return m, tea.Quit return m, tea.Quit
} }
return m, nil return m, nil

View File

@@ -10,12 +10,14 @@ func (m model) handleNetworkMenu() (tea.Model, tea.Cmd) {
switch m.cursor { switch m.cursor {
case 0: case 0:
m.busy = true m.busy = true
m.busyTitle = "Network status"
return m, func() tea.Msg { return m, func() tea.Msg {
result, err := m.app.NetworkStatus() result, err := m.app.NetworkStatus()
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
} }
case 1: case 1:
m.busy = true m.busy = true
m.busyTitle = "DHCP all interfaces"
return m, func() tea.Msg { return m, func() tea.Msg {
result, err := m.app.DHCPAllResult() result, err := m.app.DHCPAllResult()
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
@@ -23,6 +25,7 @@ func (m model) handleNetworkMenu() (tea.Model, tea.Cmd) {
case 2: case 2:
m.pendingAction = actionDHCPOne m.pendingAction = actionDHCPOne
m.busy = true m.busy = true
m.busyTitle = "Interfaces"
return m, func() tea.Msg { return m, func() tea.Msg {
ifaces, err := m.app.ListInterfaces() ifaces, err := m.app.ListInterfaces()
return interfacesMsg{ifaces: ifaces, err: err} return interfacesMsg{ifaces: ifaces, err: err}
@@ -30,6 +33,7 @@ func (m model) handleNetworkMenu() (tea.Model, tea.Cmd) {
case 3: case 3:
m.pendingAction = actionStaticIPv4 m.pendingAction = actionStaticIPv4
m.busy = true m.busy = true
m.busyTitle = "Interfaces"
return m, func() tea.Msg { return m, func() tea.Msg {
ifaces, err := m.app.ListInterfaces() ifaces, err := m.app.ListInterfaces()
return interfacesMsg{ifaces: ifaces, err: err} return interfacesMsg{ifaces: ifaces, err: err}
@@ -50,6 +54,7 @@ func (m model) handleInterfacePickMenu() (tea.Model, tea.Cmd) {
switch m.pendingAction { switch m.pendingAction {
case actionDHCPOne: case actionDHCPOne:
m.busy = true m.busy = true
m.busyTitle = "DHCP on " + m.selectedIface
return m, func() tea.Msg { return m, func() tea.Msg {
result, err := m.app.DHCPOneResult(m.selectedIface) result, err := m.app.DHCPOneResult(m.selectedIface)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}

View File

@@ -0,0 +1,238 @@
package tui
import (
"context"
"fmt"
"os/exec"
"strings"
"bee/audit/internal/platform"
tea "github.com/charmbracelet/bubbletea"
)
var nvidiaDurationOptions = []struct {
label string
seconds int
}{
{"10 minutes", 600},
{"1 hour", 3600},
{"8 hours", 28800},
{"24 hours", 86400},
}
// enterNvidiaSATSetup resets the setup screen and starts loading GPU list.
func (m model) enterNvidiaSATSetup() (tea.Model, tea.Cmd) {
m.screen = screenNvidiaSATSetup
m.nvidiaGPUs = nil
m.nvidiaGPUSel = nil
m.nvidiaDurIdx = 0
m.nvidiaSATCursor = 0
m.busy = true
m.busyTitle = "NVIDIA SAT"
return m, func() tea.Msg {
gpus, err := m.app.ListNvidiaGPUs()
return nvidiaGPUsMsg{gpus: gpus, err: err}
}
}
// handleNvidiaGPUsMsg processes the GPU list response.
func (m model) handleNvidiaGPUsMsg(msg nvidiaGPUsMsg) (tea.Model, tea.Cmd) {
m.busy = false
m.busyTitle = ""
if msg.err != nil {
m.title = "NVIDIA SAT"
m.body = fmt.Sprintf("Failed to list GPUs: %v", msg.err)
m.prevScreen = screenAcceptance
m.screen = screenOutput
return m, nil
}
m.nvidiaGPUs = msg.gpus
m.nvidiaGPUSel = make([]bool, len(msg.gpus))
for i := range m.nvidiaGPUSel {
m.nvidiaGPUSel[i] = true // all selected by default
}
m.nvidiaSATCursor = 0
return m, nil
}
// updateNvidiaSATSetup handles keys on the setup screen.
func (m model) updateNvidiaSATSetup(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
numDur := len(nvidiaDurationOptions)
numGPU := len(m.nvidiaGPUs)
totalItems := numDur + numGPU + 2 // +2: Start, Cancel
switch msg.String() {
case "up", "k":
if m.nvidiaSATCursor > 0 {
m.nvidiaSATCursor--
}
case "down", "j":
if m.nvidiaSATCursor < totalItems-1 {
m.nvidiaSATCursor++
}
case " ":
switch {
case m.nvidiaSATCursor < numDur:
m.nvidiaDurIdx = m.nvidiaSATCursor
case m.nvidiaSATCursor < numDur+numGPU:
i := m.nvidiaSATCursor - numDur
m.nvidiaGPUSel[i] = !m.nvidiaGPUSel[i]
}
case "enter":
startIdx := numDur + numGPU
cancelIdx := startIdx + 1
switch {
case m.nvidiaSATCursor < numDur:
m.nvidiaDurIdx = m.nvidiaSATCursor
case m.nvidiaSATCursor < startIdx:
i := m.nvidiaSATCursor - numDur
m.nvidiaGPUSel[i] = !m.nvidiaGPUSel[i]
case m.nvidiaSATCursor == startIdx:
return m.startNvidiaSAT()
case m.nvidiaSATCursor == cancelIdx:
m.screen = screenAcceptance
m.cursor = 0
}
case "esc":
m.screen = screenAcceptance
m.cursor = 0
case "ctrl+c", "q":
return m, tea.Quit
}
return m, nil
}
// startNvidiaSAT launches the SAT and nvtop.
func (m model) startNvidiaSAT() (tea.Model, tea.Cmd) {
var selectedGPUs []platform.NvidiaGPU
for i, sel := range m.nvidiaGPUSel {
if sel {
selectedGPUs = append(selectedGPUs, m.nvidiaGPUs[i])
}
}
if len(selectedGPUs) == 0 {
selectedGPUs = m.nvidiaGPUs // fallback: use all if none explicitly selected
}
sizeMB := 0
for _, g := range selectedGPUs {
if sizeMB == 0 || g.MemoryMB < sizeMB {
sizeMB = g.MemoryMB
}
}
if sizeMB == 0 {
sizeMB = 64
}
var gpuIndices []int
for _, g := range selectedGPUs {
gpuIndices = append(gpuIndices, g.Index)
}
durationSec := nvidiaDurationOptions[m.nvidiaDurIdx].seconds
ctx, cancel := context.WithCancel(context.Background())
m.nvidiaSATCancel = cancel
m.nvidiaSATAborted = false
m.screen = screenNvidiaSATRunning
m.nvidiaSATCursor = 0
satCmd := func() tea.Msg {
result, err := m.app.RunNvidiaAcceptancePackWithOptions(ctx, "", durationSec, sizeMB, gpuIndices)
return nvidiaSATDoneMsg{title: result.Title, body: result.Body, err: err}
}
nvtopPath, lookErr := exec.LookPath("nvtop")
if lookErr != nil {
// nvtop not available: just run the SAT, show running screen
return m, satCmd
}
return m, tea.Batch(
satCmd,
tea.ExecProcess(exec.Command(nvtopPath), func(_ error) tea.Msg {
return nvtopClosedMsg{}
}),
)
}
// updateNvidiaSATRunning handles keys on the running screen.
func (m model) updateNvidiaSATRunning(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
switch msg.String() {
case "o", "O":
nvtopPath, err := exec.LookPath("nvtop")
if err != nil {
return m, nil
}
return m, tea.ExecProcess(exec.Command(nvtopPath), func(_ error) tea.Msg {
return nvtopClosedMsg{}
})
case "a", "A":
if m.nvidiaSATCancel != nil {
m.nvidiaSATCancel()
m.nvidiaSATCancel = nil
}
m.nvidiaSATAborted = true
m.screen = screenAcceptance
m.cursor = 0
case "ctrl+c":
return m, tea.Quit
}
return m, nil
}
// renderNvidiaSATSetup renders the setup screen.
func renderNvidiaSATSetup(m model) string {
var b strings.Builder
fmt.Fprintln(&b, "NVIDIA SAT")
fmt.Fprintln(&b)
fmt.Fprintln(&b, "Duration:")
for i, opt := range nvidiaDurationOptions {
radio := "( )"
if i == m.nvidiaDurIdx {
radio = "(*)"
}
prefix := " "
if m.nvidiaSATCursor == i {
prefix = "> "
}
fmt.Fprintf(&b, "%s%s %s\n", prefix, radio, opt.label)
}
fmt.Fprintln(&b)
if len(m.nvidiaGPUs) == 0 {
fmt.Fprintln(&b, "GPUs: (none detected)")
} else {
fmt.Fprintln(&b, "GPUs:")
for i, gpu := range m.nvidiaGPUs {
check := "[ ]"
if m.nvidiaGPUSel[i] {
check = "[x]"
}
prefix := " "
if m.nvidiaSATCursor == len(nvidiaDurationOptions)+i {
prefix = "> "
}
fmt.Fprintf(&b, "%s%s %d: %s (%d MB)\n", prefix, check, gpu.Index, gpu.Name, gpu.MemoryMB)
}
}
fmt.Fprintln(&b)
startIdx := len(nvidiaDurationOptions) + len(m.nvidiaGPUs)
startPfx := " "
cancelPfx := " "
if m.nvidiaSATCursor == startIdx {
startPfx = "> "
}
if m.nvidiaSATCursor == startIdx+1 {
cancelPfx = "> "
}
fmt.Fprintf(&b, "%sStart\n", startPfx)
fmt.Fprintf(&b, "%sCancel\n", cancelPfx)
fmt.Fprintln(&b)
b.WriteString("[↑/↓] move [space] toggle [enter] select [esc] cancel\n")
return b.String()
}
// renderNvidiaSATRunning renders the running screen.
func renderNvidiaSATRunning() string {
return "NVIDIA SAT\n\nTest is running...\n\n[o] Open nvtop [a] Abort test [ctrl+c] quit\n"
}

View File

@@ -8,7 +8,7 @@ import (
func (m model) handleServicesMenu() (tea.Model, tea.Cmd) { func (m model) handleServicesMenu() (tea.Model, tea.Cmd) {
if len(m.services) == 0 { if len(m.services) == 0 {
return m, resultCmd("bee services", "No bee-* services found", nil, screenMain) return m, resultCmd("Services", "No bee-* services found.", nil, screenMain)
} }
m.selectedService = m.services[m.cursor] m.selectedService = m.services[m.cursor]
m.screen = screenServiceAction m.screen = screenServiceAction
@@ -25,22 +25,23 @@ func (m model) handleServiceActionMenu() (tea.Model, tea.Cmd) {
} }
m.busy = true m.busy = true
m.busyTitle = "service: " + m.selectedService
return m, func() tea.Msg { return m, func() tea.Msg {
switch action { switch action {
case "status": case "Status":
result, err := m.app.ServiceStatusResult(m.selectedService) result, err := m.app.ServiceStatusResult(m.selectedService)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
case "restart": case "Restart":
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceRestart) result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceRestart)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
case "start": case "Start":
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceStart) result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceStart)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
case "stop": case "Stop":
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceStop) result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceStop)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction} return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
default: default:
return resultMsg{title: "service", body: "unknown action", back: screenServiceAction} return resultMsg{title: "Service", body: "Unknown action.", back: screenServiceAction}
} }
} }
} }

View File

@@ -1,6 +1,7 @@
package tui package tui
import ( import (
"strings"
"testing" "testing"
"bee/audit/internal/app" "bee/audit/internal/app"
@@ -153,7 +154,8 @@ func TestMainMenuAsyncActionsSetBusy(t *testing.T) {
{name: "run audit", cursor: 3}, {name: "run audit", cursor: 3},
{name: "export", cursor: 4}, {name: "export", cursor: 4},
{name: "check tools", cursor: 5}, {name: "check tools", cursor: 5},
{name: "log tail", cursor: 6}, {name: "health summary", cursor: 6},
{name: "log tail", cursor: 7},
} }
for _, test := range tests { for _, test := range tests {
@@ -177,6 +179,24 @@ func TestMainMenuAsyncActionsSetBusy(t *testing.T) {
} }
} }
func TestMainViewIncludesBanner(t *testing.T) {
t.Parallel()
m := newTestModel()
m.banner = "System: Test Server | S/N ABC123\nIP: 10.0.0.10"
view := m.View()
if !strings.Contains(view, "System: Test Server | S/N ABC123") {
t.Fatalf("view missing system banner:\n%s", view)
}
if !strings.Contains(view, "IP: 10.0.0.10") {
t.Fatalf("view missing ip banner:\n%s", view)
}
if !strings.Contains(view, "Select action") {
t.Fatalf("view missing menu subtitle:\n%s", view)
}
}
func TestEscapeNavigation(t *testing.T) { func TestEscapeNavigation(t *testing.T) {
t.Parallel() t.Parallel()
@@ -235,7 +255,7 @@ func TestOutputScreenReturnsToPreviousScreen(t *testing.T) {
} }
} }
func TestAcceptanceConfirmFlow(t *testing.T) { func TestAcceptanceNvidiaSATOpensSetup(t *testing.T) {
t.Parallel() t.Parallel()
m := newTestModel() m := newTestModel()
@@ -245,23 +265,45 @@ func TestAcceptanceConfirmFlow(t *testing.T) {
next, cmd := m.handleAcceptanceMenu() next, cmd := m.handleAcceptanceMenu()
got := next.(model) got := next.(model)
if cmd != nil { if cmd == nil {
t.Fatal("expected nil cmd") t.Fatal("expected non-nil cmd (GPU list loader)")
} }
if got.screen != screenConfirm { if got.screen != screenNvidiaSATSetup {
t.Fatalf("screen=%q want %q", got.screen, screenConfirm) t.Fatalf("screen=%q want %q", got.screen, screenNvidiaSATSetup)
}
if got.pendingAction != actionRunNvidiaSAT {
t.Fatalf("pendingAction=%q want %q", got.pendingAction, actionRunNvidiaSAT)
} }
next, _ = got.updateConfirm(tea.KeyMsg{Type: tea.KeyEsc}) // esc from setup returns to acceptance
next, _ = got.updateNvidiaSATSetup(tea.KeyMsg{Type: tea.KeyEsc})
got = next.(model) got = next.(model)
if got.screen != screenAcceptance { if got.screen != screenAcceptance {
t.Fatalf("screen after esc=%q want %q", got.screen, screenAcceptance) t.Fatalf("screen after esc=%q want %q", got.screen, screenAcceptance)
} }
} }
func TestAcceptanceMenuMapsNewTargets(t *testing.T) {
t.Parallel()
tests := []struct {
cursor int
want actionKind
}{
{cursor: 1, want: actionRunMemorySAT},
{cursor: 2, want: actionRunStorageSAT},
}
for _, test := range tests {
m := newTestModel()
m.screen = screenAcceptance
m.cursor = test.cursor
next, _ := m.handleAcceptanceMenu()
got := next.(model)
if got.pendingAction != test.want {
t.Fatalf("cursor=%d pendingAction=%q want %q", test.cursor, got.pendingAction, test.want)
}
}
}
func TestExportTargetSelectionOpensConfirm(t *testing.T) { func TestExportTargetSelectionOpensConfirm(t *testing.T) {
t.Parallel() t.Parallel()
@@ -347,3 +389,197 @@ func TestConfirmCancelTarget(t *testing.T) {
t.Fatalf("default cancel target=%q want %q", got, screenMain) t.Fatalf("default cancel target=%q want %q", got, screenMain)
} }
} }
func TestViewMainMenuRendersSelectedItem(t *testing.T) {
t.Parallel()
m := newTestModel()
m.cursor = 1
view := m.View()
for _, want := range []string{
"bee",
"Select action",
" Network",
"> Services",
"Acceptance tests",
"[↑/↓] move [enter] select [esc] back [ctrl+c] quit",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewBusyStateIsMinimal(t *testing.T) {
t.Parallel()
m := newTestModel()
m.busy = true
view := m.View()
want := "bee\n\nWorking...\n\n[ctrl+c] quit\n"
if view != want {
t.Fatalf("busy view mismatch\nwant:\n%s\ngot:\n%s", want, view)
}
}
func TestViewBusyStateUsesBusyTitle(t *testing.T) {
t.Parallel()
m := newTestModel()
m.busy = true
m.busyTitle = "Export audit"
view := m.View()
for _, want := range []string{
"Export audit",
"Working...",
"[ctrl+c] quit",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewOutputScreenRendersBodyAndBackHint(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenOutput
m.title = "Run audit"
m.body = "audit output: /appdata/bee/export/bee-audit.json\n"
view := m.View()
for _, want := range []string{
"Run audit",
"audit output: /appdata/bee/export/bee-audit.json",
"[enter/esc] back [ctrl+c] quit",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewExportTargetsRendersDeviceMetadata(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenExportTargets
m.targets = []platform.RemovableTarget{
{
Device: "/dev/sdb1",
FSType: "vfat",
Size: "29G",
Label: "BEEUSB",
Mountpoint: "/media/bee",
},
}
view := m.View()
for _, want := range []string{
"Export audit",
"Select removable filesystem",
"> /dev/sdb1 [vfat 29G] label=BEEUSB mounted=/media/bee",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewStaticFormRendersFields(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenStaticForm
m.selectedIface = "enp1s0"
m.formFields = []formField{
{Label: "Address", Value: "192.0.2.10/24"},
{Label: "Gateway", Value: "192.0.2.1"},
{Label: "DNS", Value: "1.1.1.1"},
}
m.formIndex = 1
view := m.View()
for _, want := range []string{
"Static IPv4: enp1s0",
" Address: 192.0.2.10/24",
"> Gateway: 192.0.2.1",
" DNS: 1.1.1.1",
"[tab/↑/↓] move [enter] next/submit [backspace] delete [esc] cancel",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewConfirmScreenMatchesPendingExport(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenConfirm
m.pendingAction = actionExportAudit
m.selectedTarget = &platform.RemovableTarget{Device: "/dev/sdb1"}
view := m.View()
for _, want := range []string{
"Export audit",
"Copy latest audit JSON to /dev/sdb1?",
"> Confirm",
" Cancel",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestResultMsgClearsBusyAndPendingAction(t *testing.T) {
t.Parallel()
m := newTestModel()
m.busy = true
m.busyTitle = "Export audit"
m.pendingAction = actionExportAudit
m.screen = screenConfirm
next, _ := m.Update(resultMsg{title: "Export audit", body: "done", back: screenMain})
got := next.(model)
if got.busy {
t.Fatal("busy=true want false")
}
if got.busyTitle != "" {
t.Fatalf("busyTitle=%q want empty", got.busyTitle)
}
if got.pendingAction != actionNone {
t.Fatalf("pendingAction=%q want empty", got.pendingAction)
}
}
func TestResultMsgErrorWithoutBodyFormatsCleanly(t *testing.T) {
t.Parallel()
m := newTestModel()
next, _ := m.Update(resultMsg{title: "Export audit", err: assertErr("boom"), back: screenMain})
got := next.(model)
if got.body != "ERROR: boom" {
t.Fatalf("body=%q want %q", got.body, "ERROR: boom")
}
}
type assertErr string
func (e assertErr) Error() string { return string(e) }

View File

@@ -1,9 +1,12 @@
package tui package tui
import ( import (
"context"
"bee/audit/internal/app" "bee/audit/internal/app"
"bee/audit/internal/platform" "bee/audit/internal/platform"
"bee/audit/internal/runtimeenv" "bee/audit/internal/runtimeenv"
"strings"
tea "github.com/charmbracelet/bubbletea" tea "github.com/charmbracelet/bubbletea"
) )
@@ -11,26 +14,31 @@ import (
type screen string type screen string
const ( const (
screenMain screen = "main" screenMain screen = "main"
screenNetwork screen = "network" screenNetwork screen = "network"
screenInterfacePick screen = "interface_pick" screenInterfacePick screen = "interface_pick"
screenServices screen = "services" screenServices screen = "services"
screenServiceAction screen = "service_action" screenServiceAction screen = "service_action"
screenAcceptance screen = "acceptance" screenAcceptance screen = "acceptance"
screenExportTargets screen = "export_targets" screenExportTargets screen = "export_targets"
screenOutput screen = "output" screenOutput screen = "output"
screenStaticForm screen = "static_form" screenStaticForm screen = "static_form"
screenConfirm screen = "confirm" screenConfirm screen = "confirm"
screenNvidiaSATSetup screen = "nvidia_sat_setup"
screenNvidiaSATRunning screen = "nvidia_sat_running"
) )
type actionKind string type actionKind string
const ( const (
actionNone actionKind = "" actionNone actionKind = ""
actionDHCPOne actionKind = "dhcp_one" actionDHCPOne actionKind = "dhcp_one"
actionStaticIPv4 actionKind = "static_ipv4" actionStaticIPv4 actionKind = "static_ipv4"
actionExportAudit actionKind = "export_audit" actionExportAudit actionKind = "export_audit"
actionRunNvidiaSAT actionKind = "run_nvidia_sat" actionExportBundle actionKind = "export_bundle"
actionRunNvidiaSAT actionKind = "run_nvidia_sat"
actionRunMemorySAT actionKind = "run_memory_sat"
actionRunStorageSAT actionKind = "run_storage_sat"
) )
type model struct { type model struct {
@@ -41,8 +49,10 @@ type model struct {
prevScreen screen prevScreen screen
cursor int cursor int
busy bool busy bool
busyTitle string
title string title string
body string body string
banner string
mainMenu []string mainMenu []string
networkMenu []string networkMenu []string
serviceMenu []string serviceMenu []string
@@ -57,6 +67,16 @@ type model struct {
formFields []formField formFields []formField
formIndex int formIndex int
// NVIDIA SAT setup
nvidiaGPUs []platform.NvidiaGPU
nvidiaGPUSel []bool
nvidiaDurIdx int // index into nvidiaDurationOptions
nvidiaSATCursor int
// NVIDIA SAT running
nvidiaSATCancel context.CancelFunc
nvidiaSATAborted bool
} }
type formField struct { type formField struct {
@@ -80,32 +100,38 @@ func newModel(application *app.App, runtimeMode runtimeenv.Mode) model {
runtimeMode: runtimeMode, runtimeMode: runtimeMode,
screen: screenMain, screen: screenMain,
mainMenu: []string{ mainMenu: []string{
"Network setup", "Network",
"bee service management", "Services",
"System acceptance tests", "Acceptance tests",
"Run audit now", "Run audit",
"Export audit to removable drive", "Run self-check",
"Check required tools", "Export audit",
"Show last audit log tail", "Export support bundle",
"Check tools",
"Show health summary",
"Show runtime issues",
"Show audit logs",
"Exit", "Exit",
}, },
networkMenu: []string{ networkMenu: []string{
"Show network status", "Show status",
"DHCP on all interfaces", "DHCP on all interfaces",
"DHCP on one interface", "DHCP on one interface",
"Set static IPv4 on one interface", "Set static IPv4",
"Back", "Back",
}, },
serviceMenu: []string{ serviceMenu: []string{
"status", "Status",
"restart", "Restart",
"start", "Start",
"stop", "Stop",
"back", "Back",
}, },
} }
} }
func (m model) Init() tea.Cmd { func (m model) Init() tea.Cmd {
return nil return func() tea.Msg {
return bannerMsg{text: strings.TrimSpace(m.app.MainBanner())}
}
} }

View File

@@ -21,12 +21,19 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
return m.updateKey(msg) return m.updateKey(msg)
case resultMsg: case resultMsg:
m.busy = false m.busy = false
m.busyTitle = ""
m.title = msg.title m.title = msg.title
if msg.err != nil { if msg.err != nil {
m.body = fmt.Sprintf("%s\n\nERROR: %v", strings.TrimSpace(msg.body), msg.err) body := strings.TrimSpace(msg.body)
if body == "" {
m.body = fmt.Sprintf("ERROR: %v", msg.err)
} else {
m.body = fmt.Sprintf("%s\n\nERROR: %v", body, msg.err)
}
} else { } else {
m.body = msg.body m.body = msg.body
} }
m.pendingAction = actionNone
if msg.back != "" { if msg.back != "" {
m.prevScreen = msg.back m.prevScreen = msg.back
} else { } else {
@@ -37,8 +44,9 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
return m, nil return m, nil
case servicesMsg: case servicesMsg:
m.busy = false m.busy = false
m.busyTitle = ""
if msg.err != nil { if msg.err != nil {
m.title = "bee services" m.title = "Services"
m.body = msg.err.Error() m.body = msg.err.Error()
m.prevScreen = screenMain m.prevScreen = screenMain
m.screen = screenOutput m.screen = screenOutput
@@ -50,6 +58,7 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
return m, nil return m, nil
case interfacesMsg: case interfacesMsg:
m.busy = false m.busy = false
m.busyTitle = ""
if msg.err != nil { if msg.err != nil {
m.title = "interfaces" m.title = "interfaces"
m.body = msg.err.Error() m.body = msg.err.Error()
@@ -63,6 +72,7 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
return m, nil return m, nil
case exportTargetsMsg: case exportTargetsMsg:
m.busy = false m.busy = false
m.busyTitle = ""
if msg.err != nil { if msg.err != nil {
m.title = "export" m.title = "export"
m.body = msg.err.Error() m.body = msg.err.Error()
@@ -74,6 +84,36 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
m.screen = screenExportTargets m.screen = screenExportTargets
m.cursor = 0 m.cursor = 0
return m, nil return m, nil
case bannerMsg:
m.banner = strings.TrimSpace(msg.text)
return m, nil
case nvidiaGPUsMsg:
return m.handleNvidiaGPUsMsg(msg)
case nvtopClosedMsg:
// nvtop closed — stay on running screen (or result if SAT is already done)
return m, nil
case nvidiaSATDoneMsg:
if m.nvidiaSATAborted {
return m, nil
}
if m.nvidiaSATCancel != nil {
m.nvidiaSATCancel()
m.nvidiaSATCancel = nil
}
m.prevScreen = screenAcceptance
m.screen = screenOutput
m.title = msg.title
if msg.err != nil {
body := strings.TrimSpace(msg.body)
if body == "" {
m.body = fmt.Sprintf("ERROR: %v", msg.err)
} else {
m.body = fmt.Sprintf("%s\n\nERROR: %v", body, msg.err)
}
} else {
m.body = msg.body
}
return m, nil
} }
return m, nil return m, nil
@@ -90,7 +130,11 @@ func (m model) updateKey(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
case screenServiceAction: case screenServiceAction:
return m.updateMenu(msg, len(m.serviceMenu), m.handleServiceActionMenu) return m.updateMenu(msg, len(m.serviceMenu), m.handleServiceActionMenu)
case screenAcceptance: case screenAcceptance:
return m.updateMenu(msg, 2, m.handleAcceptanceMenu) return m.updateMenu(msg, 4, m.handleAcceptanceMenu)
case screenNvidiaSATSetup:
return m.updateNvidiaSATSetup(msg)
case screenNvidiaSATRunning:
return m.updateNvidiaSATRunning(msg)
case screenExportTargets: case screenExportTargets:
return m.updateMenu(msg, len(m.targets), m.handleExportTargetsMenu) return m.updateMenu(msg, len(m.targets), m.handleExportTargetsMenu)
case screenInterfacePick: case screenInterfacePick:
@@ -101,6 +145,7 @@ func (m model) updateKey(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
m.screen = m.prevScreen m.screen = m.prevScreen
m.body = "" m.body = ""
m.title = "" m.title = ""
m.pendingAction = actionNone
return m, nil return m, nil
case "ctrl+c": case "ctrl+c":
return m, tea.Quit return m, tea.Quit

View File

@@ -11,23 +11,31 @@ import (
func (m model) View() string { func (m model) View() string {
if m.busy { if m.busy {
return "bee\n\nWorking...\n" title := "bee"
if m.busyTitle != "" {
title = m.busyTitle
}
return fmt.Sprintf("%s\n\nWorking...\n\n[ctrl+c] quit\n", title)
} }
switch m.screen { switch m.screen {
case screenMain: case screenMain:
return renderMenu("bee", "Select action", m.mainMenu, m.cursor) return renderMainMenu("bee", m.banner, "Select action", m.mainMenu, m.cursor)
case screenNetwork: case screenNetwork:
return renderMenu("Network", "Select action", m.networkMenu, m.cursor) return renderMenu("Network", "Select action", m.networkMenu, m.cursor)
case screenServices: case screenServices:
return renderMenu("bee services", "Select service", m.services, m.cursor) return renderMenu("Services", "Select service", m.services, m.cursor)
case screenServiceAction: case screenServiceAction:
items := make([]string, len(m.serviceMenu)) items := make([]string, len(m.serviceMenu))
copy(items, m.serviceMenu) copy(items, m.serviceMenu)
return renderMenu("Service: "+m.selectedService, "Select action", items, m.cursor) return renderMenu("Service: "+m.selectedService, "Select action", items, m.cursor)
case screenAcceptance: case screenAcceptance:
return renderMenu("System acceptance tests", "Select action", []string{"Run NVIDIA command pack", "Back"}, m.cursor) return renderMenu("Acceptance tests", "Select action", []string{"Run NVIDIA command pack", "Run memory test", "Run storage diagnostic pack", "Back"}, m.cursor)
case screenExportTargets: case screenExportTargets:
return renderMenu("Export audit", "Select removable filesystem", renderTargetItems(m.targets), m.cursor) title := "Export audit"
if m.pendingAction == actionExportBundle {
title = "Export support bundle"
}
return renderMenu(title, "Select removable filesystem", renderTargetItems(m.targets), m.cursor)
case screenInterfacePick: case screenInterfacePick:
return renderMenu("Interfaces", "Select interface", renderInterfaceItems(m.interfaces), m.cursor) return renderMenu("Interfaces", "Select interface", renderInterfaceItems(m.interfaces), m.cursor)
case screenStaticForm: case screenStaticForm:
@@ -35,6 +43,10 @@ func (m model) View() string {
case screenConfirm: case screenConfirm:
title, body := m.confirmBody() title, body := m.confirmBody()
return renderConfirm(title, body, m.cursor) return renderConfirm(title, body, m.cursor)
case screenNvidiaSATSetup:
return renderNvidiaSATSetup(m)
case screenNvidiaSATRunning:
return renderNvidiaSATRunning()
case screenOutput: case screenOutput:
return fmt.Sprintf("%s\n\n%s\n\n[enter/esc] back [ctrl+c] quit\n", m.title, strings.TrimSpace(m.body)) return fmt.Sprintf("%s\n\n%s\n\n[enter/esc] back [ctrl+c] quit\n", m.title, strings.TrimSpace(m.body))
default: default:
@@ -49,8 +61,17 @@ func (m model) confirmBody() (string, string) {
return "Export audit", "No target selected" return "Export audit", "No target selected"
} }
return "Export audit", fmt.Sprintf("Copy latest audit JSON to %s?", m.selectedTarget.Device) return "Export audit", fmt.Sprintf("Copy latest audit JSON to %s?", m.selectedTarget.Device)
case actionExportBundle:
if m.selectedTarget == nil {
return "Export support bundle", "No target selected"
}
return "Export support bundle", fmt.Sprintf("Copy support bundle archive to %s?", m.selectedTarget.Device)
case actionRunNvidiaSAT: case actionRunNvidiaSAT:
return "NVIDIA SAT", "Run NVIDIA acceptance command pack?" return "NVIDIA SAT", "Run NVIDIA acceptance command pack?"
case actionRunMemorySAT:
return "Memory SAT", "Run runtime memory test with memtester?"
case actionRunStorageSAT:
return "Storage SAT", "Run storage diagnostic pack and start short self-tests where supported?"
default: default:
return "Confirm", "Proceed?" return "Confirm", "Proceed?"
} }
@@ -101,6 +122,30 @@ func renderMenu(title, subtitle string, items []string, cursor int) string {
return body.String() return body.String()
} }
func renderMainMenu(title, banner, subtitle string, items []string, cursor int) string {
var body strings.Builder
fmt.Fprintf(&body, "%s\n\n", title)
if banner != "" {
body.WriteString(strings.TrimSpace(banner))
body.WriteString("\n\n")
}
body.WriteString(subtitle)
body.WriteString("\n\n")
if len(items) == 0 {
body.WriteString("(no items)\n")
} else {
for i, item := range items {
prefix := " "
if i == cursor {
prefix = "> "
}
fmt.Fprintf(&body, "%s%s\n", prefix, item)
}
}
body.WriteString("\n[↑/↓] move [enter] select [esc] back [ctrl+c] quit\n")
return body.String()
}
func renderForm(title string, fields []formField, idx int) string { func renderForm(title string, fields []formField, idx int) string {
var body strings.Builder var body strings.Builder
fmt.Fprintf(&body, "%s\n\n", title) fmt.Fprintf(&body, "%s\n\n", title)

View File

@@ -0,0 +1,240 @@
package webui
import (
"errors"
"fmt"
"html"
"net/http"
"net/url"
"os"
"path/filepath"
"sort"
"strings"
"bee/audit/internal/app"
"reanimator/chart/viewer"
chartweb "reanimator/chart/web"
)
const defaultTitle = "Bee Hardware Audit"
type HandlerOptions struct {
Title string
AuditPath string
ExportDir string
}
func NewHandler(opts HandlerOptions) http.Handler {
title := strings.TrimSpace(opts.Title)
if title == "" {
title = defaultTitle
}
auditPath := strings.TrimSpace(opts.AuditPath)
exportDir := strings.TrimSpace(opts.ExportDir)
if exportDir == "" {
exportDir = app.DefaultExportDir
}
mux := http.NewServeMux()
mux.Handle("GET /static/", http.StripPrefix("/static/", chartweb.Static()))
mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Cache-Control", "no-store")
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("ok"))
})
mux.HandleFunc("GET /audit.json", func(w http.ResponseWriter, r *http.Request) {
data, err := loadSnapshot(auditPath)
if err != nil {
if errors.Is(err, os.ErrNotExist) {
http.Error(w, "audit snapshot not found", http.StatusNotFound)
return
}
http.Error(w, fmt.Sprintf("read audit snapshot: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/json; charset=utf-8")
_, _ = w.Write(data)
})
mux.HandleFunc("GET /export/support.tar.gz", func(w http.ResponseWriter, r *http.Request) {
archive, err := app.BuildSupportBundle(exportDir)
if err != nil {
http.Error(w, fmt.Sprintf("build support bundle: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/gzip")
w.Header().Set("Content-Disposition", fmt.Sprintf("attachment; filename=%q", filepath.Base(archive)))
http.ServeFile(w, r, archive)
})
mux.HandleFunc("GET /runtime-health.json", func(w http.ResponseWriter, r *http.Request) {
data, err := loadSnapshot(filepath.Join(exportDir, "runtime-health.json"))
if err != nil {
if errors.Is(err, os.ErrNotExist) {
http.Error(w, "runtime health not found", http.StatusNotFound)
return
}
http.Error(w, fmt.Sprintf("read runtime health: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/json; charset=utf-8")
_, _ = w.Write(data)
})
mux.HandleFunc("GET /export/", func(w http.ResponseWriter, r *http.Request) {
body, err := renderExportIndex(exportDir)
if err != nil {
http.Error(w, fmt.Sprintf("render export index: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte(body))
})
mux.HandleFunc("GET /export/file", func(w http.ResponseWriter, r *http.Request) {
rel := strings.TrimSpace(r.URL.Query().Get("path"))
if rel == "" {
http.Error(w, "path is required", http.StatusBadRequest)
return
}
clean := filepath.Clean(rel)
if clean == "." || strings.HasPrefix(clean, "..") {
http.Error(w, "invalid path", http.StatusBadRequest)
return
}
http.ServeFile(w, r, filepath.Join(exportDir, clean))
})
mux.HandleFunc("GET /viewer", func(w http.ResponseWriter, r *http.Request) {
snapshot, err := loadSnapshot(auditPath)
if err != nil && !errors.Is(err, os.ErrNotExist) {
http.Error(w, fmt.Sprintf("read audit snapshot: %v", err), http.StatusInternalServerError)
return
}
html, err := viewer.RenderHTML(snapshot, title)
if err != nil {
http.Error(w, fmt.Sprintf("render snapshot: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write(html)
})
mux.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) {
noticeTitle, noticeBody := runtimeNotice(filepath.Join(exportDir, "runtime-health.json"))
body := renderShellPage(title, noticeTitle, noticeBody)
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte(body))
})
return mux
}
func ListenAndServe(addr string, opts HandlerOptions) error {
return http.ListenAndServe(addr, NewHandler(opts))
}
func loadSnapshot(path string) ([]byte, error) {
if strings.TrimSpace(path) == "" {
return nil, os.ErrNotExist
}
return os.ReadFile(path)
}
func runtimeNotice(path string) (string, string) {
health, err := app.ReadRuntimeHealth(path)
if err != nil {
return "Runtime Health", "No runtime health snapshot found yet."
}
body := fmt.Sprintf("Status: %s. Export dir: %s. Driver ready: %t. CUDA ready: %t. Network: %s. Export files: /export/",
firstNonEmpty(health.Status, "UNKNOWN"),
firstNonEmpty(health.ExportDir, app.DefaultExportDir),
health.DriverReady,
health.CUDAReady,
firstNonEmpty(health.NetworkStatus, "UNKNOWN"),
)
if len(health.Issues) > 0 {
body += " Issues: "
parts := make([]string, 0, len(health.Issues))
for _, issue := range health.Issues {
parts = append(parts, issue.Code)
}
body += strings.Join(parts, ", ")
}
return "Runtime Health", body
}
func renderExportIndex(exportDir string) (string, error) {
var entries []string
err := filepath.Walk(strings.TrimSpace(exportDir), func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if info.IsDir() {
return nil
}
rel, err := filepath.Rel(exportDir, path)
if err != nil {
return err
}
entries = append(entries, rel)
return nil
})
if err != nil && !errors.Is(err, os.ErrNotExist) {
return "", err
}
sort.Strings(entries)
var body strings.Builder
body.WriteString("<!DOCTYPE html><html><head><meta charset=\"utf-8\"><title>Bee Export Files</title></head><body>")
body.WriteString("<h1>Bee Export Files</h1><ul>")
for _, entry := range entries {
body.WriteString("<li><a href=\"/export/file?path=" + url.QueryEscape(entry) + "\">" + html.EscapeString(entry) + "</a></li>")
}
if len(entries) == 0 {
body.WriteString("<li>No export files found.</li>")
}
body.WriteString("</ul></body></html>")
return body.String(), nil
}
func renderShellPage(title, noticeTitle, noticeBody string) string {
var body strings.Builder
body.WriteString("<!DOCTYPE html><html><head><meta charset=\"utf-8\"><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">")
body.WriteString("<title>" + html.EscapeString(title) + "</title>")
body.WriteString(`<style>
body{margin:0;font-family:system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;background:#f4f1ea;color:#1b1b18}
.shell{min-height:100vh;display:grid;grid-template-rows:auto auto 1fr}
.header{padding:18px 20px 12px;border-bottom:1px solid rgba(0,0,0,.08);background:#fbf8f2}
.header h1{margin:0;font-size:24px}
.header p{margin:6px 0 0;color:#5a5a52}
.actions{display:flex;flex-wrap:wrap;gap:10px;padding:12px 20px;background:#fbf8f2}
.actions a{display:inline-block;text-decoration:none;padding:10px 14px;border-radius:999px;background:#1f5f4a;color:#fff;font-weight:600}
.actions a.secondary{background:#d8e5dd;color:#17372b}
.notice{margin:16px 20px 0;padding:14px 16px;border-radius:14px;background:#fff7df;border:1px solid #ead9a4}
.notice h2{margin:0 0 6px;font-size:16px}
.notice p{margin:0;color:#4f4a37}
.viewer-wrap{padding:16px 20px 20px}
.viewer{width:100%;height:calc(100vh - 170px);border:0;border-radius:18px;background:#fff;box-shadow:0 12px 40px rgba(0,0,0,.08)}
@media (max-width:720px){.viewer{height:calc(100vh - 240px)}}
</style></head><body><div class="shell">`)
body.WriteString("<header class=\"header\"><h1>" + html.EscapeString(title) + "</h1><p>Audit viewer with support bundle and raw export access.</p></header>")
body.WriteString("<nav class=\"actions\">")
body.WriteString("<a href=\"/export/support.tar.gz\">Download support bundle</a>")
body.WriteString("<a class=\"secondary\" href=\"/audit.json\">Open audit.json</a>")
body.WriteString("<a class=\"secondary\" href=\"/runtime-health.json\">Open runtime-health.json</a>")
body.WriteString("<a class=\"secondary\" href=\"/export/\">Browse export files</a>")
body.WriteString("</nav>")
if strings.TrimSpace(noticeTitle) != "" {
body.WriteString("<section class=\"notice\"><h2>" + html.EscapeString(noticeTitle) + "</h2><p>" + html.EscapeString(noticeBody) + "</p></section>")
}
body.WriteString("<main class=\"viewer-wrap\"><iframe class=\"viewer\" src=\"/viewer\" loading=\"eager\" referrerpolicy=\"same-origin\"></iframe></main>")
body.WriteString("</div></body></html>")
return body.String()
}
func firstNonEmpty(value, fallback string) string {
value = strings.TrimSpace(value)
if value == "" {
return fallback
}
return value
}

View File

@@ -0,0 +1,167 @@
package webui
import (
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"strings"
"testing"
)
func TestRootRendersShellWithIframe(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{
Title: "Bee Hardware Audit",
AuditPath: path,
ExportDir: exportDir,
})
first := httptest.NewRecorder()
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/", nil))
if first.Code != http.StatusOK {
t.Fatalf("first status=%d", first.Code)
}
if !strings.Contains(first.Body.String(), `iframe`) || !strings.Contains(first.Body.String(), `src="/viewer"`) {
t.Fatalf("first body missing iframe viewer: %s", first.Body.String())
}
if !strings.Contains(first.Body.String(), "/export/support.tar.gz") {
t.Fatalf("first body missing support bundle link: %s", first.Body.String())
}
if got := first.Header().Get("Cache-Control"); got != "no-store" {
t.Fatalf("first cache-control=%q", got)
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
t.Fatal(err)
}
second := httptest.NewRecorder()
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/", nil))
if second.Code != http.StatusOK {
t.Fatalf("second status=%d", second.Code)
}
if !strings.Contains(second.Body.String(), `src="/viewer"`) {
t.Fatalf("second body missing iframe viewer: %s", second.Body.String())
}
}
func TestViewerRendersLatestSnapshot(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path})
first := httptest.NewRecorder()
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/viewer", nil))
if first.Code != http.StatusOK {
t.Fatalf("first status=%d", first.Code)
}
if !strings.Contains(first.Body.String(), "SERIAL-OLD") {
t.Fatalf("viewer body missing old serial: %s", first.Body.String())
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
t.Fatal(err)
}
second := httptest.NewRecorder()
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/viewer", nil))
if second.Code != http.StatusOK {
t.Fatalf("second status=%d", second.Code)
}
if !strings.Contains(second.Body.String(), "SERIAL-NEW") {
t.Fatalf("viewer body missing new serial: %s", second.Body.String())
}
if strings.Contains(second.Body.String(), "SERIAL-OLD") {
t.Fatalf("viewer body still contains old serial: %s", second.Body.String())
}
}
func TestAuditJSONServesLatestSnapshot(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
body := `{"hardware":{"board":{"serial_number":"SERIAL-API"}}}`
if err := os.WriteFile(path, []byte(body), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
if got := strings.TrimSpace(rec.Body.String()); got != body {
t.Fatalf("body=%q want %q", got, body)
}
if got := rec.Header().Get("Content-Type"); !strings.Contains(got, "application/json") {
t.Fatalf("content-type=%q", got)
}
}
func TestMissingAuditJSONReturnsNotFound(t *testing.T) {
handler := NewHandler(HandlerOptions{AuditPath: "/missing/audit.json"})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
if rec.Code != http.StatusNotFound {
t.Fatalf("status=%d want %d", rec.Code, http.StatusNotFound)
}
}
func TestSupportBundleEndpointReturnsArchive(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.log"), []byte("audit log"), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/export/support.tar.gz", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
if got := rec.Header().Get("Content-Disposition"); !strings.Contains(got, "attachment;") {
t.Fatalf("content-disposition=%q", got)
}
if rec.Body.Len() == 0 {
t.Fatal("empty archive body")
}
}
func TestRuntimeHealthEndpointReturnsJSON(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
body := `{"status":"PARTIAL","checked_at":"2026-03-16T10:00:00Z"}`
if err := os.WriteFile(filepath.Join(exportDir, "runtime-health.json"), []byte(body), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/runtime-health.json", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
if strings.TrimSpace(rec.Body.String()) != body {
t.Fatalf("body=%q want %q", strings.TrimSpace(rec.Body.String()), body)
}
}

View File

@@ -9,4 +9,5 @@ Generic engineering rules live in `bible/rules/patterns/`.
|---|---| |---|---|
| `architecture/system-overview.md` | What bee does, scope, tech stack | | `architecture/system-overview.md` | What bee does, scope, tech stack |
| `architecture/runtime-flows.md` | Boot sequence, audit flow, service order | | `architecture/runtime-flows.md` | Boot sequence, audit flow, service order |
| `docs/hardware-ingest-contract.md` | Current Reanimator hardware ingest JSON contract |
| `decisions/` | Architectural decision log | | `decisions/` | Architectural decision log |

View File

@@ -18,8 +18,10 @@ local-fs.target
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking) ├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
├── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/, ├── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
│ creates /dev/nvidia* nodes) │ creates /dev/nvidia* nodes)
── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json, ── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
never blocks boot on partial collector failures) never blocks boot on partial collector failures)
└── bee-web.service (runs `bee web` on :80,
reads the latest audit snapshot on each request)
``` ```
**Critical invariants:** **Critical invariants:**
@@ -29,17 +31,39 @@ local-fs.target
Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree. Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken. - `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker. - `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.
- `bee-web.service` binds `0.0.0.0:80` and always renders the current `/var/log/bee-audit.json` contents.
- Audit JSON now includes a `hardware.summary` block with overall verdict and warning/failure counts.
## Console and login flow
Local-console behavior:
```text
tty1
└── live-config autologin → bee
└── /home/bee/.profile
└── exec menu
└── /usr/local/bin/bee-tui
└── sudo -n /usr/local/bin/bee tui --runtime livecd
```
Rules:
- local `tty1` lands in user `bee`, not directly in `root`
- `menu` must work without typing `sudo`
- TUI actions still run as `root` via `sudo -n`
- SSH is independent from the tty1 path
- serial console support is enabled for VM boot debugging
## ISO build sequence ## ISO build sequence
``` ```
build.sh [--authorized-keys /path/to/keys] build-in-container.sh [--authorized-keys /path/to/keys]
1. compile `bee` binary (skip if .go files older than binary) 1. compile `bee` binary (skip if .go files older than binary)
2. create a temporary overlay staging dir under `dist/` 2. create a temporary overlay staging dir under `dist/`
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker) 3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
4. copy `bee` binary → staged `/usr/local/bin/bee` 4. copy `bee` binary → staged `/usr/local/bin/bee`
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/` 5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
(`storcli64`, `sas2ircu`, `sas3ircu`, `mstflint` — each optional) (`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli` — optional; `mstflint` comes from the Debian package set)
6. `build-nvidia-module.sh`: 6. `build-nvidia-module.sh`:
a. install Debian kernel headers if missing a. install Debian kernel headers if missing
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`) b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
@@ -54,13 +78,12 @@ build.sh [--authorized-keys /path/to/keys]
11. patch staged `motd` with build metadata 11. patch staged `motd` with build metadata
12. copy `iso/builder/` into a temporary live-build workdir under `dist/` 12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
13. sync staged overlay into workdir `config/includes.chroot/` 13. sync staged overlay into workdir `config/includes.chroot/`
14. run `lb config && lb build` inside the temporary workdir 14. run `lb config && lb build` inside the privileged builder container
(either on a Debian host/VM or inside the privileged builder container)
``` ```
**Critical invariants:** **Critical invariants:**
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places: - `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
1. `setup-builder.sh` / `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build 1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
2. `auto/config``linux-image-${DEBIAN_KERNEL_ABI}` in the ISO 2. `auto/config``linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`. - NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay. - The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
@@ -80,6 +103,10 @@ Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shi
Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present, Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
systemd services running, audit completed with NVIDIA enrichment, LAN reachability. systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
Current validation state:
- local/libvirt VM boot path is validated for `systemd`, SSH, `bee audit`, `bee-network`, and TUI startup
- real hardware validation is still required before treating the ISO as release-ready
## Overlay mechanism ## Overlay mechanism
`live-build` copies files from `config/includes.chroot/` into the ISO filesystem. `live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
@@ -95,10 +122,52 @@ systemd services running, audit completed with NVIDIA enrichment, LAN reachabili
3. memory collector (dmidecode -t 17) 3. memory collector (dmidecode -t 17)
4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log) 4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/) 5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/)
6. psu collector (ipmitool fru — silent if no /dev/ipmi0) 6. psu collector (ipmitool fru + sdr — silent if no /dev/ipmi0)
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded) 7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
8. output JSON → /var/log/bee-audit.json 8. output JSON → /var/log/bee-audit.json
9. QR summary to stdout (qrencode if available) 9. QR summary to stdout (qrencode if available)
``` ```
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal. Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
Acceptance flows:
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + lightweight `bee-gpu-stress`
- `bee sat memory``memtester` archive
- `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
- SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
- Runtime overrides:
- `BEE_GPU_STRESS_SECONDS`
- `BEE_GPU_STRESS_SIZE_MB`
- `BEE_MEMTESTER_SIZE_MB`
- `BEE_MEMTESTER_PASSES`
## NVIDIA SAT TUI flow (v1.0.0+)
```
TUI: Acceptance tests → NVIDIA command pack
1. screenNvidiaSATSetup
a. enumerate GPUs via `nvidia-smi --query-gpu=index,name,memory.total`
b. user selects duration preset: 10 min / 1 h / 8 h / 24 h
c. user selects GPUs via checkboxes (all selected by default)
d. memory size = max(selected GPU memory) — auto-detected, not exposed to user
2. Start → screenNvidiaSATRunning
a. CUDA_VISIBLE_DEVICES set to selected GPU indices
b. tea.Batch: SAT goroutine + tea.ExecProcess(nvtop) launched concurrently
c. nvtop occupies full terminal; SAT result queues in background
d. [o] reopen nvtop at any time; [a] abort (cancels context → kills bee-gpu-stress)
3. GPU metrics collection (during bee-gpu-stress)
- background goroutine polls `nvidia-smi` every second
- per-second rows: elapsed, GPU index, temp°C, usage%, power W, clock MHz
- outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG chart), gpu-metrics-term.txt
4. After SAT completes
- result shown in screenOutput with terminal line-chart (gpu-metrics-term.txt)
- chart is asciigraph-style: box-drawing chars (╭╮╰╯─│), 4 series per GPU,
Y axis with ticks, ANSI colours (red=temp, blue=usage, green=power, yellow=clock)
```
**Critical invariants:**
- `nvtop` must be in `iso/builder/config/package-lists/bee.list.chroot` (baked into ISO).
- `bee-gpu-stress` uses `exec.CommandContext` — aborted on cancel.
- Metric goroutine uses stopCh/doneCh pattern; main goroutine waits `<-doneCh` before reading rows (no mutex needed).
- If `nvtop` is not found on PATH, SAT still runs without it (graceful degradation).
- SVG chart is fully offline: no JS, no external CSS, pure inline SVG.

View File

@@ -4,7 +4,7 @@
Hardware audit LiveCD. Boots on a server via BMC virtual media or USB. Hardware audit LiveCD. Boots on a server via BMC virtual media or USB.
Collects hardware inventory at OS level (not through BMC/Redfish). Collects hardware inventory at OS level (not through BMC/Redfish).
Produces `HardwareIngestRequest` JSON compatible with core/reanimator. Produces `HardwareIngestRequest` JSON compatible with the contract in `bible-local/docs/hardware-ingest-contract.md`.
## Why it exists ## Why it exists
@@ -19,10 +19,15 @@ Fills gaps where Redfish/logpile is blind:
## In scope ## In scope
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID - Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
- Unattended operation — no user interaction required - Machine-readable health summary derived from collector verdicts
- Operator-triggered acceptance tests for NVIDIA, memory, and storage
- NVIDIA SAT includes both diagnostic collection and lightweight GPU stress via `bee-gpu-stress`
- Automatic boot audit with operator-facing local console and SSH access
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi` - NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
- SSH access (OpenSSH) always available for inspection and debugging - SSH access (OpenSSH) always available for inspection and debugging
- Interactive Go TUI via `bee tui` for network setup, service management, and acceptance tests - Interactive Go TUI via `bee tui` for network setup, service management, and acceptance tests
- Read-only web viewer via `bee web`, rendering the latest audit snapshot through the embedded Reanimator Chart
- Local `tty1` operator UX: `bee` autologin, `menu` auto-start, privileged actions via `sudo -n`
## Network isolation — CRITICAL ## Network isolation — CRITICAL
@@ -42,6 +47,16 @@ Fills gaps where Redfish/logpile is blind:
- Anything requiring persistent storage on the audited machine - Anything requiring persistent storage on the audited machine
- Windows support - Windows support
- Any functionality requiring internet access at boot - Any functionality requiring internet access at boot
- Component lifecycle/history across multiple snapshots
- Status transition history (`status_history`, `status_changed_at`) derived from previous exports
- Replacement detection between two or more audit runs
## Contract boundary
- `bee` is responsible for the current hardware snapshot only.
- `bee` should populate current component state, hardware inventory, telemetry, and `status_checked_at`.
- Historical status transitions and component replacement logic belong to the centralized ingest/lifecycle system, not to `bee`.
- Contract fields that have no honest local source on a generic Linux host may remain empty.
## Tech stack ## Tech stack
@@ -56,6 +71,14 @@ Fills gaps where Redfish/logpile is blind:
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` | | NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
| Builder | Debian 12 host/VM or Debian 12 container image | | Builder | Debian 12 host/VM or Debian 12 container image |
## Operator UX
- On the live ISO, `tty1` autologins as `bee`
- The login profile auto-runs `menu`, which enters the Go TUI
- The TUI itself executes privileged actions as `root` via `sudo -n`
- SSH remains available independently of the local console path
- VM-oriented builds also include `qemu-guest-agent` and serial console support for debugging
## Runtime split ## Runtime split
- The main Go application must run both on a normal Linux host and inside the live ISO - The main Go application must run both on a normal Linux host and inside the live ISO
@@ -72,8 +95,11 @@ Fills gaps where Redfish/logpile is blind:
| `audit/internal/schema/` | HardwareIngestRequest types | | `audit/internal/schema/` | HardwareIngestRequest types |
| `iso/builder/` | ISO build scripts and `live-build` profile | | `iso/builder/` | ISO build scripts and `live-build` profile |
| `iso/overlay/` | Source overlay copied into a staged build overlay | | `iso/overlay/` | Source overlay copied into a staged build overlay |
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, mstflint, …) | | `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, arcconf, ssacli, …) |
| `internal/chart/` | Git submodule with `reanimator/chart`, embedded into `bee web` |
| `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI | | `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI |
| `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO | | `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
| `iso/overlay/etc/profile.d/bee.sh` | `menu` helper + tty1 auto-start policy |
| `iso/overlay/home/bee/.profile` | `bee` shell profile for local console startup |
| `dist/` | Build outputs (gitignored) | | `dist/` | Build outputs (gitignored) |
| `iso/out/` | Downloaded ISO files (gitignored) | | `iso/out/` | Downloaded ISO files (gitignored) |

View File

@@ -1,22 +1,68 @@
# Backlog # Backlog
## GPU stress test (H100) ## Real hardware validation
**Статус:** отложено. В текущем ISO `gpu_burn` не включается и не запускается. **Статус:** ожидает доступа к железу.
**Почему задача всё ещё в backlog:** Что осталось подтвердить на практике:
- `gpu_burn` остаётся тяжёлым и неудобным с точки зрения зависимостей - `bee sat nvidia` на реальном NVIDIA GPU host
- хочется штатный lightweight stress tool без `libcublas.so` и без заметного раздувания ISO - `bee sat storage` на NVMe/SATA/RAID host
- для H100 нужен предсказуемый offline-инструмент, который можно стабильно возить внутри ISO - `ipmitool sdr` parsing на сервере с реальным BMC/IPMI
- vendor RAID tooling (`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli`) в живом ISO
**Желаемый следующий шаг:** написать минимальный stress tool на CUDA Driver API ## SAT result polish
- использует только `libcuda.so`, уже присутствующий в ISO
- выполняет простой compute / memory workload через `cuLaunchKernel`
- собирается отдельно на builder VM и кладётся в `iso/vendor/`
- в будущем может вызываться из `bee tui` как предпочтительный встроенный GPU SAT/stress path
**Отклонённые / проблемные варианты:** **Статус:** частично закрыто.
- `gpu_burn` — нужен libcublas (~500MB)
- `nvbandwidth` — только bandwidth, не жжёт FLOPs; нужен libcudart (~8MB) Что ещё можно улучшить после полевой проверки:
- DCGM diag — правильный инструмент для H100 но ~100MB установка - точнее классифицировать vendor-specific self-test outputs в `storage SAT`
- Download on demand — нужен libcublas, проблема та же - подобрать дефолты `memtester` по объёму RAM на целевых машинах
- при необходимости расширить `bee-gpu-stress` по длительности/нагрузке
## Hardware Contract backlog
**Статус:** уточнён, сокращён до `bee`-only snapshot scope.
### Не backlog для `bee`
Эти задачи не должны реализовываться в `bee`, потому что относятся к централизованному ingest/lifecycle слою:
- `status_history`
- `status_changed_at`
- определение замены компонента между snapshot'ами
- timeline/lifecycle/history по diff между экспортами
`bee` отвечает только за текущий snapshot железа и `status_checked_at`.
### Реализуемо инкрементально
Эти поля можно развивать дальше по мере появления реальных sample outputs и vendor-specific parser'ов:
- `cpus.correctable_error_count`
- `cpus.uncorrectable_error_count`
- `power_supplies.life_remaining_pct`
- `power_supplies.life_used_pct`
- `pcie_devices.battery_charge_pct`
- `pcie_devices.battery_health_pct`
- `pcie_devices.battery_temperature_c`
- `pcie_devices.battery_voltage_v`
- `pcie_devices.battery_replace_required`
### Vendor/platform-specific, часто пустые
Эти поля допустимо оставлять пустыми на части платформ даже после реализации parser'ов:
- `power_supplies.life_remaining_pct`
- `power_supplies.life_used_pct`
- часть `pcie_devices.battery_*` для неподдержанных RAID/NIC/GPU вендоров
### Unsupported в `bee`
Эти поля считаются нереалистичными для общего OS-level hardware snapshotter без synthetic/fake data:
- `cpus.life_remaining_pct`
- `cpus.life_used_pct`
- `memory.life_remaining_pct`
- `memory.life_used_pct`
- `memory.spare_blocks_remaining_pct`
- `memory.performance_degraded`
Причина: у обычного Linux-host audit обычно нет честного vendor-neutral runtime source для этих метрик.
Эти поля считаются дропнутыми из backlog `bee` и не должны возвращаться в план работ без появления нового доказуемого локального источника данных на целевых машинах.

View File

@@ -0,0 +1,793 @@
---
title: Hardware Ingest JSON Contract
version: "2.7"
updated: "2026-03-15"
maintainer: Reanimator Core
audience: external-integrators, ai-agents
language: ru
---
# Интеграция с Reanimator: контракт JSON-импорта аппаратного обеспечения
Версия: **2.7** · Дата: **2026-03-15**
Документ описывает формат JSON для передачи данных об аппаратном обеспечении серверов в систему **Reanimator** (управление жизненным циклом аппаратного обеспечения).
Предназначен для разработчиков смежных систем (Redfish-коллекторов, агентов мониторинга, CMDB-экспортёров) и может быть включён в документацию интегрируемых проектов.
> Актуальная версия документа: https://git.mchus.pro/reanimator/core/src/branch/main/bible-local/docs/hardware-ingest-contract.md
---
## Changelog
| Версия | Дата | Изменения |
|--------|------|-----------|
| 2.7 | 2026-03-15 | Явно запрещён синтез данных в `event_logs`; интеграторы не должны придумывать серийные номера компонентов, если источник их не отдал |
| 2.6 | 2026-03-15 | Добавлена необязательная секция `event_logs` для dedup/upsert логов `host` / `bmc` / `redfish` вне history timeline |
| 2.5 | 2026-03-15 | Добавлено общее необязательное поле `manufactured_year_week` для компонентных секций (`YYYY-Www`) |
| 2.4 | 2026-03-15 | Добавлена первая волна component telemetry: health/life поля для `cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies` |
| 2.3 | 2026-03-15 | Добавлены component telemetry поля: `pcie_devices.temperature_c`, `pcie_devices.power_w`, `power_supplies.temperature_c` |
| 2.2 | 2026-03-15 | Добавлено поле `numa_node` у `pcie_devices` для topology/affinity |
| 2.1 | 2026-03-15 | Добавлена секция `sensors` (fans, power, temperatures, other); поле `mac_addresses` у `pcie_devices`; расширен список значений `device_class` |
| 2.0 | 2026-02-01 | История статусов (`status_history`, `status_changed_at`); поля telemetry у PSU; async job response |
| 1.0 | 2026-01-01 | Начальная версия контракта |
---
## Принципы
1. **Snapshot** — JSON описывает состояние сервера на момент сбора. Может включать историю изменений статуса компонентов.
2. **Идемпотентность** — повторная отправка идентичного payload не создаёт дублей (дедупликация по хешу).
3. **Частичность** — можно передавать только те секции, данные по которым доступны. Пустой массив и отсутствие секции эквивалентны.
4. **Строгая схема** — endpoint использует строгий JSON-декодер; неизвестные поля приводят к `400 Bad Request`.
5. **Event-driven** — импорт создаёт события в timeline (LOG_COLLECTED, INSTALLED, REMOVED, FIRMWARE_CHANGED и др.).
6. **Без синтеза со стороны интегратора** — сборщик передаёт только фактически собранные значения. Нельзя придумывать `serial_number`, `component_ref`, `message`, `message_id` или другие идентификаторы/атрибуты, если источник их не предоставил или парсер не смог их надёжно извлечь.
---
## Endpoint
```
POST /ingest/hardware
Content-Type: application/json
```
**Ответ при приёме (202 Accepted):**
```json
{
"status": "accepted",
"job_id": "job_01J..."
}
```
Импорт выполняется асинхронно. Результат доступен по:
```
GET /ingest/hardware/jobs/{job_id}
```
**Ответ при успехе задачи:**
```json
{
"status": "success",
"bundle_id": "lb_01J...",
"asset_id": "mach_01J...",
"collected_at": "2026-02-10T15:30:00Z",
"duplicate": false,
"summary": {
"parts_observed": 15,
"parts_created": 2,
"parts_updated": 13,
"installations_created": 2,
"installations_closed": 1,
"timeline_events_created": 9,
"failure_events_created": 1
}
}
```
**Ответ при дубликате:**
```json
{
"status": "success",
"duplicate": true,
"message": "LogBundle with this content hash already exists"
}
```
**Ответ при ошибке (400 Bad Request):**
```json
{
"status": "error",
"error": "validation_failed",
"details": {
"field": "hardware.board.serial_number",
"message": "serial_number is required"
}
}
```
Частые причины `400`:
- Неверный формат `collected_at` (требуется RFC3339).
- Пустой `hardware.board.serial_number`.
- Наличие неизвестного JSON-поля на любом уровне.
- Тело запроса превышает допустимый размер.
---
## Структура верхнего уровня
```json
{
"filename": "redfish://10.10.10.103",
"source_type": "api",
"protocol": "redfish",
"target_host": "10.10.10.103",
"collected_at": "2026-02-10T15:30:00Z",
"hardware": {
"board": { ... },
"firmware": [ ... ],
"cpus": [ ... ],
"memory": [ ... ],
"storage": [ ... ],
"pcie_devices": [ ... ],
"power_supplies": [ ... ],
"sensors": { ... },
"event_logs": [ ... ]
}
}
```
### Поля верхнего уровня
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `collected_at` | string RFC3339 | **да** | Время сбора данных |
| `hardware` | object | **да** | Аппаратный снапшот |
| `hardware.board.serial_number` | string | **да** | Серийный номер платы/сервера |
| `target_host` | string | нет | IP или hostname |
| `source_type` | string | нет | Тип источника: `api`, `logfile`, `manual` |
| `protocol` | string | нет | Протокол: `redfish`, `ipmi`, `snmp`, `ssh` |
| `filename` | string | нет | Идентификатор источника |
---
## Общие поля статуса компонентов
Применяются ко всем компонентным секциям (`cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies`).
| Поле | Тип | Описание |
|------|-----|----------|
| `status` | string | Текущий статус: `OK`, `Warning`, `Critical`, `Unknown`, `Empty` |
| `status_checked_at` | string RFC3339 | Время последней проверки статуса |
| `status_changed_at` | string RFC3339 | Время последнего изменения статуса |
| `status_history` | array | История переходов статусов (см. ниже) |
| `error_description` | string | Текст ошибки/диагностики |
| `manufactured_year_week` | string | Дата производства в формате `YYYY-Www`, например `2024-W07` |
**Объект `status_history[]`:**
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `status` | string | **да** | Статус в этот момент |
| `changed_at` | string RFC3339 | **да** | Время перехода (без этого поля запись игнорируется) |
| `details` | string | нет | Пояснение к переходу |
**Правила приоритета времени события:**
1. `status_changed_at`
2. Последняя запись `status_history` с совпадающим статусом
3. Последняя парсируемая запись `status_history`
4. `status_checked_at`
**Правила передачи статусов:**
- Передавайте `status` как текущее состояние компонента в snapshot.
- Если источник хранит историю — передавайте `status_history` отсортированным по `changed_at` по возрастанию.
- Не включайте записи `status_history` без `changed_at`.
- Все даты — RFC3339, рекомендуется UTC (`Z`).
- `manufactured_year_week` используйте, когда источник знает только год и неделю производства, без точной календарной даты.
---
## Секции hardware
### board
Основная информация о сервере. Обязательная секция.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `serial_number` | string | **да** | Серийный номер (ключ идентификации Asset) |
| `manufacturer` | string | нет | Производитель |
| `product_name` | string | нет | Модель |
| `part_number` | string | нет | Партномер |
| `uuid` | string | нет | UUID системы |
Значения `"NULL"` в строковых полях трактуются как отсутствие данных.
```json
"board": {
"manufacturer": "Supermicro",
"product_name": "X12DPG-QT6",
"serial_number": "21D634101",
"part_number": "X12DPG-QT6-REV1.01",
"uuid": "d7ef2fe5-2fd0-11f0-910a-346f11040868"
}
```
---
### firmware
Версии прошивок системных компонентов (BIOS, BMC, CPLD и др.).
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `device_name` | string | **да** | Название устройства (`BIOS`, `BMC`, `CPLD`, …) |
| `version` | string | **да** | Версия прошивки |
Записи с пустым `device_name` или `version` игнорируются.
Изменение версии создаёт событие `FIRMWARE_CHANGED` для Asset.
```json
"firmware": [
{ "device_name": "BIOS", "version": "06.08.05" },
{ "device_name": "BMC", "version": "5.17.00" },
{ "device_name": "CPLD", "version": "01.02.03" }
]
```
---
### cpus
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `socket` | int | **да** | Номер сокета (используется для генерации serial) |
| `model` | string | нет | Модель процессора |
| `manufacturer` | string | нет | Производитель |
| `cores` | int | нет | Количество ядер |
| `threads` | int | нет | Количество потоков |
| `frequency_mhz` | int | нет | Текущая частота |
| `max_frequency_mhz` | int | нет | Максимальная частота |
| `temperature_c` | float | нет | Температура CPU, °C (telemetry) |
| `power_w` | float | нет | Текущая мощность CPU, Вт (telemetry) |
| `throttled` | bool | нет | Зафиксирован thermal/power throttling |
| `correctable_error_count` | int | нет | Количество корректируемых ошибок CPU |
| `uncorrectable_error_count` | int | нет | Количество некорректируемых ошибок CPU |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `serial_number` | string | нет | Серийный номер (если доступен) |
| `firmware` | string | нет | Версия микрокода; если логгер отдает `Microcode level`, передавайте его сюда как есть |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| + общие поля статуса | | | см. раздел выше |
**Генерация serial_number при отсутствии:** `{board_serial}-CPU-{socket}`
Если источник использует поле/лейбл `Microcode level`, его значение передавайте в `cpus[].firmware` без дополнительного преобразования.
```json
"cpus": [
{
"socket": 0,
"model": "INTEL(R) XEON(R) GOLD 6530",
"cores": 32,
"threads": 64,
"frequency_mhz": 2100,
"max_frequency_mhz": 4000,
"temperature_c": 61.5,
"power_w": 182.0,
"throttled": false,
"manufacturer": "Intel",
"status": "OK",
"status_checked_at": "2026-02-10T15:28:00Z"
}
]
```
---
### memory
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `present` | bool | нет | Наличие модуля (по умолчанию `true`) |
| `serial_number` | string | нет | Серийный номер |
| `part_number` | string | нет | Партномер (используется как модель) |
| `manufacturer` | string | нет | Производитель |
| `size_mb` | int | нет | Объём в МБ |
| `type` | string | нет | Тип: `DDR3`, `DDR4`, `DDR5`, … |
| `max_speed_mhz` | int | нет | Максимальная частота |
| `current_speed_mhz` | int | нет | Текущая частота |
| `temperature_c` | float | нет | Температура DIMM/модуля, °C (telemetry) |
| `correctable_ecc_error_count` | int | нет | Количество корректируемых ECC-ошибок |
| `uncorrectable_ecc_error_count` | int | нет | Количество некорректируемых ECC-ошибок |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `spare_blocks_remaining_pct` | float | нет | Остаток spare blocks, % |
| `performance_degraded` | bool | нет | Зафиксирована деградация производительности |
| `data_loss_detected` | bool | нет | Источник сигнализирует риск/факт потери данных |
| + общие поля статуса | | | см. раздел выше |
Модуль без `serial_number` игнорируется. Модуль с `present=false` или `status=Empty` игнорируется.
```json
"memory": [
{
"slot": "CPU0_C0D0",
"present": true,
"size_mb": 32768,
"type": "DDR5",
"max_speed_mhz": 4800,
"current_speed_mhz": 4800,
"temperature_c": 43.0,
"correctable_ecc_error_count": 0,
"manufacturer": "Hynix",
"serial_number": "80AD032419E17CEEC1",
"part_number": "HMCG88AGBRA191N",
"status": "OK"
}
]
```
---
### storage
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Канонический адрес установки PCIe-устройства; передавайте BDF (`0000:18:00.0`) |
| `serial_number` | string | нет | Серийный номер |
| `model` | string | нет | Модель |
| `manufacturer` | string | нет | Производитель |
| `type` | string | нет | Тип: `NVMe`, `SSD`, `HDD` |
| `interface` | string | нет | Интерфейс: `NVMe`, `SATA`, `SAS` |
| `size_gb` | int | нет | Размер в ГБ |
| `temperature_c` | float | нет | Температура накопителя, °C (telemetry) |
| `power_on_hours` | int64 | нет | Время работы, часы |
| `power_cycles` | int64 | нет | Количество циклов питания |
| `unsafe_shutdowns` | int64 | нет | Нештатные выключения |
| `media_errors` | int64 | нет | Ошибки носителя / media errors |
| `error_log_entries` | int64 | нет | Количество записей в error log |
| `written_bytes` | int64 | нет | Всего записано байт |
| `read_bytes` | int64 | нет | Всего прочитано байт |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `available_spare_pct` | float | нет | Доступный spare, % |
| `reallocated_sectors` | int64 | нет | Переназначенные сектора |
| `current_pending_sectors` | int64 | нет | Сектора в ожидании ремапа |
| `offline_uncorrectable` | int64 | нет | Некорректируемые ошибки offline scan |
| `firmware` | string | нет | Версия прошивки |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| + общие поля статуса | | | см. раздел выше |
Диск без `serial_number` игнорируется. Изменение `firmware` создаёт событие `FIRMWARE_CHANGED`.
```json
"storage": [
{
"slot": "OB01",
"type": "NVMe",
"model": "INTEL SSDPF2KX076T1",
"size_gb": 7680,
"temperature_c": 38.5,
"power_on_hours": 12450,
"unsafe_shutdowns": 3,
"written_bytes": 9876543210,
"life_remaining_pct": 91.0,
"serial_number": "BTAX41900GF87P6DGN",
"manufacturer": "Intel",
"firmware": "9CV10510",
"interface": "NVMe",
"present": true,
"status": "OK"
}
]
```
---
### pcie_devices
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `vendor_id` | int | нет | PCI Vendor ID (decimal) |
| `device_id` | int | нет | PCI Device ID (decimal) |
| `numa_node` | int | нет | NUMA node / CPU affinity устройства |
| `temperature_c` | float | нет | Температура устройства, °C (telemetry) |
| `power_w` | float | нет | Текущее энергопотребление устройства, Вт (telemetry) |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `ecc_corrected_total` | int64 | нет | Всего корректируемых ECC-ошибок |
| `ecc_uncorrected_total` | int64 | нет | Всего некорректируемых ECC-ошибок |
| `hw_slowdown` | bool | нет | Устройство вошло в hardware slowdown / protective mode |
| `battery_charge_pct` | float | нет | Заряд батареи / supercap, % |
| `battery_health_pct` | float | нет | Состояние батареи / supercap, % |
| `battery_temperature_c` | float | нет | Температура батареи / supercap, °C |
| `battery_voltage_v` | float | нет | Напряжение батареи / supercap, В |
| `battery_replace_required` | bool | нет | Требуется замена батареи / supercap |
| `sfp_temperature_c` | float | нет | Температура SFP/optic, °C |
| `sfp_tx_power_dbm` | float | нет | TX optical power, dBm |
| `sfp_rx_power_dbm` | float | нет | RX optical power, dBm |
| `sfp_voltage_v` | float | нет | Напряжение SFP, В |
| `sfp_bias_ma` | float | нет | Bias current SFP, мА |
| `bdf` | string | нет | Deprecated alias для `slot`; при наличии ingest нормализует его в `slot` |
| `device_class` | string | нет | Класс устройства (см. список ниже) |
| `manufacturer` | string | нет | Производитель |
| `model` | string | нет | Модель |
| `serial_number` | string | нет | Серийный номер |
| `firmware` | string | нет | Версия прошивки |
| `link_width` | int | нет | Текущая ширина линка |
| `link_speed` | string | нет | Текущая скорость: `Gen3`, `Gen4`, `Gen5` |
| `max_link_width` | int | нет | Максимальная ширина линка |
| `max_link_speed` | string | нет | Максимальная скорость |
| `mac_addresses` | string[] | нет | MAC-адреса портов (для сетевых устройств) |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| + общие поля статуса | | | см. раздел выше |
`numa_node` передавайте для NIC / InfiniBand / RAID / GPU, когда источник знает CPU/NUMA affinity. Поле сохраняется в snapshot-атрибутах PCIe-компонента и дублируется в telemetry для topology use cases.
Поля `temperature_c` и `power_w` используйте для device-level telemetry GPU / accelerator / smart PCIe devices. Они не влияют на идентификацию компонента.
**Генерация serial_number при отсутствии или `"N/A"`:** `{board_serial}-PCIE-{slot}`, где `slot` для PCIe равен BDF.
`slot` — единственный канонический адрес компонента. Для PCIe в `slot` передавайте BDF. Поле `bdf` сохраняется только как переходный alias на входе и не должно использоваться как отдельная координата рядом со `slot`.
**Значения `device_class`:**
| Значение | Назначение |
|----------|------------|
| `MassStorageController` | RAID-контроллеры |
| `StorageController` | HBA, SAS-контроллеры |
| `NetworkController` | Сетевые адаптеры (InfiniBand, общий) |
| `EthernetController` | Ethernet NIC |
| `FibreChannelController` | Fibre Channel HBA |
| `VideoController` | GPU, видеокарты |
| `ProcessingAccelerator` | Вычислительные ускорители (AI/ML) |
| `DisplayController` | Контроллеры дисплея (BMC VGA) |
Список открытый: допускаются произвольные строки для нестандартных классов.
```json
"pcie_devices": [
{
"slot": "0000:3b:00.0",
"vendor_id": 5555,
"device_id": 4401,
"numa_node": 0,
"temperature_c": 48.5,
"power_w": 18.2,
"sfp_temperature_c": 36.2,
"sfp_tx_power_dbm": -1.8,
"sfp_rx_power_dbm": -2.1,
"device_class": "EthernetController",
"manufacturer": "Intel",
"model": "X710 10GbE",
"serial_number": "K65472-003",
"firmware": "9.20 0x8000d4ae",
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
"status": "OK"
}
]
```
---
### power_supplies
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| `serial_number` | string | нет | Серийный номер |
| `part_number` | string | нет | Партномер |
| `model` | string | нет | Модель |
| `vendor` | string | нет | Производитель |
| `wattage_w` | int | нет | Мощность в ваттах |
| `firmware` | string | нет | Версия прошивки |
| `input_type` | string | нет | Тип входа (например `ACWideRange`) |
| `input_voltage` | float | нет | Входное напряжение, В (telemetry) |
| `input_power_w` | float | нет | Входная мощность, Вт (telemetry) |
| `output_power_w` | float | нет | Выходная мощность, Вт (telemetry) |
| `temperature_c` | float | нет | Температура PSU, °C (telemetry) |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| + общие поля статуса | | | см. раздел выше |
Поля telemetry (`input_voltage`, `input_power_w`, `output_power_w`, `temperature_c`, `life_remaining_pct`, `life_used_pct`) сохраняются в атрибутах компонента и не влияют на его идентификацию.
PSU без `serial_number` игнорируется.
```json
"power_supplies": [
{
"slot": "0",
"present": true,
"model": "GW-CRPS3000LW",
"vendor": "Great Wall",
"wattage_w": 3000,
"serial_number": "2P06C102610",
"firmware": "00.03.05",
"status": "OK",
"input_type": "ACWideRange",
"input_power_w": 137,
"output_power_w": 104,
"input_voltage": 215.25,
"temperature_c": 39.5,
"life_remaining_pct": 97.0
}
]
```
---
### sensors
Показания сенсоров сервера. Секция опциональная, не привязана к компонентам.
Данные хранятся как последнее известное значение (last-known-value) на уровне Asset.
```json
"sensors": {
"fans": [ ... ],
"power": [ ... ],
"temperatures": [ ... ],
"other": [ ... ]
}
```
---
### event_logs
Нормализованные операционные логи сервера из `host`, `bmc` или `redfish`.
Эти записи не попадают в history timeline и не создают history events. Они сохраняются в отдельной deduplicated log store и отображаются в отдельном UI-блоке asset logs / host logs.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `source` | string | **да** | Источник лога: `host`, `bmc`, `redfish` |
| `event_time` | string RFC3339 | нет | Время события из источника; если отсутствует, используется время ingest/collection |
| `severity` | string | нет | Уровень: `OK`, `Info`, `Warning`, `Critical`, `Unknown` |
| `message_id` | string | нет | Идентификатор/код события источника |
| `message` | string | **да** | Нормализованный текст события |
| `component_ref` | string | нет | Ссылка на компонент/устройство/слот, если извлекается |
| `fingerprint` | string | нет | Внешний готовый dedup-key; если не передан, система вычисляет свой |
| `is_active` | bool | нет | Признак, что событие всё ещё активно/не погашено, если источник умеет lifecycle |
| `raw_payload` | object | нет | Сырой vendor-specific payload для диагностики |
**Правила event_logs:**
- Логи дедуплицируются в рамках asset + source + fingerprint.
- Если `fingerprint` не передан, система строит его из нормализованных полей (`source`, `message_id`, `message`, `component_ref`, временная нормализация).
- Интегратор/сборщик логов не должен синтезировать содержимое событий: не придумывайте `message`, `message_id`, `component_ref`, serial/device identifiers или иные поля, если они отсутствуют в исходном логе или не были надёжно извлечены.
- Повторное получение того же события обновляет `last_seen_at`/счётчик повторов и не должно создавать новый timeline/history event.
- `event_logs` используются для отдельного UI-представления логов и не изменяют canonical state компонентов/asset по умолчанию.
```json
"event_logs": [
{
"source": "bmc",
"event_time": "2026-03-15T14:03:11Z",
"severity": "Warning",
"message_id": "0x000F",
"message": "Correctable ECC error threshold exceeded",
"component_ref": "CPU0_C0D0",
"raw_payload": {
"sensor": "DIMM_A1",
"sel_record_id": "0042"
}
},
{
"source": "redfish",
"event_time": "2026-03-15T14:03:20Z",
"severity": "Info",
"message_id": "OpenBMC.0.1.SystemReboot",
"message": "System reboot requested by administrator",
"component_ref": "Mainboard"
}
]
```
#### sensors.fans
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора в рамках секции |
| `location` | string | нет | Физическое расположение |
| `rpm` | int | нет | Обороты, RPM |
| `status` | string | нет | Статус: `OK`, `Warning`, `Critical`, `Unknown` |
#### sensors.power
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `voltage_v` | float | нет | Напряжение, В |
| `current_a` | float | нет | Ток, А |
| `power_w` | float | нет | Мощность, Вт |
| `status` | string | нет | Статус |
#### sensors.temperatures
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `celsius` | float | нет | Температура, °C |
| `threshold_warning_celsius` | float | нет | Порог Warning, °C |
| `threshold_critical_celsius` | float | нет | Порог Critical, °C |
| `status` | string | нет | Статус |
#### sensors.other
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `value` | float | нет | Значение |
| `unit` | string | нет | Единица измерения |
| `status` | string | нет | Статус |
**Правила sensors:**
- Идентификатор сенсора: пара `(sensor_type, name)`. Дубли в одном payload — берётся первое вхождение.
- Сенсоры без `name` игнорируются.
- При каждом импорте значения перезаписываются (upsert по ключу).
```json
"sensors": {
"fans": [
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" },
{ "name": "FAN_CPU0", "location": "CPU0", "rpm": 5600, "status": "OK" }
],
"power": [
{ "name": "12V Rail", "location": "Mainboard", "voltage_v": 12.06, "status": "OK" },
{ "name": "PSU0 Input", "location": "PSU0", "voltage_v": 215.25, "current_a": 0.64, "power_w": 137.0, "status": "OK" }
],
"temperatures": [
{ "name": "CPU0 Temp", "location": "CPU0", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" },
{ "name": "Inlet Temp", "location": "Front", "celsius": 22.0, "threshold_warning_celsius": 40.0, "threshold_critical_celsius": 50.0, "status": "OK" }
],
"other": [
{ "name": "System Humidity", "value": 38.5, "unit": "%", "status": "OK" }
]
}
```
---
## Обработка статусов компонентов
| Статус | Поведение |
|--------|-----------|
| `OK` | Нормальная обработка |
| `Warning` | Создаётся событие `COMPONENT_WARNING` |
| `Critical` | Создаётся событие `COMPONENT_FAILED` + запись в `failure_events` |
| `Unknown` | Компонент считается рабочим, создаётся событие `COMPONENT_UNKNOWN` |
| `Empty` | Компонент не создаётся/не обновляется |
---
## Обработка отсутствующих serial_number
Общее правило для всех секций: если источник не вернул серийный номер и сборщик не смог его надёжно извлечь, интегратор не должен подставлять вымышленные значения, хеши, локальные placeholder-идентификаторы или серийные номера "по догадке". Разрешены только явно оговорённые ниже server-side fallback-правила ingest.
| Тип | Поведение |
|-----|-----------|
| CPU | Генерируется: `{board_serial}-CPU-{socket}` |
| PCIe | Генерируется: `{board_serial}-PCIE-{slot}` (если serial = `"N/A"` или пустой; `slot` для PCIe = BDF) |
| Memory | Компонент игнорируется |
| Storage | Компонент игнорируется |
| PSU | Компонент игнорируется |
Если `serial_number` не уникален внутри одного payload для того же `model`:
- Первое вхождение сохраняет оригинальный серийный номер.
- Каждое следующее дублирующее получает placeholder: `NO_SN-XXXXXXXX`.
---
## Минимальный валидный пример
```json
{
"collected_at": "2026-02-10T15:30:00Z",
"target_host": "192.168.1.100",
"hardware": {
"board": {
"serial_number": "SRV-001"
}
}
}
```
---
## Полный пример с историей статусов
```json
{
"filename": "redfish://10.10.10.103",
"source_type": "api",
"protocol": "redfish",
"target_host": "10.10.10.103",
"collected_at": "2026-02-10T15:30:00Z",
"hardware": {
"board": {
"manufacturer": "Supermicro",
"product_name": "X12DPG-QT6",
"serial_number": "21D634101"
},
"firmware": [
{ "device_name": "BIOS", "version": "06.08.05" },
{ "device_name": "BMC", "version": "5.17.00" }
],
"cpus": [
{
"socket": 0,
"model": "INTEL(R) XEON(R) GOLD 6530",
"manufacturer": "Intel",
"cores": 32,
"threads": 64,
"status": "OK"
}
],
"storage": [
{
"slot": "OB01",
"type": "NVMe",
"model": "INTEL SSDPF2KX076T1",
"size_gb": 7680,
"serial_number": "BTAX41900GF87P6DGN",
"manufacturer": "Intel",
"firmware": "9CV10510",
"present": true,
"status": "OK",
"status_changed_at": "2026-02-10T15:22:00Z",
"status_history": [
{ "status": "Critical", "changed_at": "2026-02-10T15:10:00Z", "details": "I/O timeout on NVMe queue 3" },
{ "status": "OK", "changed_at": "2026-02-10T15:22:00Z", "details": "Recovered after controller reset" }
]
}
],
"pcie_devices": [
{
"slot": "0000:18:00.0",
"device_class": "EthernetController",
"manufacturer": "Intel",
"model": "X710 10GbE",
"serial_number": "K65472-003",
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
"status": "OK"
}
],
"power_supplies": [
{
"slot": "0",
"present": true,
"model": "GW-CRPS3000LW",
"vendor": "Great Wall",
"wattage_w": 3000,
"serial_number": "2P06C102610",
"firmware": "00.03.05",
"status": "OK",
"input_power_w": 137,
"output_power_w": 104,
"input_voltage": 215.25
}
],
"sensors": {
"fans": [
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" }
],
"power": [
{ "name": "12V Rail", "voltage_v": 12.06, "status": "OK" }
],
"temperatures": [
{ "name": "CPU0 Temp", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" }
],
"other": [
{ "name": "System Humidity", "value": 38.5, "unit": "%" }
]
}
}
}
```

1
internal/chart Submodule

Submodule internal/chart added at 05db6994d4

View File

@@ -1,6 +1,6 @@
FROM debian:12 FROM debian:12
ARG GO_VERSION=1.23.6 ARG GO_VERSION=1.24.0
ARG DEBIAN_KERNEL_ABI=6.1.0-43 ARG DEBIAN_KERNEL_ABI=6.1.0-43
ENV DEBIAN_FRONTEND=noninteractive ENV DEBIAN_FRONTEND=noninteractive

View File

@@ -1,5 +1,5 @@
DEBIAN_VERSION=12 DEBIAN_VERSION=12
DEBIAN_KERNEL_ABI=6.1.0-43 DEBIAN_KERNEL_ABI=6.1.0-43
NVIDIA_DRIVER_VERSION=590.48.01 NVIDIA_DRIVER_VERSION=590.48.01
GO_VERSION=1.23.6 GO_VERSION=1.24.0
AUDIT_VERSION=0.1.1 AUDIT_VERSION=1.0.0

View File

@@ -21,7 +21,7 @@ lb config noauto \
--linux-flavours "amd64" \ --linux-flavours "amd64" \
--linux-packages "linux-image-${DEBIAN_KERNEL_ABI}" \ --linux-packages "linux-image-${DEBIAN_KERNEL_ABI}" \
--memtest none \ --memtest none \
--iso-volume "BEE-DEBUG" \ --iso-volume "BEE" \
--iso-application "Bee Hardware Audit" \ --iso-application "Bee Hardware Audit" \
--bootappend-live "boot=live components console=tty0 console=ttyS0,115200n8 username=bee user-fullname=Bee modprobe.blacklist=nouveau" \ --bootappend-live "boot=live components console=tty0 console=ttyS0,115200n8 username=bee user-fullname=Bee modprobe.blacklist=nouveau" \
--apt-recommends false \ --apt-recommends false \

View File

@@ -0,0 +1,314 @@
#define _POSIX_C_SOURCE 200809L
#include <dlfcn.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
typedef int CUdevice;
typedef uint64_t CUdeviceptr;
typedef int CUresult;
typedef void *CUcontext;
typedef void *CUmodule;
typedef void *CUfunction;
typedef void *CUstream;
#define CU_SUCCESS 0
static const char *ptx_source =
".version 6.0\n"
".target sm_30\n"
".address_size 64\n"
"\n"
".visible .entry burn(\n"
" .param .u64 data,\n"
" .param .u32 words,\n"
" .param .u32 rounds\n"
")\n"
"{\n"
" .reg .pred %p<2>;\n"
" .reg .b32 %r<8>;\n"
" .reg .b64 %rd<5>;\n"
"\n"
" ld.param.u64 %rd1, [data];\n"
" ld.param.u32 %r1, [words];\n"
" ld.param.u32 %r2, [rounds];\n"
" mov.u32 %r3, %ctaid.x;\n"
" mov.u32 %r4, %ntid.x;\n"
" mov.u32 %r5, %tid.x;\n"
" mad.lo.s32 %r0, %r3, %r4, %r5;\n"
" setp.ge.u32 %p0, %r0, %r1;\n"
" @%p0 bra DONE;\n"
" mul.wide.u32 %rd2, %r0, 4;\n"
" add.s64 %rd3, %rd1, %rd2;\n"
" ld.global.u32 %r6, [%rd3];\n"
"LOOP:\n"
" setp.eq.u32 %p1, %r2, 0;\n"
" @%p1 bra STORE;\n"
" mad.lo.u32 %r6, %r6, 1664525, 1013904223;\n"
" sub.u32 %r2, %r2, 1;\n"
" bra LOOP;\n"
"STORE:\n"
" st.global.u32 [%rd3], %r6;\n"
"DONE:\n"
" ret;\n"
"}\n";
typedef CUresult (*cuInit_fn)(unsigned int);
typedef CUresult (*cuDeviceGetCount_fn)(int *);
typedef CUresult (*cuDeviceGet_fn)(CUdevice *, int);
typedef CUresult (*cuDeviceGetName_fn)(char *, int, CUdevice);
typedef CUresult (*cuCtxCreate_fn)(CUcontext *, unsigned int, CUdevice);
typedef CUresult (*cuCtxDestroy_fn)(CUcontext);
typedef CUresult (*cuCtxSynchronize_fn)(void);
typedef CUresult (*cuMemAlloc_fn)(CUdeviceptr *, size_t);
typedef CUresult (*cuMemFree_fn)(CUdeviceptr);
typedef CUresult (*cuMemcpyHtoD_fn)(CUdeviceptr, const void *, size_t);
typedef CUresult (*cuMemcpyDtoH_fn)(void *, CUdeviceptr, size_t);
typedef CUresult (*cuModuleLoadDataEx_fn)(CUmodule *, const void *, unsigned int, void *, void *);
typedef CUresult (*cuModuleGetFunction_fn)(CUfunction *, CUmodule, const char *);
typedef CUresult (*cuLaunchKernel_fn)(CUfunction,
unsigned int,
unsigned int,
unsigned int,
unsigned int,
unsigned int,
unsigned int,
unsigned int,
CUstream,
void **,
void **);
typedef CUresult (*cuGetErrorName_fn)(CUresult, const char **);
typedef CUresult (*cuGetErrorString_fn)(CUresult, const char **);
struct cuda_api {
void *lib;
cuInit_fn cuInit;
cuDeviceGetCount_fn cuDeviceGetCount;
cuDeviceGet_fn cuDeviceGet;
cuDeviceGetName_fn cuDeviceGetName;
cuCtxCreate_fn cuCtxCreate;
cuCtxDestroy_fn cuCtxDestroy;
cuCtxSynchronize_fn cuCtxSynchronize;
cuMemAlloc_fn cuMemAlloc;
cuMemFree_fn cuMemFree;
cuMemcpyHtoD_fn cuMemcpyHtoD;
cuMemcpyDtoH_fn cuMemcpyDtoH;
cuModuleLoadDataEx_fn cuModuleLoadDataEx;
cuModuleGetFunction_fn cuModuleGetFunction;
cuLaunchKernel_fn cuLaunchKernel;
cuGetErrorName_fn cuGetErrorName;
cuGetErrorString_fn cuGetErrorString;
};
static int load_symbol(void *lib, const char *name, void **out) {
*out = dlsym(lib, name);
return *out != NULL;
}
static int load_cuda(struct cuda_api *api) {
memset(api, 0, sizeof(*api));
api->lib = dlopen("libcuda.so.1", RTLD_NOW | RTLD_LOCAL);
if (!api->lib) {
return 0;
}
return
load_symbol(api->lib, "cuInit", (void **)&api->cuInit) &&
load_symbol(api->lib, "cuDeviceGetCount", (void **)&api->cuDeviceGetCount) &&
load_symbol(api->lib, "cuDeviceGet", (void **)&api->cuDeviceGet) &&
load_symbol(api->lib, "cuDeviceGetName", (void **)&api->cuDeviceGetName) &&
load_symbol(api->lib, "cuCtxCreate_v2", (void **)&api->cuCtxCreate) &&
load_symbol(api->lib, "cuCtxDestroy_v2", (void **)&api->cuCtxDestroy) &&
load_symbol(api->lib, "cuCtxSynchronize", (void **)&api->cuCtxSynchronize) &&
load_symbol(api->lib, "cuMemAlloc_v2", (void **)&api->cuMemAlloc) &&
load_symbol(api->lib, "cuMemFree_v2", (void **)&api->cuMemFree) &&
load_symbol(api->lib, "cuMemcpyHtoD_v2", (void **)&api->cuMemcpyHtoD) &&
load_symbol(api->lib, "cuMemcpyDtoH_v2", (void **)&api->cuMemcpyDtoH) &&
load_symbol(api->lib, "cuModuleLoadDataEx", (void **)&api->cuModuleLoadDataEx) &&
load_symbol(api->lib, "cuModuleGetFunction", (void **)&api->cuModuleGetFunction) &&
load_symbol(api->lib, "cuLaunchKernel", (void **)&api->cuLaunchKernel);
}
static const char *cu_error_name(struct cuda_api *api, CUresult rc) {
const char *value = NULL;
if (api->cuGetErrorName && api->cuGetErrorName(rc, &value) == CU_SUCCESS && value) {
return value;
}
return "CUDA_ERROR";
}
static const char *cu_error_string(struct cuda_api *api, CUresult rc) {
const char *value = NULL;
if (api->cuGetErrorString && api->cuGetErrorString(rc, &value) == CU_SUCCESS && value) {
return value;
}
return "unknown";
}
static int check_rc(struct cuda_api *api, const char *step, CUresult rc) {
if (rc == CU_SUCCESS) {
return 1;
}
fprintf(stderr, "%s failed: %s (%s)\n", step, cu_error_name(api, rc), cu_error_string(api, rc));
return 0;
}
static double now_seconds(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (double)ts.tv_sec + ((double)ts.tv_nsec / 1000000000.0);
}
int main(int argc, char **argv) {
int seconds = 5;
int size_mb = 64;
for (int i = 1; i < argc; i++) {
if ((strcmp(argv[i], "--seconds") == 0 || strcmp(argv[i], "-t") == 0) && i + 1 < argc) {
seconds = atoi(argv[++i]);
} else if ((strcmp(argv[i], "--size-mb") == 0 || strcmp(argv[i], "-m") == 0) && i + 1 < argc) {
size_mb = atoi(argv[++i]);
} else {
fprintf(stderr, "usage: %s [--seconds N] [--size-mb N]\n", argv[0]);
return 2;
}
}
if (seconds <= 0) {
seconds = 5;
}
if (size_mb <= 0) {
size_mb = 64;
}
struct cuda_api api;
if (!load_cuda(&api)) {
fprintf(stderr, "failed to load libcuda.so.1 or required Driver API symbols\n");
return 1;
}
load_symbol(api.lib, "cuGetErrorName", (void **)&api.cuGetErrorName);
load_symbol(api.lib, "cuGetErrorString", (void **)&api.cuGetErrorString);
if (!check_rc(&api, "cuInit", api.cuInit(0))) {
return 1;
}
int count = 0;
if (!check_rc(&api, "cuDeviceGetCount", api.cuDeviceGetCount(&count))) {
return 1;
}
if (count <= 0) {
fprintf(stderr, "no CUDA devices found\n");
return 1;
}
CUdevice dev = 0;
if (!check_rc(&api, "cuDeviceGet", api.cuDeviceGet(&dev, 0))) {
return 1;
}
char name[128] = {0};
if (!check_rc(&api, "cuDeviceGetName", api.cuDeviceGetName(name, (int)sizeof(name), dev))) {
return 1;
}
CUcontext ctx = NULL;
if (!check_rc(&api, "cuCtxCreate", api.cuCtxCreate(&ctx, 0, dev))) {
return 1;
}
size_t bytes = (size_t)size_mb * 1024 * 1024;
uint32_t words = (uint32_t)(bytes / sizeof(uint32_t));
if (words < 1024) {
words = 1024;
bytes = (size_t)words * sizeof(uint32_t);
}
uint32_t *host = (uint32_t *)malloc(bytes);
if (!host) {
fprintf(stderr, "malloc failed\n");
api.cuCtxDestroy(ctx);
return 1;
}
for (uint32_t i = 0; i < words; i++) {
host[i] = i ^ 0x12345678u;
}
CUdeviceptr device_mem = 0;
if (!check_rc(&api, "cuMemAlloc", api.cuMemAlloc(&device_mem, bytes))) {
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
if (!check_rc(&api, "cuMemcpyHtoD", api.cuMemcpyHtoD(device_mem, host, bytes))) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
CUmodule module = NULL;
if (!check_rc(&api, "cuModuleLoadDataEx", api.cuModuleLoadDataEx(&module, ptx_source, 0, NULL, NULL))) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
CUfunction kernel = NULL;
if (!check_rc(&api, "cuModuleGetFunction", api.cuModuleGetFunction(&kernel, module, "burn"))) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
unsigned int threads = 256;
unsigned int blocks = (words + threads - 1) / threads;
uint32_t rounds = 256;
void *params[] = {&device_mem, &words, &rounds};
double start = now_seconds();
double deadline = start + (double)seconds;
unsigned long iterations = 0;
while (now_seconds() < deadline) {
if (!check_rc(&api, "cuLaunchKernel",
api.cuLaunchKernel(kernel, blocks, 1, 1, threads, 1, 1, 0, NULL, params, NULL))) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
iterations++;
}
if (!check_rc(&api, "cuCtxSynchronize", api.cuCtxSynchronize())) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
if (!check_rc(&api, "cuMemcpyDtoH", api.cuMemcpyDtoH(host, device_mem, bytes))) {
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 1;
}
uint64_t checksum = 0;
for (uint32_t i = 0; i < words; i += words / 256 ? words / 256 : 1) {
checksum += host[i];
}
double elapsed = now_seconds() - start;
printf("device=%s\n", name);
printf("duration_s=%.2f\n", elapsed);
printf("buffer_mb=%d\n", size_mb);
printf("iterations=%lu\n", iterations);
printf("checksum=%llu\n", (unsigned long long)checksum);
printf("status=OK\n");
api.cuMemFree(device_mem);
free(host);
api.cuCtxDestroy(ctx);
return 0;
}

View File

@@ -1,5 +1,5 @@
#!/bin/sh #!/bin/sh
# build-in-container.sh — build the bee ISO inside a Debian container. # build-in-container.sh — build the bee ISO inside the Debian builder container.
set -e set -e
@@ -70,6 +70,7 @@ set -- \
run --rm --privileged \ run --rm --privileged \
-v "${REPO_ROOT}:/work" \ -v "${REPO_ROOT}:/work" \
-v "${CACHE_DIR}:/cache" \ -v "${CACHE_DIR}:/cache" \
-e BEE_CONTAINER_BUILD=1 \
-e GOCACHE=/cache/go-build \ -e GOCACHE=/cache/go-build \
-e GOMODCACHE=/cache/go-mod \ -e GOMODCACHE=/cache/go-mod \
-e TMPDIR=/cache/tmp \ -e TMPDIR=/cache/tmp \
@@ -83,6 +84,7 @@ if [ -n "$AUTH_KEYS" ]; then
-v "${REPO_ROOT}:/work" \ -v "${REPO_ROOT}:/work" \
-v "${CACHE_DIR}:/cache" \ -v "${CACHE_DIR}:/cache" \
-v "${AUTH_KEYS_DIR}:/tmp/bee-authkeys:ro" \ -v "${AUTH_KEYS_DIR}:/tmp/bee-authkeys:ro" \
-e BEE_CONTAINER_BUILD=1 \
-e GOCACHE=/cache/go-build \ -e GOCACHE=/cache/go-build \
-e GOMODCACHE=/cache/go-mod \ -e GOMODCACHE=/cache/go-mod \
-e TMPDIR=/cache/tmp \ -e TMPDIR=/cache/tmp \

View File

@@ -1,14 +1,13 @@
#!/bin/sh #!/bin/sh
# build.sh — build bee ISO (Debian 12 / live-build) # build.sh — internal ISO build entrypoint executed inside the builder container.
#
# Single build script. Produces a bootable live ISO with SSH access, TUI, NVIDIA drivers.
#
# Run on Debian 12 builder VM as root after setup-builder.sh.
# Usage:
# sh iso/builder/build.sh [--authorized-keys /path/to/authorized_keys]
set -e set -e
if [ "${BEE_CONTAINER_BUILD:-0}" != "1" ]; then
echo "build.sh must run inside iso/builder/build-in-container.sh" >&2
exit 1
fi
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)" REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
BUILDER_DIR="${REPO_ROOT}/iso/builder" BUILDER_DIR="${REPO_ROOT}/iso/builder"
OVERLAY_DIR="${REPO_ROOT}/iso/overlay" OVERLAY_DIR="${REPO_ROOT}/iso/overlay"
@@ -39,8 +38,12 @@ echo "=== bee ISO build ==="
echo "Debian: ${DEBIAN_VERSION}, Kernel ABI: ${DEBIAN_KERNEL_ABI}, Go: ${GO_VERSION}" echo "Debian: ${DEBIAN_VERSION}, Kernel ABI: ${DEBIAN_KERNEL_ABI}, Go: ${GO_VERSION}"
echo "" echo ""
echo "=== syncing git submodules ==="
git -C "${REPO_ROOT}" submodule update --init --recursive
# --- compile bee binary (static, Linux amd64) --- # --- compile bee binary (static, Linux amd64) ---
BEE_BIN="${DIST_DIR}/bee-linux-amd64" BEE_BIN="${DIST_DIR}/bee-linux-amd64"
GPU_STRESS_BIN="${DIST_DIR}/bee-gpu-stress-linux-amd64"
NEED_BUILD=1 NEED_BUILD=1
if [ -f "$BEE_BIN" ]; then if [ -f "$BEE_BIN" ]; then
NEWEST_SRC=$(find "${REPO_ROOT}/audit" -name '*.go' -newer "$BEE_BIN" | head -1) NEWEST_SRC=$(find "${REPO_ROOT}/audit" -name '*.go' -newer "$BEE_BIN" | head -1)
@@ -70,6 +73,22 @@ else
echo "=== bee binary up to date, skipping build ===" echo "=== bee binary up to date, skipping build ==="
fi fi
GPU_STRESS_NEED_BUILD=1
if [ -f "$GPU_STRESS_BIN" ] && [ "${BUILDER_DIR}/bee-gpu-stress.c" -ot "$GPU_STRESS_BIN" ]; then
GPU_STRESS_NEED_BUILD=0
fi
if [ "$GPU_STRESS_NEED_BUILD" = "1" ]; then
echo "=== building bee-gpu-stress ==="
gcc -O2 -s -Wall -Wextra \
-o "$GPU_STRESS_BIN" \
"${BUILDER_DIR}/bee-gpu-stress.c" \
-ldl
echo "binary: $GPU_STRESS_BIN"
else
echo "=== bee-gpu-stress up to date, skipping build ==="
fi
echo "=== preparing staged overlay ===" echo "=== preparing staged overlay ==="
rm -rf "${BUILD_WORK_DIR}" "${OVERLAY_STAGE_DIR}" rm -rf "${BUILD_WORK_DIR}" "${OVERLAY_STAGE_DIR}"
mkdir -p "${BUILD_WORK_DIR}" "${OVERLAY_STAGE_DIR}" mkdir -p "${BUILD_WORK_DIR}" "${OVERLAY_STAGE_DIR}"
@@ -80,6 +99,7 @@ rm -f \
"${OVERLAY_STAGE_DIR}/etc/bee-release" \ "${OVERLAY_STAGE_DIR}/etc/bee-release" \
"${OVERLAY_STAGE_DIR}/root/.ssh/authorized_keys" \ "${OVERLAY_STAGE_DIR}/root/.ssh/authorized_keys" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee" \ "${OVERLAY_STAGE_DIR}/usr/local/bin/bee" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest"
# --- inject authorized_keys for SSH access --- # --- inject authorized_keys for SSH access ---
@@ -119,13 +139,15 @@ fi
mkdir -p "${OVERLAY_STAGE_DIR}/usr/local/bin" mkdir -p "${OVERLAY_STAGE_DIR}/usr/local/bin"
cp "${DIST_DIR}/bee-linux-amd64" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee" cp "${DIST_DIR}/bee-linux-amd64" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee" chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
cp "${GPU_STRESS_BIN}" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress"
# --- inject smoketest into overlay so it runs directly on the live CD --- # --- inject smoketest into overlay so it runs directly on the live CD ---
cp "${BUILDER_DIR}/smoketest.sh" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest" cp "${BUILDER_DIR}/smoketest.sh" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest" chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest"
# --- vendor utilities (optional pre-fetched binaries) --- # --- vendor utilities (optional pre-fetched binaries) ---
for tool in storcli64 sas2ircu sas3ircu mstflint; do for tool in storcli64 sas2ircu sas3ircu arcconf ssacli; do
if [ -f "${VENDOR_DIR}/${tool}" ]; then if [ -f "${VENDOR_DIR}/${tool}" ]; then
cp "${VENDOR_DIR}/${tool}" "${OVERLAY_STAGE_DIR}/usr/local/bin/${tool}" cp "${VENDOR_DIR}/${tool}" "${OVERLAY_STAGE_DIR}/usr/local/bin/${tool}"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/${tool}" || true chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/${tool}" || true

View File

@@ -8,7 +8,9 @@ echo "=== bee chroot setup ==="
# Enable bee services # Enable bee services
systemctl enable bee-network.service systemctl enable bee-network.service
systemctl enable bee-nvidia.service systemctl enable bee-nvidia.service
systemctl enable bee-preflight.service
systemctl enable bee-audit.service systemctl enable bee-audit.service
systemctl enable bee-web.service
systemctl enable bee-sshsetup.service systemctl enable bee-sshsetup.service
systemctl enable ssh.service systemctl enable ssh.service
systemctl enable qemu-guest-agent.service 2>/dev/null || true systemctl enable qemu-guest-agent.service 2>/dev/null || true
@@ -25,8 +27,8 @@ chmod +x /usr/local/bin/bee 2>/dev/null || true
# Reload udev rules # Reload udev rules
udevadm control --reload-rules 2>/dev/null || true udevadm control --reload-rules 2>/dev/null || true
# Create log directory # Create export directory
mkdir -p /var/log mkdir -p /appdata/bee/export
if [ -f /etc/sudoers.d/bee ]; then if [ -f /etc/sudoers.d/bee ]; then
chmod 0440 /etc/sudoers.d/bee chmod 0440 /etc/sudoers.d/bee

View File

@@ -11,6 +11,8 @@ lshw
iproute2 iproute2
isc-dhcp-client isc-dhcp-client
iputils-ping iputils-ping
ethtool
lm-sensors
qemu-guest-agent qemu-guest-agent
# SSH # SSH
@@ -25,8 +27,11 @@ less
vim-tiny vim-tiny
mc mc
htop htop
nvtop
sudo sudo
zstd zstd
mstflint
memtester
# QR codes (for displaying audit results) # QR codes (for displaying audit results)
qrencode qrencode

View File

@@ -1,75 +0,0 @@
#!/bin/sh
# setup-builder.sh — prepare Debian 12 host/VM as bee ISO builder
#
# Run once on a fresh Debian 12 (Bookworm) host/VM as root.
# After this script completes, the machine can build bee ISO images directly.
# Container alternative: use `iso/builder/build-in-container.sh`.
#
# Usage (on Debian VM):
# wget -O- https://git.mchus.pro/mchus/bee/raw/branch/main/iso/builder/setup-builder.sh | sh
# or: sh setup-builder.sh
set -e
. "$(dirname "$0")/VERSIONS" 2>/dev/null || true
GO_VERSION="${GO_VERSION:-1.23.6}"
DEBIAN_VERSION="${DEBIAN_VERSION:-12}"
DEBIAN_KERNEL_ABI="${DEBIAN_KERNEL_ABI:-6.1.0-28}"
echo "=== bee builder setup ==="
echo "Debian: $(cat /etc/debian_version)"
echo "Go target: ${GO_VERSION}"
echo "Kernel ABI: ${DEBIAN_KERNEL_ABI}"
echo ""
# --- system packages ---
export DEBIAN_FRONTEND=noninteractive
apt-get update -qq
apt-get install -y \
live-build \
debootstrap \
squashfs-tools \
xorriso \
grub-pc-bin \
grub-efi-amd64-bin \
mtools \
git \
wget \
curl \
tar \
xz-utils \
screen \
rsync \
build-essential \
gcc \
make \
perl \
"linux-headers-${DEBIAN_KERNEL_ABI}-amd64"
echo "linux-headers installed: $(dpkg -l "linux-headers-${DEBIAN_KERNEL_ABI}-amd64" | awk '/^ii/{print $3}')"
# --- Go toolchain ---
echo ""
echo "=== installing Go ${GO_VERSION} ==="
if [ -d /usr/local/go ] && /usr/local/go/bin/go version 2>/dev/null | grep -q "${GO_VERSION}"; then
echo "Go ${GO_VERSION} already installed"
else
ARCH=$(uname -m)
case "$ARCH" in
x86_64) GOARCH=amd64 ;;
aarch64) GOARCH=arm64 ;;
*) echo "unsupported arch: $ARCH"; exit 1 ;;
esac
wget -O /tmp/go.tar.gz \
"https://go.dev/dl/go${GO_VERSION}.linux-${GOARCH}.tar.gz"
rm -rf /usr/local/go
tar -C /usr/local -xzf /tmp/go.tar.gz
rm /tmp/go.tar.gz
fi
export PATH="$PATH:/usr/local/go/bin"
echo "Go: $(go version)"
echo ""
echo "=== builder setup complete ==="
echo "Next: sh iso/builder/build.sh"

View File

@@ -96,7 +96,7 @@ done
echo "" echo ""
echo "-- systemd services --" echo "-- systemd services --"
for svc in bee-nvidia bee-network bee-audit; do for svc in bee-nvidia bee-network bee-preflight bee-audit bee-web; do
if systemctl is-active --quiet "$svc" 2>/dev/null; then if systemctl is-active --quiet "$svc" 2>/dev/null; then
ok "service active: $svc" ok "service active: $svc"
else else
@@ -104,6 +104,20 @@ for svc in bee-nvidia bee-network bee-audit; do
fi fi
done done
echo ""
echo "-- runtime health --"
if [ -f /appdata/bee/export/runtime-health.json ] && [ -s /appdata/bee/export/runtime-health.json ]; then
ok "runtime: runtime-health.json present and non-empty"
else
fail "runtime: runtime-health.json missing or empty"
fi
if [ -f /appdata/bee/export/runtime-health.log ]; then
info "last runtime log line: $(tail -1 /appdata/bee/export/runtime-health.log)"
else
warn "runtime: no log found at /appdata/bee/export/runtime-health.log"
fi
for svc in ssh bee-sshsetup; do for svc in ssh bee-sshsetup; do
if systemctl is-active --quiet "$svc" 2>/dev/null \ if systemctl is-active --quiet "$svc" 2>/dev/null \
|| systemctl show "$svc" --property=ActiveState 2>/dev/null | grep -q "inactive\|exited"; then || systemctl show "$svc" --property=ActiveState 2>/dev/null | grep -q "inactive\|exited"; then
@@ -126,29 +140,43 @@ fi
echo "" echo ""
echo "-- audit last run --" echo "-- audit last run --"
if [ -f /var/log/bee-audit.json ] && [ -s /var/log/bee-audit.json ]; then if [ -f /appdata/bee/export/bee-audit.json ] && [ -s /appdata/bee/export/bee-audit.json ]; then
ok "audit: bee-audit.json present and non-empty" ok "audit: bee-audit.json present and non-empty"
info "size: $(du -sh /var/log/bee-audit.json | cut -f1)" info "size: $(du -sh /appdata/bee/export/bee-audit.json | cut -f1)"
else else
fail "audit: bee-audit.json missing or empty" fail "audit: bee-audit.json missing or empty"
fi fi
if [ -f /var/log/bee-audit.log ]; then if [ -f /appdata/bee/export/bee-audit.log ]; then
last_line=$(tail -1 /var/log/bee-audit.log) last_line=$(tail -1 /appdata/bee/export/bee-audit.log)
info "last log line: $last_line" info "last log line: $last_line"
if grep -q "audit output written" /var/log/bee-audit.log 2>/dev/null; then if grep -q "audit output written" /appdata/bee/export/bee-audit.log 2>/dev/null; then
ok "audit: completed successfully" ok "audit: completed successfully"
else else
warn "audit: 'audit output written' not found in log — may have failed" warn "audit: 'audit output written' not found in log — may have failed"
fi fi
if grep -q "nvidia: enrichment skipped\|nvidia.*skipped\|enrichment skipped" /var/log/bee-audit.log 2>/dev/null; then if grep -q "nvidia: enrichment skipped\|nvidia.*skipped\|enrichment skipped" /appdata/bee/export/bee-audit.log 2>/dev/null; then
reason=$(grep -E "nvidia.*skipped|enrichment skipped" /var/log/bee-audit.log | tail -1) reason=$(grep -E "nvidia.*skipped|enrichment skipped" /appdata/bee/export/bee-audit.log | tail -1)
fail "audit: nvidia enrichment skipped — $reason" fail "audit: nvidia enrichment skipped — $reason"
else else
ok "audit: nvidia enrichment OK (no skip message)" ok "audit: nvidia enrichment OK (no skip message)"
fi fi
else else
warn "audit: no log found at /var/log/bee-audit.log" warn "audit: no log found at /appdata/bee/export/bee-audit.log"
fi
echo ""
echo "-- bee web --"
if [ -f /appdata/bee/export/bee-web.log ]; then
info "last web log line: $(tail -1 /appdata/bee/export/bee-web.log)"
else
warn "web: no log found at /appdata/bee/export/bee-web.log"
fi
if bash -c 'exec 3<>/dev/tcp/127.0.0.1/80 && printf "GET /healthz HTTP/1.0\r\nHost: localhost\r\n\r\n" >&3 && grep -q "^ok$" <&3'; then
ok "web: health endpoint reachable on 127.0.0.1:80"
else
fail "web: health endpoint not reachable on 127.0.0.1:80"
fi fi
echo "" echo ""

View File

@@ -9,7 +9,8 @@
Hardware Audit LiveCD Hardware Audit LiveCD
Build: %%BUILD_INFO%% Build: %%BUILD_INFO%%
Logs: /var/log/bee-audit.json /var/log/bee-network.log Export dir: /appdata/bee/export
Self-check: /appdata/bee/export/runtime-health.json
Open TUI: bee-tui Open TUI: bee-tui

View File

@@ -1,12 +1,13 @@
[Unit] [Unit]
Description=Bee: run hardware audit Description=Bee: run hardware audit
After=bee-network.service bee-nvidia.service After=bee-network.service bee-nvidia.service bee-preflight.service
Before=bee-web.service
[Service] [Service]
Type=oneshot Type=oneshot
ExecStart=/bin/sh -c '/usr/local/bin/bee audit --runtime livecd --output file:/var/log/bee-audit.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-audit] WARN: audit exited with rc=$rc"; fi; exit 0' ExecStart=/bin/sh -c '/usr/local/bin/bee audit --runtime livecd --output file:/appdata/bee/export/bee-audit.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-audit] WARN: audit exited with rc=$rc"; fi; exit 0'
StandardOutput=append:/var/log/bee-audit.log StandardOutput=append:/appdata/bee/export/bee-audit.log
StandardError=append:/var/log/bee-audit.log StandardError=append:/appdata/bee/export/bee-audit.log
RemainAfterExit=yes RemainAfterExit=yes
[Install] [Install]

View File

@@ -6,8 +6,8 @@ Before=network-online.target bee-audit.service
[Service] [Service]
Type=oneshot Type=oneshot
ExecStart=/usr/local/bin/bee-network.sh ExecStart=/usr/local/bin/bee-network.sh
StandardOutput=append:/var/log/bee-network.log StandardOutput=append:/appdata/bee/export/bee-network.log
StandardError=append:/var/log/bee-network.log StandardError=append:/appdata/bee/export/bee-network.log
RemainAfterExit=yes RemainAfterExit=yes
[Install] [Install]

View File

@@ -6,8 +6,8 @@ Before=bee-audit.service
[Service] [Service]
Type=oneshot Type=oneshot
ExecStart=/usr/local/bin/bee-nvidia-load ExecStart=/usr/local/bin/bee-nvidia-load
StandardOutput=journal StandardOutput=append:/appdata/bee/export/bee-nvidia.log
StandardError=journal StandardError=append:/appdata/bee/export/bee-nvidia.log
RemainAfterExit=yes RemainAfterExit=yes
[Install] [Install]

View File

@@ -0,0 +1,14 @@
[Unit]
Description=Bee: runtime preflight self-check
After=bee-network.service bee-nvidia.service
Before=bee-audit.service
[Service]
Type=oneshot
ExecStart=/bin/sh -c '/usr/local/bin/bee preflight --output file:/appdata/bee/export/runtime-health.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-preflight] WARN: preflight exited with rc=$rc"; fi; exit 0'
StandardOutput=append:/appdata/bee/export/runtime-health.log
StandardError=append:/appdata/bee/export/runtime-health.log
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target

View File

@@ -6,6 +6,8 @@ Before=ssh.service
[Service] [Service]
Type=oneshot Type=oneshot
ExecStart=/usr/local/bin/bee-sshsetup ExecStart=/usr/local/bin/bee-sshsetup
StandardOutput=append:/appdata/bee/export/bee-sshsetup.log
StandardError=append:/appdata/bee/export/bee-sshsetup.log
RemainAfterExit=yes RemainAfterExit=yes
[Install] [Install]

View File

@@ -0,0 +1,15 @@
[Unit]
Description=Bee: hardware audit web viewer
After=bee-network.service bee-audit.service
Wants=bee-audit.service
[Service]
Type=simple
ExecStart=/usr/local/bin/bee web --listen :80 --audit-path /appdata/bee/export/bee-audit.json --export-dir /appdata/bee/export --title "Bee Hardware Audit"
Restart=always
RestartSec=2
StandardOutput=append:/appdata/bee/export/bee-web.log
StandardError=append:/appdata/bee/export/bee-web.log
[Install]
WantedBy=multi-user.target

View File

@@ -20,10 +20,10 @@ fi
for iface in $interfaces; do for iface in $interfaces; do
log "bringing up $iface" log "bringing up $iface"
ip link set "$iface" up 2>/dev/null || { log "WARN: could not bring up $iface"; continue; } ip link set "$iface" up || { log "WARN: could not bring up $iface"; continue; }
# DHCP in background — non-blocking, retries indefinitely # DHCP in background — non-blocking, keep dhclient verbose output in the service log.
dhclient -nw "$iface" 2>/dev/null & dhclient -4 -v -nw "$iface" &
log "DHCP started for $iface (pid $!)" log "DHCP started for $iface (pid $!)"
done done

View File

@@ -16,12 +16,15 @@ fi
log "module dir: $NVIDIA_KO_DIR" log "module dir: $NVIDIA_KO_DIR"
ls "$NVIDIA_KO_DIR"/*.ko 2>/dev/null | sed 's/^/ /' || true ls "$NVIDIA_KO_DIR"/*.ko 2>/dev/null | sed 's/^/ /' || true
# Some kernels expose backlight helper symbols only after loading `video`.
modprobe video >/dev/null 2>&1 && log "loaded helper module: video" || log "helper module unavailable: video"
# Load modules via insmod (direct load — no depmod needed) # Load modules via insmod (direct load — no depmod needed)
for mod in nvidia nvidia-modeset nvidia-uvm; do for mod in nvidia nvidia-modeset nvidia-uvm; do
ko="$NVIDIA_KO_DIR/${mod}.ko" ko="$NVIDIA_KO_DIR/${mod}.ko"
[ -f "$ko" ] || ko="$NVIDIA_KO_DIR/${mod//-/_}.ko" [ -f "$ko" ] || ko="$NVIDIA_KO_DIR/${mod//-/_}.ko"
if [ -f "$ko" ]; then if [ -f "$ko" ]; then
if insmod "$ko" 2>/dev/null; then if insmod "$ko"; then
log "loaded: $mod" log "loaded: $mod"
else else
log "WARN: failed to load: $mod" log "WARN: failed to load: $mod"
@@ -33,25 +36,25 @@ for mod in nvidia nvidia-modeset nvidia-uvm; do
done done
# Create /dev/nvidia* device nodes (udev rules absent since we use .run installer) # Create /dev/nvidia* device nodes (udev rules absent since we use .run installer)
nvidia_major=$(grep -m1 ' nvidiactl$' /proc/devices 2>/dev/null | awk '{print $1}') nvidia_major=$(grep -m1 ' nvidiactl$' /proc/devices | awk '{print $1}')
if [ -n "$nvidia_major" ]; then if [ -n "$nvidia_major" ]; then
mknod -m 666 /dev/nvidiactl c "$nvidia_major" 255 2>/dev/null \ mknod -m 666 /dev/nvidiactl c "$nvidia_major" 255 \
&& log "created /dev/nvidiactl (major $nvidia_major)" \ && log "created /dev/nvidiactl (major $nvidia_major)" \
|| log "WARN: /dev/nvidiactl already exists or mknod failed" || log "WARN: /dev/nvidiactl already exists or mknod failed"
for i in 0 1 2 3 4 5 6 7; do for i in 0 1 2 3 4 5 6 7; do
mknod -m 666 "/dev/nvidia$i" c "$nvidia_major" "$i" 2>/dev/null || true mknod -m 666 "/dev/nvidia$i" c "$nvidia_major" "$i" || true
done done
log "created /dev/nvidia{0-7}" log "created /dev/nvidia{0-7}"
else else
log "WARN: nvidiactl not in /proc/devices — no GPU hardware present?" log "WARN: nvidiactl not in /proc/devices — no GPU hardware present?"
fi fi
uvm_major=$(grep -m1 ' nvidia-uvm$' /proc/devices 2>/dev/null | awk '{print $1}') uvm_major=$(grep -m1 ' nvidia-uvm$' /proc/devices | awk '{print $1}')
if [ -n "$uvm_major" ]; then if [ -n "$uvm_major" ]; then
mknod -m 666 /dev/nvidia-uvm c "$uvm_major" 0 2>/dev/null \ mknod -m 666 /dev/nvidia-uvm c "$uvm_major" 0 \
&& log "created /dev/nvidia-uvm (major $uvm_major)" \ && log "created /dev/nvidia-uvm (major $uvm_major)" \
|| log "WARN: /dev/nvidia-uvm already exists" || log "WARN: /dev/nvidia-uvm already exists"
mknod -m 666 /dev/nvidia-uvm-tools c "$uvm_major" 1 2>/dev/null || true mknod -m 666 /dev/nvidia-uvm-tools c "$uvm_major" 1 || true
else else
log "WARN: nvidia-uvm not in /proc/devices" log "WARN: nvidia-uvm not in /proc/devices"
fi fi

View File

@@ -1,5 +1,5 @@
#!/bin/sh #!/bin/sh
# run-builder.sh — trigger ISO build on remote Debian 12 builder VM # run-builder.sh — trigger containerized ISO build on a remote builder host
# #
# Usage: # Usage:
# sh scripts/run-builder.sh # sh scripts/run-builder.sh
@@ -79,7 +79,7 @@ screen -S bee-build -X quit 2>/dev/null || true
echo "--- starting build in screen session (survives SSH disconnect) ---" echo "--- starting build in screen session (survives SSH disconnect) ---"
echo "--- log: \$LOG ---" echo "--- log: \$LOG ---"
screen -dmS bee-build sh -c "sudo sh iso/builder/build.sh ${EXTRA_ARGS} > \$LOG 2>&1; echo \$? > /tmp/bee-build-exit" screen -dmS bee-build sh -c "sh iso/builder/build-in-container.sh ${EXTRA_ARGS} > \$LOG 2>&1; echo \$? > /tmp/bee-build-exit"
# Stream log until build finishes # Stream log until build finishes
echo "--- streaming build log (Ctrl+C safe — build continues on VM) ---" echo "--- streaming build log (Ctrl+C safe — build continues on VM) ---"