Compare commits
14 Commits
livecd-v0.
...
audit/v1.0
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
76a17937f3 | ||
|
|
b965184e71 | ||
|
|
b25a2f6d30 | ||
|
|
d18cde19c1 | ||
|
|
78c6dfc0ef | ||
|
|
72cf482ad3 | ||
|
|
a6023372b1 | ||
|
|
ab5a4be7ac | ||
|
|
b8c235b5ac | ||
|
|
b483e2ce35 | ||
|
|
17f0bda45e | ||
|
|
591164a251 | ||
|
|
ef4ec5695d | ||
|
|
f1e096cabe |
3
.gitmodules
vendored
3
.gitmodules
vendored
@@ -1,3 +1,6 @@
|
||||
[submodule "bible"]
|
||||
path = bible
|
||||
url = https://git.mchus.pro/mchus/bible.git
|
||||
[submodule "internal/chart"]
|
||||
path = internal/chart
|
||||
url = https://git.mchus.pro/reanimator/chart.git
|
||||
|
||||
41
PLAN.md
41
PLAN.md
@@ -4,13 +4,13 @@ Hardware audit LiveCD for offline server inventory.
|
||||
Produces `HardwareIngestRequest` JSON compatible with core/reanimator.
|
||||
|
||||
**Principle:** OS-level collection — reads hardware directly, not through BMC.
|
||||
Fully unattended — no user interaction required at any stage. Boot → update → audit → output → done.
|
||||
All errors are logged, never presented interactively. Every failure path has a silent fallback.
|
||||
Automatic boot audit plus operator console. Boot runs audit immediately, but local/SSH operators can rerun checks through the TUI and CLI.
|
||||
Errors are logged and should not block boot on partial collector failures.
|
||||
Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware.
|
||||
|
||||
---
|
||||
|
||||
## Status snapshot (2026-03-06)
|
||||
## Status snapshot (2026-03-14)
|
||||
|
||||
### Phase 1 — Go Audit Binary
|
||||
|
||||
@@ -23,8 +23,10 @@ Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials,
|
||||
- 1.7 PSU collector — **DONE (basic FRU path)**
|
||||
- 1.8 NVIDIA GPU enrichment — **DONE**
|
||||
- 1.8b Component wear / age telemetry — **DONE** (storage + NVMe + NVIDIA + NIC SFP/DOM + NIC packet stats)
|
||||
- 1.8c Storage health verdicts — **DONE** (SMART/NVMe warning/failed status derivation)
|
||||
- 1.9 Mellanox/NVIDIA NIC enrichment — **DONE** (mstflint + ethtool firmware fallback)
|
||||
- 1.10 RAID controller enrichment — **DONE (initial multi-tool support)** (storcli + sas2/3ircu + arcconf + ssacli + VROC/mdstat)
|
||||
- 1.11 PSU SDR health — **DONE** (`ipmitool sdr` merged with FRU inventory)
|
||||
- 1.11 Output and export workflow — **DONE** (explicit file output + manual removable export via TUI)
|
||||
- 1.12 Integration test (local) — **DONE** (`scripts/test-local.sh`)
|
||||
|
||||
@@ -33,9 +35,14 @@ Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials,
|
||||
- Current implementation uses Debian 12 `live-build`, `systemd`, and OpenSSH.
|
||||
- Network bring-up on boot — **DONE**
|
||||
- Boot services (`bee-network`, `bee-nvidia`, `bee-audit`, `bee-sshsetup`) — **DONE**
|
||||
- Local console UX (`bee` autologin on `tty1`, `menu` auto-start, TUI privilege escalation via `sudo -n`) — **DONE**
|
||||
- VM/debug support (`qemu-guest-agent`, serial console, virtual GPU initramfs modules) — **DONE**
|
||||
- Vendor utilities in overlay — **DONE**
|
||||
- Build metadata + staged overlay injection — **DONE**
|
||||
- Builder container cache persisted outside container writable layer — **DONE**
|
||||
- ISO volume label `BEE` — **DONE**
|
||||
- Auto-update flow remains deferred; current focus is deterministic offline audit ISO behavior.
|
||||
- Real-hardware validation remains **PENDING**; current validation is limited to local/libvirt VM boot + service checks.
|
||||
|
||||
---
|
||||
|
||||
@@ -265,13 +272,10 @@ ISO image bootable via BMC virtual media or USB. Runs boot services automaticall
|
||||
|
||||
### 2.1 — Builder environment
|
||||
|
||||
`iso/builder/setup-builder.sh` prepares a Debian 12 host/VM with:
|
||||
- `live-build`, `debootstrap`, bootloader tooling, kernel headers
|
||||
- Go toolchain
|
||||
- everything needed to compile the `bee` binary and NVIDIA modules
|
||||
|
||||
`iso/builder/build-in-container.sh` offers the same builder stack in a Debian 12 container image.
|
||||
The container run is privileged because `live-build` needs mount/chroot/loop capabilities.
|
||||
`iso/builder/build-in-container.sh` is the only supported builder entrypoint.
|
||||
It builds a Debian 12 builder image with `live-build`, toolchains, and pinned kernel headers,
|
||||
then runs the ISO assembly in a privileged container because `live-build` needs
|
||||
mount/chroot/loop capabilities.
|
||||
|
||||
`iso/builder/build.sh` orchestrates the full ISO build:
|
||||
1. compile the Go `bee` binary
|
||||
@@ -334,8 +338,14 @@ Planned code shape:
|
||||
### 2.5 — Operator workflows
|
||||
|
||||
- Automatic boot audit writes JSON to `/var/log/bee-audit.json`
|
||||
- `tty1` autologins into `bee` and auto-runs `menu`
|
||||
- `menu` launches the LiveCD wrapper `bee-tui`, which escalates to `root` via `sudo -n`
|
||||
- `bee tui` can rerun the audit manually
|
||||
- `bee tui` can export the latest audit JSON to removable media
|
||||
- `bee tui` can show health summary and run NVIDIA/memory/storage acceptance tests
|
||||
- NVIDIA SAT now includes a lightweight in-image GPU stress step via `bee-gpu-stress`
|
||||
- SAT summaries now expose `overall_status` plus per-job `OK/FAILED/UNSUPPORTED`
|
||||
- Memory/GPU SAT runtime defaults can be overridden via `BEE_MEMTESTER_*` and `BEE_GPU_STRESS_*`
|
||||
- removable export requires explicit target selection, mount, confirmation, copy, and cleanup
|
||||
|
||||
### 2.6 — Vendor utilities and optional assets
|
||||
@@ -343,7 +353,9 @@ Planned code shape:
|
||||
Optional binaries live in `iso/vendor/` and are included when present:
|
||||
- `storcli64`
|
||||
- `sas2ircu`, `sas3ircu`
|
||||
- `mstflint`
|
||||
- `arcconf`
|
||||
- `ssacli`
|
||||
- `mstflint` (via Debian package set)
|
||||
|
||||
Missing optional tools do not fail the build or boot.
|
||||
|
||||
@@ -358,6 +370,7 @@ Missing optional tools do not fail the build or boot.
|
||||
Current release model:
|
||||
- shipping a new ISO means a full rebuild
|
||||
- build metadata is embedded into `/etc/bee-release` and `motd`
|
||||
- current ISO label is `BEE`
|
||||
- binary self-update remains deferred; no automatic USB/network patching is part of the current runtime
|
||||
|
||||
---
|
||||
@@ -374,9 +387,9 @@ No "works on my Mac" drift.
|
||||
1.2 board collector → first real data
|
||||
1.3 CPU collector → +CPUs
|
||||
|
||||
--- BUILDER + DEBUG ISO (unblock real-hardware testing) ---
|
||||
--- BUILDER + BEE ISO (unblock real-hardware testing) ---
|
||||
|
||||
2.1 builder setup → Debian host/VM or privileged container with build deps
|
||||
2.1 builder setup → privileged container with build deps
|
||||
2.2 debug ISO profile → minimal Debian ISO: `bee` binary + OpenSSH + all packages
|
||||
2.3 boot on real server → SSH in, verify packages present, run audit manually
|
||||
|
||||
@@ -397,7 +410,7 @@ No "works on my Mac" drift.
|
||||
2.4 NVIDIA driver build → driver compiled into overlay
|
||||
2.5 network bring-up on boot → DHCP on all interfaces
|
||||
2.6 systemd boot service → audit runs on boot automatically
|
||||
2.7 vendor utilities → storcli/sas2ircu/mstflint in image
|
||||
2.7 vendor utilities → storcli/sas2ircu/arcconf/ssacli in image
|
||||
2.8 release workflow → versioning + release notes
|
||||
2.9 operator export flow → explicit TUI export to removable media
|
||||
```
|
||||
|
||||
@@ -12,6 +12,7 @@ import (
|
||||
"bee/audit/internal/platform"
|
||||
"bee/audit/internal/runtimeenv"
|
||||
"bee/audit/internal/tui"
|
||||
"bee/audit/internal/webui"
|
||||
)
|
||||
|
||||
var Version = "dev"
|
||||
@@ -27,11 +28,14 @@ func run(args []string, stdout, stderr io.Writer) int {
|
||||
|
||||
if len(args) == 0 {
|
||||
printRootUsage(stderr)
|
||||
return 1
|
||||
return 2
|
||||
}
|
||||
|
||||
switch args[0] {
|
||||
case "help", "--help", "-h":
|
||||
if len(args) > 1 {
|
||||
return runHelp(args[1:], stdout, stderr)
|
||||
}
|
||||
printRootUsage(stdout)
|
||||
return 0
|
||||
case "audit":
|
||||
@@ -40,6 +44,12 @@ func run(args []string, stdout, stderr io.Writer) int {
|
||||
return runTUI(args[1:], stdout, stderr)
|
||||
case "export":
|
||||
return runExport(args[1:], stdout, stderr)
|
||||
case "preflight":
|
||||
return runPreflight(args[1:], stdout, stderr)
|
||||
case "support-bundle":
|
||||
return runSupportBundle(args[1:], stdout, stderr)
|
||||
case "web":
|
||||
return runWeb(args[1:], stdout, stderr)
|
||||
case "sat":
|
||||
return runSAT(args[1:], stdout, stderr)
|
||||
case "version", "--version", "-version":
|
||||
@@ -48,17 +58,47 @@ func run(args []string, stdout, stderr io.Writer) int {
|
||||
default:
|
||||
fmt.Fprintf(stderr, "bee: unknown command %q\n\n", args[0])
|
||||
printRootUsage(stderr)
|
||||
return 1
|
||||
return 2
|
||||
}
|
||||
}
|
||||
|
||||
func printRootUsage(w io.Writer) {
|
||||
fmt.Fprintln(w, `bee commands:
|
||||
bee audit --runtime auto|local|livecd --output stdout|file:<path>
|
||||
bee preflight --output stdout|file:<path>
|
||||
bee tui --runtime auto|local|livecd
|
||||
bee export --target <device>
|
||||
bee sat nvidia
|
||||
bee version`)
|
||||
bee support-bundle --output stdout|file:<path>
|
||||
bee web --listen :80 --audit-path `+app.DefaultAuditJSONPath+`
|
||||
bee sat nvidia|memory|storage
|
||||
bee version
|
||||
bee help [command]`)
|
||||
}
|
||||
|
||||
func runHelp(args []string, stdout, stderr io.Writer) int {
|
||||
switch args[0] {
|
||||
case "audit":
|
||||
return runAudit([]string{"--help"}, stdout, stdout)
|
||||
case "tui":
|
||||
return runTUI([]string{"--help"}, stdout, stdout)
|
||||
case "export":
|
||||
return runExport([]string{"--help"}, stdout, stdout)
|
||||
case "preflight":
|
||||
return runPreflight([]string{"--help"}, stdout, stdout)
|
||||
case "support-bundle":
|
||||
return runSupportBundle([]string{"--help"}, stdout, stdout)
|
||||
case "web":
|
||||
return runWeb([]string{"--help"}, stdout, stdout)
|
||||
case "sat":
|
||||
return runSAT([]string{"--help"}, stdout, stderr)
|
||||
case "version":
|
||||
fmt.Fprintln(stdout, "usage: bee version")
|
||||
return 0
|
||||
default:
|
||||
fmt.Fprintf(stderr, "bee help: unknown command %q\n\n", args[0])
|
||||
printRootUsage(stderr)
|
||||
return 2
|
||||
}
|
||||
}
|
||||
|
||||
func runAudit(args []string, stdout, stderr io.Writer) int {
|
||||
@@ -72,6 +112,13 @@ func runAudit(args []string, stdout, stderr io.Writer) int {
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
if *showVersion {
|
||||
@@ -107,6 +154,13 @@ func runTUI(args []string, stdout, stderr io.Writer) int {
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
|
||||
@@ -116,6 +170,10 @@ func runTUI(args []string, stdout, stderr io.Writer) int {
|
||||
return 1
|
||||
}
|
||||
|
||||
slog.SetDefault(slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{
|
||||
Level: slog.LevelInfo,
|
||||
})))
|
||||
|
||||
application := app.New(platform.New())
|
||||
if err := tui.Run(application, runtimeInfo.Mode); err != nil {
|
||||
slog.Error("run tui", "err", err)
|
||||
@@ -133,6 +191,13 @@ func runExport(args []string, stdout, stderr io.Writer) int {
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
if strings.TrimSpace(*targetDevice) == "" {
|
||||
@@ -164,22 +229,160 @@ func runExport(args []string, stdout, stderr io.Writer) int {
|
||||
return 1
|
||||
}
|
||||
|
||||
func runSAT(args []string, stdout, stderr io.Writer) int {
|
||||
if len(args) == 0 || args[0] == "help" || args[0] == "--help" || args[0] == "-h" {
|
||||
fmt.Fprintln(stderr, "usage: bee sat nvidia")
|
||||
func runPreflight(args []string, stdout, stderr io.Writer) int {
|
||||
fs := flag.NewFlagSet("preflight", flag.ContinueOnError)
|
||||
fs.SetOutput(stderr)
|
||||
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
|
||||
fs.Usage = func() {
|
||||
fmt.Fprintf(stderr, "usage: bee preflight [--output stdout|file:%s]\n", app.DefaultRuntimeJSONPath)
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if args[0] != "nvidia" {
|
||||
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", args[0])
|
||||
fmt.Fprintln(stderr, "usage: bee sat nvidia")
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
application := app.New(platform.New())
|
||||
archive, err := application.RunNvidiaAcceptancePack("")
|
||||
path, err := application.RunRuntimePreflight(*output)
|
||||
if err != nil {
|
||||
slog.Error("run nvidia sat", "err", err)
|
||||
slog.Error("run preflight", "err", err)
|
||||
return 1
|
||||
}
|
||||
slog.Info("nvidia sat archive written", "path", archive)
|
||||
if path != "stdout" {
|
||||
slog.Info("runtime health written", "path", path)
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
func runSupportBundle(args []string, stdout, stderr io.Writer) int {
|
||||
fs := flag.NewFlagSet("support-bundle", flag.ContinueOnError)
|
||||
fs.SetOutput(stderr)
|
||||
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
|
||||
fs.Usage = func() {
|
||||
fmt.Fprintln(stderr, "usage: bee support-bundle [--output stdout|file:<path>]")
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
path, err := app.BuildSupportBundle(app.DefaultExportDir)
|
||||
if err != nil {
|
||||
slog.Error("build support bundle", "err", err)
|
||||
return 1
|
||||
}
|
||||
defer os.Remove(path)
|
||||
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
slog.Error("read support bundle", "err", err)
|
||||
return 1
|
||||
}
|
||||
switch {
|
||||
case *output == "stdout":
|
||||
if _, err := stdout.Write(raw); err != nil {
|
||||
slog.Error("write support bundle stdout", "err", err)
|
||||
return 1
|
||||
}
|
||||
case strings.HasPrefix(*output, "file:"):
|
||||
dst := strings.TrimPrefix(*output, "file:")
|
||||
if err := os.WriteFile(dst, raw, 0644); err != nil {
|
||||
slog.Error("write support bundle", "err", err)
|
||||
return 1
|
||||
}
|
||||
slog.Info("support bundle written", "path", dst)
|
||||
default:
|
||||
fmt.Fprintln(stderr, "bee support-bundle: unknown output destination")
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
func runWeb(args []string, stdout, stderr io.Writer) int {
|
||||
fs := flag.NewFlagSet("web", flag.ContinueOnError)
|
||||
fs.SetOutput(stderr)
|
||||
listenAddr := fs.String("listen", ":8080", "listen address, e.g. :80")
|
||||
auditPath := fs.String("audit-path", app.DefaultAuditJSONPath, "path to the latest audit JSON snapshot")
|
||||
exportDir := fs.String("export-dir", app.DefaultExportDir, "directory with logs, SAT results, and support bundles")
|
||||
title := fs.String("title", "Bee Hardware Audit", "page title")
|
||||
fs.Usage = func() {
|
||||
fmt.Fprintf(stderr, "usage: bee web [--listen :80] [--audit-path %s] [--export-dir %s] [--title \"Bee Hardware Audit\"]\n", app.DefaultAuditJSONPath, app.DefaultExportDir)
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
|
||||
slog.Info("starting bee web", "listen", *listenAddr, "audit_path", *auditPath)
|
||||
if err := webui.ListenAndServe(*listenAddr, webui.HandlerOptions{
|
||||
Title: *title,
|
||||
AuditPath: *auditPath,
|
||||
ExportDir: *exportDir,
|
||||
}); err != nil {
|
||||
slog.Error("run web", "err", err)
|
||||
return 1
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
func runSAT(args []string, stdout, stderr io.Writer) int {
|
||||
if len(args) == 0 {
|
||||
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage")
|
||||
return 2
|
||||
}
|
||||
if args[0] == "help" || args[0] == "--help" || args[0] == "-h" {
|
||||
fmt.Fprintln(stdout, "usage: bee sat nvidia|memory|storage")
|
||||
return 0
|
||||
}
|
||||
if args[0] != "nvidia" && args[0] != "memory" && args[0] != "storage" {
|
||||
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", args[0])
|
||||
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage")
|
||||
return 2
|
||||
}
|
||||
if len(args) > 1 {
|
||||
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage")
|
||||
return 2
|
||||
}
|
||||
application := app.New(platform.New())
|
||||
var (
|
||||
archive string
|
||||
err error
|
||||
label string
|
||||
)
|
||||
switch args[0] {
|
||||
case "nvidia":
|
||||
label = "nvidia"
|
||||
archive, err = application.RunNvidiaAcceptancePack("")
|
||||
case "memory":
|
||||
label = "memory"
|
||||
archive, err = application.RunMemoryAcceptancePack("")
|
||||
case "storage":
|
||||
label = "storage"
|
||||
archive, err = application.RunStorageAcceptancePack("")
|
||||
}
|
||||
if err != nil {
|
||||
slog.Error("run sat", "target", label, "err", err)
|
||||
return 1
|
||||
}
|
||||
slog.Info("sat archive written", "target", label, "path", archive)
|
||||
return 0
|
||||
}
|
||||
|
||||
@@ -24,8 +24,8 @@ func TestRunNoArgsPrintsUsage(t *testing.T) {
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run(nil, &stdout, &stderr)
|
||||
if rc != 1 {
|
||||
t.Fatalf("rc=%d want 1", rc)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "bee commands:") {
|
||||
t.Fatalf("stderr missing root usage:\n%s", stderr.String())
|
||||
@@ -37,8 +37,8 @@ func TestRunUnknownCommand(t *testing.T) {
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"wat"}, &stdout, &stderr)
|
||||
if rc != 1 {
|
||||
t.Fatalf("rc=%d want 1", rc)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), `unknown command "wat"`) {
|
||||
t.Fatalf("stderr missing unknown command message:\n%s", stderr.String())
|
||||
@@ -86,11 +86,63 @@ func TestRunSATUsage(t *testing.T) {
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee sat nvidia") {
|
||||
if !strings.Contains(stderr.String(), "usage: bee sat nvidia|memory|storage") {
|
||||
t.Fatalf("stderr missing sat usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunPreflightRejectsExtraArgs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"preflight", "extra"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee preflight") {
|
||||
t.Fatalf("stderr missing preflight usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSupportBundleRejectsExtraArgs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"support-bundle", "extra"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee support-bundle") {
|
||||
t.Fatalf("stderr missing support-bundle usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunHelpForSubcommand(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"help", "export"}, &stdout, &stderr)
|
||||
if rc != 0 {
|
||||
t.Fatalf("rc=%d want 0", rc)
|
||||
}
|
||||
if !strings.Contains(stdout.String(), "usage: bee export --target <device>") {
|
||||
t.Fatalf("stdout missing export help:\n%s", stdout.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunHelpUnknownSubcommand(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"help", "wat"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), `bee help: unknown command "wat"`) {
|
||||
t.Fatalf("stderr missing help error:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSATUnknownTarget(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
@@ -104,6 +156,32 @@ func TestRunSATUnknownTarget(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSATHelp(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"sat", "--help"}, &stdout, &stderr)
|
||||
if rc != 0 {
|
||||
t.Fatalf("rc=%d want 0", rc)
|
||||
}
|
||||
if !strings.Contains(stdout.String(), "usage: bee sat nvidia|memory|storage") {
|
||||
t.Fatalf("stdout missing sat help:\n%s", stdout.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSATRejectsExtraArgs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"sat", "memory", "extra"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee sat nvidia|memory|storage") {
|
||||
t.Fatalf("stderr missing sat usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunAuditInvalidRuntime(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
@@ -113,3 +191,29 @@ func TestRunAuditInvalidRuntime(t *testing.T) {
|
||||
t.Fatalf("rc=%d want 1", rc)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunAuditRejectsExtraArgs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"audit", "extra"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee audit") {
|
||||
t.Fatalf("stderr missing audit usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunExportRejectsExtraArgs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"export", "--target", "/dev/sdb1", "extra"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee export --target <device>") {
|
||||
t.Fatalf("stderr missing export usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,8 +1,11 @@
|
||||
module bee/audit
|
||||
|
||||
go 1.23
|
||||
go 1.24.0
|
||||
|
||||
replace reanimator/chart => ../internal/chart
|
||||
|
||||
require github.com/charmbracelet/bubbletea v1.3.4
|
||||
require reanimator/chart v0.0.0
|
||||
|
||||
require (
|
||||
github.com/aymanbagabas/go-osc52/v2 v2.0.1 // indirect
|
||||
|
||||
@@ -1,10 +1,13 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
@@ -12,11 +15,21 @@ import (
|
||||
"bee/audit/internal/collector"
|
||||
"bee/audit/internal/platform"
|
||||
"bee/audit/internal/runtimeenv"
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
const (
|
||||
DefaultAuditJSONPath = "/var/log/bee-audit.json"
|
||||
DefaultAuditLogPath = "/var/log/bee-audit.log"
|
||||
var (
|
||||
DefaultExportDir = "/appdata/bee/export"
|
||||
DefaultAuditJSONPath = DefaultExportDir + "/bee-audit.json"
|
||||
DefaultAuditLogPath = DefaultExportDir + "/bee-audit.log"
|
||||
DefaultWebLogPath = DefaultExportDir + "/bee-web.log"
|
||||
DefaultNetworkLogPath = DefaultExportDir + "/bee-network.log"
|
||||
DefaultNvidiaLogPath = DefaultExportDir + "/bee-nvidia.log"
|
||||
DefaultSSHLogPath = DefaultExportDir + "/bee-sshsetup.log"
|
||||
DefaultRuntimeJSONPath = DefaultExportDir + "/runtime-health.json"
|
||||
DefaultRuntimeLogPath = DefaultExportDir + "/runtime-health.log"
|
||||
DefaultTechDumpDir = DefaultExportDir + "/techdump"
|
||||
DefaultSATBaseDir = DefaultExportDir + "/bee-sat"
|
||||
)
|
||||
|
||||
type App struct {
|
||||
@@ -25,6 +38,7 @@ type App struct {
|
||||
exports exportManager
|
||||
tools toolManager
|
||||
sat satRunner
|
||||
runtime runtimeChecker
|
||||
}
|
||||
|
||||
type ActionResult struct {
|
||||
@@ -58,6 +72,15 @@ type toolManager interface {
|
||||
|
||||
type satRunner interface {
|
||||
RunNvidiaAcceptancePack(baseDir string) (string, error)
|
||||
RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, durationSec int, sizeMB int, gpuIndices []int) (string, error)
|
||||
RunMemoryAcceptancePack(baseDir string) (string, error)
|
||||
RunStorageAcceptancePack(baseDir string) (string, error)
|
||||
ListNvidiaGPUs() ([]platform.NvidiaGPU, error)
|
||||
}
|
||||
|
||||
type runtimeChecker interface {
|
||||
CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error)
|
||||
CaptureTechnicalDump(baseDir string) error
|
||||
}
|
||||
|
||||
func New(platform *platform.System) *App {
|
||||
@@ -67,11 +90,20 @@ func New(platform *platform.System) *App {
|
||||
exports: platform,
|
||||
tools: platform,
|
||||
sat: platform,
|
||||
runtime: platform,
|
||||
}
|
||||
}
|
||||
|
||||
func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, error) {
|
||||
if runtimeMode == runtimeenv.ModeLiveCD {
|
||||
if err := a.runtime.CaptureTechnicalDump(DefaultTechDumpDir); err != nil {
|
||||
slog.Warn("capture technical dump", "err", err)
|
||||
}
|
||||
}
|
||||
result := collector.Run(runtimeMode)
|
||||
if health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath); err == nil {
|
||||
result.Runtime = &health
|
||||
}
|
||||
data, err := json.MarshalIndent(result, "", " ")
|
||||
if err != nil {
|
||||
return "", err
|
||||
@@ -83,6 +115,9 @@ func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, erro
|
||||
return "stdout", err
|
||||
case strings.HasPrefix(output, "file:"):
|
||||
path := strings.TrimPrefix(output, "file:")
|
||||
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil {
|
||||
return "", err
|
||||
}
|
||||
@@ -92,6 +127,63 @@ func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, erro
|
||||
}
|
||||
}
|
||||
|
||||
func (a *App) RunRuntimePreflight(output string) (string, error) {
|
||||
health, err := a.runtime.CollectRuntimeHealth(DefaultExportDir)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
data, err := json.MarshalIndent(health, "", " ")
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
switch {
|
||||
case output == "stdout":
|
||||
_, err := os.Stdout.Write(append(data, '\n'))
|
||||
return "stdout", err
|
||||
case strings.HasPrefix(output, "file:"):
|
||||
path := strings.TrimPrefix(output, "file:")
|
||||
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil {
|
||||
return "", err
|
||||
}
|
||||
return path, nil
|
||||
default:
|
||||
return "", fmt.Errorf("unknown output destination %q — use stdout or file:<path>", output)
|
||||
}
|
||||
}
|
||||
|
||||
func (a *App) RunRuntimePreflightResult() (ActionResult, error) {
|
||||
path, err := a.RunRuntimePreflight("file:" + DefaultRuntimeJSONPath)
|
||||
body := "Runtime preflight completed."
|
||||
if path != "" {
|
||||
body = "Runtime health written to " + path
|
||||
}
|
||||
return ActionResult{Title: "Run self-check", Body: body}, err
|
||||
}
|
||||
|
||||
func (a *App) RuntimeHealthResult() ActionResult {
|
||||
health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath)
|
||||
if err != nil {
|
||||
return ActionResult{Title: "Runtime issues", Body: "No runtime health found."}
|
||||
}
|
||||
var body strings.Builder
|
||||
fmt.Fprintf(&body, "Status: %s\n", firstNonEmpty(health.Status, "UNKNOWN"))
|
||||
fmt.Fprintf(&body, "Export dir: %s\n", firstNonEmpty(health.ExportDir, DefaultExportDir))
|
||||
fmt.Fprintf(&body, "Driver ready: %t\n", health.DriverReady)
|
||||
fmt.Fprintf(&body, "CUDA ready: %t\n", health.CUDAReady)
|
||||
fmt.Fprintf(&body, "Network: %s", firstNonEmpty(health.NetworkStatus, "UNKNOWN"))
|
||||
if len(health.Issues) > 0 {
|
||||
body.WriteString("\n\nIssues:\n")
|
||||
for _, issue := range health.Issues {
|
||||
fmt.Fprintf(&body, "- %s: %s\n", issue.Code, issue.Description)
|
||||
}
|
||||
}
|
||||
return ActionResult{Title: "Runtime issues", Body: strings.TrimSpace(body.String())}
|
||||
}
|
||||
|
||||
func (a *App) RunAuditNow(runtimeMode runtimeenv.Mode) (ActionResult, error) {
|
||||
path, err := a.RunAudit(runtimeMode, "file:"+DefaultAuditJSONPath)
|
||||
body := "Audit completed."
|
||||
@@ -124,7 +216,29 @@ func (a *App) ExportLatestAudit(target platform.RemovableTarget) (string, error)
|
||||
|
||||
func (a *App) ExportLatestAuditResult(target platform.RemovableTarget) (ActionResult, error) {
|
||||
path, err := a.ExportLatestAudit(target)
|
||||
return ActionResult{Title: "Export audit", Body: "Audit exported to " + path}, err
|
||||
body := "Audit exported."
|
||||
if path != "" {
|
||||
body = "Audit exported to " + path
|
||||
}
|
||||
return ActionResult{Title: "Export audit", Body: body}, err
|
||||
}
|
||||
|
||||
func (a *App) ExportSupportBundle(target platform.RemovableTarget) (string, error) {
|
||||
archive, err := BuildSupportBundle(DefaultExportDir)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
defer os.Remove(archive)
|
||||
return a.exports.ExportFileToTarget(archive, target)
|
||||
}
|
||||
|
||||
func (a *App) ExportSupportBundleResult(target platform.RemovableTarget) (ActionResult, error) {
|
||||
path, err := a.ExportSupportBundle(target)
|
||||
body := "Support bundle exported."
|
||||
if path != "" {
|
||||
body = "Support bundle exported to " + path
|
||||
}
|
||||
return ActionResult{Title: "Export support bundle", Body: body}, err
|
||||
}
|
||||
|
||||
func (a *App) ListInterfaces() ([]platform.InterfaceInfo, error) {
|
||||
@@ -141,7 +255,7 @@ func (a *App) DHCPOne(iface string) (string, error) {
|
||||
|
||||
func (a *App) DHCPOneResult(iface string) (ActionResult, error) {
|
||||
body, err := a.network.DHCPOne(iface)
|
||||
return ActionResult{Title: "DHCP on " + iface, Body: body}, err
|
||||
return ActionResult{Title: "DHCP: " + iface, Body: bodyOr(body, "DHCP completed.")}, err
|
||||
}
|
||||
|
||||
func (a *App) DHCPAll() (string, error) {
|
||||
@@ -150,7 +264,7 @@ func (a *App) DHCPAll() (string, error) {
|
||||
|
||||
func (a *App) DHCPAllResult() (ActionResult, error) {
|
||||
body, err := a.network.DHCPAll()
|
||||
return ActionResult{Title: "DHCP all interfaces", Body: body}, err
|
||||
return ActionResult{Title: "DHCP: all interfaces", Body: bodyOr(body, "DHCP completed.")}, err
|
||||
}
|
||||
|
||||
func (a *App) SetStaticIPv4(cfg platform.StaticIPv4Config) (string, error) {
|
||||
@@ -159,7 +273,7 @@ func (a *App) SetStaticIPv4(cfg platform.StaticIPv4Config) (string, error) {
|
||||
|
||||
func (a *App) SetStaticIPv4Result(cfg platform.StaticIPv4Config) (ActionResult, error) {
|
||||
body, err := a.network.SetStaticIPv4(cfg)
|
||||
return ActionResult{Title: "Static IPv4 on " + cfg.Interface, Body: body}, err
|
||||
return ActionResult{Title: "Static IPv4: " + cfg.Interface, Body: bodyOr(body, "Static IPv4 updated.")}, err
|
||||
}
|
||||
|
||||
func (a *App) NetworkStatus() (ActionResult, error) {
|
||||
@@ -167,6 +281,9 @@ func (a *App) NetworkStatus() (ActionResult, error) {
|
||||
if err != nil {
|
||||
return ActionResult{Title: "Network status"}, err
|
||||
}
|
||||
if len(ifaces) == 0 {
|
||||
return ActionResult{Title: "Network status", Body: "No physical interfaces found."}, nil
|
||||
}
|
||||
var body strings.Builder
|
||||
for _, iface := range ifaces {
|
||||
ipv4 := "(no IPv4)"
|
||||
@@ -216,7 +333,7 @@ func (a *App) ServiceStatus(name string) (string, error) {
|
||||
|
||||
func (a *App) ServiceStatusResult(name string) (ActionResult, error) {
|
||||
body, err := a.services.ServiceStatus(name)
|
||||
return ActionResult{Title: "service: " + name, Body: body}, err
|
||||
return ActionResult{Title: "service status: " + name, Body: bodyOr(body, "No status output.")}, err
|
||||
}
|
||||
|
||||
func (a *App) ServiceDo(name string, action platform.ServiceAction) (string, error) {
|
||||
@@ -225,7 +342,7 @@ func (a *App) ServiceDo(name string, action platform.ServiceAction) (string, err
|
||||
|
||||
func (a *App) ServiceActionResult(name string, action platform.ServiceAction) (ActionResult, error) {
|
||||
body, err := a.services.ServiceDo(name, action)
|
||||
return ActionResult{Title: "service: " + name, Body: body}, err
|
||||
return ActionResult{Title: "service " + string(action) + ": " + name, Body: bodyOr(body, "Action completed.")}, err
|
||||
}
|
||||
|
||||
func (a *App) ListRemovableTargets() ([]platform.RemovableTarget, error) {
|
||||
@@ -241,6 +358,9 @@ func (a *App) CheckTools(names []string) []platform.ToolStatus {
|
||||
}
|
||||
|
||||
func (a *App) ToolCheckResult(names []string) ActionResult {
|
||||
if len(names) == 0 {
|
||||
return ActionResult{Title: "Required tools", Body: "No tools checked."}
|
||||
}
|
||||
var body strings.Builder
|
||||
for _, tool := range a.tools.CheckTools(names) {
|
||||
status := "MISSING"
|
||||
@@ -253,17 +373,151 @@ func (a *App) ToolCheckResult(names []string) ActionResult {
|
||||
}
|
||||
|
||||
func (a *App) AuditLogTailResult() ActionResult {
|
||||
body := a.tools.TailFile(DefaultAuditLogPath, 40) + "\n\n" + a.tools.TailFile(DefaultAuditJSONPath, 20)
|
||||
logTail := strings.TrimSpace(a.tools.TailFile(DefaultAuditLogPath, 40))
|
||||
jsonTail := strings.TrimSpace(a.tools.TailFile(DefaultAuditJSONPath, 20))
|
||||
body := strings.TrimSpace(logTail + "\n\n" + jsonTail)
|
||||
if body == "" {
|
||||
body = "No audit logs found."
|
||||
}
|
||||
return ActionResult{Title: "Audit log tail", Body: body}
|
||||
}
|
||||
|
||||
func (a *App) RunNvidiaAcceptancePack(baseDir string) (string, error) {
|
||||
if strings.TrimSpace(baseDir) == "" {
|
||||
baseDir = DefaultSATBaseDir
|
||||
}
|
||||
return a.sat.RunNvidiaAcceptancePack(baseDir)
|
||||
}
|
||||
|
||||
func (a *App) RunNvidiaAcceptancePackResult(baseDir string) (ActionResult, error) {
|
||||
path, err := a.sat.RunNvidiaAcceptancePack(baseDir)
|
||||
return ActionResult{Title: "NVIDIA SAT", Body: "Archive written to " + path}, err
|
||||
path, err := a.RunNvidiaAcceptancePack(baseDir)
|
||||
body := "Archive written."
|
||||
if path != "" {
|
||||
body = "Archive written to " + path
|
||||
}
|
||||
return ActionResult{Title: "NVIDIA SAT", Body: body}, err
|
||||
}
|
||||
|
||||
func (a *App) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
|
||||
return a.sat.ListNvidiaGPUs()
|
||||
}
|
||||
|
||||
func (a *App) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, durationSec int, sizeMB int, gpuIndices []int) (ActionResult, error) {
|
||||
if strings.TrimSpace(baseDir) == "" {
|
||||
baseDir = DefaultSATBaseDir
|
||||
}
|
||||
path, err := a.sat.RunNvidiaAcceptancePackWithOptions(ctx, baseDir, durationSec, sizeMB, gpuIndices)
|
||||
body := "Archive written."
|
||||
if path != "" {
|
||||
body = "Archive written to " + path
|
||||
}
|
||||
// Include terminal chart if available (runDir = archive path without .tar.gz).
|
||||
if path != "" {
|
||||
termPath := filepath.Join(strings.TrimSuffix(path, ".tar.gz"), "gpu-metrics-term.txt")
|
||||
if chart, readErr := os.ReadFile(termPath); readErr == nil && len(chart) > 0 {
|
||||
body += "\n\n" + string(chart)
|
||||
}
|
||||
}
|
||||
return ActionResult{Title: "NVIDIA SAT", Body: body}, err
|
||||
}
|
||||
|
||||
func (a *App) RunMemoryAcceptancePack(baseDir string) (string, error) {
|
||||
if strings.TrimSpace(baseDir) == "" {
|
||||
baseDir = DefaultSATBaseDir
|
||||
}
|
||||
return a.sat.RunMemoryAcceptancePack(baseDir)
|
||||
}
|
||||
|
||||
func (a *App) RunMemoryAcceptancePackResult(baseDir string) (ActionResult, error) {
|
||||
path, err := a.RunMemoryAcceptancePack(baseDir)
|
||||
body := "Archive written."
|
||||
if path != "" {
|
||||
body = "Archive written to " + path
|
||||
}
|
||||
return ActionResult{Title: "Memory SAT", Body: body}, err
|
||||
}
|
||||
|
||||
func (a *App) RunStorageAcceptancePack(baseDir string) (string, error) {
|
||||
if strings.TrimSpace(baseDir) == "" {
|
||||
baseDir = DefaultSATBaseDir
|
||||
}
|
||||
return a.sat.RunStorageAcceptancePack(baseDir)
|
||||
}
|
||||
|
||||
func (a *App) RunStorageAcceptancePackResult(baseDir string) (ActionResult, error) {
|
||||
path, err := a.RunStorageAcceptancePack(baseDir)
|
||||
body := "Archive written."
|
||||
if path != "" {
|
||||
body = "Archive written to " + path
|
||||
}
|
||||
return ActionResult{Title: "Storage SAT", Body: body}, err
|
||||
}
|
||||
|
||||
func (a *App) HealthSummaryResult() ActionResult {
|
||||
raw, err := os.ReadFile(DefaultAuditJSONPath)
|
||||
if err != nil {
|
||||
return ActionResult{Title: "Health summary", Body: "No audit JSON found."}
|
||||
}
|
||||
var snapshot schema.HardwareIngestRequest
|
||||
if err := json.Unmarshal(raw, &snapshot); err != nil {
|
||||
return ActionResult{Title: "Health summary", Body: "Audit JSON is unreadable."}
|
||||
}
|
||||
|
||||
summary := collector.BuildHealthSummary(snapshot.Hardware)
|
||||
var body strings.Builder
|
||||
status := summary.Status
|
||||
if status == "" {
|
||||
status = "Unknown"
|
||||
}
|
||||
fmt.Fprintf(&body, "Overall: %s\n", status)
|
||||
fmt.Fprintf(&body, "Storage: warn=%d fail=%d\n", summary.StorageWarn, summary.StorageFail)
|
||||
fmt.Fprintf(&body, "PCIe: warn=%d fail=%d\n", summary.PCIeWarn, summary.PCIeFail)
|
||||
fmt.Fprintf(&body, "PSU: warn=%d fail=%d\n", summary.PSUWarn, summary.PSUFail)
|
||||
fmt.Fprintf(&body, "Memory: warn=%d fail=%d\n", summary.MemoryWarn, summary.MemoryFail)
|
||||
for _, item := range latestSATSummaries() {
|
||||
fmt.Fprintf(&body, "\n\n%s", item)
|
||||
}
|
||||
if len(summary.Failures) > 0 {
|
||||
fmt.Fprintf(&body, "\n\nFailures:\n- %s", strings.Join(summary.Failures, "\n- "))
|
||||
}
|
||||
if len(summary.Warnings) > 0 {
|
||||
fmt.Fprintf(&body, "\n\nWarnings:\n- %s", strings.Join(summary.Warnings, "\n- "))
|
||||
}
|
||||
return ActionResult{Title: "Health summary", Body: strings.TrimSpace(body.String())}
|
||||
}
|
||||
|
||||
func (a *App) MainBanner() string {
|
||||
raw, err := os.ReadFile(DefaultAuditJSONPath)
|
||||
if err != nil {
|
||||
return ""
|
||||
}
|
||||
|
||||
var snapshot schema.HardwareIngestRequest
|
||||
if err := json.Unmarshal(raw, &snapshot); err != nil {
|
||||
return ""
|
||||
}
|
||||
|
||||
var lines []string
|
||||
if system := formatSystemLine(snapshot.Hardware.Board); system != "" {
|
||||
lines = append(lines, system)
|
||||
}
|
||||
if cpu := formatCPULine(snapshot.Hardware.CPUs); cpu != "" {
|
||||
lines = append(lines, cpu)
|
||||
}
|
||||
if memory := formatMemoryLine(snapshot.Hardware.Memory); memory != "" {
|
||||
lines = append(lines, memory)
|
||||
}
|
||||
if storage := formatStorageLine(snapshot.Hardware.Storage); storage != "" {
|
||||
lines = append(lines, storage)
|
||||
}
|
||||
if gpu := formatGPULine(snapshot.Hardware.PCIeDevices); gpu != "" {
|
||||
lines = append(lines, gpu)
|
||||
}
|
||||
if ip := formatIPLine(a.network.ListInterfaces); ip != "" {
|
||||
lines = append(lines, ip)
|
||||
}
|
||||
|
||||
return strings.TrimSpace(strings.Join(lines, "\n"))
|
||||
}
|
||||
|
||||
func (a *App) FormatToolStatuses(statuses []platform.ToolStatus) string {
|
||||
@@ -309,3 +563,314 @@ func sanitizeFilename(v string) string {
|
||||
}
|
||||
return string(out)
|
||||
}
|
||||
|
||||
func bodyOr(body, fallback string) string {
|
||||
body = strings.TrimSpace(body)
|
||||
if body == "" {
|
||||
return fallback
|
||||
}
|
||||
return body
|
||||
}
|
||||
|
||||
func ReadRuntimeHealth(path string) (schema.RuntimeHealth, error) {
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return schema.RuntimeHealth{}, err
|
||||
}
|
||||
var health schema.RuntimeHealth
|
||||
if err := json.Unmarshal(raw, &health); err != nil {
|
||||
return schema.RuntimeHealth{}, err
|
||||
}
|
||||
return health, nil
|
||||
}
|
||||
|
||||
func latestSATSummaries() []string {
|
||||
patterns := []struct {
|
||||
label string
|
||||
prefix string
|
||||
}{
|
||||
{label: "NVIDIA SAT", prefix: "gpu-nvidia-"},
|
||||
{label: "Memory SAT", prefix: "memory-"},
|
||||
{label: "Storage SAT", prefix: "storage-"},
|
||||
}
|
||||
var out []string
|
||||
for _, item := range patterns {
|
||||
matches, err := filepath.Glob(filepath.Join(DefaultSATBaseDir, item.prefix+"*/summary.txt"))
|
||||
if err != nil || len(matches) == 0 {
|
||||
continue
|
||||
}
|
||||
sort.Strings(matches)
|
||||
raw, err := os.ReadFile(matches[len(matches)-1])
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
out = append(out, formatSATSummary(item.label, string(raw)))
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func formatSATSummary(label, raw string) string {
|
||||
values := parseKeyValueSummary(raw)
|
||||
var body strings.Builder
|
||||
fmt.Fprintf(&body, "%s:", label)
|
||||
if overall := firstNonEmpty(values["overall_status"], "UNKNOWN"); overall != "" {
|
||||
fmt.Fprintf(&body, " %s", overall)
|
||||
}
|
||||
if ok := firstNonEmpty(values["job_ok"], "0"); ok != "" {
|
||||
fmt.Fprintf(&body, " ok=%s", ok)
|
||||
}
|
||||
if failed := firstNonEmpty(values["job_failed"], "0"); failed != "" {
|
||||
fmt.Fprintf(&body, " failed=%s", failed)
|
||||
}
|
||||
if unsupported := firstNonEmpty(values["job_unsupported"], "0"); unsupported != "" && unsupported != "0" {
|
||||
fmt.Fprintf(&body, " unsupported=%s", unsupported)
|
||||
}
|
||||
if devices := strings.TrimSpace(values["devices"]); devices != "" {
|
||||
fmt.Fprintf(&body, "\nDevices: %s", devices)
|
||||
}
|
||||
return body.String()
|
||||
}
|
||||
|
||||
func formatSystemLine(board schema.HardwareBoard) string {
|
||||
model := strings.TrimSpace(strings.Join([]string{
|
||||
trimPtr(board.Manufacturer),
|
||||
trimPtr(board.ProductName),
|
||||
}, " "))
|
||||
serial := strings.TrimSpace(board.SerialNumber)
|
||||
switch {
|
||||
case model != "" && serial != "":
|
||||
return fmt.Sprintf("System: %s | S/N %s", model, serial)
|
||||
case model != "":
|
||||
return "System: " + model
|
||||
case serial != "":
|
||||
return "System S/N: " + serial
|
||||
default:
|
||||
return ""
|
||||
}
|
||||
}
|
||||
|
||||
func formatCPULine(cpus []schema.HardwareCPU) string {
|
||||
if len(cpus) == 0 {
|
||||
return ""
|
||||
}
|
||||
modelCounts := map[string]int{}
|
||||
unknown := 0
|
||||
for _, cpu := range cpus {
|
||||
model := trimPtr(cpu.Model)
|
||||
if model == "" {
|
||||
unknown++
|
||||
continue
|
||||
}
|
||||
modelCounts[model]++
|
||||
}
|
||||
if len(modelCounts) == 1 && unknown == 0 {
|
||||
for model, count := range modelCounts {
|
||||
return fmt.Sprintf("CPU: %d x %s", count, model)
|
||||
}
|
||||
}
|
||||
parts := make([]string, 0, len(modelCounts)+1)
|
||||
if len(modelCounts) > 0 {
|
||||
keys := make([]string, 0, len(modelCounts))
|
||||
for key := range modelCounts {
|
||||
keys = append(keys, key)
|
||||
}
|
||||
sort.Strings(keys)
|
||||
for _, key := range keys {
|
||||
parts = append(parts, fmt.Sprintf("%d x %s", modelCounts[key], key))
|
||||
}
|
||||
}
|
||||
if unknown > 0 {
|
||||
parts = append(parts, fmt.Sprintf("%d x unknown", unknown))
|
||||
}
|
||||
return "CPU: " + strings.Join(parts, ", ")
|
||||
}
|
||||
|
||||
func formatMemoryLine(dimms []schema.HardwareMemory) string {
|
||||
totalMB := 0
|
||||
present := 0
|
||||
types := map[string]struct{}{}
|
||||
for _, dimm := range dimms {
|
||||
if dimm.Present != nil && !*dimm.Present {
|
||||
continue
|
||||
}
|
||||
if dimm.SizeMB == nil || *dimm.SizeMB <= 0 {
|
||||
continue
|
||||
}
|
||||
present++
|
||||
totalMB += *dimm.SizeMB
|
||||
if value := trimPtr(dimm.Type); value != "" {
|
||||
types[value] = struct{}{}
|
||||
}
|
||||
}
|
||||
if totalMB == 0 {
|
||||
return ""
|
||||
}
|
||||
typeText := joinSortedKeys(types)
|
||||
line := fmt.Sprintf("Memory: %s", humanizeMB(totalMB))
|
||||
if typeText != "" {
|
||||
line += " " + typeText
|
||||
}
|
||||
if present > 0 {
|
||||
line += fmt.Sprintf(" (%d DIMMs)", present)
|
||||
}
|
||||
return line
|
||||
}
|
||||
|
||||
func formatStorageLine(disks []schema.HardwareStorage) string {
|
||||
count := 0
|
||||
totalGB := 0
|
||||
for _, disk := range disks {
|
||||
if disk.Present != nil && !*disk.Present {
|
||||
continue
|
||||
}
|
||||
count++
|
||||
if disk.SizeGB != nil && *disk.SizeGB > 0 {
|
||||
totalGB += *disk.SizeGB
|
||||
}
|
||||
}
|
||||
if count == 0 {
|
||||
return ""
|
||||
}
|
||||
line := fmt.Sprintf("Storage: %d drives", count)
|
||||
if totalGB > 0 {
|
||||
line += fmt.Sprintf(" / %s", humanizeGB(totalGB))
|
||||
}
|
||||
return line
|
||||
}
|
||||
|
||||
func formatGPULine(devices []schema.HardwarePCIeDevice) string {
|
||||
gpus := map[string]int{}
|
||||
for _, dev := range devices {
|
||||
if !isGPUDevice(dev) {
|
||||
continue
|
||||
}
|
||||
name := firstNonEmpty(trimPtr(dev.Model), trimPtr(dev.Manufacturer), "unknown")
|
||||
gpus[name]++
|
||||
}
|
||||
if len(gpus) == 0 {
|
||||
return ""
|
||||
}
|
||||
keys := make([]string, 0, len(gpus))
|
||||
for key := range gpus {
|
||||
keys = append(keys, key)
|
||||
}
|
||||
sort.Strings(keys)
|
||||
parts := make([]string, 0, len(keys))
|
||||
for _, key := range keys {
|
||||
parts = append(parts, fmt.Sprintf("%d x %s", gpus[key], key))
|
||||
}
|
||||
return "GPU: " + strings.Join(parts, ", ")
|
||||
}
|
||||
|
||||
func formatIPLine(list func() ([]platform.InterfaceInfo, error)) string {
|
||||
if list == nil {
|
||||
return ""
|
||||
}
|
||||
ifaces, err := list()
|
||||
if err != nil {
|
||||
return ""
|
||||
}
|
||||
seen := map[string]struct{}{}
|
||||
var ips []string
|
||||
for _, iface := range ifaces {
|
||||
for _, ip := range iface.IPv4 {
|
||||
ip = strings.TrimSpace(ip)
|
||||
if ip == "" {
|
||||
continue
|
||||
}
|
||||
if _, ok := seen[ip]; ok {
|
||||
continue
|
||||
}
|
||||
seen[ip] = struct{}{}
|
||||
ips = append(ips, ip)
|
||||
}
|
||||
}
|
||||
if len(ips) == 0 {
|
||||
return ""
|
||||
}
|
||||
sort.Strings(ips)
|
||||
return "IP: " + strings.Join(ips, ", ")
|
||||
}
|
||||
|
||||
func isGPUDevice(dev schema.HardwarePCIeDevice) bool {
|
||||
class := trimPtr(dev.DeviceClass)
|
||||
model := strings.ToLower(trimPtr(dev.Model))
|
||||
vendor := strings.ToLower(trimPtr(dev.Manufacturer))
|
||||
return class == "VideoController" ||
|
||||
class == "DisplayController" ||
|
||||
class == "ProcessingAccelerator" ||
|
||||
strings.Contains(model, "nvidia") ||
|
||||
strings.Contains(vendor, "nvidia") ||
|
||||
strings.Contains(vendor, "amd")
|
||||
}
|
||||
|
||||
func trimPtr(value *string) string {
|
||||
if value == nil {
|
||||
return ""
|
||||
}
|
||||
return strings.TrimSpace(*value)
|
||||
}
|
||||
|
||||
func joinSortedKeys(values map[string]struct{}) string {
|
||||
if len(values) == 0 {
|
||||
return ""
|
||||
}
|
||||
keys := make([]string, 0, len(values))
|
||||
for key := range values {
|
||||
keys = append(keys, key)
|
||||
}
|
||||
sort.Strings(keys)
|
||||
return strings.Join(keys, "/")
|
||||
}
|
||||
|
||||
func humanizeMB(totalMB int) string {
|
||||
if totalMB <= 0 {
|
||||
return ""
|
||||
}
|
||||
gb := float64(totalMB) / 1024.0
|
||||
if gb >= 1024.0 {
|
||||
tb := gb / 1024.0
|
||||
return fmt.Sprintf("%.1f TB", tb)
|
||||
}
|
||||
if gb == float64(int64(gb)) {
|
||||
return fmt.Sprintf("%.0f GB", gb)
|
||||
}
|
||||
return fmt.Sprintf("%.1f GB", gb)
|
||||
}
|
||||
|
||||
func humanizeGB(totalGB int) string {
|
||||
if totalGB <= 0 {
|
||||
return ""
|
||||
}
|
||||
tb := float64(totalGB) / 1024.0
|
||||
if tb >= 1.0 {
|
||||
return fmt.Sprintf("%.1f TB", tb)
|
||||
}
|
||||
return fmt.Sprintf("%d GB", totalGB)
|
||||
}
|
||||
|
||||
func parseKeyValueSummary(raw string) map[string]string {
|
||||
out := map[string]string{}
|
||||
for _, line := range strings.Split(raw, "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
if line == "" {
|
||||
continue
|
||||
}
|
||||
key, value, ok := strings.Cut(line, "=")
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
out[strings.TrimSpace(key)] = strings.TrimSpace(value)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func firstNonEmpty(values ...string) string {
|
||||
for _, value := range values {
|
||||
value = strings.TrimSpace(value)
|
||||
if value != "" {
|
||||
return value
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
@@ -1,10 +1,15 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/platform"
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
type fakeNetwork struct {
|
||||
@@ -62,6 +67,22 @@ func (f fakeExports) ExportFileToTarget(src string, target platform.RemovableTar
|
||||
return "", nil
|
||||
}
|
||||
|
||||
type fakeRuntime struct {
|
||||
collectFn func(string) (schema.RuntimeHealth, error)
|
||||
dumpFn func(string) error
|
||||
}
|
||||
|
||||
func (f fakeRuntime) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
|
||||
return f.collectFn(exportDir)
|
||||
}
|
||||
|
||||
func (f fakeRuntime) CaptureTechnicalDump(baseDir string) error {
|
||||
if f.dumpFn != nil {
|
||||
return f.dumpFn(baseDir)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
type fakeTools struct {
|
||||
tailFileFn func(string, int) string
|
||||
checkToolsFn func([]string) []platform.ToolStatus
|
||||
@@ -76,11 +97,29 @@ func (f fakeTools) CheckTools(names []string) []platform.ToolStatus {
|
||||
}
|
||||
|
||||
type fakeSAT struct {
|
||||
runFn func(string) (string, error)
|
||||
runNvidiaFn func(string) (string, error)
|
||||
runMemoryFn func(string) (string, error)
|
||||
runStorageFn func(string) (string, error)
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string) (string, error) {
|
||||
return f.runFn(baseDir)
|
||||
return f.runNvidiaFn(baseDir)
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunNvidiaAcceptancePackWithOptions(_ context.Context, baseDir string, _ int, _ int, _ []int) (string, error) {
|
||||
return f.runNvidiaFn(baseDir)
|
||||
}
|
||||
|
||||
func (f fakeSAT) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunMemoryAcceptancePack(baseDir string) (string, error) {
|
||||
return f.runMemoryFn(baseDir)
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunStorageAcceptancePack(baseDir string) (string, error) {
|
||||
return f.runStorageFn(baseDir)
|
||||
}
|
||||
|
||||
func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
|
||||
@@ -96,6 +135,9 @@ func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
|
||||
},
|
||||
defaultRouteFn: func() string { return "10.0.0.1" },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
result, err := a.NetworkStatus()
|
||||
@@ -116,6 +158,28 @@ func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestNetworkStatusHandlesNoInterfaces(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
network: fakeNetwork{
|
||||
listInterfacesFn: func() ([]platform.InterfaceInfo, error) { return nil, nil },
|
||||
defaultRouteFn: func() string { return "" },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
result, err := a.NetworkStatus()
|
||||
if err != nil {
|
||||
t.Fatalf("NetworkStatus error: %v", err)
|
||||
}
|
||||
if result.Body != "No physical interfaces found." {
|
||||
t.Fatalf("body=%q want %q", result.Body, "No physical interfaces found.")
|
||||
}
|
||||
}
|
||||
|
||||
func TestNetworkStatusPropagatesListError(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
@@ -126,6 +190,9 @@ func TestNetworkStatusPropagatesListError(t *testing.T) {
|
||||
},
|
||||
defaultRouteFn: func() string { return "" },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
result, err := a.NetworkStatus()
|
||||
@@ -150,6 +217,9 @@ func TestParseStaticIPv4ConfigAndDefaults(t *testing.T) {
|
||||
dhcpAllFn: func() (string, error) { return "", nil },
|
||||
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
defaults := a.DefaultStaticIPv4FormFields("eth0")
|
||||
@@ -186,13 +256,16 @@ func TestServiceActionResults(t *testing.T) {
|
||||
return string(action) + " ok", nil
|
||||
},
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
statusResult, err := a.ServiceStatusResult("bee-audit")
|
||||
if err != nil {
|
||||
t.Fatalf("ServiceStatusResult error: %v", err)
|
||||
}
|
||||
if statusResult.Title != "service: bee-audit" || statusResult.Body != "active" {
|
||||
if statusResult.Title != "service status: bee-audit" || statusResult.Body != "active" {
|
||||
t.Fatalf("unexpected status result: %#v", statusResult)
|
||||
}
|
||||
|
||||
@@ -200,7 +273,7 @@ func TestServiceActionResults(t *testing.T) {
|
||||
if err != nil {
|
||||
t.Fatalf("ServiceActionResult error: %v", err)
|
||||
}
|
||||
if actionResult.Title != "service: bee-audit" || actionResult.Body != "restart ok" {
|
||||
if actionResult.Title != "service restart: bee-audit" || actionResult.Body != "restart ok" {
|
||||
t.Fatalf("unexpected action result: %#v", actionResult)
|
||||
}
|
||||
}
|
||||
@@ -242,17 +315,87 @@ func TestToolCheckAndLogTailResults(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestActionResultsUseFallbackBody(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
network: fakeNetwork{
|
||||
dhcpOneFn: func(string) (string, error) { return " ", nil },
|
||||
dhcpAllFn: func() (string, error) { return "", nil },
|
||||
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
|
||||
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
|
||||
return nil, nil
|
||||
},
|
||||
defaultRouteFn: func() string { return "" },
|
||||
},
|
||||
services: fakeServices{
|
||||
serviceStatusFn: func(string) (string, error) { return "", nil },
|
||||
serviceDoFn: func(string, platform.ServiceAction) (string, error) { return "", nil },
|
||||
},
|
||||
tools: fakeTools{
|
||||
tailFileFn: func(string, int) string { return " " },
|
||||
checkToolsFn: func([]string) []platform.ToolStatus { return nil },
|
||||
},
|
||||
sat: fakeSAT{
|
||||
runNvidiaFn: func(string) (string, error) { return "", nil },
|
||||
runMemoryFn: func(string) (string, error) { return "", nil },
|
||||
runStorageFn: func(string) (string, error) { return "", nil },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) {
|
||||
return schema.RuntimeHealth{Status: "PARTIAL", ExportDir: "/tmp/export"}, nil
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
if got, _ := a.DHCPOneResult("eth0"); got.Body != "DHCP completed." {
|
||||
t.Fatalf("dhcp one body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.DHCPAllResult(); got.Body != "DHCP completed." {
|
||||
t.Fatalf("dhcp all body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.SetStaticIPv4Result(platform.StaticIPv4Config{Interface: "eth0"}); got.Body != "Static IPv4 updated." {
|
||||
t.Fatalf("static body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.ServiceStatusResult("bee-audit"); got.Body != "No status output." {
|
||||
t.Fatalf("status body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.ServiceActionResult("bee-audit", platform.ServiceRestart); got.Body != "Action completed." {
|
||||
t.Fatalf("action body=%q", got.Body)
|
||||
}
|
||||
if got := a.ToolCheckResult(nil); got.Body != "No tools checked." {
|
||||
t.Fatalf("tool body=%q", got.Body)
|
||||
}
|
||||
if got := a.AuditLogTailResult(); got.Body != "No audit logs found." {
|
||||
t.Fatalf("log body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.RunNvidiaAcceptancePackResult(""); got.Body != "Archive written." {
|
||||
t.Fatalf("sat body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.RunMemoryAcceptancePackResult(""); got.Body != "Archive written." {
|
||||
t.Fatalf("memory sat body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.RunStorageAcceptancePackResult(""); got.Body != "Archive written." {
|
||||
t.Fatalf("storage sat body=%q", got.Body)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunNvidiaAcceptancePackResult(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
sat: fakeSAT{
|
||||
runFn: func(baseDir string) (string, error) {
|
||||
runNvidiaFn: func(baseDir string) (string, error) {
|
||||
if baseDir != "/tmp/sat" {
|
||||
t.Fatalf("baseDir=%q want %q", baseDir, "/tmp/sat")
|
||||
}
|
||||
return "/tmp/sat/out.tar.gz", nil
|
||||
},
|
||||
runMemoryFn: func(string) (string, error) { return "", nil },
|
||||
runStorageFn: func(string) (string, error) { return "", nil },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
@@ -265,6 +408,186 @@ func TestRunNvidiaAcceptancePackResult(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSATDefaultsToExportDir(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
oldSATBaseDir := DefaultSATBaseDir
|
||||
DefaultSATBaseDir = "/tmp/export/bee-sat"
|
||||
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
|
||||
|
||||
a := &App{
|
||||
sat: fakeSAT{
|
||||
runNvidiaFn: func(baseDir string) (string, error) {
|
||||
if baseDir != "/tmp/export/bee-sat" {
|
||||
t.Fatalf("nvidia baseDir=%q", baseDir)
|
||||
}
|
||||
return "", nil
|
||||
},
|
||||
runMemoryFn: func(baseDir string) (string, error) {
|
||||
if baseDir != "/tmp/export/bee-sat" {
|
||||
t.Fatalf("memory baseDir=%q", baseDir)
|
||||
}
|
||||
return "", nil
|
||||
},
|
||||
runStorageFn: func(baseDir string) (string, error) {
|
||||
if baseDir != "/tmp/export/bee-sat" {
|
||||
t.Fatalf("storage baseDir=%q", baseDir)
|
||||
}
|
||||
return "", nil
|
||||
},
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
if _, err := a.RunNvidiaAcceptancePack(""); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if _, err := a.RunMemoryAcceptancePack(""); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if _, err := a.RunStorageAcceptancePack(""); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFormatSATSummary(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
got := formatSATSummary("Memory SAT", "overall_status=PARTIAL\njob_ok=2\njob_failed=0\njob_unsupported=1\ndevices=3\n")
|
||||
want := "Memory SAT: PARTIAL ok=2 failed=0 unsupported=1\nDevices: 3"
|
||||
if got != want {
|
||||
t.Fatalf("got %q want %q", got, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestHealthSummaryResultIncludesCompactSATSummary(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
oldAuditPath := DefaultAuditJSONPath
|
||||
oldSATBaseDir := DefaultSATBaseDir
|
||||
DefaultAuditJSONPath = filepath.Join(tmp, "audit.json")
|
||||
DefaultSATBaseDir = filepath.Join(tmp, "sat")
|
||||
t.Cleanup(func() { DefaultAuditJSONPath = oldAuditPath })
|
||||
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
|
||||
|
||||
satDir := filepath.Join(DefaultSATBaseDir, "memory-testcase")
|
||||
if err := os.MkdirAll(satDir, 0755); err != nil {
|
||||
t.Fatalf("mkdir sat dir: %v", err)
|
||||
}
|
||||
|
||||
raw := `{"collected_at":"2026-03-15T10:00:00Z","hardware":{"board":{"serial_number":"SRV123"},"storage":[{"serial_number":"DISK1","status":"Warning"}]}}`
|
||||
if err := os.WriteFile(DefaultAuditJSONPath, []byte(raw), 0644); err != nil {
|
||||
t.Fatalf("write audit json: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(satDir, "summary.txt"), []byte("overall_status=OK\njob_ok=3\njob_failed=0\njob_unsupported=0\n"), 0644); err != nil {
|
||||
t.Fatalf("write sat summary: %v", err)
|
||||
}
|
||||
|
||||
result := (&App{}).HealthSummaryResult()
|
||||
if !contains(result.Body, "Memory SAT: OK ok=3 failed=0") {
|
||||
t.Fatalf("body missing compact sat summary:\n%s", result.Body)
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildSupportBundleIncludesExportDirContents(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
exportDir := filepath.Join(tmp, "export")
|
||||
if err := os.MkdirAll(filepath.Join(exportDir, "bee-sat", "memory-run"), 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.json"), []byte(`{"ok":true}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run", "verbose.log"), []byte("sat verbose"), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
archive, err := BuildSupportBundle(exportDir)
|
||||
if err != nil {
|
||||
t.Fatalf("BuildSupportBundle error: %v", err)
|
||||
}
|
||||
if _, err := os.Stat(archive); err != nil {
|
||||
t.Fatalf("archive stat: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestMainBanner(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
oldAuditPath := DefaultAuditJSONPath
|
||||
DefaultAuditJSONPath = filepath.Join(tmp, "audit.json")
|
||||
t.Cleanup(func() { DefaultAuditJSONPath = oldAuditPath })
|
||||
|
||||
trueValue := true
|
||||
manufacturer := "Dell"
|
||||
product := "PowerEdge R760"
|
||||
cpuModel := "Intel Xeon Gold 6430"
|
||||
memoryType := "DDR5"
|
||||
gpuClass := "VideoController"
|
||||
gpuModel := "NVIDIA H100"
|
||||
|
||||
payload := schema.HardwareIngestRequest{
|
||||
Hardware: schema.HardwareSnapshot{
|
||||
Board: schema.HardwareBoard{
|
||||
Manufacturer: &manufacturer,
|
||||
ProductName: &product,
|
||||
SerialNumber: "SRV123",
|
||||
},
|
||||
CPUs: []schema.HardwareCPU{
|
||||
{Model: &cpuModel},
|
||||
{Model: &cpuModel},
|
||||
},
|
||||
Memory: []schema.HardwareMemory{
|
||||
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
|
||||
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
|
||||
},
|
||||
Storage: []schema.HardwareStorage{
|
||||
{Present: &trueValue, SizeGB: intPtr(3840)},
|
||||
{Present: &trueValue, SizeGB: intPtr(3840)},
|
||||
},
|
||||
PCIeDevices: []schema.HardwarePCIeDevice{
|
||||
{DeviceClass: &gpuClass, Model: &gpuModel},
|
||||
{DeviceClass: &gpuClass, Model: &gpuModel},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
raw, err := json.Marshal(payload)
|
||||
if err != nil {
|
||||
t.Fatalf("marshal: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(DefaultAuditJSONPath, raw, 0644); err != nil {
|
||||
t.Fatalf("write audit json: %v", err)
|
||||
}
|
||||
|
||||
a := &App{
|
||||
network: fakeNetwork{
|
||||
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
|
||||
return []platform.InterfaceInfo{
|
||||
{Name: "eth0", IPv4: []string{"10.0.0.10"}},
|
||||
{Name: "eth1", IPv4: []string{"192.168.1.10"}},
|
||||
}, nil
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
got := a.MainBanner()
|
||||
for _, want := range []string{
|
||||
"System: Dell PowerEdge R760 | S/N SRV123",
|
||||
"CPU: 2 x Intel Xeon Gold 6430",
|
||||
"Memory: 1.0 TB DDR5 (2 DIMMs)",
|
||||
"Storage: 2 drives / 7.5 TB",
|
||||
"GPU: 2 x NVIDIA H100",
|
||||
"IP: 10.0.0.10, 192.168.1.10",
|
||||
} {
|
||||
if !contains(got, want) {
|
||||
t.Fatalf("banner missing %q:\n%s", want, got)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func intPtr(v int) *int { return &v }
|
||||
|
||||
func contains(haystack, needle string) bool {
|
||||
return len(needle) == 0 || (len(haystack) >= len(needle) && (haystack == needle || containsAt(haystack, needle)))
|
||||
}
|
||||
|
||||
300
audit/internal/app/support_bundle.go
Normal file
300
audit/internal/app/support_bundle.go
Normal file
@@ -0,0 +1,300 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"archive/tar"
|
||||
"compress/gzip"
|
||||
"fmt"
|
||||
"io"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
var supportBundleServices = []string{
|
||||
"bee-audit.service",
|
||||
"bee-web.service",
|
||||
"bee-network.service",
|
||||
"bee-nvidia.service",
|
||||
"bee-preflight.service",
|
||||
"bee-sshsetup.service",
|
||||
}
|
||||
|
||||
var supportBundleCommands = []struct {
|
||||
name string
|
||||
cmd []string
|
||||
}{
|
||||
{name: "system/uname.txt", cmd: []string{"uname", "-a"}},
|
||||
{name: "system/lsmod.txt", cmd: []string{"lsmod"}},
|
||||
{name: "system/lspci-nn.txt", cmd: []string{"lspci", "-nn"}},
|
||||
{name: "system/ip-addr.txt", cmd: []string{"ip", "addr"}},
|
||||
{name: "system/ip-route.txt", cmd: []string{"ip", "route"}},
|
||||
{name: "system/mount.txt", cmd: []string{"mount"}},
|
||||
{name: "system/df-h.txt", cmd: []string{"df", "-h"}},
|
||||
{name: "system/dmesg-tail.txt", cmd: []string{"sh", "-c", "dmesg | tail -n 200"}},
|
||||
}
|
||||
|
||||
func BuildSupportBundle(exportDir string) (string, error) {
|
||||
exportDir = strings.TrimSpace(exportDir)
|
||||
if exportDir == "" {
|
||||
exportDir = DefaultExportDir
|
||||
}
|
||||
if err := os.MkdirAll(exportDir, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
if err := cleanupOldSupportBundles(os.TempDir()); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
host := sanitizeFilename(hostnameOr("unknown"))
|
||||
ts := time.Now().UTC().Format("20060102-150405")
|
||||
stageRoot := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s", host, ts))
|
||||
if err := os.MkdirAll(stageRoot, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
defer os.RemoveAll(stageRoot)
|
||||
|
||||
if err := copyDirContents(exportDir, filepath.Join(stageRoot, "export")); err != nil {
|
||||
return "", err
|
||||
}
|
||||
if err := writeJournalDump(filepath.Join(stageRoot, "systemd", "combined.journal.log")); err != nil {
|
||||
return "", err
|
||||
}
|
||||
for _, svc := range supportBundleServices {
|
||||
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".status.txt"), []string{"systemctl", "status", svc, "--no-pager"}); err != nil {
|
||||
return "", err
|
||||
}
|
||||
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".journal.log"), []string{"journalctl", "--no-pager", "-u", svc}); err != nil {
|
||||
return "", err
|
||||
}
|
||||
}
|
||||
for _, item := range supportBundleCommands {
|
||||
if err := writeCommandOutput(filepath.Join(stageRoot, item.name), item.cmd); err != nil {
|
||||
return "", err
|
||||
}
|
||||
}
|
||||
if err := writeManifest(filepath.Join(stageRoot, "manifest.txt"), exportDir, stageRoot); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
archivePath := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s.tar.gz", host, ts))
|
||||
if err := createSupportTarGz(archivePath, stageRoot); err != nil {
|
||||
return "", err
|
||||
}
|
||||
return archivePath, nil
|
||||
}
|
||||
|
||||
func cleanupOldSupportBundles(dir string) error {
|
||||
matches, err := filepath.Glob(filepath.Join(dir, "bee-support-*.tar.gz"))
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
type entry struct {
|
||||
path string
|
||||
mod time.Time
|
||||
}
|
||||
list := make([]entry, 0, len(matches))
|
||||
for _, match := range matches {
|
||||
info, err := os.Stat(match)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
if time.Since(info.ModTime()) > 24*time.Hour {
|
||||
_ = os.Remove(match)
|
||||
continue
|
||||
}
|
||||
list = append(list, entry{path: match, mod: info.ModTime()})
|
||||
}
|
||||
sort.Slice(list, func(i, j int) bool { return list[i].mod.After(list[j].mod) })
|
||||
if len(list) > 3 {
|
||||
for _, old := range list[3:] {
|
||||
_ = os.Remove(old.path)
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func writeJournalDump(dst string) error {
|
||||
args := []string{"--no-pager"}
|
||||
for _, svc := range supportBundleServices {
|
||||
args = append(args, "-u", svc)
|
||||
}
|
||||
raw, err := exec.Command("journalctl", args...).CombinedOutput()
|
||||
if len(raw) == 0 && err != nil {
|
||||
raw = []byte(err.Error() + "\n")
|
||||
}
|
||||
if len(raw) == 0 {
|
||||
raw = []byte("no journal output\n")
|
||||
}
|
||||
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
return os.WriteFile(dst, raw, 0644)
|
||||
}
|
||||
|
||||
func writeCommandOutput(dst string, cmd []string) error {
|
||||
if len(cmd) == 0 {
|
||||
return nil
|
||||
}
|
||||
raw, err := exec.Command(cmd[0], cmd[1:]...).CombinedOutput()
|
||||
if len(raw) == 0 {
|
||||
if err != nil {
|
||||
raw = []byte(err.Error() + "\n")
|
||||
} else {
|
||||
raw = []byte("no output\n")
|
||||
}
|
||||
}
|
||||
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
return os.WriteFile(dst, raw, 0644)
|
||||
}
|
||||
|
||||
func writeManifest(dst, exportDir, stageRoot string) error {
|
||||
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
var body strings.Builder
|
||||
fmt.Fprintf(&body, "bee_version=%s\n", buildVersion())
|
||||
fmt.Fprintf(&body, "host=%s\n", hostnameOr("unknown"))
|
||||
fmt.Fprintf(&body, "generated_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
|
||||
fmt.Fprintf(&body, "export_dir=%s\n", exportDir)
|
||||
fmt.Fprintf(&body, "\nfiles:\n")
|
||||
|
||||
var files []string
|
||||
if err := filepath.Walk(stageRoot, func(path string, info os.FileInfo, err error) error {
|
||||
if err != nil || info.IsDir() {
|
||||
return err
|
||||
}
|
||||
if filepath.Clean(path) == filepath.Clean(dst) {
|
||||
return nil
|
||||
}
|
||||
rel, err := filepath.Rel(stageRoot, path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
files = append(files, fmt.Sprintf("%s\t%d", rel, info.Size()))
|
||||
return nil
|
||||
}); err != nil {
|
||||
return err
|
||||
}
|
||||
sort.Strings(files)
|
||||
for _, line := range files {
|
||||
body.WriteString(line)
|
||||
body.WriteByte('\n')
|
||||
}
|
||||
return os.WriteFile(dst, []byte(body.String()), 0644)
|
||||
}
|
||||
|
||||
func buildVersion() string {
|
||||
raw, err := exec.Command("bee", "version").CombinedOutput()
|
||||
if err != nil {
|
||||
return "unknown"
|
||||
}
|
||||
return strings.TrimSpace(string(raw))
|
||||
}
|
||||
|
||||
func copyDirContents(srcDir, dstDir string) error {
|
||||
entries, err := os.ReadDir(srcDir)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return nil
|
||||
}
|
||||
return err
|
||||
}
|
||||
for _, entry := range entries {
|
||||
src := filepath.Join(srcDir, entry.Name())
|
||||
dst := filepath.Join(dstDir, entry.Name())
|
||||
if err := copyPath(src, dst); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func copyPath(src, dst string) error {
|
||||
info, err := os.Stat(src)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if info.IsDir() {
|
||||
if err := os.MkdirAll(dst, info.Mode().Perm()); err != nil {
|
||||
return err
|
||||
}
|
||||
entries, err := os.ReadDir(src)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
for _, entry := range entries {
|
||||
if err := copyPath(filepath.Join(src, entry.Name()), filepath.Join(dst, entry.Name())); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
in, err := os.Open(src)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer in.Close()
|
||||
|
||||
out, err := os.OpenFile(dst, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, info.Mode().Perm())
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer out.Close()
|
||||
|
||||
_, err = io.Copy(out, in)
|
||||
return err
|
||||
}
|
||||
|
||||
func createSupportTarGz(dst, srcDir string) error {
|
||||
file, err := os.Create(dst)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
gz := gzip.NewWriter(file)
|
||||
defer gz.Close()
|
||||
|
||||
tw := tar.NewWriter(gz)
|
||||
defer tw.Close()
|
||||
|
||||
base := filepath.Dir(srcDir)
|
||||
return filepath.Walk(srcDir, func(path string, info os.FileInfo, err error) error {
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if info.IsDir() {
|
||||
return nil
|
||||
}
|
||||
|
||||
header, err := tar.FileInfoHeader(info, "")
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
header.Name, err = filepath.Rel(base, path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if err := tw.WriteHeader(header); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
f, err := os.Open(path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer f.Close()
|
||||
|
||||
_, err = io.Copy(tw, f)
|
||||
return err
|
||||
})
|
||||
}
|
||||
@@ -8,6 +8,14 @@ import (
|
||||
"strings"
|
||||
)
|
||||
|
||||
var execDmidecode = func(typeNum string) (string, error) {
|
||||
out, err := exec.Command("dmidecode", "-t", typeNum).Output()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
return string(out), nil
|
||||
}
|
||||
|
||||
// collectBoard runs dmidecode for types 0, 1, 2 and returns the board record
|
||||
// plus the BIOS firmware entry. Any failure is logged and returns zero values.
|
||||
func collectBoard() (schema.HardwareBoard, []schema.HardwareFirmwareRecord) {
|
||||
@@ -141,9 +149,5 @@ func cleanDMIValue(v string) string {
|
||||
|
||||
// runDmidecode executes dmidecode -t <typeNum> and returns its stdout.
|
||||
func runDmidecode(typeNum string) (string, error) {
|
||||
out, err := exec.Command("dmidecode", "-t", typeNum).Output()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
return string(out), nil
|
||||
return execDmidecode(typeNum)
|
||||
}
|
||||
|
||||
@@ -7,13 +7,15 @@ import (
|
||||
"bee/audit/internal/runtimeenv"
|
||||
"bee/audit/internal/schema"
|
||||
"log/slog"
|
||||
"os"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Run executes all collectors and returns the combined snapshot.
|
||||
// Partial failures are logged as warnings; collection always completes.
|
||||
func Run(runtimeMode runtimeenv.Mode) schema.HardwareIngestRequest {
|
||||
func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
|
||||
start := time.Now()
|
||||
collectedAt := time.Now().UTC().Format(time.RFC3339)
|
||||
slog.Info("audit started")
|
||||
|
||||
snap := schema.HardwareSnapshot{}
|
||||
@@ -22,31 +24,41 @@ func Run(runtimeMode runtimeenv.Mode) schema.HardwareIngestRequest {
|
||||
snap.Board = board
|
||||
snap.Firmware = append(snap.Firmware, biosFW...)
|
||||
|
||||
cpus, cpuFW := collectCPUs(snap.Board.SerialNumber)
|
||||
snap.CPUs = cpus
|
||||
snap.Firmware = append(snap.Firmware, cpuFW...)
|
||||
snap.CPUs = collectCPUs()
|
||||
|
||||
snap.Memory = collectMemory()
|
||||
sensorDoc, err := readSensorsJSONDoc()
|
||||
if err != nil {
|
||||
slog.Info("sensors: unavailable for enrichment", "err", err)
|
||||
}
|
||||
snap.CPUs = enrichCPUsWithTelemetry(snap.CPUs, sensorDoc)
|
||||
snap.Memory = enrichMemoryWithTelemetry(snap.Memory, sensorDoc)
|
||||
snap.Storage = collectStorage()
|
||||
snap.PCIeDevices = collectPCIe()
|
||||
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices, snap.Board.SerialNumber)
|
||||
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices)
|
||||
snap.PCIeDevices = enrichPCIeWithMellanox(snap.PCIeDevices)
|
||||
snap.PCIeDevices = enrichPCIeWithNICTelemetry(snap.PCIeDevices)
|
||||
snap.PCIeDevices = enrichPCIeWithRAIDTelemetry(snap.PCIeDevices)
|
||||
snap.Storage = enrichStorageWithVROC(snap.Storage, snap.PCIeDevices)
|
||||
snap.Storage = appendUniqueStorage(snap.Storage, collectRAIDStorage(snap.PCIeDevices))
|
||||
snap.PowerSupplies = collectPSUs()
|
||||
snap.PowerSupplies = enrichPSUsWithTelemetry(snap.PowerSupplies, sensorDoc)
|
||||
snap.Sensors = buildSensorsFromDoc(sensorDoc)
|
||||
finalizeSnapshot(&snap, collectedAt)
|
||||
|
||||
// remaining collectors added in steps 1.8 – 1.10
|
||||
|
||||
slog.Info("audit completed", "duration", time.Since(start).Round(time.Millisecond))
|
||||
|
||||
sourceType := string(runtimeMode)
|
||||
protocol := "os-direct"
|
||||
|
||||
sourceType := "manual"
|
||||
var targetHost *string
|
||||
if hostname, err := os.Hostname(); err == nil && hostname != "" {
|
||||
targetHost = &hostname
|
||||
}
|
||||
return schema.HardwareIngestRequest{
|
||||
SourceType: &sourceType,
|
||||
Protocol: &protocol,
|
||||
CollectedAt: time.Now().UTC().Format(time.RFC3339),
|
||||
TargetHost: targetHost,
|
||||
CollectedAt: collectedAt,
|
||||
Hardware: snap,
|
||||
}
|
||||
}
|
||||
|
||||
64
audit/internal/collector/contract.go
Normal file
64
audit/internal/collector/contract.go
Normal file
@@ -0,0 +1,64 @@
|
||||
package collector
|
||||
|
||||
import "strings"
|
||||
|
||||
const (
|
||||
statusOK = "OK"
|
||||
statusWarning = "Warning"
|
||||
statusCritical = "Critical"
|
||||
statusUnknown = "Unknown"
|
||||
statusEmpty = "Empty"
|
||||
)
|
||||
|
||||
func mapPCIeDeviceClass(raw string) string {
|
||||
normalized := strings.ToLower(strings.TrimSpace(raw))
|
||||
switch {
|
||||
case normalized == "":
|
||||
return ""
|
||||
case strings.Contains(normalized, "ethernet controller"):
|
||||
return "EthernetController"
|
||||
case strings.Contains(normalized, "fibre channel"):
|
||||
return "FibreChannelController"
|
||||
case strings.Contains(normalized, "network controller"), strings.Contains(normalized, "infiniband controller"):
|
||||
return "NetworkController"
|
||||
case strings.Contains(normalized, "serial attached scsi"), strings.Contains(normalized, "storage controller"):
|
||||
return "StorageController"
|
||||
case strings.Contains(normalized, "raid"), strings.Contains(normalized, "mass storage"):
|
||||
return "MassStorageController"
|
||||
case strings.Contains(normalized, "display controller"):
|
||||
return "DisplayController"
|
||||
case strings.Contains(normalized, "vga"), strings.Contains(normalized, "3d controller"), strings.Contains(normalized, "video controller"):
|
||||
return "VideoController"
|
||||
case strings.Contains(normalized, "processing accelerators"), strings.Contains(normalized, "processing accelerator"):
|
||||
return "ProcessingAccelerator"
|
||||
default:
|
||||
return raw
|
||||
}
|
||||
}
|
||||
|
||||
func isNICClass(class string) bool {
|
||||
switch strings.TrimSpace(class) {
|
||||
case "EthernetController", "NetworkController":
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
func isGPUClass(class string) bool {
|
||||
switch strings.TrimSpace(class) {
|
||||
case "VideoController", "DisplayController", "ProcessingAccelerator":
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
func isRAIDClass(class string) bool {
|
||||
switch strings.TrimSpace(class) {
|
||||
case "MassStorageController", "StorageController":
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
@@ -3,42 +3,39 @@ package collector
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"bufio"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// collectCPUs runs dmidecode -t 4 and reads microcode version from sysfs.
|
||||
func collectCPUs(boardSerial string) ([]schema.HardwareCPU, []schema.HardwareFirmwareRecord) {
|
||||
// collectCPUs runs dmidecode -t 4 and enriches CPUs with microcode from sysfs.
|
||||
func collectCPUs() []schema.HardwareCPU {
|
||||
out, err := runDmidecode("4")
|
||||
if err != nil {
|
||||
slog.Warn("cpu: dmidecode type 4 failed", "err", err)
|
||||
return nil, nil
|
||||
return nil
|
||||
}
|
||||
|
||||
cpus := parseCPUs(out, boardSerial)
|
||||
|
||||
var firmware []schema.HardwareFirmwareRecord
|
||||
cpus := parseCPUs(out)
|
||||
if mc := readMicrocode(); mc != "" {
|
||||
firmware = append(firmware, schema.HardwareFirmwareRecord{
|
||||
DeviceName: "CPU Microcode",
|
||||
Version: mc,
|
||||
})
|
||||
for i := range cpus {
|
||||
cpus[i].Firmware = &mc
|
||||
}
|
||||
}
|
||||
|
||||
slog.Info("cpu: collected", "count", len(cpus))
|
||||
return cpus, firmware
|
||||
return cpus
|
||||
}
|
||||
|
||||
// parseCPUs splits dmidecode output into per-processor sections and parses each.
|
||||
func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
|
||||
func parseCPUs(output string) []schema.HardwareCPU {
|
||||
sections := splitDMISections(output, "Processor Information")
|
||||
cpus := make([]schema.HardwareCPU, 0, len(sections))
|
||||
|
||||
for _, section := range sections {
|
||||
cpu, ok := parseCPUSection(section, boardSerial)
|
||||
cpu, ok := parseCPUSection(section)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
@@ -49,14 +46,16 @@ func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
|
||||
|
||||
// parseCPUSection parses one "Processor Information" block into a HardwareCPU.
|
||||
// Returns false if the socket is unpopulated.
|
||||
func parseCPUSection(fields map[string]string, boardSerial string) (schema.HardwareCPU, bool) {
|
||||
func parseCPUSection(fields map[string]string) (schema.HardwareCPU, bool) {
|
||||
status := parseCPUStatus(fields["Status"])
|
||||
if status == "EMPTY" {
|
||||
if status == statusEmpty {
|
||||
return schema.HardwareCPU{}, false
|
||||
}
|
||||
|
||||
cpu := schema.HardwareCPU{}
|
||||
cpu.Status = &status
|
||||
present := true
|
||||
cpu.Present = &present
|
||||
|
||||
if socket, ok := parseSocketIndex(fields["Socket Designation"]); ok {
|
||||
cpu.Socket = &socket
|
||||
@@ -70,11 +69,6 @@ func parseCPUSection(fields map[string]string, boardSerial string) (schema.Hardw
|
||||
}
|
||||
if v := cleanDMIValue(fields["Serial Number"]); v != "" {
|
||||
cpu.SerialNumber = &v
|
||||
} else if boardSerial != "" && cpu.Socket != nil {
|
||||
// Intel Xeon never exposes serial via DMI — generate stable fallback
|
||||
// matching core's generateCPUVendorSerial() logic
|
||||
fb := fmt.Sprintf("%s-CPU-%d", boardSerial, *cpu.Socket)
|
||||
cpu.SerialNumber = &fb
|
||||
}
|
||||
|
||||
if v := parseMHz(fields["Max Speed"]); v > 0 {
|
||||
@@ -99,15 +93,15 @@ func parseCPUStatus(raw string) string {
|
||||
upper := strings.ToUpper(raw)
|
||||
switch {
|
||||
case upper == "" || upper == "UNKNOWN":
|
||||
return "UNKNOWN"
|
||||
return statusUnknown
|
||||
case strings.Contains(upper, "UNPOPULATED") || strings.Contains(upper, "NOT POPULATED"):
|
||||
return "EMPTY"
|
||||
return statusEmpty
|
||||
case strings.Contains(upper, "ENABLED"):
|
||||
return "OK"
|
||||
return statusOK
|
||||
case strings.Contains(upper, "DISABLED"):
|
||||
return "WARNING"
|
||||
return statusWarning
|
||||
default:
|
||||
return "UNKNOWN"
|
||||
return statusUnknown
|
||||
}
|
||||
}
|
||||
|
||||
@@ -178,7 +172,7 @@ func parseInt(v string) int {
|
||||
// readMicrocode reads the CPU microcode revision from sysfs.
|
||||
// Returns empty string if unavailable.
|
||||
func readMicrocode() string {
|
||||
data, err := os.ReadFile("/sys/devices/system/cpu/cpu0/microcode/version")
|
||||
data, err := os.ReadFile(filepath.Join(cpuSysBaseDir, "cpu0", "microcode", "version"))
|
||||
if err != nil {
|
||||
return ""
|
||||
}
|
||||
|
||||
196
audit/internal/collector/cpu_telemetry.go
Normal file
196
audit/internal/collector/cpu_telemetry.go
Normal file
@@ -0,0 +1,196 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"regexp"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
var (
|
||||
cpuSysBaseDir = "/sys/devices/system/cpu"
|
||||
socketIndexRe = regexp.MustCompile(`(?i)(?:package id|socket|cpu)\s*([0-9]+)`)
|
||||
)
|
||||
|
||||
func enrichCPUsWithTelemetry(cpus []schema.HardwareCPU, doc sensorsDoc) []schema.HardwareCPU {
|
||||
if len(cpus) == 0 {
|
||||
return cpus
|
||||
}
|
||||
|
||||
tempBySocket := cpuTempsFromSensors(doc, len(cpus))
|
||||
powerBySocket := cpuPowerFromSensors(doc, len(cpus))
|
||||
throttleBySocket := cpuThrottleBySocket()
|
||||
|
||||
for i := range cpus {
|
||||
socket := 0
|
||||
if cpus[i].Socket != nil {
|
||||
socket = *cpus[i].Socket
|
||||
}
|
||||
if value, ok := tempBySocket[socket]; ok {
|
||||
cpus[i].TemperatureC = &value
|
||||
}
|
||||
if value, ok := powerBySocket[socket]; ok {
|
||||
cpus[i].PowerW = &value
|
||||
}
|
||||
if value, ok := throttleBySocket[socket]; ok {
|
||||
cpus[i].Throttled = &value
|
||||
}
|
||||
}
|
||||
|
||||
return cpus
|
||||
}
|
||||
|
||||
func cpuTempsFromSensors(doc sensorsDoc, cpuCount int) map[int]float64 {
|
||||
out := map[int]float64{}
|
||||
if len(doc) == 0 {
|
||||
return out
|
||||
}
|
||||
var fallback []float64
|
||||
for chip, features := range doc {
|
||||
for featureName, raw := range features {
|
||||
feature, ok := raw.(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if classifySensorFeature(feature) != "temp" {
|
||||
continue
|
||||
}
|
||||
temp, ok := firstFeatureFloat(feature, "_input")
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if socket, ok := detectCPUSocket(chip, featureName); ok {
|
||||
if _, exists := out[socket]; !exists {
|
||||
out[socket] = temp
|
||||
}
|
||||
continue
|
||||
}
|
||||
if isLikelyCPUTemp(chip, featureName) {
|
||||
fallback = append(fallback, temp)
|
||||
}
|
||||
}
|
||||
}
|
||||
if len(out) == 0 && cpuCount == 1 && len(fallback) > 0 {
|
||||
out[0] = fallback[0]
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func cpuPowerFromSensors(doc sensorsDoc, cpuCount int) map[int]float64 {
|
||||
out := map[int]float64{}
|
||||
if len(doc) == 0 {
|
||||
return out
|
||||
}
|
||||
var fallback []float64
|
||||
for chip, features := range doc {
|
||||
for featureName, raw := range features {
|
||||
feature, ok := raw.(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if classifySensorFeature(feature) != "power" {
|
||||
continue
|
||||
}
|
||||
power, ok := firstFeatureFloatWithContains(feature, []string{"power"})
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if socket, ok := detectCPUSocket(chip, featureName); ok {
|
||||
if _, exists := out[socket]; !exists {
|
||||
out[socket] = power
|
||||
}
|
||||
continue
|
||||
}
|
||||
if isLikelyCPUPower(chip, featureName) {
|
||||
fallback = append(fallback, power)
|
||||
}
|
||||
}
|
||||
}
|
||||
if len(out) == 0 && cpuCount == 1 && len(fallback) > 0 {
|
||||
out[0] = fallback[0]
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func detectCPUSocket(parts ...string) (int, bool) {
|
||||
for _, part := range parts {
|
||||
matches := socketIndexRe.FindStringSubmatch(strings.ToLower(part))
|
||||
if len(matches) == 2 {
|
||||
value, err := strconv.Atoi(matches[1])
|
||||
if err == nil {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func isLikelyCPUTemp(chip, feature string) bool {
|
||||
value := strings.ToLower(chip + " " + feature)
|
||||
return strings.Contains(value, "coretemp") ||
|
||||
strings.Contains(value, "k10temp") ||
|
||||
strings.Contains(value, "package id") ||
|
||||
strings.Contains(value, "tdie") ||
|
||||
strings.Contains(value, "tctl") ||
|
||||
strings.Contains(value, "cpu temp")
|
||||
}
|
||||
|
||||
func isLikelyCPUPower(chip, feature string) bool {
|
||||
value := strings.ToLower(chip + " " + feature)
|
||||
return strings.Contains(value, "intel-rapl") ||
|
||||
strings.Contains(value, "package id") ||
|
||||
strings.Contains(value, "package-") ||
|
||||
strings.Contains(value, "cpu power")
|
||||
}
|
||||
|
||||
func cpuThrottleBySocket() map[int]bool {
|
||||
out := map[int]bool{}
|
||||
cpuDirs, err := filepath.Glob(filepath.Join(cpuSysBaseDir, "cpu[0-9]*"))
|
||||
if err != nil {
|
||||
return out
|
||||
}
|
||||
sort.Strings(cpuDirs)
|
||||
for _, cpuDir := range cpuDirs {
|
||||
socket, ok := readSocketIndex(cpuDir)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if cpuPackageThrottled(cpuDir) {
|
||||
out[socket] = true
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func readSocketIndex(cpuDir string) (int, bool) {
|
||||
raw, err := os.ReadFile(filepath.Join(cpuDir, "topology", "physical_package_id"))
|
||||
if err != nil {
|
||||
return 0, false
|
||||
}
|
||||
value, err := strconv.Atoi(strings.TrimSpace(string(raw)))
|
||||
if err != nil || value < 0 {
|
||||
return 0, false
|
||||
}
|
||||
return value, true
|
||||
}
|
||||
|
||||
func cpuPackageThrottled(cpuDir string) bool {
|
||||
paths := []string{
|
||||
filepath.Join(cpuDir, "thermal_throttle", "package_throttle_count"),
|
||||
filepath.Join(cpuDir, "thermal_throttle", "core_throttle_count"),
|
||||
}
|
||||
for _, path := range paths {
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
value, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
|
||||
if err == nil && value > 0 {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
71
audit/internal/collector/cpu_telemetry_test.go
Normal file
71
audit/internal/collector/cpu_telemetry_test.go
Normal file
@@ -0,0 +1,71 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
func TestEnrichCPUsWithTelemetry(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
oldBase := cpuSysBaseDir
|
||||
cpuSysBaseDir = tmp
|
||||
t.Cleanup(func() { cpuSysBaseDir = oldBase })
|
||||
|
||||
mustWriteFile(t, filepath.Join(tmp, "cpu0", "topology", "physical_package_id"), "0\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "cpu0", "thermal_throttle", "package_throttle_count"), "3\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "cpu1", "topology", "physical_package_id"), "1\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "cpu1", "thermal_throttle", "package_throttle_count"), "0\n")
|
||||
|
||||
doc := sensorsDoc{
|
||||
"coretemp-isa-0000": {
|
||||
"Package id 0": map[string]any{"temp1_input": 61.5},
|
||||
"Package id 1": map[string]any{"temp2_input": 58.0},
|
||||
},
|
||||
"intel-rapl-mmio-0": {
|
||||
"Package id 0": map[string]any{"power1_average": 180.0},
|
||||
"Package id 1": map[string]any{"power2_average": 175.0},
|
||||
},
|
||||
}
|
||||
|
||||
socket0 := 0
|
||||
socket1 := 1
|
||||
status := statusOK
|
||||
cpus := []schema.HardwareCPU{
|
||||
{Socket: &socket0, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{Socket: &socket1, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
}
|
||||
|
||||
got := enrichCPUsWithTelemetry(cpus, doc)
|
||||
|
||||
if got[0].TemperatureC == nil || *got[0].TemperatureC != 61.5 {
|
||||
t.Fatalf("cpu0 temperature mismatch: %#v", got[0].TemperatureC)
|
||||
}
|
||||
if got[0].PowerW == nil || *got[0].PowerW != 180.0 {
|
||||
t.Fatalf("cpu0 power mismatch: %#v", got[0].PowerW)
|
||||
}
|
||||
if got[0].Throttled == nil || !*got[0].Throttled {
|
||||
t.Fatalf("cpu0 throttled mismatch: %#v", got[0].Throttled)
|
||||
}
|
||||
if got[1].TemperatureC == nil || *got[1].TemperatureC != 58.0 {
|
||||
t.Fatalf("cpu1 temperature mismatch: %#v", got[1].TemperatureC)
|
||||
}
|
||||
if got[1].PowerW == nil || *got[1].PowerW != 175.0 {
|
||||
t.Fatalf("cpu1 power mismatch: %#v", got[1].PowerW)
|
||||
}
|
||||
if got[1].Throttled != nil && *got[1].Throttled {
|
||||
t.Fatalf("cpu1 throttled mismatch: %#v", got[1].Throttled)
|
||||
}
|
||||
}
|
||||
|
||||
func mustWriteFile(t *testing.T, path, content string) {
|
||||
t.Helper()
|
||||
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
|
||||
t.Fatalf("mkdir %s: %v", path, err)
|
||||
}
|
||||
if err := os.WriteFile(path, []byte(content), 0644); err != nil {
|
||||
t.Fatalf("write %s: %v", path, err)
|
||||
}
|
||||
}
|
||||
@@ -1,12 +1,14 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestParseCPUs_dual_socket(t *testing.T) {
|
||||
out := mustReadFile(t, "testdata/dmidecode_type4.txt")
|
||||
cpus := parseCPUs(out, "CAR315KA0803B90")
|
||||
cpus := parseCPUs(out)
|
||||
|
||||
if len(cpus) != 2 {
|
||||
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
|
||||
@@ -37,23 +39,22 @@ func TestParseCPUs_dual_socket(t *testing.T) {
|
||||
if cpu0.Status == nil || *cpu0.Status != "OK" {
|
||||
t.Errorf("cpu0 status: got %v, want OK", cpu0.Status)
|
||||
}
|
||||
// Intel Xeon serial not available → fallback
|
||||
if cpu0.SerialNumber == nil || *cpu0.SerialNumber != "CAR315KA0803B90-CPU-0" {
|
||||
t.Errorf("cpu0 serial fallback: got %v, want CAR315KA0803B90-CPU-0", cpu0.SerialNumber)
|
||||
if cpu0.SerialNumber != nil {
|
||||
t.Errorf("cpu0 serial should stay nil without source data, got %v", cpu0.SerialNumber)
|
||||
}
|
||||
|
||||
cpu1 := cpus[1]
|
||||
if cpu1.Socket == nil || *cpu1.Socket != 1 {
|
||||
t.Errorf("cpu1 socket: got %v, want 1", cpu1.Socket)
|
||||
}
|
||||
if cpu1.SerialNumber == nil || *cpu1.SerialNumber != "CAR315KA0803B90-CPU-1" {
|
||||
t.Errorf("cpu1 serial fallback: got %v, want CAR315KA0803B90-CPU-1", cpu1.SerialNumber)
|
||||
if cpu1.SerialNumber != nil {
|
||||
t.Errorf("cpu1 serial should stay nil without source data, got %v", cpu1.SerialNumber)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseCPUs_unpopulated_skipped(t *testing.T) {
|
||||
out := mustReadFile(t, "testdata/dmidecode_type4_disabled.txt")
|
||||
cpus := parseCPUs(out, "BOARD-001")
|
||||
cpus := parseCPUs(out)
|
||||
|
||||
if len(cpus) != 1 {
|
||||
t.Fatalf("expected 1 CPU (unpopulated skipped), got %d", len(cpus))
|
||||
@@ -63,18 +64,51 @@ func TestParseCPUs_unpopulated_skipped(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestCollectCPUsSetsFirmwareFromMicrocode(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
origBase := cpuSysBaseDir
|
||||
cpuSysBaseDir = tmp
|
||||
t.Cleanup(func() { cpuSysBaseDir = origBase })
|
||||
|
||||
if err := os.MkdirAll(filepath.Join(tmp, "cpu0", "microcode"), 0755); err != nil {
|
||||
t.Fatalf("mkdir microcode dir: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(tmp, "cpu0", "microcode", "version"), []byte("0x2b000643\n"), 0644); err != nil {
|
||||
t.Fatalf("write microcode version: %v", err)
|
||||
}
|
||||
|
||||
origRun := execDmidecode
|
||||
execDmidecode = func(typeNum string) (string, error) {
|
||||
if typeNum != "4" {
|
||||
t.Fatalf("unexpected dmidecode type: %s", typeNum)
|
||||
}
|
||||
return mustReadFile(t, "testdata/dmidecode_type4.txt"), nil
|
||||
}
|
||||
t.Cleanup(func() { execDmidecode = origRun })
|
||||
|
||||
cpus := collectCPUs()
|
||||
if len(cpus) != 2 {
|
||||
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
|
||||
}
|
||||
for i, cpu := range cpus {
|
||||
if cpu.Firmware == nil || *cpu.Firmware != "0x2b000643" {
|
||||
t.Fatalf("cpu[%d] firmware=%v want microcode", i, cpu.Firmware)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseCPUStatus(t *testing.T) {
|
||||
tests := []struct {
|
||||
input string
|
||||
want string
|
||||
}{
|
||||
{"Populated, Enabled", "OK"},
|
||||
{"Populated, Disabled By User", "WARNING"},
|
||||
{"Populated, Disabled By BIOS", "WARNING"},
|
||||
{"Unpopulated", "EMPTY"},
|
||||
{"Not Populated", "EMPTY"},
|
||||
{"Unknown", "UNKNOWN"},
|
||||
{"", "UNKNOWN"},
|
||||
{"Populated, Disabled By User", statusWarning},
|
||||
{"Populated, Disabled By BIOS", statusWarning},
|
||||
{"Unpopulated", statusEmpty},
|
||||
{"Not Populated", statusEmpty},
|
||||
{"Unknown", statusUnknown},
|
||||
{"", statusUnknown},
|
||||
}
|
||||
for _, tt := range tests {
|
||||
got := parseCPUStatus(tt.input)
|
||||
|
||||
77
audit/internal/collector/finalize.go
Normal file
77
audit/internal/collector/finalize.go
Normal file
@@ -0,0 +1,77 @@
|
||||
package collector
|
||||
|
||||
import "bee/audit/internal/schema"
|
||||
|
||||
func finalizeSnapshot(snap *schema.HardwareSnapshot, collectedAt string) {
|
||||
snap.Memory = filterMemory(snap.Memory)
|
||||
snap.Storage = filterStorage(snap.Storage)
|
||||
snap.PowerSupplies = filterPSUs(snap.PowerSupplies)
|
||||
|
||||
setComponentStatusMetadata(snap, collectedAt)
|
||||
}
|
||||
|
||||
func filterMemory(dimms []schema.HardwareMemory) []schema.HardwareMemory {
|
||||
out := make([]schema.HardwareMemory, 0, len(dimms))
|
||||
for _, dimm := range dimms {
|
||||
if dimm.Present != nil && !*dimm.Present {
|
||||
continue
|
||||
}
|
||||
if dimm.Status != nil && *dimm.Status == statusEmpty {
|
||||
continue
|
||||
}
|
||||
if dimm.SerialNumber == nil || *dimm.SerialNumber == "" {
|
||||
continue
|
||||
}
|
||||
out = append(out, dimm)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func filterStorage(disks []schema.HardwareStorage) []schema.HardwareStorage {
|
||||
out := make([]schema.HardwareStorage, 0, len(disks))
|
||||
for _, disk := range disks {
|
||||
if disk.SerialNumber == nil || *disk.SerialNumber == "" {
|
||||
continue
|
||||
}
|
||||
out = append(out, disk)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func filterPSUs(psus []schema.HardwarePowerSupply) []schema.HardwarePowerSupply {
|
||||
out := make([]schema.HardwarePowerSupply, 0, len(psus))
|
||||
for _, psu := range psus {
|
||||
if psu.SerialNumber == nil || *psu.SerialNumber == "" {
|
||||
continue
|
||||
}
|
||||
out = append(out, psu)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func setComponentStatusMetadata(snap *schema.HardwareSnapshot, collectedAt string) {
|
||||
for i := range snap.CPUs {
|
||||
setStatusCheckedAt(&snap.CPUs[i].HardwareComponentStatus, collectedAt)
|
||||
}
|
||||
for i := range snap.Memory {
|
||||
setStatusCheckedAt(&snap.Memory[i].HardwareComponentStatus, collectedAt)
|
||||
}
|
||||
for i := range snap.Storage {
|
||||
setStatusCheckedAt(&snap.Storage[i].HardwareComponentStatus, collectedAt)
|
||||
}
|
||||
for i := range snap.PCIeDevices {
|
||||
setStatusCheckedAt(&snap.PCIeDevices[i].HardwareComponentStatus, collectedAt)
|
||||
}
|
||||
for i := range snap.PowerSupplies {
|
||||
setStatusCheckedAt(&snap.PowerSupplies[i].HardwareComponentStatus, collectedAt)
|
||||
}
|
||||
}
|
||||
|
||||
func setStatusCheckedAt(status *schema.HardwareComponentStatus, collectedAt string) {
|
||||
if status == nil || status.Status == nil || *status.Status == "" {
|
||||
return
|
||||
}
|
||||
if status.StatusCheckedAt == nil {
|
||||
status.StatusCheckedAt = &collectedAt
|
||||
}
|
||||
}
|
||||
63
audit/internal/collector/finalize_test.go
Normal file
63
audit/internal/collector/finalize_test.go
Normal file
@@ -0,0 +1,63 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestFinalizeSnapshotFiltersComponentsWithoutRequiredSerials(t *testing.T) {
|
||||
collectedAt := "2026-03-15T12:00:00Z"
|
||||
present := true
|
||||
status := statusOK
|
||||
serial := "SN-1"
|
||||
|
||||
snap := schema.HardwareSnapshot{
|
||||
Memory: []schema.HardwareMemory{
|
||||
{Present: &present, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{Present: &present, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
},
|
||||
Storage: []schema.HardwareStorage{
|
||||
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
},
|
||||
PowerSupplies: []schema.HardwarePowerSupply{
|
||||
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
},
|
||||
}
|
||||
|
||||
finalizeSnapshot(&snap, collectedAt)
|
||||
|
||||
if len(snap.Memory) != 1 || snap.Memory[0].StatusCheckedAt == nil || *snap.Memory[0].StatusCheckedAt != collectedAt {
|
||||
t.Fatalf("memory finalize mismatch: %+v", snap.Memory)
|
||||
}
|
||||
if len(snap.Storage) != 1 || snap.Storage[0].StatusCheckedAt == nil || *snap.Storage[0].StatusCheckedAt != collectedAt {
|
||||
t.Fatalf("storage finalize mismatch: %+v", snap.Storage)
|
||||
}
|
||||
if len(snap.PowerSupplies) != 1 || snap.PowerSupplies[0].StatusCheckedAt == nil || *snap.PowerSupplies[0].StatusCheckedAt != collectedAt {
|
||||
t.Fatalf("psu finalize mismatch: %+v", snap.PowerSupplies)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFinalizeSnapshotPreservesDuplicateSerials(t *testing.T) {
|
||||
collectedAt := "2026-03-15T12:00:00Z"
|
||||
status := statusOK
|
||||
model := "Device"
|
||||
serial := "DUPLICATE"
|
||||
|
||||
snap := schema.HardwareSnapshot{
|
||||
Storage: []schema.HardwareStorage{
|
||||
{Model: &model, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{Model: &model, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
},
|
||||
}
|
||||
|
||||
finalizeSnapshot(&snap, collectedAt)
|
||||
|
||||
if got := *snap.Storage[0].SerialNumber; got != serial {
|
||||
t.Fatalf("first serial changed: %q", got)
|
||||
}
|
||||
if got := *snap.Storage[1].SerialNumber; got != serial {
|
||||
t.Fatalf("duplicate serial should stay unchanged: %q", got)
|
||||
}
|
||||
}
|
||||
@@ -47,12 +47,12 @@ func parseMemorySection(fields map[string]string) schema.HardwareMemory {
|
||||
dimm.Present = &present
|
||||
|
||||
if !present {
|
||||
status := "EMPTY"
|
||||
status := statusEmpty
|
||||
dimm.Status = &status
|
||||
return dimm
|
||||
}
|
||||
|
||||
status := "OK"
|
||||
status := statusOK
|
||||
dimm.Status = &status
|
||||
|
||||
if mb := parseMemorySizeMB(rawSize); mb > 0 {
|
||||
|
||||
203
audit/internal/collector/memory_telemetry.go
Normal file
203
audit/internal/collector/memory_telemetry.go
Normal file
@@ -0,0 +1,203 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
var edacBaseDir = "/sys/devices/system/edac/mc"
|
||||
|
||||
type edacDIMMStats struct {
|
||||
Label string
|
||||
CECount *int64
|
||||
UECount *int64
|
||||
}
|
||||
|
||||
func enrichMemoryWithTelemetry(dimms []schema.HardwareMemory, doc sensorsDoc) []schema.HardwareMemory {
|
||||
if len(dimms) == 0 {
|
||||
return dimms
|
||||
}
|
||||
|
||||
tempByLabel := memoryTempsFromSensors(doc)
|
||||
stats := readEDACStats()
|
||||
|
||||
for i := range dimms {
|
||||
labelKeys := dimmMatchKeys(dimms[i].Slot, dimms[i].Location)
|
||||
|
||||
for _, key := range labelKeys {
|
||||
if temp, ok := tempByLabel[key]; ok {
|
||||
dimms[i].TemperatureC = &temp
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
for _, key := range labelKeys {
|
||||
if stat, ok := stats[key]; ok {
|
||||
if stat.CECount != nil {
|
||||
dimms[i].CorrectableECCErrorCount = stat.CECount
|
||||
}
|
||||
if stat.UECount != nil {
|
||||
dimms[i].UncorrectableECCErrorCount = stat.UECount
|
||||
}
|
||||
if stat.UECount != nil && *stat.UECount > 0 {
|
||||
dimms[i].DataLossDetected = boolPtr(true)
|
||||
status := statusCritical
|
||||
dimms[i].Status = &status
|
||||
if dimms[i].ErrorDescription == nil {
|
||||
dimms[i].ErrorDescription = stringPtr("EDAC reports uncorrectable ECC errors")
|
||||
}
|
||||
} else if stat.CECount != nil && *stat.CECount > 0 && (dimms[i].Status == nil || *dimms[i].Status == statusOK) {
|
||||
status := statusWarning
|
||||
dimms[i].Status = &status
|
||||
if dimms[i].ErrorDescription == nil {
|
||||
dimms[i].ErrorDescription = stringPtr("EDAC reports correctable ECC errors")
|
||||
}
|
||||
}
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return dimms
|
||||
}
|
||||
|
||||
func memoryTempsFromSensors(doc sensorsDoc) map[string]float64 {
|
||||
out := map[string]float64{}
|
||||
if len(doc) == 0 {
|
||||
return out
|
||||
}
|
||||
for chip, features := range doc {
|
||||
for featureName, raw := range features {
|
||||
feature, ok := raw.(map[string]any)
|
||||
if !ok || classifySensorFeature(feature) != "temp" {
|
||||
continue
|
||||
}
|
||||
if !isLikelyMemoryTemp(chip, featureName) {
|
||||
continue
|
||||
}
|
||||
temp, ok := firstFeatureFloat(feature, "_input")
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
key := canonicalLabel(featureName)
|
||||
if key == "" {
|
||||
continue
|
||||
}
|
||||
if _, exists := out[key]; !exists {
|
||||
out[key] = temp
|
||||
}
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func readEDACStats() map[string]edacDIMMStats {
|
||||
out := map[string]edacDIMMStats{}
|
||||
mcDirs, err := filepath.Glob(filepath.Join(edacBaseDir, "mc*"))
|
||||
if err != nil {
|
||||
return out
|
||||
}
|
||||
sort.Strings(mcDirs)
|
||||
for _, mcDir := range mcDirs {
|
||||
dimmDirs, err := filepath.Glob(filepath.Join(mcDir, "dimm*"))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
sort.Strings(dimmDirs)
|
||||
for _, dimmDir := range dimmDirs {
|
||||
stat, ok := readEDACDIMMStats(dimmDir)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
key := canonicalLabel(stat.Label)
|
||||
if key == "" {
|
||||
continue
|
||||
}
|
||||
out[key] = stat
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func readEDACDIMMStats(dimmDir string) (edacDIMMStats, bool) {
|
||||
labelBytes, err := os.ReadFile(filepath.Join(dimmDir, "dimm_label"))
|
||||
if err != nil {
|
||||
labelBytes, err = os.ReadFile(filepath.Join(dimmDir, "label"))
|
||||
if err != nil {
|
||||
return edacDIMMStats{}, false
|
||||
}
|
||||
}
|
||||
label := strings.TrimSpace(string(labelBytes))
|
||||
if label == "" {
|
||||
return edacDIMMStats{}, false
|
||||
}
|
||||
|
||||
stat := edacDIMMStats{Label: label}
|
||||
if value, ok := readEDACCount(dimmDir, []string{"dimm_ce_count", "ce_count"}); ok {
|
||||
stat.CECount = &value
|
||||
}
|
||||
if value, ok := readEDACCount(dimmDir, []string{"dimm_ue_count", "ue_count"}); ok {
|
||||
stat.UECount = &value
|
||||
}
|
||||
return stat, true
|
||||
}
|
||||
|
||||
func readEDACCount(dir string, names []string) (int64, bool) {
|
||||
for _, name := range names {
|
||||
raw, err := os.ReadFile(filepath.Join(dir, name))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
value, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
|
||||
if err == nil && value >= 0 {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func dimmMatchKeys(slot, location *string) []string {
|
||||
var out []string
|
||||
add := func(value *string) {
|
||||
key := canonicalLabel(derefString(value))
|
||||
if key == "" {
|
||||
return
|
||||
}
|
||||
for _, existing := range out {
|
||||
if existing == key {
|
||||
return
|
||||
}
|
||||
}
|
||||
out = append(out, key)
|
||||
}
|
||||
add(slot)
|
||||
add(location)
|
||||
return out
|
||||
}
|
||||
|
||||
func canonicalLabel(value string) string {
|
||||
value = strings.ToUpper(strings.TrimSpace(value))
|
||||
if value == "" {
|
||||
return ""
|
||||
}
|
||||
var b strings.Builder
|
||||
for _, r := range value {
|
||||
if (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') {
|
||||
b.WriteRune(r)
|
||||
}
|
||||
}
|
||||
return b.String()
|
||||
}
|
||||
|
||||
func isLikelyMemoryTemp(chip, feature string) bool {
|
||||
value := strings.ToLower(chip + " " + feature)
|
||||
return strings.Contains(value, "dimm") || strings.Contains(value, "sodimm")
|
||||
}
|
||||
|
||||
func boolPtr(value bool) *bool {
|
||||
return &value
|
||||
}
|
||||
61
audit/internal/collector/memory_telemetry_test.go
Normal file
61
audit/internal/collector/memory_telemetry_test.go
Normal file
@@ -0,0 +1,61 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"path/filepath"
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
func TestEnrichMemoryWithTelemetry(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
oldBase := edacBaseDir
|
||||
edacBaseDir = tmp
|
||||
t.Cleanup(func() { edacBaseDir = oldBase })
|
||||
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_label"), "CPU0_DIMM_A1\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_ce_count"), "7\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_ue_count"), "0\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_label"), "CPU1_DIMM_B2\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_ce_count"), "0\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_ue_count"), "2\n")
|
||||
|
||||
doc := sensorsDoc{
|
||||
"jc42-i2c-0-18": {
|
||||
"CPU0 DIMM A1": map[string]any{"temp1_input": 43.0},
|
||||
"CPU1 DIMM B2": map[string]any{"temp2_input": 46.0},
|
||||
},
|
||||
}
|
||||
|
||||
status := statusOK
|
||||
slotA := "CPU0_DIMM_A1"
|
||||
slotB := "CPU1_DIMM_B2"
|
||||
dimms := []schema.HardwareMemory{
|
||||
{Slot: &slotA, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{Slot: &slotB, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
}
|
||||
|
||||
got := enrichMemoryWithTelemetry(dimms, doc)
|
||||
|
||||
if got[0].TemperatureC == nil || *got[0].TemperatureC != 43.0 {
|
||||
t.Fatalf("dimm0 temperature mismatch: %#v", got[0].TemperatureC)
|
||||
}
|
||||
if got[0].CorrectableECCErrorCount == nil || *got[0].CorrectableECCErrorCount != 7 {
|
||||
t.Fatalf("dimm0 ce mismatch: %#v", got[0].CorrectableECCErrorCount)
|
||||
}
|
||||
if got[0].Status == nil || *got[0].Status != statusWarning {
|
||||
t.Fatalf("dimm0 status mismatch: %#v", got[0].Status)
|
||||
}
|
||||
if got[1].TemperatureC == nil || *got[1].TemperatureC != 46.0 {
|
||||
t.Fatalf("dimm1 temperature mismatch: %#v", got[1].TemperatureC)
|
||||
}
|
||||
if got[1].UncorrectableECCErrorCount == nil || *got[1].UncorrectableECCErrorCount != 2 {
|
||||
t.Fatalf("dimm1 ue mismatch: %#v", got[1].UncorrectableECCErrorCount)
|
||||
}
|
||||
if got[1].Status == nil || *got[1].Status != statusCritical {
|
||||
t.Fatalf("dimm1 status mismatch: %#v", got[1].Status)
|
||||
}
|
||||
if got[1].DataLossDetected == nil || !*got[1].DataLossDetected {
|
||||
t.Fatalf("dimm1 data_loss_detected mismatch: %#v", got[1].DataLossDetected)
|
||||
}
|
||||
}
|
||||
@@ -18,17 +18,13 @@ var (
|
||||
}
|
||||
return string(out), nil
|
||||
}
|
||||
readNetStatFile = func(iface, key string) (int64, error) {
|
||||
path := filepath.Join("/sys/class/net", iface, "statistics", key)
|
||||
readNetAddressFile = func(iface string) (string, error) {
|
||||
path := filepath.Join("/sys/class/net", iface, "address")
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return 0, err
|
||||
return "", err
|
||||
}
|
||||
v, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
|
||||
if err != nil {
|
||||
return 0, err
|
||||
}
|
||||
return v, nil
|
||||
return strings.TrimSpace(string(raw)), nil
|
||||
}
|
||||
)
|
||||
|
||||
@@ -47,6 +43,7 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
|
||||
continue
|
||||
}
|
||||
iface := ifaces[0]
|
||||
devs[i].MacAddresses = collectInterfaceMACs(ifaces)
|
||||
|
||||
if devs[i].Firmware == nil {
|
||||
if out, err := ethtoolInfoQuery(iface); err == nil {
|
||||
@@ -56,16 +53,13 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
|
||||
}
|
||||
}
|
||||
|
||||
if devs[i].Telemetry == nil {
|
||||
devs[i].Telemetry = map[string]any{}
|
||||
}
|
||||
injectNICPacketStats(devs[i].Telemetry, iface)
|
||||
if out, err := ethtoolModuleQuery(iface); err == nil {
|
||||
injectSFPDOMTelemetry(devs[i].Telemetry, out)
|
||||
if injectSFPDOMTelemetry(&devs[i], out) {
|
||||
enriched++
|
||||
continue
|
||||
}
|
||||
}
|
||||
if len(devs[i].Telemetry) == 0 {
|
||||
devs[i].Telemetry = nil
|
||||
} else {
|
||||
if len(devs[i].MacAddresses) > 0 || devs[i].Firmware != nil {
|
||||
enriched++
|
||||
}
|
||||
}
|
||||
@@ -77,31 +71,32 @@ func isNICDevice(dev schema.HardwarePCIeDevice) bool {
|
||||
if dev.DeviceClass == nil {
|
||||
return false
|
||||
}
|
||||
c := strings.ToLower(strings.TrimSpace(*dev.DeviceClass))
|
||||
return strings.Contains(c, "ethernet controller") ||
|
||||
strings.Contains(c, "network controller") ||
|
||||
strings.Contains(c, "infiniband controller")
|
||||
c := strings.TrimSpace(*dev.DeviceClass)
|
||||
return isNICClass(c) || strings.EqualFold(c, "FibreChannelController")
|
||||
}
|
||||
|
||||
func injectNICPacketStats(dst map[string]any, iface string) {
|
||||
for _, key := range []string{"rx_packets", "tx_packets", "rx_errors", "tx_errors"} {
|
||||
if v, err := readNetStatFile(iface, key); err == nil {
|
||||
dst[key] = v
|
||||
func collectInterfaceMACs(ifaces []string) []string {
|
||||
seen := map[string]struct{}{}
|
||||
var out []string
|
||||
for _, iface := range ifaces {
|
||||
mac, err := readNetAddressFile(iface)
|
||||
if err != nil || mac == "" {
|
||||
continue
|
||||
}
|
||||
mac = strings.ToLower(strings.TrimSpace(mac))
|
||||
if _, ok := seen[mac]; ok {
|
||||
continue
|
||||
}
|
||||
seen[mac] = struct{}{}
|
||||
out = append(out, mac)
|
||||
}
|
||||
}
|
||||
|
||||
func injectSFPDOMTelemetry(dst map[string]any, raw string) {
|
||||
parsed := parseSFPDOM(raw)
|
||||
for k, v := range parsed {
|
||||
dst[k] = v
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
var floatRe = regexp.MustCompile(`[-+]?[0-9]*\.?[0-9]+`)
|
||||
|
||||
func parseSFPDOM(raw string) map[string]any {
|
||||
out := map[string]any{}
|
||||
func injectSFPDOMTelemetry(dev *schema.HardwarePCIeDevice, raw string) bool {
|
||||
var changed bool
|
||||
for _, line := range strings.Split(raw, "\n") {
|
||||
trimmed := strings.TrimSpace(line)
|
||||
if trimmed == "" {
|
||||
@@ -117,26 +112,55 @@ func parseSFPDOM(raw string) map[string]any {
|
||||
switch {
|
||||
case strings.Contains(key, "module temperature"):
|
||||
if f, ok := firstFloat(val); ok {
|
||||
out["sfp_temperature_c"] = f
|
||||
dev.SFPTemperatureC = &f
|
||||
changed = true
|
||||
}
|
||||
case strings.Contains(key, "laser output power"):
|
||||
if f, ok := dbmValue(val); ok {
|
||||
out["sfp_tx_power_dbm"] = f
|
||||
dev.SFPTXPowerDBM = &f
|
||||
changed = true
|
||||
}
|
||||
case strings.Contains(key, "receiver signal"):
|
||||
if f, ok := dbmValue(val); ok {
|
||||
out["sfp_rx_power_dbm"] = f
|
||||
dev.SFPRXPowerDBM = &f
|
||||
changed = true
|
||||
}
|
||||
case strings.Contains(key, "module voltage"):
|
||||
if f, ok := firstFloat(val); ok {
|
||||
out["sfp_voltage_v"] = f
|
||||
dev.SFPVoltageV = &f
|
||||
changed = true
|
||||
}
|
||||
case strings.Contains(key, "laser bias current"):
|
||||
if f, ok := firstFloat(val); ok {
|
||||
out["sfp_bias_ma"] = f
|
||||
dev.SFPBiasMA = &f
|
||||
changed = true
|
||||
}
|
||||
}
|
||||
}
|
||||
return changed
|
||||
}
|
||||
|
||||
func parseSFPDOM(raw string) map[string]any {
|
||||
dev := schema.HardwarePCIeDevice{}
|
||||
if !injectSFPDOMTelemetry(&dev, raw) {
|
||||
return map[string]any{}
|
||||
}
|
||||
out := map[string]any{}
|
||||
if dev.SFPTemperatureC != nil {
|
||||
out["sfp_temperature_c"] = *dev.SFPTemperatureC
|
||||
}
|
||||
if dev.SFPTXPowerDBM != nil {
|
||||
out["sfp_tx_power_dbm"] = *dev.SFPTXPowerDBM
|
||||
}
|
||||
if dev.SFPRXPowerDBM != nil {
|
||||
out["sfp_rx_power_dbm"] = *dev.SFPRXPowerDBM
|
||||
}
|
||||
if dev.SFPVoltageV != nil {
|
||||
out["sfp_voltage_v"] = *dev.SFPVoltageV
|
||||
}
|
||||
if dev.SFPBiasMA != nil {
|
||||
out["sfp_bias_ma"] = *dev.SFPBiasMA
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
|
||||
@@ -24,18 +24,17 @@ type nvidiaGPUInfo struct {
|
||||
}
|
||||
|
||||
// enrichPCIeWithNVIDIA enriches NVIDIA PCIe devices with data from nvidia-smi.
|
||||
// If the driver/tool is unavailable, NVIDIA devices get UNKNOWN status and
|
||||
// a stable serial fallback based on board serial + slot.
|
||||
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice, boardSerial string) []schema.HardwarePCIeDevice {
|
||||
// If the driver/tool is unavailable, NVIDIA devices get Unknown status.
|
||||
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
|
||||
if !hasNVIDIADevices(devs) {
|
||||
return devs
|
||||
}
|
||||
gpuByBDF, err := queryNVIDIAGPUs()
|
||||
if err != nil {
|
||||
slog.Info("nvidia: enrichment skipped", "err", err)
|
||||
return enrichPCIeWithNVIDIAData(devs, nil, boardSerial, false)
|
||||
return enrichPCIeWithNVIDIAData(devs, nil, false)
|
||||
}
|
||||
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, boardSerial, true)
|
||||
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, true)
|
||||
}
|
||||
|
||||
func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool {
|
||||
@@ -47,7 +46,7 @@ func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool {
|
||||
return false
|
||||
}
|
||||
|
||||
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, boardSerial string, driverLoaded bool) []schema.HardwarePCIeDevice {
|
||||
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, driverLoaded bool) []schema.HardwarePCIeDevice {
|
||||
enriched := 0
|
||||
for i := range devs {
|
||||
if !isNVIDIADevice(devs[i]) {
|
||||
@@ -55,7 +54,7 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
|
||||
}
|
||||
|
||||
if !driverLoaded {
|
||||
setPCIeFallback(&devs[i], boardSerial)
|
||||
setPCIeFallback(&devs[i])
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -65,22 +64,21 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
|
||||
}
|
||||
info, ok := gpuByBDF[bdf]
|
||||
if !ok {
|
||||
setPCIeFallback(&devs[i], boardSerial)
|
||||
setPCIeFallback(&devs[i])
|
||||
continue
|
||||
}
|
||||
|
||||
if v := strings.TrimSpace(info.Serial); v != "" {
|
||||
devs[i].SerialNumber = &v
|
||||
} else {
|
||||
setPCIeFallbackSerial(&devs[i], boardSerial)
|
||||
}
|
||||
if v := strings.TrimSpace(info.VBIOS); v != "" {
|
||||
devs[i].Firmware = &v
|
||||
}
|
||||
|
||||
status := "OK"
|
||||
status := statusOK
|
||||
if info.ECCUncorrected != nil && *info.ECCUncorrected > 0 {
|
||||
status = "WARNING"
|
||||
status = statusWarning
|
||||
devs[i].ErrorDescription = stringPtr("GPU reports uncorrected ECC errors")
|
||||
}
|
||||
devs[i].Status = &status
|
||||
injectNVIDIATelemetry(&devs[i], info)
|
||||
@@ -212,46 +210,25 @@ func isNVIDIADevice(dev schema.HardwarePCIeDevice) bool {
|
||||
return false
|
||||
}
|
||||
|
||||
func setPCIeFallback(dev *schema.HardwarePCIeDevice, boardSerial string) {
|
||||
setPCIeFallbackSerial(dev, boardSerial)
|
||||
status := "UNKNOWN"
|
||||
func setPCIeFallback(dev *schema.HardwarePCIeDevice) {
|
||||
status := statusUnknown
|
||||
dev.Status = &status
|
||||
}
|
||||
|
||||
func setPCIeFallbackSerial(dev *schema.HardwarePCIeDevice, boardSerial string) {
|
||||
if strings.TrimSpace(boardSerial) == "" || dev.SerialNumber != nil {
|
||||
return
|
||||
}
|
||||
slot := "unknown"
|
||||
if dev.BDF != nil && strings.TrimSpace(*dev.BDF) != "" {
|
||||
slot = strings.TrimSpace(*dev.BDF)
|
||||
} else if dev.Slot != nil && strings.TrimSpace(*dev.Slot) != "" {
|
||||
slot = strings.TrimSpace(*dev.Slot)
|
||||
}
|
||||
fb := fmt.Sprintf("%s-PCIE-%s", boardSerial, slot)
|
||||
dev.SerialNumber = &fb
|
||||
}
|
||||
|
||||
func injectNVIDIATelemetry(dev *schema.HardwarePCIeDevice, info nvidiaGPUInfo) {
|
||||
if dev.Telemetry == nil {
|
||||
dev.Telemetry = map[string]any{}
|
||||
}
|
||||
if info.TemperatureC != nil {
|
||||
dev.Telemetry["temperature_c"] = *info.TemperatureC
|
||||
dev.TemperatureC = info.TemperatureC
|
||||
}
|
||||
if info.PowerW != nil {
|
||||
dev.Telemetry["power_w"] = *info.PowerW
|
||||
dev.PowerW = info.PowerW
|
||||
}
|
||||
if info.ECCUncorrected != nil {
|
||||
dev.Telemetry["ecc_uncorrected_total"] = *info.ECCUncorrected
|
||||
dev.ECCUncorrectedTotal = info.ECCUncorrected
|
||||
}
|
||||
if info.ECCCorrected != nil {
|
||||
dev.Telemetry["ecc_corrected_total"] = *info.ECCCorrected
|
||||
dev.ECCCorrectedTotal = info.ECCCorrected
|
||||
}
|
||||
if info.HWSlowdown != nil {
|
||||
dev.Telemetry["hw_slowdown_active"] = *info.HWSlowdown
|
||||
}
|
||||
if len(dev.Telemetry) == 0 {
|
||||
dev.Telemetry = nil
|
||||
dev.HWSlowdown = info.HWSlowdown
|
||||
}
|
||||
}
|
||||
|
||||
@@ -54,10 +54,10 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
|
||||
status := "OK"
|
||||
devices := []schema.HardwarePCIeDevice{
|
||||
{
|
||||
VendorID: &vendorID,
|
||||
BDF: &bdf,
|
||||
Manufacturer: &manufacturer,
|
||||
Status: &status,
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
VendorID: &vendorID,
|
||||
BDF: &bdf,
|
||||
Manufacturer: &manufacturer,
|
||||
},
|
||||
}
|
||||
|
||||
@@ -73,21 +73,21 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
|
||||
},
|
||||
}
|
||||
|
||||
out := enrichPCIeWithNVIDIAData(devices, byBDF, "BOARD-001", true)
|
||||
out := enrichPCIeWithNVIDIAData(devices, byBDF, true)
|
||||
if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-ABC" {
|
||||
t.Fatalf("serial: got %v", out[0].SerialNumber)
|
||||
}
|
||||
if out[0].Firmware == nil || *out[0].Firmware != "96.00.1F.00.02" {
|
||||
t.Fatalf("firmware: got %v", out[0].Firmware)
|
||||
}
|
||||
if out[0].Status == nil || *out[0].Status != "WARNING" {
|
||||
if out[0].Status == nil || *out[0].Status != statusWarning {
|
||||
t.Fatalf("status: got %v", out[0].Status)
|
||||
}
|
||||
if out[0].Telemetry == nil {
|
||||
t.Fatal("expected telemetry")
|
||||
if out[0].ECCUncorrectedTotal == nil || *out[0].ECCUncorrectedTotal != 2 {
|
||||
t.Fatalf("ecc_uncorrected_total: got %#v", out[0].ECCUncorrectedTotal)
|
||||
}
|
||||
if got, ok := out[0].Telemetry["ecc_uncorrected_total"].(int64); !ok || got != 2 {
|
||||
t.Fatalf("ecc_uncorrected_total: got %#v", out[0].Telemetry["ecc_uncorrected_total"])
|
||||
if out[0].TemperatureC == nil || *out[0].TemperatureC != 55.5 {
|
||||
t.Fatalf("temperature_c: got %#v", out[0].TemperatureC)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -103,11 +103,11 @@ func TestEnrichPCIeWithNVIDIAData_driverMissingFallback(t *testing.T) {
|
||||
},
|
||||
}
|
||||
|
||||
out := enrichPCIeWithNVIDIAData(devices, nil, "BOARD-123", false)
|
||||
if out[0].SerialNumber == nil || *out[0].SerialNumber != "BOARD-123-PCIE-0000:17:00.0" {
|
||||
t.Fatalf("fallback serial: got %v", out[0].SerialNumber)
|
||||
out := enrichPCIeWithNVIDIAData(devices, nil, false)
|
||||
if out[0].SerialNumber != nil {
|
||||
t.Fatalf("serial should stay nil without source data, got %v", out[0].SerialNumber)
|
||||
}
|
||||
if out[0].Status == nil || *out[0].Status != "UNKNOWN" {
|
||||
if out[0].Status == nil || *out[0].Status != statusUnknown {
|
||||
t.Fatalf("fallback status: got %v", out[0].Status)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -57,6 +57,8 @@ func shouldIncludePCIeDevice(class string) bool {
|
||||
"host bridge",
|
||||
"isa bridge",
|
||||
"pci bridge",
|
||||
"performance counter",
|
||||
"performance counters",
|
||||
"ram memory",
|
||||
"system peripheral",
|
||||
"communication controller",
|
||||
@@ -79,11 +81,12 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
|
||||
dev := schema.HardwarePCIeDevice{}
|
||||
present := true
|
||||
dev.Present = &present
|
||||
status := "OK"
|
||||
status := statusOK
|
||||
dev.Status = &status
|
||||
|
||||
// Slot is the BDF: "0000:00:02.0"
|
||||
if bdf := fields["Slot"]; bdf != "" {
|
||||
dev.Slot = &bdf
|
||||
dev.BDF = &bdf
|
||||
// parse vendor_id and device_id from sysfs
|
||||
vendorID, deviceID := readPCIIDs(bdf)
|
||||
@@ -93,10 +96,32 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
|
||||
if deviceID != 0 {
|
||||
dev.DeviceID = &deviceID
|
||||
}
|
||||
if numaNode, ok := readPCINumaNode(bdf); ok {
|
||||
dev.NUMANode = &numaNode
|
||||
}
|
||||
if width, ok := readPCIIntAttribute(bdf, "current_link_width"); ok {
|
||||
dev.LinkWidth = &width
|
||||
}
|
||||
if width, ok := readPCIIntAttribute(bdf, "max_link_width"); ok {
|
||||
dev.MaxLinkWidth = &width
|
||||
}
|
||||
if speed, ok := readPCIStringAttribute(bdf, "current_link_speed"); ok {
|
||||
linkSpeed := normalizePCILinkSpeed(speed)
|
||||
if linkSpeed != "" {
|
||||
dev.LinkSpeed = &linkSpeed
|
||||
}
|
||||
}
|
||||
if speed, ok := readPCIStringAttribute(bdf, "max_link_speed"); ok {
|
||||
linkSpeed := normalizePCILinkSpeed(speed)
|
||||
if linkSpeed != "" {
|
||||
dev.MaxLinkSpeed = &linkSpeed
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if v := fields["Class"]; v != "" {
|
||||
dev.DeviceClass = &v
|
||||
class := mapPCIeDeviceClass(v)
|
||||
dev.DeviceClass = &class
|
||||
}
|
||||
if v := fields["Vendor"]; v != "" {
|
||||
dev.Manufacturer = &v
|
||||
@@ -131,3 +156,55 @@ func readHexFile(path string) (int, error) {
|
||||
n, err := strconv.ParseInt(s, 16, 64)
|
||||
return int(n), err
|
||||
}
|
||||
|
||||
func readPCINumaNode(bdf string) (int, bool) {
|
||||
value, ok := readPCIIntAttribute(bdf, "numa_node")
|
||||
if !ok || value < 0 {
|
||||
return 0, false
|
||||
}
|
||||
return value, true
|
||||
}
|
||||
|
||||
func readPCIIntAttribute(bdf, attribute string) (int, bool) {
|
||||
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
|
||||
if err != nil {
|
||||
return 0, false
|
||||
}
|
||||
value, err := strconv.Atoi(strings.TrimSpace(string(out)))
|
||||
if err != nil || value < 0 {
|
||||
return 0, false
|
||||
}
|
||||
return value, true
|
||||
}
|
||||
|
||||
func readPCIStringAttribute(bdf, attribute string) (string, bool) {
|
||||
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
|
||||
if err != nil {
|
||||
return "", false
|
||||
}
|
||||
value := strings.TrimSpace(string(out))
|
||||
if value == "" {
|
||||
return "", false
|
||||
}
|
||||
return value, true
|
||||
}
|
||||
|
||||
func normalizePCILinkSpeed(raw string) string {
|
||||
raw = strings.TrimSpace(strings.ToLower(raw))
|
||||
switch {
|
||||
case strings.Contains(raw, "2.5"):
|
||||
return "Gen1"
|
||||
case strings.Contains(raw, "5.0"):
|
||||
return "Gen2"
|
||||
case strings.Contains(raw, "8.0"):
|
||||
return "Gen3"
|
||||
case strings.Contains(raw, "16.0"):
|
||||
return "Gen4"
|
||||
case strings.Contains(raw, "32.0"):
|
||||
return "Gen5"
|
||||
case strings.Contains(raw, "64.0"):
|
||||
return "Gen6"
|
||||
default:
|
||||
return ""
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,6 +1,10 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
import (
|
||||
"encoding/json"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestShouldIncludePCIeDevice(t *testing.T) {
|
||||
tests := []struct {
|
||||
@@ -13,6 +17,7 @@ func TestShouldIncludePCIeDevice(t *testing.T) {
|
||||
{"Host bridge", false},
|
||||
{"PCI bridge", false},
|
||||
{"SMBus", false},
|
||||
{"Performance counters", false},
|
||||
{"Ethernet controller", true},
|
||||
{"RAID bus controller", true},
|
||||
{"Non-Volatile memory controller", true},
|
||||
@@ -35,7 +40,50 @@ func TestParseLspci_filtersExcludedClasses(t *testing.T) {
|
||||
if len(devs) != 1 {
|
||||
t.Fatalf("expected 1 filtered device, got %d", len(devs))
|
||||
}
|
||||
if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VGA compatible controller" {
|
||||
if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VideoController" {
|
||||
t.Fatalf("unexpected remaining class: %v", devs[0].DeviceClass)
|
||||
}
|
||||
if devs[0].Slot == nil || *devs[0].Slot != "0000:65:00.0" {
|
||||
t.Fatalf("slot: got %v", devs[0].Slot)
|
||||
}
|
||||
if devs[0].BDF == nil || *devs[0].BDF != "0000:65:00.0" {
|
||||
t.Fatalf("bdf: got %v", devs[0].BDF)
|
||||
}
|
||||
}
|
||||
|
||||
func TestPCIeJSONUsesSlotNotBDF(t *testing.T) {
|
||||
input := "Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
|
||||
|
||||
devs := parseLspci(input)
|
||||
data, err := json.Marshal(devs[0])
|
||||
if err != nil {
|
||||
t.Fatalf("marshal: %v", err)
|
||||
}
|
||||
text := string(data)
|
||||
if !strings.Contains(text, `"slot":"0000:65:00.0"`) {
|
||||
t.Fatalf("json missing slot: %s", text)
|
||||
}
|
||||
if strings.Contains(text, `"bdf"`) {
|
||||
t.Fatalf("json should not emit bdf: %s", text)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNormalizePCILinkSpeed(t *testing.T) {
|
||||
tests := []struct {
|
||||
raw string
|
||||
want string
|
||||
}{
|
||||
{"2.5 GT/s PCIe", "Gen1"},
|
||||
{"5.0 GT/s PCIe", "Gen2"},
|
||||
{"8.0 GT/s PCIe", "Gen3"},
|
||||
{"16.0 GT/s PCIe", "Gen4"},
|
||||
{"32.0 GT/s PCIe", "Gen5"},
|
||||
{"64.0 GT/s PCIe", "Gen6"},
|
||||
{"unknown", ""},
|
||||
}
|
||||
for _, tt := range tests {
|
||||
if got := normalizePCILinkSpeed(tt.raw); got != tt.want {
|
||||
t.Fatalf("normalizePCILinkSpeed(%q)=%q want %q", tt.raw, got, tt.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -4,18 +4,32 @@ import (
|
||||
"bee/audit/internal/schema"
|
||||
"log/slog"
|
||||
"os/exec"
|
||||
"regexp"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func collectPSUs() []schema.HardwarePowerSupply {
|
||||
// ipmitool requires /dev/ipmi0 — not available on non-server hardware
|
||||
out, err := exec.Command("ipmitool", "fru", "print").Output()
|
||||
if err != nil {
|
||||
var psus []schema.HardwarePowerSupply
|
||||
if out, err := exec.Command("ipmitool", "fru", "print").Output(); err == nil {
|
||||
psus = parseFRU(string(out))
|
||||
} else {
|
||||
slog.Info("psu: fru unavailable", "err", err)
|
||||
}
|
||||
|
||||
sdrData := map[int]psuSDR{}
|
||||
if sdrOut, err := exec.Command("ipmitool", "sdr").Output(); err == nil {
|
||||
sdrData = parsePSUSDR(string(sdrOut))
|
||||
if len(psus) == 0 {
|
||||
psus = synthesizePSUsFromSDR(sdrData)
|
||||
} else {
|
||||
mergePSUSDR(psus, sdrData)
|
||||
}
|
||||
} else if len(psus) == 0 {
|
||||
slog.Info("psu: ipmitool unavailable, skipping", "err", err)
|
||||
return nil
|
||||
}
|
||||
psus := parseFRU(string(out))
|
||||
slog.Info("psu: collected", "count", len(psus))
|
||||
return psus
|
||||
}
|
||||
@@ -75,9 +89,7 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
|
||||
|
||||
// Only process PSU FRU records
|
||||
headerLower := strings.ToLower(header)
|
||||
if !strings.Contains(headerLower, "psu") &&
|
||||
!strings.Contains(headerLower, "power supply") &&
|
||||
!strings.Contains(headerLower, "power_supply") {
|
||||
if !isPSUHeader(headerLower) {
|
||||
return schema.HardwarePowerSupply{}, false
|
||||
}
|
||||
|
||||
@@ -85,21 +97,24 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
|
||||
psu := schema.HardwarePowerSupply{Present: &present}
|
||||
|
||||
slotStr := strconv.Itoa(slotIdx)
|
||||
if slot, ok := parsePSUSlot(header); ok && slot > 0 {
|
||||
slotStr = strconv.Itoa(slot - 1)
|
||||
}
|
||||
psu.Slot = &slotStr
|
||||
|
||||
if v := cleanDMIValue(fields["Board Product"]); v != "" {
|
||||
if v := firstNonEmptyField(fields, "Board Product", "Product Name", "Product Part Number"); v != "" {
|
||||
psu.Model = &v
|
||||
}
|
||||
if v := cleanDMIValue(fields["Board Mfg"]); v != "" {
|
||||
if v := firstNonEmptyField(fields, "Board Mfg", "Product Manufacturer", "Product Manufacturer Name"); v != "" {
|
||||
psu.Vendor = &v
|
||||
}
|
||||
if v := cleanDMIValue(fields["Board Serial"]); v != "" {
|
||||
if v := firstNonEmptyField(fields, "Board Serial", "Product Serial", "Product Serial Number"); v != "" {
|
||||
psu.SerialNumber = &v
|
||||
}
|
||||
if v := cleanDMIValue(fields["Board Part Number"]); v != "" {
|
||||
if v := firstNonEmptyField(fields, "Board Part Number", "Product Part Number", "Part Number"); v != "" {
|
||||
psu.PartNumber = &v
|
||||
}
|
||||
if v := cleanDMIValue(fields["Board Extra"]); v != "" {
|
||||
if v := firstNonEmptyField(fields, "Board Extra", "Product Version", "Board Version"); v != "" {
|
||||
psu.Firmware = &v
|
||||
}
|
||||
|
||||
@@ -110,12 +125,230 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
|
||||
}
|
||||
}
|
||||
|
||||
status := "OK"
|
||||
status := statusOK
|
||||
psu.Status = &status
|
||||
|
||||
return psu, true
|
||||
}
|
||||
|
||||
func isPSUHeader(headerLower string) bool {
|
||||
return strings.Contains(headerLower, "psu") ||
|
||||
strings.Contains(headerLower, "pws") ||
|
||||
strings.Contains(headerLower, "power supply") ||
|
||||
strings.Contains(headerLower, "power_supply") ||
|
||||
strings.Contains(headerLower, "power module")
|
||||
}
|
||||
|
||||
func firstNonEmptyField(fields map[string]string, keys ...string) string {
|
||||
for _, key := range keys {
|
||||
if value := cleanDMIValue(fields[key]); value != "" {
|
||||
return value
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
type psuSDR struct {
|
||||
slot int
|
||||
status string
|
||||
reason string
|
||||
inputPowerW *float64
|
||||
outputPowerW *float64
|
||||
inputVoltage *float64
|
||||
temperatureC *float64
|
||||
healthPct *float64
|
||||
}
|
||||
|
||||
var psuSlotPatterns = []*regexp.Regexp{
|
||||
regexp.MustCompile(`(?i)\bpsu?\s*([0-9]+)\b`),
|
||||
regexp.MustCompile(`(?i)\bps\s*([0-9]+)\b`),
|
||||
regexp.MustCompile(`(?i)\bpws\s*([0-9]+)\b`),
|
||||
regexp.MustCompile(`(?i)\bpower\s*supply(?:\s*bay)?\s*([0-9]+)\b`),
|
||||
regexp.MustCompile(`(?i)\bbay\s*([0-9]+)\b`),
|
||||
}
|
||||
|
||||
func parsePSUSDR(raw string) map[int]psuSDR {
|
||||
out := map[int]psuSDR{}
|
||||
for _, line := range strings.Split(raw, "\n") {
|
||||
fields := splitSDRFields(line)
|
||||
if len(fields) < 3 {
|
||||
continue
|
||||
}
|
||||
name := fields[0]
|
||||
value := fields[1]
|
||||
state := strings.ToLower(fields[2])
|
||||
slot, ok := parsePSUSlot(name)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
|
||||
entry := out[slot]
|
||||
entry.slot = slot
|
||||
if entry.status == "" {
|
||||
entry.status = statusOK
|
||||
}
|
||||
if state != "" && state != "ok" && state != "ns" {
|
||||
entry.status = statusCritical
|
||||
entry.reason = "PSU sensor reported non-OK state: " + state
|
||||
}
|
||||
|
||||
lowerName := strings.ToLower(name)
|
||||
switch {
|
||||
case strings.Contains(lowerName, "input power"):
|
||||
entry.inputPowerW = parseFloatPtr(value)
|
||||
case strings.Contains(lowerName, "output power"):
|
||||
entry.outputPowerW = parseFloatPtr(value)
|
||||
case strings.Contains(lowerName, "power supply bay"), strings.Contains(lowerName, "psu bay"):
|
||||
entry.outputPowerW = parseFloatPtr(value)
|
||||
case strings.Contains(lowerName, "input voltage"), strings.Contains(lowerName, "ac input"):
|
||||
entry.inputVoltage = parseFloatPtr(value)
|
||||
case strings.Contains(lowerName, "temp"):
|
||||
entry.temperatureC = parseFloatPtr(value)
|
||||
case strings.Contains(lowerName, "health"), strings.Contains(lowerName, "remaining life"), strings.Contains(lowerName, "life remaining"):
|
||||
entry.healthPct = parsePercentPtr(value)
|
||||
}
|
||||
out[slot] = entry
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func synthesizePSUsFromSDR(sdr map[int]psuSDR) []schema.HardwarePowerSupply {
|
||||
if len(sdr) == 0 {
|
||||
return nil
|
||||
}
|
||||
slots := make([]int, 0, len(sdr))
|
||||
for slot := range sdr {
|
||||
slots = append(slots, slot)
|
||||
}
|
||||
sort.Ints(slots)
|
||||
|
||||
out := make([]schema.HardwarePowerSupply, 0, len(slots))
|
||||
for _, slot := range slots {
|
||||
entry := sdr[slot]
|
||||
present := true
|
||||
status := entry.status
|
||||
if status == "" {
|
||||
status = statusUnknown
|
||||
}
|
||||
slotStr := strconv.Itoa(slot - 1)
|
||||
model := "PSU"
|
||||
psu := schema.HardwarePowerSupply{
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Slot: &slotStr,
|
||||
Present: &present,
|
||||
Model: &model,
|
||||
InputPowerW: entry.inputPowerW,
|
||||
OutputPowerW: entry.outputPowerW,
|
||||
InputVoltage: entry.inputVoltage,
|
||||
TemperatureC: entry.temperatureC,
|
||||
}
|
||||
if entry.healthPct != nil {
|
||||
psu.LifeRemainingPct = entry.healthPct
|
||||
lifeUsed := 100 - *entry.healthPct
|
||||
psu.LifeUsedPct = &lifeUsed
|
||||
}
|
||||
if entry.reason != "" {
|
||||
psu.ErrorDescription = &entry.reason
|
||||
}
|
||||
out = append(out, psu)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func mergePSUSDR(psus []schema.HardwarePowerSupply, sdr map[int]psuSDR) {
|
||||
for i := range psus {
|
||||
slotIdx, err := strconv.Atoi(derefPSUSlot(psus[i].Slot))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
entry, ok := sdr[slotIdx+1]
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if entry.inputPowerW != nil {
|
||||
psus[i].InputPowerW = entry.inputPowerW
|
||||
}
|
||||
if entry.outputPowerW != nil {
|
||||
psus[i].OutputPowerW = entry.outputPowerW
|
||||
}
|
||||
if entry.inputVoltage != nil {
|
||||
psus[i].InputVoltage = entry.inputVoltage
|
||||
}
|
||||
if entry.temperatureC != nil {
|
||||
psus[i].TemperatureC = entry.temperatureC
|
||||
}
|
||||
if entry.healthPct != nil {
|
||||
psus[i].LifeRemainingPct = entry.healthPct
|
||||
lifeUsed := 100 - *entry.healthPct
|
||||
psus[i].LifeUsedPct = &lifeUsed
|
||||
}
|
||||
if entry.status != "" {
|
||||
psus[i].Status = &entry.status
|
||||
}
|
||||
if entry.reason != "" {
|
||||
psus[i].ErrorDescription = &entry.reason
|
||||
}
|
||||
if psus[i].Status != nil && *psus[i].Status == statusOK {
|
||||
if (entry.inputPowerW == nil && entry.outputPowerW == nil && entry.inputVoltage == nil) && entry.status == "" {
|
||||
unknown := statusUnknown
|
||||
psus[i].Status = &unknown
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func splitSDRFields(line string) []string {
|
||||
parts := strings.Split(line, "|")
|
||||
out := make([]string, 0, len(parts))
|
||||
for _, part := range parts {
|
||||
part = strings.TrimSpace(part)
|
||||
if part != "" {
|
||||
out = append(out, part)
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func parsePSUSlot(name string) (int, bool) {
|
||||
for _, re := range psuSlotPatterns {
|
||||
m := re.FindStringSubmatch(strings.ToLower(name))
|
||||
if len(m) == 0 {
|
||||
continue
|
||||
}
|
||||
for _, group := range m[1:] {
|
||||
if group == "" {
|
||||
continue
|
||||
}
|
||||
n, err := strconv.Atoi(group)
|
||||
if err == nil && n > 0 {
|
||||
return n, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func parseFloatPtr(raw string) *float64 {
|
||||
raw = strings.TrimSpace(raw)
|
||||
if raw == "" || strings.EqualFold(raw, "na") {
|
||||
return nil
|
||||
}
|
||||
for _, field := range strings.Fields(raw) {
|
||||
n, err := strconv.ParseFloat(strings.TrimSpace(field), 64)
|
||||
if err == nil {
|
||||
return &n
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func derefPSUSlot(slot *string) string {
|
||||
if slot == nil {
|
||||
return ""
|
||||
}
|
||||
return *slot
|
||||
}
|
||||
|
||||
// parseWattage extracts wattage from strings like "PSU 800W", "1200W PLATINUM".
|
||||
func parseWattage(s string) int {
|
||||
s = strings.ToUpper(s)
|
||||
|
||||
91
audit/internal/collector/psu_sdr_test.go
Normal file
91
audit/internal/collector/psu_sdr_test.go
Normal file
@@ -0,0 +1,91 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestParsePSUSDR(t *testing.T) {
|
||||
raw := `
|
||||
PS1 Input Power | 215 Watts | ok
|
||||
PS1 Output Power | 198 Watts | ok
|
||||
PS1 Input Voltage | 229 Volts | ok
|
||||
PS1 Temp | 39 C | ok
|
||||
PS1 Health | 97 % | ok
|
||||
PS2 Input Power | 0 Watts | cr
|
||||
`
|
||||
|
||||
got := parsePSUSDR(raw)
|
||||
if len(got) != 2 {
|
||||
t.Fatalf("len(got)=%d want 2", len(got))
|
||||
}
|
||||
if got[1].status != statusOK {
|
||||
t.Fatalf("ps1 status=%q", got[1].status)
|
||||
}
|
||||
if got[1].inputPowerW == nil || *got[1].inputPowerW != 215 {
|
||||
t.Fatalf("ps1 input power=%v", got[1].inputPowerW)
|
||||
}
|
||||
if got[1].outputPowerW == nil || *got[1].outputPowerW != 198 {
|
||||
t.Fatalf("ps1 output power=%v", got[1].outputPowerW)
|
||||
}
|
||||
if got[1].inputVoltage == nil || *got[1].inputVoltage != 229 {
|
||||
t.Fatalf("ps1 input voltage=%v", got[1].inputVoltage)
|
||||
}
|
||||
if got[1].temperatureC == nil || *got[1].temperatureC != 39 {
|
||||
t.Fatalf("ps1 temperature=%v", got[1].temperatureC)
|
||||
}
|
||||
if got[1].healthPct == nil || *got[1].healthPct != 97 {
|
||||
t.Fatalf("ps1 health=%v", got[1].healthPct)
|
||||
}
|
||||
if got[2].status != statusCritical {
|
||||
t.Fatalf("ps2 status=%q", got[2].status)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParsePSUSlotVendorVariants(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
want int
|
||||
}{
|
||||
{name: "PWS1 Status", want: 1},
|
||||
{name: "Power Supply Bay 8", want: 8},
|
||||
{name: "PS 6 Input Power", want: 6},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
got, ok := parsePSUSlot(tt.name)
|
||||
if !ok || got != tt.want {
|
||||
t.Fatalf("parsePSUSlot(%q)=(%d,%v) want (%d,true)", tt.name, got, ok, tt.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestSynthesizePSUsFromSDR(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
health := 97.0
|
||||
outputPower := 915.0
|
||||
got := synthesizePSUsFromSDR(map[int]psuSDR{
|
||||
1: {
|
||||
slot: 1,
|
||||
status: statusOK,
|
||||
outputPowerW: &outputPower,
|
||||
healthPct: &health,
|
||||
},
|
||||
})
|
||||
|
||||
if len(got) != 1 {
|
||||
t.Fatalf("len(got)=%d want 1", len(got))
|
||||
}
|
||||
if got[0].Slot == nil || *got[0].Slot != "0" {
|
||||
t.Fatalf("slot=%v want 0", got[0].Slot)
|
||||
}
|
||||
if got[0].OutputPowerW == nil || *got[0].OutputPowerW != 915 {
|
||||
t.Fatalf("output power=%v", got[0].OutputPowerW)
|
||||
}
|
||||
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 97 {
|
||||
t.Fatalf("life remaining=%v", got[0].LifeRemainingPct)
|
||||
}
|
||||
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 3 {
|
||||
t.Fatalf("life used=%v", got[0].LifeUsedPct)
|
||||
}
|
||||
}
|
||||
121
audit/internal/collector/psu_telemetry.go
Normal file
121
audit/internal/collector/psu_telemetry.go
Normal file
@@ -0,0 +1,121 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func enrichPSUsWithTelemetry(psus []schema.HardwarePowerSupply, doc sensorsDoc) []schema.HardwarePowerSupply {
|
||||
if len(psus) == 0 || len(doc) == 0 {
|
||||
return psus
|
||||
}
|
||||
|
||||
tempBySlot := psuTempsFromSensors(doc)
|
||||
healthBySlot := psuHealthFromSensors(doc)
|
||||
for i := range psus {
|
||||
slot := derefPSUSlot(psus[i].Slot)
|
||||
if slot == "" {
|
||||
continue
|
||||
}
|
||||
if psus[i].TemperatureC == nil {
|
||||
if value, ok := tempBySlot[slot]; ok {
|
||||
psus[i].TemperatureC = &value
|
||||
}
|
||||
}
|
||||
if psus[i].LifeRemainingPct == nil {
|
||||
if value, ok := healthBySlot[slot]; ok {
|
||||
psus[i].LifeRemainingPct = &value
|
||||
used := 100 - value
|
||||
psus[i].LifeUsedPct = &used
|
||||
}
|
||||
}
|
||||
}
|
||||
return psus
|
||||
}
|
||||
|
||||
func psuHealthFromSensors(doc sensorsDoc) map[string]float64 {
|
||||
out := map[string]float64{}
|
||||
for chip, features := range doc {
|
||||
for featureName, raw := range features {
|
||||
feature, ok := raw.(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if !isLikelyPSUHealth(chip, featureName) {
|
||||
continue
|
||||
}
|
||||
value, ok := firstFeaturePercent(feature)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if slot, ok := detectPSUSlot(chip, featureName); ok {
|
||||
if _, exists := out[slot]; !exists {
|
||||
out[slot] = value
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func firstFeaturePercent(feature map[string]any) (float64, bool) {
|
||||
keys := sortedFeatureKeys(feature)
|
||||
for _, key := range keys {
|
||||
lower := strings.ToLower(key)
|
||||
if strings.HasSuffix(lower, "_alarm") {
|
||||
continue
|
||||
}
|
||||
if strings.Contains(lower, "health") || strings.Contains(lower, "life") || strings.Contains(lower, "remain") {
|
||||
if value, ok := floatFromAny(feature[key]); ok {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func isLikelyPSUHealth(chip, feature string) bool {
|
||||
value := strings.ToLower(chip + " " + feature)
|
||||
return (strings.Contains(value, "psu") || strings.Contains(value, "power supply")) &&
|
||||
(strings.Contains(value, "health") || strings.Contains(value, "life") || strings.Contains(value, "remain"))
|
||||
}
|
||||
|
||||
func psuTempsFromSensors(doc sensorsDoc) map[string]float64 {
|
||||
out := map[string]float64{}
|
||||
for chip, features := range doc {
|
||||
for featureName, raw := range features {
|
||||
feature, ok := raw.(map[string]any)
|
||||
if !ok || classifySensorFeature(feature) != "temp" {
|
||||
continue
|
||||
}
|
||||
if !isLikelyPSUTemp(chip, featureName) {
|
||||
continue
|
||||
}
|
||||
temp, ok := firstFeatureFloat(feature, "_input")
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if slot, ok := detectPSUSlot(chip, featureName); ok {
|
||||
if _, exists := out[slot]; !exists {
|
||||
out[slot] = temp
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func isLikelyPSUTemp(chip, feature string) bool {
|
||||
value := strings.ToLower(chip + " " + feature)
|
||||
return strings.Contains(value, "psu") || strings.Contains(value, "power supply")
|
||||
}
|
||||
|
||||
func detectPSUSlot(parts ...string) (string, bool) {
|
||||
for _, part := range parts {
|
||||
if value, ok := parsePSUSlot(part); ok && value > 0 {
|
||||
return strconv.Itoa(value - 1), true
|
||||
}
|
||||
}
|
||||
return "", false
|
||||
}
|
||||
42
audit/internal/collector/psu_telemetry_test.go
Normal file
42
audit/internal/collector/psu_telemetry_test.go
Normal file
@@ -0,0 +1,42 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
func TestEnrichPSUsWithTelemetry(t *testing.T) {
|
||||
slot0 := "0"
|
||||
slot1 := "1"
|
||||
psus := []schema.HardwarePowerSupply{
|
||||
{Slot: &slot0},
|
||||
{Slot: &slot1},
|
||||
}
|
||||
|
||||
doc := sensorsDoc{
|
||||
"psu-hwmon-0": {
|
||||
"PSU1 Temp": map[string]any{"temp1_input": 39.5},
|
||||
"PSU2 Temp": map[string]any{"temp2_input": 41.0},
|
||||
"PSU1 Health": map[string]any{"health1_input": 98.0},
|
||||
"PSU2 Remaining Life": map[string]any{"life2_input": 95.0},
|
||||
},
|
||||
}
|
||||
|
||||
got := enrichPSUsWithTelemetry(psus, doc)
|
||||
if got[0].TemperatureC == nil || *got[0].TemperatureC != 39.5 {
|
||||
t.Fatalf("psu0 temperature mismatch: %#v", got[0].TemperatureC)
|
||||
}
|
||||
if got[1].TemperatureC == nil || *got[1].TemperatureC != 41.0 {
|
||||
t.Fatalf("psu1 temperature mismatch: %#v", got[1].TemperatureC)
|
||||
}
|
||||
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 98.0 {
|
||||
t.Fatalf("psu0 life remaining mismatch: %#v", got[0].LifeRemainingPct)
|
||||
}
|
||||
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 2.0 {
|
||||
t.Fatalf("psu0 life used mismatch: %#v", got[0].LifeUsedPct)
|
||||
}
|
||||
if got[1].LifeRemainingPct == nil || *got[1].LifeRemainingPct != 95.0 {
|
||||
t.Fatalf("psu1 life remaining mismatch: %#v", got[1].LifeRemainingPct)
|
||||
}
|
||||
}
|
||||
@@ -83,11 +83,7 @@ func isLikelyRAIDController(dev schema.HardwarePCIeDevice) bool {
|
||||
if dev.DeviceClass == nil {
|
||||
return false
|
||||
}
|
||||
c := strings.ToLower(*dev.DeviceClass)
|
||||
return strings.Contains(c, "raid") ||
|
||||
strings.Contains(c, "sas") ||
|
||||
strings.Contains(c, "mass storage") ||
|
||||
strings.Contains(c, "serial attached scsi")
|
||||
return isRAIDClass(*dev.DeviceClass)
|
||||
}
|
||||
|
||||
func collectStorcliDrives() []schema.HardwareStorage {
|
||||
@@ -182,7 +178,10 @@ func parseSASIrcuDisplay(raw string) []schema.HardwareStorage {
|
||||
|
||||
present := true
|
||||
status := mapRAIDDriveStatus(b["State"])
|
||||
s := schema.HardwareStorage{Present: &present, Status: &status}
|
||||
s := schema.HardwareStorage{
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Present: &present,
|
||||
}
|
||||
|
||||
enclosure := strings.TrimSpace(b["Enclosure #"])
|
||||
slot := strings.TrimSpace(b["Slot #"])
|
||||
@@ -281,7 +280,10 @@ func parseArcconfPhysicalDrives(raw string) []schema.HardwareStorage {
|
||||
for _, b := range blocks {
|
||||
present := true
|
||||
status := mapRAIDDriveStatus(b["State"])
|
||||
s := schema.HardwareStorage{Present: &present, Status: &status}
|
||||
s := schema.HardwareStorage{
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Present: &present,
|
||||
}
|
||||
|
||||
if v := strings.TrimSpace(b["Reported Location"]); v != "" {
|
||||
s.Slot = &v
|
||||
@@ -362,8 +364,11 @@ func parseSSACLIPhysicalDrives(raw string) []schema.HardwareStorage {
|
||||
if m := ssacliPhysicalDriveLine.FindStringSubmatch(trimmed); len(m) == 3 {
|
||||
flush()
|
||||
present := true
|
||||
status := "UNKNOWN"
|
||||
s := schema.HardwareStorage{Present: &present, Status: &status}
|
||||
status := statusUnknown
|
||||
s := schema.HardwareStorage{
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Present: &present,
|
||||
}
|
||||
slot := m[1]
|
||||
s.Slot = &slot
|
||||
|
||||
@@ -475,8 +480,8 @@ func storcliDriveToStorage(d struct {
|
||||
present := true
|
||||
status := mapRAIDDriveStatus(d.State)
|
||||
s := schema.HardwareStorage{
|
||||
Present: &present,
|
||||
Status: &status,
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Present: &present,
|
||||
}
|
||||
|
||||
if v := strings.TrimSpace(d.EIDSlt); v != "" {
|
||||
@@ -527,15 +532,15 @@ func mapRAIDDriveStatus(raw string) string {
|
||||
u := strings.ToUpper(strings.TrimSpace(raw))
|
||||
switch {
|
||||
case strings.Contains(u, "OK"), strings.Contains(u, "OPTIMAL"), strings.Contains(u, "READY"):
|
||||
return "OK"
|
||||
return statusOK
|
||||
case strings.Contains(u, "ONLN"), strings.Contains(u, "ONLINE"):
|
||||
return "OK"
|
||||
return statusOK
|
||||
case strings.Contains(u, "RBLD"), strings.Contains(u, "REBUILD"):
|
||||
return "WARNING"
|
||||
return statusWarning
|
||||
case strings.Contains(u, "FAIL"), strings.Contains(u, "OFFLINE"):
|
||||
return "CRITICAL"
|
||||
return statusCritical
|
||||
default:
|
||||
return "UNKNOWN"
|
||||
return statusUnknown
|
||||
}
|
||||
}
|
||||
|
||||
@@ -641,8 +646,9 @@ func enrichStorageWithVROC(storage []schema.HardwareStorage, pcie []schema.Hardw
|
||||
storage[i].Telemetry["vroc_array"] = arr.Name
|
||||
storage[i].Telemetry["vroc_degraded"] = arr.Degraded
|
||||
if arr.Degraded {
|
||||
status := "WARNING"
|
||||
status := statusWarning
|
||||
storage[i].Status = &status
|
||||
storage[i].ErrorDescription = stringPtr("VROC array is degraded")
|
||||
}
|
||||
updated++
|
||||
}
|
||||
@@ -659,14 +665,14 @@ func hasVROCController(pcie []schema.HardwarePCIeDevice) bool {
|
||||
|
||||
class := ""
|
||||
if dev.DeviceClass != nil {
|
||||
class = strings.ToLower(*dev.DeviceClass)
|
||||
class = strings.TrimSpace(*dev.DeviceClass)
|
||||
}
|
||||
model := ""
|
||||
if dev.Model != nil {
|
||||
model = strings.ToLower(*dev.Model)
|
||||
}
|
||||
|
||||
if strings.Contains(class, "raid") ||
|
||||
if isRAIDClass(class) ||
|
||||
strings.Contains(model, "vroc") ||
|
||||
strings.Contains(model, "volume management device") ||
|
||||
strings.Contains(model, "vmd") {
|
||||
|
||||
334
audit/internal/collector/raid_controller_telemetry.go
Normal file
334
audit/internal/collector/raid_controller_telemetry.go
Normal file
@@ -0,0 +1,334 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"encoding/json"
|
||||
"log/slog"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
type raidControllerTelemetry struct {
|
||||
BatteryChargePct *float64
|
||||
BatteryHealthPct *float64
|
||||
BatteryTemperatureC *float64
|
||||
BatteryVoltageV *float64
|
||||
BatteryReplaceRequired *bool
|
||||
ErrorDescription *string
|
||||
}
|
||||
|
||||
func enrichPCIeWithRAIDTelemetry(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
|
||||
byVendor := collectRAIDControllerTelemetry()
|
||||
if len(byVendor) == 0 {
|
||||
return devs
|
||||
}
|
||||
|
||||
positions := map[int]int{}
|
||||
for i := range devs {
|
||||
if devs[i].VendorID == nil || !isLikelyRAIDController(devs[i]) {
|
||||
continue
|
||||
}
|
||||
vendor := *devs[i].VendorID
|
||||
list := byVendor[vendor]
|
||||
if len(list) == 0 {
|
||||
continue
|
||||
}
|
||||
index := positions[vendor]
|
||||
if index >= len(list) {
|
||||
continue
|
||||
}
|
||||
positions[vendor] = index + 1
|
||||
applyRAIDControllerTelemetry(&devs[i], list[index])
|
||||
}
|
||||
|
||||
return devs
|
||||
}
|
||||
|
||||
func applyRAIDControllerTelemetry(dev *schema.HardwarePCIeDevice, tel raidControllerTelemetry) {
|
||||
if tel.BatteryChargePct != nil {
|
||||
dev.BatteryChargePct = tel.BatteryChargePct
|
||||
}
|
||||
if tel.BatteryHealthPct != nil {
|
||||
dev.BatteryHealthPct = tel.BatteryHealthPct
|
||||
}
|
||||
if tel.BatteryTemperatureC != nil {
|
||||
dev.BatteryTemperatureC = tel.BatteryTemperatureC
|
||||
}
|
||||
if tel.BatteryVoltageV != nil {
|
||||
dev.BatteryVoltageV = tel.BatteryVoltageV
|
||||
}
|
||||
if tel.BatteryReplaceRequired != nil {
|
||||
dev.BatteryReplaceRequired = tel.BatteryReplaceRequired
|
||||
}
|
||||
if tel.ErrorDescription != nil {
|
||||
dev.ErrorDescription = tel.ErrorDescription
|
||||
if dev.Status == nil || *dev.Status == statusOK {
|
||||
status := statusWarning
|
||||
dev.Status = &status
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func collectRAIDControllerTelemetry() map[int][]raidControllerTelemetry {
|
||||
out := map[int][]raidControllerTelemetry{}
|
||||
|
||||
if raw, err := raidToolQuery("storcli64", "/call", "show", "all", "J"); err == nil {
|
||||
list := parseStorcliControllerTelemetry(raw)
|
||||
if len(list) > 0 {
|
||||
out[vendorBroadcomLSI] = append(out[vendorBroadcomLSI], list...)
|
||||
slog.Info("raid: storcli controller telemetry", "count", len(list))
|
||||
}
|
||||
}
|
||||
|
||||
if raw, err := raidToolQuery("ssacli", "ctrl", "all", "show", "config", "detail"); err == nil {
|
||||
list := parseSSACLIControllerTelemetry(string(raw))
|
||||
if len(list) > 0 {
|
||||
out[vendorHPE] = append(out[vendorHPE], list...)
|
||||
slog.Info("raid: ssacli controller telemetry", "count", len(list))
|
||||
}
|
||||
}
|
||||
|
||||
if raw, err := raidToolQuery("arcconf", "getconfig", "1", "ad"); err == nil {
|
||||
list := parseArcconfControllerTelemetry(string(raw))
|
||||
if len(list) > 0 {
|
||||
out[vendorAdaptec] = append(out[vendorAdaptec], list...)
|
||||
slog.Info("raid: arcconf controller telemetry", "count", len(list))
|
||||
}
|
||||
}
|
||||
|
||||
return out
|
||||
}
|
||||
|
||||
func parseStorcliControllerTelemetry(raw []byte) []raidControllerTelemetry {
|
||||
var doc struct {
|
||||
Controllers []struct {
|
||||
ResponseData map[string]any `json:"Response Data"`
|
||||
} `json:"Controllers"`
|
||||
}
|
||||
if err := json.Unmarshal(raw, &doc); err != nil {
|
||||
slog.Warn("raid: parse storcli controller telemetry failed", "err", err)
|
||||
return nil
|
||||
}
|
||||
|
||||
var out []raidControllerTelemetry
|
||||
for _, ctl := range doc.Controllers {
|
||||
tel := raidControllerTelemetry{}
|
||||
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["BBU_Info"]))
|
||||
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["BBU_Info_Details"]))
|
||||
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["CV_Info"]))
|
||||
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["CV_Info_Details"]))
|
||||
if hasRAIDControllerTelemetry(tel) {
|
||||
out = append(out, tel)
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func nestedStringMap(raw any) map[string]string {
|
||||
switch value := raw.(type) {
|
||||
case map[string]any:
|
||||
out := map[string]string{}
|
||||
flattenStringMap("", value, out)
|
||||
return out
|
||||
case []any:
|
||||
out := map[string]string{}
|
||||
for _, item := range value {
|
||||
if m, ok := item.(map[string]any); ok {
|
||||
flattenStringMap("", m, out)
|
||||
}
|
||||
}
|
||||
return out
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
func flattenStringMap(prefix string, in map[string]any, out map[string]string) {
|
||||
for key, raw := range in {
|
||||
fullKey := strings.TrimSpace(strings.ToLower(strings.Trim(prefix+" "+key, " ")))
|
||||
switch value := raw.(type) {
|
||||
case map[string]any:
|
||||
flattenStringMap(fullKey, value, out)
|
||||
case []any:
|
||||
for _, item := range value {
|
||||
if m, ok := item.(map[string]any); ok {
|
||||
flattenStringMap(fullKey, m, out)
|
||||
}
|
||||
}
|
||||
case string:
|
||||
out[fullKey] = value
|
||||
case json.Number:
|
||||
out[fullKey] = value.String()
|
||||
case float64:
|
||||
out[fullKey] = strconv.FormatFloat(value, 'f', -1, 64)
|
||||
case bool:
|
||||
if value {
|
||||
out[fullKey] = "true"
|
||||
} else {
|
||||
out[fullKey] = "false"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func mergeStorcliBatteryMap(tel *raidControllerTelemetry, fields map[string]string) {
|
||||
if len(fields) == 0 {
|
||||
return
|
||||
}
|
||||
for key, raw := range fields {
|
||||
lower := strings.ToLower(strings.TrimSpace(key))
|
||||
switch {
|
||||
case strings.Contains(lower, "relative state of charge"), strings.Contains(lower, "remaining capacity"), strings.Contains(lower, "charge"):
|
||||
if tel.BatteryChargePct == nil {
|
||||
tel.BatteryChargePct = parsePercentPtr(raw)
|
||||
}
|
||||
case strings.Contains(lower, "state of health"), strings.Contains(lower, "health"):
|
||||
if tel.BatteryHealthPct == nil {
|
||||
tel.BatteryHealthPct = parsePercentPtr(raw)
|
||||
}
|
||||
case strings.Contains(lower, "temperature"):
|
||||
if tel.BatteryTemperatureC == nil {
|
||||
tel.BatteryTemperatureC = parseFloatPtr(raw)
|
||||
}
|
||||
case strings.Contains(lower, "voltage"):
|
||||
if tel.BatteryVoltageV == nil {
|
||||
tel.BatteryVoltageV = parseFloatPtr(raw)
|
||||
}
|
||||
case strings.Contains(lower, "replace"), strings.Contains(lower, "replacement required"):
|
||||
if tel.BatteryReplaceRequired == nil {
|
||||
tel.BatteryReplaceRequired = parseReplaceRequired(raw)
|
||||
}
|
||||
case strings.Contains(lower, "learn cycle requested"), strings.Contains(lower, "battery state"), strings.Contains(lower, "capacitance state"):
|
||||
if desc := batteryStateDescription(raw); desc != nil && tel.ErrorDescription == nil {
|
||||
tel.ErrorDescription = desc
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func parseSSACLIControllerTelemetry(raw string) []raidControllerTelemetry {
|
||||
lines := strings.Split(raw, "\n")
|
||||
var out []raidControllerTelemetry
|
||||
var current *raidControllerTelemetry
|
||||
|
||||
flush := func() {
|
||||
if current != nil && hasRAIDControllerTelemetry(*current) {
|
||||
out = append(out, *current)
|
||||
}
|
||||
current = nil
|
||||
}
|
||||
|
||||
for _, line := range lines {
|
||||
trimmed := strings.TrimSpace(line)
|
||||
if trimmed == "" {
|
||||
continue
|
||||
}
|
||||
if strings.HasPrefix(strings.ToLower(trimmed), "smart array") || strings.HasPrefix(strings.ToLower(trimmed), "controller ") {
|
||||
flush()
|
||||
current = &raidControllerTelemetry{}
|
||||
continue
|
||||
}
|
||||
if current == nil {
|
||||
continue
|
||||
}
|
||||
if idx := strings.Index(trimmed, ":"); idx > 0 {
|
||||
key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
|
||||
val := strings.TrimSpace(trimmed[idx+1:])
|
||||
switch {
|
||||
case strings.Contains(key, "capacitor temperature"), strings.Contains(key, "battery temperature"):
|
||||
current.BatteryTemperatureC = parseFloatPtr(val)
|
||||
case strings.Contains(key, "capacitor voltage"), strings.Contains(key, "battery voltage"):
|
||||
current.BatteryVoltageV = parseFloatPtr(val)
|
||||
case strings.Contains(key, "capacitor charge"), strings.Contains(key, "battery charge"):
|
||||
current.BatteryChargePct = parsePercentPtr(val)
|
||||
case strings.Contains(key, "capacitor health"), strings.Contains(key, "battery health"):
|
||||
current.BatteryHealthPct = parsePercentPtr(val)
|
||||
case strings.Contains(key, "replace") || strings.Contains(key, "failed"):
|
||||
if current.BatteryReplaceRequired == nil {
|
||||
current.BatteryReplaceRequired = parseReplaceRequired(val)
|
||||
}
|
||||
if desc := batteryStateDescription(val); desc != nil && current.ErrorDescription == nil {
|
||||
current.ErrorDescription = desc
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
flush()
|
||||
return out
|
||||
}
|
||||
|
||||
func parseArcconfControllerTelemetry(raw string) []raidControllerTelemetry {
|
||||
lines := strings.Split(raw, "\n")
|
||||
tel := raidControllerTelemetry{}
|
||||
for _, line := range lines {
|
||||
trimmed := strings.TrimSpace(line)
|
||||
if idx := strings.Index(trimmed, ":"); idx > 0 {
|
||||
key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
|
||||
val := strings.TrimSpace(trimmed[idx+1:])
|
||||
switch {
|
||||
case strings.Contains(key, "battery temperature"), strings.Contains(key, "capacitor temperature"):
|
||||
tel.BatteryTemperatureC = parseFloatPtr(val)
|
||||
case strings.Contains(key, "battery voltage"), strings.Contains(key, "capacitor voltage"):
|
||||
tel.BatteryVoltageV = parseFloatPtr(val)
|
||||
case strings.Contains(key, "battery charge"), strings.Contains(key, "capacitor charge"):
|
||||
tel.BatteryChargePct = parsePercentPtr(val)
|
||||
case strings.Contains(key, "battery health"), strings.Contains(key, "capacitor health"):
|
||||
tel.BatteryHealthPct = parsePercentPtr(val)
|
||||
case strings.Contains(key, "replace"), strings.Contains(key, "failed"):
|
||||
if tel.BatteryReplaceRequired == nil {
|
||||
tel.BatteryReplaceRequired = parseReplaceRequired(val)
|
||||
}
|
||||
if desc := batteryStateDescription(val); desc != nil && tel.ErrorDescription == nil {
|
||||
tel.ErrorDescription = desc
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
if hasRAIDControllerTelemetry(tel) {
|
||||
return []raidControllerTelemetry{tel}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func hasRAIDControllerTelemetry(tel raidControllerTelemetry) bool {
|
||||
return tel.BatteryChargePct != nil ||
|
||||
tel.BatteryHealthPct != nil ||
|
||||
tel.BatteryTemperatureC != nil ||
|
||||
tel.BatteryVoltageV != nil ||
|
||||
tel.BatteryReplaceRequired != nil ||
|
||||
tel.ErrorDescription != nil
|
||||
}
|
||||
|
||||
func parsePercentPtr(raw string) *float64 {
|
||||
raw = strings.ReplaceAll(strings.TrimSpace(raw), "%", "")
|
||||
return parseFloatPtr(raw)
|
||||
}
|
||||
|
||||
func parseReplaceRequired(raw string) *bool {
|
||||
lower := strings.ToLower(strings.TrimSpace(raw))
|
||||
switch {
|
||||
case lower == "":
|
||||
return nil
|
||||
case strings.Contains(lower, "replace"), strings.Contains(lower, "failed"), strings.Contains(lower, "yes"), strings.Contains(lower, "required"):
|
||||
value := true
|
||||
return &value
|
||||
case strings.Contains(lower, "no"), strings.Contains(lower, "ok"), strings.Contains(lower, "good"), strings.Contains(lower, "optimal"):
|
||||
value := false
|
||||
return &value
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
func batteryStateDescription(raw string) *string {
|
||||
lower := strings.ToLower(strings.TrimSpace(raw))
|
||||
if lower == "" {
|
||||
return nil
|
||||
}
|
||||
switch {
|
||||
case strings.Contains(lower, "failed"), strings.Contains(lower, "fault"), strings.Contains(lower, "replace"), strings.Contains(lower, "warning"), strings.Contains(lower, "degraded"):
|
||||
return &raw
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
@@ -1,6 +1,10 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"errors"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestParseSASIrcuControllerIDs(t *testing.T) {
|
||||
raw := `LSI Corporation SAS2 IR Configuration Utility.
|
||||
@@ -90,7 +94,111 @@ physicaldrive 1I:1:2 (894 GB, SAS HDD, Failed)
|
||||
if drives[0].Status == nil || *drives[0].Status != "OK" {
|
||||
t.Fatalf("drive0 status: %v", drives[0].Status)
|
||||
}
|
||||
if drives[1].Status == nil || *drives[1].Status != "CRITICAL" {
|
||||
if drives[1].Status == nil || *drives[1].Status != statusCritical {
|
||||
t.Fatalf("drive1 status: %v", drives[1].Status)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseStorcliControllerTelemetry(t *testing.T) {
|
||||
raw := []byte(`{
|
||||
"Controllers": [
|
||||
{
|
||||
"Response Data": {
|
||||
"BBU_Info": {
|
||||
"State of Health": "98 %",
|
||||
"Relative State of Charge": "76 %",
|
||||
"Temperature": "41 C",
|
||||
"Voltage": "12.3 V",
|
||||
"Replacement required": "No"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}`)
|
||||
got := parseStorcliControllerTelemetry(raw)
|
||||
if len(got) != 1 {
|
||||
t.Fatalf("len(got)=%d want 1", len(got))
|
||||
}
|
||||
if got[0].BatteryHealthPct == nil || *got[0].BatteryHealthPct != 98 {
|
||||
t.Fatalf("battery health=%v", got[0].BatteryHealthPct)
|
||||
}
|
||||
if got[0].BatteryChargePct == nil || *got[0].BatteryChargePct != 76 {
|
||||
t.Fatalf("battery charge=%v", got[0].BatteryChargePct)
|
||||
}
|
||||
if got[0].BatteryTemperatureC == nil || *got[0].BatteryTemperatureC != 41 {
|
||||
t.Fatalf("battery temperature=%v", got[0].BatteryTemperatureC)
|
||||
}
|
||||
if got[0].BatteryVoltageV == nil || *got[0].BatteryVoltageV != 12.3 {
|
||||
t.Fatalf("battery voltage=%v", got[0].BatteryVoltageV)
|
||||
}
|
||||
if got[0].BatteryReplaceRequired == nil || *got[0].BatteryReplaceRequired {
|
||||
t.Fatalf("battery replace=%v", got[0].BatteryReplaceRequired)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseSSACLIControllerTelemetry(t *testing.T) {
|
||||
raw := `Smart Array P440ar in Slot 0
|
||||
Battery/Capacitor Count: 1
|
||||
Capacitor Temperature (C): 37
|
||||
Capacitor Charge (%): 94
|
||||
Capacitor Health (%): 96
|
||||
Capacitor Voltage (V): 9.8
|
||||
Capacitor Failed: No
|
||||
`
|
||||
got := parseSSACLIControllerTelemetry(raw)
|
||||
if len(got) != 1 {
|
||||
t.Fatalf("len(got)=%d want 1", len(got))
|
||||
}
|
||||
if got[0].BatteryTemperatureC == nil || *got[0].BatteryTemperatureC != 37 {
|
||||
t.Fatalf("battery temperature=%v", got[0].BatteryTemperatureC)
|
||||
}
|
||||
if got[0].BatteryChargePct == nil || *got[0].BatteryChargePct != 94 {
|
||||
t.Fatalf("battery charge=%v", got[0].BatteryChargePct)
|
||||
}
|
||||
}
|
||||
|
||||
func TestEnrichPCIeWithRAIDTelemetry(t *testing.T) {
|
||||
orig := raidToolQuery
|
||||
t.Cleanup(func() { raidToolQuery = orig })
|
||||
raidToolQuery = func(name string, args ...string) ([]byte, error) {
|
||||
switch name {
|
||||
case "storcli64":
|
||||
return []byte(`{
|
||||
"Controllers": [
|
||||
{
|
||||
"Response Data": {
|
||||
"CV_Info": {
|
||||
"State of Health": "99 %",
|
||||
"Relative State of Charge": "81 %",
|
||||
"Temperature": "38 C",
|
||||
"Voltage": "12.1 V",
|
||||
"Replacement required": "No"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}`), nil
|
||||
default:
|
||||
return nil, errors.New("skip")
|
||||
}
|
||||
}
|
||||
|
||||
vendor := vendorBroadcomLSI
|
||||
class := "MassStorageController"
|
||||
status := statusOK
|
||||
devs := []schema.HardwarePCIeDevice{{
|
||||
VendorID: &vendor,
|
||||
DeviceClass: &class,
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
}}
|
||||
out := enrichPCIeWithRAIDTelemetry(devs)
|
||||
if out[0].BatteryHealthPct == nil || *out[0].BatteryHealthPct != 99 {
|
||||
t.Fatalf("battery health=%v", out[0].BatteryHealthPct)
|
||||
}
|
||||
if out[0].BatteryChargePct == nil || *out[0].BatteryChargePct != 81 {
|
||||
t.Fatalf("battery charge=%v", out[0].BatteryChargePct)
|
||||
}
|
||||
if out[0].BatteryVoltageV == nil || *out[0].BatteryVoltageV != 12.1 {
|
||||
t.Fatalf("battery voltage=%v", out[0].BatteryVoltageV)
|
||||
}
|
||||
}
|
||||
|
||||
373
audit/internal/collector/sensors.go
Normal file
373
audit/internal/collector/sensors.go
Normal file
@@ -0,0 +1,373 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"encoding/json"
|
||||
"log/slog"
|
||||
"os/exec"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
type sensorsDoc map[string]map[string]any
|
||||
|
||||
func collectSensors() *schema.HardwareSensors {
|
||||
doc, err := readSensorsJSONDoc()
|
||||
if err != nil {
|
||||
slog.Info("sensors: unavailable, skipping", "err", err)
|
||||
return nil
|
||||
}
|
||||
sensors := buildSensorsFromDoc(doc)
|
||||
if sensors == nil || (len(sensors.Fans) == 0 && len(sensors.Power) == 0 && len(sensors.Temperatures) == 0 && len(sensors.Other) == 0) {
|
||||
return nil
|
||||
}
|
||||
slog.Info("sensors: collected",
|
||||
"fans", len(sensors.Fans),
|
||||
"power", len(sensors.Power),
|
||||
"temperatures", len(sensors.Temperatures),
|
||||
"other", len(sensors.Other),
|
||||
)
|
||||
return sensors
|
||||
}
|
||||
|
||||
func readSensorsJSONDoc() (sensorsDoc, error) {
|
||||
out, err := exec.Command("sensors", "-j").Output()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
var doc sensorsDoc
|
||||
if err := json.Unmarshal(out, &doc); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return doc, nil
|
||||
}
|
||||
|
||||
func buildSensorsFromDoc(doc sensorsDoc) *schema.HardwareSensors {
|
||||
if len(doc) == 0 {
|
||||
return nil
|
||||
}
|
||||
result := &schema.HardwareSensors{}
|
||||
seen := map[string]struct{}{}
|
||||
|
||||
chips := make([]string, 0, len(doc))
|
||||
for chip := range doc {
|
||||
chips = append(chips, chip)
|
||||
}
|
||||
sort.Strings(chips)
|
||||
|
||||
for _, chip := range chips {
|
||||
features := doc[chip]
|
||||
location := sensorLocation(chip)
|
||||
|
||||
keys := make([]string, 0, len(features))
|
||||
for key := range features {
|
||||
keys = append(keys, key)
|
||||
}
|
||||
sort.Strings(keys)
|
||||
|
||||
for _, key := range keys {
|
||||
if strings.EqualFold(key, "Adapter") {
|
||||
continue
|
||||
}
|
||||
feature, ok := features[key].(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
name := strings.TrimSpace(key)
|
||||
if name == "" {
|
||||
continue
|
||||
}
|
||||
switch classifySensorFeature(feature) {
|
||||
case "fan":
|
||||
item := buildFanSensor(name, location, feature)
|
||||
if item == nil || duplicateSensor(seen, "fan", item.Name) {
|
||||
continue
|
||||
}
|
||||
result.Fans = append(result.Fans, *item)
|
||||
case "temp":
|
||||
item := buildTempSensor(name, location, feature)
|
||||
if item == nil || duplicateSensor(seen, "temp", item.Name) {
|
||||
continue
|
||||
}
|
||||
result.Temperatures = append(result.Temperatures, *item)
|
||||
case "power":
|
||||
item := buildPowerSensor(name, location, feature)
|
||||
if item == nil || duplicateSensor(seen, "power", item.Name) {
|
||||
continue
|
||||
}
|
||||
result.Power = append(result.Power, *item)
|
||||
default:
|
||||
item := buildOtherSensor(name, location, feature)
|
||||
if item == nil || duplicateSensor(seen, "other", item.Name) {
|
||||
continue
|
||||
}
|
||||
result.Other = append(result.Other, *item)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return result
|
||||
}
|
||||
|
||||
func parseSensorsJSON(raw []byte) (*schema.HardwareSensors, error) {
|
||||
var doc sensorsDoc
|
||||
err := json.Unmarshal(raw, &doc)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return buildSensorsFromDoc(doc), nil
|
||||
}
|
||||
|
||||
func duplicateSensor(seen map[string]struct{}, sensorType, name string) bool {
|
||||
key := sensorType + "\x00" + name
|
||||
if _, ok := seen[key]; ok {
|
||||
return true
|
||||
}
|
||||
seen[key] = struct{}{}
|
||||
return false
|
||||
}
|
||||
|
||||
func sensorLocation(chip string) *string {
|
||||
chip = strings.TrimSpace(chip)
|
||||
if chip == "" {
|
||||
return nil
|
||||
}
|
||||
return &chip
|
||||
}
|
||||
|
||||
func classifySensorFeature(feature map[string]any) string {
|
||||
for key := range feature {
|
||||
switch {
|
||||
case strings.Contains(key, "fan") && strings.HasSuffix(key, "_input"):
|
||||
return "fan"
|
||||
case strings.Contains(key, "temp") && strings.HasSuffix(key, "_input"):
|
||||
return "temp"
|
||||
case strings.Contains(key, "power") && (strings.HasSuffix(key, "_input") || strings.HasSuffix(key, "_average")):
|
||||
return "power"
|
||||
case strings.Contains(key, "curr") && strings.HasSuffix(key, "_input"):
|
||||
return "power"
|
||||
case strings.HasPrefix(key, "in") && strings.HasSuffix(key, "_input"):
|
||||
return "power"
|
||||
}
|
||||
}
|
||||
return "other"
|
||||
}
|
||||
|
||||
func buildFanSensor(name string, location *string, feature map[string]any) *schema.HardwareFanSensor {
|
||||
rpm, ok := firstFeatureInt(feature, "_input")
|
||||
if !ok {
|
||||
return nil
|
||||
}
|
||||
item := &schema.HardwareFanSensor{Name: name, Location: location, RPM: &rpm}
|
||||
if status := sensorStatusFromFeature(feature); status != nil {
|
||||
item.Status = status
|
||||
}
|
||||
return item
|
||||
}
|
||||
|
||||
func buildTempSensor(name string, location *string, feature map[string]any) *schema.HardwareTemperatureSensor {
|
||||
celsius, ok := firstFeatureFloat(feature, "_input")
|
||||
if !ok {
|
||||
return nil
|
||||
}
|
||||
item := &schema.HardwareTemperatureSensor{Name: name, Location: location, Celsius: &celsius}
|
||||
if warning, ok := firstFeatureFloatWithSuffixes(feature, []string{"_max", "_high"}); ok {
|
||||
item.ThresholdWarningCelsius = &warning
|
||||
}
|
||||
if critical, ok := firstFeatureFloatWithSuffixes(feature, []string{"_crit", "_emergency"}); ok {
|
||||
item.ThresholdCriticalCelsius = &critical
|
||||
}
|
||||
if status := sensorStatusFromFeature(feature); status != nil {
|
||||
item.Status = status
|
||||
} else {
|
||||
item.Status = deriveTemperatureStatus(item.Celsius, item.ThresholdWarningCelsius, item.ThresholdCriticalCelsius)
|
||||
}
|
||||
return item
|
||||
}
|
||||
|
||||
func buildPowerSensor(name string, location *string, feature map[string]any) *schema.HardwarePowerSensor {
|
||||
item := &schema.HardwarePowerSensor{Name: name, Location: location}
|
||||
if v, ok := firstFeatureFloatWithContains(feature, []string{"power"}); ok {
|
||||
item.PowerW = &v
|
||||
}
|
||||
if v, ok := firstFeatureFloatWithPrefix(feature, "curr"); ok {
|
||||
item.CurrentA = &v
|
||||
}
|
||||
if v, ok := firstFeatureFloatWithPrefix(feature, "in"); ok {
|
||||
item.VoltageV = &v
|
||||
}
|
||||
if item.PowerW == nil && item.CurrentA == nil && item.VoltageV == nil {
|
||||
return nil
|
||||
}
|
||||
if status := sensorStatusFromFeature(feature); status != nil {
|
||||
item.Status = status
|
||||
}
|
||||
return item
|
||||
}
|
||||
|
||||
func buildOtherSensor(name string, location *string, feature map[string]any) *schema.HardwareOtherSensor {
|
||||
value, unit, ok := firstGenericSensorValue(feature)
|
||||
if !ok {
|
||||
return nil
|
||||
}
|
||||
item := &schema.HardwareOtherSensor{Name: name, Location: location, Value: &value}
|
||||
if unit != "" {
|
||||
item.Unit = &unit
|
||||
}
|
||||
if status := sensorStatusFromFeature(feature); status != nil {
|
||||
item.Status = status
|
||||
}
|
||||
return item
|
||||
}
|
||||
|
||||
func sensorStatusFromFeature(feature map[string]any) *string {
|
||||
for key, raw := range feature {
|
||||
if !strings.HasSuffix(key, "_alarm") {
|
||||
continue
|
||||
}
|
||||
if number, ok := floatFromAny(raw); ok && number > 0 {
|
||||
status := statusWarning
|
||||
return &status
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func deriveTemperatureStatus(current, warning, critical *float64) *string {
|
||||
if current == nil {
|
||||
return nil
|
||||
}
|
||||
switch {
|
||||
case critical != nil && *current >= *critical:
|
||||
status := statusCritical
|
||||
return &status
|
||||
case warning != nil && *current >= *warning:
|
||||
status := statusWarning
|
||||
return &status
|
||||
default:
|
||||
status := statusOK
|
||||
return &status
|
||||
}
|
||||
}
|
||||
|
||||
func firstFeatureInt(feature map[string]any, suffix string) (int, bool) {
|
||||
for key, raw := range feature {
|
||||
if strings.HasSuffix(key, suffix) {
|
||||
if value, ok := floatFromAny(raw); ok {
|
||||
return int(value), true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func firstFeatureFloat(feature map[string]any, suffix string) (float64, bool) {
|
||||
return firstFeatureFloatWithSuffixes(feature, []string{suffix})
|
||||
}
|
||||
|
||||
func firstFeatureFloatWithSuffixes(feature map[string]any, suffixes []string) (float64, bool) {
|
||||
keys := sortedFeatureKeys(feature)
|
||||
for _, key := range keys {
|
||||
for _, suffix := range suffixes {
|
||||
if strings.HasSuffix(key, suffix) {
|
||||
if value, ok := floatFromAny(feature[key]); ok {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func firstFeatureFloatWithContains(feature map[string]any, parts []string) (float64, bool) {
|
||||
keys := sortedFeatureKeys(feature)
|
||||
for _, key := range keys {
|
||||
matched := true
|
||||
for _, part := range parts {
|
||||
if !strings.Contains(key, part) {
|
||||
matched = false
|
||||
break
|
||||
}
|
||||
}
|
||||
if matched {
|
||||
if value, ok := floatFromAny(feature[key]); ok {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func firstFeatureFloatWithPrefix(feature map[string]any, prefix string) (float64, bool) {
|
||||
keys := sortedFeatureKeys(feature)
|
||||
for _, key := range keys {
|
||||
if strings.HasPrefix(key, prefix) && strings.HasSuffix(key, "_input") {
|
||||
if value, ok := floatFromAny(feature[key]); ok {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func firstGenericSensorValue(feature map[string]any) (float64, string, bool) {
|
||||
keys := sortedFeatureKeys(feature)
|
||||
for _, key := range keys {
|
||||
if strings.HasSuffix(key, "_alarm") {
|
||||
continue
|
||||
}
|
||||
value, ok := floatFromAny(feature[key])
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
unit := inferSensorUnit(key)
|
||||
return value, unit, true
|
||||
}
|
||||
return 0, "", false
|
||||
}
|
||||
|
||||
func inferSensorUnit(key string) string {
|
||||
switch {
|
||||
case strings.Contains(key, "humidity"):
|
||||
return "%"
|
||||
case strings.Contains(key, "intrusion"):
|
||||
return ""
|
||||
default:
|
||||
return ""
|
||||
}
|
||||
}
|
||||
|
||||
func sortedFeatureKeys(feature map[string]any) []string {
|
||||
keys := make([]string, 0, len(feature))
|
||||
for key := range feature {
|
||||
keys = append(keys, key)
|
||||
}
|
||||
sort.Strings(keys)
|
||||
return keys
|
||||
}
|
||||
|
||||
func floatFromAny(raw any) (float64, bool) {
|
||||
switch value := raw.(type) {
|
||||
case float64:
|
||||
return value, true
|
||||
case float32:
|
||||
return float64(value), true
|
||||
case int:
|
||||
return float64(value), true
|
||||
case int64:
|
||||
return float64(value), true
|
||||
case json.Number:
|
||||
if f, err := value.Float64(); err == nil {
|
||||
return f, true
|
||||
}
|
||||
case string:
|
||||
if value == "" {
|
||||
return 0, false
|
||||
}
|
||||
if f, err := strconv.ParseFloat(value, 64); err == nil {
|
||||
return f, true
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
54
audit/internal/collector/sensors_test.go
Normal file
54
audit/internal/collector/sensors_test.go
Normal file
@@ -0,0 +1,54 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestParseSensorsJSON(t *testing.T) {
|
||||
raw := []byte(`{
|
||||
"coretemp-isa-0000": {
|
||||
"Adapter": "ISA adapter",
|
||||
"Package id 0": {
|
||||
"temp1_input": 61.5,
|
||||
"temp1_max": 80.0,
|
||||
"temp1_crit": 95.0
|
||||
},
|
||||
"fan1": {
|
||||
"fan1_input": 4200
|
||||
}
|
||||
},
|
||||
"acpitz-acpi-0": {
|
||||
"Adapter": "ACPI interface",
|
||||
"in0": {
|
||||
"in0_input": 12.06
|
||||
},
|
||||
"curr1": {
|
||||
"curr1_input": 0.64
|
||||
},
|
||||
"power1": {
|
||||
"power1_average": 137.0
|
||||
},
|
||||
"humidity1": {
|
||||
"humidity1_input": 38.5
|
||||
}
|
||||
}
|
||||
}`)
|
||||
|
||||
got, err := parseSensorsJSON(raw)
|
||||
if err != nil {
|
||||
t.Fatalf("parseSensorsJSON error: %v", err)
|
||||
}
|
||||
if got == nil {
|
||||
t.Fatal("expected sensors")
|
||||
}
|
||||
if len(got.Temperatures) != 1 || got.Temperatures[0].Celsius == nil || *got.Temperatures[0].Celsius != 61.5 {
|
||||
t.Fatalf("temperatures mismatch: %#v", got.Temperatures)
|
||||
}
|
||||
if len(got.Fans) != 1 || got.Fans[0].RPM == nil || *got.Fans[0].RPM != 4200 {
|
||||
t.Fatalf("fans mismatch: %#v", got.Fans)
|
||||
}
|
||||
if len(got.Power) != 3 {
|
||||
t.Fatalf("power sensors mismatch: %#v", got.Power)
|
||||
}
|
||||
if len(got.Other) != 1 || got.Other[0].Unit == nil || *got.Other[0].Unit != "%" {
|
||||
t.Fatalf("other sensors mismatch: %#v", got.Other)
|
||||
}
|
||||
}
|
||||
@@ -5,11 +5,13 @@ import (
|
||||
"encoding/json"
|
||||
"log/slog"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func collectStorage() []schema.HardwareStorage {
|
||||
devs := lsblkDevices()
|
||||
devs := discoverStorageDevices()
|
||||
result := make([]schema.HardwareStorage, 0, len(devs))
|
||||
for _, dev := range devs {
|
||||
var s schema.HardwareStorage
|
||||
@@ -26,19 +28,60 @@ func collectStorage() []schema.HardwareStorage {
|
||||
|
||||
// lsblkDevice is a minimal lsblk JSON record.
|
||||
type lsblkDevice struct {
|
||||
Name string `json:"name"`
|
||||
Type string `json:"type"`
|
||||
Size string `json:"size"`
|
||||
Serial string `json:"serial"`
|
||||
Model string `json:"model"`
|
||||
Tran string `json:"tran"`
|
||||
Hctl string `json:"hctl"`
|
||||
Name string `json:"name"`
|
||||
Type string `json:"type"`
|
||||
Size string `json:"size"`
|
||||
Serial string `json:"serial"`
|
||||
Model string `json:"model"`
|
||||
Tran string `json:"tran"`
|
||||
Hctl string `json:"hctl"`
|
||||
}
|
||||
|
||||
type lsblkRoot struct {
|
||||
Blockdevices []lsblkDevice `json:"blockdevices"`
|
||||
}
|
||||
|
||||
type nvmeListRoot struct {
|
||||
Devices []nvmeListDevice `json:"Devices"`
|
||||
}
|
||||
|
||||
type nvmeListDevice struct {
|
||||
DevicePath string `json:"DevicePath"`
|
||||
ModelNumber string `json:"ModelNumber"`
|
||||
SerialNumber string `json:"SerialNumber"`
|
||||
Firmware string `json:"Firmware"`
|
||||
PhysicalSize int64 `json:"PhysicalSize"`
|
||||
}
|
||||
|
||||
func discoverStorageDevices() []lsblkDevice {
|
||||
merged := map[string]lsblkDevice{}
|
||||
for _, dev := range lsblkDevices() {
|
||||
if dev.Name == "" {
|
||||
continue
|
||||
}
|
||||
merged[dev.Name] = dev
|
||||
}
|
||||
for _, dev := range nvmeListDevices() {
|
||||
if dev.Name == "" {
|
||||
continue
|
||||
}
|
||||
current := merged[dev.Name]
|
||||
merged[dev.Name] = mergeStorageDevice(current, dev)
|
||||
}
|
||||
|
||||
disks := make([]lsblkDevice, 0, len(merged))
|
||||
for _, dev := range merged {
|
||||
if dev.Type == "" {
|
||||
dev.Type = "disk"
|
||||
}
|
||||
if dev.Type != "disk" {
|
||||
continue
|
||||
}
|
||||
disks = append(disks, dev)
|
||||
}
|
||||
return disks
|
||||
}
|
||||
|
||||
func lsblkDevices() []lsblkDevice {
|
||||
out, err := exec.Command("lsblk", "-J", "-d",
|
||||
"-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL").Output()
|
||||
@@ -60,6 +103,59 @@ func lsblkDevices() []lsblkDevice {
|
||||
return disks
|
||||
}
|
||||
|
||||
func nvmeListDevices() []lsblkDevice {
|
||||
out, err := exec.Command("nvme", "list", "-o", "json").Output()
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
var root nvmeListRoot
|
||||
if err := json.Unmarshal(out, &root); err != nil {
|
||||
slog.Warn("storage: nvme list parse failed", "err", err)
|
||||
return nil
|
||||
}
|
||||
devices := make([]lsblkDevice, 0, len(root.Devices))
|
||||
for _, dev := range root.Devices {
|
||||
name := filepath.Base(strings.TrimSpace(dev.DevicePath))
|
||||
if name == "" {
|
||||
continue
|
||||
}
|
||||
devices = append(devices, lsblkDevice{
|
||||
Name: name,
|
||||
Type: "disk",
|
||||
Size: strconv.FormatInt(dev.PhysicalSize, 10),
|
||||
Serial: strings.TrimSpace(dev.SerialNumber),
|
||||
Model: strings.TrimSpace(dev.ModelNumber),
|
||||
Tran: "nvme",
|
||||
})
|
||||
}
|
||||
return devices
|
||||
}
|
||||
|
||||
func mergeStorageDevice(existing, incoming lsblkDevice) lsblkDevice {
|
||||
if existing.Name == "" {
|
||||
return incoming
|
||||
}
|
||||
if existing.Type == "" {
|
||||
existing.Type = incoming.Type
|
||||
}
|
||||
if strings.TrimSpace(existing.Size) == "" {
|
||||
existing.Size = incoming.Size
|
||||
}
|
||||
if strings.TrimSpace(existing.Serial) == "" {
|
||||
existing.Serial = incoming.Serial
|
||||
}
|
||||
if strings.TrimSpace(existing.Model) == "" {
|
||||
existing.Model = incoming.Model
|
||||
}
|
||||
if strings.TrimSpace(existing.Tran) == "" {
|
||||
existing.Tran = incoming.Tran
|
||||
}
|
||||
if strings.TrimSpace(existing.Hctl) == "" {
|
||||
existing.Hctl = incoming.Hctl
|
||||
}
|
||||
return existing
|
||||
}
|
||||
|
||||
// smartctlInfo is the subset of smartctl -j -a output we care about.
|
||||
type smartctlInfo struct {
|
||||
ModelFamily string `json:"model_family"`
|
||||
@@ -67,14 +163,22 @@ type smartctlInfo struct {
|
||||
SerialNumber string `json:"serial_number"`
|
||||
FirmwareVer string `json:"firmware_version"`
|
||||
RotationRate int `json:"rotation_rate"`
|
||||
Temperature struct {
|
||||
Current int `json:"current"`
|
||||
} `json:"temperature"`
|
||||
SmartStatus struct {
|
||||
Passed bool `json:"passed"`
|
||||
} `json:"smart_status"`
|
||||
UserCapacity struct {
|
||||
Bytes int64 `json:"bytes"`
|
||||
} `json:"user_capacity"`
|
||||
AtaSmartAttributes struct {
|
||||
Table []struct {
|
||||
ID int `json:"id"`
|
||||
Name string `json:"name"`
|
||||
Raw struct{ Value int64 `json:"value"` } `json:"raw"`
|
||||
ID int `json:"id"`
|
||||
Name string `json:"name"`
|
||||
Raw struct {
|
||||
Value int64 `json:"value"`
|
||||
} `json:"raw"`
|
||||
} `json:"table"`
|
||||
} `json:"ata_smart_attributes"`
|
||||
PowerOnTime struct {
|
||||
@@ -149,69 +253,116 @@ func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
|
||||
} else if info.RotationRate > 0 {
|
||||
devType = "HDD"
|
||||
}
|
||||
s.Type = &devType
|
||||
|
||||
// telemetry
|
||||
tel := map[string]any{}
|
||||
if info.Temperature.Current > 0 {
|
||||
t := float64(info.Temperature.Current)
|
||||
s.TemperatureC = &t
|
||||
}
|
||||
if info.PowerOnTime.Hours > 0 {
|
||||
tel["power_on_hours"] = info.PowerOnTime.Hours
|
||||
v := int64(info.PowerOnTime.Hours)
|
||||
s.PowerOnHours = &v
|
||||
}
|
||||
if info.PowerCycleCount > 0 {
|
||||
tel["power_cycles"] = info.PowerCycleCount
|
||||
v := int64(info.PowerCycleCount)
|
||||
s.PowerCycles = &v
|
||||
}
|
||||
reallocated := int64(0)
|
||||
pending := int64(0)
|
||||
uncorrectable := int64(0)
|
||||
lifeRemaining := int64(0)
|
||||
for _, attr := range info.AtaSmartAttributes.Table {
|
||||
switch attr.ID {
|
||||
case 5:
|
||||
tel["reallocated_sectors"] = attr.Raw.Value
|
||||
reallocated = attr.Raw.Value
|
||||
s.ReallocatedSectors = &reallocated
|
||||
case 177:
|
||||
tel["wear_leveling_pct"] = attr.Raw.Value
|
||||
value := float64(attr.Raw.Value)
|
||||
s.LifeUsedPct = &value
|
||||
case 231:
|
||||
tel["life_remaining_pct"] = attr.Raw.Value
|
||||
lifeRemaining = attr.Raw.Value
|
||||
value := float64(attr.Raw.Value)
|
||||
s.LifeRemainingPct = &value
|
||||
case 241:
|
||||
tel["total_lba_written"] = attr.Raw.Value
|
||||
value := attr.Raw.Value
|
||||
s.WrittenBytes = &value
|
||||
case 197:
|
||||
pending = attr.Raw.Value
|
||||
s.CurrentPendingSectors = &pending
|
||||
case 198:
|
||||
uncorrectable = attr.Raw.Value
|
||||
s.OfflineUncorrectable = &uncorrectable
|
||||
}
|
||||
}
|
||||
if len(tel) > 0 {
|
||||
s.Telemetry = tel
|
||||
|
||||
status := storageHealthStatus{
|
||||
overallPassed: info.SmartStatus.Passed,
|
||||
hasOverall: true,
|
||||
reallocatedSectors: reallocated,
|
||||
pendingSectors: pending,
|
||||
offlineUncorrectable: uncorrectable,
|
||||
lifeRemainingPct: lifeRemaining,
|
||||
}
|
||||
setStorageHealthStatus(&s, status)
|
||||
return s
|
||||
}
|
||||
|
||||
s.Type = &devType
|
||||
status := "OK"
|
||||
status := statusUnknown
|
||||
s.Status = &status
|
||||
return s
|
||||
}
|
||||
|
||||
// nvmeSmartLog is the subset of `nvme smart-log -o json` output we care about.
|
||||
type nvmeSmartLog struct {
|
||||
CriticalWarning int `json:"critical_warning"`
|
||||
PercentageUsed int `json:"percentage_used"`
|
||||
AvailableSpare int `json:"available_spare"`
|
||||
SpareThreshold int `json:"spare_thresh"`
|
||||
Temperature int64 `json:"temperature"`
|
||||
PowerOnHours int64 `json:"power_on_hours"`
|
||||
PowerCycles int64 `json:"power_cycles"`
|
||||
UnsafeShutdowns int64 `json:"unsafe_shutdowns"`
|
||||
DataUnitsRead int64 `json:"data_units_read"`
|
||||
DataUnitsWritten int64 `json:"data_units_written"`
|
||||
ControllerBusy int64 `json:"controller_busy_time"`
|
||||
MediaErrors int64 `json:"media_errors"`
|
||||
NumErrLogEntries int64 `json:"num_err_log_entries"`
|
||||
}
|
||||
|
||||
// nvmeIDCtrl is the subset of `nvme id-ctrl -o json` output.
|
||||
type nvmeIDCtrl struct {
|
||||
ModelNumber string `json:"mn"`
|
||||
SerialNumber string `json:"sn"`
|
||||
FirmwareRev string `json:"fr"`
|
||||
TotalCapacity int64 `json:"tnvmcap"`
|
||||
ModelNumber string `json:"mn"`
|
||||
SerialNumber string `json:"sn"`
|
||||
FirmwareRev string `json:"fr"`
|
||||
TotalCapacity int64 `json:"tnvmcap"`
|
||||
}
|
||||
|
||||
func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
|
||||
present := true
|
||||
devType := "NVMe"
|
||||
iface := "NVMe"
|
||||
status := "OK"
|
||||
status := statusOK
|
||||
s := schema.HardwareStorage{
|
||||
Present: &present,
|
||||
Type: &devType,
|
||||
Interface: &iface,
|
||||
Status: &status,
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Present: &present,
|
||||
Type: &devType,
|
||||
Interface: &iface,
|
||||
}
|
||||
|
||||
devPath := "/dev/" + dev.Name
|
||||
if v := cleanDMIValue(strings.TrimSpace(dev.Model)); v != "" {
|
||||
s.Model = &v
|
||||
}
|
||||
if v := cleanDMIValue(strings.TrimSpace(dev.Serial)); v != "" {
|
||||
s.SerialNumber = &v
|
||||
}
|
||||
if size := parseStorageBytes(dev.Size); size > 0 {
|
||||
gb := int(size / 1_000_000_000)
|
||||
if gb > 0 {
|
||||
s.SizeGB = &gb
|
||||
}
|
||||
}
|
||||
|
||||
// id-ctrl: model, serial, firmware, capacity
|
||||
if out, err := exec.Command("nvme", "id-ctrl", devPath, "-o", "json").Output(); err == nil {
|
||||
@@ -237,30 +388,131 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
|
||||
if out, err := exec.Command("nvme", "smart-log", devPath, "-o", "json").Output(); err == nil {
|
||||
var log nvmeSmartLog
|
||||
if json.Unmarshal(out, &log) == nil {
|
||||
tel := map[string]any{}
|
||||
if log.PowerOnHours > 0 {
|
||||
tel["power_on_hours"] = log.PowerOnHours
|
||||
s.PowerOnHours = &log.PowerOnHours
|
||||
}
|
||||
if log.PowerCycles > 0 {
|
||||
tel["power_cycles"] = log.PowerCycles
|
||||
s.PowerCycles = &log.PowerCycles
|
||||
}
|
||||
if log.UnsafeShutdowns > 0 {
|
||||
tel["unsafe_shutdowns"] = log.UnsafeShutdowns
|
||||
s.UnsafeShutdowns = &log.UnsafeShutdowns
|
||||
}
|
||||
if log.PercentageUsed > 0 {
|
||||
tel["percentage_used"] = log.PercentageUsed
|
||||
v := float64(log.PercentageUsed)
|
||||
s.LifeUsedPct = &v
|
||||
remaining := 100 - v
|
||||
s.LifeRemainingPct = &remaining
|
||||
}
|
||||
if log.DataUnitsWritten > 0 {
|
||||
tel["data_units_written"] = log.DataUnitsWritten
|
||||
v := nvmeDataUnitsToBytes(log.DataUnitsWritten)
|
||||
s.WrittenBytes = &v
|
||||
}
|
||||
if log.ControllerBusy > 0 {
|
||||
tel["controller_busy_time"] = log.ControllerBusy
|
||||
if log.DataUnitsRead > 0 {
|
||||
v := nvmeDataUnitsToBytes(log.DataUnitsRead)
|
||||
s.ReadBytes = &v
|
||||
}
|
||||
if len(tel) > 0 {
|
||||
s.Telemetry = tel
|
||||
if log.AvailableSpare > 0 {
|
||||
v := float64(log.AvailableSpare)
|
||||
s.AvailableSparePct = &v
|
||||
}
|
||||
if log.MediaErrors > 0 {
|
||||
s.MediaErrors = &log.MediaErrors
|
||||
}
|
||||
if log.NumErrLogEntries > 0 {
|
||||
s.ErrorLogEntries = &log.NumErrLogEntries
|
||||
}
|
||||
if log.Temperature > 0 {
|
||||
v := float64(log.Temperature - 273)
|
||||
s.TemperatureC = &v
|
||||
}
|
||||
setStorageHealthStatus(&s, storageHealthStatus{
|
||||
criticalWarning: log.CriticalWarning,
|
||||
percentageUsed: int64(log.PercentageUsed),
|
||||
availableSpare: int64(log.AvailableSpare),
|
||||
spareThreshold: int64(log.SpareThreshold),
|
||||
unsafeShutdowns: log.UnsafeShutdowns,
|
||||
mediaErrors: log.MediaErrors,
|
||||
errorLogEntries: log.NumErrLogEntries,
|
||||
})
|
||||
return s
|
||||
}
|
||||
}
|
||||
|
||||
status = statusUnknown
|
||||
s.Status = &status
|
||||
return s
|
||||
}
|
||||
|
||||
func parseStorageBytes(raw string) int64 {
|
||||
value, err := strconv.ParseInt(strings.TrimSpace(raw), 10, 64)
|
||||
if err == nil && value > 0 {
|
||||
return value
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
func nvmeDataUnitsToBytes(units int64) int64 {
|
||||
if units <= 0 {
|
||||
return 0
|
||||
}
|
||||
return units * 512000
|
||||
}
|
||||
|
||||
type storageHealthStatus struct {
|
||||
hasOverall bool
|
||||
overallPassed bool
|
||||
reallocatedSectors int64
|
||||
pendingSectors int64
|
||||
offlineUncorrectable int64
|
||||
lifeRemainingPct int64
|
||||
criticalWarning int
|
||||
percentageUsed int64
|
||||
availableSpare int64
|
||||
spareThreshold int64
|
||||
unsafeShutdowns int64
|
||||
mediaErrors int64
|
||||
errorLogEntries int64
|
||||
}
|
||||
|
||||
func setStorageHealthStatus(s *schema.HardwareStorage, health storageHealthStatus) {
|
||||
status := statusOK
|
||||
var description *string
|
||||
switch {
|
||||
case health.hasOverall && !health.overallPassed:
|
||||
status = statusCritical
|
||||
description = stringPtr("SMART overall self-assessment failed")
|
||||
case health.criticalWarning > 0:
|
||||
status = statusCritical
|
||||
description = stringPtr("NVMe critical warning is set")
|
||||
case health.pendingSectors > 0 || health.offlineUncorrectable > 0:
|
||||
status = statusCritical
|
||||
description = stringPtr("Pending or offline uncorrectable sectors detected")
|
||||
case health.mediaErrors > 0:
|
||||
status = statusWarning
|
||||
description = stringPtr("Media errors reported")
|
||||
case health.reallocatedSectors > 0:
|
||||
status = statusWarning
|
||||
description = stringPtr("Reallocated sectors detected")
|
||||
case health.errorLogEntries > 0:
|
||||
status = statusWarning
|
||||
description = stringPtr("Device error log contains entries")
|
||||
case health.lifeRemainingPct > 0 && health.lifeRemainingPct <= 10:
|
||||
status = statusWarning
|
||||
description = stringPtr("Life remaining is low")
|
||||
case health.percentageUsed >= 95:
|
||||
status = statusWarning
|
||||
description = stringPtr("Drive wear level is high")
|
||||
case health.availableSpare > 0 && health.spareThreshold > 0 && health.availableSpare <= health.spareThreshold:
|
||||
status = statusWarning
|
||||
description = stringPtr("Available spare is at or below threshold")
|
||||
case health.unsafeShutdowns > 100:
|
||||
status = statusWarning
|
||||
description = stringPtr("Unsafe shutdown count is high")
|
||||
}
|
||||
s.Status = &status
|
||||
s.ErrorDescription = description
|
||||
}
|
||||
|
||||
func stringPtr(value string) *string {
|
||||
return &value
|
||||
}
|
||||
|
||||
33
audit/internal/collector/storage_discovery_test.go
Normal file
33
audit/internal/collector/storage_discovery_test.go
Normal file
@@ -0,0 +1,33 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestMergeStorageDevicePrefersNonEmptyFields(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
got := mergeStorageDevice(
|
||||
lsblkDevice{Name: "nvme0n1", Type: "disk", Tran: "nvme"},
|
||||
lsblkDevice{Name: "nvme0n1", Type: "disk", Size: "1024", Serial: "SN123", Model: "Kioxia"},
|
||||
)
|
||||
|
||||
if got.Serial != "SN123" {
|
||||
t.Fatalf("serial=%q want SN123", got.Serial)
|
||||
}
|
||||
if got.Model != "Kioxia" {
|
||||
t.Fatalf("model=%q want Kioxia", got.Model)
|
||||
}
|
||||
if got.Size != "1024" {
|
||||
t.Fatalf("size=%q want 1024", got.Size)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseStorageBytes(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
if got := parseStorageBytes(" 2048 "); got != 2048 {
|
||||
t.Fatalf("parseStorageBytes=%d want 2048", got)
|
||||
}
|
||||
if got := parseStorageBytes("1.92 TB"); got != 0 {
|
||||
t.Fatalf("parseStorageBytes invalid=%d want 0", got)
|
||||
}
|
||||
}
|
||||
63
audit/internal/collector/storage_health_test.go
Normal file
63
audit/internal/collector/storage_health_test.go
Normal file
@@ -0,0 +1,63 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
func TestSetStorageHealthStatus(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
health storageHealthStatus
|
||||
want string
|
||||
}{
|
||||
{
|
||||
name: "smart overall failed",
|
||||
health: storageHealthStatus{hasOverall: true, overallPassed: false},
|
||||
want: statusCritical,
|
||||
},
|
||||
{
|
||||
name: "nvme critical warning",
|
||||
health: storageHealthStatus{criticalWarning: 1},
|
||||
want: statusCritical,
|
||||
},
|
||||
{
|
||||
name: "pending sectors",
|
||||
health: storageHealthStatus{pendingSectors: 1},
|
||||
want: statusCritical,
|
||||
},
|
||||
{
|
||||
name: "media errors warning",
|
||||
health: storageHealthStatus{mediaErrors: 2},
|
||||
want: statusWarning,
|
||||
},
|
||||
{
|
||||
name: "reallocated warning",
|
||||
health: storageHealthStatus{reallocatedSectors: 1},
|
||||
want: statusWarning,
|
||||
},
|
||||
{
|
||||
name: "life remaining low",
|
||||
health: storageHealthStatus{lifeRemainingPct: 8},
|
||||
want: statusWarning,
|
||||
},
|
||||
{
|
||||
name: "healthy",
|
||||
health: storageHealthStatus{},
|
||||
want: statusOK,
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
var disk schema.HardwareStorage
|
||||
setStorageHealthStatus(&disk, tt.health)
|
||||
if disk.Status == nil || *disk.Status != tt.want {
|
||||
t.Fatalf("status=%v want %q", disk.Status, tt.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
114
audit/internal/collector/summary.go
Normal file
114
audit/internal/collector/summary.go
Normal file
@@ -0,0 +1,114 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"fmt"
|
||||
"time"
|
||||
)
|
||||
|
||||
func BuildHealthSummary(snap schema.HardwareSnapshot) *schema.HardwareHealthSummary {
|
||||
summary := &schema.HardwareHealthSummary{
|
||||
Status: statusOK,
|
||||
CollectedAt: time.Now().UTC().Format(time.RFC3339),
|
||||
}
|
||||
|
||||
for _, dimm := range snap.Memory {
|
||||
switch derefString(dimm.Status) {
|
||||
case statusWarning:
|
||||
summary.MemoryWarn++
|
||||
summary.Warnings = append(summary.Warnings, formatMemorySummary(dimm))
|
||||
case statusCritical:
|
||||
summary.MemoryFail++
|
||||
summary.Failures = append(summary.Failures, formatMemorySummary(dimm))
|
||||
case statusEmpty:
|
||||
summary.EmptyDIMMs++
|
||||
}
|
||||
}
|
||||
|
||||
for _, disk := range snap.Storage {
|
||||
switch derefString(disk.Status) {
|
||||
case statusWarning:
|
||||
summary.StorageWarn++
|
||||
summary.Warnings = append(summary.Warnings, formatStorageSummary(disk))
|
||||
case statusCritical:
|
||||
summary.StorageFail++
|
||||
summary.Failures = append(summary.Failures, formatStorageSummary(disk))
|
||||
}
|
||||
}
|
||||
|
||||
for _, dev := range snap.PCIeDevices {
|
||||
switch derefString(dev.Status) {
|
||||
case statusWarning:
|
||||
summary.PCIeWarn++
|
||||
summary.Warnings = append(summary.Warnings, formatPCIeSummary(dev))
|
||||
case statusCritical:
|
||||
summary.PCIeFail++
|
||||
summary.Failures = append(summary.Failures, formatPCIeSummary(dev))
|
||||
}
|
||||
}
|
||||
|
||||
for _, psu := range snap.PowerSupplies {
|
||||
if psu.Present != nil && !*psu.Present {
|
||||
summary.MissingPSUs++
|
||||
}
|
||||
switch derefString(psu.Status) {
|
||||
case statusWarning:
|
||||
summary.PSUWarn++
|
||||
summary.Warnings = append(summary.Warnings, formatPSUSummary(psu))
|
||||
case statusCritical:
|
||||
summary.PSUFail++
|
||||
summary.Failures = append(summary.Failures, formatPSUSummary(psu))
|
||||
}
|
||||
}
|
||||
|
||||
if len(summary.Failures) > 0 || summary.StorageFail > 0 || summary.PCIeFail > 0 || summary.PSUFail > 0 || summary.MemoryFail > 0 {
|
||||
summary.Status = statusCritical
|
||||
} else if len(summary.Warnings) > 0 || summary.StorageWarn > 0 || summary.PCIeWarn > 0 || summary.PSUWarn > 0 || summary.MemoryWarn > 0 {
|
||||
summary.Status = statusWarning
|
||||
}
|
||||
|
||||
if len(summary.Warnings) == 0 {
|
||||
summary.Warnings = nil
|
||||
}
|
||||
if len(summary.Failures) == 0 {
|
||||
summary.Failures = nil
|
||||
}
|
||||
|
||||
return summary
|
||||
}
|
||||
|
||||
func derefString(value *string) string {
|
||||
if value == nil {
|
||||
return ""
|
||||
}
|
||||
return *value
|
||||
}
|
||||
|
||||
func preferredName(model, serial, slot *string) string {
|
||||
switch {
|
||||
case model != nil && *model != "":
|
||||
return *model
|
||||
case serial != nil && *serial != "":
|
||||
return *serial
|
||||
case slot != nil && *slot != "":
|
||||
return *slot
|
||||
default:
|
||||
return "unknown"
|
||||
}
|
||||
}
|
||||
|
||||
func formatStorageSummary(disk schema.HardwareStorage) string {
|
||||
return fmt.Sprintf("storage %s status=%s", preferredName(disk.Model, disk.SerialNumber, disk.Slot), derefString(disk.Status))
|
||||
}
|
||||
|
||||
func formatPCIeSummary(dev schema.HardwarePCIeDevice) string {
|
||||
return fmt.Sprintf("pcie %s status=%s", preferredName(dev.Model, dev.SerialNumber, dev.BDF), derefString(dev.Status))
|
||||
}
|
||||
|
||||
func formatPSUSummary(psu schema.HardwarePowerSupply) string {
|
||||
return fmt.Sprintf("psu %s status=%s", preferredName(psu.Model, psu.SerialNumber, psu.Slot), derefString(psu.Status))
|
||||
}
|
||||
|
||||
func formatMemorySummary(dimm schema.HardwareMemory) string {
|
||||
return fmt.Sprintf("memory %s status=%s", preferredName(dimm.PartNumber, dimm.SerialNumber, dimm.Slot), derefString(dimm.Status))
|
||||
}
|
||||
@@ -31,7 +31,7 @@ md125 : active raid1 nvme2n1[0] nvme3n1[1]
|
||||
func TestHasVROCController(t *testing.T) {
|
||||
intel := vendorIntel
|
||||
model := "Volume Management Device NVMe RAID Controller"
|
||||
class := "RAID bus controller"
|
||||
class := "MassStorageController"
|
||||
tests := []struct {
|
||||
name string
|
||||
pcie []schema.HardwarePCIeDevice
|
||||
|
||||
577
audit/internal/platform/gpu_metrics.go
Normal file
577
audit/internal/platform/gpu_metrics.go
Normal file
@@ -0,0 +1,577 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"fmt"
|
||||
"math"
|
||||
"os"
|
||||
"os/exec"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
// GPUMetricRow is one telemetry sample from nvidia-smi during a stress test.
|
||||
type GPUMetricRow struct {
|
||||
ElapsedSec float64
|
||||
GPUIndex int
|
||||
TempC float64
|
||||
UsagePct float64
|
||||
PowerW float64
|
||||
ClockMHz float64
|
||||
}
|
||||
|
||||
// sampleGPUMetrics runs nvidia-smi once and returns current metrics for each GPU.
|
||||
func sampleGPUMetrics(gpuIndices []int) ([]GPUMetricRow, error) {
|
||||
args := []string{
|
||||
"--query-gpu=index,temperature.gpu,utilization.gpu,power.draw,clocks.current.graphics",
|
||||
"--format=csv,noheader,nounits",
|
||||
}
|
||||
if len(gpuIndices) > 0 {
|
||||
ids := make([]string, len(gpuIndices))
|
||||
for i, idx := range gpuIndices {
|
||||
ids[i] = strconv.Itoa(idx)
|
||||
}
|
||||
args = append([]string{"--id=" + strings.Join(ids, ",")}, args...)
|
||||
}
|
||||
out, err := exec.Command("nvidia-smi", args...).Output()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
var rows []GPUMetricRow
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
if line == "" {
|
||||
continue
|
||||
}
|
||||
parts := strings.Split(line, ", ")
|
||||
if len(parts) < 5 {
|
||||
continue
|
||||
}
|
||||
idx, _ := strconv.Atoi(strings.TrimSpace(parts[0]))
|
||||
rows = append(rows, GPUMetricRow{
|
||||
GPUIndex: idx,
|
||||
TempC: parseGPUFloat(parts[1]),
|
||||
UsagePct: parseGPUFloat(parts[2]),
|
||||
PowerW: parseGPUFloat(parts[3]),
|
||||
ClockMHz: parseGPUFloat(parts[4]),
|
||||
})
|
||||
}
|
||||
return rows, nil
|
||||
}
|
||||
|
||||
func parseGPUFloat(s string) float64 {
|
||||
s = strings.TrimSpace(s)
|
||||
if s == "N/A" || s == "[Not Supported]" || s == "" {
|
||||
return 0
|
||||
}
|
||||
v, _ := strconv.ParseFloat(s, 64)
|
||||
return v
|
||||
}
|
||||
|
||||
// WriteGPUMetricsCSV writes collected rows as a CSV file.
|
||||
func WriteGPUMetricsCSV(path string, rows []GPUMetricRow) error {
|
||||
var b bytes.Buffer
|
||||
b.WriteString("elapsed_sec,gpu_index,temperature_c,usage_pct,power_w,clock_mhz\n")
|
||||
for _, r := range rows {
|
||||
fmt.Fprintf(&b, "%.1f,%d,%.1f,%.1f,%.1f,%.0f\n",
|
||||
r.ElapsedSec, r.GPUIndex, r.TempC, r.UsagePct, r.PowerW, r.ClockMHz)
|
||||
}
|
||||
return os.WriteFile(path, b.Bytes(), 0644)
|
||||
}
|
||||
|
||||
// WriteGPUMetricsHTML writes a standalone HTML file with one SVG chart per GPU.
|
||||
func WriteGPUMetricsHTML(path string, rows []GPUMetricRow) error {
|
||||
// Group by GPU index preserving order.
|
||||
seen := make(map[int]bool)
|
||||
var order []int
|
||||
gpuMap := make(map[int][]GPUMetricRow)
|
||||
for _, r := range rows {
|
||||
if !seen[r.GPUIndex] {
|
||||
seen[r.GPUIndex] = true
|
||||
order = append(order, r.GPUIndex)
|
||||
}
|
||||
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
|
||||
}
|
||||
|
||||
var svgs strings.Builder
|
||||
for _, gpuIdx := range order {
|
||||
svgs.WriteString(drawGPUChartSVG(gpuMap[gpuIdx], gpuIdx))
|
||||
svgs.WriteString("\n")
|
||||
}
|
||||
|
||||
ts := time.Now().UTC().Format("2006-01-02 15:04:05 UTC")
|
||||
html := fmt.Sprintf(`<!DOCTYPE html>
|
||||
<html><head>
|
||||
<meta charset="utf-8">
|
||||
<title>GPU Stress Test Metrics</title>
|
||||
<style>
|
||||
body { font-family: sans-serif; background: #f0f0f0; margin: 0; padding: 20px; }
|
||||
h1 { text-align: center; color: #333; margin: 0 0 8px; }
|
||||
p { text-align: center; color: #888; font-size: 13px; margin: 0 0 24px; }
|
||||
</style>
|
||||
</head><body>
|
||||
<h1>GPU Stress Test Metrics</h1>
|
||||
<p>Generated %s</p>
|
||||
%s
|
||||
</body></html>`, ts, svgs.String())
|
||||
|
||||
return os.WriteFile(path, []byte(html), 0644)
|
||||
}
|
||||
|
||||
// drawGPUChartSVG generates a self-contained SVG chart for one GPU.
|
||||
func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int) string {
|
||||
// Layout
|
||||
const W, H = 960, 520
|
||||
const plotX1 = 120 // usage axis / chart left border
|
||||
const plotX2 = 840 // power axis / chart right border
|
||||
const plotY1 = 70 // top
|
||||
const plotY2 = 465 // bottom (PH = 395)
|
||||
const PW = plotX2 - plotX1
|
||||
const PH = plotY2 - plotY1
|
||||
// Outer axes
|
||||
const tempAxisX = 60 // temp axis line
|
||||
const clockAxisX = 900 // clock axis line
|
||||
|
||||
colors := [4]string{"#e74c3c", "#3498db", "#2ecc71", "#f39c12"}
|
||||
seriesLabel := [4]string{
|
||||
fmt.Sprintf("GPU %d Temp (°C)", gpuIdx),
|
||||
fmt.Sprintf("GPU %d Usage (%%)", gpuIdx),
|
||||
fmt.Sprintf("GPU %d Power (W)", gpuIdx),
|
||||
fmt.Sprintf("GPU %d Clock (MHz)", gpuIdx),
|
||||
}
|
||||
axisLabel := [4]string{"Temperature (°C)", "GPU Usage (%)", "Power (W)", "Clock (MHz)"}
|
||||
|
||||
// Extract series
|
||||
t := make([]float64, len(rows))
|
||||
vals := [4][]float64{}
|
||||
for i := range vals {
|
||||
vals[i] = make([]float64, len(rows))
|
||||
}
|
||||
for i, r := range rows {
|
||||
t[i] = r.ElapsedSec
|
||||
vals[0][i] = r.TempC
|
||||
vals[1][i] = r.UsagePct
|
||||
vals[2][i] = r.PowerW
|
||||
vals[3][i] = r.ClockMHz
|
||||
}
|
||||
|
||||
tMin, tMax := gpuMinMax(t)
|
||||
type axisScale struct {
|
||||
ticks []float64
|
||||
min, max float64
|
||||
}
|
||||
var axes [4]axisScale
|
||||
for i := 0; i < 4; i++ {
|
||||
mn, mx := gpuMinMax(vals[i])
|
||||
tks := gpuNiceTicks(mn, mx, 8)
|
||||
axes[i] = axisScale{ticks: tks, min: tks[0], max: tks[len(tks)-1]}
|
||||
}
|
||||
|
||||
xv := func(tv float64) float64 {
|
||||
if tMax == tMin {
|
||||
return float64(plotX1)
|
||||
}
|
||||
return float64(plotX1) + (tv-tMin)/(tMax-tMin)*float64(PW)
|
||||
}
|
||||
yv := func(v float64, ai int) float64 {
|
||||
a := axes[ai]
|
||||
if a.max == a.min {
|
||||
return float64(plotY1 + PH/2)
|
||||
}
|
||||
return float64(plotY2) - (v-a.min)/(a.max-a.min)*float64(PH)
|
||||
}
|
||||
|
||||
var b strings.Builder
|
||||
|
||||
fmt.Fprintf(&b, `<svg xmlns="http://www.w3.org/2000/svg" width="%d" height="%d"`+
|
||||
` style="background:#fff;border-radius:8px;display:block;margin:0 auto 24px;`+
|
||||
`box-shadow:0 2px 12px rgba(0,0,0,.12)">`+"\n", W, H)
|
||||
|
||||
// Title
|
||||
fmt.Fprintf(&b, `<text x="%d" y="22" text-anchor="middle" font-family="sans-serif"`+
|
||||
` font-size="14" font-weight="bold" fill="#333">GPU Stress Test Metrics — GPU %d</text>`+"\n",
|
||||
plotX1+PW/2, gpuIdx)
|
||||
|
||||
// Horizontal grid (align to temp axis ticks)
|
||||
b.WriteString(`<g stroke="#e0e0e0" stroke-width="0.5">` + "\n")
|
||||
for _, tick := range axes[0].ticks {
|
||||
y := yv(tick, 0)
|
||||
if y < float64(plotY1) || y > float64(plotY2) {
|
||||
continue
|
||||
}
|
||||
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"/>`+"\n",
|
||||
plotX1, y, plotX2, y)
|
||||
}
|
||||
// Vertical grid
|
||||
xTicks := gpuNiceTicks(tMin, tMax, 10)
|
||||
for _, tv := range xTicks {
|
||||
x := xv(tv)
|
||||
if x < float64(plotX1) || x > float64(plotX2) {
|
||||
continue
|
||||
}
|
||||
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d"/>`+"\n",
|
||||
x, plotY1, x, plotY2)
|
||||
}
|
||||
b.WriteString("</g>\n")
|
||||
|
||||
// Chart border
|
||||
fmt.Fprintf(&b, `<rect x="%d" y="%d" width="%d" height="%d"`+
|
||||
` fill="none" stroke="#333" stroke-width="1"/>`+"\n",
|
||||
plotX1, plotY1, PW, PH)
|
||||
|
||||
// X axis ticks and labels
|
||||
b.WriteString(`<g font-family="sans-serif" font-size="11" fill="#333" text-anchor="middle">` + "\n")
|
||||
for _, tv := range xTicks {
|
||||
x := xv(tv)
|
||||
if x < float64(plotX1) || x > float64(plotX2) {
|
||||
continue
|
||||
}
|
||||
fmt.Fprintf(&b, `<text x="%.1f" y="%d">%s</text>`+"\n", x, plotY2+18, gpuFormatTick(tv))
|
||||
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d" stroke="#333" stroke-width="1"/>`+"\n",
|
||||
x, plotY2, x, plotY2+4)
|
||||
}
|
||||
b.WriteString("</g>\n")
|
||||
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="13"`+
|
||||
` fill="#333" text-anchor="middle">Time (seconds)</text>`+"\n",
|
||||
plotX1+PW/2, plotY2+38)
|
||||
|
||||
// Y axes: [tempAxisX, plotX1, plotX2, clockAxisX]
|
||||
axisLineX := [4]int{tempAxisX, plotX1, plotX2, clockAxisX}
|
||||
axisRight := [4]bool{false, false, true, true}
|
||||
// Label x positions (for rotated vertical text)
|
||||
axisLabelX := [4]int{10, 68, 868, 950}
|
||||
|
||||
for i := 0; i < 4; i++ {
|
||||
ax := axisLineX[i]
|
||||
right := axisRight[i]
|
||||
color := colors[i]
|
||||
|
||||
// Axis line
|
||||
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
|
||||
` stroke="%s" stroke-width="1"/>`+"\n",
|
||||
ax, plotY1, ax, plotY2, color)
|
||||
|
||||
// Ticks and tick labels
|
||||
fmt.Fprintf(&b, `<g font-family="sans-serif" font-size="10" fill="%s">`+"\n", color)
|
||||
for _, tick := range axes[i].ticks {
|
||||
y := yv(tick, i)
|
||||
if y < float64(plotY1) || y > float64(plotY2) {
|
||||
continue
|
||||
}
|
||||
dx := -5
|
||||
textX := ax - 8
|
||||
anchor := "end"
|
||||
if right {
|
||||
dx = 5
|
||||
textX = ax + 8
|
||||
anchor = "start"
|
||||
}
|
||||
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"`+
|
||||
` stroke="%s" stroke-width="1"/>`+"\n",
|
||||
ax, y, ax+dx, y, color)
|
||||
fmt.Fprintf(&b, `<text x="%d" y="%.1f" text-anchor="%s" dy="4">%s</text>`+"\n",
|
||||
textX, y, anchor, gpuFormatTick(tick))
|
||||
}
|
||||
b.WriteString("</g>\n")
|
||||
|
||||
// Axis label (rotated)
|
||||
lx := axisLabelX[i]
|
||||
fmt.Fprintf(&b, `<text transform="translate(%d,%d) rotate(-90)"`+
|
||||
` font-family="sans-serif" font-size="12" fill="%s" text-anchor="middle">%s</text>`+"\n",
|
||||
lx, plotY1+PH/2, color, axisLabel[i])
|
||||
}
|
||||
|
||||
// Data lines
|
||||
for i := 0; i < 4; i++ {
|
||||
var pts strings.Builder
|
||||
for j := range rows {
|
||||
x := xv(t[j])
|
||||
y := yv(vals[i][j], i)
|
||||
if j == 0 {
|
||||
fmt.Fprintf(&pts, "%.1f,%.1f", x, y)
|
||||
} else {
|
||||
fmt.Fprintf(&pts, " %.1f,%.1f", x, y)
|
||||
}
|
||||
}
|
||||
fmt.Fprintf(&b, `<polyline points="%s" fill="none" stroke="%s" stroke-width="1.5"/>`+"\n",
|
||||
pts.String(), colors[i])
|
||||
}
|
||||
|
||||
// Legend
|
||||
const legendY = 42
|
||||
for i := 0; i < 4; i++ {
|
||||
lx := plotX1 + i*(PW/4) + 10
|
||||
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
|
||||
` stroke="%s" stroke-width="2"/>`+"\n",
|
||||
lx, legendY, lx+20, legendY, colors[i])
|
||||
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="12" fill="#333">%s</text>`+"\n",
|
||||
lx+25, legendY+4, seriesLabel[i])
|
||||
}
|
||||
|
||||
b.WriteString("</svg>\n")
|
||||
return b.String()
|
||||
}
|
||||
|
||||
const (
|
||||
ansiRed = "\033[31m"
|
||||
ansiBlue = "\033[34m"
|
||||
ansiGreen = "\033[32m"
|
||||
ansiYellow = "\033[33m"
|
||||
ansiReset = "\033[0m"
|
||||
)
|
||||
|
||||
const (
|
||||
termChartWidth = 70
|
||||
termChartHeight = 12
|
||||
)
|
||||
|
||||
// RenderGPUTerminalChart returns ANSI line charts (asciigraph-style) per GPU.
|
||||
// Suitable for display in the TUI screenOutput.
|
||||
func RenderGPUTerminalChart(rows []GPUMetricRow) string {
|
||||
seen := make(map[int]bool)
|
||||
var order []int
|
||||
gpuMap := make(map[int][]GPUMetricRow)
|
||||
for _, r := range rows {
|
||||
if !seen[r.GPUIndex] {
|
||||
seen[r.GPUIndex] = true
|
||||
order = append(order, r.GPUIndex)
|
||||
}
|
||||
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
|
||||
}
|
||||
|
||||
type seriesDef struct {
|
||||
caption string
|
||||
color string
|
||||
fn func(GPUMetricRow) float64
|
||||
}
|
||||
defs := []seriesDef{
|
||||
{"Temperature (°C)", ansiRed, func(r GPUMetricRow) float64 { return r.TempC }},
|
||||
{"GPU Usage (%)", ansiBlue, func(r GPUMetricRow) float64 { return r.UsagePct }},
|
||||
{"Power (W)", ansiGreen, func(r GPUMetricRow) float64 { return r.PowerW }},
|
||||
{"Clock (MHz)", ansiYellow, func(r GPUMetricRow) float64 { return r.ClockMHz }},
|
||||
}
|
||||
|
||||
var b strings.Builder
|
||||
for _, gpuIdx := range order {
|
||||
gr := gpuMap[gpuIdx]
|
||||
if len(gr) == 0 {
|
||||
continue
|
||||
}
|
||||
tMax := gr[len(gr)-1].ElapsedSec - gr[0].ElapsedSec
|
||||
fmt.Fprintf(&b, "GPU %d — Stress Test Metrics (%.0f seconds)\n\n", gpuIdx, tMax)
|
||||
for _, d := range defs {
|
||||
b.WriteString(renderLineChart(extractGPUField(gr, d.fn), d.color, d.caption,
|
||||
termChartHeight, termChartWidth))
|
||||
b.WriteRune('\n')
|
||||
}
|
||||
}
|
||||
|
||||
return strings.TrimRight(b.String(), "\n")
|
||||
}
|
||||
|
||||
// renderLineChart draws a single time-series line chart using box-drawing characters.
|
||||
// Produces output in the style of asciigraph: ╭─╮ │ ╰─╯ with a Y axis and caption.
|
||||
func renderLineChart(vals []float64, color, caption string, height, width int) string {
|
||||
if len(vals) == 0 {
|
||||
return caption + "\n"
|
||||
}
|
||||
|
||||
mn, mx := gpuMinMax(vals)
|
||||
if mn == mx {
|
||||
mx = mn + 1
|
||||
}
|
||||
|
||||
// Use the smaller of width or len(vals) to avoid stretching sparse data.
|
||||
w := width
|
||||
if len(vals) < w {
|
||||
w = len(vals)
|
||||
}
|
||||
data := gpuDownsample(vals, w)
|
||||
|
||||
// row[i] = display row index: 0 = top = max value, height = bottom = min value.
|
||||
row := make([]int, w)
|
||||
for i, v := range data {
|
||||
r := int(math.Round((mx - v) / (mx - mn) * float64(height)))
|
||||
if r < 0 {
|
||||
r = 0
|
||||
}
|
||||
if r > height {
|
||||
r = height
|
||||
}
|
||||
row[i] = r
|
||||
}
|
||||
|
||||
// Fill the character grid.
|
||||
grid := make([][]rune, height+1)
|
||||
for i := range grid {
|
||||
grid[i] = make([]rune, w)
|
||||
for j := range grid[i] {
|
||||
grid[i][j] = ' '
|
||||
}
|
||||
}
|
||||
for x := 0; x < w; x++ {
|
||||
r := row[x]
|
||||
if x == 0 {
|
||||
grid[r][0] = '─'
|
||||
continue
|
||||
}
|
||||
p := row[x-1]
|
||||
switch {
|
||||
case r == p:
|
||||
grid[r][x] = '─'
|
||||
case r < p: // value went up (row index decreased toward top)
|
||||
grid[r][x] = '╭'
|
||||
grid[p][x] = '╯'
|
||||
for y := r + 1; y < p; y++ {
|
||||
grid[y][x] = '│'
|
||||
}
|
||||
default: // r > p, value went down
|
||||
grid[p][x] = '╮'
|
||||
grid[r][x] = '╰'
|
||||
for y := p + 1; y < r; y++ {
|
||||
grid[y][x] = '│'
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Y axis tick labels.
|
||||
ticks := gpuNiceTicks(mn, mx, height/2)
|
||||
tickAtRow := make(map[int]string)
|
||||
labelWidth := 4
|
||||
for _, t := range ticks {
|
||||
r := int(math.Round((mx - t) / (mx - mn) * float64(height)))
|
||||
if r < 0 || r > height {
|
||||
continue
|
||||
}
|
||||
s := gpuFormatTick(t)
|
||||
tickAtRow[r] = s
|
||||
if len(s) > labelWidth {
|
||||
labelWidth = len(s)
|
||||
}
|
||||
}
|
||||
|
||||
var b strings.Builder
|
||||
for r := 0; r <= height; r++ {
|
||||
label := tickAtRow[r]
|
||||
fmt.Fprintf(&b, "%*s", labelWidth, label)
|
||||
switch {
|
||||
case label != "":
|
||||
b.WriteRune('┤')
|
||||
case r == height:
|
||||
b.WriteRune('┼')
|
||||
default:
|
||||
b.WriteRune('│')
|
||||
}
|
||||
b.WriteString(color)
|
||||
b.WriteString(string(grid[r]))
|
||||
b.WriteString(ansiReset)
|
||||
b.WriteRune('\n')
|
||||
}
|
||||
|
||||
// Bottom axis.
|
||||
b.WriteString(strings.Repeat(" ", labelWidth))
|
||||
b.WriteRune('└')
|
||||
b.WriteString(strings.Repeat("─", w))
|
||||
b.WriteRune('\n')
|
||||
|
||||
// Caption centered under the chart.
|
||||
if caption != "" {
|
||||
total := labelWidth + 1 + w
|
||||
if pad := (total - len(caption)) / 2; pad > 0 {
|
||||
b.WriteString(strings.Repeat(" ", pad))
|
||||
}
|
||||
b.WriteString(caption)
|
||||
b.WriteRune('\n')
|
||||
}
|
||||
|
||||
return b.String()
|
||||
}
|
||||
|
||||
func extractGPUField(rows []GPUMetricRow, fn func(GPUMetricRow) float64) []float64 {
|
||||
v := make([]float64, len(rows))
|
||||
for i, r := range rows {
|
||||
v[i] = fn(r)
|
||||
}
|
||||
return v
|
||||
}
|
||||
|
||||
// gpuDownsample averages vals into w buckets (or nearest-neighbor upsamples if len(vals) < w).
|
||||
func gpuDownsample(vals []float64, w int) []float64 {
|
||||
n := len(vals)
|
||||
if n == 0 {
|
||||
return make([]float64, w)
|
||||
}
|
||||
result := make([]float64, w)
|
||||
if n >= w {
|
||||
counts := make([]int, w)
|
||||
for i, v := range vals {
|
||||
bucket := i * w / n
|
||||
if bucket >= w {
|
||||
bucket = w - 1
|
||||
}
|
||||
result[bucket] += v
|
||||
counts[bucket]++
|
||||
}
|
||||
for i := range result {
|
||||
if counts[i] > 0 {
|
||||
result[i] /= float64(counts[i])
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Nearest-neighbour upsample.
|
||||
for i := range result {
|
||||
src := i * (n - 1) / (w - 1)
|
||||
if src >= n {
|
||||
src = n - 1
|
||||
}
|
||||
result[i] = vals[src]
|
||||
}
|
||||
}
|
||||
return result
|
||||
}
|
||||
|
||||
func gpuMinMax(vals []float64) (float64, float64) {
|
||||
if len(vals) == 0 {
|
||||
return 0, 1
|
||||
}
|
||||
mn, mx := vals[0], vals[0]
|
||||
for _, v := range vals[1:] {
|
||||
if v < mn {
|
||||
mn = v
|
||||
}
|
||||
if v > mx {
|
||||
mx = v
|
||||
}
|
||||
}
|
||||
return mn, mx
|
||||
}
|
||||
|
||||
func gpuNiceTicks(mn, mx float64, targetCount int) []float64 {
|
||||
if mn == mx {
|
||||
mn -= 1
|
||||
mx += 1
|
||||
}
|
||||
r := mx - mn
|
||||
step := math.Pow(10, math.Floor(math.Log10(r/float64(targetCount))))
|
||||
for _, f := range []float64{1, 2, 5, 10} {
|
||||
if r/(f*step) <= float64(targetCount)*1.5 {
|
||||
step = f * step
|
||||
break
|
||||
}
|
||||
}
|
||||
lo := math.Floor(mn/step) * step
|
||||
hi := math.Ceil(mx/step) * step
|
||||
var ticks []float64
|
||||
for v := lo; v <= hi+step*0.001; v += step {
|
||||
ticks = append(ticks, math.Round(v*1e9)/1e9)
|
||||
}
|
||||
return ticks
|
||||
}
|
||||
|
||||
func gpuFormatTick(v float64) string {
|
||||
if v == math.Trunc(v) {
|
||||
return strconv.Itoa(int(v))
|
||||
}
|
||||
return strconv.FormatFloat(v, 'f', 1, 64)
|
||||
}
|
||||
164
audit/internal/platform/runtime.go
Normal file
164
audit/internal/platform/runtime.go
Normal file
@@ -0,0 +1,164 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"os"
|
||||
"os/exec"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
var runtimeRequiredTools = []string{
|
||||
"dmidecode",
|
||||
"lspci",
|
||||
"lsblk",
|
||||
"smartctl",
|
||||
"nvme",
|
||||
"ipmitool",
|
||||
"nvidia-smi",
|
||||
"nvidia-bug-report.sh",
|
||||
"bee-gpu-stress",
|
||||
"dhclient",
|
||||
"mount",
|
||||
}
|
||||
|
||||
var runtimeTrackedServices = []string{
|
||||
"bee-network",
|
||||
"bee-nvidia",
|
||||
"bee-preflight",
|
||||
"bee-audit",
|
||||
"bee-web",
|
||||
"bee-sshsetup",
|
||||
}
|
||||
|
||||
func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
|
||||
checkedAt := time.Now().UTC().Format(time.RFC3339)
|
||||
health := schema.RuntimeHealth{
|
||||
Status: "OK",
|
||||
CheckedAt: checkedAt,
|
||||
ExportDir: strings.TrimSpace(exportDir),
|
||||
}
|
||||
|
||||
if health.ExportDir != "" {
|
||||
if err := os.MkdirAll(health.ExportDir, 0755); err != nil {
|
||||
health.Status = "FAILED"
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "export_dir_unavailable",
|
||||
Severity: "critical",
|
||||
Description: err.Error(),
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
interfaces, err := s.ListInterfaces()
|
||||
if err == nil {
|
||||
health.Interfaces = make([]schema.RuntimeInterface, 0, len(interfaces))
|
||||
hasIPv4 := false
|
||||
missingIPv4 := false
|
||||
for _, iface := range interfaces {
|
||||
outcome := "no_offer"
|
||||
if len(iface.IPv4) > 0 {
|
||||
outcome = "lease_acquired"
|
||||
hasIPv4 = true
|
||||
} else if strings.EqualFold(iface.State, "DOWN") {
|
||||
outcome = "link_down"
|
||||
} else {
|
||||
missingIPv4 = true
|
||||
}
|
||||
health.Interfaces = append(health.Interfaces, schema.RuntimeInterface{
|
||||
Name: iface.Name,
|
||||
State: iface.State,
|
||||
IPv4: iface.IPv4,
|
||||
Outcome: outcome,
|
||||
})
|
||||
}
|
||||
switch {
|
||||
case hasIPv4 && !missingIPv4:
|
||||
health.NetworkStatus = "OK"
|
||||
case hasIPv4:
|
||||
health.NetworkStatus = "PARTIAL"
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "dhcp_partial",
|
||||
Severity: "warning",
|
||||
Description: "At least one interface did not obtain IPv4 connectivity.",
|
||||
})
|
||||
default:
|
||||
health.NetworkStatus = "FAILED"
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "dhcp_failed",
|
||||
Severity: "warning",
|
||||
Description: "No physical interface obtained IPv4 connectivity.",
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
for _, tool := range s.CheckTools(runtimeRequiredTools) {
|
||||
health.Tools = append(health.Tools, schema.RuntimeToolStatus{
|
||||
Name: tool.Name,
|
||||
Path: tool.Path,
|
||||
OK: tool.OK,
|
||||
})
|
||||
if !tool.OK {
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "tool_missing",
|
||||
Severity: "warning",
|
||||
Description: "Required tool missing: " + tool.Name,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
for _, name := range runtimeTrackedServices {
|
||||
health.Services = append(health.Services, schema.RuntimeServiceStatus{
|
||||
Name: name,
|
||||
Status: s.ServiceState(name),
|
||||
})
|
||||
}
|
||||
|
||||
lsmodText := commandText("lsmod")
|
||||
health.DriverReady = strings.Contains(lsmodText, "nvidia ")
|
||||
if !health.DriverReady {
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "nvidia_kernel_module_missing",
|
||||
Severity: "warning",
|
||||
Description: "NVIDIA kernel module is not loaded.",
|
||||
})
|
||||
}
|
||||
if health.DriverReady && !strings.Contains(lsmodText, "nvidia_modeset") {
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "nvidia_modeset_failed",
|
||||
Severity: "warning",
|
||||
Description: "nvidia-modeset is not loaded; display/CUDA stack may be partial.",
|
||||
})
|
||||
}
|
||||
if out, err := exec.Command("nvidia-smi", "-L").CombinedOutput(); err == nil && strings.TrimSpace(string(out)) != "" {
|
||||
health.DriverReady = true
|
||||
}
|
||||
|
||||
health.CUDAReady = false
|
||||
if lookErr := exec.Command("sh", "-c", "command -v bee-gpu-stress >/dev/null 2>&1").Run(); lookErr == nil {
|
||||
out, err := exec.Command("bee-gpu-stress", "--seconds", "1", "--size-mb", "1").CombinedOutput()
|
||||
if err == nil {
|
||||
health.CUDAReady = true
|
||||
} else if strings.Contains(strings.ToLower(string(out)), "cuda_error_system_not_ready") {
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "cuda_runtime_not_ready",
|
||||
Severity: "warning",
|
||||
Description: "CUDA runtime is not ready for GPU SAT.",
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
if health.Status != "FAILED" && len(health.Issues) > 0 {
|
||||
health.Status = "PARTIAL"
|
||||
}
|
||||
return health, nil
|
||||
}
|
||||
|
||||
func commandText(name string, args ...string) string {
|
||||
raw, err := exec.Command(name, args...).CombinedOutput()
|
||||
if err != nil && len(raw) == 0 {
|
||||
return ""
|
||||
}
|
||||
return string(raw)
|
||||
}
|
||||
@@ -3,60 +3,477 @@ package platform
|
||||
import (
|
||||
"archive/tar"
|
||||
"compress/gzip"
|
||||
"context"
|
||||
"fmt"
|
||||
"io"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
// NvidiaGPU holds basic GPU info from nvidia-smi.
|
||||
type NvidiaGPU struct {
|
||||
Index int
|
||||
Name string
|
||||
MemoryMB int
|
||||
}
|
||||
|
||||
// ListNvidiaGPUs returns GPUs visible to nvidia-smi.
|
||||
func (s *System) ListNvidiaGPUs() ([]NvidiaGPU, error) {
|
||||
out, err := exec.Command("nvidia-smi",
|
||||
"--query-gpu=index,name,memory.total",
|
||||
"--format=csv,noheader,nounits").Output()
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("nvidia-smi: %w", err)
|
||||
}
|
||||
var gpus []NvidiaGPU
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
if line == "" {
|
||||
continue
|
||||
}
|
||||
parts := strings.SplitN(line, ", ", 3)
|
||||
if len(parts) != 3 {
|
||||
continue
|
||||
}
|
||||
idx, err := strconv.Atoi(strings.TrimSpace(parts[0]))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
memMB, _ := strconv.Atoi(strings.TrimSpace(parts[2]))
|
||||
gpus = append(gpus, NvidiaGPU{
|
||||
Index: idx,
|
||||
Name: strings.TrimSpace(parts[1]),
|
||||
MemoryMB: memMB,
|
||||
})
|
||||
}
|
||||
return gpus, nil
|
||||
}
|
||||
|
||||
func (s *System) RunNvidiaAcceptancePack(baseDir string) (string, error) {
|
||||
return runAcceptancePack(baseDir, "gpu-nvidia", nvidiaSATJobs())
|
||||
}
|
||||
|
||||
// RunNvidiaAcceptancePackWithOptions runs the NVIDIA SAT with explicit duration,
|
||||
// GPU memory size, and GPU index selection. ctx cancellation kills the running job.
|
||||
func (s *System) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, durationSec int, sizeMB int, gpuIndices []int) (string, error) {
|
||||
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia", nvidiaSATJobsWithOptions(durationSec, sizeMB, gpuIndices))
|
||||
}
|
||||
|
||||
func (s *System) RunMemoryAcceptancePack(baseDir string) (string, error) {
|
||||
sizeMB := envInt("BEE_MEMTESTER_SIZE_MB", 128)
|
||||
passes := envInt("BEE_MEMTESTER_PASSES", 1)
|
||||
return runAcceptancePack(baseDir, "memory", []satJob{
|
||||
{name: "01-free-before.log", cmd: []string{"free", "-h"}},
|
||||
{name: "02-memtester.log", cmd: []string{"memtester", fmt.Sprintf("%dM", sizeMB), fmt.Sprintf("%d", passes)}},
|
||||
{name: "03-free-after.log", cmd: []string{"free", "-h"}},
|
||||
})
|
||||
}
|
||||
|
||||
func (s *System) RunStorageAcceptancePack(baseDir string) (string, error) {
|
||||
if baseDir == "" {
|
||||
baseDir = "/var/log/bee-sat"
|
||||
}
|
||||
ts := time.Now().UTC().Format("20060102-150405")
|
||||
runDir := filepath.Join(baseDir, "gpu-nvidia-"+ts)
|
||||
runDir := filepath.Join(baseDir, "storage-"+ts)
|
||||
if err := os.MkdirAll(runDir, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
verboseLog := filepath.Join(runDir, "verbose.log")
|
||||
|
||||
type job struct {
|
||||
name string
|
||||
cmd []string
|
||||
}
|
||||
jobs := []job{
|
||||
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
|
||||
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
|
||||
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
|
||||
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output", filepath.Join(runDir, "nvidia-bug-report.log")}},
|
||||
devices, err := listStorageDevices()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
sort.Strings(devices)
|
||||
|
||||
var summary strings.Builder
|
||||
stats := satStats{}
|
||||
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
|
||||
for _, job := range jobs {
|
||||
out, err := exec.Command(job.cmd[0], job.cmd[1:]...).CombinedOutput()
|
||||
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
|
||||
return "", writeErr
|
||||
}
|
||||
rc := 0
|
||||
if err != nil {
|
||||
rc = 1
|
||||
}
|
||||
fmt.Fprintf(&summary, "%s_rc=%d\n", strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log"), rc)
|
||||
if len(devices) == 0 {
|
||||
fmt.Fprintln(&summary, "devices=0")
|
||||
stats.Unsupported++
|
||||
} else {
|
||||
fmt.Fprintf(&summary, "devices=%d\n", len(devices))
|
||||
}
|
||||
|
||||
for index, devPath := range devices {
|
||||
prefix := fmt.Sprintf("%02d-%s", index+1, filepath.Base(devPath))
|
||||
commands := storageSATCommands(devPath)
|
||||
for cmdIndex, job := range commands {
|
||||
name := fmt.Sprintf("%s-%02d-%s.log", prefix, cmdIndex+1, job.name)
|
||||
out, err := runSATCommand(verboseLog, job.name, job.cmd)
|
||||
if writeErr := os.WriteFile(filepath.Join(runDir, name), out, 0644); writeErr != nil {
|
||||
return "", writeErr
|
||||
}
|
||||
status, rc := classifySATResult(job.name, out, err)
|
||||
stats.Add(status)
|
||||
key := filepath.Base(devPath) + "_" + strings.ReplaceAll(job.name, "-", "_")
|
||||
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
|
||||
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
|
||||
}
|
||||
}
|
||||
|
||||
writeSATStats(&summary, stats)
|
||||
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
archive := filepath.Join(baseDir, "gpu-nvidia-"+ts+".tar.gz")
|
||||
archive := filepath.Join(baseDir, "storage-"+ts+".tar.gz")
|
||||
if err := createTarGz(archive, runDir); err != nil {
|
||||
return "", err
|
||||
}
|
||||
return archive, nil
|
||||
}
|
||||
|
||||
type satJob struct {
|
||||
name string
|
||||
cmd []string
|
||||
env []string // extra env vars (appended to os.Environ)
|
||||
collectGPU bool // collect GPU metrics via nvidia-smi while this job runs
|
||||
gpuIndices []int // GPU indices to collect metrics for (empty = all)
|
||||
}
|
||||
|
||||
type satStats struct {
|
||||
OK int
|
||||
Failed int
|
||||
Unsupported int
|
||||
}
|
||||
|
||||
func nvidiaSATJobs() []satJob {
|
||||
seconds := envInt("BEE_GPU_STRESS_SECONDS", 5)
|
||||
sizeMB := envInt("BEE_GPU_STRESS_SIZE_MB", 64)
|
||||
return []satJob{
|
||||
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
|
||||
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
|
||||
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
|
||||
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
|
||||
{name: "05-bee-gpu-stress.log", cmd: []string{"bee-gpu-stress", "--seconds", fmt.Sprintf("%d", seconds), "--size-mb", fmt.Sprintf("%d", sizeMB)}},
|
||||
}
|
||||
}
|
||||
|
||||
func runAcceptancePack(baseDir, prefix string, jobs []satJob) (string, error) {
|
||||
if baseDir == "" {
|
||||
baseDir = "/var/log/bee-sat"
|
||||
}
|
||||
ts := time.Now().UTC().Format("20060102-150405")
|
||||
runDir := filepath.Join(baseDir, prefix+"-"+ts)
|
||||
if err := os.MkdirAll(runDir, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
verboseLog := filepath.Join(runDir, "verbose.log")
|
||||
|
||||
var summary strings.Builder
|
||||
stats := satStats{}
|
||||
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
|
||||
for _, job := range jobs {
|
||||
cmd := make([]string, 0, len(job.cmd))
|
||||
for _, arg := range job.cmd {
|
||||
cmd = append(cmd, strings.ReplaceAll(arg, "{{run_dir}}", runDir))
|
||||
}
|
||||
out, err := runSATCommand(verboseLog, job.name, cmd)
|
||||
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
|
||||
return "", writeErr
|
||||
}
|
||||
status, rc := classifySATResult(job.name, out, err)
|
||||
stats.Add(status)
|
||||
key := strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log")
|
||||
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
|
||||
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
|
||||
}
|
||||
writeSATStats(&summary, stats)
|
||||
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz")
|
||||
if err := createTarGz(archive, runDir); err != nil {
|
||||
return "", err
|
||||
}
|
||||
return archive, nil
|
||||
}
|
||||
|
||||
func nvidiaSATJobsWithOptions(durationSec, sizeMB int, gpuIndices []int) []satJob {
|
||||
var env []string
|
||||
if len(gpuIndices) > 0 {
|
||||
ids := make([]string, len(gpuIndices))
|
||||
for i, idx := range gpuIndices {
|
||||
ids[i] = strconv.Itoa(idx)
|
||||
}
|
||||
env = []string{"CUDA_VISIBLE_DEVICES=" + strings.Join(ids, ",")}
|
||||
}
|
||||
return []satJob{
|
||||
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
|
||||
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
|
||||
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
|
||||
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
|
||||
{
|
||||
name: "05-bee-gpu-stress.log",
|
||||
cmd: []string{"bee-gpu-stress", "--seconds", strconv.Itoa(durationSec), "--size-mb", strconv.Itoa(sizeMB)},
|
||||
env: env,
|
||||
collectGPU: true,
|
||||
gpuIndices: gpuIndices,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []satJob) (string, error) {
|
||||
if baseDir == "" {
|
||||
baseDir = "/var/log/bee-sat"
|
||||
}
|
||||
ts := time.Now().UTC().Format("20060102-150405")
|
||||
runDir := filepath.Join(baseDir, prefix+"-"+ts)
|
||||
if err := os.MkdirAll(runDir, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
verboseLog := filepath.Join(runDir, "verbose.log")
|
||||
|
||||
var summary strings.Builder
|
||||
stats := satStats{}
|
||||
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
|
||||
for _, job := range jobs {
|
||||
if ctx.Err() != nil {
|
||||
break
|
||||
}
|
||||
cmd := make([]string, 0, len(job.cmd))
|
||||
for _, arg := range job.cmd {
|
||||
cmd = append(cmd, strings.ReplaceAll(arg, "{{run_dir}}", runDir))
|
||||
}
|
||||
|
||||
var out []byte
|
||||
var err error
|
||||
|
||||
if job.collectGPU {
|
||||
out, err = runSATCommandWithMetrics(ctx, verboseLog, job.name, cmd, job.env, job.gpuIndices, runDir)
|
||||
} else {
|
||||
out, err = runSATCommandCtx(ctx, verboseLog, job.name, cmd, job.env)
|
||||
}
|
||||
|
||||
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
|
||||
return "", writeErr
|
||||
}
|
||||
status, rc := classifySATResult(job.name, out, err)
|
||||
stats.Add(status)
|
||||
key := strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log")
|
||||
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
|
||||
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
|
||||
}
|
||||
writeSATStats(&summary, stats)
|
||||
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz")
|
||||
if err := createTarGz(archive, runDir); err != nil {
|
||||
return "", err
|
||||
}
|
||||
return archive, nil
|
||||
}
|
||||
|
||||
func runSATCommandCtx(ctx context.Context, verboseLog, name string, cmd []string, env []string) ([]byte, error) {
|
||||
start := time.Now().UTC()
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
|
||||
"cmd: "+strings.Join(cmd, " "),
|
||||
)
|
||||
|
||||
c := exec.CommandContext(ctx, cmd[0], cmd[1:]...)
|
||||
if len(env) > 0 {
|
||||
c.Env = append(os.Environ(), env...)
|
||||
}
|
||||
out, err := c.CombinedOutput()
|
||||
|
||||
rc := 0
|
||||
if err != nil {
|
||||
rc = 1
|
||||
}
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
|
||||
fmt.Sprintf("rc: %d", rc),
|
||||
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
|
||||
"",
|
||||
)
|
||||
return out, err
|
||||
}
|
||||
|
||||
func listStorageDevices() ([]string, error) {
|
||||
out, err := exec.Command("lsblk", "-dn", "-o", "NAME,TYPE").Output()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
var devices []string
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
|
||||
fields := strings.Fields(strings.TrimSpace(line))
|
||||
if len(fields) != 2 || fields[1] != "disk" {
|
||||
continue
|
||||
}
|
||||
devices = append(devices, "/dev/"+fields[0])
|
||||
}
|
||||
return devices, nil
|
||||
}
|
||||
|
||||
func storageSATCommands(devPath string) []satJob {
|
||||
if strings.Contains(filepath.Base(devPath), "nvme") {
|
||||
return []satJob{
|
||||
{name: "nvme-id-ctrl", cmd: []string{"nvme", "id-ctrl", devPath, "-o", "json"}},
|
||||
{name: "nvme-smart-log", cmd: []string{"nvme", "smart-log", devPath, "-o", "json"}},
|
||||
{name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "--start", "1"}},
|
||||
}
|
||||
}
|
||||
return []satJob{
|
||||
{name: "smartctl-health", cmd: []string{"smartctl", "-H", "-A", devPath}},
|
||||
{name: "smartctl-self-test-short", cmd: []string{"smartctl", "-t", "short", devPath}},
|
||||
}
|
||||
}
|
||||
|
||||
func (s *satStats) Add(status string) {
|
||||
switch status {
|
||||
case "OK":
|
||||
s.OK++
|
||||
case "UNSUPPORTED":
|
||||
s.Unsupported++
|
||||
default:
|
||||
s.Failed++
|
||||
}
|
||||
}
|
||||
|
||||
func (s satStats) Overall() string {
|
||||
if s.Failed > 0 {
|
||||
return "FAILED"
|
||||
}
|
||||
if s.Unsupported > 0 {
|
||||
return "PARTIAL"
|
||||
}
|
||||
return "OK"
|
||||
}
|
||||
|
||||
func writeSATStats(summary *strings.Builder, stats satStats) {
|
||||
fmt.Fprintf(summary, "overall_status=%s\n", stats.Overall())
|
||||
fmt.Fprintf(summary, "job_ok=%d\n", stats.OK)
|
||||
fmt.Fprintf(summary, "job_failed=%d\n", stats.Failed)
|
||||
fmt.Fprintf(summary, "job_unsupported=%d\n", stats.Unsupported)
|
||||
}
|
||||
|
||||
func classifySATResult(name string, out []byte, err error) (string, int) {
|
||||
rc := 0
|
||||
if err != nil {
|
||||
rc = 1
|
||||
}
|
||||
if err == nil {
|
||||
return "OK", rc
|
||||
}
|
||||
|
||||
text := strings.ToLower(string(out))
|
||||
if strings.Contains(text, "unsupported") ||
|
||||
strings.Contains(text, "not supported") ||
|
||||
strings.Contains(text, "invalid opcode") ||
|
||||
strings.Contains(text, "unknown command") ||
|
||||
strings.Contains(text, "not implemented") ||
|
||||
strings.Contains(text, "not available") ||
|
||||
strings.Contains(text, "cuda_error_system_not_ready") ||
|
||||
strings.Contains(text, "no such device") ||
|
||||
(strings.Contains(name, "self-test") && strings.Contains(text, "aborted")) {
|
||||
return "UNSUPPORTED", rc
|
||||
}
|
||||
return "FAILED", rc
|
||||
}
|
||||
|
||||
func runSATCommand(verboseLog, name string, cmd []string) ([]byte, error) {
|
||||
start := time.Now().UTC()
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
|
||||
"cmd: "+strings.Join(cmd, " "),
|
||||
)
|
||||
|
||||
out, err := exec.Command(cmd[0], cmd[1:]...).CombinedOutput()
|
||||
|
||||
rc := 0
|
||||
if err != nil {
|
||||
rc = 1
|
||||
}
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
|
||||
fmt.Sprintf("rc: %d", rc),
|
||||
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
|
||||
"",
|
||||
)
|
||||
return out, err
|
||||
}
|
||||
|
||||
// runSATCommandWithMetrics runs a command while collecting GPU metrics in the background.
|
||||
// On completion it writes gpu-metrics.csv and gpu-metrics.html into runDir.
|
||||
func runSATCommandWithMetrics(ctx context.Context, verboseLog, name string, cmd []string, env []string, gpuIndices []int, runDir string) ([]byte, error) {
|
||||
stopCh := make(chan struct{})
|
||||
doneCh := make(chan struct{})
|
||||
var metricRows []GPUMetricRow
|
||||
start := time.Now()
|
||||
|
||||
go func() {
|
||||
defer close(doneCh)
|
||||
ticker := time.NewTicker(time.Second)
|
||||
defer ticker.Stop()
|
||||
for {
|
||||
select {
|
||||
case <-stopCh:
|
||||
return
|
||||
case <-ticker.C:
|
||||
samples, err := sampleGPUMetrics(gpuIndices)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
elapsed := time.Since(start).Seconds()
|
||||
for i := range samples {
|
||||
samples[i].ElapsedSec = elapsed
|
||||
}
|
||||
metricRows = append(metricRows, samples...)
|
||||
}
|
||||
}
|
||||
}()
|
||||
|
||||
out, err := runSATCommandCtx(ctx, verboseLog, name, cmd, env)
|
||||
|
||||
close(stopCh)
|
||||
<-doneCh
|
||||
|
||||
if len(metricRows) > 0 {
|
||||
_ = WriteGPUMetricsCSV(filepath.Join(runDir, "gpu-metrics.csv"), metricRows)
|
||||
_ = WriteGPUMetricsHTML(filepath.Join(runDir, "gpu-metrics.html"), metricRows)
|
||||
chart := RenderGPUTerminalChart(metricRows)
|
||||
_ = os.WriteFile(filepath.Join(runDir, "gpu-metrics-term.txt"), []byte(chart), 0644)
|
||||
}
|
||||
|
||||
return out, err
|
||||
}
|
||||
|
||||
func appendSATVerboseLog(path string, lines ...string) {
|
||||
if path == "" {
|
||||
return
|
||||
}
|
||||
f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0644)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
defer f.Close()
|
||||
for _, line := range lines {
|
||||
_, _ = io.WriteString(f, line+"\n")
|
||||
}
|
||||
}
|
||||
|
||||
func envInt(name string, fallback int) int {
|
||||
raw := strings.TrimSpace(os.Getenv(name))
|
||||
if raw == "" {
|
||||
return fallback
|
||||
}
|
||||
value, err := strconv.Atoi(raw)
|
||||
if err != nil || value <= 0 {
|
||||
return fallback
|
||||
}
|
||||
return value
|
||||
}
|
||||
|
||||
func createTarGz(dst, srcDir string) error {
|
||||
file, err := os.Create(dst)
|
||||
if err != nil {
|
||||
|
||||
93
audit/internal/platform/sat_test.go
Normal file
93
audit/internal/platform/sat_test.go
Normal file
@@ -0,0 +1,93 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"os"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestStorageSATCommands(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
nvme := storageSATCommands("/dev/nvme0n1")
|
||||
if len(nvme) != 3 || nvme[2].cmd[0] != "nvme" {
|
||||
t.Fatalf("unexpected nvme commands: %#v", nvme)
|
||||
}
|
||||
|
||||
sata := storageSATCommands("/dev/sda")
|
||||
if len(sata) != 2 || sata[0].cmd[0] != "smartctl" {
|
||||
t.Fatalf("unexpected sata commands: %#v", sata)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunNvidiaAcceptancePackIncludesGPUStress(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
jobs := nvidiaSATJobs()
|
||||
|
||||
if len(jobs) != 5 {
|
||||
t.Fatalf("jobs=%d want 5", len(jobs))
|
||||
}
|
||||
if got := jobs[4].cmd[0]; got != "bee-gpu-stress" {
|
||||
t.Fatalf("gpu stress command=%q want bee-gpu-stress", got)
|
||||
}
|
||||
if got := jobs[3].cmd[1]; got != "--output-file" {
|
||||
t.Fatalf("bug report flag=%q want --output-file", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNvidiaSATJobsUseEnvOverrides(t *testing.T) {
|
||||
t.Setenv("BEE_GPU_STRESS_SECONDS", "9")
|
||||
t.Setenv("BEE_GPU_STRESS_SIZE_MB", "96")
|
||||
|
||||
jobs := nvidiaSATJobs()
|
||||
got := jobs[4].cmd
|
||||
want := []string{"bee-gpu-stress", "--seconds", "9", "--size-mb", "96"}
|
||||
if len(got) != len(want) {
|
||||
t.Fatalf("cmd len=%d want %d", len(got), len(want))
|
||||
}
|
||||
for i := range want {
|
||||
if got[i] != want[i] {
|
||||
t.Fatalf("cmd[%d]=%q want %q", i, got[i], want[i])
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestEnvIntFallback(t *testing.T) {
|
||||
os.Unsetenv("BEE_MEMTESTER_SIZE_MB")
|
||||
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
|
||||
t.Fatalf("got %d want 123", got)
|
||||
}
|
||||
t.Setenv("BEE_MEMTESTER_SIZE_MB", "bad")
|
||||
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
|
||||
t.Fatalf("got %d want 123", got)
|
||||
}
|
||||
t.Setenv("BEE_MEMTESTER_SIZE_MB", "256")
|
||||
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 256 {
|
||||
t.Fatalf("got %d want 256", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestClassifySATResult(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
job string
|
||||
out string
|
||||
err error
|
||||
status string
|
||||
}{
|
||||
{name: "ok", job: "memtester", out: "done", err: nil, status: "OK"},
|
||||
{name: "unsupported", job: "smartctl-self-test-short", out: "Self-test not supported", err: errors.New("rc 1"), status: "UNSUPPORTED"},
|
||||
{name: "failed", job: "bee-gpu-stress", out: "cuda error", err: errors.New("rc 1"), status: "FAILED"},
|
||||
{name: "cuda not ready", job: "bee-gpu-stress", out: "cuInit failed: CUDA_ERROR_SYSTEM_NOT_READY", err: errors.New("rc 1"), status: "UNSUPPORTED"},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
got, _ := classifySATResult(tt.job, []byte(tt.out), tt.err)
|
||||
if got != tt.status {
|
||||
t.Fatalf("status=%q want %q", got, tt.status)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
122
audit/internal/platform/techdump.go
Normal file
122
audit/internal/platform/techdump.go
Normal file
@@ -0,0 +1,122 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strings"
|
||||
)
|
||||
|
||||
var techDumpFixedCommands = []struct {
|
||||
Name string
|
||||
Args []string
|
||||
File string
|
||||
}{
|
||||
{Name: "dmidecode", Args: []string{"-t", "0"}, File: "dmidecode-type0.txt"},
|
||||
{Name: "dmidecode", Args: []string{"-t", "1"}, File: "dmidecode-type1.txt"},
|
||||
{Name: "dmidecode", Args: []string{"-t", "2"}, File: "dmidecode-type2.txt"},
|
||||
{Name: "dmidecode", Args: []string{"-t", "4"}, File: "dmidecode-type4.txt"},
|
||||
{Name: "dmidecode", Args: []string{"-t", "17"}, File: "dmidecode-type17.txt"},
|
||||
{Name: "lspci", Args: []string{"-vmm", "-D"}, File: "lspci-vmm.txt"},
|
||||
{Name: "lsblk", Args: []string{"-J", "-d", "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL"}, File: "lsblk.json"},
|
||||
{Name: "sensors", Args: []string{"-j"}, File: "sensors.json"},
|
||||
{Name: "ipmitool", Args: []string{"fru", "print"}, File: "ipmitool-fru.txt"},
|
||||
{Name: "ipmitool", Args: []string{"sdr"}, File: "ipmitool-sdr.txt"},
|
||||
{Name: "nvidia-smi", Args: []string{"-q"}, File: "nvidia-smi-q.txt"},
|
||||
{Name: "nvidia-smi", Args: []string{"--query-gpu=index,pci.bus_id,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown", "--format=csv,noheader,nounits"}, File: "nvidia-smi-query.csv"},
|
||||
{Name: "nvme", Args: []string{"list", "-o", "json"}, File: "nvme-list.json"},
|
||||
}
|
||||
|
||||
type lsblkDumpRoot struct {
|
||||
Blockdevices []struct {
|
||||
Name string `json:"name"`
|
||||
Type string `json:"type"`
|
||||
} `json:"blockdevices"`
|
||||
}
|
||||
|
||||
type nvmeDumpRoot struct {
|
||||
Devices []struct {
|
||||
DevicePath string `json:"DevicePath"`
|
||||
} `json:"Devices"`
|
||||
}
|
||||
|
||||
func (s *System) CaptureTechnicalDump(baseDir string) error {
|
||||
if err := os.MkdirAll(baseDir, 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
for _, cmd := range techDumpFixedCommands {
|
||||
writeCommandDump(filepath.Join(baseDir, cmd.File), cmd.Name, cmd.Args...)
|
||||
}
|
||||
|
||||
for _, dev := range lsblkDumpDevices(filepath.Join(baseDir, "lsblk.json")) {
|
||||
writeCommandDump(filepath.Join(baseDir, "smartctl-"+sanitizeDumpName(dev)+".json"), "smartctl", "-j", "-a", "/dev/"+dev)
|
||||
}
|
||||
for _, dev := range nvmeDumpDevices(filepath.Join(baseDir, "nvme-list.json")) {
|
||||
writeCommandDump(filepath.Join(baseDir, "nvme-id-ctrl-"+sanitizeDumpName(dev)+".json"), "nvme", "id-ctrl", dev, "-o", "json")
|
||||
writeCommandDump(filepath.Join(baseDir, "nvme-smart-log-"+sanitizeDumpName(dev)+".json"), "nvme", "smart-log", dev, "-o", "json")
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func writeCommandDump(path, name string, args ...string) {
|
||||
out, err := exec.Command(name, args...).CombinedOutput()
|
||||
if err != nil && len(out) == 0 {
|
||||
return
|
||||
}
|
||||
_ = os.WriteFile(path, out, 0644)
|
||||
}
|
||||
|
||||
func lsblkDumpDevices(path string) []string {
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
var root lsblkDumpRoot
|
||||
if err := json.Unmarshal(raw, &root); err != nil {
|
||||
return nil
|
||||
}
|
||||
var devices []string
|
||||
for _, dev := range root.Blockdevices {
|
||||
if dev.Type == "disk" && strings.TrimSpace(dev.Name) != "" {
|
||||
devices = append(devices, strings.TrimSpace(dev.Name))
|
||||
}
|
||||
}
|
||||
sort.Strings(devices)
|
||||
return devices
|
||||
}
|
||||
|
||||
func nvmeDumpDevices(path string) []string {
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
var root nvmeDumpRoot
|
||||
if err := json.Unmarshal(raw, &root); err != nil {
|
||||
return nil
|
||||
}
|
||||
seen := map[string]bool{}
|
||||
var devices []string
|
||||
for _, dev := range root.Devices {
|
||||
name := strings.TrimSpace(dev.DevicePath)
|
||||
if name == "" || seen[name] {
|
||||
continue
|
||||
}
|
||||
seen[name] = true
|
||||
devices = append(devices, name)
|
||||
}
|
||||
sort.Strings(devices)
|
||||
return devices
|
||||
}
|
||||
|
||||
func sanitizeDumpName(value string) string {
|
||||
value = strings.TrimSpace(value)
|
||||
value = strings.TrimPrefix(value, "/dev/")
|
||||
value = strings.ReplaceAll(value, "/", "_")
|
||||
if value == "" {
|
||||
return "unknown"
|
||||
}
|
||||
return value
|
||||
}
|
||||
48
audit/internal/platform/techdump_test.go
Normal file
48
audit/internal/platform/techdump_test.go
Normal file
@@ -0,0 +1,48 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"reflect"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestLSBLKDumpDevices(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "lsblk.json")
|
||||
if err := os.WriteFile(path, []byte(`{"blockdevices":[{"name":"sda","type":"disk"},{"name":"sda1","type":"part"},{"name":"nvme0n1","type":"disk"}]}`), 0644); err != nil {
|
||||
t.Fatalf("write lsblk fixture: %v", err)
|
||||
}
|
||||
|
||||
got := lsblkDumpDevices(path)
|
||||
want := []string{"nvme0n1", "sda"}
|
||||
if !reflect.DeepEqual(got, want) {
|
||||
t.Fatalf("lsblkDumpDevices=%v want %v", got, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNVMEDumpDevices(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "nvme-list.json")
|
||||
if err := os.WriteFile(path, []byte(`{"Devices":[{"DevicePath":"/dev/nvme1n1"},{"DevicePath":"/dev/nvme0n1"},{"DevicePath":"/dev/nvme1n1"}]}`), 0644); err != nil {
|
||||
t.Fatalf("write nvme fixture: %v", err)
|
||||
}
|
||||
|
||||
got := nvmeDumpDevices(path)
|
||||
want := []string{"/dev/nvme0n1", "/dev/nvme1n1"}
|
||||
if !reflect.DeepEqual(got, want) {
|
||||
t.Fatalf("nvmeDumpDevices=%v want %v", got, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestSanitizeDumpName(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
if got := sanitizeDumpName("/dev/nvme0n1"); got != "nvme0n1" {
|
||||
t.Fatalf("sanitizeDumpName=%q want nvme0n1", got)
|
||||
}
|
||||
}
|
||||
@@ -5,14 +5,52 @@ package schema
|
||||
// HardwareIngestRequest is the top-level output document produced by `bee audit`.
|
||||
// It is accepted as-is by the core /api/ingest/hardware endpoint.
|
||||
type HardwareIngestRequest struct {
|
||||
Filename *string `json:"filename"`
|
||||
SourceType *string `json:"source_type"`
|
||||
Protocol *string `json:"protocol"`
|
||||
TargetHost string `json:"target_host"`
|
||||
Filename *string `json:"filename,omitempty"`
|
||||
SourceType *string `json:"source_type,omitempty"`
|
||||
Protocol *string `json:"protocol,omitempty"`
|
||||
TargetHost *string `json:"target_host,omitempty"`
|
||||
CollectedAt string `json:"collected_at"`
|
||||
Runtime *RuntimeHealth `json:"runtime,omitempty"`
|
||||
Hardware HardwareSnapshot `json:"hardware"`
|
||||
}
|
||||
|
||||
type RuntimeHealth struct {
|
||||
Status string `json:"status"`
|
||||
CheckedAt string `json:"checked_at"`
|
||||
ExportDir string `json:"export_dir,omitempty"`
|
||||
DriverReady bool `json:"driver_ready,omitempty"`
|
||||
CUDAReady bool `json:"cuda_ready,omitempty"`
|
||||
NetworkStatus string `json:"network_status,omitempty"`
|
||||
Issues []RuntimeIssue `json:"issues,omitempty"`
|
||||
Tools []RuntimeToolStatus `json:"tools,omitempty"`
|
||||
Services []RuntimeServiceStatus `json:"services,omitempty"`
|
||||
Interfaces []RuntimeInterface `json:"interfaces,omitempty"`
|
||||
}
|
||||
|
||||
type RuntimeIssue struct {
|
||||
Code string `json:"code"`
|
||||
Severity string `json:"severity,omitempty"`
|
||||
Description string `json:"description"`
|
||||
}
|
||||
|
||||
type RuntimeToolStatus struct {
|
||||
Name string `json:"name"`
|
||||
Path string `json:"path,omitempty"`
|
||||
OK bool `json:"ok"`
|
||||
}
|
||||
|
||||
type RuntimeServiceStatus struct {
|
||||
Name string `json:"name"`
|
||||
Status string `json:"status"`
|
||||
}
|
||||
|
||||
type RuntimeInterface struct {
|
||||
Name string `json:"name"`
|
||||
State string `json:"state,omitempty"`
|
||||
IPv4 []string `json:"ipv4,omitempty"`
|
||||
Outcome string `json:"outcome,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareSnapshot struct {
|
||||
Board HardwareBoard `json:"board"`
|
||||
Firmware []HardwareFirmwareRecord `json:"firmware,omitempty"`
|
||||
@@ -21,14 +59,33 @@ type HardwareSnapshot struct {
|
||||
Storage []HardwareStorage `json:"storage,omitempty"`
|
||||
PCIeDevices []HardwarePCIeDevice `json:"pcie_devices,omitempty"`
|
||||
PowerSupplies []HardwarePowerSupply `json:"power_supplies,omitempty"`
|
||||
Sensors *HardwareSensors `json:"sensors,omitempty"`
|
||||
EventLogs []HardwareEventLog `json:"event_logs,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareHealthSummary struct {
|
||||
Status string `json:"status"`
|
||||
Warnings []string `json:"warnings,omitempty"`
|
||||
Failures []string `json:"failures,omitempty"`
|
||||
StorageWarn int `json:"storage_warn,omitempty"`
|
||||
StorageFail int `json:"storage_fail,omitempty"`
|
||||
PCIeWarn int `json:"pcie_warn,omitempty"`
|
||||
PCIeFail int `json:"pcie_fail,omitempty"`
|
||||
PSUWarn int `json:"psu_warn,omitempty"`
|
||||
PSUFail int `json:"psu_fail,omitempty"`
|
||||
MemoryWarn int `json:"memory_warn,omitempty"`
|
||||
MemoryFail int `json:"memory_fail,omitempty"`
|
||||
EmptyDIMMs int `json:"empty_dimms,omitempty"`
|
||||
MissingPSUs int `json:"missing_psus,omitempty"`
|
||||
CollectedAt string `json:"collected_at,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareBoard struct {
|
||||
Manufacturer *string `json:"manufacturer"`
|
||||
ProductName *string `json:"product_name"`
|
||||
Manufacturer *string `json:"manufacturer,omitempty"`
|
||||
ProductName *string `json:"product_name,omitempty"`
|
||||
SerialNumber string `json:"serial_number"`
|
||||
PartNumber *string `json:"part_number"`
|
||||
UUID *string `json:"uuid"`
|
||||
PartNumber *string `json:"part_number,omitempty"`
|
||||
UUID *string `json:"uuid,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareFirmwareRecord struct {
|
||||
@@ -37,77 +94,196 @@ type HardwareFirmwareRecord struct {
|
||||
}
|
||||
|
||||
type HardwareCPU struct {
|
||||
Socket *int `json:"socket"`
|
||||
Model *string `json:"model"`
|
||||
Manufacturer *string `json:"manufacturer"`
|
||||
Status *string `json:"status"`
|
||||
SerialNumber *string `json:"serial_number"`
|
||||
Firmware *string `json:"firmware"`
|
||||
Cores *int `json:"cores"`
|
||||
Threads *int `json:"threads"`
|
||||
FrequencyMHz *int `json:"frequency_mhz"`
|
||||
MaxFrequencyMHz *int `json:"max_frequency_mhz"`
|
||||
HardwareComponentStatus
|
||||
Socket *int `json:"socket,omitempty"`
|
||||
Model *string `json:"model,omitempty"`
|
||||
Manufacturer *string `json:"manufacturer,omitempty"`
|
||||
SerialNumber *string `json:"serial_number,omitempty"`
|
||||
Firmware *string `json:"firmware,omitempty"`
|
||||
Cores *int `json:"cores,omitempty"`
|
||||
Threads *int `json:"threads,omitempty"`
|
||||
FrequencyMHz *int `json:"frequency_mhz,omitempty"`
|
||||
MaxFrequencyMHz *int `json:"max_frequency_mhz,omitempty"`
|
||||
TemperatureC *float64 `json:"temperature_c,omitempty"`
|
||||
PowerW *float64 `json:"power_w,omitempty"`
|
||||
Throttled *bool `json:"throttled,omitempty"`
|
||||
CorrectableErrorCount *int64 `json:"correctable_error_count,omitempty"`
|
||||
UncorrectableErrorCount *int64 `json:"uncorrectable_error_count,omitempty"`
|
||||
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
|
||||
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
|
||||
Present *bool `json:"present,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareMemory struct {
|
||||
Slot *string `json:"slot"`
|
||||
Location *string `json:"location"`
|
||||
Present *bool `json:"present"`
|
||||
SizeMB *int `json:"size_mb"`
|
||||
Type *string `json:"type"`
|
||||
MaxSpeedMHz *int `json:"max_speed_mhz"`
|
||||
CurrentSpeedMHz *int `json:"current_speed_mhz"`
|
||||
Manufacturer *string `json:"manufacturer"`
|
||||
SerialNumber *string `json:"serial_number"`
|
||||
PartNumber *string `json:"part_number"`
|
||||
Status *string `json:"status"`
|
||||
HardwareComponentStatus
|
||||
Slot *string `json:"slot,omitempty"`
|
||||
Location *string `json:"location,omitempty"`
|
||||
Present *bool `json:"present,omitempty"`
|
||||
SizeMB *int `json:"size_mb,omitempty"`
|
||||
Type *string `json:"type,omitempty"`
|
||||
MaxSpeedMHz *int `json:"max_speed_mhz,omitempty"`
|
||||
CurrentSpeedMHz *int `json:"current_speed_mhz,omitempty"`
|
||||
Manufacturer *string `json:"manufacturer,omitempty"`
|
||||
SerialNumber *string `json:"serial_number,omitempty"`
|
||||
PartNumber *string `json:"part_number,omitempty"`
|
||||
TemperatureC *float64 `json:"temperature_c,omitempty"`
|
||||
CorrectableECCErrorCount *int64 `json:"correctable_ecc_error_count,omitempty"`
|
||||
UncorrectableECCErrorCount *int64 `json:"uncorrectable_ecc_error_count,omitempty"`
|
||||
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
|
||||
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
|
||||
SpareBlocksRemainingPct *float64 `json:"spare_blocks_remaining_pct,omitempty"`
|
||||
PerformanceDegraded *bool `json:"performance_degraded,omitempty"`
|
||||
DataLossDetected *bool `json:"data_loss_detected,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareStorage struct {
|
||||
Slot *string `json:"slot"`
|
||||
Type *string `json:"type"`
|
||||
Model *string `json:"model"`
|
||||
SizeGB *int `json:"size_gb"`
|
||||
SerialNumber *string `json:"serial_number"`
|
||||
Manufacturer *string `json:"manufacturer"`
|
||||
Firmware *string `json:"firmware"`
|
||||
Interface *string `json:"interface"`
|
||||
Present *bool `json:"present"`
|
||||
Status *string `json:"status"`
|
||||
Telemetry map[string]any `json:"telemetry,omitempty"`
|
||||
HardwareComponentStatus
|
||||
Slot *string `json:"slot,omitempty"`
|
||||
Type *string `json:"type,omitempty"`
|
||||
Model *string `json:"model,omitempty"`
|
||||
SizeGB *int `json:"size_gb,omitempty"`
|
||||
SerialNumber *string `json:"serial_number,omitempty"`
|
||||
Manufacturer *string `json:"manufacturer,omitempty"`
|
||||
Firmware *string `json:"firmware,omitempty"`
|
||||
Interface *string `json:"interface,omitempty"`
|
||||
Present *bool `json:"present,omitempty"`
|
||||
TemperatureC *float64 `json:"temperature_c,omitempty"`
|
||||
PowerOnHours *int64 `json:"power_on_hours,omitempty"`
|
||||
PowerCycles *int64 `json:"power_cycles,omitempty"`
|
||||
UnsafeShutdowns *int64 `json:"unsafe_shutdowns,omitempty"`
|
||||
MediaErrors *int64 `json:"media_errors,omitempty"`
|
||||
ErrorLogEntries *int64 `json:"error_log_entries,omitempty"`
|
||||
WrittenBytes *int64 `json:"written_bytes,omitempty"`
|
||||
ReadBytes *int64 `json:"read_bytes,omitempty"`
|
||||
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
|
||||
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
|
||||
AvailableSparePct *float64 `json:"available_spare_pct,omitempty"`
|
||||
ReallocatedSectors *int64 `json:"reallocated_sectors,omitempty"`
|
||||
CurrentPendingSectors *int64 `json:"current_pending_sectors,omitempty"`
|
||||
OfflineUncorrectable *int64 `json:"offline_uncorrectable,omitempty"`
|
||||
Telemetry map[string]any `json:"-"`
|
||||
}
|
||||
|
||||
type HardwarePCIeDevice struct {
|
||||
Slot *string `json:"slot"`
|
||||
VendorID *int `json:"vendor_id"`
|
||||
DeviceID *int `json:"device_id"`
|
||||
BDF *string `json:"bdf"`
|
||||
DeviceClass *string `json:"device_class"`
|
||||
Manufacturer *string `json:"manufacturer"`
|
||||
Model *string `json:"model"`
|
||||
LinkWidth *int `json:"link_width"`
|
||||
LinkSpeed *string `json:"link_speed"`
|
||||
MaxLinkWidth *int `json:"max_link_width"`
|
||||
MaxLinkSpeed *string `json:"max_link_speed"`
|
||||
SerialNumber *string `json:"serial_number"`
|
||||
Firmware *string `json:"firmware"`
|
||||
Present *bool `json:"present"`
|
||||
Status *string `json:"status"`
|
||||
Telemetry map[string]any `json:"telemetry,omitempty"`
|
||||
HardwareComponentStatus
|
||||
Slot *string `json:"slot,omitempty"`
|
||||
VendorID *int `json:"vendor_id,omitempty"`
|
||||
DeviceID *int `json:"device_id,omitempty"`
|
||||
NUMANode *int `json:"numa_node,omitempty"`
|
||||
TemperatureC *float64 `json:"temperature_c,omitempty"`
|
||||
PowerW *float64 `json:"power_w,omitempty"`
|
||||
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
|
||||
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
|
||||
ECCCorrectedTotal *int64 `json:"ecc_corrected_total,omitempty"`
|
||||
ECCUncorrectedTotal *int64 `json:"ecc_uncorrected_total,omitempty"`
|
||||
HWSlowdown *bool `json:"hw_slowdown,omitempty"`
|
||||
BatteryChargePct *float64 `json:"battery_charge_pct,omitempty"`
|
||||
BatteryHealthPct *float64 `json:"battery_health_pct,omitempty"`
|
||||
BatteryTemperatureC *float64 `json:"battery_temperature_c,omitempty"`
|
||||
BatteryVoltageV *float64 `json:"battery_voltage_v,omitempty"`
|
||||
BatteryReplaceRequired *bool `json:"battery_replace_required,omitempty"`
|
||||
SFPTemperatureC *float64 `json:"sfp_temperature_c,omitempty"`
|
||||
SFPTXPowerDBM *float64 `json:"sfp_tx_power_dbm,omitempty"`
|
||||
SFPRXPowerDBM *float64 `json:"sfp_rx_power_dbm,omitempty"`
|
||||
SFPVoltageV *float64 `json:"sfp_voltage_v,omitempty"`
|
||||
SFPBiasMA *float64 `json:"sfp_bias_ma,omitempty"`
|
||||
BDF *string `json:"-"`
|
||||
DeviceClass *string `json:"device_class,omitempty"`
|
||||
Manufacturer *string `json:"manufacturer,omitempty"`
|
||||
Model *string `json:"model,omitempty"`
|
||||
LinkWidth *int `json:"link_width,omitempty"`
|
||||
LinkSpeed *string `json:"link_speed,omitempty"`
|
||||
MaxLinkWidth *int `json:"max_link_width,omitempty"`
|
||||
MaxLinkSpeed *string `json:"max_link_speed,omitempty"`
|
||||
SerialNumber *string `json:"serial_number,omitempty"`
|
||||
Firmware *string `json:"firmware,omitempty"`
|
||||
MacAddresses []string `json:"mac_addresses,omitempty"`
|
||||
Present *bool `json:"present,omitempty"`
|
||||
Telemetry map[string]any `json:"-"`
|
||||
}
|
||||
|
||||
type HardwarePowerSupply struct {
|
||||
Slot *string `json:"slot"`
|
||||
Present *bool `json:"present"`
|
||||
Model *string `json:"model"`
|
||||
Vendor *string `json:"vendor"`
|
||||
WattageW *int `json:"wattage_w"`
|
||||
SerialNumber *string `json:"serial_number"`
|
||||
PartNumber *string `json:"part_number"`
|
||||
Firmware *string `json:"firmware"`
|
||||
Status *string `json:"status"`
|
||||
InputType *string `json:"input_type"`
|
||||
InputPowerW *float64 `json:"input_power_w"`
|
||||
OutputPowerW *float64 `json:"output_power_w"`
|
||||
InputVoltage *float64 `json:"input_voltage"`
|
||||
HardwareComponentStatus
|
||||
Slot *string `json:"slot,omitempty"`
|
||||
Present *bool `json:"present,omitempty"`
|
||||
Model *string `json:"model,omitempty"`
|
||||
Vendor *string `json:"vendor,omitempty"`
|
||||
WattageW *int `json:"wattage_w,omitempty"`
|
||||
SerialNumber *string `json:"serial_number,omitempty"`
|
||||
PartNumber *string `json:"part_number,omitempty"`
|
||||
Firmware *string `json:"firmware,omitempty"`
|
||||
InputType *string `json:"input_type,omitempty"`
|
||||
InputPowerW *float64 `json:"input_power_w,omitempty"`
|
||||
OutputPowerW *float64 `json:"output_power_w,omitempty"`
|
||||
InputVoltage *float64 `json:"input_voltage,omitempty"`
|
||||
TemperatureC *float64 `json:"temperature_c,omitempty"`
|
||||
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
|
||||
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareComponentStatus struct {
|
||||
Status *string `json:"status,omitempty"`
|
||||
StatusCheckedAt *string `json:"status_checked_at,omitempty"`
|
||||
StatusChangedAt *string `json:"status_changed_at,omitempty"`
|
||||
StatusHistory []HardwareStatusHistory `json:"status_history,omitempty"`
|
||||
ErrorDescription *string `json:"error_description,omitempty"`
|
||||
ManufacturedYearWeek *string `json:"manufactured_year_week,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareStatusHistory struct {
|
||||
Status string `json:"status"`
|
||||
ChangedAt string `json:"changed_at"`
|
||||
Details *string `json:"details,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareSensors struct {
|
||||
Fans []HardwareFanSensor `json:"fans,omitempty"`
|
||||
Power []HardwarePowerSensor `json:"power,omitempty"`
|
||||
Temperatures []HardwareTemperatureSensor `json:"temperatures,omitempty"`
|
||||
Other []HardwareOtherSensor `json:"other,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareFanSensor struct {
|
||||
Name string `json:"name"`
|
||||
Location *string `json:"location,omitempty"`
|
||||
RPM *int `json:"rpm,omitempty"`
|
||||
Status *string `json:"status,omitempty"`
|
||||
}
|
||||
|
||||
type HardwarePowerSensor struct {
|
||||
Name string `json:"name"`
|
||||
Location *string `json:"location,omitempty"`
|
||||
VoltageV *float64 `json:"voltage_v,omitempty"`
|
||||
CurrentA *float64 `json:"current_a,omitempty"`
|
||||
PowerW *float64 `json:"power_w,omitempty"`
|
||||
Status *string `json:"status,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareTemperatureSensor struct {
|
||||
Name string `json:"name"`
|
||||
Location *string `json:"location,omitempty"`
|
||||
Celsius *float64 `json:"celsius,omitempty"`
|
||||
ThresholdWarningCelsius *float64 `json:"threshold_warning_celsius,omitempty"`
|
||||
ThresholdCriticalCelsius *float64 `json:"threshold_critical_celsius,omitempty"`
|
||||
Status *string `json:"status,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareOtherSensor struct {
|
||||
Name string `json:"name"`
|
||||
Location *string `json:"location,omitempty"`
|
||||
Value *float64 `json:"value,omitempty"`
|
||||
Unit *string `json:"unit,omitempty"`
|
||||
Status *string `json:"status,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareEventLog struct {
|
||||
Source string `json:"source"`
|
||||
EventTime *string `json:"event_time,omitempty"`
|
||||
Severity *string `json:"severity,omitempty"`
|
||||
MessageID *string `json:"message_id,omitempty"`
|
||||
Message string `json:"message"`
|
||||
ComponentRef *string `json:"component_ref,omitempty"`
|
||||
Fingerprint *string `json:"fingerprint,omitempty"`
|
||||
IsActive *bool `json:"is_active,omitempty"`
|
||||
RawPayload map[string]any `json:"raw_payload,omitempty"`
|
||||
}
|
||||
|
||||
46
audit/internal/schema/hardware_test.go
Normal file
46
audit/internal/schema/hardware_test.go
Normal file
@@ -0,0 +1,46 @@
|
||||
package schema
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestHardwareSnapshotMarshalsNewContractFields(t *testing.T) {
|
||||
week := "2024-W07"
|
||||
eventTime := "2026-03-15T14:03:11Z"
|
||||
message := "Correctable ECC error threshold exceeded"
|
||||
|
||||
payload := HardwareIngestRequest{
|
||||
CollectedAt: "2026-03-15T15:00:00Z",
|
||||
Hardware: HardwareSnapshot{
|
||||
Board: HardwareBoard{SerialNumber: "SRV-001"},
|
||||
CPUs: []HardwareCPU{
|
||||
{
|
||||
HardwareComponentStatus: HardwareComponentStatus{
|
||||
ManufacturedYearWeek: &week,
|
||||
},
|
||||
},
|
||||
},
|
||||
EventLogs: []HardwareEventLog{
|
||||
{
|
||||
Source: "bmc",
|
||||
EventTime: &eventTime,
|
||||
Message: message,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
data, err := json.Marshal(payload)
|
||||
if err != nil {
|
||||
t.Fatalf("marshal: %v", err)
|
||||
}
|
||||
text := string(data)
|
||||
if !strings.Contains(text, `"manufactured_year_week":"2024-W07"`) {
|
||||
t.Fatalf("missing manufactured_year_week: %s", text)
|
||||
}
|
||||
if !strings.Contains(text, `"event_logs":[{"source":"bmc","event_time":"2026-03-15T14:03:11Z","message":"Correctable ECC error threshold exceeded"}]`) {
|
||||
t.Fatalf("missing event_logs payload: %s", text)
|
||||
}
|
||||
}
|
||||
@@ -29,6 +29,7 @@ func (m model) updateStaticForm(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
|
||||
m.formFields[3].Value,
|
||||
})
|
||||
m.busy = true
|
||||
m.busyTitle = "Static IPv4: " + m.selectedIface
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.SetStaticIPv4Result(cfg)
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
|
||||
@@ -59,26 +60,49 @@ func (m model) updateConfirm(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
|
||||
case "esc":
|
||||
m.screen = m.confirmCancelTarget()
|
||||
m.cursor = 0
|
||||
m.pendingAction = actionNone
|
||||
return m, nil
|
||||
case "enter":
|
||||
if m.cursor == 1 {
|
||||
m.screen = m.confirmCancelTarget()
|
||||
m.cursor = 0
|
||||
m.pendingAction = actionNone
|
||||
return m, nil
|
||||
}
|
||||
m.busy = true
|
||||
switch m.pendingAction {
|
||||
case actionExportAudit:
|
||||
m.busyTitle = "Export audit"
|
||||
target := *m.selectedTarget
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.ExportLatestAuditResult(target)
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
|
||||
}
|
||||
case actionExportBundle:
|
||||
m.busyTitle = "Export support bundle"
|
||||
target := *m.selectedTarget
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.ExportSupportBundleResult(target)
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
|
||||
}
|
||||
case actionRunNvidiaSAT:
|
||||
m.busyTitle = "NVIDIA SAT"
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.RunNvidiaAcceptancePackResult("")
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance}
|
||||
}
|
||||
case actionRunMemorySAT:
|
||||
m.busyTitle = "Memory SAT"
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.RunMemoryAcceptancePackResult("")
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance}
|
||||
}
|
||||
case actionRunStorageSAT:
|
||||
m.busyTitle = "Storage SAT"
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.RunStorageAcceptancePackResult("")
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance}
|
||||
}
|
||||
}
|
||||
case "ctrl+c":
|
||||
return m, tea.Quit
|
||||
@@ -90,7 +114,13 @@ func (m model) confirmCancelTarget() screen {
|
||||
switch m.pendingAction {
|
||||
case actionExportAudit:
|
||||
return screenExportTargets
|
||||
case actionExportBundle:
|
||||
return screenExportTargets
|
||||
case actionRunNvidiaSAT:
|
||||
fallthrough
|
||||
case actionRunMemorySAT:
|
||||
fallthrough
|
||||
case actionRunStorageSAT:
|
||||
return screenAcceptance
|
||||
default:
|
||||
return screenMain
|
||||
|
||||
@@ -23,3 +23,20 @@ type exportTargetsMsg struct {
|
||||
targets []platform.RemovableTarget
|
||||
err error
|
||||
}
|
||||
|
||||
type bannerMsg struct {
|
||||
text string
|
||||
}
|
||||
|
||||
type nvidiaGPUsMsg struct {
|
||||
gpus []platform.NvidiaGPU
|
||||
err error
|
||||
}
|
||||
|
||||
type nvtopClosedMsg struct{}
|
||||
|
||||
type nvidiaSATDoneMsg struct {
|
||||
title string
|
||||
body string
|
||||
err error
|
||||
}
|
||||
|
||||
@@ -3,12 +3,19 @@ package tui
|
||||
import tea "github.com/charmbracelet/bubbletea"
|
||||
|
||||
func (m model) handleAcceptanceMenu() (tea.Model, tea.Cmd) {
|
||||
if m.cursor == 1 {
|
||||
if m.cursor == 3 {
|
||||
m.screen = screenMain
|
||||
m.cursor = 0
|
||||
return m, nil
|
||||
}
|
||||
m.pendingAction = actionRunNvidiaSAT
|
||||
switch m.cursor {
|
||||
case 0:
|
||||
return m.enterNvidiaSATSetup()
|
||||
case 1:
|
||||
m.pendingAction = actionRunMemorySAT
|
||||
case 2:
|
||||
m.pendingAction = actionRunStorageSAT
|
||||
}
|
||||
m.screen = screenConfirm
|
||||
return m, nil
|
||||
}
|
||||
|
||||
@@ -4,11 +4,17 @@ import tea "github.com/charmbracelet/bubbletea"
|
||||
|
||||
func (m model) handleExportTargetsMenu() (tea.Model, tea.Cmd) {
|
||||
if len(m.targets) == 0 {
|
||||
return m, resultCmd("Export audit", "No removable filesystems found", nil, screenMain)
|
||||
title := "Export audit"
|
||||
if m.pendingAction == actionExportBundle {
|
||||
title = "Export support bundle"
|
||||
}
|
||||
return m, resultCmd(title, "No removable filesystems found", nil, screenMain)
|
||||
}
|
||||
target := m.targets[m.cursor]
|
||||
m.selectedTarget = &target
|
||||
m.pendingAction = actionExportAudit
|
||||
if m.pendingAction == actionNone {
|
||||
m.pendingAction = actionExportAudit
|
||||
}
|
||||
m.screen = screenConfirm
|
||||
return m, nil
|
||||
}
|
||||
|
||||
@@ -12,6 +12,7 @@ func (m model) handleMainMenu() (tea.Model, tea.Cmd) {
|
||||
return m, nil
|
||||
case 1:
|
||||
m.busy = true
|
||||
m.busyTitle = "Services"
|
||||
return m, func() tea.Msg {
|
||||
services, err := m.app.ListBeeServices()
|
||||
return servicesMsg{services: services, err: err}
|
||||
@@ -22,29 +23,63 @@ func (m model) handleMainMenu() (tea.Model, tea.Cmd) {
|
||||
return m, nil
|
||||
case 3:
|
||||
m.busy = true
|
||||
m.busyTitle = "Run audit"
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.RunAuditNow(m.runtimeMode)
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
|
||||
}
|
||||
case 4:
|
||||
m.busy = true
|
||||
m.busyTitle = "Run self-check"
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.RunRuntimePreflightResult()
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
|
||||
}
|
||||
case 5:
|
||||
m.pendingAction = actionExportAudit
|
||||
m.busy = true
|
||||
m.busyTitle = "Export audit"
|
||||
return m, func() tea.Msg {
|
||||
targets, err := m.app.ListRemovableTargets()
|
||||
return exportTargetsMsg{targets: targets, err: err}
|
||||
}
|
||||
case 5:
|
||||
case 6:
|
||||
m.pendingAction = actionExportBundle
|
||||
m.busy = true
|
||||
m.busyTitle = "Export support bundle"
|
||||
return m, func() tea.Msg {
|
||||
result := m.app.ToolCheckResult([]string{"dmidecode", "smartctl", "nvme", "ipmitool", "lspci", "bee", "nvidia-smi", "dhclient", "lsblk", "mount"})
|
||||
targets, err := m.app.ListRemovableTargets()
|
||||
return exportTargetsMsg{targets: targets, err: err}
|
||||
}
|
||||
case 7:
|
||||
m.busy = true
|
||||
m.busyTitle = "Required tools"
|
||||
return m, func() tea.Msg {
|
||||
result := m.app.ToolCheckResult([]string{"dmidecode", "smartctl", "nvme", "ipmitool", "lspci", "ethtool", "bee", "nvidia-smi", "bee-gpu-stress", "memtester", "dhclient", "lsblk", "mount"})
|
||||
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
|
||||
}
|
||||
case 6:
|
||||
case 8:
|
||||
m.busy = true
|
||||
m.busyTitle = "Health summary"
|
||||
return m, func() tea.Msg {
|
||||
result := m.app.HealthSummaryResult()
|
||||
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
|
||||
}
|
||||
case 9:
|
||||
m.busy = true
|
||||
m.busyTitle = "Runtime issues"
|
||||
return m, func() tea.Msg {
|
||||
result := m.app.RuntimeHealthResult()
|
||||
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
|
||||
}
|
||||
case 10:
|
||||
m.busy = true
|
||||
m.busyTitle = "Audit logs"
|
||||
return m, func() tea.Msg {
|
||||
result := m.app.AuditLogTailResult()
|
||||
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
|
||||
}
|
||||
case 7:
|
||||
case 11:
|
||||
return m, tea.Quit
|
||||
}
|
||||
return m, nil
|
||||
|
||||
@@ -10,12 +10,14 @@ func (m model) handleNetworkMenu() (tea.Model, tea.Cmd) {
|
||||
switch m.cursor {
|
||||
case 0:
|
||||
m.busy = true
|
||||
m.busyTitle = "Network status"
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.NetworkStatus()
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
|
||||
}
|
||||
case 1:
|
||||
m.busy = true
|
||||
m.busyTitle = "DHCP all interfaces"
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.DHCPAllResult()
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
|
||||
@@ -23,6 +25,7 @@ func (m model) handleNetworkMenu() (tea.Model, tea.Cmd) {
|
||||
case 2:
|
||||
m.pendingAction = actionDHCPOne
|
||||
m.busy = true
|
||||
m.busyTitle = "Interfaces"
|
||||
return m, func() tea.Msg {
|
||||
ifaces, err := m.app.ListInterfaces()
|
||||
return interfacesMsg{ifaces: ifaces, err: err}
|
||||
@@ -30,6 +33,7 @@ func (m model) handleNetworkMenu() (tea.Model, tea.Cmd) {
|
||||
case 3:
|
||||
m.pendingAction = actionStaticIPv4
|
||||
m.busy = true
|
||||
m.busyTitle = "Interfaces"
|
||||
return m, func() tea.Msg {
|
||||
ifaces, err := m.app.ListInterfaces()
|
||||
return interfacesMsg{ifaces: ifaces, err: err}
|
||||
@@ -50,6 +54,7 @@ func (m model) handleInterfacePickMenu() (tea.Model, tea.Cmd) {
|
||||
switch m.pendingAction {
|
||||
case actionDHCPOne:
|
||||
m.busy = true
|
||||
m.busyTitle = "DHCP on " + m.selectedIface
|
||||
return m, func() tea.Msg {
|
||||
result, err := m.app.DHCPOneResult(m.selectedIface)
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
|
||||
|
||||
238
audit/internal/tui/screen_nvidia_sat.go
Normal file
238
audit/internal/tui/screen_nvidia_sat.go
Normal file
@@ -0,0 +1,238 @@
|
||||
package tui
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"os/exec"
|
||||
"strings"
|
||||
|
||||
"bee/audit/internal/platform"
|
||||
|
||||
tea "github.com/charmbracelet/bubbletea"
|
||||
)
|
||||
|
||||
var nvidiaDurationOptions = []struct {
|
||||
label string
|
||||
seconds int
|
||||
}{
|
||||
{"10 minutes", 600},
|
||||
{"1 hour", 3600},
|
||||
{"8 hours", 28800},
|
||||
{"24 hours", 86400},
|
||||
}
|
||||
|
||||
// enterNvidiaSATSetup resets the setup screen and starts loading GPU list.
|
||||
func (m model) enterNvidiaSATSetup() (tea.Model, tea.Cmd) {
|
||||
m.screen = screenNvidiaSATSetup
|
||||
m.nvidiaGPUs = nil
|
||||
m.nvidiaGPUSel = nil
|
||||
m.nvidiaDurIdx = 0
|
||||
m.nvidiaSATCursor = 0
|
||||
m.busy = true
|
||||
m.busyTitle = "NVIDIA SAT"
|
||||
return m, func() tea.Msg {
|
||||
gpus, err := m.app.ListNvidiaGPUs()
|
||||
return nvidiaGPUsMsg{gpus: gpus, err: err}
|
||||
}
|
||||
}
|
||||
|
||||
// handleNvidiaGPUsMsg processes the GPU list response.
|
||||
func (m model) handleNvidiaGPUsMsg(msg nvidiaGPUsMsg) (tea.Model, tea.Cmd) {
|
||||
m.busy = false
|
||||
m.busyTitle = ""
|
||||
if msg.err != nil {
|
||||
m.title = "NVIDIA SAT"
|
||||
m.body = fmt.Sprintf("Failed to list GPUs: %v", msg.err)
|
||||
m.prevScreen = screenAcceptance
|
||||
m.screen = screenOutput
|
||||
return m, nil
|
||||
}
|
||||
m.nvidiaGPUs = msg.gpus
|
||||
m.nvidiaGPUSel = make([]bool, len(msg.gpus))
|
||||
for i := range m.nvidiaGPUSel {
|
||||
m.nvidiaGPUSel[i] = true // all selected by default
|
||||
}
|
||||
m.nvidiaSATCursor = 0
|
||||
return m, nil
|
||||
}
|
||||
|
||||
// updateNvidiaSATSetup handles keys on the setup screen.
|
||||
func (m model) updateNvidiaSATSetup(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
|
||||
numDur := len(nvidiaDurationOptions)
|
||||
numGPU := len(m.nvidiaGPUs)
|
||||
totalItems := numDur + numGPU + 2 // +2: Start, Cancel
|
||||
switch msg.String() {
|
||||
case "up", "k":
|
||||
if m.nvidiaSATCursor > 0 {
|
||||
m.nvidiaSATCursor--
|
||||
}
|
||||
case "down", "j":
|
||||
if m.nvidiaSATCursor < totalItems-1 {
|
||||
m.nvidiaSATCursor++
|
||||
}
|
||||
case " ":
|
||||
switch {
|
||||
case m.nvidiaSATCursor < numDur:
|
||||
m.nvidiaDurIdx = m.nvidiaSATCursor
|
||||
case m.nvidiaSATCursor < numDur+numGPU:
|
||||
i := m.nvidiaSATCursor - numDur
|
||||
m.nvidiaGPUSel[i] = !m.nvidiaGPUSel[i]
|
||||
}
|
||||
case "enter":
|
||||
startIdx := numDur + numGPU
|
||||
cancelIdx := startIdx + 1
|
||||
switch {
|
||||
case m.nvidiaSATCursor < numDur:
|
||||
m.nvidiaDurIdx = m.nvidiaSATCursor
|
||||
case m.nvidiaSATCursor < startIdx:
|
||||
i := m.nvidiaSATCursor - numDur
|
||||
m.nvidiaGPUSel[i] = !m.nvidiaGPUSel[i]
|
||||
case m.nvidiaSATCursor == startIdx:
|
||||
return m.startNvidiaSAT()
|
||||
case m.nvidiaSATCursor == cancelIdx:
|
||||
m.screen = screenAcceptance
|
||||
m.cursor = 0
|
||||
}
|
||||
case "esc":
|
||||
m.screen = screenAcceptance
|
||||
m.cursor = 0
|
||||
case "ctrl+c", "q":
|
||||
return m, tea.Quit
|
||||
}
|
||||
return m, nil
|
||||
}
|
||||
|
||||
// startNvidiaSAT launches the SAT and nvtop.
|
||||
func (m model) startNvidiaSAT() (tea.Model, tea.Cmd) {
|
||||
var selectedGPUs []platform.NvidiaGPU
|
||||
for i, sel := range m.nvidiaGPUSel {
|
||||
if sel {
|
||||
selectedGPUs = append(selectedGPUs, m.nvidiaGPUs[i])
|
||||
}
|
||||
}
|
||||
if len(selectedGPUs) == 0 {
|
||||
selectedGPUs = m.nvidiaGPUs // fallback: use all if none explicitly selected
|
||||
}
|
||||
|
||||
sizeMB := 0
|
||||
for _, g := range selectedGPUs {
|
||||
if sizeMB == 0 || g.MemoryMB < sizeMB {
|
||||
sizeMB = g.MemoryMB
|
||||
}
|
||||
}
|
||||
if sizeMB == 0 {
|
||||
sizeMB = 64
|
||||
}
|
||||
|
||||
var gpuIndices []int
|
||||
for _, g := range selectedGPUs {
|
||||
gpuIndices = append(gpuIndices, g.Index)
|
||||
}
|
||||
|
||||
durationSec := nvidiaDurationOptions[m.nvidiaDurIdx].seconds
|
||||
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
m.nvidiaSATCancel = cancel
|
||||
m.nvidiaSATAborted = false
|
||||
m.screen = screenNvidiaSATRunning
|
||||
m.nvidiaSATCursor = 0
|
||||
|
||||
satCmd := func() tea.Msg {
|
||||
result, err := m.app.RunNvidiaAcceptancePackWithOptions(ctx, "", durationSec, sizeMB, gpuIndices)
|
||||
return nvidiaSATDoneMsg{title: result.Title, body: result.Body, err: err}
|
||||
}
|
||||
|
||||
nvtopPath, lookErr := exec.LookPath("nvtop")
|
||||
if lookErr != nil {
|
||||
// nvtop not available: just run the SAT, show running screen
|
||||
return m, satCmd
|
||||
}
|
||||
|
||||
return m, tea.Batch(
|
||||
satCmd,
|
||||
tea.ExecProcess(exec.Command(nvtopPath), func(_ error) tea.Msg {
|
||||
return nvtopClosedMsg{}
|
||||
}),
|
||||
)
|
||||
}
|
||||
|
||||
// updateNvidiaSATRunning handles keys on the running screen.
|
||||
func (m model) updateNvidiaSATRunning(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
|
||||
switch msg.String() {
|
||||
case "o", "O":
|
||||
nvtopPath, err := exec.LookPath("nvtop")
|
||||
if err != nil {
|
||||
return m, nil
|
||||
}
|
||||
return m, tea.ExecProcess(exec.Command(nvtopPath), func(_ error) tea.Msg {
|
||||
return nvtopClosedMsg{}
|
||||
})
|
||||
case "a", "A":
|
||||
if m.nvidiaSATCancel != nil {
|
||||
m.nvidiaSATCancel()
|
||||
m.nvidiaSATCancel = nil
|
||||
}
|
||||
m.nvidiaSATAborted = true
|
||||
m.screen = screenAcceptance
|
||||
m.cursor = 0
|
||||
case "ctrl+c":
|
||||
return m, tea.Quit
|
||||
}
|
||||
return m, nil
|
||||
}
|
||||
|
||||
// renderNvidiaSATSetup renders the setup screen.
|
||||
func renderNvidiaSATSetup(m model) string {
|
||||
var b strings.Builder
|
||||
fmt.Fprintln(&b, "NVIDIA SAT")
|
||||
fmt.Fprintln(&b)
|
||||
fmt.Fprintln(&b, "Duration:")
|
||||
for i, opt := range nvidiaDurationOptions {
|
||||
radio := "( )"
|
||||
if i == m.nvidiaDurIdx {
|
||||
radio = "(*)"
|
||||
}
|
||||
prefix := " "
|
||||
if m.nvidiaSATCursor == i {
|
||||
prefix = "> "
|
||||
}
|
||||
fmt.Fprintf(&b, "%s%s %s\n", prefix, radio, opt.label)
|
||||
}
|
||||
fmt.Fprintln(&b)
|
||||
if len(m.nvidiaGPUs) == 0 {
|
||||
fmt.Fprintln(&b, "GPUs: (none detected)")
|
||||
} else {
|
||||
fmt.Fprintln(&b, "GPUs:")
|
||||
for i, gpu := range m.nvidiaGPUs {
|
||||
check := "[ ]"
|
||||
if m.nvidiaGPUSel[i] {
|
||||
check = "[x]"
|
||||
}
|
||||
prefix := " "
|
||||
if m.nvidiaSATCursor == len(nvidiaDurationOptions)+i {
|
||||
prefix = "> "
|
||||
}
|
||||
fmt.Fprintf(&b, "%s%s %d: %s (%d MB)\n", prefix, check, gpu.Index, gpu.Name, gpu.MemoryMB)
|
||||
}
|
||||
}
|
||||
fmt.Fprintln(&b)
|
||||
startIdx := len(nvidiaDurationOptions) + len(m.nvidiaGPUs)
|
||||
startPfx := " "
|
||||
cancelPfx := " "
|
||||
if m.nvidiaSATCursor == startIdx {
|
||||
startPfx = "> "
|
||||
}
|
||||
if m.nvidiaSATCursor == startIdx+1 {
|
||||
cancelPfx = "> "
|
||||
}
|
||||
fmt.Fprintf(&b, "%sStart\n", startPfx)
|
||||
fmt.Fprintf(&b, "%sCancel\n", cancelPfx)
|
||||
fmt.Fprintln(&b)
|
||||
b.WriteString("[↑/↓] move [space] toggle [enter] select [esc] cancel\n")
|
||||
return b.String()
|
||||
}
|
||||
|
||||
// renderNvidiaSATRunning renders the running screen.
|
||||
func renderNvidiaSATRunning() string {
|
||||
return "NVIDIA SAT\n\nTest is running...\n\n[o] Open nvtop [a] Abort test [ctrl+c] quit\n"
|
||||
}
|
||||
@@ -8,7 +8,7 @@ import (
|
||||
|
||||
func (m model) handleServicesMenu() (tea.Model, tea.Cmd) {
|
||||
if len(m.services) == 0 {
|
||||
return m, resultCmd("bee services", "No bee-* services found", nil, screenMain)
|
||||
return m, resultCmd("Services", "No bee-* services found.", nil, screenMain)
|
||||
}
|
||||
m.selectedService = m.services[m.cursor]
|
||||
m.screen = screenServiceAction
|
||||
@@ -25,22 +25,23 @@ func (m model) handleServiceActionMenu() (tea.Model, tea.Cmd) {
|
||||
}
|
||||
|
||||
m.busy = true
|
||||
m.busyTitle = "service: " + m.selectedService
|
||||
return m, func() tea.Msg {
|
||||
switch action {
|
||||
case "status":
|
||||
case "Status":
|
||||
result, err := m.app.ServiceStatusResult(m.selectedService)
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
|
||||
case "restart":
|
||||
case "Restart":
|
||||
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceRestart)
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
|
||||
case "start":
|
||||
case "Start":
|
||||
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceStart)
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
|
||||
case "stop":
|
||||
case "Stop":
|
||||
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceStop)
|
||||
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
|
||||
default:
|
||||
return resultMsg{title: "service", body: "unknown action", back: screenServiceAction}
|
||||
return resultMsg{title: "Service", body: "Unknown action.", back: screenServiceAction}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
package tui
|
||||
|
||||
import (
|
||||
"strings"
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/app"
|
||||
@@ -153,7 +154,8 @@ func TestMainMenuAsyncActionsSetBusy(t *testing.T) {
|
||||
{name: "run audit", cursor: 3},
|
||||
{name: "export", cursor: 4},
|
||||
{name: "check tools", cursor: 5},
|
||||
{name: "log tail", cursor: 6},
|
||||
{name: "health summary", cursor: 6},
|
||||
{name: "log tail", cursor: 7},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
@@ -177,6 +179,24 @@ func TestMainMenuAsyncActionsSetBusy(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestMainViewIncludesBanner(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
m.banner = "System: Test Server | S/N ABC123\nIP: 10.0.0.10"
|
||||
|
||||
view := m.View()
|
||||
if !strings.Contains(view, "System: Test Server | S/N ABC123") {
|
||||
t.Fatalf("view missing system banner:\n%s", view)
|
||||
}
|
||||
if !strings.Contains(view, "IP: 10.0.0.10") {
|
||||
t.Fatalf("view missing ip banner:\n%s", view)
|
||||
}
|
||||
if !strings.Contains(view, "Select action") {
|
||||
t.Fatalf("view missing menu subtitle:\n%s", view)
|
||||
}
|
||||
}
|
||||
|
||||
func TestEscapeNavigation(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
@@ -235,7 +255,7 @@ func TestOutputScreenReturnsToPreviousScreen(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestAcceptanceConfirmFlow(t *testing.T) {
|
||||
func TestAcceptanceNvidiaSATOpensSetup(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
@@ -245,23 +265,45 @@ func TestAcceptanceConfirmFlow(t *testing.T) {
|
||||
next, cmd := m.handleAcceptanceMenu()
|
||||
got := next.(model)
|
||||
|
||||
if cmd != nil {
|
||||
t.Fatal("expected nil cmd")
|
||||
if cmd == nil {
|
||||
t.Fatal("expected non-nil cmd (GPU list loader)")
|
||||
}
|
||||
if got.screen != screenConfirm {
|
||||
t.Fatalf("screen=%q want %q", got.screen, screenConfirm)
|
||||
}
|
||||
if got.pendingAction != actionRunNvidiaSAT {
|
||||
t.Fatalf("pendingAction=%q want %q", got.pendingAction, actionRunNvidiaSAT)
|
||||
if got.screen != screenNvidiaSATSetup {
|
||||
t.Fatalf("screen=%q want %q", got.screen, screenNvidiaSATSetup)
|
||||
}
|
||||
|
||||
next, _ = got.updateConfirm(tea.KeyMsg{Type: tea.KeyEsc})
|
||||
// esc from setup returns to acceptance
|
||||
next, _ = got.updateNvidiaSATSetup(tea.KeyMsg{Type: tea.KeyEsc})
|
||||
got = next.(model)
|
||||
if got.screen != screenAcceptance {
|
||||
t.Fatalf("screen after esc=%q want %q", got.screen, screenAcceptance)
|
||||
}
|
||||
}
|
||||
|
||||
func TestAcceptanceMenuMapsNewTargets(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
tests := []struct {
|
||||
cursor int
|
||||
want actionKind
|
||||
}{
|
||||
{cursor: 1, want: actionRunMemorySAT},
|
||||
{cursor: 2, want: actionRunStorageSAT},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
m := newTestModel()
|
||||
m.screen = screenAcceptance
|
||||
m.cursor = test.cursor
|
||||
|
||||
next, _ := m.handleAcceptanceMenu()
|
||||
got := next.(model)
|
||||
if got.pendingAction != test.want {
|
||||
t.Fatalf("cursor=%d pendingAction=%q want %q", test.cursor, got.pendingAction, test.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestExportTargetSelectionOpensConfirm(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
@@ -347,3 +389,197 @@ func TestConfirmCancelTarget(t *testing.T) {
|
||||
t.Fatalf("default cancel target=%q want %q", got, screenMain)
|
||||
}
|
||||
}
|
||||
|
||||
func TestViewMainMenuRendersSelectedItem(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
m.cursor = 1
|
||||
|
||||
view := m.View()
|
||||
|
||||
for _, want := range []string{
|
||||
"bee",
|
||||
"Select action",
|
||||
" Network",
|
||||
"> Services",
|
||||
"Acceptance tests",
|
||||
"[↑/↓] move [enter] select [esc] back [ctrl+c] quit",
|
||||
} {
|
||||
if !strings.Contains(view, want) {
|
||||
t.Fatalf("view missing %q\nview:\n%s", want, view)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestViewBusyStateIsMinimal(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
m.busy = true
|
||||
|
||||
view := m.View()
|
||||
want := "bee\n\nWorking...\n\n[ctrl+c] quit\n"
|
||||
if view != want {
|
||||
t.Fatalf("busy view mismatch\nwant:\n%s\ngot:\n%s", want, view)
|
||||
}
|
||||
}
|
||||
|
||||
func TestViewBusyStateUsesBusyTitle(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
m.busy = true
|
||||
m.busyTitle = "Export audit"
|
||||
|
||||
view := m.View()
|
||||
|
||||
for _, want := range []string{
|
||||
"Export audit",
|
||||
"Working...",
|
||||
"[ctrl+c] quit",
|
||||
} {
|
||||
if !strings.Contains(view, want) {
|
||||
t.Fatalf("view missing %q\nview:\n%s", want, view)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestViewOutputScreenRendersBodyAndBackHint(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
m.screen = screenOutput
|
||||
m.title = "Run audit"
|
||||
m.body = "audit output: /appdata/bee/export/bee-audit.json\n"
|
||||
|
||||
view := m.View()
|
||||
|
||||
for _, want := range []string{
|
||||
"Run audit",
|
||||
"audit output: /appdata/bee/export/bee-audit.json",
|
||||
"[enter/esc] back [ctrl+c] quit",
|
||||
} {
|
||||
if !strings.Contains(view, want) {
|
||||
t.Fatalf("view missing %q\nview:\n%s", want, view)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestViewExportTargetsRendersDeviceMetadata(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
m.screen = screenExportTargets
|
||||
m.targets = []platform.RemovableTarget{
|
||||
{
|
||||
Device: "/dev/sdb1",
|
||||
FSType: "vfat",
|
||||
Size: "29G",
|
||||
Label: "BEEUSB",
|
||||
Mountpoint: "/media/bee",
|
||||
},
|
||||
}
|
||||
|
||||
view := m.View()
|
||||
|
||||
for _, want := range []string{
|
||||
"Export audit",
|
||||
"Select removable filesystem",
|
||||
"> /dev/sdb1 [vfat 29G] label=BEEUSB mounted=/media/bee",
|
||||
} {
|
||||
if !strings.Contains(view, want) {
|
||||
t.Fatalf("view missing %q\nview:\n%s", want, view)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestViewStaticFormRendersFields(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
m.screen = screenStaticForm
|
||||
m.selectedIface = "enp1s0"
|
||||
m.formFields = []formField{
|
||||
{Label: "Address", Value: "192.0.2.10/24"},
|
||||
{Label: "Gateway", Value: "192.0.2.1"},
|
||||
{Label: "DNS", Value: "1.1.1.1"},
|
||||
}
|
||||
m.formIndex = 1
|
||||
|
||||
view := m.View()
|
||||
|
||||
for _, want := range []string{
|
||||
"Static IPv4: enp1s0",
|
||||
" Address: 192.0.2.10/24",
|
||||
"> Gateway: 192.0.2.1",
|
||||
" DNS: 1.1.1.1",
|
||||
"[tab/↑/↓] move [enter] next/submit [backspace] delete [esc] cancel",
|
||||
} {
|
||||
if !strings.Contains(view, want) {
|
||||
t.Fatalf("view missing %q\nview:\n%s", want, view)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestViewConfirmScreenMatchesPendingExport(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
m.screen = screenConfirm
|
||||
m.pendingAction = actionExportAudit
|
||||
m.selectedTarget = &platform.RemovableTarget{Device: "/dev/sdb1"}
|
||||
|
||||
view := m.View()
|
||||
|
||||
for _, want := range []string{
|
||||
"Export audit",
|
||||
"Copy latest audit JSON to /dev/sdb1?",
|
||||
"> Confirm",
|
||||
" Cancel",
|
||||
} {
|
||||
if !strings.Contains(view, want) {
|
||||
t.Fatalf("view missing %q\nview:\n%s", want, view)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestResultMsgClearsBusyAndPendingAction(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
m.busy = true
|
||||
m.busyTitle = "Export audit"
|
||||
m.pendingAction = actionExportAudit
|
||||
m.screen = screenConfirm
|
||||
|
||||
next, _ := m.Update(resultMsg{title: "Export audit", body: "done", back: screenMain})
|
||||
got := next.(model)
|
||||
|
||||
if got.busy {
|
||||
t.Fatal("busy=true want false")
|
||||
}
|
||||
if got.busyTitle != "" {
|
||||
t.Fatalf("busyTitle=%q want empty", got.busyTitle)
|
||||
}
|
||||
if got.pendingAction != actionNone {
|
||||
t.Fatalf("pendingAction=%q want empty", got.pendingAction)
|
||||
}
|
||||
}
|
||||
|
||||
func TestResultMsgErrorWithoutBodyFormatsCleanly(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
m := newTestModel()
|
||||
|
||||
next, _ := m.Update(resultMsg{title: "Export audit", err: assertErr("boom"), back: screenMain})
|
||||
got := next.(model)
|
||||
|
||||
if got.body != "ERROR: boom" {
|
||||
t.Fatalf("body=%q want %q", got.body, "ERROR: boom")
|
||||
}
|
||||
}
|
||||
|
||||
type assertErr string
|
||||
|
||||
func (e assertErr) Error() string { return string(e) }
|
||||
|
||||
@@ -1,9 +1,12 @@
|
||||
package tui
|
||||
|
||||
import (
|
||||
"context"
|
||||
|
||||
"bee/audit/internal/app"
|
||||
"bee/audit/internal/platform"
|
||||
"bee/audit/internal/runtimeenv"
|
||||
"strings"
|
||||
|
||||
tea "github.com/charmbracelet/bubbletea"
|
||||
)
|
||||
@@ -11,26 +14,31 @@ import (
|
||||
type screen string
|
||||
|
||||
const (
|
||||
screenMain screen = "main"
|
||||
screenNetwork screen = "network"
|
||||
screenInterfacePick screen = "interface_pick"
|
||||
screenServices screen = "services"
|
||||
screenServiceAction screen = "service_action"
|
||||
screenAcceptance screen = "acceptance"
|
||||
screenExportTargets screen = "export_targets"
|
||||
screenOutput screen = "output"
|
||||
screenStaticForm screen = "static_form"
|
||||
screenConfirm screen = "confirm"
|
||||
screenMain screen = "main"
|
||||
screenNetwork screen = "network"
|
||||
screenInterfacePick screen = "interface_pick"
|
||||
screenServices screen = "services"
|
||||
screenServiceAction screen = "service_action"
|
||||
screenAcceptance screen = "acceptance"
|
||||
screenExportTargets screen = "export_targets"
|
||||
screenOutput screen = "output"
|
||||
screenStaticForm screen = "static_form"
|
||||
screenConfirm screen = "confirm"
|
||||
screenNvidiaSATSetup screen = "nvidia_sat_setup"
|
||||
screenNvidiaSATRunning screen = "nvidia_sat_running"
|
||||
)
|
||||
|
||||
type actionKind string
|
||||
|
||||
const (
|
||||
actionNone actionKind = ""
|
||||
actionDHCPOne actionKind = "dhcp_one"
|
||||
actionStaticIPv4 actionKind = "static_ipv4"
|
||||
actionExportAudit actionKind = "export_audit"
|
||||
actionRunNvidiaSAT actionKind = "run_nvidia_sat"
|
||||
actionNone actionKind = ""
|
||||
actionDHCPOne actionKind = "dhcp_one"
|
||||
actionStaticIPv4 actionKind = "static_ipv4"
|
||||
actionExportAudit actionKind = "export_audit"
|
||||
actionExportBundle actionKind = "export_bundle"
|
||||
actionRunNvidiaSAT actionKind = "run_nvidia_sat"
|
||||
actionRunMemorySAT actionKind = "run_memory_sat"
|
||||
actionRunStorageSAT actionKind = "run_storage_sat"
|
||||
)
|
||||
|
||||
type model struct {
|
||||
@@ -41,8 +49,10 @@ type model struct {
|
||||
prevScreen screen
|
||||
cursor int
|
||||
busy bool
|
||||
busyTitle string
|
||||
title string
|
||||
body string
|
||||
banner string
|
||||
mainMenu []string
|
||||
networkMenu []string
|
||||
serviceMenu []string
|
||||
@@ -57,6 +67,16 @@ type model struct {
|
||||
|
||||
formFields []formField
|
||||
formIndex int
|
||||
|
||||
// NVIDIA SAT setup
|
||||
nvidiaGPUs []platform.NvidiaGPU
|
||||
nvidiaGPUSel []bool
|
||||
nvidiaDurIdx int // index into nvidiaDurationOptions
|
||||
nvidiaSATCursor int
|
||||
|
||||
// NVIDIA SAT running
|
||||
nvidiaSATCancel context.CancelFunc
|
||||
nvidiaSATAborted bool
|
||||
}
|
||||
|
||||
type formField struct {
|
||||
@@ -80,32 +100,38 @@ func newModel(application *app.App, runtimeMode runtimeenv.Mode) model {
|
||||
runtimeMode: runtimeMode,
|
||||
screen: screenMain,
|
||||
mainMenu: []string{
|
||||
"Network setup",
|
||||
"bee service management",
|
||||
"System acceptance tests",
|
||||
"Run audit now",
|
||||
"Export audit to removable drive",
|
||||
"Check required tools",
|
||||
"Show last audit log tail",
|
||||
"Network",
|
||||
"Services",
|
||||
"Acceptance tests",
|
||||
"Run audit",
|
||||
"Run self-check",
|
||||
"Export audit",
|
||||
"Export support bundle",
|
||||
"Check tools",
|
||||
"Show health summary",
|
||||
"Show runtime issues",
|
||||
"Show audit logs",
|
||||
"Exit",
|
||||
},
|
||||
networkMenu: []string{
|
||||
"Show network status",
|
||||
"Show status",
|
||||
"DHCP on all interfaces",
|
||||
"DHCP on one interface",
|
||||
"Set static IPv4 on one interface",
|
||||
"Set static IPv4",
|
||||
"Back",
|
||||
},
|
||||
serviceMenu: []string{
|
||||
"status",
|
||||
"restart",
|
||||
"start",
|
||||
"stop",
|
||||
"back",
|
||||
"Status",
|
||||
"Restart",
|
||||
"Start",
|
||||
"Stop",
|
||||
"Back",
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
func (m model) Init() tea.Cmd {
|
||||
return nil
|
||||
return func() tea.Msg {
|
||||
return bannerMsg{text: strings.TrimSpace(m.app.MainBanner())}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -21,12 +21,19 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
|
||||
return m.updateKey(msg)
|
||||
case resultMsg:
|
||||
m.busy = false
|
||||
m.busyTitle = ""
|
||||
m.title = msg.title
|
||||
if msg.err != nil {
|
||||
m.body = fmt.Sprintf("%s\n\nERROR: %v", strings.TrimSpace(msg.body), msg.err)
|
||||
body := strings.TrimSpace(msg.body)
|
||||
if body == "" {
|
||||
m.body = fmt.Sprintf("ERROR: %v", msg.err)
|
||||
} else {
|
||||
m.body = fmt.Sprintf("%s\n\nERROR: %v", body, msg.err)
|
||||
}
|
||||
} else {
|
||||
m.body = msg.body
|
||||
}
|
||||
m.pendingAction = actionNone
|
||||
if msg.back != "" {
|
||||
m.prevScreen = msg.back
|
||||
} else {
|
||||
@@ -37,8 +44,9 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
|
||||
return m, nil
|
||||
case servicesMsg:
|
||||
m.busy = false
|
||||
m.busyTitle = ""
|
||||
if msg.err != nil {
|
||||
m.title = "bee services"
|
||||
m.title = "Services"
|
||||
m.body = msg.err.Error()
|
||||
m.prevScreen = screenMain
|
||||
m.screen = screenOutput
|
||||
@@ -50,6 +58,7 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
|
||||
return m, nil
|
||||
case interfacesMsg:
|
||||
m.busy = false
|
||||
m.busyTitle = ""
|
||||
if msg.err != nil {
|
||||
m.title = "interfaces"
|
||||
m.body = msg.err.Error()
|
||||
@@ -63,6 +72,7 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
|
||||
return m, nil
|
||||
case exportTargetsMsg:
|
||||
m.busy = false
|
||||
m.busyTitle = ""
|
||||
if msg.err != nil {
|
||||
m.title = "export"
|
||||
m.body = msg.err.Error()
|
||||
@@ -74,6 +84,36 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
|
||||
m.screen = screenExportTargets
|
||||
m.cursor = 0
|
||||
return m, nil
|
||||
case bannerMsg:
|
||||
m.banner = strings.TrimSpace(msg.text)
|
||||
return m, nil
|
||||
case nvidiaGPUsMsg:
|
||||
return m.handleNvidiaGPUsMsg(msg)
|
||||
case nvtopClosedMsg:
|
||||
// nvtop closed — stay on running screen (or result if SAT is already done)
|
||||
return m, nil
|
||||
case nvidiaSATDoneMsg:
|
||||
if m.nvidiaSATAborted {
|
||||
return m, nil
|
||||
}
|
||||
if m.nvidiaSATCancel != nil {
|
||||
m.nvidiaSATCancel()
|
||||
m.nvidiaSATCancel = nil
|
||||
}
|
||||
m.prevScreen = screenAcceptance
|
||||
m.screen = screenOutput
|
||||
m.title = msg.title
|
||||
if msg.err != nil {
|
||||
body := strings.TrimSpace(msg.body)
|
||||
if body == "" {
|
||||
m.body = fmt.Sprintf("ERROR: %v", msg.err)
|
||||
} else {
|
||||
m.body = fmt.Sprintf("%s\n\nERROR: %v", body, msg.err)
|
||||
}
|
||||
} else {
|
||||
m.body = msg.body
|
||||
}
|
||||
return m, nil
|
||||
}
|
||||
|
||||
return m, nil
|
||||
@@ -90,7 +130,11 @@ func (m model) updateKey(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
|
||||
case screenServiceAction:
|
||||
return m.updateMenu(msg, len(m.serviceMenu), m.handleServiceActionMenu)
|
||||
case screenAcceptance:
|
||||
return m.updateMenu(msg, 2, m.handleAcceptanceMenu)
|
||||
return m.updateMenu(msg, 4, m.handleAcceptanceMenu)
|
||||
case screenNvidiaSATSetup:
|
||||
return m.updateNvidiaSATSetup(msg)
|
||||
case screenNvidiaSATRunning:
|
||||
return m.updateNvidiaSATRunning(msg)
|
||||
case screenExportTargets:
|
||||
return m.updateMenu(msg, len(m.targets), m.handleExportTargetsMenu)
|
||||
case screenInterfacePick:
|
||||
@@ -101,6 +145,7 @@ func (m model) updateKey(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
|
||||
m.screen = m.prevScreen
|
||||
m.body = ""
|
||||
m.title = ""
|
||||
m.pendingAction = actionNone
|
||||
return m, nil
|
||||
case "ctrl+c":
|
||||
return m, tea.Quit
|
||||
|
||||
@@ -11,23 +11,31 @@ import (
|
||||
|
||||
func (m model) View() string {
|
||||
if m.busy {
|
||||
return "bee\n\nWorking...\n"
|
||||
title := "bee"
|
||||
if m.busyTitle != "" {
|
||||
title = m.busyTitle
|
||||
}
|
||||
return fmt.Sprintf("%s\n\nWorking...\n\n[ctrl+c] quit\n", title)
|
||||
}
|
||||
switch m.screen {
|
||||
case screenMain:
|
||||
return renderMenu("bee", "Select action", m.mainMenu, m.cursor)
|
||||
return renderMainMenu("bee", m.banner, "Select action", m.mainMenu, m.cursor)
|
||||
case screenNetwork:
|
||||
return renderMenu("Network", "Select action", m.networkMenu, m.cursor)
|
||||
case screenServices:
|
||||
return renderMenu("bee services", "Select service", m.services, m.cursor)
|
||||
return renderMenu("Services", "Select service", m.services, m.cursor)
|
||||
case screenServiceAction:
|
||||
items := make([]string, len(m.serviceMenu))
|
||||
copy(items, m.serviceMenu)
|
||||
return renderMenu("Service: "+m.selectedService, "Select action", items, m.cursor)
|
||||
case screenAcceptance:
|
||||
return renderMenu("System acceptance tests", "Select action", []string{"Run NVIDIA command pack", "Back"}, m.cursor)
|
||||
return renderMenu("Acceptance tests", "Select action", []string{"Run NVIDIA command pack", "Run memory test", "Run storage diagnostic pack", "Back"}, m.cursor)
|
||||
case screenExportTargets:
|
||||
return renderMenu("Export audit", "Select removable filesystem", renderTargetItems(m.targets), m.cursor)
|
||||
title := "Export audit"
|
||||
if m.pendingAction == actionExportBundle {
|
||||
title = "Export support bundle"
|
||||
}
|
||||
return renderMenu(title, "Select removable filesystem", renderTargetItems(m.targets), m.cursor)
|
||||
case screenInterfacePick:
|
||||
return renderMenu("Interfaces", "Select interface", renderInterfaceItems(m.interfaces), m.cursor)
|
||||
case screenStaticForm:
|
||||
@@ -35,6 +43,10 @@ func (m model) View() string {
|
||||
case screenConfirm:
|
||||
title, body := m.confirmBody()
|
||||
return renderConfirm(title, body, m.cursor)
|
||||
case screenNvidiaSATSetup:
|
||||
return renderNvidiaSATSetup(m)
|
||||
case screenNvidiaSATRunning:
|
||||
return renderNvidiaSATRunning()
|
||||
case screenOutput:
|
||||
return fmt.Sprintf("%s\n\n%s\n\n[enter/esc] back [ctrl+c] quit\n", m.title, strings.TrimSpace(m.body))
|
||||
default:
|
||||
@@ -49,8 +61,17 @@ func (m model) confirmBody() (string, string) {
|
||||
return "Export audit", "No target selected"
|
||||
}
|
||||
return "Export audit", fmt.Sprintf("Copy latest audit JSON to %s?", m.selectedTarget.Device)
|
||||
case actionExportBundle:
|
||||
if m.selectedTarget == nil {
|
||||
return "Export support bundle", "No target selected"
|
||||
}
|
||||
return "Export support bundle", fmt.Sprintf("Copy support bundle archive to %s?", m.selectedTarget.Device)
|
||||
case actionRunNvidiaSAT:
|
||||
return "NVIDIA SAT", "Run NVIDIA acceptance command pack?"
|
||||
case actionRunMemorySAT:
|
||||
return "Memory SAT", "Run runtime memory test with memtester?"
|
||||
case actionRunStorageSAT:
|
||||
return "Storage SAT", "Run storage diagnostic pack and start short self-tests where supported?"
|
||||
default:
|
||||
return "Confirm", "Proceed?"
|
||||
}
|
||||
@@ -101,6 +122,30 @@ func renderMenu(title, subtitle string, items []string, cursor int) string {
|
||||
return body.String()
|
||||
}
|
||||
|
||||
func renderMainMenu(title, banner, subtitle string, items []string, cursor int) string {
|
||||
var body strings.Builder
|
||||
fmt.Fprintf(&body, "%s\n\n", title)
|
||||
if banner != "" {
|
||||
body.WriteString(strings.TrimSpace(banner))
|
||||
body.WriteString("\n\n")
|
||||
}
|
||||
body.WriteString(subtitle)
|
||||
body.WriteString("\n\n")
|
||||
if len(items) == 0 {
|
||||
body.WriteString("(no items)\n")
|
||||
} else {
|
||||
for i, item := range items {
|
||||
prefix := " "
|
||||
if i == cursor {
|
||||
prefix = "> "
|
||||
}
|
||||
fmt.Fprintf(&body, "%s%s\n", prefix, item)
|
||||
}
|
||||
}
|
||||
body.WriteString("\n[↑/↓] move [enter] select [esc] back [ctrl+c] quit\n")
|
||||
return body.String()
|
||||
}
|
||||
|
||||
func renderForm(title string, fields []formField, idx int) string {
|
||||
var body strings.Builder
|
||||
fmt.Fprintf(&body, "%s\n\n", title)
|
||||
|
||||
240
audit/internal/webui/server.go
Normal file
240
audit/internal/webui/server.go
Normal file
@@ -0,0 +1,240 @@
|
||||
package webui
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"fmt"
|
||||
"html"
|
||||
"net/http"
|
||||
"net/url"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strings"
|
||||
|
||||
"bee/audit/internal/app"
|
||||
"reanimator/chart/viewer"
|
||||
chartweb "reanimator/chart/web"
|
||||
)
|
||||
|
||||
const defaultTitle = "Bee Hardware Audit"
|
||||
|
||||
type HandlerOptions struct {
|
||||
Title string
|
||||
AuditPath string
|
||||
ExportDir string
|
||||
}
|
||||
|
||||
func NewHandler(opts HandlerOptions) http.Handler {
|
||||
title := strings.TrimSpace(opts.Title)
|
||||
if title == "" {
|
||||
title = defaultTitle
|
||||
}
|
||||
|
||||
auditPath := strings.TrimSpace(opts.AuditPath)
|
||||
exportDir := strings.TrimSpace(opts.ExportDir)
|
||||
if exportDir == "" {
|
||||
exportDir = app.DefaultExportDir
|
||||
}
|
||||
mux := http.NewServeMux()
|
||||
mux.Handle("GET /static/", http.StripPrefix("/static/", chartweb.Static()))
|
||||
mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
|
||||
w.Header().Set("Cache-Control", "no-store")
|
||||
w.WriteHeader(http.StatusOK)
|
||||
_, _ = w.Write([]byte("ok"))
|
||||
})
|
||||
mux.HandleFunc("GET /audit.json", func(w http.ResponseWriter, r *http.Request) {
|
||||
data, err := loadSnapshot(auditPath)
|
||||
if err != nil {
|
||||
if errors.Is(err, os.ErrNotExist) {
|
||||
http.Error(w, "audit snapshot not found", http.StatusNotFound)
|
||||
return
|
||||
}
|
||||
http.Error(w, fmt.Sprintf("read audit snapshot: %v", err), http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
w.Header().Set("Cache-Control", "no-store")
|
||||
w.Header().Set("Content-Type", "application/json; charset=utf-8")
|
||||
_, _ = w.Write(data)
|
||||
})
|
||||
mux.HandleFunc("GET /export/support.tar.gz", func(w http.ResponseWriter, r *http.Request) {
|
||||
archive, err := app.BuildSupportBundle(exportDir)
|
||||
if err != nil {
|
||||
http.Error(w, fmt.Sprintf("build support bundle: %v", err), http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
w.Header().Set("Cache-Control", "no-store")
|
||||
w.Header().Set("Content-Type", "application/gzip")
|
||||
w.Header().Set("Content-Disposition", fmt.Sprintf("attachment; filename=%q", filepath.Base(archive)))
|
||||
http.ServeFile(w, r, archive)
|
||||
})
|
||||
mux.HandleFunc("GET /runtime-health.json", func(w http.ResponseWriter, r *http.Request) {
|
||||
data, err := loadSnapshot(filepath.Join(exportDir, "runtime-health.json"))
|
||||
if err != nil {
|
||||
if errors.Is(err, os.ErrNotExist) {
|
||||
http.Error(w, "runtime health not found", http.StatusNotFound)
|
||||
return
|
||||
}
|
||||
http.Error(w, fmt.Sprintf("read runtime health: %v", err), http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
w.Header().Set("Cache-Control", "no-store")
|
||||
w.Header().Set("Content-Type", "application/json; charset=utf-8")
|
||||
_, _ = w.Write(data)
|
||||
})
|
||||
mux.HandleFunc("GET /export/", func(w http.ResponseWriter, r *http.Request) {
|
||||
body, err := renderExportIndex(exportDir)
|
||||
if err != nil {
|
||||
http.Error(w, fmt.Sprintf("render export index: %v", err), http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
w.Header().Set("Cache-Control", "no-store")
|
||||
w.Header().Set("Content-Type", "text/html; charset=utf-8")
|
||||
_, _ = w.Write([]byte(body))
|
||||
})
|
||||
mux.HandleFunc("GET /export/file", func(w http.ResponseWriter, r *http.Request) {
|
||||
rel := strings.TrimSpace(r.URL.Query().Get("path"))
|
||||
if rel == "" {
|
||||
http.Error(w, "path is required", http.StatusBadRequest)
|
||||
return
|
||||
}
|
||||
clean := filepath.Clean(rel)
|
||||
if clean == "." || strings.HasPrefix(clean, "..") {
|
||||
http.Error(w, "invalid path", http.StatusBadRequest)
|
||||
return
|
||||
}
|
||||
http.ServeFile(w, r, filepath.Join(exportDir, clean))
|
||||
})
|
||||
mux.HandleFunc("GET /viewer", func(w http.ResponseWriter, r *http.Request) {
|
||||
snapshot, err := loadSnapshot(auditPath)
|
||||
if err != nil && !errors.Is(err, os.ErrNotExist) {
|
||||
http.Error(w, fmt.Sprintf("read audit snapshot: %v", err), http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
html, err := viewer.RenderHTML(snapshot, title)
|
||||
if err != nil {
|
||||
http.Error(w, fmt.Sprintf("render snapshot: %v", err), http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
w.Header().Set("Cache-Control", "no-store")
|
||||
w.Header().Set("Content-Type", "text/html; charset=utf-8")
|
||||
_, _ = w.Write(html)
|
||||
})
|
||||
mux.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) {
|
||||
noticeTitle, noticeBody := runtimeNotice(filepath.Join(exportDir, "runtime-health.json"))
|
||||
body := renderShellPage(title, noticeTitle, noticeBody)
|
||||
w.Header().Set("Cache-Control", "no-store")
|
||||
w.Header().Set("Content-Type", "text/html; charset=utf-8")
|
||||
_, _ = w.Write([]byte(body))
|
||||
})
|
||||
return mux
|
||||
}
|
||||
|
||||
func ListenAndServe(addr string, opts HandlerOptions) error {
|
||||
return http.ListenAndServe(addr, NewHandler(opts))
|
||||
}
|
||||
|
||||
func loadSnapshot(path string) ([]byte, error) {
|
||||
if strings.TrimSpace(path) == "" {
|
||||
return nil, os.ErrNotExist
|
||||
}
|
||||
return os.ReadFile(path)
|
||||
}
|
||||
|
||||
func runtimeNotice(path string) (string, string) {
|
||||
health, err := app.ReadRuntimeHealth(path)
|
||||
if err != nil {
|
||||
return "Runtime Health", "No runtime health snapshot found yet."
|
||||
}
|
||||
body := fmt.Sprintf("Status: %s. Export dir: %s. Driver ready: %t. CUDA ready: %t. Network: %s. Export files: /export/",
|
||||
firstNonEmpty(health.Status, "UNKNOWN"),
|
||||
firstNonEmpty(health.ExportDir, app.DefaultExportDir),
|
||||
health.DriverReady,
|
||||
health.CUDAReady,
|
||||
firstNonEmpty(health.NetworkStatus, "UNKNOWN"),
|
||||
)
|
||||
if len(health.Issues) > 0 {
|
||||
body += " Issues: "
|
||||
parts := make([]string, 0, len(health.Issues))
|
||||
for _, issue := range health.Issues {
|
||||
parts = append(parts, issue.Code)
|
||||
}
|
||||
body += strings.Join(parts, ", ")
|
||||
}
|
||||
return "Runtime Health", body
|
||||
}
|
||||
|
||||
func renderExportIndex(exportDir string) (string, error) {
|
||||
var entries []string
|
||||
err := filepath.Walk(strings.TrimSpace(exportDir), func(path string, info os.FileInfo, err error) error {
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if info.IsDir() {
|
||||
return nil
|
||||
}
|
||||
rel, err := filepath.Rel(exportDir, path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
entries = append(entries, rel)
|
||||
return nil
|
||||
})
|
||||
if err != nil && !errors.Is(err, os.ErrNotExist) {
|
||||
return "", err
|
||||
}
|
||||
sort.Strings(entries)
|
||||
var body strings.Builder
|
||||
body.WriteString("<!DOCTYPE html><html><head><meta charset=\"utf-8\"><title>Bee Export Files</title></head><body>")
|
||||
body.WriteString("<h1>Bee Export Files</h1><ul>")
|
||||
for _, entry := range entries {
|
||||
body.WriteString("<li><a href=\"/export/file?path=" + url.QueryEscape(entry) + "\">" + html.EscapeString(entry) + "</a></li>")
|
||||
}
|
||||
if len(entries) == 0 {
|
||||
body.WriteString("<li>No export files found.</li>")
|
||||
}
|
||||
body.WriteString("</ul></body></html>")
|
||||
return body.String(), nil
|
||||
}
|
||||
|
||||
func renderShellPage(title, noticeTitle, noticeBody string) string {
|
||||
var body strings.Builder
|
||||
body.WriteString("<!DOCTYPE html><html><head><meta charset=\"utf-8\"><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">")
|
||||
body.WriteString("<title>" + html.EscapeString(title) + "</title>")
|
||||
body.WriteString(`<style>
|
||||
body{margin:0;font-family:system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",sans-serif;background:#f4f1ea;color:#1b1b18}
|
||||
.shell{min-height:100vh;display:grid;grid-template-rows:auto auto 1fr}
|
||||
.header{padding:18px 20px 12px;border-bottom:1px solid rgba(0,0,0,.08);background:#fbf8f2}
|
||||
.header h1{margin:0;font-size:24px}
|
||||
.header p{margin:6px 0 0;color:#5a5a52}
|
||||
.actions{display:flex;flex-wrap:wrap;gap:10px;padding:12px 20px;background:#fbf8f2}
|
||||
.actions a{display:inline-block;text-decoration:none;padding:10px 14px;border-radius:999px;background:#1f5f4a;color:#fff;font-weight:600}
|
||||
.actions a.secondary{background:#d8e5dd;color:#17372b}
|
||||
.notice{margin:16px 20px 0;padding:14px 16px;border-radius:14px;background:#fff7df;border:1px solid #ead9a4}
|
||||
.notice h2{margin:0 0 6px;font-size:16px}
|
||||
.notice p{margin:0;color:#4f4a37}
|
||||
.viewer-wrap{padding:16px 20px 20px}
|
||||
.viewer{width:100%;height:calc(100vh - 170px);border:0;border-radius:18px;background:#fff;box-shadow:0 12px 40px rgba(0,0,0,.08)}
|
||||
@media (max-width:720px){.viewer{height:calc(100vh - 240px)}}
|
||||
</style></head><body><div class="shell">`)
|
||||
body.WriteString("<header class=\"header\"><h1>" + html.EscapeString(title) + "</h1><p>Audit viewer with support bundle and raw export access.</p></header>")
|
||||
body.WriteString("<nav class=\"actions\">")
|
||||
body.WriteString("<a href=\"/export/support.tar.gz\">Download support bundle</a>")
|
||||
body.WriteString("<a class=\"secondary\" href=\"/audit.json\">Open audit.json</a>")
|
||||
body.WriteString("<a class=\"secondary\" href=\"/runtime-health.json\">Open runtime-health.json</a>")
|
||||
body.WriteString("<a class=\"secondary\" href=\"/export/\">Browse export files</a>")
|
||||
body.WriteString("</nav>")
|
||||
if strings.TrimSpace(noticeTitle) != "" {
|
||||
body.WriteString("<section class=\"notice\"><h2>" + html.EscapeString(noticeTitle) + "</h2><p>" + html.EscapeString(noticeBody) + "</p></section>")
|
||||
}
|
||||
body.WriteString("<main class=\"viewer-wrap\"><iframe class=\"viewer\" src=\"/viewer\" loading=\"eager\" referrerpolicy=\"same-origin\"></iframe></main>")
|
||||
body.WriteString("</div></body></html>")
|
||||
return body.String()
|
||||
}
|
||||
|
||||
func firstNonEmpty(value, fallback string) string {
|
||||
value = strings.TrimSpace(value)
|
||||
if value == "" {
|
||||
return fallback
|
||||
}
|
||||
return value
|
||||
}
|
||||
167
audit/internal/webui/server_test.go
Normal file
167
audit/internal/webui/server_test.go
Normal file
@@ -0,0 +1,167 @@
|
||||
package webui
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestRootRendersShellWithIframe(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "audit.json")
|
||||
exportDir := filepath.Join(dir, "export")
|
||||
if err := os.MkdirAll(exportDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{
|
||||
Title: "Bee Hardware Audit",
|
||||
AuditPath: path,
|
||||
ExportDir: exportDir,
|
||||
})
|
||||
|
||||
first := httptest.NewRecorder()
|
||||
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/", nil))
|
||||
if first.Code != http.StatusOK {
|
||||
t.Fatalf("first status=%d", first.Code)
|
||||
}
|
||||
if !strings.Contains(first.Body.String(), `iframe`) || !strings.Contains(first.Body.String(), `src="/viewer"`) {
|
||||
t.Fatalf("first body missing iframe viewer: %s", first.Body.String())
|
||||
}
|
||||
if !strings.Contains(first.Body.String(), "/export/support.tar.gz") {
|
||||
t.Fatalf("first body missing support bundle link: %s", first.Body.String())
|
||||
}
|
||||
if got := first.Header().Get("Cache-Control"); got != "no-store" {
|
||||
t.Fatalf("first cache-control=%q", got)
|
||||
}
|
||||
|
||||
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
second := httptest.NewRecorder()
|
||||
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/", nil))
|
||||
if second.Code != http.StatusOK {
|
||||
t.Fatalf("second status=%d", second.Code)
|
||||
}
|
||||
if !strings.Contains(second.Body.String(), `src="/viewer"`) {
|
||||
t.Fatalf("second body missing iframe viewer: %s", second.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestViewerRendersLatestSnapshot(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "audit.json")
|
||||
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{AuditPath: path})
|
||||
first := httptest.NewRecorder()
|
||||
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/viewer", nil))
|
||||
if first.Code != http.StatusOK {
|
||||
t.Fatalf("first status=%d", first.Code)
|
||||
}
|
||||
if !strings.Contains(first.Body.String(), "SERIAL-OLD") {
|
||||
t.Fatalf("viewer body missing old serial: %s", first.Body.String())
|
||||
}
|
||||
|
||||
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
second := httptest.NewRecorder()
|
||||
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/viewer", nil))
|
||||
if second.Code != http.StatusOK {
|
||||
t.Fatalf("second status=%d", second.Code)
|
||||
}
|
||||
if !strings.Contains(second.Body.String(), "SERIAL-NEW") {
|
||||
t.Fatalf("viewer body missing new serial: %s", second.Body.String())
|
||||
}
|
||||
if strings.Contains(second.Body.String(), "SERIAL-OLD") {
|
||||
t.Fatalf("viewer body still contains old serial: %s", second.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestAuditJSONServesLatestSnapshot(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "audit.json")
|
||||
body := `{"hardware":{"board":{"serial_number":"SERIAL-API"}}}`
|
||||
if err := os.WriteFile(path, []byte(body), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{AuditPath: path})
|
||||
rec := httptest.NewRecorder()
|
||||
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
|
||||
if rec.Code != http.StatusOK {
|
||||
t.Fatalf("status=%d", rec.Code)
|
||||
}
|
||||
if got := strings.TrimSpace(rec.Body.String()); got != body {
|
||||
t.Fatalf("body=%q want %q", got, body)
|
||||
}
|
||||
if got := rec.Header().Get("Content-Type"); !strings.Contains(got, "application/json") {
|
||||
t.Fatalf("content-type=%q", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestMissingAuditJSONReturnsNotFound(t *testing.T) {
|
||||
handler := NewHandler(HandlerOptions{AuditPath: "/missing/audit.json"})
|
||||
rec := httptest.NewRecorder()
|
||||
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
|
||||
if rec.Code != http.StatusNotFound {
|
||||
t.Fatalf("status=%d want %d", rec.Code, http.StatusNotFound)
|
||||
}
|
||||
}
|
||||
|
||||
func TestSupportBundleEndpointReturnsArchive(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
exportDir := filepath.Join(dir, "export")
|
||||
if err := os.MkdirAll(exportDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.log"), []byte("audit log"), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
|
||||
rec := httptest.NewRecorder()
|
||||
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/export/support.tar.gz", nil))
|
||||
if rec.Code != http.StatusOK {
|
||||
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
|
||||
}
|
||||
if got := rec.Header().Get("Content-Disposition"); !strings.Contains(got, "attachment;") {
|
||||
t.Fatalf("content-disposition=%q", got)
|
||||
}
|
||||
if rec.Body.Len() == 0 {
|
||||
t.Fatal("empty archive body")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRuntimeHealthEndpointReturnsJSON(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
exportDir := filepath.Join(dir, "export")
|
||||
if err := os.MkdirAll(exportDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
body := `{"status":"PARTIAL","checked_at":"2026-03-16T10:00:00Z"}`
|
||||
if err := os.WriteFile(filepath.Join(exportDir, "runtime-health.json"), []byte(body), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
|
||||
rec := httptest.NewRecorder()
|
||||
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/runtime-health.json", nil))
|
||||
if rec.Code != http.StatusOK {
|
||||
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
|
||||
}
|
||||
if strings.TrimSpace(rec.Body.String()) != body {
|
||||
t.Fatalf("body=%q want %q", strings.TrimSpace(rec.Body.String()), body)
|
||||
}
|
||||
}
|
||||
@@ -9,4 +9,5 @@ Generic engineering rules live in `bible/rules/patterns/`.
|
||||
|---|---|
|
||||
| `architecture/system-overview.md` | What bee does, scope, tech stack |
|
||||
| `architecture/runtime-flows.md` | Boot sequence, audit flow, service order |
|
||||
| `docs/hardware-ingest-contract.md` | Current Reanimator hardware ingest JSON contract |
|
||||
| `decisions/` | Architectural decision log |
|
||||
|
||||
@@ -18,8 +18,10 @@ local-fs.target
|
||||
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
|
||||
├── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
|
||||
│ creates /dev/nvidia* nodes)
|
||||
└── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
|
||||
never blocks boot on partial collector failures)
|
||||
├── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
|
||||
│ never blocks boot on partial collector failures)
|
||||
└── bee-web.service (runs `bee web` on :80,
|
||||
reads the latest audit snapshot on each request)
|
||||
```
|
||||
|
||||
**Critical invariants:**
|
||||
@@ -29,17 +31,39 @@ local-fs.target
|
||||
Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
|
||||
- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
|
||||
- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.
|
||||
- `bee-web.service` binds `0.0.0.0:80` and always renders the current `/var/log/bee-audit.json` contents.
|
||||
- Audit JSON now includes a `hardware.summary` block with overall verdict and warning/failure counts.
|
||||
|
||||
## Console and login flow
|
||||
|
||||
Local-console behavior:
|
||||
|
||||
```text
|
||||
tty1
|
||||
└── live-config autologin → bee
|
||||
└── /home/bee/.profile
|
||||
└── exec menu
|
||||
└── /usr/local/bin/bee-tui
|
||||
└── sudo -n /usr/local/bin/bee tui --runtime livecd
|
||||
```
|
||||
|
||||
Rules:
|
||||
- local `tty1` lands in user `bee`, not directly in `root`
|
||||
- `menu` must work without typing `sudo`
|
||||
- TUI actions still run as `root` via `sudo -n`
|
||||
- SSH is independent from the tty1 path
|
||||
- serial console support is enabled for VM boot debugging
|
||||
|
||||
## ISO build sequence
|
||||
|
||||
```
|
||||
build.sh [--authorized-keys /path/to/keys]
|
||||
build-in-container.sh [--authorized-keys /path/to/keys]
|
||||
1. compile `bee` binary (skip if .go files older than binary)
|
||||
2. create a temporary overlay staging dir under `dist/`
|
||||
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
|
||||
4. copy `bee` binary → staged `/usr/local/bin/bee`
|
||||
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
|
||||
(`storcli64`, `sas2ircu`, `sas3ircu`, `mstflint` — each optional)
|
||||
(`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli` — optional; `mstflint` comes from the Debian package set)
|
||||
6. `build-nvidia-module.sh`:
|
||||
a. install Debian kernel headers if missing
|
||||
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
|
||||
@@ -54,13 +78,12 @@ build.sh [--authorized-keys /path/to/keys]
|
||||
11. patch staged `motd` with build metadata
|
||||
12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
|
||||
13. sync staged overlay into workdir `config/includes.chroot/`
|
||||
14. run `lb config && lb build` inside the temporary workdir
|
||||
(either on a Debian host/VM or inside the privileged builder container)
|
||||
14. run `lb config && lb build` inside the privileged builder container
|
||||
```
|
||||
|
||||
**Critical invariants:**
|
||||
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
|
||||
1. `setup-builder.sh` / `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
|
||||
1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
|
||||
2. `auto/config` — `linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
|
||||
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
|
||||
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
|
||||
@@ -80,6 +103,10 @@ Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shi
|
||||
Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
|
||||
systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
|
||||
|
||||
Current validation state:
|
||||
- local/libvirt VM boot path is validated for `systemd`, SSH, `bee audit`, `bee-network`, and TUI startup
|
||||
- real hardware validation is still required before treating the ISO as release-ready
|
||||
|
||||
## Overlay mechanism
|
||||
|
||||
`live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
|
||||
@@ -95,10 +122,52 @@ systemd services running, audit completed with NVIDIA enrichment, LAN reachabili
|
||||
3. memory collector (dmidecode -t 17)
|
||||
4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
|
||||
5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/)
|
||||
6. psu collector (ipmitool fru — silent if no /dev/ipmi0)
|
||||
6. psu collector (ipmitool fru + sdr — silent if no /dev/ipmi0)
|
||||
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
|
||||
8. output JSON → /var/log/bee-audit.json
|
||||
9. QR summary to stdout (qrencode if available)
|
||||
```
|
||||
|
||||
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
|
||||
|
||||
Acceptance flows:
|
||||
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + lightweight `bee-gpu-stress`
|
||||
- `bee sat memory` → `memtester` archive
|
||||
- `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
|
||||
- SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
|
||||
- Runtime overrides:
|
||||
- `BEE_GPU_STRESS_SECONDS`
|
||||
- `BEE_GPU_STRESS_SIZE_MB`
|
||||
- `BEE_MEMTESTER_SIZE_MB`
|
||||
- `BEE_MEMTESTER_PASSES`
|
||||
|
||||
## NVIDIA SAT TUI flow (v1.0.0+)
|
||||
|
||||
```
|
||||
TUI: Acceptance tests → NVIDIA command pack
|
||||
1. screenNvidiaSATSetup
|
||||
a. enumerate GPUs via `nvidia-smi --query-gpu=index,name,memory.total`
|
||||
b. user selects duration preset: 10 min / 1 h / 8 h / 24 h
|
||||
c. user selects GPUs via checkboxes (all selected by default)
|
||||
d. memory size = max(selected GPU memory) — auto-detected, not exposed to user
|
||||
2. Start → screenNvidiaSATRunning
|
||||
a. CUDA_VISIBLE_DEVICES set to selected GPU indices
|
||||
b. tea.Batch: SAT goroutine + tea.ExecProcess(nvtop) launched concurrently
|
||||
c. nvtop occupies full terminal; SAT result queues in background
|
||||
d. [o] reopen nvtop at any time; [a] abort (cancels context → kills bee-gpu-stress)
|
||||
3. GPU metrics collection (during bee-gpu-stress)
|
||||
- background goroutine polls `nvidia-smi` every second
|
||||
- per-second rows: elapsed, GPU index, temp°C, usage%, power W, clock MHz
|
||||
- outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG chart), gpu-metrics-term.txt
|
||||
4. After SAT completes
|
||||
- result shown in screenOutput with terminal line-chart (gpu-metrics-term.txt)
|
||||
- chart is asciigraph-style: box-drawing chars (╭╮╰╯─│), 4 series per GPU,
|
||||
Y axis with ticks, ANSI colours (red=temp, blue=usage, green=power, yellow=clock)
|
||||
```
|
||||
|
||||
**Critical invariants:**
|
||||
- `nvtop` must be in `iso/builder/config/package-lists/bee.list.chroot` (baked into ISO).
|
||||
- `bee-gpu-stress` uses `exec.CommandContext` — aborted on cancel.
|
||||
- Metric goroutine uses stopCh/doneCh pattern; main goroutine waits `<-doneCh` before reading rows (no mutex needed).
|
||||
- If `nvtop` is not found on PATH, SAT still runs without it (graceful degradation).
|
||||
- SVG chart is fully offline: no JS, no external CSS, pure inline SVG.
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
Hardware audit LiveCD. Boots on a server via BMC virtual media or USB.
|
||||
Collects hardware inventory at OS level (not through BMC/Redfish).
|
||||
Produces `HardwareIngestRequest` JSON compatible with core/reanimator.
|
||||
Produces `HardwareIngestRequest` JSON compatible with the contract in `bible-local/docs/hardware-ingest-contract.md`.
|
||||
|
||||
## Why it exists
|
||||
|
||||
@@ -19,10 +19,15 @@ Fills gaps where Redfish/logpile is blind:
|
||||
## In scope
|
||||
|
||||
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
|
||||
- Unattended operation — no user interaction required
|
||||
- Machine-readable health summary derived from collector verdicts
|
||||
- Operator-triggered acceptance tests for NVIDIA, memory, and storage
|
||||
- NVIDIA SAT includes both diagnostic collection and lightweight GPU stress via `bee-gpu-stress`
|
||||
- Automatic boot audit with operator-facing local console and SSH access
|
||||
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
|
||||
- SSH access (OpenSSH) always available for inspection and debugging
|
||||
- Interactive Go TUI via `bee tui` for network setup, service management, and acceptance tests
|
||||
- Read-only web viewer via `bee web`, rendering the latest audit snapshot through the embedded Reanimator Chart
|
||||
- Local `tty1` operator UX: `bee` autologin, `menu` auto-start, privileged actions via `sudo -n`
|
||||
|
||||
## Network isolation — CRITICAL
|
||||
|
||||
@@ -42,6 +47,16 @@ Fills gaps where Redfish/logpile is blind:
|
||||
- Anything requiring persistent storage on the audited machine
|
||||
- Windows support
|
||||
- Any functionality requiring internet access at boot
|
||||
- Component lifecycle/history across multiple snapshots
|
||||
- Status transition history (`status_history`, `status_changed_at`) derived from previous exports
|
||||
- Replacement detection between two or more audit runs
|
||||
|
||||
## Contract boundary
|
||||
|
||||
- `bee` is responsible for the current hardware snapshot only.
|
||||
- `bee` should populate current component state, hardware inventory, telemetry, and `status_checked_at`.
|
||||
- Historical status transitions and component replacement logic belong to the centralized ingest/lifecycle system, not to `bee`.
|
||||
- Contract fields that have no honest local source on a generic Linux host may remain empty.
|
||||
|
||||
## Tech stack
|
||||
|
||||
@@ -56,6 +71,14 @@ Fills gaps where Redfish/logpile is blind:
|
||||
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
|
||||
| Builder | Debian 12 host/VM or Debian 12 container image |
|
||||
|
||||
## Operator UX
|
||||
|
||||
- On the live ISO, `tty1` autologins as `bee`
|
||||
- The login profile auto-runs `menu`, which enters the Go TUI
|
||||
- The TUI itself executes privileged actions as `root` via `sudo -n`
|
||||
- SSH remains available independently of the local console path
|
||||
- VM-oriented builds also include `qemu-guest-agent` and serial console support for debugging
|
||||
|
||||
## Runtime split
|
||||
|
||||
- The main Go application must run both on a normal Linux host and inside the live ISO
|
||||
@@ -72,8 +95,11 @@ Fills gaps where Redfish/logpile is blind:
|
||||
| `audit/internal/schema/` | HardwareIngestRequest types |
|
||||
| `iso/builder/` | ISO build scripts and `live-build` profile |
|
||||
| `iso/overlay/` | Source overlay copied into a staged build overlay |
|
||||
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, mstflint, …) |
|
||||
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, arcconf, ssacli, …) |
|
||||
| `internal/chart/` | Git submodule with `reanimator/chart`, embedded into `bee web` |
|
||||
| `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI |
|
||||
| `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
|
||||
| `iso/overlay/etc/profile.d/bee.sh` | `menu` helper + tty1 auto-start policy |
|
||||
| `iso/overlay/home/bee/.profile` | `bee` shell profile for local console startup |
|
||||
| `dist/` | Build outputs (gitignored) |
|
||||
| `iso/out/` | Downloaded ISO files (gitignored) |
|
||||
|
||||
@@ -1,22 +1,68 @@
|
||||
# Backlog
|
||||
|
||||
## GPU stress test (H100)
|
||||
## Real hardware validation
|
||||
|
||||
**Статус:** отложено. В текущем ISO `gpu_burn` не включается и не запускается.
|
||||
**Статус:** ожидает доступа к железу.
|
||||
|
||||
**Почему задача всё ещё в backlog:**
|
||||
- `gpu_burn` остаётся тяжёлым и неудобным с точки зрения зависимостей
|
||||
- хочется штатный lightweight stress tool без `libcublas.so` и без заметного раздувания ISO
|
||||
- для H100 нужен предсказуемый offline-инструмент, который можно стабильно возить внутри ISO
|
||||
Что осталось подтвердить на практике:
|
||||
- `bee sat nvidia` на реальном NVIDIA GPU host
|
||||
- `bee sat storage` на NVMe/SATA/RAID host
|
||||
- `ipmitool sdr` parsing на сервере с реальным BMC/IPMI
|
||||
- vendor RAID tooling (`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli`) в живом ISO
|
||||
|
||||
**Желаемый следующий шаг:** написать минимальный stress tool на CUDA Driver API
|
||||
- использует только `libcuda.so`, уже присутствующий в ISO
|
||||
- выполняет простой compute / memory workload через `cuLaunchKernel`
|
||||
- собирается отдельно на builder VM и кладётся в `iso/vendor/`
|
||||
- в будущем может вызываться из `bee tui` как предпочтительный встроенный GPU SAT/stress path
|
||||
## SAT result polish
|
||||
|
||||
**Отклонённые / проблемные варианты:**
|
||||
- `gpu_burn` — нужен libcublas (~500MB)
|
||||
- `nvbandwidth` — только bandwidth, не жжёт FLOPs; нужен libcudart (~8MB)
|
||||
- DCGM diag — правильный инструмент для H100 но ~100MB установка
|
||||
- Download on demand — нужен libcublas, проблема та же
|
||||
**Статус:** частично закрыто.
|
||||
|
||||
Что ещё можно улучшить после полевой проверки:
|
||||
- точнее классифицировать vendor-specific self-test outputs в `storage SAT`
|
||||
- подобрать дефолты `memtester` по объёму RAM на целевых машинах
|
||||
- при необходимости расширить `bee-gpu-stress` по длительности/нагрузке
|
||||
|
||||
## Hardware Contract backlog
|
||||
|
||||
**Статус:** уточнён, сокращён до `bee`-only snapshot scope.
|
||||
|
||||
### Не backlog для `bee`
|
||||
|
||||
Эти задачи не должны реализовываться в `bee`, потому что относятся к централизованному ingest/lifecycle слою:
|
||||
- `status_history`
|
||||
- `status_changed_at`
|
||||
- определение замены компонента между snapshot'ами
|
||||
- timeline/lifecycle/history по diff между экспортами
|
||||
|
||||
`bee` отвечает только за текущий snapshot железа и `status_checked_at`.
|
||||
|
||||
### Реализуемо инкрементально
|
||||
|
||||
Эти поля можно развивать дальше по мере появления реальных sample outputs и vendor-specific parser'ов:
|
||||
- `cpus.correctable_error_count`
|
||||
- `cpus.uncorrectable_error_count`
|
||||
- `power_supplies.life_remaining_pct`
|
||||
- `power_supplies.life_used_pct`
|
||||
- `pcie_devices.battery_charge_pct`
|
||||
- `pcie_devices.battery_health_pct`
|
||||
- `pcie_devices.battery_temperature_c`
|
||||
- `pcie_devices.battery_voltage_v`
|
||||
- `pcie_devices.battery_replace_required`
|
||||
|
||||
### Vendor/platform-specific, часто пустые
|
||||
|
||||
Эти поля допустимо оставлять пустыми на части платформ даже после реализации parser'ов:
|
||||
- `power_supplies.life_remaining_pct`
|
||||
- `power_supplies.life_used_pct`
|
||||
- часть `pcie_devices.battery_*` для неподдержанных RAID/NIC/GPU вендоров
|
||||
|
||||
### Unsupported в `bee`
|
||||
|
||||
Эти поля считаются нереалистичными для общего OS-level hardware snapshotter без synthetic/fake data:
|
||||
- `cpus.life_remaining_pct`
|
||||
- `cpus.life_used_pct`
|
||||
- `memory.life_remaining_pct`
|
||||
- `memory.life_used_pct`
|
||||
- `memory.spare_blocks_remaining_pct`
|
||||
- `memory.performance_degraded`
|
||||
|
||||
Причина: у обычного Linux-host audit обычно нет честного vendor-neutral runtime source для этих метрик.
|
||||
|
||||
Эти поля считаются дропнутыми из backlog `bee` и не должны возвращаться в план работ без появления нового доказуемого локального источника данных на целевых машинах.
|
||||
|
||||
793
bible-local/docs/hardware-ingest-contract.md
Normal file
793
bible-local/docs/hardware-ingest-contract.md
Normal file
@@ -0,0 +1,793 @@
|
||||
---
|
||||
title: Hardware Ingest JSON Contract
|
||||
version: "2.7"
|
||||
updated: "2026-03-15"
|
||||
maintainer: Reanimator Core
|
||||
audience: external-integrators, ai-agents
|
||||
language: ru
|
||||
---
|
||||
|
||||
# Интеграция с Reanimator: контракт JSON-импорта аппаратного обеспечения
|
||||
|
||||
Версия: **2.7** · Дата: **2026-03-15**
|
||||
|
||||
Документ описывает формат JSON для передачи данных об аппаратном обеспечении серверов в систему **Reanimator** (управление жизненным циклом аппаратного обеспечения).
|
||||
Предназначен для разработчиков смежных систем (Redfish-коллекторов, агентов мониторинга, CMDB-экспортёров) и может быть включён в документацию интегрируемых проектов.
|
||||
|
||||
> Актуальная версия документа: https://git.mchus.pro/reanimator/core/src/branch/main/bible-local/docs/hardware-ingest-contract.md
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Версия | Дата | Изменения |
|
||||
|--------|------|-----------|
|
||||
| 2.7 | 2026-03-15 | Явно запрещён синтез данных в `event_logs`; интеграторы не должны придумывать серийные номера компонентов, если источник их не отдал |
|
||||
| 2.6 | 2026-03-15 | Добавлена необязательная секция `event_logs` для dedup/upsert логов `host` / `bmc` / `redfish` вне history timeline |
|
||||
| 2.5 | 2026-03-15 | Добавлено общее необязательное поле `manufactured_year_week` для компонентных секций (`YYYY-Www`) |
|
||||
| 2.4 | 2026-03-15 | Добавлена первая волна component telemetry: health/life поля для `cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies` |
|
||||
| 2.3 | 2026-03-15 | Добавлены component telemetry поля: `pcie_devices.temperature_c`, `pcie_devices.power_w`, `power_supplies.temperature_c` |
|
||||
| 2.2 | 2026-03-15 | Добавлено поле `numa_node` у `pcie_devices` для topology/affinity |
|
||||
| 2.1 | 2026-03-15 | Добавлена секция `sensors` (fans, power, temperatures, other); поле `mac_addresses` у `pcie_devices`; расширен список значений `device_class` |
|
||||
| 2.0 | 2026-02-01 | История статусов (`status_history`, `status_changed_at`); поля telemetry у PSU; async job response |
|
||||
| 1.0 | 2026-01-01 | Начальная версия контракта |
|
||||
|
||||
---
|
||||
|
||||
## Принципы
|
||||
|
||||
1. **Snapshot** — JSON описывает состояние сервера на момент сбора. Может включать историю изменений статуса компонентов.
|
||||
2. **Идемпотентность** — повторная отправка идентичного payload не создаёт дублей (дедупликация по хешу).
|
||||
3. **Частичность** — можно передавать только те секции, данные по которым доступны. Пустой массив и отсутствие секции эквивалентны.
|
||||
4. **Строгая схема** — endpoint использует строгий JSON-декодер; неизвестные поля приводят к `400 Bad Request`.
|
||||
5. **Event-driven** — импорт создаёт события в timeline (LOG_COLLECTED, INSTALLED, REMOVED, FIRMWARE_CHANGED и др.).
|
||||
6. **Без синтеза со стороны интегратора** — сборщик передаёт только фактически собранные значения. Нельзя придумывать `serial_number`, `component_ref`, `message`, `message_id` или другие идентификаторы/атрибуты, если источник их не предоставил или парсер не смог их надёжно извлечь.
|
||||
|
||||
---
|
||||
|
||||
## Endpoint
|
||||
|
||||
```
|
||||
POST /ingest/hardware
|
||||
Content-Type: application/json
|
||||
```
|
||||
|
||||
**Ответ при приёме (202 Accepted):**
|
||||
```json
|
||||
{
|
||||
"status": "accepted",
|
||||
"job_id": "job_01J..."
|
||||
}
|
||||
```
|
||||
|
||||
Импорт выполняется асинхронно. Результат доступен по:
|
||||
```
|
||||
GET /ingest/hardware/jobs/{job_id}
|
||||
```
|
||||
|
||||
**Ответ при успехе задачи:**
|
||||
```json
|
||||
{
|
||||
"status": "success",
|
||||
"bundle_id": "lb_01J...",
|
||||
"asset_id": "mach_01J...",
|
||||
"collected_at": "2026-02-10T15:30:00Z",
|
||||
"duplicate": false,
|
||||
"summary": {
|
||||
"parts_observed": 15,
|
||||
"parts_created": 2,
|
||||
"parts_updated": 13,
|
||||
"installations_created": 2,
|
||||
"installations_closed": 1,
|
||||
"timeline_events_created": 9,
|
||||
"failure_events_created": 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Ответ при дубликате:**
|
||||
```json
|
||||
{
|
||||
"status": "success",
|
||||
"duplicate": true,
|
||||
"message": "LogBundle with this content hash already exists"
|
||||
}
|
||||
```
|
||||
|
||||
**Ответ при ошибке (400 Bad Request):**
|
||||
```json
|
||||
{
|
||||
"status": "error",
|
||||
"error": "validation_failed",
|
||||
"details": {
|
||||
"field": "hardware.board.serial_number",
|
||||
"message": "serial_number is required"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Частые причины `400`:
|
||||
- Неверный формат `collected_at` (требуется RFC3339).
|
||||
- Пустой `hardware.board.serial_number`.
|
||||
- Наличие неизвестного JSON-поля на любом уровне.
|
||||
- Тело запроса превышает допустимый размер.
|
||||
|
||||
---
|
||||
|
||||
## Структура верхнего уровня
|
||||
|
||||
```json
|
||||
{
|
||||
"filename": "redfish://10.10.10.103",
|
||||
"source_type": "api",
|
||||
"protocol": "redfish",
|
||||
"target_host": "10.10.10.103",
|
||||
"collected_at": "2026-02-10T15:30:00Z",
|
||||
"hardware": {
|
||||
"board": { ... },
|
||||
"firmware": [ ... ],
|
||||
"cpus": [ ... ],
|
||||
"memory": [ ... ],
|
||||
"storage": [ ... ],
|
||||
"pcie_devices": [ ... ],
|
||||
"power_supplies": [ ... ],
|
||||
"sensors": { ... },
|
||||
"event_logs": [ ... ]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Поля верхнего уровня
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `collected_at` | string RFC3339 | **да** | Время сбора данных |
|
||||
| `hardware` | object | **да** | Аппаратный снапшот |
|
||||
| `hardware.board.serial_number` | string | **да** | Серийный номер платы/сервера |
|
||||
| `target_host` | string | нет | IP или hostname |
|
||||
| `source_type` | string | нет | Тип источника: `api`, `logfile`, `manual` |
|
||||
| `protocol` | string | нет | Протокол: `redfish`, `ipmi`, `snmp`, `ssh` |
|
||||
| `filename` | string | нет | Идентификатор источника |
|
||||
|
||||
---
|
||||
|
||||
## Общие поля статуса компонентов
|
||||
|
||||
Применяются ко всем компонентным секциям (`cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies`).
|
||||
|
||||
| Поле | Тип | Описание |
|
||||
|------|-----|----------|
|
||||
| `status` | string | Текущий статус: `OK`, `Warning`, `Critical`, `Unknown`, `Empty` |
|
||||
| `status_checked_at` | string RFC3339 | Время последней проверки статуса |
|
||||
| `status_changed_at` | string RFC3339 | Время последнего изменения статуса |
|
||||
| `status_history` | array | История переходов статусов (см. ниже) |
|
||||
| `error_description` | string | Текст ошибки/диагностики |
|
||||
| `manufactured_year_week` | string | Дата производства в формате `YYYY-Www`, например `2024-W07` |
|
||||
|
||||
**Объект `status_history[]`:**
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `status` | string | **да** | Статус в этот момент |
|
||||
| `changed_at` | string RFC3339 | **да** | Время перехода (без этого поля запись игнорируется) |
|
||||
| `details` | string | нет | Пояснение к переходу |
|
||||
|
||||
**Правила приоритета времени события:**
|
||||
|
||||
1. `status_changed_at`
|
||||
2. Последняя запись `status_history` с совпадающим статусом
|
||||
3. Последняя парсируемая запись `status_history`
|
||||
4. `status_checked_at`
|
||||
|
||||
**Правила передачи статусов:**
|
||||
- Передавайте `status` как текущее состояние компонента в snapshot.
|
||||
- Если источник хранит историю — передавайте `status_history` отсортированным по `changed_at` по возрастанию.
|
||||
- Не включайте записи `status_history` без `changed_at`.
|
||||
- Все даты — RFC3339, рекомендуется UTC (`Z`).
|
||||
- `manufactured_year_week` используйте, когда источник знает только год и неделю производства, без точной календарной даты.
|
||||
|
||||
---
|
||||
|
||||
## Секции hardware
|
||||
|
||||
### board
|
||||
|
||||
Основная информация о сервере. Обязательная секция.
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `serial_number` | string | **да** | Серийный номер (ключ идентификации Asset) |
|
||||
| `manufacturer` | string | нет | Производитель |
|
||||
| `product_name` | string | нет | Модель |
|
||||
| `part_number` | string | нет | Партномер |
|
||||
| `uuid` | string | нет | UUID системы |
|
||||
|
||||
Значения `"NULL"` в строковых полях трактуются как отсутствие данных.
|
||||
|
||||
```json
|
||||
"board": {
|
||||
"manufacturer": "Supermicro",
|
||||
"product_name": "X12DPG-QT6",
|
||||
"serial_number": "21D634101",
|
||||
"part_number": "X12DPG-QT6-REV1.01",
|
||||
"uuid": "d7ef2fe5-2fd0-11f0-910a-346f11040868"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### firmware
|
||||
|
||||
Версии прошивок системных компонентов (BIOS, BMC, CPLD и др.).
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `device_name` | string | **да** | Название устройства (`BIOS`, `BMC`, `CPLD`, …) |
|
||||
| `version` | string | **да** | Версия прошивки |
|
||||
|
||||
Записи с пустым `device_name` или `version` игнорируются.
|
||||
Изменение версии создаёт событие `FIRMWARE_CHANGED` для Asset.
|
||||
|
||||
```json
|
||||
"firmware": [
|
||||
{ "device_name": "BIOS", "version": "06.08.05" },
|
||||
{ "device_name": "BMC", "version": "5.17.00" },
|
||||
{ "device_name": "CPLD", "version": "01.02.03" }
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### cpus
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `socket` | int | **да** | Номер сокета (используется для генерации serial) |
|
||||
| `model` | string | нет | Модель процессора |
|
||||
| `manufacturer` | string | нет | Производитель |
|
||||
| `cores` | int | нет | Количество ядер |
|
||||
| `threads` | int | нет | Количество потоков |
|
||||
| `frequency_mhz` | int | нет | Текущая частота |
|
||||
| `max_frequency_mhz` | int | нет | Максимальная частота |
|
||||
| `temperature_c` | float | нет | Температура CPU, °C (telemetry) |
|
||||
| `power_w` | float | нет | Текущая мощность CPU, Вт (telemetry) |
|
||||
| `throttled` | bool | нет | Зафиксирован thermal/power throttling |
|
||||
| `correctable_error_count` | int | нет | Количество корректируемых ошибок CPU |
|
||||
| `uncorrectable_error_count` | int | нет | Количество некорректируемых ошибок CPU |
|
||||
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
|
||||
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
|
||||
| `serial_number` | string | нет | Серийный номер (если доступен) |
|
||||
| `firmware` | string | нет | Версия микрокода; если логгер отдает `Microcode level`, передавайте его сюда как есть |
|
||||
| `present` | bool | нет | Наличие (по умолчанию `true`) |
|
||||
| + общие поля статуса | | | см. раздел выше |
|
||||
|
||||
**Генерация serial_number при отсутствии:** `{board_serial}-CPU-{socket}`
|
||||
|
||||
Если источник использует поле/лейбл `Microcode level`, его значение передавайте в `cpus[].firmware` без дополнительного преобразования.
|
||||
|
||||
```json
|
||||
"cpus": [
|
||||
{
|
||||
"socket": 0,
|
||||
"model": "INTEL(R) XEON(R) GOLD 6530",
|
||||
"cores": 32,
|
||||
"threads": 64,
|
||||
"frequency_mhz": 2100,
|
||||
"max_frequency_mhz": 4000,
|
||||
"temperature_c": 61.5,
|
||||
"power_w": 182.0,
|
||||
"throttled": false,
|
||||
"manufacturer": "Intel",
|
||||
"status": "OK",
|
||||
"status_checked_at": "2026-02-10T15:28:00Z"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### memory
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `slot` | string | нет | Идентификатор слота |
|
||||
| `present` | bool | нет | Наличие модуля (по умолчанию `true`) |
|
||||
| `serial_number` | string | нет | Серийный номер |
|
||||
| `part_number` | string | нет | Партномер (используется как модель) |
|
||||
| `manufacturer` | string | нет | Производитель |
|
||||
| `size_mb` | int | нет | Объём в МБ |
|
||||
| `type` | string | нет | Тип: `DDR3`, `DDR4`, `DDR5`, … |
|
||||
| `max_speed_mhz` | int | нет | Максимальная частота |
|
||||
| `current_speed_mhz` | int | нет | Текущая частота |
|
||||
| `temperature_c` | float | нет | Температура DIMM/модуля, °C (telemetry) |
|
||||
| `correctable_ecc_error_count` | int | нет | Количество корректируемых ECC-ошибок |
|
||||
| `uncorrectable_ecc_error_count` | int | нет | Количество некорректируемых ECC-ошибок |
|
||||
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
|
||||
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
|
||||
| `spare_blocks_remaining_pct` | float | нет | Остаток spare blocks, % |
|
||||
| `performance_degraded` | bool | нет | Зафиксирована деградация производительности |
|
||||
| `data_loss_detected` | bool | нет | Источник сигнализирует риск/факт потери данных |
|
||||
| + общие поля статуса | | | см. раздел выше |
|
||||
|
||||
Модуль без `serial_number` игнорируется. Модуль с `present=false` или `status=Empty` игнорируется.
|
||||
|
||||
```json
|
||||
"memory": [
|
||||
{
|
||||
"slot": "CPU0_C0D0",
|
||||
"present": true,
|
||||
"size_mb": 32768,
|
||||
"type": "DDR5",
|
||||
"max_speed_mhz": 4800,
|
||||
"current_speed_mhz": 4800,
|
||||
"temperature_c": 43.0,
|
||||
"correctable_ecc_error_count": 0,
|
||||
"manufacturer": "Hynix",
|
||||
"serial_number": "80AD032419E17CEEC1",
|
||||
"part_number": "HMCG88AGBRA191N",
|
||||
"status": "OK"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### storage
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `slot` | string | нет | Канонический адрес установки PCIe-устройства; передавайте BDF (`0000:18:00.0`) |
|
||||
| `serial_number` | string | нет | Серийный номер |
|
||||
| `model` | string | нет | Модель |
|
||||
| `manufacturer` | string | нет | Производитель |
|
||||
| `type` | string | нет | Тип: `NVMe`, `SSD`, `HDD` |
|
||||
| `interface` | string | нет | Интерфейс: `NVMe`, `SATA`, `SAS` |
|
||||
| `size_gb` | int | нет | Размер в ГБ |
|
||||
| `temperature_c` | float | нет | Температура накопителя, °C (telemetry) |
|
||||
| `power_on_hours` | int64 | нет | Время работы, часы |
|
||||
| `power_cycles` | int64 | нет | Количество циклов питания |
|
||||
| `unsafe_shutdowns` | int64 | нет | Нештатные выключения |
|
||||
| `media_errors` | int64 | нет | Ошибки носителя / media errors |
|
||||
| `error_log_entries` | int64 | нет | Количество записей в error log |
|
||||
| `written_bytes` | int64 | нет | Всего записано байт |
|
||||
| `read_bytes` | int64 | нет | Всего прочитано байт |
|
||||
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
|
||||
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
|
||||
| `available_spare_pct` | float | нет | Доступный spare, % |
|
||||
| `reallocated_sectors` | int64 | нет | Переназначенные сектора |
|
||||
| `current_pending_sectors` | int64 | нет | Сектора в ожидании ремапа |
|
||||
| `offline_uncorrectable` | int64 | нет | Некорректируемые ошибки offline scan |
|
||||
| `firmware` | string | нет | Версия прошивки |
|
||||
| `present` | bool | нет | Наличие (по умолчанию `true`) |
|
||||
| + общие поля статуса | | | см. раздел выше |
|
||||
|
||||
Диск без `serial_number` игнорируется. Изменение `firmware` создаёт событие `FIRMWARE_CHANGED`.
|
||||
|
||||
```json
|
||||
"storage": [
|
||||
{
|
||||
"slot": "OB01",
|
||||
"type": "NVMe",
|
||||
"model": "INTEL SSDPF2KX076T1",
|
||||
"size_gb": 7680,
|
||||
"temperature_c": 38.5,
|
||||
"power_on_hours": 12450,
|
||||
"unsafe_shutdowns": 3,
|
||||
"written_bytes": 9876543210,
|
||||
"life_remaining_pct": 91.0,
|
||||
"serial_number": "BTAX41900GF87P6DGN",
|
||||
"manufacturer": "Intel",
|
||||
"firmware": "9CV10510",
|
||||
"interface": "NVMe",
|
||||
"present": true,
|
||||
"status": "OK"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### pcie_devices
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `slot` | string | нет | Идентификатор слота |
|
||||
| `vendor_id` | int | нет | PCI Vendor ID (decimal) |
|
||||
| `device_id` | int | нет | PCI Device ID (decimal) |
|
||||
| `numa_node` | int | нет | NUMA node / CPU affinity устройства |
|
||||
| `temperature_c` | float | нет | Температура устройства, °C (telemetry) |
|
||||
| `power_w` | float | нет | Текущее энергопотребление устройства, Вт (telemetry) |
|
||||
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
|
||||
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
|
||||
| `ecc_corrected_total` | int64 | нет | Всего корректируемых ECC-ошибок |
|
||||
| `ecc_uncorrected_total` | int64 | нет | Всего некорректируемых ECC-ошибок |
|
||||
| `hw_slowdown` | bool | нет | Устройство вошло в hardware slowdown / protective mode |
|
||||
| `battery_charge_pct` | float | нет | Заряд батареи / supercap, % |
|
||||
| `battery_health_pct` | float | нет | Состояние батареи / supercap, % |
|
||||
| `battery_temperature_c` | float | нет | Температура батареи / supercap, °C |
|
||||
| `battery_voltage_v` | float | нет | Напряжение батареи / supercap, В |
|
||||
| `battery_replace_required` | bool | нет | Требуется замена батареи / supercap |
|
||||
| `sfp_temperature_c` | float | нет | Температура SFP/optic, °C |
|
||||
| `sfp_tx_power_dbm` | float | нет | TX optical power, dBm |
|
||||
| `sfp_rx_power_dbm` | float | нет | RX optical power, dBm |
|
||||
| `sfp_voltage_v` | float | нет | Напряжение SFP, В |
|
||||
| `sfp_bias_ma` | float | нет | Bias current SFP, мА |
|
||||
| `bdf` | string | нет | Deprecated alias для `slot`; при наличии ingest нормализует его в `slot` |
|
||||
| `device_class` | string | нет | Класс устройства (см. список ниже) |
|
||||
| `manufacturer` | string | нет | Производитель |
|
||||
| `model` | string | нет | Модель |
|
||||
| `serial_number` | string | нет | Серийный номер |
|
||||
| `firmware` | string | нет | Версия прошивки |
|
||||
| `link_width` | int | нет | Текущая ширина линка |
|
||||
| `link_speed` | string | нет | Текущая скорость: `Gen3`, `Gen4`, `Gen5` |
|
||||
| `max_link_width` | int | нет | Максимальная ширина линка |
|
||||
| `max_link_speed` | string | нет | Максимальная скорость |
|
||||
| `mac_addresses` | string[] | нет | MAC-адреса портов (для сетевых устройств) |
|
||||
| `present` | bool | нет | Наличие (по умолчанию `true`) |
|
||||
| + общие поля статуса | | | см. раздел выше |
|
||||
|
||||
`numa_node` передавайте для NIC / InfiniBand / RAID / GPU, когда источник знает CPU/NUMA affinity. Поле сохраняется в snapshot-атрибутах PCIe-компонента и дублируется в telemetry для topology use cases.
|
||||
Поля `temperature_c` и `power_w` используйте для device-level telemetry GPU / accelerator / smart PCIe devices. Они не влияют на идентификацию компонента.
|
||||
|
||||
**Генерация serial_number при отсутствии или `"N/A"`:** `{board_serial}-PCIE-{slot}`, где `slot` для PCIe равен BDF.
|
||||
|
||||
`slot` — единственный канонический адрес компонента. Для PCIe в `slot` передавайте BDF. Поле `bdf` сохраняется только как переходный alias на входе и не должно использоваться как отдельная координата рядом со `slot`.
|
||||
|
||||
**Значения `device_class`:**
|
||||
|
||||
| Значение | Назначение |
|
||||
|----------|------------|
|
||||
| `MassStorageController` | RAID-контроллеры |
|
||||
| `StorageController` | HBA, SAS-контроллеры |
|
||||
| `NetworkController` | Сетевые адаптеры (InfiniBand, общий) |
|
||||
| `EthernetController` | Ethernet NIC |
|
||||
| `FibreChannelController` | Fibre Channel HBA |
|
||||
| `VideoController` | GPU, видеокарты |
|
||||
| `ProcessingAccelerator` | Вычислительные ускорители (AI/ML) |
|
||||
| `DisplayController` | Контроллеры дисплея (BMC VGA) |
|
||||
|
||||
Список открытый: допускаются произвольные строки для нестандартных классов.
|
||||
|
||||
```json
|
||||
"pcie_devices": [
|
||||
{
|
||||
"slot": "0000:3b:00.0",
|
||||
"vendor_id": 5555,
|
||||
"device_id": 4401,
|
||||
"numa_node": 0,
|
||||
"temperature_c": 48.5,
|
||||
"power_w": 18.2,
|
||||
"sfp_temperature_c": 36.2,
|
||||
"sfp_tx_power_dbm": -1.8,
|
||||
"sfp_rx_power_dbm": -2.1,
|
||||
"device_class": "EthernetController",
|
||||
"manufacturer": "Intel",
|
||||
"model": "X710 10GbE",
|
||||
"serial_number": "K65472-003",
|
||||
"firmware": "9.20 0x8000d4ae",
|
||||
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
|
||||
"status": "OK"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### power_supplies
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `slot` | string | нет | Идентификатор слота |
|
||||
| `present` | bool | нет | Наличие (по умолчанию `true`) |
|
||||
| `serial_number` | string | нет | Серийный номер |
|
||||
| `part_number` | string | нет | Партномер |
|
||||
| `model` | string | нет | Модель |
|
||||
| `vendor` | string | нет | Производитель |
|
||||
| `wattage_w` | int | нет | Мощность в ваттах |
|
||||
| `firmware` | string | нет | Версия прошивки |
|
||||
| `input_type` | string | нет | Тип входа (например `ACWideRange`) |
|
||||
| `input_voltage` | float | нет | Входное напряжение, В (telemetry) |
|
||||
| `input_power_w` | float | нет | Входная мощность, Вт (telemetry) |
|
||||
| `output_power_w` | float | нет | Выходная мощность, Вт (telemetry) |
|
||||
| `temperature_c` | float | нет | Температура PSU, °C (telemetry) |
|
||||
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
|
||||
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
|
||||
| + общие поля статуса | | | см. раздел выше |
|
||||
|
||||
Поля telemetry (`input_voltage`, `input_power_w`, `output_power_w`, `temperature_c`, `life_remaining_pct`, `life_used_pct`) сохраняются в атрибутах компонента и не влияют на его идентификацию.
|
||||
|
||||
PSU без `serial_number` игнорируется.
|
||||
|
||||
```json
|
||||
"power_supplies": [
|
||||
{
|
||||
"slot": "0",
|
||||
"present": true,
|
||||
"model": "GW-CRPS3000LW",
|
||||
"vendor": "Great Wall",
|
||||
"wattage_w": 3000,
|
||||
"serial_number": "2P06C102610",
|
||||
"firmware": "00.03.05",
|
||||
"status": "OK",
|
||||
"input_type": "ACWideRange",
|
||||
"input_power_w": 137,
|
||||
"output_power_w": 104,
|
||||
"input_voltage": 215.25,
|
||||
"temperature_c": 39.5,
|
||||
"life_remaining_pct": 97.0
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### sensors
|
||||
|
||||
Показания сенсоров сервера. Секция опциональная, не привязана к компонентам.
|
||||
Данные хранятся как последнее известное значение (last-known-value) на уровне Asset.
|
||||
|
||||
```json
|
||||
"sensors": {
|
||||
"fans": [ ... ],
|
||||
"power": [ ... ],
|
||||
"temperatures": [ ... ],
|
||||
"other": [ ... ]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### event_logs
|
||||
|
||||
Нормализованные операционные логи сервера из `host`, `bmc` или `redfish`.
|
||||
|
||||
Эти записи не попадают в history timeline и не создают history events. Они сохраняются в отдельной deduplicated log store и отображаются в отдельном UI-блоке asset logs / host logs.
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `source` | string | **да** | Источник лога: `host`, `bmc`, `redfish` |
|
||||
| `event_time` | string RFC3339 | нет | Время события из источника; если отсутствует, используется время ingest/collection |
|
||||
| `severity` | string | нет | Уровень: `OK`, `Info`, `Warning`, `Critical`, `Unknown` |
|
||||
| `message_id` | string | нет | Идентификатор/код события источника |
|
||||
| `message` | string | **да** | Нормализованный текст события |
|
||||
| `component_ref` | string | нет | Ссылка на компонент/устройство/слот, если извлекается |
|
||||
| `fingerprint` | string | нет | Внешний готовый dedup-key; если не передан, система вычисляет свой |
|
||||
| `is_active` | bool | нет | Признак, что событие всё ещё активно/не погашено, если источник умеет lifecycle |
|
||||
| `raw_payload` | object | нет | Сырой vendor-specific payload для диагностики |
|
||||
|
||||
**Правила event_logs:**
|
||||
- Логи дедуплицируются в рамках asset + source + fingerprint.
|
||||
- Если `fingerprint` не передан, система строит его из нормализованных полей (`source`, `message_id`, `message`, `component_ref`, временная нормализация).
|
||||
- Интегратор/сборщик логов не должен синтезировать содержимое событий: не придумывайте `message`, `message_id`, `component_ref`, serial/device identifiers или иные поля, если они отсутствуют в исходном логе или не были надёжно извлечены.
|
||||
- Повторное получение того же события обновляет `last_seen_at`/счётчик повторов и не должно создавать новый timeline/history event.
|
||||
- `event_logs` используются для отдельного UI-представления логов и не изменяют canonical state компонентов/asset по умолчанию.
|
||||
|
||||
```json
|
||||
"event_logs": [
|
||||
{
|
||||
"source": "bmc",
|
||||
"event_time": "2026-03-15T14:03:11Z",
|
||||
"severity": "Warning",
|
||||
"message_id": "0x000F",
|
||||
"message": "Correctable ECC error threshold exceeded",
|
||||
"component_ref": "CPU0_C0D0",
|
||||
"raw_payload": {
|
||||
"sensor": "DIMM_A1",
|
||||
"sel_record_id": "0042"
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "redfish",
|
||||
"event_time": "2026-03-15T14:03:20Z",
|
||||
"severity": "Info",
|
||||
"message_id": "OpenBMC.0.1.SystemReboot",
|
||||
"message": "System reboot requested by administrator",
|
||||
"component_ref": "Mainboard"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
#### sensors.fans
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `name` | string | **да** | Уникальное имя сенсора в рамках секции |
|
||||
| `location` | string | нет | Физическое расположение |
|
||||
| `rpm` | int | нет | Обороты, RPM |
|
||||
| `status` | string | нет | Статус: `OK`, `Warning`, `Critical`, `Unknown` |
|
||||
|
||||
#### sensors.power
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `name` | string | **да** | Уникальное имя сенсора |
|
||||
| `location` | string | нет | Физическое расположение |
|
||||
| `voltage_v` | float | нет | Напряжение, В |
|
||||
| `current_a` | float | нет | Ток, А |
|
||||
| `power_w` | float | нет | Мощность, Вт |
|
||||
| `status` | string | нет | Статус |
|
||||
|
||||
#### sensors.temperatures
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `name` | string | **да** | Уникальное имя сенсора |
|
||||
| `location` | string | нет | Физическое расположение |
|
||||
| `celsius` | float | нет | Температура, °C |
|
||||
| `threshold_warning_celsius` | float | нет | Порог Warning, °C |
|
||||
| `threshold_critical_celsius` | float | нет | Порог Critical, °C |
|
||||
| `status` | string | нет | Статус |
|
||||
|
||||
#### sensors.other
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `name` | string | **да** | Уникальное имя сенсора |
|
||||
| `location` | string | нет | Физическое расположение |
|
||||
| `value` | float | нет | Значение |
|
||||
| `unit` | string | нет | Единица измерения |
|
||||
| `status` | string | нет | Статус |
|
||||
|
||||
**Правила sensors:**
|
||||
- Идентификатор сенсора: пара `(sensor_type, name)`. Дубли в одном payload — берётся первое вхождение.
|
||||
- Сенсоры без `name` игнорируются.
|
||||
- При каждом импорте значения перезаписываются (upsert по ключу).
|
||||
|
||||
```json
|
||||
"sensors": {
|
||||
"fans": [
|
||||
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" },
|
||||
{ "name": "FAN_CPU0", "location": "CPU0", "rpm": 5600, "status": "OK" }
|
||||
],
|
||||
"power": [
|
||||
{ "name": "12V Rail", "location": "Mainboard", "voltage_v": 12.06, "status": "OK" },
|
||||
{ "name": "PSU0 Input", "location": "PSU0", "voltage_v": 215.25, "current_a": 0.64, "power_w": 137.0, "status": "OK" }
|
||||
],
|
||||
"temperatures": [
|
||||
{ "name": "CPU0 Temp", "location": "CPU0", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" },
|
||||
{ "name": "Inlet Temp", "location": "Front", "celsius": 22.0, "threshold_warning_celsius": 40.0, "threshold_critical_celsius": 50.0, "status": "OK" }
|
||||
],
|
||||
"other": [
|
||||
{ "name": "System Humidity", "value": 38.5, "unit": "%", "status": "OK" }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Обработка статусов компонентов
|
||||
|
||||
| Статус | Поведение |
|
||||
|--------|-----------|
|
||||
| `OK` | Нормальная обработка |
|
||||
| `Warning` | Создаётся событие `COMPONENT_WARNING` |
|
||||
| `Critical` | Создаётся событие `COMPONENT_FAILED` + запись в `failure_events` |
|
||||
| `Unknown` | Компонент считается рабочим, создаётся событие `COMPONENT_UNKNOWN` |
|
||||
| `Empty` | Компонент не создаётся/не обновляется |
|
||||
|
||||
---
|
||||
|
||||
## Обработка отсутствующих serial_number
|
||||
|
||||
Общее правило для всех секций: если источник не вернул серийный номер и сборщик не смог его надёжно извлечь, интегратор не должен подставлять вымышленные значения, хеши, локальные placeholder-идентификаторы или серийные номера "по догадке". Разрешены только явно оговорённые ниже server-side fallback-правила ingest.
|
||||
|
||||
| Тип | Поведение |
|
||||
|-----|-----------|
|
||||
| CPU | Генерируется: `{board_serial}-CPU-{socket}` |
|
||||
| PCIe | Генерируется: `{board_serial}-PCIE-{slot}` (если serial = `"N/A"` или пустой; `slot` для PCIe = BDF) |
|
||||
| Memory | Компонент игнорируется |
|
||||
| Storage | Компонент игнорируется |
|
||||
| PSU | Компонент игнорируется |
|
||||
|
||||
Если `serial_number` не уникален внутри одного payload для того же `model`:
|
||||
- Первое вхождение сохраняет оригинальный серийный номер.
|
||||
- Каждое следующее дублирующее получает placeholder: `NO_SN-XXXXXXXX`.
|
||||
|
||||
---
|
||||
|
||||
## Минимальный валидный пример
|
||||
|
||||
```json
|
||||
{
|
||||
"collected_at": "2026-02-10T15:30:00Z",
|
||||
"target_host": "192.168.1.100",
|
||||
"hardware": {
|
||||
"board": {
|
||||
"serial_number": "SRV-001"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Полный пример с историей статусов
|
||||
|
||||
```json
|
||||
{
|
||||
"filename": "redfish://10.10.10.103",
|
||||
"source_type": "api",
|
||||
"protocol": "redfish",
|
||||
"target_host": "10.10.10.103",
|
||||
"collected_at": "2026-02-10T15:30:00Z",
|
||||
"hardware": {
|
||||
"board": {
|
||||
"manufacturer": "Supermicro",
|
||||
"product_name": "X12DPG-QT6",
|
||||
"serial_number": "21D634101"
|
||||
},
|
||||
"firmware": [
|
||||
{ "device_name": "BIOS", "version": "06.08.05" },
|
||||
{ "device_name": "BMC", "version": "5.17.00" }
|
||||
],
|
||||
"cpus": [
|
||||
{
|
||||
"socket": 0,
|
||||
"model": "INTEL(R) XEON(R) GOLD 6530",
|
||||
"manufacturer": "Intel",
|
||||
"cores": 32,
|
||||
"threads": 64,
|
||||
"status": "OK"
|
||||
}
|
||||
],
|
||||
"storage": [
|
||||
{
|
||||
"slot": "OB01",
|
||||
"type": "NVMe",
|
||||
"model": "INTEL SSDPF2KX076T1",
|
||||
"size_gb": 7680,
|
||||
"serial_number": "BTAX41900GF87P6DGN",
|
||||
"manufacturer": "Intel",
|
||||
"firmware": "9CV10510",
|
||||
"present": true,
|
||||
"status": "OK",
|
||||
"status_changed_at": "2026-02-10T15:22:00Z",
|
||||
"status_history": [
|
||||
{ "status": "Critical", "changed_at": "2026-02-10T15:10:00Z", "details": "I/O timeout on NVMe queue 3" },
|
||||
{ "status": "OK", "changed_at": "2026-02-10T15:22:00Z", "details": "Recovered after controller reset" }
|
||||
]
|
||||
}
|
||||
],
|
||||
"pcie_devices": [
|
||||
{
|
||||
"slot": "0000:18:00.0",
|
||||
"device_class": "EthernetController",
|
||||
"manufacturer": "Intel",
|
||||
"model": "X710 10GbE",
|
||||
"serial_number": "K65472-003",
|
||||
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
|
||||
"status": "OK"
|
||||
}
|
||||
],
|
||||
"power_supplies": [
|
||||
{
|
||||
"slot": "0",
|
||||
"present": true,
|
||||
"model": "GW-CRPS3000LW",
|
||||
"vendor": "Great Wall",
|
||||
"wattage_w": 3000,
|
||||
"serial_number": "2P06C102610",
|
||||
"firmware": "00.03.05",
|
||||
"status": "OK",
|
||||
"input_power_w": 137,
|
||||
"output_power_w": 104,
|
||||
"input_voltage": 215.25
|
||||
}
|
||||
],
|
||||
"sensors": {
|
||||
"fans": [
|
||||
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" }
|
||||
],
|
||||
"power": [
|
||||
{ "name": "12V Rail", "voltage_v": 12.06, "status": "OK" }
|
||||
],
|
||||
"temperatures": [
|
||||
{ "name": "CPU0 Temp", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" }
|
||||
],
|
||||
"other": [
|
||||
{ "name": "System Humidity", "value": 38.5, "unit": "%" }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
1
internal/chart
Submodule
1
internal/chart
Submodule
Submodule internal/chart added at 05db6994d4
@@ -1,6 +1,6 @@
|
||||
FROM debian:12
|
||||
|
||||
ARG GO_VERSION=1.23.6
|
||||
ARG GO_VERSION=1.24.0
|
||||
ARG DEBIAN_KERNEL_ABI=6.1.0-43
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
DEBIAN_VERSION=12
|
||||
DEBIAN_KERNEL_ABI=6.1.0-43
|
||||
NVIDIA_DRIVER_VERSION=590.48.01
|
||||
GO_VERSION=1.23.6
|
||||
AUDIT_VERSION=0.1.1
|
||||
GO_VERSION=1.24.0
|
||||
AUDIT_VERSION=1.0.0
|
||||
|
||||
@@ -21,7 +21,7 @@ lb config noauto \
|
||||
--linux-flavours "amd64" \
|
||||
--linux-packages "linux-image-${DEBIAN_KERNEL_ABI}" \
|
||||
--memtest none \
|
||||
--iso-volume "BEE-DEBUG" \
|
||||
--iso-volume "BEE" \
|
||||
--iso-application "Bee Hardware Audit" \
|
||||
--bootappend-live "boot=live components console=tty0 console=ttyS0,115200n8 username=bee user-fullname=Bee modprobe.blacklist=nouveau" \
|
||||
--apt-recommends false \
|
||||
|
||||
314
iso/builder/bee-gpu-stress.c
Normal file
314
iso/builder/bee-gpu-stress.c
Normal file
@@ -0,0 +1,314 @@
|
||||
#define _POSIX_C_SOURCE 200809L
|
||||
|
||||
#include <dlfcn.h>
|
||||
#include <stdint.h>
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <string.h>
|
||||
#include <time.h>
|
||||
|
||||
typedef int CUdevice;
|
||||
typedef uint64_t CUdeviceptr;
|
||||
typedef int CUresult;
|
||||
typedef void *CUcontext;
|
||||
typedef void *CUmodule;
|
||||
typedef void *CUfunction;
|
||||
typedef void *CUstream;
|
||||
|
||||
#define CU_SUCCESS 0
|
||||
|
||||
static const char *ptx_source =
|
||||
".version 6.0\n"
|
||||
".target sm_30\n"
|
||||
".address_size 64\n"
|
||||
"\n"
|
||||
".visible .entry burn(\n"
|
||||
" .param .u64 data,\n"
|
||||
" .param .u32 words,\n"
|
||||
" .param .u32 rounds\n"
|
||||
")\n"
|
||||
"{\n"
|
||||
" .reg .pred %p<2>;\n"
|
||||
" .reg .b32 %r<8>;\n"
|
||||
" .reg .b64 %rd<5>;\n"
|
||||
"\n"
|
||||
" ld.param.u64 %rd1, [data];\n"
|
||||
" ld.param.u32 %r1, [words];\n"
|
||||
" ld.param.u32 %r2, [rounds];\n"
|
||||
" mov.u32 %r3, %ctaid.x;\n"
|
||||
" mov.u32 %r4, %ntid.x;\n"
|
||||
" mov.u32 %r5, %tid.x;\n"
|
||||
" mad.lo.s32 %r0, %r3, %r4, %r5;\n"
|
||||
" setp.ge.u32 %p0, %r0, %r1;\n"
|
||||
" @%p0 bra DONE;\n"
|
||||
" mul.wide.u32 %rd2, %r0, 4;\n"
|
||||
" add.s64 %rd3, %rd1, %rd2;\n"
|
||||
" ld.global.u32 %r6, [%rd3];\n"
|
||||
"LOOP:\n"
|
||||
" setp.eq.u32 %p1, %r2, 0;\n"
|
||||
" @%p1 bra STORE;\n"
|
||||
" mad.lo.u32 %r6, %r6, 1664525, 1013904223;\n"
|
||||
" sub.u32 %r2, %r2, 1;\n"
|
||||
" bra LOOP;\n"
|
||||
"STORE:\n"
|
||||
" st.global.u32 [%rd3], %r6;\n"
|
||||
"DONE:\n"
|
||||
" ret;\n"
|
||||
"}\n";
|
||||
|
||||
typedef CUresult (*cuInit_fn)(unsigned int);
|
||||
typedef CUresult (*cuDeviceGetCount_fn)(int *);
|
||||
typedef CUresult (*cuDeviceGet_fn)(CUdevice *, int);
|
||||
typedef CUresult (*cuDeviceGetName_fn)(char *, int, CUdevice);
|
||||
typedef CUresult (*cuCtxCreate_fn)(CUcontext *, unsigned int, CUdevice);
|
||||
typedef CUresult (*cuCtxDestroy_fn)(CUcontext);
|
||||
typedef CUresult (*cuCtxSynchronize_fn)(void);
|
||||
typedef CUresult (*cuMemAlloc_fn)(CUdeviceptr *, size_t);
|
||||
typedef CUresult (*cuMemFree_fn)(CUdeviceptr);
|
||||
typedef CUresult (*cuMemcpyHtoD_fn)(CUdeviceptr, const void *, size_t);
|
||||
typedef CUresult (*cuMemcpyDtoH_fn)(void *, CUdeviceptr, size_t);
|
||||
typedef CUresult (*cuModuleLoadDataEx_fn)(CUmodule *, const void *, unsigned int, void *, void *);
|
||||
typedef CUresult (*cuModuleGetFunction_fn)(CUfunction *, CUmodule, const char *);
|
||||
typedef CUresult (*cuLaunchKernel_fn)(CUfunction,
|
||||
unsigned int,
|
||||
unsigned int,
|
||||
unsigned int,
|
||||
unsigned int,
|
||||
unsigned int,
|
||||
unsigned int,
|
||||
unsigned int,
|
||||
CUstream,
|
||||
void **,
|
||||
void **);
|
||||
typedef CUresult (*cuGetErrorName_fn)(CUresult, const char **);
|
||||
typedef CUresult (*cuGetErrorString_fn)(CUresult, const char **);
|
||||
|
||||
struct cuda_api {
|
||||
void *lib;
|
||||
cuInit_fn cuInit;
|
||||
cuDeviceGetCount_fn cuDeviceGetCount;
|
||||
cuDeviceGet_fn cuDeviceGet;
|
||||
cuDeviceGetName_fn cuDeviceGetName;
|
||||
cuCtxCreate_fn cuCtxCreate;
|
||||
cuCtxDestroy_fn cuCtxDestroy;
|
||||
cuCtxSynchronize_fn cuCtxSynchronize;
|
||||
cuMemAlloc_fn cuMemAlloc;
|
||||
cuMemFree_fn cuMemFree;
|
||||
cuMemcpyHtoD_fn cuMemcpyHtoD;
|
||||
cuMemcpyDtoH_fn cuMemcpyDtoH;
|
||||
cuModuleLoadDataEx_fn cuModuleLoadDataEx;
|
||||
cuModuleGetFunction_fn cuModuleGetFunction;
|
||||
cuLaunchKernel_fn cuLaunchKernel;
|
||||
cuGetErrorName_fn cuGetErrorName;
|
||||
cuGetErrorString_fn cuGetErrorString;
|
||||
};
|
||||
|
||||
static int load_symbol(void *lib, const char *name, void **out) {
|
||||
*out = dlsym(lib, name);
|
||||
return *out != NULL;
|
||||
}
|
||||
|
||||
static int load_cuda(struct cuda_api *api) {
|
||||
memset(api, 0, sizeof(*api));
|
||||
api->lib = dlopen("libcuda.so.1", RTLD_NOW | RTLD_LOCAL);
|
||||
if (!api->lib) {
|
||||
return 0;
|
||||
}
|
||||
return
|
||||
load_symbol(api->lib, "cuInit", (void **)&api->cuInit) &&
|
||||
load_symbol(api->lib, "cuDeviceGetCount", (void **)&api->cuDeviceGetCount) &&
|
||||
load_symbol(api->lib, "cuDeviceGet", (void **)&api->cuDeviceGet) &&
|
||||
load_symbol(api->lib, "cuDeviceGetName", (void **)&api->cuDeviceGetName) &&
|
||||
load_symbol(api->lib, "cuCtxCreate_v2", (void **)&api->cuCtxCreate) &&
|
||||
load_symbol(api->lib, "cuCtxDestroy_v2", (void **)&api->cuCtxDestroy) &&
|
||||
load_symbol(api->lib, "cuCtxSynchronize", (void **)&api->cuCtxSynchronize) &&
|
||||
load_symbol(api->lib, "cuMemAlloc_v2", (void **)&api->cuMemAlloc) &&
|
||||
load_symbol(api->lib, "cuMemFree_v2", (void **)&api->cuMemFree) &&
|
||||
load_symbol(api->lib, "cuMemcpyHtoD_v2", (void **)&api->cuMemcpyHtoD) &&
|
||||
load_symbol(api->lib, "cuMemcpyDtoH_v2", (void **)&api->cuMemcpyDtoH) &&
|
||||
load_symbol(api->lib, "cuModuleLoadDataEx", (void **)&api->cuModuleLoadDataEx) &&
|
||||
load_symbol(api->lib, "cuModuleGetFunction", (void **)&api->cuModuleGetFunction) &&
|
||||
load_symbol(api->lib, "cuLaunchKernel", (void **)&api->cuLaunchKernel);
|
||||
}
|
||||
|
||||
static const char *cu_error_name(struct cuda_api *api, CUresult rc) {
|
||||
const char *value = NULL;
|
||||
if (api->cuGetErrorName && api->cuGetErrorName(rc, &value) == CU_SUCCESS && value) {
|
||||
return value;
|
||||
}
|
||||
return "CUDA_ERROR";
|
||||
}
|
||||
|
||||
static const char *cu_error_string(struct cuda_api *api, CUresult rc) {
|
||||
const char *value = NULL;
|
||||
if (api->cuGetErrorString && api->cuGetErrorString(rc, &value) == CU_SUCCESS && value) {
|
||||
return value;
|
||||
}
|
||||
return "unknown";
|
||||
}
|
||||
|
||||
static int check_rc(struct cuda_api *api, const char *step, CUresult rc) {
|
||||
if (rc == CU_SUCCESS) {
|
||||
return 1;
|
||||
}
|
||||
fprintf(stderr, "%s failed: %s (%s)\n", step, cu_error_name(api, rc), cu_error_string(api, rc));
|
||||
return 0;
|
||||
}
|
||||
|
||||
static double now_seconds(void) {
|
||||
struct timespec ts;
|
||||
clock_gettime(CLOCK_MONOTONIC, &ts);
|
||||
return (double)ts.tv_sec + ((double)ts.tv_nsec / 1000000000.0);
|
||||
}
|
||||
|
||||
int main(int argc, char **argv) {
|
||||
int seconds = 5;
|
||||
int size_mb = 64;
|
||||
for (int i = 1; i < argc; i++) {
|
||||
if ((strcmp(argv[i], "--seconds") == 0 || strcmp(argv[i], "-t") == 0) && i + 1 < argc) {
|
||||
seconds = atoi(argv[++i]);
|
||||
} else if ((strcmp(argv[i], "--size-mb") == 0 || strcmp(argv[i], "-m") == 0) && i + 1 < argc) {
|
||||
size_mb = atoi(argv[++i]);
|
||||
} else {
|
||||
fprintf(stderr, "usage: %s [--seconds N] [--size-mb N]\n", argv[0]);
|
||||
return 2;
|
||||
}
|
||||
}
|
||||
if (seconds <= 0) {
|
||||
seconds = 5;
|
||||
}
|
||||
if (size_mb <= 0) {
|
||||
size_mb = 64;
|
||||
}
|
||||
|
||||
struct cuda_api api;
|
||||
if (!load_cuda(&api)) {
|
||||
fprintf(stderr, "failed to load libcuda.so.1 or required Driver API symbols\n");
|
||||
return 1;
|
||||
}
|
||||
load_symbol(api.lib, "cuGetErrorName", (void **)&api.cuGetErrorName);
|
||||
load_symbol(api.lib, "cuGetErrorString", (void **)&api.cuGetErrorString);
|
||||
|
||||
if (!check_rc(&api, "cuInit", api.cuInit(0))) {
|
||||
return 1;
|
||||
}
|
||||
|
||||
int count = 0;
|
||||
if (!check_rc(&api, "cuDeviceGetCount", api.cuDeviceGetCount(&count))) {
|
||||
return 1;
|
||||
}
|
||||
if (count <= 0) {
|
||||
fprintf(stderr, "no CUDA devices found\n");
|
||||
return 1;
|
||||
}
|
||||
|
||||
CUdevice dev = 0;
|
||||
if (!check_rc(&api, "cuDeviceGet", api.cuDeviceGet(&dev, 0))) {
|
||||
return 1;
|
||||
}
|
||||
char name[128] = {0};
|
||||
if (!check_rc(&api, "cuDeviceGetName", api.cuDeviceGetName(name, (int)sizeof(name), dev))) {
|
||||
return 1;
|
||||
}
|
||||
|
||||
CUcontext ctx = NULL;
|
||||
if (!check_rc(&api, "cuCtxCreate", api.cuCtxCreate(&ctx, 0, dev))) {
|
||||
return 1;
|
||||
}
|
||||
|
||||
size_t bytes = (size_t)size_mb * 1024 * 1024;
|
||||
uint32_t words = (uint32_t)(bytes / sizeof(uint32_t));
|
||||
if (words < 1024) {
|
||||
words = 1024;
|
||||
bytes = (size_t)words * sizeof(uint32_t);
|
||||
}
|
||||
|
||||
uint32_t *host = (uint32_t *)malloc(bytes);
|
||||
if (!host) {
|
||||
fprintf(stderr, "malloc failed\n");
|
||||
api.cuCtxDestroy(ctx);
|
||||
return 1;
|
||||
}
|
||||
for (uint32_t i = 0; i < words; i++) {
|
||||
host[i] = i ^ 0x12345678u;
|
||||
}
|
||||
|
||||
CUdeviceptr device_mem = 0;
|
||||
if (!check_rc(&api, "cuMemAlloc", api.cuMemAlloc(&device_mem, bytes))) {
|
||||
free(host);
|
||||
api.cuCtxDestroy(ctx);
|
||||
return 1;
|
||||
}
|
||||
if (!check_rc(&api, "cuMemcpyHtoD", api.cuMemcpyHtoD(device_mem, host, bytes))) {
|
||||
api.cuMemFree(device_mem);
|
||||
free(host);
|
||||
api.cuCtxDestroy(ctx);
|
||||
return 1;
|
||||
}
|
||||
|
||||
CUmodule module = NULL;
|
||||
if (!check_rc(&api, "cuModuleLoadDataEx", api.cuModuleLoadDataEx(&module, ptx_source, 0, NULL, NULL))) {
|
||||
api.cuMemFree(device_mem);
|
||||
free(host);
|
||||
api.cuCtxDestroy(ctx);
|
||||
return 1;
|
||||
}
|
||||
|
||||
CUfunction kernel = NULL;
|
||||
if (!check_rc(&api, "cuModuleGetFunction", api.cuModuleGetFunction(&kernel, module, "burn"))) {
|
||||
api.cuMemFree(device_mem);
|
||||
free(host);
|
||||
api.cuCtxDestroy(ctx);
|
||||
return 1;
|
||||
}
|
||||
|
||||
unsigned int threads = 256;
|
||||
unsigned int blocks = (words + threads - 1) / threads;
|
||||
uint32_t rounds = 256;
|
||||
void *params[] = {&device_mem, &words, &rounds};
|
||||
|
||||
double start = now_seconds();
|
||||
double deadline = start + (double)seconds;
|
||||
unsigned long iterations = 0;
|
||||
while (now_seconds() < deadline) {
|
||||
if (!check_rc(&api, "cuLaunchKernel",
|
||||
api.cuLaunchKernel(kernel, blocks, 1, 1, threads, 1, 1, 0, NULL, params, NULL))) {
|
||||
api.cuMemFree(device_mem);
|
||||
free(host);
|
||||
api.cuCtxDestroy(ctx);
|
||||
return 1;
|
||||
}
|
||||
iterations++;
|
||||
}
|
||||
|
||||
if (!check_rc(&api, "cuCtxSynchronize", api.cuCtxSynchronize())) {
|
||||
api.cuMemFree(device_mem);
|
||||
free(host);
|
||||
api.cuCtxDestroy(ctx);
|
||||
return 1;
|
||||
}
|
||||
if (!check_rc(&api, "cuMemcpyDtoH", api.cuMemcpyDtoH(host, device_mem, bytes))) {
|
||||
api.cuMemFree(device_mem);
|
||||
free(host);
|
||||
api.cuCtxDestroy(ctx);
|
||||
return 1;
|
||||
}
|
||||
|
||||
uint64_t checksum = 0;
|
||||
for (uint32_t i = 0; i < words; i += words / 256 ? words / 256 : 1) {
|
||||
checksum += host[i];
|
||||
}
|
||||
|
||||
double elapsed = now_seconds() - start;
|
||||
printf("device=%s\n", name);
|
||||
printf("duration_s=%.2f\n", elapsed);
|
||||
printf("buffer_mb=%d\n", size_mb);
|
||||
printf("iterations=%lu\n", iterations);
|
||||
printf("checksum=%llu\n", (unsigned long long)checksum);
|
||||
printf("status=OK\n");
|
||||
|
||||
api.cuMemFree(device_mem);
|
||||
free(host);
|
||||
api.cuCtxDestroy(ctx);
|
||||
return 0;
|
||||
}
|
||||
@@ -1,5 +1,5 @@
|
||||
#!/bin/sh
|
||||
# build-in-container.sh — build the bee ISO inside a Debian container.
|
||||
# build-in-container.sh — build the bee ISO inside the Debian builder container.
|
||||
|
||||
set -e
|
||||
|
||||
@@ -70,6 +70,7 @@ set -- \
|
||||
run --rm --privileged \
|
||||
-v "${REPO_ROOT}:/work" \
|
||||
-v "${CACHE_DIR}:/cache" \
|
||||
-e BEE_CONTAINER_BUILD=1 \
|
||||
-e GOCACHE=/cache/go-build \
|
||||
-e GOMODCACHE=/cache/go-mod \
|
||||
-e TMPDIR=/cache/tmp \
|
||||
@@ -83,6 +84,7 @@ if [ -n "$AUTH_KEYS" ]; then
|
||||
-v "${REPO_ROOT}:/work" \
|
||||
-v "${CACHE_DIR}:/cache" \
|
||||
-v "${AUTH_KEYS_DIR}:/tmp/bee-authkeys:ro" \
|
||||
-e BEE_CONTAINER_BUILD=1 \
|
||||
-e GOCACHE=/cache/go-build \
|
||||
-e GOMODCACHE=/cache/go-mod \
|
||||
-e TMPDIR=/cache/tmp \
|
||||
|
||||
@@ -1,14 +1,13 @@
|
||||
#!/bin/sh
|
||||
# build.sh — build bee ISO (Debian 12 / live-build)
|
||||
#
|
||||
# Single build script. Produces a bootable live ISO with SSH access, TUI, NVIDIA drivers.
|
||||
#
|
||||
# Run on Debian 12 builder VM as root after setup-builder.sh.
|
||||
# Usage:
|
||||
# sh iso/builder/build.sh [--authorized-keys /path/to/authorized_keys]
|
||||
# build.sh — internal ISO build entrypoint executed inside the builder container.
|
||||
|
||||
set -e
|
||||
|
||||
if [ "${BEE_CONTAINER_BUILD:-0}" != "1" ]; then
|
||||
echo "build.sh must run inside iso/builder/build-in-container.sh" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
|
||||
BUILDER_DIR="${REPO_ROOT}/iso/builder"
|
||||
OVERLAY_DIR="${REPO_ROOT}/iso/overlay"
|
||||
@@ -39,8 +38,12 @@ echo "=== bee ISO build ==="
|
||||
echo "Debian: ${DEBIAN_VERSION}, Kernel ABI: ${DEBIAN_KERNEL_ABI}, Go: ${GO_VERSION}"
|
||||
echo ""
|
||||
|
||||
echo "=== syncing git submodules ==="
|
||||
git -C "${REPO_ROOT}" submodule update --init --recursive
|
||||
|
||||
# --- compile bee binary (static, Linux amd64) ---
|
||||
BEE_BIN="${DIST_DIR}/bee-linux-amd64"
|
||||
GPU_STRESS_BIN="${DIST_DIR}/bee-gpu-stress-linux-amd64"
|
||||
NEED_BUILD=1
|
||||
if [ -f "$BEE_BIN" ]; then
|
||||
NEWEST_SRC=$(find "${REPO_ROOT}/audit" -name '*.go' -newer "$BEE_BIN" | head -1)
|
||||
@@ -70,6 +73,22 @@ else
|
||||
echo "=== bee binary up to date, skipping build ==="
|
||||
fi
|
||||
|
||||
GPU_STRESS_NEED_BUILD=1
|
||||
if [ -f "$GPU_STRESS_BIN" ] && [ "${BUILDER_DIR}/bee-gpu-stress.c" -ot "$GPU_STRESS_BIN" ]; then
|
||||
GPU_STRESS_NEED_BUILD=0
|
||||
fi
|
||||
|
||||
if [ "$GPU_STRESS_NEED_BUILD" = "1" ]; then
|
||||
echo "=== building bee-gpu-stress ==="
|
||||
gcc -O2 -s -Wall -Wextra \
|
||||
-o "$GPU_STRESS_BIN" \
|
||||
"${BUILDER_DIR}/bee-gpu-stress.c" \
|
||||
-ldl
|
||||
echo "binary: $GPU_STRESS_BIN"
|
||||
else
|
||||
echo "=== bee-gpu-stress up to date, skipping build ==="
|
||||
fi
|
||||
|
||||
echo "=== preparing staged overlay ==="
|
||||
rm -rf "${BUILD_WORK_DIR}" "${OVERLAY_STAGE_DIR}"
|
||||
mkdir -p "${BUILD_WORK_DIR}" "${OVERLAY_STAGE_DIR}"
|
||||
@@ -80,6 +99,7 @@ rm -f \
|
||||
"${OVERLAY_STAGE_DIR}/etc/bee-release" \
|
||||
"${OVERLAY_STAGE_DIR}/root/.ssh/authorized_keys" \
|
||||
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee" \
|
||||
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress" \
|
||||
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest"
|
||||
|
||||
# --- inject authorized_keys for SSH access ---
|
||||
@@ -119,13 +139,15 @@ fi
|
||||
mkdir -p "${OVERLAY_STAGE_DIR}/usr/local/bin"
|
||||
cp "${DIST_DIR}/bee-linux-amd64" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
|
||||
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
|
||||
cp "${GPU_STRESS_BIN}" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress"
|
||||
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress"
|
||||
|
||||
# --- inject smoketest into overlay so it runs directly on the live CD ---
|
||||
cp "${BUILDER_DIR}/smoketest.sh" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest"
|
||||
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest"
|
||||
|
||||
# --- vendor utilities (optional pre-fetched binaries) ---
|
||||
for tool in storcli64 sas2ircu sas3ircu mstflint; do
|
||||
for tool in storcli64 sas2ircu sas3ircu arcconf ssacli; do
|
||||
if [ -f "${VENDOR_DIR}/${tool}" ]; then
|
||||
cp "${VENDOR_DIR}/${tool}" "${OVERLAY_STAGE_DIR}/usr/local/bin/${tool}"
|
||||
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/${tool}" || true
|
||||
|
||||
@@ -8,7 +8,9 @@ echo "=== bee chroot setup ==="
|
||||
# Enable bee services
|
||||
systemctl enable bee-network.service
|
||||
systemctl enable bee-nvidia.service
|
||||
systemctl enable bee-preflight.service
|
||||
systemctl enable bee-audit.service
|
||||
systemctl enable bee-web.service
|
||||
systemctl enable bee-sshsetup.service
|
||||
systemctl enable ssh.service
|
||||
systemctl enable qemu-guest-agent.service 2>/dev/null || true
|
||||
@@ -25,8 +27,8 @@ chmod +x /usr/local/bin/bee 2>/dev/null || true
|
||||
# Reload udev rules
|
||||
udevadm control --reload-rules 2>/dev/null || true
|
||||
|
||||
# Create log directory
|
||||
mkdir -p /var/log
|
||||
# Create export directory
|
||||
mkdir -p /appdata/bee/export
|
||||
|
||||
if [ -f /etc/sudoers.d/bee ]; then
|
||||
chmod 0440 /etc/sudoers.d/bee
|
||||
|
||||
@@ -11,6 +11,8 @@ lshw
|
||||
iproute2
|
||||
isc-dhcp-client
|
||||
iputils-ping
|
||||
ethtool
|
||||
lm-sensors
|
||||
qemu-guest-agent
|
||||
|
||||
# SSH
|
||||
@@ -25,8 +27,11 @@ less
|
||||
vim-tiny
|
||||
mc
|
||||
htop
|
||||
nvtop
|
||||
sudo
|
||||
zstd
|
||||
mstflint
|
||||
memtester
|
||||
|
||||
# QR codes (for displaying audit results)
|
||||
qrencode
|
||||
|
||||
@@ -1,75 +0,0 @@
|
||||
#!/bin/sh
|
||||
# setup-builder.sh — prepare Debian 12 host/VM as bee ISO builder
|
||||
#
|
||||
# Run once on a fresh Debian 12 (Bookworm) host/VM as root.
|
||||
# After this script completes, the machine can build bee ISO images directly.
|
||||
# Container alternative: use `iso/builder/build-in-container.sh`.
|
||||
#
|
||||
# Usage (on Debian VM):
|
||||
# wget -O- https://git.mchus.pro/mchus/bee/raw/branch/main/iso/builder/setup-builder.sh | sh
|
||||
# or: sh setup-builder.sh
|
||||
|
||||
set -e
|
||||
|
||||
. "$(dirname "$0")/VERSIONS" 2>/dev/null || true
|
||||
GO_VERSION="${GO_VERSION:-1.23.6}"
|
||||
DEBIAN_VERSION="${DEBIAN_VERSION:-12}"
|
||||
DEBIAN_KERNEL_ABI="${DEBIAN_KERNEL_ABI:-6.1.0-28}"
|
||||
|
||||
echo "=== bee builder setup ==="
|
||||
echo "Debian: $(cat /etc/debian_version)"
|
||||
echo "Go target: ${GO_VERSION}"
|
||||
echo "Kernel ABI: ${DEBIAN_KERNEL_ABI}"
|
||||
echo ""
|
||||
|
||||
# --- system packages ---
|
||||
export DEBIAN_FRONTEND=noninteractive
|
||||
apt-get update -qq
|
||||
|
||||
apt-get install -y \
|
||||
live-build \
|
||||
debootstrap \
|
||||
squashfs-tools \
|
||||
xorriso \
|
||||
grub-pc-bin \
|
||||
grub-efi-amd64-bin \
|
||||
mtools \
|
||||
git \
|
||||
wget \
|
||||
curl \
|
||||
tar \
|
||||
xz-utils \
|
||||
screen \
|
||||
rsync \
|
||||
build-essential \
|
||||
gcc \
|
||||
make \
|
||||
perl \
|
||||
"linux-headers-${DEBIAN_KERNEL_ABI}-amd64"
|
||||
|
||||
echo "linux-headers installed: $(dpkg -l "linux-headers-${DEBIAN_KERNEL_ABI}-amd64" | awk '/^ii/{print $3}')"
|
||||
|
||||
# --- Go toolchain ---
|
||||
echo ""
|
||||
echo "=== installing Go ${GO_VERSION} ==="
|
||||
if [ -d /usr/local/go ] && /usr/local/go/bin/go version 2>/dev/null | grep -q "${GO_VERSION}"; then
|
||||
echo "Go ${GO_VERSION} already installed"
|
||||
else
|
||||
ARCH=$(uname -m)
|
||||
case "$ARCH" in
|
||||
x86_64) GOARCH=amd64 ;;
|
||||
aarch64) GOARCH=arm64 ;;
|
||||
*) echo "unsupported arch: $ARCH"; exit 1 ;;
|
||||
esac
|
||||
wget -O /tmp/go.tar.gz \
|
||||
"https://go.dev/dl/go${GO_VERSION}.linux-${GOARCH}.tar.gz"
|
||||
rm -rf /usr/local/go
|
||||
tar -C /usr/local -xzf /tmp/go.tar.gz
|
||||
rm /tmp/go.tar.gz
|
||||
fi
|
||||
export PATH="$PATH:/usr/local/go/bin"
|
||||
echo "Go: $(go version)"
|
||||
|
||||
echo ""
|
||||
echo "=== builder setup complete ==="
|
||||
echo "Next: sh iso/builder/build.sh"
|
||||
@@ -96,7 +96,7 @@ done
|
||||
|
||||
echo ""
|
||||
echo "-- systemd services --"
|
||||
for svc in bee-nvidia bee-network bee-audit; do
|
||||
for svc in bee-nvidia bee-network bee-preflight bee-audit bee-web; do
|
||||
if systemctl is-active --quiet "$svc" 2>/dev/null; then
|
||||
ok "service active: $svc"
|
||||
else
|
||||
@@ -104,6 +104,20 @@ for svc in bee-nvidia bee-network bee-audit; do
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "-- runtime health --"
|
||||
if [ -f /appdata/bee/export/runtime-health.json ] && [ -s /appdata/bee/export/runtime-health.json ]; then
|
||||
ok "runtime: runtime-health.json present and non-empty"
|
||||
else
|
||||
fail "runtime: runtime-health.json missing or empty"
|
||||
fi
|
||||
|
||||
if [ -f /appdata/bee/export/runtime-health.log ]; then
|
||||
info "last runtime log line: $(tail -1 /appdata/bee/export/runtime-health.log)"
|
||||
else
|
||||
warn "runtime: no log found at /appdata/bee/export/runtime-health.log"
|
||||
fi
|
||||
|
||||
for svc in ssh bee-sshsetup; do
|
||||
if systemctl is-active --quiet "$svc" 2>/dev/null \
|
||||
|| systemctl show "$svc" --property=ActiveState 2>/dev/null | grep -q "inactive\|exited"; then
|
||||
@@ -126,29 +140,43 @@ fi
|
||||
|
||||
echo ""
|
||||
echo "-- audit last run --"
|
||||
if [ -f /var/log/bee-audit.json ] && [ -s /var/log/bee-audit.json ]; then
|
||||
if [ -f /appdata/bee/export/bee-audit.json ] && [ -s /appdata/bee/export/bee-audit.json ]; then
|
||||
ok "audit: bee-audit.json present and non-empty"
|
||||
info "size: $(du -sh /var/log/bee-audit.json | cut -f1)"
|
||||
info "size: $(du -sh /appdata/bee/export/bee-audit.json | cut -f1)"
|
||||
else
|
||||
fail "audit: bee-audit.json missing or empty"
|
||||
fi
|
||||
|
||||
if [ -f /var/log/bee-audit.log ]; then
|
||||
last_line=$(tail -1 /var/log/bee-audit.log)
|
||||
if [ -f /appdata/bee/export/bee-audit.log ]; then
|
||||
last_line=$(tail -1 /appdata/bee/export/bee-audit.log)
|
||||
info "last log line: $last_line"
|
||||
if grep -q "audit output written" /var/log/bee-audit.log 2>/dev/null; then
|
||||
if grep -q "audit output written" /appdata/bee/export/bee-audit.log 2>/dev/null; then
|
||||
ok "audit: completed successfully"
|
||||
else
|
||||
warn "audit: 'audit output written' not found in log — may have failed"
|
||||
fi
|
||||
if grep -q "nvidia: enrichment skipped\|nvidia.*skipped\|enrichment skipped" /var/log/bee-audit.log 2>/dev/null; then
|
||||
reason=$(grep -E "nvidia.*skipped|enrichment skipped" /var/log/bee-audit.log | tail -1)
|
||||
if grep -q "nvidia: enrichment skipped\|nvidia.*skipped\|enrichment skipped" /appdata/bee/export/bee-audit.log 2>/dev/null; then
|
||||
reason=$(grep -E "nvidia.*skipped|enrichment skipped" /appdata/bee/export/bee-audit.log | tail -1)
|
||||
fail "audit: nvidia enrichment skipped — $reason"
|
||||
else
|
||||
ok "audit: nvidia enrichment OK (no skip message)"
|
||||
fi
|
||||
else
|
||||
warn "audit: no log found at /var/log/bee-audit.log"
|
||||
warn "audit: no log found at /appdata/bee/export/bee-audit.log"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "-- bee web --"
|
||||
if [ -f /appdata/bee/export/bee-web.log ]; then
|
||||
info "last web log line: $(tail -1 /appdata/bee/export/bee-web.log)"
|
||||
else
|
||||
warn "web: no log found at /appdata/bee/export/bee-web.log"
|
||||
fi
|
||||
|
||||
if bash -c 'exec 3<>/dev/tcp/127.0.0.1/80 && printf "GET /healthz HTTP/1.0\r\nHost: localhost\r\n\r\n" >&3 && grep -q "^ok$" <&3'; then
|
||||
ok "web: health endpoint reachable on 127.0.0.1:80"
|
||||
else
|
||||
fail "web: health endpoint not reachable on 127.0.0.1:80"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
|
||||
@@ -9,7 +9,8 @@
|
||||
Hardware Audit LiveCD
|
||||
Build: %%BUILD_INFO%%
|
||||
|
||||
Logs: /var/log/bee-audit.json /var/log/bee-network.log
|
||||
Export dir: /appdata/bee/export
|
||||
Self-check: /appdata/bee/export/runtime-health.json
|
||||
|
||||
Open TUI: bee-tui
|
||||
|
||||
|
||||
@@ -1,12 +1,13 @@
|
||||
[Unit]
|
||||
Description=Bee: run hardware audit
|
||||
After=bee-network.service bee-nvidia.service
|
||||
After=bee-network.service bee-nvidia.service bee-preflight.service
|
||||
Before=bee-web.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/bin/sh -c '/usr/local/bin/bee audit --runtime livecd --output file:/var/log/bee-audit.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-audit] WARN: audit exited with rc=$rc"; fi; exit 0'
|
||||
StandardOutput=append:/var/log/bee-audit.log
|
||||
StandardError=append:/var/log/bee-audit.log
|
||||
ExecStart=/bin/sh -c '/usr/local/bin/bee audit --runtime livecd --output file:/appdata/bee/export/bee-audit.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-audit] WARN: audit exited with rc=$rc"; fi; exit 0'
|
||||
StandardOutput=append:/appdata/bee/export/bee-audit.log
|
||||
StandardError=append:/appdata/bee/export/bee-audit.log
|
||||
RemainAfterExit=yes
|
||||
|
||||
[Install]
|
||||
|
||||
@@ -6,8 +6,8 @@ Before=network-online.target bee-audit.service
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/bin/bee-network.sh
|
||||
StandardOutput=append:/var/log/bee-network.log
|
||||
StandardError=append:/var/log/bee-network.log
|
||||
StandardOutput=append:/appdata/bee/export/bee-network.log
|
||||
StandardError=append:/appdata/bee/export/bee-network.log
|
||||
RemainAfterExit=yes
|
||||
|
||||
[Install]
|
||||
|
||||
@@ -6,8 +6,8 @@ Before=bee-audit.service
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/bin/bee-nvidia-load
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
StandardOutput=append:/appdata/bee/export/bee-nvidia.log
|
||||
StandardError=append:/appdata/bee/export/bee-nvidia.log
|
||||
RemainAfterExit=yes
|
||||
|
||||
[Install]
|
||||
|
||||
14
iso/overlay/etc/systemd/system/bee-preflight.service
Normal file
14
iso/overlay/etc/systemd/system/bee-preflight.service
Normal file
@@ -0,0 +1,14 @@
|
||||
[Unit]
|
||||
Description=Bee: runtime preflight self-check
|
||||
After=bee-network.service bee-nvidia.service
|
||||
Before=bee-audit.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/bin/sh -c '/usr/local/bin/bee preflight --output file:/appdata/bee/export/runtime-health.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-preflight] WARN: preflight exited with rc=$rc"; fi; exit 0'
|
||||
StandardOutput=append:/appdata/bee/export/runtime-health.log
|
||||
StandardError=append:/appdata/bee/export/runtime-health.log
|
||||
RemainAfterExit=yes
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
@@ -6,6 +6,8 @@ Before=ssh.service
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/bin/bee-sshsetup
|
||||
StandardOutput=append:/appdata/bee/export/bee-sshsetup.log
|
||||
StandardError=append:/appdata/bee/export/bee-sshsetup.log
|
||||
RemainAfterExit=yes
|
||||
|
||||
[Install]
|
||||
|
||||
15
iso/overlay/etc/systemd/system/bee-web.service
Normal file
15
iso/overlay/etc/systemd/system/bee-web.service
Normal file
@@ -0,0 +1,15 @@
|
||||
[Unit]
|
||||
Description=Bee: hardware audit web viewer
|
||||
After=bee-network.service bee-audit.service
|
||||
Wants=bee-audit.service
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
ExecStart=/usr/local/bin/bee web --listen :80 --audit-path /appdata/bee/export/bee-audit.json --export-dir /appdata/bee/export --title "Bee Hardware Audit"
|
||||
Restart=always
|
||||
RestartSec=2
|
||||
StandardOutput=append:/appdata/bee/export/bee-web.log
|
||||
StandardError=append:/appdata/bee/export/bee-web.log
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
@@ -20,10 +20,10 @@ fi
|
||||
|
||||
for iface in $interfaces; do
|
||||
log "bringing up $iface"
|
||||
ip link set "$iface" up 2>/dev/null || { log "WARN: could not bring up $iface"; continue; }
|
||||
ip link set "$iface" up || { log "WARN: could not bring up $iface"; continue; }
|
||||
|
||||
# DHCP in background — non-blocking, retries indefinitely
|
||||
dhclient -nw "$iface" 2>/dev/null &
|
||||
# DHCP in background — non-blocking, keep dhclient verbose output in the service log.
|
||||
dhclient -4 -v -nw "$iface" &
|
||||
log "DHCP started for $iface (pid $!)"
|
||||
done
|
||||
|
||||
|
||||
@@ -16,12 +16,15 @@ fi
|
||||
log "module dir: $NVIDIA_KO_DIR"
|
||||
ls "$NVIDIA_KO_DIR"/*.ko 2>/dev/null | sed 's/^/ /' || true
|
||||
|
||||
# Some kernels expose backlight helper symbols only after loading `video`.
|
||||
modprobe video >/dev/null 2>&1 && log "loaded helper module: video" || log "helper module unavailable: video"
|
||||
|
||||
# Load modules via insmod (direct load — no depmod needed)
|
||||
for mod in nvidia nvidia-modeset nvidia-uvm; do
|
||||
ko="$NVIDIA_KO_DIR/${mod}.ko"
|
||||
[ -f "$ko" ] || ko="$NVIDIA_KO_DIR/${mod//-/_}.ko"
|
||||
if [ -f "$ko" ]; then
|
||||
if insmod "$ko" 2>/dev/null; then
|
||||
if insmod "$ko"; then
|
||||
log "loaded: $mod"
|
||||
else
|
||||
log "WARN: failed to load: $mod"
|
||||
@@ -33,25 +36,25 @@ for mod in nvidia nvidia-modeset nvidia-uvm; do
|
||||
done
|
||||
|
||||
# Create /dev/nvidia* device nodes (udev rules absent since we use .run installer)
|
||||
nvidia_major=$(grep -m1 ' nvidiactl$' /proc/devices 2>/dev/null | awk '{print $1}')
|
||||
nvidia_major=$(grep -m1 ' nvidiactl$' /proc/devices | awk '{print $1}')
|
||||
if [ -n "$nvidia_major" ]; then
|
||||
mknod -m 666 /dev/nvidiactl c "$nvidia_major" 255 2>/dev/null \
|
||||
mknod -m 666 /dev/nvidiactl c "$nvidia_major" 255 \
|
||||
&& log "created /dev/nvidiactl (major $nvidia_major)" \
|
||||
|| log "WARN: /dev/nvidiactl already exists or mknod failed"
|
||||
for i in 0 1 2 3 4 5 6 7; do
|
||||
mknod -m 666 "/dev/nvidia$i" c "$nvidia_major" "$i" 2>/dev/null || true
|
||||
mknod -m 666 "/dev/nvidia$i" c "$nvidia_major" "$i" || true
|
||||
done
|
||||
log "created /dev/nvidia{0-7}"
|
||||
else
|
||||
log "WARN: nvidiactl not in /proc/devices — no GPU hardware present?"
|
||||
fi
|
||||
|
||||
uvm_major=$(grep -m1 ' nvidia-uvm$' /proc/devices 2>/dev/null | awk '{print $1}')
|
||||
uvm_major=$(grep -m1 ' nvidia-uvm$' /proc/devices | awk '{print $1}')
|
||||
if [ -n "$uvm_major" ]; then
|
||||
mknod -m 666 /dev/nvidia-uvm c "$uvm_major" 0 2>/dev/null \
|
||||
mknod -m 666 /dev/nvidia-uvm c "$uvm_major" 0 \
|
||||
&& log "created /dev/nvidia-uvm (major $uvm_major)" \
|
||||
|| log "WARN: /dev/nvidia-uvm already exists"
|
||||
mknod -m 666 /dev/nvidia-uvm-tools c "$uvm_major" 1 2>/dev/null || true
|
||||
mknod -m 666 /dev/nvidia-uvm-tools c "$uvm_major" 1 || true
|
||||
else
|
||||
log "WARN: nvidia-uvm not in /proc/devices"
|
||||
fi
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
#!/bin/sh
|
||||
# run-builder.sh — trigger ISO build on remote Debian 12 builder VM
|
||||
# run-builder.sh — trigger containerized ISO build on a remote builder host
|
||||
#
|
||||
# Usage:
|
||||
# sh scripts/run-builder.sh
|
||||
@@ -79,7 +79,7 @@ screen -S bee-build -X quit 2>/dev/null || true
|
||||
|
||||
echo "--- starting build in screen session (survives SSH disconnect) ---"
|
||||
echo "--- log: \$LOG ---"
|
||||
screen -dmS bee-build sh -c "sudo sh iso/builder/build.sh ${EXTRA_ARGS} > \$LOG 2>&1; echo \$? > /tmp/bee-build-exit"
|
||||
screen -dmS bee-build sh -c "sh iso/builder/build-in-container.sh ${EXTRA_ARGS} > \$LOG 2>&1; echo \$? > /tmp/bee-build-exit"
|
||||
|
||||
# Stream log until build finishes
|
||||
echo "--- streaming build log (Ctrl+C safe — build continues on VM) ---"
|
||||
|
||||
Reference in New Issue
Block a user