Compare commits
173 Commits
fa553c3f20
...
v3.3
| Author | SHA1 | Date | |
|---|---|---|---|
| 407c1cd1c4 | |||
| e15bcc91c5 | |||
| 98f0cf0d52 | |||
| 4db89e9773 | |||
| 3fda18f708 | |||
| ea518abf30 | |||
| 744de588bb | |||
| a3ed9473a3 | |||
| a714c45f10 | |||
| 349e026cfa | |||
| 889fe1dc2f | |||
| befdbf3768 | |||
| ec6a0b292d | |||
| a03312c286 | |||
| e69e9109da | |||
| 413869809d | |||
| f9bd38572a | |||
| 662e3d2cdd | |||
| 126af96780 | |||
| ada15ac777 | |||
| dfb94f9ca6 | |||
| 5857805518 | |||
| 59a1d4b209 | |||
| 0dbfaf6121 | |||
| 5d72d48714 | |||
| 096b4a09ca | |||
| 5d42a92e4c | |||
| 3e54763367 | |||
| f91bce8661 | |||
| 585e6d7311 | |||
| 0a98ed8ae9 | |||
| 911745e4da | |||
| acfd2010d7 | |||
| e904c13790 | |||
| 24c5c72cee | |||
| 6ff0bcad56 | |||
| 4fef26000c | |||
| a393dcb731 | |||
| 9e55728053 | |||
| 4b8023c1cb | |||
| 4c8417d20a | |||
| 0755374dd2 | |||
| c70ae274fa | |||
| 23ad7ff534 | |||
| de130966f7 | |||
| c6fbfc8306 | |||
| 35ad1c74d9 | |||
| 4a02e74b17 | |||
| cd2853ad99 | |||
| 6caf771d6e | |||
| 14fa87b7d7 | |||
| 600ece911b | |||
| 2d424c63cb | |||
| 50f28d1ee6 | |||
| 3579747ae3 | |||
| 09dc7d2613 | |||
| ec0b7f7ff9 | |||
| e7a7ff54b9 | |||
| b4371e291e | |||
| c22b53a406 | |||
| ff0acc3698 | |||
| d50760e7c6 | |||
| ed4f8be019 | |||
| 883592d029 | |||
| a6dcaf1c7e | |||
| 88727fb590 | |||
| c9f5224c42 | |||
| 7cb5c02a9b | |||
| c1aa3cf491 | |||
| f7eb75c57c | |||
| 004cc4910d | |||
| ed1cceed8c | |||
| 9fe9f061f8 | |||
| 837a1fb981 | |||
| 1f43b4e050 | |||
| 83bbc8a1bc | |||
| 896bdb6ee8 | |||
| 5407c26e25 | |||
| 4fddaba9c5 | |||
| d2f384b6eb | |||
| 25f0f30aaf | |||
| a57b037a91 | |||
| 5644231f9a | |||
| eea98e6d76 | |||
| 967455194c | |||
| 79dabf3efb | |||
| 1336f5b95c | |||
| 31486a31c1 | |||
| aa3fc332ba | |||
| 62c57b87f2 | |||
| f600261546 | |||
| d7ca04bdfb | |||
| 5433652c70 | |||
| b25f014dbd | |||
| d69a46f211 | |||
|
|
fc5c2019aa | ||
|
|
67a215c66f | ||
|
|
8b4bfdf5ad | ||
|
|
0a52a4f3ba | ||
|
|
b132f7973a | ||
|
|
bd94b6c792 | ||
|
|
06017eddfd | ||
|
|
0ac7b6a963 | ||
|
|
3d2ae4cdcb | ||
|
|
4669f14f4f | ||
|
|
540a9e39b8 | ||
|
|
58510207fa | ||
|
|
4cd7c9ab4e | ||
|
|
cfe255f6e4 | ||
|
|
8b9d3447d7 | ||
|
|
614b7cad61 | ||
|
|
9a1df9b1ba | ||
|
|
30cf014d58 | ||
|
|
27d478aed6 | ||
|
|
d36e8442a9 | ||
|
|
b345b0d14d | ||
|
|
0a1ac2ab9f | ||
|
|
1e62f828c6 | ||
|
|
f8c997d272 | ||
|
|
0c16616cc9 | ||
|
|
adcc147b32 | ||
|
|
94e233651e | ||
|
|
03c36f6cb2 | ||
|
|
a221814797 | ||
|
|
b6619d5ccc | ||
|
|
450193b063 | ||
|
|
ee8931f171 | ||
|
|
b771d95894 | ||
|
|
8e60e474dc | ||
|
|
2f4ec2acda | ||
|
|
7ed5cb0306 | ||
|
|
6df7ac68f5 | ||
|
|
0ce23aea4f | ||
|
|
36dff6e584 | ||
|
|
1c80906c1f | ||
|
|
2abe2ce3aa | ||
|
|
8233c9ee85 | ||
|
|
13189e2683 | ||
|
|
76a17937f3 | ||
|
|
b965184e71 | ||
|
|
b25a2f6d30 | ||
|
|
d18cde19c1 | ||
|
|
78c6dfc0ef | ||
|
|
72cf482ad3 | ||
|
|
a6023372b1 | ||
|
|
ab5a4be7ac | ||
|
|
b8c235b5ac | ||
|
|
b483e2ce35 | ||
|
|
17f0bda45e | ||
|
|
591164a251 | ||
|
|
ef4ec5695d | ||
|
|
f1e096cabe | ||
|
|
6082c7953e | ||
|
|
f37ef0d844 | ||
|
|
e32fa6e477 | ||
|
|
20118bb400 | ||
|
|
55d6876297 | ||
|
|
e8e176ab7f | ||
|
|
caeafa836b | ||
|
|
e8a52562e7 | ||
|
|
6aca1682b9 | ||
|
|
b7c888edb1 | ||
|
|
17d5d74a8d | ||
|
|
d487e539bb | ||
|
|
441ab3adbd | ||
|
|
c91c8d8cf9 | ||
|
|
83e1910281 | ||
|
|
2252c5af56 | ||
|
|
7a4d75c143 | ||
|
|
7c62d100d4 | ||
|
|
c843ff95a2 | ||
|
|
0057686769 | ||
|
|
68b5e02a74 |
@@ -1 +1,2 @@
|
||||
BUILDER_HOST=
|
||||
BUILDER_USER=
|
||||
|
||||
3
.gitmodules
vendored
3
.gitmodules
vendored
@@ -1,3 +1,6 @@
|
||||
[submodule "bible"]
|
||||
path = bible
|
||||
url = https://git.mchus.pro/mchus/bible.git
|
||||
[submodule "internal/chart"]
|
||||
path = internal/chart
|
||||
url = https://git.mchus.pro/reanimator/chart.git
|
||||
|
||||
395
PLAN.md
395
PLAN.md
@@ -4,13 +4,13 @@ Hardware audit LiveCD for offline server inventory.
|
||||
Produces `HardwareIngestRequest` JSON compatible with core/reanimator.
|
||||
|
||||
**Principle:** OS-level collection — reads hardware directly, not through BMC.
|
||||
Fully unattended — no user interaction required at any stage. Boot → update → audit → output → done.
|
||||
All errors are logged, never presented interactively. Every failure path has a silent fallback.
|
||||
Automatic boot audit plus operator console. Boot runs audit immediately, but local/SSH operators can rerun checks through the TUI and CLI.
|
||||
Errors are logged and should not block boot on partial collector failures.
|
||||
Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware.
|
||||
|
||||
---
|
||||
|
||||
## Status snapshot (2026-03-06)
|
||||
## Status snapshot (2026-03-14)
|
||||
|
||||
### Phase 1 — Go Audit Binary
|
||||
|
||||
@@ -23,33 +23,38 @@ Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials,
|
||||
- 1.7 PSU collector — **DONE (basic FRU path)**
|
||||
- 1.8 NVIDIA GPU enrichment — **DONE**
|
||||
- 1.8b Component wear / age telemetry — **DONE** (storage + NVMe + NVIDIA + NIC SFP/DOM + NIC packet stats)
|
||||
- 1.8c Storage health verdicts — **DONE** (SMART/NVMe warning/failed status derivation)
|
||||
- 1.9 Mellanox/NVIDIA NIC enrichment — **DONE** (mstflint + ethtool firmware fallback)
|
||||
- 1.10 RAID controller enrichment — **DONE (initial multi-tool support)** (storcli + sas2/3ircu + arcconf + ssacli + VROC/mdstat)
|
||||
- 1.11 Output and USB write — **DONE** (usb + /tmp fallback)
|
||||
- 1.11 PSU SDR health — **DONE** (`ipmitool sdr` merged with FRU inventory)
|
||||
- 1.11 Output and export workflow — **DONE** (explicit file output + manual removable export via TUI)
|
||||
- 1.12 Integration test (local) — **DONE** (`scripts/test-local.sh`)
|
||||
|
||||
### Phase 2 — Alpine LiveCD
|
||||
### Phase 2 — Debian Live ISO
|
||||
|
||||
- Debug ISO track is active (builder + overlay-debug + OpenRC services + TUI workflow).
|
||||
- Production ISO track — **IN PROGRESS**.
|
||||
- 2.3 Alpine mkimage profile — **DONE (production profile scaffold)**
|
||||
- 2.4 Network bring-up on boot — **DONE**
|
||||
- 2.5 OpenRC boot service (bee-audit) — **DONE** (with explicit bee-nvidia ordering)
|
||||
- 2.6 Vendor utilities in overlay — **DONE (fetch script + iso/vendor scaffold)**
|
||||
- 2.7 Auto-update wiring (USB first, network second) — **PARTIAL** (shell flow done; strict Ed25519 verification intentionally deferred to final stage)
|
||||
- 2.8 Release workflow — **PARTIAL** (production build now injects audit binary, NVIDIA modules/tools, vendor tools, and build metadata)
|
||||
- Current implementation uses Debian 12 `live-build`, `systemd`, and OpenSSH.
|
||||
- Network bring-up on boot — **DONE**
|
||||
- Boot services (`bee-network`, `bee-nvidia`, `bee-audit`, `bee-sshsetup`) — **DONE**
|
||||
- Local console UX (`bee` autologin on `tty1`, `menu` auto-start, TUI privilege escalation via `sudo -n`) — **DONE**
|
||||
- VM/debug support (`qemu-guest-agent`, serial console, virtual GPU initramfs modules) — **DONE**
|
||||
- Vendor utilities in overlay — **DONE**
|
||||
- Build metadata + staged overlay injection — **DONE**
|
||||
- Builder container cache persisted outside container writable layer — **DONE**
|
||||
- ISO volume label `BEE` — **DONE**
|
||||
- Auto-update flow remains deferred; current focus is deterministic offline audit ISO behavior.
|
||||
- Real-hardware validation remains **PENDING**; current validation is limited to local/libvirt VM boot + service checks.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Go Audit Binary
|
||||
|
||||
Self-contained static binary. Runs on any Linux (including Alpine LiveCD).
|
||||
Self-contained static binary. Runs on any Linux (including the Debian live ISO).
|
||||
Calls system utilities, parses their output, produces `HardwareIngestRequest` JSON.
|
||||
|
||||
### 1.1 — Project scaffold
|
||||
|
||||
- `audit/go.mod` — module `bee/audit`
|
||||
- `audit/cmd/audit/main.go` — CLI entry point: flags, orchestration, JSON output
|
||||
- `audit/cmd/bee/main.go` — main CLI entry point: subcommands, runtime selection, JSON output
|
||||
- `audit/internal/schema/` — copy of `HardwareIngestRequest` types from core (no import dependency)
|
||||
- `audit/internal/collector/` — empty package stubs for all collectors
|
||||
- `const Version = "1.0"` in main
|
||||
@@ -237,305 +242,143 @@ No hardcoded vendor names in detection logic — pure PCI vendor_id map.
|
||||
|
||||
Tests: table tests with storcli/sas2ircu text fixtures
|
||||
|
||||
### 1.11 — Output and USB write
|
||||
### 1.11 — Output and export workflow
|
||||
|
||||
`--output stdout` (default): pretty-printed JSON to stdout
|
||||
`--output file:<path>`: write JSON to explicit path
|
||||
`--output usb`: auto-detect first removable block device, mount it, write `audit-<board_serial>-<YYYYMMDD-HHMMSS>.json`
|
||||
|
||||
USB detection: scan `/sys/block/*/removable`, pick first `1`, mount to `/tmp/bee-usb`
|
||||
Live ISO default service output: `/var/log/bee-audit.json`
|
||||
|
||||
QR summary to stdout (always): board serial + model + component counts — fits in one QR code
|
||||
Uses `qrencode` if present, else skips silently
|
||||
Removable-media export is manual via `bee tui` (or the LiveCD wrapper `bee-tui`):
|
||||
- operator chooses a removable filesystem explicitly
|
||||
- TUI mounts it if needed
|
||||
- TUI asks for confirmation before copying the JSON
|
||||
- TUI unmounts temporary mountpoints after export
|
||||
|
||||
No auto-write to arbitrary removable media is allowed.
|
||||
|
||||
### 1.12 — Integration test (local)
|
||||
|
||||
`scripts/test-local.sh` — runs audit binary on developer machine (Linux), captures JSON,
|
||||
`scripts/test-local.sh` — runs `bee audit` on developer machine (Linux), captures JSON,
|
||||
validates required fields are present (board.serial_number non-empty, cpus non-empty, etc.)
|
||||
|
||||
Not a unit test — requires real hardware access. Documents how to run for verification.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Alpine LiveCD
|
||||
## Phase 2 — Debian Live ISO
|
||||
|
||||
ISO image bootable via BMC virtual media. Runs audit binary automatically on boot.
|
||||
ISO image bootable via BMC virtual media or USB. Runs boot services automatically and writes the audit result to `/var/log/bee-audit.json`.
|
||||
|
||||
### 2.1 — Builder environment
|
||||
|
||||
`iso/builder/Dockerfile` — Alpine 3.21 build environment with:
|
||||
- `alpine-sdk`, `abuild`, `squashfs-tools`, `xorriso`
|
||||
- Go toolchain (for binary compilation inside builder)
|
||||
- NVIDIA driver `.run` pre-fetched during image build
|
||||
`iso/builder/build-in-container.sh` is the only supported builder entrypoint.
|
||||
It builds a Debian 12 builder image with `live-build`, toolchains, and pinned kernel headers,
|
||||
then runs the ISO assembly in a privileged container because `live-build` needs
|
||||
mount/chroot/loop capabilities.
|
||||
|
||||
`iso/builder/build.sh` — orchestrates full ISO build:
|
||||
1. Compile Go binary (static, `CGO_ENABLED=0`)
|
||||
2. Compile NVIDIA kernel module against Alpine 3.21 LTS kernel headers
|
||||
3. Run `mkimage.sh` with bee profile
|
||||
4. Output: `dist/bee-<version>.iso`
|
||||
`iso/builder/build.sh` orchestrates the full ISO build:
|
||||
1. compile the Go `bee` binary
|
||||
2. create a staged overlay under `dist/overlay-stage`
|
||||
3. inject SSH auth, vendor tools, NVIDIA artifacts, and build metadata into the staged overlay
|
||||
4. create a disposable `live-build` workdir under `dist/live-build-work`
|
||||
5. sync the staged overlay into `config/includes.chroot/`
|
||||
6. run `lb config && lb build`
|
||||
7. copy the final ISO into `dist/`
|
||||
|
||||
### 2.2 — NVIDIA driver build
|
||||
|
||||
Alpine 3.21, LTS kernel 6.6 — fixed versions in builder.
|
||||
`iso/builder/build-nvidia-module.sh`:
|
||||
- downloads the pinned NVIDIA `.run` installer
|
||||
- verifies SHA256
|
||||
- builds kernel modules against the pinned Debian kernel ABI
|
||||
- caches modules, userspace tools, and libs in `dist/nvidia-<version>-<kver>/`
|
||||
|
||||
`iso/builder/build-nvidia.sh`:
|
||||
- Download `NVIDIA-Linux-x86_64-<ver>.run` (version pinned in `iso/builder/VERSIONS`)
|
||||
- Extract kernel module sources
|
||||
- Compile against `linux-lts-dev` headers
|
||||
- Strip and package as `nvidia-<ver>-k6.6.ko.tar.gz` for inclusion in overlay
|
||||
`iso/overlay/usr/local/bin/bee-nvidia-load`:
|
||||
- loads `nvidia`, `nvidia-modeset`, `nvidia-uvm` via `insmod`
|
||||
- creates `/dev/nvidia*` nodes if the driver registered successfully
|
||||
- logs failures but does not block the rest of boot
|
||||
|
||||
`iso/overlay/usr/local/bin/load-nvidia.sh`:
|
||||
- `insmod` sequence: nvidia.ko → nvidia-modeset.ko → nvidia-uvm.ko
|
||||
- Verify: `nvidia-smi -L` → log result
|
||||
- On failure: log warning, continue (audit runs without GPU enrichment)
|
||||
### 2.3 — ISO assembly and overlay policy
|
||||
|
||||
### 2.3 — Alpine mkimage profile
|
||||
`iso/overlay/` is source-only input for the build.
|
||||
|
||||
`iso/builder/mkimg.bee.sh` — Alpine mkimage profile:
|
||||
- Base: `alpine-base`
|
||||
- Kernel: `linux-lts`
|
||||
- Packages: `dmidecode smartmontools nvme-cli pciutils ipmitool util-linux e2fsprogs qrencode`
|
||||
- Overlay: `iso/overlay/` included as apkovl
|
||||
Build-time files are injected into the staged overlay only:
|
||||
- `bee`
|
||||
- `bee-smoketest`
|
||||
- `authorized_keys`
|
||||
- password-fallback marker
|
||||
- `/etc/bee-release`
|
||||
- vendor tools from `iso/vendor/`
|
||||
|
||||
### 2.4 — Network bring-up on boot
|
||||
The source tree must stay clean after a build.
|
||||
|
||||
`iso/overlay/usr/local/bin/bee-network.sh`:
|
||||
- Enumerate all network interfaces: `ip link show` → filter out loopback and virtual (docker/bridge)
|
||||
- For each physical interface: `ip link set <iface> up` + `udhcpc -i <iface> -t 5 -T 3 -n`
|
||||
- Log each interface result (got IP / timeout / no carrier)
|
||||
- Continue regardless — network is best-effort for auto-update
|
||||
### 2.4 — Boot services
|
||||
|
||||
`iso/overlay/etc/init.d/bee-network`:
|
||||
- runlevel: default, before: bee-update
|
||||
- Calls bee-network.sh
|
||||
- Does not block boot if DHCP fails on all interfaces
|
||||
`systemd` service order:
|
||||
- `bee-sshsetup.service` → configures SSH auth before `ssh.service`
|
||||
- `bee-network.service` → starts best-effort DHCP on all physical interfaces
|
||||
- `bee-nvidia.service` → loads NVIDIA modules if present
|
||||
- `bee-audit.service` → runs audit and logs failures without turning partial collector bugs into a boot blocker
|
||||
|
||||
### 2.5 — OpenRC boot service (bee-audit)
|
||||
### 2.4b — Runtime split
|
||||
|
||||
`iso/overlay/etc/init.d/bee-audit`:
|
||||
- runlevel: default, after: bee-update
|
||||
- start(): load-nvidia.sh → /usr/local/bin/audit --output usb
|
||||
- on completion: print QR summary to /dev/tty1 (always, even if USB write failed)
|
||||
- log everything to /var/log/bee-audit.log
|
||||
- exits 0 regardless of partial failures — unattended, no prompts, no waits
|
||||
Target split:
|
||||
- main Go application works on a normal Linux host and on the live ISO
|
||||
- live-ISO specifics stay in integration glue under `iso/`
|
||||
- the live ISO passes `--runtime livecd` to the Go binary
|
||||
- local runs default to `--runtime auto`, which resolves to `local` unless a live marker is detected
|
||||
|
||||
Unattended invariants:
|
||||
- No TTY prompts ever. All decisions are automatic.
|
||||
- Missing USB: output goes to /tmp/bee-audit-<serial>-<date>.json, QR shown on screen.
|
||||
- Missing NVIDIA driver: GPU records have status UNKNOWN, audit continues.
|
||||
- Missing ipmitool/storcli/any tool: that collector is skipped, rest continue.
|
||||
- Timeout on any external command: 30s hard limit via `timeout` wrapper, then skip.
|
||||
- Boot never hangs waiting for user input.
|
||||
Planned code shape:
|
||||
- `audit/cmd/bee/` — main CLI entrypoint
|
||||
- `audit/internal/runtimeenv/` — runtime detection and mode selection
|
||||
- future `audit/internal/tui/` — host/live shared TUI logic
|
||||
- `iso/overlay/` — boot-time livecd integration only
|
||||
|
||||
`iso/overlay/etc/runlevels/default/bee-audit` symlink
|
||||
### 2.5 — Operator workflows
|
||||
|
||||
### 2.6 — Vendor utilities in overlay
|
||||
- Automatic boot audit writes JSON to `/var/log/bee-audit.json`
|
||||
- `tty1` autologins into `bee` and auto-runs `menu`
|
||||
- `menu` launches the LiveCD wrapper `bee-tui`, which escalates to `root` via `sudo -n`
|
||||
- `bee tui` can rerun the audit manually
|
||||
- `bee tui` can export the latest audit JSON to removable media
|
||||
- `bee tui` can show health summary and run NVIDIA/memory/storage acceptance tests
|
||||
- NVIDIA SAT now includes a lightweight in-image GPU stress step via `bee-gpu-stress`
|
||||
- SAT summaries now expose `overall_status` plus per-job `OK/FAILED/UNSUPPORTED`
|
||||
- Memory/GPU SAT runtime defaults can be overridden via `BEE_MEMTESTER_*` and `BEE_GPU_STRESS_*`
|
||||
- removable export requires explicit target selection, mount, confirmation, copy, and cleanup
|
||||
|
||||
`iso/overlay/usr/local/bin/` includes pre-fetched proprietary tools:
|
||||
- `storcli64` (Broadcom)
|
||||
- `sas2ircu`, `sas3ircu` (Broadcom/LSI)
|
||||
- `mstflint` (NVIDIA Networking / Mellanox)
|
||||
### 2.6 — Vendor utilities and optional assets
|
||||
|
||||
`scripts/fetch-vendor.sh` — downloads and places these before ISO build.
|
||||
Checksums verified. Tools not committed to git — fetched at build time.
|
||||
Optional binaries live in `iso/vendor/` and are included when present:
|
||||
- `storcli64`
|
||||
- `sas2ircu`, `sas3ircu`
|
||||
- `arcconf`
|
||||
- `ssacli`
|
||||
- `mstflint` (via Debian package set)
|
||||
|
||||
`iso/vendor/.gitkeep` — placeholder, directory gitignored except .gitkeep
|
||||
Missing optional tools do not fail the build or boot.
|
||||
|
||||
### 2.7 — Auto-update of audit binary (USB + network)
|
||||
### 2.7 — Release workflow
|
||||
|
||||
Two update paths, tried in order on every boot:
|
||||
`iso/builder/VERSIONS` pins the current release inputs:
|
||||
- audit version
|
||||
- Debian version / kernel ABI
|
||||
- Go version
|
||||
- NVIDIA driver version
|
||||
|
||||
**Path A — USB (no network required, higher priority):**
|
||||
|
||||
`bee-update.sh` scans mounted removable media for an update package before checking network.
|
||||
|
||||
Looks for: `<usb>/bee-update/bee-audit-linux-amd64` + `<usb>/bee-update/bee-audit-linux-amd64.sha256`
|
||||
|
||||
Steps:
|
||||
1. Find USB mount point (same detection as audit output: `/sys/block/*/removable`)
|
||||
2. Check for `bee-update/bee-audit-linux-amd64` on the USB root
|
||||
3. Read version from `bee-update/VERSION` file (plain text, e.g. `1.3`)
|
||||
4. Compare with running binary version (`/usr/local/bin/audit --version`)
|
||||
5. If USB version > running: verify SHA256 checksum, replace binary, log update
|
||||
6. Re-run audit if updated
|
||||
|
||||
**Authenticity verification — Ed25519 multi-key trust (stdlib only, no external tools):**
|
||||
|
||||
Problem: SHA256 alone does not prevent a crafted attack — an attacker places their binary
|
||||
and a matching SHA256 next to it. The LiveCD would accept it.
|
||||
|
||||
Solution: Ed25519 asymmetric signatures via Go stdlib `crypto/ed25519`.
|
||||
Multiple developer public keys are supported. A binary update is accepted if its signature
|
||||
verifies against ANY of the embedded trusted public keys.
|
||||
|
||||
This mirrors the SSH authorized_keys model: add a developer → add their public key.
|
||||
Remove a developer → rebuild without their key.
|
||||
|
||||
**Key management — centralized across all projects:**
|
||||
|
||||
Public keys live in a dedicated repo at git.mchus.pro/mchus/keys (or similar):
|
||||
```
|
||||
keys/
|
||||
developers/
|
||||
mchusavitin.pub ← Ed25519 public key, base64, one line
|
||||
developer2.pub
|
||||
README.md ← how to generate a key pair
|
||||
```
|
||||
|
||||
Public keys are safe to commit — they are not secret.
|
||||
Private keys stay on each developer's machine, never committed anywhere.
|
||||
|
||||
Key generation (one-time per developer, run locally):
|
||||
```sh
|
||||
# scripts/keygen.sh — also lives in the keys repo
|
||||
openssl genpkey -algorithm ed25519 -out ~/.bee-release.key
|
||||
openssl pkey -in ~/.bee-release.key -pubout -outform DER \
|
||||
| tail -c 32 | base64 > mchusavitin.pub
|
||||
```
|
||||
|
||||
**Embedding public keys at release time (not compile time):**
|
||||
|
||||
Public keys are injected via `-ldflags` at build time from the keys repo.
|
||||
The binary does not hardcode keys — they are provided by the release script.
|
||||
|
||||
```go
|
||||
// audit/internal/updater/trust.go
|
||||
// trustedKeysRaw is injected at build time via -ldflags
|
||||
// format: base64(key1):base64(key2):...
|
||||
var trustedKeysRaw string
|
||||
|
||||
func trustedKeys() ([]ed25519.PublicKey, error) {
|
||||
if trustedKeysRaw == "" {
|
||||
return nil, fmt.Errorf("binary built without trusted keys — updates disabled")
|
||||
}
|
||||
var keys []ed25519.PublicKey
|
||||
for _, enc := range strings.Split(trustedKeysRaw, ":") {
|
||||
b, err := base64.StdEncoding.DecodeString(strings.TrimSpace(enc))
|
||||
if err != nil || len(b) != ed25519.PublicKeySize {
|
||||
return nil, fmt.Errorf("invalid trusted key: %w", err)
|
||||
}
|
||||
keys = append(keys, ed25519.PublicKey(b))
|
||||
}
|
||||
return keys, nil
|
||||
}
|
||||
|
||||
func verifySignature(binaryPath, sigPath string) error {
|
||||
keys, err := trustedKeys()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
data, _ := os.ReadFile(binaryPath)
|
||||
sig, _ := os.ReadFile(sigPath) // 64 bytes raw Ed25519 signature
|
||||
for _, key := range keys {
|
||||
if ed25519.Verify(key, data, sig) {
|
||||
return nil // any trusted key accepts → pass
|
||||
}
|
||||
}
|
||||
return fmt.Errorf("signature verification failed: no trusted key matched")
|
||||
}
|
||||
```
|
||||
|
||||
Release build injects keys:
|
||||
```sh
|
||||
# scripts/build-release.sh
|
||||
KEYS=$(paste -sd: keys/developers/*.pub)
|
||||
go build -ldflags "-X bee/audit/internal/updater/trust.trustedKeysRaw=${KEYS}" \
|
||||
-o dist/bee-audit-linux-amd64 ./cmd/audit
|
||||
```
|
||||
|
||||
Signing (release engineer signs with their private key):
|
||||
```sh
|
||||
# scripts/sign-release.sh <binary>
|
||||
openssl pkeyutl -sign -inkey ~/.bee-release.key \
|
||||
-rawin -in "$1" -out "$1.sig"
|
||||
```
|
||||
|
||||
Binary built without `-ldflags` injection (e.g. local dev build) has `trustedKeysRaw=""`
|
||||
→ updates are disabled, logged as INFO, audit continues normally.
|
||||
|
||||
Update rejected silently (logged as WARNING, audit continues with current binary) if:
|
||||
- `.sig` file missing
|
||||
- Signature does not match any trusted key
|
||||
- `trustedKeysRaw` empty (dev build)
|
||||
|
||||
Update package layout on USB:
|
||||
```
|
||||
/bee-update/
|
||||
bee-audit-linux-amd64 ← new binary (also signed with embedded keys)
|
||||
bee-audit-linux-amd64.sig ← Ed25519 signature (64 bytes raw)
|
||||
VERSION ← plain version string e.g. "1.3"
|
||||
```
|
||||
|
||||
Admin workflow: download `bee-audit-linux-amd64` + `bee-audit-linux-amd64.sig` from Gitea
|
||||
release assets, place in `bee-update/` on USB.
|
||||
|
||||
**Path B — Network (requires DHCP on at least one interface):**
|
||||
1. Check network: ping git.mchus.pro -c 1 -W 3 || skip
|
||||
2. Fetch: `GET https://git.mchus.pro/api/v1/repos/<org>/bee/releases/latest`
|
||||
3. Parse tag_name, asset URLs for `bee-audit-linux-amd64` + `bee-audit-linux-amd64.sig`
|
||||
4. Compare tag with running version
|
||||
5. If newer: download both files to /tmp, verify Ed25519 signature against all trusted keys
|
||||
6. Replace binary on pass, log and skip on fail
|
||||
7. Re-run audit if updated
|
||||
|
||||
**Ordering:** USB update checked first, network checked second.
|
||||
If USB update applied and verified, network check is skipped.
|
||||
|
||||
`iso/overlay/etc/init.d/bee-update`:
|
||||
- runlevel: default
|
||||
- after: bee-network (network path needs interfaces up)
|
||||
- before: bee-audit (audit runs with latest binary)
|
||||
- Calls bee-update.sh
|
||||
|
||||
Triggered after bee-audit completes, only if network is available.
|
||||
|
||||
`iso/overlay/usr/local/bin/bee-update.sh`:
|
||||
|
||||
```
|
||||
1. Check network: ping git.mchus.pro -c 1 -W 3 || exit 0
|
||||
2. Fetch latest release metadata:
|
||||
GET https://git.mchus.pro/api/v1/repos/<org>/bee/releases/latest
|
||||
3. Parse: extract tag_name, asset URL for bee-audit-linux-amd64
|
||||
4. Compare tag_name with /usr/local/bin/audit --version output
|
||||
5. If newer: download to /tmp/bee-audit-new, verify SHA256 checksum from release assets
|
||||
6. Replace /usr/local/bin/audit (tmpfs — survives until reboot)
|
||||
7. Log: updated from vX.Y to vX.Z
|
||||
8. Re-run audit if update happened: /usr/local/bin/audit --output usb
|
||||
```
|
||||
|
||||
`iso/overlay/etc/init.d/bee-update`:
|
||||
- runlevel: default
|
||||
- after: bee-audit, network
|
||||
- Calls bee-update.sh
|
||||
|
||||
Release naming convention: binary asset named `bee-audit-linux-amd64` per release tag.
|
||||
|
||||
### 2.8 — Release workflow
|
||||
|
||||
`iso/builder/VERSIONS` — pinned versions:
|
||||
```
|
||||
AUDIT_VERSION=1.0
|
||||
ALPINE_VERSION=3.21
|
||||
KERNEL_VERSION=6.12
|
||||
NVIDIA_DRIVER_VERSION=590.48.01
|
||||
```
|
||||
|
||||
LiveCD release = full ISO rebuild. Binary-only patch = new Gitea release with binary asset.
|
||||
On boot with network: ISO auto-patches its binary without full rebuild.
|
||||
|
||||
ISO version embedded in `/etc/bee-release`:
|
||||
```
|
||||
BEE_ISO_VERSION=1.0
|
||||
BEE_AUDIT_VERSION=1.0
|
||||
BUILD_DATE=2026-03-05
|
||||
```
|
||||
Current release model:
|
||||
- shipping a new ISO means a full rebuild
|
||||
- build metadata is embedded into `/etc/bee-release` and `motd`
|
||||
- current ISO label is `BEE`
|
||||
- binary self-update remains deferred; no automatic USB/network patching is part of the current runtime
|
||||
|
||||
---
|
||||
|
||||
## Eating order
|
||||
|
||||
Builder environment is set up early (after 1.3) so every subsequent collector
|
||||
is developed and tested directly on real hardware in the actual Alpine environment.
|
||||
is developed and tested directly on real hardware in the actual Debian live ISO environment.
|
||||
No "works on my Mac" drift.
|
||||
|
||||
```
|
||||
@@ -544,10 +387,10 @@ No "works on my Mac" drift.
|
||||
1.2 board collector → first real data
|
||||
1.3 CPU collector → +CPUs
|
||||
|
||||
--- BUILDER + DEBUG ISO (unblock real-hardware testing) ---
|
||||
--- BUILDER + BEE ISO (unblock real-hardware testing) ---
|
||||
|
||||
2.1 builder VM setup → Alpine VM with build deps + Go toolchain
|
||||
2.2 debug ISO profile → minimal Alpine ISO: audit binary + dropbear SSH + all packages
|
||||
2.1 builder setup → privileged container with build deps
|
||||
2.2 debug ISO profile → minimal Debian ISO: `bee` binary + OpenSSH + all packages
|
||||
2.3 boot on real server → SSH in, verify packages present, run audit manually
|
||||
|
||||
--- CONTINUE COLLECTORS (tested on real hardware from here) ---
|
||||
@@ -560,14 +403,14 @@ No "works on my Mac" drift.
|
||||
1.8b wear/age telemetry → +SMART hours, NVMe % used, SFP DOM, ECC
|
||||
1.9 Mellanox NIC enrichment → +NIC firmware/serial
|
||||
1.10 RAID enrichment → +physical disks behind RAID
|
||||
1.11 output + USB write → production-ready output
|
||||
1.11 output + export workflow → file output + explicit removable export
|
||||
|
||||
--- PRODUCTION ISO ---
|
||||
|
||||
2.4 NVIDIA driver build → driver compiled into overlay
|
||||
2.5 network bring-up on boot → DHCP on all interfaces
|
||||
2.6 OpenRC boot service → audit runs on boot automatically
|
||||
2.7 vendor utilities → storcli/sas2ircu/mstflint in image
|
||||
2.8 auto-update → binary self-patches from Gitea
|
||||
2.9 release workflow → versioning + release notes
|
||||
2.6 systemd boot service → audit runs on boot automatically
|
||||
2.7 vendor utilities → storcli/sas2ircu/arcconf/ssacli in image
|
||||
2.8 release workflow → versioning + release notes
|
||||
2.9 operator export flow → explicit TUI export to removable media
|
||||
```
|
||||
|
||||
18
audit/Makefile
Normal file
18
audit/Makefile
Normal file
@@ -0,0 +1,18 @@
|
||||
LISTEN ?= :8080
|
||||
AUDIT_PATH ?=
|
||||
|
||||
RUN_ARGS := web --listen $(LISTEN)
|
||||
ifneq ($(AUDIT_PATH),)
|
||||
RUN_ARGS += --audit-path $(AUDIT_PATH)
|
||||
endif
|
||||
|
||||
.PHONY: run build test
|
||||
|
||||
run:
|
||||
go run ./cmd/bee $(RUN_ARGS)
|
||||
|
||||
build:
|
||||
go build -o bee ./cmd/bee
|
||||
|
||||
test:
|
||||
go test ./...
|
||||
@@ -1,167 +0,0 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"flag"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"bee/audit/internal/collector"
|
||||
)
|
||||
|
||||
// Version is the audit binary version.
|
||||
// Injected at release build time via:
|
||||
//
|
||||
// -ldflags "-X main.Version=1.2"
|
||||
//
|
||||
// Defaults to "dev" in local builds.
|
||||
var Version = "dev"
|
||||
|
||||
func main() {
|
||||
output := flag.String("output", "stdout", `output destination:
|
||||
stdout — print JSON to stdout (default)
|
||||
file:<path> — write JSON to file
|
||||
usb — auto-detect removable media, write JSON there`)
|
||||
showVersion := flag.Bool("version", false, "print version and exit")
|
||||
flag.Parse()
|
||||
|
||||
if *showVersion {
|
||||
fmt.Println(Version)
|
||||
return
|
||||
}
|
||||
|
||||
slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{
|
||||
Level: slog.LevelInfo,
|
||||
})))
|
||||
|
||||
result := collector.Run()
|
||||
|
||||
data, err := json.MarshalIndent(result, "", " ")
|
||||
if err != nil {
|
||||
slog.Error("marshal result", "err", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
if err := writeOutput(*output, data); err != nil {
|
||||
slog.Error("write output", "destination", *output, "err", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
||||
|
||||
func writeOutput(dest string, data []byte) error {
|
||||
switch {
|
||||
case dest == "stdout":
|
||||
_, err := os.Stdout.Write(append(data, '\n'))
|
||||
return err
|
||||
|
||||
case strings.HasPrefix(dest, "file:"):
|
||||
path := strings.TrimPrefix(dest, "file:")
|
||||
return os.WriteFile(path, append(data, '\n'), 0644)
|
||||
|
||||
case dest == "usb":
|
||||
return writeToUSB(data)
|
||||
|
||||
default:
|
||||
return fmt.Errorf("unknown output destination %q — use stdout, file:<path>, or usb", dest)
|
||||
}
|
||||
}
|
||||
|
||||
// writeToUSB auto-detects the first removable block device, mounts it,
|
||||
// and writes the audit JSON. Falls back to /tmp on any failure.
|
||||
func writeToUSB(data []byte) error {
|
||||
boardSerial := extractBoardSerial(data)
|
||||
filename := auditFilename(boardSerial, time.Now().UTC())
|
||||
|
||||
device, err := firstRemovableDevice()
|
||||
if err != nil {
|
||||
slog.Warn("usb output: no removable device, writing to /tmp", "err", err)
|
||||
return writeAuditToPath(filepath.Join("/tmp", filename), data)
|
||||
}
|
||||
|
||||
mountpoint := "/tmp/bee-usb"
|
||||
if err := os.MkdirAll(mountpoint, 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if err := exec.Command("mount", device, mountpoint).Run(); err != nil {
|
||||
slog.Warn("usb output: mount failed, writing to /tmp", "device", device, "err", err)
|
||||
return writeAuditToPath(filepath.Join("/tmp", filename), data)
|
||||
}
|
||||
defer func() {
|
||||
if err := exec.Command("umount", mountpoint).Run(); err != nil {
|
||||
slog.Warn("usb output: umount failed", "mountpoint", mountpoint, "err", err)
|
||||
}
|
||||
}()
|
||||
|
||||
path := filepath.Join(mountpoint, filename)
|
||||
if err := writeAuditToPath(path, data); err != nil {
|
||||
slog.Warn("usb output: write failed, falling back to /tmp", "path", path, "err", err)
|
||||
return writeAuditToPath(filepath.Join("/tmp", filename), data)
|
||||
}
|
||||
|
||||
slog.Info("usb output: written", "path", path)
|
||||
return nil
|
||||
}
|
||||
|
||||
func writeAuditToPath(path string, data []byte) error {
|
||||
if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil {
|
||||
return err
|
||||
}
|
||||
slog.Info("audit output written", "path", path)
|
||||
return nil
|
||||
}
|
||||
|
||||
func extractBoardSerial(data []byte) string {
|
||||
var doc struct {
|
||||
Hardware struct {
|
||||
Board struct {
|
||||
SerialNumber string `json:"serial_number"`
|
||||
} `json:"board"`
|
||||
} `json:"hardware"`
|
||||
}
|
||||
if err := json.Unmarshal(data, &doc); err != nil {
|
||||
return "unknown"
|
||||
}
|
||||
serial := strings.TrimSpace(doc.Hardware.Board.SerialNumber)
|
||||
if serial == "" {
|
||||
return "unknown"
|
||||
}
|
||||
return serial
|
||||
}
|
||||
|
||||
func auditFilename(boardSerial string, now time.Time) string {
|
||||
boardSerial = strings.TrimSpace(boardSerial)
|
||||
if boardSerial == "" {
|
||||
boardSerial = "unknown"
|
||||
}
|
||||
return fmt.Sprintf("audit-%s-%s.json", boardSerial, now.Format("20060102-150405"))
|
||||
}
|
||||
|
||||
func firstRemovableDevice() (string, error) {
|
||||
entries, err := os.ReadDir("/sys/block")
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
sort.Slice(entries, func(i, j int) bool { return entries[i].Name() < entries[j].Name() })
|
||||
|
||||
for _, e := range entries {
|
||||
name := e.Name()
|
||||
if strings.HasPrefix(name, "loop") || strings.HasPrefix(name, "ram") {
|
||||
continue
|
||||
}
|
||||
removableFlag, err := os.ReadFile(filepath.Join("/sys/block", name, "removable"))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
if strings.TrimSpace(string(removableFlag)) == "1" {
|
||||
return filepath.Join("/dev", name), nil
|
||||
}
|
||||
}
|
||||
return "", fmt.Errorf("no removable block device found")
|
||||
}
|
||||
403
audit/cmd/bee/main.go
Normal file
403
audit/cmd/bee/main.go
Normal file
@@ -0,0 +1,403 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"flag"
|
||||
"fmt"
|
||||
"io"
|
||||
"log/slog"
|
||||
"os"
|
||||
"runtime/debug"
|
||||
"strings"
|
||||
|
||||
"bee/audit/internal/app"
|
||||
"bee/audit/internal/platform"
|
||||
"bee/audit/internal/runtimeenv"
|
||||
"bee/audit/internal/webui"
|
||||
)
|
||||
|
||||
var Version = "dev"
|
||||
|
||||
func buildLabel() string {
|
||||
label := strings.TrimSpace(Version)
|
||||
if label == "" {
|
||||
label = "dev"
|
||||
}
|
||||
if info, ok := debug.ReadBuildInfo(); ok {
|
||||
var revision string
|
||||
var modified bool
|
||||
for _, setting := range info.Settings {
|
||||
switch setting.Key {
|
||||
case "vcs.revision":
|
||||
revision = setting.Value
|
||||
case "vcs.modified":
|
||||
modified = setting.Value == "true"
|
||||
}
|
||||
}
|
||||
if revision != "" {
|
||||
short := revision
|
||||
if len(short) > 12 {
|
||||
short = short[:12]
|
||||
}
|
||||
label += " (" + short
|
||||
if modified {
|
||||
label += "+"
|
||||
}
|
||||
label += ")"
|
||||
}
|
||||
}
|
||||
return label
|
||||
}
|
||||
|
||||
func main() {
|
||||
os.Exit(run(os.Args[1:], os.Stdout, os.Stderr))
|
||||
}
|
||||
|
||||
func run(args []string, stdout, stderr io.Writer) int {
|
||||
slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{
|
||||
Level: slog.LevelInfo,
|
||||
})))
|
||||
|
||||
if len(args) == 0 {
|
||||
printRootUsage(stderr)
|
||||
return 2
|
||||
}
|
||||
|
||||
switch args[0] {
|
||||
case "help", "--help", "-h":
|
||||
if len(args) > 1 {
|
||||
return runHelp(args[1:], stdout, stderr)
|
||||
}
|
||||
printRootUsage(stdout)
|
||||
return 0
|
||||
case "audit":
|
||||
return runAudit(args[1:], stdout, stderr)
|
||||
case "export":
|
||||
return runExport(args[1:], stdout, stderr)
|
||||
case "preflight":
|
||||
return runPreflight(args[1:], stdout, stderr)
|
||||
case "support-bundle":
|
||||
return runSupportBundle(args[1:], stdout, stderr)
|
||||
case "web":
|
||||
return runWeb(args[1:], stdout, stderr)
|
||||
case "sat":
|
||||
return runSAT(args[1:], stdout, stderr)
|
||||
case "version", "--version", "-version":
|
||||
fmt.Fprintln(stdout, Version)
|
||||
return 0
|
||||
default:
|
||||
fmt.Fprintf(stderr, "bee: unknown command %q\n\n", args[0])
|
||||
printRootUsage(stderr)
|
||||
return 2
|
||||
}
|
||||
}
|
||||
|
||||
func printRootUsage(w io.Writer) {
|
||||
fmt.Fprintln(w, `bee commands:
|
||||
bee audit --runtime auto|local|livecd --output stdout|file:<path>
|
||||
bee preflight --output stdout|file:<path>
|
||||
bee export --target <device>
|
||||
bee support-bundle --output stdout|file:<path>
|
||||
bee web --listen :80 --audit-path `+app.DefaultAuditJSONPath+`
|
||||
bee sat nvidia|memory|storage|cpu [--duration <seconds>]
|
||||
bee version
|
||||
bee help [command]`)
|
||||
}
|
||||
|
||||
func runHelp(args []string, stdout, stderr io.Writer) int {
|
||||
switch args[0] {
|
||||
case "audit":
|
||||
return runAudit([]string{"--help"}, stdout, stdout)
|
||||
case "export":
|
||||
return runExport([]string{"--help"}, stdout, stdout)
|
||||
case "preflight":
|
||||
return runPreflight([]string{"--help"}, stdout, stdout)
|
||||
case "support-bundle":
|
||||
return runSupportBundle([]string{"--help"}, stdout, stdout)
|
||||
case "web":
|
||||
return runWeb([]string{"--help"}, stdout, stdout)
|
||||
case "sat":
|
||||
return runSAT([]string{"--help"}, stdout, stderr)
|
||||
case "version":
|
||||
fmt.Fprintln(stdout, "usage: bee version")
|
||||
return 0
|
||||
default:
|
||||
fmt.Fprintf(stderr, "bee help: unknown command %q\n\n", args[0])
|
||||
printRootUsage(stderr)
|
||||
return 2
|
||||
}
|
||||
}
|
||||
|
||||
func runAudit(args []string, stdout, stderr io.Writer) int {
|
||||
fs := flag.NewFlagSet("audit", flag.ContinueOnError)
|
||||
fs.SetOutput(stderr)
|
||||
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
|
||||
runtimeFlag := fs.String("runtime", "auto", "runtime environment: auto, local, livecd")
|
||||
showVersion := fs.Bool("version", false, "print version and exit")
|
||||
fs.Usage = func() {
|
||||
fmt.Fprintln(stderr, "usage: bee audit [--runtime auto|local|livecd] [--output stdout|file:<path>]")
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
if *showVersion {
|
||||
fmt.Fprintln(stdout, Version)
|
||||
return 0
|
||||
}
|
||||
|
||||
runtimeInfo, err := runtimeenv.Detect(*runtimeFlag)
|
||||
if err != nil {
|
||||
slog.Error("resolve runtime", "err", err)
|
||||
return 1
|
||||
}
|
||||
slog.Info("runtime resolved", "mode", runtimeInfo.Mode, "reason", runtimeInfo.Reason)
|
||||
|
||||
application := app.New(platform.New())
|
||||
path, err := application.RunAudit(runtimeInfo.Mode, *output)
|
||||
if err != nil {
|
||||
slog.Error("run audit", "err", err)
|
||||
return 1
|
||||
}
|
||||
if path != "stdout" {
|
||||
slog.Info("audit output written", "path", path)
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
func runExport(args []string, stdout, stderr io.Writer) int {
|
||||
fs := flag.NewFlagSet("export", flag.ContinueOnError)
|
||||
fs.SetOutput(stderr)
|
||||
targetDevice := fs.String("target", "", "removable device path, e.g. /dev/sdb1")
|
||||
fs.Usage = func() {
|
||||
fmt.Fprintln(stderr, "usage: bee export --target <device>")
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
if strings.TrimSpace(*targetDevice) == "" {
|
||||
fmt.Fprintln(stderr, "bee export: --target is required")
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
|
||||
application := app.New(platform.New())
|
||||
targets, err := application.ListRemovableTargets()
|
||||
if err != nil {
|
||||
slog.Error("list removable targets", "err", err)
|
||||
return 1
|
||||
}
|
||||
|
||||
for _, target := range targets {
|
||||
if target.Device == *targetDevice {
|
||||
path, err := application.ExportLatestAudit(target)
|
||||
if err != nil {
|
||||
slog.Error("export latest audit", "err", err)
|
||||
return 1
|
||||
}
|
||||
slog.Info("audit exported", "path", path)
|
||||
return 0
|
||||
}
|
||||
}
|
||||
|
||||
slog.Error("target device not found among removable filesystems", "device", *targetDevice)
|
||||
return 1
|
||||
}
|
||||
|
||||
func runPreflight(args []string, stdout, stderr io.Writer) int {
|
||||
fs := flag.NewFlagSet("preflight", flag.ContinueOnError)
|
||||
fs.SetOutput(stderr)
|
||||
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
|
||||
fs.Usage = func() {
|
||||
fmt.Fprintf(stderr, "usage: bee preflight [--output stdout|file:%s]\n", app.DefaultRuntimeJSONPath)
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
application := app.New(platform.New())
|
||||
path, err := application.RunRuntimePreflight(*output)
|
||||
if err != nil {
|
||||
slog.Error("run preflight", "err", err)
|
||||
return 1
|
||||
}
|
||||
if path != "stdout" {
|
||||
slog.Info("runtime health written", "path", path)
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
func runSupportBundle(args []string, stdout, stderr io.Writer) int {
|
||||
fs := flag.NewFlagSet("support-bundle", flag.ContinueOnError)
|
||||
fs.SetOutput(stderr)
|
||||
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
|
||||
fs.Usage = func() {
|
||||
fmt.Fprintln(stderr, "usage: bee support-bundle [--output stdout|file:<path>]")
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
path, err := app.BuildSupportBundle(app.DefaultExportDir)
|
||||
if err != nil {
|
||||
slog.Error("build support bundle", "err", err)
|
||||
return 1
|
||||
}
|
||||
defer os.Remove(path)
|
||||
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
slog.Error("read support bundle", "err", err)
|
||||
return 1
|
||||
}
|
||||
switch {
|
||||
case *output == "stdout":
|
||||
if _, err := stdout.Write(raw); err != nil {
|
||||
slog.Error("write support bundle stdout", "err", err)
|
||||
return 1
|
||||
}
|
||||
case strings.HasPrefix(*output, "file:"):
|
||||
dst := strings.TrimPrefix(*output, "file:")
|
||||
if err := os.WriteFile(dst, raw, 0644); err != nil {
|
||||
slog.Error("write support bundle", "err", err)
|
||||
return 1
|
||||
}
|
||||
slog.Info("support bundle written", "path", dst)
|
||||
default:
|
||||
fmt.Fprintln(stderr, "bee support-bundle: unknown output destination")
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
func runWeb(args []string, stdout, stderr io.Writer) int {
|
||||
fs := flag.NewFlagSet("web", flag.ContinueOnError)
|
||||
fs.SetOutput(stderr)
|
||||
listenAddr := fs.String("listen", ":8080", "listen address, e.g. :80")
|
||||
auditPath := fs.String("audit-path", app.DefaultAuditJSONPath, "path to the latest audit JSON snapshot")
|
||||
exportDir := fs.String("export-dir", app.DefaultExportDir, "directory with logs, SAT results, and support bundles")
|
||||
title := fs.String("title", "Bee Hardware Audit", "page title")
|
||||
fs.Usage = func() {
|
||||
fmt.Fprintf(stderr, "usage: bee web [--listen :80] [--audit-path %s] [--export-dir %s] [--title \"Bee Hardware Audit\"]\n", app.DefaultAuditJSONPath, app.DefaultExportDir)
|
||||
fs.PrintDefaults()
|
||||
}
|
||||
if err := fs.Parse(args); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fs.Usage()
|
||||
return 2
|
||||
}
|
||||
|
||||
slog.Info("starting bee web", "listen", *listenAddr, "audit_path", *auditPath)
|
||||
|
||||
runtimeInfo, err := runtimeenv.Detect("auto")
|
||||
if err != nil {
|
||||
slog.Warn("resolve runtime for web", "err", err)
|
||||
}
|
||||
|
||||
if err := webui.ListenAndServe(*listenAddr, webui.HandlerOptions{
|
||||
Title: *title,
|
||||
BuildLabel: buildLabel(),
|
||||
AuditPath: *auditPath,
|
||||
ExportDir: *exportDir,
|
||||
App: app.New(platform.New()),
|
||||
RuntimeMode: runtimeInfo.Mode,
|
||||
}); err != nil {
|
||||
slog.Error("run web", "err", err)
|
||||
return 1
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
func runSAT(args []string, stdout, stderr io.Writer) int {
|
||||
if len(args) == 0 {
|
||||
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
|
||||
return 2
|
||||
}
|
||||
if args[0] == "help" || args[0] == "--help" || args[0] == "-h" {
|
||||
fmt.Fprintln(stdout, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
|
||||
return 0
|
||||
}
|
||||
|
||||
fs := flag.NewFlagSet("sat", flag.ContinueOnError)
|
||||
fs.SetOutput(stderr)
|
||||
duration := fs.Int("duration", 0, "stress-ng duration in seconds (cpu only; default: 60)")
|
||||
if err := fs.Parse(args[1:]); err != nil {
|
||||
if err == flag.ErrHelp {
|
||||
return 0
|
||||
}
|
||||
return 2
|
||||
}
|
||||
if fs.NArg() != 0 {
|
||||
fmt.Fprintf(stderr, "bee sat: unexpected arguments\n")
|
||||
return 2
|
||||
}
|
||||
|
||||
target := args[0]
|
||||
if target != "nvidia" && target != "memory" && target != "storage" && target != "cpu" {
|
||||
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", target)
|
||||
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
|
||||
return 2
|
||||
}
|
||||
|
||||
application := app.New(platform.New())
|
||||
var (
|
||||
archive string
|
||||
err error
|
||||
)
|
||||
logLine := func(s string) { fmt.Fprintln(os.Stderr, s) }
|
||||
switch target {
|
||||
case "nvidia":
|
||||
archive, err = application.RunNvidiaAcceptancePack("", logLine)
|
||||
case "memory":
|
||||
archive, err = application.RunMemoryAcceptancePackCtx(context.Background(), "", logLine)
|
||||
case "storage":
|
||||
archive, err = application.RunStorageAcceptancePackCtx(context.Background(), "", logLine)
|
||||
case "cpu":
|
||||
dur := *duration
|
||||
if dur <= 0 {
|
||||
dur = 60
|
||||
}
|
||||
archive, err = application.RunCPUAcceptancePackCtx(context.Background(), "", dur, logLine)
|
||||
}
|
||||
if err != nil {
|
||||
slog.Error("run sat", "target", target, "err", err)
|
||||
return 1
|
||||
}
|
||||
slog.Info("sat archive written", "target", target, "path", archive)
|
||||
return 0
|
||||
}
|
||||
219
audit/cmd/bee/main_test.go
Normal file
219
audit/cmd/bee/main_test.go
Normal file
@@ -0,0 +1,219 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestRunRootHelp(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"help"}, &stdout, &stderr)
|
||||
if rc != 0 {
|
||||
t.Fatalf("rc=%d want 0", rc)
|
||||
}
|
||||
if !strings.Contains(stdout.String(), "bee commands:") {
|
||||
t.Fatalf("stdout missing root usage:\n%s", stdout.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunNoArgsPrintsUsage(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run(nil, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "bee commands:") {
|
||||
t.Fatalf("stderr missing root usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunUnknownCommand(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"wat"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), `unknown command "wat"`) {
|
||||
t.Fatalf("stderr missing unknown command message:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunVersion(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
old := Version
|
||||
Version = "test-version"
|
||||
t.Cleanup(func() { Version = old })
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"version"}, &stdout, &stderr)
|
||||
if rc != 0 {
|
||||
t.Fatalf("rc=%d want 0", rc)
|
||||
}
|
||||
if strings.TrimSpace(stdout.String()) != "test-version" {
|
||||
t.Fatalf("stdout=%q want %q", strings.TrimSpace(stdout.String()), "test-version")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunExportRequiresTarget(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"export"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "--target is required") {
|
||||
t.Fatalf("stderr missing target error:\n%s", stderr.String())
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee export --target <device>") {
|
||||
t.Fatalf("stderr missing export usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSATUsage(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"sat"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee sat nvidia|memory|storage") {
|
||||
t.Fatalf("stderr missing sat usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunPreflightRejectsExtraArgs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"preflight", "extra"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee preflight") {
|
||||
t.Fatalf("stderr missing preflight usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSupportBundleRejectsExtraArgs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"support-bundle", "extra"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee support-bundle") {
|
||||
t.Fatalf("stderr missing support-bundle usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunHelpForSubcommand(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"help", "export"}, &stdout, &stderr)
|
||||
if rc != 0 {
|
||||
t.Fatalf("rc=%d want 0", rc)
|
||||
}
|
||||
if !strings.Contains(stdout.String(), "usage: bee export --target <device>") {
|
||||
t.Fatalf("stdout missing export help:\n%s", stdout.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunHelpUnknownSubcommand(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"help", "wat"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), `bee help: unknown command "wat"`) {
|
||||
t.Fatalf("stderr missing help error:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSATUnknownTarget(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"sat", "amd"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), `unknown target "amd"`) {
|
||||
t.Fatalf("stderr missing sat target error:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSATHelp(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"sat", "--help"}, &stdout, &stderr)
|
||||
if rc != 0 {
|
||||
t.Fatalf("rc=%d want 0", rc)
|
||||
}
|
||||
if !strings.Contains(stdout.String(), "usage: bee sat nvidia|memory|storage|cpu") {
|
||||
t.Fatalf("stdout missing sat help:\n%s", stdout.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSATRejectsExtraArgs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"sat", "memory", "extra"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "bee sat: unexpected arguments") {
|
||||
t.Fatalf("stderr missing sat error:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunAuditInvalidRuntime(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"audit", "--runtime", "bad"}, &stdout, &stderr)
|
||||
if rc != 1 {
|
||||
t.Fatalf("rc=%d want 1", rc)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunAuditRejectsExtraArgs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"audit", "extra"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee audit") {
|
||||
t.Fatalf("stderr missing audit usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunExportRejectsExtraArgs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
var stdout, stderr bytes.Buffer
|
||||
rc := run([]string{"export", "--target", "/dev/sdb1", "extra"}, &stdout, &stderr)
|
||||
if rc != 2 {
|
||||
t.Fatalf("rc=%d want 2", rc)
|
||||
}
|
||||
if !strings.Contains(stderr.String(), "usage: bee export --target <device>") {
|
||||
t.Fatalf("stderr missing export usage:\n%s", stderr.String())
|
||||
}
|
||||
}
|
||||
25
audit/go.mod
25
audit/go.mod
@@ -1,3 +1,26 @@
|
||||
module bee/audit
|
||||
|
||||
go 1.23
|
||||
go 1.25.0
|
||||
|
||||
replace reanimator/chart => ../internal/chart
|
||||
|
||||
require (
|
||||
github.com/go-analyze/charts v0.5.26
|
||||
reanimator/chart v0.0.0-00010101000000-000000000000
|
||||
)
|
||||
|
||||
require (
|
||||
github.com/dustin/go-humanize v1.0.1 // indirect
|
||||
github.com/go-analyze/bulk v0.1.3 // indirect
|
||||
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 // indirect
|
||||
github.com/google/uuid v1.6.0 // indirect
|
||||
github.com/mattn/go-isatty v0.0.20 // indirect
|
||||
github.com/ncruces/go-strftime v1.0.0 // indirect
|
||||
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec // indirect
|
||||
golang.org/x/image v0.24.0 // indirect
|
||||
golang.org/x/sys v0.42.0 // indirect
|
||||
modernc.org/libc v1.70.0 // indirect
|
||||
modernc.org/mathutil v1.7.1 // indirect
|
||||
modernc.org/memory v1.11.0 // indirect
|
||||
modernc.org/sqlite v1.48.0 // indirect
|
||||
)
|
||||
|
||||
37
audit/go.sum
Normal file
37
audit/go.sum
Normal file
@@ -0,0 +1,37 @@
|
||||
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
|
||||
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
|
||||
github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
|
||||
github.com/go-analyze/bulk v0.1.3 h1:pzRdBqzHDAT9PyROt0SlWE0YqPtdmTcEpIJY0C3vF0c=
|
||||
github.com/go-analyze/bulk v0.1.3/go.mod h1:afon/KtFJYnekIyN20H/+XUvcLFjE8sKR1CfpqfClgM=
|
||||
github.com/go-analyze/charts v0.5.26 h1:rSwZikLQuFX6cJzwI8OAgaWZneG1kDYxD857ms00ZxY=
|
||||
github.com/go-analyze/charts v0.5.26/go.mod h1:s1YvQhjiSwtLx1f2dOKfiV9x2TT49nVSL6v2rlRpTbY=
|
||||
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 h1:DACJavvAHhabrF08vX0COfcOBJRhZ8lUbR+ZWIs0Y5g=
|
||||
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0/go.mod h1:E/TSTwGwJL78qG/PmXZO1EjYhfJinVAhrmmHX6Z8B9k=
|
||||
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
|
||||
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
|
||||
github.com/mattn/go-isatty v0.0.20 h1:xfD0iDuEKnDkl03q4limB+vH+GxLEtL/jb4xVJSWWEY=
|
||||
github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y=
|
||||
github.com/ncruces/go-strftime v1.0.0 h1:HMFp8mLCTPp341M/ZnA4qaf7ZlsbTc+miZjCLOFAw7w=
|
||||
github.com/ncruces/go-strftime v1.0.0/go.mod h1:Fwc5htZGVVkseilnfgOVb9mKy6w1naJmn9CehxcKcls=
|
||||
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
|
||||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec h1:W09IVJc94icq4NjY3clb7Lk8O1qJ8BdBEF8z0ibU0rE=
|
||||
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec/go.mod h1:qqbHyh8v60DhA7CoWK5oRCqLrMHRGoxYCSS9EjAz6Eo=
|
||||
github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
|
||||
github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U=
|
||||
golang.org/x/image v0.24.0 h1:AN7zRgVsbvmTfNyqIbbOraYL8mSwcKncEj8ofjgzcMQ=
|
||||
golang.org/x/image v0.24.0/go.mod h1:4b/ITuLfqYq1hqZcjofwctIhi7sZh2WaCjvsBNjjya8=
|
||||
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.42.0 h1:omrd2nAlyT5ESRdCLYdm3+fMfNFE/+Rf4bDIQImRJeo=
|
||||
golang.org/x/sys v0.42.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
|
||||
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
|
||||
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
modernc.org/libc v1.70.0 h1:U58NawXqXbgpZ/dcdS9kMshu08aiA6b7gusEusqzNkw=
|
||||
modernc.org/libc v1.70.0/go.mod h1:OVmxFGP1CI/Z4L3E0Q3Mf1PDE0BucwMkcXjjLntvHJo=
|
||||
modernc.org/mathutil v1.7.1 h1:GCZVGXdaN8gTqB1Mf/usp1Y/hSqgI2vAGGP4jZMCxOU=
|
||||
modernc.org/mathutil v1.7.1/go.mod h1:4p5IwJITfppl0G4sUEDtCr4DthTaT47/N3aT6MhfgJg=
|
||||
modernc.org/memory v1.11.0 h1:o4QC8aMQzmcwCK3t3Ux/ZHmwFPzE6hf2Y5LbkRs+hbI=
|
||||
modernc.org/memory v1.11.0/go.mod h1:/JP4VbVC+K5sU2wZi9bHoq2MAkCnrt2r98UGeSK7Mjw=
|
||||
modernc.org/sqlite v1.48.0 h1:ElZyLop3Q2mHYk5IFPPXADejZrlHu7APbpB0sF78bq4=
|
||||
modernc.org/sqlite v1.48.0/go.mod h1:hWjRO6Tj/5Ik8ieqxQybiEOUXy0NJFNp2tpvVpKlvig=
|
||||
1200
audit/internal/app/app.go
Normal file
1200
audit/internal/app/app.go
Normal file
File diff suppressed because it is too large
Load Diff
839
audit/internal/app/app_test.go
Normal file
839
audit/internal/app/app_test.go
Normal file
@@ -0,0 +1,839 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"archive/tar"
|
||||
"compress/gzip"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"io"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/platform"
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
type fakeNetwork struct {
|
||||
listInterfacesFn func() ([]platform.InterfaceInfo, error)
|
||||
defaultRouteFn func() string
|
||||
dhcpOneFn func(string) (string, error)
|
||||
dhcpAllFn func() (string, error)
|
||||
setStaticIPv4Fn func(platform.StaticIPv4Config) (string, error)
|
||||
}
|
||||
|
||||
func (f fakeNetwork) ListInterfaces() ([]platform.InterfaceInfo, error) {
|
||||
return f.listInterfacesFn()
|
||||
}
|
||||
|
||||
func (f fakeNetwork) DefaultRoute() string {
|
||||
return f.defaultRouteFn()
|
||||
}
|
||||
|
||||
func (f fakeNetwork) DHCPOne(iface string) (string, error) {
|
||||
return f.dhcpOneFn(iface)
|
||||
}
|
||||
|
||||
func (f fakeNetwork) DHCPAll() (string, error) {
|
||||
return f.dhcpAllFn()
|
||||
}
|
||||
|
||||
func (f fakeNetwork) SetStaticIPv4(cfg platform.StaticIPv4Config) (string, error) {
|
||||
return f.setStaticIPv4Fn(cfg)
|
||||
}
|
||||
|
||||
func (f fakeNetwork) SetInterfaceState(_ string, _ bool) error { return nil }
|
||||
func (f fakeNetwork) GetInterfaceState(_ string) (bool, error) { return true, nil }
|
||||
func (f fakeNetwork) CaptureNetworkSnapshot() (platform.NetworkSnapshot, error) {
|
||||
return platform.NetworkSnapshot{}, nil
|
||||
}
|
||||
func (f fakeNetwork) RestoreNetworkSnapshot(platform.NetworkSnapshot) error { return nil }
|
||||
|
||||
type fakeServices struct {
|
||||
serviceStatusFn func(string) (string, error)
|
||||
serviceDoFn func(string, platform.ServiceAction) (string, error)
|
||||
}
|
||||
|
||||
func (f fakeServices) ListBeeServices() ([]string, error) {
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
func (f fakeServices) ServiceState(name string) string {
|
||||
return "active"
|
||||
}
|
||||
|
||||
func (f fakeServices) ServiceStatus(name string) (string, error) {
|
||||
return f.serviceStatusFn(name)
|
||||
}
|
||||
|
||||
func (f fakeServices) ServiceDo(name string, action platform.ServiceAction) (string, error) {
|
||||
return f.serviceDoFn(name, action)
|
||||
}
|
||||
|
||||
type fakeExports struct {
|
||||
listTargetsFn func() ([]platform.RemovableTarget, error)
|
||||
exportToTargetFn func(string, platform.RemovableTarget) (string, error)
|
||||
}
|
||||
|
||||
func (f fakeExports) ListRemovableTargets() ([]platform.RemovableTarget, error) {
|
||||
if f.listTargetsFn != nil {
|
||||
return f.listTargetsFn()
|
||||
}
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
func (f fakeExports) ExportFileToTarget(src string, target platform.RemovableTarget) (string, error) {
|
||||
if f.exportToTargetFn != nil {
|
||||
return f.exportToTargetFn(src, target)
|
||||
}
|
||||
return "", nil
|
||||
}
|
||||
|
||||
type fakeRuntime struct {
|
||||
collectFn func(string) (schema.RuntimeHealth, error)
|
||||
dumpFn func(string) error
|
||||
}
|
||||
|
||||
func (f fakeRuntime) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
|
||||
return f.collectFn(exportDir)
|
||||
}
|
||||
|
||||
func (f fakeRuntime) CaptureTechnicalDump(baseDir string) error {
|
||||
if f.dumpFn != nil {
|
||||
return f.dumpFn(baseDir)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
type fakeTools struct {
|
||||
tailFileFn func(string, int) string
|
||||
checkToolsFn func([]string) []platform.ToolStatus
|
||||
}
|
||||
|
||||
func (f fakeTools) TailFile(path string, lines int) string {
|
||||
return f.tailFileFn(path, lines)
|
||||
}
|
||||
|
||||
func (f fakeTools) CheckTools(names []string) []platform.ToolStatus {
|
||||
return f.checkToolsFn(names)
|
||||
}
|
||||
|
||||
type fakeSAT struct {
|
||||
runNvidiaFn func(string) (string, error)
|
||||
runMemoryFn func(string) (string, error)
|
||||
runStorageFn func(string) (string, error)
|
||||
runCPUFn func(string, int) (string, error)
|
||||
detectVendorFn func() string
|
||||
listAMDGPUsFn func() ([]platform.AMDGPUInfo, error)
|
||||
runAMDPackFn func(string) (string, error)
|
||||
listNvidiaGPUsFn func() ([]platform.NvidiaGPU, error)
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string, _ func(string)) (string, error) {
|
||||
return f.runNvidiaFn(baseDir)
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunNvidiaAcceptancePackWithOptions(_ context.Context, baseDir string, _ int, _ []int, _ func(string)) (string, error) {
|
||||
return f.runNvidiaFn(baseDir)
|
||||
}
|
||||
|
||||
func (f fakeSAT) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
|
||||
if f.listNvidiaGPUsFn != nil {
|
||||
return f.listNvidiaGPUsFn()
|
||||
}
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunMemoryAcceptancePack(_ context.Context, baseDir string, _ func(string)) (string, error) {
|
||||
return f.runMemoryFn(baseDir)
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunStorageAcceptancePack(_ context.Context, baseDir string, _ func(string)) (string, error) {
|
||||
return f.runStorageFn(baseDir)
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunCPUAcceptancePack(_ context.Context, baseDir string, durationSec int, _ func(string)) (string, error) {
|
||||
if f.runCPUFn != nil {
|
||||
return f.runCPUFn(baseDir, durationSec)
|
||||
}
|
||||
return "", nil
|
||||
}
|
||||
|
||||
func (f fakeSAT) DetectGPUVendor() string {
|
||||
if f.detectVendorFn != nil {
|
||||
return f.detectVendorFn()
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func (f fakeSAT) ListAMDGPUs() ([]platform.AMDGPUInfo, error) {
|
||||
if f.listAMDGPUsFn != nil {
|
||||
return f.listAMDGPUsFn()
|
||||
}
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunAMDAcceptancePack(_ context.Context, baseDir string, _ func(string)) (string, error) {
|
||||
if f.runAMDPackFn != nil {
|
||||
return f.runAMDPackFn(baseDir)
|
||||
}
|
||||
return "", nil
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunAMDMemIntegrityPack(_ context.Context, _ string, _ func(string)) (string, error) {
|
||||
return "", nil
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunAMDMemBandwidthPack(_ context.Context, _ string, _ func(string)) (string, error) {
|
||||
return "", nil
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunAMDStressPack(_ context.Context, _ string, _ int, _ func(string)) (string, error) {
|
||||
return "", nil
|
||||
}
|
||||
func (f fakeSAT) RunMemoryStressPack(_ context.Context, _ string, _ int, _ func(string)) (string, error) {
|
||||
return "", nil
|
||||
}
|
||||
func (f fakeSAT) RunSATStressPack(_ context.Context, _ string, _ int, _ func(string)) (string, error) {
|
||||
return "", nil
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunFanStressTest(_ context.Context, _ string, _ platform.FanStressOptions) (string, error) {
|
||||
return "", nil
|
||||
}
|
||||
|
||||
func (f fakeSAT) RunNCCLTests(_ context.Context, _ string, _ func(string)) (string, error) {
|
||||
return "", nil
|
||||
}
|
||||
|
||||
func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
network: fakeNetwork{
|
||||
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
|
||||
return []platform.InterfaceInfo{
|
||||
{Name: "eth0", State: "UP", IPv4: []string{"10.0.0.2/24"}},
|
||||
{Name: "eth1", State: "DOWN", IPv4: nil},
|
||||
}, nil
|
||||
},
|
||||
defaultRouteFn: func() string { return "10.0.0.1" },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
result, err := a.NetworkStatus()
|
||||
if err != nil {
|
||||
t.Fatalf("NetworkStatus error: %v", err)
|
||||
}
|
||||
if result.Title != "Network status" {
|
||||
t.Fatalf("title=%q want %q", result.Title, "Network status")
|
||||
}
|
||||
if want := "- eth0: state=UP ip=10.0.0.2/24"; !contains(result.Body, want) {
|
||||
t.Fatalf("body missing %q\nbody=%s", want, result.Body)
|
||||
}
|
||||
if want := "- eth1: state=DOWN ip=(no IPv4)"; !contains(result.Body, want) {
|
||||
t.Fatalf("body missing %q\nbody=%s", want, result.Body)
|
||||
}
|
||||
if want := "Default route: 10.0.0.1"; !contains(result.Body, want) {
|
||||
t.Fatalf("body missing %q\nbody=%s", want, result.Body)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNetworkStatusHandlesNoInterfaces(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
network: fakeNetwork{
|
||||
listInterfacesFn: func() ([]platform.InterfaceInfo, error) { return nil, nil },
|
||||
defaultRouteFn: func() string { return "" },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
result, err := a.NetworkStatus()
|
||||
if err != nil {
|
||||
t.Fatalf("NetworkStatus error: %v", err)
|
||||
}
|
||||
if result.Body != "No physical interfaces found." {
|
||||
t.Fatalf("body=%q want %q", result.Body, "No physical interfaces found.")
|
||||
}
|
||||
}
|
||||
|
||||
func TestNetworkStatusPropagatesListError(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
network: fakeNetwork{
|
||||
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
|
||||
return nil, errors.New("boom")
|
||||
},
|
||||
defaultRouteFn: func() string { return "" },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
result, err := a.NetworkStatus()
|
||||
if err == nil {
|
||||
t.Fatal("expected error")
|
||||
}
|
||||
if result.Title != "Network status" {
|
||||
t.Fatalf("title=%q want %q", result.Title, "Network status")
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseStaticIPv4ConfigAndDefaults(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
network: fakeNetwork{
|
||||
defaultRouteFn: func() string { return " 192.168.1.1 " },
|
||||
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
|
||||
return nil, nil
|
||||
},
|
||||
dhcpOneFn: func(string) (string, error) { return "", nil },
|
||||
dhcpAllFn: func() (string, error) { return "", nil },
|
||||
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
defaults := a.DefaultStaticIPv4FormFields("eth0")
|
||||
if len(defaults) != 4 {
|
||||
t.Fatalf("len(defaults)=%d want 4", len(defaults))
|
||||
}
|
||||
if defaults[1] != "24" || defaults[2] != "192.168.1.1" {
|
||||
t.Fatalf("unexpected defaults: %#v", defaults)
|
||||
}
|
||||
|
||||
cfg := a.ParseStaticIPv4Config("eth0", []string{
|
||||
" 10.10.0.5 ",
|
||||
" 23 ",
|
||||
" 10.10.0.1 ",
|
||||
" 1.1.1.1 8.8.8.8 ",
|
||||
})
|
||||
if cfg.Interface != "eth0" || cfg.Address != "10.10.0.5" || cfg.Prefix != "23" || cfg.Gateway != "10.10.0.1" {
|
||||
t.Fatalf("unexpected cfg: %#v", cfg)
|
||||
}
|
||||
if len(cfg.DNS) != 2 || cfg.DNS[0] != "1.1.1.1" || cfg.DNS[1] != "8.8.8.8" {
|
||||
t.Fatalf("unexpected dns: %#v", cfg.DNS)
|
||||
}
|
||||
}
|
||||
|
||||
func TestServiceActionResults(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
services: fakeServices{
|
||||
serviceStatusFn: func(name string) (string, error) {
|
||||
return "active", nil
|
||||
},
|
||||
serviceDoFn: func(name string, action platform.ServiceAction) (string, error) {
|
||||
return string(action) + " ok", nil
|
||||
},
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
statusResult, err := a.ServiceStatusResult("bee-audit")
|
||||
if err != nil {
|
||||
t.Fatalf("ServiceStatusResult error: %v", err)
|
||||
}
|
||||
if statusResult.Title != "service status: bee-audit" || statusResult.Body != "active" {
|
||||
t.Fatalf("unexpected status result: %#v", statusResult)
|
||||
}
|
||||
|
||||
actionResult, err := a.ServiceActionResult("bee-audit", platform.ServiceRestart)
|
||||
if err != nil {
|
||||
t.Fatalf("ServiceActionResult error: %v", err)
|
||||
}
|
||||
if actionResult.Title != "service restart: bee-audit" || actionResult.Body != "restart ok" {
|
||||
t.Fatalf("unexpected action result: %#v", actionResult)
|
||||
}
|
||||
}
|
||||
|
||||
func TestToolCheckAndLogTailResults(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
tools: fakeTools{
|
||||
tailFileFn: func(path string, lines int) string {
|
||||
return path
|
||||
},
|
||||
checkToolsFn: func(names []string) []platform.ToolStatus {
|
||||
return []platform.ToolStatus{
|
||||
{Name: "dmidecode", OK: true, Path: "/usr/bin/dmidecode"},
|
||||
{Name: "smartctl", OK: false},
|
||||
}
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
toolsResult := a.ToolCheckResult([]string{"dmidecode", "smartctl"})
|
||||
if toolsResult.Title != "Required tools" {
|
||||
t.Fatalf("title=%q want %q", toolsResult.Title, "Required tools")
|
||||
}
|
||||
if want := "- dmidecode: OK (/usr/bin/dmidecode)"; !contains(toolsResult.Body, want) {
|
||||
t.Fatalf("body missing %q\nbody=%s", want, toolsResult.Body)
|
||||
}
|
||||
if want := "- smartctl: MISSING"; !contains(toolsResult.Body, want) {
|
||||
t.Fatalf("body missing %q\nbody=%s", want, toolsResult.Body)
|
||||
}
|
||||
|
||||
logResult := a.AuditLogTailResult()
|
||||
if logResult.Title != "Audit log tail" {
|
||||
t.Fatalf("title=%q want %q", logResult.Title, "Audit log tail")
|
||||
}
|
||||
if want := DefaultAuditLogPath + "\n\n" + DefaultAuditJSONPath; logResult.Body != want {
|
||||
t.Fatalf("body=%q want %q", logResult.Body, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestActionResultsUseFallbackBody(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
network: fakeNetwork{
|
||||
dhcpOneFn: func(string) (string, error) { return " ", nil },
|
||||
dhcpAllFn: func() (string, error) { return "", nil },
|
||||
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
|
||||
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
|
||||
return nil, nil
|
||||
},
|
||||
defaultRouteFn: func() string { return "" },
|
||||
},
|
||||
services: fakeServices{
|
||||
serviceStatusFn: func(string) (string, error) { return "", nil },
|
||||
serviceDoFn: func(string, platform.ServiceAction) (string, error) { return "", nil },
|
||||
},
|
||||
tools: fakeTools{
|
||||
tailFileFn: func(string, int) string { return " " },
|
||||
checkToolsFn: func([]string) []platform.ToolStatus { return nil },
|
||||
},
|
||||
sat: fakeSAT{
|
||||
runNvidiaFn: func(string) (string, error) { return "", nil },
|
||||
runMemoryFn: func(string) (string, error) { return "", nil },
|
||||
runStorageFn: func(string) (string, error) { return "", nil },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) {
|
||||
return schema.RuntimeHealth{Status: "PARTIAL", ExportDir: "/tmp/export"}, nil
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
if got, _ := a.DHCPOneResult("eth0"); got.Body != "DHCP completed." {
|
||||
t.Fatalf("dhcp one body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.DHCPAllResult(); got.Body != "DHCP completed." {
|
||||
t.Fatalf("dhcp all body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.SetStaticIPv4Result(platform.StaticIPv4Config{Interface: "eth0"}); got.Body != "Static IPv4 updated." {
|
||||
t.Fatalf("static body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.ServiceStatusResult("bee-audit"); got.Body != "No status output." {
|
||||
t.Fatalf("status body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.ServiceActionResult("bee-audit", platform.ServiceRestart); got.Body != "Action completed." {
|
||||
t.Fatalf("action body=%q", got.Body)
|
||||
}
|
||||
if got := a.ToolCheckResult(nil); got.Body != "No tools checked." {
|
||||
t.Fatalf("tool body=%q", got.Body)
|
||||
}
|
||||
if got := a.AuditLogTailResult(); got.Body != "No audit logs found." {
|
||||
t.Fatalf("log body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.RunNvidiaAcceptancePackResult(""); got.Body != "Archive written." {
|
||||
t.Fatalf("sat body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.RunMemoryAcceptancePackResult(""); got.Body != "No output produced." {
|
||||
t.Fatalf("memory sat body=%q", got.Body)
|
||||
}
|
||||
if got, _ := a.RunStorageAcceptancePackResult(""); got.Body != "No output produced." {
|
||||
t.Fatalf("storage sat body=%q", got.Body)
|
||||
}
|
||||
}
|
||||
|
||||
func TestExportSupportBundleResultMentionsUnmountedUSB(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
tmp := t.TempDir()
|
||||
oldExportDir := DefaultExportDir
|
||||
DefaultExportDir = tmp
|
||||
t.Cleanup(func() { DefaultExportDir = oldExportDir })
|
||||
|
||||
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.json"), []byte("{}\n"), 0644); err != nil {
|
||||
t.Fatalf("write bee-audit.json: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.log"), []byte("audit ok\n"), 0644); err != nil {
|
||||
t.Fatalf("write bee-audit.log: %v", err)
|
||||
}
|
||||
|
||||
a := &App{
|
||||
exports: fakeExports{
|
||||
exportToTargetFn: func(src string, target platform.RemovableTarget) (string, error) {
|
||||
if filepath.Base(src) == "" {
|
||||
t.Fatalf("expected non-empty source path")
|
||||
}
|
||||
return "/media/bee/" + filepath.Base(src), nil
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
result, err := a.ExportSupportBundleResult(platform.RemovableTarget{Device: "/dev/sdb1"})
|
||||
if err != nil {
|
||||
t.Fatalf("ExportSupportBundleResult error: %v", err)
|
||||
}
|
||||
if result.Title != "Export support bundle" {
|
||||
t.Fatalf("title=%q want %q", result.Title, "Export support bundle")
|
||||
}
|
||||
if want := "USB target unmounted and safe to remove."; !contains(result.Body, want) {
|
||||
t.Fatalf("body missing %q\nbody=%s", want, result.Body)
|
||||
}
|
||||
}
|
||||
|
||||
func TestExportSupportBundleResultDoesNotPretendSuccessOnError(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
tmp := t.TempDir()
|
||||
oldExportDir := DefaultExportDir
|
||||
DefaultExportDir = tmp
|
||||
t.Cleanup(func() { DefaultExportDir = oldExportDir })
|
||||
|
||||
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.json"), []byte("{}\n"), 0644); err != nil {
|
||||
t.Fatalf("write bee-audit.json: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.log"), []byte("audit ok\n"), 0644); err != nil {
|
||||
t.Fatalf("write bee-audit.log: %v", err)
|
||||
}
|
||||
|
||||
a := &App{
|
||||
exports: fakeExports{
|
||||
exportToTargetFn: func(string, platform.RemovableTarget) (string, error) {
|
||||
return "", errors.New("mount /dev/sda1: exFAT support is missing in this ISO build")
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
result, err := a.ExportSupportBundleResult(platform.RemovableTarget{Device: "/dev/sda1", FSType: "exfat"})
|
||||
if err == nil {
|
||||
t.Fatal("expected export error")
|
||||
}
|
||||
if contains(result.Body, "exported to") {
|
||||
t.Fatalf("body should not claim success:\n%s", result.Body)
|
||||
}
|
||||
if result.Body != "Support bundle export failed." {
|
||||
t.Fatalf("body=%q want %q", result.Body, "Support bundle export failed.")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunNvidiaAcceptancePackResult(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
a := &App{
|
||||
sat: fakeSAT{
|
||||
runNvidiaFn: func(baseDir string) (string, error) {
|
||||
if baseDir != "/tmp/sat" {
|
||||
t.Fatalf("baseDir=%q want %q", baseDir, "/tmp/sat")
|
||||
}
|
||||
return "/tmp/sat/out.tar.gz", nil
|
||||
},
|
||||
runMemoryFn: func(string) (string, error) { return "", nil },
|
||||
runStorageFn: func(string) (string, error) { return "", nil },
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
result, err := a.RunNvidiaAcceptancePackResult("/tmp/sat")
|
||||
if err != nil {
|
||||
t.Fatalf("RunNvidiaAcceptancePackResult error: %v", err)
|
||||
}
|
||||
if result.Title != "NVIDIA SAT" || result.Body != "Archive written to /tmp/sat/out.tar.gz" {
|
||||
t.Fatalf("unexpected result: %#v", result)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunSATDefaultsToExportDir(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
oldSATBaseDir := DefaultSATBaseDir
|
||||
DefaultSATBaseDir = "/tmp/export/bee-sat"
|
||||
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
|
||||
|
||||
a := &App{
|
||||
sat: fakeSAT{
|
||||
runNvidiaFn: func(baseDir string) (string, error) {
|
||||
if baseDir != "/tmp/export/bee-sat" {
|
||||
t.Fatalf("nvidia baseDir=%q", baseDir)
|
||||
}
|
||||
return "", nil
|
||||
},
|
||||
runMemoryFn: func(baseDir string) (string, error) {
|
||||
if baseDir != "/tmp/export/bee-sat" {
|
||||
t.Fatalf("memory baseDir=%q", baseDir)
|
||||
}
|
||||
return "", nil
|
||||
},
|
||||
runStorageFn: func(baseDir string) (string, error) {
|
||||
if baseDir != "/tmp/export/bee-sat" {
|
||||
t.Fatalf("storage baseDir=%q", baseDir)
|
||||
}
|
||||
return "", nil
|
||||
},
|
||||
},
|
||||
runtime: fakeRuntime{
|
||||
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
|
||||
},
|
||||
}
|
||||
|
||||
if _, err := a.RunNvidiaAcceptancePack("", nil); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if _, err := a.RunMemoryAcceptancePack("", nil); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if _, err := a.RunStorageAcceptancePack("", nil); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFormatSATSummary(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
got := formatSATSummary("Memory SAT", "overall_status=PARTIAL\njob_ok=2\njob_failed=0\njob_unsupported=1\ndevices=3\n")
|
||||
want := "Memory SAT: PARTIAL ok=2 failed=0 unsupported=1\nDevices: 3"
|
||||
if got != want {
|
||||
t.Fatalf("got %q want %q", got, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestHealthSummaryResultIncludesCompactSATSummary(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
oldAuditPath := DefaultAuditJSONPath
|
||||
oldSATBaseDir := DefaultSATBaseDir
|
||||
DefaultAuditJSONPath = filepath.Join(tmp, "audit.json")
|
||||
DefaultSATBaseDir = filepath.Join(tmp, "sat")
|
||||
t.Cleanup(func() { DefaultAuditJSONPath = oldAuditPath })
|
||||
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
|
||||
|
||||
satDir := filepath.Join(DefaultSATBaseDir, "memory-testcase")
|
||||
if err := os.MkdirAll(satDir, 0755); err != nil {
|
||||
t.Fatalf("mkdir sat dir: %v", err)
|
||||
}
|
||||
|
||||
raw := `{"collected_at":"2026-03-15T10:00:00Z","hardware":{"board":{"serial_number":"SRV123"},"storage":[{"serial_number":"DISK1","status":"Warning"}]}}`
|
||||
if err := os.WriteFile(DefaultAuditJSONPath, []byte(raw), 0644); err != nil {
|
||||
t.Fatalf("write audit json: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(satDir, "summary.txt"), []byte("overall_status=OK\njob_ok=3\njob_failed=0\njob_unsupported=0\n"), 0644); err != nil {
|
||||
t.Fatalf("write sat summary: %v", err)
|
||||
}
|
||||
|
||||
result := (&App{}).HealthSummaryResult()
|
||||
if !contains(result.Body, "Memory SAT: OK ok=3 failed=0") {
|
||||
t.Fatalf("body missing compact sat summary:\n%s", result.Body)
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildSupportBundleIncludesExportDirContents(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
exportDir := filepath.Join(tmp, "export")
|
||||
if err := os.MkdirAll(filepath.Join(exportDir, "bee-sat", "memory-run"), 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.json"), []byte(`{"ok":true}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run", "verbose.log"), []byte("sat verbose"), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run.tar.gz"), []byte("nested sat archive"), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
archive, err := BuildSupportBundle(exportDir)
|
||||
if err != nil {
|
||||
t.Fatalf("BuildSupportBundle error: %v", err)
|
||||
}
|
||||
if _, err := os.Stat(archive); err != nil {
|
||||
t.Fatalf("archive stat: %v", err)
|
||||
}
|
||||
|
||||
file, err := os.Open(archive)
|
||||
if err != nil {
|
||||
t.Fatalf("open archive: %v", err)
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
gzr, err := gzip.NewReader(file)
|
||||
if err != nil {
|
||||
t.Fatalf("gzip reader: %v", err)
|
||||
}
|
||||
defer gzr.Close()
|
||||
|
||||
tr := tar.NewReader(gzr)
|
||||
var names []string
|
||||
for {
|
||||
hdr, err := tr.Next()
|
||||
if errors.Is(err, io.EOF) {
|
||||
break
|
||||
}
|
||||
if err != nil {
|
||||
t.Fatalf("read tar entry: %v", err)
|
||||
}
|
||||
names = append(names, hdr.Name)
|
||||
}
|
||||
|
||||
var foundRaw bool
|
||||
for _, name := range names {
|
||||
if contains(name, "/export/bee-sat/memory-run/verbose.log") {
|
||||
foundRaw = true
|
||||
}
|
||||
if contains(name, "/export/bee-sat/memory-run.tar.gz") {
|
||||
t.Fatalf("support bundle should not contain nested SAT archive: %s", name)
|
||||
}
|
||||
}
|
||||
if !foundRaw {
|
||||
t.Fatalf("support bundle missing raw SAT log, names=%v", names)
|
||||
}
|
||||
}
|
||||
|
||||
func TestMainBanner(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
oldAuditPath := DefaultAuditJSONPath
|
||||
DefaultAuditJSONPath = filepath.Join(tmp, "audit.json")
|
||||
t.Cleanup(func() { DefaultAuditJSONPath = oldAuditPath })
|
||||
|
||||
trueValue := true
|
||||
manufacturer := "Dell"
|
||||
product := "PowerEdge R760"
|
||||
cpuModel := "Intel Xeon Gold 6430"
|
||||
memoryType := "DDR5"
|
||||
gpuClass := "VideoController"
|
||||
gpuModel := "NVIDIA H100"
|
||||
|
||||
payload := schema.HardwareIngestRequest{
|
||||
Hardware: schema.HardwareSnapshot{
|
||||
Board: schema.HardwareBoard{
|
||||
Manufacturer: &manufacturer,
|
||||
ProductName: &product,
|
||||
SerialNumber: "SRV123",
|
||||
},
|
||||
CPUs: []schema.HardwareCPU{
|
||||
{Model: &cpuModel},
|
||||
{Model: &cpuModel},
|
||||
},
|
||||
Memory: []schema.HardwareMemory{
|
||||
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
|
||||
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
|
||||
},
|
||||
Storage: []schema.HardwareStorage{
|
||||
{Present: &trueValue, SizeGB: intPtr(3840)},
|
||||
{Present: &trueValue, SizeGB: intPtr(3840)},
|
||||
},
|
||||
PCIeDevices: []schema.HardwarePCIeDevice{
|
||||
{DeviceClass: &gpuClass, Model: &gpuModel},
|
||||
{DeviceClass: &gpuClass, Model: &gpuModel},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
raw, err := json.Marshal(payload)
|
||||
if err != nil {
|
||||
t.Fatalf("marshal: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(DefaultAuditJSONPath, raw, 0644); err != nil {
|
||||
t.Fatalf("write audit json: %v", err)
|
||||
}
|
||||
|
||||
a := &App{
|
||||
network: fakeNetwork{
|
||||
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
|
||||
return []platform.InterfaceInfo{
|
||||
{Name: "eth0", IPv4: []string{"10.0.0.10"}},
|
||||
{Name: "eth1", IPv4: []string{"192.168.1.10"}},
|
||||
}, nil
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
got := a.MainBanner()
|
||||
for _, want := range []string{
|
||||
"System: Dell PowerEdge R760 | S/N SRV123",
|
||||
"CPU: 2 x Intel Xeon Gold 6430",
|
||||
"Memory: 1.0 TB DDR5 (2 DIMMs)",
|
||||
"Storage: 2 drives / 7.5 TB",
|
||||
"GPU: 2 x NVIDIA H100",
|
||||
"IP: 10.0.0.10, 192.168.1.10",
|
||||
} {
|
||||
if !contains(got, want) {
|
||||
t.Fatalf("banner missing %q:\n%s", want, got)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestRuntimeHealthResultUsesAMDLabels(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
oldRuntimePath := DefaultRuntimeJSONPath
|
||||
DefaultRuntimeJSONPath = filepath.Join(tmp, "runtime-health.json")
|
||||
t.Cleanup(func() { DefaultRuntimeJSONPath = oldRuntimePath })
|
||||
|
||||
raw, err := json.Marshal(schema.RuntimeHealth{
|
||||
Status: "OK",
|
||||
ExportDir: "/appdata/bee/export",
|
||||
DriverReady: true,
|
||||
CUDAReady: true,
|
||||
NetworkStatus: "OK",
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatalf("marshal runtime health: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(DefaultRuntimeJSONPath, raw, 0644); err != nil {
|
||||
t.Fatalf("write runtime health: %v", err)
|
||||
}
|
||||
|
||||
a := &App{
|
||||
sat: fakeSAT{
|
||||
detectVendorFn: func() string { return "amd" },
|
||||
},
|
||||
}
|
||||
|
||||
result := a.RuntimeHealthResult()
|
||||
if !contains(result.Body, "AMDGPU ready: true") {
|
||||
t.Fatalf("body missing AMD driver label:\n%s", result.Body)
|
||||
}
|
||||
if !contains(result.Body, "ROCm SMI ready: true") {
|
||||
t.Fatalf("body missing ROCm label:\n%s", result.Body)
|
||||
}
|
||||
if contains(result.Body, "CUDA ready") {
|
||||
t.Fatalf("body should not mention CUDA on AMD:\n%s", result.Body)
|
||||
}
|
||||
}
|
||||
|
||||
func intPtr(v int) *int { return &v }
|
||||
|
||||
func contains(haystack, needle string) bool {
|
||||
return len(needle) == 0 || (len(haystack) >= len(needle) && (haystack == needle || containsAt(haystack, needle)))
|
||||
}
|
||||
|
||||
func containsAt(haystack, needle string) bool {
|
||||
for i := 0; i+len(needle) <= len(haystack); i++ {
|
||||
if haystack[i:i+len(needle)] == needle {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
218
audit/internal/app/sat_overlay.go
Normal file
218
audit/internal/app/sat_overlay.go
Normal file
@@ -0,0 +1,218 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strings"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
func applyLatestSATStatuses(snap *schema.HardwareSnapshot, baseDir string) {
|
||||
if snap == nil || strings.TrimSpace(baseDir) == "" {
|
||||
return
|
||||
}
|
||||
if summary, ok := loadLatestSATSummary(baseDir, "gpu-amd-"); ok {
|
||||
applyGPUVendorSAT(snap.PCIeDevices, "amd", summary)
|
||||
}
|
||||
if summary, ok := loadLatestSATSummary(baseDir, "gpu-nvidia-"); ok {
|
||||
applyGPUVendorSAT(snap.PCIeDevices, "nvidia", summary)
|
||||
}
|
||||
if summary, ok := loadLatestSATSummary(baseDir, "memory-"); ok {
|
||||
applyMemorySAT(snap.Memory, summary)
|
||||
}
|
||||
if summary, ok := loadLatestSATSummary(baseDir, "cpu-"); ok {
|
||||
applyCPUSAT(snap.CPUs, summary)
|
||||
}
|
||||
if summary, ok := loadLatestSATSummary(baseDir, "storage-"); ok {
|
||||
applyStorageSAT(snap.Storage, summary)
|
||||
}
|
||||
}
|
||||
|
||||
type satSummary struct {
|
||||
runAtUTC string
|
||||
overall string
|
||||
kv map[string]string
|
||||
}
|
||||
|
||||
func loadLatestSATSummary(baseDir, prefix string) (satSummary, bool) {
|
||||
matches, err := filepath.Glob(filepath.Join(baseDir, prefix+"*/summary.txt"))
|
||||
if err != nil || len(matches) == 0 {
|
||||
return satSummary{}, false
|
||||
}
|
||||
sort.Strings(matches)
|
||||
raw, err := os.ReadFile(matches[len(matches)-1])
|
||||
if err != nil {
|
||||
return satSummary{}, false
|
||||
}
|
||||
kv := parseKeyValueSummary(string(raw))
|
||||
return satSummary{
|
||||
runAtUTC: strings.TrimSpace(kv["run_at_utc"]),
|
||||
overall: strings.ToUpper(strings.TrimSpace(kv["overall_status"])),
|
||||
kv: kv,
|
||||
}, true
|
||||
}
|
||||
|
||||
func applyGPUVendorSAT(devs []schema.HardwarePCIeDevice, vendor string, summary satSummary) {
|
||||
status, description, ok := satSummaryStatus(summary, vendor+" GPU SAT")
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
for i := range devs {
|
||||
if !matchesGPUVendor(devs[i], vendor) {
|
||||
continue
|
||||
}
|
||||
mergeComponentStatus(&devs[i].HardwareComponentStatus, summary.runAtUTC, status, description)
|
||||
}
|
||||
}
|
||||
|
||||
func applyMemorySAT(dimms []schema.HardwareMemory, summary satSummary) {
|
||||
status, description, ok := satSummaryStatus(summary, "memory SAT")
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
for i := range dimms {
|
||||
mergeComponentStatus(&dimms[i].HardwareComponentStatus, summary.runAtUTC, status, description)
|
||||
}
|
||||
}
|
||||
|
||||
func applyCPUSAT(cpus []schema.HardwareCPU, summary satSummary) {
|
||||
status, description, ok := satSummaryStatus(summary, "CPU SAT")
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
for i := range cpus {
|
||||
mergeComponentStatus(&cpus[i].HardwareComponentStatus, summary.runAtUTC, status, description)
|
||||
}
|
||||
}
|
||||
|
||||
func applyStorageSAT(disks []schema.HardwareStorage, summary satSummary) {
|
||||
byDevice := parseStorageSATStatus(summary)
|
||||
for i := range disks {
|
||||
devPath, _ := disks[i].Telemetry["linux_device"].(string)
|
||||
devName := filepath.Base(strings.TrimSpace(devPath))
|
||||
if devName == "" {
|
||||
continue
|
||||
}
|
||||
result, ok := byDevice[devName]
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
mergeComponentStatus(&disks[i].HardwareComponentStatus, summary.runAtUTC, result.status, result.description)
|
||||
}
|
||||
}
|
||||
|
||||
type satStatusResult struct {
|
||||
status string
|
||||
description string
|
||||
ok bool
|
||||
}
|
||||
|
||||
func parseStorageSATStatus(summary satSummary) map[string]satStatusResult {
|
||||
result := map[string]satStatusResult{}
|
||||
for key, value := range summary.kv {
|
||||
if !strings.HasSuffix(key, "_status") || key == "overall_status" {
|
||||
continue
|
||||
}
|
||||
base := strings.TrimSuffix(key, "_status")
|
||||
idx := strings.Index(base, "_")
|
||||
if idx <= 0 {
|
||||
continue
|
||||
}
|
||||
devName := base[:idx]
|
||||
step := strings.ReplaceAll(base[idx+1:], "_", "-")
|
||||
stepStatus, desc, ok := satKeyStatus(strings.ToUpper(strings.TrimSpace(value)), "storage "+step)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
current := result[devName]
|
||||
if !current.ok || statusSeverity(stepStatus) > statusSeverity(current.status) {
|
||||
result[devName] = satStatusResult{status: stepStatus, description: desc, ok: true}
|
||||
}
|
||||
}
|
||||
return result
|
||||
}
|
||||
|
||||
func satSummaryStatus(summary satSummary, label string) (string, string, bool) {
|
||||
return satKeyStatus(summary.overall, label)
|
||||
}
|
||||
|
||||
func satKeyStatus(rawStatus, label string) (string, string, bool) {
|
||||
switch strings.ToUpper(strings.TrimSpace(rawStatus)) {
|
||||
case "OK":
|
||||
// No error description on success — error_description is for problems only.
|
||||
return "OK", "", true
|
||||
case "PARTIAL", "UNSUPPORTED", "CANCELED", "CANCELLED":
|
||||
// Tool couldn't run or test was incomplete — we can't assert hardware health.
|
||||
return "Unknown", "", true
|
||||
case "FAILED":
|
||||
return "Critical", label + " failed", true
|
||||
default:
|
||||
return "", "", false
|
||||
}
|
||||
}
|
||||
|
||||
func mergeComponentStatus(component *schema.HardwareComponentStatus, changedAt, satStatus, description string) {
|
||||
if component == nil || satStatus == "" {
|
||||
return
|
||||
}
|
||||
current := strings.TrimSpace(ptrString(component.Status))
|
||||
if current == "" || current == "Unknown" || statusSeverity(satStatus) > statusSeverity(current) {
|
||||
component.Status = appStringPtr(satStatus)
|
||||
if strings.TrimSpace(description) != "" {
|
||||
component.ErrorDescription = appStringPtr(description)
|
||||
}
|
||||
if strings.TrimSpace(changedAt) != "" {
|
||||
component.StatusChangedAt = appStringPtr(changedAt)
|
||||
component.StatusHistory = append(component.StatusHistory, schema.HardwareStatusHistory{
|
||||
Status: satStatus,
|
||||
ChangedAt: changedAt,
|
||||
Details: appStringPtr(description),
|
||||
})
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func statusSeverity(status string) int {
|
||||
switch strings.TrimSpace(status) {
|
||||
case "Critical":
|
||||
return 3
|
||||
case "Warning":
|
||||
return 2
|
||||
case "OK":
|
||||
return 1
|
||||
case "Unknown":
|
||||
return 1 // same as OK — does not override OK from another source
|
||||
default:
|
||||
return 0
|
||||
}
|
||||
}
|
||||
|
||||
func matchesGPUVendor(dev schema.HardwarePCIeDevice, vendor string) bool {
|
||||
if dev.DeviceClass == nil || !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Controller") && !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Accelerator") {
|
||||
if dev.DeviceClass == nil || !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Display") && !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Video") {
|
||||
return false
|
||||
}
|
||||
}
|
||||
manufacturer := strings.ToLower(strings.TrimSpace(ptrString(dev.Manufacturer)))
|
||||
switch vendor {
|
||||
case "amd":
|
||||
return strings.Contains(manufacturer, "advanced micro devices") || strings.Contains(manufacturer, "amd/ati")
|
||||
case "nvidia":
|
||||
return strings.Contains(manufacturer, "nvidia")
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
func ptrString(v *string) string {
|
||||
if v == nil {
|
||||
return ""
|
||||
}
|
||||
return *v
|
||||
}
|
||||
|
||||
func appStringPtr(value string) *string {
|
||||
return &value
|
||||
}
|
||||
61
audit/internal/app/sat_overlay_test.go
Normal file
61
audit/internal/app/sat_overlay_test.go
Normal file
@@ -0,0 +1,61 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
func TestApplyLatestSATStatusesMarksStorageByDevice(t *testing.T) {
|
||||
baseDir := t.TempDir()
|
||||
runDir := filepath.Join(baseDir, "storage-20260325-161151")
|
||||
if err := os.MkdirAll(runDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
raw := "run_at_utc=2026-03-25T16:11:51Z\nnvme0n1_nvme_smart_log_status=OK\nsda_smartctl_health_status=FAILED\noverall_status=FAILED\n"
|
||||
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(raw), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
nvme := schema.HardwareStorage{Telemetry: map[string]any{"linux_device": "/dev/nvme0n1"}}
|
||||
usb := schema.HardwareStorage{Telemetry: map[string]any{"linux_device": "/dev/sda"}}
|
||||
snap := schema.HardwareSnapshot{Storage: []schema.HardwareStorage{nvme, usb}}
|
||||
|
||||
applyLatestSATStatuses(&snap, baseDir)
|
||||
|
||||
if snap.Storage[0].Status == nil || *snap.Storage[0].Status != "OK" {
|
||||
t.Fatalf("nvme status=%v want OK", snap.Storage[0].Status)
|
||||
}
|
||||
if snap.Storage[1].Status == nil || *snap.Storage[1].Status != "Critical" {
|
||||
t.Fatalf("sda status=%v want Critical", snap.Storage[1].Status)
|
||||
}
|
||||
}
|
||||
|
||||
func TestApplyLatestSATStatusesMarksAMDGPUs(t *testing.T) {
|
||||
baseDir := t.TempDir()
|
||||
runDir := filepath.Join(baseDir, "gpu-amd-20260325-161436")
|
||||
if err := os.MkdirAll(runDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
raw := "run_at_utc=2026-03-25T16:14:36Z\noverall_status=FAILED\n"
|
||||
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(raw), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
class := "DisplayController"
|
||||
manufacturer := "Advanced Micro Devices, Inc. [AMD/ATI]"
|
||||
snap := schema.HardwareSnapshot{
|
||||
PCIeDevices: []schema.HardwarePCIeDevice{{
|
||||
DeviceClass: &class,
|
||||
Manufacturer: &manufacturer,
|
||||
}},
|
||||
}
|
||||
|
||||
applyLatestSATStatuses(&snap, baseDir)
|
||||
|
||||
if snap.PCIeDevices[0].Status == nil || *snap.PCIeDevices[0].Status != "Critical" {
|
||||
t.Fatalf("gpu status=%v want Critical", snap.PCIeDevices[0].Status)
|
||||
}
|
||||
}
|
||||
364
audit/internal/app/support_bundle.go
Normal file
364
audit/internal/app/support_bundle.go
Normal file
@@ -0,0 +1,364 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"archive/tar"
|
||||
"compress/gzip"
|
||||
"fmt"
|
||||
"io"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
var supportBundleServices = []string{
|
||||
"bee-audit.service",
|
||||
"bee-web.service",
|
||||
"bee-network.service",
|
||||
"bee-nvidia.service",
|
||||
"bee-preflight.service",
|
||||
"bee-sshsetup.service",
|
||||
}
|
||||
|
||||
var supportBundleCommands = []struct {
|
||||
name string
|
||||
cmd []string
|
||||
}{
|
||||
{name: "system/uname.txt", cmd: []string{"uname", "-a"}},
|
||||
{name: "system/lsmod.txt", cmd: []string{"lsmod"}},
|
||||
{name: "system/lspci-nn.txt", cmd: []string{"lspci", "-nn"}},
|
||||
{name: "system/ip-addr.txt", cmd: []string{"ip", "addr"}},
|
||||
{name: "system/ip-route.txt", cmd: []string{"ip", "route"}},
|
||||
{name: "system/mount.txt", cmd: []string{"mount"}},
|
||||
{name: "system/df-h.txt", cmd: []string{"df", "-h"}},
|
||||
{name: "system/dmesg-tail.txt", cmd: []string{"sh", "-c", "dmesg | tail -n 200"}},
|
||||
}
|
||||
|
||||
func BuildSupportBundle(exportDir string) (string, error) {
|
||||
exportDir = strings.TrimSpace(exportDir)
|
||||
if exportDir == "" {
|
||||
exportDir = DefaultExportDir
|
||||
}
|
||||
if err := os.MkdirAll(exportDir, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
if err := cleanupOldSupportBundles(os.TempDir()); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
host := sanitizeFilename(hostnameOr("unknown"))
|
||||
ts := time.Now().UTC().Format("20060102-150405")
|
||||
stageRoot := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s", host, ts))
|
||||
if err := os.MkdirAll(stageRoot, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
defer os.RemoveAll(stageRoot)
|
||||
|
||||
if err := copyExportDirForSupportBundle(exportDir, filepath.Join(stageRoot, "export")); err != nil {
|
||||
return "", err
|
||||
}
|
||||
if err := writeJournalDump(filepath.Join(stageRoot, "systemd", "combined.journal.log")); err != nil {
|
||||
return "", err
|
||||
}
|
||||
for _, svc := range supportBundleServices {
|
||||
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".status.txt"), []string{"systemctl", "status", svc, "--no-pager"}); err != nil {
|
||||
return "", err
|
||||
}
|
||||
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".journal.log"), []string{"journalctl", "--no-pager", "-u", svc}); err != nil {
|
||||
return "", err
|
||||
}
|
||||
}
|
||||
for _, item := range supportBundleCommands {
|
||||
if err := writeCommandOutput(filepath.Join(stageRoot, item.name), item.cmd); err != nil {
|
||||
return "", err
|
||||
}
|
||||
}
|
||||
if err := writeManifest(filepath.Join(stageRoot, "manifest.txt"), exportDir, stageRoot); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
archivePath := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s.tar.gz", host, ts))
|
||||
if err := createSupportTarGz(archivePath, stageRoot); err != nil {
|
||||
return "", err
|
||||
}
|
||||
return archivePath, nil
|
||||
}
|
||||
|
||||
func cleanupOldSupportBundles(dir string) error {
|
||||
matches, err := filepath.Glob(filepath.Join(dir, "bee-support-*.tar.gz"))
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
type entry struct {
|
||||
path string
|
||||
mod time.Time
|
||||
}
|
||||
list := make([]entry, 0, len(matches))
|
||||
for _, match := range matches {
|
||||
info, err := os.Stat(match)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
if time.Since(info.ModTime()) > 24*time.Hour {
|
||||
_ = os.Remove(match)
|
||||
continue
|
||||
}
|
||||
list = append(list, entry{path: match, mod: info.ModTime()})
|
||||
}
|
||||
sort.Slice(list, func(i, j int) bool { return list[i].mod.After(list[j].mod) })
|
||||
if len(list) > 3 {
|
||||
for _, old := range list[3:] {
|
||||
_ = os.Remove(old.path)
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func writeJournalDump(dst string) error {
|
||||
args := []string{"--no-pager"}
|
||||
for _, svc := range supportBundleServices {
|
||||
args = append(args, "-u", svc)
|
||||
}
|
||||
raw, err := exec.Command("journalctl", args...).CombinedOutput()
|
||||
if len(raw) == 0 && err != nil {
|
||||
raw = []byte(err.Error() + "\n")
|
||||
}
|
||||
if len(raw) == 0 {
|
||||
raw = []byte("no journal output\n")
|
||||
}
|
||||
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
return os.WriteFile(dst, raw, 0644)
|
||||
}
|
||||
|
||||
func writeCommandOutput(dst string, cmd []string) error {
|
||||
if len(cmd) == 0 {
|
||||
return nil
|
||||
}
|
||||
raw, err := exec.Command(cmd[0], cmd[1:]...).CombinedOutput()
|
||||
if len(raw) == 0 {
|
||||
if err != nil {
|
||||
raw = []byte(err.Error() + "\n")
|
||||
} else {
|
||||
raw = []byte("no output\n")
|
||||
}
|
||||
}
|
||||
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
return os.WriteFile(dst, raw, 0644)
|
||||
}
|
||||
|
||||
func writeManifest(dst, exportDir, stageRoot string) error {
|
||||
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
var body strings.Builder
|
||||
fmt.Fprintf(&body, "bee_version=%s\n", buildVersion())
|
||||
fmt.Fprintf(&body, "host=%s\n", hostnameOr("unknown"))
|
||||
fmt.Fprintf(&body, "generated_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
|
||||
fmt.Fprintf(&body, "export_dir=%s\n", exportDir)
|
||||
fmt.Fprintf(&body, "\nfiles:\n")
|
||||
|
||||
var files []string
|
||||
if err := filepath.Walk(stageRoot, func(path string, info os.FileInfo, err error) error {
|
||||
if err != nil || info.IsDir() {
|
||||
return err
|
||||
}
|
||||
if filepath.Clean(path) == filepath.Clean(dst) {
|
||||
return nil
|
||||
}
|
||||
rel, err := filepath.Rel(stageRoot, path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
files = append(files, fmt.Sprintf("%s\t%d", rel, info.Size()))
|
||||
return nil
|
||||
}); err != nil {
|
||||
return err
|
||||
}
|
||||
sort.Strings(files)
|
||||
for _, line := range files {
|
||||
body.WriteString(line)
|
||||
body.WriteByte('\n')
|
||||
}
|
||||
return os.WriteFile(dst, []byte(body.String()), 0644)
|
||||
}
|
||||
|
||||
func buildVersion() string {
|
||||
raw, err := exec.Command("bee", "version").CombinedOutput()
|
||||
if err != nil {
|
||||
return "unknown"
|
||||
}
|
||||
return strings.TrimSpace(string(raw))
|
||||
}
|
||||
|
||||
func copyDirContents(srcDir, dstDir string) error {
|
||||
entries, err := os.ReadDir(srcDir)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return nil
|
||||
}
|
||||
return err
|
||||
}
|
||||
for _, entry := range entries {
|
||||
src := filepath.Join(srcDir, entry.Name())
|
||||
dst := filepath.Join(dstDir, entry.Name())
|
||||
if err := copyPath(src, dst); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func copyExportDirForSupportBundle(srcDir, dstDir string) error {
|
||||
return copyDirContentsFiltered(srcDir, dstDir, func(rel string, info os.FileInfo) bool {
|
||||
cleanRel := filepath.ToSlash(strings.TrimPrefix(filepath.Clean(rel), "./"))
|
||||
if cleanRel == "" {
|
||||
return true
|
||||
}
|
||||
if strings.HasPrefix(cleanRel, "bee-sat/") && strings.HasSuffix(cleanRel, ".tar.gz") {
|
||||
return false
|
||||
}
|
||||
if strings.HasPrefix(filepath.Base(cleanRel), "bee-support-") && strings.HasSuffix(cleanRel, ".tar.gz") {
|
||||
return false
|
||||
}
|
||||
return true
|
||||
})
|
||||
}
|
||||
|
||||
func copyDirContentsFiltered(srcDir, dstDir string, keep func(rel string, info os.FileInfo) bool) error {
|
||||
entries, err := os.ReadDir(srcDir)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return nil
|
||||
}
|
||||
return err
|
||||
}
|
||||
for _, entry := range entries {
|
||||
src := filepath.Join(srcDir, entry.Name())
|
||||
dst := filepath.Join(dstDir, entry.Name())
|
||||
if err := copyPathFiltered(srcDir, src, dst, keep); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func copyPath(src, dst string) error {
|
||||
info, err := os.Stat(src)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if info.IsDir() {
|
||||
if err := os.MkdirAll(dst, info.Mode().Perm()); err != nil {
|
||||
return err
|
||||
}
|
||||
entries, err := os.ReadDir(src)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
for _, entry := range entries {
|
||||
if err := copyPath(filepath.Join(src, entry.Name()), filepath.Join(dst, entry.Name())); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
in, err := os.Open(src)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer in.Close()
|
||||
|
||||
out, err := os.OpenFile(dst, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, info.Mode().Perm())
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer out.Close()
|
||||
|
||||
_, err = io.Copy(out, in)
|
||||
return err
|
||||
}
|
||||
|
||||
func copyPathFiltered(rootSrc, src, dst string, keep func(rel string, info os.FileInfo) bool) error {
|
||||
info, err := os.Stat(src)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
rel, err := filepath.Rel(rootSrc, src)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if keep != nil && !keep(rel, info) {
|
||||
return nil
|
||||
}
|
||||
if info.IsDir() {
|
||||
if err := os.MkdirAll(dst, info.Mode().Perm()); err != nil {
|
||||
return err
|
||||
}
|
||||
entries, err := os.ReadDir(src)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
for _, entry := range entries {
|
||||
if err := copyPathFiltered(rootSrc, filepath.Join(src, entry.Name()), filepath.Join(dst, entry.Name()), keep); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
return copyPath(src, dst)
|
||||
}
|
||||
|
||||
func createSupportTarGz(dst, srcDir string) error {
|
||||
file, err := os.Create(dst)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
gz := gzip.NewWriter(file)
|
||||
defer gz.Close()
|
||||
|
||||
tw := tar.NewWriter(gz)
|
||||
defer tw.Close()
|
||||
|
||||
base := filepath.Dir(srcDir)
|
||||
return filepath.Walk(srcDir, func(path string, info os.FileInfo, err error) error {
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if info.IsDir() {
|
||||
return nil
|
||||
}
|
||||
|
||||
header, err := tar.FileInfoHeader(info, "")
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
header.Name, err = filepath.Rel(base, path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if err := tw.WriteHeader(header); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
f, err := os.Open(path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer f.Close()
|
||||
|
||||
_, err = io.Copy(tw, f)
|
||||
return err
|
||||
})
|
||||
}
|
||||
252
audit/internal/collector/amdgpu.go
Normal file
252
audit/internal/collector/amdgpu.go
Normal file
@@ -0,0 +1,252 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"encoding/csv"
|
||||
"log/slog"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
var (
|
||||
amdSMIExecCommand = exec.Command
|
||||
amdSMILookPath = exec.LookPath
|
||||
amdSMIGlob = filepath.Glob
|
||||
)
|
||||
|
||||
var amdSMIExecutableGlobs = []string{
|
||||
"/opt/rocm/bin/rocm-smi",
|
||||
"/opt/rocm-*/bin/rocm-smi",
|
||||
"/usr/local/bin/rocm-smi",
|
||||
}
|
||||
|
||||
type amdGPUInfo struct {
|
||||
BDF string
|
||||
Serial string
|
||||
Product string
|
||||
Firmware string
|
||||
PowerW *float64
|
||||
TempC *float64
|
||||
}
|
||||
|
||||
func enrichPCIeWithAMD(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
|
||||
if !hasAMDGPUDevices(devs) {
|
||||
return devs
|
||||
}
|
||||
infoByBDF, err := queryAMDGPUs()
|
||||
if err != nil {
|
||||
slog.Info("amdgpu: enrichment skipped", "err", err)
|
||||
return devs
|
||||
}
|
||||
enriched := 0
|
||||
for i := range devs {
|
||||
if !isAMDGPUDevice(devs[i]) || devs[i].BDF == nil {
|
||||
continue
|
||||
}
|
||||
info, ok := infoByBDF[normalizePCIeBDF(*devs[i].BDF)]
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if strings.TrimSpace(info.Serial) != "" {
|
||||
devs[i].SerialNumber = &info.Serial
|
||||
}
|
||||
if strings.TrimSpace(info.Firmware) != "" {
|
||||
devs[i].Firmware = &info.Firmware
|
||||
}
|
||||
if strings.TrimSpace(info.Product) != "" && devs[i].Model == nil {
|
||||
devs[i].Model = &info.Product
|
||||
}
|
||||
if info.PowerW != nil {
|
||||
devs[i].PowerW = info.PowerW
|
||||
}
|
||||
if info.TempC != nil {
|
||||
devs[i].TemperatureC = info.TempC
|
||||
}
|
||||
enriched++
|
||||
}
|
||||
if enriched > 0 {
|
||||
slog.Info("amdgpu: enriched", "count", enriched)
|
||||
}
|
||||
return devs
|
||||
}
|
||||
|
||||
func hasAMDGPUDevices(devs []schema.HardwarePCIeDevice) bool {
|
||||
for _, dev := range devs {
|
||||
if isAMDGPUDevice(dev) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func isAMDGPUDevice(dev schema.HardwarePCIeDevice) bool {
|
||||
if dev.Manufacturer == nil || dev.DeviceClass == nil {
|
||||
return false
|
||||
}
|
||||
manufacturer := strings.ToLower(strings.TrimSpace(*dev.Manufacturer))
|
||||
return strings.Contains(manufacturer, "advanced micro devices") && isGPUClass(strings.TrimSpace(*dev.DeviceClass))
|
||||
}
|
||||
|
||||
func queryAMDGPUs() (map[string]amdGPUInfo, error) {
|
||||
busByCard, err := queryAMDField("--showbus")
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
infoByCard := map[string]amdGPUInfo{}
|
||||
for card, bus := range busByCard {
|
||||
bdf := normalizePCIeBDF(bus)
|
||||
if bdf == "" {
|
||||
continue
|
||||
}
|
||||
infoByCard[card] = amdGPUInfo{BDF: bdf}
|
||||
}
|
||||
if len(infoByCard) == 0 {
|
||||
return map[string]amdGPUInfo{}, nil
|
||||
}
|
||||
mergeAMDField(infoByCard, "--showserial", func(info *amdGPUInfo, value string) { info.Serial = value })
|
||||
mergeAMDField(infoByCard, "--showproductname", func(info *amdGPUInfo, value string) { info.Product = value })
|
||||
mergeAMDField(infoByCard, "--showvbios", func(info *amdGPUInfo, value string) { info.Firmware = value })
|
||||
mergeAMDNumericField(infoByCard, "--showpower", func(info *amdGPUInfo, value float64) { info.PowerW = &value })
|
||||
mergeAMDNumericField(infoByCard, "--showtemp", func(info *amdGPUInfo, value float64) { info.TempC = &value })
|
||||
|
||||
result := make(map[string]amdGPUInfo, len(infoByCard))
|
||||
for _, info := range infoByCard {
|
||||
if info.BDF == "" {
|
||||
continue
|
||||
}
|
||||
result[info.BDF] = info
|
||||
}
|
||||
return result, nil
|
||||
}
|
||||
|
||||
func mergeAMDField(infoByCard map[string]amdGPUInfo, flag string, apply func(*amdGPUInfo, string)) {
|
||||
values, err := queryAMDField(flag)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
for card, value := range values {
|
||||
info, ok := infoByCard[card]
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
value = strings.TrimSpace(value)
|
||||
if value == "" {
|
||||
continue
|
||||
}
|
||||
apply(&info, value)
|
||||
infoByCard[card] = info
|
||||
}
|
||||
}
|
||||
|
||||
func mergeAMDNumericField(infoByCard map[string]amdGPUInfo, flag string, apply func(*amdGPUInfo, float64)) {
|
||||
values, err := queryAMDNumericField(flag)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
for card, value := range values {
|
||||
info, ok := infoByCard[card]
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
apply(&info, value)
|
||||
infoByCard[card] = info
|
||||
}
|
||||
}
|
||||
|
||||
func queryAMDField(flag string) (map[string]string, error) {
|
||||
cmd, err := resolveAMDSMICmd(flag, "--csv")
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
out, err := amdSMIExecCommand(cmd[0], cmd[1:]...).CombinedOutput()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return parseROCmSingleValueCSV(string(out)), nil
|
||||
}
|
||||
|
||||
func queryAMDNumericField(flag string) (map[string]float64, error) {
|
||||
values, err := queryAMDField(flag)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
out := map[string]float64{}
|
||||
for card, raw := range values {
|
||||
if value, ok := firstFloat(raw); ok {
|
||||
out[card] = value
|
||||
}
|
||||
}
|
||||
return out, nil
|
||||
}
|
||||
|
||||
func resolveAMDSMICmd(args ...string) ([]string, error) {
|
||||
if path, err := amdSMILookPath("rocm-smi"); err == nil {
|
||||
return append([]string{path}, args...), nil
|
||||
}
|
||||
for _, pattern := range amdSMIExecutableGlobs {
|
||||
matches, err := amdSMIGlob(pattern)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
sort.Strings(matches)
|
||||
for _, match := range matches {
|
||||
return append([]string{match}, args...), nil
|
||||
}
|
||||
}
|
||||
return nil, exec.ErrNotFound
|
||||
}
|
||||
|
||||
func parseROCmSingleValueCSV(raw string) map[string]string {
|
||||
rows := map[string]string{}
|
||||
reader := csv.NewReader(strings.NewReader(raw))
|
||||
reader.FieldsPerRecord = -1
|
||||
records, err := reader.ReadAll()
|
||||
if err != nil {
|
||||
return rows
|
||||
}
|
||||
for _, rec := range records {
|
||||
if len(rec) < 2 {
|
||||
continue
|
||||
}
|
||||
card := normalizeROCmCardKey(rec[0])
|
||||
if card == "" {
|
||||
continue
|
||||
}
|
||||
value := strings.TrimSpace(strings.Join(rec[1:], ","))
|
||||
if value == "" || looksLikeCSVHeaderValue(value) {
|
||||
continue
|
||||
}
|
||||
rows[card] = value
|
||||
}
|
||||
return rows
|
||||
}
|
||||
|
||||
func normalizeROCmCardKey(raw string) string {
|
||||
raw = strings.ToLower(strings.TrimSpace(raw))
|
||||
raw = strings.Trim(raw, "\"")
|
||||
if raw == "" {
|
||||
return ""
|
||||
}
|
||||
if raw == "device" || raw == "gpu" || raw == "card" {
|
||||
return ""
|
||||
}
|
||||
if strings.HasPrefix(raw, "card") {
|
||||
return raw
|
||||
}
|
||||
if _, err := strconv.Atoi(raw); err == nil {
|
||||
return "card" + raw
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func looksLikeCSVHeaderValue(value string) bool {
|
||||
value = strings.ToLower(strings.TrimSpace(value))
|
||||
return strings.Contains(value, "product") ||
|
||||
strings.Contains(value, "serial") ||
|
||||
strings.Contains(value, "vbios") ||
|
||||
strings.Contains(value, "bus")
|
||||
}
|
||||
56
audit/internal/collector/amdgpu_test.go
Normal file
56
audit/internal/collector/amdgpu_test.go
Normal file
@@ -0,0 +1,56 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"os/exec"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestParseROCmSingleValueCSV(t *testing.T) {
|
||||
raw := "device,Serial Number\ncard0,ABC123\ncard1,XYZ789\n"
|
||||
got := parseROCmSingleValueCSV(raw)
|
||||
if got["card0"] != "ABC123" {
|
||||
t.Fatalf("card0=%q want ABC123", got["card0"])
|
||||
}
|
||||
if got["card1"] != "XYZ789" {
|
||||
t.Fatalf("card1=%q want XYZ789", got["card1"])
|
||||
}
|
||||
}
|
||||
|
||||
func TestQueryAMDNumericFieldParsesUnits(t *testing.T) {
|
||||
origExec := amdSMIExecCommand
|
||||
origLookPath := amdSMILookPath
|
||||
t.Cleanup(func() {
|
||||
amdSMIExecCommand = origExec
|
||||
amdSMILookPath = origLookPath
|
||||
})
|
||||
|
||||
amdSMILookPath = func(string) (string, error) { return "/usr/bin/rocm-smi", nil }
|
||||
amdSMIExecCommand = func(name string, args ...string) *exec.Cmd {
|
||||
return exec.Command("sh", "-c", "printf 'device,Temperature\\ncard0,45.5c\\ncard1,67.0c\\n'")
|
||||
}
|
||||
|
||||
got, err := queryAMDNumericField("--showtemp")
|
||||
if err != nil {
|
||||
t.Fatalf("queryAMDNumericField: %v", err)
|
||||
}
|
||||
if got["card0"] != 45.5 {
|
||||
t.Fatalf("card0=%v want 45.5", got["card0"])
|
||||
}
|
||||
if got["card1"] != 67.0 {
|
||||
t.Fatalf("card1=%v want 67.0", got["card1"])
|
||||
}
|
||||
}
|
||||
|
||||
func TestNormalizeROCmCardKey(t *testing.T) {
|
||||
tests := map[string]string{
|
||||
"0": "card0",
|
||||
"card1": "card1",
|
||||
"Device": "",
|
||||
"": "",
|
||||
}
|
||||
for input, want := range tests {
|
||||
if got := normalizeROCmCardKey(input); got != want {
|
||||
t.Fatalf("normalizeROCmCardKey(%q)=%q want %q", input, got, want)
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -4,10 +4,27 @@ import (
|
||||
"bee/audit/internal/schema"
|
||||
"bufio"
|
||||
"log/slog"
|
||||
"os"
|
||||
"os/exec"
|
||||
"strings"
|
||||
)
|
||||
|
||||
var execDmidecode = func(typeNum string) (string, error) {
|
||||
out, err := exec.Command("dmidecode", "-t", typeNum).Output()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
return string(out), nil
|
||||
}
|
||||
|
||||
var execIpmitool = func(args ...string) (string, error) {
|
||||
out, err := exec.Command("ipmitool", args...).Output()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
return string(out), nil
|
||||
}
|
||||
|
||||
// collectBoard runs dmidecode for types 0, 1, 2 and returns the board record
|
||||
// plus the BIOS firmware entry. Any failure is logged and returns zero values.
|
||||
func collectBoard() (schema.HardwareBoard, []schema.HardwareFirmwareRecord) {
|
||||
@@ -61,6 +78,45 @@ func parseBoard(type1, type2 string) schema.HardwareBoard {
|
||||
return board
|
||||
}
|
||||
|
||||
// collectBMCFirmware collects BMC firmware version via ipmitool mc info.
|
||||
// Returns nil if ipmitool is missing, /dev/ipmi0 is absent, or any error occurs.
|
||||
func collectBMCFirmware() []schema.HardwareFirmwareRecord {
|
||||
if _, err := exec.LookPath("ipmitool"); err != nil {
|
||||
return nil
|
||||
}
|
||||
if _, err := os.Stat("/dev/ipmi0"); err != nil {
|
||||
return nil
|
||||
}
|
||||
out, err := execIpmitool("mc", "info")
|
||||
if err != nil {
|
||||
slog.Info("bmc: ipmitool mc info unavailable", "err", err)
|
||||
return nil
|
||||
}
|
||||
version := parseBMCFirmwareRevision(out)
|
||||
if version == "" {
|
||||
return nil
|
||||
}
|
||||
slog.Info("bmc: collected", "version", version)
|
||||
return []schema.HardwareFirmwareRecord{
|
||||
{DeviceName: "BMC", Version: version},
|
||||
}
|
||||
}
|
||||
|
||||
// parseBMCFirmwareRevision extracts the "Firmware Revision" field from ipmitool mc info output.
|
||||
func parseBMCFirmwareRevision(out string) string {
|
||||
for _, line := range strings.Split(out, "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
key, val, ok := strings.Cut(line, ":")
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if strings.TrimSpace(key) == "Firmware Revision" {
|
||||
return strings.TrimSpace(val)
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
// parseBIOSFirmware extracts BIOS version from dmidecode type 0 output.
|
||||
func parseBIOSFirmware(type0 string) []schema.HardwareFirmwareRecord {
|
||||
fields := parseDMIFields(type0, "BIOS Information")
|
||||
@@ -141,9 +197,5 @@ func cleanDMIValue(v string) string {
|
||||
|
||||
// runDmidecode executes dmidecode -t <typeNum> and returns its stdout.
|
||||
func runDmidecode(typeNum string) (string, error) {
|
||||
out, err := exec.Command("dmidecode", "-t", typeNum).Output()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
return string(out), nil
|
||||
return execDmidecode(typeNum)
|
||||
}
|
||||
|
||||
@@ -4,15 +4,18 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/runtimeenv"
|
||||
"bee/audit/internal/schema"
|
||||
"log/slog"
|
||||
"os"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Run executes all collectors and returns the combined snapshot.
|
||||
// Partial failures are logged as warnings; collection always completes.
|
||||
func Run() schema.HardwareIngestRequest {
|
||||
func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
|
||||
start := time.Now()
|
||||
collectedAt := time.Now().UTC().Format(time.RFC3339)
|
||||
slog.Info("audit started")
|
||||
|
||||
snap := schema.HardwareSnapshot{}
|
||||
@@ -20,32 +23,45 @@ func Run() schema.HardwareIngestRequest {
|
||||
board, biosFW := collectBoard()
|
||||
snap.Board = board
|
||||
snap.Firmware = append(snap.Firmware, biosFW...)
|
||||
snap.Firmware = append(snap.Firmware, collectBMCFirmware()...)
|
||||
|
||||
cpus, cpuFW := collectCPUs(snap.Board.SerialNumber)
|
||||
snap.CPUs = cpus
|
||||
snap.Firmware = append(snap.Firmware, cpuFW...)
|
||||
snap.CPUs = collectCPUs()
|
||||
|
||||
snap.Memory = collectMemory()
|
||||
sensorDoc, err := readSensorsJSONDoc()
|
||||
if err != nil {
|
||||
slog.Info("sensors: unavailable for enrichment", "err", err)
|
||||
}
|
||||
snap.CPUs = enrichCPUsWithTelemetry(snap.CPUs, sensorDoc)
|
||||
snap.Memory = enrichMemoryWithTelemetry(snap.Memory, sensorDoc)
|
||||
snap.Storage = collectStorage()
|
||||
snap.PCIeDevices = collectPCIe()
|
||||
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices, snap.Board.SerialNumber)
|
||||
snap.PCIeDevices = enrichPCIeWithAMD(snap.PCIeDevices)
|
||||
snap.PCIeDevices = enrichPCIeWithPCISerials(snap.PCIeDevices)
|
||||
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices)
|
||||
snap.PCIeDevices = enrichPCIeWithMellanox(snap.PCIeDevices)
|
||||
snap.PCIeDevices = enrichPCIeWithNICTelemetry(snap.PCIeDevices)
|
||||
snap.PCIeDevices = enrichPCIeWithRAIDTelemetry(snap.PCIeDevices)
|
||||
snap.Storage = enrichStorageWithVROC(snap.Storage, snap.PCIeDevices)
|
||||
snap.Storage = appendUniqueStorage(snap.Storage, collectRAIDStorage(snap.PCIeDevices))
|
||||
snap.PowerSupplies = collectPSUs()
|
||||
snap.PowerSupplies = enrichPSUsWithTelemetry(snap.PowerSupplies, sensorDoc)
|
||||
snap.Sensors = buildSensorsFromDoc(sensorDoc)
|
||||
finalizeSnapshot(&snap, collectedAt)
|
||||
|
||||
// remaining collectors added in steps 1.8 – 1.10
|
||||
|
||||
slog.Info("audit completed", "duration", time.Since(start).Round(time.Millisecond))
|
||||
|
||||
sourceType := "livcd"
|
||||
protocol := "os-direct"
|
||||
|
||||
sourceType := "manual"
|
||||
var targetHost *string
|
||||
if hostname, err := os.Hostname(); err == nil && hostname != "" {
|
||||
targetHost = &hostname
|
||||
}
|
||||
return schema.HardwareIngestRequest{
|
||||
SourceType: &sourceType,
|
||||
Protocol: &protocol,
|
||||
CollectedAt: time.Now().UTC().Format(time.RFC3339),
|
||||
TargetHost: targetHost,
|
||||
CollectedAt: collectedAt,
|
||||
Hardware: snap,
|
||||
}
|
||||
}
|
||||
|
||||
64
audit/internal/collector/contract.go
Normal file
64
audit/internal/collector/contract.go
Normal file
@@ -0,0 +1,64 @@
|
||||
package collector
|
||||
|
||||
import "strings"
|
||||
|
||||
const (
|
||||
statusOK = "OK"
|
||||
statusWarning = "Warning"
|
||||
statusCritical = "Critical"
|
||||
statusUnknown = "Unknown"
|
||||
statusEmpty = "Empty"
|
||||
)
|
||||
|
||||
func mapPCIeDeviceClass(raw string) string {
|
||||
normalized := strings.ToLower(strings.TrimSpace(raw))
|
||||
switch {
|
||||
case normalized == "":
|
||||
return ""
|
||||
case strings.Contains(normalized, "ethernet controller"):
|
||||
return "EthernetController"
|
||||
case strings.Contains(normalized, "fibre channel"):
|
||||
return "FibreChannelController"
|
||||
case strings.Contains(normalized, "network controller"), strings.Contains(normalized, "infiniband controller"):
|
||||
return "NetworkController"
|
||||
case strings.Contains(normalized, "serial attached scsi"), strings.Contains(normalized, "storage controller"):
|
||||
return "StorageController"
|
||||
case strings.Contains(normalized, "raid"), strings.Contains(normalized, "mass storage"):
|
||||
return "MassStorageController"
|
||||
case strings.Contains(normalized, "display controller"):
|
||||
return "DisplayController"
|
||||
case strings.Contains(normalized, "vga"), strings.Contains(normalized, "3d controller"), strings.Contains(normalized, "video controller"):
|
||||
return "VideoController"
|
||||
case strings.Contains(normalized, "processing accelerators"), strings.Contains(normalized, "processing accelerator"):
|
||||
return "ProcessingAccelerator"
|
||||
default:
|
||||
return raw
|
||||
}
|
||||
}
|
||||
|
||||
func isNICClass(class string) bool {
|
||||
switch strings.TrimSpace(class) {
|
||||
case "EthernetController", "NetworkController":
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
func isGPUClass(class string) bool {
|
||||
switch strings.TrimSpace(class) {
|
||||
case "VideoController", "DisplayController", "ProcessingAccelerator":
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
func isRAIDClass(class string) bool {
|
||||
switch strings.TrimSpace(class) {
|
||||
case "MassStorageController", "StorageController":
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
@@ -3,42 +3,39 @@ package collector
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"bufio"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// collectCPUs runs dmidecode -t 4 and reads microcode version from sysfs.
|
||||
func collectCPUs(boardSerial string) ([]schema.HardwareCPU, []schema.HardwareFirmwareRecord) {
|
||||
// collectCPUs runs dmidecode -t 4 and enriches CPUs with microcode from sysfs.
|
||||
func collectCPUs() []schema.HardwareCPU {
|
||||
out, err := runDmidecode("4")
|
||||
if err != nil {
|
||||
slog.Warn("cpu: dmidecode type 4 failed", "err", err)
|
||||
return nil, nil
|
||||
return nil
|
||||
}
|
||||
|
||||
cpus := parseCPUs(out, boardSerial)
|
||||
|
||||
var firmware []schema.HardwareFirmwareRecord
|
||||
cpus := parseCPUs(out)
|
||||
if mc := readMicrocode(); mc != "" {
|
||||
firmware = append(firmware, schema.HardwareFirmwareRecord{
|
||||
DeviceName: "CPU Microcode",
|
||||
Version: mc,
|
||||
})
|
||||
for i := range cpus {
|
||||
cpus[i].Firmware = &mc
|
||||
}
|
||||
}
|
||||
|
||||
slog.Info("cpu: collected", "count", len(cpus))
|
||||
return cpus, firmware
|
||||
return cpus
|
||||
}
|
||||
|
||||
// parseCPUs splits dmidecode output into per-processor sections and parses each.
|
||||
func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
|
||||
func parseCPUs(output string) []schema.HardwareCPU {
|
||||
sections := splitDMISections(output, "Processor Information")
|
||||
cpus := make([]schema.HardwareCPU, 0, len(sections))
|
||||
|
||||
for _, section := range sections {
|
||||
cpu, ok := parseCPUSection(section, boardSerial)
|
||||
cpu, ok := parseCPUSection(section)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
@@ -49,14 +46,16 @@ func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
|
||||
|
||||
// parseCPUSection parses one "Processor Information" block into a HardwareCPU.
|
||||
// Returns false if the socket is unpopulated.
|
||||
func parseCPUSection(fields map[string]string, boardSerial string) (schema.HardwareCPU, bool) {
|
||||
func parseCPUSection(fields map[string]string) (schema.HardwareCPU, bool) {
|
||||
status := parseCPUStatus(fields["Status"])
|
||||
if status == "EMPTY" {
|
||||
if status == statusEmpty {
|
||||
return schema.HardwareCPU{}, false
|
||||
}
|
||||
|
||||
cpu := schema.HardwareCPU{}
|
||||
cpu.Status = &status
|
||||
present := true
|
||||
cpu.Present = &present
|
||||
|
||||
if socket, ok := parseSocketIndex(fields["Socket Designation"]); ok {
|
||||
cpu.Socket = &socket
|
||||
@@ -70,11 +69,6 @@ func parseCPUSection(fields map[string]string, boardSerial string) (schema.Hardw
|
||||
}
|
||||
if v := cleanDMIValue(fields["Serial Number"]); v != "" {
|
||||
cpu.SerialNumber = &v
|
||||
} else if boardSerial != "" && cpu.Socket != nil {
|
||||
// Intel Xeon never exposes serial via DMI — generate stable fallback
|
||||
// matching core's generateCPUVendorSerial() logic
|
||||
fb := fmt.Sprintf("%s-CPU-%d", boardSerial, *cpu.Socket)
|
||||
cpu.SerialNumber = &fb
|
||||
}
|
||||
|
||||
if v := parseMHz(fields["Max Speed"]); v > 0 {
|
||||
@@ -99,15 +93,15 @@ func parseCPUStatus(raw string) string {
|
||||
upper := strings.ToUpper(raw)
|
||||
switch {
|
||||
case upper == "" || upper == "UNKNOWN":
|
||||
return "UNKNOWN"
|
||||
return statusUnknown
|
||||
case strings.Contains(upper, "UNPOPULATED") || strings.Contains(upper, "NOT POPULATED"):
|
||||
return "EMPTY"
|
||||
return statusEmpty
|
||||
case strings.Contains(upper, "ENABLED"):
|
||||
return "OK"
|
||||
return statusOK
|
||||
case strings.Contains(upper, "DISABLED"):
|
||||
return "WARNING"
|
||||
return statusWarning
|
||||
default:
|
||||
return "UNKNOWN"
|
||||
return statusUnknown
|
||||
}
|
||||
}
|
||||
|
||||
@@ -178,7 +172,7 @@ func parseInt(v string) int {
|
||||
// readMicrocode reads the CPU microcode revision from sysfs.
|
||||
// Returns empty string if unavailable.
|
||||
func readMicrocode() string {
|
||||
data, err := os.ReadFile("/sys/devices/system/cpu/cpu0/microcode/version")
|
||||
data, err := os.ReadFile(filepath.Join(cpuSysBaseDir, "cpu0", "microcode", "version"))
|
||||
if err != nil {
|
||||
return ""
|
||||
}
|
||||
|
||||
196
audit/internal/collector/cpu_telemetry.go
Normal file
196
audit/internal/collector/cpu_telemetry.go
Normal file
@@ -0,0 +1,196 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"regexp"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
var (
|
||||
cpuSysBaseDir = "/sys/devices/system/cpu"
|
||||
socketIndexRe = regexp.MustCompile(`(?i)(?:package id|socket|cpu)\s*([0-9]+)`)
|
||||
)
|
||||
|
||||
func enrichCPUsWithTelemetry(cpus []schema.HardwareCPU, doc sensorsDoc) []schema.HardwareCPU {
|
||||
if len(cpus) == 0 {
|
||||
return cpus
|
||||
}
|
||||
|
||||
tempBySocket := cpuTempsFromSensors(doc, len(cpus))
|
||||
powerBySocket := cpuPowerFromSensors(doc, len(cpus))
|
||||
throttleBySocket := cpuThrottleBySocket()
|
||||
|
||||
for i := range cpus {
|
||||
socket := 0
|
||||
if cpus[i].Socket != nil {
|
||||
socket = *cpus[i].Socket
|
||||
}
|
||||
if value, ok := tempBySocket[socket]; ok {
|
||||
cpus[i].TemperatureC = &value
|
||||
}
|
||||
if value, ok := powerBySocket[socket]; ok {
|
||||
cpus[i].PowerW = &value
|
||||
}
|
||||
if value, ok := throttleBySocket[socket]; ok {
|
||||
cpus[i].Throttled = &value
|
||||
}
|
||||
}
|
||||
|
||||
return cpus
|
||||
}
|
||||
|
||||
func cpuTempsFromSensors(doc sensorsDoc, cpuCount int) map[int]float64 {
|
||||
out := map[int]float64{}
|
||||
if len(doc) == 0 {
|
||||
return out
|
||||
}
|
||||
var fallback []float64
|
||||
for chip, features := range doc {
|
||||
for featureName, raw := range features {
|
||||
feature, ok := raw.(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if classifySensorFeature(feature) != "temp" {
|
||||
continue
|
||||
}
|
||||
temp, ok := firstFeatureFloat(feature, "_input")
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if socket, ok := detectCPUSocket(chip, featureName); ok {
|
||||
if _, exists := out[socket]; !exists {
|
||||
out[socket] = temp
|
||||
}
|
||||
continue
|
||||
}
|
||||
if isLikelyCPUTemp(chip, featureName) {
|
||||
fallback = append(fallback, temp)
|
||||
}
|
||||
}
|
||||
}
|
||||
if len(out) == 0 && cpuCount == 1 && len(fallback) > 0 {
|
||||
out[0] = fallback[0]
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func cpuPowerFromSensors(doc sensorsDoc, cpuCount int) map[int]float64 {
|
||||
out := map[int]float64{}
|
||||
if len(doc) == 0 {
|
||||
return out
|
||||
}
|
||||
var fallback []float64
|
||||
for chip, features := range doc {
|
||||
for featureName, raw := range features {
|
||||
feature, ok := raw.(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if classifySensorFeature(feature) != "power" {
|
||||
continue
|
||||
}
|
||||
power, ok := firstFeatureFloatWithContains(feature, []string{"power"})
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if socket, ok := detectCPUSocket(chip, featureName); ok {
|
||||
if _, exists := out[socket]; !exists {
|
||||
out[socket] = power
|
||||
}
|
||||
continue
|
||||
}
|
||||
if isLikelyCPUPower(chip, featureName) {
|
||||
fallback = append(fallback, power)
|
||||
}
|
||||
}
|
||||
}
|
||||
if len(out) == 0 && cpuCount == 1 && len(fallback) > 0 {
|
||||
out[0] = fallback[0]
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func detectCPUSocket(parts ...string) (int, bool) {
|
||||
for _, part := range parts {
|
||||
matches := socketIndexRe.FindStringSubmatch(strings.ToLower(part))
|
||||
if len(matches) == 2 {
|
||||
value, err := strconv.Atoi(matches[1])
|
||||
if err == nil {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func isLikelyCPUTemp(chip, feature string) bool {
|
||||
value := strings.ToLower(chip + " " + feature)
|
||||
return strings.Contains(value, "coretemp") ||
|
||||
strings.Contains(value, "k10temp") ||
|
||||
strings.Contains(value, "package id") ||
|
||||
strings.Contains(value, "tdie") ||
|
||||
strings.Contains(value, "tctl") ||
|
||||
strings.Contains(value, "cpu temp")
|
||||
}
|
||||
|
||||
func isLikelyCPUPower(chip, feature string) bool {
|
||||
value := strings.ToLower(chip + " " + feature)
|
||||
return strings.Contains(value, "intel-rapl") ||
|
||||
strings.Contains(value, "package id") ||
|
||||
strings.Contains(value, "package-") ||
|
||||
strings.Contains(value, "cpu power")
|
||||
}
|
||||
|
||||
func cpuThrottleBySocket() map[int]bool {
|
||||
out := map[int]bool{}
|
||||
cpuDirs, err := filepath.Glob(filepath.Join(cpuSysBaseDir, "cpu[0-9]*"))
|
||||
if err != nil {
|
||||
return out
|
||||
}
|
||||
sort.Strings(cpuDirs)
|
||||
for _, cpuDir := range cpuDirs {
|
||||
socket, ok := readSocketIndex(cpuDir)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if cpuPackageThrottled(cpuDir) {
|
||||
out[socket] = true
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func readSocketIndex(cpuDir string) (int, bool) {
|
||||
raw, err := os.ReadFile(filepath.Join(cpuDir, "topology", "physical_package_id"))
|
||||
if err != nil {
|
||||
return 0, false
|
||||
}
|
||||
value, err := strconv.Atoi(strings.TrimSpace(string(raw)))
|
||||
if err != nil || value < 0 {
|
||||
return 0, false
|
||||
}
|
||||
return value, true
|
||||
}
|
||||
|
||||
func cpuPackageThrottled(cpuDir string) bool {
|
||||
paths := []string{
|
||||
filepath.Join(cpuDir, "thermal_throttle", "package_throttle_count"),
|
||||
filepath.Join(cpuDir, "thermal_throttle", "core_throttle_count"),
|
||||
}
|
||||
for _, path := range paths {
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
value, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
|
||||
if err == nil && value > 0 {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
71
audit/internal/collector/cpu_telemetry_test.go
Normal file
71
audit/internal/collector/cpu_telemetry_test.go
Normal file
@@ -0,0 +1,71 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
func TestEnrichCPUsWithTelemetry(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
oldBase := cpuSysBaseDir
|
||||
cpuSysBaseDir = tmp
|
||||
t.Cleanup(func() { cpuSysBaseDir = oldBase })
|
||||
|
||||
mustWriteFile(t, filepath.Join(tmp, "cpu0", "topology", "physical_package_id"), "0\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "cpu0", "thermal_throttle", "package_throttle_count"), "3\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "cpu1", "topology", "physical_package_id"), "1\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "cpu1", "thermal_throttle", "package_throttle_count"), "0\n")
|
||||
|
||||
doc := sensorsDoc{
|
||||
"coretemp-isa-0000": {
|
||||
"Package id 0": map[string]any{"temp1_input": 61.5},
|
||||
"Package id 1": map[string]any{"temp2_input": 58.0},
|
||||
},
|
||||
"intel-rapl-mmio-0": {
|
||||
"Package id 0": map[string]any{"power1_average": 180.0},
|
||||
"Package id 1": map[string]any{"power2_average": 175.0},
|
||||
},
|
||||
}
|
||||
|
||||
socket0 := 0
|
||||
socket1 := 1
|
||||
status := statusOK
|
||||
cpus := []schema.HardwareCPU{
|
||||
{Socket: &socket0, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{Socket: &socket1, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
}
|
||||
|
||||
got := enrichCPUsWithTelemetry(cpus, doc)
|
||||
|
||||
if got[0].TemperatureC == nil || *got[0].TemperatureC != 61.5 {
|
||||
t.Fatalf("cpu0 temperature mismatch: %#v", got[0].TemperatureC)
|
||||
}
|
||||
if got[0].PowerW == nil || *got[0].PowerW != 180.0 {
|
||||
t.Fatalf("cpu0 power mismatch: %#v", got[0].PowerW)
|
||||
}
|
||||
if got[0].Throttled == nil || !*got[0].Throttled {
|
||||
t.Fatalf("cpu0 throttled mismatch: %#v", got[0].Throttled)
|
||||
}
|
||||
if got[1].TemperatureC == nil || *got[1].TemperatureC != 58.0 {
|
||||
t.Fatalf("cpu1 temperature mismatch: %#v", got[1].TemperatureC)
|
||||
}
|
||||
if got[1].PowerW == nil || *got[1].PowerW != 175.0 {
|
||||
t.Fatalf("cpu1 power mismatch: %#v", got[1].PowerW)
|
||||
}
|
||||
if got[1].Throttled != nil && *got[1].Throttled {
|
||||
t.Fatalf("cpu1 throttled mismatch: %#v", got[1].Throttled)
|
||||
}
|
||||
}
|
||||
|
||||
func mustWriteFile(t *testing.T, path, content string) {
|
||||
t.Helper()
|
||||
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
|
||||
t.Fatalf("mkdir %s: %v", path, err)
|
||||
}
|
||||
if err := os.WriteFile(path, []byte(content), 0644); err != nil {
|
||||
t.Fatalf("write %s: %v", path, err)
|
||||
}
|
||||
}
|
||||
@@ -1,12 +1,14 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestParseCPUs_dual_socket(t *testing.T) {
|
||||
out := mustReadFile(t, "testdata/dmidecode_type4.txt")
|
||||
cpus := parseCPUs(out, "CAR315KA0803B90")
|
||||
cpus := parseCPUs(out)
|
||||
|
||||
if len(cpus) != 2 {
|
||||
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
|
||||
@@ -37,23 +39,22 @@ func TestParseCPUs_dual_socket(t *testing.T) {
|
||||
if cpu0.Status == nil || *cpu0.Status != "OK" {
|
||||
t.Errorf("cpu0 status: got %v, want OK", cpu0.Status)
|
||||
}
|
||||
// Intel Xeon serial not available → fallback
|
||||
if cpu0.SerialNumber == nil || *cpu0.SerialNumber != "CAR315KA0803B90-CPU-0" {
|
||||
t.Errorf("cpu0 serial fallback: got %v, want CAR315KA0803B90-CPU-0", cpu0.SerialNumber)
|
||||
if cpu0.SerialNumber != nil {
|
||||
t.Errorf("cpu0 serial should stay nil without source data, got %v", cpu0.SerialNumber)
|
||||
}
|
||||
|
||||
cpu1 := cpus[1]
|
||||
if cpu1.Socket == nil || *cpu1.Socket != 1 {
|
||||
t.Errorf("cpu1 socket: got %v, want 1", cpu1.Socket)
|
||||
}
|
||||
if cpu1.SerialNumber == nil || *cpu1.SerialNumber != "CAR315KA0803B90-CPU-1" {
|
||||
t.Errorf("cpu1 serial fallback: got %v, want CAR315KA0803B90-CPU-1", cpu1.SerialNumber)
|
||||
if cpu1.SerialNumber != nil {
|
||||
t.Errorf("cpu1 serial should stay nil without source data, got %v", cpu1.SerialNumber)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseCPUs_unpopulated_skipped(t *testing.T) {
|
||||
out := mustReadFile(t, "testdata/dmidecode_type4_disabled.txt")
|
||||
cpus := parseCPUs(out, "BOARD-001")
|
||||
cpus := parseCPUs(out)
|
||||
|
||||
if len(cpus) != 1 {
|
||||
t.Fatalf("expected 1 CPU (unpopulated skipped), got %d", len(cpus))
|
||||
@@ -63,18 +64,51 @@ func TestParseCPUs_unpopulated_skipped(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestCollectCPUsSetsFirmwareFromMicrocode(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
origBase := cpuSysBaseDir
|
||||
cpuSysBaseDir = tmp
|
||||
t.Cleanup(func() { cpuSysBaseDir = origBase })
|
||||
|
||||
if err := os.MkdirAll(filepath.Join(tmp, "cpu0", "microcode"), 0755); err != nil {
|
||||
t.Fatalf("mkdir microcode dir: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(tmp, "cpu0", "microcode", "version"), []byte("0x2b000643\n"), 0644); err != nil {
|
||||
t.Fatalf("write microcode version: %v", err)
|
||||
}
|
||||
|
||||
origRun := execDmidecode
|
||||
execDmidecode = func(typeNum string) (string, error) {
|
||||
if typeNum != "4" {
|
||||
t.Fatalf("unexpected dmidecode type: %s", typeNum)
|
||||
}
|
||||
return mustReadFile(t, "testdata/dmidecode_type4.txt"), nil
|
||||
}
|
||||
t.Cleanup(func() { execDmidecode = origRun })
|
||||
|
||||
cpus := collectCPUs()
|
||||
if len(cpus) != 2 {
|
||||
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
|
||||
}
|
||||
for i, cpu := range cpus {
|
||||
if cpu.Firmware == nil || *cpu.Firmware != "0x2b000643" {
|
||||
t.Fatalf("cpu[%d] firmware=%v want microcode", i, cpu.Firmware)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseCPUStatus(t *testing.T) {
|
||||
tests := []struct {
|
||||
input string
|
||||
want string
|
||||
}{
|
||||
{"Populated, Enabled", "OK"},
|
||||
{"Populated, Disabled By User", "WARNING"},
|
||||
{"Populated, Disabled By BIOS", "WARNING"},
|
||||
{"Unpopulated", "EMPTY"},
|
||||
{"Not Populated", "EMPTY"},
|
||||
{"Unknown", "UNKNOWN"},
|
||||
{"", "UNKNOWN"},
|
||||
{"Populated, Disabled By User", statusWarning},
|
||||
{"Populated, Disabled By BIOS", statusWarning},
|
||||
{"Unpopulated", statusEmpty},
|
||||
{"Not Populated", statusEmpty},
|
||||
{"Unknown", statusUnknown},
|
||||
{"", statusUnknown},
|
||||
}
|
||||
for _, tt := range tests {
|
||||
got := parseCPUStatus(tt.input)
|
||||
|
||||
88
audit/internal/collector/finalize.go
Normal file
88
audit/internal/collector/finalize.go
Normal file
@@ -0,0 +1,88 @@
|
||||
package collector
|
||||
|
||||
import "bee/audit/internal/schema"
|
||||
|
||||
func finalizeSnapshot(snap *schema.HardwareSnapshot, collectedAt string) {
|
||||
snap.Memory = filterMemory(snap.Memory)
|
||||
snap.Storage = filterStorage(snap.Storage)
|
||||
snap.PowerSupplies = filterPSUs(snap.PowerSupplies)
|
||||
|
||||
setComponentStatusMetadata(snap, collectedAt)
|
||||
}
|
||||
|
||||
func filterMemory(dimms []schema.HardwareMemory) []schema.HardwareMemory {
|
||||
out := make([]schema.HardwareMemory, 0, len(dimms))
|
||||
for _, dimm := range dimms {
|
||||
if dimm.Present != nil && !*dimm.Present {
|
||||
continue
|
||||
}
|
||||
if dimm.Status != nil && *dimm.Status == statusEmpty {
|
||||
continue
|
||||
}
|
||||
if dimm.SerialNumber == nil || *dimm.SerialNumber == "" {
|
||||
continue
|
||||
}
|
||||
out = append(out, dimm)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func filterStorage(disks []schema.HardwareStorage) []schema.HardwareStorage {
|
||||
out := make([]schema.HardwareStorage, 0, len(disks))
|
||||
for _, disk := range disks {
|
||||
if disk.SerialNumber == nil || *disk.SerialNumber == "" {
|
||||
continue
|
||||
}
|
||||
out = append(out, disk)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func filterPSUs(psus []schema.HardwarePowerSupply) []schema.HardwarePowerSupply {
|
||||
out := make([]schema.HardwarePowerSupply, 0, len(psus))
|
||||
for _, psu := range psus {
|
||||
hasIdentity := false
|
||||
switch {
|
||||
case psu.SerialNumber != nil && *psu.SerialNumber != "":
|
||||
hasIdentity = true
|
||||
case psu.Slot != nil && *psu.Slot != "":
|
||||
hasIdentity = true
|
||||
case psu.Model != nil && *psu.Model != "":
|
||||
hasIdentity = true
|
||||
case psu.Vendor != nil && *psu.Vendor != "":
|
||||
hasIdentity = true
|
||||
}
|
||||
if !hasIdentity {
|
||||
continue
|
||||
}
|
||||
out = append(out, psu)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func setComponentStatusMetadata(snap *schema.HardwareSnapshot, collectedAt string) {
|
||||
for i := range snap.CPUs {
|
||||
setStatusCheckedAt(&snap.CPUs[i].HardwareComponentStatus, collectedAt)
|
||||
}
|
||||
for i := range snap.Memory {
|
||||
setStatusCheckedAt(&snap.Memory[i].HardwareComponentStatus, collectedAt)
|
||||
}
|
||||
for i := range snap.Storage {
|
||||
setStatusCheckedAt(&snap.Storage[i].HardwareComponentStatus, collectedAt)
|
||||
}
|
||||
for i := range snap.PCIeDevices {
|
||||
setStatusCheckedAt(&snap.PCIeDevices[i].HardwareComponentStatus, collectedAt)
|
||||
}
|
||||
for i := range snap.PowerSupplies {
|
||||
setStatusCheckedAt(&snap.PowerSupplies[i].HardwareComponentStatus, collectedAt)
|
||||
}
|
||||
}
|
||||
|
||||
func setStatusCheckedAt(status *schema.HardwareComponentStatus, collectedAt string) {
|
||||
if status == nil || status.Status == nil || *status.Status == "" {
|
||||
return
|
||||
}
|
||||
if status.StatusCheckedAt == nil {
|
||||
status.StatusCheckedAt = &collectedAt
|
||||
}
|
||||
}
|
||||
80
audit/internal/collector/finalize_test.go
Normal file
80
audit/internal/collector/finalize_test.go
Normal file
@@ -0,0 +1,80 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestFinalizeSnapshotFiltersComponentsWithoutRequiredSerials(t *testing.T) {
|
||||
collectedAt := "2026-03-15T12:00:00Z"
|
||||
present := true
|
||||
status := statusOK
|
||||
serial := "SN-1"
|
||||
|
||||
snap := schema.HardwareSnapshot{
|
||||
Memory: []schema.HardwareMemory{
|
||||
{Present: &present, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{Present: &present, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
},
|
||||
Storage: []schema.HardwareStorage{
|
||||
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
},
|
||||
PowerSupplies: []schema.HardwarePowerSupply{
|
||||
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
},
|
||||
}
|
||||
|
||||
finalizeSnapshot(&snap, collectedAt)
|
||||
|
||||
if len(snap.Memory) != 1 || snap.Memory[0].StatusCheckedAt == nil || *snap.Memory[0].StatusCheckedAt != collectedAt {
|
||||
t.Fatalf("memory finalize mismatch: %+v", snap.Memory)
|
||||
}
|
||||
if len(snap.Storage) != 1 || snap.Storage[0].StatusCheckedAt == nil || *snap.Storage[0].StatusCheckedAt != collectedAt {
|
||||
t.Fatalf("storage finalize mismatch: %+v", snap.Storage)
|
||||
}
|
||||
if len(snap.PowerSupplies) != 1 || snap.PowerSupplies[0].StatusCheckedAt == nil || *snap.PowerSupplies[0].StatusCheckedAt != collectedAt {
|
||||
t.Fatalf("psu finalize mismatch: %+v", snap.PowerSupplies)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFinalizeSnapshotPreservesDuplicateSerials(t *testing.T) {
|
||||
collectedAt := "2026-03-15T12:00:00Z"
|
||||
status := statusOK
|
||||
model := "Device"
|
||||
serial := "DUPLICATE"
|
||||
|
||||
snap := schema.HardwareSnapshot{
|
||||
Storage: []schema.HardwareStorage{
|
||||
{Model: &model, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{Model: &model, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
},
|
||||
}
|
||||
|
||||
finalizeSnapshot(&snap, collectedAt)
|
||||
|
||||
if got := *snap.Storage[0].SerialNumber; got != serial {
|
||||
t.Fatalf("first serial changed: %q", got)
|
||||
}
|
||||
if got := *snap.Storage[1].SerialNumber; got != serial {
|
||||
t.Fatalf("duplicate serial should stay unchanged: %q", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFilterPSUsKeepsSlotOnlyEntries(t *testing.T) {
|
||||
slot := "0"
|
||||
status := statusOK
|
||||
|
||||
got := filterPSUs([]schema.HardwarePowerSupply{
|
||||
{Slot: &slot, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
})
|
||||
|
||||
if len(got) != 1 {
|
||||
t.Fatalf("len(got)=%d want 1", len(got))
|
||||
}
|
||||
if got[0].Slot == nil || *got[0].Slot != "0" {
|
||||
t.Fatalf("unexpected kept PSU: %+v", got[0])
|
||||
}
|
||||
}
|
||||
@@ -47,12 +47,12 @@ func parseMemorySection(fields map[string]string) schema.HardwareMemory {
|
||||
dimm.Present = &present
|
||||
|
||||
if !present {
|
||||
status := "EMPTY"
|
||||
status := statusEmpty
|
||||
dimm.Status = &status
|
||||
return dimm
|
||||
}
|
||||
|
||||
status := "OK"
|
||||
status := statusOK
|
||||
dimm.Status = &status
|
||||
|
||||
if mb := parseMemorySizeMB(rawSize); mb > 0 {
|
||||
|
||||
203
audit/internal/collector/memory_telemetry.go
Normal file
203
audit/internal/collector/memory_telemetry.go
Normal file
@@ -0,0 +1,203 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
var edacBaseDir = "/sys/devices/system/edac/mc"
|
||||
|
||||
type edacDIMMStats struct {
|
||||
Label string
|
||||
CECount *int64
|
||||
UECount *int64
|
||||
}
|
||||
|
||||
func enrichMemoryWithTelemetry(dimms []schema.HardwareMemory, doc sensorsDoc) []schema.HardwareMemory {
|
||||
if len(dimms) == 0 {
|
||||
return dimms
|
||||
}
|
||||
|
||||
tempByLabel := memoryTempsFromSensors(doc)
|
||||
stats := readEDACStats()
|
||||
|
||||
for i := range dimms {
|
||||
labelKeys := dimmMatchKeys(dimms[i].Slot, dimms[i].Location)
|
||||
|
||||
for _, key := range labelKeys {
|
||||
if temp, ok := tempByLabel[key]; ok {
|
||||
dimms[i].TemperatureC = &temp
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
for _, key := range labelKeys {
|
||||
if stat, ok := stats[key]; ok {
|
||||
if stat.CECount != nil {
|
||||
dimms[i].CorrectableECCErrorCount = stat.CECount
|
||||
}
|
||||
if stat.UECount != nil {
|
||||
dimms[i].UncorrectableECCErrorCount = stat.UECount
|
||||
}
|
||||
if stat.UECount != nil && *stat.UECount > 0 {
|
||||
dimms[i].DataLossDetected = boolPtr(true)
|
||||
status := statusCritical
|
||||
dimms[i].Status = &status
|
||||
if dimms[i].ErrorDescription == nil {
|
||||
dimms[i].ErrorDescription = stringPtr("EDAC reports uncorrectable ECC errors")
|
||||
}
|
||||
} else if stat.CECount != nil && *stat.CECount > 0 && (dimms[i].Status == nil || *dimms[i].Status == statusOK) {
|
||||
status := statusWarning
|
||||
dimms[i].Status = &status
|
||||
if dimms[i].ErrorDescription == nil {
|
||||
dimms[i].ErrorDescription = stringPtr("EDAC reports correctable ECC errors")
|
||||
}
|
||||
}
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return dimms
|
||||
}
|
||||
|
||||
func memoryTempsFromSensors(doc sensorsDoc) map[string]float64 {
|
||||
out := map[string]float64{}
|
||||
if len(doc) == 0 {
|
||||
return out
|
||||
}
|
||||
for chip, features := range doc {
|
||||
for featureName, raw := range features {
|
||||
feature, ok := raw.(map[string]any)
|
||||
if !ok || classifySensorFeature(feature) != "temp" {
|
||||
continue
|
||||
}
|
||||
if !isLikelyMemoryTemp(chip, featureName) {
|
||||
continue
|
||||
}
|
||||
temp, ok := firstFeatureFloat(feature, "_input")
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
key := canonicalLabel(featureName)
|
||||
if key == "" {
|
||||
continue
|
||||
}
|
||||
if _, exists := out[key]; !exists {
|
||||
out[key] = temp
|
||||
}
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func readEDACStats() map[string]edacDIMMStats {
|
||||
out := map[string]edacDIMMStats{}
|
||||
mcDirs, err := filepath.Glob(filepath.Join(edacBaseDir, "mc*"))
|
||||
if err != nil {
|
||||
return out
|
||||
}
|
||||
sort.Strings(mcDirs)
|
||||
for _, mcDir := range mcDirs {
|
||||
dimmDirs, err := filepath.Glob(filepath.Join(mcDir, "dimm*"))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
sort.Strings(dimmDirs)
|
||||
for _, dimmDir := range dimmDirs {
|
||||
stat, ok := readEDACDIMMStats(dimmDir)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
key := canonicalLabel(stat.Label)
|
||||
if key == "" {
|
||||
continue
|
||||
}
|
||||
out[key] = stat
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func readEDACDIMMStats(dimmDir string) (edacDIMMStats, bool) {
|
||||
labelBytes, err := os.ReadFile(filepath.Join(dimmDir, "dimm_label"))
|
||||
if err != nil {
|
||||
labelBytes, err = os.ReadFile(filepath.Join(dimmDir, "label"))
|
||||
if err != nil {
|
||||
return edacDIMMStats{}, false
|
||||
}
|
||||
}
|
||||
label := strings.TrimSpace(string(labelBytes))
|
||||
if label == "" {
|
||||
return edacDIMMStats{}, false
|
||||
}
|
||||
|
||||
stat := edacDIMMStats{Label: label}
|
||||
if value, ok := readEDACCount(dimmDir, []string{"dimm_ce_count", "ce_count"}); ok {
|
||||
stat.CECount = &value
|
||||
}
|
||||
if value, ok := readEDACCount(dimmDir, []string{"dimm_ue_count", "ue_count"}); ok {
|
||||
stat.UECount = &value
|
||||
}
|
||||
return stat, true
|
||||
}
|
||||
|
||||
func readEDACCount(dir string, names []string) (int64, bool) {
|
||||
for _, name := range names {
|
||||
raw, err := os.ReadFile(filepath.Join(dir, name))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
value, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
|
||||
if err == nil && value >= 0 {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func dimmMatchKeys(slot, location *string) []string {
|
||||
var out []string
|
||||
add := func(value *string) {
|
||||
key := canonicalLabel(derefString(value))
|
||||
if key == "" {
|
||||
return
|
||||
}
|
||||
for _, existing := range out {
|
||||
if existing == key {
|
||||
return
|
||||
}
|
||||
}
|
||||
out = append(out, key)
|
||||
}
|
||||
add(slot)
|
||||
add(location)
|
||||
return out
|
||||
}
|
||||
|
||||
func canonicalLabel(value string) string {
|
||||
value = strings.ToUpper(strings.TrimSpace(value))
|
||||
if value == "" {
|
||||
return ""
|
||||
}
|
||||
var b strings.Builder
|
||||
for _, r := range value {
|
||||
if (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') {
|
||||
b.WriteRune(r)
|
||||
}
|
||||
}
|
||||
return b.String()
|
||||
}
|
||||
|
||||
func isLikelyMemoryTemp(chip, feature string) bool {
|
||||
value := strings.ToLower(chip + " " + feature)
|
||||
return strings.Contains(value, "dimm") || strings.Contains(value, "sodimm")
|
||||
}
|
||||
|
||||
func boolPtr(value bool) *bool {
|
||||
return &value
|
||||
}
|
||||
61
audit/internal/collector/memory_telemetry_test.go
Normal file
61
audit/internal/collector/memory_telemetry_test.go
Normal file
@@ -0,0 +1,61 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"path/filepath"
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
func TestEnrichMemoryWithTelemetry(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
oldBase := edacBaseDir
|
||||
edacBaseDir = tmp
|
||||
t.Cleanup(func() { edacBaseDir = oldBase })
|
||||
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_label"), "CPU0_DIMM_A1\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_ce_count"), "7\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_ue_count"), "0\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_label"), "CPU1_DIMM_B2\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_ce_count"), "0\n")
|
||||
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_ue_count"), "2\n")
|
||||
|
||||
doc := sensorsDoc{
|
||||
"jc42-i2c-0-18": {
|
||||
"CPU0 DIMM A1": map[string]any{"temp1_input": 43.0},
|
||||
"CPU1 DIMM B2": map[string]any{"temp2_input": 46.0},
|
||||
},
|
||||
}
|
||||
|
||||
status := statusOK
|
||||
slotA := "CPU0_DIMM_A1"
|
||||
slotB := "CPU1_DIMM_B2"
|
||||
dimms := []schema.HardwareMemory{
|
||||
{Slot: &slotA, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
{Slot: &slotB, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
|
||||
}
|
||||
|
||||
got := enrichMemoryWithTelemetry(dimms, doc)
|
||||
|
||||
if got[0].TemperatureC == nil || *got[0].TemperatureC != 43.0 {
|
||||
t.Fatalf("dimm0 temperature mismatch: %#v", got[0].TemperatureC)
|
||||
}
|
||||
if got[0].CorrectableECCErrorCount == nil || *got[0].CorrectableECCErrorCount != 7 {
|
||||
t.Fatalf("dimm0 ce mismatch: %#v", got[0].CorrectableECCErrorCount)
|
||||
}
|
||||
if got[0].Status == nil || *got[0].Status != statusWarning {
|
||||
t.Fatalf("dimm0 status mismatch: %#v", got[0].Status)
|
||||
}
|
||||
if got[1].TemperatureC == nil || *got[1].TemperatureC != 46.0 {
|
||||
t.Fatalf("dimm1 temperature mismatch: %#v", got[1].TemperatureC)
|
||||
}
|
||||
if got[1].UncorrectableECCErrorCount == nil || *got[1].UncorrectableECCErrorCount != 2 {
|
||||
t.Fatalf("dimm1 ue mismatch: %#v", got[1].UncorrectableECCErrorCount)
|
||||
}
|
||||
if got[1].Status == nil || *got[1].Status != statusCritical {
|
||||
t.Fatalf("dimm1 status mismatch: %#v", got[1].Status)
|
||||
}
|
||||
if got[1].DataLossDetected == nil || !*got[1].DataLossDetected {
|
||||
t.Fatalf("dimm1 data_loss_detected mismatch: %#v", got[1].DataLossDetected)
|
||||
}
|
||||
}
|
||||
@@ -18,17 +18,13 @@ var (
|
||||
}
|
||||
return string(out), nil
|
||||
}
|
||||
readNetStatFile = func(iface, key string) (int64, error) {
|
||||
path := filepath.Join("/sys/class/net", iface, "statistics", key)
|
||||
readNetAddressFile = func(iface string) (string, error) {
|
||||
path := filepath.Join("/sys/class/net", iface, "address")
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return 0, err
|
||||
return "", err
|
||||
}
|
||||
v, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
|
||||
if err != nil {
|
||||
return 0, err
|
||||
}
|
||||
return v, nil
|
||||
return strings.TrimSpace(string(raw)), nil
|
||||
}
|
||||
)
|
||||
|
||||
@@ -47,6 +43,12 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
|
||||
continue
|
||||
}
|
||||
iface := ifaces[0]
|
||||
devs[i].MacAddresses = collectInterfaceMACs(ifaces)
|
||||
if devs[i].SerialNumber == nil {
|
||||
if serial := queryPCIDeviceSerial(bdf); serial != "" {
|
||||
devs[i].SerialNumber = &serial
|
||||
}
|
||||
}
|
||||
|
||||
if devs[i].Firmware == nil {
|
||||
if out, err := ethtoolInfoQuery(iface); err == nil {
|
||||
@@ -56,16 +58,13 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
|
||||
}
|
||||
}
|
||||
|
||||
if devs[i].Telemetry == nil {
|
||||
devs[i].Telemetry = map[string]any{}
|
||||
}
|
||||
injectNICPacketStats(devs[i].Telemetry, iface)
|
||||
if out, err := ethtoolModuleQuery(iface); err == nil {
|
||||
injectSFPDOMTelemetry(devs[i].Telemetry, out)
|
||||
if injectSFPDOMTelemetry(&devs[i], out) {
|
||||
enriched++
|
||||
continue
|
||||
}
|
||||
}
|
||||
if len(devs[i].Telemetry) == 0 {
|
||||
devs[i].Telemetry = nil
|
||||
} else {
|
||||
if len(devs[i].MacAddresses) > 0 || devs[i].Firmware != nil {
|
||||
enriched++
|
||||
}
|
||||
}
|
||||
@@ -77,31 +76,32 @@ func isNICDevice(dev schema.HardwarePCIeDevice) bool {
|
||||
if dev.DeviceClass == nil {
|
||||
return false
|
||||
}
|
||||
c := strings.ToLower(strings.TrimSpace(*dev.DeviceClass))
|
||||
return strings.Contains(c, "ethernet controller") ||
|
||||
strings.Contains(c, "network controller") ||
|
||||
strings.Contains(c, "infiniband controller")
|
||||
c := strings.TrimSpace(*dev.DeviceClass)
|
||||
return isNICClass(c) || strings.EqualFold(c, "FibreChannelController")
|
||||
}
|
||||
|
||||
func injectNICPacketStats(dst map[string]any, iface string) {
|
||||
for _, key := range []string{"rx_packets", "tx_packets", "rx_errors", "tx_errors"} {
|
||||
if v, err := readNetStatFile(iface, key); err == nil {
|
||||
dst[key] = v
|
||||
func collectInterfaceMACs(ifaces []string) []string {
|
||||
seen := map[string]struct{}{}
|
||||
var out []string
|
||||
for _, iface := range ifaces {
|
||||
mac, err := readNetAddressFile(iface)
|
||||
if err != nil || mac == "" {
|
||||
continue
|
||||
}
|
||||
mac = strings.ToLower(strings.TrimSpace(mac))
|
||||
if _, ok := seen[mac]; ok {
|
||||
continue
|
||||
}
|
||||
seen[mac] = struct{}{}
|
||||
out = append(out, mac)
|
||||
}
|
||||
}
|
||||
|
||||
func injectSFPDOMTelemetry(dst map[string]any, raw string) {
|
||||
parsed := parseSFPDOM(raw)
|
||||
for k, v := range parsed {
|
||||
dst[k] = v
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
var floatRe = regexp.MustCompile(`[-+]?[0-9]*\.?[0-9]+`)
|
||||
|
||||
func parseSFPDOM(raw string) map[string]any {
|
||||
out := map[string]any{}
|
||||
func injectSFPDOMTelemetry(dev *schema.HardwarePCIeDevice, raw string) bool {
|
||||
var changed bool
|
||||
for _, line := range strings.Split(raw, "\n") {
|
||||
trimmed := strings.TrimSpace(line)
|
||||
if trimmed == "" {
|
||||
@@ -117,26 +117,55 @@ func parseSFPDOM(raw string) map[string]any {
|
||||
switch {
|
||||
case strings.Contains(key, "module temperature"):
|
||||
if f, ok := firstFloat(val); ok {
|
||||
out["sfp_temperature_c"] = f
|
||||
dev.SFPTemperatureC = &f
|
||||
changed = true
|
||||
}
|
||||
case strings.Contains(key, "laser output power"):
|
||||
if f, ok := dbmValue(val); ok {
|
||||
out["sfp_tx_power_dbm"] = f
|
||||
dev.SFPTXPowerDBM = &f
|
||||
changed = true
|
||||
}
|
||||
case strings.Contains(key, "receiver signal"):
|
||||
if f, ok := dbmValue(val); ok {
|
||||
out["sfp_rx_power_dbm"] = f
|
||||
dev.SFPRXPowerDBM = &f
|
||||
changed = true
|
||||
}
|
||||
case strings.Contains(key, "module voltage"):
|
||||
if f, ok := firstFloat(val); ok {
|
||||
out["sfp_voltage_v"] = f
|
||||
dev.SFPVoltageV = &f
|
||||
changed = true
|
||||
}
|
||||
case strings.Contains(key, "laser bias current"):
|
||||
if f, ok := firstFloat(val); ok {
|
||||
out["sfp_bias_ma"] = f
|
||||
dev.SFPBiasMA = &f
|
||||
changed = true
|
||||
}
|
||||
}
|
||||
}
|
||||
return changed
|
||||
}
|
||||
|
||||
func parseSFPDOM(raw string) map[string]any {
|
||||
dev := schema.HardwarePCIeDevice{}
|
||||
if !injectSFPDOMTelemetry(&dev, raw) {
|
||||
return map[string]any{}
|
||||
}
|
||||
out := map[string]any{}
|
||||
if dev.SFPTemperatureC != nil {
|
||||
out["sfp_temperature_c"] = *dev.SFPTemperatureC
|
||||
}
|
||||
if dev.SFPTXPowerDBM != nil {
|
||||
out["sfp_tx_power_dbm"] = *dev.SFPTXPowerDBM
|
||||
}
|
||||
if dev.SFPRXPowerDBM != nil {
|
||||
out["sfp_rx_power_dbm"] = *dev.SFPRXPowerDBM
|
||||
}
|
||||
if dev.SFPVoltageV != nil {
|
||||
out["sfp_voltage_v"] = *dev.SFPVoltageV
|
||||
}
|
||||
if dev.SFPBiasMA != nil {
|
||||
out["sfp_bias_ma"] = *dev.SFPBiasMA
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
|
||||
@@ -1,6 +1,10 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"fmt"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestParseSFPDOM(t *testing.T) {
|
||||
raw := `
|
||||
@@ -29,6 +33,74 @@ func TestParseSFPDOM(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseLSPCIDetailSerial(t *testing.T) {
|
||||
raw := `
|
||||
05:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
|
||||
Serial number: NIC-SN-12345
|
||||
`
|
||||
if got := parseLSPCIDetailSerial(raw); got != "NIC-SN-12345" {
|
||||
t.Fatalf("serial=%q want %q", got, "NIC-SN-12345")
|
||||
}
|
||||
}
|
||||
|
||||
func TestParsePCIVPDSerial(t *testing.T) {
|
||||
raw := []byte{0x82, 0x05, 0x00, 'M', 'L', 'X', '5', 0x90, 0x08, 0x00, 'S', 'N', 0x08, 'M', 'T', '1', '2', '3', '4', '5', '6'}
|
||||
if got := parsePCIVPDSerial(raw); got != "MT123456" {
|
||||
t.Fatalf("serial=%q want %q", got, "MT123456")
|
||||
}
|
||||
}
|
||||
|
||||
func TestEnrichPCIeWithNICTelemetryAddsSerialFallback(t *testing.T) {
|
||||
origDetail := queryPCILSPCIDetail
|
||||
origVPD := readPCIVPDFile
|
||||
origIfaces := netIfacesByBDF
|
||||
origReadMAC := readNetAddressFile
|
||||
origEth := ethtoolInfoQuery
|
||||
origModule := ethtoolModuleQuery
|
||||
t.Cleanup(func() {
|
||||
queryPCILSPCIDetail = origDetail
|
||||
readPCIVPDFile = origVPD
|
||||
netIfacesByBDF = origIfaces
|
||||
readNetAddressFile = origReadMAC
|
||||
ethtoolInfoQuery = origEth
|
||||
ethtoolModuleQuery = origModule
|
||||
})
|
||||
|
||||
queryPCILSPCIDetail = func(bdf string) (string, error) {
|
||||
if bdf != "0000:18:00.0" {
|
||||
t.Fatalf("unexpected bdf: %s", bdf)
|
||||
}
|
||||
return "Serial number: NIC-SN-98765\n", nil
|
||||
}
|
||||
readPCIVPDFile = func(string) ([]byte, error) {
|
||||
return nil, fmt.Errorf("no vpd needed")
|
||||
}
|
||||
netIfacesByBDF = func(string) []string { return []string{"eth0"} }
|
||||
readNetAddressFile = func(iface string) (string, error) {
|
||||
if iface != "eth0" {
|
||||
t.Fatalf("unexpected iface: %s", iface)
|
||||
}
|
||||
return "aa:bb:cc:dd:ee:ff", nil
|
||||
}
|
||||
ethtoolInfoQuery = func(string) (string, error) { return "", fmt.Errorf("skip firmware") }
|
||||
ethtoolModuleQuery = func(string) (string, error) { return "", fmt.Errorf("skip optics") }
|
||||
|
||||
class := "EthernetController"
|
||||
bdf := "0000:18:00.0"
|
||||
devs := []schema.HardwarePCIeDevice{{
|
||||
DeviceClass: &class,
|
||||
BDF: &bdf,
|
||||
}}
|
||||
|
||||
out := enrichPCIeWithNICTelemetry(devs)
|
||||
if out[0].SerialNumber == nil || *out[0].SerialNumber != "NIC-SN-98765" {
|
||||
t.Fatalf("serial=%v want NIC-SN-98765", out[0].SerialNumber)
|
||||
}
|
||||
if len(out[0].MacAddresses) != 1 || out[0].MacAddresses[0] != "aa:bb:cc:dd:ee:ff" {
|
||||
t.Fatalf("mac_addresses=%v", out[0].MacAddresses)
|
||||
}
|
||||
}
|
||||
|
||||
func TestDBMValue(t *testing.T) {
|
||||
tests := []struct {
|
||||
in string
|
||||
|
||||
@@ -24,18 +24,29 @@ type nvidiaGPUInfo struct {
|
||||
}
|
||||
|
||||
// enrichPCIeWithNVIDIA enriches NVIDIA PCIe devices with data from nvidia-smi.
|
||||
// If the driver/tool is unavailable, NVIDIA devices get UNKNOWN status and
|
||||
// a stable serial fallback based on board serial + slot.
|
||||
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice, boardSerial string) []schema.HardwarePCIeDevice {
|
||||
// If the driver/tool is unavailable, NVIDIA devices get Unknown status.
|
||||
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
|
||||
if !hasNVIDIADevices(devs) {
|
||||
return devs
|
||||
}
|
||||
gpuByBDF, err := queryNVIDIAGPUs()
|
||||
if err != nil {
|
||||
slog.Info("nvidia: enrichment skipped", "err", err)
|
||||
return enrichPCIeWithNVIDIAData(devs, nil, boardSerial, false)
|
||||
return enrichPCIeWithNVIDIAData(devs, nil, false)
|
||||
}
|
||||
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, boardSerial, true)
|
||||
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, true)
|
||||
}
|
||||
|
||||
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, boardSerial string, driverLoaded bool) []schema.HardwarePCIeDevice {
|
||||
func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool {
|
||||
for _, dev := range devs {
|
||||
if isNVIDIADevice(dev) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, driverLoaded bool) []schema.HardwarePCIeDevice {
|
||||
enriched := 0
|
||||
for i := range devs {
|
||||
if !isNVIDIADevice(devs[i]) {
|
||||
@@ -43,7 +54,7 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
|
||||
}
|
||||
|
||||
if !driverLoaded {
|
||||
setPCIeFallback(&devs[i], boardSerial)
|
||||
setPCIeFallback(&devs[i])
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -53,22 +64,21 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
|
||||
}
|
||||
info, ok := gpuByBDF[bdf]
|
||||
if !ok {
|
||||
setPCIeFallback(&devs[i], boardSerial)
|
||||
setPCIeFallback(&devs[i])
|
||||
continue
|
||||
}
|
||||
|
||||
if v := strings.TrimSpace(info.Serial); v != "" {
|
||||
devs[i].SerialNumber = &v
|
||||
} else {
|
||||
setPCIeFallbackSerial(&devs[i], boardSerial)
|
||||
}
|
||||
if v := strings.TrimSpace(info.VBIOS); v != "" {
|
||||
devs[i].Firmware = &v
|
||||
}
|
||||
|
||||
status := "OK"
|
||||
status := statusOK
|
||||
if info.ECCUncorrected != nil && *info.ECCUncorrected > 0 {
|
||||
status = "WARNING"
|
||||
status = statusWarning
|
||||
devs[i].ErrorDescription = stringPtr("GPU reports uncorrected ECC errors")
|
||||
}
|
||||
devs[i].Status = &status
|
||||
injectNVIDIATelemetry(&devs[i], info)
|
||||
@@ -200,46 +210,25 @@ func isNVIDIADevice(dev schema.HardwarePCIeDevice) bool {
|
||||
return false
|
||||
}
|
||||
|
||||
func setPCIeFallback(dev *schema.HardwarePCIeDevice, boardSerial string) {
|
||||
setPCIeFallbackSerial(dev, boardSerial)
|
||||
status := "UNKNOWN"
|
||||
func setPCIeFallback(dev *schema.HardwarePCIeDevice) {
|
||||
status := statusUnknown
|
||||
dev.Status = &status
|
||||
}
|
||||
|
||||
func setPCIeFallbackSerial(dev *schema.HardwarePCIeDevice, boardSerial string) {
|
||||
if strings.TrimSpace(boardSerial) == "" || dev.SerialNumber != nil {
|
||||
return
|
||||
}
|
||||
slot := "unknown"
|
||||
if dev.BDF != nil && strings.TrimSpace(*dev.BDF) != "" {
|
||||
slot = strings.TrimSpace(*dev.BDF)
|
||||
} else if dev.Slot != nil && strings.TrimSpace(*dev.Slot) != "" {
|
||||
slot = strings.TrimSpace(*dev.Slot)
|
||||
}
|
||||
fb := fmt.Sprintf("%s-PCIE-%s", boardSerial, slot)
|
||||
dev.SerialNumber = &fb
|
||||
}
|
||||
|
||||
func injectNVIDIATelemetry(dev *schema.HardwarePCIeDevice, info nvidiaGPUInfo) {
|
||||
if dev.Telemetry == nil {
|
||||
dev.Telemetry = map[string]any{}
|
||||
}
|
||||
if info.TemperatureC != nil {
|
||||
dev.Telemetry["temperature_c"] = *info.TemperatureC
|
||||
dev.TemperatureC = info.TemperatureC
|
||||
}
|
||||
if info.PowerW != nil {
|
||||
dev.Telemetry["power_w"] = *info.PowerW
|
||||
dev.PowerW = info.PowerW
|
||||
}
|
||||
if info.ECCUncorrected != nil {
|
||||
dev.Telemetry["ecc_uncorrected_total"] = *info.ECCUncorrected
|
||||
dev.ECCUncorrectedTotal = info.ECCUncorrected
|
||||
}
|
||||
if info.ECCCorrected != nil {
|
||||
dev.Telemetry["ecc_corrected_total"] = *info.ECCCorrected
|
||||
dev.ECCCorrectedTotal = info.ECCCorrected
|
||||
}
|
||||
if info.HWSlowdown != nil {
|
||||
dev.Telemetry["hw_slowdown_active"] = *info.HWSlowdown
|
||||
}
|
||||
if len(dev.Telemetry) == 0 {
|
||||
dev.Telemetry = nil
|
||||
dev.HWSlowdown = info.HWSlowdown
|
||||
}
|
||||
}
|
||||
|
||||
@@ -54,10 +54,10 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
|
||||
status := "OK"
|
||||
devices := []schema.HardwarePCIeDevice{
|
||||
{
|
||||
VendorID: &vendorID,
|
||||
BDF: &bdf,
|
||||
Manufacturer: &manufacturer,
|
||||
Status: &status,
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
VendorID: &vendorID,
|
||||
BDF: &bdf,
|
||||
Manufacturer: &manufacturer,
|
||||
},
|
||||
}
|
||||
|
||||
@@ -73,21 +73,21 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
|
||||
},
|
||||
}
|
||||
|
||||
out := enrichPCIeWithNVIDIAData(devices, byBDF, "BOARD-001", true)
|
||||
out := enrichPCIeWithNVIDIAData(devices, byBDF, true)
|
||||
if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-ABC" {
|
||||
t.Fatalf("serial: got %v", out[0].SerialNumber)
|
||||
}
|
||||
if out[0].Firmware == nil || *out[0].Firmware != "96.00.1F.00.02" {
|
||||
t.Fatalf("firmware: got %v", out[0].Firmware)
|
||||
}
|
||||
if out[0].Status == nil || *out[0].Status != "WARNING" {
|
||||
if out[0].Status == nil || *out[0].Status != statusWarning {
|
||||
t.Fatalf("status: got %v", out[0].Status)
|
||||
}
|
||||
if out[0].Telemetry == nil {
|
||||
t.Fatal("expected telemetry")
|
||||
if out[0].ECCUncorrectedTotal == nil || *out[0].ECCUncorrectedTotal != 2 {
|
||||
t.Fatalf("ecc_uncorrected_total: got %#v", out[0].ECCUncorrectedTotal)
|
||||
}
|
||||
if got, ok := out[0].Telemetry["ecc_uncorrected_total"].(int64); !ok || got != 2 {
|
||||
t.Fatalf("ecc_uncorrected_total: got %#v", out[0].Telemetry["ecc_uncorrected_total"])
|
||||
if out[0].TemperatureC == nil || *out[0].TemperatureC != 55.5 {
|
||||
t.Fatalf("temperature_c: got %#v", out[0].TemperatureC)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -103,11 +103,11 @@ func TestEnrichPCIeWithNVIDIAData_driverMissingFallback(t *testing.T) {
|
||||
},
|
||||
}
|
||||
|
||||
out := enrichPCIeWithNVIDIAData(devices, nil, "BOARD-123", false)
|
||||
if out[0].SerialNumber == nil || *out[0].SerialNumber != "BOARD-123-PCIE-0000:17:00.0" {
|
||||
t.Fatalf("fallback serial: got %v", out[0].SerialNumber)
|
||||
out := enrichPCIeWithNVIDIAData(devices, nil, false)
|
||||
if out[0].SerialNumber != nil {
|
||||
t.Fatalf("serial should stay nil without source data, got %v", out[0].SerialNumber)
|
||||
}
|
||||
if out[0].Status == nil || *out[0].Status != "UNKNOWN" {
|
||||
if out[0].Status == nil || *out[0].Status != statusUnknown {
|
||||
t.Fatalf("fallback status: got %v", out[0].Status)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -37,7 +37,7 @@ func parseLspci(output string) []schema.HardwarePCIeDevice {
|
||||
val := strings.TrimSpace(line[idx+2:])
|
||||
fields[key] = val
|
||||
}
|
||||
if !shouldIncludePCIeDevice(fields["Class"]) {
|
||||
if !shouldIncludePCIeDevice(fields["Class"], fields["Vendor"], fields["Device"]) {
|
||||
continue
|
||||
}
|
||||
dev := parseLspciDevice(fields)
|
||||
@@ -46,8 +46,10 @@ func parseLspci(output string) []schema.HardwarePCIeDevice {
|
||||
return devs
|
||||
}
|
||||
|
||||
func shouldIncludePCIeDevice(class string) bool {
|
||||
func shouldIncludePCIeDevice(class, vendor, device string) bool {
|
||||
c := strings.ToLower(strings.TrimSpace(class))
|
||||
v := strings.ToLower(strings.TrimSpace(vendor))
|
||||
d := strings.ToLower(strings.TrimSpace(device))
|
||||
if c == "" {
|
||||
return true
|
||||
}
|
||||
@@ -57,6 +59,8 @@ func shouldIncludePCIeDevice(class string) bool {
|
||||
"host bridge",
|
||||
"isa bridge",
|
||||
"pci bridge",
|
||||
"performance counter",
|
||||
"performance counters",
|
||||
"ram memory",
|
||||
"system peripheral",
|
||||
"communication controller",
|
||||
@@ -66,12 +70,28 @@ func shouldIncludePCIeDevice(class string) bool {
|
||||
"audio device",
|
||||
"serial bus controller",
|
||||
"unassigned class",
|
||||
"non-essential instrumentation",
|
||||
}
|
||||
for _, bad := range excluded {
|
||||
if strings.Contains(c, bad) {
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
if strings.Contains(v, "advanced micro devices") || strings.Contains(v, "[amd]") {
|
||||
internalAMDPatterns := []string{
|
||||
"dummy function",
|
||||
"reserved spp",
|
||||
"ptdma",
|
||||
"cryptographic coprocessor pspcpp",
|
||||
"pspcpp",
|
||||
}
|
||||
for _, bad := range internalAMDPatterns {
|
||||
if strings.Contains(d, bad) {
|
||||
return false
|
||||
}
|
||||
}
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
@@ -79,11 +99,12 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
|
||||
dev := schema.HardwarePCIeDevice{}
|
||||
present := true
|
||||
dev.Present = &present
|
||||
status := "OK"
|
||||
status := statusOK
|
||||
dev.Status = &status
|
||||
|
||||
// Slot is the BDF: "0000:00:02.0"
|
||||
if bdf := fields["Slot"]; bdf != "" {
|
||||
dev.Slot = &bdf
|
||||
dev.BDF = &bdf
|
||||
// parse vendor_id and device_id from sysfs
|
||||
vendorID, deviceID := readPCIIDs(bdf)
|
||||
@@ -93,10 +114,34 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
|
||||
if deviceID != 0 {
|
||||
dev.DeviceID = &deviceID
|
||||
}
|
||||
if numaNode, ok := readPCINumaNode(bdf); ok {
|
||||
dev.NUMANode = &numaNode
|
||||
} else if numaNode, ok := parsePCINumaNode(fields["NUMANode"]); ok {
|
||||
dev.NUMANode = &numaNode
|
||||
}
|
||||
if width, ok := readPCIIntAttribute(bdf, "current_link_width"); ok {
|
||||
dev.LinkWidth = &width
|
||||
}
|
||||
if width, ok := readPCIIntAttribute(bdf, "max_link_width"); ok {
|
||||
dev.MaxLinkWidth = &width
|
||||
}
|
||||
if speed, ok := readPCIStringAttribute(bdf, "current_link_speed"); ok {
|
||||
linkSpeed := normalizePCILinkSpeed(speed)
|
||||
if linkSpeed != "" {
|
||||
dev.LinkSpeed = &linkSpeed
|
||||
}
|
||||
}
|
||||
if speed, ok := readPCIStringAttribute(bdf, "max_link_speed"); ok {
|
||||
linkSpeed := normalizePCILinkSpeed(speed)
|
||||
if linkSpeed != "" {
|
||||
dev.MaxLinkSpeed = &linkSpeed
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if v := fields["Class"]; v != "" {
|
||||
dev.DeviceClass = &v
|
||||
class := mapPCIeDeviceClass(v)
|
||||
dev.DeviceClass = &class
|
||||
}
|
||||
if v := fields["Vendor"]; v != "" {
|
||||
dev.Manufacturer = &v
|
||||
@@ -131,3 +176,67 @@ func readHexFile(path string) (int, error) {
|
||||
n, err := strconv.ParseInt(s, 16, 64)
|
||||
return int(n), err
|
||||
}
|
||||
|
||||
func readPCINumaNode(bdf string) (int, bool) {
|
||||
value, ok := readPCIIntAttribute(bdf, "numa_node")
|
||||
if !ok || value < 0 {
|
||||
return 0, false
|
||||
}
|
||||
return value, true
|
||||
}
|
||||
|
||||
func parsePCINumaNode(raw string) (int, bool) {
|
||||
raw = strings.TrimSpace(raw)
|
||||
if raw == "" {
|
||||
return 0, false
|
||||
}
|
||||
value, err := strconv.Atoi(raw)
|
||||
if err != nil || value < 0 {
|
||||
return 0, false
|
||||
}
|
||||
return value, true
|
||||
}
|
||||
|
||||
func readPCIIntAttribute(bdf, attribute string) (int, bool) {
|
||||
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
|
||||
if err != nil {
|
||||
return 0, false
|
||||
}
|
||||
value, err := strconv.Atoi(strings.TrimSpace(string(out)))
|
||||
if err != nil || value < 0 {
|
||||
return 0, false
|
||||
}
|
||||
return value, true
|
||||
}
|
||||
|
||||
func readPCIStringAttribute(bdf, attribute string) (string, bool) {
|
||||
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
|
||||
if err != nil {
|
||||
return "", false
|
||||
}
|
||||
value := strings.TrimSpace(string(out))
|
||||
if value == "" {
|
||||
return "", false
|
||||
}
|
||||
return value, true
|
||||
}
|
||||
|
||||
func normalizePCILinkSpeed(raw string) string {
|
||||
raw = strings.TrimSpace(strings.ToLower(raw))
|
||||
switch {
|
||||
case strings.Contains(raw, "2.5"):
|
||||
return "Gen1"
|
||||
case strings.Contains(raw, "5.0"):
|
||||
return "Gen2"
|
||||
case strings.Contains(raw, "8.0"):
|
||||
return "Gen3"
|
||||
case strings.Contains(raw, "16.0"):
|
||||
return "Gen4"
|
||||
case strings.Contains(raw, "32.0"):
|
||||
return "Gen5"
|
||||
case strings.Contains(raw, "64.0"):
|
||||
return "Gen6"
|
||||
default:
|
||||
return ""
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,41 +1,126 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
import (
|
||||
"encoding/json"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestShouldIncludePCIeDevice(t *testing.T) {
|
||||
tests := []struct {
|
||||
class string
|
||||
want bool
|
||||
name string
|
||||
class string
|
||||
vendor string
|
||||
device string
|
||||
want bool
|
||||
}{
|
||||
{"USB controller", false},
|
||||
{"System peripheral", false},
|
||||
{"Audio device", false},
|
||||
{"Host bridge", false},
|
||||
{"PCI bridge", false},
|
||||
{"SMBus", false},
|
||||
{"Ethernet controller", true},
|
||||
{"RAID bus controller", true},
|
||||
{"Non-Volatile memory controller", true},
|
||||
{"VGA compatible controller", true},
|
||||
{name: "usb", class: "USB controller", want: false},
|
||||
{name: "system peripheral", class: "System peripheral", want: false},
|
||||
{name: "audio", class: "Audio device", want: false},
|
||||
{name: "host bridge", class: "Host bridge", want: false},
|
||||
{name: "pci bridge", class: "PCI bridge", want: false},
|
||||
{name: "smbus", class: "SMBus", want: false},
|
||||
{name: "perf", class: "Performance counters", want: false},
|
||||
{name: "non essential instrumentation", class: "Non-Essential Instrumentation", want: false},
|
||||
{name: "amd dummy function", class: "Encryption controller", vendor: "Advanced Micro Devices, Inc. [AMD]", device: "Starship/Matisse PTDMA", want: false},
|
||||
{name: "amd pspcpp", class: "Encryption controller", vendor: "Advanced Micro Devices, Inc. [AMD]", device: "Starship/Matisse Cryptographic Coprocessor PSPCPP", want: false},
|
||||
{name: "ethernet", class: "Ethernet controller", want: true},
|
||||
{name: "raid", class: "RAID bus controller", want: true},
|
||||
{name: "nvme", class: "Non-Volatile memory controller", want: true},
|
||||
{name: "vga", class: "VGA compatible controller", want: true},
|
||||
{name: "other encryption controller", class: "Encryption controller", vendor: "Intel Corporation", device: "QuickAssist", want: true},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
got := shouldIncludePCIeDevice(tt.class)
|
||||
if got != tt.want {
|
||||
t.Fatalf("class %q include=%v want %v", tt.class, got, tt.want)
|
||||
}
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
got := shouldIncludePCIeDevice(tt.class, tt.vendor, tt.device)
|
||||
if got != tt.want {
|
||||
t.Fatalf("class=%q vendor=%q device=%q include=%v want %v", tt.class, tt.vendor, tt.device, got, tt.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseLspci_filtersExcludedClasses(t *testing.T) {
|
||||
input := "Slot:\t0000:00:14.0\nClass:\tUSB controller\nVendor:\tIntel Corporation\nDevice:\tUSB 3.0\n\n" +
|
||||
"Slot:\t0000:00:18.0\nClass:\tNon-Essential Instrumentation\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PCIe Dummy Function\n\n" +
|
||||
"Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
|
||||
|
||||
devs := parseLspci(input)
|
||||
if len(devs) != 1 {
|
||||
t.Fatalf("expected 1 filtered device, got %d", len(devs))
|
||||
}
|
||||
if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VGA compatible controller" {
|
||||
if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VideoController" {
|
||||
t.Fatalf("unexpected remaining class: %v", devs[0].DeviceClass)
|
||||
}
|
||||
if devs[0].Slot == nil || *devs[0].Slot != "0000:65:00.0" {
|
||||
t.Fatalf("slot: got %v", devs[0].Slot)
|
||||
}
|
||||
if devs[0].BDF == nil || *devs[0].BDF != "0000:65:00.0" {
|
||||
t.Fatalf("bdf: got %v", devs[0].BDF)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseLspci_filtersAMDChipsetNoise(t *testing.T) {
|
||||
input := "" +
|
||||
"Slot:\t0000:1a:00.0\nClass:\tNon-Essential Instrumentation\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PCIe Dummy Function\n\n" +
|
||||
"Slot:\t0000:1a:00.2\nClass:\tEncryption controller\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PTDMA\n\n" +
|
||||
"Slot:\t0000:05:00.0\nClass:\tEthernet controller\nVendor:\tMellanox Technologies\nDevice:\tMT28908 Family [ConnectX-6]\n\n"
|
||||
|
||||
devs := parseLspci(input)
|
||||
if len(devs) != 1 {
|
||||
t.Fatalf("expected 1 remaining device, got %d", len(devs))
|
||||
}
|
||||
if devs[0].Model == nil || *devs[0].Model != "MT28908 Family [ConnectX-6]" {
|
||||
t.Fatalf("unexpected remaining device: %+v", devs[0])
|
||||
}
|
||||
}
|
||||
|
||||
func TestPCIeJSONUsesSlotNotBDF(t *testing.T) {
|
||||
input := "Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
|
||||
|
||||
devs := parseLspci(input)
|
||||
data, err := json.Marshal(devs[0])
|
||||
if err != nil {
|
||||
t.Fatalf("marshal: %v", err)
|
||||
}
|
||||
text := string(data)
|
||||
if !strings.Contains(text, `"slot":"0000:65:00.0"`) {
|
||||
t.Fatalf("json missing slot: %s", text)
|
||||
}
|
||||
if strings.Contains(text, `"bdf"`) {
|
||||
t.Fatalf("json should not emit bdf: %s", text)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseLspciUsesNUMANodeFieldWhenSysfsUnavailable(t *testing.T) {
|
||||
input := "Slot:\t0000:65:00.0\nClass:\tEthernet controller\nVendor:\tIntel Corporation\nDevice:\tX710\nNUMANode:\t1\n\n"
|
||||
|
||||
devs := parseLspci(input)
|
||||
if len(devs) != 1 {
|
||||
t.Fatalf("expected 1 device, got %d", len(devs))
|
||||
}
|
||||
if devs[0].NUMANode == nil || *devs[0].NUMANode != 1 {
|
||||
t.Fatalf("numa_node=%v want 1", devs[0].NUMANode)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNormalizePCILinkSpeed(t *testing.T) {
|
||||
tests := []struct {
|
||||
raw string
|
||||
want string
|
||||
}{
|
||||
{"2.5 GT/s PCIe", "Gen1"},
|
||||
{"5.0 GT/s PCIe", "Gen2"},
|
||||
{"8.0 GT/s PCIe", "Gen3"},
|
||||
{"16.0 GT/s PCIe", "Gen4"},
|
||||
{"32.0 GT/s PCIe", "Gen5"},
|
||||
{"64.0 GT/s PCIe", "Gen6"},
|
||||
{"unknown", ""},
|
||||
}
|
||||
for _, tt := range tests {
|
||||
if got := normalizePCILinkSpeed(tt.raw); got != tt.want {
|
||||
t.Fatalf("normalizePCILinkSpeed(%q)=%q want %q", tt.raw, got, tt.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
123
audit/internal/collector/pcie_identity.go
Normal file
123
audit/internal/collector/pcie_identity.go
Normal file
@@ -0,0 +1,123 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"log/slog"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
)
|
||||
|
||||
var (
|
||||
queryPCILSPCIDetail = func(bdf string) (string, error) {
|
||||
out, err := exec.Command("lspci", "-vv", "-s", bdf).Output()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
return string(out), nil
|
||||
}
|
||||
readPCIVPDFile = func(bdf string) ([]byte, error) {
|
||||
return os.ReadFile(filepath.Join("/sys/bus/pci/devices", bdf, "vpd"))
|
||||
}
|
||||
)
|
||||
|
||||
func enrichPCIeWithPCISerials(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
|
||||
enriched := 0
|
||||
for i := range devs {
|
||||
if !shouldProbePCIeSerial(devs[i]) {
|
||||
continue
|
||||
}
|
||||
bdf := normalizePCIeBDF(*devs[i].BDF)
|
||||
if bdf == "" {
|
||||
continue
|
||||
}
|
||||
if serial := queryPCIDeviceSerial(bdf); serial != "" {
|
||||
devs[i].SerialNumber = &serial
|
||||
enriched++
|
||||
}
|
||||
}
|
||||
if enriched > 0 {
|
||||
slog.Info("pcie: serials enriched", "count", enriched)
|
||||
}
|
||||
return devs
|
||||
}
|
||||
|
||||
func shouldProbePCIeSerial(dev schema.HardwarePCIeDevice) bool {
|
||||
if dev.BDF == nil || dev.SerialNumber != nil {
|
||||
return false
|
||||
}
|
||||
if dev.DeviceClass == nil {
|
||||
return false
|
||||
}
|
||||
class := strings.TrimSpace(*dev.DeviceClass)
|
||||
return isNICClass(class) || isGPUClass(class)
|
||||
}
|
||||
|
||||
func queryPCIDeviceSerial(bdf string) string {
|
||||
if out, err := queryPCILSPCIDetail(bdf); err == nil {
|
||||
if serial := parseLSPCIDetailSerial(out); serial != "" {
|
||||
return serial
|
||||
}
|
||||
}
|
||||
if raw, err := readPCIVPDFile(bdf); err == nil {
|
||||
return parsePCIVPDSerial(raw)
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func parseLSPCIDetailSerial(raw string) string {
|
||||
for _, line := range strings.Split(raw, "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
if line == "" {
|
||||
continue
|
||||
}
|
||||
lower := strings.ToLower(line)
|
||||
if !strings.Contains(lower, "serial number:") {
|
||||
continue
|
||||
}
|
||||
idx := strings.Index(line, ":")
|
||||
if idx < 0 {
|
||||
continue
|
||||
}
|
||||
if serial := strings.TrimSpace(line[idx+1:]); serial != "" {
|
||||
return serial
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func parsePCIVPDSerial(raw []byte) string {
|
||||
for i := 0; i+3 < len(raw); i++ {
|
||||
if raw[i] != 'S' || raw[i+1] != 'N' {
|
||||
continue
|
||||
}
|
||||
length := int(raw[i+2])
|
||||
if length <= 0 || length > 64 || i+3+length > len(raw) {
|
||||
continue
|
||||
}
|
||||
value := strings.TrimSpace(strings.Trim(string(raw[i+3:i+3+length]), "\x00"))
|
||||
if !looksLikeSerial(value) {
|
||||
continue
|
||||
}
|
||||
return value
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func looksLikeSerial(value string) bool {
|
||||
if len(value) < 4 {
|
||||
return false
|
||||
}
|
||||
hasAlphaNum := false
|
||||
for _, r := range value {
|
||||
switch {
|
||||
case r >= 'a' && r <= 'z', r >= 'A' && r <= 'Z', r >= '0' && r <= '9':
|
||||
hasAlphaNum = true
|
||||
case strings.ContainsRune(" -_./:", r):
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
return hasAlphaNum
|
||||
}
|
||||
47
audit/internal/collector/pcie_identity_test.go
Normal file
47
audit/internal/collector/pcie_identity_test.go
Normal file
@@ -0,0 +1,47 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"fmt"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestEnrichPCIeWithPCISerialsAddsGPUFallback(t *testing.T) {
|
||||
origDetail := queryPCILSPCIDetail
|
||||
origVPD := readPCIVPDFile
|
||||
t.Cleanup(func() {
|
||||
queryPCILSPCIDetail = origDetail
|
||||
readPCIVPDFile = origVPD
|
||||
})
|
||||
|
||||
queryPCILSPCIDetail = func(bdf string) (string, error) {
|
||||
if bdf != "0000:11:00.0" {
|
||||
t.Fatalf("unexpected bdf: %s", bdf)
|
||||
}
|
||||
return "Serial number: GPU-SN-12345\n", nil
|
||||
}
|
||||
readPCIVPDFile = func(string) ([]byte, error) {
|
||||
return nil, fmt.Errorf("no vpd needed")
|
||||
}
|
||||
|
||||
class := "DisplayController"
|
||||
bdf := "0000:11:00.0"
|
||||
devs := []schema.HardwarePCIeDevice{{
|
||||
DeviceClass: &class,
|
||||
BDF: &bdf,
|
||||
}}
|
||||
|
||||
out := enrichPCIeWithPCISerials(devs)
|
||||
if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-SN-12345" {
|
||||
t.Fatalf("serial=%v want GPU-SN-12345", out[0].SerialNumber)
|
||||
}
|
||||
}
|
||||
|
||||
func TestShouldProbePCIeSerialSkipsNonGPUOrNIC(t *testing.T) {
|
||||
class := "StorageController"
|
||||
bdf := "0000:19:00.0"
|
||||
dev := schema.HardwarePCIeDevice{DeviceClass: &class, BDF: &bdf}
|
||||
if shouldProbePCIeSerial(dev) {
|
||||
t.Fatal("unexpected probe for storage controller")
|
||||
}
|
||||
}
|
||||
@@ -4,18 +4,32 @@ import (
|
||||
"bee/audit/internal/schema"
|
||||
"log/slog"
|
||||
"os/exec"
|
||||
"regexp"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func collectPSUs() []schema.HardwarePowerSupply {
|
||||
// ipmitool requires /dev/ipmi0 — not available on non-server hardware
|
||||
out, err := exec.Command("ipmitool", "fru", "print").Output()
|
||||
if err != nil {
|
||||
var psus []schema.HardwarePowerSupply
|
||||
if out, err := exec.Command("ipmitool", "fru", "print").Output(); err == nil {
|
||||
psus = parseFRU(string(out))
|
||||
} else {
|
||||
slog.Info("psu: fru unavailable", "err", err)
|
||||
}
|
||||
|
||||
sdrData := map[int]psuSDR{}
|
||||
if sdrOut, err := exec.Command("ipmitool", "sdr").Output(); err == nil {
|
||||
sdrData = parsePSUSDR(string(sdrOut))
|
||||
if len(psus) == 0 {
|
||||
psus = synthesizePSUsFromSDR(sdrData)
|
||||
} else {
|
||||
mergePSUSDR(psus, sdrData)
|
||||
}
|
||||
} else if len(psus) == 0 {
|
||||
slog.Info("psu: ipmitool unavailable, skipping", "err", err)
|
||||
return nil
|
||||
}
|
||||
psus := parseFRU(string(out))
|
||||
slog.Info("psu: collected", "count", len(psus))
|
||||
return psus
|
||||
}
|
||||
@@ -75,9 +89,7 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
|
||||
|
||||
// Only process PSU FRU records
|
||||
headerLower := strings.ToLower(header)
|
||||
if !strings.Contains(headerLower, "psu") &&
|
||||
!strings.Contains(headerLower, "power supply") &&
|
||||
!strings.Contains(headerLower, "power_supply") {
|
||||
if !isPSUHeader(headerLower) {
|
||||
return schema.HardwarePowerSupply{}, false
|
||||
}
|
||||
|
||||
@@ -85,21 +97,24 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
|
||||
psu := schema.HardwarePowerSupply{Present: &present}
|
||||
|
||||
slotStr := strconv.Itoa(slotIdx)
|
||||
if slot, ok := parsePSUSlot(header); ok && slot > 0 {
|
||||
slotStr = strconv.Itoa(slot - 1)
|
||||
}
|
||||
psu.Slot = &slotStr
|
||||
|
||||
if v := cleanDMIValue(fields["Board Product"]); v != "" {
|
||||
if v := firstNonEmptyField(fields, "Board Product", "Product Name", "Product Part Number"); v != "" {
|
||||
psu.Model = &v
|
||||
}
|
||||
if v := cleanDMIValue(fields["Board Mfg"]); v != "" {
|
||||
if v := firstNonEmptyField(fields, "Board Mfg", "Product Manufacturer", "Product Manufacturer Name"); v != "" {
|
||||
psu.Vendor = &v
|
||||
}
|
||||
if v := cleanDMIValue(fields["Board Serial"]); v != "" {
|
||||
if v := firstNonEmptyField(fields, "Board Serial", "Product Serial", "Product Serial Number"); v != "" {
|
||||
psu.SerialNumber = &v
|
||||
}
|
||||
if v := cleanDMIValue(fields["Board Part Number"]); v != "" {
|
||||
if v := firstNonEmptyField(fields, "Board Part Number", "Product Part Number", "Part Number"); v != "" {
|
||||
psu.PartNumber = &v
|
||||
}
|
||||
if v := cleanDMIValue(fields["Board Extra"]); v != "" {
|
||||
if v := firstNonEmptyField(fields, "Board Extra", "Product Version", "Board Version"); v != "" {
|
||||
psu.Firmware = &v
|
||||
}
|
||||
|
||||
@@ -110,12 +125,230 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
|
||||
}
|
||||
}
|
||||
|
||||
status := "OK"
|
||||
status := statusOK
|
||||
psu.Status = &status
|
||||
|
||||
return psu, true
|
||||
}
|
||||
|
||||
func isPSUHeader(headerLower string) bool {
|
||||
return strings.Contains(headerLower, "psu") ||
|
||||
strings.Contains(headerLower, "pws") ||
|
||||
strings.Contains(headerLower, "power supply") ||
|
||||
strings.Contains(headerLower, "power_supply") ||
|
||||
strings.Contains(headerLower, "power module")
|
||||
}
|
||||
|
||||
func firstNonEmptyField(fields map[string]string, keys ...string) string {
|
||||
for _, key := range keys {
|
||||
if value := cleanDMIValue(fields[key]); value != "" {
|
||||
return value
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
type psuSDR struct {
|
||||
slot int
|
||||
status string
|
||||
reason string
|
||||
inputPowerW *float64
|
||||
outputPowerW *float64
|
||||
inputVoltage *float64
|
||||
temperatureC *float64
|
||||
healthPct *float64
|
||||
}
|
||||
|
||||
var psuSlotPatterns = []*regexp.Regexp{
|
||||
regexp.MustCompile(`(?i)\bpsu?\s*([0-9]+)\b`),
|
||||
regexp.MustCompile(`(?i)\bps\s*([0-9]+)\b`),
|
||||
regexp.MustCompile(`(?i)\bpws\s*([0-9]+)\b`),
|
||||
regexp.MustCompile(`(?i)\bpower\s*supply(?:\s*bay)?\s*([0-9]+)\b`),
|
||||
regexp.MustCompile(`(?i)\bbay\s*([0-9]+)\b`),
|
||||
}
|
||||
|
||||
func parsePSUSDR(raw string) map[int]psuSDR {
|
||||
out := map[int]psuSDR{}
|
||||
for _, line := range strings.Split(raw, "\n") {
|
||||
fields := splitSDRFields(line)
|
||||
if len(fields) < 3 {
|
||||
continue
|
||||
}
|
||||
name := fields[0]
|
||||
value := fields[1]
|
||||
state := strings.ToLower(fields[2])
|
||||
slot, ok := parsePSUSlot(name)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
|
||||
entry := out[slot]
|
||||
entry.slot = slot
|
||||
if entry.status == "" {
|
||||
entry.status = statusOK
|
||||
}
|
||||
if state != "" && state != "ok" && state != "ns" {
|
||||
entry.status = statusCritical
|
||||
entry.reason = "PSU sensor reported non-OK state: " + state
|
||||
}
|
||||
|
||||
lowerName := strings.ToLower(name)
|
||||
switch {
|
||||
case strings.Contains(lowerName, "input power"):
|
||||
entry.inputPowerW = parseFloatPtr(value)
|
||||
case strings.Contains(lowerName, "output power"):
|
||||
entry.outputPowerW = parseFloatPtr(value)
|
||||
case strings.Contains(lowerName, "power supply bay"), strings.Contains(lowerName, "psu bay"):
|
||||
entry.outputPowerW = parseFloatPtr(value)
|
||||
case strings.Contains(lowerName, "input voltage"), strings.Contains(lowerName, "ac input"):
|
||||
entry.inputVoltage = parseFloatPtr(value)
|
||||
case strings.Contains(lowerName, "temp"):
|
||||
entry.temperatureC = parseFloatPtr(value)
|
||||
case strings.Contains(lowerName, "health"), strings.Contains(lowerName, "remaining life"), strings.Contains(lowerName, "life remaining"):
|
||||
entry.healthPct = parsePercentPtr(value)
|
||||
}
|
||||
out[slot] = entry
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func synthesizePSUsFromSDR(sdr map[int]psuSDR) []schema.HardwarePowerSupply {
|
||||
if len(sdr) == 0 {
|
||||
return nil
|
||||
}
|
||||
slots := make([]int, 0, len(sdr))
|
||||
for slot := range sdr {
|
||||
slots = append(slots, slot)
|
||||
}
|
||||
sort.Ints(slots)
|
||||
|
||||
out := make([]schema.HardwarePowerSupply, 0, len(slots))
|
||||
for _, slot := range slots {
|
||||
entry := sdr[slot]
|
||||
present := true
|
||||
status := entry.status
|
||||
if status == "" {
|
||||
status = statusUnknown
|
||||
}
|
||||
slotStr := strconv.Itoa(slot - 1)
|
||||
model := "PSU"
|
||||
psu := schema.HardwarePowerSupply{
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Slot: &slotStr,
|
||||
Present: &present,
|
||||
Model: &model,
|
||||
InputPowerW: entry.inputPowerW,
|
||||
OutputPowerW: entry.outputPowerW,
|
||||
InputVoltage: entry.inputVoltage,
|
||||
TemperatureC: entry.temperatureC,
|
||||
}
|
||||
if entry.healthPct != nil {
|
||||
psu.LifeRemainingPct = entry.healthPct
|
||||
lifeUsed := 100 - *entry.healthPct
|
||||
psu.LifeUsedPct = &lifeUsed
|
||||
}
|
||||
if entry.reason != "" {
|
||||
psu.ErrorDescription = &entry.reason
|
||||
}
|
||||
out = append(out, psu)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func mergePSUSDR(psus []schema.HardwarePowerSupply, sdr map[int]psuSDR) {
|
||||
for i := range psus {
|
||||
slotIdx, err := strconv.Atoi(derefPSUSlot(psus[i].Slot))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
entry, ok := sdr[slotIdx+1]
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if entry.inputPowerW != nil {
|
||||
psus[i].InputPowerW = entry.inputPowerW
|
||||
}
|
||||
if entry.outputPowerW != nil {
|
||||
psus[i].OutputPowerW = entry.outputPowerW
|
||||
}
|
||||
if entry.inputVoltage != nil {
|
||||
psus[i].InputVoltage = entry.inputVoltage
|
||||
}
|
||||
if entry.temperatureC != nil {
|
||||
psus[i].TemperatureC = entry.temperatureC
|
||||
}
|
||||
if entry.healthPct != nil {
|
||||
psus[i].LifeRemainingPct = entry.healthPct
|
||||
lifeUsed := 100 - *entry.healthPct
|
||||
psus[i].LifeUsedPct = &lifeUsed
|
||||
}
|
||||
if entry.status != "" {
|
||||
psus[i].Status = &entry.status
|
||||
}
|
||||
if entry.reason != "" {
|
||||
psus[i].ErrorDescription = &entry.reason
|
||||
}
|
||||
if psus[i].Status != nil && *psus[i].Status == statusOK {
|
||||
if (entry.inputPowerW == nil && entry.outputPowerW == nil && entry.inputVoltage == nil) && entry.status == "" {
|
||||
unknown := statusUnknown
|
||||
psus[i].Status = &unknown
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func splitSDRFields(line string) []string {
|
||||
parts := strings.Split(line, "|")
|
||||
out := make([]string, 0, len(parts))
|
||||
for _, part := range parts {
|
||||
part = strings.TrimSpace(part)
|
||||
if part != "" {
|
||||
out = append(out, part)
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func parsePSUSlot(name string) (int, bool) {
|
||||
for _, re := range psuSlotPatterns {
|
||||
m := re.FindStringSubmatch(strings.ToLower(name))
|
||||
if len(m) == 0 {
|
||||
continue
|
||||
}
|
||||
for _, group := range m[1:] {
|
||||
if group == "" {
|
||||
continue
|
||||
}
|
||||
n, err := strconv.Atoi(group)
|
||||
if err == nil && n > 0 {
|
||||
return n, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func parseFloatPtr(raw string) *float64 {
|
||||
raw = strings.TrimSpace(raw)
|
||||
if raw == "" || strings.EqualFold(raw, "na") {
|
||||
return nil
|
||||
}
|
||||
for _, field := range strings.Fields(raw) {
|
||||
n, err := strconv.ParseFloat(strings.TrimSpace(field), 64)
|
||||
if err == nil {
|
||||
return &n
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func derefPSUSlot(slot *string) string {
|
||||
if slot == nil {
|
||||
return ""
|
||||
}
|
||||
return *slot
|
||||
}
|
||||
|
||||
// parseWattage extracts wattage from strings like "PSU 800W", "1200W PLATINUM".
|
||||
func parseWattage(s string) int {
|
||||
s = strings.ToUpper(s)
|
||||
|
||||
91
audit/internal/collector/psu_sdr_test.go
Normal file
91
audit/internal/collector/psu_sdr_test.go
Normal file
@@ -0,0 +1,91 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestParsePSUSDR(t *testing.T) {
|
||||
raw := `
|
||||
PS1 Input Power | 215 Watts | ok
|
||||
PS1 Output Power | 198 Watts | ok
|
||||
PS1 Input Voltage | 229 Volts | ok
|
||||
PS1 Temp | 39 C | ok
|
||||
PS1 Health | 97 % | ok
|
||||
PS2 Input Power | 0 Watts | cr
|
||||
`
|
||||
|
||||
got := parsePSUSDR(raw)
|
||||
if len(got) != 2 {
|
||||
t.Fatalf("len(got)=%d want 2", len(got))
|
||||
}
|
||||
if got[1].status != statusOK {
|
||||
t.Fatalf("ps1 status=%q", got[1].status)
|
||||
}
|
||||
if got[1].inputPowerW == nil || *got[1].inputPowerW != 215 {
|
||||
t.Fatalf("ps1 input power=%v", got[1].inputPowerW)
|
||||
}
|
||||
if got[1].outputPowerW == nil || *got[1].outputPowerW != 198 {
|
||||
t.Fatalf("ps1 output power=%v", got[1].outputPowerW)
|
||||
}
|
||||
if got[1].inputVoltage == nil || *got[1].inputVoltage != 229 {
|
||||
t.Fatalf("ps1 input voltage=%v", got[1].inputVoltage)
|
||||
}
|
||||
if got[1].temperatureC == nil || *got[1].temperatureC != 39 {
|
||||
t.Fatalf("ps1 temperature=%v", got[1].temperatureC)
|
||||
}
|
||||
if got[1].healthPct == nil || *got[1].healthPct != 97 {
|
||||
t.Fatalf("ps1 health=%v", got[1].healthPct)
|
||||
}
|
||||
if got[2].status != statusCritical {
|
||||
t.Fatalf("ps2 status=%q", got[2].status)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParsePSUSlotVendorVariants(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
want int
|
||||
}{
|
||||
{name: "PWS1 Status", want: 1},
|
||||
{name: "Power Supply Bay 8", want: 8},
|
||||
{name: "PS 6 Input Power", want: 6},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
got, ok := parsePSUSlot(tt.name)
|
||||
if !ok || got != tt.want {
|
||||
t.Fatalf("parsePSUSlot(%q)=(%d,%v) want (%d,true)", tt.name, got, ok, tt.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestSynthesizePSUsFromSDR(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
health := 97.0
|
||||
outputPower := 915.0
|
||||
got := synthesizePSUsFromSDR(map[int]psuSDR{
|
||||
1: {
|
||||
slot: 1,
|
||||
status: statusOK,
|
||||
outputPowerW: &outputPower,
|
||||
healthPct: &health,
|
||||
},
|
||||
})
|
||||
|
||||
if len(got) != 1 {
|
||||
t.Fatalf("len(got)=%d want 1", len(got))
|
||||
}
|
||||
if got[0].Slot == nil || *got[0].Slot != "0" {
|
||||
t.Fatalf("slot=%v want 0", got[0].Slot)
|
||||
}
|
||||
if got[0].OutputPowerW == nil || *got[0].OutputPowerW != 915 {
|
||||
t.Fatalf("output power=%v", got[0].OutputPowerW)
|
||||
}
|
||||
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 97 {
|
||||
t.Fatalf("life remaining=%v", got[0].LifeRemainingPct)
|
||||
}
|
||||
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 3 {
|
||||
t.Fatalf("life used=%v", got[0].LifeUsedPct)
|
||||
}
|
||||
}
|
||||
121
audit/internal/collector/psu_telemetry.go
Normal file
121
audit/internal/collector/psu_telemetry.go
Normal file
@@ -0,0 +1,121 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func enrichPSUsWithTelemetry(psus []schema.HardwarePowerSupply, doc sensorsDoc) []schema.HardwarePowerSupply {
|
||||
if len(psus) == 0 || len(doc) == 0 {
|
||||
return psus
|
||||
}
|
||||
|
||||
tempBySlot := psuTempsFromSensors(doc)
|
||||
healthBySlot := psuHealthFromSensors(doc)
|
||||
for i := range psus {
|
||||
slot := derefPSUSlot(psus[i].Slot)
|
||||
if slot == "" {
|
||||
continue
|
||||
}
|
||||
if psus[i].TemperatureC == nil {
|
||||
if value, ok := tempBySlot[slot]; ok {
|
||||
psus[i].TemperatureC = &value
|
||||
}
|
||||
}
|
||||
if psus[i].LifeRemainingPct == nil {
|
||||
if value, ok := healthBySlot[slot]; ok {
|
||||
psus[i].LifeRemainingPct = &value
|
||||
used := 100 - value
|
||||
psus[i].LifeUsedPct = &used
|
||||
}
|
||||
}
|
||||
}
|
||||
return psus
|
||||
}
|
||||
|
||||
func psuHealthFromSensors(doc sensorsDoc) map[string]float64 {
|
||||
out := map[string]float64{}
|
||||
for chip, features := range doc {
|
||||
for featureName, raw := range features {
|
||||
feature, ok := raw.(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if !isLikelyPSUHealth(chip, featureName) {
|
||||
continue
|
||||
}
|
||||
value, ok := firstFeaturePercent(feature)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if slot, ok := detectPSUSlot(chip, featureName); ok {
|
||||
if _, exists := out[slot]; !exists {
|
||||
out[slot] = value
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func firstFeaturePercent(feature map[string]any) (float64, bool) {
|
||||
keys := sortedFeatureKeys(feature)
|
||||
for _, key := range keys {
|
||||
lower := strings.ToLower(key)
|
||||
if strings.HasSuffix(lower, "_alarm") {
|
||||
continue
|
||||
}
|
||||
if strings.Contains(lower, "health") || strings.Contains(lower, "life") || strings.Contains(lower, "remain") {
|
||||
if value, ok := floatFromAny(feature[key]); ok {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func isLikelyPSUHealth(chip, feature string) bool {
|
||||
value := strings.ToLower(chip + " " + feature)
|
||||
return (strings.Contains(value, "psu") || strings.Contains(value, "power supply")) &&
|
||||
(strings.Contains(value, "health") || strings.Contains(value, "life") || strings.Contains(value, "remain"))
|
||||
}
|
||||
|
||||
func psuTempsFromSensors(doc sensorsDoc) map[string]float64 {
|
||||
out := map[string]float64{}
|
||||
for chip, features := range doc {
|
||||
for featureName, raw := range features {
|
||||
feature, ok := raw.(map[string]any)
|
||||
if !ok || classifySensorFeature(feature) != "temp" {
|
||||
continue
|
||||
}
|
||||
if !isLikelyPSUTemp(chip, featureName) {
|
||||
continue
|
||||
}
|
||||
temp, ok := firstFeatureFloat(feature, "_input")
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if slot, ok := detectPSUSlot(chip, featureName); ok {
|
||||
if _, exists := out[slot]; !exists {
|
||||
out[slot] = temp
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func isLikelyPSUTemp(chip, feature string) bool {
|
||||
value := strings.ToLower(chip + " " + feature)
|
||||
return strings.Contains(value, "psu") || strings.Contains(value, "power supply")
|
||||
}
|
||||
|
||||
func detectPSUSlot(parts ...string) (string, bool) {
|
||||
for _, part := range parts {
|
||||
if value, ok := parsePSUSlot(part); ok && value > 0 {
|
||||
return strconv.Itoa(value - 1), true
|
||||
}
|
||||
}
|
||||
return "", false
|
||||
}
|
||||
42
audit/internal/collector/psu_telemetry_test.go
Normal file
42
audit/internal/collector/psu_telemetry_test.go
Normal file
@@ -0,0 +1,42 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
func TestEnrichPSUsWithTelemetry(t *testing.T) {
|
||||
slot0 := "0"
|
||||
slot1 := "1"
|
||||
psus := []schema.HardwarePowerSupply{
|
||||
{Slot: &slot0},
|
||||
{Slot: &slot1},
|
||||
}
|
||||
|
||||
doc := sensorsDoc{
|
||||
"psu-hwmon-0": {
|
||||
"PSU1 Temp": map[string]any{"temp1_input": 39.5},
|
||||
"PSU2 Temp": map[string]any{"temp2_input": 41.0},
|
||||
"PSU1 Health": map[string]any{"health1_input": 98.0},
|
||||
"PSU2 Remaining Life": map[string]any{"life2_input": 95.0},
|
||||
},
|
||||
}
|
||||
|
||||
got := enrichPSUsWithTelemetry(psus, doc)
|
||||
if got[0].TemperatureC == nil || *got[0].TemperatureC != 39.5 {
|
||||
t.Fatalf("psu0 temperature mismatch: %#v", got[0].TemperatureC)
|
||||
}
|
||||
if got[1].TemperatureC == nil || *got[1].TemperatureC != 41.0 {
|
||||
t.Fatalf("psu1 temperature mismatch: %#v", got[1].TemperatureC)
|
||||
}
|
||||
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 98.0 {
|
||||
t.Fatalf("psu0 life remaining mismatch: %#v", got[0].LifeRemainingPct)
|
||||
}
|
||||
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 2.0 {
|
||||
t.Fatalf("psu0 life used mismatch: %#v", got[0].LifeUsedPct)
|
||||
}
|
||||
if got[1].LifeRemainingPct == nil || *got[1].LifeRemainingPct != 95.0 {
|
||||
t.Fatalf("psu1 life remaining mismatch: %#v", got[1].LifeRemainingPct)
|
||||
}
|
||||
}
|
||||
@@ -83,11 +83,7 @@ func isLikelyRAIDController(dev schema.HardwarePCIeDevice) bool {
|
||||
if dev.DeviceClass == nil {
|
||||
return false
|
||||
}
|
||||
c := strings.ToLower(*dev.DeviceClass)
|
||||
return strings.Contains(c, "raid") ||
|
||||
strings.Contains(c, "sas") ||
|
||||
strings.Contains(c, "mass storage") ||
|
||||
strings.Contains(c, "serial attached scsi")
|
||||
return isRAIDClass(*dev.DeviceClass)
|
||||
}
|
||||
|
||||
func collectStorcliDrives() []schema.HardwareStorage {
|
||||
@@ -182,7 +178,10 @@ func parseSASIrcuDisplay(raw string) []schema.HardwareStorage {
|
||||
|
||||
present := true
|
||||
status := mapRAIDDriveStatus(b["State"])
|
||||
s := schema.HardwareStorage{Present: &present, Status: &status}
|
||||
s := schema.HardwareStorage{
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Present: &present,
|
||||
}
|
||||
|
||||
enclosure := strings.TrimSpace(b["Enclosure #"])
|
||||
slot := strings.TrimSpace(b["Slot #"])
|
||||
@@ -281,7 +280,10 @@ func parseArcconfPhysicalDrives(raw string) []schema.HardwareStorage {
|
||||
for _, b := range blocks {
|
||||
present := true
|
||||
status := mapRAIDDriveStatus(b["State"])
|
||||
s := schema.HardwareStorage{Present: &present, Status: &status}
|
||||
s := schema.HardwareStorage{
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Present: &present,
|
||||
}
|
||||
|
||||
if v := strings.TrimSpace(b["Reported Location"]); v != "" {
|
||||
s.Slot = &v
|
||||
@@ -362,8 +364,11 @@ func parseSSACLIPhysicalDrives(raw string) []schema.HardwareStorage {
|
||||
if m := ssacliPhysicalDriveLine.FindStringSubmatch(trimmed); len(m) == 3 {
|
||||
flush()
|
||||
present := true
|
||||
status := "UNKNOWN"
|
||||
s := schema.HardwareStorage{Present: &present, Status: &status}
|
||||
status := statusUnknown
|
||||
s := schema.HardwareStorage{
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Present: &present,
|
||||
}
|
||||
slot := m[1]
|
||||
s.Slot = &slot
|
||||
|
||||
@@ -475,8 +480,8 @@ func storcliDriveToStorage(d struct {
|
||||
present := true
|
||||
status := mapRAIDDriveStatus(d.State)
|
||||
s := schema.HardwareStorage{
|
||||
Present: &present,
|
||||
Status: &status,
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Present: &present,
|
||||
}
|
||||
|
||||
if v := strings.TrimSpace(d.EIDSlt); v != "" {
|
||||
@@ -527,15 +532,15 @@ func mapRAIDDriveStatus(raw string) string {
|
||||
u := strings.ToUpper(strings.TrimSpace(raw))
|
||||
switch {
|
||||
case strings.Contains(u, "OK"), strings.Contains(u, "OPTIMAL"), strings.Contains(u, "READY"):
|
||||
return "OK"
|
||||
return statusOK
|
||||
case strings.Contains(u, "ONLN"), strings.Contains(u, "ONLINE"):
|
||||
return "OK"
|
||||
return statusOK
|
||||
case strings.Contains(u, "RBLD"), strings.Contains(u, "REBUILD"):
|
||||
return "WARNING"
|
||||
return statusWarning
|
||||
case strings.Contains(u, "FAIL"), strings.Contains(u, "OFFLINE"):
|
||||
return "CRITICAL"
|
||||
return statusCritical
|
||||
default:
|
||||
return "UNKNOWN"
|
||||
return statusUnknown
|
||||
}
|
||||
}
|
||||
|
||||
@@ -641,8 +646,9 @@ func enrichStorageWithVROC(storage []schema.HardwareStorage, pcie []schema.Hardw
|
||||
storage[i].Telemetry["vroc_array"] = arr.Name
|
||||
storage[i].Telemetry["vroc_degraded"] = arr.Degraded
|
||||
if arr.Degraded {
|
||||
status := "WARNING"
|
||||
status := statusWarning
|
||||
storage[i].Status = &status
|
||||
storage[i].ErrorDescription = stringPtr("VROC array is degraded")
|
||||
}
|
||||
updated++
|
||||
}
|
||||
@@ -659,14 +665,14 @@ func hasVROCController(pcie []schema.HardwarePCIeDevice) bool {
|
||||
|
||||
class := ""
|
||||
if dev.DeviceClass != nil {
|
||||
class = strings.ToLower(*dev.DeviceClass)
|
||||
class = strings.TrimSpace(*dev.DeviceClass)
|
||||
}
|
||||
model := ""
|
||||
if dev.Model != nil {
|
||||
model = strings.ToLower(*dev.Model)
|
||||
}
|
||||
|
||||
if strings.Contains(class, "raid") ||
|
||||
if isRAIDClass(class) ||
|
||||
strings.Contains(model, "vroc") ||
|
||||
strings.Contains(model, "volume management device") ||
|
||||
strings.Contains(model, "vmd") {
|
||||
|
||||
334
audit/internal/collector/raid_controller_telemetry.go
Normal file
334
audit/internal/collector/raid_controller_telemetry.go
Normal file
@@ -0,0 +1,334 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"encoding/json"
|
||||
"log/slog"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
type raidControllerTelemetry struct {
|
||||
BatteryChargePct *float64
|
||||
BatteryHealthPct *float64
|
||||
BatteryTemperatureC *float64
|
||||
BatteryVoltageV *float64
|
||||
BatteryReplaceRequired *bool
|
||||
ErrorDescription *string
|
||||
}
|
||||
|
||||
func enrichPCIeWithRAIDTelemetry(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
|
||||
byVendor := collectRAIDControllerTelemetry()
|
||||
if len(byVendor) == 0 {
|
||||
return devs
|
||||
}
|
||||
|
||||
positions := map[int]int{}
|
||||
for i := range devs {
|
||||
if devs[i].VendorID == nil || !isLikelyRAIDController(devs[i]) {
|
||||
continue
|
||||
}
|
||||
vendor := *devs[i].VendorID
|
||||
list := byVendor[vendor]
|
||||
if len(list) == 0 {
|
||||
continue
|
||||
}
|
||||
index := positions[vendor]
|
||||
if index >= len(list) {
|
||||
continue
|
||||
}
|
||||
positions[vendor] = index + 1
|
||||
applyRAIDControllerTelemetry(&devs[i], list[index])
|
||||
}
|
||||
|
||||
return devs
|
||||
}
|
||||
|
||||
func applyRAIDControllerTelemetry(dev *schema.HardwarePCIeDevice, tel raidControllerTelemetry) {
|
||||
if tel.BatteryChargePct != nil {
|
||||
dev.BatteryChargePct = tel.BatteryChargePct
|
||||
}
|
||||
if tel.BatteryHealthPct != nil {
|
||||
dev.BatteryHealthPct = tel.BatteryHealthPct
|
||||
}
|
||||
if tel.BatteryTemperatureC != nil {
|
||||
dev.BatteryTemperatureC = tel.BatteryTemperatureC
|
||||
}
|
||||
if tel.BatteryVoltageV != nil {
|
||||
dev.BatteryVoltageV = tel.BatteryVoltageV
|
||||
}
|
||||
if tel.BatteryReplaceRequired != nil {
|
||||
dev.BatteryReplaceRequired = tel.BatteryReplaceRequired
|
||||
}
|
||||
if tel.ErrorDescription != nil {
|
||||
dev.ErrorDescription = tel.ErrorDescription
|
||||
if dev.Status == nil || *dev.Status == statusOK {
|
||||
status := statusWarning
|
||||
dev.Status = &status
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func collectRAIDControllerTelemetry() map[int][]raidControllerTelemetry {
|
||||
out := map[int][]raidControllerTelemetry{}
|
||||
|
||||
if raw, err := raidToolQuery("storcli64", "/call", "show", "all", "J"); err == nil {
|
||||
list := parseStorcliControllerTelemetry(raw)
|
||||
if len(list) > 0 {
|
||||
out[vendorBroadcomLSI] = append(out[vendorBroadcomLSI], list...)
|
||||
slog.Info("raid: storcli controller telemetry", "count", len(list))
|
||||
}
|
||||
}
|
||||
|
||||
if raw, err := raidToolQuery("ssacli", "ctrl", "all", "show", "config", "detail"); err == nil {
|
||||
list := parseSSACLIControllerTelemetry(string(raw))
|
||||
if len(list) > 0 {
|
||||
out[vendorHPE] = append(out[vendorHPE], list...)
|
||||
slog.Info("raid: ssacli controller telemetry", "count", len(list))
|
||||
}
|
||||
}
|
||||
|
||||
if raw, err := raidToolQuery("arcconf", "getconfig", "1", "ad"); err == nil {
|
||||
list := parseArcconfControllerTelemetry(string(raw))
|
||||
if len(list) > 0 {
|
||||
out[vendorAdaptec] = append(out[vendorAdaptec], list...)
|
||||
slog.Info("raid: arcconf controller telemetry", "count", len(list))
|
||||
}
|
||||
}
|
||||
|
||||
return out
|
||||
}
|
||||
|
||||
func parseStorcliControllerTelemetry(raw []byte) []raidControllerTelemetry {
|
||||
var doc struct {
|
||||
Controllers []struct {
|
||||
ResponseData map[string]any `json:"Response Data"`
|
||||
} `json:"Controllers"`
|
||||
}
|
||||
if err := json.Unmarshal(raw, &doc); err != nil {
|
||||
slog.Warn("raid: parse storcli controller telemetry failed", "err", err)
|
||||
return nil
|
||||
}
|
||||
|
||||
var out []raidControllerTelemetry
|
||||
for _, ctl := range doc.Controllers {
|
||||
tel := raidControllerTelemetry{}
|
||||
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["BBU_Info"]))
|
||||
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["BBU_Info_Details"]))
|
||||
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["CV_Info"]))
|
||||
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["CV_Info_Details"]))
|
||||
if hasRAIDControllerTelemetry(tel) {
|
||||
out = append(out, tel)
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func nestedStringMap(raw any) map[string]string {
|
||||
switch value := raw.(type) {
|
||||
case map[string]any:
|
||||
out := map[string]string{}
|
||||
flattenStringMap("", value, out)
|
||||
return out
|
||||
case []any:
|
||||
out := map[string]string{}
|
||||
for _, item := range value {
|
||||
if m, ok := item.(map[string]any); ok {
|
||||
flattenStringMap("", m, out)
|
||||
}
|
||||
}
|
||||
return out
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
func flattenStringMap(prefix string, in map[string]any, out map[string]string) {
|
||||
for key, raw := range in {
|
||||
fullKey := strings.TrimSpace(strings.ToLower(strings.Trim(prefix+" "+key, " ")))
|
||||
switch value := raw.(type) {
|
||||
case map[string]any:
|
||||
flattenStringMap(fullKey, value, out)
|
||||
case []any:
|
||||
for _, item := range value {
|
||||
if m, ok := item.(map[string]any); ok {
|
||||
flattenStringMap(fullKey, m, out)
|
||||
}
|
||||
}
|
||||
case string:
|
||||
out[fullKey] = value
|
||||
case json.Number:
|
||||
out[fullKey] = value.String()
|
||||
case float64:
|
||||
out[fullKey] = strconv.FormatFloat(value, 'f', -1, 64)
|
||||
case bool:
|
||||
if value {
|
||||
out[fullKey] = "true"
|
||||
} else {
|
||||
out[fullKey] = "false"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func mergeStorcliBatteryMap(tel *raidControllerTelemetry, fields map[string]string) {
|
||||
if len(fields) == 0 {
|
||||
return
|
||||
}
|
||||
for key, raw := range fields {
|
||||
lower := strings.ToLower(strings.TrimSpace(key))
|
||||
switch {
|
||||
case strings.Contains(lower, "relative state of charge"), strings.Contains(lower, "remaining capacity"), strings.Contains(lower, "charge"):
|
||||
if tel.BatteryChargePct == nil {
|
||||
tel.BatteryChargePct = parsePercentPtr(raw)
|
||||
}
|
||||
case strings.Contains(lower, "state of health"), strings.Contains(lower, "health"):
|
||||
if tel.BatteryHealthPct == nil {
|
||||
tel.BatteryHealthPct = parsePercentPtr(raw)
|
||||
}
|
||||
case strings.Contains(lower, "temperature"):
|
||||
if tel.BatteryTemperatureC == nil {
|
||||
tel.BatteryTemperatureC = parseFloatPtr(raw)
|
||||
}
|
||||
case strings.Contains(lower, "voltage"):
|
||||
if tel.BatteryVoltageV == nil {
|
||||
tel.BatteryVoltageV = parseFloatPtr(raw)
|
||||
}
|
||||
case strings.Contains(lower, "replace"), strings.Contains(lower, "replacement required"):
|
||||
if tel.BatteryReplaceRequired == nil {
|
||||
tel.BatteryReplaceRequired = parseReplaceRequired(raw)
|
||||
}
|
||||
case strings.Contains(lower, "learn cycle requested"), strings.Contains(lower, "battery state"), strings.Contains(lower, "capacitance state"):
|
||||
if desc := batteryStateDescription(raw); desc != nil && tel.ErrorDescription == nil {
|
||||
tel.ErrorDescription = desc
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func parseSSACLIControllerTelemetry(raw string) []raidControllerTelemetry {
|
||||
lines := strings.Split(raw, "\n")
|
||||
var out []raidControllerTelemetry
|
||||
var current *raidControllerTelemetry
|
||||
|
||||
flush := func() {
|
||||
if current != nil && hasRAIDControllerTelemetry(*current) {
|
||||
out = append(out, *current)
|
||||
}
|
||||
current = nil
|
||||
}
|
||||
|
||||
for _, line := range lines {
|
||||
trimmed := strings.TrimSpace(line)
|
||||
if trimmed == "" {
|
||||
continue
|
||||
}
|
||||
if strings.HasPrefix(strings.ToLower(trimmed), "smart array") || strings.HasPrefix(strings.ToLower(trimmed), "controller ") {
|
||||
flush()
|
||||
current = &raidControllerTelemetry{}
|
||||
continue
|
||||
}
|
||||
if current == nil {
|
||||
continue
|
||||
}
|
||||
if idx := strings.Index(trimmed, ":"); idx > 0 {
|
||||
key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
|
||||
val := strings.TrimSpace(trimmed[idx+1:])
|
||||
switch {
|
||||
case strings.Contains(key, "capacitor temperature"), strings.Contains(key, "battery temperature"):
|
||||
current.BatteryTemperatureC = parseFloatPtr(val)
|
||||
case strings.Contains(key, "capacitor voltage"), strings.Contains(key, "battery voltage"):
|
||||
current.BatteryVoltageV = parseFloatPtr(val)
|
||||
case strings.Contains(key, "capacitor charge"), strings.Contains(key, "battery charge"):
|
||||
current.BatteryChargePct = parsePercentPtr(val)
|
||||
case strings.Contains(key, "capacitor health"), strings.Contains(key, "battery health"):
|
||||
current.BatteryHealthPct = parsePercentPtr(val)
|
||||
case strings.Contains(key, "replace") || strings.Contains(key, "failed"):
|
||||
if current.BatteryReplaceRequired == nil {
|
||||
current.BatteryReplaceRequired = parseReplaceRequired(val)
|
||||
}
|
||||
if desc := batteryStateDescription(val); desc != nil && current.ErrorDescription == nil {
|
||||
current.ErrorDescription = desc
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
flush()
|
||||
return out
|
||||
}
|
||||
|
||||
func parseArcconfControllerTelemetry(raw string) []raidControllerTelemetry {
|
||||
lines := strings.Split(raw, "\n")
|
||||
tel := raidControllerTelemetry{}
|
||||
for _, line := range lines {
|
||||
trimmed := strings.TrimSpace(line)
|
||||
if idx := strings.Index(trimmed, ":"); idx > 0 {
|
||||
key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
|
||||
val := strings.TrimSpace(trimmed[idx+1:])
|
||||
switch {
|
||||
case strings.Contains(key, "battery temperature"), strings.Contains(key, "capacitor temperature"):
|
||||
tel.BatteryTemperatureC = parseFloatPtr(val)
|
||||
case strings.Contains(key, "battery voltage"), strings.Contains(key, "capacitor voltage"):
|
||||
tel.BatteryVoltageV = parseFloatPtr(val)
|
||||
case strings.Contains(key, "battery charge"), strings.Contains(key, "capacitor charge"):
|
||||
tel.BatteryChargePct = parsePercentPtr(val)
|
||||
case strings.Contains(key, "battery health"), strings.Contains(key, "capacitor health"):
|
||||
tel.BatteryHealthPct = parsePercentPtr(val)
|
||||
case strings.Contains(key, "replace"), strings.Contains(key, "failed"):
|
||||
if tel.BatteryReplaceRequired == nil {
|
||||
tel.BatteryReplaceRequired = parseReplaceRequired(val)
|
||||
}
|
||||
if desc := batteryStateDescription(val); desc != nil && tel.ErrorDescription == nil {
|
||||
tel.ErrorDescription = desc
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
if hasRAIDControllerTelemetry(tel) {
|
||||
return []raidControllerTelemetry{tel}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func hasRAIDControllerTelemetry(tel raidControllerTelemetry) bool {
|
||||
return tel.BatteryChargePct != nil ||
|
||||
tel.BatteryHealthPct != nil ||
|
||||
tel.BatteryTemperatureC != nil ||
|
||||
tel.BatteryVoltageV != nil ||
|
||||
tel.BatteryReplaceRequired != nil ||
|
||||
tel.ErrorDescription != nil
|
||||
}
|
||||
|
||||
func parsePercentPtr(raw string) *float64 {
|
||||
raw = strings.ReplaceAll(strings.TrimSpace(raw), "%", "")
|
||||
return parseFloatPtr(raw)
|
||||
}
|
||||
|
||||
func parseReplaceRequired(raw string) *bool {
|
||||
lower := strings.ToLower(strings.TrimSpace(raw))
|
||||
switch {
|
||||
case lower == "":
|
||||
return nil
|
||||
case strings.Contains(lower, "replace"), strings.Contains(lower, "failed"), strings.Contains(lower, "yes"), strings.Contains(lower, "required"):
|
||||
value := true
|
||||
return &value
|
||||
case strings.Contains(lower, "no"), strings.Contains(lower, "ok"), strings.Contains(lower, "good"), strings.Contains(lower, "optimal"):
|
||||
value := false
|
||||
return &value
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
func batteryStateDescription(raw string) *string {
|
||||
lower := strings.ToLower(strings.TrimSpace(raw))
|
||||
if lower == "" {
|
||||
return nil
|
||||
}
|
||||
switch {
|
||||
case strings.Contains(lower, "failed"), strings.Contains(lower, "fault"), strings.Contains(lower, "replace"), strings.Contains(lower, "warning"), strings.Contains(lower, "degraded"):
|
||||
return &raw
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
@@ -1,6 +1,10 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"errors"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestParseSASIrcuControllerIDs(t *testing.T) {
|
||||
raw := `LSI Corporation SAS2 IR Configuration Utility.
|
||||
@@ -90,7 +94,111 @@ physicaldrive 1I:1:2 (894 GB, SAS HDD, Failed)
|
||||
if drives[0].Status == nil || *drives[0].Status != "OK" {
|
||||
t.Fatalf("drive0 status: %v", drives[0].Status)
|
||||
}
|
||||
if drives[1].Status == nil || *drives[1].Status != "CRITICAL" {
|
||||
if drives[1].Status == nil || *drives[1].Status != statusCritical {
|
||||
t.Fatalf("drive1 status: %v", drives[1].Status)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseStorcliControllerTelemetry(t *testing.T) {
|
||||
raw := []byte(`{
|
||||
"Controllers": [
|
||||
{
|
||||
"Response Data": {
|
||||
"BBU_Info": {
|
||||
"State of Health": "98 %",
|
||||
"Relative State of Charge": "76 %",
|
||||
"Temperature": "41 C",
|
||||
"Voltage": "12.3 V",
|
||||
"Replacement required": "No"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}`)
|
||||
got := parseStorcliControllerTelemetry(raw)
|
||||
if len(got) != 1 {
|
||||
t.Fatalf("len(got)=%d want 1", len(got))
|
||||
}
|
||||
if got[0].BatteryHealthPct == nil || *got[0].BatteryHealthPct != 98 {
|
||||
t.Fatalf("battery health=%v", got[0].BatteryHealthPct)
|
||||
}
|
||||
if got[0].BatteryChargePct == nil || *got[0].BatteryChargePct != 76 {
|
||||
t.Fatalf("battery charge=%v", got[0].BatteryChargePct)
|
||||
}
|
||||
if got[0].BatteryTemperatureC == nil || *got[0].BatteryTemperatureC != 41 {
|
||||
t.Fatalf("battery temperature=%v", got[0].BatteryTemperatureC)
|
||||
}
|
||||
if got[0].BatteryVoltageV == nil || *got[0].BatteryVoltageV != 12.3 {
|
||||
t.Fatalf("battery voltage=%v", got[0].BatteryVoltageV)
|
||||
}
|
||||
if got[0].BatteryReplaceRequired == nil || *got[0].BatteryReplaceRequired {
|
||||
t.Fatalf("battery replace=%v", got[0].BatteryReplaceRequired)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseSSACLIControllerTelemetry(t *testing.T) {
|
||||
raw := `Smart Array P440ar in Slot 0
|
||||
Battery/Capacitor Count: 1
|
||||
Capacitor Temperature (C): 37
|
||||
Capacitor Charge (%): 94
|
||||
Capacitor Health (%): 96
|
||||
Capacitor Voltage (V): 9.8
|
||||
Capacitor Failed: No
|
||||
`
|
||||
got := parseSSACLIControllerTelemetry(raw)
|
||||
if len(got) != 1 {
|
||||
t.Fatalf("len(got)=%d want 1", len(got))
|
||||
}
|
||||
if got[0].BatteryTemperatureC == nil || *got[0].BatteryTemperatureC != 37 {
|
||||
t.Fatalf("battery temperature=%v", got[0].BatteryTemperatureC)
|
||||
}
|
||||
if got[0].BatteryChargePct == nil || *got[0].BatteryChargePct != 94 {
|
||||
t.Fatalf("battery charge=%v", got[0].BatteryChargePct)
|
||||
}
|
||||
}
|
||||
|
||||
func TestEnrichPCIeWithRAIDTelemetry(t *testing.T) {
|
||||
orig := raidToolQuery
|
||||
t.Cleanup(func() { raidToolQuery = orig })
|
||||
raidToolQuery = func(name string, args ...string) ([]byte, error) {
|
||||
switch name {
|
||||
case "storcli64":
|
||||
return []byte(`{
|
||||
"Controllers": [
|
||||
{
|
||||
"Response Data": {
|
||||
"CV_Info": {
|
||||
"State of Health": "99 %",
|
||||
"Relative State of Charge": "81 %",
|
||||
"Temperature": "38 C",
|
||||
"Voltage": "12.1 V",
|
||||
"Replacement required": "No"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}`), nil
|
||||
default:
|
||||
return nil, errors.New("skip")
|
||||
}
|
||||
}
|
||||
|
||||
vendor := vendorBroadcomLSI
|
||||
class := "MassStorageController"
|
||||
status := statusOK
|
||||
devs := []schema.HardwarePCIeDevice{{
|
||||
VendorID: &vendor,
|
||||
DeviceClass: &class,
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
}}
|
||||
out := enrichPCIeWithRAIDTelemetry(devs)
|
||||
if out[0].BatteryHealthPct == nil || *out[0].BatteryHealthPct != 99 {
|
||||
t.Fatalf("battery health=%v", out[0].BatteryHealthPct)
|
||||
}
|
||||
if out[0].BatteryChargePct == nil || *out[0].BatteryChargePct != 81 {
|
||||
t.Fatalf("battery charge=%v", out[0].BatteryChargePct)
|
||||
}
|
||||
if out[0].BatteryVoltageV == nil || *out[0].BatteryVoltageV != 12.1 {
|
||||
t.Fatalf("battery voltage=%v", out[0].BatteryVoltageV)
|
||||
}
|
||||
}
|
||||
|
||||
373
audit/internal/collector/sensors.go
Normal file
373
audit/internal/collector/sensors.go
Normal file
@@ -0,0 +1,373 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"encoding/json"
|
||||
"log/slog"
|
||||
"os/exec"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
type sensorsDoc map[string]map[string]any
|
||||
|
||||
func collectSensors() *schema.HardwareSensors {
|
||||
doc, err := readSensorsJSONDoc()
|
||||
if err != nil {
|
||||
slog.Info("sensors: unavailable, skipping", "err", err)
|
||||
return nil
|
||||
}
|
||||
sensors := buildSensorsFromDoc(doc)
|
||||
if sensors == nil || (len(sensors.Fans) == 0 && len(sensors.Power) == 0 && len(sensors.Temperatures) == 0 && len(sensors.Other) == 0) {
|
||||
return nil
|
||||
}
|
||||
slog.Info("sensors: collected",
|
||||
"fans", len(sensors.Fans),
|
||||
"power", len(sensors.Power),
|
||||
"temperatures", len(sensors.Temperatures),
|
||||
"other", len(sensors.Other),
|
||||
)
|
||||
return sensors
|
||||
}
|
||||
|
||||
func readSensorsJSONDoc() (sensorsDoc, error) {
|
||||
out, err := exec.Command("sensors", "-j").Output()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
var doc sensorsDoc
|
||||
if err := json.Unmarshal(out, &doc); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return doc, nil
|
||||
}
|
||||
|
||||
func buildSensorsFromDoc(doc sensorsDoc) *schema.HardwareSensors {
|
||||
if len(doc) == 0 {
|
||||
return nil
|
||||
}
|
||||
result := &schema.HardwareSensors{}
|
||||
seen := map[string]struct{}{}
|
||||
|
||||
chips := make([]string, 0, len(doc))
|
||||
for chip := range doc {
|
||||
chips = append(chips, chip)
|
||||
}
|
||||
sort.Strings(chips)
|
||||
|
||||
for _, chip := range chips {
|
||||
features := doc[chip]
|
||||
location := sensorLocation(chip)
|
||||
|
||||
keys := make([]string, 0, len(features))
|
||||
for key := range features {
|
||||
keys = append(keys, key)
|
||||
}
|
||||
sort.Strings(keys)
|
||||
|
||||
for _, key := range keys {
|
||||
if strings.EqualFold(key, "Adapter") {
|
||||
continue
|
||||
}
|
||||
feature, ok := features[key].(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
name := strings.TrimSpace(key)
|
||||
if name == "" {
|
||||
continue
|
||||
}
|
||||
switch classifySensorFeature(feature) {
|
||||
case "fan":
|
||||
item := buildFanSensor(name, location, feature)
|
||||
if item == nil || duplicateSensor(seen, "fan", item.Name) {
|
||||
continue
|
||||
}
|
||||
result.Fans = append(result.Fans, *item)
|
||||
case "temp":
|
||||
item := buildTempSensor(name, location, feature)
|
||||
if item == nil || duplicateSensor(seen, "temp", item.Name) {
|
||||
continue
|
||||
}
|
||||
result.Temperatures = append(result.Temperatures, *item)
|
||||
case "power":
|
||||
item := buildPowerSensor(name, location, feature)
|
||||
if item == nil || duplicateSensor(seen, "power", item.Name) {
|
||||
continue
|
||||
}
|
||||
result.Power = append(result.Power, *item)
|
||||
default:
|
||||
item := buildOtherSensor(name, location, feature)
|
||||
if item == nil || duplicateSensor(seen, "other", item.Name) {
|
||||
continue
|
||||
}
|
||||
result.Other = append(result.Other, *item)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return result
|
||||
}
|
||||
|
||||
func parseSensorsJSON(raw []byte) (*schema.HardwareSensors, error) {
|
||||
var doc sensorsDoc
|
||||
err := json.Unmarshal(raw, &doc)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return buildSensorsFromDoc(doc), nil
|
||||
}
|
||||
|
||||
func duplicateSensor(seen map[string]struct{}, sensorType, name string) bool {
|
||||
key := sensorType + "\x00" + name
|
||||
if _, ok := seen[key]; ok {
|
||||
return true
|
||||
}
|
||||
seen[key] = struct{}{}
|
||||
return false
|
||||
}
|
||||
|
||||
func sensorLocation(chip string) *string {
|
||||
chip = strings.TrimSpace(chip)
|
||||
if chip == "" {
|
||||
return nil
|
||||
}
|
||||
return &chip
|
||||
}
|
||||
|
||||
func classifySensorFeature(feature map[string]any) string {
|
||||
for key := range feature {
|
||||
switch {
|
||||
case strings.Contains(key, "fan") && strings.HasSuffix(key, "_input"):
|
||||
return "fan"
|
||||
case strings.Contains(key, "temp") && strings.HasSuffix(key, "_input"):
|
||||
return "temp"
|
||||
case strings.Contains(key, "power") && (strings.HasSuffix(key, "_input") || strings.HasSuffix(key, "_average")):
|
||||
return "power"
|
||||
case strings.Contains(key, "curr") && strings.HasSuffix(key, "_input"):
|
||||
return "power"
|
||||
case strings.HasPrefix(key, "in") && strings.HasSuffix(key, "_input"):
|
||||
return "power"
|
||||
}
|
||||
}
|
||||
return "other"
|
||||
}
|
||||
|
||||
func buildFanSensor(name string, location *string, feature map[string]any) *schema.HardwareFanSensor {
|
||||
rpm, ok := firstFeatureInt(feature, "_input")
|
||||
if !ok {
|
||||
return nil
|
||||
}
|
||||
item := &schema.HardwareFanSensor{Name: name, Location: location, RPM: &rpm}
|
||||
if status := sensorStatusFromFeature(feature); status != nil {
|
||||
item.Status = status
|
||||
}
|
||||
return item
|
||||
}
|
||||
|
||||
func buildTempSensor(name string, location *string, feature map[string]any) *schema.HardwareTemperatureSensor {
|
||||
celsius, ok := firstFeatureFloat(feature, "_input")
|
||||
if !ok {
|
||||
return nil
|
||||
}
|
||||
item := &schema.HardwareTemperatureSensor{Name: name, Location: location, Celsius: &celsius}
|
||||
if warning, ok := firstFeatureFloatWithSuffixes(feature, []string{"_max", "_high"}); ok {
|
||||
item.ThresholdWarningCelsius = &warning
|
||||
}
|
||||
if critical, ok := firstFeatureFloatWithSuffixes(feature, []string{"_crit", "_emergency"}); ok {
|
||||
item.ThresholdCriticalCelsius = &critical
|
||||
}
|
||||
if status := sensorStatusFromFeature(feature); status != nil {
|
||||
item.Status = status
|
||||
} else {
|
||||
item.Status = deriveTemperatureStatus(item.Celsius, item.ThresholdWarningCelsius, item.ThresholdCriticalCelsius)
|
||||
}
|
||||
return item
|
||||
}
|
||||
|
||||
func buildPowerSensor(name string, location *string, feature map[string]any) *schema.HardwarePowerSensor {
|
||||
item := &schema.HardwarePowerSensor{Name: name, Location: location}
|
||||
if v, ok := firstFeatureFloatWithContains(feature, []string{"power"}); ok {
|
||||
item.PowerW = &v
|
||||
}
|
||||
if v, ok := firstFeatureFloatWithPrefix(feature, "curr"); ok {
|
||||
item.CurrentA = &v
|
||||
}
|
||||
if v, ok := firstFeatureFloatWithPrefix(feature, "in"); ok {
|
||||
item.VoltageV = &v
|
||||
}
|
||||
if item.PowerW == nil && item.CurrentA == nil && item.VoltageV == nil {
|
||||
return nil
|
||||
}
|
||||
if status := sensorStatusFromFeature(feature); status != nil {
|
||||
item.Status = status
|
||||
}
|
||||
return item
|
||||
}
|
||||
|
||||
func buildOtherSensor(name string, location *string, feature map[string]any) *schema.HardwareOtherSensor {
|
||||
value, unit, ok := firstGenericSensorValue(feature)
|
||||
if !ok {
|
||||
return nil
|
||||
}
|
||||
item := &schema.HardwareOtherSensor{Name: name, Location: location, Value: &value}
|
||||
if unit != "" {
|
||||
item.Unit = &unit
|
||||
}
|
||||
if status := sensorStatusFromFeature(feature); status != nil {
|
||||
item.Status = status
|
||||
}
|
||||
return item
|
||||
}
|
||||
|
||||
func sensorStatusFromFeature(feature map[string]any) *string {
|
||||
for key, raw := range feature {
|
||||
if !strings.HasSuffix(key, "_alarm") {
|
||||
continue
|
||||
}
|
||||
if number, ok := floatFromAny(raw); ok && number > 0 {
|
||||
status := statusWarning
|
||||
return &status
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func deriveTemperatureStatus(current, warning, critical *float64) *string {
|
||||
if current == nil {
|
||||
return nil
|
||||
}
|
||||
switch {
|
||||
case critical != nil && *current >= *critical:
|
||||
status := statusCritical
|
||||
return &status
|
||||
case warning != nil && *current >= *warning:
|
||||
status := statusWarning
|
||||
return &status
|
||||
default:
|
||||
status := statusOK
|
||||
return &status
|
||||
}
|
||||
}
|
||||
|
||||
func firstFeatureInt(feature map[string]any, suffix string) (int, bool) {
|
||||
for key, raw := range feature {
|
||||
if strings.HasSuffix(key, suffix) {
|
||||
if value, ok := floatFromAny(raw); ok {
|
||||
return int(value), true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func firstFeatureFloat(feature map[string]any, suffix string) (float64, bool) {
|
||||
return firstFeatureFloatWithSuffixes(feature, []string{suffix})
|
||||
}
|
||||
|
||||
func firstFeatureFloatWithSuffixes(feature map[string]any, suffixes []string) (float64, bool) {
|
||||
keys := sortedFeatureKeys(feature)
|
||||
for _, key := range keys {
|
||||
for _, suffix := range suffixes {
|
||||
if strings.HasSuffix(key, suffix) {
|
||||
if value, ok := floatFromAny(feature[key]); ok {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func firstFeatureFloatWithContains(feature map[string]any, parts []string) (float64, bool) {
|
||||
keys := sortedFeatureKeys(feature)
|
||||
for _, key := range keys {
|
||||
matched := true
|
||||
for _, part := range parts {
|
||||
if !strings.Contains(key, part) {
|
||||
matched = false
|
||||
break
|
||||
}
|
||||
}
|
||||
if matched {
|
||||
if value, ok := floatFromAny(feature[key]); ok {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func firstFeatureFloatWithPrefix(feature map[string]any, prefix string) (float64, bool) {
|
||||
keys := sortedFeatureKeys(feature)
|
||||
for _, key := range keys {
|
||||
if strings.HasPrefix(key, prefix) && strings.HasSuffix(key, "_input") {
|
||||
if value, ok := floatFromAny(feature[key]); ok {
|
||||
return value, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func firstGenericSensorValue(feature map[string]any) (float64, string, bool) {
|
||||
keys := sortedFeatureKeys(feature)
|
||||
for _, key := range keys {
|
||||
if strings.HasSuffix(key, "_alarm") {
|
||||
continue
|
||||
}
|
||||
value, ok := floatFromAny(feature[key])
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
unit := inferSensorUnit(key)
|
||||
return value, unit, true
|
||||
}
|
||||
return 0, "", false
|
||||
}
|
||||
|
||||
func inferSensorUnit(key string) string {
|
||||
switch {
|
||||
case strings.Contains(key, "humidity"):
|
||||
return "%"
|
||||
case strings.Contains(key, "intrusion"):
|
||||
return ""
|
||||
default:
|
||||
return ""
|
||||
}
|
||||
}
|
||||
|
||||
func sortedFeatureKeys(feature map[string]any) []string {
|
||||
keys := make([]string, 0, len(feature))
|
||||
for key := range feature {
|
||||
keys = append(keys, key)
|
||||
}
|
||||
sort.Strings(keys)
|
||||
return keys
|
||||
}
|
||||
|
||||
func floatFromAny(raw any) (float64, bool) {
|
||||
switch value := raw.(type) {
|
||||
case float64:
|
||||
return value, true
|
||||
case float32:
|
||||
return float64(value), true
|
||||
case int:
|
||||
return float64(value), true
|
||||
case int64:
|
||||
return float64(value), true
|
||||
case json.Number:
|
||||
if f, err := value.Float64(); err == nil {
|
||||
return f, true
|
||||
}
|
||||
case string:
|
||||
if value == "" {
|
||||
return 0, false
|
||||
}
|
||||
if f, err := strconv.ParseFloat(value, 64); err == nil {
|
||||
return f, true
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
54
audit/internal/collector/sensors_test.go
Normal file
54
audit/internal/collector/sensors_test.go
Normal file
@@ -0,0 +1,54 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestParseSensorsJSON(t *testing.T) {
|
||||
raw := []byte(`{
|
||||
"coretemp-isa-0000": {
|
||||
"Adapter": "ISA adapter",
|
||||
"Package id 0": {
|
||||
"temp1_input": 61.5,
|
||||
"temp1_max": 80.0,
|
||||
"temp1_crit": 95.0
|
||||
},
|
||||
"fan1": {
|
||||
"fan1_input": 4200
|
||||
}
|
||||
},
|
||||
"acpitz-acpi-0": {
|
||||
"Adapter": "ACPI interface",
|
||||
"in0": {
|
||||
"in0_input": 12.06
|
||||
},
|
||||
"curr1": {
|
||||
"curr1_input": 0.64
|
||||
},
|
||||
"power1": {
|
||||
"power1_average": 137.0
|
||||
},
|
||||
"humidity1": {
|
||||
"humidity1_input": 38.5
|
||||
}
|
||||
}
|
||||
}`)
|
||||
|
||||
got, err := parseSensorsJSON(raw)
|
||||
if err != nil {
|
||||
t.Fatalf("parseSensorsJSON error: %v", err)
|
||||
}
|
||||
if got == nil {
|
||||
t.Fatal("expected sensors")
|
||||
}
|
||||
if len(got.Temperatures) != 1 || got.Temperatures[0].Celsius == nil || *got.Temperatures[0].Celsius != 61.5 {
|
||||
t.Fatalf("temperatures mismatch: %#v", got.Temperatures)
|
||||
}
|
||||
if len(got.Fans) != 1 || got.Fans[0].RPM == nil || *got.Fans[0].RPM != 4200 {
|
||||
t.Fatalf("fans mismatch: %#v", got.Fans)
|
||||
}
|
||||
if len(got.Power) != 3 {
|
||||
t.Fatalf("power sensors mismatch: %#v", got.Power)
|
||||
}
|
||||
if len(got.Other) != 1 || got.Other[0].Unit == nil || *got.Other[0].Unit != "%" {
|
||||
t.Fatalf("other sensors mismatch: %#v", got.Other)
|
||||
}
|
||||
}
|
||||
@@ -5,11 +5,13 @@ import (
|
||||
"encoding/json"
|
||||
"log/slog"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func collectStorage() []schema.HardwareStorage {
|
||||
devs := lsblkDevices()
|
||||
devs := discoverStorageDevices()
|
||||
result := make([]schema.HardwareStorage, 0, len(devs))
|
||||
for _, dev := range devs {
|
||||
var s schema.HardwareStorage
|
||||
@@ -26,19 +28,60 @@ func collectStorage() []schema.HardwareStorage {
|
||||
|
||||
// lsblkDevice is a minimal lsblk JSON record.
|
||||
type lsblkDevice struct {
|
||||
Name string `json:"name"`
|
||||
Type string `json:"type"`
|
||||
Size string `json:"size"`
|
||||
Serial string `json:"serial"`
|
||||
Model string `json:"model"`
|
||||
Tran string `json:"tran"`
|
||||
Hctl string `json:"hctl"`
|
||||
Name string `json:"name"`
|
||||
Type string `json:"type"`
|
||||
Size string `json:"size"`
|
||||
Serial string `json:"serial"`
|
||||
Model string `json:"model"`
|
||||
Tran string `json:"tran"`
|
||||
Hctl string `json:"hctl"`
|
||||
}
|
||||
|
||||
type lsblkRoot struct {
|
||||
Blockdevices []lsblkDevice `json:"blockdevices"`
|
||||
}
|
||||
|
||||
type nvmeListRoot struct {
|
||||
Devices []nvmeListDevice `json:"Devices"`
|
||||
}
|
||||
|
||||
type nvmeListDevice struct {
|
||||
DevicePath string `json:"DevicePath"`
|
||||
ModelNumber string `json:"ModelNumber"`
|
||||
SerialNumber string `json:"SerialNumber"`
|
||||
Firmware string `json:"Firmware"`
|
||||
PhysicalSize int64 `json:"PhysicalSize"`
|
||||
}
|
||||
|
||||
func discoverStorageDevices() []lsblkDevice {
|
||||
merged := map[string]lsblkDevice{}
|
||||
for _, dev := range lsblkDevices() {
|
||||
if dev.Name == "" {
|
||||
continue
|
||||
}
|
||||
merged[dev.Name] = dev
|
||||
}
|
||||
for _, dev := range nvmeListDevices() {
|
||||
if dev.Name == "" {
|
||||
continue
|
||||
}
|
||||
current := merged[dev.Name]
|
||||
merged[dev.Name] = mergeStorageDevice(current, dev)
|
||||
}
|
||||
|
||||
disks := make([]lsblkDevice, 0, len(merged))
|
||||
for _, dev := range merged {
|
||||
if dev.Type == "" {
|
||||
dev.Type = "disk"
|
||||
}
|
||||
if dev.Type != "disk" {
|
||||
continue
|
||||
}
|
||||
disks = append(disks, dev)
|
||||
}
|
||||
return disks
|
||||
}
|
||||
|
||||
func lsblkDevices() []lsblkDevice {
|
||||
out, err := exec.Command("lsblk", "-J", "-d",
|
||||
"-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL").Output()
|
||||
@@ -60,6 +103,59 @@ func lsblkDevices() []lsblkDevice {
|
||||
return disks
|
||||
}
|
||||
|
||||
func nvmeListDevices() []lsblkDevice {
|
||||
out, err := exec.Command("nvme", "list", "-o", "json").Output()
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
var root nvmeListRoot
|
||||
if err := json.Unmarshal(out, &root); err != nil {
|
||||
slog.Warn("storage: nvme list parse failed", "err", err)
|
||||
return nil
|
||||
}
|
||||
devices := make([]lsblkDevice, 0, len(root.Devices))
|
||||
for _, dev := range root.Devices {
|
||||
name := filepath.Base(strings.TrimSpace(dev.DevicePath))
|
||||
if name == "" {
|
||||
continue
|
||||
}
|
||||
devices = append(devices, lsblkDevice{
|
||||
Name: name,
|
||||
Type: "disk",
|
||||
Size: strconv.FormatInt(dev.PhysicalSize, 10),
|
||||
Serial: strings.TrimSpace(dev.SerialNumber),
|
||||
Model: strings.TrimSpace(dev.ModelNumber),
|
||||
Tran: "nvme",
|
||||
})
|
||||
}
|
||||
return devices
|
||||
}
|
||||
|
||||
func mergeStorageDevice(existing, incoming lsblkDevice) lsblkDevice {
|
||||
if existing.Name == "" {
|
||||
return incoming
|
||||
}
|
||||
if existing.Type == "" {
|
||||
existing.Type = incoming.Type
|
||||
}
|
||||
if strings.TrimSpace(existing.Size) == "" {
|
||||
existing.Size = incoming.Size
|
||||
}
|
||||
if strings.TrimSpace(existing.Serial) == "" {
|
||||
existing.Serial = incoming.Serial
|
||||
}
|
||||
if strings.TrimSpace(existing.Model) == "" {
|
||||
existing.Model = incoming.Model
|
||||
}
|
||||
if strings.TrimSpace(existing.Tran) == "" {
|
||||
existing.Tran = incoming.Tran
|
||||
}
|
||||
if strings.TrimSpace(existing.Hctl) == "" {
|
||||
existing.Hctl = incoming.Hctl
|
||||
}
|
||||
return existing
|
||||
}
|
||||
|
||||
// smartctlInfo is the subset of smartctl -j -a output we care about.
|
||||
type smartctlInfo struct {
|
||||
ModelFamily string `json:"model_family"`
|
||||
@@ -67,14 +163,22 @@ type smartctlInfo struct {
|
||||
SerialNumber string `json:"serial_number"`
|
||||
FirmwareVer string `json:"firmware_version"`
|
||||
RotationRate int `json:"rotation_rate"`
|
||||
Temperature struct {
|
||||
Current int `json:"current"`
|
||||
} `json:"temperature"`
|
||||
SmartStatus struct {
|
||||
Passed bool `json:"passed"`
|
||||
} `json:"smart_status"`
|
||||
UserCapacity struct {
|
||||
Bytes int64 `json:"bytes"`
|
||||
} `json:"user_capacity"`
|
||||
AtaSmartAttributes struct {
|
||||
Table []struct {
|
||||
ID int `json:"id"`
|
||||
Name string `json:"name"`
|
||||
Raw struct{ Value int64 `json:"value"` } `json:"raw"`
|
||||
ID int `json:"id"`
|
||||
Name string `json:"name"`
|
||||
Raw struct {
|
||||
Value int64 `json:"value"`
|
||||
} `json:"raw"`
|
||||
} `json:"table"`
|
||||
} `json:"ata_smart_attributes"`
|
||||
PowerOnTime struct {
|
||||
@@ -86,6 +190,7 @@ type smartctlInfo struct {
|
||||
func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
|
||||
present := true
|
||||
s := schema.HardwareStorage{Present: &present}
|
||||
s.Telemetry = map[string]any{"linux_device": "/dev/" + dev.Name}
|
||||
|
||||
tran := strings.ToLower(dev.Tran)
|
||||
devPath := "/dev/" + dev.Name
|
||||
@@ -149,69 +254,117 @@ func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
|
||||
} else if info.RotationRate > 0 {
|
||||
devType = "HDD"
|
||||
}
|
||||
s.Type = &devType
|
||||
|
||||
// telemetry
|
||||
tel := map[string]any{}
|
||||
if info.Temperature.Current > 0 {
|
||||
t := float64(info.Temperature.Current)
|
||||
s.TemperatureC = &t
|
||||
}
|
||||
if info.PowerOnTime.Hours > 0 {
|
||||
tel["power_on_hours"] = info.PowerOnTime.Hours
|
||||
v := int64(info.PowerOnTime.Hours)
|
||||
s.PowerOnHours = &v
|
||||
}
|
||||
if info.PowerCycleCount > 0 {
|
||||
tel["power_cycles"] = info.PowerCycleCount
|
||||
v := int64(info.PowerCycleCount)
|
||||
s.PowerCycles = &v
|
||||
}
|
||||
reallocated := int64(0)
|
||||
pending := int64(0)
|
||||
uncorrectable := int64(0)
|
||||
lifeRemaining := int64(0)
|
||||
for _, attr := range info.AtaSmartAttributes.Table {
|
||||
switch attr.ID {
|
||||
case 5:
|
||||
tel["reallocated_sectors"] = attr.Raw.Value
|
||||
reallocated = attr.Raw.Value
|
||||
s.ReallocatedSectors = &reallocated
|
||||
case 177:
|
||||
tel["wear_leveling_pct"] = attr.Raw.Value
|
||||
value := float64(attr.Raw.Value)
|
||||
s.LifeUsedPct = &value
|
||||
case 231:
|
||||
tel["life_remaining_pct"] = attr.Raw.Value
|
||||
lifeRemaining = attr.Raw.Value
|
||||
value := float64(attr.Raw.Value)
|
||||
s.LifeRemainingPct = &value
|
||||
case 241:
|
||||
tel["total_lba_written"] = attr.Raw.Value
|
||||
value := attr.Raw.Value
|
||||
s.WrittenBytes = &value
|
||||
case 197:
|
||||
pending = attr.Raw.Value
|
||||
s.CurrentPendingSectors = &pending
|
||||
case 198:
|
||||
uncorrectable = attr.Raw.Value
|
||||
s.OfflineUncorrectable = &uncorrectable
|
||||
}
|
||||
}
|
||||
if len(tel) > 0 {
|
||||
s.Telemetry = tel
|
||||
|
||||
status := storageHealthStatus{
|
||||
overallPassed: info.SmartStatus.Passed,
|
||||
hasOverall: true,
|
||||
reallocatedSectors: reallocated,
|
||||
pendingSectors: pending,
|
||||
offlineUncorrectable: uncorrectable,
|
||||
lifeRemainingPct: lifeRemaining,
|
||||
}
|
||||
setStorageHealthStatus(&s, status)
|
||||
return s
|
||||
}
|
||||
|
||||
s.Type = &devType
|
||||
status := "OK"
|
||||
status := statusUnknown
|
||||
s.Status = &status
|
||||
return s
|
||||
}
|
||||
|
||||
// nvmeSmartLog is the subset of `nvme smart-log -o json` output we care about.
|
||||
type nvmeSmartLog struct {
|
||||
CriticalWarning int `json:"critical_warning"`
|
||||
PercentageUsed int `json:"percentage_used"`
|
||||
AvailableSpare int `json:"available_spare"`
|
||||
SpareThreshold int `json:"spare_thresh"`
|
||||
Temperature int64 `json:"temperature"`
|
||||
PowerOnHours int64 `json:"power_on_hours"`
|
||||
PowerCycles int64 `json:"power_cycles"`
|
||||
UnsafeShutdowns int64 `json:"unsafe_shutdowns"`
|
||||
DataUnitsRead int64 `json:"data_units_read"`
|
||||
DataUnitsWritten int64 `json:"data_units_written"`
|
||||
ControllerBusy int64 `json:"controller_busy_time"`
|
||||
MediaErrors int64 `json:"media_errors"`
|
||||
NumErrLogEntries int64 `json:"num_err_log_entries"`
|
||||
}
|
||||
|
||||
// nvmeIDCtrl is the subset of `nvme id-ctrl -o json` output.
|
||||
type nvmeIDCtrl struct {
|
||||
ModelNumber string `json:"mn"`
|
||||
SerialNumber string `json:"sn"`
|
||||
FirmwareRev string `json:"fr"`
|
||||
TotalCapacity int64 `json:"tnvmcap"`
|
||||
ModelNumber string `json:"mn"`
|
||||
SerialNumber string `json:"sn"`
|
||||
FirmwareRev string `json:"fr"`
|
||||
TotalCapacity int64 `json:"tnvmcap"`
|
||||
}
|
||||
|
||||
func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
|
||||
present := true
|
||||
devType := "NVMe"
|
||||
iface := "NVMe"
|
||||
status := "OK"
|
||||
status := statusOK
|
||||
s := schema.HardwareStorage{
|
||||
Present: &present,
|
||||
Type: &devType,
|
||||
Interface: &iface,
|
||||
Status: &status,
|
||||
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
|
||||
Present: &present,
|
||||
Type: &devType,
|
||||
Interface: &iface,
|
||||
Telemetry: map[string]any{"linux_device": "/dev/" + dev.Name},
|
||||
}
|
||||
|
||||
devPath := "/dev/" + dev.Name
|
||||
if v := cleanDMIValue(strings.TrimSpace(dev.Model)); v != "" {
|
||||
s.Model = &v
|
||||
}
|
||||
if v := cleanDMIValue(strings.TrimSpace(dev.Serial)); v != "" {
|
||||
s.SerialNumber = &v
|
||||
}
|
||||
if size := parseStorageBytes(dev.Size); size > 0 {
|
||||
gb := int(size / 1_000_000_000)
|
||||
if gb > 0 {
|
||||
s.SizeGB = &gb
|
||||
}
|
||||
}
|
||||
|
||||
// id-ctrl: model, serial, firmware, capacity
|
||||
if out, err := exec.Command("nvme", "id-ctrl", devPath, "-o", "json").Output(); err == nil {
|
||||
@@ -237,30 +390,131 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
|
||||
if out, err := exec.Command("nvme", "smart-log", devPath, "-o", "json").Output(); err == nil {
|
||||
var log nvmeSmartLog
|
||||
if json.Unmarshal(out, &log) == nil {
|
||||
tel := map[string]any{}
|
||||
if log.PowerOnHours > 0 {
|
||||
tel["power_on_hours"] = log.PowerOnHours
|
||||
s.PowerOnHours = &log.PowerOnHours
|
||||
}
|
||||
if log.PowerCycles > 0 {
|
||||
tel["power_cycles"] = log.PowerCycles
|
||||
s.PowerCycles = &log.PowerCycles
|
||||
}
|
||||
if log.UnsafeShutdowns > 0 {
|
||||
tel["unsafe_shutdowns"] = log.UnsafeShutdowns
|
||||
s.UnsafeShutdowns = &log.UnsafeShutdowns
|
||||
}
|
||||
if log.PercentageUsed > 0 {
|
||||
tel["percentage_used"] = log.PercentageUsed
|
||||
v := float64(log.PercentageUsed)
|
||||
s.LifeUsedPct = &v
|
||||
remaining := 100 - v
|
||||
s.LifeRemainingPct = &remaining
|
||||
}
|
||||
if log.DataUnitsWritten > 0 {
|
||||
tel["data_units_written"] = log.DataUnitsWritten
|
||||
v := nvmeDataUnitsToBytes(log.DataUnitsWritten)
|
||||
s.WrittenBytes = &v
|
||||
}
|
||||
if log.ControllerBusy > 0 {
|
||||
tel["controller_busy_time"] = log.ControllerBusy
|
||||
if log.DataUnitsRead > 0 {
|
||||
v := nvmeDataUnitsToBytes(log.DataUnitsRead)
|
||||
s.ReadBytes = &v
|
||||
}
|
||||
if len(tel) > 0 {
|
||||
s.Telemetry = tel
|
||||
if log.AvailableSpare > 0 {
|
||||
v := float64(log.AvailableSpare)
|
||||
s.AvailableSparePct = &v
|
||||
}
|
||||
if log.MediaErrors > 0 {
|
||||
s.MediaErrors = &log.MediaErrors
|
||||
}
|
||||
if log.NumErrLogEntries > 0 {
|
||||
s.ErrorLogEntries = &log.NumErrLogEntries
|
||||
}
|
||||
if log.Temperature > 0 {
|
||||
v := float64(log.Temperature - 273)
|
||||
s.TemperatureC = &v
|
||||
}
|
||||
setStorageHealthStatus(&s, storageHealthStatus{
|
||||
criticalWarning: log.CriticalWarning,
|
||||
percentageUsed: int64(log.PercentageUsed),
|
||||
availableSpare: int64(log.AvailableSpare),
|
||||
spareThreshold: int64(log.SpareThreshold),
|
||||
unsafeShutdowns: log.UnsafeShutdowns,
|
||||
mediaErrors: log.MediaErrors,
|
||||
errorLogEntries: log.NumErrLogEntries,
|
||||
})
|
||||
return s
|
||||
}
|
||||
}
|
||||
|
||||
status = statusUnknown
|
||||
s.Status = &status
|
||||
return s
|
||||
}
|
||||
|
||||
func parseStorageBytes(raw string) int64 {
|
||||
value, err := strconv.ParseInt(strings.TrimSpace(raw), 10, 64)
|
||||
if err == nil && value > 0 {
|
||||
return value
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
func nvmeDataUnitsToBytes(units int64) int64 {
|
||||
if units <= 0 {
|
||||
return 0
|
||||
}
|
||||
return units * 512000
|
||||
}
|
||||
|
||||
type storageHealthStatus struct {
|
||||
hasOverall bool
|
||||
overallPassed bool
|
||||
reallocatedSectors int64
|
||||
pendingSectors int64
|
||||
offlineUncorrectable int64
|
||||
lifeRemainingPct int64
|
||||
criticalWarning int
|
||||
percentageUsed int64
|
||||
availableSpare int64
|
||||
spareThreshold int64
|
||||
unsafeShutdowns int64
|
||||
mediaErrors int64
|
||||
errorLogEntries int64
|
||||
}
|
||||
|
||||
func setStorageHealthStatus(s *schema.HardwareStorage, health storageHealthStatus) {
|
||||
status := statusOK
|
||||
var description *string
|
||||
switch {
|
||||
case health.hasOverall && !health.overallPassed:
|
||||
status = statusCritical
|
||||
description = stringPtr("SMART overall self-assessment failed")
|
||||
case health.criticalWarning > 0:
|
||||
status = statusCritical
|
||||
description = stringPtr("NVMe critical warning is set")
|
||||
case health.pendingSectors > 0 || health.offlineUncorrectable > 0:
|
||||
status = statusCritical
|
||||
description = stringPtr("Pending or offline uncorrectable sectors detected")
|
||||
case health.mediaErrors > 0:
|
||||
status = statusWarning
|
||||
description = stringPtr("Media errors reported")
|
||||
case health.reallocatedSectors > 0:
|
||||
status = statusWarning
|
||||
description = stringPtr("Reallocated sectors detected")
|
||||
case health.errorLogEntries > 0:
|
||||
status = statusWarning
|
||||
description = stringPtr("Device error log contains entries")
|
||||
case health.lifeRemainingPct > 0 && health.lifeRemainingPct <= 10:
|
||||
status = statusWarning
|
||||
description = stringPtr("Life remaining is low")
|
||||
case health.percentageUsed >= 95:
|
||||
status = statusWarning
|
||||
description = stringPtr("Drive wear level is high")
|
||||
case health.availableSpare > 0 && health.spareThreshold > 0 && health.availableSpare <= health.spareThreshold:
|
||||
status = statusWarning
|
||||
description = stringPtr("Available spare is at or below threshold")
|
||||
case health.unsafeShutdowns > 100:
|
||||
status = statusWarning
|
||||
description = stringPtr("Unsafe shutdown count is high")
|
||||
}
|
||||
s.Status = &status
|
||||
s.ErrorDescription = description
|
||||
}
|
||||
|
||||
func stringPtr(value string) *string {
|
||||
return &value
|
||||
}
|
||||
|
||||
33
audit/internal/collector/storage_discovery_test.go
Normal file
33
audit/internal/collector/storage_discovery_test.go
Normal file
@@ -0,0 +1,33 @@
|
||||
package collector
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestMergeStorageDevicePrefersNonEmptyFields(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
got := mergeStorageDevice(
|
||||
lsblkDevice{Name: "nvme0n1", Type: "disk", Tran: "nvme"},
|
||||
lsblkDevice{Name: "nvme0n1", Type: "disk", Size: "1024", Serial: "SN123", Model: "Kioxia"},
|
||||
)
|
||||
|
||||
if got.Serial != "SN123" {
|
||||
t.Fatalf("serial=%q want SN123", got.Serial)
|
||||
}
|
||||
if got.Model != "Kioxia" {
|
||||
t.Fatalf("model=%q want Kioxia", got.Model)
|
||||
}
|
||||
if got.Size != "1024" {
|
||||
t.Fatalf("size=%q want 1024", got.Size)
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseStorageBytes(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
if got := parseStorageBytes(" 2048 "); got != 2048 {
|
||||
t.Fatalf("parseStorageBytes=%d want 2048", got)
|
||||
}
|
||||
if got := parseStorageBytes("1.92 TB"); got != 0 {
|
||||
t.Fatalf("parseStorageBytes invalid=%d want 0", got)
|
||||
}
|
||||
}
|
||||
63
audit/internal/collector/storage_health_test.go
Normal file
63
audit/internal/collector/storage_health_test.go
Normal file
@@ -0,0 +1,63 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
func TestSetStorageHealthStatus(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
health storageHealthStatus
|
||||
want string
|
||||
}{
|
||||
{
|
||||
name: "smart overall failed",
|
||||
health: storageHealthStatus{hasOverall: true, overallPassed: false},
|
||||
want: statusCritical,
|
||||
},
|
||||
{
|
||||
name: "nvme critical warning",
|
||||
health: storageHealthStatus{criticalWarning: 1},
|
||||
want: statusCritical,
|
||||
},
|
||||
{
|
||||
name: "pending sectors",
|
||||
health: storageHealthStatus{pendingSectors: 1},
|
||||
want: statusCritical,
|
||||
},
|
||||
{
|
||||
name: "media errors warning",
|
||||
health: storageHealthStatus{mediaErrors: 2},
|
||||
want: statusWarning,
|
||||
},
|
||||
{
|
||||
name: "reallocated warning",
|
||||
health: storageHealthStatus{reallocatedSectors: 1},
|
||||
want: statusWarning,
|
||||
},
|
||||
{
|
||||
name: "life remaining low",
|
||||
health: storageHealthStatus{lifeRemainingPct: 8},
|
||||
want: statusWarning,
|
||||
},
|
||||
{
|
||||
name: "healthy",
|
||||
health: storageHealthStatus{},
|
||||
want: statusOK,
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
var disk schema.HardwareStorage
|
||||
setStorageHealthStatus(&disk, tt.health)
|
||||
if disk.Status == nil || *disk.Status != tt.want {
|
||||
t.Fatalf("status=%v want %q", disk.Status, tt.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
114
audit/internal/collector/summary.go
Normal file
114
audit/internal/collector/summary.go
Normal file
@@ -0,0 +1,114 @@
|
||||
package collector
|
||||
|
||||
import (
|
||||
"bee/audit/internal/schema"
|
||||
"fmt"
|
||||
"time"
|
||||
)
|
||||
|
||||
func BuildHealthSummary(snap schema.HardwareSnapshot) *schema.HardwareHealthSummary {
|
||||
summary := &schema.HardwareHealthSummary{
|
||||
Status: statusOK,
|
||||
CollectedAt: time.Now().UTC().Format(time.RFC3339),
|
||||
}
|
||||
|
||||
for _, dimm := range snap.Memory {
|
||||
switch derefString(dimm.Status) {
|
||||
case statusWarning:
|
||||
summary.MemoryWarn++
|
||||
summary.Warnings = append(summary.Warnings, formatMemorySummary(dimm))
|
||||
case statusCritical:
|
||||
summary.MemoryFail++
|
||||
summary.Failures = append(summary.Failures, formatMemorySummary(dimm))
|
||||
case statusEmpty:
|
||||
summary.EmptyDIMMs++
|
||||
}
|
||||
}
|
||||
|
||||
for _, disk := range snap.Storage {
|
||||
switch derefString(disk.Status) {
|
||||
case statusWarning:
|
||||
summary.StorageWarn++
|
||||
summary.Warnings = append(summary.Warnings, formatStorageSummary(disk))
|
||||
case statusCritical:
|
||||
summary.StorageFail++
|
||||
summary.Failures = append(summary.Failures, formatStorageSummary(disk))
|
||||
}
|
||||
}
|
||||
|
||||
for _, dev := range snap.PCIeDevices {
|
||||
switch derefString(dev.Status) {
|
||||
case statusWarning:
|
||||
summary.PCIeWarn++
|
||||
summary.Warnings = append(summary.Warnings, formatPCIeSummary(dev))
|
||||
case statusCritical:
|
||||
summary.PCIeFail++
|
||||
summary.Failures = append(summary.Failures, formatPCIeSummary(dev))
|
||||
}
|
||||
}
|
||||
|
||||
for _, psu := range snap.PowerSupplies {
|
||||
if psu.Present != nil && !*psu.Present {
|
||||
summary.MissingPSUs++
|
||||
}
|
||||
switch derefString(psu.Status) {
|
||||
case statusWarning:
|
||||
summary.PSUWarn++
|
||||
summary.Warnings = append(summary.Warnings, formatPSUSummary(psu))
|
||||
case statusCritical:
|
||||
summary.PSUFail++
|
||||
summary.Failures = append(summary.Failures, formatPSUSummary(psu))
|
||||
}
|
||||
}
|
||||
|
||||
if len(summary.Failures) > 0 || summary.StorageFail > 0 || summary.PCIeFail > 0 || summary.PSUFail > 0 || summary.MemoryFail > 0 {
|
||||
summary.Status = statusCritical
|
||||
} else if len(summary.Warnings) > 0 || summary.StorageWarn > 0 || summary.PCIeWarn > 0 || summary.PSUWarn > 0 || summary.MemoryWarn > 0 {
|
||||
summary.Status = statusWarning
|
||||
}
|
||||
|
||||
if len(summary.Warnings) == 0 {
|
||||
summary.Warnings = nil
|
||||
}
|
||||
if len(summary.Failures) == 0 {
|
||||
summary.Failures = nil
|
||||
}
|
||||
|
||||
return summary
|
||||
}
|
||||
|
||||
func derefString(value *string) string {
|
||||
if value == nil {
|
||||
return ""
|
||||
}
|
||||
return *value
|
||||
}
|
||||
|
||||
func preferredName(model, serial, slot *string) string {
|
||||
switch {
|
||||
case model != nil && *model != "":
|
||||
return *model
|
||||
case serial != nil && *serial != "":
|
||||
return *serial
|
||||
case slot != nil && *slot != "":
|
||||
return *slot
|
||||
default:
|
||||
return "unknown"
|
||||
}
|
||||
}
|
||||
|
||||
func formatStorageSummary(disk schema.HardwareStorage) string {
|
||||
return fmt.Sprintf("storage %s status=%s", preferredName(disk.Model, disk.SerialNumber, disk.Slot), derefString(disk.Status))
|
||||
}
|
||||
|
||||
func formatPCIeSummary(dev schema.HardwarePCIeDevice) string {
|
||||
return fmt.Sprintf("pcie %s status=%s", preferredName(dev.Model, dev.SerialNumber, dev.BDF), derefString(dev.Status))
|
||||
}
|
||||
|
||||
func formatPSUSummary(psu schema.HardwarePowerSupply) string {
|
||||
return fmt.Sprintf("psu %s status=%s", preferredName(psu.Model, psu.SerialNumber, psu.Slot), derefString(psu.Status))
|
||||
}
|
||||
|
||||
func formatMemorySummary(dimm schema.HardwareMemory) string {
|
||||
return fmt.Sprintf("memory %s status=%s", preferredName(dimm.PartNumber, dimm.SerialNumber, dimm.Slot), derefString(dimm.Status))
|
||||
}
|
||||
@@ -31,7 +31,7 @@ md125 : active raid1 nvme2n1[0] nvme3n1[1]
|
||||
func TestHasVROCController(t *testing.T) {
|
||||
intel := vendorIntel
|
||||
model := "Volume Management Device NVMe RAID Controller"
|
||||
class := "RAID bus controller"
|
||||
class := "MassStorageController"
|
||||
tests := []struct {
|
||||
name string
|
||||
pcie []schema.HardwarePCIeDevice
|
||||
|
||||
153
audit/internal/platform/export.go
Normal file
153
audit/internal/platform/export.go
Normal file
@@ -0,0 +1,153 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strings"
|
||||
)
|
||||
|
||||
var exportExecCommand = exec.Command
|
||||
|
||||
func formatMountTargetError(target RemovableTarget, raw string, err error) error {
|
||||
msg := strings.TrimSpace(raw)
|
||||
fstype := strings.ToLower(strings.TrimSpace(target.FSType))
|
||||
if fstype == "exfat" && strings.Contains(strings.ToLower(msg), "unknown filesystem type 'exfat'") {
|
||||
return fmt.Errorf("mount %s: exFAT support is missing in this ISO build: %w", target.Device, err)
|
||||
}
|
||||
if msg == "" {
|
||||
return err
|
||||
}
|
||||
return fmt.Errorf("%s: %w", msg, err)
|
||||
}
|
||||
|
||||
func removableTargetReadOnly(fields map[string]string) bool {
|
||||
if fields["RO"] == "1" {
|
||||
return true
|
||||
}
|
||||
switch strings.ToLower(strings.TrimSpace(fields["FSTYPE"])) {
|
||||
case "iso9660", "squashfs":
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
func ensureWritableMountpoint(mountpoint string) error {
|
||||
probe, err := os.CreateTemp(mountpoint, ".bee-write-test-*")
|
||||
if err != nil {
|
||||
return fmt.Errorf("target filesystem is not writable: %w", err)
|
||||
}
|
||||
name := probe.Name()
|
||||
if closeErr := probe.Close(); closeErr != nil {
|
||||
_ = os.Remove(name)
|
||||
return closeErr
|
||||
}
|
||||
if err := os.Remove(name); err != nil {
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *System) ListRemovableTargets() ([]RemovableTarget, error) {
|
||||
raw, err := exportExecCommand("lsblk", "-P", "-o", "NAME,TYPE,PKNAME,RM,RO,FSTYPE,MOUNTPOINT,SIZE,LABEL,MODEL").Output()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
var out []RemovableTarget
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(raw)), "\n") {
|
||||
if strings.TrimSpace(line) == "" {
|
||||
continue
|
||||
}
|
||||
fields := parseLSBLKPairs(line)
|
||||
deviceType := fields["TYPE"]
|
||||
if deviceType == "rom" || deviceType == "loop" {
|
||||
continue
|
||||
}
|
||||
|
||||
removable := fields["RM"] == "1"
|
||||
if !removable {
|
||||
if parent := fields["PKNAME"]; parent != "" {
|
||||
if data, err := os.ReadFile(filepath.Join("/sys/class/block", parent, "removable")); err == nil {
|
||||
removable = strings.TrimSpace(string(data)) == "1"
|
||||
}
|
||||
}
|
||||
}
|
||||
if !removable || fields["FSTYPE"] == "" || removableTargetReadOnly(fields) {
|
||||
continue
|
||||
}
|
||||
|
||||
out = append(out, RemovableTarget{
|
||||
Device: "/dev/" + fields["NAME"],
|
||||
FSType: fields["FSTYPE"],
|
||||
Size: fields["SIZE"],
|
||||
Label: fields["LABEL"],
|
||||
Model: fields["MODEL"],
|
||||
Mountpoint: fields["MOUNTPOINT"],
|
||||
})
|
||||
}
|
||||
|
||||
sort.Slice(out, func(i, j int) bool { return out[i].Device < out[j].Device })
|
||||
return out, nil
|
||||
}
|
||||
|
||||
func (s *System) ExportFileToTarget(src string, target RemovableTarget) (dst string, retErr error) {
|
||||
if src == "" || target.Device == "" {
|
||||
return "", fmt.Errorf("source and target are required")
|
||||
}
|
||||
if _, err := os.Stat(src); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
mountpoint := strings.TrimSpace(target.Mountpoint)
|
||||
mountedHere := false
|
||||
mounted := mountpoint != ""
|
||||
if mountpoint == "" {
|
||||
mountpoint = filepath.Join("/tmp", "bee-export-"+filepath.Base(target.Device))
|
||||
if err := os.MkdirAll(mountpoint, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
if raw, err := exportExecCommand("mount", target.Device, mountpoint).CombinedOutput(); err != nil {
|
||||
_ = os.Remove(mountpoint)
|
||||
return "", formatMountTargetError(target, string(raw), err)
|
||||
}
|
||||
mountedHere = true
|
||||
mounted = true
|
||||
}
|
||||
defer func() {
|
||||
if !mounted {
|
||||
return
|
||||
}
|
||||
_ = exportExecCommand("sync").Run()
|
||||
if raw, err := exportExecCommand("umount", mountpoint).CombinedOutput(); err != nil && retErr == nil {
|
||||
msg := strings.TrimSpace(string(raw))
|
||||
if msg == "" {
|
||||
retErr = err
|
||||
} else {
|
||||
retErr = fmt.Errorf("%s: %w", msg, err)
|
||||
}
|
||||
}
|
||||
if mountedHere {
|
||||
_ = os.Remove(mountpoint)
|
||||
}
|
||||
}()
|
||||
|
||||
if err := ensureWritableMountpoint(mountpoint); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
filename := filepath.Base(src)
|
||||
dst = filepath.Join(mountpoint, filename)
|
||||
data, err := os.ReadFile(src)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
if err := os.WriteFile(dst, data, 0644); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
return dst, nil
|
||||
}
|
||||
112
audit/internal/platform/export_test.go
Normal file
112
audit/internal/platform/export_test.go
Normal file
@@ -0,0 +1,112 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestExportFileToTargetUnmountsExistingMountpoint(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
src := filepath.Join(tmp, "bundle.tar.gz")
|
||||
mountpoint := filepath.Join(tmp, "mnt")
|
||||
if err := os.MkdirAll(mountpoint, 0755); err != nil {
|
||||
t.Fatalf("mkdir mountpoint: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(src, []byte("bundle"), 0644); err != nil {
|
||||
t.Fatalf("write src: %v", err)
|
||||
}
|
||||
|
||||
var calls [][]string
|
||||
oldExec := exportExecCommand
|
||||
exportExecCommand = func(name string, args ...string) *exec.Cmd {
|
||||
calls = append(calls, append([]string{name}, args...))
|
||||
return exec.Command("sh", "-c", "exit 0")
|
||||
}
|
||||
t.Cleanup(func() { exportExecCommand = oldExec })
|
||||
|
||||
s := &System{}
|
||||
dst, err := s.ExportFileToTarget(src, RemovableTarget{
|
||||
Device: "/dev/sdb1",
|
||||
Mountpoint: mountpoint,
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatalf("ExportFileToTarget error: %v", err)
|
||||
}
|
||||
if got, want := dst, filepath.Join(mountpoint, "bundle.tar.gz"); got != want {
|
||||
t.Fatalf("dst=%q want %q", got, want)
|
||||
}
|
||||
if _, err := os.Stat(filepath.Join(mountpoint, "bundle.tar.gz")); err != nil {
|
||||
t.Fatalf("exported file missing: %v", err)
|
||||
}
|
||||
|
||||
foundUmount := false
|
||||
for _, call := range calls {
|
||||
if len(call) == 2 && call[0] == "umount" && call[1] == mountpoint {
|
||||
foundUmount = true
|
||||
break
|
||||
}
|
||||
}
|
||||
if !foundUmount {
|
||||
t.Fatalf("expected umount %q call, got %#v", mountpoint, calls)
|
||||
}
|
||||
}
|
||||
|
||||
func TestExportFileToTargetRejectsNonWritableMountpoint(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
src := filepath.Join(tmp, "bundle.tar.gz")
|
||||
mountpoint := filepath.Join(tmp, "mnt")
|
||||
if err := os.MkdirAll(mountpoint, 0755); err != nil {
|
||||
t.Fatalf("mkdir mountpoint: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(src, []byte("bundle"), 0644); err != nil {
|
||||
t.Fatalf("write src: %v", err)
|
||||
}
|
||||
if err := os.Chmod(mountpoint, 0555); err != nil {
|
||||
t.Fatalf("chmod mountpoint: %v", err)
|
||||
}
|
||||
|
||||
oldExec := exportExecCommand
|
||||
exportExecCommand = func(name string, args ...string) *exec.Cmd {
|
||||
return exec.Command("sh", "-c", "exit 0")
|
||||
}
|
||||
t.Cleanup(func() { exportExecCommand = oldExec })
|
||||
|
||||
s := &System{}
|
||||
_, err := s.ExportFileToTarget(src, RemovableTarget{
|
||||
Device: "/dev/sdb1",
|
||||
Mountpoint: mountpoint,
|
||||
})
|
||||
if err == nil {
|
||||
t.Fatal("expected error for non-writable mountpoint")
|
||||
}
|
||||
if !strings.Contains(err.Error(), "target filesystem is not writable") {
|
||||
t.Fatalf("err=%q want writable message", err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestListRemovableTargetsSkipsReadOnlyMedia(t *testing.T) {
|
||||
oldExec := exportExecCommand
|
||||
lsblkOut := `NAME="sda1" TYPE="part" PKNAME="sda" RM="1" RO="1" FSTYPE="iso9660" MOUNTPOINT="/run/live/medium" SIZE="3.7G" LABEL="BEE" MODEL=""
|
||||
NAME="sdb1" TYPE="part" PKNAME="sdb" RM="1" RO="0" FSTYPE="vfat" MOUNTPOINT="/media/bee/USB" SIZE="29.8G" LABEL="USB" MODEL=""`
|
||||
exportExecCommand = func(name string, args ...string) *exec.Cmd {
|
||||
cmd := exec.Command("sh", "-c", "printf '%s\n' \"$LSBLK_OUT\"")
|
||||
cmd.Env = append(os.Environ(), "LSBLK_OUT="+lsblkOut)
|
||||
return cmd
|
||||
}
|
||||
t.Cleanup(func() { exportExecCommand = oldExec })
|
||||
|
||||
s := &System{}
|
||||
targets, err := s.ListRemovableTargets()
|
||||
if err != nil {
|
||||
t.Fatalf("ListRemovableTargets error: %v", err)
|
||||
}
|
||||
if len(targets) != 1 {
|
||||
t.Fatalf("len(targets)=%d want 1 (%+v)", len(targets), targets)
|
||||
}
|
||||
if got := targets[0].Device; got != "/dev/sdb1" {
|
||||
t.Fatalf("device=%q want /dev/sdb1", got)
|
||||
}
|
||||
}
|
||||
644
audit/internal/platform/gpu_metrics.go
Normal file
644
audit/internal/platform/gpu_metrics.go
Normal file
@@ -0,0 +1,644 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"fmt"
|
||||
"math"
|
||||
"os"
|
||||
"os/exec"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
// GPUMetricRow is one telemetry sample from nvidia-smi during a stress test.
|
||||
type GPUMetricRow struct {
|
||||
ElapsedSec float64 `json:"elapsed_sec"`
|
||||
GPUIndex int `json:"index"`
|
||||
TempC float64 `json:"temp_c"`
|
||||
UsagePct float64 `json:"usage_pct"`
|
||||
MemUsagePct float64 `json:"mem_usage_pct"`
|
||||
PowerW float64 `json:"power_w"`
|
||||
ClockMHz float64 `json:"clock_mhz"`
|
||||
}
|
||||
|
||||
// sampleGPUMetrics runs nvidia-smi once and returns current metrics for each GPU.
|
||||
func sampleGPUMetrics(gpuIndices []int) ([]GPUMetricRow, error) {
|
||||
args := []string{
|
||||
"--query-gpu=index,temperature.gpu,utilization.gpu,utilization.memory,power.draw,clocks.current.graphics",
|
||||
"--format=csv,noheader,nounits",
|
||||
}
|
||||
if len(gpuIndices) > 0 {
|
||||
ids := make([]string, len(gpuIndices))
|
||||
for i, idx := range gpuIndices {
|
||||
ids[i] = strconv.Itoa(idx)
|
||||
}
|
||||
args = append([]string{"--id=" + strings.Join(ids, ",")}, args...)
|
||||
}
|
||||
out, err := exec.Command("nvidia-smi", args...).Output()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
var rows []GPUMetricRow
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
if line == "" {
|
||||
continue
|
||||
}
|
||||
parts := strings.Split(line, ", ")
|
||||
if len(parts) < 6 {
|
||||
continue
|
||||
}
|
||||
idx, _ := strconv.Atoi(strings.TrimSpace(parts[0]))
|
||||
rows = append(rows, GPUMetricRow{
|
||||
GPUIndex: idx,
|
||||
TempC: parseGPUFloat(parts[1]),
|
||||
UsagePct: parseGPUFloat(parts[2]),
|
||||
MemUsagePct: parseGPUFloat(parts[3]),
|
||||
PowerW: parseGPUFloat(parts[4]),
|
||||
ClockMHz: parseGPUFloat(parts[5]),
|
||||
})
|
||||
}
|
||||
return rows, nil
|
||||
}
|
||||
|
||||
func parseGPUFloat(s string) float64 {
|
||||
s = strings.TrimSpace(s)
|
||||
if s == "N/A" || s == "[Not Supported]" || s == "" {
|
||||
return 0
|
||||
}
|
||||
v, _ := strconv.ParseFloat(s, 64)
|
||||
return v
|
||||
}
|
||||
|
||||
// SampleGPUMetrics runs nvidia-smi once and returns current metrics for each GPU.
|
||||
func SampleGPUMetrics(gpuIndices []int) ([]GPUMetricRow, error) {
|
||||
return sampleGPUMetrics(gpuIndices)
|
||||
}
|
||||
|
||||
// sampleAMDGPUMetrics queries rocm-smi for live GPU metrics.
|
||||
func sampleAMDGPUMetrics() ([]GPUMetricRow, error) {
|
||||
out, err := runROCmSMI("--showtemp", "--showuse", "--showpower", "--showmemuse", "--csv")
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
lines := strings.Split(strings.TrimSpace(string(out)), "\n")
|
||||
if len(lines) < 2 {
|
||||
return nil, fmt.Errorf("rocm-smi: insufficient output")
|
||||
}
|
||||
|
||||
// Parse header to find column indices by name.
|
||||
headers := strings.Split(lines[0], ",")
|
||||
colIdx := func(keywords ...string) int {
|
||||
for i, h := range headers {
|
||||
hl := strings.ToLower(strings.TrimSpace(h))
|
||||
for _, kw := range keywords {
|
||||
if strings.Contains(hl, kw) {
|
||||
return i
|
||||
}
|
||||
}
|
||||
}
|
||||
return -1
|
||||
}
|
||||
idxTemp := colIdx("sensor edge", "temperature (c)", "temp")
|
||||
idxUse := colIdx("gpu use (%)")
|
||||
idxMem := colIdx("vram%", "memory allocated")
|
||||
idxPow := colIdx("average graphics package power", "power (w)")
|
||||
|
||||
var rows []GPUMetricRow
|
||||
for _, line := range lines[1:] {
|
||||
line = strings.TrimSpace(line)
|
||||
if line == "" {
|
||||
continue
|
||||
}
|
||||
parts := strings.Split(line, ",")
|
||||
idx := len(rows)
|
||||
row := GPUMetricRow{GPUIndex: idx}
|
||||
get := func(i int) float64 {
|
||||
if i < 0 || i >= len(parts) {
|
||||
return 0
|
||||
}
|
||||
v := strings.TrimSpace(parts[i])
|
||||
if strings.EqualFold(v, "n/a") {
|
||||
return 0
|
||||
}
|
||||
return parseGPUFloat(v)
|
||||
}
|
||||
row.TempC = get(idxTemp)
|
||||
row.UsagePct = get(idxUse)
|
||||
row.MemUsagePct = get(idxMem)
|
||||
row.PowerW = get(idxPow)
|
||||
rows = append(rows, row)
|
||||
}
|
||||
if len(rows) == 0 {
|
||||
return nil, fmt.Errorf("rocm-smi: no GPU rows parsed")
|
||||
}
|
||||
return rows, nil
|
||||
}
|
||||
|
||||
// WriteGPUMetricsCSV writes collected rows as a CSV file.
|
||||
func WriteGPUMetricsCSV(path string, rows []GPUMetricRow) error {
|
||||
var b bytes.Buffer
|
||||
b.WriteString("elapsed_sec,gpu_index,temperature_c,usage_pct,power_w,clock_mhz\n")
|
||||
for _, r := range rows {
|
||||
fmt.Fprintf(&b, "%.1f,%d,%.1f,%.1f,%.1f,%.0f\n",
|
||||
r.ElapsedSec, r.GPUIndex, r.TempC, r.UsagePct, r.PowerW, r.ClockMHz)
|
||||
}
|
||||
return os.WriteFile(path, b.Bytes(), 0644)
|
||||
}
|
||||
|
||||
// WriteGPUMetricsHTML writes a standalone HTML file with one SVG chart per GPU.
|
||||
func WriteGPUMetricsHTML(path string, rows []GPUMetricRow) error {
|
||||
// Group by GPU index preserving order.
|
||||
seen := make(map[int]bool)
|
||||
var order []int
|
||||
gpuMap := make(map[int][]GPUMetricRow)
|
||||
for _, r := range rows {
|
||||
if !seen[r.GPUIndex] {
|
||||
seen[r.GPUIndex] = true
|
||||
order = append(order, r.GPUIndex)
|
||||
}
|
||||
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
|
||||
}
|
||||
|
||||
var svgs strings.Builder
|
||||
for _, gpuIdx := range order {
|
||||
svgs.WriteString(drawGPUChartSVG(gpuMap[gpuIdx], gpuIdx))
|
||||
svgs.WriteString("\n")
|
||||
}
|
||||
|
||||
ts := time.Now().UTC().Format("2006-01-02 15:04:05 UTC")
|
||||
html := fmt.Sprintf(`<!DOCTYPE html>
|
||||
<html><head>
|
||||
<meta charset="utf-8">
|
||||
<title>GPU Stress Test Metrics</title>
|
||||
<style>
|
||||
body { font-family: sans-serif; background: #f0f0f0; margin: 0; padding: 20px; }
|
||||
h1 { text-align: center; color: #333; margin: 0 0 8px; }
|
||||
p { text-align: center; color: #888; font-size: 13px; margin: 0 0 24px; }
|
||||
</style>
|
||||
</head><body>
|
||||
<h1>GPU Stress Test Metrics</h1>
|
||||
<p>Generated %s</p>
|
||||
%s
|
||||
</body></html>`, ts, svgs.String())
|
||||
|
||||
return os.WriteFile(path, []byte(html), 0644)
|
||||
}
|
||||
|
||||
// drawGPUChartSVG generates a self-contained SVG chart for one GPU.
|
||||
func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int) string {
|
||||
// Layout
|
||||
const W, H = 960, 520
|
||||
const plotX1 = 120 // usage axis / chart left border
|
||||
const plotX2 = 840 // power axis / chart right border
|
||||
const plotY1 = 70 // top
|
||||
const plotY2 = 465 // bottom (PH = 395)
|
||||
const PW = plotX2 - plotX1
|
||||
const PH = plotY2 - plotY1
|
||||
// Outer axes
|
||||
const tempAxisX = 60 // temp axis line
|
||||
const clockAxisX = 900 // clock axis line
|
||||
|
||||
colors := [4]string{"#e74c3c", "#3498db", "#2ecc71", "#f39c12"}
|
||||
seriesLabel := [4]string{
|
||||
fmt.Sprintf("GPU %d Temp (°C)", gpuIdx),
|
||||
fmt.Sprintf("GPU %d Usage (%%)", gpuIdx),
|
||||
fmt.Sprintf("GPU %d Power (W)", gpuIdx),
|
||||
fmt.Sprintf("GPU %d Clock (MHz)", gpuIdx),
|
||||
}
|
||||
axisLabel := [4]string{"Temperature (°C)", "GPU Usage (%)", "Power (W)", "Clock (MHz)"}
|
||||
|
||||
// Extract series
|
||||
t := make([]float64, len(rows))
|
||||
vals := [4][]float64{}
|
||||
for i := range vals {
|
||||
vals[i] = make([]float64, len(rows))
|
||||
}
|
||||
for i, r := range rows {
|
||||
t[i] = r.ElapsedSec
|
||||
vals[0][i] = r.TempC
|
||||
vals[1][i] = r.UsagePct
|
||||
vals[2][i] = r.PowerW
|
||||
vals[3][i] = r.ClockMHz
|
||||
}
|
||||
|
||||
tMin, tMax := gpuMinMax(t)
|
||||
type axisScale struct {
|
||||
ticks []float64
|
||||
min, max float64
|
||||
}
|
||||
var axes [4]axisScale
|
||||
for i := 0; i < 4; i++ {
|
||||
mn, mx := gpuMinMax(vals[i])
|
||||
tks := gpuNiceTicks(mn, mx, 8)
|
||||
axes[i] = axisScale{ticks: tks, min: tks[0], max: tks[len(tks)-1]}
|
||||
}
|
||||
|
||||
xv := func(tv float64) float64 {
|
||||
if tMax == tMin {
|
||||
return float64(plotX1)
|
||||
}
|
||||
return float64(plotX1) + (tv-tMin)/(tMax-tMin)*float64(PW)
|
||||
}
|
||||
yv := func(v float64, ai int) float64 {
|
||||
a := axes[ai]
|
||||
if a.max == a.min {
|
||||
return float64(plotY1 + PH/2)
|
||||
}
|
||||
return float64(plotY2) - (v-a.min)/(a.max-a.min)*float64(PH)
|
||||
}
|
||||
|
||||
var b strings.Builder
|
||||
|
||||
fmt.Fprintf(&b, `<svg xmlns="http://www.w3.org/2000/svg" width="%d" height="%d"`+
|
||||
` style="background:#fff;border-radius:8px;display:block;margin:0 auto 24px;`+
|
||||
`box-shadow:0 2px 12px rgba(0,0,0,.12)">`+"\n", W, H)
|
||||
|
||||
// Title
|
||||
fmt.Fprintf(&b, `<text x="%d" y="22" text-anchor="middle" font-family="sans-serif"`+
|
||||
` font-size="14" font-weight="bold" fill="#333">GPU Stress Test Metrics — GPU %d</text>`+"\n",
|
||||
plotX1+PW/2, gpuIdx)
|
||||
|
||||
// Horizontal grid (align to temp axis ticks)
|
||||
b.WriteString(`<g stroke="#e0e0e0" stroke-width="0.5">` + "\n")
|
||||
for _, tick := range axes[0].ticks {
|
||||
y := yv(tick, 0)
|
||||
if y < float64(plotY1) || y > float64(plotY2) {
|
||||
continue
|
||||
}
|
||||
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"/>`+"\n",
|
||||
plotX1, y, plotX2, y)
|
||||
}
|
||||
// Vertical grid
|
||||
xTicks := gpuNiceTicks(tMin, tMax, 10)
|
||||
for _, tv := range xTicks {
|
||||
x := xv(tv)
|
||||
if x < float64(plotX1) || x > float64(plotX2) {
|
||||
continue
|
||||
}
|
||||
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d"/>`+"\n",
|
||||
x, plotY1, x, plotY2)
|
||||
}
|
||||
b.WriteString("</g>\n")
|
||||
|
||||
// Chart border
|
||||
fmt.Fprintf(&b, `<rect x="%d" y="%d" width="%d" height="%d"`+
|
||||
` fill="none" stroke="#333" stroke-width="1"/>`+"\n",
|
||||
plotX1, plotY1, PW, PH)
|
||||
|
||||
// X axis ticks and labels
|
||||
b.WriteString(`<g font-family="sans-serif" font-size="11" fill="#333" text-anchor="middle">` + "\n")
|
||||
for _, tv := range xTicks {
|
||||
x := xv(tv)
|
||||
if x < float64(plotX1) || x > float64(plotX2) {
|
||||
continue
|
||||
}
|
||||
fmt.Fprintf(&b, `<text x="%.1f" y="%d">%s</text>`+"\n", x, plotY2+18, gpuFormatTick(tv))
|
||||
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d" stroke="#333" stroke-width="1"/>`+"\n",
|
||||
x, plotY2, x, plotY2+4)
|
||||
}
|
||||
b.WriteString("</g>\n")
|
||||
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="13"`+
|
||||
` fill="#333" text-anchor="middle">Time (seconds)</text>`+"\n",
|
||||
plotX1+PW/2, plotY2+38)
|
||||
|
||||
// Y axes: [tempAxisX, plotX1, plotX2, clockAxisX]
|
||||
axisLineX := [4]int{tempAxisX, plotX1, plotX2, clockAxisX}
|
||||
axisRight := [4]bool{false, false, true, true}
|
||||
// Label x positions (for rotated vertical text)
|
||||
axisLabelX := [4]int{10, 68, 868, 950}
|
||||
|
||||
for i := 0; i < 4; i++ {
|
||||
ax := axisLineX[i]
|
||||
right := axisRight[i]
|
||||
color := colors[i]
|
||||
|
||||
// Axis line
|
||||
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
|
||||
` stroke="%s" stroke-width="1"/>`+"\n",
|
||||
ax, plotY1, ax, plotY2, color)
|
||||
|
||||
// Ticks and tick labels
|
||||
fmt.Fprintf(&b, `<g font-family="sans-serif" font-size="10" fill="%s">`+"\n", color)
|
||||
for _, tick := range axes[i].ticks {
|
||||
y := yv(tick, i)
|
||||
if y < float64(plotY1) || y > float64(plotY2) {
|
||||
continue
|
||||
}
|
||||
dx := -5
|
||||
textX := ax - 8
|
||||
anchor := "end"
|
||||
if right {
|
||||
dx = 5
|
||||
textX = ax + 8
|
||||
anchor = "start"
|
||||
}
|
||||
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"`+
|
||||
` stroke="%s" stroke-width="1"/>`+"\n",
|
||||
ax, y, ax+dx, y, color)
|
||||
fmt.Fprintf(&b, `<text x="%d" y="%.1f" text-anchor="%s" dy="4">%s</text>`+"\n",
|
||||
textX, y, anchor, gpuFormatTick(tick))
|
||||
}
|
||||
b.WriteString("</g>\n")
|
||||
|
||||
// Axis label (rotated)
|
||||
lx := axisLabelX[i]
|
||||
fmt.Fprintf(&b, `<text transform="translate(%d,%d) rotate(-90)"`+
|
||||
` font-family="sans-serif" font-size="12" fill="%s" text-anchor="middle">%s</text>`+"\n",
|
||||
lx, plotY1+PH/2, color, axisLabel[i])
|
||||
}
|
||||
|
||||
// Data lines
|
||||
for i := 0; i < 4; i++ {
|
||||
var pts strings.Builder
|
||||
for j := range rows {
|
||||
x := xv(t[j])
|
||||
y := yv(vals[i][j], i)
|
||||
if j == 0 {
|
||||
fmt.Fprintf(&pts, "%.1f,%.1f", x, y)
|
||||
} else {
|
||||
fmt.Fprintf(&pts, " %.1f,%.1f", x, y)
|
||||
}
|
||||
}
|
||||
fmt.Fprintf(&b, `<polyline points="%s" fill="none" stroke="%s" stroke-width="1.5"/>`+"\n",
|
||||
pts.String(), colors[i])
|
||||
}
|
||||
|
||||
// Legend
|
||||
const legendY = 42
|
||||
for i := 0; i < 4; i++ {
|
||||
lx := plotX1 + i*(PW/4) + 10
|
||||
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
|
||||
` stroke="%s" stroke-width="2"/>`+"\n",
|
||||
lx, legendY, lx+20, legendY, colors[i])
|
||||
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="12" fill="#333">%s</text>`+"\n",
|
||||
lx+25, legendY+4, seriesLabel[i])
|
||||
}
|
||||
|
||||
b.WriteString("</svg>\n")
|
||||
return b.String()
|
||||
}
|
||||
|
||||
const (
|
||||
ansiRed = "\033[31m"
|
||||
ansiBlue = "\033[34m"
|
||||
ansiGreen = "\033[32m"
|
||||
ansiYellow = "\033[33m"
|
||||
ansiReset = "\033[0m"
|
||||
)
|
||||
|
||||
const (
|
||||
termChartWidth = 70
|
||||
termChartHeight = 12
|
||||
)
|
||||
|
||||
// RenderGPUTerminalChart returns ANSI line charts (asciigraph-style) per GPU.
|
||||
// Used in SAT stress-test logs.
|
||||
func RenderGPUTerminalChart(rows []GPUMetricRow) string {
|
||||
seen := make(map[int]bool)
|
||||
var order []int
|
||||
gpuMap := make(map[int][]GPUMetricRow)
|
||||
for _, r := range rows {
|
||||
if !seen[r.GPUIndex] {
|
||||
seen[r.GPUIndex] = true
|
||||
order = append(order, r.GPUIndex)
|
||||
}
|
||||
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
|
||||
}
|
||||
|
||||
type seriesDef struct {
|
||||
caption string
|
||||
color string
|
||||
fn func(GPUMetricRow) float64
|
||||
}
|
||||
defs := []seriesDef{
|
||||
{"Temperature (°C)", ansiRed, func(r GPUMetricRow) float64 { return r.TempC }},
|
||||
{"GPU Usage (%)", ansiBlue, func(r GPUMetricRow) float64 { return r.UsagePct }},
|
||||
{"Power (W)", ansiGreen, func(r GPUMetricRow) float64 { return r.PowerW }},
|
||||
{"Clock (MHz)", ansiYellow, func(r GPUMetricRow) float64 { return r.ClockMHz }},
|
||||
}
|
||||
|
||||
var b strings.Builder
|
||||
for _, gpuIdx := range order {
|
||||
gr := gpuMap[gpuIdx]
|
||||
if len(gr) == 0 {
|
||||
continue
|
||||
}
|
||||
tMax := gr[len(gr)-1].ElapsedSec - gr[0].ElapsedSec
|
||||
fmt.Fprintf(&b, "GPU %d — Stress Test Metrics (%.0f seconds)\n\n", gpuIdx, tMax)
|
||||
for _, d := range defs {
|
||||
b.WriteString(renderLineChart(extractGPUField(gr, d.fn), d.color, d.caption,
|
||||
termChartHeight, termChartWidth))
|
||||
b.WriteRune('\n')
|
||||
}
|
||||
}
|
||||
|
||||
return strings.TrimRight(b.String(), "\n")
|
||||
}
|
||||
|
||||
// renderLineChart draws a single time-series line chart using box-drawing characters.
|
||||
// Produces output in the style of asciigraph: ╭─╮ │ ╰─╯ with a Y axis and caption.
|
||||
func renderLineChart(vals []float64, color, caption string, height, width int) string {
|
||||
if len(vals) == 0 {
|
||||
return caption + "\n"
|
||||
}
|
||||
|
||||
mn, mx := gpuMinMax(vals)
|
||||
if mn == mx {
|
||||
mx = mn + 1
|
||||
}
|
||||
|
||||
// Use the smaller of width or len(vals) to avoid stretching sparse data.
|
||||
w := width
|
||||
if len(vals) < w {
|
||||
w = len(vals)
|
||||
}
|
||||
data := gpuDownsample(vals, w)
|
||||
|
||||
// row[i] = display row index: 0 = top = max value, height = bottom = min value.
|
||||
row := make([]int, w)
|
||||
for i, v := range data {
|
||||
r := int(math.Round((mx - v) / (mx - mn) * float64(height)))
|
||||
if r < 0 {
|
||||
r = 0
|
||||
}
|
||||
if r > height {
|
||||
r = height
|
||||
}
|
||||
row[i] = r
|
||||
}
|
||||
|
||||
// Fill the character grid.
|
||||
grid := make([][]rune, height+1)
|
||||
for i := range grid {
|
||||
grid[i] = make([]rune, w)
|
||||
for j := range grid[i] {
|
||||
grid[i][j] = ' '
|
||||
}
|
||||
}
|
||||
for x := 0; x < w; x++ {
|
||||
r := row[x]
|
||||
if x == 0 {
|
||||
grid[r][0] = '─'
|
||||
continue
|
||||
}
|
||||
p := row[x-1]
|
||||
switch {
|
||||
case r == p:
|
||||
grid[r][x] = '─'
|
||||
case r < p: // value went up (row index decreased toward top)
|
||||
grid[r][x] = '╭'
|
||||
grid[p][x] = '╯'
|
||||
for y := r + 1; y < p; y++ {
|
||||
grid[y][x] = '│'
|
||||
}
|
||||
default: // r > p, value went down
|
||||
grid[p][x] = '╮'
|
||||
grid[r][x] = '╰'
|
||||
for y := p + 1; y < r; y++ {
|
||||
grid[y][x] = '│'
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Y axis tick labels.
|
||||
ticks := gpuNiceTicks(mn, mx, height/2)
|
||||
tickAtRow := make(map[int]string)
|
||||
labelWidth := 4
|
||||
for _, t := range ticks {
|
||||
r := int(math.Round((mx - t) / (mx - mn) * float64(height)))
|
||||
if r < 0 || r > height {
|
||||
continue
|
||||
}
|
||||
s := gpuFormatTick(t)
|
||||
tickAtRow[r] = s
|
||||
if len(s) > labelWidth {
|
||||
labelWidth = len(s)
|
||||
}
|
||||
}
|
||||
|
||||
var b strings.Builder
|
||||
for r := 0; r <= height; r++ {
|
||||
label := tickAtRow[r]
|
||||
fmt.Fprintf(&b, "%*s", labelWidth, label)
|
||||
switch {
|
||||
case label != "":
|
||||
b.WriteRune('┤')
|
||||
case r == height:
|
||||
b.WriteRune('┼')
|
||||
default:
|
||||
b.WriteRune('│')
|
||||
}
|
||||
b.WriteString(color)
|
||||
b.WriteString(string(grid[r]))
|
||||
b.WriteString(ansiReset)
|
||||
b.WriteRune('\n')
|
||||
}
|
||||
|
||||
// Bottom axis.
|
||||
b.WriteString(strings.Repeat(" ", labelWidth))
|
||||
b.WriteRune('└')
|
||||
b.WriteString(strings.Repeat("─", w))
|
||||
b.WriteRune('\n')
|
||||
|
||||
// Caption centered under the chart.
|
||||
if caption != "" {
|
||||
total := labelWidth + 1 + w
|
||||
if pad := (total - len(caption)) / 2; pad > 0 {
|
||||
b.WriteString(strings.Repeat(" ", pad))
|
||||
}
|
||||
b.WriteString(caption)
|
||||
b.WriteRune('\n')
|
||||
}
|
||||
|
||||
return b.String()
|
||||
}
|
||||
|
||||
func extractGPUField(rows []GPUMetricRow, fn func(GPUMetricRow) float64) []float64 {
|
||||
v := make([]float64, len(rows))
|
||||
for i, r := range rows {
|
||||
v[i] = fn(r)
|
||||
}
|
||||
return v
|
||||
}
|
||||
|
||||
// gpuDownsample averages vals into w buckets (or nearest-neighbor upsamples if len(vals) < w).
|
||||
func gpuDownsample(vals []float64, w int) []float64 {
|
||||
n := len(vals)
|
||||
if n == 0 {
|
||||
return make([]float64, w)
|
||||
}
|
||||
result := make([]float64, w)
|
||||
if n >= w {
|
||||
counts := make([]int, w)
|
||||
for i, v := range vals {
|
||||
bucket := i * w / n
|
||||
if bucket >= w {
|
||||
bucket = w - 1
|
||||
}
|
||||
result[bucket] += v
|
||||
counts[bucket]++
|
||||
}
|
||||
for i := range result {
|
||||
if counts[i] > 0 {
|
||||
result[i] /= float64(counts[i])
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Nearest-neighbour upsample.
|
||||
for i := range result {
|
||||
src := i * (n - 1) / (w - 1)
|
||||
if src >= n {
|
||||
src = n - 1
|
||||
}
|
||||
result[i] = vals[src]
|
||||
}
|
||||
}
|
||||
return result
|
||||
}
|
||||
|
||||
func gpuMinMax(vals []float64) (float64, float64) {
|
||||
if len(vals) == 0 {
|
||||
return 0, 1
|
||||
}
|
||||
mn, mx := vals[0], vals[0]
|
||||
for _, v := range vals[1:] {
|
||||
if v < mn {
|
||||
mn = v
|
||||
}
|
||||
if v > mx {
|
||||
mx = v
|
||||
}
|
||||
}
|
||||
return mn, mx
|
||||
}
|
||||
|
||||
func gpuNiceTicks(mn, mx float64, targetCount int) []float64 {
|
||||
if mn == mx {
|
||||
mn -= 1
|
||||
mx += 1
|
||||
}
|
||||
r := mx - mn
|
||||
step := math.Pow(10, math.Floor(math.Log10(r/float64(targetCount))))
|
||||
for _, f := range []float64{1, 2, 5, 10} {
|
||||
if r/(f*step) <= float64(targetCount)*1.5 {
|
||||
step = f * step
|
||||
break
|
||||
}
|
||||
}
|
||||
lo := math.Floor(mn/step) * step
|
||||
hi := math.Ceil(mx/step) * step
|
||||
var ticks []float64
|
||||
for v := lo; v <= hi+step*0.001; v += step {
|
||||
ticks = append(ticks, math.Round(v*1e9)/1e9)
|
||||
}
|
||||
return ticks
|
||||
}
|
||||
|
||||
func gpuFormatTick(v float64) string {
|
||||
if v == math.Trunc(v) {
|
||||
return strconv.Itoa(int(v))
|
||||
}
|
||||
return strconv.FormatFloat(v, 'f', 1, 64)
|
||||
}
|
||||
214
audit/internal/platform/install.go
Normal file
214
audit/internal/platform/install.go
Normal file
@@ -0,0 +1,214 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"os"
|
||||
"os/exec"
|
||||
"strconv"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// InstallDisk describes a candidate disk for installation.
|
||||
type InstallDisk struct {
|
||||
Device string // e.g. /dev/sda
|
||||
Model string
|
||||
Size string // human-readable, e.g. "500G"
|
||||
SizeBytes int64 // raw byte count from lsblk
|
||||
MountedParts []string // partition mount points currently active
|
||||
}
|
||||
|
||||
const squashfsPath = "/run/live/medium/live/filesystem.squashfs"
|
||||
|
||||
// ListInstallDisks returns block devices suitable for installation.
|
||||
// Excludes the current live boot medium but includes USB drives.
|
||||
func (s *System) ListInstallDisks() ([]InstallDisk, error) {
|
||||
out, err := exec.Command("lsblk", "-dn", "-o", "NAME,MODEL,SIZE,TYPE,TRAN").Output()
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("lsblk: %w", err)
|
||||
}
|
||||
|
||||
bootDev := findLiveBootDevice()
|
||||
|
||||
var disks []InstallDisk
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
|
||||
fields := strings.Fields(line)
|
||||
// NAME MODEL SIZE TYPE TRAN — model may have spaces so we parse from end
|
||||
if len(fields) < 4 {
|
||||
continue
|
||||
}
|
||||
// Last field: TRAN, second-to-last: TYPE, third-to-last: SIZE
|
||||
typ := fields[len(fields)-2]
|
||||
size := fields[len(fields)-3]
|
||||
name := fields[0]
|
||||
model := strings.Join(fields[1:len(fields)-3], " ")
|
||||
|
||||
if typ != "disk" {
|
||||
continue
|
||||
}
|
||||
|
||||
device := "/dev/" + name
|
||||
if device == bootDev {
|
||||
continue
|
||||
}
|
||||
|
||||
sizeBytes := diskSizeBytes(device)
|
||||
mounted := mountedParts(device)
|
||||
|
||||
disks = append(disks, InstallDisk{
|
||||
Device: device,
|
||||
Model: strings.TrimSpace(model),
|
||||
Size: size,
|
||||
SizeBytes: sizeBytes,
|
||||
MountedParts: mounted,
|
||||
})
|
||||
}
|
||||
return disks, nil
|
||||
}
|
||||
|
||||
// diskSizeBytes returns the byte size of a block device using lsblk.
|
||||
func diskSizeBytes(device string) int64 {
|
||||
out, err := exec.Command("lsblk", "-bdn", "-o", "SIZE", device).Output()
|
||||
if err != nil {
|
||||
return 0
|
||||
}
|
||||
n, _ := strconv.ParseInt(strings.TrimSpace(string(out)), 10, 64)
|
||||
return n
|
||||
}
|
||||
|
||||
// mountedParts returns a list of "<part> at <mountpoint>" strings for any
|
||||
// mounted partitions on the given device.
|
||||
func mountedParts(device string) []string {
|
||||
out, err := exec.Command("lsblk", "-n", "-o", "NAME,MOUNTPOINT", device).Output()
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
var result []string
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
|
||||
fields := strings.Fields(line)
|
||||
if len(fields) < 2 {
|
||||
continue
|
||||
}
|
||||
mp := fields[1]
|
||||
if mp == "" || mp == "[SWAP]" {
|
||||
continue
|
||||
}
|
||||
result = append(result, "/dev/"+strings.TrimLeft(fields[0], "└─├─")+" at "+mp)
|
||||
}
|
||||
return result
|
||||
}
|
||||
|
||||
// findLiveBootDevice returns the block device backing /run/live/medium (if any).
|
||||
func findLiveBootDevice() string {
|
||||
out, err := exec.Command("findmnt", "-n", "-o", "SOURCE", "/run/live/medium").Output()
|
||||
if err != nil {
|
||||
return ""
|
||||
}
|
||||
src := strings.TrimSpace(string(out))
|
||||
if src == "" {
|
||||
return ""
|
||||
}
|
||||
// Strip partition suffix to get the whole disk device.
|
||||
// e.g. /dev/sdb1 → /dev/sdb, /dev/nvme0n1p1 → /dev/nvme0n1
|
||||
out2, err := exec.Command("lsblk", "-no", "PKNAME", src).Output()
|
||||
if err != nil || strings.TrimSpace(string(out2)) == "" {
|
||||
return src
|
||||
}
|
||||
return "/dev/" + strings.TrimSpace(string(out2))
|
||||
}
|
||||
|
||||
// MinInstallBytes returns the minimum recommended disk size for installation:
|
||||
// squashfs size × 1.5 to allow for extracted filesystem and bootloader.
|
||||
// Returns 0 if the squashfs is not available (non-live environment).
|
||||
func MinInstallBytes() int64 {
|
||||
fi, err := os.Stat(squashfsPath)
|
||||
if err != nil {
|
||||
return 0
|
||||
}
|
||||
return fi.Size() * 3 / 2
|
||||
}
|
||||
|
||||
// toramActive returns true when the live system was booted with toram.
|
||||
func toramActive() bool {
|
||||
data, err := os.ReadFile("/proc/cmdline")
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
return strings.Contains(string(data), "toram")
|
||||
}
|
||||
|
||||
// freeMemBytes returns MemAvailable from /proc/meminfo.
|
||||
func freeMemBytes() int64 {
|
||||
data, err := os.ReadFile("/proc/meminfo")
|
||||
if err != nil {
|
||||
return 0
|
||||
}
|
||||
for _, line := range strings.Split(string(data), "\n") {
|
||||
if strings.HasPrefix(line, "MemAvailable:") {
|
||||
fields := strings.Fields(line)
|
||||
if len(fields) >= 2 {
|
||||
n, _ := strconv.ParseInt(fields[1], 10, 64)
|
||||
return n * 1024 // kB → bytes
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
// DiskWarnings returns advisory warning strings for a disk candidate.
|
||||
func DiskWarnings(d InstallDisk) []string {
|
||||
var w []string
|
||||
if len(d.MountedParts) > 0 {
|
||||
w = append(w, "has mounted partitions: "+strings.Join(d.MountedParts, ", "))
|
||||
}
|
||||
min := MinInstallBytes()
|
||||
if min > 0 && d.SizeBytes > 0 && d.SizeBytes < min {
|
||||
w = append(w, fmt.Sprintf("disk may be too small (need ≥ %s, have %s)",
|
||||
humanBytes(min), humanBytes(d.SizeBytes)))
|
||||
}
|
||||
if toramActive() {
|
||||
sqFi, err := os.Stat(squashfsPath)
|
||||
if err == nil {
|
||||
free := freeMemBytes()
|
||||
if free > 0 && free < sqFi.Size()*2 {
|
||||
w = append(w, "toram mode — low RAM, extraction may be slow or fail")
|
||||
}
|
||||
}
|
||||
}
|
||||
return w
|
||||
}
|
||||
|
||||
func humanBytes(b int64) string {
|
||||
const unit = 1024
|
||||
if b < unit {
|
||||
return fmt.Sprintf("%d B", b)
|
||||
}
|
||||
div, exp := int64(unit), 0
|
||||
for n := b / unit; n >= unit; n /= unit {
|
||||
div *= unit
|
||||
exp++
|
||||
}
|
||||
return fmt.Sprintf("%.1f %cB", float64(b)/float64(div), "KMGTPE"[exp])
|
||||
}
|
||||
|
||||
// InstallToDisk runs bee-install <device> <logfile> and streams output to logFile.
|
||||
// The context can be used to cancel.
|
||||
func (s *System) InstallToDisk(ctx context.Context, device string, logFile string) error {
|
||||
cmd := exec.CommandContext(ctx, "bee-install", device, logFile)
|
||||
return cmd.Run()
|
||||
}
|
||||
|
||||
// InstallLogPath returns the default install log path for a given device.
|
||||
func InstallLogPath(device string) string {
|
||||
safe := strings.NewReplacer("/", "_", " ", "_").Replace(device)
|
||||
return "/tmp/bee-install" + safe + ".log"
|
||||
}
|
||||
|
||||
// Label returns a display label for a disk.
|
||||
func (d InstallDisk) Label() string {
|
||||
model := d.Model
|
||||
if model == "" {
|
||||
model = "Unknown"
|
||||
}
|
||||
return fmt.Sprintf("%s %s %s", d.Device, d.Size, model)
|
||||
}
|
||||
191
audit/internal/platform/install_to_ram.go
Normal file
191
audit/internal/platform/install_to_ram.go
Normal file
@@ -0,0 +1,191 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func (s *System) IsLiveMediaInRAM() bool {
|
||||
out, err := exec.Command("findmnt", "-n", "-o", "FSTYPE", "/run/live/medium").Output()
|
||||
if err != nil {
|
||||
return toramActive()
|
||||
}
|
||||
return strings.TrimSpace(string(out)) == "tmpfs"
|
||||
}
|
||||
|
||||
func (s *System) RunInstallToRAM(ctx context.Context, logFunc func(string)) error {
|
||||
log := func(msg string) {
|
||||
if logFunc != nil {
|
||||
logFunc(msg)
|
||||
}
|
||||
}
|
||||
|
||||
if s.IsLiveMediaInRAM() {
|
||||
log("Already running from RAM — installation media can be safely disconnected.")
|
||||
return nil
|
||||
}
|
||||
|
||||
squashfsFiles, err := filepath.Glob("/run/live/medium/live/*.squashfs")
|
||||
if err != nil || len(squashfsFiles) == 0 {
|
||||
return fmt.Errorf("no squashfs files found in /run/live/medium/live/")
|
||||
}
|
||||
|
||||
free := freeMemBytes()
|
||||
var needed int64
|
||||
for _, sf := range squashfsFiles {
|
||||
fi, err2 := os.Stat(sf)
|
||||
if err2 != nil {
|
||||
return fmt.Errorf("stat %s: %v", sf, err2)
|
||||
}
|
||||
needed += fi.Size()
|
||||
}
|
||||
const headroom = 256 * 1024 * 1024
|
||||
if free > 0 && needed+headroom > free {
|
||||
return fmt.Errorf("insufficient RAM: need %s, available %s",
|
||||
humanBytes(needed+headroom), humanBytes(free))
|
||||
}
|
||||
|
||||
dstDir := "/dev/shm/bee-live"
|
||||
if err := os.MkdirAll(dstDir, 0755); err != nil {
|
||||
return fmt.Errorf("create tmpfs dir: %v", err)
|
||||
}
|
||||
|
||||
for _, sf := range squashfsFiles {
|
||||
if err := ctx.Err(); err != nil {
|
||||
return err
|
||||
}
|
||||
base := filepath.Base(sf)
|
||||
dst := filepath.Join(dstDir, base)
|
||||
log(fmt.Sprintf("Copying %s to RAM...", base))
|
||||
if err := copyFileLarge(ctx, sf, dst, log); err != nil {
|
||||
return fmt.Errorf("copy %s: %v", base, err)
|
||||
}
|
||||
log(fmt.Sprintf("Copied %s.", base))
|
||||
|
||||
loopDev, err := findLoopForFile(sf)
|
||||
if err != nil {
|
||||
log(fmt.Sprintf("Loop device for %s not found (%v) — skipping re-association.", base, err))
|
||||
continue
|
||||
}
|
||||
if err := reassociateLoopDevice(loopDev, dst); err != nil {
|
||||
log(fmt.Sprintf("Warning: could not re-associate %s → %s: %v", loopDev, dst, err))
|
||||
} else {
|
||||
log(fmt.Sprintf("Loop device %s now backed by RAM copy.", loopDev))
|
||||
}
|
||||
}
|
||||
|
||||
log("Copying remaining medium files...")
|
||||
if err := cpDir(ctx, "/run/live/medium", dstDir, log); err != nil {
|
||||
log(fmt.Sprintf("Warning: partial copy: %v", err))
|
||||
}
|
||||
if err := ctx.Err(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := exec.Command("mount", "--bind", dstDir, "/run/live/medium").Run(); err != nil {
|
||||
log(fmt.Sprintf("Warning: rebind /run/live/medium failed: %v", err))
|
||||
}
|
||||
|
||||
log("Done. Installation media can be safely disconnected.")
|
||||
return nil
|
||||
}
|
||||
|
||||
func copyFileLarge(ctx context.Context, src, dst string, logFunc func(string)) error {
|
||||
in, err := os.Open(src)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer in.Close()
|
||||
fi, err := in.Stat()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
out, err := os.Create(dst)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer out.Close()
|
||||
total := fi.Size()
|
||||
var copied int64
|
||||
buf := make([]byte, 4*1024*1024)
|
||||
for {
|
||||
if err := ctx.Err(); err != nil {
|
||||
return err
|
||||
}
|
||||
n, err := in.Read(buf)
|
||||
if n > 0 {
|
||||
if _, werr := out.Write(buf[:n]); werr != nil {
|
||||
return werr
|
||||
}
|
||||
copied += int64(n)
|
||||
if logFunc != nil && total > 0 {
|
||||
pct := int(float64(copied) / float64(total) * 100)
|
||||
logFunc(fmt.Sprintf(" %s / %s (%d%%)", humanBytes(copied), humanBytes(total), pct))
|
||||
}
|
||||
}
|
||||
if err == io.EOF {
|
||||
break
|
||||
}
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return out.Sync()
|
||||
}
|
||||
|
||||
func cpDir(ctx context.Context, src, dst string, logFunc func(string)) error {
|
||||
return filepath.Walk(src, func(path string, fi os.FileInfo, err error) error {
|
||||
if ctx.Err() != nil {
|
||||
return ctx.Err()
|
||||
}
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
rel, _ := filepath.Rel(src, path)
|
||||
target := filepath.Join(dst, rel)
|
||||
if fi.IsDir() {
|
||||
return os.MkdirAll(target, fi.Mode())
|
||||
}
|
||||
if strings.HasSuffix(path, ".squashfs") {
|
||||
return nil
|
||||
}
|
||||
if _, err := os.Stat(target); err == nil {
|
||||
return nil
|
||||
}
|
||||
return copyFileLarge(ctx, path, target, nil)
|
||||
})
|
||||
}
|
||||
|
||||
func findLoopForFile(backingFile string) (string, error) {
|
||||
out, err := exec.Command("losetup", "--list", "--json").Output()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
var result struct {
|
||||
Loopdevices []struct {
|
||||
Name string `json:"name"`
|
||||
BackFile string `json:"back-file"`
|
||||
} `json:"loopdevices"`
|
||||
}
|
||||
if err := json.Unmarshal(out, &result); err != nil {
|
||||
return "", err
|
||||
}
|
||||
for _, dev := range result.Loopdevices {
|
||||
if dev.BackFile == backingFile {
|
||||
return dev.Name, nil
|
||||
}
|
||||
}
|
||||
return "", fmt.Errorf("no loop device found for %s", backingFile)
|
||||
}
|
||||
|
||||
func reassociateLoopDevice(loopDev, newFile string) error {
|
||||
if err := exec.Command("losetup", "--replace", loopDev, newFile).Run(); err == nil {
|
||||
return nil
|
||||
}
|
||||
return loopChangeFD(loopDev, newFile)
|
||||
}
|
||||
28
audit/internal/platform/install_to_ram_linux.go
Normal file
28
audit/internal/platform/install_to_ram_linux.go
Normal file
@@ -0,0 +1,28 @@
|
||||
//go:build linux
|
||||
|
||||
package platform
|
||||
|
||||
import (
|
||||
"os"
|
||||
"syscall"
|
||||
)
|
||||
|
||||
const ioctlLoopChangeFD = 0x4C08
|
||||
|
||||
func loopChangeFD(loopDev, newFile string) error {
|
||||
lf, err := os.OpenFile(loopDev, os.O_RDWR, 0)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer lf.Close()
|
||||
nf, err := os.OpenFile(newFile, os.O_RDONLY, 0)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer nf.Close()
|
||||
_, _, errno := syscall.Syscall(syscall.SYS_IOCTL, lf.Fd(), ioctlLoopChangeFD, nf.Fd())
|
||||
if errno != 0 {
|
||||
return errno
|
||||
}
|
||||
return nil
|
||||
}
|
||||
9
audit/internal/platform/install_to_ram_other.go
Normal file
9
audit/internal/platform/install_to_ram_other.go
Normal file
@@ -0,0 +1,9 @@
|
||||
//go:build !linux
|
||||
|
||||
package platform
|
||||
|
||||
import "errors"
|
||||
|
||||
func loopChangeFD(loopDev, newFile string) error {
|
||||
return errors.New("LOOP_CHANGE_FD not available on this platform")
|
||||
}
|
||||
326
audit/internal/platform/live_metrics.go
Normal file
326
audit/internal/platform/live_metrics.go
Normal file
@@ -0,0 +1,326 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"encoding/json"
|
||||
"os"
|
||||
"os/exec"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
// LiveMetricSample is a single point-in-time snapshot of server metrics
|
||||
// collected for the web UI metrics page.
|
||||
type LiveMetricSample struct {
|
||||
Timestamp time.Time `json:"ts"`
|
||||
Fans []FanReading `json:"fans"`
|
||||
Temps []TempReading `json:"temps"`
|
||||
PowerW float64 `json:"power_w"`
|
||||
CPULoadPct float64 `json:"cpu_load_pct"`
|
||||
MemLoadPct float64 `json:"mem_load_pct"`
|
||||
GPUs []GPUMetricRow `json:"gpus"`
|
||||
}
|
||||
|
||||
// TempReading is a named temperature sensor value.
|
||||
type TempReading struct {
|
||||
Name string `json:"name"`
|
||||
Group string `json:"group,omitempty"`
|
||||
Celsius float64 `json:"celsius"`
|
||||
}
|
||||
|
||||
// SampleLiveMetrics collects a single metrics snapshot from all available
|
||||
// sources: GPU (via nvidia-smi), fans and temperatures (via ipmitool/sensors),
|
||||
// and system power (via ipmitool dcmi). Missing sources are silently skipped.
|
||||
func SampleLiveMetrics() LiveMetricSample {
|
||||
s := LiveMetricSample{Timestamp: time.Now().UTC()}
|
||||
|
||||
// GPU metrics — try NVIDIA first, fall back to AMD
|
||||
if gpus, err := SampleGPUMetrics(nil); err == nil && len(gpus) > 0 {
|
||||
s.GPUs = gpus
|
||||
} else if amdGPUs, err := sampleAMDGPUMetrics(); err == nil && len(amdGPUs) > 0 {
|
||||
s.GPUs = amdGPUs
|
||||
}
|
||||
|
||||
// Fan speeds — skipped silently if ipmitool unavailable
|
||||
fans, _ := sampleFanSpeeds()
|
||||
s.Fans = fans
|
||||
|
||||
s.Temps = append(s.Temps, sampleLiveTemperatureReadings()...)
|
||||
if !hasTempGroup(s.Temps, "cpu") {
|
||||
if cpuTemp := sampleCPUMaxTemp(); cpuTemp > 0 {
|
||||
s.Temps = append(s.Temps, TempReading{Name: "CPU Max", Group: "cpu", Celsius: cpuTemp})
|
||||
}
|
||||
}
|
||||
|
||||
// System power — returns 0 if unavailable
|
||||
s.PowerW = sampleSystemPower()
|
||||
|
||||
// CPU load — from /proc/stat
|
||||
s.CPULoadPct = sampleCPULoadPct()
|
||||
|
||||
// Memory load — from /proc/meminfo
|
||||
s.MemLoadPct = sampleMemLoadPct()
|
||||
|
||||
return s
|
||||
}
|
||||
|
||||
// sampleCPULoadPct reads two /proc/stat snapshots 200ms apart and returns
|
||||
// the overall CPU utilisation percentage.
|
||||
var cpuStatPrev [2]uint64 // [total, idle]
|
||||
|
||||
func sampleCPULoadPct() float64 {
|
||||
total, idle := readCPUStat()
|
||||
if total == 0 {
|
||||
return 0
|
||||
}
|
||||
prevTotal, prevIdle := cpuStatPrev[0], cpuStatPrev[1]
|
||||
cpuStatPrev = [2]uint64{total, idle}
|
||||
if prevTotal == 0 {
|
||||
return 0
|
||||
}
|
||||
dt := float64(total - prevTotal)
|
||||
di := float64(idle - prevIdle)
|
||||
if dt <= 0 {
|
||||
return 0
|
||||
}
|
||||
pct := (1 - di/dt) * 100
|
||||
if pct < 0 {
|
||||
return 0
|
||||
}
|
||||
if pct > 100 {
|
||||
return 100
|
||||
}
|
||||
return pct
|
||||
}
|
||||
|
||||
func readCPUStat() (total, idle uint64) {
|
||||
f, err := os.Open("/proc/stat")
|
||||
if err != nil {
|
||||
return 0, 0
|
||||
}
|
||||
defer f.Close()
|
||||
sc := bufio.NewScanner(f)
|
||||
for sc.Scan() {
|
||||
line := sc.Text()
|
||||
if !strings.HasPrefix(line, "cpu ") {
|
||||
continue
|
||||
}
|
||||
fields := strings.Fields(line)[1:] // skip "cpu"
|
||||
var vals [10]uint64
|
||||
for i := 0; i < len(fields) && i < 10; i++ {
|
||||
vals[i], _ = strconv.ParseUint(fields[i], 10, 64)
|
||||
}
|
||||
// idle = idle + iowait
|
||||
idle = vals[3] + vals[4]
|
||||
for _, v := range vals {
|
||||
total += v
|
||||
}
|
||||
return total, idle
|
||||
}
|
||||
return 0, 0
|
||||
}
|
||||
|
||||
func sampleMemLoadPct() float64 {
|
||||
f, err := os.Open("/proc/meminfo")
|
||||
if err != nil {
|
||||
return 0
|
||||
}
|
||||
defer f.Close()
|
||||
vals := map[string]uint64{}
|
||||
sc := bufio.NewScanner(f)
|
||||
for sc.Scan() {
|
||||
fields := strings.Fields(sc.Text())
|
||||
if len(fields) >= 2 {
|
||||
v, _ := strconv.ParseUint(fields[1], 10, 64)
|
||||
vals[strings.TrimSuffix(fields[0], ":")] = v
|
||||
}
|
||||
}
|
||||
total := vals["MemTotal"]
|
||||
avail := vals["MemAvailable"]
|
||||
if total == 0 {
|
||||
return 0
|
||||
}
|
||||
used := total - avail
|
||||
return float64(used) / float64(total) * 100
|
||||
}
|
||||
|
||||
func hasTempGroup(temps []TempReading, group string) bool {
|
||||
for _, t := range temps {
|
||||
if t.Group == group {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func sampleLiveTemperatureReadings() []TempReading {
|
||||
if temps := sampleLiveTempsViaSensorsJSON(); len(temps) > 0 {
|
||||
return temps
|
||||
}
|
||||
return sampleLiveTempsViaIPMI()
|
||||
}
|
||||
|
||||
func sampleLiveTempsViaSensorsJSON() []TempReading {
|
||||
out, err := exec.Command("sensors", "-j").Output()
|
||||
if err != nil || len(out) == 0 {
|
||||
return nil
|
||||
}
|
||||
|
||||
var doc map[string]map[string]any
|
||||
if err := json.Unmarshal(out, &doc); err != nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
chips := make([]string, 0, len(doc))
|
||||
for chip := range doc {
|
||||
chips = append(chips, chip)
|
||||
}
|
||||
sort.Strings(chips)
|
||||
|
||||
temps := make([]TempReading, 0, len(chips))
|
||||
seen := map[string]struct{}{}
|
||||
for _, chip := range chips {
|
||||
features := doc[chip]
|
||||
featureNames := make([]string, 0, len(features))
|
||||
for name := range features {
|
||||
featureNames = append(featureNames, name)
|
||||
}
|
||||
sort.Strings(featureNames)
|
||||
for _, name := range featureNames {
|
||||
if strings.EqualFold(name, "Adapter") {
|
||||
continue
|
||||
}
|
||||
feature, ok := features[name].(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
value, ok := firstTempInputValue(feature)
|
||||
if !ok || value <= 0 || value > 150 {
|
||||
continue
|
||||
}
|
||||
group := classifyLiveTempGroup(chip, name)
|
||||
if group == "gpu" {
|
||||
continue
|
||||
}
|
||||
label := strings.TrimSpace(name)
|
||||
if label == "" {
|
||||
continue
|
||||
}
|
||||
if group == "ambient" {
|
||||
label = compactAmbientTempName(chip, label)
|
||||
}
|
||||
key := group + "\x00" + label
|
||||
if _, ok := seen[key]; ok {
|
||||
continue
|
||||
}
|
||||
seen[key] = struct{}{}
|
||||
temps = append(temps, TempReading{Name: label, Group: group, Celsius: value})
|
||||
}
|
||||
}
|
||||
return temps
|
||||
}
|
||||
|
||||
func sampleLiveTempsViaIPMI() []TempReading {
|
||||
out, err := exec.Command("ipmitool", "sdr", "type", "Temperature").Output()
|
||||
if err != nil || len(out) == 0 {
|
||||
return nil
|
||||
}
|
||||
var temps []TempReading
|
||||
seen := map[string]struct{}{}
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
|
||||
parts := strings.Split(line, "|")
|
||||
if len(parts) < 3 {
|
||||
continue
|
||||
}
|
||||
name := strings.TrimSpace(parts[0])
|
||||
if name == "" {
|
||||
continue
|
||||
}
|
||||
unit := strings.ToLower(strings.TrimSpace(parts[2]))
|
||||
if !strings.Contains(unit, "degrees") {
|
||||
continue
|
||||
}
|
||||
raw := strings.TrimSpace(parts[1])
|
||||
if raw == "" || strings.EqualFold(raw, "na") {
|
||||
continue
|
||||
}
|
||||
value, err := strconv.ParseFloat(raw, 64)
|
||||
if err != nil || value <= 0 || value > 150 {
|
||||
continue
|
||||
}
|
||||
group := classifyLiveTempGroup("", name)
|
||||
if group == "gpu" {
|
||||
continue
|
||||
}
|
||||
label := name
|
||||
if group == "ambient" {
|
||||
label = compactAmbientTempName("", label)
|
||||
}
|
||||
key := group + "\x00" + label
|
||||
if _, ok := seen[key]; ok {
|
||||
continue
|
||||
}
|
||||
seen[key] = struct{}{}
|
||||
temps = append(temps, TempReading{Name: label, Group: group, Celsius: value})
|
||||
}
|
||||
return temps
|
||||
}
|
||||
|
||||
func firstTempInputValue(feature map[string]any) (float64, bool) {
|
||||
keys := make([]string, 0, len(feature))
|
||||
for key := range feature {
|
||||
keys = append(keys, key)
|
||||
}
|
||||
sort.Strings(keys)
|
||||
for _, key := range keys {
|
||||
lower := strings.ToLower(key)
|
||||
if !strings.Contains(lower, "temp") || !strings.HasSuffix(lower, "_input") {
|
||||
continue
|
||||
}
|
||||
switch value := feature[key].(type) {
|
||||
case float64:
|
||||
return value, true
|
||||
case string:
|
||||
f, err := strconv.ParseFloat(value, 64)
|
||||
if err == nil {
|
||||
return f, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
func classifyLiveTempGroup(chip, name string) string {
|
||||
text := strings.ToLower(strings.TrimSpace(chip + " " + name))
|
||||
switch {
|
||||
case strings.Contains(text, "gpu"), strings.Contains(text, "amdgpu"), strings.Contains(text, "nvidia"), strings.Contains(text, "adeon"):
|
||||
return "gpu"
|
||||
case strings.Contains(text, "coretemp"),
|
||||
strings.Contains(text, "k10temp"),
|
||||
strings.Contains(text, "zenpower"),
|
||||
strings.Contains(text, "package id"),
|
||||
strings.Contains(text, "x86_pkg_temp"),
|
||||
strings.Contains(text, "tctl"),
|
||||
strings.Contains(text, "tdie"),
|
||||
strings.Contains(text, "tccd"),
|
||||
strings.Contains(text, "cpu"),
|
||||
strings.Contains(text, "peci"):
|
||||
return "cpu"
|
||||
default:
|
||||
return "ambient"
|
||||
}
|
||||
}
|
||||
|
||||
func compactAmbientTempName(chip, name string) string {
|
||||
chip = strings.TrimSpace(chip)
|
||||
name = strings.TrimSpace(name)
|
||||
if chip == "" || strings.EqualFold(chip, name) {
|
||||
return name
|
||||
}
|
||||
if strings.Contains(strings.ToLower(name), strings.ToLower(chip)) {
|
||||
return name
|
||||
}
|
||||
return chip + " / " + name
|
||||
}
|
||||
44
audit/internal/platform/live_metrics_test.go
Normal file
44
audit/internal/platform/live_metrics_test.go
Normal file
@@ -0,0 +1,44 @@
|
||||
package platform
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestFirstTempInputValue(t *testing.T) {
|
||||
feature := map[string]any{
|
||||
"temp1_input": 61.5,
|
||||
"temp1_max": 80.0,
|
||||
}
|
||||
got, ok := firstTempInputValue(feature)
|
||||
if !ok {
|
||||
t.Fatal("expected value")
|
||||
}
|
||||
if got != 61.5 {
|
||||
t.Fatalf("got %v want 61.5", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestClassifyLiveTempGroup(t *testing.T) {
|
||||
tests := []struct {
|
||||
chip string
|
||||
name string
|
||||
want string
|
||||
}{
|
||||
{chip: "coretemp-isa-0000", name: "Package id 0", want: "cpu"},
|
||||
{chip: "amdgpu-pci-4300", name: "edge", want: "gpu"},
|
||||
{chip: "nvme-pci-0100", name: "Composite", want: "ambient"},
|
||||
{chip: "acpitz-acpi-0", name: "temp1", want: "ambient"},
|
||||
}
|
||||
for _, tc := range tests {
|
||||
if got := classifyLiveTempGroup(tc.chip, tc.name); got != tc.want {
|
||||
t.Fatalf("classifyLiveTempGroup(%q,%q)=%q want %q", tc.chip, tc.name, got, tc.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestCompactAmbientTempName(t *testing.T) {
|
||||
if got := compactAmbientTempName("nvme-pci-0100", "Composite"); got != "nvme-pci-0100 / Composite" {
|
||||
t.Fatalf("got %q", got)
|
||||
}
|
||||
if got := compactAmbientTempName("", "Inlet Temp"); got != "Inlet Temp" {
|
||||
t.Fatalf("got %q", got)
|
||||
}
|
||||
}
|
||||
325
audit/internal/platform/network.go
Normal file
325
audit/internal/platform/network.go
Normal file
@@ -0,0 +1,325 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"errors"
|
||||
"fmt"
|
||||
"os"
|
||||
"os/exec"
|
||||
"sort"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func (s *System) ListInterfaces() ([]InterfaceInfo, error) {
|
||||
names, err := listInterfaceNames()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
out := make([]InterfaceInfo, 0, len(names))
|
||||
for _, name := range names {
|
||||
state := "unknown"
|
||||
if up, err := interfaceAdminState(name); err == nil {
|
||||
if up {
|
||||
state = "up"
|
||||
} else {
|
||||
state = "down"
|
||||
}
|
||||
}
|
||||
|
||||
ipv4, err := interfaceIPv4Addrs(name)
|
||||
if err != nil {
|
||||
ipv4 = nil
|
||||
}
|
||||
|
||||
out = append(out, InterfaceInfo{Name: name, State: state, IPv4: ipv4})
|
||||
}
|
||||
|
||||
return out, nil
|
||||
}
|
||||
|
||||
func (s *System) DefaultRoute() string {
|
||||
raw, err := exec.Command("ip", "route", "show", "default").Output()
|
||||
if err != nil {
|
||||
return ""
|
||||
}
|
||||
fields := strings.Fields(string(raw))
|
||||
for i := 0; i < len(fields)-1; i++ {
|
||||
if fields[i] == "via" {
|
||||
return fields[i+1]
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func (s *System) CaptureNetworkSnapshot() (NetworkSnapshot, error) {
|
||||
names, err := listInterfaceNames()
|
||||
if err != nil {
|
||||
return NetworkSnapshot{}, err
|
||||
}
|
||||
|
||||
snapshot := NetworkSnapshot{
|
||||
Interfaces: make([]NetworkInterfaceSnapshot, 0, len(names)),
|
||||
}
|
||||
for _, name := range names {
|
||||
up, err := interfaceAdminState(name)
|
||||
if err != nil {
|
||||
return NetworkSnapshot{}, err
|
||||
}
|
||||
ipv4, err := interfaceIPv4Addrs(name)
|
||||
if err != nil {
|
||||
return NetworkSnapshot{}, err
|
||||
}
|
||||
snapshot.Interfaces = append(snapshot.Interfaces, NetworkInterfaceSnapshot{
|
||||
Name: name,
|
||||
Up: up,
|
||||
IPv4: ipv4,
|
||||
})
|
||||
}
|
||||
|
||||
if raw, err := exec.Command("ip", "route", "show", "default").Output(); err == nil {
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(raw)), "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
if line != "" {
|
||||
snapshot.DefaultRoutes = append(snapshot.DefaultRoutes, line)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if raw, err := os.ReadFile("/etc/resolv.conf"); err == nil {
|
||||
snapshot.ResolvConf = string(raw)
|
||||
}
|
||||
|
||||
return snapshot, nil
|
||||
}
|
||||
|
||||
func (s *System) RestoreNetworkSnapshot(snapshot NetworkSnapshot) error {
|
||||
var errs []string
|
||||
|
||||
for _, iface := range snapshot.Interfaces {
|
||||
if err := exec.Command("ip", "link", "set", "dev", iface.Name, "up").Run(); err != nil {
|
||||
errs = append(errs, fmt.Sprintf("%s: bring up before restore: %v", iface.Name, err))
|
||||
continue
|
||||
}
|
||||
if err := exec.Command("ip", "addr", "flush", "dev", iface.Name).Run(); err != nil {
|
||||
errs = append(errs, fmt.Sprintf("%s: flush addresses: %v", iface.Name, err))
|
||||
}
|
||||
for _, cidr := range iface.IPv4 {
|
||||
if raw, err := exec.Command("ip", "addr", "add", cidr, "dev", iface.Name).CombinedOutput(); err != nil {
|
||||
detail := strings.TrimSpace(string(raw))
|
||||
if detail != "" {
|
||||
errs = append(errs, fmt.Sprintf("%s: restore address %s: %v: %s", iface.Name, cidr, err, detail))
|
||||
} else {
|
||||
errs = append(errs, fmt.Sprintf("%s: restore address %s: %v", iface.Name, cidr, err))
|
||||
}
|
||||
}
|
||||
}
|
||||
state := "down"
|
||||
if iface.Up {
|
||||
state = "up"
|
||||
}
|
||||
if err := exec.Command("ip", "link", "set", "dev", iface.Name, state).Run(); err != nil {
|
||||
errs = append(errs, fmt.Sprintf("%s: restore state %s: %v", iface.Name, state, err))
|
||||
}
|
||||
}
|
||||
|
||||
if err := exec.Command("ip", "route", "del", "default").Run(); err != nil {
|
||||
var exitErr *exec.ExitError
|
||||
if !errors.As(err, &exitErr) {
|
||||
errs = append(errs, fmt.Sprintf("clear default route: %v", err))
|
||||
}
|
||||
}
|
||||
for _, route := range snapshot.DefaultRoutes {
|
||||
fields := strings.Fields(route)
|
||||
if len(fields) == 0 {
|
||||
continue
|
||||
}
|
||||
// Strip state flags that ip-route(8) does not accept as add arguments.
|
||||
filtered := fields[:0]
|
||||
for _, f := range fields {
|
||||
switch f {
|
||||
case "linkdown", "dead", "onlink", "pervasive":
|
||||
// skip
|
||||
default:
|
||||
filtered = append(filtered, f)
|
||||
}
|
||||
}
|
||||
args := append([]string{"route", "add"}, filtered...)
|
||||
if raw, err := exec.Command("ip", args...).CombinedOutput(); err != nil {
|
||||
detail := strings.TrimSpace(string(raw))
|
||||
if detail != "" {
|
||||
errs = append(errs, fmt.Sprintf("restore route %q: %v: %s", route, err, detail))
|
||||
} else {
|
||||
errs = append(errs, fmt.Sprintf("restore route %q: %v", route, err))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if err := os.WriteFile("/etc/resolv.conf", []byte(snapshot.ResolvConf), 0644); err != nil {
|
||||
errs = append(errs, fmt.Sprintf("restore resolv.conf: %v", err))
|
||||
}
|
||||
|
||||
if len(errs) > 0 {
|
||||
return errors.New(strings.Join(errs, "; "))
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *System) DHCPOne(iface string) (string, error) {
|
||||
var out bytes.Buffer
|
||||
if err := exec.Command("ip", "link", "set", iface, "up").Run(); err != nil {
|
||||
fmt.Fprintf(&out, "WARN: ip link set up failed: %v\n", err)
|
||||
}
|
||||
if raw, err := exec.Command("dhclient", "-r", iface).CombinedOutput(); err == nil {
|
||||
out.Write(raw)
|
||||
} else if len(raw) > 0 {
|
||||
out.Write(raw)
|
||||
}
|
||||
raw, err := exec.Command("dhclient", "-4", "-v", iface).CombinedOutput()
|
||||
out.Write(raw)
|
||||
if err != nil {
|
||||
return out.String(), err
|
||||
}
|
||||
return out.String(), nil
|
||||
}
|
||||
|
||||
func (s *System) DHCPAll() (string, error) {
|
||||
ifaces, err := listInterfaceNames()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
var out strings.Builder
|
||||
for _, iface := range ifaces {
|
||||
fmt.Fprintf(&out, "[%s]\n", iface)
|
||||
log, err := s.DHCPOne(iface)
|
||||
out.WriteString(log)
|
||||
if err != nil {
|
||||
fmt.Fprintf(&out, "ERROR: %v\n", err)
|
||||
}
|
||||
out.WriteString("\n")
|
||||
}
|
||||
return out.String(), nil
|
||||
}
|
||||
|
||||
func (s *System) SetStaticIPv4(cfg StaticIPv4Config) (string, error) {
|
||||
if cfg.Interface == "" || cfg.Address == "" || cfg.Prefix == "" {
|
||||
return "", fmt.Errorf("interface, address, and prefix are required")
|
||||
}
|
||||
|
||||
dns := cfg.DNS
|
||||
if len(dns) == 0 {
|
||||
dns = []string{"77.88.8.8", "77.88.8.1", "1.1.1.1", "8.8.8.8"}
|
||||
}
|
||||
|
||||
var out strings.Builder
|
||||
_ = exec.Command("ip", "link", "set", cfg.Interface, "up").Run()
|
||||
_ = exec.Command("ip", "addr", "flush", "dev", cfg.Interface).Run()
|
||||
if raw, err := exec.Command("ip", "addr", "add", cfg.Address+"/"+cfg.Prefix, "dev", cfg.Interface).CombinedOutput(); err != nil {
|
||||
return string(raw), err
|
||||
}
|
||||
out.WriteString("address configured\n")
|
||||
if cfg.Gateway != "" {
|
||||
_ = exec.Command("ip", "route", "del", "default").Run()
|
||||
if raw, err := exec.Command("ip", "route", "add", "default", "via", cfg.Gateway, "dev", cfg.Interface).CombinedOutput(); err != nil {
|
||||
return out.String() + string(raw), err
|
||||
}
|
||||
out.WriteString("default route configured\n")
|
||||
}
|
||||
|
||||
var resolv strings.Builder
|
||||
for _, dnsServer := range dns {
|
||||
dnsServer = strings.TrimSpace(dnsServer)
|
||||
if dnsServer == "" {
|
||||
continue
|
||||
}
|
||||
fmt.Fprintf(&resolv, "nameserver %s\n", dnsServer)
|
||||
}
|
||||
if err := os.WriteFile("/etc/resolv.conf", []byte(resolv.String()), 0644); err != nil {
|
||||
return out.String(), err
|
||||
}
|
||||
out.WriteString("dns configured\n")
|
||||
return out.String(), nil
|
||||
}
|
||||
|
||||
// SetInterfaceState brings a network interface up or down.
|
||||
func (s *System) SetInterfaceState(iface string, up bool) error {
|
||||
state := "down"
|
||||
if up {
|
||||
state = "up"
|
||||
}
|
||||
return exec.Command("ip", "link", "set", "dev", iface, state).Run()
|
||||
}
|
||||
|
||||
// GetInterfaceState returns true if the interface is UP.
|
||||
func (s *System) GetInterfaceState(iface string) (bool, error) {
|
||||
return interfaceAdminState(iface)
|
||||
}
|
||||
|
||||
func interfaceAdminState(iface string) (bool, error) {
|
||||
raw, err := exec.Command("ip", "-o", "link", "show", "dev", iface).Output()
|
||||
if err != nil {
|
||||
return false, err
|
||||
}
|
||||
return parseInterfaceAdminState(string(raw))
|
||||
}
|
||||
|
||||
func parseInterfaceAdminState(raw string) (bool, error) {
|
||||
start := strings.IndexByte(raw, '<')
|
||||
if start == -1 {
|
||||
return false, fmt.Errorf("ip link output missing flags")
|
||||
}
|
||||
end := strings.IndexByte(raw[start+1:], '>')
|
||||
if end == -1 {
|
||||
return false, fmt.Errorf("ip link output missing flag terminator")
|
||||
}
|
||||
flags := strings.Split(raw[start+1:start+1+end], ",")
|
||||
for _, flag := range flags {
|
||||
if strings.TrimSpace(flag) == "UP" {
|
||||
return true, nil
|
||||
}
|
||||
}
|
||||
return false, nil
|
||||
}
|
||||
|
||||
func interfaceIPv4Addrs(iface string) ([]string, error) {
|
||||
raw, err := exec.Command("ip", "-o", "-4", "addr", "show", "dev", iface).Output()
|
||||
if err != nil {
|
||||
var exitErr *exec.ExitError
|
||||
if errors.As(err, &exitErr) {
|
||||
return nil, nil
|
||||
}
|
||||
return nil, err
|
||||
}
|
||||
var ipv4 []string
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(raw)), "\n") {
|
||||
fields := strings.Fields(line)
|
||||
if len(fields) >= 4 {
|
||||
ipv4 = append(ipv4, fields[3])
|
||||
}
|
||||
}
|
||||
return ipv4, nil
|
||||
}
|
||||
|
||||
func listInterfaceNames() ([]string, error) {
|
||||
raw, err := exec.Command("ip", "-o", "link", "show").Output()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
var out []string
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(raw)), "\n") {
|
||||
fields := strings.SplitN(line, ": ", 3)
|
||||
if len(fields) < 2 {
|
||||
continue
|
||||
}
|
||||
name := fields[1]
|
||||
if name == "lo" || strings.HasPrefix(name, "docker") || strings.HasPrefix(name, "virbr") ||
|
||||
strings.HasPrefix(name, "veth") || strings.HasPrefix(name, "tun") ||
|
||||
strings.HasPrefix(name, "tap") || strings.HasPrefix(name, "br-") ||
|
||||
strings.HasPrefix(name, "bond") || strings.HasPrefix(name, "dummy") {
|
||||
continue
|
||||
}
|
||||
out = append(out, name)
|
||||
}
|
||||
sort.Strings(out)
|
||||
return out, nil
|
||||
}
|
||||
46
audit/internal/platform/network_test.go
Normal file
46
audit/internal/platform/network_test.go
Normal file
@@ -0,0 +1,46 @@
|
||||
package platform
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestParseInterfaceAdminState(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
raw string
|
||||
want bool
|
||||
wantErr bool
|
||||
}{
|
||||
{
|
||||
name: "admin up with no carrier",
|
||||
raw: "2: enp1s0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000\n",
|
||||
want: true,
|
||||
},
|
||||
{
|
||||
name: "admin down",
|
||||
raw: "2: enp1s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n",
|
||||
want: false,
|
||||
},
|
||||
{
|
||||
name: "malformed output",
|
||||
raw: "2: enp1s0: mtu 1500 state DOWN\n",
|
||||
wantErr: true,
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
got, err := parseInterfaceAdminState(tt.raw)
|
||||
if tt.wantErr {
|
||||
if err == nil {
|
||||
t.Fatal("expected error")
|
||||
}
|
||||
return
|
||||
}
|
||||
if err != nil {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
if got != tt.want {
|
||||
t.Fatalf("got %v want %v", got, tt.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
43
audit/internal/platform/parse.go
Normal file
43
audit/internal/platform/parse.go
Normal file
@@ -0,0 +1,43 @@
|
||||
package platform
|
||||
|
||||
import "strings"
|
||||
|
||||
func parseLSBLKPairs(line string) map[string]string {
|
||||
out := map[string]string{}
|
||||
for _, part := range splitQuotedFields(line) {
|
||||
idx := strings.Index(part, "=")
|
||||
if idx <= 0 {
|
||||
continue
|
||||
}
|
||||
key := part[:idx]
|
||||
value := strings.Trim(part[idx+1:], `"`)
|
||||
out[key] = value
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func splitQuotedFields(s string) []string {
|
||||
var out []string
|
||||
var cur strings.Builder
|
||||
inQuotes := false
|
||||
for _, r := range s {
|
||||
switch r {
|
||||
case '"':
|
||||
inQuotes = !inQuotes
|
||||
cur.WriteRune(r)
|
||||
case ' ':
|
||||
if inQuotes {
|
||||
cur.WriteRune(r)
|
||||
} else if cur.Len() > 0 {
|
||||
out = append(out, cur.String())
|
||||
cur.Reset()
|
||||
}
|
||||
default:
|
||||
cur.WriteRune(r)
|
||||
}
|
||||
}
|
||||
if cur.Len() > 0 {
|
||||
out = append(out, cur.String())
|
||||
}
|
||||
return out
|
||||
}
|
||||
214
audit/internal/platform/runtime.go
Normal file
214
audit/internal/platform/runtime.go
Normal file
@@ -0,0 +1,214 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"os"
|
||||
"os/exec"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"bee/audit/internal/schema"
|
||||
)
|
||||
|
||||
var runtimeRequiredTools = []string{
|
||||
"dmidecode",
|
||||
"lspci",
|
||||
"lsblk",
|
||||
"smartctl",
|
||||
"nvme",
|
||||
"ipmitool",
|
||||
"dhclient",
|
||||
"mount",
|
||||
}
|
||||
|
||||
var runtimeTrackedServices = []string{
|
||||
"bee-network",
|
||||
"bee-nvidia",
|
||||
"bee-preflight",
|
||||
"bee-audit",
|
||||
"bee-web",
|
||||
"bee-sshsetup",
|
||||
}
|
||||
|
||||
func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
|
||||
checkedAt := time.Now().UTC().Format(time.RFC3339)
|
||||
health := schema.RuntimeHealth{
|
||||
Status: "OK",
|
||||
CheckedAt: checkedAt,
|
||||
ExportDir: strings.TrimSpace(exportDir),
|
||||
}
|
||||
|
||||
if health.ExportDir != "" {
|
||||
if err := os.MkdirAll(health.ExportDir, 0755); err != nil {
|
||||
health.Status = "FAILED"
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "export_dir_unavailable",
|
||||
Severity: "critical",
|
||||
Description: err.Error(),
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
interfaces, err := s.ListInterfaces()
|
||||
if err == nil {
|
||||
health.Interfaces = make([]schema.RuntimeInterface, 0, len(interfaces))
|
||||
hasIPv4 := false
|
||||
missingIPv4 := false
|
||||
for _, iface := range interfaces {
|
||||
outcome := "no_offer"
|
||||
if len(iface.IPv4) > 0 {
|
||||
outcome = "lease_acquired"
|
||||
hasIPv4 = true
|
||||
} else if strings.EqualFold(iface.State, "DOWN") {
|
||||
outcome = "link_down"
|
||||
} else {
|
||||
missingIPv4 = true
|
||||
}
|
||||
health.Interfaces = append(health.Interfaces, schema.RuntimeInterface{
|
||||
Name: iface.Name,
|
||||
State: iface.State,
|
||||
IPv4: iface.IPv4,
|
||||
Outcome: outcome,
|
||||
})
|
||||
}
|
||||
switch {
|
||||
case hasIPv4 && !missingIPv4:
|
||||
health.NetworkStatus = "OK"
|
||||
case hasIPv4:
|
||||
health.NetworkStatus = "PARTIAL"
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "dhcp_partial",
|
||||
Severity: "warning",
|
||||
Description: "At least one interface did not obtain IPv4 connectivity.",
|
||||
})
|
||||
default:
|
||||
health.NetworkStatus = "FAILED"
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "dhcp_failed",
|
||||
Severity: "warning",
|
||||
Description: "No physical interface obtained IPv4 connectivity.",
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
vendor := s.DetectGPUVendor()
|
||||
for _, tool := range s.runtimeToolStatuses(vendor) {
|
||||
health.Tools = append(health.Tools, schema.RuntimeToolStatus{
|
||||
Name: tool.Name,
|
||||
Path: tool.Path,
|
||||
OK: tool.OK,
|
||||
})
|
||||
if !tool.OK {
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "tool_missing",
|
||||
Severity: "warning",
|
||||
Description: "Required tool missing: " + tool.Name,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
for _, name := range runtimeTrackedServices {
|
||||
health.Services = append(health.Services, schema.RuntimeServiceStatus{
|
||||
Name: name,
|
||||
Status: s.ServiceState(name),
|
||||
})
|
||||
}
|
||||
|
||||
s.collectGPURuntimeHealth(vendor, &health)
|
||||
|
||||
if health.Status != "FAILED" && len(health.Issues) > 0 {
|
||||
health.Status = "PARTIAL"
|
||||
}
|
||||
return health, nil
|
||||
}
|
||||
|
||||
func commandText(name string, args ...string) string {
|
||||
raw, err := exec.Command(name, args...).CombinedOutput()
|
||||
if err != nil && len(raw) == 0 {
|
||||
return ""
|
||||
}
|
||||
return string(raw)
|
||||
}
|
||||
|
||||
func (s *System) runtimeToolStatuses(vendor string) []ToolStatus {
|
||||
tools := s.CheckTools(runtimeRequiredTools)
|
||||
switch vendor {
|
||||
case "nvidia":
|
||||
tools = append(tools, s.CheckTools([]string{
|
||||
"nvidia-smi",
|
||||
"nvidia-bug-report.sh",
|
||||
"bee-gpu-stress",
|
||||
})...)
|
||||
case "amd":
|
||||
tool := ToolStatus{Name: "rocm-smi"}
|
||||
if cmd, err := resolveROCmSMICommand(); err == nil && len(cmd) > 0 {
|
||||
tool.Path = cmd[0]
|
||||
if len(cmd) > 1 && strings.HasSuffix(cmd[1], "rocm_smi.py") {
|
||||
tool.Path = cmd[1]
|
||||
}
|
||||
tool.OK = true
|
||||
}
|
||||
tools = append(tools, tool)
|
||||
}
|
||||
return tools
|
||||
}
|
||||
|
||||
func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHealth) {
|
||||
lsmodText := commandText("lsmod")
|
||||
|
||||
switch vendor {
|
||||
case "nvidia":
|
||||
health.DriverReady = strings.Contains(lsmodText, "nvidia ")
|
||||
if !health.DriverReady {
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "nvidia_kernel_module_missing",
|
||||
Severity: "warning",
|
||||
Description: "NVIDIA kernel module is not loaded.",
|
||||
})
|
||||
}
|
||||
if health.DriverReady && !strings.Contains(lsmodText, "nvidia_modeset") {
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "nvidia_modeset_failed",
|
||||
Severity: "warning",
|
||||
Description: "nvidia-modeset is not loaded; display/CUDA stack may be partial.",
|
||||
})
|
||||
}
|
||||
if out, err := exec.Command("nvidia-smi", "-L").CombinedOutput(); err == nil && strings.TrimSpace(string(out)) != "" {
|
||||
health.DriverReady = true
|
||||
}
|
||||
|
||||
if lookErr := exec.Command("sh", "-c", "command -v bee-gpu-stress >/dev/null 2>&1").Run(); lookErr == nil {
|
||||
out, err := exec.Command("bee-gpu-stress", "--seconds", "1", "--size-mb", "1").CombinedOutput()
|
||||
if err == nil {
|
||||
health.CUDAReady = true
|
||||
} else if strings.Contains(strings.ToLower(string(out)), "cuda_error_system_not_ready") {
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "cuda_runtime_not_ready",
|
||||
Severity: "warning",
|
||||
Description: "CUDA runtime is not ready for GPU SAT.",
|
||||
})
|
||||
}
|
||||
}
|
||||
case "amd":
|
||||
health.DriverReady = strings.Contains(lsmodText, "amdgpu ") || strings.Contains(lsmodText, "amdkfd")
|
||||
if !health.DriverReady {
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "amdgpu_kernel_module_missing",
|
||||
Severity: "warning",
|
||||
Description: "AMD GPU driver is not loaded.",
|
||||
})
|
||||
}
|
||||
|
||||
out, err := runROCmSMI("--showproductname", "--csv")
|
||||
if err == nil && strings.TrimSpace(string(out)) != "" {
|
||||
health.CUDAReady = true
|
||||
health.DriverReady = true
|
||||
return
|
||||
}
|
||||
|
||||
health.Issues = append(health.Issues, schema.RuntimeIssue{
|
||||
Code: "rocm_smi_unavailable",
|
||||
Severity: "warning",
|
||||
Description: "ROCm SMI is not available for AMD GPU SAT.",
|
||||
})
|
||||
}
|
||||
}
|
||||
895
audit/internal/platform/sat.go
Normal file
895
audit/internal/platform/sat.go
Normal file
@@ -0,0 +1,895 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"archive/tar"
|
||||
"bufio"
|
||||
"bytes"
|
||||
"compress/gzip"
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"io"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
var (
|
||||
satExecCommand = exec.Command
|
||||
satLookPath = exec.LookPath
|
||||
satGlob = filepath.Glob
|
||||
satStat = os.Stat
|
||||
|
||||
rocmSMIExecutableGlobs = []string{
|
||||
"/opt/rocm/bin/rocm-smi",
|
||||
"/opt/rocm-*/bin/rocm-smi",
|
||||
}
|
||||
rocmSMIScriptGlobs = []string{
|
||||
"/opt/rocm/libexec/rocm_smi/rocm_smi.py",
|
||||
"/opt/rocm-*/libexec/rocm_smi/rocm_smi.py",
|
||||
}
|
||||
rvsExecutableGlobs = []string{
|
||||
"/opt/rocm/bin/rvs",
|
||||
"/opt/rocm-*/bin/rvs",
|
||||
}
|
||||
)
|
||||
|
||||
// streamExecOutput runs cmd and streams each output line to logFunc (if non-nil).
|
||||
// Returns combined stdout+stderr as a byte slice.
|
||||
func streamExecOutput(cmd *exec.Cmd, logFunc func(string)) ([]byte, error) {
|
||||
pr, pw := io.Pipe()
|
||||
cmd.Stdout = pw
|
||||
cmd.Stderr = pw
|
||||
|
||||
var buf bytes.Buffer
|
||||
var wg sync.WaitGroup
|
||||
wg.Add(1)
|
||||
go func() {
|
||||
defer wg.Done()
|
||||
scanner := bufio.NewScanner(pr)
|
||||
for scanner.Scan() {
|
||||
line := scanner.Text()
|
||||
buf.WriteString(line + "\n")
|
||||
if logFunc != nil {
|
||||
logFunc(line)
|
||||
}
|
||||
}
|
||||
}()
|
||||
|
||||
err := cmd.Start()
|
||||
if err != nil {
|
||||
_ = pw.Close()
|
||||
wg.Wait()
|
||||
return nil, err
|
||||
}
|
||||
waitErr := cmd.Wait()
|
||||
_ = pw.Close()
|
||||
wg.Wait()
|
||||
return buf.Bytes(), waitErr
|
||||
}
|
||||
|
||||
// NvidiaGPU holds basic GPU info from nvidia-smi.
|
||||
type NvidiaGPU struct {
|
||||
Index int
|
||||
Name string
|
||||
MemoryMB int
|
||||
}
|
||||
|
||||
// AMDGPUInfo holds basic info about an AMD GPU from rocm-smi.
|
||||
type AMDGPUInfo struct {
|
||||
Index int
|
||||
Name string
|
||||
}
|
||||
|
||||
// DetectGPUVendor returns "nvidia" if /dev/nvidia0 exists, "amd" if /dev/kfd exists, or "" otherwise.
|
||||
func (s *System) DetectGPUVendor() string {
|
||||
if _, err := os.Stat("/dev/nvidia0"); err == nil {
|
||||
return "nvidia"
|
||||
}
|
||||
if _, err := os.Stat("/dev/kfd"); err == nil {
|
||||
return "amd"
|
||||
}
|
||||
if raw, err := exec.Command("lspci", "-nn").Output(); err == nil {
|
||||
text := strings.ToLower(string(raw))
|
||||
if strings.Contains(text, "advanced micro devices") || strings.Contains(text, "amd/ati") {
|
||||
return "amd"
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
// ListAMDGPUs returns AMD GPUs visible to rocm-smi.
|
||||
func (s *System) ListAMDGPUs() ([]AMDGPUInfo, error) {
|
||||
out, err := runROCmSMI("--showproductname", "--csv")
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("rocm-smi: %w", err)
|
||||
}
|
||||
var gpus []AMDGPUInfo
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
if line == "" || strings.HasPrefix(strings.ToLower(line), "device") {
|
||||
continue
|
||||
}
|
||||
parts := strings.SplitN(line, ",", 2)
|
||||
name := ""
|
||||
if len(parts) >= 2 {
|
||||
name = strings.TrimSpace(parts[1])
|
||||
}
|
||||
idx := len(gpus)
|
||||
gpus = append(gpus, AMDGPUInfo{Index: idx, Name: name})
|
||||
}
|
||||
return gpus, nil
|
||||
}
|
||||
|
||||
// RunAMDAcceptancePack runs an AMD GPU diagnostic pack using rocm-smi.
|
||||
func (s *System) RunAMDAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd", []satJob{
|
||||
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
|
||||
{name: "02-rocm-smi-showallinfo.log", cmd: []string{"rocm-smi", "--showallinfo"}},
|
||||
{name: "03-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
|
||||
{name: "04-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
|
||||
}, logFunc)
|
||||
}
|
||||
|
||||
// RunAMDMemIntegrityPack runs the official RVS MEM module as a validate-style memory integrity test.
|
||||
func (s *System) RunAMDMemIntegrityPack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
if err := ensureAMDRuntimeReady(); err != nil {
|
||||
return "", err
|
||||
}
|
||||
cfgFile := "/tmp/bee-amd-mem.conf"
|
||||
cfg := `actions:
|
||||
- name: mem_integrity
|
||||
device: all
|
||||
module: mem
|
||||
parallel: true
|
||||
duration: 60000
|
||||
copy_matrix: false
|
||||
target_stress: 90
|
||||
matrix_size: 8640
|
||||
`
|
||||
_ = os.WriteFile(cfgFile, []byte(cfg), 0644)
|
||||
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd-mem", []satJob{
|
||||
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
|
||||
{name: "02-rvs-mem.log", cmd: []string{"rvs", "-c", cfgFile}},
|
||||
{name: "03-rocm-smi-after.log", cmd: []string{"rocm-smi", "--showtemp", "--showpower", "--showmemuse", "--csv"}},
|
||||
}, logFunc)
|
||||
}
|
||||
|
||||
// RunAMDMemBandwidthPack runs AMD's memory/interconnect bandwidth-oriented tools.
|
||||
func (s *System) RunAMDMemBandwidthPack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
if err := ensureAMDRuntimeReady(); err != nil {
|
||||
return "", err
|
||||
}
|
||||
cfgFile := "/tmp/bee-amd-babel.conf"
|
||||
cfg := `actions:
|
||||
- name: babel_mem_bw
|
||||
device: all
|
||||
module: babel
|
||||
parallel: true
|
||||
copy_matrix: true
|
||||
target_stress: 90
|
||||
matrix_size: 134217728
|
||||
`
|
||||
_ = os.WriteFile(cfgFile, []byte(cfg), 0644)
|
||||
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd-bandwidth", []satJob{
|
||||
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
|
||||
{name: "02-rocm-bandwidth-test.log", cmd: []string{"rocm-bandwidth-test"}},
|
||||
{name: "03-rvs-babel.log", cmd: []string{"rvs", "-c", cfgFile}},
|
||||
{name: "04-rocm-smi-after.log", cmd: []string{"rocm-smi", "--showtemp", "--showpower", "--showmemuse", "--csv"}},
|
||||
}, logFunc)
|
||||
}
|
||||
|
||||
// RunAMDStressPack runs an AMD GPU burn-in pack.
|
||||
// Missing tools are reported as UNSUPPORTED, consistent with the existing SAT pattern.
|
||||
func (s *System) RunAMDStressPack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
|
||||
seconds := durationSec
|
||||
if seconds <= 0 {
|
||||
seconds = envInt("BEE_AMD_STRESS_SECONDS", 300)
|
||||
}
|
||||
if err := ensureAMDRuntimeReady(); err != nil {
|
||||
return "", err
|
||||
}
|
||||
// Enable copy_matrix so the same GST run drives VRAM traffic in addition to compute.
|
||||
rvsCfg := amdStressRVSConfig(seconds)
|
||||
cfgFile := "/tmp/bee-amd-gst.conf"
|
||||
_ = os.WriteFile(cfgFile, []byte(rvsCfg), 0644)
|
||||
|
||||
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd-stress", amdStressJobs(seconds, cfgFile), logFunc)
|
||||
}
|
||||
|
||||
func amdStressRVSConfig(seconds int) string {
|
||||
return fmt.Sprintf(`actions:
|
||||
- name: gst_stress
|
||||
device: all
|
||||
module: gst
|
||||
parallel: true
|
||||
duration: %d
|
||||
copy_matrix: false
|
||||
target_stress: 90
|
||||
matrix_size_a: 8640
|
||||
matrix_size_b: 8640
|
||||
matrix_size_c: 8640
|
||||
`, seconds*1000)
|
||||
}
|
||||
|
||||
func amdStressJobs(seconds int, cfgFile string) []satJob {
|
||||
return []satJob{
|
||||
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
|
||||
{name: "02-rocm-bandwidth-test.log", cmd: []string{"rocm-bandwidth-test"}},
|
||||
{name: fmt.Sprintf("03-rvs-gst-%ds.log", seconds), cmd: []string{"rvs", "-c", cfgFile}},
|
||||
{name: fmt.Sprintf("04-rocm-smi-after.log"), cmd: []string{"rocm-smi", "--showtemp", "--showpower", "--csv"}},
|
||||
}
|
||||
}
|
||||
|
||||
// ListNvidiaGPUs returns GPUs visible to nvidia-smi.
|
||||
func (s *System) ListNvidiaGPUs() ([]NvidiaGPU, error) {
|
||||
out, err := exec.Command("nvidia-smi",
|
||||
"--query-gpu=index,name,memory.total",
|
||||
"--format=csv,noheader,nounits").Output()
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("nvidia-smi: %w", err)
|
||||
}
|
||||
var gpus []NvidiaGPU
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
if line == "" {
|
||||
continue
|
||||
}
|
||||
parts := strings.SplitN(line, ", ", 3)
|
||||
if len(parts) != 3 {
|
||||
continue
|
||||
}
|
||||
idx, err := strconv.Atoi(strings.TrimSpace(parts[0]))
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
memMB, _ := strconv.Atoi(strings.TrimSpace(parts[2]))
|
||||
gpus = append(gpus, NvidiaGPU{
|
||||
Index: idx,
|
||||
Name: strings.TrimSpace(parts[1]),
|
||||
MemoryMB: memMB,
|
||||
})
|
||||
}
|
||||
return gpus, nil
|
||||
}
|
||||
|
||||
// RunNCCLTests runs nccl-tests all_reduce_perf across all NVIDIA GPUs.
|
||||
// Measures collective communication bandwidth over NVLink/PCIe.
|
||||
func (s *System) RunNCCLTests(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
// detect GPU count
|
||||
out, _ := exec.Command("nvidia-smi", "--query-gpu=index", "--format=csv,noheader").Output()
|
||||
gpuCount := len(strings.Split(strings.TrimSpace(string(out)), "\n"))
|
||||
if gpuCount < 1 {
|
||||
gpuCount = 1
|
||||
}
|
||||
return runAcceptancePackCtx(ctx, baseDir, "nccl-tests", []satJob{
|
||||
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
|
||||
{name: "02-all-reduce-perf.log", cmd: []string{
|
||||
"all_reduce_perf", "-b", "512M", "-e", "4G", "-f", "2",
|
||||
"-g", strconv.Itoa(gpuCount), "--iters", "20",
|
||||
}},
|
||||
}, logFunc)
|
||||
}
|
||||
|
||||
func (s *System) RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
|
||||
return runAcceptancePackCtx(context.Background(), baseDir, "gpu-nvidia", nvidiaSATJobs(), logFunc)
|
||||
}
|
||||
|
||||
// RunNvidiaAcceptancePackWithOptions runs the NVIDIA diagnostics via DCGM.
|
||||
// diagLevel: 1=quick, 2=medium, 3=targeted stress, 4=extended stress.
|
||||
// gpuIndices: specific GPU indices to test (empty = all GPUs).
|
||||
// ctx cancellation kills the running job.
|
||||
func (s *System) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error) {
|
||||
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia", nvidiaDCGMJobs(diagLevel, gpuIndices), logFunc)
|
||||
}
|
||||
|
||||
func (s *System) RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
sizeMB := envInt("BEE_MEMTESTER_SIZE_MB", 128)
|
||||
passes := envInt("BEE_MEMTESTER_PASSES", 1)
|
||||
return runAcceptancePackCtx(ctx, baseDir, "memory", []satJob{
|
||||
{name: "01-free-before.log", cmd: []string{"free", "-h"}},
|
||||
{name: "02-memtester.log", cmd: []string{"memtester", fmt.Sprintf("%dM", sizeMB), fmt.Sprintf("%d", passes)}},
|
||||
{name: "03-free-after.log", cmd: []string{"free", "-h"}},
|
||||
}, logFunc)
|
||||
}
|
||||
|
||||
func (s *System) RunMemoryStressPack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
|
||||
seconds := durationSec
|
||||
if seconds <= 0 {
|
||||
seconds = envInt("BEE_VM_STRESS_SECONDS", 300)
|
||||
}
|
||||
// Use 80% of RAM by default; override with BEE_VM_STRESS_SIZE_MB.
|
||||
sizeArg := "80%"
|
||||
if mb := envInt("BEE_VM_STRESS_SIZE_MB", 0); mb > 0 {
|
||||
sizeArg = fmt.Sprintf("%dM", mb)
|
||||
}
|
||||
return runAcceptancePackCtx(ctx, baseDir, "memory-stress", []satJob{
|
||||
{name: "01-free-before.log", cmd: []string{"free", "-h"}},
|
||||
{name: "02-stress-ng-vm.log", cmd: []string{
|
||||
"stress-ng", "--vm", "1",
|
||||
"--vm-bytes", sizeArg,
|
||||
"--vm-method", "all",
|
||||
"--timeout", fmt.Sprintf("%d", seconds),
|
||||
"--metrics-brief",
|
||||
}},
|
||||
{name: "03-free-after.log", cmd: []string{"free", "-h"}},
|
||||
}, logFunc)
|
||||
}
|
||||
|
||||
func (s *System) RunSATStressPack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
|
||||
seconds := durationSec
|
||||
if seconds <= 0 {
|
||||
seconds = envInt("BEE_SAT_STRESS_SECONDS", 300)
|
||||
}
|
||||
cmd := []string{"stressapptest", "-s", fmt.Sprintf("%d", seconds), "-W", "--cc_test"}
|
||||
if mb := envInt("BEE_SAT_STRESS_MB", 0); mb > 0 {
|
||||
cmd = append(cmd, "-M", fmt.Sprintf("%d", mb))
|
||||
}
|
||||
return runAcceptancePackCtx(ctx, baseDir, "sat-stress", []satJob{
|
||||
{name: "01-free-before.log", cmd: []string{"free", "-h"}},
|
||||
{name: "02-stressapptest.log", cmd: cmd},
|
||||
{name: "03-free-after.log", cmd: []string{"free", "-h"}},
|
||||
}, logFunc)
|
||||
}
|
||||
|
||||
func (s *System) RunCPUAcceptancePack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
|
||||
if durationSec <= 0 {
|
||||
durationSec = 60
|
||||
}
|
||||
return runAcceptancePackCtx(ctx, baseDir, "cpu", []satJob{
|
||||
{name: "01-lscpu.log", cmd: []string{"lscpu"}},
|
||||
{name: "02-sensors-before.log", cmd: []string{"sensors"}},
|
||||
{name: "03-stress-ng.log", cmd: []string{"stress-ng", "--cpu", "0", "--cpu-method", "all", "--timeout", fmt.Sprintf("%d", durationSec)}},
|
||||
{name: "04-sensors-after.log", cmd: []string{"sensors"}},
|
||||
}, logFunc)
|
||||
}
|
||||
|
||||
func (s *System) RunStorageAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
if baseDir == "" {
|
||||
baseDir = "/var/log/bee-sat"
|
||||
}
|
||||
ts := time.Now().UTC().Format("20060102-150405")
|
||||
runDir := filepath.Join(baseDir, "storage-"+ts)
|
||||
if err := os.MkdirAll(runDir, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
verboseLog := filepath.Join(runDir, "verbose.log")
|
||||
|
||||
devices, err := listStorageDevices()
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
sort.Strings(devices)
|
||||
|
||||
var summary strings.Builder
|
||||
stats := satStats{}
|
||||
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
|
||||
if len(devices) == 0 {
|
||||
fmt.Fprintln(&summary, "devices=0")
|
||||
stats.Unsupported++
|
||||
} else {
|
||||
fmt.Fprintf(&summary, "devices=%d\n", len(devices))
|
||||
}
|
||||
|
||||
for index, devPath := range devices {
|
||||
if ctx.Err() != nil {
|
||||
break
|
||||
}
|
||||
prefix := fmt.Sprintf("%02d-%s", index+1, filepath.Base(devPath))
|
||||
commands := storageSATCommands(devPath)
|
||||
for cmdIndex, job := range commands {
|
||||
if ctx.Err() != nil {
|
||||
break
|
||||
}
|
||||
name := fmt.Sprintf("%s-%02d-%s.log", prefix, cmdIndex+1, job.name)
|
||||
out, err := runSATCommandCtx(ctx, verboseLog, job.name, job.cmd, nil, logFunc)
|
||||
if writeErr := os.WriteFile(filepath.Join(runDir, name), out, 0644); writeErr != nil {
|
||||
return "", writeErr
|
||||
}
|
||||
status, rc := classifySATResult(job.name, out, err)
|
||||
stats.Add(status)
|
||||
key := filepath.Base(devPath) + "_" + strings.ReplaceAll(job.name, "-", "_")
|
||||
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
|
||||
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
|
||||
}
|
||||
}
|
||||
|
||||
writeSATStats(&summary, stats)
|
||||
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
|
||||
return "", err
|
||||
}
|
||||
archive := filepath.Join(baseDir, "storage-"+ts+".tar.gz")
|
||||
if err := createTarGz(archive, runDir); err != nil {
|
||||
return "", err
|
||||
}
|
||||
return archive, nil
|
||||
}
|
||||
|
||||
type satJob struct {
|
||||
name string
|
||||
cmd []string
|
||||
env []string // extra env vars (appended to os.Environ)
|
||||
collectGPU bool // collect GPU metrics via nvidia-smi while this job runs
|
||||
gpuIndices []int // GPU indices to collect metrics for (empty = all)
|
||||
}
|
||||
|
||||
type satStats struct {
|
||||
OK int
|
||||
Failed int
|
||||
Unsupported int
|
||||
}
|
||||
|
||||
func nvidiaSATJobs() []satJob {
|
||||
seconds := envInt("BEE_GPU_STRESS_SECONDS", 5)
|
||||
sizeMB := envInt("BEE_GPU_STRESS_SIZE_MB", 64)
|
||||
return []satJob{
|
||||
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
|
||||
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
|
||||
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
|
||||
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
|
||||
{name: "05-bee-gpu-stress.log", cmd: []string{"bee-gpu-stress", "--seconds", fmt.Sprintf("%d", seconds), "--size-mb", fmt.Sprintf("%d", sizeMB)}},
|
||||
}
|
||||
}
|
||||
|
||||
func nvidiaDCGMJobs(diagLevel int, gpuIndices []int) []satJob {
|
||||
if diagLevel < 1 || diagLevel > 4 {
|
||||
diagLevel = 3
|
||||
}
|
||||
diagArgs := []string{"dcgmi", "diag", "-r", strconv.Itoa(diagLevel)}
|
||||
if len(gpuIndices) > 0 {
|
||||
ids := make([]string, len(gpuIndices))
|
||||
for i, idx := range gpuIndices {
|
||||
ids[i] = strconv.Itoa(idx)
|
||||
}
|
||||
diagArgs = append(diagArgs, "-i", strings.Join(ids, ","))
|
||||
}
|
||||
return []satJob{
|
||||
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
|
||||
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
|
||||
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
|
||||
{name: "04-dcgmi-diag.log", cmd: diagArgs},
|
||||
}
|
||||
}
|
||||
|
||||
func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []satJob, logFunc func(string)) (string, error) {
|
||||
if ctx == nil {
|
||||
ctx = context.Background()
|
||||
}
|
||||
if baseDir == "" {
|
||||
baseDir = "/var/log/bee-sat"
|
||||
}
|
||||
ts := time.Now().UTC().Format("20060102-150405")
|
||||
runDir := filepath.Join(baseDir, prefix+"-"+ts)
|
||||
if err := os.MkdirAll(runDir, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
verboseLog := filepath.Join(runDir, "verbose.log")
|
||||
|
||||
var summary strings.Builder
|
||||
stats := satStats{}
|
||||
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
|
||||
for _, job := range jobs {
|
||||
if ctx.Err() != nil {
|
||||
break
|
||||
}
|
||||
cmd := make([]string, 0, len(job.cmd))
|
||||
for _, arg := range job.cmd {
|
||||
cmd = append(cmd, strings.ReplaceAll(arg, "{{run_dir}}", runDir))
|
||||
}
|
||||
|
||||
var out []byte
|
||||
var err error
|
||||
|
||||
if job.collectGPU {
|
||||
out, err = runSATCommandWithMetrics(ctx, verboseLog, job.name, cmd, job.env, job.gpuIndices, runDir, logFunc)
|
||||
} else {
|
||||
out, err = runSATCommandCtx(ctx, verboseLog, job.name, cmd, job.env, logFunc)
|
||||
}
|
||||
|
||||
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
|
||||
return "", writeErr
|
||||
}
|
||||
status, rc := classifySATResult(job.name, out, err)
|
||||
stats.Add(status)
|
||||
key := strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log")
|
||||
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
|
||||
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
|
||||
}
|
||||
writeSATStats(&summary, stats)
|
||||
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz")
|
||||
if err := createTarGz(archive, runDir); err != nil {
|
||||
return "", err
|
||||
}
|
||||
return archive, nil
|
||||
}
|
||||
|
||||
func runSATCommandCtx(ctx context.Context, verboseLog, name string, cmd []string, env []string, logFunc func(string)) ([]byte, error) {
|
||||
start := time.Now().UTC()
|
||||
resolvedCmd, err := resolveSATCommand(cmd)
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
|
||||
"cmd: "+strings.Join(resolvedCmd, " "),
|
||||
)
|
||||
if logFunc != nil {
|
||||
logFunc(fmt.Sprintf("=== %s ===", name))
|
||||
}
|
||||
if err != nil {
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
|
||||
"rc: 1",
|
||||
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
|
||||
"",
|
||||
)
|
||||
return []byte(err.Error() + "\n"), err
|
||||
}
|
||||
|
||||
c := exec.CommandContext(ctx, resolvedCmd[0], resolvedCmd[1:]...)
|
||||
if len(env) > 0 {
|
||||
c.Env = append(os.Environ(), env...)
|
||||
}
|
||||
out, err := streamExecOutput(c, logFunc)
|
||||
|
||||
rc := 0
|
||||
if err != nil {
|
||||
rc = 1
|
||||
}
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
|
||||
fmt.Sprintf("rc: %d", rc),
|
||||
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
|
||||
"",
|
||||
)
|
||||
return out, err
|
||||
}
|
||||
|
||||
func listStorageDevices() ([]string, error) {
|
||||
out, err := satExecCommand("lsblk", "-dn", "-o", "NAME,TYPE,TRAN").Output()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return parseStorageDevices(string(out)), nil
|
||||
}
|
||||
|
||||
func storageSATCommands(devPath string) []satJob {
|
||||
if strings.Contains(filepath.Base(devPath), "nvme") {
|
||||
return []satJob{
|
||||
{name: "nvme-id-ctrl", cmd: []string{"nvme", "id-ctrl", devPath, "-o", "json"}},
|
||||
{name: "nvme-smart-log", cmd: []string{"nvme", "smart-log", devPath, "-o", "json"}},
|
||||
{name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "-s", "1", "--wait"}},
|
||||
}
|
||||
}
|
||||
return []satJob{
|
||||
{name: "smartctl-health", cmd: []string{"smartctl", "-H", "-A", devPath}},
|
||||
{name: "smartctl-self-test-short", cmd: []string{"smartctl", "-t", "short", devPath}},
|
||||
}
|
||||
}
|
||||
|
||||
func (s *satStats) Add(status string) {
|
||||
switch status {
|
||||
case "OK":
|
||||
s.OK++
|
||||
case "UNSUPPORTED":
|
||||
s.Unsupported++
|
||||
default:
|
||||
s.Failed++
|
||||
}
|
||||
}
|
||||
|
||||
func (s satStats) Overall() string {
|
||||
if s.Failed > 0 {
|
||||
return "FAILED"
|
||||
}
|
||||
if s.Unsupported > 0 {
|
||||
return "PARTIAL"
|
||||
}
|
||||
return "OK"
|
||||
}
|
||||
|
||||
func writeSATStats(summary *strings.Builder, stats satStats) {
|
||||
fmt.Fprintf(summary, "overall_status=%s\n", stats.Overall())
|
||||
fmt.Fprintf(summary, "job_ok=%d\n", stats.OK)
|
||||
fmt.Fprintf(summary, "job_failed=%d\n", stats.Failed)
|
||||
fmt.Fprintf(summary, "job_unsupported=%d\n", stats.Unsupported)
|
||||
}
|
||||
|
||||
func classifySATResult(name string, out []byte, err error) (string, int) {
|
||||
rc := 0
|
||||
if err != nil {
|
||||
rc = 1
|
||||
}
|
||||
if err == nil {
|
||||
return "OK", rc
|
||||
}
|
||||
|
||||
text := strings.ToLower(string(out))
|
||||
// No output at all means the tool failed to start (mlock limit, binary missing,
|
||||
// etc.) — we cannot say anything about hardware health → UNSUPPORTED.
|
||||
if len(strings.TrimSpace(text)) == 0 {
|
||||
return "UNSUPPORTED", rc
|
||||
}
|
||||
if strings.Contains(text, "unsupported") ||
|
||||
strings.Contains(text, "not supported") ||
|
||||
strings.Contains(text, "invalid opcode") ||
|
||||
strings.Contains(text, "unknown command") ||
|
||||
strings.Contains(text, "not implemented") ||
|
||||
strings.Contains(text, "not available") ||
|
||||
strings.Contains(text, "cuda_error_system_not_ready") ||
|
||||
strings.Contains(text, "no such device") ||
|
||||
// nvidia-smi on a machine with no NVIDIA GPU
|
||||
strings.Contains(text, "couldn't communicate with the nvidia driver") ||
|
||||
strings.Contains(text, "no nvidia gpu") ||
|
||||
(strings.Contains(name, "self-test") && strings.Contains(text, "aborted")) {
|
||||
return "UNSUPPORTED", rc
|
||||
}
|
||||
return "FAILED", rc
|
||||
}
|
||||
|
||||
func runSATCommand(verboseLog, name string, cmd []string, logFunc func(string)) ([]byte, error) {
|
||||
start := time.Now().UTC()
|
||||
resolvedCmd, err := resolveSATCommand(cmd)
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
|
||||
"cmd: "+strings.Join(resolvedCmd, " "),
|
||||
)
|
||||
if logFunc != nil {
|
||||
logFunc(fmt.Sprintf("=== %s ===", name))
|
||||
}
|
||||
if err != nil {
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
|
||||
"rc: 1",
|
||||
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
|
||||
"",
|
||||
)
|
||||
return []byte(err.Error() + "\n"), err
|
||||
}
|
||||
|
||||
out, err := streamExecOutput(satExecCommand(resolvedCmd[0], resolvedCmd[1:]...), logFunc)
|
||||
|
||||
rc := 0
|
||||
if err != nil {
|
||||
rc = 1
|
||||
}
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
|
||||
fmt.Sprintf("rc: %d", rc),
|
||||
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
|
||||
"",
|
||||
)
|
||||
return out, err
|
||||
}
|
||||
|
||||
func runROCmSMI(args ...string) ([]byte, error) {
|
||||
cmd, err := resolveROCmSMICommand(args...)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return satExecCommand(cmd[0], cmd[1:]...).CombinedOutput()
|
||||
}
|
||||
|
||||
func resolveSATCommand(cmd []string) ([]string, error) {
|
||||
if len(cmd) == 0 {
|
||||
return nil, errors.New("empty SAT command")
|
||||
}
|
||||
switch cmd[0] {
|
||||
case "rocm-smi":
|
||||
return resolveROCmSMICommand(cmd[1:]...)
|
||||
case "rvs":
|
||||
return resolveRVSCommand(cmd[1:]...)
|
||||
}
|
||||
return cmd, nil
|
||||
}
|
||||
|
||||
func resolveRVSCommand(args ...string) ([]string, error) {
|
||||
if path, err := satLookPath("rvs"); err == nil {
|
||||
return append([]string{path}, args...), nil
|
||||
}
|
||||
for _, path := range expandExistingPaths(rvsExecutableGlobs) {
|
||||
return append([]string{path}, args...), nil
|
||||
}
|
||||
return nil, errors.New("rvs not found in PATH or under /opt/rocm")
|
||||
}
|
||||
|
||||
func resolveROCmSMICommand(args ...string) ([]string, error) {
|
||||
if path, err := satLookPath("rocm-smi"); err == nil {
|
||||
return append([]string{path}, args...), nil
|
||||
}
|
||||
|
||||
for _, path := range rocmSMIExecutableCandidates() {
|
||||
return append([]string{path}, args...), nil
|
||||
}
|
||||
|
||||
pythonPath, pyErr := satLookPath("python3")
|
||||
if pyErr == nil {
|
||||
for _, script := range rocmSMIScriptCandidates() {
|
||||
cmd := []string{pythonPath, script}
|
||||
cmd = append(cmd, args...)
|
||||
return cmd, nil
|
||||
}
|
||||
}
|
||||
|
||||
return nil, errors.New("rocm-smi not found in PATH or under /opt/rocm")
|
||||
}
|
||||
|
||||
func ensureAMDRuntimeReady() error {
|
||||
if _, err := os.Stat("/dev/kfd"); err == nil {
|
||||
return nil
|
||||
}
|
||||
if raw, err := os.ReadFile("/sys/module/amdgpu/initstate"); err == nil {
|
||||
state := strings.TrimSpace(string(raw))
|
||||
if strings.EqualFold(state, "live") {
|
||||
return nil
|
||||
}
|
||||
return fmt.Errorf("AMD driver is present but not initialized: amdgpu initstate=%q", state)
|
||||
}
|
||||
return errors.New("AMD GPUs are present but the runtime is not initialized: /dev/kfd is missing and amdgpu is not loaded")
|
||||
}
|
||||
|
||||
func rocmSMIExecutableCandidates() []string {
|
||||
return expandExistingPaths(rocmSMIExecutableGlobs)
|
||||
}
|
||||
|
||||
func rocmSMIScriptCandidates() []string {
|
||||
return expandExistingPaths(rocmSMIScriptGlobs)
|
||||
}
|
||||
|
||||
func expandExistingPaths(patterns []string) []string {
|
||||
seen := make(map[string]struct{})
|
||||
var paths []string
|
||||
for _, pattern := range patterns {
|
||||
matches, err := satGlob(pattern)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
sort.Strings(matches)
|
||||
for _, match := range matches {
|
||||
if _, err := satStat(match); err != nil {
|
||||
continue
|
||||
}
|
||||
if _, ok := seen[match]; ok {
|
||||
continue
|
||||
}
|
||||
seen[match] = struct{}{}
|
||||
paths = append(paths, match)
|
||||
}
|
||||
}
|
||||
return paths
|
||||
}
|
||||
|
||||
func parseStorageDevices(raw string) []string {
|
||||
var devices []string
|
||||
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
|
||||
fields := strings.Fields(strings.TrimSpace(line))
|
||||
if len(fields) < 2 || fields[1] != "disk" {
|
||||
continue
|
||||
}
|
||||
if len(fields) >= 3 && strings.EqualFold(fields[2], "usb") {
|
||||
continue
|
||||
}
|
||||
devices = append(devices, "/dev/"+fields[0])
|
||||
}
|
||||
return devices
|
||||
}
|
||||
|
||||
// runSATCommandWithMetrics runs a command while collecting GPU metrics in the background.
|
||||
// On completion it writes gpu-metrics.csv and gpu-metrics.html into runDir.
|
||||
func runSATCommandWithMetrics(ctx context.Context, verboseLog, name string, cmd []string, env []string, gpuIndices []int, runDir string, logFunc func(string)) ([]byte, error) {
|
||||
stopCh := make(chan struct{})
|
||||
doneCh := make(chan struct{})
|
||||
var metricRows []GPUMetricRow
|
||||
start := time.Now()
|
||||
|
||||
go func() {
|
||||
defer close(doneCh)
|
||||
ticker := time.NewTicker(time.Second)
|
||||
defer ticker.Stop()
|
||||
for {
|
||||
select {
|
||||
case <-stopCh:
|
||||
return
|
||||
case <-ticker.C:
|
||||
samples, err := sampleGPUMetrics(gpuIndices)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
elapsed := time.Since(start).Seconds()
|
||||
for i := range samples {
|
||||
samples[i].ElapsedSec = elapsed
|
||||
}
|
||||
metricRows = append(metricRows, samples...)
|
||||
}
|
||||
}
|
||||
}()
|
||||
|
||||
out, err := runSATCommandCtx(ctx, verboseLog, name, cmd, env, logFunc)
|
||||
|
||||
close(stopCh)
|
||||
<-doneCh
|
||||
|
||||
if len(metricRows) > 0 {
|
||||
_ = WriteGPUMetricsCSV(filepath.Join(runDir, "gpu-metrics.csv"), metricRows)
|
||||
_ = WriteGPUMetricsHTML(filepath.Join(runDir, "gpu-metrics.html"), metricRows)
|
||||
chart := RenderGPUTerminalChart(metricRows)
|
||||
_ = os.WriteFile(filepath.Join(runDir, "gpu-metrics-term.txt"), []byte(chart), 0644)
|
||||
}
|
||||
|
||||
return out, err
|
||||
}
|
||||
|
||||
func appendSATVerboseLog(path string, lines ...string) {
|
||||
if path == "" {
|
||||
return
|
||||
}
|
||||
f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0644)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
defer f.Close()
|
||||
for _, line := range lines {
|
||||
_, _ = io.WriteString(f, line+"\n")
|
||||
}
|
||||
}
|
||||
|
||||
func envInt(name string, fallback int) int {
|
||||
raw := strings.TrimSpace(os.Getenv(name))
|
||||
if raw == "" {
|
||||
return fallback
|
||||
}
|
||||
value, err := strconv.Atoi(raw)
|
||||
if err != nil || value <= 0 {
|
||||
return fallback
|
||||
}
|
||||
return value
|
||||
}
|
||||
|
||||
func createTarGz(dst, srcDir string) error {
|
||||
file, err := os.Create(dst)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer file.Close()
|
||||
|
||||
gz := gzip.NewWriter(file)
|
||||
defer gz.Close()
|
||||
|
||||
tw := tar.NewWriter(gz)
|
||||
defer tw.Close()
|
||||
|
||||
base := filepath.Dir(srcDir)
|
||||
return filepath.Walk(srcDir, func(path string, info os.FileInfo, err error) error {
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if info.IsDir() {
|
||||
return nil
|
||||
}
|
||||
header, err := tar.FileInfoHeader(info, "")
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
rel, err := filepath.Rel(base, path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
header.Name = rel
|
||||
if err := tw.WriteHeader(header); err != nil {
|
||||
return err
|
||||
}
|
||||
file, err := os.Open(path)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer file.Close()
|
||||
_, err = io.Copy(tw, file)
|
||||
return err
|
||||
})
|
||||
}
|
||||
695
audit/internal/platform/sat_fan_stress.go
Normal file
695
audit/internal/platform/sat_fan_stress.go
Normal file
@@ -0,0 +1,695 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
// FanStressOptions configures the fan-stress / thermal cycling test.
|
||||
type FanStressOptions struct {
|
||||
BaselineSec int // idle monitoring before and after load (default 30)
|
||||
Phase1DurSec int // first load phase duration in seconds (default 300)
|
||||
PauseSec int // pause between the two load phases (default 60)
|
||||
Phase2DurSec int // second load phase duration in seconds (default 300)
|
||||
SizeMB int // GPU memory to allocate per GPU during stress (default 64)
|
||||
GPUIndices []int // which GPU indices to stress (empty = all detected)
|
||||
}
|
||||
|
||||
// FanReading holds one fan sensor reading.
|
||||
type FanReading struct {
|
||||
Name string
|
||||
RPM float64
|
||||
}
|
||||
|
||||
// GPUStressMetric holds per-GPU metrics during the stress test.
|
||||
type GPUStressMetric struct {
|
||||
Index int
|
||||
TempC float64
|
||||
UsagePct float64
|
||||
PowerW float64
|
||||
ClockMHz float64
|
||||
Throttled bool // true if any throttle reason is active
|
||||
}
|
||||
|
||||
// FanStressRow is one second-interval telemetry sample covering all monitored dimensions.
|
||||
type FanStressRow struct {
|
||||
TimestampUTC string
|
||||
ElapsedSec float64
|
||||
Phase string // "baseline", "load1", "pause", "load2", "cooldown"
|
||||
GPUs []GPUStressMetric
|
||||
Fans []FanReading
|
||||
CPUMaxTempC float64 // highest CPU temperature from ipmitool / sensors
|
||||
SysPowerW float64 // DCMI system power reading
|
||||
}
|
||||
|
||||
// RunFanStressTest runs a two-phase GPU stress test while monitoring fan speeds,
|
||||
// temperatures, and power draw every second. Exports metrics.csv and fan-sensors.csv.
|
||||
// Designed to reproduce case-04 fan-speed lag and detect GPU thermal throttling.
|
||||
func (s *System) RunFanStressTest(ctx context.Context, baseDir string, opts FanStressOptions) (string, error) {
|
||||
if baseDir == "" {
|
||||
baseDir = "/var/log/bee-sat"
|
||||
}
|
||||
applyFanStressDefaults(&opts)
|
||||
|
||||
ts := time.Now().UTC().Format("20060102-150405")
|
||||
runDir := filepath.Join(baseDir, "fan-stress-"+ts)
|
||||
if err := os.MkdirAll(runDir, 0755); err != nil {
|
||||
return "", err
|
||||
}
|
||||
verboseLog := filepath.Join(runDir, "verbose.log")
|
||||
|
||||
// Phase name shared between sampler goroutine and main goroutine.
|
||||
var phaseMu sync.Mutex
|
||||
currentPhase := "init"
|
||||
setPhase := func(name string) {
|
||||
phaseMu.Lock()
|
||||
currentPhase = name
|
||||
phaseMu.Unlock()
|
||||
}
|
||||
getPhase := func() string {
|
||||
phaseMu.Lock()
|
||||
defer phaseMu.Unlock()
|
||||
return currentPhase
|
||||
}
|
||||
|
||||
start := time.Now()
|
||||
var rowsMu sync.Mutex
|
||||
var allRows []FanStressRow
|
||||
|
||||
// Start background sampler (every second).
|
||||
stopCh := make(chan struct{})
|
||||
doneCh := make(chan struct{})
|
||||
go func() {
|
||||
defer close(doneCh)
|
||||
ticker := time.NewTicker(time.Second)
|
||||
defer ticker.Stop()
|
||||
for {
|
||||
select {
|
||||
case <-stopCh:
|
||||
return
|
||||
case <-ticker.C:
|
||||
row := sampleFanStressRow(opts.GPUIndices, getPhase(), time.Since(start).Seconds())
|
||||
rowsMu.Lock()
|
||||
allRows = append(allRows, row)
|
||||
rowsMu.Unlock()
|
||||
}
|
||||
}
|
||||
}()
|
||||
|
||||
var summary strings.Builder
|
||||
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
|
||||
|
||||
stats := satStats{}
|
||||
|
||||
// idlePhase sleeps for durSec while the sampler stamps phaseName on each row.
|
||||
idlePhase := func(phaseName, stepName string, durSec int) {
|
||||
if ctx.Err() != nil {
|
||||
return
|
||||
}
|
||||
setPhase(phaseName)
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] start %s (idle %ds)", time.Now().UTC().Format(time.RFC3339), stepName, durSec),
|
||||
)
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
case <-time.After(time.Duration(durSec) * time.Second):
|
||||
}
|
||||
appendSATVerboseLog(verboseLog,
|
||||
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), stepName),
|
||||
)
|
||||
fmt.Fprintf(&summary, "%s_status=OK\n", stepName)
|
||||
stats.OK++
|
||||
}
|
||||
|
||||
// loadPhase runs bee-gpu-stress for durSec; sampler stamps phaseName on each row.
|
||||
loadPhase := func(phaseName, stepName string, durSec int) {
|
||||
if ctx.Err() != nil {
|
||||
return
|
||||
}
|
||||
setPhase(phaseName)
|
||||
var env []string
|
||||
if len(opts.GPUIndices) > 0 {
|
||||
ids := make([]string, len(opts.GPUIndices))
|
||||
for i, idx := range opts.GPUIndices {
|
||||
ids[i] = strconv.Itoa(idx)
|
||||
}
|
||||
env = []string{"CUDA_VISIBLE_DEVICES=" + strings.Join(ids, ",")}
|
||||
}
|
||||
cmd := []string{
|
||||
"bee-gpu-stress",
|
||||
"--seconds", strconv.Itoa(durSec),
|
||||
"--size-mb", strconv.Itoa(opts.SizeMB),
|
||||
}
|
||||
out, err := runSATCommandCtx(ctx, verboseLog, stepName, cmd, env, nil)
|
||||
_ = os.WriteFile(filepath.Join(runDir, stepName+".log"), out, 0644)
|
||||
if err != nil && err != context.Canceled && err.Error() != "signal: killed" {
|
||||
fmt.Fprintf(&summary, "%s_status=FAILED\n", stepName)
|
||||
stats.Failed++
|
||||
} else {
|
||||
fmt.Fprintf(&summary, "%s_status=OK\n", stepName)
|
||||
stats.OK++
|
||||
}
|
||||
}
|
||||
|
||||
// Execute test phases.
|
||||
idlePhase("baseline", "01-baseline", opts.BaselineSec)
|
||||
loadPhase("load1", "02-load1", opts.Phase1DurSec)
|
||||
idlePhase("pause", "03-pause", opts.PauseSec)
|
||||
loadPhase("load2", "04-load2", opts.Phase2DurSec)
|
||||
idlePhase("cooldown", "05-cooldown", opts.BaselineSec)
|
||||
|
||||
// Stop sampler and collect rows.
|
||||
close(stopCh)
|
||||
<-doneCh
|
||||
|
||||
rowsMu.Lock()
|
||||
rows := allRows
|
||||
rowsMu.Unlock()
|
||||
|
||||
// Analysis.
|
||||
throttled := analyzeThrottling(rows)
|
||||
maxGPUTemp := analyzeMaxTemp(rows, func(r FanStressRow) float64 {
|
||||
var m float64
|
||||
for _, g := range r.GPUs {
|
||||
if g.TempC > m {
|
||||
m = g.TempC
|
||||
}
|
||||
}
|
||||
return m
|
||||
})
|
||||
maxCPUTemp := analyzeMaxTemp(rows, func(r FanStressRow) float64 {
|
||||
return r.CPUMaxTempC
|
||||
})
|
||||
fanResponseSec := analyzeFanResponse(rows)
|
||||
|
||||
fmt.Fprintf(&summary, "throttling_detected=%v\n", throttled)
|
||||
fmt.Fprintf(&summary, "max_gpu_temp_c=%.1f\n", maxGPUTemp)
|
||||
fmt.Fprintf(&summary, "max_cpu_temp_c=%.1f\n", maxCPUTemp)
|
||||
if fanResponseSec >= 0 {
|
||||
fmt.Fprintf(&summary, "fan_response_sec=%.1f\n", fanResponseSec)
|
||||
} else {
|
||||
fmt.Fprintf(&summary, "fan_response_sec=N/A\n")
|
||||
}
|
||||
|
||||
// Throttling failure counts against overall result.
|
||||
if throttled {
|
||||
stats.Failed++
|
||||
}
|
||||
writeSATStats(&summary, stats)
|
||||
|
||||
// Write CSV outputs.
|
||||
if err := WriteFanStressCSV(filepath.Join(runDir, "metrics.csv"), rows, opts.GPUIndices); err != nil {
|
||||
return "", err
|
||||
}
|
||||
_ = WriteFanSensorsCSV(filepath.Join(runDir, "fan-sensors.csv"), rows)
|
||||
|
||||
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
archive := filepath.Join(baseDir, "fan-stress-"+ts+".tar.gz")
|
||||
if err := createTarGz(archive, runDir); err != nil {
|
||||
return "", err
|
||||
}
|
||||
return archive, nil
|
||||
}
|
||||
|
||||
func applyFanStressDefaults(opts *FanStressOptions) {
|
||||
if opts.BaselineSec <= 0 {
|
||||
opts.BaselineSec = 30
|
||||
}
|
||||
if opts.Phase1DurSec <= 0 {
|
||||
opts.Phase1DurSec = 300
|
||||
}
|
||||
if opts.PauseSec <= 0 {
|
||||
opts.PauseSec = 60
|
||||
}
|
||||
if opts.Phase2DurSec <= 0 {
|
||||
opts.Phase2DurSec = 300
|
||||
}
|
||||
if opts.SizeMB <= 0 {
|
||||
opts.SizeMB = 64
|
||||
}
|
||||
}
|
||||
|
||||
// sampleFanStressRow collects all metrics for one telemetry sample.
|
||||
func sampleFanStressRow(gpuIndices []int, phase string, elapsed float64) FanStressRow {
|
||||
row := FanStressRow{
|
||||
TimestampUTC: time.Now().UTC().Format(time.RFC3339),
|
||||
ElapsedSec: elapsed,
|
||||
Phase: phase,
|
||||
}
|
||||
row.GPUs = sampleGPUStressMetrics(gpuIndices)
|
||||
row.Fans, _ = sampleFanSpeeds()
|
||||
row.CPUMaxTempC = sampleCPUMaxTemp()
|
||||
row.SysPowerW = sampleSystemPower()
|
||||
return row
|
||||
}
|
||||
|
||||
// sampleGPUStressMetrics queries nvidia-smi for temperature, utilization, power,
|
||||
// clock frequency, and active throttle reasons for each GPU.
|
||||
func sampleGPUStressMetrics(gpuIndices []int) []GPUStressMetric {
|
||||
args := []string{
|
||||
"--query-gpu=index,temperature.gpu,utilization.gpu,power.draw,clocks.current.graphics,clocks_throttle_reasons.active",
|
||||
"--format=csv,noheader,nounits",
|
||||
}
|
||||
if len(gpuIndices) > 0 {
|
||||
ids := make([]string, len(gpuIndices))
|
||||
for i, idx := range gpuIndices {
|
||||
ids[i] = strconv.Itoa(idx)
|
||||
}
|
||||
args = append([]string{"--id=" + strings.Join(ids, ",")}, args...)
|
||||
}
|
||||
out, err := exec.Command("nvidia-smi", args...).Output()
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
var metrics []GPUStressMetric
|
||||
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
if line == "" {
|
||||
continue
|
||||
}
|
||||
parts := strings.Split(line, ", ")
|
||||
if len(parts) < 6 {
|
||||
continue
|
||||
}
|
||||
idx, _ := strconv.Atoi(strings.TrimSpace(parts[0]))
|
||||
throttleVal := strings.TrimSpace(parts[5])
|
||||
// Throttled if active reasons bitmask is non-zero.
|
||||
throttled := throttleVal != "0x0000000000000000" &&
|
||||
throttleVal != "0x0" &&
|
||||
throttleVal != "0" &&
|
||||
throttleVal != "" &&
|
||||
throttleVal != "N/A"
|
||||
metrics = append(metrics, GPUStressMetric{
|
||||
Index: idx,
|
||||
TempC: parseGPUFloat(parts[1]),
|
||||
UsagePct: parseGPUFloat(parts[2]),
|
||||
PowerW: parseGPUFloat(parts[3]),
|
||||
ClockMHz: parseGPUFloat(parts[4]),
|
||||
Throttled: throttled,
|
||||
})
|
||||
}
|
||||
return metrics
|
||||
}
|
||||
|
||||
// sampleFanSpeeds reads fan RPM values from ipmitool sdr.
|
||||
func sampleFanSpeeds() ([]FanReading, error) {
|
||||
out, err := exec.Command("ipmitool", "sdr", "type", "Fan").Output()
|
||||
if err == nil {
|
||||
if fans := parseFanSpeeds(string(out)); len(fans) > 0 {
|
||||
return fans, nil
|
||||
}
|
||||
}
|
||||
fans, sensorsErr := sampleFanSpeedsViaSensorsJSON()
|
||||
if len(fans) > 0 {
|
||||
return fans, nil
|
||||
}
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return nil, sensorsErr
|
||||
}
|
||||
|
||||
// parseFanSpeeds parses "ipmitool sdr type Fan" output.
|
||||
// Handles two formats:
|
||||
// Old: "FAN1 | 2400.000 | RPM | ok" (value in col[1], unit in col[2])
|
||||
// New: "FAN1 | 41h | ok | 29.1 | 4340 RPM" (value+unit combined in last col)
|
||||
func parseFanSpeeds(raw string) []FanReading {
|
||||
var fans []FanReading
|
||||
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
|
||||
parts := strings.Split(line, "|")
|
||||
if len(parts) < 2 {
|
||||
continue
|
||||
}
|
||||
name := strings.TrimSpace(parts[0])
|
||||
// Find the first field that contains "RPM" (either as a standalone unit or inline)
|
||||
rpmVal := 0.0
|
||||
found := false
|
||||
for _, p := range parts[1:] {
|
||||
p = strings.TrimSpace(p)
|
||||
if !strings.Contains(strings.ToUpper(p), "RPM") {
|
||||
continue
|
||||
}
|
||||
if strings.EqualFold(p, "RPM") {
|
||||
continue // unit-only column in old format; value is in previous field
|
||||
}
|
||||
val, err := parseFanRPMValue(p)
|
||||
if err == nil {
|
||||
rpmVal = val
|
||||
found = true
|
||||
break
|
||||
}
|
||||
}
|
||||
// Old format: unit "RPM" is in col[2], value is in col[1]
|
||||
if !found && len(parts) >= 3 && strings.EqualFold(strings.TrimSpace(parts[2]), "RPM") {
|
||||
valStr := strings.TrimSpace(parts[1])
|
||||
if !strings.EqualFold(valStr, "na") && !strings.EqualFold(valStr, "disabled") && valStr != "" {
|
||||
if val, err := parseFanRPMValue(valStr); err == nil {
|
||||
rpmVal = val
|
||||
found = true
|
||||
}
|
||||
}
|
||||
}
|
||||
if !found {
|
||||
continue
|
||||
}
|
||||
fans = append(fans, FanReading{Name: name, RPM: rpmVal})
|
||||
}
|
||||
return fans
|
||||
}
|
||||
|
||||
func parseFanRPMValue(raw string) (float64, error) {
|
||||
fields := strings.Fields(strings.TrimSpace(strings.ReplaceAll(raw, ",", "")))
|
||||
if len(fields) == 0 {
|
||||
return 0, strconv.ErrSyntax
|
||||
}
|
||||
return strconv.ParseFloat(fields[0], 64)
|
||||
}
|
||||
|
||||
func sampleFanSpeedsViaSensorsJSON() ([]FanReading, error) {
|
||||
out, err := exec.Command("sensors", "-j").Output()
|
||||
if err != nil || len(out) == 0 {
|
||||
return nil, err
|
||||
}
|
||||
var doc map[string]map[string]any
|
||||
if err := json.Unmarshal(out, &doc); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
chips := make([]string, 0, len(doc))
|
||||
for chip := range doc {
|
||||
chips = append(chips, chip)
|
||||
}
|
||||
sort.Strings(chips)
|
||||
var fans []FanReading
|
||||
seen := map[string]struct{}{}
|
||||
for _, chip := range chips {
|
||||
features := doc[chip]
|
||||
names := make([]string, 0, len(features))
|
||||
for name := range features {
|
||||
names = append(names, name)
|
||||
}
|
||||
sort.Strings(names)
|
||||
for _, name := range names {
|
||||
feature, ok := features[name].(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
rpm, ok := firstFanInputValue(feature)
|
||||
if !ok || rpm <= 0 {
|
||||
continue
|
||||
}
|
||||
label := strings.TrimSpace(name)
|
||||
if chip != "" && !strings.Contains(strings.ToLower(label), strings.ToLower(chip)) {
|
||||
label = chip + " / " + label
|
||||
}
|
||||
if _, ok := seen[label]; ok {
|
||||
continue
|
||||
}
|
||||
seen[label] = struct{}{}
|
||||
fans = append(fans, FanReading{Name: label, RPM: rpm})
|
||||
}
|
||||
}
|
||||
return fans, nil
|
||||
}
|
||||
|
||||
func firstFanInputValue(feature map[string]any) (float64, bool) {
|
||||
keys := make([]string, 0, len(feature))
|
||||
for key := range feature {
|
||||
keys = append(keys, key)
|
||||
}
|
||||
sort.Strings(keys)
|
||||
for _, key := range keys {
|
||||
lower := strings.ToLower(key)
|
||||
if !strings.Contains(lower, "fan") || !strings.HasSuffix(lower, "_input") {
|
||||
continue
|
||||
}
|
||||
switch value := feature[key].(type) {
|
||||
case float64:
|
||||
return value, true
|
||||
case string:
|
||||
f, err := strconv.ParseFloat(value, 64)
|
||||
if err == nil {
|
||||
return f, true
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0, false
|
||||
}
|
||||
|
||||
// sampleCPUMaxTemp returns the highest CPU/inlet temperature from ipmitool or sensors.
|
||||
func sampleCPUMaxTemp() float64 {
|
||||
out, err := exec.Command("ipmitool", "sdr", "type", "Temperature").Output()
|
||||
if err != nil {
|
||||
return sampleCPUTempViaSensors()
|
||||
}
|
||||
return parseIPMIMaxTemp(string(out))
|
||||
}
|
||||
|
||||
// parseIPMIMaxTemp extracts the maximum temperature from "ipmitool sdr type Temperature".
|
||||
func parseIPMIMaxTemp(raw string) float64 {
|
||||
var max float64
|
||||
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
|
||||
parts := strings.Split(line, "|")
|
||||
if len(parts) < 3 {
|
||||
continue
|
||||
}
|
||||
unit := strings.TrimSpace(parts[2])
|
||||
if !strings.Contains(strings.ToLower(unit), "degrees") {
|
||||
continue
|
||||
}
|
||||
valStr := strings.TrimSpace(parts[1])
|
||||
if strings.EqualFold(valStr, "na") || valStr == "" {
|
||||
continue
|
||||
}
|
||||
val, err := strconv.ParseFloat(valStr, 64)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
if val > max {
|
||||
max = val
|
||||
}
|
||||
}
|
||||
return max
|
||||
}
|
||||
|
||||
// sampleCPUTempViaSensors falls back to lm-sensors when ipmitool is unavailable.
|
||||
func sampleCPUTempViaSensors() float64 {
|
||||
out, err := exec.Command("sensors", "-u").Output()
|
||||
if err != nil {
|
||||
return 0
|
||||
}
|
||||
var max float64
|
||||
for _, line := range strings.Split(string(out), "\n") {
|
||||
line = strings.TrimSpace(line)
|
||||
fields := strings.Fields(line)
|
||||
if len(fields) < 2 {
|
||||
continue
|
||||
}
|
||||
if !strings.HasSuffix(fields[0], "_input:") {
|
||||
continue
|
||||
}
|
||||
val, err := strconv.ParseFloat(fields[1], 64)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
if val > 0 && val < 150 && val > max {
|
||||
max = val
|
||||
}
|
||||
}
|
||||
return max
|
||||
}
|
||||
|
||||
// sampleSystemPower reads system power draw via DCMI.
|
||||
func sampleSystemPower() float64 {
|
||||
out, err := exec.Command("ipmitool", "dcmi", "power", "reading").Output()
|
||||
if err != nil {
|
||||
return 0
|
||||
}
|
||||
return parseDCMIPowerReading(string(out))
|
||||
}
|
||||
|
||||
// parseDCMIPowerReading extracts the instantaneous power reading from ipmitool dcmi output.
|
||||
// Sample: " Instantaneous power reading: 500 Watts"
|
||||
func parseDCMIPowerReading(raw string) float64 {
|
||||
for _, line := range strings.Split(raw, "\n") {
|
||||
if !strings.Contains(strings.ToLower(line), "instantaneous") {
|
||||
continue
|
||||
}
|
||||
parts := strings.Fields(line)
|
||||
for i, p := range parts {
|
||||
if strings.EqualFold(p, "Watts") && i > 0 {
|
||||
val, err := strconv.ParseFloat(parts[i-1], 64)
|
||||
if err == nil {
|
||||
return val
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
// analyzeThrottling returns true if any GPU reported an active throttle reason
|
||||
// during either load phase.
|
||||
func analyzeThrottling(rows []FanStressRow) bool {
|
||||
for _, row := range rows {
|
||||
if row.Phase != "load1" && row.Phase != "load2" {
|
||||
continue
|
||||
}
|
||||
for _, gpu := range row.GPUs {
|
||||
if gpu.Throttled {
|
||||
return true
|
||||
}
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// analyzeMaxTemp returns the maximum value of the given extractor across all rows.
|
||||
func analyzeMaxTemp(rows []FanStressRow, extract func(FanStressRow) float64) float64 {
|
||||
var max float64
|
||||
for _, row := range rows {
|
||||
if v := extract(row); v > max {
|
||||
max = v
|
||||
}
|
||||
}
|
||||
return max
|
||||
}
|
||||
|
||||
// analyzeFanResponse returns the seconds from load1 start until fan RPM first
|
||||
// increased by more than 5% above the baseline average. Returns -1 if undetermined.
|
||||
func analyzeFanResponse(rows []FanStressRow) float64 {
|
||||
// Compute baseline average fan RPM.
|
||||
var baseTotal, baseCount float64
|
||||
for _, row := range rows {
|
||||
if row.Phase != "baseline" {
|
||||
continue
|
||||
}
|
||||
for _, f := range row.Fans {
|
||||
baseTotal += f.RPM
|
||||
baseCount++
|
||||
}
|
||||
}
|
||||
if baseCount == 0 || baseTotal == 0 {
|
||||
return -1
|
||||
}
|
||||
baseAvg := baseTotal / baseCount
|
||||
threshold := baseAvg * 1.05 // 5% increase signals fan ramp-up
|
||||
|
||||
// Find elapsed time when load1 started.
|
||||
var load1Start float64 = -1
|
||||
for _, row := range rows {
|
||||
if row.Phase == "load1" {
|
||||
load1Start = row.ElapsedSec
|
||||
break
|
||||
}
|
||||
}
|
||||
if load1Start < 0 {
|
||||
return -1
|
||||
}
|
||||
|
||||
// Find first load1 row where average RPM crosses the threshold.
|
||||
for _, row := range rows {
|
||||
if row.Phase != "load1" {
|
||||
continue
|
||||
}
|
||||
var total, count float64
|
||||
for _, f := range row.Fans {
|
||||
total += f.RPM
|
||||
count++
|
||||
}
|
||||
if count > 0 && total/count >= threshold {
|
||||
return row.ElapsedSec - load1Start
|
||||
}
|
||||
}
|
||||
return -1
|
||||
}
|
||||
|
||||
// WriteFanStressCSV writes the wide-format metrics CSV with one row per second.
|
||||
// GPU columns are generated per index in gpuIndices order.
|
||||
func WriteFanStressCSV(path string, rows []FanStressRow, gpuIndices []int) error {
|
||||
if len(rows) == 0 {
|
||||
return os.WriteFile(path, []byte("no data\n"), 0644)
|
||||
}
|
||||
|
||||
var b strings.Builder
|
||||
|
||||
// Header: fixed system columns + per-GPU columns.
|
||||
b.WriteString("timestamp_utc,elapsed_sec,phase,fan_avg_rpm,fan_min_rpm,fan_max_rpm,cpu_max_temp_c,sys_power_w")
|
||||
for _, idx := range gpuIndices {
|
||||
fmt.Fprintf(&b, ",gpu%d_temp_c,gpu%d_usage_pct,gpu%d_power_w,gpu%d_clock_mhz,gpu%d_throttled",
|
||||
idx, idx, idx, idx, idx)
|
||||
}
|
||||
b.WriteRune('\n')
|
||||
|
||||
for _, row := range rows {
|
||||
favg, fmin, fmax := fanRPMStats(row.Fans)
|
||||
fmt.Fprintf(&b, "%s,%.1f,%s,%.0f,%.0f,%.0f,%.1f,%.1f",
|
||||
row.TimestampUTC,
|
||||
row.ElapsedSec,
|
||||
row.Phase,
|
||||
favg, fmin, fmax,
|
||||
row.CPUMaxTempC,
|
||||
row.SysPowerW,
|
||||
)
|
||||
gpuByIdx := make(map[int]GPUStressMetric, len(row.GPUs))
|
||||
for _, g := range row.GPUs {
|
||||
gpuByIdx[g.Index] = g
|
||||
}
|
||||
for _, idx := range gpuIndices {
|
||||
g := gpuByIdx[idx]
|
||||
throttled := 0
|
||||
if g.Throttled {
|
||||
throttled = 1
|
||||
}
|
||||
fmt.Fprintf(&b, ",%.1f,%.1f,%.1f,%.0f,%d",
|
||||
g.TempC, g.UsagePct, g.PowerW, g.ClockMHz, throttled)
|
||||
}
|
||||
b.WriteRune('\n')
|
||||
}
|
||||
|
||||
return os.WriteFile(path, []byte(b.String()), 0644)
|
||||
}
|
||||
|
||||
// WriteFanSensorsCSV writes individual fan sensor readings in long (tidy) format.
|
||||
func WriteFanSensorsCSV(path string, rows []FanStressRow) error {
|
||||
var b strings.Builder
|
||||
b.WriteString("timestamp_utc,elapsed_sec,phase,fan_name,rpm\n")
|
||||
for _, row := range rows {
|
||||
for _, f := range row.Fans {
|
||||
fmt.Fprintf(&b, "%s,%.1f,%s,%s,%.0f\n",
|
||||
row.TimestampUTC, row.ElapsedSec, row.Phase, f.Name, f.RPM)
|
||||
}
|
||||
}
|
||||
return os.WriteFile(path, []byte(b.String()), 0644)
|
||||
}
|
||||
|
||||
// fanRPMStats computes average, min, max RPM across all fans in a sample row.
|
||||
func fanRPMStats(fans []FanReading) (avg, min, max float64) {
|
||||
if len(fans) == 0 {
|
||||
return 0, 0, 0
|
||||
}
|
||||
min = fans[0].RPM
|
||||
max = fans[0].RPM
|
||||
var total float64
|
||||
for _, f := range fans {
|
||||
total += f.RPM
|
||||
if f.RPM < min {
|
||||
min = f.RPM
|
||||
}
|
||||
if f.RPM > max {
|
||||
max = f.RPM
|
||||
}
|
||||
}
|
||||
return total / float64(len(fans)), min, max
|
||||
}
|
||||
27
audit/internal/platform/sat_fan_stress_test.go
Normal file
27
audit/internal/platform/sat_fan_stress_test.go
Normal file
@@ -0,0 +1,27 @@
|
||||
package platform
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestParseFanSpeeds(t *testing.T) {
|
||||
raw := "FAN1 | 2400.000 | RPM | ok\nFAN2 | 1800 RPM | ok | ok\nFAN3 | na | RPM | ns\n"
|
||||
got := parseFanSpeeds(raw)
|
||||
if len(got) != 2 {
|
||||
t.Fatalf("fans=%d want 2 (%v)", len(got), got)
|
||||
}
|
||||
if got[0].Name != "FAN1" || got[0].RPM != 2400 {
|
||||
t.Fatalf("fan0=%+v", got[0])
|
||||
}
|
||||
if got[1].Name != "FAN2" || got[1].RPM != 1800 {
|
||||
t.Fatalf("fan1=%+v", got[1])
|
||||
}
|
||||
}
|
||||
|
||||
func TestFirstFanInputValue(t *testing.T) {
|
||||
feature := map[string]any{
|
||||
"fan1_input": 9200.0,
|
||||
}
|
||||
got, ok := firstFanInputValue(feature)
|
||||
if !ok || got != 9200 {
|
||||
t.Fatalf("got=%v ok=%v", got, ok)
|
||||
}
|
||||
}
|
||||
224
audit/internal/platform/sat_test.go
Normal file
224
audit/internal/platform/sat_test.go
Normal file
@@ -0,0 +1,224 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestStorageSATCommands(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
nvme := storageSATCommands("/dev/nvme0n1")
|
||||
if len(nvme) != 3 || nvme[2].cmd[0] != "nvme" {
|
||||
t.Fatalf("unexpected nvme commands: %#v", nvme)
|
||||
}
|
||||
|
||||
sata := storageSATCommands("/dev/sda")
|
||||
if len(sata) != 2 || sata[0].cmd[0] != "smartctl" {
|
||||
t.Fatalf("unexpected sata commands: %#v", sata)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunNvidiaAcceptancePackIncludesGPUStress(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
jobs := nvidiaSATJobs()
|
||||
|
||||
if len(jobs) != 5 {
|
||||
t.Fatalf("jobs=%d want 5", len(jobs))
|
||||
}
|
||||
if got := jobs[4].cmd[0]; got != "bee-gpu-stress" {
|
||||
t.Fatalf("gpu stress command=%q want bee-gpu-stress", got)
|
||||
}
|
||||
if got := jobs[3].cmd[1]; got != "--output-file" {
|
||||
t.Fatalf("bug report flag=%q want --output-file", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestAMDStressConfigUsesSingleGSTAction(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
cfg := amdStressRVSConfig(123)
|
||||
if !strings.Contains(cfg, "module: gst") {
|
||||
t.Fatalf("config missing gst module:\n%s", cfg)
|
||||
}
|
||||
if strings.Contains(cfg, "module: mem") {
|
||||
t.Fatalf("config should not include mem module:\n%s", cfg)
|
||||
}
|
||||
if !strings.Contains(cfg, "copy_matrix: false") {
|
||||
t.Fatalf("config should use copy_matrix=false:\n%s", cfg)
|
||||
}
|
||||
if strings.Count(cfg, "duration: 123000") != 1 {
|
||||
t.Fatalf("config should apply duration once:\n%s", cfg)
|
||||
}
|
||||
for _, field := range []string{"matrix_size_a: 8640", "matrix_size_b: 8640", "matrix_size_c: 8640"} {
|
||||
if !strings.Contains(cfg, field) {
|
||||
t.Fatalf("config missing %s:\n%s", field, cfg)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestAMDStressJobsIncludeBandwidthAndGST(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
jobs := amdStressJobs(300, "/tmp/test-amd-gst.conf")
|
||||
if len(jobs) != 4 {
|
||||
t.Fatalf("jobs=%d want 4", len(jobs))
|
||||
}
|
||||
if got := jobs[1].cmd[0]; got != "rocm-bandwidth-test" {
|
||||
t.Fatalf("jobs[1]=%q want rocm-bandwidth-test", got)
|
||||
}
|
||||
if got := jobs[2].cmd[0]; got != "rvs" {
|
||||
t.Fatalf("jobs[2]=%q want rvs", got)
|
||||
}
|
||||
if got := jobs[2].cmd[2]; got != "/tmp/test-amd-gst.conf" {
|
||||
t.Fatalf("jobs[2] cfg=%q want /tmp/test-amd-gst.conf", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNvidiaSATJobsUseEnvOverrides(t *testing.T) {
|
||||
t.Setenv("BEE_GPU_STRESS_SECONDS", "9")
|
||||
t.Setenv("BEE_GPU_STRESS_SIZE_MB", "96")
|
||||
|
||||
jobs := nvidiaSATJobs()
|
||||
got := jobs[4].cmd
|
||||
want := []string{"bee-gpu-stress", "--seconds", "9", "--size-mb", "96"}
|
||||
if len(got) != len(want) {
|
||||
t.Fatalf("cmd len=%d want %d", len(got), len(want))
|
||||
}
|
||||
for i := range want {
|
||||
if got[i] != want[i] {
|
||||
t.Fatalf("cmd[%d]=%q want %q", i, got[i], want[i])
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestEnvIntFallback(t *testing.T) {
|
||||
os.Unsetenv("BEE_MEMTESTER_SIZE_MB")
|
||||
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
|
||||
t.Fatalf("got %d want 123", got)
|
||||
}
|
||||
t.Setenv("BEE_MEMTESTER_SIZE_MB", "bad")
|
||||
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
|
||||
t.Fatalf("got %d want 123", got)
|
||||
}
|
||||
t.Setenv("BEE_MEMTESTER_SIZE_MB", "256")
|
||||
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 256 {
|
||||
t.Fatalf("got %d want 256", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestClassifySATResult(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
job string
|
||||
out string
|
||||
err error
|
||||
status string
|
||||
}{
|
||||
{name: "ok", job: "memtester", out: "done", err: nil, status: "OK"},
|
||||
{name: "unsupported", job: "smartctl-self-test-short", out: "Self-test not supported", err: errors.New("rc 1"), status: "UNSUPPORTED"},
|
||||
{name: "failed", job: "bee-gpu-stress", out: "cuda error", err: errors.New("rc 1"), status: "FAILED"},
|
||||
{name: "cuda not ready", job: "bee-gpu-stress", out: "cuInit failed: CUDA_ERROR_SYSTEM_NOT_READY", err: errors.New("rc 1"), status: "UNSUPPORTED"},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
got, _ := classifySATResult(tt.job, []byte(tt.out), tt.err)
|
||||
if got != tt.status {
|
||||
t.Fatalf("status=%q want %q", got, tt.status)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseStorageDevicesSkipsUSBDisks(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
raw := "nvme0n1 disk nvme\nsda disk usb\nloop0 loop\nsdb disk sata\n"
|
||||
got := parseStorageDevices(raw)
|
||||
want := []string{"/dev/nvme0n1", "/dev/sdb"}
|
||||
if len(got) != len(want) {
|
||||
t.Fatalf("len(devices)=%d want %d (%v)", len(got), len(want), got)
|
||||
}
|
||||
for i := range want {
|
||||
if got[i] != want[i] {
|
||||
t.Fatalf("devices[%d]=%q want %q", i, got[i], want[i])
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestResolveROCmSMICommandFromPATH(t *testing.T) {
|
||||
t.Setenv("PATH", t.TempDir())
|
||||
|
||||
toolPath := filepath.Join(os.Getenv("PATH"), "rocm-smi")
|
||||
if err := os.WriteFile(toolPath, []byte("#!/bin/sh\nexit 0\n"), 0755); err != nil {
|
||||
t.Fatalf("write rocm-smi: %v", err)
|
||||
}
|
||||
|
||||
cmd, err := resolveROCmSMICommand("--showproductname")
|
||||
if err != nil {
|
||||
t.Fatalf("resolveROCmSMICommand error: %v", err)
|
||||
}
|
||||
if len(cmd) != 2 {
|
||||
t.Fatalf("cmd len=%d want 2 (%v)", len(cmd), cmd)
|
||||
}
|
||||
if cmd[0] != toolPath {
|
||||
t.Fatalf("cmd[0]=%q want %q", cmd[0], toolPath)
|
||||
}
|
||||
}
|
||||
|
||||
func TestResolveROCmSMICommandFallsBackToROCmTree(t *testing.T) {
|
||||
tmp := t.TempDir()
|
||||
execPath := filepath.Join(tmp, "opt", "rocm", "bin", "rocm-smi")
|
||||
if err := os.MkdirAll(filepath.Dir(execPath), 0755); err != nil {
|
||||
t.Fatalf("mkdir: %v", err)
|
||||
}
|
||||
if err := os.WriteFile(execPath, []byte("#!/bin/sh\nexit 0\n"), 0755); err != nil {
|
||||
t.Fatalf("write rocm-smi: %v", err)
|
||||
}
|
||||
|
||||
oldGlob := rocmSMIExecutableGlobs
|
||||
oldScriptGlobs := rocmSMIScriptGlobs
|
||||
rocmSMIExecutableGlobs = []string{execPath}
|
||||
rocmSMIScriptGlobs = nil
|
||||
t.Cleanup(func() {
|
||||
rocmSMIExecutableGlobs = oldGlob
|
||||
rocmSMIScriptGlobs = oldScriptGlobs
|
||||
})
|
||||
|
||||
t.Setenv("PATH", "")
|
||||
|
||||
cmd, err := resolveROCmSMICommand("--showallinfo")
|
||||
if err != nil {
|
||||
t.Fatalf("resolveROCmSMICommand error: %v", err)
|
||||
}
|
||||
if len(cmd) != 2 {
|
||||
t.Fatalf("cmd len=%d want 2 (%v)", len(cmd), cmd)
|
||||
}
|
||||
if cmd[0] != execPath {
|
||||
t.Fatalf("cmd[0]=%q want %q", cmd[0], execPath)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunROCmSMIReportsMissingCommand(t *testing.T) {
|
||||
oldLookPath := satLookPath
|
||||
oldExecGlobs := rocmSMIExecutableGlobs
|
||||
oldScriptGlobs := rocmSMIScriptGlobs
|
||||
satLookPath = func(string) (string, error) { return "", exec.ErrNotFound }
|
||||
rocmSMIExecutableGlobs = nil
|
||||
rocmSMIScriptGlobs = nil
|
||||
t.Cleanup(func() {
|
||||
satLookPath = oldLookPath
|
||||
rocmSMIExecutableGlobs = oldExecGlobs
|
||||
rocmSMIScriptGlobs = oldScriptGlobs
|
||||
})
|
||||
|
||||
if _, err := runROCmSMI("--showproductname"); err == nil {
|
||||
t.Fatal("expected missing rocm-smi error")
|
||||
}
|
||||
}
|
||||
58
audit/internal/platform/services.go
Normal file
58
audit/internal/platform/services.go
Normal file
@@ -0,0 +1,58 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func (s *System) ListBeeServices() ([]string, error) {
|
||||
seen := map[string]bool{}
|
||||
var out []string
|
||||
for _, pattern := range []string{"/etc/systemd/system/bee-*.service", "/lib/systemd/system/bee-*.service"} {
|
||||
matches, err := filepath.Glob(pattern)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
for _, match := range matches {
|
||||
name := strings.TrimSuffix(filepath.Base(match), ".service")
|
||||
// Skip template units (e.g. bee-journal-mirror@) — they have no instances to query.
|
||||
if strings.HasSuffix(name, "@") {
|
||||
continue
|
||||
}
|
||||
if !seen[name] {
|
||||
seen[name] = true
|
||||
out = append(out, name)
|
||||
}
|
||||
}
|
||||
}
|
||||
sort.Strings(out)
|
||||
return out, nil
|
||||
}
|
||||
|
||||
func (s *System) ServiceState(name string) string {
|
||||
raw, err := exec.Command("systemctl", "is-active", name).CombinedOutput()
|
||||
if err == nil {
|
||||
return strings.TrimSpace(string(raw))
|
||||
}
|
||||
raw, err = exec.Command("systemctl", "show", name, "--property=ActiveState", "--value").CombinedOutput()
|
||||
if err != nil {
|
||||
return "unknown"
|
||||
}
|
||||
state := strings.TrimSpace(string(raw))
|
||||
if state == "" {
|
||||
return "unknown"
|
||||
}
|
||||
return state
|
||||
}
|
||||
|
||||
func (s *System) ServiceDo(name string, action ServiceAction) (string, error) {
|
||||
raw, err := exec.Command("systemctl", string(action), name).CombinedOutput()
|
||||
return string(raw), err
|
||||
}
|
||||
|
||||
func (s *System) ServiceStatus(name string) (string, error) {
|
||||
raw, err := exec.Command("systemctl", "status", name, "--no-pager").CombinedOutput()
|
||||
return string(raw), err
|
||||
}
|
||||
49
audit/internal/platform/system_test.go
Normal file
49
audit/internal/platform/system_test.go
Normal file
@@ -0,0 +1,49 @@
|
||||
package platform
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestSplitQuotedFields(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
line := `NAME="sdb1" TYPE="part" LABEL="BEE EXPORT" MODEL="USB DISK 3.0"`
|
||||
got := splitQuotedFields(line)
|
||||
want := []string{
|
||||
`NAME="sdb1"`,
|
||||
`TYPE="part"`,
|
||||
`LABEL="BEE EXPORT"`,
|
||||
`MODEL="USB DISK 3.0"`,
|
||||
}
|
||||
|
||||
if len(got) != len(want) {
|
||||
t.Fatalf("len(got)=%d len(want)=%d; got=%q", len(got), len(want), got)
|
||||
}
|
||||
for i := range want {
|
||||
if got[i] != want[i] {
|
||||
t.Fatalf("got[%d]=%q want %q", i, got[i], want[i])
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestParseLSBLKPairs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
line := `NAME="sdb1" TYPE="part" PKNAME="sdb" RM="1" FSTYPE="vfat" MOUNTPOINT="" SIZE="57.3G" LABEL="BEE EXPORT" MODEL="USB DISK 3.0"`
|
||||
got := parseLSBLKPairs(line)
|
||||
|
||||
checks := map[string]string{
|
||||
"NAME": "sdb1",
|
||||
"TYPE": "part",
|
||||
"PKNAME": "sdb",
|
||||
"RM": "1",
|
||||
"FSTYPE": "vfat",
|
||||
"MOUNTPOINT": "",
|
||||
"SIZE": "57.3G",
|
||||
"LABEL": "BEE EXPORT",
|
||||
"MODEL": "USB DISK 3.0",
|
||||
}
|
||||
for key, want := range checks {
|
||||
if got[key] != want {
|
||||
t.Fatalf("got[%s]=%q want %q", key, got[key], want)
|
||||
}
|
||||
}
|
||||
}
|
||||
150
audit/internal/platform/techdump.go
Normal file
150
audit/internal/platform/techdump.go
Normal file
@@ -0,0 +1,150 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"strings"
|
||||
)
|
||||
|
||||
var techDumpFixedCommands = []struct {
|
||||
Name string
|
||||
Args []string
|
||||
File string
|
||||
}{
|
||||
{Name: "dmidecode", Args: []string{"-t", "0"}, File: "dmidecode-type0.txt"},
|
||||
{Name: "dmidecode", Args: []string{"-t", "1"}, File: "dmidecode-type1.txt"},
|
||||
{Name: "dmidecode", Args: []string{"-t", "2"}, File: "dmidecode-type2.txt"},
|
||||
{Name: "dmidecode", Args: []string{"-t", "4"}, File: "dmidecode-type4.txt"},
|
||||
{Name: "dmidecode", Args: []string{"-t", "17"}, File: "dmidecode-type17.txt"},
|
||||
{Name: "lspci", Args: []string{"-vmm", "-D"}, File: "lspci-vmm.txt"},
|
||||
{Name: "lsblk", Args: []string{"-J", "-d", "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL"}, File: "lsblk.json"},
|
||||
{Name: "sensors", Args: []string{"-j"}, File: "sensors.json"},
|
||||
{Name: "ipmitool", Args: []string{"fru", "print"}, File: "ipmitool-fru.txt"},
|
||||
{Name: "ipmitool", Args: []string{"sdr"}, File: "ipmitool-sdr.txt"},
|
||||
{Name: "nvme", Args: []string{"list", "-o", "json"}, File: "nvme-list.json"},
|
||||
}
|
||||
|
||||
var techDumpNvidiaCommands = []struct {
|
||||
Name string
|
||||
Args []string
|
||||
File string
|
||||
}{
|
||||
{Name: "nvidia-smi", Args: []string{"-q"}, File: "nvidia-smi-q.txt"},
|
||||
{Name: "nvidia-smi", Args: []string{"--query-gpu=index,pci.bus_id,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown", "--format=csv,noheader,nounits"}, File: "nvidia-smi-query.csv"},
|
||||
}
|
||||
|
||||
type lsblkDumpRoot struct {
|
||||
Blockdevices []struct {
|
||||
Name string `json:"name"`
|
||||
Type string `json:"type"`
|
||||
Tran string `json:"tran"`
|
||||
} `json:"blockdevices"`
|
||||
}
|
||||
|
||||
type nvmeDumpRoot struct {
|
||||
Devices []struct {
|
||||
DevicePath string `json:"DevicePath"`
|
||||
} `json:"Devices"`
|
||||
}
|
||||
|
||||
func (s *System) CaptureTechnicalDump(baseDir string) error {
|
||||
if err := os.MkdirAll(baseDir, 0755); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
for _, cmd := range techDumpFixedCommands {
|
||||
writeCommandDump(filepath.Join(baseDir, cmd.File), cmd.Name, cmd.Args...)
|
||||
}
|
||||
switch s.DetectGPUVendor() {
|
||||
case "nvidia":
|
||||
for _, cmd := range techDumpNvidiaCommands {
|
||||
writeCommandDump(filepath.Join(baseDir, cmd.File), cmd.Name, cmd.Args...)
|
||||
}
|
||||
case "amd":
|
||||
writeROCmSMIDump(filepath.Join(baseDir, "rocm-smi.txt"))
|
||||
writeROCmSMIDump(filepath.Join(baseDir, "rocm-smi-showallinfo.txt"), "--showallinfo")
|
||||
}
|
||||
|
||||
for _, dev := range lsblkDumpDevices(filepath.Join(baseDir, "lsblk.json")) {
|
||||
writeCommandDump(filepath.Join(baseDir, "smartctl-"+sanitizeDumpName(dev)+".json"), "smartctl", "-j", "-a", "/dev/"+dev)
|
||||
}
|
||||
for _, dev := range nvmeDumpDevices(filepath.Join(baseDir, "nvme-list.json")) {
|
||||
writeCommandDump(filepath.Join(baseDir, "nvme-id-ctrl-"+sanitizeDumpName(dev)+".json"), "nvme", "id-ctrl", dev, "-o", "json")
|
||||
writeCommandDump(filepath.Join(baseDir, "nvme-smart-log-"+sanitizeDumpName(dev)+".json"), "nvme", "smart-log", dev, "-o", "json")
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func writeCommandDump(path, name string, args ...string) {
|
||||
out, err := exec.Command(name, args...).CombinedOutput()
|
||||
if err != nil && len(out) == 0 {
|
||||
return
|
||||
}
|
||||
_ = os.WriteFile(path, out, 0644)
|
||||
}
|
||||
|
||||
func writeROCmSMIDump(path string, args ...string) {
|
||||
out, err := runROCmSMI(args...)
|
||||
if err != nil && len(out) == 0 {
|
||||
return
|
||||
}
|
||||
_ = os.WriteFile(path, out, 0644)
|
||||
}
|
||||
|
||||
func lsblkDumpDevices(path string) []string {
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
var root lsblkDumpRoot
|
||||
if err := json.Unmarshal(raw, &root); err != nil {
|
||||
return nil
|
||||
}
|
||||
var devices []string
|
||||
for _, dev := range root.Blockdevices {
|
||||
if strings.EqualFold(strings.TrimSpace(dev.Tran), "usb") {
|
||||
continue
|
||||
}
|
||||
if dev.Type == "disk" && strings.TrimSpace(dev.Name) != "" {
|
||||
devices = append(devices, strings.TrimSpace(dev.Name))
|
||||
}
|
||||
}
|
||||
sort.Strings(devices)
|
||||
return devices
|
||||
}
|
||||
|
||||
func nvmeDumpDevices(path string) []string {
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
var root nvmeDumpRoot
|
||||
if err := json.Unmarshal(raw, &root); err != nil {
|
||||
return nil
|
||||
}
|
||||
seen := map[string]bool{}
|
||||
var devices []string
|
||||
for _, dev := range root.Devices {
|
||||
name := strings.TrimSpace(dev.DevicePath)
|
||||
if name == "" || seen[name] {
|
||||
continue
|
||||
}
|
||||
seen[name] = true
|
||||
devices = append(devices, name)
|
||||
}
|
||||
sort.Strings(devices)
|
||||
return devices
|
||||
}
|
||||
|
||||
func sanitizeDumpName(value string) string {
|
||||
value = strings.TrimSpace(value)
|
||||
value = strings.TrimPrefix(value, "/dev/")
|
||||
value = strings.ReplaceAll(value, "/", "_")
|
||||
if value == "" {
|
||||
return "unknown"
|
||||
}
|
||||
return value
|
||||
}
|
||||
48
audit/internal/platform/techdump_test.go
Normal file
48
audit/internal/platform/techdump_test.go
Normal file
@@ -0,0 +1,48 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"reflect"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestLSBLKDumpDevices(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "lsblk.json")
|
||||
if err := os.WriteFile(path, []byte(`{"blockdevices":[{"name":"sda","type":"disk","tran":"usb"},{"name":"sda1","type":"part"},{"name":"nvme0n1","type":"disk","tran":"nvme"},{"name":"sdb","type":"disk","tran":"sata"}]}`), 0644); err != nil {
|
||||
t.Fatalf("write lsblk fixture: %v", err)
|
||||
}
|
||||
|
||||
got := lsblkDumpDevices(path)
|
||||
want := []string{"nvme0n1", "sdb"}
|
||||
if !reflect.DeepEqual(got, want) {
|
||||
t.Fatalf("lsblkDumpDevices=%v want %v", got, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNVMEDumpDevices(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "nvme-list.json")
|
||||
if err := os.WriteFile(path, []byte(`{"Devices":[{"DevicePath":"/dev/nvme1n1"},{"DevicePath":"/dev/nvme0n1"},{"DevicePath":"/dev/nvme1n1"}]}`), 0644); err != nil {
|
||||
t.Fatalf("write nvme fixture: %v", err)
|
||||
}
|
||||
|
||||
got := nvmeDumpDevices(path)
|
||||
want := []string{"/dev/nvme0n1", "/dev/nvme1n1"}
|
||||
if !reflect.DeepEqual(got, want) {
|
||||
t.Fatalf("nvmeDumpDevices=%v want %v", got, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestSanitizeDumpName(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
if got := sanitizeDumpName("/dev/nvme0n1"); got != "nvme0n1" {
|
||||
t.Fatalf("sanitizeDumpName=%q want nvme0n1", got)
|
||||
}
|
||||
}
|
||||
29
audit/internal/platform/tools.go
Normal file
29
audit/internal/platform/tools.go
Normal file
@@ -0,0 +1,29 @@
|
||||
package platform
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"os"
|
||||
"os/exec"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func (s *System) TailFile(path string, lines int) string {
|
||||
raw, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
return fmt.Sprintf("read %s: %v", path, err)
|
||||
}
|
||||
all := strings.Split(strings.TrimRight(string(raw), "\n"), "\n")
|
||||
if lines <= 0 || len(all) <= lines {
|
||||
return string(raw)
|
||||
}
|
||||
return strings.Join(all[len(all)-lines:], "\n")
|
||||
}
|
||||
|
||||
func (s *System) CheckTools(names []string) []ToolStatus {
|
||||
out := make([]ToolStatus, 0, len(names))
|
||||
for _, name := range names {
|
||||
path, err := exec.LookPath(name)
|
||||
out = append(out, ToolStatus{Name: name, Path: path, OK: err == nil})
|
||||
}
|
||||
return out
|
||||
}
|
||||
56
audit/internal/platform/types.go
Normal file
56
audit/internal/platform/types.go
Normal file
@@ -0,0 +1,56 @@
|
||||
package platform
|
||||
|
||||
type System struct{}
|
||||
|
||||
type InterfaceInfo struct {
|
||||
Name string
|
||||
State string
|
||||
IPv4 []string
|
||||
}
|
||||
|
||||
type NetworkInterfaceSnapshot struct {
|
||||
Name string
|
||||
Up bool
|
||||
IPv4 []string
|
||||
}
|
||||
|
||||
type NetworkSnapshot struct {
|
||||
Interfaces []NetworkInterfaceSnapshot
|
||||
DefaultRoutes []string
|
||||
ResolvConf string
|
||||
}
|
||||
|
||||
type ServiceAction string
|
||||
|
||||
const (
|
||||
ServiceStart ServiceAction = "start"
|
||||
ServiceStop ServiceAction = "stop"
|
||||
ServiceRestart ServiceAction = "restart"
|
||||
)
|
||||
|
||||
type StaticIPv4Config struct {
|
||||
Interface string
|
||||
Address string
|
||||
Prefix string
|
||||
Gateway string
|
||||
DNS []string
|
||||
}
|
||||
|
||||
type RemovableTarget struct {
|
||||
Device string
|
||||
FSType string
|
||||
Size string
|
||||
Label string
|
||||
Model string
|
||||
Mountpoint string
|
||||
}
|
||||
|
||||
type ToolStatus struct {
|
||||
Name string
|
||||
Path string
|
||||
OK bool
|
||||
}
|
||||
|
||||
func New() *System {
|
||||
return &System{}
|
||||
}
|
||||
77
audit/internal/runtimeenv/runtimeenv.go
Normal file
77
audit/internal/runtimeenv/runtimeenv.go
Normal file
@@ -0,0 +1,77 @@
|
||||
package runtimeenv
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"os"
|
||||
"strings"
|
||||
)
|
||||
|
||||
type Mode string
|
||||
|
||||
const (
|
||||
ModeAuto Mode = "auto"
|
||||
ModeLocal Mode = "local"
|
||||
ModeLiveCD Mode = "livecd"
|
||||
)
|
||||
|
||||
type Info struct {
|
||||
Mode Mode
|
||||
Detected bool
|
||||
Reason string
|
||||
}
|
||||
|
||||
func ParseMode(raw string) (Mode, error) {
|
||||
mode := Mode(strings.TrimSpace(strings.ToLower(raw)))
|
||||
switch mode {
|
||||
case "", ModeAuto:
|
||||
return ModeAuto, nil
|
||||
case ModeLocal, ModeLiveCD:
|
||||
return mode, nil
|
||||
default:
|
||||
return "", fmt.Errorf("invalid runtime %q — use auto, local, or livecd", raw)
|
||||
}
|
||||
}
|
||||
|
||||
func Detect(flagValue string) (Info, error) {
|
||||
flagMode, err := ParseMode(flagValue)
|
||||
if err != nil {
|
||||
return Info{}, err
|
||||
}
|
||||
if flagMode != ModeAuto {
|
||||
return Info{Mode: flagMode, Reason: "flag"}, nil
|
||||
}
|
||||
|
||||
if envMode, ok := getenvMode("BEE_RUNTIME"); ok {
|
||||
return Info{Mode: envMode, Reason: "env:BEE_RUNTIME"}, nil
|
||||
}
|
||||
|
||||
if fileExists("/etc/bee-release") {
|
||||
return Info{Mode: ModeLiveCD, Detected: true, Reason: "marker:/etc/bee-release"}, nil
|
||||
}
|
||||
|
||||
if data, err := os.ReadFile("/proc/cmdline"); err == nil {
|
||||
cmdline := string(data)
|
||||
if strings.Contains(cmdline, " boot=live") || strings.HasPrefix(cmdline, "boot=live ") || strings.Contains(cmdline, "live-media") {
|
||||
return Info{Mode: ModeLiveCD, Detected: true, Reason: "kernel:boot=live"}, nil
|
||||
}
|
||||
}
|
||||
|
||||
return Info{Mode: ModeLocal, Detected: true, Reason: "default:local"}, nil
|
||||
}
|
||||
|
||||
func getenvMode(name string) (Mode, bool) {
|
||||
value := strings.TrimSpace(os.Getenv(name))
|
||||
if value == "" {
|
||||
return "", false
|
||||
}
|
||||
mode, err := ParseMode(value)
|
||||
if err != nil || mode == ModeAuto {
|
||||
return "", false
|
||||
}
|
||||
return mode, true
|
||||
}
|
||||
|
||||
func fileExists(path string) bool {
|
||||
info, err := os.Stat(path)
|
||||
return err == nil && !info.IsDir()
|
||||
}
|
||||
67
audit/internal/runtimeenv/runtimeenv_test.go
Normal file
67
audit/internal/runtimeenv/runtimeenv_test.go
Normal file
@@ -0,0 +1,67 @@
|
||||
package runtimeenv
|
||||
|
||||
import (
|
||||
"os"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestParseMode(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
tests := []struct {
|
||||
in string
|
||||
want Mode
|
||||
ok bool
|
||||
}{
|
||||
{in: "", want: ModeAuto, ok: true},
|
||||
{in: "auto", want: ModeAuto, ok: true},
|
||||
{in: "local", want: ModeLocal, ok: true},
|
||||
{in: "livecd", want: ModeLiveCD, ok: true},
|
||||
{in: "bad", ok: false},
|
||||
}
|
||||
|
||||
for _, test := range tests {
|
||||
got, err := ParseMode(test.in)
|
||||
if test.ok && err != nil {
|
||||
t.Fatalf("ParseMode(%q): %v", test.in, err)
|
||||
}
|
||||
if !test.ok && err == nil {
|
||||
t.Fatalf("ParseMode(%q): expected error", test.in)
|
||||
}
|
||||
if test.ok && got != test.want {
|
||||
t.Fatalf("ParseMode(%q): got %q want %q", test.in, got, test.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestDetectHonorsFlag(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
info, err := Detect("livecd")
|
||||
if err != nil {
|
||||
t.Fatalf("Detect(flag): %v", err)
|
||||
}
|
||||
if info.Mode != ModeLiveCD || info.Reason != "flag" {
|
||||
t.Fatalf("unexpected info: %+v", info)
|
||||
}
|
||||
}
|
||||
|
||||
func TestDetectHonorsEnv(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
old := os.Getenv("BEE_RUNTIME")
|
||||
t.Cleanup(func() {
|
||||
_ = os.Setenv("BEE_RUNTIME", old)
|
||||
})
|
||||
if err := os.Setenv("BEE_RUNTIME", "local"); err != nil {
|
||||
t.Fatalf("Setenv: %v", err)
|
||||
}
|
||||
|
||||
info, err := Detect("auto")
|
||||
if err != nil {
|
||||
t.Fatalf("Detect(env): %v", err)
|
||||
}
|
||||
if info.Mode != ModeLocal || info.Reason != "env:BEE_RUNTIME" {
|
||||
t.Fatalf("unexpected info: %+v", info)
|
||||
}
|
||||
}
|
||||
@@ -2,17 +2,55 @@
|
||||
// core/internal/ingest/parser_hardware.go. No import dependency on core.
|
||||
package schema
|
||||
|
||||
// HardwareIngestRequest is the top-level output document produced by the audit binary.
|
||||
// HardwareIngestRequest is the top-level output document produced by `bee audit`.
|
||||
// It is accepted as-is by the core /api/ingest/hardware endpoint.
|
||||
type HardwareIngestRequest struct {
|
||||
Filename *string `json:"filename"`
|
||||
SourceType *string `json:"source_type"`
|
||||
Protocol *string `json:"protocol"`
|
||||
TargetHost string `json:"target_host"`
|
||||
Filename *string `json:"filename,omitempty"`
|
||||
SourceType *string `json:"source_type,omitempty"`
|
||||
Protocol *string `json:"protocol,omitempty"`
|
||||
TargetHost *string `json:"target_host,omitempty"`
|
||||
CollectedAt string `json:"collected_at"`
|
||||
Runtime *RuntimeHealth `json:"runtime,omitempty"`
|
||||
Hardware HardwareSnapshot `json:"hardware"`
|
||||
}
|
||||
|
||||
type RuntimeHealth struct {
|
||||
Status string `json:"status"`
|
||||
CheckedAt string `json:"checked_at"`
|
||||
ExportDir string `json:"export_dir,omitempty"`
|
||||
DriverReady bool `json:"driver_ready,omitempty"`
|
||||
CUDAReady bool `json:"cuda_ready,omitempty"`
|
||||
NetworkStatus string `json:"network_status,omitempty"`
|
||||
Issues []RuntimeIssue `json:"issues,omitempty"`
|
||||
Tools []RuntimeToolStatus `json:"tools,omitempty"`
|
||||
Services []RuntimeServiceStatus `json:"services,omitempty"`
|
||||
Interfaces []RuntimeInterface `json:"interfaces,omitempty"`
|
||||
}
|
||||
|
||||
type RuntimeIssue struct {
|
||||
Code string `json:"code"`
|
||||
Severity string `json:"severity,omitempty"`
|
||||
Description string `json:"description"`
|
||||
}
|
||||
|
||||
type RuntimeToolStatus struct {
|
||||
Name string `json:"name"`
|
||||
Path string `json:"path,omitempty"`
|
||||
OK bool `json:"ok"`
|
||||
}
|
||||
|
||||
type RuntimeServiceStatus struct {
|
||||
Name string `json:"name"`
|
||||
Status string `json:"status"`
|
||||
}
|
||||
|
||||
type RuntimeInterface struct {
|
||||
Name string `json:"name"`
|
||||
State string `json:"state,omitempty"`
|
||||
IPv4 []string `json:"ipv4,omitempty"`
|
||||
Outcome string `json:"outcome,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareSnapshot struct {
|
||||
Board HardwareBoard `json:"board"`
|
||||
Firmware []HardwareFirmwareRecord `json:"firmware,omitempty"`
|
||||
@@ -21,14 +59,33 @@ type HardwareSnapshot struct {
|
||||
Storage []HardwareStorage `json:"storage,omitempty"`
|
||||
PCIeDevices []HardwarePCIeDevice `json:"pcie_devices,omitempty"`
|
||||
PowerSupplies []HardwarePowerSupply `json:"power_supplies,omitempty"`
|
||||
Sensors *HardwareSensors `json:"sensors,omitempty"`
|
||||
EventLogs []HardwareEventLog `json:"event_logs,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareHealthSummary struct {
|
||||
Status string `json:"status"`
|
||||
Warnings []string `json:"warnings,omitempty"`
|
||||
Failures []string `json:"failures,omitempty"`
|
||||
StorageWarn int `json:"storage_warn,omitempty"`
|
||||
StorageFail int `json:"storage_fail,omitempty"`
|
||||
PCIeWarn int `json:"pcie_warn,omitempty"`
|
||||
PCIeFail int `json:"pcie_fail,omitempty"`
|
||||
PSUWarn int `json:"psu_warn,omitempty"`
|
||||
PSUFail int `json:"psu_fail,omitempty"`
|
||||
MemoryWarn int `json:"memory_warn,omitempty"`
|
||||
MemoryFail int `json:"memory_fail,omitempty"`
|
||||
EmptyDIMMs int `json:"empty_dimms,omitempty"`
|
||||
MissingPSUs int `json:"missing_psus,omitempty"`
|
||||
CollectedAt string `json:"collected_at,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareBoard struct {
|
||||
Manufacturer *string `json:"manufacturer"`
|
||||
ProductName *string `json:"product_name"`
|
||||
Manufacturer *string `json:"manufacturer,omitempty"`
|
||||
ProductName *string `json:"product_name,omitempty"`
|
||||
SerialNumber string `json:"serial_number"`
|
||||
PartNumber *string `json:"part_number"`
|
||||
UUID *string `json:"uuid"`
|
||||
PartNumber *string `json:"part_number,omitempty"`
|
||||
UUID *string `json:"uuid,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareFirmwareRecord struct {
|
||||
@@ -37,77 +94,196 @@ type HardwareFirmwareRecord struct {
|
||||
}
|
||||
|
||||
type HardwareCPU struct {
|
||||
Socket *int `json:"socket"`
|
||||
Model *string `json:"model"`
|
||||
Manufacturer *string `json:"manufacturer"`
|
||||
Status *string `json:"status"`
|
||||
SerialNumber *string `json:"serial_number"`
|
||||
Firmware *string `json:"firmware"`
|
||||
Cores *int `json:"cores"`
|
||||
Threads *int `json:"threads"`
|
||||
FrequencyMHz *int `json:"frequency_mhz"`
|
||||
MaxFrequencyMHz *int `json:"max_frequency_mhz"`
|
||||
HardwareComponentStatus
|
||||
Socket *int `json:"socket,omitempty"`
|
||||
Model *string `json:"model,omitempty"`
|
||||
Manufacturer *string `json:"manufacturer,omitempty"`
|
||||
SerialNumber *string `json:"serial_number,omitempty"`
|
||||
Firmware *string `json:"firmware,omitempty"`
|
||||
Cores *int `json:"cores,omitempty"`
|
||||
Threads *int `json:"threads,omitempty"`
|
||||
FrequencyMHz *int `json:"frequency_mhz,omitempty"`
|
||||
MaxFrequencyMHz *int `json:"max_frequency_mhz,omitempty"`
|
||||
TemperatureC *float64 `json:"temperature_c,omitempty"`
|
||||
PowerW *float64 `json:"power_w,omitempty"`
|
||||
Throttled *bool `json:"throttled,omitempty"`
|
||||
CorrectableErrorCount *int64 `json:"correctable_error_count,omitempty"`
|
||||
UncorrectableErrorCount *int64 `json:"uncorrectable_error_count,omitempty"`
|
||||
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
|
||||
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
|
||||
Present *bool `json:"present,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareMemory struct {
|
||||
Slot *string `json:"slot"`
|
||||
Location *string `json:"location"`
|
||||
Present *bool `json:"present"`
|
||||
SizeMB *int `json:"size_mb"`
|
||||
Type *string `json:"type"`
|
||||
MaxSpeedMHz *int `json:"max_speed_mhz"`
|
||||
CurrentSpeedMHz *int `json:"current_speed_mhz"`
|
||||
Manufacturer *string `json:"manufacturer"`
|
||||
SerialNumber *string `json:"serial_number"`
|
||||
PartNumber *string `json:"part_number"`
|
||||
Status *string `json:"status"`
|
||||
HardwareComponentStatus
|
||||
Slot *string `json:"slot,omitempty"`
|
||||
Location *string `json:"location,omitempty"`
|
||||
Present *bool `json:"present,omitempty"`
|
||||
SizeMB *int `json:"size_mb,omitempty"`
|
||||
Type *string `json:"type,omitempty"`
|
||||
MaxSpeedMHz *int `json:"max_speed_mhz,omitempty"`
|
||||
CurrentSpeedMHz *int `json:"current_speed_mhz,omitempty"`
|
||||
Manufacturer *string `json:"manufacturer,omitempty"`
|
||||
SerialNumber *string `json:"serial_number,omitempty"`
|
||||
PartNumber *string `json:"part_number,omitempty"`
|
||||
TemperatureC *float64 `json:"temperature_c,omitempty"`
|
||||
CorrectableECCErrorCount *int64 `json:"correctable_ecc_error_count,omitempty"`
|
||||
UncorrectableECCErrorCount *int64 `json:"uncorrectable_ecc_error_count,omitempty"`
|
||||
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
|
||||
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
|
||||
SpareBlocksRemainingPct *float64 `json:"spare_blocks_remaining_pct,omitempty"`
|
||||
PerformanceDegraded *bool `json:"performance_degraded,omitempty"`
|
||||
DataLossDetected *bool `json:"data_loss_detected,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareStorage struct {
|
||||
Slot *string `json:"slot"`
|
||||
Type *string `json:"type"`
|
||||
Model *string `json:"model"`
|
||||
SizeGB *int `json:"size_gb"`
|
||||
SerialNumber *string `json:"serial_number"`
|
||||
Manufacturer *string `json:"manufacturer"`
|
||||
Firmware *string `json:"firmware"`
|
||||
Interface *string `json:"interface"`
|
||||
Present *bool `json:"present"`
|
||||
Status *string `json:"status"`
|
||||
Telemetry map[string]any `json:"telemetry,omitempty"`
|
||||
HardwareComponentStatus
|
||||
Slot *string `json:"slot,omitempty"`
|
||||
Type *string `json:"type,omitempty"`
|
||||
Model *string `json:"model,omitempty"`
|
||||
SizeGB *int `json:"size_gb,omitempty"`
|
||||
SerialNumber *string `json:"serial_number,omitempty"`
|
||||
Manufacturer *string `json:"manufacturer,omitempty"`
|
||||
Firmware *string `json:"firmware,omitempty"`
|
||||
Interface *string `json:"interface,omitempty"`
|
||||
Present *bool `json:"present,omitempty"`
|
||||
TemperatureC *float64 `json:"temperature_c,omitempty"`
|
||||
PowerOnHours *int64 `json:"power_on_hours,omitempty"`
|
||||
PowerCycles *int64 `json:"power_cycles,omitempty"`
|
||||
UnsafeShutdowns *int64 `json:"unsafe_shutdowns,omitempty"`
|
||||
MediaErrors *int64 `json:"media_errors,omitempty"`
|
||||
ErrorLogEntries *int64 `json:"error_log_entries,omitempty"`
|
||||
WrittenBytes *int64 `json:"written_bytes,omitempty"`
|
||||
ReadBytes *int64 `json:"read_bytes,omitempty"`
|
||||
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
|
||||
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
|
||||
AvailableSparePct *float64 `json:"available_spare_pct,omitempty"`
|
||||
ReallocatedSectors *int64 `json:"reallocated_sectors,omitempty"`
|
||||
CurrentPendingSectors *int64 `json:"current_pending_sectors,omitempty"`
|
||||
OfflineUncorrectable *int64 `json:"offline_uncorrectable,omitempty"`
|
||||
Telemetry map[string]any `json:"-"`
|
||||
}
|
||||
|
||||
type HardwarePCIeDevice struct {
|
||||
Slot *string `json:"slot"`
|
||||
VendorID *int `json:"vendor_id"`
|
||||
DeviceID *int `json:"device_id"`
|
||||
BDF *string `json:"bdf"`
|
||||
DeviceClass *string `json:"device_class"`
|
||||
Manufacturer *string `json:"manufacturer"`
|
||||
Model *string `json:"model"`
|
||||
LinkWidth *int `json:"link_width"`
|
||||
LinkSpeed *string `json:"link_speed"`
|
||||
MaxLinkWidth *int `json:"max_link_width"`
|
||||
MaxLinkSpeed *string `json:"max_link_speed"`
|
||||
SerialNumber *string `json:"serial_number"`
|
||||
Firmware *string `json:"firmware"`
|
||||
Present *bool `json:"present"`
|
||||
Status *string `json:"status"`
|
||||
Telemetry map[string]any `json:"telemetry,omitempty"`
|
||||
HardwareComponentStatus
|
||||
Slot *string `json:"slot,omitempty"`
|
||||
VendorID *int `json:"vendor_id,omitempty"`
|
||||
DeviceID *int `json:"device_id,omitempty"`
|
||||
NUMANode *int `json:"numa_node,omitempty"`
|
||||
TemperatureC *float64 `json:"temperature_c,omitempty"`
|
||||
PowerW *float64 `json:"power_w,omitempty"`
|
||||
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
|
||||
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
|
||||
ECCCorrectedTotal *int64 `json:"ecc_corrected_total,omitempty"`
|
||||
ECCUncorrectedTotal *int64 `json:"ecc_uncorrected_total,omitempty"`
|
||||
HWSlowdown *bool `json:"hw_slowdown,omitempty"`
|
||||
BatteryChargePct *float64 `json:"battery_charge_pct,omitempty"`
|
||||
BatteryHealthPct *float64 `json:"battery_health_pct,omitempty"`
|
||||
BatteryTemperatureC *float64 `json:"battery_temperature_c,omitempty"`
|
||||
BatteryVoltageV *float64 `json:"battery_voltage_v,omitempty"`
|
||||
BatteryReplaceRequired *bool `json:"battery_replace_required,omitempty"`
|
||||
SFPTemperatureC *float64 `json:"sfp_temperature_c,omitempty"`
|
||||
SFPTXPowerDBM *float64 `json:"sfp_tx_power_dbm,omitempty"`
|
||||
SFPRXPowerDBM *float64 `json:"sfp_rx_power_dbm,omitempty"`
|
||||
SFPVoltageV *float64 `json:"sfp_voltage_v,omitempty"`
|
||||
SFPBiasMA *float64 `json:"sfp_bias_ma,omitempty"`
|
||||
BDF *string `json:"-"`
|
||||
DeviceClass *string `json:"device_class,omitempty"`
|
||||
Manufacturer *string `json:"manufacturer,omitempty"`
|
||||
Model *string `json:"model,omitempty"`
|
||||
LinkWidth *int `json:"link_width,omitempty"`
|
||||
LinkSpeed *string `json:"link_speed,omitempty"`
|
||||
MaxLinkWidth *int `json:"max_link_width,omitempty"`
|
||||
MaxLinkSpeed *string `json:"max_link_speed,omitempty"`
|
||||
SerialNumber *string `json:"serial_number,omitempty"`
|
||||
Firmware *string `json:"firmware,omitempty"`
|
||||
MacAddresses []string `json:"mac_addresses,omitempty"`
|
||||
Present *bool `json:"present,omitempty"`
|
||||
Telemetry map[string]any `json:"-"`
|
||||
}
|
||||
|
||||
type HardwarePowerSupply struct {
|
||||
Slot *string `json:"slot"`
|
||||
Present *bool `json:"present"`
|
||||
Model *string `json:"model"`
|
||||
Vendor *string `json:"vendor"`
|
||||
WattageW *int `json:"wattage_w"`
|
||||
SerialNumber *string `json:"serial_number"`
|
||||
PartNumber *string `json:"part_number"`
|
||||
Firmware *string `json:"firmware"`
|
||||
Status *string `json:"status"`
|
||||
InputType *string `json:"input_type"`
|
||||
InputPowerW *float64 `json:"input_power_w"`
|
||||
OutputPowerW *float64 `json:"output_power_w"`
|
||||
InputVoltage *float64 `json:"input_voltage"`
|
||||
HardwareComponentStatus
|
||||
Slot *string `json:"slot,omitempty"`
|
||||
Present *bool `json:"present,omitempty"`
|
||||
Model *string `json:"model,omitempty"`
|
||||
Vendor *string `json:"vendor,omitempty"`
|
||||
WattageW *int `json:"wattage_w,omitempty"`
|
||||
SerialNumber *string `json:"serial_number,omitempty"`
|
||||
PartNumber *string `json:"part_number,omitempty"`
|
||||
Firmware *string `json:"firmware,omitempty"`
|
||||
InputType *string `json:"input_type,omitempty"`
|
||||
InputPowerW *float64 `json:"input_power_w,omitempty"`
|
||||
OutputPowerW *float64 `json:"output_power_w,omitempty"`
|
||||
InputVoltage *float64 `json:"input_voltage,omitempty"`
|
||||
TemperatureC *float64 `json:"temperature_c,omitempty"`
|
||||
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
|
||||
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareComponentStatus struct {
|
||||
Status *string `json:"status,omitempty"`
|
||||
StatusCheckedAt *string `json:"status_checked_at,omitempty"`
|
||||
StatusChangedAt *string `json:"status_changed_at,omitempty"`
|
||||
StatusHistory []HardwareStatusHistory `json:"status_history,omitempty"`
|
||||
ErrorDescription *string `json:"error_description,omitempty"`
|
||||
ManufacturedYearWeek *string `json:"manufactured_year_week,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareStatusHistory struct {
|
||||
Status string `json:"status"`
|
||||
ChangedAt string `json:"changed_at"`
|
||||
Details *string `json:"details,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareSensors struct {
|
||||
Fans []HardwareFanSensor `json:"fans,omitempty"`
|
||||
Power []HardwarePowerSensor `json:"power,omitempty"`
|
||||
Temperatures []HardwareTemperatureSensor `json:"temperatures,omitempty"`
|
||||
Other []HardwareOtherSensor `json:"other,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareFanSensor struct {
|
||||
Name string `json:"name"`
|
||||
Location *string `json:"location,omitempty"`
|
||||
RPM *int `json:"rpm,omitempty"`
|
||||
Status *string `json:"status,omitempty"`
|
||||
}
|
||||
|
||||
type HardwarePowerSensor struct {
|
||||
Name string `json:"name"`
|
||||
Location *string `json:"location,omitempty"`
|
||||
VoltageV *float64 `json:"voltage_v,omitempty"`
|
||||
CurrentA *float64 `json:"current_a,omitempty"`
|
||||
PowerW *float64 `json:"power_w,omitempty"`
|
||||
Status *string `json:"status,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareTemperatureSensor struct {
|
||||
Name string `json:"name"`
|
||||
Location *string `json:"location,omitempty"`
|
||||
Celsius *float64 `json:"celsius,omitempty"`
|
||||
ThresholdWarningCelsius *float64 `json:"threshold_warning_celsius,omitempty"`
|
||||
ThresholdCriticalCelsius *float64 `json:"threshold_critical_celsius,omitempty"`
|
||||
Status *string `json:"status,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareOtherSensor struct {
|
||||
Name string `json:"name"`
|
||||
Location *string `json:"location,omitempty"`
|
||||
Value *float64 `json:"value,omitempty"`
|
||||
Unit *string `json:"unit,omitempty"`
|
||||
Status *string `json:"status,omitempty"`
|
||||
}
|
||||
|
||||
type HardwareEventLog struct {
|
||||
Source string `json:"source"`
|
||||
EventTime *string `json:"event_time,omitempty"`
|
||||
Severity *string `json:"severity,omitempty"`
|
||||
MessageID *string `json:"message_id,omitempty"`
|
||||
Message string `json:"message"`
|
||||
ComponentRef *string `json:"component_ref,omitempty"`
|
||||
Fingerprint *string `json:"fingerprint,omitempty"`
|
||||
IsActive *bool `json:"is_active,omitempty"`
|
||||
RawPayload map[string]any `json:"raw_payload,omitempty"`
|
||||
}
|
||||
|
||||
46
audit/internal/schema/hardware_test.go
Normal file
46
audit/internal/schema/hardware_test.go
Normal file
@@ -0,0 +1,46 @@
|
||||
package schema
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestHardwareSnapshotMarshalsNewContractFields(t *testing.T) {
|
||||
week := "2024-W07"
|
||||
eventTime := "2026-03-15T14:03:11Z"
|
||||
message := "Correctable ECC error threshold exceeded"
|
||||
|
||||
payload := HardwareIngestRequest{
|
||||
CollectedAt: "2026-03-15T15:00:00Z",
|
||||
Hardware: HardwareSnapshot{
|
||||
Board: HardwareBoard{SerialNumber: "SRV-001"},
|
||||
CPUs: []HardwareCPU{
|
||||
{
|
||||
HardwareComponentStatus: HardwareComponentStatus{
|
||||
ManufacturedYearWeek: &week,
|
||||
},
|
||||
},
|
||||
},
|
||||
EventLogs: []HardwareEventLog{
|
||||
{
|
||||
Source: "bmc",
|
||||
EventTime: &eventTime,
|
||||
Message: message,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
data, err := json.Marshal(payload)
|
||||
if err != nil {
|
||||
t.Fatalf("marshal: %v", err)
|
||||
}
|
||||
text := string(data)
|
||||
if !strings.Contains(text, `"manufactured_year_week":"2024-W07"`) {
|
||||
t.Fatalf("missing manufactured_year_week: %s", text)
|
||||
}
|
||||
if !strings.Contains(text, `"event_logs":[{"source":"bmc","event_time":"2026-03-15T14:03:11Z","message":"Correctable ECC error threshold exceeded"}]`) {
|
||||
t.Fatalf("missing event_logs payload: %s", text)
|
||||
}
|
||||
}
|
||||
792
audit/internal/webui/api.go
Normal file
792
audit/internal/webui/api.go
Normal file
@@ -0,0 +1,792 @@
|
||||
package webui
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"sync/atomic"
|
||||
"time"
|
||||
|
||||
"bee/audit/internal/app"
|
||||
"bee/audit/internal/platform"
|
||||
)
|
||||
|
||||
// ── Job ID counter ────────────────────────────────────────────────────────────
|
||||
|
||||
var jobCounter atomic.Uint64
|
||||
|
||||
func newJobID(prefix string) string {
|
||||
return fmt.Sprintf("%s-%d", prefix, jobCounter.Add(1))
|
||||
}
|
||||
|
||||
// ── SSE helpers ───────────────────────────────────────────────────────────────
|
||||
|
||||
func sseWrite(w http.ResponseWriter, event, data string) bool {
|
||||
f, ok := w.(http.Flusher)
|
||||
if !ok {
|
||||
return false
|
||||
}
|
||||
if event != "" {
|
||||
fmt.Fprintf(w, "event: %s\n", event)
|
||||
}
|
||||
fmt.Fprintf(w, "data: %s\n\n", data)
|
||||
f.Flush()
|
||||
return true
|
||||
}
|
||||
|
||||
func sseStart(w http.ResponseWriter) bool {
|
||||
_, ok := w.(http.Flusher)
|
||||
if !ok {
|
||||
http.Error(w, "streaming not supported", http.StatusInternalServerError)
|
||||
return false
|
||||
}
|
||||
w.Header().Set("Content-Type", "text/event-stream")
|
||||
w.Header().Set("Cache-Control", "no-cache")
|
||||
w.Header().Set("Connection", "keep-alive")
|
||||
w.Header().Set("Access-Control-Allow-Origin", "*")
|
||||
return true
|
||||
}
|
||||
|
||||
// streamJob streams lines from a jobState to a SSE response.
|
||||
func streamJob(w http.ResponseWriter, r *http.Request, j *jobState) {
|
||||
if !sseStart(w) {
|
||||
return
|
||||
}
|
||||
existing, ch := j.subscribe()
|
||||
for _, line := range existing {
|
||||
sseWrite(w, "", line)
|
||||
}
|
||||
if ch == nil {
|
||||
// Job already finished
|
||||
sseWrite(w, "done", j.err)
|
||||
return
|
||||
}
|
||||
for {
|
||||
select {
|
||||
case line, ok := <-ch:
|
||||
if !ok {
|
||||
sseWrite(w, "done", j.err)
|
||||
return
|
||||
}
|
||||
sseWrite(w, "", line)
|
||||
case <-r.Context().Done():
|
||||
return
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// runCmdJob runs an exec.Cmd as a background job, streaming stdout+stderr lines.
|
||||
func runCmdJob(j *jobState, cmd *exec.Cmd) {
|
||||
pr, pw := io.Pipe()
|
||||
cmd.Stdout = pw
|
||||
cmd.Stderr = pw
|
||||
|
||||
if err := cmd.Start(); err != nil {
|
||||
j.finish(err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
go func() {
|
||||
scanner := bufio.NewScanner(pr)
|
||||
for scanner.Scan() {
|
||||
j.append(scanner.Text())
|
||||
}
|
||||
}()
|
||||
|
||||
err := cmd.Wait()
|
||||
_ = pw.Close()
|
||||
if err != nil {
|
||||
j.finish(err.Error())
|
||||
} else {
|
||||
j.finish("")
|
||||
}
|
||||
}
|
||||
|
||||
// ── Audit ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPIAuditRun(w http.ResponseWriter, _ *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
t := &Task{
|
||||
ID: newJobID("audit"),
|
||||
Name: "Audit",
|
||||
Target: "audit",
|
||||
Status: TaskPending,
|
||||
CreatedAt: time.Now(),
|
||||
}
|
||||
globalQueue.enqueue(t)
|
||||
writeJSON(w, map[string]string{"task_id": t.ID, "job_id": t.ID})
|
||||
}
|
||||
|
||||
func (h *handler) handleAPIAuditStream(w http.ResponseWriter, r *http.Request) {
|
||||
id := r.URL.Query().Get("job_id")
|
||||
if id == "" {
|
||||
id = r.URL.Query().Get("task_id")
|
||||
}
|
||||
// Try task queue first, then legacy job manager
|
||||
if j, ok := globalQueue.findJob(id); ok {
|
||||
streamJob(w, r, j)
|
||||
return
|
||||
}
|
||||
if j, ok := globalJobs.get(id); ok {
|
||||
streamJob(w, r, j)
|
||||
return
|
||||
}
|
||||
http.Error(w, "job not found", http.StatusNotFound)
|
||||
}
|
||||
|
||||
// ── SAT ───────────────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
|
||||
return func(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
|
||||
var body struct {
|
||||
Duration int `json:"duration"`
|
||||
DiagLevel int `json:"diag_level"`
|
||||
GPUIndices []int `json:"gpu_indices"`
|
||||
Profile string `json:"profile"`
|
||||
DisplayName string `json:"display_name"`
|
||||
}
|
||||
if r.ContentLength > 0 {
|
||||
_ = json.NewDecoder(r.Body).Decode(&body)
|
||||
}
|
||||
|
||||
name := taskNames[target]
|
||||
if name == "" {
|
||||
name = target
|
||||
}
|
||||
t := &Task{
|
||||
ID: newJobID("sat-" + target),
|
||||
Name: name,
|
||||
Target: target,
|
||||
Status: TaskPending,
|
||||
CreatedAt: time.Now(),
|
||||
params: taskParams{
|
||||
Duration: body.Duration,
|
||||
DiagLevel: body.DiagLevel,
|
||||
GPUIndices: body.GPUIndices,
|
||||
BurnProfile: body.Profile,
|
||||
DisplayName: body.DisplayName,
|
||||
},
|
||||
}
|
||||
if strings.TrimSpace(body.DisplayName) != "" {
|
||||
t.Name = body.DisplayName
|
||||
}
|
||||
globalQueue.enqueue(t)
|
||||
writeJSON(w, map[string]string{"task_id": t.ID, "job_id": t.ID})
|
||||
}
|
||||
}
|
||||
|
||||
func (h *handler) handleAPISATStream(w http.ResponseWriter, r *http.Request) {
|
||||
id := r.URL.Query().Get("job_id")
|
||||
if id == "" {
|
||||
id = r.URL.Query().Get("task_id")
|
||||
}
|
||||
if j, ok := globalQueue.findJob(id); ok {
|
||||
streamJob(w, r, j)
|
||||
return
|
||||
}
|
||||
if j, ok := globalJobs.get(id); ok {
|
||||
streamJob(w, r, j)
|
||||
return
|
||||
}
|
||||
http.Error(w, "job not found", http.StatusNotFound)
|
||||
}
|
||||
|
||||
func (h *handler) handleAPISATAbort(w http.ResponseWriter, r *http.Request) {
|
||||
id := r.URL.Query().Get("job_id")
|
||||
if id == "" {
|
||||
id = r.URL.Query().Get("task_id")
|
||||
}
|
||||
if t, ok := globalQueue.findByID(id); ok {
|
||||
globalQueue.mu.Lock()
|
||||
switch t.Status {
|
||||
case TaskPending:
|
||||
t.Status = TaskCancelled
|
||||
now := time.Now()
|
||||
t.DoneAt = &now
|
||||
case TaskRunning:
|
||||
if t.job != nil {
|
||||
t.job.abort()
|
||||
}
|
||||
t.Status = TaskCancelled
|
||||
now := time.Now()
|
||||
t.DoneAt = &now
|
||||
}
|
||||
globalQueue.mu.Unlock()
|
||||
writeJSON(w, map[string]string{"status": "aborted"})
|
||||
return
|
||||
}
|
||||
if j, ok := globalJobs.get(id); ok {
|
||||
if j.abort() {
|
||||
writeJSON(w, map[string]string{"status": "aborted"})
|
||||
} else {
|
||||
writeJSON(w, map[string]string{"status": "not_running"})
|
||||
}
|
||||
return
|
||||
}
|
||||
http.Error(w, "job not found", http.StatusNotFound)
|
||||
}
|
||||
|
||||
// ── Services ──────────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPIServicesList(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
names, err := h.opts.App.ListBeeServices()
|
||||
if err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
type serviceInfo struct {
|
||||
Name string `json:"name"`
|
||||
State string `json:"state"`
|
||||
Body string `json:"body"`
|
||||
}
|
||||
result := make([]serviceInfo, 0, len(names))
|
||||
for _, name := range names {
|
||||
state := h.opts.App.ServiceState(name)
|
||||
body, _ := h.opts.App.ServiceStatus(name)
|
||||
result = append(result, serviceInfo{Name: name, State: state, Body: body})
|
||||
}
|
||||
writeJSON(w, result)
|
||||
}
|
||||
|
||||
func (h *handler) handleAPIServicesAction(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
var req struct {
|
||||
Name string `json:"name"`
|
||||
Action string `json:"action"`
|
||||
}
|
||||
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
|
||||
writeError(w, http.StatusBadRequest, "invalid request body")
|
||||
return
|
||||
}
|
||||
var action platform.ServiceAction
|
||||
switch req.Action {
|
||||
case "start":
|
||||
action = platform.ServiceStart
|
||||
case "stop":
|
||||
action = platform.ServiceStop
|
||||
case "restart":
|
||||
action = platform.ServiceRestart
|
||||
default:
|
||||
writeError(w, http.StatusBadRequest, "action must be start|stop|restart")
|
||||
return
|
||||
}
|
||||
result, err := h.opts.App.ServiceActionResult(req.Name, action)
|
||||
if err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
writeJSON(w, map[string]string{"status": "ok", "output": result.Body})
|
||||
}
|
||||
|
||||
// ── Network ───────────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPINetworkStatus(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
ifaces, err := h.opts.App.ListInterfaces()
|
||||
if err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
writeJSON(w, map[string]any{
|
||||
"interfaces": ifaces,
|
||||
"default_route": h.opts.App.DefaultRoute(),
|
||||
})
|
||||
}
|
||||
|
||||
func (h *handler) handleAPINetworkDHCP(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
var req struct {
|
||||
Interface string `json:"interface"`
|
||||
}
|
||||
_ = json.NewDecoder(r.Body).Decode(&req)
|
||||
|
||||
result, err := h.applyPendingNetworkChange(func() (app.ActionResult, error) {
|
||||
if req.Interface == "" || req.Interface == "all" {
|
||||
return h.opts.App.DHCPAllResult()
|
||||
}
|
||||
return h.opts.App.DHCPOneResult(req.Interface)
|
||||
})
|
||||
if err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
writeJSON(w, map[string]any{
|
||||
"status": "ok",
|
||||
"output": result.Body,
|
||||
"rollback_in": int(netRollbackTimeout.Seconds()),
|
||||
})
|
||||
}
|
||||
|
||||
func (h *handler) handleAPINetworkStatic(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
var req struct {
|
||||
Interface string `json:"interface"`
|
||||
Address string `json:"address"`
|
||||
Prefix string `json:"prefix"`
|
||||
Gateway string `json:"gateway"`
|
||||
DNS []string `json:"dns"`
|
||||
}
|
||||
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
|
||||
writeError(w, http.StatusBadRequest, "invalid request body")
|
||||
return
|
||||
}
|
||||
cfg := platform.StaticIPv4Config{
|
||||
Interface: req.Interface,
|
||||
Address: req.Address,
|
||||
Prefix: req.Prefix,
|
||||
Gateway: req.Gateway,
|
||||
DNS: req.DNS,
|
||||
}
|
||||
result, err := h.applyPendingNetworkChange(func() (app.ActionResult, error) {
|
||||
return h.opts.App.SetStaticIPv4Result(cfg)
|
||||
})
|
||||
if err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
writeJSON(w, map[string]any{
|
||||
"status": "ok",
|
||||
"output": result.Body,
|
||||
"rollback_in": int(netRollbackTimeout.Seconds()),
|
||||
})
|
||||
}
|
||||
|
||||
// ── Export ────────────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPIExportList(w http.ResponseWriter, r *http.Request) {
|
||||
entries, err := listExportFiles(h.opts.ExportDir)
|
||||
if err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
writeJSON(w, entries)
|
||||
}
|
||||
|
||||
func (h *handler) handleAPIExportBundle(w http.ResponseWriter, r *http.Request) {
|
||||
archive, err := app.BuildSupportBundle(h.opts.ExportDir)
|
||||
if err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
writeJSON(w, map[string]string{
|
||||
"status": "ok",
|
||||
"path": archive,
|
||||
"url": "/export/support.tar.gz",
|
||||
})
|
||||
}
|
||||
|
||||
// ── GPU presence ──────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPIGPUPresence(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
gp := h.opts.App.DetectGPUPresence()
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
_ = json.NewEncoder(w).Encode(map[string]bool{
|
||||
"nvidia": gp.Nvidia,
|
||||
"amd": gp.AMD,
|
||||
})
|
||||
}
|
||||
|
||||
// ── System ────────────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPIRAMStatus(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
inRAM := h.opts.App.IsLiveMediaInRAM()
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
_ = json.NewEncoder(w).Encode(map[string]bool{"in_ram": inRAM})
|
||||
}
|
||||
|
||||
func (h *handler) handleAPIInstallToRAM(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
h.installMu.Lock()
|
||||
installRunning := h.installJob != nil && !h.installJob.isDone()
|
||||
h.installMu.Unlock()
|
||||
if installRunning {
|
||||
writeError(w, http.StatusConflict, "install to disk is already running")
|
||||
return
|
||||
}
|
||||
t := &Task{
|
||||
ID: newJobID("install-to-ram"),
|
||||
Name: "Install to RAM",
|
||||
Target: "install-to-ram",
|
||||
Priority: 10,
|
||||
Status: TaskPending,
|
||||
CreatedAt: time.Now(),
|
||||
}
|
||||
globalQueue.enqueue(t)
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
_ = json.NewEncoder(w).Encode(map[string]string{"task_id": t.ID})
|
||||
}
|
||||
|
||||
// ── Tools ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
var standardTools = []string{
|
||||
"dmidecode", "smartctl", "nvme", "lspci", "ipmitool",
|
||||
"nvidia-smi", "memtester", "stress-ng", "nvtop",
|
||||
"mstflint", "qrencode",
|
||||
}
|
||||
|
||||
func (h *handler) handleAPIToolsCheck(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
statuses := h.opts.App.CheckTools(standardTools)
|
||||
writeJSON(w, statuses)
|
||||
}
|
||||
|
||||
// ── Preflight ─────────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPIPreflight(w http.ResponseWriter, r *http.Request) {
|
||||
data, err := loadSnapshot(filepath.Join(h.opts.ExportDir, "runtime-health.json"))
|
||||
if err != nil {
|
||||
writeError(w, http.StatusNotFound, "runtime health not found")
|
||||
return
|
||||
}
|
||||
w.Header().Set("Content-Type", "application/json; charset=utf-8")
|
||||
w.Header().Set("Cache-Control", "no-store")
|
||||
_, _ = w.Write(data)
|
||||
}
|
||||
|
||||
// ── Install ───────────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPIInstallDisks(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
disks, err := h.opts.App.ListInstallDisks()
|
||||
if err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
type diskJSON struct {
|
||||
Device string `json:"device"`
|
||||
Model string `json:"model"`
|
||||
Size string `json:"size"`
|
||||
SizeBytes int64 `json:"size_bytes"`
|
||||
MountedParts []string `json:"mounted_parts"`
|
||||
Warnings []string `json:"warnings"`
|
||||
}
|
||||
result := make([]diskJSON, 0, len(disks))
|
||||
for _, d := range disks {
|
||||
result = append(result, diskJSON{
|
||||
Device: d.Device,
|
||||
Model: d.Model,
|
||||
Size: d.Size,
|
||||
SizeBytes: d.SizeBytes,
|
||||
MountedParts: d.MountedParts,
|
||||
Warnings: platform.DiskWarnings(d),
|
||||
})
|
||||
}
|
||||
writeJSON(w, result)
|
||||
}
|
||||
|
||||
func (h *handler) handleAPIInstallRun(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
var req struct {
|
||||
Device string `json:"device"`
|
||||
}
|
||||
if err := json.NewDecoder(r.Body).Decode(&req); err != nil || req.Device == "" {
|
||||
writeError(w, http.StatusBadRequest, "device is required")
|
||||
return
|
||||
}
|
||||
|
||||
// Whitelist: only allow devices that ListInstallDisks() returns.
|
||||
disks, err := h.opts.App.ListInstallDisks()
|
||||
if err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
allowed := false
|
||||
for _, d := range disks {
|
||||
if d.Device == req.Device {
|
||||
allowed = true
|
||||
break
|
||||
}
|
||||
}
|
||||
if !allowed {
|
||||
writeError(w, http.StatusBadRequest, "device not in install candidate list")
|
||||
return
|
||||
}
|
||||
if globalQueue.hasActiveTarget("install-to-ram") {
|
||||
writeError(w, http.StatusConflict, "install to RAM task is already pending or running")
|
||||
return
|
||||
}
|
||||
|
||||
h.installMu.Lock()
|
||||
if h.installJob != nil && !h.installJob.isDone() {
|
||||
h.installMu.Unlock()
|
||||
writeError(w, http.StatusConflict, "install already running")
|
||||
return
|
||||
}
|
||||
j := &jobState{}
|
||||
h.installJob = j
|
||||
h.installMu.Unlock()
|
||||
|
||||
logFile := platform.InstallLogPath(req.Device)
|
||||
go runCmdJob(j, exec.CommandContext(context.Background(), "bee-install", req.Device, logFile))
|
||||
|
||||
w.WriteHeader(http.StatusNoContent)
|
||||
}
|
||||
|
||||
func (h *handler) handleAPIInstallStream(w http.ResponseWriter, r *http.Request) {
|
||||
h.installMu.Lock()
|
||||
j := h.installJob
|
||||
h.installMu.Unlock()
|
||||
if j == nil {
|
||||
if !sseStart(w) {
|
||||
return
|
||||
}
|
||||
sseWrite(w, "done", "")
|
||||
return
|
||||
}
|
||||
streamJob(w, r, j)
|
||||
}
|
||||
|
||||
// ── Metrics SSE ───────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPIMetricsStream(w http.ResponseWriter, r *http.Request) {
|
||||
if !sseStart(w) {
|
||||
return
|
||||
}
|
||||
ticker := time.NewTicker(1 * time.Second)
|
||||
defer ticker.Stop()
|
||||
for {
|
||||
select {
|
||||
case <-r.Context().Done():
|
||||
return
|
||||
case <-ticker.C:
|
||||
sample, ok := h.latestMetric()
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
b, err := json.Marshal(sample)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
if !sseWrite(w, "metrics", string(b)) {
|
||||
return
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// feedRings pushes one sample into all in-memory ring buffers.
|
||||
func (h *handler) feedRings(sample platform.LiveMetricSample) {
|
||||
for _, t := range sample.Temps {
|
||||
switch t.Group {
|
||||
case "cpu":
|
||||
h.pushNamedMetricRing(&h.cpuTempRings, t.Name, t.Celsius)
|
||||
case "ambient":
|
||||
h.pushNamedMetricRing(&h.ambientTempRings, t.Name, t.Celsius)
|
||||
}
|
||||
}
|
||||
h.ringPower.push(sample.PowerW)
|
||||
h.ringCPULoad.push(sample.CPULoadPct)
|
||||
h.ringMemLoad.push(sample.MemLoadPct)
|
||||
|
||||
h.ringsMu.Lock()
|
||||
for i, fan := range sample.Fans {
|
||||
for len(h.ringFans) <= i {
|
||||
h.ringFans = append(h.ringFans, newMetricsRing(120))
|
||||
h.fanNames = append(h.fanNames, fan.Name)
|
||||
}
|
||||
h.ringFans[i].push(float64(fan.RPM))
|
||||
}
|
||||
for _, gpu := range sample.GPUs {
|
||||
idx := gpu.GPUIndex
|
||||
for len(h.gpuRings) <= idx {
|
||||
h.gpuRings = append(h.gpuRings, &gpuRings{
|
||||
Temp: newMetricsRing(120),
|
||||
Util: newMetricsRing(120),
|
||||
MemUtil: newMetricsRing(120),
|
||||
Power: newMetricsRing(120),
|
||||
})
|
||||
}
|
||||
h.gpuRings[idx].Temp.push(gpu.TempC)
|
||||
h.gpuRings[idx].Util.push(gpu.UsagePct)
|
||||
h.gpuRings[idx].MemUtil.push(gpu.MemUsagePct)
|
||||
h.gpuRings[idx].Power.push(gpu.PowerW)
|
||||
}
|
||||
h.ringsMu.Unlock()
|
||||
}
|
||||
|
||||
func (h *handler) pushNamedMetricRing(dst *[]*namedMetricsRing, name string, value float64) {
|
||||
if name == "" {
|
||||
return
|
||||
}
|
||||
for _, item := range *dst {
|
||||
if item != nil && item.Name == name && item.Ring != nil {
|
||||
item.Ring.push(value)
|
||||
return
|
||||
}
|
||||
}
|
||||
*dst = append(*dst, &namedMetricsRing{
|
||||
Name: name,
|
||||
Ring: newMetricsRing(120),
|
||||
})
|
||||
(*dst)[len(*dst)-1].Ring.push(value)
|
||||
}
|
||||
|
||||
// ── Network toggle ────────────────────────────────────────────────────────────
|
||||
|
||||
const netRollbackTimeout = 60 * time.Second
|
||||
|
||||
func (h *handler) handleAPINetworkToggle(w http.ResponseWriter, r *http.Request) {
|
||||
if h.opts.App == nil {
|
||||
writeError(w, http.StatusServiceUnavailable, "app not configured")
|
||||
return
|
||||
}
|
||||
var req struct {
|
||||
Iface string `json:"iface"`
|
||||
}
|
||||
if err := json.NewDecoder(r.Body).Decode(&req); err != nil || req.Iface == "" {
|
||||
writeError(w, http.StatusBadRequest, "iface is required")
|
||||
return
|
||||
}
|
||||
|
||||
wasUp, err := h.opts.App.GetInterfaceState(req.Iface)
|
||||
if err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
if _, err := h.applyPendingNetworkChange(func() (app.ActionResult, error) {
|
||||
err := h.opts.App.SetInterfaceState(req.Iface, !wasUp)
|
||||
return app.ActionResult{}, err
|
||||
}); err != nil {
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
newState := "up"
|
||||
if wasUp {
|
||||
newState = "down"
|
||||
}
|
||||
writeJSON(w, map[string]any{
|
||||
"iface": req.Iface,
|
||||
"new_state": newState,
|
||||
"rollback_in": int(netRollbackTimeout.Seconds()),
|
||||
})
|
||||
}
|
||||
|
||||
func (h *handler) applyPendingNetworkChange(apply func() (app.ActionResult, error)) (app.ActionResult, error) {
|
||||
if h.opts.App == nil {
|
||||
return app.ActionResult{}, fmt.Errorf("app not configured")
|
||||
}
|
||||
|
||||
if err := h.rollbackPendingNetworkChange(); err != nil && err.Error() != "no pending network change" {
|
||||
return app.ActionResult{}, err
|
||||
}
|
||||
|
||||
snapshot, err := h.opts.App.CaptureNetworkSnapshot()
|
||||
if err != nil {
|
||||
return app.ActionResult{}, err
|
||||
}
|
||||
|
||||
result, err := apply()
|
||||
if err != nil {
|
||||
return result, err
|
||||
}
|
||||
|
||||
pnc := &pendingNetChange{snapshot: snapshot}
|
||||
pnc.timer = time.AfterFunc(netRollbackTimeout, func() {
|
||||
_ = h.opts.App.RestoreNetworkSnapshot(snapshot)
|
||||
h.pendingNetMu.Lock()
|
||||
if h.pendingNet == pnc {
|
||||
h.pendingNet = nil
|
||||
}
|
||||
h.pendingNetMu.Unlock()
|
||||
})
|
||||
|
||||
h.pendingNetMu.Lock()
|
||||
h.pendingNet = pnc
|
||||
h.pendingNetMu.Unlock()
|
||||
|
||||
return result, nil
|
||||
}
|
||||
|
||||
func (h *handler) handleAPINetworkConfirm(w http.ResponseWriter, _ *http.Request) {
|
||||
h.pendingNetMu.Lock()
|
||||
pnc := h.pendingNet
|
||||
h.pendingNet = nil
|
||||
h.pendingNetMu.Unlock()
|
||||
if pnc != nil {
|
||||
pnc.mu.Lock()
|
||||
pnc.timer.Stop()
|
||||
pnc.mu.Unlock()
|
||||
}
|
||||
writeJSON(w, map[string]string{"status": "confirmed"})
|
||||
}
|
||||
|
||||
func (h *handler) handleAPINetworkRollback(w http.ResponseWriter, _ *http.Request) {
|
||||
if err := h.rollbackPendingNetworkChange(); err != nil {
|
||||
if err.Error() == "no pending network change" {
|
||||
writeError(w, http.StatusConflict, err.Error())
|
||||
return
|
||||
}
|
||||
writeError(w, http.StatusInternalServerError, err.Error())
|
||||
return
|
||||
}
|
||||
writeJSON(w, map[string]string{"status": "rolled back"})
|
||||
}
|
||||
|
||||
func (h *handler) rollbackPendingNetworkChange() error {
|
||||
h.pendingNetMu.Lock()
|
||||
pnc := h.pendingNet
|
||||
h.pendingNet = nil
|
||||
h.pendingNetMu.Unlock()
|
||||
if pnc == nil {
|
||||
return fmt.Errorf("no pending network change")
|
||||
}
|
||||
pnc.mu.Lock()
|
||||
pnc.timer.Stop()
|
||||
pnc.mu.Unlock()
|
||||
if h.opts.App != nil {
|
||||
return h.opts.App.RestoreNetworkSnapshot(pnc.snapshot)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
137
audit/internal/webui/jobs.go
Normal file
137
audit/internal/webui/jobs.go
Normal file
@@ -0,0 +1,137 @@
|
||||
package webui
|
||||
|
||||
import (
|
||||
"os"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
// jobState holds the output lines and completion status of an async job.
|
||||
type jobState struct {
|
||||
lines []string
|
||||
done bool
|
||||
err string
|
||||
mu sync.Mutex
|
||||
subs []chan string
|
||||
cancel func() // optional cancel function; nil if job is not cancellable
|
||||
logPath string
|
||||
}
|
||||
|
||||
// abort cancels the job if it has a cancel function and is not yet done.
|
||||
func (j *jobState) abort() bool {
|
||||
j.mu.Lock()
|
||||
defer j.mu.Unlock()
|
||||
if j.done || j.cancel == nil {
|
||||
return false
|
||||
}
|
||||
j.cancel()
|
||||
return true
|
||||
}
|
||||
|
||||
func (j *jobState) append(line string) {
|
||||
j.mu.Lock()
|
||||
defer j.mu.Unlock()
|
||||
j.lines = append(j.lines, line)
|
||||
if j.logPath != "" {
|
||||
appendJobLog(j.logPath, line)
|
||||
}
|
||||
for _, ch := range j.subs {
|
||||
select {
|
||||
case ch <- line:
|
||||
default:
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (j *jobState) finish(errMsg string) {
|
||||
j.mu.Lock()
|
||||
defer j.mu.Unlock()
|
||||
j.done = true
|
||||
j.err = errMsg
|
||||
for _, ch := range j.subs {
|
||||
close(ch)
|
||||
}
|
||||
j.subs = nil
|
||||
}
|
||||
|
||||
// subscribe returns a channel that receives all future lines.
|
||||
// Existing lines are returned first, then the channel streams new ones.
|
||||
func (j *jobState) subscribe() ([]string, <-chan string) {
|
||||
j.mu.Lock()
|
||||
defer j.mu.Unlock()
|
||||
existing := make([]string, len(j.lines))
|
||||
copy(existing, j.lines)
|
||||
if j.done {
|
||||
return existing, nil
|
||||
}
|
||||
ch := make(chan string, 256)
|
||||
j.subs = append(j.subs, ch)
|
||||
return existing, ch
|
||||
}
|
||||
|
||||
// jobManager manages async jobs identified by string IDs.
|
||||
type jobManager struct {
|
||||
mu sync.Mutex
|
||||
jobs map[string]*jobState
|
||||
}
|
||||
|
||||
var globalJobs = &jobManager{jobs: make(map[string]*jobState)}
|
||||
|
||||
func (m *jobManager) create(id string) *jobState {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
j := &jobState{}
|
||||
m.jobs[id] = j
|
||||
// Schedule cleanup after 30 minutes
|
||||
go func() {
|
||||
time.Sleep(30 * time.Minute)
|
||||
m.mu.Lock()
|
||||
delete(m.jobs, id)
|
||||
m.mu.Unlock()
|
||||
}()
|
||||
return j
|
||||
}
|
||||
|
||||
// isDone returns true if the job has finished (either successfully or with error).
|
||||
func (j *jobState) isDone() bool {
|
||||
j.mu.Lock()
|
||||
defer j.mu.Unlock()
|
||||
return j.done
|
||||
}
|
||||
|
||||
func (m *jobManager) get(id string) (*jobState, bool) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
j, ok := m.jobs[id]
|
||||
return j, ok
|
||||
}
|
||||
|
||||
func newTaskJobState(logPath string) *jobState {
|
||||
j := &jobState{logPath: logPath}
|
||||
if logPath == "" {
|
||||
return j
|
||||
}
|
||||
data, err := os.ReadFile(logPath)
|
||||
if err != nil || len(data) == 0 {
|
||||
return j
|
||||
}
|
||||
lines := strings.Split(strings.ReplaceAll(string(data), "\r\n", "\n"), "\n")
|
||||
if len(lines) > 0 && lines[len(lines)-1] == "" {
|
||||
lines = lines[:len(lines)-1]
|
||||
}
|
||||
j.lines = append(j.lines, lines...)
|
||||
return j
|
||||
}
|
||||
|
||||
func appendJobLog(path, line string) {
|
||||
if path == "" {
|
||||
return
|
||||
}
|
||||
f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0644)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
defer f.Close()
|
||||
_, _ = f.WriteString(line + "\n")
|
||||
}
|
||||
317
audit/internal/webui/metricsdb.go
Normal file
317
audit/internal/webui/metricsdb.go
Normal file
@@ -0,0 +1,317 @@
|
||||
package webui
|
||||
|
||||
import (
|
||||
"database/sql"
|
||||
"encoding/csv"
|
||||
"io"
|
||||
"strconv"
|
||||
"time"
|
||||
|
||||
"bee/audit/internal/platform"
|
||||
_ "modernc.org/sqlite"
|
||||
)
|
||||
|
||||
const metricsDBPath = "/appdata/bee/metrics.db"
|
||||
|
||||
// MetricsDB persists live metric samples to SQLite.
|
||||
type MetricsDB struct {
|
||||
db *sql.DB
|
||||
}
|
||||
|
||||
// openMetricsDB opens (or creates) the metrics database at the given path.
|
||||
func openMetricsDB(path string) (*MetricsDB, error) {
|
||||
db, err := sql.Open("sqlite", path+"?_journal=WAL&_busy_timeout=5000")
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
db.SetMaxOpenConns(1)
|
||||
if err := initMetricsSchema(db); err != nil {
|
||||
_ = db.Close()
|
||||
return nil, err
|
||||
}
|
||||
return &MetricsDB{db: db}, nil
|
||||
}
|
||||
|
||||
func initMetricsSchema(db *sql.DB) error {
|
||||
_, err := db.Exec(`
|
||||
CREATE TABLE IF NOT EXISTS sys_metrics (
|
||||
ts INTEGER NOT NULL,
|
||||
cpu_load_pct REAL,
|
||||
mem_load_pct REAL,
|
||||
power_w REAL,
|
||||
PRIMARY KEY (ts)
|
||||
);
|
||||
CREATE TABLE IF NOT EXISTS gpu_metrics (
|
||||
ts INTEGER NOT NULL,
|
||||
gpu_index INTEGER NOT NULL,
|
||||
temp_c REAL,
|
||||
usage_pct REAL,
|
||||
mem_usage_pct REAL,
|
||||
power_w REAL,
|
||||
PRIMARY KEY (ts, gpu_index)
|
||||
);
|
||||
CREATE TABLE IF NOT EXISTS fan_metrics (
|
||||
ts INTEGER NOT NULL,
|
||||
name TEXT NOT NULL,
|
||||
rpm REAL,
|
||||
PRIMARY KEY (ts, name)
|
||||
);
|
||||
CREATE TABLE IF NOT EXISTS temp_metrics (
|
||||
ts INTEGER NOT NULL,
|
||||
name TEXT NOT NULL,
|
||||
grp TEXT NOT NULL,
|
||||
celsius REAL,
|
||||
PRIMARY KEY (ts, name)
|
||||
);
|
||||
`)
|
||||
return err
|
||||
}
|
||||
|
||||
// Write inserts one sample into all relevant tables.
|
||||
func (m *MetricsDB) Write(s platform.LiveMetricSample) error {
|
||||
ts := s.Timestamp.Unix()
|
||||
tx, err := m.db.Begin()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer func() { _ = tx.Rollback() }()
|
||||
|
||||
_, err = tx.Exec(
|
||||
`INSERT OR REPLACE INTO sys_metrics(ts,cpu_load_pct,mem_load_pct,power_w) VALUES(?,?,?,?)`,
|
||||
ts, s.CPULoadPct, s.MemLoadPct, s.PowerW,
|
||||
)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
for _, g := range s.GPUs {
|
||||
_, err = tx.Exec(
|
||||
`INSERT OR REPLACE INTO gpu_metrics(ts,gpu_index,temp_c,usage_pct,mem_usage_pct,power_w) VALUES(?,?,?,?,?,?)`,
|
||||
ts, g.GPUIndex, g.TempC, g.UsagePct, g.MemUsagePct, g.PowerW,
|
||||
)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
for _, f := range s.Fans {
|
||||
_, err = tx.Exec(
|
||||
`INSERT OR REPLACE INTO fan_metrics(ts,name,rpm) VALUES(?,?,?)`,
|
||||
ts, f.Name, f.RPM,
|
||||
)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
for _, t := range s.Temps {
|
||||
_, err = tx.Exec(
|
||||
`INSERT OR REPLACE INTO temp_metrics(ts,name,grp,celsius) VALUES(?,?,?,?)`,
|
||||
ts, t.Name, t.Group, t.Celsius,
|
||||
)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return tx.Commit()
|
||||
}
|
||||
|
||||
// LoadRecent returns up to n samples in chronological order (oldest first).
|
||||
func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
|
||||
return m.loadSamples(`SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM sys_metrics ORDER BY ts DESC LIMIT ?`, n)
|
||||
}
|
||||
|
||||
// LoadAll returns all persisted samples in chronological order (oldest first).
|
||||
func (m *MetricsDB) LoadAll() ([]platform.LiveMetricSample, error) {
|
||||
return m.loadSamples(`SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM sys_metrics ORDER BY ts`, nil)
|
||||
}
|
||||
|
||||
// loadSamples reconstructs LiveMetricSample rows from the normalized tables.
|
||||
func (m *MetricsDB) loadSamples(query string, args ...any) ([]platform.LiveMetricSample, error) {
|
||||
rows, err := m.db.Query(query, args...)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
type sysRow struct {
|
||||
ts int64
|
||||
cpu, mem, pwr float64
|
||||
}
|
||||
var sysRows []sysRow
|
||||
for rows.Next() {
|
||||
var r sysRow
|
||||
if err := rows.Scan(&r.ts, &r.cpu, &r.mem, &r.pwr); err != nil {
|
||||
continue
|
||||
}
|
||||
sysRows = append(sysRows, r)
|
||||
}
|
||||
if len(sysRows) == 0 {
|
||||
return nil, nil
|
||||
}
|
||||
// Reverse to chronological order
|
||||
for i, j := 0, len(sysRows)-1; i < j; i, j = i+1, j-1 {
|
||||
sysRows[i], sysRows[j] = sysRows[j], sysRows[i]
|
||||
}
|
||||
|
||||
// Collect min/max ts for range query
|
||||
minTS := sysRows[0].ts
|
||||
maxTS := sysRows[len(sysRows)-1].ts
|
||||
|
||||
// Load GPU rows in range
|
||||
type gpuKey struct{ ts int64; idx int }
|
||||
gpuData := map[gpuKey]platform.GPUMetricRow{}
|
||||
gRows, err := m.db.Query(
|
||||
`SELECT ts,gpu_index,temp_c,usage_pct,mem_usage_pct,power_w FROM gpu_metrics WHERE ts>=? AND ts<=? ORDER BY ts,gpu_index`,
|
||||
minTS, maxTS,
|
||||
)
|
||||
if err == nil {
|
||||
defer gRows.Close()
|
||||
for gRows.Next() {
|
||||
var ts int64
|
||||
var g platform.GPUMetricRow
|
||||
if err := gRows.Scan(&ts, &g.GPUIndex, &g.TempC, &g.UsagePct, &g.MemUsagePct, &g.PowerW); err == nil {
|
||||
gpuData[gpuKey{ts, g.GPUIndex}] = g
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Load fan rows in range
|
||||
type fanKey struct{ ts int64; name string }
|
||||
fanData := map[fanKey]float64{}
|
||||
fRows, err := m.db.Query(
|
||||
`SELECT ts,name,rpm FROM fan_metrics WHERE ts>=? AND ts<=?`, minTS, maxTS,
|
||||
)
|
||||
if err == nil {
|
||||
defer fRows.Close()
|
||||
for fRows.Next() {
|
||||
var ts int64
|
||||
var name string
|
||||
var rpm float64
|
||||
if err := fRows.Scan(&ts, &name, &rpm); err == nil {
|
||||
fanData[fanKey{ts, name}] = rpm
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Load temp rows in range
|
||||
type tempKey struct{ ts int64; name string }
|
||||
tempData := map[tempKey]platform.TempReading{}
|
||||
tRows, err := m.db.Query(
|
||||
`SELECT ts,name,grp,celsius FROM temp_metrics WHERE ts>=? AND ts<=?`, minTS, maxTS,
|
||||
)
|
||||
if err == nil {
|
||||
defer tRows.Close()
|
||||
for tRows.Next() {
|
||||
var ts int64
|
||||
var t platform.TempReading
|
||||
if err := tRows.Scan(&ts, &t.Name, &t.Group, &t.Celsius); err == nil {
|
||||
tempData[tempKey{ts, t.Name}] = t
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Collect unique GPU indices and fan names from loaded data (preserve order)
|
||||
seenGPU := map[int]bool{}
|
||||
var gpuIndices []int
|
||||
for k := range gpuData {
|
||||
if !seenGPU[k.idx] {
|
||||
seenGPU[k.idx] = true
|
||||
gpuIndices = append(gpuIndices, k.idx)
|
||||
}
|
||||
}
|
||||
seenFan := map[string]bool{}
|
||||
var fanNames []string
|
||||
for k := range fanData {
|
||||
if !seenFan[k.name] {
|
||||
seenFan[k.name] = true
|
||||
fanNames = append(fanNames, k.name)
|
||||
}
|
||||
}
|
||||
seenTemp := map[string]bool{}
|
||||
var tempNames []string
|
||||
for k := range tempData {
|
||||
if !seenTemp[k.name] {
|
||||
seenTemp[k.name] = true
|
||||
tempNames = append(tempNames, k.name)
|
||||
}
|
||||
}
|
||||
|
||||
samples := make([]platform.LiveMetricSample, len(sysRows))
|
||||
for i, r := range sysRows {
|
||||
s := platform.LiveMetricSample{
|
||||
Timestamp: time.Unix(r.ts, 0).UTC(),
|
||||
CPULoadPct: r.cpu,
|
||||
MemLoadPct: r.mem,
|
||||
PowerW: r.pwr,
|
||||
}
|
||||
for _, idx := range gpuIndices {
|
||||
if g, ok := gpuData[gpuKey{r.ts, idx}]; ok {
|
||||
s.GPUs = append(s.GPUs, g)
|
||||
}
|
||||
}
|
||||
for _, name := range fanNames {
|
||||
if rpm, ok := fanData[fanKey{r.ts, name}]; ok {
|
||||
s.Fans = append(s.Fans, platform.FanReading{Name: name, RPM: rpm})
|
||||
}
|
||||
}
|
||||
for _, name := range tempNames {
|
||||
if t, ok := tempData[tempKey{r.ts, name}]; ok {
|
||||
s.Temps = append(s.Temps, t)
|
||||
}
|
||||
}
|
||||
samples[i] = s
|
||||
}
|
||||
return samples, nil
|
||||
}
|
||||
|
||||
// ExportCSV writes all sys+gpu data as CSV to w.
|
||||
func (m *MetricsDB) ExportCSV(w io.Writer) error {
|
||||
rows, err := m.db.Query(`
|
||||
SELECT s.ts, s.cpu_load_pct, s.mem_load_pct, s.power_w,
|
||||
g.gpu_index, g.temp_c, g.usage_pct, g.mem_usage_pct, g.power_w
|
||||
FROM sys_metrics s
|
||||
LEFT JOIN gpu_metrics g ON g.ts = s.ts
|
||||
ORDER BY s.ts, g.gpu_index
|
||||
`)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
cw := csv.NewWriter(w)
|
||||
_ = cw.Write([]string{"ts", "cpu_load_pct", "mem_load_pct", "sys_power_w", "gpu_index", "gpu_temp_c", "gpu_usage_pct", "gpu_mem_pct", "gpu_power_w"})
|
||||
for rows.Next() {
|
||||
var ts int64
|
||||
var cpu, mem, pwr float64
|
||||
var gpuIdx sql.NullInt64
|
||||
var gpuTemp, gpuUse, gpuMem, gpuPow sql.NullFloat64
|
||||
if err := rows.Scan(&ts, &cpu, &mem, &pwr, &gpuIdx, &gpuTemp, &gpuUse, &gpuMem, &gpuPow); err != nil {
|
||||
continue
|
||||
}
|
||||
row := []string{
|
||||
strconv.FormatInt(ts, 10),
|
||||
strconv.FormatFloat(cpu, 'f', 2, 64),
|
||||
strconv.FormatFloat(mem, 'f', 2, 64),
|
||||
strconv.FormatFloat(pwr, 'f', 1, 64),
|
||||
}
|
||||
if gpuIdx.Valid {
|
||||
row = append(row,
|
||||
strconv.FormatInt(gpuIdx.Int64, 10),
|
||||
strconv.FormatFloat(gpuTemp.Float64, 'f', 1, 64),
|
||||
strconv.FormatFloat(gpuUse.Float64, 'f', 1, 64),
|
||||
strconv.FormatFloat(gpuMem.Float64, 'f', 1, 64),
|
||||
strconv.FormatFloat(gpuPow.Float64, 'f', 1, 64),
|
||||
)
|
||||
} else {
|
||||
row = append(row, "", "", "", "", "")
|
||||
}
|
||||
_ = cw.Write(row)
|
||||
}
|
||||
cw.Flush()
|
||||
return cw.Error()
|
||||
}
|
||||
|
||||
// Close closes the database.
|
||||
func (m *MetricsDB) Close() { _ = m.db.Close() }
|
||||
|
||||
func nullFloat(v float64) sql.NullFloat64 {
|
||||
return sql.NullFloat64{Float64: v, Valid: true}
|
||||
}
|
||||
1257
audit/internal/webui/pages.go
Normal file
1257
audit/internal/webui/pages.go
Normal file
File diff suppressed because it is too large
Load Diff
1270
audit/internal/webui/server.go
Normal file
1270
audit/internal/webui/server.go
Normal file
File diff suppressed because it is too large
Load Diff
270
audit/internal/webui/server_test.go
Normal file
270
audit/internal/webui/server_test.go
Normal file
@@ -0,0 +1,270 @@
|
||||
package webui
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"bee/audit/internal/platform"
|
||||
)
|
||||
|
||||
func TestChartLegendNumber(t *testing.T) {
|
||||
tests := []struct {
|
||||
in float64
|
||||
want string
|
||||
}{
|
||||
{in: 0.4, want: "0"},
|
||||
{in: 61.5, want: "62"},
|
||||
{in: 999.4, want: "999"},
|
||||
{in: 1200, want: "1,2k"},
|
||||
{in: 1250, want: "1,25k"},
|
||||
{in: 1310, want: "1,31k"},
|
||||
{in: 1500, want: "1,5k"},
|
||||
{in: 2600, want: "2,6k"},
|
||||
{in: 10200, want: "10k"},
|
||||
}
|
||||
for _, tc := range tests {
|
||||
if got := chartLegendNumber(tc.in); got != tc.want {
|
||||
t.Fatalf("chartLegendNumber(%v)=%q want %q", tc.in, got, tc.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestChartDataFromSamplesUsesFullHistory(t *testing.T) {
|
||||
samples := []platform.LiveMetricSample{
|
||||
{
|
||||
Timestamp: time.Now().Add(-3 * time.Minute),
|
||||
CPULoadPct: 10,
|
||||
MemLoadPct: 20,
|
||||
PowerW: 300,
|
||||
GPUs: []platform.GPUMetricRow{
|
||||
{GPUIndex: 0, UsagePct: 90, MemUsagePct: 5, PowerW: 120, TempC: 50},
|
||||
},
|
||||
},
|
||||
{
|
||||
Timestamp: time.Now().Add(-2 * time.Minute),
|
||||
CPULoadPct: 30,
|
||||
MemLoadPct: 40,
|
||||
PowerW: 320,
|
||||
GPUs: []platform.GPUMetricRow{
|
||||
{GPUIndex: 0, UsagePct: 95, MemUsagePct: 7, PowerW: 125, TempC: 51},
|
||||
},
|
||||
},
|
||||
{
|
||||
Timestamp: time.Now().Add(-1 * time.Minute),
|
||||
CPULoadPct: 50,
|
||||
MemLoadPct: 60,
|
||||
PowerW: 340,
|
||||
GPUs: []platform.GPUMetricRow{
|
||||
{GPUIndex: 0, UsagePct: 97, MemUsagePct: 9, PowerW: 130, TempC: 52},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
datasets, names, labels, title, _, _, ok := chartDataFromSamples("gpu-all-power", samples)
|
||||
if !ok {
|
||||
t.Fatal("chartDataFromSamples returned ok=false")
|
||||
}
|
||||
if title != "GPU Power" {
|
||||
t.Fatalf("title=%q", title)
|
||||
}
|
||||
if len(names) != 1 || names[0] != "GPU 0" {
|
||||
t.Fatalf("names=%v", names)
|
||||
}
|
||||
if len(labels) != len(samples) {
|
||||
t.Fatalf("labels len=%d want %d", len(labels), len(samples))
|
||||
}
|
||||
if len(datasets) != 1 || len(datasets[0]) != len(samples) {
|
||||
t.Fatalf("datasets shape=%v", datasets)
|
||||
}
|
||||
if got := datasets[0][0]; got != 120 {
|
||||
t.Fatalf("datasets[0][0]=%v want 120", got)
|
||||
}
|
||||
if got := datasets[0][2]; got != 130 {
|
||||
t.Fatalf("datasets[0][2]=%v want 130", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRootRendersDashboard(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "audit.json")
|
||||
exportDir := filepath.Join(dir, "export")
|
||||
if err := os.MkdirAll(exportDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{
|
||||
Title: "Bee Hardware Audit",
|
||||
AuditPath: path,
|
||||
ExportDir: exportDir,
|
||||
})
|
||||
|
||||
first := httptest.NewRecorder()
|
||||
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/", nil))
|
||||
if first.Code != http.StatusOK {
|
||||
t.Fatalf("first status=%d", first.Code)
|
||||
}
|
||||
// Dashboard should contain the audit nav link and hardware summary
|
||||
if !strings.Contains(first.Body.String(), `href="/audit"`) {
|
||||
t.Fatalf("first body missing audit nav link: %s", first.Body.String())
|
||||
}
|
||||
if !strings.Contains(first.Body.String(), `/viewer`) {
|
||||
t.Fatalf("first body missing viewer link: %s", first.Body.String())
|
||||
}
|
||||
if got := first.Header().Get("Cache-Control"); got != "no-store" {
|
||||
t.Fatalf("first cache-control=%q", got)
|
||||
}
|
||||
|
||||
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
second := httptest.NewRecorder()
|
||||
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/", nil))
|
||||
if second.Code != http.StatusOK {
|
||||
t.Fatalf("second status=%d", second.Code)
|
||||
}
|
||||
if !strings.Contains(second.Body.String(), `Hardware Summary`) {
|
||||
t.Fatalf("second body missing hardware summary: %s", second.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestAuditPageRendersViewerFrameAndActions(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "audit.json")
|
||||
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z"}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{AuditPath: path})
|
||||
rec := httptest.NewRecorder()
|
||||
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit", nil))
|
||||
if rec.Code != http.StatusOK {
|
||||
t.Fatalf("status=%d", rec.Code)
|
||||
}
|
||||
body := rec.Body.String()
|
||||
if !strings.Contains(body, `iframe class="viewer-frame" src="/viewer"`) {
|
||||
t.Fatalf("audit page missing viewer frame: %s", body)
|
||||
}
|
||||
if !strings.Contains(body, `openAuditModal()`) {
|
||||
t.Fatalf("audit page missing action modal trigger: %s", body)
|
||||
}
|
||||
}
|
||||
|
||||
func TestViewerRendersLatestSnapshot(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "audit.json")
|
||||
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{AuditPath: path})
|
||||
first := httptest.NewRecorder()
|
||||
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/viewer", nil))
|
||||
if first.Code != http.StatusOK {
|
||||
t.Fatalf("first status=%d", first.Code)
|
||||
}
|
||||
if !strings.Contains(first.Body.String(), "SERIAL-OLD") {
|
||||
t.Fatalf("viewer body missing old serial: %s", first.Body.String())
|
||||
}
|
||||
|
||||
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
second := httptest.NewRecorder()
|
||||
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/viewer", nil))
|
||||
if second.Code != http.StatusOK {
|
||||
t.Fatalf("second status=%d", second.Code)
|
||||
}
|
||||
if !strings.Contains(second.Body.String(), "SERIAL-NEW") {
|
||||
t.Fatalf("viewer body missing new serial: %s", second.Body.String())
|
||||
}
|
||||
if strings.Contains(second.Body.String(), "SERIAL-OLD") {
|
||||
t.Fatalf("viewer body still contains old serial: %s", second.Body.String())
|
||||
}
|
||||
}
|
||||
|
||||
func TestAuditJSONServesLatestSnapshot(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "audit.json")
|
||||
body := `{"hardware":{"board":{"serial_number":"SERIAL-API"}}}`
|
||||
if err := os.WriteFile(path, []byte(body), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{AuditPath: path})
|
||||
rec := httptest.NewRecorder()
|
||||
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
|
||||
if rec.Code != http.StatusOK {
|
||||
t.Fatalf("status=%d", rec.Code)
|
||||
}
|
||||
if !strings.Contains(rec.Body.String(), "SERIAL-API") {
|
||||
t.Fatalf("body missing expected serial: %s", rec.Body.String())
|
||||
}
|
||||
if got := rec.Header().Get("Content-Type"); !strings.Contains(got, "application/json") {
|
||||
t.Fatalf("content-type=%q", got)
|
||||
}
|
||||
}
|
||||
|
||||
func TestMissingAuditJSONReturnsNotFound(t *testing.T) {
|
||||
handler := NewHandler(HandlerOptions{AuditPath: "/missing/audit.json"})
|
||||
rec := httptest.NewRecorder()
|
||||
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
|
||||
if rec.Code != http.StatusNotFound {
|
||||
t.Fatalf("status=%d want %d", rec.Code, http.StatusNotFound)
|
||||
}
|
||||
}
|
||||
|
||||
func TestSupportBundleEndpointReturnsArchive(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
exportDir := filepath.Join(dir, "export")
|
||||
if err := os.MkdirAll(exportDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.log"), []byte("audit log"), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
|
||||
rec := httptest.NewRecorder()
|
||||
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/export/support.tar.gz", nil))
|
||||
if rec.Code != http.StatusOK {
|
||||
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
|
||||
}
|
||||
if got := rec.Header().Get("Content-Disposition"); !strings.Contains(got, "attachment;") {
|
||||
t.Fatalf("content-disposition=%q", got)
|
||||
}
|
||||
if rec.Body.Len() == 0 {
|
||||
t.Fatal("empty archive body")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRuntimeHealthEndpointReturnsJSON(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
exportDir := filepath.Join(dir, "export")
|
||||
if err := os.MkdirAll(exportDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
body := `{"status":"PARTIAL","checked_at":"2026-03-16T10:00:00Z"}`
|
||||
if err := os.WriteFile(filepath.Join(exportDir, "runtime-health.json"), []byte(body), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
|
||||
rec := httptest.NewRecorder()
|
||||
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/runtime-health.json", nil))
|
||||
if rec.Code != http.StatusOK {
|
||||
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
|
||||
}
|
||||
if strings.TrimSpace(rec.Body.String()) != body {
|
||||
t.Fatalf("body=%q want %q", strings.TrimSpace(rec.Body.String()), body)
|
||||
}
|
||||
}
|
||||
648
audit/internal/webui/tasks.go
Normal file
648
audit/internal/webui/tasks.go
Normal file
@@ -0,0 +1,648 @@
|
||||
package webui
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"net/http"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"sort"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"bee/audit/internal/app"
|
||||
)
|
||||
|
||||
// Task statuses.
|
||||
const (
|
||||
TaskPending = "pending"
|
||||
TaskRunning = "running"
|
||||
TaskDone = "done"
|
||||
TaskFailed = "failed"
|
||||
TaskCancelled = "cancelled"
|
||||
)
|
||||
|
||||
// taskNames maps target → human-readable name.
|
||||
var taskNames = map[string]string{
|
||||
"nvidia": "NVIDIA SAT",
|
||||
"memory": "Memory SAT",
|
||||
"storage": "Storage SAT",
|
||||
"cpu": "CPU SAT",
|
||||
"amd": "AMD GPU SAT",
|
||||
"amd-mem": "AMD GPU MEM Integrity",
|
||||
"amd-bandwidth": "AMD GPU MEM Bandwidth",
|
||||
"amd-stress": "AMD GPU Burn-in",
|
||||
"memory-stress": "Memory Burn-in",
|
||||
"sat-stress": "SAT Stress (stressapptest)",
|
||||
"audit": "Audit",
|
||||
"install": "Install to Disk",
|
||||
"install-to-ram": "Install to RAM",
|
||||
}
|
||||
|
||||
// Task represents one unit of work in the queue.
|
||||
type Task struct {
|
||||
ID string `json:"id"`
|
||||
Name string `json:"name"`
|
||||
Target string `json:"target"`
|
||||
Priority int `json:"priority"`
|
||||
Status string `json:"status"`
|
||||
CreatedAt time.Time `json:"created_at"`
|
||||
StartedAt *time.Time `json:"started_at,omitempty"`
|
||||
DoneAt *time.Time `json:"done_at,omitempty"`
|
||||
ErrMsg string `json:"error,omitempty"`
|
||||
LogPath string `json:"log_path,omitempty"`
|
||||
|
||||
// runtime fields (not serialised)
|
||||
job *jobState
|
||||
params taskParams
|
||||
}
|
||||
|
||||
// taskParams holds optional parameters parsed from the run request.
|
||||
type taskParams struct {
|
||||
Duration int `json:"duration,omitempty"`
|
||||
DiagLevel int `json:"diag_level,omitempty"`
|
||||
GPUIndices []int `json:"gpu_indices,omitempty"`
|
||||
BurnProfile string `json:"burn_profile,omitempty"`
|
||||
DisplayName string `json:"display_name,omitempty"`
|
||||
Device string `json:"device,omitempty"` // for install
|
||||
}
|
||||
|
||||
type persistedTask struct {
|
||||
ID string `json:"id"`
|
||||
Name string `json:"name"`
|
||||
Target string `json:"target"`
|
||||
Priority int `json:"priority"`
|
||||
Status string `json:"status"`
|
||||
CreatedAt time.Time `json:"created_at"`
|
||||
StartedAt *time.Time `json:"started_at,omitempty"`
|
||||
DoneAt *time.Time `json:"done_at,omitempty"`
|
||||
ErrMsg string `json:"error,omitempty"`
|
||||
LogPath string `json:"log_path,omitempty"`
|
||||
Params taskParams `json:"params,omitempty"`
|
||||
}
|
||||
|
||||
type burnPreset struct {
|
||||
NvidiaDiag int
|
||||
DurationSec int
|
||||
}
|
||||
|
||||
func resolveBurnPreset(profile string) burnPreset {
|
||||
switch profile {
|
||||
case "overnight":
|
||||
return burnPreset{NvidiaDiag: 4, DurationSec: 8 * 60 * 60}
|
||||
case "acceptance":
|
||||
return burnPreset{NvidiaDiag: 3, DurationSec: 60 * 60}
|
||||
default:
|
||||
return burnPreset{NvidiaDiag: 1, DurationSec: 5 * 60}
|
||||
}
|
||||
}
|
||||
|
||||
// taskQueue manages a priority-ordered list of tasks and runs them one at a time.
|
||||
type taskQueue struct {
|
||||
mu sync.Mutex
|
||||
tasks []*Task
|
||||
trigger chan struct{}
|
||||
opts *HandlerOptions // set by startWorker
|
||||
statePath string
|
||||
logsDir string
|
||||
started bool
|
||||
}
|
||||
|
||||
var globalQueue = &taskQueue{trigger: make(chan struct{}, 1)}
|
||||
|
||||
const maxTaskHistory = 50
|
||||
|
||||
var (
|
||||
runMemoryAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
return a.RunMemoryAcceptancePackCtx(ctx, baseDir, logFunc)
|
||||
}
|
||||
runStorageAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
return a.RunStorageAcceptancePackCtx(ctx, baseDir, logFunc)
|
||||
}
|
||||
runCPUAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
|
||||
return a.RunCPUAcceptancePackCtx(ctx, baseDir, durationSec, logFunc)
|
||||
}
|
||||
runAMDAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
return a.RunAMDAcceptancePackCtx(ctx, baseDir, logFunc)
|
||||
}
|
||||
runAMDMemIntegrityPackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
return a.RunAMDMemIntegrityPackCtx(ctx, baseDir, logFunc)
|
||||
}
|
||||
runAMDMemBandwidthPackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
|
||||
return a.RunAMDMemBandwidthPackCtx(ctx, baseDir, logFunc)
|
||||
}
|
||||
runAMDStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
|
||||
return a.RunAMDStressPackCtx(ctx, baseDir, durationSec, logFunc)
|
||||
}
|
||||
runMemoryStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
|
||||
return a.RunMemoryStressPackCtx(ctx, baseDir, durationSec, logFunc)
|
||||
}
|
||||
runSATStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
|
||||
return a.RunSATStressPackCtx(ctx, baseDir, durationSec, logFunc)
|
||||
}
|
||||
)
|
||||
|
||||
// enqueue adds a task to the queue and notifies the worker.
|
||||
func (q *taskQueue) enqueue(t *Task) {
|
||||
q.mu.Lock()
|
||||
q.assignTaskLogPathLocked(t)
|
||||
q.tasks = append(q.tasks, t)
|
||||
q.prune()
|
||||
q.persistLocked()
|
||||
q.mu.Unlock()
|
||||
select {
|
||||
case q.trigger <- struct{}{}:
|
||||
default:
|
||||
}
|
||||
}
|
||||
|
||||
// prune removes oldest completed tasks beyond maxTaskHistory.
|
||||
func (q *taskQueue) prune() {
|
||||
var done []*Task
|
||||
var active []*Task
|
||||
for _, t := range q.tasks {
|
||||
switch t.Status {
|
||||
case TaskDone, TaskFailed, TaskCancelled:
|
||||
done = append(done, t)
|
||||
default:
|
||||
active = append(active, t)
|
||||
}
|
||||
}
|
||||
if len(done) > maxTaskHistory {
|
||||
done = done[len(done)-maxTaskHistory:]
|
||||
}
|
||||
q.tasks = append(active, done...)
|
||||
}
|
||||
|
||||
// nextPending returns the highest-priority pending task (nil if none).
|
||||
func (q *taskQueue) nextPending() *Task {
|
||||
var best *Task
|
||||
for _, t := range q.tasks {
|
||||
if t.Status != TaskPending {
|
||||
continue
|
||||
}
|
||||
if best == nil || t.Priority > best.Priority ||
|
||||
(t.Priority == best.Priority && t.CreatedAt.Before(best.CreatedAt)) {
|
||||
best = t
|
||||
}
|
||||
}
|
||||
return best
|
||||
}
|
||||
|
||||
// findByID looks up a task by ID.
|
||||
func (q *taskQueue) findByID(id string) (*Task, bool) {
|
||||
q.mu.Lock()
|
||||
defer q.mu.Unlock()
|
||||
for _, t := range q.tasks {
|
||||
if t.ID == id {
|
||||
return t, true
|
||||
}
|
||||
}
|
||||
return nil, false
|
||||
}
|
||||
|
||||
// findJob returns the jobState for a task ID (for SSE streaming compatibility).
|
||||
func (q *taskQueue) findJob(id string) (*jobState, bool) {
|
||||
t, ok := q.findByID(id)
|
||||
if !ok || t.job == nil {
|
||||
return nil, false
|
||||
}
|
||||
return t.job, true
|
||||
}
|
||||
|
||||
func (q *taskQueue) hasActiveTarget(target string) bool {
|
||||
q.mu.Lock()
|
||||
defer q.mu.Unlock()
|
||||
for _, t := range q.tasks {
|
||||
if t.Target != target {
|
||||
continue
|
||||
}
|
||||
if t.Status == TaskPending || t.Status == TaskRunning {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// snapshot returns a copy of all tasks sorted for display (running first, then pending by priority, then done by doneAt desc).
|
||||
func (q *taskQueue) snapshot() []Task {
|
||||
q.mu.Lock()
|
||||
defer q.mu.Unlock()
|
||||
out := make([]Task, len(q.tasks))
|
||||
for i, t := range q.tasks {
|
||||
out[i] = *t
|
||||
}
|
||||
sort.SliceStable(out, func(i, j int) bool {
|
||||
si := statusOrder(out[i].Status)
|
||||
sj := statusOrder(out[j].Status)
|
||||
if si != sj {
|
||||
return si < sj
|
||||
}
|
||||
if out[i].Priority != out[j].Priority {
|
||||
return out[i].Priority > out[j].Priority
|
||||
}
|
||||
return out[i].CreatedAt.Before(out[j].CreatedAt)
|
||||
})
|
||||
return out
|
||||
}
|
||||
|
||||
func statusOrder(s string) int {
|
||||
switch s {
|
||||
case TaskRunning:
|
||||
return 0
|
||||
case TaskPending:
|
||||
return 1
|
||||
default:
|
||||
return 2
|
||||
}
|
||||
}
|
||||
|
||||
// startWorker launches the queue runner goroutine.
|
||||
func (q *taskQueue) startWorker(opts *HandlerOptions) {
|
||||
q.mu.Lock()
|
||||
q.opts = opts
|
||||
q.statePath = filepath.Join(opts.ExportDir, "tasks-state.json")
|
||||
q.logsDir = filepath.Join(opts.ExportDir, "tasks")
|
||||
_ = os.MkdirAll(q.logsDir, 0755)
|
||||
if !q.started {
|
||||
q.loadLocked()
|
||||
q.started = true
|
||||
go q.worker()
|
||||
}
|
||||
hasPending := q.nextPending() != nil
|
||||
q.mu.Unlock()
|
||||
if hasPending {
|
||||
select {
|
||||
case q.trigger <- struct{}{}:
|
||||
default:
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (q *taskQueue) worker() {
|
||||
for {
|
||||
<-q.trigger
|
||||
setCPUGovernor("performance")
|
||||
for {
|
||||
q.mu.Lock()
|
||||
t := q.nextPending()
|
||||
if t == nil {
|
||||
q.mu.Unlock()
|
||||
break
|
||||
}
|
||||
now := time.Now()
|
||||
t.Status = TaskRunning
|
||||
t.StartedAt = &now
|
||||
t.DoneAt = nil
|
||||
t.ErrMsg = ""
|
||||
j := newTaskJobState(t.LogPath)
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
j.cancel = cancel
|
||||
t.job = j
|
||||
q.persistLocked()
|
||||
q.mu.Unlock()
|
||||
|
||||
q.runTask(t, j, ctx)
|
||||
|
||||
q.mu.Lock()
|
||||
now2 := time.Now()
|
||||
t.DoneAt = &now2
|
||||
if t.Status == TaskRunning { // not cancelled externally
|
||||
if j.err != "" {
|
||||
t.Status = TaskFailed
|
||||
t.ErrMsg = j.err
|
||||
} else {
|
||||
t.Status = TaskDone
|
||||
}
|
||||
}
|
||||
q.prune()
|
||||
q.persistLocked()
|
||||
q.mu.Unlock()
|
||||
}
|
||||
setCPUGovernor("powersave")
|
||||
}
|
||||
}
|
||||
|
||||
// setCPUGovernor writes the given governor to all CPU scaling_governor sysfs files.
|
||||
// Silently ignores errors (e.g. when cpufreq is not available).
|
||||
func setCPUGovernor(governor string) {
|
||||
matches, err := filepath.Glob("/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor")
|
||||
if err != nil || len(matches) == 0 {
|
||||
return
|
||||
}
|
||||
for _, path := range matches {
|
||||
_ = os.WriteFile(path, []byte(governor), 0644)
|
||||
}
|
||||
}
|
||||
|
||||
// runTask executes the work for a task, writing output to j.
|
||||
func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
|
||||
if q.opts == nil || q.opts.App == nil {
|
||||
j.append("ERROR: app not configured")
|
||||
j.finish("app not configured")
|
||||
return
|
||||
}
|
||||
a := q.opts.App
|
||||
|
||||
j.append(fmt.Sprintf("Starting %s...", t.Name))
|
||||
if len(j.lines) > 0 {
|
||||
j.append(fmt.Sprintf("Recovered after bee-web restart at %s", time.Now().UTC().Format(time.RFC3339)))
|
||||
}
|
||||
|
||||
var (
|
||||
archive string
|
||||
err error
|
||||
)
|
||||
|
||||
switch t.Target {
|
||||
case "nvidia":
|
||||
diagLevel := t.params.DiagLevel
|
||||
if t.params.BurnProfile != "" && diagLevel <= 0 {
|
||||
diagLevel = resolveBurnPreset(t.params.BurnProfile).NvidiaDiag
|
||||
}
|
||||
if len(t.params.GPUIndices) > 0 || diagLevel > 0 {
|
||||
result, e := a.RunNvidiaAcceptancePackWithOptions(
|
||||
ctx, "", diagLevel, t.params.GPUIndices, j.append,
|
||||
)
|
||||
if e != nil {
|
||||
err = e
|
||||
} else {
|
||||
archive = result.Body
|
||||
}
|
||||
} else {
|
||||
archive, err = a.RunNvidiaAcceptancePack("", j.append)
|
||||
}
|
||||
case "memory":
|
||||
archive, err = runMemoryAcceptancePackCtx(a, ctx, "", j.append)
|
||||
case "storage":
|
||||
archive, err = runStorageAcceptancePackCtx(a, ctx, "", j.append)
|
||||
case "cpu":
|
||||
dur := t.params.Duration
|
||||
if t.params.BurnProfile != "" && dur <= 0 {
|
||||
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
|
||||
}
|
||||
if dur <= 0 {
|
||||
dur = 60
|
||||
}
|
||||
archive, err = runCPUAcceptancePackCtx(a, ctx, "", dur, j.append)
|
||||
case "amd":
|
||||
archive, err = runAMDAcceptancePackCtx(a, ctx, "", j.append)
|
||||
case "amd-mem":
|
||||
archive, err = runAMDMemIntegrityPackCtx(a, ctx, "", j.append)
|
||||
case "amd-bandwidth":
|
||||
archive, err = runAMDMemBandwidthPackCtx(a, ctx, "", j.append)
|
||||
case "amd-stress":
|
||||
dur := t.params.Duration
|
||||
if t.params.BurnProfile != "" && dur <= 0 {
|
||||
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
|
||||
}
|
||||
archive, err = runAMDStressPackCtx(a, ctx, "", dur, j.append)
|
||||
case "memory-stress":
|
||||
dur := t.params.Duration
|
||||
if t.params.BurnProfile != "" && dur <= 0 {
|
||||
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
|
||||
}
|
||||
archive, err = runMemoryStressPackCtx(a, ctx, "", dur, j.append)
|
||||
case "sat-stress":
|
||||
dur := t.params.Duration
|
||||
if t.params.BurnProfile != "" && dur <= 0 {
|
||||
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
|
||||
}
|
||||
archive, err = runSATStressPackCtx(a, ctx, "", dur, j.append)
|
||||
case "audit":
|
||||
result, e := a.RunAuditNow(q.opts.RuntimeMode)
|
||||
if e != nil {
|
||||
err = e
|
||||
} else {
|
||||
for _, line := range splitLines(result.Body) {
|
||||
j.append(line)
|
||||
}
|
||||
}
|
||||
case "install-to-ram":
|
||||
err = a.RunInstallToRAM(ctx, j.append)
|
||||
default:
|
||||
j.append("ERROR: unknown target: " + t.Target)
|
||||
j.finish("unknown target")
|
||||
return
|
||||
}
|
||||
|
||||
if err != nil {
|
||||
if ctx.Err() != nil {
|
||||
j.append("Aborted.")
|
||||
j.finish("aborted")
|
||||
} else {
|
||||
j.append("ERROR: " + err.Error())
|
||||
j.finish(err.Error())
|
||||
}
|
||||
return
|
||||
}
|
||||
if archive != "" {
|
||||
j.append("Archive: " + archive)
|
||||
}
|
||||
j.finish("")
|
||||
}
|
||||
|
||||
func splitLines(s string) []string {
|
||||
var out []string
|
||||
for _, l := range splitNL(s) {
|
||||
if l != "" {
|
||||
out = append(out, l)
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func splitNL(s string) []string {
|
||||
var out []string
|
||||
start := 0
|
||||
for i, c := range s {
|
||||
if c == '\n' {
|
||||
out = append(out, s[start:i])
|
||||
start = i + 1
|
||||
}
|
||||
}
|
||||
out = append(out, s[start:])
|
||||
return out
|
||||
}
|
||||
|
||||
// ── HTTP handlers ─────────────────────────────────────────────────────────────
|
||||
|
||||
func (h *handler) handleAPITasksList(w http.ResponseWriter, _ *http.Request) {
|
||||
tasks := globalQueue.snapshot()
|
||||
writeJSON(w, tasks)
|
||||
}
|
||||
|
||||
func (h *handler) handleAPITasksCancel(w http.ResponseWriter, r *http.Request) {
|
||||
id := r.PathValue("id")
|
||||
t, ok := globalQueue.findByID(id)
|
||||
if !ok {
|
||||
writeError(w, http.StatusNotFound, "task not found")
|
||||
return
|
||||
}
|
||||
globalQueue.mu.Lock()
|
||||
defer globalQueue.mu.Unlock()
|
||||
switch t.Status {
|
||||
case TaskPending:
|
||||
t.Status = TaskCancelled
|
||||
now := time.Now()
|
||||
t.DoneAt = &now
|
||||
globalQueue.persistLocked()
|
||||
writeJSON(w, map[string]string{"status": "cancelled"})
|
||||
case TaskRunning:
|
||||
if t.job != nil {
|
||||
t.job.abort()
|
||||
}
|
||||
t.Status = TaskCancelled
|
||||
now := time.Now()
|
||||
t.DoneAt = &now
|
||||
globalQueue.persistLocked()
|
||||
writeJSON(w, map[string]string{"status": "cancelled"})
|
||||
default:
|
||||
writeError(w, http.StatusConflict, "task is not running or pending")
|
||||
}
|
||||
}
|
||||
|
||||
func (h *handler) handleAPITasksPriority(w http.ResponseWriter, r *http.Request) {
|
||||
id := r.PathValue("id")
|
||||
t, ok := globalQueue.findByID(id)
|
||||
if !ok {
|
||||
writeError(w, http.StatusNotFound, "task not found")
|
||||
return
|
||||
}
|
||||
var req struct {
|
||||
Delta int `json:"delta"`
|
||||
}
|
||||
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
|
||||
writeError(w, http.StatusBadRequest, "invalid body")
|
||||
return
|
||||
}
|
||||
globalQueue.mu.Lock()
|
||||
defer globalQueue.mu.Unlock()
|
||||
if t.Status != TaskPending {
|
||||
writeError(w, http.StatusConflict, "only pending tasks can be reprioritised")
|
||||
return
|
||||
}
|
||||
t.Priority += req.Delta
|
||||
globalQueue.persistLocked()
|
||||
writeJSON(w, map[string]int{"priority": t.Priority})
|
||||
}
|
||||
|
||||
func (h *handler) handleAPITasksCancelAll(w http.ResponseWriter, _ *http.Request) {
|
||||
globalQueue.mu.Lock()
|
||||
now := time.Now()
|
||||
n := 0
|
||||
for _, t := range globalQueue.tasks {
|
||||
switch t.Status {
|
||||
case TaskPending:
|
||||
t.Status = TaskCancelled
|
||||
t.DoneAt = &now
|
||||
n++
|
||||
case TaskRunning:
|
||||
if t.job != nil {
|
||||
t.job.abort()
|
||||
}
|
||||
t.Status = TaskCancelled
|
||||
t.DoneAt = &now
|
||||
n++
|
||||
}
|
||||
}
|
||||
globalQueue.persistLocked()
|
||||
globalQueue.mu.Unlock()
|
||||
writeJSON(w, map[string]int{"cancelled": n})
|
||||
}
|
||||
|
||||
func (h *handler) handleAPITasksStream(w http.ResponseWriter, r *http.Request) {
|
||||
id := r.PathValue("id")
|
||||
// Wait up to 5s for the task to get a job (it may be pending)
|
||||
deadline := time.Now().Add(5 * time.Second)
|
||||
var j *jobState
|
||||
for time.Now().Before(deadline) {
|
||||
if jj, ok := globalQueue.findJob(id); ok {
|
||||
j = jj
|
||||
break
|
||||
}
|
||||
time.Sleep(200 * time.Millisecond)
|
||||
}
|
||||
if j == nil {
|
||||
http.Error(w, "task not found or not yet started", http.StatusNotFound)
|
||||
return
|
||||
}
|
||||
streamJob(w, r, j)
|
||||
}
|
||||
|
||||
func (q *taskQueue) assignTaskLogPathLocked(t *Task) {
|
||||
if t.LogPath != "" || q.logsDir == "" || t.ID == "" {
|
||||
return
|
||||
}
|
||||
t.LogPath = filepath.Join(q.logsDir, t.ID+".log")
|
||||
}
|
||||
|
||||
func (q *taskQueue) loadLocked() {
|
||||
if q.statePath == "" {
|
||||
return
|
||||
}
|
||||
data, err := os.ReadFile(q.statePath)
|
||||
if err != nil || len(data) == 0 {
|
||||
return
|
||||
}
|
||||
var persisted []persistedTask
|
||||
if err := json.Unmarshal(data, &persisted); err != nil {
|
||||
return
|
||||
}
|
||||
for _, pt := range persisted {
|
||||
t := &Task{
|
||||
ID: pt.ID,
|
||||
Name: pt.Name,
|
||||
Target: pt.Target,
|
||||
Priority: pt.Priority,
|
||||
Status: pt.Status,
|
||||
CreatedAt: pt.CreatedAt,
|
||||
StartedAt: pt.StartedAt,
|
||||
DoneAt: pt.DoneAt,
|
||||
ErrMsg: pt.ErrMsg,
|
||||
LogPath: pt.LogPath,
|
||||
params: pt.Params,
|
||||
}
|
||||
q.assignTaskLogPathLocked(t)
|
||||
if t.Status == TaskPending || t.Status == TaskRunning {
|
||||
t.Status = TaskPending
|
||||
t.DoneAt = nil
|
||||
t.ErrMsg = ""
|
||||
}
|
||||
q.tasks = append(q.tasks, t)
|
||||
}
|
||||
q.prune()
|
||||
q.persistLocked()
|
||||
}
|
||||
|
||||
func (q *taskQueue) persistLocked() {
|
||||
if q.statePath == "" {
|
||||
return
|
||||
}
|
||||
state := make([]persistedTask, 0, len(q.tasks))
|
||||
for _, t := range q.tasks {
|
||||
state = append(state, persistedTask{
|
||||
ID: t.ID,
|
||||
Name: t.Name,
|
||||
Target: t.Target,
|
||||
Priority: t.Priority,
|
||||
Status: t.Status,
|
||||
CreatedAt: t.CreatedAt,
|
||||
StartedAt: t.StartedAt,
|
||||
DoneAt: t.DoneAt,
|
||||
ErrMsg: t.ErrMsg,
|
||||
LogPath: t.LogPath,
|
||||
Params: t.params,
|
||||
})
|
||||
}
|
||||
data, err := json.MarshalIndent(state, "", " ")
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
tmp := q.statePath + ".tmp"
|
||||
if err := os.WriteFile(tmp, data, 0644); err != nil {
|
||||
return
|
||||
}
|
||||
_ = os.Rename(tmp, q.statePath)
|
||||
}
|
||||
156
audit/internal/webui/tasks_test.go
Normal file
156
audit/internal/webui/tasks_test.go
Normal file
@@ -0,0 +1,156 @@
|
||||
package webui
|
||||
|
||||
import (
|
||||
"context"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"bee/audit/internal/app"
|
||||
)
|
||||
|
||||
func TestTaskQueuePersistsAndRecoversPendingTasks(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
q := &taskQueue{
|
||||
statePath: filepath.Join(dir, "tasks-state.json"),
|
||||
logsDir: filepath.Join(dir, "tasks"),
|
||||
trigger: make(chan struct{}, 1),
|
||||
}
|
||||
if err := os.MkdirAll(q.logsDir, 0755); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
started := time.Now().Add(-time.Minute)
|
||||
task := &Task{
|
||||
ID: "task-1",
|
||||
Name: "Memory Burn-in",
|
||||
Target: "memory-stress",
|
||||
Priority: 2,
|
||||
Status: TaskRunning,
|
||||
CreatedAt: time.Now().Add(-2 * time.Minute),
|
||||
StartedAt: &started,
|
||||
params: taskParams{
|
||||
Duration: 300,
|
||||
BurnProfile: "smoke",
|
||||
},
|
||||
}
|
||||
q.tasks = append(q.tasks, task)
|
||||
q.assignTaskLogPathLocked(task)
|
||||
q.persistLocked()
|
||||
|
||||
recovered := &taskQueue{
|
||||
statePath: q.statePath,
|
||||
logsDir: q.logsDir,
|
||||
trigger: make(chan struct{}, 1),
|
||||
}
|
||||
recovered.loadLocked()
|
||||
|
||||
if len(recovered.tasks) != 1 {
|
||||
t.Fatalf("tasks=%d want 1", len(recovered.tasks))
|
||||
}
|
||||
got := recovered.tasks[0]
|
||||
if got.Status != TaskPending {
|
||||
t.Fatalf("status=%q want %q", got.Status, TaskPending)
|
||||
}
|
||||
if got.params.Duration != 300 || got.params.BurnProfile != "smoke" {
|
||||
t.Fatalf("params=%+v", got.params)
|
||||
}
|
||||
if got.LogPath == "" {
|
||||
t.Fatal("expected log path")
|
||||
}
|
||||
}
|
||||
|
||||
func TestNewTaskJobStateLoadsExistingLog(t *testing.T) {
|
||||
dir := t.TempDir()
|
||||
path := filepath.Join(dir, "task.log")
|
||||
if err := os.WriteFile(path, []byte("line1\nline2\n"), 0644); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
j := newTaskJobState(path)
|
||||
existing, ch := j.subscribe()
|
||||
if ch == nil {
|
||||
t.Fatal("expected live subscription channel")
|
||||
}
|
||||
if len(existing) != 2 || existing[0] != "line1" || existing[1] != "line2" {
|
||||
t.Fatalf("existing=%v", existing)
|
||||
}
|
||||
}
|
||||
|
||||
func TestResolveBurnPreset(t *testing.T) {
|
||||
tests := []struct {
|
||||
profile string
|
||||
want burnPreset
|
||||
}{
|
||||
{profile: "smoke", want: burnPreset{NvidiaDiag: 1, DurationSec: 5 * 60}},
|
||||
{profile: "acceptance", want: burnPreset{NvidiaDiag: 3, DurationSec: 60 * 60}},
|
||||
{profile: "overnight", want: burnPreset{NvidiaDiag: 4, DurationSec: 8 * 60 * 60}},
|
||||
{profile: "", want: burnPreset{NvidiaDiag: 1, DurationSec: 5 * 60}},
|
||||
}
|
||||
for _, tc := range tests {
|
||||
if got := resolveBurnPreset(tc.profile); got != tc.want {
|
||||
t.Fatalf("resolveBurnPreset(%q)=%+v want %+v", tc.profile, got, tc.want)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunTaskHonorsCancel(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
blocked := make(chan struct{})
|
||||
released := make(chan struct{})
|
||||
aRun := func(_ any, ctx context.Context, _ string, _ int, _ func(string)) (string, error) {
|
||||
close(blocked)
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
close(released)
|
||||
return "", ctx.Err()
|
||||
case <-time.After(5 * time.Second):
|
||||
close(released)
|
||||
return "unexpected", nil
|
||||
}
|
||||
}
|
||||
|
||||
q := &taskQueue{
|
||||
opts: &HandlerOptions{App: &app.App{}},
|
||||
}
|
||||
tk := &Task{
|
||||
ID: "cpu-1",
|
||||
Name: "CPU SAT",
|
||||
Target: "cpu",
|
||||
Status: TaskRunning,
|
||||
CreatedAt: time.Now(),
|
||||
params: taskParams{Duration: 60},
|
||||
}
|
||||
j := &jobState{}
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
j.cancel = cancel
|
||||
tk.job = j
|
||||
|
||||
orig := runCPUAcceptancePackCtx
|
||||
runCPUAcceptancePackCtx = func(_ *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
|
||||
return aRun(nil, ctx, baseDir, durationSec, logFunc)
|
||||
}
|
||||
defer func() { runCPUAcceptancePackCtx = orig }()
|
||||
|
||||
done := make(chan struct{})
|
||||
go func() {
|
||||
q.runTask(tk, j, ctx)
|
||||
close(done)
|
||||
}()
|
||||
|
||||
<-blocked
|
||||
j.abort()
|
||||
|
||||
select {
|
||||
case <-released:
|
||||
case <-time.After(2 * time.Second):
|
||||
t.Fatal("task did not observe cancel")
|
||||
}
|
||||
select {
|
||||
case <-done:
|
||||
case <-time.After(2 * time.Second):
|
||||
t.Fatal("runTask did not return after cancel")
|
||||
}
|
||||
}
|
||||
@@ -9,4 +9,5 @@ Generic engineering rules live in `bible/rules/patterns/`.
|
||||
|---|---|
|
||||
| `architecture/system-overview.md` | What bee does, scope, tech stack |
|
||||
| `architecture/runtime-flows.md` | Boot sequence, audit flow, service order |
|
||||
| `docs/hardware-ingest-contract.md` | Current Reanimator hardware ingest JSON contract |
|
||||
| `decisions/` | Architectural decision log |
|
||||
|
||||
38
bible-local/architecture/charting.md
Normal file
38
bible-local/architecture/charting.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Charting architecture
|
||||
|
||||
## Decision: one chart engine for all live metrics
|
||||
|
||||
**Engine:** `github.com/go-analyze/charts` (pure Go, no CGO, SVG output)
|
||||
**Theme:** `grafana` (dark background, coloured lines)
|
||||
|
||||
All live metrics charts in the web UI are server-side SVG images served by Go
|
||||
and polled by the browser every 2 seconds via `<img src="...?t=now">`.
|
||||
There is no client-side canvas or JS chart library.
|
||||
|
||||
### Why go-analyze/charts
|
||||
|
||||
- Pure Go, no CGO — builds cleanly inside the live-build container
|
||||
- SVG output — crisp at any display resolution, full-width without pixelation
|
||||
- Grafana theme matches the dark web UI colour scheme
|
||||
- Active fork of the archived wcharczuk/go-chart
|
||||
|
||||
### SAT stress-test charts
|
||||
|
||||
The `drawGPUChartSVG` function in `platform/gpu_metrics.go` is a separate
|
||||
self-contained SVG renderer used **only** for completed SAT run reports
|
||||
(HTML export, burn-in summaries). It is not used for live metrics.
|
||||
|
||||
### Live metrics chart endpoints
|
||||
|
||||
| Path | Content |
|
||||
|------|---------|
|
||||
| `GET /api/metrics/chart/server.svg` | CPU temp, CPU load %, mem load %, power W, fan RPMs |
|
||||
| `GET /api/metrics/chart/gpu/{idx}.svg` | GPU temp °C, load %, mem %, power W |
|
||||
|
||||
Charts are 1400 × 280 px SVG. The page renders them at `width: 100%` in a
|
||||
single-column layout so they always fill the viewport width.
|
||||
|
||||
### Ring buffers
|
||||
|
||||
Each metric is stored in a 120-sample ring buffer (2 minutes of history at 1 Hz).
|
||||
Buffers are per-server or per-GPU and grow dynamically as new GPUs appear.
|
||||
@@ -4,100 +4,113 @@
|
||||
|
||||
**The live CD runs in an isolated network segment with no internet access.**
|
||||
All binaries, kernel modules, and tools must be baked into the ISO at build time.
|
||||
No `apk add`, no downloads, no package manager calls are allowed at boot.
|
||||
No package installation, no downloads, and no package manager calls are allowed at boot.
|
||||
DHCP is used only for LAN (operator SSH access). Internet is NOT available.
|
||||
|
||||
## Boot sequence (single ISO)
|
||||
|
||||
OpenRC default runlevel, service start order:
|
||||
The live system is expected to boot with `toram`, so `live-boot` copies the full read-only medium into RAM before mounting the root filesystem. After that point, runtime must not depend on the original USB/BMC virtual media staying readable.
|
||||
|
||||
`systemd` boot order:
|
||||
|
||||
```
|
||||
localmount
|
||||
├── bee-sshsetup (creates bee user, sets password; runs before dropbear)
|
||||
│ └── dropbear (SSH on port 22 — starts without network)
|
||||
├── bee-network (udhcpc -b on all physical interfaces, non-blocking)
|
||||
│ └── bee-nvidia (insmod nvidia*.ko from /usr/local/lib/nvidia/,
|
||||
│ creates libnvidia-ml.so.1 symlinks in /usr/lib/)
|
||||
│ └── bee-audit (runs audit binary → /var/log/bee-audit.json)
|
||||
local-fs.target
|
||||
├── bee-sshsetup.service (enables SSH key auth; password fallback only if marker exists)
|
||||
│ └── ssh.service (OpenSSH on port 22 — starts without network)
|
||||
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
|
||||
├── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
|
||||
│ creates /dev/nvidia* nodes)
|
||||
├── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
|
||||
│ never blocks boot on partial collector failures)
|
||||
├── bee-web.service (runs `bee web` on :80 — full interactive web UI)
|
||||
└── bee-desktop.service (startx → openbox + chromium http://localhost/)
|
||||
```
|
||||
|
||||
**Critical invariants:**
|
||||
- Dropbear MUST start without network. `bee-sshsetup` has `need localmount` only.
|
||||
- `bee-network` uses `udhcpc -b` (background) — retries indefinitely if no cable.
|
||||
- `bee-nvidia` loads modules via `insmod` with absolute paths — NOT `modprobe`.
|
||||
Reason: modloop squashfs mounts over `/lib/modules/<kver>/` at boot, making it
|
||||
read-only. The overlay's modules at that path are inaccessible. Modules are stored
|
||||
at `/usr/local/lib/nvidia/` (overlay path, always writable).
|
||||
- `bee-nvidia` creates `libnvidia-ml.so.1` symlinks in `/usr/lib/` — required because
|
||||
`nvidia-smi` is a glibc binary that looks for the soname symlink, not the versioned file.
|
||||
- `gcompat` package provides `/lib64/ld-linux-x86-64.so.2` for glibc compat on Alpine musl.
|
||||
- `bee-audit` uses `after bee-nvidia` — ensures NVIDIA enrichment succeeds.
|
||||
- `bee-audit` uses `eend 0` always — never fails boot even if audit errors.
|
||||
- The live ISO boots with `boot=live toram`. Runtime binaries must continue working even if the original boot media disappears after early boot.
|
||||
- OpenSSH MUST start without network. `bee-sshsetup.service` runs before `ssh.service`.
|
||||
- `bee-network.service` uses `dhclient -nw` (background) — network bring-up is best effort and non-blocking.
|
||||
- `bee-nvidia.service` loads modules via `insmod` with absolute paths — NOT `modprobe`.
|
||||
Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
|
||||
- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
|
||||
- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.
|
||||
- `bee-web.service` binds `0.0.0.0:80` and always renders the current `/var/log/bee-audit.json` contents.
|
||||
- Audit JSON now includes a `hardware.summary` block with overall verdict and warning/failure counts.
|
||||
|
||||
## Console and login flow
|
||||
|
||||
Local-console behavior:
|
||||
|
||||
```text
|
||||
tty1
|
||||
└── live-config autologin → bee
|
||||
└── /home/bee/.profile (prints web UI URLs)
|
||||
|
||||
display :0
|
||||
└── bee-desktop.service (User=bee)
|
||||
└── startx /usr/local/bin/bee-openbox-session -- :0
|
||||
├── tint2 (taskbar)
|
||||
├── chromium http://localhost/
|
||||
└── openbox (WM)
|
||||
```
|
||||
|
||||
Rules:
|
||||
- local `tty1` lands in user `bee`, not directly in `root`
|
||||
- `bee-desktop.service` starts X11 + openbox + Chromium automatically after `bee-web.service`
|
||||
- Chromium opens `http://localhost/` — the full interactive web UI
|
||||
- SSH is independent from the desktop path
|
||||
- serial console support is enabled for VM boot debugging
|
||||
|
||||
## ISO build sequence
|
||||
|
||||
```
|
||||
build.sh [--authorized-keys /path/to/keys]
|
||||
1. compile audit binary (skip if .go files older than binary)
|
||||
2. inject authorized_keys into overlay/root/.ssh/ (or set password fallback)
|
||||
3. copy audit binary → overlay/usr/local/bin/audit
|
||||
4. copy vendor binaries from iso/vendor/ → overlay/usr/local/bin/
|
||||
(storcli64, sas2ircu, sas3ircu, mstflint, gpu_burn — each optional)
|
||||
5. build-nvidia-module.sh:
|
||||
a. apk add linux-lts-dev (always, to get current Alpine 3.21 kernel headers)
|
||||
b. detect KVER from /usr/src/linux-headers-*
|
||||
c. download NVIDIA .run installer (sha256 verified, cached in dist/)
|
||||
d. extract installer
|
||||
e. build kernel modules against linux-lts headers
|
||||
f. create libnvidia-ml.so.1 / libcuda.so.1 symlinks in cache
|
||||
g. cache in dist/nvidia-<version>-<kver>/
|
||||
6. inject NVIDIA .ko → overlay/usr/local/lib/nvidia/
|
||||
7. inject nvidia-smi → overlay/usr/local/bin/nvidia-smi
|
||||
8. inject libnvidia-ml + libcuda → overlay/usr/lib/
|
||||
9. write overlay/etc/bee-release (versions + git commit)
|
||||
10. export BEE_BUILD_INFO for motd substitution
|
||||
11. mkimage.sh (from /var/tmp, TMPDIR=/var/tmp):
|
||||
kernel_* section — cached (linux-lts modloop)
|
||||
apks_* section — cached (downloaded packages)
|
||||
syslinux_* / grub_* — cached
|
||||
apkovl — always regenerated (genapkovl-bee.sh)
|
||||
final ISO — always assembled
|
||||
build-in-container.sh [--authorized-keys /path/to/keys]
|
||||
1. compile `bee` binary (skip if .go files older than binary)
|
||||
2. create a temporary overlay staging dir under `dist/`
|
||||
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
|
||||
4. copy `bee` binary → staged `/usr/local/bin/bee`
|
||||
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
|
||||
(`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli` — optional; `mstflint` comes from the Debian package set)
|
||||
6. `build-nvidia-module.sh`:
|
||||
a. install Debian kernel headers if missing
|
||||
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
|
||||
c. extract installer
|
||||
d. build kernel modules against Debian headers
|
||||
e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
|
||||
f. cache in `dist/nvidia-<version>-<kver>/`
|
||||
7. `build-cublas.sh`:
|
||||
a. download `libcublas`, `libcublasLt`, `libcudart` runtime + dev packages from the NVIDIA CUDA Debian repo
|
||||
b. verify packages against repo `Packages.gz`
|
||||
c. extract headers for `bee-gpu-stress` build
|
||||
d. cache userspace libs in `dist/cublas-<version>+cuda<series>/`
|
||||
8. build `bee-gpu-stress` against extracted cuBLASLt/cudart headers
|
||||
9. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
|
||||
10. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
|
||||
11. inject `libnvidia-ml` + `libcuda` + `libcublas` + `libcublasLt` + `libcudart` → staged `/usr/lib/`
|
||||
12. write staged `/etc/bee-release` (versions + git commit)
|
||||
13. patch staged `motd` with build metadata
|
||||
14. copy `iso/builder/` into a temporary live-build workdir under `dist/`
|
||||
15. sync staged overlay into workdir `config/includes.chroot/`
|
||||
16. run `lb config && lb build` inside the privileged builder container
|
||||
```
|
||||
|
||||
Build host notes:
|
||||
- `build-in-container.sh` targets `linux/amd64` builder containers by default, including Docker Desktop on macOS / Apple Silicon.
|
||||
- Override with `BEE_BUILDER_PLATFORM=<os/arch>` only if you intentionally need a different container platform.
|
||||
- If the local builder image under the same tag was previously built for the wrong architecture, the script rebuilds it automatically.
|
||||
|
||||
**Critical invariants:**
|
||||
- `KERNEL_PKG_VERSION` in `iso/builder/VERSIONS` pins the exact Alpine package version
|
||||
(e.g. `6.12.76-r0`). This version is used in THREE places that MUST stay in sync:
|
||||
1. `build-nvidia-module.sh` — `apk add linux-lts-dev=${KERNEL_PKG_VERSION}` (compile headers)
|
||||
2. `mkimg.bee.sh` — `linux-lts=${KERNEL_PKG_VERSION}` in apks list (ISO kernel)
|
||||
3. `build.sh` — build-time verification that headers match pin (fails loudly if not)
|
||||
When Alpine releases a new linux-lts patch (e.g. r0 → r1), update KERNEL_PKG_VERSION
|
||||
in VERSIONS — that's the only place to change. The build will fail loudly if the pin
|
||||
doesn't match the installed headers, so stale pins are caught immediately.
|
||||
- **All three must use the same APK mirror: `dl-cdn.alpinelinux.org`.** Both
|
||||
`build-nvidia-module.sh` (apk add) and `mkimage.sh` (--repository) explicitly use
|
||||
`https://dl-cdn.alpinelinux.org/alpine/v${ALPINE_VERSION}/main|community`.
|
||||
Never use the builder's local `/etc/apk/repositories` — its mirror may serve
|
||||
a different package state, causing "unable to select package" failures.
|
||||
- `linux-lts-dev` is always installed (not conditional) — stale 6.6.x headers on the
|
||||
builder would cause modules to be built for the wrong kernel and never load at runtime.
|
||||
- NVIDIA modules go to `overlay/usr/local/lib/nvidia/` — NOT `lib/modules/<kver>/extra/`.
|
||||
- `genapkovl-bee.sh` must be copied to `/var/tmp/` (CWD when mkimage runs).
|
||||
- `TMPDIR=/var/tmp` required — tmpfs `/tmp` is only ~1GB, too small for kernel firmware.
|
||||
- Workdir cleanup preserves `apks_*`, `kernel_*`, `syslinux_*`, `grub_*` cache dirs.
|
||||
|
||||
## gpu_burn vendor binary
|
||||
|
||||
`gpu_burn` requires CUDA nvcc to build. It is NOT built as part of the main ISO build.
|
||||
Build separately on the builder VM and place in `iso/vendor/gpu_burn`:
|
||||
|
||||
```sh
|
||||
sh iso/builder/build-gpu-burn.sh dist/
|
||||
cp dist/gpu_burn iso/vendor/gpu_burn
|
||||
cp dist/compare.ptx iso/vendor/compare.ptx
|
||||
```
|
||||
|
||||
Requires: CUDA 12.8+ (supports GCC 14, Alpine 3.21), libxml2, g++, make, git.
|
||||
The `build.sh` will include it automatically if `iso/vendor/gpu_burn` exists.
|
||||
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
|
||||
1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
|
||||
2. `auto/config` — `linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
|
||||
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
|
||||
- `bee-gpu-stress` must be built against cached CUDA userspace headers from `build-cublas.sh`, not against random host-installed CUDA headers.
|
||||
- The live ISO must ship `libcublas`, `libcublasLt`, and `libcudart` together with `libcuda` so tensor-core stress works without internet or package installs at boot.
|
||||
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
|
||||
- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
|
||||
- Container build requires `--privileged` because `live-build` uses mounts/chroots/loop devices during ISO assembly.
|
||||
- On macOS / Docker Desktop, the builder still must run as `linux/amd64` so the shipped ISO binaries remain `amd64`.
|
||||
- Operators must provision enough RAM to hold the full compressed live medium plus normal runtime overhead, because `toram` copies the entire read-only ISO payload into memory before the system reaches steady state.
|
||||
|
||||
## Post-boot smoke test
|
||||
|
||||
@@ -109,35 +122,63 @@ ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh
|
||||
|
||||
Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shipping.
|
||||
|
||||
Key checks: NVIDIA modules loaded, nvidia-smi sees all GPUs, lib symlinks present,
|
||||
gcompat installed, services running, audit completed with NVIDIA enrichment, internet.
|
||||
Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
|
||||
systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
|
||||
|
||||
## apkovl mechanism
|
||||
Current validation state:
|
||||
- local/libvirt VM boot path is validated for `systemd`, SSH, `bee audit`, `bee-network`, and Web UI startup
|
||||
- real hardware validation is still required before treating the ISO as release-ready
|
||||
|
||||
The apkovl is a `.tar.gz` injected into the ISO at `/boot/`. Alpine initramfs extracts
|
||||
it at boot, overlaying `/etc`, `/usr`, `/root`, `/lib` on the tmpfs root.
|
||||
## Overlay mechanism
|
||||
|
||||
`genapkovl-bee.sh` generates the tarball containing:
|
||||
- `/etc/apk/world` — package list (apk installs on first boot)
|
||||
- `/etc/runlevels/*/` — OpenRC service symlinks
|
||||
- `/etc/conf.d/dropbear` — `DROPBEAR_OPTS="-R -B"`
|
||||
- `/etc/network/interfaces` — lo only (bee-network handles DHCP)
|
||||
- `/etc/hostname`
|
||||
- Everything from `iso/overlay/` (init scripts, binaries, ssh keys, tui)
|
||||
`live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
|
||||
`build.sh` prepares a staged overlay, then syncs it into a temporary workdir's
|
||||
`config/includes.chroot/` before running `lb build`.
|
||||
|
||||
## Collector flow
|
||||
|
||||
```
|
||||
audit binary start
|
||||
`bee audit` start
|
||||
1. board collector (dmidecode -t 0,1,2)
|
||||
2. cpu collector (dmidecode -t 4)
|
||||
3. memory collector (dmidecode -t 17)
|
||||
4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
|
||||
5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/)
|
||||
6. psu collector (ipmitool fru — silent if no /dev/ipmi0)
|
||||
6. psu collector (ipmitool fru + sdr — silent if no /dev/ipmi0)
|
||||
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
|
||||
8. output JSON → /var/log/bee-audit.json
|
||||
9. QR summary to stdout (qrencode if available)
|
||||
```
|
||||
|
||||
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
|
||||
|
||||
Acceptance flows:
|
||||
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + mixed-precision `bee-gpu-stress`
|
||||
- `bee sat memory` → `memtester` archive
|
||||
- `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
|
||||
- SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
|
||||
- `bee-gpu-stress` should prefer cuBLASLt GEMM load over the old integer/PTX burn path:
|
||||
- Ampere: `fp16` + `fp32`/TF32 tensor-core load
|
||||
- Ada / Hopper: add `fp8`
|
||||
- Blackwell+: add `fp4`
|
||||
- PTX fallback is only for missing cuBLASLt/userspace or unsupported narrow datatypes
|
||||
- Runtime overrides:
|
||||
- `BEE_GPU_STRESS_SECONDS`
|
||||
- `BEE_GPU_STRESS_SIZE_MB`
|
||||
- `BEE_MEMTESTER_SIZE_MB`
|
||||
- `BEE_MEMTESTER_PASSES`
|
||||
|
||||
## NVIDIA SAT Web UI flow
|
||||
|
||||
```
|
||||
Web UI: Acceptance Tests page → Run Test button
|
||||
1. POST /api/sat/nvidia/run → returns job_id
|
||||
2. GET /api/sat/stream?job_id=... (SSE) — streams stdout/stderr lines live
|
||||
3. After completion — archive written to /appdata/bee/export/bee-sat/
|
||||
summary.txt contains overall_status (OK / FAILED) and per-job status values
|
||||
```
|
||||
|
||||
**Critical invariants:**
|
||||
- `bee-gpu-stress` uses `exec.CommandContext` — killed on job context cancel.
|
||||
- Metric goroutine uses stopCh/doneCh pattern; main goroutine waits `<-doneCh` before reading rows (no mutex needed).
|
||||
- SVG chart is fully offline: no JS, no external CSS, pure inline SVG.
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
Hardware audit LiveCD. Boots on a server via BMC virtual media or USB.
|
||||
Collects hardware inventory at OS level (not through BMC/Redfish).
|
||||
Produces `HardwareIngestRequest` JSON compatible with core/reanimator.
|
||||
Produces `HardwareIngestRequest` JSON compatible with the contract in `bible-local/docs/hardware-ingest-contract.md`.
|
||||
|
||||
## Why it exists
|
||||
|
||||
@@ -19,18 +19,23 @@ Fills gaps where Redfish/logpile is blind:
|
||||
## In scope
|
||||
|
||||
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
|
||||
- Unattended operation — no user interaction required
|
||||
- Machine-readable health summary derived from collector verdicts
|
||||
- Operator-triggered acceptance tests for NVIDIA, memory, and storage
|
||||
- NVIDIA SAT includes both diagnostic collection and mixed-precision GPU stress via `bee-gpu-stress`
|
||||
- `bee-gpu-stress` should exercise tensor/inference paths (`fp16`, `fp32`/TF32, `fp8`, `fp4` when supported by the GPU/userspace stack) and fall back to Driver API PTX burn only if cuBLASLt is unavailable
|
||||
- Automatic boot audit with operator-facing local console and SSH access
|
||||
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
|
||||
- SSH access (dropbear) always available for inspection and debugging
|
||||
- Interactive TUI (`bee-tui`) for network setup, service management, GPU tests
|
||||
- GPU stress testing via `gpu_burn` (vendor binary, optional)
|
||||
- SSH access (OpenSSH) always available for inspection and debugging
|
||||
- Full web UI via `bee web` on port 80: interactive control panel with live metrics, SAT tests, network config, service management, export, and tools
|
||||
- Local operator desktop: openbox + Xorg + Chromium auto-opening `http://localhost/`
|
||||
- Local `tty1` operator UX: `bee` autologin, openbox desktop auto-starts with Chromium on `http://localhost/`
|
||||
|
||||
## Network isolation — CRITICAL
|
||||
|
||||
**The live CD runs in an isolated network segment with no internet access.**
|
||||
|
||||
- All tools, drivers, and binaries MUST be pre-baked into the ISO at build time
|
||||
- No `apk add` at boot — packages are installed during ISO creation, not at runtime
|
||||
- No package installation at boot — packages are installed during ISO creation, not at runtime
|
||||
- No downloads at boot — NVIDIA modules, vendor tools, and all binaries come from the ISO overlay
|
||||
- DHCP is used only for LAN access (SSH from operator laptop); internet is NOT assumed
|
||||
- Any feature requiring network downloads cannot be added to the live CD
|
||||
@@ -43,32 +48,66 @@ Fills gaps where Redfish/logpile is blind:
|
||||
- Anything requiring persistent storage on the audited machine
|
||||
- Windows support
|
||||
- Any functionality requiring internet access at boot
|
||||
- Component lifecycle/history across multiple snapshots
|
||||
- Status transition history (`status_history`, `status_changed_at`) derived from previous exports
|
||||
- Replacement detection between two or more audit runs
|
||||
|
||||
## Contract boundary
|
||||
|
||||
- `bee` is responsible for the current hardware snapshot only.
|
||||
- `bee` should populate current component state, hardware inventory, telemetry, and `status_checked_at`.
|
||||
- Historical status transitions and component replacement logic belong to the centralized ingest/lifecycle system, not to `bee`.
|
||||
- Contract fields that have no honest local source on a generic Linux host may remain empty.
|
||||
|
||||
## Tech stack
|
||||
|
||||
| Component | Technology |
|
||||
|---|---|
|
||||
| Audit binary | Go, static, `CGO_ENABLED=0` |
|
||||
| LiveCD | Alpine Linux 3.21, linux-lts 6.12.x |
|
||||
| ISO build | Alpine mkimage + apkovl overlay (`iso/overlay/`) |
|
||||
| Init system | OpenRC |
|
||||
| SSH | Dropbear (always included) |
|
||||
| NVIDIA driver | Proprietary `.run` installer, built against linux-lts headers |
|
||||
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` (not modloop path) |
|
||||
| glibc compat | `gcompat` — required for `nvidia-smi` (glibc binary on musl Alpine) |
|
||||
| Builder VM | Alpine 3.21 |
|
||||
| Live ISO | Debian 12 (bookworm), amd64 live-build image |
|
||||
| ISO build | Debian `live-build` + overlay sync into `config/includes.chroot/` |
|
||||
| Init system | `systemd` |
|
||||
| SSH | OpenSSH server |
|
||||
| NVIDIA driver | Proprietary `.run` installer, built against Debian kernel headers |
|
||||
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
|
||||
| GPU stress backend | `bee-gpu-stress` + cuBLASLt/cuBLAS/cudart mixed-precision GEMM, with Driver API PTX fallback |
|
||||
| Builder | Debian 12 host/VM or Debian 12 container image |
|
||||
|
||||
## Operator UX
|
||||
|
||||
- On the live ISO, `tty1` autologins as `bee`
|
||||
- `bee-desktop.service` starts X11 + openbox + Chromium on display `:0`
|
||||
- Chromium opens `http://localhost/` — the full web UI
|
||||
- SSH remains available independently of the local console path
|
||||
- Remote operators can open `http://<ip>/` in any browser on the same LAN
|
||||
- VM-oriented builds also include `qemu-guest-agent` and serial console support for debugging
|
||||
- The ISO boots with `toram`, so loss of the original USB/BMC virtual media after boot should not break already-installed runtime binaries
|
||||
|
||||
## Runtime split
|
||||
|
||||
- The main Go application must run both on a normal Linux host and inside the live ISO
|
||||
- Live-ISO-only responsibilities stay in `iso/` integration code
|
||||
- Live ISO launches the Go CLI with `--runtime livecd`
|
||||
- Local/manual runs use `--runtime auto` or `--runtime local`
|
||||
- Live ISO targets must have enough RAM for the full compressed live medium plus runtime working set because the boot medium is copied into memory at startup
|
||||
|
||||
## Key paths
|
||||
|
||||
| Path | Purpose |
|
||||
|---|---|
|
||||
| `audit/cmd/audit/` | CLI entry point |
|
||||
| `audit/cmd/bee/` | Main CLI entry point |
|
||||
| `audit/internal/collector/` | Per-subsystem collectors |
|
||||
| `audit/internal/schema/` | HardwareIngestRequest types |
|
||||
| `iso/builder/` | ISO build scripts and mkimage profile |
|
||||
| `iso/overlay/` | Single overlay: files injected into ISO via apkovl |
|
||||
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, gpu_burn, …) |
|
||||
| `iso/builder/VERSIONS` | Pinned versions: Alpine, Go, NVIDIA driver, kernel |
|
||||
| `iso/builder/` | ISO build scripts and `live-build` profile |
|
||||
| `iso/overlay/` | Source overlay copied into a staged build overlay |
|
||||
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, arcconf, ssacli, …) |
|
||||
| `internal/chart/` | Git submodule with `reanimator/chart`, embedded into `bee web` |
|
||||
| `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI |
|
||||
| `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
|
||||
| `iso/overlay/etc/profile.d/bee.sh` | tty1 welcome message with web UI URLs |
|
||||
| `iso/overlay/home/bee/.profile` | `bee` shell profile (PATH only) |
|
||||
| `iso/overlay/etc/systemd/system/bee-desktop.service` | starts X11 + openbox + chromium |
|
||||
| `iso/overlay/usr/local/bin/bee-desktop` | startx wrapper for bee-desktop.service |
|
||||
| `iso/overlay/usr/local/bin/bee-openbox-session` | xinitrc: tint2 + chromium + openbox |
|
||||
| `dist/` | Build outputs (gitignored) |
|
||||
| `iso/out/` | Downloaded ISO files (gitignored) |
|
||||
|
||||
@@ -1,21 +1,89 @@
|
||||
# Backlog
|
||||
|
||||
## GPU stress test (H100)
|
||||
## BMC версия через IPMI
|
||||
|
||||
**Задача:** добавить GPU burn/stress тест в bee-tui без существенного увеличения ISO.
|
||||
**Статус:** реализовано.
|
||||
|
||||
**Контекст:**
|
||||
- `gpu_burn` (wilicc/gpu-burn) не подходит — требует `libcublas.so` (~500MB), что раздует ISO кратно
|
||||
- `libcuda.so` уже есть в ISO (из NVIDIA .run installer)
|
||||
Добавить сбор версии BMC firmware в board collector:
|
||||
- Команда: `ipmitool mc info` → поле `Firmware Revision`
|
||||
- Записывать в `hardware.firmware[]` как `{device_name: "BMC", version: "..."}`
|
||||
- Показывать в TUI правой колонке рядом с BIOS версией
|
||||
- Graceful skip если `/dev/ipmi0` отсутствует (silent: same pattern as PSU collector)
|
||||
|
||||
**Выбранный подход:** написать минимальный стресс-тул на CUDA Driver API
|
||||
- Использует только `libcuda.so` (уже в ISO) — никаких новых зависимостей
|
||||
- Реализует матричное умножение или memory bandwidth через `cuLaunchKernel`
|
||||
- Бинарь ~100KB, компилируется через `nvcc` на builder VM, кладётся в `iso/vendor/`
|
||||
- bee-tui вызывает его вместо `gpu_burn`
|
||||
## CPU acceptance test через stress-ng
|
||||
|
||||
**Отклонённые варианты:**
|
||||
- `gpu_burn` — нужен libcublas (~500MB)
|
||||
- `nvbandwidth` — только bandwidth, не жжёт FLOPs; нужен libcudart (~8MB)
|
||||
- DCGM diag — правильный инструмент для H100 но ~100MB установка
|
||||
- Download on demand — нужен libcublas, проблема та же
|
||||
**Статус:** реализовано. CPU в Health Check получает PASS/FAIL из summary.txt.
|
||||
|
||||
Добавить CPU SAT на базе `stress-ng`:
|
||||
- Bake `stress-ng` в ISO (добавить в `bee.list.chroot`)
|
||||
- Новый `bee sat cpu` — запускает `stress-ng --cpu 0 --cpu-method all --timeout <N>` где N = duration из режима (Quick=60s, Standard=300s, Express=900s)
|
||||
- Параллельно снимает температуры через `sensors` и throttle-флаги из аудит JSON
|
||||
- Результат: SAT архив с summary.txt в формате других SAT (overall_status=OK/FAILED)
|
||||
- После реализации: CPU в Health Check получает реальный PASS/FAIL статус
|
||||
|
||||
## Real hardware validation
|
||||
|
||||
**Статус:** ожидает доступа к железу.
|
||||
|
||||
Что осталось подтвердить на практике:
|
||||
- `bee sat nvidia` на реальном NVIDIA GPU host
|
||||
- `bee sat storage` на NVMe/SATA/RAID host
|
||||
- `ipmitool sdr` parsing на сервере с реальным BMC/IPMI
|
||||
- vendor RAID tooling (`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli`) в живом ISO
|
||||
|
||||
## SAT result polish
|
||||
|
||||
**Статус:** частично закрыто.
|
||||
|
||||
Что ещё можно улучшить после полевой проверки:
|
||||
- точнее классифицировать vendor-specific self-test outputs в `storage SAT`
|
||||
- подобрать дефолты `memtester` по объёму RAM на целевых машинах
|
||||
- при необходимости расширить `bee-gpu-stress` по длительности/нагрузке
|
||||
|
||||
## Hardware Contract backlog
|
||||
|
||||
**Статус:** уточнён, сокращён до `bee`-only snapshot scope.
|
||||
|
||||
### Не backlog для `bee`
|
||||
|
||||
Эти задачи не должны реализовываться в `bee`, потому что относятся к централизованному ingest/lifecycle слою:
|
||||
- `status_history`
|
||||
- `status_changed_at`
|
||||
- определение замены компонента между snapshot'ами
|
||||
- timeline/lifecycle/history по diff между экспортами
|
||||
|
||||
`bee` отвечает только за текущий snapshot железа и `status_checked_at`.
|
||||
|
||||
### Реализуемо инкрементально
|
||||
|
||||
Эти поля можно развивать дальше по мере появления реальных sample outputs и vendor-specific parser'ов:
|
||||
- `cpus.correctable_error_count`
|
||||
- `cpus.uncorrectable_error_count`
|
||||
- `power_supplies.life_remaining_pct`
|
||||
- `power_supplies.life_used_pct`
|
||||
- `pcie_devices.battery_charge_pct`
|
||||
- `pcie_devices.battery_health_pct`
|
||||
- `pcie_devices.battery_temperature_c`
|
||||
- `pcie_devices.battery_voltage_v`
|
||||
- `pcie_devices.battery_replace_required`
|
||||
|
||||
### Vendor/platform-specific, часто пустые
|
||||
|
||||
Эти поля допустимо оставлять пустыми на части платформ даже после реализации parser'ов:
|
||||
- `power_supplies.life_remaining_pct`
|
||||
- `power_supplies.life_used_pct`
|
||||
- часть `pcie_devices.battery_*` для неподдержанных RAID/NIC/GPU вендоров
|
||||
|
||||
### Unsupported в `bee`
|
||||
|
||||
Эти поля считаются нереалистичными для общего OS-level hardware snapshotter без synthetic/fake data:
|
||||
- `cpus.life_remaining_pct`
|
||||
- `cpus.life_used_pct`
|
||||
- `memory.life_remaining_pct`
|
||||
- `memory.life_used_pct`
|
||||
- `memory.spare_blocks_remaining_pct`
|
||||
- `memory.performance_degraded`
|
||||
|
||||
Причина: у обычного Linux-host audit обычно нет честного vendor-neutral runtime source для этих метрик.
|
||||
|
||||
Эти поля считаются дропнутыми из backlog `bee` и не должны возвращаться в план работ без появления нового доказуемого локального источника данных на целевых машинах.
|
||||
|
||||
793
bible-local/docs/hardware-ingest-contract.md
Normal file
793
bible-local/docs/hardware-ingest-contract.md
Normal file
@@ -0,0 +1,793 @@
|
||||
---
|
||||
title: Hardware Ingest JSON Contract
|
||||
version: "2.7"
|
||||
updated: "2026-03-15"
|
||||
maintainer: Reanimator Core
|
||||
audience: external-integrators, ai-agents
|
||||
language: ru
|
||||
---
|
||||
|
||||
# Интеграция с Reanimator: контракт JSON-импорта аппаратного обеспечения
|
||||
|
||||
Версия: **2.7** · Дата: **2026-03-15**
|
||||
|
||||
Документ описывает формат JSON для передачи данных об аппаратном обеспечении серверов в систему **Reanimator** (управление жизненным циклом аппаратного обеспечения).
|
||||
Предназначен для разработчиков смежных систем (Redfish-коллекторов, агентов мониторинга, CMDB-экспортёров) и может быть включён в документацию интегрируемых проектов.
|
||||
|
||||
> Актуальная версия документа: https://git.mchus.pro/reanimator/core/src/branch/main/bible-local/docs/hardware-ingest-contract.md
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Версия | Дата | Изменения |
|
||||
|--------|------|-----------|
|
||||
| 2.7 | 2026-03-15 | Явно запрещён синтез данных в `event_logs`; интеграторы не должны придумывать серийные номера компонентов, если источник их не отдал |
|
||||
| 2.6 | 2026-03-15 | Добавлена необязательная секция `event_logs` для dedup/upsert логов `host` / `bmc` / `redfish` вне history timeline |
|
||||
| 2.5 | 2026-03-15 | Добавлено общее необязательное поле `manufactured_year_week` для компонентных секций (`YYYY-Www`) |
|
||||
| 2.4 | 2026-03-15 | Добавлена первая волна component telemetry: health/life поля для `cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies` |
|
||||
| 2.3 | 2026-03-15 | Добавлены component telemetry поля: `pcie_devices.temperature_c`, `pcie_devices.power_w`, `power_supplies.temperature_c` |
|
||||
| 2.2 | 2026-03-15 | Добавлено поле `numa_node` у `pcie_devices` для topology/affinity |
|
||||
| 2.1 | 2026-03-15 | Добавлена секция `sensors` (fans, power, temperatures, other); поле `mac_addresses` у `pcie_devices`; расширен список значений `device_class` |
|
||||
| 2.0 | 2026-02-01 | История статусов (`status_history`, `status_changed_at`); поля telemetry у PSU; async job response |
|
||||
| 1.0 | 2026-01-01 | Начальная версия контракта |
|
||||
|
||||
---
|
||||
|
||||
## Принципы
|
||||
|
||||
1. **Snapshot** — JSON описывает состояние сервера на момент сбора. Может включать историю изменений статуса компонентов.
|
||||
2. **Идемпотентность** — повторная отправка идентичного payload не создаёт дублей (дедупликация по хешу).
|
||||
3. **Частичность** — можно передавать только те секции, данные по которым доступны. Пустой массив и отсутствие секции эквивалентны.
|
||||
4. **Строгая схема** — endpoint использует строгий JSON-декодер; неизвестные поля приводят к `400 Bad Request`.
|
||||
5. **Event-driven** — импорт создаёт события в timeline (LOG_COLLECTED, INSTALLED, REMOVED, FIRMWARE_CHANGED и др.).
|
||||
6. **Без синтеза со стороны интегратора** — сборщик передаёт только фактически собранные значения. Нельзя придумывать `serial_number`, `component_ref`, `message`, `message_id` или другие идентификаторы/атрибуты, если источник их не предоставил или парсер не смог их надёжно извлечь.
|
||||
|
||||
---
|
||||
|
||||
## Endpoint
|
||||
|
||||
```
|
||||
POST /ingest/hardware
|
||||
Content-Type: application/json
|
||||
```
|
||||
|
||||
**Ответ при приёме (202 Accepted):**
|
||||
```json
|
||||
{
|
||||
"status": "accepted",
|
||||
"job_id": "job_01J..."
|
||||
}
|
||||
```
|
||||
|
||||
Импорт выполняется асинхронно. Результат доступен по:
|
||||
```
|
||||
GET /ingest/hardware/jobs/{job_id}
|
||||
```
|
||||
|
||||
**Ответ при успехе задачи:**
|
||||
```json
|
||||
{
|
||||
"status": "success",
|
||||
"bundle_id": "lb_01J...",
|
||||
"asset_id": "mach_01J...",
|
||||
"collected_at": "2026-02-10T15:30:00Z",
|
||||
"duplicate": false,
|
||||
"summary": {
|
||||
"parts_observed": 15,
|
||||
"parts_created": 2,
|
||||
"parts_updated": 13,
|
||||
"installations_created": 2,
|
||||
"installations_closed": 1,
|
||||
"timeline_events_created": 9,
|
||||
"failure_events_created": 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Ответ при дубликате:**
|
||||
```json
|
||||
{
|
||||
"status": "success",
|
||||
"duplicate": true,
|
||||
"message": "LogBundle with this content hash already exists"
|
||||
}
|
||||
```
|
||||
|
||||
**Ответ при ошибке (400 Bad Request):**
|
||||
```json
|
||||
{
|
||||
"status": "error",
|
||||
"error": "validation_failed",
|
||||
"details": {
|
||||
"field": "hardware.board.serial_number",
|
||||
"message": "serial_number is required"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Частые причины `400`:
|
||||
- Неверный формат `collected_at` (требуется RFC3339).
|
||||
- Пустой `hardware.board.serial_number`.
|
||||
- Наличие неизвестного JSON-поля на любом уровне.
|
||||
- Тело запроса превышает допустимый размер.
|
||||
|
||||
---
|
||||
|
||||
## Структура верхнего уровня
|
||||
|
||||
```json
|
||||
{
|
||||
"filename": "redfish://10.10.10.103",
|
||||
"source_type": "api",
|
||||
"protocol": "redfish",
|
||||
"target_host": "10.10.10.103",
|
||||
"collected_at": "2026-02-10T15:30:00Z",
|
||||
"hardware": {
|
||||
"board": { ... },
|
||||
"firmware": [ ... ],
|
||||
"cpus": [ ... ],
|
||||
"memory": [ ... ],
|
||||
"storage": [ ... ],
|
||||
"pcie_devices": [ ... ],
|
||||
"power_supplies": [ ... ],
|
||||
"sensors": { ... },
|
||||
"event_logs": [ ... ]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Поля верхнего уровня
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `collected_at` | string RFC3339 | **да** | Время сбора данных |
|
||||
| `hardware` | object | **да** | Аппаратный снапшот |
|
||||
| `hardware.board.serial_number` | string | **да** | Серийный номер платы/сервера |
|
||||
| `target_host` | string | нет | IP или hostname |
|
||||
| `source_type` | string | нет | Тип источника: `api`, `logfile`, `manual` |
|
||||
| `protocol` | string | нет | Протокол: `redfish`, `ipmi`, `snmp`, `ssh` |
|
||||
| `filename` | string | нет | Идентификатор источника |
|
||||
|
||||
---
|
||||
|
||||
## Общие поля статуса компонентов
|
||||
|
||||
Применяются ко всем компонентным секциям (`cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies`).
|
||||
|
||||
| Поле | Тип | Описание |
|
||||
|------|-----|----------|
|
||||
| `status` | string | Текущий статус: `OK`, `Warning`, `Critical`, `Unknown`, `Empty` |
|
||||
| `status_checked_at` | string RFC3339 | Время последней проверки статуса |
|
||||
| `status_changed_at` | string RFC3339 | Время последнего изменения статуса |
|
||||
| `status_history` | array | История переходов статусов (см. ниже) |
|
||||
| `error_description` | string | Текст ошибки/диагностики |
|
||||
| `manufactured_year_week` | string | Дата производства в формате `YYYY-Www`, например `2024-W07` |
|
||||
|
||||
**Объект `status_history[]`:**
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `status` | string | **да** | Статус в этот момент |
|
||||
| `changed_at` | string RFC3339 | **да** | Время перехода (без этого поля запись игнорируется) |
|
||||
| `details` | string | нет | Пояснение к переходу |
|
||||
|
||||
**Правила приоритета времени события:**
|
||||
|
||||
1. `status_changed_at`
|
||||
2. Последняя запись `status_history` с совпадающим статусом
|
||||
3. Последняя парсируемая запись `status_history`
|
||||
4. `status_checked_at`
|
||||
|
||||
**Правила передачи статусов:**
|
||||
- Передавайте `status` как текущее состояние компонента в snapshot.
|
||||
- Если источник хранит историю — передавайте `status_history` отсортированным по `changed_at` по возрастанию.
|
||||
- Не включайте записи `status_history` без `changed_at`.
|
||||
- Все даты — RFC3339, рекомендуется UTC (`Z`).
|
||||
- `manufactured_year_week` используйте, когда источник знает только год и неделю производства, без точной календарной даты.
|
||||
|
||||
---
|
||||
|
||||
## Секции hardware
|
||||
|
||||
### board
|
||||
|
||||
Основная информация о сервере. Обязательная секция.
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `serial_number` | string | **да** | Серийный номер (ключ идентификации Asset) |
|
||||
| `manufacturer` | string | нет | Производитель |
|
||||
| `product_name` | string | нет | Модель |
|
||||
| `part_number` | string | нет | Партномер |
|
||||
| `uuid` | string | нет | UUID системы |
|
||||
|
||||
Значения `"NULL"` в строковых полях трактуются как отсутствие данных.
|
||||
|
||||
```json
|
||||
"board": {
|
||||
"manufacturer": "Supermicro",
|
||||
"product_name": "X12DPG-QT6",
|
||||
"serial_number": "21D634101",
|
||||
"part_number": "X12DPG-QT6-REV1.01",
|
||||
"uuid": "d7ef2fe5-2fd0-11f0-910a-346f11040868"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### firmware
|
||||
|
||||
Версии прошивок системных компонентов (BIOS, BMC, CPLD и др.).
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `device_name` | string | **да** | Название устройства (`BIOS`, `BMC`, `CPLD`, …) |
|
||||
| `version` | string | **да** | Версия прошивки |
|
||||
|
||||
Записи с пустым `device_name` или `version` игнорируются.
|
||||
Изменение версии создаёт событие `FIRMWARE_CHANGED` для Asset.
|
||||
|
||||
```json
|
||||
"firmware": [
|
||||
{ "device_name": "BIOS", "version": "06.08.05" },
|
||||
{ "device_name": "BMC", "version": "5.17.00" },
|
||||
{ "device_name": "CPLD", "version": "01.02.03" }
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### cpus
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `socket` | int | **да** | Номер сокета (используется для генерации serial) |
|
||||
| `model` | string | нет | Модель процессора |
|
||||
| `manufacturer` | string | нет | Производитель |
|
||||
| `cores` | int | нет | Количество ядер |
|
||||
| `threads` | int | нет | Количество потоков |
|
||||
| `frequency_mhz` | int | нет | Текущая частота |
|
||||
| `max_frequency_mhz` | int | нет | Максимальная частота |
|
||||
| `temperature_c` | float | нет | Температура CPU, °C (telemetry) |
|
||||
| `power_w` | float | нет | Текущая мощность CPU, Вт (telemetry) |
|
||||
| `throttled` | bool | нет | Зафиксирован thermal/power throttling |
|
||||
| `correctable_error_count` | int | нет | Количество корректируемых ошибок CPU |
|
||||
| `uncorrectable_error_count` | int | нет | Количество некорректируемых ошибок CPU |
|
||||
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
|
||||
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
|
||||
| `serial_number` | string | нет | Серийный номер (если доступен) |
|
||||
| `firmware` | string | нет | Версия микрокода; если логгер отдает `Microcode level`, передавайте его сюда как есть |
|
||||
| `present` | bool | нет | Наличие (по умолчанию `true`) |
|
||||
| + общие поля статуса | | | см. раздел выше |
|
||||
|
||||
**Генерация serial_number при отсутствии:** `{board_serial}-CPU-{socket}`
|
||||
|
||||
Если источник использует поле/лейбл `Microcode level`, его значение передавайте в `cpus[].firmware` без дополнительного преобразования.
|
||||
|
||||
```json
|
||||
"cpus": [
|
||||
{
|
||||
"socket": 0,
|
||||
"model": "INTEL(R) XEON(R) GOLD 6530",
|
||||
"cores": 32,
|
||||
"threads": 64,
|
||||
"frequency_mhz": 2100,
|
||||
"max_frequency_mhz": 4000,
|
||||
"temperature_c": 61.5,
|
||||
"power_w": 182.0,
|
||||
"throttled": false,
|
||||
"manufacturer": "Intel",
|
||||
"status": "OK",
|
||||
"status_checked_at": "2026-02-10T15:28:00Z"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### memory
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `slot` | string | нет | Идентификатор слота |
|
||||
| `present` | bool | нет | Наличие модуля (по умолчанию `true`) |
|
||||
| `serial_number` | string | нет | Серийный номер |
|
||||
| `part_number` | string | нет | Партномер (используется как модель) |
|
||||
| `manufacturer` | string | нет | Производитель |
|
||||
| `size_mb` | int | нет | Объём в МБ |
|
||||
| `type` | string | нет | Тип: `DDR3`, `DDR4`, `DDR5`, … |
|
||||
| `max_speed_mhz` | int | нет | Максимальная частота |
|
||||
| `current_speed_mhz` | int | нет | Текущая частота |
|
||||
| `temperature_c` | float | нет | Температура DIMM/модуля, °C (telemetry) |
|
||||
| `correctable_ecc_error_count` | int | нет | Количество корректируемых ECC-ошибок |
|
||||
| `uncorrectable_ecc_error_count` | int | нет | Количество некорректируемых ECC-ошибок |
|
||||
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
|
||||
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
|
||||
| `spare_blocks_remaining_pct` | float | нет | Остаток spare blocks, % |
|
||||
| `performance_degraded` | bool | нет | Зафиксирована деградация производительности |
|
||||
| `data_loss_detected` | bool | нет | Источник сигнализирует риск/факт потери данных |
|
||||
| + общие поля статуса | | | см. раздел выше |
|
||||
|
||||
Модуль без `serial_number` игнорируется. Модуль с `present=false` или `status=Empty` игнорируется.
|
||||
|
||||
```json
|
||||
"memory": [
|
||||
{
|
||||
"slot": "CPU0_C0D0",
|
||||
"present": true,
|
||||
"size_mb": 32768,
|
||||
"type": "DDR5",
|
||||
"max_speed_mhz": 4800,
|
||||
"current_speed_mhz": 4800,
|
||||
"temperature_c": 43.0,
|
||||
"correctable_ecc_error_count": 0,
|
||||
"manufacturer": "Hynix",
|
||||
"serial_number": "80AD032419E17CEEC1",
|
||||
"part_number": "HMCG88AGBRA191N",
|
||||
"status": "OK"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### storage
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `slot` | string | нет | Канонический адрес установки PCIe-устройства; передавайте BDF (`0000:18:00.0`) |
|
||||
| `serial_number` | string | нет | Серийный номер |
|
||||
| `model` | string | нет | Модель |
|
||||
| `manufacturer` | string | нет | Производитель |
|
||||
| `type` | string | нет | Тип: `NVMe`, `SSD`, `HDD` |
|
||||
| `interface` | string | нет | Интерфейс: `NVMe`, `SATA`, `SAS` |
|
||||
| `size_gb` | int | нет | Размер в ГБ |
|
||||
| `temperature_c` | float | нет | Температура накопителя, °C (telemetry) |
|
||||
| `power_on_hours` | int64 | нет | Время работы, часы |
|
||||
| `power_cycles` | int64 | нет | Количество циклов питания |
|
||||
| `unsafe_shutdowns` | int64 | нет | Нештатные выключения |
|
||||
| `media_errors` | int64 | нет | Ошибки носителя / media errors |
|
||||
| `error_log_entries` | int64 | нет | Количество записей в error log |
|
||||
| `written_bytes` | int64 | нет | Всего записано байт |
|
||||
| `read_bytes` | int64 | нет | Всего прочитано байт |
|
||||
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
|
||||
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
|
||||
| `available_spare_pct` | float | нет | Доступный spare, % |
|
||||
| `reallocated_sectors` | int64 | нет | Переназначенные сектора |
|
||||
| `current_pending_sectors` | int64 | нет | Сектора в ожидании ремапа |
|
||||
| `offline_uncorrectable` | int64 | нет | Некорректируемые ошибки offline scan |
|
||||
| `firmware` | string | нет | Версия прошивки |
|
||||
| `present` | bool | нет | Наличие (по умолчанию `true`) |
|
||||
| + общие поля статуса | | | см. раздел выше |
|
||||
|
||||
Диск без `serial_number` игнорируется. Изменение `firmware` создаёт событие `FIRMWARE_CHANGED`.
|
||||
|
||||
```json
|
||||
"storage": [
|
||||
{
|
||||
"slot": "OB01",
|
||||
"type": "NVMe",
|
||||
"model": "INTEL SSDPF2KX076T1",
|
||||
"size_gb": 7680,
|
||||
"temperature_c": 38.5,
|
||||
"power_on_hours": 12450,
|
||||
"unsafe_shutdowns": 3,
|
||||
"written_bytes": 9876543210,
|
||||
"life_remaining_pct": 91.0,
|
||||
"serial_number": "BTAX41900GF87P6DGN",
|
||||
"manufacturer": "Intel",
|
||||
"firmware": "9CV10510",
|
||||
"interface": "NVMe",
|
||||
"present": true,
|
||||
"status": "OK"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### pcie_devices
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `slot` | string | нет | Идентификатор слота |
|
||||
| `vendor_id` | int | нет | PCI Vendor ID (decimal) |
|
||||
| `device_id` | int | нет | PCI Device ID (decimal) |
|
||||
| `numa_node` | int | нет | NUMA node / CPU affinity устройства |
|
||||
| `temperature_c` | float | нет | Температура устройства, °C (telemetry) |
|
||||
| `power_w` | float | нет | Текущее энергопотребление устройства, Вт (telemetry) |
|
||||
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
|
||||
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
|
||||
| `ecc_corrected_total` | int64 | нет | Всего корректируемых ECC-ошибок |
|
||||
| `ecc_uncorrected_total` | int64 | нет | Всего некорректируемых ECC-ошибок |
|
||||
| `hw_slowdown` | bool | нет | Устройство вошло в hardware slowdown / protective mode |
|
||||
| `battery_charge_pct` | float | нет | Заряд батареи / supercap, % |
|
||||
| `battery_health_pct` | float | нет | Состояние батареи / supercap, % |
|
||||
| `battery_temperature_c` | float | нет | Температура батареи / supercap, °C |
|
||||
| `battery_voltage_v` | float | нет | Напряжение батареи / supercap, В |
|
||||
| `battery_replace_required` | bool | нет | Требуется замена батареи / supercap |
|
||||
| `sfp_temperature_c` | float | нет | Температура SFP/optic, °C |
|
||||
| `sfp_tx_power_dbm` | float | нет | TX optical power, dBm |
|
||||
| `sfp_rx_power_dbm` | float | нет | RX optical power, dBm |
|
||||
| `sfp_voltage_v` | float | нет | Напряжение SFP, В |
|
||||
| `sfp_bias_ma` | float | нет | Bias current SFP, мА |
|
||||
| `bdf` | string | нет | Deprecated alias для `slot`; при наличии ingest нормализует его в `slot` |
|
||||
| `device_class` | string | нет | Класс устройства (см. список ниже) |
|
||||
| `manufacturer` | string | нет | Производитель |
|
||||
| `model` | string | нет | Модель |
|
||||
| `serial_number` | string | нет | Серийный номер |
|
||||
| `firmware` | string | нет | Версия прошивки |
|
||||
| `link_width` | int | нет | Текущая ширина линка |
|
||||
| `link_speed` | string | нет | Текущая скорость: `Gen3`, `Gen4`, `Gen5` |
|
||||
| `max_link_width` | int | нет | Максимальная ширина линка |
|
||||
| `max_link_speed` | string | нет | Максимальная скорость |
|
||||
| `mac_addresses` | string[] | нет | MAC-адреса портов (для сетевых устройств) |
|
||||
| `present` | bool | нет | Наличие (по умолчанию `true`) |
|
||||
| + общие поля статуса | | | см. раздел выше |
|
||||
|
||||
`numa_node` передавайте для NIC / InfiniBand / RAID / GPU, когда источник знает CPU/NUMA affinity. Поле сохраняется в snapshot-атрибутах PCIe-компонента и дублируется в telemetry для topology use cases.
|
||||
Поля `temperature_c` и `power_w` используйте для device-level telemetry GPU / accelerator / smart PCIe devices. Они не влияют на идентификацию компонента.
|
||||
|
||||
**Генерация serial_number при отсутствии или `"N/A"`:** `{board_serial}-PCIE-{slot}`, где `slot` для PCIe равен BDF.
|
||||
|
||||
`slot` — единственный канонический адрес компонента. Для PCIe в `slot` передавайте BDF. Поле `bdf` сохраняется только как переходный alias на входе и не должно использоваться как отдельная координата рядом со `slot`.
|
||||
|
||||
**Значения `device_class`:**
|
||||
|
||||
| Значение | Назначение |
|
||||
|----------|------------|
|
||||
| `MassStorageController` | RAID-контроллеры |
|
||||
| `StorageController` | HBA, SAS-контроллеры |
|
||||
| `NetworkController` | Сетевые адаптеры (InfiniBand, общий) |
|
||||
| `EthernetController` | Ethernet NIC |
|
||||
| `FibreChannelController` | Fibre Channel HBA |
|
||||
| `VideoController` | GPU, видеокарты |
|
||||
| `ProcessingAccelerator` | Вычислительные ускорители (AI/ML) |
|
||||
| `DisplayController` | Контроллеры дисплея (BMC VGA) |
|
||||
|
||||
Список открытый: допускаются произвольные строки для нестандартных классов.
|
||||
|
||||
```json
|
||||
"pcie_devices": [
|
||||
{
|
||||
"slot": "0000:3b:00.0",
|
||||
"vendor_id": 5555,
|
||||
"device_id": 4401,
|
||||
"numa_node": 0,
|
||||
"temperature_c": 48.5,
|
||||
"power_w": 18.2,
|
||||
"sfp_temperature_c": 36.2,
|
||||
"sfp_tx_power_dbm": -1.8,
|
||||
"sfp_rx_power_dbm": -2.1,
|
||||
"device_class": "EthernetController",
|
||||
"manufacturer": "Intel",
|
||||
"model": "X710 10GbE",
|
||||
"serial_number": "K65472-003",
|
||||
"firmware": "9.20 0x8000d4ae",
|
||||
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
|
||||
"status": "OK"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### power_supplies
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `slot` | string | нет | Идентификатор слота |
|
||||
| `present` | bool | нет | Наличие (по умолчанию `true`) |
|
||||
| `serial_number` | string | нет | Серийный номер |
|
||||
| `part_number` | string | нет | Партномер |
|
||||
| `model` | string | нет | Модель |
|
||||
| `vendor` | string | нет | Производитель |
|
||||
| `wattage_w` | int | нет | Мощность в ваттах |
|
||||
| `firmware` | string | нет | Версия прошивки |
|
||||
| `input_type` | string | нет | Тип входа (например `ACWideRange`) |
|
||||
| `input_voltage` | float | нет | Входное напряжение, В (telemetry) |
|
||||
| `input_power_w` | float | нет | Входная мощность, Вт (telemetry) |
|
||||
| `output_power_w` | float | нет | Выходная мощность, Вт (telemetry) |
|
||||
| `temperature_c` | float | нет | Температура PSU, °C (telemetry) |
|
||||
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
|
||||
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
|
||||
| + общие поля статуса | | | см. раздел выше |
|
||||
|
||||
Поля telemetry (`input_voltage`, `input_power_w`, `output_power_w`, `temperature_c`, `life_remaining_pct`, `life_used_pct`) сохраняются в атрибутах компонента и не влияют на его идентификацию.
|
||||
|
||||
PSU без `serial_number` игнорируется.
|
||||
|
||||
```json
|
||||
"power_supplies": [
|
||||
{
|
||||
"slot": "0",
|
||||
"present": true,
|
||||
"model": "GW-CRPS3000LW",
|
||||
"vendor": "Great Wall",
|
||||
"wattage_w": 3000,
|
||||
"serial_number": "2P06C102610",
|
||||
"firmware": "00.03.05",
|
||||
"status": "OK",
|
||||
"input_type": "ACWideRange",
|
||||
"input_power_w": 137,
|
||||
"output_power_w": 104,
|
||||
"input_voltage": 215.25,
|
||||
"temperature_c": 39.5,
|
||||
"life_remaining_pct": 97.0
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### sensors
|
||||
|
||||
Показания сенсоров сервера. Секция опциональная, не привязана к компонентам.
|
||||
Данные хранятся как последнее известное значение (last-known-value) на уровне Asset.
|
||||
|
||||
```json
|
||||
"sensors": {
|
||||
"fans": [ ... ],
|
||||
"power": [ ... ],
|
||||
"temperatures": [ ... ],
|
||||
"other": [ ... ]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### event_logs
|
||||
|
||||
Нормализованные операционные логи сервера из `host`, `bmc` или `redfish`.
|
||||
|
||||
Эти записи не попадают в history timeline и не создают history events. Они сохраняются в отдельной deduplicated log store и отображаются в отдельном UI-блоке asset logs / host logs.
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `source` | string | **да** | Источник лога: `host`, `bmc`, `redfish` |
|
||||
| `event_time` | string RFC3339 | нет | Время события из источника; если отсутствует, используется время ingest/collection |
|
||||
| `severity` | string | нет | Уровень: `OK`, `Info`, `Warning`, `Critical`, `Unknown` |
|
||||
| `message_id` | string | нет | Идентификатор/код события источника |
|
||||
| `message` | string | **да** | Нормализованный текст события |
|
||||
| `component_ref` | string | нет | Ссылка на компонент/устройство/слот, если извлекается |
|
||||
| `fingerprint` | string | нет | Внешний готовый dedup-key; если не передан, система вычисляет свой |
|
||||
| `is_active` | bool | нет | Признак, что событие всё ещё активно/не погашено, если источник умеет lifecycle |
|
||||
| `raw_payload` | object | нет | Сырой vendor-specific payload для диагностики |
|
||||
|
||||
**Правила event_logs:**
|
||||
- Логи дедуплицируются в рамках asset + source + fingerprint.
|
||||
- Если `fingerprint` не передан, система строит его из нормализованных полей (`source`, `message_id`, `message`, `component_ref`, временная нормализация).
|
||||
- Интегратор/сборщик логов не должен синтезировать содержимое событий: не придумывайте `message`, `message_id`, `component_ref`, serial/device identifiers или иные поля, если они отсутствуют в исходном логе или не были надёжно извлечены.
|
||||
- Повторное получение того же события обновляет `last_seen_at`/счётчик повторов и не должно создавать новый timeline/history event.
|
||||
- `event_logs` используются для отдельного UI-представления логов и не изменяют canonical state компонентов/asset по умолчанию.
|
||||
|
||||
```json
|
||||
"event_logs": [
|
||||
{
|
||||
"source": "bmc",
|
||||
"event_time": "2026-03-15T14:03:11Z",
|
||||
"severity": "Warning",
|
||||
"message_id": "0x000F",
|
||||
"message": "Correctable ECC error threshold exceeded",
|
||||
"component_ref": "CPU0_C0D0",
|
||||
"raw_payload": {
|
||||
"sensor": "DIMM_A1",
|
||||
"sel_record_id": "0042"
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "redfish",
|
||||
"event_time": "2026-03-15T14:03:20Z",
|
||||
"severity": "Info",
|
||||
"message_id": "OpenBMC.0.1.SystemReboot",
|
||||
"message": "System reboot requested by administrator",
|
||||
"component_ref": "Mainboard"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
#### sensors.fans
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `name` | string | **да** | Уникальное имя сенсора в рамках секции |
|
||||
| `location` | string | нет | Физическое расположение |
|
||||
| `rpm` | int | нет | Обороты, RPM |
|
||||
| `status` | string | нет | Статус: `OK`, `Warning`, `Critical`, `Unknown` |
|
||||
|
||||
#### sensors.power
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `name` | string | **да** | Уникальное имя сенсора |
|
||||
| `location` | string | нет | Физическое расположение |
|
||||
| `voltage_v` | float | нет | Напряжение, В |
|
||||
| `current_a` | float | нет | Ток, А |
|
||||
| `power_w` | float | нет | Мощность, Вт |
|
||||
| `status` | string | нет | Статус |
|
||||
|
||||
#### sensors.temperatures
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `name` | string | **да** | Уникальное имя сенсора |
|
||||
| `location` | string | нет | Физическое расположение |
|
||||
| `celsius` | float | нет | Температура, °C |
|
||||
| `threshold_warning_celsius` | float | нет | Порог Warning, °C |
|
||||
| `threshold_critical_celsius` | float | нет | Порог Critical, °C |
|
||||
| `status` | string | нет | Статус |
|
||||
|
||||
#### sensors.other
|
||||
|
||||
| Поле | Тип | Обязательно | Описание |
|
||||
|------|-----|-------------|----------|
|
||||
| `name` | string | **да** | Уникальное имя сенсора |
|
||||
| `location` | string | нет | Физическое расположение |
|
||||
| `value` | float | нет | Значение |
|
||||
| `unit` | string | нет | Единица измерения |
|
||||
| `status` | string | нет | Статус |
|
||||
|
||||
**Правила sensors:**
|
||||
- Идентификатор сенсора: пара `(sensor_type, name)`. Дубли в одном payload — берётся первое вхождение.
|
||||
- Сенсоры без `name` игнорируются.
|
||||
- При каждом импорте значения перезаписываются (upsert по ключу).
|
||||
|
||||
```json
|
||||
"sensors": {
|
||||
"fans": [
|
||||
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" },
|
||||
{ "name": "FAN_CPU0", "location": "CPU0", "rpm": 5600, "status": "OK" }
|
||||
],
|
||||
"power": [
|
||||
{ "name": "12V Rail", "location": "Mainboard", "voltage_v": 12.06, "status": "OK" },
|
||||
{ "name": "PSU0 Input", "location": "PSU0", "voltage_v": 215.25, "current_a": 0.64, "power_w": 137.0, "status": "OK" }
|
||||
],
|
||||
"temperatures": [
|
||||
{ "name": "CPU0 Temp", "location": "CPU0", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" },
|
||||
{ "name": "Inlet Temp", "location": "Front", "celsius": 22.0, "threshold_warning_celsius": 40.0, "threshold_critical_celsius": 50.0, "status": "OK" }
|
||||
],
|
||||
"other": [
|
||||
{ "name": "System Humidity", "value": 38.5, "unit": "%", "status": "OK" }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Обработка статусов компонентов
|
||||
|
||||
| Статус | Поведение |
|
||||
|--------|-----------|
|
||||
| `OK` | Нормальная обработка |
|
||||
| `Warning` | Создаётся событие `COMPONENT_WARNING` |
|
||||
| `Critical` | Создаётся событие `COMPONENT_FAILED` + запись в `failure_events` |
|
||||
| `Unknown` | Компонент считается рабочим, создаётся событие `COMPONENT_UNKNOWN` |
|
||||
| `Empty` | Компонент не создаётся/не обновляется |
|
||||
|
||||
---
|
||||
|
||||
## Обработка отсутствующих serial_number
|
||||
|
||||
Общее правило для всех секций: если источник не вернул серийный номер и сборщик не смог его надёжно извлечь, интегратор не должен подставлять вымышленные значения, хеши, локальные placeholder-идентификаторы или серийные номера "по догадке". Разрешены только явно оговорённые ниже server-side fallback-правила ingest.
|
||||
|
||||
| Тип | Поведение |
|
||||
|-----|-----------|
|
||||
| CPU | Генерируется: `{board_serial}-CPU-{socket}` |
|
||||
| PCIe | Генерируется: `{board_serial}-PCIE-{slot}` (если serial = `"N/A"` или пустой; `slot` для PCIe = BDF) |
|
||||
| Memory | Компонент игнорируется |
|
||||
| Storage | Компонент игнорируется |
|
||||
| PSU | Компонент игнорируется |
|
||||
|
||||
Если `serial_number` не уникален внутри одного payload для того же `model`:
|
||||
- Первое вхождение сохраняет оригинальный серийный номер.
|
||||
- Каждое следующее дублирующее получает placeholder: `NO_SN-XXXXXXXX`.
|
||||
|
||||
---
|
||||
|
||||
## Минимальный валидный пример
|
||||
|
||||
```json
|
||||
{
|
||||
"collected_at": "2026-02-10T15:30:00Z",
|
||||
"target_host": "192.168.1.100",
|
||||
"hardware": {
|
||||
"board": {
|
||||
"serial_number": "SRV-001"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Полный пример с историей статусов
|
||||
|
||||
```json
|
||||
{
|
||||
"filename": "redfish://10.10.10.103",
|
||||
"source_type": "api",
|
||||
"protocol": "redfish",
|
||||
"target_host": "10.10.10.103",
|
||||
"collected_at": "2026-02-10T15:30:00Z",
|
||||
"hardware": {
|
||||
"board": {
|
||||
"manufacturer": "Supermicro",
|
||||
"product_name": "X12DPG-QT6",
|
||||
"serial_number": "21D634101"
|
||||
},
|
||||
"firmware": [
|
||||
{ "device_name": "BIOS", "version": "06.08.05" },
|
||||
{ "device_name": "BMC", "version": "5.17.00" }
|
||||
],
|
||||
"cpus": [
|
||||
{
|
||||
"socket": 0,
|
||||
"model": "INTEL(R) XEON(R) GOLD 6530",
|
||||
"manufacturer": "Intel",
|
||||
"cores": 32,
|
||||
"threads": 64,
|
||||
"status": "OK"
|
||||
}
|
||||
],
|
||||
"storage": [
|
||||
{
|
||||
"slot": "OB01",
|
||||
"type": "NVMe",
|
||||
"model": "INTEL SSDPF2KX076T1",
|
||||
"size_gb": 7680,
|
||||
"serial_number": "BTAX41900GF87P6DGN",
|
||||
"manufacturer": "Intel",
|
||||
"firmware": "9CV10510",
|
||||
"present": true,
|
||||
"status": "OK",
|
||||
"status_changed_at": "2026-02-10T15:22:00Z",
|
||||
"status_history": [
|
||||
{ "status": "Critical", "changed_at": "2026-02-10T15:10:00Z", "details": "I/O timeout on NVMe queue 3" },
|
||||
{ "status": "OK", "changed_at": "2026-02-10T15:22:00Z", "details": "Recovered after controller reset" }
|
||||
]
|
||||
}
|
||||
],
|
||||
"pcie_devices": [
|
||||
{
|
||||
"slot": "0000:18:00.0",
|
||||
"device_class": "EthernetController",
|
||||
"manufacturer": "Intel",
|
||||
"model": "X710 10GbE",
|
||||
"serial_number": "K65472-003",
|
||||
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
|
||||
"status": "OK"
|
||||
}
|
||||
],
|
||||
"power_supplies": [
|
||||
{
|
||||
"slot": "0",
|
||||
"present": true,
|
||||
"model": "GW-CRPS3000LW",
|
||||
"vendor": "Great Wall",
|
||||
"wattage_w": 3000,
|
||||
"serial_number": "2P06C102610",
|
||||
"firmware": "00.03.05",
|
||||
"status": "OK",
|
||||
"input_power_w": 137,
|
||||
"output_power_w": 104,
|
||||
"input_voltage": 215.25
|
||||
}
|
||||
],
|
||||
"sensors": {
|
||||
"fans": [
|
||||
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" }
|
||||
],
|
||||
"power": [
|
||||
{ "name": "12V Rail", "voltage_v": 12.06, "status": "OK" }
|
||||
],
|
||||
"temperatures": [
|
||||
{ "name": "CPU0 Temp", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" }
|
||||
],
|
||||
"other": [
|
||||
{ "name": "System Humidity", "value": 38.5, "unit": "%" }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
21
bible-local/docs/iso-build-rules.md
Normal file
21
bible-local/docs/iso-build-rules.md
Normal file
@@ -0,0 +1,21 @@
|
||||
# ISO Build Rules
|
||||
|
||||
## Verify package names before use
|
||||
|
||||
ISO builds take 30–60 minutes. A wrong package name wastes an entire build cycle.
|
||||
|
||||
**Rule: before adding any Debian package name to the ISO config, verify it exists and check its file list.**
|
||||
|
||||
Use one of:
|
||||
- `https://packages.debian.org/bookworm/<package-name>` — existence + description
|
||||
- `https://packages.debian.org/bookworm/amd64/<package-name>/filelist` — exact files installed
|
||||
- `apt-cache show <package>` inside a Debian bookworm container
|
||||
|
||||
This applies to:
|
||||
- `iso/builder/config/package-lists/*.list.chroot`
|
||||
- Any package referenced in `grub.cfg`, hooks, or overlay scripts (e.g. file paths like `/boot/memtest86+x64.bin`)
|
||||
|
||||
## Example of what goes wrong without this
|
||||
|
||||
`memtest86+` in Debian bookworm installs `/boot/memtest86+x64.bin`, not `/boot/memtest86+.bin`.
|
||||
Guessing the filename caused a broken GRUB entry that only surfaced at boot time, after a full rebuild.
|
||||
35
bible-local/docs/validate-vs-burn.md
Normal file
35
bible-local/docs/validate-vs-burn.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Validate vs Burn: Hardware Impact Policy
|
||||
|
||||
## Validate Tests (non-destructive)
|
||||
|
||||
Tests on the **Validate** page are purely diagnostic. They:
|
||||
|
||||
- **Do not write to disks** — no data is written to storage devices; SMART counters (power-on hours, load cycle count, reallocated sectors) are not incremented.
|
||||
- **Do not run sustained high load** — commands complete quickly (seconds to minutes) and do not push hardware to thermal or electrical limits.
|
||||
- **Do not increment hardware wear counters** — GPU memory ECC counters, NVMe wear leveling counters, and similar endurance metrics are unaffected.
|
||||
- **Are safe to run repeatedly** — on new, production-bound, or already-deployed hardware without concern for reducing lifespan.
|
||||
|
||||
### What Validate tests actually do
|
||||
|
||||
| Test | What it runs |
|
||||
|---|---|
|
||||
| NVIDIA GPU | `nvidia-smi`, `dcgmi diag` (levels 1–4 read-only diagnostics) |
|
||||
| Memory | `memtester` on a limited allocation; reads/writes to RAM only |
|
||||
| Storage | `smartctl -a`, `nvme smart-log` — reads SMART data only |
|
||||
| CPU | `stress-ng` for a bounded duration; CPU-only, no I/O |
|
||||
| AMD GPU | `rocm-smi --showallinfo`, `dmidecode` — read-only queries |
|
||||
|
||||
## Burn Tests (hardware wear)
|
||||
|
||||
Tests on the **Burn** page run hardware at maximum or near-maximum load for extended durations. They:
|
||||
|
||||
- **Wear storage**: write-intensive patterns can reduce SSD endurance (P/E cycles).
|
||||
- **Stress GPU memory**: extended ECC stress tests may surface latent defects but also exercise memory cells.
|
||||
- **Accelerate thermal cycling**: repeated heat/cool cycles degrade solder joints and capacitors over time.
|
||||
- **May increment wear counters**: GPU power-on hours, NVMe media wear indicator, and similar metrics will advance.
|
||||
|
||||
### Rule
|
||||
|
||||
> Run **Validate** freely on any server, at any time, before or after deployment.
|
||||
> Run **Burn** only when explicitly required (e.g., initial acceptance after repair, or per customer SLA).
|
||||
> Document when and why Burn tests were run.
|
||||
1
internal/chart
Submodule
1
internal/chart
Submodule
Submodule internal/chart added at ac8120c8ab
58
iso/README.md
Normal file
58
iso/README.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# ISO Build
|
||||
|
||||
`bee` ISO is built inside a Debian 12 builder container via `iso/builder/build-in-container.sh`.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Docker Desktop or another Docker-compatible container runtime
|
||||
- Privileged containers enabled
|
||||
- Enough free disk space for builder cache, Debian live-build artifacts, NVIDIA driver cache, and CUDA userspace packages
|
||||
|
||||
## Build On macOS
|
||||
|
||||
From the repository root:
|
||||
|
||||
```sh
|
||||
sh iso/builder/build-in-container.sh
|
||||
```
|
||||
|
||||
The script defaults to `linux/amd64` builder containers, so it works on:
|
||||
|
||||
- Intel Mac
|
||||
- Apple Silicon (`M1` / `M2` / `M3` / `M4`) via Docker Desktop's Linux VM
|
||||
|
||||
You do not need to pass `--platform` manually for normal ISO builds.
|
||||
|
||||
## Useful Options
|
||||
|
||||
Build with explicit SSH keys baked into the ISO:
|
||||
|
||||
```sh
|
||||
sh iso/builder/build-in-container.sh --authorized-keys ~/.ssh/id_ed25519.pub
|
||||
```
|
||||
|
||||
Rebuild the builder image:
|
||||
|
||||
```sh
|
||||
sh iso/builder/build-in-container.sh --rebuild-image
|
||||
```
|
||||
|
||||
Use a custom cache directory:
|
||||
|
||||
```sh
|
||||
sh iso/builder/build-in-container.sh --cache-dir /path/to/cache
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The builder image is automatically rebuilt if the local tag exists for the wrong architecture.
|
||||
- The live ISO boots with Debian `live-boot` `toram`, so the read-only medium is copied into RAM during boot and the runtime no longer depends on the original USB/BMC virtual media staying present.
|
||||
- Target systems need enough RAM for the full compressed live medium plus normal runtime overhead, or boot may fail before reaching the TUI.
|
||||
- Override the container platform only if you know why:
|
||||
|
||||
```sh
|
||||
BEE_BUILDER_PLATFORM=linux/amd64 sh iso/builder/build-in-container.sh
|
||||
```
|
||||
|
||||
- The shipped ISO is still `amd64`.
|
||||
- Output ISO artifacts are written under `dist/`.
|
||||
57
iso/builder/Dockerfile
Normal file
57
iso/builder/Dockerfile
Normal file
@@ -0,0 +1,57 @@
|
||||
FROM debian:12
|
||||
|
||||
ARG GO_VERSION=1.24.0
|
||||
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
RUN apt-get update -qq && apt-get install -y \
|
||||
ca-certificates \
|
||||
live-build \
|
||||
debootstrap \
|
||||
squashfs-tools \
|
||||
xorriso \
|
||||
grub-pc-bin \
|
||||
grub-efi-amd64-bin \
|
||||
mtools \
|
||||
git \
|
||||
wget \
|
||||
curl \
|
||||
tar \
|
||||
xz-utils \
|
||||
rsync \
|
||||
build-essential \
|
||||
gcc \
|
||||
make \
|
||||
perl \
|
||||
linux-headers-amd64 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Add NVIDIA CUDA repo and install nvcc (needed to compile nccl-tests)
|
||||
RUN wget -qO /tmp/cuda-keyring.gpg \
|
||||
https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/3bf863cc.pub \
|
||||
&& gpg --dearmor < /tmp/cuda-keyring.gpg \
|
||||
> /usr/share/keyrings/nvidia-cuda.gpg \
|
||||
&& rm /tmp/cuda-keyring.gpg \
|
||||
&& echo "deb [signed-by=/usr/share/keyrings/nvidia-cuda.gpg] \
|
||||
https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/ /" \
|
||||
> /etc/apt/sources.list.d/cuda.list \
|
||||
&& apt-get update -qq \
|
||||
&& apt-get install -y cuda-nvcc-12-8 \
|
||||
&& rm -rf /var/lib/apt/lists/* \
|
||||
&& ln -sfn /usr/local/cuda-12.8 /usr/local/cuda
|
||||
|
||||
RUN arch="$(dpkg --print-architecture)" \
|
||||
&& case "$arch" in \
|
||||
amd64) goarch=amd64 ;; \
|
||||
arm64) goarch=arm64 ;; \
|
||||
*) echo "unsupported architecture: $arch" >&2; exit 1 ;; \
|
||||
esac \
|
||||
&& wget -q -O /tmp/go.tar.gz "https://go.dev/dl/go${GO_VERSION}.linux-${goarch}.tar.gz" \
|
||||
&& rm -rf /usr/local/go \
|
||||
&& tar -C /usr/local -xzf /tmp/go.tar.gz \
|
||||
&& rm -f /tmp/go.tar.gz
|
||||
|
||||
ENV PATH=/usr/local/go/bin:${PATH}
|
||||
WORKDIR /work
|
||||
|
||||
CMD ["/bin/bash"]
|
||||
@@ -1,5 +1,22 @@
|
||||
DEBIAN_VERSION=12
|
||||
DEBIAN_KERNEL_ABI=6.1.0-43
|
||||
DEBIAN_KERNEL_ABI=auto
|
||||
NVIDIA_DRIVER_VERSION=590.48.01
|
||||
GO_VERSION=1.23.6
|
||||
AUDIT_VERSION=0.1.0
|
||||
NCCL_VERSION=2.28.9-1
|
||||
NCCL_CUDA_VERSION=13.0
|
||||
NCCL_SHA256=2e6faafd2c19cffc7738d9283976a3200ea9db9895907f337f0c7e5a25563186
|
||||
NCCL_TESTS_VERSION=2.13.10
|
||||
NVCC_VERSION=12.8
|
||||
CUBLAS_VERSION=13.0.2.14-1
|
||||
CUDA_USERSPACE_VERSION=13.0.96-1
|
||||
DCGM_VERSION=3.3.9
|
||||
ROCM_VERSION=6.3.4
|
||||
ROCM_SMI_VERSION=7.4.0.60304-76~22.04
|
||||
ROCM_BANDWIDTH_TEST_VERSION=1.4.0.60304-76~22.04
|
||||
ROCM_VALIDATION_SUITE_VERSION=1.1.0.60304-76~22.04
|
||||
ROCBLAS_VERSION=4.3.0.60304-76~22.04
|
||||
ROCRAND_VERSION=3.2.0.60304-76~22.04
|
||||
HIP_RUNTIME_AMD_VERSION=6.3.42134.60304-76~22.04
|
||||
HIPBLASLT_VERSION=0.10.0.60304-76~22.04
|
||||
COMGR_VERSION=2.8.0.60304-76~22.04
|
||||
GO_VERSION=1.24.0
|
||||
AUDIT_VERSION=1.0.0
|
||||
|
||||
@@ -7,6 +7,15 @@ set -e
|
||||
|
||||
. "$(dirname "$0")/../VERSIONS"
|
||||
|
||||
# Pin the exact kernel ABI detected by build.sh so the ISO kernel matches
|
||||
# the kernel headers used to compile NVIDIA modules. Falls back to meta-package
|
||||
# when lb config is run manually without the environment variable.
|
||||
if [ -n "${BEE_KERNEL_ABI:-}" ] && [ "${BEE_KERNEL_ABI}" != "auto" ]; then
|
||||
LB_LINUX_PACKAGES="linux-image-${BEE_KERNEL_ABI}"
|
||||
else
|
||||
LB_LINUX_PACKAGES="linux-image"
|
||||
fi
|
||||
|
||||
lb config noauto \
|
||||
--distribution bookworm \
|
||||
--architectures amd64 \
|
||||
@@ -19,12 +28,11 @@ lb config noauto \
|
||||
--mirror-binary "https://deb.debian.org/debian" \
|
||||
--security true \
|
||||
--linux-flavours "amd64" \
|
||||
--linux-packages "linux-image-${DEBIAN_KERNEL_ABI}" \
|
||||
--linux-packages "${LB_LINUX_PACKAGES}" \
|
||||
--memtest none \
|
||||
--iso-volume "BEE-DEBUG" \
|
||||
--iso-application "Bee Hardware Audit" \
|
||||
--hostname "bee-debug" \
|
||||
--username "root" \
|
||||
--bootappend-live "boot=live components quiet splash" \
|
||||
--iso-volume "EASY-BEE" \
|
||||
--iso-application "EASY-BEE" \
|
||||
--bootappend-live "boot=live components video=1920x1080 console=tty0 console=ttyS0,115200n8 loglevel=7 username=bee user-fullname=Bee modprobe.blacklist=nouveau" \
|
||||
--apt-recommends false \
|
||||
--chroot-squashfs-compression-type zstd \
|
||||
"${@}"
|
||||
|
||||
1176
iso/builder/bee-gpu-stress.c
Normal file
1176
iso/builder/bee-gpu-stress.c
Normal file
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user