Compare commits

..

201 Commits

Author SHA1 Message Date
Mikhail Chusavitin
b447717a5a fix(iso): harden boot network bring-up - v3.20 2026-04-01 09:10:55 +03:00
Mikhail Chusavitin
f6f4923ac9 fix(iso): recover memtest after live-build 2026-04-01 08:55:57 +03:00
Mikhail Chusavitin
c394845b34 refactor(webui): queue install and bundle tasks - v3.18 2026-04-01 08:46:46 +03:00
Mikhail Chusavitin
3472afea32 fix(iso): make memtest non-blocking by default 2026-04-01 08:33:36 +03:00
Mikhail Chusavitin
942f11937f chore(submodule): update bible - v3.16 2026-04-01 08:23:39 +03:00
Mikhail Chusavitin
b5b34983f1 fix(webui): repair audit actions and CPU burn flow - v3.15 2026-04-01 08:19:11 +03:00
45221d1e9a fix(stress): label loaders and improve john opencl diagnostics 2026-04-01 07:31:52 +03:00
3869788bac fix(iso): validate memtest with xorriso fallback 2026-04-01 07:24:05 +03:00
3dbc2184ef fix(iso): archive build logs and memtest diagnostics 2026-04-01 07:14:53 +03:00
60cb8f889a fix(iso): restore memtest menu entries and validate ISO 2026-04-01 07:04:48 +03:00
c9ee078622 fix(stress): keep platform burn responsive under load 2026-03-31 22:28:26 +03:00
ea660500c9 chore: commit pending repo changes 2026-03-31 22:17:36 +03:00
d43a9aeec7 fix(iso): restore live-build memtest integration 2026-03-31 22:10:28 +03:00
Mikhail Chusavitin
f5622e351e Fix staged John cleanup for repeated ISO builds 2026-03-31 11:40:52 +03:00
Mikhail Chusavitin
a20806afc8 Fix ISO grub package conflict 2026-03-31 11:38:30 +03:00
Mikhail Chusavitin
4f9b6b3bcd Harden NVIDIA boot logging on live ISO 2026-03-31 11:37:21 +03:00
Mikhail Chusavitin
c850b39b01 feat: v3.10 GPU stress and NCCL burn updates 2026-03-31 11:22:27 +03:00
Mikhail Chusavitin
6dee8f3509 Add NVIDIA stress loader selection and DCGM 4 support 2026-03-31 11:15:15 +03:00
Mikhail Chusavitin
20f834aa96 feat: v3.4 — boot reliability, log readability, USB export, screen resolution, GRUB UEFI fix, memtest, KVM console stability
Web UI / logs:
- Strip ANSI escape codes and handle \r (progress bars) in task log output
- Add USB export API + UI card on Export page (list removable devices, write audit JSON or support bundle)
- Add Display Resolution card in Tools (xrandr-based, per-output mode selector)
- Dashboard: audit status banner with auto-reload when audit task completes

Boot & install:
- bee-web starts immediately with no dependencies (was blocked by audit + network)
- bee-audit.service redesigned: waits for bee-web healthz, sleeps 60s, enqueues audit via /api/audit/run (task system)
- bee-install: fix GRUB UEFI — grub-install exit code was silently ignored (|| true); add --no-nvram fallback; always copy EFI/BOOT/BOOTX64.EFI fallback path
- Add grub-efi-amd64, grub-pc, grub-efi-amd64-signed, shim-signed to package list (grub-install requires these, not just -bin variants)
- memtest hook: fix binary/boot/ not created before cp; handle both Debian (no extension) and upstream (x64.efi) naming
- bee-openbox-session: increase healthz wait from 30s to 120s

KVM console stability:
- runCmdJob: syscall.Setpriority(PRIO_PROCESS, pid, 10) on all stress subprocesses
- lightdm.service.d: Nice=-5 so X server preempts stress processes

Packages: add btop

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 10:16:15 +03:00
105d92df8b fix(iso): use underscore in volume label to comply with ISO 9660
ISO 9660 volume labels allow only A-Z, 0-9, and underscore.
Dashes cause xorriso WARNING on every build.
EASY-BEE-NVIDIA → EASY_BEE_NVIDIA (iso-application keeps dashes, it's UDF).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:38:02 +03:00
f96b149875 fix(memtest): extract EFI binary from .deb cache if chroot/boot/ is empty
memtest86+ postinst does not place files in /boot in a live-build chroot
without grub triggers. Added fallback: extract directly from the cached
.deb via dpkg-deb -x, with verbose logging throughout.

Also remove "NVIDIA no MSI-X" from boot menu (premature — root cause unknown).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:30:52 +03:00
5ee120158e fix(build): remove unused variant package lists before lb build
live-build picks up ALL .list.chroot files in config/package-lists/.
After rsync, bee-nvidia.list.chroot, bee-amd.list.chroot, and
bee-nogpu.list.chroot all end up in BUILD_WORK_DIR — causing lb to
try installing packages from every variant (and leaving version
placeholders unsubstituted in the unused lists).

Fix: after copying bee-${BEE_GPU_VENDOR}.list.chroot → bee-gpu.list.chroot,
delete all other bee-{nvidia,amd,nogpu}.list.chroot from BUILD_WORK_DIR.

Also includes nomsi boot mode changes (bee-nvidia-load + grub.cfg).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:03:42 +03:00
09fe0e2e9e feat(iso): add nogpu variant (no NVIDIA, no AMD/ROCm)
- build.sh: accept --variant nogpu; skips all GPU build steps, removes
  both nvidia-cuda and rocm archives, strips bee-nvidia-load and
  bee-nvidia.service from overlay
- build-in-container.sh: add nogpu to --variant flag; all variant
  includes nogpu; --clean-build wipes live-build-work-nogpu
- 9000-bee-setup hook: nogpu path enables no GPU services
- bee-nogpu.list.chroot: empty GPU package list

Output: easy-bee-nogpu-vX.iso

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 22:49:25 +03:00
ace1a9dba6 feat(iso): split into nvidia and amd variants, fix KVM graphics and PATH
- build.sh: add --variant nvidia|amd; separate work dirs per variant
  (live-build-work-nvidia / live-build-work-amd); GPU-specific steps
  (modules, NCCL, cuBLAS, nccl-tests) run only for nvidia; deb package
  cache synced back to shared location after each lb build so second
  variant reuses downloaded packages; ISO output named
  easy-bee-{variant}-v{ver}-amd64.iso
- build-in-container.sh: add --variant nvidia|amd|all (default: all);
  runs build.sh twice in one container for 'all'; --clean-build wipes
  both variant work dirs
- package-lists: remove GPU packages from bee.list.chroot; add
  bee-nvidia.list.chroot (DCGM) and bee-amd.list.chroot (ROCm)
- 9000-bee-setup hook: read /etc/bee-gpu-vendor; enable bee-nvidia.service
  and DCGM only for nvidia; set up ROCm symlinks only for amd
- auto/config: --iso-volume uses BEE_GPU_VENDOR_UPPER env var
- grub.cfg: add nomodeset to EASY-BEE and EASY-BEE (load to RAM) entries
  — fixes X/lightdm on BMC KVM (ASPEED AST chip requires nomodeset for
  fbdev to work; NVIDIA H100 compute does not need KMS)
- bee.sh / smoketest.sh: add /usr/sbin to PATH so dmidecode, smartctl,
  nvme are found
- 9100-memtest hook: add diagnostic listing of chroot/boot/memtest* files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 22:24:37 +03:00
905c581ece fix(iso): substitute all ROCm package version placeholders in build.sh
ROCM_BANDWIDTH_TEST_VERSION, ROCM_VALIDATION_SUITE_VERSION, ROCBLAS,
ROCRAND, HIP_RUNTIME_AMD, HIPBLASLT, COMGR were defined in VERSIONS and
in bee.list.chroot but the sed substitution block only covered 3 of them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 22:00:05 +03:00
7c2a0135d2 feat(audit): add platform thermal cycling stress test
Runs CPU (stressapptest) + GPU stress simultaneously across multiple
load/idle cycles with varying idle durations (120s/60s/30s) to detect
cooling systems that fail to recover under repeated load.

Presets: smoke (~5 min), acceptance (~25 min), overnight (~100 min).
Outputs metrics.csv + summary.txt with per-cycle throttle and fan
spindown analysis, packed as tar.gz.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 21:57:33 +03:00
407c1cd1c4 fix(charts): unify timeline labels across graphs 2026-03-29 21:24:06 +03:00
e15bcc91c5 feat(metrics): persist history in sqlite and add AMD memory validate tests 2026-03-29 12:28:06 +03:00
98f0cf0d52 fix(amd-stress): include VRAM load in GST burn 2026-03-29 12:03:50 +03:00
4db89e9773 fix(metrics): correct chart padding order — right=80 not top=80
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:38:45 +03:00
3fda18f708 feat(metrics): SQLite persistence + chart fixes (no dots, peak label, min/avg/max in title)
- Add modernc.org/sqlite dependency; write every sample to
  /appdata/bee/metrics.db (WAL mode, prune to 24h on startup)
- Pre-fill ring buffers from last 120 DB rows on startup so charts
  survive service restarts
- Ticker changed 3s→1s; chart JS refresh will be set to 2s (lag ≤3s)
- Add GET /api/metrics/export.csv for full history download
- Chart rendering: SymbolNone (no dots), right padding=80px so peak
  mark line label is not clipped, min/avg/max appended to chart title

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:37:59 +03:00
ea518abf30 feat(metrics): add global peak mark line to all live metric charts
Finds the series with the highest value across all datasets and adds
a SeriesMarkTypeMax dashed mark line to it. Since all series share the
same Y axis this effectively shows a single "global peak" line for the
whole chart with a label on the right.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:24:50 +03:00
744de588bb fix(burn): resolve rvs binary via /opt/rocm-*/bin glob like rocm-smi; add terminal copy button
rvs was not in PATH so the stress job exited immediately (UNSUPPORTED).
Now resolveRVSCommand searches /opt/rocm-*/bin/rvs before failing.
Also add a Copy button overlay on all .terminal elements and set
user-select:text so logs can be copied from the web UI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:20:46 +03:00
a3ed9473a3 fix(metrics): strip units from GPU legend names; fix fan SDR parsing for new IPMI format
Legend names were "GPU 0 %" — remove unit suffix since chart title already
conveys it. Fan parsing now handles the 5-field IPMI SDR format where the
value+unit ("4340 RPM") are combined in the last column rather than split
across separate fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:14:27 +03:00
a714c45f10 fix(metrics): parse rocm-smi CSV by header keywords, not column position
MI250X outputs 7 temperature columns before power/use%; positional parsing
read junction temp (~40°C) as GPU utilisation. Switch to header-based
colIdx() lookup so the correct fields are read regardless of column order
or rocm-smi version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:10:13 +03:00
349e026cfa fix(webui): restore chart legend, remove GPU numeric table
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:04:51 +03:00
889fe1dc2f fix: IPMI access for bee user + remove chart legend
- Add udev rule: /dev/ipmi0 readable by 'ipmi' group (no sudo needed)
- Add 'ipmi' group creation and bee user membership in chroot hook
- Remove legend from all charts (data shown in GPU table below)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:03:35 +03:00
befdbf3768 fix(iso): autoload ipmi_si/ipmi_devintf for fan/sensor monitoring
Without these modules /dev/ipmi0 doesn't exist and ipmitool can't
read fan RPM, PSU fans, or IPMI temperature sensors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:59:15 +03:00
ec6a0b292d fix(webui): fix sensor grouping and fan card visibility
- Tccd1-8 (AMD CCD die temps) now classified as 'cpu' group,
  appear on CPU Temperature chart instead of ambient
- Fan RPM card hidden when no fans detected
- Remove CPU Load/Mem Load/Power from fan table (have dedicated charts)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:58:01 +03:00
a03312c286 feat: AMD GPU compute stress via rocm-validation-suite GST (GEMM)
- Add rocm-validation-suite, rocblas, rocrand, hip-runtime-amd,
  hipblaslt, comgr to ISO (~700MB, needed for HIP compute)
- RunAMDStressPack: run RVS GST (SGEMM ~31 TFLOPS/GPU) + bandwidth test
- Add rvs symlink in chroot setup hook
- Pin all new package versions in VERSIONS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:56:32 +03:00
e69e9109da fix(iso): set bash as default shell for bee user
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:49:18 +03:00
413869809d feat(iso): add rocm-bandwidth-test for AMD GPU burn-in
- Add rocm-bandwidth-test package to ISO
- Add bee user to 'render' group (/dev/kfd, /dev/dri/renderD* access)
- Add rocm-bandwidth-test symlink alongside rocm-smi

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:48:29 +03:00
f9bd38572a fix(network): strip linkdown/dead/onlink flags when restoring routes
ip route show includes state flags like 'linkdown' that ip route add
does not accept, causing restore to fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:39:16 +03:00
662e3d2cdd feat(webui): combined GPU charts (load/memload/power/temp all GPUs per chart)
Replace per-GPU cards with 4 combined charts showing all GPUs as
separate series. Add gpu-all-load/memload/power/temp endpoints.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:37:33 +03:00
126af96780 fix(webui): slow metrics chart refresh to 3s interval
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:32:35 +03:00
ada15ac777 fix: loading screen via Go handler instead of file:// HTML
- bee-web.service: remove After=bee-audit so Go starts immediately
- Go serves loading page from / when audit JSON not yet present;
  JS polls /api/ready (503 until file exists, 200 when ready)
  then redirects to dashboard
- bee-openbox-session: wait for /healthz (Go binds fast <2s),
  open http://localhost/ directly — no file:// cross-origin issues
- Remove loading.html static file

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:31:46 +03:00
dfb94f9ca6 feat(iso): loading screen while bee-web starts
Replace 15s blocking wait with instant Chromium launch showing a
dark loading page that polls /healthz every 500ms and auto-redirects
to the app when ready.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 09:33:04 +03:00
5857805518 fix(iso): copy memtest86+ to ISO root via binary hook
memtest files live in chroot /boot (inside squashfs) but GRUB needs
them on the ISO filesystem. Binary hook copies them out at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 09:02:40 +03:00
59a1d4b209 release: v3.1 2026-03-28 22:51:36 +03:00
0dbfaf6121 feat: dynamic CPU governor (performance during tasks, powersave at idle)
Switch to performance governor when task queue starts processing,
back to powersave when queue drains. Removes bee-cpuperf.service.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:47:11 +03:00
5d72d48714 feat(iso): set CPU governor to performance on boot
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:45:37 +03:00
096b4a09ca feat(iso): add bare-metal performance kernel params
mitigations=off, transparent_hugepage=always, numa_balancing=disable,
nowatchdog, nosoftlockup — safe on single-user bare-metal LiveCD,
improves SAT/burn test throughput. fail-safe entry unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:44:21 +03:00
5d42a92e4c feat(iso): use legacy network names (eth0/eth1) via net.ifnames=0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:43:00 +03:00
3e54763367 docs: add iso-build-rules (verify package names before use)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:38:54 +03:00
f91bce8661 fix(iso): fix memtest86+ path (bookworm uses memtest86+x64.bin/.efi)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:38:15 +03:00
585e6d7311 docs: add validate-vs-burn hardware impact policy
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:15:33 +03:00
0a98ed8ae9 feat: task queue, UI overhaul, burn tests, install-to-RAM
- Task queue: all SAT/audit jobs enqueue and run one-at-a-time;
  tasks persist past page navigation; new Tasks page with cancel/priority/log stream
- UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal;
  Dashboard hardware summary badges + split metrics charts (load/temp/power);
  Tools page consolidates network, services, install, support bundle
- AMD GPU: acceptance test and stress burn cards; GPU presence API greys
  out irrelevant SAT cards automatically
- Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest)
- Install to RAM: copies squashfs to /dev/shm, re-associates loop devices
  via LOOP_CHANGE_FD ioctl so live media can be ejected
- Charts: relative time axis (0 = now, negative left)
- memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED
- SAT overlay applied dynamically on every /audit.json serve
- MIME panic guard for LiveCD ramdisk I/O errors
- ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry;
  disable screensaver/DPMS in bee-openbox-session
- Unknown SAT status severity = 1 (does not override OK)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:15:11 +03:00
911745e4da refactor(iso): replace chroot hooks for DCGM/ROCm with live-build apt sources
Move datacenter-gpu-manager and rocm-smi-lib from dynamic chroot hooks
into live-build's config/archives mechanism so lb caches the .deb files
in cache/packages.chroot/ between builds, eliminating repeated 900+ MB
downloads. Versions pinned via VERSIONS and substituted into package
lists at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 13:01:10 +03:00
acfd2010d7 fix(iso): remove firmware-chelsio-t4 (not in Debian bookworm)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:43:29 +03:00
e904c13790 fix(iso): remove --no-sandbox from chromium (runs as bee user, not root)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:40:42 +03:00
24c5c72cee feat(iso): add NIC firmware packages for broad hardware support
Adds firmware-misc-nonfree (Intel ice/i40e/igc), firmware-bnx2/bnx2x
(Broadcom), firmware-cavium (Marvell/QLogic), firmware-qlogic,
firmware-chelsio-t4, firmware-realtek to fix missing network on
physical servers with modern NICs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:38:22 +03:00
6ff0bcad56 feat(iso): show kernel logs on graphical console (remove quiet, loglevel=7)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:23:57 +03:00
4fef26000c fix(iso): replace invalid --compression with --chroot-squashfs-compression-type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:23:00 +03:00
a393dcb731 feat(webui): add POST /api/sat/abort + update bible-local runtime-flows
- jobState now has optional cancel func; abort() calls it if job is running
- handleAPISATRun passes cancellable context to RunNvidiaAcceptancePackWithOptions
- POST /api/sat/abort?job_id=... cancels the running SAT job
- bible-local/runtime-flows.md: replace TUI SAT flow with Web UI flow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:23:00 +03:00
9e55728053 feat(iso): replace --clean-cache with --clean-build (cleans + rebuilds)
--clean-build clears all caches (Go, NVIDIA, lb packages, work dir)
and rebuilds the Docker image, then proceeds with a full clean build.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:12:21 +03:00
4b8023c1cb feat(iso): add --clean-cache option to build-in-container.sh
Removes all cached build artifacts: Go cache, NVIDIA/NCCL/cuBLAS
downloads, lb package cache, and live-build work dir. Use before
a clean rebuild or when switching Debian/kernel versions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:11:31 +03:00
4c8417d20a feat(webui): add Install to Disk page
Expose the existing bee-install script through the web UI:
- platform/install.go: remove USB exclusion, add SizeBytes/MountedParts
  fields, add MinInstallBytes()/DiskWarnings() safety checks (size,
  mounted partitions, toram+low-RAM warning)
- webui: add GET /api/install/disks, POST /api/install/run,
  GET /api/install/stream endpoints
- webui: add Install to Disk page with disk table, warning badges,
  device-name confirmation gate, SSE progress terminal, reboot button

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:11:16 +03:00
0755374dd2 perf(iso): speed up builds — zstd squashfs + preserve lb chroot cache
- Switch squashfs compression from xz to zstd (3-5x faster compression,
  ~10-15% larger but decompresses faster at boot)
- Stop rm -rf BUILD_WORK_DIR on each build; rsync only config changes
  so lb can reuse its chroot across builds (skips apt install step)
- Keep lb-packages cache in CACHE_ROOT as fallback if work dir is wiped

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:10:29 +03:00
c70ae274fa revert(iso): remove apt-cacher-ng support, use lb package cache instead
apt-cacher-ng requires a separate container; lb's own package cache
persisted in --cache-dir is simpler and sufficient.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:02:34 +03:00
23ad7ff534 feat(iso): persist lb package cache across builds in cache dir
Saves cache/packages.chroot before wiping BUILD_WORK_DIR and
restores it after, so apt packages are not re-downloaded on every
build. Cache lives in --cache-dir (same place as Go/NVIDIA cache).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 09:59:55 +03:00
de130966f7 feat(iso): add APT_PROXY support to speed up builds via apt-cacher-ng
Pass APT_PROXY=http://host:3142 to build-in-container.sh to route
all apt traffic through a local cache. Also supports --apt-proxy flag.
Mirrors in auto/config are set from BEE_APT_PROXY env when present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 09:57:54 +03:00
c6fbfc8306 fix(boot): restore toram as menu option only, not default boot param
toram was incorrectly added to the default bootappend-live causing
every boot to copy the full ISO to RAM (slow on BMC virtual media).
Default boot reads squashfs from media; toram is available as a
separate menu entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 09:52:25 +03:00
35ad1c74d9 feat(iso): add slim hook to strip locales/man pages/apt cache from squashfs
Removes ~100-300MB from the squashfs: man pages, non-en locales,
python cache, apt lists and package cache, temp files and logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 08:44:02 +03:00
4a02e74b17 fix(iso): add git safe.directory so git describe sees v* tags inside container
Without this, git refuses to read the bind-mounted repo (UID mismatch)
and describe returns empty, causing the version to fall back to iso/v1.0.20.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 08:23:37 +03:00
cd2853ad99 fix(webui): fix viewer static path so Reanimator Chart CSS loads correctly
Mount chart submodule static assets at /static/ (matching the template's
hardcoded href), fix nav to include Audit Snapshot tab, remove dead
renderViewerPage code and iframe from Dashboard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 08:19:17 +03:00
6caf771d6e fix(boot): restore toram kernel parameter
Without toram the squashfs is read from the physical medium at runtime.
Disconnecting the USB/CD after boot causes SQUASHFS I/O errors on any
uncached block, making all X11 apps crash with SIGBUS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 08:04:37 +03:00
14fa87b7d7 feat(netconf): add input validation, 'b' to go back, 'a' to abort
- All prompts accept 'a' = abort, 'b' = back to previous step
- Interface input: validate numeric range and name existence, re-prompt on bad input
- IP address: regex check x.x.x.x/prefix format
- Gateway: regex check x.x.x.x format
- Main loop: 'b' at mode selection goes back to interface list

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 07:31:23 +03:00
600ece911b fix(desktop): remove forced 1920x1080 modeline, limit LightDM restarts
On real server hardware (IPMI/BMC AST chip + nomodeset) the VESA
framebuffer is set by BIOS at whatever resolution it chooses (often
1024x768 or 1280x1024). The hardcoded 1920x1080 Modeline caused X to
fail → LightDM crash-loop → SOL console flooded with systemd messages.

- Remove Monitor section / Modeline from xorg.conf — fbdev now uses
  whatever framebuffer resolution the kernel provides
- Add lightdm.service.d/bee-limits.conf: RestartSec=10,
  max 3 restarts per 60s so headless hardware doesn't spam the console

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 07:30:51 +03:00
2d424c63cb fix(netconf): accept interface number as input, not just name
User sees a numbered list but could only type the name.
Now numeric input is resolved to the interface name via awk NR==N.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 07:27:49 +03:00
50f28d1ee6 chore: drop legacy TUI/dead code
- Delete audit/internal/app/panel.go (388 lines, zero callers — TUI panel remnant)
- Delete RenderGPULiveChart() from platform/gpu_metrics.go (~155 lines, never called)
- Move formatSATDetail/cleanSummaryKey helpers to app.go (still used)
- Update motd: replace bee-tui with Web UI hint
- Update journald.conf.d comment: remove bee-tui reference

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 07:27:30 +03:00
3579747ae3 fix(iso): prioritise v[0-9]* tags over iso/v* for ISO filename
Plain v2.x tags are now the active tagging scheme; iso/v1.0.x tags
are legacy. Swap priority in resolve_iso_version so the ISO is named
bee-debian12-v2.x-amd64.iso instead of v1.0.x-N-gHASH.
Also tighten the v* pattern to v[0-9]* to avoid accidentally matching
other prefixed tags in both resolve functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:34:09 +03:00
09dc7d2613 feat(webui): apply light theme from chart submodule CSS
Replace dark #0f1117 theme with clean white/Semantic-UI-inspired
design matching the updated internal/chart submodule: white surface,
dark sidebar (#1b1c1d), Lato font, blue accent (#2185d0), subtle
borders. Also update submodule pointer to latest commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:31:29 +03:00
ec0b7f7ff9 feat(metrics): single chart engine + full-width stacked layout
- One engine: go-analyze/charts (grafana theme) for all live metrics
- Server chart: CPU temp, CPU load%, mem load%, power W, fan RPMs
- GPU charts: temp, load%, mem%, power W — one card per GPU, added dynamically
- Charts 1400x280px SVG, rendered at width:100% in single-column layout
- Add CPU load (from /proc/stat) and mem load (from /proc/meminfo) to LiveMetricSample
- Add GPU mem utilization to GPUMetricRow (nvidia-smi utilization.memory)
- Document charting architecture in bible-local/architecture/charting.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:26:13 +03:00
e7a7ff54b9 chore: add Makefile with run/build/test targets
make run                          — starts web UI on :8080
make run LISTEN=:9090             — custom port
make run AUDIT_PATH=/tmp/bee.json — with audit data
make build / make test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:14:53 +03:00
b4371e291e fix(build): resolve ISO version from plain v* tags (e.g. v2.6)
resolve_iso_version only matched iso/v* pattern; GUI release tags
(v2, v2.1 ... v2.6) were ignored, falling back to the old v1.0.20
annotated tag via resolve_audit_version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:11:33 +03:00
c22b53a406 feat(boot): set 1920x1080 resolution for framebuffer and GRUB
- Add video=1920x1080 to kernel cmdline (sets fbdev to Full HD)
- Update GRUB gfxmode to 1920x1080 (fallback to 1280x1024,auto)
- Add Xorg Monitor section with 1920x1080 Modeline and preferred mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:10:18 +03:00
ff0acc3698 feat(webui): server-side SVG charts + reanimator-chart viewer
Metrics:
- Replace canvas JS charts with server-side SVG via go-analyze/charts
- Add ring buffers (120 samples) for CPU temp and power
- /api/metrics/chart/{name}.svg endpoint serves live SVG, polled every 2s

Dashboard:
- Replace custom renderViewerPage with viewer.RenderHTML() from reanimator/chart submodule
- Mount chart static assets at /chart/static/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:07:47 +03:00
d50760e7c6 fix(webui): remove emojis from nav, fix metrics chart sizing
- Remove all emojis from sidebar nav and logo (broken on server console fonts)
- Fix canvas chart: use parentElement.getBoundingClientRect() for width,
  set explicit H=120px — fixes empty charts when offsetWidth/Height is 0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:49:09 +03:00
ed4f8be019 fix(webui): services table — show state badge, full status on click
Replace raw systemctl output in table cell with:
- state badge (active/failed/inactive) — click to expand
- full systemctl status in collapsible pre block (max 200px scroll)
Fixes layout explosion from multi-line status text in table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:47:59 +03:00
883592d029 feat(desktop): switch to LightDM for X startup (matches Ubuntu LiveCD)
startx from user shell has /dev/fb0 permission issues and is fragile.
LightDM starts Xorg as root — standard LiveCD approach that works
on server hardware / IPMI KVM with nomodeset + fbdev/vesa.

- Add lightdm package, configure autologin as bee/openbox session
- Add /usr/share/xsessions/openbox.desktop
- Remove startx from .profile (LightDM manages X lifecycle)
- Remove Xwrapper.config needs_root_rights workaround (no longer needed)
- Enable lightdm.service in setup hook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:17:59 +03:00
a6dcaf1c7e fix(desktop): fix X permissions for server hardware (IPMI KVM)
- Add bee user to video,input groups (fixes /dev/fb0 permission denied)
- Add Xwrapper.config: needs_root_rights=yes (X gets hw access)
- Add xserver-xorg-video-vesa as fallback driver
- Remove dead bee-tui chmod from setup hook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:07:25 +03:00
88727fb590 fix(desktop): don't exec startx — fall back to shell on X failure
If X fails to start, the user gets a working shell prompt instead
of a dead session or autologin loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:48:26 +03:00
c9f5224c42 feat(console): add netconf command for quick network setup
Interactive script: lists interfaces, DHCP or static IP config.
Shown as hint in tty1 welcome message.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:07:14 +03:00
7cb5c02a9b fix(desktop): force fbdev Xorg driver for server framebuffer
Explicit xorg.conf.d config prevents Xorg from trying KMS/DRM
drivers that fail on server hardware with nomodeset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:05:42 +03:00
c1aa3cf491 fix(desktop): start X on vt1 from .profile for IPMI KVM compatibility
startx from autologin shell targets VT1 directly — KVM sees the
graphical UI without VT switching. Remove bee-desktop.service
(systemd-launched X defaults to VT7, invisible on KVM).
Add xserver-xorg-video-fbdev for server AST/VGA framebuffer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:03:59 +03:00
f7eb75c57c fix(iso): replace grub-pc/grub-efi-amd64 with -bin variants to fix package conflict
grub-pc and grub-efi-amd64 conflict with each other in Debian 12.
The -bin packages provide the same grub-install binaries without conflict.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:12:18 +03:00
004cc4910d feat(webui): replace TUI with full web UI + local openbox desktop
- Remove audit/internal/tui/ (~3000 LOC, bubbletea/lipgloss/reanimator deps)
- Add /api/* REST+SSE endpoints: audit, SAT (nvidia/memory/storage/cpu),
  services, network, export, tools, live metrics stream
- Add async job manager with SSE streaming for long-running operations
- Add platform.SampleLiveMetrics() for live fan/temp/power/GPU polling
- Add multi-page web UI (vanilla JS): Dashboard, Metrics charts, Tests,
  Burn-in, Network, Services, Export, Tools
- Add bee-desktop.service: openbox + Xorg + Chromium opening http://localhost/
- Add openbox/tint2/xorg/xinit/xterm/chromium to ISO package list
- Update .profile, bee.sh, and bible-local docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 19:21:14 +03:00
ed1cceed8c fix(boot): add nomodeset to fix black screen on server VGA/IPMI KVM (AST chip KMS) 2026-03-27 00:13:36 +03:00
9fe9f061f8 fix(nccl-tests): set LIBRARY_PATH so ld finds libnccl.so in nccl cache 2026-03-26 23:59:06 +03:00
837a1fb981 fix(nccl-tests): pin /usr/local/cuda→12.8 symlink, auto-detect gencode by nvcc version 2026-03-26 23:54:07 +03:00
1f43b4e050 fix(nccl-tests): pass NCCL_LIB from nccl cache to fix -lnccl link error 2026-03-26 23:52:25 +03:00
83bbc8a1bc fix(nccl-tests): upgrade to cuda-nvcc-12-8, add sm_100 (Blackwell B100/B200) 2026-03-26 23:51:26 +03:00
896bdb6ee8 fix(nccl-tests): use cuda-nvcc-12-6 to support Ampere/Volta (sm_70..sm_90) 2026-03-26 23:50:36 +03:00
5407c26e25 fix(nccl-tests): CUDA 13.0 supports only sm_90+ (Hopper/H100) 2026-03-26 23:49:45 +03:00
4fddaba9c5 fix(nccl-tests): limit CUDA gencode to sm_70+ (CUDA 13 dropped Pascal) 2026-03-26 23:48:40 +03:00
d2f384b6eb fix(nccl-tests): use plain make instead of non-existent all_reduce_perf target 2026-03-26 23:47:49 +03:00
25f0f30aaf fix(boot): fix black screen on monitor, stop log spam on console
- Add console=tty0 so VGA display gets kernel output (was serial-only)
- Change loglevel=7→3 (debug→errors only)
- Add quiet to suppress verbose kernel boot messages
- journald: ForwardToConsole=no so service logs don't flood tty1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:45:09 +03:00
a57b037a91 feat(installer): add 'Install to disk' in Tools submenu
Copies the live system to a local disk via unsquashfs — no debootstrap,
no network required. Supports UEFI (GPT+EFI) and BIOS (MBR) layouts.

ISO:
- Add squashfs-tools, parted, grub-pc, grub-efi-amd64 to package list
- New overlay script bee-install: partitions, formats, unsquashfs,
  writes fstab, runs grub-install+update-grub in chroot

Go TUI:
- Settings → Tools submenu (Install to disk, Check tools)
- Disk picker screen: lists non-USB, non-boot disks via lsblk
- Confirm screen warns about data loss
- Runs with live progress tail of /tmp/bee-install.log
- platform/install.go: ListInstallDisks, InstallToDisk, findLiveBootDevice

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:35:01 +03:00
5644231f9a feat(nccl): add nccl-tests all_reduce_perf for GPU bandwidth testing
- Dockerfile: install cuda-nvcc-13-0 from NVIDIA repo for compilation
- build-nccl-tests.sh: downloads libnccl-dev for nccl.h, builds all_reduce_perf
- build.sh: runs nccl-tests build, injects binary into /usr/local/bin/
- platform: RunNCCLTests() auto-detects GPU count, runs all_reduce_perf
- TUI: NCCL bandwidth test entry in Burn-in Tests screen [N] hotkey

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:22:19 +03:00
eea98e6d76 feat(dcgm): add NVIDIA DCGM diagnostics, fix KVM console
- Add 9002-nvidia-dcgm.hook.chroot: installs datacenter-gpu-manager
  from NVIDIA apt repo during live-build
- Enable nvidia-dcgm.service in chroot setup hook
- Replace bee-gpu-stress with dcgmi diag (levels 1-4) in NVIDIA SAT
- TUI: replace GPU checkbox + duration UI with DCGM level selection
- Remove console=tty2 from boot params: KVM/VGA now shows tty1
  where bee-tui runs, fixing unresponsive console

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:08:12 +03:00
967455194c feat(iso): make toram optional, add 'load to RAM' boot menu entry
Default boot no longer loads ISO to RAM (slow on BMC virtual media).
Separate menu entry added for toram in both GRUB and isolinux.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 21:45:04 +03:00
79dabf3efb fix(build): link bee-gpu-stress with -lm for sqrt()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:55:14 +03:00
1336f5b95c fix(cublas): copy include dirs containing files without .h extension
nv/target has no .h suffix; use -type f instead of -name '*.h' to
detect non-empty include directories.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:53:08 +03:00
31486a31c1 fix(cublas): add cuda-cccl package for nv/target header
cuda_fp16.h (included by cublas_api.h) requires <nv/target> from
the CUDA C++ Core Libraries (cuda-cccl-13-0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:49:46 +03:00
aa3fc332ba fix(cublas): check for .h in subdirs when copying non-standard include dirs
ls *.h missed headers in subdirectories like crt/host_defines.h;
use find -maxdepth 2 instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:47:39 +03:00
62c57b87f2 fix(cublas): allow version-free lookup for cuda-crt package
cuda-crt-13-0 may not share the same version string as cuda-cudart-13-0;
pass empty version to lookup_pkg to match the first available version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:46:45 +03:00
f600261546 fix(cublas): add cuda-crt package for crt/host_defines.h
cublasLt.h -> cublas_api.h -> driver_types.h -> crt/host_defines.h
which lives in the cuda-crt-13-0 package, not cudart-dev.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:42:40 +03:00
d7ca04bdfb fix(cublas): search all include/ dirs in deb for CUDA headers
NVIDIA CUDA .deb packages install headers under
/usr/local/cuda-X.Y/targets/x86_64-linux/include/ not /usr/include/,
causing copy_headers() to silently skip them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:35:21 +03:00
5433652c70 fix(cublas): prevent double-print in lookup_pkg awk END block
awk exit in the blank-line block jumps to END, which printed the
result again causing repo_sha to contain the hash twice with a newline,
breaking the sha256 string comparison.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:29:10 +03:00
b25f014dbd fix(cublas): strip CR from Packages.gz fields to fix sha256 comparison
Debian Packages.gz uses CRLF line endings; \r in the captured SHA256
field caused string comparison to fail even when hashes were identical.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:24:58 +03:00
d69a46f211 fix(cublas): redirect diagnostic echo to stderr in download_verified_pkg
Echo messages captured in stdout polluted the return value of
download_verified_pkg(), causing extract_deb() to receive a
multi-line string instead of a file path and silently exit via set -e.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:22:39 +03:00
Mikhail Chusavitin
fc5c2019aa iso: improve burn-in, export, and live boot 2026-03-26 18:56:19 +03:00
Mikhail Chusavitin
67a215c66f fix(iso): route kernel logs to tty2, keep tty1 clean for TUI
console=tty0 sent kernel messages to the active VT (tty1), overwriting
the TUI. Changed to console=tty2 so kernel logs land on a dedicated
console. tty1 is now clean; operator can press Alt+F2 to inspect kernel
messages and Alt+F3 for an extra shell.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 17:40:44 +03:00
Mikhail Chusavitin
8b4bfdf5ad feat(tui): live GPU chart during stress test, full VRAM allocation
- GPU Platform Stress Test now shows a live in-TUI chart instead of nvtop.
  nvidia-smi is polled every second; up to 60 data points per GPU kept.
  All three metrics (Usage %, Temp °C, Power W) drawn on a single plot,
  each normalised to its own range and rendered in a different colour.
- Memory allocation changed from MemoryMB/16 to MemoryMB-512 (full VRAM
  minus 512 MB driver overhead) so bee-gpu-stress actually stresses memory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 17:37:20 +03:00
Mikhail Chusavitin
0a52a4f3ba fix(iso): restore loglevel=7 on VGA console for crash visibility
loglevel=3 was hiding all kernel messages on tty0/ttyS0 except errors.
Machine crashes (panics, driver oops, module failures) were silent on VGA.

Restored loglevel=7 so kernel messages up to debug are printed to both
tty0 (VGA) and ttyS0 (SOL). Journald MaxLevelConsole reduced to info
(was debug) to reduce noise on SOL while keeping it useful.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 11:19:07 +03:00
Mikhail Chusavitin
b132f7973a fix(iso): derive ISO filename from iso/v* tags, not audit/v*
Previously the ISO file was named after git describe --match 'audit/v*',
so a new iso/ tag produced names like v1.0.9-1-gXXXXXXX instead of v1.0.17.
Now build.sh has resolve_iso_version() that looks at iso/v* tags separately.
The bee binary inside the ISO still uses AUDIT_VERSION_EFFECTIVE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 11:05:51 +03:00
Mikhail Chusavitin
bd94b6c792 fix(iso): add libnvidia-ptxjitcompiler + ldconfig for PTX JIT and NCCL
- build-nvidia-module.sh: copy libnvidia-ptxjitcompiler.so.* alongside
  libcuda/libnvidia-ml — required by cuModuleLoadDataEx for PTX JIT.
  Without it: CUDA_ERROR_JIT_COMPILER_NOT_FOUND at runtime.
  Cache check updated to force rebuild when ptxjitcompiler is missing.
- bee-nvidia-load: run ldconfig after module load so that NVIDIA/NCCL
  libs injected into /usr/lib/ are visible to dlopen() callers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:37:27 +03:00
Mikhail Chusavitin
06017eddfd feat(tui): remove nvtop auto-launch from NVIDIA SAT
nvtop is no longer shown during NVIDIA SAT runs.
[o] Open nvtop shortcut also removed from the running screen.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:29:05 +03:00
Mikhail Chusavitin
0ac7b6a963 fix(iso): restore console=tty0 — VGA screen was black without it
Commit d36e844 dropped console=tty0 and added dual-serial + debug logging.
Without console=tty0 the kernel never initialises the VGA console,
leaving the physical screen permanently blank.

- Restore console=tty0 (VGA) as primary, keep console=ttyS0 for SOL
- Drop console=ttyS1 (redundant second serial port)
- Replace loglevel=7 + journald debug flood with loglevel=3 (errors only)
  so kernel messages don't overwrite the TUI on the local screen
- Remove systemd.log_target/forward_to_console debug params

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:23:53 +03:00
Mikhail Chusavitin
3d2ae4cdcb fix(iso): use Ubuntu jammy codename for AMD ROCm repo — Debian not supported
AMD does not publish Debian Bookworm packages at all (only focal/jammy/noble).
Switch ROCM_UBUNTU_DIST to "jammy"; jammy packages install cleanly on
Debian 12 due to compatible glibc. Also expand candidate list to include
point-releases (6.3.4, 6.3.3, …) so we pick the latest actually-published one.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:08:58 +03:00
Mikhail Chusavitin
4669f14f4f feat(tui): GPU Platform Stress Test — live nvtop chart during test
Apply the same pattern as NVIDIA SAT: launch nvtop via tea.ExecProcess
so it occupies the full terminal as a live GPU chart (temp, power, fan,
utilisation lines) while the stress test runs in the background.

- Add screenGPUStressRunning screen + dedicated running/render handlers
- startGPUStressTest: tea.Batch(stress goroutine, tea.ExecProcess(nvtop))
- [o] reopen nvtop at any time; [a] abort (cancels context)
- Graceful degradation: test still runs if nvtop is not on PATH
- gpuStressDoneMsg routes result to screenOutput on completion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:01:31 +03:00
Mikhail Chusavitin
540a9e39b8 refactor(audit): rename Fan Stress Test → GPU Platform Stress Test
Update all user-facing strings in TUI and ActionResult title.
Internal identifiers (types, functions, file name) unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 09:56:25 +03:00
Mikhail Chusavitin
58510207fa fix(iso): fall back through ROCm 6.4→6.3→6.2 if repo Release file missing
ROCm 6.4 does not yet publish a Release file for Debian Bookworm, causing
the live-build chroot hook to fail with "does not have a Release file".

Try each version in ROCM_CANDIDATES order; skip to the next if apt-get update
fails (repo unavailable). Exit gracefully if none are available.
Also rename inner 'candidate' variable to 'smi_path' to avoid collision.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 09:52:17 +03:00
Mikhail Chusavitin
4cd7c9ab4e feat(audit): fan-stress SAT for MSI case-04 fan lag & thermal throttle detection
Two-phase GPU thermal cycling test with per-second telemetry:
- Phases: baseline → load1 → pause (no cooldown) → load2 → cooldown
- Monitors: fan RPM (ipmitool sdr), CPU/server temps (ipmitool/sensors),
  system power (ipmitool dcmi), GPU temp/power/usage/clock/throttle (nvidia-smi)
- Detects throttling via clocks_throttle_reasons.active bitmask
- Measures fan response lag from load start (validates case-04 ~2s lag)
- Exports metrics.csv (wide format, one row/sec) and fan-sensors.csv (long format)
- TUI: adds [F] Fan Stress Test to Health Check screen with Quick/Standard/Express modes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 09:51:03 +03:00
Mikhail Chusavitin
cfe255f6e4 Release audit/v1.0.5 2026-03-26 09:41:19 +03:00
Mikhail Chusavitin
8b9d3447d7 Overlay SAT results into audit JSON 2026-03-25 20:11:03 +03:00
Mikhail Chusavitin
614b7cad61 Improve PCIe inventory and hardware identity collection 2026-03-25 20:00:38 +03:00
Mikhail Chusavitin
9a1df9b1ba Tighten support bundles and fix AMD runtime checks 2026-03-25 19:35:25 +03:00
Mikhail Chusavitin
30cf014d58 Rename NVIDIA bootloader modes 2026-03-25 19:12:26 +03:00
Mikhail Chusavitin
27d478aed6 Add bootloader choice for safe vs full NVIDIA boot 2026-03-25 19:11:15 +03:00
Mikhail Chusavitin
d36e8442a9 Stabilize live ISO consoles and NVIDIA boot path 2026-03-25 19:05:18 +03:00
Mikhail Chusavitin
b345b0d14d Derive ISO version from git tags 2026-03-25 18:40:48 +03:00
Mikhail Chusavitin
0a1ac2ab9f Bootstrap ROCm hook prerequisites in ISO build 2026-03-25 18:38:19 +03:00
Mikhail Chusavitin
1e62f828c6 Embed MOTD banner into TUI 2026-03-25 18:11:17 +03:00
Mikhail Chusavitin
f8c997d272 Add missing SAT progress TUI helpers 2026-03-25 18:03:45 +03:00
Mikhail Chusavitin
0c16616cc9 1. Verbose live progress during SAT tests (CPU, Memory, Storage, AMD GPU)
- New tui/sat_progress.go: polls {DefaultSATBaseDir}/{prefix}-*/verbose.log every 300ms and parses completed/in-progress steps
  - Busy screen now shows each step as PASS  lscpu (234ms) / FAIL  stress-ng (60.0s) / ...   sensors-after instead of just "Working..."

  2. Test results shown on screen (instead of just "Archive written to /path")
  - RunCPUAcceptancePackResult, RunMemoryAcceptancePackResult, RunStorageAcceptancePackResult, RunAMDAcceptancePackResult now read summary.txt from the run directory and return a formatted per-step result:
  Run: 2025-03-25T10:00:00Z

  PASS  lscpu
  PASS  sensors-before
  FAIL  stress-ng
  PASS  sensors-after

  Overall: FAILED  (ok=3  failed=1)

  3. AMD GPU SAT with auto-detection
  - platform.System.DetectGPUVendor(): checks /dev/nvidia0 → "nvidia", /dev/kfd → "amd"
  - platform.System.RunAMDAcceptancePack(): runs rocm-smi, rocm-smi --showallinfo, dmidecode
  - GPU SAT (G key / GPU row enter) automatically routes to AMD or NVIDIA based on detected vendor
  - "Run All" also auto-detects vendor

  4. Panel detail view
  - GPU detail now shows the most recent (NVIDIA or AMD) SAT result, whichever is newer
  - All SAT detail views use the same human-readable formatSATDetail format
2026-03-25 17:54:27 +03:00
Mikhail Chusavitin
adcc147b32 feat(iso): add AMD Instinct MI250X/MI250 driver support
- firmware-amd-graphics: Aldebaran firmware blobs (fixes amdgpu IB ring
  test errors on MI250/MI250X at boot)
- 9001-amd-rocm.hook.chroot: adds AMD ROCm 6.4 apt repo and installs
  rocm-smi-lib for GPU monitoring (analogous to nvidia-smi)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 15:42:10 +03:00
Mikhail Chusavitin
94e233651e fix(sat): fix nvme device-self-test command flags
--start is not a valid nvme-cli flag; correct syntax is -s 1 (short test).
Add --wait so the command blocks until the test completes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 15:24:52 +03:00
Mikhail Chusavitin
03c36f6cb2 fix(iso): add stress-ng to package list for CPU SAT
stress-ng was missing from the LiveCD — CPU acceptance test exited
immediately with rc=1 because the binary was not found.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:50:30 +03:00
Mikhail Chusavitin
a221814797 fix(tui): fix GPU panel row showing AMD chipset devices, clear screen before TUI
isGPUDevice matched all AMD vendor PCIe devices (SATA, crypto coprocessors,
PCIe dummies) because of a broad strings.Contains(vendor,"amd") check.
Remove it — AMD Instinct/Radeon GPUs are caught by ProcessingAccelerator /
DisplayController class. Also exclude ASPEED (BMC VGA adapter).

Add clear before bee-tui to avoid dirty terminal output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:49:09 +03:00
Mikhail Chusavitin
b6619d5ccc fix(iso): skip NVIDIA module load when no NVIDIA GPU present
Check PCI vendor 10de before attempting insmod — avoids spurious
nvidia_uvm symbol errors on systems without NVIDIA hardware (e.g. AMD MI350).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:38:31 +03:00
Mikhail Chusavitin
450193b063 feat(iso): remove splash.png, show EASY-BEE ASCII art in GRUB text mode
The graphical splash had "BEE / HARDWARE AUDIT" baked into the PNG,
overriding the echo ASCII art. Replace with a plain black background
so the EASY-BEE block-char banner from grub.cfg echo commands is visible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:32:23 +03:00
Mikhail Chusavitin
ee8931f171 fix(iso): pin ISO kernel to same ABI as compiled NVIDIA modules
Export detected DEBIAN_KERNEL_ABI as BEE_KERNEL_ABI from build.sh so
auto/config can pin linux-packages to the exact versioned package
(e.g. linux-image-6.1.0-31 + flavour amd64 = linux-image-6.1.0-31-amd64).
This prevents nvidia.ko vermagic mismatch if the linux-image-amd64
meta-package is updated between build start and lb build chroot step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 12:26:59 +03:00
Mikhail Chusavitin
b771d95894 fix(iso): fix linux-packages to "linux-image" so lb appends flavour correctly
live-build constructs the kernel package as <linux-packages>-<linux-flavours>,
so "linux-image-amd64" + "amd64" = "linux-image-amd64-amd64" (not found).
The correct value is "linux-image" + "amd64" = "linux-image-amd64".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:45:41 +03:00
Mikhail Chusavitin
8e60e474dc feat(iso): rebrand to EASY-BEE with ASCII art banner
Replace "Bee Hardware Audit" branding with EASY-BEE across bootloader
and LiveCD: grub.cfg menu entries, echo ASCII art before menu,
motd banner, iso-volume and iso-application metadata.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:45:12 +03:00
Mikhail Chusavitin
2f4ec2acda fix(iso): auto-detect and install kernel headers at build time
- Dockerfile: linux-headers-amd64 meta-package instead of pinned ABI;
  remove DEBIAN_KERNEL_ABI build-arg (no longer needed at image build time)
- build-in-container.sh: drop --build-arg DEBIAN_KERNEL_ABI
- build.sh: apt-get update + detect ABI from apt-cache at build time;
  auto-install linux-headers-<ABI> if kernel changed since image build

Image rebuild is now needed only when changing Go version or lb tools,
not on every Debian kernel point release.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:25:29 +03:00
Mikhail Chusavitin
7ed5cb0306 fix(iso): auto-detect kernel ABI at build time instead of pinning
DEBIAN_KERNEL_ABI=auto in VERSIONS — build.sh queries
apt-cache depends linux-image-amd64 to find the current ABI.
lb config now uses linux-image-amd64 meta-package.

This prevents build failures when Debian drops old kernel packages
from the repo (happens with every point release).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:17:29 +03:00
Mikhail Chusavitin
6df7ac68f5 fix(iso): bump kernel ABI to 6.1.0-44 (6.1.164-1 in bookworm)
6.1.0-43 is no longer available in Debian repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:16:09 +03:00
Mikhail Chusavitin
0ce23aea4f feat(iso): add exfatprogs and ntfs-3g for USB export support
exFAT is the default filesystem on USB drives >32GB sold today.
Without exfatprogs, mount fails silently and export to such drives is broken.
ntfs-3g covers Windows-formatted drives.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:12:51 +03:00
Mikhail Chusavitin
36dff6e584 feat: CPU SAT via stress-ng + BMC version via ipmitool
BMC:
- collector/board.go: collectBMCFirmware() via ipmitool mc info, graceful skip if /dev/ipmi0 absent
- collector/collector.go: append BMC firmware record to snap.Firmware
- app/panel.go: show BMC version in TUI right-panel header alongside BIOS

CPU SAT:
- platform/sat.go: RunCPUAcceptancePack(baseDir, durationSec) — lscpu + sensors before/after + stress-ng
- app/app.go: RunCPUAcceptancePack + RunCPUAcceptancePackResult methods, satRunner interface updated
- app/panel.go: CPU row now reads real PASS/FAIL from cpu-*/summary.txt via satStatuses(); cpuDetailResult shows last SAT summary + audit data
- tui/types.go: actionRunCPUSAT, confirmBody for CPU test with mode label
- tui/screen_health_check.go: hcCPUDurations [60,300,900]s; hcRunSingle(CPU)→confirm screen; executeRunAll uses RunCPUAcceptancePackResult
- tui/forms.go: actionRunCPUSAT → RunCPUAcceptancePackResult with mode duration
- cmd/bee/main.go: bee sat cpu [--duration N] subcommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:06:12 +03:00
Mikhail Chusavitin
1c80906c1f feat(tui): rebuild TUI around hardware diagnostics (Health Check + two-column layout)
- Replace 12-item flat menu with 4-item main menu: Health Check, Export support bundle, Settings, Exit
- Add Health Check screen (Lenovo-style): per-component checkboxes (GPU/MEM/DISK/CPU), Quick/Standard/Express modes, Run All, letter hotkeys G/M/S/C/R/A/1/2/3
- Add two-column main screen: left = menu, right = hardware panel with colored PASS/FAIL/CANCEL/N/A status per component; Tab/→ switches focus, Enter opens component detail
- Add app.LoadHardwarePanel() + ComponentDetailResult() reading audit JSON and SAT summary.txt files
- Move Network/Services/audit actions into Settings submenu
- Export: support bundle only (remove separate audit JSON export)
- Delete screen_acceptance.go; add screen_health_check.go, screen_settings.go, app/panel.go
- Add BMC + CPU stress-ng tests to backlog
- Update bible submodule
- Rewrite tui_test.go for new screen/action structure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 10:59:21 +03:00
Mikhail Chusavitin
2abe2ce3aa fix(iso): fix NCCL version to 2.28.9+cuda13.0, add sha256 verification
NVIDIA's CUDA repo for Debian 12 only has NCCL packages for cuda13.x,
not cuda12.x. Update to the latest available: 2.28.9-1+cuda13.0.
Also pass sha256 from VERSIONS into build-nccl.sh for integrity check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 12:04:03 +03:00
Mikhail Chusavitin
8233c9ee85 feat(iso): add NCCL 2.26.2 to LiveCD
Download libnccl2 .deb from NVIDIA's CUDA apt repo (Debian 12) during ISO
build, extract libnccl.so.* into the overlay at /usr/lib/ alongside
libnvidia-ml and libcuda. Version pinned in VERSIONS, reflected in
/etc/bee-release.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 09:51:28 +03:00
Mikhail Chusavitin
13189e2683 fix(iso): pet hardware watchdog via systemd RuntimeWatchdogSec=30s
Without a keepalive the kernel watchdog timer expires and reboots
the host mid-audit. Configuring RuntimeWatchdogSec lets systemd PID 1
reset /dev/watchdog every 30 s — well within the typical 60 s timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 23:56:42 +03:00
Mikhail Chusavitin
76a17937f3 feat(tui): NVIDIA SAT with nvtop, GPU selection, metrics and chart — v1.0.0
- TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes
- nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort
- GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock)
- Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt
- Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours
- AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list
- runtime-flows.md updated with full NVIDIA SAT TUI flow documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 15:18:57 +03:00
Mikhail Chusavitin
b965184e71 feat: wrap chart viewer in web shell 2026-03-16 18:26:05 +03:00
Mikhail Chusavitin
b25a2f6d30 feat: add support bundle and raw audit export 2026-03-16 18:20:26 +03:00
Mikhail Chusavitin
d18cde19c1 Drop legacy non-container builders 2026-03-16 00:23:55 +03:00
Mikhail Chusavitin
78c6dfc0ef Sync hardware ingest contract v2.7 2026-03-15 23:03:38 +03:00
Mikhail Chusavitin
72cf482ad3 Embed Reanimator Chart web viewer 2026-03-15 22:07:42 +03:00
Mikhail Chusavitin
a6023372b1 Use microcode as CPU firmware 2026-03-15 21:16:17 +03:00
Mikhail Chusavitin
ab5a4be7ac Align hardware export with ingest contract 2026-03-15 21:04:53 +03:00
Mikhail Chusavitin
b8c235b5ac Add TUI hardware banner and polish SAT summaries 2026-03-15 14:27:01 +03:00
Mikhail Chusavitin
b483e2ce35 Add health verdicts and acceptance tests 2026-03-14 17:53:58 +03:00
Mikhail Chusavitin
17f0bda45e Update docs for current LiveCD flow 2026-03-14 16:28:30 +03:00
Mikhail Chusavitin
591164a251 Rename ISO volume to BEE 2026-03-14 14:58:49 +03:00
Mikhail Chusavitin
ef4ec5695d Remove broken TUI log redirection 2026-03-14 14:57:31 +03:00
Mikhail Chusavitin
f1e096cabe Keep live TUI logs off the console 2026-03-14 14:51:25 +03:00
Mikhail Chusavitin
6082c7953e Add console tools and bee menu startup 2026-03-14 08:36:38 +03:00
Mikhail Chusavitin
f37ef0d844 Run live TUI as root via sudo 2026-03-14 08:34:23 +03:00
Mikhail Chusavitin
e32fa6e477 Use live-config autologin for bee user 2026-03-14 08:33:36 +03:00
Mikhail Chusavitin
20118bb400 Fix tty1 autologin override order 2026-03-14 08:17:23 +03:00
Mikhail Chusavitin
55d6876297 Avoid tty1 black screen on live boot 2026-03-14 08:14:49 +03:00
Mikhail Chusavitin
e8e176ab7f Add zstd to live image packages 2026-03-14 08:04:18 +03:00
Mikhail Chusavitin
caeafa836b Improve VM boot diagnostics and guest support 2026-03-14 07:51:16 +03:00
Mikhail Chusavitin
e8a52562e7 Persist builder caches outside container 2026-03-14 07:40:32 +03:00
Mikhail Chusavitin
6aca1682b9 Refactor bee CLI and LiveCD integration 2026-03-13 16:52:16 +03:00
Mikhail Chusavitin
b7c888edb1 fix: getty autologin root, inject GSP firmware for H100, bump 0.1.1 2026-03-08 22:12:02 +03:00
Mikhail Chusavitin
17d5d74a8d fix: nomodeset + remove splash (framebuffer hangs on headless H100 server) 2026-03-08 21:39:31 +03:00
Mikhail Chusavitin
d487e539bb fix: use sudo git checkout to reset root-owned build artifacts 2026-03-08 20:54:15 +03:00
Mikhail Chusavitin
441ab3adbd fix: blacklist nouveau driver (hangs on H100 unknown chipset) 2026-03-08 20:51:49 +03:00
Mikhail Chusavitin
c91c8d8cf9 feat: bee-themed grub splash (amber/black honeycomb) with progress bar 2026-03-08 20:44:19 +03:00
Mikhail Chusavitin
83e1910281 feat: custom grub bootloader - bee branding, 5s auto-boot, no splash 2026-03-08 20:35:23 +03:00
Mikhail Chusavitin
2252c5af56 fix: use isc-dhcp-client for dhclient, remove standalone lsblk (in util-linux) 2026-03-08 19:43:59 +03:00
Mikhail Chusavitin
7a4d75c143 fix: remove unsupported --hostname/--username from lb config
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 19:28:01 +03:00
Mikhail Chusavitin
7c62d100d4 fix: use SYSSRC=common SYSOUT=amd64 for NVIDIA build on Debian split headers
Debian 12 splits kernel headers into two packages:
  linux-headers-<kver>        (arch-specific: generated/, config/)
  linux-headers-<kver>-common (source headers: linux/, asm-generic/, etc.)

NVIDIA conftest.sh builds include paths as HEADERS=$SOURCES/include.
When SYSSRC=amd64, HEADERS=amd64/include/ which is nearly empty —
conftest can't compile any kernel header tests, all compile-tests fail
silently, and NVIDIA assumes all kernel APIs are present. This causes
link errors for APIs added in kernel 6.3+ (vm_flags_set, vm_flags_clear)
and removed APIs (phys_to_dma, dma_is_direct, get_dma_ops).

Fix: pass SYSSRC=common (real headers) and SYSOUT=amd64 (generated headers).
NVIDIA Makefile maps SYSSRC→NV_KERNEL_SOURCES, SYSOUT→NV_KERNEL_OUTPUT,
and runs 'make -C common KBUILD_OUTPUT=amd64'. Conftest then correctly
detects which APIs are present in kernel 6.1 and uses proper wrappers.

Tested: 5 .ko files built successfully on Debian 12 kernel 6.1.0-43-amd64.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 19:23:47 +03:00
Mikhail Chusavitin
c843ff95a2 fix: add -Wno-error to CFLAGS_MODULE for NVIDIA kernel 6.1 compat
get_dma_ops() return type changed in kernel 6.1 — GCC treats int-conversion
warning as error. Suppress with -Wno-error to allow build to complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 18:55:25 +03:00
Mikhail Chusavitin
0057686769 fix: pass GCC include dir to NVIDIA make to resolve stdarg.h not found
Debian kernel build uses -nostdinc which strips GCC's own includes.
NVIDIA's nv_stdarg.h needs <stdarg.h> from GCC.
Pass -I$(gcc --print-file-name=include) via CFLAGS_MODULE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 18:53:37 +03:00
Mikhail Chusavitin
68b5e02a74 fix: run-builder.sh uses BUILDER_USER from .env, not hardcoded
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 18:48:33 +03:00
Mikhail Chusavitin
fa553c3f20 fix: update DEBIAN_KERNEL_ABI to 6.1.0-43 (actual kernel on build host)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 18:35:44 +03:00
Mikhail Chusavitin
345a93512a migrate ISO build from Alpine to Debian 12 (Bookworm)
Replace the entire live CD build pipeline:
- Alpine SDK + mkimage + genapkovl → Debian live-build (lb config/build)
- OpenRC init scripts → systemd service units
- dropbear → openssh-server (native to Debian live)
- udhcpc → dhclient for DHCP
- apk → apt-get in setup-builder.sh and build-nvidia-module.sh
- Add auto/config (lb config options) and auto/build wrapper
- Add config/package-lists/bee.list.chroot replacing Alpine apks
- Add config/hooks/normal/9000-bee-setup.hook.chroot to enable services
- Add bee-nvidia-load and bee-sshsetup helper scripts
- Keep NVIDIA pre-compile pipeline (Option B): compile on builder VM against
  pinned Debian kernel headers (DEBIAN_KERNEL_ABI), inject .ko into includes.chroot
- Fixes: native glibc (no gcompat shims), proper udev, writable /lib/modules,
  no Alpine modloop read-only constraint, no stale apk cache issues

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 18:01:38 +03:00
180 changed files with 25634 additions and 2248 deletions

View File

@@ -1 +1,2 @@
BUILDER_HOST=
BUILDER_USER=

3
.gitmodules vendored
View File

@@ -1,3 +1,6 @@
[submodule "bible"]
path = bible
url = https://git.mchus.pro/mchus/bible.git
[submodule "internal/chart"]
path = internal/chart
url = https://git.mchus.pro/reanimator/chart.git

395
PLAN.md
View File

@@ -4,13 +4,13 @@ Hardware audit LiveCD for offline server inventory.
Produces `HardwareIngestRequest` JSON compatible with core/reanimator.
**Principle:** OS-level collection — reads hardware directly, not through BMC.
Fully unattended — no user interaction required at any stage. Boot → update → audit → output → done.
All errors are logged, never presented interactively. Every failure path has a silent fallback.
Automatic boot audit plus operator console. Boot runs audit immediately, but local/SSH operators can rerun checks through the TUI and CLI.
Errors are logged and should not block boot on partial collector failures.
Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials, physical disks behind RAID, full SMART, NIC firmware.
---
## Status snapshot (2026-03-06)
## Status snapshot (2026-03-14)
### Phase 1 — Go Audit Binary
@@ -23,33 +23,38 @@ Fills the gaps where logpile/Redfish is blind: NVMe, DIMM serials, GPU serials,
- 1.7 PSU collector — **DONE (basic FRU path)**
- 1.8 NVIDIA GPU enrichment — **DONE**
- 1.8b Component wear / age telemetry — **DONE** (storage + NVMe + NVIDIA + NIC SFP/DOM + NIC packet stats)
- 1.8c Storage health verdicts — **DONE** (SMART/NVMe warning/failed status derivation)
- 1.9 Mellanox/NVIDIA NIC enrichment — **DONE** (mstflint + ethtool firmware fallback)
- 1.10 RAID controller enrichment — **DONE (initial multi-tool support)** (storcli + sas2/3ircu + arcconf + ssacli + VROC/mdstat)
- 1.11 Output and USB write — **DONE** (usb + /tmp fallback)
- 1.11 PSU SDR health — **DONE** (`ipmitool sdr` merged with FRU inventory)
- 1.11 Output and export workflow — **DONE** (explicit file output + manual removable export via TUI)
- 1.12 Integration test (local) — **DONE** (`scripts/test-local.sh`)
### Phase 2 — Alpine LiveCD
### Phase 2 — Debian Live ISO
- Debug ISO track is active (builder + overlay-debug + OpenRC services + TUI workflow).
- Production ISO track — **IN PROGRESS**.
- 2.3 Alpine mkimage profile — **DONE (production profile scaffold)**
- 2.4 Network bring-up on boot**DONE**
- 2.5 OpenRC boot service (bee-audit) — **DONE** (with explicit bee-nvidia ordering)
- 2.6 Vendor utilities in overlay — **DONE (fetch script + iso/vendor scaffold)**
- 2.7 Auto-update wiring (USB first, network second) — **PARTIAL** (shell flow done; strict Ed25519 verification intentionally deferred to final stage)
- 2.8 Release workflow — **PARTIAL** (production build now injects audit binary, NVIDIA modules/tools, vendor tools, and build metadata)
- Current implementation uses Debian 12 `live-build`, `systemd`, and OpenSSH.
- Network bring-up on boot — **DONE**
- Boot services (`bee-network`, `bee-nvidia`, `bee-audit`, `bee-sshsetup`) — **DONE**
- Local console UX (`bee` autologin on `tty1`, `menu` auto-start, TUI privilege escalation via `sudo -n`)**DONE**
- VM/debug support (`qemu-guest-agent`, serial console, virtual GPU initramfs modules) — **DONE**
- Vendor utilities in overlay — **DONE**
- Build metadata + staged overlay injection — **DONE**
- Builder container cache persisted outside container writable layer — **DONE**
- ISO volume label `BEE`**DONE**
- Auto-update flow remains deferred; current focus is deterministic offline audit ISO behavior.
- Real-hardware validation remains **PENDING**; current validation is limited to local/libvirt VM boot + service checks.
---
## Phase 1 — Go Audit Binary
Self-contained static binary. Runs on any Linux (including Alpine LiveCD).
Self-contained static binary. Runs on any Linux (including the Debian live ISO).
Calls system utilities, parses their output, produces `HardwareIngestRequest` JSON.
### 1.1 — Project scaffold
- `audit/go.mod` — module `bee/audit`
- `audit/cmd/audit/main.go` — CLI entry point: flags, orchestration, JSON output
- `audit/cmd/bee/main.go` main CLI entry point: subcommands, runtime selection, JSON output
- `audit/internal/schema/` — copy of `HardwareIngestRequest` types from core (no import dependency)
- `audit/internal/collector/` — empty package stubs for all collectors
- `const Version = "1.0"` in main
@@ -237,305 +242,143 @@ No hardcoded vendor names in detection logic — pure PCI vendor_id map.
Tests: table tests with storcli/sas2ircu text fixtures
### 1.11 — Output and USB write
### 1.11 — Output and export workflow
`--output stdout` (default): pretty-printed JSON to stdout
`--output file:<path>`: write JSON to explicit path
`--output usb`: auto-detect first removable block device, mount it, write `audit-<board_serial>-<YYYYMMDD-HHMMSS>.json`
USB detection: scan `/sys/block/*/removable`, pick first `1`, mount to `/tmp/bee-usb`
Live ISO default service output: `/var/log/bee-audit.json`
QR summary to stdout (always): board serial + model + component counts — fits in one QR code
Uses `qrencode` if present, else skips silently
Removable-media export is manual via `bee tui` (or the LiveCD wrapper `bee-tui`):
- operator chooses a removable filesystem explicitly
- TUI mounts it if needed
- TUI asks for confirmation before copying the JSON
- TUI unmounts temporary mountpoints after export
No auto-write to arbitrary removable media is allowed.
### 1.12 — Integration test (local)
`scripts/test-local.sh` — runs audit binary on developer machine (Linux), captures JSON,
`scripts/test-local.sh` — runs `bee audit` on developer machine (Linux), captures JSON,
validates required fields are present (board.serial_number non-empty, cpus non-empty, etc.)
Not a unit test — requires real hardware access. Documents how to run for verification.
---
## Phase 2 — Alpine LiveCD
## Phase 2 — Debian Live ISO
ISO image bootable via BMC virtual media. Runs audit binary automatically on boot.
ISO image bootable via BMC virtual media or USB. Runs boot services automatically and writes the audit result to `/var/log/bee-audit.json`.
### 2.1 — Builder environment
`iso/builder/Dockerfile` — Alpine 3.21 build environment with:
- `alpine-sdk`, `abuild`, `squashfs-tools`, `xorriso`
- Go toolchain (for binary compilation inside builder)
- NVIDIA driver `.run` pre-fetched during image build
`iso/builder/build-in-container.sh` is the only supported builder entrypoint.
It builds a Debian 12 builder image with `live-build`, toolchains, and pinned kernel headers,
then runs the ISO assembly in a privileged container because `live-build` needs
mount/chroot/loop capabilities.
`iso/builder/build.sh` orchestrates full ISO build:
1. Compile Go binary (static, `CGO_ENABLED=0`)
2. Compile NVIDIA kernel module against Alpine 3.21 LTS kernel headers
3. Run `mkimage.sh` with bee profile
4. Output: `dist/bee-<version>.iso`
`iso/builder/build.sh` orchestrates the full ISO build:
1. compile the Go `bee` binary
2. create a staged overlay under `dist/overlay-stage`
3. inject SSH auth, vendor tools, NVIDIA artifacts, and build metadata into the staged overlay
4. create a disposable `live-build` workdir under `dist/live-build-work`
5. sync the staged overlay into `config/includes.chroot/`
6. run `lb config && lb build`
7. copy the final ISO into `dist/`
### 2.2 — NVIDIA driver build
Alpine 3.21, LTS kernel 6.6 — fixed versions in builder.
`iso/builder/build-nvidia-module.sh`:
- downloads the pinned NVIDIA `.run` installer
- verifies SHA256
- builds kernel modules against the pinned Debian kernel ABI
- caches modules, userspace tools, and libs in `dist/nvidia-<version>-<kver>/`
`iso/builder/build-nvidia.sh`:
- Download `NVIDIA-Linux-x86_64-<ver>.run` (version pinned in `iso/builder/VERSIONS`)
- Extract kernel module sources
- Compile against `linux-lts-dev` headers
- Strip and package as `nvidia-<ver>-k6.6.ko.tar.gz` for inclusion in overlay
`iso/overlay/usr/local/bin/bee-nvidia-load`:
- loads `nvidia`, `nvidia-modeset`, `nvidia-uvm` via `insmod`
- creates `/dev/nvidia*` nodes if the driver registered successfully
- logs failures but does not block the rest of boot
`iso/overlay/usr/local/bin/load-nvidia.sh`:
- `insmod` sequence: nvidia.ko → nvidia-modeset.ko → nvidia-uvm.ko
- Verify: `nvidia-smi -L` → log result
- On failure: log warning, continue (audit runs without GPU enrichment)
### 2.3 — ISO assembly and overlay policy
### 2.3 — Alpine mkimage profile
`iso/overlay/` is source-only input for the build.
`iso/builder/mkimg.bee.sh` — Alpine mkimage profile:
- Base: `alpine-base`
- Kernel: `linux-lts`
- Packages: `dmidecode smartmontools nvme-cli pciutils ipmitool util-linux e2fsprogs qrencode`
- Overlay: `iso/overlay/` included as apkovl
Build-time files are injected into the staged overlay only:
- `bee`
- `bee-smoketest`
- `authorized_keys`
- password-fallback marker
- `/etc/bee-release`
- vendor tools from `iso/vendor/`
### 2.4 — Network bring-up on boot
The source tree must stay clean after a build.
`iso/overlay/usr/local/bin/bee-network.sh`:
- Enumerate all network interfaces: `ip link show` → filter out loopback and virtual (docker/bridge)
- For each physical interface: `ip link set <iface> up` + `udhcpc -i <iface> -t 5 -T 3 -n`
- Log each interface result (got IP / timeout / no carrier)
- Continue regardless — network is best-effort for auto-update
### 2.4 — Boot services
`iso/overlay/etc/init.d/bee-network`:
- runlevel: default, before: bee-update
- Calls bee-network.sh
- Does not block boot if DHCP fails on all interfaces
`systemd` service order:
- `bee-sshsetup.service` → configures SSH auth before `ssh.service`
- `bee-network.service` → starts best-effort DHCP on all physical interfaces
- `bee-nvidia.service` → loads NVIDIA modules if present
- `bee-audit.service` → runs audit and logs failures without turning partial collector bugs into a boot blocker
### 2.5OpenRC boot service (bee-audit)
### 2.4bRuntime split
`iso/overlay/etc/init.d/bee-audit`:
- runlevel: default, after: bee-update
- start(): load-nvidia.sh → /usr/local/bin/audit --output usb
- on completion: print QR summary to /dev/tty1 (always, even if USB write failed)
- log everything to /var/log/bee-audit.log
- exits 0 regardless of partial failures — unattended, no prompts, no waits
Target split:
- main Go application works on a normal Linux host and on the live ISO
- live-ISO specifics stay in integration glue under `iso/`
- the live ISO passes `--runtime livecd` to the Go binary
- local runs default to `--runtime auto`, which resolves to `local` unless a live marker is detected
Unattended invariants:
- No TTY prompts ever. All decisions are automatic.
- Missing USB: output goes to /tmp/bee-audit-<serial>-<date>.json, QR shown on screen.
- Missing NVIDIA driver: GPU records have status UNKNOWN, audit continues.
- Missing ipmitool/storcli/any tool: that collector is skipped, rest continue.
- Timeout on any external command: 30s hard limit via `timeout` wrapper, then skip.
- Boot never hangs waiting for user input.
Planned code shape:
- `audit/cmd/bee/` — main CLI entrypoint
- `audit/internal/runtimeenv/` — runtime detection and mode selection
- future `audit/internal/tui/` — host/live shared TUI logic
- `iso/overlay/` — boot-time livecd integration only
`iso/overlay/etc/runlevels/default/bee-audit` symlink
### 2.5 — Operator workflows
### 2.6 — Vendor utilities in overlay
- Automatic boot audit writes JSON to `/var/log/bee-audit.json`
- `tty1` autologins into `bee` and auto-runs `menu`
- `menu` launches the LiveCD wrapper `bee-tui`, which escalates to `root` via `sudo -n`
- `bee tui` can rerun the audit manually
- `bee tui` can export the latest audit JSON to removable media
- `bee tui` can show health summary and run NVIDIA/memory/storage acceptance tests
- NVIDIA SAT now includes a lightweight in-image GPU stress step via `bee-gpu-burn`
- SAT summaries now expose `overall_status` plus per-job `OK/FAILED/UNSUPPORTED`
- Memory SAT runtime defaults can be overridden via `BEE_MEMTESTER_*`
- removable export requires explicit target selection, mount, confirmation, copy, and cleanup
`iso/overlay/usr/local/bin/` includes pre-fetched proprietary tools:
- `storcli64` (Broadcom)
- `sas2ircu`, `sas3ircu` (Broadcom/LSI)
- `mstflint` (NVIDIA Networking / Mellanox)
### 2.6 — Vendor utilities and optional assets
`scripts/fetch-vendor.sh` — downloads and places these before ISO build.
Checksums verified. Tools not committed to git — fetched at build time.
Optional binaries live in `iso/vendor/` and are included when present:
- `storcli64`
- `sas2ircu`, `sas3ircu`
- `arcconf`
- `ssacli`
- `mstflint` (via Debian package set)
`iso/vendor/.gitkeep` — placeholder, directory gitignored except .gitkeep
Missing optional tools do not fail the build or boot.
### 2.7 — Auto-update of audit binary (USB + network)
### 2.7 — Release workflow
Two update paths, tried in order on every boot:
`iso/builder/VERSIONS` pins the current release inputs:
- audit version
- Debian version / kernel ABI
- Go version
- NVIDIA driver version
**Path A — USB (no network required, higher priority):**
`bee-update.sh` scans mounted removable media for an update package before checking network.
Looks for: `<usb>/bee-update/bee-audit-linux-amd64` + `<usb>/bee-update/bee-audit-linux-amd64.sha256`
Steps:
1. Find USB mount point (same detection as audit output: `/sys/block/*/removable`)
2. Check for `bee-update/bee-audit-linux-amd64` on the USB root
3. Read version from `bee-update/VERSION` file (plain text, e.g. `1.3`)
4. Compare with running binary version (`/usr/local/bin/audit --version`)
5. If USB version > running: verify SHA256 checksum, replace binary, log update
6. Re-run audit if updated
**Authenticity verification — Ed25519 multi-key trust (stdlib only, no external tools):**
Problem: SHA256 alone does not prevent a crafted attack — an attacker places their binary
and a matching SHA256 next to it. The LiveCD would accept it.
Solution: Ed25519 asymmetric signatures via Go stdlib `crypto/ed25519`.
Multiple developer public keys are supported. A binary update is accepted if its signature
verifies against ANY of the embedded trusted public keys.
This mirrors the SSH authorized_keys model: add a developer → add their public key.
Remove a developer → rebuild without their key.
**Key management — centralized across all projects:**
Public keys live in a dedicated repo at git.mchus.pro/mchus/keys (or similar):
```
keys/
developers/
mchusavitin.pub ← Ed25519 public key, base64, one line
developer2.pub
README.md ← how to generate a key pair
```
Public keys are safe to commit — they are not secret.
Private keys stay on each developer's machine, never committed anywhere.
Key generation (one-time per developer, run locally):
```sh
# scripts/keygen.sh — also lives in the keys repo
openssl genpkey -algorithm ed25519 -out ~/.bee-release.key
openssl pkey -in ~/.bee-release.key -pubout -outform DER \
| tail -c 32 | base64 > mchusavitin.pub
```
**Embedding public keys at release time (not compile time):**
Public keys are injected via `-ldflags` at build time from the keys repo.
The binary does not hardcode keys — they are provided by the release script.
```go
// audit/internal/updater/trust.go
// trustedKeysRaw is injected at build time via -ldflags
// format: base64(key1):base64(key2):...
var trustedKeysRaw string
func trustedKeys() ([]ed25519.PublicKey, error) {
if trustedKeysRaw == "" {
return nil, fmt.Errorf("binary built without trusted keys — updates disabled")
}
var keys []ed25519.PublicKey
for _, enc := range strings.Split(trustedKeysRaw, ":") {
b, err := base64.StdEncoding.DecodeString(strings.TrimSpace(enc))
if err != nil || len(b) != ed25519.PublicKeySize {
return nil, fmt.Errorf("invalid trusted key: %w", err)
}
keys = append(keys, ed25519.PublicKey(b))
}
return keys, nil
}
func verifySignature(binaryPath, sigPath string) error {
keys, err := trustedKeys()
if err != nil {
return err
}
data, _ := os.ReadFile(binaryPath)
sig, _ := os.ReadFile(sigPath) // 64 bytes raw Ed25519 signature
for _, key := range keys {
if ed25519.Verify(key, data, sig) {
return nil // any trusted key accepts → pass
}
}
return fmt.Errorf("signature verification failed: no trusted key matched")
}
```
Release build injects keys:
```sh
# scripts/build-release.sh
KEYS=$(paste -sd: keys/developers/*.pub)
go build -ldflags "-X bee/audit/internal/updater/trust.trustedKeysRaw=${KEYS}" \
-o dist/bee-audit-linux-amd64 ./cmd/audit
```
Signing (release engineer signs with their private key):
```sh
# scripts/sign-release.sh <binary>
openssl pkeyutl -sign -inkey ~/.bee-release.key \
-rawin -in "$1" -out "$1.sig"
```
Binary built without `-ldflags` injection (e.g. local dev build) has `trustedKeysRaw=""`
→ updates are disabled, logged as INFO, audit continues normally.
Update rejected silently (logged as WARNING, audit continues with current binary) if:
- `.sig` file missing
- Signature does not match any trusted key
- `trustedKeysRaw` empty (dev build)
Update package layout on USB:
```
/bee-update/
bee-audit-linux-amd64 ← new binary (also signed with embedded keys)
bee-audit-linux-amd64.sig ← Ed25519 signature (64 bytes raw)
VERSION ← plain version string e.g. "1.3"
```
Admin workflow: download `bee-audit-linux-amd64` + `bee-audit-linux-amd64.sig` from Gitea
release assets, place in `bee-update/` on USB.
**Path B — Network (requires DHCP on at least one interface):**
1. Check network: ping git.mchus.pro -c 1 -W 3 || skip
2. Fetch: `GET https://git.mchus.pro/api/v1/repos/<org>/bee/releases/latest`
3. Parse tag_name, asset URLs for `bee-audit-linux-amd64` + `bee-audit-linux-amd64.sig`
4. Compare tag with running version
5. If newer: download both files to /tmp, verify Ed25519 signature against all trusted keys
6. Replace binary on pass, log and skip on fail
7. Re-run audit if updated
**Ordering:** USB update checked first, network checked second.
If USB update applied and verified, network check is skipped.
`iso/overlay/etc/init.d/bee-update`:
- runlevel: default
- after: bee-network (network path needs interfaces up)
- before: bee-audit (audit runs with latest binary)
- Calls bee-update.sh
Triggered after bee-audit completes, only if network is available.
`iso/overlay/usr/local/bin/bee-update.sh`:
```
1. Check network: ping git.mchus.pro -c 1 -W 3 || exit 0
2. Fetch latest release metadata:
GET https://git.mchus.pro/api/v1/repos/<org>/bee/releases/latest
3. Parse: extract tag_name, asset URL for bee-audit-linux-amd64
4. Compare tag_name with /usr/local/bin/audit --version output
5. If newer: download to /tmp/bee-audit-new, verify SHA256 checksum from release assets
6. Replace /usr/local/bin/audit (tmpfs — survives until reboot)
7. Log: updated from vX.Y to vX.Z
8. Re-run audit if update happened: /usr/local/bin/audit --output usb
```
`iso/overlay/etc/init.d/bee-update`:
- runlevel: default
- after: bee-audit, network
- Calls bee-update.sh
Release naming convention: binary asset named `bee-audit-linux-amd64` per release tag.
### 2.8 — Release workflow
`iso/builder/VERSIONS` — pinned versions:
```
AUDIT_VERSION=1.0
ALPINE_VERSION=3.21
KERNEL_VERSION=6.12
NVIDIA_DRIVER_VERSION=590.48.01
```
LiveCD release = full ISO rebuild. Binary-only patch = new Gitea release with binary asset.
On boot with network: ISO auto-patches its binary without full rebuild.
ISO version embedded in `/etc/bee-release`:
```
BEE_ISO_VERSION=1.0
BEE_AUDIT_VERSION=1.0
BUILD_DATE=2026-03-05
```
Current release model:
- shipping a new ISO means a full rebuild
- build metadata is embedded into `/etc/bee-release` and `motd`
- current ISO label is `BEE`
- binary self-update remains deferred; no automatic USB/network patching is part of the current runtime
---
## Eating order
Builder environment is set up early (after 1.3) so every subsequent collector
is developed and tested directly on real hardware in the actual Alpine environment.
is developed and tested directly on real hardware in the actual Debian live ISO environment.
No "works on my Mac" drift.
```
@@ -544,10 +387,10 @@ No "works on my Mac" drift.
1.2 board collector → first real data
1.3 CPU collector → +CPUs
--- BUILDER + DEBUG ISO (unblock real-hardware testing) ---
--- BUILDER + BEE ISO (unblock real-hardware testing) ---
2.1 builder VM setup → Alpine VM with build deps + Go toolchain
2.2 debug ISO profile → minimal Alpine ISO: audit binary + dropbear SSH + all packages
2.1 builder setup → privileged container with build deps
2.2 debug ISO profile → minimal Debian ISO: `bee` binary + OpenSSH + all packages
2.3 boot on real server → SSH in, verify packages present, run audit manually
--- CONTINUE COLLECTORS (tested on real hardware from here) ---
@@ -560,14 +403,14 @@ No "works on my Mac" drift.
1.8b wear/age telemetry → +SMART hours, NVMe % used, SFP DOM, ECC
1.9 Mellanox NIC enrichment → +NIC firmware/serial
1.10 RAID enrichment → +physical disks behind RAID
1.11 output + USB write → production-ready output
1.11 output + export workflow → file output + explicit removable export
--- PRODUCTION ISO ---
2.4 NVIDIA driver build → driver compiled into overlay
2.5 network bring-up on boot → DHCP on all interfaces
2.6 OpenRC boot service → audit runs on boot automatically
2.7 vendor utilities → storcli/sas2ircu/mstflint in image
2.8 auto-update → binary self-patches from Gitea
2.9 release workflow → versioning + release notes
2.6 systemd boot service → audit runs on boot automatically
2.7 vendor utilities → storcli/sas2ircu/arcconf/ssacli in image
2.8 release workflow → versioning + release notes
2.9 operator export flow → explicit TUI export to removable media
```

18
audit/Makefile Normal file
View File

@@ -0,0 +1,18 @@
LISTEN ?= :8080
AUDIT_PATH ?=
RUN_ARGS := web --listen $(LISTEN)
ifneq ($(AUDIT_PATH),)
RUN_ARGS += --audit-path $(AUDIT_PATH)
endif
.PHONY: run build test
run:
go run ./cmd/bee $(RUN_ARGS)
build:
go build -o bee ./cmd/bee
test:
go test ./...

BIN
audit/bee Executable file

Binary file not shown.

View File

@@ -1,167 +0,0 @@
package main
import (
"encoding/json"
"flag"
"fmt"
"log/slog"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
"time"
"bee/audit/internal/collector"
)
// Version is the audit binary version.
// Injected at release build time via:
//
// -ldflags "-X main.Version=1.2"
//
// Defaults to "dev" in local builds.
var Version = "dev"
func main() {
output := flag.String("output", "stdout", `output destination:
stdout — print JSON to stdout (default)
file:<path> — write JSON to file
usb — auto-detect removable media, write JSON there`)
showVersion := flag.Bool("version", false, "print version and exit")
flag.Parse()
if *showVersion {
fmt.Println(Version)
return
}
slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{
Level: slog.LevelInfo,
})))
result := collector.Run()
data, err := json.MarshalIndent(result, "", " ")
if err != nil {
slog.Error("marshal result", "err", err)
os.Exit(1)
}
if err := writeOutput(*output, data); err != nil {
slog.Error("write output", "destination", *output, "err", err)
os.Exit(1)
}
}
func writeOutput(dest string, data []byte) error {
switch {
case dest == "stdout":
_, err := os.Stdout.Write(append(data, '\n'))
return err
case strings.HasPrefix(dest, "file:"):
path := strings.TrimPrefix(dest, "file:")
return os.WriteFile(path, append(data, '\n'), 0644)
case dest == "usb":
return writeToUSB(data)
default:
return fmt.Errorf("unknown output destination %q — use stdout, file:<path>, or usb", dest)
}
}
// writeToUSB auto-detects the first removable block device, mounts it,
// and writes the audit JSON. Falls back to /tmp on any failure.
func writeToUSB(data []byte) error {
boardSerial := extractBoardSerial(data)
filename := auditFilename(boardSerial, time.Now().UTC())
device, err := firstRemovableDevice()
if err != nil {
slog.Warn("usb output: no removable device, writing to /tmp", "err", err)
return writeAuditToPath(filepath.Join("/tmp", filename), data)
}
mountpoint := "/tmp/bee-usb"
if err := os.MkdirAll(mountpoint, 0755); err != nil {
return err
}
if err := exec.Command("mount", device, mountpoint).Run(); err != nil {
slog.Warn("usb output: mount failed, writing to /tmp", "device", device, "err", err)
return writeAuditToPath(filepath.Join("/tmp", filename), data)
}
defer func() {
if err := exec.Command("umount", mountpoint).Run(); err != nil {
slog.Warn("usb output: umount failed", "mountpoint", mountpoint, "err", err)
}
}()
path := filepath.Join(mountpoint, filename)
if err := writeAuditToPath(path, data); err != nil {
slog.Warn("usb output: write failed, falling back to /tmp", "path", path, "err", err)
return writeAuditToPath(filepath.Join("/tmp", filename), data)
}
slog.Info("usb output: written", "path", path)
return nil
}
func writeAuditToPath(path string, data []byte) error {
if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil {
return err
}
slog.Info("audit output written", "path", path)
return nil
}
func extractBoardSerial(data []byte) string {
var doc struct {
Hardware struct {
Board struct {
SerialNumber string `json:"serial_number"`
} `json:"board"`
} `json:"hardware"`
}
if err := json.Unmarshal(data, &doc); err != nil {
return "unknown"
}
serial := strings.TrimSpace(doc.Hardware.Board.SerialNumber)
if serial == "" {
return "unknown"
}
return serial
}
func auditFilename(boardSerial string, now time.Time) string {
boardSerial = strings.TrimSpace(boardSerial)
if boardSerial == "" {
boardSerial = "unknown"
}
return fmt.Sprintf("audit-%s-%s.json", boardSerial, now.Format("20060102-150405"))
}
func firstRemovableDevice() (string, error) {
entries, err := os.ReadDir("/sys/block")
if err != nil {
return "", err
}
sort.Slice(entries, func(i, j int) bool { return entries[i].Name() < entries[j].Name() })
for _, e := range entries {
name := e.Name()
if strings.HasPrefix(name, "loop") || strings.HasPrefix(name, "ram") {
continue
}
removableFlag, err := os.ReadFile(filepath.Join("/sys/block", name, "removable"))
if err != nil {
continue
}
if strings.TrimSpace(string(removableFlag)) == "1" {
return filepath.Join("/dev", name), nil
}
}
return "", fmt.Errorf("no removable block device found")
}

409
audit/cmd/bee/main.go Normal file
View File

@@ -0,0 +1,409 @@
package main
import (
"context"
"flag"
"fmt"
"io"
"log/slog"
"os"
"runtime/debug"
"strings"
"bee/audit/internal/app"
"bee/audit/internal/platform"
"bee/audit/internal/runtimeenv"
"bee/audit/internal/webui"
)
var Version = "dev"
func buildLabel() string {
label := strings.TrimSpace(Version)
if label == "" {
label = "dev"
}
if info, ok := debug.ReadBuildInfo(); ok {
var revision string
var modified bool
for _, setting := range info.Settings {
switch setting.Key {
case "vcs.revision":
revision = setting.Value
case "vcs.modified":
modified = setting.Value == "true"
}
}
if revision != "" {
short := revision
if len(short) > 12 {
short = short[:12]
}
label += " (" + short
if modified {
label += "+"
}
label += ")"
}
}
return label
}
func main() {
os.Exit(run(os.Args[1:], os.Stdout, os.Stderr))
}
func run(args []string, stdout, stderr io.Writer) int {
slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{
Level: slog.LevelInfo,
})))
if len(args) == 0 {
printRootUsage(stderr)
return 2
}
switch args[0] {
case "help", "--help", "-h":
if len(args) > 1 {
return runHelp(args[1:], stdout, stderr)
}
printRootUsage(stdout)
return 0
case "audit":
return runAudit(args[1:], stdout, stderr)
case "export":
return runExport(args[1:], stdout, stderr)
case "preflight":
return runPreflight(args[1:], stdout, stderr)
case "support-bundle":
return runSupportBundle(args[1:], stdout, stderr)
case "web":
return runWeb(args[1:], stdout, stderr)
case "sat":
return runSAT(args[1:], stdout, stderr)
case "version", "--version", "-version":
fmt.Fprintln(stdout, Version)
return 0
default:
fmt.Fprintf(stderr, "bee: unknown command %q\n\n", args[0])
printRootUsage(stderr)
return 2
}
}
func printRootUsage(w io.Writer) {
fmt.Fprintln(w, `bee commands:
bee audit --runtime auto|local|livecd --output stdout|file:<path>
bee preflight --output stdout|file:<path>
bee export --target <device>
bee support-bundle --output stdout|file:<path>
bee web --listen :80 --audit-path `+app.DefaultAuditJSONPath+`
bee sat nvidia|memory|storage|cpu [--duration <seconds>]
bee version
bee help [command]`)
}
func runHelp(args []string, stdout, stderr io.Writer) int {
switch args[0] {
case "audit":
return runAudit([]string{"--help"}, stdout, stdout)
case "export":
return runExport([]string{"--help"}, stdout, stdout)
case "preflight":
return runPreflight([]string{"--help"}, stdout, stdout)
case "support-bundle":
return runSupportBundle([]string{"--help"}, stdout, stdout)
case "web":
return runWeb([]string{"--help"}, stdout, stdout)
case "sat":
return runSAT([]string{"--help"}, stdout, stderr)
case "version":
fmt.Fprintln(stdout, "usage: bee version")
return 0
default:
fmt.Fprintf(stderr, "bee help: unknown command %q\n\n", args[0])
printRootUsage(stderr)
return 2
}
}
func runAudit(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("audit", flag.ContinueOnError)
fs.SetOutput(stderr)
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
runtimeFlag := fs.String("runtime", "auto", "runtime environment: auto, local, livecd")
showVersion := fs.Bool("version", false, "print version and exit")
fs.Usage = func() {
fmt.Fprintln(stderr, "usage: bee audit [--runtime auto|local|livecd] [--output stdout|file:<path>]")
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
if *showVersion {
fmt.Fprintln(stdout, Version)
return 0
}
runtimeInfo, err := runtimeenv.Detect(*runtimeFlag)
if err != nil {
slog.Error("resolve runtime", "err", err)
return 1
}
slog.Info("runtime resolved", "mode", runtimeInfo.Mode, "reason", runtimeInfo.Reason)
application := app.New(platform.New())
path, err := application.RunAudit(runtimeInfo.Mode, *output)
if err != nil {
slog.Error("run audit", "err", err)
return 1
}
if path != "stdout" {
slog.Info("audit output written", "path", path)
}
return 0
}
func runExport(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("export", flag.ContinueOnError)
fs.SetOutput(stderr)
targetDevice := fs.String("target", "", "removable device path, e.g. /dev/sdb1")
fs.Usage = func() {
fmt.Fprintln(stderr, "usage: bee export --target <device>")
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
if strings.TrimSpace(*targetDevice) == "" {
fmt.Fprintln(stderr, "bee export: --target is required")
fs.Usage()
return 2
}
application := app.New(platform.New())
targets, err := application.ListRemovableTargets()
if err != nil {
slog.Error("list removable targets", "err", err)
return 1
}
for _, target := range targets {
if target.Device == *targetDevice {
path, err := application.ExportLatestAudit(target)
if err != nil {
slog.Error("export latest audit", "err", err)
return 1
}
slog.Info("audit exported", "path", path)
return 0
}
}
slog.Error("target device not found among removable filesystems", "device", *targetDevice)
return 1
}
func runPreflight(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("preflight", flag.ContinueOnError)
fs.SetOutput(stderr)
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
fs.Usage = func() {
fmt.Fprintf(stderr, "usage: bee preflight [--output stdout|file:%s]\n", app.DefaultRuntimeJSONPath)
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
application := app.New(platform.New())
path, err := application.RunRuntimePreflight(*output)
if err != nil {
slog.Error("run preflight", "err", err)
return 1
}
if path != "stdout" {
slog.Info("runtime health written", "path", path)
}
return 0
}
func runSupportBundle(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("support-bundle", flag.ContinueOnError)
fs.SetOutput(stderr)
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
fs.Usage = func() {
fmt.Fprintln(stderr, "usage: bee support-bundle [--output stdout|file:<path>]")
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
path, err := app.BuildSupportBundle(app.DefaultExportDir)
if err != nil {
slog.Error("build support bundle", "err", err)
return 1
}
defer os.Remove(path)
raw, err := os.ReadFile(path)
if err != nil {
slog.Error("read support bundle", "err", err)
return 1
}
switch {
case *output == "stdout":
if _, err := stdout.Write(raw); err != nil {
slog.Error("write support bundle stdout", "err", err)
return 1
}
case strings.HasPrefix(*output, "file:"):
dst := strings.TrimPrefix(*output, "file:")
if err := os.WriteFile(dst, raw, 0644); err != nil {
slog.Error("write support bundle", "err", err)
return 1
}
slog.Info("support bundle written", "path", dst)
default:
fmt.Fprintln(stderr, "bee support-bundle: unknown output destination")
fs.Usage()
return 2
}
return 0
}
func runWeb(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("web", flag.ContinueOnError)
fs.SetOutput(stderr)
listenAddr := fs.String("listen", ":8080", "listen address, e.g. :80")
auditPath := fs.String("audit-path", app.DefaultAuditJSONPath, "path to the latest audit JSON snapshot")
exportDir := fs.String("export-dir", app.DefaultExportDir, "directory with logs, SAT results, and support bundles")
title := fs.String("title", "Bee Hardware Audit", "page title")
fs.Usage = func() {
fmt.Fprintf(stderr, "usage: bee web [--listen :80] [--audit-path %s] [--export-dir %s] [--title \"Bee Hardware Audit\"]\n", app.DefaultAuditJSONPath, app.DefaultExportDir)
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
slog.Info("starting bee web", "listen", *listenAddr, "audit_path", *auditPath)
runtimeInfo, err := runtimeenv.Detect("auto")
if err != nil {
slog.Warn("resolve runtime for web", "err", err)
}
if err := webui.ListenAndServe(*listenAddr, webui.HandlerOptions{
Title: *title,
BuildLabel: buildLabel(),
AuditPath: *auditPath,
ExportDir: *exportDir,
App: app.New(platform.New()),
RuntimeMode: runtimeInfo.Mode,
}); err != nil {
slog.Error("run web", "err", err)
return 1
}
return 0
}
func runSAT(args []string, stdout, stderr io.Writer) int {
if len(args) == 0 {
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
return 2
}
if args[0] == "help" || args[0] == "--help" || args[0] == "-h" {
fmt.Fprintln(stdout, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
return 0
}
fs := flag.NewFlagSet("sat", flag.ContinueOnError)
fs.SetOutput(stderr)
duration := fs.Int("duration", 0, "stress-ng duration in seconds (cpu only; default: 60)")
diagLevel := fs.Int("diag-level", 0, "DCGM diagnostic level for nvidia (1=quick, 2=medium, 3=targeted stress, 4=extended stress; default: 1)")
if err := fs.Parse(args[1:]); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fmt.Fprintf(stderr, "bee sat: unexpected arguments\n")
return 2
}
target := args[0]
if target != "nvidia" && target != "memory" && target != "storage" && target != "cpu" {
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", target)
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>] [--diag-level <1-4>]")
return 2
}
application := app.New(platform.New())
var (
archive string
err error
)
logLine := func(s string) { fmt.Fprintln(os.Stderr, s) }
switch target {
case "nvidia":
level := *diagLevel
if level > 0 {
_, err = application.RunNvidiaAcceptancePackWithOptions(context.Background(), "", level, nil, logLine)
} else {
archive, err = application.RunNvidiaAcceptancePack("", logLine)
}
case "memory":
archive, err = application.RunMemoryAcceptancePackCtx(context.Background(), "", logLine)
case "storage":
archive, err = application.RunStorageAcceptancePackCtx(context.Background(), "", logLine)
case "cpu":
dur := *duration
if dur <= 0 {
dur = 60
}
archive, err = application.RunCPUAcceptancePackCtx(context.Background(), "", dur, logLine)
}
if err != nil {
slog.Error("run sat", "target", target, "err", err)
return 1
}
slog.Info("sat archive written", "target", target, "path", archive)
return 0
}

219
audit/cmd/bee/main_test.go Normal file
View File

@@ -0,0 +1,219 @@
package main
import (
"bytes"
"strings"
"testing"
)
func TestRunRootHelp(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"help"}, &stdout, &stderr)
if rc != 0 {
t.Fatalf("rc=%d want 0", rc)
}
if !strings.Contains(stdout.String(), "bee commands:") {
t.Fatalf("stdout missing root usage:\n%s", stdout.String())
}
}
func TestRunNoArgsPrintsUsage(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run(nil, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "bee commands:") {
t.Fatalf("stderr missing root usage:\n%s", stderr.String())
}
}
func TestRunUnknownCommand(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"wat"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), `unknown command "wat"`) {
t.Fatalf("stderr missing unknown command message:\n%s", stderr.String())
}
}
func TestRunVersion(t *testing.T) {
t.Parallel()
old := Version
Version = "test-version"
t.Cleanup(func() { Version = old })
var stdout, stderr bytes.Buffer
rc := run([]string{"version"}, &stdout, &stderr)
if rc != 0 {
t.Fatalf("rc=%d want 0", rc)
}
if strings.TrimSpace(stdout.String()) != "test-version" {
t.Fatalf("stdout=%q want %q", strings.TrimSpace(stdout.String()), "test-version")
}
}
func TestRunExportRequiresTarget(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"export"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "--target is required") {
t.Fatalf("stderr missing target error:\n%s", stderr.String())
}
if !strings.Contains(stderr.String(), "usage: bee export --target <device>") {
t.Fatalf("stderr missing export usage:\n%s", stderr.String())
}
}
func TestRunSATUsage(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"sat"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee sat nvidia|memory|storage") {
t.Fatalf("stderr missing sat usage:\n%s", stderr.String())
}
}
func TestRunPreflightRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"preflight", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee preflight") {
t.Fatalf("stderr missing preflight usage:\n%s", stderr.String())
}
}
func TestRunSupportBundleRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"support-bundle", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee support-bundle") {
t.Fatalf("stderr missing support-bundle usage:\n%s", stderr.String())
}
}
func TestRunHelpForSubcommand(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"help", "export"}, &stdout, &stderr)
if rc != 0 {
t.Fatalf("rc=%d want 0", rc)
}
if !strings.Contains(stdout.String(), "usage: bee export --target <device>") {
t.Fatalf("stdout missing export help:\n%s", stdout.String())
}
}
func TestRunHelpUnknownSubcommand(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"help", "wat"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), `bee help: unknown command "wat"`) {
t.Fatalf("stderr missing help error:\n%s", stderr.String())
}
}
func TestRunSATUnknownTarget(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"sat", "amd"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), `unknown target "amd"`) {
t.Fatalf("stderr missing sat target error:\n%s", stderr.String())
}
}
func TestRunSATHelp(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"sat", "--help"}, &stdout, &stderr)
if rc != 0 {
t.Fatalf("rc=%d want 0", rc)
}
if !strings.Contains(stdout.String(), "usage: bee sat nvidia|memory|storage|cpu") {
t.Fatalf("stdout missing sat help:\n%s", stdout.String())
}
}
func TestRunSATRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"sat", "memory", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "bee sat: unexpected arguments") {
t.Fatalf("stderr missing sat error:\n%s", stderr.String())
}
}
func TestRunAuditInvalidRuntime(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"audit", "--runtime", "bad"}, &stdout, &stderr)
if rc != 1 {
t.Fatalf("rc=%d want 1", rc)
}
}
func TestRunAuditRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"audit", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee audit") {
t.Fatalf("stderr missing audit usage:\n%s", stderr.String())
}
}
func TestRunExportRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"export", "--target", "/dev/sdb1", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee export --target <device>") {
t.Fatalf("stderr missing export usage:\n%s", stderr.String())
}
}

View File

@@ -1,3 +1,26 @@
module bee/audit
go 1.23
go 1.25.0
replace reanimator/chart => ../internal/chart
require (
github.com/go-analyze/charts v0.5.26
reanimator/chart v0.0.0-00010101000000-000000000000
)
require (
github.com/dustin/go-humanize v1.0.1 // indirect
github.com/go-analyze/bulk v0.1.3 // indirect
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/mattn/go-isatty v0.0.20 // indirect
github.com/ncruces/go-strftime v1.0.0 // indirect
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec // indirect
golang.org/x/image v0.24.0 // indirect
golang.org/x/sys v0.42.0 // indirect
modernc.org/libc v1.70.0 // indirect
modernc.org/mathutil v1.7.1 // indirect
modernc.org/memory v1.11.0 // indirect
modernc.org/sqlite v1.48.0 // indirect
)

37
audit/go.sum Normal file
View File

@@ -0,0 +1,37 @@
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
github.com/go-analyze/bulk v0.1.3 h1:pzRdBqzHDAT9PyROt0SlWE0YqPtdmTcEpIJY0C3vF0c=
github.com/go-analyze/bulk v0.1.3/go.mod h1:afon/KtFJYnekIyN20H/+XUvcLFjE8sKR1CfpqfClgM=
github.com/go-analyze/charts v0.5.26 h1:rSwZikLQuFX6cJzwI8OAgaWZneG1kDYxD857ms00ZxY=
github.com/go-analyze/charts v0.5.26/go.mod h1:s1YvQhjiSwtLx1f2dOKfiV9x2TT49nVSL6v2rlRpTbY=
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 h1:DACJavvAHhabrF08vX0COfcOBJRhZ8lUbR+ZWIs0Y5g=
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0/go.mod h1:E/TSTwGwJL78qG/PmXZO1EjYhfJinVAhrmmHX6Z8B9k=
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/mattn/go-isatty v0.0.20 h1:xfD0iDuEKnDkl03q4limB+vH+GxLEtL/jb4xVJSWWEY=
github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y=
github.com/ncruces/go-strftime v1.0.0 h1:HMFp8mLCTPp341M/ZnA4qaf7ZlsbTc+miZjCLOFAw7w=
github.com/ncruces/go-strftime v1.0.0/go.mod h1:Fwc5htZGVVkseilnfgOVb9mKy6w1naJmn9CehxcKcls=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec h1:W09IVJc94icq4NjY3clb7Lk8O1qJ8BdBEF8z0ibU0rE=
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec/go.mod h1:qqbHyh8v60DhA7CoWK5oRCqLrMHRGoxYCSS9EjAz6Eo=
github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U=
golang.org/x/image v0.24.0 h1:AN7zRgVsbvmTfNyqIbbOraYL8mSwcKncEj8ofjgzcMQ=
golang.org/x/image v0.24.0/go.mod h1:4b/ITuLfqYq1hqZcjofwctIhi7sZh2WaCjvsBNjjya8=
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.42.0 h1:omrd2nAlyT5ESRdCLYdm3+fMfNFE/+Rf4bDIQImRJeo=
golang.org/x/sys v0.42.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
modernc.org/libc v1.70.0 h1:U58NawXqXbgpZ/dcdS9kMshu08aiA6b7gusEusqzNkw=
modernc.org/libc v1.70.0/go.mod h1:OVmxFGP1CI/Z4L3E0Q3Mf1PDE0BucwMkcXjjLntvHJo=
modernc.org/mathutil v1.7.1 h1:GCZVGXdaN8gTqB1Mf/usp1Y/hSqgI2vAGGP4jZMCxOU=
modernc.org/mathutil v1.7.1/go.mod h1:4p5IwJITfppl0G4sUEDtCr4DthTaT47/N3aT6MhfgJg=
modernc.org/memory v1.11.0 h1:o4QC8aMQzmcwCK3t3Ux/ZHmwFPzE6hf2Y5LbkRs+hbI=
modernc.org/memory v1.11.0/go.mod h1:/JP4VbVC+K5sU2wZi9bHoq2MAkCnrt2r98UGeSK7Mjw=
modernc.org/sqlite v1.48.0 h1:ElZyLop3Q2mHYk5IFPPXADejZrlHu7APbpB0sF78bq4=
modernc.org/sqlite v1.48.0/go.mod h1:hWjRO6Tj/5Ik8ieqxQybiEOUXy0NJFNp2tpvVpKlvig=

1220
audit/internal/app/app.go Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,851 @@
package app
import (
"archive/tar"
"compress/gzip"
"context"
"encoding/json"
"errors"
"io"
"os"
"path/filepath"
"testing"
"bee/audit/internal/platform"
"bee/audit/internal/schema"
)
type fakeNetwork struct {
listInterfacesFn func() ([]platform.InterfaceInfo, error)
defaultRouteFn func() string
dhcpOneFn func(string) (string, error)
dhcpAllFn func() (string, error)
setStaticIPv4Fn func(platform.StaticIPv4Config) (string, error)
}
func (f fakeNetwork) ListInterfaces() ([]platform.InterfaceInfo, error) {
return f.listInterfacesFn()
}
func (f fakeNetwork) DefaultRoute() string {
return f.defaultRouteFn()
}
func (f fakeNetwork) DHCPOne(iface string) (string, error) {
return f.dhcpOneFn(iface)
}
func (f fakeNetwork) DHCPAll() (string, error) {
return f.dhcpAllFn()
}
func (f fakeNetwork) SetStaticIPv4(cfg platform.StaticIPv4Config) (string, error) {
return f.setStaticIPv4Fn(cfg)
}
func (f fakeNetwork) SetInterfaceState(_ string, _ bool) error { return nil }
func (f fakeNetwork) GetInterfaceState(_ string) (bool, error) { return true, nil }
func (f fakeNetwork) CaptureNetworkSnapshot() (platform.NetworkSnapshot, error) {
return platform.NetworkSnapshot{}, nil
}
func (f fakeNetwork) RestoreNetworkSnapshot(platform.NetworkSnapshot) error { return nil }
type fakeServices struct {
serviceStatusFn func(string) (string, error)
serviceDoFn func(string, platform.ServiceAction) (string, error)
}
func (f fakeServices) ListBeeServices() ([]string, error) {
return nil, nil
}
func (f fakeServices) ServiceState(name string) string {
return "active"
}
func (f fakeServices) ServiceStatus(name string) (string, error) {
return f.serviceStatusFn(name)
}
func (f fakeServices) ServiceDo(name string, action platform.ServiceAction) (string, error) {
return f.serviceDoFn(name, action)
}
type fakeExports struct {
listTargetsFn func() ([]platform.RemovableTarget, error)
exportToTargetFn func(string, platform.RemovableTarget) (string, error)
}
func (f fakeExports) ListRemovableTargets() ([]platform.RemovableTarget, error) {
if f.listTargetsFn != nil {
return f.listTargetsFn()
}
return nil, nil
}
func (f fakeExports) ExportFileToTarget(src string, target platform.RemovableTarget) (string, error) {
if f.exportToTargetFn != nil {
return f.exportToTargetFn(src, target)
}
return "", nil
}
type fakeRuntime struct {
collectFn func(string) (schema.RuntimeHealth, error)
dumpFn func(string) error
}
func (f fakeRuntime) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
return f.collectFn(exportDir)
}
func (f fakeRuntime) CaptureTechnicalDump(baseDir string) error {
if f.dumpFn != nil {
return f.dumpFn(baseDir)
}
return nil
}
type fakeTools struct {
tailFileFn func(string, int) string
checkToolsFn func([]string) []platform.ToolStatus
}
func (f fakeTools) TailFile(path string, lines int) string {
return f.tailFileFn(path, lines)
}
func (f fakeTools) CheckTools(names []string) []platform.ToolStatus {
return f.checkToolsFn(names)
}
type fakeSAT struct {
runNvidiaFn func(string) (string, error)
runNvidiaStressFn func(string, platform.NvidiaStressOptions) (string, error)
runMemoryFn func(string) (string, error)
runStorageFn func(string) (string, error)
runCPUFn func(string, int) (string, error)
detectVendorFn func() string
listAMDGPUsFn func() ([]platform.AMDGPUInfo, error)
runAMDPackFn func(string) (string, error)
listNvidiaGPUsFn func() ([]platform.NvidiaGPU, error)
}
func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string, _ func(string)) (string, error) {
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaAcceptancePackWithOptions(_ context.Context, baseDir string, _ int, _ []int, _ func(string)) (string, error) {
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaStressPack(_ context.Context, baseDir string, opts platform.NvidiaStressOptions, _ func(string)) (string, error) {
if f.runNvidiaStressFn != nil {
return f.runNvidiaStressFn(baseDir, opts)
}
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
if f.listNvidiaGPUsFn != nil {
return f.listNvidiaGPUsFn()
}
return nil, nil
}
func (f fakeSAT) RunMemoryAcceptancePack(_ context.Context, baseDir string, _ func(string)) (string, error) {
return f.runMemoryFn(baseDir)
}
func (f fakeSAT) RunStorageAcceptancePack(_ context.Context, baseDir string, _ func(string)) (string, error) {
return f.runStorageFn(baseDir)
}
func (f fakeSAT) RunCPUAcceptancePack(_ context.Context, baseDir string, durationSec int, _ func(string)) (string, error) {
if f.runCPUFn != nil {
return f.runCPUFn(baseDir, durationSec)
}
return "", nil
}
func (f fakeSAT) DetectGPUVendor() string {
if f.detectVendorFn != nil {
return f.detectVendorFn()
}
return ""
}
func (f fakeSAT) ListAMDGPUs() ([]platform.AMDGPUInfo, error) {
if f.listAMDGPUsFn != nil {
return f.listAMDGPUsFn()
}
return nil, nil
}
func (f fakeSAT) RunAMDAcceptancePack(_ context.Context, baseDir string, _ func(string)) (string, error) {
if f.runAMDPackFn != nil {
return f.runAMDPackFn(baseDir)
}
return "", nil
}
func (f fakeSAT) RunAMDMemIntegrityPack(_ context.Context, _ string, _ func(string)) (string, error) {
return "", nil
}
func (f fakeSAT) RunAMDMemBandwidthPack(_ context.Context, _ string, _ func(string)) (string, error) {
return "", nil
}
func (f fakeSAT) RunAMDStressPack(_ context.Context, _ string, _ int, _ func(string)) (string, error) {
return "", nil
}
func (f fakeSAT) RunMemoryStressPack(_ context.Context, _ string, _ int, _ func(string)) (string, error) {
return "", nil
}
func (f fakeSAT) RunSATStressPack(_ context.Context, _ string, _ int, _ func(string)) (string, error) {
return "", nil
}
func (f fakeSAT) RunFanStressTest(_ context.Context, _ string, _ platform.FanStressOptions) (string, error) {
return "", nil
}
func (f fakeSAT) RunPlatformStress(_ context.Context, _ string, _ platform.PlatformStressOptions, _ func(string)) (string, error) {
return "", nil
}
func (f fakeSAT) RunNCCLTests(_ context.Context, _ string, _ func(string)) (string, error) {
return "", nil
}
func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
t.Parallel()
a := &App{
network: fakeNetwork{
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
return []platform.InterfaceInfo{
{Name: "eth0", State: "UP", IPv4: []string{"10.0.0.2/24"}},
{Name: "eth1", State: "DOWN", IPv4: nil},
}, nil
},
defaultRouteFn: func() string { return "10.0.0.1" },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.NetworkStatus()
if err != nil {
t.Fatalf("NetworkStatus error: %v", err)
}
if result.Title != "Network status" {
t.Fatalf("title=%q want %q", result.Title, "Network status")
}
if want := "- eth0: state=UP ip=10.0.0.2/24"; !contains(result.Body, want) {
t.Fatalf("body missing %q\nbody=%s", want, result.Body)
}
if want := "- eth1: state=DOWN ip=(no IPv4)"; !contains(result.Body, want) {
t.Fatalf("body missing %q\nbody=%s", want, result.Body)
}
if want := "Default route: 10.0.0.1"; !contains(result.Body, want) {
t.Fatalf("body missing %q\nbody=%s", want, result.Body)
}
}
func TestNetworkStatusHandlesNoInterfaces(t *testing.T) {
t.Parallel()
a := &App{
network: fakeNetwork{
listInterfacesFn: func() ([]platform.InterfaceInfo, error) { return nil, nil },
defaultRouteFn: func() string { return "" },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.NetworkStatus()
if err != nil {
t.Fatalf("NetworkStatus error: %v", err)
}
if result.Body != "No physical interfaces found." {
t.Fatalf("body=%q want %q", result.Body, "No physical interfaces found.")
}
}
func TestNetworkStatusPropagatesListError(t *testing.T) {
t.Parallel()
a := &App{
network: fakeNetwork{
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
return nil, errors.New("boom")
},
defaultRouteFn: func() string { return "" },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.NetworkStatus()
if err == nil {
t.Fatal("expected error")
}
if result.Title != "Network status" {
t.Fatalf("title=%q want %q", result.Title, "Network status")
}
}
func TestParseStaticIPv4ConfigAndDefaults(t *testing.T) {
t.Parallel()
a := &App{
network: fakeNetwork{
defaultRouteFn: func() string { return " 192.168.1.1 " },
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
return nil, nil
},
dhcpOneFn: func(string) (string, error) { return "", nil },
dhcpAllFn: func() (string, error) { return "", nil },
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
defaults := a.DefaultStaticIPv4FormFields("eth0")
if len(defaults) != 4 {
t.Fatalf("len(defaults)=%d want 4", len(defaults))
}
if defaults[1] != "24" || defaults[2] != "192.168.1.1" {
t.Fatalf("unexpected defaults: %#v", defaults)
}
cfg := a.ParseStaticIPv4Config("eth0", []string{
" 10.10.0.5 ",
" 23 ",
" 10.10.0.1 ",
" 1.1.1.1 8.8.8.8 ",
})
if cfg.Interface != "eth0" || cfg.Address != "10.10.0.5" || cfg.Prefix != "23" || cfg.Gateway != "10.10.0.1" {
t.Fatalf("unexpected cfg: %#v", cfg)
}
if len(cfg.DNS) != 2 || cfg.DNS[0] != "1.1.1.1" || cfg.DNS[1] != "8.8.8.8" {
t.Fatalf("unexpected dns: %#v", cfg.DNS)
}
}
func TestServiceActionResults(t *testing.T) {
t.Parallel()
a := &App{
services: fakeServices{
serviceStatusFn: func(name string) (string, error) {
return "active", nil
},
serviceDoFn: func(name string, action platform.ServiceAction) (string, error) {
return string(action) + " ok", nil
},
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
statusResult, err := a.ServiceStatusResult("bee-audit")
if err != nil {
t.Fatalf("ServiceStatusResult error: %v", err)
}
if statusResult.Title != "service status: bee-audit" || statusResult.Body != "active" {
t.Fatalf("unexpected status result: %#v", statusResult)
}
actionResult, err := a.ServiceActionResult("bee-audit", platform.ServiceRestart)
if err != nil {
t.Fatalf("ServiceActionResult error: %v", err)
}
if actionResult.Title != "service restart: bee-audit" || actionResult.Body != "restart ok" {
t.Fatalf("unexpected action result: %#v", actionResult)
}
}
func TestToolCheckAndLogTailResults(t *testing.T) {
t.Parallel()
a := &App{
tools: fakeTools{
tailFileFn: func(path string, lines int) string {
return path
},
checkToolsFn: func(names []string) []platform.ToolStatus {
return []platform.ToolStatus{
{Name: "dmidecode", OK: true, Path: "/usr/bin/dmidecode"},
{Name: "smartctl", OK: false},
}
},
},
}
toolsResult := a.ToolCheckResult([]string{"dmidecode", "smartctl"})
if toolsResult.Title != "Required tools" {
t.Fatalf("title=%q want %q", toolsResult.Title, "Required tools")
}
if want := "- dmidecode: OK (/usr/bin/dmidecode)"; !contains(toolsResult.Body, want) {
t.Fatalf("body missing %q\nbody=%s", want, toolsResult.Body)
}
if want := "- smartctl: MISSING"; !contains(toolsResult.Body, want) {
t.Fatalf("body missing %q\nbody=%s", want, toolsResult.Body)
}
logResult := a.AuditLogTailResult()
if logResult.Title != "Audit log tail" {
t.Fatalf("title=%q want %q", logResult.Title, "Audit log tail")
}
if want := DefaultAuditLogPath + "\n\n" + DefaultAuditJSONPath; logResult.Body != want {
t.Fatalf("body=%q want %q", logResult.Body, want)
}
}
func TestActionResultsUseFallbackBody(t *testing.T) {
t.Parallel()
a := &App{
network: fakeNetwork{
dhcpOneFn: func(string) (string, error) { return " ", nil },
dhcpAllFn: func() (string, error) { return "", nil },
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
return nil, nil
},
defaultRouteFn: func() string { return "" },
},
services: fakeServices{
serviceStatusFn: func(string) (string, error) { return "", nil },
serviceDoFn: func(string, platform.ServiceAction) (string, error) { return "", nil },
},
tools: fakeTools{
tailFileFn: func(string, int) string { return " " },
checkToolsFn: func([]string) []platform.ToolStatus { return nil },
},
sat: fakeSAT{
runNvidiaFn: func(string) (string, error) { return "", nil },
runMemoryFn: func(string) (string, error) { return "", nil },
runStorageFn: func(string) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) {
return schema.RuntimeHealth{Status: "PARTIAL", ExportDir: "/tmp/export"}, nil
},
},
}
if got, _ := a.DHCPOneResult("eth0"); got.Body != "DHCP completed." {
t.Fatalf("dhcp one body=%q", got.Body)
}
if got, _ := a.DHCPAllResult(); got.Body != "DHCP completed." {
t.Fatalf("dhcp all body=%q", got.Body)
}
if got, _ := a.SetStaticIPv4Result(platform.StaticIPv4Config{Interface: "eth0"}); got.Body != "Static IPv4 updated." {
t.Fatalf("static body=%q", got.Body)
}
if got, _ := a.ServiceStatusResult("bee-audit"); got.Body != "No status output." {
t.Fatalf("status body=%q", got.Body)
}
if got, _ := a.ServiceActionResult("bee-audit", platform.ServiceRestart); got.Body != "Action completed." {
t.Fatalf("action body=%q", got.Body)
}
if got := a.ToolCheckResult(nil); got.Body != "No tools checked." {
t.Fatalf("tool body=%q", got.Body)
}
if got := a.AuditLogTailResult(); got.Body != "No audit logs found." {
t.Fatalf("log body=%q", got.Body)
}
if got, _ := a.RunNvidiaAcceptancePackResult(""); got.Body != "Archive written." {
t.Fatalf("sat body=%q", got.Body)
}
if got, _ := a.RunMemoryAcceptancePackResult(""); got.Body != "No output produced." {
t.Fatalf("memory sat body=%q", got.Body)
}
if got, _ := a.RunStorageAcceptancePackResult(""); got.Body != "No output produced." {
t.Fatalf("storage sat body=%q", got.Body)
}
}
func TestExportSupportBundleResultMentionsUnmountedUSB(t *testing.T) {
t.Parallel()
tmp := t.TempDir()
oldExportDir := DefaultExportDir
DefaultExportDir = tmp
t.Cleanup(func() { DefaultExportDir = oldExportDir })
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.json"), []byte("{}\n"), 0644); err != nil {
t.Fatalf("write bee-audit.json: %v", err)
}
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.log"), []byte("audit ok\n"), 0644); err != nil {
t.Fatalf("write bee-audit.log: %v", err)
}
a := &App{
exports: fakeExports{
exportToTargetFn: func(src string, target platform.RemovableTarget) (string, error) {
if filepath.Base(src) == "" {
t.Fatalf("expected non-empty source path")
}
return "/media/bee/" + filepath.Base(src), nil
},
},
}
result, err := a.ExportSupportBundleResult(platform.RemovableTarget{Device: "/dev/sdb1"})
if err != nil {
t.Fatalf("ExportSupportBundleResult error: %v", err)
}
if result.Title != "Export support bundle" {
t.Fatalf("title=%q want %q", result.Title, "Export support bundle")
}
if want := "USB target unmounted and safe to remove."; !contains(result.Body, want) {
t.Fatalf("body missing %q\nbody=%s", want, result.Body)
}
}
func TestExportSupportBundleResultDoesNotPretendSuccessOnError(t *testing.T) {
t.Parallel()
tmp := t.TempDir()
oldExportDir := DefaultExportDir
DefaultExportDir = tmp
t.Cleanup(func() { DefaultExportDir = oldExportDir })
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.json"), []byte("{}\n"), 0644); err != nil {
t.Fatalf("write bee-audit.json: %v", err)
}
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.log"), []byte("audit ok\n"), 0644); err != nil {
t.Fatalf("write bee-audit.log: %v", err)
}
a := &App{
exports: fakeExports{
exportToTargetFn: func(string, platform.RemovableTarget) (string, error) {
return "", errors.New("mount /dev/sda1: exFAT support is missing in this ISO build")
},
},
}
result, err := a.ExportSupportBundleResult(platform.RemovableTarget{Device: "/dev/sda1", FSType: "exfat"})
if err == nil {
t.Fatal("expected export error")
}
if contains(result.Body, "exported to") {
t.Fatalf("body should not claim success:\n%s", result.Body)
}
if result.Body != "Support bundle export failed." {
t.Fatalf("body=%q want %q", result.Body, "Support bundle export failed.")
}
}
func TestRunNvidiaAcceptancePackResult(t *testing.T) {
t.Parallel()
a := &App{
sat: fakeSAT{
runNvidiaFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/sat" {
t.Fatalf("baseDir=%q want %q", baseDir, "/tmp/sat")
}
return "/tmp/sat/out.tar.gz", nil
},
runMemoryFn: func(string) (string, error) { return "", nil },
runStorageFn: func(string) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.RunNvidiaAcceptancePackResult("/tmp/sat")
if err != nil {
t.Fatalf("RunNvidiaAcceptancePackResult error: %v", err)
}
if result.Title != "NVIDIA SAT" || result.Body != "Archive written to /tmp/sat/out.tar.gz" {
t.Fatalf("unexpected result: %#v", result)
}
}
func TestRunSATDefaultsToExportDir(t *testing.T) {
t.Parallel()
oldSATBaseDir := DefaultSATBaseDir
DefaultSATBaseDir = "/tmp/export/bee-sat"
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
a := &App{
sat: fakeSAT{
runNvidiaFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("nvidia baseDir=%q", baseDir)
}
return "", nil
},
runMemoryFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("memory baseDir=%q", baseDir)
}
return "", nil
},
runStorageFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("storage baseDir=%q", baseDir)
}
return "", nil
},
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
if _, err := a.RunNvidiaAcceptancePack("", nil); err != nil {
t.Fatal(err)
}
if _, err := a.RunMemoryAcceptancePack("", nil); err != nil {
t.Fatal(err)
}
if _, err := a.RunStorageAcceptancePack("", nil); err != nil {
t.Fatal(err)
}
}
func TestFormatSATSummary(t *testing.T) {
t.Parallel()
got := formatSATSummary("Memory SAT", "overall_status=PARTIAL\njob_ok=2\njob_failed=0\njob_unsupported=1\ndevices=3\n")
want := "Memory SAT: PARTIAL ok=2 failed=0 unsupported=1\nDevices: 3"
if got != want {
t.Fatalf("got %q want %q", got, want)
}
}
func TestHealthSummaryResultIncludesCompactSATSummary(t *testing.T) {
tmp := t.TempDir()
oldAuditPath := DefaultAuditJSONPath
oldSATBaseDir := DefaultSATBaseDir
DefaultAuditJSONPath = filepath.Join(tmp, "audit.json")
DefaultSATBaseDir = filepath.Join(tmp, "sat")
t.Cleanup(func() { DefaultAuditJSONPath = oldAuditPath })
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
satDir := filepath.Join(DefaultSATBaseDir, "memory-testcase")
if err := os.MkdirAll(satDir, 0755); err != nil {
t.Fatalf("mkdir sat dir: %v", err)
}
raw := `{"collected_at":"2026-03-15T10:00:00Z","hardware":{"board":{"serial_number":"SRV123"},"storage":[{"serial_number":"DISK1","status":"Warning"}]}}`
if err := os.WriteFile(DefaultAuditJSONPath, []byte(raw), 0644); err != nil {
t.Fatalf("write audit json: %v", err)
}
if err := os.WriteFile(filepath.Join(satDir, "summary.txt"), []byte("overall_status=OK\njob_ok=3\njob_failed=0\njob_unsupported=0\n"), 0644); err != nil {
t.Fatalf("write sat summary: %v", err)
}
result := (&App{}).HealthSummaryResult()
if !contains(result.Body, "Memory SAT: OK ok=3 failed=0") {
t.Fatalf("body missing compact sat summary:\n%s", result.Body)
}
}
func TestBuildSupportBundleIncludesExportDirContents(t *testing.T) {
tmp := t.TempDir()
exportDir := filepath.Join(tmp, "export")
if err := os.MkdirAll(filepath.Join(exportDir, "bee-sat", "memory-run"), 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.json"), []byte(`{"ok":true}`), 0644); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run", "verbose.log"), []byte("sat verbose"), 0644); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run.tar.gz"), []byte("nested sat archive"), 0644); err != nil {
t.Fatal(err)
}
archive, err := BuildSupportBundle(exportDir)
if err != nil {
t.Fatalf("BuildSupportBundle error: %v", err)
}
if _, err := os.Stat(archive); err != nil {
t.Fatalf("archive stat: %v", err)
}
file, err := os.Open(archive)
if err != nil {
t.Fatalf("open archive: %v", err)
}
defer file.Close()
gzr, err := gzip.NewReader(file)
if err != nil {
t.Fatalf("gzip reader: %v", err)
}
defer gzr.Close()
tr := tar.NewReader(gzr)
var names []string
for {
hdr, err := tr.Next()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
t.Fatalf("read tar entry: %v", err)
}
names = append(names, hdr.Name)
}
var foundRaw bool
for _, name := range names {
if contains(name, "/export/bee-sat/memory-run/verbose.log") {
foundRaw = true
}
if contains(name, "/export/bee-sat/memory-run.tar.gz") {
t.Fatalf("support bundle should not contain nested SAT archive: %s", name)
}
}
if !foundRaw {
t.Fatalf("support bundle missing raw SAT log, names=%v", names)
}
}
func TestMainBanner(t *testing.T) {
tmp := t.TempDir()
oldAuditPath := DefaultAuditJSONPath
DefaultAuditJSONPath = filepath.Join(tmp, "audit.json")
t.Cleanup(func() { DefaultAuditJSONPath = oldAuditPath })
trueValue := true
manufacturer := "Dell"
product := "PowerEdge R760"
cpuModel := "Intel Xeon Gold 6430"
memoryType := "DDR5"
gpuClass := "VideoController"
gpuModel := "NVIDIA H100"
payload := schema.HardwareIngestRequest{
Hardware: schema.HardwareSnapshot{
Board: schema.HardwareBoard{
Manufacturer: &manufacturer,
ProductName: &product,
SerialNumber: "SRV123",
},
CPUs: []schema.HardwareCPU{
{Model: &cpuModel},
{Model: &cpuModel},
},
Memory: []schema.HardwareMemory{
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
},
Storage: []schema.HardwareStorage{
{Present: &trueValue, SizeGB: intPtr(3840)},
{Present: &trueValue, SizeGB: intPtr(3840)},
},
PCIeDevices: []schema.HardwarePCIeDevice{
{DeviceClass: &gpuClass, Model: &gpuModel},
{DeviceClass: &gpuClass, Model: &gpuModel},
},
},
}
raw, err := json.Marshal(payload)
if err != nil {
t.Fatalf("marshal: %v", err)
}
if err := os.WriteFile(DefaultAuditJSONPath, raw, 0644); err != nil {
t.Fatalf("write audit json: %v", err)
}
a := &App{
network: fakeNetwork{
listInterfacesFn: func() ([]platform.InterfaceInfo, error) {
return []platform.InterfaceInfo{
{Name: "eth0", IPv4: []string{"10.0.0.10"}},
{Name: "eth1", IPv4: []string{"192.168.1.10"}},
}, nil
},
},
}
got := a.MainBanner()
for _, want := range []string{
"System: Dell PowerEdge R760 | S/N SRV123",
"CPU: 2 x Intel Xeon Gold 6430",
"Memory: 1.0 TB DDR5 (2 DIMMs)",
"Storage: 2 drives / 7.5 TB",
"GPU: 2 x NVIDIA H100",
"IP: 10.0.0.10, 192.168.1.10",
} {
if !contains(got, want) {
t.Fatalf("banner missing %q:\n%s", want, got)
}
}
}
func TestRuntimeHealthResultUsesAMDLabels(t *testing.T) {
tmp := t.TempDir()
oldRuntimePath := DefaultRuntimeJSONPath
DefaultRuntimeJSONPath = filepath.Join(tmp, "runtime-health.json")
t.Cleanup(func() { DefaultRuntimeJSONPath = oldRuntimePath })
raw, err := json.Marshal(schema.RuntimeHealth{
Status: "OK",
ExportDir: "/appdata/bee/export",
DriverReady: true,
CUDAReady: true,
NetworkStatus: "OK",
})
if err != nil {
t.Fatalf("marshal runtime health: %v", err)
}
if err := os.WriteFile(DefaultRuntimeJSONPath, raw, 0644); err != nil {
t.Fatalf("write runtime health: %v", err)
}
a := &App{
sat: fakeSAT{
detectVendorFn: func() string { return "amd" },
},
}
result := a.RuntimeHealthResult()
if !contains(result.Body, "AMDGPU ready: true") {
t.Fatalf("body missing AMD driver label:\n%s", result.Body)
}
if !contains(result.Body, "ROCm SMI ready: true") {
t.Fatalf("body missing ROCm label:\n%s", result.Body)
}
if contains(result.Body, "CUDA ready") {
t.Fatalf("body should not mention CUDA on AMD:\n%s", result.Body)
}
}
func intPtr(v int) *int { return &v }
func contains(haystack, needle string) bool {
return len(needle) == 0 || (len(haystack) >= len(needle) && (haystack == needle || containsAt(haystack, needle)))
}
func containsAt(haystack, needle string) bool {
for i := 0; i+len(needle) <= len(haystack); i++ {
if haystack[i:i+len(needle)] == needle {
return true
}
}
return false
}

View File

@@ -0,0 +1,218 @@
package app
import (
"os"
"path/filepath"
"sort"
"strings"
"bee/audit/internal/schema"
)
func applyLatestSATStatuses(snap *schema.HardwareSnapshot, baseDir string) {
if snap == nil || strings.TrimSpace(baseDir) == "" {
return
}
if summary, ok := loadLatestSATSummary(baseDir, "gpu-amd-"); ok {
applyGPUVendorSAT(snap.PCIeDevices, "amd", summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "gpu-nvidia-"); ok {
applyGPUVendorSAT(snap.PCIeDevices, "nvidia", summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "memory-"); ok {
applyMemorySAT(snap.Memory, summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "cpu-"); ok {
applyCPUSAT(snap.CPUs, summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "storage-"); ok {
applyStorageSAT(snap.Storage, summary)
}
}
type satSummary struct {
runAtUTC string
overall string
kv map[string]string
}
func loadLatestSATSummary(baseDir, prefix string) (satSummary, bool) {
matches, err := filepath.Glob(filepath.Join(baseDir, prefix+"*/summary.txt"))
if err != nil || len(matches) == 0 {
return satSummary{}, false
}
sort.Strings(matches)
raw, err := os.ReadFile(matches[len(matches)-1])
if err != nil {
return satSummary{}, false
}
kv := parseKeyValueSummary(string(raw))
return satSummary{
runAtUTC: strings.TrimSpace(kv["run_at_utc"]),
overall: strings.ToUpper(strings.TrimSpace(kv["overall_status"])),
kv: kv,
}, true
}
func applyGPUVendorSAT(devs []schema.HardwarePCIeDevice, vendor string, summary satSummary) {
status, description, ok := satSummaryStatus(summary, vendor+" GPU SAT")
if !ok {
return
}
for i := range devs {
if !matchesGPUVendor(devs[i], vendor) {
continue
}
mergeComponentStatus(&devs[i].HardwareComponentStatus, summary.runAtUTC, status, description)
}
}
func applyMemorySAT(dimms []schema.HardwareMemory, summary satSummary) {
status, description, ok := satSummaryStatus(summary, "memory SAT")
if !ok {
return
}
for i := range dimms {
mergeComponentStatus(&dimms[i].HardwareComponentStatus, summary.runAtUTC, status, description)
}
}
func applyCPUSAT(cpus []schema.HardwareCPU, summary satSummary) {
status, description, ok := satSummaryStatus(summary, "CPU SAT")
if !ok {
return
}
for i := range cpus {
mergeComponentStatus(&cpus[i].HardwareComponentStatus, summary.runAtUTC, status, description)
}
}
func applyStorageSAT(disks []schema.HardwareStorage, summary satSummary) {
byDevice := parseStorageSATStatus(summary)
for i := range disks {
devPath, _ := disks[i].Telemetry["linux_device"].(string)
devName := filepath.Base(strings.TrimSpace(devPath))
if devName == "" {
continue
}
result, ok := byDevice[devName]
if !ok {
continue
}
mergeComponentStatus(&disks[i].HardwareComponentStatus, summary.runAtUTC, result.status, result.description)
}
}
type satStatusResult struct {
status string
description string
ok bool
}
func parseStorageSATStatus(summary satSummary) map[string]satStatusResult {
result := map[string]satStatusResult{}
for key, value := range summary.kv {
if !strings.HasSuffix(key, "_status") || key == "overall_status" {
continue
}
base := strings.TrimSuffix(key, "_status")
idx := strings.Index(base, "_")
if idx <= 0 {
continue
}
devName := base[:idx]
step := strings.ReplaceAll(base[idx+1:], "_", "-")
stepStatus, desc, ok := satKeyStatus(strings.ToUpper(strings.TrimSpace(value)), "storage "+step)
if !ok {
continue
}
current := result[devName]
if !current.ok || statusSeverity(stepStatus) > statusSeverity(current.status) {
result[devName] = satStatusResult{status: stepStatus, description: desc, ok: true}
}
}
return result
}
func satSummaryStatus(summary satSummary, label string) (string, string, bool) {
return satKeyStatus(summary.overall, label)
}
func satKeyStatus(rawStatus, label string) (string, string, bool) {
switch strings.ToUpper(strings.TrimSpace(rawStatus)) {
case "OK":
// No error description on success — error_description is for problems only.
return "OK", "", true
case "PARTIAL", "UNSUPPORTED", "CANCELED", "CANCELLED":
// Tool couldn't run or test was incomplete — we can't assert hardware health.
return "Unknown", "", true
case "FAILED":
return "Critical", label + " failed", true
default:
return "", "", false
}
}
func mergeComponentStatus(component *schema.HardwareComponentStatus, changedAt, satStatus, description string) {
if component == nil || satStatus == "" {
return
}
current := strings.TrimSpace(ptrString(component.Status))
if current == "" || current == "Unknown" || statusSeverity(satStatus) > statusSeverity(current) {
component.Status = appStringPtr(satStatus)
if strings.TrimSpace(description) != "" {
component.ErrorDescription = appStringPtr(description)
}
if strings.TrimSpace(changedAt) != "" {
component.StatusChangedAt = appStringPtr(changedAt)
component.StatusHistory = append(component.StatusHistory, schema.HardwareStatusHistory{
Status: satStatus,
ChangedAt: changedAt,
Details: appStringPtr(description),
})
}
}
}
func statusSeverity(status string) int {
switch strings.TrimSpace(status) {
case "Critical":
return 3
case "Warning":
return 2
case "OK":
return 1
case "Unknown":
return 1 // same as OK — does not override OK from another source
default:
return 0
}
}
func matchesGPUVendor(dev schema.HardwarePCIeDevice, vendor string) bool {
if dev.DeviceClass == nil || !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Controller") && !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Accelerator") {
if dev.DeviceClass == nil || !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Display") && !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Video") {
return false
}
}
manufacturer := strings.ToLower(strings.TrimSpace(ptrString(dev.Manufacturer)))
switch vendor {
case "amd":
return strings.Contains(manufacturer, "advanced micro devices") || strings.Contains(manufacturer, "amd/ati")
case "nvidia":
return strings.Contains(manufacturer, "nvidia")
default:
return false
}
}
func ptrString(v *string) string {
if v == nil {
return ""
}
return *v
}
func appStringPtr(value string) *string {
return &value
}

View File

@@ -0,0 +1,61 @@
package app
import (
"os"
"path/filepath"
"testing"
"bee/audit/internal/schema"
)
func TestApplyLatestSATStatusesMarksStorageByDevice(t *testing.T) {
baseDir := t.TempDir()
runDir := filepath.Join(baseDir, "storage-20260325-161151")
if err := os.MkdirAll(runDir, 0755); err != nil {
t.Fatal(err)
}
raw := "run_at_utc=2026-03-25T16:11:51Z\nnvme0n1_nvme_smart_log_status=OK\nsda_smartctl_health_status=FAILED\noverall_status=FAILED\n"
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(raw), 0644); err != nil {
t.Fatal(err)
}
nvme := schema.HardwareStorage{Telemetry: map[string]any{"linux_device": "/dev/nvme0n1"}}
usb := schema.HardwareStorage{Telemetry: map[string]any{"linux_device": "/dev/sda"}}
snap := schema.HardwareSnapshot{Storage: []schema.HardwareStorage{nvme, usb}}
applyLatestSATStatuses(&snap, baseDir)
if snap.Storage[0].Status == nil || *snap.Storage[0].Status != "OK" {
t.Fatalf("nvme status=%v want OK", snap.Storage[0].Status)
}
if snap.Storage[1].Status == nil || *snap.Storage[1].Status != "Critical" {
t.Fatalf("sda status=%v want Critical", snap.Storage[1].Status)
}
}
func TestApplyLatestSATStatusesMarksAMDGPUs(t *testing.T) {
baseDir := t.TempDir()
runDir := filepath.Join(baseDir, "gpu-amd-20260325-161436")
if err := os.MkdirAll(runDir, 0755); err != nil {
t.Fatal(err)
}
raw := "run_at_utc=2026-03-25T16:14:36Z\noverall_status=FAILED\n"
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(raw), 0644); err != nil {
t.Fatal(err)
}
class := "DisplayController"
manufacturer := "Advanced Micro Devices, Inc. [AMD/ATI]"
snap := schema.HardwareSnapshot{
PCIeDevices: []schema.HardwarePCIeDevice{{
DeviceClass: &class,
Manufacturer: &manufacturer,
}},
}
applyLatestSATStatuses(&snap, baseDir)
if snap.PCIeDevices[0].Status == nil || *snap.PCIeDevices[0].Status != "Critical" {
t.Fatalf("gpu status=%v want Critical", snap.PCIeDevices[0].Status)
}
}

View File

@@ -0,0 +1,396 @@
package app
import (
"archive/tar"
"compress/gzip"
"fmt"
"io"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
"time"
)
var supportBundleServices = []string{
"bee-audit.service",
"bee-web.service",
"bee-network.service",
"bee-nvidia.service",
"bee-preflight.service",
"bee-sshsetup.service",
}
var supportBundleCommands = []struct {
name string
cmd []string
}{
{name: "system/uname.txt", cmd: []string{"uname", "-a"}},
{name: "system/lsmod.txt", cmd: []string{"lsmod"}},
{name: "system/lspci-nn.txt", cmd: []string{"lspci", "-nn"}},
{name: "system/ip-addr.txt", cmd: []string{"ip", "addr"}},
{name: "system/ip-route.txt", cmd: []string{"ip", "route"}},
{name: "system/mount.txt", cmd: []string{"mount"}},
{name: "system/df-h.txt", cmd: []string{"df", "-h"}},
{name: "system/dmesg-tail.txt", cmd: []string{"sh", "-c", "dmesg | tail -n 200"}},
}
const supportBundleGlob = "bee-support-*.tar.gz"
func BuildSupportBundle(exportDir string) (string, error) {
exportDir = strings.TrimSpace(exportDir)
if exportDir == "" {
exportDir = DefaultExportDir
}
if err := os.MkdirAll(exportDir, 0755); err != nil {
return "", err
}
if err := cleanupOldSupportBundles(os.TempDir()); err != nil {
return "", err
}
host := sanitizeFilename(hostnameOr("unknown"))
ts := time.Now().UTC().Format("20060102-150405")
stageRoot := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s", host, ts))
if err := os.MkdirAll(stageRoot, 0755); err != nil {
return "", err
}
defer os.RemoveAll(stageRoot)
if err := copyExportDirForSupportBundle(exportDir, filepath.Join(stageRoot, "export")); err != nil {
return "", err
}
if err := writeJournalDump(filepath.Join(stageRoot, "systemd", "combined.journal.log")); err != nil {
return "", err
}
for _, svc := range supportBundleServices {
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".status.txt"), []string{"systemctl", "status", svc, "--no-pager"}); err != nil {
return "", err
}
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".journal.log"), []string{"journalctl", "--no-pager", "-u", svc}); err != nil {
return "", err
}
}
for _, item := range supportBundleCommands {
if err := writeCommandOutput(filepath.Join(stageRoot, item.name), item.cmd); err != nil {
return "", err
}
}
if err := writeManifest(filepath.Join(stageRoot, "manifest.txt"), exportDir, stageRoot); err != nil {
return "", err
}
archivePath := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s.tar.gz", host, ts))
if err := createSupportTarGz(archivePath, stageRoot); err != nil {
return "", err
}
return archivePath, nil
}
func LatestSupportBundlePath() (string, error) {
return latestSupportBundlePath(os.TempDir())
}
func cleanupOldSupportBundles(dir string) error {
matches, err := filepath.Glob(filepath.Join(dir, supportBundleGlob))
if err != nil {
return err
}
entries := supportBundleEntries(matches)
for path, mod := range entries {
if time.Since(mod) > 24*time.Hour {
_ = os.Remove(path)
delete(entries, path)
}
}
ordered := orderSupportBundles(entries)
if len(ordered) > 3 {
for _, old := range ordered[3:] {
_ = os.Remove(old)
}
}
return nil
}
func latestSupportBundlePath(dir string) (string, error) {
matches, err := filepath.Glob(filepath.Join(dir, supportBundleGlob))
if err != nil {
return "", err
}
ordered := orderSupportBundles(supportBundleEntries(matches))
if len(ordered) == 0 {
return "", os.ErrNotExist
}
return ordered[0], nil
}
func supportBundleEntries(matches []string) map[string]time.Time {
entries := make(map[string]time.Time, len(matches))
for _, match := range matches {
info, err := os.Stat(match)
if err != nil {
continue
}
entries[match] = info.ModTime()
}
return entries
}
func orderSupportBundles(entries map[string]time.Time) []string {
ordered := make([]string, 0, len(entries))
for path := range entries {
ordered = append(ordered, path)
}
sort.Slice(ordered, func(i, j int) bool {
return entries[ordered[i]].After(entries[ordered[j]])
})
return ordered
}
func writeJournalDump(dst string) error {
args := []string{"--no-pager"}
for _, svc := range supportBundleServices {
args = append(args, "-u", svc)
}
raw, err := exec.Command("journalctl", args...).CombinedOutput()
if len(raw) == 0 && err != nil {
raw = []byte(err.Error() + "\n")
}
if len(raw) == 0 {
raw = []byte("no journal output\n")
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
return os.WriteFile(dst, raw, 0644)
}
func writeCommandOutput(dst string, cmd []string) error {
if len(cmd) == 0 {
return nil
}
raw, err := exec.Command(cmd[0], cmd[1:]...).CombinedOutput()
if len(raw) == 0 {
if err != nil {
raw = []byte(err.Error() + "\n")
} else {
raw = []byte("no output\n")
}
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
return os.WriteFile(dst, raw, 0644)
}
func writeManifest(dst, exportDir, stageRoot string) error {
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
var body strings.Builder
fmt.Fprintf(&body, "bee_version=%s\n", buildVersion())
fmt.Fprintf(&body, "host=%s\n", hostnameOr("unknown"))
fmt.Fprintf(&body, "generated_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
fmt.Fprintf(&body, "export_dir=%s\n", exportDir)
fmt.Fprintf(&body, "\nfiles:\n")
var files []string
if err := filepath.Walk(stageRoot, func(path string, info os.FileInfo, err error) error {
if err != nil || info.IsDir() {
return err
}
if filepath.Clean(path) == filepath.Clean(dst) {
return nil
}
rel, err := filepath.Rel(stageRoot, path)
if err != nil {
return err
}
files = append(files, fmt.Sprintf("%s\t%d", rel, info.Size()))
return nil
}); err != nil {
return err
}
sort.Strings(files)
for _, line := range files {
body.WriteString(line)
body.WriteByte('\n')
}
return os.WriteFile(dst, []byte(body.String()), 0644)
}
func buildVersion() string {
raw, err := exec.Command("bee", "version").CombinedOutput()
if err != nil {
return "unknown"
}
return strings.TrimSpace(string(raw))
}
func copyDirContents(srcDir, dstDir string) error {
entries, err := os.ReadDir(srcDir)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return err
}
for _, entry := range entries {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(dstDir, entry.Name())
if err := copyPath(src, dst); err != nil {
return err
}
}
return nil
}
func copyExportDirForSupportBundle(srcDir, dstDir string) error {
return copyDirContentsFiltered(srcDir, dstDir, func(rel string, info os.FileInfo) bool {
cleanRel := filepath.ToSlash(strings.TrimPrefix(filepath.Clean(rel), "./"))
if cleanRel == "" {
return true
}
if strings.HasPrefix(cleanRel, "bee-sat/") && strings.HasSuffix(cleanRel, ".tar.gz") {
return false
}
if strings.HasPrefix(filepath.Base(cleanRel), "bee-support-") && strings.HasSuffix(cleanRel, ".tar.gz") {
return false
}
return true
})
}
func copyDirContentsFiltered(srcDir, dstDir string, keep func(rel string, info os.FileInfo) bool) error {
entries, err := os.ReadDir(srcDir)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return err
}
for _, entry := range entries {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(dstDir, entry.Name())
if err := copyPathFiltered(srcDir, src, dst, keep); err != nil {
return err
}
}
return nil
}
func copyPath(src, dst string) error {
info, err := os.Stat(src)
if err != nil {
return err
}
if info.IsDir() {
if err := os.MkdirAll(dst, info.Mode().Perm()); err != nil {
return err
}
entries, err := os.ReadDir(src)
if err != nil {
return err
}
for _, entry := range entries {
if err := copyPath(filepath.Join(src, entry.Name()), filepath.Join(dst, entry.Name())); err != nil {
return err
}
}
return nil
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
in, err := os.Open(src)
if err != nil {
return err
}
defer in.Close()
out, err := os.OpenFile(dst, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, info.Mode().Perm())
if err != nil {
return err
}
defer out.Close()
_, err = io.Copy(out, in)
return err
}
func copyPathFiltered(rootSrc, src, dst string, keep func(rel string, info os.FileInfo) bool) error {
info, err := os.Stat(src)
if err != nil {
return err
}
rel, err := filepath.Rel(rootSrc, src)
if err != nil {
return err
}
if keep != nil && !keep(rel, info) {
return nil
}
if info.IsDir() {
if err := os.MkdirAll(dst, info.Mode().Perm()); err != nil {
return err
}
entries, err := os.ReadDir(src)
if err != nil {
return err
}
for _, entry := range entries {
if err := copyPathFiltered(rootSrc, filepath.Join(src, entry.Name()), filepath.Join(dst, entry.Name()), keep); err != nil {
return err
}
}
return nil
}
return copyPath(src, dst)
}
func createSupportTarGz(dst, srcDir string) error {
file, err := os.Create(dst)
if err != nil {
return err
}
defer file.Close()
gz := gzip.NewWriter(file)
defer gz.Close()
tw := tar.NewWriter(gz)
defer tw.Close()
base := filepath.Dir(srcDir)
return filepath.Walk(srcDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if info.IsDir() {
return nil
}
header, err := tar.FileInfoHeader(info, "")
if err != nil {
return err
}
header.Name, err = filepath.Rel(base, path)
if err != nil {
return err
}
if err := tw.WriteHeader(header); err != nil {
return err
}
f, err := os.Open(path)
if err != nil {
return err
}
defer f.Close()
_, err = io.Copy(tw, f)
return err
})
}

View File

@@ -0,0 +1,252 @@
package collector
import (
"encoding/csv"
"log/slog"
"os/exec"
"path/filepath"
"sort"
"strconv"
"strings"
"bee/audit/internal/schema"
)
var (
amdSMIExecCommand = exec.Command
amdSMILookPath = exec.LookPath
amdSMIGlob = filepath.Glob
)
var amdSMIExecutableGlobs = []string{
"/opt/rocm/bin/rocm-smi",
"/opt/rocm-*/bin/rocm-smi",
"/usr/local/bin/rocm-smi",
}
type amdGPUInfo struct {
BDF string
Serial string
Product string
Firmware string
PowerW *float64
TempC *float64
}
func enrichPCIeWithAMD(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
if !hasAMDGPUDevices(devs) {
return devs
}
infoByBDF, err := queryAMDGPUs()
if err != nil {
slog.Info("amdgpu: enrichment skipped", "err", err)
return devs
}
enriched := 0
for i := range devs {
if !isAMDGPUDevice(devs[i]) || devs[i].BDF == nil {
continue
}
info, ok := infoByBDF[normalizePCIeBDF(*devs[i].BDF)]
if !ok {
continue
}
if strings.TrimSpace(info.Serial) != "" {
devs[i].SerialNumber = &info.Serial
}
if strings.TrimSpace(info.Firmware) != "" {
devs[i].Firmware = &info.Firmware
}
if strings.TrimSpace(info.Product) != "" && devs[i].Model == nil {
devs[i].Model = &info.Product
}
if info.PowerW != nil {
devs[i].PowerW = info.PowerW
}
if info.TempC != nil {
devs[i].TemperatureC = info.TempC
}
enriched++
}
if enriched > 0 {
slog.Info("amdgpu: enriched", "count", enriched)
}
return devs
}
func hasAMDGPUDevices(devs []schema.HardwarePCIeDevice) bool {
for _, dev := range devs {
if isAMDGPUDevice(dev) {
return true
}
}
return false
}
func isAMDGPUDevice(dev schema.HardwarePCIeDevice) bool {
if dev.Manufacturer == nil || dev.DeviceClass == nil {
return false
}
manufacturer := strings.ToLower(strings.TrimSpace(*dev.Manufacturer))
return strings.Contains(manufacturer, "advanced micro devices") && isGPUClass(strings.TrimSpace(*dev.DeviceClass))
}
func queryAMDGPUs() (map[string]amdGPUInfo, error) {
busByCard, err := queryAMDField("--showbus")
if err != nil {
return nil, err
}
infoByCard := map[string]amdGPUInfo{}
for card, bus := range busByCard {
bdf := normalizePCIeBDF(bus)
if bdf == "" {
continue
}
infoByCard[card] = amdGPUInfo{BDF: bdf}
}
if len(infoByCard) == 0 {
return map[string]amdGPUInfo{}, nil
}
mergeAMDField(infoByCard, "--showserial", func(info *amdGPUInfo, value string) { info.Serial = value })
mergeAMDField(infoByCard, "--showproductname", func(info *amdGPUInfo, value string) { info.Product = value })
mergeAMDField(infoByCard, "--showvbios", func(info *amdGPUInfo, value string) { info.Firmware = value })
mergeAMDNumericField(infoByCard, "--showpower", func(info *amdGPUInfo, value float64) { info.PowerW = &value })
mergeAMDNumericField(infoByCard, "--showtemp", func(info *amdGPUInfo, value float64) { info.TempC = &value })
result := make(map[string]amdGPUInfo, len(infoByCard))
for _, info := range infoByCard {
if info.BDF == "" {
continue
}
result[info.BDF] = info
}
return result, nil
}
func mergeAMDField(infoByCard map[string]amdGPUInfo, flag string, apply func(*amdGPUInfo, string)) {
values, err := queryAMDField(flag)
if err != nil {
return
}
for card, value := range values {
info, ok := infoByCard[card]
if !ok {
continue
}
value = strings.TrimSpace(value)
if value == "" {
continue
}
apply(&info, value)
infoByCard[card] = info
}
}
func mergeAMDNumericField(infoByCard map[string]amdGPUInfo, flag string, apply func(*amdGPUInfo, float64)) {
values, err := queryAMDNumericField(flag)
if err != nil {
return
}
for card, value := range values {
info, ok := infoByCard[card]
if !ok {
continue
}
apply(&info, value)
infoByCard[card] = info
}
}
func queryAMDField(flag string) (map[string]string, error) {
cmd, err := resolveAMDSMICmd(flag, "--csv")
if err != nil {
return nil, err
}
out, err := amdSMIExecCommand(cmd[0], cmd[1:]...).CombinedOutput()
if err != nil {
return nil, err
}
return parseROCmSingleValueCSV(string(out)), nil
}
func queryAMDNumericField(flag string) (map[string]float64, error) {
values, err := queryAMDField(flag)
if err != nil {
return nil, err
}
out := map[string]float64{}
for card, raw := range values {
if value, ok := firstFloat(raw); ok {
out[card] = value
}
}
return out, nil
}
func resolveAMDSMICmd(args ...string) ([]string, error) {
if path, err := amdSMILookPath("rocm-smi"); err == nil {
return append([]string{path}, args...), nil
}
for _, pattern := range amdSMIExecutableGlobs {
matches, err := amdSMIGlob(pattern)
if err != nil {
continue
}
sort.Strings(matches)
for _, match := range matches {
return append([]string{match}, args...), nil
}
}
return nil, exec.ErrNotFound
}
func parseROCmSingleValueCSV(raw string) map[string]string {
rows := map[string]string{}
reader := csv.NewReader(strings.NewReader(raw))
reader.FieldsPerRecord = -1
records, err := reader.ReadAll()
if err != nil {
return rows
}
for _, rec := range records {
if len(rec) < 2 {
continue
}
card := normalizeROCmCardKey(rec[0])
if card == "" {
continue
}
value := strings.TrimSpace(strings.Join(rec[1:], ","))
if value == "" || looksLikeCSVHeaderValue(value) {
continue
}
rows[card] = value
}
return rows
}
func normalizeROCmCardKey(raw string) string {
raw = strings.ToLower(strings.TrimSpace(raw))
raw = strings.Trim(raw, "\"")
if raw == "" {
return ""
}
if raw == "device" || raw == "gpu" || raw == "card" {
return ""
}
if strings.HasPrefix(raw, "card") {
return raw
}
if _, err := strconv.Atoi(raw); err == nil {
return "card" + raw
}
return ""
}
func looksLikeCSVHeaderValue(value string) bool {
value = strings.ToLower(strings.TrimSpace(value))
return strings.Contains(value, "product") ||
strings.Contains(value, "serial") ||
strings.Contains(value, "vbios") ||
strings.Contains(value, "bus")
}

View File

@@ -0,0 +1,56 @@
package collector
import (
"os/exec"
"testing"
)
func TestParseROCmSingleValueCSV(t *testing.T) {
raw := "device,Serial Number\ncard0,ABC123\ncard1,XYZ789\n"
got := parseROCmSingleValueCSV(raw)
if got["card0"] != "ABC123" {
t.Fatalf("card0=%q want ABC123", got["card0"])
}
if got["card1"] != "XYZ789" {
t.Fatalf("card1=%q want XYZ789", got["card1"])
}
}
func TestQueryAMDNumericFieldParsesUnits(t *testing.T) {
origExec := amdSMIExecCommand
origLookPath := amdSMILookPath
t.Cleanup(func() {
amdSMIExecCommand = origExec
amdSMILookPath = origLookPath
})
amdSMILookPath = func(string) (string, error) { return "/usr/bin/rocm-smi", nil }
amdSMIExecCommand = func(name string, args ...string) *exec.Cmd {
return exec.Command("sh", "-c", "printf 'device,Temperature\\ncard0,45.5c\\ncard1,67.0c\\n'")
}
got, err := queryAMDNumericField("--showtemp")
if err != nil {
t.Fatalf("queryAMDNumericField: %v", err)
}
if got["card0"] != 45.5 {
t.Fatalf("card0=%v want 45.5", got["card0"])
}
if got["card1"] != 67.0 {
t.Fatalf("card1=%v want 67.0", got["card1"])
}
}
func TestNormalizeROCmCardKey(t *testing.T) {
tests := map[string]string{
"0": "card0",
"card1": "card1",
"Device": "",
"": "",
}
for input, want := range tests {
if got := normalizeROCmCardKey(input); got != want {
t.Fatalf("normalizeROCmCardKey(%q)=%q want %q", input, got, want)
}
}
}

View File

@@ -4,10 +4,27 @@ import (
"bee/audit/internal/schema"
"bufio"
"log/slog"
"os"
"os/exec"
"strings"
)
var execDmidecode = func(typeNum string) (string, error) {
out, err := exec.Command("dmidecode", "-t", typeNum).Output()
if err != nil {
return "", err
}
return string(out), nil
}
var execIpmitool = func(args ...string) (string, error) {
out, err := exec.Command("ipmitool", args...).Output()
if err != nil {
return "", err
}
return string(out), nil
}
// collectBoard runs dmidecode for types 0, 1, 2 and returns the board record
// plus the BIOS firmware entry. Any failure is logged and returns zero values.
func collectBoard() (schema.HardwareBoard, []schema.HardwareFirmwareRecord) {
@@ -61,6 +78,45 @@ func parseBoard(type1, type2 string) schema.HardwareBoard {
return board
}
// collectBMCFirmware collects BMC firmware version via ipmitool mc info.
// Returns nil if ipmitool is missing, /dev/ipmi0 is absent, or any error occurs.
func collectBMCFirmware() []schema.HardwareFirmwareRecord {
if _, err := exec.LookPath("ipmitool"); err != nil {
return nil
}
if _, err := os.Stat("/dev/ipmi0"); err != nil {
return nil
}
out, err := execIpmitool("mc", "info")
if err != nil {
slog.Info("bmc: ipmitool mc info unavailable", "err", err)
return nil
}
version := parseBMCFirmwareRevision(out)
if version == "" {
return nil
}
slog.Info("bmc: collected", "version", version)
return []schema.HardwareFirmwareRecord{
{DeviceName: "BMC", Version: version},
}
}
// parseBMCFirmwareRevision extracts the "Firmware Revision" field from ipmitool mc info output.
func parseBMCFirmwareRevision(out string) string {
for _, line := range strings.Split(out, "\n") {
line = strings.TrimSpace(line)
key, val, ok := strings.Cut(line, ":")
if !ok {
continue
}
if strings.TrimSpace(key) == "Firmware Revision" {
return strings.TrimSpace(val)
}
}
return ""
}
// parseBIOSFirmware extracts BIOS version from dmidecode type 0 output.
func parseBIOSFirmware(type0 string) []schema.HardwareFirmwareRecord {
fields := parseDMIFields(type0, "BIOS Information")
@@ -141,9 +197,5 @@ func cleanDMIValue(v string) string {
// runDmidecode executes dmidecode -t <typeNum> and returns its stdout.
func runDmidecode(typeNum string) (string, error) {
out, err := exec.Command("dmidecode", "-t", typeNum).Output()
if err != nil {
return "", err
}
return string(out), nil
return execDmidecode(typeNum)
}

View File

@@ -4,15 +4,18 @@
package collector
import (
"bee/audit/internal/runtimeenv"
"bee/audit/internal/schema"
"log/slog"
"os"
"time"
)
// Run executes all collectors and returns the combined snapshot.
// Partial failures are logged as warnings; collection always completes.
func Run() schema.HardwareIngestRequest {
func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
start := time.Now()
collectedAt := time.Now().UTC().Format(time.RFC3339)
slog.Info("audit started")
snap := schema.HardwareSnapshot{}
@@ -20,32 +23,45 @@ func Run() schema.HardwareIngestRequest {
board, biosFW := collectBoard()
snap.Board = board
snap.Firmware = append(snap.Firmware, biosFW...)
snap.Firmware = append(snap.Firmware, collectBMCFirmware()...)
cpus, cpuFW := collectCPUs(snap.Board.SerialNumber)
snap.CPUs = cpus
snap.Firmware = append(snap.Firmware, cpuFW...)
snap.CPUs = collectCPUs()
snap.Memory = collectMemory()
sensorDoc, err := readSensorsJSONDoc()
if err != nil {
slog.Info("sensors: unavailable for enrichment", "err", err)
}
snap.CPUs = enrichCPUsWithTelemetry(snap.CPUs, sensorDoc)
snap.Memory = enrichMemoryWithTelemetry(snap.Memory, sensorDoc)
snap.Storage = collectStorage()
snap.PCIeDevices = collectPCIe()
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices, snap.Board.SerialNumber)
snap.PCIeDevices = enrichPCIeWithAMD(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithPCISerials(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithMellanox(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithNICTelemetry(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithRAIDTelemetry(snap.PCIeDevices)
snap.Storage = enrichStorageWithVROC(snap.Storage, snap.PCIeDevices)
snap.Storage = appendUniqueStorage(snap.Storage, collectRAIDStorage(snap.PCIeDevices))
snap.PowerSupplies = collectPSUs()
snap.PowerSupplies = enrichPSUsWithTelemetry(snap.PowerSupplies, sensorDoc)
snap.Sensors = buildSensorsFromDoc(sensorDoc)
finalizeSnapshot(&snap, collectedAt)
// remaining collectors added in steps 1.8 1.10
slog.Info("audit completed", "duration", time.Since(start).Round(time.Millisecond))
sourceType := "livcd"
protocol := "os-direct"
sourceType := "manual"
var targetHost *string
if hostname, err := os.Hostname(); err == nil && hostname != "" {
targetHost = &hostname
}
return schema.HardwareIngestRequest{
SourceType: &sourceType,
Protocol: &protocol,
CollectedAt: time.Now().UTC().Format(time.RFC3339),
TargetHost: targetHost,
CollectedAt: collectedAt,
Hardware: snap,
}
}

View File

@@ -0,0 +1,64 @@
package collector
import "strings"
const (
statusOK = "OK"
statusWarning = "Warning"
statusCritical = "Critical"
statusUnknown = "Unknown"
statusEmpty = "Empty"
)
func mapPCIeDeviceClass(raw string) string {
normalized := strings.ToLower(strings.TrimSpace(raw))
switch {
case normalized == "":
return ""
case strings.Contains(normalized, "ethernet controller"):
return "EthernetController"
case strings.Contains(normalized, "fibre channel"):
return "FibreChannelController"
case strings.Contains(normalized, "network controller"), strings.Contains(normalized, "infiniband controller"):
return "NetworkController"
case strings.Contains(normalized, "serial attached scsi"), strings.Contains(normalized, "storage controller"):
return "StorageController"
case strings.Contains(normalized, "raid"), strings.Contains(normalized, "mass storage"):
return "MassStorageController"
case strings.Contains(normalized, "display controller"):
return "DisplayController"
case strings.Contains(normalized, "vga"), strings.Contains(normalized, "3d controller"), strings.Contains(normalized, "video controller"):
return "VideoController"
case strings.Contains(normalized, "processing accelerators"), strings.Contains(normalized, "processing accelerator"):
return "ProcessingAccelerator"
default:
return raw
}
}
func isNICClass(class string) bool {
switch strings.TrimSpace(class) {
case "EthernetController", "NetworkController":
return true
default:
return false
}
}
func isGPUClass(class string) bool {
switch strings.TrimSpace(class) {
case "VideoController", "DisplayController", "ProcessingAccelerator":
return true
default:
return false
}
}
func isRAIDClass(class string) bool {
switch strings.TrimSpace(class) {
case "MassStorageController", "StorageController":
return true
default:
return false
}
}

View File

@@ -3,42 +3,39 @@ package collector
import (
"bee/audit/internal/schema"
"bufio"
"fmt"
"log/slog"
"os"
"path/filepath"
"strconv"
"strings"
)
// collectCPUs runs dmidecode -t 4 and reads microcode version from sysfs.
func collectCPUs(boardSerial string) ([]schema.HardwareCPU, []schema.HardwareFirmwareRecord) {
// collectCPUs runs dmidecode -t 4 and enriches CPUs with microcode from sysfs.
func collectCPUs() []schema.HardwareCPU {
out, err := runDmidecode("4")
if err != nil {
slog.Warn("cpu: dmidecode type 4 failed", "err", err)
return nil, nil
return nil
}
cpus := parseCPUs(out, boardSerial)
var firmware []schema.HardwareFirmwareRecord
cpus := parseCPUs(out)
if mc := readMicrocode(); mc != "" {
firmware = append(firmware, schema.HardwareFirmwareRecord{
DeviceName: "CPU Microcode",
Version: mc,
})
for i := range cpus {
cpus[i].Firmware = &mc
}
}
slog.Info("cpu: collected", "count", len(cpus))
return cpus, firmware
return cpus
}
// parseCPUs splits dmidecode output into per-processor sections and parses each.
func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
func parseCPUs(output string) []schema.HardwareCPU {
sections := splitDMISections(output, "Processor Information")
cpus := make([]schema.HardwareCPU, 0, len(sections))
for _, section := range sections {
cpu, ok := parseCPUSection(section, boardSerial)
cpu, ok := parseCPUSection(section)
if !ok {
continue
}
@@ -49,14 +46,16 @@ func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
// parseCPUSection parses one "Processor Information" block into a HardwareCPU.
// Returns false if the socket is unpopulated.
func parseCPUSection(fields map[string]string, boardSerial string) (schema.HardwareCPU, bool) {
func parseCPUSection(fields map[string]string) (schema.HardwareCPU, bool) {
status := parseCPUStatus(fields["Status"])
if status == "EMPTY" {
if status == statusEmpty {
return schema.HardwareCPU{}, false
}
cpu := schema.HardwareCPU{}
cpu.Status = &status
present := true
cpu.Present = &present
if socket, ok := parseSocketIndex(fields["Socket Designation"]); ok {
cpu.Socket = &socket
@@ -70,11 +69,6 @@ func parseCPUSection(fields map[string]string, boardSerial string) (schema.Hardw
}
if v := cleanDMIValue(fields["Serial Number"]); v != "" {
cpu.SerialNumber = &v
} else if boardSerial != "" && cpu.Socket != nil {
// Intel Xeon never exposes serial via DMI — generate stable fallback
// matching core's generateCPUVendorSerial() logic
fb := fmt.Sprintf("%s-CPU-%d", boardSerial, *cpu.Socket)
cpu.SerialNumber = &fb
}
if v := parseMHz(fields["Max Speed"]); v > 0 {
@@ -99,15 +93,15 @@ func parseCPUStatus(raw string) string {
upper := strings.ToUpper(raw)
switch {
case upper == "" || upper == "UNKNOWN":
return "UNKNOWN"
return statusUnknown
case strings.Contains(upper, "UNPOPULATED") || strings.Contains(upper, "NOT POPULATED"):
return "EMPTY"
return statusEmpty
case strings.Contains(upper, "ENABLED"):
return "OK"
return statusOK
case strings.Contains(upper, "DISABLED"):
return "WARNING"
return statusWarning
default:
return "UNKNOWN"
return statusUnknown
}
}
@@ -178,7 +172,7 @@ func parseInt(v string) int {
// readMicrocode reads the CPU microcode revision from sysfs.
// Returns empty string if unavailable.
func readMicrocode() string {
data, err := os.ReadFile("/sys/devices/system/cpu/cpu0/microcode/version")
data, err := os.ReadFile(filepath.Join(cpuSysBaseDir, "cpu0", "microcode", "version"))
if err != nil {
return ""
}

View File

@@ -0,0 +1,196 @@
package collector
import (
"bee/audit/internal/schema"
"os"
"path/filepath"
"regexp"
"sort"
"strconv"
"strings"
)
var (
cpuSysBaseDir = "/sys/devices/system/cpu"
socketIndexRe = regexp.MustCompile(`(?i)(?:package id|socket|cpu)\s*([0-9]+)`)
)
func enrichCPUsWithTelemetry(cpus []schema.HardwareCPU, doc sensorsDoc) []schema.HardwareCPU {
if len(cpus) == 0 {
return cpus
}
tempBySocket := cpuTempsFromSensors(doc, len(cpus))
powerBySocket := cpuPowerFromSensors(doc, len(cpus))
throttleBySocket := cpuThrottleBySocket()
for i := range cpus {
socket := 0
if cpus[i].Socket != nil {
socket = *cpus[i].Socket
}
if value, ok := tempBySocket[socket]; ok {
cpus[i].TemperatureC = &value
}
if value, ok := powerBySocket[socket]; ok {
cpus[i].PowerW = &value
}
if value, ok := throttleBySocket[socket]; ok {
cpus[i].Throttled = &value
}
}
return cpus
}
func cpuTempsFromSensors(doc sensorsDoc, cpuCount int) map[int]float64 {
out := map[int]float64{}
if len(doc) == 0 {
return out
}
var fallback []float64
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok {
continue
}
if classifySensorFeature(feature) != "temp" {
continue
}
temp, ok := firstFeatureFloat(feature, "_input")
if !ok {
continue
}
if socket, ok := detectCPUSocket(chip, featureName); ok {
if _, exists := out[socket]; !exists {
out[socket] = temp
}
continue
}
if isLikelyCPUTemp(chip, featureName) {
fallback = append(fallback, temp)
}
}
}
if len(out) == 0 && cpuCount == 1 && len(fallback) > 0 {
out[0] = fallback[0]
}
return out
}
func cpuPowerFromSensors(doc sensorsDoc, cpuCount int) map[int]float64 {
out := map[int]float64{}
if len(doc) == 0 {
return out
}
var fallback []float64
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok {
continue
}
if classifySensorFeature(feature) != "power" {
continue
}
power, ok := firstFeatureFloatWithContains(feature, []string{"power"})
if !ok {
continue
}
if socket, ok := detectCPUSocket(chip, featureName); ok {
if _, exists := out[socket]; !exists {
out[socket] = power
}
continue
}
if isLikelyCPUPower(chip, featureName) {
fallback = append(fallback, power)
}
}
}
if len(out) == 0 && cpuCount == 1 && len(fallback) > 0 {
out[0] = fallback[0]
}
return out
}
func detectCPUSocket(parts ...string) (int, bool) {
for _, part := range parts {
matches := socketIndexRe.FindStringSubmatch(strings.ToLower(part))
if len(matches) == 2 {
value, err := strconv.Atoi(matches[1])
if err == nil {
return value, true
}
}
}
return 0, false
}
func isLikelyCPUTemp(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "coretemp") ||
strings.Contains(value, "k10temp") ||
strings.Contains(value, "package id") ||
strings.Contains(value, "tdie") ||
strings.Contains(value, "tctl") ||
strings.Contains(value, "cpu temp")
}
func isLikelyCPUPower(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "intel-rapl") ||
strings.Contains(value, "package id") ||
strings.Contains(value, "package-") ||
strings.Contains(value, "cpu power")
}
func cpuThrottleBySocket() map[int]bool {
out := map[int]bool{}
cpuDirs, err := filepath.Glob(filepath.Join(cpuSysBaseDir, "cpu[0-9]*"))
if err != nil {
return out
}
sort.Strings(cpuDirs)
for _, cpuDir := range cpuDirs {
socket, ok := readSocketIndex(cpuDir)
if !ok {
continue
}
if cpuPackageThrottled(cpuDir) {
out[socket] = true
}
}
return out
}
func readSocketIndex(cpuDir string) (int, bool) {
raw, err := os.ReadFile(filepath.Join(cpuDir, "topology", "physical_package_id"))
if err != nil {
return 0, false
}
value, err := strconv.Atoi(strings.TrimSpace(string(raw)))
if err != nil || value < 0 {
return 0, false
}
return value, true
}
func cpuPackageThrottled(cpuDir string) bool {
paths := []string{
filepath.Join(cpuDir, "thermal_throttle", "package_throttle_count"),
filepath.Join(cpuDir, "thermal_throttle", "core_throttle_count"),
}
for _, path := range paths {
raw, err := os.ReadFile(path)
if err != nil {
continue
}
value, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
if err == nil && value > 0 {
return true
}
}
return false
}

View File

@@ -0,0 +1,71 @@
package collector
import (
"os"
"path/filepath"
"testing"
"bee/audit/internal/schema"
)
func TestEnrichCPUsWithTelemetry(t *testing.T) {
tmp := t.TempDir()
oldBase := cpuSysBaseDir
cpuSysBaseDir = tmp
t.Cleanup(func() { cpuSysBaseDir = oldBase })
mustWriteFile(t, filepath.Join(tmp, "cpu0", "topology", "physical_package_id"), "0\n")
mustWriteFile(t, filepath.Join(tmp, "cpu0", "thermal_throttle", "package_throttle_count"), "3\n")
mustWriteFile(t, filepath.Join(tmp, "cpu1", "topology", "physical_package_id"), "1\n")
mustWriteFile(t, filepath.Join(tmp, "cpu1", "thermal_throttle", "package_throttle_count"), "0\n")
doc := sensorsDoc{
"coretemp-isa-0000": {
"Package id 0": map[string]any{"temp1_input": 61.5},
"Package id 1": map[string]any{"temp2_input": 58.0},
},
"intel-rapl-mmio-0": {
"Package id 0": map[string]any{"power1_average": 180.0},
"Package id 1": map[string]any{"power2_average": 175.0},
},
}
socket0 := 0
socket1 := 1
status := statusOK
cpus := []schema.HardwareCPU{
{Socket: &socket0, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Socket: &socket1, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
}
got := enrichCPUsWithTelemetry(cpus, doc)
if got[0].TemperatureC == nil || *got[0].TemperatureC != 61.5 {
t.Fatalf("cpu0 temperature mismatch: %#v", got[0].TemperatureC)
}
if got[0].PowerW == nil || *got[0].PowerW != 180.0 {
t.Fatalf("cpu0 power mismatch: %#v", got[0].PowerW)
}
if got[0].Throttled == nil || !*got[0].Throttled {
t.Fatalf("cpu0 throttled mismatch: %#v", got[0].Throttled)
}
if got[1].TemperatureC == nil || *got[1].TemperatureC != 58.0 {
t.Fatalf("cpu1 temperature mismatch: %#v", got[1].TemperatureC)
}
if got[1].PowerW == nil || *got[1].PowerW != 175.0 {
t.Fatalf("cpu1 power mismatch: %#v", got[1].PowerW)
}
if got[1].Throttled != nil && *got[1].Throttled {
t.Fatalf("cpu1 throttled mismatch: %#v", got[1].Throttled)
}
}
func mustWriteFile(t *testing.T, path, content string) {
t.Helper()
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
t.Fatalf("mkdir %s: %v", path, err)
}
if err := os.WriteFile(path, []byte(content), 0644); err != nil {
t.Fatalf("write %s: %v", path, err)
}
}

View File

@@ -1,12 +1,14 @@
package collector
import (
"os"
"path/filepath"
"testing"
)
func TestParseCPUs_dual_socket(t *testing.T) {
out := mustReadFile(t, "testdata/dmidecode_type4.txt")
cpus := parseCPUs(out, "CAR315KA0803B90")
cpus := parseCPUs(out)
if len(cpus) != 2 {
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
@@ -37,23 +39,22 @@ func TestParseCPUs_dual_socket(t *testing.T) {
if cpu0.Status == nil || *cpu0.Status != "OK" {
t.Errorf("cpu0 status: got %v, want OK", cpu0.Status)
}
// Intel Xeon serial not available → fallback
if cpu0.SerialNumber == nil || *cpu0.SerialNumber != "CAR315KA0803B90-CPU-0" {
t.Errorf("cpu0 serial fallback: got %v, want CAR315KA0803B90-CPU-0", cpu0.SerialNumber)
if cpu0.SerialNumber != nil {
t.Errorf("cpu0 serial should stay nil without source data, got %v", cpu0.SerialNumber)
}
cpu1 := cpus[1]
if cpu1.Socket == nil || *cpu1.Socket != 1 {
t.Errorf("cpu1 socket: got %v, want 1", cpu1.Socket)
}
if cpu1.SerialNumber == nil || *cpu1.SerialNumber != "CAR315KA0803B90-CPU-1" {
t.Errorf("cpu1 serial fallback: got %v, want CAR315KA0803B90-CPU-1", cpu1.SerialNumber)
if cpu1.SerialNumber != nil {
t.Errorf("cpu1 serial should stay nil without source data, got %v", cpu1.SerialNumber)
}
}
func TestParseCPUs_unpopulated_skipped(t *testing.T) {
out := mustReadFile(t, "testdata/dmidecode_type4_disabled.txt")
cpus := parseCPUs(out, "BOARD-001")
cpus := parseCPUs(out)
if len(cpus) != 1 {
t.Fatalf("expected 1 CPU (unpopulated skipped), got %d", len(cpus))
@@ -63,18 +64,51 @@ func TestParseCPUs_unpopulated_skipped(t *testing.T) {
}
}
func TestCollectCPUsSetsFirmwareFromMicrocode(t *testing.T) {
tmp := t.TempDir()
origBase := cpuSysBaseDir
cpuSysBaseDir = tmp
t.Cleanup(func() { cpuSysBaseDir = origBase })
if err := os.MkdirAll(filepath.Join(tmp, "cpu0", "microcode"), 0755); err != nil {
t.Fatalf("mkdir microcode dir: %v", err)
}
if err := os.WriteFile(filepath.Join(tmp, "cpu0", "microcode", "version"), []byte("0x2b000643\n"), 0644); err != nil {
t.Fatalf("write microcode version: %v", err)
}
origRun := execDmidecode
execDmidecode = func(typeNum string) (string, error) {
if typeNum != "4" {
t.Fatalf("unexpected dmidecode type: %s", typeNum)
}
return mustReadFile(t, "testdata/dmidecode_type4.txt"), nil
}
t.Cleanup(func() { execDmidecode = origRun })
cpus := collectCPUs()
if len(cpus) != 2 {
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
}
for i, cpu := range cpus {
if cpu.Firmware == nil || *cpu.Firmware != "0x2b000643" {
t.Fatalf("cpu[%d] firmware=%v want microcode", i, cpu.Firmware)
}
}
}
func TestParseCPUStatus(t *testing.T) {
tests := []struct {
input string
want string
}{
{"Populated, Enabled", "OK"},
{"Populated, Disabled By User", "WARNING"},
{"Populated, Disabled By BIOS", "WARNING"},
{"Unpopulated", "EMPTY"},
{"Not Populated", "EMPTY"},
{"Unknown", "UNKNOWN"},
{"", "UNKNOWN"},
{"Populated, Disabled By User", statusWarning},
{"Populated, Disabled By BIOS", statusWarning},
{"Unpopulated", statusEmpty},
{"Not Populated", statusEmpty},
{"Unknown", statusUnknown},
{"", statusUnknown},
}
for _, tt := range tests {
got := parseCPUStatus(tt.input)

View File

@@ -0,0 +1,88 @@
package collector
import "bee/audit/internal/schema"
func finalizeSnapshot(snap *schema.HardwareSnapshot, collectedAt string) {
snap.Memory = filterMemory(snap.Memory)
snap.Storage = filterStorage(snap.Storage)
snap.PowerSupplies = filterPSUs(snap.PowerSupplies)
setComponentStatusMetadata(snap, collectedAt)
}
func filterMemory(dimms []schema.HardwareMemory) []schema.HardwareMemory {
out := make([]schema.HardwareMemory, 0, len(dimms))
for _, dimm := range dimms {
if dimm.Present != nil && !*dimm.Present {
continue
}
if dimm.Status != nil && *dimm.Status == statusEmpty {
continue
}
if dimm.SerialNumber == nil || *dimm.SerialNumber == "" {
continue
}
out = append(out, dimm)
}
return out
}
func filterStorage(disks []schema.HardwareStorage) []schema.HardwareStorage {
out := make([]schema.HardwareStorage, 0, len(disks))
for _, disk := range disks {
if disk.SerialNumber == nil || *disk.SerialNumber == "" {
continue
}
out = append(out, disk)
}
return out
}
func filterPSUs(psus []schema.HardwarePowerSupply) []schema.HardwarePowerSupply {
out := make([]schema.HardwarePowerSupply, 0, len(psus))
for _, psu := range psus {
hasIdentity := false
switch {
case psu.SerialNumber != nil && *psu.SerialNumber != "":
hasIdentity = true
case psu.Slot != nil && *psu.Slot != "":
hasIdentity = true
case psu.Model != nil && *psu.Model != "":
hasIdentity = true
case psu.Vendor != nil && *psu.Vendor != "":
hasIdentity = true
}
if !hasIdentity {
continue
}
out = append(out, psu)
}
return out
}
func setComponentStatusMetadata(snap *schema.HardwareSnapshot, collectedAt string) {
for i := range snap.CPUs {
setStatusCheckedAt(&snap.CPUs[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.Memory {
setStatusCheckedAt(&snap.Memory[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.Storage {
setStatusCheckedAt(&snap.Storage[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.PCIeDevices {
setStatusCheckedAt(&snap.PCIeDevices[i].HardwareComponentStatus, collectedAt)
}
for i := range snap.PowerSupplies {
setStatusCheckedAt(&snap.PowerSupplies[i].HardwareComponentStatus, collectedAt)
}
}
func setStatusCheckedAt(status *schema.HardwareComponentStatus, collectedAt string) {
if status == nil || status.Status == nil || *status.Status == "" {
return
}
if status.StatusCheckedAt == nil {
status.StatusCheckedAt = &collectedAt
}
}

View File

@@ -0,0 +1,80 @@
package collector
import (
"bee/audit/internal/schema"
"testing"
)
func TestFinalizeSnapshotFiltersComponentsWithoutRequiredSerials(t *testing.T) {
collectedAt := "2026-03-15T12:00:00Z"
present := true
status := statusOK
serial := "SN-1"
snap := schema.HardwareSnapshot{
Memory: []schema.HardwareMemory{
{Present: &present, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Present: &present, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
Storage: []schema.HardwareStorage{
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
PowerSupplies: []schema.HardwarePowerSupply{
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
}
finalizeSnapshot(&snap, collectedAt)
if len(snap.Memory) != 1 || snap.Memory[0].StatusCheckedAt == nil || *snap.Memory[0].StatusCheckedAt != collectedAt {
t.Fatalf("memory finalize mismatch: %+v", snap.Memory)
}
if len(snap.Storage) != 1 || snap.Storage[0].StatusCheckedAt == nil || *snap.Storage[0].StatusCheckedAt != collectedAt {
t.Fatalf("storage finalize mismatch: %+v", snap.Storage)
}
if len(snap.PowerSupplies) != 1 || snap.PowerSupplies[0].StatusCheckedAt == nil || *snap.PowerSupplies[0].StatusCheckedAt != collectedAt {
t.Fatalf("psu finalize mismatch: %+v", snap.PowerSupplies)
}
}
func TestFinalizeSnapshotPreservesDuplicateSerials(t *testing.T) {
collectedAt := "2026-03-15T12:00:00Z"
status := statusOK
model := "Device"
serial := "DUPLICATE"
snap := schema.HardwareSnapshot{
Storage: []schema.HardwareStorage{
{Model: &model, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Model: &model, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
}
finalizeSnapshot(&snap, collectedAt)
if got := *snap.Storage[0].SerialNumber; got != serial {
t.Fatalf("first serial changed: %q", got)
}
if got := *snap.Storage[1].SerialNumber; got != serial {
t.Fatalf("duplicate serial should stay unchanged: %q", got)
}
}
func TestFilterPSUsKeepsSlotOnlyEntries(t *testing.T) {
slot := "0"
status := statusOK
got := filterPSUs([]schema.HardwarePowerSupply{
{Slot: &slot, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
})
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].Slot == nil || *got[0].Slot != "0" {
t.Fatalf("unexpected kept PSU: %+v", got[0])
}
}

View File

@@ -47,12 +47,12 @@ func parseMemorySection(fields map[string]string) schema.HardwareMemory {
dimm.Present = &present
if !present {
status := "EMPTY"
status := statusEmpty
dimm.Status = &status
return dimm
}
status := "OK"
status := statusOK
dimm.Status = &status
if mb := parseMemorySizeMB(rawSize); mb > 0 {

View File

@@ -0,0 +1,203 @@
package collector
import (
"bee/audit/internal/schema"
"os"
"path/filepath"
"sort"
"strconv"
"strings"
)
var edacBaseDir = "/sys/devices/system/edac/mc"
type edacDIMMStats struct {
Label string
CECount *int64
UECount *int64
}
func enrichMemoryWithTelemetry(dimms []schema.HardwareMemory, doc sensorsDoc) []schema.HardwareMemory {
if len(dimms) == 0 {
return dimms
}
tempByLabel := memoryTempsFromSensors(doc)
stats := readEDACStats()
for i := range dimms {
labelKeys := dimmMatchKeys(dimms[i].Slot, dimms[i].Location)
for _, key := range labelKeys {
if temp, ok := tempByLabel[key]; ok {
dimms[i].TemperatureC = &temp
break
}
}
for _, key := range labelKeys {
if stat, ok := stats[key]; ok {
if stat.CECount != nil {
dimms[i].CorrectableECCErrorCount = stat.CECount
}
if stat.UECount != nil {
dimms[i].UncorrectableECCErrorCount = stat.UECount
}
if stat.UECount != nil && *stat.UECount > 0 {
dimms[i].DataLossDetected = boolPtr(true)
status := statusCritical
dimms[i].Status = &status
if dimms[i].ErrorDescription == nil {
dimms[i].ErrorDescription = stringPtr("EDAC reports uncorrectable ECC errors")
}
} else if stat.CECount != nil && *stat.CECount > 0 && (dimms[i].Status == nil || *dimms[i].Status == statusOK) {
status := statusWarning
dimms[i].Status = &status
if dimms[i].ErrorDescription == nil {
dimms[i].ErrorDescription = stringPtr("EDAC reports correctable ECC errors")
}
}
break
}
}
}
return dimms
}
func memoryTempsFromSensors(doc sensorsDoc) map[string]float64 {
out := map[string]float64{}
if len(doc) == 0 {
return out
}
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok || classifySensorFeature(feature) != "temp" {
continue
}
if !isLikelyMemoryTemp(chip, featureName) {
continue
}
temp, ok := firstFeatureFloat(feature, "_input")
if !ok {
continue
}
key := canonicalLabel(featureName)
if key == "" {
continue
}
if _, exists := out[key]; !exists {
out[key] = temp
}
}
}
return out
}
func readEDACStats() map[string]edacDIMMStats {
out := map[string]edacDIMMStats{}
mcDirs, err := filepath.Glob(filepath.Join(edacBaseDir, "mc*"))
if err != nil {
return out
}
sort.Strings(mcDirs)
for _, mcDir := range mcDirs {
dimmDirs, err := filepath.Glob(filepath.Join(mcDir, "dimm*"))
if err != nil {
continue
}
sort.Strings(dimmDirs)
for _, dimmDir := range dimmDirs {
stat, ok := readEDACDIMMStats(dimmDir)
if !ok {
continue
}
key := canonicalLabel(stat.Label)
if key == "" {
continue
}
out[key] = stat
}
}
return out
}
func readEDACDIMMStats(dimmDir string) (edacDIMMStats, bool) {
labelBytes, err := os.ReadFile(filepath.Join(dimmDir, "dimm_label"))
if err != nil {
labelBytes, err = os.ReadFile(filepath.Join(dimmDir, "label"))
if err != nil {
return edacDIMMStats{}, false
}
}
label := strings.TrimSpace(string(labelBytes))
if label == "" {
return edacDIMMStats{}, false
}
stat := edacDIMMStats{Label: label}
if value, ok := readEDACCount(dimmDir, []string{"dimm_ce_count", "ce_count"}); ok {
stat.CECount = &value
}
if value, ok := readEDACCount(dimmDir, []string{"dimm_ue_count", "ue_count"}); ok {
stat.UECount = &value
}
return stat, true
}
func readEDACCount(dir string, names []string) (int64, bool) {
for _, name := range names {
raw, err := os.ReadFile(filepath.Join(dir, name))
if err != nil {
continue
}
value, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
if err == nil && value >= 0 {
return value, true
}
}
return 0, false
}
func dimmMatchKeys(slot, location *string) []string {
var out []string
add := func(value *string) {
key := canonicalLabel(derefString(value))
if key == "" {
return
}
for _, existing := range out {
if existing == key {
return
}
}
out = append(out, key)
}
add(slot)
add(location)
return out
}
func canonicalLabel(value string) string {
value = strings.ToUpper(strings.TrimSpace(value))
if value == "" {
return ""
}
var b strings.Builder
for _, r := range value {
if (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') {
b.WriteRune(r)
}
}
return b.String()
}
func isLikelyMemoryTemp(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "dimm") || strings.Contains(value, "sodimm")
}
func boolPtr(value bool) *bool {
return &value
}

View File

@@ -0,0 +1,61 @@
package collector
import (
"path/filepath"
"testing"
"bee/audit/internal/schema"
)
func TestEnrichMemoryWithTelemetry(t *testing.T) {
tmp := t.TempDir()
oldBase := edacBaseDir
edacBaseDir = tmp
t.Cleanup(func() { edacBaseDir = oldBase })
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_label"), "CPU0_DIMM_A1\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_ce_count"), "7\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm0", "dimm_ue_count"), "0\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_label"), "CPU1_DIMM_B2\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_ce_count"), "0\n")
mustWriteFile(t, filepath.Join(tmp, "mc0", "dimm1", "dimm_ue_count"), "2\n")
doc := sensorsDoc{
"jc42-i2c-0-18": {
"CPU0 DIMM A1": map[string]any{"temp1_input": 43.0},
"CPU1 DIMM B2": map[string]any{"temp2_input": 46.0},
},
}
status := statusOK
slotA := "CPU0_DIMM_A1"
slotB := "CPU1_DIMM_B2"
dimms := []schema.HardwareMemory{
{Slot: &slotA, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Slot: &slotB, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
}
got := enrichMemoryWithTelemetry(dimms, doc)
if got[0].TemperatureC == nil || *got[0].TemperatureC != 43.0 {
t.Fatalf("dimm0 temperature mismatch: %#v", got[0].TemperatureC)
}
if got[0].CorrectableECCErrorCount == nil || *got[0].CorrectableECCErrorCount != 7 {
t.Fatalf("dimm0 ce mismatch: %#v", got[0].CorrectableECCErrorCount)
}
if got[0].Status == nil || *got[0].Status != statusWarning {
t.Fatalf("dimm0 status mismatch: %#v", got[0].Status)
}
if got[1].TemperatureC == nil || *got[1].TemperatureC != 46.0 {
t.Fatalf("dimm1 temperature mismatch: %#v", got[1].TemperatureC)
}
if got[1].UncorrectableECCErrorCount == nil || *got[1].UncorrectableECCErrorCount != 2 {
t.Fatalf("dimm1 ue mismatch: %#v", got[1].UncorrectableECCErrorCount)
}
if got[1].Status == nil || *got[1].Status != statusCritical {
t.Fatalf("dimm1 status mismatch: %#v", got[1].Status)
}
if got[1].DataLossDetected == nil || !*got[1].DataLossDetected {
t.Fatalf("dimm1 data_loss_detected mismatch: %#v", got[1].DataLossDetected)
}
}

View File

@@ -18,17 +18,13 @@ var (
}
return string(out), nil
}
readNetStatFile = func(iface, key string) (int64, error) {
path := filepath.Join("/sys/class/net", iface, "statistics", key)
readNetAddressFile = func(iface string) (string, error) {
path := filepath.Join("/sys/class/net", iface, "address")
raw, err := os.ReadFile(path)
if err != nil {
return 0, err
return "", err
}
v, err := strconv.ParseInt(strings.TrimSpace(string(raw)), 10, 64)
if err != nil {
return 0, err
}
return v, nil
return strings.TrimSpace(string(raw)), nil
}
)
@@ -47,6 +43,12 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
continue
}
iface := ifaces[0]
devs[i].MacAddresses = collectInterfaceMACs(ifaces)
if devs[i].SerialNumber == nil {
if serial := queryPCIDeviceSerial(bdf); serial != "" {
devs[i].SerialNumber = &serial
}
}
if devs[i].Firmware == nil {
if out, err := ethtoolInfoQuery(iface); err == nil {
@@ -56,16 +58,13 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
}
}
if devs[i].Telemetry == nil {
devs[i].Telemetry = map[string]any{}
}
injectNICPacketStats(devs[i].Telemetry, iface)
if out, err := ethtoolModuleQuery(iface); err == nil {
injectSFPDOMTelemetry(devs[i].Telemetry, out)
if injectSFPDOMTelemetry(&devs[i], out) {
enriched++
continue
}
}
if len(devs[i].Telemetry) == 0 {
devs[i].Telemetry = nil
} else {
if len(devs[i].MacAddresses) > 0 || devs[i].Firmware != nil {
enriched++
}
}
@@ -77,31 +76,32 @@ func isNICDevice(dev schema.HardwarePCIeDevice) bool {
if dev.DeviceClass == nil {
return false
}
c := strings.ToLower(strings.TrimSpace(*dev.DeviceClass))
return strings.Contains(c, "ethernet controller") ||
strings.Contains(c, "network controller") ||
strings.Contains(c, "infiniband controller")
c := strings.TrimSpace(*dev.DeviceClass)
return isNICClass(c) || strings.EqualFold(c, "FibreChannelController")
}
func injectNICPacketStats(dst map[string]any, iface string) {
for _, key := range []string{"rx_packets", "tx_packets", "rx_errors", "tx_errors"} {
if v, err := readNetStatFile(iface, key); err == nil {
dst[key] = v
func collectInterfaceMACs(ifaces []string) []string {
seen := map[string]struct{}{}
var out []string
for _, iface := range ifaces {
mac, err := readNetAddressFile(iface)
if err != nil || mac == "" {
continue
}
mac = strings.ToLower(strings.TrimSpace(mac))
if _, ok := seen[mac]; ok {
continue
}
seen[mac] = struct{}{}
out = append(out, mac)
}
}
func injectSFPDOMTelemetry(dst map[string]any, raw string) {
parsed := parseSFPDOM(raw)
for k, v := range parsed {
dst[k] = v
}
return out
}
var floatRe = regexp.MustCompile(`[-+]?[0-9]*\.?[0-9]+`)
func parseSFPDOM(raw string) map[string]any {
out := map[string]any{}
func injectSFPDOMTelemetry(dev *schema.HardwarePCIeDevice, raw string) bool {
var changed bool
for _, line := range strings.Split(raw, "\n") {
trimmed := strings.TrimSpace(line)
if trimmed == "" {
@@ -117,26 +117,55 @@ func parseSFPDOM(raw string) map[string]any {
switch {
case strings.Contains(key, "module temperature"):
if f, ok := firstFloat(val); ok {
out["sfp_temperature_c"] = f
dev.SFPTemperatureC = &f
changed = true
}
case strings.Contains(key, "laser output power"):
if f, ok := dbmValue(val); ok {
out["sfp_tx_power_dbm"] = f
dev.SFPTXPowerDBM = &f
changed = true
}
case strings.Contains(key, "receiver signal"):
if f, ok := dbmValue(val); ok {
out["sfp_rx_power_dbm"] = f
dev.SFPRXPowerDBM = &f
changed = true
}
case strings.Contains(key, "module voltage"):
if f, ok := firstFloat(val); ok {
out["sfp_voltage_v"] = f
dev.SFPVoltageV = &f
changed = true
}
case strings.Contains(key, "laser bias current"):
if f, ok := firstFloat(val); ok {
out["sfp_bias_ma"] = f
dev.SFPBiasMA = &f
changed = true
}
}
}
return changed
}
func parseSFPDOM(raw string) map[string]any {
dev := schema.HardwarePCIeDevice{}
if !injectSFPDOMTelemetry(&dev, raw) {
return map[string]any{}
}
out := map[string]any{}
if dev.SFPTemperatureC != nil {
out["sfp_temperature_c"] = *dev.SFPTemperatureC
}
if dev.SFPTXPowerDBM != nil {
out["sfp_tx_power_dbm"] = *dev.SFPTXPowerDBM
}
if dev.SFPRXPowerDBM != nil {
out["sfp_rx_power_dbm"] = *dev.SFPRXPowerDBM
}
if dev.SFPVoltageV != nil {
out["sfp_voltage_v"] = *dev.SFPVoltageV
}
if dev.SFPBiasMA != nil {
out["sfp_bias_ma"] = *dev.SFPBiasMA
}
return out
}

View File

@@ -1,6 +1,10 @@
package collector
import "testing"
import (
"bee/audit/internal/schema"
"fmt"
"testing"
)
func TestParseSFPDOM(t *testing.T) {
raw := `
@@ -29,6 +33,74 @@ func TestParseSFPDOM(t *testing.T) {
}
}
func TestParseLSPCIDetailSerial(t *testing.T) {
raw := `
05:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
Serial number: NIC-SN-12345
`
if got := parseLSPCIDetailSerial(raw); got != "NIC-SN-12345" {
t.Fatalf("serial=%q want %q", got, "NIC-SN-12345")
}
}
func TestParsePCIVPDSerial(t *testing.T) {
raw := []byte{0x82, 0x05, 0x00, 'M', 'L', 'X', '5', 0x90, 0x08, 0x00, 'S', 'N', 0x08, 'M', 'T', '1', '2', '3', '4', '5', '6'}
if got := parsePCIVPDSerial(raw); got != "MT123456" {
t.Fatalf("serial=%q want %q", got, "MT123456")
}
}
func TestEnrichPCIeWithNICTelemetryAddsSerialFallback(t *testing.T) {
origDetail := queryPCILSPCIDetail
origVPD := readPCIVPDFile
origIfaces := netIfacesByBDF
origReadMAC := readNetAddressFile
origEth := ethtoolInfoQuery
origModule := ethtoolModuleQuery
t.Cleanup(func() {
queryPCILSPCIDetail = origDetail
readPCIVPDFile = origVPD
netIfacesByBDF = origIfaces
readNetAddressFile = origReadMAC
ethtoolInfoQuery = origEth
ethtoolModuleQuery = origModule
})
queryPCILSPCIDetail = func(bdf string) (string, error) {
if bdf != "0000:18:00.0" {
t.Fatalf("unexpected bdf: %s", bdf)
}
return "Serial number: NIC-SN-98765\n", nil
}
readPCIVPDFile = func(string) ([]byte, error) {
return nil, fmt.Errorf("no vpd needed")
}
netIfacesByBDF = func(string) []string { return []string{"eth0"} }
readNetAddressFile = func(iface string) (string, error) {
if iface != "eth0" {
t.Fatalf("unexpected iface: %s", iface)
}
return "aa:bb:cc:dd:ee:ff", nil
}
ethtoolInfoQuery = func(string) (string, error) { return "", fmt.Errorf("skip firmware") }
ethtoolModuleQuery = func(string) (string, error) { return "", fmt.Errorf("skip optics") }
class := "EthernetController"
bdf := "0000:18:00.0"
devs := []schema.HardwarePCIeDevice{{
DeviceClass: &class,
BDF: &bdf,
}}
out := enrichPCIeWithNICTelemetry(devs)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "NIC-SN-98765" {
t.Fatalf("serial=%v want NIC-SN-98765", out[0].SerialNumber)
}
if len(out[0].MacAddresses) != 1 || out[0].MacAddresses[0] != "aa:bb:cc:dd:ee:ff" {
t.Fatalf("mac_addresses=%v", out[0].MacAddresses)
}
}
func TestDBMValue(t *testing.T) {
tests := []struct {
in string

View File

@@ -24,18 +24,29 @@ type nvidiaGPUInfo struct {
}
// enrichPCIeWithNVIDIA enriches NVIDIA PCIe devices with data from nvidia-smi.
// If the driver/tool is unavailable, NVIDIA devices get UNKNOWN status and
// a stable serial fallback based on board serial + slot.
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice, boardSerial string) []schema.HardwarePCIeDevice {
// If the driver/tool is unavailable, NVIDIA devices get Unknown status.
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
if !hasNVIDIADevices(devs) {
return devs
}
gpuByBDF, err := queryNVIDIAGPUs()
if err != nil {
slog.Info("nvidia: enrichment skipped", "err", err)
return enrichPCIeWithNVIDIAData(devs, nil, boardSerial, false)
return enrichPCIeWithNVIDIAData(devs, nil, false)
}
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, boardSerial, true)
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, true)
}
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, boardSerial string, driverLoaded bool) []schema.HardwarePCIeDevice {
func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool {
for _, dev := range devs {
if isNVIDIADevice(dev) {
return true
}
}
return false
}
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, driverLoaded bool) []schema.HardwarePCIeDevice {
enriched := 0
for i := range devs {
if !isNVIDIADevice(devs[i]) {
@@ -43,7 +54,7 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
}
if !driverLoaded {
setPCIeFallback(&devs[i], boardSerial)
setPCIeFallback(&devs[i])
continue
}
@@ -53,22 +64,21 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
}
info, ok := gpuByBDF[bdf]
if !ok {
setPCIeFallback(&devs[i], boardSerial)
setPCIeFallback(&devs[i])
continue
}
if v := strings.TrimSpace(info.Serial); v != "" {
devs[i].SerialNumber = &v
} else {
setPCIeFallbackSerial(&devs[i], boardSerial)
}
if v := strings.TrimSpace(info.VBIOS); v != "" {
devs[i].Firmware = &v
}
status := "OK"
status := statusOK
if info.ECCUncorrected != nil && *info.ECCUncorrected > 0 {
status = "WARNING"
status = statusWarning
devs[i].ErrorDescription = stringPtr("GPU reports uncorrected ECC errors")
}
devs[i].Status = &status
injectNVIDIATelemetry(&devs[i], info)
@@ -200,46 +210,25 @@ func isNVIDIADevice(dev schema.HardwarePCIeDevice) bool {
return false
}
func setPCIeFallback(dev *schema.HardwarePCIeDevice, boardSerial string) {
setPCIeFallbackSerial(dev, boardSerial)
status := "UNKNOWN"
func setPCIeFallback(dev *schema.HardwarePCIeDevice) {
status := statusUnknown
dev.Status = &status
}
func setPCIeFallbackSerial(dev *schema.HardwarePCIeDevice, boardSerial string) {
if strings.TrimSpace(boardSerial) == "" || dev.SerialNumber != nil {
return
}
slot := "unknown"
if dev.BDF != nil && strings.TrimSpace(*dev.BDF) != "" {
slot = strings.TrimSpace(*dev.BDF)
} else if dev.Slot != nil && strings.TrimSpace(*dev.Slot) != "" {
slot = strings.TrimSpace(*dev.Slot)
}
fb := fmt.Sprintf("%s-PCIE-%s", boardSerial, slot)
dev.SerialNumber = &fb
}
func injectNVIDIATelemetry(dev *schema.HardwarePCIeDevice, info nvidiaGPUInfo) {
if dev.Telemetry == nil {
dev.Telemetry = map[string]any{}
}
if info.TemperatureC != nil {
dev.Telemetry["temperature_c"] = *info.TemperatureC
dev.TemperatureC = info.TemperatureC
}
if info.PowerW != nil {
dev.Telemetry["power_w"] = *info.PowerW
dev.PowerW = info.PowerW
}
if info.ECCUncorrected != nil {
dev.Telemetry["ecc_uncorrected_total"] = *info.ECCUncorrected
dev.ECCUncorrectedTotal = info.ECCUncorrected
}
if info.ECCCorrected != nil {
dev.Telemetry["ecc_corrected_total"] = *info.ECCCorrected
dev.ECCCorrectedTotal = info.ECCCorrected
}
if info.HWSlowdown != nil {
dev.Telemetry["hw_slowdown_active"] = *info.HWSlowdown
}
if len(dev.Telemetry) == 0 {
dev.Telemetry = nil
dev.HWSlowdown = info.HWSlowdown
}
}

View File

@@ -54,10 +54,10 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
status := "OK"
devices := []schema.HardwarePCIeDevice{
{
VendorID: &vendorID,
BDF: &bdf,
Manufacturer: &manufacturer,
Status: &status,
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
VendorID: &vendorID,
BDF: &bdf,
Manufacturer: &manufacturer,
},
}
@@ -73,21 +73,21 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
},
}
out := enrichPCIeWithNVIDIAData(devices, byBDF, "BOARD-001", true)
out := enrichPCIeWithNVIDIAData(devices, byBDF, true)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-ABC" {
t.Fatalf("serial: got %v", out[0].SerialNumber)
}
if out[0].Firmware == nil || *out[0].Firmware != "96.00.1F.00.02" {
t.Fatalf("firmware: got %v", out[0].Firmware)
}
if out[0].Status == nil || *out[0].Status != "WARNING" {
if out[0].Status == nil || *out[0].Status != statusWarning {
t.Fatalf("status: got %v", out[0].Status)
}
if out[0].Telemetry == nil {
t.Fatal("expected telemetry")
if out[0].ECCUncorrectedTotal == nil || *out[0].ECCUncorrectedTotal != 2 {
t.Fatalf("ecc_uncorrected_total: got %#v", out[0].ECCUncorrectedTotal)
}
if got, ok := out[0].Telemetry["ecc_uncorrected_total"].(int64); !ok || got != 2 {
t.Fatalf("ecc_uncorrected_total: got %#v", out[0].Telemetry["ecc_uncorrected_total"])
if out[0].TemperatureC == nil || *out[0].TemperatureC != 55.5 {
t.Fatalf("temperature_c: got %#v", out[0].TemperatureC)
}
}
@@ -103,11 +103,11 @@ func TestEnrichPCIeWithNVIDIAData_driverMissingFallback(t *testing.T) {
},
}
out := enrichPCIeWithNVIDIAData(devices, nil, "BOARD-123", false)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "BOARD-123-PCIE-0000:17:00.0" {
t.Fatalf("fallback serial: got %v", out[0].SerialNumber)
out := enrichPCIeWithNVIDIAData(devices, nil, false)
if out[0].SerialNumber != nil {
t.Fatalf("serial should stay nil without source data, got %v", out[0].SerialNumber)
}
if out[0].Status == nil || *out[0].Status != "UNKNOWN" {
if out[0].Status == nil || *out[0].Status != statusUnknown {
t.Fatalf("fallback status: got %v", out[0].Status)
}
}

View File

@@ -37,7 +37,7 @@ func parseLspci(output string) []schema.HardwarePCIeDevice {
val := strings.TrimSpace(line[idx+2:])
fields[key] = val
}
if !shouldIncludePCIeDevice(fields["Class"]) {
if !shouldIncludePCIeDevice(fields["Class"], fields["Vendor"], fields["Device"]) {
continue
}
dev := parseLspciDevice(fields)
@@ -46,8 +46,10 @@ func parseLspci(output string) []schema.HardwarePCIeDevice {
return devs
}
func shouldIncludePCIeDevice(class string) bool {
func shouldIncludePCIeDevice(class, vendor, device string) bool {
c := strings.ToLower(strings.TrimSpace(class))
v := strings.ToLower(strings.TrimSpace(vendor))
d := strings.ToLower(strings.TrimSpace(device))
if c == "" {
return true
}
@@ -57,6 +59,8 @@ func shouldIncludePCIeDevice(class string) bool {
"host bridge",
"isa bridge",
"pci bridge",
"performance counter",
"performance counters",
"ram memory",
"system peripheral",
"communication controller",
@@ -66,12 +70,28 @@ func shouldIncludePCIeDevice(class string) bool {
"audio device",
"serial bus controller",
"unassigned class",
"non-essential instrumentation",
}
for _, bad := range excluded {
if strings.Contains(c, bad) {
return false
}
}
if strings.Contains(v, "advanced micro devices") || strings.Contains(v, "[amd]") {
internalAMDPatterns := []string{
"dummy function",
"reserved spp",
"ptdma",
"cryptographic coprocessor pspcpp",
"pspcpp",
}
for _, bad := range internalAMDPatterns {
if strings.Contains(d, bad) {
return false
}
}
}
return true
}
@@ -79,11 +99,12 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
dev := schema.HardwarePCIeDevice{}
present := true
dev.Present = &present
status := "OK"
status := statusOK
dev.Status = &status
// Slot is the BDF: "0000:00:02.0"
if bdf := fields["Slot"]; bdf != "" {
dev.Slot = &bdf
dev.BDF = &bdf
// parse vendor_id and device_id from sysfs
vendorID, deviceID := readPCIIDs(bdf)
@@ -93,10 +114,34 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
if deviceID != 0 {
dev.DeviceID = &deviceID
}
if numaNode, ok := readPCINumaNode(bdf); ok {
dev.NUMANode = &numaNode
} else if numaNode, ok := parsePCINumaNode(fields["NUMANode"]); ok {
dev.NUMANode = &numaNode
}
if width, ok := readPCIIntAttribute(bdf, "current_link_width"); ok {
dev.LinkWidth = &width
}
if width, ok := readPCIIntAttribute(bdf, "max_link_width"); ok {
dev.MaxLinkWidth = &width
}
if speed, ok := readPCIStringAttribute(bdf, "current_link_speed"); ok {
linkSpeed := normalizePCILinkSpeed(speed)
if linkSpeed != "" {
dev.LinkSpeed = &linkSpeed
}
}
if speed, ok := readPCIStringAttribute(bdf, "max_link_speed"); ok {
linkSpeed := normalizePCILinkSpeed(speed)
if linkSpeed != "" {
dev.MaxLinkSpeed = &linkSpeed
}
}
}
if v := fields["Class"]; v != "" {
dev.DeviceClass = &v
class := mapPCIeDeviceClass(v)
dev.DeviceClass = &class
}
if v := fields["Vendor"]; v != "" {
dev.Manufacturer = &v
@@ -131,3 +176,67 @@ func readHexFile(path string) (int, error) {
n, err := strconv.ParseInt(s, 16, 64)
return int(n), err
}
func readPCINumaNode(bdf string) (int, bool) {
value, ok := readPCIIntAttribute(bdf, "numa_node")
if !ok || value < 0 {
return 0, false
}
return value, true
}
func parsePCINumaNode(raw string) (int, bool) {
raw = strings.TrimSpace(raw)
if raw == "" {
return 0, false
}
value, err := strconv.Atoi(raw)
if err != nil || value < 0 {
return 0, false
}
return value, true
}
func readPCIIntAttribute(bdf, attribute string) (int, bool) {
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
if err != nil {
return 0, false
}
value, err := strconv.Atoi(strings.TrimSpace(string(out)))
if err != nil || value < 0 {
return 0, false
}
return value, true
}
func readPCIStringAttribute(bdf, attribute string) (string, bool) {
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
if err != nil {
return "", false
}
value := strings.TrimSpace(string(out))
if value == "" {
return "", false
}
return value, true
}
func normalizePCILinkSpeed(raw string) string {
raw = strings.TrimSpace(strings.ToLower(raw))
switch {
case strings.Contains(raw, "2.5"):
return "Gen1"
case strings.Contains(raw, "5.0"):
return "Gen2"
case strings.Contains(raw, "8.0"):
return "Gen3"
case strings.Contains(raw, "16.0"):
return "Gen4"
case strings.Contains(raw, "32.0"):
return "Gen5"
case strings.Contains(raw, "64.0"):
return "Gen6"
default:
return ""
}
}

View File

@@ -1,41 +1,126 @@
package collector
import "testing"
import (
"encoding/json"
"strings"
"testing"
)
func TestShouldIncludePCIeDevice(t *testing.T) {
tests := []struct {
class string
want bool
name string
class string
vendor string
device string
want bool
}{
{"USB controller", false},
{"System peripheral", false},
{"Audio device", false},
{"Host bridge", false},
{"PCI bridge", false},
{"SMBus", false},
{"Ethernet controller", true},
{"RAID bus controller", true},
{"Non-Volatile memory controller", true},
{"VGA compatible controller", true},
{name: "usb", class: "USB controller", want: false},
{name: "system peripheral", class: "System peripheral", want: false},
{name: "audio", class: "Audio device", want: false},
{name: "host bridge", class: "Host bridge", want: false},
{name: "pci bridge", class: "PCI bridge", want: false},
{name: "smbus", class: "SMBus", want: false},
{name: "perf", class: "Performance counters", want: false},
{name: "non essential instrumentation", class: "Non-Essential Instrumentation", want: false},
{name: "amd dummy function", class: "Encryption controller", vendor: "Advanced Micro Devices, Inc. [AMD]", device: "Starship/Matisse PTDMA", want: false},
{name: "amd pspcpp", class: "Encryption controller", vendor: "Advanced Micro Devices, Inc. [AMD]", device: "Starship/Matisse Cryptographic Coprocessor PSPCPP", want: false},
{name: "ethernet", class: "Ethernet controller", want: true},
{name: "raid", class: "RAID bus controller", want: true},
{name: "nvme", class: "Non-Volatile memory controller", want: true},
{name: "vga", class: "VGA compatible controller", want: true},
{name: "other encryption controller", class: "Encryption controller", vendor: "Intel Corporation", device: "QuickAssist", want: true},
}
for _, tt := range tests {
got := shouldIncludePCIeDevice(tt.class)
if got != tt.want {
t.Fatalf("class %q include=%v want %v", tt.class, got, tt.want)
}
t.Run(tt.name, func(t *testing.T) {
got := shouldIncludePCIeDevice(tt.class, tt.vendor, tt.device)
if got != tt.want {
t.Fatalf("class=%q vendor=%q device=%q include=%v want %v", tt.class, tt.vendor, tt.device, got, tt.want)
}
})
}
}
func TestParseLspci_filtersExcludedClasses(t *testing.T) {
input := "Slot:\t0000:00:14.0\nClass:\tUSB controller\nVendor:\tIntel Corporation\nDevice:\tUSB 3.0\n\n" +
"Slot:\t0000:00:18.0\nClass:\tNon-Essential Instrumentation\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PCIe Dummy Function\n\n" +
"Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
devs := parseLspci(input)
if len(devs) != 1 {
t.Fatalf("expected 1 filtered device, got %d", len(devs))
}
if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VGA compatible controller" {
if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VideoController" {
t.Fatalf("unexpected remaining class: %v", devs[0].DeviceClass)
}
if devs[0].Slot == nil || *devs[0].Slot != "0000:65:00.0" {
t.Fatalf("slot: got %v", devs[0].Slot)
}
if devs[0].BDF == nil || *devs[0].BDF != "0000:65:00.0" {
t.Fatalf("bdf: got %v", devs[0].BDF)
}
}
func TestParseLspci_filtersAMDChipsetNoise(t *testing.T) {
input := "" +
"Slot:\t0000:1a:00.0\nClass:\tNon-Essential Instrumentation\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PCIe Dummy Function\n\n" +
"Slot:\t0000:1a:00.2\nClass:\tEncryption controller\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PTDMA\n\n" +
"Slot:\t0000:05:00.0\nClass:\tEthernet controller\nVendor:\tMellanox Technologies\nDevice:\tMT28908 Family [ConnectX-6]\n\n"
devs := parseLspci(input)
if len(devs) != 1 {
t.Fatalf("expected 1 remaining device, got %d", len(devs))
}
if devs[0].Model == nil || *devs[0].Model != "MT28908 Family [ConnectX-6]" {
t.Fatalf("unexpected remaining device: %+v", devs[0])
}
}
func TestPCIeJSONUsesSlotNotBDF(t *testing.T) {
input := "Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
devs := parseLspci(input)
data, err := json.Marshal(devs[0])
if err != nil {
t.Fatalf("marshal: %v", err)
}
text := string(data)
if !strings.Contains(text, `"slot":"0000:65:00.0"`) {
t.Fatalf("json missing slot: %s", text)
}
if strings.Contains(text, `"bdf"`) {
t.Fatalf("json should not emit bdf: %s", text)
}
}
func TestParseLspciUsesNUMANodeFieldWhenSysfsUnavailable(t *testing.T) {
input := "Slot:\t0000:65:00.0\nClass:\tEthernet controller\nVendor:\tIntel Corporation\nDevice:\tX710\nNUMANode:\t1\n\n"
devs := parseLspci(input)
if len(devs) != 1 {
t.Fatalf("expected 1 device, got %d", len(devs))
}
if devs[0].NUMANode == nil || *devs[0].NUMANode != 1 {
t.Fatalf("numa_node=%v want 1", devs[0].NUMANode)
}
}
func TestNormalizePCILinkSpeed(t *testing.T) {
tests := []struct {
raw string
want string
}{
{"2.5 GT/s PCIe", "Gen1"},
{"5.0 GT/s PCIe", "Gen2"},
{"8.0 GT/s PCIe", "Gen3"},
{"16.0 GT/s PCIe", "Gen4"},
{"32.0 GT/s PCIe", "Gen5"},
{"64.0 GT/s PCIe", "Gen6"},
{"unknown", ""},
}
for _, tt := range tests {
if got := normalizePCILinkSpeed(tt.raw); got != tt.want {
t.Fatalf("normalizePCILinkSpeed(%q)=%q want %q", tt.raw, got, tt.want)
}
}
}

View File

@@ -0,0 +1,123 @@
package collector
import (
"bee/audit/internal/schema"
"log/slog"
"os"
"os/exec"
"path/filepath"
"strings"
)
var (
queryPCILSPCIDetail = func(bdf string) (string, error) {
out, err := exec.Command("lspci", "-vv", "-s", bdf).Output()
if err != nil {
return "", err
}
return string(out), nil
}
readPCIVPDFile = func(bdf string) ([]byte, error) {
return os.ReadFile(filepath.Join("/sys/bus/pci/devices", bdf, "vpd"))
}
)
func enrichPCIeWithPCISerials(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
enriched := 0
for i := range devs {
if !shouldProbePCIeSerial(devs[i]) {
continue
}
bdf := normalizePCIeBDF(*devs[i].BDF)
if bdf == "" {
continue
}
if serial := queryPCIDeviceSerial(bdf); serial != "" {
devs[i].SerialNumber = &serial
enriched++
}
}
if enriched > 0 {
slog.Info("pcie: serials enriched", "count", enriched)
}
return devs
}
func shouldProbePCIeSerial(dev schema.HardwarePCIeDevice) bool {
if dev.BDF == nil || dev.SerialNumber != nil {
return false
}
if dev.DeviceClass == nil {
return false
}
class := strings.TrimSpace(*dev.DeviceClass)
return isNICClass(class) || isGPUClass(class)
}
func queryPCIDeviceSerial(bdf string) string {
if out, err := queryPCILSPCIDetail(bdf); err == nil {
if serial := parseLSPCIDetailSerial(out); serial != "" {
return serial
}
}
if raw, err := readPCIVPDFile(bdf); err == nil {
return parsePCIVPDSerial(raw)
}
return ""
}
func parseLSPCIDetailSerial(raw string) string {
for _, line := range strings.Split(raw, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
lower := strings.ToLower(line)
if !strings.Contains(lower, "serial number:") {
continue
}
idx := strings.Index(line, ":")
if idx < 0 {
continue
}
if serial := strings.TrimSpace(line[idx+1:]); serial != "" {
return serial
}
}
return ""
}
func parsePCIVPDSerial(raw []byte) string {
for i := 0; i+3 < len(raw); i++ {
if raw[i] != 'S' || raw[i+1] != 'N' {
continue
}
length := int(raw[i+2])
if length <= 0 || length > 64 || i+3+length > len(raw) {
continue
}
value := strings.TrimSpace(strings.Trim(string(raw[i+3:i+3+length]), "\x00"))
if !looksLikeSerial(value) {
continue
}
return value
}
return ""
}
func looksLikeSerial(value string) bool {
if len(value) < 4 {
return false
}
hasAlphaNum := false
for _, r := range value {
switch {
case r >= 'a' && r <= 'z', r >= 'A' && r <= 'Z', r >= '0' && r <= '9':
hasAlphaNum = true
case strings.ContainsRune(" -_./:", r):
default:
return false
}
}
return hasAlphaNum
}

View File

@@ -0,0 +1,47 @@
package collector
import (
"bee/audit/internal/schema"
"fmt"
"testing"
)
func TestEnrichPCIeWithPCISerialsAddsGPUFallback(t *testing.T) {
origDetail := queryPCILSPCIDetail
origVPD := readPCIVPDFile
t.Cleanup(func() {
queryPCILSPCIDetail = origDetail
readPCIVPDFile = origVPD
})
queryPCILSPCIDetail = func(bdf string) (string, error) {
if bdf != "0000:11:00.0" {
t.Fatalf("unexpected bdf: %s", bdf)
}
return "Serial number: GPU-SN-12345\n", nil
}
readPCIVPDFile = func(string) ([]byte, error) {
return nil, fmt.Errorf("no vpd needed")
}
class := "DisplayController"
bdf := "0000:11:00.0"
devs := []schema.HardwarePCIeDevice{{
DeviceClass: &class,
BDF: &bdf,
}}
out := enrichPCIeWithPCISerials(devs)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-SN-12345" {
t.Fatalf("serial=%v want GPU-SN-12345", out[0].SerialNumber)
}
}
func TestShouldProbePCIeSerialSkipsNonGPUOrNIC(t *testing.T) {
class := "StorageController"
bdf := "0000:19:00.0"
dev := schema.HardwarePCIeDevice{DeviceClass: &class, BDF: &bdf}
if shouldProbePCIeSerial(dev) {
t.Fatal("unexpected probe for storage controller")
}
}

View File

@@ -4,18 +4,32 @@ import (
"bee/audit/internal/schema"
"log/slog"
"os/exec"
"regexp"
"sort"
"strconv"
"strings"
)
func collectPSUs() []schema.HardwarePowerSupply {
// ipmitool requires /dev/ipmi0 — not available on non-server hardware
out, err := exec.Command("ipmitool", "fru", "print").Output()
if err != nil {
var psus []schema.HardwarePowerSupply
if out, err := exec.Command("ipmitool", "fru", "print").Output(); err == nil {
psus = parseFRU(string(out))
} else {
slog.Info("psu: fru unavailable", "err", err)
}
sdrData := map[int]psuSDR{}
if sdrOut, err := exec.Command("ipmitool", "sdr").Output(); err == nil {
sdrData = parsePSUSDR(string(sdrOut))
if len(psus) == 0 {
psus = synthesizePSUsFromSDR(sdrData)
} else {
mergePSUSDR(psus, sdrData)
}
} else if len(psus) == 0 {
slog.Info("psu: ipmitool unavailable, skipping", "err", err)
return nil
}
psus := parseFRU(string(out))
slog.Info("psu: collected", "count", len(psus))
return psus
}
@@ -75,9 +89,7 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
// Only process PSU FRU records
headerLower := strings.ToLower(header)
if !strings.Contains(headerLower, "psu") &&
!strings.Contains(headerLower, "power supply") &&
!strings.Contains(headerLower, "power_supply") {
if !isPSUHeader(headerLower) {
return schema.HardwarePowerSupply{}, false
}
@@ -85,21 +97,24 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
psu := schema.HardwarePowerSupply{Present: &present}
slotStr := strconv.Itoa(slotIdx)
if slot, ok := parsePSUSlot(header); ok && slot > 0 {
slotStr = strconv.Itoa(slot - 1)
}
psu.Slot = &slotStr
if v := cleanDMIValue(fields["Board Product"]); v != "" {
if v := firstNonEmptyField(fields, "Board Product", "Product Name", "Product Part Number"); v != "" {
psu.Model = &v
}
if v := cleanDMIValue(fields["Board Mfg"]); v != "" {
if v := firstNonEmptyField(fields, "Board Mfg", "Product Manufacturer", "Product Manufacturer Name"); v != "" {
psu.Vendor = &v
}
if v := cleanDMIValue(fields["Board Serial"]); v != "" {
if v := firstNonEmptyField(fields, "Board Serial", "Product Serial", "Product Serial Number"); v != "" {
psu.SerialNumber = &v
}
if v := cleanDMIValue(fields["Board Part Number"]); v != "" {
if v := firstNonEmptyField(fields, "Board Part Number", "Product Part Number", "Part Number"); v != "" {
psu.PartNumber = &v
}
if v := cleanDMIValue(fields["Board Extra"]); v != "" {
if v := firstNonEmptyField(fields, "Board Extra", "Product Version", "Board Version"); v != "" {
psu.Firmware = &v
}
@@ -110,12 +125,230 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
}
}
status := "OK"
status := statusOK
psu.Status = &status
return psu, true
}
func isPSUHeader(headerLower string) bool {
return strings.Contains(headerLower, "psu") ||
strings.Contains(headerLower, "pws") ||
strings.Contains(headerLower, "power supply") ||
strings.Contains(headerLower, "power_supply") ||
strings.Contains(headerLower, "power module")
}
func firstNonEmptyField(fields map[string]string, keys ...string) string {
for _, key := range keys {
if value := cleanDMIValue(fields[key]); value != "" {
return value
}
}
return ""
}
type psuSDR struct {
slot int
status string
reason string
inputPowerW *float64
outputPowerW *float64
inputVoltage *float64
temperatureC *float64
healthPct *float64
}
var psuSlotPatterns = []*regexp.Regexp{
regexp.MustCompile(`(?i)\bpsu?\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bps\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bpws\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bpower\s*supply(?:\s*bay)?\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bbay\s*([0-9]+)\b`),
}
func parsePSUSDR(raw string) map[int]psuSDR {
out := map[int]psuSDR{}
for _, line := range strings.Split(raw, "\n") {
fields := splitSDRFields(line)
if len(fields) < 3 {
continue
}
name := fields[0]
value := fields[1]
state := strings.ToLower(fields[2])
slot, ok := parsePSUSlot(name)
if !ok {
continue
}
entry := out[slot]
entry.slot = slot
if entry.status == "" {
entry.status = statusOK
}
if state != "" && state != "ok" && state != "ns" {
entry.status = statusCritical
entry.reason = "PSU sensor reported non-OK state: " + state
}
lowerName := strings.ToLower(name)
switch {
case strings.Contains(lowerName, "input power"):
entry.inputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "output power"):
entry.outputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "power supply bay"), strings.Contains(lowerName, "psu bay"):
entry.outputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "input voltage"), strings.Contains(lowerName, "ac input"):
entry.inputVoltage = parseFloatPtr(value)
case strings.Contains(lowerName, "temp"):
entry.temperatureC = parseFloatPtr(value)
case strings.Contains(lowerName, "health"), strings.Contains(lowerName, "remaining life"), strings.Contains(lowerName, "life remaining"):
entry.healthPct = parsePercentPtr(value)
}
out[slot] = entry
}
return out
}
func synthesizePSUsFromSDR(sdr map[int]psuSDR) []schema.HardwarePowerSupply {
if len(sdr) == 0 {
return nil
}
slots := make([]int, 0, len(sdr))
for slot := range sdr {
slots = append(slots, slot)
}
sort.Ints(slots)
out := make([]schema.HardwarePowerSupply, 0, len(slots))
for _, slot := range slots {
entry := sdr[slot]
present := true
status := entry.status
if status == "" {
status = statusUnknown
}
slotStr := strconv.Itoa(slot - 1)
model := "PSU"
psu := schema.HardwarePowerSupply{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Slot: &slotStr,
Present: &present,
Model: &model,
InputPowerW: entry.inputPowerW,
OutputPowerW: entry.outputPowerW,
InputVoltage: entry.inputVoltage,
TemperatureC: entry.temperatureC,
}
if entry.healthPct != nil {
psu.LifeRemainingPct = entry.healthPct
lifeUsed := 100 - *entry.healthPct
psu.LifeUsedPct = &lifeUsed
}
if entry.reason != "" {
psu.ErrorDescription = &entry.reason
}
out = append(out, psu)
}
return out
}
func mergePSUSDR(psus []schema.HardwarePowerSupply, sdr map[int]psuSDR) {
for i := range psus {
slotIdx, err := strconv.Atoi(derefPSUSlot(psus[i].Slot))
if err != nil {
continue
}
entry, ok := sdr[slotIdx+1]
if !ok {
continue
}
if entry.inputPowerW != nil {
psus[i].InputPowerW = entry.inputPowerW
}
if entry.outputPowerW != nil {
psus[i].OutputPowerW = entry.outputPowerW
}
if entry.inputVoltage != nil {
psus[i].InputVoltage = entry.inputVoltage
}
if entry.temperatureC != nil {
psus[i].TemperatureC = entry.temperatureC
}
if entry.healthPct != nil {
psus[i].LifeRemainingPct = entry.healthPct
lifeUsed := 100 - *entry.healthPct
psus[i].LifeUsedPct = &lifeUsed
}
if entry.status != "" {
psus[i].Status = &entry.status
}
if entry.reason != "" {
psus[i].ErrorDescription = &entry.reason
}
if psus[i].Status != nil && *psus[i].Status == statusOK {
if (entry.inputPowerW == nil && entry.outputPowerW == nil && entry.inputVoltage == nil) && entry.status == "" {
unknown := statusUnknown
psus[i].Status = &unknown
}
}
}
}
func splitSDRFields(line string) []string {
parts := strings.Split(line, "|")
out := make([]string, 0, len(parts))
for _, part := range parts {
part = strings.TrimSpace(part)
if part != "" {
out = append(out, part)
}
}
return out
}
func parsePSUSlot(name string) (int, bool) {
for _, re := range psuSlotPatterns {
m := re.FindStringSubmatch(strings.ToLower(name))
if len(m) == 0 {
continue
}
for _, group := range m[1:] {
if group == "" {
continue
}
n, err := strconv.Atoi(group)
if err == nil && n > 0 {
return n, true
}
}
}
return 0, false
}
func parseFloatPtr(raw string) *float64 {
raw = strings.TrimSpace(raw)
if raw == "" || strings.EqualFold(raw, "na") {
return nil
}
for _, field := range strings.Fields(raw) {
n, err := strconv.ParseFloat(strings.TrimSpace(field), 64)
if err == nil {
return &n
}
}
return nil
}
func derefPSUSlot(slot *string) string {
if slot == nil {
return ""
}
return *slot
}
// parseWattage extracts wattage from strings like "PSU 800W", "1200W PLATINUM".
func parseWattage(s string) int {
s = strings.ToUpper(s)

View File

@@ -0,0 +1,91 @@
package collector
import "testing"
func TestParsePSUSDR(t *testing.T) {
raw := `
PS1 Input Power | 215 Watts | ok
PS1 Output Power | 198 Watts | ok
PS1 Input Voltage | 229 Volts | ok
PS1 Temp | 39 C | ok
PS1 Health | 97 % | ok
PS2 Input Power | 0 Watts | cr
`
got := parsePSUSDR(raw)
if len(got) != 2 {
t.Fatalf("len(got)=%d want 2", len(got))
}
if got[1].status != statusOK {
t.Fatalf("ps1 status=%q", got[1].status)
}
if got[1].inputPowerW == nil || *got[1].inputPowerW != 215 {
t.Fatalf("ps1 input power=%v", got[1].inputPowerW)
}
if got[1].outputPowerW == nil || *got[1].outputPowerW != 198 {
t.Fatalf("ps1 output power=%v", got[1].outputPowerW)
}
if got[1].inputVoltage == nil || *got[1].inputVoltage != 229 {
t.Fatalf("ps1 input voltage=%v", got[1].inputVoltage)
}
if got[1].temperatureC == nil || *got[1].temperatureC != 39 {
t.Fatalf("ps1 temperature=%v", got[1].temperatureC)
}
if got[1].healthPct == nil || *got[1].healthPct != 97 {
t.Fatalf("ps1 health=%v", got[1].healthPct)
}
if got[2].status != statusCritical {
t.Fatalf("ps2 status=%q", got[2].status)
}
}
func TestParsePSUSlotVendorVariants(t *testing.T) {
t.Parallel()
tests := []struct {
name string
want int
}{
{name: "PWS1 Status", want: 1},
{name: "Power Supply Bay 8", want: 8},
{name: "PS 6 Input Power", want: 6},
}
for _, tt := range tests {
got, ok := parsePSUSlot(tt.name)
if !ok || got != tt.want {
t.Fatalf("parsePSUSlot(%q)=(%d,%v) want (%d,true)", tt.name, got, ok, tt.want)
}
}
}
func TestSynthesizePSUsFromSDR(t *testing.T) {
t.Parallel()
health := 97.0
outputPower := 915.0
got := synthesizePSUsFromSDR(map[int]psuSDR{
1: {
slot: 1,
status: statusOK,
outputPowerW: &outputPower,
healthPct: &health,
},
})
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].Slot == nil || *got[0].Slot != "0" {
t.Fatalf("slot=%v want 0", got[0].Slot)
}
if got[0].OutputPowerW == nil || *got[0].OutputPowerW != 915 {
t.Fatalf("output power=%v", got[0].OutputPowerW)
}
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 97 {
t.Fatalf("life remaining=%v", got[0].LifeRemainingPct)
}
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 3 {
t.Fatalf("life used=%v", got[0].LifeUsedPct)
}
}

View File

@@ -0,0 +1,121 @@
package collector
import (
"bee/audit/internal/schema"
"strconv"
"strings"
)
func enrichPSUsWithTelemetry(psus []schema.HardwarePowerSupply, doc sensorsDoc) []schema.HardwarePowerSupply {
if len(psus) == 0 || len(doc) == 0 {
return psus
}
tempBySlot := psuTempsFromSensors(doc)
healthBySlot := psuHealthFromSensors(doc)
for i := range psus {
slot := derefPSUSlot(psus[i].Slot)
if slot == "" {
continue
}
if psus[i].TemperatureC == nil {
if value, ok := tempBySlot[slot]; ok {
psus[i].TemperatureC = &value
}
}
if psus[i].LifeRemainingPct == nil {
if value, ok := healthBySlot[slot]; ok {
psus[i].LifeRemainingPct = &value
used := 100 - value
psus[i].LifeUsedPct = &used
}
}
}
return psus
}
func psuHealthFromSensors(doc sensorsDoc) map[string]float64 {
out := map[string]float64{}
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok {
continue
}
if !isLikelyPSUHealth(chip, featureName) {
continue
}
value, ok := firstFeaturePercent(feature)
if !ok {
continue
}
if slot, ok := detectPSUSlot(chip, featureName); ok {
if _, exists := out[slot]; !exists {
out[slot] = value
}
}
}
}
return out
}
func firstFeaturePercent(feature map[string]any) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
lower := strings.ToLower(key)
if strings.HasSuffix(lower, "_alarm") {
continue
}
if strings.Contains(lower, "health") || strings.Contains(lower, "life") || strings.Contains(lower, "remain") {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
return 0, false
}
func isLikelyPSUHealth(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return (strings.Contains(value, "psu") || strings.Contains(value, "power supply")) &&
(strings.Contains(value, "health") || strings.Contains(value, "life") || strings.Contains(value, "remain"))
}
func psuTempsFromSensors(doc sensorsDoc) map[string]float64 {
out := map[string]float64{}
for chip, features := range doc {
for featureName, raw := range features {
feature, ok := raw.(map[string]any)
if !ok || classifySensorFeature(feature) != "temp" {
continue
}
if !isLikelyPSUTemp(chip, featureName) {
continue
}
temp, ok := firstFeatureFloat(feature, "_input")
if !ok {
continue
}
if slot, ok := detectPSUSlot(chip, featureName); ok {
if _, exists := out[slot]; !exists {
out[slot] = temp
}
}
}
}
return out
}
func isLikelyPSUTemp(chip, feature string) bool {
value := strings.ToLower(chip + " " + feature)
return strings.Contains(value, "psu") || strings.Contains(value, "power supply")
}
func detectPSUSlot(parts ...string) (string, bool) {
for _, part := range parts {
if value, ok := parsePSUSlot(part); ok && value > 0 {
return strconv.Itoa(value - 1), true
}
}
return "", false
}

View File

@@ -0,0 +1,42 @@
package collector
import (
"testing"
"bee/audit/internal/schema"
)
func TestEnrichPSUsWithTelemetry(t *testing.T) {
slot0 := "0"
slot1 := "1"
psus := []schema.HardwarePowerSupply{
{Slot: &slot0},
{Slot: &slot1},
}
doc := sensorsDoc{
"psu-hwmon-0": {
"PSU1 Temp": map[string]any{"temp1_input": 39.5},
"PSU2 Temp": map[string]any{"temp2_input": 41.0},
"PSU1 Health": map[string]any{"health1_input": 98.0},
"PSU2 Remaining Life": map[string]any{"life2_input": 95.0},
},
}
got := enrichPSUsWithTelemetry(psus, doc)
if got[0].TemperatureC == nil || *got[0].TemperatureC != 39.5 {
t.Fatalf("psu0 temperature mismatch: %#v", got[0].TemperatureC)
}
if got[1].TemperatureC == nil || *got[1].TemperatureC != 41.0 {
t.Fatalf("psu1 temperature mismatch: %#v", got[1].TemperatureC)
}
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 98.0 {
t.Fatalf("psu0 life remaining mismatch: %#v", got[0].LifeRemainingPct)
}
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 2.0 {
t.Fatalf("psu0 life used mismatch: %#v", got[0].LifeUsedPct)
}
if got[1].LifeRemainingPct == nil || *got[1].LifeRemainingPct != 95.0 {
t.Fatalf("psu1 life remaining mismatch: %#v", got[1].LifeRemainingPct)
}
}

View File

@@ -83,11 +83,7 @@ func isLikelyRAIDController(dev schema.HardwarePCIeDevice) bool {
if dev.DeviceClass == nil {
return false
}
c := strings.ToLower(*dev.DeviceClass)
return strings.Contains(c, "raid") ||
strings.Contains(c, "sas") ||
strings.Contains(c, "mass storage") ||
strings.Contains(c, "serial attached scsi")
return isRAIDClass(*dev.DeviceClass)
}
func collectStorcliDrives() []schema.HardwareStorage {
@@ -182,7 +178,10 @@ func parseSASIrcuDisplay(raw string) []schema.HardwareStorage {
present := true
status := mapRAIDDriveStatus(b["State"])
s := schema.HardwareStorage{Present: &present, Status: &status}
s := schema.HardwareStorage{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
enclosure := strings.TrimSpace(b["Enclosure #"])
slot := strings.TrimSpace(b["Slot #"])
@@ -281,7 +280,10 @@ func parseArcconfPhysicalDrives(raw string) []schema.HardwareStorage {
for _, b := range blocks {
present := true
status := mapRAIDDriveStatus(b["State"])
s := schema.HardwareStorage{Present: &present, Status: &status}
s := schema.HardwareStorage{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
if v := strings.TrimSpace(b["Reported Location"]); v != "" {
s.Slot = &v
@@ -362,8 +364,11 @@ func parseSSACLIPhysicalDrives(raw string) []schema.HardwareStorage {
if m := ssacliPhysicalDriveLine.FindStringSubmatch(trimmed); len(m) == 3 {
flush()
present := true
status := "UNKNOWN"
s := schema.HardwareStorage{Present: &present, Status: &status}
status := statusUnknown
s := schema.HardwareStorage{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
slot := m[1]
s.Slot = &slot
@@ -475,8 +480,8 @@ func storcliDriveToStorage(d struct {
present := true
status := mapRAIDDriveStatus(d.State)
s := schema.HardwareStorage{
Present: &present,
Status: &status,
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
}
if v := strings.TrimSpace(d.EIDSlt); v != "" {
@@ -527,15 +532,15 @@ func mapRAIDDriveStatus(raw string) string {
u := strings.ToUpper(strings.TrimSpace(raw))
switch {
case strings.Contains(u, "OK"), strings.Contains(u, "OPTIMAL"), strings.Contains(u, "READY"):
return "OK"
return statusOK
case strings.Contains(u, "ONLN"), strings.Contains(u, "ONLINE"):
return "OK"
return statusOK
case strings.Contains(u, "RBLD"), strings.Contains(u, "REBUILD"):
return "WARNING"
return statusWarning
case strings.Contains(u, "FAIL"), strings.Contains(u, "OFFLINE"):
return "CRITICAL"
return statusCritical
default:
return "UNKNOWN"
return statusUnknown
}
}
@@ -641,8 +646,9 @@ func enrichStorageWithVROC(storage []schema.HardwareStorage, pcie []schema.Hardw
storage[i].Telemetry["vroc_array"] = arr.Name
storage[i].Telemetry["vroc_degraded"] = arr.Degraded
if arr.Degraded {
status := "WARNING"
status := statusWarning
storage[i].Status = &status
storage[i].ErrorDescription = stringPtr("VROC array is degraded")
}
updated++
}
@@ -659,14 +665,14 @@ func hasVROCController(pcie []schema.HardwarePCIeDevice) bool {
class := ""
if dev.DeviceClass != nil {
class = strings.ToLower(*dev.DeviceClass)
class = strings.TrimSpace(*dev.DeviceClass)
}
model := ""
if dev.Model != nil {
model = strings.ToLower(*dev.Model)
}
if strings.Contains(class, "raid") ||
if isRAIDClass(class) ||
strings.Contains(model, "vroc") ||
strings.Contains(model, "volume management device") ||
strings.Contains(model, "vmd") {

View File

@@ -0,0 +1,334 @@
package collector
import (
"bee/audit/internal/schema"
"encoding/json"
"log/slog"
"strconv"
"strings"
)
type raidControllerTelemetry struct {
BatteryChargePct *float64
BatteryHealthPct *float64
BatteryTemperatureC *float64
BatteryVoltageV *float64
BatteryReplaceRequired *bool
ErrorDescription *string
}
func enrichPCIeWithRAIDTelemetry(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
byVendor := collectRAIDControllerTelemetry()
if len(byVendor) == 0 {
return devs
}
positions := map[int]int{}
for i := range devs {
if devs[i].VendorID == nil || !isLikelyRAIDController(devs[i]) {
continue
}
vendor := *devs[i].VendorID
list := byVendor[vendor]
if len(list) == 0 {
continue
}
index := positions[vendor]
if index >= len(list) {
continue
}
positions[vendor] = index + 1
applyRAIDControllerTelemetry(&devs[i], list[index])
}
return devs
}
func applyRAIDControllerTelemetry(dev *schema.HardwarePCIeDevice, tel raidControllerTelemetry) {
if tel.BatteryChargePct != nil {
dev.BatteryChargePct = tel.BatteryChargePct
}
if tel.BatteryHealthPct != nil {
dev.BatteryHealthPct = tel.BatteryHealthPct
}
if tel.BatteryTemperatureC != nil {
dev.BatteryTemperatureC = tel.BatteryTemperatureC
}
if tel.BatteryVoltageV != nil {
dev.BatteryVoltageV = tel.BatteryVoltageV
}
if tel.BatteryReplaceRequired != nil {
dev.BatteryReplaceRequired = tel.BatteryReplaceRequired
}
if tel.ErrorDescription != nil {
dev.ErrorDescription = tel.ErrorDescription
if dev.Status == nil || *dev.Status == statusOK {
status := statusWarning
dev.Status = &status
}
}
}
func collectRAIDControllerTelemetry() map[int][]raidControllerTelemetry {
out := map[int][]raidControllerTelemetry{}
if raw, err := raidToolQuery("storcli64", "/call", "show", "all", "J"); err == nil {
list := parseStorcliControllerTelemetry(raw)
if len(list) > 0 {
out[vendorBroadcomLSI] = append(out[vendorBroadcomLSI], list...)
slog.Info("raid: storcli controller telemetry", "count", len(list))
}
}
if raw, err := raidToolQuery("ssacli", "ctrl", "all", "show", "config", "detail"); err == nil {
list := parseSSACLIControllerTelemetry(string(raw))
if len(list) > 0 {
out[vendorHPE] = append(out[vendorHPE], list...)
slog.Info("raid: ssacli controller telemetry", "count", len(list))
}
}
if raw, err := raidToolQuery("arcconf", "getconfig", "1", "ad"); err == nil {
list := parseArcconfControllerTelemetry(string(raw))
if len(list) > 0 {
out[vendorAdaptec] = append(out[vendorAdaptec], list...)
slog.Info("raid: arcconf controller telemetry", "count", len(list))
}
}
return out
}
func parseStorcliControllerTelemetry(raw []byte) []raidControllerTelemetry {
var doc struct {
Controllers []struct {
ResponseData map[string]any `json:"Response Data"`
} `json:"Controllers"`
}
if err := json.Unmarshal(raw, &doc); err != nil {
slog.Warn("raid: parse storcli controller telemetry failed", "err", err)
return nil
}
var out []raidControllerTelemetry
for _, ctl := range doc.Controllers {
tel := raidControllerTelemetry{}
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["BBU_Info"]))
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["BBU_Info_Details"]))
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["CV_Info"]))
mergeStorcliBatteryMap(&tel, nestedStringMap(ctl.ResponseData["CV_Info_Details"]))
if hasRAIDControllerTelemetry(tel) {
out = append(out, tel)
}
}
return out
}
func nestedStringMap(raw any) map[string]string {
switch value := raw.(type) {
case map[string]any:
out := map[string]string{}
flattenStringMap("", value, out)
return out
case []any:
out := map[string]string{}
for _, item := range value {
if m, ok := item.(map[string]any); ok {
flattenStringMap("", m, out)
}
}
return out
default:
return nil
}
}
func flattenStringMap(prefix string, in map[string]any, out map[string]string) {
for key, raw := range in {
fullKey := strings.TrimSpace(strings.ToLower(strings.Trim(prefix+" "+key, " ")))
switch value := raw.(type) {
case map[string]any:
flattenStringMap(fullKey, value, out)
case []any:
for _, item := range value {
if m, ok := item.(map[string]any); ok {
flattenStringMap(fullKey, m, out)
}
}
case string:
out[fullKey] = value
case json.Number:
out[fullKey] = value.String()
case float64:
out[fullKey] = strconv.FormatFloat(value, 'f', -1, 64)
case bool:
if value {
out[fullKey] = "true"
} else {
out[fullKey] = "false"
}
}
}
}
func mergeStorcliBatteryMap(tel *raidControllerTelemetry, fields map[string]string) {
if len(fields) == 0 {
return
}
for key, raw := range fields {
lower := strings.ToLower(strings.TrimSpace(key))
switch {
case strings.Contains(lower, "relative state of charge"), strings.Contains(lower, "remaining capacity"), strings.Contains(lower, "charge"):
if tel.BatteryChargePct == nil {
tel.BatteryChargePct = parsePercentPtr(raw)
}
case strings.Contains(lower, "state of health"), strings.Contains(lower, "health"):
if tel.BatteryHealthPct == nil {
tel.BatteryHealthPct = parsePercentPtr(raw)
}
case strings.Contains(lower, "temperature"):
if tel.BatteryTemperatureC == nil {
tel.BatteryTemperatureC = parseFloatPtr(raw)
}
case strings.Contains(lower, "voltage"):
if tel.BatteryVoltageV == nil {
tel.BatteryVoltageV = parseFloatPtr(raw)
}
case strings.Contains(lower, "replace"), strings.Contains(lower, "replacement required"):
if tel.BatteryReplaceRequired == nil {
tel.BatteryReplaceRequired = parseReplaceRequired(raw)
}
case strings.Contains(lower, "learn cycle requested"), strings.Contains(lower, "battery state"), strings.Contains(lower, "capacitance state"):
if desc := batteryStateDescription(raw); desc != nil && tel.ErrorDescription == nil {
tel.ErrorDescription = desc
}
}
}
}
func parseSSACLIControllerTelemetry(raw string) []raidControllerTelemetry {
lines := strings.Split(raw, "\n")
var out []raidControllerTelemetry
var current *raidControllerTelemetry
flush := func() {
if current != nil && hasRAIDControllerTelemetry(*current) {
out = append(out, *current)
}
current = nil
}
for _, line := range lines {
trimmed := strings.TrimSpace(line)
if trimmed == "" {
continue
}
if strings.HasPrefix(strings.ToLower(trimmed), "smart array") || strings.HasPrefix(strings.ToLower(trimmed), "controller ") {
flush()
current = &raidControllerTelemetry{}
continue
}
if current == nil {
continue
}
if idx := strings.Index(trimmed, ":"); idx > 0 {
key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
val := strings.TrimSpace(trimmed[idx+1:])
switch {
case strings.Contains(key, "capacitor temperature"), strings.Contains(key, "battery temperature"):
current.BatteryTemperatureC = parseFloatPtr(val)
case strings.Contains(key, "capacitor voltage"), strings.Contains(key, "battery voltage"):
current.BatteryVoltageV = parseFloatPtr(val)
case strings.Contains(key, "capacitor charge"), strings.Contains(key, "battery charge"):
current.BatteryChargePct = parsePercentPtr(val)
case strings.Contains(key, "capacitor health"), strings.Contains(key, "battery health"):
current.BatteryHealthPct = parsePercentPtr(val)
case strings.Contains(key, "replace") || strings.Contains(key, "failed"):
if current.BatteryReplaceRequired == nil {
current.BatteryReplaceRequired = parseReplaceRequired(val)
}
if desc := batteryStateDescription(val); desc != nil && current.ErrorDescription == nil {
current.ErrorDescription = desc
}
}
}
}
flush()
return out
}
func parseArcconfControllerTelemetry(raw string) []raidControllerTelemetry {
lines := strings.Split(raw, "\n")
tel := raidControllerTelemetry{}
for _, line := range lines {
trimmed := strings.TrimSpace(line)
if idx := strings.Index(trimmed, ":"); idx > 0 {
key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
val := strings.TrimSpace(trimmed[idx+1:])
switch {
case strings.Contains(key, "battery temperature"), strings.Contains(key, "capacitor temperature"):
tel.BatteryTemperatureC = parseFloatPtr(val)
case strings.Contains(key, "battery voltage"), strings.Contains(key, "capacitor voltage"):
tel.BatteryVoltageV = parseFloatPtr(val)
case strings.Contains(key, "battery charge"), strings.Contains(key, "capacitor charge"):
tel.BatteryChargePct = parsePercentPtr(val)
case strings.Contains(key, "battery health"), strings.Contains(key, "capacitor health"):
tel.BatteryHealthPct = parsePercentPtr(val)
case strings.Contains(key, "replace"), strings.Contains(key, "failed"):
if tel.BatteryReplaceRequired == nil {
tel.BatteryReplaceRequired = parseReplaceRequired(val)
}
if desc := batteryStateDescription(val); desc != nil && tel.ErrorDescription == nil {
tel.ErrorDescription = desc
}
}
}
}
if hasRAIDControllerTelemetry(tel) {
return []raidControllerTelemetry{tel}
}
return nil
}
func hasRAIDControllerTelemetry(tel raidControllerTelemetry) bool {
return tel.BatteryChargePct != nil ||
tel.BatteryHealthPct != nil ||
tel.BatteryTemperatureC != nil ||
tel.BatteryVoltageV != nil ||
tel.BatteryReplaceRequired != nil ||
tel.ErrorDescription != nil
}
func parsePercentPtr(raw string) *float64 {
raw = strings.ReplaceAll(strings.TrimSpace(raw), "%", "")
return parseFloatPtr(raw)
}
func parseReplaceRequired(raw string) *bool {
lower := strings.ToLower(strings.TrimSpace(raw))
switch {
case lower == "":
return nil
case strings.Contains(lower, "replace"), strings.Contains(lower, "failed"), strings.Contains(lower, "yes"), strings.Contains(lower, "required"):
value := true
return &value
case strings.Contains(lower, "no"), strings.Contains(lower, "ok"), strings.Contains(lower, "good"), strings.Contains(lower, "optimal"):
value := false
return &value
default:
return nil
}
}
func batteryStateDescription(raw string) *string {
lower := strings.ToLower(strings.TrimSpace(raw))
if lower == "" {
return nil
}
switch {
case strings.Contains(lower, "failed"), strings.Contains(lower, "fault"), strings.Contains(lower, "replace"), strings.Contains(lower, "warning"), strings.Contains(lower, "degraded"):
return &raw
default:
return nil
}
}

View File

@@ -1,6 +1,10 @@
package collector
import "testing"
import (
"bee/audit/internal/schema"
"errors"
"testing"
)
func TestParseSASIrcuControllerIDs(t *testing.T) {
raw := `LSI Corporation SAS2 IR Configuration Utility.
@@ -90,7 +94,111 @@ physicaldrive 1I:1:2 (894 GB, SAS HDD, Failed)
if drives[0].Status == nil || *drives[0].Status != "OK" {
t.Fatalf("drive0 status: %v", drives[0].Status)
}
if drives[1].Status == nil || *drives[1].Status != "CRITICAL" {
if drives[1].Status == nil || *drives[1].Status != statusCritical {
t.Fatalf("drive1 status: %v", drives[1].Status)
}
}
func TestParseStorcliControllerTelemetry(t *testing.T) {
raw := []byte(`{
"Controllers": [
{
"Response Data": {
"BBU_Info": {
"State of Health": "98 %",
"Relative State of Charge": "76 %",
"Temperature": "41 C",
"Voltage": "12.3 V",
"Replacement required": "No"
}
}
}
]
}`)
got := parseStorcliControllerTelemetry(raw)
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].BatteryHealthPct == nil || *got[0].BatteryHealthPct != 98 {
t.Fatalf("battery health=%v", got[0].BatteryHealthPct)
}
if got[0].BatteryChargePct == nil || *got[0].BatteryChargePct != 76 {
t.Fatalf("battery charge=%v", got[0].BatteryChargePct)
}
if got[0].BatteryTemperatureC == nil || *got[0].BatteryTemperatureC != 41 {
t.Fatalf("battery temperature=%v", got[0].BatteryTemperatureC)
}
if got[0].BatteryVoltageV == nil || *got[0].BatteryVoltageV != 12.3 {
t.Fatalf("battery voltage=%v", got[0].BatteryVoltageV)
}
if got[0].BatteryReplaceRequired == nil || *got[0].BatteryReplaceRequired {
t.Fatalf("battery replace=%v", got[0].BatteryReplaceRequired)
}
}
func TestParseSSACLIControllerTelemetry(t *testing.T) {
raw := `Smart Array P440ar in Slot 0
Battery/Capacitor Count: 1
Capacitor Temperature (C): 37
Capacitor Charge (%): 94
Capacitor Health (%): 96
Capacitor Voltage (V): 9.8
Capacitor Failed: No
`
got := parseSSACLIControllerTelemetry(raw)
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].BatteryTemperatureC == nil || *got[0].BatteryTemperatureC != 37 {
t.Fatalf("battery temperature=%v", got[0].BatteryTemperatureC)
}
if got[0].BatteryChargePct == nil || *got[0].BatteryChargePct != 94 {
t.Fatalf("battery charge=%v", got[0].BatteryChargePct)
}
}
func TestEnrichPCIeWithRAIDTelemetry(t *testing.T) {
orig := raidToolQuery
t.Cleanup(func() { raidToolQuery = orig })
raidToolQuery = func(name string, args ...string) ([]byte, error) {
switch name {
case "storcli64":
return []byte(`{
"Controllers": [
{
"Response Data": {
"CV_Info": {
"State of Health": "99 %",
"Relative State of Charge": "81 %",
"Temperature": "38 C",
"Voltage": "12.1 V",
"Replacement required": "No"
}
}
}
]
}`), nil
default:
return nil, errors.New("skip")
}
}
vendor := vendorBroadcomLSI
class := "MassStorageController"
status := statusOK
devs := []schema.HardwarePCIeDevice{{
VendorID: &vendor,
DeviceClass: &class,
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
}}
out := enrichPCIeWithRAIDTelemetry(devs)
if out[0].BatteryHealthPct == nil || *out[0].BatteryHealthPct != 99 {
t.Fatalf("battery health=%v", out[0].BatteryHealthPct)
}
if out[0].BatteryChargePct == nil || *out[0].BatteryChargePct != 81 {
t.Fatalf("battery charge=%v", out[0].BatteryChargePct)
}
if out[0].BatteryVoltageV == nil || *out[0].BatteryVoltageV != 12.1 {
t.Fatalf("battery voltage=%v", out[0].BatteryVoltageV)
}
}

View File

@@ -0,0 +1,373 @@
package collector
import (
"bee/audit/internal/schema"
"encoding/json"
"log/slog"
"os/exec"
"sort"
"strconv"
"strings"
)
type sensorsDoc map[string]map[string]any
func collectSensors() *schema.HardwareSensors {
doc, err := readSensorsJSONDoc()
if err != nil {
slog.Info("sensors: unavailable, skipping", "err", err)
return nil
}
sensors := buildSensorsFromDoc(doc)
if sensors == nil || (len(sensors.Fans) == 0 && len(sensors.Power) == 0 && len(sensors.Temperatures) == 0 && len(sensors.Other) == 0) {
return nil
}
slog.Info("sensors: collected",
"fans", len(sensors.Fans),
"power", len(sensors.Power),
"temperatures", len(sensors.Temperatures),
"other", len(sensors.Other),
)
return sensors
}
func readSensorsJSONDoc() (sensorsDoc, error) {
out, err := exec.Command("sensors", "-j").Output()
if err != nil {
return nil, err
}
var doc sensorsDoc
if err := json.Unmarshal(out, &doc); err != nil {
return nil, err
}
return doc, nil
}
func buildSensorsFromDoc(doc sensorsDoc) *schema.HardwareSensors {
if len(doc) == 0 {
return nil
}
result := &schema.HardwareSensors{}
seen := map[string]struct{}{}
chips := make([]string, 0, len(doc))
for chip := range doc {
chips = append(chips, chip)
}
sort.Strings(chips)
for _, chip := range chips {
features := doc[chip]
location := sensorLocation(chip)
keys := make([]string, 0, len(features))
for key := range features {
keys = append(keys, key)
}
sort.Strings(keys)
for _, key := range keys {
if strings.EqualFold(key, "Adapter") {
continue
}
feature, ok := features[key].(map[string]any)
if !ok {
continue
}
name := strings.TrimSpace(key)
if name == "" {
continue
}
switch classifySensorFeature(feature) {
case "fan":
item := buildFanSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "fan", item.Name) {
continue
}
result.Fans = append(result.Fans, *item)
case "temp":
item := buildTempSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "temp", item.Name) {
continue
}
result.Temperatures = append(result.Temperatures, *item)
case "power":
item := buildPowerSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "power", item.Name) {
continue
}
result.Power = append(result.Power, *item)
default:
item := buildOtherSensor(name, location, feature)
if item == nil || duplicateSensor(seen, "other", item.Name) {
continue
}
result.Other = append(result.Other, *item)
}
}
}
return result
}
func parseSensorsJSON(raw []byte) (*schema.HardwareSensors, error) {
var doc sensorsDoc
err := json.Unmarshal(raw, &doc)
if err != nil {
return nil, err
}
return buildSensorsFromDoc(doc), nil
}
func duplicateSensor(seen map[string]struct{}, sensorType, name string) bool {
key := sensorType + "\x00" + name
if _, ok := seen[key]; ok {
return true
}
seen[key] = struct{}{}
return false
}
func sensorLocation(chip string) *string {
chip = strings.TrimSpace(chip)
if chip == "" {
return nil
}
return &chip
}
func classifySensorFeature(feature map[string]any) string {
for key := range feature {
switch {
case strings.Contains(key, "fan") && strings.HasSuffix(key, "_input"):
return "fan"
case strings.Contains(key, "temp") && strings.HasSuffix(key, "_input"):
return "temp"
case strings.Contains(key, "power") && (strings.HasSuffix(key, "_input") || strings.HasSuffix(key, "_average")):
return "power"
case strings.Contains(key, "curr") && strings.HasSuffix(key, "_input"):
return "power"
case strings.HasPrefix(key, "in") && strings.HasSuffix(key, "_input"):
return "power"
}
}
return "other"
}
func buildFanSensor(name string, location *string, feature map[string]any) *schema.HardwareFanSensor {
rpm, ok := firstFeatureInt(feature, "_input")
if !ok {
return nil
}
item := &schema.HardwareFanSensor{Name: name, Location: location, RPM: &rpm}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
}
return item
}
func buildTempSensor(name string, location *string, feature map[string]any) *schema.HardwareTemperatureSensor {
celsius, ok := firstFeatureFloat(feature, "_input")
if !ok {
return nil
}
item := &schema.HardwareTemperatureSensor{Name: name, Location: location, Celsius: &celsius}
if warning, ok := firstFeatureFloatWithSuffixes(feature, []string{"_max", "_high"}); ok {
item.ThresholdWarningCelsius = &warning
}
if critical, ok := firstFeatureFloatWithSuffixes(feature, []string{"_crit", "_emergency"}); ok {
item.ThresholdCriticalCelsius = &critical
}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
} else {
item.Status = deriveTemperatureStatus(item.Celsius, item.ThresholdWarningCelsius, item.ThresholdCriticalCelsius)
}
return item
}
func buildPowerSensor(name string, location *string, feature map[string]any) *schema.HardwarePowerSensor {
item := &schema.HardwarePowerSensor{Name: name, Location: location}
if v, ok := firstFeatureFloatWithContains(feature, []string{"power"}); ok {
item.PowerW = &v
}
if v, ok := firstFeatureFloatWithPrefix(feature, "curr"); ok {
item.CurrentA = &v
}
if v, ok := firstFeatureFloatWithPrefix(feature, "in"); ok {
item.VoltageV = &v
}
if item.PowerW == nil && item.CurrentA == nil && item.VoltageV == nil {
return nil
}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
}
return item
}
func buildOtherSensor(name string, location *string, feature map[string]any) *schema.HardwareOtherSensor {
value, unit, ok := firstGenericSensorValue(feature)
if !ok {
return nil
}
item := &schema.HardwareOtherSensor{Name: name, Location: location, Value: &value}
if unit != "" {
item.Unit = &unit
}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
}
return item
}
func sensorStatusFromFeature(feature map[string]any) *string {
for key, raw := range feature {
if !strings.HasSuffix(key, "_alarm") {
continue
}
if number, ok := floatFromAny(raw); ok && number > 0 {
status := statusWarning
return &status
}
}
return nil
}
func deriveTemperatureStatus(current, warning, critical *float64) *string {
if current == nil {
return nil
}
switch {
case critical != nil && *current >= *critical:
status := statusCritical
return &status
case warning != nil && *current >= *warning:
status := statusWarning
return &status
default:
status := statusOK
return &status
}
}
func firstFeatureInt(feature map[string]any, suffix string) (int, bool) {
for key, raw := range feature {
if strings.HasSuffix(key, suffix) {
if value, ok := floatFromAny(raw); ok {
return int(value), true
}
}
}
return 0, false
}
func firstFeatureFloat(feature map[string]any, suffix string) (float64, bool) {
return firstFeatureFloatWithSuffixes(feature, []string{suffix})
}
func firstFeatureFloatWithSuffixes(feature map[string]any, suffixes []string) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
for _, suffix := range suffixes {
if strings.HasSuffix(key, suffix) {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
}
return 0, false
}
func firstFeatureFloatWithContains(feature map[string]any, parts []string) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
matched := true
for _, part := range parts {
if !strings.Contains(key, part) {
matched = false
break
}
}
if matched {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
return 0, false
}
func firstFeatureFloatWithPrefix(feature map[string]any, prefix string) (float64, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
if strings.HasPrefix(key, prefix) && strings.HasSuffix(key, "_input") {
if value, ok := floatFromAny(feature[key]); ok {
return value, true
}
}
}
return 0, false
}
func firstGenericSensorValue(feature map[string]any) (float64, string, bool) {
keys := sortedFeatureKeys(feature)
for _, key := range keys {
if strings.HasSuffix(key, "_alarm") {
continue
}
value, ok := floatFromAny(feature[key])
if !ok {
continue
}
unit := inferSensorUnit(key)
return value, unit, true
}
return 0, "", false
}
func inferSensorUnit(key string) string {
switch {
case strings.Contains(key, "humidity"):
return "%"
case strings.Contains(key, "intrusion"):
return ""
default:
return ""
}
}
func sortedFeatureKeys(feature map[string]any) []string {
keys := make([]string, 0, len(feature))
for key := range feature {
keys = append(keys, key)
}
sort.Strings(keys)
return keys
}
func floatFromAny(raw any) (float64, bool) {
switch value := raw.(type) {
case float64:
return value, true
case float32:
return float64(value), true
case int:
return float64(value), true
case int64:
return float64(value), true
case json.Number:
if f, err := value.Float64(); err == nil {
return f, true
}
case string:
if value == "" {
return 0, false
}
if f, err := strconv.ParseFloat(value, 64); err == nil {
return f, true
}
}
return 0, false
}

View File

@@ -0,0 +1,54 @@
package collector
import "testing"
func TestParseSensorsJSON(t *testing.T) {
raw := []byte(`{
"coretemp-isa-0000": {
"Adapter": "ISA adapter",
"Package id 0": {
"temp1_input": 61.5,
"temp1_max": 80.0,
"temp1_crit": 95.0
},
"fan1": {
"fan1_input": 4200
}
},
"acpitz-acpi-0": {
"Adapter": "ACPI interface",
"in0": {
"in0_input": 12.06
},
"curr1": {
"curr1_input": 0.64
},
"power1": {
"power1_average": 137.0
},
"humidity1": {
"humidity1_input": 38.5
}
}
}`)
got, err := parseSensorsJSON(raw)
if err != nil {
t.Fatalf("parseSensorsJSON error: %v", err)
}
if got == nil {
t.Fatal("expected sensors")
}
if len(got.Temperatures) != 1 || got.Temperatures[0].Celsius == nil || *got.Temperatures[0].Celsius != 61.5 {
t.Fatalf("temperatures mismatch: %#v", got.Temperatures)
}
if len(got.Fans) != 1 || got.Fans[0].RPM == nil || *got.Fans[0].RPM != 4200 {
t.Fatalf("fans mismatch: %#v", got.Fans)
}
if len(got.Power) != 3 {
t.Fatalf("power sensors mismatch: %#v", got.Power)
}
if len(got.Other) != 1 || got.Other[0].Unit == nil || *got.Other[0].Unit != "%" {
t.Fatalf("other sensors mismatch: %#v", got.Other)
}
}

View File

@@ -5,11 +5,13 @@ import (
"encoding/json"
"log/slog"
"os/exec"
"path/filepath"
"strconv"
"strings"
)
func collectStorage() []schema.HardwareStorage {
devs := lsblkDevices()
devs := discoverStorageDevices()
result := make([]schema.HardwareStorage, 0, len(devs))
for _, dev := range devs {
var s schema.HardwareStorage
@@ -26,19 +28,60 @@ func collectStorage() []schema.HardwareStorage {
// lsblkDevice is a minimal lsblk JSON record.
type lsblkDevice struct {
Name string `json:"name"`
Type string `json:"type"`
Size string `json:"size"`
Serial string `json:"serial"`
Model string `json:"model"`
Tran string `json:"tran"`
Hctl string `json:"hctl"`
Name string `json:"name"`
Type string `json:"type"`
Size string `json:"size"`
Serial string `json:"serial"`
Model string `json:"model"`
Tran string `json:"tran"`
Hctl string `json:"hctl"`
}
type lsblkRoot struct {
Blockdevices []lsblkDevice `json:"blockdevices"`
}
type nvmeListRoot struct {
Devices []nvmeListDevice `json:"Devices"`
}
type nvmeListDevice struct {
DevicePath string `json:"DevicePath"`
ModelNumber string `json:"ModelNumber"`
SerialNumber string `json:"SerialNumber"`
Firmware string `json:"Firmware"`
PhysicalSize int64 `json:"PhysicalSize"`
}
func discoverStorageDevices() []lsblkDevice {
merged := map[string]lsblkDevice{}
for _, dev := range lsblkDevices() {
if dev.Name == "" {
continue
}
merged[dev.Name] = dev
}
for _, dev := range nvmeListDevices() {
if dev.Name == "" {
continue
}
current := merged[dev.Name]
merged[dev.Name] = mergeStorageDevice(current, dev)
}
disks := make([]lsblkDevice, 0, len(merged))
for _, dev := range merged {
if dev.Type == "" {
dev.Type = "disk"
}
if dev.Type != "disk" {
continue
}
disks = append(disks, dev)
}
return disks
}
func lsblkDevices() []lsblkDevice {
out, err := exec.Command("lsblk", "-J", "-d",
"-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL").Output()
@@ -60,6 +103,59 @@ func lsblkDevices() []lsblkDevice {
return disks
}
func nvmeListDevices() []lsblkDevice {
out, err := exec.Command("nvme", "list", "-o", "json").Output()
if err != nil {
return nil
}
var root nvmeListRoot
if err := json.Unmarshal(out, &root); err != nil {
slog.Warn("storage: nvme list parse failed", "err", err)
return nil
}
devices := make([]lsblkDevice, 0, len(root.Devices))
for _, dev := range root.Devices {
name := filepath.Base(strings.TrimSpace(dev.DevicePath))
if name == "" {
continue
}
devices = append(devices, lsblkDevice{
Name: name,
Type: "disk",
Size: strconv.FormatInt(dev.PhysicalSize, 10),
Serial: strings.TrimSpace(dev.SerialNumber),
Model: strings.TrimSpace(dev.ModelNumber),
Tran: "nvme",
})
}
return devices
}
func mergeStorageDevice(existing, incoming lsblkDevice) lsblkDevice {
if existing.Name == "" {
return incoming
}
if existing.Type == "" {
existing.Type = incoming.Type
}
if strings.TrimSpace(existing.Size) == "" {
existing.Size = incoming.Size
}
if strings.TrimSpace(existing.Serial) == "" {
existing.Serial = incoming.Serial
}
if strings.TrimSpace(existing.Model) == "" {
existing.Model = incoming.Model
}
if strings.TrimSpace(existing.Tran) == "" {
existing.Tran = incoming.Tran
}
if strings.TrimSpace(existing.Hctl) == "" {
existing.Hctl = incoming.Hctl
}
return existing
}
// smartctlInfo is the subset of smartctl -j -a output we care about.
type smartctlInfo struct {
ModelFamily string `json:"model_family"`
@@ -67,14 +163,22 @@ type smartctlInfo struct {
SerialNumber string `json:"serial_number"`
FirmwareVer string `json:"firmware_version"`
RotationRate int `json:"rotation_rate"`
Temperature struct {
Current int `json:"current"`
} `json:"temperature"`
SmartStatus struct {
Passed bool `json:"passed"`
} `json:"smart_status"`
UserCapacity struct {
Bytes int64 `json:"bytes"`
} `json:"user_capacity"`
AtaSmartAttributes struct {
Table []struct {
ID int `json:"id"`
Name string `json:"name"`
Raw struct{ Value int64 `json:"value"` } `json:"raw"`
ID int `json:"id"`
Name string `json:"name"`
Raw struct {
Value int64 `json:"value"`
} `json:"raw"`
} `json:"table"`
} `json:"ata_smart_attributes"`
PowerOnTime struct {
@@ -86,6 +190,7 @@ type smartctlInfo struct {
func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
present := true
s := schema.HardwareStorage{Present: &present}
s.Telemetry = map[string]any{"linux_device": "/dev/" + dev.Name}
tran := strings.ToLower(dev.Tran)
devPath := "/dev/" + dev.Name
@@ -149,69 +254,117 @@ func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
} else if info.RotationRate > 0 {
devType = "HDD"
}
s.Type = &devType
// telemetry
tel := map[string]any{}
if info.Temperature.Current > 0 {
t := float64(info.Temperature.Current)
s.TemperatureC = &t
}
if info.PowerOnTime.Hours > 0 {
tel["power_on_hours"] = info.PowerOnTime.Hours
v := int64(info.PowerOnTime.Hours)
s.PowerOnHours = &v
}
if info.PowerCycleCount > 0 {
tel["power_cycles"] = info.PowerCycleCount
v := int64(info.PowerCycleCount)
s.PowerCycles = &v
}
reallocated := int64(0)
pending := int64(0)
uncorrectable := int64(0)
lifeRemaining := int64(0)
for _, attr := range info.AtaSmartAttributes.Table {
switch attr.ID {
case 5:
tel["reallocated_sectors"] = attr.Raw.Value
reallocated = attr.Raw.Value
s.ReallocatedSectors = &reallocated
case 177:
tel["wear_leveling_pct"] = attr.Raw.Value
value := float64(attr.Raw.Value)
s.LifeUsedPct = &value
case 231:
tel["life_remaining_pct"] = attr.Raw.Value
lifeRemaining = attr.Raw.Value
value := float64(attr.Raw.Value)
s.LifeRemainingPct = &value
case 241:
tel["total_lba_written"] = attr.Raw.Value
value := attr.Raw.Value
s.WrittenBytes = &value
case 197:
pending = attr.Raw.Value
s.CurrentPendingSectors = &pending
case 198:
uncorrectable = attr.Raw.Value
s.OfflineUncorrectable = &uncorrectable
}
}
if len(tel) > 0 {
s.Telemetry = tel
status := storageHealthStatus{
overallPassed: info.SmartStatus.Passed,
hasOverall: true,
reallocatedSectors: reallocated,
pendingSectors: pending,
offlineUncorrectable: uncorrectable,
lifeRemainingPct: lifeRemaining,
}
setStorageHealthStatus(&s, status)
return s
}
s.Type = &devType
status := "OK"
status := statusUnknown
s.Status = &status
return s
}
// nvmeSmartLog is the subset of `nvme smart-log -o json` output we care about.
type nvmeSmartLog struct {
CriticalWarning int `json:"critical_warning"`
PercentageUsed int `json:"percentage_used"`
AvailableSpare int `json:"available_spare"`
SpareThreshold int `json:"spare_thresh"`
Temperature int64 `json:"temperature"`
PowerOnHours int64 `json:"power_on_hours"`
PowerCycles int64 `json:"power_cycles"`
UnsafeShutdowns int64 `json:"unsafe_shutdowns"`
DataUnitsRead int64 `json:"data_units_read"`
DataUnitsWritten int64 `json:"data_units_written"`
ControllerBusy int64 `json:"controller_busy_time"`
MediaErrors int64 `json:"media_errors"`
NumErrLogEntries int64 `json:"num_err_log_entries"`
}
// nvmeIDCtrl is the subset of `nvme id-ctrl -o json` output.
type nvmeIDCtrl struct {
ModelNumber string `json:"mn"`
SerialNumber string `json:"sn"`
FirmwareRev string `json:"fr"`
TotalCapacity int64 `json:"tnvmcap"`
ModelNumber string `json:"mn"`
SerialNumber string `json:"sn"`
FirmwareRev string `json:"fr"`
TotalCapacity int64 `json:"tnvmcap"`
}
func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
present := true
devType := "NVMe"
iface := "NVMe"
status := "OK"
status := statusOK
s := schema.HardwareStorage{
Present: &present,
Type: &devType,
Interface: &iface,
Status: &status,
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Present: &present,
Type: &devType,
Interface: &iface,
Telemetry: map[string]any{"linux_device": "/dev/" + dev.Name},
}
devPath := "/dev/" + dev.Name
if v := cleanDMIValue(strings.TrimSpace(dev.Model)); v != "" {
s.Model = &v
}
if v := cleanDMIValue(strings.TrimSpace(dev.Serial)); v != "" {
s.SerialNumber = &v
}
if size := parseStorageBytes(dev.Size); size > 0 {
gb := int(size / 1_000_000_000)
if gb > 0 {
s.SizeGB = &gb
}
}
// id-ctrl: model, serial, firmware, capacity
if out, err := exec.Command("nvme", "id-ctrl", devPath, "-o", "json").Output(); err == nil {
@@ -237,30 +390,131 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
if out, err := exec.Command("nvme", "smart-log", devPath, "-o", "json").Output(); err == nil {
var log nvmeSmartLog
if json.Unmarshal(out, &log) == nil {
tel := map[string]any{}
if log.PowerOnHours > 0 {
tel["power_on_hours"] = log.PowerOnHours
s.PowerOnHours = &log.PowerOnHours
}
if log.PowerCycles > 0 {
tel["power_cycles"] = log.PowerCycles
s.PowerCycles = &log.PowerCycles
}
if log.UnsafeShutdowns > 0 {
tel["unsafe_shutdowns"] = log.UnsafeShutdowns
s.UnsafeShutdowns = &log.UnsafeShutdowns
}
if log.PercentageUsed > 0 {
tel["percentage_used"] = log.PercentageUsed
v := float64(log.PercentageUsed)
s.LifeUsedPct = &v
remaining := 100 - v
s.LifeRemainingPct = &remaining
}
if log.DataUnitsWritten > 0 {
tel["data_units_written"] = log.DataUnitsWritten
v := nvmeDataUnitsToBytes(log.DataUnitsWritten)
s.WrittenBytes = &v
}
if log.ControllerBusy > 0 {
tel["controller_busy_time"] = log.ControllerBusy
if log.DataUnitsRead > 0 {
v := nvmeDataUnitsToBytes(log.DataUnitsRead)
s.ReadBytes = &v
}
if len(tel) > 0 {
s.Telemetry = tel
if log.AvailableSpare > 0 {
v := float64(log.AvailableSpare)
s.AvailableSparePct = &v
}
if log.MediaErrors > 0 {
s.MediaErrors = &log.MediaErrors
}
if log.NumErrLogEntries > 0 {
s.ErrorLogEntries = &log.NumErrLogEntries
}
if log.Temperature > 0 {
v := float64(log.Temperature - 273)
s.TemperatureC = &v
}
setStorageHealthStatus(&s, storageHealthStatus{
criticalWarning: log.CriticalWarning,
percentageUsed: int64(log.PercentageUsed),
availableSpare: int64(log.AvailableSpare),
spareThreshold: int64(log.SpareThreshold),
unsafeShutdowns: log.UnsafeShutdowns,
mediaErrors: log.MediaErrors,
errorLogEntries: log.NumErrLogEntries,
})
return s
}
}
status = statusUnknown
s.Status = &status
return s
}
func parseStorageBytes(raw string) int64 {
value, err := strconv.ParseInt(strings.TrimSpace(raw), 10, 64)
if err == nil && value > 0 {
return value
}
return 0
}
func nvmeDataUnitsToBytes(units int64) int64 {
if units <= 0 {
return 0
}
return units * 512000
}
type storageHealthStatus struct {
hasOverall bool
overallPassed bool
reallocatedSectors int64
pendingSectors int64
offlineUncorrectable int64
lifeRemainingPct int64
criticalWarning int
percentageUsed int64
availableSpare int64
spareThreshold int64
unsafeShutdowns int64
mediaErrors int64
errorLogEntries int64
}
func setStorageHealthStatus(s *schema.HardwareStorage, health storageHealthStatus) {
status := statusOK
var description *string
switch {
case health.hasOverall && !health.overallPassed:
status = statusCritical
description = stringPtr("SMART overall self-assessment failed")
case health.criticalWarning > 0:
status = statusCritical
description = stringPtr("NVMe critical warning is set")
case health.pendingSectors > 0 || health.offlineUncorrectable > 0:
status = statusCritical
description = stringPtr("Pending or offline uncorrectable sectors detected")
case health.mediaErrors > 0:
status = statusWarning
description = stringPtr("Media errors reported")
case health.reallocatedSectors > 0:
status = statusWarning
description = stringPtr("Reallocated sectors detected")
case health.errorLogEntries > 0:
status = statusWarning
description = stringPtr("Device error log contains entries")
case health.lifeRemainingPct > 0 && health.lifeRemainingPct <= 10:
status = statusWarning
description = stringPtr("Life remaining is low")
case health.percentageUsed >= 95:
status = statusWarning
description = stringPtr("Drive wear level is high")
case health.availableSpare > 0 && health.spareThreshold > 0 && health.availableSpare <= health.spareThreshold:
status = statusWarning
description = stringPtr("Available spare is at or below threshold")
case health.unsafeShutdowns > 100:
status = statusWarning
description = stringPtr("Unsafe shutdown count is high")
}
s.Status = &status
s.ErrorDescription = description
}
func stringPtr(value string) *string {
return &value
}

View File

@@ -0,0 +1,33 @@
package collector
import "testing"
func TestMergeStorageDevicePrefersNonEmptyFields(t *testing.T) {
t.Parallel()
got := mergeStorageDevice(
lsblkDevice{Name: "nvme0n1", Type: "disk", Tran: "nvme"},
lsblkDevice{Name: "nvme0n1", Type: "disk", Size: "1024", Serial: "SN123", Model: "Kioxia"},
)
if got.Serial != "SN123" {
t.Fatalf("serial=%q want SN123", got.Serial)
}
if got.Model != "Kioxia" {
t.Fatalf("model=%q want Kioxia", got.Model)
}
if got.Size != "1024" {
t.Fatalf("size=%q want 1024", got.Size)
}
}
func TestParseStorageBytes(t *testing.T) {
t.Parallel()
if got := parseStorageBytes(" 2048 "); got != 2048 {
t.Fatalf("parseStorageBytes=%d want 2048", got)
}
if got := parseStorageBytes("1.92 TB"); got != 0 {
t.Fatalf("parseStorageBytes invalid=%d want 0", got)
}
}

View File

@@ -0,0 +1,63 @@
package collector
import (
"testing"
"bee/audit/internal/schema"
)
func TestSetStorageHealthStatus(t *testing.T) {
t.Parallel()
tests := []struct {
name string
health storageHealthStatus
want string
}{
{
name: "smart overall failed",
health: storageHealthStatus{hasOverall: true, overallPassed: false},
want: statusCritical,
},
{
name: "nvme critical warning",
health: storageHealthStatus{criticalWarning: 1},
want: statusCritical,
},
{
name: "pending sectors",
health: storageHealthStatus{pendingSectors: 1},
want: statusCritical,
},
{
name: "media errors warning",
health: storageHealthStatus{mediaErrors: 2},
want: statusWarning,
},
{
name: "reallocated warning",
health: storageHealthStatus{reallocatedSectors: 1},
want: statusWarning,
},
{
name: "life remaining low",
health: storageHealthStatus{lifeRemainingPct: 8},
want: statusWarning,
},
{
name: "healthy",
health: storageHealthStatus{},
want: statusOK,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
var disk schema.HardwareStorage
setStorageHealthStatus(&disk, tt.health)
if disk.Status == nil || *disk.Status != tt.want {
t.Fatalf("status=%v want %q", disk.Status, tt.want)
}
})
}
}

View File

@@ -0,0 +1,114 @@
package collector
import (
"bee/audit/internal/schema"
"fmt"
"time"
)
func BuildHealthSummary(snap schema.HardwareSnapshot) *schema.HardwareHealthSummary {
summary := &schema.HardwareHealthSummary{
Status: statusOK,
CollectedAt: time.Now().UTC().Format(time.RFC3339),
}
for _, dimm := range snap.Memory {
switch derefString(dimm.Status) {
case statusWarning:
summary.MemoryWarn++
summary.Warnings = append(summary.Warnings, formatMemorySummary(dimm))
case statusCritical:
summary.MemoryFail++
summary.Failures = append(summary.Failures, formatMemorySummary(dimm))
case statusEmpty:
summary.EmptyDIMMs++
}
}
for _, disk := range snap.Storage {
switch derefString(disk.Status) {
case statusWarning:
summary.StorageWarn++
summary.Warnings = append(summary.Warnings, formatStorageSummary(disk))
case statusCritical:
summary.StorageFail++
summary.Failures = append(summary.Failures, formatStorageSummary(disk))
}
}
for _, dev := range snap.PCIeDevices {
switch derefString(dev.Status) {
case statusWarning:
summary.PCIeWarn++
summary.Warnings = append(summary.Warnings, formatPCIeSummary(dev))
case statusCritical:
summary.PCIeFail++
summary.Failures = append(summary.Failures, formatPCIeSummary(dev))
}
}
for _, psu := range snap.PowerSupplies {
if psu.Present != nil && !*psu.Present {
summary.MissingPSUs++
}
switch derefString(psu.Status) {
case statusWarning:
summary.PSUWarn++
summary.Warnings = append(summary.Warnings, formatPSUSummary(psu))
case statusCritical:
summary.PSUFail++
summary.Failures = append(summary.Failures, formatPSUSummary(psu))
}
}
if len(summary.Failures) > 0 || summary.StorageFail > 0 || summary.PCIeFail > 0 || summary.PSUFail > 0 || summary.MemoryFail > 0 {
summary.Status = statusCritical
} else if len(summary.Warnings) > 0 || summary.StorageWarn > 0 || summary.PCIeWarn > 0 || summary.PSUWarn > 0 || summary.MemoryWarn > 0 {
summary.Status = statusWarning
}
if len(summary.Warnings) == 0 {
summary.Warnings = nil
}
if len(summary.Failures) == 0 {
summary.Failures = nil
}
return summary
}
func derefString(value *string) string {
if value == nil {
return ""
}
return *value
}
func preferredName(model, serial, slot *string) string {
switch {
case model != nil && *model != "":
return *model
case serial != nil && *serial != "":
return *serial
case slot != nil && *slot != "":
return *slot
default:
return "unknown"
}
}
func formatStorageSummary(disk schema.HardwareStorage) string {
return fmt.Sprintf("storage %s status=%s", preferredName(disk.Model, disk.SerialNumber, disk.Slot), derefString(disk.Status))
}
func formatPCIeSummary(dev schema.HardwarePCIeDevice) string {
return fmt.Sprintf("pcie %s status=%s", preferredName(dev.Model, dev.SerialNumber, dev.BDF), derefString(dev.Status))
}
func formatPSUSummary(psu schema.HardwarePowerSupply) string {
return fmt.Sprintf("psu %s status=%s", preferredName(psu.Model, psu.SerialNumber, psu.Slot), derefString(psu.Status))
}
func formatMemorySummary(dimm schema.HardwareMemory) string {
return fmt.Sprintf("memory %s status=%s", preferredName(dimm.PartNumber, dimm.SerialNumber, dimm.Slot), derefString(dimm.Status))
}

View File

@@ -31,7 +31,7 @@ md125 : active raid1 nvme2n1[0] nvme3n1[1]
func TestHasVROCController(t *testing.T) {
intel := vendorIntel
model := "Volume Management Device NVMe RAID Controller"
class := "RAID bus controller"
class := "MassStorageController"
tests := []struct {
name string
pcie []schema.HardwarePCIeDevice

View File

@@ -0,0 +1,153 @@
package platform
import (
"fmt"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
)
var exportExecCommand = exec.Command
func formatMountTargetError(target RemovableTarget, raw string, err error) error {
msg := strings.TrimSpace(raw)
fstype := strings.ToLower(strings.TrimSpace(target.FSType))
if fstype == "exfat" && strings.Contains(strings.ToLower(msg), "unknown filesystem type 'exfat'") {
return fmt.Errorf("mount %s: exFAT support is missing in this ISO build: %w", target.Device, err)
}
if msg == "" {
return err
}
return fmt.Errorf("%s: %w", msg, err)
}
func removableTargetReadOnly(fields map[string]string) bool {
if fields["RO"] == "1" {
return true
}
switch strings.ToLower(strings.TrimSpace(fields["FSTYPE"])) {
case "iso9660", "squashfs":
return true
default:
return false
}
}
func ensureWritableMountpoint(mountpoint string) error {
probe, err := os.CreateTemp(mountpoint, ".bee-write-test-*")
if err != nil {
return fmt.Errorf("target filesystem is not writable: %w", err)
}
name := probe.Name()
if closeErr := probe.Close(); closeErr != nil {
_ = os.Remove(name)
return closeErr
}
if err := os.Remove(name); err != nil {
return err
}
return nil
}
func (s *System) ListRemovableTargets() ([]RemovableTarget, error) {
raw, err := exportExecCommand("lsblk", "-P", "-o", "NAME,TYPE,PKNAME,RM,RO,FSTYPE,MOUNTPOINT,SIZE,LABEL,MODEL").Output()
if err != nil {
return nil, err
}
var out []RemovableTarget
for _, line := range strings.Split(strings.TrimSpace(string(raw)), "\n") {
if strings.TrimSpace(line) == "" {
continue
}
fields := parseLSBLKPairs(line)
deviceType := fields["TYPE"]
if deviceType == "rom" || deviceType == "loop" {
continue
}
removable := fields["RM"] == "1"
if !removable {
if parent := fields["PKNAME"]; parent != "" {
if data, err := os.ReadFile(filepath.Join("/sys/class/block", parent, "removable")); err == nil {
removable = strings.TrimSpace(string(data)) == "1"
}
}
}
if !removable || fields["FSTYPE"] == "" || removableTargetReadOnly(fields) {
continue
}
out = append(out, RemovableTarget{
Device: "/dev/" + fields["NAME"],
FSType: fields["FSTYPE"],
Size: fields["SIZE"],
Label: fields["LABEL"],
Model: fields["MODEL"],
Mountpoint: fields["MOUNTPOINT"],
})
}
sort.Slice(out, func(i, j int) bool { return out[i].Device < out[j].Device })
return out, nil
}
func (s *System) ExportFileToTarget(src string, target RemovableTarget) (dst string, retErr error) {
if src == "" || target.Device == "" {
return "", fmt.Errorf("source and target are required")
}
if _, err := os.Stat(src); err != nil {
return "", err
}
mountpoint := strings.TrimSpace(target.Mountpoint)
mountedHere := false
mounted := mountpoint != ""
if mountpoint == "" {
mountpoint = filepath.Join("/tmp", "bee-export-"+filepath.Base(target.Device))
if err := os.MkdirAll(mountpoint, 0755); err != nil {
return "", err
}
if raw, err := exportExecCommand("mount", target.Device, mountpoint).CombinedOutput(); err != nil {
_ = os.Remove(mountpoint)
return "", formatMountTargetError(target, string(raw), err)
}
mountedHere = true
mounted = true
}
defer func() {
if !mounted {
return
}
_ = exportExecCommand("sync").Run()
if raw, err := exportExecCommand("umount", mountpoint).CombinedOutput(); err != nil && retErr == nil {
msg := strings.TrimSpace(string(raw))
if msg == "" {
retErr = err
} else {
retErr = fmt.Errorf("%s: %w", msg, err)
}
}
if mountedHere {
_ = os.Remove(mountpoint)
}
}()
if err := ensureWritableMountpoint(mountpoint); err != nil {
return "", err
}
filename := filepath.Base(src)
dst = filepath.Join(mountpoint, filename)
data, err := os.ReadFile(src)
if err != nil {
return "", err
}
if err := os.WriteFile(dst, data, 0644); err != nil {
return "", err
}
return dst, nil
}

View File

@@ -0,0 +1,112 @@
package platform
import (
"os"
"os/exec"
"path/filepath"
"strings"
"testing"
)
func TestExportFileToTargetUnmountsExistingMountpoint(t *testing.T) {
tmp := t.TempDir()
src := filepath.Join(tmp, "bundle.tar.gz")
mountpoint := filepath.Join(tmp, "mnt")
if err := os.MkdirAll(mountpoint, 0755); err != nil {
t.Fatalf("mkdir mountpoint: %v", err)
}
if err := os.WriteFile(src, []byte("bundle"), 0644); err != nil {
t.Fatalf("write src: %v", err)
}
var calls [][]string
oldExec := exportExecCommand
exportExecCommand = func(name string, args ...string) *exec.Cmd {
calls = append(calls, append([]string{name}, args...))
return exec.Command("sh", "-c", "exit 0")
}
t.Cleanup(func() { exportExecCommand = oldExec })
s := &System{}
dst, err := s.ExportFileToTarget(src, RemovableTarget{
Device: "/dev/sdb1",
Mountpoint: mountpoint,
})
if err != nil {
t.Fatalf("ExportFileToTarget error: %v", err)
}
if got, want := dst, filepath.Join(mountpoint, "bundle.tar.gz"); got != want {
t.Fatalf("dst=%q want %q", got, want)
}
if _, err := os.Stat(filepath.Join(mountpoint, "bundle.tar.gz")); err != nil {
t.Fatalf("exported file missing: %v", err)
}
foundUmount := false
for _, call := range calls {
if len(call) == 2 && call[0] == "umount" && call[1] == mountpoint {
foundUmount = true
break
}
}
if !foundUmount {
t.Fatalf("expected umount %q call, got %#v", mountpoint, calls)
}
}
func TestExportFileToTargetRejectsNonWritableMountpoint(t *testing.T) {
tmp := t.TempDir()
src := filepath.Join(tmp, "bundle.tar.gz")
mountpoint := filepath.Join(tmp, "mnt")
if err := os.MkdirAll(mountpoint, 0755); err != nil {
t.Fatalf("mkdir mountpoint: %v", err)
}
if err := os.WriteFile(src, []byte("bundle"), 0644); err != nil {
t.Fatalf("write src: %v", err)
}
if err := os.Chmod(mountpoint, 0555); err != nil {
t.Fatalf("chmod mountpoint: %v", err)
}
oldExec := exportExecCommand
exportExecCommand = func(name string, args ...string) *exec.Cmd {
return exec.Command("sh", "-c", "exit 0")
}
t.Cleanup(func() { exportExecCommand = oldExec })
s := &System{}
_, err := s.ExportFileToTarget(src, RemovableTarget{
Device: "/dev/sdb1",
Mountpoint: mountpoint,
})
if err == nil {
t.Fatal("expected error for non-writable mountpoint")
}
if !strings.Contains(err.Error(), "target filesystem is not writable") {
t.Fatalf("err=%q want writable message", err)
}
}
func TestListRemovableTargetsSkipsReadOnlyMedia(t *testing.T) {
oldExec := exportExecCommand
lsblkOut := `NAME="sda1" TYPE="part" PKNAME="sda" RM="1" RO="1" FSTYPE="iso9660" MOUNTPOINT="/run/live/medium" SIZE="3.7G" LABEL="BEE" MODEL=""
NAME="sdb1" TYPE="part" PKNAME="sdb" RM="1" RO="0" FSTYPE="vfat" MOUNTPOINT="/media/bee/USB" SIZE="29.8G" LABEL="USB" MODEL=""`
exportExecCommand = func(name string, args ...string) *exec.Cmd {
cmd := exec.Command("sh", "-c", "printf '%s\n' \"$LSBLK_OUT\"")
cmd.Env = append(os.Environ(), "LSBLK_OUT="+lsblkOut)
return cmd
}
t.Cleanup(func() { exportExecCommand = oldExec })
s := &System{}
targets, err := s.ListRemovableTargets()
if err != nil {
t.Fatalf("ListRemovableTargets error: %v", err)
}
if len(targets) != 1 {
t.Fatalf("len(targets)=%d want 1 (%+v)", len(targets), targets)
}
if got := targets[0].Device; got != "/dev/sdb1" {
t.Fatalf("device=%q want /dev/sdb1", got)
}
}

View File

@@ -0,0 +1,644 @@
package platform
import (
"bytes"
"fmt"
"math"
"os"
"os/exec"
"strconv"
"strings"
"time"
)
// GPUMetricRow is one telemetry sample from nvidia-smi during a stress test.
type GPUMetricRow struct {
ElapsedSec float64 `json:"elapsed_sec"`
GPUIndex int `json:"index"`
TempC float64 `json:"temp_c"`
UsagePct float64 `json:"usage_pct"`
MemUsagePct float64 `json:"mem_usage_pct"`
PowerW float64 `json:"power_w"`
ClockMHz float64 `json:"clock_mhz"`
}
// sampleGPUMetrics runs nvidia-smi once and returns current metrics for each GPU.
func sampleGPUMetrics(gpuIndices []int) ([]GPUMetricRow, error) {
args := []string{
"--query-gpu=index,temperature.gpu,utilization.gpu,utilization.memory,power.draw,clocks.current.graphics",
"--format=csv,noheader,nounits",
}
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
args = append([]string{"--id=" + strings.Join(ids, ",")}, args...)
}
out, err := exec.Command("nvidia-smi", args...).Output()
if err != nil {
return nil, err
}
var rows []GPUMetricRow
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.Split(line, ", ")
if len(parts) < 6 {
continue
}
idx, _ := strconv.Atoi(strings.TrimSpace(parts[0]))
rows = append(rows, GPUMetricRow{
GPUIndex: idx,
TempC: parseGPUFloat(parts[1]),
UsagePct: parseGPUFloat(parts[2]),
MemUsagePct: parseGPUFloat(parts[3]),
PowerW: parseGPUFloat(parts[4]),
ClockMHz: parseGPUFloat(parts[5]),
})
}
return rows, nil
}
func parseGPUFloat(s string) float64 {
s = strings.TrimSpace(s)
if s == "N/A" || s == "[Not Supported]" || s == "" {
return 0
}
v, _ := strconv.ParseFloat(s, 64)
return v
}
// SampleGPUMetrics runs nvidia-smi once and returns current metrics for each GPU.
func SampleGPUMetrics(gpuIndices []int) ([]GPUMetricRow, error) {
return sampleGPUMetrics(gpuIndices)
}
// sampleAMDGPUMetrics queries rocm-smi for live GPU metrics.
func sampleAMDGPUMetrics() ([]GPUMetricRow, error) {
out, err := runROCmSMI("--showtemp", "--showuse", "--showpower", "--showmemuse", "--csv")
if err != nil {
return nil, err
}
lines := strings.Split(strings.TrimSpace(string(out)), "\n")
if len(lines) < 2 {
return nil, fmt.Errorf("rocm-smi: insufficient output")
}
// Parse header to find column indices by name.
headers := strings.Split(lines[0], ",")
colIdx := func(keywords ...string) int {
for i, h := range headers {
hl := strings.ToLower(strings.TrimSpace(h))
for _, kw := range keywords {
if strings.Contains(hl, kw) {
return i
}
}
}
return -1
}
idxTemp := colIdx("sensor edge", "temperature (c)", "temp")
idxUse := colIdx("gpu use (%)")
idxMem := colIdx("vram%", "memory allocated")
idxPow := colIdx("average graphics package power", "power (w)")
var rows []GPUMetricRow
for _, line := range lines[1:] {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.Split(line, ",")
idx := len(rows)
row := GPUMetricRow{GPUIndex: idx}
get := func(i int) float64 {
if i < 0 || i >= len(parts) {
return 0
}
v := strings.TrimSpace(parts[i])
if strings.EqualFold(v, "n/a") {
return 0
}
return parseGPUFloat(v)
}
row.TempC = get(idxTemp)
row.UsagePct = get(idxUse)
row.MemUsagePct = get(idxMem)
row.PowerW = get(idxPow)
rows = append(rows, row)
}
if len(rows) == 0 {
return nil, fmt.Errorf("rocm-smi: no GPU rows parsed")
}
return rows, nil
}
// WriteGPUMetricsCSV writes collected rows as a CSV file.
func WriteGPUMetricsCSV(path string, rows []GPUMetricRow) error {
var b bytes.Buffer
b.WriteString("elapsed_sec,gpu_index,temperature_c,usage_pct,power_w,clock_mhz\n")
for _, r := range rows {
fmt.Fprintf(&b, "%.1f,%d,%.1f,%.1f,%.1f,%.0f\n",
r.ElapsedSec, r.GPUIndex, r.TempC, r.UsagePct, r.PowerW, r.ClockMHz)
}
return os.WriteFile(path, b.Bytes(), 0644)
}
// WriteGPUMetricsHTML writes a standalone HTML file with one SVG chart per GPU.
func WriteGPUMetricsHTML(path string, rows []GPUMetricRow) error {
// Group by GPU index preserving order.
seen := make(map[int]bool)
var order []int
gpuMap := make(map[int][]GPUMetricRow)
for _, r := range rows {
if !seen[r.GPUIndex] {
seen[r.GPUIndex] = true
order = append(order, r.GPUIndex)
}
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
}
var svgs strings.Builder
for _, gpuIdx := range order {
svgs.WriteString(drawGPUChartSVG(gpuMap[gpuIdx], gpuIdx))
svgs.WriteString("\n")
}
ts := time.Now().UTC().Format("2006-01-02 15:04:05 UTC")
html := fmt.Sprintf(`<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<title>GPU Stress Test Metrics</title>
<style>
body { font-family: sans-serif; background: #f0f0f0; margin: 0; padding: 20px; }
h1 { text-align: center; color: #333; margin: 0 0 8px; }
p { text-align: center; color: #888; font-size: 13px; margin: 0 0 24px; }
</style>
</head><body>
<h1>GPU Stress Test Metrics</h1>
<p>Generated %s</p>
%s
</body></html>`, ts, svgs.String())
return os.WriteFile(path, []byte(html), 0644)
}
// drawGPUChartSVG generates a self-contained SVG chart for one GPU.
func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int) string {
// Layout
const W, H = 960, 520
const plotX1 = 120 // usage axis / chart left border
const plotX2 = 840 // power axis / chart right border
const plotY1 = 70 // top
const plotY2 = 465 // bottom (PH = 395)
const PW = plotX2 - plotX1
const PH = plotY2 - plotY1
// Outer axes
const tempAxisX = 60 // temp axis line
const clockAxisX = 900 // clock axis line
colors := [4]string{"#e74c3c", "#3498db", "#2ecc71", "#f39c12"}
seriesLabel := [4]string{
fmt.Sprintf("GPU %d Temp (°C)", gpuIdx),
fmt.Sprintf("GPU %d Usage (%%)", gpuIdx),
fmt.Sprintf("GPU %d Power (W)", gpuIdx),
fmt.Sprintf("GPU %d Clock (MHz)", gpuIdx),
}
axisLabel := [4]string{"Temperature (°C)", "GPU Usage (%)", "Power (W)", "Clock (MHz)"}
// Extract series
t := make([]float64, len(rows))
vals := [4][]float64{}
for i := range vals {
vals[i] = make([]float64, len(rows))
}
for i, r := range rows {
t[i] = r.ElapsedSec
vals[0][i] = r.TempC
vals[1][i] = r.UsagePct
vals[2][i] = r.PowerW
vals[3][i] = r.ClockMHz
}
tMin, tMax := gpuMinMax(t)
type axisScale struct {
ticks []float64
min, max float64
}
var axes [4]axisScale
for i := 0; i < 4; i++ {
mn, mx := gpuMinMax(vals[i])
tks := gpuNiceTicks(mn, mx, 8)
axes[i] = axisScale{ticks: tks, min: tks[0], max: tks[len(tks)-1]}
}
xv := func(tv float64) float64 {
if tMax == tMin {
return float64(plotX1)
}
return float64(plotX1) + (tv-tMin)/(tMax-tMin)*float64(PW)
}
yv := func(v float64, ai int) float64 {
a := axes[ai]
if a.max == a.min {
return float64(plotY1 + PH/2)
}
return float64(plotY2) - (v-a.min)/(a.max-a.min)*float64(PH)
}
var b strings.Builder
fmt.Fprintf(&b, `<svg xmlns="http://www.w3.org/2000/svg" width="%d" height="%d"`+
` style="background:#fff;border-radius:8px;display:block;margin:0 auto 24px;`+
`box-shadow:0 2px 12px rgba(0,0,0,.12)">`+"\n", W, H)
// Title
fmt.Fprintf(&b, `<text x="%d" y="22" text-anchor="middle" font-family="sans-serif"`+
` font-size="14" font-weight="bold" fill="#333">GPU Stress Test Metrics — GPU %d</text>`+"\n",
plotX1+PW/2, gpuIdx)
// Horizontal grid (align to temp axis ticks)
b.WriteString(`<g stroke="#e0e0e0" stroke-width="0.5">` + "\n")
for _, tick := range axes[0].ticks {
y := yv(tick, 0)
if y < float64(plotY1) || y > float64(plotY2) {
continue
}
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"/>`+"\n",
plotX1, y, plotX2, y)
}
// Vertical grid
xTicks := gpuNiceTicks(tMin, tMax, 10)
for _, tv := range xTicks {
x := xv(tv)
if x < float64(plotX1) || x > float64(plotX2) {
continue
}
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d"/>`+"\n",
x, plotY1, x, plotY2)
}
b.WriteString("</g>\n")
// Chart border
fmt.Fprintf(&b, `<rect x="%d" y="%d" width="%d" height="%d"`+
` fill="none" stroke="#333" stroke-width="1"/>`+"\n",
plotX1, plotY1, PW, PH)
// X axis ticks and labels
b.WriteString(`<g font-family="sans-serif" font-size="11" fill="#333" text-anchor="middle">` + "\n")
for _, tv := range xTicks {
x := xv(tv)
if x < float64(plotX1) || x > float64(plotX2) {
continue
}
fmt.Fprintf(&b, `<text x="%.1f" y="%d">%s</text>`+"\n", x, plotY2+18, gpuFormatTick(tv))
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d" stroke="#333" stroke-width="1"/>`+"\n",
x, plotY2, x, plotY2+4)
}
b.WriteString("</g>\n")
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="13"`+
` fill="#333" text-anchor="middle">Time (seconds)</text>`+"\n",
plotX1+PW/2, plotY2+38)
// Y axes: [tempAxisX, plotX1, plotX2, clockAxisX]
axisLineX := [4]int{tempAxisX, plotX1, plotX2, clockAxisX}
axisRight := [4]bool{false, false, true, true}
// Label x positions (for rotated vertical text)
axisLabelX := [4]int{10, 68, 868, 950}
for i := 0; i < 4; i++ {
ax := axisLineX[i]
right := axisRight[i]
color := colors[i]
// Axis line
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
` stroke="%s" stroke-width="1"/>`+"\n",
ax, plotY1, ax, plotY2, color)
// Ticks and tick labels
fmt.Fprintf(&b, `<g font-family="sans-serif" font-size="10" fill="%s">`+"\n", color)
for _, tick := range axes[i].ticks {
y := yv(tick, i)
if y < float64(plotY1) || y > float64(plotY2) {
continue
}
dx := -5
textX := ax - 8
anchor := "end"
if right {
dx = 5
textX = ax + 8
anchor = "start"
}
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"`+
` stroke="%s" stroke-width="1"/>`+"\n",
ax, y, ax+dx, y, color)
fmt.Fprintf(&b, `<text x="%d" y="%.1f" text-anchor="%s" dy="4">%s</text>`+"\n",
textX, y, anchor, gpuFormatTick(tick))
}
b.WriteString("</g>\n")
// Axis label (rotated)
lx := axisLabelX[i]
fmt.Fprintf(&b, `<text transform="translate(%d,%d) rotate(-90)"`+
` font-family="sans-serif" font-size="12" fill="%s" text-anchor="middle">%s</text>`+"\n",
lx, plotY1+PH/2, color, axisLabel[i])
}
// Data lines
for i := 0; i < 4; i++ {
var pts strings.Builder
for j := range rows {
x := xv(t[j])
y := yv(vals[i][j], i)
if j == 0 {
fmt.Fprintf(&pts, "%.1f,%.1f", x, y)
} else {
fmt.Fprintf(&pts, " %.1f,%.1f", x, y)
}
}
fmt.Fprintf(&b, `<polyline points="%s" fill="none" stroke="%s" stroke-width="1.5"/>`+"\n",
pts.String(), colors[i])
}
// Legend
const legendY = 42
for i := 0; i < 4; i++ {
lx := plotX1 + i*(PW/4) + 10
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
` stroke="%s" stroke-width="2"/>`+"\n",
lx, legendY, lx+20, legendY, colors[i])
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="12" fill="#333">%s</text>`+"\n",
lx+25, legendY+4, seriesLabel[i])
}
b.WriteString("</svg>\n")
return b.String()
}
const (
ansiRed = "\033[31m"
ansiBlue = "\033[34m"
ansiGreen = "\033[32m"
ansiYellow = "\033[33m"
ansiReset = "\033[0m"
)
const (
termChartWidth = 70
termChartHeight = 12
)
// RenderGPUTerminalChart returns ANSI line charts (asciigraph-style) per GPU.
// Used in SAT stress-test logs.
func RenderGPUTerminalChart(rows []GPUMetricRow) string {
seen := make(map[int]bool)
var order []int
gpuMap := make(map[int][]GPUMetricRow)
for _, r := range rows {
if !seen[r.GPUIndex] {
seen[r.GPUIndex] = true
order = append(order, r.GPUIndex)
}
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
}
type seriesDef struct {
caption string
color string
fn func(GPUMetricRow) float64
}
defs := []seriesDef{
{"Temperature (°C)", ansiRed, func(r GPUMetricRow) float64 { return r.TempC }},
{"GPU Usage (%)", ansiBlue, func(r GPUMetricRow) float64 { return r.UsagePct }},
{"Power (W)", ansiGreen, func(r GPUMetricRow) float64 { return r.PowerW }},
{"Clock (MHz)", ansiYellow, func(r GPUMetricRow) float64 { return r.ClockMHz }},
}
var b strings.Builder
for _, gpuIdx := range order {
gr := gpuMap[gpuIdx]
if len(gr) == 0 {
continue
}
tMax := gr[len(gr)-1].ElapsedSec - gr[0].ElapsedSec
fmt.Fprintf(&b, "GPU %d — Stress Test Metrics (%.0f seconds)\n\n", gpuIdx, tMax)
for _, d := range defs {
b.WriteString(renderLineChart(extractGPUField(gr, d.fn), d.color, d.caption,
termChartHeight, termChartWidth))
b.WriteRune('\n')
}
}
return strings.TrimRight(b.String(), "\n")
}
// renderLineChart draws a single time-series line chart using box-drawing characters.
// Produces output in the style of asciigraph: ╭─╮ │ ╰─╯ with a Y axis and caption.
func renderLineChart(vals []float64, color, caption string, height, width int) string {
if len(vals) == 0 {
return caption + "\n"
}
mn, mx := gpuMinMax(vals)
if mn == mx {
mx = mn + 1
}
// Use the smaller of width or len(vals) to avoid stretching sparse data.
w := width
if len(vals) < w {
w = len(vals)
}
data := gpuDownsample(vals, w)
// row[i] = display row index: 0 = top = max value, height = bottom = min value.
row := make([]int, w)
for i, v := range data {
r := int(math.Round((mx - v) / (mx - mn) * float64(height)))
if r < 0 {
r = 0
}
if r > height {
r = height
}
row[i] = r
}
// Fill the character grid.
grid := make([][]rune, height+1)
for i := range grid {
grid[i] = make([]rune, w)
for j := range grid[i] {
grid[i][j] = ' '
}
}
for x := 0; x < w; x++ {
r := row[x]
if x == 0 {
grid[r][0] = '─'
continue
}
p := row[x-1]
switch {
case r == p:
grid[r][x] = '─'
case r < p: // value went up (row index decreased toward top)
grid[r][x] = '╭'
grid[p][x] = '╯'
for y := r + 1; y < p; y++ {
grid[y][x] = '│'
}
default: // r > p, value went down
grid[p][x] = '╮'
grid[r][x] = '╰'
for y := p + 1; y < r; y++ {
grid[y][x] = '│'
}
}
}
// Y axis tick labels.
ticks := gpuNiceTicks(mn, mx, height/2)
tickAtRow := make(map[int]string)
labelWidth := 4
for _, t := range ticks {
r := int(math.Round((mx - t) / (mx - mn) * float64(height)))
if r < 0 || r > height {
continue
}
s := gpuFormatTick(t)
tickAtRow[r] = s
if len(s) > labelWidth {
labelWidth = len(s)
}
}
var b strings.Builder
for r := 0; r <= height; r++ {
label := tickAtRow[r]
fmt.Fprintf(&b, "%*s", labelWidth, label)
switch {
case label != "":
b.WriteRune('┤')
case r == height:
b.WriteRune('┼')
default:
b.WriteRune('│')
}
b.WriteString(color)
b.WriteString(string(grid[r]))
b.WriteString(ansiReset)
b.WriteRune('\n')
}
// Bottom axis.
b.WriteString(strings.Repeat(" ", labelWidth))
b.WriteRune('└')
b.WriteString(strings.Repeat("─", w))
b.WriteRune('\n')
// Caption centered under the chart.
if caption != "" {
total := labelWidth + 1 + w
if pad := (total - len(caption)) / 2; pad > 0 {
b.WriteString(strings.Repeat(" ", pad))
}
b.WriteString(caption)
b.WriteRune('\n')
}
return b.String()
}
func extractGPUField(rows []GPUMetricRow, fn func(GPUMetricRow) float64) []float64 {
v := make([]float64, len(rows))
for i, r := range rows {
v[i] = fn(r)
}
return v
}
// gpuDownsample averages vals into w buckets (or nearest-neighbor upsamples if len(vals) < w).
func gpuDownsample(vals []float64, w int) []float64 {
n := len(vals)
if n == 0 {
return make([]float64, w)
}
result := make([]float64, w)
if n >= w {
counts := make([]int, w)
for i, v := range vals {
bucket := i * w / n
if bucket >= w {
bucket = w - 1
}
result[bucket] += v
counts[bucket]++
}
for i := range result {
if counts[i] > 0 {
result[i] /= float64(counts[i])
}
}
} else {
// Nearest-neighbour upsample.
for i := range result {
src := i * (n - 1) / (w - 1)
if src >= n {
src = n - 1
}
result[i] = vals[src]
}
}
return result
}
func gpuMinMax(vals []float64) (float64, float64) {
if len(vals) == 0 {
return 0, 1
}
mn, mx := vals[0], vals[0]
for _, v := range vals[1:] {
if v < mn {
mn = v
}
if v > mx {
mx = v
}
}
return mn, mx
}
func gpuNiceTicks(mn, mx float64, targetCount int) []float64 {
if mn == mx {
mn -= 1
mx += 1
}
r := mx - mn
step := math.Pow(10, math.Floor(math.Log10(r/float64(targetCount))))
for _, f := range []float64{1, 2, 5, 10} {
if r/(f*step) <= float64(targetCount)*1.5 {
step = f * step
break
}
}
lo := math.Floor(mn/step) * step
hi := math.Ceil(mx/step) * step
var ticks []float64
for v := lo; v <= hi+step*0.001; v += step {
ticks = append(ticks, math.Round(v*1e9)/1e9)
}
return ticks
}
func gpuFormatTick(v float64) string {
if v == math.Trunc(v) {
return strconv.Itoa(int(v))
}
return strconv.FormatFloat(v, 'f', 1, 64)
}

View File

@@ -0,0 +1,214 @@
package platform
import (
"context"
"fmt"
"os"
"os/exec"
"strconv"
"strings"
)
// InstallDisk describes a candidate disk for installation.
type InstallDisk struct {
Device string // e.g. /dev/sda
Model string
Size string // human-readable, e.g. "500G"
SizeBytes int64 // raw byte count from lsblk
MountedParts []string // partition mount points currently active
}
const squashfsPath = "/run/live/medium/live/filesystem.squashfs"
// ListInstallDisks returns block devices suitable for installation.
// Excludes the current live boot medium but includes USB drives.
func (s *System) ListInstallDisks() ([]InstallDisk, error) {
out, err := exec.Command("lsblk", "-dn", "-o", "NAME,MODEL,SIZE,TYPE,TRAN").Output()
if err != nil {
return nil, fmt.Errorf("lsblk: %w", err)
}
bootDev := findLiveBootDevice()
var disks []InstallDisk
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
fields := strings.Fields(line)
// NAME MODEL SIZE TYPE TRAN — model may have spaces so we parse from end
if len(fields) < 4 {
continue
}
// Last field: TRAN, second-to-last: TYPE, third-to-last: SIZE
typ := fields[len(fields)-2]
size := fields[len(fields)-3]
name := fields[0]
model := strings.Join(fields[1:len(fields)-3], " ")
if typ != "disk" {
continue
}
device := "/dev/" + name
if device == bootDev {
continue
}
sizeBytes := diskSizeBytes(device)
mounted := mountedParts(device)
disks = append(disks, InstallDisk{
Device: device,
Model: strings.TrimSpace(model),
Size: size,
SizeBytes: sizeBytes,
MountedParts: mounted,
})
}
return disks, nil
}
// diskSizeBytes returns the byte size of a block device using lsblk.
func diskSizeBytes(device string) int64 {
out, err := exec.Command("lsblk", "-bdn", "-o", "SIZE", device).Output()
if err != nil {
return 0
}
n, _ := strconv.ParseInt(strings.TrimSpace(string(out)), 10, 64)
return n
}
// mountedParts returns a list of "<part> at <mountpoint>" strings for any
// mounted partitions on the given device.
func mountedParts(device string) []string {
out, err := exec.Command("lsblk", "-n", "-o", "NAME,MOUNTPOINT", device).Output()
if err != nil {
return nil
}
var result []string
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
fields := strings.Fields(line)
if len(fields) < 2 {
continue
}
mp := fields[1]
if mp == "" || mp == "[SWAP]" {
continue
}
result = append(result, "/dev/"+strings.TrimLeft(fields[0], "└─├─")+" at "+mp)
}
return result
}
// findLiveBootDevice returns the block device backing /run/live/medium (if any).
func findLiveBootDevice() string {
out, err := exec.Command("findmnt", "-n", "-o", "SOURCE", "/run/live/medium").Output()
if err != nil {
return ""
}
src := strings.TrimSpace(string(out))
if src == "" {
return ""
}
// Strip partition suffix to get the whole disk device.
// e.g. /dev/sdb1 → /dev/sdb, /dev/nvme0n1p1 → /dev/nvme0n1
out2, err := exec.Command("lsblk", "-no", "PKNAME", src).Output()
if err != nil || strings.TrimSpace(string(out2)) == "" {
return src
}
return "/dev/" + strings.TrimSpace(string(out2))
}
// MinInstallBytes returns the minimum recommended disk size for installation:
// squashfs size × 1.5 to allow for extracted filesystem and bootloader.
// Returns 0 if the squashfs is not available (non-live environment).
func MinInstallBytes() int64 {
fi, err := os.Stat(squashfsPath)
if err != nil {
return 0
}
return fi.Size() * 3 / 2
}
// toramActive returns true when the live system was booted with toram.
func toramActive() bool {
data, err := os.ReadFile("/proc/cmdline")
if err != nil {
return false
}
return strings.Contains(string(data), "toram")
}
// freeMemBytes returns MemAvailable from /proc/meminfo.
func freeMemBytes() int64 {
data, err := os.ReadFile("/proc/meminfo")
if err != nil {
return 0
}
for _, line := range strings.Split(string(data), "\n") {
if strings.HasPrefix(line, "MemAvailable:") {
fields := strings.Fields(line)
if len(fields) >= 2 {
n, _ := strconv.ParseInt(fields[1], 10, 64)
return n * 1024 // kB → bytes
}
}
}
return 0
}
// DiskWarnings returns advisory warning strings for a disk candidate.
func DiskWarnings(d InstallDisk) []string {
var w []string
if len(d.MountedParts) > 0 {
w = append(w, "has mounted partitions: "+strings.Join(d.MountedParts, ", "))
}
min := MinInstallBytes()
if min > 0 && d.SizeBytes > 0 && d.SizeBytes < min {
w = append(w, fmt.Sprintf("disk may be too small (need ≥ %s, have %s)",
humanBytes(min), humanBytes(d.SizeBytes)))
}
if toramActive() {
sqFi, err := os.Stat(squashfsPath)
if err == nil {
free := freeMemBytes()
if free > 0 && free < sqFi.Size()*2 {
w = append(w, "toram mode — low RAM, extraction may be slow or fail")
}
}
}
return w
}
func humanBytes(b int64) string {
const unit = 1024
if b < unit {
return fmt.Sprintf("%d B", b)
}
div, exp := int64(unit), 0
for n := b / unit; n >= unit; n /= unit {
div *= unit
exp++
}
return fmt.Sprintf("%.1f %cB", float64(b)/float64(div), "KMGTPE"[exp])
}
// InstallToDisk runs bee-install <device> <logfile> and streams output to logFile.
// The context can be used to cancel.
func (s *System) InstallToDisk(ctx context.Context, device string, logFile string) error {
cmd := exec.CommandContext(ctx, "bee-install", device, logFile)
return cmd.Run()
}
// InstallLogPath returns the default install log path for a given device.
func InstallLogPath(device string) string {
safe := strings.NewReplacer("/", "_", " ", "_").Replace(device)
return "/tmp/bee-install" + safe + ".log"
}
// Label returns a display label for a disk.
func (d InstallDisk) Label() string {
model := d.Model
if model == "" {
model = "Unknown"
}
return fmt.Sprintf("%s %s %s", d.Device, d.Size, model)
}

View File

@@ -0,0 +1,191 @@
package platform
import (
"context"
"encoding/json"
"fmt"
"io"
"os"
"os/exec"
"path/filepath"
"strings"
)
func (s *System) IsLiveMediaInRAM() bool {
out, err := exec.Command("findmnt", "-n", "-o", "FSTYPE", "/run/live/medium").Output()
if err != nil {
return toramActive()
}
return strings.TrimSpace(string(out)) == "tmpfs"
}
func (s *System) RunInstallToRAM(ctx context.Context, logFunc func(string)) error {
log := func(msg string) {
if logFunc != nil {
logFunc(msg)
}
}
if s.IsLiveMediaInRAM() {
log("Already running from RAM — installation media can be safely disconnected.")
return nil
}
squashfsFiles, err := filepath.Glob("/run/live/medium/live/*.squashfs")
if err != nil || len(squashfsFiles) == 0 {
return fmt.Errorf("no squashfs files found in /run/live/medium/live/")
}
free := freeMemBytes()
var needed int64
for _, sf := range squashfsFiles {
fi, err2 := os.Stat(sf)
if err2 != nil {
return fmt.Errorf("stat %s: %v", sf, err2)
}
needed += fi.Size()
}
const headroom = 256 * 1024 * 1024
if free > 0 && needed+headroom > free {
return fmt.Errorf("insufficient RAM: need %s, available %s",
humanBytes(needed+headroom), humanBytes(free))
}
dstDir := "/dev/shm/bee-live"
if err := os.MkdirAll(dstDir, 0755); err != nil {
return fmt.Errorf("create tmpfs dir: %v", err)
}
for _, sf := range squashfsFiles {
if err := ctx.Err(); err != nil {
return err
}
base := filepath.Base(sf)
dst := filepath.Join(dstDir, base)
log(fmt.Sprintf("Copying %s to RAM...", base))
if err := copyFileLarge(ctx, sf, dst, log); err != nil {
return fmt.Errorf("copy %s: %v", base, err)
}
log(fmt.Sprintf("Copied %s.", base))
loopDev, err := findLoopForFile(sf)
if err != nil {
log(fmt.Sprintf("Loop device for %s not found (%v) — skipping re-association.", base, err))
continue
}
if err := reassociateLoopDevice(loopDev, dst); err != nil {
log(fmt.Sprintf("Warning: could not re-associate %s → %s: %v", loopDev, dst, err))
} else {
log(fmt.Sprintf("Loop device %s now backed by RAM copy.", loopDev))
}
}
log("Copying remaining medium files...")
if err := cpDir(ctx, "/run/live/medium", dstDir, log); err != nil {
log(fmt.Sprintf("Warning: partial copy: %v", err))
}
if err := ctx.Err(); err != nil {
return err
}
if err := exec.Command("mount", "--bind", dstDir, "/run/live/medium").Run(); err != nil {
log(fmt.Sprintf("Warning: rebind /run/live/medium failed: %v", err))
}
log("Done. Installation media can be safely disconnected.")
return nil
}
func copyFileLarge(ctx context.Context, src, dst string, logFunc func(string)) error {
in, err := os.Open(src)
if err != nil {
return err
}
defer in.Close()
fi, err := in.Stat()
if err != nil {
return err
}
out, err := os.Create(dst)
if err != nil {
return err
}
defer out.Close()
total := fi.Size()
var copied int64
buf := make([]byte, 4*1024*1024)
for {
if err := ctx.Err(); err != nil {
return err
}
n, err := in.Read(buf)
if n > 0 {
if _, werr := out.Write(buf[:n]); werr != nil {
return werr
}
copied += int64(n)
if logFunc != nil && total > 0 {
pct := int(float64(copied) / float64(total) * 100)
logFunc(fmt.Sprintf(" %s / %s (%d%%)", humanBytes(copied), humanBytes(total), pct))
}
}
if err == io.EOF {
break
}
if err != nil {
return err
}
}
return out.Sync()
}
func cpDir(ctx context.Context, src, dst string, logFunc func(string)) error {
return filepath.Walk(src, func(path string, fi os.FileInfo, err error) error {
if ctx.Err() != nil {
return ctx.Err()
}
if err != nil {
return nil
}
rel, _ := filepath.Rel(src, path)
target := filepath.Join(dst, rel)
if fi.IsDir() {
return os.MkdirAll(target, fi.Mode())
}
if strings.HasSuffix(path, ".squashfs") {
return nil
}
if _, err := os.Stat(target); err == nil {
return nil
}
return copyFileLarge(ctx, path, target, nil)
})
}
func findLoopForFile(backingFile string) (string, error) {
out, err := exec.Command("losetup", "--list", "--json").Output()
if err != nil {
return "", err
}
var result struct {
Loopdevices []struct {
Name string `json:"name"`
BackFile string `json:"back-file"`
} `json:"loopdevices"`
}
if err := json.Unmarshal(out, &result); err != nil {
return "", err
}
for _, dev := range result.Loopdevices {
if dev.BackFile == backingFile {
return dev.Name, nil
}
}
return "", fmt.Errorf("no loop device found for %s", backingFile)
}
func reassociateLoopDevice(loopDev, newFile string) error {
if err := exec.Command("losetup", "--replace", loopDev, newFile).Run(); err == nil {
return nil
}
return loopChangeFD(loopDev, newFile)
}

View File

@@ -0,0 +1,28 @@
//go:build linux
package platform
import (
"os"
"syscall"
)
const ioctlLoopChangeFD = 0x4C08
func loopChangeFD(loopDev, newFile string) error {
lf, err := os.OpenFile(loopDev, os.O_RDWR, 0)
if err != nil {
return err
}
defer lf.Close()
nf, err := os.OpenFile(newFile, os.O_RDONLY, 0)
if err != nil {
return err
}
defer nf.Close()
_, _, errno := syscall.Syscall(syscall.SYS_IOCTL, lf.Fd(), ioctlLoopChangeFD, nf.Fd())
if errno != 0 {
return errno
}
return nil
}

View File

@@ -0,0 +1,9 @@
//go:build !linux
package platform
import "errors"
func loopChangeFD(loopDev, newFile string) error {
return errors.New("LOOP_CHANGE_FD not available on this platform")
}

View File

@@ -0,0 +1,326 @@
package platform
import (
"bufio"
"encoding/json"
"os"
"os/exec"
"sort"
"strconv"
"strings"
"time"
)
// LiveMetricSample is a single point-in-time snapshot of server metrics
// collected for the web UI metrics page.
type LiveMetricSample struct {
Timestamp time.Time `json:"ts"`
Fans []FanReading `json:"fans"`
Temps []TempReading `json:"temps"`
PowerW float64 `json:"power_w"`
CPULoadPct float64 `json:"cpu_load_pct"`
MemLoadPct float64 `json:"mem_load_pct"`
GPUs []GPUMetricRow `json:"gpus"`
}
// TempReading is a named temperature sensor value.
type TempReading struct {
Name string `json:"name"`
Group string `json:"group,omitempty"`
Celsius float64 `json:"celsius"`
}
// SampleLiveMetrics collects a single metrics snapshot from all available
// sources: GPU (via nvidia-smi), fans and temperatures (via ipmitool/sensors),
// and system power (via ipmitool dcmi). Missing sources are silently skipped.
func SampleLiveMetrics() LiveMetricSample {
s := LiveMetricSample{Timestamp: time.Now().UTC()}
// GPU metrics — try NVIDIA first, fall back to AMD
if gpus, err := SampleGPUMetrics(nil); err == nil && len(gpus) > 0 {
s.GPUs = gpus
} else if amdGPUs, err := sampleAMDGPUMetrics(); err == nil && len(amdGPUs) > 0 {
s.GPUs = amdGPUs
}
// Fan speeds — skipped silently if ipmitool unavailable
fans, _ := sampleFanSpeeds()
s.Fans = fans
s.Temps = append(s.Temps, sampleLiveTemperatureReadings()...)
if !hasTempGroup(s.Temps, "cpu") {
if cpuTemp := sampleCPUMaxTemp(); cpuTemp > 0 {
s.Temps = append(s.Temps, TempReading{Name: "CPU Max", Group: "cpu", Celsius: cpuTemp})
}
}
// System power — returns 0 if unavailable
s.PowerW = sampleSystemPower()
// CPU load — from /proc/stat
s.CPULoadPct = sampleCPULoadPct()
// Memory load — from /proc/meminfo
s.MemLoadPct = sampleMemLoadPct()
return s
}
// sampleCPULoadPct reads two /proc/stat snapshots 200ms apart and returns
// the overall CPU utilisation percentage.
var cpuStatPrev [2]uint64 // [total, idle]
func sampleCPULoadPct() float64 {
total, idle := readCPUStat()
if total == 0 {
return 0
}
prevTotal, prevIdle := cpuStatPrev[0], cpuStatPrev[1]
cpuStatPrev = [2]uint64{total, idle}
if prevTotal == 0 {
return 0
}
dt := float64(total - prevTotal)
di := float64(idle - prevIdle)
if dt <= 0 {
return 0
}
pct := (1 - di/dt) * 100
if pct < 0 {
return 0
}
if pct > 100 {
return 100
}
return pct
}
func readCPUStat() (total, idle uint64) {
f, err := os.Open("/proc/stat")
if err != nil {
return 0, 0
}
defer f.Close()
sc := bufio.NewScanner(f)
for sc.Scan() {
line := sc.Text()
if !strings.HasPrefix(line, "cpu ") {
continue
}
fields := strings.Fields(line)[1:] // skip "cpu"
var vals [10]uint64
for i := 0; i < len(fields) && i < 10; i++ {
vals[i], _ = strconv.ParseUint(fields[i], 10, 64)
}
// idle = idle + iowait
idle = vals[3] + vals[4]
for _, v := range vals {
total += v
}
return total, idle
}
return 0, 0
}
func sampleMemLoadPct() float64 {
f, err := os.Open("/proc/meminfo")
if err != nil {
return 0
}
defer f.Close()
vals := map[string]uint64{}
sc := bufio.NewScanner(f)
for sc.Scan() {
fields := strings.Fields(sc.Text())
if len(fields) >= 2 {
v, _ := strconv.ParseUint(fields[1], 10, 64)
vals[strings.TrimSuffix(fields[0], ":")] = v
}
}
total := vals["MemTotal"]
avail := vals["MemAvailable"]
if total == 0 {
return 0
}
used := total - avail
return float64(used) / float64(total) * 100
}
func hasTempGroup(temps []TempReading, group string) bool {
for _, t := range temps {
if t.Group == group {
return true
}
}
return false
}
func sampleLiveTemperatureReadings() []TempReading {
if temps := sampleLiveTempsViaSensorsJSON(); len(temps) > 0 {
return temps
}
return sampleLiveTempsViaIPMI()
}
func sampleLiveTempsViaSensorsJSON() []TempReading {
out, err := exec.Command("sensors", "-j").Output()
if err != nil || len(out) == 0 {
return nil
}
var doc map[string]map[string]any
if err := json.Unmarshal(out, &doc); err != nil {
return nil
}
chips := make([]string, 0, len(doc))
for chip := range doc {
chips = append(chips, chip)
}
sort.Strings(chips)
temps := make([]TempReading, 0, len(chips))
seen := map[string]struct{}{}
for _, chip := range chips {
features := doc[chip]
featureNames := make([]string, 0, len(features))
for name := range features {
featureNames = append(featureNames, name)
}
sort.Strings(featureNames)
for _, name := range featureNames {
if strings.EqualFold(name, "Adapter") {
continue
}
feature, ok := features[name].(map[string]any)
if !ok {
continue
}
value, ok := firstTempInputValue(feature)
if !ok || value <= 0 || value > 150 {
continue
}
group := classifyLiveTempGroup(chip, name)
if group == "gpu" {
continue
}
label := strings.TrimSpace(name)
if label == "" {
continue
}
if group == "ambient" {
label = compactAmbientTempName(chip, label)
}
key := group + "\x00" + label
if _, ok := seen[key]; ok {
continue
}
seen[key] = struct{}{}
temps = append(temps, TempReading{Name: label, Group: group, Celsius: value})
}
}
return temps
}
func sampleLiveTempsViaIPMI() []TempReading {
out, err := exec.Command("ipmitool", "sdr", "type", "Temperature").Output()
if err != nil || len(out) == 0 {
return nil
}
var temps []TempReading
seen := map[string]struct{}{}
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
parts := strings.Split(line, "|")
if len(parts) < 3 {
continue
}
name := strings.TrimSpace(parts[0])
if name == "" {
continue
}
unit := strings.ToLower(strings.TrimSpace(parts[2]))
if !strings.Contains(unit, "degrees") {
continue
}
raw := strings.TrimSpace(parts[1])
if raw == "" || strings.EqualFold(raw, "na") {
continue
}
value, err := strconv.ParseFloat(raw, 64)
if err != nil || value <= 0 || value > 150 {
continue
}
group := classifyLiveTempGroup("", name)
if group == "gpu" {
continue
}
label := name
if group == "ambient" {
label = compactAmbientTempName("", label)
}
key := group + "\x00" + label
if _, ok := seen[key]; ok {
continue
}
seen[key] = struct{}{}
temps = append(temps, TempReading{Name: label, Group: group, Celsius: value})
}
return temps
}
func firstTempInputValue(feature map[string]any) (float64, bool) {
keys := make([]string, 0, len(feature))
for key := range feature {
keys = append(keys, key)
}
sort.Strings(keys)
for _, key := range keys {
lower := strings.ToLower(key)
if !strings.Contains(lower, "temp") || !strings.HasSuffix(lower, "_input") {
continue
}
switch value := feature[key].(type) {
case float64:
return value, true
case string:
f, err := strconv.ParseFloat(value, 64)
if err == nil {
return f, true
}
}
}
return 0, false
}
func classifyLiveTempGroup(chip, name string) string {
text := strings.ToLower(strings.TrimSpace(chip + " " + name))
switch {
case strings.Contains(text, "gpu"), strings.Contains(text, "amdgpu"), strings.Contains(text, "nvidia"), strings.Contains(text, "adeon"):
return "gpu"
case strings.Contains(text, "coretemp"),
strings.Contains(text, "k10temp"),
strings.Contains(text, "zenpower"),
strings.Contains(text, "package id"),
strings.Contains(text, "x86_pkg_temp"),
strings.Contains(text, "tctl"),
strings.Contains(text, "tdie"),
strings.Contains(text, "tccd"),
strings.Contains(text, "cpu"),
strings.Contains(text, "peci"):
return "cpu"
default:
return "ambient"
}
}
func compactAmbientTempName(chip, name string) string {
chip = strings.TrimSpace(chip)
name = strings.TrimSpace(name)
if chip == "" || strings.EqualFold(chip, name) {
return name
}
if strings.Contains(strings.ToLower(name), strings.ToLower(chip)) {
return name
}
return chip + " / " + name
}

View File

@@ -0,0 +1,44 @@
package platform
import "testing"
func TestFirstTempInputValue(t *testing.T) {
feature := map[string]any{
"temp1_input": 61.5,
"temp1_max": 80.0,
}
got, ok := firstTempInputValue(feature)
if !ok {
t.Fatal("expected value")
}
if got != 61.5 {
t.Fatalf("got %v want 61.5", got)
}
}
func TestClassifyLiveTempGroup(t *testing.T) {
tests := []struct {
chip string
name string
want string
}{
{chip: "coretemp-isa-0000", name: "Package id 0", want: "cpu"},
{chip: "amdgpu-pci-4300", name: "edge", want: "gpu"},
{chip: "nvme-pci-0100", name: "Composite", want: "ambient"},
{chip: "acpitz-acpi-0", name: "temp1", want: "ambient"},
}
for _, tc := range tests {
if got := classifyLiveTempGroup(tc.chip, tc.name); got != tc.want {
t.Fatalf("classifyLiveTempGroup(%q,%q)=%q want %q", tc.chip, tc.name, got, tc.want)
}
}
}
func TestCompactAmbientTempName(t *testing.T) {
if got := compactAmbientTempName("nvme-pci-0100", "Composite"); got != "nvme-pci-0100 / Composite" {
t.Fatalf("got %q", got)
}
if got := compactAmbientTempName("", "Inlet Temp"); got != "Inlet Temp" {
t.Fatalf("got %q", got)
}
}

View File

@@ -0,0 +1,325 @@
package platform
import (
"bytes"
"errors"
"fmt"
"os"
"os/exec"
"sort"
"strings"
)
func (s *System) ListInterfaces() ([]InterfaceInfo, error) {
names, err := listInterfaceNames()
if err != nil {
return nil, err
}
out := make([]InterfaceInfo, 0, len(names))
for _, name := range names {
state := "unknown"
if up, err := interfaceAdminState(name); err == nil {
if up {
state = "up"
} else {
state = "down"
}
}
ipv4, err := interfaceIPv4Addrs(name)
if err != nil {
ipv4 = nil
}
out = append(out, InterfaceInfo{Name: name, State: state, IPv4: ipv4})
}
return out, nil
}
func (s *System) DefaultRoute() string {
raw, err := exec.Command("ip", "route", "show", "default").Output()
if err != nil {
return ""
}
fields := strings.Fields(string(raw))
for i := 0; i < len(fields)-1; i++ {
if fields[i] == "via" {
return fields[i+1]
}
}
return ""
}
func (s *System) CaptureNetworkSnapshot() (NetworkSnapshot, error) {
names, err := listInterfaceNames()
if err != nil {
return NetworkSnapshot{}, err
}
snapshot := NetworkSnapshot{
Interfaces: make([]NetworkInterfaceSnapshot, 0, len(names)),
}
for _, name := range names {
up, err := interfaceAdminState(name)
if err != nil {
return NetworkSnapshot{}, err
}
ipv4, err := interfaceIPv4Addrs(name)
if err != nil {
return NetworkSnapshot{}, err
}
snapshot.Interfaces = append(snapshot.Interfaces, NetworkInterfaceSnapshot{
Name: name,
Up: up,
IPv4: ipv4,
})
}
if raw, err := exec.Command("ip", "route", "show", "default").Output(); err == nil {
for _, line := range strings.Split(strings.TrimSpace(string(raw)), "\n") {
line = strings.TrimSpace(line)
if line != "" {
snapshot.DefaultRoutes = append(snapshot.DefaultRoutes, line)
}
}
}
if raw, err := os.ReadFile("/etc/resolv.conf"); err == nil {
snapshot.ResolvConf = string(raw)
}
return snapshot, nil
}
func (s *System) RestoreNetworkSnapshot(snapshot NetworkSnapshot) error {
var errs []string
for _, iface := range snapshot.Interfaces {
if err := exec.Command("ip", "link", "set", "dev", iface.Name, "up").Run(); err != nil {
errs = append(errs, fmt.Sprintf("%s: bring up before restore: %v", iface.Name, err))
continue
}
if err := exec.Command("ip", "addr", "flush", "dev", iface.Name).Run(); err != nil {
errs = append(errs, fmt.Sprintf("%s: flush addresses: %v", iface.Name, err))
}
for _, cidr := range iface.IPv4 {
if raw, err := exec.Command("ip", "addr", "add", cidr, "dev", iface.Name).CombinedOutput(); err != nil {
detail := strings.TrimSpace(string(raw))
if detail != "" {
errs = append(errs, fmt.Sprintf("%s: restore address %s: %v: %s", iface.Name, cidr, err, detail))
} else {
errs = append(errs, fmt.Sprintf("%s: restore address %s: %v", iface.Name, cidr, err))
}
}
}
state := "down"
if iface.Up {
state = "up"
}
if err := exec.Command("ip", "link", "set", "dev", iface.Name, state).Run(); err != nil {
errs = append(errs, fmt.Sprintf("%s: restore state %s: %v", iface.Name, state, err))
}
}
if err := exec.Command("ip", "route", "del", "default").Run(); err != nil {
var exitErr *exec.ExitError
if !errors.As(err, &exitErr) {
errs = append(errs, fmt.Sprintf("clear default route: %v", err))
}
}
for _, route := range snapshot.DefaultRoutes {
fields := strings.Fields(route)
if len(fields) == 0 {
continue
}
// Strip state flags that ip-route(8) does not accept as add arguments.
filtered := fields[:0]
for _, f := range fields {
switch f {
case "linkdown", "dead", "onlink", "pervasive":
// skip
default:
filtered = append(filtered, f)
}
}
args := append([]string{"route", "add"}, filtered...)
if raw, err := exec.Command("ip", args...).CombinedOutput(); err != nil {
detail := strings.TrimSpace(string(raw))
if detail != "" {
errs = append(errs, fmt.Sprintf("restore route %q: %v: %s", route, err, detail))
} else {
errs = append(errs, fmt.Sprintf("restore route %q: %v", route, err))
}
}
}
if err := os.WriteFile("/etc/resolv.conf", []byte(snapshot.ResolvConf), 0644); err != nil {
errs = append(errs, fmt.Sprintf("restore resolv.conf: %v", err))
}
if len(errs) > 0 {
return errors.New(strings.Join(errs, "; "))
}
return nil
}
func (s *System) DHCPOne(iface string) (string, error) {
var out bytes.Buffer
if err := exec.Command("ip", "link", "set", iface, "up").Run(); err != nil {
fmt.Fprintf(&out, "WARN: ip link set up failed: %v\n", err)
}
if raw, err := exec.Command("dhclient", "-r", iface).CombinedOutput(); err == nil {
out.Write(raw)
} else if len(raw) > 0 {
out.Write(raw)
}
raw, err := exec.Command("dhclient", "-4", "-v", iface).CombinedOutput()
out.Write(raw)
if err != nil {
return out.String(), err
}
return out.String(), nil
}
func (s *System) DHCPAll() (string, error) {
ifaces, err := listInterfaceNames()
if err != nil {
return "", err
}
var out strings.Builder
for _, iface := range ifaces {
fmt.Fprintf(&out, "[%s]\n", iface)
log, err := s.DHCPOne(iface)
out.WriteString(log)
if err != nil {
fmt.Fprintf(&out, "ERROR: %v\n", err)
}
out.WriteString("\n")
}
return out.String(), nil
}
func (s *System) SetStaticIPv4(cfg StaticIPv4Config) (string, error) {
if cfg.Interface == "" || cfg.Address == "" || cfg.Prefix == "" {
return "", fmt.Errorf("interface, address, and prefix are required")
}
dns := cfg.DNS
if len(dns) == 0 {
dns = []string{"77.88.8.8", "77.88.8.1", "1.1.1.1", "8.8.8.8"}
}
var out strings.Builder
_ = exec.Command("ip", "link", "set", cfg.Interface, "up").Run()
_ = exec.Command("ip", "addr", "flush", "dev", cfg.Interface).Run()
if raw, err := exec.Command("ip", "addr", "add", cfg.Address+"/"+cfg.Prefix, "dev", cfg.Interface).CombinedOutput(); err != nil {
return string(raw), err
}
out.WriteString("address configured\n")
if cfg.Gateway != "" {
_ = exec.Command("ip", "route", "del", "default").Run()
if raw, err := exec.Command("ip", "route", "add", "default", "via", cfg.Gateway, "dev", cfg.Interface).CombinedOutput(); err != nil {
return out.String() + string(raw), err
}
out.WriteString("default route configured\n")
}
var resolv strings.Builder
for _, dnsServer := range dns {
dnsServer = strings.TrimSpace(dnsServer)
if dnsServer == "" {
continue
}
fmt.Fprintf(&resolv, "nameserver %s\n", dnsServer)
}
if err := os.WriteFile("/etc/resolv.conf", []byte(resolv.String()), 0644); err != nil {
return out.String(), err
}
out.WriteString("dns configured\n")
return out.String(), nil
}
// SetInterfaceState brings a network interface up or down.
func (s *System) SetInterfaceState(iface string, up bool) error {
state := "down"
if up {
state = "up"
}
return exec.Command("ip", "link", "set", "dev", iface, state).Run()
}
// GetInterfaceState returns true if the interface is UP.
func (s *System) GetInterfaceState(iface string) (bool, error) {
return interfaceAdminState(iface)
}
func interfaceAdminState(iface string) (bool, error) {
raw, err := exec.Command("ip", "-o", "link", "show", "dev", iface).Output()
if err != nil {
return false, err
}
return parseInterfaceAdminState(string(raw))
}
func parseInterfaceAdminState(raw string) (bool, error) {
start := strings.IndexByte(raw, '<')
if start == -1 {
return false, fmt.Errorf("ip link output missing flags")
}
end := strings.IndexByte(raw[start+1:], '>')
if end == -1 {
return false, fmt.Errorf("ip link output missing flag terminator")
}
flags := strings.Split(raw[start+1:start+1+end], ",")
for _, flag := range flags {
if strings.TrimSpace(flag) == "UP" {
return true, nil
}
}
return false, nil
}
func interfaceIPv4Addrs(iface string) ([]string, error) {
raw, err := exec.Command("ip", "-o", "-4", "addr", "show", "dev", iface).Output()
if err != nil {
var exitErr *exec.ExitError
if errors.As(err, &exitErr) {
return nil, nil
}
return nil, err
}
var ipv4 []string
for _, line := range strings.Split(strings.TrimSpace(string(raw)), "\n") {
fields := strings.Fields(line)
if len(fields) >= 4 {
ipv4 = append(ipv4, fields[3])
}
}
return ipv4, nil
}
func listInterfaceNames() ([]string, error) {
raw, err := exec.Command("ip", "-o", "link", "show").Output()
if err != nil {
return nil, err
}
var out []string
for _, line := range strings.Split(strings.TrimSpace(string(raw)), "\n") {
fields := strings.SplitN(line, ": ", 3)
if len(fields) < 2 {
continue
}
name := fields[1]
if name == "lo" || strings.HasPrefix(name, "docker") || strings.HasPrefix(name, "virbr") ||
strings.HasPrefix(name, "veth") || strings.HasPrefix(name, "tun") ||
strings.HasPrefix(name, "tap") || strings.HasPrefix(name, "br-") ||
strings.HasPrefix(name, "bond") || strings.HasPrefix(name, "dummy") {
continue
}
out = append(out, name)
}
sort.Strings(out)
return out, nil
}

View File

@@ -0,0 +1,46 @@
package platform
import "testing"
func TestParseInterfaceAdminState(t *testing.T) {
tests := []struct {
name string
raw string
want bool
wantErr bool
}{
{
name: "admin up with no carrier",
raw: "2: enp1s0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000\n",
want: true,
},
{
name: "admin down",
raw: "2: enp1s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n",
want: false,
},
{
name: "malformed output",
raw: "2: enp1s0: mtu 1500 state DOWN\n",
wantErr: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got, err := parseInterfaceAdminState(tt.raw)
if tt.wantErr {
if err == nil {
t.Fatal("expected error")
}
return
}
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got != tt.want {
t.Fatalf("got %v want %v", got, tt.want)
}
})
}
}

View File

@@ -0,0 +1,205 @@
package platform
import (
"context"
"fmt"
"sort"
"strconv"
"strings"
)
func (s *System) RunNvidiaStressPack(ctx context.Context, baseDir string, opts NvidiaStressOptions, logFunc func(string)) (string, error) {
normalizeNvidiaStressOptions(&opts)
job, err := buildNvidiaStressJob(opts)
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, nvidiaStressArchivePrefix(opts.Loader), []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-nvidia-smi-list.log", cmd: []string{"nvidia-smi", "-L"}},
job,
{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
}
func nvidiaStressArchivePrefix(loader string) string {
switch strings.TrimSpace(strings.ToLower(loader)) {
case NvidiaStressLoaderJohn:
return "gpu-nvidia-john"
case NvidiaStressLoaderNCCL:
return "gpu-nvidia-nccl"
default:
return "gpu-nvidia-burn"
}
}
func buildNvidiaStressJob(opts NvidiaStressOptions) (satJob, error) {
selected, err := resolveNvidiaGPUSelection(opts.GPUIndices, opts.ExcludeGPUIndices)
if err != nil {
return satJob{}, err
}
loader := strings.TrimSpace(strings.ToLower(opts.Loader))
switch loader {
case "", NvidiaStressLoaderBuiltin:
cmd := []string{
"bee-gpu-burn",
"--seconds", strconv.Itoa(opts.DurationSec),
"--size-mb", strconv.Itoa(opts.SizeMB),
}
if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected))
}
return satJob{
name: "03-bee-gpu-burn.log",
cmd: cmd,
collectGPU: true,
gpuIndices: selected,
}, nil
case NvidiaStressLoaderJohn:
cmd := []string{
"bee-john-gpu-stress",
"--seconds", strconv.Itoa(opts.DurationSec),
}
if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected))
}
return satJob{
name: "03-john-gpu-stress.log",
cmd: cmd,
collectGPU: true,
gpuIndices: selected,
}, nil
case NvidiaStressLoaderNCCL:
cmd := []string{
"bee-nccl-gpu-stress",
"--seconds", strconv.Itoa(opts.DurationSec),
}
if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected))
}
return satJob{
name: "03-bee-nccl-gpu-stress.log",
cmd: cmd,
collectGPU: true,
gpuIndices: selected,
}, nil
default:
return satJob{}, fmt.Errorf("unknown NVIDIA stress loader %q", opts.Loader)
}
}
func normalizeNvidiaStressOptions(opts *NvidiaStressOptions) {
if opts.DurationSec <= 0 {
opts.DurationSec = 300
}
if opts.SizeMB <= 0 {
opts.SizeMB = 64
}
switch strings.TrimSpace(strings.ToLower(opts.Loader)) {
case "", NvidiaStressLoaderBuiltin:
opts.Loader = NvidiaStressLoaderBuiltin
case NvidiaStressLoaderJohn:
opts.Loader = NvidiaStressLoaderJohn
case NvidiaStressLoaderNCCL:
opts.Loader = NvidiaStressLoaderNCCL
default:
opts.Loader = NvidiaStressLoaderBuiltin
}
opts.GPUIndices = dedupeSortedIndices(opts.GPUIndices)
opts.ExcludeGPUIndices = dedupeSortedIndices(opts.ExcludeGPUIndices)
}
func resolveNvidiaGPUSelection(include, exclude []int) ([]int, error) {
all, err := listNvidiaGPUIndices()
if err != nil {
return nil, err
}
if len(all) == 0 {
return nil, fmt.Errorf("nvidia-smi found no NVIDIA GPUs")
}
selected := all
if len(include) > 0 {
want := make(map[int]struct{}, len(include))
for _, idx := range include {
want[idx] = struct{}{}
}
selected = selected[:0]
for _, idx := range all {
if _, ok := want[idx]; ok {
selected = append(selected, idx)
}
}
}
if len(exclude) > 0 {
skip := make(map[int]struct{}, len(exclude))
for _, idx := range exclude {
skip[idx] = struct{}{}
}
filtered := selected[:0]
for _, idx := range selected {
if _, ok := skip[idx]; ok {
continue
}
filtered = append(filtered, idx)
}
selected = filtered
}
if len(selected) == 0 {
return nil, fmt.Errorf("no NVIDIA GPUs selected after applying filters")
}
out := append([]int(nil), selected...)
sort.Ints(out)
return out, nil
}
func listNvidiaGPUIndices() ([]int, error) {
out, err := satExecCommand("nvidia-smi", "--query-gpu=index", "--format=csv,noheader,nounits").Output()
if err != nil {
return nil, fmt.Errorf("nvidia-smi: %w", err)
}
var indices []int
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
idx, err := strconv.Atoi(line)
if err != nil {
continue
}
indices = append(indices, idx)
}
return dedupeSortedIndices(indices), nil
}
func dedupeSortedIndices(values []int) []int {
if len(values) == 0 {
return nil
}
seen := make(map[int]struct{}, len(values))
out := make([]int, 0, len(values))
for _, value := range values {
if value < 0 {
continue
}
if _, ok := seen[value]; ok {
continue
}
seen[value] = struct{}{}
out = append(out, value)
}
sort.Ints(out)
return out
}
func joinIndexList(values []int) string {
parts := make([]string, 0, len(values))
for _, value := range values {
parts = append(parts, strconv.Itoa(value))
}
return strings.Join(parts, ",")
}

View File

@@ -0,0 +1,43 @@
package platform
import "strings"
func parseLSBLKPairs(line string) map[string]string {
out := map[string]string{}
for _, part := range splitQuotedFields(line) {
idx := strings.Index(part, "=")
if idx <= 0 {
continue
}
key := part[:idx]
value := strings.Trim(part[idx+1:], `"`)
out[key] = value
}
return out
}
func splitQuotedFields(s string) []string {
var out []string
var cur strings.Builder
inQuotes := false
for _, r := range s {
switch r {
case '"':
inQuotes = !inQuotes
cur.WriteRune(r)
case ' ':
if inQuotes {
cur.WriteRune(r)
} else if cur.Len() > 0 {
out = append(out, cur.String())
cur.Reset()
}
default:
cur.WriteRune(r)
}
}
if cur.Len() > 0 {
out = append(out, cur.String())
}
return out
}

View File

@@ -0,0 +1,528 @@
package platform
import (
"archive/tar"
"bytes"
"compress/gzip"
"context"
"encoding/csv"
"fmt"
"os"
"os/exec"
"path/filepath"
"runtime"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
// PlatformStressCycle defines one load+idle cycle.
type PlatformStressCycle struct {
LoadSec int // seconds of simultaneous CPU+GPU stress
IdleSec int // seconds of idle monitoring after load cut
}
// PlatformStressOptions controls the thermal cycling test.
type PlatformStressOptions struct {
Cycles []PlatformStressCycle
}
// platformStressRow is one second of telemetry.
type platformStressRow struct {
ElapsedSec float64
Cycle int
Phase string // "load" | "idle"
CPULoadPct float64
MaxCPUTempC float64
MaxGPUTempC float64
SysPowerW float64
FanMinRPM float64
FanMaxRPM float64
GPUThrottled bool
}
// RunPlatformStress runs repeated load+idle thermal cycling.
// Each cycle starts CPU (stressapptest) and GPU stress simultaneously,
// runs for LoadSec, then cuts load abruptly and monitors for IdleSec.
func (s *System) RunPlatformStress(
ctx context.Context,
baseDir string,
opts PlatformStressOptions,
logFunc func(string),
) (string, error) {
if logFunc == nil {
logFunc = func(string) {}
}
if len(opts.Cycles) == 0 {
return "", fmt.Errorf("no cycles defined")
}
if err := os.MkdirAll(baseDir, 0755); err != nil {
return "", fmt.Errorf("mkdir %s: %w", baseDir, err)
}
stamp := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, "platform-stress-"+stamp)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", fmt.Errorf("mkdir run dir: %w", err)
}
vendor := s.DetectGPUVendor()
logFunc(fmt.Sprintf("Platform Thermal Cycling — %d cycle(s), GPU vendor: %s", len(opts.Cycles), vendor))
var rows []platformStressRow
start := time.Now()
var analyses []cycleAnalysis
for i, cycle := range opts.Cycles {
if ctx.Err() != nil {
break
}
cycleNum := i + 1
logFunc(fmt.Sprintf("--- Cycle %d/%d: load=%ds, idle=%ds ---", cycleNum, len(opts.Cycles), cycle.LoadSec, cycle.IdleSec))
// ── LOAD PHASE ───────────────────────────────────────────────────────
loadCtx, loadCancel := context.WithTimeout(ctx, time.Duration(cycle.LoadSec)*time.Second)
var wg sync.WaitGroup
// CPU stress
wg.Add(1)
go func() {
defer wg.Done()
cpuCmd, err := buildCPUStressCmd(loadCtx)
if err != nil {
logFunc("CPU stress: " + err.Error())
return
}
_ = cpuCmd.Wait() // exits when loadCtx times out (SIGKILL)
}()
// GPU stress
wg.Add(1)
go func() {
defer wg.Done()
gpuCmd := buildGPUStressCmd(loadCtx, vendor)
if gpuCmd == nil {
return
}
_ = gpuCmd.Wait()
}()
// Monitoring goroutine for load phase
loadRows := collectPhase(loadCtx, cycleNum, "load", start)
for _, r := range loadRows {
logFunc(formatPlatformRow(r))
}
rows = append(rows, loadRows...)
loadCancel()
wg.Wait()
if len(loadRows) > 0 {
logFunc(fmt.Sprintf("Cycle %d load ended (%.0fs)", cycleNum, loadRows[len(loadRows)-1].ElapsedSec))
}
// ── IDLE PHASE ───────────────────────────────────────────────────────
idleCtx, idleCancel := context.WithTimeout(ctx, time.Duration(cycle.IdleSec)*time.Second)
idleRows := collectPhase(idleCtx, cycleNum, "idle", start)
for _, r := range idleRows {
logFunc(formatPlatformRow(r))
}
rows = append(rows, idleRows...)
idleCancel()
// Per-cycle analysis
an := analyzePlatformCycle(loadRows, idleRows)
analyses = append(analyses, an)
logFunc(fmt.Sprintf("Cycle %d: maxCPU=%.1f°C maxGPU=%.1f°C power=%.0fW throttled=%v fanDrop=%.0f%%",
cycleNum, an.maxCPUTemp, an.maxGPUTemp, an.maxPower, an.throttled, an.fanDropPct))
}
// Write CSV
csvData := writePlatformCSV(rows)
_ = os.WriteFile(filepath.Join(runDir, "metrics.csv"), csvData, 0644)
// Write summary
summary := writePlatformSummary(opts, analyses)
logFunc("--- Summary ---")
for _, line := range strings.Split(summary, "\n") {
if line != "" {
logFunc(line)
}
}
_ = os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary), 0644)
// Pack tar.gz
archivePath := filepath.Join(baseDir, "platform-stress-"+stamp+".tar.gz")
if err := packPlatformDir(runDir, archivePath); err != nil {
return "", fmt.Errorf("pack archive: %w", err)
}
_ = os.RemoveAll(runDir)
return archivePath, nil
}
// collectPhase samples live metrics every second until ctx is done.
func collectPhase(ctx context.Context, cycle int, phase string, testStart time.Time) []platformStressRow {
var rows []platformStressRow
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return rows
case <-ticker.C:
sample := SampleLiveMetrics()
rows = append(rows, sampleToPlatformRow(sample, cycle, phase, testStart))
}
}
}
func sampleToPlatformRow(s LiveMetricSample, cycle int, phase string, testStart time.Time) platformStressRow {
r := platformStressRow{
ElapsedSec: time.Since(testStart).Seconds(),
Cycle: cycle,
Phase: phase,
CPULoadPct: s.CPULoadPct,
SysPowerW: s.PowerW,
}
for _, t := range s.Temps {
switch t.Group {
case "cpu":
if t.Celsius > r.MaxCPUTempC {
r.MaxCPUTempC = t.Celsius
}
case "gpu":
if t.Celsius > r.MaxGPUTempC {
r.MaxGPUTempC = t.Celsius
}
}
}
for _, g := range s.GPUs {
if g.TempC > r.MaxGPUTempC {
r.MaxGPUTempC = g.TempC
}
}
if len(s.Fans) > 0 {
r.FanMinRPM = s.Fans[0].RPM
r.FanMaxRPM = s.Fans[0].RPM
for _, f := range s.Fans[1:] {
if f.RPM < r.FanMinRPM {
r.FanMinRPM = f.RPM
}
if f.RPM > r.FanMaxRPM {
r.FanMaxRPM = f.RPM
}
}
}
return r
}
func formatPlatformRow(r platformStressRow) string {
throttle := ""
if r.GPUThrottled {
throttle = " THROTTLE"
}
fans := ""
if r.FanMinRPM > 0 {
fans = fmt.Sprintf(" fans=%.0f-%.0fRPM", r.FanMinRPM, r.FanMaxRPM)
}
return fmt.Sprintf("[%5.0fs] cycle=%d phase=%-4s cpu=%.0f%% cpuT=%.1f°C gpuT=%.1f°C pwr=%.0fW%s%s",
r.ElapsedSec, r.Cycle, r.Phase, r.CPULoadPct, r.MaxCPUTempC, r.MaxGPUTempC, r.SysPowerW, fans, throttle)
}
func analyzePlatformCycle(loadRows, idleRows []platformStressRow) cycleAnalysis {
var an cycleAnalysis
for _, r := range loadRows {
if r.MaxCPUTempC > an.maxCPUTemp {
an.maxCPUTemp = r.MaxCPUTempC
}
if r.MaxGPUTempC > an.maxGPUTemp {
an.maxGPUTemp = r.MaxGPUTempC
}
if r.SysPowerW > an.maxPower {
an.maxPower = r.SysPowerW
}
if r.GPUThrottled {
an.throttled = true
}
}
// Fan RPM at cut = avg of last 5 load rows
if n := len(loadRows); n > 0 {
window := loadRows
if n > 5 {
window = loadRows[n-5:]
}
var sum float64
var cnt int
for _, r := range window {
if r.FanMinRPM > 0 {
sum += (r.FanMinRPM + r.FanMaxRPM) / 2
cnt++
}
}
if cnt > 0 {
an.fanAtCutAvg = sum / float64(cnt)
}
}
// Fan RPM min in first 15s of idle
an.fanMin15s = an.fanAtCutAvg
var cutElapsed float64
if len(loadRows) > 0 {
cutElapsed = loadRows[len(loadRows)-1].ElapsedSec
}
for _, r := range idleRows {
if r.ElapsedSec > cutElapsed+15 {
break
}
avg := (r.FanMinRPM + r.FanMaxRPM) / 2
if avg > 0 && (an.fanMin15s == 0 || avg < an.fanMin15s) {
an.fanMin15s = avg
}
}
if an.fanAtCutAvg > 0 {
an.fanDropPct = (an.fanAtCutAvg - an.fanMin15s) / an.fanAtCutAvg * 100
}
return an
}
type cycleAnalysis struct {
maxCPUTemp float64
maxGPUTemp float64
maxPower float64
throttled bool
fanAtCutAvg float64
fanMin15s float64
fanDropPct float64
}
func writePlatformSummary(opts PlatformStressOptions, analyses []cycleAnalysis) string {
var b strings.Builder
fmt.Fprintf(&b, "Platform Thermal Cycling — %d cycle(s)\n", len(opts.Cycles))
fmt.Fprintf(&b, "%s\n\n", strings.Repeat("=", 48))
totalThrottle := 0
totalFanWarn := 0
for i, an := range analyses {
cycle := opts.Cycles[i]
fmt.Fprintf(&b, "Cycle %d/%d (load=%ds, idle=%ds)\n", i+1, len(opts.Cycles), cycle.LoadSec, cycle.IdleSec)
fmt.Fprintf(&b, " Max CPU temp: %.1f°C\n", an.maxCPUTemp)
fmt.Fprintf(&b, " Max GPU temp: %.1f°C\n", an.maxGPUTemp)
fmt.Fprintf(&b, " Max sys power: %.0f W\n", an.maxPower)
if an.throttled {
fmt.Fprintf(&b, " Throttle: DETECTED\n")
totalThrottle++
} else {
fmt.Fprintf(&b, " Throttle: none\n")
}
if an.fanAtCutAvg > 0 {
fmt.Fprintf(&b, " Fan at load cut: %.0f RPM avg\n", an.fanAtCutAvg)
fmt.Fprintf(&b, " Fan min (first 15s idle): %.0f RPM (drop %.0f%%)\n", an.fanMin15s, an.fanDropPct)
if an.fanDropPct > 20 {
fmt.Fprintf(&b, " Fan response: WARN — fast spindown (>20%% drop in 15s)\n")
totalFanWarn++
} else {
fmt.Fprintf(&b, " Fan response: OK\n")
}
}
b.WriteString("\n")
}
fmt.Fprintf(&b, "%s\n", strings.Repeat("=", 48))
if totalThrottle > 0 {
fmt.Fprintf(&b, "Overall: FAIL — throttle detected in %d/%d cycles\n", totalThrottle, len(analyses))
} else if totalFanWarn > 0 {
fmt.Fprintf(&b, "Overall: WARN — fast fan spindown in %d/%d cycles (cooling recovery risk)\n", totalFanWarn, len(analyses))
} else {
fmt.Fprintf(&b, "Overall: PASS\n")
}
return b.String()
}
func writePlatformCSV(rows []platformStressRow) []byte {
var buf bytes.Buffer
w := csv.NewWriter(&buf)
_ = w.Write([]string{
"elapsed_sec", "cycle", "phase",
"cpu_load_pct", "max_cpu_temp_c", "max_gpu_temp_c",
"sys_power_w", "fan_min_rpm", "fan_max_rpm", "gpu_throttled",
})
for _, r := range rows {
throttled := "0"
if r.GPUThrottled {
throttled = "1"
}
_ = w.Write([]string{
strconv.FormatFloat(r.ElapsedSec, 'f', 1, 64),
strconv.Itoa(r.Cycle),
r.Phase,
strconv.FormatFloat(r.CPULoadPct, 'f', 1, 64),
strconv.FormatFloat(r.MaxCPUTempC, 'f', 1, 64),
strconv.FormatFloat(r.MaxGPUTempC, 'f', 1, 64),
strconv.FormatFloat(r.SysPowerW, 'f', 1, 64),
strconv.FormatFloat(r.FanMinRPM, 'f', 0, 64),
strconv.FormatFloat(r.FanMaxRPM, 'f', 0, 64),
throttled,
})
}
w.Flush()
return buf.Bytes()
}
// buildCPUStressCmd creates a stressapptest command that runs until ctx is cancelled.
func buildCPUStressCmd(ctx context.Context) (*exec.Cmd, error) {
path, err := satLookPath("stressapptest")
if err != nil {
return nil, fmt.Errorf("stressapptest not found: %w", err)
}
// Use a very long duration; the context timeout will kill it at the right time.
cmdArgs := []string{"-s", "86400", "-W", "--cc_test"}
if threads := platformStressCPUThreads(); threads > 0 {
cmdArgs = append(cmdArgs, "-m", strconv.Itoa(threads))
}
if mb := platformStressMemoryMB(); mb > 0 {
cmdArgs = append(cmdArgs, "-M", strconv.Itoa(mb))
}
cmd := exec.CommandContext(ctx, path, cmdArgs...)
cmd.Stdout = nil
cmd.Stderr = nil
if err := startLowPriorityCmd(cmd, 15); err != nil {
return nil, fmt.Errorf("stressapptest start: %w", err)
}
return cmd, nil
}
// buildGPUStressCmd creates a GPU stress command appropriate for the detected vendor.
// Returns nil if no GPU stress tool is available (CPU-only cycling still useful).
func buildGPUStressCmd(ctx context.Context, vendor string) *exec.Cmd {
switch strings.ToLower(vendor) {
case "amd":
return buildAMDGPUStressCmd(ctx)
case "nvidia":
return buildNvidiaGPUStressCmd(ctx)
}
return nil
}
func buildAMDGPUStressCmd(ctx context.Context) *exec.Cmd {
rvsArgs, err := resolveRVSCommand()
if err != nil {
return nil
}
rvsPath := rvsArgs[0]
cfg := `actions:
- name: gst_platform
device: all
module: gst
parallel: true
duration: 86400000
copy_matrix: false
target_stress: 90
matrix_size_a: 8640
matrix_size_b: 8640
matrix_size_c: 8640
`
cfgFile := "/tmp/bee-platform-gst.conf"
_ = os.WriteFile(cfgFile, []byte(cfg), 0644)
cmd := exec.CommandContext(ctx, rvsPath, "-c", cfgFile)
cmd.Stdout = nil
cmd.Stderr = nil
_ = startLowPriorityCmd(cmd, 10)
return cmd
}
func buildNvidiaGPUStressCmd(ctx context.Context) *exec.Cmd {
path, err := satLookPath("bee-gpu-burn")
if err != nil {
path, err = satLookPath("bee-gpu-stress")
}
if err != nil {
return nil
}
cmd := exec.CommandContext(ctx, path, "--seconds", "86400", "--size-mb", "64")
cmd.Stdout = nil
cmd.Stderr = nil
_ = startLowPriorityCmd(cmd, 10)
return cmd
}
func startLowPriorityCmd(cmd *exec.Cmd, nice int) error {
if err := cmd.Start(); err != nil {
return err
}
if cmd.Process != nil {
_ = syscall.Setpriority(syscall.PRIO_PROCESS, cmd.Process.Pid, nice)
}
return nil
}
func platformStressCPUThreads() int {
if n := envInt("BEE_PLATFORM_STRESS_THREADS", 0); n > 0 {
return n
}
cpus := runtime.NumCPU()
switch {
case cpus <= 2:
return 1
case cpus <= 8:
return cpus - 1
default:
return cpus - 2
}
}
func platformStressMemoryMB() int {
if mb := envInt("BEE_PLATFORM_STRESS_MB", 0); mb > 0 {
return mb
}
free := freeMemBytes()
if free <= 0 {
return 0
}
mb := int((free * 60) / 100 / (1024 * 1024))
if mb < 1024 {
return 1024
}
return mb
}
func packPlatformDir(dir, dest string) error {
f, err := os.Create(dest)
if err != nil {
return err
}
defer f.Close()
gz := gzip.NewWriter(f)
defer gz.Close()
tw := tar.NewWriter(gz)
defer tw.Close()
entries, err := os.ReadDir(dir)
if err != nil {
return err
}
base := filepath.Base(dir)
for _, e := range entries {
if e.IsDir() {
continue
}
fpath := filepath.Join(dir, e.Name())
data, err := os.ReadFile(fpath)
if err != nil {
continue
}
hdr := &tar.Header{
Name: filepath.Join(base, e.Name()),
Size: int64(len(data)),
Mode: 0644,
ModTime: time.Now(),
}
if err := tw.WriteHeader(hdr); err != nil {
return err
}
if _, err := tw.Write(data); err != nil {
return err
}
}
return nil
}

View File

@@ -0,0 +1,34 @@
package platform
import (
"runtime"
"testing"
)
func TestPlatformStressCPUThreadsOverride(t *testing.T) {
t.Setenv("BEE_PLATFORM_STRESS_THREADS", "7")
if got := platformStressCPUThreads(); got != 7 {
t.Fatalf("platformStressCPUThreads=%d want 7", got)
}
}
func TestPlatformStressCPUThreadsDefaultLeavesHeadroom(t *testing.T) {
t.Setenv("BEE_PLATFORM_STRESS_THREADS", "")
got := platformStressCPUThreads()
if got < 1 {
t.Fatalf("platformStressCPUThreads=%d want >= 1", got)
}
if got > runtime.NumCPU() {
t.Fatalf("platformStressCPUThreads=%d want <= NumCPU=%d", got, runtime.NumCPU())
}
if runtime.NumCPU() > 2 && got >= runtime.NumCPU() {
t.Fatalf("platformStressCPUThreads=%d want headroom below NumCPU=%d", got, runtime.NumCPU())
}
}
func TestPlatformStressMemoryMBOverride(t *testing.T) {
t.Setenv("BEE_PLATFORM_STRESS_MB", "8192")
if got := platformStressMemoryMB(); got != 8192 {
t.Fatalf("platformStressMemoryMB=%d want 8192", got)
}
}

View File

@@ -0,0 +1,217 @@
package platform
import (
"os"
"os/exec"
"strings"
"time"
"bee/audit/internal/schema"
)
var runtimeRequiredTools = []string{
"dmidecode",
"lspci",
"lsblk",
"smartctl",
"nvme",
"ipmitool",
"dhclient",
"mount",
}
var runtimeTrackedServices = []string{
"bee-network",
"bee-nvidia",
"bee-preflight",
"bee-audit",
"bee-web",
"bee-sshsetup",
}
func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
checkedAt := time.Now().UTC().Format(time.RFC3339)
health := schema.RuntimeHealth{
Status: "OK",
CheckedAt: checkedAt,
ExportDir: strings.TrimSpace(exportDir),
}
if health.ExportDir != "" {
if err := os.MkdirAll(health.ExportDir, 0755); err != nil {
health.Status = "FAILED"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "export_dir_unavailable",
Severity: "critical",
Description: err.Error(),
})
}
}
interfaces, err := s.ListInterfaces()
if err == nil {
health.Interfaces = make([]schema.RuntimeInterface, 0, len(interfaces))
hasIPv4 := false
missingIPv4 := false
for _, iface := range interfaces {
outcome := "no_offer"
if len(iface.IPv4) > 0 {
outcome = "lease_acquired"
hasIPv4 = true
} else if strings.EqualFold(iface.State, "DOWN") {
outcome = "link_down"
} else {
missingIPv4 = true
}
health.Interfaces = append(health.Interfaces, schema.RuntimeInterface{
Name: iface.Name,
State: iface.State,
IPv4: iface.IPv4,
Outcome: outcome,
})
}
switch {
case hasIPv4 && !missingIPv4:
health.NetworkStatus = "OK"
case hasIPv4:
health.NetworkStatus = "PARTIAL"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "dhcp_partial",
Severity: "warning",
Description: "At least one interface did not obtain IPv4 connectivity.",
})
default:
health.NetworkStatus = "FAILED"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "dhcp_failed",
Severity: "warning",
Description: "No physical interface obtained IPv4 connectivity.",
})
}
}
vendor := s.DetectGPUVendor()
for _, tool := range s.runtimeToolStatuses(vendor) {
health.Tools = append(health.Tools, schema.RuntimeToolStatus{
Name: tool.Name,
Path: tool.Path,
OK: tool.OK,
})
if !tool.OK {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "tool_missing",
Severity: "warning",
Description: "Required tool missing: " + tool.Name,
})
}
}
for _, name := range runtimeTrackedServices {
health.Services = append(health.Services, schema.RuntimeServiceStatus{
Name: name,
Status: s.ServiceState(name),
})
}
s.collectGPURuntimeHealth(vendor, &health)
if health.Status != "FAILED" && len(health.Issues) > 0 {
health.Status = "PARTIAL"
}
return health, nil
}
func commandText(name string, args ...string) string {
raw, err := exec.Command(name, args...).CombinedOutput()
if err != nil && len(raw) == 0 {
return ""
}
return string(raw)
}
func (s *System) runtimeToolStatuses(vendor string) []ToolStatus {
tools := s.CheckTools(runtimeRequiredTools)
switch vendor {
case "nvidia":
tools = append(tools, s.CheckTools([]string{
"nvidia-smi",
"nvidia-bug-report.sh",
"bee-gpu-burn",
"bee-john-gpu-stress",
"bee-nccl-gpu-stress",
"all_reduce_perf",
})...)
case "amd":
tool := ToolStatus{Name: "rocm-smi"}
if cmd, err := resolveROCmSMICommand(); err == nil && len(cmd) > 0 {
tool.Path = cmd[0]
if len(cmd) > 1 && strings.HasSuffix(cmd[1], "rocm_smi.py") {
tool.Path = cmd[1]
}
tool.OK = true
}
tools = append(tools, tool)
}
return tools
}
func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHealth) {
lsmodText := commandText("lsmod")
switch vendor {
case "nvidia":
health.DriverReady = strings.Contains(lsmodText, "nvidia ")
if !health.DriverReady {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "nvidia_kernel_module_missing",
Severity: "warning",
Description: "NVIDIA kernel module is not loaded.",
})
}
if health.DriverReady && !strings.Contains(lsmodText, "nvidia_modeset") {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "nvidia_modeset_failed",
Severity: "warning",
Description: "nvidia-modeset is not loaded; display/CUDA stack may be partial.",
})
}
if out, err := exec.Command("nvidia-smi", "-L").CombinedOutput(); err == nil && strings.TrimSpace(string(out)) != "" {
health.DriverReady = true
}
if _, lookErr := exec.LookPath("bee-gpu-burn"); lookErr == nil {
out, err := exec.Command("bee-gpu-burn", "--seconds", "1", "--size-mb", "1").CombinedOutput()
if err == nil {
health.CUDAReady = true
} else if strings.Contains(strings.ToLower(string(out)), "cuda_error_system_not_ready") {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "cuda_runtime_not_ready",
Severity: "warning",
Description: "CUDA runtime is not ready for GPU SAT.",
})
}
}
case "amd":
health.DriverReady = strings.Contains(lsmodText, "amdgpu ") || strings.Contains(lsmodText, "amdkfd")
if !health.DriverReady {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "amdgpu_kernel_module_missing",
Severity: "warning",
Description: "AMD GPU driver is not loaded.",
})
}
out, err := runROCmSMI("--showproductname", "--csv")
if err == nil && strings.TrimSpace(string(out)) != "" {
health.CUDAReady = true
health.DriverReady = true
return
}
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "rocm_smi_unavailable",
Severity: "warning",
Description: "ROCm SMI is not available for AMD GPU SAT.",
})
}
}

View File

@@ -0,0 +1,897 @@
package platform
import (
"archive/tar"
"bufio"
"bytes"
"compress/gzip"
"context"
"errors"
"fmt"
"io"
"os"
"os/exec"
"path/filepath"
"sort"
"strconv"
"strings"
"sync"
"time"
)
var (
satExecCommand = exec.Command
satLookPath = exec.LookPath
satGlob = filepath.Glob
satStat = os.Stat
rocmSMIExecutableGlobs = []string{
"/opt/rocm/bin/rocm-smi",
"/opt/rocm-*/bin/rocm-smi",
}
rocmSMIScriptGlobs = []string{
"/opt/rocm/libexec/rocm_smi/rocm_smi.py",
"/opt/rocm-*/libexec/rocm_smi/rocm_smi.py",
}
rvsExecutableGlobs = []string{
"/opt/rocm/bin/rvs",
"/opt/rocm-*/bin/rvs",
}
)
// streamExecOutput runs cmd and streams each output line to logFunc (if non-nil).
// Returns combined stdout+stderr as a byte slice.
func streamExecOutput(cmd *exec.Cmd, logFunc func(string)) ([]byte, error) {
pr, pw := io.Pipe()
cmd.Stdout = pw
cmd.Stderr = pw
var buf bytes.Buffer
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
scanner := bufio.NewScanner(pr)
for scanner.Scan() {
line := scanner.Text()
buf.WriteString(line + "\n")
if logFunc != nil {
logFunc(line)
}
}
}()
err := cmd.Start()
if err != nil {
_ = pw.Close()
wg.Wait()
return nil, err
}
waitErr := cmd.Wait()
_ = pw.Close()
wg.Wait()
return buf.Bytes(), waitErr
}
// NvidiaGPU holds basic GPU info from nvidia-smi.
type NvidiaGPU struct {
Index int
Name string
MemoryMB int
}
// AMDGPUInfo holds basic info about an AMD GPU from rocm-smi.
type AMDGPUInfo struct {
Index int
Name string
}
// DetectGPUVendor returns "nvidia" if /dev/nvidia0 exists, "amd" if /dev/kfd exists, or "" otherwise.
func (s *System) DetectGPUVendor() string {
if _, err := os.Stat("/dev/nvidia0"); err == nil {
return "nvidia"
}
if _, err := os.Stat("/dev/kfd"); err == nil {
return "amd"
}
if raw, err := exec.Command("lspci", "-nn").Output(); err == nil {
text := strings.ToLower(string(raw))
if strings.Contains(text, "advanced micro devices") || strings.Contains(text, "amd/ati") {
return "amd"
}
}
return ""
}
// ListAMDGPUs returns AMD GPUs visible to rocm-smi.
func (s *System) ListAMDGPUs() ([]AMDGPUInfo, error) {
out, err := runROCmSMI("--showproductname", "--csv")
if err != nil {
return nil, fmt.Errorf("rocm-smi: %w", err)
}
var gpus []AMDGPUInfo
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" || strings.HasPrefix(strings.ToLower(line), "device") {
continue
}
parts := strings.SplitN(line, ",", 2)
name := ""
if len(parts) >= 2 {
name = strings.TrimSpace(parts[1])
}
idx := len(gpus)
gpus = append(gpus, AMDGPUInfo{Index: idx, Name: name})
}
return gpus, nil
}
// RunAMDAcceptancePack runs an AMD GPU diagnostic pack using rocm-smi.
func (s *System) RunAMDAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd", []satJob{
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
{name: "02-rocm-smi-showallinfo.log", cmd: []string{"rocm-smi", "--showallinfo"}},
{name: "03-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "04-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
}, logFunc)
}
// RunAMDMemIntegrityPack runs the official RVS MEM module as a validate-style memory integrity test.
func (s *System) RunAMDMemIntegrityPack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
if err := ensureAMDRuntimeReady(); err != nil {
return "", err
}
cfgFile := "/tmp/bee-amd-mem.conf"
cfg := `actions:
- name: mem_integrity
device: all
module: mem
parallel: true
duration: 60000
copy_matrix: false
target_stress: 90
matrix_size: 8640
`
_ = os.WriteFile(cfgFile, []byte(cfg), 0644)
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd-mem", []satJob{
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
{name: "02-rvs-mem.log", cmd: []string{"rvs", "-c", cfgFile}},
{name: "03-rocm-smi-after.log", cmd: []string{"rocm-smi", "--showtemp", "--showpower", "--showmemuse", "--csv"}},
}, logFunc)
}
// RunAMDMemBandwidthPack runs AMD's memory/interconnect bandwidth-oriented tools.
func (s *System) RunAMDMemBandwidthPack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
if err := ensureAMDRuntimeReady(); err != nil {
return "", err
}
cfgFile := "/tmp/bee-amd-babel.conf"
cfg := `actions:
- name: babel_mem_bw
device: all
module: babel
parallel: true
copy_matrix: true
target_stress: 90
matrix_size: 134217728
`
_ = os.WriteFile(cfgFile, []byte(cfg), 0644)
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd-bandwidth", []satJob{
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
{name: "02-rocm-bandwidth-test.log", cmd: []string{"rocm-bandwidth-test"}},
{name: "03-rvs-babel.log", cmd: []string{"rvs", "-c", cfgFile}},
{name: "04-rocm-smi-after.log", cmd: []string{"rocm-smi", "--showtemp", "--showpower", "--showmemuse", "--csv"}},
}, logFunc)
}
// RunAMDStressPack runs an AMD GPU burn-in pack.
// Missing tools are reported as UNSUPPORTED, consistent with the existing SAT pattern.
func (s *System) RunAMDStressPack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
seconds := durationSec
if seconds <= 0 {
seconds = envInt("BEE_AMD_STRESS_SECONDS", 300)
}
if err := ensureAMDRuntimeReady(); err != nil {
return "", err
}
// Enable copy_matrix so the same GST run drives VRAM traffic in addition to compute.
rvsCfg := amdStressRVSConfig(seconds)
cfgFile := "/tmp/bee-amd-gst.conf"
_ = os.WriteFile(cfgFile, []byte(rvsCfg), 0644)
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd-stress", amdStressJobs(seconds, cfgFile), logFunc)
}
func amdStressRVSConfig(seconds int) string {
return fmt.Sprintf(`actions:
- name: gst_stress
device: all
module: gst
parallel: true
duration: %d
copy_matrix: false
target_stress: 90
matrix_size_a: 8640
matrix_size_b: 8640
matrix_size_c: 8640
`, seconds*1000)
}
func amdStressJobs(seconds int, cfgFile string) []satJob {
return []satJob{
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
{name: "02-rocm-bandwidth-test.log", cmd: []string{"rocm-bandwidth-test"}},
{name: fmt.Sprintf("03-rvs-gst-%ds.log", seconds), cmd: []string{"rvs", "-c", cfgFile}},
{name: fmt.Sprintf("04-rocm-smi-after.log"), cmd: []string{"rocm-smi", "--showtemp", "--showpower", "--csv"}},
}
}
// ListNvidiaGPUs returns GPUs visible to nvidia-smi.
func (s *System) ListNvidiaGPUs() ([]NvidiaGPU, error) {
out, err := exec.Command("nvidia-smi",
"--query-gpu=index,name,memory.total",
"--format=csv,noheader,nounits").Output()
if err != nil {
return nil, fmt.Errorf("nvidia-smi: %w", err)
}
var gpus []NvidiaGPU
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.SplitN(line, ", ", 3)
if len(parts) != 3 {
continue
}
idx, err := strconv.Atoi(strings.TrimSpace(parts[0]))
if err != nil {
continue
}
memMB, _ := strconv.Atoi(strings.TrimSpace(parts[2]))
gpus = append(gpus, NvidiaGPU{
Index: idx,
Name: strings.TrimSpace(parts[1]),
MemoryMB: memMB,
})
}
return gpus, nil
}
// RunNCCLTests runs nccl-tests all_reduce_perf across all NVIDIA GPUs.
// Measures collective communication bandwidth over NVLink/PCIe.
func (s *System) RunNCCLTests(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
// detect GPU count
out, _ := exec.Command("nvidia-smi", "--query-gpu=index", "--format=csv,noheader").Output()
gpuCount := len(strings.Split(strings.TrimSpace(string(out)), "\n"))
if gpuCount < 1 {
gpuCount = 1
}
return runAcceptancePackCtx(ctx, baseDir, "nccl-tests", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-all-reduce-perf.log", cmd: []string{
"all_reduce_perf", "-b", "512M", "-e", "4G", "-f", "2",
"-g", strconv.Itoa(gpuCount), "--iters", "20",
}},
}, logFunc)
}
func (s *System) RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
return runAcceptancePackCtx(context.Background(), baseDir, "gpu-nvidia", nvidiaSATJobs(), logFunc)
}
// RunNvidiaAcceptancePackWithOptions runs the NVIDIA diagnostics via DCGM.
// diagLevel: 1=quick, 2=medium, 3=targeted stress, 4=extended stress.
// gpuIndices: specific GPU indices to test (empty = all GPUs).
// ctx cancellation kills the running job.
func (s *System) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error) {
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia", nvidiaDCGMJobs(diagLevel, gpuIndices), logFunc)
}
func (s *System) RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
sizeMB := envInt("BEE_MEMTESTER_SIZE_MB", 128)
passes := envInt("BEE_MEMTESTER_PASSES", 1)
return runAcceptancePackCtx(ctx, baseDir, "memory", []satJob{
{name: "01-free-before.log", cmd: []string{"free", "-h"}},
{name: "02-memtester.log", cmd: []string{"memtester", fmt.Sprintf("%dM", sizeMB), fmt.Sprintf("%d", passes)}},
{name: "03-free-after.log", cmd: []string{"free", "-h"}},
}, logFunc)
}
func (s *System) RunMemoryStressPack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
seconds := durationSec
if seconds <= 0 {
seconds = envInt("BEE_VM_STRESS_SECONDS", 300)
}
// Use 80% of RAM by default; override with BEE_VM_STRESS_SIZE_MB.
sizeArg := "80%"
if mb := envInt("BEE_VM_STRESS_SIZE_MB", 0); mb > 0 {
sizeArg = fmt.Sprintf("%dM", mb)
}
return runAcceptancePackCtx(ctx, baseDir, "memory-stress", []satJob{
{name: "01-free-before.log", cmd: []string{"free", "-h"}},
{name: "02-stress-ng-vm.log", cmd: []string{
"stress-ng", "--vm", "1",
"--vm-bytes", sizeArg,
"--vm-method", "all",
"--timeout", fmt.Sprintf("%d", seconds),
"--metrics-brief",
}},
{name: "03-free-after.log", cmd: []string{"free", "-h"}},
}, logFunc)
}
func (s *System) RunSATStressPack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
seconds := durationSec
if seconds <= 0 {
seconds = envInt("BEE_SAT_STRESS_SECONDS", 300)
}
cmd := []string{"stressapptest", "-s", fmt.Sprintf("%d", seconds), "-W", "--cc_test"}
if mb := envInt("BEE_SAT_STRESS_MB", 0); mb > 0 {
cmd = append(cmd, "-M", fmt.Sprintf("%d", mb))
}
return runAcceptancePackCtx(ctx, baseDir, "sat-stress", []satJob{
{name: "01-free-before.log", cmd: []string{"free", "-h"}},
{name: "02-stressapptest.log", cmd: cmd},
{name: "03-free-after.log", cmd: []string{"free", "-h"}},
}, logFunc)
}
func (s *System) RunCPUAcceptancePack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
if durationSec <= 0 {
durationSec = 60
}
return runAcceptancePackCtx(ctx, baseDir, "cpu", []satJob{
{name: "01-lscpu.log", cmd: []string{"lscpu"}},
{name: "02-sensors-before.log", cmd: []string{"sensors"}},
{name: "03-stress-ng.log", cmd: []string{"stress-ng", "--cpu", "0", "--cpu-method", "all", "--timeout", fmt.Sprintf("%d", durationSec)}},
{name: "04-sensors-after.log", cmd: []string{"sensors"}},
}, logFunc)
}
func (s *System) RunStorageAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, "storage-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
devices, err := listStorageDevices()
if err != nil {
return "", err
}
sort.Strings(devices)
var summary strings.Builder
stats := satStats{}
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
if len(devices) == 0 {
fmt.Fprintln(&summary, "devices=0")
stats.Unsupported++
} else {
fmt.Fprintf(&summary, "devices=%d\n", len(devices))
}
for index, devPath := range devices {
if ctx.Err() != nil {
break
}
prefix := fmt.Sprintf("%02d-%s", index+1, filepath.Base(devPath))
commands := storageSATCommands(devPath)
for cmdIndex, job := range commands {
if ctx.Err() != nil {
break
}
name := fmt.Sprintf("%s-%02d-%s.log", prefix, cmdIndex+1, job.name)
out, err := runSATCommandCtx(ctx, verboseLog, job.name, job.cmd, nil, logFunc)
if writeErr := os.WriteFile(filepath.Join(runDir, name), out, 0644); writeErr != nil {
return "", writeErr
}
status, rc := classifySATResult(job.name, out, err)
stats.Add(status)
key := filepath.Base(devPath) + "_" + strings.ReplaceAll(job.name, "-", "_")
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
}
}
writeSATStats(&summary, stats)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, "storage-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
type satJob struct {
name string
cmd []string
env []string // extra env vars (appended to os.Environ)
collectGPU bool // collect GPU metrics via nvidia-smi while this job runs
gpuIndices []int // GPU indices to collect metrics for (empty = all)
}
type satStats struct {
OK int
Failed int
Unsupported int
}
func nvidiaSATJobs() []satJob {
return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
{name: "05-bee-gpu-burn.log", cmd: []string{"bee-gpu-burn", "--seconds", "5", "--size-mb", "64"}},
}
}
func nvidiaDCGMJobs(diagLevel int, gpuIndices []int) []satJob {
if diagLevel < 1 || diagLevel > 4 {
diagLevel = 3
}
diagArgs := []string{"dcgmi", "diag", "-r", strconv.Itoa(diagLevel)}
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
diagArgs = append(diagArgs, "-i", strings.Join(ids, ","))
}
return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-dcgmi-diag.log", cmd: diagArgs},
}
}
func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []satJob, logFunc func(string)) (string, error) {
if ctx == nil {
ctx = context.Background()
}
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, prefix+"-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
var summary strings.Builder
stats := satStats{}
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
for _, job := range jobs {
if ctx.Err() != nil {
break
}
cmd := make([]string, 0, len(job.cmd))
for _, arg := range job.cmd {
cmd = append(cmd, strings.ReplaceAll(arg, "{{run_dir}}", runDir))
}
var out []byte
var err error
if job.collectGPU {
out, err = runSATCommandWithMetrics(ctx, verboseLog, job.name, cmd, job.env, job.gpuIndices, runDir, logFunc)
} else {
out, err = runSATCommandCtx(ctx, verboseLog, job.name, cmd, job.env, logFunc)
}
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
return "", writeErr
}
status, rc := classifySATResult(job.name, out, err)
stats.Add(status)
key := strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log")
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
}
writeSATStats(&summary, stats)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
func runSATCommandCtx(ctx context.Context, verboseLog, name string, cmd []string, env []string, logFunc func(string)) ([]byte, error) {
start := time.Now().UTC()
resolvedCmd, err := resolveSATCommand(cmd)
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
"cmd: "+strings.Join(resolvedCmd, " "),
)
if logFunc != nil {
logFunc(fmt.Sprintf("=== %s ===", name))
}
if err != nil {
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
"rc: 1",
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return []byte(err.Error() + "\n"), err
}
c := exec.CommandContext(ctx, resolvedCmd[0], resolvedCmd[1:]...)
if len(env) > 0 {
c.Env = append(os.Environ(), env...)
}
out, err := streamExecOutput(c, logFunc)
rc := 0
if err != nil {
rc = 1
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
fmt.Sprintf("rc: %d", rc),
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return out, err
}
func listStorageDevices() ([]string, error) {
out, err := satExecCommand("lsblk", "-dn", "-o", "NAME,TYPE,TRAN").Output()
if err != nil {
return nil, err
}
return parseStorageDevices(string(out)), nil
}
func storageSATCommands(devPath string) []satJob {
if strings.Contains(filepath.Base(devPath), "nvme") {
return []satJob{
{name: "nvme-id-ctrl", cmd: []string{"nvme", "id-ctrl", devPath, "-o", "json"}},
{name: "nvme-smart-log", cmd: []string{"nvme", "smart-log", devPath, "-o", "json"}},
{name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "-s", "1", "--wait"}},
}
}
return []satJob{
{name: "smartctl-health", cmd: []string{"smartctl", "-H", "-A", devPath}},
{name: "smartctl-self-test-short", cmd: []string{"smartctl", "-t", "short", devPath}},
}
}
func (s *satStats) Add(status string) {
switch status {
case "OK":
s.OK++
case "UNSUPPORTED":
s.Unsupported++
default:
s.Failed++
}
}
func (s satStats) Overall() string {
if s.Failed > 0 {
return "FAILED"
}
if s.Unsupported > 0 {
return "PARTIAL"
}
return "OK"
}
func writeSATStats(summary *strings.Builder, stats satStats) {
fmt.Fprintf(summary, "overall_status=%s\n", stats.Overall())
fmt.Fprintf(summary, "job_ok=%d\n", stats.OK)
fmt.Fprintf(summary, "job_failed=%d\n", stats.Failed)
fmt.Fprintf(summary, "job_unsupported=%d\n", stats.Unsupported)
}
func classifySATResult(name string, out []byte, err error) (string, int) {
rc := 0
if err != nil {
rc = 1
}
if err == nil {
return "OK", rc
}
text := strings.ToLower(string(out))
// No output at all means the tool failed to start (mlock limit, binary missing,
// etc.) — we cannot say anything about hardware health → UNSUPPORTED.
if len(strings.TrimSpace(text)) == 0 {
return "UNSUPPORTED", rc
}
if strings.Contains(text, "unsupported") ||
strings.Contains(text, "not supported") ||
strings.Contains(text, "invalid opcode") ||
strings.Contains(text, "unknown command") ||
strings.Contains(text, "not implemented") ||
strings.Contains(text, "not available") ||
strings.Contains(text, "cuda_error_system_not_ready") ||
strings.Contains(text, "no such device") ||
// nvidia-smi on a machine with no NVIDIA GPU
strings.Contains(text, "couldn't communicate with the nvidia driver") ||
strings.Contains(text, "no nvidia gpu") ||
(strings.Contains(name, "self-test") && strings.Contains(text, "aborted")) {
return "UNSUPPORTED", rc
}
return "FAILED", rc
}
func runSATCommand(verboseLog, name string, cmd []string, logFunc func(string)) ([]byte, error) {
start := time.Now().UTC()
resolvedCmd, err := resolveSATCommand(cmd)
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
"cmd: "+strings.Join(resolvedCmd, " "),
)
if logFunc != nil {
logFunc(fmt.Sprintf("=== %s ===", name))
}
if err != nil {
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
"rc: 1",
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return []byte(err.Error() + "\n"), err
}
out, err := streamExecOutput(satExecCommand(resolvedCmd[0], resolvedCmd[1:]...), logFunc)
rc := 0
if err != nil {
rc = 1
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
fmt.Sprintf("rc: %d", rc),
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return out, err
}
func runROCmSMI(args ...string) ([]byte, error) {
cmd, err := resolveROCmSMICommand(args...)
if err != nil {
return nil, err
}
return satExecCommand(cmd[0], cmd[1:]...).CombinedOutput()
}
func resolveSATCommand(cmd []string) ([]string, error) {
if len(cmd) == 0 {
return nil, errors.New("empty SAT command")
}
switch cmd[0] {
case "rocm-smi":
return resolveROCmSMICommand(cmd[1:]...)
case "rvs":
return resolveRVSCommand(cmd[1:]...)
}
path, err := satLookPath(cmd[0])
if err != nil {
return nil, fmt.Errorf("%s not found in PATH: %w", cmd[0], err)
}
return append([]string{path}, cmd[1:]...), nil
}
func resolveRVSCommand(args ...string) ([]string, error) {
if path, err := satLookPath("rvs"); err == nil {
return append([]string{path}, args...), nil
}
for _, path := range expandExistingPaths(rvsExecutableGlobs) {
return append([]string{path}, args...), nil
}
return nil, errors.New("rvs not found in PATH or under /opt/rocm")
}
func resolveROCmSMICommand(args ...string) ([]string, error) {
if path, err := satLookPath("rocm-smi"); err == nil {
return append([]string{path}, args...), nil
}
for _, path := range rocmSMIExecutableCandidates() {
return append([]string{path}, args...), nil
}
pythonPath, pyErr := satLookPath("python3")
if pyErr == nil {
for _, script := range rocmSMIScriptCandidates() {
cmd := []string{pythonPath, script}
cmd = append(cmd, args...)
return cmd, nil
}
}
return nil, errors.New("rocm-smi not found in PATH or under /opt/rocm")
}
func ensureAMDRuntimeReady() error {
if _, err := os.Stat("/dev/kfd"); err == nil {
return nil
}
if raw, err := os.ReadFile("/sys/module/amdgpu/initstate"); err == nil {
state := strings.TrimSpace(string(raw))
if strings.EqualFold(state, "live") {
return nil
}
return fmt.Errorf("AMD driver is present but not initialized: amdgpu initstate=%q", state)
}
return errors.New("AMD GPUs are present but the runtime is not initialized: /dev/kfd is missing and amdgpu is not loaded")
}
func rocmSMIExecutableCandidates() []string {
return expandExistingPaths(rocmSMIExecutableGlobs)
}
func rocmSMIScriptCandidates() []string {
return expandExistingPaths(rocmSMIScriptGlobs)
}
func expandExistingPaths(patterns []string) []string {
seen := make(map[string]struct{})
var paths []string
for _, pattern := range patterns {
matches, err := satGlob(pattern)
if err != nil {
continue
}
sort.Strings(matches)
for _, match := range matches {
if _, err := satStat(match); err != nil {
continue
}
if _, ok := seen[match]; ok {
continue
}
seen[match] = struct{}{}
paths = append(paths, match)
}
}
return paths
}
func parseStorageDevices(raw string) []string {
var devices []string
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
fields := strings.Fields(strings.TrimSpace(line))
if len(fields) < 2 || fields[1] != "disk" {
continue
}
if len(fields) >= 3 && strings.EqualFold(fields[2], "usb") {
continue
}
devices = append(devices, "/dev/"+fields[0])
}
return devices
}
// runSATCommandWithMetrics runs a command while collecting GPU metrics in the background.
// On completion it writes gpu-metrics.csv and gpu-metrics.html into runDir.
func runSATCommandWithMetrics(ctx context.Context, verboseLog, name string, cmd []string, env []string, gpuIndices []int, runDir string, logFunc func(string)) ([]byte, error) {
stopCh := make(chan struct{})
doneCh := make(chan struct{})
var metricRows []GPUMetricRow
start := time.Now()
go func() {
defer close(doneCh)
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-stopCh:
return
case <-ticker.C:
samples, err := sampleGPUMetrics(gpuIndices)
if err != nil {
continue
}
elapsed := time.Since(start).Seconds()
for i := range samples {
samples[i].ElapsedSec = elapsed
}
metricRows = append(metricRows, samples...)
}
}
}()
out, err := runSATCommandCtx(ctx, verboseLog, name, cmd, env, logFunc)
close(stopCh)
<-doneCh
if len(metricRows) > 0 {
_ = WriteGPUMetricsCSV(filepath.Join(runDir, "gpu-metrics.csv"), metricRows)
_ = WriteGPUMetricsHTML(filepath.Join(runDir, "gpu-metrics.html"), metricRows)
chart := RenderGPUTerminalChart(metricRows)
_ = os.WriteFile(filepath.Join(runDir, "gpu-metrics-term.txt"), []byte(chart), 0644)
}
return out, err
}
func appendSATVerboseLog(path string, lines ...string) {
if path == "" {
return
}
f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0644)
if err != nil {
return
}
defer f.Close()
for _, line := range lines {
_, _ = io.WriteString(f, line+"\n")
}
}
func envInt(name string, fallback int) int {
raw := strings.TrimSpace(os.Getenv(name))
if raw == "" {
return fallback
}
value, err := strconv.Atoi(raw)
if err != nil || value <= 0 {
return fallback
}
return value
}
func createTarGz(dst, srcDir string) error {
file, err := os.Create(dst)
if err != nil {
return err
}
defer file.Close()
gz := gzip.NewWriter(file)
defer gz.Close()
tw := tar.NewWriter(gz)
defer tw.Close()
base := filepath.Dir(srcDir)
return filepath.Walk(srcDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if info.IsDir() {
return nil
}
header, err := tar.FileInfoHeader(info, "")
if err != nil {
return err
}
rel, err := filepath.Rel(base, path)
if err != nil {
return err
}
header.Name = rel
if err := tw.WriteHeader(header); err != nil {
return err
}
file, err := os.Open(path)
if err != nil {
return err
}
defer file.Close()
_, err = io.Copy(tw, file)
return err
})
}

View File

@@ -0,0 +1,691 @@
package platform
import (
"context"
"encoding/json"
"fmt"
"os"
"os/exec"
"path/filepath"
"sort"
"strconv"
"strings"
"sync"
"time"
)
// FanStressOptions configures the fan-stress / thermal cycling test.
type FanStressOptions struct {
BaselineSec int // idle monitoring before and after load (default 30)
Phase1DurSec int // first load phase duration in seconds (default 300)
PauseSec int // pause between the two load phases (default 60)
Phase2DurSec int // second load phase duration in seconds (default 300)
SizeMB int // GPU memory to allocate per GPU during stress (default 64)
GPUIndices []int // which GPU indices to stress (empty = all detected)
}
// FanReading holds one fan sensor reading.
type FanReading struct {
Name string
RPM float64
}
// GPUStressMetric holds per-GPU metrics during the stress test.
type GPUStressMetric struct {
Index int
TempC float64
UsagePct float64
PowerW float64
ClockMHz float64
Throttled bool // true if any throttle reason is active
}
// FanStressRow is one second-interval telemetry sample covering all monitored dimensions.
type FanStressRow struct {
TimestampUTC string
ElapsedSec float64
Phase string // "baseline", "load1", "pause", "load2", "cooldown"
GPUs []GPUStressMetric
Fans []FanReading
CPUMaxTempC float64 // highest CPU temperature from ipmitool / sensors
SysPowerW float64 // DCMI system power reading
}
// RunFanStressTest runs a two-phase GPU stress test while monitoring fan speeds,
// temperatures, and power draw every second. Exports metrics.csv and fan-sensors.csv.
// Designed to reproduce case-04 fan-speed lag and detect GPU thermal throttling.
func (s *System) RunFanStressTest(ctx context.Context, baseDir string, opts FanStressOptions) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
applyFanStressDefaults(&opts)
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, "fan-stress-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
// Phase name shared between sampler goroutine and main goroutine.
var phaseMu sync.Mutex
currentPhase := "init"
setPhase := func(name string) {
phaseMu.Lock()
currentPhase = name
phaseMu.Unlock()
}
getPhase := func() string {
phaseMu.Lock()
defer phaseMu.Unlock()
return currentPhase
}
start := time.Now()
var rowsMu sync.Mutex
var allRows []FanStressRow
// Start background sampler (every second).
stopCh := make(chan struct{})
doneCh := make(chan struct{})
go func() {
defer close(doneCh)
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-stopCh:
return
case <-ticker.C:
row := sampleFanStressRow(opts.GPUIndices, getPhase(), time.Since(start).Seconds())
rowsMu.Lock()
allRows = append(allRows, row)
rowsMu.Unlock()
}
}
}()
var summary strings.Builder
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
stats := satStats{}
// idlePhase sleeps for durSec while the sampler stamps phaseName on each row.
idlePhase := func(phaseName, stepName string, durSec int) {
if ctx.Err() != nil {
return
}
setPhase(phaseName)
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s (idle %ds)", time.Now().UTC().Format(time.RFC3339), stepName, durSec),
)
select {
case <-ctx.Done():
case <-time.After(time.Duration(durSec) * time.Second):
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), stepName),
)
fmt.Fprintf(&summary, "%s_status=OK\n", stepName)
stats.OK++
}
// loadPhase runs bee-gpu-burn for durSec; sampler stamps phaseName on each row.
loadPhase := func(phaseName, stepName string, durSec int) {
if ctx.Err() != nil {
return
}
setPhase(phaseName)
cmd := []string{
"bee-gpu-burn",
"--seconds", strconv.Itoa(durSec),
"--size-mb", strconv.Itoa(opts.SizeMB),
}
if len(opts.GPUIndices) > 0 {
cmd = append(cmd, "--devices", joinIndexList(dedupeSortedIndices(opts.GPUIndices)))
}
out, err := runSATCommandCtx(ctx, verboseLog, stepName, cmd, nil, nil)
_ = os.WriteFile(filepath.Join(runDir, stepName+".log"), out, 0644)
if err != nil && err != context.Canceled && err.Error() != "signal: killed" {
fmt.Fprintf(&summary, "%s_status=FAILED\n", stepName)
stats.Failed++
} else {
fmt.Fprintf(&summary, "%s_status=OK\n", stepName)
stats.OK++
}
}
// Execute test phases.
idlePhase("baseline", "01-baseline", opts.BaselineSec)
loadPhase("load1", "02-load1", opts.Phase1DurSec)
idlePhase("pause", "03-pause", opts.PauseSec)
loadPhase("load2", "04-load2", opts.Phase2DurSec)
idlePhase("cooldown", "05-cooldown", opts.BaselineSec)
// Stop sampler and collect rows.
close(stopCh)
<-doneCh
rowsMu.Lock()
rows := allRows
rowsMu.Unlock()
// Analysis.
throttled := analyzeThrottling(rows)
maxGPUTemp := analyzeMaxTemp(rows, func(r FanStressRow) float64 {
var m float64
for _, g := range r.GPUs {
if g.TempC > m {
m = g.TempC
}
}
return m
})
maxCPUTemp := analyzeMaxTemp(rows, func(r FanStressRow) float64 {
return r.CPUMaxTempC
})
fanResponseSec := analyzeFanResponse(rows)
fmt.Fprintf(&summary, "throttling_detected=%v\n", throttled)
fmt.Fprintf(&summary, "max_gpu_temp_c=%.1f\n", maxGPUTemp)
fmt.Fprintf(&summary, "max_cpu_temp_c=%.1f\n", maxCPUTemp)
if fanResponseSec >= 0 {
fmt.Fprintf(&summary, "fan_response_sec=%.1f\n", fanResponseSec)
} else {
fmt.Fprintf(&summary, "fan_response_sec=N/A\n")
}
// Throttling failure counts against overall result.
if throttled {
stats.Failed++
}
writeSATStats(&summary, stats)
// Write CSV outputs.
if err := WriteFanStressCSV(filepath.Join(runDir, "metrics.csv"), rows, opts.GPUIndices); err != nil {
return "", err
}
_ = WriteFanSensorsCSV(filepath.Join(runDir, "fan-sensors.csv"), rows)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, "fan-stress-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
func applyFanStressDefaults(opts *FanStressOptions) {
if opts.BaselineSec <= 0 {
opts.BaselineSec = 30
}
if opts.Phase1DurSec <= 0 {
opts.Phase1DurSec = 300
}
if opts.PauseSec <= 0 {
opts.PauseSec = 60
}
if opts.Phase2DurSec <= 0 {
opts.Phase2DurSec = 300
}
if opts.SizeMB <= 0 {
opts.SizeMB = 64
}
}
// sampleFanStressRow collects all metrics for one telemetry sample.
func sampleFanStressRow(gpuIndices []int, phase string, elapsed float64) FanStressRow {
row := FanStressRow{
TimestampUTC: time.Now().UTC().Format(time.RFC3339),
ElapsedSec: elapsed,
Phase: phase,
}
row.GPUs = sampleGPUStressMetrics(gpuIndices)
row.Fans, _ = sampleFanSpeeds()
row.CPUMaxTempC = sampleCPUMaxTemp()
row.SysPowerW = sampleSystemPower()
return row
}
// sampleGPUStressMetrics queries nvidia-smi for temperature, utilization, power,
// clock frequency, and active throttle reasons for each GPU.
func sampleGPUStressMetrics(gpuIndices []int) []GPUStressMetric {
args := []string{
"--query-gpu=index,temperature.gpu,utilization.gpu,power.draw,clocks.current.graphics,clocks_throttle_reasons.active",
"--format=csv,noheader,nounits",
}
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
args = append([]string{"--id=" + strings.Join(ids, ",")}, args...)
}
out, err := exec.Command("nvidia-smi", args...).Output()
if err != nil {
return nil
}
var metrics []GPUStressMetric
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.Split(line, ", ")
if len(parts) < 6 {
continue
}
idx, _ := strconv.Atoi(strings.TrimSpace(parts[0]))
throttleVal := strings.TrimSpace(parts[5])
// Throttled if active reasons bitmask is non-zero.
throttled := throttleVal != "0x0000000000000000" &&
throttleVal != "0x0" &&
throttleVal != "0" &&
throttleVal != "" &&
throttleVal != "N/A"
metrics = append(metrics, GPUStressMetric{
Index: idx,
TempC: parseGPUFloat(parts[1]),
UsagePct: parseGPUFloat(parts[2]),
PowerW: parseGPUFloat(parts[3]),
ClockMHz: parseGPUFloat(parts[4]),
Throttled: throttled,
})
}
return metrics
}
// sampleFanSpeeds reads fan RPM values from ipmitool sdr.
func sampleFanSpeeds() ([]FanReading, error) {
out, err := exec.Command("ipmitool", "sdr", "type", "Fan").Output()
if err == nil {
if fans := parseFanSpeeds(string(out)); len(fans) > 0 {
return fans, nil
}
}
fans, sensorsErr := sampleFanSpeedsViaSensorsJSON()
if len(fans) > 0 {
return fans, nil
}
if err != nil {
return nil, err
}
return nil, sensorsErr
}
// parseFanSpeeds parses "ipmitool sdr type Fan" output.
// Handles two formats:
//
// Old: "FAN1 | 2400.000 | RPM | ok" (value in col[1], unit in col[2])
// New: "FAN1 | 41h | ok | 29.1 | 4340 RPM" (value+unit combined in last col)
func parseFanSpeeds(raw string) []FanReading {
var fans []FanReading
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
parts := strings.Split(line, "|")
if len(parts) < 2 {
continue
}
name := strings.TrimSpace(parts[0])
// Find the first field that contains "RPM" (either as a standalone unit or inline)
rpmVal := 0.0
found := false
for _, p := range parts[1:] {
p = strings.TrimSpace(p)
if !strings.Contains(strings.ToUpper(p), "RPM") {
continue
}
if strings.EqualFold(p, "RPM") {
continue // unit-only column in old format; value is in previous field
}
val, err := parseFanRPMValue(p)
if err == nil {
rpmVal = val
found = true
break
}
}
// Old format: unit "RPM" is in col[2], value is in col[1]
if !found && len(parts) >= 3 && strings.EqualFold(strings.TrimSpace(parts[2]), "RPM") {
valStr := strings.TrimSpace(parts[1])
if !strings.EqualFold(valStr, "na") && !strings.EqualFold(valStr, "disabled") && valStr != "" {
if val, err := parseFanRPMValue(valStr); err == nil {
rpmVal = val
found = true
}
}
}
if !found {
continue
}
fans = append(fans, FanReading{Name: name, RPM: rpmVal})
}
return fans
}
func parseFanRPMValue(raw string) (float64, error) {
fields := strings.Fields(strings.TrimSpace(strings.ReplaceAll(raw, ",", "")))
if len(fields) == 0 {
return 0, strconv.ErrSyntax
}
return strconv.ParseFloat(fields[0], 64)
}
func sampleFanSpeedsViaSensorsJSON() ([]FanReading, error) {
out, err := exec.Command("sensors", "-j").Output()
if err != nil || len(out) == 0 {
return nil, err
}
var doc map[string]map[string]any
if err := json.Unmarshal(out, &doc); err != nil {
return nil, err
}
chips := make([]string, 0, len(doc))
for chip := range doc {
chips = append(chips, chip)
}
sort.Strings(chips)
var fans []FanReading
seen := map[string]struct{}{}
for _, chip := range chips {
features := doc[chip]
names := make([]string, 0, len(features))
for name := range features {
names = append(names, name)
}
sort.Strings(names)
for _, name := range names {
feature, ok := features[name].(map[string]any)
if !ok {
continue
}
rpm, ok := firstFanInputValue(feature)
if !ok || rpm <= 0 {
continue
}
label := strings.TrimSpace(name)
if chip != "" && !strings.Contains(strings.ToLower(label), strings.ToLower(chip)) {
label = chip + " / " + label
}
if _, ok := seen[label]; ok {
continue
}
seen[label] = struct{}{}
fans = append(fans, FanReading{Name: label, RPM: rpm})
}
}
return fans, nil
}
func firstFanInputValue(feature map[string]any) (float64, bool) {
keys := make([]string, 0, len(feature))
for key := range feature {
keys = append(keys, key)
}
sort.Strings(keys)
for _, key := range keys {
lower := strings.ToLower(key)
if !strings.Contains(lower, "fan") || !strings.HasSuffix(lower, "_input") {
continue
}
switch value := feature[key].(type) {
case float64:
return value, true
case string:
f, err := strconv.ParseFloat(value, 64)
if err == nil {
return f, true
}
}
}
return 0, false
}
// sampleCPUMaxTemp returns the highest CPU/inlet temperature from ipmitool or sensors.
func sampleCPUMaxTemp() float64 {
out, err := exec.Command("ipmitool", "sdr", "type", "Temperature").Output()
if err != nil {
return sampleCPUTempViaSensors()
}
return parseIPMIMaxTemp(string(out))
}
// parseIPMIMaxTemp extracts the maximum temperature from "ipmitool sdr type Temperature".
func parseIPMIMaxTemp(raw string) float64 {
var max float64
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
parts := strings.Split(line, "|")
if len(parts) < 3 {
continue
}
unit := strings.TrimSpace(parts[2])
if !strings.Contains(strings.ToLower(unit), "degrees") {
continue
}
valStr := strings.TrimSpace(parts[1])
if strings.EqualFold(valStr, "na") || valStr == "" {
continue
}
val, err := strconv.ParseFloat(valStr, 64)
if err != nil {
continue
}
if val > max {
max = val
}
}
return max
}
// sampleCPUTempViaSensors falls back to lm-sensors when ipmitool is unavailable.
func sampleCPUTempViaSensors() float64 {
out, err := exec.Command("sensors", "-u").Output()
if err != nil {
return 0
}
var max float64
for _, line := range strings.Split(string(out), "\n") {
line = strings.TrimSpace(line)
fields := strings.Fields(line)
if len(fields) < 2 {
continue
}
if !strings.HasSuffix(fields[0], "_input:") {
continue
}
val, err := strconv.ParseFloat(fields[1], 64)
if err != nil {
continue
}
if val > 0 && val < 150 && val > max {
max = val
}
}
return max
}
// sampleSystemPower reads system power draw via DCMI.
func sampleSystemPower() float64 {
out, err := exec.Command("ipmitool", "dcmi", "power", "reading").Output()
if err != nil {
return 0
}
return parseDCMIPowerReading(string(out))
}
// parseDCMIPowerReading extracts the instantaneous power reading from ipmitool dcmi output.
// Sample: " Instantaneous power reading: 500 Watts"
func parseDCMIPowerReading(raw string) float64 {
for _, line := range strings.Split(raw, "\n") {
if !strings.Contains(strings.ToLower(line), "instantaneous") {
continue
}
parts := strings.Fields(line)
for i, p := range parts {
if strings.EqualFold(p, "Watts") && i > 0 {
val, err := strconv.ParseFloat(parts[i-1], 64)
if err == nil {
return val
}
}
}
}
return 0
}
// analyzeThrottling returns true if any GPU reported an active throttle reason
// during either load phase.
func analyzeThrottling(rows []FanStressRow) bool {
for _, row := range rows {
if row.Phase != "load1" && row.Phase != "load2" {
continue
}
for _, gpu := range row.GPUs {
if gpu.Throttled {
return true
}
}
}
return false
}
// analyzeMaxTemp returns the maximum value of the given extractor across all rows.
func analyzeMaxTemp(rows []FanStressRow, extract func(FanStressRow) float64) float64 {
var max float64
for _, row := range rows {
if v := extract(row); v > max {
max = v
}
}
return max
}
// analyzeFanResponse returns the seconds from load1 start until fan RPM first
// increased by more than 5% above the baseline average. Returns -1 if undetermined.
func analyzeFanResponse(rows []FanStressRow) float64 {
// Compute baseline average fan RPM.
var baseTotal, baseCount float64
for _, row := range rows {
if row.Phase != "baseline" {
continue
}
for _, f := range row.Fans {
baseTotal += f.RPM
baseCount++
}
}
if baseCount == 0 || baseTotal == 0 {
return -1
}
baseAvg := baseTotal / baseCount
threshold := baseAvg * 1.05 // 5% increase signals fan ramp-up
// Find elapsed time when load1 started.
var load1Start float64 = -1
for _, row := range rows {
if row.Phase == "load1" {
load1Start = row.ElapsedSec
break
}
}
if load1Start < 0 {
return -1
}
// Find first load1 row where average RPM crosses the threshold.
for _, row := range rows {
if row.Phase != "load1" {
continue
}
var total, count float64
for _, f := range row.Fans {
total += f.RPM
count++
}
if count > 0 && total/count >= threshold {
return row.ElapsedSec - load1Start
}
}
return -1
}
// WriteFanStressCSV writes the wide-format metrics CSV with one row per second.
// GPU columns are generated per index in gpuIndices order.
func WriteFanStressCSV(path string, rows []FanStressRow, gpuIndices []int) error {
if len(rows) == 0 {
return os.WriteFile(path, []byte("no data\n"), 0644)
}
var b strings.Builder
// Header: fixed system columns + per-GPU columns.
b.WriteString("timestamp_utc,elapsed_sec,phase,fan_avg_rpm,fan_min_rpm,fan_max_rpm,cpu_max_temp_c,sys_power_w")
for _, idx := range gpuIndices {
fmt.Fprintf(&b, ",gpu%d_temp_c,gpu%d_usage_pct,gpu%d_power_w,gpu%d_clock_mhz,gpu%d_throttled",
idx, idx, idx, idx, idx)
}
b.WriteRune('\n')
for _, row := range rows {
favg, fmin, fmax := fanRPMStats(row.Fans)
fmt.Fprintf(&b, "%s,%.1f,%s,%.0f,%.0f,%.0f,%.1f,%.1f",
row.TimestampUTC,
row.ElapsedSec,
row.Phase,
favg, fmin, fmax,
row.CPUMaxTempC,
row.SysPowerW,
)
gpuByIdx := make(map[int]GPUStressMetric, len(row.GPUs))
for _, g := range row.GPUs {
gpuByIdx[g.Index] = g
}
for _, idx := range gpuIndices {
g := gpuByIdx[idx]
throttled := 0
if g.Throttled {
throttled = 1
}
fmt.Fprintf(&b, ",%.1f,%.1f,%.1f,%.0f,%d",
g.TempC, g.UsagePct, g.PowerW, g.ClockMHz, throttled)
}
b.WriteRune('\n')
}
return os.WriteFile(path, []byte(b.String()), 0644)
}
// WriteFanSensorsCSV writes individual fan sensor readings in long (tidy) format.
func WriteFanSensorsCSV(path string, rows []FanStressRow) error {
var b strings.Builder
b.WriteString("timestamp_utc,elapsed_sec,phase,fan_name,rpm\n")
for _, row := range rows {
for _, f := range row.Fans {
fmt.Fprintf(&b, "%s,%.1f,%s,%s,%.0f\n",
row.TimestampUTC, row.ElapsedSec, row.Phase, f.Name, f.RPM)
}
}
return os.WriteFile(path, []byte(b.String()), 0644)
}
// fanRPMStats computes average, min, max RPM across all fans in a sample row.
func fanRPMStats(fans []FanReading) (avg, min, max float64) {
if len(fans) == 0 {
return 0, 0, 0
}
min = fans[0].RPM
max = fans[0].RPM
var total float64
for _, f := range fans {
total += f.RPM
if f.RPM < min {
min = f.RPM
}
if f.RPM > max {
max = f.RPM
}
}
return total / float64(len(fans)), min, max
}

View File

@@ -0,0 +1,27 @@
package platform
import "testing"
func TestParseFanSpeeds(t *testing.T) {
raw := "FAN1 | 2400.000 | RPM | ok\nFAN2 | 1800 RPM | ok | ok\nFAN3 | na | RPM | ns\n"
got := parseFanSpeeds(raw)
if len(got) != 2 {
t.Fatalf("fans=%d want 2 (%v)", len(got), got)
}
if got[0].Name != "FAN1" || got[0].RPM != 2400 {
t.Fatalf("fan0=%+v", got[0])
}
if got[1].Name != "FAN2" || got[1].RPM != 1800 {
t.Fatalf("fan1=%+v", got[1])
}
}
func TestFirstFanInputValue(t *testing.T) {
feature := map[string]any{
"fan1_input": 9200.0,
}
got, ok := firstFanInputValue(feature)
if !ok || got != 9200 {
t.Fatalf("got=%v ok=%v", got, ok)
}
}

View File

@@ -0,0 +1,346 @@
package platform
import (
"errors"
"os"
"os/exec"
"path/filepath"
"strings"
"testing"
)
func TestStorageSATCommands(t *testing.T) {
t.Parallel()
nvme := storageSATCommands("/dev/nvme0n1")
if len(nvme) != 3 || nvme[2].cmd[0] != "nvme" {
t.Fatalf("unexpected nvme commands: %#v", nvme)
}
sata := storageSATCommands("/dev/sda")
if len(sata) != 2 || sata[0].cmd[0] != "smartctl" {
t.Fatalf("unexpected sata commands: %#v", sata)
}
}
func TestRunNvidiaAcceptancePackIncludesGPUStress(t *testing.T) {
t.Parallel()
jobs := nvidiaSATJobs()
if len(jobs) != 5 {
t.Fatalf("jobs=%d want 5", len(jobs))
}
if got := jobs[4].cmd[0]; got != "bee-gpu-burn" {
t.Fatalf("gpu stress command=%q want bee-gpu-burn", got)
}
if got := jobs[3].cmd[1]; got != "--output-file" {
t.Fatalf("bug report flag=%q want --output-file", got)
}
}
func TestAMDStressConfigUsesSingleGSTAction(t *testing.T) {
t.Parallel()
cfg := amdStressRVSConfig(123)
if !strings.Contains(cfg, "module: gst") {
t.Fatalf("config missing gst module:\n%s", cfg)
}
if strings.Contains(cfg, "module: mem") {
t.Fatalf("config should not include mem module:\n%s", cfg)
}
if !strings.Contains(cfg, "copy_matrix: false") {
t.Fatalf("config should use copy_matrix=false:\n%s", cfg)
}
if strings.Count(cfg, "duration: 123000") != 1 {
t.Fatalf("config should apply duration once:\n%s", cfg)
}
for _, field := range []string{"matrix_size_a: 8640", "matrix_size_b: 8640", "matrix_size_c: 8640"} {
if !strings.Contains(cfg, field) {
t.Fatalf("config missing %s:\n%s", field, cfg)
}
}
}
func TestAMDStressJobsIncludeBandwidthAndGST(t *testing.T) {
t.Parallel()
jobs := amdStressJobs(300, "/tmp/test-amd-gst.conf")
if len(jobs) != 4 {
t.Fatalf("jobs=%d want 4", len(jobs))
}
if got := jobs[1].cmd[0]; got != "rocm-bandwidth-test" {
t.Fatalf("jobs[1]=%q want rocm-bandwidth-test", got)
}
if got := jobs[2].cmd[0]; got != "rvs" {
t.Fatalf("jobs[2]=%q want rvs", got)
}
if got := jobs[2].cmd[2]; got != "/tmp/test-amd-gst.conf" {
t.Fatalf("jobs[2] cfg=%q want /tmp/test-amd-gst.conf", got)
}
}
func TestNvidiaSATJobsUseBuiltinBurnDefaults(t *testing.T) {
jobs := nvidiaSATJobs()
got := jobs[4].cmd
want := []string{"bee-gpu-burn", "--seconds", "5", "--size-mb", "64"}
if len(got) != len(want) {
t.Fatalf("cmd len=%d want %d", len(got), len(want))
}
for i := range want {
if got[i] != want[i] {
t.Fatalf("cmd[%d]=%q want %q", i, got[i], want[i])
}
}
}
func TestBuildNvidiaStressJobUsesSelectedLoaderAndDevices(t *testing.T) {
t.Parallel()
oldExecCommand := satExecCommand
satExecCommand = func(name string, args ...string) *exec.Cmd {
if name == "nvidia-smi" {
return exec.Command("sh", "-c", "printf '0\n1\n2\n'")
}
return exec.Command(name, args...)
}
t.Cleanup(func() { satExecCommand = oldExecCommand })
job, err := buildNvidiaStressJob(NvidiaStressOptions{
DurationSec: 600,
Loader: NvidiaStressLoaderJohn,
ExcludeGPUIndices: []int{1},
})
if err != nil {
t.Fatalf("buildNvidiaStressJob error: %v", err)
}
wantCmd := []string{"bee-john-gpu-stress", "--seconds", "600", "--devices", "0,2"}
if len(job.cmd) != len(wantCmd) {
t.Fatalf("cmd len=%d want %d (%v)", len(job.cmd), len(wantCmd), job.cmd)
}
for i := range wantCmd {
if job.cmd[i] != wantCmd[i] {
t.Fatalf("cmd[%d]=%q want %q", i, job.cmd[i], wantCmd[i])
}
}
if got := joinIndexList(job.gpuIndices); got != "0,2" {
t.Fatalf("gpuIndices=%q want 0,2", got)
}
}
func TestBuildNvidiaStressJobUsesNCCLLoader(t *testing.T) {
t.Parallel()
oldExecCommand := satExecCommand
satExecCommand = func(name string, args ...string) *exec.Cmd {
if name == "nvidia-smi" {
return exec.Command("sh", "-c", "printf '0\n1\n2\n'")
}
return exec.Command(name, args...)
}
t.Cleanup(func() { satExecCommand = oldExecCommand })
job, err := buildNvidiaStressJob(NvidiaStressOptions{
DurationSec: 120,
Loader: NvidiaStressLoaderNCCL,
GPUIndices: []int{2, 0},
})
if err != nil {
t.Fatalf("buildNvidiaStressJob error: %v", err)
}
wantCmd := []string{"bee-nccl-gpu-stress", "--seconds", "120", "--devices", "0,2"}
if len(job.cmd) != len(wantCmd) {
t.Fatalf("cmd len=%d want %d (%v)", len(job.cmd), len(wantCmd), job.cmd)
}
for i := range wantCmd {
if job.cmd[i] != wantCmd[i] {
t.Fatalf("cmd[%d]=%q want %q", i, job.cmd[i], wantCmd[i])
}
}
if got := joinIndexList(job.gpuIndices); got != "0,2" {
t.Fatalf("gpuIndices=%q want 0,2", got)
}
}
func TestNvidiaStressArchivePrefixByLoader(t *testing.T) {
t.Parallel()
tests := []struct {
loader string
want string
}{
{loader: NvidiaStressLoaderBuiltin, want: "gpu-nvidia-burn"},
{loader: NvidiaStressLoaderJohn, want: "gpu-nvidia-john"},
{loader: NvidiaStressLoaderNCCL, want: "gpu-nvidia-nccl"},
{loader: "", want: "gpu-nvidia-burn"},
}
for _, tt := range tests {
if got := nvidiaStressArchivePrefix(tt.loader); got != tt.want {
t.Fatalf("loader=%q prefix=%q want %q", tt.loader, got, tt.want)
}
}
}
func TestEnvIntFallback(t *testing.T) {
os.Unsetenv("BEE_MEMTESTER_SIZE_MB")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
t.Fatalf("got %d want 123", got)
}
t.Setenv("BEE_MEMTESTER_SIZE_MB", "bad")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
t.Fatalf("got %d want 123", got)
}
t.Setenv("BEE_MEMTESTER_SIZE_MB", "256")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 256 {
t.Fatalf("got %d want 256", got)
}
}
func TestClassifySATResult(t *testing.T) {
tests := []struct {
name string
job string
out string
err error
status string
}{
{name: "ok", job: "memtester", out: "done", err: nil, status: "OK"},
{name: "unsupported", job: "smartctl-self-test-short", out: "Self-test not supported", err: errors.New("rc 1"), status: "UNSUPPORTED"},
{name: "failed", job: "bee-gpu-burn", out: "cuda error", err: errors.New("rc 1"), status: "FAILED"},
{name: "cuda not ready", job: "bee-gpu-burn", out: "cuInit failed: CUDA_ERROR_SYSTEM_NOT_READY", err: errors.New("rc 1"), status: "UNSUPPORTED"},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got, _ := classifySATResult(tt.job, []byte(tt.out), tt.err)
if got != tt.status {
t.Fatalf("status=%q want %q", got, tt.status)
}
})
}
}
func TestParseStorageDevicesSkipsUSBDisks(t *testing.T) {
t.Parallel()
raw := "nvme0n1 disk nvme\nsda disk usb\nloop0 loop\nsdb disk sata\n"
got := parseStorageDevices(raw)
want := []string{"/dev/nvme0n1", "/dev/sdb"}
if len(got) != len(want) {
t.Fatalf("len(devices)=%d want %d (%v)", len(got), len(want), got)
}
for i := range want {
if got[i] != want[i] {
t.Fatalf("devices[%d]=%q want %q", i, got[i], want[i])
}
}
}
func TestResolveROCmSMICommandFromPATH(t *testing.T) {
t.Setenv("PATH", t.TempDir())
toolPath := filepath.Join(os.Getenv("PATH"), "rocm-smi")
if err := os.WriteFile(toolPath, []byte("#!/bin/sh\nexit 0\n"), 0755); err != nil {
t.Fatalf("write rocm-smi: %v", err)
}
cmd, err := resolveROCmSMICommand("--showproductname")
if err != nil {
t.Fatalf("resolveROCmSMICommand error: %v", err)
}
if len(cmd) != 2 {
t.Fatalf("cmd len=%d want 2 (%v)", len(cmd), cmd)
}
if cmd[0] != toolPath {
t.Fatalf("cmd[0]=%q want %q", cmd[0], toolPath)
}
}
func TestResolveSATCommandUsesLookPathForGenericTools(t *testing.T) {
oldLookPath := satLookPath
satLookPath = func(file string) (string, error) {
if file == "stress-ng" {
return "/usr/bin/stress-ng", nil
}
return "", exec.ErrNotFound
}
t.Cleanup(func() { satLookPath = oldLookPath })
cmd, err := resolveSATCommand([]string{"stress-ng", "--cpu", "0"})
if err != nil {
t.Fatalf("resolveSATCommand error: %v", err)
}
if len(cmd) != 3 {
t.Fatalf("cmd len=%d want 3 (%v)", len(cmd), cmd)
}
if cmd[0] != "/usr/bin/stress-ng" {
t.Fatalf("cmd[0]=%q want /usr/bin/stress-ng", cmd[0])
}
}
func TestResolveSATCommandFailsForMissingGenericTool(t *testing.T) {
oldLookPath := satLookPath
satLookPath = func(file string) (string, error) {
return "", exec.ErrNotFound
}
t.Cleanup(func() { satLookPath = oldLookPath })
_, err := resolveSATCommand([]string{"stress-ng", "--cpu", "0"})
if err == nil {
t.Fatal("expected error")
}
if !strings.Contains(err.Error(), "stress-ng not found in PATH") {
t.Fatalf("error=%q", err)
}
}
func TestResolveROCmSMICommandFallsBackToROCmTree(t *testing.T) {
tmp := t.TempDir()
execPath := filepath.Join(tmp, "opt", "rocm", "bin", "rocm-smi")
if err := os.MkdirAll(filepath.Dir(execPath), 0755); err != nil {
t.Fatalf("mkdir: %v", err)
}
if err := os.WriteFile(execPath, []byte("#!/bin/sh\nexit 0\n"), 0755); err != nil {
t.Fatalf("write rocm-smi: %v", err)
}
oldGlob := rocmSMIExecutableGlobs
oldScriptGlobs := rocmSMIScriptGlobs
rocmSMIExecutableGlobs = []string{execPath}
rocmSMIScriptGlobs = nil
t.Cleanup(func() {
rocmSMIExecutableGlobs = oldGlob
rocmSMIScriptGlobs = oldScriptGlobs
})
t.Setenv("PATH", "")
cmd, err := resolveROCmSMICommand("--showallinfo")
if err != nil {
t.Fatalf("resolveROCmSMICommand error: %v", err)
}
if len(cmd) != 2 {
t.Fatalf("cmd len=%d want 2 (%v)", len(cmd), cmd)
}
if cmd[0] != execPath {
t.Fatalf("cmd[0]=%q want %q", cmd[0], execPath)
}
}
func TestRunROCmSMIReportsMissingCommand(t *testing.T) {
oldLookPath := satLookPath
oldExecGlobs := rocmSMIExecutableGlobs
oldScriptGlobs := rocmSMIScriptGlobs
satLookPath = func(string) (string, error) { return "", exec.ErrNotFound }
rocmSMIExecutableGlobs = nil
rocmSMIScriptGlobs = nil
t.Cleanup(func() {
satLookPath = oldLookPath
rocmSMIExecutableGlobs = oldExecGlobs
rocmSMIScriptGlobs = oldScriptGlobs
})
if _, err := runROCmSMI("--showproductname"); err == nil {
t.Fatal("expected missing rocm-smi error")
}
}

View File

@@ -0,0 +1,58 @@
package platform
import (
"os/exec"
"path/filepath"
"sort"
"strings"
)
func (s *System) ListBeeServices() ([]string, error) {
seen := map[string]bool{}
var out []string
for _, pattern := range []string{"/etc/systemd/system/bee-*.service", "/lib/systemd/system/bee-*.service"} {
matches, err := filepath.Glob(pattern)
if err != nil {
return nil, err
}
for _, match := range matches {
name := strings.TrimSuffix(filepath.Base(match), ".service")
// Skip template units (e.g. bee-journal-mirror@) — they have no instances to query.
if strings.HasSuffix(name, "@") {
continue
}
if !seen[name] {
seen[name] = true
out = append(out, name)
}
}
}
sort.Strings(out)
return out, nil
}
func (s *System) ServiceState(name string) string {
raw, err := exec.Command("systemctl", "is-active", name).CombinedOutput()
if err == nil {
return strings.TrimSpace(string(raw))
}
raw, err = exec.Command("systemctl", "show", name, "--property=ActiveState", "--value").CombinedOutput()
if err != nil {
return "unknown"
}
state := strings.TrimSpace(string(raw))
if state == "" {
return "unknown"
}
return state
}
func (s *System) ServiceDo(name string, action ServiceAction) (string, error) {
raw, err := exec.Command("systemctl", string(action), name).CombinedOutput()
return string(raw), err
}
func (s *System) ServiceStatus(name string) (string, error) {
raw, err := exec.Command("systemctl", "status", name, "--no-pager").CombinedOutput()
return string(raw), err
}

View File

@@ -0,0 +1,49 @@
package platform
import "testing"
func TestSplitQuotedFields(t *testing.T) {
t.Parallel()
line := `NAME="sdb1" TYPE="part" LABEL="BEE EXPORT" MODEL="USB DISK 3.0"`
got := splitQuotedFields(line)
want := []string{
`NAME="sdb1"`,
`TYPE="part"`,
`LABEL="BEE EXPORT"`,
`MODEL="USB DISK 3.0"`,
}
if len(got) != len(want) {
t.Fatalf("len(got)=%d len(want)=%d; got=%q", len(got), len(want), got)
}
for i := range want {
if got[i] != want[i] {
t.Fatalf("got[%d]=%q want %q", i, got[i], want[i])
}
}
}
func TestParseLSBLKPairs(t *testing.T) {
t.Parallel()
line := `NAME="sdb1" TYPE="part" PKNAME="sdb" RM="1" FSTYPE="vfat" MOUNTPOINT="" SIZE="57.3G" LABEL="BEE EXPORT" MODEL="USB DISK 3.0"`
got := parseLSBLKPairs(line)
checks := map[string]string{
"NAME": "sdb1",
"TYPE": "part",
"PKNAME": "sdb",
"RM": "1",
"FSTYPE": "vfat",
"MOUNTPOINT": "",
"SIZE": "57.3G",
"LABEL": "BEE EXPORT",
"MODEL": "USB DISK 3.0",
}
for key, want := range checks {
if got[key] != want {
t.Fatalf("got[%s]=%q want %q", key, got[key], want)
}
}
}

View File

@@ -0,0 +1,150 @@
package platform
import (
"encoding/json"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
)
var techDumpFixedCommands = []struct {
Name string
Args []string
File string
}{
{Name: "dmidecode", Args: []string{"-t", "0"}, File: "dmidecode-type0.txt"},
{Name: "dmidecode", Args: []string{"-t", "1"}, File: "dmidecode-type1.txt"},
{Name: "dmidecode", Args: []string{"-t", "2"}, File: "dmidecode-type2.txt"},
{Name: "dmidecode", Args: []string{"-t", "4"}, File: "dmidecode-type4.txt"},
{Name: "dmidecode", Args: []string{"-t", "17"}, File: "dmidecode-type17.txt"},
{Name: "lspci", Args: []string{"-vmm", "-D"}, File: "lspci-vmm.txt"},
{Name: "lsblk", Args: []string{"-J", "-d", "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL"}, File: "lsblk.json"},
{Name: "sensors", Args: []string{"-j"}, File: "sensors.json"},
{Name: "ipmitool", Args: []string{"fru", "print"}, File: "ipmitool-fru.txt"},
{Name: "ipmitool", Args: []string{"sdr"}, File: "ipmitool-sdr.txt"},
{Name: "nvme", Args: []string{"list", "-o", "json"}, File: "nvme-list.json"},
}
var techDumpNvidiaCommands = []struct {
Name string
Args []string
File string
}{
{Name: "nvidia-smi", Args: []string{"-q"}, File: "nvidia-smi-q.txt"},
{Name: "nvidia-smi", Args: []string{"--query-gpu=index,pci.bus_id,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown", "--format=csv,noheader,nounits"}, File: "nvidia-smi-query.csv"},
}
type lsblkDumpRoot struct {
Blockdevices []struct {
Name string `json:"name"`
Type string `json:"type"`
Tran string `json:"tran"`
} `json:"blockdevices"`
}
type nvmeDumpRoot struct {
Devices []struct {
DevicePath string `json:"DevicePath"`
} `json:"Devices"`
}
func (s *System) CaptureTechnicalDump(baseDir string) error {
if err := os.MkdirAll(baseDir, 0755); err != nil {
return err
}
for _, cmd := range techDumpFixedCommands {
writeCommandDump(filepath.Join(baseDir, cmd.File), cmd.Name, cmd.Args...)
}
switch s.DetectGPUVendor() {
case "nvidia":
for _, cmd := range techDumpNvidiaCommands {
writeCommandDump(filepath.Join(baseDir, cmd.File), cmd.Name, cmd.Args...)
}
case "amd":
writeROCmSMIDump(filepath.Join(baseDir, "rocm-smi.txt"))
writeROCmSMIDump(filepath.Join(baseDir, "rocm-smi-showallinfo.txt"), "--showallinfo")
}
for _, dev := range lsblkDumpDevices(filepath.Join(baseDir, "lsblk.json")) {
writeCommandDump(filepath.Join(baseDir, "smartctl-"+sanitizeDumpName(dev)+".json"), "smartctl", "-j", "-a", "/dev/"+dev)
}
for _, dev := range nvmeDumpDevices(filepath.Join(baseDir, "nvme-list.json")) {
writeCommandDump(filepath.Join(baseDir, "nvme-id-ctrl-"+sanitizeDumpName(dev)+".json"), "nvme", "id-ctrl", dev, "-o", "json")
writeCommandDump(filepath.Join(baseDir, "nvme-smart-log-"+sanitizeDumpName(dev)+".json"), "nvme", "smart-log", dev, "-o", "json")
}
return nil
}
func writeCommandDump(path, name string, args ...string) {
out, err := exec.Command(name, args...).CombinedOutput()
if err != nil && len(out) == 0 {
return
}
_ = os.WriteFile(path, out, 0644)
}
func writeROCmSMIDump(path string, args ...string) {
out, err := runROCmSMI(args...)
if err != nil && len(out) == 0 {
return
}
_ = os.WriteFile(path, out, 0644)
}
func lsblkDumpDevices(path string) []string {
raw, err := os.ReadFile(path)
if err != nil {
return nil
}
var root lsblkDumpRoot
if err := json.Unmarshal(raw, &root); err != nil {
return nil
}
var devices []string
for _, dev := range root.Blockdevices {
if strings.EqualFold(strings.TrimSpace(dev.Tran), "usb") {
continue
}
if dev.Type == "disk" && strings.TrimSpace(dev.Name) != "" {
devices = append(devices, strings.TrimSpace(dev.Name))
}
}
sort.Strings(devices)
return devices
}
func nvmeDumpDevices(path string) []string {
raw, err := os.ReadFile(path)
if err != nil {
return nil
}
var root nvmeDumpRoot
if err := json.Unmarshal(raw, &root); err != nil {
return nil
}
seen := map[string]bool{}
var devices []string
for _, dev := range root.Devices {
name := strings.TrimSpace(dev.DevicePath)
if name == "" || seen[name] {
continue
}
seen[name] = true
devices = append(devices, name)
}
sort.Strings(devices)
return devices
}
func sanitizeDumpName(value string) string {
value = strings.TrimSpace(value)
value = strings.TrimPrefix(value, "/dev/")
value = strings.ReplaceAll(value, "/", "_")
if value == "" {
return "unknown"
}
return value
}

View File

@@ -0,0 +1,48 @@
package platform
import (
"os"
"path/filepath"
"reflect"
"testing"
)
func TestLSBLKDumpDevices(t *testing.T) {
t.Parallel()
dir := t.TempDir()
path := filepath.Join(dir, "lsblk.json")
if err := os.WriteFile(path, []byte(`{"blockdevices":[{"name":"sda","type":"disk","tran":"usb"},{"name":"sda1","type":"part"},{"name":"nvme0n1","type":"disk","tran":"nvme"},{"name":"sdb","type":"disk","tran":"sata"}]}`), 0644); err != nil {
t.Fatalf("write lsblk fixture: %v", err)
}
got := lsblkDumpDevices(path)
want := []string{"nvme0n1", "sdb"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("lsblkDumpDevices=%v want %v", got, want)
}
}
func TestNVMEDumpDevices(t *testing.T) {
t.Parallel()
dir := t.TempDir()
path := filepath.Join(dir, "nvme-list.json")
if err := os.WriteFile(path, []byte(`{"Devices":[{"DevicePath":"/dev/nvme1n1"},{"DevicePath":"/dev/nvme0n1"},{"DevicePath":"/dev/nvme1n1"}]}`), 0644); err != nil {
t.Fatalf("write nvme fixture: %v", err)
}
got := nvmeDumpDevices(path)
want := []string{"/dev/nvme0n1", "/dev/nvme1n1"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("nvmeDumpDevices=%v want %v", got, want)
}
}
func TestSanitizeDumpName(t *testing.T) {
t.Parallel()
if got := sanitizeDumpName("/dev/nvme0n1"); got != "nvme0n1" {
t.Fatalf("sanitizeDumpName=%q want nvme0n1", got)
}
}

View File

@@ -0,0 +1,29 @@
package platform
import (
"fmt"
"os"
"os/exec"
"strings"
)
func (s *System) TailFile(path string, lines int) string {
raw, err := os.ReadFile(path)
if err != nil {
return fmt.Sprintf("read %s: %v", path, err)
}
all := strings.Split(strings.TrimRight(string(raw), "\n"), "\n")
if lines <= 0 || len(all) <= lines {
return string(raw)
}
return strings.Join(all[len(all)-lines:], "\n")
}
func (s *System) CheckTools(names []string) []ToolStatus {
out := make([]ToolStatus, 0, len(names))
for _, name := range names {
path, err := exec.LookPath(name)
out = append(out, ToolStatus{Name: name, Path: path, OK: err == nil})
}
return out
}

View File

@@ -0,0 +1,70 @@
package platform
type System struct{}
type InterfaceInfo struct {
Name string
State string
IPv4 []string
}
type NetworkInterfaceSnapshot struct {
Name string
Up bool
IPv4 []string
}
type NetworkSnapshot struct {
Interfaces []NetworkInterfaceSnapshot
DefaultRoutes []string
ResolvConf string
}
type ServiceAction string
const (
ServiceStart ServiceAction = "start"
ServiceStop ServiceAction = "stop"
ServiceRestart ServiceAction = "restart"
)
type StaticIPv4Config struct {
Interface string
Address string
Prefix string
Gateway string
DNS []string
}
type RemovableTarget struct {
Device string
FSType string
Size string
Label string
Model string
Mountpoint string
}
type ToolStatus struct {
Name string
Path string
OK bool
}
const (
NvidiaStressLoaderBuiltin = "builtin"
NvidiaStressLoaderJohn = "john"
NvidiaStressLoaderNCCL = "nccl"
)
type NvidiaStressOptions struct {
DurationSec int
SizeMB int
Loader string
GPUIndices []int
ExcludeGPUIndices []int
}
func New() *System {
return &System{}
}

View File

@@ -0,0 +1,77 @@
package runtimeenv
import (
"fmt"
"os"
"strings"
)
type Mode string
const (
ModeAuto Mode = "auto"
ModeLocal Mode = "local"
ModeLiveCD Mode = "livecd"
)
type Info struct {
Mode Mode
Detected bool
Reason string
}
func ParseMode(raw string) (Mode, error) {
mode := Mode(strings.TrimSpace(strings.ToLower(raw)))
switch mode {
case "", ModeAuto:
return ModeAuto, nil
case ModeLocal, ModeLiveCD:
return mode, nil
default:
return "", fmt.Errorf("invalid runtime %q — use auto, local, or livecd", raw)
}
}
func Detect(flagValue string) (Info, error) {
flagMode, err := ParseMode(flagValue)
if err != nil {
return Info{}, err
}
if flagMode != ModeAuto {
return Info{Mode: flagMode, Reason: "flag"}, nil
}
if envMode, ok := getenvMode("BEE_RUNTIME"); ok {
return Info{Mode: envMode, Reason: "env:BEE_RUNTIME"}, nil
}
if fileExists("/etc/bee-release") {
return Info{Mode: ModeLiveCD, Detected: true, Reason: "marker:/etc/bee-release"}, nil
}
if data, err := os.ReadFile("/proc/cmdline"); err == nil {
cmdline := string(data)
if strings.Contains(cmdline, " boot=live") || strings.HasPrefix(cmdline, "boot=live ") || strings.Contains(cmdline, "live-media") {
return Info{Mode: ModeLiveCD, Detected: true, Reason: "kernel:boot=live"}, nil
}
}
return Info{Mode: ModeLocal, Detected: true, Reason: "default:local"}, nil
}
func getenvMode(name string) (Mode, bool) {
value := strings.TrimSpace(os.Getenv(name))
if value == "" {
return "", false
}
mode, err := ParseMode(value)
if err != nil || mode == ModeAuto {
return "", false
}
return mode, true
}
func fileExists(path string) bool {
info, err := os.Stat(path)
return err == nil && !info.IsDir()
}

View File

@@ -0,0 +1,67 @@
package runtimeenv
import (
"os"
"testing"
)
func TestParseMode(t *testing.T) {
t.Parallel()
tests := []struct {
in string
want Mode
ok bool
}{
{in: "", want: ModeAuto, ok: true},
{in: "auto", want: ModeAuto, ok: true},
{in: "local", want: ModeLocal, ok: true},
{in: "livecd", want: ModeLiveCD, ok: true},
{in: "bad", ok: false},
}
for _, test := range tests {
got, err := ParseMode(test.in)
if test.ok && err != nil {
t.Fatalf("ParseMode(%q): %v", test.in, err)
}
if !test.ok && err == nil {
t.Fatalf("ParseMode(%q): expected error", test.in)
}
if test.ok && got != test.want {
t.Fatalf("ParseMode(%q): got %q want %q", test.in, got, test.want)
}
}
}
func TestDetectHonorsFlag(t *testing.T) {
t.Parallel()
info, err := Detect("livecd")
if err != nil {
t.Fatalf("Detect(flag): %v", err)
}
if info.Mode != ModeLiveCD || info.Reason != "flag" {
t.Fatalf("unexpected info: %+v", info)
}
}
func TestDetectHonorsEnv(t *testing.T) {
t.Parallel()
old := os.Getenv("BEE_RUNTIME")
t.Cleanup(func() {
_ = os.Setenv("BEE_RUNTIME", old)
})
if err := os.Setenv("BEE_RUNTIME", "local"); err != nil {
t.Fatalf("Setenv: %v", err)
}
info, err := Detect("auto")
if err != nil {
t.Fatalf("Detect(env): %v", err)
}
if info.Mode != ModeLocal || info.Reason != "env:BEE_RUNTIME" {
t.Fatalf("unexpected info: %+v", info)
}
}

View File

@@ -2,17 +2,55 @@
// core/internal/ingest/parser_hardware.go. No import dependency on core.
package schema
// HardwareIngestRequest is the top-level output document produced by the audit binary.
// HardwareIngestRequest is the top-level output document produced by `bee audit`.
// It is accepted as-is by the core /api/ingest/hardware endpoint.
type HardwareIngestRequest struct {
Filename *string `json:"filename"`
SourceType *string `json:"source_type"`
Protocol *string `json:"protocol"`
TargetHost string `json:"target_host"`
Filename *string `json:"filename,omitempty"`
SourceType *string `json:"source_type,omitempty"`
Protocol *string `json:"protocol,omitempty"`
TargetHost *string `json:"target_host,omitempty"`
CollectedAt string `json:"collected_at"`
Runtime *RuntimeHealth `json:"runtime,omitempty"`
Hardware HardwareSnapshot `json:"hardware"`
}
type RuntimeHealth struct {
Status string `json:"status"`
CheckedAt string `json:"checked_at"`
ExportDir string `json:"export_dir,omitempty"`
DriverReady bool `json:"driver_ready,omitempty"`
CUDAReady bool `json:"cuda_ready,omitempty"`
NetworkStatus string `json:"network_status,omitempty"`
Issues []RuntimeIssue `json:"issues,omitempty"`
Tools []RuntimeToolStatus `json:"tools,omitempty"`
Services []RuntimeServiceStatus `json:"services,omitempty"`
Interfaces []RuntimeInterface `json:"interfaces,omitempty"`
}
type RuntimeIssue struct {
Code string `json:"code"`
Severity string `json:"severity,omitempty"`
Description string `json:"description"`
}
type RuntimeToolStatus struct {
Name string `json:"name"`
Path string `json:"path,omitempty"`
OK bool `json:"ok"`
}
type RuntimeServiceStatus struct {
Name string `json:"name"`
Status string `json:"status"`
}
type RuntimeInterface struct {
Name string `json:"name"`
State string `json:"state,omitempty"`
IPv4 []string `json:"ipv4,omitempty"`
Outcome string `json:"outcome,omitempty"`
}
type HardwareSnapshot struct {
Board HardwareBoard `json:"board"`
Firmware []HardwareFirmwareRecord `json:"firmware,omitempty"`
@@ -21,14 +59,33 @@ type HardwareSnapshot struct {
Storage []HardwareStorage `json:"storage,omitempty"`
PCIeDevices []HardwarePCIeDevice `json:"pcie_devices,omitempty"`
PowerSupplies []HardwarePowerSupply `json:"power_supplies,omitempty"`
Sensors *HardwareSensors `json:"sensors,omitempty"`
EventLogs []HardwareEventLog `json:"event_logs,omitempty"`
}
type HardwareHealthSummary struct {
Status string `json:"status"`
Warnings []string `json:"warnings,omitempty"`
Failures []string `json:"failures,omitempty"`
StorageWarn int `json:"storage_warn,omitempty"`
StorageFail int `json:"storage_fail,omitempty"`
PCIeWarn int `json:"pcie_warn,omitempty"`
PCIeFail int `json:"pcie_fail,omitempty"`
PSUWarn int `json:"psu_warn,omitempty"`
PSUFail int `json:"psu_fail,omitempty"`
MemoryWarn int `json:"memory_warn,omitempty"`
MemoryFail int `json:"memory_fail,omitempty"`
EmptyDIMMs int `json:"empty_dimms,omitempty"`
MissingPSUs int `json:"missing_psus,omitempty"`
CollectedAt string `json:"collected_at,omitempty"`
}
type HardwareBoard struct {
Manufacturer *string `json:"manufacturer"`
ProductName *string `json:"product_name"`
Manufacturer *string `json:"manufacturer,omitempty"`
ProductName *string `json:"product_name,omitempty"`
SerialNumber string `json:"serial_number"`
PartNumber *string `json:"part_number"`
UUID *string `json:"uuid"`
PartNumber *string `json:"part_number,omitempty"`
UUID *string `json:"uuid,omitempty"`
}
type HardwareFirmwareRecord struct {
@@ -37,77 +94,196 @@ type HardwareFirmwareRecord struct {
}
type HardwareCPU struct {
Socket *int `json:"socket"`
Model *string `json:"model"`
Manufacturer *string `json:"manufacturer"`
Status *string `json:"status"`
SerialNumber *string `json:"serial_number"`
Firmware *string `json:"firmware"`
Cores *int `json:"cores"`
Threads *int `json:"threads"`
FrequencyMHz *int `json:"frequency_mhz"`
MaxFrequencyMHz *int `json:"max_frequency_mhz"`
HardwareComponentStatus
Socket *int `json:"socket,omitempty"`
Model *string `json:"model,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
Firmware *string `json:"firmware,omitempty"`
Cores *int `json:"cores,omitempty"`
Threads *int `json:"threads,omitempty"`
FrequencyMHz *int `json:"frequency_mhz,omitempty"`
MaxFrequencyMHz *int `json:"max_frequency_mhz,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
PowerW *float64 `json:"power_w,omitempty"`
Throttled *bool `json:"throttled,omitempty"`
CorrectableErrorCount *int64 `json:"correctable_error_count,omitempty"`
UncorrectableErrorCount *int64 `json:"uncorrectable_error_count,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
Present *bool `json:"present,omitempty"`
}
type HardwareMemory struct {
Slot *string `json:"slot"`
Location *string `json:"location"`
Present *bool `json:"present"`
SizeMB *int `json:"size_mb"`
Type *string `json:"type"`
MaxSpeedMHz *int `json:"max_speed_mhz"`
CurrentSpeedMHz *int `json:"current_speed_mhz"`
Manufacturer *string `json:"manufacturer"`
SerialNumber *string `json:"serial_number"`
PartNumber *string `json:"part_number"`
Status *string `json:"status"`
HardwareComponentStatus
Slot *string `json:"slot,omitempty"`
Location *string `json:"location,omitempty"`
Present *bool `json:"present,omitempty"`
SizeMB *int `json:"size_mb,omitempty"`
Type *string `json:"type,omitempty"`
MaxSpeedMHz *int `json:"max_speed_mhz,omitempty"`
CurrentSpeedMHz *int `json:"current_speed_mhz,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
PartNumber *string `json:"part_number,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
CorrectableECCErrorCount *int64 `json:"correctable_ecc_error_count,omitempty"`
UncorrectableECCErrorCount *int64 `json:"uncorrectable_ecc_error_count,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
SpareBlocksRemainingPct *float64 `json:"spare_blocks_remaining_pct,omitempty"`
PerformanceDegraded *bool `json:"performance_degraded,omitempty"`
DataLossDetected *bool `json:"data_loss_detected,omitempty"`
}
type HardwareStorage struct {
Slot *string `json:"slot"`
Type *string `json:"type"`
Model *string `json:"model"`
SizeGB *int `json:"size_gb"`
SerialNumber *string `json:"serial_number"`
Manufacturer *string `json:"manufacturer"`
Firmware *string `json:"firmware"`
Interface *string `json:"interface"`
Present *bool `json:"present"`
Status *string `json:"status"`
Telemetry map[string]any `json:"telemetry,omitempty"`
HardwareComponentStatus
Slot *string `json:"slot,omitempty"`
Type *string `json:"type,omitempty"`
Model *string `json:"model,omitempty"`
SizeGB *int `json:"size_gb,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
Firmware *string `json:"firmware,omitempty"`
Interface *string `json:"interface,omitempty"`
Present *bool `json:"present,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
PowerOnHours *int64 `json:"power_on_hours,omitempty"`
PowerCycles *int64 `json:"power_cycles,omitempty"`
UnsafeShutdowns *int64 `json:"unsafe_shutdowns,omitempty"`
MediaErrors *int64 `json:"media_errors,omitempty"`
ErrorLogEntries *int64 `json:"error_log_entries,omitempty"`
WrittenBytes *int64 `json:"written_bytes,omitempty"`
ReadBytes *int64 `json:"read_bytes,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
AvailableSparePct *float64 `json:"available_spare_pct,omitempty"`
ReallocatedSectors *int64 `json:"reallocated_sectors,omitempty"`
CurrentPendingSectors *int64 `json:"current_pending_sectors,omitempty"`
OfflineUncorrectable *int64 `json:"offline_uncorrectable,omitempty"`
Telemetry map[string]any `json:"-"`
}
type HardwarePCIeDevice struct {
Slot *string `json:"slot"`
VendorID *int `json:"vendor_id"`
DeviceID *int `json:"device_id"`
BDF *string `json:"bdf"`
DeviceClass *string `json:"device_class"`
Manufacturer *string `json:"manufacturer"`
Model *string `json:"model"`
LinkWidth *int `json:"link_width"`
LinkSpeed *string `json:"link_speed"`
MaxLinkWidth *int `json:"max_link_width"`
MaxLinkSpeed *string `json:"max_link_speed"`
SerialNumber *string `json:"serial_number"`
Firmware *string `json:"firmware"`
Present *bool `json:"present"`
Status *string `json:"status"`
Telemetry map[string]any `json:"telemetry,omitempty"`
HardwareComponentStatus
Slot *string `json:"slot,omitempty"`
VendorID *int `json:"vendor_id,omitempty"`
DeviceID *int `json:"device_id,omitempty"`
NUMANode *int `json:"numa_node,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
PowerW *float64 `json:"power_w,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
ECCCorrectedTotal *int64 `json:"ecc_corrected_total,omitempty"`
ECCUncorrectedTotal *int64 `json:"ecc_uncorrected_total,omitempty"`
HWSlowdown *bool `json:"hw_slowdown,omitempty"`
BatteryChargePct *float64 `json:"battery_charge_pct,omitempty"`
BatteryHealthPct *float64 `json:"battery_health_pct,omitempty"`
BatteryTemperatureC *float64 `json:"battery_temperature_c,omitempty"`
BatteryVoltageV *float64 `json:"battery_voltage_v,omitempty"`
BatteryReplaceRequired *bool `json:"battery_replace_required,omitempty"`
SFPTemperatureC *float64 `json:"sfp_temperature_c,omitempty"`
SFPTXPowerDBM *float64 `json:"sfp_tx_power_dbm,omitempty"`
SFPRXPowerDBM *float64 `json:"sfp_rx_power_dbm,omitempty"`
SFPVoltageV *float64 `json:"sfp_voltage_v,omitempty"`
SFPBiasMA *float64 `json:"sfp_bias_ma,omitempty"`
BDF *string `json:"-"`
DeviceClass *string `json:"device_class,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
Model *string `json:"model,omitempty"`
LinkWidth *int `json:"link_width,omitempty"`
LinkSpeed *string `json:"link_speed,omitempty"`
MaxLinkWidth *int `json:"max_link_width,omitempty"`
MaxLinkSpeed *string `json:"max_link_speed,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
Firmware *string `json:"firmware,omitempty"`
MacAddresses []string `json:"mac_addresses,omitempty"`
Present *bool `json:"present,omitempty"`
Telemetry map[string]any `json:"-"`
}
type HardwarePowerSupply struct {
Slot *string `json:"slot"`
Present *bool `json:"present"`
Model *string `json:"model"`
Vendor *string `json:"vendor"`
WattageW *int `json:"wattage_w"`
SerialNumber *string `json:"serial_number"`
PartNumber *string `json:"part_number"`
Firmware *string `json:"firmware"`
Status *string `json:"status"`
InputType *string `json:"input_type"`
InputPowerW *float64 `json:"input_power_w"`
OutputPowerW *float64 `json:"output_power_w"`
InputVoltage *float64 `json:"input_voltage"`
HardwareComponentStatus
Slot *string `json:"slot,omitempty"`
Present *bool `json:"present,omitempty"`
Model *string `json:"model,omitempty"`
Vendor *string `json:"vendor,omitempty"`
WattageW *int `json:"wattage_w,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
PartNumber *string `json:"part_number,omitempty"`
Firmware *string `json:"firmware,omitempty"`
InputType *string `json:"input_type,omitempty"`
InputPowerW *float64 `json:"input_power_w,omitempty"`
OutputPowerW *float64 `json:"output_power_w,omitempty"`
InputVoltage *float64 `json:"input_voltage,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
}
type HardwareComponentStatus struct {
Status *string `json:"status,omitempty"`
StatusCheckedAt *string `json:"status_checked_at,omitempty"`
StatusChangedAt *string `json:"status_changed_at,omitempty"`
StatusHistory []HardwareStatusHistory `json:"status_history,omitempty"`
ErrorDescription *string `json:"error_description,omitempty"`
ManufacturedYearWeek *string `json:"manufactured_year_week,omitempty"`
}
type HardwareStatusHistory struct {
Status string `json:"status"`
ChangedAt string `json:"changed_at"`
Details *string `json:"details,omitempty"`
}
type HardwareSensors struct {
Fans []HardwareFanSensor `json:"fans,omitempty"`
Power []HardwarePowerSensor `json:"power,omitempty"`
Temperatures []HardwareTemperatureSensor `json:"temperatures,omitempty"`
Other []HardwareOtherSensor `json:"other,omitempty"`
}
type HardwareFanSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
RPM *int `json:"rpm,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwarePowerSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
VoltageV *float64 `json:"voltage_v,omitempty"`
CurrentA *float64 `json:"current_a,omitempty"`
PowerW *float64 `json:"power_w,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareTemperatureSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
Celsius *float64 `json:"celsius,omitempty"`
ThresholdWarningCelsius *float64 `json:"threshold_warning_celsius,omitempty"`
ThresholdCriticalCelsius *float64 `json:"threshold_critical_celsius,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareOtherSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
Value *float64 `json:"value,omitempty"`
Unit *string `json:"unit,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareEventLog struct {
Source string `json:"source"`
EventTime *string `json:"event_time,omitempty"`
Severity *string `json:"severity,omitempty"`
MessageID *string `json:"message_id,omitempty"`
Message string `json:"message"`
ComponentRef *string `json:"component_ref,omitempty"`
Fingerprint *string `json:"fingerprint,omitempty"`
IsActive *bool `json:"is_active,omitempty"`
RawPayload map[string]any `json:"raw_payload,omitempty"`
}

View File

@@ -0,0 +1,46 @@
package schema
import (
"encoding/json"
"strings"
"testing"
)
func TestHardwareSnapshotMarshalsNewContractFields(t *testing.T) {
week := "2024-W07"
eventTime := "2026-03-15T14:03:11Z"
message := "Correctable ECC error threshold exceeded"
payload := HardwareIngestRequest{
CollectedAt: "2026-03-15T15:00:00Z",
Hardware: HardwareSnapshot{
Board: HardwareBoard{SerialNumber: "SRV-001"},
CPUs: []HardwareCPU{
{
HardwareComponentStatus: HardwareComponentStatus{
ManufacturedYearWeek: &week,
},
},
},
EventLogs: []HardwareEventLog{
{
Source: "bmc",
EventTime: &eventTime,
Message: message,
},
},
},
}
data, err := json.Marshal(payload)
if err != nil {
t.Fatalf("marshal: %v", err)
}
text := string(data)
if !strings.Contains(text, `"manufactured_year_week":"2024-W07"`) {
t.Fatalf("missing manufactured_year_week: %s", text)
}
if !strings.Contains(text, `"event_logs":[{"source":"bmc","event_time":"2026-03-15T14:03:11Z","message":"Correctable ECC error threshold exceeded"}]`) {
t.Fatalf("missing event_logs payload: %s", text)
}
}

990
audit/internal/webui/api.go Normal file
View File

@@ -0,0 +1,990 @@
package webui
import (
"bufio"
"encoding/json"
"errors"
"fmt"
"io"
"net/http"
"os"
"os/exec"
"path/filepath"
"regexp"
"strings"
"sync/atomic"
"syscall"
"time"
"bee/audit/internal/app"
"bee/audit/internal/platform"
)
var ansiEscapeRE = regexp.MustCompile(`\x1b\[[0-9;]*[a-zA-Z]|\x1b[()][A-Z0-9]|\x1b[DABC]`)
// ── Job ID counter ────────────────────────────────────────────────────────────
var jobCounter atomic.Uint64
func newJobID(prefix string) string {
return fmt.Sprintf("%s-%d", prefix, jobCounter.Add(1))
}
// ── SSE helpers ───────────────────────────────────────────────────────────────
func sseWrite(w http.ResponseWriter, event, data string) bool {
f, ok := w.(http.Flusher)
if !ok {
return false
}
if event != "" {
fmt.Fprintf(w, "event: %s\n", event)
}
fmt.Fprintf(w, "data: %s\n\n", data)
f.Flush()
return true
}
func sseStart(w http.ResponseWriter) bool {
_, ok := w.(http.Flusher)
if !ok {
http.Error(w, "streaming not supported", http.StatusInternalServerError)
return false
}
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
w.Header().Set("Connection", "keep-alive")
w.Header().Set("Access-Control-Allow-Origin", "*")
return true
}
// streamJob streams lines from a jobState to a SSE response.
func streamJob(w http.ResponseWriter, r *http.Request, j *jobState) {
if !sseStart(w) {
return
}
existing, ch := j.subscribe()
for _, line := range existing {
sseWrite(w, "", line)
}
if ch == nil {
// Job already finished
sseWrite(w, "done", j.err)
return
}
for {
select {
case line, ok := <-ch:
if !ok {
sseWrite(w, "done", j.err)
return
}
sseWrite(w, "", line)
case <-r.Context().Done():
return
}
}
}
// streamCmdJob runs an exec.Cmd and streams stdout+stderr lines into j.
func streamCmdJob(j *jobState, cmd *exec.Cmd) error {
pr, pw := io.Pipe()
cmd.Stdout = pw
cmd.Stderr = pw
if err := cmd.Start(); err != nil {
_ = pw.Close()
_ = pr.Close()
return err
}
// Lower the CPU scheduling priority of stress/audit subprocesses to nice+10
// so the X server and kernel interrupt handling remain responsive under load
// (prevents KVM/IPMI graphical console from freezing during GPU stress tests).
if cmd.Process != nil {
_ = syscall.Setpriority(syscall.PRIO_PROCESS, cmd.Process.Pid, 10)
}
scanDone := make(chan error, 1)
go func() {
scanner := bufio.NewScanner(pr)
scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024)
for scanner.Scan() {
// Split on \r to handle progress-bar style output (e.g. \r overwrites)
// and strip ANSI escape codes so logs are readable in the browser.
parts := strings.Split(scanner.Text(), "\r")
for _, part := range parts {
line := ansiEscapeRE.ReplaceAllString(part, "")
if line != "" {
j.append(line)
}
}
}
if err := scanner.Err(); err != nil && !errors.Is(err, io.ErrClosedPipe) {
scanDone <- err
return
}
scanDone <- nil
}()
err := cmd.Wait()
_ = pw.Close()
scanErr := <-scanDone
_ = pr.Close()
if err != nil {
return err
}
return scanErr
}
// ── Audit ─────────────────────────────────────────────────────────────────────
func (h *handler) handleAPIAuditRun(w http.ResponseWriter, _ *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
t := &Task{
ID: newJobID("audit"),
Name: "Audit",
Target: "audit",
Status: TaskPending,
CreatedAt: time.Now(),
}
globalQueue.enqueue(t)
writeJSON(w, map[string]string{"task_id": t.ID, "job_id": t.ID})
}
func (h *handler) handleAPIAuditStream(w http.ResponseWriter, r *http.Request) {
id := r.URL.Query().Get("job_id")
if id == "" {
id = r.URL.Query().Get("task_id")
}
// Try task queue first, then legacy job manager
if j, ok := globalQueue.findJob(id); ok {
streamJob(w, r, j)
return
}
if j, ok := globalJobs.get(id); ok {
streamJob(w, r, j)
return
}
http.Error(w, "job not found", http.StatusNotFound)
}
// ── SAT ───────────────────────────────────────────────────────────────────────
func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var body struct {
Duration int `json:"duration"`
DiagLevel int `json:"diag_level"`
GPUIndices []int `json:"gpu_indices"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices"`
Loader string `json:"loader"`
Profile string `json:"profile"`
DisplayName string `json:"display_name"`
}
if r.Body != nil {
if err := json.NewDecoder(r.Body).Decode(&body); err != nil && !errors.Is(err, io.EOF) {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
}
name := taskDisplayName(target, body.Profile, body.Loader)
t := &Task{
ID: newJobID("sat-" + target),
Name: name,
Target: target,
Status: TaskPending,
CreatedAt: time.Now(),
params: taskParams{
Duration: body.Duration,
DiagLevel: body.DiagLevel,
GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices,
Loader: body.Loader,
BurnProfile: body.Profile,
DisplayName: body.DisplayName,
},
}
if strings.TrimSpace(body.DisplayName) != "" {
t.Name = body.DisplayName
}
globalQueue.enqueue(t)
writeJSON(w, map[string]string{"task_id": t.ID, "job_id": t.ID})
}
}
func (h *handler) handleAPISATStream(w http.ResponseWriter, r *http.Request) {
id := r.URL.Query().Get("job_id")
if id == "" {
id = r.URL.Query().Get("task_id")
}
if j, ok := globalQueue.findJob(id); ok {
streamJob(w, r, j)
return
}
if j, ok := globalJobs.get(id); ok {
streamJob(w, r, j)
return
}
http.Error(w, "job not found", http.StatusNotFound)
}
func (h *handler) handleAPISATAbort(w http.ResponseWriter, r *http.Request) {
id := r.URL.Query().Get("job_id")
if id == "" {
id = r.URL.Query().Get("task_id")
}
if t, ok := globalQueue.findByID(id); ok {
globalQueue.mu.Lock()
switch t.Status {
case TaskPending:
t.Status = TaskCancelled
now := time.Now()
t.DoneAt = &now
case TaskRunning:
if t.job != nil {
t.job.abort()
}
t.Status = TaskCancelled
now := time.Now()
t.DoneAt = &now
}
globalQueue.mu.Unlock()
writeJSON(w, map[string]string{"status": "aborted"})
return
}
if j, ok := globalJobs.get(id); ok {
if j.abort() {
writeJSON(w, map[string]string{"status": "aborted"})
} else {
writeJSON(w, map[string]string{"status": "not_running"})
}
return
}
http.Error(w, "job not found", http.StatusNotFound)
}
// ── Services ──────────────────────────────────────────────────────────────────
func (h *handler) handleAPIServicesList(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
names, err := h.opts.App.ListBeeServices()
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
type serviceInfo struct {
Name string `json:"name"`
State string `json:"state"`
Body string `json:"body"`
}
result := make([]serviceInfo, 0, len(names))
for _, name := range names {
state := h.opts.App.ServiceState(name)
body, _ := h.opts.App.ServiceStatus(name)
result = append(result, serviceInfo{Name: name, State: state, Body: body})
}
writeJSON(w, result)
}
func (h *handler) handleAPIServicesAction(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var req struct {
Name string `json:"name"`
Action string `json:"action"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
var action platform.ServiceAction
switch req.Action {
case "start":
action = platform.ServiceStart
case "stop":
action = platform.ServiceStop
case "restart":
action = platform.ServiceRestart
default:
writeError(w, http.StatusBadRequest, "action must be start|stop|restart")
return
}
result, err := h.opts.App.ServiceActionResult(req.Name, action)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "output": result.Body})
}
// ── Network ───────────────────────────────────────────────────────────────────
func (h *handler) handleAPINetworkStatus(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
ifaces, err := h.opts.App.ListInterfaces()
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]any{
"interfaces": ifaces,
"default_route": h.opts.App.DefaultRoute(),
})
}
func (h *handler) handleAPINetworkDHCP(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var req struct {
Interface string `json:"interface"`
}
_ = json.NewDecoder(r.Body).Decode(&req)
result, err := h.applyPendingNetworkChange(func() (app.ActionResult, error) {
if req.Interface == "" || req.Interface == "all" {
return h.opts.App.DHCPAllResult()
}
return h.opts.App.DHCPOneResult(req.Interface)
})
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]any{
"status": "ok",
"output": result.Body,
"rollback_in": int(netRollbackTimeout.Seconds()),
})
}
func (h *handler) handleAPINetworkStatic(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var req struct {
Interface string `json:"interface"`
Address string `json:"address"`
Prefix string `json:"prefix"`
Gateway string `json:"gateway"`
DNS []string `json:"dns"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
cfg := platform.StaticIPv4Config{
Interface: req.Interface,
Address: req.Address,
Prefix: req.Prefix,
Gateway: req.Gateway,
DNS: req.DNS,
}
result, err := h.applyPendingNetworkChange(func() (app.ActionResult, error) {
return h.opts.App.SetStaticIPv4Result(cfg)
})
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]any{
"status": "ok",
"output": result.Body,
"rollback_in": int(netRollbackTimeout.Seconds()),
})
}
// ── Export ────────────────────────────────────────────────────────────────────
func (h *handler) handleAPIExportList(w http.ResponseWriter, r *http.Request) {
entries, err := listExportFiles(h.opts.ExportDir)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, entries)
}
func (h *handler) handleAPIExportBundle(w http.ResponseWriter, r *http.Request) {
if globalQueue.hasActiveTarget("support-bundle") {
writeError(w, http.StatusConflict, "support bundle task is already pending or running")
return
}
t := &Task{
ID: newJobID("support-bundle"),
Name: "Support Bundle",
Target: "support-bundle",
Status: TaskPending,
CreatedAt: time.Now(),
}
globalQueue.enqueue(t)
writeJSON(w, map[string]string{
"status": "queued",
"task_id": t.ID,
"job_id": t.ID,
"url": "/export/support.tar.gz",
})
}
func (h *handler) handleAPIExportUSBTargets(w http.ResponseWriter, _ *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
targets, err := h.opts.App.ListRemovableTargets()
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
if targets == nil {
targets = []platform.RemovableTarget{}
}
writeJSON(w, targets)
}
func (h *handler) handleAPIExportUSBAudit(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var target platform.RemovableTarget
if err := json.NewDecoder(r.Body).Decode(&target); err != nil || target.Device == "" {
writeError(w, http.StatusBadRequest, "device is required")
return
}
result, err := h.opts.App.ExportLatestAuditResult(target)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "message": result.Body})
}
func (h *handler) handleAPIExportUSBBundle(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var target platform.RemovableTarget
if err := json.NewDecoder(r.Body).Decode(&target); err != nil || target.Device == "" {
writeError(w, http.StatusBadRequest, "device is required")
return
}
result, err := h.opts.App.ExportSupportBundleResult(target)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "message": result.Body})
}
// ── GPU presence ──────────────────────────────────────────────────────────────
func (h *handler) handleAPIGPUPresence(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
gp := h.opts.App.DetectGPUPresence()
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]bool{
"nvidia": gp.Nvidia,
"amd": gp.AMD,
})
}
// ── System ────────────────────────────────────────────────────────────────────
func (h *handler) handleAPIRAMStatus(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
inRAM := h.opts.App.IsLiveMediaInRAM()
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]bool{"in_ram": inRAM})
}
func (h *handler) handleAPIInstallToRAM(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
if globalQueue.hasActiveTarget("install") {
writeError(w, http.StatusConflict, "install to disk is already running")
return
}
t := &Task{
ID: newJobID("install-to-ram"),
Name: "Install to RAM",
Target: "install-to-ram",
Priority: 10,
Status: TaskPending,
CreatedAt: time.Now(),
}
globalQueue.enqueue(t)
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]string{"task_id": t.ID})
}
// ── Tools ─────────────────────────────────────────────────────────────────────
var standardTools = []string{
"dmidecode", "smartctl", "nvme", "lspci", "ipmitool",
"nvidia-smi", "memtester", "stress-ng", "nvtop",
"mstflint", "qrencode",
}
func (h *handler) handleAPIToolsCheck(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
statuses := h.opts.App.CheckTools(standardTools)
writeJSON(w, statuses)
}
// ── Preflight ─────────────────────────────────────────────────────────────────
func (h *handler) handleAPIPreflight(w http.ResponseWriter, r *http.Request) {
data, err := loadSnapshot(filepath.Join(h.opts.ExportDir, "runtime-health.json"))
if err != nil {
writeError(w, http.StatusNotFound, "runtime health not found")
return
}
w.Header().Set("Content-Type", "application/json; charset=utf-8")
w.Header().Set("Cache-Control", "no-store")
_, _ = w.Write(data)
}
// ── Install ───────────────────────────────────────────────────────────────────
func (h *handler) handleAPIInstallDisks(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
disks, err := h.opts.App.ListInstallDisks()
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
type diskJSON struct {
Device string `json:"device"`
Model string `json:"model"`
Size string `json:"size"`
SizeBytes int64 `json:"size_bytes"`
MountedParts []string `json:"mounted_parts"`
Warnings []string `json:"warnings"`
}
result := make([]diskJSON, 0, len(disks))
for _, d := range disks {
result = append(result, diskJSON{
Device: d.Device,
Model: d.Model,
Size: d.Size,
SizeBytes: d.SizeBytes,
MountedParts: d.MountedParts,
Warnings: platform.DiskWarnings(d),
})
}
writeJSON(w, result)
}
func (h *handler) handleAPIInstallRun(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var req struct {
Device string `json:"device"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil || req.Device == "" {
writeError(w, http.StatusBadRequest, "device is required")
return
}
// Whitelist: only allow devices that ListInstallDisks() returns.
disks, err := h.opts.App.ListInstallDisks()
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
allowed := false
for _, d := range disks {
if d.Device == req.Device {
allowed = true
break
}
}
if !allowed {
writeError(w, http.StatusBadRequest, "device not in install candidate list")
return
}
if globalQueue.hasActiveTarget("install-to-ram") {
writeError(w, http.StatusConflict, "install to RAM task is already pending or running")
return
}
if globalQueue.hasActiveTarget("install") {
writeError(w, http.StatusConflict, "install task is already pending or running")
return
}
t := &Task{
ID: newJobID("install"),
Name: "Install to Disk",
Target: "install",
Priority: 20,
Status: TaskPending,
CreatedAt: time.Now(),
params: taskParams{
Device: req.Device,
},
}
globalQueue.enqueue(t)
writeJSON(w, map[string]string{"task_id": t.ID, "job_id": t.ID})
}
// ── Metrics SSE ───────────────────────────────────────────────────────────────
func (h *handler) handleAPIMetricsLatest(w http.ResponseWriter, r *http.Request) {
sample, ok := h.latestMetric()
if !ok {
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write([]byte("{}"))
return
}
b, err := json.Marshal(sample)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write(b)
}
func (h *handler) handleAPIMetricsStream(w http.ResponseWriter, r *http.Request) {
if !sseStart(w) {
return
}
ticker := time.NewTicker(1 * time.Second)
defer ticker.Stop()
for {
select {
case <-r.Context().Done():
return
case <-ticker.C:
sample, ok := h.latestMetric()
if !ok {
continue
}
b, err := json.Marshal(sample)
if err != nil {
continue
}
if !sseWrite(w, "metrics", string(b)) {
return
}
}
}
}
// feedRings pushes one sample into all in-memory ring buffers.
func (h *handler) feedRings(sample platform.LiveMetricSample) {
for _, t := range sample.Temps {
switch t.Group {
case "cpu":
h.pushNamedMetricRing(&h.cpuTempRings, t.Name, t.Celsius)
case "ambient":
h.pushNamedMetricRing(&h.ambientTempRings, t.Name, t.Celsius)
}
}
h.ringPower.push(sample.PowerW)
h.ringCPULoad.push(sample.CPULoadPct)
h.ringMemLoad.push(sample.MemLoadPct)
h.ringsMu.Lock()
for i, fan := range sample.Fans {
for len(h.ringFans) <= i {
h.ringFans = append(h.ringFans, newMetricsRing(120))
h.fanNames = append(h.fanNames, fan.Name)
}
h.ringFans[i].push(float64(fan.RPM))
}
for _, gpu := range sample.GPUs {
idx := gpu.GPUIndex
for len(h.gpuRings) <= idx {
h.gpuRings = append(h.gpuRings, &gpuRings{
Temp: newMetricsRing(120),
Util: newMetricsRing(120),
MemUtil: newMetricsRing(120),
Power: newMetricsRing(120),
})
}
h.gpuRings[idx].Temp.push(gpu.TempC)
h.gpuRings[idx].Util.push(gpu.UsagePct)
h.gpuRings[idx].MemUtil.push(gpu.MemUsagePct)
h.gpuRings[idx].Power.push(gpu.PowerW)
}
h.ringsMu.Unlock()
}
func (h *handler) pushNamedMetricRing(dst *[]*namedMetricsRing, name string, value float64) {
if name == "" {
return
}
for _, item := range *dst {
if item != nil && item.Name == name && item.Ring != nil {
item.Ring.push(value)
return
}
}
*dst = append(*dst, &namedMetricsRing{
Name: name,
Ring: newMetricsRing(120),
})
(*dst)[len(*dst)-1].Ring.push(value)
}
// ── Network toggle ────────────────────────────────────────────────────────────
const netRollbackTimeout = 60 * time.Second
func (h *handler) handleAPINetworkToggle(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var req struct {
Iface string `json:"iface"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil || req.Iface == "" {
writeError(w, http.StatusBadRequest, "iface is required")
return
}
wasUp, err := h.opts.App.GetInterfaceState(req.Iface)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
if _, err := h.applyPendingNetworkChange(func() (app.ActionResult, error) {
err := h.opts.App.SetInterfaceState(req.Iface, !wasUp)
return app.ActionResult{}, err
}); err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
newState := "up"
if wasUp {
newState = "down"
}
writeJSON(w, map[string]any{
"iface": req.Iface,
"new_state": newState,
"rollback_in": int(netRollbackTimeout.Seconds()),
})
}
func (h *handler) applyPendingNetworkChange(apply func() (app.ActionResult, error)) (app.ActionResult, error) {
if h.opts.App == nil {
return app.ActionResult{}, fmt.Errorf("app not configured")
}
if err := h.rollbackPendingNetworkChange(); err != nil && err.Error() != "no pending network change" {
return app.ActionResult{}, err
}
snapshot, err := h.opts.App.CaptureNetworkSnapshot()
if err != nil {
return app.ActionResult{}, err
}
result, err := apply()
if err != nil {
return result, err
}
pnc := &pendingNetChange{snapshot: snapshot}
pnc.timer = time.AfterFunc(netRollbackTimeout, func() {
_ = h.opts.App.RestoreNetworkSnapshot(snapshot)
h.pendingNetMu.Lock()
if h.pendingNet == pnc {
h.pendingNet = nil
}
h.pendingNetMu.Unlock()
})
h.pendingNetMu.Lock()
h.pendingNet = pnc
h.pendingNetMu.Unlock()
return result, nil
}
func (h *handler) handleAPINetworkConfirm(w http.ResponseWriter, _ *http.Request) {
h.pendingNetMu.Lock()
pnc := h.pendingNet
h.pendingNet = nil
h.pendingNetMu.Unlock()
if pnc != nil {
pnc.mu.Lock()
pnc.timer.Stop()
pnc.mu.Unlock()
}
writeJSON(w, map[string]string{"status": "confirmed"})
}
func (h *handler) handleAPINetworkRollback(w http.ResponseWriter, _ *http.Request) {
if err := h.rollbackPendingNetworkChange(); err != nil {
if err.Error() == "no pending network change" {
writeError(w, http.StatusConflict, err.Error())
return
}
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "rolled back"})
}
func (h *handler) rollbackPendingNetworkChange() error {
h.pendingNetMu.Lock()
pnc := h.pendingNet
h.pendingNet = nil
h.pendingNetMu.Unlock()
if pnc == nil {
return fmt.Errorf("no pending network change")
}
pnc.mu.Lock()
pnc.timer.Stop()
pnc.mu.Unlock()
if h.opts.App != nil {
return h.opts.App.RestoreNetworkSnapshot(pnc.snapshot)
}
return nil
}
// ── Display / Screen Resolution ───────────────────────────────────────────────
type displayMode struct {
Output string `json:"output"`
Mode string `json:"mode"`
Current bool `json:"current"`
}
type displayInfo struct {
Output string `json:"output"`
Modes []displayMode `json:"modes"`
Current string `json:"current"`
}
var xrandrOutputRE = regexp.MustCompile(`^(\S+)\s+connected`)
var xrandrModeRE = regexp.MustCompile(`^\s{3}(\d+x\d+)\s`)
var xrandrCurrentRE = regexp.MustCompile(`\*`)
func parseXrandrOutput(out string) []displayInfo {
var infos []displayInfo
var cur *displayInfo
for _, line := range strings.Split(out, "\n") {
if m := xrandrOutputRE.FindStringSubmatch(line); m != nil {
if cur != nil {
infos = append(infos, *cur)
}
cur = &displayInfo{Output: m[1]}
continue
}
if cur == nil {
continue
}
if m := xrandrModeRE.FindStringSubmatch(line); m != nil {
isCurrent := xrandrCurrentRE.MatchString(line)
mode := displayMode{Output: cur.Output, Mode: m[1], Current: isCurrent}
cur.Modes = append(cur.Modes, mode)
if isCurrent {
cur.Current = m[1]
}
}
}
if cur != nil {
infos = append(infos, *cur)
}
return infos
}
func xrandrCommand(args ...string) *exec.Cmd {
cmd := exec.Command("xrandr", args...)
env := append([]string{}, os.Environ()...)
hasDisplay := false
hasXAuthority := false
for _, kv := range env {
if strings.HasPrefix(kv, "DISPLAY=") && strings.TrimPrefix(kv, "DISPLAY=") != "" {
hasDisplay = true
}
if strings.HasPrefix(kv, "XAUTHORITY=") && strings.TrimPrefix(kv, "XAUTHORITY=") != "" {
hasXAuthority = true
}
}
if !hasDisplay {
env = append(env, "DISPLAY=:0")
}
if !hasXAuthority {
env = append(env, "XAUTHORITY=/home/bee/.Xauthority")
}
cmd.Env = env
return cmd
}
func (h *handler) handleAPIDisplayResolutions(w http.ResponseWriter, _ *http.Request) {
out, err := xrandrCommand().Output()
if err != nil {
writeError(w, http.StatusInternalServerError, "xrandr: "+err.Error())
return
}
writeJSON(w, parseXrandrOutput(string(out)))
}
func (h *handler) handleAPIDisplaySet(w http.ResponseWriter, r *http.Request) {
var req struct {
Output string `json:"output"`
Mode string `json:"mode"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil || req.Output == "" || req.Mode == "" {
writeError(w, http.StatusBadRequest, "output and mode are required")
return
}
// Validate mode looks like WxH to prevent injection
if !regexp.MustCompile(`^\d+x\d+$`).MatchString(req.Mode) {
writeError(w, http.StatusBadRequest, "invalid mode format")
return
}
// Validate output name (no special chars)
if !regexp.MustCompile(`^[A-Za-z0-9_\-]+$`).MatchString(req.Output) {
writeError(w, http.StatusBadRequest, "invalid output name")
return
}
if out, err := xrandrCommand("--output", req.Output, "--mode", req.Mode).CombinedOutput(); err != nil {
writeError(w, http.StatusInternalServerError, "xrandr: "+strings.TrimSpace(string(out)))
return
}
writeJSON(w, map[string]string{"status": "ok", "output": req.Output, "mode": req.Mode})
}

View File

@@ -0,0 +1,102 @@
package webui
import (
"encoding/json"
"net/http/httptest"
"strings"
"testing"
"bee/audit/internal/app"
)
func TestXrandrCommandAddsDefaultX11Env(t *testing.T) {
t.Setenv("DISPLAY", "")
t.Setenv("XAUTHORITY", "")
cmd := xrandrCommand("--query")
var hasDisplay bool
var hasXAuthority bool
for _, kv := range cmd.Env {
if kv == "DISPLAY=:0" {
hasDisplay = true
}
if kv == "XAUTHORITY=/home/bee/.Xauthority" {
hasXAuthority = true
}
}
if !hasDisplay {
t.Fatalf("DISPLAY not injected: %v", cmd.Env)
}
if !hasXAuthority {
t.Fatalf("XAUTHORITY not injected: %v", cmd.Env)
}
}
func TestHandleAPISATRunDecodesBodyWithoutContentLength(t *testing.T) {
globalQueue.mu.Lock()
originalTasks := globalQueue.tasks
globalQueue.tasks = nil
globalQueue.mu.Unlock()
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = originalTasks
globalQueue.mu.Unlock()
})
h := &handler{opts: HandlerOptions{App: &app.App{}}}
req := httptest.NewRequest("POST", "/api/sat/cpu/run", strings.NewReader(`{"profile":"smoke"}`))
req.ContentLength = -1
rec := httptest.NewRecorder()
h.handleAPISATRun("cpu").ServeHTTP(rec, req)
if rec.Code != 200 {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
globalQueue.mu.Lock()
defer globalQueue.mu.Unlock()
if len(globalQueue.tasks) != 1 {
t.Fatalf("tasks=%d want 1", len(globalQueue.tasks))
}
if got := globalQueue.tasks[0].params.BurnProfile; got != "smoke" {
t.Fatalf("burn profile=%q want smoke", got)
}
}
func TestHandleAPIExportBundleQueuesTask(t *testing.T) {
globalQueue.mu.Lock()
originalTasks := globalQueue.tasks
globalQueue.tasks = nil
globalQueue.mu.Unlock()
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = originalTasks
globalQueue.mu.Unlock()
})
h := &handler{opts: HandlerOptions{ExportDir: t.TempDir()}}
req := httptest.NewRequest("POST", "/api/export/bundle", nil)
rec := httptest.NewRecorder()
h.handleAPIExportBundle(rec, req)
if rec.Code != 200 {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
var body map[string]string
if err := json.Unmarshal(rec.Body.Bytes(), &body); err != nil {
t.Fatalf("decode response: %v", err)
}
if body["task_id"] == "" {
t.Fatalf("missing task_id in response: %v", body)
}
globalQueue.mu.Lock()
defer globalQueue.mu.Unlock()
if len(globalQueue.tasks) != 1 {
t.Fatalf("tasks=%d want 1", len(globalQueue.tasks))
}
if got := globalQueue.tasks[0].Target; got != "support-bundle" {
t.Fatalf("target=%q want support-bundle", got)
}
}

View File

@@ -0,0 +1,137 @@
package webui
import (
"os"
"strings"
"sync"
"time"
)
// jobState holds the output lines and completion status of an async job.
type jobState struct {
lines []string
done bool
err string
mu sync.Mutex
subs []chan string
cancel func() // optional cancel function; nil if job is not cancellable
logPath string
}
// abort cancels the job if it has a cancel function and is not yet done.
func (j *jobState) abort() bool {
j.mu.Lock()
defer j.mu.Unlock()
if j.done || j.cancel == nil {
return false
}
j.cancel()
return true
}
func (j *jobState) append(line string) {
j.mu.Lock()
defer j.mu.Unlock()
j.lines = append(j.lines, line)
if j.logPath != "" {
appendJobLog(j.logPath, line)
}
for _, ch := range j.subs {
select {
case ch <- line:
default:
}
}
}
func (j *jobState) finish(errMsg string) {
j.mu.Lock()
defer j.mu.Unlock()
j.done = true
j.err = errMsg
for _, ch := range j.subs {
close(ch)
}
j.subs = nil
}
// subscribe returns a channel that receives all future lines.
// Existing lines are returned first, then the channel streams new ones.
func (j *jobState) subscribe() ([]string, <-chan string) {
j.mu.Lock()
defer j.mu.Unlock()
existing := make([]string, len(j.lines))
copy(existing, j.lines)
if j.done {
return existing, nil
}
ch := make(chan string, 256)
j.subs = append(j.subs, ch)
return existing, ch
}
// jobManager manages async jobs identified by string IDs.
type jobManager struct {
mu sync.Mutex
jobs map[string]*jobState
}
var globalJobs = &jobManager{jobs: make(map[string]*jobState)}
func (m *jobManager) create(id string) *jobState {
m.mu.Lock()
defer m.mu.Unlock()
j := &jobState{}
m.jobs[id] = j
// Schedule cleanup after 30 minutes
go func() {
time.Sleep(30 * time.Minute)
m.mu.Lock()
delete(m.jobs, id)
m.mu.Unlock()
}()
return j
}
// isDone returns true if the job has finished (either successfully or with error).
func (j *jobState) isDone() bool {
j.mu.Lock()
defer j.mu.Unlock()
return j.done
}
func (m *jobManager) get(id string) (*jobState, bool) {
m.mu.Lock()
defer m.mu.Unlock()
j, ok := m.jobs[id]
return j, ok
}
func newTaskJobState(logPath string) *jobState {
j := &jobState{logPath: logPath}
if logPath == "" {
return j
}
data, err := os.ReadFile(logPath)
if err != nil || len(data) == 0 {
return j
}
lines := strings.Split(strings.ReplaceAll(string(data), "\r\n", "\n"), "\n")
if len(lines) > 0 && lines[len(lines)-1] == "" {
lines = lines[:len(lines)-1]
}
j.lines = append(j.lines, lines...)
return j
}
func appendJobLog(path, line string) {
if path == "" {
return
}
f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0644)
if err != nil {
return
}
defer f.Close()
_, _ = f.WriteString(line + "\n")
}

View File

@@ -0,0 +1,331 @@
package webui
import (
"database/sql"
"encoding/csv"
"io"
"os"
"path/filepath"
"strconv"
"time"
"bee/audit/internal/platform"
_ "modernc.org/sqlite"
)
const metricsDBPath = "/appdata/bee/metrics.db"
// MetricsDB persists live metric samples to SQLite.
type MetricsDB struct {
db *sql.DB
}
// openMetricsDB opens (or creates) the metrics database at the given path.
func openMetricsDB(path string) (*MetricsDB, error) {
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return nil, err
}
db, err := sql.Open("sqlite", path+"?_journal=WAL&_busy_timeout=5000")
if err != nil {
return nil, err
}
db.SetMaxOpenConns(1)
if err := initMetricsSchema(db); err != nil {
_ = db.Close()
return nil, err
}
return &MetricsDB{db: db}, nil
}
func initMetricsSchema(db *sql.DB) error {
_, err := db.Exec(`
CREATE TABLE IF NOT EXISTS sys_metrics (
ts INTEGER NOT NULL,
cpu_load_pct REAL,
mem_load_pct REAL,
power_w REAL,
PRIMARY KEY (ts)
);
CREATE TABLE IF NOT EXISTS gpu_metrics (
ts INTEGER NOT NULL,
gpu_index INTEGER NOT NULL,
temp_c REAL,
usage_pct REAL,
mem_usage_pct REAL,
power_w REAL,
PRIMARY KEY (ts, gpu_index)
);
CREATE TABLE IF NOT EXISTS fan_metrics (
ts INTEGER NOT NULL,
name TEXT NOT NULL,
rpm REAL,
PRIMARY KEY (ts, name)
);
CREATE TABLE IF NOT EXISTS temp_metrics (
ts INTEGER NOT NULL,
name TEXT NOT NULL,
grp TEXT NOT NULL,
celsius REAL,
PRIMARY KEY (ts, name)
);
`)
return err
}
// Write inserts one sample into all relevant tables.
func (m *MetricsDB) Write(s platform.LiveMetricSample) error {
ts := s.Timestamp.Unix()
tx, err := m.db.Begin()
if err != nil {
return err
}
defer func() { _ = tx.Rollback() }()
_, err = tx.Exec(
`INSERT OR REPLACE INTO sys_metrics(ts,cpu_load_pct,mem_load_pct,power_w) VALUES(?,?,?,?)`,
ts, s.CPULoadPct, s.MemLoadPct, s.PowerW,
)
if err != nil {
return err
}
for _, g := range s.GPUs {
_, err = tx.Exec(
`INSERT OR REPLACE INTO gpu_metrics(ts,gpu_index,temp_c,usage_pct,mem_usage_pct,power_w) VALUES(?,?,?,?,?,?)`,
ts, g.GPUIndex, g.TempC, g.UsagePct, g.MemUsagePct, g.PowerW,
)
if err != nil {
return err
}
}
for _, f := range s.Fans {
_, err = tx.Exec(
`INSERT OR REPLACE INTO fan_metrics(ts,name,rpm) VALUES(?,?,?)`,
ts, f.Name, f.RPM,
)
if err != nil {
return err
}
}
for _, t := range s.Temps {
_, err = tx.Exec(
`INSERT OR REPLACE INTO temp_metrics(ts,name,grp,celsius) VALUES(?,?,?,?)`,
ts, t.Name, t.Group, t.Celsius,
)
if err != nil {
return err
}
}
return tx.Commit()
}
// LoadRecent returns up to n samples in chronological order (oldest first).
func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
return m.loadSamples(`SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM sys_metrics ORDER BY ts DESC LIMIT ?`, n)
}
// LoadAll returns all persisted samples in chronological order (oldest first).
func (m *MetricsDB) LoadAll() ([]platform.LiveMetricSample, error) {
return m.loadSamples(`SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM sys_metrics ORDER BY ts`, nil)
}
// loadSamples reconstructs LiveMetricSample rows from the normalized tables.
func (m *MetricsDB) loadSamples(query string, args ...any) ([]platform.LiveMetricSample, error) {
rows, err := m.db.Query(query, args...)
if err != nil {
return nil, err
}
defer rows.Close()
type sysRow struct {
ts int64
cpu, mem, pwr float64
}
var sysRows []sysRow
for rows.Next() {
var r sysRow
if err := rows.Scan(&r.ts, &r.cpu, &r.mem, &r.pwr); err != nil {
continue
}
sysRows = append(sysRows, r)
}
if len(sysRows) == 0 {
return nil, nil
}
// Reverse to chronological order
for i, j := 0, len(sysRows)-1; i < j; i, j = i+1, j-1 {
sysRows[i], sysRows[j] = sysRows[j], sysRows[i]
}
// Collect min/max ts for range query
minTS := sysRows[0].ts
maxTS := sysRows[len(sysRows)-1].ts
// Load GPU rows in range
type gpuKey struct {
ts int64
idx int
}
gpuData := map[gpuKey]platform.GPUMetricRow{}
gRows, err := m.db.Query(
`SELECT ts,gpu_index,temp_c,usage_pct,mem_usage_pct,power_w FROM gpu_metrics WHERE ts>=? AND ts<=? ORDER BY ts,gpu_index`,
minTS, maxTS,
)
if err == nil {
defer gRows.Close()
for gRows.Next() {
var ts int64
var g platform.GPUMetricRow
if err := gRows.Scan(&ts, &g.GPUIndex, &g.TempC, &g.UsagePct, &g.MemUsagePct, &g.PowerW); err == nil {
gpuData[gpuKey{ts, g.GPUIndex}] = g
}
}
}
// Load fan rows in range
type fanKey struct {
ts int64
name string
}
fanData := map[fanKey]float64{}
fRows, err := m.db.Query(
`SELECT ts,name,rpm FROM fan_metrics WHERE ts>=? AND ts<=?`, minTS, maxTS,
)
if err == nil {
defer fRows.Close()
for fRows.Next() {
var ts int64
var name string
var rpm float64
if err := fRows.Scan(&ts, &name, &rpm); err == nil {
fanData[fanKey{ts, name}] = rpm
}
}
}
// Load temp rows in range
type tempKey struct {
ts int64
name string
}
tempData := map[tempKey]platform.TempReading{}
tRows, err := m.db.Query(
`SELECT ts,name,grp,celsius FROM temp_metrics WHERE ts>=? AND ts<=?`, minTS, maxTS,
)
if err == nil {
defer tRows.Close()
for tRows.Next() {
var ts int64
var t platform.TempReading
if err := tRows.Scan(&ts, &t.Name, &t.Group, &t.Celsius); err == nil {
tempData[tempKey{ts, t.Name}] = t
}
}
}
// Collect unique GPU indices and fan names from loaded data (preserve order)
seenGPU := map[int]bool{}
var gpuIndices []int
for k := range gpuData {
if !seenGPU[k.idx] {
seenGPU[k.idx] = true
gpuIndices = append(gpuIndices, k.idx)
}
}
seenFan := map[string]bool{}
var fanNames []string
for k := range fanData {
if !seenFan[k.name] {
seenFan[k.name] = true
fanNames = append(fanNames, k.name)
}
}
seenTemp := map[string]bool{}
var tempNames []string
for k := range tempData {
if !seenTemp[k.name] {
seenTemp[k.name] = true
tempNames = append(tempNames, k.name)
}
}
samples := make([]platform.LiveMetricSample, len(sysRows))
for i, r := range sysRows {
s := platform.LiveMetricSample{
Timestamp: time.Unix(r.ts, 0).UTC(),
CPULoadPct: r.cpu,
MemLoadPct: r.mem,
PowerW: r.pwr,
}
for _, idx := range gpuIndices {
if g, ok := gpuData[gpuKey{r.ts, idx}]; ok {
s.GPUs = append(s.GPUs, g)
}
}
for _, name := range fanNames {
if rpm, ok := fanData[fanKey{r.ts, name}]; ok {
s.Fans = append(s.Fans, platform.FanReading{Name: name, RPM: rpm})
}
}
for _, name := range tempNames {
if t, ok := tempData[tempKey{r.ts, name}]; ok {
s.Temps = append(s.Temps, t)
}
}
samples[i] = s
}
return samples, nil
}
// ExportCSV writes all sys+gpu data as CSV to w.
func (m *MetricsDB) ExportCSV(w io.Writer) error {
rows, err := m.db.Query(`
SELECT s.ts, s.cpu_load_pct, s.mem_load_pct, s.power_w,
g.gpu_index, g.temp_c, g.usage_pct, g.mem_usage_pct, g.power_w
FROM sys_metrics s
LEFT JOIN gpu_metrics g ON g.ts = s.ts
ORDER BY s.ts, g.gpu_index
`)
if err != nil {
return err
}
defer rows.Close()
cw := csv.NewWriter(w)
_ = cw.Write([]string{"ts", "cpu_load_pct", "mem_load_pct", "sys_power_w", "gpu_index", "gpu_temp_c", "gpu_usage_pct", "gpu_mem_pct", "gpu_power_w"})
for rows.Next() {
var ts int64
var cpu, mem, pwr float64
var gpuIdx sql.NullInt64
var gpuTemp, gpuUse, gpuMem, gpuPow sql.NullFloat64
if err := rows.Scan(&ts, &cpu, &mem, &pwr, &gpuIdx, &gpuTemp, &gpuUse, &gpuMem, &gpuPow); err != nil {
continue
}
row := []string{
strconv.FormatInt(ts, 10),
strconv.FormatFloat(cpu, 'f', 2, 64),
strconv.FormatFloat(mem, 'f', 2, 64),
strconv.FormatFloat(pwr, 'f', 1, 64),
}
if gpuIdx.Valid {
row = append(row,
strconv.FormatInt(gpuIdx.Int64, 10),
strconv.FormatFloat(gpuTemp.Float64, 'f', 1, 64),
strconv.FormatFloat(gpuUse.Float64, 'f', 1, 64),
strconv.FormatFloat(gpuMem.Float64, 'f', 1, 64),
strconv.FormatFloat(gpuPow.Float64, 'f', 1, 64),
)
} else {
row = append(row, "", "", "", "", "")
}
_ = cw.Write(row)
}
cw.Flush()
return cw.Error()
}
// Close closes the database.
func (m *MetricsDB) Close() { _ = m.db.Close() }
func nullFloat(v float64) sql.NullFloat64 {
return sql.NullFloat64{Float64: v, Valid: true}
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,308 @@
package webui
import (
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"strings"
"testing"
"time"
"bee/audit/internal/platform"
)
func TestChartLegendNumber(t *testing.T) {
tests := []struct {
in float64
want string
}{
{in: 0.4, want: "0"},
{in: 61.5, want: "62"},
{in: 999.4, want: "999"},
{in: 1200, want: "1,2k"},
{in: 1250, want: "1,25k"},
{in: 1310, want: "1,31k"},
{in: 1500, want: "1,5k"},
{in: 2600, want: "2,6k"},
{in: 10200, want: "10k"},
}
for _, tc := range tests {
if got := chartLegendNumber(tc.in); got != tc.want {
t.Fatalf("chartLegendNumber(%v)=%q want %q", tc.in, got, tc.want)
}
}
}
func TestChartDataFromSamplesUsesFullHistory(t *testing.T) {
samples := []platform.LiveMetricSample{
{
Timestamp: time.Now().Add(-3 * time.Minute),
CPULoadPct: 10,
MemLoadPct: 20,
PowerW: 300,
GPUs: []platform.GPUMetricRow{
{GPUIndex: 0, UsagePct: 90, MemUsagePct: 5, PowerW: 120, TempC: 50},
},
},
{
Timestamp: time.Now().Add(-2 * time.Minute),
CPULoadPct: 30,
MemLoadPct: 40,
PowerW: 320,
GPUs: []platform.GPUMetricRow{
{GPUIndex: 0, UsagePct: 95, MemUsagePct: 7, PowerW: 125, TempC: 51},
},
},
{
Timestamp: time.Now().Add(-1 * time.Minute),
CPULoadPct: 50,
MemLoadPct: 60,
PowerW: 340,
GPUs: []platform.GPUMetricRow{
{GPUIndex: 0, UsagePct: 97, MemUsagePct: 9, PowerW: 130, TempC: 52},
},
},
}
datasets, names, labels, title, _, _, ok := chartDataFromSamples("gpu-all-power", samples)
if !ok {
t.Fatal("chartDataFromSamples returned ok=false")
}
if title != "GPU Power" {
t.Fatalf("title=%q", title)
}
if len(names) != 1 || names[0] != "GPU 0" {
t.Fatalf("names=%v", names)
}
if len(labels) != len(samples) {
t.Fatalf("labels len=%d want %d", len(labels), len(samples))
}
if len(datasets) != 1 || len(datasets[0]) != len(samples) {
t.Fatalf("datasets shape=%v", datasets)
}
if got := datasets[0][0]; got != 120 {
t.Fatalf("datasets[0][0]=%v want 120", got)
}
if got := datasets[0][2]; got != 130 {
t.Fatalf("datasets[0][2]=%v want 130", got)
}
}
func TestRootRendersDashboard(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{
Title: "Bee Hardware Audit",
AuditPath: path,
ExportDir: exportDir,
})
first := httptest.NewRecorder()
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/", nil))
if first.Code != http.StatusOK {
t.Fatalf("first status=%d", first.Code)
}
// Dashboard should contain the audit nav link and hardware summary
if !strings.Contains(first.Body.String(), `href="/audit"`) {
t.Fatalf("first body missing audit nav link: %s", first.Body.String())
}
if !strings.Contains(first.Body.String(), `/viewer`) {
t.Fatalf("first body missing viewer link: %s", first.Body.String())
}
if got := first.Header().Get("Cache-Control"); got != "no-store" {
t.Fatalf("first cache-control=%q", got)
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
t.Fatal(err)
}
second := httptest.NewRecorder()
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/", nil))
if second.Code != http.StatusOK {
t.Fatalf("second status=%d", second.Code)
}
if !strings.Contains(second.Body.String(), `Hardware Summary`) {
t.Fatalf("second body missing hardware summary: %s", second.Body.String())
}
}
func TestRootShowsRunAuditButtonWhenSnapshotMissing(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{
Title: "Bee Hardware Audit",
AuditPath: filepath.Join(dir, "missing-audit.json"),
ExportDir: exportDir,
})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
body := rec.Body.String()
if !strings.Contains(body, `Run Audit`) {
t.Fatalf("dashboard missing run audit button: %s", body)
}
if strings.Contains(body, `No audit data`) {
t.Fatalf("dashboard still shows empty audit badge: %s", body)
}
}
func TestAuditPageRendersViewerFrameAndActions(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z"}`), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
body := rec.Body.String()
if !strings.Contains(body, `iframe class="viewer-frame" src="/viewer"`) {
t.Fatalf("audit page missing viewer frame: %s", body)
}
if !strings.Contains(body, `openAuditModal()`) {
t.Fatalf("audit page missing action modal trigger: %s", body)
}
}
func TestViewerRendersLatestSnapshot(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path})
first := httptest.NewRecorder()
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/viewer", nil))
if first.Code != http.StatusOK {
t.Fatalf("first status=%d", first.Code)
}
if !strings.Contains(first.Body.String(), "SERIAL-OLD") {
t.Fatalf("viewer body missing old serial: %s", first.Body.String())
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
t.Fatal(err)
}
second := httptest.NewRecorder()
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/viewer", nil))
if second.Code != http.StatusOK {
t.Fatalf("second status=%d", second.Code)
}
if !strings.Contains(second.Body.String(), "SERIAL-NEW") {
t.Fatalf("viewer body missing new serial: %s", second.Body.String())
}
if strings.Contains(second.Body.String(), "SERIAL-OLD") {
t.Fatalf("viewer body still contains old serial: %s", second.Body.String())
}
}
func TestAuditJSONServesLatestSnapshot(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
body := `{"hardware":{"board":{"serial_number":"SERIAL-API"}}}`
if err := os.WriteFile(path, []byte(body), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
if !strings.Contains(rec.Body.String(), "SERIAL-API") {
t.Fatalf("body missing expected serial: %s", rec.Body.String())
}
if got := rec.Header().Get("Content-Type"); !strings.Contains(got, "application/json") {
t.Fatalf("content-type=%q", got)
}
}
func TestMissingAuditJSONReturnsNotFound(t *testing.T) {
handler := NewHandler(HandlerOptions{AuditPath: "/missing/audit.json"})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
if rec.Code != http.StatusNotFound {
t.Fatalf("status=%d want %d", rec.Code, http.StatusNotFound)
}
}
func TestSupportBundleEndpointReturnsArchive(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.log"), []byte("audit log"), 0644); err != nil {
t.Fatal(err)
}
archive, err := os.CreateTemp(os.TempDir(), "bee-support-server-test-*.tar.gz")
if err != nil {
t.Fatal(err)
}
t.Cleanup(func() { _ = os.Remove(archive.Name()) })
if _, err := archive.WriteString("support-bundle"); err != nil {
t.Fatal(err)
}
if err := archive.Close(); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/export/support.tar.gz", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
if got := rec.Header().Get("Content-Disposition"); !strings.Contains(got, "attachment;") {
t.Fatalf("content-disposition=%q", got)
}
if rec.Body.Len() == 0 {
t.Fatal("empty archive body")
}
}
func TestRuntimeHealthEndpointReturnsJSON(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
body := `{"status":"PARTIAL","checked_at":"2026-03-16T10:00:00Z"}`
if err := os.WriteFile(filepath.Join(exportDir, "runtime-health.json"), []byte(body), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/runtime-health.json", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
if strings.TrimSpace(rec.Body.String()) != body {
t.Fatalf("body=%q want %q", strings.TrimSpace(rec.Body.String()), body)
}
}

View File

@@ -0,0 +1,808 @@
package webui
import (
"context"
"encoding/json"
"fmt"
"net/http"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
"sync"
"time"
"bee/audit/internal/app"
"bee/audit/internal/platform"
)
// Task statuses.
const (
TaskPending = "pending"
TaskRunning = "running"
TaskDone = "done"
TaskFailed = "failed"
TaskCancelled = "cancelled"
)
// taskNames maps target → human-readable name for validate (SAT) runs.
var taskNames = map[string]string{
"nvidia": "NVIDIA SAT",
"nvidia-stress": "NVIDIA GPU Stress",
"memory": "Memory SAT",
"storage": "Storage SAT",
"cpu": "CPU SAT",
"amd": "AMD GPU SAT",
"amd-mem": "AMD GPU MEM Integrity",
"amd-bandwidth": "AMD GPU MEM Bandwidth",
"amd-stress": "AMD GPU Burn-in",
"memory-stress": "Memory Burn-in",
"sat-stress": "SAT Stress (stressapptest)",
"platform-stress": "Platform Thermal Cycling",
"audit": "Audit",
"support-bundle": "Support Bundle",
"install": "Install to Disk",
"install-to-ram": "Install to RAM",
}
// burnNames maps target → human-readable name when a burn profile is set.
var burnNames = map[string]string{
"nvidia": "NVIDIA Burn-in",
"memory": "Memory Burn-in",
"cpu": "CPU Burn-in",
"amd": "AMD GPU Burn-in",
}
func nvidiaStressTaskName(loader string) string {
switch strings.TrimSpace(strings.ToLower(loader)) {
case platform.NvidiaStressLoaderJohn:
return "NVIDIA GPU Stress (John/OpenCL)"
case platform.NvidiaStressLoaderNCCL:
return "NVIDIA GPU Stress (NCCL)"
default:
return "NVIDIA GPU Stress (bee-gpu-burn)"
}
}
func taskDisplayName(target, profile, loader string) string {
name := taskNames[target]
if profile != "" {
if n, ok := burnNames[target]; ok {
name = n
}
}
if target == "nvidia-stress" {
name = nvidiaStressTaskName(loader)
}
if name == "" {
name = target
}
return name
}
// Task represents one unit of work in the queue.
type Task struct {
ID string `json:"id"`
Name string `json:"name"`
Target string `json:"target"`
Priority int `json:"priority"`
Status string `json:"status"`
CreatedAt time.Time `json:"created_at"`
StartedAt *time.Time `json:"started_at,omitempty"`
DoneAt *time.Time `json:"done_at,omitempty"`
ErrMsg string `json:"error,omitempty"`
LogPath string `json:"log_path,omitempty"`
// runtime fields (not serialised)
job *jobState
params taskParams
}
// taskParams holds optional parameters parsed from the run request.
type taskParams struct {
Duration int `json:"duration,omitempty"`
DiagLevel int `json:"diag_level,omitempty"`
GPUIndices []int `json:"gpu_indices,omitempty"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices,omitempty"`
Loader string `json:"loader,omitempty"`
BurnProfile string `json:"burn_profile,omitempty"`
DisplayName string `json:"display_name,omitempty"`
Device string `json:"device,omitempty"` // for install
}
type persistedTask struct {
ID string `json:"id"`
Name string `json:"name"`
Target string `json:"target"`
Priority int `json:"priority"`
Status string `json:"status"`
CreatedAt time.Time `json:"created_at"`
StartedAt *time.Time `json:"started_at,omitempty"`
DoneAt *time.Time `json:"done_at,omitempty"`
ErrMsg string `json:"error,omitempty"`
LogPath string `json:"log_path,omitempty"`
Params taskParams `json:"params,omitempty"`
}
type burnPreset struct {
NvidiaDiag int
DurationSec int
}
func resolveBurnPreset(profile string) burnPreset {
switch profile {
case "overnight":
return burnPreset{NvidiaDiag: 4, DurationSec: 8 * 60 * 60}
case "acceptance":
return burnPreset{NvidiaDiag: 3, DurationSec: 60 * 60}
default:
return burnPreset{NvidiaDiag: 1, DurationSec: 5 * 60}
}
}
func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions {
switch profile {
case "overnight":
return platform.PlatformStressOptions{Cycles: []platform.PlatformStressCycle{
{LoadSec: 600, IdleSec: 120},
{LoadSec: 600, IdleSec: 60},
{LoadSec: 600, IdleSec: 30},
{LoadSec: 600, IdleSec: 120},
{LoadSec: 600, IdleSec: 60},
{LoadSec: 600, IdleSec: 30},
{LoadSec: 600, IdleSec: 120},
{LoadSec: 600, IdleSec: 60},
}}
case "acceptance":
return platform.PlatformStressOptions{Cycles: []platform.PlatformStressCycle{
{LoadSec: 300, IdleSec: 60},
{LoadSec: 300, IdleSec: 30},
{LoadSec: 300, IdleSec: 60},
{LoadSec: 300, IdleSec: 30},
}}
default: // smoke
return platform.PlatformStressOptions{Cycles: []platform.PlatformStressCycle{
{LoadSec: 90, IdleSec: 60},
{LoadSec: 90, IdleSec: 30},
}}
}
}
// taskQueue manages a priority-ordered list of tasks and runs them one at a time.
type taskQueue struct {
mu sync.Mutex
tasks []*Task
trigger chan struct{}
opts *HandlerOptions // set by startWorker
statePath string
logsDir string
started bool
}
var globalQueue = &taskQueue{trigger: make(chan struct{}, 1)}
const maxTaskHistory = 50
var (
runMemoryAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
return a.RunMemoryAcceptancePackCtx(ctx, baseDir, logFunc)
}
runStorageAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
return a.RunStorageAcceptancePackCtx(ctx, baseDir, logFunc)
}
runCPUAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunCPUAcceptancePackCtx(ctx, baseDir, durationSec, logFunc)
}
runAMDAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
return a.RunAMDAcceptancePackCtx(ctx, baseDir, logFunc)
}
runAMDMemIntegrityPackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
return a.RunAMDMemIntegrityPackCtx(ctx, baseDir, logFunc)
}
runAMDMemBandwidthPackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
return a.RunAMDMemBandwidthPackCtx(ctx, baseDir, logFunc)
}
runNvidiaStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
return a.RunNvidiaStressPackCtx(ctx, baseDir, opts, logFunc)
}
runAMDStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunAMDStressPackCtx(ctx, baseDir, durationSec, logFunc)
}
runMemoryStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunMemoryStressPackCtx(ctx, baseDir, durationSec, logFunc)
}
runSATStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunSATStressPackCtx(ctx, baseDir, durationSec, logFunc)
}
buildSupportBundle = app.BuildSupportBundle
installCommand = func(ctx context.Context, device string, logPath string) *exec.Cmd {
return exec.CommandContext(ctx, "bee-install", device, logPath)
}
)
// enqueue adds a task to the queue and notifies the worker.
func (q *taskQueue) enqueue(t *Task) {
q.mu.Lock()
q.assignTaskLogPathLocked(t)
q.tasks = append(q.tasks, t)
q.prune()
q.persistLocked()
q.mu.Unlock()
select {
case q.trigger <- struct{}{}:
default:
}
}
// prune removes oldest completed tasks beyond maxTaskHistory.
func (q *taskQueue) prune() {
var done []*Task
var active []*Task
for _, t := range q.tasks {
switch t.Status {
case TaskDone, TaskFailed, TaskCancelled:
done = append(done, t)
default:
active = append(active, t)
}
}
if len(done) > maxTaskHistory {
done = done[len(done)-maxTaskHistory:]
}
q.tasks = append(active, done...)
}
// nextPending returns the highest-priority pending task (nil if none).
func (q *taskQueue) nextPending() *Task {
var best *Task
for _, t := range q.tasks {
if t.Status != TaskPending {
continue
}
if best == nil || t.Priority > best.Priority ||
(t.Priority == best.Priority && t.CreatedAt.Before(best.CreatedAt)) {
best = t
}
}
return best
}
// findByID looks up a task by ID.
func (q *taskQueue) findByID(id string) (*Task, bool) {
q.mu.Lock()
defer q.mu.Unlock()
for _, t := range q.tasks {
if t.ID == id {
return t, true
}
}
return nil, false
}
// findJob returns the jobState for a task ID (for SSE streaming compatibility).
func (q *taskQueue) findJob(id string) (*jobState, bool) {
t, ok := q.findByID(id)
if !ok || t.job == nil {
return nil, false
}
return t.job, true
}
func (q *taskQueue) hasActiveTarget(target string) bool {
q.mu.Lock()
defer q.mu.Unlock()
for _, t := range q.tasks {
if t.Target != target {
continue
}
if t.Status == TaskPending || t.Status == TaskRunning {
return true
}
}
return false
}
// snapshot returns a copy of all tasks sorted for display (running first, then pending by priority, then done by doneAt desc).
func (q *taskQueue) snapshot() []Task {
q.mu.Lock()
defer q.mu.Unlock()
out := make([]Task, len(q.tasks))
for i, t := range q.tasks {
out[i] = *t
}
sort.SliceStable(out, func(i, j int) bool {
si := statusOrder(out[i].Status)
sj := statusOrder(out[j].Status)
if si != sj {
return si < sj
}
if out[i].Priority != out[j].Priority {
return out[i].Priority > out[j].Priority
}
return out[i].CreatedAt.Before(out[j].CreatedAt)
})
return out
}
func statusOrder(s string) int {
switch s {
case TaskRunning:
return 0
case TaskPending:
return 1
default:
return 2
}
}
// startWorker launches the queue runner goroutine.
func (q *taskQueue) startWorker(opts *HandlerOptions) {
q.mu.Lock()
q.opts = opts
q.statePath = filepath.Join(opts.ExportDir, "tasks-state.json")
q.logsDir = filepath.Join(opts.ExportDir, "tasks")
_ = os.MkdirAll(q.logsDir, 0755)
if !q.started {
q.loadLocked()
q.started = true
go q.worker()
}
hasPending := q.nextPending() != nil
q.mu.Unlock()
if hasPending {
select {
case q.trigger <- struct{}{}:
default:
}
}
}
func (q *taskQueue) worker() {
for {
<-q.trigger
setCPUGovernor("performance")
for {
q.mu.Lock()
t := q.nextPending()
if t == nil {
q.mu.Unlock()
break
}
now := time.Now()
t.Status = TaskRunning
t.StartedAt = &now
t.DoneAt = nil
t.ErrMsg = ""
j := newTaskJobState(t.LogPath)
ctx, cancel := context.WithCancel(context.Background())
j.cancel = cancel
t.job = j
q.persistLocked()
q.mu.Unlock()
q.runTask(t, j, ctx)
q.mu.Lock()
now2 := time.Now()
t.DoneAt = &now2
if t.Status == TaskRunning { // not cancelled externally
if j.err != "" {
t.Status = TaskFailed
t.ErrMsg = j.err
} else {
t.Status = TaskDone
}
}
q.prune()
q.persistLocked()
q.mu.Unlock()
}
setCPUGovernor("powersave")
}
}
// setCPUGovernor writes the given governor to all CPU scaling_governor sysfs files.
// Silently ignores errors (e.g. when cpufreq is not available).
func setCPUGovernor(governor string) {
matches, err := filepath.Glob("/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor")
if err != nil || len(matches) == 0 {
return
}
for _, path := range matches {
_ = os.WriteFile(path, []byte(governor), 0644)
}
}
// runTask executes the work for a task, writing output to j.
func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if q.opts == nil {
j.append("ERROR: handler options not configured")
j.finish("handler options not configured")
return
}
a := q.opts.App
j.append(fmt.Sprintf("Starting %s...", t.Name))
if len(j.lines) > 0 {
j.append(fmt.Sprintf("Recovered after bee-web restart at %s", time.Now().UTC().Format(time.RFC3339)))
}
var (
archive string
err error
)
switch t.Target {
case "nvidia":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
diagLevel := t.params.DiagLevel
if t.params.BurnProfile != "" && diagLevel <= 0 {
diagLevel = resolveBurnPreset(t.params.BurnProfile).NvidiaDiag
}
if len(t.params.GPUIndices) > 0 || diagLevel > 0 {
result, e := a.RunNvidiaAcceptancePackWithOptions(
ctx, "", diagLevel, t.params.GPUIndices, j.append,
)
if e != nil {
err = e
} else {
archive = result.Body
}
} else {
archive, err = a.RunNvidiaAcceptancePack("", j.append)
}
case "nvidia-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{
DurationSec: dur,
Loader: t.params.Loader,
GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices,
}, j.append)
case "memory":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runMemoryAcceptancePackCtx(a, ctx, "", j.append)
case "storage":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runStorageAcceptancePackCtx(a, ctx, "", j.append)
case "cpu":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
if dur <= 0 {
dur = 60
}
j.append(fmt.Sprintf("CPU stress duration: %ds", dur))
archive, err = runCPUAcceptancePackCtx(a, ctx, "", dur, j.append)
case "amd":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDAcceptancePackCtx(a, ctx, "", j.append)
case "amd-mem":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDMemIntegrityPackCtx(a, ctx, "", j.append)
case "amd-bandwidth":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDMemBandwidthPackCtx(a, ctx, "", j.append)
case "amd-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runAMDStressPackCtx(a, ctx, "", dur, j.append)
case "memory-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runMemoryStressPackCtx(a, ctx, "", dur, j.append)
case "sat-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runSATStressPackCtx(a, ctx, "", dur, j.append)
case "platform-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
opts := resolvePlatformStressPreset(t.params.BurnProfile)
archive, err = a.RunPlatformStress(ctx, "", opts, j.append)
case "audit":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
result, e := a.RunAuditNow(q.opts.RuntimeMode)
if e != nil {
err = e
} else {
for _, line := range splitLines(result.Body) {
j.append(line)
}
}
case "support-bundle":
j.append("Building support bundle...")
archive, err = buildSupportBundle(q.opts.ExportDir)
case "install":
if strings.TrimSpace(t.params.Device) == "" {
err = fmt.Errorf("device is required")
break
}
installLogPath := platform.InstallLogPath(t.params.Device)
j.append("Install log: " + installLogPath)
err = streamCmdJob(j, installCommand(ctx, t.params.Device, installLogPath))
case "install-to-ram":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
err = a.RunInstallToRAM(ctx, j.append)
default:
j.append("ERROR: unknown target: " + t.Target)
j.finish("unknown target")
return
}
if err != nil {
if ctx.Err() != nil {
j.append("Aborted.")
j.finish("aborted")
} else {
j.append("ERROR: " + err.Error())
j.finish(err.Error())
}
return
}
if archive != "" {
j.append("Archive: " + archive)
}
j.finish("")
}
func splitLines(s string) []string {
var out []string
for _, l := range splitNL(s) {
if l != "" {
out = append(out, l)
}
}
return out
}
func splitNL(s string) []string {
var out []string
start := 0
for i, c := range s {
if c == '\n' {
out = append(out, s[start:i])
start = i + 1
}
}
out = append(out, s[start:])
return out
}
// ── HTTP handlers ─────────────────────────────────────────────────────────────
func (h *handler) handleAPITasksList(w http.ResponseWriter, _ *http.Request) {
tasks := globalQueue.snapshot()
writeJSON(w, tasks)
}
func (h *handler) handleAPITasksCancel(w http.ResponseWriter, r *http.Request) {
id := r.PathValue("id")
t, ok := globalQueue.findByID(id)
if !ok {
writeError(w, http.StatusNotFound, "task not found")
return
}
globalQueue.mu.Lock()
defer globalQueue.mu.Unlock()
switch t.Status {
case TaskPending:
t.Status = TaskCancelled
now := time.Now()
t.DoneAt = &now
globalQueue.persistLocked()
writeJSON(w, map[string]string{"status": "cancelled"})
case TaskRunning:
if t.job != nil {
t.job.abort()
}
t.Status = TaskCancelled
now := time.Now()
t.DoneAt = &now
globalQueue.persistLocked()
writeJSON(w, map[string]string{"status": "cancelled"})
default:
writeError(w, http.StatusConflict, "task is not running or pending")
}
}
func (h *handler) handleAPITasksPriority(w http.ResponseWriter, r *http.Request) {
id := r.PathValue("id")
t, ok := globalQueue.findByID(id)
if !ok {
writeError(w, http.StatusNotFound, "task not found")
return
}
var req struct {
Delta int `json:"delta"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid body")
return
}
globalQueue.mu.Lock()
defer globalQueue.mu.Unlock()
if t.Status != TaskPending {
writeError(w, http.StatusConflict, "only pending tasks can be reprioritised")
return
}
t.Priority += req.Delta
globalQueue.persistLocked()
writeJSON(w, map[string]int{"priority": t.Priority})
}
func (h *handler) handleAPITasksCancelAll(w http.ResponseWriter, _ *http.Request) {
globalQueue.mu.Lock()
now := time.Now()
n := 0
for _, t := range globalQueue.tasks {
switch t.Status {
case TaskPending:
t.Status = TaskCancelled
t.DoneAt = &now
n++
case TaskRunning:
if t.job != nil {
t.job.abort()
}
t.Status = TaskCancelled
t.DoneAt = &now
n++
}
}
globalQueue.persistLocked()
globalQueue.mu.Unlock()
writeJSON(w, map[string]int{"cancelled": n})
}
func (h *handler) handleAPITasksStream(w http.ResponseWriter, r *http.Request) {
id := r.PathValue("id")
// Wait up to 5s for the task to get a job (it may be pending)
deadline := time.Now().Add(5 * time.Second)
var j *jobState
for time.Now().Before(deadline) {
if jj, ok := globalQueue.findJob(id); ok {
j = jj
break
}
time.Sleep(200 * time.Millisecond)
}
if j == nil {
http.Error(w, "task not found or not yet started", http.StatusNotFound)
return
}
streamJob(w, r, j)
}
func (q *taskQueue) assignTaskLogPathLocked(t *Task) {
if t.LogPath != "" || q.logsDir == "" || t.ID == "" {
return
}
t.LogPath = filepath.Join(q.logsDir, t.ID+".log")
}
func (q *taskQueue) loadLocked() {
if q.statePath == "" {
return
}
data, err := os.ReadFile(q.statePath)
if err != nil || len(data) == 0 {
return
}
var persisted []persistedTask
if err := json.Unmarshal(data, &persisted); err != nil {
return
}
for _, pt := range persisted {
t := &Task{
ID: pt.ID,
Name: pt.Name,
Target: pt.Target,
Priority: pt.Priority,
Status: pt.Status,
CreatedAt: pt.CreatedAt,
StartedAt: pt.StartedAt,
DoneAt: pt.DoneAt,
ErrMsg: pt.ErrMsg,
LogPath: pt.LogPath,
params: pt.Params,
}
q.assignTaskLogPathLocked(t)
if t.Status == TaskPending || t.Status == TaskRunning {
t.Status = TaskPending
t.DoneAt = nil
t.ErrMsg = ""
}
q.tasks = append(q.tasks, t)
}
q.prune()
q.persistLocked()
}
func (q *taskQueue) persistLocked() {
if q.statePath == "" {
return
}
state := make([]persistedTask, 0, len(q.tasks))
for _, t := range q.tasks {
state = append(state, persistedTask{
ID: t.ID,
Name: t.Name,
Target: t.Target,
Priority: t.Priority,
Status: t.Status,
CreatedAt: t.CreatedAt,
StartedAt: t.StartedAt,
DoneAt: t.DoneAt,
ErrMsg: t.ErrMsg,
LogPath: t.LogPath,
Params: t.params,
})
}
data, err := json.MarshalIndent(state, "", " ")
if err != nil {
return
}
tmp := q.statePath + ".tmp"
if err := os.WriteFile(tmp, data, 0644); err != nil {
return
}
_ = os.Rename(tmp, q.statePath)
}

View File

@@ -0,0 +1,281 @@
package webui
import (
"context"
"os"
"os/exec"
"path/filepath"
"strings"
"testing"
"time"
"bee/audit/internal/app"
)
func TestTaskQueuePersistsAndRecoversPendingTasks(t *testing.T) {
dir := t.TempDir()
q := &taskQueue{
statePath: filepath.Join(dir, "tasks-state.json"),
logsDir: filepath.Join(dir, "tasks"),
trigger: make(chan struct{}, 1),
}
if err := os.MkdirAll(q.logsDir, 0755); err != nil {
t.Fatal(err)
}
started := time.Now().Add(-time.Minute)
task := &Task{
ID: "task-1",
Name: "Memory Burn-in",
Target: "memory-stress",
Priority: 2,
Status: TaskRunning,
CreatedAt: time.Now().Add(-2 * time.Minute),
StartedAt: &started,
params: taskParams{
Duration: 300,
BurnProfile: "smoke",
},
}
q.tasks = append(q.tasks, task)
q.assignTaskLogPathLocked(task)
q.persistLocked()
recovered := &taskQueue{
statePath: q.statePath,
logsDir: q.logsDir,
trigger: make(chan struct{}, 1),
}
recovered.loadLocked()
if len(recovered.tasks) != 1 {
t.Fatalf("tasks=%d want 1", len(recovered.tasks))
}
got := recovered.tasks[0]
if got.Status != TaskPending {
t.Fatalf("status=%q want %q", got.Status, TaskPending)
}
if got.params.Duration != 300 || got.params.BurnProfile != "smoke" {
t.Fatalf("params=%+v", got.params)
}
if got.LogPath == "" {
t.Fatal("expected log path")
}
}
func TestNewTaskJobStateLoadsExistingLog(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "task.log")
if err := os.WriteFile(path, []byte("line1\nline2\n"), 0644); err != nil {
t.Fatal(err)
}
j := newTaskJobState(path)
existing, ch := j.subscribe()
if ch == nil {
t.Fatal("expected live subscription channel")
}
if len(existing) != 2 || existing[0] != "line1" || existing[1] != "line2" {
t.Fatalf("existing=%v", existing)
}
}
func TestResolveBurnPreset(t *testing.T) {
tests := []struct {
profile string
want burnPreset
}{
{profile: "smoke", want: burnPreset{NvidiaDiag: 1, DurationSec: 5 * 60}},
{profile: "acceptance", want: burnPreset{NvidiaDiag: 3, DurationSec: 60 * 60}},
{profile: "overnight", want: burnPreset{NvidiaDiag: 4, DurationSec: 8 * 60 * 60}},
{profile: "", want: burnPreset{NvidiaDiag: 1, DurationSec: 5 * 60}},
}
for _, tc := range tests {
if got := resolveBurnPreset(tc.profile); got != tc.want {
t.Fatalf("resolveBurnPreset(%q)=%+v want %+v", tc.profile, got, tc.want)
}
}
}
func TestTaskDisplayNameUsesNvidiaStressLoader(t *testing.T) {
tests := []struct {
loader string
want string
}{
{loader: "", want: "NVIDIA GPU Stress (bee-gpu-burn)"},
{loader: "builtin", want: "NVIDIA GPU Stress (bee-gpu-burn)"},
{loader: "john", want: "NVIDIA GPU Stress (John/OpenCL)"},
{loader: "nccl", want: "NVIDIA GPU Stress (NCCL)"},
}
for _, tc := range tests {
if got := taskDisplayName("nvidia-stress", "acceptance", tc.loader); got != tc.want {
t.Fatalf("taskDisplayName(loader=%q)=%q want %q", tc.loader, got, tc.want)
}
}
}
func TestRunTaskHonorsCancel(t *testing.T) {
blocked := make(chan struct{})
released := make(chan struct{})
aRun := func(_ any, ctx context.Context, _ string, _ int, _ func(string)) (string, error) {
close(blocked)
select {
case <-ctx.Done():
close(released)
return "", ctx.Err()
case <-time.After(5 * time.Second):
close(released)
return "unexpected", nil
}
}
q := &taskQueue{
opts: &HandlerOptions{App: &app.App{}},
}
tk := &Task{
ID: "cpu-1",
Name: "CPU SAT",
Target: "cpu",
Status: TaskRunning,
CreatedAt: time.Now(),
params: taskParams{Duration: 60},
}
j := &jobState{}
ctx, cancel := context.WithCancel(context.Background())
j.cancel = cancel
tk.job = j
orig := runCPUAcceptancePackCtx
runCPUAcceptancePackCtx = func(_ *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return aRun(nil, ctx, baseDir, durationSec, logFunc)
}
defer func() { runCPUAcceptancePackCtx = orig }()
done := make(chan struct{})
go func() {
q.runTask(tk, j, ctx)
close(done)
}()
<-blocked
j.abort()
select {
case <-released:
case <-time.After(2 * time.Second):
t.Fatal("task did not observe cancel")
}
select {
case <-done:
case <-time.After(2 * time.Second):
t.Fatal("runTask did not return after cancel")
}
}
func TestRunTaskUsesBurnProfileDurationForCPU(t *testing.T) {
var gotDuration int
q := &taskQueue{
opts: &HandlerOptions{App: &app.App{}},
}
tk := &Task{
ID: "cpu-burn-1",
Name: "CPU Burn-in",
Target: "cpu",
Status: TaskRunning,
CreatedAt: time.Now(),
params: taskParams{BurnProfile: "smoke"},
}
j := &jobState{}
orig := runCPUAcceptancePackCtx
runCPUAcceptancePackCtx = func(_ *app.App, _ context.Context, _ string, durationSec int, _ func(string)) (string, error) {
gotDuration = durationSec
return "/tmp/cpu-burn.tar.gz", nil
}
defer func() { runCPUAcceptancePackCtx = orig }()
q.runTask(tk, j, context.Background())
if gotDuration != 5*60 {
t.Fatalf("duration=%d want %d", gotDuration, 5*60)
}
}
func TestRunTaskBuildsSupportBundleWithoutApp(t *testing.T) {
dir := t.TempDir()
q := &taskQueue{
opts: &HandlerOptions{ExportDir: dir},
}
tk := &Task{
ID: "support-bundle-1",
Name: "Support Bundle",
Target: "support-bundle",
Status: TaskRunning,
CreatedAt: time.Now(),
}
j := &jobState{}
var gotExportDir string
orig := buildSupportBundle
buildSupportBundle = func(exportDir string) (string, error) {
gotExportDir = exportDir
return filepath.Join(exportDir, "bundle.tar.gz"), nil
}
defer func() { buildSupportBundle = orig }()
q.runTask(tk, j, context.Background())
if gotExportDir != dir {
t.Fatalf("exportDir=%q want %q", gotExportDir, dir)
}
if j.err != "" {
t.Fatalf("unexpected error: %q", j.err)
}
if !strings.Contains(strings.Join(j.lines, "\n"), "Archive: "+filepath.Join(dir, "bundle.tar.gz")) {
t.Fatalf("lines=%v", j.lines)
}
}
func TestRunTaskInstallUsesSharedCommandStreaming(t *testing.T) {
q := &taskQueue{
opts: &HandlerOptions{},
}
tk := &Task{
ID: "install-1",
Name: "Install to Disk",
Target: "install",
Status: TaskRunning,
CreatedAt: time.Now(),
params: taskParams{Device: "/dev/sda"},
}
j := &jobState{}
var gotDevice string
var gotLogPath string
orig := installCommand
installCommand = func(ctx context.Context, device string, logPath string) *exec.Cmd {
gotDevice = device
gotLogPath = logPath
return exec.CommandContext(ctx, "sh", "-c", "printf 'line1\nline2\n'")
}
defer func() { installCommand = orig }()
q.runTask(tk, j, context.Background())
if gotDevice != "/dev/sda" {
t.Fatalf("device=%q want /dev/sda", gotDevice)
}
if gotLogPath == "" {
t.Fatal("expected install log path")
}
logs := strings.Join(j.lines, "\n")
if !strings.Contains(logs, "Install log: ") {
t.Fatalf("missing install log line: %v", j.lines)
}
if !strings.Contains(logs, "line1") || !strings.Contains(logs, "line2") {
t.Fatalf("missing streamed output: %v", j.lines)
}
if j.err != "" {
t.Fatalf("unexpected error: %q", j.err)
}
}

2
bible

Submodule bible updated: 456c1f022c...688b87e98d

View File

@@ -9,4 +9,5 @@ Generic engineering rules live in `bible/rules/patterns/`.
|---|---|
| `architecture/system-overview.md` | What bee does, scope, tech stack |
| `architecture/runtime-flows.md` | Boot sequence, audit flow, service order |
| `docs/hardware-ingest-contract.md` | Current Reanimator hardware ingest JSON contract |
| `decisions/` | Architectural decision log |

View File

@@ -0,0 +1,38 @@
# Charting architecture
## Decision: one chart engine for all live metrics
**Engine:** `github.com/go-analyze/charts` (pure Go, no CGO, SVG output)
**Theme:** `grafana` (dark background, coloured lines)
All live metrics charts in the web UI are server-side SVG images served by Go
and polled by the browser every 2 seconds via `<img src="...?t=now">`.
There is no client-side canvas or JS chart library.
### Why go-analyze/charts
- Pure Go, no CGO — builds cleanly inside the live-build container
- SVG output — crisp at any display resolution, full-width without pixelation
- Grafana theme matches the dark web UI colour scheme
- Active fork of the archived wcharczuk/go-chart
### SAT stress-test charts
The `drawGPUChartSVG` function in `platform/gpu_metrics.go` is a separate
self-contained SVG renderer used **only** for completed SAT run reports
(HTML export, burn-in summaries). It is not used for live metrics.
### Live metrics chart endpoints
| Path | Content |
|------|---------|
| `GET /api/metrics/chart/server.svg` | CPU temp, CPU load %, mem load %, power W, fan RPMs |
| `GET /api/metrics/chart/gpu/{idx}.svg` | GPU temp °C, load %, mem %, power W |
Charts are 1400 × 280 px SVG. The page renders them at `width: 100%` in a
single-column layout so they always fill the viewport width.
### Ring buffers
Each metric is stored in a 120-sample ring buffer (2 minutes of history at 1 Hz).
Buffers are per-server or per-GPU and grow dynamically as new GPUs appear.

View File

@@ -4,100 +4,113 @@
**The live CD runs in an isolated network segment with no internet access.**
All binaries, kernel modules, and tools must be baked into the ISO at build time.
No `apk add`, no downloads, no package manager calls are allowed at boot.
No package installation, no downloads, and no package manager calls are allowed at boot.
DHCP is used only for LAN (operator SSH access). Internet is NOT available.
## Boot sequence (single ISO)
OpenRC default runlevel, service start order:
The live system is expected to boot with `toram`, so `live-boot` copies the full read-only medium into RAM before mounting the root filesystem. After that point, runtime must not depend on the original USB/BMC virtual media staying readable.
`systemd` boot order:
```
localmount
├── bee-sshsetup (creates bee user, sets password; runs before dropbear)
│ └── dropbear (SSH on port 22 — starts without network)
├── bee-network (udhcpc -b on all physical interfaces, non-blocking)
│ └── bee-nvidia (insmod nvidia*.ko from /usr/local/lib/nvidia/,
│ creates libnvidia-ml.so.1 symlinks in /usr/lib/)
│ └── bee-audit (runs audit binary → /var/log/bee-audit.json)
local-fs.target
├── bee-sshsetup.service (enables SSH key auth; password fallback only if marker exists)
│ └── ssh.service (OpenSSH on port 22 — starts without network)
├── bee-network.service (starts `dhclient -nw` on all physical interfaces, non-blocking)
── bee-nvidia.service (insmod nvidia*.ko from /usr/local/lib/nvidia/,
creates /dev/nvidia* nodes)
├── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
│ never blocks boot on partial collector failures)
├── bee-web.service (runs `bee web` on :80 — full interactive web UI)
└── bee-desktop.service (startx → openbox + chromium http://localhost/)
```
**Critical invariants:**
- Dropbear MUST start without network. `bee-sshsetup` has `need localmount` only.
- `bee-network` uses `udhcpc -b` (background) — retries indefinitely if no cable.
- `bee-nvidia` loads modules via `insmod` with absolute paths — NOT `modprobe`.
Reason: modloop squashfs mounts over `/lib/modules/<kver>/` at boot, making it
read-only. The overlay's modules at that path are inaccessible. Modules are stored
at `/usr/local/lib/nvidia/` (overlay path, always writable).
- `bee-nvidia` creates `libnvidia-ml.so.1` symlinks in `/usr/lib/` — required because
`nvidia-smi` is a glibc binary that looks for the soname symlink, not the versioned file.
- `gcompat` package provides `/lib64/ld-linux-x86-64.so.2` for glibc compat on Alpine musl.
- `bee-audit` uses `after bee-nvidia` — ensures NVIDIA enrichment succeeds.
- `bee-audit` uses `eend 0` always — never fails boot even if audit errors.
- The live ISO boots with `boot=live toram`. Runtime binaries must continue working even if the original boot media disappears after early boot.
- OpenSSH MUST start without network. `bee-sshsetup.service` runs before `ssh.service`.
- `bee-network.service` uses `dhclient -nw` (background) — network bring-up is best effort and non-blocking.
- `bee-nvidia.service` loads modules via `insmod` with absolute paths — NOT `modprobe`.
Reason: the modules are shipped in the ISO overlay under `/usr/local/lib/nvidia/`, not in the host module tree.
- `bee-audit.service` does not wait for `network-online.target`; audit is local and must run even if DHCP is broken.
- `bee-audit.service` logs audit failures but does not turn partial collector problems into a boot blocker.
- `bee-web.service` binds `0.0.0.0:80` and always renders the current `/var/log/bee-audit.json` contents.
- Audit JSON now includes a `hardware.summary` block with overall verdict and warning/failure counts.
## Console and login flow
Local-console behavior:
```text
tty1
└── live-config autologin → bee
└── /home/bee/.profile (prints web UI URLs)
display :0
└── bee-desktop.service (User=bee)
└── startx /usr/local/bin/bee-openbox-session -- :0
├── tint2 (taskbar)
├── chromium http://localhost/
└── openbox (WM)
```
Rules:
- local `tty1` lands in user `bee`, not directly in `root`
- `bee-desktop.service` starts X11 + openbox + Chromium automatically after `bee-web.service`
- Chromium opens `http://localhost/` — the full interactive web UI
- SSH is independent from the desktop path
- serial console support is enabled for VM boot debugging
## ISO build sequence
```
build.sh [--authorized-keys /path/to/keys]
1. compile audit binary (skip if .go files older than binary)
2. inject authorized_keys into overlay/root/.ssh/ (or set password fallback)
3. copy audit binary → overlay/usr/local/bin/audit
4. copy vendor binaries from iso/vendor/ → overlay/usr/local/bin/
(storcli64, sas2ircu, sas3ircu, mstflint, gpu_burn — each optional)
5. build-nvidia-module.sh:
a. apk add linux-lts-dev (always, to get current Alpine 3.21 kernel headers)
b. detect KVER from /usr/src/linux-headers-*
c. download NVIDIA .run installer (sha256 verified, cached in dist/)
d. extract installer
e. build kernel modules against linux-lts headers
f. create libnvidia-ml.so.1 / libcuda.so.1 symlinks in cache
g. cache in dist/nvidia-<version>-<kver>/
6. inject NVIDIA .ko → overlay/usr/local/lib/nvidia/
7. inject nvidia-smi → overlay/usr/local/bin/nvidia-smi
8. inject libnvidia-ml + libcuda → overlay/usr/lib/
9. write overlay/etc/bee-release (versions + git commit)
10. export BEE_BUILD_INFO for motd substitution
11. mkimage.sh (from /var/tmp, TMPDIR=/var/tmp):
kernel_* section — cached (linux-lts modloop)
apks_* section — cached (downloaded packages)
syslinux_* / grub_* — cached
apkovl — always regenerated (genapkovl-bee.sh)
final ISO — always assembled
build-in-container.sh [--authorized-keys /path/to/keys]
1. compile `bee` binary (skip if .go files older than binary)
2. create a temporary overlay staging dir under `dist/`
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
4. copy `bee` binary → staged `/usr/local/bin/bee`
5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
(`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli` — optional; `mstflint` comes from the Debian package set)
6. `build-nvidia-module.sh`:
a. install Debian kernel headers if missing
b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
c. extract installer
d. build kernel modules against Debian headers
e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
f. cache in `dist/nvidia-<version>-<kver>/`
7. `build-cublas.sh`:
a. download `libcublas`, `libcublasLt`, `libcudart` runtime + dev packages from the NVIDIA CUDA Debian repo
b. verify packages against repo `Packages.gz`
c. extract headers for `bee-gpu-burn` worker build
d. cache userspace libs in `dist/cublas-<version>+cuda<series>/`
8. build `bee-gpu-burn` worker against extracted cuBLASLt/cudart headers
9. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
10. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
11. inject `libnvidia-ml` + `libcuda` + `libcublas` + `libcublasLt` + `libcudart` → staged `/usr/lib/`
12. write staged `/etc/bee-release` (versions + git commit)
13. patch staged `motd` with build metadata
14. copy `iso/builder/` into a temporary live-build workdir under `dist/`
15. sync staged overlay into workdir `config/includes.chroot/`
16. run `lb config && lb build` inside the privileged builder container
```
Build host notes:
- `build-in-container.sh` targets `linux/amd64` builder containers by default, including Docker Desktop on macOS / Apple Silicon.
- Override with `BEE_BUILDER_PLATFORM=<os/arch>` only if you intentionally need a different container platform.
- If the local builder image under the same tag was previously built for the wrong architecture, the script rebuilds it automatically.
**Critical invariants:**
- `KERNEL_PKG_VERSION` in `iso/builder/VERSIONS` pins the exact Alpine package version
(e.g. `6.12.76-r0`). This version is used in THREE places that MUST stay in sync:
1. `build-nvidia-module.sh``apk add linux-lts-dev=${KERNEL_PKG_VERSION}` (compile headers)
2. `mkimg.bee.sh``linux-lts=${KERNEL_PKG_VERSION}` in apks list (ISO kernel)
3. `build.sh` — build-time verification that headers match pin (fails loudly if not)
When Alpine releases a new linux-lts patch (e.g. r0 → r1), update KERNEL_PKG_VERSION
in VERSIONS — that's the only place to change. The build will fail loudly if the pin
doesn't match the installed headers, so stale pins are caught immediately.
- **All three must use the same APK mirror: `dl-cdn.alpinelinux.org`.** Both
`build-nvidia-module.sh` (apk add) and `mkimage.sh` (--repository) explicitly use
`https://dl-cdn.alpinelinux.org/alpine/v${ALPINE_VERSION}/main|community`.
Never use the builder's local `/etc/apk/repositories` — its mirror may serve
a different package state, causing "unable to select package" failures.
- `linux-lts-dev` is always installed (not conditional) — stale 6.6.x headers on the
builder would cause modules to be built for the wrong kernel and never load at runtime.
- NVIDIA modules go to `overlay/usr/local/lib/nvidia/` — NOT `lib/modules/<kver>/extra/`.
- `genapkovl-bee.sh` must be copied to `/var/tmp/` (CWD when mkimage runs).
- `TMPDIR=/var/tmp` required — tmpfs `/tmp` is only ~1GB, too small for kernel firmware.
- Workdir cleanup preserves `apks_*`, `kernel_*`, `syslinux_*`, `grub_*` cache dirs.
## gpu_burn vendor binary
`gpu_burn` requires CUDA nvcc to build. It is NOT built as part of the main ISO build.
Build separately on the builder VM and place in `iso/vendor/gpu_burn`:
```sh
sh iso/builder/build-gpu-burn.sh dist/
cp dist/gpu_burn iso/vendor/gpu_burn
cp dist/compare.ptx iso/vendor/compare.ptx
```
Requires: CUDA 12.8+ (supports GCC 14, Alpine 3.21), libxml2, g++, make, git.
The `build.sh` will include it automatically if `iso/vendor/gpu_burn` exists.
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
2. `auto/config``linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
- `bee-gpu-burn` worker must be built against cached CUDA userspace headers from `build-cublas.sh`, not against random host-installed CUDA headers.
- The live ISO must ship `libcublas`, `libcublasLt`, and `libcudart` together with `libcuda` so tensor-core stress works without internet or package installs at boot.
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
- Container build requires `--privileged` because `live-build` uses mounts/chroots/loop devices during ISO assembly.
- On macOS / Docker Desktop, the builder still must run as `linux/amd64` so the shipped ISO binaries remain `amd64`.
- Operators must provision enough RAM to hold the full compressed live medium plus normal runtime overhead, because `toram` copies the entire read-only ISO payload into memory before the system reaches steady state.
## Post-boot smoke test
@@ -109,35 +122,62 @@ ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh
Exit code 0 = all required checks pass. All `FAIL` lines must be zero before shipping.
Key checks: NVIDIA modules loaded, nvidia-smi sees all GPUs, lib symlinks present,
gcompat installed, services running, audit completed with NVIDIA enrichment, internet.
Key checks: NVIDIA modules loaded, `nvidia-smi` sees all GPUs, lib symlinks present,
systemd services running, audit completed with NVIDIA enrichment, LAN reachability.
## apkovl mechanism
Current validation state:
- local/libvirt VM boot path is validated for `systemd`, SSH, `bee audit`, `bee-network`, and Web UI startup
- real hardware validation is still required before treating the ISO as release-ready
The apkovl is a `.tar.gz` injected into the ISO at `/boot/`. Alpine initramfs extracts
it at boot, overlaying `/etc`, `/usr`, `/root`, `/lib` on the tmpfs root.
## Overlay mechanism
`genapkovl-bee.sh` generates the tarball containing:
- `/etc/apk/world` — package list (apk installs on first boot)
- `/etc/runlevels/*/` — OpenRC service symlinks
- `/etc/conf.d/dropbear``DROPBEAR_OPTS="-R -B"`
- `/etc/network/interfaces` — lo only (bee-network handles DHCP)
- `/etc/hostname`
- Everything from `iso/overlay/` (init scripts, binaries, ssh keys, tui)
`live-build` copies files from `config/includes.chroot/` into the ISO filesystem.
`build.sh` prepares a staged overlay, then syncs it into a temporary workdir's
`config/includes.chroot/` before running `lb build`.
## Collector flow
```
audit binary start
`bee audit` start
1. board collector (dmidecode -t 0,1,2)
2. cpu collector (dmidecode -t 4)
3. memory collector (dmidecode -t 17)
4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
5. pcie collector (lspci -vmm -D, /sys/bus/pci/devices/)
6. psu collector (ipmitool fru — silent if no /dev/ipmi0)
6. psu collector (ipmitool fru + sdr — silent if no /dev/ipmi0)
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
8. output JSON → /var/log/bee-audit.json
9. QR summary to stdout (qrencode if available)
```
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
Acceptance flows:
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + lightweight `bee-gpu-burn`
- NVIDIA GPU burn-in can use either `bee-gpu-burn` or `bee-john-gpu-stress` (John the Ripper jumbo via OpenCL)
- `bee sat memory``memtester` archive
- `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
- SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
- `bee-gpu-burn` should prefer cuBLASLt GEMM load over the old integer/PTX burn path:
- Ampere: `fp16` + `fp32`/TF32 tensor-core load
- Ada / Hopper: add `fp8`
- Blackwell+: add `fp4`
- PTX fallback is only for missing cuBLASLt/userspace or unsupported narrow datatypes
- Runtime overrides:
- `BEE_MEMTESTER_SIZE_MB`
- `BEE_MEMTESTER_PASSES`
## NVIDIA SAT Web UI flow
```
Web UI: Acceptance Tests page → Run Test button
1. POST /api/sat/nvidia/run → returns job_id
2. GET /api/sat/stream?job_id=... (SSE) — streams stdout/stderr lines live
3. After completion — archive written to /appdata/bee/export/bee-sat/
summary.txt contains overall_status (OK / FAILED) and per-job status values
```
**Critical invariants:**
- `bee-gpu-burn` / `bee-john-gpu-stress` use `exec.CommandContext` — killed on job context cancel.
- Metric goroutine uses stopCh/doneCh pattern; main goroutine waits `<-doneCh` before reading rows (no mutex needed).
- SVG chart is fully offline: no JS, no external CSS, pure inline SVG.

View File

@@ -4,7 +4,7 @@
Hardware audit LiveCD. Boots on a server via BMC virtual media or USB.
Collects hardware inventory at OS level (not through BMC/Redfish).
Produces `HardwareIngestRequest` JSON compatible with core/reanimator.
Produces `HardwareIngestRequest` JSON compatible with the contract in `bible-local/docs/hardware-ingest-contract.md`.
## Why it exists
@@ -19,18 +19,23 @@ Fills gaps where Redfish/logpile is blind:
## In scope
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
- Unattended operation — no user interaction required
- Machine-readable health summary derived from collector verdicts
- Operator-triggered acceptance tests for NVIDIA, memory, and storage
- NVIDIA SAT includes diagnostic collection plus a lightweight in-image GPU stress step via `bee-gpu-burn`
- `bee-gpu-burn` should exercise tensor/inference paths (`fp16`, `fp32`/TF32, `fp8`, `fp4` when supported by the GPU/userspace stack) and fall back to Driver API PTX burn only if cuBLASLt is unavailable
- Automatic boot audit with operator-facing local console and SSH access
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
- SSH access (dropbear) always available for inspection and debugging
- Interactive TUI (`bee-tui`) for network setup, service management, GPU tests
- GPU stress testing via `gpu_burn` (vendor binary, optional)
- SSH access (OpenSSH) always available for inspection and debugging
- Full web UI via `bee web` on port 80: interactive control panel with live metrics, SAT tests, network config, service management, export, and tools
- Local operator desktop: openbox + Xorg + Chromium auto-opening `http://localhost/`
- Local `tty1` operator UX: `bee` autologin, openbox desktop auto-starts with Chromium on `http://localhost/`
## Network isolation — CRITICAL
**The live CD runs in an isolated network segment with no internet access.**
- All tools, drivers, and binaries MUST be pre-baked into the ISO at build time
- No `apk add` at boot — packages are installed during ISO creation, not at runtime
- No package installation at boot — packages are installed during ISO creation, not at runtime
- No downloads at boot — NVIDIA modules, vendor tools, and all binaries come from the ISO overlay
- DHCP is used only for LAN access (SSH from operator laptop); internet is NOT assumed
- Any feature requiring network downloads cannot be added to the live CD
@@ -43,32 +48,66 @@ Fills gaps where Redfish/logpile is blind:
- Anything requiring persistent storage on the audited machine
- Windows support
- Any functionality requiring internet access at boot
- Component lifecycle/history across multiple snapshots
- Status transition history (`status_history`, `status_changed_at`) derived from previous exports
- Replacement detection between two or more audit runs
## Contract boundary
- `bee` is responsible for the current hardware snapshot only.
- `bee` should populate current component state, hardware inventory, telemetry, and `status_checked_at`.
- Historical status transitions and component replacement logic belong to the centralized ingest/lifecycle system, not to `bee`.
- Contract fields that have no honest local source on a generic Linux host may remain empty.
## Tech stack
| Component | Technology |
|---|---|
| Audit binary | Go, static, `CGO_ENABLED=0` |
| LiveCD | Alpine Linux 3.21, linux-lts 6.12.x |
| ISO build | Alpine mkimage + apkovl overlay (`iso/overlay/`) |
| Init system | OpenRC |
| SSH | Dropbear (always included) |
| NVIDIA driver | Proprietary `.run` installer, built against linux-lts headers |
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` (not modloop path) |
| glibc compat | `gcompat` — required for `nvidia-smi` (glibc binary on musl Alpine) |
| Builder VM | Alpine 3.21 |
| Live ISO | Debian 12 (bookworm), amd64 live-build image |
| ISO build | Debian `live-build` + overlay sync into `config/includes.chroot/` |
| Init system | `systemd` |
| SSH | OpenSSH server |
| NVIDIA driver | Proprietary `.run` installer, built against Debian kernel headers |
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
| GPU stress backend | `bee-gpu-burn` + cuBLASLt/cuBLAS/cudart mixed-precision GEMM, with Driver API PTX fallback |
| Builder | Debian 12 host/VM or Debian 12 container image |
## Operator UX
- On the live ISO, `tty1` autologins as `bee`
- `bee-desktop.service` starts X11 + openbox + Chromium on display `:0`
- Chromium opens `http://localhost/` — the full web UI
- SSH remains available independently of the local console path
- Remote operators can open `http://<ip>/` in any browser on the same LAN
- VM-oriented builds also include `qemu-guest-agent` and serial console support for debugging
- The ISO boots with `toram`, so loss of the original USB/BMC virtual media after boot should not break already-installed runtime binaries
## Runtime split
- The main Go application must run both on a normal Linux host and inside the live ISO
- Live-ISO-only responsibilities stay in `iso/` integration code
- Live ISO launches the Go CLI with `--runtime livecd`
- Local/manual runs use `--runtime auto` or `--runtime local`
- Live ISO targets must have enough RAM for the full compressed live medium plus runtime working set because the boot medium is copied into memory at startup
## Key paths
| Path | Purpose |
|---|---|
| `audit/cmd/audit/` | CLI entry point |
| `audit/cmd/bee/` | Main CLI entry point |
| `audit/internal/collector/` | Per-subsystem collectors |
| `audit/internal/schema/` | HardwareIngestRequest types |
| `iso/builder/` | ISO build scripts and mkimage profile |
| `iso/overlay/` | Single overlay: files injected into ISO via apkovl |
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, gpu_burn, …) |
| `iso/builder/VERSIONS` | Pinned versions: Alpine, Go, NVIDIA driver, kernel |
| `iso/builder/` | ISO build scripts and `live-build` profile |
| `iso/overlay/` | Source overlay copied into a staged build overlay |
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, arcconf, ssacli, …) |
| `internal/chart/` | Git submodule with `reanimator/chart`, embedded into `bee web` |
| `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI |
| `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
| `iso/overlay/etc/profile.d/bee.sh` | tty1 welcome message with web UI URLs |
| `iso/overlay/home/bee/.profile` | `bee` shell profile (PATH only) |
| `iso/overlay/etc/systemd/system/bee-desktop.service` | starts X11 + openbox + chromium |
| `iso/overlay/usr/local/bin/bee-desktop` | startx wrapper for bee-desktop.service |
| `iso/overlay/usr/local/bin/bee-openbox-session` | xinitrc: tint2 + chromium + openbox |
| `dist/` | Build outputs (gitignored) |
| `iso/out/` | Downloaded ISO files (gitignored) |

View File

@@ -1,21 +1,89 @@
# Backlog
## GPU stress test (H100)
## BMC версия через IPMI
**Задача:** добавить GPU burn/stress тест в bee-tui без существенного увеличения ISO.
**Статус:** реализовано.
**Контекст:**
- `gpu_burn` (wilicc/gpu-burn) не подходит — требует `libcublas.so` (~500MB), что раздует ISO кратно
- `libcuda.so` уже есть в ISO (из NVIDIA .run installer)
Добавить сбор версии BMC firmware в board collector:
- Команда: `ipmitool mc info` → поле `Firmware Revision`
- Записывать в `hardware.firmware[]` как `{device_name: "BMC", version: "..."}`
- Показывать в TUI правой колонке рядом с BIOS версией
- Graceful skip если `/dev/ipmi0` отсутствует (silent: same pattern as PSU collector)
**Выбранный подход:** написать минимальный стресс-тул на CUDA Driver API
- Использует только `libcuda.so` (уже в ISO) — никаких новых зависимостей
- Реализует матричное умножение или memory bandwidth через `cuLaunchKernel`
- Бинарь ~100KB, компилируется через `nvcc` на builder VM, кладётся в `iso/vendor/`
- bee-tui вызывает его вместо `gpu_burn`
## CPU acceptance test через stress-ng
**Отклонённые варианты:**
- `gpu_burn` — нужен libcublas (~500MB)
- `nvbandwidth` — только bandwidth, не жжёт FLOPs; нужен libcudart (~8MB)
- DCGM diag — правильный инструмент для H100 но ~100MB установка
- Download on demand — нужен libcublas, проблема та же
**Статус:** реализовано. CPU в Health Check получает PASS/FAIL из summary.txt.
Добавить CPU SAT на базе `stress-ng`:
- Bake `stress-ng` в ISO (добавить в `bee.list.chroot`)
- Новый `bee sat cpu` — запускает `stress-ng --cpu 0 --cpu-method all --timeout <N>` где N = duration из режима (Quick=60s, Standard=300s, Express=900s)
- Параллельно снимает температуры через `sensors` и throttle-флаги из аудит JSON
- Результат: SAT архив с summary.txt в формате других SAT (overall_status=OK/FAILED)
- После реализации: CPU в Health Check получает реальный PASS/FAIL статус
## Real hardware validation
**Статус:** ожидает доступа к железу.
Что осталось подтвердить на практике:
- `bee sat nvidia` на реальном NVIDIA GPU host
- `bee sat storage` на NVMe/SATA/RAID host
- `ipmitool sdr` parsing на сервере с реальным BMC/IPMI
- vendor RAID tooling (`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli`) в живом ISO
## SAT result polish
**Статус:** частично закрыто.
Что ещё можно улучшить после полевой проверки:
- точнее классифицировать vendor-specific self-test outputs в `storage SAT`
- подобрать дефолты `memtester` по объёму RAM на целевых машинах
- при необходимости расширить `bee-gpu-stress` по длительности/нагрузке
## Hardware Contract backlog
**Статус:** уточнён, сокращён до `bee`-only snapshot scope.
### Не backlog для `bee`
Эти задачи не должны реализовываться в `bee`, потому что относятся к централизованному ingest/lifecycle слою:
- `status_history`
- `status_changed_at`
- определение замены компонента между snapshot'ами
- timeline/lifecycle/history по diff между экспортами
`bee` отвечает только за текущий snapshot железа и `status_checked_at`.
### Реализуемо инкрементально
Эти поля можно развивать дальше по мере появления реальных sample outputs и vendor-specific parser'ов:
- `cpus.correctable_error_count`
- `cpus.uncorrectable_error_count`
- `power_supplies.life_remaining_pct`
- `power_supplies.life_used_pct`
- `pcie_devices.battery_charge_pct`
- `pcie_devices.battery_health_pct`
- `pcie_devices.battery_temperature_c`
- `pcie_devices.battery_voltage_v`
- `pcie_devices.battery_replace_required`
### Vendor/platform-specific, часто пустые
Эти поля допустимо оставлять пустыми на части платформ даже после реализации parser'ов:
- `power_supplies.life_remaining_pct`
- `power_supplies.life_used_pct`
- часть `pcie_devices.battery_*` для неподдержанных RAID/NIC/GPU вендоров
### Unsupported в `bee`
Эти поля считаются нереалистичными для общего OS-level hardware snapshotter без synthetic/fake data:
- `cpus.life_remaining_pct`
- `cpus.life_used_pct`
- `memory.life_remaining_pct`
- `memory.life_used_pct`
- `memory.spare_blocks_remaining_pct`
- `memory.performance_degraded`
Причина: у обычного Linux-host audit обычно нет честного vendor-neutral runtime source для этих метрик.
Эти поля считаются дропнутыми из backlog `bee` и не должны возвращаться в план работ без появления нового доказуемого локального источника данных на целевых машинах.

View File

@@ -18,6 +18,8 @@ Use the official proprietary NVIDIA `.run` installer for both kernel modules and
- Kernel modules and nvidia-smi come from a single verified source.
- NVIDIA publishes `.sha256sum` alongside each installer — download and verify before use.
- Driver version pinned in `iso/builder/VERSIONS` as `NVIDIA_DRIVER_VERSION`.
- DCGM must track the CUDA user-mode driver major version exposed by `nvidia-smi`.
- For NVIDIA driver branch `590` with CUDA `13.x`, use DCGM 4 package family `datacenter-gpu-manager-4-cuda13`; legacy `datacenter-gpu-manager` 3.x does not provide a working path for this stack.
- Build process: download `.run`, extract, compile `kernel/` sources against `linux-lts-dev`.
- Modules cached in `dist/nvidia-<version>-<kver>/` — rebuild only on version or kernel change.
- ISO size increases by ~50MB for .ko files + nvidia-smi.

View File

@@ -0,0 +1,117 @@
# Decision: Treat memtest as explicit ISO content, not as trusted live-build magic
**Date:** 2026-04-01
**Status:** active
## Context
We have already iterated on `memtest` multiple times and kept cycling between the same ideas.
The commit history shows several distinct attempts:
- `f91bce8` — fixed Bookworm memtest file names to `memtest86+x64.bin` / `memtest86+x64.efi`
- `5857805` — added a binary hook to copy memtest files from the build tree into the ISO root
- `f96b149` — added fallback extraction from the cached `.deb` when `chroot/boot/` stayed empty
- `d43a9ae` — removed the custom hook and switched back to live-build built-in memtest integration
- `60cb8f8` — restored explicit memtest menu entries and added ISO validation
- `3dbc218` / `3869788` — added archived build logs and better memtest diagnostics
Current evidence from the archived `easy-bee-nvidia-v3.14-amd64` logs dated 2026-04-01:
- `lb binary_memtest` does run and installs `memtest86+`
- but the final ISO still does **not** contain `boot/memtest86+x64.bin`
- the final ISO also does **not** contain memtest menu entries in `boot/grub/grub.cfg` or `isolinux/live.cfg`
So the assumption "live-build built-in memtest integration is enough on this stack" is currently false for this project until proven otherwise by a real built ISO.
Additional evidence from the archived `easy-bee-nvidia-v3.17-dirty-amd64` logs dated 2026-04-01:
- the build now completes successfully because memtest is non-blocking by default
- `lb binary_memtest` still runs and installs `memtest86+`
- the project-owned hook `config/hooks/normal/9100-memtest.hook.binary` does execute
- but it executes too early for its current target paths:
- `binary/boot/grub/grub.cfg` is still missing at hook time
- `binary/isolinux/live.cfg` is still missing at hook time
- memtest binaries are also still absent in `binary/boot/`
- later in the build, live-build does create intermediate bootloader configs with memtest lines in the workdir
- but the final ISO still lacks memtest binaries and still lacks memtest lines in extracted ISO `boot/grub/grub.cfg` and `isolinux/live.cfg`
So the assumption "the current normal binary hook path is late enough to patch final memtest artifacts" is also false.
## Known Failed Attempts
These approaches were already tried and should not be repeated blindly:
1. Built-in live-build memtest only.
Reason it failed:
- `lb binary_memtest` runs, but the final ISO still misses memtest binaries and menu entries.
2. Fixing only the memtest file names for Debian Bookworm.
Reason it failed:
- correct file names alone do not make the files appear in the final ISO.
3. Copying memtest from `chroot/boot/` into `binary/boot/` via a binary hook.
Reason it failed:
- in this stack `chroot/boot/` is often empty for memtest payloads at the relevant time.
4. Fallback extraction from cached `memtest86+` `.deb`.
Reason it failed:
- this was explored already and was not enough to stabilize the final ISO path end-to-end.
5. Restoring explicit memtest menu entries in source bootloader templates only.
Reason it failed:
- memtest lines in source templates or intermediate workdir configs do not guarantee the final ISO contains them.
6. Patching `binary/boot/grub/grub.cfg` and `binary/isolinux/live.cfg` from the current `config/hooks/normal/9100-memtest.hook.binary`.
Reason it failed:
- the hook runs before those files exist, so the hook cannot patch them there.
## What This Means
When revisiting memtest later, start from the constraints above rather than retrying the same patterns:
- do not assume the built-in memtest stage is sufficient
- do not assume `chroot/boot/` will contain memtest payloads
- do not assume source bootloader templates are the last writer of final ISO configs
- do not assume the current normal binary hook timing is late enough for final patching
Any future memtest fix must explicitly identify:
- where the memtest binaries are reliably available at build time
- which exact build stage writes the final bootloader configs that land in the ISO
- and a post-build proof from a real ISO, not only from intermediate workdir files
## Decision
For `bee`, memtest must be treated as an explicit ISO artifact with explicit post-build validation.
Project rules from now on:
- Do **not** trust `--memtest memtest86+` by itself.
- A memtest implementation is considered valid only if the produced ISO actually contains:
- `boot/memtest86+x64.bin`
- `boot/memtest86+x64.efi`
- a GRUB menu entry
- an isolinux menu entry
- If live-build built-in integration does not produce those artifacts, use an explicit project-owned mechanism such as:
- a binary hook copying files into `binary/boot/`
- extraction from the cached `memtest86+` `.deb`
- another deterministic build-time copy step
- Do **not** remove such explicit logic later unless a fresh real ISO build proves that built-in integration alone produces all required files and menu entries.
Current implementation direction:
- keep the live-build memtest stage enabled if it helps package acquisition
- do not rely on the current early `binary_hooks` timing for final patching
- prefer a post-`lb build` recovery step in `build.sh` that:
- patches the fully materialized `LB_DIR/binary` tree
- injects memtest binaries there
- ensures final bootloader entries there
- reruns late binary stages (`binary_checksums`, `binary_iso`, `binary_zsync`) after the patch
## Consequences
- Future memtest changes must begin by reading this ADR and the commits listed above.
- Future memtest changes must also begin by reading the failed-attempt list above.
- We should stop re-introducing "prefer built-in live-build memtest" as a default assumption without new evidence.
- Memtest validation in `build.sh` is not optional; it is the acceptance gate that prevents another silent regression.
- If we change memtest strategy again, we must update this ADR with the exact build evidence that justified the change.

View File

@@ -5,3 +5,4 @@ One file per decision, named `YYYY-MM-DD-short-topic.md`.
| Date | Decision | Status |
|---|---|---|
| 2026-03-05 | Use NVIDIA proprietary driver | active |
| 2026-04-01 | Treat memtest as explicit ISO content | active |

View File

@@ -0,0 +1,793 @@
---
title: Hardware Ingest JSON Contract
version: "2.7"
updated: "2026-03-15"
maintainer: Reanimator Core
audience: external-integrators, ai-agents
language: ru
---
# Интеграция с Reanimator: контракт JSON-импорта аппаратного обеспечения
Версия: **2.7** · Дата: **2026-03-15**
Документ описывает формат JSON для передачи данных об аппаратном обеспечении серверов в систему **Reanimator** (управление жизненным циклом аппаратного обеспечения).
Предназначен для разработчиков смежных систем (Redfish-коллекторов, агентов мониторинга, CMDB-экспортёров) и может быть включён в документацию интегрируемых проектов.
> Актуальная версия документа: https://git.mchus.pro/reanimator/core/src/branch/main/bible-local/docs/hardware-ingest-contract.md
---
## Changelog
| Версия | Дата | Изменения |
|--------|------|-----------|
| 2.7 | 2026-03-15 | Явно запрещён синтез данных в `event_logs`; интеграторы не должны придумывать серийные номера компонентов, если источник их не отдал |
| 2.6 | 2026-03-15 | Добавлена необязательная секция `event_logs` для dedup/upsert логов `host` / `bmc` / `redfish` вне history timeline |
| 2.5 | 2026-03-15 | Добавлено общее необязательное поле `manufactured_year_week` для компонентных секций (`YYYY-Www`) |
| 2.4 | 2026-03-15 | Добавлена первая волна component telemetry: health/life поля для `cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies` |
| 2.3 | 2026-03-15 | Добавлены component telemetry поля: `pcie_devices.temperature_c`, `pcie_devices.power_w`, `power_supplies.temperature_c` |
| 2.2 | 2026-03-15 | Добавлено поле `numa_node` у `pcie_devices` для topology/affinity |
| 2.1 | 2026-03-15 | Добавлена секция `sensors` (fans, power, temperatures, other); поле `mac_addresses` у `pcie_devices`; расширен список значений `device_class` |
| 2.0 | 2026-02-01 | История статусов (`status_history`, `status_changed_at`); поля telemetry у PSU; async job response |
| 1.0 | 2026-01-01 | Начальная версия контракта |
---
## Принципы
1. **Snapshot** — JSON описывает состояние сервера на момент сбора. Может включать историю изменений статуса компонентов.
2. **Идемпотентность** — повторная отправка идентичного payload не создаёт дублей (дедупликация по хешу).
3. **Частичность** — можно передавать только те секции, данные по которым доступны. Пустой массив и отсутствие секции эквивалентны.
4. **Строгая схема** — endpoint использует строгий JSON-декодер; неизвестные поля приводят к `400 Bad Request`.
5. **Event-driven** — импорт создаёт события в timeline (LOG_COLLECTED, INSTALLED, REMOVED, FIRMWARE_CHANGED и др.).
6. **Без синтеза со стороны интегратора** — сборщик передаёт только фактически собранные значения. Нельзя придумывать `serial_number`, `component_ref`, `message`, `message_id` или другие идентификаторы/атрибуты, если источник их не предоставил или парсер не смог их надёжно извлечь.
---
## Endpoint
```
POST /ingest/hardware
Content-Type: application/json
```
**Ответ при приёме (202 Accepted):**
```json
{
"status": "accepted",
"job_id": "job_01J..."
}
```
Импорт выполняется асинхронно. Результат доступен по:
```
GET /ingest/hardware/jobs/{job_id}
```
**Ответ при успехе задачи:**
```json
{
"status": "success",
"bundle_id": "lb_01J...",
"asset_id": "mach_01J...",
"collected_at": "2026-02-10T15:30:00Z",
"duplicate": false,
"summary": {
"parts_observed": 15,
"parts_created": 2,
"parts_updated": 13,
"installations_created": 2,
"installations_closed": 1,
"timeline_events_created": 9,
"failure_events_created": 1
}
}
```
**Ответ при дубликате:**
```json
{
"status": "success",
"duplicate": true,
"message": "LogBundle with this content hash already exists"
}
```
**Ответ при ошибке (400 Bad Request):**
```json
{
"status": "error",
"error": "validation_failed",
"details": {
"field": "hardware.board.serial_number",
"message": "serial_number is required"
}
}
```
Частые причины `400`:
- Неверный формат `collected_at` (требуется RFC3339).
- Пустой `hardware.board.serial_number`.
- Наличие неизвестного JSON-поля на любом уровне.
- Тело запроса превышает допустимый размер.
---
## Структура верхнего уровня
```json
{
"filename": "redfish://10.10.10.103",
"source_type": "api",
"protocol": "redfish",
"target_host": "10.10.10.103",
"collected_at": "2026-02-10T15:30:00Z",
"hardware": {
"board": { ... },
"firmware": [ ... ],
"cpus": [ ... ],
"memory": [ ... ],
"storage": [ ... ],
"pcie_devices": [ ... ],
"power_supplies": [ ... ],
"sensors": { ... },
"event_logs": [ ... ]
}
}
```
### Поля верхнего уровня
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `collected_at` | string RFC3339 | **да** | Время сбора данных |
| `hardware` | object | **да** | Аппаратный снапшот |
| `hardware.board.serial_number` | string | **да** | Серийный номер платы/сервера |
| `target_host` | string | нет | IP или hostname |
| `source_type` | string | нет | Тип источника: `api`, `logfile`, `manual` |
| `protocol` | string | нет | Протокол: `redfish`, `ipmi`, `snmp`, `ssh` |
| `filename` | string | нет | Идентификатор источника |
---
## Общие поля статуса компонентов
Применяются ко всем компонентным секциям (`cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies`).
| Поле | Тип | Описание |
|------|-----|----------|
| `status` | string | Текущий статус: `OK`, `Warning`, `Critical`, `Unknown`, `Empty` |
| `status_checked_at` | string RFC3339 | Время последней проверки статуса |
| `status_changed_at` | string RFC3339 | Время последнего изменения статуса |
| `status_history` | array | История переходов статусов (см. ниже) |
| `error_description` | string | Текст ошибки/диагностики |
| `manufactured_year_week` | string | Дата производства в формате `YYYY-Www`, например `2024-W07` |
**Объект `status_history[]`:**
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `status` | string | **да** | Статус в этот момент |
| `changed_at` | string RFC3339 | **да** | Время перехода (без этого поля запись игнорируется) |
| `details` | string | нет | Пояснение к переходу |
**Правила приоритета времени события:**
1. `status_changed_at`
2. Последняя запись `status_history` с совпадающим статусом
3. Последняя парсируемая запись `status_history`
4. `status_checked_at`
**Правила передачи статусов:**
- Передавайте `status` как текущее состояние компонента в snapshot.
- Если источник хранит историю — передавайте `status_history` отсортированным по `changed_at` по возрастанию.
- Не включайте записи `status_history` без `changed_at`.
- Все даты — RFC3339, рекомендуется UTC (`Z`).
- `manufactured_year_week` используйте, когда источник знает только год и неделю производства, без точной календарной даты.
---
## Секции hardware
### board
Основная информация о сервере. Обязательная секция.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `serial_number` | string | **да** | Серийный номер (ключ идентификации Asset) |
| `manufacturer` | string | нет | Производитель |
| `product_name` | string | нет | Модель |
| `part_number` | string | нет | Партномер |
| `uuid` | string | нет | UUID системы |
Значения `"NULL"` в строковых полях трактуются как отсутствие данных.
```json
"board": {
"manufacturer": "Supermicro",
"product_name": "X12DPG-QT6",
"serial_number": "21D634101",
"part_number": "X12DPG-QT6-REV1.01",
"uuid": "d7ef2fe5-2fd0-11f0-910a-346f11040868"
}
```
---
### firmware
Версии прошивок системных компонентов (BIOS, BMC, CPLD и др.).
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `device_name` | string | **да** | Название устройства (`BIOS`, `BMC`, `CPLD`, …) |
| `version` | string | **да** | Версия прошивки |
Записи с пустым `device_name` или `version` игнорируются.
Изменение версии создаёт событие `FIRMWARE_CHANGED` для Asset.
```json
"firmware": [
{ "device_name": "BIOS", "version": "06.08.05" },
{ "device_name": "BMC", "version": "5.17.00" },
{ "device_name": "CPLD", "version": "01.02.03" }
]
```
---
### cpus
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `socket` | int | **да** | Номер сокета (используется для генерации serial) |
| `model` | string | нет | Модель процессора |
| `manufacturer` | string | нет | Производитель |
| `cores` | int | нет | Количество ядер |
| `threads` | int | нет | Количество потоков |
| `frequency_mhz` | int | нет | Текущая частота |
| `max_frequency_mhz` | int | нет | Максимальная частота |
| `temperature_c` | float | нет | Температура CPU, °C (telemetry) |
| `power_w` | float | нет | Текущая мощность CPU, Вт (telemetry) |
| `throttled` | bool | нет | Зафиксирован thermal/power throttling |
| `correctable_error_count` | int | нет | Количество корректируемых ошибок CPU |
| `uncorrectable_error_count` | int | нет | Количество некорректируемых ошибок CPU |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `serial_number` | string | нет | Серийный номер (если доступен) |
| `firmware` | string | нет | Версия микрокода; если логгер отдает `Microcode level`, передавайте его сюда как есть |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| + общие поля статуса | | | см. раздел выше |
**Генерация serial_number при отсутствии:** `{board_serial}-CPU-{socket}`
Если источник использует поле/лейбл `Microcode level`, его значение передавайте в `cpus[].firmware` без дополнительного преобразования.
```json
"cpus": [
{
"socket": 0,
"model": "INTEL(R) XEON(R) GOLD 6530",
"cores": 32,
"threads": 64,
"frequency_mhz": 2100,
"max_frequency_mhz": 4000,
"temperature_c": 61.5,
"power_w": 182.0,
"throttled": false,
"manufacturer": "Intel",
"status": "OK",
"status_checked_at": "2026-02-10T15:28:00Z"
}
]
```
---
### memory
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `present` | bool | нет | Наличие модуля (по умолчанию `true`) |
| `serial_number` | string | нет | Серийный номер |
| `part_number` | string | нет | Партномер (используется как модель) |
| `manufacturer` | string | нет | Производитель |
| `size_mb` | int | нет | Объём в МБ |
| `type` | string | нет | Тип: `DDR3`, `DDR4`, `DDR5`, … |
| `max_speed_mhz` | int | нет | Максимальная частота |
| `current_speed_mhz` | int | нет | Текущая частота |
| `temperature_c` | float | нет | Температура DIMM/модуля, °C (telemetry) |
| `correctable_ecc_error_count` | int | нет | Количество корректируемых ECC-ошибок |
| `uncorrectable_ecc_error_count` | int | нет | Количество некорректируемых ECC-ошибок |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `spare_blocks_remaining_pct` | float | нет | Остаток spare blocks, % |
| `performance_degraded` | bool | нет | Зафиксирована деградация производительности |
| `data_loss_detected` | bool | нет | Источник сигнализирует риск/факт потери данных |
| + общие поля статуса | | | см. раздел выше |
Модуль без `serial_number` игнорируется. Модуль с `present=false` или `status=Empty` игнорируется.
```json
"memory": [
{
"slot": "CPU0_C0D0",
"present": true,
"size_mb": 32768,
"type": "DDR5",
"max_speed_mhz": 4800,
"current_speed_mhz": 4800,
"temperature_c": 43.0,
"correctable_ecc_error_count": 0,
"manufacturer": "Hynix",
"serial_number": "80AD032419E17CEEC1",
"part_number": "HMCG88AGBRA191N",
"status": "OK"
}
]
```
---
### storage
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Канонический адрес установки PCIe-устройства; передавайте BDF (`0000:18:00.0`) |
| `serial_number` | string | нет | Серийный номер |
| `model` | string | нет | Модель |
| `manufacturer` | string | нет | Производитель |
| `type` | string | нет | Тип: `NVMe`, `SSD`, `HDD` |
| `interface` | string | нет | Интерфейс: `NVMe`, `SATA`, `SAS` |
| `size_gb` | int | нет | Размер в ГБ |
| `temperature_c` | float | нет | Температура накопителя, °C (telemetry) |
| `power_on_hours` | int64 | нет | Время работы, часы |
| `power_cycles` | int64 | нет | Количество циклов питания |
| `unsafe_shutdowns` | int64 | нет | Нештатные выключения |
| `media_errors` | int64 | нет | Ошибки носителя / media errors |
| `error_log_entries` | int64 | нет | Количество записей в error log |
| `written_bytes` | int64 | нет | Всего записано байт |
| `read_bytes` | int64 | нет | Всего прочитано байт |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `available_spare_pct` | float | нет | Доступный spare, % |
| `reallocated_sectors` | int64 | нет | Переназначенные сектора |
| `current_pending_sectors` | int64 | нет | Сектора в ожидании ремапа |
| `offline_uncorrectable` | int64 | нет | Некорректируемые ошибки offline scan |
| `firmware` | string | нет | Версия прошивки |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| + общие поля статуса | | | см. раздел выше |
Диск без `serial_number` игнорируется. Изменение `firmware` создаёт событие `FIRMWARE_CHANGED`.
```json
"storage": [
{
"slot": "OB01",
"type": "NVMe",
"model": "INTEL SSDPF2KX076T1",
"size_gb": 7680,
"temperature_c": 38.5,
"power_on_hours": 12450,
"unsafe_shutdowns": 3,
"written_bytes": 9876543210,
"life_remaining_pct": 91.0,
"serial_number": "BTAX41900GF87P6DGN",
"manufacturer": "Intel",
"firmware": "9CV10510",
"interface": "NVMe",
"present": true,
"status": "OK"
}
]
```
---
### pcie_devices
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `vendor_id` | int | нет | PCI Vendor ID (decimal) |
| `device_id` | int | нет | PCI Device ID (decimal) |
| `numa_node` | int | нет | NUMA node / CPU affinity устройства |
| `temperature_c` | float | нет | Температура устройства, °C (telemetry) |
| `power_w` | float | нет | Текущее энергопотребление устройства, Вт (telemetry) |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `ecc_corrected_total` | int64 | нет | Всего корректируемых ECC-ошибок |
| `ecc_uncorrected_total` | int64 | нет | Всего некорректируемых ECC-ошибок |
| `hw_slowdown` | bool | нет | Устройство вошло в hardware slowdown / protective mode |
| `battery_charge_pct` | float | нет | Заряд батареи / supercap, % |
| `battery_health_pct` | float | нет | Состояние батареи / supercap, % |
| `battery_temperature_c` | float | нет | Температура батареи / supercap, °C |
| `battery_voltage_v` | float | нет | Напряжение батареи / supercap, В |
| `battery_replace_required` | bool | нет | Требуется замена батареи / supercap |
| `sfp_temperature_c` | float | нет | Температура SFP/optic, °C |
| `sfp_tx_power_dbm` | float | нет | TX optical power, dBm |
| `sfp_rx_power_dbm` | float | нет | RX optical power, dBm |
| `sfp_voltage_v` | float | нет | Напряжение SFP, В |
| `sfp_bias_ma` | float | нет | Bias current SFP, мА |
| `bdf` | string | нет | Deprecated alias для `slot`; при наличии ingest нормализует его в `slot` |
| `device_class` | string | нет | Класс устройства (см. список ниже) |
| `manufacturer` | string | нет | Производитель |
| `model` | string | нет | Модель |
| `serial_number` | string | нет | Серийный номер |
| `firmware` | string | нет | Версия прошивки |
| `link_width` | int | нет | Текущая ширина линка |
| `link_speed` | string | нет | Текущая скорость: `Gen3`, `Gen4`, `Gen5` |
| `max_link_width` | int | нет | Максимальная ширина линка |
| `max_link_speed` | string | нет | Максимальная скорость |
| `mac_addresses` | string[] | нет | MAC-адреса портов (для сетевых устройств) |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| + общие поля статуса | | | см. раздел выше |
`numa_node` передавайте для NIC / InfiniBand / RAID / GPU, когда источник знает CPU/NUMA affinity. Поле сохраняется в snapshot-атрибутах PCIe-компонента и дублируется в telemetry для topology use cases.
Поля `temperature_c` и `power_w` используйте для device-level telemetry GPU / accelerator / smart PCIe devices. Они не влияют на идентификацию компонента.
**Генерация serial_number при отсутствии или `"N/A"`:** `{board_serial}-PCIE-{slot}`, где `slot` для PCIe равен BDF.
`slot` — единственный канонический адрес компонента. Для PCIe в `slot` передавайте BDF. Поле `bdf` сохраняется только как переходный alias на входе и не должно использоваться как отдельная координата рядом со `slot`.
**Значения `device_class`:**
| Значение | Назначение |
|----------|------------|
| `MassStorageController` | RAID-контроллеры |
| `StorageController` | HBA, SAS-контроллеры |
| `NetworkController` | Сетевые адаптеры (InfiniBand, общий) |
| `EthernetController` | Ethernet NIC |
| `FibreChannelController` | Fibre Channel HBA |
| `VideoController` | GPU, видеокарты |
| `ProcessingAccelerator` | Вычислительные ускорители (AI/ML) |
| `DisplayController` | Контроллеры дисплея (BMC VGA) |
Список открытый: допускаются произвольные строки для нестандартных классов.
```json
"pcie_devices": [
{
"slot": "0000:3b:00.0",
"vendor_id": 5555,
"device_id": 4401,
"numa_node": 0,
"temperature_c": 48.5,
"power_w": 18.2,
"sfp_temperature_c": 36.2,
"sfp_tx_power_dbm": -1.8,
"sfp_rx_power_dbm": -2.1,
"device_class": "EthernetController",
"manufacturer": "Intel",
"model": "X710 10GbE",
"serial_number": "K65472-003",
"firmware": "9.20 0x8000d4ae",
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
"status": "OK"
}
]
```
---
### power_supplies
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| `serial_number` | string | нет | Серийный номер |
| `part_number` | string | нет | Партномер |
| `model` | string | нет | Модель |
| `vendor` | string | нет | Производитель |
| `wattage_w` | int | нет | Мощность в ваттах |
| `firmware` | string | нет | Версия прошивки |
| `input_type` | string | нет | Тип входа (например `ACWideRange`) |
| `input_voltage` | float | нет | Входное напряжение, В (telemetry) |
| `input_power_w` | float | нет | Входная мощность, Вт (telemetry) |
| `output_power_w` | float | нет | Выходная мощность, Вт (telemetry) |
| `temperature_c` | float | нет | Температура PSU, °C (telemetry) |
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| + общие поля статуса | | | см. раздел выше |
Поля telemetry (`input_voltage`, `input_power_w`, `output_power_w`, `temperature_c`, `life_remaining_pct`, `life_used_pct`) сохраняются в атрибутах компонента и не влияют на его идентификацию.
PSU без `serial_number` игнорируется.
```json
"power_supplies": [
{
"slot": "0",
"present": true,
"model": "GW-CRPS3000LW",
"vendor": "Great Wall",
"wattage_w": 3000,
"serial_number": "2P06C102610",
"firmware": "00.03.05",
"status": "OK",
"input_type": "ACWideRange",
"input_power_w": 137,
"output_power_w": 104,
"input_voltage": 215.25,
"temperature_c": 39.5,
"life_remaining_pct": 97.0
}
]
```
---
### sensors
Показания сенсоров сервера. Секция опциональная, не привязана к компонентам.
Данные хранятся как последнее известное значение (last-known-value) на уровне Asset.
```json
"sensors": {
"fans": [ ... ],
"power": [ ... ],
"temperatures": [ ... ],
"other": [ ... ]
}
```
---
### event_logs
Нормализованные операционные логи сервера из `host`, `bmc` или `redfish`.
Эти записи не попадают в history timeline и не создают history events. Они сохраняются в отдельной deduplicated log store и отображаются в отдельном UI-блоке asset logs / host logs.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `source` | string | **да** | Источник лога: `host`, `bmc`, `redfish` |
| `event_time` | string RFC3339 | нет | Время события из источника; если отсутствует, используется время ingest/collection |
| `severity` | string | нет | Уровень: `OK`, `Info`, `Warning`, `Critical`, `Unknown` |
| `message_id` | string | нет | Идентификатор/код события источника |
| `message` | string | **да** | Нормализованный текст события |
| `component_ref` | string | нет | Ссылка на компонент/устройство/слот, если извлекается |
| `fingerprint` | string | нет | Внешний готовый dedup-key; если не передан, система вычисляет свой |
| `is_active` | bool | нет | Признак, что событие всё ещё активно/не погашено, если источник умеет lifecycle |
| `raw_payload` | object | нет | Сырой vendor-specific payload для диагностики |
**Правила event_logs:**
- Логи дедуплицируются в рамках asset + source + fingerprint.
- Если `fingerprint` не передан, система строит его из нормализованных полей (`source`, `message_id`, `message`, `component_ref`, временная нормализация).
- Интегратор/сборщик логов не должен синтезировать содержимое событий: не придумывайте `message`, `message_id`, `component_ref`, serial/device identifiers или иные поля, если они отсутствуют в исходном логе или не были надёжно извлечены.
- Повторное получение того же события обновляет `last_seen_at`/счётчик повторов и не должно создавать новый timeline/history event.
- `event_logs` используются для отдельного UI-представления логов и не изменяют canonical state компонентов/asset по умолчанию.
```json
"event_logs": [
{
"source": "bmc",
"event_time": "2026-03-15T14:03:11Z",
"severity": "Warning",
"message_id": "0x000F",
"message": "Correctable ECC error threshold exceeded",
"component_ref": "CPU0_C0D0",
"raw_payload": {
"sensor": "DIMM_A1",
"sel_record_id": "0042"
}
},
{
"source": "redfish",
"event_time": "2026-03-15T14:03:20Z",
"severity": "Info",
"message_id": "OpenBMC.0.1.SystemReboot",
"message": "System reboot requested by administrator",
"component_ref": "Mainboard"
}
]
```
#### sensors.fans
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора в рамках секции |
| `location` | string | нет | Физическое расположение |
| `rpm` | int | нет | Обороты, RPM |
| `status` | string | нет | Статус: `OK`, `Warning`, `Critical`, `Unknown` |
#### sensors.power
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `voltage_v` | float | нет | Напряжение, В |
| `current_a` | float | нет | Ток, А |
| `power_w` | float | нет | Мощность, Вт |
| `status` | string | нет | Статус |
#### sensors.temperatures
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `celsius` | float | нет | Температура, °C |
| `threshold_warning_celsius` | float | нет | Порог Warning, °C |
| `threshold_critical_celsius` | float | нет | Порог Critical, °C |
| `status` | string | нет | Статус |
#### sensors.other
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `value` | float | нет | Значение |
| `unit` | string | нет | Единица измерения |
| `status` | string | нет | Статус |
**Правила sensors:**
- Идентификатор сенсора: пара `(sensor_type, name)`. Дубли в одном payload — берётся первое вхождение.
- Сенсоры без `name` игнорируются.
- При каждом импорте значения перезаписываются (upsert по ключу).
```json
"sensors": {
"fans": [
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" },
{ "name": "FAN_CPU0", "location": "CPU0", "rpm": 5600, "status": "OK" }
],
"power": [
{ "name": "12V Rail", "location": "Mainboard", "voltage_v": 12.06, "status": "OK" },
{ "name": "PSU0 Input", "location": "PSU0", "voltage_v": 215.25, "current_a": 0.64, "power_w": 137.0, "status": "OK" }
],
"temperatures": [
{ "name": "CPU0 Temp", "location": "CPU0", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" },
{ "name": "Inlet Temp", "location": "Front", "celsius": 22.0, "threshold_warning_celsius": 40.0, "threshold_critical_celsius": 50.0, "status": "OK" }
],
"other": [
{ "name": "System Humidity", "value": 38.5, "unit": "%", "status": "OK" }
]
}
```
---
## Обработка статусов компонентов
| Статус | Поведение |
|--------|-----------|
| `OK` | Нормальная обработка |
| `Warning` | Создаётся событие `COMPONENT_WARNING` |
| `Critical` | Создаётся событие `COMPONENT_FAILED` + запись в `failure_events` |
| `Unknown` | Компонент считается рабочим, создаётся событие `COMPONENT_UNKNOWN` |
| `Empty` | Компонент не создаётся/не обновляется |
---
## Обработка отсутствующих serial_number
Общее правило для всех секций: если источник не вернул серийный номер и сборщик не смог его надёжно извлечь, интегратор не должен подставлять вымышленные значения, хеши, локальные placeholder-идентификаторы или серийные номера "по догадке". Разрешены только явно оговорённые ниже server-side fallback-правила ingest.
| Тип | Поведение |
|-----|-----------|
| CPU | Генерируется: `{board_serial}-CPU-{socket}` |
| PCIe | Генерируется: `{board_serial}-PCIE-{slot}` (если serial = `"N/A"` или пустой; `slot` для PCIe = BDF) |
| Memory | Компонент игнорируется |
| Storage | Компонент игнорируется |
| PSU | Компонент игнорируется |
Если `serial_number` не уникален внутри одного payload для того же `model`:
- Первое вхождение сохраняет оригинальный серийный номер.
- Каждое следующее дублирующее получает placeholder: `NO_SN-XXXXXXXX`.
---
## Минимальный валидный пример
```json
{
"collected_at": "2026-02-10T15:30:00Z",
"target_host": "192.168.1.100",
"hardware": {
"board": {
"serial_number": "SRV-001"
}
}
}
```
---
## Полный пример с историей статусов
```json
{
"filename": "redfish://10.10.10.103",
"source_type": "api",
"protocol": "redfish",
"target_host": "10.10.10.103",
"collected_at": "2026-02-10T15:30:00Z",
"hardware": {
"board": {
"manufacturer": "Supermicro",
"product_name": "X12DPG-QT6",
"serial_number": "21D634101"
},
"firmware": [
{ "device_name": "BIOS", "version": "06.08.05" },
{ "device_name": "BMC", "version": "5.17.00" }
],
"cpus": [
{
"socket": 0,
"model": "INTEL(R) XEON(R) GOLD 6530",
"manufacturer": "Intel",
"cores": 32,
"threads": 64,
"status": "OK"
}
],
"storage": [
{
"slot": "OB01",
"type": "NVMe",
"model": "INTEL SSDPF2KX076T1",
"size_gb": 7680,
"serial_number": "BTAX41900GF87P6DGN",
"manufacturer": "Intel",
"firmware": "9CV10510",
"present": true,
"status": "OK",
"status_changed_at": "2026-02-10T15:22:00Z",
"status_history": [
{ "status": "Critical", "changed_at": "2026-02-10T15:10:00Z", "details": "I/O timeout on NVMe queue 3" },
{ "status": "OK", "changed_at": "2026-02-10T15:22:00Z", "details": "Recovered after controller reset" }
]
}
],
"pcie_devices": [
{
"slot": "0000:18:00.0",
"device_class": "EthernetController",
"manufacturer": "Intel",
"model": "X710 10GbE",
"serial_number": "K65472-003",
"mac_addresses": ["3c:fd:fe:aa:bb:cc", "3c:fd:fe:aa:bb:cd"],
"status": "OK"
}
],
"power_supplies": [
{
"slot": "0",
"present": true,
"model": "GW-CRPS3000LW",
"vendor": "Great Wall",
"wattage_w": 3000,
"serial_number": "2P06C102610",
"firmware": "00.03.05",
"status": "OK",
"input_power_w": 137,
"output_power_w": 104,
"input_voltage": 215.25
}
],
"sensors": {
"fans": [
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" }
],
"power": [
{ "name": "12V Rail", "voltage_v": 12.06, "status": "OK" }
],
"temperatures": [
{ "name": "CPU0 Temp", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" }
],
"other": [
{ "name": "System Humidity", "value": 38.5, "unit": "%" }
]
}
}
}
```

Some files were not shown because too many files have changed in this diff Show More