Compare commits

..

83 Commits

Author SHA1 Message Date
ff0acc3698 feat(webui): server-side SVG charts + reanimator-chart viewer
Metrics:
- Replace canvas JS charts with server-side SVG via go-analyze/charts
- Add ring buffers (120 samples) for CPU temp and power
- /api/metrics/chart/{name}.svg endpoint serves live SVG, polled every 2s

Dashboard:
- Replace custom renderViewerPage with viewer.RenderHTML() from reanimator/chart submodule
- Mount chart static assets at /chart/static/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:07:47 +03:00
d50760e7c6 fix(webui): remove emojis from nav, fix metrics chart sizing
- Remove all emojis from sidebar nav and logo (broken on server console fonts)
- Fix canvas chart: use parentElement.getBoundingClientRect() for width,
  set explicit H=120px — fixes empty charts when offsetWidth/Height is 0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:49:09 +03:00
ed4f8be019 fix(webui): services table — show state badge, full status on click
Replace raw systemctl output in table cell with:
- state badge (active/failed/inactive) — click to expand
- full systemctl status in collapsible pre block (max 200px scroll)
Fixes layout explosion from multi-line status text in table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:47:59 +03:00
883592d029 feat(desktop): switch to LightDM for X startup (matches Ubuntu LiveCD)
startx from user shell has /dev/fb0 permission issues and is fragile.
LightDM starts Xorg as root — standard LiveCD approach that works
on server hardware / IPMI KVM with nomodeset + fbdev/vesa.

- Add lightdm package, configure autologin as bee/openbox session
- Add /usr/share/xsessions/openbox.desktop
- Remove startx from .profile (LightDM manages X lifecycle)
- Remove Xwrapper.config needs_root_rights workaround (no longer needed)
- Enable lightdm.service in setup hook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:17:59 +03:00
a6dcaf1c7e fix(desktop): fix X permissions for server hardware (IPMI KVM)
- Add bee user to video,input groups (fixes /dev/fb0 permission denied)
- Add Xwrapper.config: needs_root_rights=yes (X gets hw access)
- Add xserver-xorg-video-vesa as fallback driver
- Remove dead bee-tui chmod from setup hook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:07:25 +03:00
88727fb590 fix(desktop): don't exec startx — fall back to shell on X failure
If X fails to start, the user gets a working shell prompt instead
of a dead session or autologin loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:48:26 +03:00
c9f5224c42 feat(console): add netconf command for quick network setup
Interactive script: lists interfaces, DHCP or static IP config.
Shown as hint in tty1 welcome message.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:07:14 +03:00
7cb5c02a9b fix(desktop): force fbdev Xorg driver for server framebuffer
Explicit xorg.conf.d config prevents Xorg from trying KMS/DRM
drivers that fail on server hardware with nomodeset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:05:42 +03:00
c1aa3cf491 fix(desktop): start X on vt1 from .profile for IPMI KVM compatibility
startx from autologin shell targets VT1 directly — KVM sees the
graphical UI without VT switching. Remove bee-desktop.service
(systemd-launched X defaults to VT7, invisible on KVM).
Add xserver-xorg-video-fbdev for server AST/VGA framebuffer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:03:59 +03:00
f7eb75c57c fix(iso): replace grub-pc/grub-efi-amd64 with -bin variants to fix package conflict
grub-pc and grub-efi-amd64 conflict with each other in Debian 12.
The -bin packages provide the same grub-install binaries without conflict.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:12:18 +03:00
004cc4910d feat(webui): replace TUI with full web UI + local openbox desktop
- Remove audit/internal/tui/ (~3000 LOC, bubbletea/lipgloss/reanimator deps)
- Add /api/* REST+SSE endpoints: audit, SAT (nvidia/memory/storage/cpu),
  services, network, export, tools, live metrics stream
- Add async job manager with SSE streaming for long-running operations
- Add platform.SampleLiveMetrics() for live fan/temp/power/GPU polling
- Add multi-page web UI (vanilla JS): Dashboard, Metrics charts, Tests,
  Burn-in, Network, Services, Export, Tools
- Add bee-desktop.service: openbox + Xorg + Chromium opening http://localhost/
- Add openbox/tint2/xorg/xinit/xterm/chromium to ISO package list
- Update .profile, bee.sh, and bible-local docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 19:21:14 +03:00
ed1cceed8c fix(boot): add nomodeset to fix black screen on server VGA/IPMI KVM (AST chip KMS) 2026-03-27 00:13:36 +03:00
9fe9f061f8 fix(nccl-tests): set LIBRARY_PATH so ld finds libnccl.so in nccl cache 2026-03-26 23:59:06 +03:00
837a1fb981 fix(nccl-tests): pin /usr/local/cuda→12.8 symlink, auto-detect gencode by nvcc version 2026-03-26 23:54:07 +03:00
1f43b4e050 fix(nccl-tests): pass NCCL_LIB from nccl cache to fix -lnccl link error 2026-03-26 23:52:25 +03:00
83bbc8a1bc fix(nccl-tests): upgrade to cuda-nvcc-12-8, add sm_100 (Blackwell B100/B200) 2026-03-26 23:51:26 +03:00
896bdb6ee8 fix(nccl-tests): use cuda-nvcc-12-6 to support Ampere/Volta (sm_70..sm_90) 2026-03-26 23:50:36 +03:00
5407c26e25 fix(nccl-tests): CUDA 13.0 supports only sm_90+ (Hopper/H100) 2026-03-26 23:49:45 +03:00
4fddaba9c5 fix(nccl-tests): limit CUDA gencode to sm_70+ (CUDA 13 dropped Pascal) 2026-03-26 23:48:40 +03:00
d2f384b6eb fix(nccl-tests): use plain make instead of non-existent all_reduce_perf target 2026-03-26 23:47:49 +03:00
25f0f30aaf fix(boot): fix black screen on monitor, stop log spam on console
- Add console=tty0 so VGA display gets kernel output (was serial-only)
- Change loglevel=7→3 (debug→errors only)
- Add quiet to suppress verbose kernel boot messages
- journald: ForwardToConsole=no so service logs don't flood tty1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:45:09 +03:00
a57b037a91 feat(installer): add 'Install to disk' in Tools submenu
Copies the live system to a local disk via unsquashfs — no debootstrap,
no network required. Supports UEFI (GPT+EFI) and BIOS (MBR) layouts.

ISO:
- Add squashfs-tools, parted, grub-pc, grub-efi-amd64 to package list
- New overlay script bee-install: partitions, formats, unsquashfs,
  writes fstab, runs grub-install+update-grub in chroot

Go TUI:
- Settings → Tools submenu (Install to disk, Check tools)
- Disk picker screen: lists non-USB, non-boot disks via lsblk
- Confirm screen warns about data loss
- Runs with live progress tail of /tmp/bee-install.log
- platform/install.go: ListInstallDisks, InstallToDisk, findLiveBootDevice

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:35:01 +03:00
5644231f9a feat(nccl): add nccl-tests all_reduce_perf for GPU bandwidth testing
- Dockerfile: install cuda-nvcc-13-0 from NVIDIA repo for compilation
- build-nccl-tests.sh: downloads libnccl-dev for nccl.h, builds all_reduce_perf
- build.sh: runs nccl-tests build, injects binary into /usr/local/bin/
- platform: RunNCCLTests() auto-detects GPU count, runs all_reduce_perf
- TUI: NCCL bandwidth test entry in Burn-in Tests screen [N] hotkey

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:22:19 +03:00
eea98e6d76 feat(dcgm): add NVIDIA DCGM diagnostics, fix KVM console
- Add 9002-nvidia-dcgm.hook.chroot: installs datacenter-gpu-manager
  from NVIDIA apt repo during live-build
- Enable nvidia-dcgm.service in chroot setup hook
- Replace bee-gpu-stress with dcgmi diag (levels 1-4) in NVIDIA SAT
- TUI: replace GPU checkbox + duration UI with DCGM level selection
- Remove console=tty2 from boot params: KVM/VGA now shows tty1
  where bee-tui runs, fixing unresponsive console

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:08:12 +03:00
967455194c feat(iso): make toram optional, add 'load to RAM' boot menu entry
Default boot no longer loads ISO to RAM (slow on BMC virtual media).
Separate menu entry added for toram in both GRUB and isolinux.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 21:45:04 +03:00
79dabf3efb fix(build): link bee-gpu-stress with -lm for sqrt()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:55:14 +03:00
1336f5b95c fix(cublas): copy include dirs containing files without .h extension
nv/target has no .h suffix; use -type f instead of -name '*.h' to
detect non-empty include directories.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:53:08 +03:00
31486a31c1 fix(cublas): add cuda-cccl package for nv/target header
cuda_fp16.h (included by cublas_api.h) requires <nv/target> from
the CUDA C++ Core Libraries (cuda-cccl-13-0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:49:46 +03:00
aa3fc332ba fix(cublas): check for .h in subdirs when copying non-standard include dirs
ls *.h missed headers in subdirectories like crt/host_defines.h;
use find -maxdepth 2 instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:47:39 +03:00
62c57b87f2 fix(cublas): allow version-free lookup for cuda-crt package
cuda-crt-13-0 may not share the same version string as cuda-cudart-13-0;
pass empty version to lookup_pkg to match the first available version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:46:45 +03:00
f600261546 fix(cublas): add cuda-crt package for crt/host_defines.h
cublasLt.h -> cublas_api.h -> driver_types.h -> crt/host_defines.h
which lives in the cuda-crt-13-0 package, not cudart-dev.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:42:40 +03:00
d7ca04bdfb fix(cublas): search all include/ dirs in deb for CUDA headers
NVIDIA CUDA .deb packages install headers under
/usr/local/cuda-X.Y/targets/x86_64-linux/include/ not /usr/include/,
causing copy_headers() to silently skip them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:35:21 +03:00
5433652c70 fix(cublas): prevent double-print in lookup_pkg awk END block
awk exit in the blank-line block jumps to END, which printed the
result again causing repo_sha to contain the hash twice with a newline,
breaking the sha256 string comparison.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:29:10 +03:00
b25f014dbd fix(cublas): strip CR from Packages.gz fields to fix sha256 comparison
Debian Packages.gz uses CRLF line endings; \r in the captured SHA256
field caused string comparison to fail even when hashes were identical.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:24:58 +03:00
d69a46f211 fix(cublas): redirect diagnostic echo to stderr in download_verified_pkg
Echo messages captured in stdout polluted the return value of
download_verified_pkg(), causing extract_deb() to receive a
multi-line string instead of a file path and silently exit via set -e.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:22:39 +03:00
Mikhail Chusavitin
fc5c2019aa iso: improve burn-in, export, and live boot 2026-03-26 18:56:19 +03:00
Mikhail Chusavitin
67a215c66f fix(iso): route kernel logs to tty2, keep tty1 clean for TUI
console=tty0 sent kernel messages to the active VT (tty1), overwriting
the TUI. Changed to console=tty2 so kernel logs land on a dedicated
console. tty1 is now clean; operator can press Alt+F2 to inspect kernel
messages and Alt+F3 for an extra shell.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 17:40:44 +03:00
Mikhail Chusavitin
8b4bfdf5ad feat(tui): live GPU chart during stress test, full VRAM allocation
- GPU Platform Stress Test now shows a live in-TUI chart instead of nvtop.
  nvidia-smi is polled every second; up to 60 data points per GPU kept.
  All three metrics (Usage %, Temp °C, Power W) drawn on a single plot,
  each normalised to its own range and rendered in a different colour.
- Memory allocation changed from MemoryMB/16 to MemoryMB-512 (full VRAM
  minus 512 MB driver overhead) so bee-gpu-stress actually stresses memory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 17:37:20 +03:00
Mikhail Chusavitin
0a52a4f3ba fix(iso): restore loglevel=7 on VGA console for crash visibility
loglevel=3 was hiding all kernel messages on tty0/ttyS0 except errors.
Machine crashes (panics, driver oops, module failures) were silent on VGA.

Restored loglevel=7 so kernel messages up to debug are printed to both
tty0 (VGA) and ttyS0 (SOL). Journald MaxLevelConsole reduced to info
(was debug) to reduce noise on SOL while keeping it useful.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 11:19:07 +03:00
Mikhail Chusavitin
b132f7973a fix(iso): derive ISO filename from iso/v* tags, not audit/v*
Previously the ISO file was named after git describe --match 'audit/v*',
so a new iso/ tag produced names like v1.0.9-1-gXXXXXXX instead of v1.0.17.
Now build.sh has resolve_iso_version() that looks at iso/v* tags separately.
The bee binary inside the ISO still uses AUDIT_VERSION_EFFECTIVE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 11:05:51 +03:00
Mikhail Chusavitin
bd94b6c792 fix(iso): add libnvidia-ptxjitcompiler + ldconfig for PTX JIT and NCCL
- build-nvidia-module.sh: copy libnvidia-ptxjitcompiler.so.* alongside
  libcuda/libnvidia-ml — required by cuModuleLoadDataEx for PTX JIT.
  Without it: CUDA_ERROR_JIT_COMPILER_NOT_FOUND at runtime.
  Cache check updated to force rebuild when ptxjitcompiler is missing.
- bee-nvidia-load: run ldconfig after module load so that NVIDIA/NCCL
  libs injected into /usr/lib/ are visible to dlopen() callers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:37:27 +03:00
Mikhail Chusavitin
06017eddfd feat(tui): remove nvtop auto-launch from NVIDIA SAT
nvtop is no longer shown during NVIDIA SAT runs.
[o] Open nvtop shortcut also removed from the running screen.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:29:05 +03:00
Mikhail Chusavitin
0ac7b6a963 fix(iso): restore console=tty0 — VGA screen was black without it
Commit d36e844 dropped console=tty0 and added dual-serial + debug logging.
Without console=tty0 the kernel never initialises the VGA console,
leaving the physical screen permanently blank.

- Restore console=tty0 (VGA) as primary, keep console=ttyS0 for SOL
- Drop console=ttyS1 (redundant second serial port)
- Replace loglevel=7 + journald debug flood with loglevel=3 (errors only)
  so kernel messages don't overwrite the TUI on the local screen
- Remove systemd.log_target/forward_to_console debug params

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:23:53 +03:00
Mikhail Chusavitin
3d2ae4cdcb fix(iso): use Ubuntu jammy codename for AMD ROCm repo — Debian not supported
AMD does not publish Debian Bookworm packages at all (only focal/jammy/noble).
Switch ROCM_UBUNTU_DIST to "jammy"; jammy packages install cleanly on
Debian 12 due to compatible glibc. Also expand candidate list to include
point-releases (6.3.4, 6.3.3, …) so we pick the latest actually-published one.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:08:58 +03:00
Mikhail Chusavitin
4669f14f4f feat(tui): GPU Platform Stress Test — live nvtop chart during test
Apply the same pattern as NVIDIA SAT: launch nvtop via tea.ExecProcess
so it occupies the full terminal as a live GPU chart (temp, power, fan,
utilisation lines) while the stress test runs in the background.

- Add screenGPUStressRunning screen + dedicated running/render handlers
- startGPUStressTest: tea.Batch(stress goroutine, tea.ExecProcess(nvtop))
- [o] reopen nvtop at any time; [a] abort (cancels context)
- Graceful degradation: test still runs if nvtop is not on PATH
- gpuStressDoneMsg routes result to screenOutput on completion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:01:31 +03:00
Mikhail Chusavitin
540a9e39b8 refactor(audit): rename Fan Stress Test → GPU Platform Stress Test
Update all user-facing strings in TUI and ActionResult title.
Internal identifiers (types, functions, file name) unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 09:56:25 +03:00
Mikhail Chusavitin
58510207fa fix(iso): fall back through ROCm 6.4→6.3→6.2 if repo Release file missing
ROCm 6.4 does not yet publish a Release file for Debian Bookworm, causing
the live-build chroot hook to fail with "does not have a Release file".

Try each version in ROCM_CANDIDATES order; skip to the next if apt-get update
fails (repo unavailable). Exit gracefully if none are available.
Also rename inner 'candidate' variable to 'smi_path' to avoid collision.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 09:52:17 +03:00
Mikhail Chusavitin
4cd7c9ab4e feat(audit): fan-stress SAT for MSI case-04 fan lag & thermal throttle detection
Two-phase GPU thermal cycling test with per-second telemetry:
- Phases: baseline → load1 → pause (no cooldown) → load2 → cooldown
- Monitors: fan RPM (ipmitool sdr), CPU/server temps (ipmitool/sensors),
  system power (ipmitool dcmi), GPU temp/power/usage/clock/throttle (nvidia-smi)
- Detects throttling via clocks_throttle_reasons.active bitmask
- Measures fan response lag from load start (validates case-04 ~2s lag)
- Exports metrics.csv (wide format, one row/sec) and fan-sensors.csv (long format)
- TUI: adds [F] Fan Stress Test to Health Check screen with Quick/Standard/Express modes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 09:51:03 +03:00
Mikhail Chusavitin
cfe255f6e4 Release audit/v1.0.5 2026-03-26 09:41:19 +03:00
Mikhail Chusavitin
8b9d3447d7 Overlay SAT results into audit JSON 2026-03-25 20:11:03 +03:00
Mikhail Chusavitin
614b7cad61 Improve PCIe inventory and hardware identity collection 2026-03-25 20:00:38 +03:00
Mikhail Chusavitin
9a1df9b1ba Tighten support bundles and fix AMD runtime checks 2026-03-25 19:35:25 +03:00
Mikhail Chusavitin
30cf014d58 Rename NVIDIA bootloader modes 2026-03-25 19:12:26 +03:00
Mikhail Chusavitin
27d478aed6 Add bootloader choice for safe vs full NVIDIA boot 2026-03-25 19:11:15 +03:00
Mikhail Chusavitin
d36e8442a9 Stabilize live ISO consoles and NVIDIA boot path 2026-03-25 19:05:18 +03:00
Mikhail Chusavitin
b345b0d14d Derive ISO version from git tags 2026-03-25 18:40:48 +03:00
Mikhail Chusavitin
0a1ac2ab9f Bootstrap ROCm hook prerequisites in ISO build 2026-03-25 18:38:19 +03:00
Mikhail Chusavitin
1e62f828c6 Embed MOTD banner into TUI 2026-03-25 18:11:17 +03:00
Mikhail Chusavitin
f8c997d272 Add missing SAT progress TUI helpers 2026-03-25 18:03:45 +03:00
Mikhail Chusavitin
0c16616cc9 1. Verbose live progress during SAT tests (CPU, Memory, Storage, AMD GPU)
- New tui/sat_progress.go: polls {DefaultSATBaseDir}/{prefix}-*/verbose.log every 300ms and parses completed/in-progress steps
  - Busy screen now shows each step as PASS  lscpu (234ms) / FAIL  stress-ng (60.0s) / ...   sensors-after instead of just "Working..."

  2. Test results shown on screen (instead of just "Archive written to /path")
  - RunCPUAcceptancePackResult, RunMemoryAcceptancePackResult, RunStorageAcceptancePackResult, RunAMDAcceptancePackResult now read summary.txt from the run directory and return a formatted per-step result:
  Run: 2025-03-25T10:00:00Z

  PASS  lscpu
  PASS  sensors-before
  FAIL  stress-ng
  PASS  sensors-after

  Overall: FAILED  (ok=3  failed=1)

  3. AMD GPU SAT with auto-detection
  - platform.System.DetectGPUVendor(): checks /dev/nvidia0 → "nvidia", /dev/kfd → "amd"
  - platform.System.RunAMDAcceptancePack(): runs rocm-smi, rocm-smi --showallinfo, dmidecode
  - GPU SAT (G key / GPU row enter) automatically routes to AMD or NVIDIA based on detected vendor
  - "Run All" also auto-detects vendor

  4. Panel detail view
  - GPU detail now shows the most recent (NVIDIA or AMD) SAT result, whichever is newer
  - All SAT detail views use the same human-readable formatSATDetail format
2026-03-25 17:54:27 +03:00
Mikhail Chusavitin
adcc147b32 feat(iso): add AMD Instinct MI250X/MI250 driver support
- firmware-amd-graphics: Aldebaran firmware blobs (fixes amdgpu IB ring
  test errors on MI250/MI250X at boot)
- 9001-amd-rocm.hook.chroot: adds AMD ROCm 6.4 apt repo and installs
  rocm-smi-lib for GPU monitoring (analogous to nvidia-smi)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 15:42:10 +03:00
Mikhail Chusavitin
94e233651e fix(sat): fix nvme device-self-test command flags
--start is not a valid nvme-cli flag; correct syntax is -s 1 (short test).
Add --wait so the command blocks until the test completes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 15:24:52 +03:00
Mikhail Chusavitin
03c36f6cb2 fix(iso): add stress-ng to package list for CPU SAT
stress-ng was missing from the LiveCD — CPU acceptance test exited
immediately with rc=1 because the binary was not found.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:50:30 +03:00
Mikhail Chusavitin
a221814797 fix(tui): fix GPU panel row showing AMD chipset devices, clear screen before TUI
isGPUDevice matched all AMD vendor PCIe devices (SATA, crypto coprocessors,
PCIe dummies) because of a broad strings.Contains(vendor,"amd") check.
Remove it — AMD Instinct/Radeon GPUs are caught by ProcessingAccelerator /
DisplayController class. Also exclude ASPEED (BMC VGA adapter).

Add clear before bee-tui to avoid dirty terminal output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:49:09 +03:00
Mikhail Chusavitin
b6619d5ccc fix(iso): skip NVIDIA module load when no NVIDIA GPU present
Check PCI vendor 10de before attempting insmod — avoids spurious
nvidia_uvm symbol errors on systems without NVIDIA hardware (e.g. AMD MI350).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:38:31 +03:00
Mikhail Chusavitin
450193b063 feat(iso): remove splash.png, show EASY-BEE ASCII art in GRUB text mode
The graphical splash had "BEE / HARDWARE AUDIT" baked into the PNG,
overriding the echo ASCII art. Replace with a plain black background
so the EASY-BEE block-char banner from grub.cfg echo commands is visible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:32:23 +03:00
Mikhail Chusavitin
ee8931f171 fix(iso): pin ISO kernel to same ABI as compiled NVIDIA modules
Export detected DEBIAN_KERNEL_ABI as BEE_KERNEL_ABI from build.sh so
auto/config can pin linux-packages to the exact versioned package
(e.g. linux-image-6.1.0-31 + flavour amd64 = linux-image-6.1.0-31-amd64).
This prevents nvidia.ko vermagic mismatch if the linux-image-amd64
meta-package is updated between build start and lb build chroot step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 12:26:59 +03:00
Mikhail Chusavitin
b771d95894 fix(iso): fix linux-packages to "linux-image" so lb appends flavour correctly
live-build constructs the kernel package as <linux-packages>-<linux-flavours>,
so "linux-image-amd64" + "amd64" = "linux-image-amd64-amd64" (not found).
The correct value is "linux-image" + "amd64" = "linux-image-amd64".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:45:41 +03:00
Mikhail Chusavitin
8e60e474dc feat(iso): rebrand to EASY-BEE with ASCII art banner
Replace "Bee Hardware Audit" branding with EASY-BEE across bootloader
and LiveCD: grub.cfg menu entries, echo ASCII art before menu,
motd banner, iso-volume and iso-application metadata.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:45:12 +03:00
Mikhail Chusavitin
2f4ec2acda fix(iso): auto-detect and install kernel headers at build time
- Dockerfile: linux-headers-amd64 meta-package instead of pinned ABI;
  remove DEBIAN_KERNEL_ABI build-arg (no longer needed at image build time)
- build-in-container.sh: drop --build-arg DEBIAN_KERNEL_ABI
- build.sh: apt-get update + detect ABI from apt-cache at build time;
  auto-install linux-headers-<ABI> if kernel changed since image build

Image rebuild is now needed only when changing Go version or lb tools,
not on every Debian kernel point release.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:25:29 +03:00
Mikhail Chusavitin
7ed5cb0306 fix(iso): auto-detect kernel ABI at build time instead of pinning
DEBIAN_KERNEL_ABI=auto in VERSIONS — build.sh queries
apt-cache depends linux-image-amd64 to find the current ABI.
lb config now uses linux-image-amd64 meta-package.

This prevents build failures when Debian drops old kernel packages
from the repo (happens with every point release).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:17:29 +03:00
Mikhail Chusavitin
6df7ac68f5 fix(iso): bump kernel ABI to 6.1.0-44 (6.1.164-1 in bookworm)
6.1.0-43 is no longer available in Debian repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:16:09 +03:00
Mikhail Chusavitin
0ce23aea4f feat(iso): add exfatprogs and ntfs-3g for USB export support
exFAT is the default filesystem on USB drives >32GB sold today.
Without exfatprogs, mount fails silently and export to such drives is broken.
ntfs-3g covers Windows-formatted drives.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:12:51 +03:00
Mikhail Chusavitin
36dff6e584 feat: CPU SAT via stress-ng + BMC version via ipmitool
BMC:
- collector/board.go: collectBMCFirmware() via ipmitool mc info, graceful skip if /dev/ipmi0 absent
- collector/collector.go: append BMC firmware record to snap.Firmware
- app/panel.go: show BMC version in TUI right-panel header alongside BIOS

CPU SAT:
- platform/sat.go: RunCPUAcceptancePack(baseDir, durationSec) — lscpu + sensors before/after + stress-ng
- app/app.go: RunCPUAcceptancePack + RunCPUAcceptancePackResult methods, satRunner interface updated
- app/panel.go: CPU row now reads real PASS/FAIL from cpu-*/summary.txt via satStatuses(); cpuDetailResult shows last SAT summary + audit data
- tui/types.go: actionRunCPUSAT, confirmBody for CPU test with mode label
- tui/screen_health_check.go: hcCPUDurations [60,300,900]s; hcRunSingle(CPU)→confirm screen; executeRunAll uses RunCPUAcceptancePackResult
- tui/forms.go: actionRunCPUSAT → RunCPUAcceptancePackResult with mode duration
- cmd/bee/main.go: bee sat cpu [--duration N] subcommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:06:12 +03:00
Mikhail Chusavitin
1c80906c1f feat(tui): rebuild TUI around hardware diagnostics (Health Check + two-column layout)
- Replace 12-item flat menu with 4-item main menu: Health Check, Export support bundle, Settings, Exit
- Add Health Check screen (Lenovo-style): per-component checkboxes (GPU/MEM/DISK/CPU), Quick/Standard/Express modes, Run All, letter hotkeys G/M/S/C/R/A/1/2/3
- Add two-column main screen: left = menu, right = hardware panel with colored PASS/FAIL/CANCEL/N/A status per component; Tab/→ switches focus, Enter opens component detail
- Add app.LoadHardwarePanel() + ComponentDetailResult() reading audit JSON and SAT summary.txt files
- Move Network/Services/audit actions into Settings submenu
- Export: support bundle only (remove separate audit JSON export)
- Delete screen_acceptance.go; add screen_health_check.go, screen_settings.go, app/panel.go
- Add BMC + CPU stress-ng tests to backlog
- Update bible submodule
- Rewrite tui_test.go for new screen/action structure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 10:59:21 +03:00
Mikhail Chusavitin
2abe2ce3aa fix(iso): fix NCCL version to 2.28.9+cuda13.0, add sha256 verification
NVIDIA's CUDA repo for Debian 12 only has NCCL packages for cuda13.x,
not cuda12.x. Update to the latest available: 2.28.9-1+cuda13.0.
Also pass sha256 from VERSIONS into build-nccl.sh for integrity check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 12:04:03 +03:00
Mikhail Chusavitin
8233c9ee85 feat(iso): add NCCL 2.26.2 to LiveCD
Download libnccl2 .deb from NVIDIA's CUDA apt repo (Debian 12) during ISO
build, extract libnccl.so.* into the overlay at /usr/lib/ alongside
libnvidia-ml and libcuda. Version pinned in VERSIONS, reflected in
/etc/bee-release.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 09:51:28 +03:00
Mikhail Chusavitin
13189e2683 fix(iso): pet hardware watchdog via systemd RuntimeWatchdogSec=30s
Without a keepalive the kernel watchdog timer expires and reboots
the host mid-audit. Configuring RuntimeWatchdogSec lets systemd PID 1
reset /dev/watchdog every 30 s — well within the typical 60 s timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 23:56:42 +03:00
Mikhail Chusavitin
76a17937f3 feat(tui): NVIDIA SAT with nvtop, GPU selection, metrics and chart — v1.0.0
- TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes
- nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort
- GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock)
- Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt
- Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours
- AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list
- runtime-flows.md updated with full NVIDIA SAT TUI flow documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 15:18:57 +03:00
Mikhail Chusavitin
b965184e71 feat: wrap chart viewer in web shell 2026-03-16 18:26:05 +03:00
Mikhail Chusavitin
b25a2f6d30 feat: add support bundle and raw audit export 2026-03-16 18:20:26 +03:00
Mikhail Chusavitin
d18cde19c1 Drop legacy non-container builders 2026-03-16 00:23:55 +03:00
Mikhail Chusavitin
78c6dfc0ef Sync hardware ingest contract v2.7 2026-03-15 23:03:38 +03:00
111 changed files with 9903 additions and 2215 deletions

13
PLAN.md
View File

@@ -272,13 +272,10 @@ ISO image bootable via BMC virtual media or USB. Runs boot services automaticall
### 2.1 — Builder environment
`iso/builder/setup-builder.sh` prepares a Debian 12 host/VM with:
- `live-build`, `debootstrap`, bootloader tooling, kernel headers
- Go toolchain
- everything needed to compile the `bee` binary and NVIDIA modules
`iso/builder/build-in-container.sh` offers the same builder stack in a Debian 12 container image.
The container run is privileged because `live-build` needs mount/chroot/loop capabilities.
`iso/builder/build-in-container.sh` is the only supported builder entrypoint.
It builds a Debian 12 builder image with `live-build`, toolchains, and pinned kernel headers,
then runs the ISO assembly in a privileged container because `live-build` needs
mount/chroot/loop capabilities.
`iso/builder/build.sh` orchestrates the full ISO build:
1. compile the Go `bee` binary
@@ -392,7 +389,7 @@ No "works on my Mac" drift.
--- BUILDER + BEE ISO (unblock real-hardware testing) ---
2.1 builder setup → Debian host/VM or privileged container with build deps
2.1 builder setup → privileged container with build deps
2.2 debug ISO profile → minimal Debian ISO: `bee` binary + OpenSSH + all packages
2.3 boot on real server → SSH in, verify packages present, run audit manually

View File

@@ -11,7 +11,6 @@ import (
"bee/audit/internal/app"
"bee/audit/internal/platform"
"bee/audit/internal/runtimeenv"
"bee/audit/internal/tui"
"bee/audit/internal/webui"
)
@@ -40,10 +39,12 @@ func run(args []string, stdout, stderr io.Writer) int {
return 0
case "audit":
return runAudit(args[1:], stdout, stderr)
case "tui":
return runTUI(args[1:], stdout, stderr)
case "export":
return runExport(args[1:], stdout, stderr)
case "preflight":
return runPreflight(args[1:], stdout, stderr)
case "support-bundle":
return runSupportBundle(args[1:], stdout, stderr)
case "web":
return runWeb(args[1:], stdout, stderr)
case "sat":
@@ -61,10 +62,11 @@ func run(args []string, stdout, stderr io.Writer) int {
func printRootUsage(w io.Writer) {
fmt.Fprintln(w, `bee commands:
bee audit --runtime auto|local|livecd --output stdout|file:<path>
bee tui --runtime auto|local|livecd
bee preflight --output stdout|file:<path>
bee export --target <device>
bee web --listen :80 --audit-path /var/log/bee-audit.json
bee sat nvidia|memory|storage
bee support-bundle --output stdout|file:<path>
bee web --listen :80 --audit-path `+app.DefaultAuditJSONPath+`
bee sat nvidia|memory|storage|cpu [--duration <seconds>]
bee version
bee help [command]`)
}
@@ -73,10 +75,12 @@ func runHelp(args []string, stdout, stderr io.Writer) int {
switch args[0] {
case "audit":
return runAudit([]string{"--help"}, stdout, stdout)
case "tui":
return runTUI([]string{"--help"}, stdout, stdout)
case "export":
return runExport([]string{"--help"}, stdout, stdout)
case "preflight":
return runPreflight([]string{"--help"}, stdout, stdout)
case "support-bundle":
return runSupportBundle([]string{"--help"}, stdout, stdout)
case "web":
return runWeb([]string{"--help"}, stdout, stdout)
case "sat":
@@ -135,42 +139,6 @@ func runAudit(args []string, stdout, stderr io.Writer) int {
return 0
}
func runTUI(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("tui", flag.ContinueOnError)
fs.SetOutput(stderr)
runtimeFlag := fs.String("runtime", "auto", "runtime environment: auto, local, livecd")
fs.Usage = func() {
fmt.Fprintln(stderr, "usage: bee tui [--runtime auto|local|livecd]")
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
runtimeInfo, err := runtimeenv.Detect(*runtimeFlag)
if err != nil {
slog.Error("resolve runtime", "err", err)
return 1
}
slog.SetDefault(slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{
Level: slog.LevelInfo,
})))
application := app.New(platform.New())
if err := tui.Run(application, runtimeInfo.Mode); err != nil {
slog.Error("run tui", "err", err)
return 1
}
return 0
}
func runExport(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("export", flag.ContinueOnError)
@@ -219,14 +187,96 @@ func runExport(args []string, stdout, stderr io.Writer) int {
return 1
}
func runPreflight(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("preflight", flag.ContinueOnError)
fs.SetOutput(stderr)
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
fs.Usage = func() {
fmt.Fprintf(stderr, "usage: bee preflight [--output stdout|file:%s]\n", app.DefaultRuntimeJSONPath)
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
application := app.New(platform.New())
path, err := application.RunRuntimePreflight(*output)
if err != nil {
slog.Error("run preflight", "err", err)
return 1
}
if path != "stdout" {
slog.Info("runtime health written", "path", path)
}
return 0
}
func runSupportBundle(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("support-bundle", flag.ContinueOnError)
fs.SetOutput(stderr)
output := fs.String("output", "stdout", "output destination: stdout or file:<path>")
fs.Usage = func() {
fmt.Fprintln(stderr, "usage: bee support-bundle [--output stdout|file:<path>]")
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
path, err := app.BuildSupportBundle(app.DefaultExportDir)
if err != nil {
slog.Error("build support bundle", "err", err)
return 1
}
defer os.Remove(path)
raw, err := os.ReadFile(path)
if err != nil {
slog.Error("read support bundle", "err", err)
return 1
}
switch {
case *output == "stdout":
if _, err := stdout.Write(raw); err != nil {
slog.Error("write support bundle stdout", "err", err)
return 1
}
case strings.HasPrefix(*output, "file:"):
dst := strings.TrimPrefix(*output, "file:")
if err := os.WriteFile(dst, raw, 0644); err != nil {
slog.Error("write support bundle", "err", err)
return 1
}
slog.Info("support bundle written", "path", dst)
default:
fmt.Fprintln(stderr, "bee support-bundle: unknown output destination")
fs.Usage()
return 2
}
return 0
}
func runWeb(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("web", flag.ContinueOnError)
fs.SetOutput(stderr)
listenAddr := fs.String("listen", ":8080", "listen address, e.g. :80")
auditPath := fs.String("audit-path", app.DefaultAuditJSONPath, "path to the latest audit JSON snapshot")
exportDir := fs.String("export-dir", app.DefaultExportDir, "directory with logs, SAT results, and support bundles")
title := fs.String("title", "Bee Hardware Audit", "page title")
fs.Usage = func() {
fmt.Fprintln(stderr, "usage: bee web [--listen :80] [--audit-path /var/log/bee-audit.json] [--title \"Bee Hardware Audit\"]")
fmt.Fprintf(stderr, "usage: bee web [--listen :80] [--audit-path %s] [--export-dir %s] [--title \"Bee Hardware Audit\"]\n", app.DefaultAuditJSONPath, app.DefaultExportDir)
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
@@ -241,9 +291,18 @@ func runWeb(args []string, stdout, stderr io.Writer) int {
}
slog.Info("starting bee web", "listen", *listenAddr, "audit_path", *auditPath)
runtimeInfo, err := runtimeenv.Detect("auto")
if err != nil {
slog.Warn("resolve runtime for web", "err", err)
}
if err := webui.ListenAndServe(*listenAddr, webui.HandlerOptions{
Title: *title,
AuditPath: *auditPath,
Title: *title,
AuditPath: *auditPath,
ExportDir: *exportDir,
App: app.New(platform.New()),
RuntimeMode: runtimeInfo.Mode,
}); err != nil {
slog.Error("run web", "err", err)
return 1
@@ -253,43 +312,58 @@ func runWeb(args []string, stdout, stderr io.Writer) int {
func runSAT(args []string, stdout, stderr io.Writer) int {
if len(args) == 0 {
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage")
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
return 2
}
if args[0] == "help" || args[0] == "--help" || args[0] == "-h" {
fmt.Fprintln(stdout, "usage: bee sat nvidia|memory|storage")
fmt.Fprintln(stdout, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
return 0
}
if args[0] != "nvidia" && args[0] != "memory" && args[0] != "storage" {
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", args[0])
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage")
fs := flag.NewFlagSet("sat", flag.ContinueOnError)
fs.SetOutput(stderr)
duration := fs.Int("duration", 0, "stress-ng duration in seconds (cpu only; default: 60)")
if err := fs.Parse(args[1:]); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if len(args) > 1 {
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage")
if fs.NArg() != 0 {
fmt.Fprintf(stderr, "bee sat: unexpected arguments\n")
return 2
}
target := args[0]
if target != "nvidia" && target != "memory" && target != "storage" && target != "cpu" {
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", target)
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
return 2
}
application := app.New(platform.New())
var (
archive string
err error
label string
)
switch args[0] {
switch target {
case "nvidia":
label = "nvidia"
archive, err = application.RunNvidiaAcceptancePack("")
case "memory":
label = "memory"
archive, err = application.RunMemoryAcceptancePack("")
case "storage":
label = "storage"
archive, err = application.RunStorageAcceptancePack("")
case "cpu":
dur := *duration
if dur <= 0 {
dur = 60
}
archive, err = application.RunCPUAcceptancePack("", dur)
}
if err != nil {
slog.Error("run sat", "target", label, "err", err)
slog.Error("run sat", "target", target, "err", err)
return 1
}
slog.Info("sat archive written", "target", label, "path", archive)
slog.Info("sat archive written", "target", target, "path", archive)
return 0
}

View File

@@ -91,6 +91,32 @@ func TestRunSATUsage(t *testing.T) {
}
}
func TestRunPreflightRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"preflight", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee preflight") {
t.Fatalf("stderr missing preflight usage:\n%s", stderr.String())
}
}
func TestRunSupportBundleRejectsExtraArgs(t *testing.T) {
t.Parallel()
var stdout, stderr bytes.Buffer
rc := run([]string{"support-bundle", "extra"}, &stdout, &stderr)
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee support-bundle") {
t.Fatalf("stderr missing support-bundle usage:\n%s", stderr.String())
}
}
func TestRunHelpForSubcommand(t *testing.T) {
t.Parallel()
@@ -138,7 +164,7 @@ func TestRunSATHelp(t *testing.T) {
if rc != 0 {
t.Fatalf("rc=%d want 0", rc)
}
if !strings.Contains(stdout.String(), "usage: bee sat nvidia|memory|storage") {
if !strings.Contains(stdout.String(), "usage: bee sat nvidia|memory|storage|cpu") {
t.Fatalf("stdout missing sat help:\n%s", stdout.String())
}
}
@@ -151,8 +177,8 @@ func TestRunSATRejectsExtraArgs(t *testing.T) {
if rc != 2 {
t.Fatalf("rc=%d want 2", rc)
}
if !strings.Contains(stderr.String(), "usage: bee sat nvidia|memory|storage") {
t.Fatalf("stderr missing sat usage:\n%s", stderr.String())
if !strings.Contains(stderr.String(), "bee sat: unexpected arguments") {
t.Fatalf("stderr missing sat error:\n%s", stderr.String())
}
}

View File

@@ -4,24 +4,14 @@ go 1.24.0
replace reanimator/chart => ../internal/chart
require github.com/charmbracelet/bubbletea v1.3.4
require reanimator/chart v0.0.0
require (
github.com/go-analyze/charts v0.5.26
reanimator/chart v0.0.0-00010101000000-000000000000
)
require (
github.com/aymanbagabas/go-osc52/v2 v2.0.1 // indirect
github.com/charmbracelet/lipgloss v1.0.0 // indirect
github.com/charmbracelet/x/ansi v0.8.0 // indirect
github.com/charmbracelet/x/term v0.2.1 // indirect
github.com/erikgeiser/coninput v0.0.0-20211004153227-1c3628e74d0f // indirect
github.com/lucasb-eyer/go-colorful v1.2.0 // indirect
github.com/mattn/go-isatty v0.0.20 // indirect
github.com/mattn/go-localereader v0.0.1 // indirect
github.com/mattn/go-runewidth v0.0.16 // indirect
github.com/muesli/ansi v0.0.0-20230316100256-276c6243b2f6 // indirect
github.com/muesli/cancelreader v0.2.2 // indirect
github.com/muesli/termenv v0.15.2 // indirect
github.com/rivo/uniseg v0.4.7 // indirect
golang.org/x/sync v0.11.0 // indirect
golang.org/x/sys v0.30.0 // indirect
golang.org/x/text v0.3.8 // indirect
github.com/dustin/go-humanize v1.0.1 // indirect
github.com/go-analyze/bulk v0.1.3 // indirect
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 // indirect
golang.org/x/image v0.24.0 // indirect
)

View File

@@ -1,37 +1,18 @@
github.com/aymanbagabas/go-osc52/v2 v2.0.1 h1:HwpRHbFMcZLEVr42D4p7XBqjyuxQH5SMiErDT4WkJ2k=
github.com/aymanbagabas/go-osc52/v2 v2.0.1/go.mod h1:uYgXzlJ7ZpABp8OJ+exZzJJhRNQ2ASbcXHWsFqH8hp8=
github.com/charmbracelet/bubbletea v1.3.4 h1:kCg7B+jSCFPLYRA52SDZjr51kG/fMUEoPoZrkaDHyoI=
github.com/charmbracelet/bubbletea v1.3.4/go.mod h1:dtcUCyCGEX3g9tosuYiut3MXgY/Jsv9nKVdibKKRRXo=
github.com/charmbracelet/lipgloss v1.0.0 h1:O7VkGDvqEdGi93X+DeqsQ7PKHDgtQfF8j8/O2qFMQNg=
github.com/charmbracelet/lipgloss v1.0.0/go.mod h1:U5fy9Z+C38obMs+T+tJqst9VGzlOYGj4ri9reL3qUlo=
github.com/charmbracelet/x/ansi v0.8.0 h1:9GTq3xq9caJW8ZrBTe0LIe2fvfLR/bYXKTx2llXn7xE=
github.com/charmbracelet/x/ansi v0.8.0/go.mod h1:wdYl/ONOLHLIVmQaxbIYEC/cRKOQyjTkowiI4blgS9Q=
github.com/charmbracelet/x/term v0.2.1 h1:AQeHeLZ1OqSXhrAWpYUtZyX1T3zVxfpZuEQMIQaGIAQ=
github.com/charmbracelet/x/term v0.2.1/go.mod h1:oQ4enTYFV7QN4m0i9mzHrViD7TQKvNEEkHUMCmsxdUg=
github.com/erikgeiser/coninput v0.0.0-20211004153227-1c3628e74d0f h1:Y/CXytFA4m6baUTXGLOoWe4PQhGxaX0KpnayAqC48p4=
github.com/erikgeiser/coninput v0.0.0-20211004153227-1c3628e74d0f/go.mod h1:vw97MGsxSvLiUE2X8qFplwetxpGLQrlU1Q9AUEIzCaM=
github.com/lucasb-eyer/go-colorful v1.2.0 h1:1nnpGOrhyZZuNyfu1QjKiUICQ74+3FNCN69Aj6K7nkY=
github.com/lucasb-eyer/go-colorful v1.2.0/go.mod h1:R4dSotOR9KMtayYi1e77YzuveK+i7ruzyGqttikkLy0=
github.com/mattn/go-isatty v0.0.20 h1:xfD0iDuEKnDkl03q4limB+vH+GxLEtL/jb4xVJSWWEY=
github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y=
github.com/mattn/go-localereader v0.0.1 h1:ygSAOl7ZXTx4RdPYinUpg6W99U8jWvWi9Ye2JC/oIi4=
github.com/mattn/go-localereader v0.0.1/go.mod h1:8fBrzywKY7BI3czFoHkuzRoWE9C+EiG4R1k4Cjx5p88=
github.com/mattn/go-runewidth v0.0.16 h1:E5ScNMtiwvlvB5paMFdw9p4kSQzbXFikJ5SQO6TULQc=
github.com/mattn/go-runewidth v0.0.16/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
github.com/muesli/ansi v0.0.0-20230316100256-276c6243b2f6 h1:ZK8zHtRHOkbHy6Mmr5D264iyp3TiX5OmNcI5cIARiQI=
github.com/muesli/ansi v0.0.0-20230316100256-276c6243b2f6/go.mod h1:CJlz5H+gyd6CUWT45Oy4q24RdLyn7Md9Vj2/ldJBSIo=
github.com/muesli/cancelreader v0.2.2 h1:3I4Kt4BQjOR54NavqnDogx/MIoWBFa0StPA8ELUXHmA=
github.com/muesli/cancelreader v0.2.2/go.mod h1:3XuTXfFS2VjM+HTLZY9Ak0l6eUKfijIfMUZ4EgX0QYo=
github.com/muesli/termenv v0.15.2 h1:GohcuySI0QmI3wN8Ok9PtKGkgkFIk7y6Vpb5PvrY+Wo=
github.com/muesli/termenv v0.15.2/go.mod h1:Epx+iuz8sNs7mNKhxzH4fWXGNpZwUaJKRS1noLXviQ8=
github.com/rivo/uniseg v0.2.0/go.mod h1:J6wj4VEh+S6ZtnVlnTBMWIodfgj8LQOQFoIToxlJtxc=
github.com/rivo/uniseg v0.4.7 h1:WUdvkW8uEhrYfLC4ZzdpI2ztxP1I582+49Oc5Mq64VQ=
github.com/rivo/uniseg v0.4.7/go.mod h1:FN3SvrM+Zdj16jyLfmOkMNblXMcoc8DfTHruCPUcx88=
golang.org/x/sync v0.11.0 h1:GGz8+XQP4FvTTrjZPzNKTMFtSXH80RAzG+5ghFPgK9w=
golang.org/x/sync v0.11.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk=
golang.org/x/sys v0.0.0-20210809222454-d867a43fc93e/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.30.0 h1:QjkSwP/36a20jFYWkSue1YwXzLmsV5Gfq7Eiy72C1uc=
golang.org/x/sys v0.30.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/text v0.3.8 h1:nAL+RVCQ9uMn3vJZbV+MRnydTJFPf8qqY42YiA6MrqY=
golang.org/x/text v0.3.8/go.mod h1:E6s5w1FMmriuDzIBO73fBruAKo1PCIq6d2Q6DHfQ8WQ=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
github.com/go-analyze/bulk v0.1.3 h1:pzRdBqzHDAT9PyROt0SlWE0YqPtdmTcEpIJY0C3vF0c=
github.com/go-analyze/bulk v0.1.3/go.mod h1:afon/KtFJYnekIyN20H/+XUvcLFjE8sKR1CfpqfClgM=
github.com/go-analyze/charts v0.5.26 h1:rSwZikLQuFX6cJzwI8OAgaWZneG1kDYxD857ms00ZxY=
github.com/go-analyze/charts v0.5.26/go.mod h1:s1YvQhjiSwtLx1f2dOKfiV9x2TT49nVSL6v2rlRpTbY=
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 h1:DACJavvAHhabrF08vX0COfcOBJRhZ8lUbR+ZWIs0Y5g=
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0/go.mod h1:E/TSTwGwJL78qG/PmXZO1EjYhfJinVAhrmmHX6Z8B9k=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U=
golang.org/x/image v0.24.0 h1:AN7zRgVsbvmTfNyqIbbOraYL8mSwcKncEj8ofjgzcMQ=
golang.org/x/image v0.24.0/go.mod h1:4b/ITuLfqYq1hqZcjofwctIhi7sZh2WaCjvsBNjjya8=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=

View File

@@ -1,8 +1,10 @@
package app
import (
"context"
"encoding/json"
"fmt"
"log/slog"
"os"
"path/filepath"
"sort"
@@ -17,17 +19,27 @@ import (
)
var (
DefaultAuditJSONPath = "/var/log/bee-audit.json"
DefaultAuditLogPath = "/var/log/bee-audit.log"
DefaultSATBaseDir = "/var/log/bee-sat"
DefaultExportDir = "/appdata/bee/export"
DefaultAuditJSONPath = DefaultExportDir + "/bee-audit.json"
DefaultAuditLogPath = DefaultExportDir + "/bee-audit.log"
DefaultWebLogPath = DefaultExportDir + "/bee-web.log"
DefaultNetworkLogPath = DefaultExportDir + "/bee-network.log"
DefaultNvidiaLogPath = DefaultExportDir + "/bee-nvidia.log"
DefaultSSHLogPath = DefaultExportDir + "/bee-sshsetup.log"
DefaultRuntimeJSONPath = DefaultExportDir + "/runtime-health.json"
DefaultRuntimeLogPath = DefaultExportDir + "/runtime-health.log"
DefaultTechDumpDir = DefaultExportDir + "/techdump"
DefaultSATBaseDir = DefaultExportDir + "/bee-sat"
)
type App struct {
network networkManager
services serviceManager
exports exportManager
tools toolManager
sat satRunner
network networkManager
services serviceManager
exports exportManager
tools toolManager
sat satRunner
runtime runtimeChecker
installer installer
}
type ActionResult struct {
@@ -45,6 +57,7 @@ type networkManager interface {
type serviceManager interface {
ListBeeServices() ([]string, error)
ServiceState(name string) string
ServiceStatus(name string) (string, error)
ServiceDo(name string, action platform.ServiceAction) (string, error)
}
@@ -59,24 +72,53 @@ type toolManager interface {
CheckTools(names []string) []platform.ToolStatus
}
type installer interface {
ListInstallDisks() ([]platform.InstallDisk, error)
InstallToDisk(ctx context.Context, device string, logFile string) error
}
type satRunner interface {
RunNvidiaAcceptancePack(baseDir string) (string, error)
RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int) (string, error)
RunMemoryAcceptancePack(baseDir string) (string, error)
RunStorageAcceptancePack(baseDir string) (string, error)
RunCPUAcceptancePack(baseDir string, durationSec int) (string, error)
ListNvidiaGPUs() ([]platform.NvidiaGPU, error)
DetectGPUVendor() string
ListAMDGPUs() ([]platform.AMDGPUInfo, error)
RunAMDAcceptancePack(baseDir string) (string, error)
RunFanStressTest(ctx context.Context, baseDir string, opts platform.FanStressOptions) (string, error)
RunNCCLTests(ctx context.Context, baseDir string) (string, error)
}
type runtimeChecker interface {
CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error)
CaptureTechnicalDump(baseDir string) error
}
func New(platform *platform.System) *App {
return &App{
network: platform,
services: platform,
exports: platform,
tools: platform,
sat: platform,
network: platform,
services: platform,
exports: platform,
tools: platform,
sat: platform,
runtime: platform,
installer: platform,
}
}
func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, error) {
if runtimeMode == runtimeenv.ModeLiveCD {
if err := a.runtime.CaptureTechnicalDump(DefaultTechDumpDir); err != nil {
slog.Warn("capture technical dump", "err", err)
}
}
result := collector.Run(runtimeMode)
applyLatestSATStatuses(&result.Hardware, DefaultSATBaseDir)
if health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath); err == nil {
result.Runtime = &health
}
data, err := json.MarshalIndent(result, "", " ")
if err != nil {
return "", err
@@ -88,6 +130,9 @@ func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, erro
return "stdout", err
case strings.HasPrefix(output, "file:"):
path := strings.TrimPrefix(output, "file:")
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return "", err
}
if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil {
return "", err
}
@@ -97,6 +142,72 @@ func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, erro
}
}
func (a *App) RunRuntimePreflight(output string) (string, error) {
health, err := a.runtime.CollectRuntimeHealth(DefaultExportDir)
if err != nil {
return "", err
}
data, err := json.MarshalIndent(health, "", " ")
if err != nil {
return "", err
}
switch {
case output == "stdout":
_, err := os.Stdout.Write(append(data, '\n'))
return "stdout", err
case strings.HasPrefix(output, "file:"):
path := strings.TrimPrefix(output, "file:")
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return "", err
}
if err := os.WriteFile(path, append(data, '\n'), 0644); err != nil {
return "", err
}
return path, nil
default:
return "", fmt.Errorf("unknown output destination %q — use stdout or file:<path>", output)
}
}
func (a *App) RunRuntimePreflightResult() (ActionResult, error) {
path, err := a.RunRuntimePreflight("file:" + DefaultRuntimeJSONPath)
body := "Runtime preflight completed."
if path != "" {
body = "Runtime health written to " + path
}
return ActionResult{Title: "Run self-check", Body: body}, err
}
func (a *App) RuntimeHealthResult() ActionResult {
health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath)
if err != nil {
return ActionResult{Title: "Runtime issues", Body: "No runtime health found."}
}
driverLabel := "Driver ready"
accelLabel := "CUDA ready"
switch a.sat.DetectGPUVendor() {
case "amd":
driverLabel = "AMDGPU ready"
accelLabel = "ROCm SMI ready"
case "nvidia":
driverLabel = "NVIDIA ready"
}
var body strings.Builder
fmt.Fprintf(&body, "Status: %s\n", firstNonEmpty(health.Status, "UNKNOWN"))
fmt.Fprintf(&body, "Export dir: %s\n", firstNonEmpty(health.ExportDir, DefaultExportDir))
fmt.Fprintf(&body, "%s: %t\n", driverLabel, health.DriverReady)
fmt.Fprintf(&body, "%s: %t\n", accelLabel, health.CUDAReady)
fmt.Fprintf(&body, "Network: %s", firstNonEmpty(health.NetworkStatus, "UNKNOWN"))
if len(health.Issues) > 0 {
body.WriteString("\n\nIssues:\n")
for _, issue := range health.Issues {
fmt.Fprintf(&body, "- %s: %s\n", issue.Code, issue.Description)
}
}
return ActionResult{Title: "Runtime issues", Body: strings.TrimSpace(body.String())}
}
func (a *App) RunAuditNow(runtimeMode runtimeenv.Mode) (ActionResult, error) {
path, err := a.RunAudit(runtimeMode, "file:"+DefaultAuditJSONPath)
body := "Audit completed."
@@ -129,13 +240,37 @@ func (a *App) ExportLatestAudit(target platform.RemovableTarget) (string, error)
func (a *App) ExportLatestAuditResult(target platform.RemovableTarget) (ActionResult, error) {
path, err := a.ExportLatestAudit(target)
body := "Audit exported."
if path != "" {
body := "Audit export failed."
if err == nil {
body = "Audit exported."
}
if err == nil && path != "" {
body = "Audit exported to " + path
}
return ActionResult{Title: "Export audit", Body: body}, err
}
func (a *App) ExportSupportBundle(target platform.RemovableTarget) (string, error) {
archive, err := BuildSupportBundle(DefaultExportDir)
if err != nil {
return "", err
}
defer os.Remove(archive)
return a.exports.ExportFileToTarget(archive, target)
}
func (a *App) ExportSupportBundleResult(target platform.RemovableTarget) (ActionResult, error) {
path, err := a.ExportSupportBundle(target)
body := "Support bundle export failed."
if err == nil {
body = "Support bundle exported. USB target unmounted and safe to remove."
}
if err == nil && path != "" {
body = "Support bundle exported to " + path + ".\n\nUSB target unmounted and safe to remove."
}
return ActionResult{Title: "Export support bundle", Body: body}, err
}
func (a *App) ListInterfaces() ([]platform.InterfaceInfo, error) {
return a.network.ListInterfaces()
}
@@ -222,6 +357,10 @@ func (a *App) ListBeeServices() ([]string, error) {
return a.services.ListBeeServices()
}
func (a *App) ServiceState(name string) string {
return a.services.ServiceState(name)
}
func (a *App) ServiceStatus(name string) (string, error) {
return a.services.ServiceStatus(name)
}
@@ -278,11 +417,14 @@ func (a *App) AuditLogTailResult() ActionResult {
}
func (a *App) RunNvidiaAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaAcceptancePack(baseDir)
}
func (a *App) RunNvidiaAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.sat.RunNvidiaAcceptancePack(baseDir)
path, err := a.RunNvidiaAcceptancePack(baseDir)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
@@ -290,30 +432,160 @@ func (a *App) RunNvidiaAcceptancePackResult(baseDir string) (ActionResult, error
return ActionResult{Title: "NVIDIA SAT", Body: body}, err
}
func (a *App) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
return a.sat.ListNvidiaGPUs()
}
func (a *App) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int) (ActionResult, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
path, err := a.sat.RunNvidiaAcceptancePackWithOptions(ctx, baseDir, diagLevel, gpuIndices)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
}
return ActionResult{Title: "NVIDIA DCGM", Body: body}, err
}
func (a *App) RunMemoryAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunMemoryAcceptancePack(baseDir)
}
func (a *App) RunMemoryAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.sat.RunMemoryAcceptancePack(baseDir)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
path, err := a.RunMemoryAcceptancePack(baseDir)
return ActionResult{Title: "Memory SAT", Body: satResultBody(path)}, err
}
func (a *App) RunCPUAcceptancePack(baseDir string, durationSec int) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return ActionResult{Title: "Memory SAT", Body: body}, err
return a.sat.RunCPUAcceptancePack(baseDir, durationSec)
}
func (a *App) RunCPUAcceptancePackResult(baseDir string, durationSec int) (ActionResult, error) {
path, err := a.RunCPUAcceptancePack(baseDir, durationSec)
return ActionResult{Title: "CPU SAT", Body: satResultBody(path)}, err
}
func (a *App) RunStorageAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunStorageAcceptancePack(baseDir)
}
func (a *App) RunStorageAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.sat.RunStorageAcceptancePack(baseDir)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
path, err := a.RunStorageAcceptancePack(baseDir)
return ActionResult{Title: "Storage SAT", Body: satResultBody(path)}, err
}
func (a *App) DetectGPUVendor() string {
return a.sat.DetectGPUVendor()
}
func (a *App) ListAMDGPUs() ([]platform.AMDGPUInfo, error) {
return a.sat.ListAMDGPUs()
}
func (a *App) RunAMDAcceptancePack(baseDir string) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return ActionResult{Title: "Storage SAT", Body: body}, err
return a.sat.RunAMDAcceptancePack(baseDir)
}
func (a *App) RunAMDAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.RunAMDAcceptancePack(baseDir)
return ActionResult{Title: "AMD GPU SAT", Body: satResultBody(path)}, err
}
func (a *App) RunFanStressTest(ctx context.Context, baseDir string, opts platform.FanStressOptions) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunFanStressTest(ctx, baseDir, opts)
}
func (a *App) RunNCCLTestsResult(ctx context.Context) (ActionResult, error) {
path, err := a.sat.RunNCCLTests(ctx, DefaultSATBaseDir)
body := "Results: " + path
if err != nil && err != context.Canceled {
body += "\nERROR: " + err.Error()
}
return ActionResult{Title: "NCCL bandwidth test", Body: body}, err
}
func (a *App) RunFanStressTestResult(ctx context.Context, opts platform.FanStressOptions) (ActionResult, error) {
path, err := a.RunFanStressTest(ctx, "", opts)
body := formatFanStressResult(path)
if err != nil && err != context.Canceled {
body += "\nERROR: " + err.Error()
}
return ActionResult{Title: "GPU Platform Stress Test", Body: body}, err
}
// formatFanStressResult formats the summary.txt from a fan-stress run, including
// the per-step pass/fail display and the analysis section (throttling, max temps, fan response).
func formatFanStressResult(archivePath string) string {
if archivePath == "" {
return "No output produced."
}
runDir := strings.TrimSuffix(archivePath, ".tar.gz")
raw, err := os.ReadFile(filepath.Join(runDir, "summary.txt"))
if err != nil {
return "Archive written to " + archivePath
}
content := strings.TrimSpace(string(raw))
kv := parseKeyValueSummary(content)
var b strings.Builder
b.WriteString(formatSATDetail(content))
// Append analysis section.
var analysis []string
if v, ok := kv["throttling_detected"]; ok {
label := "NO"
if v == "true" {
label = "YES ← throttling detected during load"
}
analysis = append(analysis, "Throttling: "+label)
}
if v, ok := kv["max_gpu_temp_c"]; ok && v != "0.0" {
analysis = append(analysis, "Max GPU temp: "+v+"°C")
}
if v, ok := kv["max_cpu_temp_c"]; ok && v != "0.0" {
analysis = append(analysis, "Max CPU temp: "+v+"°C")
}
if v, ok := kv["fan_response_sec"]; ok && v != "N/A" && v != "-1.0" {
analysis = append(analysis, "Fan response: "+v+"s")
}
if len(analysis) > 0 {
b.WriteString("\n\n=== Analysis ===\n")
for _, line := range analysis {
b.WriteString(line + "\n")
}
}
return strings.TrimSpace(b.String())
}
// satResultBody reads summary.txt from the SAT run directory (archive path without .tar.gz)
// and returns a formatted human-readable result. Falls back to a plain message if unreadable.
func satResultBody(archivePath string) string {
if archivePath == "" {
return "No output produced."
}
runDir := strings.TrimSuffix(archivePath, ".tar.gz")
raw, err := os.ReadFile(filepath.Join(runDir, "summary.txt"))
if err != nil {
return "Archive written to " + archivePath
}
return formatSATDetail(strings.TrimSpace(string(raw)))
}
func (a *App) HealthSummaryResult() ActionResult {
@@ -435,6 +707,18 @@ func bodyOr(body, fallback string) string {
return body
}
func ReadRuntimeHealth(path string) (schema.RuntimeHealth, error) {
raw, err := os.ReadFile(path)
if err != nil {
return schema.RuntimeHealth{}, err
}
var health schema.RuntimeHealth
if err := json.Unmarshal(raw, &health); err != nil {
return schema.RuntimeHealth{}, err
}
return health, nil
}
func latestSATSummaries() []string {
patterns := []struct {
label string
@@ -443,6 +727,7 @@ func latestSATSummaries() []string {
{label: "NVIDIA SAT", prefix: "gpu-nvidia-"},
{label: "Memory SAT", prefix: "memory-"},
{label: "Storage SAT", prefix: "storage-"},
{label: "CPU SAT", prefix: "cpu-"},
}
var out []string
for _, item := range patterns {
@@ -647,12 +932,17 @@ func isGPUDevice(dev schema.HardwarePCIeDevice) bool {
class := trimPtr(dev.DeviceClass)
model := strings.ToLower(trimPtr(dev.Model))
vendor := strings.ToLower(trimPtr(dev.Manufacturer))
// Exclude ASPEED (BMC VGA adapter, not a compute GPU)
if strings.Contains(vendor, "aspeed") || strings.Contains(model, "aspeed") {
return false
}
// AMD Instinct / Radeon compute GPUs have class ProcessingAccelerator or DisplayController.
// Do NOT match by AMD vendor alone — chipset/CPU PCIe devices share that vendor.
return class == "VideoController" ||
class == "DisplayController" ||
class == "ProcessingAccelerator" ||
strings.Contains(model, "nvidia") ||
strings.Contains(vendor, "nvidia") ||
strings.Contains(vendor, "amd")
strings.Contains(vendor, "nvidia")
}
func trimPtr(value *string) string {
@@ -725,3 +1015,11 @@ func firstNonEmpty(values ...string) string {
}
return ""
}
func (a *App) ListInstallDisks() ([]platform.InstallDisk, error) {
return a.installer.ListInstallDisks()
}
func (a *App) InstallToDisk(ctx context.Context, device string, logFile string) error {
return a.installer.InstallToDisk(ctx, device, logFile)
}

View File

@@ -1,8 +1,12 @@
package app
import (
"archive/tar"
"compress/gzip"
"context"
"encoding/json"
"errors"
"io"
"os"
"path/filepath"
"testing"
@@ -48,6 +52,10 @@ func (f fakeServices) ListBeeServices() ([]string, error) {
return nil, nil
}
func (f fakeServices) ServiceState(name string) string {
return "active"
}
func (f fakeServices) ServiceStatus(name string) (string, error) {
return f.serviceStatusFn(name)
}
@@ -56,16 +64,41 @@ func (f fakeServices) ServiceDo(name string, action platform.ServiceAction) (str
return f.serviceDoFn(name, action)
}
type fakeExports struct{}
type fakeExports struct {
listTargetsFn func() ([]platform.RemovableTarget, error)
exportToTargetFn func(string, platform.RemovableTarget) (string, error)
}
func (f fakeExports) ListRemovableTargets() ([]platform.RemovableTarget, error) {
if f.listTargetsFn != nil {
return f.listTargetsFn()
}
return nil, nil
}
func (f fakeExports) ExportFileToTarget(src string, target platform.RemovableTarget) (string, error) {
if f.exportToTargetFn != nil {
return f.exportToTargetFn(src, target)
}
return "", nil
}
type fakeRuntime struct {
collectFn func(string) (schema.RuntimeHealth, error)
dumpFn func(string) error
}
func (f fakeRuntime) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
return f.collectFn(exportDir)
}
func (f fakeRuntime) CaptureTechnicalDump(baseDir string) error {
if f.dumpFn != nil {
return f.dumpFn(baseDir)
}
return nil
}
type fakeTools struct {
tailFileFn func(string, int) string
checkToolsFn func([]string) []platform.ToolStatus
@@ -80,15 +113,31 @@ func (f fakeTools) CheckTools(names []string) []platform.ToolStatus {
}
type fakeSAT struct {
runNvidiaFn func(string) (string, error)
runMemoryFn func(string) (string, error)
runStorageFn func(string) (string, error)
runNvidiaFn func(string) (string, error)
runMemoryFn func(string) (string, error)
runStorageFn func(string) (string, error)
runCPUFn func(string, int) (string, error)
detectVendorFn func() string
listAMDGPUsFn func() ([]platform.AMDGPUInfo, error)
runAMDPackFn func(string) (string, error)
listNvidiaGPUsFn func() ([]platform.NvidiaGPU, error)
}
func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string) (string, error) {
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaAcceptancePackWithOptions(_ context.Context, baseDir string, _ int, _ []int) (string, error) {
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
if f.listNvidiaGPUsFn != nil {
return f.listNvidiaGPUsFn()
}
return nil, nil
}
func (f fakeSAT) RunMemoryAcceptancePack(baseDir string) (string, error) {
return f.runMemoryFn(baseDir)
}
@@ -97,6 +146,42 @@ func (f fakeSAT) RunStorageAcceptancePack(baseDir string) (string, error) {
return f.runStorageFn(baseDir)
}
func (f fakeSAT) RunCPUAcceptancePack(baseDir string, durationSec int) (string, error) {
if f.runCPUFn != nil {
return f.runCPUFn(baseDir, durationSec)
}
return "", nil
}
func (f fakeSAT) DetectGPUVendor() string {
if f.detectVendorFn != nil {
return f.detectVendorFn()
}
return ""
}
func (f fakeSAT) ListAMDGPUs() ([]platform.AMDGPUInfo, error) {
if f.listAMDGPUsFn != nil {
return f.listAMDGPUsFn()
}
return nil, nil
}
func (f fakeSAT) RunAMDAcceptancePack(baseDir string) (string, error) {
if f.runAMDPackFn != nil {
return f.runAMDPackFn(baseDir)
}
return "", nil
}
func (f fakeSAT) RunFanStressTest(_ context.Context, _ string, _ platform.FanStressOptions) (string, error) {
return "", nil
}
func (f fakeSAT) RunNCCLTests(_ context.Context, _ string) (string, error) {
return "", nil
}
func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
t.Parallel()
@@ -110,6 +195,9 @@ func TestNetworkStatusFormatsInterfacesAndRoute(t *testing.T) {
},
defaultRouteFn: func() string { return "10.0.0.1" },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.NetworkStatus()
@@ -138,6 +226,9 @@ func TestNetworkStatusHandlesNoInterfaces(t *testing.T) {
listInterfacesFn: func() ([]platform.InterfaceInfo, error) { return nil, nil },
defaultRouteFn: func() string { return "" },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.NetworkStatus()
@@ -159,6 +250,9 @@ func TestNetworkStatusPropagatesListError(t *testing.T) {
},
defaultRouteFn: func() string { return "" },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.NetworkStatus()
@@ -183,6 +277,9 @@ func TestParseStaticIPv4ConfigAndDefaults(t *testing.T) {
dhcpAllFn: func() (string, error) { return "", nil },
setStaticIPv4Fn: func(platform.StaticIPv4Config) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
defaults := a.DefaultStaticIPv4FormFields("eth0")
@@ -219,6 +316,9 @@ func TestServiceActionResults(t *testing.T) {
return string(action) + " ok", nil
},
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
statusResult, err := a.ServiceStatusResult("bee-audit")
@@ -301,6 +401,11 @@ func TestActionResultsUseFallbackBody(t *testing.T) {
runMemoryFn: func(string) (string, error) { return "", nil },
runStorageFn: func(string) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) {
return schema.RuntimeHealth{Status: "PARTIAL", ExportDir: "/tmp/export"}, nil
},
},
}
if got, _ := a.DHCPOneResult("eth0"); got.Body != "DHCP completed." {
@@ -327,14 +432,87 @@ func TestActionResultsUseFallbackBody(t *testing.T) {
if got, _ := a.RunNvidiaAcceptancePackResult(""); got.Body != "Archive written." {
t.Fatalf("sat body=%q", got.Body)
}
if got, _ := a.RunMemoryAcceptancePackResult(""); got.Body != "Archive written." {
if got, _ := a.RunMemoryAcceptancePackResult(""); got.Body != "No output produced." {
t.Fatalf("memory sat body=%q", got.Body)
}
if got, _ := a.RunStorageAcceptancePackResult(""); got.Body != "Archive written." {
if got, _ := a.RunStorageAcceptancePackResult(""); got.Body != "No output produced." {
t.Fatalf("storage sat body=%q", got.Body)
}
}
func TestExportSupportBundleResultMentionsUnmountedUSB(t *testing.T) {
t.Parallel()
tmp := t.TempDir()
oldExportDir := DefaultExportDir
DefaultExportDir = tmp
t.Cleanup(func() { DefaultExportDir = oldExportDir })
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.json"), []byte("{}\n"), 0644); err != nil {
t.Fatalf("write bee-audit.json: %v", err)
}
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.log"), []byte("audit ok\n"), 0644); err != nil {
t.Fatalf("write bee-audit.log: %v", err)
}
a := &App{
exports: fakeExports{
exportToTargetFn: func(src string, target platform.RemovableTarget) (string, error) {
if filepath.Base(src) == "" {
t.Fatalf("expected non-empty source path")
}
return "/media/bee/" + filepath.Base(src), nil
},
},
}
result, err := a.ExportSupportBundleResult(platform.RemovableTarget{Device: "/dev/sdb1"})
if err != nil {
t.Fatalf("ExportSupportBundleResult error: %v", err)
}
if result.Title != "Export support bundle" {
t.Fatalf("title=%q want %q", result.Title, "Export support bundle")
}
if want := "USB target unmounted and safe to remove."; !contains(result.Body, want) {
t.Fatalf("body missing %q\nbody=%s", want, result.Body)
}
}
func TestExportSupportBundleResultDoesNotPretendSuccessOnError(t *testing.T) {
t.Parallel()
tmp := t.TempDir()
oldExportDir := DefaultExportDir
DefaultExportDir = tmp
t.Cleanup(func() { DefaultExportDir = oldExportDir })
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.json"), []byte("{}\n"), 0644); err != nil {
t.Fatalf("write bee-audit.json: %v", err)
}
if err := os.WriteFile(filepath.Join(tmp, "bee-audit.log"), []byte("audit ok\n"), 0644); err != nil {
t.Fatalf("write bee-audit.log: %v", err)
}
a := &App{
exports: fakeExports{
exportToTargetFn: func(string, platform.RemovableTarget) (string, error) {
return "", errors.New("mount /dev/sda1: exFAT support is missing in this ISO build")
},
},
}
result, err := a.ExportSupportBundleResult(platform.RemovableTarget{Device: "/dev/sda1", FSType: "exfat"})
if err == nil {
t.Fatal("expected export error")
}
if contains(result.Body, "exported to") {
t.Fatalf("body should not claim success:\n%s", result.Body)
}
if result.Body != "Support bundle export failed." {
t.Fatalf("body=%q want %q", result.Body, "Support bundle export failed.")
}
}
func TestRunNvidiaAcceptancePackResult(t *testing.T) {
t.Parallel()
@@ -349,6 +527,9 @@ func TestRunNvidiaAcceptancePackResult(t *testing.T) {
runMemoryFn: func(string) (string, error) { return "", nil },
runStorageFn: func(string) (string, error) { return "", nil },
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
result, err := a.RunNvidiaAcceptancePackResult("/tmp/sat")
@@ -360,6 +541,50 @@ func TestRunNvidiaAcceptancePackResult(t *testing.T) {
}
}
func TestRunSATDefaultsToExportDir(t *testing.T) {
t.Parallel()
oldSATBaseDir := DefaultSATBaseDir
DefaultSATBaseDir = "/tmp/export/bee-sat"
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
a := &App{
sat: fakeSAT{
runNvidiaFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("nvidia baseDir=%q", baseDir)
}
return "", nil
},
runMemoryFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("memory baseDir=%q", baseDir)
}
return "", nil
},
runStorageFn: func(baseDir string) (string, error) {
if baseDir != "/tmp/export/bee-sat" {
t.Fatalf("storage baseDir=%q", baseDir)
}
return "", nil
},
},
runtime: fakeRuntime{
collectFn: func(string) (schema.RuntimeHealth, error) { return schema.RuntimeHealth{}, nil },
},
}
if _, err := a.RunNvidiaAcceptancePack(""); err != nil {
t.Fatal(err)
}
if _, err := a.RunMemoryAcceptancePack(""); err != nil {
t.Fatal(err)
}
if _, err := a.RunStorageAcceptancePack(""); err != nil {
t.Fatal(err)
}
}
func TestFormatSATSummary(t *testing.T) {
t.Parallel()
@@ -398,6 +623,69 @@ func TestHealthSummaryResultIncludesCompactSATSummary(t *testing.T) {
}
}
func TestBuildSupportBundleIncludesExportDirContents(t *testing.T) {
tmp := t.TempDir()
exportDir := filepath.Join(tmp, "export")
if err := os.MkdirAll(filepath.Join(exportDir, "bee-sat", "memory-run"), 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.json"), []byte(`{"ok":true}`), 0644); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run", "verbose.log"), []byte("sat verbose"), 0644); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run.tar.gz"), []byte("nested sat archive"), 0644); err != nil {
t.Fatal(err)
}
archive, err := BuildSupportBundle(exportDir)
if err != nil {
t.Fatalf("BuildSupportBundle error: %v", err)
}
if _, err := os.Stat(archive); err != nil {
t.Fatalf("archive stat: %v", err)
}
file, err := os.Open(archive)
if err != nil {
t.Fatalf("open archive: %v", err)
}
defer file.Close()
gzr, err := gzip.NewReader(file)
if err != nil {
t.Fatalf("gzip reader: %v", err)
}
defer gzr.Close()
tr := tar.NewReader(gzr)
var names []string
for {
hdr, err := tr.Next()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
t.Fatalf("read tar entry: %v", err)
}
names = append(names, hdr.Name)
}
var foundRaw bool
for _, name := range names {
if contains(name, "/export/bee-sat/memory-run/verbose.log") {
foundRaw = true
}
if contains(name, "/export/bee-sat/memory-run.tar.gz") {
t.Fatalf("support bundle should not contain nested SAT archive: %s", name)
}
}
if !foundRaw {
t.Fatalf("support bundle missing raw SAT log, names=%v", names)
}
}
func TestMainBanner(t *testing.T) {
tmp := t.TempDir()
oldAuditPath := DefaultAuditJSONPath
@@ -472,6 +760,44 @@ func TestMainBanner(t *testing.T) {
}
}
func TestRuntimeHealthResultUsesAMDLabels(t *testing.T) {
tmp := t.TempDir()
oldRuntimePath := DefaultRuntimeJSONPath
DefaultRuntimeJSONPath = filepath.Join(tmp, "runtime-health.json")
t.Cleanup(func() { DefaultRuntimeJSONPath = oldRuntimePath })
raw, err := json.Marshal(schema.RuntimeHealth{
Status: "OK",
ExportDir: "/appdata/bee/export",
DriverReady: true,
CUDAReady: true,
NetworkStatus: "OK",
})
if err != nil {
t.Fatalf("marshal runtime health: %v", err)
}
if err := os.WriteFile(DefaultRuntimeJSONPath, raw, 0644); err != nil {
t.Fatalf("write runtime health: %v", err)
}
a := &App{
sat: fakeSAT{
detectVendorFn: func() string { return "amd" },
},
}
result := a.RuntimeHealthResult()
if !contains(result.Body, "AMDGPU ready: true") {
t.Fatalf("body missing AMD driver label:\n%s", result.Body)
}
if !contains(result.Body, "ROCm SMI ready: true") {
t.Fatalf("body missing ROCm label:\n%s", result.Body)
}
if contains(result.Body, "CUDA ready") {
t.Fatalf("body should not mention CUDA on AMD:\n%s", result.Body)
}
}
func intPtr(v int) *int { return &v }
func contains(haystack, needle string) bool {

387
audit/internal/app/panel.go Normal file
View File

@@ -0,0 +1,387 @@
package app
import (
"encoding/json"
"fmt"
"os"
"path/filepath"
"sort"
"strings"
"bee/audit/internal/schema"
)
// ComponentRow is one line in the hardware panel.
type ComponentRow struct {
Key string // "CPU", "MEM", "GPU", "DISK", "PSU"
Status string // "PASS", "FAIL", "CANCEL", "N/A"
Detail string // compact one-liner
}
// HardwarePanelData holds everything the TUI right panel needs.
type HardwarePanelData struct {
Header []string
Rows []ComponentRow
}
// LoadHardwarePanel reads the latest audit JSON and SAT summaries.
// Returns empty panel if no audit data exists yet.
func (a *App) LoadHardwarePanel() HardwarePanelData {
raw, err := os.ReadFile(DefaultAuditJSONPath)
if err != nil {
return HardwarePanelData{Header: []string{"No audit data — run audit first."}}
}
var snap schema.HardwareIngestRequest
if err := json.Unmarshal(raw, &snap); err != nil {
return HardwarePanelData{Header: []string{"Audit data unreadable."}}
}
statuses := satStatuses()
var header []string
if sys := formatSystemLine(snap.Hardware.Board); sys != "" {
header = append(header, sys)
}
for _, fw := range snap.Hardware.Firmware {
if fw.DeviceName == "BIOS" && fw.Version != "" {
header = append(header, "BIOS: "+fw.Version)
}
if fw.DeviceName == "BMC" && fw.Version != "" {
header = append(header, "BMC: "+fw.Version)
}
}
if ip := formatIPLine(a.network.ListInterfaces); ip != "" {
header = append(header, ip)
}
var rows []ComponentRow
if cpu := formatCPULine(snap.Hardware.CPUs); cpu != "" {
rows = append(rows, ComponentRow{
Key: "CPU",
Status: statuses["cpu"],
Detail: strings.TrimPrefix(cpu, "CPU: "),
})
}
if mem := formatMemoryLine(snap.Hardware.Memory); mem != "" {
rows = append(rows, ComponentRow{
Key: "MEM",
Status: statuses["memory"],
Detail: strings.TrimPrefix(mem, "Memory: "),
})
}
if gpu := formatGPULine(snap.Hardware.PCIeDevices); gpu != "" {
rows = append(rows, ComponentRow{
Key: "GPU",
Status: statuses["gpu"],
Detail: strings.TrimPrefix(gpu, "GPU: "),
})
}
if disk := formatStorageLine(snap.Hardware.Storage); disk != "" {
rows = append(rows, ComponentRow{
Key: "DISK",
Status: statuses["storage"],
Detail: strings.TrimPrefix(disk, "Storage: "),
})
}
if psu := formatPSULine(snap.Hardware.PowerSupplies); psu != "" {
rows = append(rows, ComponentRow{
Key: "PSU",
Status: "N/A",
Detail: psu,
})
}
return HardwarePanelData{Header: header, Rows: rows}
}
// ComponentDetailResult returns detail text for a component shown in the panel.
func (a *App) ComponentDetailResult(key string) ActionResult {
switch key {
case "CPU":
return a.cpuDetailResult(false)
case "MEM":
return a.satDetailResult("memory", "memory-", "MEM detail")
case "GPU":
// Prefer whichever GPU SAT was run most recently.
nv, _ := filepath.Glob(filepath.Join(DefaultSATBaseDir, "gpu-nvidia-*/summary.txt"))
am, _ := filepath.Glob(filepath.Join(DefaultSATBaseDir, "gpu-amd-*/summary.txt"))
sort.Strings(nv)
sort.Strings(am)
latestNV := ""
if len(nv) > 0 {
latestNV = nv[len(nv)-1]
}
latestAM := ""
if len(am) > 0 {
latestAM = am[len(am)-1]
}
if latestAM > latestNV {
return a.satDetailResult("gpu", "gpu-amd-", "GPU detail")
}
return a.satDetailResult("gpu", "gpu-nvidia-", "GPU detail")
case "DISK":
return a.satDetailResult("storage", "storage-", "DISK detail")
case "PSU":
return a.psuDetailResult()
default:
return ActionResult{Title: key, Body: "No detail available."}
}
}
func (a *App) cpuDetailResult(satOnly bool) ActionResult {
var b strings.Builder
// Show latest SAT summary if available.
satResult := a.satDetailResult("cpu", "cpu-", "CPU SAT")
if satResult.Body != "No test results found. Run a test first." {
fmt.Fprintln(&b, "=== Last SAT ===")
fmt.Fprintln(&b, satResult.Body)
fmt.Fprintln(&b)
}
if satOnly {
body := strings.TrimSpace(b.String())
if body == "" {
body = "No CPU SAT results found. Run a test first."
}
return ActionResult{Title: "CPU SAT", Body: body}
}
raw, err := os.ReadFile(DefaultAuditJSONPath)
if err != nil {
return ActionResult{Title: "CPU", Body: strings.TrimSpace(b.String())}
}
var snap schema.HardwareIngestRequest
if err := json.Unmarshal(raw, &snap); err != nil {
return ActionResult{Title: "CPU", Body: strings.TrimSpace(b.String())}
}
if len(snap.Hardware.CPUs) == 0 {
return ActionResult{Title: "CPU", Body: strings.TrimSpace(b.String())}
}
fmt.Fprintln(&b, "=== Audit ===")
for i, cpu := range snap.Hardware.CPUs {
fmt.Fprintf(&b, "CPU %d\n", i)
if cpu.Model != nil {
fmt.Fprintf(&b, " Model: %s\n", *cpu.Model)
}
if cpu.Manufacturer != nil {
fmt.Fprintf(&b, " Vendor: %s\n", *cpu.Manufacturer)
}
if cpu.Cores != nil {
fmt.Fprintf(&b, " Cores: %d\n", *cpu.Cores)
}
if cpu.Threads != nil {
fmt.Fprintf(&b, " Threads: %d\n", *cpu.Threads)
}
if cpu.MaxFrequencyMHz != nil {
fmt.Fprintf(&b, " Max freq: %d MHz\n", *cpu.MaxFrequencyMHz)
}
if cpu.TemperatureC != nil {
fmt.Fprintf(&b, " Temp: %.1f°C\n", *cpu.TemperatureC)
}
if cpu.Throttled != nil {
fmt.Fprintf(&b, " Throttled: %v\n", *cpu.Throttled)
}
if cpu.CorrectableErrorCount != nil && *cpu.CorrectableErrorCount > 0 {
fmt.Fprintf(&b, " ECC correctable: %d\n", *cpu.CorrectableErrorCount)
}
if cpu.UncorrectableErrorCount != nil && *cpu.UncorrectableErrorCount > 0 {
fmt.Fprintf(&b, " ECC uncorrectable: %d\n", *cpu.UncorrectableErrorCount)
}
if i < len(snap.Hardware.CPUs)-1 {
fmt.Fprintln(&b)
}
}
return ActionResult{Title: "CPU", Body: strings.TrimSpace(b.String())}
}
func (a *App) satDetailResult(statusKey, prefix, title string) ActionResult {
matches, err := filepath.Glob(filepath.Join(DefaultSATBaseDir, prefix+"*/summary.txt"))
if err != nil || len(matches) == 0 {
return ActionResult{Title: title, Body: "No test results found. Run a test first."}
}
sort.Strings(matches)
raw, err := os.ReadFile(matches[len(matches)-1])
if err != nil {
return ActionResult{Title: title, Body: "Could not read test results."}
}
return ActionResult{Title: title, Body: formatSATDetail(strings.TrimSpace(string(raw)))}
}
// formatSATDetail converts raw summary.txt key=value content to a human-readable per-step display.
func formatSATDetail(raw string) string {
var b strings.Builder
kv := parseKeyValueSummary(raw)
if t, ok := kv["run_at_utc"]; ok {
fmt.Fprintf(&b, "Run: %s\n\n", t)
}
// Collect step names in order they appear in the file
lines := strings.Split(raw, "\n")
var stepKeys []string
seenStep := map[string]bool{}
for _, line := range lines {
if idx := strings.Index(line, "_status="); idx >= 0 {
key := line[:idx]
if !seenStep[key] && key != "overall" {
seenStep[key] = true
stepKeys = append(stepKeys, key)
}
}
}
for _, key := range stepKeys {
status := kv[key+"_status"]
display := cleanSummaryKey(key)
switch status {
case "OK":
fmt.Fprintf(&b, "PASS %s\n", display)
case "FAILED":
fmt.Fprintf(&b, "FAIL %s\n", display)
case "UNSUPPORTED":
fmt.Fprintf(&b, "SKIP %s\n", display)
default:
fmt.Fprintf(&b, "? %s\n", display)
}
}
if overall, ok := kv["overall_status"]; ok {
ok2 := kv["job_ok"]
failed := kv["job_failed"]
fmt.Fprintf(&b, "\nOverall: %s (ok=%s failed=%s)", overall, ok2, failed)
}
return strings.TrimSpace(b.String())
}
// cleanSummaryKey strips the leading numeric prefix from a SAT step key.
// "1-lscpu" → "lscpu", "3-stress-ng" → "stress-ng"
func cleanSummaryKey(key string) string {
idx := strings.Index(key, "-")
if idx <= 0 {
return key
}
prefix := key[:idx]
for _, c := range prefix {
if c < '0' || c > '9' {
return key
}
}
return key[idx+1:]
}
func (a *App) psuDetailResult() ActionResult {
raw, err := os.ReadFile(DefaultAuditJSONPath)
if err != nil {
return ActionResult{Title: "PSU", Body: "No audit data."}
}
var snap schema.HardwareIngestRequest
if err := json.Unmarshal(raw, &snap); err != nil {
return ActionResult{Title: "PSU", Body: "Audit data unreadable."}
}
if len(snap.Hardware.PowerSupplies) == 0 {
return ActionResult{Title: "PSU", Body: "No PSU data in last audit."}
}
var b strings.Builder
for i, psu := range snap.Hardware.PowerSupplies {
fmt.Fprintf(&b, "PSU %d\n", i)
if psu.Model != nil {
fmt.Fprintf(&b, " Model: %s\n", *psu.Model)
}
if psu.Vendor != nil {
fmt.Fprintf(&b, " Vendor: %s\n", *psu.Vendor)
}
if psu.WattageW != nil {
fmt.Fprintf(&b, " Rated: %d W\n", *psu.WattageW)
}
if psu.InputPowerW != nil {
fmt.Fprintf(&b, " Input: %.1f W\n", *psu.InputPowerW)
}
if psu.OutputPowerW != nil {
fmt.Fprintf(&b, " Output: %.1f W\n", *psu.OutputPowerW)
}
if psu.TemperatureC != nil {
fmt.Fprintf(&b, " Temp: %.1f°C\n", *psu.TemperatureC)
}
if i < len(snap.Hardware.PowerSupplies)-1 {
fmt.Fprintln(&b)
}
}
return ActionResult{Title: "PSU", Body: strings.TrimSpace(b.String())}
}
// satStatuses reads the latest summary.txt for each SAT type and returns
// a map of component key ("gpu","memory","storage") → status ("PASS","FAIL","CANCEL","N/A").
func satStatuses() map[string]string {
result := map[string]string{
"gpu": "N/A",
"memory": "N/A",
"storage": "N/A",
"cpu": "N/A",
}
patterns := []struct {
key string
prefix string
}{
{"gpu", "gpu-nvidia-"},
{"gpu", "gpu-amd-"},
{"memory", "memory-"},
{"storage", "storage-"},
{"cpu", "cpu-"},
}
for _, item := range patterns {
matches, err := filepath.Glob(filepath.Join(DefaultSATBaseDir, item.prefix+"*/summary.txt"))
if err != nil || len(matches) == 0 {
continue
}
sort.Strings(matches)
raw, err := os.ReadFile(matches[len(matches)-1])
if err != nil {
continue
}
values := parseKeyValueSummary(string(raw))
switch strings.ToUpper(strings.TrimSpace(values["overall_status"])) {
case "OK":
result[item.key] = "PASS"
case "FAILED":
result[item.key] = "FAIL"
case "CANCELED", "CANCELLED":
result[item.key] = "CANCEL"
}
}
return result
}
func formatPSULine(psus []schema.HardwarePowerSupply) string {
var present []schema.HardwarePowerSupply
for _, psu := range psus {
if psu.Present != nil && !*psu.Present {
continue
}
present = append(present, psu)
}
if len(present) == 0 {
return ""
}
firstW := 0
if present[0].WattageW != nil {
firstW = *present[0].WattageW
}
allSame := firstW > 0
for _, p := range present[1:] {
w := 0
if p.WattageW != nil {
w = *p.WattageW
}
if w != firstW {
allSame = false
break
}
}
if allSame && firstW > 0 {
return fmt.Sprintf("%dx %dW", len(present), firstW)
}
return fmt.Sprintf("%d PSU", len(present))
}

View File

@@ -0,0 +1,214 @@
package app
import (
"os"
"path/filepath"
"sort"
"strings"
"bee/audit/internal/schema"
)
func applyLatestSATStatuses(snap *schema.HardwareSnapshot, baseDir string) {
if snap == nil || strings.TrimSpace(baseDir) == "" {
return
}
if summary, ok := loadLatestSATSummary(baseDir, "gpu-amd-"); ok {
applyGPUVendorSAT(snap.PCIeDevices, "amd", summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "gpu-nvidia-"); ok {
applyGPUVendorSAT(snap.PCIeDevices, "nvidia", summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "memory-"); ok {
applyMemorySAT(snap.Memory, summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "cpu-"); ok {
applyCPUSAT(snap.CPUs, summary)
}
if summary, ok := loadLatestSATSummary(baseDir, "storage-"); ok {
applyStorageSAT(snap.Storage, summary)
}
}
type satSummary struct {
runAtUTC string
overall string
kv map[string]string
}
func loadLatestSATSummary(baseDir, prefix string) (satSummary, bool) {
matches, err := filepath.Glob(filepath.Join(baseDir, prefix+"*/summary.txt"))
if err != nil || len(matches) == 0 {
return satSummary{}, false
}
sort.Strings(matches)
raw, err := os.ReadFile(matches[len(matches)-1])
if err != nil {
return satSummary{}, false
}
kv := parseKeyValueSummary(string(raw))
return satSummary{
runAtUTC: strings.TrimSpace(kv["run_at_utc"]),
overall: strings.ToUpper(strings.TrimSpace(kv["overall_status"])),
kv: kv,
}, true
}
func applyGPUVendorSAT(devs []schema.HardwarePCIeDevice, vendor string, summary satSummary) {
status, description, ok := satSummaryStatus(summary, vendor+" GPU SAT")
if !ok {
return
}
for i := range devs {
if !matchesGPUVendor(devs[i], vendor) {
continue
}
mergeComponentStatus(&devs[i].HardwareComponentStatus, summary.runAtUTC, status, description)
}
}
func applyMemorySAT(dimms []schema.HardwareMemory, summary satSummary) {
status, description, ok := satSummaryStatus(summary, "memory SAT")
if !ok {
return
}
for i := range dimms {
mergeComponentStatus(&dimms[i].HardwareComponentStatus, summary.runAtUTC, status, description)
}
}
func applyCPUSAT(cpus []schema.HardwareCPU, summary satSummary) {
status, description, ok := satSummaryStatus(summary, "CPU SAT")
if !ok {
return
}
for i := range cpus {
mergeComponentStatus(&cpus[i].HardwareComponentStatus, summary.runAtUTC, status, description)
}
}
func applyStorageSAT(disks []schema.HardwareStorage, summary satSummary) {
byDevice := parseStorageSATStatus(summary)
for i := range disks {
devPath, _ := disks[i].Telemetry["linux_device"].(string)
devName := filepath.Base(strings.TrimSpace(devPath))
if devName == "" {
continue
}
result, ok := byDevice[devName]
if !ok {
continue
}
mergeComponentStatus(&disks[i].HardwareComponentStatus, summary.runAtUTC, result.status, result.description)
}
}
type satStatusResult struct {
status string
description string
ok bool
}
func parseStorageSATStatus(summary satSummary) map[string]satStatusResult {
result := map[string]satStatusResult{}
for key, value := range summary.kv {
if !strings.HasSuffix(key, "_status") || key == "overall_status" {
continue
}
base := strings.TrimSuffix(key, "_status")
idx := strings.Index(base, "_")
if idx <= 0 {
continue
}
devName := base[:idx]
step := strings.ReplaceAll(base[idx+1:], "_", "-")
stepStatus, desc, ok := satKeyStatus(strings.ToUpper(strings.TrimSpace(value)), "storage "+step)
if !ok {
continue
}
current := result[devName]
if !current.ok || statusSeverity(stepStatus) > statusSeverity(current.status) {
result[devName] = satStatusResult{status: stepStatus, description: desc, ok: true}
}
}
return result
}
func satSummaryStatus(summary satSummary, label string) (string, string, bool) {
return satKeyStatus(summary.overall, label)
}
func satKeyStatus(rawStatus, label string) (string, string, bool) {
switch strings.ToUpper(strings.TrimSpace(rawStatus)) {
case "OK":
return "OK", label + " passed", true
case "PARTIAL", "UNSUPPORTED", "CANCELED", "CANCELLED":
return "Warning", label + " incomplete", true
case "FAILED":
return "Critical", label + " failed", true
default:
return "", "", false
}
}
func mergeComponentStatus(component *schema.HardwareComponentStatus, changedAt, satStatus, description string) {
if component == nil || satStatus == "" {
return
}
current := strings.TrimSpace(ptrString(component.Status))
if current == "" || current == "Unknown" || statusSeverity(satStatus) > statusSeverity(current) {
component.Status = appStringPtr(satStatus)
if strings.TrimSpace(description) != "" {
component.ErrorDescription = appStringPtr(description)
}
if strings.TrimSpace(changedAt) != "" {
component.StatusChangedAt = appStringPtr(changedAt)
component.StatusHistory = append(component.StatusHistory, schema.HardwareStatusHistory{
Status: satStatus,
ChangedAt: changedAt,
Details: appStringPtr(description),
})
}
}
}
func statusSeverity(status string) int {
switch strings.TrimSpace(status) {
case "Critical":
return 3
case "Warning":
return 2
case "OK":
return 1
default:
return 0
}
}
func matchesGPUVendor(dev schema.HardwarePCIeDevice, vendor string) bool {
if dev.DeviceClass == nil || !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Controller") && !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Accelerator") {
if dev.DeviceClass == nil || !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Display") && !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Video") {
return false
}
}
manufacturer := strings.ToLower(strings.TrimSpace(ptrString(dev.Manufacturer)))
switch vendor {
case "amd":
return strings.Contains(manufacturer, "advanced micro devices") || strings.Contains(manufacturer, "amd/ati")
case "nvidia":
return strings.Contains(manufacturer, "nvidia")
default:
return false
}
}
func ptrString(v *string) string {
if v == nil {
return ""
}
return *v
}
func appStringPtr(value string) *string {
return &value
}

View File

@@ -0,0 +1,61 @@
package app
import (
"os"
"path/filepath"
"testing"
"bee/audit/internal/schema"
)
func TestApplyLatestSATStatusesMarksStorageByDevice(t *testing.T) {
baseDir := t.TempDir()
runDir := filepath.Join(baseDir, "storage-20260325-161151")
if err := os.MkdirAll(runDir, 0755); err != nil {
t.Fatal(err)
}
raw := "run_at_utc=2026-03-25T16:11:51Z\nnvme0n1_nvme_smart_log_status=OK\nsda_smartctl_health_status=FAILED\noverall_status=FAILED\n"
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(raw), 0644); err != nil {
t.Fatal(err)
}
nvme := schema.HardwareStorage{Telemetry: map[string]any{"linux_device": "/dev/nvme0n1"}}
usb := schema.HardwareStorage{Telemetry: map[string]any{"linux_device": "/dev/sda"}}
snap := schema.HardwareSnapshot{Storage: []schema.HardwareStorage{nvme, usb}}
applyLatestSATStatuses(&snap, baseDir)
if snap.Storage[0].Status == nil || *snap.Storage[0].Status != "OK" {
t.Fatalf("nvme status=%v want OK", snap.Storage[0].Status)
}
if snap.Storage[1].Status == nil || *snap.Storage[1].Status != "Critical" {
t.Fatalf("sda status=%v want Critical", snap.Storage[1].Status)
}
}
func TestApplyLatestSATStatusesMarksAMDGPUs(t *testing.T) {
baseDir := t.TempDir()
runDir := filepath.Join(baseDir, "gpu-amd-20260325-161436")
if err := os.MkdirAll(runDir, 0755); err != nil {
t.Fatal(err)
}
raw := "run_at_utc=2026-03-25T16:14:36Z\noverall_status=FAILED\n"
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(raw), 0644); err != nil {
t.Fatal(err)
}
class := "DisplayController"
manufacturer := "Advanced Micro Devices, Inc. [AMD/ATI]"
snap := schema.HardwareSnapshot{
PCIeDevices: []schema.HardwarePCIeDevice{{
DeviceClass: &class,
Manufacturer: &manufacturer,
}},
}
applyLatestSATStatuses(&snap, baseDir)
if snap.PCIeDevices[0].Status == nil || *snap.PCIeDevices[0].Status != "Critical" {
t.Fatalf("gpu status=%v want Critical", snap.PCIeDevices[0].Status)
}
}

View File

@@ -0,0 +1,364 @@
package app
import (
"archive/tar"
"compress/gzip"
"fmt"
"io"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
"time"
)
var supportBundleServices = []string{
"bee-audit.service",
"bee-web.service",
"bee-network.service",
"bee-nvidia.service",
"bee-preflight.service",
"bee-sshsetup.service",
}
var supportBundleCommands = []struct {
name string
cmd []string
}{
{name: "system/uname.txt", cmd: []string{"uname", "-a"}},
{name: "system/lsmod.txt", cmd: []string{"lsmod"}},
{name: "system/lspci-nn.txt", cmd: []string{"lspci", "-nn"}},
{name: "system/ip-addr.txt", cmd: []string{"ip", "addr"}},
{name: "system/ip-route.txt", cmd: []string{"ip", "route"}},
{name: "system/mount.txt", cmd: []string{"mount"}},
{name: "system/df-h.txt", cmd: []string{"df", "-h"}},
{name: "system/dmesg-tail.txt", cmd: []string{"sh", "-c", "dmesg | tail -n 200"}},
}
func BuildSupportBundle(exportDir string) (string, error) {
exportDir = strings.TrimSpace(exportDir)
if exportDir == "" {
exportDir = DefaultExportDir
}
if err := os.MkdirAll(exportDir, 0755); err != nil {
return "", err
}
if err := cleanupOldSupportBundles(os.TempDir()); err != nil {
return "", err
}
host := sanitizeFilename(hostnameOr("unknown"))
ts := time.Now().UTC().Format("20060102-150405")
stageRoot := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s", host, ts))
if err := os.MkdirAll(stageRoot, 0755); err != nil {
return "", err
}
defer os.RemoveAll(stageRoot)
if err := copyExportDirForSupportBundle(exportDir, filepath.Join(stageRoot, "export")); err != nil {
return "", err
}
if err := writeJournalDump(filepath.Join(stageRoot, "systemd", "combined.journal.log")); err != nil {
return "", err
}
for _, svc := range supportBundleServices {
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".status.txt"), []string{"systemctl", "status", svc, "--no-pager"}); err != nil {
return "", err
}
if err := writeCommandOutput(filepath.Join(stageRoot, "systemd", svc+".journal.log"), []string{"journalctl", "--no-pager", "-u", svc}); err != nil {
return "", err
}
}
for _, item := range supportBundleCommands {
if err := writeCommandOutput(filepath.Join(stageRoot, item.name), item.cmd); err != nil {
return "", err
}
}
if err := writeManifest(filepath.Join(stageRoot, "manifest.txt"), exportDir, stageRoot); err != nil {
return "", err
}
archivePath := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s.tar.gz", host, ts))
if err := createSupportTarGz(archivePath, stageRoot); err != nil {
return "", err
}
return archivePath, nil
}
func cleanupOldSupportBundles(dir string) error {
matches, err := filepath.Glob(filepath.Join(dir, "bee-support-*.tar.gz"))
if err != nil {
return err
}
type entry struct {
path string
mod time.Time
}
list := make([]entry, 0, len(matches))
for _, match := range matches {
info, err := os.Stat(match)
if err != nil {
continue
}
if time.Since(info.ModTime()) > 24*time.Hour {
_ = os.Remove(match)
continue
}
list = append(list, entry{path: match, mod: info.ModTime()})
}
sort.Slice(list, func(i, j int) bool { return list[i].mod.After(list[j].mod) })
if len(list) > 3 {
for _, old := range list[3:] {
_ = os.Remove(old.path)
}
}
return nil
}
func writeJournalDump(dst string) error {
args := []string{"--no-pager"}
for _, svc := range supportBundleServices {
args = append(args, "-u", svc)
}
raw, err := exec.Command("journalctl", args...).CombinedOutput()
if len(raw) == 0 && err != nil {
raw = []byte(err.Error() + "\n")
}
if len(raw) == 0 {
raw = []byte("no journal output\n")
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
return os.WriteFile(dst, raw, 0644)
}
func writeCommandOutput(dst string, cmd []string) error {
if len(cmd) == 0 {
return nil
}
raw, err := exec.Command(cmd[0], cmd[1:]...).CombinedOutput()
if len(raw) == 0 {
if err != nil {
raw = []byte(err.Error() + "\n")
} else {
raw = []byte("no output\n")
}
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
return os.WriteFile(dst, raw, 0644)
}
func writeManifest(dst, exportDir, stageRoot string) error {
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
var body strings.Builder
fmt.Fprintf(&body, "bee_version=%s\n", buildVersion())
fmt.Fprintf(&body, "host=%s\n", hostnameOr("unknown"))
fmt.Fprintf(&body, "generated_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
fmt.Fprintf(&body, "export_dir=%s\n", exportDir)
fmt.Fprintf(&body, "\nfiles:\n")
var files []string
if err := filepath.Walk(stageRoot, func(path string, info os.FileInfo, err error) error {
if err != nil || info.IsDir() {
return err
}
if filepath.Clean(path) == filepath.Clean(dst) {
return nil
}
rel, err := filepath.Rel(stageRoot, path)
if err != nil {
return err
}
files = append(files, fmt.Sprintf("%s\t%d", rel, info.Size()))
return nil
}); err != nil {
return err
}
sort.Strings(files)
for _, line := range files {
body.WriteString(line)
body.WriteByte('\n')
}
return os.WriteFile(dst, []byte(body.String()), 0644)
}
func buildVersion() string {
raw, err := exec.Command("bee", "version").CombinedOutput()
if err != nil {
return "unknown"
}
return strings.TrimSpace(string(raw))
}
func copyDirContents(srcDir, dstDir string) error {
entries, err := os.ReadDir(srcDir)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return err
}
for _, entry := range entries {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(dstDir, entry.Name())
if err := copyPath(src, dst); err != nil {
return err
}
}
return nil
}
func copyExportDirForSupportBundle(srcDir, dstDir string) error {
return copyDirContentsFiltered(srcDir, dstDir, func(rel string, info os.FileInfo) bool {
cleanRel := filepath.ToSlash(strings.TrimPrefix(filepath.Clean(rel), "./"))
if cleanRel == "" {
return true
}
if strings.HasPrefix(cleanRel, "bee-sat/") && strings.HasSuffix(cleanRel, ".tar.gz") {
return false
}
if strings.HasPrefix(filepath.Base(cleanRel), "bee-support-") && strings.HasSuffix(cleanRel, ".tar.gz") {
return false
}
return true
})
}
func copyDirContentsFiltered(srcDir, dstDir string, keep func(rel string, info os.FileInfo) bool) error {
entries, err := os.ReadDir(srcDir)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return err
}
for _, entry := range entries {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(dstDir, entry.Name())
if err := copyPathFiltered(srcDir, src, dst, keep); err != nil {
return err
}
}
return nil
}
func copyPath(src, dst string) error {
info, err := os.Stat(src)
if err != nil {
return err
}
if info.IsDir() {
if err := os.MkdirAll(dst, info.Mode().Perm()); err != nil {
return err
}
entries, err := os.ReadDir(src)
if err != nil {
return err
}
for _, entry := range entries {
if err := copyPath(filepath.Join(src, entry.Name()), filepath.Join(dst, entry.Name())); err != nil {
return err
}
}
return nil
}
if err := os.MkdirAll(filepath.Dir(dst), 0755); err != nil {
return err
}
in, err := os.Open(src)
if err != nil {
return err
}
defer in.Close()
out, err := os.OpenFile(dst, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, info.Mode().Perm())
if err != nil {
return err
}
defer out.Close()
_, err = io.Copy(out, in)
return err
}
func copyPathFiltered(rootSrc, src, dst string, keep func(rel string, info os.FileInfo) bool) error {
info, err := os.Stat(src)
if err != nil {
return err
}
rel, err := filepath.Rel(rootSrc, src)
if err != nil {
return err
}
if keep != nil && !keep(rel, info) {
return nil
}
if info.IsDir() {
if err := os.MkdirAll(dst, info.Mode().Perm()); err != nil {
return err
}
entries, err := os.ReadDir(src)
if err != nil {
return err
}
for _, entry := range entries {
if err := copyPathFiltered(rootSrc, filepath.Join(src, entry.Name()), filepath.Join(dst, entry.Name()), keep); err != nil {
return err
}
}
return nil
}
return copyPath(src, dst)
}
func createSupportTarGz(dst, srcDir string) error {
file, err := os.Create(dst)
if err != nil {
return err
}
defer file.Close()
gz := gzip.NewWriter(file)
defer gz.Close()
tw := tar.NewWriter(gz)
defer tw.Close()
base := filepath.Dir(srcDir)
return filepath.Walk(srcDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if info.IsDir() {
return nil
}
header, err := tar.FileInfoHeader(info, "")
if err != nil {
return err
}
header.Name, err = filepath.Rel(base, path)
if err != nil {
return err
}
if err := tw.WriteHeader(header); err != nil {
return err
}
f, err := os.Open(path)
if err != nil {
return err
}
defer f.Close()
_, err = io.Copy(tw, f)
return err
})
}

View File

@@ -0,0 +1,252 @@
package collector
import (
"encoding/csv"
"log/slog"
"os/exec"
"path/filepath"
"sort"
"strconv"
"strings"
"bee/audit/internal/schema"
)
var (
amdSMIExecCommand = exec.Command
amdSMILookPath = exec.LookPath
amdSMIGlob = filepath.Glob
)
var amdSMIExecutableGlobs = []string{
"/opt/rocm/bin/rocm-smi",
"/opt/rocm-*/bin/rocm-smi",
"/usr/local/bin/rocm-smi",
}
type amdGPUInfo struct {
BDF string
Serial string
Product string
Firmware string
PowerW *float64
TempC *float64
}
func enrichPCIeWithAMD(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
if !hasAMDGPUDevices(devs) {
return devs
}
infoByBDF, err := queryAMDGPUs()
if err != nil {
slog.Info("amdgpu: enrichment skipped", "err", err)
return devs
}
enriched := 0
for i := range devs {
if !isAMDGPUDevice(devs[i]) || devs[i].BDF == nil {
continue
}
info, ok := infoByBDF[normalizePCIeBDF(*devs[i].BDF)]
if !ok {
continue
}
if strings.TrimSpace(info.Serial) != "" {
devs[i].SerialNumber = &info.Serial
}
if strings.TrimSpace(info.Firmware) != "" {
devs[i].Firmware = &info.Firmware
}
if strings.TrimSpace(info.Product) != "" && devs[i].Model == nil {
devs[i].Model = &info.Product
}
if info.PowerW != nil {
devs[i].PowerW = info.PowerW
}
if info.TempC != nil {
devs[i].TemperatureC = info.TempC
}
enriched++
}
if enriched > 0 {
slog.Info("amdgpu: enriched", "count", enriched)
}
return devs
}
func hasAMDGPUDevices(devs []schema.HardwarePCIeDevice) bool {
for _, dev := range devs {
if isAMDGPUDevice(dev) {
return true
}
}
return false
}
func isAMDGPUDevice(dev schema.HardwarePCIeDevice) bool {
if dev.Manufacturer == nil || dev.DeviceClass == nil {
return false
}
manufacturer := strings.ToLower(strings.TrimSpace(*dev.Manufacturer))
return strings.Contains(manufacturer, "advanced micro devices") && isGPUClass(strings.TrimSpace(*dev.DeviceClass))
}
func queryAMDGPUs() (map[string]amdGPUInfo, error) {
busByCard, err := queryAMDField("--showbus")
if err != nil {
return nil, err
}
infoByCard := map[string]amdGPUInfo{}
for card, bus := range busByCard {
bdf := normalizePCIeBDF(bus)
if bdf == "" {
continue
}
infoByCard[card] = amdGPUInfo{BDF: bdf}
}
if len(infoByCard) == 0 {
return map[string]amdGPUInfo{}, nil
}
mergeAMDField(infoByCard, "--showserial", func(info *amdGPUInfo, value string) { info.Serial = value })
mergeAMDField(infoByCard, "--showproductname", func(info *amdGPUInfo, value string) { info.Product = value })
mergeAMDField(infoByCard, "--showvbios", func(info *amdGPUInfo, value string) { info.Firmware = value })
mergeAMDNumericField(infoByCard, "--showpower", func(info *amdGPUInfo, value float64) { info.PowerW = &value })
mergeAMDNumericField(infoByCard, "--showtemp", func(info *amdGPUInfo, value float64) { info.TempC = &value })
result := make(map[string]amdGPUInfo, len(infoByCard))
for _, info := range infoByCard {
if info.BDF == "" {
continue
}
result[info.BDF] = info
}
return result, nil
}
func mergeAMDField(infoByCard map[string]amdGPUInfo, flag string, apply func(*amdGPUInfo, string)) {
values, err := queryAMDField(flag)
if err != nil {
return
}
for card, value := range values {
info, ok := infoByCard[card]
if !ok {
continue
}
value = strings.TrimSpace(value)
if value == "" {
continue
}
apply(&info, value)
infoByCard[card] = info
}
}
func mergeAMDNumericField(infoByCard map[string]amdGPUInfo, flag string, apply func(*amdGPUInfo, float64)) {
values, err := queryAMDNumericField(flag)
if err != nil {
return
}
for card, value := range values {
info, ok := infoByCard[card]
if !ok {
continue
}
apply(&info, value)
infoByCard[card] = info
}
}
func queryAMDField(flag string) (map[string]string, error) {
cmd, err := resolveAMDSMICmd(flag, "--csv")
if err != nil {
return nil, err
}
out, err := amdSMIExecCommand(cmd[0], cmd[1:]...).CombinedOutput()
if err != nil {
return nil, err
}
return parseROCmSingleValueCSV(string(out)), nil
}
func queryAMDNumericField(flag string) (map[string]float64, error) {
values, err := queryAMDField(flag)
if err != nil {
return nil, err
}
out := map[string]float64{}
for card, raw := range values {
if value, ok := firstFloat(raw); ok {
out[card] = value
}
}
return out, nil
}
func resolveAMDSMICmd(args ...string) ([]string, error) {
if path, err := amdSMILookPath("rocm-smi"); err == nil {
return append([]string{path}, args...), nil
}
for _, pattern := range amdSMIExecutableGlobs {
matches, err := amdSMIGlob(pattern)
if err != nil {
continue
}
sort.Strings(matches)
for _, match := range matches {
return append([]string{match}, args...), nil
}
}
return nil, exec.ErrNotFound
}
func parseROCmSingleValueCSV(raw string) map[string]string {
rows := map[string]string{}
reader := csv.NewReader(strings.NewReader(raw))
reader.FieldsPerRecord = -1
records, err := reader.ReadAll()
if err != nil {
return rows
}
for _, rec := range records {
if len(rec) < 2 {
continue
}
card := normalizeROCmCardKey(rec[0])
if card == "" {
continue
}
value := strings.TrimSpace(strings.Join(rec[1:], ","))
if value == "" || looksLikeCSVHeaderValue(value) {
continue
}
rows[card] = value
}
return rows
}
func normalizeROCmCardKey(raw string) string {
raw = strings.ToLower(strings.TrimSpace(raw))
raw = strings.Trim(raw, "\"")
if raw == "" {
return ""
}
if raw == "device" || raw == "gpu" || raw == "card" {
return ""
}
if strings.HasPrefix(raw, "card") {
return raw
}
if _, err := strconv.Atoi(raw); err == nil {
return "card" + raw
}
return ""
}
func looksLikeCSVHeaderValue(value string) bool {
value = strings.ToLower(strings.TrimSpace(value))
return strings.Contains(value, "product") ||
strings.Contains(value, "serial") ||
strings.Contains(value, "vbios") ||
strings.Contains(value, "bus")
}

View File

@@ -0,0 +1,56 @@
package collector
import (
"os/exec"
"testing"
)
func TestParseROCmSingleValueCSV(t *testing.T) {
raw := "device,Serial Number\ncard0,ABC123\ncard1,XYZ789\n"
got := parseROCmSingleValueCSV(raw)
if got["card0"] != "ABC123" {
t.Fatalf("card0=%q want ABC123", got["card0"])
}
if got["card1"] != "XYZ789" {
t.Fatalf("card1=%q want XYZ789", got["card1"])
}
}
func TestQueryAMDNumericFieldParsesUnits(t *testing.T) {
origExec := amdSMIExecCommand
origLookPath := amdSMILookPath
t.Cleanup(func() {
amdSMIExecCommand = origExec
amdSMILookPath = origLookPath
})
amdSMILookPath = func(string) (string, error) { return "/usr/bin/rocm-smi", nil }
amdSMIExecCommand = func(name string, args ...string) *exec.Cmd {
return exec.Command("sh", "-c", "printf 'device,Temperature\\ncard0,45.5c\\ncard1,67.0c\\n'")
}
got, err := queryAMDNumericField("--showtemp")
if err != nil {
t.Fatalf("queryAMDNumericField: %v", err)
}
if got["card0"] != 45.5 {
t.Fatalf("card0=%v want 45.5", got["card0"])
}
if got["card1"] != 67.0 {
t.Fatalf("card1=%v want 67.0", got["card1"])
}
}
func TestNormalizeROCmCardKey(t *testing.T) {
tests := map[string]string{
"0": "card0",
"card1": "card1",
"Device": "",
"": "",
}
for input, want := range tests {
if got := normalizeROCmCardKey(input); got != want {
t.Fatalf("normalizeROCmCardKey(%q)=%q want %q", input, got, want)
}
}
}

View File

@@ -4,6 +4,7 @@ import (
"bee/audit/internal/schema"
"bufio"
"log/slog"
"os"
"os/exec"
"strings"
)
@@ -16,6 +17,14 @@ var execDmidecode = func(typeNum string) (string, error) {
return string(out), nil
}
var execIpmitool = func(args ...string) (string, error) {
out, err := exec.Command("ipmitool", args...).Output()
if err != nil {
return "", err
}
return string(out), nil
}
// collectBoard runs dmidecode for types 0, 1, 2 and returns the board record
// plus the BIOS firmware entry. Any failure is logged and returns zero values.
func collectBoard() (schema.HardwareBoard, []schema.HardwareFirmwareRecord) {
@@ -69,6 +78,45 @@ func parseBoard(type1, type2 string) schema.HardwareBoard {
return board
}
// collectBMCFirmware collects BMC firmware version via ipmitool mc info.
// Returns nil if ipmitool is missing, /dev/ipmi0 is absent, or any error occurs.
func collectBMCFirmware() []schema.HardwareFirmwareRecord {
if _, err := exec.LookPath("ipmitool"); err != nil {
return nil
}
if _, err := os.Stat("/dev/ipmi0"); err != nil {
return nil
}
out, err := execIpmitool("mc", "info")
if err != nil {
slog.Info("bmc: ipmitool mc info unavailable", "err", err)
return nil
}
version := parseBMCFirmwareRevision(out)
if version == "" {
return nil
}
slog.Info("bmc: collected", "version", version)
return []schema.HardwareFirmwareRecord{
{DeviceName: "BMC", Version: version},
}
}
// parseBMCFirmwareRevision extracts the "Firmware Revision" field from ipmitool mc info output.
func parseBMCFirmwareRevision(out string) string {
for _, line := range strings.Split(out, "\n") {
line = strings.TrimSpace(line)
key, val, ok := strings.Cut(line, ":")
if !ok {
continue
}
if strings.TrimSpace(key) == "Firmware Revision" {
return strings.TrimSpace(val)
}
}
return ""
}
// parseBIOSFirmware extracts BIOS version from dmidecode type 0 output.
func parseBIOSFirmware(type0 string) []schema.HardwareFirmwareRecord {
fields := parseDMIFields(type0, "BIOS Information")

View File

@@ -23,8 +23,9 @@ func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
board, biosFW := collectBoard()
snap.Board = board
snap.Firmware = append(snap.Firmware, biosFW...)
snap.Firmware = append(snap.Firmware, collectBMCFirmware()...)
snap.CPUs = collectCPUs(snap.Board.SerialNumber)
snap.CPUs = collectCPUs()
snap.Memory = collectMemory()
sensorDoc, err := readSensorsJSONDoc()
@@ -35,7 +36,9 @@ func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
snap.Memory = enrichMemoryWithTelemetry(snap.Memory, sensorDoc)
snap.Storage = collectStorage()
snap.PCIeDevices = collectPCIe()
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices, snap.Board.SerialNumber)
snap.PCIeDevices = enrichPCIeWithAMD(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithPCISerials(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithMellanox(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithNICTelemetry(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithRAIDTelemetry(snap.PCIeDevices)

View File

@@ -3,7 +3,6 @@ package collector
import (
"bee/audit/internal/schema"
"bufio"
"fmt"
"log/slog"
"os"
"path/filepath"
@@ -12,14 +11,14 @@ import (
)
// collectCPUs runs dmidecode -t 4 and enriches CPUs with microcode from sysfs.
func collectCPUs(boardSerial string) []schema.HardwareCPU {
func collectCPUs() []schema.HardwareCPU {
out, err := runDmidecode("4")
if err != nil {
slog.Warn("cpu: dmidecode type 4 failed", "err", err)
return nil
}
cpus := parseCPUs(out, boardSerial)
cpus := parseCPUs(out)
if mc := readMicrocode(); mc != "" {
for i := range cpus {
cpus[i].Firmware = &mc
@@ -31,12 +30,12 @@ func collectCPUs(boardSerial string) []schema.HardwareCPU {
}
// parseCPUs splits dmidecode output into per-processor sections and parses each.
func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
func parseCPUs(output string) []schema.HardwareCPU {
sections := splitDMISections(output, "Processor Information")
cpus := make([]schema.HardwareCPU, 0, len(sections))
for _, section := range sections {
cpu, ok := parseCPUSection(section, boardSerial)
cpu, ok := parseCPUSection(section)
if !ok {
continue
}
@@ -47,7 +46,7 @@ func parseCPUs(output, boardSerial string) []schema.HardwareCPU {
// parseCPUSection parses one "Processor Information" block into a HardwareCPU.
// Returns false if the socket is unpopulated.
func parseCPUSection(fields map[string]string, boardSerial string) (schema.HardwareCPU, bool) {
func parseCPUSection(fields map[string]string) (schema.HardwareCPU, bool) {
status := parseCPUStatus(fields["Status"])
if status == statusEmpty {
return schema.HardwareCPU{}, false
@@ -70,11 +69,6 @@ func parseCPUSection(fields map[string]string, boardSerial string) (schema.Hardw
}
if v := cleanDMIValue(fields["Serial Number"]); v != "" {
cpu.SerialNumber = &v
} else if boardSerial != "" && cpu.Socket != nil {
// Intel Xeon never exposes serial via DMI — generate stable fallback
// matching core's generateCPUVendorSerial() logic
fb := fmt.Sprintf("%s-CPU-%d", boardSerial, *cpu.Socket)
cpu.SerialNumber = &fb
}
if v := parseMHz(fields["Max Speed"]); v > 0 {

View File

@@ -8,7 +8,7 @@ import (
func TestParseCPUs_dual_socket(t *testing.T) {
out := mustReadFile(t, "testdata/dmidecode_type4.txt")
cpus := parseCPUs(out, "CAR315KA0803B90")
cpus := parseCPUs(out)
if len(cpus) != 2 {
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
@@ -39,23 +39,22 @@ func TestParseCPUs_dual_socket(t *testing.T) {
if cpu0.Status == nil || *cpu0.Status != "OK" {
t.Errorf("cpu0 status: got %v, want OK", cpu0.Status)
}
// Intel Xeon serial not available → fallback
if cpu0.SerialNumber == nil || *cpu0.SerialNumber != "CAR315KA0803B90-CPU-0" {
t.Errorf("cpu0 serial fallback: got %v, want CAR315KA0803B90-CPU-0", cpu0.SerialNumber)
if cpu0.SerialNumber != nil {
t.Errorf("cpu0 serial should stay nil without source data, got %v", cpu0.SerialNumber)
}
cpu1 := cpus[1]
if cpu1.Socket == nil || *cpu1.Socket != 1 {
t.Errorf("cpu1 socket: got %v, want 1", cpu1.Socket)
}
if cpu1.SerialNumber == nil || *cpu1.SerialNumber != "CAR315KA0803B90-CPU-1" {
t.Errorf("cpu1 serial fallback: got %v, want CAR315KA0803B90-CPU-1", cpu1.SerialNumber)
if cpu1.SerialNumber != nil {
t.Errorf("cpu1 serial should stay nil without source data, got %v", cpu1.SerialNumber)
}
}
func TestParseCPUs_unpopulated_skipped(t *testing.T) {
out := mustReadFile(t, "testdata/dmidecode_type4_disabled.txt")
cpus := parseCPUs(out, "BOARD-001")
cpus := parseCPUs(out)
if len(cpus) != 1 {
t.Fatalf("expected 1 CPU (unpopulated skipped), got %d", len(cpus))
@@ -87,7 +86,7 @@ func TestCollectCPUsSetsFirmwareFromMicrocode(t *testing.T) {
}
t.Cleanup(func() { execDmidecode = origRun })
cpus := collectCPUs("CAR315KA0803B90")
cpus := collectCPUs()
if len(cpus) != 2 {
t.Fatalf("expected 2 CPUs, got %d", len(cpus))
}

View File

@@ -1,9 +1,6 @@
package collector
import (
"bee/audit/internal/schema"
"fmt"
)
import "bee/audit/internal/schema"
func finalizeSnapshot(snap *schema.HardwareSnapshot, collectedAt string) {
snap.Memory = filterMemory(snap.Memory)
@@ -11,7 +8,6 @@ func finalizeSnapshot(snap *schema.HardwareSnapshot, collectedAt string) {
snap.PowerSupplies = filterPSUs(snap.PowerSupplies)
setComponentStatusMetadata(snap, collectedAt)
deduplicateComponentSerials(snap)
}
func filterMemory(dimms []schema.HardwareMemory) []schema.HardwareMemory {
@@ -45,7 +41,18 @@ func filterStorage(disks []schema.HardwareStorage) []schema.HardwareStorage {
func filterPSUs(psus []schema.HardwarePowerSupply) []schema.HardwarePowerSupply {
out := make([]schema.HardwarePowerSupply, 0, len(psus))
for _, psu := range psus {
if psu.SerialNumber == nil || *psu.SerialNumber == "" {
hasIdentity := false
switch {
case psu.SerialNumber != nil && *psu.SerialNumber != "":
hasIdentity = true
case psu.Slot != nil && *psu.Slot != "":
hasIdentity = true
case psu.Model != nil && *psu.Model != "":
hasIdentity = true
case psu.Vendor != nil && *psu.Vendor != "":
hasIdentity = true
}
if !hasIdentity {
continue
}
out = append(out, psu)
@@ -79,101 +86,3 @@ func setStatusCheckedAt(status *schema.HardwareComponentStatus, collectedAt stri
status.StatusCheckedAt = &collectedAt
}
}
func deduplicateComponentSerials(snap *schema.HardwareSnapshot) {
deduplicateCPUSerials(snap.CPUs)
deduplicateMemorySerials(snap.Memory)
deduplicateStorageSerials(snap.Storage)
deduplicatePCIeSerials(snap.PCIeDevices)
deduplicatePSUSerials(snap.PowerSupplies)
}
func deduplicateCPUSerials(items []schema.HardwareCPU) {
seen := map[string]int{}
seq := 1
for i := range items {
if items[i].SerialNumber == nil || *items[i].SerialNumber == "" {
continue
}
model := derefString(items[i].Model)
key := model + "\x00" + *items[i].SerialNumber
seen[key]++
if seen[key] > 1 {
repl := fmt.Sprintf("NO_SN-%08d", seq)
seq++
items[i].SerialNumber = &repl
}
}
}
func deduplicateMemorySerials(items []schema.HardwareMemory) {
seen := map[string]int{}
seq := 1
for i := range items {
if items[i].SerialNumber == nil || *items[i].SerialNumber == "" {
continue
}
model := derefString(items[i].PartNumber)
key := model + "\x00" + *items[i].SerialNumber
seen[key]++
if seen[key] > 1 {
repl := fmt.Sprintf("NO_SN-%08d", seq)
seq++
items[i].SerialNumber = &repl
}
}
}
func deduplicateStorageSerials(items []schema.HardwareStorage) {
seen := map[string]int{}
seq := 1
for i := range items {
if items[i].SerialNumber == nil || *items[i].SerialNumber == "" {
continue
}
model := derefString(items[i].Model)
key := model + "\x00" + *items[i].SerialNumber
seen[key]++
if seen[key] > 1 {
repl := fmt.Sprintf("NO_SN-%08d", seq)
seq++
items[i].SerialNumber = &repl
}
}
}
func deduplicatePCIeSerials(items []schema.HardwarePCIeDevice) {
seen := map[string]int{}
seq := 1
for i := range items {
if items[i].SerialNumber == nil || *items[i].SerialNumber == "" {
continue
}
model := derefString(items[i].Model)
key := model + "\x00" + *items[i].SerialNumber
seen[key]++
if seen[key] > 1 {
repl := fmt.Sprintf("NO_SN-%08d", seq)
seq++
items[i].SerialNumber = &repl
}
}
}
func deduplicatePSUSerials(items []schema.HardwarePowerSupply) {
seen := map[string]int{}
seq := 1
for i := range items {
if items[i].SerialNumber == nil || *items[i].SerialNumber == "" {
continue
}
model := derefString(items[i].Model)
key := model + "\x00" + *items[i].SerialNumber
seen[key]++
if seen[key] > 1 {
repl := fmt.Sprintf("NO_SN-%08d", seq)
seq++
items[i].SerialNumber = &repl
}
}
}

View File

@@ -39,7 +39,7 @@ func TestFinalizeSnapshotFiltersComponentsWithoutRequiredSerials(t *testing.T) {
}
}
func TestFinalizeSnapshotDeduplicatesSerials(t *testing.T) {
func TestFinalizeSnapshotPreservesDuplicateSerials(t *testing.T) {
collectedAt := "2026-03-15T12:00:00Z"
status := statusOK
model := "Device"
@@ -57,7 +57,24 @@ func TestFinalizeSnapshotDeduplicatesSerials(t *testing.T) {
if got := *snap.Storage[0].SerialNumber; got != serial {
t.Fatalf("first serial changed: %q", got)
}
if got := *snap.Storage[1].SerialNumber; got != "NO_SN-00000001" {
t.Fatalf("duplicate serial mismatch: %q", got)
if got := *snap.Storage[1].SerialNumber; got != serial {
t.Fatalf("duplicate serial should stay unchanged: %q", got)
}
}
func TestFilterPSUsKeepsSlotOnlyEntries(t *testing.T) {
slot := "0"
status := statusOK
got := filterPSUs([]schema.HardwarePowerSupply{
{Slot: &slot, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
})
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].Slot == nil || *got[0].Slot != "0" {
t.Fatalf("unexpected kept PSU: %+v", got[0])
}
}

View File

@@ -44,6 +44,11 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
}
iface := ifaces[0]
devs[i].MacAddresses = collectInterfaceMACs(ifaces)
if devs[i].SerialNumber == nil {
if serial := queryPCIDeviceSerial(bdf); serial != "" {
devs[i].SerialNumber = &serial
}
}
if devs[i].Firmware == nil {
if out, err := ethtoolInfoQuery(iface); err == nil {

View File

@@ -1,6 +1,10 @@
package collector
import "testing"
import (
"bee/audit/internal/schema"
"fmt"
"testing"
)
func TestParseSFPDOM(t *testing.T) {
raw := `
@@ -29,6 +33,74 @@ func TestParseSFPDOM(t *testing.T) {
}
}
func TestParseLSPCIDetailSerial(t *testing.T) {
raw := `
05:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
Serial number: NIC-SN-12345
`
if got := parseLSPCIDetailSerial(raw); got != "NIC-SN-12345" {
t.Fatalf("serial=%q want %q", got, "NIC-SN-12345")
}
}
func TestParsePCIVPDSerial(t *testing.T) {
raw := []byte{0x82, 0x05, 0x00, 'M', 'L', 'X', '5', 0x90, 0x08, 0x00, 'S', 'N', 0x08, 'M', 'T', '1', '2', '3', '4', '5', '6'}
if got := parsePCIVPDSerial(raw); got != "MT123456" {
t.Fatalf("serial=%q want %q", got, "MT123456")
}
}
func TestEnrichPCIeWithNICTelemetryAddsSerialFallback(t *testing.T) {
origDetail := queryPCILSPCIDetail
origVPD := readPCIVPDFile
origIfaces := netIfacesByBDF
origReadMAC := readNetAddressFile
origEth := ethtoolInfoQuery
origModule := ethtoolModuleQuery
t.Cleanup(func() {
queryPCILSPCIDetail = origDetail
readPCIVPDFile = origVPD
netIfacesByBDF = origIfaces
readNetAddressFile = origReadMAC
ethtoolInfoQuery = origEth
ethtoolModuleQuery = origModule
})
queryPCILSPCIDetail = func(bdf string) (string, error) {
if bdf != "0000:18:00.0" {
t.Fatalf("unexpected bdf: %s", bdf)
}
return "Serial number: NIC-SN-98765\n", nil
}
readPCIVPDFile = func(string) ([]byte, error) {
return nil, fmt.Errorf("no vpd needed")
}
netIfacesByBDF = func(string) []string { return []string{"eth0"} }
readNetAddressFile = func(iface string) (string, error) {
if iface != "eth0" {
t.Fatalf("unexpected iface: %s", iface)
}
return "aa:bb:cc:dd:ee:ff", nil
}
ethtoolInfoQuery = func(string) (string, error) { return "", fmt.Errorf("skip firmware") }
ethtoolModuleQuery = func(string) (string, error) { return "", fmt.Errorf("skip optics") }
class := "EthernetController"
bdf := "0000:18:00.0"
devs := []schema.HardwarePCIeDevice{{
DeviceClass: &class,
BDF: &bdf,
}}
out := enrichPCIeWithNICTelemetry(devs)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "NIC-SN-98765" {
t.Fatalf("serial=%v want NIC-SN-98765", out[0].SerialNumber)
}
if len(out[0].MacAddresses) != 1 || out[0].MacAddresses[0] != "aa:bb:cc:dd:ee:ff" {
t.Fatalf("mac_addresses=%v", out[0].MacAddresses)
}
}
func TestDBMValue(t *testing.T) {
tests := []struct {
in string

View File

@@ -24,18 +24,17 @@ type nvidiaGPUInfo struct {
}
// enrichPCIeWithNVIDIA enriches NVIDIA PCIe devices with data from nvidia-smi.
// If the driver/tool is unavailable, NVIDIA devices get Unknown status and
// a stable serial fallback based on board serial + slot.
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice, boardSerial string) []schema.HardwarePCIeDevice {
// If the driver/tool is unavailable, NVIDIA devices get Unknown status.
func enrichPCIeWithNVIDIA(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
if !hasNVIDIADevices(devs) {
return devs
}
gpuByBDF, err := queryNVIDIAGPUs()
if err != nil {
slog.Info("nvidia: enrichment skipped", "err", err)
return enrichPCIeWithNVIDIAData(devs, nil, boardSerial, false)
return enrichPCIeWithNVIDIAData(devs, nil, false)
}
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, boardSerial, true)
return enrichPCIeWithNVIDIAData(devs, gpuByBDF, true)
}
func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool {
@@ -47,7 +46,7 @@ func hasNVIDIADevices(devs []schema.HardwarePCIeDevice) bool {
return false
}
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, boardSerial string, driverLoaded bool) []schema.HardwarePCIeDevice {
func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[string]nvidiaGPUInfo, driverLoaded bool) []schema.HardwarePCIeDevice {
enriched := 0
for i := range devs {
if !isNVIDIADevice(devs[i]) {
@@ -55,7 +54,7 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
}
if !driverLoaded {
setPCIeFallback(&devs[i], boardSerial)
setPCIeFallback(&devs[i])
continue
}
@@ -65,14 +64,12 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
}
info, ok := gpuByBDF[bdf]
if !ok {
setPCIeFallback(&devs[i], boardSerial)
setPCIeFallback(&devs[i])
continue
}
if v := strings.TrimSpace(info.Serial); v != "" {
devs[i].SerialNumber = &v
} else {
setPCIeFallbackSerial(&devs[i], boardSerial)
}
if v := strings.TrimSpace(info.VBIOS); v != "" {
devs[i].Firmware = &v
@@ -213,26 +210,11 @@ func isNVIDIADevice(dev schema.HardwarePCIeDevice) bool {
return false
}
func setPCIeFallback(dev *schema.HardwarePCIeDevice, boardSerial string) {
setPCIeFallbackSerial(dev, boardSerial)
func setPCIeFallback(dev *schema.HardwarePCIeDevice) {
status := statusUnknown
dev.Status = &status
}
func setPCIeFallbackSerial(dev *schema.HardwarePCIeDevice, boardSerial string) {
if strings.TrimSpace(boardSerial) == "" || dev.SerialNumber != nil {
return
}
slot := "unknown"
if dev.BDF != nil && strings.TrimSpace(*dev.BDF) != "" {
slot = strings.TrimSpace(*dev.BDF)
} else if dev.Slot != nil && strings.TrimSpace(*dev.Slot) != "" {
slot = strings.TrimSpace(*dev.Slot)
}
fb := fmt.Sprintf("%s-PCIE-%s", boardSerial, slot)
dev.SerialNumber = &fb
}
func injectNVIDIATelemetry(dev *schema.HardwarePCIeDevice, info nvidiaGPUInfo) {
if info.TemperatureC != nil {
dev.TemperatureC = info.TemperatureC

View File

@@ -73,7 +73,7 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
},
}
out := enrichPCIeWithNVIDIAData(devices, byBDF, "BOARD-001", true)
out := enrichPCIeWithNVIDIAData(devices, byBDF, true)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-ABC" {
t.Fatalf("serial: got %v", out[0].SerialNumber)
}
@@ -103,9 +103,9 @@ func TestEnrichPCIeWithNVIDIAData_driverMissingFallback(t *testing.T) {
},
}
out := enrichPCIeWithNVIDIAData(devices, nil, "BOARD-123", false)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "BOARD-123-PCIE-0000:17:00.0" {
t.Fatalf("fallback serial: got %v", out[0].SerialNumber)
out := enrichPCIeWithNVIDIAData(devices, nil, false)
if out[0].SerialNumber != nil {
t.Fatalf("serial should stay nil without source data, got %v", out[0].SerialNumber)
}
if out[0].Status == nil || *out[0].Status != statusUnknown {
t.Fatalf("fallback status: got %v", out[0].Status)

View File

@@ -37,7 +37,7 @@ func parseLspci(output string) []schema.HardwarePCIeDevice {
val := strings.TrimSpace(line[idx+2:])
fields[key] = val
}
if !shouldIncludePCIeDevice(fields["Class"]) {
if !shouldIncludePCIeDevice(fields["Class"], fields["Vendor"], fields["Device"]) {
continue
}
dev := parseLspciDevice(fields)
@@ -46,8 +46,10 @@ func parseLspci(output string) []schema.HardwarePCIeDevice {
return devs
}
func shouldIncludePCIeDevice(class string) bool {
func shouldIncludePCIeDevice(class, vendor, device string) bool {
c := strings.ToLower(strings.TrimSpace(class))
v := strings.ToLower(strings.TrimSpace(vendor))
d := strings.ToLower(strings.TrimSpace(device))
if c == "" {
return true
}
@@ -57,6 +59,8 @@ func shouldIncludePCIeDevice(class string) bool {
"host bridge",
"isa bridge",
"pci bridge",
"performance counter",
"performance counters",
"ram memory",
"system peripheral",
"communication controller",
@@ -66,12 +70,28 @@ func shouldIncludePCIeDevice(class string) bool {
"audio device",
"serial bus controller",
"unassigned class",
"non-essential instrumentation",
}
for _, bad := range excluded {
if strings.Contains(c, bad) {
return false
}
}
if strings.Contains(v, "advanced micro devices") || strings.Contains(v, "[amd]") {
internalAMDPatterns := []string{
"dummy function",
"reserved spp",
"ptdma",
"cryptographic coprocessor pspcpp",
"pspcpp",
}
for _, bad := range internalAMDPatterns {
if strings.Contains(d, bad) {
return false
}
}
}
return true
}
@@ -84,6 +104,7 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
// Slot is the BDF: "0000:00:02.0"
if bdf := fields["Slot"]; bdf != "" {
dev.Slot = &bdf
dev.BDF = &bdf
// parse vendor_id and device_id from sysfs
vendorID, deviceID := readPCIIDs(bdf)
@@ -95,6 +116,8 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
}
if numaNode, ok := readPCINumaNode(bdf); ok {
dev.NUMANode = &numaNode
} else if numaNode, ok := parsePCINumaNode(fields["NUMANode"]); ok {
dev.NUMANode = &numaNode
}
if width, ok := readPCIIntAttribute(bdf, "current_link_width"); ok {
dev.LinkWidth = &width
@@ -162,6 +185,18 @@ func readPCINumaNode(bdf string) (int, bool) {
return value, true
}
func parsePCINumaNode(raw string) (int, bool) {
raw = strings.TrimSpace(raw)
if raw == "" {
return 0, false
}
value, err := strconv.Atoi(raw)
if err != nil || value < 0 {
return 0, false
}
return value, true
}
func readPCIIntAttribute(bdf, attribute string) (int, bool) {
out, err := exec.Command("cat", "/sys/bus/pci/devices/"+bdf+"/"+attribute).Output()
if err != nil {

View File

@@ -1,34 +1,49 @@
package collector
import "testing"
import (
"encoding/json"
"strings"
"testing"
)
func TestShouldIncludePCIeDevice(t *testing.T) {
tests := []struct {
class string
want bool
name string
class string
vendor string
device string
want bool
}{
{"USB controller", false},
{"System peripheral", false},
{"Audio device", false},
{"Host bridge", false},
{"PCI bridge", false},
{"SMBus", false},
{"Ethernet controller", true},
{"RAID bus controller", true},
{"Non-Volatile memory controller", true},
{"VGA compatible controller", true},
{name: "usb", class: "USB controller", want: false},
{name: "system peripheral", class: "System peripheral", want: false},
{name: "audio", class: "Audio device", want: false},
{name: "host bridge", class: "Host bridge", want: false},
{name: "pci bridge", class: "PCI bridge", want: false},
{name: "smbus", class: "SMBus", want: false},
{name: "perf", class: "Performance counters", want: false},
{name: "non essential instrumentation", class: "Non-Essential Instrumentation", want: false},
{name: "amd dummy function", class: "Encryption controller", vendor: "Advanced Micro Devices, Inc. [AMD]", device: "Starship/Matisse PTDMA", want: false},
{name: "amd pspcpp", class: "Encryption controller", vendor: "Advanced Micro Devices, Inc. [AMD]", device: "Starship/Matisse Cryptographic Coprocessor PSPCPP", want: false},
{name: "ethernet", class: "Ethernet controller", want: true},
{name: "raid", class: "RAID bus controller", want: true},
{name: "nvme", class: "Non-Volatile memory controller", want: true},
{name: "vga", class: "VGA compatible controller", want: true},
{name: "other encryption controller", class: "Encryption controller", vendor: "Intel Corporation", device: "QuickAssist", want: true},
}
for _, tt := range tests {
got := shouldIncludePCIeDevice(tt.class)
if got != tt.want {
t.Fatalf("class %q include=%v want %v", tt.class, got, tt.want)
}
t.Run(tt.name, func(t *testing.T) {
got := shouldIncludePCIeDevice(tt.class, tt.vendor, tt.device)
if got != tt.want {
t.Fatalf("class=%q vendor=%q device=%q include=%v want %v", tt.class, tt.vendor, tt.device, got, tt.want)
}
})
}
}
func TestParseLspci_filtersExcludedClasses(t *testing.T) {
input := "Slot:\t0000:00:14.0\nClass:\tUSB controller\nVendor:\tIntel Corporation\nDevice:\tUSB 3.0\n\n" +
"Slot:\t0000:00:18.0\nClass:\tNon-Essential Instrumentation\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PCIe Dummy Function\n\n" +
"Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
devs := parseLspci(input)
@@ -38,6 +53,56 @@ func TestParseLspci_filtersExcludedClasses(t *testing.T) {
if devs[0].DeviceClass == nil || *devs[0].DeviceClass != "VideoController" {
t.Fatalf("unexpected remaining class: %v", devs[0].DeviceClass)
}
if devs[0].Slot == nil || *devs[0].Slot != "0000:65:00.0" {
t.Fatalf("slot: got %v", devs[0].Slot)
}
if devs[0].BDF == nil || *devs[0].BDF != "0000:65:00.0" {
t.Fatalf("bdf: got %v", devs[0].BDF)
}
}
func TestParseLspci_filtersAMDChipsetNoise(t *testing.T) {
input := "" +
"Slot:\t0000:1a:00.0\nClass:\tNon-Essential Instrumentation\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PCIe Dummy Function\n\n" +
"Slot:\t0000:1a:00.2\nClass:\tEncryption controller\nVendor:\tAdvanced Micro Devices, Inc. [AMD]\nDevice:\tStarship/Matisse PTDMA\n\n" +
"Slot:\t0000:05:00.0\nClass:\tEthernet controller\nVendor:\tMellanox Technologies\nDevice:\tMT28908 Family [ConnectX-6]\n\n"
devs := parseLspci(input)
if len(devs) != 1 {
t.Fatalf("expected 1 remaining device, got %d", len(devs))
}
if devs[0].Model == nil || *devs[0].Model != "MT28908 Family [ConnectX-6]" {
t.Fatalf("unexpected remaining device: %+v", devs[0])
}
}
func TestPCIeJSONUsesSlotNotBDF(t *testing.T) {
input := "Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
devs := parseLspci(input)
data, err := json.Marshal(devs[0])
if err != nil {
t.Fatalf("marshal: %v", err)
}
text := string(data)
if !strings.Contains(text, `"slot":"0000:65:00.0"`) {
t.Fatalf("json missing slot: %s", text)
}
if strings.Contains(text, `"bdf"`) {
t.Fatalf("json should not emit bdf: %s", text)
}
}
func TestParseLspciUsesNUMANodeFieldWhenSysfsUnavailable(t *testing.T) {
input := "Slot:\t0000:65:00.0\nClass:\tEthernet controller\nVendor:\tIntel Corporation\nDevice:\tX710\nNUMANode:\t1\n\n"
devs := parseLspci(input)
if len(devs) != 1 {
t.Fatalf("expected 1 device, got %d", len(devs))
}
if devs[0].NUMANode == nil || *devs[0].NUMANode != 1 {
t.Fatalf("numa_node=%v want 1", devs[0].NUMANode)
}
}
func TestNormalizePCILinkSpeed(t *testing.T) {

View File

@@ -0,0 +1,123 @@
package collector
import (
"bee/audit/internal/schema"
"log/slog"
"os"
"os/exec"
"path/filepath"
"strings"
)
var (
queryPCILSPCIDetail = func(bdf string) (string, error) {
out, err := exec.Command("lspci", "-vv", "-s", bdf).Output()
if err != nil {
return "", err
}
return string(out), nil
}
readPCIVPDFile = func(bdf string) ([]byte, error) {
return os.ReadFile(filepath.Join("/sys/bus/pci/devices", bdf, "vpd"))
}
)
func enrichPCIeWithPCISerials(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
enriched := 0
for i := range devs {
if !shouldProbePCIeSerial(devs[i]) {
continue
}
bdf := normalizePCIeBDF(*devs[i].BDF)
if bdf == "" {
continue
}
if serial := queryPCIDeviceSerial(bdf); serial != "" {
devs[i].SerialNumber = &serial
enriched++
}
}
if enriched > 0 {
slog.Info("pcie: serials enriched", "count", enriched)
}
return devs
}
func shouldProbePCIeSerial(dev schema.HardwarePCIeDevice) bool {
if dev.BDF == nil || dev.SerialNumber != nil {
return false
}
if dev.DeviceClass == nil {
return false
}
class := strings.TrimSpace(*dev.DeviceClass)
return isNICClass(class) || isGPUClass(class)
}
func queryPCIDeviceSerial(bdf string) string {
if out, err := queryPCILSPCIDetail(bdf); err == nil {
if serial := parseLSPCIDetailSerial(out); serial != "" {
return serial
}
}
if raw, err := readPCIVPDFile(bdf); err == nil {
return parsePCIVPDSerial(raw)
}
return ""
}
func parseLSPCIDetailSerial(raw string) string {
for _, line := range strings.Split(raw, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
lower := strings.ToLower(line)
if !strings.Contains(lower, "serial number:") {
continue
}
idx := strings.Index(line, ":")
if idx < 0 {
continue
}
if serial := strings.TrimSpace(line[idx+1:]); serial != "" {
return serial
}
}
return ""
}
func parsePCIVPDSerial(raw []byte) string {
for i := 0; i+3 < len(raw); i++ {
if raw[i] != 'S' || raw[i+1] != 'N' {
continue
}
length := int(raw[i+2])
if length <= 0 || length > 64 || i+3+length > len(raw) {
continue
}
value := strings.TrimSpace(strings.Trim(string(raw[i+3:i+3+length]), "\x00"))
if !looksLikeSerial(value) {
continue
}
return value
}
return ""
}
func looksLikeSerial(value string) bool {
if len(value) < 4 {
return false
}
hasAlphaNum := false
for _, r := range value {
switch {
case r >= 'a' && r <= 'z', r >= 'A' && r <= 'Z', r >= '0' && r <= '9':
hasAlphaNum = true
case strings.ContainsRune(" -_./:", r):
default:
return false
}
}
return hasAlphaNum
}

View File

@@ -0,0 +1,47 @@
package collector
import (
"bee/audit/internal/schema"
"fmt"
"testing"
)
func TestEnrichPCIeWithPCISerialsAddsGPUFallback(t *testing.T) {
origDetail := queryPCILSPCIDetail
origVPD := readPCIVPDFile
t.Cleanup(func() {
queryPCILSPCIDetail = origDetail
readPCIVPDFile = origVPD
})
queryPCILSPCIDetail = func(bdf string) (string, error) {
if bdf != "0000:11:00.0" {
t.Fatalf("unexpected bdf: %s", bdf)
}
return "Serial number: GPU-SN-12345\n", nil
}
readPCIVPDFile = func(string) ([]byte, error) {
return nil, fmt.Errorf("no vpd needed")
}
class := "DisplayController"
bdf := "0000:11:00.0"
devs := []schema.HardwarePCIeDevice{{
DeviceClass: &class,
BDF: &bdf,
}}
out := enrichPCIeWithPCISerials(devs)
if out[0].SerialNumber == nil || *out[0].SerialNumber != "GPU-SN-12345" {
t.Fatalf("serial=%v want GPU-SN-12345", out[0].SerialNumber)
}
}
func TestShouldProbePCIeSerialSkipsNonGPUOrNIC(t *testing.T) {
class := "StorageController"
bdf := "0000:19:00.0"
dev := schema.HardwarePCIeDevice{DeviceClass: &class, BDF: &bdf}
if shouldProbePCIeSerial(dev) {
t.Fatal("unexpected probe for storage controller")
}
}

View File

@@ -5,21 +5,31 @@ import (
"log/slog"
"os/exec"
"regexp"
"sort"
"strconv"
"strings"
)
func collectPSUs() []schema.HardwarePowerSupply {
// ipmitool requires /dev/ipmi0 — not available on non-server hardware
out, err := exec.Command("ipmitool", "fru", "print").Output()
if err != nil {
var psus []schema.HardwarePowerSupply
if out, err := exec.Command("ipmitool", "fru", "print").Output(); err == nil {
psus = parseFRU(string(out))
} else {
slog.Info("psu: fru unavailable", "err", err)
}
sdrData := map[int]psuSDR{}
if sdrOut, err := exec.Command("ipmitool", "sdr").Output(); err == nil {
sdrData = parsePSUSDR(string(sdrOut))
if len(psus) == 0 {
psus = synthesizePSUsFromSDR(sdrData)
} else {
mergePSUSDR(psus, sdrData)
}
} else if len(psus) == 0 {
slog.Info("psu: ipmitool unavailable, skipping", "err", err)
return nil
}
psus := parseFRU(string(out))
if sdrOut, err := exec.Command("ipmitool", "sdr").Output(); err == nil {
mergePSUSDR(psus, parsePSUSDR(string(sdrOut)))
}
slog.Info("psu: collected", "count", len(psus))
return psus
}
@@ -79,9 +89,7 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
// Only process PSU FRU records
headerLower := strings.ToLower(header)
if !strings.Contains(headerLower, "psu") &&
!strings.Contains(headerLower, "power supply") &&
!strings.Contains(headerLower, "power_supply") {
if !isPSUHeader(headerLower) {
return schema.HardwarePowerSupply{}, false
}
@@ -89,21 +97,24 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
psu := schema.HardwarePowerSupply{Present: &present}
slotStr := strconv.Itoa(slotIdx)
if slot, ok := parsePSUSlot(header); ok && slot > 0 {
slotStr = strconv.Itoa(slot - 1)
}
psu.Slot = &slotStr
if v := cleanDMIValue(fields["Board Product"]); v != "" {
if v := firstNonEmptyField(fields, "Board Product", "Product Name", "Product Part Number"); v != "" {
psu.Model = &v
}
if v := cleanDMIValue(fields["Board Mfg"]); v != "" {
if v := firstNonEmptyField(fields, "Board Mfg", "Product Manufacturer", "Product Manufacturer Name"); v != "" {
psu.Vendor = &v
}
if v := cleanDMIValue(fields["Board Serial"]); v != "" {
if v := firstNonEmptyField(fields, "Board Serial", "Product Serial", "Product Serial Number"); v != "" {
psu.SerialNumber = &v
}
if v := cleanDMIValue(fields["Board Part Number"]); v != "" {
if v := firstNonEmptyField(fields, "Board Part Number", "Product Part Number", "Part Number"); v != "" {
psu.PartNumber = &v
}
if v := cleanDMIValue(fields["Board Extra"]); v != "" {
if v := firstNonEmptyField(fields, "Board Extra", "Product Version", "Board Version"); v != "" {
psu.Firmware = &v
}
@@ -120,6 +131,23 @@ func parseFRUBlock(block string, slotIdx int) (schema.HardwarePowerSupply, bool)
return psu, true
}
func isPSUHeader(headerLower string) bool {
return strings.Contains(headerLower, "psu") ||
strings.Contains(headerLower, "pws") ||
strings.Contains(headerLower, "power supply") ||
strings.Contains(headerLower, "power_supply") ||
strings.Contains(headerLower, "power module")
}
func firstNonEmptyField(fields map[string]string, keys ...string) string {
for _, key := range keys {
if value := cleanDMIValue(fields[key]); value != "" {
return value
}
}
return ""
}
type psuSDR struct {
slot int
status string
@@ -131,7 +159,13 @@ type psuSDR struct {
healthPct *float64
}
var psuSlotRe = regexp.MustCompile(`(?i)\bpsu?\s*([0-9]+)\b|\bps\s*([0-9]+)\b`)
var psuSlotPatterns = []*regexp.Regexp{
regexp.MustCompile(`(?i)\bpsu?\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bps\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bpws\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bpower\s*supply(?:\s*bay)?\s*([0-9]+)\b`),
regexp.MustCompile(`(?i)\bbay\s*([0-9]+)\b`),
}
func parsePSUSDR(raw string) map[int]psuSDR {
out := map[int]psuSDR{}
@@ -164,6 +198,8 @@ func parsePSUSDR(raw string) map[int]psuSDR {
entry.inputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "output power"):
entry.outputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "power supply bay"), strings.Contains(lowerName, "psu bay"):
entry.outputPowerW = parseFloatPtr(value)
case strings.Contains(lowerName, "input voltage"), strings.Contains(lowerName, "ac input"):
entry.inputVoltage = parseFloatPtr(value)
case strings.Contains(lowerName, "temp"):
@@ -176,6 +212,49 @@ func parsePSUSDR(raw string) map[int]psuSDR {
return out
}
func synthesizePSUsFromSDR(sdr map[int]psuSDR) []schema.HardwarePowerSupply {
if len(sdr) == 0 {
return nil
}
slots := make([]int, 0, len(sdr))
for slot := range sdr {
slots = append(slots, slot)
}
sort.Ints(slots)
out := make([]schema.HardwarePowerSupply, 0, len(slots))
for _, slot := range slots {
entry := sdr[slot]
present := true
status := entry.status
if status == "" {
status = statusUnknown
}
slotStr := strconv.Itoa(slot - 1)
model := "PSU"
psu := schema.HardwarePowerSupply{
HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status},
Slot: &slotStr,
Present: &present,
Model: &model,
InputPowerW: entry.inputPowerW,
OutputPowerW: entry.outputPowerW,
InputVoltage: entry.inputVoltage,
TemperatureC: entry.temperatureC,
}
if entry.healthPct != nil {
psu.LifeRemainingPct = entry.healthPct
lifeUsed := 100 - *entry.healthPct
psu.LifeUsedPct = &lifeUsed
}
if entry.reason != "" {
psu.ErrorDescription = &entry.reason
}
out = append(out, psu)
}
return out
}
func mergePSUSDR(psus []schema.HardwarePowerSupply, sdr map[int]psuSDR) {
for i := range psus {
slotIdx, err := strconv.Atoi(derefPSUSlot(psus[i].Slot))
@@ -231,17 +310,19 @@ func splitSDRFields(line string) []string {
}
func parsePSUSlot(name string) (int, bool) {
m := psuSlotRe.FindStringSubmatch(strings.ToLower(name))
if len(m) == 0 {
return 0, false
}
for _, group := range m[1:] {
if group == "" {
for _, re := range psuSlotPatterns {
m := re.FindStringSubmatch(strings.ToLower(name))
if len(m) == 0 {
continue
}
n, err := strconv.Atoi(group)
if err == nil && n > 0 {
return n, true
for _, group := range m[1:] {
if group == "" {
continue
}
n, err := strconv.Atoi(group)
if err == nil && n > 0 {
return n, true
}
}
}
return 0, false

View File

@@ -38,3 +38,54 @@ PS2 Input Power | 0 Watts | cr
t.Fatalf("ps2 status=%q", got[2].status)
}
}
func TestParsePSUSlotVendorVariants(t *testing.T) {
t.Parallel()
tests := []struct {
name string
want int
}{
{name: "PWS1 Status", want: 1},
{name: "Power Supply Bay 8", want: 8},
{name: "PS 6 Input Power", want: 6},
}
for _, tt := range tests {
got, ok := parsePSUSlot(tt.name)
if !ok || got != tt.want {
t.Fatalf("parsePSUSlot(%q)=(%d,%v) want (%d,true)", tt.name, got, ok, tt.want)
}
}
}
func TestSynthesizePSUsFromSDR(t *testing.T) {
t.Parallel()
health := 97.0
outputPower := 915.0
got := synthesizePSUsFromSDR(map[int]psuSDR{
1: {
slot: 1,
status: statusOK,
outputPowerW: &outputPower,
healthPct: &health,
},
})
if len(got) != 1 {
t.Fatalf("len(got)=%d want 1", len(got))
}
if got[0].Slot == nil || *got[0].Slot != "0" {
t.Fatalf("slot=%v want 0", got[0].Slot)
}
if got[0].OutputPowerW == nil || *got[0].OutputPowerW != 915 {
t.Fatalf("output power=%v", got[0].OutputPowerW)
}
if got[0].LifeRemainingPct == nil || *got[0].LifeRemainingPct != 97 {
t.Fatalf("life remaining=%v", got[0].LifeRemainingPct)
}
if got[0].LifeUsedPct == nil || *got[0].LifeUsedPct != 3 {
t.Fatalf("life used=%v", got[0].LifeUsedPct)
}
}

View File

@@ -113,19 +113,8 @@ func isLikelyPSUTemp(chip, feature string) bool {
func detectPSUSlot(parts ...string) (string, bool) {
for _, part := range parts {
lower := strings.ToLower(part)
matches := psuSlotRe.FindStringSubmatch(lower)
if len(matches) == 0 {
continue
}
for _, group := range matches[1:] {
if group == "" {
continue
}
value, err := strconv.Atoi(group)
if err == nil && value > 0 {
return strconv.Itoa(value - 1), true
}
if value, ok := parsePSUSlot(part); ok && value > 0 {
return strconv.Itoa(value - 1), true
}
}
return "", false

View File

@@ -5,11 +5,13 @@ import (
"encoding/json"
"log/slog"
"os/exec"
"path/filepath"
"strconv"
"strings"
)
func collectStorage() []schema.HardwareStorage {
devs := lsblkDevices()
devs := discoverStorageDevices()
result := make([]schema.HardwareStorage, 0, len(devs))
for _, dev := range devs {
var s schema.HardwareStorage
@@ -39,6 +41,47 @@ type lsblkRoot struct {
Blockdevices []lsblkDevice `json:"blockdevices"`
}
type nvmeListRoot struct {
Devices []nvmeListDevice `json:"Devices"`
}
type nvmeListDevice struct {
DevicePath string `json:"DevicePath"`
ModelNumber string `json:"ModelNumber"`
SerialNumber string `json:"SerialNumber"`
Firmware string `json:"Firmware"`
PhysicalSize int64 `json:"PhysicalSize"`
}
func discoverStorageDevices() []lsblkDevice {
merged := map[string]lsblkDevice{}
for _, dev := range lsblkDevices() {
if dev.Name == "" {
continue
}
merged[dev.Name] = dev
}
for _, dev := range nvmeListDevices() {
if dev.Name == "" {
continue
}
current := merged[dev.Name]
merged[dev.Name] = mergeStorageDevice(current, dev)
}
disks := make([]lsblkDevice, 0, len(merged))
for _, dev := range merged {
if dev.Type == "" {
dev.Type = "disk"
}
if dev.Type != "disk" {
continue
}
disks = append(disks, dev)
}
return disks
}
func lsblkDevices() []lsblkDevice {
out, err := exec.Command("lsblk", "-J", "-d",
"-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL").Output()
@@ -60,6 +103,59 @@ func lsblkDevices() []lsblkDevice {
return disks
}
func nvmeListDevices() []lsblkDevice {
out, err := exec.Command("nvme", "list", "-o", "json").Output()
if err != nil {
return nil
}
var root nvmeListRoot
if err := json.Unmarshal(out, &root); err != nil {
slog.Warn("storage: nvme list parse failed", "err", err)
return nil
}
devices := make([]lsblkDevice, 0, len(root.Devices))
for _, dev := range root.Devices {
name := filepath.Base(strings.TrimSpace(dev.DevicePath))
if name == "" {
continue
}
devices = append(devices, lsblkDevice{
Name: name,
Type: "disk",
Size: strconv.FormatInt(dev.PhysicalSize, 10),
Serial: strings.TrimSpace(dev.SerialNumber),
Model: strings.TrimSpace(dev.ModelNumber),
Tran: "nvme",
})
}
return devices
}
func mergeStorageDevice(existing, incoming lsblkDevice) lsblkDevice {
if existing.Name == "" {
return incoming
}
if existing.Type == "" {
existing.Type = incoming.Type
}
if strings.TrimSpace(existing.Size) == "" {
existing.Size = incoming.Size
}
if strings.TrimSpace(existing.Serial) == "" {
existing.Serial = incoming.Serial
}
if strings.TrimSpace(existing.Model) == "" {
existing.Model = incoming.Model
}
if strings.TrimSpace(existing.Tran) == "" {
existing.Tran = incoming.Tran
}
if strings.TrimSpace(existing.Hctl) == "" {
existing.Hctl = incoming.Hctl
}
return existing
}
// smartctlInfo is the subset of smartctl -j -a output we care about.
type smartctlInfo struct {
ModelFamily string `json:"model_family"`
@@ -94,6 +190,7 @@ type smartctlInfo struct {
func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
present := true
s := schema.HardwareStorage{Present: &present}
s.Telemetry = map[string]any{"linux_device": "/dev/" + dev.Name}
tran := strings.ToLower(dev.Tran)
devPath := "/dev/" + dev.Name
@@ -252,9 +349,22 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
Present: &present,
Type: &devType,
Interface: &iface,
Telemetry: map[string]any{"linux_device": "/dev/" + dev.Name},
}
devPath := "/dev/" + dev.Name
if v := cleanDMIValue(strings.TrimSpace(dev.Model)); v != "" {
s.Model = &v
}
if v := cleanDMIValue(strings.TrimSpace(dev.Serial)); v != "" {
s.SerialNumber = &v
}
if size := parseStorageBytes(dev.Size); size > 0 {
gb := int(size / 1_000_000_000)
if gb > 0 {
s.SizeGB = &gb
}
}
// id-ctrl: model, serial, firmware, capacity
if out, err := exec.Command("nvme", "id-ctrl", devPath, "-o", "json").Output(); err == nil {
@@ -335,6 +445,14 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
return s
}
func parseStorageBytes(raw string) int64 {
value, err := strconv.ParseInt(strings.TrimSpace(raw), 10, 64)
if err == nil && value > 0 {
return value
}
return 0
}
func nvmeDataUnitsToBytes(units int64) int64 {
if units <= 0 {
return 0

View File

@@ -0,0 +1,33 @@
package collector
import "testing"
func TestMergeStorageDevicePrefersNonEmptyFields(t *testing.T) {
t.Parallel()
got := mergeStorageDevice(
lsblkDevice{Name: "nvme0n1", Type: "disk", Tran: "nvme"},
lsblkDevice{Name: "nvme0n1", Type: "disk", Size: "1024", Serial: "SN123", Model: "Kioxia"},
)
if got.Serial != "SN123" {
t.Fatalf("serial=%q want SN123", got.Serial)
}
if got.Model != "Kioxia" {
t.Fatalf("model=%q want Kioxia", got.Model)
}
if got.Size != "1024" {
t.Fatalf("size=%q want 1024", got.Size)
}
}
func TestParseStorageBytes(t *testing.T) {
t.Parallel()
if got := parseStorageBytes(" 2048 "); got != 2048 {
t.Fatalf("parseStorageBytes=%d want 2048", got)
}
if got := parseStorageBytes("1.92 TB"); got != 0 {
t.Fatalf("parseStorageBytes invalid=%d want 0", got)
}
}

View File

@@ -9,8 +9,50 @@ import (
"strings"
)
var exportExecCommand = exec.Command
func formatMountTargetError(target RemovableTarget, raw string, err error) error {
msg := strings.TrimSpace(raw)
fstype := strings.ToLower(strings.TrimSpace(target.FSType))
if fstype == "exfat" && strings.Contains(strings.ToLower(msg), "unknown filesystem type 'exfat'") {
return fmt.Errorf("mount %s: exFAT support is missing in this ISO build: %w", target.Device, err)
}
if msg == "" {
return err
}
return fmt.Errorf("%s: %w", msg, err)
}
func removableTargetReadOnly(fields map[string]string) bool {
if fields["RO"] == "1" {
return true
}
switch strings.ToLower(strings.TrimSpace(fields["FSTYPE"])) {
case "iso9660", "squashfs":
return true
default:
return false
}
}
func ensureWritableMountpoint(mountpoint string) error {
probe, err := os.CreateTemp(mountpoint, ".bee-write-test-*")
if err != nil {
return fmt.Errorf("target filesystem is not writable: %w", err)
}
name := probe.Name()
if closeErr := probe.Close(); closeErr != nil {
_ = os.Remove(name)
return closeErr
}
if err := os.Remove(name); err != nil {
return err
}
return nil
}
func (s *System) ListRemovableTargets() ([]RemovableTarget, error) {
raw, err := exec.Command("lsblk", "-P", "-o", "NAME,TYPE,PKNAME,RM,FSTYPE,MOUNTPOINT,SIZE,LABEL,MODEL").Output()
raw, err := exportExecCommand("lsblk", "-P", "-o", "NAME,TYPE,PKNAME,RM,RO,FSTYPE,MOUNTPOINT,SIZE,LABEL,MODEL").Output()
if err != nil {
return nil, err
}
@@ -34,7 +76,7 @@ func (s *System) ListRemovableTargets() ([]RemovableTarget, error) {
}
}
}
if !removable || fields["FSTYPE"] == "" {
if !removable || fields["FSTYPE"] == "" || removableTargetReadOnly(fields) {
continue
}
@@ -52,7 +94,7 @@ func (s *System) ListRemovableTargets() ([]RemovableTarget, error) {
return out, nil
}
func (s *System) ExportFileToTarget(src string, target RemovableTarget) (string, error) {
func (s *System) ExportFileToTarget(src string, target RemovableTarget) (dst string, retErr error) {
if src == "" || target.Device == "" {
return "", fmt.Errorf("source and target are required")
}
@@ -62,20 +104,43 @@ func (s *System) ExportFileToTarget(src string, target RemovableTarget) (string,
mountpoint := strings.TrimSpace(target.Mountpoint)
mountedHere := false
mounted := mountpoint != ""
if mountpoint == "" {
mountpoint = filepath.Join("/tmp", "bee-export-"+filepath.Base(target.Device))
if err := os.MkdirAll(mountpoint, 0755); err != nil {
return "", err
}
if raw, err := exec.Command("mount", target.Device, mountpoint).CombinedOutput(); err != nil {
if raw, err := exportExecCommand("mount", target.Device, mountpoint).CombinedOutput(); err != nil {
_ = os.Remove(mountpoint)
return string(raw), err
return "", formatMountTargetError(target, string(raw), err)
}
mountedHere = true
mounted = true
}
defer func() {
if !mounted {
return
}
_ = exportExecCommand("sync").Run()
if raw, err := exportExecCommand("umount", mountpoint).CombinedOutput(); err != nil && retErr == nil {
msg := strings.TrimSpace(string(raw))
if msg == "" {
retErr = err
} else {
retErr = fmt.Errorf("%s: %w", msg, err)
}
}
if mountedHere {
_ = os.Remove(mountpoint)
}
}()
if err := ensureWritableMountpoint(mountpoint); err != nil {
return "", err
}
filename := filepath.Base(src)
dst := filepath.Join(mountpoint, filename)
dst = filepath.Join(mountpoint, filename)
data, err := os.ReadFile(src)
if err != nil {
return "", err
@@ -83,12 +148,6 @@ func (s *System) ExportFileToTarget(src string, target RemovableTarget) (string,
if err := os.WriteFile(dst, data, 0644); err != nil {
return "", err
}
_ = exec.Command("sync").Run()
if mountedHere {
_ = exec.Command("umount", mountpoint).Run()
_ = os.Remove(mountpoint)
}
return dst, nil
}

View File

@@ -0,0 +1,112 @@
package platform
import (
"os"
"os/exec"
"path/filepath"
"strings"
"testing"
)
func TestExportFileToTargetUnmountsExistingMountpoint(t *testing.T) {
tmp := t.TempDir()
src := filepath.Join(tmp, "bundle.tar.gz")
mountpoint := filepath.Join(tmp, "mnt")
if err := os.MkdirAll(mountpoint, 0755); err != nil {
t.Fatalf("mkdir mountpoint: %v", err)
}
if err := os.WriteFile(src, []byte("bundle"), 0644); err != nil {
t.Fatalf("write src: %v", err)
}
var calls [][]string
oldExec := exportExecCommand
exportExecCommand = func(name string, args ...string) *exec.Cmd {
calls = append(calls, append([]string{name}, args...))
return exec.Command("sh", "-c", "exit 0")
}
t.Cleanup(func() { exportExecCommand = oldExec })
s := &System{}
dst, err := s.ExportFileToTarget(src, RemovableTarget{
Device: "/dev/sdb1",
Mountpoint: mountpoint,
})
if err != nil {
t.Fatalf("ExportFileToTarget error: %v", err)
}
if got, want := dst, filepath.Join(mountpoint, "bundle.tar.gz"); got != want {
t.Fatalf("dst=%q want %q", got, want)
}
if _, err := os.Stat(filepath.Join(mountpoint, "bundle.tar.gz")); err != nil {
t.Fatalf("exported file missing: %v", err)
}
foundUmount := false
for _, call := range calls {
if len(call) == 2 && call[0] == "umount" && call[1] == mountpoint {
foundUmount = true
break
}
}
if !foundUmount {
t.Fatalf("expected umount %q call, got %#v", mountpoint, calls)
}
}
func TestExportFileToTargetRejectsNonWritableMountpoint(t *testing.T) {
tmp := t.TempDir()
src := filepath.Join(tmp, "bundle.tar.gz")
mountpoint := filepath.Join(tmp, "mnt")
if err := os.MkdirAll(mountpoint, 0755); err != nil {
t.Fatalf("mkdir mountpoint: %v", err)
}
if err := os.WriteFile(src, []byte("bundle"), 0644); err != nil {
t.Fatalf("write src: %v", err)
}
if err := os.Chmod(mountpoint, 0555); err != nil {
t.Fatalf("chmod mountpoint: %v", err)
}
oldExec := exportExecCommand
exportExecCommand = func(name string, args ...string) *exec.Cmd {
return exec.Command("sh", "-c", "exit 0")
}
t.Cleanup(func() { exportExecCommand = oldExec })
s := &System{}
_, err := s.ExportFileToTarget(src, RemovableTarget{
Device: "/dev/sdb1",
Mountpoint: mountpoint,
})
if err == nil {
t.Fatal("expected error for non-writable mountpoint")
}
if !strings.Contains(err.Error(), "target filesystem is not writable") {
t.Fatalf("err=%q want writable message", err)
}
}
func TestListRemovableTargetsSkipsReadOnlyMedia(t *testing.T) {
oldExec := exportExecCommand
lsblkOut := `NAME="sda1" TYPE="part" PKNAME="sda" RM="1" RO="1" FSTYPE="iso9660" MOUNTPOINT="/run/live/medium" SIZE="3.7G" LABEL="BEE" MODEL=""
NAME="sdb1" TYPE="part" PKNAME="sdb" RM="1" RO="0" FSTYPE="vfat" MOUNTPOINT="/media/bee/USB" SIZE="29.8G" LABEL="USB" MODEL=""`
exportExecCommand = func(name string, args ...string) *exec.Cmd {
cmd := exec.Command("sh", "-c", "printf '%s\n' \"$LSBLK_OUT\"")
cmd.Env = append(os.Environ(), "LSBLK_OUT="+lsblkOut)
return cmd
}
t.Cleanup(func() { exportExecCommand = oldExec })
s := &System{}
targets, err := s.ListRemovableTargets()
if err != nil {
t.Fatalf("ListRemovableTargets error: %v", err)
}
if len(targets) != 1 {
t.Fatalf("len(targets)=%d want 1 (%+v)", len(targets), targets)
}
if got := targets[0].Device; got != "/dev/sdb1" {
t.Fatalf("device=%q want /dev/sdb1", got)
}
}

View File

@@ -0,0 +1,738 @@
package platform
import (
"bytes"
"fmt"
"math"
"os"
"os/exec"
"strconv"
"strings"
"time"
)
// GPUMetricRow is one telemetry sample from nvidia-smi during a stress test.
type GPUMetricRow struct {
ElapsedSec float64
GPUIndex int
TempC float64
UsagePct float64
PowerW float64
ClockMHz float64
}
// sampleGPUMetrics runs nvidia-smi once and returns current metrics for each GPU.
func sampleGPUMetrics(gpuIndices []int) ([]GPUMetricRow, error) {
args := []string{
"--query-gpu=index,temperature.gpu,utilization.gpu,power.draw,clocks.current.graphics",
"--format=csv,noheader,nounits",
}
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
args = append([]string{"--id=" + strings.Join(ids, ",")}, args...)
}
out, err := exec.Command("nvidia-smi", args...).Output()
if err != nil {
return nil, err
}
var rows []GPUMetricRow
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.Split(line, ", ")
if len(parts) < 5 {
continue
}
idx, _ := strconv.Atoi(strings.TrimSpace(parts[0]))
rows = append(rows, GPUMetricRow{
GPUIndex: idx,
TempC: parseGPUFloat(parts[1]),
UsagePct: parseGPUFloat(parts[2]),
PowerW: parseGPUFloat(parts[3]),
ClockMHz: parseGPUFloat(parts[4]),
})
}
return rows, nil
}
func parseGPUFloat(s string) float64 {
s = strings.TrimSpace(s)
if s == "N/A" || s == "[Not Supported]" || s == "" {
return 0
}
v, _ := strconv.ParseFloat(s, 64)
return v
}
// SampleGPUMetrics runs nvidia-smi once and returns current metrics for each GPU.
func SampleGPUMetrics(gpuIndices []int) ([]GPUMetricRow, error) {
return sampleGPUMetrics(gpuIndices)
}
// WriteGPUMetricsCSV writes collected rows as a CSV file.
func WriteGPUMetricsCSV(path string, rows []GPUMetricRow) error {
var b bytes.Buffer
b.WriteString("elapsed_sec,gpu_index,temperature_c,usage_pct,power_w,clock_mhz\n")
for _, r := range rows {
fmt.Fprintf(&b, "%.1f,%d,%.1f,%.1f,%.1f,%.0f\n",
r.ElapsedSec, r.GPUIndex, r.TempC, r.UsagePct, r.PowerW, r.ClockMHz)
}
return os.WriteFile(path, b.Bytes(), 0644)
}
// WriteGPUMetricsHTML writes a standalone HTML file with one SVG chart per GPU.
func WriteGPUMetricsHTML(path string, rows []GPUMetricRow) error {
// Group by GPU index preserving order.
seen := make(map[int]bool)
var order []int
gpuMap := make(map[int][]GPUMetricRow)
for _, r := range rows {
if !seen[r.GPUIndex] {
seen[r.GPUIndex] = true
order = append(order, r.GPUIndex)
}
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
}
var svgs strings.Builder
for _, gpuIdx := range order {
svgs.WriteString(drawGPUChartSVG(gpuMap[gpuIdx], gpuIdx))
svgs.WriteString("\n")
}
ts := time.Now().UTC().Format("2006-01-02 15:04:05 UTC")
html := fmt.Sprintf(`<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<title>GPU Stress Test Metrics</title>
<style>
body { font-family: sans-serif; background: #f0f0f0; margin: 0; padding: 20px; }
h1 { text-align: center; color: #333; margin: 0 0 8px; }
p { text-align: center; color: #888; font-size: 13px; margin: 0 0 24px; }
</style>
</head><body>
<h1>GPU Stress Test Metrics</h1>
<p>Generated %s</p>
%s
</body></html>`, ts, svgs.String())
return os.WriteFile(path, []byte(html), 0644)
}
// drawGPUChartSVG generates a self-contained SVG chart for one GPU.
func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int) string {
// Layout
const W, H = 960, 520
const plotX1 = 120 // usage axis / chart left border
const plotX2 = 840 // power axis / chart right border
const plotY1 = 70 // top
const plotY2 = 465 // bottom (PH = 395)
const PW = plotX2 - plotX1
const PH = plotY2 - plotY1
// Outer axes
const tempAxisX = 60 // temp axis line
const clockAxisX = 900 // clock axis line
colors := [4]string{"#e74c3c", "#3498db", "#2ecc71", "#f39c12"}
seriesLabel := [4]string{
fmt.Sprintf("GPU %d Temp (°C)", gpuIdx),
fmt.Sprintf("GPU %d Usage (%%)", gpuIdx),
fmt.Sprintf("GPU %d Power (W)", gpuIdx),
fmt.Sprintf("GPU %d Clock (MHz)", gpuIdx),
}
axisLabel := [4]string{"Temperature (°C)", "GPU Usage (%)", "Power (W)", "Clock (MHz)"}
// Extract series
t := make([]float64, len(rows))
vals := [4][]float64{}
for i := range vals {
vals[i] = make([]float64, len(rows))
}
for i, r := range rows {
t[i] = r.ElapsedSec
vals[0][i] = r.TempC
vals[1][i] = r.UsagePct
vals[2][i] = r.PowerW
vals[3][i] = r.ClockMHz
}
tMin, tMax := gpuMinMax(t)
type axisScale struct {
ticks []float64
min, max float64
}
var axes [4]axisScale
for i := 0; i < 4; i++ {
mn, mx := gpuMinMax(vals[i])
tks := gpuNiceTicks(mn, mx, 8)
axes[i] = axisScale{ticks: tks, min: tks[0], max: tks[len(tks)-1]}
}
xv := func(tv float64) float64 {
if tMax == tMin {
return float64(plotX1)
}
return float64(plotX1) + (tv-tMin)/(tMax-tMin)*float64(PW)
}
yv := func(v float64, ai int) float64 {
a := axes[ai]
if a.max == a.min {
return float64(plotY1 + PH/2)
}
return float64(plotY2) - (v-a.min)/(a.max-a.min)*float64(PH)
}
var b strings.Builder
fmt.Fprintf(&b, `<svg xmlns="http://www.w3.org/2000/svg" width="%d" height="%d"`+
` style="background:#fff;border-radius:8px;display:block;margin:0 auto 24px;`+
`box-shadow:0 2px 12px rgba(0,0,0,.12)">`+"\n", W, H)
// Title
fmt.Fprintf(&b, `<text x="%d" y="22" text-anchor="middle" font-family="sans-serif"`+
` font-size="14" font-weight="bold" fill="#333">GPU Stress Test Metrics — GPU %d</text>`+"\n",
plotX1+PW/2, gpuIdx)
// Horizontal grid (align to temp axis ticks)
b.WriteString(`<g stroke="#e0e0e0" stroke-width="0.5">` + "\n")
for _, tick := range axes[0].ticks {
y := yv(tick, 0)
if y < float64(plotY1) || y > float64(plotY2) {
continue
}
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"/>`+"\n",
plotX1, y, plotX2, y)
}
// Vertical grid
xTicks := gpuNiceTicks(tMin, tMax, 10)
for _, tv := range xTicks {
x := xv(tv)
if x < float64(plotX1) || x > float64(plotX2) {
continue
}
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d"/>`+"\n",
x, plotY1, x, plotY2)
}
b.WriteString("</g>\n")
// Chart border
fmt.Fprintf(&b, `<rect x="%d" y="%d" width="%d" height="%d"`+
` fill="none" stroke="#333" stroke-width="1"/>`+"\n",
plotX1, plotY1, PW, PH)
// X axis ticks and labels
b.WriteString(`<g font-family="sans-serif" font-size="11" fill="#333" text-anchor="middle">` + "\n")
for _, tv := range xTicks {
x := xv(tv)
if x < float64(plotX1) || x > float64(plotX2) {
continue
}
fmt.Fprintf(&b, `<text x="%.1f" y="%d">%s</text>`+"\n", x, plotY2+18, gpuFormatTick(tv))
fmt.Fprintf(&b, `<line x1="%.1f" y1="%d" x2="%.1f" y2="%d" stroke="#333" stroke-width="1"/>`+"\n",
x, plotY2, x, plotY2+4)
}
b.WriteString("</g>\n")
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="13"`+
` fill="#333" text-anchor="middle">Time (seconds)</text>`+"\n",
plotX1+PW/2, plotY2+38)
// Y axes: [tempAxisX, plotX1, plotX2, clockAxisX]
axisLineX := [4]int{tempAxisX, plotX1, plotX2, clockAxisX}
axisRight := [4]bool{false, false, true, true}
// Label x positions (for rotated vertical text)
axisLabelX := [4]int{10, 68, 868, 950}
for i := 0; i < 4; i++ {
ax := axisLineX[i]
right := axisRight[i]
color := colors[i]
// Axis line
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
` stroke="%s" stroke-width="1"/>`+"\n",
ax, plotY1, ax, plotY2, color)
// Ticks and tick labels
fmt.Fprintf(&b, `<g font-family="sans-serif" font-size="10" fill="%s">`+"\n", color)
for _, tick := range axes[i].ticks {
y := yv(tick, i)
if y < float64(plotY1) || y > float64(plotY2) {
continue
}
dx := -5
textX := ax - 8
anchor := "end"
if right {
dx = 5
textX = ax + 8
anchor = "start"
}
fmt.Fprintf(&b, `<line x1="%d" y1="%.1f" x2="%d" y2="%.1f"`+
` stroke="%s" stroke-width="1"/>`+"\n",
ax, y, ax+dx, y, color)
fmt.Fprintf(&b, `<text x="%d" y="%.1f" text-anchor="%s" dy="4">%s</text>`+"\n",
textX, y, anchor, gpuFormatTick(tick))
}
b.WriteString("</g>\n")
// Axis label (rotated)
lx := axisLabelX[i]
fmt.Fprintf(&b, `<text transform="translate(%d,%d) rotate(-90)"`+
` font-family="sans-serif" font-size="12" fill="%s" text-anchor="middle">%s</text>`+"\n",
lx, plotY1+PH/2, color, axisLabel[i])
}
// Data lines
for i := 0; i < 4; i++ {
var pts strings.Builder
for j := range rows {
x := xv(t[j])
y := yv(vals[i][j], i)
if j == 0 {
fmt.Fprintf(&pts, "%.1f,%.1f", x, y)
} else {
fmt.Fprintf(&pts, " %.1f,%.1f", x, y)
}
}
fmt.Fprintf(&b, `<polyline points="%s" fill="none" stroke="%s" stroke-width="1.5"/>`+"\n",
pts.String(), colors[i])
}
// Legend
const legendY = 42
for i := 0; i < 4; i++ {
lx := plotX1 + i*(PW/4) + 10
fmt.Fprintf(&b, `<line x1="%d" y1="%d" x2="%d" y2="%d"`+
` stroke="%s" stroke-width="2"/>`+"\n",
lx, legendY, lx+20, legendY, colors[i])
fmt.Fprintf(&b, `<text x="%d" y="%d" font-family="sans-serif" font-size="12" fill="#333">%s</text>`+"\n",
lx+25, legendY+4, seriesLabel[i])
}
b.WriteString("</svg>\n")
return b.String()
}
const (
ansiRed = "\033[31m"
ansiBlue = "\033[34m"
ansiGreen = "\033[32m"
ansiYellow = "\033[33m"
ansiReset = "\033[0m"
)
const (
termChartWidth = 70
termChartHeight = 12
)
// RenderGPUTerminalChart returns ANSI line charts (asciigraph-style) per GPU.
// Suitable for display in the TUI screenOutput.
func RenderGPUTerminalChart(rows []GPUMetricRow) string {
seen := make(map[int]bool)
var order []int
gpuMap := make(map[int][]GPUMetricRow)
for _, r := range rows {
if !seen[r.GPUIndex] {
seen[r.GPUIndex] = true
order = append(order, r.GPUIndex)
}
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
}
type seriesDef struct {
caption string
color string
fn func(GPUMetricRow) float64
}
defs := []seriesDef{
{"Temperature (°C)", ansiRed, func(r GPUMetricRow) float64 { return r.TempC }},
{"GPU Usage (%)", ansiBlue, func(r GPUMetricRow) float64 { return r.UsagePct }},
{"Power (W)", ansiGreen, func(r GPUMetricRow) float64 { return r.PowerW }},
{"Clock (MHz)", ansiYellow, func(r GPUMetricRow) float64 { return r.ClockMHz }},
}
var b strings.Builder
for _, gpuIdx := range order {
gr := gpuMap[gpuIdx]
if len(gr) == 0 {
continue
}
tMax := gr[len(gr)-1].ElapsedSec - gr[0].ElapsedSec
fmt.Fprintf(&b, "GPU %d — Stress Test Metrics (%.0f seconds)\n\n", gpuIdx, tMax)
for _, d := range defs {
b.WriteString(renderLineChart(extractGPUField(gr, d.fn), d.color, d.caption,
termChartHeight, termChartWidth))
b.WriteRune('\n')
}
}
return strings.TrimRight(b.String(), "\n")
}
// RenderGPULiveChart renders all GPU metrics on a single combined chart per GPU.
// Each series is normalised to its own minmax and drawn in a different colour.
// chartWidth controls the width of the plot area (Y-axis label uses 5 extra chars).
func RenderGPULiveChart(rows []GPUMetricRow, chartWidth int) string {
if chartWidth < 20 {
chartWidth = 70
}
const chartHeight = 14
seen := make(map[int]bool)
var order []int
gpuMap := make(map[int][]GPUMetricRow)
for _, r := range rows {
if !seen[r.GPUIndex] {
seen[r.GPUIndex] = true
order = append(order, r.GPUIndex)
}
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
}
type seriesDef struct {
label string
color string
unit string
fn func(GPUMetricRow) float64
}
defs := []seriesDef{
{"Usage", ansiBlue, "%", func(r GPUMetricRow) float64 { return r.UsagePct }},
{"Temp", ansiRed, "°C", func(r GPUMetricRow) float64 { return r.TempC }},
{"Power", ansiGreen, "W", func(r GPUMetricRow) float64 { return r.PowerW }},
}
var b strings.Builder
for _, gpuIdx := range order {
gr := gpuMap[gpuIdx]
if len(gr) == 0 {
continue
}
elapsed := gr[len(gr)-1].ElapsedSec
// Build value slices for each series.
type seriesData struct {
seriesDef
vals []float64
mn float64
mx float64
}
var series []seriesData
for _, d := range defs {
vals := extractGPUField(gr, d.fn)
mn, mx := gpuMinMax(vals)
if mn == mx {
mx = mn + 1
}
series = append(series, seriesData{d, vals, mn, mx})
}
// Shared character grid: row 0 = top (max), row chartHeight = bottom (min).
type cell struct {
ch rune
color string
}
grid := make([][]cell, chartHeight+1)
for r := range grid {
grid[r] = make([]cell, chartWidth)
for c := range grid[r] {
grid[r][c] = cell{' ', ""}
}
}
// Plot each series onto the shared grid.
for _, s := range series {
w := chartWidth
if len(s.vals) < w {
w = len(s.vals)
}
data := gpuDownsample(s.vals, w)
prevRow := -1
for x, v := range data {
row := chartHeight - int(math.Round((v-s.mn)/(s.mx-s.mn)*float64(chartHeight)))
if row < 0 {
row = 0
}
if row > chartHeight {
row = chartHeight
}
if prevRow < 0 || prevRow == row {
grid[row][x] = cell{'─', s.color}
} else {
lo, hi := prevRow, row
if lo > hi {
lo, hi = hi, lo
}
for y := lo + 1; y < hi; y++ {
grid[y][x] = cell{'│', s.color}
}
if prevRow < row {
grid[prevRow][x] = cell{'╮', s.color}
grid[row][x] = cell{'╰', s.color}
} else {
grid[prevRow][x] = cell{'╯', s.color}
grid[row][x] = cell{'╭', s.color}
}
}
prevRow = row
}
}
// Render: Y axis + data rows.
fmt.Fprintf(&b, "GPU %d (%.0fs) each series normalised to its range\n", gpuIdx, elapsed)
for r := 0; r <= chartHeight; r++ {
// Y axis label: 100% at top, 50% in middle, 0% at bottom.
switch r {
case 0:
fmt.Fprintf(&b, "%4s┤", "100%")
case chartHeight / 2:
fmt.Fprintf(&b, "%4s┤", "50%")
case chartHeight:
fmt.Fprintf(&b, "%4s┤", "0%")
default:
fmt.Fprintf(&b, "%4s│", "")
}
for c := 0; c < chartWidth; c++ {
cl := grid[r][c]
if cl.color != "" {
b.WriteString(cl.color)
b.WriteRune(cl.ch)
b.WriteString(ansiReset)
} else {
b.WriteRune(' ')
}
}
b.WriteRune('\n')
}
// Bottom axis.
b.WriteString(" └")
b.WriteString(strings.Repeat("─", chartWidth))
b.WriteRune('\n')
// Legend with current (last) values.
b.WriteString(" ")
for i, s := range series {
last := s.vals[len(s.vals)-1]
b.WriteString(s.color)
fmt.Fprintf(&b, "▐ %s: %.0f%s", s.label, last, s.unit)
b.WriteString(ansiReset)
if i < len(series)-1 {
b.WriteString(" ")
}
}
b.WriteRune('\n')
}
return strings.TrimRight(b.String(), "\n")
}
// renderLineChart draws a single time-series line chart using box-drawing characters.
// Produces output in the style of asciigraph: ╭─╮ │ ╰─╯ with a Y axis and caption.
func renderLineChart(vals []float64, color, caption string, height, width int) string {
if len(vals) == 0 {
return caption + "\n"
}
mn, mx := gpuMinMax(vals)
if mn == mx {
mx = mn + 1
}
// Use the smaller of width or len(vals) to avoid stretching sparse data.
w := width
if len(vals) < w {
w = len(vals)
}
data := gpuDownsample(vals, w)
// row[i] = display row index: 0 = top = max value, height = bottom = min value.
row := make([]int, w)
for i, v := range data {
r := int(math.Round((mx - v) / (mx - mn) * float64(height)))
if r < 0 {
r = 0
}
if r > height {
r = height
}
row[i] = r
}
// Fill the character grid.
grid := make([][]rune, height+1)
for i := range grid {
grid[i] = make([]rune, w)
for j := range grid[i] {
grid[i][j] = ' '
}
}
for x := 0; x < w; x++ {
r := row[x]
if x == 0 {
grid[r][0] = '─'
continue
}
p := row[x-1]
switch {
case r == p:
grid[r][x] = '─'
case r < p: // value went up (row index decreased toward top)
grid[r][x] = '╭'
grid[p][x] = '╯'
for y := r + 1; y < p; y++ {
grid[y][x] = '│'
}
default: // r > p, value went down
grid[p][x] = '╮'
grid[r][x] = '╰'
for y := p + 1; y < r; y++ {
grid[y][x] = '│'
}
}
}
// Y axis tick labels.
ticks := gpuNiceTicks(mn, mx, height/2)
tickAtRow := make(map[int]string)
labelWidth := 4
for _, t := range ticks {
r := int(math.Round((mx - t) / (mx - mn) * float64(height)))
if r < 0 || r > height {
continue
}
s := gpuFormatTick(t)
tickAtRow[r] = s
if len(s) > labelWidth {
labelWidth = len(s)
}
}
var b strings.Builder
for r := 0; r <= height; r++ {
label := tickAtRow[r]
fmt.Fprintf(&b, "%*s", labelWidth, label)
switch {
case label != "":
b.WriteRune('┤')
case r == height:
b.WriteRune('┼')
default:
b.WriteRune('│')
}
b.WriteString(color)
b.WriteString(string(grid[r]))
b.WriteString(ansiReset)
b.WriteRune('\n')
}
// Bottom axis.
b.WriteString(strings.Repeat(" ", labelWidth))
b.WriteRune('└')
b.WriteString(strings.Repeat("─", w))
b.WriteRune('\n')
// Caption centered under the chart.
if caption != "" {
total := labelWidth + 1 + w
if pad := (total - len(caption)) / 2; pad > 0 {
b.WriteString(strings.Repeat(" ", pad))
}
b.WriteString(caption)
b.WriteRune('\n')
}
return b.String()
}
func extractGPUField(rows []GPUMetricRow, fn func(GPUMetricRow) float64) []float64 {
v := make([]float64, len(rows))
for i, r := range rows {
v[i] = fn(r)
}
return v
}
// gpuDownsample averages vals into w buckets (or nearest-neighbor upsamples if len(vals) < w).
func gpuDownsample(vals []float64, w int) []float64 {
n := len(vals)
if n == 0 {
return make([]float64, w)
}
result := make([]float64, w)
if n >= w {
counts := make([]int, w)
for i, v := range vals {
bucket := i * w / n
if bucket >= w {
bucket = w - 1
}
result[bucket] += v
counts[bucket]++
}
for i := range result {
if counts[i] > 0 {
result[i] /= float64(counts[i])
}
}
} else {
// Nearest-neighbour upsample.
for i := range result {
src := i * (n - 1) / (w - 1)
if src >= n {
src = n - 1
}
result[i] = vals[src]
}
}
return result
}
func gpuMinMax(vals []float64) (float64, float64) {
if len(vals) == 0 {
return 0, 1
}
mn, mx := vals[0], vals[0]
for _, v := range vals[1:] {
if v < mn {
mn = v
}
if v > mx {
mx = v
}
}
return mn, mx
}
func gpuNiceTicks(mn, mx float64, targetCount int) []float64 {
if mn == mx {
mn -= 1
mx += 1
}
r := mx - mn
step := math.Pow(10, math.Floor(math.Log10(r/float64(targetCount))))
for _, f := range []float64{1, 2, 5, 10} {
if r/(f*step) <= float64(targetCount)*1.5 {
step = f * step
break
}
}
lo := math.Floor(mn/step) * step
hi := math.Ceil(mx/step) * step
var ticks []float64
for v := lo; v <= hi+step*0.001; v += step {
ticks = append(ticks, math.Round(v*1e9)/1e9)
}
return ticks
}
func gpuFormatTick(v float64) string {
if v == math.Trunc(v) {
return strconv.Itoa(int(v))
}
return strconv.FormatFloat(v, 'f', 1, 64)
}

View File

@@ -0,0 +1,105 @@
package platform
import (
"context"
"fmt"
"os/exec"
"strconv"
"strings"
)
// InstallDisk describes a candidate disk for installation.
type InstallDisk struct {
Device string // e.g. /dev/sda
Model string
Size string // human-readable, e.g. "500G"
}
// ListInstallDisks returns block devices suitable for installation.
// Excludes USB drives and the current live boot medium.
func (s *System) ListInstallDisks() ([]InstallDisk, error) {
out, err := exec.Command("lsblk", "-dn", "-o", "NAME,MODEL,SIZE,TYPE,TRAN").Output()
if err != nil {
return nil, fmt.Errorf("lsblk: %w", err)
}
bootDev := findLiveBootDevice()
var disks []InstallDisk
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
fields := strings.Fields(line)
// NAME MODEL SIZE TYPE TRAN — model may have spaces so we parse from end
if len(fields) < 4 {
continue
}
// Last field: TRAN, second-to-last: TYPE, third-to-last: SIZE
tran := fields[len(fields)-1]
typ := fields[len(fields)-2]
size := fields[len(fields)-3]
name := fields[0]
model := strings.Join(fields[1:len(fields)-3], " ")
if typ != "disk" {
continue
}
if strings.EqualFold(tran, "usb") {
continue
}
device := "/dev/" + name
if device == bootDev {
continue
}
disks = append(disks, InstallDisk{
Device: device,
Model: strings.TrimSpace(model),
Size: size,
})
}
return disks, nil
}
// findLiveBootDevice returns the block device backing /run/live/medium (if any).
func findLiveBootDevice() string {
out, err := exec.Command("findmnt", "-n", "-o", "SOURCE", "/run/live/medium").Output()
if err != nil {
return ""
}
src := strings.TrimSpace(string(out))
if src == "" {
return ""
}
// Strip partition suffix to get the whole disk device.
// e.g. /dev/sdb1 → /dev/sdb, /dev/nvme0n1p1 → /dev/nvme0n1
out2, err := exec.Command("lsblk", "-no", "PKNAME", src).Output()
if err != nil || strings.TrimSpace(string(out2)) == "" {
return src
}
return "/dev/" + strings.TrimSpace(string(out2))
}
// InstallToDisk runs bee-install <device> <logfile> and streams output to logFile.
// The context can be used to cancel.
func (s *System) InstallToDisk(ctx context.Context, device string, logFile string) error {
cmd := exec.CommandContext(ctx, "bee-install", device, logFile)
return cmd.Run()
}
// InstallLogPath returns the default install log path for a given device.
func InstallLogPath(device string) string {
safe := strings.NewReplacer("/", "_", " ", "_").Replace(device)
return "/tmp/bee-install" + safe + ".log"
}
// DiskLabel returns a display label for a disk.
func (d InstallDisk) Label() string {
model := d.Model
if model == "" {
model = "Unknown"
}
sizeBytes, err := strconv.ParseInt(strings.TrimSuffix(d.Size, "B"), 10, 64)
_ = sizeBytes
_ = err
return fmt.Sprintf("%s %s %s", d.Device, d.Size, model)
}

View File

@@ -0,0 +1,45 @@
package platform
import "time"
// LiveMetricSample is a single point-in-time snapshot of server metrics
// collected for the web UI metrics page.
type LiveMetricSample struct {
Timestamp time.Time `json:"ts"`
Fans []FanReading `json:"fans"`
Temps []TempReading `json:"temps"`
PowerW float64 `json:"power_w"`
GPUs []GPUMetricRow `json:"gpus"`
}
// TempReading is a named temperature sensor value.
type TempReading struct {
Name string `json:"name"`
Celsius float64 `json:"celsius"`
}
// SampleLiveMetrics collects a single metrics snapshot from all available
// sources: GPU (via nvidia-smi), fans and temperatures (via ipmitool/sensors),
// and system power (via ipmitool dcmi). Missing sources are silently skipped.
func SampleLiveMetrics() LiveMetricSample {
s := LiveMetricSample{Timestamp: time.Now().UTC()}
// GPU metrics — skipped silently if nvidia-smi unavailable
gpus, _ := SampleGPUMetrics(nil)
s.GPUs = gpus
// Fan speeds — skipped silently if ipmitool unavailable
fans, _ := sampleFanSpeeds()
s.Fans = fans
// CPU/system temperature — returns 0 if unavailable
cpuTemp := sampleCPUMaxTemp()
if cpuTemp > 0 {
s.Temps = append(s.Temps, TempReading{Name: "CPU", Celsius: cpuTemp})
}
// System power — returns 0 if unavailable
s.PowerW = sampleSystemPower()
return s
}

View File

@@ -0,0 +1,214 @@
package platform
import (
"os"
"os/exec"
"strings"
"time"
"bee/audit/internal/schema"
)
var runtimeRequiredTools = []string{
"dmidecode",
"lspci",
"lsblk",
"smartctl",
"nvme",
"ipmitool",
"dhclient",
"mount",
}
var runtimeTrackedServices = []string{
"bee-network",
"bee-nvidia",
"bee-preflight",
"bee-audit",
"bee-web",
"bee-sshsetup",
}
func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
checkedAt := time.Now().UTC().Format(time.RFC3339)
health := schema.RuntimeHealth{
Status: "OK",
CheckedAt: checkedAt,
ExportDir: strings.TrimSpace(exportDir),
}
if health.ExportDir != "" {
if err := os.MkdirAll(health.ExportDir, 0755); err != nil {
health.Status = "FAILED"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "export_dir_unavailable",
Severity: "critical",
Description: err.Error(),
})
}
}
interfaces, err := s.ListInterfaces()
if err == nil {
health.Interfaces = make([]schema.RuntimeInterface, 0, len(interfaces))
hasIPv4 := false
missingIPv4 := false
for _, iface := range interfaces {
outcome := "no_offer"
if len(iface.IPv4) > 0 {
outcome = "lease_acquired"
hasIPv4 = true
} else if strings.EqualFold(iface.State, "DOWN") {
outcome = "link_down"
} else {
missingIPv4 = true
}
health.Interfaces = append(health.Interfaces, schema.RuntimeInterface{
Name: iface.Name,
State: iface.State,
IPv4: iface.IPv4,
Outcome: outcome,
})
}
switch {
case hasIPv4 && !missingIPv4:
health.NetworkStatus = "OK"
case hasIPv4:
health.NetworkStatus = "PARTIAL"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "dhcp_partial",
Severity: "warning",
Description: "At least one interface did not obtain IPv4 connectivity.",
})
default:
health.NetworkStatus = "FAILED"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "dhcp_failed",
Severity: "warning",
Description: "No physical interface obtained IPv4 connectivity.",
})
}
}
vendor := s.DetectGPUVendor()
for _, tool := range s.runtimeToolStatuses(vendor) {
health.Tools = append(health.Tools, schema.RuntimeToolStatus{
Name: tool.Name,
Path: tool.Path,
OK: tool.OK,
})
if !tool.OK {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "tool_missing",
Severity: "warning",
Description: "Required tool missing: " + tool.Name,
})
}
}
for _, name := range runtimeTrackedServices {
health.Services = append(health.Services, schema.RuntimeServiceStatus{
Name: name,
Status: s.ServiceState(name),
})
}
s.collectGPURuntimeHealth(vendor, &health)
if health.Status != "FAILED" && len(health.Issues) > 0 {
health.Status = "PARTIAL"
}
return health, nil
}
func commandText(name string, args ...string) string {
raw, err := exec.Command(name, args...).CombinedOutput()
if err != nil && len(raw) == 0 {
return ""
}
return string(raw)
}
func (s *System) runtimeToolStatuses(vendor string) []ToolStatus {
tools := s.CheckTools(runtimeRequiredTools)
switch vendor {
case "nvidia":
tools = append(tools, s.CheckTools([]string{
"nvidia-smi",
"nvidia-bug-report.sh",
"bee-gpu-stress",
})...)
case "amd":
tool := ToolStatus{Name: "rocm-smi"}
if cmd, err := resolveROCmSMICommand(); err == nil && len(cmd) > 0 {
tool.Path = cmd[0]
if len(cmd) > 1 && strings.HasSuffix(cmd[1], "rocm_smi.py") {
tool.Path = cmd[1]
}
tool.OK = true
}
tools = append(tools, tool)
}
return tools
}
func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHealth) {
lsmodText := commandText("lsmod")
switch vendor {
case "nvidia":
health.DriverReady = strings.Contains(lsmodText, "nvidia ")
if !health.DriverReady {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "nvidia_kernel_module_missing",
Severity: "warning",
Description: "NVIDIA kernel module is not loaded.",
})
}
if health.DriverReady && !strings.Contains(lsmodText, "nvidia_modeset") {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "nvidia_modeset_failed",
Severity: "warning",
Description: "nvidia-modeset is not loaded; display/CUDA stack may be partial.",
})
}
if out, err := exec.Command("nvidia-smi", "-L").CombinedOutput(); err == nil && strings.TrimSpace(string(out)) != "" {
health.DriverReady = true
}
if lookErr := exec.Command("sh", "-c", "command -v bee-gpu-stress >/dev/null 2>&1").Run(); lookErr == nil {
out, err := exec.Command("bee-gpu-stress", "--seconds", "1", "--size-mb", "1").CombinedOutput()
if err == nil {
health.CUDAReady = true
} else if strings.Contains(strings.ToLower(string(out)), "cuda_error_system_not_ready") {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "cuda_runtime_not_ready",
Severity: "warning",
Description: "CUDA runtime is not ready for GPU SAT.",
})
}
}
case "amd":
health.DriverReady = strings.Contains(lsmodText, "amdgpu ") || strings.Contains(lsmodText, "amdkfd")
if !health.DriverReady {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "amdgpu_kernel_module_missing",
Severity: "warning",
Description: "AMD GPU driver is not loaded.",
})
}
out, err := runROCmSMI("--showproductname", "--csv")
if err == nil && strings.TrimSpace(string(out)) != "" {
health.CUDAReady = true
health.DriverReady = true
return
}
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "rocm_smi_unavailable",
Severity: "warning",
Description: "ROCm SMI is not available for AMD GPU SAT.",
})
}
}

View File

@@ -3,6 +3,8 @@ package platform
import (
"archive/tar"
"compress/gzip"
"context"
"errors"
"fmt"
"io"
"os"
@@ -14,10 +16,141 @@ import (
"time"
)
var (
satExecCommand = exec.Command
satLookPath = exec.LookPath
satGlob = filepath.Glob
satStat = os.Stat
rocmSMIExecutableGlobs = []string{
"/opt/rocm/bin/rocm-smi",
"/opt/rocm-*/bin/rocm-smi",
}
rocmSMIScriptGlobs = []string{
"/opt/rocm/libexec/rocm_smi/rocm_smi.py",
"/opt/rocm-*/libexec/rocm_smi/rocm_smi.py",
}
)
// NvidiaGPU holds basic GPU info from nvidia-smi.
type NvidiaGPU struct {
Index int
Name string
MemoryMB int
}
// AMDGPUInfo holds basic info about an AMD GPU from rocm-smi.
type AMDGPUInfo struct {
Index int
Name string
}
// DetectGPUVendor returns "nvidia" if /dev/nvidia0 exists, "amd" if /dev/kfd exists, or "" otherwise.
func (s *System) DetectGPUVendor() string {
if _, err := os.Stat("/dev/nvidia0"); err == nil {
return "nvidia"
}
if _, err := os.Stat("/dev/kfd"); err == nil {
return "amd"
}
return ""
}
// ListAMDGPUs returns AMD GPUs visible to rocm-smi.
func (s *System) ListAMDGPUs() ([]AMDGPUInfo, error) {
out, err := runROCmSMI("--showproductname", "--csv")
if err != nil {
return nil, fmt.Errorf("rocm-smi: %w", err)
}
var gpus []AMDGPUInfo
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" || strings.HasPrefix(strings.ToLower(line), "device") {
continue
}
parts := strings.SplitN(line, ",", 2)
name := ""
if len(parts) >= 2 {
name = strings.TrimSpace(parts[1])
}
idx := len(gpus)
gpus = append(gpus, AMDGPUInfo{Index: idx, Name: name})
}
return gpus, nil
}
// RunAMDAcceptancePack runs an AMD GPU diagnostic pack using rocm-smi.
func (s *System) RunAMDAcceptancePack(baseDir string) (string, error) {
return runAcceptancePack(baseDir, "gpu-amd", []satJob{
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
{name: "02-rocm-smi-showallinfo.log", cmd: []string{"rocm-smi", "--showallinfo"}},
{name: "03-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "04-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
})
}
// ListNvidiaGPUs returns GPUs visible to nvidia-smi.
func (s *System) ListNvidiaGPUs() ([]NvidiaGPU, error) {
out, err := exec.Command("nvidia-smi",
"--query-gpu=index,name,memory.total",
"--format=csv,noheader,nounits").Output()
if err != nil {
return nil, fmt.Errorf("nvidia-smi: %w", err)
}
var gpus []NvidiaGPU
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.SplitN(line, ", ", 3)
if len(parts) != 3 {
continue
}
idx, err := strconv.Atoi(strings.TrimSpace(parts[0]))
if err != nil {
continue
}
memMB, _ := strconv.Atoi(strings.TrimSpace(parts[2]))
gpus = append(gpus, NvidiaGPU{
Index: idx,
Name: strings.TrimSpace(parts[1]),
MemoryMB: memMB,
})
}
return gpus, nil
}
// RunNCCLTests runs nccl-tests all_reduce_perf across all NVIDIA GPUs.
// Measures collective communication bandwidth over NVLink/PCIe.
func (s *System) RunNCCLTests(ctx context.Context, baseDir string) (string, error) {
// detect GPU count
out, _ := exec.Command("nvidia-smi", "--query-gpu=index", "--format=csv,noheader").Output()
gpuCount := len(strings.Split(strings.TrimSpace(string(out)), "\n"))
if gpuCount < 1 {
gpuCount = 1
}
return runAcceptancePackCtx(ctx, baseDir, "nccl-tests", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-all-reduce-perf.log", cmd: []string{
"all_reduce_perf", "-b", "512M", "-e", "4G", "-f", "2",
"-g", strconv.Itoa(gpuCount), "--iters", "20",
}},
})
}
func (s *System) RunNvidiaAcceptancePack(baseDir string) (string, error) {
return runAcceptancePack(baseDir, "gpu-nvidia", nvidiaSATJobs())
}
// RunNvidiaAcceptancePackWithOptions runs the NVIDIA diagnostics via DCGM.
// diagLevel: 1=quick, 2=medium, 3=targeted stress, 4=extended stress.
// gpuIndices: specific GPU indices to test (empty = all GPUs).
// ctx cancellation kills the running job.
func (s *System) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int) (string, error) {
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia", nvidiaDCGMJobs(diagLevel, gpuIndices))
}
func (s *System) RunMemoryAcceptancePack(baseDir string) (string, error) {
sizeMB := envInt("BEE_MEMTESTER_SIZE_MB", 128)
passes := envInt("BEE_MEMTESTER_PASSES", 1)
@@ -28,6 +161,18 @@ func (s *System) RunMemoryAcceptancePack(baseDir string) (string, error) {
})
}
func (s *System) RunCPUAcceptancePack(baseDir string, durationSec int) (string, error) {
if durationSec <= 0 {
durationSec = 60
}
return runAcceptancePack(baseDir, "cpu", []satJob{
{name: "01-lscpu.log", cmd: []string{"lscpu"}},
{name: "02-sensors-before.log", cmd: []string{"sensors"}},
{name: "03-stress-ng.log", cmd: []string{"stress-ng", "--cpu", "0", "--cpu-method", "all", "--timeout", fmt.Sprintf("%d", durationSec)}},
{name: "04-sensors-after.log", cmd: []string{"sensors"}},
})
}
func (s *System) RunStorageAcceptancePack(baseDir string) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
@@ -37,6 +182,7 @@ func (s *System) RunStorageAcceptancePack(baseDir string) (string, error) {
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
devices, err := listStorageDevices()
if err != nil {
@@ -59,7 +205,7 @@ func (s *System) RunStorageAcceptancePack(baseDir string) (string, error) {
commands := storageSATCommands(devPath)
for cmdIndex, job := range commands {
name := fmt.Sprintf("%s-%02d-%s.log", prefix, cmdIndex+1, job.name)
out, err := exec.Command(job.cmd[0], job.cmd[1:]...).CombinedOutput()
out, err := runSATCommand(verboseLog, job.name, job.cmd)
if writeErr := os.WriteFile(filepath.Join(runDir, name), out, 0644); writeErr != nil {
return "", writeErr
}
@@ -83,8 +229,11 @@ func (s *System) RunStorageAcceptancePack(baseDir string) (string, error) {
}
type satJob struct {
name string
cmd []string
name string
cmd []string
env []string // extra env vars (appended to os.Environ)
collectGPU bool // collect GPU metrics via nvidia-smi while this job runs
gpuIndices []int // GPU indices to collect metrics for (empty = all)
}
type satStats struct {
@@ -100,7 +249,7 @@ func nvidiaSATJobs() []satJob {
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output", "{{run_dir}}/nvidia-bug-report.log"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
{name: "05-bee-gpu-stress.log", cmd: []string{"bee-gpu-stress", "--seconds", fmt.Sprintf("%d", seconds), "--size-mb", fmt.Sprintf("%d", sizeMB)}},
}
}
@@ -114,6 +263,7 @@ func runAcceptancePack(baseDir, prefix string, jobs []satJob) (string, error) {
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
var summary strings.Builder
stats := satStats{}
@@ -123,7 +273,7 @@ func runAcceptancePack(baseDir, prefix string, jobs []satJob) (string, error) {
for _, arg := range job.cmd {
cmd = append(cmd, strings.ReplaceAll(arg, "{{run_dir}}", runDir))
}
out, err := exec.Command(cmd[0], cmd[1:]...).CombinedOutput()
out, err := runSATCommand(verboseLog, job.name, cmd)
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
return "", writeErr
}
@@ -145,20 +295,121 @@ func runAcceptancePack(baseDir, prefix string, jobs []satJob) (string, error) {
return archive, nil
}
func nvidiaDCGMJobs(diagLevel int, gpuIndices []int) []satJob {
if diagLevel < 1 || diagLevel > 4 {
diagLevel = 3
}
diagArgs := []string{"dcgmi", "diag", "-r", strconv.Itoa(diagLevel)}
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
diagArgs = append(diagArgs, "-i", strings.Join(ids, ","))
}
return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-dcgmi-diag.log", cmd: diagArgs},
}
}
func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []satJob) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, prefix+"-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
var summary strings.Builder
stats := satStats{}
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
for _, job := range jobs {
if ctx.Err() != nil {
break
}
cmd := make([]string, 0, len(job.cmd))
for _, arg := range job.cmd {
cmd = append(cmd, strings.ReplaceAll(arg, "{{run_dir}}", runDir))
}
var out []byte
var err error
if job.collectGPU {
out, err = runSATCommandWithMetrics(ctx, verboseLog, job.name, cmd, job.env, job.gpuIndices, runDir)
} else {
out, err = runSATCommandCtx(ctx, verboseLog, job.name, cmd, job.env)
}
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
return "", writeErr
}
status, rc := classifySATResult(job.name, out, err)
stats.Add(status)
key := strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log")
fmt.Fprintf(&summary, "%s_rc=%d\n", key, rc)
fmt.Fprintf(&summary, "%s_status=%s\n", key, status)
}
writeSATStats(&summary, stats)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
func runSATCommandCtx(ctx context.Context, verboseLog, name string, cmd []string, env []string) ([]byte, error) {
start := time.Now().UTC()
resolvedCmd, err := resolveSATCommand(cmd)
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
"cmd: "+strings.Join(resolvedCmd, " "),
)
if err != nil {
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
"rc: 1",
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return []byte(err.Error() + "\n"), err
}
c := exec.CommandContext(ctx, resolvedCmd[0], resolvedCmd[1:]...)
if len(env) > 0 {
c.Env = append(os.Environ(), env...)
}
out, err := c.CombinedOutput()
rc := 0
if err != nil {
rc = 1
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
fmt.Sprintf("rc: %d", rc),
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return out, err
}
func listStorageDevices() ([]string, error) {
out, err := exec.Command("lsblk", "-dn", "-o", "NAME,TYPE").Output()
out, err := satExecCommand("lsblk", "-dn", "-o", "NAME,TYPE,TRAN").Output()
if err != nil {
return nil, err
}
var devices []string
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
fields := strings.Fields(strings.TrimSpace(line))
if len(fields) != 2 || fields[1] != "disk" {
continue
}
devices = append(devices, "/dev/"+fields[0])
}
return devices, nil
return parseStorageDevices(string(out)), nil
}
func storageSATCommands(devPath string) []satJob {
@@ -166,7 +417,7 @@ func storageSATCommands(devPath string) []satJob {
return []satJob{
{name: "nvme-id-ctrl", cmd: []string{"nvme", "id-ctrl", devPath, "-o", "json"}},
{name: "nvme-smart-log", cmd: []string{"nvme", "smart-log", devPath, "-o", "json"}},
{name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "--start", "1"}},
{name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "-s", "1", "--wait"}},
}
}
return []satJob{
@@ -219,6 +470,7 @@ func classifySATResult(name string, out []byte, err error) (string, int) {
strings.Contains(text, "unknown command") ||
strings.Contains(text, "not implemented") ||
strings.Contains(text, "not available") ||
strings.Contains(text, "cuda_error_system_not_ready") ||
strings.Contains(text, "no such device") ||
(strings.Contains(name, "self-test") && strings.Contains(text, "aborted")) {
return "UNSUPPORTED", rc
@@ -226,6 +478,182 @@ func classifySATResult(name string, out []byte, err error) (string, int) {
return "FAILED", rc
}
func runSATCommand(verboseLog, name string, cmd []string) ([]byte, error) {
start := time.Now().UTC()
resolvedCmd, err := resolveSATCommand(cmd)
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s", start.Format(time.RFC3339), name),
"cmd: "+strings.Join(resolvedCmd, " "),
)
if err != nil {
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
"rc: 1",
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return []byte(err.Error() + "\n"), err
}
out, err := satExecCommand(resolvedCmd[0], resolvedCmd[1:]...).CombinedOutput()
rc := 0
if err != nil {
rc = 1
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), name),
fmt.Sprintf("rc: %d", rc),
fmt.Sprintf("duration_ms: %d", time.Since(start).Milliseconds()),
"",
)
return out, err
}
func runROCmSMI(args ...string) ([]byte, error) {
cmd, err := resolveROCmSMICommand(args...)
if err != nil {
return nil, err
}
return satExecCommand(cmd[0], cmd[1:]...).CombinedOutput()
}
func resolveSATCommand(cmd []string) ([]string, error) {
if len(cmd) == 0 {
return nil, errors.New("empty SAT command")
}
if cmd[0] != "rocm-smi" {
return cmd, nil
}
return resolveROCmSMICommand(cmd[1:]...)
}
func resolveROCmSMICommand(args ...string) ([]string, error) {
if path, err := satLookPath("rocm-smi"); err == nil {
return append([]string{path}, args...), nil
}
for _, path := range rocmSMIExecutableCandidates() {
return append([]string{path}, args...), nil
}
pythonPath, pyErr := satLookPath("python3")
if pyErr == nil {
for _, script := range rocmSMIScriptCandidates() {
cmd := []string{pythonPath, script}
cmd = append(cmd, args...)
return cmd, nil
}
}
return nil, errors.New("rocm-smi not found in PATH or under /opt/rocm")
}
func rocmSMIExecutableCandidates() []string {
return expandExistingPaths(rocmSMIExecutableGlobs)
}
func rocmSMIScriptCandidates() []string {
return expandExistingPaths(rocmSMIScriptGlobs)
}
func expandExistingPaths(patterns []string) []string {
seen := make(map[string]struct{})
var paths []string
for _, pattern := range patterns {
matches, err := satGlob(pattern)
if err != nil {
continue
}
sort.Strings(matches)
for _, match := range matches {
if _, err := satStat(match); err != nil {
continue
}
if _, ok := seen[match]; ok {
continue
}
seen[match] = struct{}{}
paths = append(paths, match)
}
}
return paths
}
func parseStorageDevices(raw string) []string {
var devices []string
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
fields := strings.Fields(strings.TrimSpace(line))
if len(fields) < 2 || fields[1] != "disk" {
continue
}
if len(fields) >= 3 && strings.EqualFold(fields[2], "usb") {
continue
}
devices = append(devices, "/dev/"+fields[0])
}
return devices
}
// runSATCommandWithMetrics runs a command while collecting GPU metrics in the background.
// On completion it writes gpu-metrics.csv and gpu-metrics.html into runDir.
func runSATCommandWithMetrics(ctx context.Context, verboseLog, name string, cmd []string, env []string, gpuIndices []int, runDir string) ([]byte, error) {
stopCh := make(chan struct{})
doneCh := make(chan struct{})
var metricRows []GPUMetricRow
start := time.Now()
go func() {
defer close(doneCh)
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-stopCh:
return
case <-ticker.C:
samples, err := sampleGPUMetrics(gpuIndices)
if err != nil {
continue
}
elapsed := time.Since(start).Seconds()
for i := range samples {
samples[i].ElapsedSec = elapsed
}
metricRows = append(metricRows, samples...)
}
}
}()
out, err := runSATCommandCtx(ctx, verboseLog, name, cmd, env)
close(stopCh)
<-doneCh
if len(metricRows) > 0 {
_ = WriteGPUMetricsCSV(filepath.Join(runDir, "gpu-metrics.csv"), metricRows)
_ = WriteGPUMetricsHTML(filepath.Join(runDir, "gpu-metrics.html"), metricRows)
chart := RenderGPUTerminalChart(metricRows)
_ = os.WriteFile(filepath.Join(runDir, "gpu-metrics-term.txt"), []byte(chart), 0644)
}
return out, err
}
func appendSATVerboseLog(path string, lines ...string) {
if path == "" {
return
}
f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0644)
if err != nil {
return
}
defer f.Close()
for _, line := range lines {
_, _ = io.WriteString(f, line+"\n")
}
}
func envInt(name string, fallback int) int {
raw := strings.TrimSpace(os.Getenv(name))
if raw == "" {

View File

@@ -0,0 +1,587 @@
package platform
import (
"context"
"fmt"
"os"
"os/exec"
"path/filepath"
"strconv"
"strings"
"sync"
"time"
)
// FanStressOptions configures the fan-stress / thermal cycling test.
type FanStressOptions struct {
BaselineSec int // idle monitoring before and after load (default 30)
Phase1DurSec int // first load phase duration in seconds (default 300)
PauseSec int // pause between the two load phases (default 60)
Phase2DurSec int // second load phase duration in seconds (default 300)
SizeMB int // GPU memory to allocate per GPU during stress (default 64)
GPUIndices []int // which GPU indices to stress (empty = all detected)
}
// FanReading holds one fan sensor reading.
type FanReading struct {
Name string
RPM float64
}
// GPUStressMetric holds per-GPU metrics during the stress test.
type GPUStressMetric struct {
Index int
TempC float64
UsagePct float64
PowerW float64
ClockMHz float64
Throttled bool // true if any throttle reason is active
}
// FanStressRow is one second-interval telemetry sample covering all monitored dimensions.
type FanStressRow struct {
TimestampUTC string
ElapsedSec float64
Phase string // "baseline", "load1", "pause", "load2", "cooldown"
GPUs []GPUStressMetric
Fans []FanReading
CPUMaxTempC float64 // highest CPU temperature from ipmitool / sensors
SysPowerW float64 // DCMI system power reading
}
// RunFanStressTest runs a two-phase GPU stress test while monitoring fan speeds,
// temperatures, and power draw every second. Exports metrics.csv and fan-sensors.csv.
// Designed to reproduce case-04 fan-speed lag and detect GPU thermal throttling.
func (s *System) RunFanStressTest(ctx context.Context, baseDir string, opts FanStressOptions) (string, error) {
if baseDir == "" {
baseDir = "/var/log/bee-sat"
}
applyFanStressDefaults(&opts)
ts := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, "fan-stress-"+ts)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", err
}
verboseLog := filepath.Join(runDir, "verbose.log")
// Phase name shared between sampler goroutine and main goroutine.
var phaseMu sync.Mutex
currentPhase := "init"
setPhase := func(name string) {
phaseMu.Lock()
currentPhase = name
phaseMu.Unlock()
}
getPhase := func() string {
phaseMu.Lock()
defer phaseMu.Unlock()
return currentPhase
}
start := time.Now()
var rowsMu sync.Mutex
var allRows []FanStressRow
// Start background sampler (every second).
stopCh := make(chan struct{})
doneCh := make(chan struct{})
go func() {
defer close(doneCh)
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-stopCh:
return
case <-ticker.C:
row := sampleFanStressRow(opts.GPUIndices, getPhase(), time.Since(start).Seconds())
rowsMu.Lock()
allRows = append(allRows, row)
rowsMu.Unlock()
}
}
}()
var summary strings.Builder
fmt.Fprintf(&summary, "run_at_utc=%s\n", time.Now().UTC().Format(time.RFC3339))
stats := satStats{}
// idlePhase sleeps for durSec while the sampler stamps phaseName on each row.
idlePhase := func(phaseName, stepName string, durSec int) {
if ctx.Err() != nil {
return
}
setPhase(phaseName)
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start %s (idle %ds)", time.Now().UTC().Format(time.RFC3339), stepName, durSec),
)
select {
case <-ctx.Done():
case <-time.After(time.Duration(durSec) * time.Second):
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish %s", time.Now().UTC().Format(time.RFC3339), stepName),
)
fmt.Fprintf(&summary, "%s_status=OK\n", stepName)
stats.OK++
}
// loadPhase runs bee-gpu-stress for durSec; sampler stamps phaseName on each row.
loadPhase := func(phaseName, stepName string, durSec int) {
if ctx.Err() != nil {
return
}
setPhase(phaseName)
var env []string
if len(opts.GPUIndices) > 0 {
ids := make([]string, len(opts.GPUIndices))
for i, idx := range opts.GPUIndices {
ids[i] = strconv.Itoa(idx)
}
env = []string{"CUDA_VISIBLE_DEVICES=" + strings.Join(ids, ",")}
}
cmd := []string{
"bee-gpu-stress",
"--seconds", strconv.Itoa(durSec),
"--size-mb", strconv.Itoa(opts.SizeMB),
}
out, err := runSATCommandCtx(ctx, verboseLog, stepName, cmd, env)
_ = os.WriteFile(filepath.Join(runDir, stepName+".log"), out, 0644)
if err != nil && err != context.Canceled && err.Error() != "signal: killed" {
fmt.Fprintf(&summary, "%s_status=FAILED\n", stepName)
stats.Failed++
} else {
fmt.Fprintf(&summary, "%s_status=OK\n", stepName)
stats.OK++
}
}
// Execute test phases.
idlePhase("baseline", "01-baseline", opts.BaselineSec)
loadPhase("load1", "02-load1", opts.Phase1DurSec)
idlePhase("pause", "03-pause", opts.PauseSec)
loadPhase("load2", "04-load2", opts.Phase2DurSec)
idlePhase("cooldown", "05-cooldown", opts.BaselineSec)
// Stop sampler and collect rows.
close(stopCh)
<-doneCh
rowsMu.Lock()
rows := allRows
rowsMu.Unlock()
// Analysis.
throttled := analyzeThrottling(rows)
maxGPUTemp := analyzeMaxTemp(rows, func(r FanStressRow) float64 {
var m float64
for _, g := range r.GPUs {
if g.TempC > m {
m = g.TempC
}
}
return m
})
maxCPUTemp := analyzeMaxTemp(rows, func(r FanStressRow) float64 {
return r.CPUMaxTempC
})
fanResponseSec := analyzeFanResponse(rows)
fmt.Fprintf(&summary, "throttling_detected=%v\n", throttled)
fmt.Fprintf(&summary, "max_gpu_temp_c=%.1f\n", maxGPUTemp)
fmt.Fprintf(&summary, "max_cpu_temp_c=%.1f\n", maxCPUTemp)
if fanResponseSec >= 0 {
fmt.Fprintf(&summary, "fan_response_sec=%.1f\n", fanResponseSec)
} else {
fmt.Fprintf(&summary, "fan_response_sec=N/A\n")
}
// Throttling failure counts against overall result.
if throttled {
stats.Failed++
}
writeSATStats(&summary, stats)
// Write CSV outputs.
if err := WriteFanStressCSV(filepath.Join(runDir, "metrics.csv"), rows, opts.GPUIndices); err != nil {
return "", err
}
_ = WriteFanSensorsCSV(filepath.Join(runDir, "fan-sensors.csv"), rows)
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err
}
archive := filepath.Join(baseDir, "fan-stress-"+ts+".tar.gz")
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
}
func applyFanStressDefaults(opts *FanStressOptions) {
if opts.BaselineSec <= 0 {
opts.BaselineSec = 30
}
if opts.Phase1DurSec <= 0 {
opts.Phase1DurSec = 300
}
if opts.PauseSec <= 0 {
opts.PauseSec = 60
}
if opts.Phase2DurSec <= 0 {
opts.Phase2DurSec = 300
}
if opts.SizeMB <= 0 {
opts.SizeMB = 64
}
}
// sampleFanStressRow collects all metrics for one telemetry sample.
func sampleFanStressRow(gpuIndices []int, phase string, elapsed float64) FanStressRow {
row := FanStressRow{
TimestampUTC: time.Now().UTC().Format(time.RFC3339),
ElapsedSec: elapsed,
Phase: phase,
}
row.GPUs = sampleGPUStressMetrics(gpuIndices)
row.Fans, _ = sampleFanSpeeds()
row.CPUMaxTempC = sampleCPUMaxTemp()
row.SysPowerW = sampleSystemPower()
return row
}
// sampleGPUStressMetrics queries nvidia-smi for temperature, utilization, power,
// clock frequency, and active throttle reasons for each GPU.
func sampleGPUStressMetrics(gpuIndices []int) []GPUStressMetric {
args := []string{
"--query-gpu=index,temperature.gpu,utilization.gpu,power.draw,clocks.current.graphics,clocks_throttle_reasons.active",
"--format=csv,noheader,nounits",
}
if len(gpuIndices) > 0 {
ids := make([]string, len(gpuIndices))
for i, idx := range gpuIndices {
ids[i] = strconv.Itoa(idx)
}
args = append([]string{"--id=" + strings.Join(ids, ",")}, args...)
}
out, err := exec.Command("nvidia-smi", args...).Output()
if err != nil {
return nil
}
var metrics []GPUStressMetric
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
parts := strings.Split(line, ", ")
if len(parts) < 6 {
continue
}
idx, _ := strconv.Atoi(strings.TrimSpace(parts[0]))
throttleVal := strings.TrimSpace(parts[5])
// Throttled if active reasons bitmask is non-zero.
throttled := throttleVal != "0x0000000000000000" &&
throttleVal != "0x0" &&
throttleVal != "0" &&
throttleVal != "" &&
throttleVal != "N/A"
metrics = append(metrics, GPUStressMetric{
Index: idx,
TempC: parseGPUFloat(parts[1]),
UsagePct: parseGPUFloat(parts[2]),
PowerW: parseGPUFloat(parts[3]),
ClockMHz: parseGPUFloat(parts[4]),
Throttled: throttled,
})
}
return metrics
}
// sampleFanSpeeds reads fan RPM values from ipmitool sdr.
func sampleFanSpeeds() ([]FanReading, error) {
out, err := exec.Command("ipmitool", "sdr", "type", "Fan").Output()
if err != nil {
return nil, err
}
return parseFanSpeeds(string(out)), nil
}
// parseFanSpeeds parses "ipmitool sdr type Fan" output.
// Line format: "FAN1 | 2400.000 | RPM | ok"
func parseFanSpeeds(raw string) []FanReading {
var fans []FanReading
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
parts := strings.Split(line, "|")
if len(parts) < 3 {
continue
}
unit := strings.TrimSpace(parts[2])
if !strings.EqualFold(unit, "RPM") {
continue
}
valStr := strings.TrimSpace(parts[1])
if strings.EqualFold(valStr, "na") || strings.EqualFold(valStr, "disabled") || valStr == "" {
continue
}
val, err := strconv.ParseFloat(valStr, 64)
if err != nil {
continue
}
fans = append(fans, FanReading{
Name: strings.TrimSpace(parts[0]),
RPM: val,
})
}
return fans
}
// sampleCPUMaxTemp returns the highest CPU/inlet temperature from ipmitool or sensors.
func sampleCPUMaxTemp() float64 {
out, err := exec.Command("ipmitool", "sdr", "type", "Temperature").Output()
if err != nil {
return sampleCPUTempViaSensors()
}
return parseIPMIMaxTemp(string(out))
}
// parseIPMIMaxTemp extracts the maximum temperature from "ipmitool sdr type Temperature".
func parseIPMIMaxTemp(raw string) float64 {
var max float64
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
parts := strings.Split(line, "|")
if len(parts) < 3 {
continue
}
unit := strings.TrimSpace(parts[2])
if !strings.Contains(strings.ToLower(unit), "degrees") {
continue
}
valStr := strings.TrimSpace(parts[1])
if strings.EqualFold(valStr, "na") || valStr == "" {
continue
}
val, err := strconv.ParseFloat(valStr, 64)
if err != nil {
continue
}
if val > max {
max = val
}
}
return max
}
// sampleCPUTempViaSensors falls back to lm-sensors when ipmitool is unavailable.
func sampleCPUTempViaSensors() float64 {
out, err := exec.Command("sensors", "-u").Output()
if err != nil {
return 0
}
var max float64
for _, line := range strings.Split(string(out), "\n") {
line = strings.TrimSpace(line)
fields := strings.Fields(line)
if len(fields) < 2 {
continue
}
if !strings.HasSuffix(fields[0], "_input:") {
continue
}
val, err := strconv.ParseFloat(fields[1], 64)
if err != nil {
continue
}
if val > 0 && val < 150 && val > max {
max = val
}
}
return max
}
// sampleSystemPower reads system power draw via DCMI.
func sampleSystemPower() float64 {
out, err := exec.Command("ipmitool", "dcmi", "power", "reading").Output()
if err != nil {
return 0
}
return parseDCMIPowerReading(string(out))
}
// parseDCMIPowerReading extracts the instantaneous power reading from ipmitool dcmi output.
// Sample: " Instantaneous power reading: 500 Watts"
func parseDCMIPowerReading(raw string) float64 {
for _, line := range strings.Split(raw, "\n") {
if !strings.Contains(strings.ToLower(line), "instantaneous") {
continue
}
parts := strings.Fields(line)
for i, p := range parts {
if strings.EqualFold(p, "Watts") && i > 0 {
val, err := strconv.ParseFloat(parts[i-1], 64)
if err == nil {
return val
}
}
}
}
return 0
}
// analyzeThrottling returns true if any GPU reported an active throttle reason
// during either load phase.
func analyzeThrottling(rows []FanStressRow) bool {
for _, row := range rows {
if row.Phase != "load1" && row.Phase != "load2" {
continue
}
for _, gpu := range row.GPUs {
if gpu.Throttled {
return true
}
}
}
return false
}
// analyzeMaxTemp returns the maximum value of the given extractor across all rows.
func analyzeMaxTemp(rows []FanStressRow, extract func(FanStressRow) float64) float64 {
var max float64
for _, row := range rows {
if v := extract(row); v > max {
max = v
}
}
return max
}
// analyzeFanResponse returns the seconds from load1 start until fan RPM first
// increased by more than 5% above the baseline average. Returns -1 if undetermined.
func analyzeFanResponse(rows []FanStressRow) float64 {
// Compute baseline average fan RPM.
var baseTotal, baseCount float64
for _, row := range rows {
if row.Phase != "baseline" {
continue
}
for _, f := range row.Fans {
baseTotal += f.RPM
baseCount++
}
}
if baseCount == 0 || baseTotal == 0 {
return -1
}
baseAvg := baseTotal / baseCount
threshold := baseAvg * 1.05 // 5% increase signals fan ramp-up
// Find elapsed time when load1 started.
var load1Start float64 = -1
for _, row := range rows {
if row.Phase == "load1" {
load1Start = row.ElapsedSec
break
}
}
if load1Start < 0 {
return -1
}
// Find first load1 row where average RPM crosses the threshold.
for _, row := range rows {
if row.Phase != "load1" {
continue
}
var total, count float64
for _, f := range row.Fans {
total += f.RPM
count++
}
if count > 0 && total/count >= threshold {
return row.ElapsedSec - load1Start
}
}
return -1
}
// WriteFanStressCSV writes the wide-format metrics CSV with one row per second.
// GPU columns are generated per index in gpuIndices order.
func WriteFanStressCSV(path string, rows []FanStressRow, gpuIndices []int) error {
if len(rows) == 0 {
return os.WriteFile(path, []byte("no data\n"), 0644)
}
var b strings.Builder
// Header: fixed system columns + per-GPU columns.
b.WriteString("timestamp_utc,elapsed_sec,phase,fan_avg_rpm,fan_min_rpm,fan_max_rpm,cpu_max_temp_c,sys_power_w")
for _, idx := range gpuIndices {
fmt.Fprintf(&b, ",gpu%d_temp_c,gpu%d_usage_pct,gpu%d_power_w,gpu%d_clock_mhz,gpu%d_throttled",
idx, idx, idx, idx, idx)
}
b.WriteRune('\n')
for _, row := range rows {
favg, fmin, fmax := fanRPMStats(row.Fans)
fmt.Fprintf(&b, "%s,%.1f,%s,%.0f,%.0f,%.0f,%.1f,%.1f",
row.TimestampUTC,
row.ElapsedSec,
row.Phase,
favg, fmin, fmax,
row.CPUMaxTempC,
row.SysPowerW,
)
gpuByIdx := make(map[int]GPUStressMetric, len(row.GPUs))
for _, g := range row.GPUs {
gpuByIdx[g.Index] = g
}
for _, idx := range gpuIndices {
g := gpuByIdx[idx]
throttled := 0
if g.Throttled {
throttled = 1
}
fmt.Fprintf(&b, ",%.1f,%.1f,%.1f,%.0f,%d",
g.TempC, g.UsagePct, g.PowerW, g.ClockMHz, throttled)
}
b.WriteRune('\n')
}
return os.WriteFile(path, []byte(b.String()), 0644)
}
// WriteFanSensorsCSV writes individual fan sensor readings in long (tidy) format.
func WriteFanSensorsCSV(path string, rows []FanStressRow) error {
var b strings.Builder
b.WriteString("timestamp_utc,elapsed_sec,phase,fan_name,rpm\n")
for _, row := range rows {
for _, f := range row.Fans {
fmt.Fprintf(&b, "%s,%.1f,%s,%s,%.0f\n",
row.TimestampUTC, row.ElapsedSec, row.Phase, f.Name, f.RPM)
}
}
return os.WriteFile(path, []byte(b.String()), 0644)
}
// fanRPMStats computes average, min, max RPM across all fans in a sample row.
func fanRPMStats(fans []FanReading) (avg, min, max float64) {
if len(fans) == 0 {
return 0, 0, 0
}
min = fans[0].RPM
max = fans[0].RPM
var total float64
for _, f := range fans {
total += f.RPM
if f.RPM < min {
min = f.RPM
}
if f.RPM > max {
max = f.RPM
}
}
return total / float64(len(fans)), min, max
}

View File

@@ -3,6 +3,8 @@ package platform
import (
"errors"
"os"
"os/exec"
"path/filepath"
"testing"
)
@@ -31,6 +33,9 @@ func TestRunNvidiaAcceptancePackIncludesGPUStress(t *testing.T) {
if got := jobs[4].cmd[0]; got != "bee-gpu-stress" {
t.Fatalf("gpu stress command=%q want bee-gpu-stress", got)
}
if got := jobs[3].cmd[1]; got != "--output-file" {
t.Fatalf("bug report flag=%q want --output-file", got)
}
}
func TestNvidiaSATJobsUseEnvOverrides(t *testing.T) {
@@ -76,6 +81,7 @@ func TestClassifySATResult(t *testing.T) {
{name: "ok", job: "memtester", out: "done", err: nil, status: "OK"},
{name: "unsupported", job: "smartctl-self-test-short", out: "Self-test not supported", err: errors.New("rc 1"), status: "UNSUPPORTED"},
{name: "failed", job: "bee-gpu-stress", out: "cuda error", err: errors.New("rc 1"), status: "FAILED"},
{name: "cuda not ready", job: "bee-gpu-stress", out: "cuInit failed: CUDA_ERROR_SYSTEM_NOT_READY", err: errors.New("rc 1"), status: "UNSUPPORTED"},
}
for _, tt := range tests {
@@ -87,3 +93,90 @@ func TestClassifySATResult(t *testing.T) {
})
}
}
func TestParseStorageDevicesSkipsUSBDisks(t *testing.T) {
t.Parallel()
raw := "nvme0n1 disk nvme\nsda disk usb\nloop0 loop\nsdb disk sata\n"
got := parseStorageDevices(raw)
want := []string{"/dev/nvme0n1", "/dev/sdb"}
if len(got) != len(want) {
t.Fatalf("len(devices)=%d want %d (%v)", len(got), len(want), got)
}
for i := range want {
if got[i] != want[i] {
t.Fatalf("devices[%d]=%q want %q", i, got[i], want[i])
}
}
}
func TestResolveROCmSMICommandFromPATH(t *testing.T) {
t.Setenv("PATH", t.TempDir())
toolPath := filepath.Join(os.Getenv("PATH"), "rocm-smi")
if err := os.WriteFile(toolPath, []byte("#!/bin/sh\nexit 0\n"), 0755); err != nil {
t.Fatalf("write rocm-smi: %v", err)
}
cmd, err := resolveROCmSMICommand("--showproductname")
if err != nil {
t.Fatalf("resolveROCmSMICommand error: %v", err)
}
if len(cmd) != 2 {
t.Fatalf("cmd len=%d want 2 (%v)", len(cmd), cmd)
}
if cmd[0] != toolPath {
t.Fatalf("cmd[0]=%q want %q", cmd[0], toolPath)
}
}
func TestResolveROCmSMICommandFallsBackToROCmTree(t *testing.T) {
tmp := t.TempDir()
execPath := filepath.Join(tmp, "opt", "rocm", "bin", "rocm-smi")
if err := os.MkdirAll(filepath.Dir(execPath), 0755); err != nil {
t.Fatalf("mkdir: %v", err)
}
if err := os.WriteFile(execPath, []byte("#!/bin/sh\nexit 0\n"), 0755); err != nil {
t.Fatalf("write rocm-smi: %v", err)
}
oldGlob := rocmSMIExecutableGlobs
oldScriptGlobs := rocmSMIScriptGlobs
rocmSMIExecutableGlobs = []string{execPath}
rocmSMIScriptGlobs = nil
t.Cleanup(func() {
rocmSMIExecutableGlobs = oldGlob
rocmSMIScriptGlobs = oldScriptGlobs
})
t.Setenv("PATH", "")
cmd, err := resolveROCmSMICommand("--showallinfo")
if err != nil {
t.Fatalf("resolveROCmSMICommand error: %v", err)
}
if len(cmd) != 2 {
t.Fatalf("cmd len=%d want 2 (%v)", len(cmd), cmd)
}
if cmd[0] != execPath {
t.Fatalf("cmd[0]=%q want %q", cmd[0], execPath)
}
}
func TestRunROCmSMIReportsMissingCommand(t *testing.T) {
oldLookPath := satLookPath
oldExecGlobs := rocmSMIExecutableGlobs
oldScriptGlobs := rocmSMIScriptGlobs
satLookPath = func(string) (string, error) { return "", exec.ErrNotFound }
rocmSMIExecutableGlobs = nil
rocmSMIScriptGlobs = nil
t.Cleanup(func() {
satLookPath = oldLookPath
rocmSMIExecutableGlobs = oldExecGlobs
rocmSMIScriptGlobs = oldScriptGlobs
})
if _, err := runROCmSMI("--showproductname"); err == nil {
t.Fatal("expected missing rocm-smi error")
}
}

View File

@@ -0,0 +1,150 @@
package platform
import (
"encoding/json"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
)
var techDumpFixedCommands = []struct {
Name string
Args []string
File string
}{
{Name: "dmidecode", Args: []string{"-t", "0"}, File: "dmidecode-type0.txt"},
{Name: "dmidecode", Args: []string{"-t", "1"}, File: "dmidecode-type1.txt"},
{Name: "dmidecode", Args: []string{"-t", "2"}, File: "dmidecode-type2.txt"},
{Name: "dmidecode", Args: []string{"-t", "4"}, File: "dmidecode-type4.txt"},
{Name: "dmidecode", Args: []string{"-t", "17"}, File: "dmidecode-type17.txt"},
{Name: "lspci", Args: []string{"-vmm", "-D"}, File: "lspci-vmm.txt"},
{Name: "lsblk", Args: []string{"-J", "-d", "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL"}, File: "lsblk.json"},
{Name: "sensors", Args: []string{"-j"}, File: "sensors.json"},
{Name: "ipmitool", Args: []string{"fru", "print"}, File: "ipmitool-fru.txt"},
{Name: "ipmitool", Args: []string{"sdr"}, File: "ipmitool-sdr.txt"},
{Name: "nvme", Args: []string{"list", "-o", "json"}, File: "nvme-list.json"},
}
var techDumpNvidiaCommands = []struct {
Name string
Args []string
File string
}{
{Name: "nvidia-smi", Args: []string{"-q"}, File: "nvidia-smi-q.txt"},
{Name: "nvidia-smi", Args: []string{"--query-gpu=index,pci.bus_id,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown", "--format=csv,noheader,nounits"}, File: "nvidia-smi-query.csv"},
}
type lsblkDumpRoot struct {
Blockdevices []struct {
Name string `json:"name"`
Type string `json:"type"`
Tran string `json:"tran"`
} `json:"blockdevices"`
}
type nvmeDumpRoot struct {
Devices []struct {
DevicePath string `json:"DevicePath"`
} `json:"Devices"`
}
func (s *System) CaptureTechnicalDump(baseDir string) error {
if err := os.MkdirAll(baseDir, 0755); err != nil {
return err
}
for _, cmd := range techDumpFixedCommands {
writeCommandDump(filepath.Join(baseDir, cmd.File), cmd.Name, cmd.Args...)
}
switch s.DetectGPUVendor() {
case "nvidia":
for _, cmd := range techDumpNvidiaCommands {
writeCommandDump(filepath.Join(baseDir, cmd.File), cmd.Name, cmd.Args...)
}
case "amd":
writeROCmSMIDump(filepath.Join(baseDir, "rocm-smi.txt"))
writeROCmSMIDump(filepath.Join(baseDir, "rocm-smi-showallinfo.txt"), "--showallinfo")
}
for _, dev := range lsblkDumpDevices(filepath.Join(baseDir, "lsblk.json")) {
writeCommandDump(filepath.Join(baseDir, "smartctl-"+sanitizeDumpName(dev)+".json"), "smartctl", "-j", "-a", "/dev/"+dev)
}
for _, dev := range nvmeDumpDevices(filepath.Join(baseDir, "nvme-list.json")) {
writeCommandDump(filepath.Join(baseDir, "nvme-id-ctrl-"+sanitizeDumpName(dev)+".json"), "nvme", "id-ctrl", dev, "-o", "json")
writeCommandDump(filepath.Join(baseDir, "nvme-smart-log-"+sanitizeDumpName(dev)+".json"), "nvme", "smart-log", dev, "-o", "json")
}
return nil
}
func writeCommandDump(path, name string, args ...string) {
out, err := exec.Command(name, args...).CombinedOutput()
if err != nil && len(out) == 0 {
return
}
_ = os.WriteFile(path, out, 0644)
}
func writeROCmSMIDump(path string, args ...string) {
out, err := runROCmSMI(args...)
if err != nil && len(out) == 0 {
return
}
_ = os.WriteFile(path, out, 0644)
}
func lsblkDumpDevices(path string) []string {
raw, err := os.ReadFile(path)
if err != nil {
return nil
}
var root lsblkDumpRoot
if err := json.Unmarshal(raw, &root); err != nil {
return nil
}
var devices []string
for _, dev := range root.Blockdevices {
if strings.EqualFold(strings.TrimSpace(dev.Tran), "usb") {
continue
}
if dev.Type == "disk" && strings.TrimSpace(dev.Name) != "" {
devices = append(devices, strings.TrimSpace(dev.Name))
}
}
sort.Strings(devices)
return devices
}
func nvmeDumpDevices(path string) []string {
raw, err := os.ReadFile(path)
if err != nil {
return nil
}
var root nvmeDumpRoot
if err := json.Unmarshal(raw, &root); err != nil {
return nil
}
seen := map[string]bool{}
var devices []string
for _, dev := range root.Devices {
name := strings.TrimSpace(dev.DevicePath)
if name == "" || seen[name] {
continue
}
seen[name] = true
devices = append(devices, name)
}
sort.Strings(devices)
return devices
}
func sanitizeDumpName(value string) string {
value = strings.TrimSpace(value)
value = strings.TrimPrefix(value, "/dev/")
value = strings.ReplaceAll(value, "/", "_")
if value == "" {
return "unknown"
}
return value
}

View File

@@ -0,0 +1,48 @@
package platform
import (
"os"
"path/filepath"
"reflect"
"testing"
)
func TestLSBLKDumpDevices(t *testing.T) {
t.Parallel()
dir := t.TempDir()
path := filepath.Join(dir, "lsblk.json")
if err := os.WriteFile(path, []byte(`{"blockdevices":[{"name":"sda","type":"disk","tran":"usb"},{"name":"sda1","type":"part"},{"name":"nvme0n1","type":"disk","tran":"nvme"},{"name":"sdb","type":"disk","tran":"sata"}]}`), 0644); err != nil {
t.Fatalf("write lsblk fixture: %v", err)
}
got := lsblkDumpDevices(path)
want := []string{"nvme0n1", "sdb"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("lsblkDumpDevices=%v want %v", got, want)
}
}
func TestNVMEDumpDevices(t *testing.T) {
t.Parallel()
dir := t.TempDir()
path := filepath.Join(dir, "nvme-list.json")
if err := os.WriteFile(path, []byte(`{"Devices":[{"DevicePath":"/dev/nvme1n1"},{"DevicePath":"/dev/nvme0n1"},{"DevicePath":"/dev/nvme1n1"}]}`), 0644); err != nil {
t.Fatalf("write nvme fixture: %v", err)
}
got := nvmeDumpDevices(path)
want := []string{"/dev/nvme0n1", "/dev/nvme1n1"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("nvmeDumpDevices=%v want %v", got, want)
}
}
func TestSanitizeDumpName(t *testing.T) {
t.Parallel()
if got := sanitizeDumpName("/dev/nvme0n1"); got != "nvme0n1" {
t.Fatalf("sanitizeDumpName=%q want nvme0n1", got)
}
}

View File

@@ -10,9 +10,47 @@ type HardwareIngestRequest struct {
Protocol *string `json:"protocol,omitempty"`
TargetHost *string `json:"target_host,omitempty"`
CollectedAt string `json:"collected_at"`
Runtime *RuntimeHealth `json:"runtime,omitempty"`
Hardware HardwareSnapshot `json:"hardware"`
}
type RuntimeHealth struct {
Status string `json:"status"`
CheckedAt string `json:"checked_at"`
ExportDir string `json:"export_dir,omitempty"`
DriverReady bool `json:"driver_ready,omitempty"`
CUDAReady bool `json:"cuda_ready,omitempty"`
NetworkStatus string `json:"network_status,omitempty"`
Issues []RuntimeIssue `json:"issues,omitempty"`
Tools []RuntimeToolStatus `json:"tools,omitempty"`
Services []RuntimeServiceStatus `json:"services,omitempty"`
Interfaces []RuntimeInterface `json:"interfaces,omitempty"`
}
type RuntimeIssue struct {
Code string `json:"code"`
Severity string `json:"severity,omitempty"`
Description string `json:"description"`
}
type RuntimeToolStatus struct {
Name string `json:"name"`
Path string `json:"path,omitempty"`
OK bool `json:"ok"`
}
type RuntimeServiceStatus struct {
Name string `json:"name"`
Status string `json:"status"`
}
type RuntimeInterface struct {
Name string `json:"name"`
State string `json:"state,omitempty"`
IPv4 []string `json:"ipv4,omitempty"`
Outcome string `json:"outcome,omitempty"`
}
type HardwareSnapshot struct {
Board HardwareBoard `json:"board"`
Firmware []HardwareFirmwareRecord `json:"firmware,omitempty"`
@@ -22,6 +60,7 @@ type HardwareSnapshot struct {
PCIeDevices []HardwarePCIeDevice `json:"pcie_devices,omitempty"`
PowerSupplies []HardwarePowerSupply `json:"power_supplies,omitempty"`
Sensors *HardwareSensors `json:"sensors,omitempty"`
EventLogs []HardwareEventLog `json:"event_logs,omitempty"`
}
type HardwareHealthSummary struct {
@@ -148,7 +187,7 @@ type HardwarePCIeDevice struct {
SFPRXPowerDBM *float64 `json:"sfp_rx_power_dbm,omitempty"`
SFPVoltageV *float64 `json:"sfp_voltage_v,omitempty"`
SFPBiasMA *float64 `json:"sfp_bias_ma,omitempty"`
BDF *string `json:"bdf,omitempty"`
BDF *string `json:"-"`
DeviceClass *string `json:"device_class,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
Model *string `json:"model,omitempty"`
@@ -183,11 +222,12 @@ type HardwarePowerSupply struct {
}
type HardwareComponentStatus struct {
Status *string `json:"status,omitempty"`
StatusCheckedAt *string `json:"status_checked_at,omitempty"`
StatusChangedAt *string `json:"status_changed_at,omitempty"`
StatusHistory []HardwareStatusHistory `json:"status_history,omitempty"`
ErrorDescription *string `json:"error_description,omitempty"`
Status *string `json:"status,omitempty"`
StatusCheckedAt *string `json:"status_checked_at,omitempty"`
StatusChangedAt *string `json:"status_changed_at,omitempty"`
StatusHistory []HardwareStatusHistory `json:"status_history,omitempty"`
ErrorDescription *string `json:"error_description,omitempty"`
ManufacturedYearWeek *string `json:"manufactured_year_week,omitempty"`
}
type HardwareStatusHistory struct {
@@ -235,3 +275,15 @@ type HardwareOtherSensor struct {
Unit *string `json:"unit,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareEventLog struct {
Source string `json:"source"`
EventTime *string `json:"event_time,omitempty"`
Severity *string `json:"severity,omitempty"`
MessageID *string `json:"message_id,omitempty"`
Message string `json:"message"`
ComponentRef *string `json:"component_ref,omitempty"`
Fingerprint *string `json:"fingerprint,omitempty"`
IsActive *bool `json:"is_active,omitempty"`
RawPayload map[string]any `json:"raw_payload,omitempty"`
}

View File

@@ -0,0 +1,46 @@
package schema
import (
"encoding/json"
"strings"
"testing"
)
func TestHardwareSnapshotMarshalsNewContractFields(t *testing.T) {
week := "2024-W07"
eventTime := "2026-03-15T14:03:11Z"
message := "Correctable ECC error threshold exceeded"
payload := HardwareIngestRequest{
CollectedAt: "2026-03-15T15:00:00Z",
Hardware: HardwareSnapshot{
Board: HardwareBoard{SerialNumber: "SRV-001"},
CPUs: []HardwareCPU{
{
HardwareComponentStatus: HardwareComponentStatus{
ManufacturedYearWeek: &week,
},
},
},
EventLogs: []HardwareEventLog{
{
Source: "bmc",
EventTime: &eventTime,
Message: message,
},
},
},
}
data, err := json.Marshal(payload)
if err != nil {
t.Fatalf("marshal: %v", err)
}
text := string(data)
if !strings.Contains(text, `"manufactured_year_week":"2024-W07"`) {
t.Fatalf("missing manufactured_year_week: %s", text)
}
if !strings.Contains(text, `"event_logs":[{"source":"bmc","event_time":"2026-03-15T14:03:11Z","message":"Correctable ECC error threshold exceeded"}]`) {
t.Fatalf("missing event_logs payload: %s", text)
}
}

View File

@@ -1,119 +0,0 @@
package tui
import tea "github.com/charmbracelet/bubbletea"
func (m model) updateStaticForm(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
switch msg.String() {
case "esc":
m.screen = screenNetwork
m.formFields = nil
m.formIndex = 0
return m, nil
case "up", "shift+tab":
if m.formIndex > 0 {
m.formIndex--
}
case "down", "tab":
if m.formIndex < len(m.formFields)-1 {
m.formIndex++
}
case "enter":
if m.formIndex < len(m.formFields)-1 {
m.formIndex++
return m, nil
}
cfg := m.app.ParseStaticIPv4Config(m.selectedIface, []string{
m.formFields[0].Value,
m.formFields[1].Value,
m.formFields[2].Value,
m.formFields[3].Value,
})
m.busy = true
m.busyTitle = "Static IPv4: " + m.selectedIface
return m, func() tea.Msg {
result, err := m.app.SetStaticIPv4Result(cfg)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
}
case "backspace":
field := &m.formFields[m.formIndex]
if len(field.Value) > 0 {
field.Value = field.Value[:len(field.Value)-1]
}
default:
if msg.Type == tea.KeyRunes && len(msg.Runes) > 0 {
m.formFields[m.formIndex].Value += string(msg.Runes)
}
}
return m, nil
}
func (m model) updateConfirm(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
switch msg.String() {
case "left", "up", "tab":
if m.cursor > 0 {
m.cursor--
}
case "right", "down":
if m.cursor < 1 {
m.cursor++
}
case "esc":
m.screen = m.confirmCancelTarget()
m.cursor = 0
m.pendingAction = actionNone
return m, nil
case "enter":
if m.cursor == 1 {
m.screen = m.confirmCancelTarget()
m.cursor = 0
m.pendingAction = actionNone
return m, nil
}
m.busy = true
switch m.pendingAction {
case actionExportAudit:
m.busyTitle = "Export audit"
target := *m.selectedTarget
return m, func() tea.Msg {
result, err := m.app.ExportLatestAuditResult(target)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
}
case actionRunNvidiaSAT:
m.busyTitle = "NVIDIA SAT"
return m, func() tea.Msg {
result, err := m.app.RunNvidiaAcceptancePackResult("")
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance}
}
case actionRunMemorySAT:
m.busyTitle = "Memory SAT"
return m, func() tea.Msg {
result, err := m.app.RunMemoryAcceptancePackResult("")
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance}
}
case actionRunStorageSAT:
m.busyTitle = "Storage SAT"
return m, func() tea.Msg {
result, err := m.app.RunStorageAcceptancePackResult("")
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenAcceptance}
}
}
case "ctrl+c":
return m, tea.Quit
}
return m, nil
}
func (m model) confirmCancelTarget() screen {
switch m.pendingAction {
case actionExportAudit:
return screenExportTargets
case actionRunNvidiaSAT:
fallthrough
case actionRunMemorySAT:
fallthrough
case actionRunStorageSAT:
return screenAcceptance
default:
return screenMain
}
}

View File

@@ -1,29 +0,0 @@
package tui
import "bee/audit/internal/platform"
type resultMsg struct {
title string
body string
err error
back screen
}
type servicesMsg struct {
services []string
err error
}
type interfacesMsg struct {
ifaces []platform.InterfaceInfo
err error
}
type exportTargetsMsg struct {
targets []platform.RemovableTarget
err error
}
type bannerMsg struct {
text string
}

View File

@@ -1,21 +0,0 @@
package tui
import tea "github.com/charmbracelet/bubbletea"
func (m model) handleAcceptanceMenu() (tea.Model, tea.Cmd) {
if m.cursor == 3 {
m.screen = screenMain
m.cursor = 0
return m, nil
}
switch m.cursor {
case 0:
m.pendingAction = actionRunNvidiaSAT
case 1:
m.pendingAction = actionRunMemorySAT
case 2:
m.pendingAction = actionRunStorageSAT
}
m.screen = screenConfirm
return m, nil
}

View File

@@ -1,14 +0,0 @@
package tui
import tea "github.com/charmbracelet/bubbletea"
func (m model) handleExportTargetsMenu() (tea.Model, tea.Cmd) {
if len(m.targets) == 0 {
return m, resultCmd("Export audit", "No removable filesystems found", nil, screenMain)
}
target := m.targets[m.cursor]
m.selectedTarget = &target
m.pendingAction = actionExportAudit
m.screen = screenConfirm
return m, nil
}

View File

@@ -1,63 +0,0 @@
package tui
import (
tea "github.com/charmbracelet/bubbletea"
)
func (m model) handleMainMenu() (tea.Model, tea.Cmd) {
switch m.cursor {
case 0:
m.screen = screenNetwork
m.cursor = 0
return m, nil
case 1:
m.busy = true
m.busyTitle = "Services"
return m, func() tea.Msg {
services, err := m.app.ListBeeServices()
return servicesMsg{services: services, err: err}
}
case 2:
m.screen = screenAcceptance
m.cursor = 0
return m, nil
case 3:
m.busy = true
m.busyTitle = "Run audit"
return m, func() tea.Msg {
result, err := m.app.RunAuditNow(m.runtimeMode)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenMain}
}
case 4:
m.busy = true
m.busyTitle = "Export audit"
return m, func() tea.Msg {
targets, err := m.app.ListRemovableTargets()
return exportTargetsMsg{targets: targets, err: err}
}
case 5:
m.busy = true
m.busyTitle = "Required tools"
return m, func() tea.Msg {
result := m.app.ToolCheckResult([]string{"dmidecode", "smartctl", "nvme", "ipmitool", "lspci", "ethtool", "bee", "nvidia-smi", "bee-gpu-stress", "memtester", "dhclient", "lsblk", "mount"})
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
}
case 6:
m.busy = true
m.busyTitle = "Health summary"
return m, func() tea.Msg {
result := m.app.HealthSummaryResult()
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
}
case 7:
m.busy = true
m.busyTitle = "Audit logs"
return m, func() tea.Msg {
result := m.app.AuditLogTailResult()
return resultMsg{title: result.Title, body: result.Body, back: screenMain}
}
case 8:
return m, tea.Quit
}
return m, nil
}

View File

@@ -1,76 +0,0 @@
package tui
import (
"strings"
tea "github.com/charmbracelet/bubbletea"
)
func (m model) handleNetworkMenu() (tea.Model, tea.Cmd) {
switch m.cursor {
case 0:
m.busy = true
m.busyTitle = "Network status"
return m, func() tea.Msg {
result, err := m.app.NetworkStatus()
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
}
case 1:
m.busy = true
m.busyTitle = "DHCP all interfaces"
return m, func() tea.Msg {
result, err := m.app.DHCPAllResult()
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
}
case 2:
m.pendingAction = actionDHCPOne
m.busy = true
m.busyTitle = "Interfaces"
return m, func() tea.Msg {
ifaces, err := m.app.ListInterfaces()
return interfacesMsg{ifaces: ifaces, err: err}
}
case 3:
m.pendingAction = actionStaticIPv4
m.busy = true
m.busyTitle = "Interfaces"
return m, func() tea.Msg {
ifaces, err := m.app.ListInterfaces()
return interfacesMsg{ifaces: ifaces, err: err}
}
case 4:
m.screen = screenMain
m.cursor = 0
return m, nil
}
return m, nil
}
func (m model) handleInterfacePickMenu() (tea.Model, tea.Cmd) {
if len(m.interfaces) == 0 {
return m, resultCmd("interfaces", "No physical interfaces found", nil, screenNetwork)
}
m.selectedIface = m.interfaces[m.cursor].Name
switch m.pendingAction {
case actionDHCPOne:
m.busy = true
m.busyTitle = "DHCP on " + m.selectedIface
return m, func() tea.Msg {
result, err := m.app.DHCPOneResult(m.selectedIface)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenNetwork}
}
case actionStaticIPv4:
defaults := m.app.DefaultStaticIPv4FormFields(m.selectedIface)
m.formFields = []formField{
{Label: "IPv4 address", Value: defaults[0]},
{Label: "Prefix", Value: defaults[1]},
{Label: "Gateway", Value: strings.TrimSpace(defaults[2])},
{Label: "DNS (space-separated)", Value: defaults[3]},
}
m.formIndex = 0
m.screen = screenStaticForm
return m, nil
default:
return m, nil
}
}

View File

@@ -1,47 +0,0 @@
package tui
import (
"bee/audit/internal/platform"
tea "github.com/charmbracelet/bubbletea"
)
func (m model) handleServicesMenu() (tea.Model, tea.Cmd) {
if len(m.services) == 0 {
return m, resultCmd("Services", "No bee-* services found.", nil, screenMain)
}
m.selectedService = m.services[m.cursor]
m.screen = screenServiceAction
m.cursor = 0
return m, nil
}
func (m model) handleServiceActionMenu() (tea.Model, tea.Cmd) {
action := m.serviceMenu[m.cursor]
if action == "back" {
m.screen = screenServices
m.cursor = 0
return m, nil
}
m.busy = true
m.busyTitle = "service: " + m.selectedService
return m, func() tea.Msg {
switch action {
case "Status":
result, err := m.app.ServiceStatusResult(m.selectedService)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
case "Restart":
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceRestart)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
case "Start":
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceStart)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
case "Stop":
result, err := m.app.ServiceActionResult(m.selectedService, platform.ServiceStop)
return resultMsg{title: result.Title, body: result.Body, err: err, back: screenServiceAction}
default:
return resultMsg{title: "Service", body: "Unknown action.", back: screenServiceAction}
}
}
}

View File

@@ -1,588 +0,0 @@
package tui
import (
"strings"
"testing"
"bee/audit/internal/app"
"bee/audit/internal/platform"
"bee/audit/internal/runtimeenv"
tea "github.com/charmbracelet/bubbletea"
)
func newTestModel() model {
return newModel(app.New(platform.New()), runtimeenv.ModeLocal)
}
func sendKey(t *testing.T, m model, key tea.KeyType) model {
t.Helper()
next, _ := m.Update(tea.KeyMsg{Type: key})
return next.(model)
}
func TestUpdateMainMenuCursorNavigation(t *testing.T) {
t.Parallel()
m := newTestModel()
m = sendKey(t, m, tea.KeyDown)
if m.cursor != 1 {
t.Fatalf("cursor=%d want 1 after down", m.cursor)
}
m = sendKey(t, m, tea.KeyDown)
if m.cursor != 2 {
t.Fatalf("cursor=%d want 2 after second down", m.cursor)
}
m = sendKey(t, m, tea.KeyUp)
if m.cursor != 1 {
t.Fatalf("cursor=%d want 1 after up", m.cursor)
}
}
func TestUpdateMainMenuEnterActions(t *testing.T) {
t.Parallel()
tests := []struct {
name string
cursor int
wantScreen screen
wantBusy bool
wantCmd bool
}{
{name: "network", cursor: 0, wantScreen: screenNetwork},
{name: "services", cursor: 1, wantScreen: screenMain, wantBusy: true, wantCmd: true},
{name: "acceptance", cursor: 2, wantScreen: screenAcceptance},
{name: "run audit", cursor: 3, wantScreen: screenMain, wantBusy: true, wantCmd: true},
{name: "export", cursor: 4, wantScreen: screenMain, wantBusy: true, wantCmd: true},
}
for _, test := range tests {
test := test
t.Run(test.name, func(t *testing.T) {
t.Parallel()
m := newTestModel()
m.cursor = test.cursor
next, cmd := m.Update(tea.KeyMsg{Type: tea.KeyEnter})
got := next.(model)
if got.screen != test.wantScreen {
t.Fatalf("screen=%q want %q", got.screen, test.wantScreen)
}
if got.busy != test.wantBusy {
t.Fatalf("busy=%v want %v", got.busy, test.wantBusy)
}
if (cmd != nil) != test.wantCmd {
t.Fatalf("cmd present=%v want %v", cmd != nil, test.wantCmd)
}
})
}
}
func TestUpdateConfirmCancelViaKeys(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenConfirm
m.pendingAction = actionRunNvidiaSAT
next, _ := m.Update(tea.KeyMsg{Type: tea.KeyRight})
got := next.(model)
if got.cursor != 1 {
t.Fatalf("cursor=%d want 1 after right", got.cursor)
}
next, _ = got.Update(tea.KeyMsg{Type: tea.KeyEnter})
got = next.(model)
if got.screen != screenAcceptance {
t.Fatalf("screen=%q want %q", got.screen, screenAcceptance)
}
if got.cursor != 0 {
t.Fatalf("cursor=%d want 0 after cancel", got.cursor)
}
}
func TestMainMenuSimpleTransitions(t *testing.T) {
t.Parallel()
tests := []struct {
name string
cursor int
wantScreen screen
}{
{name: "network", cursor: 0, wantScreen: screenNetwork},
{name: "acceptance", cursor: 2, wantScreen: screenAcceptance},
}
for _, test := range tests {
test := test
t.Run(test.name, func(t *testing.T) {
t.Parallel()
m := newTestModel()
m.cursor = test.cursor
next, cmd := m.handleMainMenu()
got := next.(model)
if cmd != nil {
t.Fatalf("expected nil cmd for %s", test.name)
}
if got.screen != test.wantScreen {
t.Fatalf("screen=%q want %q", got.screen, test.wantScreen)
}
if got.cursor != 0 {
t.Fatalf("cursor=%d want 0", got.cursor)
}
})
}
}
func TestMainMenuAsyncActionsSetBusy(t *testing.T) {
t.Parallel()
tests := []struct {
name string
cursor int
}{
{name: "services", cursor: 1},
{name: "run audit", cursor: 3},
{name: "export", cursor: 4},
{name: "check tools", cursor: 5},
{name: "health summary", cursor: 6},
{name: "log tail", cursor: 7},
}
for _, test := range tests {
test := test
t.Run(test.name, func(t *testing.T) {
t.Parallel()
m := newTestModel()
m.cursor = test.cursor
next, cmd := m.handleMainMenu()
got := next.(model)
if !got.busy {
t.Fatalf("busy=false for %s", test.name)
}
if cmd == nil {
t.Fatalf("expected async cmd for %s", test.name)
}
})
}
}
func TestMainViewIncludesBanner(t *testing.T) {
t.Parallel()
m := newTestModel()
m.banner = "System: Test Server | S/N ABC123\nIP: 10.0.0.10"
view := m.View()
if !strings.Contains(view, "System: Test Server | S/N ABC123") {
t.Fatalf("view missing system banner:\n%s", view)
}
if !strings.Contains(view, "IP: 10.0.0.10") {
t.Fatalf("view missing ip banner:\n%s", view)
}
if !strings.Contains(view, "Select action") {
t.Fatalf("view missing menu subtitle:\n%s", view)
}
}
func TestEscapeNavigation(t *testing.T) {
t.Parallel()
tests := []struct {
name string
screen screen
wantScreen screen
}{
{name: "network to main", screen: screenNetwork, wantScreen: screenMain},
{name: "services to main", screen: screenServices, wantScreen: screenMain},
{name: "acceptance to main", screen: screenAcceptance, wantScreen: screenMain},
{name: "service action to services", screen: screenServiceAction, wantScreen: screenServices},
{name: "export targets to main", screen: screenExportTargets, wantScreen: screenMain},
{name: "interface pick to network", screen: screenInterfacePick, wantScreen: screenNetwork},
}
for _, test := range tests {
test := test
t.Run(test.name, func(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = test.screen
m.cursor = 3
next, _ := m.updateKey(tea.KeyMsg{Type: tea.KeyEsc})
got := next.(model)
if got.screen != test.wantScreen {
t.Fatalf("screen=%q want %q", got.screen, test.wantScreen)
}
if got.cursor != 0 {
t.Fatalf("cursor=%d want 0", got.cursor)
}
})
}
}
func TestOutputScreenReturnsToPreviousScreen(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenOutput
m.prevScreen = screenNetwork
m.title = "title"
m.body = "body"
next, _ := m.updateKey(tea.KeyMsg{Type: tea.KeyEnter})
got := next.(model)
if got.screen != screenNetwork {
t.Fatalf("screen=%q want %q", got.screen, screenNetwork)
}
if got.title != "" || got.body != "" {
t.Fatalf("expected output state cleared, got title=%q body=%q", got.title, got.body)
}
}
func TestAcceptanceConfirmFlow(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenAcceptance
m.cursor = 0
next, cmd := m.handleAcceptanceMenu()
got := next.(model)
if cmd != nil {
t.Fatal("expected nil cmd")
}
if got.screen != screenConfirm {
t.Fatalf("screen=%q want %q", got.screen, screenConfirm)
}
if got.pendingAction != actionRunNvidiaSAT {
t.Fatalf("pendingAction=%q want %q", got.pendingAction, actionRunNvidiaSAT)
}
next, _ = got.updateConfirm(tea.KeyMsg{Type: tea.KeyEsc})
got = next.(model)
if got.screen != screenAcceptance {
t.Fatalf("screen after esc=%q want %q", got.screen, screenAcceptance)
}
}
func TestAcceptanceMenuMapsNewTargets(t *testing.T) {
t.Parallel()
tests := []struct {
cursor int
want actionKind
}{
{cursor: 0, want: actionRunNvidiaSAT},
{cursor: 1, want: actionRunMemorySAT},
{cursor: 2, want: actionRunStorageSAT},
}
for _, test := range tests {
m := newTestModel()
m.screen = screenAcceptance
m.cursor = test.cursor
next, _ := m.handleAcceptanceMenu()
got := next.(model)
if got.pendingAction != test.want {
t.Fatalf("cursor=%d pendingAction=%q want %q", test.cursor, got.pendingAction, test.want)
}
}
}
func TestExportTargetSelectionOpensConfirm(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenExportTargets
m.targets = []platform.RemovableTarget{{Device: "/dev/sdb1", FSType: "vfat", Size: "16G"}}
next, cmd := m.handleExportTargetsMenu()
got := next.(model)
if cmd != nil {
t.Fatal("expected nil cmd")
}
if got.screen != screenConfirm {
t.Fatalf("screen=%q want %q", got.screen, screenConfirm)
}
if got.pendingAction != actionExportAudit {
t.Fatalf("pendingAction=%q want %q", got.pendingAction, actionExportAudit)
}
if got.selectedTarget == nil || got.selectedTarget.Device != "/dev/sdb1" {
t.Fatalf("selectedTarget=%+v want /dev/sdb1", got.selectedTarget)
}
}
func TestInterfacePickStaticIPv4OpensForm(t *testing.T) {
t.Parallel()
m := newTestModel()
m.pendingAction = actionStaticIPv4
m.interfaces = []platform.InterfaceInfo{{Name: "eth0"}}
next, cmd := m.handleInterfacePickMenu()
got := next.(model)
if cmd != nil {
t.Fatal("expected nil cmd")
}
if got.screen != screenStaticForm {
t.Fatalf("screen=%q want %q", got.screen, screenStaticForm)
}
if got.selectedIface != "eth0" {
t.Fatalf("selectedIface=%q want eth0", got.selectedIface)
}
if len(got.formFields) != 4 {
t.Fatalf("len(formFields)=%d want 4", len(got.formFields))
}
}
func TestResultMsgUsesExplicitBackScreen(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenConfirm
next, _ := m.Update(resultMsg{title: "done", body: "ok", back: screenNetwork})
got := next.(model)
if got.screen != screenOutput {
t.Fatalf("screen=%q want %q", got.screen, screenOutput)
}
if got.prevScreen != screenNetwork {
t.Fatalf("prevScreen=%q want %q", got.prevScreen, screenNetwork)
}
}
func TestConfirmCancelTarget(t *testing.T) {
t.Parallel()
m := newTestModel()
m.pendingAction = actionExportAudit
if got := m.confirmCancelTarget(); got != screenExportTargets {
t.Fatalf("export cancel target=%q want %q", got, screenExportTargets)
}
m.pendingAction = actionRunNvidiaSAT
if got := m.confirmCancelTarget(); got != screenAcceptance {
t.Fatalf("sat cancel target=%q want %q", got, screenAcceptance)
}
m.pendingAction = actionNone
if got := m.confirmCancelTarget(); got != screenMain {
t.Fatalf("default cancel target=%q want %q", got, screenMain)
}
}
func TestViewMainMenuRendersSelectedItem(t *testing.T) {
t.Parallel()
m := newTestModel()
m.cursor = 1
view := m.View()
for _, want := range []string{
"bee",
"Select action",
" Network",
"> Services",
"Acceptance tests",
"[↑/↓] move [enter] select [esc] back [ctrl+c] quit",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewBusyStateIsMinimal(t *testing.T) {
t.Parallel()
m := newTestModel()
m.busy = true
view := m.View()
want := "bee\n\nWorking...\n\n[ctrl+c] quit\n"
if view != want {
t.Fatalf("busy view mismatch\nwant:\n%s\ngot:\n%s", want, view)
}
}
func TestViewBusyStateUsesBusyTitle(t *testing.T) {
t.Parallel()
m := newTestModel()
m.busy = true
m.busyTitle = "Export audit"
view := m.View()
for _, want := range []string{
"Export audit",
"Working...",
"[ctrl+c] quit",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewOutputScreenRendersBodyAndBackHint(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenOutput
m.title = "Run audit"
m.body = "audit output: /var/log/bee-audit.json\n"
view := m.View()
for _, want := range []string{
"Run audit",
"audit output: /var/log/bee-audit.json",
"[enter/esc] back [ctrl+c] quit",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewExportTargetsRendersDeviceMetadata(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenExportTargets
m.targets = []platform.RemovableTarget{
{
Device: "/dev/sdb1",
FSType: "vfat",
Size: "29G",
Label: "BEEUSB",
Mountpoint: "/media/bee",
},
}
view := m.View()
for _, want := range []string{
"Export audit",
"Select removable filesystem",
"> /dev/sdb1 [vfat 29G] label=BEEUSB mounted=/media/bee",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewStaticFormRendersFields(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenStaticForm
m.selectedIface = "enp1s0"
m.formFields = []formField{
{Label: "Address", Value: "192.0.2.10/24"},
{Label: "Gateway", Value: "192.0.2.1"},
{Label: "DNS", Value: "1.1.1.1"},
}
m.formIndex = 1
view := m.View()
for _, want := range []string{
"Static IPv4: enp1s0",
" Address: 192.0.2.10/24",
"> Gateway: 192.0.2.1",
" DNS: 1.1.1.1",
"[tab/↑/↓] move [enter] next/submit [backspace] delete [esc] cancel",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestViewConfirmScreenMatchesPendingExport(t *testing.T) {
t.Parallel()
m := newTestModel()
m.screen = screenConfirm
m.pendingAction = actionExportAudit
m.selectedTarget = &platform.RemovableTarget{Device: "/dev/sdb1"}
view := m.View()
for _, want := range []string{
"Export audit",
"Copy latest audit JSON to /dev/sdb1?",
"> Confirm",
" Cancel",
} {
if !strings.Contains(view, want) {
t.Fatalf("view missing %q\nview:\n%s", want, view)
}
}
}
func TestResultMsgClearsBusyAndPendingAction(t *testing.T) {
t.Parallel()
m := newTestModel()
m.busy = true
m.busyTitle = "Export audit"
m.pendingAction = actionExportAudit
m.screen = screenConfirm
next, _ := m.Update(resultMsg{title: "Export audit", body: "done", back: screenMain})
got := next.(model)
if got.busy {
t.Fatal("busy=true want false")
}
if got.busyTitle != "" {
t.Fatalf("busyTitle=%q want empty", got.busyTitle)
}
if got.pendingAction != actionNone {
t.Fatalf("pendingAction=%q want empty", got.pendingAction)
}
}
func TestResultMsgErrorWithoutBodyFormatsCleanly(t *testing.T) {
t.Parallel()
m := newTestModel()
next, _ := m.Update(resultMsg{title: "Export audit", err: assertErr("boom"), back: screenMain})
got := next.(model)
if got.body != "ERROR: boom" {
t.Fatalf("body=%q want %q", got.body, "ERROR: boom")
}
}
type assertErr string
func (e assertErr) Error() string { return string(e) }

View File

@@ -1,119 +0,0 @@
package tui
import (
"bee/audit/internal/app"
"bee/audit/internal/platform"
"bee/audit/internal/runtimeenv"
"strings"
tea "github.com/charmbracelet/bubbletea"
)
type screen string
const (
screenMain screen = "main"
screenNetwork screen = "network"
screenInterfacePick screen = "interface_pick"
screenServices screen = "services"
screenServiceAction screen = "service_action"
screenAcceptance screen = "acceptance"
screenExportTargets screen = "export_targets"
screenOutput screen = "output"
screenStaticForm screen = "static_form"
screenConfirm screen = "confirm"
)
type actionKind string
const (
actionNone actionKind = ""
actionDHCPOne actionKind = "dhcp_one"
actionStaticIPv4 actionKind = "static_ipv4"
actionExportAudit actionKind = "export_audit"
actionRunNvidiaSAT actionKind = "run_nvidia_sat"
actionRunMemorySAT actionKind = "run_memory_sat"
actionRunStorageSAT actionKind = "run_storage_sat"
)
type model struct {
app *app.App
runtimeMode runtimeenv.Mode
screen screen
prevScreen screen
cursor int
busy bool
busyTitle string
title string
body string
banner string
mainMenu []string
networkMenu []string
serviceMenu []string
services []string
interfaces []platform.InterfaceInfo
targets []platform.RemovableTarget
selectedService string
selectedIface string
selectedTarget *platform.RemovableTarget
pendingAction actionKind
formFields []formField
formIndex int
}
type formField struct {
Label string
Value string
}
func Run(application *app.App, runtimeMode runtimeenv.Mode) error {
options := []tea.ProgramOption{}
if runtimeMode != runtimeenv.ModeLiveCD {
options = append(options, tea.WithAltScreen())
}
program := tea.NewProgram(newModel(application, runtimeMode), options...)
_, err := program.Run()
return err
}
func newModel(application *app.App, runtimeMode runtimeenv.Mode) model {
return model{
app: application,
runtimeMode: runtimeMode,
screen: screenMain,
mainMenu: []string{
"Network",
"Services",
"Acceptance tests",
"Run audit",
"Export audit",
"Check tools",
"Show health summary",
"Show audit logs",
"Exit",
},
networkMenu: []string{
"Show status",
"DHCP on all interfaces",
"DHCP on one interface",
"Set static IPv4",
"Back",
},
serviceMenu: []string{
"Status",
"Restart",
"Start",
"Stop",
"Back",
},
}
}
func (m model) Init() tea.Cmd {
return func() tea.Msg {
return bannerMsg{text: strings.TrimSpace(m.app.MainBanner())}
}
}

View File

@@ -1,168 +0,0 @@
package tui
import (
"fmt"
"strings"
tea "github.com/charmbracelet/bubbletea"
)
func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
switch msg := msg.(type) {
case tea.KeyMsg:
if m.busy {
switch msg.String() {
case "ctrl+c":
return m, tea.Quit
default:
return m, nil
}
}
return m.updateKey(msg)
case resultMsg:
m.busy = false
m.busyTitle = ""
m.title = msg.title
if msg.err != nil {
body := strings.TrimSpace(msg.body)
if body == "" {
m.body = fmt.Sprintf("ERROR: %v", msg.err)
} else {
m.body = fmt.Sprintf("%s\n\nERROR: %v", body, msg.err)
}
} else {
m.body = msg.body
}
m.pendingAction = actionNone
if msg.back != "" {
m.prevScreen = msg.back
} else {
m.prevScreen = m.screen
}
m.screen = screenOutput
m.cursor = 0
return m, nil
case servicesMsg:
m.busy = false
m.busyTitle = ""
if msg.err != nil {
m.title = "Services"
m.body = msg.err.Error()
m.prevScreen = screenMain
m.screen = screenOutput
return m, nil
}
m.services = msg.services
m.screen = screenServices
m.cursor = 0
return m, nil
case interfacesMsg:
m.busy = false
m.busyTitle = ""
if msg.err != nil {
m.title = "interfaces"
m.body = msg.err.Error()
m.prevScreen = screenMain
m.screen = screenOutput
return m, nil
}
m.interfaces = msg.ifaces
m.screen = screenInterfacePick
m.cursor = 0
return m, nil
case exportTargetsMsg:
m.busy = false
m.busyTitle = ""
if msg.err != nil {
m.title = "export"
m.body = msg.err.Error()
m.prevScreen = screenMain
m.screen = screenOutput
return m, nil
}
m.targets = msg.targets
m.screen = screenExportTargets
m.cursor = 0
return m, nil
case bannerMsg:
m.banner = strings.TrimSpace(msg.text)
return m, nil
}
return m, nil
}
func (m model) updateKey(msg tea.KeyMsg) (tea.Model, tea.Cmd) {
switch m.screen {
case screenMain:
return m.updateMenu(msg, len(m.mainMenu), m.handleMainMenu)
case screenNetwork:
return m.updateMenu(msg, len(m.networkMenu), m.handleNetworkMenu)
case screenServices:
return m.updateMenu(msg, len(m.services), m.handleServicesMenu)
case screenServiceAction:
return m.updateMenu(msg, len(m.serviceMenu), m.handleServiceActionMenu)
case screenAcceptance:
return m.updateMenu(msg, 4, m.handleAcceptanceMenu)
case screenExportTargets:
return m.updateMenu(msg, len(m.targets), m.handleExportTargetsMenu)
case screenInterfacePick:
return m.updateMenu(msg, len(m.interfaces), m.handleInterfacePickMenu)
case screenOutput:
switch msg.String() {
case "esc", "enter", "q":
m.screen = m.prevScreen
m.body = ""
m.title = ""
m.pendingAction = actionNone
return m, nil
case "ctrl+c":
return m, tea.Quit
}
case screenStaticForm:
return m.updateStaticForm(msg)
case screenConfirm:
return m.updateConfirm(msg)
}
if msg.String() == "ctrl+c" {
return m, tea.Quit
}
return m, nil
}
func (m model) updateMenu(msg tea.KeyMsg, size int, onEnter func() (tea.Model, tea.Cmd)) (tea.Model, tea.Cmd) {
if size == 0 {
size = 1
}
switch msg.String() {
case "up", "k":
if m.cursor > 0 {
m.cursor--
}
case "down", "j":
if m.cursor < size-1 {
m.cursor++
}
case "enter":
return onEnter()
case "esc":
switch m.screen {
case screenNetwork, screenServices, screenAcceptance:
m.screen = screenMain
m.cursor = 0
case screenServiceAction:
m.screen = screenServices
m.cursor = 0
case screenExportTargets:
m.screen = screenMain
m.cursor = 0
case screenInterfacePick:
m.screen = screenNetwork
m.cursor = 0
}
case "q", "ctrl+c":
return m, tea.Quit
}
return m, nil
}

View File

@@ -1,169 +0,0 @@
package tui
import (
"fmt"
"strings"
"bee/audit/internal/platform"
tea "github.com/charmbracelet/bubbletea"
)
func (m model) View() string {
if m.busy {
title := "bee"
if m.busyTitle != "" {
title = m.busyTitle
}
return fmt.Sprintf("%s\n\nWorking...\n\n[ctrl+c] quit\n", title)
}
switch m.screen {
case screenMain:
return renderMainMenu("bee", m.banner, "Select action", m.mainMenu, m.cursor)
case screenNetwork:
return renderMenu("Network", "Select action", m.networkMenu, m.cursor)
case screenServices:
return renderMenu("Services", "Select service", m.services, m.cursor)
case screenServiceAction:
items := make([]string, len(m.serviceMenu))
copy(items, m.serviceMenu)
return renderMenu("Service: "+m.selectedService, "Select action", items, m.cursor)
case screenAcceptance:
return renderMenu("Acceptance tests", "Select action", []string{"Run NVIDIA command pack", "Run memory test", "Run storage diagnostic pack", "Back"}, m.cursor)
case screenExportTargets:
return renderMenu("Export audit", "Select removable filesystem", renderTargetItems(m.targets), m.cursor)
case screenInterfacePick:
return renderMenu("Interfaces", "Select interface", renderInterfaceItems(m.interfaces), m.cursor)
case screenStaticForm:
return renderForm("Static IPv4: "+m.selectedIface, m.formFields, m.formIndex)
case screenConfirm:
title, body := m.confirmBody()
return renderConfirm(title, body, m.cursor)
case screenOutput:
return fmt.Sprintf("%s\n\n%s\n\n[enter/esc] back [ctrl+c] quit\n", m.title, strings.TrimSpace(m.body))
default:
return "bee\n"
}
}
func (m model) confirmBody() (string, string) {
switch m.pendingAction {
case actionExportAudit:
if m.selectedTarget == nil {
return "Export audit", "No target selected"
}
return "Export audit", fmt.Sprintf("Copy latest audit JSON to %s?", m.selectedTarget.Device)
case actionRunNvidiaSAT:
return "NVIDIA SAT", "Run NVIDIA acceptance command pack?"
case actionRunMemorySAT:
return "Memory SAT", "Run runtime memory test with memtester?"
case actionRunStorageSAT:
return "Storage SAT", "Run storage diagnostic pack and start short self-tests where supported?"
default:
return "Confirm", "Proceed?"
}
}
func renderTargetItems(targets []platform.RemovableTarget) []string {
items := make([]string, 0, len(targets))
for _, target := range targets {
desc := fmt.Sprintf("%s [%s %s]", target.Device, target.FSType, target.Size)
if target.Label != "" {
desc += " label=" + target.Label
}
if target.Mountpoint != "" {
desc += " mounted=" + target.Mountpoint
}
items = append(items, desc)
}
return items
}
func renderInterfaceItems(interfaces []platform.InterfaceInfo) []string {
items := make([]string, 0, len(interfaces))
for _, iface := range interfaces {
label := iface.Name
if len(iface.IPv4) > 0 {
label += " [" + strings.Join(iface.IPv4, ", ") + "]"
}
items = append(items, label)
}
return items
}
func renderMenu(title, subtitle string, items []string, cursor int) string {
var body strings.Builder
fmt.Fprintf(&body, "%s\n\n%s\n\n", title, subtitle)
if len(items) == 0 {
body.WriteString("(no items)\n")
} else {
for i, item := range items {
prefix := " "
if i == cursor {
prefix = "> "
}
fmt.Fprintf(&body, "%s%s\n", prefix, item)
}
}
body.WriteString("\n[↑/↓] move [enter] select [esc] back [ctrl+c] quit\n")
return body.String()
}
func renderMainMenu(title, banner, subtitle string, items []string, cursor int) string {
var body strings.Builder
fmt.Fprintf(&body, "%s\n\n", title)
if banner != "" {
body.WriteString(strings.TrimSpace(banner))
body.WriteString("\n\n")
}
body.WriteString(subtitle)
body.WriteString("\n\n")
if len(items) == 0 {
body.WriteString("(no items)\n")
} else {
for i, item := range items {
prefix := " "
if i == cursor {
prefix = "> "
}
fmt.Fprintf(&body, "%s%s\n", prefix, item)
}
}
body.WriteString("\n[↑/↓] move [enter] select [esc] back [ctrl+c] quit\n")
return body.String()
}
func renderForm(title string, fields []formField, idx int) string {
var body strings.Builder
fmt.Fprintf(&body, "%s\n\n", title)
for i, field := range fields {
prefix := " "
if i == idx {
prefix = "> "
}
fmt.Fprintf(&body, "%s%s: %s\n", prefix, field.Label, field.Value)
}
body.WriteString("\n[tab/↑/↓] move [enter] next/submit [backspace] delete [esc] cancel\n")
return body.String()
}
func renderConfirm(title, body string, cursor int) string {
options := []string{"Confirm", "Cancel"}
var out strings.Builder
fmt.Fprintf(&out, "%s\n\n%s\n\n", title, body)
for i, option := range options {
prefix := " "
if i == cursor {
prefix = "> "
}
fmt.Fprintf(&out, "%s%s\n", prefix, option)
}
out.WriteString("\n[←/→/↑/↓] move [enter] select [esc] cancel\n")
return out.String()
}
func resultCmd(title, body string, err error, back screen) tea.Cmd {
return func() tea.Msg {
return resultMsg{title: title, body: body, err: err, back: back}
}
}

445
audit/internal/webui/api.go Normal file
View File

@@ -0,0 +1,445 @@
package webui
import (
"bufio"
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"os/exec"
"path/filepath"
"strings"
"sync/atomic"
"time"
"bee/audit/internal/app"
"bee/audit/internal/platform"
)
// ── Job ID counter ────────────────────────────────────────────────────────────
var jobCounter atomic.Uint64
func newJobID(prefix string) string {
return fmt.Sprintf("%s-%d", prefix, jobCounter.Add(1))
}
// ── SSE helpers ───────────────────────────────────────────────────────────────
func sseWrite(w http.ResponseWriter, event, data string) bool {
f, ok := w.(http.Flusher)
if !ok {
return false
}
if event != "" {
fmt.Fprintf(w, "event: %s\n", event)
}
fmt.Fprintf(w, "data: %s\n\n", data)
f.Flush()
return true
}
func sseStart(w http.ResponseWriter) bool {
_, ok := w.(http.Flusher)
if !ok {
http.Error(w, "streaming not supported", http.StatusInternalServerError)
return false
}
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
w.Header().Set("Connection", "keep-alive")
w.Header().Set("Access-Control-Allow-Origin", "*")
return true
}
// streamJob streams lines from a jobState to a SSE response.
func streamJob(w http.ResponseWriter, r *http.Request, j *jobState) {
if !sseStart(w) {
return
}
existing, ch := j.subscribe()
for _, line := range existing {
sseWrite(w, "", line)
}
if ch == nil {
// Job already finished
sseWrite(w, "done", j.err)
return
}
for {
select {
case line, ok := <-ch:
if !ok {
sseWrite(w, "done", j.err)
return
}
sseWrite(w, "", line)
case <-r.Context().Done():
return
}
}
}
// runCmdJob runs an exec.Cmd as a background job, streaming stdout+stderr lines.
func runCmdJob(j *jobState, cmd *exec.Cmd) {
pr, pw := io.Pipe()
cmd.Stdout = pw
cmd.Stderr = pw
if err := cmd.Start(); err != nil {
j.finish(err.Error())
return
}
go func() {
scanner := bufio.NewScanner(pr)
for scanner.Scan() {
j.append(scanner.Text())
}
}()
err := cmd.Wait()
_ = pw.Close()
if err != nil {
j.finish(err.Error())
} else {
j.finish("")
}
}
// ── Audit ─────────────────────────────────────────────────────────────────────
func (h *handler) handleAPIAuditRun(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
id := newJobID("audit")
j := globalJobs.create(id)
go func() {
j.append("Running audit...")
result, err := h.opts.App.RunAuditNow(h.opts.RuntimeMode)
if err != nil {
j.append("ERROR: " + err.Error())
j.finish(err.Error())
return
}
for _, line := range strings.Split(result.Body, "\n") {
if line != "" {
j.append(line)
}
}
j.finish("")
}()
writeJSON(w, map[string]string{"job_id": id})
}
func (h *handler) handleAPIAuditStream(w http.ResponseWriter, r *http.Request) {
id := r.URL.Query().Get("job_id")
j, ok := globalJobs.get(id)
if !ok {
http.Error(w, "job not found", http.StatusNotFound)
return
}
streamJob(w, r, j)
}
// ── SAT ───────────────────────────────────────────────────────────────────────
func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
id := newJobID("sat-" + target)
j := globalJobs.create(id)
go func() {
j.append(fmt.Sprintf("Starting %s acceptance test...", target))
var (
archive string
err error
)
// Parse optional parameters
var body struct {
Duration int `json:"duration"`
DiagLevel int `json:"diag_level"`
GPUIndices []int `json:"gpu_indices"`
}
body.DiagLevel = 1
if r.ContentLength > 0 {
_ = json.NewDecoder(r.Body).Decode(&body)
}
switch target {
case "nvidia":
if len(body.GPUIndices) > 0 || body.DiagLevel > 0 {
result, e := h.opts.App.RunNvidiaAcceptancePackWithOptions(
context.Background(), "", body.DiagLevel, body.GPUIndices,
)
if e != nil {
err = e
} else {
archive = result.Body
}
} else {
archive, err = h.opts.App.RunNvidiaAcceptancePack("")
}
case "memory":
archive, err = h.opts.App.RunMemoryAcceptancePack("")
case "storage":
archive, err = h.opts.App.RunStorageAcceptancePack("")
case "cpu":
dur := body.Duration
if dur <= 0 {
dur = 60
}
archive, err = h.opts.App.RunCPUAcceptancePack("", dur)
}
if err != nil {
j.append("ERROR: " + err.Error())
j.finish(err.Error())
return
}
j.append(fmt.Sprintf("Archive written: %s", archive))
j.finish("")
}()
writeJSON(w, map[string]string{"job_id": id})
}
}
func (h *handler) handleAPISATStream(w http.ResponseWriter, r *http.Request) {
id := r.URL.Query().Get("job_id")
j, ok := globalJobs.get(id)
if !ok {
http.Error(w, "job not found", http.StatusNotFound)
return
}
streamJob(w, r, j)
}
// ── Services ──────────────────────────────────────────────────────────────────
func (h *handler) handleAPIServicesList(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
names, err := h.opts.App.ListBeeServices()
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
type serviceInfo struct {
Name string `json:"name"`
State string `json:"state"`
Body string `json:"body"`
}
result := make([]serviceInfo, 0, len(names))
for _, name := range names {
state := h.opts.App.ServiceState(name)
body, _ := h.opts.App.ServiceStatus(name)
result = append(result, serviceInfo{Name: name, State: state, Body: body})
}
writeJSON(w, result)
}
func (h *handler) handleAPIServicesAction(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var req struct {
Name string `json:"name"`
Action string `json:"action"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
var action platform.ServiceAction
switch req.Action {
case "start":
action = platform.ServiceStart
case "stop":
action = platform.ServiceStop
case "restart":
action = platform.ServiceRestart
default:
writeError(w, http.StatusBadRequest, "action must be start|stop|restart")
return
}
result, err := h.opts.App.ServiceActionResult(req.Name, action)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "output": result.Body})
}
// ── Network ───────────────────────────────────────────────────────────────────
func (h *handler) handleAPINetworkStatus(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
ifaces, err := h.opts.App.ListInterfaces()
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]any{
"interfaces": ifaces,
"default_route": h.opts.App.DefaultRoute(),
})
}
func (h *handler) handleAPINetworkDHCP(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var req struct {
Interface string `json:"interface"`
}
_ = json.NewDecoder(r.Body).Decode(&req)
var result app.ActionResult
var err error
if req.Interface == "" || req.Interface == "all" {
result, err = h.opts.App.DHCPAllResult()
} else {
result, err = h.opts.App.DHCPOneResult(req.Interface)
}
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "output": result.Body})
}
func (h *handler) handleAPINetworkStatic(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var req struct {
Interface string `json:"interface"`
Address string `json:"address"`
Prefix string `json:"prefix"`
Gateway string `json:"gateway"`
DNS []string `json:"dns"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
cfg := platform.StaticIPv4Config{
Interface: req.Interface,
Address: req.Address,
Prefix: req.Prefix,
Gateway: req.Gateway,
DNS: req.DNS,
}
result, err := h.opts.App.SetStaticIPv4Result(cfg)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "output": result.Body})
}
// ── Export ────────────────────────────────────────────────────────────────────
func (h *handler) handleAPIExportList(w http.ResponseWriter, r *http.Request) {
entries, err := listExportFiles(h.opts.ExportDir)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, entries)
}
func (h *handler) handleAPIExportBundle(w http.ResponseWriter, r *http.Request) {
archive, err := app.BuildSupportBundle(h.opts.ExportDir)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{
"status": "ok",
"path": archive,
"url": "/export/support.tar.gz",
})
}
// ── Tools ─────────────────────────────────────────────────────────────────────
var standardTools = []string{
"dmidecode", "smartctl", "nvme", "lspci", "ipmitool",
"nvidia-smi", "memtester", "stress-ng", "nvtop",
"mstflint", "qrencode",
}
func (h *handler) handleAPIToolsCheck(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
statuses := h.opts.App.CheckTools(standardTools)
writeJSON(w, statuses)
}
// ── Preflight ─────────────────────────────────────────────────────────────────
func (h *handler) handleAPIPreflight(w http.ResponseWriter, r *http.Request) {
data, err := loadSnapshot(filepath.Join(h.opts.ExportDir, "runtime-health.json"))
if err != nil {
writeError(w, http.StatusNotFound, "runtime health not found")
return
}
w.Header().Set("Content-Type", "application/json; charset=utf-8")
w.Header().Set("Cache-Control", "no-store")
_, _ = w.Write(data)
}
// ── Metrics SSE ───────────────────────────────────────────────────────────────
func (h *handler) handleAPIMetricsStream(w http.ResponseWriter, r *http.Request) {
if !sseStart(w) {
return
}
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-r.Context().Done():
return
case <-ticker.C:
sample := platform.SampleLiveMetrics()
// Feed ring buffers for server-side SVG charts
for _, t := range sample.Temps {
if t.Name == "CPU" {
h.ringCPUTemp.push(t.Celsius)
break
}
}
h.ringPower.push(sample.PowerW)
b, err := json.Marshal(sample)
if err != nil {
continue
}
if !sseWrite(w, "metrics", string(b)) {
return
}
}
}
}

View File

@@ -0,0 +1,84 @@
package webui
import (
"sync"
"time"
)
// jobState holds the output lines and completion status of an async job.
type jobState struct {
lines []string
done bool
err string
mu sync.Mutex
// subs is a list of channels that receive new lines as they arrive.
subs []chan string
}
func (j *jobState) append(line string) {
j.mu.Lock()
defer j.mu.Unlock()
j.lines = append(j.lines, line)
for _, ch := range j.subs {
select {
case ch <- line:
default:
}
}
}
func (j *jobState) finish(errMsg string) {
j.mu.Lock()
defer j.mu.Unlock()
j.done = true
j.err = errMsg
for _, ch := range j.subs {
close(ch)
}
j.subs = nil
}
// subscribe returns a channel that receives all future lines.
// Existing lines are returned first, then the channel streams new ones.
func (j *jobState) subscribe() ([]string, <-chan string) {
j.mu.Lock()
defer j.mu.Unlock()
existing := make([]string, len(j.lines))
copy(existing, j.lines)
if j.done {
return existing, nil
}
ch := make(chan string, 256)
j.subs = append(j.subs, ch)
return existing, ch
}
// jobManager manages async jobs identified by string IDs.
type jobManager struct {
mu sync.Mutex
jobs map[string]*jobState
}
var globalJobs = &jobManager{jobs: make(map[string]*jobState)}
func (m *jobManager) create(id string) *jobState {
m.mu.Lock()
defer m.mu.Unlock()
j := &jobState{}
m.jobs[id] = j
// Schedule cleanup after 30 minutes
go func() {
time.Sleep(30 * time.Minute)
m.mu.Lock()
delete(m.jobs, id)
m.mu.Unlock()
}()
return j
}
func (m *jobManager) get(id string) (*jobState, bool) {
m.mu.Lock()
defer m.mu.Unlock()
j, ok := m.jobs[id]
return j, ok
}

View File

@@ -0,0 +1,660 @@
package webui
import (
"encoding/json"
"fmt"
"html"
"net/url"
"os"
"path/filepath"
"sort"
"strings"
)
// ── Layout ────────────────────────────────────────────────────────────────────
func layoutHead(title string) string {
return `<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1">
<title>` + html.EscapeString(title) + `</title>
<style>
*{box-sizing:border-box;margin:0;padding:0}
body{font-family:system-ui,-apple-system,sans-serif;background:#0f1117;color:#e2e8f0;display:flex;min-height:100vh}
a{color:inherit;text-decoration:none}
/* Sidebar */
.sidebar{width:200px;min-height:100vh;background:#161b25;border-right:1px solid #252d3d;flex-shrink:0;display:flex;flex-direction:column}
.sidebar-logo{padding:20px 16px 12px;font-size:20px;font-weight:700;color:#60a5fa;letter-spacing:-0.5px}
.sidebar-logo span{color:#94a3b8;font-weight:400;font-size:13px;display:block;margin-top:2px}
.nav{flex:1}
.nav-item{display:block;padding:10px 16px;color:#94a3b8;font-size:14px;border-left:3px solid transparent;transition:all .15s}
.nav-item:hover,.nav-item.active{background:#1e2535;color:#e2e8f0;border-left-color:#3b82f6}
.nav-icon{margin-right:8px;opacity:.7}
/* Content */
.main{flex:1;display:flex;flex-direction:column;overflow:auto}
.topbar{padding:16px 24px;border-bottom:1px solid #1e2535;display:flex;align-items:center;gap:12px}
.topbar h1{font-size:18px;font-weight:600}
.content{padding:24px;flex:1}
/* Cards */
.card{background:#161b25;border:1px solid #1e2535;border-radius:10px;margin-bottom:16px}
.card-head{padding:14px 18px;border-bottom:1px solid #1e2535;font-weight:600;font-size:14px;display:flex;align-items:center;gap:8px}
.card-body{padding:18px}
/* Buttons */
.btn{display:inline-flex;align-items:center;gap:6px;padding:8px 16px;border-radius:6px;font-size:13px;font-weight:600;cursor:pointer;border:none;transition:background .15s}
.btn-primary{background:#3b82f6;color:#fff}.btn-primary:hover{background:#2563eb}
.btn-danger{background:#ef4444;color:#fff}.btn-danger:hover{background:#dc2626}
.btn-secondary{background:#1e2535;color:#94a3b8;border:1px solid #252d3d}.btn-secondary:hover{background:#252d3d;color:#e2e8f0}
.btn-sm{padding:5px 10px;font-size:12px}
/* Tables */
table{width:100%;border-collapse:collapse;font-size:13px}
th{text-align:left;padding:8px 12px;color:#64748b;font-weight:600;border-bottom:1px solid #1e2535}
td{padding:8px 12px;border-bottom:1px solid #1a2030}
tr:last-child td{border:none}
tr:hover td{background:#1a2030}
/* Status badges */
.badge{display:inline-block;padding:2px 8px;border-radius:999px;font-size:11px;font-weight:600}
.badge-ok{background:#166534;color:#86efac}
.badge-warn{background:#713f12;color:#fde68a}
.badge-err{background:#7f1d1d;color:#fca5a5}
.badge-unknown{background:#1e293b;color:#64748b}
/* Output terminal */
.terminal{background:#0a0d14;border:1px solid #1e2535;border-radius:8px;padding:14px;font-family:monospace;font-size:12px;color:#86efac;max-height:400px;overflow-y:auto;white-space:pre-wrap;word-break:break-all}
/* Forms */
.form-row{margin-bottom:14px}
.form-row label{display:block;font-size:12px;color:#64748b;margin-bottom:5px}
.form-row input,.form-row select{width:100%;padding:8px 10px;background:#0f1117;border:1px solid #252d3d;border-radius:6px;color:#e2e8f0;font-size:13px;outline:none}
.form-row input:focus,.form-row select:focus{border-color:#3b82f6}
.chart-legend{font-size:11px;color:#64748b;padding:4px 0}
/* Grid */
.grid2{display:grid;grid-template-columns:1fr 1fr;gap:16px}
.grid3{display:grid;grid-template-columns:1fr 1fr 1fr;gap:16px}
@media(max-width:900px){.grid2,.grid3{grid-template-columns:1fr}}
/* iframe viewer */
.viewer-frame{width:100%;height:calc(100vh - 160px);border:0;border-radius:8px;background:#1a1f2e}
/* Alerts */
.alert{padding:10px 14px;border-radius:8px;font-size:13px;margin-bottom:14px}
.alert-info{background:#1e3a5f;border:1px solid #2563eb;color:#93c5fd}
.alert-warn{background:#451a03;border:1px solid #d97706;color:#fde68a}
</style>
</head>
<body>
`
}
func layoutNav(active string) string {
items := []struct{ id, icon, label string }{
{"dashboard", "", "Dashboard"},
{"metrics", "", "Metrics"},
{"tests", "", "Acceptance Tests"},
{"burn-in", "", "Burn-in"},
{"network", "", "Network"},
{"services", "", "Services"},
{"export", "", "Export"},
{"tools", "", "Tools"},
}
var b strings.Builder
b.WriteString(`<aside class="sidebar">`)
b.WriteString(`<div class="sidebar-logo">bee<span>hardware audit</span></div>`)
b.WriteString(`<nav class="nav">`)
for _, item := range items {
cls := "nav-item"
if item.id == active {
cls += " active"
}
href := "/"
if item.id != "dashboard" {
href = "/" + item.id
}
b.WriteString(fmt.Sprintf(`<a class="%s" href="%s">%s</a>`,
cls, href, item.label))
}
b.WriteString(`</nav></aside>`)
return b.String()
}
// renderPage dispatches to the appropriate page renderer.
func renderPage(page string, opts HandlerOptions) string {
var pageID, title, body string
switch page {
case "dashboard", "":
pageID = "dashboard"
title = "Dashboard"
body = renderDashboard(opts)
case "metrics":
pageID = "metrics"
title = "Live Metrics"
body = renderMetrics()
case "tests":
pageID = "tests"
title = "Acceptance Tests"
body = renderTests()
case "burn-in":
pageID = "burn-in"
title = "Burn-in Tests"
body = renderBurnIn()
case "network":
pageID = "network"
title = "Network"
body = renderNetwork()
case "services":
pageID = "services"
title = "Services"
body = renderServices()
case "export":
pageID = "export"
title = "Export"
body = renderExport(opts.ExportDir)
case "tools":
pageID = "tools"
title = "Tools"
body = renderTools()
default:
pageID = "dashboard"
title = "Not Found"
body = `<div class="alert alert-warn">Page not found.</div>`
}
return layoutHead(opts.Title+" — "+title) +
layoutNav(pageID) +
`<div class="main"><div class="topbar"><h1>` + html.EscapeString(title) + `</h1></div><div class="content">` +
body +
`</div></div></body></html>`
}
// ── Dashboard ─────────────────────────────────────────────────────────────────
func renderDashboard(opts HandlerOptions) string {
var b strings.Builder
b.WriteString(`<div class="grid2">`)
// Left: health summary
b.WriteString(`<div>`)
b.WriteString(renderHealthCard(opts))
b.WriteString(`</div>`)
// Right: quick actions
b.WriteString(`<div>`)
b.WriteString(`<div class="card"><div class="card-head">Quick Actions</div><div class="card-body">`)
b.WriteString(`<a class="btn btn-primary" href="/export/support.tar.gz" style="display:block;margin-bottom:10px">⬇ Download Support Bundle</a>`)
b.WriteString(`<a class="btn btn-secondary" href="/audit.json" style="display:block;margin-bottom:10px" target="_blank">📄 Open audit.json</a>`)
b.WriteString(`<a class="btn btn-secondary" href="/export/" style="display:block">📁 Browse Export Files</a>`)
b.WriteString(`<div style="margin-top:14px"><button class="btn btn-secondary" onclick="runAudit()">▶ Re-run Audit</button></div>`)
b.WriteString(`</div></div>`)
b.WriteString(`</div>`)
b.WriteString(`</div>`)
// Audit viewer iframe
b.WriteString(`<div class="card"><div class="card-head">Audit Snapshot</div><div class="card-body" style="padding:0">`)
b.WriteString(`<iframe class="viewer-frame" src="/viewer" loading="eager" referrerpolicy="same-origin"></iframe>`)
b.WriteString(`</div></div>`)
// Audit run output div
b.WriteString(`<div id="audit-output" style="display:none" class="card"><div class="card-head">Audit Output</div><div class="card-body"><div id="audit-terminal" class="terminal"></div></div></div>`)
b.WriteString(`<script>
function runAudit() {
document.getElementById('audit-output').style.display='block';
const term = document.getElementById('audit-terminal');
term.textContent = 'Starting audit...\n';
fetch('/api/audit/run', {method:'POST'})
.then(r => r.json())
.then(d => {
const es = new EventSource('/api/audit/stream?job_id=' + d.job_id);
es.onmessage = e => { term.textContent += e.data + '\n'; term.scrollTop = term.scrollHeight; };
es.addEventListener('done', e => { es.close(); term.textContent += (e.data ? '\\nERROR: ' + e.data : '\\nDone.') + '\n'; location.reload(); });
});
}
</script>`)
return b.String()
}
func renderHealthCard(opts HandlerOptions) string {
data, err := loadSnapshot(filepath.Join(opts.ExportDir, "runtime-health.json"))
if err != nil {
return `<div class="card"><div class="card-head">Runtime Health</div><div class="card-body"><span class="badge badge-unknown">No data</span></div></div>`
}
var health map[string]any
if err := json.Unmarshal(data, &health); err != nil {
return `<div class="card"><div class="card-head">Runtime Health</div><div class="card-body"><span class="badge badge-err">Parse error</span></div></div>`
}
status := fmt.Sprintf("%v", health["status"])
badge := "badge-ok"
if status == "PARTIAL" {
badge = "badge-warn"
} else if status == "FAIL" || status == "FAILED" {
badge = "badge-err"
}
var b strings.Builder
b.WriteString(`<div class="card"><div class="card-head">Runtime Health</div><div class="card-body">`)
b.WriteString(fmt.Sprintf(`<div style="margin-bottom:10px"><span class="badge %s">%s</span></div>`, badge, html.EscapeString(status)))
if issues, ok := health["issues"].([]any); ok && len(issues) > 0 {
b.WriteString(`<div style="font-size:12px;color:#f87171">Issues:<br>`)
for _, issue := range issues {
if m, ok := issue.(map[string]any); ok {
b.WriteString(html.EscapeString(fmt.Sprintf("%v: %v", m["code"], m["message"])) + "<br>")
}
}
b.WriteString(`</div>`)
}
b.WriteString(`</div></div>`)
return b.String()
}
// ── Metrics ───────────────────────────────────────────────────────────────────
func renderMetrics() string {
return `<p style="color:#64748b;font-size:13px;margin-bottom:16px">Live server metrics, charts updated every 2 seconds.</p>
<div class="grid2">
<div class="card">
<div class="card-head">System</div>
<div class="card-body">
<img id="chart-cpu-temp" src="/api/metrics/chart/cpu-temp.svg" style="width:100%;border-radius:6px" alt="CPU Temp">
<img id="chart-power" src="/api/metrics/chart/power.svg" style="width:100%;border-radius:6px;margin-top:8px" alt="Power">
<div id="sys-table" style="margin-top:8px"></div>
</div>
</div>
<div class="card">
<div class="card-head">GPU</div>
<div class="card-body">
<div id="gpu-table"><p style="color:#64748b;font-size:12px">Waiting for data...</p></div>
</div>
</div>
</div>
<script>
function refreshCharts() {
const t = '?t=' + Date.now();
['chart-cpu-temp','chart-power'].forEach(id => {
const el = document.getElementById(id);
if (el) el.src = el.src.split('?')[0] + t;
});
}
setInterval(refreshCharts, 2000);
const es = new EventSource('/api/metrics/stream');
es.addEventListener('metrics', e => {
const d = JSON.parse(e.data);
const gpuRows = (d.gpus||[]).map(g =>
'<tr><td>GPU '+g.index+'</td><td>'+g.temp_c+'°C</td><td>'+g.usage_pct+'%</td><td>'+g.power_w+'W</td><td>'+g.clock_mhz+'MHz</td></tr>'
).join('');
document.getElementById('gpu-table').innerHTML = gpuRows ?
'<table><tr><th>GPU</th><th>Temp</th><th>Usage</th><th>Power</th><th>Clock</th></tr>'+gpuRows+'</table>' :
'<p style="color:#64748b;font-size:12px">No NVIDIA GPU detected</p>';
let sysHTML = '';
const cpuTemp = (d.temps||[]).find(t => t.name==='CPU');
if (cpuTemp) sysHTML += '<tr><td>CPU Temp</td><td>'+cpuTemp.celsius.toFixed(1)+'°C</td></tr>';
(d.fans||[]).forEach(f => sysHTML += '<tr><td>'+f.name+'</td><td>'+f.rpm+' RPM</td></tr>');
if (d.power_w) sysHTML += '<tr><td>System Power</td><td>'+d.power_w.toFixed(0)+'W</td></tr>';
document.getElementById('sys-table').innerHTML = sysHTML ?
'<table>'+sysHTML+'</table>' :
'<p style="color:#64748b;font-size:12px">No sensor data (ipmitool/sensors required)</p>';
});
es.onerror = () => {};
</script>`
}
// ── Acceptance Tests ──────────────────────────────────────────────────────────
func renderTests() string {
return `<p style="color:#64748b;font-size:13px;margin-bottom:16px">Run hardware acceptance tests and view results.</p>
<div class="grid2">
` + renderSATCard("nvidia", "NVIDIA GPU", `<div class="form-row"><label>Diag Level</label><select id="sat-nvidia-level"><option value="1">Level 1 — Quick</option><option value="2">Level 2 — Standard</option><option value="3">Level 3 — Extended</option><option value="4">Level 4 — Full</option></select></div>`) +
renderSATCard("memory", "Memory", "") +
renderSATCard("storage", "Storage", "") +
renderSATCard("cpu", "CPU", `<div class="form-row"><label>Duration (seconds)</label><input type="number" id="sat-cpu-dur" value="60" min="10"></div>`) +
`</div>
<div id="sat-output" style="display:none;margin-top:16px" class="card">
<div class="card-head">Test Output <span id="sat-title"></span></div>
<div class="card-body"><div id="sat-terminal" class="terminal"></div></div>
</div>
<script>
let satES = null;
function runSAT(target) {
if (satES) satES.close();
const body = {};
if (target === 'nvidia') body.diag_level = parseInt(document.getElementById('sat-nvidia-level').value)||1;
if (target === 'cpu') body.duration = parseInt(document.getElementById('sat-cpu-dur').value)||60;
document.getElementById('sat-output').style.display='block';
document.getElementById('sat-title').textContent = '— ' + target;
const term = document.getElementById('sat-terminal');
term.textContent = 'Starting ' + target + ' test...\n';
fetch('/api/sat/'+target+'/run', {method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify(body)})
.then(r => r.json())
.then(d => {
satES = new EventSource('/api/sat/stream?job_id='+d.job_id);
satES.onmessage = e => { term.textContent += e.data+'\n'; term.scrollTop=term.scrollHeight; };
satES.addEventListener('done', e => { satES.close(); term.textContent += (e.data ? '\nERROR: '+e.data : '\nCompleted.')+'\n'; });
});
}
</script>`
}
func renderSATCard(id, label, extra string) string {
return fmt.Sprintf(`<div class="card"><div class="card-head">%s</div><div class="card-body">%s<button class="btn btn-primary" onclick="runSAT('%s')">▶ Run Test</button></div></div>`,
label, extra, id)
}
// ── Burn-in ───────────────────────────────────────────────────────────────────
func renderBurnIn() string {
return `<p style="color:#64748b;font-size:13px;margin-bottom:16px">Long-running GPU and system stress tests. Check <a href="/metrics" style="color:#60a5fa">Metrics</a> page for live telemetry.</p>
<div class="grid2">
<div class="card"><div class="card-head">GPU Platform Stress</div><div class="card-body">
<div class="form-row"><label>Duration</label><select id="bi-dur"><option value="600">10 minutes</option><option value="3600">1 hour</option><option value="28800">8 hours</option><option value="86400">24 hours</option></select></div>
<button class="btn btn-primary" onclick="runBurnIn('nvidia')">▶ Start GPU Stress</button>
</div></div>
<div class="card"><div class="card-head">CPU Stress</div><div class="card-body">
<div class="form-row"><label>Duration (seconds)</label><input type="number" id="bi-cpu-dur" value="300" min="60"></div>
<button class="btn btn-primary" onclick="runBurnIn('cpu')">▶ Start CPU Stress</button>
</div></div>
</div>
<div id="bi-output" style="display:none;margin-top:16px" class="card">
<div class="card-head">Output</div>
<div class="card-body"><div id="bi-terminal" class="terminal"></div></div>
</div>
<script>
let biES = null;
function runBurnIn(target) {
if (biES) biES.close();
const body = {};
if (target === 'nvidia') body.duration = parseInt(document.getElementById('bi-dur').value)||600;
if (target === 'cpu') body.duration = parseInt(document.getElementById('bi-cpu-dur').value)||300;
document.getElementById('bi-output').style.display='block';
const term = document.getElementById('bi-terminal');
term.textContent = 'Starting ' + target + ' burn-in...\n';
fetch('/api/sat/'+target+'/run', {method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify(body)})
.then(r => r.json())
.then(d => {
biES = new EventSource('/api/sat/stream?job_id='+d.job_id);
biES.onmessage = e => { term.textContent += e.data+'\n'; term.scrollTop=term.scrollHeight; };
biES.addEventListener('done', e => { biES.close(); term.textContent += (e.data ? '\nERROR: '+e.data : '\nCompleted.')+'\n'; });
});
}
</script>`
}
// ── Network ───────────────────────────────────────────────────────────────────
func renderNetwork() string {
return `<div class="card"><div class="card-head">Network Interfaces</div><div class="card-body">
<div id="iface-table"><p style="color:#64748b;font-size:13px">Loading...</p></div>
</div></div>
<div class="grid2">
<div class="card"><div class="card-head">DHCP</div><div class="card-body">
<div class="form-row"><label>Interface (leave empty for all)</label><input type="text" id="dhcp-iface" placeholder="eth0"></div>
<button class="btn btn-primary" onclick="runDHCP()">▶ Run DHCP</button>
<div id="dhcp-out" style="margin-top:10px;font-size:12px;color:#86efac"></div>
</div></div>
<div class="card"><div class="card-head">Static IPv4</div><div class="card-body">
<div class="form-row"><label>Interface</label><input type="text" id="st-iface" placeholder="eth0"></div>
<div class="form-row"><label>Address</label><input type="text" id="st-addr" placeholder="192.168.1.100"></div>
<div class="form-row"><label>Prefix length</label><input type="text" id="st-prefix" placeholder="24"></div>
<div class="form-row"><label>Gateway</label><input type="text" id="st-gw" placeholder="192.168.1.1"></div>
<div class="form-row"><label>DNS (comma-separated)</label><input type="text" id="st-dns" placeholder="8.8.8.8,8.8.4.4"></div>
<button class="btn btn-primary" onclick="setStatic()">Apply Static IP</button>
<div id="static-out" style="margin-top:10px;font-size:12px;color:#86efac"></div>
</div></div>
</div>
<script>
function loadNetwork() {
fetch('/api/network').then(r=>r.json()).then(d => {
const rows = (d.interfaces||[]).map(i =>
'<tr><td>'+i.Name+'</td><td><span class="badge '+(i.State==='up'?'badge-ok':'badge-warn')+'">'+i.State+'</span></td><td>'+(i.IPv4||[]).join(', ')+'</td></tr>'
).join('');
document.getElementById('iface-table').innerHTML =
'<table><tr><th>Interface</th><th>State</th><th>Addresses</th></tr>'+rows+'</table>' +
(d.default_route ? '<p style="font-size:12px;color:#64748b;margin-top:8px">Default route: '+d.default_route+'</p>' : '');
});
}
function runDHCP() {
const iface = document.getElementById('dhcp-iface').value.trim();
fetch('/api/network/dhcp',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({interface:iface||'all'})})
.then(r=>r.json()).then(d => {
document.getElementById('dhcp-out').textContent = d.output || d.error || 'Done.';
loadNetwork();
});
}
function setStatic() {
const dns = document.getElementById('st-dns').value.split(',').map(s=>s.trim()).filter(Boolean);
fetch('/api/network/static',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({
interface: document.getElementById('st-iface').value,
address: document.getElementById('st-addr').value,
prefix: document.getElementById('st-prefix').value,
gateway: document.getElementById('st-gw').value,
dns: dns,
})}).then(r=>r.json()).then(d => {
document.getElementById('static-out').textContent = d.output || d.error || 'Done.';
loadNetwork();
});
}
loadNetwork();
</script>`
}
// ── Services ──────────────────────────────────────────────────────────────────
func renderServices() string {
return `<div class="card"><div class="card-head">Bee Services <button class="btn btn-sm btn-secondary" onclick="loadServices()" style="margin-left:auto">↻ Refresh</button></div>
<div class="card-body">
<div id="svc-table"><p style="color:#64748b;font-size:13px">Loading...</p></div>
</div></div>
<div id="svc-out" style="display:none;margin-top:8px" class="card">
<div class="card-head">Output</div>
<div class="card-body" style="padding:10px"><div id="svc-terminal" class="terminal" style="max-height:150px"></div></div>
</div>
<script>
function loadServices() {
fetch('/api/services').then(r=>r.json()).then(svcs => {
const rows = svcs.map(s => {
const st = s.state||'unknown';
const badge = st==='active' ? 'badge-ok' : st==='failed' ? 'badge-err' : 'badge-warn';
const id = 'svc-body-'+s.name.replace(/[^a-z0-9]/g,'-');
const body = (s.body||'').replace(/</g,'&lt;').replace(/>/g,'&gt;');
return '<tr>' +
'<td style="white-space:nowrap">'+s.name+'</td>' +
'<td style="white-space:nowrap"><span class="badge '+badge+'" style="cursor:pointer" onclick="toggleBody(\''+id+'\')">'+st+' ▾</span>' +
'<div id="'+id+'" style="display:none;margin-top:6px"><pre style="font-size:11px;white-space:pre-wrap;word-break:break-all;max-height:200px;overflow-y:auto;background:#0a0d14;padding:8px;border-radius:6px;color:#94a3b8">'+body+'</pre></div>' +
'</td>' +
'<td style="white-space:nowrap">' +
'<button class="btn btn-sm btn-secondary" onclick="svcAction(\''+s.name+'\',\'start\')">Start</button> ' +
'<button class="btn btn-sm btn-secondary" onclick="svcAction(\''+s.name+'\',\'stop\')">Stop</button> ' +
'<button class="btn btn-sm btn-secondary" onclick="svcAction(\''+s.name+'\',\'restart\')">Restart</button>' +
'</td></tr>';
}).join('');
document.getElementById('svc-table').innerHTML =
'<table><tr><th>Service</th><th>Status</th><th>Actions</th></tr>'+rows+'</table>';
});
}
function toggleBody(id) {
const el = document.getElementById(id);
if (el) el.style.display = el.style.display==='none' ? 'block' : 'none';
}
function svcAction(name, action) {
fetch('/api/services/action',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({name,action})})
.then(r=>r.json()).then(d => {
document.getElementById('svc-out').style.display='block';
document.getElementById('svc-terminal').textContent = d.output || d.error || action+' '+name;
setTimeout(loadServices, 1000);
});
}
loadServices();
</script>`
}
// ── Export ────────────────────────────────────────────────────────────────────
func renderExport(exportDir string) string {
entries, _ := listExportFiles(exportDir)
var rows strings.Builder
for _, e := range entries {
rows.WriteString(fmt.Sprintf(`<tr><td><a href="/export/file?path=%s" target="_blank">%s</a></td></tr>`,
url.QueryEscape(e), html.EscapeString(e)))
}
if len(entries) == 0 {
rows.WriteString(`<tr><td style="color:#64748b">No export files found.</td></tr>`)
}
return `<div class="grid2">
<div class="card"><div class="card-head">Support Bundle</div><div class="card-body">
<p style="font-size:13px;color:#94a3b8;margin-bottom:12px">Creates a tar.gz archive of all audit files, SAT results, and logs.</p>
<a class="btn btn-primary" href="/export/support.tar.gz">⬇ Download Support Bundle</a>
</div></div>
<div class="card"><div class="card-head">Export Files</div><div class="card-body">
<table><tr><th>File</th></tr>` + rows.String() + `</table>
</div></div>
</div>`
}
func listExportFiles(exportDir string) ([]string, error) {
var entries []string
err := filepath.Walk(strings.TrimSpace(exportDir), func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if info.IsDir() {
return nil
}
rel, err := filepath.Rel(exportDir, path)
if err != nil {
return err
}
entries = append(entries, rel)
return nil
})
if err != nil && !os.IsNotExist(err) {
return nil, err
}
sort.Strings(entries)
return entries, nil
}
// ── Tools ─────────────────────────────────────────────────────────────────────
func renderTools() string {
return `<div class="card"><div class="card-head">Tool Check <button class="btn btn-sm btn-secondary" onclick="checkTools()" style="margin-left:auto">↻ Check</button></div>
<div class="card-body"><div id="tools-table"><p style="color:#64748b;font-size:13px">Click Check to verify installed tools.</p></div></div></div>
<script>
function checkTools() {
document.getElementById('tools-table').innerHTML = '<p style="color:#64748b;font-size:13px">Checking...</p>';
fetch('/api/tools/check').then(r=>r.json()).then(tools => {
const rows = tools.map(t =>
'<tr><td>'+t.Name+'</td><td><span class="badge '+(t.OK ? 'badge-ok' : 'badge-err')+'">'+(t.OK ? '✓ '+t.Path : '✗ missing')+'</span></td></tr>'
).join('');
document.getElementById('tools-table').innerHTML =
'<table><tr><th>Tool</th><th>Status</th></tr>'+rows+'</table>';
});
}
checkTools();
</script>`
}
// ── Viewer (compatibility) ────────────────────────────────────────────────────
// renderViewerPage renders the audit snapshot as a styled HTML page.
// This endpoint is embedded as an iframe on the Dashboard page.
func renderViewerPage(title string, snapshot []byte) string {
var b strings.Builder
b.WriteString(`<!DOCTYPE html><html><head><meta charset="utf-8">`)
b.WriteString(`<title>` + html.EscapeString(title) + `</title>`)
b.WriteString(`<style>
*{box-sizing:border-box;margin:0;padding:0}
body{font-family:system-ui,sans-serif;background:#0f1117;color:#e2e8f0;padding:20px}
h2{font-size:14px;color:#64748b;margin-bottom:8px;margin-top:16px;text-transform:uppercase;letter-spacing:.05em}
.grid{display:grid;grid-template-columns:repeat(auto-fill,minmax(280px,1fr));gap:12px}
.card{background:#161b25;border:1px solid #1e2535;border-radius:8px;padding:14px}
.card-title{font-size:12px;color:#64748b;margin-bottom:6px}
.card-value{font-size:15px;font-weight:600}
.badge{display:inline-block;padding:2px 8px;border-radius:999px;font-size:11px;font-weight:600}
.ok{background:#166534;color:#86efac}.warn{background:#713f12;color:#fde68a}.err{background:#7f1d1d;color:#fca5a5}
pre{background:#0a0d14;border:1px solid #1e2535;border-radius:6px;padding:12px;font-size:11px;overflow-x:auto;color:#94a3b8;white-space:pre-wrap;word-break:break-word;max-height:400px;overflow-y:auto}
</style></head><body>
`)
if len(snapshot) == 0 {
b.WriteString(`<p style="color:#64748b">No audit snapshot available yet. Re-run audit from the Dashboard.</p>`)
b.WriteString(`</body></html>`)
return b.String()
}
var data map[string]any
if err := json.Unmarshal(snapshot, &data); err != nil {
// Fallback: render raw JSON
b.WriteString(`<pre>` + html.EscapeString(string(snapshot)) + `</pre>`)
b.WriteString(`</body></html>`)
return b.String()
}
// Collected at
if t, ok := data["collected_at"].(string); ok {
b.WriteString(`<p style="font-size:12px;color:#64748b;margin-bottom:16px">Collected: ` + html.EscapeString(t) + `</p>`)
}
// Hardware section
hw, _ := data["hardware"].(map[string]any)
if hw == nil {
hw = data
}
renderHWCards(&b, hw)
// Full JSON below
b.WriteString(`<h2>Raw JSON</h2>`)
pretty, _ := json.MarshalIndent(data, "", " ")
b.WriteString(`<pre>` + html.EscapeString(string(pretty)) + `</pre>`)
b.WriteString(`</body></html>`)
return b.String()
}
func renderHWCards(b *strings.Builder, hw map[string]any) {
sections := []struct{ key, label string }{
{"board", "Board"},
{"cpus", "CPUs"},
{"memory", "Memory"},
{"storage", "Storage"},
{"gpus", "GPUs"},
{"nics", "NICs"},
{"psus", "Power Supplies"},
}
for _, s := range sections {
v, ok := hw[s.key]
if !ok {
continue
}
b.WriteString(`<h2>` + s.label + `</h2><div class="grid">`)
renderValue(b, v)
b.WriteString(`</div>`)
}
}
func renderValue(b *strings.Builder, v any) {
switch val := v.(type) {
case []any:
for _, item := range val {
renderValue(b, item)
}
case map[string]any:
b.WriteString(`<div class="card">`)
for k, vv := range val {
b.WriteString(fmt.Sprintf(`<div class="card-title">%s</div><div class="card-value">%s</div>`,
html.EscapeString(k), html.EscapeString(fmt.Sprintf("%v", vv))))
}
b.WriteString(`</div>`)
}
}
// ── Export index (compatibility) ──────────────────────────────────────────────
func renderExportIndex(exportDir string) (string, error) {
entries, err := listExportFiles(exportDir)
if err != nil {
return "", err
}
var body strings.Builder
body.WriteString(`<!DOCTYPE html><html><head><meta charset="utf-8"><title>Bee Export Files</title></head><body>`)
body.WriteString(`<h1>Bee Export Files</h1><ul>`)
for _, entry := range entries {
body.WriteString(`<li><a href="/export/file?path=` + url.QueryEscape(entry) + `">` + html.EscapeString(entry) + `</a></li>`)
}
if len(entries) == 0 {
body.WriteString(`<li>No export files found.</li>`)
}
body.WriteString(`</ul></body></html>`)
return body.String(), nil
}

View File

@@ -1,77 +1,339 @@
package webui
import (
"encoding/json"
"errors"
"fmt"
"net/http"
"os"
"path/filepath"
"strings"
"sync"
"time"
"bee/audit/internal/app"
"bee/audit/internal/runtimeenv"
gocharts "github.com/go-analyze/charts"
"reanimator/chart/viewer"
chartweb "reanimator/chart/web"
"reanimator/chart/web"
)
const defaultTitle = "Bee Hardware Audit"
// HandlerOptions configures the web UI handler.
type HandlerOptions struct {
Title string
AuditPath string
Title string
AuditPath string
ExportDir string
App *app.App
RuntimeMode runtimeenv.Mode
}
// metricsRing holds a rolling window of live metric samples.
type metricsRing struct {
mu sync.Mutex
vals []float64
labels []string
size int
}
func newMetricsRing(size int) *metricsRing {
return &metricsRing{size: size, vals: make([]float64, 0, size), labels: make([]string, 0, size)}
}
func (r *metricsRing) push(v float64) {
r.mu.Lock()
defer r.mu.Unlock()
if len(r.vals) >= r.size {
r.vals = r.vals[1:]
r.labels = r.labels[1:]
}
r.vals = append(r.vals, v)
r.labels = append(r.labels, time.Now().Format("15:04"))
}
func (r *metricsRing) snapshot() ([]float64, []string) {
r.mu.Lock()
defer r.mu.Unlock()
v := make([]float64, len(r.vals))
l := make([]string, len(r.labels))
copy(v, r.vals)
copy(l, r.labels)
return v, l
}
// handler is the HTTP handler for the web UI.
type handler struct {
opts HandlerOptions
mux *http.ServeMux
ringCPUTemp *metricsRing
ringPower *metricsRing
ringFans []*metricsRing
ringGPUTemp []*metricsRing
ringGPUUtil []*metricsRing
ringsMu sync.Mutex
}
// NewHandler creates the HTTP mux with all routes.
func NewHandler(opts HandlerOptions) http.Handler {
title := strings.TrimSpace(opts.Title)
if title == "" {
title = defaultTitle
if strings.TrimSpace(opts.Title) == "" {
opts.Title = defaultTitle
}
if strings.TrimSpace(opts.ExportDir) == "" {
opts.ExportDir = app.DefaultExportDir
}
if opts.RuntimeMode == "" {
opts.RuntimeMode = runtimeenv.ModeAuto
}
auditPath := strings.TrimSpace(opts.AuditPath)
h := &handler{
opts: opts,
ringCPUTemp: newMetricsRing(120),
ringPower: newMetricsRing(120),
}
mux := http.NewServeMux()
mux.Handle("GET /static/", http.StripPrefix("/static/", chartweb.Static()))
mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Cache-Control", "no-store")
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("ok"))
})
mux.HandleFunc("GET /audit.json", func(w http.ResponseWriter, r *http.Request) {
data, err := loadSnapshot(auditPath)
if err != nil {
if errors.Is(err, os.ErrNotExist) {
http.Error(w, "audit snapshot not found", http.StatusNotFound)
return
}
http.Error(w, fmt.Sprintf("read audit snapshot: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/json; charset=utf-8")
_, _ = w.Write(data)
})
mux.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) {
snapshot, err := loadSnapshot(auditPath)
if err != nil && !errors.Is(err, os.ErrNotExist) {
http.Error(w, fmt.Sprintf("read audit snapshot: %v", err), http.StatusInternalServerError)
return
}
html, err := viewer.RenderHTML(snapshot, title)
if err != nil {
http.Error(w, fmt.Sprintf("render snapshot: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write(html)
})
// ── Infrastructure ──────────────────────────────────────────────────────
mux.HandleFunc("GET /healthz", h.handleHealthz)
// ── Existing read-only endpoints (preserved for compatibility) ──────────
mux.HandleFunc("GET /audit.json", h.handleAuditJSON)
mux.HandleFunc("GET /runtime-health.json", h.handleRuntimeHealthJSON)
mux.HandleFunc("GET /export/support.tar.gz", h.handleSupportBundleDownload)
mux.HandleFunc("GET /export/file", h.handleExportFile)
mux.HandleFunc("GET /export/", h.handleExportIndex)
mux.HandleFunc("GET /viewer", h.handleViewer)
// ── API ──────────────────────────────────────────────────────────────────
// Audit
mux.HandleFunc("POST /api/audit/run", h.handleAPIAuditRun)
mux.HandleFunc("GET /api/audit/stream", h.handleAPIAuditStream)
// SAT
mux.HandleFunc("POST /api/sat/nvidia/run", h.handleAPISATRun("nvidia"))
mux.HandleFunc("POST /api/sat/memory/run", h.handleAPISATRun("memory"))
mux.HandleFunc("POST /api/sat/storage/run", h.handleAPISATRun("storage"))
mux.HandleFunc("POST /api/sat/cpu/run", h.handleAPISATRun("cpu"))
mux.HandleFunc("GET /api/sat/stream", h.handleAPISATStream)
// Services
mux.HandleFunc("GET /api/services", h.handleAPIServicesList)
mux.HandleFunc("POST /api/services/action", h.handleAPIServicesAction)
// Network
mux.HandleFunc("GET /api/network", h.handleAPINetworkStatus)
mux.HandleFunc("POST /api/network/dhcp", h.handleAPINetworkDHCP)
mux.HandleFunc("POST /api/network/static", h.handleAPINetworkStatic)
// Export
mux.HandleFunc("GET /api/export/list", h.handleAPIExportList)
mux.HandleFunc("POST /api/export/bundle", h.handleAPIExportBundle)
// Tools
mux.HandleFunc("GET /api/tools/check", h.handleAPIToolsCheck)
// Preflight
mux.HandleFunc("GET /api/preflight", h.handleAPIPreflight)
// Metrics — SSE stream of live sensor data + server-side SVG charts
mux.HandleFunc("GET /api/metrics/stream", h.handleAPIMetricsStream)
mux.HandleFunc("GET /api/metrics/chart/", h.handleMetricsChartSVG)
// Reanimator chart static assets
mux.Handle("GET /chart/static/", http.StripPrefix("/chart/static/", web.Static()))
// ── Pages ────────────────────────────────────────────────────────────────
mux.HandleFunc("GET /", h.handlePage)
h.mux = mux
return mux
}
// ListenAndServe starts the HTTP server.
func ListenAndServe(addr string, opts HandlerOptions) error {
return http.ListenAndServe(addr, NewHandler(opts))
}
// ── Infrastructure handlers ──────────────────────────────────────────────────
func (h *handler) handleHealthz(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Cache-Control", "no-store")
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("ok"))
}
// ── Compatibility endpoints ──────────────────────────────────────────────────
func (h *handler) handleAuditJSON(w http.ResponseWriter, r *http.Request) {
data, err := loadSnapshot(h.opts.AuditPath)
if err != nil {
if errors.Is(err, os.ErrNotExist) {
http.Error(w, "audit snapshot not found", http.StatusNotFound)
return
}
http.Error(w, fmt.Sprintf("read audit snapshot: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/json; charset=utf-8")
_, _ = w.Write(data)
}
func (h *handler) handleRuntimeHealthJSON(w http.ResponseWriter, r *http.Request) {
data, err := loadSnapshot(filepath.Join(h.opts.ExportDir, "runtime-health.json"))
if err != nil {
if errors.Is(err, os.ErrNotExist) {
http.Error(w, "runtime health not found", http.StatusNotFound)
return
}
http.Error(w, fmt.Sprintf("read runtime health: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/json; charset=utf-8")
_, _ = w.Write(data)
}
func (h *handler) handleSupportBundleDownload(w http.ResponseWriter, r *http.Request) {
archive, err := app.BuildSupportBundle(h.opts.ExportDir)
if err != nil {
http.Error(w, fmt.Sprintf("build support bundle: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/gzip")
w.Header().Set("Content-Disposition", fmt.Sprintf("attachment; filename=%q", filepath.Base(archive)))
http.ServeFile(w, r, archive)
}
func (h *handler) handleExportFile(w http.ResponseWriter, r *http.Request) {
rel := strings.TrimSpace(r.URL.Query().Get("path"))
if rel == "" {
http.Error(w, "path is required", http.StatusBadRequest)
return
}
clean := filepath.Clean(rel)
if clean == "." || strings.HasPrefix(clean, "..") {
http.Error(w, "invalid path", http.StatusBadRequest)
return
}
http.ServeFile(w, r, filepath.Join(h.opts.ExportDir, clean))
}
func (h *handler) handleExportIndex(w http.ResponseWriter, r *http.Request) {
body, err := renderExportIndex(h.opts.ExportDir)
if err != nil {
http.Error(w, fmt.Sprintf("render export index: %v", err), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte(body))
}
func (h *handler) handleViewer(w http.ResponseWriter, r *http.Request) {
snapshot, _ := loadSnapshot(h.opts.AuditPath)
body, err := viewer.RenderHTML(snapshot, h.opts.Title)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write(body)
}
func (h *handler) handleMetricsChartSVG(w http.ResponseWriter, r *http.Request) {
name := strings.TrimPrefix(r.URL.Path, "/api/metrics/chart/")
name = strings.TrimSuffix(name, ".svg")
var ring *metricsRing
var title, unit string
switch name {
case "cpu-temp":
ring, title, unit = h.ringCPUTemp, "CPU Temperature", "°C"
case "power":
ring, title, unit = h.ringPower, "System Power", "W"
default:
http.NotFound(w, r)
return
}
vals, labels := ring.snapshot()
if len(vals) == 0 {
vals = []float64{0}
labels = []string{""}
}
// Sparse x-axis labels
sparse := make([]string, len(labels))
step := len(labels) / 6
if step < 1 {
step = 1
}
for i := range labels {
if i%step == 0 {
sparse[i] = labels[i]
}
}
opt := gocharts.NewLineChartOptionWithData([][]float64{vals})
opt.Title = gocharts.TitleOption{Text: title + " (" + unit + ")"}
opt.XAxis.Labels = sparse
opt.Legend = gocharts.LegendOption{Show: gocharts.Ptr(false)}
p := gocharts.NewPainter(gocharts.PainterOptions{
OutputFormat: gocharts.ChartOutputSVG,
Width: 600,
Height: 180,
}, gocharts.PainterThemeOption(gocharts.GetTheme("grafana")))
if err := p.LineChart(opt); err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
buf, err := p.Bytes()
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "image/svg+xml")
w.Header().Set("Cache-Control", "no-store")
_, _ = w.Write(buf)
}
// ── Page handler ─────────────────────────────────────────────────────────────
func (h *handler) handlePage(w http.ResponseWriter, r *http.Request) {
page := strings.TrimPrefix(r.URL.Path, "/")
if page == "" {
page = "dashboard"
}
body := renderPage(page, h.opts)
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte(body))
}
// ── Helpers ──────────────────────────────────────────────────────────────────
func loadSnapshot(path string) ([]byte, error) {
if strings.TrimSpace(path) == "" {
return nil, os.ErrNotExist
}
return os.ReadFile(path)
}
// writeJSON sends v as JSON with status 200.
func writeJSON(w http.ResponseWriter, v any) {
w.Header().Set("Content-Type", "application/json; charset=utf-8")
w.Header().Set("Cache-Control", "no-store")
_ = json.NewEncoder(w).Encode(v)
}
// writeError sends a JSON error response.
func writeError(w http.ResponseWriter, status int, msg string) {
w.Header().Set("Content-Type", "application/json; charset=utf-8")
w.Header().Set("Cache-Control", "no-store")
w.WriteHeader(status)
_ = json.NewEncoder(w).Encode(map[string]string{"error": msg})
}

View File

@@ -9,9 +9,13 @@ import (
"testing"
)
func TestRootRendersLatestSnapshot(t *testing.T) {
func TestRootRendersShellWithIframe(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
t.Fatal(err)
}
@@ -19,6 +23,7 @@ func TestRootRendersLatestSnapshot(t *testing.T) {
handler := NewHandler(HandlerOptions{
Title: "Bee Hardware Audit",
AuditPath: path,
ExportDir: exportDir,
})
first := httptest.NewRecorder()
@@ -26,8 +31,11 @@ func TestRootRendersLatestSnapshot(t *testing.T) {
if first.Code != http.StatusOK {
t.Fatalf("first status=%d", first.Code)
}
if !strings.Contains(first.Body.String(), "SERIAL-OLD") {
t.Fatalf("first body missing old serial: %s", first.Body.String())
if !strings.Contains(first.Body.String(), `iframe`) || !strings.Contains(first.Body.String(), `src="/viewer"`) {
t.Fatalf("first body missing iframe viewer: %s", first.Body.String())
}
if !strings.Contains(first.Body.String(), "/export/support.tar.gz") {
t.Fatalf("first body missing support bundle link: %s", first.Body.String())
}
if got := first.Header().Get("Cache-Control"); got != "no-store" {
t.Fatalf("first cache-control=%q", got)
@@ -42,11 +50,42 @@ func TestRootRendersLatestSnapshot(t *testing.T) {
if second.Code != http.StatusOK {
t.Fatalf("second status=%d", second.Code)
}
if !strings.Contains(second.Body.String(), `src="/viewer"`) {
t.Fatalf("second body missing iframe viewer: %s", second.Body.String())
}
}
func TestViewerRendersLatestSnapshot(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-OLD"}}}`), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path})
first := httptest.NewRecorder()
handler.ServeHTTP(first, httptest.NewRequest(http.MethodGet, "/viewer", nil))
if first.Code != http.StatusOK {
t.Fatalf("first status=%d", first.Code)
}
if !strings.Contains(first.Body.String(), "SERIAL-OLD") {
t.Fatalf("viewer body missing old serial: %s", first.Body.String())
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:05:00Z","hardware":{"board":{"serial_number":"SERIAL-NEW"}}}`), 0644); err != nil {
t.Fatal(err)
}
second := httptest.NewRecorder()
handler.ServeHTTP(second, httptest.NewRequest(http.MethodGet, "/viewer", nil))
if second.Code != http.StatusOK {
t.Fatalf("second status=%d", second.Code)
}
if !strings.Contains(second.Body.String(), "SERIAL-NEW") {
t.Fatalf("second body missing new serial: %s", second.Body.String())
t.Fatalf("viewer body missing new serial: %s", second.Body.String())
}
if strings.Contains(second.Body.String(), "SERIAL-OLD") {
t.Fatalf("second body still contains old serial: %s", second.Body.String())
t.Fatalf("viewer body still contains old serial: %s", second.Body.String())
}
}
@@ -80,3 +119,49 @@ func TestMissingAuditJSONReturnsNotFound(t *testing.T) {
t.Fatalf("status=%d want %d", rec.Code, http.StatusNotFound)
}
}
func TestSupportBundleEndpointReturnsArchive(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.log"), []byte("audit log"), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/export/support.tar.gz", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
if got := rec.Header().Get("Content-Disposition"); !strings.Contains(got, "attachment;") {
t.Fatalf("content-disposition=%q", got)
}
if rec.Body.Len() == 0 {
t.Fatal("empty archive body")
}
}
func TestRuntimeHealthEndpointReturnsJSON(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
body := `{"status":"PARTIAL","checked_at":"2026-03-16T10:00:00Z"}`
if err := os.WriteFile(filepath.Join(exportDir, "runtime-health.json"), []byte(body), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/runtime-health.json", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
if strings.TrimSpace(rec.Body.String()) != body {
t.Fatalf("body=%q want %q", strings.TrimSpace(rec.Body.String()), body)
}
}

View File

@@ -9,6 +9,8 @@ DHCP is used only for LAN (operator SSH access). Internet is NOT available.
## Boot sequence (single ISO)
The live system is expected to boot with `toram`, so `live-boot` copies the full read-only medium into RAM before mounting the root filesystem. After that point, runtime must not depend on the original USB/BMC virtual media staying readable.
`systemd` boot order:
```
@@ -20,11 +22,12 @@ local-fs.target
│ creates /dev/nvidia* nodes)
├── bee-audit.service (runs `bee audit` → /var/log/bee-audit.json,
│ never blocks boot on partial collector failures)
── bee-web.service (runs `bee web` on :80,
reads the latest audit snapshot on each request)
── bee-web.service (runs `bee web` on :80 — full interactive web UI)
└── bee-desktop.service (startx → openbox + chromium http://localhost/)
```
**Critical invariants:**
- The live ISO boots with `boot=live toram`. Runtime binaries must continue working even if the original boot media disappears after early boot.
- OpenSSH MUST start without network. `bee-sshsetup.service` runs before `ssh.service`.
- `bee-network.service` uses `dhclient -nw` (background) — network bring-up is best effort and non-blocking.
- `bee-nvidia.service` loads modules via `insmod` with absolute paths — NOT `modprobe`.
@@ -41,23 +44,27 @@ Local-console behavior:
```text
tty1
└── live-config autologin → bee
└── /home/bee/.profile
└── exec menu
└── /usr/local/bin/bee-tui
└── sudo -n /usr/local/bin/bee tui --runtime livecd
└── /home/bee/.profile (prints web UI URLs)
display :0
└── bee-desktop.service (User=bee)
└── startx /usr/local/bin/bee-openbox-session -- :0
├── tint2 (taskbar)
├── chromium http://localhost/
└── openbox (WM)
```
Rules:
- local `tty1` lands in user `bee`, not directly in `root`
- `menu` must work without typing `sudo`
- TUI actions still run as `root` via `sudo -n`
- SSH is independent from the tty1 path
- `bee-desktop.service` starts X11 + openbox + Chromium automatically after `bee-web.service`
- Chromium opens `http://localhost/` — the full interactive web UI
- SSH is independent from the desktop path
- serial console support is enabled for VM boot debugging
## ISO build sequence
```
build.sh [--authorized-keys /path/to/keys]
build-in-container.sh [--authorized-keys /path/to/keys]
1. compile `bee` binary (skip if .go files older than binary)
2. create a temporary overlay staging dir under `dist/`
3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
@@ -71,25 +78,39 @@ build.sh [--authorized-keys /path/to/keys]
d. build kernel modules against Debian headers
e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
f. cache in `dist/nvidia-<version>-<kver>/`
7. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
8. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
9. inject `libnvidia-ml` + `libcuda` → staged `/usr/lib/`
10. write staged `/etc/bee-release` (versions + git commit)
11. patch staged `motd` with build metadata
12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
13. sync staged overlay into workdir `config/includes.chroot/`
14. run `lb config && lb build` inside the temporary workdir
(either on a Debian host/VM or inside the privileged builder container)
7. `build-cublas.sh`:
a. download `libcublas`, `libcublasLt`, `libcudart` runtime + dev packages from the NVIDIA CUDA Debian repo
b. verify packages against repo `Packages.gz`
c. extract headers for `bee-gpu-stress` build
d. cache userspace libs in `dist/cublas-<version>+cuda<series>/`
8. build `bee-gpu-stress` against extracted cuBLASLt/cudart headers
9. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
10. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
11. inject `libnvidia-ml` + `libcuda` + `libcublas` + `libcublasLt` + `libcudart` → staged `/usr/lib/`
12. write staged `/etc/bee-release` (versions + git commit)
13. patch staged `motd` with build metadata
14. copy `iso/builder/` into a temporary live-build workdir under `dist/`
15. sync staged overlay into workdir `config/includes.chroot/`
16. run `lb config && lb build` inside the privileged builder container
```
Build host notes:
- `build-in-container.sh` targets `linux/amd64` builder containers by default, including Docker Desktop on macOS / Apple Silicon.
- Override with `BEE_BUILDER_PLATFORM=<os/arch>` only if you intentionally need a different container platform.
- If the local builder image under the same tag was previously built for the wrong architecture, the script rebuilds it automatically.
**Critical invariants:**
- `DEBIAN_KERNEL_ABI` in `iso/builder/VERSIONS` pins the exact kernel ABI used in BOTH places:
1. `setup-builder.sh` / `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
2. `auto/config``linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
- `bee-gpu-stress` must be built against cached CUDA userspace headers from `build-cublas.sh`, not against random host-installed CUDA headers.
- The live ISO must ship `libcublas`, `libcublasLt`, and `libcudart` together with `libcuda` so tensor-core stress works without internet or package installs at boot.
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
- Container build requires `--privileged` because `live-build` uses mounts/chroots/loop devices during ISO assembly.
- On macOS / Docker Desktop, the builder still must run as `linux/amd64` so the shipped ISO binaries remain `amd64`.
- Operators must provision enough RAM to hold the full compressed live medium plus normal runtime overhead, because `toram` copies the entire read-only ISO payload into memory before the system reaches steady state.
## Post-boot smoke test
@@ -132,12 +153,48 @@ Current validation state:
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
Acceptance flows:
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + lightweight `bee-gpu-stress`
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + mixed-precision `bee-gpu-stress`
- `bee sat memory``memtester` archive
- `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
- SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
- `bee-gpu-stress` should prefer cuBLASLt GEMM load over the old integer/PTX burn path:
- Ampere: `fp16` + `fp32`/TF32 tensor-core load
- Ada / Hopper: add `fp8`
- Blackwell+: add `fp4`
- PTX fallback is only for missing cuBLASLt/userspace or unsupported narrow datatypes
- Runtime overrides:
- `BEE_GPU_STRESS_SECONDS`
- `BEE_GPU_STRESS_SIZE_MB`
- `BEE_MEMTESTER_SIZE_MB`
- `BEE_MEMTESTER_PASSES`
## NVIDIA SAT TUI flow (v1.0.0+)
```
TUI: Acceptance tests → NVIDIA command pack
1. screenNvidiaSATSetup
a. enumerate GPUs via `nvidia-smi --query-gpu=index,name,memory.total`
b. user selects duration preset: 10 min / 1 h / 8 h / 24 h
c. user selects GPUs via checkboxes (all selected by default)
d. memory size = max(selected GPU memory) — auto-detected, not exposed to user
2. Start → screenNvidiaSATRunning
a. CUDA_VISIBLE_DEVICES set to selected GPU indices
b. tea.Batch: SAT goroutine + tea.ExecProcess(nvtop) launched concurrently
c. nvtop occupies full terminal; SAT result queues in background
d. [o] reopen nvtop at any time; [a] abort (cancels context → kills bee-gpu-stress)
3. GPU metrics collection (during bee-gpu-stress)
- background goroutine polls `nvidia-smi` every second
- per-second rows: elapsed, GPU index, temp°C, usage%, power W, clock MHz
- outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG chart), gpu-metrics-term.txt
4. After SAT completes
- result shown in screenOutput with terminal line-chart (gpu-metrics-term.txt)
- chart is asciigraph-style: box-drawing chars (╭╮╰╯─│), 4 series per GPU,
Y axis with ticks, ANSI colours (red=temp, blue=usage, green=power, yellow=clock)
```
**Critical invariants:**
- `nvtop` must be in `iso/builder/config/package-lists/bee.list.chroot` (baked into ISO).
- `bee-gpu-stress` uses `exec.CommandContext` — aborted on cancel.
- Metric goroutine uses stopCh/doneCh pattern; main goroutine waits `<-doneCh` before reading rows (no mutex needed).
- If `nvtop` is not found on PATH, SAT still runs without it (graceful degradation).
- SVG chart is fully offline: no JS, no external CSS, pure inline SVG.

View File

@@ -21,13 +21,14 @@ Fills gaps where Redfish/logpile is blind:
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
- Machine-readable health summary derived from collector verdicts
- Operator-triggered acceptance tests for NVIDIA, memory, and storage
- NVIDIA SAT includes both diagnostic collection and lightweight GPU stress via `bee-gpu-stress`
- NVIDIA SAT includes both diagnostic collection and mixed-precision GPU stress via `bee-gpu-stress`
- `bee-gpu-stress` should exercise tensor/inference paths (`fp16`, `fp32`/TF32, `fp8`, `fp4` when supported by the GPU/userspace stack) and fall back to Driver API PTX burn only if cuBLASLt is unavailable
- Automatic boot audit with operator-facing local console and SSH access
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
- SSH access (OpenSSH) always available for inspection and debugging
- Interactive Go TUI via `bee tui` for network setup, service management, and acceptance tests
- Read-only web viewer via `bee web`, rendering the latest audit snapshot through the embedded Reanimator Chart
- Local `tty1` operator UX: `bee` autologin, `menu` auto-start, privileged actions via `sudo -n`
- Full web UI via `bee web` on port 80: interactive control panel with live metrics, SAT tests, network config, service management, export, and tools
- Local operator desktop: openbox + Xorg + Chromium auto-opening `http://localhost/`
- Local `tty1` operator UX: `bee` autologin, openbox desktop auto-starts with Chromium on `http://localhost/`
## Network isolation — CRITICAL
@@ -69,15 +70,18 @@ Fills gaps where Redfish/logpile is blind:
| SSH | OpenSSH server |
| NVIDIA driver | Proprietary `.run` installer, built against Debian kernel headers |
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
| GPU stress backend | `bee-gpu-stress` + cuBLASLt/cuBLAS/cudart mixed-precision GEMM, with Driver API PTX fallback |
| Builder | Debian 12 host/VM or Debian 12 container image |
## Operator UX
- On the live ISO, `tty1` autologins as `bee`
- The login profile auto-runs `menu`, which enters the Go TUI
- The TUI itself executes privileged actions as `root` via `sudo -n`
- `bee-desktop.service` starts X11 + openbox + Chromium on display `:0`
- Chromium opens `http://localhost/` — the full web UI
- SSH remains available independently of the local console path
- Remote operators can open `http://<ip>/` in any browser on the same LAN
- VM-oriented builds also include `qemu-guest-agent` and serial console support for debugging
- The ISO boots with `toram`, so loss of the original USB/BMC virtual media after boot should not break already-installed runtime binaries
## Runtime split
@@ -85,6 +89,7 @@ Fills gaps where Redfish/logpile is blind:
- Live-ISO-only responsibilities stay in `iso/` integration code
- Live ISO launches the Go CLI with `--runtime livecd`
- Local/manual runs use `--runtime auto` or `--runtime local`
- Live ISO targets must have enough RAM for the full compressed live medium plus runtime working set because the boot medium is copied into memory at startup
## Key paths
@@ -99,7 +104,10 @@ Fills gaps where Redfish/logpile is blind:
| `internal/chart/` | Git submodule with `reanimator/chart`, embedded into `bee web` |
| `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI |
| `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
| `iso/overlay/etc/profile.d/bee.sh` | `menu` helper + tty1 auto-start policy |
| `iso/overlay/home/bee/.profile` | `bee` shell profile for local console startup |
| `iso/overlay/etc/profile.d/bee.sh` | tty1 welcome message with web UI URLs |
| `iso/overlay/home/bee/.profile` | `bee` shell profile (PATH only) |
| `iso/overlay/etc/systemd/system/bee-desktop.service` | starts X11 + openbox + chromium |
| `iso/overlay/usr/local/bin/bee-desktop` | startx wrapper for bee-desktop.service |
| `iso/overlay/usr/local/bin/bee-openbox-session` | xinitrc: tint2 + chromium + openbox |
| `dist/` | Build outputs (gitignored) |
| `iso/out/` | Downloaded ISO files (gitignored) |

View File

@@ -1,5 +1,26 @@
# Backlog
## BMC версия через IPMI
**Статус:** реализовано.
Добавить сбор версии BMC firmware в board collector:
- Команда: `ipmitool mc info` → поле `Firmware Revision`
- Записывать в `hardware.firmware[]` как `{device_name: "BMC", version: "..."}`
- Показывать в TUI правой колонке рядом с BIOS версией
- Graceful skip если `/dev/ipmi0` отсутствует (silent: same pattern as PSU collector)
## CPU acceptance test через stress-ng
**Статус:** реализовано. CPU в Health Check получает PASS/FAIL из summary.txt.
Добавить CPU SAT на базе `stress-ng`:
- Bake `stress-ng` в ISO (добавить в `bee.list.chroot`)
- Новый `bee sat cpu` — запускает `stress-ng --cpu 0 --cpu-method all --timeout <N>` где N = duration из режима (Quick=60s, Standard=300s, Express=900s)
- Параллельно снимает температуры через `sensors` и throttle-флаги из аудит JSON
- Результат: SAT архив с summary.txt в формате других SAT (overall_status=OK/FAILED)
- После реализации: CPU в Health Check получает реальный PASS/FAIL статус
## Real hardware validation
**Статус:** ожидает доступа к железу.

View File

@@ -1,6 +1,6 @@
---
title: Hardware Ingest JSON Contract
version: "2.1"
version: "2.7"
updated: "2026-03-15"
maintainer: Reanimator Core
audience: external-integrators, ai-agents
@@ -9,7 +9,7 @@ language: ru
# Интеграция с Reanimator: контракт JSON-импорта аппаратного обеспечения
Версия: **2.1** · Дата: **2026-03-15**
Версия: **2.7** · Дата: **2026-03-15**
Документ описывает формат JSON для передачи данных об аппаратном обеспечении серверов в систему **Reanimator** (управление жизненным циклом аппаратного обеспечения).
Предназначен для разработчиков смежных систем (Redfish-коллекторов, агентов мониторинга, CMDB-экспортёров) и может быть включён в документацию интегрируемых проектов.
@@ -22,6 +22,9 @@ language: ru
| Версия | Дата | Изменения |
|--------|------|-----------|
| 2.7 | 2026-03-15 | Явно запрещён синтез данных в `event_logs`; интеграторы не должны придумывать серийные номера компонентов, если источник их не отдал |
| 2.6 | 2026-03-15 | Добавлена необязательная секция `event_logs` для dedup/upsert логов `host` / `bmc` / `redfish` вне history timeline |
| 2.5 | 2026-03-15 | Добавлено общее необязательное поле `manufactured_year_week` для компонентных секций (`YYYY-Www`) |
| 2.4 | 2026-03-15 | Добавлена первая волна component telemetry: health/life поля для `cpus`, `memory`, `storage`, `pcie_devices`, `power_supplies` |
| 2.3 | 2026-03-15 | Добавлены component telemetry поля: `pcie_devices.temperature_c`, `pcie_devices.power_w`, `power_supplies.temperature_c` |
| 2.2 | 2026-03-15 | Добавлено поле `numa_node` у `pcie_devices` для topology/affinity |
@@ -38,6 +41,7 @@ language: ru
3. **Частичность** — можно передавать только те секции, данные по которым доступны. Пустой массив и отсутствие секции эквивалентны.
4. **Строгая схема** — endpoint использует строгий JSON-декодер; неизвестные поля приводят к `400 Bad Request`.
5. **Event-driven** — импорт создаёт события в timeline (LOG_COLLECTED, INSTALLED, REMOVED, FIRMWARE_CHANGED и др.).
6. **Без синтеза со стороны интегратора** — сборщик передаёт только фактически собранные значения. Нельзя придумывать `serial_number`, `component_ref`, `message`, `message_id` или другие идентификаторы/атрибуты, если источник их не предоставил или парсер не смог их надёжно извлечь.
---
@@ -127,7 +131,8 @@ GET /ingest/hardware/jobs/{job_id}
"storage": [ ... ],
"pcie_devices": [ ... ],
"power_supplies": [ ... ],
"sensors": { ... }
"sensors": { ... },
"event_logs": [ ... ]
}
}
```
@@ -157,6 +162,7 @@ GET /ingest/hardware/jobs/{job_id}
| `status_changed_at` | string RFC3339 | Время последнего изменения статуса |
| `status_history` | array | История переходов статусов (см. ниже) |
| `error_description` | string | Текст ошибки/диагностики |
| `manufactured_year_week` | string | Дата производства в формате `YYYY-Www`, например `2024-W07` |
**Объект `status_history[]`:**
@@ -178,6 +184,7 @@ GET /ingest/hardware/jobs/{job_id}
- Если источник хранит историю — передавайте `status_history` отсортированным по `changed_at` по возрастанию.
- Не включайте записи `status_history` без `changed_at`.
- Все даты — RFC3339, рекомендуется UTC (`Z`).
- `manufactured_year_week` используйте, когда источник знает только год и неделю производства, без точной календарной даты.
---
@@ -250,12 +257,14 @@ GET /ingest/hardware/jobs/{job_id}
| `life_remaining_pct` | float | нет | Остаточный ресурс / health, % |
| `life_used_pct` | float | нет | Использованный ресурс / wear, % |
| `serial_number` | string | нет | Серийный номер (если доступен) |
| `firmware` | string | нет | Версия микрокода |
| `firmware` | string | нет | Версия микрокода; если логгер отдает `Microcode level`, передавайте его сюда как есть |
| `present` | bool | нет | Наличие (по умолчанию `true`) |
| + общие поля статуса | | | см. раздел выше |
**Генерация serial_number при отсутствии:** `{board_serial}-CPU-{socket}`
Если источник использует поле/лейбл `Microcode level`, его значение передавайте в `cpus[].firmware` без дополнительного преобразования.
```json
"cpus": [
{
@@ -282,7 +291,6 @@ GET /ingest/hardware/jobs/{job_id}
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `location` | string | нет | Физическое расположение |
| `present` | bool | нет | Наличие модуля (по умолчанию `true`) |
| `serial_number` | string | нет | Серийный номер |
| `part_number` | string | нет | Партномер (используется как модель) |
@@ -328,7 +336,7 @@ GET /ingest/hardware/jobs/{job_id}
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `slot` | string | нет | Идентификатор слота |
| `slot` | string | нет | Канонический адрес установки PCIe-устройства; передавайте BDF (`0000:18:00.0`) |
| `serial_number` | string | нет | Серийный номер |
| `model` | string | нет | Модель |
| `manufacturer` | string | нет | Производитель |
@@ -404,7 +412,7 @@ GET /ingest/hardware/jobs/{job_id}
| `sfp_rx_power_dbm` | float | нет | RX optical power, dBm |
| `sfp_voltage_v` | float | нет | Напряжение SFP, В |
| `sfp_bias_ma` | float | нет | Bias current SFP, мА |
| `bdf` | string | нет | Bus:Device.Function, например `0000:18:00.0` |
| `bdf` | string | нет | Deprecated alias для `slot`; при наличии ingest нормализует его в `slot` |
| `device_class` | string | нет | Класс устройства (см. список ниже) |
| `manufacturer` | string | нет | Производитель |
| `model` | string | нет | Модель |
@@ -421,7 +429,9 @@ GET /ingest/hardware/jobs/{job_id}
`numa_node` передавайте для NIC / InfiniBand / RAID / GPU, когда источник знает CPU/NUMA affinity. Поле сохраняется в snapshot-атрибутах PCIe-компонента и дублируется в telemetry для topology use cases.
Поля `temperature_c` и `power_w` используйте для device-level telemetry GPU / accelerator / smart PCIe devices. Они не влияют на идентификацию компонента.
**Генерация serial_number при отсутствии или `"N/A"`:** `{board_serial}-PCIE-{slot}`
**Генерация serial_number при отсутствии или `"N/A"`:** `{board_serial}-PCIE-{slot}`, где `slot` для PCIe равен BDF.
`slot` — единственный канонический адрес компонента. Для PCIe в `slot` передавайте BDF. Поле `bdf` сохраняется только как переходный alias на входе и не должно использоваться как отдельная координата рядом со `slot`.
**Значения `device_class`:**
@@ -441,7 +451,7 @@ GET /ingest/hardware/jobs/{job_id}
```json
"pcie_devices": [
{
"slot": "PCIeCard2",
"slot": "0000:3b:00.0",
"vendor_id": 5555,
"device_id": 4401,
"numa_node": 0,
@@ -450,7 +460,6 @@ GET /ingest/hardware/jobs/{job_id}
"sfp_temperature_c": 36.2,
"sfp_tx_power_dbm": -1.8,
"sfp_rx_power_dbm": -2.1,
"bdf": "0000:3b:00.0",
"device_class": "EthernetController",
"manufacturer": "Intel",
"model": "X710 10GbE",
@@ -526,6 +535,58 @@ PSU без `serial_number` игнорируется.
}
```
---
### event_logs
Нормализованные операционные логи сервера из `host`, `bmc` или `redfish`.
Эти записи не попадают в history timeline и не создают history events. Они сохраняются в отдельной deduplicated log store и отображаются в отдельном UI-блоке asset logs / host logs.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `source` | string | **да** | Источник лога: `host`, `bmc`, `redfish` |
| `event_time` | string RFC3339 | нет | Время события из источника; если отсутствует, используется время ingest/collection |
| `severity` | string | нет | Уровень: `OK`, `Info`, `Warning`, `Critical`, `Unknown` |
| `message_id` | string | нет | Идентификатор/код события источника |
| `message` | string | **да** | Нормализованный текст события |
| `component_ref` | string | нет | Ссылка на компонент/устройство/слот, если извлекается |
| `fingerprint` | string | нет | Внешний готовый dedup-key; если не передан, система вычисляет свой |
| `is_active` | bool | нет | Признак, что событие всё ещё активно/не погашено, если источник умеет lifecycle |
| `raw_payload` | object | нет | Сырой vendor-specific payload для диагностики |
**Правила event_logs:**
- Логи дедуплицируются в рамках asset + source + fingerprint.
- Если `fingerprint` не передан, система строит его из нормализованных полей (`source`, `message_id`, `message`, `component_ref`, временная нормализация).
- Интегратор/сборщик логов не должен синтезировать содержимое событий: не придумывайте `message`, `message_id`, `component_ref`, serial/device identifiers или иные поля, если они отсутствуют в исходном логе или не были надёжно извлечены.
- Повторное получение того же события обновляет `last_seen_at`/счётчик повторов и не должно создавать новый timeline/history event.
- `event_logs` используются для отдельного UI-представления логов и не изменяют canonical state компонентов/asset по умолчанию.
```json
"event_logs": [
{
"source": "bmc",
"event_time": "2026-03-15T14:03:11Z",
"severity": "Warning",
"message_id": "0x000F",
"message": "Correctable ECC error threshold exceeded",
"component_ref": "CPU0_C0D0",
"raw_payload": {
"sensor": "DIMM_A1",
"sel_record_id": "0042"
}
},
{
"source": "redfish",
"event_time": "2026-03-15T14:03:20Z",
"severity": "Info",
"message_id": "OpenBMC.0.1.SystemReboot",
"message": "System reboot requested by administrator",
"component_ref": "Mainboard"
}
]
```
#### sensors.fans
| Поле | Тип | Обязательно | Описание |
@@ -608,10 +669,12 @@ PSU без `serial_number` игнорируется.
## Обработка отсутствующих serial_number
Общее правило для всех секций: если источник не вернул серийный номер и сборщик не смог его надёжно извлечь, интегратор не должен подставлять вымышленные значения, хеши, локальные placeholder-идентификаторы или серийные номера "по догадке". Разрешены только явно оговорённые ниже server-side fallback-правила ingest.
| Тип | Поведение |
|-----|-----------|
| CPU | Генерируется: `{board_serial}-CPU-{socket}` |
| PCIe | Генерируется: `{board_serial}-PCIE-{slot}` (если serial = `"N/A"` или пустой) |
| PCIe | Генерируется: `{board_serial}-PCIE-{slot}` (если serial = `"N/A"` или пустой; `slot` для PCIe = BDF) |
| Memory | Компонент игнорируется |
| Storage | Компонент игнорируется |
| PSU | Компонент игнорируется |
@@ -687,7 +750,7 @@ PSU без `serial_number` игнорируется.
],
"pcie_devices": [
{
"slot": "PCIeCard1",
"slot": "0000:18:00.0",
"device_class": "EthernetController",
"manufacturer": "Intel",
"model": "X710 10GbE",

58
iso/README.md Normal file
View File

@@ -0,0 +1,58 @@
# ISO Build
`bee` ISO is built inside a Debian 12 builder container via `iso/builder/build-in-container.sh`.
## Requirements
- Docker Desktop or another Docker-compatible container runtime
- Privileged containers enabled
- Enough free disk space for builder cache, Debian live-build artifacts, NVIDIA driver cache, and CUDA userspace packages
## Build On macOS
From the repository root:
```sh
sh iso/builder/build-in-container.sh
```
The script defaults to `linux/amd64` builder containers, so it works on:
- Intel Mac
- Apple Silicon (`M1` / `M2` / `M3` / `M4`) via Docker Desktop's Linux VM
You do not need to pass `--platform` manually for normal ISO builds.
## Useful Options
Build with explicit SSH keys baked into the ISO:
```sh
sh iso/builder/build-in-container.sh --authorized-keys ~/.ssh/id_ed25519.pub
```
Rebuild the builder image:
```sh
sh iso/builder/build-in-container.sh --rebuild-image
```
Use a custom cache directory:
```sh
sh iso/builder/build-in-container.sh --cache-dir /path/to/cache
```
## Notes
- The builder image is automatically rebuilt if the local tag exists for the wrong architecture.
- The live ISO boots with Debian `live-boot` `toram`, so the read-only medium is copied into RAM during boot and the runtime no longer depends on the original USB/BMC virtual media staying present.
- Target systems need enough RAM for the full compressed live medium plus normal runtime overhead, or boot may fail before reaching the TUI.
- Override the container platform only if you know why:
```sh
BEE_BUILDER_PLATFORM=linux/amd64 sh iso/builder/build-in-container.sh
```
- The shipped ISO is still `amd64`.
- Output ISO artifacts are written under `dist/`.

View File

@@ -1,7 +1,6 @@
FROM debian:12
ARG GO_VERSION=1.24.0
ARG DEBIAN_KERNEL_ABI=6.1.0-43
ENV DEBIAN_FRONTEND=noninteractive
@@ -24,9 +23,23 @@ RUN apt-get update -qq && apt-get install -y \
gcc \
make \
perl \
"linux-headers-${DEBIAN_KERNEL_ABI}-amd64" \
linux-headers-amd64 \
&& rm -rf /var/lib/apt/lists/*
# Add NVIDIA CUDA repo and install nvcc (needed to compile nccl-tests)
RUN wget -qO /tmp/cuda-keyring.gpg \
https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/3bf863cc.pub \
&& gpg --dearmor < /tmp/cuda-keyring.gpg \
> /usr/share/keyrings/nvidia-cuda.gpg \
&& rm /tmp/cuda-keyring.gpg \
&& echo "deb [signed-by=/usr/share/keyrings/nvidia-cuda.gpg] \
https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/ /" \
> /etc/apt/sources.list.d/cuda.list \
&& apt-get update -qq \
&& apt-get install -y cuda-nvcc-12-8 \
&& rm -rf /var/lib/apt/lists/* \
&& ln -sfn /usr/local/cuda-12.8 /usr/local/cuda
RUN arch="$(dpkg --print-architecture)" \
&& case "$arch" in \
amd64) goarch=amd64 ;; \

View File

@@ -1,5 +1,12 @@
DEBIAN_VERSION=12
DEBIAN_KERNEL_ABI=6.1.0-43
DEBIAN_KERNEL_ABI=auto
NVIDIA_DRIVER_VERSION=590.48.01
NCCL_VERSION=2.28.9-1
NCCL_CUDA_VERSION=13.0
NCCL_SHA256=2e6faafd2c19cffc7738d9283976a3200ea9db9895907f337f0c7e5a25563186
NCCL_TESTS_VERSION=2.13.10
NVCC_VERSION=12.8
CUBLAS_VERSION=13.0.2.14-1
CUDA_USERSPACE_VERSION=13.0.96-1
GO_VERSION=1.24.0
AUDIT_VERSION=0.1.1
AUDIT_VERSION=1.0.0

View File

@@ -7,6 +7,15 @@ set -e
. "$(dirname "$0")/../VERSIONS"
# Pin the exact kernel ABI detected by build.sh so the ISO kernel matches
# the kernel headers used to compile NVIDIA modules. Falls back to meta-package
# when lb config is run manually without the environment variable.
if [ -n "${BEE_KERNEL_ABI:-}" ] && [ "${BEE_KERNEL_ABI}" != "auto" ]; then
LB_LINUX_PACKAGES="linux-image-${BEE_KERNEL_ABI}"
else
LB_LINUX_PACKAGES="linux-image"
fi
lb config noauto \
--distribution bookworm \
--architectures amd64 \
@@ -19,10 +28,10 @@ lb config noauto \
--mirror-binary "https://deb.debian.org/debian" \
--security true \
--linux-flavours "amd64" \
--linux-packages "linux-image-${DEBIAN_KERNEL_ABI}" \
--linux-packages "${LB_LINUX_PACKAGES}" \
--memtest none \
--iso-volume "BEE" \
--iso-application "Bee Hardware Audit" \
--bootappend-live "boot=live components console=tty0 console=ttyS0,115200n8 username=bee user-fullname=Bee modprobe.blacklist=nouveau" \
--iso-volume "EASY-BEE" \
--iso-application "EASY-BEE" \
--bootappend-live "boot=live components quiet nomodeset console=tty0 console=ttyS0,115200n8 loglevel=3 username=bee user-fullname=Bee modprobe.blacklist=nouveau" \
--apt-recommends false \
"${@}"

File diff suppressed because it is too large Load Diff

190
iso/builder/build-cublas.sh Normal file
View File

@@ -0,0 +1,190 @@
#!/bin/sh
# build-cublas.sh — download cuBLASLt/cuBLAS/cudart runtime + headers for bee-gpu-stress.
#
# Downloads .deb packages from NVIDIA's CUDA apt repository (Debian 12, x86_64),
# verifies them against Packages.gz, and extracts the small subset we need:
# - headers for compiling bee-gpu-stress against cuBLASLt
# - runtime libs for libcublas, libcublasLt, libcudart inside the ISO
set -e
CUBLAS_VERSION="$1"
CUDA_USERSPACE_VERSION="$2"
CUDA_SERIES="$3"
DIST_DIR="$4"
[ -n "$CUBLAS_VERSION" ] || { echo "usage: $0 <cublas-version> <cuda-userspace-version> <cuda-series> <dist-dir>"; exit 1; }
[ -n "$CUDA_USERSPACE_VERSION" ] || { echo "usage: $0 <cublas-version> <cuda-userspace-version> <cuda-series> <dist-dir>"; exit 1; }
[ -n "$CUDA_SERIES" ] || { echo "usage: $0 <cublas-version> <cuda-userspace-version> <cuda-series> <dist-dir>"; exit 1; }
[ -n "$DIST_DIR" ] || { echo "usage: $0 <cublas-version> <cuda-userspace-version> <cuda-series> <dist-dir>"; exit 1; }
CUDA_SERIES_DASH=$(printf '%s' "$CUDA_SERIES" | tr '.' '-')
REPO_BASE="https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64"
CACHE_DIR="${DIST_DIR}/cublas-${CUBLAS_VERSION}+cuda${CUDA_SERIES}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/cublas-downloads"
PACKAGES_GZ="${DOWNLOAD_CACHE_DIR}/Packages.gz"
echo "=== cuBLAS ${CUBLAS_VERSION} / cudart ${CUDA_USERSPACE_VERSION} / CUDA ${CUDA_SERIES} ==="
if [ -f "${CACHE_DIR}/include/cublasLt.h" ] && [ -f "${CACHE_DIR}/include/cuda_runtime_api.h" ] \
&& [ -f "${CACHE_DIR}/include/crt/host_defines.h" ] \
&& [ -f "${CACHE_DIR}/include/nv/target" ] \
&& [ "$(find "${CACHE_DIR}/lib" \( -name 'libcublas.so*' -o -name 'libcublasLt.so*' -o -name 'libcudart.so*' \) 2>/dev/null | wc -l)" -gt 0 ]; then
echo "=== cuBLAS cached, skipping download ==="
echo "cache: $CACHE_DIR"
exit 0
fi
mkdir -p "${DOWNLOAD_CACHE_DIR}" "${CACHE_DIR}/include" "${CACHE_DIR}/lib"
echo "=== downloading Packages.gz ==="
wget -q -O "${PACKAGES_GZ}" "${REPO_BASE}/Packages.gz"
lookup_pkg() {
pkg="$1"
ver="$2" # if empty, match any version (first found)
gzip -dc "${PACKAGES_GZ}" | awk -v pkg="$pkg" -v ver="$ver" '
/^Package: / { cur_pkg=$2; gsub(/\r/, "", cur_pkg) }
/^Version: / { cur_ver=$2; gsub(/\r/, "", cur_ver) }
/^Filename: / { cur_file=$2; gsub(/\r/, "", cur_file) }
/^SHA256: / { cur_sha=$2; gsub(/\r/, "", cur_sha) }
/^$/ {
if (cur_pkg == pkg && (ver == "" || cur_ver == ver)) {
print cur_file " " cur_sha
printed=1
exit
}
cur_pkg=""; cur_ver=""; cur_file=""; cur_sha=""
}
END {
if (!printed && cur_pkg == pkg && (ver == "" || cur_ver == ver)) {
print cur_file " " cur_sha
}
}'
}
download_verified_pkg() {
pkg="$1"
ver="$2"
meta="$(lookup_pkg "$pkg" "$ver")"
[ -n "$meta" ] || { echo "ERROR: package metadata not found for ${pkg} ${ver}"; exit 1; }
repo_file="$(printf '%s\n' "$meta" | awk '{print $1}')"
repo_sha="$(printf '%s\n' "$meta" | awk '{print $2}')"
[ -n "$repo_file" ] || { echo "ERROR: package filename missing for ${pkg}"; exit 1; }
[ -n "$repo_sha" ] || { echo "ERROR: package sha missing for ${pkg}"; exit 1; }
out="${DOWNLOAD_CACHE_DIR}/$(basename "$repo_file")"
if [ -f "$out" ]; then
actual_sha="$(sha256sum "$out" | awk '{print $1}')"
if [ "$actual_sha" = "$repo_sha" ]; then
echo "=== using cached $(basename "$repo_file") ===" >&2
printf '%s\n' "$out"
return 0
fi
echo "=== removing stale $(basename "$repo_file") (sha256 mismatch) ===" >&2
rm -f "$out"
fi
echo "=== downloading $(basename "$repo_file") ===" >&2
wget --show-progress -O "$out" "${REPO_BASE}/$(basename "$repo_file")"
actual_sha="$(sha256sum "$out" | awk '{print $1}')"
if [ "$actual_sha" != "$repo_sha" ]; then
echo "ERROR: sha256 mismatch for $(basename "$repo_file")" >&2
echo " expected: $repo_sha" >&2
echo " actual: $actual_sha" >&2
rm -f "$out"
exit 1
fi
echo "sha256 OK: $(basename "$repo_file")" >&2
printf '%s\n' "$out"
}
extract_deb() {
deb="$1"
dst="$2"
mkdir -p "$dst"
(
cd "$dst"
ar x "$deb"
data_tar=$(ls data.tar.* 2>/dev/null | head -1)
[ -n "$data_tar" ] || { echo "ERROR: data.tar.* not found in $deb"; exit 1; }
tar xf "$data_tar"
)
}
copy_headers() {
from="$1"
if [ -d "${from}/usr/include" ]; then
cp -a "${from}/usr/include/." "${CACHE_DIR}/include/"
fi
# NVIDIA CUDA packages install headers under /usr/local/cuda-X.Y/targets/x86_64-linux/include/
find "$from" -type d -name include | while read -r inc_dir; do
case "$inc_dir" in
*/usr/include) ;; # already handled above
*)
if find "${inc_dir}" -maxdepth 3 \( -name '*.h' -o -type f \) | grep -q .; then
cp -a "${inc_dir}/." "${CACHE_DIR}/include/"
fi
;;
esac
done
}
copy_libs() {
from="$1"
find "$from" \( -name 'libcublas.so*' -o -name 'libcublasLt.so*' -o -name 'libcudart.so*' \) \
\( -type f -o -type l \) -exec cp -a {} "${CACHE_DIR}/lib/" \;
}
make_links() {
base="$1"
versioned=$(find "${CACHE_DIR}/lib" -maxdepth 1 -name "${base}.so.[0-9]*" -type f | sort | head -1)
[ -n "$versioned" ] || return 0
soname=$(printf '%s\n' "$versioned" | sed -E "s#.*/(${base}\.so\.[0-9]+).*#\\1#")
target=$(basename "$versioned")
ln -sf "$target" "${CACHE_DIR}/lib/${soname}" 2>/dev/null || true
ln -sf "${soname}" "${CACHE_DIR}/lib/${base}.so" 2>/dev/null || true
}
TMP_DIR=$(mktemp -d)
trap 'rm -rf "$TMP_DIR"' EXIT INT TERM
CUBLAS_RT_DEB=$(download_verified_pkg "libcublas-${CUDA_SERIES_DASH}" "${CUBLAS_VERSION}")
CUBLAS_DEV_DEB=$(download_verified_pkg "libcublas-dev-${CUDA_SERIES_DASH}" "${CUBLAS_VERSION}")
CUDART_RT_DEB=$(download_verified_pkg "cuda-cudart-${CUDA_SERIES_DASH}" "${CUDA_USERSPACE_VERSION}")
CUDART_DEV_DEB=$(download_verified_pkg "cuda-cudart-dev-${CUDA_SERIES_DASH}" "${CUDA_USERSPACE_VERSION}")
CUDA_CRT_DEB=$(download_verified_pkg "cuda-crt-${CUDA_SERIES_DASH}" "")
CUDA_CCCL_DEB=$(download_verified_pkg "cuda-cccl-${CUDA_SERIES_DASH}" "")
extract_deb "$CUBLAS_RT_DEB" "${TMP_DIR}/cublas-rt"
extract_deb "$CUBLAS_DEV_DEB" "${TMP_DIR}/cublas-dev"
extract_deb "$CUDART_RT_DEB" "${TMP_DIR}/cudart-rt"
extract_deb "$CUDART_DEV_DEB" "${TMP_DIR}/cudart-dev"
extract_deb "$CUDA_CRT_DEB" "${TMP_DIR}/cuda-crt"
extract_deb "$CUDA_CCCL_DEB" "${TMP_DIR}/cuda-cccl"
copy_headers "${TMP_DIR}/cublas-dev"
copy_headers "${TMP_DIR}/cudart-dev"
copy_headers "${TMP_DIR}/cuda-crt"
copy_headers "${TMP_DIR}/cuda-cccl"
copy_libs "${TMP_DIR}/cublas-rt"
copy_libs "${TMP_DIR}/cudart-rt"
make_links "libcublas"
make_links "libcublasLt"
make_links "libcudart"
[ -f "${CACHE_DIR}/include/cublasLt.h" ] || { echo "ERROR: cublasLt.h not extracted"; exit 1; }
[ -f "${CACHE_DIR}/include/cuda_runtime_api.h" ] || { echo "ERROR: cuda_runtime_api.h not extracted"; exit 1; }
[ "$(find "${CACHE_DIR}/lib" -maxdepth 1 -name 'libcublasLt.so*' | wc -l)" -gt 0 ] || { echo "ERROR: libcublasLt not extracted"; exit 1; }
[ "$(find "${CACHE_DIR}/lib" -maxdepth 1 -name 'libcublas.so*' | wc -l)" -gt 0 ] || { echo "ERROR: libcublas not extracted"; exit 1; }
[ "$(find "${CACHE_DIR}/lib" -maxdepth 1 -name 'libcudart.so*' | wc -l)" -gt 0 ] || { echo "ERROR: libcudart not extracted"; exit 1; }
echo "=== cuBLAS extraction complete ==="
echo "cache: $CACHE_DIR"
echo "headers: $(find "${CACHE_DIR}/include" -type f | wc -l)"
echo "libs: $(find "${CACHE_DIR}/lib" -maxdepth 1 \( -name 'libcublas*.so*' -o -name 'libcudart.so*' \) | wc -l)"

View File

@@ -1,5 +1,5 @@
#!/bin/sh
# build-in-container.sh — build the bee ISO inside a Debian container.
# build-in-container.sh — build the bee ISO inside the Debian builder container.
set -e
@@ -7,6 +7,7 @@ REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
BUILDER_DIR="${REPO_ROOT}/iso/builder"
CONTAINER_TOOL="${CONTAINER_TOOL:-docker}"
IMAGE_TAG="${BEE_BUILDER_IMAGE:-bee-iso-builder}"
BUILDER_PLATFORM="${BEE_BUILDER_PLATFORM:-linux/amd64}"
CACHE_DIR="${BEE_BUILDER_CACHE_DIR:-${REPO_ROOT}/dist/container-cache}"
AUTH_KEYS=""
REBUILD_IMAGE=0
@@ -40,6 +41,13 @@ if ! command -v "$CONTAINER_TOOL" >/dev/null 2>&1; then
exit 1
fi
PLATFORM_OS="${BUILDER_PLATFORM%/*}"
PLATFORM_ARCH="${BUILDER_PLATFORM#*/}"
if [ -z "$PLATFORM_OS" ] || [ -z "$PLATFORM_ARCH" ] || [ "$PLATFORM_OS" = "$BUILDER_PLATFORM" ]; then
echo "invalid BEE_BUILDER_PLATFORM: ${BUILDER_PLATFORM} (expected os/arch, e.g. linux/amd64)" >&2
exit 1
fi
if [ -n "$AUTH_KEYS" ]; then
[ -f "$AUTH_KEYS" ] || { echo "authorized_keys not found: $AUTH_KEYS" >&2; exit 1; }
AUTH_KEYS_ABS="$(cd "$(dirname "$AUTH_KEYS")" && pwd)/$(basename "$AUTH_KEYS")"
@@ -56,20 +64,38 @@ mkdir -p \
IMAGE_REF="${IMAGE_TAG}:debian${DEBIAN_VERSION}"
if [ "$REBUILD_IMAGE" = "1" ] || ! "$CONTAINER_TOOL" image inspect "${IMAGE_REF}" >/dev/null 2>&1; then
image_matches_platform() {
actual_platform="$("$CONTAINER_TOOL" image inspect --format '{{.Os}}/{{.Architecture}}' "${IMAGE_REF}" 2>/dev/null || true)"
[ "$actual_platform" = "${BUILDER_PLATFORM}" ]
}
NEED_BUILD_IMAGE=0
if [ "$REBUILD_IMAGE" = "1" ]; then
NEED_BUILD_IMAGE=1
elif ! "$CONTAINER_TOOL" image inspect "${IMAGE_REF}" >/dev/null 2>&1; then
NEED_BUILD_IMAGE=1
elif ! image_matches_platform; then
actual_platform="$("$CONTAINER_TOOL" image inspect --format '{{.Os}}/{{.Architecture}}' "${IMAGE_REF}" 2>/dev/null || echo unknown)"
echo "=== rebuilding builder image ${IMAGE_REF}: platform mismatch (${actual_platform} != ${BUILDER_PLATFORM}) ==="
NEED_BUILD_IMAGE=1
fi
if [ "$NEED_BUILD_IMAGE" = "1" ]; then
"$CONTAINER_TOOL" build \
--platform "${BUILDER_PLATFORM}" \
--build-arg GO_VERSION="${GO_VERSION}" \
--build-arg DEBIAN_KERNEL_ABI="${DEBIAN_KERNEL_ABI}" \
-t "${IMAGE_REF}" \
"${BUILDER_DIR}"
else
echo "=== using existing builder image ${IMAGE_REF} ==="
echo "=== using existing builder image ${IMAGE_REF} (${BUILDER_PLATFORM}) ==="
fi
set -- \
run --rm --privileged \
--platform "${BUILDER_PLATFORM}" \
-v "${REPO_ROOT}:/work" \
-v "${CACHE_DIR}:/cache" \
-e BEE_CONTAINER_BUILD=1 \
-e GOCACHE=/cache/go-build \
-e GOMODCACHE=/cache/go-mod \
-e TMPDIR=/cache/tmp \
@@ -80,9 +106,11 @@ set -- \
if [ -n "$AUTH_KEYS" ]; then
set -- run --rm --privileged \
--platform "${BUILDER_PLATFORM}" \
-v "${REPO_ROOT}:/work" \
-v "${CACHE_DIR}:/cache" \
-v "${AUTH_KEYS_DIR}:/tmp/bee-authkeys:ro" \
-e BEE_CONTAINER_BUILD=1 \
-e GOCACHE=/cache/go-build \
-e GOMODCACHE=/cache/go-mod \
-e TMPDIR=/cache/tmp \

138
iso/builder/build-nccl-tests.sh Executable file
View File

@@ -0,0 +1,138 @@
#!/bin/sh
# build-nccl-tests.sh — build nccl-tests all_reduce_perf for the LiveCD.
#
# Downloads nccl-tests source from GitHub, downloads libnccl-dev .deb for
# nccl.h, and compiles all_reduce_perf with nvcc (cuda-nvcc-13-0).
#
# Output is cached in DIST_DIR/nccl-tests-<version>/ so subsequent builds
# are instant unless NCCL_TESTS_VERSION changes.
#
# Output layout:
# $CACHE_DIR/bin/all_reduce_perf
set -e
NCCL_TESTS_VERSION="$1"
NCCL_VERSION="$2"
NCCL_CUDA_VERSION="$3"
DIST_DIR="$4"
[ -n "$NCCL_TESTS_VERSION" ] || { echo "usage: $0 <nccl-tests-version> <nccl-version> <cuda-version> <dist-dir>"; exit 1; }
[ -n "$NCCL_VERSION" ] || { echo "usage: $0 <nccl-tests-version> <nccl-version> <cuda-version> <dist-dir>"; exit 1; }
[ -n "$NCCL_CUDA_VERSION" ] || { echo "usage: $0 <nccl-tests-version> <nccl-version> <cuda-version> <dist-dir>"; exit 1; }
[ -n "$DIST_DIR" ] || { echo "usage: $0 <nccl-tests-version> <nccl-version> <cuda-version> <dist-dir>"; exit 1; }
echo "=== nccl-tests ${NCCL_TESTS_VERSION} ==="
CACHE_DIR="${DIST_DIR}/nccl-tests-${NCCL_TESTS_VERSION}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/nccl-tests-downloads"
if [ -f "${CACHE_DIR}/bin/all_reduce_perf" ]; then
echo "=== nccl-tests cached, skipping build ==="
echo "binary: ${CACHE_DIR}/bin/all_reduce_perf"
exit 0
fi
# Resolve nvcc path (cuda-nvcc-12-8 installs to /usr/local/cuda-12.8/bin/nvcc)
NVCC=""
for candidate in nvcc /usr/local/cuda-12.8/bin/nvcc /usr/local/cuda-12/bin/nvcc /usr/local/cuda/bin/nvcc; do
if command -v "$candidate" >/dev/null 2>&1 || [ -x "$candidate" ]; then
NVCC="$candidate"
break
fi
done
[ -n "$NVCC" ] || { echo "ERROR: nvcc not found — install cuda-nvcc-13-0"; exit 1; }
echo "nvcc: $NVCC"
# Determine CUDA_HOME from nvcc location
CUDA_HOME="$(dirname "$(dirname "$NVCC")")"
echo "CUDA_HOME: $CUDA_HOME"
# Download libnccl-dev for nccl.h
REPO_BASE="https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64"
DEV_PKG="libnccl-dev_${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}_amd64.deb"
DEV_URL="${REPO_BASE}/${DEV_PKG}"
mkdir -p "$DOWNLOAD_CACHE_DIR"
DEV_DEB="${DOWNLOAD_CACHE_DIR}/${DEV_PKG}"
if [ ! -f "$DEV_DEB" ]; then
echo "=== downloading libnccl-dev ==="
wget --show-progress -O "$DEV_DEB" "$DEV_URL"
fi
# Extract nccl.h from libnccl-dev
NCCL_INCLUDE_TMP=$(mktemp -d)
trap 'rm -rf "$NCCL_INCLUDE_TMP" "$BUILD_TMP"' EXIT INT TERM
cd "$NCCL_INCLUDE_TMP"
ar x "$DEV_DEB"
DATA_TAR=$(ls data.tar.* 2>/dev/null | head -1)
[ -n "$DATA_TAR" ] || { echo "ERROR: data.tar.* not found in libnccl-dev .deb"; exit 1; }
tar xf "$DATA_TAR"
# nccl.h lands in ./usr/include/ or ./usr/local/cuda-X.Y/targets/.../include/
NCCL_H=$(find . -name 'nccl.h' -type f 2>/dev/null | head -1)
[ -n "$NCCL_H" ] || { echo "ERROR: nccl.h not found in libnccl-dev package"; exit 1; }
NCCL_INCLUDE_DIR="$(pwd)/$(dirname "$NCCL_H")"
echo "nccl.h: $NCCL_H"
# libnccl.so comes from the already-built NCCL cache (build-nccl.sh ran first)
NCCL_LIB_DIR="${DIST_DIR}/nccl-${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}/lib"
[ -d "$NCCL_LIB_DIR" ] || { echo "ERROR: NCCL lib dir not found at $NCCL_LIB_DIR — run build-nccl.sh first"; exit 1; }
echo "nccl lib: $NCCL_LIB_DIR"
# Download nccl-tests source
SRC_TAR="${DOWNLOAD_CACHE_DIR}/nccl-tests-v${NCCL_TESTS_VERSION}.tar.gz"
SRC_URL="https://github.com/NVIDIA/nccl-tests/archive/refs/tags/v${NCCL_TESTS_VERSION}.tar.gz"
if [ ! -f "$SRC_TAR" ]; then
echo "=== downloading nccl-tests v${NCCL_TESTS_VERSION} ==="
wget --show-progress -O "$SRC_TAR" "$SRC_URL"
fi
# Extract and build
BUILD_TMP=$(mktemp -d)
cd "$BUILD_TMP"
tar xf "$SRC_TAR"
SRC_DIR=$(ls -d nccl-tests-* 2>/dev/null | head -1)
[ -n "$SRC_DIR" ] || { echo "ERROR: source directory not found in archive"; exit 1; }
cd "$SRC_DIR"
echo "=== building all_reduce_perf ==="
# Pick gencode based on the actual nvcc version:
# CUDA 12.x — Volta..Blackwell (sm_70..sm_100)
# CUDA 13.x — Hopper..Blackwell (sm_90..sm_100, Pascal/Volta/Ampere dropped)
NVCC_MAJOR=$("$NVCC" --version 2>/dev/null | grep -oE 'release [0-9]+' | awk '{print $2}' | head -1)
echo "nvcc major version: ${NVCC_MAJOR:-unknown}"
if [ "${NVCC_MAJOR:-0}" -ge 13 ] 2>/dev/null; then
GENCODE="-gencode=arch=compute_90,code=sm_90 \
-gencode=arch=compute_100,code=sm_100"
echo "gencode: sm_90 sm_100 (CUDA 13+)"
else
GENCODE="-gencode=arch=compute_70,code=sm_70 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_90,code=sm_90 \
-gencode=arch=compute_100,code=sm_100"
echo "gencode: sm_70..sm_100 (CUDA 12)"
fi
LIBRARY_PATH="$NCCL_LIB_DIR${LIBRARY_PATH:+:$LIBRARY_PATH}" \
make MPI=0 \
NVCC="$NVCC" \
CUDA_HOME="$CUDA_HOME" \
NCCL_HOME="$NCCL_INCLUDE_DIR/.." \
NCCL_LIB="$NCCL_LIB_DIR" \
NVCC_GENCODE="$GENCODE" \
BUILDDIR="./build"
[ -f "./build/all_reduce_perf" ] || { echo "ERROR: all_reduce_perf not found after build"; exit 1; }
mkdir -p "${CACHE_DIR}/bin"
cp "./build/all_reduce_perf" "${CACHE_DIR}/bin/all_reduce_perf"
chmod +x "${CACHE_DIR}/bin/all_reduce_perf"
echo "=== nccl-tests build complete ==="
echo "binary: ${CACHE_DIR}/bin/all_reduce_perf"
ls -lh "${CACHE_DIR}/bin/all_reduce_perf"

94
iso/builder/build-nccl.sh Executable file
View File

@@ -0,0 +1,94 @@
#!/bin/sh
# build-nccl.sh — download and extract NCCL shared library for the LiveCD.
#
# Downloads libnccl2 .deb from NVIDIA's CUDA apt repository (Debian 12, x86_64)
# and extracts the shared library. Package integrity verified via sha256.
#
# Output is cached in DIST_DIR/nccl-<version>+cuda<cuda>/ so subsequent builds
# are instant unless NCCL_VERSION or NCCL_CUDA_VERSION changes.
#
# Output layout:
# $CACHE_DIR/lib/ — libnccl.so.* files
set -e
NCCL_VERSION="$1"
NCCL_CUDA_VERSION="$2"
DIST_DIR="$3"
EXPECTED_SHA256="$4"
[ -n "$NCCL_VERSION" ] || { echo "usage: $0 <nccl-version> <cuda-version> <dist-dir> [sha256]"; exit 1; }
[ -n "$NCCL_CUDA_VERSION" ] || { echo "usage: $0 <nccl-version> <cuda-version> <dist-dir> [sha256]"; exit 1; }
[ -n "$DIST_DIR" ] || { echo "usage: $0 <nccl-version> <cuda-version> <dist-dir> [sha256]"; exit 1; }
echo "=== NCCL ${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION} ==="
CACHE_DIR="${DIST_DIR}/nccl-${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/nccl-downloads"
if [ -d "$CACHE_DIR/lib" ] && [ "$(ls "$CACHE_DIR/lib/"libnccl.so.* 2>/dev/null | wc -l)" -gt 0 ]; then
echo "=== NCCL cached, skipping download ==="
echo "cache: $CACHE_DIR"
echo "libs: $(ls "$CACHE_DIR/lib/" | wc -l) files"
exit 0
fi
REPO_BASE="https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64"
PKG_NAME="libnccl2_${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}_amd64.deb"
PKG_URL="${REPO_BASE}/${PKG_NAME}"
mkdir -p "$DOWNLOAD_CACHE_DIR"
DEB_FILE="${DOWNLOAD_CACHE_DIR}/${PKG_NAME}"
echo "=== downloading NCCL package ==="
echo "URL: ${PKG_URL}"
wget --show-progress -O "$DEB_FILE" "$PKG_URL"
if [ -n "$EXPECTED_SHA256" ]; then
echo "=== verifying sha256 ==="
ACTUAL_SHA256=$(sha256sum "$DEB_FILE" | awk '{print $1}')
if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then
echo "ERROR: sha256 mismatch"
echo " expected: $EXPECTED_SHA256"
echo " actual: $ACTUAL_SHA256"
rm -f "$DEB_FILE"
exit 1
fi
echo "sha256 OK"
fi
echo "=== extracting NCCL libraries ==="
EXTRACT_TMP=$(mktemp -d)
trap 'rm -rf "$EXTRACT_TMP"' EXIT INT TERM
# .deb is an ar archive; data.tar.* contains the actual files
cd "$EXTRACT_TMP"
ar x "$DEB_FILE"
# Extract data tarball (xz, gz, or zst)
DATA_TAR=$(ls data.tar.* 2>/dev/null | head -1)
[ -n "$DATA_TAR" ] || { echo "ERROR: data.tar.* not found in .deb"; exit 1; }
tar xf "$DATA_TAR"
# Library lands in ./usr/lib/x86_64-linux-gnu/ or ./usr/lib/
mkdir -p "$CACHE_DIR/lib"
found=0
for f in $(find . -name 'libnccl.so.*' -not -type d 2>/dev/null); do
cp "$f" "$CACHE_DIR/lib/"
found=$((found + 1))
done
[ "$found" -gt 0 ] || { echo "ERROR: libnccl.so.* not found in package"; exit 1; }
# Create soname symlinks: libnccl.so.2 -> libnccl.so.<full>, libnccl.so -> libnccl.so.2
versioned=$(ls "$CACHE_DIR/lib/libnccl.so."[0-9][0-9.]* 2>/dev/null | head -1)
if [ -n "$versioned" ]; then
base=$(basename "$versioned")
ln -sf "$base" "$CACHE_DIR/lib/libnccl.so.2" 2>/dev/null || true
ln -sf "libnccl.so.2" "$CACHE_DIR/lib/libnccl.so" 2>/dev/null || true
fi
echo "=== NCCL extraction complete ==="
echo "cache: $CACHE_DIR"
ls -lh "$CACHE_DIR/lib/"

View File

@@ -46,7 +46,8 @@ CACHE_DIR="${DIST_DIR}/nvidia-${NVIDIA_VERSION}-${KVER}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/nvidia-downloads"
EXTRACT_CACHE_DIR="${CACHE_ROOT}/nvidia-extract"
if [ -d "$CACHE_DIR/modules" ] && [ -f "$CACHE_DIR/bin/nvidia-smi" ]; then
if [ -d "$CACHE_DIR/modules" ] && [ -f "$CACHE_DIR/bin/nvidia-smi" ] \
&& [ "$(ls "$CACHE_DIR/lib/libnvidia-ptxjitcompiler.so."* 2>/dev/null | wc -l)" -gt 0 ]; then
echo "=== NVIDIA cached, skipping build ==="
echo "cache: $CACHE_DIR"
echo "modules: $(ls "$CACHE_DIR/modules/"*.ko 2>/dev/null | wc -l) .ko files"
@@ -129,8 +130,10 @@ else
echo "WARNING: no firmware/ dir found in installer (may be needed for Hopper GPUs)"
fi
# Copy ALL userspace library files
for lib in libnvidia-ml libcuda; do
# Copy ALL userspace library files.
# libnvidia-ptxjitcompiler is required by libcuda for PTX JIT compilation
# (cuModuleLoadDataEx with PTX source) — without it CUDA_ERROR_JIT_COMPILER_NOT_FOUND.
for lib in libnvidia-ml libcuda libnvidia-ptxjitcompiler; do
count=0
for f in $(find "$EXTRACT_DIR" -maxdepth 1 -name "${lib}.so.*" 2>/dev/null); do
cp "$f" "$CACHE_DIR/lib/" && count=$((count+1))
@@ -147,7 +150,7 @@ ko_count=$(ls "$CACHE_DIR/modules/"*.ko 2>/dev/null | wc -l)
[ "$ko_count" -gt 0 ] || { echo "ERROR: no .ko files built in $CACHE_DIR/modules/"; exit 1; }
# Create soname symlinks: use [0-9][0-9]* to avoid circular symlink (.so.1 has single digit)
for lib in libnvidia-ml libcuda; do
for lib in libnvidia-ml libcuda libnvidia-ptxjitcompiler; do
versioned=$(ls "$CACHE_DIR/lib/${lib}.so."[0-9][0-9]* 2>/dev/null | head -1)
[ -n "$versioned" ] || continue
base=$(basename "$versioned")

View File

@@ -1,14 +1,13 @@
#!/bin/sh
# build.sh — build bee ISO (Debian 12 / live-build)
#
# Single build script. Produces a bootable live ISO with SSH access, TUI, NVIDIA drivers.
#
# Run on Debian 12 builder VM as root after setup-builder.sh.
# Usage:
# sh iso/builder/build.sh [--authorized-keys /path/to/authorized_keys]
# build.sh — internal ISO build entrypoint executed inside the builder container.
set -e
if [ "${BEE_CONTAINER_BUILD:-0}" != "1" ]; then
echo "build.sh must run inside iso/builder/build-in-container.sh" >&2
exit 1
fi
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
BUILDER_DIR="${REPO_ROOT}/iso/builder"
OVERLAY_DIR="${REPO_ROOT}/iso/overlay"
@@ -35,8 +34,94 @@ mkdir -p "${CACHE_ROOT}"
: "${GOMODCACHE:=${CACHE_ROOT}/go-mod}"
export GOCACHE GOMODCACHE
resolve_audit_version() {
if [ -n "${BEE_AUDIT_VERSION:-}" ]; then
echo "${BEE_AUDIT_VERSION}"
return 0
fi
tag="$(git -C "${REPO_ROOT}" describe --tags --match 'audit/v*' --abbrev=7 --dirty 2>/dev/null || true)"
if [ -z "${tag}" ]; then
tag="$(git -C "${REPO_ROOT}" describe --tags --match 'v*' --abbrev=7 --dirty 2>/dev/null || true)"
fi
case "${tag}" in
audit/v*)
echo "${tag#audit/v}"
return 0
;;
v*)
echo "${tag#v}"
return 0
;;
"")
;;
*)
echo "${tag}"
return 0
;;
esac
if [ -n "${AUDIT_VERSION:-}" ]; then
echo "${AUDIT_VERSION}"
return 0
fi
date +%Y%m%d
}
# ISO image versioned separately from the audit binary (iso/v* tags).
resolve_iso_version() {
if [ -n "${BEE_ISO_VERSION:-}" ]; then
echo "${BEE_ISO_VERSION}"
return 0
fi
tag="$(git -C "${REPO_ROOT}" describe --tags --match 'iso/v*' --abbrev=7 --dirty 2>/dev/null || true)"
case "${tag}" in
iso/v*)
echo "${tag#iso/v}"
return 0
;;
esac
# Fall back to audit version so the name is still meaningful
resolve_audit_version
}
AUDIT_VERSION_EFFECTIVE="$(resolve_audit_version)"
ISO_VERSION_EFFECTIVE="$(resolve_iso_version)"
# Auto-detect kernel ABI: refresh apt index, then query current linux-image-amd64 dependency.
# If headers for the detected ABI are not yet installed (kernel updated since image build),
# install them on the fly so NVIDIA modules and ISO kernel always match.
if [ -z "${DEBIAN_KERNEL_ABI}" ] || [ "${DEBIAN_KERNEL_ABI}" = "auto" ]; then
echo "=== refreshing apt index to detect current kernel ABI ==="
apt-get update -qq
DEBIAN_KERNEL_ABI=$(apt-cache depends linux-image-amd64 2>/dev/null \
| awk '/Depends:.*linux-image-[0-9]/{print $2}' \
| grep -oE '[0-9]+\.[0-9]+\.[0-9]+-[0-9]+' \
| head -1)
if [ -z "${DEBIAN_KERNEL_ABI}" ]; then
echo "ERROR: could not auto-detect kernel ABI from apt-cache" >&2
exit 1
fi
echo "=== kernel ABI: ${DEBIAN_KERNEL_ABI} ==="
fi
# Export detected ABI so that auto/config can pin the exact kernel package
# (prevents NVIDIA module/kernel mismatch if linux-image-amd64 meta-package
# gets updated between build.sh start and lb build chroot step)
export BEE_KERNEL_ABI="${DEBIAN_KERNEL_ABI}"
KVER="${DEBIAN_KERNEL_ABI}-amd64"
if [ ! -d "/usr/src/linux-headers-${KVER}" ]; then
echo "=== installing linux-headers-${KVER} (kernel updated since image build) ==="
apt-get install -y "linux-headers-${KVER}"
fi
echo "=== bee ISO build ==="
echo "Debian: ${DEBIAN_VERSION}, Kernel ABI: ${DEBIAN_KERNEL_ABI}, Go: ${GO_VERSION}"
echo "Audit version: ${AUDIT_VERSION_EFFECTIVE}, ISO version: ${ISO_VERSION_EFFECTIVE}"
echo ""
echo "=== syncing git submodules ==="
@@ -56,7 +141,7 @@ if [ "$NEED_BUILD" = "1" ]; then
cd "${REPO_ROOT}/audit"
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 \
go build \
-ldflags "-s -w -X main.Version=${AUDIT_VERSION:-$(date +%Y%m%d)}" \
-ldflags "-s -w -X main.Version=${AUDIT_VERSION_EFFECTIVE}" \
-o "$BEE_BIN" \
./cmd/bee
echo "binary: $BEE_BIN"
@@ -74,6 +159,16 @@ else
echo "=== bee binary up to date, skipping build ==="
fi
echo ""
echo "=== downloading cuBLAS/cuBLASLt/cudart ${NCCL_CUDA_VERSION} userspace ==="
sh "${BUILDER_DIR}/build-cublas.sh" \
"${CUBLAS_VERSION}" \
"${CUDA_USERSPACE_VERSION}" \
"${NCCL_CUDA_VERSION}" \
"${DIST_DIR}"
CUBLAS_CACHE="${DIST_DIR}/cublas-${CUBLAS_VERSION}+cuda${NCCL_CUDA_VERSION}"
GPU_STRESS_NEED_BUILD=1
if [ -f "$GPU_STRESS_BIN" ] && [ "${BUILDER_DIR}/bee-gpu-stress.c" -ot "$GPU_STRESS_BIN" ]; then
GPU_STRESS_NEED_BUILD=0
@@ -82,9 +177,10 @@ fi
if [ "$GPU_STRESS_NEED_BUILD" = "1" ]; then
echo "=== building bee-gpu-stress ==="
gcc -O2 -s -Wall -Wextra \
-I"${CUBLAS_CACHE}/include" \
-o "$GPU_STRESS_BIN" \
"${BUILDER_DIR}/bee-gpu-stress.c" \
-ldl
-ldl -lm
echo "binary: $GPU_STRESS_BIN"
else
echo "=== bee-gpu-stress up to date, skipping build ==="
@@ -101,7 +197,8 @@ rm -f \
"${OVERLAY_STAGE_DIR}/root/.ssh/authorized_keys" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest"
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf"
# --- inject authorized_keys for SSH access ---
AUTHORIZED_KEYS_FILE="${OVERLAY_STAGE_DIR}/root/.ssh/authorized_keys"
@@ -187,18 +284,52 @@ if [ -d "${NVIDIA_CACHE}/firmware" ] && [ "$(ls -A "${NVIDIA_CACHE}/firmware" 2>
echo "=== firmware: $(ls "${OVERLAY_STAGE_DIR}/lib/firmware/nvidia/${NVIDIA_DRIVER_VERSION}/" | wc -l) files injected ==="
fi
# --- build / download NCCL ---
echo ""
echo "=== downloading NCCL ${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION} ==="
sh "${BUILDER_DIR}/build-nccl.sh" "${NCCL_VERSION}" "${NCCL_CUDA_VERSION}" "${DIST_DIR}" "${NCCL_SHA256:-}"
NCCL_CACHE="${DIST_DIR}/nccl-${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}"
# Inject libnccl.so.* into overlay alongside other NVIDIA userspace libs
cp "${NCCL_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/"
echo "=== NCCL: $(ls "${NCCL_CACHE}/lib/" | wc -l) files injected into /usr/lib/ ==="
# Inject cuBLAS/cuBLASLt/cudart runtime libs used by bee-gpu-stress tensor-core GEMM path
cp "${CUBLAS_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/"
echo "=== cuBLAS: $(ls "${CUBLAS_CACHE}/lib/" | wc -l) files injected into /usr/lib/ ==="
# --- build nccl-tests ---
echo ""
echo "=== building nccl-tests ${NCCL_TESTS_VERSION} ==="
sh "${BUILDER_DIR}/build-nccl-tests.sh" \
"${NCCL_TESTS_VERSION}" \
"${NCCL_VERSION}" \
"${NCCL_CUDA_VERSION}" \
"${DIST_DIR}"
NCCL_TESTS_CACHE="${DIST_DIR}/nccl-tests-${NCCL_TESTS_VERSION}"
cp "${NCCL_TESTS_CACHE}/bin/all_reduce_perf" "${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf"
echo "=== all_reduce_perf injected ==="
# --- embed build metadata ---
mkdir -p "${OVERLAY_STAGE_DIR}/etc"
BUILD_DATE="$(date +%Y-%m-%d)"
GIT_COMMIT="$(git -C "${REPO_ROOT}" rev-parse --short HEAD 2>/dev/null || echo unknown)"
cat > "${OVERLAY_STAGE_DIR}/etc/bee-release" <<EOF
BEE_ISO_VERSION=${AUDIT_VERSION}
BEE_AUDIT_VERSION=${AUDIT_VERSION}
BEE_ISO_VERSION=${ISO_VERSION_EFFECTIVE}
BEE_AUDIT_VERSION=${AUDIT_VERSION_EFFECTIVE}
BUILD_DATE=${BUILD_DATE}
GIT_COMMIT=${GIT_COMMIT}
DEBIAN_VERSION=${DEBIAN_VERSION}
DEBIAN_KERNEL_ABI=${DEBIAN_KERNEL_ABI}
NVIDIA_DRIVER_VERSION=${NVIDIA_DRIVER_VERSION}
NCCL_VERSION=${NCCL_VERSION}
NCCL_CUDA_VERSION=${NCCL_CUDA_VERSION}
CUBLAS_VERSION=${CUBLAS_VERSION}
CUDA_USERSPACE_VERSION=${CUDA_USERSPACE_VERSION}
NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}
EOF
# Patch motd with build info
@@ -232,7 +363,7 @@ lb build 2>&1
# live-build outputs live-image-amd64.hybrid.iso in LB_DIR
ISO_RAW="${LB_DIR}/live-image-amd64.hybrid.iso"
ISO_OUT="${DIST_DIR}/bee-debian${DEBIAN_VERSION}-v${AUDIT_VERSION}-amd64.iso"
ISO_OUT="${DIST_DIR}/bee-debian${DEBIAN_VERSION}-v${ISO_VERSION_EFFECTIVE}-amd64.iso"
if [ -f "$ISO_RAW" ]; then
cp "$ISO_RAW" "$ISO_OUT"
echo ""

View File

@@ -1,12 +1,31 @@
source /boot/grub/config.cfg
menuentry "Bee Hardware Audit" {
linux @KERNEL_LIVE@ @APPEND_LIVE@
echo ""
echo " ███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗"
echo " ██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝"
echo " █████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗"
echo " ██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝"
echo " ███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗"
echo " ╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝"
echo ""
menuentry "EASY-BEE" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=normal
initrd @INITRD_LIVE@
}
menuentry "Bee Hardware Audit (fail-safe)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ memtest noapic noapm nodma nomce nolapic nosmp vga=normal
menuentry "EASY-BEE (load to RAM)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ toram bee.nvidia.mode=normal
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (NVIDIA GSP=off)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=gsp-off
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (fail-safe)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=gsp-off memtest noapic noapm nodma nomce nolapic nosmp vga=normal
initrd @INITRD_LIVE@
}

View File

@@ -1,4 +1,4 @@
desktop-image: "../splash.png"
desktop-color: "#000000"
title-color: "#f5a800"
title-font: "Unifont Regular 16"
title-text: ""

Binary file not shown.

Before

Width:  |  Height:  |  Size: 8.7 KiB

View File

@@ -0,0 +1,24 @@
label live-@FLAVOUR@-normal
menu label ^EASY-BEE
menu default
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=normal
label live-@FLAVOUR@-toram
menu label EASY-BEE (^load to RAM)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ toram bee.nvidia.mode=normal
label live-@FLAVOUR@-gsp-off
menu label EASY-BEE (^NVIDIA GSP=off)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=gsp-off
label live-@FLAVOUR@-failsafe
menu label EASY-BEE (^fail-safe)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=gsp-off memtest noapic noapm nodma nomce nolapic nosmp vga=normal

View File

@@ -5,29 +5,49 @@ set -e
echo "=== bee chroot setup ==="
ensure_bee_console_user() {
if id bee >/dev/null 2>&1; then
usermod -d /home/bee -s /bin/sh bee 2>/dev/null || true
else
useradd -d /home/bee -m -s /bin/sh -U bee
fi
mkdir -p /home/bee
chown -R bee:bee /home/bee
echo "bee:eeb" | chpasswd
usermod -aG sudo,video,input bee 2>/dev/null || true
}
ensure_bee_console_user
# Enable bee services
systemctl enable nvidia-dcgm.service 2>/dev/null || true
systemctl enable bee-network.service
systemctl enable bee-nvidia.service
systemctl enable bee-preflight.service
systemctl enable bee-audit.service
systemctl enable bee-web.service
systemctl enable bee-sshsetup.service
systemctl enable ssh.service
systemctl enable lightdm.service 2>/dev/null || true
systemctl enable qemu-guest-agent.service 2>/dev/null || true
systemctl enable serial-getty@ttyS0.service 2>/dev/null || true
systemctl enable serial-getty@ttyS1.service 2>/dev/null || true
systemctl enable bee-journal-mirror@ttyS1.service 2>/dev/null || true
# Ensure scripts are executable
chmod +x /usr/local/bin/bee-network.sh 2>/dev/null || true
chmod +x /usr/local/bin/bee-nvidia-load 2>/dev/null || true
chmod +x /usr/local/bin/bee-sshsetup 2>/dev/null || true
chmod +x /usr/local/bin/bee-smoketest 2>/dev/null || true
chmod +x /usr/local/bin/bee-tui 2>/dev/null || true
chmod +x /usr/local/bin/bee 2>/dev/null || true
chmod +x /usr/local/bin/bee-log-run 2>/dev/null || true
# Reload udev rules
udevadm control --reload-rules 2>/dev/null || true
# Create log directory
mkdir -p /var/log
# Create export directory
mkdir -p /appdata/bee/export
if [ -f /etc/sudoers.d/bee ]; then
chmod 0440 /etc/sudoers.d/bee

View File

@@ -0,0 +1,103 @@
#!/bin/sh
# 9001-amd-rocm.hook.chroot — install AMD ROCm SMI tool for Instinct GPU monitoring.
# Runs inside the live-build chroot. Adds AMD's apt repository and installs
# rocm-smi-lib which provides the `rocm-smi` CLI (analogous to nvidia-smi).
#
# AMD does NOT publish Debian Bookworm packages. The repo uses Ubuntu codenames
# (jammy/noble). We use jammy (Ubuntu 22.04) — its packages install cleanly on
# Debian 12 (Bookworm) due to compatible glibc/libstdc++.
# Tried versions newest-first; falls back if a point release is missing.
set -e
# Ubuntu codename to use for the AMD repo (Debian has no AMD packages).
ROCM_UBUNTU_DIST="jammy"
# ROCm point-releases to try newest-first. AMD drops old point releases
# from the repo, so we walk backwards until one responds 200.
ROCM_CANDIDATES="6.3.4 6.3.3 6.3.2 6.3.1 6.3 6.2.4 6.2.3 6.2.2 6.2.1 6.2"
ROCM_KEYRING="/etc/apt/keyrings/rocm.gpg"
ROCM_LIST="/etc/apt/sources.list.d/rocm.list"
APT_UPDATED=0
mkdir -p /etc/apt/keyrings
ensure_tool() {
tool="$1"
pkg="$2"
if command -v "${tool}" >/dev/null 2>&1; then
return 0
fi
if [ "${APT_UPDATED}" -eq 0 ]; then
apt-get update -qq
APT_UPDATED=1
fi
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends "${pkg}"
}
ensure_cert_bundle() {
if [ -s /etc/ssl/certs/ca-certificates.crt ]; then
return 0
fi
if [ "${APT_UPDATED}" -eq 0 ]; then
apt-get update -qq
APT_UPDATED=1
fi
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ca-certificates
}
# live-build chroot may not include fetch/signing tools yet
if ! ensure_cert_bundle || ! ensure_tool wget wget || ! ensure_tool gpg gpg; then
echo "WARN: failed to install wget/gpg/ca-certificates prerequisites — skipping ROCm install"
exit 0
fi
# Download and import AMD GPG key
if ! wget -qO- "https://repo.radeon.com/rocm/rocm.gpg.key" \
| gpg --dearmor --yes --output "${ROCM_KEYRING}"; then
echo "WARN: failed to fetch AMD ROCm GPG key — skipping ROCm install"
exit 0
fi
# Try each ROCm version until apt-get update succeeds.
# AMD repo uses Ubuntu codenames; bookworm is not published — use jammy.
ROCM_VERSION=""
for candidate in ${ROCM_CANDIDATES}; do
cat > "${ROCM_LIST}" <<EOF
deb [arch=amd64 signed-by=${ROCM_KEYRING}] https://repo.radeon.com/rocm/apt/${candidate} ${ROCM_UBUNTU_DIST} main
EOF
if apt-get update -qq 2>/dev/null; then
ROCM_VERSION="${candidate}"
echo "=== AMD ROCm ${ROCM_VERSION} (${ROCM_UBUNTU_DIST}): repository available ==="
break
fi
echo "WARN: ROCm ${candidate} not available, trying next..."
rm -f "${ROCM_LIST}"
done
if [ -z "${ROCM_VERSION}" ]; then
echo "WARN: no ROCm apt repository available — skipping ROCm install"
rm -f "${ROCM_KEYRING}"
exit 0
fi
# rocm-smi-lib provides the rocm-smi CLI tool for GPU monitoring
if DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends rocm-smi-lib; then
echo "=== AMD ROCm: rocm-smi-lib installed ==="
if [ -x /opt/rocm/bin/rocm-smi ]; then
ln -sf /opt/rocm/bin/rocm-smi /usr/local/bin/rocm-smi
else
smi_path="$(find /opt -path '*/bin/rocm-smi' -type f 2>/dev/null | sort | tail -1)"
if [ -n "${smi_path}" ]; then
ln -sf "${smi_path}" /usr/local/bin/rocm-smi
fi
fi
rocm-smi --version 2>/dev/null || true
else
echo "WARN: rocm-smi-lib install failed — AMD GPU monitoring unavailable"
fi
# Clean up apt lists to keep ISO size down
rm -f "${ROCM_LIST}"
apt-get clean

View File

@@ -0,0 +1,66 @@
#!/bin/sh
# 9002-nvidia-dcgm.hook.chroot — install NVIDIA DCGM inside the live-build chroot.
# DCGM (Data Center GPU Manager) provides dcgmi diag for acceptance testing.
# Adds NVIDIA's CUDA apt repository (debian12/x86_64) and installs datacenter-gpu-manager.
set -e
NVIDIA_KEYRING="/usr/share/keyrings/nvidia-cuda.gpg"
NVIDIA_LIST="/etc/apt/sources.list.d/nvidia-cuda.list"
NVIDIA_KEY_URL="https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/3bf863cc.pub"
NVIDIA_REPO="https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/"
APT_UPDATED=0
mkdir -p /usr/share/keyrings /etc/apt/sources.list.d
ensure_tool() {
tool="$1"
pkg="$2"
if command -v "${tool}" >/dev/null 2>&1; then
return 0
fi
if [ "${APT_UPDATED}" -eq 0 ]; then
apt-get update -qq
APT_UPDATED=1
fi
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends "${pkg}"
}
ensure_cert_bundle() {
if [ -s /etc/ssl/certs/ca-certificates.crt ]; then
return 0
fi
if [ "${APT_UPDATED}" -eq 0 ]; then
apt-get update -qq
APT_UPDATED=1
fi
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ca-certificates
}
if ! ensure_cert_bundle || ! ensure_tool wget wget || ! ensure_tool gpg gpg; then
echo "WARN: prerequisites missing — skipping DCGM install"
exit 0
fi
# Download and import NVIDIA GPG key
if ! wget -qO- "${NVIDIA_KEY_URL}" | gpg --dearmor --yes --output "${NVIDIA_KEYRING}"; then
echo "WARN: failed to fetch NVIDIA GPG key — skipping DCGM install"
exit 0
fi
cat > "${NVIDIA_LIST}" <<EOF
deb [signed-by=${NVIDIA_KEYRING}] ${NVIDIA_REPO} /
EOF
apt-get update -qq
if DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends datacenter-gpu-manager; then
echo "=== DCGM: datacenter-gpu-manager installed ==="
dcgmi --version 2>/dev/null || true
else
echo "WARN: datacenter-gpu-manager install failed — DCGM unavailable"
fi
# Clean up apt lists to keep ISO size down
rm -f "${NVIDIA_LIST}"
apt-get clean

View File

@@ -18,6 +18,17 @@ qemu-guest-agent
# SSH
openssh-server
# Disk installer
squashfs-tools
parted
grub-pc-bin
grub-efi-amd64-bin
# Filesystem support for USB export targets
exfatprogs
exfat-fuse
ntfs-3g
# Utilities
bash
procps
@@ -27,16 +38,29 @@ less
vim-tiny
mc
htop
nvtop
sudo
zstd
mstflint
memtester
stress-ng
# QR codes (for displaying audit results)
qrencode
# Local desktop (openbox + chromium kiosk)
openbox
tint2
xorg
xterm
chromium
xserver-xorg-video-fbdev
xserver-xorg-video-vesa
lightdm
# Firmware
firmware-linux-free
firmware-amd-graphics
# glibc compat helpers (for any external binaries that need it)
libc6

View File

@@ -1,75 +0,0 @@
#!/bin/sh
# setup-builder.sh — prepare Debian 12 host/VM as bee ISO builder
#
# Run once on a fresh Debian 12 (Bookworm) host/VM as root.
# After this script completes, the machine can build bee ISO images directly.
# Container alternative: use `iso/builder/build-in-container.sh`.
#
# Usage (on Debian VM):
# wget -O- https://git.mchus.pro/mchus/bee/raw/branch/main/iso/builder/setup-builder.sh | sh
# or: sh setup-builder.sh
set -e
. "$(dirname "$0")/VERSIONS" 2>/dev/null || true
GO_VERSION="${GO_VERSION:-1.24.0}"
DEBIAN_VERSION="${DEBIAN_VERSION:-12}"
DEBIAN_KERNEL_ABI="${DEBIAN_KERNEL_ABI:-6.1.0-28}"
echo "=== bee builder setup ==="
echo "Debian: $(cat /etc/debian_version)"
echo "Go target: ${GO_VERSION}"
echo "Kernel ABI: ${DEBIAN_KERNEL_ABI}"
echo ""
# --- system packages ---
export DEBIAN_FRONTEND=noninteractive
apt-get update -qq
apt-get install -y \
live-build \
debootstrap \
squashfs-tools \
xorriso \
grub-pc-bin \
grub-efi-amd64-bin \
mtools \
git \
wget \
curl \
tar \
xz-utils \
screen \
rsync \
build-essential \
gcc \
make \
perl \
"linux-headers-${DEBIAN_KERNEL_ABI}-amd64"
echo "linux-headers installed: $(dpkg -l "linux-headers-${DEBIAN_KERNEL_ABI}-amd64" | awk '/^ii/{print $3}')"
# --- Go toolchain ---
echo ""
echo "=== installing Go ${GO_VERSION} ==="
if [ -d /usr/local/go ] && /usr/local/go/bin/go version 2>/dev/null | grep -q "${GO_VERSION}"; then
echo "Go ${GO_VERSION} already installed"
else
ARCH=$(uname -m)
case "$ARCH" in
x86_64) GOARCH=amd64 ;;
aarch64) GOARCH=arm64 ;;
*) echo "unsupported arch: $ARCH"; exit 1 ;;
esac
wget -O /tmp/go.tar.gz \
"https://go.dev/dl/go${GO_VERSION}.linux-${GOARCH}.tar.gz"
rm -rf /usr/local/go
tar -C /usr/local -xzf /tmp/go.tar.gz
rm /tmp/go.tar.gz
fi
export PATH="$PATH:/usr/local/go/bin"
echo "Go: $(go version)"
echo ""
echo "=== builder setup complete ==="
echo "Next: sh iso/builder/build.sh"

View File

@@ -26,6 +26,15 @@ echo ""
KVER=$(uname -r)
info "kernel: $KVER"
NVIDIA_BOOT_MODE="normal"
for arg in $(cat /proc/cmdline 2>/dev/null); do
case "$arg" in
bee.nvidia.mode=*)
NVIDIA_BOOT_MODE="${arg#*=}"
;;
esac
done
info "nvidia boot mode: ${NVIDIA_BOOT_MODE}"
# --- PATH & binaries ---
echo "-- PATH & binaries --"
@@ -53,17 +62,25 @@ else
fail "NVIDIA ko dir missing: $KO_DIR"
fi
for mod in nvidia nvidia_modeset nvidia_uvm; do
if /sbin/lsmod 2>/dev/null | grep -q "^nvidia "; then
ok "module loaded: nvidia"
else
fail "module NOT loaded: nvidia"
fi
for mod in nvidia_modeset nvidia_uvm; do
if /sbin/lsmod 2>/dev/null | grep -q "^$mod "; then
ok "module loaded: $mod"
elif [ "${NVIDIA_BOOT_MODE}" = "normal" ] || [ "${NVIDIA_BOOT_MODE}" = "full" ]; then
fail "module NOT loaded in normal mode: $mod"
else
fail "module NOT loaded: $mod"
warn "module not loaded in GSP-off mode: $mod"
fi
done
echo ""
echo "-- NVIDIA device nodes --"
for dev in nvidiactl nvidia0 nvidia-uvm; do
for dev in nvidiactl nvidia0; do
if [ -e "/dev/$dev" ]; then
ok "/dev/$dev exists"
else
@@ -71,6 +88,14 @@ for dev in nvidiactl nvidia0 nvidia-uvm; do
fi
done
if [ -e /dev/nvidia-uvm ]; then
ok "/dev/nvidia-uvm exists"
elif [ "${NVIDIA_BOOT_MODE}" = "normal" ] || [ "${NVIDIA_BOOT_MODE}" = "full" ]; then
fail "/dev/nvidia-uvm missing in normal mode"
else
warn "/dev/nvidia-uvm missing — CUDA stress path may be unavailable until loaded on demand"
fi
echo ""
echo "-- nvidia-smi --"
if PATH="/usr/local/bin:$PATH" command -v nvidia-smi >/dev/null 2>&1; then
@@ -96,7 +121,7 @@ done
echo ""
echo "-- systemd services --"
for svc in bee-nvidia bee-network bee-audit bee-web; do
for svc in bee-nvidia bee-network bee-preflight bee-audit bee-web; do
if systemctl is-active --quiet "$svc" 2>/dev/null; then
ok "service active: $svc"
else
@@ -104,6 +129,20 @@ for svc in bee-nvidia bee-network bee-audit bee-web; do
fi
done
echo ""
echo "-- runtime health --"
if [ -f /appdata/bee/export/runtime-health.json ] && [ -s /appdata/bee/export/runtime-health.json ]; then
ok "runtime: runtime-health.json present and non-empty"
else
fail "runtime: runtime-health.json missing or empty"
fi
if [ -f /appdata/bee/export/runtime-health.log ]; then
info "last runtime log line: $(tail -1 /appdata/bee/export/runtime-health.log)"
else
warn "runtime: no log found at /appdata/bee/export/runtime-health.log"
fi
for svc in ssh bee-sshsetup; do
if systemctl is-active --quiet "$svc" 2>/dev/null \
|| systemctl show "$svc" --property=ActiveState 2>/dev/null | grep -q "inactive\|exited"; then
@@ -126,37 +165,37 @@ fi
echo ""
echo "-- audit last run --"
if [ -f /var/log/bee-audit.json ] && [ -s /var/log/bee-audit.json ]; then
if [ -f /appdata/bee/export/bee-audit.json ] && [ -s /appdata/bee/export/bee-audit.json ]; then
ok "audit: bee-audit.json present and non-empty"
info "size: $(du -sh /var/log/bee-audit.json | cut -f1)"
info "size: $(du -sh /appdata/bee/export/bee-audit.json | cut -f1)"
else
fail "audit: bee-audit.json missing or empty"
fi
if [ -f /var/log/bee-audit.log ]; then
last_line=$(tail -1 /var/log/bee-audit.log)
if [ -f /appdata/bee/export/bee-audit.log ]; then
last_line=$(tail -1 /appdata/bee/export/bee-audit.log)
info "last log line: $last_line"
if grep -q "audit output written" /var/log/bee-audit.log 2>/dev/null; then
if grep -q "audit output written" /appdata/bee/export/bee-audit.log 2>/dev/null; then
ok "audit: completed successfully"
else
warn "audit: 'audit output written' not found in log — may have failed"
fi
if grep -q "nvidia: enrichment skipped\|nvidia.*skipped\|enrichment skipped" /var/log/bee-audit.log 2>/dev/null; then
reason=$(grep -E "nvidia.*skipped|enrichment skipped" /var/log/bee-audit.log | tail -1)
if grep -q "nvidia: enrichment skipped\|nvidia.*skipped\|enrichment skipped" /appdata/bee/export/bee-audit.log 2>/dev/null; then
reason=$(grep -E "nvidia.*skipped|enrichment skipped" /appdata/bee/export/bee-audit.log | tail -1)
fail "audit: nvidia enrichment skipped — $reason"
else
ok "audit: nvidia enrichment OK (no skip message)"
fi
else
warn "audit: no log found at /var/log/bee-audit.log"
warn "audit: no log found at /appdata/bee/export/bee-audit.log"
fi
echo ""
echo "-- bee web --"
if [ -f /var/log/bee-web.log ]; then
info "last web log line: $(tail -1 /var/log/bee-web.log)"
if [ -f /appdata/bee/export/bee-web.log ]; then
info "last web log line: $(tail -1 /appdata/bee/export/bee-web.log)"
else
warn "web: no log found at /var/log/bee-web.log"
warn "web: no log found at /appdata/bee/export/bee-web.log"
fi
if bash -c 'exec 3<>/dev/tcp/127.0.0.1/80 && printf "GET /healthz HTTP/1.0\r\nHost: localhost\r\n\r\n" >&3 && grep -q "^ok$" <&3'; then

View File

@@ -0,0 +1,2 @@
allowed_users=anybody
needs_root_rights=yes

View File

@@ -0,0 +1,10 @@
Section "Device"
Identifier "fbdev"
Driver "fbdev"
Option "fbdev" "/dev/fb0"
EndSection
Section "Screen"
Identifier "screen0"
Device "fbdev"
EndSection

View File

@@ -0,0 +1,5 @@
[Seat:*]
autologin-user=bee
autologin-user-timeout=0
autologin-session=openbox
user-session=openbox

View File

@@ -1,15 +1,16 @@
██████╗ ███████╗███████╗ ██████╗ ███████╗██████╗ ██╗ ██╗ ██████╗
██╔══██╗██╔════╝██╔════╝ ██╔══██╗██╔════╝██╔══██╗██║ ██║██╔════╝
██████╔╝██████████╗ ██║ ██║█████╗ ██████╔╝██║ ██║██║ ███╗
██╔══██╗██╔══╝ ██╔══╝ ██║ ██║██╔══╝ ██╔══██╗██║ ██║██║ ██
██████╔╝██████████████╗ ██████╔╝███████╗██████╔╝╚██████╔╝╚██████╔╝
╚═════╝ ╚══════╝╚══════╝ ╚═════╝ ╚══════╝╚═════╝ ╚═════╝ ╚═════╝
███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗
██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝
████████████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗
██╔══██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝
███████╗██║ █████████║ ██║ ██████╔╝███████╗███████╗
╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╚═════╝ ╚══════╝╚══════╝
Hardware Audit LiveCD
Build: %%BUILD_INFO%%
Logs: /var/log/bee-audit.json /var/log/bee-network.log
Export dir: /appdata/bee/export
Self-check: /appdata/bee/export/runtime-health.json
Open TUI: bee-tui

View File

@@ -1,20 +1,18 @@
export PATH="$PATH:/usr/local/bin"
export PATH="$PATH:/usr/local/bin:/opt/rocm/bin:/opt/rocm/sbin"
menu() {
if [ -x /usr/local/bin/bee-tui ]; then
/usr/local/bin/bee-tui "$@"
else
echo "bee-tui is not installed"
return 1
fi
}
# On the local console, keep the shell visible and let the operator
# start the TUI explicitly. This avoids black-screen failures if the
# terminal implementation does not support the TUI well.
# Print web UI URLs on the local console at login.
if [ -z "${SSH_CONNECTION:-}" ] \
&& [ -z "${SSH_TTY:-}" ] \
&& [ "$(tty 2>/dev/null)" = "/dev/tty1" ]; then
&& [ -z "${SSH_TTY:-}" ]; then
echo "Bee live environment ready."
echo "Run 'menu' to open the TUI."
echo ""
echo " Web UI (local): http://localhost/"
# Print IP addresses for remote access
_ips=$(ip -4 addr show scope global 2>/dev/null | awk '/inet /{print $2}' | cut -d/ -f1)
for _ip in $_ips; do
echo " Web UI (remote): http://$_ip/"
done
unset _ips _ip
echo ""
echo " Network setup: netconf"
echo " Kernel logs: Alt+F2 | Extra shell: Alt+F3"
fi

View File

@@ -0,0 +1,4 @@
[Journal]
# Do not forward service logs to the console — bee-tui runs on tty1
# and log spam makes the screen unusable on physical monitors.
ForwardToConsole=no

View File

@@ -0,0 +1,4 @@
[Journal]
ForwardToConsole=yes
TTYPath=/dev/ttyS0
MaxLevelConsole=info

View File

@@ -0,0 +1,4 @@
[Manager]
# Pet the hardware watchdog every 30s so the host doesn't reboot mid-audit.
# Kernel watchdog timeout is typically 60s; 30s gives a safe 2× margin.
RuntimeWatchdogSec=30s

View File

@@ -1,13 +1,13 @@
[Unit]
Description=Bee: run hardware audit
After=bee-network.service bee-nvidia.service
After=bee-network.service bee-nvidia.service bee-preflight.service
Before=bee-web.service
[Service]
Type=oneshot
ExecStart=/bin/sh -c '/usr/local/bin/bee audit --runtime livecd --output file:/var/log/bee-audit.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-audit] WARN: audit exited with rc=$rc"; fi; exit 0'
StandardOutput=append:/var/log/bee-audit.log
StandardError=append:/var/log/bee-audit.log
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-audit.log /bin/sh -c '/usr/local/bin/bee audit --runtime livecd --output file:/appdata/bee/export/bee-audit.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-audit] WARN: audit exited with rc=$rc"; fi; exit 0'
StandardOutput=journal
StandardError=journal
RemainAfterExit=yes
[Install]

View File

@@ -0,0 +1,16 @@
[Unit]
Description=Bee: mirror system journal to %I
After=systemd-journald.service
Requires=systemd-journald.service
ConditionPathExists=/dev/%I
[Service]
Type=simple
ExecStart=/bin/sh -c 'exec journalctl -f -n 200 -o short-monotonic > /dev/%I'
Restart=always
RestartSec=1
StandardOutput=null
StandardError=journal
[Install]
WantedBy=multi-user.target

View File

@@ -5,9 +5,9 @@ Before=network-online.target bee-audit.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/bee-network.sh
StandardOutput=append:/var/log/bee-network.log
StandardError=append:/var/log/bee-network.log
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-network.log /usr/local/bin/bee-network.sh
StandardOutput=journal
StandardError=journal
RemainAfterExit=yes
[Install]

View File

@@ -5,7 +5,7 @@ Before=bee-audit.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/bee-nvidia-load
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-nvidia.log /usr/local/bin/bee-nvidia-load
StandardOutput=journal
StandardError=journal
RemainAfterExit=yes

View File

@@ -0,0 +1,14 @@
[Unit]
Description=Bee: runtime preflight self-check
After=bee-network.service bee-nvidia.service
Before=bee-audit.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/runtime-health.log /bin/sh -c '/usr/local/bin/bee preflight --output file:/appdata/bee/export/runtime-health.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-preflight] WARN: preflight exited with rc=$rc"; fi; exit 0'
StandardOutput=journal
StandardError=journal
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target

View File

@@ -5,7 +5,9 @@ Before=ssh.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/bee-sshsetup
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-sshsetup.log /usr/local/bin/bee-sshsetup
StandardOutput=journal
StandardError=journal
RemainAfterExit=yes
[Install]

Some files were not shown because too many files have changed in this diff Show More