Compare commits

..

51 Commits
v3.2 ... v4.1

Author SHA1 Message Date
Mikhail Chusavitin
fc7fe0b08e fix(webui): build support bundle synchronously on download, bypass task queue
Support bundle is now built on-the-fly when the user clicks the button,
regardless of whether other tasks are running:

- GET /export/support.tar.gz builds the bundle synchronously and streams it
  directly to the client; the temp archive is removed after serving
- Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue
  approach meant the bundle could only be downloaded after navigating away
  and back, and was blocked entirely while a long SAT test was running
- UI: single "Download Support Bundle" button; fetch+blob gives a loading
  state ("Building...") while the server collects logs, then triggers the
  browser download with the correct filename from Content-Disposition

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 12:58:00 +03:00
Mikhail Chusavitin
3cf75a541a build: collect ISO and logs under versioned dist/easy-bee-v{VERSION}/ dir
All final artefacts for a given version now land in one place:
  dist/easy-bee-v4.1/
    easy-bee-nvidia-v4.1-amd64.iso
    easy-bee-nvidia-v4.1-amd64.logs.tar.gz   ← log archive
                                               (logs dir deleted after archiving)

- Introduce OUT_DIR="${DIST_DIR}/easy-bee-v${ISO_VERSION_EFFECTIVE}"
- Move LOG_DIR, LOG_ARCHIVE, and ISO_OUT into OUT_DIR
- cleanup_build_log: use dirname(LOG_DIR) as tar -C base so the path is
  correct regardless of where OUT_DIR lives; delete LOG_DIR after archiving

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 10:19:11 +03:00
Mikhail Chusavitin
1f750d3edd fix(webui): prevent orphaned workers on restart, reduce metrics polling, add Kill Workers button
- tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of
  re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web
  crashes mid-test (each restart was launching a new set of 8 workers on top
  of still-alive orphans from the previous crash)
- server: reduce metrics collector interval 1s→5s, grow ring buffer to 360
  samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5×
- platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn,
  stress-ng, stressapptest, memtester without relying on pkill/killall
- webui: add "Kill Workers" button next to Cancel All; calls
  POST /api/tasks/kill-workers which cancels the task queue then kills
  orphaned OS-level processes; shows toast with killed count
- metricsdb: sort GPU indices and fan/temp names after map iteration to fix
  non-deterministic sample reconstruction order (flaky test)
- server: fix chartYAxisNumber to use one decimal place for 1000–9999
  (e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 10:13:43 +03:00
Mikhail Chusavitin
b2b0444131 audit: ignore virtual hdisk and coprocessor noise 2026-04-02 09:56:17 +03:00
dbab43db90 Fix full-history metrics range loading 2026-04-01 23:55:28 +03:00
bcb7fe5fe9 Render charts from full SQLite history 2026-04-01 23:52:54 +03:00
d21d9d191b fix(build): bump DCGM to 4.5.3-1 — core package updated in CUDA repo
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:49:57 +03:00
ef45246ea0 fix(sat): kill entire process group on task cancel
exec.CommandContext only kills the direct child (the shell script), leaving
grandchildren (john, gpu-burn, etc.) as orphans. Set Setpgid so each SAT
job runs in its own process group, then send SIGKILL to the whole group
(-pgid) in the Cancel hook.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:46:33 +03:00
348db35119 fix(stress): stagger john GPU launches to prevent GWS tuning contention
When 8 john processes start simultaneously they race for GPU memory during
OpenCL GWS auto-tuning. Slower devices settle on a smaller work size (~594MiB
vs 762MiB) and run at 40% instead of 100% load. Add 3s sleep between launches
so each instance finishes memory allocation before the next one starts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:44:00 +03:00
1dd7f243f5 Keep chart series colors stable 2026-04-01 23:37:57 +03:00
938e499ac2 Serve charts from SQLite history only 2026-04-01 23:33:13 +03:00
964ab39656 fix: run john stress in parallel per GPU, fix chromium fullscreen, filter BMC virtual disks
- bee-john-gpu-stress: spawn one john process per OpenCL device in parallel
  so all GPUs are stressed simultaneously instead of only device 1
- bee-openbox-session: --start-fullscreen → --start-maximized to fix blank
  white page on first render in fbdev environment
- storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:14:21 +03:00
c2aecc6ce9 Fix fan chart gaps and task durations 2026-04-01 22:36:11 +03:00
439b86ce59 Unify live metrics chart rendering 2026-04-01 22:19:33 +03:00
eb60100297 fix: pcie gen, nccl binary, netconf sudo, boot noise, firmware cleanup
- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead
  of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state
- build: remove bee-nccl-gpu-stress from rm -f list so shell script from
  overlay is not silently dropped from the ISO
- smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress,
  bee-nccl-gpu-stress, all_reduce_perf
- netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors
- auto/config: reduce loglevel 7→3 to show clean systemd output on boot
- auto/config: blacklist snd_hda_intel and related audio modules (unused on servers)
- package-lists: remove firmware-intel-sound and firmware-amd-graphics from
  base list; move firmware-amd-graphics to bee-amd variant only
- bible-local: mark memtest ADR resolved, document working solution

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:25:23 +03:00
Mikhail Chusavitin
2baf3be640 Handle memtest recovery probe under set -e 2026-04-01 17:42:13 +03:00
Mikhail Chusavitin
d92f8f41d0 Fix memtest ISO validation false negatives 2026-04-01 12:22:17 +03:00
Mikhail Chusavitin
76a9100779 fix(iso): rebuild image after memtest recovery 2026-04-01 10:01:14 +03:00
Mikhail Chusavitin
1b6d592bf3 feat(iso): add optional kms display boot path 2026-04-01 09:42:59 +03:00
Mikhail Chusavitin
c95bbff23b fix(metrics): stabilize cpu and power sampling 2026-04-01 09:40:42 +03:00
Mikhail Chusavitin
4e4debd4da refactor(webui): redesign Burn tab and fix gpu-burn memory defaults
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
  Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
  endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
  with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
  now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
  field to selectively run CPU and/or GPU load goroutines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:39:07 +03:00
Mikhail Chusavitin
5839f870b7 fix(iso): include full nvidia opencl runtime 2026-04-01 09:16:06 +03:00
Mikhail Chusavitin
b447717a5a fix(iso): harden boot network bring-up - v3.20 2026-04-01 09:10:55 +03:00
Mikhail Chusavitin
f6f4923ac9 fix(iso): recover memtest after live-build 2026-04-01 08:55:57 +03:00
Mikhail Chusavitin
c394845b34 refactor(webui): queue install and bundle tasks - v3.18 2026-04-01 08:46:46 +03:00
Mikhail Chusavitin
3472afea32 fix(iso): make memtest non-blocking by default 2026-04-01 08:33:36 +03:00
Mikhail Chusavitin
942f11937f chore(submodule): update bible - v3.16 2026-04-01 08:23:39 +03:00
Mikhail Chusavitin
b5b34983f1 fix(webui): repair audit actions and CPU burn flow - v3.15 2026-04-01 08:19:11 +03:00
45221d1e9a fix(stress): label loaders and improve john opencl diagnostics 2026-04-01 07:31:52 +03:00
3869788bac fix(iso): validate memtest with xorriso fallback 2026-04-01 07:24:05 +03:00
3dbc2184ef fix(iso): archive build logs and memtest diagnostics 2026-04-01 07:14:53 +03:00
60cb8f889a fix(iso): restore memtest menu entries and validate ISO 2026-04-01 07:04:48 +03:00
c9ee078622 fix(stress): keep platform burn responsive under load 2026-03-31 22:28:26 +03:00
ea660500c9 chore: commit pending repo changes 2026-03-31 22:17:36 +03:00
d43a9aeec7 fix(iso): restore live-build memtest integration 2026-03-31 22:10:28 +03:00
Mikhail Chusavitin
f5622e351e Fix staged John cleanup for repeated ISO builds 2026-03-31 11:40:52 +03:00
Mikhail Chusavitin
a20806afc8 Fix ISO grub package conflict 2026-03-31 11:38:30 +03:00
Mikhail Chusavitin
4f9b6b3bcd Harden NVIDIA boot logging on live ISO 2026-03-31 11:37:21 +03:00
Mikhail Chusavitin
c850b39b01 feat: v3.10 GPU stress and NCCL burn updates 2026-03-31 11:22:27 +03:00
Mikhail Chusavitin
6dee8f3509 Add NVIDIA stress loader selection and DCGM 4 support 2026-03-31 11:15:15 +03:00
Mikhail Chusavitin
20f834aa96 feat: v3.4 — boot reliability, log readability, USB export, screen resolution, GRUB UEFI fix, memtest, KVM console stability
Web UI / logs:
- Strip ANSI escape codes and handle \r (progress bars) in task log output
- Add USB export API + UI card on Export page (list removable devices, write audit JSON or support bundle)
- Add Display Resolution card in Tools (xrandr-based, per-output mode selector)
- Dashboard: audit status banner with auto-reload when audit task completes

Boot & install:
- bee-web starts immediately with no dependencies (was blocked by audit + network)
- bee-audit.service redesigned: waits for bee-web healthz, sleeps 60s, enqueues audit via /api/audit/run (task system)
- bee-install: fix GRUB UEFI — grub-install exit code was silently ignored (|| true); add --no-nvram fallback; always copy EFI/BOOT/BOOTX64.EFI fallback path
- Add grub-efi-amd64, grub-pc, grub-efi-amd64-signed, shim-signed to package list (grub-install requires these, not just -bin variants)
- memtest hook: fix binary/boot/ not created before cp; handle both Debian (no extension) and upstream (x64.efi) naming
- bee-openbox-session: increase healthz wait from 30s to 120s

KVM console stability:
- runCmdJob: syscall.Setpriority(PRIO_PROCESS, pid, 10) on all stress subprocesses
- lightdm.service.d: Nice=-5 so X server preempts stress processes

Packages: add btop

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 10:16:15 +03:00
105d92df8b fix(iso): use underscore in volume label to comply with ISO 9660
ISO 9660 volume labels allow only A-Z, 0-9, and underscore.
Dashes cause xorriso WARNING on every build.
EASY-BEE-NVIDIA → EASY_BEE_NVIDIA (iso-application keeps dashes, it's UDF).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:38:02 +03:00
f96b149875 fix(memtest): extract EFI binary from .deb cache if chroot/boot/ is empty
memtest86+ postinst does not place files in /boot in a live-build chroot
without grub triggers. Added fallback: extract directly from the cached
.deb via dpkg-deb -x, with verbose logging throughout.

Also remove "NVIDIA no MSI-X" from boot menu (premature — root cause unknown).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:30:52 +03:00
5ee120158e fix(build): remove unused variant package lists before lb build
live-build picks up ALL .list.chroot files in config/package-lists/.
After rsync, bee-nvidia.list.chroot, bee-amd.list.chroot, and
bee-nogpu.list.chroot all end up in BUILD_WORK_DIR — causing lb to
try installing packages from every variant (and leaving version
placeholders unsubstituted in the unused lists).

Fix: after copying bee-${BEE_GPU_VENDOR}.list.chroot → bee-gpu.list.chroot,
delete all other bee-{nvidia,amd,nogpu}.list.chroot from BUILD_WORK_DIR.

Also includes nomsi boot mode changes (bee-nvidia-load + grub.cfg).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:03:42 +03:00
09fe0e2e9e feat(iso): add nogpu variant (no NVIDIA, no AMD/ROCm)
- build.sh: accept --variant nogpu; skips all GPU build steps, removes
  both nvidia-cuda and rocm archives, strips bee-nvidia-load and
  bee-nvidia.service from overlay
- build-in-container.sh: add nogpu to --variant flag; all variant
  includes nogpu; --clean-build wipes live-build-work-nogpu
- 9000-bee-setup hook: nogpu path enables no GPU services
- bee-nogpu.list.chroot: empty GPU package list

Output: easy-bee-nogpu-vX.iso

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 22:49:25 +03:00
ace1a9dba6 feat(iso): split into nvidia and amd variants, fix KVM graphics and PATH
- build.sh: add --variant nvidia|amd; separate work dirs per variant
  (live-build-work-nvidia / live-build-work-amd); GPU-specific steps
  (modules, NCCL, cuBLAS, nccl-tests) run only for nvidia; deb package
  cache synced back to shared location after each lb build so second
  variant reuses downloaded packages; ISO output named
  easy-bee-{variant}-v{ver}-amd64.iso
- build-in-container.sh: add --variant nvidia|amd|all (default: all);
  runs build.sh twice in one container for 'all'; --clean-build wipes
  both variant work dirs
- package-lists: remove GPU packages from bee.list.chroot; add
  bee-nvidia.list.chroot (DCGM) and bee-amd.list.chroot (ROCm)
- 9000-bee-setup hook: read /etc/bee-gpu-vendor; enable bee-nvidia.service
  and DCGM only for nvidia; set up ROCm symlinks only for amd
- auto/config: --iso-volume uses BEE_GPU_VENDOR_UPPER env var
- grub.cfg: add nomodeset to EASY-BEE and EASY-BEE (load to RAM) entries
  — fixes X/lightdm on BMC KVM (ASPEED AST chip requires nomodeset for
  fbdev to work; NVIDIA H100 compute does not need KMS)
- bee.sh / smoketest.sh: add /usr/sbin to PATH so dmidecode, smartctl,
  nvme are found
- 9100-memtest hook: add diagnostic listing of chroot/boot/memtest* files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 22:24:37 +03:00
905c581ece fix(iso): substitute all ROCm package version placeholders in build.sh
ROCM_BANDWIDTH_TEST_VERSION, ROCM_VALIDATION_SUITE_VERSION, ROCBLAS,
ROCRAND, HIP_RUNTIME_AMD, HIPBLASLT, COMGR were defined in VERSIONS and
in bee.list.chroot but the sed substitution block only covered 3 of them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 22:00:05 +03:00
7c2a0135d2 feat(audit): add platform thermal cycling stress test
Runs CPU (stressapptest) + GPU stress simultaneously across multiple
load/idle cycles with varying idle durations (120s/60s/30s) to detect
cooling systems that fail to recover under repeated load.

Presets: smoke (~5 min), acceptance (~25 min), overnight (~100 min).
Outputs metrics.csv + summary.txt with per-cycle throttle and fan
spindown analysis, packed as tar.gz.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 21:57:33 +03:00
407c1cd1c4 fix(charts): unify timeline labels across graphs 2026-03-29 21:24:06 +03:00
e15bcc91c5 feat(metrics): persist history in sqlite and add AMD memory validate tests 2026-03-29 12:28:06 +03:00
98f0cf0d52 fix(amd-stress): include VRAM load in GST burn 2026-03-29 12:03:50 +03:00
76 changed files with 6610 additions and 975 deletions

View File

@@ -343,9 +343,9 @@ Planned code shape:
- `bee tui` can rerun the audit manually
- `bee tui` can export the latest audit JSON to removable media
- `bee tui` can show health summary and run NVIDIA/memory/storage acceptance tests
- NVIDIA SAT now includes a lightweight in-image GPU stress step via `bee-gpu-stress`
- NVIDIA SAT now includes a lightweight in-image GPU stress step via `bee-gpu-burn`
- SAT summaries now expose `overall_status` plus per-job `OK/FAILED/UNSUPPORTED`
- Memory/GPU SAT runtime defaults can be overridden via `BEE_MEMTESTER_*` and `BEE_GPU_STRESS_*`
- Memory SAT runtime defaults can be overridden via `BEE_MEMTESTER_*`
- removable export requires explicit target selection, mount, confirmation, copy, and cleanup
### 2.6 — Vendor utilities and optional assets

View File

@@ -356,6 +356,7 @@ func runSAT(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("sat", flag.ContinueOnError)
fs.SetOutput(stderr)
duration := fs.Int("duration", 0, "stress-ng duration in seconds (cpu only; default: 60)")
diagLevel := fs.Int("diag-level", 0, "DCGM diagnostic level for nvidia (1=quick, 2=medium, 3=targeted stress, 4=extended stress; default: 1)")
if err := fs.Parse(args[1:]); err != nil {
if err == flag.ErrHelp {
return 0
@@ -370,7 +371,7 @@ func runSAT(args []string, stdout, stderr io.Writer) int {
target := args[0]
if target != "nvidia" && target != "memory" && target != "storage" && target != "cpu" {
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", target)
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>] [--diag-level <1-4>]")
return 2
}
@@ -382,7 +383,12 @@ func runSAT(args []string, stdout, stderr io.Writer) int {
logLine := func(s string) { fmt.Fprintln(os.Stderr, s) }
switch target {
case "nvidia":
archive, err = application.RunNvidiaAcceptancePack("", logLine)
level := *diagLevel
if level > 0 {
_, err = application.RunNvidiaAcceptancePackWithOptions(context.Background(), "", level, nil, logLine)
} else {
archive, err = application.RunNvidiaAcceptancePack("", logLine)
}
case "memory":
archive, err = application.RunMemoryAcceptancePackCtx(context.Background(), "", logLine)
case "storage":

View File

@@ -107,6 +107,7 @@ func (a *App) RunInstallToRAM(ctx context.Context, logFunc func(string)) error {
type satRunner interface {
RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error)
RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaStressPack(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error)
RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error)
RunStorageAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error)
RunCPUAcceptancePack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error)
@@ -114,10 +115,13 @@ type satRunner interface {
DetectGPUVendor() string
ListAMDGPUs() ([]platform.AMDGPUInfo, error)
RunAMDAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error)
RunAMDMemIntegrityPack(ctx context.Context, baseDir string, logFunc func(string)) (string, error)
RunAMDMemBandwidthPack(ctx context.Context, baseDir string, logFunc func(string)) (string, error)
RunAMDStressPack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error)
RunMemoryStressPack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error)
RunSATStressPack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error)
RunFanStressTest(ctx context.Context, baseDir string, opts platform.FanStressOptions) (string, error)
RunPlatformStress(ctx context.Context, baseDir string, opts platform.PlatformStressOptions, logFunc func(string)) (string, error)
RunNCCLTests(ctx context.Context, baseDir string, logFunc func(string)) (string, error)
}
@@ -141,14 +145,23 @@ func New(platform *platform.System) *App {
// ApplySATOverlay parses a raw audit JSON, overlays the latest SAT results,
// and returns the updated JSON. Used by the web UI to serve always-fresh status.
func ApplySATOverlay(auditJSON []byte) ([]byte, error) {
var snap schema.HardwareIngestRequest
if err := json.Unmarshal(auditJSON, &snap); err != nil {
snap, err := readAuditSnapshot(auditJSON)
if err != nil {
return nil, err
}
applyLatestSATStatuses(&snap.Hardware, DefaultSATBaseDir)
return json.MarshalIndent(snap, "", " ")
}
func readAuditSnapshot(auditJSON []byte) (schema.HardwareIngestRequest, error) {
var snap schema.HardwareIngestRequest
if err := json.Unmarshal(auditJSON, &snap); err != nil {
return schema.HardwareIngestRequest{}, err
}
collector.NormalizeSnapshot(&snap.Hardware, snap.CollectedAt)
return snap, nil
}
func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, error) {
if runtimeMode == runtimeenv.ModeLiveCD {
if err := a.runtime.CaptureTechnicalDump(DefaultTechDumpDir); err != nil {
@@ -272,6 +285,9 @@ func (a *App) ExportLatestAudit(target platform.RemovableTarget) (string, error)
if err != nil {
return "", err
}
if normalized, normErr := ApplySATOverlay(data); normErr == nil {
data = normalized
}
if err := os.WriteFile(tmpPath, data, 0644); err != nil {
return "", err
}
@@ -505,6 +521,17 @@ func (a *App) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir st
return ActionResult{Title: "NVIDIA DCGM", Body: body}, err
}
func (a *App) RunNvidiaStressPack(baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
return a.RunNvidiaStressPackCtx(context.Background(), baseDir, opts, logFunc)
}
func (a *App) RunNvidiaStressPackCtx(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaStressPack(ctx, baseDir, opts, logFunc)
}
func (a *App) RunMemoryAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
return a.RunMemoryAcceptancePackCtx(context.Background(), baseDir, logFunc)
}
@@ -577,6 +604,20 @@ func (a *App) RunAMDAcceptancePackResult(baseDir string) (ActionResult, error) {
return ActionResult{Title: "AMD GPU SAT", Body: satResultBody(path)}, err
}
func (a *App) RunAMDMemIntegrityPackCtx(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunAMDMemIntegrityPack(ctx, baseDir, logFunc)
}
func (a *App) RunAMDMemBandwidthPackCtx(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunAMDMemBandwidthPack(ctx, baseDir, logFunc)
}
func (a *App) RunMemoryStressPack(baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunMemoryStressPackCtx(context.Background(), baseDir, durationSec, logFunc)
}
@@ -611,6 +652,13 @@ func (a *App) RunFanStressTest(ctx context.Context, baseDir string, opts platfor
return a.sat.RunFanStressTest(ctx, baseDir, opts)
}
func (a *App) RunPlatformStress(ctx context.Context, baseDir string, opts platform.PlatformStressOptions, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunPlatformStress(ctx, baseDir, opts, logFunc)
}
func (a *App) RunNCCLTestsResult(ctx context.Context) (ActionResult, error) {
path, err := a.sat.RunNCCLTests(ctx, DefaultSATBaseDir, nil)
body := "Results: " + path
@@ -697,6 +745,7 @@ func (a *App) HealthSummaryResult() ActionResult {
if err := json.Unmarshal(raw, &snapshot); err != nil {
return ActionResult{Title: "Health summary", Body: "Audit JSON is unreadable."}
}
collector.NormalizeSnapshot(&snapshot.Hardware, snapshot.CollectedAt)
summary := collector.BuildHealthSummary(snapshot.Hardware)
var body strings.Builder
@@ -731,6 +780,7 @@ func (a *App) MainBanner() string {
if err := json.Unmarshal(raw, &snapshot); err != nil {
return ""
}
collector.NormalizeSnapshot(&snapshot.Hardware, snapshot.CollectedAt)
var lines []string
if system := formatSystemLine(snapshot.Hardware.Board); system != "" {

View File

@@ -120,14 +120,15 @@ func (f fakeTools) CheckTools(names []string) []platform.ToolStatus {
}
type fakeSAT struct {
runNvidiaFn func(string) (string, error)
runMemoryFn func(string) (string, error)
runStorageFn func(string) (string, error)
runCPUFn func(string, int) (string, error)
detectVendorFn func() string
listAMDGPUsFn func() ([]platform.AMDGPUInfo, error)
runAMDPackFn func(string) (string, error)
listNvidiaGPUsFn func() ([]platform.NvidiaGPU, error)
runNvidiaFn func(string) (string, error)
runNvidiaStressFn func(string, platform.NvidiaStressOptions) (string, error)
runMemoryFn func(string) (string, error)
runStorageFn func(string) (string, error)
runCPUFn func(string, int) (string, error)
detectVendorFn func() string
listAMDGPUsFn func() ([]platform.AMDGPUInfo, error)
runAMDPackFn func(string) (string, error)
listNvidiaGPUsFn func() ([]platform.NvidiaGPU, error)
}
func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string, _ func(string)) (string, error) {
@@ -138,6 +139,13 @@ func (f fakeSAT) RunNvidiaAcceptancePackWithOptions(_ context.Context, baseDir s
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaStressPack(_ context.Context, baseDir string, opts platform.NvidiaStressOptions, _ func(string)) (string, error) {
if f.runNvidiaStressFn != nil {
return f.runNvidiaStressFn(baseDir, opts)
}
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
if f.listNvidiaGPUsFn != nil {
return f.listNvidiaGPUsFn()
@@ -181,6 +189,14 @@ func (f fakeSAT) RunAMDAcceptancePack(_ context.Context, baseDir string, _ func(
return "", nil
}
func (f fakeSAT) RunAMDMemIntegrityPack(_ context.Context, _ string, _ func(string)) (string, error) {
return "", nil
}
func (f fakeSAT) RunAMDMemBandwidthPack(_ context.Context, _ string, _ func(string)) (string, error) {
return "", nil
}
func (f fakeSAT) RunAMDStressPack(_ context.Context, _ string, _ int, _ func(string)) (string, error) {
return "", nil
}
@@ -195,6 +211,10 @@ func (f fakeSAT) RunFanStressTest(_ context.Context, _ string, _ platform.FanStr
return "", nil
}
func (f fakeSAT) RunPlatformStress(_ context.Context, _ string, _ platform.PlatformStressOptions, _ func(string)) (string, error) {
return "", nil
}
func (f fakeSAT) RunNCCLTests(_ context.Context, _ string, _ func(string)) (string, error) {
return "", nil
}
@@ -640,13 +660,50 @@ func TestHealthSummaryResultIncludesCompactSATSummary(t *testing.T) {
}
}
func TestApplySATOverlayFiltersIgnoredLegacyDevices(t *testing.T) {
tmp := t.TempDir()
oldSATBaseDir := DefaultSATBaseDir
DefaultSATBaseDir = filepath.Join(tmp, "sat")
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })
raw := `{
"collected_at": "2026-03-15T10:00:00Z",
"hardware": {
"board": {"serial_number": "SRV123"},
"storage": [
{"model": "Virtual HDisk0", "serial_number": "AAAABBBBCCCC3"},
{"model": "PASCARI", "serial_number": "DISK1", "status": "OK"}
],
"pcie_devices": [
{"device_class": "Co-processor", "model": "402xx Series QAT", "status": "OK"},
{"device_class": "VideoController", "model": "NVIDIA H100", "status": "OK"}
]
}
}`
got, err := ApplySATOverlay([]byte(raw))
if err != nil {
t.Fatalf("ApplySATOverlay error: %v", err)
}
text := string(got)
if contains(text, "Virtual HDisk0") {
t.Fatalf("overlaid audit should drop virtual hdisk:\n%s", text)
}
if contains(text, "\"device_class\": \"Co-processor\"") {
t.Fatalf("overlaid audit should drop co-processors:\n%s", text)
}
if !contains(text, "PASCARI") || !contains(text, "NVIDIA H100") {
t.Fatalf("overlaid audit should keep real devices:\n%s", text)
}
}
func TestBuildSupportBundleIncludesExportDirContents(t *testing.T) {
tmp := t.TempDir()
exportDir := filepath.Join(tmp, "export")
if err := os.MkdirAll(filepath.Join(exportDir, "bee-sat", "memory-run"), 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.json"), []byte(`{"ok":true}`), 0644); err != nil {
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.json"), []byte(`{"collected_at":"2026-03-15T10:00:00Z","hardware":{"board":{"serial_number":"SRV123"},"storage":[{"model":"Virtual HDisk0","serial_number":"AAAABBBBCCCC3"},{"model":"PASCARI","serial_number":"DISK1"}],"pcie_devices":[{"device_class":"Co-processor","model":"402xx Series QAT"},{"device_class":"VideoController","model":"NVIDIA H100"}]}}`), 0644); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(exportDir, "bee-sat", "memory-run", "verbose.log"), []byte("sat verbose"), 0644); err != nil {
@@ -678,6 +735,7 @@ func TestBuildSupportBundleIncludesExportDirContents(t *testing.T) {
tr := tar.NewReader(gzr)
var names []string
var auditJSON string
for {
hdr, err := tr.Next()
if errors.Is(err, io.EOF) {
@@ -687,6 +745,13 @@ func TestBuildSupportBundleIncludesExportDirContents(t *testing.T) {
t.Fatalf("read tar entry: %v", err)
}
names = append(names, hdr.Name)
if contains(hdr.Name, "/export/bee-audit.json") {
body, err := io.ReadAll(tr)
if err != nil {
t.Fatalf("read audit entry: %v", err)
}
auditJSON = string(body)
}
}
var foundRaw bool
@@ -701,6 +766,12 @@ func TestBuildSupportBundleIncludesExportDirContents(t *testing.T) {
if !foundRaw {
t.Fatalf("support bundle missing raw SAT log, names=%v", names)
}
if contains(auditJSON, "Virtual HDisk0") || contains(auditJSON, "\"device_class\": \"Co-processor\"") {
t.Fatalf("support bundle should normalize ignored devices:\n%s", auditJSON)
}
if !contains(auditJSON, "PASCARI") || !contains(auditJSON, "NVIDIA H100") {
t.Fatalf("support bundle should keep real devices:\n%s", auditJSON)
}
}
func TestMainBanner(t *testing.T) {
@@ -714,6 +785,10 @@ func TestMainBanner(t *testing.T) {
product := "PowerEdge R760"
cpuModel := "Intel Xeon Gold 6430"
memoryType := "DDR5"
memorySerialA := "DIMM-A"
memorySerialB := "DIMM-B"
storageSerialA := "DISK-A"
storageSerialB := "DISK-B"
gpuClass := "VideoController"
gpuModel := "NVIDIA H100"
@@ -729,12 +804,12 @@ func TestMainBanner(t *testing.T) {
{Model: &cpuModel},
},
Memory: []schema.HardwareMemory{
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType},
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType, SerialNumber: &memorySerialA},
{Present: &trueValue, SizeMB: intPtr(524288), Type: &memoryType, SerialNumber: &memorySerialB},
},
Storage: []schema.HardwareStorage{
{Present: &trueValue, SizeGB: intPtr(3840)},
{Present: &trueValue, SizeGB: intPtr(3840)},
{Present: &trueValue, SizeGB: intPtr(3840), SerialNumber: &storageSerialA},
{Present: &trueValue, SizeGB: intPtr(3840), SerialNumber: &storageSerialB},
},
PCIeDevices: []schema.HardwarePCIeDevice{
{DeviceClass: &gpuClass, Model: &gpuModel},

View File

@@ -36,6 +36,8 @@ var supportBundleCommands = []struct {
{name: "system/dmesg-tail.txt", cmd: []string{"sh", "-c", "dmesg | tail -n 200"}},
}
const supportBundleGlob = "bee-support-*.tar.gz"
func BuildSupportBundle(exportDir string) (string, error) {
exportDir = strings.TrimSpace(exportDir)
if exportDir == "" {
@@ -86,34 +88,64 @@ func BuildSupportBundle(exportDir string) (string, error) {
return archivePath, nil
}
func LatestSupportBundlePath() (string, error) {
return latestSupportBundlePath(os.TempDir())
}
func cleanupOldSupportBundles(dir string) error {
matches, err := filepath.Glob(filepath.Join(dir, "bee-support-*.tar.gz"))
matches, err := filepath.Glob(filepath.Join(dir, supportBundleGlob))
if err != nil {
return err
}
type entry struct {
path string
mod time.Time
entries := supportBundleEntries(matches)
for path, mod := range entries {
if time.Since(mod) > 24*time.Hour {
_ = os.Remove(path)
delete(entries, path)
}
}
list := make([]entry, 0, len(matches))
ordered := orderSupportBundles(entries)
if len(ordered) > 3 {
for _, old := range ordered[3:] {
_ = os.Remove(old)
}
}
return nil
}
func latestSupportBundlePath(dir string) (string, error) {
matches, err := filepath.Glob(filepath.Join(dir, supportBundleGlob))
if err != nil {
return "", err
}
ordered := orderSupportBundles(supportBundleEntries(matches))
if len(ordered) == 0 {
return "", os.ErrNotExist
}
return ordered[0], nil
}
func supportBundleEntries(matches []string) map[string]time.Time {
entries := make(map[string]time.Time, len(matches))
for _, match := range matches {
info, err := os.Stat(match)
if err != nil {
continue
}
if time.Since(info.ModTime()) > 24*time.Hour {
_ = os.Remove(match)
continue
}
list = append(list, entry{path: match, mod: info.ModTime()})
entries[match] = info.ModTime()
}
sort.Slice(list, func(i, j int) bool { return list[i].mod.After(list[j].mod) })
if len(list) > 3 {
for _, old := range list[3:] {
_ = os.Remove(old.path)
}
return entries
}
func orderSupportBundles(entries map[string]time.Time) []string {
ordered := make([]string, 0, len(entries))
for path := range entries {
ordered = append(ordered, path)
}
return nil
sort.Slice(ordered, func(i, j int) bool {
return entries[ordered[i]].After(entries[ordered[j]])
})
return ordered
}
func writeJournalDump(dst string) error {
@@ -215,7 +247,7 @@ func copyDirContents(srcDir, dstDir string) error {
}
func copyExportDirForSupportBundle(srcDir, dstDir string) error {
return copyDirContentsFiltered(srcDir, dstDir, func(rel string, info os.FileInfo) bool {
if err := copyDirContentsFiltered(srcDir, dstDir, func(rel string, info os.FileInfo) bool {
cleanRel := filepath.ToSlash(strings.TrimPrefix(filepath.Clean(rel), "./"))
if cleanRel == "" {
return true
@@ -227,7 +259,25 @@ func copyExportDirForSupportBundle(srcDir, dstDir string) error {
return false
}
return true
})
}); err != nil {
return err
}
return normalizeSupportBundleAuditJSON(filepath.Join(dstDir, "bee-audit.json"))
}
func normalizeSupportBundleAuditJSON(path string) error {
data, err := os.ReadFile(path)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return err
}
normalized, err := ApplySATOverlay(data)
if err != nil {
return nil
}
return os.WriteFile(path, normalized, 0644)
}
func copyDirContentsFiltered(srcDir, dstDir string, keep func(rel string, info os.FileInfo) bool) error {

View File

@@ -1,10 +1,18 @@
package collector
import "bee/audit/internal/schema"
import (
"bee/audit/internal/schema"
"strings"
)
func NormalizeSnapshot(snap *schema.HardwareSnapshot, collectedAt string) {
finalizeSnapshot(snap, collectedAt)
}
func finalizeSnapshot(snap *schema.HardwareSnapshot, collectedAt string) {
snap.Memory = filterMemory(snap.Memory)
snap.Storage = filterStorage(snap.Storage)
snap.PCIeDevices = filterPCIe(snap.PCIeDevices)
snap.PowerSupplies = filterPSUs(snap.PowerSupplies)
setComponentStatusMetadata(snap, collectedAt)
@@ -33,11 +41,25 @@ func filterStorage(disks []schema.HardwareStorage) []schema.HardwareStorage {
if disk.SerialNumber == nil || *disk.SerialNumber == "" {
continue
}
if disk.Model != nil && isVirtualHDiskModel(*disk.Model) {
continue
}
out = append(out, disk)
}
return out
}
func filterPCIe(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
out := make([]schema.HardwarePCIeDevice, 0, len(devs))
for _, dev := range devs {
if dev.DeviceClass != nil && strings.Contains(strings.ToLower(strings.TrimSpace(*dev.DeviceClass)), "co-processor") {
continue
}
out = append(out, dev)
}
return out
}
func filterPSUs(psus []schema.HardwarePowerSupply) []schema.HardwarePowerSupply {
out := make([]schema.HardwarePowerSupply, 0, len(psus))
for _, psu := range psus {

View File

@@ -10,6 +10,10 @@ func TestFinalizeSnapshotFiltersComponentsWithoutRequiredSerials(t *testing.T) {
present := true
status := statusOK
serial := "SN-1"
virtualModel := "Virtual HDisk1"
realModel := "PASCARI"
coProcessorClass := "Co-processor"
gpuClass := "VideoController"
snap := schema.HardwareSnapshot{
Memory: []schema.HardwareMemory{
@@ -17,9 +21,15 @@ func TestFinalizeSnapshotFiltersComponentsWithoutRequiredSerials(t *testing.T) {
{Present: &present, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
Storage: []schema.HardwareStorage{
{Model: &virtualModel, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{Model: &realModel, SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
PCIeDevices: []schema.HardwarePCIeDevice{
{DeviceClass: &coProcessorClass, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{DeviceClass: &gpuClass, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
},
PowerSupplies: []schema.HardwarePowerSupply{
{SerialNumber: &serial, HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
{HardwareComponentStatus: schema.HardwareComponentStatus{Status: &status}},
@@ -31,9 +41,12 @@ func TestFinalizeSnapshotFiltersComponentsWithoutRequiredSerials(t *testing.T) {
if len(snap.Memory) != 1 || snap.Memory[0].StatusCheckedAt == nil || *snap.Memory[0].StatusCheckedAt != collectedAt {
t.Fatalf("memory finalize mismatch: %+v", snap.Memory)
}
if len(snap.Storage) != 1 || snap.Storage[0].StatusCheckedAt == nil || *snap.Storage[0].StatusCheckedAt != collectedAt {
if len(snap.Storage) != 2 || snap.Storage[0].StatusCheckedAt == nil || *snap.Storage[0].StatusCheckedAt != collectedAt {
t.Fatalf("storage finalize mismatch: %+v", snap.Storage)
}
if len(snap.PCIeDevices) != 1 || snap.PCIeDevices[0].DeviceClass == nil || *snap.PCIeDevices[0].DeviceClass != gpuClass {
t.Fatalf("pcie finalize mismatch: %+v", snap.PCIeDevices)
}
if len(snap.PowerSupplies) != 1 || snap.PowerSupplies[0].StatusCheckedAt == nil || *snap.PowerSupplies[0].StatusCheckedAt != collectedAt {
t.Fatalf("psu finalize mismatch: %+v", snap.PowerSupplies)
}

View File

@@ -13,14 +13,18 @@ import (
const nvidiaVendorID = 0x10de
type nvidiaGPUInfo struct {
BDF string
Serial string
VBIOS string
TemperatureC *float64
PowerW *float64
ECCUncorrected *int64
ECCCorrected *int64
HWSlowdown *bool
BDF string
Serial string
VBIOS string
TemperatureC *float64
PowerW *float64
ECCUncorrected *int64
ECCCorrected *int64
HWSlowdown *bool
PCIeLinkGenCurrent *int
PCIeLinkGenMax *int
PCIeLinkWidthCur *int
PCIeLinkWidthMax *int
}
// enrichPCIeWithNVIDIA enriches NVIDIA PCIe devices with data from nvidia-smi.
@@ -94,7 +98,7 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
func queryNVIDIAGPUs() (map[string]nvidiaGPUInfo, error) {
out, err := exec.Command(
"nvidia-smi",
"--query-gpu=index,pci.bus_id,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown",
"--query-gpu=index,pci.bus_id,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown,pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max",
"--format=csv,noheader,nounits",
).Output()
if err != nil {
@@ -118,8 +122,8 @@ func parseNVIDIASMIQuery(raw string) (map[string]nvidiaGPUInfo, error) {
if len(rec) == 0 {
continue
}
if len(rec) < 9 {
return nil, fmt.Errorf("unexpected nvidia-smi columns: got %d, want 9", len(rec))
if len(rec) < 13 {
return nil, fmt.Errorf("unexpected nvidia-smi columns: got %d, want 13", len(rec))
}
bdf := normalizePCIeBDF(rec[1])
@@ -128,14 +132,18 @@ func parseNVIDIASMIQuery(raw string) (map[string]nvidiaGPUInfo, error) {
}
info := nvidiaGPUInfo{
BDF: bdf,
Serial: strings.TrimSpace(rec[2]),
VBIOS: strings.TrimSpace(rec[3]),
TemperatureC: parseMaybeFloat(rec[4]),
PowerW: parseMaybeFloat(rec[5]),
ECCUncorrected: parseMaybeInt64(rec[6]),
ECCCorrected: parseMaybeInt64(rec[7]),
HWSlowdown: parseMaybeBool(rec[8]),
BDF: bdf,
Serial: strings.TrimSpace(rec[2]),
VBIOS: strings.TrimSpace(rec[3]),
TemperatureC: parseMaybeFloat(rec[4]),
PowerW: parseMaybeFloat(rec[5]),
ECCUncorrected: parseMaybeInt64(rec[6]),
ECCCorrected: parseMaybeInt64(rec[7]),
HWSlowdown: parseMaybeBool(rec[8]),
PCIeLinkGenCurrent: parseMaybeInt(rec[9]),
PCIeLinkGenMax: parseMaybeInt(rec[10]),
PCIeLinkWidthCur: parseMaybeInt(rec[11]),
PCIeLinkWidthMax: parseMaybeInt(rec[12]),
}
result[bdf] = info
}
@@ -167,6 +175,22 @@ func parseMaybeInt64(v string) *int64 {
return &n
}
func parseMaybeInt(v string) *int {
v = strings.TrimSpace(v)
if v == "" || strings.EqualFold(v, "n/a") || strings.EqualFold(v, "not supported") || strings.EqualFold(v, "[not supported]") {
return nil
}
n, err := strconv.Atoi(v)
if err != nil {
return nil
}
return &n
}
func pcieLinkGenLabel(gen int) string {
return fmt.Sprintf("Gen%d", gen)
}
func parseMaybeBool(v string) *bool {
v = strings.TrimSpace(strings.ToLower(v))
switch v {
@@ -231,4 +255,22 @@ func injectNVIDIATelemetry(dev *schema.HardwarePCIeDevice, info nvidiaGPUInfo) {
if info.HWSlowdown != nil {
dev.HWSlowdown = info.HWSlowdown
}
// Override PCIe link speed/width with nvidia-smi driver values.
// sysfs current_link_speed reflects the instantaneous physical link state and
// can show Gen1 when the GPU is idle due to ASPM power management. The driver
// knows the negotiated speed regardless of the current power state.
if info.PCIeLinkGenCurrent != nil {
s := pcieLinkGenLabel(*info.PCIeLinkGenCurrent)
dev.LinkSpeed = &s
}
if info.PCIeLinkGenMax != nil {
s := pcieLinkGenLabel(*info.PCIeLinkGenMax)
dev.MaxLinkSpeed = &s
}
if info.PCIeLinkWidthCur != nil {
dev.LinkWidth = info.PCIeLinkWidthCur
}
if info.PCIeLinkWidthMax != nil {
dev.MaxLinkWidth = info.PCIeLinkWidthMax
}
}

View File

@@ -6,7 +6,7 @@ import (
)
func TestParseNVIDIASMIQuery(t *testing.T) {
raw := "0, 00000000:65:00.0, GPU-SERIAL-1, 96.00.1F.00.02, 54, 210.33, 0, 5, Not Active\n"
raw := "0, 00000000:65:00.0, GPU-SERIAL-1, 96.00.1F.00.02, 54, 210.33, 0, 5, Not Active, 4, 4, 16, 16\n"
byBDF, err := parseNVIDIASMIQuery(raw)
if err != nil {
t.Fatalf("parse failed: %v", err)
@@ -28,6 +28,12 @@ func TestParseNVIDIASMIQuery(t *testing.T) {
if gpu.HWSlowdown == nil || *gpu.HWSlowdown {
t.Fatalf("hw slowdown: got %v, want false", gpu.HWSlowdown)
}
if gpu.PCIeLinkGenCurrent == nil || *gpu.PCIeLinkGenCurrent != 4 {
t.Fatalf("pcie link gen current: got %v, want 4", gpu.PCIeLinkGenCurrent)
}
if gpu.PCIeLinkGenMax == nil || *gpu.PCIeLinkGenMax != 4 {
t.Fatalf("pcie link gen max: got %v, want 4", gpu.PCIeLinkGenMax)
}
}
func TestNormalizePCIeBDF(t *testing.T) {

View File

@@ -59,6 +59,7 @@ func shouldIncludePCIeDevice(class, vendor, device string) bool {
"host bridge",
"isa bridge",
"pci bridge",
"co-processor",
"performance counter",
"performance counters",
"ram memory",

View File

@@ -19,6 +19,7 @@ func TestShouldIncludePCIeDevice(t *testing.T) {
{name: "audio", class: "Audio device", want: false},
{name: "host bridge", class: "Host bridge", want: false},
{name: "pci bridge", class: "PCI bridge", want: false},
{name: "co-processor", class: "Co-processor", want: false},
{name: "smbus", class: "SMBus", want: false},
{name: "perf", class: "Performance counters", want: false},
{name: "non essential instrumentation", class: "Non-Essential Instrumentation", want: false},
@@ -76,6 +77,20 @@ func TestParseLspci_filtersAMDChipsetNoise(t *testing.T) {
}
}
func TestParseLspci_filtersCoProcessors(t *testing.T) {
input := "" +
"Slot:\t0000:01:00.0\nClass:\tCo-processor\nVendor:\tIntel Corporation\nDevice:\t402xx Series QAT\n\n" +
"Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"
devs := parseLspci(input)
if len(devs) != 1 {
t.Fatalf("expected 1 remaining device, got %d", len(devs))
}
if devs[0].Model == nil || *devs[0].Model != "H100" {
t.Fatalf("unexpected remaining device: %+v", devs[0])
}
}
func TestPCIeJSONUsesSlotNotBDF(t *testing.T) {
input := "Slot:\t0000:65:00.0\nClass:\tVGA compatible controller\nVendor:\tNVIDIA Corporation\nDevice:\tH100\n\n"

View File

@@ -77,11 +77,28 @@ func discoverStorageDevices() []lsblkDevice {
if dev.Type != "disk" {
continue
}
if isVirtualBMCDisk(dev) {
slog.Debug("storage: skipping BMC virtual disk", "name", dev.Name, "model", dev.Model)
continue
}
disks = append(disks, dev)
}
return disks
}
// isVirtualBMCDisk returns true for BMC/IPMI virtual USB mass storage devices
// that appear as disks but are not real hardware (e.g. iDRAC Virtual HDisk*).
// These have zero reported size, a generic fake serial, and a model name that
// starts with "Virtual HDisk".
func isVirtualBMCDisk(dev lsblkDevice) bool {
return isVirtualHDiskModel(dev.Model)
}
func isVirtualHDiskModel(model string) bool {
model = strings.ToLower(strings.TrimSpace(model))
return strings.HasPrefix(model, "virtual hdisk")
}
func lsblkDevices() []lsblkDevice {
out, err := exec.Command("lsblk", "-J", "-d",
"-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL").Output()

View File

@@ -0,0 +1,64 @@
package platform
import (
"fmt"
"os"
"strconv"
"strings"
"syscall"
)
// workerPatterns are substrings matched against /proc/<pid>/cmdline to identify
// bee test worker processes that should be killed by KillTestWorkers.
var workerPatterns = []string{
"bee-gpu-burn",
"stress-ng",
"stressapptest",
"memtester",
}
// KilledProcess describes a process that was sent SIGKILL.
type KilledProcess struct {
PID int `json:"pid"`
Name string `json:"name"`
}
// KillTestWorkers scans /proc for running test worker processes and sends
// SIGKILL to each one found. It returns a list of killed processes.
// Errors for individual processes (e.g. already exited) are silently ignored.
func KillTestWorkers() []KilledProcess {
entries, err := os.ReadDir("/proc")
if err != nil {
return nil
}
var killed []KilledProcess
for _, e := range entries {
if !e.IsDir() {
continue
}
pid, err := strconv.Atoi(e.Name())
if err != nil {
continue
}
cmdline, err := os.ReadFile(fmt.Sprintf("/proc/%d/cmdline", pid))
if err != nil {
continue
}
// /proc/*/cmdline uses NUL bytes as argument separators.
args := strings.SplitN(strings.ReplaceAll(string(cmdline), "\x00", " "), " ", 2)
exe := strings.TrimSpace(args[0])
base := exe
if idx := strings.LastIndexByte(exe, '/'); idx >= 0 {
base = exe[idx+1:]
}
for _, pat := range workerPatterns {
if strings.Contains(base, pat) || strings.Contains(exe, pat) {
_ = syscall.Kill(pid, syscall.SIGKILL)
killed = append(killed, KilledProcess{PID: pid, Name: base})
break
}
}
}
return killed
}

View File

@@ -68,18 +68,20 @@ func SampleLiveMetrics() LiveMetricSample {
// sampleCPULoadPct reads two /proc/stat snapshots 200ms apart and returns
// the overall CPU utilisation percentage.
var cpuStatPrev [2]uint64 // [total, idle]
func sampleCPULoadPct() float64 {
total, idle := readCPUStat()
if total == 0 {
total0, idle0 := readCPUStat()
if total0 == 0 {
return 0
}
prevTotal, prevIdle := cpuStatPrev[0], cpuStatPrev[1]
cpuStatPrev = [2]uint64{total, idle}
if prevTotal == 0 {
time.Sleep(200 * time.Millisecond)
total1, idle1 := readCPUStat()
if total1 == 0 {
return 0
}
return cpuLoadPctBetween(total0, idle0, total1, idle1)
}
func cpuLoadPctBetween(prevTotal, prevIdle, total, idle uint64) float64 {
dt := float64(total - prevTotal)
di := float64(idle - prevIdle)
if dt <= 0 {

View File

@@ -42,3 +42,53 @@ func TestCompactAmbientTempName(t *testing.T) {
t.Fatalf("got %q", got)
}
}
func TestCPULoadPctBetween(t *testing.T) {
tests := []struct {
name string
prevTotal uint64
prevIdle uint64
total uint64
idle uint64
want float64
}{
{
name: "busy half",
prevTotal: 100,
prevIdle: 40,
total: 200,
idle: 90,
want: 50,
},
{
name: "fully busy",
prevTotal: 100,
prevIdle: 40,
total: 200,
idle: 40,
want: 100,
},
{
name: "no progress",
prevTotal: 100,
prevIdle: 40,
total: 100,
idle: 40,
want: 0,
},
{
name: "idle delta larger than total clamps to zero",
prevTotal: 100,
prevIdle: 40,
total: 200,
idle: 150,
want: 0,
},
}
for _, tc := range tests {
if got := cpuLoadPctBetween(tc.prevTotal, tc.prevIdle, tc.total, tc.idle); got != tc.want {
t.Fatalf("%s: cpuLoadPctBetween(...)=%v want %v", tc.name, got, tc.want)
}
}
}

View File

@@ -0,0 +1,203 @@
package platform
import (
"context"
"fmt"
"sort"
"strconv"
"strings"
)
func (s *System) RunNvidiaStressPack(ctx context.Context, baseDir string, opts NvidiaStressOptions, logFunc func(string)) (string, error) {
normalizeNvidiaStressOptions(&opts)
job, err := buildNvidiaStressJob(opts)
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, nvidiaStressArchivePrefix(opts.Loader), []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-nvidia-smi-list.log", cmd: []string{"nvidia-smi", "-L"}},
job,
{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
}
func nvidiaStressArchivePrefix(loader string) string {
switch strings.TrimSpace(strings.ToLower(loader)) {
case NvidiaStressLoaderJohn:
return "gpu-nvidia-john"
case NvidiaStressLoaderNCCL:
return "gpu-nvidia-nccl"
default:
return "gpu-nvidia-burn"
}
}
func buildNvidiaStressJob(opts NvidiaStressOptions) (satJob, error) {
selected, err := resolveNvidiaGPUSelection(opts.GPUIndices, opts.ExcludeGPUIndices)
if err != nil {
return satJob{}, err
}
loader := strings.TrimSpace(strings.ToLower(opts.Loader))
switch loader {
case "", NvidiaStressLoaderBuiltin:
cmd := []string{
"bee-gpu-burn",
"--seconds", strconv.Itoa(opts.DurationSec),
"--size-mb", strconv.Itoa(opts.SizeMB),
}
if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected))
}
return satJob{
name: "03-bee-gpu-burn.log",
cmd: cmd,
collectGPU: true,
gpuIndices: selected,
}, nil
case NvidiaStressLoaderJohn:
cmd := []string{
"bee-john-gpu-stress",
"--seconds", strconv.Itoa(opts.DurationSec),
}
if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected))
}
return satJob{
name: "03-john-gpu-stress.log",
cmd: cmd,
collectGPU: true,
gpuIndices: selected,
}, nil
case NvidiaStressLoaderNCCL:
cmd := []string{
"bee-nccl-gpu-stress",
"--seconds", strconv.Itoa(opts.DurationSec),
}
if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected))
}
return satJob{
name: "03-bee-nccl-gpu-stress.log",
cmd: cmd,
collectGPU: true,
gpuIndices: selected,
}, nil
default:
return satJob{}, fmt.Errorf("unknown NVIDIA stress loader %q", opts.Loader)
}
}
func normalizeNvidiaStressOptions(opts *NvidiaStressOptions) {
if opts.DurationSec <= 0 {
opts.DurationSec = 300
}
// SizeMB=0 means "auto" — bee-gpu-burn will query per-GPU memory at runtime.
switch strings.TrimSpace(strings.ToLower(opts.Loader)) {
case "", NvidiaStressLoaderBuiltin:
opts.Loader = NvidiaStressLoaderBuiltin
case NvidiaStressLoaderJohn:
opts.Loader = NvidiaStressLoaderJohn
case NvidiaStressLoaderNCCL:
opts.Loader = NvidiaStressLoaderNCCL
default:
opts.Loader = NvidiaStressLoaderBuiltin
}
opts.GPUIndices = dedupeSortedIndices(opts.GPUIndices)
opts.ExcludeGPUIndices = dedupeSortedIndices(opts.ExcludeGPUIndices)
}
func resolveNvidiaGPUSelection(include, exclude []int) ([]int, error) {
all, err := listNvidiaGPUIndices()
if err != nil {
return nil, err
}
if len(all) == 0 {
return nil, fmt.Errorf("nvidia-smi found no NVIDIA GPUs")
}
selected := all
if len(include) > 0 {
want := make(map[int]struct{}, len(include))
for _, idx := range include {
want[idx] = struct{}{}
}
selected = selected[:0]
for _, idx := range all {
if _, ok := want[idx]; ok {
selected = append(selected, idx)
}
}
}
if len(exclude) > 0 {
skip := make(map[int]struct{}, len(exclude))
for _, idx := range exclude {
skip[idx] = struct{}{}
}
filtered := selected[:0]
for _, idx := range selected {
if _, ok := skip[idx]; ok {
continue
}
filtered = append(filtered, idx)
}
selected = filtered
}
if len(selected) == 0 {
return nil, fmt.Errorf("no NVIDIA GPUs selected after applying filters")
}
out := append([]int(nil), selected...)
sort.Ints(out)
return out, nil
}
func listNvidiaGPUIndices() ([]int, error) {
out, err := satExecCommand("nvidia-smi", "--query-gpu=index", "--format=csv,noheader,nounits").Output()
if err != nil {
return nil, fmt.Errorf("nvidia-smi: %w", err)
}
var indices []int
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
idx, err := strconv.Atoi(line)
if err != nil {
continue
}
indices = append(indices, idx)
}
return dedupeSortedIndices(indices), nil
}
func dedupeSortedIndices(values []int) []int {
if len(values) == 0 {
return nil
}
seen := make(map[int]struct{}, len(values))
out := make([]int, 0, len(values))
for _, value := range values {
if value < 0 {
continue
}
if _, ok := seen[value]; ok {
continue
}
seen[value] = struct{}{}
out = append(out, value)
}
sort.Ints(out)
return out
}
func joinIndexList(values []int) string {
parts := make([]string, 0, len(values))
for _, value := range values {
parts = append(parts, strconv.Itoa(value))
}
return strings.Join(parts, ",")
}

View File

@@ -0,0 +1,545 @@
package platform
import (
"archive/tar"
"bytes"
"compress/gzip"
"context"
"encoding/csv"
"fmt"
"os"
"os/exec"
"path/filepath"
"runtime"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
// PlatformStressCycle defines one load+idle cycle.
type PlatformStressCycle struct {
LoadSec int // seconds of simultaneous CPU+GPU stress
IdleSec int // seconds of idle monitoring after load cut
}
// PlatformStressOptions controls the thermal cycling test.
type PlatformStressOptions struct {
Cycles []PlatformStressCycle
Components []string // if empty: run all; values: "cpu", "gpu"
}
// platformStressRow is one second of telemetry.
type platformStressRow struct {
ElapsedSec float64
Cycle int
Phase string // "load" | "idle"
CPULoadPct float64
MaxCPUTempC float64
MaxGPUTempC float64
SysPowerW float64
FanMinRPM float64
FanMaxRPM float64
GPUThrottled bool
}
// RunPlatformStress runs repeated load+idle thermal cycling.
// Each cycle starts CPU (stressapptest) and GPU stress simultaneously,
// runs for LoadSec, then cuts load abruptly and monitors for IdleSec.
func (s *System) RunPlatformStress(
ctx context.Context,
baseDir string,
opts PlatformStressOptions,
logFunc func(string),
) (string, error) {
if logFunc == nil {
logFunc = func(string) {}
}
if len(opts.Cycles) == 0 {
return "", fmt.Errorf("no cycles defined")
}
if err := os.MkdirAll(baseDir, 0755); err != nil {
return "", fmt.Errorf("mkdir %s: %w", baseDir, err)
}
stamp := time.Now().UTC().Format("20060102-150405")
runDir := filepath.Join(baseDir, "platform-stress-"+stamp)
if err := os.MkdirAll(runDir, 0755); err != nil {
return "", fmt.Errorf("mkdir run dir: %w", err)
}
hasCPU := len(opts.Components) == 0 || containsComponent(opts.Components, "cpu")
hasGPU := len(opts.Components) == 0 || containsComponent(opts.Components, "gpu")
vendor := s.DetectGPUVendor()
logFunc(fmt.Sprintf("Platform Thermal Cycling — %d cycle(s), GPU vendor: %s, cpu=%v gpu=%v", len(opts.Cycles), vendor, hasCPU, hasGPU))
var rows []platformStressRow
start := time.Now()
var analyses []cycleAnalysis
for i, cycle := range opts.Cycles {
if ctx.Err() != nil {
break
}
cycleNum := i + 1
logFunc(fmt.Sprintf("--- Cycle %d/%d: load=%ds, idle=%ds ---", cycleNum, len(opts.Cycles), cycle.LoadSec, cycle.IdleSec))
// ── LOAD PHASE ───────────────────────────────────────────────────────
loadCtx, loadCancel := context.WithTimeout(ctx, time.Duration(cycle.LoadSec)*time.Second)
var wg sync.WaitGroup
// CPU stress
if hasCPU {
wg.Add(1)
go func() {
defer wg.Done()
cpuCmd, err := buildCPUStressCmd(loadCtx)
if err != nil {
logFunc("CPU stress: " + err.Error())
return
}
_ = cpuCmd.Wait() // exits when loadCtx times out (SIGKILL)
}()
}
// GPU stress
if hasGPU {
wg.Add(1)
go func() {
defer wg.Done()
gpuCmd := buildGPUStressCmd(loadCtx, vendor)
if gpuCmd == nil {
return
}
_ = gpuCmd.Wait()
}()
}
// Monitoring goroutine for load phase
loadRows := collectPhase(loadCtx, cycleNum, "load", start)
for _, r := range loadRows {
logFunc(formatPlatformRow(r))
}
rows = append(rows, loadRows...)
loadCancel()
wg.Wait()
if len(loadRows) > 0 {
logFunc(fmt.Sprintf("Cycle %d load ended (%.0fs)", cycleNum, loadRows[len(loadRows)-1].ElapsedSec))
}
// ── IDLE PHASE ───────────────────────────────────────────────────────
idleCtx, idleCancel := context.WithTimeout(ctx, time.Duration(cycle.IdleSec)*time.Second)
idleRows := collectPhase(idleCtx, cycleNum, "idle", start)
for _, r := range idleRows {
logFunc(formatPlatformRow(r))
}
rows = append(rows, idleRows...)
idleCancel()
// Per-cycle analysis
an := analyzePlatformCycle(loadRows, idleRows)
analyses = append(analyses, an)
logFunc(fmt.Sprintf("Cycle %d: maxCPU=%.1f°C maxGPU=%.1f°C power=%.0fW throttled=%v fanDrop=%.0f%%",
cycleNum, an.maxCPUTemp, an.maxGPUTemp, an.maxPower, an.throttled, an.fanDropPct))
}
// Write CSV
csvData := writePlatformCSV(rows)
_ = os.WriteFile(filepath.Join(runDir, "metrics.csv"), csvData, 0644)
// Write summary
summary := writePlatformSummary(opts, analyses)
logFunc("--- Summary ---")
for _, line := range strings.Split(summary, "\n") {
if line != "" {
logFunc(line)
}
}
_ = os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary), 0644)
// Pack tar.gz
archivePath := filepath.Join(baseDir, "platform-stress-"+stamp+".tar.gz")
if err := packPlatformDir(runDir, archivePath); err != nil {
return "", fmt.Errorf("pack archive: %w", err)
}
_ = os.RemoveAll(runDir)
return archivePath, nil
}
// collectPhase samples live metrics every second until ctx is done.
func collectPhase(ctx context.Context, cycle int, phase string, testStart time.Time) []platformStressRow {
var rows []platformStressRow
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return rows
case <-ticker.C:
sample := SampleLiveMetrics()
rows = append(rows, sampleToPlatformRow(sample, cycle, phase, testStart))
}
}
}
func sampleToPlatformRow(s LiveMetricSample, cycle int, phase string, testStart time.Time) platformStressRow {
r := platformStressRow{
ElapsedSec: time.Since(testStart).Seconds(),
Cycle: cycle,
Phase: phase,
CPULoadPct: s.CPULoadPct,
SysPowerW: s.PowerW,
}
for _, t := range s.Temps {
switch t.Group {
case "cpu":
if t.Celsius > r.MaxCPUTempC {
r.MaxCPUTempC = t.Celsius
}
case "gpu":
if t.Celsius > r.MaxGPUTempC {
r.MaxGPUTempC = t.Celsius
}
}
}
for _, g := range s.GPUs {
if g.TempC > r.MaxGPUTempC {
r.MaxGPUTempC = g.TempC
}
}
if len(s.Fans) > 0 {
r.FanMinRPM = s.Fans[0].RPM
r.FanMaxRPM = s.Fans[0].RPM
for _, f := range s.Fans[1:] {
if f.RPM < r.FanMinRPM {
r.FanMinRPM = f.RPM
}
if f.RPM > r.FanMaxRPM {
r.FanMaxRPM = f.RPM
}
}
}
return r
}
func formatPlatformRow(r platformStressRow) string {
throttle := ""
if r.GPUThrottled {
throttle = " THROTTLE"
}
fans := ""
if r.FanMinRPM > 0 {
fans = fmt.Sprintf(" fans=%.0f-%.0fRPM", r.FanMinRPM, r.FanMaxRPM)
}
return fmt.Sprintf("[%5.0fs] cycle=%d phase=%-4s cpu=%.0f%% cpuT=%.1f°C gpuT=%.1f°C pwr=%.0fW%s%s",
r.ElapsedSec, r.Cycle, r.Phase, r.CPULoadPct, r.MaxCPUTempC, r.MaxGPUTempC, r.SysPowerW, fans, throttle)
}
func analyzePlatformCycle(loadRows, idleRows []platformStressRow) cycleAnalysis {
var an cycleAnalysis
for _, r := range loadRows {
if r.MaxCPUTempC > an.maxCPUTemp {
an.maxCPUTemp = r.MaxCPUTempC
}
if r.MaxGPUTempC > an.maxGPUTemp {
an.maxGPUTemp = r.MaxGPUTempC
}
if r.SysPowerW > an.maxPower {
an.maxPower = r.SysPowerW
}
if r.GPUThrottled {
an.throttled = true
}
}
// Fan RPM at cut = avg of last 5 load rows
if n := len(loadRows); n > 0 {
window := loadRows
if n > 5 {
window = loadRows[n-5:]
}
var sum float64
var cnt int
for _, r := range window {
if r.FanMinRPM > 0 {
sum += (r.FanMinRPM + r.FanMaxRPM) / 2
cnt++
}
}
if cnt > 0 {
an.fanAtCutAvg = sum / float64(cnt)
}
}
// Fan RPM min in first 15s of idle
an.fanMin15s = an.fanAtCutAvg
var cutElapsed float64
if len(loadRows) > 0 {
cutElapsed = loadRows[len(loadRows)-1].ElapsedSec
}
for _, r := range idleRows {
if r.ElapsedSec > cutElapsed+15 {
break
}
avg := (r.FanMinRPM + r.FanMaxRPM) / 2
if avg > 0 && (an.fanMin15s == 0 || avg < an.fanMin15s) {
an.fanMin15s = avg
}
}
if an.fanAtCutAvg > 0 {
an.fanDropPct = (an.fanAtCutAvg - an.fanMin15s) / an.fanAtCutAvg * 100
}
return an
}
type cycleAnalysis struct {
maxCPUTemp float64
maxGPUTemp float64
maxPower float64
throttled bool
fanAtCutAvg float64
fanMin15s float64
fanDropPct float64
}
func writePlatformSummary(opts PlatformStressOptions, analyses []cycleAnalysis) string {
var b strings.Builder
fmt.Fprintf(&b, "Platform Thermal Cycling — %d cycle(s)\n", len(opts.Cycles))
fmt.Fprintf(&b, "%s\n\n", strings.Repeat("=", 48))
totalThrottle := 0
totalFanWarn := 0
for i, an := range analyses {
cycle := opts.Cycles[i]
fmt.Fprintf(&b, "Cycle %d/%d (load=%ds, idle=%ds)\n", i+1, len(opts.Cycles), cycle.LoadSec, cycle.IdleSec)
fmt.Fprintf(&b, " Max CPU temp: %.1f°C\n", an.maxCPUTemp)
fmt.Fprintf(&b, " Max GPU temp: %.1f°C\n", an.maxGPUTemp)
fmt.Fprintf(&b, " Max sys power: %.0f W\n", an.maxPower)
if an.throttled {
fmt.Fprintf(&b, " Throttle: DETECTED\n")
totalThrottle++
} else {
fmt.Fprintf(&b, " Throttle: none\n")
}
if an.fanAtCutAvg > 0 {
fmt.Fprintf(&b, " Fan at load cut: %.0f RPM avg\n", an.fanAtCutAvg)
fmt.Fprintf(&b, " Fan min (first 15s idle): %.0f RPM (drop %.0f%%)\n", an.fanMin15s, an.fanDropPct)
if an.fanDropPct > 20 {
fmt.Fprintf(&b, " Fan response: WARN — fast spindown (>20%% drop in 15s)\n")
totalFanWarn++
} else {
fmt.Fprintf(&b, " Fan response: OK\n")
}
}
b.WriteString("\n")
}
fmt.Fprintf(&b, "%s\n", strings.Repeat("=", 48))
if totalThrottle > 0 {
fmt.Fprintf(&b, "Overall: FAIL — throttle detected in %d/%d cycles\n", totalThrottle, len(analyses))
} else if totalFanWarn > 0 {
fmt.Fprintf(&b, "Overall: WARN — fast fan spindown in %d/%d cycles (cooling recovery risk)\n", totalFanWarn, len(analyses))
} else {
fmt.Fprintf(&b, "Overall: PASS\n")
}
return b.String()
}
func writePlatformCSV(rows []platformStressRow) []byte {
var buf bytes.Buffer
w := csv.NewWriter(&buf)
_ = w.Write([]string{
"elapsed_sec", "cycle", "phase",
"cpu_load_pct", "max_cpu_temp_c", "max_gpu_temp_c",
"sys_power_w", "fan_min_rpm", "fan_max_rpm", "gpu_throttled",
})
for _, r := range rows {
throttled := "0"
if r.GPUThrottled {
throttled = "1"
}
_ = w.Write([]string{
strconv.FormatFloat(r.ElapsedSec, 'f', 1, 64),
strconv.Itoa(r.Cycle),
r.Phase,
strconv.FormatFloat(r.CPULoadPct, 'f', 1, 64),
strconv.FormatFloat(r.MaxCPUTempC, 'f', 1, 64),
strconv.FormatFloat(r.MaxGPUTempC, 'f', 1, 64),
strconv.FormatFloat(r.SysPowerW, 'f', 1, 64),
strconv.FormatFloat(r.FanMinRPM, 'f', 0, 64),
strconv.FormatFloat(r.FanMaxRPM, 'f', 0, 64),
throttled,
})
}
w.Flush()
return buf.Bytes()
}
// buildCPUStressCmd creates a stressapptest command that runs until ctx is cancelled.
func buildCPUStressCmd(ctx context.Context) (*exec.Cmd, error) {
path, err := satLookPath("stressapptest")
if err != nil {
return nil, fmt.Errorf("stressapptest not found: %w", err)
}
// Use a very long duration; the context timeout will kill it at the right time.
cmdArgs := []string{"-s", "86400", "-W", "--cc_test"}
if threads := platformStressCPUThreads(); threads > 0 {
cmdArgs = append(cmdArgs, "-m", strconv.Itoa(threads))
}
if mb := platformStressMemoryMB(); mb > 0 {
cmdArgs = append(cmdArgs, "-M", strconv.Itoa(mb))
}
cmd := exec.CommandContext(ctx, path, cmdArgs...)
cmd.Stdout = nil
cmd.Stderr = nil
if err := startLowPriorityCmd(cmd, 15); err != nil {
return nil, fmt.Errorf("stressapptest start: %w", err)
}
return cmd, nil
}
// buildGPUStressCmd creates a GPU stress command appropriate for the detected vendor.
// Returns nil if no GPU stress tool is available (CPU-only cycling still useful).
func buildGPUStressCmd(ctx context.Context, vendor string) *exec.Cmd {
switch strings.ToLower(vendor) {
case "amd":
return buildAMDGPUStressCmd(ctx)
case "nvidia":
return buildNvidiaGPUStressCmd(ctx)
}
return nil
}
func buildAMDGPUStressCmd(ctx context.Context) *exec.Cmd {
rvsArgs, err := resolveRVSCommand()
if err != nil {
return nil
}
rvsPath := rvsArgs[0]
cfg := `actions:
- name: gst_platform
device: all
module: gst
parallel: true
duration: 86400000
copy_matrix: false
target_stress: 90
matrix_size_a: 8640
matrix_size_b: 8640
matrix_size_c: 8640
`
cfgFile := "/tmp/bee-platform-gst.conf"
_ = os.WriteFile(cfgFile, []byte(cfg), 0644)
cmd := exec.CommandContext(ctx, rvsPath, "-c", cfgFile)
cmd.Stdout = nil
cmd.Stderr = nil
_ = startLowPriorityCmd(cmd, 10)
return cmd
}
func buildNvidiaGPUStressCmd(ctx context.Context) *exec.Cmd {
path, err := satLookPath("bee-gpu-burn")
if err != nil {
path, err = satLookPath("bee-gpu-stress")
}
if err != nil {
return nil
}
cmd := exec.CommandContext(ctx, path, "--seconds", "86400")
cmd.Stdout = nil
cmd.Stderr = nil
_ = startLowPriorityCmd(cmd, 10)
return cmd
}
func startLowPriorityCmd(cmd *exec.Cmd, nice int) error {
if err := cmd.Start(); err != nil {
return err
}
if cmd.Process != nil {
_ = syscall.Setpriority(syscall.PRIO_PROCESS, cmd.Process.Pid, nice)
}
return nil
}
func platformStressCPUThreads() int {
if n := envInt("BEE_PLATFORM_STRESS_THREADS", 0); n > 0 {
return n
}
cpus := runtime.NumCPU()
switch {
case cpus <= 2:
return 1
case cpus <= 8:
return cpus - 1
default:
return cpus - 2
}
}
func platformStressMemoryMB() int {
if mb := envInt("BEE_PLATFORM_STRESS_MB", 0); mb > 0 {
return mb
}
free := freeMemBytes()
if free <= 0 {
return 0
}
mb := int((free * 60) / 100 / (1024 * 1024))
if mb < 1024 {
return 1024
}
return mb
}
func containsComponent(components []string, name string) bool {
for _, c := range components {
if c == name {
return true
}
}
return false
}
func packPlatformDir(dir, dest string) error {
f, err := os.Create(dest)
if err != nil {
return err
}
defer f.Close()
gz := gzip.NewWriter(f)
defer gz.Close()
tw := tar.NewWriter(gz)
defer tw.Close()
entries, err := os.ReadDir(dir)
if err != nil {
return err
}
base := filepath.Base(dir)
for _, e := range entries {
if e.IsDir() {
continue
}
fpath := filepath.Join(dir, e.Name())
data, err := os.ReadFile(fpath)
if err != nil {
continue
}
hdr := &tar.Header{
Name: filepath.Join(base, e.Name()),
Size: int64(len(data)),
Mode: 0644,
ModTime: time.Now(),
}
if err := tw.WriteHeader(hdr); err != nil {
return err
}
if _, err := tw.Write(data); err != nil {
return err
}
}
return nil
}

View File

@@ -0,0 +1,34 @@
package platform
import (
"runtime"
"testing"
)
func TestPlatformStressCPUThreadsOverride(t *testing.T) {
t.Setenv("BEE_PLATFORM_STRESS_THREADS", "7")
if got := platformStressCPUThreads(); got != 7 {
t.Fatalf("platformStressCPUThreads=%d want 7", got)
}
}
func TestPlatformStressCPUThreadsDefaultLeavesHeadroom(t *testing.T) {
t.Setenv("BEE_PLATFORM_STRESS_THREADS", "")
got := platformStressCPUThreads()
if got < 1 {
t.Fatalf("platformStressCPUThreads=%d want >= 1", got)
}
if got > runtime.NumCPU() {
t.Fatalf("platformStressCPUThreads=%d want <= NumCPU=%d", got, runtime.NumCPU())
}
if runtime.NumCPU() > 2 && got >= runtime.NumCPU() {
t.Fatalf("platformStressCPUThreads=%d want headroom below NumCPU=%d", got, runtime.NumCPU())
}
}
func TestPlatformStressMemoryMBOverride(t *testing.T) {
t.Setenv("BEE_PLATFORM_STRESS_MB", "8192")
if got := platformStressMemoryMB(); got != 8192 {
t.Fatalf("platformStressMemoryMB=%d want 8192", got)
}
}

View File

@@ -136,7 +136,10 @@ func (s *System) runtimeToolStatuses(vendor string) []ToolStatus {
tools = append(tools, s.CheckTools([]string{
"nvidia-smi",
"nvidia-bug-report.sh",
"bee-gpu-stress",
"bee-gpu-burn",
"bee-john-gpu-stress",
"bee-nccl-gpu-stress",
"all_reduce_perf",
})...)
case "amd":
tool := ToolStatus{Name: "rocm-smi"}
@@ -176,8 +179,8 @@ func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHe
health.DriverReady = true
}
if lookErr := exec.Command("sh", "-c", "command -v bee-gpu-stress >/dev/null 2>&1").Run(); lookErr == nil {
out, err := exec.Command("bee-gpu-stress", "--seconds", "1", "--size-mb", "1").CombinedOutput()
if _, lookErr := exec.LookPath("bee-gpu-burn"); lookErr == nil {
out, err := exec.Command("bee-gpu-burn", "--seconds", "1", "--size-mb", "1").CombinedOutput()
if err == nil {
health.CUDAReady = true
} else if strings.Contains(strings.ToLower(string(out)), "cuda_error_system_not_ready") {

View File

@@ -12,6 +12,7 @@ import (
"os"
"os/exec"
"path/filepath"
"syscall"
"sort"
"strconv"
"strings"
@@ -136,6 +137,54 @@ func (s *System) RunAMDAcceptancePack(ctx context.Context, baseDir string, logFu
}, logFunc)
}
// RunAMDMemIntegrityPack runs the official RVS MEM module as a validate-style memory integrity test.
func (s *System) RunAMDMemIntegrityPack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
if err := ensureAMDRuntimeReady(); err != nil {
return "", err
}
cfgFile := "/tmp/bee-amd-mem.conf"
cfg := `actions:
- name: mem_integrity
device: all
module: mem
parallel: true
duration: 60000
copy_matrix: false
target_stress: 90
matrix_size: 8640
`
_ = os.WriteFile(cfgFile, []byte(cfg), 0644)
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd-mem", []satJob{
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
{name: "02-rvs-mem.log", cmd: []string{"rvs", "-c", cfgFile}},
{name: "03-rocm-smi-after.log", cmd: []string{"rocm-smi", "--showtemp", "--showpower", "--showmemuse", "--csv"}},
}, logFunc)
}
// RunAMDMemBandwidthPack runs AMD's memory/interconnect bandwidth-oriented tools.
func (s *System) RunAMDMemBandwidthPack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
if err := ensureAMDRuntimeReady(); err != nil {
return "", err
}
cfgFile := "/tmp/bee-amd-babel.conf"
cfg := `actions:
- name: babel_mem_bw
device: all
module: babel
parallel: true
copy_matrix: true
target_stress: 90
matrix_size: 134217728
`
_ = os.WriteFile(cfgFile, []byte(cfg), 0644)
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd-bandwidth", []satJob{
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
{name: "02-rocm-bandwidth-test.log", cmd: []string{"rocm-bandwidth-test"}},
{name: "03-rvs-babel.log", cmd: []string{"rvs", "-c", cfgFile}},
{name: "04-rocm-smi-after.log", cmd: []string{"rocm-smi", "--showtemp", "--showpower", "--showmemuse", "--csv"}},
}, logFunc)
}
// RunAMDStressPack runs an AMD GPU burn-in pack.
// Missing tools are reported as UNSUPPORTED, consistent with the existing SAT pattern.
func (s *System) RunAMDStressPack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
@@ -146,8 +195,16 @@ func (s *System) RunAMDStressPack(ctx context.Context, baseDir string, durationS
if err := ensureAMDRuntimeReady(); err != nil {
return "", err
}
// Write RVS GST config to a temp file
rvsCfg := fmt.Sprintf(`actions:
// Enable copy_matrix so the same GST run drives VRAM traffic in addition to compute.
rvsCfg := amdStressRVSConfig(seconds)
cfgFile := "/tmp/bee-amd-gst.conf"
_ = os.WriteFile(cfgFile, []byte(rvsCfg), 0644)
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd-stress", amdStressJobs(seconds, cfgFile), logFunc)
}
func amdStressRVSConfig(seconds int) string {
return fmt.Sprintf(`actions:
- name: gst_stress
device: all
module: gst
@@ -159,15 +216,15 @@ func (s *System) RunAMDStressPack(ctx context.Context, baseDir string, durationS
matrix_size_b: 8640
matrix_size_c: 8640
`, seconds*1000)
cfgFile := "/tmp/bee-amd-gst.conf"
_ = os.WriteFile(cfgFile, []byte(rvsCfg), 0644)
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-amd-stress", []satJob{
func amdStressJobs(seconds int, cfgFile string) []satJob {
return []satJob{
{name: "01-rocm-smi.log", cmd: []string{"rocm-smi"}},
{name: "02-rocm-bandwidth-test.log", cmd: []string{"rocm-bandwidth-test"}},
{name: fmt.Sprintf("03-rvs-gst-%ds.log", seconds), cmd: []string{"rvs", "-c", cfgFile}},
{name: fmt.Sprintf("04-rocm-smi-after.log"), cmd: []string{"rocm-smi", "--showtemp", "--showpower", "--csv"}},
}, logFunc)
}
}
// ListNvidiaGPUs returns GPUs visible to nvidia-smi.
@@ -369,14 +426,12 @@ type satStats struct {
}
func nvidiaSATJobs() []satJob {
seconds := envInt("BEE_GPU_STRESS_SECONDS", 5)
sizeMB := envInt("BEE_GPU_STRESS_SIZE_MB", 64)
return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
{name: "05-bee-gpu-stress.log", cmd: []string{"bee-gpu-stress", "--seconds", fmt.Sprintf("%d", seconds), "--size-mb", fmt.Sprintf("%d", sizeMB)}},
{name: "05-bee-gpu-burn.log", cmd: []string{"bee-gpu-burn", "--seconds", "5", "--size-mb", "64"}},
}
}
@@ -477,6 +532,13 @@ func runSATCommandCtx(ctx context.Context, verboseLog, name string, cmd []string
}
c := exec.CommandContext(ctx, resolvedCmd[0], resolvedCmd[1:]...)
c.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
c.Cancel = func() error {
if c.Process != nil {
_ = syscall.Kill(-c.Process.Pid, syscall.SIGKILL)
}
return nil
}
if len(env) > 0 {
c.Env = append(os.Environ(), env...)
}
@@ -630,7 +692,11 @@ func resolveSATCommand(cmd []string) ([]string, error) {
case "rvs":
return resolveRVSCommand(cmd[1:]...)
}
return cmd, nil
path, err := satLookPath(cmd[0])
if err != nil {
return nil, fmt.Errorf("%s not found in PATH: %w", cmd[0], err)
}
return append([]string{path}, cmd[1:]...), nil
}
func resolveRVSCommand(args ...string) ([]string, error) {

View File

@@ -51,6 +51,18 @@ type FanStressRow struct {
SysPowerW float64 // DCMI system power reading
}
type cachedPowerReading struct {
Value float64
UpdatedAt time.Time
}
var (
systemPowerCacheMu sync.Mutex
systemPowerCache cachedPowerReading
)
const systemPowerHoldTTL = 15 * time.Second
// RunFanStressTest runs a two-phase GPU stress test while monitoring fan speeds,
// temperatures, and power draw every second. Exports metrics.csv and fan-sensors.csv.
// Designed to reproduce case-04 fan-speed lag and detect GPU thermal throttling.
@@ -130,26 +142,21 @@ func (s *System) RunFanStressTest(ctx context.Context, baseDir string, opts FanS
stats.OK++
}
// loadPhase runs bee-gpu-stress for durSec; sampler stamps phaseName on each row.
// loadPhase runs bee-gpu-burn for durSec; sampler stamps phaseName on each row.
loadPhase := func(phaseName, stepName string, durSec int) {
if ctx.Err() != nil {
return
}
setPhase(phaseName)
var env []string
if len(opts.GPUIndices) > 0 {
ids := make([]string, len(opts.GPUIndices))
for i, idx := range opts.GPUIndices {
ids[i] = strconv.Itoa(idx)
}
env = []string{"CUDA_VISIBLE_DEVICES=" + strings.Join(ids, ",")}
}
cmd := []string{
"bee-gpu-stress",
"bee-gpu-burn",
"--seconds", strconv.Itoa(durSec),
"--size-mb", strconv.Itoa(opts.SizeMB),
}
out, err := runSATCommandCtx(ctx, verboseLog, stepName, cmd, env, nil)
if len(opts.GPUIndices) > 0 {
cmd = append(cmd, "--devices", joinIndexList(dedupeSortedIndices(opts.GPUIndices)))
}
out, err := runSATCommandCtx(ctx, verboseLog, stepName, cmd, nil, nil)
_ = os.WriteFile(filepath.Join(runDir, stepName+".log"), out, 0644)
if err != nil && err != context.Canceled && err.Error() != "signal: killed" {
fmt.Fprintf(&summary, "%s_status=FAILED\n", stepName)
@@ -323,8 +330,9 @@ func sampleFanSpeeds() ([]FanReading, error) {
// parseFanSpeeds parses "ipmitool sdr type Fan" output.
// Handles two formats:
// Old: "FAN1 | 2400.000 | RPM | ok" (value in col[1], unit in col[2])
// New: "FAN1 | 41h | ok | 29.1 | 4340 RPM" (value+unit combined in last col)
//
// Old: "FAN1 | 2400.000 | RPM | ok" (value in col[1], unit in col[2])
// New: "FAN1 | 41h | ok | 29.1 | 4340 RPM" (value+unit combined in last col)
func parseFanSpeeds(raw string) []FanReading {
var fans []FanReading
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {
@@ -512,11 +520,17 @@ func sampleCPUTempViaSensors() float64 {
// sampleSystemPower reads system power draw via DCMI.
func sampleSystemPower() float64 {
now := time.Now()
current := 0.0
out, err := exec.Command("ipmitool", "dcmi", "power", "reading").Output()
if err != nil {
return 0
if err == nil {
current = parseDCMIPowerReading(string(out))
}
return parseDCMIPowerReading(string(out))
systemPowerCacheMu.Lock()
defer systemPowerCacheMu.Unlock()
value, updated := effectiveSystemPowerReading(systemPowerCache, current, now)
systemPowerCache = updated
return value
}
// parseDCMIPowerReading extracts the instantaneous power reading from ipmitool dcmi output.
@@ -539,6 +553,17 @@ func parseDCMIPowerReading(raw string) float64 {
return 0
}
func effectiveSystemPowerReading(cache cachedPowerReading, current float64, now time.Time) (float64, cachedPowerReading) {
if current > 0 {
cache = cachedPowerReading{Value: current, UpdatedAt: now}
return current, cache
}
if cache.Value > 0 && !cache.UpdatedAt.IsZero() && now.Sub(cache.UpdatedAt) <= systemPowerHoldTTL {
return cache.Value, cache
}
return 0, cache
}
// analyzeThrottling returns true if any GPU reported an active throttle reason
// during either load phase.
func analyzeThrottling(rows []FanStressRow) bool {

View File

@@ -1,6 +1,9 @@
package platform
import "testing"
import (
"testing"
"time"
)
func TestParseFanSpeeds(t *testing.T) {
raw := "FAN1 | 2400.000 | RPM | ok\nFAN2 | 1800 RPM | ok | ok\nFAN3 | na | RPM | ns\n"
@@ -25,3 +28,40 @@ func TestFirstFanInputValue(t *testing.T) {
t.Fatalf("got=%v ok=%v", got, ok)
}
}
func TestParseDCMIPowerReading(t *testing.T) {
raw := `
Instantaneous power reading: 512 Watts
Minimum during sampling period: 498 Watts
`
if got := parseDCMIPowerReading(raw); got != 512 {
t.Fatalf("parseDCMIPowerReading()=%v want 512", got)
}
}
func TestEffectiveSystemPowerReading(t *testing.T) {
now := time.Now()
cache := cachedPowerReading{Value: 480, UpdatedAt: now.Add(-5 * time.Second)}
got, updated := effectiveSystemPowerReading(cache, 0, now)
if got != 480 {
t.Fatalf("got=%v want cached 480", got)
}
if updated.Value != 480 {
t.Fatalf("updated=%+v", updated)
}
got, updated = effectiveSystemPowerReading(cache, 530, now)
if got != 530 {
t.Fatalf("got=%v want 530", got)
}
if updated.Value != 530 {
t.Fatalf("updated=%+v", updated)
}
expired := cachedPowerReading{Value: 480, UpdatedAt: now.Add(-systemPowerHoldTTL - time.Second)}
got, _ = effectiveSystemPowerReading(expired, 0, now)
if got != 0 {
t.Fatalf("expired cache returned %v want 0", got)
}
}

View File

@@ -5,6 +5,7 @@ import (
"os"
"os/exec"
"path/filepath"
"strings"
"testing"
)
@@ -30,21 +31,59 @@ func TestRunNvidiaAcceptancePackIncludesGPUStress(t *testing.T) {
if len(jobs) != 5 {
t.Fatalf("jobs=%d want 5", len(jobs))
}
if got := jobs[4].cmd[0]; got != "bee-gpu-stress" {
t.Fatalf("gpu stress command=%q want bee-gpu-stress", got)
if got := jobs[4].cmd[0]; got != "bee-gpu-burn" {
t.Fatalf("gpu stress command=%q want bee-gpu-burn", got)
}
if got := jobs[3].cmd[1]; got != "--output-file" {
t.Fatalf("bug report flag=%q want --output-file", got)
}
}
func TestNvidiaSATJobsUseEnvOverrides(t *testing.T) {
t.Setenv("BEE_GPU_STRESS_SECONDS", "9")
t.Setenv("BEE_GPU_STRESS_SIZE_MB", "96")
func TestAMDStressConfigUsesSingleGSTAction(t *testing.T) {
t.Parallel()
cfg := amdStressRVSConfig(123)
if !strings.Contains(cfg, "module: gst") {
t.Fatalf("config missing gst module:\n%s", cfg)
}
if strings.Contains(cfg, "module: mem") {
t.Fatalf("config should not include mem module:\n%s", cfg)
}
if !strings.Contains(cfg, "copy_matrix: false") {
t.Fatalf("config should use copy_matrix=false:\n%s", cfg)
}
if strings.Count(cfg, "duration: 123000") != 1 {
t.Fatalf("config should apply duration once:\n%s", cfg)
}
for _, field := range []string{"matrix_size_a: 8640", "matrix_size_b: 8640", "matrix_size_c: 8640"} {
if !strings.Contains(cfg, field) {
t.Fatalf("config missing %s:\n%s", field, cfg)
}
}
}
func TestAMDStressJobsIncludeBandwidthAndGST(t *testing.T) {
t.Parallel()
jobs := amdStressJobs(300, "/tmp/test-amd-gst.conf")
if len(jobs) != 4 {
t.Fatalf("jobs=%d want 4", len(jobs))
}
if got := jobs[1].cmd[0]; got != "rocm-bandwidth-test" {
t.Fatalf("jobs[1]=%q want rocm-bandwidth-test", got)
}
if got := jobs[2].cmd[0]; got != "rvs" {
t.Fatalf("jobs[2]=%q want rvs", got)
}
if got := jobs[2].cmd[2]; got != "/tmp/test-amd-gst.conf" {
t.Fatalf("jobs[2] cfg=%q want /tmp/test-amd-gst.conf", got)
}
}
func TestNvidiaSATJobsUseBuiltinBurnDefaults(t *testing.T) {
jobs := nvidiaSATJobs()
got := jobs[4].cmd
want := []string{"bee-gpu-stress", "--seconds", "9", "--size-mb", "96"}
want := []string{"bee-gpu-burn", "--seconds", "5", "--size-mb", "64"}
if len(got) != len(want) {
t.Fatalf("cmd len=%d want %d", len(got), len(want))
}
@@ -55,6 +94,93 @@ func TestNvidiaSATJobsUseEnvOverrides(t *testing.T) {
}
}
func TestBuildNvidiaStressJobUsesSelectedLoaderAndDevices(t *testing.T) {
t.Parallel()
oldExecCommand := satExecCommand
satExecCommand = func(name string, args ...string) *exec.Cmd {
if name == "nvidia-smi" {
return exec.Command("sh", "-c", "printf '0\n1\n2\n'")
}
return exec.Command(name, args...)
}
t.Cleanup(func() { satExecCommand = oldExecCommand })
job, err := buildNvidiaStressJob(NvidiaStressOptions{
DurationSec: 600,
Loader: NvidiaStressLoaderJohn,
ExcludeGPUIndices: []int{1},
})
if err != nil {
t.Fatalf("buildNvidiaStressJob error: %v", err)
}
wantCmd := []string{"bee-john-gpu-stress", "--seconds", "600", "--devices", "0,2"}
if len(job.cmd) != len(wantCmd) {
t.Fatalf("cmd len=%d want %d (%v)", len(job.cmd), len(wantCmd), job.cmd)
}
for i := range wantCmd {
if job.cmd[i] != wantCmd[i] {
t.Fatalf("cmd[%d]=%q want %q", i, job.cmd[i], wantCmd[i])
}
}
if got := joinIndexList(job.gpuIndices); got != "0,2" {
t.Fatalf("gpuIndices=%q want 0,2", got)
}
}
func TestBuildNvidiaStressJobUsesNCCLLoader(t *testing.T) {
t.Parallel()
oldExecCommand := satExecCommand
satExecCommand = func(name string, args ...string) *exec.Cmd {
if name == "nvidia-smi" {
return exec.Command("sh", "-c", "printf '0\n1\n2\n'")
}
return exec.Command(name, args...)
}
t.Cleanup(func() { satExecCommand = oldExecCommand })
job, err := buildNvidiaStressJob(NvidiaStressOptions{
DurationSec: 120,
Loader: NvidiaStressLoaderNCCL,
GPUIndices: []int{2, 0},
})
if err != nil {
t.Fatalf("buildNvidiaStressJob error: %v", err)
}
wantCmd := []string{"bee-nccl-gpu-stress", "--seconds", "120", "--devices", "0,2"}
if len(job.cmd) != len(wantCmd) {
t.Fatalf("cmd len=%d want %d (%v)", len(job.cmd), len(wantCmd), job.cmd)
}
for i := range wantCmd {
if job.cmd[i] != wantCmd[i] {
t.Fatalf("cmd[%d]=%q want %q", i, job.cmd[i], wantCmd[i])
}
}
if got := joinIndexList(job.gpuIndices); got != "0,2" {
t.Fatalf("gpuIndices=%q want 0,2", got)
}
}
func TestNvidiaStressArchivePrefixByLoader(t *testing.T) {
t.Parallel()
tests := []struct {
loader string
want string
}{
{loader: NvidiaStressLoaderBuiltin, want: "gpu-nvidia-burn"},
{loader: NvidiaStressLoaderJohn, want: "gpu-nvidia-john"},
{loader: NvidiaStressLoaderNCCL, want: "gpu-nvidia-nccl"},
{loader: "", want: "gpu-nvidia-burn"},
}
for _, tt := range tests {
if got := nvidiaStressArchivePrefix(tt.loader); got != tt.want {
t.Fatalf("loader=%q prefix=%q want %q", tt.loader, got, tt.want)
}
}
}
func TestEnvIntFallback(t *testing.T) {
os.Unsetenv("BEE_MEMTESTER_SIZE_MB")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
@@ -80,8 +206,8 @@ func TestClassifySATResult(t *testing.T) {
}{
{name: "ok", job: "memtester", out: "done", err: nil, status: "OK"},
{name: "unsupported", job: "smartctl-self-test-short", out: "Self-test not supported", err: errors.New("rc 1"), status: "UNSUPPORTED"},
{name: "failed", job: "bee-gpu-stress", out: "cuda error", err: errors.New("rc 1"), status: "FAILED"},
{name: "cuda not ready", job: "bee-gpu-stress", out: "cuInit failed: CUDA_ERROR_SYSTEM_NOT_READY", err: errors.New("rc 1"), status: "UNSUPPORTED"},
{name: "failed", job: "bee-gpu-burn", out: "cuda error", err: errors.New("rc 1"), status: "FAILED"},
{name: "cuda not ready", job: "bee-gpu-burn", out: "cuInit failed: CUDA_ERROR_SYSTEM_NOT_READY", err: errors.New("rc 1"), status: "UNSUPPORTED"},
}
for _, tt := range tests {
@@ -130,6 +256,44 @@ func TestResolveROCmSMICommandFromPATH(t *testing.T) {
}
}
func TestResolveSATCommandUsesLookPathForGenericTools(t *testing.T) {
oldLookPath := satLookPath
satLookPath = func(file string) (string, error) {
if file == "stress-ng" {
return "/usr/bin/stress-ng", nil
}
return "", exec.ErrNotFound
}
t.Cleanup(func() { satLookPath = oldLookPath })
cmd, err := resolveSATCommand([]string{"stress-ng", "--cpu", "0"})
if err != nil {
t.Fatalf("resolveSATCommand error: %v", err)
}
if len(cmd) != 3 {
t.Fatalf("cmd len=%d want 3 (%v)", len(cmd), cmd)
}
if cmd[0] != "/usr/bin/stress-ng" {
t.Fatalf("cmd[0]=%q want /usr/bin/stress-ng", cmd[0])
}
}
func TestResolveSATCommandFailsForMissingGenericTool(t *testing.T) {
oldLookPath := satLookPath
satLookPath = func(file string) (string, error) {
return "", exec.ErrNotFound
}
t.Cleanup(func() { satLookPath = oldLookPath })
_, err := resolveSATCommand([]string{"stress-ng", "--cpu", "0"})
if err == nil {
t.Fatal("expected error")
}
if !strings.Contains(err.Error(), "stress-ng not found in PATH") {
t.Fatalf("error=%q", err)
}
}
func TestResolveROCmSMICommandFallsBackToROCmTree(t *testing.T) {
tmp := t.TempDir()
execPath := filepath.Join(tmp, "opt", "rocm", "bin", "rocm-smi")

View File

@@ -51,6 +51,20 @@ type ToolStatus struct {
OK bool
}
const (
NvidiaStressLoaderBuiltin = "builtin"
NvidiaStressLoaderJohn = "john"
NvidiaStressLoaderNCCL = "nccl"
)
type NvidiaStressOptions struct {
DurationSec int
SizeMB int
Loader string
GPUIndices []int
ExcludeGPUIndices []int
}
func New() *System {
return &System{}
}

View File

@@ -2,21 +2,26 @@ package webui
import (
"bufio"
"context"
"encoding/json"
"errors"
"fmt"
"io"
"net/http"
"os"
"os/exec"
"path/filepath"
"regexp"
"strings"
"sync/atomic"
"syscall"
"time"
"bee/audit/internal/app"
"bee/audit/internal/platform"
)
var ansiEscapeRE = regexp.MustCompile(`\x1b\[[0-9;]*[a-zA-Z]|\x1b[()][A-Z0-9]|\x1b[DABC]`)
// ── Job ID counter ────────────────────────────────────────────────────────────
var jobCounter atomic.Uint64
@@ -81,31 +86,54 @@ func streamJob(w http.ResponseWriter, r *http.Request, j *jobState) {
}
}
// runCmdJob runs an exec.Cmd as a background job, streaming stdout+stderr lines.
func runCmdJob(j *jobState, cmd *exec.Cmd) {
// streamCmdJob runs an exec.Cmd and streams stdout+stderr lines into j.
func streamCmdJob(j *jobState, cmd *exec.Cmd) error {
pr, pw := io.Pipe()
cmd.Stdout = pw
cmd.Stderr = pw
if err := cmd.Start(); err != nil {
j.finish(err.Error())
return
_ = pw.Close()
_ = pr.Close()
return err
}
// Lower the CPU scheduling priority of stress/audit subprocesses to nice+10
// so the X server and kernel interrupt handling remain responsive under load
// (prevents KVM/IPMI graphical console from freezing during GPU stress tests).
if cmd.Process != nil {
_ = syscall.Setpriority(syscall.PRIO_PROCESS, cmd.Process.Pid, 10)
}
scanDone := make(chan error, 1)
go func() {
scanner := bufio.NewScanner(pr)
scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024)
for scanner.Scan() {
j.append(scanner.Text())
// Split on \r to handle progress-bar style output (e.g. \r overwrites)
// and strip ANSI escape codes so logs are readable in the browser.
parts := strings.Split(scanner.Text(), "\r")
for _, part := range parts {
line := ansiEscapeRE.ReplaceAllString(part, "")
if line != "" {
j.append(line)
}
}
}
if err := scanner.Err(); err != nil && !errors.Is(err, io.ErrClosedPipe) {
scanDone <- err
return
}
scanDone <- nil
}()
err := cmd.Wait()
_ = pw.Close()
scanErr := <-scanDone
_ = pr.Close()
if err != nil {
j.finish(err.Error())
} else {
j.finish("")
return err
}
return scanErr
}
// ── Audit ─────────────────────────────────────────────────────────────────────
@@ -153,20 +181,23 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
}
var body struct {
Duration int `json:"duration"`
DiagLevel int `json:"diag_level"`
GPUIndices []int `json:"gpu_indices"`
Profile string `json:"profile"`
DisplayName string `json:"display_name"`
Duration int `json:"duration"`
DiagLevel int `json:"diag_level"`
GPUIndices []int `json:"gpu_indices"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices"`
Loader string `json:"loader"`
Profile string `json:"profile"`
DisplayName string `json:"display_name"`
PlatformComponents []string `json:"platform_components"`
}
if r.ContentLength > 0 {
_ = json.NewDecoder(r.Body).Decode(&body)
if r.Body != nil {
if err := json.NewDecoder(r.Body).Decode(&body); err != nil && !errors.Is(err, io.EOF) {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
}
name := taskNames[target]
if name == "" {
name = target
}
name := taskDisplayName(target, body.Profile, body.Loader)
t := &Task{
ID: newJobID("sat-" + target),
Name: name,
@@ -174,11 +205,14 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
Status: TaskPending,
CreatedAt: time.Now(),
params: taskParams{
Duration: body.Duration,
DiagLevel: body.DiagLevel,
GPUIndices: body.GPUIndices,
BurnProfile: body.Profile,
DisplayName: body.DisplayName,
Duration: body.Duration,
DiagLevel: body.DiagLevel,
GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices,
Loader: body.Loader,
BurnProfile: body.Profile,
DisplayName: body.DisplayName,
PlatformComponents: body.PlatformComponents,
},
}
if strings.TrimSpace(body.DisplayName) != "" {
@@ -312,8 +346,10 @@ func (h *handler) handleAPINetworkStatus(w http.ResponseWriter, r *http.Request)
return
}
writeJSON(w, map[string]any{
"interfaces": ifaces,
"default_route": h.opts.App.DefaultRoute(),
"interfaces": ifaces,
"default_route": h.opts.App.DefaultRoute(),
"pending_change": h.hasPendingNetworkChange(),
"rollback_in": h.pendingNetworkRollbackIn(),
})
}
@@ -392,17 +428,57 @@ func (h *handler) handleAPIExportList(w http.ResponseWriter, r *http.Request) {
writeJSON(w, entries)
}
func (h *handler) handleAPIExportBundle(w http.ResponseWriter, r *http.Request) {
archive, err := app.BuildSupportBundle(h.opts.ExportDir)
func (h *handler) handleAPIExportUSBTargets(w http.ResponseWriter, _ *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
targets, err := h.opts.App.ListRemovableTargets()
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{
"status": "ok",
"path": archive,
"url": "/export/support.tar.gz",
})
if targets == nil {
targets = []platform.RemovableTarget{}
}
writeJSON(w, targets)
}
func (h *handler) handleAPIExportUSBAudit(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var target platform.RemovableTarget
if err := json.NewDecoder(r.Body).Decode(&target); err != nil || target.Device == "" {
writeError(w, http.StatusBadRequest, "device is required")
return
}
result, err := h.opts.App.ExportLatestAuditResult(target)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "message": result.Body})
}
func (h *handler) handleAPIExportUSBBundle(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var target platform.RemovableTarget
if err := json.NewDecoder(r.Body).Decode(&target); err != nil || target.Device == "" {
writeError(w, http.StatusBadRequest, "device is required")
return
}
result, err := h.opts.App.ExportSupportBundleResult(target)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "message": result.Body})
}
// ── GPU presence ──────────────────────────────────────────────────────────────
@@ -420,6 +496,26 @@ func (h *handler) handleAPIGPUPresence(w http.ResponseWriter, r *http.Request) {
})
}
// ── GPU tools ─────────────────────────────────────────────────────────────────
func (h *handler) handleAPIGPUTools(w http.ResponseWriter, _ *http.Request) {
type toolEntry struct {
ID string `json:"id"`
Available bool `json:"available"`
Vendor string `json:"vendor"` // "nvidia" | "amd"
}
_, nvidiaErr := os.Stat("/dev/nvidia0")
_, amdErr := os.Stat("/dev/kfd")
nvidiaUp := nvidiaErr == nil
amdUp := amdErr == nil
writeJSON(w, []toolEntry{
{ID: "bee-gpu-burn", Available: nvidiaUp, Vendor: "nvidia"},
{ID: "john", Available: nvidiaUp, Vendor: "nvidia"},
{ID: "nccl", Available: nvidiaUp, Vendor: "nvidia"},
{ID: "rvs", Available: amdUp, Vendor: "amd"},
})
}
// ── System ────────────────────────────────────────────────────────────────────
func (h *handler) handleAPIRAMStatus(w http.ResponseWriter, r *http.Request) {
@@ -437,10 +533,7 @@ func (h *handler) handleAPIInstallToRAM(w http.ResponseWriter, r *http.Request)
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
h.installMu.Lock()
installRunning := h.installJob != nil && !h.installJob.isDone()
h.installMu.Unlock()
if installRunning {
if globalQueue.hasActiveTarget("install") {
writeError(w, http.StatusConflict, "install to disk is already running")
return
}
@@ -555,39 +648,43 @@ func (h *handler) handleAPIInstallRun(w http.ResponseWriter, r *http.Request) {
writeError(w, http.StatusConflict, "install to RAM task is already pending or running")
return
}
h.installMu.Lock()
if h.installJob != nil && !h.installJob.isDone() {
h.installMu.Unlock()
writeError(w, http.StatusConflict, "install already running")
if globalQueue.hasActiveTarget("install") {
writeError(w, http.StatusConflict, "install task is already pending or running")
return
}
j := &jobState{}
h.installJob = j
h.installMu.Unlock()
logFile := platform.InstallLogPath(req.Device)
go runCmdJob(j, exec.CommandContext(context.Background(), "bee-install", req.Device, logFile))
w.WriteHeader(http.StatusNoContent)
}
func (h *handler) handleAPIInstallStream(w http.ResponseWriter, r *http.Request) {
h.installMu.Lock()
j := h.installJob
h.installMu.Unlock()
if j == nil {
if !sseStart(w) {
return
}
sseWrite(w, "done", "")
return
t := &Task{
ID: newJobID("install"),
Name: "Install to Disk",
Target: "install",
Priority: 20,
Status: TaskPending,
CreatedAt: time.Now(),
params: taskParams{
Device: req.Device,
},
}
streamJob(w, r, j)
globalQueue.enqueue(t)
writeJSON(w, map[string]string{"task_id": t.ID, "job_id": t.ID})
}
// ── Metrics SSE ───────────────────────────────────────────────────────────────
func (h *handler) handleAPIMetricsLatest(w http.ResponseWriter, r *http.Request) {
sample, ok := h.latestMetric()
if !ok {
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write([]byte("{}"))
return
}
b, err := json.Marshal(sample)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write(b)
}
func (h *handler) handleAPIMetricsStream(w http.ResponseWriter, r *http.Request) {
if !sseStart(w) {
return
@@ -599,10 +696,9 @@ func (h *handler) handleAPIMetricsStream(w http.ResponseWriter, r *http.Request)
case <-r.Context().Done():
return
case <-ticker.C:
sample := platform.SampleLiveMetrics()
h.feedRings(sample)
if h.metricsDB != nil {
_ = h.metricsDB.Write(sample)
sample, ok := h.latestMetric()
if !ok {
continue
}
b, err := json.Marshal(sample)
if err != nil {
@@ -630,13 +726,7 @@ func (h *handler) feedRings(sample platform.LiveMetricSample) {
h.ringMemLoad.push(sample.MemLoadPct)
h.ringsMu.Lock()
for i, fan := range sample.Fans {
for len(h.ringFans) <= i {
h.ringFans = append(h.ringFans, newMetricsRing(120))
h.fanNames = append(h.fanNames, fan.Name)
}
h.ringFans[i].push(float64(fan.RPM))
}
h.pushFanRings(sample.Fans)
for _, gpu := range sample.GPUs {
idx := gpu.GPUIndex
for len(h.gpuRings) <= idx {
@@ -655,6 +745,51 @@ func (h *handler) feedRings(sample platform.LiveMetricSample) {
h.ringsMu.Unlock()
}
func (h *handler) pushFanRings(fans []platform.FanReading) {
if len(fans) == 0 && len(h.ringFans) == 0 {
return
}
fanValues := make(map[string]float64, len(fans))
for _, fan := range fans {
if fan.Name == "" {
continue
}
fanValues[fan.Name] = fan.RPM
found := false
for i, name := range h.fanNames {
if name == fan.Name {
found = true
if i >= len(h.ringFans) {
h.ringFans = append(h.ringFans, newMetricsRing(120))
}
break
}
}
if !found {
h.fanNames = append(h.fanNames, fan.Name)
h.ringFans = append(h.ringFans, newMetricsRing(120))
}
}
for i, ring := range h.ringFans {
if ring == nil {
continue
}
name := ""
if i < len(h.fanNames) {
name = h.fanNames[i]
}
if rpm, ok := fanValues[name]; ok {
ring.push(rpm)
continue
}
if last, ok := ring.latest(); ok {
ring.push(last)
continue
}
ring.push(0)
}
}
func (h *handler) pushNamedMetricRing(dst *[]*namedMetricsRing, name string, value float64) {
if name == "" {
return
@@ -733,7 +868,10 @@ func (h *handler) applyPendingNetworkChange(apply func() (app.ActionResult, erro
return result, err
}
pnc := &pendingNetChange{snapshot: snapshot}
pnc := &pendingNetChange{
snapshot: snapshot,
deadline: time.Now().Add(netRollbackTimeout),
}
pnc.timer = time.AfterFunc(netRollbackTimeout, func() {
_ = h.opts.App.RestoreNetworkSnapshot(snapshot)
h.pendingNetMu.Lock()
@@ -750,6 +888,25 @@ func (h *handler) applyPendingNetworkChange(apply func() (app.ActionResult, erro
return result, nil
}
func (h *handler) hasPendingNetworkChange() bool {
h.pendingNetMu.Lock()
defer h.pendingNetMu.Unlock()
return h.pendingNet != nil
}
func (h *handler) pendingNetworkRollbackIn() int {
h.pendingNetMu.Lock()
defer h.pendingNetMu.Unlock()
if h.pendingNet == nil {
return 0
}
remaining := int(time.Until(h.pendingNet.deadline).Seconds())
if remaining < 1 {
return 1
}
return remaining
}
func (h *handler) handleAPINetworkConfirm(w http.ResponseWriter, _ *http.Request) {
h.pendingNetMu.Lock()
pnc := h.pendingNet
@@ -791,3 +948,108 @@ func (h *handler) rollbackPendingNetworkChange() error {
}
return nil
}
// ── Display / Screen Resolution ───────────────────────────────────────────────
type displayMode struct {
Output string `json:"output"`
Mode string `json:"mode"`
Current bool `json:"current"`
}
type displayInfo struct {
Output string `json:"output"`
Modes []displayMode `json:"modes"`
Current string `json:"current"`
}
var xrandrOutputRE = regexp.MustCompile(`^(\S+)\s+connected`)
var xrandrModeRE = regexp.MustCompile(`^\s{3}(\d+x\d+)\s`)
var xrandrCurrentRE = regexp.MustCompile(`\*`)
func parseXrandrOutput(out string) []displayInfo {
var infos []displayInfo
var cur *displayInfo
for _, line := range strings.Split(out, "\n") {
if m := xrandrOutputRE.FindStringSubmatch(line); m != nil {
if cur != nil {
infos = append(infos, *cur)
}
cur = &displayInfo{Output: m[1]}
continue
}
if cur == nil {
continue
}
if m := xrandrModeRE.FindStringSubmatch(line); m != nil {
isCurrent := xrandrCurrentRE.MatchString(line)
mode := displayMode{Output: cur.Output, Mode: m[1], Current: isCurrent}
cur.Modes = append(cur.Modes, mode)
if isCurrent {
cur.Current = m[1]
}
}
}
if cur != nil {
infos = append(infos, *cur)
}
return infos
}
func xrandrCommand(args ...string) *exec.Cmd {
cmd := exec.Command("xrandr", args...)
env := append([]string{}, os.Environ()...)
hasDisplay := false
hasXAuthority := false
for _, kv := range env {
if strings.HasPrefix(kv, "DISPLAY=") && strings.TrimPrefix(kv, "DISPLAY=") != "" {
hasDisplay = true
}
if strings.HasPrefix(kv, "XAUTHORITY=") && strings.TrimPrefix(kv, "XAUTHORITY=") != "" {
hasXAuthority = true
}
}
if !hasDisplay {
env = append(env, "DISPLAY=:0")
}
if !hasXAuthority {
env = append(env, "XAUTHORITY=/home/bee/.Xauthority")
}
cmd.Env = env
return cmd
}
func (h *handler) handleAPIDisplayResolutions(w http.ResponseWriter, _ *http.Request) {
out, err := xrandrCommand().Output()
if err != nil {
writeError(w, http.StatusInternalServerError, "xrandr: "+err.Error())
return
}
writeJSON(w, parseXrandrOutput(string(out)))
}
func (h *handler) handleAPIDisplaySet(w http.ResponseWriter, r *http.Request) {
var req struct {
Output string `json:"output"`
Mode string `json:"mode"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil || req.Output == "" || req.Mode == "" {
writeError(w, http.StatusBadRequest, "output and mode are required")
return
}
// Validate mode looks like WxH to prevent injection
if !regexp.MustCompile(`^\d+x\d+$`).MatchString(req.Mode) {
writeError(w, http.StatusBadRequest, "invalid mode format")
return
}
// Validate output name (no special chars)
if !regexp.MustCompile(`^[A-Za-z0-9_\-]+$`).MatchString(req.Output) {
writeError(w, http.StatusBadRequest, "invalid output name")
return
}
if out, err := xrandrCommand("--output", req.Output, "--mode", req.Mode).CombinedOutput(); err != nil {
writeError(w, http.StatusInternalServerError, "xrandr: "+strings.TrimSpace(string(out)))
return
}
writeJSON(w, map[string]string{"status": "ok", "output": req.Output, "mode": req.Mode})
}

View File

@@ -0,0 +1,92 @@
package webui
import (
"net/http/httptest"
"strings"
"testing"
"bee/audit/internal/app"
"bee/audit/internal/platform"
)
func TestXrandrCommandAddsDefaultX11Env(t *testing.T) {
t.Setenv("DISPLAY", "")
t.Setenv("XAUTHORITY", "")
cmd := xrandrCommand("--query")
var hasDisplay bool
var hasXAuthority bool
for _, kv := range cmd.Env {
if kv == "DISPLAY=:0" {
hasDisplay = true
}
if kv == "XAUTHORITY=/home/bee/.Xauthority" {
hasXAuthority = true
}
}
if !hasDisplay {
t.Fatalf("DISPLAY not injected: %v", cmd.Env)
}
if !hasXAuthority {
t.Fatalf("XAUTHORITY not injected: %v", cmd.Env)
}
}
func TestHandleAPISATRunDecodesBodyWithoutContentLength(t *testing.T) {
globalQueue.mu.Lock()
originalTasks := globalQueue.tasks
globalQueue.tasks = nil
globalQueue.mu.Unlock()
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = originalTasks
globalQueue.mu.Unlock()
})
h := &handler{opts: HandlerOptions{App: &app.App{}}}
req := httptest.NewRequest("POST", "/api/sat/cpu/run", strings.NewReader(`{"profile":"smoke"}`))
req.ContentLength = -1
rec := httptest.NewRecorder()
h.handleAPISATRun("cpu").ServeHTTP(rec, req)
if rec.Code != 200 {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
globalQueue.mu.Lock()
defer globalQueue.mu.Unlock()
if len(globalQueue.tasks) != 1 {
t.Fatalf("tasks=%d want 1", len(globalQueue.tasks))
}
if got := globalQueue.tasks[0].params.BurnProfile; got != "smoke" {
t.Fatalf("burn profile=%q want smoke", got)
}
}
func TestPushFanRingsTracksByNameAndCarriesForwardMissingSamples(t *testing.T) {
h := &handler{}
h.pushFanRings([]platform.FanReading{
{Name: "FAN_A", RPM: 4200},
{Name: "FAN_B", RPM: 5100},
})
h.pushFanRings([]platform.FanReading{
{Name: "FAN_B", RPM: 5200},
})
if len(h.fanNames) != 2 || h.fanNames[0] != "FAN_A" || h.fanNames[1] != "FAN_B" {
t.Fatalf("fanNames=%v", h.fanNames)
}
aVals, _ := h.ringFans[0].snapshot()
bVals, _ := h.ringFans[1].snapshot()
if len(aVals) != 2 || len(bVals) != 2 {
t.Fatalf("fan ring lengths: A=%d B=%d", len(aVals), len(bVals))
}
if aVals[1] != 4200 {
t.Fatalf("FAN_A should carry forward last value, got %v", aVals)
}
if bVals[1] != 5200 {
t.Fatalf("FAN_B should use latest sampled value, got %v", bVals)
}
}

View File

@@ -3,8 +3,10 @@ package webui
import (
"database/sql"
"encoding/csv"
"fmt"
"io"
"os"
"path/filepath"
"sort"
"strconv"
"time"
@@ -13,7 +15,6 @@ import (
)
const metricsDBPath = "/appdata/bee/metrics.db"
const metricsKeepDuration = 24 * time.Hour
// MetricsDB persists live metric samples to SQLite.
type MetricsDB struct {
@@ -22,6 +23,9 @@ type MetricsDB struct {
// openMetricsDB opens (or creates) the metrics database at the given path.
func openMetricsDB(path string) (*MetricsDB, error) {
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return nil, err
}
db, err := sql.Open("sqlite", path+"?_journal=WAL&_busy_timeout=5000")
if err != nil {
return nil, err
@@ -116,18 +120,25 @@ func (m *MetricsDB) Write(s platform.LiveMetricSample) error {
}
// LoadRecent returns up to n samples in chronological order (oldest first).
// It reconstructs LiveMetricSample from the normalized tables.
func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
rows, err := m.db.Query(
`SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM sys_metrics ORDER BY ts DESC LIMIT ?`, n,
)
return m.loadSamples(`SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM (SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM sys_metrics ORDER BY ts DESC LIMIT ?) ORDER BY ts`, n)
}
// LoadAll returns all persisted samples in chronological order (oldest first).
func (m *MetricsDB) LoadAll() ([]platform.LiveMetricSample, error) {
return m.loadSamples(`SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM sys_metrics ORDER BY ts`, nil)
}
// loadSamples reconstructs LiveMetricSample rows from the normalized tables.
func (m *MetricsDB) loadSamples(query string, args ...any) ([]platform.LiveMetricSample, error) {
rows, err := m.db.Query(query, args...)
if err != nil {
return nil, err
}
defer rows.Close()
type sysRow struct {
ts int64
ts int64
cpu, mem, pwr float64
}
var sysRows []sysRow
@@ -141,17 +152,15 @@ func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
if len(sysRows) == 0 {
return nil, nil
}
// Reverse to chronological order
for i, j := 0, len(sysRows)-1; i < j; i, j = i+1, j-1 {
sysRows[i], sysRows[j] = sysRows[j], sysRows[i]
}
// Collect min/max ts for range query
minTS := sysRows[0].ts
maxTS := sysRows[len(sysRows)-1].ts
// Load GPU rows in range
type gpuKey struct{ ts int64; idx int }
type gpuKey struct {
ts int64
idx int
}
gpuData := map[gpuKey]platform.GPUMetricRow{}
gRows, err := m.db.Query(
`SELECT ts,gpu_index,temp_c,usage_pct,mem_usage_pct,power_w FROM gpu_metrics WHERE ts>=? AND ts<=? ORDER BY ts,gpu_index`,
@@ -169,7 +178,10 @@ func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
}
// Load fan rows in range
type fanKey struct{ ts int64; name string }
type fanKey struct {
ts int64
name string
}
fanData := map[fanKey]float64{}
fRows, err := m.db.Query(
`SELECT ts,name,rpm FROM fan_metrics WHERE ts>=? AND ts<=?`, minTS, maxTS,
@@ -187,7 +199,10 @@ func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
}
// Load temp rows in range
type tempKey struct{ ts int64; name string }
type tempKey struct {
ts int64
name string
}
tempData := map[tempKey]platform.TempReading{}
tRows, err := m.db.Query(
`SELECT ts,name,grp,celsius FROM temp_metrics WHERE ts>=? AND ts<=?`, minTS, maxTS,
@@ -203,7 +218,9 @@ func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
}
}
// Collect unique GPU indices and fan names from loaded data (preserve order)
// Collect unique GPU indices and fan/temp names from loaded data.
// Sort each list so that sample reconstruction is deterministic regardless
// of Go's non-deterministic map iteration order.
seenGPU := map[int]bool{}
var gpuIndices []int
for k := range gpuData {
@@ -212,6 +229,8 @@ func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
gpuIndices = append(gpuIndices, k.idx)
}
}
sort.Ints(gpuIndices)
seenFan := map[string]bool{}
var fanNames []string
for k := range fanData {
@@ -220,6 +239,8 @@ func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
fanNames = append(fanNames, k.name)
}
}
sort.Strings(fanNames)
seenTemp := map[string]bool{}
var tempNames []string
for k := range tempData {
@@ -228,6 +249,7 @@ func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
tempNames = append(tempNames, k.name)
}
}
sort.Strings(tempNames)
samples := make([]platform.LiveMetricSample, len(sysRows))
for i, r := range sysRows {
@@ -257,14 +279,6 @@ func (m *MetricsDB) LoadRecent(n int) ([]platform.LiveMetricSample, error) {
return samples, nil
}
// Prune deletes samples older than keepDuration.
func (m *MetricsDB) Prune(keepDuration time.Duration) {
cutoff := time.Now().Add(-keepDuration).Unix()
for _, table := range []string{"sys_metrics", "gpu_metrics", "fan_metrics", "temp_metrics"} {
_, _ = m.db.Exec(fmt.Sprintf("DELETE FROM %s WHERE ts < ?", table), cutoff)
}
}
// ExportCSV writes all sys+gpu data as CSV to w.
func (m *MetricsDB) ExportCSV(w io.Writer) error {
rows, err := m.db.Query(`

View File

@@ -0,0 +1,69 @@
package webui
import (
"path/filepath"
"testing"
"time"
"bee/audit/internal/platform"
)
func TestMetricsDBLoadSamplesKeepsChronologicalRangeForGPUs(t *testing.T) {
db, err := openMetricsDB(filepath.Join(t.TempDir(), "metrics.db"))
if err != nil {
t.Fatalf("openMetricsDB: %v", err)
}
defer db.Close()
base := time.Unix(1_700_000_000, 0).UTC()
for i := 0; i < 3; i++ {
err := db.Write(platform.LiveMetricSample{
Timestamp: base.Add(time.Duration(i) * time.Second),
CPULoadPct: float64(10 + i),
MemLoadPct: float64(20 + i),
PowerW: float64(300 + i),
GPUs: []platform.GPUMetricRow{
{GPUIndex: 0, PowerW: float64(100 + i)},
{GPUIndex: 2, PowerW: float64(200 + i)},
},
})
if err != nil {
t.Fatalf("Write(%d): %v", i, err)
}
}
all, err := db.LoadAll()
if err != nil {
t.Fatalf("LoadAll: %v", err)
}
if len(all) != 3 {
t.Fatalf("LoadAll len=%d want 3", len(all))
}
for i, sample := range all {
if len(sample.GPUs) != 2 {
t.Fatalf("LoadAll sample %d GPUs=%v want 2 rows", i, sample.GPUs)
}
if sample.GPUs[0].GPUIndex != 0 || sample.GPUs[0].PowerW != float64(100+i) {
t.Fatalf("LoadAll sample %d GPU0=%+v", i, sample.GPUs[0])
}
if sample.GPUs[1].GPUIndex != 2 || sample.GPUs[1].PowerW != float64(200+i) {
t.Fatalf("LoadAll sample %d GPU1=%+v", i, sample.GPUs[1])
}
}
recent, err := db.LoadRecent(2)
if err != nil {
t.Fatalf("LoadRecent: %v", err)
}
if len(recent) != 2 {
t.Fatalf("LoadRecent len=%d want 2", len(recent))
}
if !recent[0].Timestamp.Before(recent[1].Timestamp) {
t.Fatalf("LoadRecent timestamps not ascending: %v >= %v", recent[0].Timestamp, recent[1].Timestamp)
}
for i, sample := range recent {
if len(sample.GPUs) != 2 {
t.Fatalf("LoadRecent sample %d GPUs=%v want 2 rows", i, sample.GPUs)
}
}
}

View File

@@ -205,12 +205,83 @@ document.querySelectorAll('.terminal').forEach(function(t){
func renderDashboard(opts HandlerOptions) string {
var b strings.Builder
b.WriteString(renderAuditStatusBanner(opts))
b.WriteString(renderHardwareSummaryCard(opts))
b.WriteString(renderHealthCard(opts))
b.WriteString(renderMetrics())
return b.String()
}
// renderAuditStatusBanner shows a live progress banner when an audit task is
// running and auto-reloads the page when it completes.
func renderAuditStatusBanner(opts HandlerOptions) string {
// If audit data already exists, no banner needed — data is fresh.
// We still inject the polling script so a newly-triggered audit also reloads.
hasData := false
if _, err := loadSnapshot(opts.AuditPath); err == nil {
hasData = true
}
_ = hasData
return `<div id="audit-banner" style="display:none" class="alert alert-warn" style="margin-bottom:16px">
<span id="audit-banner-text">&#9654; Hardware audit is running — page will refresh automatically when complete.</span>
<a href="/tasks" style="margin-left:12px;font-size:12px">View in Tasks</a>
</div>
<script>
(function(){
var _auditPoll = null;
var _auditSeenRunning = false;
function pollAuditTask() {
fetch('/api/tasks').then(function(r){ return r.json(); }).then(function(tasks){
if (!tasks) return;
var audit = null;
for (var i = 0; i < tasks.length; i++) {
if (tasks[i].target === 'audit') { audit = tasks[i]; break; }
}
var banner = document.getElementById('audit-banner');
var txt = document.getElementById('audit-banner-text');
if (!audit) {
if (banner) banner.style.display = 'none';
return;
}
if (audit.status === 'running' || audit.status === 'pending') {
_auditSeenRunning = true;
if (banner) {
banner.style.display = '';
var label = audit.status === 'pending' ? 'pending\u2026' : 'running\u2026';
if (txt) txt.textContent = '\u25b6 Hardware audit ' + label + ' \u2014 page will refresh when complete.';
}
} else if (audit.status === 'done' && _auditSeenRunning) {
// Audit just finished — reload to show fresh hardware data.
clearInterval(_auditPoll);
if (banner) {
if (txt) txt.textContent = '\u2713 Audit complete \u2014 reloading\u2026';
banner.style.background = 'var(--ok-bg,#fcfff5)';
banner.style.color = 'var(--ok-fg,#2c662d)';
}
setTimeout(function(){ window.location.reload(); }, 800);
} else if (audit.status === 'failed') {
_auditSeenRunning = false;
if (banner) {
banner.style.display = '';
banner.style.background = 'var(--crit-bg,#fff6f6)';
banner.style.color = 'var(--crit-fg,#9f3a38)';
if (txt) txt.textContent = '\u2717 Audit failed: ' + (audit.error||'unknown error');
clearInterval(_auditPoll);
}
} else {
if (banner) banner.style.display = 'none';
}
}).catch(function(){});
}
_auditPoll = setInterval(pollAuditTask, 3000);
pollAuditTask();
})();
</script>`
}
func renderAudit() string {
return `<div class="card"><div class="card-head">Audit Viewer <button class="btn btn-sm btn-secondary" style="margin-left:auto" onclick="openAuditModal()">Actions</button></div><div class="card-body" style="padding:0"><iframe class="viewer-frame" src="/viewer" title="Audit viewer"></iframe></div></div>`
}
@@ -218,7 +289,7 @@ func renderAudit() string {
func renderHardwareSummaryCard(opts HandlerOptions) string {
data, err := loadSnapshot(opts.AuditPath)
if err != nil {
return `<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body"><span class="badge badge-unknown">No audit data</span></div></div>`
return `<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body"><button class="btn btn-primary" onclick="auditModalRun()">&#9654; Run Audit</button></div></div>`
}
// Parse just enough fields for the summary banner
var snap struct {
@@ -451,26 +522,37 @@ func renderMetrics() string {
</div>
<script>
const chartIds = [
'chart-server-load','chart-server-temp-cpu','chart-server-temp-gpu','chart-server-temp-ambient','chart-server-power','chart-server-fans',
'chart-gpu-all-load','chart-gpu-all-memload','chart-gpu-all-power','chart-gpu-all-temp'
];
function refreshChartImage(el) {
if (!el || el.dataset.loading === '1') return;
const baseSrc = el.dataset.baseSrc || el.src.split('?')[0];
const nextSrc = baseSrc + '?t=' + Date.now();
const probe = new Image();
el.dataset.baseSrc = baseSrc;
el.dataset.loading = '1';
probe.onload = function() {
el.src = nextSrc;
el.dataset.loading = '0';
};
probe.onerror = function() {
el.dataset.loading = '0';
};
probe.src = nextSrc;
}
function refreshCharts() {
const t = '?t=' + Date.now();
['chart-server-load','chart-server-temp-cpu','chart-server-temp-gpu','chart-server-temp-ambient','chart-server-power','chart-server-fans',
'chart-gpu-all-load','chart-gpu-all-memload','chart-gpu-all-power','chart-gpu-all-temp'].forEach(id => {
const el = document.getElementById(id);
if (el) el.src = el.src.split('?')[0] + t;
});
chartIds.forEach(id => refreshChartImage(document.getElementById(id)));
}
setInterval(refreshCharts, 3000);
const es = new EventSource('/api/metrics/stream');
es.addEventListener('metrics', e => {
const d = JSON.parse(e.data);
// Show/hide Fan RPM card based on data availability
fetch('/api/metrics/latest').then(r => r.json()).then(d => {
const fanCard = document.getElementById('card-server-fans');
if (fanCard) fanCard.style.display = (d.fans && d.fans.length > 0) ? '' : 'none';
});
es.onerror = () => {};
}).catch(() => {});
</script>`
}
@@ -494,7 +576,11 @@ func renderValidate() string {
renderSATCard("memory", "Memory", "") +
renderSATCard("storage", "Storage", "") +
renderSATCard("cpu", "CPU", `<div class="form-row"><label>Duration (seconds)</label><input type="number" id="sat-cpu-dur" value="60" min="10"></div>`) +
renderSATCard("amd", "AMD GPU", "") +
renderSATCard("amd", "AMD GPU", `<div style="display:flex;gap:8px;flex-wrap:wrap;margin-bottom:8px">
<button id="sat-btn-amd-mem" class="btn" type="button" onclick="runSAT('amd-mem')">MEM Integrity</button>
<button id="sat-btn-amd-bandwidth" class="btn" type="button" onclick="runSAT('amd-bandwidth')">MEM Bandwidth</button>
</div>
<p style="color:var(--muted);font-size:12px;margin:0">Additional AMD memory diagnostics: RVS MEM for integrity and BABEL + rocm-bandwidth-test for memory/interconnect bandwidth.</p>`) +
`</div>
<div id="sat-output" style="display:none;margin-top:16px" class="card">
<div class="card-head">Test Output <span id="sat-title"></span></div>
@@ -505,7 +591,7 @@ let satES = null;
function runSAT(target) {
if (satES) { satES.close(); satES = null; }
const body = {};
const labels = {nvidia:'Validate GPU', memory:'Validate Memory', storage:'Validate Storage', cpu:'Validate CPU', amd:'Validate AMD GPU'};
const labels = {nvidia:'Validate GPU', memory:'Validate Memory', storage:'Validate Storage', cpu:'Validate CPU', amd:'Validate AMD GPU', 'amd-mem':'AMD GPU MEM Integrity', 'amd-bandwidth':'AMD GPU MEM Bandwidth'};
body.display_name = labels[target] || ('Validate ' + target);
if (target === 'nvidia') body.diag_level = parseInt(document.getElementById('sat-nvidia-level').value)||1;
if (target === 'cpu') body.duration = parseInt(document.getElementById('sat-cpu-dur').value)||60;
@@ -524,7 +610,7 @@ function runSAT(target) {
}
function runAllSAT() {
const cycles = Math.max(1, parseInt(document.getElementById('sat-cycles').value)||1);
const targets = ['nvidia','memory','storage','cpu','amd'];
const targets = ['nvidia','memory','storage','cpu','amd','amd-mem','amd-bandwidth'];
const total = targets.length * cycles;
let enqueued = 0;
const status = document.getElementById('sat-all-status');
@@ -536,7 +622,7 @@ function runAllSAT() {
const btn = document.getElementById('sat-btn-' + target);
if (btn && btn.disabled) { enqueueNext(cycle, idx+1); return; }
const body = {};
const labels = {nvidia:'Validate GPU', memory:'Validate Memory', storage:'Validate Storage', cpu:'Validate CPU', amd:'Validate AMD GPU'};
const labels = {nvidia:'Validate GPU', memory:'Validate Memory', storage:'Validate Storage', cpu:'Validate CPU', amd:'Validate AMD GPU', 'amd-mem':'AMD GPU MEM Integrity', 'amd-bandwidth':'AMD GPU MEM Bandwidth'};
body.display_name = labels[target] || ('Validate ' + target);
if (target === 'nvidia') body.diag_level = parseInt(document.getElementById('sat-nvidia-level').value)||1;
if (target === 'cpu') body.duration = parseInt(document.getElementById('sat-cpu-dur').value)||60;
@@ -554,6 +640,8 @@ function runAllSAT() {
fetch('/api/gpu/presence').then(r=>r.json()).then(gp => {
if (!gp.nvidia) disableSATCard('nvidia', 'No NVIDIA GPU detected');
if (!gp.amd) disableSATCard('amd', 'No AMD GPU detected');
if (!gp.amd) disableSATCard('amd-mem', 'No AMD GPU detected');
if (!gp.amd) disableSATCard('amd-bandwidth', 'No AMD GPU detected');
});
function disableSATCard(id, reason) {
const btn = document.getElementById('sat-btn-' + id);
@@ -586,76 +674,210 @@ func renderSATCard(id, label, extra string) string {
func renderBurn() string {
return `<div class="alert alert-warn" style="margin-bottom:16px"><strong>&#9888; Warning:</strong> Stress tests on this page run hardware at maximum load. Repeated or prolonged use may reduce hardware lifespan (storage endurance, GPU wear). Use only when necessary.</div>
<p style="color:var(--muted);font-size:13px;margin-bottom:16px">Tasks continue in the background — view progress in <a href="/tasks">Tasks</a>.</p>
<div class="card"><div class="card-head">Burn Profile</div><div class="card-body">
<div class="form-row" style="max-width:320px"><label>Preset</label><select id="burn-profile"><option value="smoke">Smoke: 5 minutes</option><option value="acceptance">Acceptance: 1 hour</option><option value="overnight">Overnight: 8 hours</option></select></div>
<p style="color:var(--muted);font-size:12px">Applied to all tests on this page. NVIDIA uses mapped DCGM levels: smoke=quick, acceptance=targeted stress, overnight=extended stress.</p>
</div></div>
<div class="grid3">
<div class="card"><div class="card-head">NVIDIA GPU Stress</div><div class="card-body">
<button id="sat-btn-nvidia" class="btn btn-primary" onclick="runBurnIn('nvidia')">&#9654; Start NVIDIA Stress</button>
</div></div>
<div class="card"><div class="card-head">CPU Stress</div><div class="card-body">
<button class="btn btn-primary" onclick="runBurnIn('cpu')">&#9654; Start CPU Stress</button>
</div></div>
<div class="card"><div class="card-head">AMD GPU Stress</div><div class="card-body">
<p style="color:var(--muted);font-size:12px;margin-bottom:8px">Requires ROCm tools (rocm-bandwidth-test). Missing tools reported as UNSUPPORTED.</p>
<button id="sat-btn-amd-stress" class="btn btn-primary" onclick="runBurnIn('amd-stress')">&#9654; Start AMD Stress</button>
</div></div>
<div class="card"><div class="card-head">Memory Stress</div><div class="card-body">
<p style="color:var(--muted);font-size:12px;margin-bottom:8px">stress-ng --vm writes and verifies memory patterns across all of RAM. Env: <code>BEE_VM_STRESS_SECONDS</code> (default 300), <code>BEE_VM_STRESS_SIZE_MB</code> (default 80%).</p>
<button class="btn btn-primary" onclick="runBurnIn('memory-stress')">&#9654; Start Memory Stress</button>
</div></div>
<div class="card"><div class="card-head">SAT Stress (stressapptest)</div><div class="card-body">
<p style="color:var(--muted);font-size:12px;margin-bottom:8px">Google stressapptest saturates CPU, memory and cache buses simultaneously. Env: <code>BEE_SAT_STRESS_SECONDS</code> (default 300), <code>BEE_SAT_STRESS_MB</code> (default auto).</p>
<button class="btn btn-primary" onclick="runBurnIn('sat-stress')">&#9654; Start SAT Stress</button>
</div></div>
<div class="card" style="margin-bottom:16px">
<div class="card-head">Burn Profile</div>
<div class="card-body" style="display:flex;align-items:center;gap:16px;flex-wrap:wrap">
<div class="form-row" style="margin:0;max-width:380px"><label>Preset</label><select id="burn-profile">
<option value="smoke" selected>Smoke — quick check (~5 min)</option>
<option value="acceptance">Acceptance — 1 hour</option>
<option value="overnight">Overnight — 8 hours</option>
</select></div>
<button class="btn btn-primary" onclick="runAll()">&#9654; Run All</button>
<span id="burn-all-status" style="font-size:12px;color:var(--muted)"></span>
</div>
</div>
<div class="grid3" style="margin-bottom:16px">
<div class="card">
<div class="card-head">GPU Stress</div>
<div class="card-body">
<p style="font-size:12px;color:var(--muted);margin:0 0 10px">Tests run on all GPUs in the system. Availability determined by driver status.</p>
<div id="gpu-tools-list">
<label class="cb-row"><input type="checkbox" id="burn-gpu-bee" value="bee-gpu-burn" disabled><span>bee-gpu-burn <span class="cb-note" id="note-bee"></span></span></label>
<label class="cb-row"><input type="checkbox" id="burn-gpu-john" value="john" disabled><span>John the Ripper (OpenCL) <span class="cb-note" id="note-john"></span></span></label>
<label class="cb-row"><input type="checkbox" id="burn-gpu-nccl" value="nccl" disabled><span>NCCL all_reduce_perf <span class="cb-note" id="note-nccl"></span></span></label>
<label class="cb-row"><input type="checkbox" id="burn-gpu-rvs" value="rvs" disabled><span>RVS GST (AMD) <span class="cb-note" id="note-rvs"></span></span></label>
</div>
<button class="btn btn-primary" style="margin-top:10px" onclick="runGPUStress()">&#9654; Run GPU Stress</button>
</div>
</div>
<div class="card">
<div class="card-head">Compute Stress</div>
<div class="card-body">
<p style="font-size:12px;color:var(--muted);margin:0 0 10px">Select which subsystems to stress. Each checked item runs as a separate task.</p>
<label class="cb-row"><input type="checkbox" id="burn-cpu" checked><span>CPU stress (stress-ng)</span></label>
<label class="cb-row"><input type="checkbox" id="burn-mem-stress" checked><span>Memory stress (stress-ng --vm)</span></label>
<label class="cb-row"><input type="checkbox" id="burn-sat-stress"><span>stressapptest (CPU + memory bus)</span></label>
<button class="btn btn-primary" style="margin-top:10px" onclick="runComputeStress()">&#9654; Run Compute Stress</button>
</div>
</div>
<div class="card">
<div class="card-head">Platform Thermal Cycling</div>
<div class="card-body">
<p style="font-size:12px;color:var(--muted);margin:0 0 10px">Repeated load+idle cycles. Detects cooling recovery failures and GPU throttle. Smoke: 2×90s. Acceptance: 4×300s.</p>
<p style="font-size:12px;font-weight:600;margin:0 0 6px">Load components:</p>
<label class="cb-row"><input type="checkbox" id="burn-pt-cpu" checked><span>CPU (stressapptest)</span></label>
<label class="cb-row"><input type="checkbox" id="burn-pt-nvidia" disabled><span>NVIDIA GPU <span class="cb-note" id="note-pt-nvidia"></span></span></label>
<label class="cb-row"><input type="checkbox" id="burn-pt-amd" disabled><span>AMD GPU <span class="cb-note" id="note-pt-amd"></span></span></label>
<button class="btn btn-primary" style="margin-top:10px" onclick="runPlatformStress()">&#9654; Run Thermal Cycling</button>
</div>
</div>
</div>
<div id="bi-output" style="display:none;margin-top:16px" class="card">
<div class="card-head">Output <span id="bi-title"></span></div>
<div class="card-body"><div id="bi-terminal" class="terminal"></div></div>
</div>
<style>
.cb-row { display:flex; align-items:center; gap:8px; padding:4px 0; cursor:pointer; font-size:13px; }
.cb-row input[type=checkbox] { width:16px; height:16px; flex-shrink:0; }
.cb-row input[type=checkbox]:disabled { opacity:0.4; cursor:not-allowed; }
.cb-row input[type=checkbox]:disabled ~ span { opacity:0.45; cursor:not-allowed; }
.cb-note { font-size:11px; color:var(--muted); font-style:italic; }
</style>
<script>
let biES = null;
function runBurnIn(target) {
function profile() { return document.getElementById('burn-profile').value || 'smoke'; }
function enqueueTask(target, extra) {
const body = Object.assign({ profile: profile() }, extra || {});
return fetch('/api/sat/'+target+'/run', {
method: 'POST', headers: {'Content-Type':'application/json'}, body: JSON.stringify(body)
}).then(r => r.json());
}
function streamTask(taskId, label) {
if (biES) { biES.close(); biES = null; }
const body = { profile: document.getElementById('burn-profile').value || 'smoke' };
document.getElementById('bi-output').style.display='block';
document.getElementById('bi-title').textContent = '— ' + target + ' [' + body.profile + ']';
document.getElementById('bi-output').style.display = 'block';
document.getElementById('bi-title').textContent = '— ' + label + ' [' + profile() + ']';
const term = document.getElementById('bi-terminal');
term.textContent = 'Enqueuing ' + target + ' stress...\n';
fetch('/api/sat/'+target+'/run', {method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify(body)})
.then(r => r.json())
.then(d => {
term.textContent += 'Task ' + d.task_id + ' queued.\n';
biES = new EventSource('/api/tasks/'+d.task_id+'/stream');
biES.onmessage = e => { term.textContent += e.data+'\n'; term.scrollTop=term.scrollHeight; };
biES.addEventListener('done', e => { biES.close(); biES=null; term.textContent += (e.data ? '\nERROR: '+e.data : '\nCompleted.')+'\n'; });
});
term.textContent = 'Task ' + taskId + ' queued. Streaming...\n';
biES = new EventSource('/api/tasks/'+taskId+'/stream');
biES.onmessage = e => { term.textContent += e.data+'\n'; term.scrollTop = term.scrollHeight; };
biES.addEventListener('done', e => {
biES.close(); biES = null;
term.textContent += (e.data ? '\nERROR: '+e.data : '\nCompleted.')+'\n';
});
}
</script>
<script>
fetch('/api/gpu/presence').then(r=>r.json()).then(gp => {
if (!gp.nvidia) disableSATCard('nvidia', 'No NVIDIA GPU detected');
if (!gp.amd) disableSATCard('amd-stress', 'No AMD GPU detected');
});
function disableSATCard(id, reason) {
const btn = document.getElementById('sat-btn-' + id);
if (!btn) return;
btn.disabled = true;
btn.title = reason;
btn.style.opacity = '0.4';
const card = btn.closest('.card');
if (card) {
let note = card.querySelector('.sat-unavail');
if (!note) {
note = document.createElement('p');
note.className = 'sat-unavail';
note.style.cssText = 'color:var(--muted);font-size:12px;margin-top:6px';
btn.parentNode.insertBefore(note, btn.nextSibling);
}
note.textContent = reason;
function runGPUStress() {
const ids = ['burn-gpu-bee','burn-gpu-john','burn-gpu-nccl','burn-gpu-rvs'];
const loaderMap = {'burn-gpu-bee':'builtin','burn-gpu-john':'john','burn-gpu-nccl':'nccl','burn-gpu-rvs':'rvs'};
const targetMap = {'burn-gpu-bee':'nvidia-stress','burn-gpu-john':'nvidia-stress','burn-gpu-nccl':'nvidia-stress','burn-gpu-rvs':'amd-stress'};
let last = null;
ids.filter(id => {
const el = document.getElementById(id);
return el && el.checked && !el.disabled;
}).forEach(id => {
const target = targetMap[id];
const extra = target === 'nvidia-stress' ? {loader: loaderMap[id]} : {};
enqueueTask(target, extra).then(d => { last = d; streamTask(d.task_id, target + ' / ' + loaderMap[id]); });
});
}
function runComputeStress() {
const tasks = [
{id:'burn-cpu', target:'cpu'},
{id:'burn-mem-stress', target:'memory-stress'},
{id:'burn-sat-stress', target:'sat-stress'},
];
let last = null;
tasks.filter(t => {
const el = document.getElementById(t.id);
return el && el.checked;
}).forEach(t => {
enqueueTask(t.target).then(d => { last = d; streamTask(d.task_id, t.target); });
});
}
function runPlatformStress() {
const comps = [];
if (document.getElementById('burn-pt-cpu').checked) comps.push('cpu');
const nv = document.getElementById('burn-pt-nvidia');
if (nv && nv.checked && !nv.disabled) comps.push('gpu');
const am = document.getElementById('burn-pt-amd');
if (am && am.checked && !am.disabled) comps.push('gpu');
const extra = comps.length > 0 ? {platform_components: comps} : {};
enqueueTask('platform-stress', extra).then(d => streamTask(d.task_id, 'platform-stress'));
}
function runAll() {
const status = document.getElementById('burn-all-status');
status.textContent = 'Enqueuing...';
let count = 0;
const done = () => { count++; status.textContent = count + ' tasks queued.'; };
// GPU tests
const gpuIds = ['burn-gpu-bee','burn-gpu-john','burn-gpu-nccl','burn-gpu-rvs'];
const loaderMap = {'burn-gpu-bee':'builtin','burn-gpu-john':'john','burn-gpu-nccl':'nccl','burn-gpu-rvs':'rvs'};
const gpuTargetMap = {'burn-gpu-bee':'nvidia-stress','burn-gpu-john':'nvidia-stress','burn-gpu-nccl':'nvidia-stress','burn-gpu-rvs':'amd-stress'};
gpuIds.filter(id => { const el = document.getElementById(id); return el && el.checked && !el.disabled; }).forEach(id => {
const target = gpuTargetMap[id];
const extra = target === 'nvidia-stress' ? {loader: loaderMap[id]} : {};
enqueueTask(target, extra).then(d => { streamTask(d.task_id, target); done(); });
});
// Compute tests
[{id:'burn-cpu',target:'cpu'},{id:'burn-mem-stress',target:'memory-stress'},{id:'burn-sat-stress',target:'sat-stress'}]
.filter(t => { const el = document.getElementById(t.id); return el && el.checked; })
.forEach(t => enqueueTask(t.target).then(d => { streamTask(d.task_id, t.target); done(); }));
// Platform
const comps = [];
if (document.getElementById('burn-pt-cpu').checked) comps.push('cpu');
const nv = document.getElementById('burn-pt-nvidia');
if (nv && nv.checked && !nv.disabled) comps.push('gpu');
const am = document.getElementById('burn-pt-amd');
if (am && am.checked && !am.disabled) comps.push('gpu');
const ptExtra = comps.length > 0 ? {platform_components: comps} : {};
enqueueTask('platform-stress', ptExtra).then(d => { streamTask(d.task_id, 'platform-stress'); done(); });
}
// Load GPU tool availability
fetch('/api/gpu/tools').then(r => r.json()).then(tools => {
const nvidiaMap = {'bee-gpu-burn':'burn-gpu-bee','john':'burn-gpu-john','nccl':'burn-gpu-nccl','rvs':'burn-gpu-rvs'};
const noteMap = {'bee-gpu-burn':'note-bee','john':'note-john','nccl':'note-nccl','rvs':'note-rvs'};
tools.forEach(t => {
const cb = document.getElementById(nvidiaMap[t.id]);
const note = document.getElementById(noteMap[t.id]);
if (!cb) return;
if (t.available) {
cb.disabled = false;
if (t.id === 'bee-gpu-burn') cb.checked = true;
} else {
const reason = t.vendor === 'nvidia' ? 'NVIDIA driver not running' : 'AMD driver not running';
if (note) note.textContent = '— ' + reason;
}
}
});
}).catch(() => {});
// Load GPU presence for platform thermal cycling
fetch('/api/gpu/presence').then(r => r.json()).then(gp => {
const nvCb = document.getElementById('burn-pt-nvidia');
const amCb = document.getElementById('burn-pt-amd');
const nvNote = document.getElementById('note-pt-nvidia');
const amNote = document.getElementById('note-pt-amd');
if (gp.nvidia) {
nvCb.disabled = false;
nvCb.checked = true;
} else {
if (nvNote) nvNote.textContent = '— NVIDIA driver not running';
}
if (gp.amd) {
amCb.disabled = false;
amCb.checked = true;
} else {
if (amNote) amNote.textContent = '— AMD driver not running';
}
}).catch(() => {});
</script>`
}
@@ -687,6 +909,8 @@ func renderNetworkInline() string {
</div>
<script>
var _netCountdownTimer = null;
var _netRefreshTimer = null;
const NET_ROLLBACK_SECS = 60;
function loadNetwork() {
fetch('/api/network').then(r=>r.json()).then(d => {
const rows = (d.interfaces||[]).map(i =>
@@ -697,21 +921,33 @@ function loadNetwork() {
document.getElementById('iface-table').innerHTML =
'<table><tr><th>Interface</th><th>State (click to toggle)</th><th>Addresses</th></tr>'+rows+'</table>' +
(d.default_route ? '<p style="font-size:12px;color:var(--muted);margin-top:8px">Default route: '+d.default_route+'</p>' : '');
});
if (d.pending_change) showNetPending(d.rollback_in || NET_ROLLBACK_SECS);
else hideNetPending();
}).catch(function() {});
}
function selectIface(iface) {
document.getElementById('dhcp-iface').value = iface;
document.getElementById('st-iface').value = iface;
}
function toggleIface(iface, currentState) {
showNetPending(NET_ROLLBACK_SECS);
fetch('/api/network/toggle',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({iface:iface})})
.then(r=>r.json()).then(d => {
if (d.error) { alert('Error: '+d.error); return; }
if (d.error) { hideNetPending(); alert('Error: '+d.error); return; }
loadNetwork();
showNetPending(d.rollback_in || 60);
showNetPending(d.rollback_in || NET_ROLLBACK_SECS);
}).catch(function() {
setTimeout(loadNetwork, 1500);
});
}
function hideNetPending() {
const el = document.getElementById('net-pending');
if (_netCountdownTimer) clearInterval(_netCountdownTimer);
_netCountdownTimer = null;
el.style.display = 'none';
}
function showNetPending(secs) {
if (!secs || secs < 1) { hideNetPending(); return; }
const el = document.getElementById('net-pending');
el.style.display = 'block';
if (_netCountdownTimer) clearInterval(_netCountdownTimer);
@@ -720,30 +956,33 @@ function showNetPending(secs) {
_netCountdownTimer = setInterval(function() {
remaining--;
document.getElementById('net-countdown').textContent = remaining;
if (remaining <= 0) { clearInterval(_netCountdownTimer); _netCountdownTimer=null; el.style.display='none'; loadNetwork(); }
if (remaining <= 0) { hideNetPending(); loadNetwork(); }
}, 1000);
}
function confirmNetChange() {
if (_netCountdownTimer) { clearInterval(_netCountdownTimer); _netCountdownTimer=null; }
document.getElementById('net-pending').style.display='none';
fetch('/api/network/confirm',{method:'POST'});
hideNetPending();
fetch('/api/network/confirm',{method:'POST'}).then(()=>loadNetwork()).catch(()=>{});
}
function rollbackNetChange() {
if (_netCountdownTimer) { clearInterval(_netCountdownTimer); _netCountdownTimer=null; }
document.getElementById('net-pending').style.display='none';
fetch('/api/network/rollback',{method:'POST'}).then(()=>loadNetwork());
hideNetPending();
fetch('/api/network/rollback',{method:'POST'}).then(()=>loadNetwork()).catch(()=>{});
}
function runDHCP() {
const iface = document.getElementById('dhcp-iface').value.trim();
showNetPending(NET_ROLLBACK_SECS);
fetch('/api/network/dhcp',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({interface:iface||'all'})})
.then(r=>r.json()).then(d => {
document.getElementById('dhcp-out').textContent = d.output || d.error || 'Done.';
if (!d.error) showNetPending(d.rollback_in || 60);
if (d.error) { hideNetPending(); return; }
showNetPending(d.rollback_in || NET_ROLLBACK_SECS);
loadNetwork();
}).catch(function() {
setTimeout(loadNetwork, 1500);
});
}
function setStatic() {
const dns = document.getElementById('st-dns').value.split(',').map(s=>s.trim()).filter(Boolean);
showNetPending(NET_ROLLBACK_SECS);
fetch('/api/network/static',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({
interface: document.getElementById('st-iface').value,
address: document.getElementById('st-addr').value,
@@ -752,11 +991,16 @@ function setStatic() {
dns: dns,
})}).then(r=>r.json()).then(d => {
document.getElementById('static-out').textContent = d.output || d.error || 'Done.';
if (!d.error) showNetPending(d.rollback_in || 60);
if (d.error) { hideNetPending(); return; }
showNetPending(d.rollback_in || NET_ROLLBACK_SECS);
loadNetwork();
}).catch(function() {
setTimeout(loadNetwork, 1500);
});
}
loadNetwork();
if (_netRefreshTimer) clearInterval(_netRefreshTimer);
_netRefreshTimer = setInterval(loadNetwork, 5000);
</script>`
}
@@ -835,12 +1079,79 @@ func renderExport(exportDir string) string {
return `<div class="grid2">
<div class="card"><div class="card-head">Support Bundle</div><div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Creates a tar.gz archive of all audit files, SAT results, and logs.</p>
<a class="btn btn-primary" href="/export/support.tar.gz">⬇ Download Support Bundle</a>
` + renderSupportBundleInline() + `
</div></div>
<div class="card"><div class="card-head">Export Files</div><div class="card-body">
<table><tr><th>File</th></tr>` + rows.String() + `</table>
</div></div>
</div>`
</div>
<div class="card" style="margin-top:16px">
<div class="card-head">Export to USB
<button class="btn btn-sm btn-secondary" onclick="usbRefresh()" style="margin-left:auto">&#8635; Refresh</button>
</div>
<div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Write audit JSON or support bundle directly to a removable USB drive.</p>
<div id="usb-status" style="font-size:13px;color:var(--muted)">Scanning for USB devices...</div>
<div id="usb-targets" style="margin-top:12px"></div>
<div id="usb-msg" style="margin-top:10px;font-size:13px"></div>
</div>
</div>
<script>
(function(){
function usbRefresh() {
document.getElementById('usb-status').textContent = 'Scanning...';
document.getElementById('usb-targets').innerHTML = '';
document.getElementById('usb-msg').textContent = '';
fetch('/api/export/usb').then(r=>r.json()).then(targets => {
const st = document.getElementById('usb-status');
const ct = document.getElementById('usb-targets');
if (!targets || targets.length === 0) {
st.textContent = 'No removable USB devices found.';
return;
}
st.textContent = targets.length + ' device(s) found:';
ct.innerHTML = '<table><tr><th>Device</th><th>FS</th><th>Size</th><th>Label</th><th>Model</th><th>Actions</th></tr>' +
targets.map(t => {
const dev = t.device || '';
const label = t.label || '';
const model = t.model || '';
return '<tr>' +
'<td style="font-family:monospace">'+dev+'</td>' +
'<td>'+t.fs_type+'</td>' +
'<td>'+t.size+'</td>' +
'<td>'+label+'</td>' +
'<td style="font-size:12px;color:var(--muted)">'+model+'</td>' +
'<td style="white-space:nowrap">' +
'<button class="btn btn-sm btn-primary" onclick="usbExport(\'audit\','+JSON.stringify(t)+')">Audit JSON</button> ' +
'<button class="btn btn-sm btn-secondary" onclick="usbExport(\'bundle\','+JSON.stringify(t)+')">Support Bundle</button>' +
'</td></tr>';
}).join('') + '</table>';
}).catch(e => {
document.getElementById('usb-status').textContent = 'Error: ' + e;
});
}
window.usbExport = function(type, target) {
const msg = document.getElementById('usb-msg');
msg.style.color = 'var(--muted)';
msg.textContent = 'Exporting to ' + (target.device||'') + '...';
fetch('/api/export/usb/'+type, {
method: 'POST',
headers: {'Content-Type':'application/json'},
body: JSON.stringify(target)
}).then(r=>r.json()).then(d => {
if (d.error) { msg.style.color='var(--err,red)'; msg.textContent = 'Error: '+d.error; return; }
msg.style.color = 'var(--ok,green)';
msg.textContent = d.message || 'Done.';
}).catch(e => {
msg.style.color = 'var(--err,red)';
msg.textContent = 'Error: '+e;
});
};
window.usbRefresh = usbRefresh;
usbRefresh();
})();
</script>`
}
func listExportFiles(exportDir string) ([]string, error) {
@@ -866,6 +1177,100 @@ func listExportFiles(exportDir string) ([]string, error) {
return entries, nil
}
func renderSupportBundleInline() string {
return `<button id="support-bundle-btn" class="btn btn-primary" onclick="supportBundleDownload()">&#8595; Download Support Bundle</button>
<div id="support-bundle-status" style="margin-top:10px;font-size:13px;color:var(--muted)"></div>
<script>
window.supportBundleDownload = function() {
var btn = document.getElementById('support-bundle-btn');
var status = document.getElementById('support-bundle-status');
btn.disabled = true;
btn.textContent = 'Building...';
status.textContent = 'Collecting logs and export data\u2026';
status.style.color = 'var(--muted)';
var filename = 'bee-support.tar.gz';
fetch('/export/support.tar.gz')
.then(function(r) {
if (!r.ok) throw new Error('HTTP ' + r.status);
var cd = r.headers.get('Content-Disposition') || '';
var m = cd.match(/filename="?([^";]+)"?/);
if (m) filename = m[1];
return r.blob();
})
.then(function(blob) {
var url = URL.createObjectURL(blob);
var a = document.createElement('a');
a.href = url;
a.download = filename;
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
URL.revokeObjectURL(url);
status.textContent = 'Download started.';
status.style.color = 'var(--ok-fg)';
})
.catch(function(e) {
status.textContent = 'Error: ' + e.message;
status.style.color = 'var(--crit-fg)';
})
.finally(function() {
btn.disabled = false;
btn.textContent = '\u2195 Download Support Bundle';
});
};
</script>`
}
// ── Display Resolution ────────────────────────────────────────────────────────
func renderDisplayInline() string {
return `<div id="display-status" style="color:var(--muted);font-size:13px;margin-bottom:12px">Loading displays...</div>
<div id="display-controls"></div>
<script>
(function(){
function loadDisplays() {
fetch('/api/display/resolutions').then(r=>r.json()).then(displays => {
const status = document.getElementById('display-status');
const ctrl = document.getElementById('display-controls');
if (!displays || displays.length === 0) {
status.textContent = 'No connected displays found or xrandr not available.';
return;
}
status.textContent = '';
ctrl.innerHTML = displays.map(d => {
const opts = (d.modes||[]).map(m =>
'<option value="'+m.mode+'"'+(m.current?' selected':'')+'>'+m.mode+(m.current?' (current)':'')+'</option>'
).join('');
return '<div style="margin-bottom:12px">'
+'<span style="font-weight:600;margin-right:8px">'+d.output+'</span>'
+'<span style="color:var(--muted);font-size:12px;margin-right:12px">Current: '+d.current+'</span>'
+'<select id="res-sel-'+d.output+'" style="margin-right:8px">'+opts+'</select>'
+'<button class="btn btn-sm btn-primary" onclick="applyResolution(\''+d.output+'\')">Apply</button>'
+'</div>';
}).join('');
}).catch(()=>{
document.getElementById('display-status').textContent = 'xrandr not available on this system.';
});
}
window.applyResolution = function(output) {
const sel = document.getElementById('res-sel-'+output);
if (!sel) return;
const mode = sel.value;
const btn = sel.nextElementSibling;
btn.disabled = true;
btn.textContent = 'Applying...';
fetch('/api/display/set', {method:'POST', headers:{'Content-Type':'application/json'}, body:JSON.stringify({output:output,mode:mode})})
.then(r=>r.json()).then(d=>{
if (d.error) { alert('Error: '+d.error); }
loadDisplays();
}).catch(e=>{ alert('Error: '+e); })
.finally(()=>{ btn.disabled=false; btn.textContent='Apply'; });
};
loadDisplays();
})();
</script>`
}
// ── Tools ─────────────────────────────────────────────────────────────────────
func renderTools() string {
@@ -905,7 +1310,7 @@ function installToRAM() {
<div class="card"><div class="card-head">Support Bundle</div><div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Downloads a tar.gz archive of all audit files, SAT results, and logs.</p>
<a class="btn btn-primary" href="/export/support.tar.gz">&#8595; Download Support Bundle</a>
` + renderSupportBundleInline() + `
</div></div>
<div class="card"><div class="card-head">Tool Check <button class="btn btn-sm btn-secondary" onclick="checkTools()" style="margin-left:auto">&#8635; Check</button></div>
@@ -917,6 +1322,9 @@ function installToRAM() {
<div class="card"><div class="card-head">Services</div><div class="card-body">` +
renderServicesInline() + `</div></div>
<div class="card"><div class="card-head">Display Resolution</div><div class="card-body">` +
renderDisplayInline() + `</div></div>
<script>
function checkTools() {
document.getElementById('tools-table').innerHTML = '<p style="color:var(--muted);font-size:13px">Checking...</p>';
@@ -1081,21 +1489,23 @@ function installStart() {
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({device: _installSelected.device})
}).then(function(r){
if (r.status === 204) {
installStreamLog();
} else {
return r.json().then(function(j){ throw new Error(j.error || r.statusText); });
}
return r.json().then(function(j){
if (!r.ok) throw new Error(j.error || r.statusText);
return j;
});
}).then(function(j){
if (!j.task_id) throw new Error('missing task id');
installStreamLog(j.task_id);
}).catch(function(e){
status.textContent = 'Error: ' + e;
status.style.color = 'var(--crit-fg)';
});
}
function installStreamLog() {
function installStreamLog(taskId) {
var term = document.getElementById('install-terminal');
var status = document.getElementById('install-status');
var es = new EventSource('/api/install/stream');
var es = new EventSource('/api/tasks/' + taskId + '/stream');
es.onmessage = function(e) {
term.textContent += e.data + '\n';
term.scrollTop = term.scrollHeight;
@@ -1140,8 +1550,10 @@ func renderInstall() string {
// ── Tasks ─────────────────────────────────────────────────────────────────────
func renderTasks() string {
return `<div style="display:flex;align-items:center;gap:12px;margin-bottom:16px">
return `<div style="display:flex;align-items:center;gap:12px;margin-bottom:16px;flex-wrap:wrap">
<button class="btn btn-danger btn-sm" onclick="cancelAll()">Cancel All</button>
<button class="btn btn-sm" style="background:#b45309;color:#fff" onclick="killWorkers()" title="Send SIGKILL to all running test processes (bee-gpu-burn, stress-ng, stressapptest, memtester)">Kill Workers</button>
<span id="kill-toast" style="font-size:12px;color:var(--muted);display:none"></span>
<span style="font-size:12px;color:var(--muted)">Tasks run one at a time. Logs persist after navigation.</span>
</div>
<div class="card">
@@ -1164,7 +1576,7 @@ function loadTasks() {
return;
}
const rows = tasks.map(t => {
const dur = t.started_at ? formatDur(t.started_at, t.done_at) : '';
const dur = t.elapsed_sec ? formatDurSec(t.elapsed_sec) : '';
const statusClass = {running:'badge-ok',pending:'badge-unknown',done:'badge-ok',failed:'badge-err',cancelled:'badge-unknown'}[t.status]||'badge-unknown';
const statusLabel = {running:'&#9654; running',pending:'pending',done:'&#10003; done',failed:'&#10007; failed',cancelled:'cancelled'}[t.status]||t.status;
let actions = '<button class="btn btn-sm btn-secondary" onclick="viewLog(\''+t.id+'\',\''+escHtml(t.name)+'\')">Logs</button>';
@@ -1189,14 +1601,11 @@ function loadTasks() {
function escHtml(s) { return (s||'').replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;').replace(/"/g,'&quot;'); }
function fmtTime(s) { if (!s) return ''; try { return new Date(s).toLocaleTimeString(); } catch(e){ return s; } }
function formatDur(start, end) {
try {
const s = new Date(start), e = end ? new Date(end) : new Date();
const sec = Math.round((e-s)/1000);
if (sec < 60) return sec+'s';
const m = Math.floor(sec/60), ss = sec%60;
return m+'m '+ss+'s';
} catch(e){ return ''; }
function formatDurSec(sec) {
sec = Math.max(0, Math.round(sec||0));
if (sec < 60) return sec+'s';
const m = Math.floor(sec/60), ss = sec%60;
return m+'m '+ss+'s';
}
function cancelTask(id) {
@@ -1205,6 +1614,21 @@ function cancelTask(id) {
function cancelAll() {
fetch('/api/tasks/cancel-all',{method:'POST'}).then(()=>loadTasks());
}
function killWorkers() {
if (!confirm('Send SIGKILL to all running test workers (bee-gpu-burn, stress-ng, stressapptest, memtester)?\n\nThis will also cancel all queued and running tasks.')) return;
fetch('/api/tasks/kill-workers',{method:'POST'})
.then(r=>r.json())
.then(d=>{
loadTasks();
var toast = document.getElementById('kill-toast');
var parts = [];
if (d.cancelled > 0) parts.push(d.cancelled+' task'+(d.cancelled===1?'':'s')+' cancelled');
if (d.killed > 0) parts.push(d.killed+' process'+(d.killed===1?'':'es')+' killed');
toast.textContent = parts.length ? parts.join(', ')+'.' : 'No processes found.';
toast.style.display = '';
setTimeout(()=>{ toast.style.display='none'; }, 5000);
});
}
function setPriority(id, delta) {
fetch('/api/tasks/'+id+'/priority',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({delta:delta})})
.then(()=>loadTasks());

View File

@@ -5,10 +5,12 @@ import (
"errors"
"fmt"
"html"
"log/slog"
"mime"
"net/http"
"os"
"path/filepath"
"sort"
"strings"
"sync"
"time"
@@ -72,29 +74,45 @@ func (r *metricsRing) snapshot() ([]float64, []string) {
defer r.mu.Unlock()
v := make([]float64, len(r.vals))
copy(v, r.vals)
now := time.Now()
labels := make([]string, len(r.times))
if len(r.times) == 0 {
return v, labels
}
sameDay := timestampsSameLocalDay(r.times)
for i, t := range r.times {
labels[i] = relAgeLabel(now.Sub(t))
labels[i] = formatTimelineLabel(t.Local(), sameDay)
}
return v, labels
}
func relAgeLabel(age time.Duration) string {
if age <= 0 {
return "0"
func (r *metricsRing) latest() (float64, bool) {
r.mu.Lock()
defer r.mu.Unlock()
if len(r.vals) == 0 {
return 0, false
}
if age < time.Hour {
m := int(age.Minutes())
if m == 0 {
return "-1m"
return r.vals[len(r.vals)-1], true
}
func timestampsSameLocalDay(times []time.Time) bool {
if len(times) == 0 {
return true
}
first := times[0].Local()
for _, t := range times[1:] {
local := t.Local()
if local.Year() != first.Year() || local.YearDay() != first.YearDay() {
return false
}
return fmt.Sprintf("-%dm", m)
}
if age < 24*time.Hour {
return fmt.Sprintf("-%dh", int(age.Hours()))
return true
}
func formatTimelineLabel(ts time.Time, sameDay bool) string {
if sameDay {
return ts.Format("15:04")
}
return fmt.Sprintf("-%dd", int(age.Hours()/24))
return ts.Format("01-02 15:04")
}
// gpuRings holds per-GPU ring buffers.
@@ -110,9 +128,16 @@ type namedMetricsRing struct {
Ring *metricsRing
}
// metricsChartWindow is the number of samples kept in the live ring buffer.
// At metricsCollectInterval = 5 s this covers 30 minutes of live history.
const metricsChartWindow = 360
var metricsCollectInterval = 5 * time.Second
// pendingNetChange tracks a network state change awaiting confirmation.
type pendingNetChange struct {
snapshot platform.NetworkSnapshot
deadline time.Time
timer *time.Timer
mu sync.Mutex
}
@@ -132,11 +157,10 @@ type handler struct {
// per-GPU rings (index = GPU index)
gpuRings []*gpuRings
ringsMu sync.Mutex
latestMu sync.RWMutex
latest *platform.LiveMetricSample
// metrics persistence (nil if DB unavailable)
metricsDB *MetricsDB
// install job (at most one at a time)
installJob *jobState
installMu sync.Mutex
// pending network change (rollback on timeout)
pendingNet *pendingNetChange
pendingNetMu sync.Mutex
@@ -164,13 +188,20 @@ func NewHandler(opts HandlerOptions) http.Handler {
// Open metrics DB and pre-fill ring buffers from history.
if db, err := openMetricsDB(metricsDBPath); err == nil {
h.metricsDB = db
db.Prune(metricsKeepDuration)
if samples, err := db.LoadRecent(120); err == nil {
if samples, err := db.LoadRecent(metricsChartWindow); err == nil {
for _, s := range samples {
h.feedRings(s)
}
if len(samples) > 0 {
h.setLatestMetric(samples[len(samples)-1])
}
} else {
slog.Warn("metrics history unavailable", "path", metricsDBPath, "err", err)
}
} else {
slog.Warn("metrics db disabled", "path", metricsDBPath, "err", err)
}
h.startMetricsCollector()
globalQueue.startWorker(&opts)
mux := http.NewServeMux()
@@ -194,19 +225,24 @@ func NewHandler(opts HandlerOptions) http.Handler {
// SAT
mux.HandleFunc("POST /api/sat/nvidia/run", h.handleAPISATRun("nvidia"))
mux.HandleFunc("POST /api/sat/nvidia-stress/run", h.handleAPISATRun("nvidia-stress"))
mux.HandleFunc("POST /api/sat/memory/run", h.handleAPISATRun("memory"))
mux.HandleFunc("POST /api/sat/storage/run", h.handleAPISATRun("storage"))
mux.HandleFunc("POST /api/sat/cpu/run", h.handleAPISATRun("cpu"))
mux.HandleFunc("POST /api/sat/amd/run", h.handleAPISATRun("amd"))
mux.HandleFunc("POST /api/sat/amd-mem/run", h.handleAPISATRun("amd-mem"))
mux.HandleFunc("POST /api/sat/amd-bandwidth/run", h.handleAPISATRun("amd-bandwidth"))
mux.HandleFunc("POST /api/sat/amd-stress/run", h.handleAPISATRun("amd-stress"))
mux.HandleFunc("POST /api/sat/memory-stress/run", h.handleAPISATRun("memory-stress"))
mux.HandleFunc("POST /api/sat/sat-stress/run", h.handleAPISATRun("sat-stress"))
mux.HandleFunc("POST /api/sat/platform-stress/run", h.handleAPISATRun("platform-stress"))
mux.HandleFunc("GET /api/sat/stream", h.handleAPISATStream)
mux.HandleFunc("POST /api/sat/abort", h.handleAPISATAbort)
// Tasks
mux.HandleFunc("GET /api/tasks", h.handleAPITasksList)
mux.HandleFunc("POST /api/tasks/cancel-all", h.handleAPITasksCancelAll)
mux.HandleFunc("POST /api/tasks/kill-workers", h.handleAPITasksKillWorkers)
mux.HandleFunc("POST /api/tasks/{id}/cancel", h.handleAPITasksCancel)
mux.HandleFunc("POST /api/tasks/{id}/priority", h.handleAPITasksPriority)
mux.HandleFunc("GET /api/tasks/{id}/stream", h.handleAPITasksStream)
@@ -225,13 +261,20 @@ func NewHandler(opts HandlerOptions) http.Handler {
// Export
mux.HandleFunc("GET /api/export/list", h.handleAPIExportList)
mux.HandleFunc("POST /api/export/bundle", h.handleAPIExportBundle)
mux.HandleFunc("GET /api/export/usb", h.handleAPIExportUSBTargets)
mux.HandleFunc("POST /api/export/usb/audit", h.handleAPIExportUSBAudit)
mux.HandleFunc("POST /api/export/usb/bundle", h.handleAPIExportUSBBundle)
// Tools
mux.HandleFunc("GET /api/tools/check", h.handleAPIToolsCheck)
// GPU presence
// Display
mux.HandleFunc("GET /api/display/resolutions", h.handleAPIDisplayResolutions)
mux.HandleFunc("POST /api/display/set", h.handleAPIDisplaySet)
// GPU presence / tools
mux.HandleFunc("GET /api/gpu/presence", h.handleAPIGPUPresence)
mux.HandleFunc("GET /api/gpu/tools", h.handleAPIGPUTools)
// System
mux.HandleFunc("GET /api/system/ram-status", h.handleAPIRAMStatus)
@@ -243,10 +286,10 @@ func NewHandler(opts HandlerOptions) http.Handler {
// Install
mux.HandleFunc("GET /api/install/disks", h.handleAPIInstallDisks)
mux.HandleFunc("POST /api/install/run", h.handleAPIInstallRun)
mux.HandleFunc("GET /api/install/stream", h.handleAPIInstallStream)
// Metrics — SSE stream of live sensor data + server-side SVG charts + CSV export
mux.HandleFunc("GET /api/metrics/stream", h.handleAPIMetricsStream)
mux.HandleFunc("GET /api/metrics/latest", h.handleAPIMetricsLatest)
mux.HandleFunc("GET /api/metrics/chart/", h.handleMetricsChartSVG)
mux.HandleFunc("GET /api/metrics/export.csv", h.handleAPIMetricsExportCSV)
@@ -260,6 +303,37 @@ func NewHandler(opts HandlerOptions) http.Handler {
return mux
}
func (h *handler) startMetricsCollector() {
go func() {
ticker := time.NewTicker(metricsCollectInterval)
defer ticker.Stop()
for range ticker.C {
sample := platform.SampleLiveMetrics()
if h.metricsDB != nil {
_ = h.metricsDB.Write(sample)
}
h.feedRings(sample)
h.setLatestMetric(sample)
}
}()
}
func (h *handler) setLatestMetric(sample platform.LiveMetricSample) {
h.latestMu.Lock()
defer h.latestMu.Unlock()
cp := sample
h.latest = &cp
}
func (h *handler) latestMetric() (platform.LiveMetricSample, bool) {
h.latestMu.RLock()
defer h.latestMu.RUnlock()
if h.latest == nil {
return platform.LiveMetricSample{}, false
}
return *h.latest, true
}
// ListenAndServe starts the HTTP server.
func ListenAndServe(addr string, opts HandlerOptions) error {
return http.ListenAndServe(addr, NewHandler(opts))
@@ -316,6 +390,7 @@ func (h *handler) handleSupportBundleDownload(w http.ResponseWriter, r *http.Req
http.Error(w, fmt.Sprintf("build support bundle: %v", err), http.StatusInternalServerError)
return
}
defer os.Remove(archive)
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/gzip")
w.Header().Set("Content-Disposition", fmt.Sprintf("attachment; filename=%q", filepath.Base(archive)))
@@ -387,207 +462,13 @@ func (h *handler) handleMetricsChartSVG(w http.ResponseWriter, r *http.Request)
path := strings.TrimPrefix(r.URL.Path, "/api/metrics/chart/")
path = strings.TrimSuffix(path, ".svg")
var datasets [][]float64
var names []string
var labels []string
var title string
var yMin, yMax *float64 // nil = auto; for load charts fixed 0-100
switch {
// ── Server sub-charts ─────────────────────────────────────────────────
case path == "server-load":
title = "CPU / Memory Load"
vCPULoad, l := h.ringCPULoad.snapshot()
vMemLoad, _ := h.ringMemLoad.snapshot()
labels = l
datasets = [][]float64{vCPULoad, vMemLoad}
names = []string{"CPU Load %", "Mem Load %"}
yMin = floatPtr(0)
yMax = floatPtr(100)
case path == "server-temp", path == "server-temp-cpu":
title = "CPU Temperature"
h.ringsMu.Lock()
datasets, names, labels = snapshotNamedRings(h.cpuTempRings)
h.ringsMu.Unlock()
yMin = floatPtr(0)
yMax = autoMax120(datasets...)
case path == "server-temp-gpu":
title = "GPU Temperature"
h.ringsMu.Lock()
for idx, gr := range h.gpuRings {
if gr == nil {
continue
}
vTemp, l := gr.Temp.snapshot()
datasets = append(datasets, vTemp)
names = append(names, fmt.Sprintf("GPU %d", idx))
if len(labels) == 0 {
labels = l
}
}
h.ringsMu.Unlock()
yMin = floatPtr(0)
yMax = autoMax120(datasets...)
case path == "server-temp-ambient":
title = "Ambient / Other Sensors"
h.ringsMu.Lock()
datasets, names, labels = snapshotNamedRings(h.ambientTempRings)
h.ringsMu.Unlock()
yMin = floatPtr(0)
yMax = autoMax120(datasets...)
case path == "server-power":
title = "System Power"
vPower, l := h.ringPower.snapshot()
labels = l
datasets = [][]float64{vPower}
names = []string{"Power W"}
yMin = floatPtr(0)
yMax = autoMax120(vPower)
case path == "server-fans":
title = "Fan RPM"
h.ringsMu.Lock()
for i, fr := range h.ringFans {
fv, _ := fr.snapshot()
datasets = append(datasets, fv)
name := "Fan"
if i < len(h.fanNames) {
name = h.fanNames[i]
}
names = append(names, name+" RPM")
}
h.ringsMu.Unlock()
yMin = floatPtr(0)
yMax = autoMax120(datasets...)
// ── Combined GPU charts (all GPUs on one chart) ───────────────────────
case path == "gpu-all-load":
title = "GPU Compute Load"
h.ringsMu.Lock()
for idx, gr := range h.gpuRings {
if gr == nil {
continue
}
vUtil, l := gr.Util.snapshot()
datasets = append(datasets, vUtil)
names = append(names, fmt.Sprintf("GPU %d", idx))
if len(labels) == 0 {
labels = l
}
}
h.ringsMu.Unlock()
yMin = floatPtr(0)
yMax = floatPtr(100)
case path == "gpu-all-memload":
title = "GPU Memory Load"
h.ringsMu.Lock()
for idx, gr := range h.gpuRings {
if gr == nil {
continue
}
vMem, l := gr.MemUtil.snapshot()
datasets = append(datasets, vMem)
names = append(names, fmt.Sprintf("GPU %d", idx))
if len(labels) == 0 {
labels = l
}
}
h.ringsMu.Unlock()
yMin = floatPtr(0)
yMax = floatPtr(100)
case path == "gpu-all-power":
title = "GPU Power"
h.ringsMu.Lock()
for idx, gr := range h.gpuRings {
if gr == nil {
continue
}
vPow, l := gr.Power.snapshot()
datasets = append(datasets, vPow)
names = append(names, fmt.Sprintf("GPU %d", idx))
if len(labels) == 0 {
labels = l
}
}
h.ringsMu.Unlock()
yMin = floatPtr(0)
yMax = autoMax120(datasets...)
case path == "gpu-all-temp":
title = "GPU Temperature"
h.ringsMu.Lock()
for idx, gr := range h.gpuRings {
if gr == nil {
continue
}
vTemp, l := gr.Temp.snapshot()
datasets = append(datasets, vTemp)
names = append(names, fmt.Sprintf("GPU %d", idx))
if len(labels) == 0 {
labels = l
}
}
h.ringsMu.Unlock()
yMin = floatPtr(0)
yMax = autoMax120(datasets...)
// ── Per-GPU sub-charts ────────────────────────────────────────────────
case strings.HasPrefix(path, "gpu/"):
rest := strings.TrimPrefix(path, "gpu/")
// rest is either "{idx}-load", "{idx}-temp", "{idx}-power", or legacy "{idx}"
sub := ""
if i := strings.LastIndex(rest, "-"); i > 0 {
sub = rest[i+1:]
rest = rest[:i]
}
idx := 0
fmt.Sscanf(rest, "%d", &idx)
h.ringsMu.Lock()
var gr *gpuRings
if idx < len(h.gpuRings) {
gr = h.gpuRings[idx]
}
h.ringsMu.Unlock()
if gr == nil {
http.NotFound(w, r)
return
}
switch sub {
case "load":
vUtil, l := gr.Util.snapshot()
vMemUtil, _ := gr.MemUtil.snapshot()
labels = l
title = fmt.Sprintf("GPU %d Load", idx)
datasets = [][]float64{vUtil, vMemUtil}
names = []string{"Load %", "Mem %"}
yMin = floatPtr(0)
yMax = floatPtr(100)
case "temp":
vTemp, l := gr.Temp.snapshot()
labels = l
title = fmt.Sprintf("GPU %d Temperature", idx)
datasets = [][]float64{vTemp}
names = []string{"Temp °C"}
yMin = floatPtr(0)
yMax = autoMax120(vTemp)
default: // "power" or legacy (no sub)
vPower, l := gr.Power.snapshot()
labels = l
title = fmt.Sprintf("GPU %d Power", idx)
datasets = [][]float64{vPower}
names = []string{"Power W"}
yMin = floatPtr(0)
yMax = autoMax120(vPower)
}
default:
http.NotFound(w, r)
if h.metricsDB == nil {
http.Error(w, "metrics database not available", http.StatusServiceUnavailable)
return
}
datasets, names, labels, title, yMin, yMax, ok := h.chartDataFromDB(path)
if !ok {
http.Error(w, "metrics history unavailable", http.StatusServiceUnavailable)
return
}
@@ -601,6 +482,306 @@ func (h *handler) handleMetricsChartSVG(w http.ResponseWriter, r *http.Request)
_, _ = w.Write(buf)
}
func (h *handler) chartDataFromDB(path string) ([][]float64, []string, []string, string, *float64, *float64, bool) {
samples, err := h.metricsDB.LoadAll()
if err != nil || len(samples) == 0 {
return nil, nil, nil, "", nil, nil, false
}
return chartDataFromSamples(path, samples)
}
func chartDataFromSamples(path string, samples []platform.LiveMetricSample) ([][]float64, []string, []string, string, *float64, *float64, bool) {
var datasets [][]float64
var names []string
var title string
var yMin, yMax *float64
labels := sampleTimeLabels(samples)
switch {
case path == "server-load":
title = "CPU / Memory Load"
cpu := make([]float64, len(samples))
mem := make([]float64, len(samples))
for i, s := range samples {
cpu[i] = s.CPULoadPct
mem[i] = s.MemLoadPct
}
datasets = [][]float64{cpu, mem}
names = []string{"CPU Load %", "Mem Load %"}
yMin = floatPtr(0)
yMax = floatPtr(100)
case path == "server-temp", path == "server-temp-cpu":
title = "CPU Temperature"
datasets, names = namedTempDatasets(samples, "cpu")
yMin = floatPtr(0)
yMax = autoMax120(datasets...)
case path == "server-temp-gpu":
title = "GPU Temperature"
datasets, names = gpuDatasets(samples, func(g platform.GPUMetricRow) float64 { return g.TempC })
yMin = floatPtr(0)
yMax = autoMax120(datasets...)
case path == "server-temp-ambient":
title = "Ambient / Other Sensors"
datasets, names = namedTempDatasets(samples, "ambient")
yMin = floatPtr(0)
yMax = autoMax120(datasets...)
case path == "server-power":
title = "System Power"
power := make([]float64, len(samples))
for i, s := range samples {
power[i] = s.PowerW
}
power = normalizePowerSeries(power)
datasets = [][]float64{power}
names = []string{"Power W"}
yMin = floatPtr(0)
yMax = autoMax120(power)
case path == "server-fans":
title = "Fan RPM"
datasets, names = namedFanDatasets(samples)
yMin, yMax = autoBounds120(datasets...)
case path == "gpu-all-load":
title = "GPU Compute Load"
datasets, names = gpuDatasets(samples, func(g platform.GPUMetricRow) float64 { return g.UsagePct })
yMin = floatPtr(0)
yMax = floatPtr(100)
case path == "gpu-all-memload":
title = "GPU Memory Load"
datasets, names = gpuDatasets(samples, func(g platform.GPUMetricRow) float64 { return g.MemUsagePct })
yMin = floatPtr(0)
yMax = floatPtr(100)
case path == "gpu-all-power":
title = "GPU Power"
datasets, names = gpuDatasets(samples, func(g platform.GPUMetricRow) float64 { return g.PowerW })
yMin, yMax = autoBounds120(datasets...)
case path == "gpu-all-temp":
title = "GPU Temperature"
datasets, names = gpuDatasets(samples, func(g platform.GPUMetricRow) float64 { return g.TempC })
yMin = floatPtr(0)
yMax = autoMax120(datasets...)
case strings.HasPrefix(path, "gpu/"):
rest := strings.TrimPrefix(path, "gpu/")
sub := ""
if i := strings.LastIndex(rest, "-"); i > 0 {
sub = rest[i+1:]
rest = rest[:i]
}
idx := 0
fmt.Sscanf(rest, "%d", &idx)
switch sub {
case "load":
title = fmt.Sprintf("GPU %d Load", idx)
util := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.UsagePct })
mem := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.MemUsagePct })
if util == nil && mem == nil {
return nil, nil, nil, "", nil, nil, false
}
datasets = [][]float64{coalesceDataset(util, len(samples)), coalesceDataset(mem, len(samples))}
names = []string{"Load %", "Mem %"}
yMin = floatPtr(0)
yMax = floatPtr(100)
case "temp":
title = fmt.Sprintf("GPU %d Temperature", idx)
temp := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.TempC })
if temp == nil {
return nil, nil, nil, "", nil, nil, false
}
datasets = [][]float64{temp}
names = []string{"Temp °C"}
yMin = floatPtr(0)
yMax = autoMax120(temp)
default:
title = fmt.Sprintf("GPU %d Power", idx)
power := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.PowerW })
if power == nil {
return nil, nil, nil, "", nil, nil, false
}
datasets = [][]float64{power}
names = []string{"Power W"}
yMin, yMax = autoBounds120(power)
}
default:
return nil, nil, nil, "", nil, nil, false
}
return datasets, names, labels, title, yMin, yMax, len(datasets) > 0
}
func sampleTimeLabels(samples []platform.LiveMetricSample) []string {
labels := make([]string, len(samples))
if len(samples) == 0 {
return labels
}
times := make([]time.Time, len(samples))
for i, s := range samples {
times[i] = s.Timestamp
}
sameDay := timestampsSameLocalDay(times)
for i, s := range samples {
labels[i] = formatTimelineLabel(s.Timestamp.Local(), sameDay)
}
return labels
}
func namedTempDatasets(samples []platform.LiveMetricSample, group string) ([][]float64, []string) {
seen := map[string]bool{}
var names []string
for _, s := range samples {
for _, t := range s.Temps {
if t.Group == group && !seen[t.Name] {
seen[t.Name] = true
names = append(names, t.Name)
}
}
}
sort.Strings(names)
datasets := make([][]float64, 0, len(names))
for _, name := range names {
ds := make([]float64, len(samples))
for i, s := range samples {
for _, t := range s.Temps {
if t.Group == group && t.Name == name {
ds[i] = t.Celsius
break
}
}
}
datasets = append(datasets, ds)
}
return datasets, names
}
func namedFanDatasets(samples []platform.LiveMetricSample) ([][]float64, []string) {
seen := map[string]bool{}
var names []string
for _, s := range samples {
for _, f := range s.Fans {
if !seen[f.Name] {
seen[f.Name] = true
names = append(names, f.Name)
}
}
}
sort.Strings(names)
datasets := make([][]float64, 0, len(names))
for _, name := range names {
ds := make([]float64, len(samples))
for i, s := range samples {
for _, f := range s.Fans {
if f.Name == name {
ds[i] = f.RPM
break
}
}
}
datasets = append(datasets, normalizeFanSeries(ds))
}
return datasets, names
}
func gpuDatasets(samples []platform.LiveMetricSample, pick func(platform.GPUMetricRow) float64) ([][]float64, []string) {
seen := map[int]bool{}
var indices []int
for _, s := range samples {
for _, g := range s.GPUs {
if !seen[g.GPUIndex] {
seen[g.GPUIndex] = true
indices = append(indices, g.GPUIndex)
}
}
}
sort.Ints(indices)
datasets := make([][]float64, 0, len(indices))
names := make([]string, 0, len(indices))
for _, idx := range indices {
ds := gpuDatasetByIndex(samples, idx, pick)
if ds == nil {
continue
}
datasets = append(datasets, ds)
names = append(names, fmt.Sprintf("GPU %d", idx))
}
return datasets, names
}
func gpuDatasetByIndex(samples []platform.LiveMetricSample, idx int, pick func(platform.GPUMetricRow) float64) []float64 {
found := false
ds := make([]float64, len(samples))
for i, s := range samples {
for _, g := range s.GPUs {
if g.GPUIndex == idx {
ds[i] = pick(g)
found = true
break
}
}
}
if !found {
return nil
}
return ds
}
func coalesceDataset(ds []float64, n int) []float64 {
if ds != nil {
return ds
}
return make([]float64, n)
}
func normalizePowerSeries(ds []float64) []float64 {
if len(ds) == 0 {
return nil
}
out := make([]float64, len(ds))
copy(out, ds)
last := 0.0
haveLast := false
for i, v := range out {
if v > 0 {
last = v
haveLast = true
continue
}
if haveLast {
out[i] = last
}
}
return out
}
func normalizeFanSeries(ds []float64) []float64 {
if len(ds) == 0 {
return nil
}
out := make([]float64, len(ds))
var lastPositive float64
for i, v := range ds {
if v > 0 {
lastPositive = v
out[i] = v
continue
}
if lastPositive > 0 {
out[i] = lastPositive
continue
}
out[i] = 0
}
return out
}
// floatPtr returns a pointer to a float64 value.
func floatPtr(v float64) *float64 { return &v }
@@ -621,6 +802,47 @@ func autoMax120(datasets ...[]float64) *float64 {
return &v
}
func autoBounds120(datasets ...[]float64) (*float64, *float64) {
min := 0.0
max := 0.0
first := true
for _, ds := range datasets {
for _, v := range ds {
if first {
min, max = v, v
first = false
continue
}
if v < min {
min = v
}
if v > max {
max = v
}
}
}
if first {
return nil, nil
}
if max <= 0 {
return floatPtr(0), nil
}
span := max - min
if span <= 0 {
span = max * 0.1
if span <= 0 {
span = 1
}
}
pad := span * 0.2
low := min - pad
if low < 0 {
low = 0
}
high := max + pad
return floatPtr(low), floatPtr(high)
}
// renderChartSVG renders a line chart SVG with a fixed Y-axis range.
func renderChartSVG(title string, datasets [][]float64, names []string, labels []string, yMin, yMax *float64) ([]byte, error) {
n := len(labels)
@@ -651,15 +873,17 @@ func renderChartSVG(title string, datasets [][]float64, names []string, labels [
opt.Title = gocharts.TitleOption{Text: title}
opt.XAxis.Labels = sparse
opt.Legend = gocharts.LegendOption{SeriesNames: names}
if chartLegendVisible(len(names)) {
opt.Legend.Offset = gocharts.OffsetStr{Top: gocharts.PositionBottom}
opt.Legend.OverlayChart = gocharts.Ptr(false)
} else {
opt.Legend.Show = gocharts.Ptr(false)
}
opt.Symbol = gocharts.SymbolNone
// Right padding: reserve space for the MarkLine label (library recommendation).
opt.Padding = gocharts.NewBox(20, 20, 80, 20)
if yMin != nil || yMax != nil {
opt.YAxis = []gocharts.YAxisOption{{
Min: yMin,
Max: yMax,
ValueFormatter: chartLegendNumber,
}}
opt.YAxis = []gocharts.YAxisOption{chartYAxisOption(yMin, yMax)}
}
// Add a single peak mark line on the series that holds the global maximum.
@@ -671,7 +895,7 @@ func renderChartSVG(title string, datasets [][]float64, names []string, labels [
p := gocharts.NewPainter(gocharts.PainterOptions{
OutputFormat: gocharts.ChartOutputSVG,
Width: 1400,
Height: 240,
Height: chartCanvasHeight(len(names)),
}, gocharts.PainterThemeOption(gocharts.GetTheme("grafana")))
if err := p.LineChart(opt); err != nil {
return nil, err
@@ -679,6 +903,26 @@ func renderChartSVG(title string, datasets [][]float64, names []string, labels [
return p.Bytes()
}
func chartLegendVisible(seriesCount int) bool {
return seriesCount <= 8
}
func chartCanvasHeight(seriesCount int) int {
if chartLegendVisible(seriesCount) {
return 360
}
return 288
}
func chartYAxisOption(yMin, yMax *float64) gocharts.YAxisOption {
return gocharts.YAxisOption{
Min: yMin,
Max: yMax,
LabelCount: 11,
ValueFormatter: chartYAxisNumber,
}
}
// globalPeakSeries returns the index of the series containing the global maximum
// value across all datasets, and that maximum value.
func globalPeakSeries(datasets [][]float64) (idx int, peak float64) {
@@ -766,6 +1010,28 @@ func snapshotNamedRings(rings []*namedMetricsRing) ([][]float64, []string, []str
return datasets, names, labels
}
func snapshotFanRings(rings []*metricsRing, fanNames []string) ([][]float64, []string, []string) {
var datasets [][]float64
var names []string
var labels []string
for i, ring := range rings {
if ring == nil {
continue
}
vals, l := ring.snapshot()
datasets = append(datasets, normalizeFanSeries(vals))
name := "Fan"
if i < len(fanNames) {
name = fanNames[i]
}
names = append(names, name+" RPM")
if len(labels) == 0 {
labels = l
}
}
return datasets, names, labels
}
func chartLegendNumber(v float64) string {
neg := v < 0
if v < 0 {
@@ -788,6 +1054,30 @@ func chartLegendNumber(v float64) string {
return out
}
func chartYAxisNumber(v float64) string {
neg := v < 0
if neg {
v = -v
}
var out string
switch {
case v >= 10000:
out = fmt.Sprintf("%dк", int((v+500)/1000))
case v >= 1000:
// Use one decimal place so ticks like 1400, 1600, 1800 read as
// "1,4к", "1,6к", "1,8к" instead of the ambiguous "1к"/"2к".
s := fmt.Sprintf("%.1f", v/1000)
s = strings.TrimRight(strings.TrimRight(s, "0"), ".")
out = strings.ReplaceAll(s, ".", ",") + "к"
default:
out = fmt.Sprintf("%.0f", v)
}
if neg {
return "-" + out
}
return out
}
func sparseLabels(labels []string, n int) []string {
out := make([]string, len(labels))
step := len(labels) / n
@@ -868,13 +1158,6 @@ probe();
func (h *handler) handlePage(w http.ResponseWriter, r *http.Request) {
page := strings.TrimPrefix(r.URL.Path, "/")
if page == "" {
// Serve loading page until audit snapshot exists
if _, err := os.Stat(h.opts.AuditPath); err != nil {
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte(loadingPageHTML))
return
}
page = "dashboard"
}
// Redirect old routes to new names

View File

@@ -7,6 +7,9 @@ import (
"path/filepath"
"strings"
"testing"
"time"
"bee/audit/internal/platform"
)
func TestChartLegendNumber(t *testing.T) {
@@ -31,6 +34,235 @@ func TestChartLegendNumber(t *testing.T) {
}
}
func TestChartDataFromSamplesUsesFullHistory(t *testing.T) {
samples := []platform.LiveMetricSample{
{
Timestamp: time.Now().Add(-3 * time.Minute),
CPULoadPct: 10,
MemLoadPct: 20,
PowerW: 300,
GPUs: []platform.GPUMetricRow{
{GPUIndex: 0, UsagePct: 90, MemUsagePct: 5, PowerW: 120, TempC: 50},
},
},
{
Timestamp: time.Now().Add(-2 * time.Minute),
CPULoadPct: 30,
MemLoadPct: 40,
PowerW: 320,
GPUs: []platform.GPUMetricRow{
{GPUIndex: 0, UsagePct: 95, MemUsagePct: 7, PowerW: 125, TempC: 51},
},
},
{
Timestamp: time.Now().Add(-1 * time.Minute),
CPULoadPct: 50,
MemLoadPct: 60,
PowerW: 340,
GPUs: []platform.GPUMetricRow{
{GPUIndex: 0, UsagePct: 97, MemUsagePct: 9, PowerW: 130, TempC: 52},
},
},
}
datasets, names, labels, title, _, _, ok := chartDataFromSamples("gpu-all-power", samples)
if !ok {
t.Fatal("chartDataFromSamples returned ok=false")
}
if title != "GPU Power" {
t.Fatalf("title=%q", title)
}
if len(names) != 1 || names[0] != "GPU 0" {
t.Fatalf("names=%v", names)
}
if len(labels) != len(samples) {
t.Fatalf("labels len=%d want %d", len(labels), len(samples))
}
if len(datasets) != 1 || len(datasets[0]) != len(samples) {
t.Fatalf("datasets shape=%v", datasets)
}
if got := datasets[0][0]; got != 120 {
t.Fatalf("datasets[0][0]=%v want 120", got)
}
if got := datasets[0][2]; got != 130 {
t.Fatalf("datasets[0][2]=%v want 130", got)
}
}
func TestChartDataFromSamplesKeepsStableGPUSeriesOrder(t *testing.T) {
samples := []platform.LiveMetricSample{
{
Timestamp: time.Now().Add(-2 * time.Minute),
GPUs: []platform.GPUMetricRow{
{GPUIndex: 7, PowerW: 170},
{GPUIndex: 2, PowerW: 120},
{GPUIndex: 0, PowerW: 100},
},
},
{
Timestamp: time.Now().Add(-1 * time.Minute),
GPUs: []platform.GPUMetricRow{
{GPUIndex: 0, PowerW: 101},
{GPUIndex: 7, PowerW: 171},
{GPUIndex: 2, PowerW: 121},
},
},
}
datasets, names, _, title, _, _, ok := chartDataFromSamples("gpu-all-power", samples)
if !ok {
t.Fatal("chartDataFromSamples returned ok=false")
}
if title != "GPU Power" {
t.Fatalf("title=%q", title)
}
wantNames := []string{"GPU 0", "GPU 2", "GPU 7"}
if len(names) != len(wantNames) {
t.Fatalf("names len=%d want %d: %v", len(names), len(wantNames), names)
}
for i := range wantNames {
if names[i] != wantNames[i] {
t.Fatalf("names[%d]=%q want %q; full=%v", i, names[i], wantNames[i], names)
}
}
if got := datasets[0]; len(got) != 2 || got[0] != 100 || got[1] != 101 {
t.Fatalf("GPU 0 dataset=%v want [100 101]", got)
}
if got := datasets[1]; len(got) != 2 || got[0] != 120 || got[1] != 121 {
t.Fatalf("GPU 2 dataset=%v want [120 121]", got)
}
if got := datasets[2]; len(got) != 2 || got[0] != 170 || got[1] != 171 {
t.Fatalf("GPU 7 dataset=%v want [170 171]", got)
}
}
func TestNormalizePowerSeriesHoldsLastPositive(t *testing.T) {
got := normalizePowerSeries([]float64{0, 480, 0, 0, 510, 0})
want := []float64{0, 480, 480, 480, 510, 510}
if len(got) != len(want) {
t.Fatalf("len=%d want %d", len(got), len(want))
}
for i := range want {
if got[i] != want[i] {
t.Fatalf("got[%d]=%v want %v", i, got[i], want[i])
}
}
}
func TestRenderMetricsUsesBufferedChartRefresh(t *testing.T) {
body := renderMetrics()
if !strings.Contains(body, "const probe = new Image();") {
t.Fatalf("metrics page should preload chart images before swap: %s", body)
}
if !strings.Contains(body, "el.dataset.loading === '1'") {
t.Fatalf("metrics page should avoid overlapping chart reloads: %s", body)
}
}
func TestChartLegendVisible(t *testing.T) {
if !chartLegendVisible(8) {
t.Fatal("legend should stay visible for charts with up to 8 series")
}
if chartLegendVisible(9) {
t.Fatal("legend should be hidden for charts with more than 8 series")
}
}
func TestChartYAxisNumber(t *testing.T) {
tests := []struct {
in float64
want string
}{
{in: 999, want: "999"},
{in: 1000, want: "1к"},
{in: 1370, want: "1,4к"},
{in: 1500, want: "1,5к"},
{in: 1700, want: "1,7к"},
{in: 2000, want: "2к"},
{in: 9999, want: "10к"},
{in: 10200, want: "10к"},
{in: -1500, want: "-1,5к"},
}
for _, tc := range tests {
if got := chartYAxisNumber(tc.in); got != tc.want {
t.Fatalf("chartYAxisNumber(%v)=%q want %q", tc.in, got, tc.want)
}
}
}
func TestChartCanvasHeight(t *testing.T) {
if got := chartCanvasHeight(4); got != 360 {
t.Fatalf("chartCanvasHeight(4)=%d want 360", got)
}
if got := chartCanvasHeight(12); got != 288 {
t.Fatalf("chartCanvasHeight(12)=%d want 288", got)
}
}
func TestNormalizeFanSeriesHoldsLastPositive(t *testing.T) {
got := normalizeFanSeries([]float64{4200, 0, 0, 4300, 0})
want := []float64{4200, 4200, 4200, 4300, 4300}
if len(got) != len(want) {
t.Fatalf("len=%d want %d", len(got), len(want))
}
for i := range want {
if got[i] != want[i] {
t.Fatalf("got[%d]=%v want %v", i, got[i], want[i])
}
}
}
func TestChartYAxisOption(t *testing.T) {
min := floatPtr(0)
max := floatPtr(100)
opt := chartYAxisOption(min, max)
if opt.Min != min || opt.Max != max {
t.Fatalf("chartYAxisOption min/max mismatch: %#v", opt)
}
if opt.LabelCount != 11 {
t.Fatalf("chartYAxisOption labelCount=%d want 11", opt.LabelCount)
}
if got := opt.ValueFormatter(1000); got != "1к" {
t.Fatalf("chartYAxisOption formatter(1000)=%q want 1к", got)
}
}
func TestSnapshotFanRingsUsesTimelineLabels(t *testing.T) {
r1 := newMetricsRing(4)
r2 := newMetricsRing(4)
r1.push(1000)
r1.push(1100)
r2.push(1200)
r2.push(1300)
datasets, names, labels := snapshotFanRings([]*metricsRing{r1, r2}, []string{"FAN_A", "FAN_B"})
if len(datasets) != 2 {
t.Fatalf("datasets=%d want 2", len(datasets))
}
if len(names) != 2 || names[0] != "FAN_A RPM" || names[1] != "FAN_B RPM" {
t.Fatalf("names=%v", names)
}
if len(labels) != 2 {
t.Fatalf("labels=%v want 2 entries", labels)
}
if labels[0] == "" || labels[1] == "" {
t.Fatalf("labels should contain timeline values, got %v", labels)
}
}
func TestRenderNetworkInlineSyncsPendingState(t *testing.T) {
body := renderNetworkInline()
if !strings.Contains(body, "d.pending_change") {
t.Fatalf("network UI should read pending network state from API: %s", body)
}
if !strings.Contains(body, "setInterval(loadNetwork, 5000)") {
t.Fatalf("network UI should periodically refresh network state: %s", body)
}
if !strings.Contains(body, "showNetPending(NET_ROLLBACK_SECS)") {
t.Fatalf("network UI should show pending confirmation immediately on apply: %s", body)
}
}
func TestRootRendersDashboard(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
@@ -78,6 +310,33 @@ func TestRootRendersDashboard(t *testing.T) {
}
}
func TestRootShowsRunAuditButtonWhenSnapshotMissing(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{
Title: "Bee Hardware Audit",
AuditPath: filepath.Join(dir, "missing-audit.json"),
ExportDir: exportDir,
})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
body := rec.Body.String()
if !strings.Contains(body, `Run Audit`) {
t.Fatalf("dashboard missing run audit button: %s", body)
}
if strings.Contains(body, `No audit data`) {
t.Fatalf("dashboard still shows empty audit badge: %s", body)
}
}
func TestAuditPageRendersViewerFrameAndActions(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
@@ -174,6 +433,17 @@ func TestSupportBundleEndpointReturnsArchive(t *testing.T) {
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.log"), []byte("audit log"), 0644); err != nil {
t.Fatal(err)
}
archive, err := os.CreateTemp(os.TempDir(), "bee-support-server-test-*.tar.gz")
if err != nil {
t.Fatal(err)
}
t.Cleanup(func() { _ = os.Remove(archive.Name()) })
if _, err := archive.WriteString("support-bundle"); err != nil {
t.Fatal(err)
}
if err := archive.Close(); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()

View File

@@ -6,12 +6,15 @@ import (
"fmt"
"net/http"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
"sync"
"time"
"bee/audit/internal/app"
"bee/audit/internal/platform"
)
// Task statuses.
@@ -23,33 +26,74 @@ const (
TaskCancelled = "cancelled"
)
// taskNames maps target → human-readable name.
// taskNames maps target → human-readable name for validate (SAT) runs.
var taskNames = map[string]string{
"nvidia": "NVIDIA SAT",
"memory": "Memory SAT",
"storage": "Storage SAT",
"cpu": "CPU SAT",
"amd": "AMD GPU SAT",
"amd-stress": "AMD GPU Burn-in",
"memory-stress": "Memory Burn-in",
"sat-stress": "SAT Stress (stressapptest)",
"audit": "Audit",
"install": "Install to Disk",
"install-to-ram": "Install to RAM",
"nvidia": "NVIDIA SAT",
"nvidia-stress": "NVIDIA GPU Stress",
"memory": "Memory SAT",
"storage": "Storage SAT",
"cpu": "CPU SAT",
"amd": "AMD GPU SAT",
"amd-mem": "AMD GPU MEM Integrity",
"amd-bandwidth": "AMD GPU MEM Bandwidth",
"amd-stress": "AMD GPU Burn-in",
"memory-stress": "Memory Burn-in",
"sat-stress": "SAT Stress (stressapptest)",
"platform-stress": "Platform Thermal Cycling",
"audit": "Audit",
"support-bundle": "Support Bundle",
"install": "Install to Disk",
"install-to-ram": "Install to RAM",
}
// burnNames maps target → human-readable name when a burn profile is set.
var burnNames = map[string]string{
"nvidia": "NVIDIA Burn-in",
"memory": "Memory Burn-in",
"cpu": "CPU Burn-in",
"amd": "AMD GPU Burn-in",
}
func nvidiaStressTaskName(loader string) string {
switch strings.TrimSpace(strings.ToLower(loader)) {
case platform.NvidiaStressLoaderJohn:
return "NVIDIA GPU Stress (John/OpenCL)"
case platform.NvidiaStressLoaderNCCL:
return "NVIDIA GPU Stress (NCCL)"
default:
return "NVIDIA GPU Stress (bee-gpu-burn)"
}
}
func taskDisplayName(target, profile, loader string) string {
name := taskNames[target]
if profile != "" {
if n, ok := burnNames[target]; ok {
name = n
}
}
if target == "nvidia-stress" {
name = nvidiaStressTaskName(loader)
}
if name == "" {
name = target
}
return name
}
// Task represents one unit of work in the queue.
type Task struct {
ID string `json:"id"`
Name string `json:"name"`
Target string `json:"target"`
Priority int `json:"priority"`
Status string `json:"status"`
CreatedAt time.Time `json:"created_at"`
StartedAt *time.Time `json:"started_at,omitempty"`
DoneAt *time.Time `json:"done_at,omitempty"`
ErrMsg string `json:"error,omitempty"`
LogPath string `json:"log_path,omitempty"`
ID string `json:"id"`
Name string `json:"name"`
Target string `json:"target"`
Priority int `json:"priority"`
Status string `json:"status"`
CreatedAt time.Time `json:"created_at"`
StartedAt *time.Time `json:"started_at,omitempty"`
DoneAt *time.Time `json:"done_at,omitempty"`
ElapsedSec int `json:"elapsed_sec,omitempty"`
ErrMsg string `json:"error,omitempty"`
LogPath string `json:"log_path,omitempty"`
// runtime fields (not serialised)
job *jobState
@@ -58,12 +102,15 @@ type Task struct {
// taskParams holds optional parameters parsed from the run request.
type taskParams struct {
Duration int `json:"duration,omitempty"`
DiagLevel int `json:"diag_level,omitempty"`
GPUIndices []int `json:"gpu_indices,omitempty"`
BurnProfile string `json:"burn_profile,omitempty"`
DisplayName string `json:"display_name,omitempty"`
Device string `json:"device,omitempty"` // for install
Duration int `json:"duration,omitempty"`
DiagLevel int `json:"diag_level,omitempty"`
GPUIndices []int `json:"gpu_indices,omitempty"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices,omitempty"`
Loader string `json:"loader,omitempty"`
BurnProfile string `json:"burn_profile,omitempty"`
DisplayName string `json:"display_name,omitempty"`
Device string `json:"device,omitempty"` // for install
PlatformComponents []string `json:"platform_components,omitempty"`
}
type persistedTask struct {
@@ -96,6 +143,34 @@ func resolveBurnPreset(profile string) burnPreset {
}
}
func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions {
switch profile {
case "overnight":
return platform.PlatformStressOptions{Cycles: []platform.PlatformStressCycle{
{LoadSec: 600, IdleSec: 120},
{LoadSec: 600, IdleSec: 60},
{LoadSec: 600, IdleSec: 30},
{LoadSec: 600, IdleSec: 120},
{LoadSec: 600, IdleSec: 60},
{LoadSec: 600, IdleSec: 30},
{LoadSec: 600, IdleSec: 120},
{LoadSec: 600, IdleSec: 60},
}}
case "acceptance":
return platform.PlatformStressOptions{Cycles: []platform.PlatformStressCycle{
{LoadSec: 300, IdleSec: 60},
{LoadSec: 300, IdleSec: 30},
{LoadSec: 300, IdleSec: 60},
{LoadSec: 300, IdleSec: 30},
}}
default: // smoke
return platform.PlatformStressOptions{Cycles: []platform.PlatformStressCycle{
{LoadSec: 90, IdleSec: 60},
{LoadSec: 90, IdleSec: 30},
}}
}
}
// taskQueue manages a priority-ordered list of tasks and runs them one at a time.
type taskQueue struct {
mu sync.Mutex
@@ -124,6 +199,15 @@ var (
runAMDAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
return a.RunAMDAcceptancePackCtx(ctx, baseDir, logFunc)
}
runAMDMemIntegrityPackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
return a.RunAMDMemIntegrityPackCtx(ctx, baseDir, logFunc)
}
runAMDMemBandwidthPackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
return a.RunAMDMemBandwidthPackCtx(ctx, baseDir, logFunc)
}
runNvidiaStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
return a.RunNvidiaStressPackCtx(ctx, baseDir, opts, logFunc)
}
runAMDStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunAMDStressPackCtx(ctx, baseDir, durationSec, logFunc)
}
@@ -133,6 +217,10 @@ var (
runSATStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunSATStressPackCtx(ctx, baseDir, durationSec, logFunc)
}
buildSupportBundle = app.BuildSupportBundle
installCommand = func(ctx context.Context, device string, logPath string) *exec.Cmd {
return exec.CommandContext(ctx, "bee-install", device, logPath)
}
)
// enqueue adds a task to the queue and notifies the worker.
@@ -224,6 +312,7 @@ func (q *taskQueue) snapshot() []Task {
out := make([]Task, len(q.tasks))
for i, t := range q.tasks {
out[i] = *t
out[i].ElapsedSec = taskElapsedSec(&out[i], time.Now())
}
sort.SliceStable(out, func(i, j int) bool {
si := statusOrder(out[i].Status)
@@ -330,9 +419,9 @@ func setCPUGovernor(governor string) {
// runTask executes the work for a task, writing output to j.
func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if q.opts == nil || q.opts.App == nil {
j.append("ERROR: app not configured")
j.finish("app not configured")
if q.opts == nil {
j.append("ERROR: handler options not configured")
j.finish("handler options not configured")
return
}
a := q.opts.App
@@ -349,6 +438,10 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
switch t.Target {
case "nvidia":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
diagLevel := t.params.DiagLevel
if t.params.BurnProfile != "" && diagLevel <= 0 {
diagLevel = resolveBurnPreset(t.params.BurnProfile).NvidiaDiag
@@ -365,11 +458,38 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
} else {
archive, err = a.RunNvidiaAcceptancePack("", j.append)
}
case "nvidia-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{
DurationSec: dur,
Loader: t.params.Loader,
GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices,
}, j.append)
case "memory":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runMemoryAcceptancePackCtx(a, ctx, "", j.append)
case "storage":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runStorageAcceptancePackCtx(a, ctx, "", j.append)
case "cpu":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
@@ -377,28 +497,69 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if dur <= 0 {
dur = 60
}
j.append(fmt.Sprintf("CPU stress duration: %ds", dur))
archive, err = runCPUAcceptancePackCtx(a, ctx, "", dur, j.append)
case "amd":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDAcceptancePackCtx(a, ctx, "", j.append)
case "amd-mem":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDMemIntegrityPackCtx(a, ctx, "", j.append)
case "amd-bandwidth":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDMemBandwidthPackCtx(a, ctx, "", j.append)
case "amd-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runAMDStressPackCtx(a, ctx, "", dur, j.append)
case "memory-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runMemoryStressPackCtx(a, ctx, "", dur, j.append)
case "sat-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runSATStressPackCtx(a, ctx, "", dur, j.append)
case "platform-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
opts := resolvePlatformStressPreset(t.params.BurnProfile)
opts.Components = t.params.PlatformComponents
archive, err = a.RunPlatformStress(ctx, "", opts, j.append)
case "audit":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
result, e := a.RunAuditNow(q.opts.RuntimeMode)
if e != nil {
err = e
@@ -407,7 +568,22 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
j.append(line)
}
}
case "support-bundle":
j.append("Building support bundle...")
archive, err = buildSupportBundle(q.opts.ExportDir)
case "install":
if strings.TrimSpace(t.params.Device) == "" {
err = fmt.Errorf("device is required")
break
}
installLogPath := platform.InstallLogPath(t.params.Device)
j.append("Install log: " + installLogPath)
err = streamCmdJob(j, installCommand(ctx, t.params.Device, installLogPath))
case "install-to-ram":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
err = a.RunInstallToRAM(ctx, j.append)
default:
j.append("ERROR: unknown target: " + t.Target)
@@ -540,6 +716,38 @@ func (h *handler) handleAPITasksCancelAll(w http.ResponseWriter, _ *http.Request
writeJSON(w, map[string]int{"cancelled": n})
}
func (h *handler) handleAPITasksKillWorkers(w http.ResponseWriter, _ *http.Request) {
// Cancel all queued/running tasks in the queue first.
globalQueue.mu.Lock()
now := time.Now()
cancelled := 0
for _, t := range globalQueue.tasks {
switch t.Status {
case TaskPending:
t.Status = TaskCancelled
t.DoneAt = &now
cancelled++
case TaskRunning:
if t.job != nil {
t.job.abort()
}
t.Status = TaskCancelled
t.DoneAt = &now
cancelled++
}
}
globalQueue.persistLocked()
globalQueue.mu.Unlock()
// Kill orphaned test worker processes at the OS level.
killed := platform.KillTestWorkers()
writeJSON(w, map[string]any{
"cancelled": cancelled,
"killed": len(killed),
"processes": killed,
})
}
func (h *handler) handleAPITasksStream(w http.ResponseWriter, r *http.Request) {
id := r.PathValue("id")
// Wait up to 5s for the task to get a job (it may be pending)
@@ -593,8 +801,18 @@ func (q *taskQueue) loadLocked() {
params: pt.Params,
}
q.assignTaskLogPathLocked(t)
if t.Status == TaskPending || t.Status == TaskRunning {
t.Status = TaskPending
if t.Status == TaskRunning {
// The task was interrupted by a bee-web restart. Child processes
// (e.g. bee-gpu-burn-worker) survive the restart in their own
// process groups and cannot be cancelled retroactively. Mark the
// task as failed so the user can decide whether to re-run it
// rather than blindly re-launching duplicate workers.
now := time.Now()
t.Status = TaskFailed
t.DoneAt = &now
t.ErrMsg = "interrupted by bee-web restart"
} else if t.Status == TaskPending {
t.StartedAt = nil
t.DoneAt = nil
t.ErrMsg = ""
}
@@ -634,3 +852,21 @@ func (q *taskQueue) persistLocked() {
}
_ = os.Rename(tmp, q.statePath)
}
func taskElapsedSec(t *Task, now time.Time) int {
if t == nil || t.StartedAt == nil || t.StartedAt.IsZero() {
return 0
}
start := *t.StartedAt
if !t.CreatedAt.IsZero() && start.Before(t.CreatedAt) {
start = t.CreatedAt
}
end := now
if t.DoneAt != nil && !t.DoneAt.IsZero() {
end = *t.DoneAt
}
if end.Before(start) {
return 0
}
return int(end.Sub(start).Round(time.Second) / time.Second)
}

View File

@@ -3,7 +3,9 @@ package webui
import (
"context"
"os"
"os/exec"
"path/filepath"
"strings"
"testing"
"time"
@@ -22,21 +24,34 @@ func TestTaskQueuePersistsAndRecoversPendingTasks(t *testing.T) {
}
started := time.Now().Add(-time.Minute)
task := &Task{
ID: "task-1",
// A task that was pending (not yet started) must be re-queued on restart.
pendingTask := &Task{
ID: "task-pending",
Name: "Memory Burn-in",
Target: "memory-stress",
Priority: 2,
Status: TaskRunning,
Status: TaskPending,
CreatedAt: time.Now().Add(-2 * time.Minute),
StartedAt: &started,
params: taskParams{
Duration: 300,
BurnProfile: "smoke",
},
params: taskParams{Duration: 300, BurnProfile: "smoke"},
}
// A task that was running when bee-web crashed must NOT be re-queued —
// its child processes (e.g. gpu-burn-worker) survive the restart in
// their own process groups and can't be cancelled retroactively.
runningTask := &Task{
ID: "task-running",
Name: "NVIDIA GPU Stress",
Target: "nvidia-stress",
Priority: 1,
Status: TaskRunning,
CreatedAt: time.Now().Add(-3 * time.Minute),
StartedAt: &started,
params: taskParams{Duration: 86400},
}
for _, task := range []*Task{pendingTask, runningTask} {
q.tasks = append(q.tasks, task)
q.assignTaskLogPathLocked(task)
}
q.tasks = append(q.tasks, task)
q.assignTaskLogPathLocked(task)
q.persistLocked()
recovered := &taskQueue{
@@ -46,18 +61,47 @@ func TestTaskQueuePersistsAndRecoversPendingTasks(t *testing.T) {
}
recovered.loadLocked()
if len(recovered.tasks) != 1 {
t.Fatalf("tasks=%d want 1", len(recovered.tasks))
if len(recovered.tasks) != 2 {
t.Fatalf("tasks=%d want 2", len(recovered.tasks))
}
got := recovered.tasks[0]
if got.Status != TaskPending {
t.Fatalf("status=%q want %q", got.Status, TaskPending)
byID := map[string]*Task{}
for i := range recovered.tasks {
byID[recovered.tasks[i].ID] = recovered.tasks[i]
}
if got.params.Duration != 300 || got.params.BurnProfile != "smoke" {
t.Fatalf("params=%+v", got.params)
// Pending task must be re-queued as pending with params intact.
p := byID["task-pending"]
if p == nil {
t.Fatal("task-pending not found")
}
if got.LogPath == "" {
t.Fatal("expected log path")
if p.Status != TaskPending {
t.Fatalf("pending task: status=%q want %q", p.Status, TaskPending)
}
if p.StartedAt != nil {
t.Fatalf("pending task: started_at=%v want nil", p.StartedAt)
}
if p.params.Duration != 300 || p.params.BurnProfile != "smoke" {
t.Fatalf("pending task: params=%+v", p.params)
}
if p.LogPath == "" {
t.Fatal("pending task: expected log path")
}
// Running task must be marked failed, not re-queued, to prevent
// launching duplicate workers (e.g. a second set of gpu-burn-workers).
r := byID["task-running"]
if r == nil {
t.Fatal("task-running not found")
}
if r.Status != TaskFailed {
t.Fatalf("running task: status=%q want %q", r.Status, TaskFailed)
}
if r.ErrMsg == "" {
t.Fatal("running task: expected non-empty error message")
}
if r.DoneAt == nil {
t.Fatal("running task: expected done_at to be set")
}
}
@@ -95,9 +139,24 @@ func TestResolveBurnPreset(t *testing.T) {
}
}
func TestRunTaskHonorsCancel(t *testing.T) {
t.Parallel()
func TestTaskDisplayNameUsesNvidiaStressLoader(t *testing.T) {
tests := []struct {
loader string
want string
}{
{loader: "", want: "NVIDIA GPU Stress (bee-gpu-burn)"},
{loader: "builtin", want: "NVIDIA GPU Stress (bee-gpu-burn)"},
{loader: "john", want: "NVIDIA GPU Stress (John/OpenCL)"},
{loader: "nccl", want: "NVIDIA GPU Stress (NCCL)"},
}
for _, tc := range tests {
if got := taskDisplayName("nvidia-stress", "acceptance", tc.loader); got != tc.want {
t.Fatalf("taskDisplayName(loader=%q)=%q want %q", tc.loader, got, tc.want)
}
}
}
func TestRunTaskHonorsCancel(t *testing.T) {
blocked := make(chan struct{})
released := make(chan struct{})
aRun := func(_ any, ctx context.Context, _ string, _ int, _ func(string)) (string, error) {
@@ -154,3 +213,131 @@ func TestRunTaskHonorsCancel(t *testing.T) {
t.Fatal("runTask did not return after cancel")
}
}
func TestRunTaskUsesBurnProfileDurationForCPU(t *testing.T) {
var gotDuration int
q := &taskQueue{
opts: &HandlerOptions{App: &app.App{}},
}
tk := &Task{
ID: "cpu-burn-1",
Name: "CPU Burn-in",
Target: "cpu",
Status: TaskRunning,
CreatedAt: time.Now(),
params: taskParams{BurnProfile: "smoke"},
}
j := &jobState{}
orig := runCPUAcceptancePackCtx
runCPUAcceptancePackCtx = func(_ *app.App, _ context.Context, _ string, durationSec int, _ func(string)) (string, error) {
gotDuration = durationSec
return "/tmp/cpu-burn.tar.gz", nil
}
defer func() { runCPUAcceptancePackCtx = orig }()
q.runTask(tk, j, context.Background())
if gotDuration != 5*60 {
t.Fatalf("duration=%d want %d", gotDuration, 5*60)
}
}
func TestRunTaskBuildsSupportBundleWithoutApp(t *testing.T) {
dir := t.TempDir()
q := &taskQueue{
opts: &HandlerOptions{ExportDir: dir},
}
tk := &Task{
ID: "support-bundle-1",
Name: "Support Bundle",
Target: "support-bundle",
Status: TaskRunning,
CreatedAt: time.Now(),
}
j := &jobState{}
var gotExportDir string
orig := buildSupportBundle
buildSupportBundle = func(exportDir string) (string, error) {
gotExportDir = exportDir
return filepath.Join(exportDir, "bundle.tar.gz"), nil
}
defer func() { buildSupportBundle = orig }()
q.runTask(tk, j, context.Background())
if gotExportDir != dir {
t.Fatalf("exportDir=%q want %q", gotExportDir, dir)
}
if j.err != "" {
t.Fatalf("unexpected error: %q", j.err)
}
if !strings.Contains(strings.Join(j.lines, "\n"), "Archive: "+filepath.Join(dir, "bundle.tar.gz")) {
t.Fatalf("lines=%v", j.lines)
}
}
func TestTaskElapsedSecClampsInvalidStartedAt(t *testing.T) {
now := time.Date(2026, 4, 1, 19, 10, 0, 0, time.UTC)
created := time.Date(2026, 4, 1, 19, 4, 5, 0, time.UTC)
started := time.Time{}
task := &Task{
Status: TaskRunning,
CreatedAt: created,
StartedAt: &started,
}
if got := taskElapsedSec(task, now); got != 0 {
t.Fatalf("taskElapsedSec(zero start)=%d want 0", got)
}
stale := created.Add(-24 * time.Hour)
task.StartedAt = &stale
if got := taskElapsedSec(task, now); got != int(now.Sub(created).Seconds()) {
t.Fatalf("taskElapsedSec(stale start)=%d want %d", got, int(now.Sub(created).Seconds()))
}
}
func TestRunTaskInstallUsesSharedCommandStreaming(t *testing.T) {
q := &taskQueue{
opts: &HandlerOptions{},
}
tk := &Task{
ID: "install-1",
Name: "Install to Disk",
Target: "install",
Status: TaskRunning,
CreatedAt: time.Now(),
params: taskParams{Device: "/dev/sda"},
}
j := &jobState{}
var gotDevice string
var gotLogPath string
orig := installCommand
installCommand = func(ctx context.Context, device string, logPath string) *exec.Cmd {
gotDevice = device
gotLogPath = logPath
return exec.CommandContext(ctx, "sh", "-c", "printf 'line1\nline2\n'")
}
defer func() { installCommand = orig }()
q.runTask(tk, j, context.Background())
if gotDevice != "/dev/sda" {
t.Fatalf("device=%q want /dev/sda", gotDevice)
}
if gotLogPath == "" {
t.Fatal("expected install log path")
}
logs := strings.Join(j.lines, "\n")
if !strings.Contains(logs, "Install log: ") {
t.Fatalf("missing install log line: %v", j.lines)
}
if !strings.Contains(logs, "line1") || !strings.Contains(logs, "line2") {
t.Fatalf("missing streamed output: %v", j.lines)
}
if j.err != "" {
t.Fatalf("unexpected error: %q", j.err)
}
}

2
bible

Submodule bible updated: 456c1f022c...688b87e98d

View File

@@ -9,6 +9,34 @@ All live metrics charts in the web UI are server-side SVG images served by Go
and polled by the browser every 2 seconds via `<img src="...?t=now">`.
There is no client-side canvas or JS chart library.
## Rule: live charts must be visually uniform
Live charts are a single UI family, not a set of one-off widgets. New charts and
changes to existing charts must keep the same rendering model and presentation
rules unless there is an explicit architectural decision to diverge.
Default expectations:
- same server-side SVG pipeline for all live metrics charts
- same refresh behaviour and failure handling in the browser
- same canvas size class and card layout
- same legend placement policy across charts
- same axis, title, and summary conventions
- no chart-specific visual exceptions added as a quick fix
Current default for live charts:
- legend below the plot area when a chart has 8 series or fewer
- legend hidden when a chart has more than 8 series
- 10 equal Y-axis steps across the chart height
- 1400 x 360 SVG canvas with legend
- 1400 x 288 SVG canvas without legend
- full-width card rendering in a single-column stack
If one chart needs a different layout or legend behaviour, treat that as a
design-level decision affecting the whole chart family, not as a local tweak to
just one endpoint.
### Why go-analyze/charts
- Pure Go, no CGO — builds cleanly inside the live-build container
@@ -29,7 +57,8 @@ self-contained SVG renderer used **only** for completed SAT run reports
| `GET /api/metrics/chart/server.svg` | CPU temp, CPU load %, mem load %, power W, fan RPMs |
| `GET /api/metrics/chart/gpu/{idx}.svg` | GPU temp °C, load %, mem %, power W |
Charts are 1400 × 280 px SVG. The page renders them at `width: 100%` in a
Charts are 1400 × 360 px SVG when the legend is shown, and 1400 × 288 px when
the legend is hidden. The page renders them at `width: 100%` in a
single-column layout so they always fill the viewport width.
### Ring buffers

View File

@@ -60,6 +60,8 @@ Rules:
- Chromium opens `http://localhost/` — the full interactive web UI
- SSH is independent from the desktop path
- serial console support is enabled for VM boot debugging
- Default boot keeps the server-safe graphics path (`nomodeset` + forced `fbdev`) for IPMI/BMC consoles
- Higher-resolution mode selection is expected only when booting through an explicit `bee.display=kms` menu entry, which disables the forced `fbdev` Xorg config before `lightdm`
## ISO build sequence
@@ -81,9 +83,9 @@ build-in-container.sh [--authorized-keys /path/to/keys]
7. `build-cublas.sh`:
a. download `libcublas`, `libcublasLt`, `libcudart` runtime + dev packages from the NVIDIA CUDA Debian repo
b. verify packages against repo `Packages.gz`
c. extract headers for `bee-gpu-stress` build
c. extract headers for `bee-gpu-burn` worker build
d. cache userspace libs in `dist/cublas-<version>+cuda<series>/`
8. build `bee-gpu-stress` against extracted cuBLASLt/cudart headers
8. build `bee-gpu-burn` worker against extracted cuBLASLt/cudart headers
9. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
10. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
11. inject `libnvidia-ml` + `libcuda` + `libcublas` + `libcublasLt` + `libcudart` → staged `/usr/lib/`
@@ -104,7 +106,7 @@ Build host notes:
1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
2. `auto/config``linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
- `bee-gpu-stress` must be built against cached CUDA userspace headers from `build-cublas.sh`, not against random host-installed CUDA headers.
- `bee-gpu-burn` worker must be built against cached CUDA userspace headers from `build-cublas.sh`, not against random host-installed CUDA headers.
- The live ISO must ship `libcublas`, `libcublasLt`, and `libcudart` together with `libcuda` so tensor-core stress works without internet or package installs at boot.
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
@@ -153,18 +155,17 @@ Current validation state:
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
Acceptance flows:
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + mixed-precision `bee-gpu-stress`
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + lightweight `bee-gpu-burn`
- NVIDIA GPU burn-in can use either `bee-gpu-burn` or `bee-john-gpu-stress` (John the Ripper jumbo via OpenCL)
- `bee sat memory``memtester` archive
- `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
- SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
- `bee-gpu-stress` should prefer cuBLASLt GEMM load over the old integer/PTX burn path:
- `bee-gpu-burn` should prefer cuBLASLt GEMM load over the old integer/PTX burn path:
- Ampere: `fp16` + `fp32`/TF32 tensor-core load
- Ada / Hopper: add `fp8`
- Blackwell+: add `fp4`
- PTX fallback is only for missing cuBLASLt/userspace or unsupported narrow datatypes
- Runtime overrides:
- `BEE_GPU_STRESS_SECONDS`
- `BEE_GPU_STRESS_SIZE_MB`
- `BEE_MEMTESTER_SIZE_MB`
- `BEE_MEMTESTER_PASSES`
@@ -179,6 +180,6 @@ Web UI: Acceptance Tests page → Run Test button
```
**Critical invariants:**
- `bee-gpu-stress` uses `exec.CommandContext` — killed on job context cancel.
- `bee-gpu-burn` / `bee-john-gpu-stress` use `exec.CommandContext` — killed on job context cancel.
- Metric goroutine uses stopCh/doneCh pattern; main goroutine waits `<-doneCh` before reading rows (no mutex needed).
- SVG chart is fully offline: no JS, no external CSS, pure inline SVG.

View File

@@ -21,8 +21,8 @@ Fills gaps where Redfish/logpile is blind:
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
- Machine-readable health summary derived from collector verdicts
- Operator-triggered acceptance tests for NVIDIA, memory, and storage
- NVIDIA SAT includes both diagnostic collection and mixed-precision GPU stress via `bee-gpu-stress`
- `bee-gpu-stress` should exercise tensor/inference paths (`fp16`, `fp32`/TF32, `fp8`, `fp4` when supported by the GPU/userspace stack) and fall back to Driver API PTX burn only if cuBLASLt is unavailable
- NVIDIA SAT includes diagnostic collection plus a lightweight in-image GPU stress step via `bee-gpu-burn`
- `bee-gpu-burn` should exercise tensor/inference paths (`fp16`, `fp32`/TF32, `fp8`, `fp4` when supported by the GPU/userspace stack) and fall back to Driver API PTX burn only if cuBLASLt is unavailable
- Automatic boot audit with operator-facing local console and SSH access
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
- SSH access (OpenSSH) always available for inspection and debugging
@@ -70,7 +70,7 @@ Fills gaps where Redfish/logpile is blind:
| SSH | OpenSSH server |
| NVIDIA driver | Proprietary `.run` installer, built against Debian kernel headers |
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
| GPU stress backend | `bee-gpu-stress` + cuBLASLt/cuBLAS/cudart mixed-precision GEMM, with Driver API PTX fallback |
| GPU stress backend | `bee-gpu-burn` + cuBLASLt/cuBLAS/cudart mixed-precision GEMM, with Driver API PTX fallback |
| Builder | Debian 12 host/VM or Debian 12 container image |
## Operator UX

View File

@@ -18,6 +18,8 @@ Use the official proprietary NVIDIA `.run` installer for both kernel modules and
- Kernel modules and nvidia-smi come from a single verified source.
- NVIDIA publishes `.sha256sum` alongside each installer — download and verify before use.
- Driver version pinned in `iso/builder/VERSIONS` as `NVIDIA_DRIVER_VERSION`.
- DCGM must track the CUDA user-mode driver major version exposed by `nvidia-smi`.
- For NVIDIA driver branch `590` with CUDA `13.x`, use DCGM 4 package family `datacenter-gpu-manager-4-cuda13`; legacy `datacenter-gpu-manager` 3.x does not provide a working path for this stack.
- Build process: download `.run`, extract, compile `kernel/` sources against `linux-lts-dev`.
- Modules cached in `dist/nvidia-<version>-<kver>/` — rebuild only on version or kernel change.
- ISO size increases by ~50MB for .ko files + nvidia-smi.

View File

@@ -0,0 +1,224 @@
# Decision: Treat memtest as explicit ISO content, not as trusted live-build magic
**Date:** 2026-04-01
**Status:** resolved
## Context
We have already iterated on `memtest` multiple times and kept cycling between the same ideas.
The commit history shows several distinct attempts:
- `f91bce8` — fixed Bookworm memtest file names to `memtest86+x64.bin` / `memtest86+x64.efi`
- `5857805` — added a binary hook to copy memtest files from the build tree into the ISO root
- `f96b149` — added fallback extraction from the cached `.deb` when `chroot/boot/` stayed empty
- `d43a9ae` — removed the custom hook and switched back to live-build built-in memtest integration
- `60cb8f8` — restored explicit memtest menu entries and added ISO validation
- `3dbc218` / `3869788` — added archived build logs and better memtest diagnostics
Current evidence from the archived `easy-bee-nvidia-v3.14-amd64` logs dated 2026-04-01:
- `lb binary_memtest` does run and installs `memtest86+`
- but the final ISO still does **not** contain `boot/memtest86+x64.bin`
- the final ISO also does **not** contain memtest menu entries in `boot/grub/grub.cfg` or `isolinux/live.cfg`
So the assumption "live-build built-in memtest integration is enough on this stack" is currently false for this project until proven otherwise by a real built ISO.
Additional evidence from the archived `easy-bee-nvidia-v3.17-dirty-amd64` logs dated 2026-04-01:
- the build now completes successfully because memtest is non-blocking by default
- `lb binary_memtest` still runs and installs `memtest86+`
- the project-owned hook `config/hooks/normal/9100-memtest.hook.binary` does execute
- but it executes too early for its current target paths:
- `binary/boot/grub/grub.cfg` is still missing at hook time
- `binary/isolinux/live.cfg` is still missing at hook time
- memtest binaries are also still absent in `binary/boot/`
- later in the build, live-build does create intermediate bootloader configs with memtest lines in the workdir
- but the final ISO still lacks memtest binaries and still lacks memtest lines in extracted ISO `boot/grub/grub.cfg` and `isolinux/live.cfg`
So the assumption "the current normal binary hook path is late enough to patch final memtest artifacts" is also false.
Correction after inspecting the real `easy-bee-nvidia-v3.20-5-g76a9100-amd64.iso`
artifact dated 2026-04-01:
- the final ISO does contain `boot/memtest86+x64.bin`
- the final ISO does contain `boot/memtest86+x64.efi`
- the final ISO does contain memtest menu entries in both `boot/grub/grub.cfg`
and `isolinux/live.cfg`
- so `v3.20-5-g76a9100` was **not** another real memtest regression in the
shipped ISO
- the regression was in the build-time validator/debug path in `build.sh`
Root cause of the false alarm:
- `build.sh` treated "ISO reader command exists" as equivalent to "ISO reader
successfully listed/extracted members"
- `iso_list_files` / `iso_extract_file` failures were collapsed into the same
observable output as "memtest content missing"
- this made a reader failure look identical to a missing memtest payload
- as a result, we re-entered the same memtest investigation loop even though
the real ISO was already correct
Additional correction from the subsequent `v3.21` build logs dated 2026-04-01:
- once ISO reading was fixed, the post-build debug correctly showed the raw ISO
still carried live-build's default memtest layout (`live/memtest.bin`,
`live/memtest.efi`, `boot/grub/memtest.cfg`, `isolinux/memtest.cfg`)
- that mismatch is expected to trigger project recovery, because `bee` requires
`boot/memtest86+x64.bin` / `boot/memtest86+x64.efi` plus matching menu paths
- however, `build.sh` exited before recovery because `set -e` treated a direct
`iso_memtest_present` return code of `1` as fatal
- so the next repeated loop was caused by shell control flow, not by proof that
the recovery design itself was wrong
## Known Failed Attempts
These approaches were already tried and should not be repeated blindly:
1. Built-in live-build memtest only.
Reason it failed:
- `lb binary_memtest` runs, but the final ISO still misses memtest binaries and menu entries.
2. Fixing only the memtest file names for Debian Bookworm.
Reason it failed:
- correct file names alone do not make the files appear in the final ISO.
3. Copying memtest from `chroot/boot/` into `binary/boot/` via a binary hook.
Reason it failed:
- in this stack `chroot/boot/` is often empty for memtest payloads at the relevant time.
4. Fallback extraction from cached `memtest86+` `.deb`.
Reason it failed:
- this was explored already and was not enough to stabilize the final ISO path end-to-end.
5. Restoring explicit memtest menu entries in source bootloader templates only.
Reason it failed:
- memtest lines in source templates or intermediate workdir configs do not guarantee the final ISO contains them.
6. Patching `binary/boot/grub/grub.cfg` and `binary/isolinux/live.cfg` from the current `config/hooks/normal/9100-memtest.hook.binary`.
Reason it failed:
- the hook runs before those files exist, so the hook cannot patch them there.
## What This Means
When revisiting memtest later, start from the constraints above rather than retrying the same patterns:
- do not assume the built-in memtest stage is sufficient
- do not assume `chroot/boot/` will contain memtest payloads
- do not assume source bootloader templates are the last writer of final ISO configs
- do not assume the current normal binary hook timing is late enough for final patching
Any future memtest fix must explicitly identify:
- where the memtest binaries are reliably available at build time
- which exact build stage writes the final bootloader configs that land in the ISO
- and a post-build proof from a real ISO, not only from intermediate workdir files
- whether the ISO inspection step itself succeeded, rather than merely whether
the validator printed a memtest warning
- whether a non-zero probe is intentionally handled inside an `if` / `case`
context rather than accidentally tripping `set -e`
## Decision
For `bee`, memtest must be treated as an explicit ISO artifact with explicit post-build validation.
Project rules from now on:
- Do **not** trust `--memtest memtest86+` by itself.
- A memtest implementation is considered valid only if the produced ISO actually contains:
- `boot/memtest86+x64.bin`
- `boot/memtest86+x64.efi`
- a GRUB menu entry
- an isolinux menu entry
- If live-build built-in integration does not produce those artifacts, use an explicit project-owned mechanism such as:
- a binary hook copying files into `binary/boot/`
- extraction from the cached `memtest86+` `.deb`
- another deterministic build-time copy step
- Do **not** remove such explicit logic later unless a fresh real ISO build proves that built-in integration alone produces all required files and menu entries.
Current implementation direction:
- keep the live-build memtest stage enabled if it helps package acquisition
- do not rely on the current early `binary_hooks` timing for final patching
- prefer a post-`lb build` recovery step in `build.sh` that:
- patches the fully materialized `LB_DIR/binary` tree
- injects memtest binaries there
- ensures final bootloader entries there
- reruns late binary stages (`binary_checksums`, `binary_iso`, `binary_zsync`) after the patch
- also treat ISO validation tooling as part of the critical path:
- install a stable ISO reader in the builder image
- fail with an explicit reader error if ISO listing/extraction fails
- do not treat reader failure as evidence that memtest is missing
- do not call a probe that may return "needs recovery" as a bare command under
`set -e`; wrap it in explicit control flow
## Consequences
- Future memtest changes must begin by reading this ADR and the commits listed above.
- Future memtest changes must also begin by reading the failed-attempt list above.
- We should stop re-introducing "prefer built-in live-build memtest" as a default assumption without new evidence.
- Memtest validation in `build.sh` is not optional; it is the acceptance gate that prevents another silent regression.
- But validation output is only trustworthy if ISO reading itself succeeded. A
"missing memtest" warning without a successful ISO read is not evidence.
- If we change memtest strategy again, we must update this ADR with the exact build evidence that justified the change.
## Working Solution (confirmed 2026-04-01, commits 76a9100 → 2baf3be)
This approach was confirmed working in ISO `easy-bee-nvidia-v3.20-5-g76a9100-amd64.iso`
and validated again in subsequent builds. The final ISO contains all required memtest artifacts.
### Components
**1. Binary hook `config/hooks/normal/9100-memtest.hook.binary`**
Runs inside the live-build binary phase. Does not patch bootloader files at hook time —
those files may not exist yet. Instead:
- Tries to copy `memtest86+x64.bin` / `memtest86+x64.efi` from `chroot/boot/` first.
- Falls back to extracting from the cached `.deb` (via `dpkg-deb -x`) if `chroot/boot/` is empty.
- Appends GRUB and isolinux menu entries only if the respective cfg files already exist at hook time.
If they do not exist, the hook warns and continues (does not fail).
Controlled by `BEE_REQUIRE_MEMTEST=1` env var to turn warnings into hard errors when needed.
**2. Post-`lb build` recovery step in `build.sh`**
After `lb build` completes, `build.sh` checks whether the fully materialized `binary/` tree
contains all required memtest artifacts. If not:
- Copies/extracts memtest binaries into `binary/boot/`.
- Patches `binary/boot/grub/grub.cfg` and `binary/isolinux/live.cfg` directly.
- Reruns the late binary stages (`binary_checksums`, `binary_iso`, `binary_zsync`) to rebuild
the ISO with the patched tree.
This is the deterministic safety net: even if the hook runs at the wrong time, the recovery
step handles the final `binary/` tree after live-build has written all bootloader configs.
**3. ISO validation hardening**
The memtest probe in `build.sh` is wrapped in explicit `if` / `case` control flow, not called
as a bare command under `set -e`. A non-zero probe return (needs recovery) is intentional and
handled — it does not abort the build prematurely.
ISO reading (`xorriso -indev -ls` / extraction) is treated as a separate prerequisite.
If the reader fails, the validator reports a reader error explicitly, not a memtest warning.
This prevents the false-negative loop that burned 2026-04-01 v3.14v3.19.
### Why this works when earlier attempts did not
The earlier patterns all shared a single flaw: they assumed a single build-time point
(hook or source template) would be the last writer of bootloader configs and memtest payloads.
In live-build on Debian Bookworm that assumption is false — live-build continues writing
bootloader files after custom hooks run, and `chroot/boot/` does not reliably hold memtest payloads.
The recovery step sidesteps the ordering problem entirely: it acts on the fully materialized
`binary/` tree after `lb build` finishes, then rebuilds the ISO from that patched tree.
There is no ordering dependency to get wrong.
### Do not revert
Do not remove the recovery step or the hook without a fresh real ISO build proving
live-build alone produces all four required artifacts:
- `boot/memtest86+x64.bin`
- `boot/memtest86+x64.efi`
- memtest entry in `boot/grub/grub.cfg`
- memtest entry in `isolinux/live.cfg`

View File

@@ -5,3 +5,4 @@ One file per decision, named `YYYY-MM-DD-short-topic.md`.
| Date | Decision | Status |
|---|---|---|
| 2026-03-05 | Use NVIDIA proprietary driver | active |
| 2026-04-01 | Treat memtest as explicit ISO content | active |

View File

@@ -13,9 +13,50 @@ Use one of:
This applies to:
- `iso/builder/config/package-lists/*.list.chroot`
- Any package referenced in `grub.cfg`, hooks, or overlay scripts (e.g. file paths like `/boot/memtest86+x64.bin`)
- Any package referenced in bootloader configs, hooks, or overlay scripts
## Example of what goes wrong without this
## Memtest rule
`memtest86+` in Debian bookworm installs `/boot/memtest86+x64.bin`, not `/boot/memtest86+.bin`.
Guessing the filename caused a broken GRUB entry that only surfaced at boot time, after a full rebuild.
Do not assume live-build's built-in memtest integration is sufficient for `bee`.
We already tried that path and regressed again on 2026-04-01: `lb binary_memtest`
ran, but the final ISO still lacked memtest binaries and menu entries.
For this project, memtest is accepted only when the produced ISO actually
contains all of the following:
- `boot/memtest86+x64.bin`
- `boot/memtest86+x64.efi`
- a memtest entry in `boot/grub/grub.cfg`
- a memtest entry in `isolinux/live.cfg`
Rules:
- Keep explicit post-build memtest validation in `build.sh`.
- Treat ISO reader success as a separate prerequisite from memtest content.
If the reader cannot list or extract from the ISO, that is a validator
failure, not proof that memtest is missing.
- If built-in integration does not produce the artifacts above, use a
deterministic project-owned copy/extract step instead of hoping live-build
will "start working".
- Do not switch back to built-in-only memtest without fresh build evidence from
a real ISO.
- If you reference memtest files manually, verify the exact package file list
first for the target Debian release.
Known bad loops for this repository:
- Do not retry built-in-only memtest without new evidence. We already proved
that `lb binary_memtest` can run while the final ISO still has no memtest.
- Do not assume fixing memtest file names is enough. Correct names did not fix
the final artifact path.
- Do not assume `chroot/boot/` contains memtest payloads at the time hooks run.
- Do not assume source `grub.cfg` / `live.cfg.in` are the final writers of ISO
bootloader configs.
- Do not assume the current `config/hooks/normal/9100-memtest.hook.binary`
timing is late enough to patch final `binary/boot/grub/grub.cfg` or
`binary/isolinux/live.cfg`; logs from 2026-04-01 showed those files were not
present yet when the hook executed.
- Do not treat a validator warning as ground truth until you have confirmed the
ISO reader actually succeeded. On 2026-04-01 we misdiagnosed another memtest
regression because the final ISO was correct but the validator produced a
false negative.

View File

@@ -48,6 +48,7 @@ sh iso/builder/build-in-container.sh --cache-dir /path/to/cache
- The builder image is automatically rebuilt if the local tag exists for the wrong architecture.
- The live ISO boots with Debian `live-boot` `toram`, so the read-only medium is copied into RAM during boot and the runtime no longer depends on the original USB/BMC virtual media staying present.
- Target systems need enough RAM for the full compressed live medium plus normal runtime overhead, or boot may fail before reaching the TUI.
- The NVIDIA variant installs DCGM 4 packages matched to the CUDA user-mode driver major version. For driver branch `590` / CUDA `13.x`, the package family is `datacenter-gpu-manager-4-cuda13` rather than legacy `datacenter-gpu-manager`.
- Override the container platform only if you know why:
```sh

View File

@@ -17,12 +17,23 @@ RUN apt-get update -qq && apt-get install -y \
wget \
curl \
tar \
libarchive-tools \
xz-utils \
rsync \
build-essential \
gcc \
make \
perl \
pkg-config \
yasm \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libgmp-dev \
libpcap-dev \
libsqlite3-dev \
libcurl4-openssl-dev \
ocl-icd-opencl-dev \
linux-headers-amd64 \
&& rm -rf /var/lib/apt/lists/*

View File

@@ -8,7 +8,8 @@ NCCL_TESTS_VERSION=2.13.10
NVCC_VERSION=12.8
CUBLAS_VERSION=13.0.2.14-1
CUDA_USERSPACE_VERSION=13.0.96-1
DCGM_VERSION=3.3.9
DCGM_VERSION=4.5.3-1
JOHN_JUMBO_COMMIT=67fcf9fe5a
ROCM_VERSION=6.3.4
ROCM_SMI_VERSION=7.4.0.60304-76~22.04
ROCM_BANDWIDTH_TEST_VERSION=1.4.0.60304-76~22.04

View File

@@ -29,10 +29,10 @@ lb config noauto \
--security true \
--linux-flavours "amd64" \
--linux-packages "${LB_LINUX_PACKAGES}" \
--memtest none \
--iso-volume "EASY-BEE" \
--iso-application "EASY-BEE" \
--bootappend-live "boot=live components video=1920x1080 console=tty0 console=ttyS0,115200n8 loglevel=7 username=bee user-fullname=Bee modprobe.blacklist=nouveau" \
--memtest memtest86+ \
--iso-volume "EASY_BEE_${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
--iso-application "EASY-BEE-${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
--bootappend-live "boot=live components video=1920x1080 console=tty0 console=ttyS0,115200n8 loglevel=3 username=bee user-fullname=Bee modprobe.blacklist=nouveau,snd_hda_intel,snd_hda_codec_realtek,snd_hda_codec_generic,soundcore" \
--apt-recommends false \
--chroot-squashfs-compression-type zstd \
"${@}"

View File

@@ -29,8 +29,14 @@ typedef void *CUfunction;
typedef void *CUstream;
#define CU_SUCCESS 0
#define CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT 16
#define CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR 75
#define CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR 76
#define MAX_STRESS_STREAMS 16
#define MAX_CUBLAS_PROFILES 5
#define MIN_PROFILE_BUDGET_BYTES ((size_t)4u * 1024u * 1024u)
#define MIN_STREAM_BUDGET_BYTES ((size_t)64u * 1024u * 1024u)
#define STRESS_LAUNCH_DEPTH 8
static const char *ptx_source =
".version 6.0\n"
@@ -97,6 +103,9 @@ typedef CUresult (*cuLaunchKernel_fn)(CUfunction,
CUstream,
void **,
void **);
typedef CUresult (*cuMemGetInfo_fn)(size_t *, size_t *);
typedef CUresult (*cuStreamCreate_fn)(CUstream *, unsigned int);
typedef CUresult (*cuStreamDestroy_fn)(CUstream);
typedef CUresult (*cuGetErrorName_fn)(CUresult, const char **);
typedef CUresult (*cuGetErrorString_fn)(CUresult, const char **);
@@ -118,6 +127,9 @@ struct cuda_api {
cuModuleLoadDataEx_fn cuModuleLoadDataEx;
cuModuleGetFunction_fn cuModuleGetFunction;
cuLaunchKernel_fn cuLaunchKernel;
cuMemGetInfo_fn cuMemGetInfo;
cuStreamCreate_fn cuStreamCreate;
cuStreamDestroy_fn cuStreamDestroy;
cuGetErrorName_fn cuGetErrorName;
cuGetErrorString_fn cuGetErrorString;
};
@@ -128,9 +140,10 @@ struct stress_report {
int cc_major;
int cc_minor;
int buffer_mb;
int stream_count;
unsigned long iterations;
uint64_t checksum;
char details[1024];
char details[16384];
};
static int load_symbol(void *lib, const char *name, void **out) {
@@ -144,7 +157,7 @@ static int load_cuda(struct cuda_api *api) {
if (!api->lib) {
return 0;
}
return
if (!(
load_symbol(api->lib, "cuInit", (void **)&api->cuInit) &&
load_symbol(api->lib, "cuDeviceGetCount", (void **)&api->cuDeviceGetCount) &&
load_symbol(api->lib, "cuDeviceGet", (void **)&api->cuDeviceGet) &&
@@ -160,7 +173,17 @@ static int load_cuda(struct cuda_api *api) {
load_symbol(api->lib, "cuMemcpyDtoH_v2", (void **)&api->cuMemcpyDtoH) &&
load_symbol(api->lib, "cuModuleLoadDataEx", (void **)&api->cuModuleLoadDataEx) &&
load_symbol(api->lib, "cuModuleGetFunction", (void **)&api->cuModuleGetFunction) &&
load_symbol(api->lib, "cuLaunchKernel", (void **)&api->cuLaunchKernel);
load_symbol(api->lib, "cuLaunchKernel", (void **)&api->cuLaunchKernel))) {
dlclose(api->lib);
memset(api, 0, sizeof(*api));
return 0;
}
load_symbol(api->lib, "cuMemGetInfo_v2", (void **)&api->cuMemGetInfo);
load_symbol(api->lib, "cuStreamCreate", (void **)&api->cuStreamCreate);
if (!load_symbol(api->lib, "cuStreamDestroy_v2", (void **)&api->cuStreamDestroy)) {
load_symbol(api->lib, "cuStreamDestroy", (void **)&api->cuStreamDestroy);
}
return 1;
}
static const char *cu_error_name(struct cuda_api *api, CUresult rc) {
@@ -193,14 +216,12 @@ static double now_seconds(void) {
return (double)ts.tv_sec + ((double)ts.tv_nsec / 1000000000.0);
}
#if HAVE_CUBLASLT_HEADERS
static size_t round_down_size(size_t value, size_t multiple) {
if (multiple == 0 || value < multiple) {
return value;
}
return value - (value % multiple);
}
#endif
static int query_compute_capability(struct cuda_api *api, CUdevice dev, int *major, int *minor) {
int cc_major = 0;
@@ -220,6 +241,75 @@ static int query_compute_capability(struct cuda_api *api, CUdevice dev, int *maj
return 1;
}
static int query_multiprocessor_count(struct cuda_api *api, CUdevice dev, int *count) {
int mp_count = 0;
if (!check_rc(api,
"cuDeviceGetAttribute(multiprocessors)",
api->cuDeviceGetAttribute(&mp_count, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev))) {
return 0;
}
*count = mp_count;
return 1;
}
static size_t clamp_budget_to_free_memory(struct cuda_api *api, size_t requested_bytes) {
size_t free_bytes = 0;
size_t total_bytes = 0;
size_t max_bytes = requested_bytes;
if (!api->cuMemGetInfo) {
return requested_bytes;
}
if (api->cuMemGetInfo(&free_bytes, &total_bytes) != CU_SUCCESS || free_bytes == 0) {
return requested_bytes;
}
max_bytes = (free_bytes * 9u) / 10u;
if (max_bytes < (size_t)4u * 1024u * 1024u) {
max_bytes = (size_t)4u * 1024u * 1024u;
}
if (requested_bytes > max_bytes) {
return max_bytes;
}
return requested_bytes;
}
static int choose_stream_count(int mp_count, int planned_profiles, size_t total_budget, int have_streams) {
int stream_count = 1;
if (!have_streams || mp_count <= 0 || planned_profiles <= 0) {
return 1;
}
stream_count = mp_count / 8;
if (stream_count < 2) {
stream_count = 2;
}
if (stream_count > MAX_STRESS_STREAMS) {
stream_count = MAX_STRESS_STREAMS;
}
while (stream_count > 1) {
size_t per_stream_budget = total_budget / ((size_t)planned_profiles * (size_t)stream_count);
if (per_stream_budget >= MIN_STREAM_BUDGET_BYTES) {
break;
}
stream_count--;
}
return stream_count;
}
static void destroy_streams(struct cuda_api *api, CUstream *streams, int count) {
if (!api->cuStreamDestroy) {
return;
}
for (int i = 0; i < count; i++) {
if (streams[i]) {
api->cuStreamDestroy(streams[i]);
streams[i] = NULL;
}
}
}
#if HAVE_CUBLASLT_HEADERS
static void append_detail(char *buf, size_t cap, const char *fmt, ...) {
size_t len = strlen(buf);
@@ -242,12 +332,19 @@ static int run_ptx_fallback(struct cuda_api *api,
int size_mb,
struct stress_report *report) {
CUcontext ctx = NULL;
CUdeviceptr device_mem = 0;
CUmodule module = NULL;
CUfunction kernel = NULL;
uint32_t sample[256];
uint32_t words = 0;
CUdeviceptr device_mem[MAX_STRESS_STREAMS] = {0};
CUstream streams[MAX_STRESS_STREAMS] = {0};
uint32_t words[MAX_STRESS_STREAMS] = {0};
uint32_t rounds[MAX_STRESS_STREAMS] = {0};
void *params[MAX_STRESS_STREAMS][3];
size_t bytes_per_stream[MAX_STRESS_STREAMS] = {0};
unsigned long iterations = 0;
int mp_count = 0;
int stream_count = 1;
int launches_per_wave = 0;
memset(report, 0, sizeof(*report));
snprintf(report->backend, sizeof(report->backend), "driver-ptx");
@@ -260,64 +357,109 @@ static int run_ptx_fallback(struct cuda_api *api,
return 0;
}
size_t bytes = (size_t)size_mb * 1024u * 1024u;
if (bytes < 4u * 1024u * 1024u) {
bytes = 4u * 1024u * 1024u;
size_t requested_bytes = (size_t)size_mb * 1024u * 1024u;
if (requested_bytes < MIN_PROFILE_BUDGET_BYTES) {
requested_bytes = MIN_PROFILE_BUDGET_BYTES;
}
if (bytes > (size_t)1024u * 1024u * 1024u) {
bytes = (size_t)1024u * 1024u * 1024u;
size_t total_bytes = clamp_budget_to_free_memory(api, requested_bytes);
if (total_bytes < MIN_PROFILE_BUDGET_BYTES) {
total_bytes = MIN_PROFILE_BUDGET_BYTES;
}
words = (uint32_t)(bytes / sizeof(uint32_t));
report->buffer_mb = (int)(total_bytes / (1024u * 1024u));
if (!check_rc(api, "cuMemAlloc", api->cuMemAlloc(&device_mem, bytes))) {
api->cuCtxDestroy(ctx);
return 0;
if (query_multiprocessor_count(api, dev, &mp_count) &&
api->cuStreamCreate &&
api->cuStreamDestroy) {
stream_count = choose_stream_count(mp_count, 1, total_bytes, 1);
}
if (!check_rc(api, "cuMemsetD8", api->cuMemsetD8(device_mem, 0, bytes))) {
api->cuMemFree(device_mem);
api->cuCtxDestroy(ctx);
return 0;
if (stream_count > 1) {
int created = 0;
for (; created < stream_count; created++) {
if (!check_rc(api, "cuStreamCreate", api->cuStreamCreate(&streams[created], 0))) {
destroy_streams(api, streams, created);
stream_count = 1;
break;
}
}
}
report->stream_count = stream_count;
for (int lane = 0; lane < stream_count; lane++) {
size_t slice = total_bytes / (size_t)stream_count;
if (lane == stream_count - 1) {
slice = total_bytes - ((size_t)lane * (total_bytes / (size_t)stream_count));
}
slice = round_down_size(slice, sizeof(uint32_t));
if (slice < MIN_PROFILE_BUDGET_BYTES) {
slice = MIN_PROFILE_BUDGET_BYTES;
}
bytes_per_stream[lane] = slice;
words[lane] = (uint32_t)(slice / sizeof(uint32_t));
if (!check_rc(api, "cuMemAlloc", api->cuMemAlloc(&device_mem[lane], slice))) {
goto fail;
}
if (!check_rc(api, "cuMemsetD8", api->cuMemsetD8(device_mem[lane], 0, slice))) {
goto fail;
}
rounds[lane] = 2048;
params[lane][0] = &device_mem[lane];
params[lane][1] = &words[lane];
params[lane][2] = &rounds[lane];
}
if (!check_rc(api,
"cuModuleLoadDataEx",
api->cuModuleLoadDataEx(&module, ptx_source, 0, NULL, NULL))) {
api->cuMemFree(device_mem);
api->cuCtxDestroy(ctx);
return 0;
goto fail;
}
if (!check_rc(api, "cuModuleGetFunction", api->cuModuleGetFunction(&kernel, module, "burn"))) {
api->cuMemFree(device_mem);
api->cuCtxDestroy(ctx);
return 0;
goto fail;
}
unsigned int threads = 256;
unsigned int blocks = (unsigned int)((words + threads - 1) / threads);
uint32_t rounds = 1024;
void *params[] = {&device_mem, &words, &rounds};
double start = now_seconds();
double deadline = start + (double)seconds;
while (now_seconds() < deadline) {
if (!check_rc(api,
"cuLaunchKernel",
api->cuLaunchKernel(kernel, blocks, 1, 1, threads, 1, 1, 0, NULL, params, NULL))) {
api->cuMemFree(device_mem);
api->cuCtxDestroy(ctx);
return 0;
launches_per_wave = 0;
for (int depth = 0; depth < STRESS_LAUNCH_DEPTH && now_seconds() < deadline; depth++) {
int launched_this_batch = 0;
for (int lane = 0; lane < stream_count; lane++) {
unsigned int blocks = (unsigned int)((words[lane] + threads - 1) / threads);
if (!check_rc(api,
"cuLaunchKernel",
api->cuLaunchKernel(kernel,
blocks,
1,
1,
threads,
1,
1,
0,
streams[lane],
params[lane],
NULL))) {
goto fail;
}
launches_per_wave++;
launched_this_batch++;
}
if (launched_this_batch <= 0) {
break;
}
}
iterations++;
if (launches_per_wave <= 0) {
goto fail;
}
if (!check_rc(api, "cuCtxSynchronize", api->cuCtxSynchronize())) {
goto fail;
}
iterations += (unsigned long)launches_per_wave;
}
if (!check_rc(api, "cuCtxSynchronize", api->cuCtxSynchronize())) {
api->cuMemFree(device_mem);
api->cuCtxDestroy(ctx);
return 0;
}
if (!check_rc(api, "cuMemcpyDtoH", api->cuMemcpyDtoH(sample, device_mem, sizeof(sample)))) {
api->cuMemFree(device_mem);
api->cuCtxDestroy(ctx);
return 0;
if (!check_rc(api, "cuMemcpyDtoH", api->cuMemcpyDtoH(sample, device_mem[0], sizeof(sample)))) {
goto fail;
}
for (size_t i = 0; i < sizeof(sample) / sizeof(sample[0]); i++) {
@@ -326,12 +468,34 @@ static int run_ptx_fallback(struct cuda_api *api,
report->iterations = iterations;
snprintf(report->details,
sizeof(report->details),
"profile_int32_fallback=OK iterations=%lu\n",
"fallback_int32=OK requested_mb=%d actual_mb=%d streams=%d queue_depth=%d per_stream_mb=%zu iterations=%lu\n",
size_mb,
report->buffer_mb,
report->stream_count,
STRESS_LAUNCH_DEPTH,
bytes_per_stream[0] / (1024u * 1024u),
iterations);
api->cuMemFree(device_mem);
for (int lane = 0; lane < stream_count; lane++) {
if (device_mem[lane]) {
api->cuMemFree(device_mem[lane]);
}
}
destroy_streams(api, streams, stream_count);
api->cuCtxDestroy(ctx);
return 1;
fail:
for (int lane = 0; lane < MAX_STRESS_STREAMS; lane++) {
if (device_mem[lane]) {
api->cuMemFree(device_mem[lane]);
}
}
destroy_streams(api, streams, MAX_STRESS_STREAMS);
if (ctx) {
api->cuCtxDestroy(ctx);
}
return 0;
}
#if HAVE_CUBLASLT_HEADERS
@@ -418,6 +582,7 @@ struct profile_desc {
struct prepared_profile {
struct profile_desc desc;
CUstream stream;
cublasLtMatmulDesc_t op_desc;
cublasLtMatrixLayout_t a_layout;
cublasLtMatrixLayout_t b_layout;
@@ -617,8 +782,8 @@ static uint64_t choose_square_dim(size_t budget_bytes, size_t bytes_per_cell, in
if (dim < (uint64_t)multiple) {
dim = (uint64_t)multiple;
}
if (dim > 8192u) {
dim = 8192u;
if (dim > 65536u) {
dim = 65536u;
}
return dim;
}
@@ -704,10 +869,12 @@ static int prepare_profile(struct cublaslt_api *cublas,
cublasLtHandle_t handle,
struct cuda_api *cuda,
const struct profile_desc *desc,
CUstream stream,
size_t profile_budget_bytes,
struct prepared_profile *out) {
memset(out, 0, sizeof(*out));
out->desc = *desc;
out->stream = stream;
size_t bytes_per_cell = 0;
bytes_per_cell += bytes_for_elements(desc->a_type, 1);
@@ -935,7 +1102,7 @@ static int run_cublas_profile(cublasLtHandle_t handle,
&profile->heuristic.algo,
(void *)(uintptr_t)profile->workspace_dev,
profile->workspace_size,
(cudaStream_t)0));
profile->stream));
}
static int run_cublaslt_stress(struct cuda_api *cuda,
@@ -947,13 +1114,22 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
int size_mb,
struct stress_report *report) {
struct cublaslt_api cublas;
struct prepared_profile prepared[sizeof(k_profiles) / sizeof(k_profiles[0])];
struct prepared_profile prepared[MAX_STRESS_STREAMS * MAX_CUBLAS_PROFILES];
cublasLtHandle_t handle = NULL;
CUcontext ctx = NULL;
CUstream streams[MAX_STRESS_STREAMS] = {0};
uint16_t sample[256];
int cc = cc_major * 10 + cc_minor;
int planned = 0;
int active = 0;
int mp_count = 0;
int stream_count = 1;
int profile_count = (int)(sizeof(k_profiles) / sizeof(k_profiles[0]));
int prepared_count = 0;
int wave_launches = 0;
size_t requested_budget = 0;
size_t total_budget = 0;
size_t per_profile_budget = 0;
memset(report, 0, sizeof(*report));
snprintf(report->backend, sizeof(report->backend), "cublasLt");
@@ -986,16 +1162,46 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
return 0;
}
size_t total_budget = (size_t)size_mb * 1024u * 1024u;
if (total_budget < (size_t)planned * 4u * 1024u * 1024u) {
total_budget = (size_t)planned * 4u * 1024u * 1024u;
requested_budget = (size_t)size_mb * 1024u * 1024u;
if (requested_budget < (size_t)planned * MIN_PROFILE_BUDGET_BYTES) {
requested_budget = (size_t)planned * MIN_PROFILE_BUDGET_BYTES;
}
size_t per_profile_budget = total_budget / (size_t)planned;
if (per_profile_budget < 4u * 1024u * 1024u) {
per_profile_budget = 4u * 1024u * 1024u;
total_budget = clamp_budget_to_free_memory(cuda, requested_budget);
if (total_budget < (size_t)planned * MIN_PROFILE_BUDGET_BYTES) {
total_budget = (size_t)planned * MIN_PROFILE_BUDGET_BYTES;
}
if (query_multiprocessor_count(cuda, dev, &mp_count) &&
cuda->cuStreamCreate &&
cuda->cuStreamDestroy) {
stream_count = choose_stream_count(mp_count, planned, total_budget, 1);
}
if (stream_count > 1) {
int created = 0;
for (; created < stream_count; created++) {
if (!check_rc(cuda, "cuStreamCreate", cuda->cuStreamCreate(&streams[created], 0))) {
destroy_streams(cuda, streams, created);
stream_count = 1;
break;
}
}
}
report->stream_count = stream_count;
per_profile_budget = total_budget / ((size_t)planned * (size_t)stream_count);
if (per_profile_budget < MIN_PROFILE_BUDGET_BYTES) {
per_profile_budget = MIN_PROFILE_BUDGET_BYTES;
}
report->buffer_mb = (int)(total_budget / (1024u * 1024u));
append_detail(report->details,
sizeof(report->details),
"requested_mb=%d actual_mb=%d streams=%d queue_depth=%d mp_count=%d per_worker_mb=%zu\n",
size_mb,
report->buffer_mb,
report->stream_count,
STRESS_LAUNCH_DEPTH,
mp_count,
per_profile_budget / (1024u * 1024u));
for (size_t i = 0; i < sizeof(k_profiles) / sizeof(k_profiles[0]); i++) {
for (int i = 0; i < profile_count; i++) {
const struct profile_desc *desc = &k_profiles[i];
if (!(desc->enabled && cc >= desc->min_cc)) {
append_detail(report->details,
@@ -1005,63 +1211,87 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
desc->min_cc);
continue;
}
if (prepare_profile(&cublas, handle, cuda, desc, per_profile_budget, &prepared[i])) {
active++;
append_detail(report->details,
sizeof(report->details),
"%s=READY dim=%llux%llux%llu block=%s\n",
desc->name,
(unsigned long long)prepared[i].m,
(unsigned long long)prepared[i].n,
(unsigned long long)prepared[i].k,
desc->block_label);
} else {
append_detail(report->details, sizeof(report->details), "%s=SKIPPED unsupported\n", desc->name);
for (int lane = 0; lane < stream_count; lane++) {
CUstream stream = streams[lane];
if (prepared_count >= (int)(sizeof(prepared) / sizeof(prepared[0]))) {
break;
}
if (prepare_profile(&cublas, handle, cuda, desc, stream, per_profile_budget, &prepared[prepared_count])) {
active++;
append_detail(report->details,
sizeof(report->details),
"%s[%d]=READY dim=%llux%llux%llu block=%s stream=%d\n",
desc->name,
lane,
(unsigned long long)prepared[prepared_count].m,
(unsigned long long)prepared[prepared_count].n,
(unsigned long long)prepared[prepared_count].k,
desc->block_label,
lane);
prepared_count++;
} else {
append_detail(report->details,
sizeof(report->details),
"%s[%d]=SKIPPED unsupported\n",
desc->name,
lane);
}
}
}
if (active <= 0) {
cublas.cublasLtDestroy(handle);
destroy_streams(cuda, streams, stream_count);
cuda->cuCtxDestroy(ctx);
return 0;
}
double deadline = now_seconds() + (double)seconds;
while (now_seconds() < deadline) {
for (size_t i = 0; i < sizeof(prepared) / sizeof(prepared[0]); i++) {
if (!prepared[i].ready) {
continue;
}
if (!run_cublas_profile(handle, &cublas, &prepared[i])) {
append_detail(report->details,
sizeof(report->details),
"%s=FAILED runtime\n",
prepared[i].desc.name);
for (size_t j = 0; j < sizeof(prepared) / sizeof(prepared[0]); j++) {
destroy_profile(&cublas, cuda, &prepared[j]);
wave_launches = 0;
for (int depth = 0; depth < STRESS_LAUNCH_DEPTH && now_seconds() < deadline; depth++) {
int launched_this_batch = 0;
for (int i = 0; i < prepared_count; i++) {
if (!prepared[i].ready) {
continue;
}
cublas.cublasLtDestroy(handle);
cuda->cuCtxDestroy(ctx);
return 0;
if (!run_cublas_profile(handle, &cublas, &prepared[i])) {
append_detail(report->details,
sizeof(report->details),
"%s=FAILED runtime\n",
prepared[i].desc.name);
for (int j = 0; j < prepared_count; j++) {
destroy_profile(&cublas, cuda, &prepared[j]);
}
cublas.cublasLtDestroy(handle);
destroy_streams(cuda, streams, stream_count);
cuda->cuCtxDestroy(ctx);
return 0;
}
prepared[i].iterations++;
report->iterations++;
wave_launches++;
launched_this_batch++;
}
prepared[i].iterations++;
report->iterations++;
if (now_seconds() >= deadline) {
if (launched_this_batch <= 0) {
break;
}
}
}
if (!check_rc(cuda, "cuCtxSynchronize", cuda->cuCtxSynchronize())) {
for (size_t i = 0; i < sizeof(prepared) / sizeof(prepared[0]); i++) {
destroy_profile(&cublas, cuda, &prepared[i]);
if (wave_launches <= 0) {
break;
}
if (!check_rc(cuda, "cuCtxSynchronize", cuda->cuCtxSynchronize())) {
for (int i = 0; i < prepared_count; i++) {
destroy_profile(&cublas, cuda, &prepared[i]);
}
cublas.cublasLtDestroy(handle);
destroy_streams(cuda, streams, stream_count);
cuda->cuCtxDestroy(ctx);
return 0;
}
cublas.cublasLtDestroy(handle);
cuda->cuCtxDestroy(ctx);
return 0;
}
for (size_t i = 0; i < sizeof(prepared) / sizeof(prepared[0]); i++) {
for (int i = 0; i < prepared_count; i++) {
if (!prepared[i].ready) {
continue;
}
@@ -1072,7 +1302,7 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
prepared[i].iterations);
}
for (size_t i = 0; i < sizeof(prepared) / sizeof(prepared[0]); i++) {
for (int i = 0; i < prepared_count; i++) {
if (prepared[i].ready) {
if (check_rc(cuda, "cuMemcpyDtoH", cuda->cuMemcpyDtoH(sample, prepared[i].d_dev, sizeof(sample)))) {
for (size_t j = 0; j < sizeof(sample) / sizeof(sample[0]); j++) {
@@ -1083,10 +1313,11 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
}
}
for (size_t i = 0; i < sizeof(prepared) / sizeof(prepared[0]); i++) {
for (int i = 0; i < prepared_count; i++) {
destroy_profile(&cublas, cuda, &prepared[i]);
}
cublas.cublasLtDestroy(handle);
destroy_streams(cuda, streams, stream_count);
cuda->cuCtxDestroy(ctx);
return 1;
}
@@ -1095,13 +1326,16 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
int main(int argc, char **argv) {
int seconds = 5;
int size_mb = 64;
int device_index = 0;
for (int i = 1; i < argc; i++) {
if ((strcmp(argv[i], "--seconds") == 0 || strcmp(argv[i], "-t") == 0) && i + 1 < argc) {
seconds = atoi(argv[++i]);
} else if ((strcmp(argv[i], "--size-mb") == 0 || strcmp(argv[i], "-m") == 0) && i + 1 < argc) {
size_mb = atoi(argv[++i]);
} else if ((strcmp(argv[i], "--device") == 0 || strcmp(argv[i], "-d") == 0) && i + 1 < argc) {
device_index = atoi(argv[++i]);
} else {
fprintf(stderr, "usage: %s [--seconds N] [--size-mb N]\n", argv[0]);
fprintf(stderr, "usage: %s [--seconds N] [--size-mb N] [--device N]\n", argv[0]);
return 2;
}
}
@@ -1111,6 +1345,9 @@ int main(int argc, char **argv) {
if (size_mb <= 0) {
size_mb = 64;
}
if (device_index < 0) {
device_index = 0;
}
struct cuda_api cuda;
if (!load_cuda(&cuda)) {
@@ -1133,8 +1370,13 @@ int main(int argc, char **argv) {
return 1;
}
if (device_index >= count) {
fprintf(stderr, "device index %d out of range (found %d CUDA device(s))\n", device_index, count);
return 1;
}
CUdevice dev = 0;
if (!check_rc(&cuda, "cuDeviceGet", cuda.cuDeviceGet(&dev, 0))) {
if (!check_rc(&cuda, "cuDeviceGet", cuda.cuDeviceGet(&dev, device_index))) {
return 1;
}
@@ -1162,10 +1404,12 @@ int main(int argc, char **argv) {
}
printf("device=%s\n", report.device);
printf("device_index=%d\n", device_index);
printf("compute_capability=%d.%d\n", report.cc_major, report.cc_minor);
printf("backend=%s\n", report.backend);
printf("duration_s=%d\n", seconds);
printf("buffer_mb=%d\n", report.buffer_mb);
printf("streams=%d\n", report.stream_count);
printf("iterations=%lu\n", report.iterations);
printf("checksum=%llu\n", (unsigned long long)report.checksum);
if (report.details[0] != '\0') {

View File

@@ -1,9 +1,9 @@
#!/bin/sh
# build-cublas.sh — download cuBLASLt/cuBLAS/cudart runtime + headers for bee-gpu-stress.
# build-cublas.sh — download cuBLASLt/cuBLAS/cudart runtime + headers for bee-gpu-burn worker.
#
# Downloads .deb packages from NVIDIA's CUDA apt repository (Debian 12, x86_64),
# verifies them against Packages.gz, and extracts the small subset we need:
# - headers for compiling bee-gpu-stress against cuBLASLt
# - headers for compiling bee-gpu-burn worker against cuBLASLt
# - runtime libs for libcublas, libcublasLt, libcudart inside the ISO
set -e

View File

@@ -12,6 +12,7 @@ CACHE_DIR="${BEE_BUILDER_CACHE_DIR:-${REPO_ROOT}/dist/container-cache}"
AUTH_KEYS=""
REBUILD_IMAGE=0
CLEAN_CACHE=0
VARIANT="all"
. "${BUILDER_DIR}/VERSIONS"
@@ -34,14 +35,23 @@ while [ $# -gt 0 ]; do
REBUILD_IMAGE=1
shift
;;
--variant)
VARIANT="$2"
shift 2
;;
*)
echo "unknown arg: $1" >&2
echo "usage: $0 [--cache-dir /path] [--rebuild-image] [--clean-build] [--authorized-keys /path/to/authorized_keys]" >&2
echo "usage: $0 [--cache-dir /path] [--rebuild-image] [--clean-build] [--authorized-keys /path/to/authorized_keys] [--variant nvidia|amd|all]" >&2
exit 1
;;
esac
done
case "$VARIANT" in
nvidia|amd|nogpu|all) ;;
*) echo "unknown variant: $VARIANT (expected nvidia, amd, nogpu, or all)" >&2; exit 1 ;;
esac
if [ "$CLEAN_CACHE" = "1" ]; then
echo "=== cleaning build cache: ${CACHE_DIR} ==="
rm -rf "${CACHE_DIR:?}/go-build" \
@@ -49,8 +59,10 @@ if [ "$CLEAN_CACHE" = "1" ]; then
"${CACHE_DIR:?}/tmp" \
"${CACHE_DIR:?}/bee" \
"${CACHE_DIR:?}/lb-packages"
echo "=== cleaning live-build work dir: ${REPO_ROOT}/dist/live-build-work ==="
rm -rf "${REPO_ROOT}/dist/live-build-work"
echo "=== cleaning live-build work dirs ==="
rm -rf "${REPO_ROOT}/dist/live-build-work-nvidia"
rm -rf "${REPO_ROOT}/dist/live-build-work-amd"
rm -rf "${REPO_ROOT}/dist/live-build-work-nogpu"
echo "=== caches cleared, proceeding with build ==="
fi
@@ -108,34 +120,75 @@ else
echo "=== using existing builder image ${IMAGE_REF} (${BUILDER_PLATFORM}) ==="
fi
set -- \
run --rm --privileged \
--platform "${BUILDER_PLATFORM}" \
-v "${REPO_ROOT}:/work" \
-v "${CACHE_DIR}:/cache" \
-e BEE_CONTAINER_BUILD=1 \
-e GOCACHE=/cache/go-build \
-e GOMODCACHE=/cache/go-mod \
-e TMPDIR=/cache/tmp \
-e BEE_CACHE_DIR=/cache/bee \
-w /work \
"${IMAGE_REF}" \
sh /work/iso/builder/build.sh
if [ -n "$AUTH_KEYS" ]; then
set -- run --rm --privileged \
--platform "${BUILDER_PLATFORM}" \
-v "${REPO_ROOT}:/work" \
-v "${CACHE_DIR}:/cache" \
-v "${AUTH_KEYS_DIR}:/tmp/bee-authkeys:ro" \
# Build base docker run args (without --authorized-keys)
build_run_args() {
_variant="$1"
_auth_arg=""
if [ -n "$AUTH_KEYS" ]; then
_auth_arg="--authorized-keys /tmp/bee-authkeys/${AUTH_KEYS_BASE}"
fi
echo "run --rm --privileged \
--platform ${BUILDER_PLATFORM} \
-v ${REPO_ROOT}:/work \
-v ${CACHE_DIR}:/cache \
${AUTH_KEYS:+-v ${AUTH_KEYS_DIR}:/tmp/bee-authkeys:ro} \
-e BEE_CONTAINER_BUILD=1 \
-e GOCACHE=/cache/go-build \
-e GOMODCACHE=/cache/go-mod \
-e TMPDIR=/cache/tmp \
-e BEE_CACHE_DIR=/cache/bee \
-w /work \
"${IMAGE_REF}" \
sh /work/iso/builder/build.sh --authorized-keys "/tmp/bee-authkeys/${AUTH_KEYS_BASE}"
fi
${IMAGE_REF} \
sh /work/iso/builder/build.sh --variant ${_variant} ${_auth_arg}"
}
"$CONTAINER_TOOL" "$@"
run_variant() {
_v="$1"
echo "=== building variant: ${_v} ==="
if [ -n "$AUTH_KEYS" ]; then
"$CONTAINER_TOOL" run --rm --privileged \
--platform "${BUILDER_PLATFORM}" \
-v "${REPO_ROOT}:/work" \
-v "${CACHE_DIR}:/cache" \
-v "${AUTH_KEYS_DIR}:/tmp/bee-authkeys:ro" \
-e BEE_CONTAINER_BUILD=1 \
-e GOCACHE=/cache/go-build \
-e GOMODCACHE=/cache/go-mod \
-e TMPDIR=/cache/tmp \
-e BEE_CACHE_DIR=/cache/bee \
-w /work \
"${IMAGE_REF}" \
sh /work/iso/builder/build.sh --variant "${_v}" \
--authorized-keys "/tmp/bee-authkeys/${AUTH_KEYS_BASE}"
else
"$CONTAINER_TOOL" run --rm --privileged \
--platform "${BUILDER_PLATFORM}" \
-v "${REPO_ROOT}:/work" \
-v "${CACHE_DIR}:/cache" \
-e BEE_CONTAINER_BUILD=1 \
-e GOCACHE=/cache/go-build \
-e GOMODCACHE=/cache/go-mod \
-e TMPDIR=/cache/tmp \
-e BEE_CACHE_DIR=/cache/bee \
-w /work \
"${IMAGE_REF}" \
sh /work/iso/builder/build.sh --variant "${_v}"
fi
}
case "$VARIANT" in
nvidia)
run_variant nvidia
;;
amd)
run_variant amd
;;
nogpu)
run_variant nogpu
;;
all)
run_variant nvidia
run_variant amd
run_variant nogpu
;;
esac

55
iso/builder/build-john.sh Normal file
View File

@@ -0,0 +1,55 @@
#!/bin/sh
# build-john.sh — build John the Ripper jumbo with OpenCL support for the LiveCD.
#
# Downloads a pinned source snapshot from the official openwall/john repository,
# builds it inside the builder container, and caches the resulting run/ tree.
set -e
JOHN_COMMIT="$1"
DIST_DIR="$2"
[ -n "$JOHN_COMMIT" ] || { echo "usage: $0 <john-commit> <dist-dir>"; exit 1; }
[ -n "$DIST_DIR" ] || { echo "usage: $0 <john-commit> <dist-dir>"; exit 1; }
echo "=== John the Ripper jumbo ${JOHN_COMMIT} ==="
CACHE_DIR="${DIST_DIR}/john-${JOHN_COMMIT}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/john-downloads"
SRC_TAR="${DOWNLOAD_CACHE_DIR}/john-${JOHN_COMMIT}.tar.gz"
SRC_URL="https://github.com/openwall/john/archive/${JOHN_COMMIT}.tar.gz"
if [ -x "${CACHE_DIR}/run/john" ] && [ -f "${CACHE_DIR}/run/john.conf" ]; then
echo "=== john cached, skipping build ==="
echo "run dir: ${CACHE_DIR}/run"
exit 0
fi
mkdir -p "${DOWNLOAD_CACHE_DIR}"
if [ ! -f "${SRC_TAR}" ]; then
echo "=== downloading john source snapshot ==="
wget --show-progress -O "${SRC_TAR}" "${SRC_URL}"
fi
BUILD_TMP=$(mktemp -d)
trap 'rm -rf "${BUILD_TMP}"' EXIT INT TERM
cd "${BUILD_TMP}"
tar xf "${SRC_TAR}"
SRC_DIR=$(find . -maxdepth 1 -type d -name 'john-*' | head -1)
[ -n "${SRC_DIR}" ] || { echo "ERROR: john source directory not found"; exit 1; }
cd "${SRC_DIR}/src"
echo "=== configuring john ==="
./configure
echo "=== building john ==="
make clean >/dev/null 2>&1 || true
make -j"$(nproc)"
mkdir -p "${CACHE_DIR}"
cp -a "../run" "${CACHE_DIR}/run"
chmod +x "${CACHE_DIR}/run/john"
echo "=== john build complete ==="
echo "run dir: ${CACHE_DIR}/run"

View File

@@ -9,6 +9,7 @@
#
# Output layout:
# $CACHE_DIR/bin/all_reduce_perf
# $CACHE_DIR/lib/libcudart.so* copied from the nvcc toolchain used to build nccl-tests
set -e
@@ -30,7 +31,7 @@ CACHE_DIR="${DIST_DIR}/nccl-tests-${NCCL_TESTS_VERSION}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/nccl-tests-downloads"
if [ -f "${CACHE_DIR}/bin/all_reduce_perf" ]; then
if [ -f "${CACHE_DIR}/bin/all_reduce_perf" ] && [ "$(find "${CACHE_DIR}/lib" -maxdepth 1 -name 'libcudart.so*' 2>/dev/null | wc -l)" -gt 0 ]; then
echo "=== nccl-tests cached, skipping build ==="
echo "binary: ${CACHE_DIR}/bin/all_reduce_perf"
exit 0
@@ -52,6 +53,23 @@ echo "nvcc: $NVCC"
CUDA_HOME="$(dirname "$(dirname "$NVCC")")"
echo "CUDA_HOME: $CUDA_HOME"
find_cudart_dir() {
for dir in \
"${CUDA_HOME}/targets/x86_64-linux/lib" \
"${CUDA_HOME}/targets/x86_64-linux/lib/stubs" \
"${CUDA_HOME}/lib64" \
"${CUDA_HOME}/lib"; do
if [ -d "$dir" ] && find "$dir" -maxdepth 1 -name 'libcudart.so*' -type f | grep -q .; then
printf '%s\n' "$dir"
return 0
fi
done
return 1
}
CUDART_DIR="$(find_cudart_dir)" || { echo "ERROR: libcudart.so* not found under ${CUDA_HOME}"; exit 1; }
echo "cudart dir: $CUDART_DIR"
# Download libnccl-dev for nccl.h
REPO_BASE="https://developer.download.nvidia.com/compute/cuda/repos/debian${DEBIAN_VERSION}/x86_64"
DEV_PKG="libnccl-dev_${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}_amd64.deb"
@@ -136,6 +154,11 @@ mkdir -p "${CACHE_DIR}/bin"
cp "./build/all_reduce_perf" "${CACHE_DIR}/bin/all_reduce_perf"
chmod +x "${CACHE_DIR}/bin/all_reduce_perf"
mkdir -p "${CACHE_DIR}/lib"
find "${CUDART_DIR}" -maxdepth 1 -name 'libcudart.so*' -type f -exec cp -a {} "${CACHE_DIR}/lib/" \;
[ "$(find "${CACHE_DIR}/lib" -maxdepth 1 -name 'libcudart.so*' -type f | wc -l)" -gt 0 ] || { echo "ERROR: libcudart runtime copy failed"; exit 1; }
echo "=== nccl-tests build complete ==="
echo "binary: ${CACHE_DIR}/bin/all_reduce_perf"
ls -lh "${CACHE_DIR}/bin/all_reduce_perf"
ls -lh "${CACHE_DIR}/lib/"libcudart.so* 2>/dev/null || true

View File

@@ -10,7 +10,7 @@
# Output layout:
# $CACHE_DIR/modules/ — nvidia*.ko files
# $CACHE_DIR/bin/ — nvidia-smi, nvidia-debugdump
# $CACHE_DIR/lib/ — libnvidia-ml.so*, libcuda.so* (for nvidia-smi)
# $CACHE_DIR/lib/ — libnvidia-ml.so*, libcuda.so*, OpenCL-related libs
set -e
@@ -46,7 +46,10 @@ CACHE_DIR="${DIST_DIR}/nvidia-${NVIDIA_VERSION}-${KVER}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/nvidia-downloads"
EXTRACT_CACHE_DIR="${CACHE_ROOT}/nvidia-extract"
CACHE_LAYOUT_VERSION="2"
CACHE_LAYOUT_MARKER="${CACHE_DIR}/.cache-layout-v${CACHE_LAYOUT_VERSION}"
if [ -d "$CACHE_DIR/modules" ] && [ -f "$CACHE_DIR/bin/nvidia-smi" ] \
&& [ -f "$CACHE_LAYOUT_MARKER" ] \
&& [ "$(ls "$CACHE_DIR/lib/libnvidia-ptxjitcompiler.so."* 2>/dev/null | wc -l)" -gt 0 ]; then
echo "=== NVIDIA cached, skipping build ==="
echo "cache: $CACHE_DIR"
@@ -130,17 +133,30 @@ else
echo "WARNING: no firmware/ dir found in installer (may be needed for Hopper GPUs)"
fi
# Copy ALL userspace library files.
# libnvidia-ptxjitcompiler is required by libcuda for PTX JIT compilation
# (cuModuleLoadDataEx with PTX source) — without it CUDA_ERROR_JIT_COMPILER_NOT_FOUND.
for lib in libnvidia-ml libcuda libnvidia-ptxjitcompiler; do
count=0
for f in $(find "$EXTRACT_DIR" -maxdepth 1 -name "${lib}.so.*" 2>/dev/null); do
cp "$f" "$CACHE_DIR/lib/" && count=$((count+1))
done
if [ "$count" -eq 0 ]; then
echo "ERROR: ${lib}.so.* not found in $EXTRACT_DIR"
ls "$EXTRACT_DIR/"*.so* 2>/dev/null | head -20 || true
# Copy NVIDIA userspace libraries broadly instead of whitelisting a few names.
# Newer driver branches add extra runtime deps (for example OpenCL/compiler side
# libraries). If we only copy a narrow allowlist, clinfo/John can see nvidia.icd
# but still fail with "no OpenCL platforms" because one dependent .so is absent.
copied_libs=0
for f in $(find "$EXTRACT_DIR" -maxdepth 1 \( -name 'libnvidia*.so.*' -o -name 'libcuda.so.*' \) -type f 2>/dev/null | sort); do
cp "$f" "$CACHE_DIR/lib/"
copied_libs=$((copied_libs+1))
done
if [ "$copied_libs" -eq 0 ]; then
echo "ERROR: no NVIDIA userspace libraries found in $EXTRACT_DIR"
ls "$EXTRACT_DIR/"*.so* 2>/dev/null | head -40 || true
exit 1
fi
for lib in \
libnvidia-ml \
libcuda \
libnvidia-ptxjitcompiler \
libnvidia-opencl; do
if ! ls "$CACHE_DIR/lib/${lib}.so."* >/dev/null 2>&1; then
echo "ERROR: required ${lib}.so.* not found in extracted userspace libs"
ls "$CACHE_DIR/lib/" | sort >&2 || true
exit 1
fi
done
@@ -149,16 +165,17 @@ done
ko_count=$(ls "$CACHE_DIR/modules/"*.ko 2>/dev/null | wc -l)
[ "$ko_count" -gt 0 ] || { echo "ERROR: no .ko files built in $CACHE_DIR/modules/"; exit 1; }
# Create soname symlinks: use [0-9][0-9]* to avoid circular symlink (.so.1 has single digit)
for lib in libnvidia-ml libcuda libnvidia-ptxjitcompiler; do
versioned=$(ls "$CACHE_DIR/lib/${lib}.so."[0-9][0-9]* 2>/dev/null | head -1)
[ -n "$versioned" ] || continue
# Create soname symlinks for every copied versioned library.
for versioned in "$CACHE_DIR"/lib/*.so.*; do
[ -f "$versioned" ] || continue
base=$(basename "$versioned")
ln -sf "$base" "$CACHE_DIR/lib/${lib}.so.1"
ln -sf "${lib}.so.1" "$CACHE_DIR/lib/${lib}.so" 2>/dev/null || true
echo "${lib}: .so.1 -> $base"
stem=${base%%.so.*}
ln -sf "$base" "$CACHE_DIR/lib/${stem}.so.1"
ln -sf "${stem}.so.1" "$CACHE_DIR/lib/${stem}.so" 2>/dev/null || true
done
touch "$CACHE_LAYOUT_MARKER"
echo "=== NVIDIA build complete ==="
echo "cache: $CACHE_DIR"
echo "modules: $ko_count .ko files"

File diff suppressed because it is too large Load Diff

View File

@@ -10,12 +10,17 @@ echo " ╚══════╝╚═╝ ╚═╝╚══════╝
echo ""
menuentry "EASY-BEE" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (graphics/KMS)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (load to RAM)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ toram bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
linux @KERNEL_LIVE@ @APPEND_LIVE@ toram nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
@@ -24,6 +29,11 @@ menuentry "EASY-BEE (NVIDIA GSP=off)" {
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (graphics/KMS, GSP=off)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (fail-safe)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=gsp-off memtest noapic noapm nodma nomce nolapic nosmp vga=normal net.ifnames=0 biosdevname=0
initrd @INITRD_LIVE@

View File

@@ -5,6 +5,12 @@ label live-@FLAVOUR@-normal
initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=normal
label live-@FLAVOUR@-kms
menu label EASY-BEE (^graphics/KMS)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=normal
label live-@FLAVOUR@-toram
menu label EASY-BEE (^load to RAM)
linux @LINUX@
@@ -17,8 +23,18 @@ label live-@FLAVOUR@-gsp-off
initrd @INITRD@
append @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off
label live-@FLAVOUR@-kms-gsp-off
menu label EASY-BEE (g^raphics/KMS, GSP=off)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=gsp-off
label live-@FLAVOUR@-failsafe
menu label EASY-BEE (^fail-safe)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=gsp-off memtest noapic noapm nodma nomce nolapic nosmp vga=normal
label memtest
menu label ^Memory Test (memtest86+)
linux /boot/memtest86+x64.bin

View File

@@ -5,6 +5,9 @@ set -e
echo "=== bee chroot setup ==="
GPU_VENDOR=$(cat /etc/bee-gpu-vendor 2>/dev/null || echo nvidia)
echo "=== GPU vendor: ${GPU_VENDOR} ==="
ensure_bee_console_user() {
if id bee >/dev/null 2>&1; then
usermod -d /home/bee -s /bin/bash bee 2>/dev/null || true
@@ -21,10 +24,8 @@ ensure_bee_console_user() {
ensure_bee_console_user
# Enable bee services
systemctl enable nvidia-dcgm.service 2>/dev/null || true
# Enable common bee services
systemctl enable bee-network.service
systemctl enable bee-nvidia.service
systemctl enable bee-preflight.service
systemctl enable bee-audit.service
systemctl enable bee-web.service
@@ -36,25 +37,37 @@ systemctl enable serial-getty@ttyS0.service 2>/dev/null || true
systemctl enable serial-getty@ttyS1.service 2>/dev/null || true
systemctl enable bee-journal-mirror@ttyS1.service 2>/dev/null || true
# Enable GPU-vendor specific services
if [ "$GPU_VENDOR" = "nvidia" ]; then
systemctl enable nvidia-dcgm.service 2>/dev/null || true
systemctl enable bee-nvidia.service
elif [ "$GPU_VENDOR" = "amd" ]; then
# ROCm symlinks (packages install to /opt/rocm-*/bin/)
for tool in rocm-smi rocm-bandwidth-test rvs; do
if [ ! -e /usr/local/bin/${tool} ]; then
bin_path="$(find /opt -path "*/bin/${tool}" -type f 2>/dev/null | sort | tail -1)"
[ -n "${bin_path}" ] && ln -sf "${bin_path}" /usr/local/bin/${tool}
fi
done
fi
# nogpu: no GPU services needed
# Ensure scripts are executable
chmod +x /usr/local/bin/bee-network.sh 2>/dev/null || true
chmod +x /usr/local/bin/bee-nvidia-load 2>/dev/null || true
chmod +x /usr/local/bin/bee-sshsetup 2>/dev/null || true
chmod +x /usr/local/bin/bee-smoketest 2>/dev/null || true
chmod +x /usr/local/bin/bee 2>/dev/null || true
chmod +x /usr/local/bin/bee-log-run 2>/dev/null || true
if [ "$GPU_VENDOR" = "nvidia" ]; then
chmod +x /usr/local/bin/bee-nvidia-load 2>/dev/null || true
chmod +x /usr/local/bin/bee-gpu-burn 2>/dev/null || true
chmod +x /usr/local/bin/bee-john-gpu-stress 2>/dev/null || true
chmod +x /usr/local/bin/bee-nccl-gpu-stress 2>/dev/null || true
fi
# Reload udev rules
udevadm control --reload-rules 2>/dev/null || true
# rocm symlinks (packages install to /opt/rocm-*/bin/)
for tool in rocm-smi rocm-bandwidth-test rvs; do
if [ ! -e /usr/local/bin/${tool} ]; then
bin_path="$(find /opt -path "*/bin/${tool}" -type f 2>/dev/null | sort | tail -1)"
[ -n "${bin_path}" ] && ln -sf "${bin_path}" /usr/local/bin/${tool}
fi
done
# Create export directory
mkdir -p /appdata/bee/export
@@ -62,4 +75,4 @@ if [ -f /etc/sudoers.d/bee ]; then
chmod 0440 /etc/sudoers.d/bee
fi
echo "=== bee chroot setup complete ==="
echo "=== bee chroot setup complete (${GPU_VENDOR}) ==="

View File

@@ -1,13 +1,139 @@
#!/bin/sh
# Copy memtest86+ binaries from chroot /boot into the ISO boot directory
# so GRUB can chainload them directly (they must be on the ISO filesystem,
# not inside the squashfs).
# Ensure memtest is present in the final ISO even if live-build's built-in
# memtest stage does not copy the binaries or expose menu entries.
set -e
for f in memtest86+x64.bin memtest86+x64.efi memtest86+ia32.bin memtest86+ia32.efi; do
src="chroot/boot/${f}"
if [ -f "${src}" ]; then
cp "${src}" "binary/boot/${f}"
echo "memtest: copied ${f} to binary/boot/"
: "${BEE_REQUIRE_MEMTEST:=0}"
MEMTEST_FILES="memtest86+x64.bin memtest86+x64.efi"
BINARY_BOOT_DIR="binary/boot"
GRUB_CFG="binary/boot/grub/grub.cfg"
ISOLINUX_CFG="binary/isolinux/live.cfg"
log() {
echo "memtest hook: $*"
}
fail_or_warn() {
msg="$1"
if [ "${BEE_REQUIRE_MEMTEST}" = "1" ]; then
log "ERROR: ${msg}"
exit 1
fi
done
log "WARNING: ${msg}"
return 0
}
copy_memtest_file() {
src="$1"
base="$(basename "$src")"
dst="${BINARY_BOOT_DIR}/${base}"
[ -f "$src" ] || return 1
mkdir -p "${BINARY_BOOT_DIR}"
cp "$src" "$dst"
log "copied ${base} from ${src}"
}
extract_memtest_from_deb() {
deb="$1"
tmpdir="$(mktemp -d)"
log "extracting memtest payload from ${deb}"
dpkg-deb -x "$deb" "$tmpdir"
for f in ${MEMTEST_FILES}; do
if [ -f "${tmpdir}/boot/${f}" ]; then
copy_memtest_file "${tmpdir}/boot/${f}"
fi
done
rm -rf "$tmpdir"
}
ensure_memtest_binaries() {
missing=0
for f in ${MEMTEST_FILES}; do
[ -f "${BINARY_BOOT_DIR}/${f}" ] || missing=1
done
[ "$missing" -eq 1 ] || return 0
for root in chroot/boot /boot; do
for f in ${MEMTEST_FILES}; do
[ -f "${BINARY_BOOT_DIR}/${f}" ] || copy_memtest_file "${root}/${f}" || true
done
done
missing=0
for f in ${MEMTEST_FILES}; do
[ -f "${BINARY_BOOT_DIR}/${f}" ] || missing=1
done
[ "$missing" -eq 1 ] || return 0
for root in cache chroot/var/cache/apt/archives /var/cache/apt/archives; do
[ -d "$root" ] || continue
deb="$(find "$root" -type f \( -name 'memtest86+_*.deb' -o -name 'memtest86+*.deb' \) 2>/dev/null | head -1)"
[ -n "$deb" ] || continue
extract_memtest_from_deb "$deb"
break
done
missing=0
for f in ${MEMTEST_FILES}; do
if [ ! -f "${BINARY_BOOT_DIR}/${f}" ]; then
fail_or_warn "missing ${BINARY_BOOT_DIR}/${f}"
missing=1
fi
done
[ "$missing" -eq 0 ] || return 0
}
ensure_grub_entry() {
[ -f "$GRUB_CFG" ] || {
fail_or_warn "missing ${GRUB_CFG}"
return 0
}
grep -q '### BEE MEMTEST ###' "$GRUB_CFG" && return 0
cat >> "$GRUB_CFG" <<'EOF'
### BEE MEMTEST ###
if [ "${grub_platform}" = "efi" ]; then
menuentry "Memory Test (memtest86+)" {
chainloader /boot/memtest86+x64.efi
}
else
menuentry "Memory Test (memtest86+)" {
linux16 /boot/memtest86+x64.bin
}
fi
### /BEE MEMTEST ###
EOF
log "appended memtest entry to ${GRUB_CFG}"
}
ensure_isolinux_entry() {
[ -f "$ISOLINUX_CFG" ] || {
fail_or_warn "missing ${ISOLINUX_CFG}"
return 0
}
grep -q '### BEE MEMTEST ###' "$ISOLINUX_CFG" && return 0
cat >> "$ISOLINUX_CFG" <<'EOF'
# ### BEE MEMTEST ###
label memtest
menu label ^Memory Test (memtest86+)
linux /boot/memtest86+x64.bin
# ### /BEE MEMTEST ###
EOF
log "appended memtest entry to ${ISOLINUX_CFG}"
}
log "ensuring memtest binaries and menu entries in binary image"
ensure_memtest_binaries
ensure_grub_entry
ensure_isolinux_entry
log "memtest assets ready"

View File

@@ -0,0 +1,12 @@
# AMD GPU firmware
firmware-amd-graphics
# AMD ROCm — GPU monitoring, bandwidth test, and compute stress (RVS GST)
rocm-smi-lib=%%ROCM_SMI_VERSION%%
rocm-bandwidth-test=%%ROCM_BANDWIDTH_TEST_VERSION%%
rocm-validation-suite=%%ROCM_VALIDATION_SUITE_VERSION%%
rocblas=%%ROCBLAS_VERSION%%
rocrand=%%ROCRAND_VERSION%%
hip-runtime-amd=%%HIP_RUNTIME_AMD_VERSION%%
hipblaslt=%%HIPBLASLT_VERSION%%
comgr=%%COMGR_VERSION%%

View File

@@ -0,0 +1 @@
# No GPU variant — no NVIDIA, no AMD/ROCm packages

View File

@@ -0,0 +1,8 @@
# NVIDIA DCGM (Data Center GPU Manager) — dcgmi diag for acceptance testing.
# DCGM 4 is packaged per CUDA major. The image ships NVIDIA driver 590 with CUDA 13 userspace,
# so install the CUDA 13 build plus proprietary diagnostic components explicitly.
datacenter-gpu-manager-4-cuda13=1:%%DCGM_VERSION%%
datacenter-gpu-manager-4-proprietary=1:%%DCGM_VERSION%%
datacenter-gpu-manager-4-proprietary-cuda13=1:%%DCGM_VERSION%%
ocl-icd-libopencl1
clinfo

View File

@@ -21,8 +21,15 @@ openssh-server
# Disk installer
squashfs-tools
parted
# Keep GRUB install tools without selecting a single active platform package.
# grub-pc and grub-efi-amd64 conflict with each other, but grub2-common
# provides grub-install/update-grub and the *-bin packages provide BIOS/UEFI modules.
grub2-common
grub-pc-bin
grub-efi-amd64-bin
grub-efi-amd64-signed
shim-signed
efibootmgr
# Filesystem support for USB export targets
exfatprogs
@@ -39,11 +46,11 @@ vim-tiny
mc
htop
nvtop
btop
sudo
zstd
mstflint
memtester
memtest86+
stress-ng
stressapptest
@@ -64,26 +71,11 @@ lightdm
firmware-linux-free
firmware-linux-nonfree
firmware-misc-nonfree
firmware-amd-graphics
firmware-realtek
firmware-intel-sound
firmware-bnx2
firmware-bnx2x
firmware-cavium
firmware-qlogic
# NVIDIA DCGM (Data Center GPU Manager) — dcgmi diag for acceptance testing
datacenter-gpu-manager=1:%%DCGM_VERSION%%
# AMD ROCm — GPU monitoring, bandwidth test, and compute stress (RVS GST)
rocm-smi-lib=%%ROCM_SMI_VERSION%%
rocm-bandwidth-test=%%ROCM_BANDWIDTH_TEST_VERSION%%
rocm-validation-suite=%%ROCM_VALIDATION_SUITE_VERSION%%
rocblas=%%ROCBLAS_VERSION%%
rocrand=%%ROCRAND_VERSION%%
hip-runtime-amd=%%HIP_RUNTIME_AMD_VERSION%%
hipblaslt=%%HIPBLASLT_VERSION%%
comgr=%%COMGR_VERSION%%
# glibc compat helpers (for any external binaries that need it)
libc6

View File

@@ -39,7 +39,7 @@ info "nvidia boot mode: ${NVIDIA_BOOT_MODE}"
# --- PATH & binaries ---
echo "-- PATH & binaries --"
for tool in dmidecode smartctl nvme ipmitool lspci bee; do
if p=$(PATH="/usr/local/bin:$PATH" command -v "$tool" 2>/dev/null); then
if p=$(PATH="/usr/local/bin:/usr/sbin:/sbin:$PATH" command -v "$tool" 2>/dev/null); then
ok "$tool found: $p"
else
fail "$tool: NOT FOUND"
@@ -52,6 +52,14 @@ else
fail "nvidia-smi: NOT FOUND"
fi
for tool in bee-gpu-burn bee-john-gpu-stress bee-nccl-gpu-stress all_reduce_perf; do
if p=$(PATH="/usr/local/bin:$PATH" command -v "$tool" 2>/dev/null); then
ok "$tool found: $p"
else
fail "$tool: NOT FOUND"
fi
done
echo ""
echo "-- NVIDIA modules --"
KO_DIR="/usr/local/lib/nvidia"
@@ -109,6 +117,40 @@ else
fail "nvidia-smi: not found in PATH"
fi
echo ""
echo "-- OpenCL / John --"
if [ -f /etc/OpenCL/vendors/nvidia.icd ]; then
ok "OpenCL ICD present: /etc/OpenCL/vendors/nvidia.icd"
else
fail "OpenCL ICD missing: /etc/OpenCL/vendors/nvidia.icd"
fi
if ldconfig -p 2>/dev/null | grep -q "libnvidia-opencl.so.1"; then
ok "libnvidia-opencl.so.1 present in linker cache"
else
fail "libnvidia-opencl.so.1 missing from linker cache"
fi
if command -v clinfo >/dev/null 2>&1; then
if clinfo -l 2>/dev/null | grep -q "Platform"; then
ok "clinfo: OpenCL platform detected"
else
fail "clinfo: no OpenCL platform detected"
fi
else
fail "clinfo: not found in PATH"
fi
if command -v john >/dev/null 2>&1; then
if john --list=opencl-devices 2>/dev/null | grep -q "Device #"; then
ok "john: OpenCL devices detected"
else
fail "john: no OpenCL devices detected"
fi
else
fail "john: not found in PATH"
fi
echo ""
echo "-- lib symlinks --"
for lib in libnvidia-ml libcuda; do

View File

@@ -1,4 +1,4 @@
export PATH="$PATH:/usr/local/bin:/opt/rocm/bin:/opt/rocm/sbin"
export PATH="$PATH:/usr/local/bin:/usr/sbin:/sbin:/opt/rocm/bin:/opt/rocm/sbin"
# Print web UI URLs on the local console at login.
if [ -z "${SSH_CONNECTION:-}" ] \

View File

@@ -1,14 +1,14 @@
[Unit]
Description=Bee: run hardware audit
After=bee-network.service bee-nvidia.service bee-preflight.service
Description=Bee: hardware audit
After=bee-preflight.service bee-network.service bee-nvidia.service
Before=bee-web.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-audit.log /bin/sh -c '/usr/local/bin/bee audit --runtime livecd --output file:/appdata/bee/export/bee-audit.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-audit] WARN: audit exited with rc=$rc"; fi; exit 0'
RemainAfterExit=yes
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-audit.log /usr/local/bin/bee audit --runtime auto --output file:/appdata/bee/export/bee-audit.json
StandardOutput=journal
StandardError=journal
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target

View File

@@ -1,7 +1,6 @@
[Unit]
Description=Bee: hardware audit web viewer
After=bee-network.service
Wants=bee-audit.service
After=bee-audit.service
[Service]
Type=simple
@@ -11,6 +10,9 @@ RestartSec=2
StandardOutput=journal
StandardError=journal
LimitMEMLOCK=infinity
# Keep the web server responsive during GPU/CPU stress (children inherit nice+10
# via Setpriority in runCmdJob, but the bee-web parent stays at 0).
Nice=0
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,6 @@
[Unit]
Wants=bee-preflight.service
After=bee-preflight.service
[Service]
ExecStartPre=/usr/local/bin/bee-display-mode

View File

@@ -4,3 +4,6 @@
RestartSec=10
StartLimitIntervalSec=60
StartLimitBurst=3
# Raise scheduling priority of the X server so the graphical console (KVM/IPMI)
# stays responsive during GPU/CPU stress tests running at nice+10.
Nice=-5

View File

@@ -0,0 +1,54 @@
#!/bin/sh
# Select Xorg display mode based on kernel cmdline.
# Default is the current server-safe path: keep forced fbdev.
set -eu
cmdline_param() {
key="$1"
for token in $(cat /proc/cmdline 2>/dev/null); do
case "$token" in
"$key"=*)
echo "${token#*=}"
return 0
;;
esac
done
return 1
}
log() {
echo "bee-display-mode: $*"
}
mode="$(cmdline_param bee.display || true)"
if [ -z "$mode" ]; then
mode="safe"
fi
xorg_dir="/etc/X11/xorg.conf.d"
fbdev_conf="${xorg_dir}/10-fbdev.conf"
fbdev_park="${xorg_dir}/10-fbdev.conf.disabled"
mkdir -p "$xorg_dir"
case "$mode" in
kms|auto)
if [ -f "$fbdev_conf" ]; then
mv "$fbdev_conf" "$fbdev_park"
log "mode=${mode}; disabled forced fbdev config"
else
log "mode=${mode}; fbdev config already disabled"
fi
;;
safe|fbdev|"")
if [ -f "$fbdev_park" ] && [ ! -f "$fbdev_conf" ]; then
mv "$fbdev_park" "$fbdev_conf"
log "mode=${mode}; restored forced fbdev config"
else
log "mode=${mode}; keeping forced fbdev config"
fi
;;
*)
log "unknown bee.display=${mode}; keeping forced fbdev config"
;;
esac

View File

@@ -0,0 +1,102 @@
#!/bin/sh
set -eu
SECONDS=5
SIZE_MB=0
DEVICES=""
EXCLUDE=""
WORKER="/usr/local/lib/bee/bee-gpu-burn-worker"
usage() {
echo "usage: $0 [--seconds N] [--size-mb N] [--devices 0,1] [--exclude 2,3]" >&2
exit 2
}
normalize_list() {
echo "${1:-}" | tr ',' '\n' | sed 's/[[:space:]]//g' | awk 'NF' | sort -n | uniq | paste -sd, -
}
contains_csv() {
needle="$1"
haystack="${2:-}"
echo ",${haystack}," | grep -q ",${needle},"
}
while [ "$#" -gt 0 ]; do
case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;;
--size-mb|-m) [ "$#" -ge 2 ] || usage; SIZE_MB="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
*) usage ;;
esac
done
[ -x "${WORKER}" ] || { echo "bee-gpu-burn worker not found: ${WORKER}" >&2; exit 1; }
ALL_DEVICES=$(nvidia-smi --query-gpu=index --format=csv,noheader,nounits 2>/dev/null | sed 's/[[:space:]]//g' | awk 'NF' | paste -sd, -)
[ -n "${ALL_DEVICES}" ] || { echo "nvidia-smi found no NVIDIA GPUs" >&2; exit 1; }
DEVICES=$(normalize_list "${DEVICES}")
EXCLUDE=$(normalize_list "${EXCLUDE}")
SELECTED="${DEVICES}"
if [ -z "${SELECTED}" ]; then
SELECTED="${ALL_DEVICES}"
fi
FINAL=""
for id in $(echo "${SELECTED}" | tr ',' ' '); do
[ -n "${id}" ] || continue
if contains_csv "${id}" "${EXCLUDE}"; then
continue
fi
if [ -z "${FINAL}" ]; then
FINAL="${id}"
else
FINAL="${FINAL},${id}"
fi
done
[ -n "${FINAL}" ] || { echo "no NVIDIA GPUs selected after filters" >&2; exit 1; }
echo "loader=bee-gpu-burn"
echo "selected_gpus=${FINAL}"
TMP_DIR=$(mktemp -d)
trap 'rm -rf "${TMP_DIR}"' EXIT INT TERM
WORKERS=""
for id in $(echo "${FINAL}" | tr ',' ' '); do
log="${TMP_DIR}/gpu-${id}.log"
gpu_size_mb="${SIZE_MB}"
if [ "${gpu_size_mb}" -le 0 ] 2>/dev/null; then
total_mb=$(nvidia-smi --id="${id}" --query-gpu=memory.total --format=csv,noheader,nounits 2>/dev/null | tr -d '[:space:]')
if [ -n "${total_mb}" ] && [ "${total_mb}" -gt 0 ] 2>/dev/null; then
gpu_size_mb=$(( total_mb * 95 / 100 ))
else
gpu_size_mb=512
fi
fi
echo "starting gpu ${id} size=${gpu_size_mb}MB"
"${WORKER}" --device "${id}" --seconds "${SECONDS}" --size-mb "${gpu_size_mb}" >"${log}" 2>&1 &
pid=$!
WORKERS="${WORKERS} ${pid}:${id}:${log}"
done
status=0
for spec in ${WORKERS}; do
pid=${spec%%:*}
rest=${spec#*:}
id=${rest%%:*}
log=${rest#*:}
if wait "${pid}"; then
echo "gpu ${id} finished: OK"
else
rc=$?
echo "gpu ${id} finished: FAILED rc=${rc}"
status=1
fi
sed "s/^/[gpu ${id}] /" "${log}" || true
done
exit "${status}"

View File

@@ -12,17 +12,55 @@
set -euo pipefail
usage() {
cat >&2 <<'EOF'
Usage: bee-install <device> [logfile]
Installs the live system to a local disk (WIPES the target).
device Target block device, e.g. /dev/sda or /dev/nvme0n1
Must be a hard disk or NVMe — NOT a CD-ROM (/dev/sr*)
logfile Optional path for progress log (default: /tmp/bee-install.log)
Examples:
bee-install /dev/sda
bee-install /dev/nvme0n1
bee-install /dev/sdb /tmp/my-install.log
WARNING: ALL DATA ON <device> WILL BE ERASED.
Layout (UEFI): GPT — partition 1: EFI 512MB vfat, partition 2: root ext4
Layout (BIOS): MBR — partition 1: root ext4
EOF
exit 1
}
DEVICE="${1:-}"
LOGFILE="${2:-/tmp/bee-install.log}"
if [ -z "$DEVICE" ]; then
echo "Usage: bee-install <device> [logfile]" >&2
exit 1
if [ -z "$DEVICE" ] || [ "$DEVICE" = "--help" ] || [ "$DEVICE" = "-h" ]; then
usage
fi
if [ ! -b "$DEVICE" ]; then
echo "ERROR: $DEVICE is not a block device" >&2
echo "Run 'lsblk' to list available disks." >&2
exit 1
fi
# Block CD-ROM devices
case "$DEVICE" in
/dev/sr*|/dev/scd*)
echo "ERROR: $DEVICE is a CD-ROM/optical device — cannot install to it." >&2
echo "Run 'lsblk' to find the target disk (e.g. /dev/sda, /dev/nvme0n1)." >&2
exit 1
;;
esac
# Check required tools
for tool in parted mkfs.vfat mkfs.ext4 unsquashfs grub-install update-grub; do
if ! command -v "$tool" >/dev/null 2>&1; then
echo "ERROR: required tool not found: $tool" >&2
exit 1
fi
done
SQUASHFS="/run/live/medium/live/filesystem.squashfs"
if [ ! -f "$SQUASHFS" ]; then
@@ -158,20 +196,56 @@ mount --bind /sys "${MOUNT_ROOT}/sys"
# ------------------------------------------------------------------
log "--- Step 7/7: Installing GRUB bootloader ---"
# Helper: run a chroot command, log all output, return its exit code.
# Needed because "cmd | while" pipelines hide the exit code of cmd.
chroot_log() {
local rc=0
local out
out=$(chroot "$MOUNT_ROOT" "$@" 2>&1) || rc=$?
echo "$out" | while IFS= read -r line; do log " $line"; done
return $rc
}
if [ "$UEFI" = "1" ]; then
chroot "$MOUNT_ROOT" grub-install \
--target=x86_64-efi \
--efi-directory=/boot/efi \
--bootloader-id=bee \
--recheck 2>&1 | while read -r line; do log " $line"; done || true
# Primary attempt: write EFI NVRAM entry (requires writable efivars)
if ! chroot_log grub-install \
--target=x86_64-efi \
--efi-directory=/boot/efi \
--bootloader-id=bee \
--recheck; then
log " WARNING: grub-install (with NVRAM) failed — retrying with --no-nvram"
# --no-nvram: write grubx64.efi but skip EFI variable update.
# Needed on headless servers where efivars is read-only or unavailable.
chroot_log grub-install \
--target=x86_64-efi \
--efi-directory=/boot/efi \
--bootloader-id=bee \
--no-nvram \
--recheck || log " WARNING: grub-install --no-nvram also failed — check logs"
fi
# Always install the UEFI fallback path EFI/BOOT/BOOTX64.EFI.
# Many UEFI implementations (especially server BMCs and some firmware)
# ignore the NVRAM boot entry and only look for this path.
GRUB_EFI="${MOUNT_ROOT}/boot/efi/EFI/bee/grubx64.efi"
FALLBACK_DIR="${MOUNT_ROOT}/boot/efi/EFI/BOOT"
if [ -f "$GRUB_EFI" ]; then
mkdir -p "$FALLBACK_DIR"
cp "$GRUB_EFI" "${FALLBACK_DIR}/BOOTX64.EFI"
log " Fallback EFI binary installed: EFI/BOOT/BOOTX64.EFI"
else
log " WARNING: grubx64.efi not found at $GRUB_EFI — UEFI fallback path not set"
fi
else
chroot "$MOUNT_ROOT" grub-install \
chroot_log grub-install \
--target=i386-pc \
--recheck \
"$DEVICE" 2>&1 | while read -r line; do log " $line"; done || true
"$DEVICE" || log " WARNING: grub-install (BIOS) failed — check logs"
fi
chroot "$MOUNT_ROOT" update-grub 2>&1 | while read -r line; do log " $line"; done || true
log " GRUB installed."
chroot_log update-grub || log " WARNING: update-grub failed — check logs"
log " GRUB step complete."
# ------------------------------------------------------------------
# Cleanup

View File

@@ -0,0 +1,205 @@
#!/bin/sh
set -eu
SECONDS=300
DEVICES=""
EXCLUDE=""
FORMAT=""
JOHN_DIR="/usr/local/lib/bee/john/run"
JOHN_BIN="${JOHN_DIR}/john"
export OCL_ICD_VENDORS="/etc/OpenCL/vendors"
export LD_LIBRARY_PATH="/usr/lib:/usr/local/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
usage() {
echo "usage: $0 [--seconds N] [--devices 0,1] [--exclude 2,3] [--format name]" >&2
exit 2
}
normalize_list() {
echo "${1:-}" | tr ',' '\n' | sed 's/[[:space:]]//g' | awk 'NF' | sort -n | uniq | paste -sd, -
}
contains_csv() {
needle="$1"
haystack="${2:-}"
echo ",${haystack}," | grep -q ",${needle},"
}
show_opencl_diagnostics() {
echo "-- OpenCL ICD vendors --" >&2
if [ -d /etc/OpenCL/vendors ]; then
ls -l /etc/OpenCL/vendors >&2 || true
for icd in /etc/OpenCL/vendors/*.icd; do
[ -f "${icd}" ] || continue
echo " file: ${icd}" >&2
sed 's/^/ /' "${icd}" >&2 || true
done
else
echo " /etc/OpenCL/vendors is missing" >&2
fi
echo "-- NVIDIA device nodes --" >&2
ls -l /dev/nvidia* >&2 || true
echo "-- ldconfig OpenCL/NVIDIA --" >&2
ldconfig -p 2>/dev/null | grep 'libOpenCL\|libcuda\|libnvidia-opencl' >&2 || true
if command -v clinfo >/dev/null 2>&1; then
echo "-- clinfo -l --" >&2
clinfo -l >&2 || true
fi
echo "-- john --list=opencl-devices --" >&2
./john --list=opencl-devices >&2 || true
}
refresh_nvidia_runtime() {
if [ "$(id -u)" != "0" ]; then
return 1
fi
if command -v bee-nvidia-load >/dev/null 2>&1; then
bee-nvidia-load >/dev/null 2>&1 || true
fi
ldconfig >/dev/null 2>&1 || true
return 0
}
ensure_nvidia_uvm() {
if lsmod 2>/dev/null | grep -q '^nvidia_uvm '; then
return 0
fi
if [ "$(id -u)" != "0" ]; then
return 1
fi
ko="/usr/local/lib/nvidia/nvidia-uvm.ko"
[ -f "${ko}" ] || return 1
if ! insmod "${ko}" >/dev/null 2>&1; then
return 1
fi
uvm_major=$(grep -m1 ' nvidia-uvm$' /proc/devices | awk '{print $1}')
if [ -n "${uvm_major}" ]; then
mknod -m 666 /dev/nvidia-uvm c "${uvm_major}" 0 2>/dev/null || true
mknod -m 666 /dev/nvidia-uvm-tools c "${uvm_major}" 1 2>/dev/null || true
fi
return 0
}
ensure_opencl_ready() {
out=$(./john --list=opencl-devices 2>&1 || true)
if echo "${out}" | grep -q "Device #"; then
return 0
fi
if refresh_nvidia_runtime; then
out=$(./john --list=opencl-devices 2>&1 || true)
if echo "${out}" | grep -q "Device #"; then
return 0
fi
fi
if ensure_nvidia_uvm; then
out=$(./john --list=opencl-devices 2>&1 || true)
if echo "${out}" | grep -q "Device #"; then
return 0
fi
fi
echo "OpenCL devices are not available for John." >&2
if ! lsmod 2>/dev/null | grep -q '^nvidia_uvm '; then
echo "nvidia_uvm is not loaded." >&2
fi
if [ ! -e /dev/nvidia-uvm ]; then
echo "/dev/nvidia-uvm is missing." >&2
fi
show_opencl_diagnostics
return 1
}
while [ "$#" -gt 0 ]; do
case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
--format) [ "$#" -ge 2 ] || usage; FORMAT="$2"; shift 2 ;;
*) usage ;;
esac
done
[ -x "${JOHN_BIN}" ] || { echo "john binary not found: ${JOHN_BIN}" >&2; exit 1; }
ALL_DEVICES=$(nvidia-smi --query-gpu=index --format=csv,noheader,nounits 2>/dev/null | sed 's/[[:space:]]//g' | awk 'NF' | paste -sd, -)
[ -n "${ALL_DEVICES}" ] || { echo "nvidia-smi found no NVIDIA GPUs" >&2; exit 1; }
DEVICES=$(normalize_list "${DEVICES}")
EXCLUDE=$(normalize_list "${EXCLUDE}")
SELECTED="${DEVICES}"
if [ -z "${SELECTED}" ]; then
SELECTED="${ALL_DEVICES}"
fi
FINAL=""
for id in $(echo "${SELECTED}" | tr ',' ' '); do
[ -n "${id}" ] || continue
if contains_csv "${id}" "${EXCLUDE}"; then
continue
fi
if [ -z "${FINAL}" ]; then
FINAL="${id}"
else
FINAL="${FINAL},${id}"
fi
done
[ -n "${FINAL}" ] || { echo "no NVIDIA GPUs selected after filters" >&2; exit 1; }
JOHN_DEVICES=""
for id in $(echo "${FINAL}" | tr ',' ' '); do
opencl_id=$((id + 1))
if [ -z "${JOHN_DEVICES}" ]; then
JOHN_DEVICES="${opencl_id}"
else
JOHN_DEVICES="${JOHN_DEVICES},${opencl_id}"
fi
done
echo "loader=john"
echo "selected_gpus=${FINAL}"
echo "john_devices=${JOHN_DEVICES}"
cd "${JOHN_DIR}"
ensure_opencl_ready || exit 1
choose_format() {
if [ -n "${FORMAT}" ]; then
echo "${FORMAT}"
return 0
fi
for candidate in sha512crypt-opencl pbkdf2-hmac-sha512-opencl 7z-opencl sha256crypt-opencl md5crypt-opencl; do
if ./john --test=1 --format="${candidate}" --devices="${JOHN_DEVICES}" >/dev/null 2>&1; then
echo "${candidate}"
return 0
fi
done
return 1
}
CHOSEN_FORMAT=$(choose_format) || {
echo "no suitable john OpenCL format found" >&2
./john --list=opencl-devices >&2 || true
exit 1
}
echo "format=${CHOSEN_FORMAT}"
PIDS=""
_first=1
for opencl_id in $(echo "${JOHN_DEVICES}" | tr ',' ' '); do
[ "${_first}" = "1" ] || sleep 3
_first=0
./john --test="${SECONDS}" --format="${CHOSEN_FORMAT}" --devices="${opencl_id}" &
PIDS="${PIDS} $!"
done
FAIL=0
for pid in ${PIDS}; do
wait "${pid}" || FAIL=$((FAIL+1))
done
[ "${FAIL}" -eq 0 ] || { echo "john: ${FAIL} device(s) failed" >&2; exit 1; }

View File

@@ -17,7 +17,7 @@ mkdir -p "$(dirname "$log_file")"
serial_sink() {
local tty="$1"
if [ -w "$tty" ]; then
cat > "$tty"
cat > "$tty" 2>/dev/null || true
else
cat > /dev/null
fi

View File

@@ -0,0 +1,91 @@
#!/bin/sh
set -eu
SECONDS=300
DEVICES=""
EXCLUDE=""
MIN_BYTES="512M"
MAX_BYTES="4G"
FACTOR="2"
ITERS="20"
ALL_REDUCE_BIN="/usr/local/bin/all_reduce_perf"
usage() {
echo "usage: $0 [--seconds N] [--devices 0,1] [--exclude 2,3]" >&2
exit 2
}
normalize_list() {
echo "${1:-}" | tr ',' '\n' | sed 's/[[:space:]]//g' | awk 'NF' | sort -n | uniq | paste -sd, -
}
contains_csv() {
needle="$1"
haystack="${2:-}"
echo ",${haystack}," | grep -q ",${needle},"
}
while [ "$#" -gt 0 ]; do
case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
*) usage ;;
esac
done
[ -x "${ALL_REDUCE_BIN}" ] || { echo "all_reduce_perf not found: ${ALL_REDUCE_BIN}" >&2; exit 1; }
ALL_DEVICES=$(nvidia-smi --query-gpu=index --format=csv,noheader,nounits 2>/dev/null | sed 's/[[:space:]]//g' | awk 'NF' | paste -sd, -)
[ -n "${ALL_DEVICES}" ] || { echo "nvidia-smi found no NVIDIA GPUs" >&2; exit 1; }
DEVICES=$(normalize_list "${DEVICES}")
EXCLUDE=$(normalize_list "${EXCLUDE}")
SELECTED="${DEVICES}"
if [ -z "${SELECTED}" ]; then
SELECTED="${ALL_DEVICES}"
fi
FINAL=""
for id in $(echo "${SELECTED}" | tr ',' ' '); do
[ -n "${id}" ] || continue
if contains_csv "${id}" "${EXCLUDE}"; then
continue
fi
if [ -z "${FINAL}" ]; then
FINAL="${id}"
else
FINAL="${FINAL},${id}"
fi
done
[ -n "${FINAL}" ] || { echo "no NVIDIA GPUs selected after filters" >&2; exit 1; }
GPU_COUNT=$(echo "${FINAL}" | tr ',' '\n' | awk 'NF' | wc -l | awk '{print $1}')
[ "${GPU_COUNT}" -gt 0 ] || { echo "selected GPU count is zero" >&2; exit 1; }
echo "loader=nccl"
echo "selected_gpus=${FINAL}"
echo "gpu_count=${GPU_COUNT}"
echo "range=${MIN_BYTES}..${MAX_BYTES}"
echo "iters=${ITERS}"
deadline=$(( $(date +%s) + SECONDS ))
round=0
while :; do
now=$(date +%s)
if [ "${now}" -ge "${deadline}" ]; then
break
fi
round=$((round + 1))
remaining=$((deadline - now))
echo "round=${round} remaining_sec=${remaining}"
CUDA_VISIBLE_DEVICES="${FINAL}" \
"${ALL_REDUCE_BIN}" \
-b "${MIN_BYTES}" \
-e "${MAX_BYTES}" \
-f "${FACTOR}" \
-g "${GPU_COUNT}" \
--iters "${ITERS}"
done

View File

@@ -6,25 +6,66 @@ LOG_PREFIX="bee-network"
log() { echo "[$LOG_PREFIX] $*"; }
# find physical interfaces: exclude lo and virtual (docker/virbr/veth/tun/tap)
interfaces=$(ip -o link show \
| awk -F': ' '{print $2}' \
| grep -v '^lo$' \
| grep -vE '^(docker|virbr|veth|tun|tap|br-|bond|dummy)' \
| sort)
list_interfaces() {
ip -o link show \
| awk -F': ' '{print $2}' \
| grep -v '^lo$' \
| grep -vE '^(docker|virbr|veth|tun|tap|br-|bond|dummy)' \
| sort
}
if [ -z "$interfaces" ]; then
# Give udev a short chance to expose late NICs before the first scan.
if command -v udevadm >/dev/null 2>&1; then
udevadm settle --timeout=5 >/dev/null 2>&1 || log "WARN: udevadm settle timed out"
fi
started_ifaces=""
started_count=0
scan_pass=1
# Some server NICs appear a bit later after module/firmware init. Do a small
# bounded rescan window without turning network bring-up into a boot blocker.
while [ "$scan_pass" -le 3 ]; do
interfaces=$(list_interfaces)
if [ -n "$interfaces" ]; then
for iface in $interfaces; do
case " $started_ifaces " in
*" $iface "*) continue ;;
esac
log "bringing up $iface"
if ! ip link set "$iface" up; then
log "WARN: could not bring up $iface"
continue
fi
carrier=$(cat "/sys/class/net/$iface/carrier" 2>/dev/null || true)
if [ "$carrier" = "1" ]; then
log "carrier detected on $iface"
else
log "carrier not detected yet on $iface"
fi
# DHCP in background — non-blocking, keep dhclient verbose output in the service log.
dhclient -4 -v -nw "$iface" &
log "DHCP started for $iface (pid $!)"
started_ifaces="$started_ifaces $iface"
started_count=$((started_count + 1))
done
fi
if [ "$scan_pass" -ge 3 ]; then
break
fi
scan_pass=$((scan_pass + 1))
sleep 2
done
if [ "$started_count" -eq 0 ]; then
log "no physical interfaces found"
exit 0
fi
for iface in $interfaces; do
log "bringing up $iface"
ip link set "$iface" up || { log "WARN: could not bring up $iface"; continue; }
# DHCP in background — non-blocking, keep dhclient verbose output in the service log.
dhclient -4 -v -nw "$iface" &
log "DHCP started for $iface (pid $!)"
done
log "done"
log "done (interfaces started: $started_count)"

View File

@@ -59,15 +59,28 @@ load_module() {
return 1
}
load_host_module() {
mod="$1"
if modprobe "$mod" >/dev/null 2>&1; then
log "host module loaded: $mod"
return 0
fi
return 1
}
case "$nvidia_mode" in
normal|full)
if ! load_module nvidia; then
exit 1
fi
# nvidia-modeset on some server kernels needs ACPI video helper symbols
# exported by the generic "video" module. Best-effort only; compute paths
# remain functional even if display-related modules stay absent.
load_host_module video || true
load_module nvidia-modeset || true
load_module nvidia-uvm || true
;;
gsp-off|safe|*)
gsp-off|safe)
# NVIDIA documents that GSP firmware is enabled by default on newer GPUs and can
# be disabled via NVreg_EnableGpuFirmware=0. Safe mode keeps the live ISO on the
# conservative path for platforms where full boot-time GSP init is unstable.
@@ -76,6 +89,15 @@ case "$nvidia_mode" in
fi
log "GSP-off mode: skipping nvidia-modeset and nvidia-uvm during boot"
;;
nomsi|*)
# nomsi: disable MSI-X/MSI interrupts — use when RmInitAdapter fails with
# "Failed to enable MSI-X" on one or more GPUs (IOMMU group interrupt limits).
# NVreg_EnableMSI=0 forces legacy INTx interrupts for all GPUs.
if ! load_module nvidia NVreg_EnableGpuFirmware=0 NVreg_EnableMSI=0; then
exit 1
fi
log "nomsi mode: MSI-X disabled (NVreg_EnableMSI=0), skipping nvidia-modeset and nvidia-uvm"
;;
esac
# Create /dev/nvidia* device nodes (udev rules absent since we use .run installer)
@@ -105,4 +127,19 @@ fi
ldconfig 2>/dev/null || true
log "ldconfig refreshed"
# Start DCGM host engine so dcgmi can discover GPUs.
# nv-hostengine must run before any dcgmi command — without it, dcgmi reports
# "group is empty" even when GPUs and modules are present.
# Skip if already running (e.g. started by a dcgm systemd service or prior boot).
if command -v nv-hostengine >/dev/null 2>&1; then
if pgrep -x nv-hostengine >/dev/null 2>&1; then
log "nv-hostengine already running — skipping"
else
nv-hostengine
log "nv-hostengine started"
fi
else
log "WARN: nv-hostengine not found — dcgmi diagnostics will not work"
fi
log "done"

View File

@@ -8,20 +8,23 @@ xset -dpms
xset s noblank
tint2 &
# Wait for bee-web to bind (Go starts fast, usually <2s)
# Wait up to 120s for bee-web to bind. The web server starts immediately now
# (audit is deferred), so this should succeed in a few seconds on most hardware.
i=0
while [ $i -lt 30 ]; do
while [ $i -lt 120 ]; do
if curl -sf http://localhost/healthz >/dev/null 2>&1; then break; fi
sleep 1
i=$((i+1))
done
chromium \
--disable-infobars \
--disable-translate \
--no-first-run \
--disable-session-crashed-bubble \
--disable-features=TranslateUI \
--start-fullscreen \
--start-maximized \
http://localhost/ &
exec openbox

View File

@@ -3,6 +3,11 @@
# Type 'a' at any prompt to abort, 'b' to go back.
set -e
# Requires root for ip/dhclient/resolv.conf — re-exec via sudo if needed.
if [ "$(id -u)" -ne 0 ]; then
exec sudo "$0" "$@"
fi
abort() { echo "Aborted."; exit 0; }
ask() {