- Replace diag level 1-4 dropdown with Validate/Stress radio buttons
- Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short
- Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended
- Parallel GPU mode: spawn single task for all GPUs instead of splitting per model
- Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel
- Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts
- Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading')
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a "Multi-GPU tests — use all GPUs" checkbox to the NVIDIA GPU
selector (checked by default). When enabled, PSU Pulse, NCCL, and
NVBandwidth tests run on every GPU in the system regardless of the
per-GPU selection above — which is required for correct PSU stress
testing (synchronous pulses across all GPUs create worst-case
transients). When unchecked, only the manually selected GPUs are used.
The same logic applies both to Run All (expandSATTarget) and to the
individual Run button on each multi-GPU test card.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pulse_test is a PSU/power-delivery test, not a per-GPU compute test.
Its purpose is to synchronously pulse all GPUs between idle and full
load to create worst-case transient spikes on the power supply.
Running it one GPU at a time would produce a fraction of the PSU load
and miss any PSU-level failures.
- Move nvidia-pulse from nvidiaPerGPUTargets to nvidiaAllGPUTargets
(same dispatch path as NCCL and NVBandwidth)
- Change card onclick to runNvidiaFabricValidate (all selected GPUs at once)
- Update card title to "NVIDIA PSU Pulse Test" and description to
explain why synchronous multi-GPU execution is required
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvidia-interconnect (NCCL all_reduce_perf) and nvidia-bandwidth
(NVBandwidth) verify fabric connectivity and bandwidth — they are
not sustained burn loads. Move both from the Burn section to the
Validate section under the stress-mode toggle, alongside the other
DCGM diagnostic tests moved in the previous commit.
- Add sat-card-nvidia-interconnect and sat-card-nvidia-bandwidth
validate cards (stress-only, all selected GPUs at once)
- Add runNvidiaFabricValidate() for all-GPU-at-once dispatch
- Add nvidiaAllGPUTargets handling in expandSATTarget/runAllSAT
- Remove Interconnect / Bandwidth card from Burn section
- Remove nvidia-interconnect and nvidia-bandwidth from runAllBurnTasks
and the gpu/tools availability map
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-gpu-stress.c: remove per-wave cuCtxSynchronize barrier in both
cuBLASLt and PTX hot loops; sync at most once/sec so the GPU queue
stays continuously full — eliminates the CPU↔GPU ping-pong that
prevented reaching full TDP
- sat_fan_stress.go: default SizeMB 0 (auto = 95% VRAM) instead of
hardcoded 64 MB; tiny matrices caused <0.1 ms kernels where CPU
re-queue overhead dominated
- pages.go: move nvidia-targeted-power and nvidia-pulse from Burn →
Validate stress section alongside nvidia-targeted-stress; these are
DCGM pass/fail diagnostics, not sustained burn loads; remove the
Power Delivery / Power Budget card from Burn entirely
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add parallel GPU mode (checkbox, off by default): runs all selected GPUs
simultaneously via a single bee-gpu-burn invocation instead of sequentially;
per-GPU telemetry, throttle counters, TOPS, and scoring are preserved
- Make queryBenchmarkGPUInfo resilient: falls back to a base field set when
extended fields (attribute.multiprocessor_count, power.default_limit) cause
exit status 2, preventing lgc normalization from being silently skipped
- Log explicit "graphics clock lock skipped" note when inventory is unavailable
- Collect server model from DMI (/sys/class/dmi/id/product_name) and store in
result JSON; benchmark history columns now show "Server Model (N× GPU Model)"
grouped by server+GPU type rather than individual GPU index
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-nvidia-load: run insmod in background, poll /proc/devices for
nvidiactl; if GSP init doesn't complete in 90s, kill insmod and retry
with NVreg_EnableGpuFirmware=0. Handles EBUSY case with clear error.
- Write /run/bee-nvidia-mode (gsp-on/gsp-off/gsp-stuck) for audit layer
- Show GSP mode badge in sidebar: yellow for gsp-off, red for gsp-stuck
- Report NvidiaGSPMode in RuntimeHealth with issue entries
- Simplify GRUB menu: default (KMS+GSP), advanced submenu (GSP=off,
nomodeset, fail-safe), remove load-to-RAM entry
- Add pcmanfm, ristretto, mupdf, mousepad to desktop packages
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- services.go: use sudo systemctl so bee user can control system services
- api.go: always return 200 with output field even on error, so the
frontend shows the actual systemctl message instead of "exit status 1"
- pages.go: button shows "..." while pending then restores label;
output panel is full-width under the table with ✓/✗ status indicator;
output auto-scrolls to bottom
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add bee-boot-status service: shows live service status on tty1 with
ASCII logo before getty, exits when all bee services settle
- Remove lightdm dependency on bee-preflight so GUI starts immediately
without waiting for NVIDIA driver load
- Replace Chromium blank-page problem with /loading spinner page that
polls /api/services and auto-redirects when services are ready; add
"Open app now" override button; use fresh --user-data-dir=/tmp/bee-chrome
- Unify branding: add "Hardware Audit LiveCD" subtitle to GRUB menu,
bee-boot-status (with yellow ASCII logo), and web spinner
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tasks are now started simultaneously when multiple are enqueued (e.g.
Run All). The worker drains all pending tasks at once and launches each
in its own goroutine, waiting via WaitGroup. kmsg watcher updated to
use a shared event window with a reference counter across concurrent tasks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add platform/error_patterns.go: pluggable table of kernel log patterns
(NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct
- Add app/component_status_db.go: persistent JSON store (component-status.json)
keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never
downgrades Warning or Critical
- Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks,
writes Warning to DB for matched hardware errors
- Fix task status: overall_status=FAILED in summary.txt now marks task failed
- Audit routine overlays component DB statuses into bee-audit.json on every read
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Support bundle is now built on-the-fly when the user clicks the button,
regardless of whether other tasks are running:
- GET /export/support.tar.gz builds the bundle synchronously and streams it
directly to the client; the temp archive is removed after serving
- Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue
approach meant the bundle could only be downloaded after navigating away
and back, and was blocked entirely while a long SAT test was running
- UI: single "Download Support Bundle" button; fetch+blob gives a loading
state ("Building...") while the server collects logs, then triggers the
browser download with the correct filename from Content-Disposition
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of
re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web
crashes mid-test (each restart was launching a new set of 8 workers on top
of still-alive orphans from the previous crash)
- server: reduce metrics collector interval 1s→5s, grow ring buffer to 360
samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5×
- platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn,
stress-ng, stressapptest, memtester without relying on pkill/killall
- webui: add "Kill Workers" button next to Cancel All; calls
POST /api/tasks/kill-workers which cancels the task queue then kills
orphaned OS-level processes; shows toast with killed count
- metricsdb: sort GPU indices and fan/temp names after map iteration to fix
non-deterministic sample reconstruction order (flaky test)
- server: fix chartYAxisNumber to use one decimal place for 1000–9999
(e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
field to selectively run CPU and/or GPU load goroutines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>