USB Export Drive:
lsblk reports TRAN only for whole disks, not partitions (/dev/sdc1).
Strip trailing partition digits to get parent disk before transport check.
LiveCD in RAM:
When RunInstallToRAM copies squashfs to /dev/shm/bee-live/ but bind-mount
of /run/live/medium fails (CD-ROM boots), /run/live/medium still shows the
CD-ROM fstype. Add fallback: if /dev/shm/bee-live/*.squashfs exists, the
data is in RAM — report status OK.
Dashboard Hardware Summary:
Show server Manufacturer + ProductName as heading and S/N as subline above
the component table, sourced from hw.Board (dmidecode system-type data).
Validate:
Remove Cycles input — always run once. cycles=1 hardcoded in runAllSAT().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and
NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files
- Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally
by comparing ramp-up step results sharing the same ramp_run_id)
- Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time),
tasks.go handler, and benchmark.go result population
- Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75:
factor = ratio/0.75, applied per-GPU with a note explaining the reduction
- Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw)
- Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode;
shows scalability_score when non-zero
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When --query-gpu CSV fields fail (exit status 2 on some Blackwell +
driver combos), enrichGPUInfoWithMaxClocks now also parses from the
verbose nvidia-smi -q output already collected at benchmark start:
- Default Power Limit → DefaultPowerLimitW
- Current Power Limit → PowerLimitW (fallback)
- Multiprocessor Count → MultiprocessorCount
Fixes PowerSustainScore=0 on systems where all three CSV query
variants fail but nvidia-smi -q succeeds (confirmed on RTX PRO 6000
Blackwell + driver 590.48.01).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Before the per-GPU compute phases, run `dcgmi diag -r targeted_power`
for 45 s while collecting nvidia-smi power metrics in parallel.
The p95 power per GPU is stored as calibrated_peak_power_w and used
as the denominator for PowerSustainScore instead of the hardware default
limit, which bee-gpu-burn cannot reach because it is compute-only.
Fallback chain: calibrated peak → default limit → enforced limit.
If dcgmi is absent or the run fails, calibration is skipped silently.
Adjust composite score weights to match the new honest power reference:
base 0.35, thermal 0.25, stability 0.25, power 0.15, NCCL bonus 0.10.
Power weight reduced (0.20→0.15) because even with a calibrated reference
bee-gpu-burn reaches ~60-75% of TDP by design (no concurrent mem stress).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BenchmarkHostConfig captures CPU model, sockets, cores, threads, and
total RAM from /proc/cpuinfo and /proc/meminfo at benchmark start.
- BenchmarkCPULoad samples host CPU utilisation every 10 s throughout
the GPU steady-state phase (sequential and parallel paths).
- Summarises avg/max/p95 and classifies status as ok / high / unstable.
- Adds a finding when CPU load is elevated (avg >20% or max >40%) or
erratic (stddev >12%), with a plain-English description in the report.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
clocks.max.graphics / clocks.max.memory CSV fields return exit status 2 on
RTX PRO 6000 Blackwell (driver 98.x), causing the entire gpu inventory query
to fail and clock lock to be skipped → normalization: partial.
Fix:
- Add minimal fallback query (index,uuid,name,pci.bus_id,vbios_version,
power.limit) that succeeds even without clock fields
- Add enrichGPUInfoWithMaxClocks: parses "Max Clocks" section of
nvidia-smi -q verbose output to fill MaxGraphicsClockMHz /
MaxMemoryClockMHz when CSV fields fail
- Move nvidia-smi -q execution before queryBenchmarkGPUInfo so its output
is available for clock enrichment immediately after
- Tests: cover enrichment and skip-if-populated cases
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- collector/pcie: add applyPCIeLinkSpeedWarning that sets status=Warning
and ErrorDescription when current link speed is below maximum negotiated
speed (e.g. Gen1 running on a Gen5 slot)
- collector/pcie: add pcieLinkSpeedRank helper for Gen string comparison
- collector/pcie_filter_test: cover degraded and healthy link speed cases
- platform/techdump: collect lspci -vvv → lspci-vvv.txt for LnkCap/LnkSta
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- schema: add ToRAMStatus and USBExportPath fields to RuntimeHealth
- platform/runtime.go: collectToRAMHealth (ok/warning/failed based on
IsLiveMediaInRAM + toramActive) and collectUSBExportHealth (scans
/proc/mounts + lsblk for writable USB-backed filesystems)
- pages.go: add USB Export Drive and LiveCD in RAM rows to the health table
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nvidia.go: add Name field to nvidiaGPUInfo, include model name in
nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData
- pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x …
→ 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat
activating/deactivating/reloading service states as OK in Runtime Health
- support_bundle.go: use "150405" time format (no colons) for exFAT compat
- sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove
.tar.gz archive creation from export dirs — export packs everything itself
- charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf
- benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bootloader: GRUB fallback text colors → yellow/brown (amber tone)
- CLI charts: all GPU metric series use single amber color (xterm-256 #214)
- Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size
- Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz
using dmidecode for server model (spaces→underscores) and serial number
- Remove display resolution feature (UI card, API routes, handlers, tests)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HPL 2.3 from netlib compiled against OpenBLAS with a minimal
single-process MPI stub — no MPI package required in the ISO.
Matrix size is auto-sized to 80% of total RAM at runtime.
Build:
- VERSIONS: HPL_VERSION=2.3, HPL_SHA256=32c5c17d…
- build-hpl.sh: downloads HPL + OpenBLAS from Debian 12 repo,
compiles xhpl with a self-contained mpi_stub.c
- build.sh: step 80-hpl, injects xhpl + libopenblas into overlay
Runtime:
- bee-hpl: generates HPL.dat (N auto from /proc/meminfo, NB=256,
P=1 Q=1), runs xhpl, prints standard WR... Gflops output
- platform/hpl.go: RunHPL(), parses WR line → GFlops + PASSED/FAILED
- tasks.go: target "hpl"
- pages.go: LINPACK (HPL) card in validate/stress grid (stress-only)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace diag level 1-4 dropdown with Validate/Stress radio buttons
- Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short
- Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended
- Parallel GPU mode: spawn single task for all GPUs instead of splitting per model
- Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel
- Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts
- Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading')
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-gpu-stress.c: remove per-wave cuCtxSynchronize barrier in both
cuBLASLt and PTX hot loops; sync at most once/sec so the GPU queue
stays continuously full — eliminates the CPU↔GPU ping-pong that
prevented reaching full TDP
- sat_fan_stress.go: default SizeMB 0 (auto = 95% VRAM) instead of
hardcoded 64 MB; tiny matrices caused <0.1 ms kernels where CPU
re-queue overhead dominated
- pages.go: move nvidia-targeted-power and nvidia-pulse from Burn →
Validate stress section alongside nvidia-targeted-stress; these are
DCGM pass/fail diagnostics, not sustained burn loads; remove the
Power Delivery / Power Budget card from Burn entirely
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add parallel GPU mode (checkbox, off by default): runs all selected GPUs
simultaneously via a single bee-gpu-burn invocation instead of sequentially;
per-GPU telemetry, throttle counters, TOPS, and scoring are preserved
- Make queryBenchmarkGPUInfo resilient: falls back to a base field set when
extended fields (attribute.multiprocessor_count, power.default_limit) cause
exit status 2, preventing lgc normalization from being silently skipped
- Log explicit "graphics clock lock skipped" note when inventory is unavailable
- Collect server model from DMI (/sys/class/dmi/id/product_name) and store in
result JSON; benchmark history columns now show "Server Model (N× GPU Model)"
grouped by server+GPU type rather than individual GPU index
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PowerSustainScore now uses DefaultPowerLimitW as reference so a
manually reduced power limit does not inflate the score. Falls back
to enforced limit if default is unavailable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Sample server power (IPMI dcmi) during baseline+steady phases in parallel;
compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting
- Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w
from nvidia-smi alongside existing GPU info
- Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal)
- Flag when enforced power limit is below default TDP by >5%
- Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0)
- Improve Executive Summary: overall pass count, FAILED GPU finding
- Throttle counters now shown as % of steady window instead of raw microseconds
- bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-nvidia-load: run insmod in background, poll /proc/devices for
nvidiactl; if GSP init doesn't complete in 90s, kill insmod and retry
with NVreg_EnableGpuFirmware=0. Handles EBUSY case with clear error.
- Write /run/bee-nvidia-mode (gsp-on/gsp-off/gsp-stuck) for audit layer
- Show GSP mode badge in sidebar: yellow for gsp-off, red for gsp-stuck
- Report NvidiaGSPMode in RuntimeHealth with issue entries
- Simplify GRUB menu: default (KMS+GSP), advanced submenu (GSP=off,
nomodeset, fail-safe), remove load-to-RAM entry
- Add pcmanfm, ristretto, mupdf, mousepad to desktop packages
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- services.go: use sudo systemctl so bee user can control system services
- api.go: always return 200 with output field even on error, so the
frontend shows the actual systemctl message instead of "exit status 1"
- pages.go: button shows "..." while pending then restores label;
output panel is full-width under the table with ✓/✗ status indicator;
output auto-scrolls to bottom
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bee-gpu-burn now receives --seconds <LoadSec> so it exits naturally
when the cycle ends, rather than relying solely on context cancellation
to kill it. Process group kill (Setpgid+Cancel) is kept as a safety net
for early cancellation (user stop, context timeout). Same fix for AMD
RVS which now gets duration_ms = LoadSec * 1000.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvvs (DCGM validation suite) survives when dcgmi is killed mid-run,
leaving the GPU occupied. The next dcgmi diag invocation then fails
with "affected resource is in use".
Two-part fix:
- Add nvvs and dcgmi to KillTestWorkers patterns so they are cleaned
up by the global cancel handler
- Call KillTestWorkers at the start of RunNvidiaTargetedStressValidatePack
to clear any stale processes before dcgmi diag runs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bee-gpu-burn is a shell script that spawns bee-gpu-burn-worker children.
exec.CommandContext default cancel only kills the shell parent; the worker
processes survive and keep loading the GPU indefinitely.
Fix: set Setpgid=true and a custom Cancel that sends SIGKILL to the
entire process group (-pid), same pattern already used in runSATCommandCtx.
Applied to Nvidia, AMD, and CPU stress commands for consistency.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add platform/error_patterns.go: pluggable table of kernel log patterns
(NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct
- Add app/component_status_db.go: persistent JSON store (component-status.json)
keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never
downgrades Warning or Critical
- Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks,
writes Warning to DB for matched hardware errors
- Fix task status: overall_status=FAILED in summary.txt now marks task failed
- Audit routine overlays component DB statuses into bee-audit.json on every read
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of
re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web
crashes mid-test (each restart was launching a new set of 8 workers on top
of still-alive orphans from the previous crash)
- server: reduce metrics collector interval 1s→5s, grow ring buffer to 360
samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5×
- platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn,
stress-ng, stressapptest, memtester without relying on pkill/killall
- webui: add "Kill Workers" button next to Cancel All; calls
POST /api/tasks/kill-workers which cancels the task queue then kills
orphaned OS-level processes; shows toast with killed count
- metricsdb: sort GPU indices and fan/temp names after map iteration to fix
non-deterministic sample reconstruction order (flaky test)
- server: fix chartYAxisNumber to use one decimal place for 1000–9999
(e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
exec.CommandContext only kills the direct child (the shell script), leaving
grandchildren (john, gpu-burn, etc.) as orphans. Set Setpgid so each SAT
job runs in its own process group, then send SIGKILL to the whole group
(-pgid) in the Cancel hook.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
field to selectively run CPU and/or GPU load goroutines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>