HPL 2.3 from netlib compiled against OpenBLAS with a minimal
single-process MPI stub — no MPI package required in the ISO.
Matrix size is auto-sized to 80% of total RAM at runtime.
Build:
- VERSIONS: HPL_VERSION=2.3, HPL_SHA256=32c5c17d…
- build-hpl.sh: downloads HPL + OpenBLAS from Debian 12 repo,
compiles xhpl with a self-contained mpi_stub.c
- build.sh: step 80-hpl, injects xhpl + libopenblas into overlay
Runtime:
- bee-hpl: generates HPL.dat (N auto from /proc/meminfo, NB=256,
P=1 Q=1), runs xhpl, prints standard WR... Gflops output
- platform/hpl.go: RunHPL(), parses WR line → GFlops + PASSED/FAILED
- tasks.go: target "hpl"
- pages.go: LINPACK (HPL) card in validate/stress grid (stress-only)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace diag level 1-4 dropdown with Validate/Stress radio buttons
- Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short
- Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended
- Parallel GPU mode: spawn single task for all GPUs instead of splitting per model
- Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel
- Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts
- Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading')
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-gpu-stress.c: remove per-wave cuCtxSynchronize barrier in both
cuBLASLt and PTX hot loops; sync at most once/sec so the GPU queue
stays continuously full — eliminates the CPU↔GPU ping-pong that
prevented reaching full TDP
- sat_fan_stress.go: default SizeMB 0 (auto = 95% VRAM) instead of
hardcoded 64 MB; tiny matrices caused <0.1 ms kernels where CPU
re-queue overhead dominated
- pages.go: move nvidia-targeted-power and nvidia-pulse from Burn →
Validate stress section alongside nvidia-targeted-stress; these are
DCGM pass/fail diagnostics, not sustained burn loads; remove the
Power Delivery / Power Budget card from Burn entirely
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add parallel GPU mode (checkbox, off by default): runs all selected GPUs
simultaneously via a single bee-gpu-burn invocation instead of sequentially;
per-GPU telemetry, throttle counters, TOPS, and scoring are preserved
- Make queryBenchmarkGPUInfo resilient: falls back to a base field set when
extended fields (attribute.multiprocessor_count, power.default_limit) cause
exit status 2, preventing lgc normalization from being silently skipped
- Log explicit "graphics clock lock skipped" note when inventory is unavailable
- Collect server model from DMI (/sys/class/dmi/id/product_name) and store in
result JSON; benchmark history columns now show "Server Model (N× GPU Model)"
grouped by server+GPU type rather than individual GPU index
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PowerSustainScore now uses DefaultPowerLimitW as reference so a
manually reduced power limit does not inflate the score. Falls back
to enforced limit if default is unavailable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Sample server power (IPMI dcmi) during baseline+steady phases in parallel;
compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting
- Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w
from nvidia-smi alongside existing GPU info
- Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal)
- Flag when enforced power limit is below default TDP by >5%
- Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0)
- Improve Executive Summary: overall pass count, FAILED GPU finding
- Throttle counters now shown as % of steady window instead of raw microseconds
- bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-nvidia-load: run insmod in background, poll /proc/devices for
nvidiactl; if GSP init doesn't complete in 90s, kill insmod and retry
with NVreg_EnableGpuFirmware=0. Handles EBUSY case with clear error.
- Write /run/bee-nvidia-mode (gsp-on/gsp-off/gsp-stuck) for audit layer
- Show GSP mode badge in sidebar: yellow for gsp-off, red for gsp-stuck
- Report NvidiaGSPMode in RuntimeHealth with issue entries
- Simplify GRUB menu: default (KMS+GSP), advanced submenu (GSP=off,
nomodeset, fail-safe), remove load-to-RAM entry
- Add pcmanfm, ristretto, mupdf, mousepad to desktop packages
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- services.go: use sudo systemctl so bee user can control system services
- api.go: always return 200 with output field even on error, so the
frontend shows the actual systemctl message instead of "exit status 1"
- pages.go: button shows "..." while pending then restores label;
output panel is full-width under the table with ✓/✗ status indicator;
output auto-scrolls to bottom
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bee-gpu-burn now receives --seconds <LoadSec> so it exits naturally
when the cycle ends, rather than relying solely on context cancellation
to kill it. Process group kill (Setpgid+Cancel) is kept as a safety net
for early cancellation (user stop, context timeout). Same fix for AMD
RVS which now gets duration_ms = LoadSec * 1000.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvvs (DCGM validation suite) survives when dcgmi is killed mid-run,
leaving the GPU occupied. The next dcgmi diag invocation then fails
with "affected resource is in use".
Two-part fix:
- Add nvvs and dcgmi to KillTestWorkers patterns so they are cleaned
up by the global cancel handler
- Call KillTestWorkers at the start of RunNvidiaTargetedStressValidatePack
to clear any stale processes before dcgmi diag runs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bee-gpu-burn is a shell script that spawns bee-gpu-burn-worker children.
exec.CommandContext default cancel only kills the shell parent; the worker
processes survive and keep loading the GPU indefinitely.
Fix: set Setpgid=true and a custom Cancel that sends SIGKILL to the
entire process group (-pid), same pattern already used in runSATCommandCtx.
Applied to Nvidia, AMD, and CPU stress commands for consistency.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add platform/error_patterns.go: pluggable table of kernel log patterns
(NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct
- Add app/component_status_db.go: persistent JSON store (component-status.json)
keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never
downgrades Warning or Critical
- Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks,
writes Warning to DB for matched hardware errors
- Fix task status: overall_status=FAILED in summary.txt now marks task failed
- Audit routine overlays component DB statuses into bee-audit.json on every read
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of
re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web
crashes mid-test (each restart was launching a new set of 8 workers on top
of still-alive orphans from the previous crash)
- server: reduce metrics collector interval 1s→5s, grow ring buffer to 360
samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5×
- platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn,
stress-ng, stressapptest, memtester without relying on pkill/killall
- webui: add "Kill Workers" button next to Cancel All; calls
POST /api/tasks/kill-workers which cancels the task queue then kills
orphaned OS-level processes; shows toast with killed count
- metricsdb: sort GPU indices and fan/temp names after map iteration to fix
non-deterministic sample reconstruction order (flaky test)
- server: fix chartYAxisNumber to use one decimal place for 1000–9999
(e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
exec.CommandContext only kills the direct child (the shell script), leaving
grandchildren (john, gpu-burn, etc.) as orphans. Set Setpgid so each SAT
job runs in its own process group, then send SIGKILL to the whole group
(-pgid) in the Cancel hook.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
field to selectively run CPU and/or GPU load goroutines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs CPU (stressapptest) + GPU stress simultaneously across multiple
load/idle cycles with varying idle durations (120s/60s/30s) to detect
cooling systems that fail to recover under repeated load.
Presets: smoke (~5 min), acceptance (~25 min), overnight (~100 min).
Outputs metrics.csv + summary.txt with per-cycle throttle and fan
spindown analysis, packed as tar.gz.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rvs was not in PATH so the stress job exited immediately (UNSUPPORTED).
Now resolveRVSCommand searches /opt/rocm-*/bin/rvs before failing.
Also add a Copy button overlay on all .terminal elements and set
user-select:text so logs can be copied from the web UI.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Legend names were "GPU 0 %" — remove unit suffix since chart title already
conveys it. Fan parsing now handles the 5-field IPMI SDR format where the
value+unit ("4340 RPM") are combined in the last column rather than split
across separate fields.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MI250X outputs 7 temperature columns before power/use%; positional parsing
read junction temp (~40°C) as GPU utilisation. Switch to header-based
colIdx() lookup so the correct fields are read regardless of column order
or rocm-smi version.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Tccd1-8 (AMD CCD die temps) now classified as 'cpu' group,
appear on CPU Temperature chart instead of ambient
- Fan RPM card hidden when no fans detected
- Remove CPU Load/Mem Load/Power from fan table (have dedicated charts)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add rocm-validation-suite, rocblas, rocrand, hip-runtime-amd,
hipblaslt, comgr to ISO (~700MB, needed for HIP compute)
- RunAMDStressPack: run RVS GST (SGEMM ~31 TFLOPS/GPU) + bandwidth test
- Add rvs symlink in chroot setup hook
- Pin all new package versions in VERSIONS
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ip route show includes state flags like 'linkdown' that ip route add
does not accept, causing restore to fail.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Task queue: all SAT/audit jobs enqueue and run one-at-a-time;
tasks persist past page navigation; new Tasks page with cancel/priority/log stream
- UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal;
Dashboard hardware summary badges + split metrics charts (load/temp/power);
Tools page consolidates network, services, install, support bundle
- AMD GPU: acceptance test and stress burn cards; GPU presence API greys
out irrelevant SAT cards automatically
- Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest)
- Install to RAM: copies squashfs to /dev/shm, re-associates loop devices
via LOOP_CHANGE_FD ioctl so live media can be ejected
- Charts: relative time axis (0 = now, negative left)
- memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED
- SAT overlay applied dynamically on every /audit.json serve
- MIME panic guard for LiveCD ramdisk I/O errors
- ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry;
disable screensaver/DPMS in bee-openbox-session
- Unknown SAT status severity = 1 (does not override OK)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expose the existing bee-install script through the web UI:
- platform/install.go: remove USB exclusion, add SizeBytes/MountedParts
fields, add MinInstallBytes()/DiskWarnings() safety checks (size,
mounted partitions, toram+low-RAM warning)
- webui: add GET /api/install/disks, POST /api/install/run,
GET /api/install/stream endpoints
- webui: add Install to Disk page with disk table, warning badges,
device-name confirmation gate, SSE progress terminal, reboot button
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- One engine: go-analyze/charts (grafana theme) for all live metrics
- Server chart: CPU temp, CPU load%, mem load%, power W, fan RPMs
- GPU charts: temp, load%, mem%, power W — one card per GPU, added dynamically
- Charts 1400x280px SVG, rendered at width:100% in single-column layout
- Add CPU load (from /proc/stat) and mem load (from /proc/meminfo) to LiveMetricSample
- Add GPU mem utilization to GPUMetricRow (nvidia-smi utilization.memory)
- Document charting architecture in bible-local/architecture/charting.md
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>