Previously each GPU was calibrated sequentially (one card fully done
before the next started), producing the staircase temperature pattern
seen on the graph.
Now all GPUs run together in a single dcgmi diag -r targeted_power
session per attempt. This means:
- All cards are under realistic thermal load at the same time.
- A single DCGM session handles the run — no resource-busy contention
from concurrent dcgmi processes.
- Binary search state (lo/hi) is tracked independently per GPU; each
card converges to its own highest stable power limit.
- Throttle counter polling covers all active GPUs in the shared ticker.
- Resource-busy exponential back-off is shared (one DCGM session).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove telemetry-guided initial candidate; use strict binary search
midpoint at every step. Clean and predictable convergence in O(log N)
attempts within the allowed power range [minLimitW, startingLimitW].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Power calibration previously stepped down 25 W at a time (linear),
requiring up to 6 attempts to find a stable limit within 150 W range.
New strategy:
- Binary search between minLimitW (lo, assumed stable floor) and the
starting/failed limit (hi, confirmed unstable), converging within a
10 W tolerance in ~4 attempts.
- For thermal throttle: the first-quarter telemetry rows estimate the
GPU's pre-throttle power draw. nextLimit = round5W(onset - 10 W) is
used as the initial candidate instead of the binary midpoint, landing
much closer to the true limit on the first step.
- On success: lo is updated and a higher level is tried (binary search
upward) until hi-lo ≤ tolerance, ensuring the highest stable limit is
found rather than the first stable one.
- Let targeted_power run to natural completion on throttle (no mid-run
SIGKILL) so nv-hostengine releases its diagnostic slot cleanly before
the next attempt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During power calibration: if a thermal throttle (sw_thermal/hw_thermal)
causes ≥20% clock drop while server fans are below 98% P95 duty cycle,
record a CoolingWarning on the GPU result and emit an actionable finding
telling the operator to rerun with fans manually fixed at 100%.
During steady-state benchmark: same signal enriches the existing
thermal_limited finding with fan duty cycle and clock drift values.
Covers both the main benchmark (buildBenchmarkFindings) and the power
bench (NvidiaPowerBenchResult.Findings).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a targeted_power attempt is cancelled (e.g. after sw_thermal
throttle), nv-hostengine holds the diagnostic slot asynchronously.
The next attempt immediately received DCGM_ST_IN_USE (exit 222)
and incorrectly derated the power limit.
Now: exit 222 is detected via isDCGMResourceBusy and triggers an
exponential back-off retry at the same power limit (1s, 2s, 4s, …
up to 256s). Once the back-off delay would exceed 300s the
calibration fails, indicating the slot is persistently held.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
USB Export Drive:
lsblk reports TRAN only for whole disks, not partitions (/dev/sdc1).
Strip trailing partition digits to get parent disk before transport check.
LiveCD in RAM:
When RunInstallToRAM copies squashfs to /dev/shm/bee-live/ but bind-mount
of /run/live/medium fails (CD-ROM boots), /run/live/medium still shows the
CD-ROM fstype. Add fallback: if /dev/shm/bee-live/*.squashfs exists, the
data is in RAM — report status OK.
Dashboard Hardware Summary:
Show server Manufacturer + ProductName as heading and S/N as subline above
the component table, sourced from hw.Board (dmidecode system-type data).
Validate:
Remove Cycles input — always run once. cycles=1 hardcoded in runAllSAT().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NCCL all_reduce is always attempted when 2+ GPUs are selected; a failure
leaves InterconnectScore=0 (no bonus, no penalty) and OverallStatus
unaffected. Exposing the checkbox implied NCCL is optional and made a
failed run look like a deliberate skip.
- Remove benchmark-run-nccl checkbox and its change listener from pages.go
- Client sends run_nccl: selected.length > 1 (automatic)
- api.go default runNCCL=true is unchanged
- Selection note now mentions NCCL automatically for multi-GPU runs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and
NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files
- Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally
by comparing ramp-up step results sharing the same ramp_run_id)
- Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time),
tasks.go handler, and benchmark.go result population
- Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75:
factor = ratio/0.75, applied per-GPU with a note explaining the reduction
- Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw)
- Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode;
shows scalability_score when non-zero
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When --query-gpu CSV fields fail (exit status 2 on some Blackwell +
driver combos), enrichGPUInfoWithMaxClocks now also parses from the
verbose nvidia-smi -q output already collected at benchmark start:
- Default Power Limit → DefaultPowerLimitW
- Current Power Limit → PowerLimitW (fallback)
- Multiprocessor Count → MultiprocessorCount
Fixes PowerSustainScore=0 on systems where all three CSV query
variants fail but nvidia-smi -q succeeds (confirmed on RTX PRO 6000
Blackwell + driver 590.48.01).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Before the per-GPU compute phases, run `dcgmi diag -r targeted_power`
for 45 s while collecting nvidia-smi power metrics in parallel.
The p95 power per GPU is stored as calibrated_peak_power_w and used
as the denominator for PowerSustainScore instead of the hardware default
limit, which bee-gpu-burn cannot reach because it is compute-only.
Fallback chain: calibrated peak → default limit → enforced limit.
If dcgmi is absent or the run fails, calibration is skipped silently.
Adjust composite score weights to match the new honest power reference:
base 0.35, thermal 0.25, stability 0.25, power 0.15, NCCL bonus 0.10.
Power weight reduced (0.20→0.15) because even with a calibrated reference
bee-gpu-burn reaches ~60-75% of TDP by design (no concurrent mem stress).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BenchmarkHostConfig captures CPU model, sockets, cores, threads, and
total RAM from /proc/cpuinfo and /proc/meminfo at benchmark start.
- BenchmarkCPULoad samples host CPU utilisation every 10 s throughout
the GPU steady-state phase (sequential and parallel paths).
- Summarises avg/max/p95 and classifies status as ok / high / unstable.
- Adds a finding when CPU load is elevated (avg >20% or max >40%) or
erratic (stddev >12%), with a plain-English description in the report.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new checkbox (enabled by default) in the benchmark section.
In ramp-up mode N tasks are spawned simultaneously: 1 GPU, then 2,
then 3, up to all selected GPUs — each step runs its GPUs in parallel.
NCCL runs only on the final step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
clocks.max.graphics / clocks.max.memory CSV fields return exit status 2 on
RTX PRO 6000 Blackwell (driver 98.x), causing the entire gpu inventory query
to fail and clock lock to be skipped → normalization: partial.
Fix:
- Add minimal fallback query (index,uuid,name,pci.bus_id,vbios_version,
power.limit) that succeeds even without clock fields
- Add enrichGPUInfoWithMaxClocks: parses "Max Clocks" section of
nvidia-smi -q verbose output to fill MaxGraphicsClockMHz /
MaxMemoryClockMHz when CSV fields fail
- Move nvidia-smi -q execution before queryBenchmarkGPUInfo so its output
is available for clock enrichment immediately after
- Tests: cover enrichment and skip-if-populated cases
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- grub.cfg + isolinux/live.cfg.in: add pcie_aspm=off,
intel_idle.max_cstate=1 and processor.max_cstate=1 to all
non-failsafe boot entries
- bee-hpc-tuning: new script that sets all CPU cores to performance
governor via sysfs and logs THP state at boot
- bee-hpc-tuning.service: runs before bee-nvidia and bee-audit
- 9000-bee-setup.hook.chroot: enable service and mark script executable
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- collector/pcie: add applyPCIeLinkSpeedWarning that sets status=Warning
and ErrorDescription when current link speed is below maximum negotiated
speed (e.g. Gen1 running on a Gen5 slot)
- collector/pcie: add pcieLinkSpeedRank helper for Gen string comparison
- collector/pcie_filter_test: cover degraded and healthy link speed cases
- platform/techdump: collect lspci -vvv → lspci-vvv.txt for LnkCap/LnkSta
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- schema: add ToRAMStatus and USBExportPath fields to RuntimeHealth
- platform/runtime.go: collectToRAMHealth (ok/warning/failed based on
IsLiveMediaInRAM + toramActive) and collectUSBExportHealth (scans
/proc/mounts + lsblk for writable USB-backed filesystems)
- pages.go: add USB Export Drive and LiveCD in RAM rows to the health table
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nvidia.go: add Name field to nvidiaGPUInfo, include model name in
nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData
- pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x …
→ 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat
activating/deactivating/reloading service states as OK in Runtime Health
- support_bundle.go: use "150405" time format (no colons) for exFAT compat
- sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove
.tar.gz archive creation from export dirs — export packs everything itself
- charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf
- benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues:
1. BMC/management VGA chips (e.g. Huawei iBMC Hi171x, ASPEED) were included
in GPU inventory because shouldIncludePCIeDevice only checked the PCI class,
not the device name. Added a name-based filter for known BMC/management
patterns when the class is VGA/display/3d.
2. New NVIDIA GPUs (e.g. RTX PRO 6000 Blackwell, device ID 2bb5) showed as
"Device 2bb5" because lspci's database lags behind. Added "name" to the
nvidia-smi query and use it to override dev.Model during enrichment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
hpl was listed in baseTargets and stressOnlyTargets but /api/sat/hpl/run
was never registered, causing a 405 Method Not Allowed (not valid JSON)
error when Validate one by one was triggered in stress mode.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Runtime Health now shows only LiveCD system status (services, tools,
drivers, network, CUDA/ROCm) — hardware component rows removed
- Hardware Summary now shows server components with readable descriptions
(model, count×size) and component-status.json health badges
- Add Network Adapters row to Hardware Summary
- SFP module static info (vendor, PN, SN, connector, type, wavelength)
now collected via ethtool -m regardless of carrier state
- PSU statuses from IPMI audit written to component-status.json so PSU
badge shows actual status after first audit instead of UNKNOWN
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bootloader: GRUB fallback text colors → yellow/brown (amber tone)
- CLI charts: all GPU metric series use single amber color (xterm-256 #214)
- Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size
- Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz
using dmidecode for server model (spaces→underscores) and serial number
- Remove display resolution feature (UI card, API routes, handlers, tests)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HPL 2.3 from netlib compiled against OpenBLAS with a minimal
single-process MPI stub — no MPI package required in the ISO.
Matrix size is auto-sized to 80% of total RAM at runtime.
Build:
- VERSIONS: HPL_VERSION=2.3, HPL_SHA256=32c5c17d…
- build-hpl.sh: downloads HPL + OpenBLAS from Debian 12 repo,
compiles xhpl with a self-contained mpi_stub.c
- build.sh: step 80-hpl, injects xhpl + libopenblas into overlay
Runtime:
- bee-hpl: generates HPL.dat (N auto from /proc/meminfo, NB=256,
P=1 Q=1), runs xhpl, prints standard WR... Gflops output
- platform/hpl.go: RunHPL(), parses WR line → GFlops + PASSED/FAILED
- tasks.go: target "hpl"
- pages.go: LINPACK (HPL) card in validate/stress grid (stress-only)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>