Commit Graph

27 Commits

Author SHA1 Message Date
a5e0261ff2 Refactor power ramp to use true single-card baselines
Phase 1 now calibrates each GPU individually (sequentially) so that
PowerRealizationPct reflects real degradation from neighbour thermals and
shared power rails. Previously the baseline came from an all-GPU-together
run, making realization always ≈100% at the final ramp step.

Ramp step 1 reuses single-card calibration results (no extra run); steps
2..N run targeted_power on the growing GPU subset with derating active.

Remove OccupiedSlots/OccupiedSlotsNote fields and occupiedSlots() helper —
they were compensation for the old all-GPU calibration approach.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 23:47:57 +03:00
3cf2e9c9dc Run power calibration for all GPUs simultaneously
Previously each GPU was calibrated sequentially (one card fully done
before the next started), producing the staircase temperature pattern
seen on the graph.

Now all GPUs run together in a single dcgmi diag -r targeted_power
session per attempt. This means:
- All cards are under realistic thermal load at the same time.
- A single DCGM session handles the run — no resource-busy contention
  from concurrent dcgmi processes.
- Binary search state (lo/hi) is tracked independently per GPU; each
  card converges to its own highest stable power limit.
- Throttle counter polling covers all active GPUs in the shared ticker.
- Resource-busy exponential back-off is shared (one DCGM session).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:25:05 +03:00
19dbabd71d Simplify power calibration: pure binary search, no telemetry guessing
Remove telemetry-guided initial candidate; use strict binary search
midpoint at every step. Clean and predictable convergence in O(log N)
attempts within the allowed power range [minLimitW, startingLimitW].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:12:45 +03:00
a6a07f2626 Replace linear power derate with binary search + telemetry-guided jump
Power calibration previously stepped down 25 W at a time (linear),
requiring up to 6 attempts to find a stable limit within 150 W range.

New strategy:
- Binary search between minLimitW (lo, assumed stable floor) and the
  starting/failed limit (hi, confirmed unstable), converging within a
  10 W tolerance in ~4 attempts.
- For thermal throttle: the first-quarter telemetry rows estimate the
  GPU's pre-throttle power draw. nextLimit = round5W(onset - 10 W) is
  used as the initial candidate instead of the binary midpoint, landing
  much closer to the true limit on the first step.
- On success: lo is updated and a higher level is tried (binary search
  upward) until hi-lo ≤ tolerance, ensuring the highest stable limit is
  found rather than the first stable one.
- Let targeted_power run to natural completion on throttle (no mid-run
  SIGKILL) so nv-hostengine releases its diagnostic slot cleanly before
  the next attempt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:05:23 +03:00
f87461ee4a Detect thermal throttle with fans below 100% as cooling misconfiguration
During power calibration: if a thermal throttle (sw_thermal/hw_thermal)
causes ≥20% clock drop while server fans are below 98% P95 duty cycle,
record a CoolingWarning on the GPU result and emit an actionable finding
telling the operator to rerun with fans manually fixed at 100%.

During steady-state benchmark: same signal enriches the existing
thermal_limited finding with fan duty cycle and clock drift values.

Covers both the main benchmark (buildBenchmarkFindings) and the power
bench (NvidiaPowerBenchResult.Findings).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 21:44:57 +03:00
a636146dbd Fix power calibration failing due to DCGM resource contention
When a targeted_power attempt is cancelled (e.g. after sw_thermal
throttle), nv-hostengine holds the diagnostic slot asynchronously.
The next attempt immediately received DCGM_ST_IN_USE (exit 222)
and incorrectly derated the power limit.

Now: exit 222 is detected via isDCGMResourceBusy and triggers an
exponential back-off retry at the same power limit (1s, 2s, 4s, …
up to 256s). Once the back-off delay would exceed 300s the
calibration fails, indicating the slot is persistently held.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 20:41:17 +03:00
Mikhail Chusavitin
303de2df04 Add slot-aware ramp sequence to bee-bench power 2026-04-14 17:47:40 +03:00
Mikhail Chusavitin
95124d228f Split bee-bench into perf and power workflows 2026-04-14 17:33:13 +03:00
Mikhail Chusavitin
2be7ae6d28 Refine NVIDIA benchmark phase timing 2026-04-14 14:12:06 +03:00
Mikhail Chusavitin
8fc986c933 Add benchmark fan duty cycle summary to report 2026-04-14 10:24:02 +03:00
Mikhail Chusavitin
88b5e0edf2 Harden IPMI power probe timeout 2026-04-14 10:18:23 +03:00
457ea1cf04 Unify benchmark exports and drop ASCII charts 2026-04-13 21:38:28 +03:00
bf6ecab4f0 Add per-precision benchmark phases, weighted TOPS scoring, and ECC tracking
- Split steady window into 6 equal slots: fp8/fp16/fp32/fp64/fp4 + combined
- Each precision phase runs bee-gpu-burn with --precision filter so PowerCVPct reflects single-kernel stability (not round-robin artifact)
- Add fp4 support in bee-gpu-stress.c for Blackwell (cc>=100) via existing CUDA_R_4F_E2M1 guard
- Weighted TOPS: fp64×2.0, fp32×1.0, fp16×0.5, fp8×0.25, fp4×0.125
- SyntheticScore = sum of weighted TOPS from per-precision phases
- MixedScore = sum from combined phase; MixedEfficiency = Mixed/Synthetic
- ComputeScore = SyntheticScore × (1 + MixedEfficiency × 0.3)
- ECC volatile counters sampled before/after each phase and overall
- DegradationReasons: ecc_uncorrected_errors, ecc_corrected_errors
- Report: per-precision stability table with ECC columns, methodology section
- Ramp-up history table redesign: GPU indices as columns, runs as rows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 10:49:49 +03:00
813e2f86a9 Add scalability/ramp-up labeling, ServerPower penalty in scoring, and report improvements
- Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and
  NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files
- Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally
  by comparing ramp-up step results sharing the same ramp_run_id)
- Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time),
  tasks.go handler, and benchmark.go result population
- Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75:
  factor = ratio/0.75, applied per-GPU with a note explaining the reduction
- Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw)
- Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode;
  shows scalability_score when non-zero

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:30:47 +03:00
58a6da9b44 Recover power limits and SM count from nvidia-smi -q in enrichGPUInfo
When --query-gpu CSV fields fail (exit status 2 on some Blackwell +
driver combos), enrichGPUInfoWithMaxClocks now also parses from the
verbose nvidia-smi -q output already collected at benchmark start:
  - Default Power Limit  → DefaultPowerLimitW
  - Current Power Limit  → PowerLimitW (fallback)
  - Multiprocessor Count → MultiprocessorCount

Fixes PowerSustainScore=0 on systems where all three CSV query
variants fail but nvidia-smi -q succeeds (confirmed on RTX PRO 6000
Blackwell + driver 590.48.01).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:17:56 +03:00
f4a19c0a00 Add power calibration step to benchmark; fix PowerSustainScore reference
Before the per-GPU compute phases, run `dcgmi diag -r targeted_power`
for 45 s while collecting nvidia-smi power metrics in parallel.
The p95 power per GPU is stored as calibrated_peak_power_w and used
as the denominator for PowerSustainScore instead of the hardware default
limit, which bee-gpu-burn cannot reach because it is compute-only.

Fallback chain: calibrated peak → default limit → enforced limit.
If dcgmi is absent or the run fails, calibration is skipped silently.

Adjust composite score weights to match the new honest power reference:
  base 0.35, thermal 0.25, stability 0.25, power 0.15, NCCL bonus 0.10.
Power weight reduced (0.20→0.15) because even with a calibrated reference
bee-gpu-burn reaches ~60-75% of TDP by design (no concurrent mem stress).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:06:46 +03:00
9e3dcf9b4d Record host CPU/RAM config in benchmark results; check CPU load
- BenchmarkHostConfig captures CPU model, sockets, cores, threads, and
  total RAM from /proc/cpuinfo and /proc/meminfo at benchmark start.
- BenchmarkCPULoad samples host CPU utilisation every 10 s throughout
  the GPU steady-state phase (sequential and parallel paths).
- Summarises avg/max/p95 and classifies status as ok / high / unstable.
- Adds a finding when CPU load is elevated (avg >20% or max >40%) or
  erratic (stddev >12%), with a plain-English description in the report.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:02:04 +03:00
Mikhail Chusavitin
525ed8b8fc Fix GPU clock lock normalization for Blackwell (clocks.max.* unsupported)
clocks.max.graphics / clocks.max.memory CSV fields return exit status 2 on
RTX PRO 6000 Blackwell (driver 98.x), causing the entire gpu inventory query
to fail and clock lock to be skipped → normalization: partial.

Fix:
- Add minimal fallback query (index,uuid,name,pci.bus_id,vbios_version,
  power.limit) that succeeds even without clock fields
- Add enrichGPUInfoWithMaxClocks: parses "Max Clocks" section of
  nvidia-smi -q verbose output to fill MaxGraphicsClockMHz /
  MaxMemoryClockMHz when CSV fields fail
- Move nvidia-smi -q execution before queryBenchmarkGPUInfo so its output
  is available for clock enrichment immediately after
- Tests: cover enrichment and skip-if-populated cases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:33:54 +03:00
ba16021cdb Fix GPU model propagation, export filenames, PSU/service status, and chart perf
- nvidia.go: add Name field to nvidiaGPUInfo, include model name in
  nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData
- pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x …
  → 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat
  activating/deactivating/reloading service states as OK in Runtime Health
- support_bundle.go: use "150405" time format (no colons) for exFAT compat
- sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove
  .tar.gz archive creation from export dirs — export packs everything itself
- charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf
- benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 10:05:27 +03:00
b2f8626fee Refactor validate modes, fix benchmark report and IPMI power
- Replace diag level 1-4 dropdown with Validate/Stress radio buttons
- Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short
- Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended
- Parallel GPU mode: spawn single task for all GPUs instead of splitting per model
- Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel
- Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts
- Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading')

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 00:42:12 +03:00
Mikhail Chusavitin
93cfa78e8c Benchmark: parallel GPU mode, resilient inventory query, server model in results
- Add parallel GPU mode (checkbox, off by default): runs all selected GPUs
  simultaneously via a single bee-gpu-burn invocation instead of sequentially;
  per-GPU telemetry, throttle counters, TOPS, and scoring are preserved
- Make queryBenchmarkGPUInfo resilient: falls back to a base field set when
  extended fields (attribute.multiprocessor_count, power.default_limit) cause
  exit status 2, preventing lgc normalization from being silently skipped
- Log explicit "graphics clock lock skipped" note when inventory is unavailable
- Collect server model from DMI (/sys/class/dmi/id/product_name) and store in
  result JSON; benchmark history columns now show "Server Model (N× GPU Model)"
  grouped by server+GPU type rather than individual GPU index

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 18:32:15 +03:00
8fe20ba678 Fix benchmark scoring: PowerSustain uses default power limit
PowerSustainScore now uses DefaultPowerLimitW as reference so a
manually reduced power limit does not inflate the score. Falls back
to enforced limit if default is unavailable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:30:59 +03:00
d973231f37 Enhance benchmark: server power via IPMI, efficiency metrics, FP64, power limit check
- Sample server power (IPMI dcmi) during baseline+steady phases in parallel;
  compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting
- Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w
  from nvidia-smi alongside existing GPU info
- Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal)
- Flag when enforced power limit is below default TDP by >5%
- Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0)
- Improve Executive Summary: overall pass count, FAILED GPU finding
- Throttle counters now shown as % of steady window instead of raw microseconds
- bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:26:52 +03:00
Mikhail Chusavitin
a98c4d7461 Include terminal charts in benchmark report 2026-04-06 12:34:57 +03:00
Mikhail Chusavitin
35f4c53887 Stabilize NVIDIA GPU device mapping across loaders 2026-04-06 12:22:04 +03:00
20abff7f90 WIP: checkpoint current tree 2026-04-05 12:05:00 +03:00
bf47c8dbd2 Add NVIDIA benchmark reporting flow 2026-04-05 10:30:56 +03:00