- Sample server power (IPMI dcmi) during baseline+steady phases in parallel; compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting - Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w from nvidia-smi alongside existing GPU info - Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal) - Flag when enforced power limit is below default TDP by >5% - Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0) - Improve Executive Summary: overall pass count, FAILED GPU finding - Throttle counters now shown as % of steady window instead of raw microseconds - bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 KiB
Benchmark clock calibration research
Status
In progress. Baseline data from production servers pending.
Background
The benchmark locks GPU clocks to MaxGraphicsClockMHz (boost) via nvidia-smi -lgc
before the steady-state phase. The metric low_sm_clock_vs_target fires when
avg_steady_clock < locked_target * 0.90.
Problem: boost clock is the theoretical maximum under ideal cooling. In practice, even a healthy GPU in a non-ideal server will sustain clocks well below boost. The 90% threshold has no empirical basis.
Key observations (2026-04-06)
H100 PCIe — new card, server not designed for it
- avg clock 1384 MHz, P95 1560 MHz (unstable, proba boost 1755 MHz)
- Thermal sustain: 0.0 (sw_thermal covers entire steady window)
- Stability: 70.0 — clocks erratic, no equilibrium found
- Degradation: power_capped, thermal_limited, low_sm_clock_vs_target, variance_too_high
H200 NVL — new card, server not designed for it
- avg clock = P95 = 1635 MHz (perfectly stable)
- Thermal sustain: 0.0 (sw_thermal + sw_power cover entire steady window)
- Stability: 92.0 — found stable thermal equilibrium at 1635 MHz
- Degradation: power_capped, thermal_limited
- Compute: 989 TOPS — card is computing correctly for its frequency
Key insight
The meaningful distinction is not whether the card throttles but how stably it throttles. H200 found a thermal equilibrium (avg == P95, Stability 92), H100 did not (avg << P95, Stability 70). Both are new cards; the H100's instability may reflect a more severe thermal mismatch or a card issue.
sw_power ≈ sw_thermal pattern = server cooling constraint, card likely OK.
hw_thermal >> sw_thermal pattern = card itself overheating, investigate.
Hypothesis for baseline
After testing on servers designed for their GPUs (proper cooling):
- Healthy GPU under sustained load will run at a stable fraction of boost
- Expected: avg_steady ≈ 80–95% of boost depending on model and TDP class
- Base clock (
clocks.base.gr) may be a better reference than boost: a healthy card under real workload should comfortably exceed base clock
Baseline: H100 PCIe HBM2e — designed server (2026-04-06, 10 samples)
Source: external stress test tool, ~90s runs, designed server, adequate power.
Healthy fingerprint
- Power: hits cap ~340–360W immediately, stays flat throughout — HEALTHY
- Clock: starts ~1750 MHz, oscillates and declines to ~1540–1600 MHz by 90s
- Avg steady (visual): ~1580–1620 MHz
- vs boost 1755 MHz: ~91–92%
- Oscillation is NORMAL — this is the boost algorithm balancing under power cap
- Stable power + oscillating clocks = healthy power-cap behavior
- Temperature: linear rise ~38°C → 75–80°C over 90s (no runaway)
- Consistency: all 10 samples within ±20 MHz — very repeatable
Characteristic patten
Flat power line + oscillating/declining clock line = GPU correctly managed by power cap algorithm. Do NOT flag this as instability.
Clock CV implication
The healthy oscillation WILL produce moderate ClockCVPct (~5–10%).
The current variance_too_high threshold (StabilityScore < 85) may fire on
healthy HBM2e PCIe cards. Needs recalibration.
Baseline: H100 HBM3 OEM SXM Custom (restored) — 2 confirmed samples
Source: pytorch_training_loop stress test, 120s (90s stress + 30s cooldown). Confirmed GPU: NVIDIA H100 80GB HBM3, GH100 rev a1.
GPU clock reference (from nvidia-smi, idle):
- base_clock_mhz: 1095
- boost_clock_mhz: 1755 (nvidia-smi
clocks.max.graphicsat idle) - achieved_max_clock_mhz: 1980 (actual burst max observed by tool)
- Our benchmark locks to
clocks.max.graphics= likely 1980 MHz for this chip
Observed under 700W sustained load (both samples nearly identical):
- Power: ~700W flat — SXM slot, adequate power confirmed
- Clock steady range: ~1380–1480 MHz, avg ~1420–1460 MHz
- vs 1980 MHz (lock target): 72–74% — severely below
- vs 1755 MHz (nvidia-smi boost): 81–83%
- vs 1095 MHz (base): 130% — above base but far below expected for SXM
- Clock/Watt: ~2.1 MHz/W vs HBM2e ~4.6 MHz/W — 2× worse efficiency
- Temperature: 38°C → 79–80°C (same rate as HBM2e)
- Oscillation: present, similar character to HBM2e but at much lower frequency
Diagnosis
These restored cards are degraded. A healthy H100 SXM in a designed server (DGX H100, HGX H100) should sustain ~1800–1900 MHz at 700W (~91–96% of 1980). The 72–74% result is a clear signal of silicon or VRM degradation from the refurbishment process.
Clock pattern note
Images 8/9 (previously marked as "HBM3 restored") are now confirmed identical to images 19/20. Both sample sets show same degraded pattern — same batch.
Baseline matrix (filled where data available)
| GPU model | Config | Avg clock steady | vs boost | Clock/Watt | Notes |
|---|---|---|---|---|---|
| H100 PCIe HBM2e | designed server | 1580–1620 MHz | 91–92% | ~4.6 MHz/W | 10 samples, healthy |
| H100 SXM HBM3 restored | 700W full | 1420–1460 MHz | 72–74% of 1980 | ~2.1 MHz/W | 4 samples confirmed, degraded |
| H100 SXM HBM3 healthy | designed | ~1800–1900 MHz est. | ~91–96% est. | ~2.7 MHz/W est. | need real baseline |
| H200 NVL | designed | TBD | TBD | TBD | need baseline |
H100 official spec (from NVIDIA datasheet)
Source: NVIDIA H100 Tensor Core GPU Datasheet (image 23, 2026-04-06). All TOPS marked * are with structural sparsity enabled. Divide by 2 for dense.
| Model | FP16 Tensor (dense) | TF32 (dense) | FP8 (dense) | TDP | Memory |
|---|---|---|---|---|---|
| H100 80GB PCIe | 756 TFLOPS | 378 TFLOPS | 1,513 TFLOPS | 350W | HBM2e |
| H100 NVL 94GB PCIe | 990 TFLOPS | 495 TFLOPS | 1,980 TFLOPS | 400W | HBM3 |
| H100 80GB SXM (BQQV) | 989 TFLOPS | 494 TFLOPS | — | 700W | HBM3 |
| H100 94GB SXM (BUBB) | 989 TFLOPS | 494 TFLOPS | — | 700W | HBM2e |
Notes:
- SXM boards do NOT list FP8 peak in this table (field empty)
- fp8_e5m2 is unsupported on H100 PCIe HBM2e — confirmed in our tests
- Tensor Cores: PCIe = 456, SXM = 528 (16% more on SXM)
Observed efficiency (H100 80GB PCIe, throttled server)
From the report in this session (power+thermal throttle throughout steady):
| Precision | Measured | Spec (dense) | % of spec |
|---|---|---|---|
| fp16_tensor | 329 TOPS | 756 TFLOPS | 44% |
| fp32_tf32 | 115 TOPS | 378 TFLOPS | 30% |
| fp8_e4m3 | 505 TOPS | 1,513 TFLOPS | 33% |
33–44% of spec is expected given sustained power+thermal throttle (avg clock 1384 MHz vs boost 1755 MHz = 79%). The GPU is computing correctly for its actual frequency — the low TOPS comes from throttle, not silicon defect.
H200 official spec (from NVIDIA datasheet, image 24, 2026-04-06)
Format: without sparsity / with sparsity.
| Model | FP16 Tensor (dense) | TF32 (dense) | FP8 (dense) | TDP | Memory |
|---|---|---|---|---|---|
| H200 NVL PCIe | 836 TFLOPS | 418 TFLOPS | 1,570 TFLOPS | 600W | HBM3e 141GB |
| H200 SXM | 990 TFLOPS | 495 TFLOPS | 1,979 TFLOPS | 700W | HBM3e 141GB |
Observed efficiency (H200 NVL PCIe, throttled non-designed server)
Avg clock 1635 MHz (62% of boost ~2619 MHz). Entire steady in thermal throttle.
| Precision | Measured | Spec (dense) | % of spec |
|---|---|---|---|
| fp16_tensor | 340 TOPS | 836 TFLOPS | 41% |
| fp32_tf32 | 120 TOPS | 418 TFLOPS | 29% |
| fp8_e4m3 | 529 TOPS | 1,570 TFLOPS | 34% |
Comparable to H100 PCIe efficiency (33–44%) despite different architecture — both are throttle-limited. Confirms that % of spec is not a quality signal, it reflects the thermal environment. tops_per_sm_per_ghz is the right metric.
Real-world GEMM efficiency reference (2026-04-06, web research)
Sources: SemiAnalysis MI300X vs H100 vs H200 training benchmark; cuBLAS optimization worklog (hamzaelshafie.bearblog.dev); Lambda AI H100 performance analysis.
What healthy systems actually achieve:
- H100 SXM in designed server: ~720 TFLOPS FP16 = ~73% of spec
- cuBLAS large square GEMM (8192³): up to ~83% flop utilization
- H200 NVL PCIe: no public data, extrapolating ~73% → ~610 TFLOPS FP16
Our results vs expectation:
| GPU | Our FP16 | Expected (73%) | Our % of spec | Gap |
|---|---|---|---|---|
| H100 PCIe HBM2e | 329 TOPS | ~552 TFLOPS | 44% | ~1.7× below |
| H200 NVL PCIe | 340 TOPS | ~610 TFLOPS | 41% | ~1.8× below |
Our results are roughly half of what a healthy system achieves even under throttle. This is NOT normal — 30-44% is not the industry baseline.
Likely causes of the gap (in order of probability):
- Thermal throttle — confirmed, sw_thermal covers entire steady window
- Power limit below TDP — GPU may be software-limited below 350W/600W.
Previous user may have set a lower limit via nvidia-smi -pl and it was not
reset. Our normalization sets clock locks but does NOT reset power limit.
Key check:
nvidia-smi -q | grep "Power Limit"— default vs enforced. - Matrix size — ruled out. bee-gpu-burn uses 4096×4096×4096 for fp16, 8192×8192×4096 for fp8. These are large enough for peak tensor utilization.
Power limit gap analysis (H100 PCIe):
- Avg clock 1384 MHz = 79% of boost 1755 MHz
- Expected TOPS at 79% clock: 756 × 0.79 ≈ 597 TFLOPS
- Actually measured: 329 TOPS = 55% of that estimate
- Remaining gap after accounting for clock throttle: ~45%
- Most likely explanation: enforced power limit < 350W TDP, further reducing sustainable clock beyond what sw_thermal alone would cause.
Action item:
Add power.limit (enforced) AND power.default_limit to queryBenchmarkGPUInfo
so result.json shows if the card was pre-configured with a non-default limit.
If enforced < default × 0.95 → add finding "GPU power limit is below default TDP".
CPU/RAM impact on GPU FLOPS:
None. Pure on-GPU GEMM is fully compute-bound once data is in VRAM. CPU core count and host RAM are irrelevant.
Compute efficiency metric (proposed, no hardcode)
Instead of comparing TOPS to a hardcoded spec, compute: tops_per_sm_per_ghz = measured_tops / (sm_count × avg_clock_ghz)
This is model-agnostic. A GPU computing correctly at its actual frequency will show a consistent tops_per_sm_per_ghz regardless of throttle level. A GPU with degraded silicon will show low tops_per_sm_per_ghz even at normal clocks.
SM count is queryable: nvidia-smi --query-gpu=attribute.multiprocessor_count (needs to be added to queryBenchmarkGPUInfo).
Reference values to establish after baseline runs:
- H100 PCIe fp16_tensor: TBD tops/SM/GHz
- H100 SXM fp16_tensor: TBD tops/SM/GHz
Proposed threshold changes (pending more data)
-
low_sm_clock_vs_target: raise threshold from 90% to 85% based on observed 91–92% on healthy HBM2e. Or remove entirely — sw_power/sw_thermal already capture the root cause. -
variance_too_high(StabilityScore < 85): healthy HBM2e WILL oscillate under power cap. Consider suppressing this flag when power is flat and usage is 100% (oscillation is expected). Or lower threshold to 70. -
New signal: MHz/Watt efficiency: if base_graphics_clock_mhz is available, ratio avg_clock / power_w could identify degraded silicon (HBM3 restored S1 would have been caught by this).
Decision deferred until baseline on SXM designed servers collected.