# Benchmark clock calibration research ## Status In progress. Baseline data from production servers pending. ## Background The benchmark locks GPU clocks to `MaxGraphicsClockMHz` (boost) via `nvidia-smi -lgc` before the steady-state phase. The metric `low_sm_clock_vs_target` fires when `avg_steady_clock < locked_target * 0.90`. Problem: boost clock is the theoretical maximum under ideal cooling. In practice, even a healthy GPU in a non-ideal server will sustain clocks well below boost. The 90% threshold has no empirical basis. ## Key observations (2026-04-06) ### H100 PCIe — new card, server not designed for it - avg clock 1384 MHz, P95 1560 MHz (unstable, proba boost 1755 MHz) - Thermal sustain: 0.0 (sw_thermal covers entire steady window) - Stability: 70.0 — clocks erratic, no equilibrium found - Degradation: power_capped, thermal_limited, low_sm_clock_vs_target, variance_too_high ### H200 NVL — new card, server not designed for it - avg clock = P95 = 1635 MHz (perfectly stable) - Thermal sustain: 0.0 (sw_thermal + sw_power cover entire steady window) - Stability: 92.0 — found stable thermal equilibrium at 1635 MHz - Degradation: power_capped, thermal_limited - Compute: 989 TOPS — card is computing correctly for its frequency ### Key insight The meaningful distinction is not *whether* the card throttles but *how stably* it throttles. H200 found a thermal equilibrium (avg == P95, Stability 92), H100 did not (avg << P95, Stability 70). Both are new cards; the H100's instability may reflect a more severe thermal mismatch or a card issue. `sw_power ≈ sw_thermal` pattern = server cooling constraint, card likely OK. `hw_thermal >> sw_thermal` pattern = card itself overheating, investigate. ## Hypothesis for baseline After testing on servers designed for their GPUs (proper cooling): - Healthy GPU under sustained load will run at a stable fraction of boost - Expected: avg_steady ≈ 80–95% of boost depending on model and TDP class - Base clock (`clocks.base.gr`) may be a better reference than boost: a healthy card under real workload should comfortably exceed base clock ## Baseline: H100 PCIe HBM2e — designed server (2026-04-06, 10 samples) Source: external stress test tool, ~90s runs, designed server, adequate power. ### Healthy fingerprint - **Power**: hits cap ~340–360W immediately, stays flat throughout — HEALTHY - **Clock**: starts ~1750 MHz, oscillates and declines to ~1540–1600 MHz by 90s - Avg steady (visual): **~1580–1620 MHz** - vs boost 1755 MHz: **~91–92%** - Oscillation is NORMAL — this is the boost algorithm balancing under power cap - Stable power + oscillating clocks = healthy power-cap behavior - **Temperature**: linear rise ~38°C → 75–80°C over 90s (no runaway) - **Consistency**: all 10 samples within ±20 MHz — very repeatable ### Characteristic patten Flat power line + oscillating/declining clock line = GPU correctly managed by power cap algorithm. Do NOT flag this as instability. ### Clock CV implication The healthy oscillation WILL produce moderate ClockCVPct (~5–10%). The current `variance_too_high` threshold (StabilityScore < 85) may fire on healthy HBM2e PCIe cards. Needs recalibration. --- ## Baseline: H100 HBM3 OEM SXM Custom (restored) — 2 confirmed samples Source: pytorch_training_loop stress test, 120s (90s stress + 30s cooldown). Confirmed GPU: NVIDIA H100 80GB HBM3, GH100 rev a1. ### GPU clock reference (from nvidia-smi, idle): - base_clock_mhz: **1095** - boost_clock_mhz: **1755** (nvidia-smi `clocks.max.graphics` at idle) - achieved_max_clock_mhz: **1980** (actual burst max observed by tool) - Our benchmark locks to `clocks.max.graphics` = likely 1980 MHz for this chip ### Observed under 700W sustained load (both samples nearly identical): - Power: ~700W flat — SXM slot, adequate power confirmed - Clock steady range: **~1380–1480 MHz**, avg **~1420–1460 MHz** - vs 1980 MHz (lock target): **72–74%** — severely below - vs 1755 MHz (nvidia-smi boost): **81–83%** - vs 1095 MHz (base): 130% — above base but far below expected for SXM - Clock/Watt: ~2.1 MHz/W vs HBM2e ~4.6 MHz/W — 2× worse efficiency - Temperature: 38°C → 79–80°C (same rate as HBM2e) - Oscillation: present, similar character to HBM2e but at much lower frequency ### Diagnosis These restored cards are degraded. A healthy H100 SXM in a designed server (DGX H100, HGX H100) should sustain ~1800–1900 MHz at 700W (~91–96% of 1980). The 72–74% result is a clear signal of silicon or VRM degradation from the refurbishment process. ### Clock pattern note Images 8/9 (previously marked as "HBM3 restored") are now confirmed identical to images 19/20. Both sample sets show same degraded pattern — same batch. --- ## Baseline matrix (filled where data available) | GPU model | Config | Avg clock steady | vs boost | Clock/Watt | Notes | |---|---|---|---|---|---| | H100 PCIe HBM2e | designed server | 1580–1620 MHz | 91–92% | ~4.6 MHz/W | 10 samples, healthy | | H100 SXM HBM3 restored | 700W full | 1420–1460 MHz | 72–74% of 1980 | ~2.1 MHz/W | 4 samples confirmed, degraded | | H100 SXM HBM3 healthy | designed | ~1800–1900 MHz est. | ~91–96% est. | ~2.7 MHz/W est. | need real baseline | | H200 NVL | designed | TBD | TBD | TBD | need baseline | --- ## H100 official spec (from NVIDIA datasheet) Source: NVIDIA H100 Tensor Core GPU Datasheet (image 23, 2026-04-06). All TOPS marked * are with structural sparsity enabled. Divide by 2 for dense. | Model | FP16 Tensor (dense) | TF32 (dense) | FP8 (dense) | TDP | Memory | |---|---|---|---|---|---| | H100 80GB PCIe | 756 TFLOPS | 378 TFLOPS | 1,513 TFLOPS | 350W | HBM2e | | H100 NVL 94GB PCIe | 990 TFLOPS | 495 TFLOPS | 1,980 TFLOPS | 400W | HBM3 | | H100 80GB SXM (BQQV) | 989 TFLOPS | 494 TFLOPS | — | 700W | HBM3 | | H100 94GB SXM (BUBB) | 989 TFLOPS | 494 TFLOPS | — | 700W | HBM2e | Notes: - SXM boards do NOT list FP8 peak in this table (field empty) - fp8_e5m2 is unsupported on H100 PCIe HBM2e — confirmed in our tests - Tensor Cores: PCIe = 456, SXM = 528 (16% more on SXM) ## Observed efficiency (H100 80GB PCIe, throttled server) From the report in this session (power+thermal throttle throughout steady): | Precision | Measured | Spec (dense) | % of spec | |---|---|---|---| | fp16_tensor | 329 TOPS | 756 TFLOPS | 44% | | fp32_tf32 | 115 TOPS | 378 TFLOPS | 30% | | fp8_e4m3 | 505 TOPS | 1,513 TFLOPS | 33% | 33–44% of spec is expected given sustained power+thermal throttle (avg clock 1384 MHz vs boost 1755 MHz = 79%). The GPU is computing correctly for its actual frequency — the low TOPS comes from throttle, not silicon defect. ## H200 official spec (from NVIDIA datasheet, image 24, 2026-04-06) Format: without sparsity / with sparsity. | Model | FP16 Tensor (dense) | TF32 (dense) | FP8 (dense) | TDP | Memory | |---|---|---|---|---|---| | H200 NVL PCIe | 836 TFLOPS | 418 TFLOPS | 1,570 TFLOPS | 600W | HBM3e 141GB | | H200 SXM | 990 TFLOPS | 495 TFLOPS | 1,979 TFLOPS | 700W | HBM3e 141GB | ## Observed efficiency (H200 NVL PCIe, throttled non-designed server) Avg clock 1635 MHz (62% of boost ~2619 MHz). Entire steady in thermal throttle. | Precision | Measured | Spec (dense) | % of spec | |---|---|---|---| | fp16_tensor | 340 TOPS | 836 TFLOPS | 41% | | fp32_tf32 | 120 TOPS | 418 TFLOPS | 29% | | fp8_e4m3 | 529 TOPS | 1,570 TFLOPS | 34% | Comparable to H100 PCIe efficiency (33–44%) despite different architecture — both are throttle-limited. Confirms that % of spec is not a quality signal, it reflects the thermal environment. tops_per_sm_per_ghz is the right metric. ## Real-world GEMM efficiency reference (2026-04-06, web research) Sources: SemiAnalysis MI300X vs H100 vs H200 training benchmark; cuBLAS optimization worklog (hamzaelshafie.bearblog.dev); Lambda AI H100 performance analysis. ### What healthy systems actually achieve: - H100 SXM in designed server: **~720 TFLOPS FP16 = ~73% of spec** - cuBLAS large square GEMM (8192³): up to **~83% flop utilization** - H200 NVL PCIe: no public data, extrapolating ~73% → ~610 TFLOPS FP16 ### Our results vs expectation: | GPU | Our FP16 | Expected (73%) | Our % of spec | Gap | |---|---|---|---|---| | H100 PCIe HBM2e | 329 TOPS | ~552 TFLOPS | 44% | ~1.7× below | | H200 NVL PCIe | 340 TOPS | ~610 TFLOPS | 41% | ~1.8× below | Our results are roughly **half** of what a healthy system achieves even under throttle. This is NOT normal — 30-44% is not the industry baseline. ### Likely causes of the gap (in order of probability): 1. **Thermal throttle** — confirmed, sw_thermal covers entire steady window 2. **Power limit below TDP** — GPU may be software-limited below 350W/600W. Previous user may have set a lower limit via nvidia-smi -pl and it was not reset. Our normalization sets clock locks but does NOT reset power limit. Key check: `nvidia-smi -q | grep "Power Limit"` — default vs enforced. 3. **Matrix size** — ruled out. bee-gpu-burn uses 4096×4096×4096 for fp16, 8192×8192×4096 for fp8. These are large enough for peak tensor utilization. ### Power limit gap analysis (H100 PCIe): - Avg clock 1384 MHz = 79% of boost 1755 MHz - Expected TOPS at 79% clock: 756 × 0.79 ≈ 597 TFLOPS - Actually measured: 329 TOPS = 55% of that estimate - Remaining gap after accounting for clock throttle: ~45% - Most likely explanation: enforced power limit < 350W TDP, further reducing sustainable clock beyond what sw_thermal alone would cause. ### Action item: Add `power.limit` (enforced) AND `power.default_limit` to queryBenchmarkGPUInfo so result.json shows if the card was pre-configured with a non-default limit. If enforced < default × 0.95 → add finding "GPU power limit is below default TDP". ### CPU/RAM impact on GPU FLOPS: None. Pure on-GPU GEMM is fully compute-bound once data is in VRAM. CPU core count and host RAM are irrelevant. ## Compute efficiency metric (proposed, no hardcode) Instead of comparing TOPS to a hardcoded spec, compute: tops_per_sm_per_ghz = measured_tops / (sm_count × avg_clock_ghz) This is model-agnostic. A GPU computing correctly at its actual frequency will show a consistent tops_per_sm_per_ghz regardless of throttle level. A GPU with degraded silicon will show low tops_per_sm_per_ghz even at normal clocks. SM count is queryable: nvidia-smi --query-gpu=attribute.multiprocessor_count (needs to be added to queryBenchmarkGPUInfo). Reference values to establish after baseline runs: - H100 PCIe fp16_tensor: TBD tops/SM/GHz - H100 SXM fp16_tensor: TBD tops/SM/GHz ## Proposed threshold changes (pending more data) 1. **`low_sm_clock_vs_target`**: raise threshold from 90% to 85% based on observed 91–92% on healthy HBM2e. Or remove entirely — sw_power/sw_thermal already capture the root cause. 2. **`variance_too_high`** (StabilityScore < 85): healthy HBM2e WILL oscillate under power cap. Consider suppressing this flag when power is flat and usage is 100% (oscillation is expected). Or lower threshold to 70. 3. **New signal: MHz/Watt efficiency**: if base_graphics_clock_mhz is available, ratio avg_clock / power_w could identify degraded silicon (HBM3 restored S1 would have been caught by this). Decision deferred until baseline on SXM designed servers collected.