Files
bee/bible-local/docs/benchmark-clock-calibration.md
Michael Chus d973231f37 Enhance benchmark: server power via IPMI, efficiency metrics, FP64, power limit check
- Sample server power (IPMI dcmi) during baseline+steady phases in parallel;
  compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting
- Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w
  from nvidia-smi alongside existing GPU info
- Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal)
- Flag when enforced power limit is below default TDP by >5%
- Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0)
- Improve Executive Summary: overall pass count, FAILED GPU finding
- Throttle counters now shown as % of steady window instead of raw microseconds
- bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:26:52 +03:00

11 KiB
Raw Blame History

Benchmark clock calibration research

Status

In progress. Baseline data from production servers pending.

Background

The benchmark locks GPU clocks to MaxGraphicsClockMHz (boost) via nvidia-smi -lgc before the steady-state phase. The metric low_sm_clock_vs_target fires when avg_steady_clock < locked_target * 0.90.

Problem: boost clock is the theoretical maximum under ideal cooling. In practice, even a healthy GPU in a non-ideal server will sustain clocks well below boost. The 90% threshold has no empirical basis.

Key observations (2026-04-06)

H100 PCIe — new card, server not designed for it

  • avg clock 1384 MHz, P95 1560 MHz (unstable, proba boost 1755 MHz)
  • Thermal sustain: 0.0 (sw_thermal covers entire steady window)
  • Stability: 70.0 — clocks erratic, no equilibrium found
  • Degradation: power_capped, thermal_limited, low_sm_clock_vs_target, variance_too_high

H200 NVL — new card, server not designed for it

  • avg clock = P95 = 1635 MHz (perfectly stable)
  • Thermal sustain: 0.0 (sw_thermal + sw_power cover entire steady window)
  • Stability: 92.0 — found stable thermal equilibrium at 1635 MHz
  • Degradation: power_capped, thermal_limited
  • Compute: 989 TOPS — card is computing correctly for its frequency

Key insight

The meaningful distinction is not whether the card throttles but how stably it throttles. H200 found a thermal equilibrium (avg == P95, Stability 92), H100 did not (avg << P95, Stability 70). Both are new cards; the H100's instability may reflect a more severe thermal mismatch or a card issue.

sw_power ≈ sw_thermal pattern = server cooling constraint, card likely OK. hw_thermal >> sw_thermal pattern = card itself overheating, investigate.

Hypothesis for baseline

After testing on servers designed for their GPUs (proper cooling):

  • Healthy GPU under sustained load will run at a stable fraction of boost
  • Expected: avg_steady ≈ 8095% of boost depending on model and TDP class
  • Base clock (clocks.base.gr) may be a better reference than boost: a healthy card under real workload should comfortably exceed base clock

Baseline: H100 PCIe HBM2e — designed server (2026-04-06, 10 samples)

Source: external stress test tool, ~90s runs, designed server, adequate power.

Healthy fingerprint

  • Power: hits cap ~340360W immediately, stays flat throughout — HEALTHY
  • Clock: starts ~1750 MHz, oscillates and declines to ~15401600 MHz by 90s
    • Avg steady (visual): ~15801620 MHz
    • vs boost 1755 MHz: ~9192%
    • Oscillation is NORMAL — this is the boost algorithm balancing under power cap
    • Stable power + oscillating clocks = healthy power-cap behavior
  • Temperature: linear rise ~38°C → 7580°C over 90s (no runaway)
  • Consistency: all 10 samples within ±20 MHz — very repeatable

Characteristic patten

Flat power line + oscillating/declining clock line = GPU correctly managed by power cap algorithm. Do NOT flag this as instability.

Clock CV implication

The healthy oscillation WILL produce moderate ClockCVPct (~510%). The current variance_too_high threshold (StabilityScore < 85) may fire on healthy HBM2e PCIe cards. Needs recalibration.


Baseline: H100 HBM3 OEM SXM Custom (restored) — 2 confirmed samples

Source: pytorch_training_loop stress test, 120s (90s stress + 30s cooldown). Confirmed GPU: NVIDIA H100 80GB HBM3, GH100 rev a1.

GPU clock reference (from nvidia-smi, idle):

  • base_clock_mhz: 1095
  • boost_clock_mhz: 1755 (nvidia-smi clocks.max.graphics at idle)
  • achieved_max_clock_mhz: 1980 (actual burst max observed by tool)
  • Our benchmark locks to clocks.max.graphics = likely 1980 MHz for this chip

Observed under 700W sustained load (both samples nearly identical):

  • Power: ~700W flat — SXM slot, adequate power confirmed
  • Clock steady range: ~13801480 MHz, avg ~14201460 MHz
  • vs 1980 MHz (lock target): 7274% — severely below
  • vs 1755 MHz (nvidia-smi boost): 8183%
  • vs 1095 MHz (base): 130% — above base but far below expected for SXM
  • Clock/Watt: ~2.1 MHz/W vs HBM2e ~4.6 MHz/W — 2× worse efficiency
  • Temperature: 38°C → 7980°C (same rate as HBM2e)
  • Oscillation: present, similar character to HBM2e but at much lower frequency

Diagnosis

These restored cards are degraded. A healthy H100 SXM in a designed server (DGX H100, HGX H100) should sustain ~18001900 MHz at 700W (~9196% of 1980). The 7274% result is a clear signal of silicon or VRM degradation from the refurbishment process.

Clock pattern note

Images 8/9 (previously marked as "HBM3 restored") are now confirmed identical to images 19/20. Both sample sets show same degraded pattern — same batch.


Baseline matrix (filled where data available)

GPU model Config Avg clock steady vs boost Clock/Watt Notes
H100 PCIe HBM2e designed server 15801620 MHz 9192% ~4.6 MHz/W 10 samples, healthy
H100 SXM HBM3 restored 700W full 14201460 MHz 7274% of 1980 ~2.1 MHz/W 4 samples confirmed, degraded
H100 SXM HBM3 healthy designed ~18001900 MHz est. ~9196% est. ~2.7 MHz/W est. need real baseline
H200 NVL designed TBD TBD TBD need baseline

H100 official spec (from NVIDIA datasheet)

Source: NVIDIA H100 Tensor Core GPU Datasheet (image 23, 2026-04-06). All TOPS marked * are with structural sparsity enabled. Divide by 2 for dense.

Model FP16 Tensor (dense) TF32 (dense) FP8 (dense) TDP Memory
H100 80GB PCIe 756 TFLOPS 378 TFLOPS 1,513 TFLOPS 350W HBM2e
H100 NVL 94GB PCIe 990 TFLOPS 495 TFLOPS 1,980 TFLOPS 400W HBM3
H100 80GB SXM (BQQV) 989 TFLOPS 494 TFLOPS 700W HBM3
H100 94GB SXM (BUBB) 989 TFLOPS 494 TFLOPS 700W HBM2e

Notes:

  • SXM boards do NOT list FP8 peak in this table (field empty)
  • fp8_e5m2 is unsupported on H100 PCIe HBM2e — confirmed in our tests
  • Tensor Cores: PCIe = 456, SXM = 528 (16% more on SXM)

Observed efficiency (H100 80GB PCIe, throttled server)

From the report in this session (power+thermal throttle throughout steady):

Precision Measured Spec (dense) % of spec
fp16_tensor 329 TOPS 756 TFLOPS 44%
fp32_tf32 115 TOPS 378 TFLOPS 30%
fp8_e4m3 505 TOPS 1,513 TFLOPS 33%

3344% of spec is expected given sustained power+thermal throttle (avg clock 1384 MHz vs boost 1755 MHz = 79%). The GPU is computing correctly for its actual frequency — the low TOPS comes from throttle, not silicon defect.

H200 official spec (from NVIDIA datasheet, image 24, 2026-04-06)

Format: without sparsity / with sparsity.

Model FP16 Tensor (dense) TF32 (dense) FP8 (dense) TDP Memory
H200 NVL PCIe 836 TFLOPS 418 TFLOPS 1,570 TFLOPS 600W HBM3e 141GB
H200 SXM 990 TFLOPS 495 TFLOPS 1,979 TFLOPS 700W HBM3e 141GB

Observed efficiency (H200 NVL PCIe, throttled non-designed server)

Avg clock 1635 MHz (62% of boost ~2619 MHz). Entire steady in thermal throttle.

Precision Measured Spec (dense) % of spec
fp16_tensor 340 TOPS 836 TFLOPS 41%
fp32_tf32 120 TOPS 418 TFLOPS 29%
fp8_e4m3 529 TOPS 1,570 TFLOPS 34%

Comparable to H100 PCIe efficiency (3344%) despite different architecture — both are throttle-limited. Confirms that % of spec is not a quality signal, it reflects the thermal environment. tops_per_sm_per_ghz is the right metric.

Real-world GEMM efficiency reference (2026-04-06, web research)

Sources: SemiAnalysis MI300X vs H100 vs H200 training benchmark; cuBLAS optimization worklog (hamzaelshafie.bearblog.dev); Lambda AI H100 performance analysis.

What healthy systems actually achieve:

  • H100 SXM in designed server: ~720 TFLOPS FP16 = ~73% of spec
  • cuBLAS large square GEMM (8192³): up to ~83% flop utilization
  • H200 NVL PCIe: no public data, extrapolating ~73% → ~610 TFLOPS FP16

Our results vs expectation:

GPU Our FP16 Expected (73%) Our % of spec Gap
H100 PCIe HBM2e 329 TOPS ~552 TFLOPS 44% ~1.7× below
H200 NVL PCIe 340 TOPS ~610 TFLOPS 41% ~1.8× below

Our results are roughly half of what a healthy system achieves even under throttle. This is NOT normal — 30-44% is not the industry baseline.

Likely causes of the gap (in order of probability):

  1. Thermal throttle — confirmed, sw_thermal covers entire steady window
  2. Power limit below TDP — GPU may be software-limited below 350W/600W. Previous user may have set a lower limit via nvidia-smi -pl and it was not reset. Our normalization sets clock locks but does NOT reset power limit. Key check: nvidia-smi -q | grep "Power Limit" — default vs enforced.
  3. Matrix size — ruled out. bee-gpu-burn uses 4096×4096×4096 for fp16, 8192×8192×4096 for fp8. These are large enough for peak tensor utilization.

Power limit gap analysis (H100 PCIe):

  • Avg clock 1384 MHz = 79% of boost 1755 MHz
  • Expected TOPS at 79% clock: 756 × 0.79 ≈ 597 TFLOPS
  • Actually measured: 329 TOPS = 55% of that estimate
  • Remaining gap after accounting for clock throttle: ~45%
  • Most likely explanation: enforced power limit < 350W TDP, further reducing sustainable clock beyond what sw_thermal alone would cause.

Action item:

Add power.limit (enforced) AND power.default_limit to queryBenchmarkGPUInfo so result.json shows if the card was pre-configured with a non-default limit. If enforced < default × 0.95 → add finding "GPU power limit is below default TDP".

CPU/RAM impact on GPU FLOPS:

None. Pure on-GPU GEMM is fully compute-bound once data is in VRAM. CPU core count and host RAM are irrelevant.

Compute efficiency metric (proposed, no hardcode)

Instead of comparing TOPS to a hardcoded spec, compute: tops_per_sm_per_ghz = measured_tops / (sm_count × avg_clock_ghz)

This is model-agnostic. A GPU computing correctly at its actual frequency will show a consistent tops_per_sm_per_ghz regardless of throttle level. A GPU with degraded silicon will show low tops_per_sm_per_ghz even at normal clocks.

SM count is queryable: nvidia-smi --query-gpu=attribute.multiprocessor_count (needs to be added to queryBenchmarkGPUInfo).

Reference values to establish after baseline runs:

  • H100 PCIe fp16_tensor: TBD tops/SM/GHz
  • H100 SXM fp16_tensor: TBD tops/SM/GHz

Proposed threshold changes (pending more data)

  1. low_sm_clock_vs_target: raise threshold from 90% to 85% based on observed 9192% on healthy HBM2e. Or remove entirely — sw_power/sw_thermal already capture the root cause.

  2. variance_too_high (StabilityScore < 85): healthy HBM2e WILL oscillate under power cap. Consider suppressing this flag when power is flat and usage is 100% (oscillation is expected). Or lower threshold to 70.

  3. New signal: MHz/Watt efficiency: if base_graphics_clock_mhz is available, ratio avg_clock / power_w could identify degraded silicon (HBM3 restored S1 would have been caught by this).

Decision deferred until baseline on SXM designed servers collected.