Files

Michael Chus d973231f37 Enhance benchmark: server power via IPMI, efficiency metrics, FP64, power limit check

- Sample server power (IPMI dcmi) during baseline+steady phases in parallel;
  compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting
- Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w
  from nvidia-smi alongside existing GPU info
- Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal)
- Flag when enforced power limit is below default TDP by >5%
- Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0)
- Improve Executive Summary: overall pass count, FAILED GPU finding
- Throttle counters now shown as % of steady window instead of raw microseconds
- bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-06 22:26:52 +03:00

11 KiB

Raw Blame History

Benchmark clock calibration research

Status

In progress. Baseline data from production servers pending.

Background

The benchmark locks GPU clocks to MaxGraphicsClockMHz (boost) via nvidia-smi -lgc before the steady-state phase. The metric low_sm_clock_vs_target fires when avg_steady_clock < locked_target * 0.90.

Problem: boost clock is the theoretical maximum under ideal cooling. In practice, even a healthy GPU in a non-ideal server will sustain clocks well below boost. The 90% threshold has no empirical basis.

Key observations (2026-04-06)

H100 PCIe — new card, server not designed for it

avg clock 1384 MHz, P95 1560 MHz (unstable, proba boost 1755 MHz)
Thermal sustain: 0.0 (sw_thermal covers entire steady window)
Stability: 70.0 — clocks erratic, no equilibrium found
Degradation: power_capped, thermal_limited, low_sm_clock_vs_target, variance_too_high

H200 NVL — new card, server not designed for it

avg clock = P95 = 1635 MHz (perfectly stable)
Thermal sustain: 0.0 (sw_thermal + sw_power cover entire steady window)
Stability: 92.0 — found stable thermal equilibrium at 1635 MHz
Degradation: power_capped, thermal_limited
Compute: 989 TOPS — card is computing correctly for its frequency

Key insight

The meaningful distinction is not whether the card throttles but how stably it throttles. H200 found a thermal equilibrium (avg == P95, Stability 92), H100 did not (avg << P95, Stability 70). Both are new cards; the H100's instability may reflect a more severe thermal mismatch or a card issue.

sw_power ≈ sw_thermal pattern = server cooling constraint, card likely OK. hw_thermal >> sw_thermal pattern = card itself overheating, investigate.

Hypothesis for baseline

After testing on servers designed for their GPUs (proper cooling):

Healthy GPU under sustained load will run at a stable fraction of boost
Expected: avg_steady ≈ 80–95% of boost depending on model and TDP class
Base clock (clocks.base.gr) may be a better reference than boost: a healthy card under real workload should comfortably exceed base clock

Baseline: H100 PCIe HBM2e — designed server (2026-04-06, 10 samples)

Source: external stress test tool, ~90s runs, designed server, adequate power.

Healthy fingerprint

Power: hits cap ~340–360W immediately, stays flat throughout — HEALTHY
Clock: starts ~1750 MHz, oscillates and declines to ~1540–1600 MHz by 90s
- Avg steady (visual): ~1580–1620 MHz
- vs boost 1755 MHz: ~91–92%
- Oscillation is NORMAL — this is the boost algorithm balancing under power cap
- Stable power + oscillating clocks = healthy power-cap behavior
Temperature: linear rise ~38°C → 75–80°C over 90s (no runaway)
Consistency: all 10 samples within ±20 MHz — very repeatable

Characteristic patten

Flat power line + oscillating/declining clock line = GPU correctly managed by power cap algorithm. Do NOT flag this as instability.

Clock CV implication

The healthy oscillation WILL produce moderate ClockCVPct (~5–10%). The current variance_too_high threshold (StabilityScore < 85) may fire on healthy HBM2e PCIe cards. Needs recalibration.

Baseline: H100 HBM3 OEM SXM Custom (restored) — 2 confirmed samples

Source: pytorch_training_loop stress test, 120s (90s stress + 30s cooldown). Confirmed GPU: NVIDIA H100 80GB HBM3, GH100 rev a1.

GPU clock reference (from nvidia-smi, idle):

base_clock_mhz: 1095
boost_clock_mhz: 1755 (nvidia-smi clocks.max.graphics at idle)
achieved_max_clock_mhz: 1980 (actual burst max observed by tool)
Our benchmark locks to clocks.max.graphics = likely 1980 MHz for this chip

Observed under 700W sustained load (both samples nearly identical):

Power: ~700W flat — SXM slot, adequate power confirmed
Clock steady range: ~1380–1480 MHz, avg ~1420–1460 MHz
vs 1980 MHz (lock target): 72–74% — severely below
vs 1755 MHz (nvidia-smi boost): 81–83%
vs 1095 MHz (base): 130% — above base but far below expected for SXM
Clock/Watt: ~2.1 MHz/W vs HBM2e ~4.6 MHz/W — 2× worse efficiency
Temperature: 38°C → 79–80°C (same rate as HBM2e)
Oscillation: present, similar character to HBM2e but at much lower frequency

Diagnosis

These restored cards are degraded. A healthy H100 SXM in a designed server (DGX H100, HGX H100) should sustain ~1800–1900 MHz at 700W (~91–96% of 1980). The 72–74% result is a clear signal of silicon or VRM degradation from the refurbishment process.

Clock pattern note

Images 8/9 (previously marked as "HBM3 restored") are now confirmed identical to images 19/20. Both sample sets show same degraded pattern — same batch.

Baseline matrix (filled where data available)

GPU model	Config	Avg clock steady	vs boost	Clock/Watt	Notes
H100 PCIe HBM2e	designed server	1580–1620 MHz	91–92%	~4.6 MHz/W	10 samples, healthy
H100 SXM HBM3 restored	700W full	1420–1460 MHz	72–74% of 1980	~2.1 MHz/W	4 samples confirmed, degraded
H100 SXM HBM3 healthy	designed	~1800–1900 MHz est.	~91–96% est.	~2.7 MHz/W est.	need real baseline
H200 NVL	designed	TBD	TBD	TBD	need baseline

H100 official spec (from NVIDIA datasheet)

Source: NVIDIA H100 Tensor Core GPU Datasheet (image 23, 2026-04-06). All TOPS marked * are with structural sparsity enabled. Divide by 2 for dense.

Model	FP16 Tensor (dense)	TF32 (dense)	FP8 (dense)	TDP	Memory
H100 80GB PCIe	756 TFLOPS	378 TFLOPS	1,513 TFLOPS	350W	HBM2e
H100 NVL 94GB PCIe	990 TFLOPS	495 TFLOPS	1,980 TFLOPS	400W	HBM3
H100 80GB SXM (BQQV)	989 TFLOPS	494 TFLOPS	—	700W	HBM3
H100 94GB SXM (BUBB)	989 TFLOPS	494 TFLOPS	—	700W	HBM2e

Notes:

SXM boards do NOT list FP8 peak in this table (field empty)
fp8_e5m2 is unsupported on H100 PCIe HBM2e — confirmed in our tests
Tensor Cores: PCIe = 456, SXM = 528 (16% more on SXM)

Observed efficiency (H100 80GB PCIe, throttled server)

From the report in this session (power+thermal throttle throughout steady):

Precision	Measured	Spec (dense)	% of spec
fp16_tensor	329 TOPS	756 TFLOPS	44%
fp32_tf32	115 TOPS	378 TFLOPS	30%
fp8_e4m3	505 TOPS	1,513 TFLOPS	33%

33–44% of spec is expected given sustained power+thermal throttle (avg clock 1384 MHz vs boost 1755 MHz = 79%). The GPU is computing correctly for its actual frequency — the low TOPS comes from throttle, not silicon defect.

H200 official spec (from NVIDIA datasheet, image 24, 2026-04-06)

Format: without sparsity / with sparsity.

Model	FP16 Tensor (dense)	TF32 (dense)	FP8 (dense)	TDP	Memory
H200 NVL PCIe	836 TFLOPS	418 TFLOPS	1,570 TFLOPS	600W	HBM3e 141GB
H200 SXM	990 TFLOPS	495 TFLOPS	1,979 TFLOPS	700W	HBM3e 141GB

Observed efficiency (H200 NVL PCIe, throttled non-designed server)

Avg clock 1635 MHz (62% of boost ~2619 MHz). Entire steady in thermal throttle.

Precision	Measured	Spec (dense)	% of spec
fp16_tensor	340 TOPS	836 TFLOPS	41%
fp32_tf32	120 TOPS	418 TFLOPS	29%
fp8_e4m3	529 TOPS	1,570 TFLOPS	34%

Comparable to H100 PCIe efficiency (33–44%) despite different architecture — both are throttle-limited. Confirms that % of spec is not a quality signal, it reflects the thermal environment. tops_per_sm_per_ghz is the right metric.

Real-world GEMM efficiency reference (2026-04-06, web research)

Sources: SemiAnalysis MI300X vs H100 vs H200 training benchmark; cuBLAS optimization worklog (hamzaelshafie.bearblog.dev); Lambda AI H100 performance analysis.

What healthy systems actually achieve:

H100 SXM in designed server: ~720 TFLOPS FP16 = ~73% of spec
cuBLAS large square GEMM (8192³): up to ~83% flop utilization
H200 NVL PCIe: no public data, extrapolating ~73% → ~610 TFLOPS FP16

Our results vs expectation:

GPU	Our FP16	Expected (73%)	Our % of spec	Gap
H100 PCIe HBM2e	329 TOPS	~552 TFLOPS	44%	~1.7× below
H200 NVL PCIe	340 TOPS	~610 TFLOPS	41%	~1.8× below

Our results are roughly half of what a healthy system achieves even under throttle. This is NOT normal — 30-44% is not the industry baseline.

Likely causes of the gap (in order of probability):

Thermal throttle — confirmed, sw_thermal covers entire steady window
Power limit below TDP — GPU may be software-limited below 350W/600W. Previous user may have set a lower limit via nvidia-smi -pl and it was not reset. Our normalization sets clock locks but does NOT reset power limit. Key check: nvidia-smi -q | grep "Power Limit" — default vs enforced.
Matrix size — ruled out. bee-gpu-burn uses 4096×4096×4096 for fp16, 8192×8192×4096 for fp8. These are large enough for peak tensor utilization.

Power limit gap analysis (H100 PCIe):

Avg clock 1384 MHz = 79% of boost 1755 MHz
Expected TOPS at 79% clock: 756 × 0.79 ≈ 597 TFLOPS
Actually measured: 329 TOPS = 55% of that estimate
Remaining gap after accounting for clock throttle: ~45%
Most likely explanation: enforced power limit < 350W TDP, further reducing sustainable clock beyond what sw_thermal alone would cause.

Action item:

Add power.limit (enforced) AND power.default_limit to queryBenchmarkGPUInfo so result.json shows if the card was pre-configured with a non-default limit. If enforced < default × 0.95 → add finding "GPU power limit is below default TDP".

CPU/RAM impact on GPU FLOPS:

None. Pure on-GPU GEMM is fully compute-bound once data is in VRAM. CPU core count and host RAM are irrelevant.

Compute efficiency metric (proposed, no hardcode)

Instead of comparing TOPS to a hardcoded spec, compute: tops_per_sm_per_ghz = measured_tops / (sm_count × avg_clock_ghz)

This is model-agnostic. A GPU computing correctly at its actual frequency will show a consistent tops_per_sm_per_ghz regardless of throttle level. A GPU with degraded silicon will show low tops_per_sm_per_ghz even at normal clocks.

SM count is queryable: nvidia-smi --query-gpu=attribute.multiprocessor_count (needs to be added to queryBenchmarkGPUInfo).

Reference values to establish after baseline runs:

H100 PCIe fp16_tensor: TBD tops/SM/GHz
H100 SXM fp16_tensor: TBD tops/SM/GHz

Proposed threshold changes (pending more data)

low_sm_clock_vs_target: raise threshold from 90% to 85% based on observed 91–92% on healthy HBM2e. Or remove entirely — sw_power/sw_thermal already capture the root cause.
variance_too_high (StabilityScore < 85): healthy HBM2e WILL oscillate under power cap. Consider suppressing this flag when power is flat and usage is 100% (oscillation is expected). Or lower threshold to 70.
New signal: MHz/Watt efficiency: if base_graphics_clock_mhz is available, ratio avg_clock / power_w could identify degraded silicon (HBM3 restored S1 would have been caught by this).

Decision deferred until baseline on SXM designed servers collected.

11 KiB Raw Blame History Unescape Escape