Enhance benchmark: server power via IPMI, efficiency metrics, FP64, power limit check

- Sample server power (IPMI dcmi) during baseline+steady phases in parallel;
  compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting
- Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w
  from nvidia-smi alongside existing GPU info
- Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal)
- Flag when enforced power limit is below default TDP by >5%
- Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0)
- Improve Executive Summary: overall pass count, FAILED GPU finding
- Throttle counters now shown as % of steady window instead of raw microseconds
- bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-06 22:26:52 +03:00
parent f5d175f488
commit d973231f37
5 changed files with 551 additions and 18 deletions

View File

@@ -0,0 +1,248 @@
# Benchmark clock calibration research
## Status
In progress. Baseline data from production servers pending.
## Background
The benchmark locks GPU clocks to `MaxGraphicsClockMHz` (boost) via `nvidia-smi -lgc`
before the steady-state phase. The metric `low_sm_clock_vs_target` fires when
`avg_steady_clock < locked_target * 0.90`.
Problem: boost clock is the theoretical maximum under ideal cooling. In practice,
even a healthy GPU in a non-ideal server will sustain clocks well below boost.
The 90% threshold has no empirical basis.
## Key observations (2026-04-06)
### H100 PCIe — new card, server not designed for it
- avg clock 1384 MHz, P95 1560 MHz (unstable, proba boost 1755 MHz)
- Thermal sustain: 0.0 (sw_thermal covers entire steady window)
- Stability: 70.0 — clocks erratic, no equilibrium found
- Degradation: power_capped, thermal_limited, low_sm_clock_vs_target, variance_too_high
### H200 NVL — new card, server not designed for it
- avg clock = P95 = 1635 MHz (perfectly stable)
- Thermal sustain: 0.0 (sw_thermal + sw_power cover entire steady window)
- Stability: 92.0 — found stable thermal equilibrium at 1635 MHz
- Degradation: power_capped, thermal_limited
- Compute: 989 TOPS — card is computing correctly for its frequency
### Key insight
The meaningful distinction is not *whether* the card throttles but *how stably*
it throttles. H200 found a thermal equilibrium (avg == P95, Stability 92),
H100 did not (avg << P95, Stability 70). Both are new cards; the H100's
instability may reflect a more severe thermal mismatch or a card issue.
`sw_power ≈ sw_thermal` pattern = server cooling constraint, card likely OK.
`hw_thermal >> sw_thermal` pattern = card itself overheating, investigate.
## Hypothesis for baseline
After testing on servers designed for their GPUs (proper cooling):
- Healthy GPU under sustained load will run at a stable fraction of boost
- Expected: avg_steady ≈ 8095% of boost depending on model and TDP class
- Base clock (`clocks.base.gr`) may be a better reference than boost:
a healthy card under real workload should comfortably exceed base clock
## Baseline: H100 PCIe HBM2e — designed server (2026-04-06, 10 samples)
Source: external stress test tool, ~90s runs, designed server, adequate power.
### Healthy fingerprint
- **Power**: hits cap ~340360W immediately, stays flat throughout — HEALTHY
- **Clock**: starts ~1750 MHz, oscillates and declines to ~15401600 MHz by 90s
- Avg steady (visual): **~15801620 MHz**
- vs boost 1755 MHz: **~9192%**
- Oscillation is NORMAL — this is the boost algorithm balancing under power cap
- Stable power + oscillating clocks = healthy power-cap behavior
- **Temperature**: linear rise ~38°C → 7580°C over 90s (no runaway)
- **Consistency**: all 10 samples within ±20 MHz — very repeatable
### Characteristic patten
Flat power line + oscillating/declining clock line = GPU correctly managed by
power cap algorithm. Do NOT flag this as instability.
### Clock CV implication
The healthy oscillation WILL produce moderate ClockCVPct (~510%).
The current `variance_too_high` threshold (StabilityScore < 85) may fire on
healthy HBM2e PCIe cards. Needs recalibration.
---
## Baseline: H100 HBM3 OEM SXM Custom (restored) — 2 confirmed samples
Source: pytorch_training_loop stress test, 120s (90s stress + 30s cooldown).
Confirmed GPU: NVIDIA H100 80GB HBM3, GH100 rev a1.
### GPU clock reference (from nvidia-smi, idle):
- base_clock_mhz: **1095**
- boost_clock_mhz: **1755** (nvidia-smi `clocks.max.graphics` at idle)
- achieved_max_clock_mhz: **1980** (actual burst max observed by tool)
- Our benchmark locks to `clocks.max.graphics` = likely 1980 MHz for this chip
### Observed under 700W sustained load (both samples nearly identical):
- Power: ~700W flat — SXM slot, adequate power confirmed
- Clock steady range: **~13801480 MHz**, avg **~14201460 MHz**
- vs 1980 MHz (lock target): **7274%** — severely below
- vs 1755 MHz (nvidia-smi boost): **8183%**
- vs 1095 MHz (base): 130% — above base but far below expected for SXM
- Clock/Watt: ~2.1 MHz/W vs HBM2e ~4.6 MHz/W — 2× worse efficiency
- Temperature: 38°C → 7980°C (same rate as HBM2e)
- Oscillation: present, similar character to HBM2e but at much lower frequency
### Diagnosis
These restored cards are degraded. A healthy H100 SXM in a designed server
(DGX H100, HGX H100) should sustain ~18001900 MHz at 700W (~9196% of 1980).
The 7274% result is a clear signal of silicon or VRM degradation from the
refurbishment process.
### Clock pattern note
Images 8/9 (previously marked as "HBM3 restored") are now confirmed identical
to images 19/20. Both sample sets show same degraded pattern — same batch.
---
## Baseline matrix (filled where data available)
| GPU model | Config | Avg clock steady | vs boost | Clock/Watt | Notes |
|---|---|---|---|---|---|
| H100 PCIe HBM2e | designed server | 15801620 MHz | 9192% | ~4.6 MHz/W | 10 samples, healthy |
| H100 SXM HBM3 restored | 700W full | 14201460 MHz | 7274% of 1980 | ~2.1 MHz/W | 4 samples confirmed, degraded |
| H100 SXM HBM3 healthy | designed | ~18001900 MHz est. | ~9196% est. | ~2.7 MHz/W est. | need real baseline |
| H200 NVL | designed | TBD | TBD | TBD | need baseline |
---
## H100 official spec (from NVIDIA datasheet)
Source: NVIDIA H100 Tensor Core GPU Datasheet (image 23, 2026-04-06).
All TOPS marked * are with structural sparsity enabled. Divide by 2 for dense.
| Model | FP16 Tensor (dense) | TF32 (dense) | FP8 (dense) | TDP | Memory |
|---|---|---|---|---|---|
| H100 80GB PCIe | 756 TFLOPS | 378 TFLOPS | 1,513 TFLOPS | 350W | HBM2e |
| H100 NVL 94GB PCIe | 990 TFLOPS | 495 TFLOPS | 1,980 TFLOPS | 400W | HBM3 |
| H100 80GB SXM (BQQV) | 989 TFLOPS | 494 TFLOPS | — | 700W | HBM3 |
| H100 94GB SXM (BUBB) | 989 TFLOPS | 494 TFLOPS | — | 700W | HBM2e |
Notes:
- SXM boards do NOT list FP8 peak in this table (field empty)
- fp8_e5m2 is unsupported on H100 PCIe HBM2e — confirmed in our tests
- Tensor Cores: PCIe = 456, SXM = 528 (16% more on SXM)
## Observed efficiency (H100 80GB PCIe, throttled server)
From the report in this session (power+thermal throttle throughout steady):
| Precision | Measured | Spec (dense) | % of spec |
|---|---|---|---|
| fp16_tensor | 329 TOPS | 756 TFLOPS | 44% |
| fp32_tf32 | 115 TOPS | 378 TFLOPS | 30% |
| fp8_e4m3 | 505 TOPS | 1,513 TFLOPS | 33% |
3344% of spec is expected given sustained power+thermal throttle (avg clock
1384 MHz vs boost 1755 MHz = 79%). The GPU is computing correctly for its
actual frequency — the low TOPS comes from throttle, not silicon defect.
## H200 official spec (from NVIDIA datasheet, image 24, 2026-04-06)
Format: without sparsity / with sparsity.
| Model | FP16 Tensor (dense) | TF32 (dense) | FP8 (dense) | TDP | Memory |
|---|---|---|---|---|---|
| H200 NVL PCIe | 836 TFLOPS | 418 TFLOPS | 1,570 TFLOPS | 600W | HBM3e 141GB |
| H200 SXM | 990 TFLOPS | 495 TFLOPS | 1,979 TFLOPS | 700W | HBM3e 141GB |
## Observed efficiency (H200 NVL PCIe, throttled non-designed server)
Avg clock 1635 MHz (62% of boost ~2619 MHz). Entire steady in thermal throttle.
| Precision | Measured | Spec (dense) | % of spec |
|---|---|---|---|
| fp16_tensor | 340 TOPS | 836 TFLOPS | 41% |
| fp32_tf32 | 120 TOPS | 418 TFLOPS | 29% |
| fp8_e4m3 | 529 TOPS | 1,570 TFLOPS | 34% |
Comparable to H100 PCIe efficiency (3344%) despite different architecture —
both are throttle-limited. Confirms that % of spec is not a quality signal,
it reflects the thermal environment. tops_per_sm_per_ghz is the right metric.
## Real-world GEMM efficiency reference (2026-04-06, web research)
Sources: SemiAnalysis MI300X vs H100 vs H200 training benchmark; cuBLAS optimization
worklog (hamzaelshafie.bearblog.dev); Lambda AI H100 performance analysis.
### What healthy systems actually achieve:
- H100 SXM in designed server: **~720 TFLOPS FP16 = ~73% of spec**
- cuBLAS large square GEMM (8192³): up to **~83% flop utilization**
- H200 NVL PCIe: no public data, extrapolating ~73% → ~610 TFLOPS FP16
### Our results vs expectation:
| GPU | Our FP16 | Expected (73%) | Our % of spec | Gap |
|---|---|---|---|---|
| H100 PCIe HBM2e | 329 TOPS | ~552 TFLOPS | 44% | ~1.7× below |
| H200 NVL PCIe | 340 TOPS | ~610 TFLOPS | 41% | ~1.8× below |
Our results are roughly **half** of what a healthy system achieves even under throttle.
This is NOT normal — 30-44% is not the industry baseline.
### Likely causes of the gap (in order of probability):
1. **Thermal throttle** — confirmed, sw_thermal covers entire steady window
2. **Power limit below TDP** — GPU may be software-limited below 350W/600W.
Previous user may have set a lower limit via nvidia-smi -pl and it was not
reset. Our normalization sets clock locks but does NOT reset power limit.
Key check: `nvidia-smi -q | grep "Power Limit"` — default vs enforced.
3. **Matrix size** — ruled out. bee-gpu-burn uses 4096×4096×4096 for fp16,
8192×8192×4096 for fp8. These are large enough for peak tensor utilization.
### Power limit gap analysis (H100 PCIe):
- Avg clock 1384 MHz = 79% of boost 1755 MHz
- Expected TOPS at 79% clock: 756 × 0.79 ≈ 597 TFLOPS
- Actually measured: 329 TOPS = 55% of that estimate
- Remaining gap after accounting for clock throttle: ~45%
- Most likely explanation: enforced power limit < 350W TDP, further reducing
sustainable clock beyond what sw_thermal alone would cause.
### Action item:
Add `power.limit` (enforced) AND `power.default_limit` to queryBenchmarkGPUInfo
so result.json shows if the card was pre-configured with a non-default limit.
If enforced < default × 0.95 → add finding "GPU power limit is below default TDP".
### CPU/RAM impact on GPU FLOPS:
None. Pure on-GPU GEMM is fully compute-bound once data is in VRAM.
CPU core count and host RAM are irrelevant.
## Compute efficiency metric (proposed, no hardcode)
Instead of comparing TOPS to a hardcoded spec, compute:
tops_per_sm_per_ghz = measured_tops / (sm_count × avg_clock_ghz)
This is model-agnostic. A GPU computing correctly at its actual frequency
will show a consistent tops_per_sm_per_ghz regardless of throttle level.
A GPU with degraded silicon will show low tops_per_sm_per_ghz even at
normal clocks.
SM count is queryable: nvidia-smi --query-gpu=attribute.multiprocessor_count
(needs to be added to queryBenchmarkGPUInfo).
Reference values to establish after baseline runs:
- H100 PCIe fp16_tensor: TBD tops/SM/GHz
- H100 SXM fp16_tensor: TBD tops/SM/GHz
## Proposed threshold changes (pending more data)
1. **`low_sm_clock_vs_target`**: raise threshold from 90% to 85% based on observed
9192% on healthy HBM2e. Or remove entirely — sw_power/sw_thermal already
capture the root cause.
2. **`variance_too_high`** (StabilityScore < 85): healthy HBM2e WILL oscillate
under power cap. Consider suppressing this flag when power is flat and usage
is 100% (oscillation is expected). Or lower threshold to 70.
3. **New signal: MHz/Watt efficiency**: if base_graphics_clock_mhz is available,
ratio avg_clock / power_w could identify degraded silicon (HBM3 restored S1
would have been caught by this).
Decision deferred until baseline on SXM designed servers collected.