-
Run power calibration for all GPUs simultaneously
released this
2026-04-14 22:25:05 +03:00 | 174 commits to main since this releasePreviously each GPU was calibrated sequentially (one card fully done
before the next started), producing the staircase temperature pattern
seen on the graph.Now all GPUs run together in a single dcgmi diag -r targeted_power
session per attempt. This means:- All cards are under realistic thermal load at the same time.
- A single DCGM session handles the run — no resource-busy contention
from concurrent dcgmi processes. - Binary search state (lo/hi) is tracked independently per GPU; each
card converges to its own highest stable power limit. - Throttle counter polling covers all active GPUs in the shared ticker.
- Resource-busy exponential back-off is shared (one DCGM session).
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com
Downloads