• v8.6 3cf2e9c9dc

    Run power calibration for all GPUs simultaneously

    mchus released this 2026-04-14 22:25:05 +03:00 | 174 commits to main since this release

    Previously each GPU was calibrated sequentially (one card fully done
    before the next started), producing the staircase temperature pattern
    seen on the graph.

    Now all GPUs run together in a single dcgmi diag -r targeted_power
    session per attempt. This means:

    • All cards are under realistic thermal load at the same time.
    • A single DCGM session handles the run — no resource-busy contention
      from concurrent dcgmi processes.
    • Binary search state (lo/hi) is tracked independently per GPU; each
      card converges to its own highest stable power limit.
    • Throttle counter polling covers all active GPUs in the shared ticker.
    • Resource-busy exponential back-off is shared (one DCGM session).

    Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

    Downloads