bee/audit/internal/platform/benchmark.go at 3cf2e9c9dcd7b552b22de41d7a21dd31012cfce2

Files

Michael Chus 3cf2e9c9dc Run power calibration for all GPUs simultaneously

Previously each GPU was calibrated sequentially (one card fully done
before the next started), producing the staircase temperature pattern
seen on the graph.

Now all GPUs run together in a single dcgmi diag -r targeted_power
session per attempt. This means:
- All cards are under realistic thermal load at the same time.
- A single DCGM session handles the run — no resource-busy contention
  from concurrent dcgmi processes.
- Binary search state (lo/hi) is tracked independently per GPU; each
  card converges to its own highest stable power limit.
- Throttle counter polling covers all active GPUs in the shared ticker.
- Resource-busy exponential back-off is shared (one DCGM session).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-14 22:25:05 +03:00

107 KiB

Raw Blame History

View Raw

107 KiB Raw Blame History

107 KiB

Raw Blame History