reanimator/bee - bee - MCHUS git PRO

Go to file

Michael Chus 3cf2e9c9dc Run power calibration for all GPUs simultaneously

Previously each GPU was calibrated sequentially (one card fully done
before the next started), producing the staircase temperature pattern
seen on the graph.

Now all GPUs run together in a single dcgmi diag -r targeted_power
session per attempt. This means:
- All cards are under realistic thermal load at the same time.
- A single DCGM session handles the run — no resource-busy contention
  from concurrent dcgmi processes.
- Binary search state (lo/hi) is tracked independently per GPU; each
  card converges to its own highest stable power limit.
- Throttle counter polling covers all active GPUs in the shared ticker.
- Resource-busy exponential back-off is shared (one DCGM session).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-14 22:25:05 +03:00

audit

Run power calibration for all GPUs simultaneously

2026-04-14 22:25:05 +03:00

bible @ 1d89a4918e

Update bible submodule

2026-04-08 07:14:31 +03:00

bible-local

Split bee-bench into perf and power workflows

2026-04-14 17:33:13 +03:00

internal

chore: commit pending repo changes