bee/audit/internal at 3cf2e9c9dcd7b552b22de41d7a21dd31012cfce2 - bee

Files

Michael Chus 3cf2e9c9dc Run power calibration for all GPUs simultaneously

Previously each GPU was calibrated sequentially (one card fully done
before the next started), producing the staircase temperature pattern
seen on the graph.

Now all GPUs run together in a single dcgmi diag -r targeted_power
session per attempt. This means:
- All cards are under realistic thermal load at the same time.
- A single DCGM session handles the run — no resource-busy contention
  from concurrent dcgmi processes.
- Binary search state (lo/hi) is tracked independently per GPU; each
  card converges to its own highest stable power limit.
- Throttle counter polling covers all active GPUs in the shared ticker.
- Resource-busy exponential back-off is shared (one DCGM session).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-14 22:25:05 +03:00

app

Add slot-aware ramp sequence to bee-bench power

2026-04-14 17:47:40 +03:00

collector

Warn on PCIe link speed degradation and collect lspci -vvv in techdump

2026-04-12 12:42:17 +03:00

platform

Run power calibration for all GPUs simultaneously

2026-04-14 22:25:05 +03:00

runtimeenv

Refactor bee CLI and LiveCD integration