Power calibration previously stepped down 25 W at a time (linear),
requiring up to 6 attempts to find a stable limit within 150 W range.
New strategy:
- Binary search between minLimitW (lo, assumed stable floor) and the
starting/failed limit (hi, confirmed unstable), converging within a
10 W tolerance in ~4 attempts.
- For thermal throttle: the first-quarter telemetry rows estimate the
GPU's pre-throttle power draw. nextLimit = round5W(onset - 10 W) is
used as the initial candidate instead of the binary midpoint, landing
much closer to the true limit on the first step.
- On success: lo is updated and a higher level is tried (binary search
upward) until hi-lo ≤ tolerance, ensuring the highest stable limit is
found rather than the first stable one.
- Let targeted_power run to natural completion on throttle (no mid-run
SIGKILL) so nv-hostengine releases its diagnostic slot cleanly before
the next attempt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>