• v8.1 a636146dbd

    Fix power calibration failing due to DCGM resource contention

    mchus released this 2026-04-14 20:41:17 +03:00 | 178 commits to main since this release

    When a targeted_power attempt is cancelled (e.g. after sw_thermal
    throttle), nv-hostengine holds the diagnostic slot asynchronously.
    The next attempt immediately received DCGM_ST_IN_USE (exit 222)
    and incorrectly derated the power limit.

    Now: exit 222 is detected via isDCGMResourceBusy and triggers an
    exponential back-off retry at the same power limit (1s, 2s, 4s, …
    up to 256s). Once the back-off delay would exceed 300s the
    calibration fails, indicating the slot is persistently held.

    Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

    Downloads