• v8.34 679aeb9947

    Run NVIDIA DCGM diag tests on all selected GPUs simultaneously

    mchus released this 2026-04-20 11:53:25 +03:00 | 75 commits to main since this release

    targeted_stress, targeted_power, and the Level 2/3 diag were dispatched
    one GPU at a time from the UI, turning a single dcgmi command into 8
    sequential ~350–450 s runs. DCGM supports -i with a comma-separated list
    of GPU indices and runs the diagnostic on all of them in parallel.

    Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into
    nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one
    API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate.
    Update sat.go constants and page_validate.go estimates to reflect all-GPU
    simultaneous execution (remove n× multiplier from total time estimates).

    Stress test on 8-GPU system: ~5.3 h → ~2.5 h.

    Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

    Downloads