-
Run NVIDIA DCGM diag tests on all selected GPUs simultaneously
released this
2026-04-20 11:53:25 +03:00 | 75 commits to main since this releasetargeted_stress, targeted_power, and the Level 2/3 diag were dispatched
one GPU at a time from the UI, turning a single dcgmi command into 8
sequential ~350–450 s runs. DCGM supports -i with a comma-separated list
of GPU indices and runs the diagnostic on all of them in parallel.Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into
nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one
API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate.
Update sat.go constants and page_validate.go estimates to reflect all-GPU
simultaneous execution (remove n× multiplier from total time estimates).Stress test on 8-GPU system: ~5.3 h → ~2.5 h.
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com
Downloads