Fix pulse_test: run all GPUs simultaneously, not per-GPU
pulse_test is a PSU/power-delivery test, not a per-GPU compute test. Its purpose is to synchronously pulse all GPUs between idle and full load to create worst-case transient spikes on the power supply. Running it one GPU at a time would produce a fraction of the PSU load and miss any PSU-level failures. - Move nvidia-pulse from nvidiaPerGPUTargets to nvidiaAllGPUTargets (same dispatch path as NCCL and NVBandwidth) - Change card onclick to runNvidiaFabricValidate (all selected GPUs at once) - Update card title to "NVIDIA PSU Pulse Test" and description to explain why synchronous multi-GPU execution is required Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1112,11 +1112,11 @@ func renderValidate(opts HandlerOptions) string {
|
|||||||
)) +
|
)) +
|
||||||
`</div>` +
|
`</div>` +
|
||||||
`<div id="sat-card-nvidia-pulse">` +
|
`<div id="sat-card-nvidia-pulse">` +
|
||||||
renderSATCard("nvidia-pulse", "NVIDIA Pulse Test", "runNvidiaValidateSet('nvidia-pulse')", "", renderValidateCardBody(
|
renderSATCard("nvidia-pulse", "NVIDIA PSU Pulse Test", "runNvidiaFabricValidate('nvidia-pulse')", "", renderValidateCardBody(
|
||||||
inv.NVIDIA,
|
inv.NVIDIA,
|
||||||
`Verifies GPU transient power response using DCGM pulse load. Pass/fail determined by DCGM.`,
|
`Tests power supply transient response by pulsing all GPUs simultaneously between idle and full load. Synchronous pulses across all GPUs create worst-case PSU load spikes — running per-GPU would miss PSU-level failures.`,
|
||||||
`<code>dcgmi diag pulse_test</code>`,
|
`<code>dcgmi diag pulse_test</code>`,
|
||||||
`Skipped in Validate mode. Runs in Stress mode only. Runs one GPU at a time.<p id="sat-pt-mode-hint" style="color:var(--warn-fg);font-size:12px;margin:8px 0 0">Only runs in Stress mode. Switch mode above to enable in Run All.</p>`,
|
`Skipped in Validate mode. Runs in Stress mode only. Runs all selected GPUs simultaneously — synchronous pulsing is required to stress the PSU.<p id="sat-pt-mode-hint" style="color:var(--warn-fg);font-size:12px;margin:8px 0 0">Only runs in Stress mode. Switch mode above to enable in Run All.</p>`,
|
||||||
)) +
|
)) +
|
||||||
`</div>` +
|
`</div>` +
|
||||||
`<div id="sat-card-nvidia-interconnect">` +
|
`<div id="sat-card-nvidia-interconnect">` +
|
||||||
@@ -1321,8 +1321,9 @@ function runSATWithOverrides(target, overrides) {
|
|||||||
return enqueueSATTarget(target, overrides)
|
return enqueueSATTarget(target, overrides)
|
||||||
.then(d => streamSATTask(d.task_id, title, false));
|
.then(d => streamSATTask(d.task_id, title, false));
|
||||||
}
|
}
|
||||||
const nvidiaPerGPUTargets = ['nvidia', 'nvidia-targeted-stress', 'nvidia-targeted-power', 'nvidia-pulse'];
|
const nvidiaPerGPUTargets = ['nvidia', 'nvidia-targeted-stress', 'nvidia-targeted-power'];
|
||||||
const nvidiaAllGPUTargets = ['nvidia-interconnect', 'nvidia-bandwidth'];
|
// pulse_test and fabric tests run on all selected GPUs simultaneously
|
||||||
|
const nvidiaAllGPUTargets = ['nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth'];
|
||||||
function expandSATTarget(target) {
|
function expandSATTarget(target) {
|
||||||
if (nvidiaAllGPUTargets.indexOf(target) >= 0) {
|
if (nvidiaAllGPUTargets.indexOf(target) >= 0) {
|
||||||
const selected = satSelectedGPUIndices();
|
const selected = satSelectedGPUIndices();
|
||||||
|
|||||||
Reference in New Issue
Block a user