- Add BenchmarkEstimated* constants to benchmark_types.go from _v8 logs
(Standard Perf ~16 min, Standard Power Fit ~43 min, Stability Perf ~92 min)
- Update benchmark profile dropdown to show Perf / Power Fit timing per profile
- Add timing columns to Method Split table (Standard vs Stability per run type)
- Update burn preset labels to show "N min/GPU (sequential) or N min (parallel)"
- Clarify burn "one by one" description with sequential vs parallel scaling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add SATEstimated* constants to sat.go derived from _v8 production logs,
with a rule to recalculate them whenever the script changes
- Extend validateInventory with NvidiaGPUCount to make estimates GPU-aware
- Update all validate card duration strings: CPU, memory, storage, NVIDIA GPU,
targeted stress/power, pulse test, NCCL, nvbandwidth
- Fix nvbandwidth description ("intended to stay short" → actual ~45 min)
- Top-level profile labels show computed total including GPU count
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 10:51:15 +03:00
4 changed files with 181 additions and 29 deletions
return`<div class="alert alert-info" style="margin-bottom:16px"><strong>Non-destructive:</strong> Validate tests collect diagnostics only. They do not write to disks, do not run sustained load, and do not increment hardware wear counters.</div>
<p style="color:var(--muted);font-size:13px;margin-bottom:16px">Tasks continue in the background — view progress in <a href="/tasks">Tasks</a>.</p>
<p style="color:var(--muted);font-size:12px;margin:0 0 10px">Runs validate modules sequentially with the selected cycle count and mode. Validate is quick (~5–15 min total); Stress is thorough (~30–60 min total).</p>
<p style="color:var(--muted);font-size:12px;margin:0 0 10px">Runs validate modules sequentially. Validate: `+validateTotalStr+gpuNote+`; Stress: `+stressTotalStr+gpuNote+`. Estimates are based on real log data and scale with GPU count.</p>
<button type="button" class="btn btn-primary" onclick="runAllSAT()">Validate one by one</button>
`Short self-test in Validate, extended self-test in Stress.`,
`Seconds in Validate (NVMe: instant device query; SATA/SAS: short self-test). Up to ~1 h per device in Stress (extended self-test, device-dependent).`,
`Runs a controlled NVIDIA DCGM load to check stability under moderate stress.`,
`<code>dcgmi diag targeted_stress</code>`,
`Skipped in Validate mode. Runs after dcgmi diag in Stress mode. Runs one GPU at a time on the selected NVIDIA GPUs.<p id="sat-ts-mode-hint" style="color:var(--warn-fg);font-size:12px;margin:8px 0 0">Only runs in Stress mode. Switch mode above to enable in Run All.</p>`,
returns+`<p id="sat-ts-mode-hint" style="color:var(--warn-fg);font-size:12px;margin:8px 0 0">Only runs in Stress mode. Switch mode above to enable in Run All.</p>`
`Checks that the GPU can sustain its declared power delivery envelope. Pass/fail determined by DCGM.`,
`<code>dcgmi diag targeted_power</code>`,
`Skipped in Validate mode. Runs in Stress mode only. Runs one GPU at a time.<p id="sat-tp-mode-hint" style="color:var(--warn-fg);font-size:12px;margin:8px 0 0">Only runs in Stress mode. Switch mode above to enable in Run All.</p>`,
returns+`<p id="sat-tp-mode-hint" style="color:var(--warn-fg);font-size:12px;margin:8px 0 0">Only runs in Stress mode. Switch mode above to enable in Run All.</p>`
`Tests power supply transient response by pulsing all GPUs simultaneously between idle and full load. Synchronous pulses across all GPUs create worst-case PSU load spikes — running per-GPU would miss PSU-level failures.`,
`<code>dcgmi diag pulse_test</code>`,
`Skipped in Validate mode. Runs in Stress mode only. Runs all selected GPUs simultaneously — synchronous pulsing is required to stress the PSU.<p id="sat-pt-mode-hint" style="color:var(--warn-fg);font-size:12px;margin:8px 0 0">Only runs in Stress mode. Switch mode above to enable in Run All.</p>`,
`Skipped in Validate. Stress: `+validateFmtDur(platform.SATEstimatedNvidiaPulseTestSec)+` (all GPUs simultaneously; measured on 8-GPU system).`+`<p id="sat-pt-mode-hint" style="color:var(--warn-fg);font-size:12px;margin:8px 0 0">Only runs in Stress mode. Switch mode above to enable in Run All.</p>`,
`Verifies NVLink/NVSwitch fabric bandwidth using NCCL all_reduce_perf across all selected GPUs. Pass/fail based on achieved bandwidth vs. theoretical.`,
`<code>all_reduce_perf</code> (NCCL tests)`,
`Runs in Validate and Stress. Uses all selected GPUs simultaneously (requires ≥2) and is kept short so it fits the Validate flow.`,
`Validate and Stress: `+validateFmtDur(platform.SATEstimatedNvidiaInterconnectSec)+` (all GPUs simultaneously, requires ≥2).`,
`Validates GPU memory copy and peer-to-peer bandwidth paths using NVBandwidth.`,
`<code>nvbandwidth</code>`,
`Runs in Validate and Stress across all selected GPUs simultaneously. Intended to stay short enough for Validate.`,
`Validate and Stress: `+validateFmtDur(platform.SATEstimatedNvidiaBandwidthSec)+` (all GPUs simultaneously; nvbandwidth runs all built-in tests without a time limit — duration set by the tool).`,
<option value="standard" selected>Standard — Perf `+validateFmtDur(platform.BenchmarkEstimatedPerfStandardSec)+` / Power Fit `+validateFmtDur(platform.BenchmarkEstimatedPowerStandardSec)+`</option>
<option value="stability">Stability — Perf `+validateFmtDur(platform.BenchmarkEstimatedPerfStabilitySec)+` / Power Fit `+validateFmtDur(platform.BenchmarkEstimatedPowerStabilitySec)+`</option>
<option value="overnight">Overnight — Perf `+validateFmtDur(platform.BenchmarkEstimatedPerfOvernightSec)+` / Power Fit `+validateFmtDur(platform.BenchmarkEstimatedPowerOvernightSec)+`</option>
<p style="font-size:13px;color:var(--muted);margin-bottom:10px">The benchmark page now exposes two fundamentally different test families so compute score and server power-fit are not mixed into one number.</p>
<tr><td>Performance Benchmark</td><td><code>bee-gpu-burn</code></td><td>How much isolated compute performance does the GPU realize in this server?</td></tr>
<tr><td>Power / Thermal Fit</td><td><code>dcgmi targeted_power</code></td><td>How much power per GPU can this server sustain as GPU count ramps up?</td></tr>
<tr><td>Performance Benchmark</td><td><code>bee-gpu-burn</code></td><td>How much isolated compute performance does the GPU realize in this server?</td><td>`+validateFmtDur(platform.BenchmarkEstimatedPerfStandardSec)+`</td><td>`+validateFmtDur(platform.BenchmarkEstimatedPerfStabilitySec)+`</td></tr>
<tr><td>Power / Thermal Fit</td><td><code>dcgmi targeted_power</code></td><td>How much power per GPU can this server sustain as GPU count ramps up?</td><td>`+validateFmtDur(platform.BenchmarkEstimatedPowerStandardSec)+`</td><td>`+validateFmtDur(platform.BenchmarkEstimatedPowerStabilitySec)+`</td></tr>
</table>
<p style="font-size:12px;color:var(--muted);margin-top:10px">Use ramp-up mode for capacity work: it creates 1 GPU → 2 GPU → … → all selected steps so analysis software can derive server total score and watts-per-GPU curves.</p>
<p style="font-size:12px;color:var(--muted);margin-top:10px">Timings are per full ramp-up run (1 GPU → all selected), measured on 4–8 GPU servers. Use ramp-up mode for capacity work: it creates 1 GPU → 2 GPU → … → all selected steps so analysis software can derive server total score and watts-per-GPU curves.</p>
<button type="button" class="btn btn-primary" onclick="runAllBurnTasks()">Burn one by one</button>
<p>Run checked tests one by one. Tests run without cooldown. Each test duration is determined by the Burn Profile. Total test duration is the sum of all selected tests multiplied by the Burn Profile duration.</p>
<p>Runs checked tests as separate sequential tasks. In sequential GPU mode, total time = profile duration × N GPU. In parallel mode, all selected GPUs burn simultaneously for one profile duration.</p>
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.