Compare commits

..

11 Commits
v7.9 ... v7.14

Author SHA1 Message Date
bf6ecab4f0 Add per-precision benchmark phases, weighted TOPS scoring, and ECC tracking
- Split steady window into 6 equal slots: fp8/fp16/fp32/fp64/fp4 + combined
- Each precision phase runs bee-gpu-burn with --precision filter so PowerCVPct reflects single-kernel stability (not round-robin artifact)
- Add fp4 support in bee-gpu-stress.c for Blackwell (cc>=100) via existing CUDA_R_4F_E2M1 guard
- Weighted TOPS: fp64×2.0, fp32×1.0, fp16×0.5, fp8×0.25, fp4×0.125
- SyntheticScore = sum of weighted TOPS from per-precision phases
- MixedScore = sum from combined phase; MixedEfficiency = Mixed/Synthetic
- ComputeScore = SyntheticScore × (1 + MixedEfficiency × 0.3)
- ECC volatile counters sampled before/after each phase and overall
- DegradationReasons: ecc_uncorrected_errors, ecc_corrected_errors
- Report: per-precision stability table with ECC columns, methodology section
- Ramp-up history table redesign: GPU indices as columns, runs as rows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 10:49:49 +03:00
02e44b1172 Fix USB/RAM status checks; add server model+S/N to dashboard; remove cycles
USB Export Drive:
  lsblk reports TRAN only for whole disks, not partitions (/dev/sdc1).
  Strip trailing partition digits to get parent disk before transport check.

LiveCD in RAM:
  When RunInstallToRAM copies squashfs to /dev/shm/bee-live/ but bind-mount
  of /run/live/medium fails (CD-ROM boots), /run/live/medium still shows the
  CD-ROM fstype. Add fallback: if /dev/shm/bee-live/*.squashfs exists, the
  data is in RAM — report status OK.

Dashboard Hardware Summary:
  Show server Manufacturer + ProductName as heading and S/N as subline above
  the component table, sourced from hw.Board (dmidecode system-type data).

Validate:
  Remove Cycles input — always run once. cycles=1 hardcoded in runAllSAT().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:46:42 +03:00
2ceaa0d0ca Include profile and mode in benchmark task names for task list clarity
Task names now follow the pattern:
  NVIDIA Benchmark · <profile> · <mode> [· GPU <indices>]

Examples:
  NVIDIA Benchmark · standard · sequential (GPU 0, RTX 6000 Pro)
  NVIDIA Benchmark · stability · parallel
  NVIDIA Benchmark · standard · ramp 1/4 · GPU 0
  NVIDIA Benchmark · standard · ramp 2/4 · GPU 0,1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:36:51 +03:00
9482ba20a2 Remove NCCL checkbox — auto-enable interconnect step when >1 GPU selected
NCCL all_reduce is always attempted when 2+ GPUs are selected; a failure
leaves InterconnectScore=0 (no bonus, no penalty) and OverallStatus
unaffected. Exposing the checkbox implied NCCL is optional and made a
failed run look like a deliberate skip.

- Remove benchmark-run-nccl checkbox and its change listener from pages.go
- Client sends run_nccl: selected.length > 1 (automatic)
- api.go default runNCCL=true is unchanged
- Selection note now mentions NCCL automatically for multi-GPU runs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:33:17 +03:00
813e2f86a9 Add scalability/ramp-up labeling, ServerPower penalty in scoring, and report improvements
- Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and
  NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files
- Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally
  by comparing ramp-up step results sharing the same ramp_run_id)
- Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time),
  tasks.go handler, and benchmark.go result population
- Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75:
  factor = ratio/0.75, applied per-GPU with a note explaining the reduction
- Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw)
- Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode;
  shows scalability_score when non-zero

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:30:47 +03:00
58a6da9b44 Recover power limits and SM count from nvidia-smi -q in enrichGPUInfo
When --query-gpu CSV fields fail (exit status 2 on some Blackwell +
driver combos), enrichGPUInfoWithMaxClocks now also parses from the
verbose nvidia-smi -q output already collected at benchmark start:
  - Default Power Limit  → DefaultPowerLimitW
  - Current Power Limit  → PowerLimitW (fallback)
  - Multiprocessor Count → MultiprocessorCount

Fixes PowerSustainScore=0 on systems where all three CSV query
variants fail but nvidia-smi -q succeeds (confirmed on RTX PRO 6000
Blackwell + driver 590.48.01).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:17:56 +03:00
f4a19c0a00 Add power calibration step to benchmark; fix PowerSustainScore reference
Before the per-GPU compute phases, run `dcgmi diag -r targeted_power`
for 45 s while collecting nvidia-smi power metrics in parallel.
The p95 power per GPU is stored as calibrated_peak_power_w and used
as the denominator for PowerSustainScore instead of the hardware default
limit, which bee-gpu-burn cannot reach because it is compute-only.

Fallback chain: calibrated peak → default limit → enforced limit.
If dcgmi is absent or the run fails, calibration is skipped silently.

Adjust composite score weights to match the new honest power reference:
  base 0.35, thermal 0.25, stability 0.25, power 0.15, NCCL bonus 0.10.
Power weight reduced (0.20→0.15) because even with a calibrated reference
bee-gpu-burn reaches ~60-75% of TDP by design (no concurrent mem stress).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:06:46 +03:00
9e3dcf9b4d Record host CPU/RAM config in benchmark results; check CPU load
- BenchmarkHostConfig captures CPU model, sockets, cores, threads, and
  total RAM from /proc/cpuinfo and /proc/meminfo at benchmark start.
- BenchmarkCPULoad samples host CPU utilisation every 10 s throughout
  the GPU steady-state phase (sequential and parallel paths).
- Summarises avg/max/p95 and classifies status as ok / high / unstable.
- Adds a finding when CPU load is elevated (avg >20% or max >40%) or
  erratic (stddev >12%), with a plain-English description in the report.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:02:04 +03:00
098e19f760 Add ramp-up mode to NVIDIA GPU benchmark
Adds a new checkbox (enabled by default) in the benchmark section.
In ramp-up mode N tasks are spawned simultaneously: 1 GPU, then 2,
then 3, up to all selected GPUs — each step runs its GPUs in parallel.
NCCL runs only on the final step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 18:34:19 +03:00
e16d0f34b5 Adjust burn GPU ramp timing by profile 2026-04-12 15:58:30 +03:00
Mikhail Chusavitin
525ed8b8fc Fix GPU clock lock normalization for Blackwell (clocks.max.* unsupported)
clocks.max.graphics / clocks.max.memory CSV fields return exit status 2 on
RTX PRO 6000 Blackwell (driver 98.x), causing the entire gpu inventory query
to fail and clock lock to be skipped → normalization: partial.

Fix:
- Add minimal fallback query (index,uuid,name,pci.bus_id,vbios_version,
  power.limit) that succeeds even without clock fields
- Add enrichGPUInfoWithMaxClocks: parses "Max Clocks" section of
  nvidia-smi -q verbose output to fill MaxGraphicsClockMHz /
  MaxMemoryClockMHz when CSV fields fail
- Move nvidia-smi -q execution before queryBenchmarkGPUInfo so its output
  is available for clock enrichment immediately after
- Tests: cover enrichment and skip-if-populated cases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:33:54 +03:00
13 changed files with 1357 additions and 250 deletions

View File

@@ -7,6 +7,7 @@ import (
"fmt" "fmt"
"math" "math"
"os" "os"
"os/exec"
"path/filepath" "path/filepath"
"regexp" "regexp"
"sort" "sort"
@@ -72,6 +73,11 @@ var (
benchmarkIterationsPattern = regexp.MustCompile(`^([a-z0-9_]+)_iterations=(\d+)$`) benchmarkIterationsPattern = regexp.MustCompile(`^([a-z0-9_]+)_iterations=(\d+)$`)
) )
// benchmarkPrecisionPhases lists the precision categories run as individual
// steady-state windows before the combined steady pass. Order is from lowest
// to highest power draw so thermal ramp-up is gradual.
var benchmarkPrecisionPhases = []string{"fp8", "fp16", "fp32", "fp64", "fp4"}
func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts NvidiaBenchmarkOptions, logFunc func(string)) (string, error) { func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts NvidiaBenchmarkOptions, logFunc func(string)) (string, error) {
if ctx == nil { if ctx == nil {
ctx = context.Background() ctx = context.Background()
@@ -108,7 +114,11 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
ServerModel: readServerModel(), ServerModel: readServerModel(),
BenchmarkProfile: spec.Name, BenchmarkProfile: spec.Name,
ParallelGPUs: opts.ParallelGPUs, ParallelGPUs: opts.ParallelGPUs,
RampStep: opts.RampStep,
RampTotal: opts.RampTotal,
RampRunID: opts.RampRunID,
SelectedGPUIndices: append([]int(nil), selected...), SelectedGPUIndices: append([]int(nil), selected...),
HostConfig: readBenchmarkHostConfig(),
Normalization: BenchmarkNormalization{ Normalization: BenchmarkNormalization{
Status: "full", Status: "full",
}, },
@@ -121,15 +131,22 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
var serverIdleOK, serverLoadedOK bool var serverIdleOK, serverLoadedOK bool
var serverLoadedSamples int var serverLoadedSamples int
// Run nvidia-smi -q first: used both for the log file and as a fallback
// source of max clock values when CSV clock fields are unsupported.
var nvsmiQOut []byte
if out, err := runSATCommandCtx(ctx, verboseLog, "00-nvidia-smi-q.log", []string{"nvidia-smi", "-q"}, nil, nil); err == nil {
nvsmiQOut = out
_ = os.WriteFile(filepath.Join(runDir, "00-nvidia-smi-q.log"), out, 0644)
}
infoByIndex, infoErr := queryBenchmarkGPUInfo(selected) infoByIndex, infoErr := queryBenchmarkGPUInfo(selected)
if infoErr != nil { if infoErr != nil {
result.Warnings = append(result.Warnings, "gpu inventory query failed: "+infoErr.Error()) result.Warnings = append(result.Warnings, "gpu inventory query failed: "+infoErr.Error())
result.Normalization.Status = "partial" result.Normalization.Status = "partial"
} }
// Enrich with max clocks from verbose output — covers GPUs where
if out, err := runSATCommandCtx(ctx, verboseLog, "00-nvidia-smi-q.log", []string{"nvidia-smi", "-q"}, nil, nil); err == nil { // clocks.max.* CSV fields are unsupported (e.g. Blackwell / driver 98.x).
_ = os.WriteFile(filepath.Join(runDir, "00-nvidia-smi-q.log"), out, 0644) enrichGPUInfoWithMaxClocks(infoByIndex, nvsmiQOut)
}
activeApps, err := queryActiveComputeApps(selected) activeApps, err := queryActiveComputeApps(selected)
if err == nil && len(activeApps) > 0 { if err == nil && len(activeApps) > 0 {
@@ -145,8 +162,16 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
} }
}() }()
// Power calibration: run dcgmi targeted_power while sampling nvidia-smi power.
// Returns per-GPU p95 power as an honest TDP reference for PowerSustainScore.
calibPowerByIndex := runBenchmarkPowerCalibration(ctx, verboseLog, runDir, selected, logFunc)
// Start background CPU load sampler — samples every 10s during GPU phases.
cpuStopCh := make(chan struct{})
cpuSamplesCh := startCPULoadSampler(cpuStopCh, 10)
if opts.ParallelGPUs { if opts.ParallelGPUs {
runNvidiaBenchmarkParallel(ctx, verboseLog, runDir, selected, infoByIndex, opts, spec, logFunc, &result, &serverIdleW, &serverLoadedWSum, &serverIdleOK, &serverLoadedOK, &serverLoadedSamples) runNvidiaBenchmarkParallel(ctx, verboseLog, runDir, selected, infoByIndex, opts, spec, logFunc, &result, calibPowerByIndex, &serverIdleW, &serverLoadedWSum, &serverIdleOK, &serverLoadedOK, &serverLoadedSamples)
} else { } else {
for _, idx := range selected { for _, idx := range selected {
@@ -166,6 +191,9 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
gpuResult.BaseGraphicsClockMHz = info.BaseGraphicsClockMHz gpuResult.BaseGraphicsClockMHz = info.BaseGraphicsClockMHz
gpuResult.MaxMemoryClockMHz = info.MaxMemoryClockMHz gpuResult.MaxMemoryClockMHz = info.MaxMemoryClockMHz
} }
if w, ok := calibPowerByIndex[idx]; ok && w > 0 {
gpuResult.CalibratedPeakPowerW = w
}
if norm := findBenchmarkNormalization(result.Normalization.GPUs, idx); norm != nil { if norm := findBenchmarkNormalization(result.Normalization.GPUs, idx); norm != nil {
gpuResult.LockedGraphicsClockMHz = norm.GPUClockLockMHz gpuResult.LockedGraphicsClockMHz = norm.GPUClockLockMHz
gpuResult.LockedMemoryClockMHz = norm.MemoryClockLockMHz gpuResult.LockedMemoryClockMHz = norm.MemoryClockLockMHz
@@ -202,14 +230,56 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
continue continue
} }
// ── Per-precision stability phases ────────────────────────────────────────
// Run each precision category alone so PowerCVPct reflects genuine GPU
// power stability, not kernel-mix variance.
// Time budget: each phase gets steadySec/numPhases, minimum 60 s.
// SteadySec is split equally across all precision phases + 1 combined slot.
// Skipped phases (unsupported precision) are simply omitted; combined is fixed.
totalSlots := len(benchmarkPrecisionPhases) + 1
perPhaseSec := spec.SteadySec / totalSlots
if perPhaseSec < 60 {
perPhaseSec = 60
}
eccBase, _ := queryECCCounters(idx)
for _, prec := range benchmarkPrecisionPhases {
phaseCmd := []string{
"bee-gpu-burn",
"--seconds", strconv.Itoa(perPhaseSec),
"--size-mb", strconv.Itoa(opts.SizeMB),
"--devices", strconv.Itoa(idx),
"--precision", prec,
}
logFunc(fmt.Sprintf("GPU %d: %s stability phase (%ds)", idx, prec, perPhaseSec))
phaseLogName := fmt.Sprintf("gpu-%d-steady-%s", idx, prec)
eccBefore, _ := queryECCCounters(idx)
phaseOut, phaseRows, phaseErr := runBenchmarkCommandWithMetrics(ctx, verboseLog, phaseLogName+".log", phaseCmd, nil, []int{idx}, runDir, phaseLogName, logFunc)
eccAfter, _ := queryECCCounters(idx)
if phaseErr != nil || len(phaseRows) == 0 {
continue
}
phase := BenchmarkPrecisionSteadyPhase{
Precision: prec,
Steady: summarizeBenchmarkTelemetry(phaseRows),
ECC: diffECCCounters(eccBefore, eccAfter),
}
for _, p := range parseBenchmarkBurnLog(string(phaseOut)).Profiles {
if p.Supported {
phase.TeraOpsPerSec += p.TeraOpsPerSec
phase.WeightedTeraOpsPerSec += p.WeightedTeraOpsPerSec
}
}
gpuResult.PrecisionSteady = append(gpuResult.PrecisionSteady, phase)
}
beforeThrottle, _ := queryThrottleCounters(idx) beforeThrottle, _ := queryThrottleCounters(idx)
steadyCmd := []string{ steadyCmd := []string{
"bee-gpu-burn", "bee-gpu-burn",
"--seconds", strconv.Itoa(spec.SteadySec), "--seconds", strconv.Itoa(perPhaseSec),
"--size-mb", strconv.Itoa(opts.SizeMB), "--size-mb", strconv.Itoa(opts.SizeMB),
"--devices", strconv.Itoa(idx), "--devices", strconv.Itoa(idx),
} }
logFunc(fmt.Sprintf("GPU %d: steady compute (%ds)", idx, spec.SteadySec)) logFunc(fmt.Sprintf("GPU %d: steady compute (combined, %ds)", idx, perPhaseSec))
// Sample server power via IPMI in parallel with the steady phase. // Sample server power via IPMI in parallel with the steady phase.
// We collect readings every 5s and average them. // We collect readings every 5s and average them.
@@ -270,6 +340,9 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
gpuResult.Steady = summarizeBenchmarkTelemetry(steadyRows) gpuResult.Steady = summarizeBenchmarkTelemetry(steadyRows)
gpuResult.Throttle = diffThrottleCounters(beforeThrottle, afterThrottle) gpuResult.Throttle = diffThrottleCounters(beforeThrottle, afterThrottle)
if eccFinal, err := queryECCCounters(idx); err == nil {
gpuResult.ECC = diffECCCounters(eccBase, eccFinal)
}
cooldownRows, err := collectBenchmarkSamples(ctx, spec.CooldownSec, []int{idx}) cooldownRows, err := collectBenchmarkSamples(ctx, spec.CooldownSec, []int{idx})
if err != nil && err != context.Canceled { if err != nil && err != context.Canceled {
@@ -303,6 +376,16 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
} }
} }
// Stop CPU load sampler and attach results.
close(cpuStopCh)
if cpuSamples := <-cpuSamplesCh; len(cpuSamples) > 0 {
result.CPULoad = summarizeCPULoad(cpuSamples)
if result.CPULoad != nil && result.CPULoad.Status != "ok" {
logFunc(fmt.Sprintf("host CPU load during benchmark: avg=%.1f%% max=%.1f%% status=%s",
result.CPULoad.AvgPct, result.CPULoad.MaxPct, result.CPULoad.Status))
}
}
// Compute server power characterization from accumulated IPMI samples. // Compute server power characterization from accumulated IPMI samples.
var gpuReportedSumW float64 var gpuReportedSumW float64
for _, gpu := range result.GPUs { for _, gpu := range result.GPUs {
@@ -314,6 +397,20 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
} }
result.ServerPower = characterizeServerPower(serverIdleW, serverLoadedW, gpuReportedSumW, serverIdleOK && serverLoadedOK) result.ServerPower = characterizeServerPower(serverIdleW, serverLoadedW, gpuReportedSumW, serverIdleOK && serverLoadedOK)
// Apply server-power penalty when IPMI reports the server delta is much
// lower than GPU-reported sum: GPU power telemetry is over-stated, making
// CalibratedPeakPowerW and PowerSustainScore unreliable.
// Penalty factor scales from 1.0 (ratio ≥ 0.75, no penalty) down to 0.
if sp := result.ServerPower; sp != nil && sp.Available && sp.ReportingRatio > 0 && sp.ReportingRatio < 0.75 {
factor := sp.ReportingRatio / 0.75
for i := range result.GPUs {
result.GPUs[i].Scores.CompositeScore *= factor
result.GPUs[i].Notes = append(result.GPUs[i].Notes,
fmt.Sprintf("server-power penalty applied (reporting_ratio=%.2f < 0.75): composite score reduced to %.1f%%",
sp.ReportingRatio, factor*100))
}
}
result.Findings = buildBenchmarkFindings(result) result.Findings = buildBenchmarkFindings(result)
result.OverallStatus = benchmarkOverallStatus(result) result.OverallStatus = benchmarkOverallStatus(result)
@@ -370,9 +467,13 @@ func resolveBenchmarkProfile(profile string) benchmarkProfileSpec {
// Fields are tried in order; the first successful query wins. Extended fields // Fields are tried in order; the first successful query wins. Extended fields
// (attribute.multiprocessor_count, power.default_limit) are not supported on // (attribute.multiprocessor_count, power.default_limit) are not supported on
// all driver versions, so we fall back to the base set if the full query fails. // all driver versions, so we fall back to the base set if the full query fails.
// The minimal fallback omits clock fields entirely — clocks.max.* returns
// exit status 2 on some GPU generations (e.g. Blackwell); max clocks are
// then recovered from nvidia-smi -q via enrichGPUInfoWithMaxClocks.
var benchmarkGPUInfoQueries = []struct { var benchmarkGPUInfoQueries = []struct {
fields string fields string
extended bool // whether this query includes optional extended fields extended bool // whether this query includes optional extended fields
minimal bool // clock fields omitted; max clocks must be filled separately
}{ }{
{ {
fields: "index,uuid,name,pci.bus_id,vbios_version,power.limit,clocks.max.graphics,clocks.max.memory,clocks.base.graphics,attribute.multiprocessor_count,power.default_limit", fields: "index,uuid,name,pci.bus_id,vbios_version,power.limit,clocks.max.graphics,clocks.max.memory,clocks.base.graphics,attribute.multiprocessor_count,power.default_limit",
@@ -382,6 +483,104 @@ var benchmarkGPUInfoQueries = []struct {
fields: "index,uuid,name,pci.bus_id,vbios_version,power.limit,clocks.max.graphics,clocks.max.memory,clocks.base.graphics", fields: "index,uuid,name,pci.bus_id,vbios_version,power.limit,clocks.max.graphics,clocks.max.memory,clocks.base.graphics",
extended: false, extended: false,
}, },
{
fields: "index,uuid,name,pci.bus_id,vbios_version,power.limit",
minimal: true,
},
}
// enrichGPUInfoWithMaxClocks fills MaxGraphicsClockMHz / MaxMemoryClockMHz for
// any GPU in infoByIndex where those values are still zero. It parses the
// "Max Clocks" section of nvidia-smi -q output (already available as nvsmiQ).
// This is the fallback for GPUs (e.g. Blackwell) where clocks.max.* CSV fields
// return exit status 2 but the verbose query works fine.
func enrichGPUInfoWithMaxClocks(infoByIndex map[int]benchmarkGPUInfo, nvsmiQ []byte) {
if len(infoByIndex) == 0 || len(nvsmiQ) == 0 {
return
}
// Build bus_id → index map for matching verbose sections to GPU indices.
busToBenchIdx := make(map[string]int, len(infoByIndex))
for idx, info := range infoByIndex {
if info.BusID != "" {
// nvidia-smi -q uses "GPU 00000000:4E:00.0" (8-digit domain),
// while --query-gpu returns the same format; normalise to lower.
busToBenchIdx[strings.ToLower(strings.TrimSpace(info.BusID))] = idx
}
}
// Split the verbose output into per-GPU sections on "^GPU " lines.
gpuSectionRe := regexp.MustCompile(`(?m)^GPU\s+([\dA-Fa-f:\.]+)`)
maxGfxRe := regexp.MustCompile(`(?i)Max Clocks[\s\S]*?Graphics\s*:\s*(\d+)\s*MHz`)
maxMemRe := regexp.MustCompile(`(?i)Max Clocks[\s\S]*?Memory\s*:\s*(\d+)\s*MHz`)
defaultPwrRe := regexp.MustCompile(`(?i)Default Power Limit\s*:\s*([0-9.]+)\s*W`)
currentPwrRe := regexp.MustCompile(`(?i)Current Power Limit\s*:\s*([0-9.]+)\s*W`)
smCountRe := regexp.MustCompile(`(?i)Multiprocessor Count\s*:\s*(\d+)`)
sectionStarts := gpuSectionRe.FindAllSubmatchIndex(nvsmiQ, -1)
for i, loc := range sectionStarts {
busID := strings.ToLower(string(nvsmiQ[loc[2]:loc[3]]))
benchIdx, ok := busToBenchIdx[busID]
if !ok {
// Bus IDs from verbose output may have a different domain prefix;
// try suffix match on the slot portion (XX:XX.X).
for k, v := range busToBenchIdx {
if strings.HasSuffix(k, busID) || strings.HasSuffix(busID, k) {
benchIdx = v
ok = true
break
}
}
}
if !ok {
continue
}
end := len(nvsmiQ)
if i+1 < len(sectionStarts) {
end = sectionStarts[i+1][0]
}
section := nvsmiQ[loc[0]:end]
info := infoByIndex[benchIdx]
if info.MaxGraphicsClockMHz == 0 {
if m := maxGfxRe.FindSubmatch(section); m != nil {
if v, err := strconv.ParseFloat(string(m[1]), 64); err == nil {
info.MaxGraphicsClockMHz = v
}
}
}
if info.MaxMemoryClockMHz == 0 {
if m := maxMemRe.FindSubmatch(section); m != nil {
if v, err := strconv.ParseFloat(string(m[1]), 64); err == nil {
info.MaxMemoryClockMHz = v
}
}
}
if info.DefaultPowerLimitW == 0 {
if m := defaultPwrRe.FindSubmatch(section); m != nil {
if v, err := strconv.ParseFloat(string(m[1]), 64); err == nil && v > 0 {
info.DefaultPowerLimitW = v
}
}
}
if info.PowerLimitW == 0 {
if m := currentPwrRe.FindSubmatch(section); m != nil {
if v, err := strconv.ParseFloat(string(m[1]), 64); err == nil && v > 0 {
info.PowerLimitW = v
}
}
}
if info.MultiprocessorCount == 0 {
if m := smCountRe.FindSubmatch(section); m != nil {
if v, err := strconv.Atoi(string(m[1])); err == nil && v > 0 {
info.MultiprocessorCount = v
}
}
}
infoByIndex[benchIdx] = info
}
} }
func queryBenchmarkGPUInfo(gpuIndices []int) (map[int]benchmarkGPUInfo, error) { func queryBenchmarkGPUInfo(gpuIndices []int) (map[int]benchmarkGPUInfo, error) {
@@ -409,9 +608,13 @@ func queryBenchmarkGPUInfo(gpuIndices []int) (map[int]benchmarkGPUInfo, error) {
continue continue
} }
minFields := 6
if !q.minimal {
minFields = 9
}
infoByIndex := make(map[int]benchmarkGPUInfo, len(rows)) infoByIndex := make(map[int]benchmarkGPUInfo, len(rows))
for _, row := range rows { for _, row := range rows {
if len(row) < 9 { if len(row) < minFields {
continue continue
} }
idx, err := strconv.Atoi(strings.TrimSpace(row[0])) idx, err := strconv.Atoi(strings.TrimSpace(row[0]))
@@ -419,24 +622,26 @@ func queryBenchmarkGPUInfo(gpuIndices []int) (map[int]benchmarkGPUInfo, error) {
continue continue
} }
info := benchmarkGPUInfo{ info := benchmarkGPUInfo{
Index: idx, Index: idx,
UUID: strings.TrimSpace(row[1]), UUID: strings.TrimSpace(row[1]),
Name: strings.TrimSpace(row[2]), Name: strings.TrimSpace(row[2]),
BusID: strings.TrimSpace(row[3]), BusID: strings.TrimSpace(row[3]),
VBIOS: strings.TrimSpace(row[4]), VBIOS: strings.TrimSpace(row[4]),
PowerLimitW: parseBenchmarkFloat(row[5]), PowerLimitW: parseBenchmarkFloat(row[5]),
MaxGraphicsClockMHz: parseBenchmarkFloat(row[6]),
MaxMemoryClockMHz: parseBenchmarkFloat(row[7]),
} }
if len(row) >= 9 { if !q.minimal {
info.BaseGraphicsClockMHz = parseBenchmarkFloat(row[8]) info.MaxGraphicsClockMHz = parseBenchmarkFloat(row[6])
} info.MaxMemoryClockMHz = parseBenchmarkFloat(row[7])
if q.extended { if len(row) >= 9 {
if len(row) >= 10 { info.BaseGraphicsClockMHz = parseBenchmarkFloat(row[8])
info.MultiprocessorCount = int(parseBenchmarkFloat(row[9]))
} }
if len(row) >= 11 { if q.extended {
info.DefaultPowerLimitW = parseBenchmarkFloat(row[10]) if len(row) >= 10 {
info.MultiprocessorCount = int(parseBenchmarkFloat(row[9]))
}
if len(row) >= 11 {
info.DefaultPowerLimitW = parseBenchmarkFloat(row[10])
}
} }
} }
infoByIndex[idx] = info infoByIndex[idx] = info
@@ -656,8 +861,11 @@ func parseBenchmarkBurnLog(raw string) benchmarkBurnParseResult {
Iterations: profile.iterations, Iterations: profile.iterations,
Notes: profile.notes, Notes: profile.notes,
} }
w := precisionWeight(profile.category)
precision.Weight = w
if profile.supported && result.DurationSec > 0 && profile.m > 0 && profile.n > 0 && profile.k > 0 && profile.iterations > 0 { if profile.supported && result.DurationSec > 0 && profile.m > 0 && profile.n > 0 && profile.k > 0 && profile.iterations > 0 {
precision.TeraOpsPerSec = (2.0 * float64(profile.m) * float64(profile.n) * float64(profile.k) * float64(profile.iterations)) / float64(result.DurationSec) / 1e12 precision.TeraOpsPerSec = (2.0 * float64(profile.m) * float64(profile.n) * float64(profile.k) * float64(profile.iterations)) / float64(result.DurationSec) / 1e12
precision.WeightedTeraOpsPerSec = precision.TeraOpsPerSec * w
} }
result.Profiles = append(result.Profiles, precision) result.Profiles = append(result.Profiles, precision)
} }
@@ -686,6 +894,33 @@ func ensureBenchmarkProfile(profiles map[string]*benchmarkBurnProfile, name stri
return profile return profile
} }
// precisionWeight returns the fp32-equivalence factor for a precision category.
// Each factor represents how much "real" numeric work one operation of that
// type performs relative to fp32 (single precision = 1.0 baseline):
// fp64 = 2.0 — double precision, 2× more bits per operand
// fp32 = 1.0 — single precision baseline
// fp16 = 0.5 — half precision
// fp8 = 0.25 — quarter precision
// fp4 = 0.125 — eighth precision
// Multiplying raw TOPS by the weight gives fp32-equivalent TOPS, enabling
// cross-precision comparison on the same numeric scale.
func precisionWeight(category string) float64 {
switch category {
case "fp64":
return 2.0
case "fp32_tf32":
return 1.0
case "fp16_bf16":
return 0.5
case "fp8":
return 0.25
case "fp4":
return 0.125
default:
return 1.0
}
}
func stripBenchmarkPrefix(line string) string { func stripBenchmarkPrefix(line string) string {
if strings.HasPrefix(line, "[gpu ") { if strings.HasPrefix(line, "[gpu ") {
if idx := strings.Index(line, "] "); idx >= 0 { if idx := strings.Index(line, "] "); idx >= 0 {
@@ -735,24 +970,72 @@ func summarizeBenchmarkTelemetry(rows []GPUMetricRow) BenchmarkTelemetrySummary
func scoreBenchmarkGPUResult(gpu BenchmarkGPUResult) BenchmarkScorecard { func scoreBenchmarkGPUResult(gpu BenchmarkGPUResult) BenchmarkScorecard {
score := BenchmarkScorecard{} score := BenchmarkScorecard{}
for _, precision := range gpu.PrecisionResults {
if precision.Supported { // SyntheticScore: sum of fp32-equivalent TOPS from per-precision phases.
score.ComputeScore += precision.TeraOpsPerSec // Each precision ran alone with full GPU dedicated — peak capability.
for _, p := range gpu.PrecisionSteady {
score.SyntheticScore += p.WeightedTeraOpsPerSec
}
// MixedScore: sum of fp32-equivalent TOPS from the combined phase.
// All precisions compete simultaneously — closer to real inference workloads.
for _, p := range gpu.PrecisionResults {
if p.Supported {
score.MixedScore += p.WeightedTeraOpsPerSec
} }
} }
// Use default power limit for sustain score so a manually reduced limit
// does not inflate the score. Fall back to enforced limit if default unknown. // MixedEfficiency = MixedScore / SyntheticScore.
referencePowerW := gpu.DefaultPowerLimitW // Measures how well the GPU sustains throughput under concurrent mixed load.
if referencePowerW <= 0 { // A healthy GPU scores ~0.80.95; severe degradation suggests bandwidth
referencePowerW = gpu.PowerLimitW // contention or scheduler inefficiency.
if score.SyntheticScore > 0 && score.MixedScore > 0 {
score.MixedEfficiency = score.MixedScore / score.SyntheticScore
} }
if referencePowerW > 0 {
score.PowerSustainScore = math.Min(100, (gpu.Steady.AvgPowerW/referencePowerW)*100) // ComputeScore = SyntheticScore × (1 + MixedEfficiency × 0.3).
// SyntheticScore is the primary signal; MixedEfficiency adds up to +30%
// bonus for GPUs that handle mixed-precision concurrency well.
// Falls back to MixedScore alone when per-precision data is absent.
switch {
case score.SyntheticScore > 0:
score.ComputeScore = score.SyntheticScore * (1 + score.MixedEfficiency*0.3)
case score.MixedScore > 0:
score.ComputeScore = score.MixedScore
}
// PowerSustainScore: measures how close the GPU came to its rated TDP under
// a full-spectrum load (dcgmi targeted_power). 100 = exactly at rated TDP.
// Penalty applied symmetrically for both under- and over-TDP deviations:
// score = max(0, 100 |measured rated| / rated × 100)
// Under-TDP → power delivery / cooling issue.
// Over-TDP → power limit not properly enforced / power regulation fault.
// Falls back to 0 if calibration was not performed (dcgmi unavailable).
{
ref := gpu.DefaultPowerLimitW
if ref <= 0 {
ref = gpu.PowerLimitW
}
if gpu.CalibratedPeakPowerW > 0 && ref > 0 {
deviationPct := math.Abs(gpu.CalibratedPeakPowerW-ref) / ref * 100
score.PowerSustainScore = clampScore(100 - deviationPct)
}
} }
runtimeUS := math.Max(1, gpu.Steady.DurationSec*1e6) runtimeUS := math.Max(1, gpu.Steady.DurationSec*1e6)
thermalRatio := float64(gpu.Throttle.HWThermalSlowdownUS+gpu.Throttle.SWThermalSlowdownUS) / runtimeUS thermalRatio := float64(gpu.Throttle.HWThermalSlowdownUS+gpu.Throttle.SWThermalSlowdownUS) / runtimeUS
score.ThermalSustainScore = clampScore(100 - thermalRatio*100) score.ThermalSustainScore = clampScore(100 - thermalRatio*100)
score.StabilityScore = clampScore(100 - (gpu.Steady.ClockCVPct*4 + gpu.Steady.PowerCVPct*2 + gpu.Steady.ClockDriftPct*2)) // StabilityScore: prefer per-precision steady phases where each window runs a
// single kernel type so PowerCVPct is a genuine stability signal (not a
// workload-mix artifact). Fall back to combined steady using clock-only metrics
// when per-precision data is absent (older results, short profiles).
if len(gpu.PrecisionSteady) > 0 {
var sum float64
for _, p := range gpu.PrecisionSteady {
sum += clampScore(100 - (p.Steady.ClockCVPct*4 + p.Steady.PowerCVPct*2 + p.Steady.ClockDriftPct*2))
}
score.StabilityScore = sum / float64(len(gpu.PrecisionSteady))
} else {
score.StabilityScore = clampScore(100 - (gpu.Steady.ClockCVPct*4 + gpu.Steady.ClockDriftPct*2))
}
score.CompositeScore = compositeBenchmarkScore(score) score.CompositeScore = compositeBenchmarkScore(score)
if gpu.MultiprocessorCount > 0 && gpu.Steady.AvgGraphicsClockMHz > 0 && score.ComputeScore > 0 { if gpu.MultiprocessorCount > 0 && gpu.Steady.AvgGraphicsClockMHz > 0 && score.ComputeScore > 0 {
score.TOPSPerSMPerGHz = score.ComputeScore / float64(gpu.MultiprocessorCount) / (gpu.Steady.AvgGraphicsClockMHz / 1000.0) score.TOPSPerSMPerGHz = score.ComputeScore / float64(gpu.MultiprocessorCount) / (gpu.Steady.AvgGraphicsClockMHz / 1000.0)
@@ -761,7 +1044,15 @@ func scoreBenchmarkGPUResult(gpu BenchmarkGPUResult) BenchmarkScorecard {
} }
func compositeBenchmarkScore(score BenchmarkScorecard) float64 { func compositeBenchmarkScore(score BenchmarkScorecard) float64 {
quality := 0.40 + 0.20*(score.PowerSustainScore/100.0) + 0.20*(score.ThermalSustainScore/100.0) + 0.20*(score.StabilityScore/100.0) // Weights after introducing calibrated power reference:
// base 0.35 — floor so a GPU that fails all sustain checks still scores
// thermal 0.25 — heaviest: throttle counters are the most reliable signal
// stability 0.25 — clock/power variance matters for reproducibility
// power 0.15 — GPU reaches rated TDP under targeted_power? lower weight
// because calibration may be absent (dcgmi not installed)
// NCCL bonus 0.10 — interconnect health
// cap 1.10
quality := 0.35 + 0.15*(score.PowerSustainScore/100.0) + 0.25*(score.ThermalSustainScore/100.0) + 0.25*(score.StabilityScore/100.0)
if score.InterconnectScore > 0 { if score.InterconnectScore > 0 {
quality += 0.10 quality += 0.10
} }
@@ -792,6 +1083,12 @@ func detectBenchmarkDegradationReasons(gpu BenchmarkGPUResult, normalizationStat
if normalizationStatus != "full" { if normalizationStatus != "full" {
reasons = append(reasons, "normalization_partial") reasons = append(reasons, "normalization_partial")
} }
if gpu.ECC.Uncorrected > 0 {
reasons = append(reasons, "ecc_uncorrected_errors")
}
if gpu.ECC.Corrected > 0 {
reasons = append(reasons, "ecc_corrected_errors")
}
return dedupeStrings(reasons) return dedupeStrings(reasons)
} }
@@ -893,6 +1190,36 @@ func diffThrottleCounters(before, after BenchmarkThrottleCounters) BenchmarkThro
} }
} }
func queryECCCounters(gpuIndex int) (BenchmarkECCCounters, error) {
out, err := satExecCommand(
"nvidia-smi",
"--id="+strconv.Itoa(gpuIndex),
"--query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total",
"--format=csv,noheader,nounits",
).Output()
if err != nil {
return BenchmarkECCCounters{}, err
}
fields := strings.Split(strings.TrimSpace(string(out)), ",")
if len(fields) < 2 {
return BenchmarkECCCounters{}, fmt.Errorf("unexpected ECC counter columns: %q", strings.TrimSpace(string(out)))
}
corrected, err1 := strconv.ParseUint(strings.TrimSpace(fields[0]), 10, 64)
uncorrected, err2 := strconv.ParseUint(strings.TrimSpace(fields[1]), 10, 64)
if err1 != nil || err2 != nil {
// ECC may be disabled on this GPU — return zero counters silently.
return BenchmarkECCCounters{}, nil
}
return BenchmarkECCCounters{Corrected: corrected, Uncorrected: uncorrected}, nil
}
func diffECCCounters(before, after BenchmarkECCCounters) BenchmarkECCCounters {
return BenchmarkECCCounters{
Corrected: saturatingSub(after.Corrected, before.Corrected),
Uncorrected: saturatingSub(after.Uncorrected, before.Uncorrected),
}
}
func queryActiveComputeApps(gpuIndices []int) ([]string, error) { func queryActiveComputeApps(gpuIndices []int) ([]string, error) {
args := []string{ args := []string{
"--query-compute-apps=gpu_uuid,pid,process_name", "--query-compute-apps=gpu_uuid,pid,process_name",
@@ -970,6 +1297,10 @@ func buildBenchmarkFindings(result NvidiaBenchmarkResult) []string {
findings = append(findings, fmt.Sprintf("GPU %d showed unstable clocks/power over the benchmark window.", gpu.Index)) findings = append(findings, fmt.Sprintf("GPU %d showed unstable clocks/power over the benchmark window.", gpu.Index))
case "normalization_partial": case "normalization_partial":
findings = append(findings, fmt.Sprintf("GPU %d ran without full benchmark normalization.", gpu.Index)) findings = append(findings, fmt.Sprintf("GPU %d ran without full benchmark normalization.", gpu.Index))
case "ecc_uncorrected_errors":
findings = append(findings, fmt.Sprintf("GPU %d reported %d uncorrected ECC error(s) — possible hardware fault.", gpu.Index, gpu.ECC.Uncorrected))
case "ecc_corrected_errors":
findings = append(findings, fmt.Sprintf("GPU %d reported %d corrected ECC error(s) — possible DRAM degradation.", gpu.Index, gpu.ECC.Corrected))
} }
} }
if gpu.Backend == "driver-ptx" { if gpu.Backend == "driver-ptx" {
@@ -981,16 +1312,57 @@ func buildBenchmarkFindings(result NvidiaBenchmarkResult) []string {
gpu.Index, gpu.PowerLimitW, gpu.DefaultPowerLimitW, gpu.PowerLimitW/gpu.DefaultPowerLimitW*100, gpu.Index, gpu.PowerLimitW, gpu.DefaultPowerLimitW, gpu.PowerLimitW/gpu.DefaultPowerLimitW*100,
)) ))
} }
// Flag significant TDP deviation (over or under) from calibration.
if gpu.CalibratedPeakPowerW > 0 {
ref := gpu.DefaultPowerLimitW
if ref <= 0 {
ref = gpu.PowerLimitW
}
if ref > 0 {
deviationPct := (gpu.CalibratedPeakPowerW - ref) / ref * 100
switch {
case deviationPct < -10:
findings = append(findings, fmt.Sprintf(
"GPU %d reached only %.0f W (%.0f%% of rated %.0f W) under targeted_power. Check power delivery or cooling.",
gpu.Index, gpu.CalibratedPeakPowerW, gpu.CalibratedPeakPowerW/ref*100, ref,
))
case deviationPct > 5:
findings = append(findings, fmt.Sprintf(
"GPU %d exceeded rated TDP: %.0f W measured vs %.0f W rated (+%.0f%%). Power limit may not be enforced correctly.",
gpu.Index, gpu.CalibratedPeakPowerW, ref, deviationPct,
))
}
}
}
} }
if result.Interconnect != nil && result.Interconnect.Supported { if result.Interconnect != nil && result.Interconnect.Supported {
findings = append(findings, fmt.Sprintf("Multi-GPU all_reduce max bus bandwidth: %.1f GB/s.", result.Interconnect.MaxBusBWGBps)) findings = append(findings, fmt.Sprintf("Multi-GPU all_reduce max bus bandwidth: %.1f GB/s.", result.Interconnect.MaxBusBWGBps))
} }
if cl := result.CPULoad; cl != nil {
switch cl.Status {
case "high":
findings = append(findings, fmt.Sprintf(
"Host CPU load was elevated during the benchmark (avg %.1f%%, max %.1f%%). A competing CPU workload may skew GPU results.",
cl.AvgPct, cl.MaxPct,
))
case "unstable":
findings = append(findings, fmt.Sprintf(
"Host CPU load was erratic during the benchmark (avg %.1f%%, p95 %.1f%%). Results may be less reproducible.",
cl.AvgPct, cl.P95Pct,
))
}
}
if sp := result.ServerPower; sp != nil && sp.Available && sp.GPUReportedSumW > 0 { if sp := result.ServerPower; sp != nil && sp.Available && sp.GPUReportedSumW > 0 {
if sp.ReportingRatio < 0.75 { if sp.ReportingRatio < 0.75 {
findings = append(findings, fmt.Sprintf( findings = append(findings, fmt.Sprintf(
"GPU power reporting may be unreliable: server delta %.0f W vs GPU-reported %.0f W (ratio %.2f). GPU telemetry likely over-reports actual consumption.", "GPU power reporting may be unreliable: server delta %.0f W vs GPU-reported %.0f W (ratio %.2f). GPU telemetry likely over-reports actual consumption. Composite scores have been penalized accordingly.",
sp.DeltaW, sp.GPUReportedSumW, sp.ReportingRatio, sp.DeltaW, sp.GPUReportedSumW, sp.ReportingRatio,
)) ))
} else if sp.ReportingRatio > 1.25 {
findings = append(findings, fmt.Sprintf(
"Server power delta %.0f W exceeds GPU-reported sum %.0f W by %.0f%%. Other components (CPU, NVMe, networking) may be drawing substantial power under GPU load.",
sp.DeltaW, sp.GPUReportedSumW, (sp.ReportingRatio-1)*100,
))
} }
} }
return dedupeStrings(findings) return dedupeStrings(findings)
@@ -1295,6 +1667,7 @@ func runNvidiaBenchmarkParallel(
spec benchmarkProfileSpec, spec benchmarkProfileSpec,
logFunc func(string), logFunc func(string),
result *NvidiaBenchmarkResult, result *NvidiaBenchmarkResult,
calibPowerByIndex map[int]float64,
serverIdleW *float64, serverLoadedWSum *float64, serverIdleW *float64, serverLoadedWSum *float64,
serverIdleOK *bool, serverLoadedOK *bool, serverLoadedSamples *int, serverIdleOK *bool, serverLoadedOK *bool, serverLoadedSamples *int,
) { ) {
@@ -1316,6 +1689,9 @@ func runNvidiaBenchmarkParallel(
r.BaseGraphicsClockMHz = info.BaseGraphicsClockMHz r.BaseGraphicsClockMHz = info.BaseGraphicsClockMHz
r.MaxMemoryClockMHz = info.MaxMemoryClockMHz r.MaxMemoryClockMHz = info.MaxMemoryClockMHz
} }
if w, ok := calibPowerByIndex[idx]; ok && w > 0 {
r.CalibratedPeakPowerW = w
}
if norm := findBenchmarkNormalization(result.Normalization.GPUs, idx); norm != nil { if norm := findBenchmarkNormalization(result.Normalization.GPUs, idx); norm != nil {
r.LockedGraphicsClockMHz = norm.GPUClockLockMHz r.LockedGraphicsClockMHz = norm.GPUClockLockMHz
r.LockedMemoryClockMHz = norm.MemoryClockLockMHz r.LockedMemoryClockMHz = norm.MemoryClockLockMHz
@@ -1364,20 +1740,75 @@ func runNvidiaBenchmarkParallel(
} }
} }
// ── Per-precision stability phases (parallel) ─────────────────────────────
totalSlots := len(benchmarkPrecisionPhases) + 1
perPhaseSec := spec.SteadySec / totalSlots
if perPhaseSec < 60 {
perPhaseSec = 60
}
eccBase := make(map[int]BenchmarkECCCounters, len(selected))
for _, idx := range selected {
eccBase[idx], _ = queryECCCounters(idx)
}
for _, prec := range benchmarkPrecisionPhases {
phaseCmd := []string{
"bee-gpu-burn",
"--seconds", strconv.Itoa(perPhaseSec),
"--size-mb", strconv.Itoa(opts.SizeMB),
"--devices", allDevices,
"--precision", prec,
}
logFunc(fmt.Sprintf("GPUs %s: %s stability phase (%ds)", allDevices, prec, perPhaseSec))
phaseLogName := "gpu-all-steady-" + prec
eccBeforePhase := make(map[int]BenchmarkECCCounters, len(selected))
for _, idx := range selected {
eccBeforePhase[idx], _ = queryECCCounters(idx)
}
phaseOut, phaseRows, phaseErr := runBenchmarkCommandWithMetrics(ctx, verboseLog, phaseLogName+".log", phaseCmd, nil, selected, runDir, phaseLogName, logFunc)
eccAfterPhase := make(map[int]BenchmarkECCCounters, len(selected))
for _, idx := range selected {
eccAfterPhase[idx], _ = queryECCCounters(idx)
}
if phaseErr != nil || len(phaseRows) == 0 {
continue
}
parseByGPU := parseBenchmarkBurnLogByGPU(string(phaseOut))
for _, idx := range selected {
perGPU := filterRowsByGPU(phaseRows, idx)
if len(perGPU) == 0 {
continue
}
phase := BenchmarkPrecisionSteadyPhase{
Precision: prec,
Steady: summarizeBenchmarkTelemetry(perGPU),
ECC: diffECCCounters(eccBeforePhase[idx], eccAfterPhase[idx]),
}
if pr, ok := parseByGPU[idx]; ok {
for _, p := range pr.Profiles {
if p.Supported {
phase.TeraOpsPerSec += p.TeraOpsPerSec
phase.WeightedTeraOpsPerSec += p.WeightedTeraOpsPerSec
}
}
}
gpuResults[idx].PrecisionSteady = append(gpuResults[idx].PrecisionSteady, phase)
}
}
// Snapshot throttle counters before steady. // Snapshot throttle counters before steady.
beforeThrottle := make(map[int]BenchmarkThrottleCounters, len(selected)) beforeThrottle := make(map[int]BenchmarkThrottleCounters, len(selected))
for _, idx := range selected { for _, idx := range selected {
beforeThrottle[idx], _ = queryThrottleCounters(idx) beforeThrottle[idx], _ = queryThrottleCounters(idx)
} }
// Steady: all GPUs simultaneously. // Steady: all GPUs simultaneously (combined). Fixed at one slot = perPhaseSec.
steadyCmd := []string{ steadyCmd := []string{
"bee-gpu-burn", "bee-gpu-burn",
"--seconds", strconv.Itoa(spec.SteadySec), "--seconds", strconv.Itoa(perPhaseSec),
"--size-mb", strconv.Itoa(opts.SizeMB), "--size-mb", strconv.Itoa(opts.SizeMB),
"--devices", allDevices, "--devices", allDevices,
} }
logFunc(fmt.Sprintf("GPUs %s: parallel steady compute (%ds)", allDevices, spec.SteadySec)) logFunc(fmt.Sprintf("GPUs %s: parallel steady compute (combined, %ds)", allDevices, perPhaseSec))
// Sample server power via IPMI in parallel with steady phase. // Sample server power via IPMI in parallel with steady phase.
ipmiStopCh := make(chan struct{}) ipmiStopCh := make(chan struct{})
@@ -1433,6 +1864,9 @@ func runNvidiaBenchmarkParallel(
writeBenchmarkMetricsFiles(runDir, fmt.Sprintf("gpu-%d-steady", idx), perGPU) writeBenchmarkMetricsFiles(runDir, fmt.Sprintf("gpu-%d-steady", idx), perGPU)
gpuResults[idx].Steady = summarizeBenchmarkTelemetry(perGPU) gpuResults[idx].Steady = summarizeBenchmarkTelemetry(perGPU)
gpuResults[idx].Throttle = diffThrottleCounters(beforeThrottle[idx], afterThrottle[idx]) gpuResults[idx].Throttle = diffThrottleCounters(beforeThrottle[idx], afterThrottle[idx])
if eccFinal, err := queryECCCounters(idx); err == nil {
gpuResults[idx].ECC = diffECCCounters(eccBase[idx], eccFinal)
}
if pr, ok := parseResults[idx]; ok { if pr, ok := parseResults[idx]; ok {
gpuResults[idx].ComputeCapability = pr.ComputeCapability gpuResults[idx].ComputeCapability = pr.ComputeCapability
@@ -1477,3 +1911,225 @@ func runNvidiaBenchmarkParallel(
result.GPUs = append(result.GPUs, finalizeBenchmarkGPUResult(*r)) result.GPUs = append(result.GPUs, finalizeBenchmarkGPUResult(*r))
} }
} }
// readBenchmarkHostConfig reads static CPU and memory configuration from
// /proc/cpuinfo and /proc/meminfo. Returns nil if neither source is readable.
func readBenchmarkHostConfig() *BenchmarkHostConfig {
cfg := &BenchmarkHostConfig{}
populated := false
// Parse /proc/cpuinfo for CPU model, sockets, cores, threads.
if data, err := os.ReadFile("/proc/cpuinfo"); err == nil {
socketIDs := map[string]struct{}{}
coresPerSocket := map[string]int{}
var modelName string
threads := 0
for _, line := range strings.Split(string(data), "\n") {
kv := strings.SplitN(line, ":", 2)
if len(kv) != 2 {
continue
}
key := strings.TrimSpace(kv[0])
val := strings.TrimSpace(kv[1])
switch key {
case "processor":
threads++
case "model name":
if modelName == "" {
modelName = val
}
case "physical id":
socketIDs[val] = struct{}{}
case "cpu cores":
// Overwrite per-socket core count (last wins per socket, but all
// entries for the same socket report the same value).
if physLine := ""; physLine == "" {
// We accumulate below by treating cpu cores as a per-thread
// field; sum by socket requires a two-pass approach. Use the
// simpler approximation: totalCores = threads / (threads per core).
_ = val
}
}
}
// Second pass: per-socket core count.
var curSocket string
for _, line := range strings.Split(string(data), "\n") {
kv := strings.SplitN(line, ":", 2)
if len(kv) != 2 {
continue
}
key := strings.TrimSpace(kv[0])
val := strings.TrimSpace(kv[1])
switch key {
case "physical id":
curSocket = val
case "cpu cores":
if curSocket != "" {
if _, seen := coresPerSocket[curSocket]; !seen {
v, _ := strconv.Atoi(val)
coresPerSocket[curSocket] = v
}
}
}
}
totalCores := 0
for _, c := range coresPerSocket {
totalCores += c
}
cfg.CPUModel = modelName
cfg.CPUSockets = len(socketIDs)
if cfg.CPUSockets == 0 && threads > 0 {
cfg.CPUSockets = 1
}
cfg.CPUCores = totalCores
cfg.CPUThreads = threads
if modelName != "" || threads > 0 {
populated = true
}
}
// Parse /proc/meminfo for total physical RAM.
if data, err := os.ReadFile("/proc/meminfo"); err == nil {
for _, line := range strings.Split(string(data), "\n") {
if strings.HasPrefix(line, "MemTotal:") {
fields := strings.Fields(line)
if len(fields) >= 2 {
kb, _ := strconv.ParseUint(fields[1], 10, 64)
cfg.MemTotalGiB = float64(kb) / (1024 * 1024)
populated = true
}
break
}
}
}
if !populated {
return nil
}
return cfg
}
// startCPULoadSampler starts a goroutine that samples host CPU load every
// intervalSec seconds until stopCh is closed, then sends the collected
// samples on the returned channel.
func startCPULoadSampler(stopCh <-chan struct{}, intervalSec int) <-chan []float64 {
ch := make(chan []float64, 1)
go func() {
var samples []float64
ticker := time.NewTicker(time.Duration(intervalSec) * time.Second)
defer ticker.Stop()
for {
select {
case <-stopCh:
ch <- samples
return
case <-ticker.C:
if pct := sampleCPULoadPct(); pct > 0 {
samples = append(samples, pct)
}
}
}
}()
return ch
}
// summarizeCPULoad computes stats over sampled CPU load values and assigns
// a health status.
func summarizeCPULoad(samples []float64) *BenchmarkCPULoad {
if len(samples) == 0 {
return nil
}
sorted := append([]float64(nil), samples...)
sort.Float64s(sorted)
var sum float64
for _, v := range sorted {
sum += v
}
avg := sum / float64(len(sorted))
p95 := sorted[int(float64(len(sorted))*0.95)]
max := sorted[len(sorted)-1]
cl := &BenchmarkCPULoad{
AvgPct: math.Round(avg*10) / 10,
MaxPct: math.Round(max*10) / 10,
P95Pct: math.Round(p95*10) / 10,
Samples: len(sorted),
}
// Compute standard deviation to detect instability.
var variance float64
for _, v := range sorted {
d := v - avg
variance += d * d
}
stdDev := math.Sqrt(variance / float64(len(sorted)))
switch {
case avg > 20 || max > 40:
cl.Status = "high"
cl.Note = fmt.Sprintf("avg %.1f%% max %.1f%% — elevated host CPU load may interfere with GPU benchmark results", avg, max)
case stdDev > 12:
cl.Status = "unstable"
cl.Note = fmt.Sprintf("avg %.1f%% stddev %.1f%% — host CPU load was erratic during the benchmark", avg, stdDev)
default:
cl.Status = "ok"
}
return cl
}
// runBenchmarkPowerCalibration runs a short dcgmi targeted_power test while
// collecting nvidia-smi power samples in parallel. It returns a map from GPU
// index to p95 observed power (watts), which is used as the reference for
// PowerSustainScore instead of the hardware default limit.
//
// If dcgmi is unavailable or the run fails the function returns an empty map
// and the caller falls back to DefaultPowerLimitW. The calibration is skipped
// gracefully — it must never block or fail the main benchmark.
func runBenchmarkPowerCalibration(
ctx context.Context,
verboseLog, runDir string,
gpuIndices []int,
logFunc func(string),
) map[int]float64 {
const calibDurationSec = 45
// dcgmi must be present.
if _, err := exec.LookPath("dcgmi"); err != nil {
logFunc("power calibration: dcgmi not found, skipping (will use default power limit)")
return map[int]float64{}
}
logFunc(fmt.Sprintf("power calibration: running dcgmi targeted_power for %ds on GPUs %s", calibDurationSec, joinIndexList(gpuIndices)))
cmd := nvidiaDCGMNamedDiagCommand("targeted_power", calibDurationSec, gpuIndices)
out, rows, err := runBenchmarkCommandWithMetrics(ctx, verboseLog, "power-calibration.log", cmd, nil, gpuIndices, runDir, "power-calibration", logFunc)
_ = os.WriteFile(filepath.Join(runDir, "power-calibration.log"), out, 0644)
if err != nil {
logFunc(fmt.Sprintf("power calibration: dcgmi targeted_power failed (%v), skipping", err))
return map[int]float64{}
}
// Group rows by GPU index and compute p95 power for each.
result := make(map[int]float64, len(gpuIndices))
for _, idx := range gpuIndices {
perGPU := filterRowsByGPU(rows, idx)
if len(perGPU) == 0 {
continue
}
powers := make([]float64, 0, len(perGPU))
for _, r := range perGPU {
if r.PowerW > 0 {
powers = append(powers, r.PowerW)
}
}
if len(powers) == 0 {
continue
}
p95 := benchmarkPercentile(powers, 95)
if p95 > 0 {
result[idx] = p95
logFunc(fmt.Sprintf("power calibration: GPU %d p95=%.0f W (%d samples)", idx, p95, len(powers)))
}
}
return result
}

View File

@@ -60,9 +60,17 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
fmt.Fprintf(&b, "**Profile:** %s \n", result.BenchmarkProfile) fmt.Fprintf(&b, "**Profile:** %s \n", result.BenchmarkProfile)
fmt.Fprintf(&b, "**App version:** %s \n", result.BenchmarkVersion) fmt.Fprintf(&b, "**App version:** %s \n", result.BenchmarkVersion)
fmt.Fprintf(&b, "**Generated:** %s \n", result.GeneratedAt.Format("2006-01-02 15:04:05 UTC")) fmt.Fprintf(&b, "**Generated:** %s \n", result.GeneratedAt.Format("2006-01-02 15:04:05 UTC"))
if result.ParallelGPUs { if result.RampStep > 0 && result.RampTotal > 0 {
fmt.Fprintf(&b, "**Ramp-up step:** %d of %d \n", result.RampStep, result.RampTotal)
if result.RampRunID != "" {
fmt.Fprintf(&b, "**Ramp-up run ID:** %s \n", result.RampRunID)
}
} else if result.ParallelGPUs {
fmt.Fprintf(&b, "**Mode:** parallel (all GPUs simultaneously) \n") fmt.Fprintf(&b, "**Mode:** parallel (all GPUs simultaneously) \n")
} }
if result.ScalabilityScore > 0 {
fmt.Fprintf(&b, "**Scalability score:** %.1f%% \n", result.ScalabilityScore)
}
fmt.Fprintf(&b, "**Overall status:** %s \n", result.OverallStatus) fmt.Fprintf(&b, "**Overall status:** %s \n", result.OverallStatus)
b.WriteString("\n") b.WriteString("\n")
@@ -83,10 +91,24 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
b.WriteString("\n") b.WriteString("\n")
} }
// ── Scoring methodology ───────────────────────────────────────────────────
b.WriteString("## Scoring Methodology\n\n")
b.WriteString("**Compute score** is derived from two phases:\n\n")
b.WriteString("- **Synthetic** — each precision type (fp8, fp16, fp32, fp64, fp4) runs alone for a dedicated window. ")
b.WriteString("Measures peak throughput with the full GPU dedicated to one kernel type. ")
b.WriteString("Each result is normalised to fp32-equivalent TOPS using precision weights: ")
b.WriteString("fp64 ×2.0 · fp32 ×1.0 · fp16 ×0.5 · fp8 ×0.25 · fp4 ×0.125.\n")
b.WriteString("- **Mixed** — all precision types run simultaneously (combined phase). ")
b.WriteString("Reflects real inference workloads where fp8 matrix ops, fp16 attention and fp32 accumulation compete for bandwidth and SM scheduler slots.\n\n")
b.WriteString("**Formula:** `Compute = Synthetic × (1 + MixedEfficiency × 0.3)`\n\n")
b.WriteString("where `MixedEfficiency = Mixed / Synthetic`. A GPU that sustains 90 % throughput under mixed load ")
b.WriteString("receives a +27 % bonus over its synthetic score; one that drops to 60 % receives +18 %.\n\n")
b.WriteString("**Composite score** = `Compute × quality_factor` where quality factors in power sustain, thermal sustain, stability, and interconnect.\n\n")
// ── Scorecard table ─────────────────────────────────────────────────────── // ── Scorecard table ───────────────────────────────────────────────────────
b.WriteString("## Scorecard\n\n") b.WriteString("## Scorecard\n\n")
b.WriteString("| GPU | Status | Composite | Compute | TOPS/SM/GHz | Power Sustain | Thermal Sustain | Stability | Interconnect |\n") b.WriteString("| GPU | Status | Composite | Compute | Synthetic | Mixed | Mixed Eff. | TOPS/SM/GHz | Power Sustain | Thermal Sustain | Stability | Interconnect |\n")
b.WriteString("|-----|--------|-----------|---------|-------------|---------------|-----------------|-----------|-------------|\n") b.WriteString("|-----|--------|-----------|---------|-----------|-------|------------|-------------|---------------|-----------------|-----------|-------------|\n")
for _, gpu := range result.GPUs { for _, gpu := range result.GPUs {
name := strings.TrimSpace(gpu.Name) name := strings.TrimSpace(gpu.Name)
if name == "" { if name == "" {
@@ -100,11 +122,26 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
if gpu.Scores.TOPSPerSMPerGHz > 0 { if gpu.Scores.TOPSPerSMPerGHz > 0 {
topsPerSM = fmt.Sprintf("%.3f", gpu.Scores.TOPSPerSMPerGHz) topsPerSM = fmt.Sprintf("%.3f", gpu.Scores.TOPSPerSMPerGHz)
} }
fmt.Fprintf(&b, "| GPU %d %s | %s | **%.2f** | %.2f | %s | %.1f | %.1f | %.1f | %s |\n", synthetic := "-"
if gpu.Scores.SyntheticScore > 0 {
synthetic = fmt.Sprintf("%.2f", gpu.Scores.SyntheticScore)
}
mixed := "-"
if gpu.Scores.MixedScore > 0 {
mixed = fmt.Sprintf("%.2f", gpu.Scores.MixedScore)
}
mixedEff := "-"
if gpu.Scores.MixedEfficiency > 0 {
mixedEff = fmt.Sprintf("%.1f%%", gpu.Scores.MixedEfficiency*100)
}
fmt.Fprintf(&b, "| GPU %d %s | %s | **%.2f** | %.2f | %s | %s | %s | %s | %.1f | %.1f | %.1f | %s |\n",
gpu.Index, name, gpu.Index, name,
gpu.Status, gpu.Status,
gpu.Scores.CompositeScore, gpu.Scores.CompositeScore,
gpu.Scores.ComputeScore, gpu.Scores.ComputeScore,
synthetic,
mixed,
mixedEff,
topsPerSM, topsPerSM,
gpu.Scores.PowerSustainScore, gpu.Scores.PowerSustainScore,
gpu.Scores.ThermalSustainScore, gpu.Scores.ThermalSustainScore,
@@ -154,6 +191,35 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
fmt.Fprintf(&b, "| GPU utilisation | %.1f %% | — |\n", gpu.Steady.AvgUsagePct) fmt.Fprintf(&b, "| GPU utilisation | %.1f %% | — |\n", gpu.Steady.AvgUsagePct)
b.WriteString("\n") b.WriteString("\n")
// Per-precision stability phases.
if len(gpu.PrecisionSteady) > 0 {
b.WriteString("**Per-precision stability:**\n\n")
b.WriteString("| Precision | Clock CV | Power CV | Clock Drift | ECC corr | ECC uncorr |\n|-----------|----------|----------|-------------|----------|------------|\n")
for _, p := range gpu.PrecisionSteady {
eccCorr := "—"
eccUncorr := "—"
if !p.ECC.IsZero() {
eccCorr = fmt.Sprintf("%d", p.ECC.Corrected)
eccUncorr = fmt.Sprintf("%d", p.ECC.Uncorrected)
}
fmt.Fprintf(&b, "| %s | %.1f%% | %.1f%% | %.1f%% | %s | %s |\n",
p.Precision, p.Steady.ClockCVPct, p.Steady.PowerCVPct, p.Steady.ClockDriftPct,
eccCorr, eccUncorr)
}
b.WriteString("\n")
} else {
// Legacy: show combined-window variance.
fmt.Fprintf(&b, "**Clock/power variance (combined window):** clock CV %.1f%% · power CV %.1f%% · clock drift %.1f%%\n\n",
gpu.Steady.ClockCVPct, gpu.Steady.PowerCVPct, gpu.Steady.ClockDriftPct)
}
// ECC summary
if !gpu.ECC.IsZero() {
fmt.Fprintf(&b, "**ECC errors (total):** corrected=%d uncorrected=%d\n\n",
gpu.ECC.Corrected, gpu.ECC.Uncorrected)
}
// Throttle // Throttle
throttle := formatThrottleLine(gpu.Throttle, gpu.Steady.DurationSec) throttle := formatThrottleLine(gpu.Throttle, gpu.Steady.DurationSec)
if throttle != "none" { if throttle != "none" {
@@ -163,12 +229,14 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
// Precision results // Precision results
if len(gpu.PrecisionResults) > 0 { if len(gpu.PrecisionResults) > 0 {
b.WriteString("**Precision results:**\n\n") b.WriteString("**Precision results:**\n\n")
b.WriteString("| Precision | TOPS | Lanes | Iterations |\n|-----------|------|-------|------------|\n") b.WriteString("| Precision | TOPS (raw) | Weight | TOPS (fp32-eq) | Lanes | Iterations |\n|-----------|------------|--------|----------------|-------|------------|\n")
for _, p := range gpu.PrecisionResults { for _, p := range gpu.PrecisionResults {
if p.Supported { if p.Supported {
fmt.Fprintf(&b, "| %s | %.2f | %d | %d |\n", p.Name, p.TeraOpsPerSec, p.Lanes, p.Iterations) weightStr := fmt.Sprintf("×%.3g", p.Weight)
fmt.Fprintf(&b, "| %s | %.2f | %s | %.2f | %d | %d |\n",
p.Name, p.TeraOpsPerSec, weightStr, p.WeightedTeraOpsPerSec, p.Lanes, p.Iterations)
} else { } else {
fmt.Fprintf(&b, "| %s | — (unsupported) | — | — |\n", p.Name) fmt.Fprintf(&b, "| %s | — (unsupported) | — | — | — | — |\n", p.Name)
} }
} }
b.WriteString("\n") b.WriteString("\n")

View File

@@ -178,3 +178,67 @@ func TestRenderBenchmarkReportIncludesTerminalChartsWithoutANSI(t *testing.T) {
t.Fatalf("report should not contain ANSI escapes\n%s", report) t.Fatalf("report should not contain ANSI escapes\n%s", report)
} }
} }
func TestEnrichGPUInfoWithMaxClocks(t *testing.T) {
t.Parallel()
nvsmiQ := []byte(`
GPU 00000000:4E:00.0
Product Name : NVIDIA RTX PRO 6000 Blackwell Server Edition
Clocks
Graphics : 2422 MHz
Memory : 12481 MHz
Max Clocks
Graphics : 2430 MHz
SM : 2430 MHz
Memory : 12481 MHz
Video : 2107 MHz
GPU 00000000:4F:00.0
Product Name : NVIDIA RTX PRO 6000 Blackwell Server Edition
Max Clocks
Graphics : 2430 MHz
Memory : 12481 MHz
`)
infoByIndex := map[int]benchmarkGPUInfo{
0: {Index: 0, BusID: "00000000:4E:00.0"},
1: {Index: 1, BusID: "00000000:4F:00.0"},
}
enrichGPUInfoWithMaxClocks(infoByIndex, nvsmiQ)
if infoByIndex[0].MaxGraphicsClockMHz != 2430 {
t.Errorf("GPU 0 MaxGraphicsClockMHz = %v, want 2430", infoByIndex[0].MaxGraphicsClockMHz)
}
if infoByIndex[0].MaxMemoryClockMHz != 12481 {
t.Errorf("GPU 0 MaxMemoryClockMHz = %v, want 12481", infoByIndex[0].MaxMemoryClockMHz)
}
if infoByIndex[1].MaxGraphicsClockMHz != 2430 {
t.Errorf("GPU 1 MaxGraphicsClockMHz = %v, want 2430", infoByIndex[1].MaxGraphicsClockMHz)
}
if infoByIndex[1].MaxMemoryClockMHz != 12481 {
t.Errorf("GPU 1 MaxMemoryClockMHz = %v, want 12481", infoByIndex[1].MaxMemoryClockMHz)
}
}
func TestEnrichGPUInfoWithMaxClocksSkipsPopulated(t *testing.T) {
t.Parallel()
nvsmiQ := []byte(`
GPU 00000000:4E:00.0
Max Clocks
Graphics : 9999 MHz
Memory : 9999 MHz
`)
// Already populated — must not be overwritten.
infoByIndex := map[int]benchmarkGPUInfo{
0: {Index: 0, BusID: "00000000:4E:00.0", MaxGraphicsClockMHz: 2430, MaxMemoryClockMHz: 12481},
}
enrichGPUInfoWithMaxClocks(infoByIndex, nvsmiQ)
if infoByIndex[0].MaxGraphicsClockMHz != 2430 {
t.Errorf("expected existing value to be preserved, got %v", infoByIndex[0].MaxGraphicsClockMHz)
}
}

View File

@@ -2,6 +2,29 @@ package platform
import "time" import "time"
// BenchmarkHostConfig holds static CPU and memory configuration captured at
// benchmark start. Useful for correlating results across runs on different hardware.
type BenchmarkHostConfig struct {
CPUModel string `json:"cpu_model,omitempty"`
CPUSockets int `json:"cpu_sockets,omitempty"`
CPUCores int `json:"cpu_cores,omitempty"`
CPUThreads int `json:"cpu_threads,omitempty"`
MemTotalGiB float64 `json:"mem_total_gib,omitempty"`
}
// BenchmarkCPULoad summarises host CPU utilisation sampled during the GPU
// steady-state phase. High or unstable CPU load during a GPU benchmark may
// indicate a competing workload or a CPU-bound driver bottleneck.
type BenchmarkCPULoad struct {
AvgPct float64 `json:"avg_pct"`
MaxPct float64 `json:"max_pct"`
P95Pct float64 `json:"p95_pct"`
Samples int `json:"samples"`
// Status is "ok", "high", or "unstable".
Status string `json:"status"`
Note string `json:"note,omitempty"`
}
const ( const (
NvidiaBenchmarkProfileStandard = "standard" NvidiaBenchmarkProfileStandard = "standard"
NvidiaBenchmarkProfileStability = "stability" NvidiaBenchmarkProfileStability = "stability"
@@ -14,7 +37,10 @@ type NvidiaBenchmarkOptions struct {
GPUIndices []int GPUIndices []int
ExcludeGPUIndices []int ExcludeGPUIndices []int
RunNCCL bool RunNCCL bool
ParallelGPUs bool // run all selected GPUs simultaneously instead of sequentially ParallelGPUs bool // run all selected GPUs simultaneously instead of sequentially
RampStep int // 1-based step index within a ramp-up run (0 = not a ramp-up)
RampTotal int // total number of ramp-up steps in this run
RampRunID string // shared identifier across all steps of the same ramp-up run
} }
@@ -25,11 +51,17 @@ type NvidiaBenchmarkResult struct {
ServerModel string `json:"server_model,omitempty"` ServerModel string `json:"server_model,omitempty"`
BenchmarkProfile string `json:"benchmark_profile"` BenchmarkProfile string `json:"benchmark_profile"`
ParallelGPUs bool `json:"parallel_gpus,omitempty"` ParallelGPUs bool `json:"parallel_gpus,omitempty"`
RampStep int `json:"ramp_step,omitempty"`
RampTotal int `json:"ramp_total,omitempty"`
RampRunID string `json:"ramp_run_id,omitempty"`
ScalabilityScore float64 `json:"scalability_score,omitempty"`
OverallStatus string `json:"overall_status"` OverallStatus string `json:"overall_status"`
SelectedGPUIndices []int `json:"selected_gpu_indices"` SelectedGPUIndices []int `json:"selected_gpu_indices"`
Findings []string `json:"findings,omitempty"` Findings []string `json:"findings,omitempty"`
Warnings []string `json:"warnings,omitempty"` Warnings []string `json:"warnings,omitempty"`
Normalization BenchmarkNormalization `json:"normalization"` Normalization BenchmarkNormalization `json:"normalization"`
HostConfig *BenchmarkHostConfig `json:"host_config,omitempty"`
CPULoad *BenchmarkCPULoad `json:"cpu_load,omitempty"`
GPUs []BenchmarkGPUResult `json:"gpus"` GPUs []BenchmarkGPUResult `json:"gpus"`
Interconnect *BenchmarkInterconnectResult `json:"interconnect,omitempty"` Interconnect *BenchmarkInterconnectResult `json:"interconnect,omitempty"`
ServerPower *BenchmarkServerPower `json:"server_power,omitempty"` ServerPower *BenchmarkServerPower `json:"server_power,omitempty"`
@@ -63,16 +95,24 @@ type BenchmarkGPUResult struct {
PowerLimitW float64 `json:"power_limit_w,omitempty"` PowerLimitW float64 `json:"power_limit_w,omitempty"`
MultiprocessorCount int `json:"multiprocessor_count,omitempty"` MultiprocessorCount int `json:"multiprocessor_count,omitempty"`
DefaultPowerLimitW float64 `json:"default_power_limit_w,omitempty"` DefaultPowerLimitW float64 `json:"default_power_limit_w,omitempty"`
// CalibratedPeakPowerW is the p95 power measured during a short
// dcgmi targeted_power calibration run before the main benchmark.
// Used as the reference denominator for PowerSustainScore instead of
// the hardware default limit, which bee-gpu-burn cannot reach.
CalibratedPeakPowerW float64 `json:"calibrated_peak_power_w,omitempty"`
MaxGraphicsClockMHz float64 `json:"max_graphics_clock_mhz,omitempty"` MaxGraphicsClockMHz float64 `json:"max_graphics_clock_mhz,omitempty"`
BaseGraphicsClockMHz float64 `json:"base_graphics_clock_mhz,omitempty"` BaseGraphicsClockMHz float64 `json:"base_graphics_clock_mhz,omitempty"`
MaxMemoryClockMHz float64 `json:"max_memory_clock_mhz,omitempty"` MaxMemoryClockMHz float64 `json:"max_memory_clock_mhz,omitempty"`
LockedGraphicsClockMHz float64 `json:"locked_graphics_clock_mhz,omitempty"` LockedGraphicsClockMHz float64 `json:"locked_graphics_clock_mhz,omitempty"`
LockedMemoryClockMHz float64 `json:"locked_memory_clock_mhz,omitempty"` LockedMemoryClockMHz float64 `json:"locked_memory_clock_mhz,omitempty"`
Baseline BenchmarkTelemetrySummary `json:"baseline"` Baseline BenchmarkTelemetrySummary `json:"baseline"`
Steady BenchmarkTelemetrySummary `json:"steady"` Steady BenchmarkTelemetrySummary `json:"steady"`
Cooldown BenchmarkTelemetrySummary `json:"cooldown"` PrecisionSteady []BenchmarkPrecisionSteadyPhase `json:"precision_steady,omitempty"`
Throttle BenchmarkThrottleCounters `json:"throttle_counters"` Cooldown BenchmarkTelemetrySummary `json:"cooldown"`
PrecisionResults []BenchmarkPrecisionResult `json:"precision_results,omitempty"` Throttle BenchmarkThrottleCounters `json:"throttle_counters"`
// ECC error delta accumulated over the full benchmark (all phases combined).
ECC BenchmarkECCCounters `json:"ecc,omitempty"`
PrecisionResults []BenchmarkPrecisionResult `json:"precision_results,omitempty"`
Scores BenchmarkScorecard `json:"scores"` Scores BenchmarkScorecard `json:"scores"`
DegradationReasons []string `json:"degradation_reasons,omitempty"` DegradationReasons []string `json:"degradation_reasons,omitempty"`
Notes []string `json:"notes,omitempty"` Notes []string `json:"notes,omitempty"`
@@ -105,6 +145,18 @@ type BenchmarkThrottleCounters struct {
HWPowerBrakeSlowdownUS uint64 `json:"hw_power_brake_slowdown_us"` HWPowerBrakeSlowdownUS uint64 `json:"hw_power_brake_slowdown_us"`
} }
// BenchmarkECCCounters holds ECC error counts sampled at a point in time.
// Corrected = single-bit errors fixed by ECC (DRAM degradation).
// Uncorrected = double-bit errors that could not be corrected (serious fault).
// Both are volatile (since last driver reset), not persistent.
type BenchmarkECCCounters struct {
Corrected uint64 `json:"corrected"`
Uncorrected uint64 `json:"uncorrected"`
}
func (e BenchmarkECCCounters) Total() uint64 { return e.Corrected + e.Uncorrected }
func (e BenchmarkECCCounters) IsZero() bool { return e.Corrected == 0 && e.Uncorrected == 0 }
type BenchmarkPrecisionResult struct { type BenchmarkPrecisionResult struct {
Name string `json:"name"` Name string `json:"name"`
Category string `json:"category"` Category string `json:"category"`
@@ -115,19 +167,31 @@ type BenchmarkPrecisionResult struct {
K uint64 `json:"k,omitempty"` K uint64 `json:"k,omitempty"`
Iterations uint64 `json:"iterations,omitempty"` Iterations uint64 `json:"iterations,omitempty"`
TeraOpsPerSec float64 `json:"teraops_per_sec,omitempty"` TeraOpsPerSec float64 `json:"teraops_per_sec,omitempty"`
// Weight is the fp32-equivalence factor for this precision category.
// fp32 = 1.0 (baseline), fp64 = 2.0, fp16 = 0.5, fp8 = 0.25, fp4 = 0.125.
// WeightedTOPS = TeraOpsPerSec * Weight gives fp32-equivalent throughput.
Weight float64 `json:"weight,omitempty"`
WeightedTeraOpsPerSec float64 `json:"weighted_teraops_per_sec,omitempty"`
Notes string `json:"notes,omitempty"` Notes string `json:"notes,omitempty"`
} }
type BenchmarkScorecard struct { type BenchmarkScorecard struct {
ComputeScore float64 `json:"compute_score"` ComputeScore float64 `json:"compute_score"`
// SyntheticScore is the sum of fp32-equivalent TOPS from per-precision
// steady phases (each precision ran alone, full GPU dedicated).
SyntheticScore float64 `json:"synthetic_score,omitempty"`
// MixedScore is the sum of fp32-equivalent TOPS from the combined phase
// (all precisions competing simultaneously — closer to real workloads).
MixedScore float64 `json:"mixed_score,omitempty"`
// MixedEfficiency = MixedScore / SyntheticScore. Measures how well the GPU
// sustains throughput under concurrent mixed-precision load.
MixedEfficiency float64 `json:"mixed_efficiency,omitempty"`
PowerSustainScore float64 `json:"power_sustain_score"` PowerSustainScore float64 `json:"power_sustain_score"`
ThermalSustainScore float64 `json:"thermal_sustain_score"` ThermalSustainScore float64 `json:"thermal_sustain_score"`
StabilityScore float64 `json:"stability_score"` StabilityScore float64 `json:"stability_score"`
InterconnectScore float64 `json:"interconnect_score"` InterconnectScore float64 `json:"interconnect_score"`
CompositeScore float64 `json:"composite_score"` CompositeScore float64 `json:"composite_score"`
// TOPSPerSMPerGHz is compute efficiency independent of clock speed and SM count. // TOPSPerSMPerGHz is compute efficiency independent of clock speed and SM count.
// Comparable across throttle levels and GPU generations. Low value at normal
// clocks indicates silicon degradation.
TOPSPerSMPerGHz float64 `json:"tops_per_sm_per_ghz,omitempty"` TOPSPerSMPerGHz float64 `json:"tops_per_sm_per_ghz,omitempty"`
} }
@@ -145,6 +209,20 @@ type BenchmarkServerPower struct {
Notes []string `json:"notes,omitempty"` Notes []string `json:"notes,omitempty"`
} }
// BenchmarkPrecisionSteadyPhase holds per-precision-category telemetry collected
// during a dedicated single-precision steady window. Because only one kernel
// type runs at a time the PowerCVPct here is a genuine stability signal.
type BenchmarkPrecisionSteadyPhase struct {
Precision string `json:"precision"` // e.g. "fp8", "fp16", "fp32"
Steady BenchmarkTelemetrySummary `json:"steady"`
TeraOpsPerSec float64 `json:"teraops_per_sec,omitempty"`
WeightedTeraOpsPerSec float64 `json:"weighted_teraops_per_sec,omitempty"`
// ECC errors accumulated during this precision phase only.
// Non-zero corrected = stress-induced DRAM errors for this kernel type.
// Any uncorrected = serious fault triggered by this precision workload.
ECC BenchmarkECCCounters `json:"ecc,omitempty"`
}
type BenchmarkInterconnectResult struct { type BenchmarkInterconnectResult struct {
Status string `json:"status"` Status string `json:"status"`
Attempted bool `json:"attempted"` Attempted bool `json:"attempted"`

View File

@@ -14,9 +14,17 @@ import (
func (s *System) IsLiveMediaInRAM() bool { func (s *System) IsLiveMediaInRAM() bool {
fsType := mountFSType("/run/live/medium") fsType := mountFSType("/run/live/medium")
if fsType == "" { if fsType == "" {
// No medium mount at all — fall back to toram kernel parameter.
return toramActive() return toramActive()
} }
return strings.EqualFold(fsType, "tmpfs") if strings.EqualFold(fsType, "tmpfs") {
return true
}
// When RunInstallToRAM copies squashfs to /dev/shm/bee-live but the bind
// mount of /run/live/medium fails (common for CD-ROM boots), the medium
// fstype still shows the CD-ROM type. Check whether the RAM copy exists.
files, _ := filepath.Glob("/dev/shm/bee-live/*.squashfs")
return len(files) > 0
} }
func (s *System) LiveBootSource() LiveBootSource { func (s *System) LiveBootSource() LiveBootSource {

View File

@@ -244,11 +244,17 @@ func findUSBExportMount() string {
if readOnly { if readOnly {
continue continue
} }
// Check USB transport via lsblk on the device // Check USB transport via lsblk on the device (or its parent disk for partitions).
if !strings.HasPrefix(device, "/dev/") { if !strings.HasPrefix(device, "/dev/") {
continue continue
} }
if blockDeviceTransport(device) == "usb" { checkDev := device
// lsblk only reports TRAN for the whole disk, not for partitions (e.g. /dev/sdc1).
// Strip trailing partition digits to get the parent disk name.
if trimmed := strings.TrimRight(device, "0123456789"); trimmed != device && len(trimmed) > len("/dev/") {
checkDev = trimmed
}
if blockDeviceTransport(checkDev) == "usb" {
return mountPoint return mountPoint
} }
} }

View File

@@ -12,6 +12,7 @@ import (
"path/filepath" "path/filepath"
"regexp" "regexp"
"sort" "sort"
"strconv"
"strings" "strings"
"sync/atomic" "sync/atomic"
"syscall" "syscall"
@@ -209,6 +210,14 @@ func joinTaskIndices(indices []int) string {
return strings.Join(parts, ",") return strings.Join(parts, ",")
} }
func formatGPUIndexList(indices []int) string {
parts := make([]string, len(indices))
for i, idx := range indices {
parts[i] = strconv.Itoa(idx)
}
return strings.Join(parts, ",")
}
func formatSplitTaskName(baseName, selectionLabel string) string { func formatSplitTaskName(baseName, selectionLabel string) string {
baseName = strings.TrimSpace(baseName) baseName = strings.TrimSpace(baseName)
selectionLabel = strings.TrimSpace(selectionLabel) selectionLabel = strings.TrimSpace(selectionLabel)
@@ -488,6 +497,7 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
GPUIndices []int `json:"gpu_indices"` GPUIndices []int `json:"gpu_indices"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices"` ExcludeGPUIndices []int `json:"exclude_gpu_indices"`
StaggerGPUStart bool `json:"stagger_gpu_start"` StaggerGPUStart bool `json:"stagger_gpu_start"`
ParallelGPUs bool `json:"parallel_gpus"`
Loader string `json:"loader"` Loader string `json:"loader"`
Profile string `json:"profile"` Profile string `json:"profile"`
DisplayName string `json:"display_name"` DisplayName string `json:"display_name"`
@@ -510,6 +520,7 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
GPUIndices: body.GPUIndices, GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices, ExcludeGPUIndices: body.ExcludeGPUIndices,
StaggerGPUStart: body.StaggerGPUStart, StaggerGPUStart: body.StaggerGPUStart,
ParallelGPUs: body.ParallelGPUs,
Loader: body.Loader, Loader: body.Loader,
BurnProfile: body.Profile, BurnProfile: body.Profile,
DisplayName: body.DisplayName, DisplayName: body.DisplayName,
@@ -540,6 +551,7 @@ func (h *handler) handleAPIBenchmarkNvidiaRun(w http.ResponseWriter, r *http.Req
ExcludeGPUIndices []int `json:"exclude_gpu_indices"` ExcludeGPUIndices []int `json:"exclude_gpu_indices"`
RunNCCL *bool `json:"run_nccl"` RunNCCL *bool `json:"run_nccl"`
ParallelGPUs *bool `json:"parallel_gpus"` ParallelGPUs *bool `json:"parallel_gpus"`
RampUp *bool `json:"ramp_up"`
DisplayName string `json:"display_name"` DisplayName string `json:"display_name"`
} }
if r.Body != nil { if r.Body != nil {
@@ -557,10 +569,82 @@ func (h *handler) handleAPIBenchmarkNvidiaRun(w http.ResponseWriter, r *http.Req
if body.ParallelGPUs != nil { if body.ParallelGPUs != nil {
parallelGPUs = *body.ParallelGPUs parallelGPUs = *body.ParallelGPUs
} }
rampUp := false
if body.RampUp != nil {
rampUp = *body.RampUp
}
// Build a descriptive base name that includes profile and mode so the task
// list is self-explanatory without opening individual task detail pages.
profile := strings.TrimSpace(body.Profile)
if profile == "" {
profile = "standard"
}
name := taskDisplayName("nvidia-benchmark", "", "") name := taskDisplayName("nvidia-benchmark", "", "")
if strings.TrimSpace(body.DisplayName) != "" { if strings.TrimSpace(body.DisplayName) != "" {
name = body.DisplayName name = body.DisplayName
} }
// Append profile tag.
name = fmt.Sprintf("%s · %s", name, profile)
if rampUp && len(body.GPUIndices) > 1 {
// Ramp-up mode: resolve GPU list, then create one task per prefix
// [gpu0], [gpu0,gpu1], ..., [gpu0,...,gpuN-1], each running in parallel.
gpus, err := apiListNvidiaGPUs(h.opts.App)
if err != nil {
writeError(w, http.StatusBadRequest, err.Error())
return
}
resolved, err := expandSelectedGPUIndices(gpus, body.GPUIndices, body.ExcludeGPUIndices)
if err != nil {
writeError(w, http.StatusBadRequest, err.Error())
return
}
if len(resolved) < 2 {
// Fall through to normal single-task path.
rampUp = false
} else {
now := time.Now()
rampRunID := fmt.Sprintf("ramp-%s", now.UTC().Format("20060102-150405"))
var allTasks []*Task
for step := 1; step <= len(resolved); step++ {
subset := resolved[:step]
stepName := fmt.Sprintf("%s · ramp %d/%d · GPU %s", name, step, len(resolved), formatGPUIndexList(subset))
t := &Task{
ID: newJobID("benchmark-nvidia"),
Name: stepName,
Target: "nvidia-benchmark",
Priority: 15,
Status: TaskPending,
CreatedAt: now,
params: taskParams{
GPUIndices: append([]int(nil), subset...),
SizeMB: body.SizeMB,
BenchmarkProfile: body.Profile,
RunNCCL: runNCCL && step == len(resolved),
ParallelGPUs: true,
RampStep: step,
RampTotal: len(resolved),
RampRunID: rampRunID,
DisplayName: stepName,
},
}
allTasks = append(allTasks, t)
}
for _, t := range allTasks {
globalQueue.enqueue(t)
}
writeTaskRunResponse(w, allTasks)
return
}
}
// For non-ramp tasks append mode tag.
if parallelGPUs {
name = fmt.Sprintf("%s · parallel", name)
} else {
name = fmt.Sprintf("%s · sequential", name)
}
tasks, err := buildNvidiaTaskSet("nvidia-benchmark", 15, time.Now(), taskParams{ tasks, err := buildNvidiaTaskSet("nvidia-benchmark", 15, time.Now(), taskParams{
GPUIndices: body.GPUIndices, GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices, ExcludeGPUIndices: body.ExcludeGPUIndices,

View File

@@ -330,6 +330,33 @@ func renderHardwareSummaryCard(opts HandlerOptions) string {
var b strings.Builder var b strings.Builder
b.WriteString(`<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body">`) b.WriteString(`<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body">`)
// Server identity block above the component table.
{
var model, serial string
parts := []string{}
if hw.Board.Manufacturer != nil && strings.TrimSpace(*hw.Board.Manufacturer) != "" {
parts = append(parts, strings.TrimSpace(*hw.Board.Manufacturer))
}
if hw.Board.ProductName != nil && strings.TrimSpace(*hw.Board.ProductName) != "" {
parts = append(parts, strings.TrimSpace(*hw.Board.ProductName))
}
if len(parts) > 0 {
model = strings.Join(parts, " ")
}
serial = strings.TrimSpace(hw.Board.SerialNumber)
if model != "" || serial != "" {
b.WriteString(`<div style="margin-bottom:14px">`)
if model != "" {
fmt.Fprintf(&b, `<div style="font-size:16px;font-weight:700;margin-bottom:2px">%s</div>`, html.EscapeString(model))
}
if serial != "" {
fmt.Fprintf(&b, `<div style="font-size:12px;color:var(--muted)">S/N: %s</div>`, html.EscapeString(serial))
}
b.WriteString(`</div>`)
}
}
b.WriteString(`<table style="width:auto">`) b.WriteString(`<table style="width:auto">`)
writeRow := func(label, value, badgeHTML string) { writeRow := func(label, value, badgeHTML string) {
b.WriteString(fmt.Sprintf(`<tr><td style="padding:6px 14px 6px 0;font-weight:700;white-space:nowrap">%s</td><td style="padding:6px 0;color:var(--muted);font-size:13px">%s</td><td style="padding:6px 0 6px 12px">%s</td></tr>`, b.WriteString(fmt.Sprintf(`<tr><td style="padding:6px 14px 6px 0;font-weight:700;white-space:nowrap">%s</td><td style="padding:6px 0;color:var(--muted);font-size:13px">%s</td><td style="padding:6px 0 6px 12px">%s</td></tr>`,
@@ -1279,9 +1306,6 @@ func renderValidate(opts HandlerOptions) string {
<div class="card" style="margin-bottom:16px"> <div class="card" style="margin-bottom:16px">
<div class="card-head">Validate Profile</div> <div class="card-head">Validate Profile</div>
<div class="card-body validate-profile-body"> <div class="card-body validate-profile-body">
<div class="validate-profile-col">
<div class="form-row" style="margin:0"><label>Cycles</label><input type="number" id="sat-cycles" value="1" min="1" max="100" style="width:100%"></div>
</div>
<div class="validate-profile-col"> <div class="validate-profile-col">
<div class="form-row" style="margin:12px 0 0"><label>Mode</label></div> <div class="form-row" style="margin:12px 0 0"><label>Mode</label></div>
<label class="cb-row"><input type="radio" name="sat-mode" id="sat-mode-validate" value="validate" checked onchange="satModeChanged()"><span>Validate — quick non-destructive check</span></label> <label class="cb-row"><input type="radio" name="sat-mode" id="sat-mode-validate" value="validate" checked onchange="satModeChanged()"><span>Validate — quick non-destructive check</span></label>
@@ -1331,22 +1355,16 @@ func renderValidate(opts HandlerOptions) string {
<p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p> <p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p>
</div> </div>
<p id="sat-gpu-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA validate tasks.</p> <p id="sat-gpu-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA validate tasks.</p>
<div style="margin-top:10px;padding-top:10px;border-top:1px solid var(--border)">
<label class="sat-gpu-row" title="When checked, multi-GPU tests (PSU Pulse, NCCL, NVBandwidth) run on ALL GPUs in the system regardless of the selection above.">
<input type="checkbox" id="sat-multi-gpu-all" checked onchange="satUpdateGPUSelectionNote()">
<span><strong>Multi-GPU tests</strong> — use all GPUs <span style="font-size:11px;color:var(--muted)">(PSU Pulse, NCCL, NVBandwidth)</span></span>
</label>
</div>
</div> </div>
</div> </div>
<div class="grid3"> <div class="grid3">
` + renderSATCard("nvidia", "NVIDIA GPU", "runNvidiaValidateSet('nvidia')", "", renderValidateCardBody( ` + renderSATCard("nvidia", "NVIDIA GPU", "runNvidiaValidateSet('nvidia')", "", renderValidateCardBody(
inv.NVIDIA, inv.NVIDIA,
`Runs NVIDIA diagnostics and board inventory checks.`, `Runs NVIDIA diagnostics and board inventory checks.`,
`<code>nvidia-smi</code>, <code>dmidecode</code>, <code>dcgmi diag</code>`, `<code>nvidia-smi</code>, <code>dmidecode</code>, <code>dcgmi diag</code>`,
`Level 2 in Validate, Level 3 in Stress. Runs one GPU at a time on the selected NVIDIA GPUs.`, `Level 2 in Validate, Level 3 in Stress. Runs one GPU at a time on the selected NVIDIA GPUs.`,
)) + )) +
`<div id="sat-card-nvidia-targeted-stress">` + `<div id="sat-card-nvidia-targeted-stress">` +
renderSATCard("nvidia-targeted-stress", "NVIDIA GPU Targeted Stress", "runNvidiaValidateSet('nvidia-targeted-stress')", "", renderValidateCardBody( renderSATCard("nvidia-targeted-stress", "NVIDIA GPU Targeted Stress", "runNvidiaValidateSet('nvidia-targeted-stress')", "", renderValidateCardBody(
inv.NVIDIA, inv.NVIDIA,
@@ -1455,10 +1473,6 @@ function satSelectedGPUIndices() {
.filter(function(v) { return !Number.isNaN(v); }) .filter(function(v) { return !Number.isNaN(v); })
.sort(function(a, b) { return a - b; }); .sort(function(a, b) { return a - b; });
} }
function satMultiGPUAll() {
const cb = document.getElementById('sat-multi-gpu-all');
return cb ? cb.checked : true;
}
function satUpdateGPUSelectionNote() { function satUpdateGPUSelectionNote() {
const note = document.getElementById('sat-gpu-selection-note'); const note = document.getElementById('sat-gpu-selection-note');
if (!note) return; if (!note) return;
@@ -1467,8 +1481,7 @@ function satUpdateGPUSelectionNote() {
note.textContent = 'Select at least one NVIDIA GPU to enable NVIDIA validate tasks.'; note.textContent = 'Select at least one NVIDIA GPU to enable NVIDIA validate tasks.';
return; return;
} }
const multiAll = satMultiGPUAll(); note.textContent = 'Selected GPUs: ' + selected.join(', ') + '. Multi-GPU tests will use all selected GPUs.';
note.textContent = 'Selected GPUs: ' + selected.join(', ') + '. Multi-GPU tests: ' + (multiAll ? 'all GPUs in system' : 'selected GPUs only') + '.';
} }
function satRenderGPUList(gpus) { function satRenderGPUList(gpus) {
const root = document.getElementById('sat-gpu-list'); const root = document.getElementById('sat-gpu-list');
@@ -1582,15 +1595,8 @@ const nvidiaPerGPUTargets = ['nvidia', 'nvidia-targeted-stress', 'nvidia-targete
// pulse_test and fabric tests run on all selected GPUs simultaneously // pulse_test and fabric tests run on all selected GPUs simultaneously
const nvidiaAllGPUTargets = ['nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth']; const nvidiaAllGPUTargets = ['nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth'];
function satAllGPUIndicesForMulti() { function satAllGPUIndicesForMulti() {
// If "Multi-GPU tests — all GPUs" is checked, return all detected GPUs. // Multi-GPU tests always use the current GPU selection.
// Otherwise fall back to the per-GPU selection. return Promise.resolve(satSelectedGPUIndices());
if (satMultiGPUAll()) {
return loadSatNvidiaGPUs().then(function(gpus) {
return gpus.map(function(g) { return Number(g.index); });
});
}
const sel = satSelectedGPUIndices();
return Promise.resolve(sel);
} }
function expandSATTarget(target) { function expandSATTarget(target) {
if (nvidiaAllGPUTargets.indexOf(target) >= 0) { if (nvidiaAllGPUTargets.indexOf(target) >= 0) {
@@ -1680,7 +1686,7 @@ function runAMDValidateSet() {
return runNext(0); return runNext(0);
} }
function runAllSAT() { function runAllSAT() {
const cycles = Math.max(1, parseInt(document.getElementById('sat-cycles').value)||1); const cycles = 1;
const status = document.getElementById('sat-all-status'); const status = document.getElementById('sat-all-status');
status.textContent = 'Enqueuing...'; status.textContent = 'Enqueuing...';
const stressOnlyTargets = ['nvidia-targeted-stress', 'nvidia-targeted-power', 'nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth']; const stressOnlyTargets = ['nvidia-targeted-stress', 'nvidia-targeted-power', 'nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth'];
@@ -1922,23 +1928,10 @@ func renderSATCard(id, label, runAction, headerActions, body string) string {
// ── Benchmark ───────────────────────────────────────────────────────────────── // ── Benchmark ─────────────────────────────────────────────────────────────────
type benchmarkHistoryColumn struct {
key string
label string
name string
index int
parallel bool
}
type benchmarkHistoryCell struct {
score float64
present bool
}
type benchmarkHistoryRun struct { type benchmarkHistoryRun struct {
generatedAt time.Time generatedAt time.Time
displayTime string displayTime string
cells map[string]benchmarkHistoryCell gpuScores map[int]float64 // GPU index → composite score
} }
func renderBenchmark(opts HandlerOptions) string { func renderBenchmark(opts HandlerOptions) string {
@@ -1967,12 +1960,16 @@ func renderBenchmark(opts HandlerOptions) string {
</div> </div>
</div> </div>
<label class="benchmark-cb-row"> <label class="benchmark-cb-row">
<input type="checkbox" id="benchmark-parallel-gpus"> <input type="radio" name="benchmark-mode" value="sequential" onchange="benchmarkUpdateSelectionNote()">
<span>Run all selected GPUs simultaneously (parallel mode)</span> <span>Sequential — one GPU at a time</span>
</label> </label>
<label class="benchmark-cb-row"> <label class="benchmark-cb-row" id="benchmark-parallel-label">
<input type="checkbox" id="benchmark-run-nccl" checked> <input type="radio" name="benchmark-mode" value="parallel" onchange="benchmarkUpdateSelectionNote()">
<span>Run multi-GPU interconnect step (NCCL) only on the selected GPUs</span> <span>Parallel — all selected GPUs simultaneously</span>
</label>
<label class="benchmark-cb-row" id="benchmark-ramp-label">
<input type="radio" name="benchmark-mode" value="ramp-up" checked onchange="benchmarkUpdateSelectionNote()">
<span>Ramp-up — 1 GPU → 2 → … → all selected (separate tasks)</span>
</label> </label>
<p id="benchmark-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 14px">Select one GPU for single-card benchmarking or several GPUs for a constrained multi-GPU run.</p> <p id="benchmark-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 14px">Select one GPU for single-card benchmarking or several GPUs for a constrained multi-GPU run.</p>
<button id="benchmark-run-btn" class="btn btn-primary" onclick="runNvidiaBenchmark()" disabled>&#9654; Run Benchmark</button> <button id="benchmark-run-btn" class="btn btn-primary" onclick="runNvidiaBenchmark()" disabled>&#9654; Run Benchmark</button>
@@ -2025,22 +2022,28 @@ function benchmarkSelectedGPUIndices() {
.sort(function(a, b) { return a - b; }); .sort(function(a, b) { return a - b; });
} }
function benchmarkMode() {
const el = document.querySelector('input[name="benchmark-mode"]:checked');
return el ? el.value : 'sequential';
}
function benchmarkUpdateSelectionNote() { function benchmarkUpdateSelectionNote() {
const selected = benchmarkSelectedGPUIndices(); const selected = benchmarkSelectedGPUIndices();
const btn = document.getElementById('benchmark-run-btn'); const btn = document.getElementById('benchmark-run-btn');
const note = document.getElementById('benchmark-selection-note'); const note = document.getElementById('benchmark-selection-note');
const nccl = document.getElementById('benchmark-run-nccl');
if (!selected.length) { if (!selected.length) {
btn.disabled = true; btn.disabled = true;
note.textContent = 'Select at least one NVIDIA GPU to run the benchmark.'; note.textContent = 'Select at least one NVIDIA GPU to run the benchmark.';
return; return;
} }
btn.disabled = false; btn.disabled = false;
note.textContent = 'Selected GPUs: ' + selected.join(', ') + '.'; const mode = benchmarkMode();
if (nccl && nccl.checked && selected.length < 2) { if (mode === 'ramp-up') {
note.textContent += ' NCCL will be skipped because fewer than 2 GPUs are selected.'; note.textContent = 'Ramp-up: ' + selected.length + ' tasks (1 GPU → ' + selected.length + ' GPUs). NCCL on final step.';
} else if (nccl && nccl.checked) { } else if (mode === 'parallel') {
note.textContent += ' NCCL interconnect will use only these GPUs.'; note.textContent = 'Parallel: all ' + selected.length + ' GPU(s) simultaneously.' + (selected.length > 1 ? ' NCCL included.' : '');
} else {
note.textContent = 'Sequential: each GPU benchmarked separately.' + (selected.length > 1 ? ' NCCL included on each.' : '');
} }
} }
@@ -2058,6 +2061,33 @@ function benchmarkRenderGPUList(gpus) {
+ '<span><strong>GPU ' + gpu.index + '</strong> — ' + gpu.name + mem + '</span>' + '<span><strong>GPU ' + gpu.index + '</strong> — ' + gpu.name + mem + '</span>'
+ '</label>'; + '</label>';
}).join(''); }).join('');
benchmarkApplyMultiGPUState(gpus.length);
benchmarkUpdateSelectionNote();
}
// Disable radio options that require multiple GPUs when only one is present.
function benchmarkApplyMultiGPUState(gpuCount) {
var multiValues = ['parallel', 'ramp-up'];
var radios = document.querySelectorAll('input[name="benchmark-mode"]');
radios.forEach(function(el) {
var isMulti = multiValues.indexOf(el.value) >= 0;
if (gpuCount < 2 && isMulti) {
el.disabled = true;
if (el.checked) {
// fall back to sequential
var seq = document.querySelector('input[name="benchmark-mode"][value="sequential"]');
if (seq) seq.checked = true;
}
var label = el.closest('label');
if (label) label.style.opacity = '0.4';
} else {
el.disabled = false;
// restore default: ramp-up checked when ≥2 GPUs
if (gpuCount >= 2 && el.value === 'ramp-up') el.checked = true;
var label = el.closest('label');
if (label) label.style.opacity = '';
}
});
benchmarkUpdateSelectionNote(); benchmarkUpdateSelectionNote();
} }
@@ -2095,12 +2125,15 @@ function runNvidiaBenchmark() {
return; return;
} }
if (benchmarkES) { benchmarkES.close(); benchmarkES = null; } if (benchmarkES) { benchmarkES.close(); benchmarkES = null; }
const parallelGPUs = !!document.getElementById('benchmark-parallel-gpus').checked; const mode = benchmarkMode();
const rampUp = mode === 'ramp-up' && selected.length > 1;
const parallelGPUs = mode === 'parallel';
const body = { const body = {
profile: document.getElementById('benchmark-profile').value || 'standard', profile: document.getElementById('benchmark-profile').value || 'standard',
gpu_indices: selected, gpu_indices: selected,
run_nccl: !!document.getElementById('benchmark-run-nccl').checked, run_nccl: selected.length > 1,
parallel_gpus: parallelGPUs, parallel_gpus: parallelGPUs,
ramp_up: rampUp,
display_name: 'NVIDIA Benchmark' display_name: 'NVIDIA Benchmark'
}; };
document.getElementById('benchmark-output').style.display = 'block'; document.getElementById('benchmark-output').style.display = 'block';
@@ -2155,23 +2188,22 @@ function runNvidiaBenchmark() {
}); });
} }
document.getElementById('benchmark-run-nccl').addEventListener('change', benchmarkUpdateSelectionNote);
benchmarkLoadGPUs(); benchmarkLoadGPUs();
</script>` </script>`
} }
func renderBenchmarkResultsCard(exportDir string) string { func renderBenchmarkResultsCard(exportDir string) string {
columns, runs := loadBenchmarkHistory(exportDir) maxIdx, runs := loadBenchmarkHistory(exportDir)
return renderBenchmarkResultsCardFromRuns( return renderBenchmarkResultsCardFromRuns(
"Benchmark Results", "Benchmark Results",
"Composite score by saved benchmark run and GPU.", "Composite score by saved benchmark run and GPU.",
"No saved benchmark runs yet.", "No saved benchmark runs yet.",
columns, maxIdx,
runs, runs,
) )
} }
func renderBenchmarkResultsCardFromRuns(title, description, emptyMessage string, columns []benchmarkHistoryColumn, runs []benchmarkHistoryRun) string { func renderBenchmarkResultsCardFromRuns(title, description, emptyMessage string, maxGPUIndex int, runs []benchmarkHistoryRun) string {
if len(runs) == 0 { if len(runs) == 0 {
return `<div class="card"><div class="card-head">` + html.EscapeString(title) + `</div><div class="card-body"><p style="color:var(--muted);font-size:13px">` + html.EscapeString(emptyMessage) + `</p></div></div>` return `<div class="card"><div class="card-head">` + html.EscapeString(title) + `</div><div class="card-body"><p style="color:var(--muted);font-size:13px">` + html.EscapeString(emptyMessage) + `</p></div></div>`
} }
@@ -2181,22 +2213,22 @@ func renderBenchmarkResultsCardFromRuns(title, description, emptyMessage string,
b.WriteString(`<p style="color:var(--muted);font-size:13px;margin-bottom:12px">` + html.EscapeString(description) + `</p>`) b.WriteString(`<p style="color:var(--muted);font-size:13px;margin-bottom:12px">` + html.EscapeString(description) + `</p>`)
} }
b.WriteString(`<div style="overflow-x:auto">`) b.WriteString(`<div style="overflow-x:auto">`)
b.WriteString(`<table><thead><tr><th>Test</th><th>Time</th>`) b.WriteString(`<table><thead><tr><th>Run</th><th>Time</th>`)
for _, col := range columns { for i := 0; i <= maxGPUIndex; i++ {
b.WriteString(`<th>` + html.EscapeString(col.label) + `</th>`) b.WriteString(`<th>GPU ` + strconv.Itoa(i) + `</th>`)
} }
b.WriteString(`</tr></thead><tbody>`) b.WriteString(`</tr></thead><tbody>`)
for i, run := range runs { for i, run := range runs {
b.WriteString(`<tr>`) b.WriteString(`<tr>`)
b.WriteString(`<td>#` + strconv.Itoa(i+1) + `</td>`) b.WriteString(`<td>#` + strconv.Itoa(i+1) + `</td>`)
b.WriteString(`<td>` + html.EscapeString(run.displayTime) + `</td>`) b.WriteString(`<td>` + html.EscapeString(run.displayTime) + `</td>`)
for _, col := range columns { for idx := 0; idx <= maxGPUIndex; idx++ {
cell, ok := run.cells[col.key] score, ok := run.gpuScores[idx]
if !ok || !cell.present { if !ok {
b.WriteString(`<td style="color:var(--muted)">-</td>`) b.WriteString(`<td style="color:var(--muted)">-</td>`)
continue continue
} }
b.WriteString(`<td>` + fmt.Sprintf("%.2f", cell.score) + `</td>`) b.WriteString(`<td>` + fmt.Sprintf("%.2f", score) + `</td>`)
} }
b.WriteString(`</tr>`) b.WriteString(`</tr>`)
} }
@@ -2204,22 +2236,22 @@ func renderBenchmarkResultsCardFromRuns(title, description, emptyMessage string,
return b.String() return b.String()
} }
func loadBenchmarkHistory(exportDir string) ([]benchmarkHistoryColumn, []benchmarkHistoryRun) { func loadBenchmarkHistory(exportDir string) (int, []benchmarkHistoryRun) {
baseDir := app.DefaultBenchmarkBaseDir baseDir := app.DefaultBenchmarkBaseDir
if strings.TrimSpace(exportDir) != "" { if strings.TrimSpace(exportDir) != "" {
baseDir = filepath.Join(exportDir, "bee-benchmark") baseDir = filepath.Join(exportDir, "bee-benchmark")
} }
paths, err := filepath.Glob(filepath.Join(baseDir, "gpu-benchmark-*", "result.json")) paths, err := filepath.Glob(filepath.Join(baseDir, "gpu-benchmark-*", "result.json"))
if err != nil || len(paths) == 0 { if err != nil || len(paths) == 0 {
return nil, nil return -1, nil
} }
sort.Strings(paths) sort.Strings(paths)
return loadBenchmarkHistoryFromPaths(paths) return loadBenchmarkHistoryFromPaths(paths)
} }
func loadBenchmarkHistoryFromPaths(paths []string) ([]benchmarkHistoryColumn, []benchmarkHistoryRun) { func loadBenchmarkHistoryFromPaths(paths []string) (int, []benchmarkHistoryRun) {
columnByKey := make(map[string]benchmarkHistoryColumn)
runs := make([]benchmarkHistoryRun, 0, len(paths)) runs := make([]benchmarkHistoryRun, 0, len(paths))
maxGPUIndex := -1
for _, path := range paths { for _, path := range paths {
raw, err := os.ReadFile(path) raw, err := os.ReadFile(path)
if err != nil { if err != nil {
@@ -2232,102 +2264,22 @@ func loadBenchmarkHistoryFromPaths(paths []string) ([]benchmarkHistoryColumn, []
run := benchmarkHistoryRun{ run := benchmarkHistoryRun{
generatedAt: result.GeneratedAt, generatedAt: result.GeneratedAt,
displayTime: result.GeneratedAt.Local().Format("2006-01-02 15:04:05"), displayTime: result.GeneratedAt.Local().Format("2006-01-02 15:04:05"),
cells: make(map[string]benchmarkHistoryCell), gpuScores: make(map[int]float64),
} }
for _, gpu := range result.GPUs {
if result.ParallelGPUs { run.gpuScores[gpu.Index] = gpu.Scores.CompositeScore
// All GPUs ran simultaneously — one column per server, score = avg composite. if gpu.Index > maxGPUIndex {
gpuModelCount := make(map[string]int) maxGPUIndex = gpu.Index
for _, gpu := range result.GPUs {
gpuModelCount[strings.TrimSpace(gpu.Name)]++
}
scoreSum := make(map[string]float64)
scoreCnt := make(map[string]int)
for _, gpu := range result.GPUs {
key := "parallel|" + strings.TrimSpace(result.ServerModel) + "|" + strings.TrimSpace(gpu.Name)
scoreSum[key] += gpu.Scores.CompositeScore
scoreCnt[key]++
count := gpuModelCount[strings.TrimSpace(gpu.Name)]
columnByKey[key] = benchmarkHistoryColumn{
key: key,
label: benchmarkHistoryParallelLabel(result.ServerModel, gpu.Name, count),
name: strings.TrimSpace(gpu.Name),
index: -1,
parallel: true,
}
}
for key, sum := range scoreSum {
run.cells[key] = benchmarkHistoryCell{score: sum / float64(scoreCnt[key]), present: true}
}
} else {
// Each GPU ran independently — one column per GPU index.
for _, gpu := range result.GPUs {
key := "gpu|" + strings.TrimSpace(result.ServerModel) + "|" + strings.TrimSpace(gpu.Name) + "|" + strconv.Itoa(gpu.Index)
columnByKey[key] = benchmarkHistoryColumn{
key: key,
label: benchmarkHistoryPerGPULabel(gpu.Name, gpu.Index),
name: strings.TrimSpace(gpu.Name),
index: gpu.Index,
parallel: false,
}
run.cells[key] = benchmarkHistoryCell{score: gpu.Scores.CompositeScore, present: true}
} }
} }
runs = append(runs, run) runs = append(runs, run)
} }
columns := make([]benchmarkHistoryColumn, 0, len(columnByKey))
for _, col := range columnByKey {
columns = append(columns, col)
}
// Sequential GPU columns first (sorted by GPU index), then parallel server columns.
sort.Slice(columns, func(i, j int) bool {
if columns[i].parallel != columns[j].parallel {
return !columns[i].parallel // sequential first
}
if columns[i].parallel {
li := strings.ToLower(columns[i].label)
lj := strings.ToLower(columns[j].label)
if li != lj {
return li < lj
}
return columns[i].key < columns[j].key
}
// Sequential: sort by GPU index, then name.
if columns[i].index != columns[j].index {
return columns[i].index < columns[j].index
}
return strings.ToLower(columns[i].name) < strings.ToLower(columns[j].name)
})
sort.Slice(runs, func(i, j int) bool { sort.Slice(runs, func(i, j int) bool {
return runs[i].generatedAt.After(runs[j].generatedAt) return runs[i].generatedAt.After(runs[j].generatedAt)
}) })
return columns, runs return maxGPUIndex, runs
} }
// benchmarkHistoryPerGPULabel formats a label for a single-GPU column: "GPU #N — ModelName".
func benchmarkHistoryPerGPULabel(gpuName string, index int) string {
gpuName = strings.TrimSpace(gpuName)
if gpuName == "" {
gpuName = "Unknown GPU"
}
return fmt.Sprintf("GPU #%d — %s", index, gpuName)
}
// benchmarkHistoryParallelLabel formats a label for an all-GPU parallel column:
// "ServerModel — N× ModelName (All GPUs)" or "N× ModelName (All GPUs)" if no server.
func benchmarkHistoryParallelLabel(serverModel, gpuName string, count int) string {
serverModel = strings.TrimSpace(serverModel)
gpuName = strings.TrimSpace(gpuName)
if gpuName == "" {
gpuName = "Unknown GPU"
}
gpuPart := fmt.Sprintf("%d× %s (All GPUs)", count, gpuName)
if serverModel == "" {
return gpuPart
}
return fmt.Sprintf("%s — %s", serverModel, gpuPart)
}
// ── Burn ────────────────────────────────────────────────────────────────────── // ── Burn ──────────────────────────────────────────────────────────────────────
@@ -2371,10 +2323,20 @@ func renderBurn() string {
<p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p> <p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p>
</div> </div>
<p id="burn-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA burn recipes.</p> <p id="burn-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA burn recipes.</p>
<label class="cb-row" style="margin-top:10px"> <div style="display:flex;flex-direction:column;gap:4px;margin-top:10px">
<input type="checkbox" id="burn-stagger-nvidia"> <label class="cb-row">
<span>Ramp selected NVIDIA GPUs one by one before full-load hold. Uses a 3-minute stabilization window per GPU, then keeps all selected GPUs under load for the chosen Burn Profile duration.</span> <input type="radio" name="burn-nvidia-mode" value="sequential" checked>
</label> <span>Sequential — selected GPUs one at a time</span>
</label>
<label class="cb-row" id="burn-parallel-label">
<input type="radio" name="burn-nvidia-mode" value="parallel">
<span>Parallel — all selected GPUs simultaneously</span>
</label>
<label class="cb-row" id="burn-ramp-label">
<input type="radio" name="burn-nvidia-mode" value="ramp-up">
<span>Ramp-up — add one GPU at a time</span>
</label>
</div>
</div> </div>
</div> </div>
@@ -2450,9 +2412,30 @@ function burnSelectedGPUIndices() {
.sort(function(a, b) { return a - b; }); .sort(function(a, b) { return a - b; });
} }
function burnUseNvidiaRampUp() { function burnNvidiaMode() {
const el = document.getElementById('burn-stagger-nvidia'); const el = document.querySelector('input[name="burn-nvidia-mode"]:checked');
return !!(el && el.checked); return el ? el.value : 'sequential';
}
function burnApplyMultiGPUState(gpuCount) {
var multiValues = ['parallel', 'ramp-up'];
var radios = document.querySelectorAll('input[name="burn-nvidia-mode"]');
radios.forEach(function(el) {
var isMulti = multiValues.indexOf(el.value) >= 0;
if (gpuCount < 2 && isMulti) {
el.disabled = true;
if (el.checked) {
var seq = document.querySelector('input[name="burn-nvidia-mode"][value="sequential"]');
if (seq) seq.checked = true;
}
var label = el.closest('label');
if (label) label.style.opacity = '0.4';
} else {
el.disabled = false;
var label = el.closest('label');
if (label) label.style.opacity = '';
}
});
} }
function burnUpdateSelectionNote() { function burnUpdateSelectionNote() {
@@ -2479,6 +2462,7 @@ function burnRenderGPUList(gpus) {
+ '<span><strong>GPU ' + gpu.index + '</strong> — ' + gpu.name + mem + '</span>' + '<span><strong>GPU ' + gpu.index + '</strong> — ' + gpu.name + mem + '</span>'
+ '</label>'; + '</label>';
}).join(''); }).join('');
burnApplyMultiGPUState(gpus.length);
burnUpdateSelectionNote(); burnUpdateSelectionNote();
} }
@@ -2514,8 +2498,11 @@ function enqueueBurnTask(target, label, extra, useSelectedNvidia) {
return Promise.reject(new Error('Select at least one NVIDIA GPU.')); return Promise.reject(new Error('Select at least one NVIDIA GPU.'));
} }
body.gpu_indices = selected; body.gpu_indices = selected;
if (burnUseNvidiaRampUp() && selected.length > 1) { const bMode = burnNvidiaMode();
if (bMode === 'ramp-up' && selected.length > 1) {
body.stagger_gpu_start = true; body.stagger_gpu_start = true;
} else if (bMode === 'parallel' && selected.length > 1) {
body.parallel_gpus = true;
} }
} }
return fetch('/api/sat/' + target + '/run', { return fetch('/api/sat/' + target + '/run', {
@@ -3108,7 +3095,6 @@ usbRefresh();
</script>` </script>`
} }
func renderNvidiaSelfHealInline() string { func renderNvidiaSelfHealInline() string {
return `<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Inspect NVIDIA GPU health, restart the bee-nvidia driver service, and issue a per-GPU reset when the driver reports reset required.</p> return `<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Inspect NVIDIA GPU health, restart the bee-nvidia driver service, and issue a per-GPU reset when the driver reports reset required.</p>
<div style="display:flex;gap:8px;flex-wrap:wrap;margin-bottom:12px"> <div style="display:flex;gap:8px;flex-wrap:wrap;margin-bottom:12px">

View File

@@ -693,8 +693,8 @@ func TestBenchmarkPageRendersSavedResultsTable(t *testing.T) {
for _, needle := range []string{ for _, needle := range []string{
`Benchmark Results`, `Benchmark Results`,
`Composite score by saved benchmark run and GPU.`, `Composite score by saved benchmark run and GPU.`,
`GPU #0 — NVIDIA H100 PCIe`, `GPU 0`,
`GPU #1 — NVIDIA H100 PCIe`, `GPU 1`,
`#1`, `#1`,
wantTime, wantTime,
`1176.25`, `1176.25`,

View File

@@ -126,6 +126,9 @@ type taskParams struct {
BenchmarkProfile string `json:"benchmark_profile,omitempty"` BenchmarkProfile string `json:"benchmark_profile,omitempty"`
RunNCCL bool `json:"run_nccl,omitempty"` RunNCCL bool `json:"run_nccl,omitempty"`
ParallelGPUs bool `json:"parallel_gpus,omitempty"` ParallelGPUs bool `json:"parallel_gpus,omitempty"`
RampStep int `json:"ramp_step,omitempty"`
RampTotal int `json:"ramp_total,omitempty"`
RampRunID string `json:"ramp_run_id,omitempty"`
DisplayName string `json:"display_name,omitempty"` DisplayName string `json:"display_name,omitempty"`
Device string `json:"device,omitempty"` // for install Device string `json:"device,omitempty"` // for install
PlatformComponents []string `json:"platform_components,omitempty"` PlatformComponents []string `json:"platform_components,omitempty"`
@@ -152,6 +155,12 @@ type burnPreset struct {
DurationSec int DurationSec int
} }
type nvidiaRampSpec struct {
DurationSec int
StaggerSeconds int
TotalDurationSec int
}
func resolveBurnPreset(profile string) burnPreset { func resolveBurnPreset(profile string) burnPreset {
switch profile { switch profile {
case "overnight": case "overnight":
@@ -163,11 +172,43 @@ func resolveBurnPreset(profile string) burnPreset {
} }
} }
func boolToNvidiaStaggerSeconds(enabled bool, selected []int) int { func resolveNvidiaRampPlan(profile string, enabled bool, selected []int) (nvidiaRampSpec, error) {
if enabled && len(selected) > 1 { base := resolveBurnPreset(profile).DurationSec
return 180 plan := nvidiaRampSpec{
DurationSec: base,
TotalDurationSec: base,
} }
return 0 if !enabled {
return plan, nil
}
count := len(selected)
if count == 0 {
return nvidiaRampSpec{}, fmt.Errorf("staggered NVIDIA burn requires explicit GPU selection")
}
if count == 1 {
return plan, nil
}
switch profile {
case "acceptance":
plan.StaggerSeconds = 10 * 60
plan.TotalDurationSec = plan.DurationSec + plan.StaggerSeconds*(count-1)
case "overnight":
plan.StaggerSeconds = 60 * 60
plan.TotalDurationSec = 8 * 60 * 60
minTotal := count * 60 * 60
if plan.TotalDurationSec < minTotal {
plan.TotalDurationSec = minTotal
}
if plan.TotalDurationSec > 10*60*60 {
return nvidiaRampSpec{}, fmt.Errorf("overnight staggered NVIDIA burn supports at most 10 GPUs")
}
plan.DurationSec = plan.TotalDurationSec - plan.StaggerSeconds*(count-1)
default:
plan.StaggerSeconds = 2 * 60
plan.TotalDurationSec = plan.DurationSec + plan.StaggerSeconds*(count-1)
}
return plan, nil
} }
func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions { func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions {
@@ -599,8 +640,11 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
ExcludeGPUIndices: t.params.ExcludeGPUIndices, ExcludeGPUIndices: t.params.ExcludeGPUIndices,
RunNCCL: t.params.RunNCCL, RunNCCL: t.params.RunNCCL,
ParallelGPUs: t.params.ParallelGPUs, ParallelGPUs: t.params.ParallelGPUs,
RampStep: t.params.RampStep,
RampTotal: t.params.RampTotal,
RampRunID: t.params.RampRunID,
}, j.append) }, j.append)
case "nvidia-compute": case "nvidia-compute":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
break break
@@ -609,11 +653,18 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
staggerSec := boolToNvidiaStaggerSeconds(t.params.StaggerGPUStart, t.params.GPUIndices) rampPlan, planErr := resolveNvidiaRampPlan(t.params.BurnProfile, t.params.StaggerGPUStart, t.params.GPUIndices)
if staggerSec > 0 { if planErr != nil {
j.append(fmt.Sprintf("NVIDIA staggered ramp-up enabled: %ds per GPU", staggerSec)) err = planErr
} break
archive, err = a.RunNvidiaOfficialComputePack(ctx, "", dur, t.params.GPUIndices, staggerSec, j.append) }
if t.params.BurnProfile != "" && t.params.StaggerGPUStart && dur <= 0 {
dur = rampPlan.DurationSec
}
if rampPlan.StaggerSeconds > 0 {
j.append(fmt.Sprintf("NVIDIA staggered ramp-up enabled: %ds per GPU; post-ramp hold: %ds; total runtime: %ds", rampPlan.StaggerSeconds, dur, rampPlan.TotalDurationSec))
}
archive, err = a.RunNvidiaOfficialComputePack(ctx, "", dur, t.params.GPUIndices, rampPlan.StaggerSeconds, j.append)
case "nvidia-targeted-power": case "nvidia-targeted-power":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
@@ -663,13 +714,24 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{ rampPlan, planErr := resolveNvidiaRampPlan(t.params.BurnProfile, t.params.StaggerGPUStart, t.params.GPUIndices)
DurationSec: dur, if planErr != nil {
Loader: t.params.Loader, err = planErr
GPUIndices: t.params.GPUIndices, break
ExcludeGPUIndices: t.params.ExcludeGPUIndices, }
StaggerSeconds: boolToNvidiaStaggerSeconds(t.params.StaggerGPUStart, t.params.GPUIndices), if t.params.BurnProfile != "" && t.params.StaggerGPUStart && dur <= 0 {
}, j.append) dur = rampPlan.DurationSec
}
if rampPlan.StaggerSeconds > 0 {
j.append(fmt.Sprintf("NVIDIA staggered ramp-up enabled: %ds per GPU; post-ramp hold: %ds; total runtime: %ds", rampPlan.StaggerSeconds, dur, rampPlan.TotalDurationSec))
}
archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{
DurationSec: dur,
Loader: t.params.Loader,
GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices,
StaggerSeconds: rampPlan.StaggerSeconds,
}, j.append)
case "memory": case "memory":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")

View File

@@ -422,7 +422,7 @@ func TestWriteTaskReportArtifactsIncludesBenchmarkResultsForTask(t *testing.T) {
for _, needle := range []string{ for _, needle := range []string{
`Benchmark Results`, `Benchmark Results`,
`Composite score for this benchmark task.`, `Composite score for this benchmark task.`,
`GPU #0 — NVIDIA H100 PCIe`, `GPU 0`,
`1176.25`, `1176.25`,
} { } {
if !strings.Contains(html, needle) { if !strings.Contains(html, needle) {
@@ -491,6 +491,83 @@ func TestResolveBurnPreset(t *testing.T) {
} }
} }
func TestResolveNvidiaRampPlan(t *testing.T) {
tests := []struct {
name string
profile string
enabled bool
selected []int
want nvidiaRampSpec
wantErr string
}{
{
name: "disabled uses base preset",
profile: "acceptance",
selected: []int{0, 1},
want: nvidiaRampSpec{DurationSec: 60 * 60, TotalDurationSec: 60 * 60},
},
{
name: "smoke ramp uses two minute steps",
profile: "smoke",
enabled: true,
selected: []int{0, 1, 2},
want: nvidiaRampSpec{DurationSec: 5 * 60, StaggerSeconds: 2 * 60, TotalDurationSec: 9 * 60},
},
{
name: "acceptance ramp uses ten minute steps",
profile: "acceptance",
enabled: true,
selected: []int{0, 1, 2},
want: nvidiaRampSpec{DurationSec: 60 * 60, StaggerSeconds: 10 * 60, TotalDurationSec: 80 * 60},
},
{
name: "overnight stays at eight hours when possible",
profile: "overnight",
enabled: true,
selected: []int{0, 1, 2},
want: nvidiaRampSpec{DurationSec: 6 * 60 * 60, StaggerSeconds: 60 * 60, TotalDurationSec: 8 * 60 * 60},
},
{
name: "overnight extends to keep one hour after final gpu",
profile: "overnight",
enabled: true,
selected: []int{0, 1, 2, 3, 4, 5, 6, 7, 8},
want: nvidiaRampSpec{DurationSec: 60 * 60, StaggerSeconds: 60 * 60, TotalDurationSec: 9 * 60 * 60},
},
{
name: "overnight rejects impossible gpu count",
profile: "overnight",
enabled: true,
selected: []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
wantErr: "at most 10 GPUs",
},
{
name: "enabled requires explicit selection",
profile: "smoke",
enabled: true,
wantErr: "requires explicit GPU selection",
},
}
for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
got, err := resolveNvidiaRampPlan(tc.profile, tc.enabled, tc.selected)
if tc.wantErr != "" {
if err == nil || !strings.Contains(err.Error(), tc.wantErr) {
t.Fatalf("err=%v want substring %q", err, tc.wantErr)
}
return
}
if err != nil {
t.Fatalf("resolveNvidiaRampPlan error: %v", err)
}
if got != tc.want {
t.Fatalf("resolveNvidiaRampPlan(%q, %t, %v)=%+v want %+v", tc.profile, tc.enabled, tc.selected, got, tc.want)
}
})
}
}
func TestTaskDisplayNameUsesNvidiaStressLoader(t *testing.T) { func TestTaskDisplayNameUsesNvidiaStressLoader(t *testing.T) {
tests := []struct { tests := []struct {
loader string loader string

View File

@@ -1121,6 +1121,7 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
int cc_minor, int cc_minor,
int seconds, int seconds,
int size_mb, int size_mb,
const char *precision_filter,
struct stress_report *report) { struct stress_report *report) {
struct cublaslt_api cublas; struct cublaslt_api cublas;
struct prepared_profile prepared[MAX_STRESS_STREAMS * MAX_CUBLAS_PROFILES]; struct prepared_profile prepared[MAX_STRESS_STREAMS * MAX_CUBLAS_PROFILES];
@@ -1159,7 +1160,8 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
} }
for (size_t i = 0; i < sizeof(k_profiles) / sizeof(k_profiles[0]); i++) { for (size_t i = 0; i < sizeof(k_profiles) / sizeof(k_profiles[0]); i++) {
if (k_profiles[i].enabled && cc >= k_profiles[i].min_cc) { if (k_profiles[i].enabled && cc >= k_profiles[i].min_cc &&
(precision_filter == NULL || strcmp(k_profiles[i].block_label, precision_filter) == 0)) {
planned++; planned++;
} }
} }
@@ -1218,6 +1220,13 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
desc->min_cc); desc->min_cc);
continue; continue;
} }
if (precision_filter != NULL && strcmp(desc->block_label, precision_filter) != 0) {
append_detail(report->details,
sizeof(report->details),
"%s=SKIPPED precision_filter\n",
desc->name);
continue;
}
for (int lane = 0; lane < stream_count; lane++) { for (int lane = 0; lane < stream_count; lane++) {
CUstream stream = streams[lane]; CUstream stream = streams[lane];
if (prepared_count >= (int)(sizeof(prepared) / sizeof(prepared[0]))) { if (prepared_count >= (int)(sizeof(prepared) / sizeof(prepared[0]))) {
@@ -1339,6 +1348,7 @@ int main(int argc, char **argv) {
int seconds = 5; int seconds = 5;
int size_mb = 64; int size_mb = 64;
int device_index = 0; int device_index = 0;
const char *precision_filter = NULL; /* NULL = all; else block_label to match */
for (int i = 1; i < argc; i++) { for (int i = 1; i < argc; i++) {
if ((strcmp(argv[i], "--seconds") == 0 || strcmp(argv[i], "-t") == 0) && i + 1 < argc) { if ((strcmp(argv[i], "--seconds") == 0 || strcmp(argv[i], "-t") == 0) && i + 1 < argc) {
seconds = atoi(argv[++i]); seconds = atoi(argv[++i]);
@@ -1346,8 +1356,12 @@ int main(int argc, char **argv) {
size_mb = atoi(argv[++i]); size_mb = atoi(argv[++i]);
} else if ((strcmp(argv[i], "--device") == 0 || strcmp(argv[i], "-d") == 0) && i + 1 < argc) { } else if ((strcmp(argv[i], "--device") == 0 || strcmp(argv[i], "-d") == 0) && i + 1 < argc) {
device_index = atoi(argv[++i]); device_index = atoi(argv[++i]);
} else if (strcmp(argv[i], "--precision") == 0 && i + 1 < argc) {
precision_filter = argv[++i];
} else { } else {
fprintf(stderr, "usage: %s [--seconds N] [--size-mb N] [--device N]\n", argv[0]); fprintf(stderr,
"usage: %s [--seconds N] [--size-mb N] [--device N] [--precision fp8|fp16|fp32|fp64|fp4]\n",
argv[0]);
return 2; return 2;
} }
} }
@@ -1407,7 +1421,7 @@ int main(int argc, char **argv) {
int ok = 0; int ok = 0;
#if HAVE_CUBLASLT_HEADERS #if HAVE_CUBLASLT_HEADERS
ok = run_cublaslt_stress(&cuda, dev, name, cc_major, cc_minor, seconds, size_mb, &report); ok = run_cublaslt_stress(&cuda, dev, name, cc_major, cc_minor, seconds, size_mb, precision_filter, &report);
#endif #endif
if (!ok) { if (!ok) {
if (!run_ptx_fallback(&cuda, dev, name, cc_major, cc_minor, seconds, size_mb, &report)) { if (!run_ptx_fallback(&cuda, dev, name, cc_major, cc_minor, seconds, size_mb, &report)) {

View File

@@ -6,10 +6,11 @@ STAGGER_SECONDS=0
SIZE_MB=0 SIZE_MB=0
DEVICES="" DEVICES=""
EXCLUDE="" EXCLUDE=""
PRECISION=""
WORKER="/usr/local/lib/bee/bee-gpu-burn-worker" WORKER="/usr/local/lib/bee/bee-gpu-burn-worker"
usage() { usage() {
echo "usage: $0 [--seconds N] [--stagger-seconds N] [--size-mb N] [--devices 0,1] [--exclude 2,3]" >&2 echo "usage: $0 [--seconds N] [--stagger-seconds N] [--size-mb N] [--devices 0,1] [--exclude 2,3] [--precision fp8|fp16|fp32|fp64|fp4]" >&2
exit 2 exit 2
} }
@@ -30,6 +31,7 @@ while [ "$#" -gt 0 ]; do
--size-mb|-m) [ "$#" -ge 2 ] || usage; SIZE_MB="$2"; shift 2 ;; --size-mb|-m) [ "$#" -ge 2 ] || usage; SIZE_MB="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;; --devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;; --exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
--precision) [ "$#" -ge 2 ] || usage; PRECISION="$2"; shift 2 ;;
*) usage ;; *) usage ;;
esac esac
done done
@@ -88,8 +90,10 @@ for id in $(echo "${FINAL}" | tr ',' ' '); do
extra_sec=$(( STAGGER_SECONDS * (GPU_COUNT - gpu_pos) )) extra_sec=$(( STAGGER_SECONDS * (GPU_COUNT - gpu_pos) ))
gpu_seconds=$(( SECONDS + extra_sec )) gpu_seconds=$(( SECONDS + extra_sec ))
echo "starting gpu ${id} size=${gpu_size_mb}MB seconds=${gpu_seconds}" echo "starting gpu ${id} size=${gpu_size_mb}MB seconds=${gpu_seconds}"
precision_arg=""
[ -n "${PRECISION}" ] && precision_arg="--precision ${PRECISION}"
CUDA_VISIBLE_DEVICES="${id}" \ CUDA_VISIBLE_DEVICES="${id}" \
"${WORKER}" --device 0 --seconds "${gpu_seconds}" --size-mb "${gpu_size_mb}" >"${log}" 2>&1 & "${WORKER}" --device 0 --seconds "${gpu_seconds}" --size-mb "${gpu_size_mb}" ${precision_arg} >"${log}" 2>&1 &
pid=$! pid=$!
WORKERS="${WORKERS} ${pid}:${id}:${log}" WORKERS="${WORKERS} ${pid}:${id}:${log}"
if [ "${STAGGER_SECONDS}" -gt 0 ] && [ "${gpu_pos}" -lt "${GPU_COUNT}" ]; then if [ "${STAGGER_SECONDS}" -gt 0 ] && [ "${gpu_pos}" -lt "${GPU_COUNT}" ]; then