Compare commits

...

14 Commits
v7.8 ... v7.17

Author SHA1 Message Date
bf182daa89 Fix benchmark report methodology and rebuild gpu burn worker on toolchain changes 2026-04-13 23:43:12 +03:00
457ea1cf04 Unify benchmark exports and drop ASCII charts 2026-04-13 21:38:28 +03:00
bf6ecab4f0 Add per-precision benchmark phases, weighted TOPS scoring, and ECC tracking
- Split steady window into 6 equal slots: fp8/fp16/fp32/fp64/fp4 + combined
- Each precision phase runs bee-gpu-burn with --precision filter so PowerCVPct reflects single-kernel stability (not round-robin artifact)
- Add fp4 support in bee-gpu-stress.c for Blackwell (cc>=100) via existing CUDA_R_4F_E2M1 guard
- Weighted TOPS: fp64×2.0, fp32×1.0, fp16×0.5, fp8×0.25, fp4×0.125
- SyntheticScore = sum of weighted TOPS from per-precision phases
- MixedScore = sum from combined phase; MixedEfficiency = Mixed/Synthetic
- ComputeScore = SyntheticScore × (1 + MixedEfficiency × 0.3)
- ECC volatile counters sampled before/after each phase and overall
- DegradationReasons: ecc_uncorrected_errors, ecc_corrected_errors
- Report: per-precision stability table with ECC columns, methodology section
- Ramp-up history table redesign: GPU indices as columns, runs as rows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 10:49:49 +03:00
02e44b1172 Fix USB/RAM status checks; add server model+S/N to dashboard; remove cycles
USB Export Drive:
  lsblk reports TRAN only for whole disks, not partitions (/dev/sdc1).
  Strip trailing partition digits to get parent disk before transport check.

LiveCD in RAM:
  When RunInstallToRAM copies squashfs to /dev/shm/bee-live/ but bind-mount
  of /run/live/medium fails (CD-ROM boots), /run/live/medium still shows the
  CD-ROM fstype. Add fallback: if /dev/shm/bee-live/*.squashfs exists, the
  data is in RAM — report status OK.

Dashboard Hardware Summary:
  Show server Manufacturer + ProductName as heading and S/N as subline above
  the component table, sourced from hw.Board (dmidecode system-type data).

Validate:
  Remove Cycles input — always run once. cycles=1 hardcoded in runAllSAT().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:46:42 +03:00
2ceaa0d0ca Include profile and mode in benchmark task names for task list clarity
Task names now follow the pattern:
  NVIDIA Benchmark · <profile> · <mode> [· GPU <indices>]

Examples:
  NVIDIA Benchmark · standard · sequential (GPU 0, RTX 6000 Pro)
  NVIDIA Benchmark · stability · parallel
  NVIDIA Benchmark · standard · ramp 1/4 · GPU 0
  NVIDIA Benchmark · standard · ramp 2/4 · GPU 0,1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:36:51 +03:00
9482ba20a2 Remove NCCL checkbox — auto-enable interconnect step when >1 GPU selected
NCCL all_reduce is always attempted when 2+ GPUs are selected; a failure
leaves InterconnectScore=0 (no bonus, no penalty) and OverallStatus
unaffected. Exposing the checkbox implied NCCL is optional and made a
failed run look like a deliberate skip.

- Remove benchmark-run-nccl checkbox and its change listener from pages.go
- Client sends run_nccl: selected.length > 1 (automatic)
- api.go default runNCCL=true is unchanged
- Selection note now mentions NCCL automatically for multi-GPU runs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:33:17 +03:00
813e2f86a9 Add scalability/ramp-up labeling, ServerPower penalty in scoring, and report improvements
- Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and
  NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files
- Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally
  by comparing ramp-up step results sharing the same ramp_run_id)
- Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time),
  tasks.go handler, and benchmark.go result population
- Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75:
  factor = ratio/0.75, applied per-GPU with a note explaining the reduction
- Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw)
- Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode;
  shows scalability_score when non-zero

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:30:47 +03:00
58a6da9b44 Recover power limits and SM count from nvidia-smi -q in enrichGPUInfo
When --query-gpu CSV fields fail (exit status 2 on some Blackwell +
driver combos), enrichGPUInfoWithMaxClocks now also parses from the
verbose nvidia-smi -q output already collected at benchmark start:
  - Default Power Limit  → DefaultPowerLimitW
  - Current Power Limit  → PowerLimitW (fallback)
  - Multiprocessor Count → MultiprocessorCount

Fixes PowerSustainScore=0 on systems where all three CSV query
variants fail but nvidia-smi -q succeeds (confirmed on RTX PRO 6000
Blackwell + driver 590.48.01).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:17:56 +03:00
f4a19c0a00 Add power calibration step to benchmark; fix PowerSustainScore reference
Before the per-GPU compute phases, run `dcgmi diag -r targeted_power`
for 45 s while collecting nvidia-smi power metrics in parallel.
The p95 power per GPU is stored as calibrated_peak_power_w and used
as the denominator for PowerSustainScore instead of the hardware default
limit, which bee-gpu-burn cannot reach because it is compute-only.

Fallback chain: calibrated peak → default limit → enforced limit.
If dcgmi is absent or the run fails, calibration is skipped silently.

Adjust composite score weights to match the new honest power reference:
  base 0.35, thermal 0.25, stability 0.25, power 0.15, NCCL bonus 0.10.
Power weight reduced (0.20→0.15) because even with a calibrated reference
bee-gpu-burn reaches ~60-75% of TDP by design (no concurrent mem stress).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:06:46 +03:00
9e3dcf9b4d Record host CPU/RAM config in benchmark results; check CPU load
- BenchmarkHostConfig captures CPU model, sockets, cores, threads, and
  total RAM from /proc/cpuinfo and /proc/meminfo at benchmark start.
- BenchmarkCPULoad samples host CPU utilisation every 10 s throughout
  the GPU steady-state phase (sequential and parallel paths).
- Summarises avg/max/p95 and classifies status as ok / high / unstable.
- Adds a finding when CPU load is elevated (avg >20% or max >40%) or
  erratic (stddev >12%), with a plain-English description in the report.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:02:04 +03:00
098e19f760 Add ramp-up mode to NVIDIA GPU benchmark
Adds a new checkbox (enabled by default) in the benchmark section.
In ramp-up mode N tasks are spawned simultaneously: 1 GPU, then 2,
then 3, up to all selected GPUs — each step runs its GPUs in parallel.
NCCL runs only on the final step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 18:34:19 +03:00
e16d0f34b5 Adjust burn GPU ramp timing by profile 2026-04-12 15:58:30 +03:00
Mikhail Chusavitin
525ed8b8fc Fix GPU clock lock normalization for Blackwell (clocks.max.* unsupported)
clocks.max.graphics / clocks.max.memory CSV fields return exit status 2 on
RTX PRO 6000 Blackwell (driver 98.x), causing the entire gpu inventory query
to fail and clock lock to be skipped → normalization: partial.

Fix:
- Add minimal fallback query (index,uuid,name,pci.bus_id,vbios_version,
  power.limit) that succeeds even without clock fields
- Add enrichGPUInfoWithMaxClocks: parses "Max Clocks" section of
  nvidia-smi -q verbose output to fill MaxGraphicsClockMHz /
  MaxMemoryClockMHz when CSV fields fail
- Move nvidia-smi -q execution before queryBenchmarkGPUInfo so its output
  is available for clock enrichment immediately after
- Tests: cover enrichment and skip-if-populated cases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:33:54 +03:00
Mikhail Chusavitin
4f94ebcb2c Add HPC tuning: PCIe ASPM off, C-states, performance CPU governor
- grub.cfg + isolinux/live.cfg.in: add pcie_aspm=off,
  intel_idle.max_cstate=1 and processor.max_cstate=1 to all
  non-failsafe boot entries
- bee-hpc-tuning: new script that sets all CPU cores to performance
  governor via sysfs and logs THP state at boot
- bee-hpc-tuning.service: runs before bee-nvidia and bee-audit
- 9000-bee-setup.hook.chroot: enable service and mark script executable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:07:32 +03:00
22 changed files with 1861 additions and 743 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -2,25 +2,15 @@ package platform
import ( import (
"fmt" "fmt"
"os"
"path/filepath"
"regexp"
"strings" "strings"
"time" "time"
) )
func renderBenchmarkReport(result NvidiaBenchmarkResult) string { func renderBenchmarkReport(result NvidiaBenchmarkResult) string {
return renderBenchmarkReportWithCharts(result, nil) return renderBenchmarkReportWithCharts(result)
} }
type benchmarkReportChart struct { func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult) string {
Title string
Content string
}
var ansiEscapePattern = regexp.MustCompile(`\x1b\[[0-9;]*m`)
func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benchmarkReportChart) string {
var b strings.Builder var b strings.Builder
// ── Header ──────────────────────────────────────────────────────────────── // ── Header ────────────────────────────────────────────────────────────────
@@ -60,9 +50,17 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
fmt.Fprintf(&b, "**Profile:** %s \n", result.BenchmarkProfile) fmt.Fprintf(&b, "**Profile:** %s \n", result.BenchmarkProfile)
fmt.Fprintf(&b, "**App version:** %s \n", result.BenchmarkVersion) fmt.Fprintf(&b, "**App version:** %s \n", result.BenchmarkVersion)
fmt.Fprintf(&b, "**Generated:** %s \n", result.GeneratedAt.Format("2006-01-02 15:04:05 UTC")) fmt.Fprintf(&b, "**Generated:** %s \n", result.GeneratedAt.Format("2006-01-02 15:04:05 UTC"))
if result.ParallelGPUs { if result.RampStep > 0 && result.RampTotal > 0 {
fmt.Fprintf(&b, "**Ramp-up step:** %d of %d \n", result.RampStep, result.RampTotal)
if result.RampRunID != "" {
fmt.Fprintf(&b, "**Ramp-up run ID:** %s \n", result.RampRunID)
}
} else if result.ParallelGPUs {
fmt.Fprintf(&b, "**Mode:** parallel (all GPUs simultaneously) \n") fmt.Fprintf(&b, "**Mode:** parallel (all GPUs simultaneously) \n")
} }
if result.ScalabilityScore > 0 {
fmt.Fprintf(&b, "**Scalability score:** %.1f%% \n", result.ScalabilityScore)
}
fmt.Fprintf(&b, "**Overall status:** %s \n", result.OverallStatus) fmt.Fprintf(&b, "**Overall status:** %s \n", result.OverallStatus)
b.WriteString("\n") b.WriteString("\n")
@@ -83,10 +81,28 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
b.WriteString("\n") b.WriteString("\n")
} }
// ── Methodology ───────────────────────────────────────────────────────────
b.WriteString("## Methodology\n\n")
fmt.Fprintf(&b, "- Profile `%s` uses standardized baseline -> warmup -> steady-state -> interconnect -> cooldown phases.\n", result.BenchmarkProfile)
b.WriteString("- Single-GPU compute score comes from `bee-gpu-burn` on the cuBLASLt path when available.\n")
b.WriteString("- Thermal and power limits are inferred from NVIDIA clock-event counters plus sustained telemetry.\n")
b.WriteString("- `result.json` is the canonical machine-readable source for the run.\n\n")
b.WriteString("**Compute score** is derived from two phases:\n\n")
b.WriteString("- **Synthetic** — each precision type (fp8, fp16, fp32, fp64, fp4) runs alone for a dedicated window. ")
b.WriteString("Measures peak throughput with the full GPU dedicated to one kernel type. ")
b.WriteString("Each result is normalised to fp32-equivalent TOPS using precision weights: ")
b.WriteString("fp64 ×2.0 · fp32 ×1.0 · fp16 ×0.5 · fp8 ×0.25 · fp4 ×0.125.\n")
b.WriteString("- **Mixed** — all precision types run simultaneously (combined phase). ")
b.WriteString("Reflects real inference workloads where fp8 matrix ops, fp16 attention and fp32 accumulation compete for bandwidth and SM scheduler slots.\n\n")
b.WriteString("**Formula:** `Compute = Synthetic × (1 + MixedEfficiency × 0.3)`\n\n")
b.WriteString("where `MixedEfficiency = Mixed / Synthetic`. A GPU that sustains 90 % throughput under mixed load ")
b.WriteString("receives a +27 % bonus over its synthetic score; one that drops to 60 % receives +18 %.\n\n")
b.WriteString("**Composite score** = `Compute × quality_factor` where quality factors in power sustain, thermal sustain, stability, and interconnect.\n\n")
// ── Scorecard table ─────────────────────────────────────────────────────── // ── Scorecard table ───────────────────────────────────────────────────────
b.WriteString("## Scorecard\n\n") b.WriteString("## Scorecard\n\n")
b.WriteString("| GPU | Status | Composite | Compute | TOPS/SM/GHz | Power Sustain | Thermal Sustain | Stability | Interconnect |\n") b.WriteString("| GPU | Status | Composite | Compute | Synthetic | Mixed | Mixed Eff. | TOPS/SM/GHz | Power Sustain | Thermal Sustain | Stability | Interconnect |\n")
b.WriteString("|-----|--------|-----------|---------|-------------|---------------|-----------------|-----------|-------------|\n") b.WriteString("|-----|--------|-----------|---------|-----------|-------|------------|-------------|---------------|-----------------|-----------|-------------|\n")
for _, gpu := range result.GPUs { for _, gpu := range result.GPUs {
name := strings.TrimSpace(gpu.Name) name := strings.TrimSpace(gpu.Name)
if name == "" { if name == "" {
@@ -100,11 +116,26 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
if gpu.Scores.TOPSPerSMPerGHz > 0 { if gpu.Scores.TOPSPerSMPerGHz > 0 {
topsPerSM = fmt.Sprintf("%.3f", gpu.Scores.TOPSPerSMPerGHz) topsPerSM = fmt.Sprintf("%.3f", gpu.Scores.TOPSPerSMPerGHz)
} }
fmt.Fprintf(&b, "| GPU %d %s | %s | **%.2f** | %.2f | %s | %.1f | %.1f | %.1f | %s |\n", synthetic := "-"
if gpu.Scores.SyntheticScore > 0 {
synthetic = fmt.Sprintf("%.2f", gpu.Scores.SyntheticScore)
}
mixed := "-"
if gpu.Scores.MixedScore > 0 {
mixed = fmt.Sprintf("%.2f", gpu.Scores.MixedScore)
}
mixedEff := "-"
if gpu.Scores.MixedEfficiency > 0 {
mixedEff = fmt.Sprintf("%.1f%%", gpu.Scores.MixedEfficiency*100)
}
fmt.Fprintf(&b, "| GPU %d %s | %s | **%.2f** | %.2f | %s | %s | %s | %s | %.1f | %.1f | %.1f | %s |\n",
gpu.Index, name, gpu.Index, name,
gpu.Status, gpu.Status,
gpu.Scores.CompositeScore, gpu.Scores.CompositeScore,
gpu.Scores.ComputeScore, gpu.Scores.ComputeScore,
synthetic,
mixed,
mixedEff,
topsPerSM, topsPerSM,
gpu.Scores.PowerSustainScore, gpu.Scores.PowerSustainScore,
gpu.Scores.ThermalSustainScore, gpu.Scores.ThermalSustainScore,
@@ -154,6 +185,34 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
fmt.Fprintf(&b, "| GPU utilisation | %.1f %% | — |\n", gpu.Steady.AvgUsagePct) fmt.Fprintf(&b, "| GPU utilisation | %.1f %% | — |\n", gpu.Steady.AvgUsagePct)
b.WriteString("\n") b.WriteString("\n")
// Per-precision stability phases.
if len(gpu.PrecisionSteady) > 0 {
b.WriteString("**Per-precision stability:**\n\n")
b.WriteString("| Precision | Clock CV | Power CV | Clock Drift | ECC corr | ECC uncorr |\n|-----------|----------|----------|-------------|----------|------------|\n")
for _, p := range gpu.PrecisionSteady {
eccCorr := "—"
eccUncorr := "—"
if !p.ECC.IsZero() {
eccCorr = fmt.Sprintf("%d", p.ECC.Corrected)
eccUncorr = fmt.Sprintf("%d", p.ECC.Uncorrected)
}
fmt.Fprintf(&b, "| %s | %.1f%% | %.1f%% | %.1f%% | %s | %s |\n",
p.Precision, p.Steady.ClockCVPct, p.Steady.PowerCVPct, p.Steady.ClockDriftPct,
eccCorr, eccUncorr)
}
b.WriteString("\n")
} else {
// Legacy: show combined-window variance.
fmt.Fprintf(&b, "**Clock/power variance (combined window):** clock CV %.1f%% · power CV %.1f%% · clock drift %.1f%%\n\n",
gpu.Steady.ClockCVPct, gpu.Steady.PowerCVPct, gpu.Steady.ClockDriftPct)
}
// ECC summary
if !gpu.ECC.IsZero() {
fmt.Fprintf(&b, "**ECC errors (total):** corrected=%d uncorrected=%d\n\n",
gpu.ECC.Corrected, gpu.ECC.Uncorrected)
}
// Throttle // Throttle
throttle := formatThrottleLine(gpu.Throttle, gpu.Steady.DurationSec) throttle := formatThrottleLine(gpu.Throttle, gpu.Steady.DurationSec)
if throttle != "none" { if throttle != "none" {
@@ -163,12 +222,14 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
// Precision results // Precision results
if len(gpu.PrecisionResults) > 0 { if len(gpu.PrecisionResults) > 0 {
b.WriteString("**Precision results:**\n\n") b.WriteString("**Precision results:**\n\n")
b.WriteString("| Precision | TOPS | Lanes | Iterations |\n|-----------|------|-------|------------|\n") b.WriteString("| Precision | TOPS (raw) | Weight | TOPS (fp32-eq) | Lanes | Iterations |\n|-----------|------------|--------|----------------|-------|------------|\n")
for _, p := range gpu.PrecisionResults { for _, p := range gpu.PrecisionResults {
if p.Supported { if p.Supported {
fmt.Fprintf(&b, "| %s | %.2f | %d | %d |\n", p.Name, p.TeraOpsPerSec, p.Lanes, p.Iterations) weightStr := fmt.Sprintf("×%.3g", p.Weight)
fmt.Fprintf(&b, "| %s | %.2f | %s | %.2f | %d | %d |\n",
p.Name, p.TeraOpsPerSec, weightStr, p.WeightedTeraOpsPerSec, p.Lanes, p.Iterations)
} else { } else {
fmt.Fprintf(&b, "| %s | — (unsupported) | — | — |\n", p.Name) fmt.Fprintf(&b, "| %s | — (unsupported) | — | — | — | — |\n", p.Name)
} }
} }
b.WriteString("\n") b.WriteString("\n")
@@ -229,61 +290,16 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
} }
} }
// ── Terminal charts (steady-state only) ───────────────────────────────────
if len(charts) > 0 {
b.WriteString("## Steady-State Charts\n\n")
for _, chart := range charts {
content := strings.TrimSpace(stripANSIEscapeSequences(chart.Content))
if content == "" {
continue
}
fmt.Fprintf(&b, "### %s\n\n```\n%s\n```\n\n", chart.Title, content)
}
}
// ── Methodology ───────────────────────────────────────────────────────────
b.WriteString("## Methodology\n\n")
fmt.Fprintf(&b, "- Profile `%s` uses standardized baseline → warmup → steady-state → interconnect → cooldown phases.\n", result.BenchmarkProfile)
b.WriteString("- Single-GPU compute score from bee-gpu-burn cuBLASLt when available.\n")
b.WriteString("- Thermal and power limitations inferred from NVIDIA clock event reason counters and sustained telemetry.\n")
b.WriteString("- `result.json` is the canonical machine-readable source for this benchmark run.\n\n")
// ── Raw files ───────────────────────────────────────────────────────────── // ── Raw files ─────────────────────────────────────────────────────────────
b.WriteString("## Raw Files\n\n") b.WriteString("## Raw Files\n\n")
b.WriteString("- `result.json`\n- `report.md`\n- `summary.txt`\n- `verbose.log`\n") b.WriteString("- `result.json`\n- `report.md`\n- `summary.txt`\n- `verbose.log`\n")
b.WriteString("- `gpu-*-baseline-metrics.csv/html/term.txt`\n") b.WriteString("- `gpu-metrics.csv`\n- `gpu-metrics.html`\n- `gpu-burn.log`\n")
b.WriteString("- `gpu-*-warmup.log`\n")
b.WriteString("- `gpu-*-steady.log`\n")
b.WriteString("- `gpu-*-steady-metrics.csv/html/term.txt`\n")
b.WriteString("- `gpu-*-cooldown-metrics.csv/html/term.txt`\n")
if result.Interconnect != nil { if result.Interconnect != nil {
b.WriteString("- `nccl-all-reduce.log`\n") b.WriteString("- `nccl-all-reduce.log`\n")
} }
return b.String() return b.String()
} }
// loadBenchmarkReportCharts loads only steady-state terminal charts (baseline and
// cooldown charts are not useful for human review).
func loadBenchmarkReportCharts(runDir string, gpuIndices []int) []benchmarkReportChart {
var charts []benchmarkReportChart
for _, idx := range gpuIndices {
path := filepath.Join(runDir, fmt.Sprintf("gpu-%d-steady-metrics-term.txt", idx))
raw, err := os.ReadFile(path)
if err != nil || len(raw) == 0 {
continue
}
charts = append(charts, benchmarkReportChart{
Title: fmt.Sprintf("GPU %d — Steady State", idx),
Content: string(raw),
})
}
return charts
}
func stripANSIEscapeSequences(raw string) string {
return ansiEscapePattern.ReplaceAllString(raw, "")
}
// formatThrottleLine renders throttle counters as human-readable percentages of // formatThrottleLine renders throttle counters as human-readable percentages of
// the steady-state window. Only non-zero counters are shown. When the steady // the steady-state window. Only non-zero counters are shown. When the steady
// duration is unknown (0), raw seconds are shown instead. // duration is unknown (0), raw seconds are shown instead.

View File

@@ -147,34 +147,89 @@ func TestRenderBenchmarkReportIncludesFindingsAndScores(t *testing.T) {
} }
} }
func TestRenderBenchmarkReportIncludesTerminalChartsWithoutANSI(t *testing.T) { func TestRenderBenchmarkReportListsUnifiedArtifacts(t *testing.T) {
t.Parallel() t.Parallel()
report := renderBenchmarkReportWithCharts(NvidiaBenchmarkResult{ report := renderBenchmarkReport(NvidiaBenchmarkResult{
BenchmarkProfile: NvidiaBenchmarkProfileStandard, BenchmarkProfile: NvidiaBenchmarkProfileStandard,
OverallStatus: "OK", OverallStatus: "OK",
SelectedGPUIndices: []int{0}, SelectedGPUIndices: []int{0},
Normalization: BenchmarkNormalization{ Normalization: BenchmarkNormalization{
Status: "full", Status: "full",
}, },
}, []benchmarkReportChart{
{
Title: "GPU 0 Steady State",
Content: "\x1b[31mGPU 0 chart\x1b[0m\n 42┤───",
},
}) })
for _, needle := range []string{ for _, needle := range []string{
"Steady-State Charts", "gpu-metrics.csv",
"GPU 0 Steady State", "gpu-metrics.html",
"GPU 0 chart", "gpu-burn.log",
"42┤───",
} { } {
if !strings.Contains(report, needle) { if !strings.Contains(report, needle) {
t.Fatalf("report missing %q\n%s", needle, report) t.Fatalf("report missing %q\n%s", needle, report)
} }
} }
if strings.Contains(report, "\x1b[31m") { }
t.Fatalf("report should not contain ANSI escapes\n%s", report)
func TestEnrichGPUInfoWithMaxClocks(t *testing.T) {
t.Parallel()
nvsmiQ := []byte(`
GPU 00000000:4E:00.0
Product Name : NVIDIA RTX PRO 6000 Blackwell Server Edition
Clocks
Graphics : 2422 MHz
Memory : 12481 MHz
Max Clocks
Graphics : 2430 MHz
SM : 2430 MHz
Memory : 12481 MHz
Video : 2107 MHz
GPU 00000000:4F:00.0
Product Name : NVIDIA RTX PRO 6000 Blackwell Server Edition
Max Clocks
Graphics : 2430 MHz
Memory : 12481 MHz
`)
infoByIndex := map[int]benchmarkGPUInfo{
0: {Index: 0, BusID: "00000000:4E:00.0"},
1: {Index: 1, BusID: "00000000:4F:00.0"},
}
enrichGPUInfoWithMaxClocks(infoByIndex, nvsmiQ)
if infoByIndex[0].MaxGraphicsClockMHz != 2430 {
t.Errorf("GPU 0 MaxGraphicsClockMHz = %v, want 2430", infoByIndex[0].MaxGraphicsClockMHz)
}
if infoByIndex[0].MaxMemoryClockMHz != 12481 {
t.Errorf("GPU 0 MaxMemoryClockMHz = %v, want 12481", infoByIndex[0].MaxMemoryClockMHz)
}
if infoByIndex[1].MaxGraphicsClockMHz != 2430 {
t.Errorf("GPU 1 MaxGraphicsClockMHz = %v, want 2430", infoByIndex[1].MaxGraphicsClockMHz)
}
if infoByIndex[1].MaxMemoryClockMHz != 12481 {
t.Errorf("GPU 1 MaxMemoryClockMHz = %v, want 12481", infoByIndex[1].MaxMemoryClockMHz)
}
}
func TestEnrichGPUInfoWithMaxClocksSkipsPopulated(t *testing.T) {
t.Parallel()
nvsmiQ := []byte(`
GPU 00000000:4E:00.0
Max Clocks
Graphics : 9999 MHz
Memory : 9999 MHz
`)
// Already populated — must not be overwritten.
infoByIndex := map[int]benchmarkGPUInfo{
0: {Index: 0, BusID: "00000000:4E:00.0", MaxGraphicsClockMHz: 2430, MaxMemoryClockMHz: 12481},
}
enrichGPUInfoWithMaxClocks(infoByIndex, nvsmiQ)
if infoByIndex[0].MaxGraphicsClockMHz != 2430 {
t.Errorf("expected existing value to be preserved, got %v", infoByIndex[0].MaxGraphicsClockMHz)
} }
} }

View File

@@ -2,6 +2,29 @@ package platform
import "time" import "time"
// BenchmarkHostConfig holds static CPU and memory configuration captured at
// benchmark start. Useful for correlating results across runs on different hardware.
type BenchmarkHostConfig struct {
CPUModel string `json:"cpu_model,omitempty"`
CPUSockets int `json:"cpu_sockets,omitempty"`
CPUCores int `json:"cpu_cores,omitempty"`
CPUThreads int `json:"cpu_threads,omitempty"`
MemTotalGiB float64 `json:"mem_total_gib,omitempty"`
}
// BenchmarkCPULoad summarises host CPU utilisation sampled during the GPU
// steady-state phase. High or unstable CPU load during a GPU benchmark may
// indicate a competing workload or a CPU-bound driver bottleneck.
type BenchmarkCPULoad struct {
AvgPct float64 `json:"avg_pct"`
MaxPct float64 `json:"max_pct"`
P95Pct float64 `json:"p95_pct"`
Samples int `json:"samples"`
// Status is "ok", "high", or "unstable".
Status string `json:"status"`
Note string `json:"note,omitempty"`
}
const ( const (
NvidiaBenchmarkProfileStandard = "standard" NvidiaBenchmarkProfileStandard = "standard"
NvidiaBenchmarkProfileStability = "stability" NvidiaBenchmarkProfileStability = "stability"
@@ -14,10 +37,12 @@ type NvidiaBenchmarkOptions struct {
GPUIndices []int GPUIndices []int
ExcludeGPUIndices []int ExcludeGPUIndices []int
RunNCCL bool RunNCCL bool
ParallelGPUs bool // run all selected GPUs simultaneously instead of sequentially ParallelGPUs bool // run all selected GPUs simultaneously instead of sequentially
RampStep int // 1-based step index within a ramp-up run (0 = not a ramp-up)
RampTotal int // total number of ramp-up steps in this run
RampRunID string // shared identifier across all steps of the same ramp-up run
} }
type NvidiaBenchmarkResult struct { type NvidiaBenchmarkResult struct {
BenchmarkVersion string `json:"benchmark_version"` BenchmarkVersion string `json:"benchmark_version"`
GeneratedAt time.Time `json:"generated_at"` GeneratedAt time.Time `json:"generated_at"`
@@ -25,11 +50,17 @@ type NvidiaBenchmarkResult struct {
ServerModel string `json:"server_model,omitempty"` ServerModel string `json:"server_model,omitempty"`
BenchmarkProfile string `json:"benchmark_profile"` BenchmarkProfile string `json:"benchmark_profile"`
ParallelGPUs bool `json:"parallel_gpus,omitempty"` ParallelGPUs bool `json:"parallel_gpus,omitempty"`
RampStep int `json:"ramp_step,omitempty"`
RampTotal int `json:"ramp_total,omitempty"`
RampRunID string `json:"ramp_run_id,omitempty"`
ScalabilityScore float64 `json:"scalability_score,omitempty"`
OverallStatus string `json:"overall_status"` OverallStatus string `json:"overall_status"`
SelectedGPUIndices []int `json:"selected_gpu_indices"` SelectedGPUIndices []int `json:"selected_gpu_indices"`
Findings []string `json:"findings,omitempty"` Findings []string `json:"findings,omitempty"`
Warnings []string `json:"warnings,omitempty"` Warnings []string `json:"warnings,omitempty"`
Normalization BenchmarkNormalization `json:"normalization"` Normalization BenchmarkNormalization `json:"normalization"`
HostConfig *BenchmarkHostConfig `json:"host_config,omitempty"`
CPULoad *BenchmarkCPULoad `json:"cpu_load,omitempty"`
GPUs []BenchmarkGPUResult `json:"gpus"` GPUs []BenchmarkGPUResult `json:"gpus"`
Interconnect *BenchmarkInterconnectResult `json:"interconnect,omitempty"` Interconnect *BenchmarkInterconnectResult `json:"interconnect,omitempty"`
ServerPower *BenchmarkServerPower `json:"server_power,omitempty"` ServerPower *BenchmarkServerPower `json:"server_power,omitempty"`
@@ -52,30 +83,38 @@ type BenchmarkNormalizationGPU struct {
} }
type BenchmarkGPUResult struct { type BenchmarkGPUResult struct {
Index int `json:"index"` Index int `json:"index"`
UUID string `json:"uuid,omitempty"` UUID string `json:"uuid,omitempty"`
Name string `json:"name,omitempty"` Name string `json:"name,omitempty"`
BusID string `json:"bus_id,omitempty"` BusID string `json:"bus_id,omitempty"`
VBIOS string `json:"vbios,omitempty"` VBIOS string `json:"vbios,omitempty"`
ComputeCapability string `json:"compute_capability,omitempty"` ComputeCapability string `json:"compute_capability,omitempty"`
Backend string `json:"backend,omitempty"` Backend string `json:"backend,omitempty"`
Status string `json:"status"` Status string `json:"status"`
PowerLimitW float64 `json:"power_limit_w,omitempty"` PowerLimitW float64 `json:"power_limit_w,omitempty"`
MultiprocessorCount int `json:"multiprocessor_count,omitempty"` MultiprocessorCount int `json:"multiprocessor_count,omitempty"`
DefaultPowerLimitW float64 `json:"default_power_limit_w,omitempty"` DefaultPowerLimitW float64 `json:"default_power_limit_w,omitempty"`
MaxGraphicsClockMHz float64 `json:"max_graphics_clock_mhz,omitempty"` // CalibratedPeakPowerW is the p95 power measured during a short
BaseGraphicsClockMHz float64 `json:"base_graphics_clock_mhz,omitempty"` // dcgmi targeted_power calibration run before the main benchmark.
MaxMemoryClockMHz float64 `json:"max_memory_clock_mhz,omitempty"` // Used as the reference denominator for PowerSustainScore instead of
LockedGraphicsClockMHz float64 `json:"locked_graphics_clock_mhz,omitempty"` // the hardware default limit, which bee-gpu-burn cannot reach.
LockedMemoryClockMHz float64 `json:"locked_memory_clock_mhz,omitempty"` CalibratedPeakPowerW float64 `json:"calibrated_peak_power_w,omitempty"`
Baseline BenchmarkTelemetrySummary `json:"baseline"` MaxGraphicsClockMHz float64 `json:"max_graphics_clock_mhz,omitempty"`
Steady BenchmarkTelemetrySummary `json:"steady"` BaseGraphicsClockMHz float64 `json:"base_graphics_clock_mhz,omitempty"`
Cooldown BenchmarkTelemetrySummary `json:"cooldown"` MaxMemoryClockMHz float64 `json:"max_memory_clock_mhz,omitempty"`
Throttle BenchmarkThrottleCounters `json:"throttle_counters"` LockedGraphicsClockMHz float64 `json:"locked_graphics_clock_mhz,omitempty"`
PrecisionResults []BenchmarkPrecisionResult `json:"precision_results,omitempty"` LockedMemoryClockMHz float64 `json:"locked_memory_clock_mhz,omitempty"`
Scores BenchmarkScorecard `json:"scores"` Baseline BenchmarkTelemetrySummary `json:"baseline"`
DegradationReasons []string `json:"degradation_reasons,omitempty"` Steady BenchmarkTelemetrySummary `json:"steady"`
Notes []string `json:"notes,omitempty"` PrecisionSteady []BenchmarkPrecisionSteadyPhase `json:"precision_steady,omitempty"`
Cooldown BenchmarkTelemetrySummary `json:"cooldown"`
Throttle BenchmarkThrottleCounters `json:"throttle_counters"`
// ECC error delta accumulated over the full benchmark (all phases combined).
ECC BenchmarkECCCounters `json:"ecc,omitempty"`
PrecisionResults []BenchmarkPrecisionResult `json:"precision_results,omitempty"`
Scores BenchmarkScorecard `json:"scores"`
DegradationReasons []string `json:"degradation_reasons,omitempty"`
Notes []string `json:"notes,omitempty"`
} }
type BenchmarkTelemetrySummary struct { type BenchmarkTelemetrySummary struct {
@@ -105,6 +144,18 @@ type BenchmarkThrottleCounters struct {
HWPowerBrakeSlowdownUS uint64 `json:"hw_power_brake_slowdown_us"` HWPowerBrakeSlowdownUS uint64 `json:"hw_power_brake_slowdown_us"`
} }
// BenchmarkECCCounters holds ECC error counts sampled at a point in time.
// Corrected = single-bit errors fixed by ECC (DRAM degradation).
// Uncorrected = double-bit errors that could not be corrected (serious fault).
// Both are volatile (since last driver reset), not persistent.
type BenchmarkECCCounters struct {
Corrected uint64 `json:"corrected"`
Uncorrected uint64 `json:"uncorrected"`
}
func (e BenchmarkECCCounters) Total() uint64 { return e.Corrected + e.Uncorrected }
func (e BenchmarkECCCounters) IsZero() bool { return e.Corrected == 0 && e.Uncorrected == 0 }
type BenchmarkPrecisionResult struct { type BenchmarkPrecisionResult struct {
Name string `json:"name"` Name string `json:"name"`
Category string `json:"category"` Category string `json:"category"`
@@ -115,19 +166,31 @@ type BenchmarkPrecisionResult struct {
K uint64 `json:"k,omitempty"` K uint64 `json:"k,omitempty"`
Iterations uint64 `json:"iterations,omitempty"` Iterations uint64 `json:"iterations,omitempty"`
TeraOpsPerSec float64 `json:"teraops_per_sec,omitempty"` TeraOpsPerSec float64 `json:"teraops_per_sec,omitempty"`
Notes string `json:"notes,omitempty"` // Weight is the fp32-equivalence factor for this precision category.
// fp32 = 1.0 (baseline), fp64 = 2.0, fp16 = 0.5, fp8 = 0.25, fp4 = 0.125.
// WeightedTOPS = TeraOpsPerSec * Weight gives fp32-equivalent throughput.
Weight float64 `json:"weight,omitempty"`
WeightedTeraOpsPerSec float64 `json:"weighted_teraops_per_sec,omitempty"`
Notes string `json:"notes,omitempty"`
} }
type BenchmarkScorecard struct { type BenchmarkScorecard struct {
ComputeScore float64 `json:"compute_score"` ComputeScore float64 `json:"compute_score"`
// SyntheticScore is the sum of fp32-equivalent TOPS from per-precision
// steady phases (each precision ran alone, full GPU dedicated).
SyntheticScore float64 `json:"synthetic_score,omitempty"`
// MixedScore is the sum of fp32-equivalent TOPS from the combined phase
// (all precisions competing simultaneously — closer to real workloads).
MixedScore float64 `json:"mixed_score,omitempty"`
// MixedEfficiency = MixedScore / SyntheticScore. Measures how well the GPU
// sustains throughput under concurrent mixed-precision load.
MixedEfficiency float64 `json:"mixed_efficiency,omitempty"`
PowerSustainScore float64 `json:"power_sustain_score"` PowerSustainScore float64 `json:"power_sustain_score"`
ThermalSustainScore float64 `json:"thermal_sustain_score"` ThermalSustainScore float64 `json:"thermal_sustain_score"`
StabilityScore float64 `json:"stability_score"` StabilityScore float64 `json:"stability_score"`
InterconnectScore float64 `json:"interconnect_score"` InterconnectScore float64 `json:"interconnect_score"`
CompositeScore float64 `json:"composite_score"` CompositeScore float64 `json:"composite_score"`
// TOPSPerSMPerGHz is compute efficiency independent of clock speed and SM count. // TOPSPerSMPerGHz is compute efficiency independent of clock speed and SM count.
// Comparable across throttle levels and GPU generations. Low value at normal
// clocks indicates silicon degradation.
TOPSPerSMPerGHz float64 `json:"tops_per_sm_per_ghz,omitempty"` TOPSPerSMPerGHz float64 `json:"tops_per_sm_per_ghz,omitempty"`
} }
@@ -145,6 +208,20 @@ type BenchmarkServerPower struct {
Notes []string `json:"notes,omitempty"` Notes []string `json:"notes,omitempty"`
} }
// BenchmarkPrecisionSteadyPhase holds per-precision-category telemetry collected
// during a dedicated single-precision steady window. Because only one kernel
// type runs at a time the PowerCVPct here is a genuine stability signal.
type BenchmarkPrecisionSteadyPhase struct {
Precision string `json:"precision"` // e.g. "fp8", "fp16", "fp32"
Steady BenchmarkTelemetrySummary `json:"steady"`
TeraOpsPerSec float64 `json:"teraops_per_sec,omitempty"`
WeightedTeraOpsPerSec float64 `json:"weighted_teraops_per_sec,omitempty"`
// ECC errors accumulated during this precision phase only.
// Non-zero corrected = stress-induced DRAM errors for this kernel type.
// Any uncorrected = serious fault triggered by this precision workload.
ECC BenchmarkECCCounters `json:"ecc,omitempty"`
}
type BenchmarkInterconnectResult struct { type BenchmarkInterconnectResult struct {
Status string `json:"status"` Status string `json:"status"`
Attempted bool `json:"attempted"` Attempted bool `json:"attempted"`

View File

@@ -13,6 +13,7 @@ import (
// GPUMetricRow is one telemetry sample from nvidia-smi during a stress test. // GPUMetricRow is one telemetry sample from nvidia-smi during a stress test.
type GPUMetricRow struct { type GPUMetricRow struct {
Stage string `json:"stage,omitempty"`
ElapsedSec float64 `json:"elapsed_sec"` ElapsedSec float64 `json:"elapsed_sec"`
GPUIndex int `json:"index"` GPUIndex int `json:"index"`
TempC float64 `json:"temp_c"` TempC float64 `json:"temp_c"`
@@ -141,14 +142,20 @@ func sampleAMDGPUMetrics() ([]GPUMetricRow, error) {
// WriteGPUMetricsCSV writes collected rows as a CSV file. // WriteGPUMetricsCSV writes collected rows as a CSV file.
func WriteGPUMetricsCSV(path string, rows []GPUMetricRow) error { func WriteGPUMetricsCSV(path string, rows []GPUMetricRow) error {
var b bytes.Buffer var b bytes.Buffer
b.WriteString("elapsed_sec,gpu_index,temperature_c,usage_pct,mem_usage_pct,power_w,clock_mhz,mem_clock_mhz\n") b.WriteString("stage,elapsed_sec,gpu_index,temperature_c,usage_pct,mem_usage_pct,power_w,clock_mhz,mem_clock_mhz\n")
for _, r := range rows { for _, r := range rows {
fmt.Fprintf(&b, "%.1f,%d,%.1f,%.1f,%.1f,%.1f,%.0f,%.0f\n", fmt.Fprintf(&b, "%s,%.1f,%d,%.1f,%.1f,%.1f,%.1f,%.0f,%.0f\n",
r.ElapsedSec, r.GPUIndex, r.TempC, r.UsagePct, r.MemUsagePct, r.PowerW, r.ClockMHz, r.MemClockMHz) strconv.Quote(strings.TrimSpace(r.Stage)), r.ElapsedSec, r.GPUIndex, r.TempC, r.UsagePct, r.MemUsagePct, r.PowerW, r.ClockMHz, r.MemClockMHz)
} }
return os.WriteFile(path, b.Bytes(), 0644) return os.WriteFile(path, b.Bytes(), 0644)
} }
type gpuMetricStageSpan struct {
Name string
Start float64
End float64
}
// WriteGPUMetricsHTML writes a standalone HTML file with one SVG chart per GPU. // WriteGPUMetricsHTML writes a standalone HTML file with one SVG chart per GPU.
func WriteGPUMetricsHTML(path string, rows []GPUMetricRow) error { func WriteGPUMetricsHTML(path string, rows []GPUMetricRow) error {
// Group by GPU index preserving order. // Group by GPU index preserving order.
@@ -163,9 +170,25 @@ func WriteGPUMetricsHTML(path string, rows []GPUMetricRow) error {
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r) gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
} }
stageSpans := buildGPUMetricStageSpans(rows)
stageColorByName := make(map[string]string, len(stageSpans))
for i, span := range stageSpans {
stageColorByName[span.Name] = gpuMetricStagePalette[i%len(gpuMetricStagePalette)]
}
var legend strings.Builder
if len(stageSpans) > 0 {
legend.WriteString(`<div class="stage-legend">`)
for _, span := range stageSpans {
fmt.Fprintf(&legend, `<span class="stage-chip"><span class="stage-swatch" style="background:%s"></span>%s</span>`,
stageColorByName[span.Name], gpuHTMLEscape(span.Name))
}
legend.WriteString(`</div>`)
}
var svgs strings.Builder var svgs strings.Builder
for _, gpuIdx := range order { for _, gpuIdx := range order {
svgs.WriteString(drawGPUChartSVG(gpuMap[gpuIdx], gpuIdx)) svgs.WriteString(drawGPUChartSVG(gpuMap[gpuIdx], gpuIdx, stageSpans, stageColorByName))
svgs.WriteString("\n") svgs.WriteString("\n")
} }
@@ -175,21 +198,39 @@ func WriteGPUMetricsHTML(path string, rows []GPUMetricRow) error {
<meta charset="utf-8"> <meta charset="utf-8">
<title>GPU Stress Test Metrics</title> <title>GPU Stress Test Metrics</title>
<style> <style>
body { font-family: sans-serif; background: #f0f0f0; margin: 0; padding: 20px; } :root{--bg:#fff;--surface:#fff;--surface-2:#f9fafb;--border:rgba(34,36,38,.15);--border-lite:rgba(34,36,38,.1);--ink:rgba(0,0,0,.87);--muted:rgba(0,0,0,.6)}
h1 { text-align: center; color: #333; margin: 0 0 8px; } *{box-sizing:border-box}
p { text-align: center; color: #888; font-size: 13px; margin: 0 0 24px; } body{font:14px/1.5 Lato,"Helvetica Neue",Arial,Helvetica,sans-serif;background:var(--bg);color:var(--ink);margin:0}
.page{padding:24px}
.card{background:var(--surface);border:1px solid var(--border);border-radius:4px;box-shadow:0 1px 2px rgba(34,36,38,.15);overflow:hidden}
.card-head{padding:11px 16px;background:var(--surface-2);border-bottom:1px solid var(--border);font-weight:700;font-size:13px}
.card-body{padding:16px}
h1{font-size:22px;margin:0 0 6px}
p{color:var(--muted);font-size:13px;margin:0 0 16px}
.stage-legend{display:flex;flex-wrap:wrap;gap:10px;margin:0 0 16px}
.stage-chip{display:inline-flex;align-items:center;gap:8px;padding:4px 10px;border-radius:999px;background:var(--surface-2);border:1px solid var(--border-lite);font-size:12px}
.stage-swatch{display:inline-block;width:12px;height:12px;border-radius:999px}
.chart-block{margin-top:16px}
</style> </style>
</head><body> </head><body>
<div class="page">
<div class="card">
<div class="card-head">GPU Stress Test Metrics</div>
<div class="card-body">
<h1>GPU Stress Test Metrics</h1> <h1>GPU Stress Test Metrics</h1>
<p>Generated %s</p> <p>Generated %s</p>
%s %s
</body></html>`, ts, svgs.String()) <div class="chart-block">%s</div>
</div>
</div>
</div>
</body></html>`, ts, legend.String(), svgs.String())
return os.WriteFile(path, []byte(html), 0644) return os.WriteFile(path, []byte(html), 0644)
} }
// drawGPUChartSVG generates a self-contained SVG chart for one GPU. // drawGPUChartSVG generates a self-contained SVG chart for one GPU.
func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int) string { func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int, stageSpans []gpuMetricStageSpan, stageColorByName map[string]string) string {
// Layout // Layout
const W, H = 960, 520 const W, H = 960, 520
const plotX1 = 120 // usage axis / chart left border const plotX1 = 120 // usage axis / chart left border
@@ -284,6 +325,23 @@ func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int) string {
} }
b.WriteString("</g>\n") b.WriteString("</g>\n")
// Stage backgrounds
for _, span := range stageSpans {
x1 := xv(span.Start)
x2 := xv(span.End)
if x2 < x1 {
x1, x2 = x2, x1
}
if x2-x1 < 1 {
x2 = x1 + 1
}
color := stageColorByName[span.Name]
fmt.Fprintf(&b, `<rect x="%.1f" y="%d" width="%.1f" height="%d" fill="%s" fill-opacity="0.18"/>`+"\n",
x1, plotY1, x2-x1, PH, color)
fmt.Fprintf(&b, `<text x="%.1f" y="%d" font-family="sans-serif" font-size="10" fill="#444" text-anchor="middle">%s</text>`+"\n",
x1+(x2-x1)/2, plotY1+12, gpuHTMLEscape(span.Name))
}
// Chart border // Chart border
fmt.Fprintf(&b, `<rect x="%d" y="%d" width="%d" height="%d"`+ fmt.Fprintf(&b, `<rect x="%d" y="%d" width="%d" height="%d"`+
` fill="none" stroke="#333" stroke-width="1"/>`+"\n", ` fill="none" stroke="#333" stroke-width="1"/>`+"\n",
@@ -382,221 +440,6 @@ func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int) string {
return b.String() return b.String()
} }
const (
ansiAmber = "\033[38;5;214m"
ansiReset = "\033[0m"
)
const (
termChartWidth = 70
termChartHeight = 12
)
// RenderGPUTerminalChart returns ANSI line charts (asciigraph-style) per GPU.
// Used in SAT stress-test logs.
func RenderGPUTerminalChart(rows []GPUMetricRow) string {
seen := make(map[int]bool)
var order []int
gpuMap := make(map[int][]GPUMetricRow)
for _, r := range rows {
if !seen[r.GPUIndex] {
seen[r.GPUIndex] = true
order = append(order, r.GPUIndex)
}
gpuMap[r.GPUIndex] = append(gpuMap[r.GPUIndex], r)
}
type seriesDef struct {
caption string
color string
fn func(GPUMetricRow) float64
}
defs := []seriesDef{
{"Temperature (°C)", ansiAmber, func(r GPUMetricRow) float64 { return r.TempC }},
{"GPU Usage (%)", ansiAmber, func(r GPUMetricRow) float64 { return r.UsagePct }},
{"Power (W)", ansiAmber, func(r GPUMetricRow) float64 { return r.PowerW }},
{"Clock (MHz)", ansiAmber, func(r GPUMetricRow) float64 { return r.ClockMHz }},
}
var b strings.Builder
for _, gpuIdx := range order {
gr := gpuMap[gpuIdx]
if len(gr) == 0 {
continue
}
tMax := gr[len(gr)-1].ElapsedSec - gr[0].ElapsedSec
fmt.Fprintf(&b, "GPU %d — Stress Test Metrics (%.0f seconds)\n\n", gpuIdx, tMax)
for _, d := range defs {
b.WriteString(renderLineChart(extractGPUField(gr, d.fn), d.color, d.caption,
termChartHeight, termChartWidth))
b.WriteRune('\n')
}
}
return strings.TrimRight(b.String(), "\n")
}
// renderLineChart draws a single time-series line chart using box-drawing characters.
// Produces output in the style of asciigraph: ╭─╮ │ ╰─╯ with a Y axis and caption.
func renderLineChart(vals []float64, color, caption string, height, width int) string {
if len(vals) == 0 {
return caption + "\n"
}
mn, mx := gpuMinMax(vals)
if mn == mx {
mx = mn + 1
}
// Use the smaller of width or len(vals) to avoid stretching sparse data.
w := width
if len(vals) < w {
w = len(vals)
}
data := gpuDownsample(vals, w)
// row[i] = display row index: 0 = top = max value, height = bottom = min value.
row := make([]int, w)
for i, v := range data {
r := int(math.Round((mx - v) / (mx - mn) * float64(height)))
if r < 0 {
r = 0
}
if r > height {
r = height
}
row[i] = r
}
// Fill the character grid.
grid := make([][]rune, height+1)
for i := range grid {
grid[i] = make([]rune, w)
for j := range grid[i] {
grid[i][j] = ' '
}
}
for x := 0; x < w; x++ {
r := row[x]
if x == 0 {
grid[r][0] = '─'
continue
}
p := row[x-1]
switch {
case r == p:
grid[r][x] = '─'
case r < p: // value went up (row index decreased toward top)
grid[r][x] = '╭'
grid[p][x] = '╯'
for y := r + 1; y < p; y++ {
grid[y][x] = '│'
}
default: // r > p, value went down
grid[p][x] = '╮'
grid[r][x] = '╰'
for y := p + 1; y < r; y++ {
grid[y][x] = '│'
}
}
}
// Y axis tick labels.
ticks := gpuNiceTicks(mn, mx, height/2)
tickAtRow := make(map[int]string)
labelWidth := 4
for _, t := range ticks {
r := int(math.Round((mx - t) / (mx - mn) * float64(height)))
if r < 0 || r > height {
continue
}
s := gpuFormatTick(t)
tickAtRow[r] = s
if len(s) > labelWidth {
labelWidth = len(s)
}
}
var b strings.Builder
for r := 0; r <= height; r++ {
label := tickAtRow[r]
fmt.Fprintf(&b, "%*s", labelWidth, label)
switch {
case label != "":
b.WriteRune('┤')
case r == height:
b.WriteRune('┼')
default:
b.WriteRune('│')
}
b.WriteString(color)
b.WriteString(string(grid[r]))
b.WriteString(ansiReset)
b.WriteRune('\n')
}
// Bottom axis.
b.WriteString(strings.Repeat(" ", labelWidth))
b.WriteRune('└')
b.WriteString(strings.Repeat("─", w))
b.WriteRune('\n')
// Caption centered under the chart.
if caption != "" {
total := labelWidth + 1 + w
if pad := (total - len(caption)) / 2; pad > 0 {
b.WriteString(strings.Repeat(" ", pad))
}
b.WriteString(caption)
b.WriteRune('\n')
}
return b.String()
}
func extractGPUField(rows []GPUMetricRow, fn func(GPUMetricRow) float64) []float64 {
v := make([]float64, len(rows))
for i, r := range rows {
v[i] = fn(r)
}
return v
}
// gpuDownsample averages vals into w buckets (or nearest-neighbor upsamples if len(vals) < w).
func gpuDownsample(vals []float64, w int) []float64 {
n := len(vals)
if n == 0 {
return make([]float64, w)
}
result := make([]float64, w)
if n >= w {
counts := make([]int, w)
for i, v := range vals {
bucket := i * w / n
if bucket >= w {
bucket = w - 1
}
result[bucket] += v
counts[bucket]++
}
for i := range result {
if counts[i] > 0 {
result[i] /= float64(counts[i])
}
}
} else {
// Nearest-neighbour upsample.
for i := range result {
src := i * (n - 1) / (w - 1)
if src >= n {
src = n - 1
}
result[i] = vals[src]
}
}
return result
}
func gpuMinMax(vals []float64) (float64, float64) { func gpuMinMax(vals []float64) (float64, float64) {
if len(vals) == 0 { if len(vals) == 0 {
return 0, 1 return 0, 1
@@ -641,3 +484,46 @@ func gpuFormatTick(v float64) string {
} }
return strconv.FormatFloat(v, 'f', 1, 64) return strconv.FormatFloat(v, 'f', 1, 64)
} }
var gpuMetricStagePalette = []string{
"#d95c5c",
"#2185d0",
"#21ba45",
"#f2c037",
"#6435c9",
"#00b5ad",
"#a5673f",
}
func buildGPUMetricStageSpans(rows []GPUMetricRow) []gpuMetricStageSpan {
var spans []gpuMetricStageSpan
for _, row := range rows {
name := strings.TrimSpace(row.Stage)
if name == "" {
name = "run"
}
if len(spans) == 0 || spans[len(spans)-1].Name != name {
spans = append(spans, gpuMetricStageSpan{Name: name, Start: row.ElapsedSec, End: row.ElapsedSec})
continue
}
spans[len(spans)-1].End = row.ElapsedSec
}
for i := range spans {
if spans[i].End <= spans[i].Start {
spans[i].End = spans[i].Start + 1
}
}
return spans
}
var gpuHTMLReplacer = strings.NewReplacer(
"&", "&amp;",
"<", "&lt;",
">", "&gt;",
`"`, "&quot;",
"'", "&#39;",
)
func gpuHTMLEscape(s string) string {
return gpuHTMLReplacer.Replace(s)
}

View File

@@ -0,0 +1,65 @@
package platform
import (
"os"
"path/filepath"
"strings"
"testing"
)
func TestWriteGPUMetricsCSVIncludesStageColumn(t *testing.T) {
t.Parallel()
dir := t.TempDir()
path := filepath.Join(dir, "gpu-metrics.csv")
rows := []GPUMetricRow{
{Stage: "warmup", ElapsedSec: 1, GPUIndex: 0, TempC: 71, UsagePct: 99, MemUsagePct: 80, PowerW: 420, ClockMHz: 1800, MemClockMHz: 1200},
}
if err := WriteGPUMetricsCSV(path, rows); err != nil {
t.Fatalf("WriteGPUMetricsCSV: %v", err)
}
raw, err := os.ReadFile(path)
if err != nil {
t.Fatalf("ReadFile: %v", err)
}
text := string(raw)
for _, needle := range []string{
"stage,elapsed_sec,gpu_index",
`"warmup",1.0,0,71.0,99.0,80.0,420.0,1800,1200`,
} {
if !strings.Contains(text, needle) {
t.Fatalf("csv missing %q\n%s", needle, text)
}
}
}
func TestWriteGPUMetricsHTMLShowsStageLegendAndLabels(t *testing.T) {
t.Parallel()
dir := t.TempDir()
path := filepath.Join(dir, "gpu-metrics.html")
rows := []GPUMetricRow{
{Stage: "baseline", ElapsedSec: 1, GPUIndex: 0, TempC: 50, UsagePct: 10, MemUsagePct: 5, PowerW: 100, ClockMHz: 500, MemClockMHz: 400},
{Stage: "baseline", ElapsedSec: 2, GPUIndex: 0, TempC: 51, UsagePct: 11, MemUsagePct: 5, PowerW: 101, ClockMHz: 510, MemClockMHz: 400},
{Stage: "steady-fp16", ElapsedSec: 3, GPUIndex: 0, TempC: 70, UsagePct: 98, MemUsagePct: 75, PowerW: 390, ClockMHz: 1700, MemClockMHz: 1100},
{Stage: "steady-fp16", ElapsedSec: 4, GPUIndex: 0, TempC: 71, UsagePct: 99, MemUsagePct: 76, PowerW: 395, ClockMHz: 1710, MemClockMHz: 1110},
}
if err := WriteGPUMetricsHTML(path, rows); err != nil {
t.Fatalf("WriteGPUMetricsHTML: %v", err)
}
raw, err := os.ReadFile(path)
if err != nil {
t.Fatalf("ReadFile: %v", err)
}
text := string(raw)
for _, needle := range []string{
"stage-legend",
"baseline",
"steady-fp16",
"GPU Stress Test Metrics",
} {
if !strings.Contains(text, needle) {
t.Fatalf("html missing %q\n%s", needle, text)
}
}
}

View File

@@ -14,9 +14,17 @@ import (
func (s *System) IsLiveMediaInRAM() bool { func (s *System) IsLiveMediaInRAM() bool {
fsType := mountFSType("/run/live/medium") fsType := mountFSType("/run/live/medium")
if fsType == "" { if fsType == "" {
// No medium mount at all — fall back to toram kernel parameter.
return toramActive() return toramActive()
} }
return strings.EqualFold(fsType, "tmpfs") if strings.EqualFold(fsType, "tmpfs") {
return true
}
// When RunInstallToRAM copies squashfs to /dev/shm/bee-live but the bind
// mount of /run/live/medium fails (common for CD-ROM boots), the medium
// fstype still shows the CD-ROM type. Check whether the RAM copy exists.
files, _ := filepath.Glob("/dev/shm/bee-live/*.squashfs")
return len(files) > 0
} }
func (s *System) LiveBootSource() LiveBootSource { func (s *System) LiveBootSource() LiveBootSource {

View File

@@ -244,11 +244,17 @@ func findUSBExportMount() string {
if readOnly { if readOnly {
continue continue
} }
// Check USB transport via lsblk on the device // Check USB transport via lsblk on the device (or its parent disk for partitions).
if !strings.HasPrefix(device, "/dev/") { if !strings.HasPrefix(device, "/dev/") {
continue continue
} }
if blockDeviceTransport(device) == "usb" { checkDev := device
// lsblk only reports TRAN for the whole disk, not for partitions (e.g. /dev/sdc1).
// Strip trailing partition digits to get the parent disk name.
if trimmed := strings.TrimRight(device, "0123456789"); trimmed != device && len(trimmed) > len("/dev/") {
checkDev = trimmed
}
if blockDeviceTransport(checkDev) == "usb" {
return mountPoint return mountPoint
} }
} }

View File

@@ -108,15 +108,15 @@ type nvidiaGPUHealth struct {
} }
type nvidiaGPUStatusFile struct { type nvidiaGPUStatusFile struct {
Index int Index int
Name string Name string
RunStatus string RunStatus string
Reason string Reason string
Health string Health string
HealthRaw string HealthRaw string
Observed bool Observed bool
Selected bool Selected bool
FailingJob string FailingJob string
} }
// AMDGPUInfo holds basic info about an AMD GPU from rocm-smi. // AMDGPUInfo holds basic info about an AMD GPU from rocm-smi.
@@ -410,13 +410,13 @@ func (s *System) RunNvidiaOfficialComputePack(ctx context.Context, baseDir strin
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-compute", withNvidiaPersistenceMode( return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-compute", withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}}, satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{name: "02-dcgmi-version.log", cmd: []string{"dcgmi", "-v"}}, satJob{name: "02-dcgmi-version.log", cmd: []string{"dcgmi", "-v"}},
satJob{ satJob{
name: "03-dcgmproftester.log", name: "03-dcgmproftester.log",
cmd: profCmd, cmd: profCmd,
env: profEnv, env: profEnv,
collectGPU: true, collectGPU: true,
gpuIndices: selected, gpuIndices: selected,
}, },
satJob{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}}, satJob{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
), logFunc) ), logFunc)
} }
@@ -1382,8 +1382,6 @@ func runSATCommandWithMetrics(ctx context.Context, verboseLog, name string, cmd
if len(metricRows) > 0 { if len(metricRows) > 0 {
_ = WriteGPUMetricsCSV(filepath.Join(runDir, "gpu-metrics.csv"), metricRows) _ = WriteGPUMetricsCSV(filepath.Join(runDir, "gpu-metrics.csv"), metricRows)
_ = WriteGPUMetricsHTML(filepath.Join(runDir, "gpu-metrics.html"), metricRows) _ = WriteGPUMetricsHTML(filepath.Join(runDir, "gpu-metrics.html"), metricRows)
chart := RenderGPUTerminalChart(metricRows)
_ = os.WriteFile(filepath.Join(runDir, "gpu-metrics-term.txt"), []byte(chart), 0644)
} }
return out, err return out, err

View File

@@ -12,6 +12,7 @@ import (
"path/filepath" "path/filepath"
"regexp" "regexp"
"sort" "sort"
"strconv"
"strings" "strings"
"sync/atomic" "sync/atomic"
"syscall" "syscall"
@@ -209,6 +210,14 @@ func joinTaskIndices(indices []int) string {
return strings.Join(parts, ",") return strings.Join(parts, ",")
} }
func formatGPUIndexList(indices []int) string {
parts := make([]string, len(indices))
for i, idx := range indices {
parts[i] = strconv.Itoa(idx)
}
return strings.Join(parts, ",")
}
func formatSplitTaskName(baseName, selectionLabel string) string { func formatSplitTaskName(baseName, selectionLabel string) string {
baseName = strings.TrimSpace(baseName) baseName = strings.TrimSpace(baseName)
selectionLabel = strings.TrimSpace(selectionLabel) selectionLabel = strings.TrimSpace(selectionLabel)
@@ -488,6 +497,7 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
GPUIndices []int `json:"gpu_indices"` GPUIndices []int `json:"gpu_indices"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices"` ExcludeGPUIndices []int `json:"exclude_gpu_indices"`
StaggerGPUStart bool `json:"stagger_gpu_start"` StaggerGPUStart bool `json:"stagger_gpu_start"`
ParallelGPUs bool `json:"parallel_gpus"`
Loader string `json:"loader"` Loader string `json:"loader"`
Profile string `json:"profile"` Profile string `json:"profile"`
DisplayName string `json:"display_name"` DisplayName string `json:"display_name"`
@@ -510,6 +520,7 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
GPUIndices: body.GPUIndices, GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices, ExcludeGPUIndices: body.ExcludeGPUIndices,
StaggerGPUStart: body.StaggerGPUStart, StaggerGPUStart: body.StaggerGPUStart,
ParallelGPUs: body.ParallelGPUs,
Loader: body.Loader, Loader: body.Loader,
BurnProfile: body.Profile, BurnProfile: body.Profile,
DisplayName: body.DisplayName, DisplayName: body.DisplayName,
@@ -540,6 +551,7 @@ func (h *handler) handleAPIBenchmarkNvidiaRun(w http.ResponseWriter, r *http.Req
ExcludeGPUIndices []int `json:"exclude_gpu_indices"` ExcludeGPUIndices []int `json:"exclude_gpu_indices"`
RunNCCL *bool `json:"run_nccl"` RunNCCL *bool `json:"run_nccl"`
ParallelGPUs *bool `json:"parallel_gpus"` ParallelGPUs *bool `json:"parallel_gpus"`
RampUp *bool `json:"ramp_up"`
DisplayName string `json:"display_name"` DisplayName string `json:"display_name"`
} }
if r.Body != nil { if r.Body != nil {
@@ -557,10 +569,82 @@ func (h *handler) handleAPIBenchmarkNvidiaRun(w http.ResponseWriter, r *http.Req
if body.ParallelGPUs != nil { if body.ParallelGPUs != nil {
parallelGPUs = *body.ParallelGPUs parallelGPUs = *body.ParallelGPUs
} }
rampUp := false
if body.RampUp != nil {
rampUp = *body.RampUp
}
// Build a descriptive base name that includes profile and mode so the task
// list is self-explanatory without opening individual task detail pages.
profile := strings.TrimSpace(body.Profile)
if profile == "" {
profile = "standard"
}
name := taskDisplayName("nvidia-benchmark", "", "") name := taskDisplayName("nvidia-benchmark", "", "")
if strings.TrimSpace(body.DisplayName) != "" { if strings.TrimSpace(body.DisplayName) != "" {
name = body.DisplayName name = body.DisplayName
} }
// Append profile tag.
name = fmt.Sprintf("%s · %s", name, profile)
if rampUp && len(body.GPUIndices) > 1 {
// Ramp-up mode: resolve GPU list, then create one task per prefix
// [gpu0], [gpu0,gpu1], ..., [gpu0,...,gpuN-1], each running in parallel.
gpus, err := apiListNvidiaGPUs(h.opts.App)
if err != nil {
writeError(w, http.StatusBadRequest, err.Error())
return
}
resolved, err := expandSelectedGPUIndices(gpus, body.GPUIndices, body.ExcludeGPUIndices)
if err != nil {
writeError(w, http.StatusBadRequest, err.Error())
return
}
if len(resolved) < 2 {
// Fall through to normal single-task path.
rampUp = false
} else {
now := time.Now()
rampRunID := fmt.Sprintf("ramp-%s", now.UTC().Format("20060102-150405"))
var allTasks []*Task
for step := 1; step <= len(resolved); step++ {
subset := resolved[:step]
stepName := fmt.Sprintf("%s · ramp %d/%d · GPU %s", name, step, len(resolved), formatGPUIndexList(subset))
t := &Task{
ID: newJobID("benchmark-nvidia"),
Name: stepName,
Target: "nvidia-benchmark",
Priority: 15,
Status: TaskPending,
CreatedAt: now,
params: taskParams{
GPUIndices: append([]int(nil), subset...),
SizeMB: body.SizeMB,
BenchmarkProfile: body.Profile,
RunNCCL: runNCCL && step == len(resolved),
ParallelGPUs: true,
RampStep: step,
RampTotal: len(resolved),
RampRunID: rampRunID,
DisplayName: stepName,
},
}
allTasks = append(allTasks, t)
}
for _, t := range allTasks {
globalQueue.enqueue(t)
}
writeTaskRunResponse(w, allTasks)
return
}
}
// For non-ramp tasks append mode tag.
if parallelGPUs {
name = fmt.Sprintf("%s · parallel", name)
} else {
name = fmt.Sprintf("%s · sequential", name)
}
tasks, err := buildNvidiaTaskSet("nvidia-benchmark", 15, time.Now(), taskParams{ tasks, err := buildNvidiaTaskSet("nvidia-benchmark", 15, time.Now(), taskParams{
GPUIndices: body.GPUIndices, GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices, ExcludeGPUIndices: body.ExcludeGPUIndices,

View File

@@ -330,6 +330,33 @@ func renderHardwareSummaryCard(opts HandlerOptions) string {
var b strings.Builder var b strings.Builder
b.WriteString(`<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body">`) b.WriteString(`<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body">`)
// Server identity block above the component table.
{
var model, serial string
parts := []string{}
if hw.Board.Manufacturer != nil && strings.TrimSpace(*hw.Board.Manufacturer) != "" {
parts = append(parts, strings.TrimSpace(*hw.Board.Manufacturer))
}
if hw.Board.ProductName != nil && strings.TrimSpace(*hw.Board.ProductName) != "" {
parts = append(parts, strings.TrimSpace(*hw.Board.ProductName))
}
if len(parts) > 0 {
model = strings.Join(parts, " ")
}
serial = strings.TrimSpace(hw.Board.SerialNumber)
if model != "" || serial != "" {
b.WriteString(`<div style="margin-bottom:14px">`)
if model != "" {
fmt.Fprintf(&b, `<div style="font-size:16px;font-weight:700;margin-bottom:2px">%s</div>`, html.EscapeString(model))
}
if serial != "" {
fmt.Fprintf(&b, `<div style="font-size:12px;color:var(--muted)">S/N: %s</div>`, html.EscapeString(serial))
}
b.WriteString(`</div>`)
}
}
b.WriteString(`<table style="width:auto">`) b.WriteString(`<table style="width:auto">`)
writeRow := func(label, value, badgeHTML string) { writeRow := func(label, value, badgeHTML string) {
b.WriteString(fmt.Sprintf(`<tr><td style="padding:6px 14px 6px 0;font-weight:700;white-space:nowrap">%s</td><td style="padding:6px 0;color:var(--muted);font-size:13px">%s</td><td style="padding:6px 0 6px 12px">%s</td></tr>`, b.WriteString(fmt.Sprintf(`<tr><td style="padding:6px 14px 6px 0;font-weight:700;white-space:nowrap">%s</td><td style="padding:6px 0;color:var(--muted);font-size:13px">%s</td><td style="padding:6px 0 6px 12px">%s</td></tr>`,
@@ -1279,9 +1306,6 @@ func renderValidate(opts HandlerOptions) string {
<div class="card" style="margin-bottom:16px"> <div class="card" style="margin-bottom:16px">
<div class="card-head">Validate Profile</div> <div class="card-head">Validate Profile</div>
<div class="card-body validate-profile-body"> <div class="card-body validate-profile-body">
<div class="validate-profile-col">
<div class="form-row" style="margin:0"><label>Cycles</label><input type="number" id="sat-cycles" value="1" min="1" max="100" style="width:100%"></div>
</div>
<div class="validate-profile-col"> <div class="validate-profile-col">
<div class="form-row" style="margin:12px 0 0"><label>Mode</label></div> <div class="form-row" style="margin:12px 0 0"><label>Mode</label></div>
<label class="cb-row"><input type="radio" name="sat-mode" id="sat-mode-validate" value="validate" checked onchange="satModeChanged()"><span>Validate — quick non-destructive check</span></label> <label class="cb-row"><input type="radio" name="sat-mode" id="sat-mode-validate" value="validate" checked onchange="satModeChanged()"><span>Validate — quick non-destructive check</span></label>
@@ -1331,22 +1355,16 @@ func renderValidate(opts HandlerOptions) string {
<p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p> <p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p>
</div> </div>
<p id="sat-gpu-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA validate tasks.</p> <p id="sat-gpu-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA validate tasks.</p>
<div style="margin-top:10px;padding-top:10px;border-top:1px solid var(--border)">
<label class="sat-gpu-row" title="When checked, multi-GPU tests (PSU Pulse, NCCL, NVBandwidth) run on ALL GPUs in the system regardless of the selection above.">
<input type="checkbox" id="sat-multi-gpu-all" checked onchange="satUpdateGPUSelectionNote()">
<span><strong>Multi-GPU tests</strong> — use all GPUs <span style="font-size:11px;color:var(--muted)">(PSU Pulse, NCCL, NVBandwidth)</span></span>
</label>
</div>
</div> </div>
</div> </div>
<div class="grid3"> <div class="grid3">
` + renderSATCard("nvidia", "NVIDIA GPU", "runNvidiaValidateSet('nvidia')", "", renderValidateCardBody( ` + renderSATCard("nvidia", "NVIDIA GPU", "runNvidiaValidateSet('nvidia')", "", renderValidateCardBody(
inv.NVIDIA, inv.NVIDIA,
`Runs NVIDIA diagnostics and board inventory checks.`, `Runs NVIDIA diagnostics and board inventory checks.`,
`<code>nvidia-smi</code>, <code>dmidecode</code>, <code>dcgmi diag</code>`, `<code>nvidia-smi</code>, <code>dmidecode</code>, <code>dcgmi diag</code>`,
`Level 2 in Validate, Level 3 in Stress. Runs one GPU at a time on the selected NVIDIA GPUs.`, `Level 2 in Validate, Level 3 in Stress. Runs one GPU at a time on the selected NVIDIA GPUs.`,
)) + )) +
`<div id="sat-card-nvidia-targeted-stress">` + `<div id="sat-card-nvidia-targeted-stress">` +
renderSATCard("nvidia-targeted-stress", "NVIDIA GPU Targeted Stress", "runNvidiaValidateSet('nvidia-targeted-stress')", "", renderValidateCardBody( renderSATCard("nvidia-targeted-stress", "NVIDIA GPU Targeted Stress", "runNvidiaValidateSet('nvidia-targeted-stress')", "", renderValidateCardBody(
inv.NVIDIA, inv.NVIDIA,
@@ -1455,10 +1473,6 @@ function satSelectedGPUIndices() {
.filter(function(v) { return !Number.isNaN(v); }) .filter(function(v) { return !Number.isNaN(v); })
.sort(function(a, b) { return a - b; }); .sort(function(a, b) { return a - b; });
} }
function satMultiGPUAll() {
const cb = document.getElementById('sat-multi-gpu-all');
return cb ? cb.checked : true;
}
function satUpdateGPUSelectionNote() { function satUpdateGPUSelectionNote() {
const note = document.getElementById('sat-gpu-selection-note'); const note = document.getElementById('sat-gpu-selection-note');
if (!note) return; if (!note) return;
@@ -1467,8 +1481,7 @@ function satUpdateGPUSelectionNote() {
note.textContent = 'Select at least one NVIDIA GPU to enable NVIDIA validate tasks.'; note.textContent = 'Select at least one NVIDIA GPU to enable NVIDIA validate tasks.';
return; return;
} }
const multiAll = satMultiGPUAll(); note.textContent = 'Selected GPUs: ' + selected.join(', ') + '. Multi-GPU tests will use all selected GPUs.';
note.textContent = 'Selected GPUs: ' + selected.join(', ') + '. Multi-GPU tests: ' + (multiAll ? 'all GPUs in system' : 'selected GPUs only') + '.';
} }
function satRenderGPUList(gpus) { function satRenderGPUList(gpus) {
const root = document.getElementById('sat-gpu-list'); const root = document.getElementById('sat-gpu-list');
@@ -1582,15 +1595,8 @@ const nvidiaPerGPUTargets = ['nvidia', 'nvidia-targeted-stress', 'nvidia-targete
// pulse_test and fabric tests run on all selected GPUs simultaneously // pulse_test and fabric tests run on all selected GPUs simultaneously
const nvidiaAllGPUTargets = ['nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth']; const nvidiaAllGPUTargets = ['nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth'];
function satAllGPUIndicesForMulti() { function satAllGPUIndicesForMulti() {
// If "Multi-GPU tests — all GPUs" is checked, return all detected GPUs. // Multi-GPU tests always use the current GPU selection.
// Otherwise fall back to the per-GPU selection. return Promise.resolve(satSelectedGPUIndices());
if (satMultiGPUAll()) {
return loadSatNvidiaGPUs().then(function(gpus) {
return gpus.map(function(g) { return Number(g.index); });
});
}
const sel = satSelectedGPUIndices();
return Promise.resolve(sel);
} }
function expandSATTarget(target) { function expandSATTarget(target) {
if (nvidiaAllGPUTargets.indexOf(target) >= 0) { if (nvidiaAllGPUTargets.indexOf(target) >= 0) {
@@ -1680,7 +1686,7 @@ function runAMDValidateSet() {
return runNext(0); return runNext(0);
} }
function runAllSAT() { function runAllSAT() {
const cycles = Math.max(1, parseInt(document.getElementById('sat-cycles').value)||1); const cycles = 1;
const status = document.getElementById('sat-all-status'); const status = document.getElementById('sat-all-status');
status.textContent = 'Enqueuing...'; status.textContent = 'Enqueuing...';
const stressOnlyTargets = ['nvidia-targeted-stress', 'nvidia-targeted-power', 'nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth']; const stressOnlyTargets = ['nvidia-targeted-stress', 'nvidia-targeted-power', 'nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth'];
@@ -1922,23 +1928,10 @@ func renderSATCard(id, label, runAction, headerActions, body string) string {
// ── Benchmark ───────────────────────────────────────────────────────────────── // ── Benchmark ─────────────────────────────────────────────────────────────────
type benchmarkHistoryColumn struct {
key string
label string
name string
index int
parallel bool
}
type benchmarkHistoryCell struct {
score float64
present bool
}
type benchmarkHistoryRun struct { type benchmarkHistoryRun struct {
generatedAt time.Time generatedAt time.Time
displayTime string displayTime string
cells map[string]benchmarkHistoryCell gpuScores map[int]float64 // GPU index → composite score
} }
func renderBenchmark(opts HandlerOptions) string { func renderBenchmark(opts HandlerOptions) string {
@@ -1967,12 +1960,16 @@ func renderBenchmark(opts HandlerOptions) string {
</div> </div>
</div> </div>
<label class="benchmark-cb-row"> <label class="benchmark-cb-row">
<input type="checkbox" id="benchmark-parallel-gpus"> <input type="radio" name="benchmark-mode" value="sequential" onchange="benchmarkUpdateSelectionNote()">
<span>Run all selected GPUs simultaneously (parallel mode)</span> <span>Sequential — one GPU at a time</span>
</label> </label>
<label class="benchmark-cb-row"> <label class="benchmark-cb-row" id="benchmark-parallel-label">
<input type="checkbox" id="benchmark-run-nccl" checked> <input type="radio" name="benchmark-mode" value="parallel" onchange="benchmarkUpdateSelectionNote()">
<span>Run multi-GPU interconnect step (NCCL) only on the selected GPUs</span> <span>Parallel — all selected GPUs simultaneously</span>
</label>
<label class="benchmark-cb-row" id="benchmark-ramp-label">
<input type="radio" name="benchmark-mode" value="ramp-up" checked onchange="benchmarkUpdateSelectionNote()">
<span>Ramp-up — 1 GPU → 2 → … → all selected (separate tasks)</span>
</label> </label>
<p id="benchmark-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 14px">Select one GPU for single-card benchmarking or several GPUs for a constrained multi-GPU run.</p> <p id="benchmark-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 14px">Select one GPU for single-card benchmarking or several GPUs for a constrained multi-GPU run.</p>
<button id="benchmark-run-btn" class="btn btn-primary" onclick="runNvidiaBenchmark()" disabled>&#9654; Run Benchmark</button> <button id="benchmark-run-btn" class="btn btn-primary" onclick="runNvidiaBenchmark()" disabled>&#9654; Run Benchmark</button>
@@ -2025,22 +2022,28 @@ function benchmarkSelectedGPUIndices() {
.sort(function(a, b) { return a - b; }); .sort(function(a, b) { return a - b; });
} }
function benchmarkMode() {
const el = document.querySelector('input[name="benchmark-mode"]:checked');
return el ? el.value : 'sequential';
}
function benchmarkUpdateSelectionNote() { function benchmarkUpdateSelectionNote() {
const selected = benchmarkSelectedGPUIndices(); const selected = benchmarkSelectedGPUIndices();
const btn = document.getElementById('benchmark-run-btn'); const btn = document.getElementById('benchmark-run-btn');
const note = document.getElementById('benchmark-selection-note'); const note = document.getElementById('benchmark-selection-note');
const nccl = document.getElementById('benchmark-run-nccl');
if (!selected.length) { if (!selected.length) {
btn.disabled = true; btn.disabled = true;
note.textContent = 'Select at least one NVIDIA GPU to run the benchmark.'; note.textContent = 'Select at least one NVIDIA GPU to run the benchmark.';
return; return;
} }
btn.disabled = false; btn.disabled = false;
note.textContent = 'Selected GPUs: ' + selected.join(', ') + '.'; const mode = benchmarkMode();
if (nccl && nccl.checked && selected.length < 2) { if (mode === 'ramp-up') {
note.textContent += ' NCCL will be skipped because fewer than 2 GPUs are selected.'; note.textContent = 'Ramp-up: ' + selected.length + ' tasks (1 GPU → ' + selected.length + ' GPUs). NCCL on final step.';
} else if (nccl && nccl.checked) { } else if (mode === 'parallel') {
note.textContent += ' NCCL interconnect will use only these GPUs.'; note.textContent = 'Parallel: all ' + selected.length + ' GPU(s) simultaneously.' + (selected.length > 1 ? ' NCCL included.' : '');
} else {
note.textContent = 'Sequential: each GPU benchmarked separately.' + (selected.length > 1 ? ' NCCL included on each.' : '');
} }
} }
@@ -2058,6 +2061,33 @@ function benchmarkRenderGPUList(gpus) {
+ '<span><strong>GPU ' + gpu.index + '</strong> — ' + gpu.name + mem + '</span>' + '<span><strong>GPU ' + gpu.index + '</strong> — ' + gpu.name + mem + '</span>'
+ '</label>'; + '</label>';
}).join(''); }).join('');
benchmarkApplyMultiGPUState(gpus.length);
benchmarkUpdateSelectionNote();
}
// Disable radio options that require multiple GPUs when only one is present.
function benchmarkApplyMultiGPUState(gpuCount) {
var multiValues = ['parallel', 'ramp-up'];
var radios = document.querySelectorAll('input[name="benchmark-mode"]');
radios.forEach(function(el) {
var isMulti = multiValues.indexOf(el.value) >= 0;
if (gpuCount < 2 && isMulti) {
el.disabled = true;
if (el.checked) {
// fall back to sequential
var seq = document.querySelector('input[name="benchmark-mode"][value="sequential"]');
if (seq) seq.checked = true;
}
var label = el.closest('label');
if (label) label.style.opacity = '0.4';
} else {
el.disabled = false;
// restore default: ramp-up checked when ≥2 GPUs
if (gpuCount >= 2 && el.value === 'ramp-up') el.checked = true;
var label = el.closest('label');
if (label) label.style.opacity = '';
}
});
benchmarkUpdateSelectionNote(); benchmarkUpdateSelectionNote();
} }
@@ -2095,12 +2125,15 @@ function runNvidiaBenchmark() {
return; return;
} }
if (benchmarkES) { benchmarkES.close(); benchmarkES = null; } if (benchmarkES) { benchmarkES.close(); benchmarkES = null; }
const parallelGPUs = !!document.getElementById('benchmark-parallel-gpus').checked; const mode = benchmarkMode();
const rampUp = mode === 'ramp-up' && selected.length > 1;
const parallelGPUs = mode === 'parallel';
const body = { const body = {
profile: document.getElementById('benchmark-profile').value || 'standard', profile: document.getElementById('benchmark-profile').value || 'standard',
gpu_indices: selected, gpu_indices: selected,
run_nccl: !!document.getElementById('benchmark-run-nccl').checked, run_nccl: selected.length > 1,
parallel_gpus: parallelGPUs, parallel_gpus: parallelGPUs,
ramp_up: rampUp,
display_name: 'NVIDIA Benchmark' display_name: 'NVIDIA Benchmark'
}; };
document.getElementById('benchmark-output').style.display = 'block'; document.getElementById('benchmark-output').style.display = 'block';
@@ -2155,23 +2188,22 @@ function runNvidiaBenchmark() {
}); });
} }
document.getElementById('benchmark-run-nccl').addEventListener('change', benchmarkUpdateSelectionNote);
benchmarkLoadGPUs(); benchmarkLoadGPUs();
</script>` </script>`
} }
func renderBenchmarkResultsCard(exportDir string) string { func renderBenchmarkResultsCard(exportDir string) string {
columns, runs := loadBenchmarkHistory(exportDir) maxIdx, runs := loadBenchmarkHistory(exportDir)
return renderBenchmarkResultsCardFromRuns( return renderBenchmarkResultsCardFromRuns(
"Benchmark Results", "Benchmark Results",
"Composite score by saved benchmark run and GPU.", "Composite score by saved benchmark run and GPU.",
"No saved benchmark runs yet.", "No saved benchmark runs yet.",
columns, maxIdx,
runs, runs,
) )
} }
func renderBenchmarkResultsCardFromRuns(title, description, emptyMessage string, columns []benchmarkHistoryColumn, runs []benchmarkHistoryRun) string { func renderBenchmarkResultsCardFromRuns(title, description, emptyMessage string, maxGPUIndex int, runs []benchmarkHistoryRun) string {
if len(runs) == 0 { if len(runs) == 0 {
return `<div class="card"><div class="card-head">` + html.EscapeString(title) + `</div><div class="card-body"><p style="color:var(--muted);font-size:13px">` + html.EscapeString(emptyMessage) + `</p></div></div>` return `<div class="card"><div class="card-head">` + html.EscapeString(title) + `</div><div class="card-body"><p style="color:var(--muted);font-size:13px">` + html.EscapeString(emptyMessage) + `</p></div></div>`
} }
@@ -2181,22 +2213,22 @@ func renderBenchmarkResultsCardFromRuns(title, description, emptyMessage string,
b.WriteString(`<p style="color:var(--muted);font-size:13px;margin-bottom:12px">` + html.EscapeString(description) + `</p>`) b.WriteString(`<p style="color:var(--muted);font-size:13px;margin-bottom:12px">` + html.EscapeString(description) + `</p>`)
} }
b.WriteString(`<div style="overflow-x:auto">`) b.WriteString(`<div style="overflow-x:auto">`)
b.WriteString(`<table><thead><tr><th>Test</th><th>Time</th>`) b.WriteString(`<table><thead><tr><th>Run</th><th>Time</th>`)
for _, col := range columns { for i := 0; i <= maxGPUIndex; i++ {
b.WriteString(`<th>` + html.EscapeString(col.label) + `</th>`) b.WriteString(`<th>GPU ` + strconv.Itoa(i) + `</th>`)
} }
b.WriteString(`</tr></thead><tbody>`) b.WriteString(`</tr></thead><tbody>`)
for i, run := range runs { for i, run := range runs {
b.WriteString(`<tr>`) b.WriteString(`<tr>`)
b.WriteString(`<td>#` + strconv.Itoa(i+1) + `</td>`) b.WriteString(`<td>#` + strconv.Itoa(i+1) + `</td>`)
b.WriteString(`<td>` + html.EscapeString(run.displayTime) + `</td>`) b.WriteString(`<td>` + html.EscapeString(run.displayTime) + `</td>`)
for _, col := range columns { for idx := 0; idx <= maxGPUIndex; idx++ {
cell, ok := run.cells[col.key] score, ok := run.gpuScores[idx]
if !ok || !cell.present { if !ok {
b.WriteString(`<td style="color:var(--muted)">-</td>`) b.WriteString(`<td style="color:var(--muted)">-</td>`)
continue continue
} }
b.WriteString(`<td>` + fmt.Sprintf("%.2f", cell.score) + `</td>`) b.WriteString(`<td>` + fmt.Sprintf("%.2f", score) + `</td>`)
} }
b.WriteString(`</tr>`) b.WriteString(`</tr>`)
} }
@@ -2204,22 +2236,22 @@ func renderBenchmarkResultsCardFromRuns(title, description, emptyMessage string,
return b.String() return b.String()
} }
func loadBenchmarkHistory(exportDir string) ([]benchmarkHistoryColumn, []benchmarkHistoryRun) { func loadBenchmarkHistory(exportDir string) (int, []benchmarkHistoryRun) {
baseDir := app.DefaultBenchmarkBaseDir baseDir := app.DefaultBenchmarkBaseDir
if strings.TrimSpace(exportDir) != "" { if strings.TrimSpace(exportDir) != "" {
baseDir = filepath.Join(exportDir, "bee-benchmark") baseDir = filepath.Join(exportDir, "bee-benchmark")
} }
paths, err := filepath.Glob(filepath.Join(baseDir, "gpu-benchmark-*", "result.json")) paths, err := filepath.Glob(filepath.Join(baseDir, "gpu-benchmark-*", "result.json"))
if err != nil || len(paths) == 0 { if err != nil || len(paths) == 0 {
return nil, nil return -1, nil
} }
sort.Strings(paths) sort.Strings(paths)
return loadBenchmarkHistoryFromPaths(paths) return loadBenchmarkHistoryFromPaths(paths)
} }
func loadBenchmarkHistoryFromPaths(paths []string) ([]benchmarkHistoryColumn, []benchmarkHistoryRun) { func loadBenchmarkHistoryFromPaths(paths []string) (int, []benchmarkHistoryRun) {
columnByKey := make(map[string]benchmarkHistoryColumn)
runs := make([]benchmarkHistoryRun, 0, len(paths)) runs := make([]benchmarkHistoryRun, 0, len(paths))
maxGPUIndex := -1
for _, path := range paths { for _, path := range paths {
raw, err := os.ReadFile(path) raw, err := os.ReadFile(path)
if err != nil { if err != nil {
@@ -2232,102 +2264,22 @@ func loadBenchmarkHistoryFromPaths(paths []string) ([]benchmarkHistoryColumn, []
run := benchmarkHistoryRun{ run := benchmarkHistoryRun{
generatedAt: result.GeneratedAt, generatedAt: result.GeneratedAt,
displayTime: result.GeneratedAt.Local().Format("2006-01-02 15:04:05"), displayTime: result.GeneratedAt.Local().Format("2006-01-02 15:04:05"),
cells: make(map[string]benchmarkHistoryCell), gpuScores: make(map[int]float64),
} }
for _, gpu := range result.GPUs {
if result.ParallelGPUs { run.gpuScores[gpu.Index] = gpu.Scores.CompositeScore
// All GPUs ran simultaneously — one column per server, score = avg composite. if gpu.Index > maxGPUIndex {
gpuModelCount := make(map[string]int) maxGPUIndex = gpu.Index
for _, gpu := range result.GPUs {
gpuModelCount[strings.TrimSpace(gpu.Name)]++
}
scoreSum := make(map[string]float64)
scoreCnt := make(map[string]int)
for _, gpu := range result.GPUs {
key := "parallel|" + strings.TrimSpace(result.ServerModel) + "|" + strings.TrimSpace(gpu.Name)
scoreSum[key] += gpu.Scores.CompositeScore
scoreCnt[key]++
count := gpuModelCount[strings.TrimSpace(gpu.Name)]
columnByKey[key] = benchmarkHistoryColumn{
key: key,
label: benchmarkHistoryParallelLabel(result.ServerModel, gpu.Name, count),
name: strings.TrimSpace(gpu.Name),
index: -1,
parallel: true,
}
}
for key, sum := range scoreSum {
run.cells[key] = benchmarkHistoryCell{score: sum / float64(scoreCnt[key]), present: true}
}
} else {
// Each GPU ran independently — one column per GPU index.
for _, gpu := range result.GPUs {
key := "gpu|" + strings.TrimSpace(result.ServerModel) + "|" + strings.TrimSpace(gpu.Name) + "|" + strconv.Itoa(gpu.Index)
columnByKey[key] = benchmarkHistoryColumn{
key: key,
label: benchmarkHistoryPerGPULabel(gpu.Name, gpu.Index),
name: strings.TrimSpace(gpu.Name),
index: gpu.Index,
parallel: false,
}
run.cells[key] = benchmarkHistoryCell{score: gpu.Scores.CompositeScore, present: true}
} }
} }
runs = append(runs, run) runs = append(runs, run)
} }
columns := make([]benchmarkHistoryColumn, 0, len(columnByKey))
for _, col := range columnByKey {
columns = append(columns, col)
}
// Sequential GPU columns first (sorted by GPU index), then parallel server columns.
sort.Slice(columns, func(i, j int) bool {
if columns[i].parallel != columns[j].parallel {
return !columns[i].parallel // sequential first
}
if columns[i].parallel {
li := strings.ToLower(columns[i].label)
lj := strings.ToLower(columns[j].label)
if li != lj {
return li < lj
}
return columns[i].key < columns[j].key
}
// Sequential: sort by GPU index, then name.
if columns[i].index != columns[j].index {
return columns[i].index < columns[j].index
}
return strings.ToLower(columns[i].name) < strings.ToLower(columns[j].name)
})
sort.Slice(runs, func(i, j int) bool { sort.Slice(runs, func(i, j int) bool {
return runs[i].generatedAt.After(runs[j].generatedAt) return runs[i].generatedAt.After(runs[j].generatedAt)
}) })
return columns, runs return maxGPUIndex, runs
} }
// benchmarkHistoryPerGPULabel formats a label for a single-GPU column: "GPU #N — ModelName".
func benchmarkHistoryPerGPULabel(gpuName string, index int) string {
gpuName = strings.TrimSpace(gpuName)
if gpuName == "" {
gpuName = "Unknown GPU"
}
return fmt.Sprintf("GPU #%d — %s", index, gpuName)
}
// benchmarkHistoryParallelLabel formats a label for an all-GPU parallel column:
// "ServerModel — N× ModelName (All GPUs)" or "N× ModelName (All GPUs)" if no server.
func benchmarkHistoryParallelLabel(serverModel, gpuName string, count int) string {
serverModel = strings.TrimSpace(serverModel)
gpuName = strings.TrimSpace(gpuName)
if gpuName == "" {
gpuName = "Unknown GPU"
}
gpuPart := fmt.Sprintf("%d× %s (All GPUs)", count, gpuName)
if serverModel == "" {
return gpuPart
}
return fmt.Sprintf("%s — %s", serverModel, gpuPart)
}
// ── Burn ────────────────────────────────────────────────────────────────────── // ── Burn ──────────────────────────────────────────────────────────────────────
@@ -2371,10 +2323,20 @@ func renderBurn() string {
<p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p> <p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p>
</div> </div>
<p id="burn-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA burn recipes.</p> <p id="burn-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA burn recipes.</p>
<label class="cb-row" style="margin-top:10px"> <div style="display:flex;flex-direction:column;gap:4px;margin-top:10px">
<input type="checkbox" id="burn-stagger-nvidia"> <label class="cb-row">
<span>Ramp selected NVIDIA GPUs one by one before full-load hold. Uses a 3-minute stabilization window per GPU, then keeps all selected GPUs under load for the chosen Burn Profile duration.</span> <input type="radio" name="burn-nvidia-mode" value="sequential" checked>
</label> <span>Sequential — selected GPUs one at a time</span>
</label>
<label class="cb-row" id="burn-parallel-label">
<input type="radio" name="burn-nvidia-mode" value="parallel">
<span>Parallel — all selected GPUs simultaneously</span>
</label>
<label class="cb-row" id="burn-ramp-label">
<input type="radio" name="burn-nvidia-mode" value="ramp-up">
<span>Ramp-up — add one GPU at a time</span>
</label>
</div>
</div> </div>
</div> </div>
@@ -2450,9 +2412,30 @@ function burnSelectedGPUIndices() {
.sort(function(a, b) { return a - b; }); .sort(function(a, b) { return a - b; });
} }
function burnUseNvidiaRampUp() { function burnNvidiaMode() {
const el = document.getElementById('burn-stagger-nvidia'); const el = document.querySelector('input[name="burn-nvidia-mode"]:checked');
return !!(el && el.checked); return el ? el.value : 'sequential';
}
function burnApplyMultiGPUState(gpuCount) {
var multiValues = ['parallel', 'ramp-up'];
var radios = document.querySelectorAll('input[name="burn-nvidia-mode"]');
radios.forEach(function(el) {
var isMulti = multiValues.indexOf(el.value) >= 0;
if (gpuCount < 2 && isMulti) {
el.disabled = true;
if (el.checked) {
var seq = document.querySelector('input[name="burn-nvidia-mode"][value="sequential"]');
if (seq) seq.checked = true;
}
var label = el.closest('label');
if (label) label.style.opacity = '0.4';
} else {
el.disabled = false;
var label = el.closest('label');
if (label) label.style.opacity = '';
}
});
} }
function burnUpdateSelectionNote() { function burnUpdateSelectionNote() {
@@ -2479,6 +2462,7 @@ function burnRenderGPUList(gpus) {
+ '<span><strong>GPU ' + gpu.index + '</strong> — ' + gpu.name + mem + '</span>' + '<span><strong>GPU ' + gpu.index + '</strong> — ' + gpu.name + mem + '</span>'
+ '</label>'; + '</label>';
}).join(''); }).join('');
burnApplyMultiGPUState(gpus.length);
burnUpdateSelectionNote(); burnUpdateSelectionNote();
} }
@@ -2514,8 +2498,11 @@ function enqueueBurnTask(target, label, extra, useSelectedNvidia) {
return Promise.reject(new Error('Select at least one NVIDIA GPU.')); return Promise.reject(new Error('Select at least one NVIDIA GPU.'));
} }
body.gpu_indices = selected; body.gpu_indices = selected;
if (burnUseNvidiaRampUp() && selected.length > 1) { const bMode = burnNvidiaMode();
if (bMode === 'ramp-up' && selected.length > 1) {
body.stagger_gpu_start = true; body.stagger_gpu_start = true;
} else if (bMode === 'parallel' && selected.length > 1) {
body.parallel_gpus = true;
} }
} }
return fetch('/api/sat/' + target + '/run', { return fetch('/api/sat/' + target + '/run', {
@@ -3108,7 +3095,6 @@ usbRefresh();
</script>` </script>`
} }
func renderNvidiaSelfHealInline() string { func renderNvidiaSelfHealInline() string {
return `<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Inspect NVIDIA GPU health, restart the bee-nvidia driver service, and issue a per-GPU reset when the driver reports reset required.</p> return `<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Inspect NVIDIA GPU health, restart the bee-nvidia driver service, and issue a per-GPU reset when the driver reports reset required.</p>
<div style="display:flex;gap:8px;flex-wrap:wrap;margin-bottom:12px"> <div style="display:flex;gap:8px;flex-wrap:wrap;margin-bottom:12px">

View File

@@ -693,8 +693,8 @@ func TestBenchmarkPageRendersSavedResultsTable(t *testing.T) {
for _, needle := range []string{ for _, needle := range []string{
`Benchmark Results`, `Benchmark Results`,
`Composite score by saved benchmark run and GPU.`, `Composite score by saved benchmark run and GPU.`,
`GPU #0 — NVIDIA H100 PCIe`, `GPU 0`,
`GPU #1 — NVIDIA H100 PCIe`, `GPU 1`,
`#1`, `#1`,
wantTime, wantTime,
`1176.25`, `1176.25`,

View File

@@ -126,6 +126,9 @@ type taskParams struct {
BenchmarkProfile string `json:"benchmark_profile,omitempty"` BenchmarkProfile string `json:"benchmark_profile,omitempty"`
RunNCCL bool `json:"run_nccl,omitempty"` RunNCCL bool `json:"run_nccl,omitempty"`
ParallelGPUs bool `json:"parallel_gpus,omitempty"` ParallelGPUs bool `json:"parallel_gpus,omitempty"`
RampStep int `json:"ramp_step,omitempty"`
RampTotal int `json:"ramp_total,omitempty"`
RampRunID string `json:"ramp_run_id,omitempty"`
DisplayName string `json:"display_name,omitempty"` DisplayName string `json:"display_name,omitempty"`
Device string `json:"device,omitempty"` // for install Device string `json:"device,omitempty"` // for install
PlatformComponents []string `json:"platform_components,omitempty"` PlatformComponents []string `json:"platform_components,omitempty"`
@@ -152,6 +155,12 @@ type burnPreset struct {
DurationSec int DurationSec int
} }
type nvidiaRampSpec struct {
DurationSec int
StaggerSeconds int
TotalDurationSec int
}
func resolveBurnPreset(profile string) burnPreset { func resolveBurnPreset(profile string) burnPreset {
switch profile { switch profile {
case "overnight": case "overnight":
@@ -163,11 +172,43 @@ func resolveBurnPreset(profile string) burnPreset {
} }
} }
func boolToNvidiaStaggerSeconds(enabled bool, selected []int) int { func resolveNvidiaRampPlan(profile string, enabled bool, selected []int) (nvidiaRampSpec, error) {
if enabled && len(selected) > 1 { base := resolveBurnPreset(profile).DurationSec
return 180 plan := nvidiaRampSpec{
DurationSec: base,
TotalDurationSec: base,
} }
return 0 if !enabled {
return plan, nil
}
count := len(selected)
if count == 0 {
return nvidiaRampSpec{}, fmt.Errorf("staggered NVIDIA burn requires explicit GPU selection")
}
if count == 1 {
return plan, nil
}
switch profile {
case "acceptance":
plan.StaggerSeconds = 10 * 60
plan.TotalDurationSec = plan.DurationSec + plan.StaggerSeconds*(count-1)
case "overnight":
plan.StaggerSeconds = 60 * 60
plan.TotalDurationSec = 8 * 60 * 60
minTotal := count * 60 * 60
if plan.TotalDurationSec < minTotal {
plan.TotalDurationSec = minTotal
}
if plan.TotalDurationSec > 10*60*60 {
return nvidiaRampSpec{}, fmt.Errorf("overnight staggered NVIDIA burn supports at most 10 GPUs")
}
plan.DurationSec = plan.TotalDurationSec - plan.StaggerSeconds*(count-1)
default:
plan.StaggerSeconds = 2 * 60
plan.TotalDurationSec = plan.DurationSec + plan.StaggerSeconds*(count-1)
}
return plan, nil
} }
func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions { func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions {
@@ -599,8 +640,11 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
ExcludeGPUIndices: t.params.ExcludeGPUIndices, ExcludeGPUIndices: t.params.ExcludeGPUIndices,
RunNCCL: t.params.RunNCCL, RunNCCL: t.params.RunNCCL,
ParallelGPUs: t.params.ParallelGPUs, ParallelGPUs: t.params.ParallelGPUs,
RampStep: t.params.RampStep,
RampTotal: t.params.RampTotal,
RampRunID: t.params.RampRunID,
}, j.append) }, j.append)
case "nvidia-compute": case "nvidia-compute":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
break break
@@ -609,11 +653,18 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
staggerSec := boolToNvidiaStaggerSeconds(t.params.StaggerGPUStart, t.params.GPUIndices) rampPlan, planErr := resolveNvidiaRampPlan(t.params.BurnProfile, t.params.StaggerGPUStart, t.params.GPUIndices)
if staggerSec > 0 { if planErr != nil {
j.append(fmt.Sprintf("NVIDIA staggered ramp-up enabled: %ds per GPU", staggerSec)) err = planErr
} break
archive, err = a.RunNvidiaOfficialComputePack(ctx, "", dur, t.params.GPUIndices, staggerSec, j.append) }
if t.params.BurnProfile != "" && t.params.StaggerGPUStart && dur <= 0 {
dur = rampPlan.DurationSec
}
if rampPlan.StaggerSeconds > 0 {
j.append(fmt.Sprintf("NVIDIA staggered ramp-up enabled: %ds per GPU; post-ramp hold: %ds; total runtime: %ds", rampPlan.StaggerSeconds, dur, rampPlan.TotalDurationSec))
}
archive, err = a.RunNvidiaOfficialComputePack(ctx, "", dur, t.params.GPUIndices, rampPlan.StaggerSeconds, j.append)
case "nvidia-targeted-power": case "nvidia-targeted-power":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
@@ -663,13 +714,24 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{ rampPlan, planErr := resolveNvidiaRampPlan(t.params.BurnProfile, t.params.StaggerGPUStart, t.params.GPUIndices)
DurationSec: dur, if planErr != nil {
Loader: t.params.Loader, err = planErr
GPUIndices: t.params.GPUIndices, break
ExcludeGPUIndices: t.params.ExcludeGPUIndices, }
StaggerSeconds: boolToNvidiaStaggerSeconds(t.params.StaggerGPUStart, t.params.GPUIndices), if t.params.BurnProfile != "" && t.params.StaggerGPUStart && dur <= 0 {
}, j.append) dur = rampPlan.DurationSec
}
if rampPlan.StaggerSeconds > 0 {
j.append(fmt.Sprintf("NVIDIA staggered ramp-up enabled: %ds per GPU; post-ramp hold: %ds; total runtime: %ds", rampPlan.StaggerSeconds, dur, rampPlan.TotalDurationSec))
}
archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{
DurationSec: dur,
Loader: t.params.Loader,
GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices,
StaggerSeconds: rampPlan.StaggerSeconds,
}, j.append)
case "memory": case "memory":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")

View File

@@ -422,7 +422,7 @@ func TestWriteTaskReportArtifactsIncludesBenchmarkResultsForTask(t *testing.T) {
for _, needle := range []string{ for _, needle := range []string{
`Benchmark Results`, `Benchmark Results`,
`Composite score for this benchmark task.`, `Composite score for this benchmark task.`,
`GPU #0 — NVIDIA H100 PCIe`, `GPU 0`,
`1176.25`, `1176.25`,
} { } {
if !strings.Contains(html, needle) { if !strings.Contains(html, needle) {
@@ -491,6 +491,83 @@ func TestResolveBurnPreset(t *testing.T) {
} }
} }
func TestResolveNvidiaRampPlan(t *testing.T) {
tests := []struct {
name string
profile string
enabled bool
selected []int
want nvidiaRampSpec
wantErr string
}{
{
name: "disabled uses base preset",
profile: "acceptance",
selected: []int{0, 1},
want: nvidiaRampSpec{DurationSec: 60 * 60, TotalDurationSec: 60 * 60},
},
{
name: "smoke ramp uses two minute steps",
profile: "smoke",
enabled: true,
selected: []int{0, 1, 2},
want: nvidiaRampSpec{DurationSec: 5 * 60, StaggerSeconds: 2 * 60, TotalDurationSec: 9 * 60},
},
{
name: "acceptance ramp uses ten minute steps",
profile: "acceptance",
enabled: true,
selected: []int{0, 1, 2},
want: nvidiaRampSpec{DurationSec: 60 * 60, StaggerSeconds: 10 * 60, TotalDurationSec: 80 * 60},
},
{
name: "overnight stays at eight hours when possible",
profile: "overnight",
enabled: true,
selected: []int{0, 1, 2},
want: nvidiaRampSpec{DurationSec: 6 * 60 * 60, StaggerSeconds: 60 * 60, TotalDurationSec: 8 * 60 * 60},
},
{
name: "overnight extends to keep one hour after final gpu",
profile: "overnight",
enabled: true,
selected: []int{0, 1, 2, 3, 4, 5, 6, 7, 8},
want: nvidiaRampSpec{DurationSec: 60 * 60, StaggerSeconds: 60 * 60, TotalDurationSec: 9 * 60 * 60},
},
{
name: "overnight rejects impossible gpu count",
profile: "overnight",
enabled: true,
selected: []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
wantErr: "at most 10 GPUs",
},
{
name: "enabled requires explicit selection",
profile: "smoke",
enabled: true,
wantErr: "requires explicit GPU selection",
},
}
for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
got, err := resolveNvidiaRampPlan(tc.profile, tc.enabled, tc.selected)
if tc.wantErr != "" {
if err == nil || !strings.Contains(err.Error(), tc.wantErr) {
t.Fatalf("err=%v want substring %q", err, tc.wantErr)
}
return
}
if err != nil {
t.Fatalf("resolveNvidiaRampPlan error: %v", err)
}
if got != tc.want {
t.Fatalf("resolveNvidiaRampPlan(%q, %t, %v)=%+v want %+v", tc.profile, tc.enabled, tc.selected, got, tc.want)
}
})
}
}
func TestTaskDisplayNameUsesNvidiaStressLoader(t *testing.T) { func TestTaskDisplayNameUsesNvidiaStressLoader(t *testing.T) {
tests := []struct { tests := []struct {
loader string loader string

View File

@@ -1121,6 +1121,7 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
int cc_minor, int cc_minor,
int seconds, int seconds,
int size_mb, int size_mb,
const char *precision_filter,
struct stress_report *report) { struct stress_report *report) {
struct cublaslt_api cublas; struct cublaslt_api cublas;
struct prepared_profile prepared[MAX_STRESS_STREAMS * MAX_CUBLAS_PROFILES]; struct prepared_profile prepared[MAX_STRESS_STREAMS * MAX_CUBLAS_PROFILES];
@@ -1159,7 +1160,8 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
} }
for (size_t i = 0; i < sizeof(k_profiles) / sizeof(k_profiles[0]); i++) { for (size_t i = 0; i < sizeof(k_profiles) / sizeof(k_profiles[0]); i++) {
if (k_profiles[i].enabled && cc >= k_profiles[i].min_cc) { if (k_profiles[i].enabled && cc >= k_profiles[i].min_cc &&
(precision_filter == NULL || strcmp(k_profiles[i].block_label, precision_filter) == 0)) {
planned++; planned++;
} }
} }
@@ -1218,6 +1220,13 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
desc->min_cc); desc->min_cc);
continue; continue;
} }
if (precision_filter != NULL && strcmp(desc->block_label, precision_filter) != 0) {
append_detail(report->details,
sizeof(report->details),
"%s=SKIPPED precision_filter\n",
desc->name);
continue;
}
for (int lane = 0; lane < stream_count; lane++) { for (int lane = 0; lane < stream_count; lane++) {
CUstream stream = streams[lane]; CUstream stream = streams[lane];
if (prepared_count >= (int)(sizeof(prepared) / sizeof(prepared[0]))) { if (prepared_count >= (int)(sizeof(prepared) / sizeof(prepared[0]))) {
@@ -1339,6 +1348,7 @@ int main(int argc, char **argv) {
int seconds = 5; int seconds = 5;
int size_mb = 64; int size_mb = 64;
int device_index = 0; int device_index = 0;
const char *precision_filter = NULL; /* NULL = all; else block_label to match */
for (int i = 1; i < argc; i++) { for (int i = 1; i < argc; i++) {
if ((strcmp(argv[i], "--seconds") == 0 || strcmp(argv[i], "-t") == 0) && i + 1 < argc) { if ((strcmp(argv[i], "--seconds") == 0 || strcmp(argv[i], "-t") == 0) && i + 1 < argc) {
seconds = atoi(argv[++i]); seconds = atoi(argv[++i]);
@@ -1346,8 +1356,12 @@ int main(int argc, char **argv) {
size_mb = atoi(argv[++i]); size_mb = atoi(argv[++i]);
} else if ((strcmp(argv[i], "--device") == 0 || strcmp(argv[i], "-d") == 0) && i + 1 < argc) { } else if ((strcmp(argv[i], "--device") == 0 || strcmp(argv[i], "-d") == 0) && i + 1 < argc) {
device_index = atoi(argv[++i]); device_index = atoi(argv[++i]);
} else if (strcmp(argv[i], "--precision") == 0 && i + 1 < argc) {
precision_filter = argv[++i];
} else { } else {
fprintf(stderr, "usage: %s [--seconds N] [--size-mb N] [--device N]\n", argv[0]); fprintf(stderr,
"usage: %s [--seconds N] [--size-mb N] [--device N] [--precision fp8|fp16|fp32|fp64|fp4]\n",
argv[0]);
return 2; return 2;
} }
} }
@@ -1407,7 +1421,7 @@ int main(int argc, char **argv) {
int ok = 0; int ok = 0;
#if HAVE_CUBLASLT_HEADERS #if HAVE_CUBLASLT_HEADERS
ok = run_cublaslt_stress(&cuda, dev, name, cc_major, cc_minor, seconds, size_mb, &report); ok = run_cublaslt_stress(&cuda, dev, name, cc_major, cc_minor, seconds, size_mb, precision_filter, &report);
#endif #endif
if (!ok) { if (!ok) {
if (!run_ptx_fallback(&cuda, dev, name, cc_major, cc_minor, seconds, size_mb, &report)) { if (!run_ptx_fallback(&cuda, dev, name, cc_major, cc_minor, seconds, size_mb, &report)) {

View File

@@ -874,8 +874,20 @@ if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
CUBLAS_CACHE="${DIST_DIR}/cublas-${CUBLAS_VERSION}+cuda${NCCL_CUDA_VERSION}" CUBLAS_CACHE="${DIST_DIR}/cublas-${CUBLAS_VERSION}+cuda${NCCL_CUDA_VERSION}"
GPU_STRESS_NEED_BUILD=1 GPU_STRESS_NEED_BUILD=1
if [ -f "$GPU_BURN_WORKER_BIN" ] && [ "${BUILDER_DIR}/bee-gpu-stress.c" -ot "$GPU_BURN_WORKER_BIN" ]; then if [ -f "$GPU_BURN_WORKER_BIN" ]; then
GPU_STRESS_NEED_BUILD=0 GPU_STRESS_NEED_BUILD=0
for dep in \
"${BUILDER_DIR}/bee-gpu-stress.c" \
"${BUILDER_DIR}/VERSIONS"; do
if [ "$dep" -nt "$GPU_BURN_WORKER_BIN" ]; then
GPU_STRESS_NEED_BUILD=1
break
fi
done
if [ "$GPU_STRESS_NEED_BUILD" = "0" ] && \
find "${CUBLAS_CACHE}/include" "${CUBLAS_CACHE}/lib" -type f -newer "$GPU_BURN_WORKER_BIN" | grep -q .; then
GPU_STRESS_NEED_BUILD=1
fi
fi fi
if [ "$GPU_STRESS_NEED_BUILD" = "1" ]; then if [ "$GPU_STRESS_NEED_BUILD" = "1" ]; then

View File

@@ -11,18 +11,18 @@ echo " Hardware Audit LiveCD"
echo "" echo ""
menuentry "EASY-BEE" { menuentry "EASY-BEE" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@ initrd @INITRD_LIVE@
} }
submenu "EASY-BEE (advanced options) -->" { submenu "EASY-BEE (advanced options) -->" {
menuentry "EASY-BEE — GSP=off" { menuentry "EASY-BEE — GSP=off" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@ initrd @INITRD_LIVE@
} }
menuentry "EASY-BEE — KMS (no nomodeset)" { menuentry "EASY-BEE — KMS (no nomodeset)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@ initrd @INITRD_LIVE@
} }

View File

@@ -3,31 +3,31 @@ label live-@FLAVOUR@-normal
menu default menu default
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=normal append @APPEND_LIVE@ bee.nvidia.mode=normal pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1
label live-@FLAVOUR@-kms label live-@FLAVOUR@-kms
menu label EASY-BEE (^graphics/KMS) menu label EASY-BEE (^graphics/KMS)
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=normal append @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=normal pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1
label live-@FLAVOUR@-toram label live-@FLAVOUR@-toram
menu label EASY-BEE (^load to RAM) menu label EASY-BEE (^load to RAM)
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ toram bee.nvidia.mode=normal append @APPEND_LIVE@ toram bee.nvidia.mode=normal pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1
label live-@FLAVOUR@-gsp-off label live-@FLAVOUR@-gsp-off
menu label EASY-BEE (^NVIDIA GSP=off) menu label EASY-BEE (^NVIDIA GSP=off)
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off append @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1
label live-@FLAVOUR@-kms-gsp-off label live-@FLAVOUR@-kms-gsp-off
menu label EASY-BEE (g^raphics/KMS, GSP=off) menu label EASY-BEE (g^raphics/KMS, GSP=off)
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=gsp-off append @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=gsp-off pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1
label live-@FLAVOUR@-failsafe label live-@FLAVOUR@-failsafe
menu label EASY-BEE (^fail-safe) menu label EASY-BEE (^fail-safe)

View File

@@ -25,6 +25,7 @@ ensure_bee_console_user() {
ensure_bee_console_user ensure_bee_console_user
# Enable common bee services # Enable common bee services
systemctl enable bee-hpc-tuning.service
systemctl enable bee-network.service systemctl enable bee-network.service
systemctl enable bee-preflight.service systemctl enable bee-preflight.service
systemctl enable bee-audit.service systemctl enable bee-audit.service
@@ -55,6 +56,7 @@ fi
# nogpu: no GPU services needed # nogpu: no GPU services needed
# Ensure scripts are executable # Ensure scripts are executable
chmod +x /usr/local/bin/bee-hpc-tuning 2>/dev/null || true
chmod +x /usr/local/bin/bee-network.sh 2>/dev/null || true chmod +x /usr/local/bin/bee-network.sh 2>/dev/null || true
chmod +x /usr/local/bin/bee-sshsetup 2>/dev/null || true chmod +x /usr/local/bin/bee-sshsetup 2>/dev/null || true
chmod +x /usr/local/bin/bee-smoketest 2>/dev/null || true chmod +x /usr/local/bin/bee-smoketest 2>/dev/null || true

View File

@@ -0,0 +1,14 @@
[Unit]
Description=Bee: HPC tuning (CPU governor, C-states)
After=local-fs.target
Before=bee-nvidia.service bee-audit.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-hpc-tuning.log /usr/local/bin/bee-hpc-tuning
StandardOutput=journal
StandardError=journal
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target

View File

@@ -6,10 +6,11 @@ STAGGER_SECONDS=0
SIZE_MB=0 SIZE_MB=0
DEVICES="" DEVICES=""
EXCLUDE="" EXCLUDE=""
PRECISION=""
WORKER="/usr/local/lib/bee/bee-gpu-burn-worker" WORKER="/usr/local/lib/bee/bee-gpu-burn-worker"
usage() { usage() {
echo "usage: $0 [--seconds N] [--stagger-seconds N] [--size-mb N] [--devices 0,1] [--exclude 2,3]" >&2 echo "usage: $0 [--seconds N] [--stagger-seconds N] [--size-mb N] [--devices 0,1] [--exclude 2,3] [--precision fp8|fp16|fp32|fp64|fp4]" >&2
exit 2 exit 2
} }
@@ -30,6 +31,7 @@ while [ "$#" -gt 0 ]; do
--size-mb|-m) [ "$#" -ge 2 ] || usage; SIZE_MB="$2"; shift 2 ;; --size-mb|-m) [ "$#" -ge 2 ] || usage; SIZE_MB="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;; --devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;; --exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
--precision) [ "$#" -ge 2 ] || usage; PRECISION="$2"; shift 2 ;;
*) usage ;; *) usage ;;
esac esac
done done
@@ -88,8 +90,10 @@ for id in $(echo "${FINAL}" | tr ',' ' '); do
extra_sec=$(( STAGGER_SECONDS * (GPU_COUNT - gpu_pos) )) extra_sec=$(( STAGGER_SECONDS * (GPU_COUNT - gpu_pos) ))
gpu_seconds=$(( SECONDS + extra_sec )) gpu_seconds=$(( SECONDS + extra_sec ))
echo "starting gpu ${id} size=${gpu_size_mb}MB seconds=${gpu_seconds}" echo "starting gpu ${id} size=${gpu_size_mb}MB seconds=${gpu_seconds}"
precision_arg=""
[ -n "${PRECISION}" ] && precision_arg="--precision ${PRECISION}"
CUDA_VISIBLE_DEVICES="${id}" \ CUDA_VISIBLE_DEVICES="${id}" \
"${WORKER}" --device 0 --seconds "${gpu_seconds}" --size-mb "${gpu_size_mb}" >"${log}" 2>&1 & "${WORKER}" --device 0 --seconds "${gpu_seconds}" --size-mb "${gpu_size_mb}" ${precision_arg} >"${log}" 2>&1 &
pid=$! pid=$!
WORKERS="${WORKERS} ${pid}:${id}:${log}" WORKERS="${WORKERS} ${pid}:${id}:${log}"
if [ "${STAGGER_SECONDS}" -gt 0 ] && [ "${gpu_pos}" -lt "${GPU_COUNT}" ]; then if [ "${STAGGER_SECONDS}" -gt 0 ] && [ "${gpu_pos}" -lt "${GPU_COUNT}" ]; then

View File

@@ -0,0 +1,41 @@
#!/bin/sh
# bee-hpc-tuning — apply HPC tuning for deterministic benchmarking
# Called by bee-hpc-tuning.service at boot.
log() { echo "[bee-hpc-tuning] $*"; }
# ── CPU governor ────────────────────────────────────────────────────────────
# Set all CPU cores to performance governor via sysfs.
# cpupower is not available; write directly to scaling_governor.
governor_ok=0
governor_fail=0
for gov_path in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
[ -f "$gov_path" ] || continue
if echo performance > "$gov_path" 2>/dev/null; then
governor_ok=$((governor_ok + 1))
else
governor_fail=$((governor_fail + 1))
fi
done
if [ "$governor_ok" -gt 0 ] && [ "$governor_fail" -eq 0 ]; then
log "CPU governor set to performance on ${governor_ok} core(s)"
elif [ "$governor_ok" -gt 0 ]; then
log "WARN: CPU governor: ${governor_ok} OK, ${governor_fail} failed"
elif [ "$governor_fail" -gt 0 ]; then
log "WARN: failed to set CPU governor on ${governor_fail} core(s)"
else
log "WARN: no cpufreq scaling_governor paths found (C-state governor or HW-controlled)"
fi
# ── Transparent Huge Pages ───────────────────────────────────────────────────
# Kernel cmdline sets transparent_hugepage=always at boot, but confirm and log.
thp_path=/sys/kernel/mm/transparent_hugepage/enabled
if [ -f "$thp_path" ]; then
current=$(cat "$thp_path" 2>/dev/null)
log "transparent_hugepage: ${current}"
else
log "WARN: transparent_hugepage sysfs path not found"
fi
log "done"