Compare commits

...

23 Commits
v5.8 ... v6.3

Author SHA1 Message Date
Mikhail Chusavitin
93cfa78e8c Benchmark: parallel GPU mode, resilient inventory query, server model in results
- Add parallel GPU mode (checkbox, off by default): runs all selected GPUs
  simultaneously via a single bee-gpu-burn invocation instead of sequentially;
  per-GPU telemetry, throttle counters, TOPS, and scoring are preserved
- Make queryBenchmarkGPUInfo resilient: falls back to a base field set when
  extended fields (attribute.multiprocessor_count, power.default_limit) cause
  exit status 2, preventing lgc normalization from being silently skipped
- Log explicit "graphics clock lock skipped" note when inventory is unavailable
- Collect server model from DMI (/sys/class/dmi/id/product_name) and store in
  result JSON; benchmark history columns now show "Server Model (N× GPU Model)"
  grouped by server+GPU type rather than individual GPU index

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 18:32:15 +03:00
Mikhail Chusavitin
1358485f2b fix logo wallpaper 2026-04-07 10:15:38 +03:00
8fe20ba678 Fix benchmark scoring: PowerSustain uses default power limit
PowerSustainScore now uses DefaultPowerLimitW as reference so a
manually reduced power limit does not inflate the score. Falls back
to enforced limit if default is unavailable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:30:59 +03:00
d973231f37 Enhance benchmark: server power via IPMI, efficiency metrics, FP64, power limit check
- Sample server power (IPMI dcmi) during baseline+steady phases in parallel;
  compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting
- Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w
  from nvidia-smi alongside existing GPU info
- Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal)
- Flag when enforced power limit is below default TDP by >5%
- Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0)
- Improve Executive Summary: overall pass count, FAILED GPU finding
- Throttle counters now shown as % of steady window instead of raw microseconds
- bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:26:52 +03:00
f5d175f488 Fix toram: patch live-boot to not use O_DIRECT when replacing loop to tmpfs
losetup --replace --direct-io=on fails with EINVAL when the target file
is on tmpfs (/dev/shm), because tmpfs does not support O_DIRECT.
Strip the --direct-io flag from the replace call and downgrade the
verification failure to a warning so boot continues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 21:06:21 +03:00
fa00667750 Refactor NVIDIA GPU Selection into standalone card on validate page
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 21:06:16 +03:00
Mikhail Chusavitin
c7d2816a7f Limit NVIDIA legacy boot hooks to proprietary ISO 2026-04-06 16:33:16 +03:00
Mikhail Chusavitin
d2eadedff2 Default NVIDIA ISO to open modules and add nvidia-legacy 2026-04-06 16:27:13 +03:00
Mikhail Chusavitin
a98c4d7461 Include terminal charts in benchmark report 2026-04-06 12:34:57 +03:00
Mikhail Chusavitin
2354ae367d Normalize task IDs and artifact folder prefixes 2026-04-06 12:26:47 +03:00
Mikhail Chusavitin
0d0e1f55a7 Avoid misleading SAT summaries after task cancellation 2026-04-06 12:24:19 +03:00
Mikhail Chusavitin
35f4c53887 Stabilize NVIDIA GPU device mapping across loaders 2026-04-06 12:22:04 +03:00
Mikhail Chusavitin
981315e6fd Split NVIDIA tasks by homogeneous GPU groups 2026-04-06 11:58:13 +03:00
Mikhail Chusavitin
fc5c100a29 Fix NVIDIA persistence mode and add benchmark results table 2026-04-06 10:47:07 +03:00
6e94216f3b Hide task charts while pending 2026-04-05 22:34:34 +03:00
53455063b9 Stabilize live task detail page 2026-04-05 22:14:52 +03:00
4602f97836 Enforce sequential task orchestration 2026-04-05 22:10:42 +03:00
c65d3ae3b1 Add nomodeset to default GRUB entry — fix black screen on headless servers
Servers with NVIDIA compute GPUs (H100 etc.) have no display output,
so KMS blanks the console. nomodeset disables kernel modesetting and
lets the NVIDIA proprietary driver handle display via Xorg.

KMS variant moved to advanced submenu for cases where it is needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 21:40:47 +03:00
7a21c370e4 Handle NVIDIA GSP firmware init hang with timeout fallback
- bee-nvidia-load: run insmod in background, poll /proc/devices for
  nvidiactl; if GSP init doesn't complete in 90s, kill insmod and retry
  with NVreg_EnableGpuFirmware=0. Handles EBUSY case with clear error.
- Write /run/bee-nvidia-mode (gsp-on/gsp-off/gsp-stuck) for audit layer
- Show GSP mode badge in sidebar: yellow for gsp-off, red for gsp-stuck
- Report NvidiaGSPMode in RuntimeHealth with issue entries
- Simplify GRUB menu: default (KMS+GSP), advanced submenu (GSP=off,
  nomodeset, fail-safe), remove load-to-RAM entry
- Add pcmanfm, ristretto, mupdf, mousepad to desktop packages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 21:00:43 +03:00
a493e3ab5b Fix service control buttons: sudo, real error output, UX feedback
- services.go: use sudo systemctl so bee user can control system services
- api.go: always return 200 with output field even on error, so the
  frontend shows the actual systemctl message instead of "exit status 1"
- pages.go: button shows "..." while pending then restores label;
  output panel is full-width under the table with ✓/✗ status indicator;
  output auto-scrolls to bottom

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:25:41 +03:00
19b4803ec7 Pass exact cycle duration to GPU stress instead of 86400s sentinel
bee-gpu-burn now receives --seconds <LoadSec> so it exits naturally
when the cycle ends, rather than relying solely on context cancellation
to kill it. Process group kill (Setpgid+Cancel) is kept as a safety net
for early cancellation (user stop, context timeout). Same fix for AMD
RVS which now gets duration_ms = LoadSec * 1000.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:22:43 +03:00
1bdfb1e9ca Fix nvidia-targeted-stress failing with DCGM_ST_IN_USE (-34)
nvvs (DCGM validation suite) survives when dcgmi is killed mid-run,
leaving the GPU occupied. The next dcgmi diag invocation then fails
with "affected resource is in use".

Two-part fix:
- Add nvvs and dcgmi to KillTestWorkers patterns so they are cleaned
  up by the global cancel handler
- Call KillTestWorkers at the start of RunNvidiaTargetedStressValidatePack
  to clear any stale processes before dcgmi diag runs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:21:36 +03:00
c5d6b30177 Fix platform thermal cycling leaving GPU load running after test ends
bee-gpu-burn is a shell script that spawns bee-gpu-burn-worker children.
exec.CommandContext default cancel only kills the shell parent; the worker
processes survive and keep loading the GPU indefinitely.

Fix: set Setpgid=true and a custom Cancel that sends SIGKILL to the
entire process group (-pid), same pattern already used in runSATCommandCtx.
Applied to Nvidia, AMD, and CPU stress commands for consistency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:19:20 +03:00
35 changed files with 2883 additions and 445 deletions

View File

@@ -27,14 +27,17 @@ type benchmarkProfileSpec struct {
}
type benchmarkGPUInfo struct {
Index int
UUID string
Name string
BusID string
VBIOS string
PowerLimitW float64
MaxGraphicsClockMHz float64
MaxMemoryClockMHz float64
Index int
UUID string
Name string
BusID string
VBIOS string
PowerLimitW float64
DefaultPowerLimitW float64
MaxGraphicsClockMHz float64
MaxMemoryClockMHz float64
BaseGraphicsClockMHz float64
MultiprocessorCount int
}
type benchmarkBurnProfile struct {
@@ -102,7 +105,9 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
BenchmarkVersion: benchmarkVersion,
GeneratedAt: time.Now().UTC(),
Hostname: hostname,
ServerModel: readServerModel(),
BenchmarkProfile: spec.Name,
ParallelGPUs: opts.ParallelGPUs,
SelectedGPUIndices: append([]int(nil), selected...),
Normalization: BenchmarkNormalization{
Status: "full",
@@ -111,6 +116,11 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
logFunc(fmt.Sprintf("NVIDIA benchmark profile=%s gpus=%s", spec.Name, joinIndexList(selected)))
// Server power characterization state — populated during per-GPU phases.
var serverIdleW, serverLoadedWSum float64
var serverIdleOK, serverLoadedOK bool
var serverLoadedSamples int
infoByIndex, infoErr := queryBenchmarkGPUInfo(selected)
if infoErr != nil {
result.Warnings = append(result.Warnings, "gpu inventory query failed: "+infoErr.Error())
@@ -135,6 +145,10 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
}
}()
if opts.ParallelGPUs {
runNvidiaBenchmarkParallel(ctx, verboseLog, runDir, selected, infoByIndex, opts, spec, logFunc, &result, &serverIdleW, &serverLoadedWSum, &serverIdleOK, &serverLoadedOK, &serverLoadedSamples)
} else {
for _, idx := range selected {
gpuResult := BenchmarkGPUResult{
Index: idx,
@@ -146,7 +160,10 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
gpuResult.BusID = info.BusID
gpuResult.VBIOS = info.VBIOS
gpuResult.PowerLimitW = info.PowerLimitW
gpuResult.MultiprocessorCount = info.MultiprocessorCount
gpuResult.DefaultPowerLimitW = info.DefaultPowerLimitW
gpuResult.MaxGraphicsClockMHz = info.MaxGraphicsClockMHz
gpuResult.BaseGraphicsClockMHz = info.BaseGraphicsClockMHz
gpuResult.MaxMemoryClockMHz = info.MaxMemoryClockMHz
}
if norm := findBenchmarkNormalization(result.Normalization.GPUs, idx); norm != nil {
@@ -161,6 +178,15 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
gpuResult.Baseline = summarizeBenchmarkTelemetry(baselineRows)
writeBenchmarkMetricsFiles(runDir, fmt.Sprintf("gpu-%d-baseline", idx), baselineRows)
// Sample server idle power once (first GPU only — server state is global).
if !serverIdleOK {
if w, ok := sampleIPMIPowerSeries(ctx, maxInt(spec.BaselineSec, 10)); ok {
serverIdleW = w
serverIdleOK = true
logFunc(fmt.Sprintf("server idle power (IPMI): %.0f W", w))
}
}
warmupCmd := []string{
"bee-gpu-burn",
"--seconds", strconv.Itoa(spec.WarmupSec),
@@ -184,7 +210,50 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
"--devices", strconv.Itoa(idx),
}
logFunc(fmt.Sprintf("GPU %d: steady compute (%ds)", idx, spec.SteadySec))
// Sample server power via IPMI in parallel with the steady phase.
// We collect readings every 5s and average them.
ipmiStopCh := make(chan struct{})
ipmiResultCh := make(chan float64, 1)
go func() {
defer close(ipmiResultCh)
var samples []float64
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
// First sample after a short warmup delay.
select {
case <-ipmiStopCh:
return
case <-time.After(15 * time.Second):
}
for {
if w, err := queryIPMIServerPowerW(); err == nil {
samples = append(samples, w)
}
select {
case <-ipmiStopCh:
if len(samples) > 0 {
var sum float64
for _, w := range samples {
sum += w
}
ipmiResultCh <- sum / float64(len(samples))
}
return
case <-ticker.C:
}
}
}()
steadyOut, steadyRows, steadyErr := runBenchmarkCommandWithMetrics(ctx, verboseLog, fmt.Sprintf("gpu-%d-steady.log", idx), steadyCmd, nil, []int{idx}, runDir, fmt.Sprintf("gpu-%d-steady", idx), logFunc)
close(ipmiStopCh)
if loadedW, ok := <-ipmiResultCh; ok {
serverLoadedWSum += loadedW
serverLoadedSamples++
serverLoadedOK = true
logFunc(fmt.Sprintf("GPU %d: server loaded power (IPMI): %.0f W", idx, loadedW))
}
_ = os.WriteFile(filepath.Join(runDir, fmt.Sprintf("gpu-%d-steady.log", idx)), steadyOut, 0644)
afterThrottle, _ := queryThrottleCounters(idx)
if steadyErr != nil {
@@ -222,6 +291,8 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
result.GPUs = append(result.GPUs, finalizeBenchmarkGPUResult(gpuResult))
}
} // end sequential path
if len(selected) > 1 && opts.RunNCCL {
result.Interconnect = runBenchmarkInterconnect(ctx, verboseLog, runDir, selected, spec, logFunc)
if result.Interconnect != nil && result.Interconnect.Supported {
@@ -232,6 +303,17 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
}
}
// Compute server power characterization from accumulated IPMI samples.
var gpuReportedSumW float64
for _, gpu := range result.GPUs {
gpuReportedSumW += gpu.Steady.AvgPowerW
}
var serverLoadedW float64
if serverLoadedSamples > 0 {
serverLoadedW = serverLoadedWSum / float64(serverLoadedSamples)
}
result.ServerPower = characterizeServerPower(serverIdleW, serverLoadedW, gpuReportedSumW, serverIdleOK && serverLoadedOK)
result.Findings = buildBenchmarkFindings(result)
result.OverallStatus = benchmarkOverallStatus(result)
@@ -243,7 +325,7 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
return "", fmt.Errorf("write result.json: %w", err)
}
report := renderBenchmarkReport(result)
report := renderBenchmarkReportWithCharts(result, loadBenchmarkReportCharts(runDir, selected))
if err := os.WriteFile(filepath.Join(runDir, "report.txt"), []byte(report), 0644); err != nil {
return "", fmt.Errorf("write report.txt: %w", err)
}
@@ -288,50 +370,87 @@ func resolveBenchmarkProfile(profile string) benchmarkProfileSpec {
}
}
func queryBenchmarkGPUInfo(gpuIndices []int) (map[int]benchmarkGPUInfo, error) {
args := []string{
"--query-gpu=index,uuid,name,pci.bus_id,vbios_version,power.limit,clocks.max.graphics,clocks.max.memory",
"--format=csv,noheader,nounits",
}
if len(gpuIndices) > 0 {
args = append([]string{"--id=" + joinIndexList(gpuIndices)}, args...)
}
out, err := satExecCommand("nvidia-smi", args...).Output()
if err != nil {
return nil, fmt.Errorf("nvidia-smi gpu info: %w", err)
}
r := csv.NewReader(strings.NewReader(string(out)))
r.TrimLeadingSpace = true
r.FieldsPerRecord = -1
rows, err := r.ReadAll()
if err != nil {
return nil, fmt.Errorf("parse nvidia-smi gpu info: %w", err)
}
infoByIndex := make(map[int]benchmarkGPUInfo, len(rows))
for _, row := range rows {
if len(row) < 8 {
continue
}
idx, err := strconv.Atoi(strings.TrimSpace(row[0]))
if err != nil {
continue
}
infoByIndex[idx] = benchmarkGPUInfo{
Index: idx,
UUID: strings.TrimSpace(row[1]),
Name: strings.TrimSpace(row[2]),
BusID: strings.TrimSpace(row[3]),
VBIOS: strings.TrimSpace(row[4]),
PowerLimitW: parseBenchmarkFloat(row[5]),
MaxGraphicsClockMHz: parseBenchmarkFloat(row[6]),
MaxMemoryClockMHz: parseBenchmarkFloat(row[7]),
}
}
return infoByIndex, nil
// benchmarkGPUInfoQuery describes a nvidia-smi --query-gpu field set to try.
// Fields are tried in order; the first successful query wins. Extended fields
// (attribute.multiprocessor_count, power.default_limit) are not supported on
// all driver versions, so we fall back to the base set if the full query fails.
var benchmarkGPUInfoQueries = []struct {
fields string
extended bool // whether this query includes optional extended fields
}{
{
fields: "index,uuid,name,pci.bus_id,vbios_version,power.limit,clocks.max.graphics,clocks.max.memory,clocks.base.graphics,attribute.multiprocessor_count,power.default_limit",
extended: true,
},
{
fields: "index,uuid,name,pci.bus_id,vbios_version,power.limit,clocks.max.graphics,clocks.max.memory,clocks.base.graphics",
extended: false,
},
}
func queryBenchmarkGPUInfo(gpuIndices []int) (map[int]benchmarkGPUInfo, error) {
var lastErr error
for _, q := range benchmarkGPUInfoQueries {
args := []string{
"--query-gpu=" + q.fields,
"--format=csv,noheader,nounits",
}
if len(gpuIndices) > 0 {
args = append([]string{"--id=" + joinIndexList(gpuIndices)}, args...)
}
out, err := satExecCommand("nvidia-smi", args...).Output()
if err != nil {
lastErr = fmt.Errorf("nvidia-smi gpu info (%s): %w", q.fields[:min(len(q.fields), 40)], err)
continue
}
r := csv.NewReader(strings.NewReader(string(out)))
r.TrimLeadingSpace = true
r.FieldsPerRecord = -1
rows, err := r.ReadAll()
if err != nil {
lastErr = fmt.Errorf("parse nvidia-smi gpu info: %w", err)
continue
}
infoByIndex := make(map[int]benchmarkGPUInfo, len(rows))
for _, row := range rows {
if len(row) < 9 {
continue
}
idx, err := strconv.Atoi(strings.TrimSpace(row[0]))
if err != nil {
continue
}
info := benchmarkGPUInfo{
Index: idx,
UUID: strings.TrimSpace(row[1]),
Name: strings.TrimSpace(row[2]),
BusID: strings.TrimSpace(row[3]),
VBIOS: strings.TrimSpace(row[4]),
PowerLimitW: parseBenchmarkFloat(row[5]),
MaxGraphicsClockMHz: parseBenchmarkFloat(row[6]),
MaxMemoryClockMHz: parseBenchmarkFloat(row[7]),
}
if len(row) >= 9 {
info.BaseGraphicsClockMHz = parseBenchmarkFloat(row[8])
}
if q.extended {
if len(row) >= 10 {
info.MultiprocessorCount = int(parseBenchmarkFloat(row[9]))
}
if len(row) >= 11 {
info.DefaultPowerLimitW = parseBenchmarkFloat(row[10])
}
}
infoByIndex[idx] = info
}
return infoByIndex, nil
}
return nil, lastErr
}
func applyBenchmarkNormalization(ctx context.Context, verboseLog string, gpuIndices []int, infoByIndex map[int]benchmarkGPUInfo, result *NvidiaBenchmarkResult) []benchmarkRestoreAction {
if os.Geteuid() != 0 {
result.Normalization.Status = "partial"
@@ -370,6 +489,10 @@ func applyBenchmarkNormalization(ctx context.Context, verboseLog string, gpuIndi
_, _ = runSATCommandCtx(context.Background(), verboseLog, fmt.Sprintf("restore-gpu-%d-rgc", idxCopy), []string{"nvidia-smi", "-i", strconv.Itoa(idxCopy), "-rgc"}, nil, nil)
}})
}
} else {
rec.GPUClockLockStatus = "skipped"
rec.Notes = append(rec.Notes, "graphics clock lock skipped: gpu inventory unavailable or MaxGraphicsClockMHz=0")
result.Normalization.Status = "partial"
}
if info, ok := infoByIndex[idx]; ok && info.MaxMemoryClockMHz > 0 {
@@ -551,6 +674,8 @@ func ensureBenchmarkProfile(profiles map[string]*benchmarkBurnProfile, name stri
}
category := "other"
switch {
case strings.HasPrefix(name, "fp64"):
category = "fp64"
case strings.HasPrefix(name, "fp32"):
category = "fp32_tf32"
case strings.HasPrefix(name, "fp16"):
@@ -619,14 +744,23 @@ func scoreBenchmarkGPUResult(gpu BenchmarkGPUResult) BenchmarkScorecard {
score.ComputeScore += precision.TeraOpsPerSec
}
}
if gpu.PowerLimitW > 0 {
score.PowerSustainScore = math.Min(100, (gpu.Steady.AvgPowerW/gpu.PowerLimitW)*100)
// Use default power limit for sustain score so a manually reduced limit
// does not inflate the score. Fall back to enforced limit if default unknown.
referencePowerW := gpu.DefaultPowerLimitW
if referencePowerW <= 0 {
referencePowerW = gpu.PowerLimitW
}
if referencePowerW > 0 {
score.PowerSustainScore = math.Min(100, (gpu.Steady.AvgPowerW/referencePowerW)*100)
}
runtimeUS := math.Max(1, gpu.Steady.DurationSec*1e6)
thermalRatio := float64(gpu.Throttle.HWThermalSlowdownUS+gpu.Throttle.SWThermalSlowdownUS) / runtimeUS
score.ThermalSustainScore = clampScore(100 - thermalRatio*100)
score.StabilityScore = clampScore(100 - (gpu.Steady.ClockCVPct*4 + gpu.Steady.PowerCVPct*2 + gpu.Steady.ClockDriftPct*2))
score.CompositeScore = compositeBenchmarkScore(score)
if gpu.MultiprocessorCount > 0 && gpu.Steady.AvgGraphicsClockMHz > 0 && score.ComputeScore > 0 {
score.TOPSPerSMPerGHz = score.ComputeScore / float64(gpu.MultiprocessorCount) / (gpu.Steady.AvgGraphicsClockMHz / 1000.0)
}
return score
}
@@ -679,7 +813,10 @@ func runBenchmarkInterconnect(ctx context.Context, verboseLog, runDir string, gp
"-g", strconv.Itoa(len(gpuIndices)),
"--iters", strconv.Itoa(maxInt(20, spec.NCCLSec/10)),
}
env := []string{"CUDA_VISIBLE_DEVICES=" + joinIndexList(gpuIndices)}
env := []string{
"CUDA_DEVICE_ORDER=PCI_BUS_ID",
"CUDA_VISIBLE_DEVICES=" + joinIndexList(gpuIndices),
}
logFunc(fmt.Sprintf("NCCL interconnect: gpus=%s", joinIndexList(gpuIndices)))
out, err := runSATCommandCtx(ctx, verboseLog, "nccl-all-reduce.log", cmd, env, logFunc)
_ = os.WriteFile(filepath.Join(runDir, "nccl-all-reduce.log"), out, 0644)
@@ -795,10 +932,30 @@ func finalizeBenchmarkGPUResult(gpu BenchmarkGPUResult) BenchmarkGPUResult {
func buildBenchmarkFindings(result NvidiaBenchmarkResult) []string {
var findings []string
passed := 0
for _, gpu := range result.GPUs {
if gpu.Status == "OK" {
passed++
}
}
total := len(result.GPUs)
if total > 0 {
if passed == total {
findings = append(findings, fmt.Sprintf("All %d GPU(s) passed the benchmark.", total))
} else {
findings = append(findings, fmt.Sprintf("%d of %d GPU(s) passed the benchmark.", passed, total))
}
}
if result.Normalization.Status != "full" {
findings = append(findings, "Environment normalization was partial; compare results with caution.")
}
for _, gpu := range result.GPUs {
if gpu.Status == "FAILED" && len(gpu.DegradationReasons) == 0 {
findings = append(findings, fmt.Sprintf("GPU %d failed the benchmark (check verbose.log for details).", gpu.Index))
continue
}
if len(gpu.DegradationReasons) == 0 && gpu.Status == "OK" {
findings = append(findings, fmt.Sprintf("GPU %d held clocks without observable throttle counters during steady state.", gpu.Index))
continue
@@ -822,10 +979,24 @@ func buildBenchmarkFindings(result NvidiaBenchmarkResult) []string {
if gpu.Backend == "driver-ptx" {
findings = append(findings, fmt.Sprintf("GPU %d used driver PTX fallback; tensor score is intentionally degraded.", gpu.Index))
}
if gpu.DefaultPowerLimitW > 0 && gpu.PowerLimitW > 0 && gpu.PowerLimitW < gpu.DefaultPowerLimitW*0.95 {
findings = append(findings, fmt.Sprintf(
"GPU %d power limit %.0f W is below default %.0f W (%.0f%%). Performance may be artificially reduced.",
gpu.Index, gpu.PowerLimitW, gpu.DefaultPowerLimitW, gpu.PowerLimitW/gpu.DefaultPowerLimitW*100,
))
}
}
if result.Interconnect != nil && result.Interconnect.Supported {
findings = append(findings, fmt.Sprintf("Multi-GPU all_reduce max bus bandwidth: %.1f GB/s.", result.Interconnect.MaxBusBWGBps))
}
if sp := result.ServerPower; sp != nil && sp.Available && sp.GPUReportedSumW > 0 {
if sp.ReportingRatio < 0.75 {
findings = append(findings, fmt.Sprintf(
"GPU power reporting may be unreliable: server delta %.0f W vs GPU-reported %.0f W (ratio %.2f). GPU telemetry likely over-reports actual consumption.",
sp.DeltaW, sp.GPUReportedSumW, sp.ReportingRatio,
))
}
}
return dedupeStrings(findings)
}
@@ -1004,3 +1175,319 @@ func maxInt(a, b int) int {
}
return b
}
// queryIPMIServerPowerW reads the current server power draw via ipmitool dcmi.
// Returns 0 and an error if IPMI is unavailable or the output cannot be parsed.
func queryIPMIServerPowerW() (float64, error) {
out, err := satExecCommand("ipmitool", "dcmi", "power", "reading").Output()
if err != nil {
return 0, fmt.Errorf("ipmitool dcmi power reading: %w", err)
}
for _, line := range strings.Split(string(out), "\n") {
if strings.Contains(line, "Current Power") {
parts := strings.SplitN(line, ":", 2)
if len(parts) == 2 {
val := strings.TrimSpace(strings.TrimSuffix(strings.TrimSpace(parts[1]), "Watts"))
val = strings.TrimSpace(val)
w, err := strconv.ParseFloat(val, 64)
if err == nil && w > 0 {
return w, nil
}
}
}
}
return 0, fmt.Errorf("could not parse ipmitool dcmi power reading output")
}
// sampleIPMIPowerSeries collects IPMI power readings every 2 seconds for
// durationSec seconds. Returns the mean of all successful samples.
// Returns 0, false if IPMI is unavailable.
func sampleIPMIPowerSeries(ctx context.Context, durationSec int) (meanW float64, ok bool) {
if durationSec <= 0 {
return 0, false
}
deadline := time.Now().Add(time.Duration(durationSec) * time.Second)
var samples []float64
for {
if w, err := queryIPMIServerPowerW(); err == nil {
samples = append(samples, w)
}
if time.Now().After(deadline) {
break
}
select {
case <-ctx.Done():
break
case <-time.After(2 * time.Second):
}
}
if len(samples) == 0 {
return 0, false
}
var sum float64
for _, w := range samples {
sum += w
}
return sum / float64(len(samples)), true
}
// characterizeServerPower computes BenchmarkServerPower from idle and loaded
// IPMI samples plus the GPU-reported average power during steady state.
func characterizeServerPower(idleW, loadedW, gpuReportedSumW float64, ipmiAvailable bool) *BenchmarkServerPower {
sp := &BenchmarkServerPower{Available: ipmiAvailable}
if !ipmiAvailable {
sp.Notes = append(sp.Notes, "IPMI power reading unavailable; server-side power characterization skipped")
return sp
}
sp.IdleW = idleW
sp.LoadedW = loadedW
sp.DeltaW = loadedW - idleW
sp.GPUReportedSumW = gpuReportedSumW
if gpuReportedSumW > 0 && sp.DeltaW > 0 {
sp.ReportingRatio = sp.DeltaW / gpuReportedSumW
}
return sp
}
// readServerModel returns the DMI system product name (e.g. "SuperMicro SYS-421GE-TNRT").
// Returns empty string if unavailable (non-Linux or missing DMI entry).
func readServerModel() string {
data, err := os.ReadFile("/sys/class/dmi/id/product_name")
if err != nil {
return ""
}
return strings.TrimSpace(string(data))
}
// filterRowsByGPU returns only the metric rows for a specific GPU index.
func filterRowsByGPU(rows []GPUMetricRow, gpuIndex int) []GPUMetricRow {
var out []GPUMetricRow
for _, r := range rows {
if r.GPUIndex == gpuIndex {
out = append(out, r)
}
}
return out
}
// parseBenchmarkBurnLogByGPU splits a multi-GPU bee-gpu-burn output by [gpu N] prefix
// and returns a per-GPU parse result map.
func parseBenchmarkBurnLogByGPU(raw string) map[int]benchmarkBurnParseResult {
gpuLines := make(map[int][]string)
for _, line := range strings.Split(strings.ReplaceAll(raw, "\r\n", "\n"), "\n") {
line = strings.TrimSpace(line)
if !strings.HasPrefix(line, "[gpu ") {
continue
}
end := strings.Index(line, "] ")
if end < 0 {
continue
}
gpuIdx, err := strconv.Atoi(strings.TrimSpace(line[5:end]))
if err != nil {
continue
}
gpuLines[gpuIdx] = append(gpuLines[gpuIdx], line[end+2:])
}
results := make(map[int]benchmarkBurnParseResult, len(gpuLines))
for gpuIdx, lines := range gpuLines {
// Lines are already stripped of the [gpu N] prefix; parseBenchmarkBurnLog
// calls stripBenchmarkPrefix which is a no-op on already-stripped lines.
results[gpuIdx] = parseBenchmarkBurnLog(strings.Join(lines, "\n"))
}
return results
}
// runNvidiaBenchmarkParallel runs warmup and steady compute on all selected GPUs
// simultaneously using a single bee-gpu-burn invocation per phase.
func runNvidiaBenchmarkParallel(
ctx context.Context,
verboseLog, runDir string,
selected []int,
infoByIndex map[int]benchmarkGPUInfo,
opts NvidiaBenchmarkOptions,
spec benchmarkProfileSpec,
logFunc func(string),
result *NvidiaBenchmarkResult,
serverIdleW *float64, serverLoadedWSum *float64,
serverIdleOK *bool, serverLoadedOK *bool, serverLoadedSamples *int,
) {
allDevices := joinIndexList(selected)
// Build per-GPU result stubs.
gpuResults := make(map[int]*BenchmarkGPUResult, len(selected))
for _, idx := range selected {
r := &BenchmarkGPUResult{Index: idx, Status: "FAILED"}
if info, ok := infoByIndex[idx]; ok {
r.UUID = info.UUID
r.Name = info.Name
r.BusID = info.BusID
r.VBIOS = info.VBIOS
r.PowerLimitW = info.PowerLimitW
r.MultiprocessorCount = info.MultiprocessorCount
r.DefaultPowerLimitW = info.DefaultPowerLimitW
r.MaxGraphicsClockMHz = info.MaxGraphicsClockMHz
r.BaseGraphicsClockMHz = info.BaseGraphicsClockMHz
r.MaxMemoryClockMHz = info.MaxMemoryClockMHz
}
if norm := findBenchmarkNormalization(result.Normalization.GPUs, idx); norm != nil {
r.LockedGraphicsClockMHz = norm.GPUClockLockMHz
r.LockedMemoryClockMHz = norm.MemoryClockLockMHz
}
gpuResults[idx] = r
}
// Baseline: sample all GPUs together.
baselineRows, err := collectBenchmarkSamples(ctx, spec.BaselineSec, selected)
if err != nil && err != context.Canceled {
for _, idx := range selected {
gpuResults[idx].Notes = append(gpuResults[idx].Notes, "baseline sampling failed: "+err.Error())
}
}
for _, idx := range selected {
perGPU := filterRowsByGPU(baselineRows, idx)
gpuResults[idx].Baseline = summarizeBenchmarkTelemetry(perGPU)
writeBenchmarkMetricsFiles(runDir, fmt.Sprintf("gpu-%d-baseline", idx), perGPU)
}
// Sample server idle power once.
if !*serverIdleOK {
if w, ok := sampleIPMIPowerSeries(ctx, maxInt(spec.BaselineSec, 10)); ok {
*serverIdleW = w
*serverIdleOK = true
logFunc(fmt.Sprintf("server idle power (IPMI): %.0f W", w))
}
}
// Warmup: all GPUs simultaneously.
warmupCmd := []string{
"bee-gpu-burn",
"--seconds", strconv.Itoa(spec.WarmupSec),
"--size-mb", strconv.Itoa(opts.SizeMB),
"--devices", allDevices,
}
logFunc(fmt.Sprintf("GPUs %s: parallel warmup (%ds)", allDevices, spec.WarmupSec))
warmupOut, warmupRows, warmupErr := runBenchmarkCommandWithMetrics(ctx, verboseLog, "gpu-all-warmup.log", warmupCmd, nil, selected, runDir, "gpu-all-warmup", logFunc)
_ = os.WriteFile(filepath.Join(runDir, "gpu-all-warmup.log"), warmupOut, 0644)
for _, idx := range selected {
writeBenchmarkMetricsFiles(runDir, fmt.Sprintf("gpu-%d-warmup", idx), filterRowsByGPU(warmupRows, idx))
}
if warmupErr != nil {
for _, idx := range selected {
gpuResults[idx].Notes = append(gpuResults[idx].Notes, "parallel warmup failed: "+warmupErr.Error())
}
}
// Snapshot throttle counters before steady.
beforeThrottle := make(map[int]BenchmarkThrottleCounters, len(selected))
for _, idx := range selected {
beforeThrottle[idx], _ = queryThrottleCounters(idx)
}
// Steady: all GPUs simultaneously.
steadyCmd := []string{
"bee-gpu-burn",
"--seconds", strconv.Itoa(spec.SteadySec),
"--size-mb", strconv.Itoa(opts.SizeMB),
"--devices", allDevices,
}
logFunc(fmt.Sprintf("GPUs %s: parallel steady compute (%ds)", allDevices, spec.SteadySec))
// Sample server power via IPMI in parallel with steady phase.
ipmiStopCh := make(chan struct{})
ipmiResultCh := make(chan float64, 1)
go func() {
defer close(ipmiResultCh)
var samples []float64
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
select {
case <-ipmiStopCh:
return
case <-time.After(15 * time.Second):
}
for {
if w, err := queryIPMIServerPowerW(); err == nil {
samples = append(samples, w)
}
select {
case <-ipmiStopCh:
if len(samples) > 0 {
var sum float64
for _, w := range samples {
sum += w
}
ipmiResultCh <- sum / float64(len(samples))
}
return
case <-ticker.C:
}
}
}()
steadyOut, steadyRows, steadyErr := runBenchmarkCommandWithMetrics(ctx, verboseLog, "gpu-all-steady.log", steadyCmd, nil, selected, runDir, "gpu-all-steady", logFunc)
close(ipmiStopCh)
if loadedW, ok := <-ipmiResultCh; ok {
*serverLoadedWSum += loadedW
(*serverLoadedSamples)++
*serverLoadedOK = true
logFunc(fmt.Sprintf("GPUs %s: server loaded power (IPMI): %.0f W", allDevices, loadedW))
}
_ = os.WriteFile(filepath.Join(runDir, "gpu-all-steady.log"), steadyOut, 0644)
afterThrottle := make(map[int]BenchmarkThrottleCounters, len(selected))
for _, idx := range selected {
afterThrottle[idx], _ = queryThrottleCounters(idx)
}
parseResults := parseBenchmarkBurnLogByGPU(string(steadyOut))
for _, idx := range selected {
perGPU := filterRowsByGPU(steadyRows, idx)
writeBenchmarkMetricsFiles(runDir, fmt.Sprintf("gpu-%d-steady", idx), perGPU)
gpuResults[idx].Steady = summarizeBenchmarkTelemetry(perGPU)
gpuResults[idx].Throttle = diffThrottleCounters(beforeThrottle[idx], afterThrottle[idx])
if pr, ok := parseResults[idx]; ok {
gpuResults[idx].ComputeCapability = pr.ComputeCapability
gpuResults[idx].Backend = pr.Backend
gpuResults[idx].PrecisionResults = pr.Profiles
if pr.Fallback {
gpuResults[idx].Notes = append(gpuResults[idx].Notes, "benchmark used driver PTX fallback; tensor throughput score is not comparable")
}
}
if steadyErr != nil {
gpuResults[idx].Notes = append(gpuResults[idx].Notes, "parallel steady compute failed: "+steadyErr.Error())
}
}
// Cooldown: all GPUs together.
cooldownRows, err := collectBenchmarkSamples(ctx, spec.CooldownSec, selected)
if err != nil && err != context.Canceled {
for _, idx := range selected {
gpuResults[idx].Notes = append(gpuResults[idx].Notes, "cooldown sampling failed: "+err.Error())
}
}
for _, idx := range selected {
perGPU := filterRowsByGPU(cooldownRows, idx)
gpuResults[idx].Cooldown = summarizeBenchmarkTelemetry(perGPU)
writeBenchmarkMetricsFiles(runDir, fmt.Sprintf("gpu-%d-cooldown", idx), perGPU)
}
// Score and finalize each GPU.
for _, idx := range selected {
r := gpuResults[idx]
r.Scores = scoreBenchmarkGPUResult(*r)
r.DegradationReasons = detectBenchmarkDegradationReasons(*r, result.Normalization.Status)
pr := parseResults[idx]
switch {
case steadyErr != nil:
r.Status = classifySATErrorStatus(steadyOut, steadyErr)
case pr.Fallback:
r.Status = "PARTIAL"
default:
r.Status = "OK"
}
result.GPUs = append(result.GPUs, finalizeBenchmarkGPUResult(*r))
}
}

View File

@@ -2,11 +2,25 @@ package platform
import (
"fmt"
"os"
"path/filepath"
"regexp"
"strings"
"time"
)
func renderBenchmarkReport(result NvidiaBenchmarkResult) string {
return renderBenchmarkReportWithCharts(result, nil)
}
type benchmarkReportChart struct {
Title string
Content string
}
var ansiEscapePattern = regexp.MustCompile(`\x1b\[[0-9;]*m`)
func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benchmarkReportChart) string {
var b strings.Builder
fmt.Fprintf(&b, "Bee NVIDIA Benchmark Report\n")
fmt.Fprintf(&b, "===========================\n\n")
@@ -42,6 +56,9 @@ func renderBenchmarkReport(result NvidiaBenchmarkResult) string {
fmt.Fprintf(&b, " Status: %s\n", gpu.Status)
fmt.Fprintf(&b, " Composite score: %.2f\n", gpu.Scores.CompositeScore)
fmt.Fprintf(&b, " Compute score: %.2f\n", gpu.Scores.ComputeScore)
if gpu.Scores.TOPSPerSMPerGHz > 0 {
fmt.Fprintf(&b, " Compute efficiency: %.3f TOPS/SM/GHz\n", gpu.Scores.TOPSPerSMPerGHz)
}
fmt.Fprintf(&b, " Power sustain: %.1f\n", gpu.Scores.PowerSustainScore)
fmt.Fprintf(&b, " Thermal sustain: %.1f\n", gpu.Scores.ThermalSustainScore)
fmt.Fprintf(&b, " Stability: %.1f\n", gpu.Scores.StabilityScore)
@@ -63,13 +80,7 @@ func renderBenchmarkReport(result NvidiaBenchmarkResult) string {
}
}
}
fmt.Fprintf(&b, " Throttle counters (us): sw_power=%d sw_thermal=%d sync_boost=%d hw_thermal=%d hw_power_brake=%d\n",
gpu.Throttle.SWPowerCapUS,
gpu.Throttle.SWThermalSlowdownUS,
gpu.Throttle.SyncBoostUS,
gpu.Throttle.HWThermalSlowdownUS,
gpu.Throttle.HWPowerBrakeSlowdownUS,
)
fmt.Fprintf(&b, " Throttle: %s\n", formatThrottleLine(gpu.Throttle, gpu.Steady.DurationSec))
if len(gpu.Notes) > 0 {
fmt.Fprintf(&b, " Notes:\n")
for _, note := range gpu.Notes {
@@ -93,6 +104,40 @@ func renderBenchmarkReport(result NvidiaBenchmarkResult) string {
b.WriteString("\n")
}
if len(charts) > 0 {
fmt.Fprintf(&b, "Terminal Charts\n")
fmt.Fprintf(&b, "---------------\n")
for _, chart := range charts {
content := strings.TrimSpace(stripANSIEscapeSequences(chart.Content))
if content == "" {
continue
}
fmt.Fprintf(&b, "%s\n", chart.Title)
fmt.Fprintf(&b, "%s\n", strings.Repeat("~", len(chart.Title)))
fmt.Fprintf(&b, "%s\n\n", content)
}
}
if sp := result.ServerPower; sp != nil {
fmt.Fprintf(&b, "Server Power (IPMI)\n")
fmt.Fprintf(&b, "-------------------\n")
if !sp.Available {
fmt.Fprintf(&b, "Unavailable\n")
} else {
fmt.Fprintf(&b, " Server idle: %.0f W\n", sp.IdleW)
fmt.Fprintf(&b, " Server under load: %.0f W\n", sp.LoadedW)
fmt.Fprintf(&b, " Server delta: %.0f W\n", sp.DeltaW)
fmt.Fprintf(&b, " GPU reported (sum): %.0f W\n", sp.GPUReportedSumW)
if sp.ReportingRatio > 0 {
fmt.Fprintf(&b, " Reporting ratio: %.2f (1.0 = accurate, <0.75 = GPU over-reports)\n", sp.ReportingRatio)
}
}
for _, note := range sp.Notes {
fmt.Fprintf(&b, " Note: %s\n", note)
}
b.WriteString("\n")
}
fmt.Fprintf(&b, "Methodology\n")
fmt.Fprintf(&b, "-----------\n")
fmt.Fprintf(&b, "- Profile %s uses standardized baseline, warmup, steady-state, interconnect, and cooldown phases.\n", result.BenchmarkProfile)
@@ -117,6 +162,72 @@ func renderBenchmarkReport(result NvidiaBenchmarkResult) string {
return b.String()
}
func loadBenchmarkReportCharts(runDir string, gpuIndices []int) []benchmarkReportChart {
phases := []struct {
name string
label string
}{
{name: "baseline", label: "Baseline"},
{name: "steady", label: "Steady State"},
{name: "cooldown", label: "Cooldown"},
}
var charts []benchmarkReportChart
for _, idx := range gpuIndices {
for _, phase := range phases {
path := filepath.Join(runDir, fmt.Sprintf("gpu-%d-%s-metrics-term.txt", idx, phase.name))
raw, err := os.ReadFile(path)
if err != nil || len(raw) == 0 {
continue
}
charts = append(charts, benchmarkReportChart{
Title: fmt.Sprintf("GPU %d %s", idx, phase.label),
Content: string(raw),
})
}
}
return charts
}
func stripANSIEscapeSequences(raw string) string {
return ansiEscapePattern.ReplaceAllString(raw, "")
}
// formatThrottleLine renders throttle counters as human-readable percentages of
// the steady-state window. Only non-zero counters are shown. When the steady
// duration is unknown (0), raw seconds are shown instead.
func formatThrottleLine(t BenchmarkThrottleCounters, steadyDurationSec float64) string {
type counter struct {
label string
us uint64
}
counters := []counter{
{"sw_power", t.SWPowerCapUS},
{"sw_thermal", t.SWThermalSlowdownUS},
{"sync_boost", t.SyncBoostUS},
{"hw_thermal", t.HWThermalSlowdownUS},
{"hw_power_brake", t.HWPowerBrakeSlowdownUS},
}
var parts []string
for _, c := range counters {
if c.us == 0 {
continue
}
sec := float64(c.us) / 1e6
if steadyDurationSec > 0 {
pct := sec / steadyDurationSec * 100
parts = append(parts, fmt.Sprintf("%s=%.1f%% (%.0fs)", c.label, pct, sec))
} else if sec < 1 {
parts = append(parts, fmt.Sprintf("%s=%.0fms", c.label, sec*1000))
} else {
parts = append(parts, fmt.Sprintf("%s=%.1fs", c.label, sec))
}
}
if len(parts) == 0 {
return "none"
}
return strings.Join(parts, " ")
}
func renderBenchmarkSummary(result NvidiaBenchmarkResult) string {
var b strings.Builder
fmt.Fprintf(&b, "run_at_utc=%s\n", result.GeneratedAt.Format(time.RFC3339))

View File

@@ -145,3 +145,35 @@ func TestRenderBenchmarkReportIncludesFindingsAndScores(t *testing.T) {
}
}
}
func TestRenderBenchmarkReportIncludesTerminalChartsWithoutANSI(t *testing.T) {
t.Parallel()
report := renderBenchmarkReportWithCharts(NvidiaBenchmarkResult{
BenchmarkProfile: NvidiaBenchmarkProfileStandard,
OverallStatus: "OK",
SelectedGPUIndices: []int{0},
Normalization: BenchmarkNormalization{
Status: "full",
},
}, []benchmarkReportChart{
{
Title: "GPU 0 Steady State",
Content: "\x1b[31mGPU 0 chart\x1b[0m\n 42┤───",
},
})
for _, needle := range []string{
"Terminal Charts",
"GPU 0 Steady State",
"GPU 0 chart",
"42┤───",
} {
if !strings.Contains(report, needle) {
t.Fatalf("report missing %q\n%s", needle, report)
}
}
if strings.Contains(report, "\x1b[31m") {
t.Fatalf("report should not contain ANSI escapes\n%s", report)
}
}

View File

@@ -14,13 +14,17 @@ type NvidiaBenchmarkOptions struct {
GPUIndices []int
ExcludeGPUIndices []int
RunNCCL bool
ParallelGPUs bool // run all selected GPUs simultaneously instead of sequentially
}
type NvidiaBenchmarkResult struct {
BenchmarkVersion string `json:"benchmark_version"`
GeneratedAt time.Time `json:"generated_at"`
Hostname string `json:"hostname,omitempty"`
ServerModel string `json:"server_model,omitempty"`
BenchmarkProfile string `json:"benchmark_profile"`
ParallelGPUs bool `json:"parallel_gpus,omitempty"`
OverallStatus string `json:"overall_status"`
SelectedGPUIndices []int `json:"selected_gpu_indices"`
Findings []string `json:"findings,omitempty"`
@@ -28,6 +32,7 @@ type NvidiaBenchmarkResult struct {
Normalization BenchmarkNormalization `json:"normalization"`
GPUs []BenchmarkGPUResult `json:"gpus"`
Interconnect *BenchmarkInterconnectResult `json:"interconnect,omitempty"`
ServerPower *BenchmarkServerPower `json:"server_power,omitempty"`
}
type BenchmarkNormalization struct {
@@ -56,7 +61,10 @@ type BenchmarkGPUResult struct {
Backend string `json:"backend,omitempty"`
Status string `json:"status"`
PowerLimitW float64 `json:"power_limit_w,omitempty"`
MultiprocessorCount int `json:"multiprocessor_count,omitempty"`
DefaultPowerLimitW float64 `json:"default_power_limit_w,omitempty"`
MaxGraphicsClockMHz float64 `json:"max_graphics_clock_mhz,omitempty"`
BaseGraphicsClockMHz float64 `json:"base_graphics_clock_mhz,omitempty"`
MaxMemoryClockMHz float64 `json:"max_memory_clock_mhz,omitempty"`
LockedGraphicsClockMHz float64 `json:"locked_graphics_clock_mhz,omitempty"`
LockedMemoryClockMHz float64 `json:"locked_memory_clock_mhz,omitempty"`
@@ -117,6 +125,24 @@ type BenchmarkScorecard struct {
StabilityScore float64 `json:"stability_score"`
InterconnectScore float64 `json:"interconnect_score"`
CompositeScore float64 `json:"composite_score"`
// TOPSPerSMPerGHz is compute efficiency independent of clock speed and SM count.
// Comparable across throttle levels and GPU generations. Low value at normal
// clocks indicates silicon degradation.
TOPSPerSMPerGHz float64 `json:"tops_per_sm_per_ghz,omitempty"`
}
// BenchmarkServerPower captures server-side power via IPMI alongside GPU-reported
// power. The reporting_ratio (delta / gpu_reported_sum) near 1.0 means GPU power
// telemetry is accurate; a ratio well below 1.0 (e.g. 0.5) means the GPU is
// over-reporting its power consumption.
type BenchmarkServerPower struct {
Available bool `json:"available"`
IdleW float64 `json:"idle_w,omitempty"`
LoadedW float64 `json:"loaded_w,omitempty"`
DeltaW float64 `json:"delta_w,omitempty"`
GPUReportedSumW float64 `json:"gpu_reported_sum_w,omitempty"`
ReportingRatio float64 `json:"reporting_ratio,omitempty"`
Notes []string `json:"notes,omitempty"`
}
type BenchmarkInterconnectResult struct {

View File

@@ -15,6 +15,10 @@ var workerPatterns = []string{
"stress-ng",
"stressapptest",
"memtester",
// DCGM diagnostic workers — nvvs is spawned by dcgmi diag and survives
// if dcgmi is killed mid-run, leaving the GPU occupied (DCGM_ST_IN_USE).
"nvvs",
"dcgmi",
}
// KilledProcess describes a process that was sent SIGKILL.

View File

@@ -16,12 +16,12 @@ func (s *System) RunNvidiaStressPack(ctx context.Context, baseDir string, opts N
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, nvidiaStressArchivePrefix(opts.Loader), []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-nvidia-smi-list.log", cmd: []string{"nvidia-smi", "-L"}},
return runAcceptancePackCtx(ctx, baseDir, nvidiaStressArchivePrefix(opts.Loader), withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{name: "02-nvidia-smi-list.log", cmd: []string{"nvidia-smi", "-L"}},
job,
{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
satJob{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
), logFunc)
}
func nvidiaStressArchivePrefix(loader string) string {

View File

@@ -110,7 +110,7 @@ func (s *System) RunPlatformStress(
wg.Add(1)
go func() {
defer wg.Done()
gpuCmd := buildGPUStressCmd(loadCtx, vendor)
gpuCmd := buildGPUStressCmd(loadCtx, vendor, cycle.LoadSec)
if gpuCmd == nil {
return
}
@@ -392,6 +392,13 @@ func buildCPUStressCmd(ctx context.Context) (*exec.Cmd, error) {
cmdArgs = append(cmdArgs, "-M", strconv.Itoa(mb))
}
cmd := exec.CommandContext(ctx, path, cmdArgs...)
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
cmd.Cancel = func() error {
if cmd.Process != nil {
_ = syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
}
return nil
}
cmd.Stdout = nil
cmd.Stderr = nil
if err := startLowPriorityCmd(cmd, 15); err != nil {
@@ -402,28 +409,28 @@ func buildCPUStressCmd(ctx context.Context) (*exec.Cmd, error) {
// buildGPUStressCmd creates a GPU stress command appropriate for the detected vendor.
// Returns nil if no GPU stress tool is available (CPU-only cycling still useful).
func buildGPUStressCmd(ctx context.Context, vendor string) *exec.Cmd {
func buildGPUStressCmd(ctx context.Context, vendor string, durSec int) *exec.Cmd {
switch strings.ToLower(vendor) {
case "amd":
return buildAMDGPUStressCmd(ctx)
return buildAMDGPUStressCmd(ctx, durSec)
case "nvidia":
return buildNvidiaGPUStressCmd(ctx)
return buildNvidiaGPUStressCmd(ctx, durSec)
}
return nil
}
func buildAMDGPUStressCmd(ctx context.Context) *exec.Cmd {
func buildAMDGPUStressCmd(ctx context.Context, durSec int) *exec.Cmd {
rvsArgs, err := resolveRVSCommand()
if err != nil {
return nil
}
rvsPath := rvsArgs[0]
cfg := `actions:
cfg := fmt.Sprintf(`actions:
- name: gst_platform
device: all
module: gst
parallel: true
duration: 86400000
duration: %d`, durSec*1000) + `
copy_matrix: false
target_stress: 90
matrix_size_a: 8640
@@ -433,13 +440,20 @@ func buildAMDGPUStressCmd(ctx context.Context) *exec.Cmd {
cfgFile := "/tmp/bee-platform-gst.conf"
_ = os.WriteFile(cfgFile, []byte(cfg), 0644)
cmd := exec.CommandContext(ctx, rvsPath, "-c", cfgFile)
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
cmd.Cancel = func() error {
if cmd.Process != nil {
_ = syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
}
return nil
}
cmd.Stdout = nil
cmd.Stderr = nil
_ = startLowPriorityCmd(cmd, 10)
return cmd
}
func buildNvidiaGPUStressCmd(ctx context.Context) *exec.Cmd {
func buildNvidiaGPUStressCmd(ctx context.Context, durSec int) *exec.Cmd {
path, err := satLookPath("bee-gpu-burn")
if err != nil {
path, err = satLookPath("bee-gpu-stress")
@@ -447,7 +461,17 @@ func buildNvidiaGPUStressCmd(ctx context.Context) *exec.Cmd {
if err != nil {
return nil
}
cmd := exec.CommandContext(ctx, path, "--seconds", "86400")
// Pass exact duration so bee-gpu-burn exits on its own when the cycle ends.
// Process group kill via Setpgid+Cancel is kept as a safety net for cases
// where the context is cancelled early (user stop, parent timeout).
cmd := exec.CommandContext(ctx, path, "--seconds", strconv.Itoa(durSec))
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
cmd.Cancel = func() error {
if cmd.Process != nil {
_ = syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
}
return nil
}
cmd.Stdout = nil
cmd.Stderr = nil
_ = startLowPriorityCmd(cmd, 10)

View File

@@ -173,6 +173,22 @@ func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHe
switch vendor {
case "nvidia":
if raw, err := os.ReadFile("/run/bee-nvidia-mode"); err == nil {
health.NvidiaGSPMode = strings.TrimSpace(string(raw))
if health.NvidiaGSPMode == "gsp-stuck" {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "nvidia_gsp_stuck",
Severity: "critical",
Description: "NVIDIA GSP firmware init timed out and the kernel module is stuck. Reboot and select 'GSP=off' in the boot menu.",
})
} else if health.NvidiaGSPMode == "gsp-off" {
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "nvidia_gsp_disabled",
Severity: "warning",
Description: "NVIDIA GSP firmware disabled (fallback). Power management runs via CPU path — power draw readings may differ from reference hardware.",
})
}
}
health.DriverReady = strings.Contains(lsmodText, "nvidia ")
if !health.DriverReady {
health.Issues = append(health.Issues, schema.RuntimeIssue{

View File

@@ -278,13 +278,13 @@ func (s *System) RunNCCLTests(ctx context.Context, baseDir string, logFunc func(
if gpuCount < 1 {
gpuCount = 1
}
return runAcceptancePackCtx(ctx, baseDir, "nccl-tests", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-all-reduce-perf.log", cmd: []string{
return runAcceptancePackCtx(ctx, baseDir, "nccl-tests", withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{name: "02-all-reduce-perf.log", cmd: []string{
"all_reduce_perf", "-b", "512M", "-e", "4G", "-f", "2",
"-g", strconv.Itoa(gpuCount), "--iters", "20",
}},
}, logFunc)
), logFunc)
}
func (s *System) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
@@ -296,18 +296,18 @@ func (s *System) RunNvidiaOfficialComputePack(ctx context.Context, baseDir strin
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-compute", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dcgmi-version.log", cmd: []string{"dcgmi", "-v"}},
{
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-compute", withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{name: "02-dcgmi-version.log", cmd: []string{"dcgmi", "-v"}},
satJob{
name: "03-dcgmproftester.log",
cmd: profCmd,
env: nvidiaVisibleDevicesEnv(selected),
collectGPU: true,
gpuIndices: selected,
},
{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
satJob{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
), logFunc)
}
func (s *System) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
@@ -315,16 +315,16 @@ func (s *System) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string,
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-targeted-power", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-targeted-power", withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{
name: "02-dcgmi-targeted-power.log",
cmd: nvidiaDCGMNamedDiagCommand("targeted_power", normalizeNvidiaBurnDuration(durationSec), selected),
collectGPU: true,
gpuIndices: selected,
},
{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
satJob{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
), logFunc)
}
func (s *System) RunNvidiaPulseTestPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
@@ -332,16 +332,16 @@ func (s *System) RunNvidiaPulseTestPack(ctx context.Context, baseDir string, dur
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-pulse", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-pulse", withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{
name: "02-dcgmi-pulse-test.log",
cmd: nvidiaDCGMNamedDiagCommand("pulse_test", normalizeNvidiaBurnDuration(durationSec), selected),
collectGPU: true,
gpuIndices: selected,
},
{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
satJob{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
), logFunc)
}
func (s *System) RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpuIndices []int, logFunc func(string)) (string, error) {
@@ -349,16 +349,16 @@ func (s *System) RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpu
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-bandwidth", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-bandwidth", withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{
name: "02-dcgmi-nvbandwidth.log",
cmd: nvidiaDCGMNamedDiagCommand("nvbandwidth", 0, selected),
collectGPU: true,
gpuIndices: selected,
},
{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
satJob{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
), logFunc)
}
func (s *System) RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
@@ -382,16 +382,23 @@ func (s *System) RunNvidiaTargetedStressValidatePack(ctx context.Context, baseDi
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-targeted-stress", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{
// Kill any lingering nvvs/dcgmi processes from a previous interrupted run
// before starting — otherwise dcgmi diag fails with DCGM_ST_IN_USE (-34).
if killed := KillTestWorkers(); len(killed) > 0 && logFunc != nil {
for _, p := range killed {
logFunc(fmt.Sprintf("pre-flight: killed stale worker pid=%d name=%s", p.PID, p.Name))
}
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-targeted-stress", withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{
name: "02-dcgmi-targeted-stress.log",
cmd: nvidiaDCGMNamedDiagCommand("targeted_stress", normalizeNvidiaBurnDuration(durationSec), selected),
collectGPU: true,
gpuIndices: selected,
},
{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
satJob{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
), logFunc)
}
func resolveDCGMGPUIndices(gpuIndices []int) ([]int, error) {
@@ -561,14 +568,24 @@ type satStats struct {
Unsupported int
}
func withNvidiaPersistenceMode(jobs ...satJob) []satJob {
out := make([]satJob, 0, len(jobs)+1)
out = append(out, satJob{
name: "00-nvidia-smi-persistence-mode.log",
cmd: []string{"nvidia-smi", "-pm", "1"},
})
out = append(out, jobs...)
return out
}
func nvidiaSATJobs() []satJob {
return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
{name: "05-bee-gpu-burn.log", cmd: []string{"bee-gpu-burn", "--seconds", "5", "--size-mb", "64"}},
}
return withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
satJob{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
satJob{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
satJob{name: "05-bee-gpu-burn.log", cmd: []string{"bee-gpu-burn", "--seconds", "5", "--size-mb", "64"}},
)
}
func nvidiaDCGMJobs(diagLevel int, gpuIndices []int) []satJob {
@@ -583,12 +600,12 @@ func nvidiaDCGMJobs(diagLevel int, gpuIndices []int) []satJob {
}
diagArgs = append(diagArgs, "-i", strings.Join(ids, ","))
}
return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-dcgmi-diag.log", cmd: diagArgs},
}
return withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
satJob{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
satJob{name: "04-dcgmi-diag.log", cmd: diagArgs},
)
}
func nvidiaDCGMNamedDiagCommand(name string, durationSec int, gpuIndices []int) []string {
@@ -613,7 +630,10 @@ func nvidiaVisibleDevicesEnv(gpuIndices []int) []string {
if len(gpuIndices) == 0 {
return nil
}
return []string{"CUDA_VISIBLE_DEVICES=" + joinIndexList(gpuIndices)}
return []string{
"CUDA_DEVICE_ORDER=PCI_BUS_ID",
"CUDA_VISIBLE_DEVICES=" + joinIndexList(gpuIndices),
}
}
func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []satJob, logFunc func(string)) (string, error) {
@@ -654,6 +674,9 @@ func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []sa
if writeErr := os.WriteFile(filepath.Join(runDir, job.name), out, 0644); writeErr != nil {
return "", writeErr
}
if ctx.Err() != nil {
return "", ctx.Err()
}
status, rc := classifySATResult(job.name, out, err)
stats.Add(status)
key := strings.TrimSuffix(strings.TrimPrefix(job.name, "0"), ".log")

View File

@@ -1,12 +1,14 @@
package platform
import (
"context"
"errors"
"os"
"os/exec"
"path/filepath"
"strings"
"testing"
"time"
)
func TestStorageSATCommands(t *testing.T) {
@@ -28,13 +30,19 @@ func TestRunNvidiaAcceptancePackIncludesGPUStress(t *testing.T) {
jobs := nvidiaSATJobs()
if len(jobs) != 5 {
t.Fatalf("jobs=%d want 5", len(jobs))
if len(jobs) != 6 {
t.Fatalf("jobs=%d want 6", len(jobs))
}
if got := jobs[4].cmd[0]; got != "bee-gpu-burn" {
if got := jobs[0].cmd[0]; got != "nvidia-smi" {
t.Fatalf("preflight command=%q want nvidia-smi", got)
}
if got := strings.Join(jobs[0].cmd, " "); got != "nvidia-smi -pm 1" {
t.Fatalf("preflight=%q want %q", got, "nvidia-smi -pm 1")
}
if got := jobs[5].cmd[0]; got != "bee-gpu-burn" {
t.Fatalf("gpu stress command=%q want bee-gpu-burn", got)
}
if got := jobs[3].cmd[1]; got != "--output-file" {
if got := jobs[4].cmd[1]; got != "--output-file" {
t.Fatalf("bug report flag=%q want --output-file", got)
}
}
@@ -82,7 +90,7 @@ func TestAMDStressJobsIncludeBandwidthAndGST(t *testing.T) {
func TestNvidiaSATJobsUseBuiltinBurnDefaults(t *testing.T) {
jobs := nvidiaSATJobs()
got := jobs[4].cmd
got := jobs[5].cmd
want := []string{"bee-gpu-burn", "--seconds", "5", "--size-mb", "64"}
if len(got) != len(want) {
t.Fatalf("cmd len=%d want %d", len(got), len(want))
@@ -94,6 +102,19 @@ func TestNvidiaSATJobsUseBuiltinBurnDefaults(t *testing.T) {
}
}
func TestNvidiaDCGMJobsEnablePersistenceModeBeforeDiag(t *testing.T) {
jobs := nvidiaDCGMJobs(3, []int{2, 0})
if len(jobs) != 5 {
t.Fatalf("jobs=%d want 5", len(jobs))
}
if got := strings.Join(jobs[0].cmd, " "); got != "nvidia-smi -pm 1" {
t.Fatalf("preflight=%q want %q", got, "nvidia-smi -pm 1")
}
if got := strings.Join(jobs[4].cmd, " "); got != "dcgmi diag -r 3 -i 2,0" {
t.Fatalf("diag=%q want %q", got, "dcgmi diag -r 3 -i 2,0")
}
}
func TestBuildNvidiaStressJobUsesSelectedLoaderAndDevices(t *testing.T) {
t.Parallel()
@@ -234,11 +255,14 @@ func TestNvidiaDCGMNamedDiagCommandUsesDurationAndSelection(t *testing.T) {
func TestNvidiaVisibleDevicesEnvUsesSelectedGPUs(t *testing.T) {
env := nvidiaVisibleDevicesEnv([]int{0, 2, 4})
if len(env) != 1 {
t.Fatalf("env len=%d want 1 (%v)", len(env), env)
if len(env) != 2 {
t.Fatalf("env len=%d want 2 (%v)", len(env), env)
}
if env[0] != "CUDA_VISIBLE_DEVICES=0,2,4" {
t.Fatalf("env[0]=%q want CUDA_VISIBLE_DEVICES=0,2,4", env[0])
if env[0] != "CUDA_DEVICE_ORDER=PCI_BUS_ID" {
t.Fatalf("env[0]=%q want CUDA_DEVICE_ORDER=PCI_BUS_ID", env[0])
}
if env[1] != "CUDA_VISIBLE_DEVICES=0,2,4" {
t.Fatalf("env[1]=%q want CUDA_VISIBLE_DEVICES=0,2,4", env[1])
}
}
@@ -331,6 +355,38 @@ func TestClassifySATResult(t *testing.T) {
}
}
func TestRunAcceptancePackCtxReturnsContextErrorWithoutArchive(t *testing.T) {
dir := t.TempDir()
ctx, cancel := context.WithCancel(context.Background())
t.Cleanup(cancel)
done := make(chan struct{})
go func() {
time.Sleep(100 * time.Millisecond)
cancel()
close(done)
}()
archive, err := runAcceptancePackCtx(ctx, dir, "cancelled-pack", []satJob{
{name: "01-sleep.log", cmd: []string{"sh", "-c", "sleep 5"}},
}, nil)
<-done
if !errors.Is(err, context.Canceled) {
t.Fatalf("err=%v want context.Canceled", err)
}
if archive != "" {
t.Fatalf("archive=%q want empty", archive)
}
matches, globErr := filepath.Glob(filepath.Join(dir, "cancelled-pack-*.tar.gz"))
if globErr != nil {
t.Fatalf("Glob error: %v", globErr)
}
if len(matches) != 0 {
t.Fatalf("archives=%v want none", matches)
}
}
func TestParseStorageDevicesSkipsUSBDisks(t *testing.T) {
t.Parallel()

View File

@@ -61,7 +61,9 @@ func (s *System) ServiceState(name string) string {
}
func (s *System) ServiceDo(name string, action ServiceAction) (string, error) {
raw, err := exec.Command("systemctl", string(action), name).CombinedOutput()
// bee-web runs as the bee user; sudo is required to control system services.
// /etc/sudoers.d/bee grants bee NOPASSWD:ALL.
raw, err := exec.Command("sudo", "systemctl", string(action), name).CombinedOutput()
return string(raw), err
}

View File

@@ -20,6 +20,7 @@ type RuntimeHealth struct {
ExportDir string `json:"export_dir,omitempty"`
DriverReady bool `json:"driver_ready,omitempty"`
CUDAReady bool `json:"cuda_ready,omitempty"`
NvidiaGSPMode string `json:"nvidia_gsp_mode,omitempty"` // "gsp-on", "gsp-off", "gsp-stuck"
NetworkStatus string `json:"network_status,omitempty"`
Issues []RuntimeIssue `json:"issues,omitempty"`
Tools []RuntimeToolStatus `json:"tools,omitempty"`

View File

@@ -11,6 +11,7 @@ import (
"os/exec"
"path/filepath"
"regexp"
"sort"
"strings"
"sync/atomic"
"syscall"
@@ -21,13 +22,238 @@ import (
)
var ansiEscapeRE = regexp.MustCompile(`\x1b\[[0-9;]*[a-zA-Z]|\x1b[()][A-Z0-9]|\x1b[DABC]`)
var apiListNvidiaGPUs = func(a *app.App) ([]platform.NvidiaGPU, error) {
if a == nil {
return nil, fmt.Errorf("app not configured")
}
return a.ListNvidiaGPUs()
}
// ── Job ID counter ────────────────────────────────────────────────────────────
var jobCounter atomic.Uint64
func newJobID(prefix string) string {
return fmt.Sprintf("%s-%d", prefix, jobCounter.Add(1))
func newJobID(_ string) string {
start := int((jobCounter.Add(1) - 1) % 1000)
globalQueue.mu.Lock()
defer globalQueue.mu.Unlock()
for offset := 0; offset < 1000; offset++ {
n := (start + offset) % 1000
id := fmt.Sprintf("TASK-%03d", n)
if !taskIDInUseLocked(id) {
return id
}
}
return fmt.Sprintf("TASK-%03d", start)
}
func taskIDInUseLocked(id string) bool {
for _, t := range globalQueue.tasks {
if t != nil && t.ID == id {
return true
}
}
return false
}
type taskRunResponse struct {
TaskID string `json:"task_id,omitempty"`
JobID string `json:"job_id,omitempty"`
TaskIDs []string `json:"task_ids,omitempty"`
JobIDs []string `json:"job_ids,omitempty"`
TaskCount int `json:"task_count,omitempty"`
}
type nvidiaTaskSelection struct {
GPUIndices []int
Label string
}
func writeTaskRunResponse(w http.ResponseWriter, tasks []*Task) {
if len(tasks) == 0 {
writeJSON(w, taskRunResponse{})
return
}
ids := make([]string, 0, len(tasks))
for _, t := range tasks {
if t == nil || strings.TrimSpace(t.ID) == "" {
continue
}
ids = append(ids, t.ID)
}
resp := taskRunResponse{TaskCount: len(ids)}
if len(ids) > 0 {
resp.TaskID = ids[0]
resp.JobID = ids[0]
resp.TaskIDs = ids
resp.JobIDs = ids
}
writeJSON(w, resp)
}
func shouldSplitHomogeneousNvidiaTarget(target string) bool {
switch strings.TrimSpace(target) {
case "nvidia", "nvidia-targeted-stress", "nvidia-benchmark", "nvidia-compute",
"nvidia-targeted-power", "nvidia-pulse", "nvidia-interconnect",
"nvidia-bandwidth", "nvidia-stress":
return true
default:
return false
}
}
func expandHomogeneousNvidiaSelections(gpus []platform.NvidiaGPU, include, exclude []int) ([]nvidiaTaskSelection, error) {
if len(gpus) == 0 {
return nil, fmt.Errorf("no NVIDIA GPUs detected")
}
indexed := make(map[int]platform.NvidiaGPU, len(gpus))
allIndices := make([]int, 0, len(gpus))
for _, gpu := range gpus {
indexed[gpu.Index] = gpu
allIndices = append(allIndices, gpu.Index)
}
sort.Ints(allIndices)
selected := allIndices
if len(include) > 0 {
selected = make([]int, 0, len(include))
seen := make(map[int]struct{}, len(include))
for _, idx := range include {
if _, ok := indexed[idx]; !ok {
continue
}
if _, dup := seen[idx]; dup {
continue
}
seen[idx] = struct{}{}
selected = append(selected, idx)
}
sort.Ints(selected)
}
if len(exclude) > 0 {
skip := make(map[int]struct{}, len(exclude))
for _, idx := range exclude {
skip[idx] = struct{}{}
}
filtered := selected[:0]
for _, idx := range selected {
if _, ok := skip[idx]; ok {
continue
}
filtered = append(filtered, idx)
}
selected = filtered
}
if len(selected) == 0 {
return nil, fmt.Errorf("no NVIDIA GPUs selected")
}
modelGroups := make(map[string][]platform.NvidiaGPU)
modelOrder := make([]string, 0)
for _, idx := range selected {
gpu := indexed[idx]
model := strings.TrimSpace(gpu.Name)
if model == "" {
model = fmt.Sprintf("GPU %d", gpu.Index)
}
if _, ok := modelGroups[model]; !ok {
modelOrder = append(modelOrder, model)
}
modelGroups[model] = append(modelGroups[model], gpu)
}
sort.Slice(modelOrder, func(i, j int) bool {
left := modelGroups[modelOrder[i]]
right := modelGroups[modelOrder[j]]
if len(left) == 0 || len(right) == 0 {
return modelOrder[i] < modelOrder[j]
}
return left[0].Index < right[0].Index
})
var groups []nvidiaTaskSelection
var singles []nvidiaTaskSelection
for _, model := range modelOrder {
group := modelGroups[model]
sort.Slice(group, func(i, j int) bool { return group[i].Index < group[j].Index })
indices := make([]int, 0, len(group))
for _, gpu := range group {
indices = append(indices, gpu.Index)
}
if len(indices) >= 2 {
groups = append(groups, nvidiaTaskSelection{
GPUIndices: indices,
Label: fmt.Sprintf("%s; GPUs %s", model, joinTaskIndices(indices)),
})
continue
}
gpu := group[0]
singles = append(singles, nvidiaTaskSelection{
GPUIndices: []int{gpu.Index},
Label: fmt.Sprintf("GPU %d — %s", gpu.Index, model),
})
}
return append(groups, singles...), nil
}
func joinTaskIndices(indices []int) string {
parts := make([]string, 0, len(indices))
for _, idx := range indices {
parts = append(parts, fmt.Sprintf("%d", idx))
}
return strings.Join(parts, ",")
}
func formatSplitTaskName(baseName, selectionLabel string) string {
baseName = strings.TrimSpace(baseName)
selectionLabel = strings.TrimSpace(selectionLabel)
if baseName == "" {
return selectionLabel
}
if selectionLabel == "" {
return baseName
}
return baseName + " (" + selectionLabel + ")"
}
func buildNvidiaTaskSet(target string, priority int, createdAt time.Time, params taskParams, baseName string, appRef *app.App, idPrefix string) ([]*Task, error) {
if !shouldSplitHomogeneousNvidiaTarget(target) {
t := &Task{
ID: newJobID(idPrefix),
Name: baseName,
Target: target,
Priority: priority,
Status: TaskPending,
CreatedAt: createdAt,
params: params,
}
return []*Task{t}, nil
}
gpus, err := apiListNvidiaGPUs(appRef)
if err != nil {
return nil, err
}
selections, err := expandHomogeneousNvidiaSelections(gpus, params.GPUIndices, params.ExcludeGPUIndices)
if err != nil {
return nil, err
}
tasks := make([]*Task, 0, len(selections))
for _, selection := range selections {
taskParamsCopy := params
taskParamsCopy.GPUIndices = append([]int(nil), selection.GPUIndices...)
taskParamsCopy.ExcludeGPUIndices = nil
displayName := formatSplitTaskName(baseName, selection.Label)
taskParamsCopy.DisplayName = displayName
tasks = append(tasks, &Task{
ID: newJobID(idPrefix),
Name: displayName,
Target: target,
Priority: priority,
Status: TaskPending,
CreatedAt: createdAt,
params: taskParamsCopy,
})
}
return tasks, nil
}
// ── SSE helpers ───────────────────────────────────────────────────────────────
@@ -207,28 +433,28 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
}
name := taskDisplayName(target, body.Profile, body.Loader)
t := &Task{
ID: newJobID("sat-" + target),
Name: name,
Target: target,
Status: TaskPending,
CreatedAt: time.Now(),
params: taskParams{
Duration: body.Duration,
DiagLevel: body.DiagLevel,
GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices,
Loader: body.Loader,
BurnProfile: body.Profile,
DisplayName: body.DisplayName,
PlatformComponents: body.PlatformComponents,
},
}
if strings.TrimSpace(body.DisplayName) != "" {
t.Name = body.DisplayName
name = body.DisplayName
}
globalQueue.enqueue(t)
writeJSON(w, map[string]string{"task_id": t.ID, "job_id": t.ID})
params := taskParams{
Duration: body.Duration,
DiagLevel: body.DiagLevel,
GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices,
Loader: body.Loader,
BurnProfile: body.Profile,
DisplayName: body.DisplayName,
PlatformComponents: body.PlatformComponents,
}
tasks, err := buildNvidiaTaskSet(target, 0, time.Now(), params, name, h.opts.App, "sat-"+target)
if err != nil {
writeError(w, http.StatusBadRequest, err.Error())
return
}
for _, t := range tasks {
globalQueue.enqueue(t)
}
writeTaskRunResponse(w, tasks)
}
}
@@ -244,6 +470,7 @@ func (h *handler) handleAPIBenchmarkNvidiaRun(w http.ResponseWriter, r *http.Req
GPUIndices []int `json:"gpu_indices"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices"`
RunNCCL *bool `json:"run_nccl"`
ParallelGPUs *bool `json:"parallel_gpus"`
DisplayName string `json:"display_name"`
}
if r.Body != nil {
@@ -257,27 +484,31 @@ func (h *handler) handleAPIBenchmarkNvidiaRun(w http.ResponseWriter, r *http.Req
if body.RunNCCL != nil {
runNCCL = *body.RunNCCL
}
t := &Task{
ID: newJobID("benchmark-nvidia"),
Name: taskDisplayName("nvidia-benchmark", "", ""),
Target: "nvidia-benchmark",
Priority: 15,
Status: TaskPending,
CreatedAt: time.Now(),
params: taskParams{
GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices,
SizeMB: body.SizeMB,
BenchmarkProfile: body.Profile,
RunNCCL: runNCCL,
DisplayName: body.DisplayName,
},
parallelGPUs := false
if body.ParallelGPUs != nil {
parallelGPUs = *body.ParallelGPUs
}
name := taskDisplayName("nvidia-benchmark", "", "")
if strings.TrimSpace(body.DisplayName) != "" {
t.Name = body.DisplayName
name = body.DisplayName
}
globalQueue.enqueue(t)
writeJSON(w, map[string]string{"task_id": t.ID, "job_id": t.ID})
tasks, err := buildNvidiaTaskSet("nvidia-benchmark", 15, time.Now(), taskParams{
GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices,
SizeMB: body.SizeMB,
BenchmarkProfile: body.Profile,
RunNCCL: runNCCL,
ParallelGPUs: parallelGPUs,
DisplayName: body.DisplayName,
}, name, h.opts.App, "benchmark-nvidia")
if err != nil {
writeError(w, http.StatusBadRequest, err.Error())
return
}
for _, t := range tasks {
globalQueue.enqueue(t)
}
writeTaskRunResponse(w, tasks)
}
func (h *handler) handleAPISATStream(w http.ResponseWriter, r *http.Request) {
@@ -383,11 +614,13 @@ func (h *handler) handleAPIServicesAction(w http.ResponseWriter, r *http.Request
return
}
result, err := h.opts.App.ServiceActionResult(req.Name, action)
status := "ok"
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
status = "error"
}
writeJSON(w, map[string]string{"status": "ok", "output": result.Body})
// Always return 200 with output so the frontend can display the actual
// systemctl error message instead of a generic "exit status 1".
writeJSON(w, map[string]string{"status": status, "output": result.Body})
}
// ── Network ───────────────────────────────────────────────────────────────────

View File

@@ -1,6 +1,7 @@
package webui
import (
"encoding/json"
"net/http/httptest"
"strings"
"testing"
@@ -74,6 +75,14 @@ func TestHandleAPIBenchmarkNvidiaRunQueuesSelectedGPUs(t *testing.T) {
globalQueue.tasks = originalTasks
globalQueue.mu.Unlock()
})
prevList := apiListNvidiaGPUs
apiListNvidiaGPUs = func(_ *app.App) ([]platform.NvidiaGPU, error) {
return []platform.NvidiaGPU{
{Index: 1, Name: "NVIDIA H100 PCIe"},
{Index: 3, Name: "NVIDIA H100 PCIe"},
}, nil
}
t.Cleanup(func() { apiListNvidiaGPUs = prevList })
h := &handler{opts: HandlerOptions{App: &app.App{}}}
req := httptest.NewRequest("POST", "/api/benchmark/nvidia/run", strings.NewReader(`{"profile":"standard","gpu_indices":[1,3],"run_nccl":false}`))
@@ -101,6 +110,97 @@ func TestHandleAPIBenchmarkNvidiaRunQueuesSelectedGPUs(t *testing.T) {
}
}
func TestHandleAPIBenchmarkNvidiaRunSplitsMixedGPUModels(t *testing.T) {
globalQueue.mu.Lock()
originalTasks := globalQueue.tasks
globalQueue.tasks = nil
globalQueue.mu.Unlock()
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = originalTasks
globalQueue.mu.Unlock()
})
prevList := apiListNvidiaGPUs
apiListNvidiaGPUs = func(_ *app.App) ([]platform.NvidiaGPU, error) {
return []platform.NvidiaGPU{
{Index: 0, Name: "NVIDIA H100 PCIe"},
{Index: 1, Name: "NVIDIA H100 PCIe"},
{Index: 2, Name: "NVIDIA H200 NVL"},
}, nil
}
t.Cleanup(func() { apiListNvidiaGPUs = prevList })
h := &handler{opts: HandlerOptions{App: &app.App{}}}
req := httptest.NewRequest("POST", "/api/benchmark/nvidia/run", strings.NewReader(`{"profile":"standard","gpu_indices":[0,1,2],"run_nccl":false}`))
rec := httptest.NewRecorder()
h.handleAPIBenchmarkNvidiaRun(rec, req)
if rec.Code != 200 {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
var resp taskRunResponse
if err := json.Unmarshal(rec.Body.Bytes(), &resp); err != nil {
t.Fatalf("decode response: %v", err)
}
if len(resp.TaskIDs) != 2 {
t.Fatalf("task_ids=%v want 2 items", resp.TaskIDs)
}
globalQueue.mu.Lock()
defer globalQueue.mu.Unlock()
if len(globalQueue.tasks) != 2 {
t.Fatalf("tasks=%d want 2", len(globalQueue.tasks))
}
if got := globalQueue.tasks[0].params.GPUIndices; len(got) != 2 || got[0] != 0 || got[1] != 1 {
t.Fatalf("task[0] gpu indices=%v want [0 1]", got)
}
if got := globalQueue.tasks[1].params.GPUIndices; len(got) != 1 || got[0] != 2 {
t.Fatalf("task[1] gpu indices=%v want [2]", got)
}
}
func TestHandleAPISATRunSplitsMixedNvidiaTaskSet(t *testing.T) {
globalQueue.mu.Lock()
originalTasks := globalQueue.tasks
globalQueue.tasks = nil
globalQueue.mu.Unlock()
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = originalTasks
globalQueue.mu.Unlock()
})
prevList := apiListNvidiaGPUs
apiListNvidiaGPUs = func(_ *app.App) ([]platform.NvidiaGPU, error) {
return []platform.NvidiaGPU{
{Index: 0, Name: "NVIDIA H100 PCIe"},
{Index: 1, Name: "NVIDIA H100 PCIe"},
{Index: 2, Name: "NVIDIA H200 NVL"},
}, nil
}
t.Cleanup(func() { apiListNvidiaGPUs = prevList })
h := &handler{opts: HandlerOptions{App: &app.App{}}}
req := httptest.NewRequest("POST", "/api/sat/nvidia-targeted-power/run", strings.NewReader(`{"profile":"acceptance","gpu_indices":[0,1,2]}`))
rec := httptest.NewRecorder()
h.handleAPISATRun("nvidia-targeted-power").ServeHTTP(rec, req)
if rec.Code != 200 {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
globalQueue.mu.Lock()
defer globalQueue.mu.Unlock()
if len(globalQueue.tasks) != 2 {
t.Fatalf("tasks=%d want 2", len(globalQueue.tasks))
}
if got := globalQueue.tasks[0].params.GPUIndices; len(got) != 2 || got[0] != 0 || got[1] != 1 {
t.Fatalf("task[0] gpu indices=%v want [0 1]", got)
}
if got := globalQueue.tasks[1].params.GPUIndices; len(got) != 1 || got[0] != 2 {
t.Fatalf("task[1] gpu indices=%v want [2]", got)
}
}
func TestPushFanRingsTracksByNameAndCarriesForwardMissingSamples(t *testing.T) {
h := &handler{}
h.pushFanRings([]platform.FanReading{

View File

@@ -8,9 +8,12 @@ import (
"os"
"path/filepath"
"sort"
"strconv"
"strings"
"time"
"bee/audit/internal/app"
"bee/audit/internal/platform"
"bee/audit/internal/schema"
)
@@ -33,6 +36,9 @@ a{color:var(--accent);text-decoration:none}
.sidebar-logo{padding:18px 16px 12px;font-size:18px;font-weight:700;color:#fff;letter-spacing:-.5px}
.sidebar-logo span{color:rgba(255,255,255,.5);font-weight:400;font-size:12px;display:block;margin-top:2px}
.sidebar-version{padding:0 16px 14px;font-size:11px;color:rgba(255,255,255,.45)}
.sidebar-badge{margin:0 12px 12px;padding:5px 8px;border-radius:4px;font-size:11px;font-weight:600;text-align:center}
.sidebar-badge-warn{background:#7a4f00;color:#f6c90e}
.sidebar-badge-crit{background:#5c1a1a;color:#ff6b6b}
.nav{flex:1}
.nav-item{display:block;padding:10px 16px;color:rgba(255,255,255,.7);font-size:13px;border-left:3px solid transparent;transition:all .15s}
.nav-item:hover{color:#fff;background:rgba(255,255,255,.08)}
@@ -107,6 +113,15 @@ func layoutNav(active string, buildLabel string) string {
buildLabel = "dev"
}
b.WriteString(`<div class="sidebar-version">Version ` + html.EscapeString(buildLabel) + `</div>`)
if raw, err := os.ReadFile("/run/bee-nvidia-mode"); err == nil {
gspMode := strings.TrimSpace(string(raw))
switch gspMode {
case "gsp-off":
b.WriteString(`<div class="sidebar-badge sidebar-badge-warn">NVIDIA GSP=off</div>`)
case "gsp-stuck":
b.WriteString(`<div class="sidebar-badge sidebar-badge-crit">NVIDIA GSP stuck — reboot</div>`)
}
}
b.WriteString(`<nav class="nav">`)
for _, item := range items {
cls := "nav-item"
@@ -149,7 +164,7 @@ func renderPage(page string, opts HandlerOptions) string {
case "benchmark":
pageID = "benchmark"
title = "Benchmark"
body = renderBenchmark()
body = renderBenchmark(opts)
case "tasks":
pageID = "tasks"
title = "Tasks"
@@ -1055,18 +1070,34 @@ func renderValidate(opts HandlerOptions) string {
)) +
`</div>
<div style="height:1px;background:var(--border);margin:16px 0"></div>
<div class="card" style="margin-bottom:16px">
<div class="card-head">NVIDIA GPU Selection</div>
<div class="card-body">
<p style="font-size:12px;color:var(--muted);margin:0 0 8px">` + inv.NVIDIA + `</p>
<p style="font-size:12px;color:var(--muted);margin:0 0 10px">All NVIDIA validate tasks use only the GPUs selected here. The same selection is used by Validate one by one.</p>
<div style="display:flex;gap:8px;flex-wrap:wrap;margin-bottom:8px">
<button class="btn btn-sm btn-secondary" type="button" onclick="satSelectAllGPUs()">Select All</button>
<button class="btn btn-sm btn-secondary" type="button" onclick="satSelectNoGPUs()">Clear</button>
</div>
<div id="sat-gpu-list" style="border:1px solid var(--border);border-radius:4px;padding:12px;min-height:88px">
<p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p>
</div>
<p id="sat-gpu-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA validate tasks.</p>
</div>
</div>
<div class="grid3">
` + renderSATCard("nvidia", "NVIDIA GPU", "runNvidiaValidateSet('nvidia')", "", renderValidateCardBody(
inv.NVIDIA,
`Runs NVIDIA diagnostics and board inventory checks.`,
`<code>nvidia-smi</code>, <code>dmidecode</code>, <code>dcgmi diag</code>`,
`Runs one GPU at a time. Diag level is taken from Validate Profile.`,
)) +
inv.NVIDIA,
`Runs NVIDIA diagnostics and board inventory checks.`,
`<code>nvidia-smi</code>, <code>dmidecode</code>, <code>dcgmi diag</code>`,
`Runs one GPU at a time on the selected NVIDIA GPUs. Diag level is taken from Validate Profile.`,
)) +
renderSATCard("nvidia-targeted-stress", "NVIDIA GPU Targeted Stress", "runNvidiaValidateSet('nvidia-targeted-stress')", "", renderValidateCardBody(
inv.NVIDIA,
`Runs a controlled NVIDIA DCGM load in Validate to check stability under moderate stress.`,
`<code>dcgmi diag targeted_stress</code>`,
`Runs one GPU at a time with the fixed DCGM targeted stress recipe.`,
`Runs one GPU at a time on the selected NVIDIA GPUs with the fixed DCGM targeted stress recipe.`,
)) +
`</div>
<div class="grid3" style="margin-top:16px">
@@ -1088,6 +1119,8 @@ func renderValidate(opts HandlerOptions) string {
.validate-card-body { padding:0; }
.validate-card-section { padding:12px 16px 0; }
.validate-card-section:last-child { padding-bottom:16px; }
.sat-gpu-row { display:flex; align-items:flex-start; gap:8px; padding:6px 0; cursor:pointer; font-size:13px; }
.sat-gpu-row input[type=checkbox] { width:16px; height:16px; margin-top:2px; flex-shrink:0; }
@media(max-width:900px){ .validate-profile-body { grid-template-columns:1fr; } }
</style>
<script>
@@ -1116,6 +1149,59 @@ function loadSatNvidiaGPUs() {
}
return satNvidiaGPUsPromise;
}
function satSelectedGPUIndices() {
return Array.from(document.querySelectorAll('.sat-nvidia-checkbox'))
.filter(function(el) { return el.checked && !el.disabled; })
.map(function(el) { return parseInt(el.value, 10); })
.filter(function(v) { return !Number.isNaN(v); })
.sort(function(a, b) { return a - b; });
}
function satUpdateGPUSelectionNote() {
const note = document.getElementById('sat-gpu-selection-note');
if (!note) return;
const selected = satSelectedGPUIndices();
if (!selected.length) {
note.textContent = 'Select at least one NVIDIA GPU to enable NVIDIA validate tasks.';
return;
}
note.textContent = 'Selected NVIDIA GPUs: ' + selected.join(', ') + '.';
}
function satRenderGPUList(gpus) {
const root = document.getElementById('sat-gpu-list');
if (!root) return;
if (!gpus || !gpus.length) {
root.innerHTML = '<p style="color:var(--muted);font-size:13px">No NVIDIA GPUs detected.</p>';
satUpdateGPUSelectionNote();
return;
}
root.innerHTML = gpus.map(function(gpu) {
const mem = gpu.memory_mb > 0 ? ' · ' + gpu.memory_mb + ' MiB' : '';
return '<label class="sat-gpu-row">'
+ '<input class="sat-nvidia-checkbox" type="checkbox" value="' + gpu.index + '" checked onchange="satUpdateGPUSelectionNote()">'
+ '<span><strong>GPU ' + gpu.index + '</strong> — ' + gpu.name + mem + '</span>'
+ '</label>';
}).join('');
satUpdateGPUSelectionNote();
}
function satSelectAllGPUs() {
document.querySelectorAll('.sat-nvidia-checkbox').forEach(function(el) { el.checked = true; });
satUpdateGPUSelectionNote();
}
function satSelectNoGPUs() {
document.querySelectorAll('.sat-nvidia-checkbox').forEach(function(el) { el.checked = false; });
satUpdateGPUSelectionNote();
}
function satLoadGPUs() {
loadSatNvidiaGPUs().then(function(gpus) {
satRenderGPUList(gpus);
}).catch(function(err) {
const root = document.getElementById('sat-gpu-list');
if (root) {
root.innerHTML = '<p style="color:var(--crit-fg);font-size:13px">Error: ' + err.message + '</p>';
}
satUpdateGPUSelectionNote();
});
}
function satGPUDisplayName(gpu) {
const idx = (gpu && Number.isFinite(Number(gpu.index))) ? Number(gpu.index) : 0;
const name = gpu && gpu.name ? gpu.name : ('GPU ' + idx);
@@ -1137,6 +1223,36 @@ function enqueueSATTarget(target, overrides) {
return fetch('/api/sat/'+target+'/run', {method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify(satRequestBody(target, overrides))})
.then(r => r.json());
}
function streamSATTask(taskId, title, resetTerminal) {
if (satES) { satES.close(); satES = null; }
document.getElementById('sat-output').style.display='block';
document.getElementById('sat-title').textContent = '— ' + title;
const term = document.getElementById('sat-terminal');
if (resetTerminal) {
term.textContent = '';
}
term.textContent += 'Task ' + taskId + ' queued. Streaming log...\n';
return new Promise(function(resolve) {
satES = new EventSource('/api/tasks/' + taskId + '/stream');
satES.onmessage = function(e) { term.textContent += e.data + '\n'; term.scrollTop = term.scrollHeight; };
satES.addEventListener('done', function(e) {
satES.close();
satES = null;
term.textContent += (e.data ? '\nERROR: ' + e.data : '\nCompleted.') + '\n';
term.scrollTop = term.scrollHeight;
resolve({ok: !e.data, error: e.data || ''});
});
satES.onerror = function() {
if (satES) {
satES.close();
satES = null;
}
term.textContent += '\nERROR: stream disconnected.\n';
term.scrollTop = term.scrollHeight;
resolve({ok: false, error: 'stream disconnected'});
};
});
}
function selectedAMDValidateTargets() {
const targets = [];
const gpu = document.getElementById('sat-amd-target');
@@ -1151,24 +1267,23 @@ function runSAT(target) {
return runSATWithOverrides(target, null);
}
function runSATWithOverrides(target, overrides) {
if (satES) { satES.close(); satES = null; }
document.getElementById('sat-output').style.display='block';
document.getElementById('sat-title').textContent = '— ' + target;
const title = (overrides && overrides.display_name) || target;
const term = document.getElementById('sat-terminal');
term.textContent = 'Enqueuing ' + target + ' test...\n';
document.getElementById('sat-output').style.display='block';
document.getElementById('sat-title').textContent = '— ' + title;
term.textContent = 'Enqueuing ' + title + ' test...\n';
return enqueueSATTarget(target, overrides)
.then(d => {
term.textContent += 'Task ' + d.task_id + ' queued. Streaming log...\n';
satES = new EventSource('/api/tasks/'+d.task_id+'/stream');
satES.onmessage = e => { term.textContent += e.data+'\n'; term.scrollTop=term.scrollHeight; };
satES.addEventListener('done', e => { satES.close(); satES=null; term.textContent += (e.data ? '\nERROR: '+e.data : '\nCompleted.')+'\n'; });
});
.then(d => streamSATTask(d.task_id, title, false));
}
function expandSATTarget(target) {
if (target !== 'nvidia' && target !== 'nvidia-targeted-stress') {
return Promise.resolve([{target: target}]);
}
return loadSatNvidiaGPUs().then(gpus => gpus.map(gpu => ({
const selected = satSelectedGPUIndices();
if (!selected.length) {
return Promise.reject(new Error('Select at least one NVIDIA GPU.'));
}
return loadSatNvidiaGPUs().then(gpus => gpus.filter(gpu => selected.indexOf(Number(gpu.index)) >= 0).map(gpu => ({
target: target,
overrides: {
gpu_indices: [Number(gpu.index)],
@@ -1179,65 +1294,61 @@ function expandSATTarget(target) {
}
function runNvidiaValidateSet(target) {
return loadSatNvidiaGPUs().then(gpus => {
if (!gpus.length) return;
if (gpus.length === 1) {
const gpu = gpus[0];
const selected = satSelectedGPUIndices();
const picked = gpus.filter(gpu => selected.indexOf(Number(gpu.index)) >= 0);
if (!picked.length) {
throw new Error('Select at least one NVIDIA GPU.');
}
if (picked.length === 1) {
const gpu = picked[0];
return runSATWithOverrides(target, {
gpu_indices: [Number(gpu.index)],
display_name: (satLabels()[target] || ('Validate ' + target)) + ' (' + satGPUDisplayName(gpu) + ')'
});
}
if (satES) { satES.close(); satES = null; }
document.getElementById('sat-output').style.display='block';
document.getElementById('sat-title').textContent = '— ' + target;
const term = document.getElementById('sat-terminal');
term.textContent = 'Enqueuing ' + target + ' tests one GPU at a time...\n';
term.textContent = 'Running ' + target + ' one GPU at a time...\n';
const labelBase = satLabels()[target] || ('Validate ' + target);
const enqueueNext = (idx) => {
if (idx >= gpus.length) return;
const gpu = gpus[idx];
const runNext = (idx) => {
if (idx >= picked.length) return Promise.resolve();
const gpu = picked[idx];
const gpuLabel = satGPUDisplayName(gpu);
enqueueSATTarget(target, {
term.textContent += '\n[' + (idx + 1) + '/' + picked.length + '] ' + gpuLabel + '\n';
return enqueueSATTarget(target, {
gpu_indices: [Number(gpu.index)],
display_name: labelBase + ' (' + gpuLabel + ')'
}).then(d => {
term.textContent += 'Task ' + d.task_id + ' queued for ' + gpuLabel + '.\n';
if (idx === gpus.length - 1) {
satES = new EventSource('/api/tasks/' + d.task_id + '/stream');
satES.onmessage = e => { term.textContent += e.data+'\n'; term.scrollTop=term.scrollHeight; };
satES.addEventListener('done', e => { satES.close(); satES=null; term.textContent += (e.data ? '\nERROR: '+e.data : '\nCompleted.')+'\n'; });
}
enqueueNext(idx + 1);
return streamSATTask(d.task_id, labelBase + ' (' + gpuLabel + ')', false);
}).then(function() {
return runNext(idx + 1);
});
};
enqueueNext(0);
return runNext(0);
});
}
function runAMDValidateSet() {
const targets = selectedAMDValidateTargets();
if (!targets.length) return;
if (targets.length === 1) return runSAT(targets[0]);
if (satES) { satES.close(); satES = null; }
document.getElementById('sat-output').style.display='block';
document.getElementById('sat-title').textContent = '— amd';
const term = document.getElementById('sat-terminal');
term.textContent = 'Enqueuing AMD validate set...\n';
term.textContent = 'Running AMD validate set one by one...\n';
const labels = satLabels();
const enqueueNext = (idx) => {
if (idx >= targets.length) return;
const runNext = (idx) => {
if (idx >= targets.length) return Promise.resolve();
const target = targets[idx];
enqueueSATTarget(target)
term.textContent += '\n[' + (idx + 1) + '/' + targets.length + '] ' + labels[target] + '\n';
return enqueueSATTarget(target)
.then(d => {
term.textContent += 'Task ' + d.task_id + ' queued for ' + labels[target] + '.\n';
if (idx === targets.length - 1) {
satES = new EventSource('/api/tasks/'+d.task_id+'/stream');
satES.onmessage = e => { term.textContent += e.data+'\n'; term.scrollTop=term.scrollHeight; };
satES.addEventListener('done', e => { satES.close(); satES=null; term.textContent += (e.data ? '\nERROR: '+e.data : '\nCompleted.')+'\n'; });
}
enqueueNext(idx + 1);
return streamSATTask(d.task_id, labels[target], false);
}).then(function() {
return runNext(idx + 1);
});
};
enqueueNext(0);
return runNext(0);
}
function runAllSAT() {
const cycles = Math.max(1, parseInt(document.getElementById('sat-cycles').value)||1);
@@ -1259,17 +1370,17 @@ function runAllSAT() {
status.textContent = 'No tasks selected.';
return;
}
const enqueueNext = (idx) => {
if (idx >= expanded.length) { status.textContent = 'Enqueued ' + total + ' tasks.'; return; }
const runNext = (idx) => {
if (idx >= expanded.length) { status.textContent = 'Completed ' + total + ' task(s).'; return Promise.resolve(); }
const item = expanded[idx];
enqueueSATTarget(item.target, item.overrides)
status.textContent = 'Running ' + (idx + 1) + '/' + total + '...';
return enqueueSATTarget(item.target, item.overrides)
.then(() => {
enqueued++;
status.textContent = 'Enqueued ' + enqueued + '/' + total + '...';
enqueueNext(idx + 1);
return runNext(idx + 1);
});
};
enqueueNext(0);
return runNext(0);
}).catch(err => {
status.textContent = 'Error: ' + err.message;
});
@@ -1282,6 +1393,7 @@ fetch('/api/gpu/presence').then(r=>r.json()).then(gp => {
if (!gp.amd) disableSATCard('amd', 'No AMD GPU detected');
if (!gp.amd) disableSATAMDOptions('No AMD GPU detected');
});
satLoadGPUs();
function disableSATAMDOptions(reason) {
['sat-amd-target','sat-amd-mem-target','sat-amd-bandwidth-target'].forEach(function(id) {
const cb = document.getElementById(id);
@@ -1470,7 +1582,25 @@ func renderSATCard(id, label, runAction, headerActions, body string) string {
// ── Benchmark ─────────────────────────────────────────────────────────────────
func renderBenchmark() string {
type benchmarkHistoryColumn struct {
key string
label string
name string
index int
}
type benchmarkHistoryCell struct {
score float64
present bool
}
type benchmarkHistoryRun struct {
generatedAt time.Time
displayTime string
cells map[string]benchmarkHistoryCell
}
func renderBenchmark(opts HandlerOptions) string {
return `<p style="color:var(--muted);font-size:13px;margin-bottom:16px">Benchmark runs generate a human-readable TXT report and machine-readable result bundle. Tasks continue in the background — view progress in <a href="/tasks">Tasks</a>.</p>
<div class="grid2">
@@ -1495,6 +1625,10 @@ func renderBenchmark() string {
<p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p>
</div>
</div>
<label class="benchmark-cb-row">
<input type="checkbox" id="benchmark-parallel-gpus">
<span>Run all selected GPUs simultaneously (parallel mode)</span>
</label>
<label class="benchmark-cb-row">
<input type="checkbox" id="benchmark-run-nccl" checked>
<span>Run multi-GPU interconnect step (NCCL) only on the selected GPUs</span>
@@ -1519,6 +1653,8 @@ func renderBenchmark() string {
</div>
</div>
` + renderBenchmarkResultsCard(opts.ExportDir) + `
<div id="benchmark-output" style="display:none;margin-top:16px" class="card">
<div class="card-head">Benchmark Output <span id="benchmark-title"></span></div>
<div class="card-body"><div id="benchmark-terminal" class="terminal"></div></div>
@@ -1534,6 +1670,12 @@ func renderBenchmark() string {
<script>
let benchmarkES = null;
function benchmarkTaskIDs(payload) {
if (payload && Array.isArray(payload.task_ids) && payload.task_ids.length) return payload.task_ids;
if (payload && payload.task_id) return [payload.task_id];
return [];
}
function benchmarkSelectedGPUIndices() {
return Array.from(document.querySelectorAll('.benchmark-gpu-checkbox'))
.filter(function(el) { return el.checked && !el.disabled; })
@@ -1612,10 +1754,12 @@ function runNvidiaBenchmark() {
return;
}
if (benchmarkES) { benchmarkES.close(); benchmarkES = null; }
const parallelGPUs = !!document.getElementById('benchmark-parallel-gpus').checked;
const body = {
profile: document.getElementById('benchmark-profile').value || 'standard',
gpu_indices: selected,
run_nccl: !!document.getElementById('benchmark-run-nccl').checked,
parallel_gpus: parallelGPUs,
display_name: 'NVIDIA Benchmark'
};
document.getElementById('benchmark-output').style.display = 'block';
@@ -1633,17 +1777,37 @@ function runNvidiaBenchmark() {
return payload;
});
}).then(function(d) {
status.textContent = 'Task ' + d.task_id + ' queued.';
term.textContent += 'Task ' + d.task_id + ' queued. Streaming log...\n';
benchmarkES = new EventSource('/api/tasks/' + d.task_id + '/stream');
benchmarkES.onmessage = function(e) { term.textContent += e.data + '\n'; term.scrollTop = term.scrollHeight; };
benchmarkES.addEventListener('done', function(e) {
benchmarkES.close();
benchmarkES = null;
term.textContent += (e.data ? '\nERROR: ' + e.data : '\nCompleted.') + '\n';
term.scrollTop = term.scrollHeight;
status.textContent = e.data ? 'Failed.' : 'Completed.';
});
const taskIds = benchmarkTaskIDs(d);
if (!taskIds.length) throw new Error('No benchmark task was queued.');
status.textContent = taskIds.length === 1 ? ('Task ' + taskIds[0] + ' queued.') : ('Queued ' + taskIds.length + ' tasks.');
const streamNext = function(idx, failures) {
if (idx >= taskIds.length) {
status.textContent = failures ? 'Completed with failures.' : 'Completed.';
return;
}
const taskId = taskIds[idx];
term.textContent += '\n[' + (idx + 1) + '/' + taskIds.length + '] Task ' + taskId + ' queued. Streaming log...\n';
benchmarkES = new EventSource('/api/tasks/' + taskId + '/stream');
benchmarkES.onmessage = function(e) { term.textContent += e.data + '\n'; term.scrollTop = term.scrollHeight; };
benchmarkES.addEventListener('done', function(e) {
benchmarkES.close();
benchmarkES = null;
if (e.data) failures += 1;
term.textContent += (e.data ? '\nERROR: ' + e.data : '\nCompleted.') + '\n';
term.scrollTop = term.scrollHeight;
streamNext(idx + 1, failures);
});
benchmarkES.onerror = function() {
if (benchmarkES) {
benchmarkES.close();
benchmarkES = null;
}
term.textContent += '\nERROR: stream disconnected.\n';
term.scrollTop = term.scrollHeight;
streamNext(idx + 1, failures + 1);
};
};
streamNext(0, 0);
}).catch(function(err) {
status.textContent = 'Error.';
term.textContent += 'ERROR: ' + err.message + '\n';
@@ -1655,6 +1819,147 @@ benchmarkLoadGPUs();
</script>`
}
func renderBenchmarkResultsCard(exportDir string) string {
columns, runs := loadBenchmarkHistory(exportDir)
return renderBenchmarkResultsCardFromRuns(
"Benchmark Results",
"Composite score by saved benchmark run and GPU.",
"No saved benchmark runs yet.",
columns,
runs,
)
}
func renderBenchmarkResultsCardFromRuns(title, description, emptyMessage string, columns []benchmarkHistoryColumn, runs []benchmarkHistoryRun) string {
if len(runs) == 0 {
return `<div class="card"><div class="card-head">` + html.EscapeString(title) + `</div><div class="card-body"><p style="color:var(--muted);font-size:13px">` + html.EscapeString(emptyMessage) + `</p></div></div>`
}
var b strings.Builder
b.WriteString(`<div class="card"><div class="card-head">` + html.EscapeString(title) + `</div><div class="card-body">`)
if strings.TrimSpace(description) != "" {
b.WriteString(`<p style="color:var(--muted);font-size:13px;margin-bottom:12px">` + html.EscapeString(description) + `</p>`)
}
b.WriteString(`<div style="overflow-x:auto">`)
b.WriteString(`<table><thead><tr><th>Test</th><th>Time</th>`)
for _, col := range columns {
b.WriteString(`<th>` + html.EscapeString(col.label) + `</th>`)
}
b.WriteString(`</tr></thead><tbody>`)
for i, run := range runs {
b.WriteString(`<tr>`)
b.WriteString(`<td>#` + strconv.Itoa(i+1) + `</td>`)
b.WriteString(`<td>` + html.EscapeString(run.displayTime) + `</td>`)
for _, col := range columns {
cell, ok := run.cells[col.key]
if !ok || !cell.present {
b.WriteString(`<td style="color:var(--muted)">-</td>`)
continue
}
b.WriteString(`<td>` + fmt.Sprintf("%.2f", cell.score) + `</td>`)
}
b.WriteString(`</tr>`)
}
b.WriteString(`</tbody></table></div></div></div>`)
return b.String()
}
func loadBenchmarkHistory(exportDir string) ([]benchmarkHistoryColumn, []benchmarkHistoryRun) {
baseDir := app.DefaultBenchmarkBaseDir
if strings.TrimSpace(exportDir) != "" {
baseDir = filepath.Join(exportDir, "bee-benchmark")
}
paths, err := filepath.Glob(filepath.Join(baseDir, "gpu-benchmark-*", "result.json"))
if err != nil || len(paths) == 0 {
return nil, nil
}
sort.Strings(paths)
return loadBenchmarkHistoryFromPaths(paths)
}
func loadBenchmarkHistoryFromPaths(paths []string) ([]benchmarkHistoryColumn, []benchmarkHistoryRun) {
columnByKey := make(map[string]benchmarkHistoryColumn)
runs := make([]benchmarkHistoryRun, 0, len(paths))
for _, path := range paths {
raw, err := os.ReadFile(path)
if err != nil {
continue
}
var result platform.NvidiaBenchmarkResult
if err := json.Unmarshal(raw, &result); err != nil {
continue
}
run := benchmarkHistoryRun{
generatedAt: result.GeneratedAt,
displayTime: result.GeneratedAt.Local().Format("2006-01-02 15:04:05"),
cells: make(map[string]benchmarkHistoryCell),
}
// Count how many GPUs of each model appear in this run (for the label).
gpuModelCount := make(map[string]int)
for _, gpu := range result.GPUs {
gpuModelCount[strings.TrimSpace(gpu.Name)]++
}
// Track best composite score per column key within this run.
runBest := make(map[string]float64)
for _, gpu := range result.GPUs {
key := benchmarkHistoryColumnKey(result.ServerModel, gpu.Name)
count := gpuModelCount[strings.TrimSpace(gpu.Name)]
columnByKey[key] = benchmarkHistoryColumn{
key: key,
label: benchmarkHistoryColumnLabel(result.ServerModel, gpu.Name, count),
name: strings.TrimSpace(gpu.Name),
index: gpu.Index,
}
if gpu.Scores.CompositeScore > runBest[key] {
runBest[key] = gpu.Scores.CompositeScore
}
}
for key, score := range runBest {
run.cells[key] = benchmarkHistoryCell{score: score, present: true}
}
runs = append(runs, run)
}
columns := make([]benchmarkHistoryColumn, 0, len(columnByKey))
for _, col := range columnByKey {
columns = append(columns, col)
}
sort.Slice(columns, func(i, j int) bool {
li := strings.ToLower(columns[i].label)
lj := strings.ToLower(columns[j].label)
if li != lj {
return li < lj
}
return columns[i].key < columns[j].key
})
sort.Slice(runs, func(i, j int) bool {
return runs[i].generatedAt.After(runs[j].generatedAt)
})
return columns, runs
}
// benchmarkHistoryColumnKey groups results by server model + GPU model so that
// runs on the same hardware produce one column regardless of individual GPU index.
func benchmarkHistoryColumnKey(serverModel, gpuName string) string {
return strings.TrimSpace(serverModel) + "|" + strings.TrimSpace(gpuName)
}
// benchmarkHistoryColumnLabel formats the column header as
// "Server Model (N× GPU Model)" or "GPU Model" when server info is missing.
func benchmarkHistoryColumnLabel(serverModel, gpuName string, count int) string {
serverModel = strings.TrimSpace(serverModel)
gpuName = strings.TrimSpace(gpuName)
if gpuName == "" {
gpuName = "Unknown GPU"
}
gpuPart := fmt.Sprintf("%d× %s", count, gpuName)
if serverModel == "" {
return gpuPart
}
return fmt.Sprintf("%s (%s)", serverModel, gpuPart)
}
// ── Burn ──────────────────────────────────────────────────────────────────────
func renderBurn() string {
@@ -1774,6 +2079,12 @@ func renderBurn() string {
<script>
let biES = null;
function burnTaskIDs(payload) {
if (payload && Array.isArray(payload.task_ids) && payload.task_ids.length) return payload.task_ids;
if (payload && payload.task_id) return [payload.task_id];
return [];
}
function burnProfile() {
const selected = document.querySelector('input[name="burn-profile"]:checked');
return selected ? selected.value : 'smoke';
@@ -1874,6 +2185,52 @@ function streamTask(taskId, label) {
term.scrollTop = term.scrollHeight;
});
}
function streamBurnTask(taskId, label, resetTerminal) {
return streamBurnTaskSet([taskId], label, resetTerminal);
}
function streamBurnTaskSet(taskIds, label, resetTerminal) {
if (biES) { biES.close(); biES = null; }
document.getElementById('bi-output').style.display = 'block';
document.getElementById('bi-title').textContent = '— ' + label + ' [' + burnProfile() + ']';
const term = document.getElementById('bi-terminal');
if (resetTerminal) {
term.textContent = '';
}
if (!Array.isArray(taskIds) || !taskIds.length) {
term.textContent += 'ERROR: no tasks queued.\n';
return Promise.resolve({ok:false, error:'no tasks queued'});
}
const streamNext = function(idx, failures) {
if (idx >= taskIds.length) {
return Promise.resolve({ok: failures === 0, error: failures ? (failures + ' task(s) failed') : ''});
}
const taskId = taskIds[idx];
term.textContent += '[' + (idx + 1) + '/' + taskIds.length + '] Task ' + taskId + ' queued. Streaming...\n';
return new Promise(function(resolve) {
biES = new EventSource('/api/tasks/' + taskId + '/stream');
biES.onmessage = function(e) { term.textContent += e.data + '\n'; term.scrollTop = term.scrollHeight; };
biES.addEventListener('done', function(e) {
biES.close();
biES = null;
term.textContent += (e.data ? '\nERROR: ' + e.data : '\nCompleted.') + '\n';
term.scrollTop = term.scrollHeight;
resolve(failures + (e.data ? 1 : 0));
});
biES.onerror = function() {
if (biES) {
biES.close();
biES = null;
}
term.textContent += '\nERROR: stream disconnected.\n';
term.scrollTop = term.scrollHeight;
resolve(failures + 1);
};
}).then(function(nextFailures) {
return streamNext(idx + 1, nextFailures);
});
};
return streamNext(0, 0);
}
function runBurnTaskSet(tasks, statusElId) {
const enabled = tasks.filter(function(t) {
@@ -1886,19 +2243,33 @@ function runBurnTaskSet(tasks, statusElId) {
if (status) status.textContent = 'No tasks selected.';
return;
}
enabled.forEach(function(t) {
enqueueBurnTask(t.target, t.label, t.extra, !!t.nvidia)
const term = document.getElementById('bi-terminal');
document.getElementById('bi-output').style.display = 'block';
document.getElementById('bi-title').textContent = '— Burn one by one [' + burnProfile() + ']';
term.textContent = '';
const runNext = function(idx) {
if (idx >= enabled.length) {
if (status) status.textContent = 'Completed ' + enabled.length + ' task(s).';
return Promise.resolve();
}
const t = enabled[idx];
term.textContent += '\n[' + (idx + 1) + '/' + enabled.length + '] ' + t.label + '\n';
if (status) status.textContent = 'Running ' + (idx + 1) + '/' + enabled.length + '...';
return enqueueBurnTask(t.target, t.label, t.extra, !!t.nvidia)
.then(function(d) {
if (status) status.textContent = enabled.length + ' task(s) queued.';
streamTask(d.task_id, t.label);
return streamBurnTaskSet(burnTaskIDs(d), t.label, false);
})
.then(function() {
return runNext(idx + 1);
})
.catch(function(err) {
if (status) status.textContent = 'Error: ' + err.message;
const term = document.getElementById('bi-terminal');
document.getElementById('bi-output').style.display = 'block';
term.textContent += 'ERROR: ' + err.message + '\n';
return Promise.reject(err);
});
});
};
return runNext(0);
}
function runPlatformStress() {
@@ -2107,9 +2478,12 @@ func renderServicesInline() string {
return `<p style="font-size:13px;color:var(--muted);margin-bottom:10px">` + html.EscapeString(`bee-selfheal.timer is expected to be active; the oneshot bee-selfheal.service itself is not shown as a long-running service.`) + `</p>
<div style="display:flex;justify-content:flex-end;gap:8px;flex-wrap:wrap;margin-bottom:8px"><button class="btn btn-sm btn-secondary" onclick="restartGPUDrivers()">Restart GPU Drivers</button><button class="btn btn-sm btn-secondary" onclick="loadServices()">&#8635; Refresh</button></div>
<div id="svc-table"><p style="color:var(--muted);font-size:13px">Loading...</p></div>
<div id="svc-out" style="display:none;margin-top:8px" class="card">
<div class="card-head">Output</div>
<div class="card-body" style="padding:10px"><div id="svc-terminal" class="terminal" style="max-height:150px"></div></div>
<div id="svc-out" style="display:none;margin-top:12px">
<div style="display:flex;align-items:center;justify-content:space-between;margin-bottom:4px">
<span id="svc-out-label" style="font-size:12px;font-weight:600;color:var(--muted)">Output</span>
<span id="svc-out-status" style="font-size:12px"></span>
</div>
<div id="svc-terminal" class="terminal" style="max-height:220px;width:100%;box-sizing:border-box"></div>
</div>
<script>
function loadServices() {
@@ -2125,9 +2499,9 @@ function loadServices() {
'<div id="'+id+'" style="display:none;margin-top:6px"><pre style="font-size:11px;white-space:pre-wrap;word-break:break-all;max-height:200px;overflow-y:auto;background:#1b1c1d;padding:8px;border-radius:4px;color:#b5cea8">'+body+'</pre></div>' +
'</td>' +
'<td style="white-space:nowrap">' +
'<button class="btn btn-sm btn-secondary" onclick="svcAction(\''+s.name+'\',\'start\')">Start</button> ' +
'<button class="btn btn-sm btn-secondary" onclick="svcAction(\''+s.name+'\',\'stop\')">Stop</button> ' +
'<button class="btn btn-sm btn-secondary" onclick="svcAction(\''+s.name+'\',\'restart\')">Restart</button>' +
'<button class="btn btn-sm btn-secondary" id="btn-'+s.name+'-start" onclick="svcAction(this,\''+s.name+'\',\'start\')">Start</button> ' +
'<button class="btn btn-sm btn-secondary" id="btn-'+s.name+'-stop" onclick="svcAction(this,\''+s.name+'\',\'stop\')">Stop</button> ' +
'<button class="btn btn-sm btn-secondary" id="btn-'+s.name+'-restart" onclick="svcAction(this,\''+s.name+'\',\'restart\')">Restart</button>' +
'</td></tr>';
}).join('');
document.getElementById('svc-table').innerHTML =
@@ -2138,16 +2512,45 @@ function toggleBody(id) {
const el = document.getElementById(id);
if (el) el.style.display = el.style.display==='none' ? 'block' : 'none';
}
function svcAction(name, action) {
function svcAction(btn, name, action) {
var label = btn.textContent;
btn.disabled = true;
btn.textContent = '...';
var out = document.getElementById('svc-out');
var term = document.getElementById('svc-terminal');
var statusEl = document.getElementById('svc-out-status');
var labelEl = document.getElementById('svc-out-label');
out.style.display = 'block';
labelEl.textContent = action + ' ' + name;
term.textContent = 'Running...';
statusEl.textContent = '';
statusEl.style.color = '';
fetch('/api/services/action',{method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify({name,action})})
.then(r=>r.json()).then(d => {
document.getElementById('svc-out').style.display='block';
document.getElementById('svc-terminal').textContent = d.output || d.error || action+' '+name;
setTimeout(loadServices, 1000);
term.textContent = d.output || d.error || '(no output)';
term.scrollTop = term.scrollHeight;
if (d.status === 'ok') {
statusEl.textContent = '✓ done';
statusEl.style.color = 'var(--ok-fg, #2c662d)';
} else {
statusEl.textContent = '✗ failed';
statusEl.style.color = 'var(--crit-fg, #9f3a38)';
}
btn.textContent = label;
btn.disabled = false;
setTimeout(loadServices, 800);
}).catch(e => {
term.textContent = 'Request failed: ' + e;
statusEl.textContent = '✗ error';
statusEl.style.color = 'var(--crit-fg, #9f3a38)';
btn.textContent = label;
btn.disabled = false;
});
}
function restartGPUDrivers() {
svcAction('bee-nvidia', 'restart');
var btn = document.querySelector('[onclick*="restartGPUDrivers"]');
if (!btn) { svcAction({textContent:'',disabled:false}, 'bee-nvidia', 'restart'); return; }
svcAction(btn, 'bee-nvidia', 'restart');
}
loadServices();
</script>`

View File

@@ -1,6 +1,7 @@
package webui
import (
"encoding/json"
"net/http"
"net/http/httptest"
"os"
@@ -601,8 +602,8 @@ func TestToolsPageRendersRestartGPUDriversButton(t *testing.T) {
if !strings.Contains(body, `Restart GPU Drivers`) {
t.Fatalf("tools page missing restart gpu drivers button: %s", body)
}
if !strings.Contains(body, `svcAction('bee-nvidia', 'restart')`) {
t.Fatalf("tools page missing bee-nvidia restart action: %s", body)
if !strings.Contains(body, `restartGPUDrivers()`) {
t.Fatalf("tools page missing restartGPUDrivers action: %s", body)
}
if !strings.Contains(body, `id="boot-source-text"`) {
t.Fatalf("tools page missing boot source field: %s", body)
@@ -636,6 +637,66 @@ func TestBenchmarkPageRendersGPUSelectionControls(t *testing.T) {
}
}
func TestBenchmarkPageRendersSavedResultsTable(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
runDir := filepath.Join(exportDir, "bee-benchmark", "gpu-benchmark-20260406-120000")
if err := os.MkdirAll(runDir, 0755); err != nil {
t.Fatal(err)
}
result := platform.NvidiaBenchmarkResult{
GeneratedAt: time.Date(2026, time.April, 6, 12, 0, 0, 0, time.UTC),
BenchmarkProfile: "standard",
OverallStatus: "OK",
GPUs: []platform.BenchmarkGPUResult{
{
Index: 0,
Name: "NVIDIA H100 PCIe",
Scores: platform.BenchmarkScorecard{
CompositeScore: 1176.25,
},
},
{
Index: 1,
Name: "NVIDIA H100 PCIe",
Scores: platform.BenchmarkScorecard{
CompositeScore: 1168.50,
},
},
},
}
raw, err := json.Marshal(result)
if err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(runDir, "result.json"), raw, 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/benchmark", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
body := rec.Body.String()
wantTime := result.GeneratedAt.Local().Format("2006-01-02 15:04:05")
for _, needle := range []string{
`Benchmark Results`,
`Composite score by saved benchmark run and GPU.`,
`NVIDIA H100 PCIe / GPU 0`,
`NVIDIA H100 PCIe / GPU 1`,
`#1`,
wantTime,
`1176.25`,
`1168.50`,
} {
if !strings.Contains(body, needle) {
t.Fatalf("benchmark page missing %q: %s", needle, body)
}
}
}
func TestValidatePageRendersNvidiaTargetedStressCard(t *testing.T) {
handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder()
@@ -649,6 +710,10 @@ func TestValidatePageRendersNvidiaTargetedStressCard(t *testing.T) {
`nvidia-targeted-stress`,
`controlled NVIDIA DCGM load`,
`<code>dcgmi diag targeted_stress</code>`,
`NVIDIA GPU Selection`,
`All NVIDIA validate tasks use only the GPUs selected here.`,
`Select All`,
`id="sat-gpu-list"`,
} {
if !strings.Contains(body, needle) {
t.Fatalf("validate page missing %q: %s", needle, body)

View File

@@ -97,32 +97,73 @@ func renderTaskDetailPage(opts HandlerOptions, task Task) string {
body.WriteString(`</div></div>`)
}
if task.Status == TaskRunning || task.Status == TaskPending {
if task.Status == TaskRunning {
body.WriteString(`<div class="card"><div class="card-head">Live Charts</div><div class="card-body">`)
body.WriteString(`<div id="task-live-charts" style="display:flex;flex-direction:column;gap:16px;color:var(--muted);font-size:13px">Loading charts...</div>`)
body.WriteString(`</div></div>`)
}
if task.Status == TaskRunning || task.Status == TaskPending {
body.WriteString(`<div class="card"><div class="card-head">Live Logs</div><div class="card-body">`)
body.WriteString(`<div id="task-live-log" class="terminal" style="max-height:none;white-space:pre-wrap">Connecting...</div>`)
body.WriteString(`</div></div>`)
body.WriteString(`<script>
function cancelTaskDetail(id) {
fetch('/api/tasks/' + id + '/cancel', {method:'POST'}).then(function(){ window.location.reload(); });
fetch('/api/tasks/' + id + '/cancel', {method:'POST'}).then(function(){
var term = document.getElementById('task-live-log');
if (term) {
term.textContent += '\nCancel requested.\n';
term.scrollTop = term.scrollHeight;
}
});
}
function renderTaskLiveCharts(taskId, charts) {
const host = document.getElementById('task-live-charts');
if (!host) return;
if (!Array.isArray(charts) || charts.length === 0) {
host.innerHTML = 'Waiting for metric samples...';
return;
}
const seen = {};
charts.forEach(function(chart) {
seen[chart.file] = true;
let img = host.querySelector('img[data-chart-file="' + chart.file + '"]');
if (img) {
const card = img.closest('.card');
if (card) {
const title = card.querySelector('.card-head');
if (title) title.textContent = chart.title;
}
return;
}
const card = document.createElement('div');
card.className = 'card';
card.style.margin = '0';
card.innerHTML = '<div class="card-head"></div><div class="card-body" style="padding:12px"></div>';
card.querySelector('.card-head').textContent = chart.title;
const body = card.querySelector('.card-body');
img = document.createElement('img');
img.setAttribute('data-task-chart', '1');
img.setAttribute('data-chart-file', chart.file);
img.setAttribute('data-base-src', '/api/tasks/' + taskId + '/chart/' + chart.file);
img.src = '/api/tasks/' + taskId + '/chart/' + chart.file + '?t=' + Date.now();
img.style.width = '100%';
img.style.display = 'block';
img.style.borderRadius = '6px';
img.alt = chart.title;
body.appendChild(img);
host.appendChild(card);
});
Array.from(host.querySelectorAll('img[data-task-chart="1"]')).forEach(function(img) {
const file = img.getAttribute('data-chart-file') || '';
if (seen[file]) return;
const card = img.closest('.card');
if (card) card.remove();
});
}
function loadTaskLiveCharts(taskId) {
fetch('/api/tasks/' + taskId + '/charts').then(function(r){ return r.json(); }).then(function(charts){
const host = document.getElementById('task-live-charts');
if (!host) return;
if (!Array.isArray(charts) || charts.length === 0) {
host.innerHTML = 'Waiting for metric samples...';
return;
}
host.innerHTML = charts.map(function(chart) {
return '<div class="card" style="margin:0">' +
'<div class="card-head">' + chart.title + '</div>' +
'<div class="card-body" style="padding:12px">' +
'<img data-task-chart="1" data-base-src="/api/tasks/' + taskId + '/chart/' + chart.file + '" src="/api/tasks/' + taskId + '/chart/' + chart.file + '?t=' + Date.now() + '" style="width:100%;display:block;border-radius:6px" alt="' + chart.title + '">' +
'</div></div>';
}).join('');
renderTaskLiveCharts(taskId, charts);
}).catch(function(){
const host = document.getElementById('task-live-charts');
if (host) host.innerHTML = 'Task charts are unavailable.';
@@ -138,12 +179,31 @@ function refreshTaskLiveCharts() {
var _taskDetailES = new EventSource('/api/tasks/` + html.EscapeString(task.ID) + `/stream');
var _taskDetailTerm = document.getElementById('task-live-log');
var _taskChartTimer = null;
var _taskChartsFrozen = false;
_taskDetailES.onopen = function(){ _taskDetailTerm.textContent = ''; };
_taskDetailES.onmessage = function(e){ _taskDetailTerm.textContent += e.data + "\n"; _taskDetailTerm.scrollTop = _taskDetailTerm.scrollHeight; };
_taskDetailES.addEventListener('done', function(){ if (_taskChartTimer) clearInterval(_taskChartTimer); _taskDetailES.close(); setTimeout(function(){ window.location.reload(); }, 1000); });
_taskDetailES.onerror = function(){ if (_taskChartTimer) clearInterval(_taskChartTimer); _taskDetailES.close(); };
_taskDetailES.addEventListener('done', function(e){
if (_taskChartTimer) clearInterval(_taskChartTimer);
_taskDetailES.close();
_taskDetailES = null;
_taskChartsFrozen = true;
_taskDetailTerm.textContent += (e.data ? '\nTask finished with error.\n' : '\nTask finished.\n');
_taskDetailTerm.scrollTop = _taskDetailTerm.scrollHeight;
refreshTaskLiveCharts();
});
_taskDetailES.onerror = function(){
if (_taskChartTimer) clearInterval(_taskChartTimer);
if (_taskDetailES) {
_taskDetailES.close();
_taskDetailES = null;
}
};
loadTaskLiveCharts('` + html.EscapeString(task.ID) + `');
_taskChartTimer = setInterval(function(){ refreshTaskLiveCharts(); loadTaskLiveCharts('` + html.EscapeString(task.ID) + `'); }, 2000);
_taskChartTimer = setInterval(function(){
if (_taskChartsFrozen) return;
loadTaskLiveCharts('` + html.EscapeString(task.ID) + `');
refreshTaskLiveCharts();
}, 2000);
</script>`)
}

View File

@@ -230,6 +230,9 @@ func renderTaskReportFragment(report taskReport, charts map[string]string, logTe
b.WriteString(`<div style="margin-top:14px;font-size:13px;color:var(--muted)">`)
b.WriteString(`Started: ` + formatTaskTime(report.StartedAt, report.CreatedAt) + ` | Finished: ` + formatTaskTime(report.DoneAt, time.Time{}) + ` | Duration: ` + formatTaskDuration(report.DurationSec))
b.WriteString(`</div></div></div>`)
if benchmarkCard := renderTaskBenchmarkResultsCard(report.Target, logText); benchmarkCard != "" {
b.WriteString(benchmarkCard)
}
if len(report.Charts) > 0 {
for _, chart := range report.Charts {
@@ -247,6 +250,57 @@ func renderTaskReportFragment(report taskReport, charts map[string]string, logTe
return b.String()
}
func renderTaskBenchmarkResultsCard(target, logText string) string {
if strings.TrimSpace(target) != "nvidia-benchmark" {
return ""
}
resultPath := taskBenchmarkResultPath(logText)
if strings.TrimSpace(resultPath) == "" {
return ""
}
columns, runs := loadBenchmarkHistoryFromPaths([]string{resultPath})
if len(runs) == 0 {
return ""
}
return renderBenchmarkResultsCardFromRuns(
"Benchmark Results",
"Composite score for this benchmark task.",
"No benchmark results were saved for this task.",
columns,
runs,
)
}
func taskBenchmarkResultPath(logText string) string {
archivePath := taskArchivePathFromLog(logText)
if archivePath == "" {
return ""
}
runDir := strings.TrimSuffix(archivePath, ".tar.gz")
if runDir == archivePath {
return ""
}
return filepath.Join(runDir, "result.json")
}
func taskArchivePathFromLog(logText string) string {
lines := strings.Split(logText, "\n")
for i := len(lines) - 1; i >= 0; i-- {
line := strings.TrimSpace(lines[i])
if line == "" || !strings.HasPrefix(line, "Archive:") {
continue
}
path := strings.TrimSpace(strings.TrimPrefix(line, "Archive:"))
if strings.HasPrefix(path, "Archive written to ") {
path = strings.TrimSpace(strings.TrimPrefix(path, "Archive written to "))
}
if strings.HasSuffix(path, ".tar.gz") {
return path
}
}
return ""
}
func renderTaskStatusBadge(status string) string {
className := map[string]string{
TaskRunning: "badge-ok",

View File

@@ -123,6 +123,7 @@ type taskParams struct {
BurnProfile string `json:"burn_profile,omitempty"`
BenchmarkProfile string `json:"benchmark_profile,omitempty"`
RunNCCL bool `json:"run_nccl,omitempty"`
ParallelGPUs bool `json:"parallel_gpus,omitempty"`
DisplayName string `json:"display_name,omitempty"`
Device string `json:"device,omitempty"` // for install
PlatformComponents []string `json:"platform_components,omitempty"`
@@ -423,13 +424,14 @@ func (q *taskQueue) worker() {
setCPUGovernor("performance")
defer setCPUGovernor("powersave")
// Drain all pending tasks and start them in parallel.
q.mu.Lock()
var batch []*Task
for {
q.mu.Lock()
t := q.nextPending()
if t == nil {
break
q.prune()
q.persistLocked()
q.mu.Unlock()
return
}
now := time.Now()
t.Status = TaskRunning
@@ -438,29 +440,14 @@ func (q *taskQueue) worker() {
t.ErrMsg = ""
j := newTaskJobState(t.LogPath, taskSerialPrefix(t))
t.job = j
batch = append(batch, t)
}
if len(batch) > 0 {
q.persistLocked()
}
q.mu.Unlock()
q.mu.Unlock()
var wg sync.WaitGroup
for _, t := range batch {
t := t
j := t.job
taskCtx, taskCancel := context.WithCancel(context.Background())
j.cancel = taskCancel
wg.Add(1)
goRecoverOnce("task "+t.Target, func() {
defer wg.Done()
defer taskCancel()
q.executeTask(t, j, taskCtx)
})
}
wg.Wait()
q.executeTask(t, j, taskCtx)
taskCancel()
if len(batch) > 0 {
q.mu.Lock()
q.prune()
q.persistLocked()
@@ -599,6 +586,7 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices,
RunNCCL: t.params.RunNCCL,
ParallelGPUs: t.params.ParallelGPUs,
}, j.append)
case "nvidia-compute":
if a == nil {
@@ -1163,7 +1151,32 @@ func taskArtifactsDir(root string, t *Task, status string) string {
if strings.TrimSpace(root) == "" || t == nil {
return ""
}
return filepath.Join(root, fmt.Sprintf("%s_%s_%s", t.ID, sanitizeTaskFolderPart(t.Name), taskFolderStatus(status)))
prefix := taskFolderNumberPrefix(t.ID)
return filepath.Join(root, fmt.Sprintf("%s_%s_%s", prefix, sanitizeTaskFolderPart(t.Name), taskFolderStatus(status)))
}
func taskFolderNumberPrefix(taskID string) string {
taskID = strings.TrimSpace(taskID)
if strings.HasPrefix(taskID, "TASK-") && len(taskID) >= len("TASK-000") {
num := strings.TrimSpace(strings.TrimPrefix(taskID, "TASK-"))
if len(num) == 3 {
allDigits := true
for _, r := range num {
if r < '0' || r > '9' {
allDigits = false
break
}
}
if allDigits {
return num
}
}
}
fallback := sanitizeTaskFolderPart(taskID)
if fallback == "" {
return "000"
}
return fallback
}
func ensureTaskReportPaths(t *Task) {

View File

@@ -163,6 +163,40 @@ func TestTaskQueueSnapshotSortsNewestFirst(t *testing.T) {
}
}
func TestNewJobIDUsesTASKPrefixAndZeroPadding(t *testing.T) {
globalQueue.mu.Lock()
origTasks := globalQueue.tasks
globalQueue.tasks = nil
globalQueue.mu.Unlock()
origCounter := jobCounter.Load()
jobCounter.Store(0)
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = origTasks
globalQueue.mu.Unlock()
jobCounter.Store(origCounter)
})
if got := newJobID("ignored"); got != "TASK-000" {
t.Fatalf("id=%q want TASK-000", got)
}
if got := newJobID("ignored"); got != "TASK-001" {
t.Fatalf("id=%q want TASK-001", got)
}
}
func TestTaskArtifactsDirStartsWithTaskNumber(t *testing.T) {
root := t.TempDir()
task := &Task{
ID: "TASK-007",
Name: "NVIDIA Benchmark",
}
got := filepath.Base(taskArtifactsDir(root, task, TaskDone))
if !strings.HasPrefix(got, "007_") {
t.Fatalf("artifacts dir=%q want prefix 007_", got)
}
}
func TestHandleAPITasksStreamReplaysPersistedLogWithoutLiveJob(t *testing.T) {
dir := t.TempDir()
logPath := filepath.Join(dir, "task.log")
@@ -325,6 +359,78 @@ func TestFinalizeTaskRunCreatesReportFolderAndArtifacts(t *testing.T) {
}
}
func TestWriteTaskReportArtifactsIncludesBenchmarkResultsForTask(t *testing.T) {
dir := t.TempDir()
metricsPath := filepath.Join(dir, "metrics.db")
prevMetricsPath := taskReportMetricsDBPath
taskReportMetricsDBPath = metricsPath
t.Cleanup(func() { taskReportMetricsDBPath = prevMetricsPath })
benchmarkDir := filepath.Join(dir, "bee-benchmark", "gpu-benchmark-20260406-120000")
if err := os.MkdirAll(benchmarkDir, 0755); err != nil {
t.Fatal(err)
}
result := platform.NvidiaBenchmarkResult{
GeneratedAt: time.Date(2026, time.April, 6, 12, 0, 0, 0, time.UTC),
BenchmarkProfile: "standard",
OverallStatus: "OK",
GPUs: []platform.BenchmarkGPUResult{
{
Index: 0,
Name: "NVIDIA H100 PCIe",
Scores: platform.BenchmarkScorecard{
CompositeScore: 1176.25,
},
},
},
}
raw, err := json.Marshal(result)
if err != nil {
t.Fatal(err)
}
if err := os.WriteFile(filepath.Join(benchmarkDir, "result.json"), raw, 0644); err != nil {
t.Fatal(err)
}
artifactsDir := filepath.Join(dir, "tasks", "task-bench_done")
if err := os.MkdirAll(artifactsDir, 0755); err != nil {
t.Fatal(err)
}
task := &Task{
ID: "task-bench",
Name: "NVIDIA Benchmark",
Target: "nvidia-benchmark",
Status: TaskDone,
CreatedAt: time.Now().UTC().Add(-time.Minute),
ArtifactsDir: artifactsDir,
}
ensureTaskReportPaths(task)
logText := "line-1\nArchive: " + filepath.Join(dir, "bee-benchmark", "gpu-benchmark-20260406-120000.tar.gz") + "\n"
if err := os.WriteFile(task.LogPath, []byte(logText), 0644); err != nil {
t.Fatal(err)
}
if err := writeTaskReportArtifacts(task); err != nil {
t.Fatalf("writeTaskReportArtifacts: %v", err)
}
body, err := os.ReadFile(task.ReportHTMLPath)
if err != nil {
t.Fatalf("ReadFile(report.html): %v", err)
}
html := string(body)
for _, needle := range []string{
`Benchmark Results`,
`Composite score for this benchmark task.`,
`NVIDIA H100 PCIe / GPU 0`,
`1176.25`,
} {
if !strings.Contains(html, needle) {
t.Fatalf("report missing %q: %s", needle, html)
}
}
}
func TestTaskLifecycleMirrorsToSerialConsole(t *testing.T) {
var lines []string
prev := taskSerialWriteLine

2
bible

Submodule bible updated: 1d89a4918e...688b87e98d

View File

@@ -0,0 +1,248 @@
# Benchmark clock calibration research
## Status
In progress. Baseline data from production servers pending.
## Background
The benchmark locks GPU clocks to `MaxGraphicsClockMHz` (boost) via `nvidia-smi -lgc`
before the steady-state phase. The metric `low_sm_clock_vs_target` fires when
`avg_steady_clock < locked_target * 0.90`.
Problem: boost clock is the theoretical maximum under ideal cooling. In practice,
even a healthy GPU in a non-ideal server will sustain clocks well below boost.
The 90% threshold has no empirical basis.
## Key observations (2026-04-06)
### H100 PCIe — new card, server not designed for it
- avg clock 1384 MHz, P95 1560 MHz (unstable, proba boost 1755 MHz)
- Thermal sustain: 0.0 (sw_thermal covers entire steady window)
- Stability: 70.0 — clocks erratic, no equilibrium found
- Degradation: power_capped, thermal_limited, low_sm_clock_vs_target, variance_too_high
### H200 NVL — new card, server not designed for it
- avg clock = P95 = 1635 MHz (perfectly stable)
- Thermal sustain: 0.0 (sw_thermal + sw_power cover entire steady window)
- Stability: 92.0 — found stable thermal equilibrium at 1635 MHz
- Degradation: power_capped, thermal_limited
- Compute: 989 TOPS — card is computing correctly for its frequency
### Key insight
The meaningful distinction is not *whether* the card throttles but *how stably*
it throttles. H200 found a thermal equilibrium (avg == P95, Stability 92),
H100 did not (avg << P95, Stability 70). Both are new cards; the H100's
instability may reflect a more severe thermal mismatch or a card issue.
`sw_power ≈ sw_thermal` pattern = server cooling constraint, card likely OK.
`hw_thermal >> sw_thermal` pattern = card itself overheating, investigate.
## Hypothesis for baseline
After testing on servers designed for their GPUs (proper cooling):
- Healthy GPU under sustained load will run at a stable fraction of boost
- Expected: avg_steady ≈ 8095% of boost depending on model and TDP class
- Base clock (`clocks.base.gr`) may be a better reference than boost:
a healthy card under real workload should comfortably exceed base clock
## Baseline: H100 PCIe HBM2e — designed server (2026-04-06, 10 samples)
Source: external stress test tool, ~90s runs, designed server, adequate power.
### Healthy fingerprint
- **Power**: hits cap ~340360W immediately, stays flat throughout — HEALTHY
- **Clock**: starts ~1750 MHz, oscillates and declines to ~15401600 MHz by 90s
- Avg steady (visual): **~15801620 MHz**
- vs boost 1755 MHz: **~9192%**
- Oscillation is NORMAL — this is the boost algorithm balancing under power cap
- Stable power + oscillating clocks = healthy power-cap behavior
- **Temperature**: linear rise ~38°C → 7580°C over 90s (no runaway)
- **Consistency**: all 10 samples within ±20 MHz — very repeatable
### Characteristic patten
Flat power line + oscillating/declining clock line = GPU correctly managed by
power cap algorithm. Do NOT flag this as instability.
### Clock CV implication
The healthy oscillation WILL produce moderate ClockCVPct (~510%).
The current `variance_too_high` threshold (StabilityScore < 85) may fire on
healthy HBM2e PCIe cards. Needs recalibration.
---
## Baseline: H100 HBM3 OEM SXM Custom (restored) — 2 confirmed samples
Source: pytorch_training_loop stress test, 120s (90s stress + 30s cooldown).
Confirmed GPU: NVIDIA H100 80GB HBM3, GH100 rev a1.
### GPU clock reference (from nvidia-smi, idle):
- base_clock_mhz: **1095**
- boost_clock_mhz: **1755** (nvidia-smi `clocks.max.graphics` at idle)
- achieved_max_clock_mhz: **1980** (actual burst max observed by tool)
- Our benchmark locks to `clocks.max.graphics` = likely 1980 MHz for this chip
### Observed under 700W sustained load (both samples nearly identical):
- Power: ~700W flat — SXM slot, adequate power confirmed
- Clock steady range: **~13801480 MHz**, avg **~14201460 MHz**
- vs 1980 MHz (lock target): **7274%** — severely below
- vs 1755 MHz (nvidia-smi boost): **8183%**
- vs 1095 MHz (base): 130% — above base but far below expected for SXM
- Clock/Watt: ~2.1 MHz/W vs HBM2e ~4.6 MHz/W — 2× worse efficiency
- Temperature: 38°C → 7980°C (same rate as HBM2e)
- Oscillation: present, similar character to HBM2e but at much lower frequency
### Diagnosis
These restored cards are degraded. A healthy H100 SXM in a designed server
(DGX H100, HGX H100) should sustain ~18001900 MHz at 700W (~9196% of 1980).
The 7274% result is a clear signal of silicon or VRM degradation from the
refurbishment process.
### Clock pattern note
Images 8/9 (previously marked as "HBM3 restored") are now confirmed identical
to images 19/20. Both sample sets show same degraded pattern — same batch.
---
## Baseline matrix (filled where data available)
| GPU model | Config | Avg clock steady | vs boost | Clock/Watt | Notes |
|---|---|---|---|---|---|
| H100 PCIe HBM2e | designed server | 15801620 MHz | 9192% | ~4.6 MHz/W | 10 samples, healthy |
| H100 SXM HBM3 restored | 700W full | 14201460 MHz | 7274% of 1980 | ~2.1 MHz/W | 4 samples confirmed, degraded |
| H100 SXM HBM3 healthy | designed | ~18001900 MHz est. | ~9196% est. | ~2.7 MHz/W est. | need real baseline |
| H200 NVL | designed | TBD | TBD | TBD | need baseline |
---
## H100 official spec (from NVIDIA datasheet)
Source: NVIDIA H100 Tensor Core GPU Datasheet (image 23, 2026-04-06).
All TOPS marked * are with structural sparsity enabled. Divide by 2 for dense.
| Model | FP16 Tensor (dense) | TF32 (dense) | FP8 (dense) | TDP | Memory |
|---|---|---|---|---|---|
| H100 80GB PCIe | 756 TFLOPS | 378 TFLOPS | 1,513 TFLOPS | 350W | HBM2e |
| H100 NVL 94GB PCIe | 990 TFLOPS | 495 TFLOPS | 1,980 TFLOPS | 400W | HBM3 |
| H100 80GB SXM (BQQV) | 989 TFLOPS | 494 TFLOPS | — | 700W | HBM3 |
| H100 94GB SXM (BUBB) | 989 TFLOPS | 494 TFLOPS | — | 700W | HBM2e |
Notes:
- SXM boards do NOT list FP8 peak in this table (field empty)
- fp8_e5m2 is unsupported on H100 PCIe HBM2e — confirmed in our tests
- Tensor Cores: PCIe = 456, SXM = 528 (16% more on SXM)
## Observed efficiency (H100 80GB PCIe, throttled server)
From the report in this session (power+thermal throttle throughout steady):
| Precision | Measured | Spec (dense) | % of spec |
|---|---|---|---|
| fp16_tensor | 329 TOPS | 756 TFLOPS | 44% |
| fp32_tf32 | 115 TOPS | 378 TFLOPS | 30% |
| fp8_e4m3 | 505 TOPS | 1,513 TFLOPS | 33% |
3344% of spec is expected given sustained power+thermal throttle (avg clock
1384 MHz vs boost 1755 MHz = 79%). The GPU is computing correctly for its
actual frequency — the low TOPS comes from throttle, not silicon defect.
## H200 official spec (from NVIDIA datasheet, image 24, 2026-04-06)
Format: without sparsity / with sparsity.
| Model | FP16 Tensor (dense) | TF32 (dense) | FP8 (dense) | TDP | Memory |
|---|---|---|---|---|---|
| H200 NVL PCIe | 836 TFLOPS | 418 TFLOPS | 1,570 TFLOPS | 600W | HBM3e 141GB |
| H200 SXM | 990 TFLOPS | 495 TFLOPS | 1,979 TFLOPS | 700W | HBM3e 141GB |
## Observed efficiency (H200 NVL PCIe, throttled non-designed server)
Avg clock 1635 MHz (62% of boost ~2619 MHz). Entire steady in thermal throttle.
| Precision | Measured | Spec (dense) | % of spec |
|---|---|---|---|
| fp16_tensor | 340 TOPS | 836 TFLOPS | 41% |
| fp32_tf32 | 120 TOPS | 418 TFLOPS | 29% |
| fp8_e4m3 | 529 TOPS | 1,570 TFLOPS | 34% |
Comparable to H100 PCIe efficiency (3344%) despite different architecture —
both are throttle-limited. Confirms that % of spec is not a quality signal,
it reflects the thermal environment. tops_per_sm_per_ghz is the right metric.
## Real-world GEMM efficiency reference (2026-04-06, web research)
Sources: SemiAnalysis MI300X vs H100 vs H200 training benchmark; cuBLAS optimization
worklog (hamzaelshafie.bearblog.dev); Lambda AI H100 performance analysis.
### What healthy systems actually achieve:
- H100 SXM in designed server: **~720 TFLOPS FP16 = ~73% of spec**
- cuBLAS large square GEMM (8192³): up to **~83% flop utilization**
- H200 NVL PCIe: no public data, extrapolating ~73% → ~610 TFLOPS FP16
### Our results vs expectation:
| GPU | Our FP16 | Expected (73%) | Our % of spec | Gap |
|---|---|---|---|---|
| H100 PCIe HBM2e | 329 TOPS | ~552 TFLOPS | 44% | ~1.7× below |
| H200 NVL PCIe | 340 TOPS | ~610 TFLOPS | 41% | ~1.8× below |
Our results are roughly **half** of what a healthy system achieves even under throttle.
This is NOT normal — 30-44% is not the industry baseline.
### Likely causes of the gap (in order of probability):
1. **Thermal throttle** — confirmed, sw_thermal covers entire steady window
2. **Power limit below TDP** — GPU may be software-limited below 350W/600W.
Previous user may have set a lower limit via nvidia-smi -pl and it was not
reset. Our normalization sets clock locks but does NOT reset power limit.
Key check: `nvidia-smi -q | grep "Power Limit"` — default vs enforced.
3. **Matrix size** — ruled out. bee-gpu-burn uses 4096×4096×4096 for fp16,
8192×8192×4096 for fp8. These are large enough for peak tensor utilization.
### Power limit gap analysis (H100 PCIe):
- Avg clock 1384 MHz = 79% of boost 1755 MHz
- Expected TOPS at 79% clock: 756 × 0.79 ≈ 597 TFLOPS
- Actually measured: 329 TOPS = 55% of that estimate
- Remaining gap after accounting for clock throttle: ~45%
- Most likely explanation: enforced power limit < 350W TDP, further reducing
sustainable clock beyond what sw_thermal alone would cause.
### Action item:
Add `power.limit` (enforced) AND `power.default_limit` to queryBenchmarkGPUInfo
so result.json shows if the card was pre-configured with a non-default limit.
If enforced < default × 0.95 → add finding "GPU power limit is below default TDP".
### CPU/RAM impact on GPU FLOPS:
None. Pure on-GPU GEMM is fully compute-bound once data is in VRAM.
CPU core count and host RAM are irrelevant.
## Compute efficiency metric (proposed, no hardcode)
Instead of comparing TOPS to a hardcoded spec, compute:
tops_per_sm_per_ghz = measured_tops / (sm_count × avg_clock_ghz)
This is model-agnostic. A GPU computing correctly at its actual frequency
will show a consistent tops_per_sm_per_ghz regardless of throttle level.
A GPU with degraded silicon will show low tops_per_sm_per_ghz even at
normal clocks.
SM count is queryable: nvidia-smi --query-gpu=attribute.multiprocessor_count
(needs to be added to queryBenchmarkGPUInfo).
Reference values to establish after baseline runs:
- H100 PCIe fp16_tensor: TBD tops/SM/GHz
- H100 SXM fp16_tensor: TBD tops/SM/GHz
## Proposed threshold changes (pending more data)
1. **`low_sm_clock_vs_target`**: raise threshold from 90% to 85% based on observed
9192% on healthy HBM2e. Or remove entirely — sw_power/sw_thermal already
capture the root cause.
2. **`variance_too_high`** (StabilityScore < 85): healthy HBM2e WILL oscillate
under power cap. Consider suppressing this flag when power is flat and usage
is 100% (oscillation is expected). Or lower threshold to 70.
3. **New signal: MHz/Watt efficiency**: if base_graphics_clock_mhz is available,
ratio avg_clock / power_w could identify degraded silicon (HBM3 restored S1
would have been caught by this).
Decision deferred until baseline on SXM designed servers collected.

View File

@@ -606,6 +606,20 @@ struct prepared_profile {
};
static const struct profile_desc k_profiles[] = {
{
"fp64",
"fp64",
80,
1,
0,
0,
8,
CUDA_R_64F,
CUDA_R_64F,
CUDA_R_64F,
CUDA_R_64F,
CUBLAS_COMPUTE_64F,
},
{
"fp32_tf32",
"fp32",

View File

@@ -41,15 +41,15 @@ while [ $# -gt 0 ]; do
;;
*)
echo "unknown arg: $1" >&2
echo "usage: $0 [--cache-dir /path] [--rebuild-image] [--clean-build] [--authorized-keys /path/to/authorized_keys] [--variant nvidia|amd|all]" >&2
echo "usage: $0 [--cache-dir /path] [--rebuild-image] [--clean-build] [--authorized-keys /path/to/authorized_keys] [--variant nvidia|nvidia-legacy|amd|nogpu|all]" >&2
exit 1
;;
esac
done
case "$VARIANT" in
nvidia|amd|nogpu|all) ;;
*) echo "unknown variant: $VARIANT (expected nvidia, amd, nogpu, or all)" >&2; exit 1 ;;
nvidia|nvidia-legacy|amd|nogpu|all) ;;
*) echo "unknown variant: $VARIANT (expected nvidia, nvidia-legacy, amd, nogpu, or all)" >&2; exit 1 ;;
esac
if [ "$CLEAN_CACHE" = "1" ]; then
@@ -61,8 +61,13 @@ if [ "$CLEAN_CACHE" = "1" ]; then
"${CACHE_DIR:?}/lb-packages"
echo "=== cleaning live-build work dirs ==="
rm -rf "${REPO_ROOT}/dist/live-build-work-nvidia"
rm -rf "${REPO_ROOT}/dist/live-build-work-nvidia-legacy"
rm -rf "${REPO_ROOT}/dist/live-build-work-amd"
rm -rf "${REPO_ROOT}/dist/live-build-work-nogpu"
rm -rf "${REPO_ROOT}/dist/overlay-stage-nvidia"
rm -rf "${REPO_ROOT}/dist/overlay-stage-nvidia-legacy"
rm -rf "${REPO_ROOT}/dist/overlay-stage-amd"
rm -rf "${REPO_ROOT}/dist/overlay-stage-nogpu"
echo "=== caches cleared, proceeding with build ==="
fi
@@ -180,6 +185,9 @@ case "$VARIANT" in
nvidia)
run_variant nvidia
;;
nvidia-legacy)
run_variant nvidia-legacy
;;
amd)
run_variant amd
;;
@@ -188,6 +196,7 @@ case "$VARIANT" in
;;
all)
run_variant nvidia
run_variant nvidia-legacy
run_variant amd
run_variant nogpu
;;

View File

@@ -1,8 +1,10 @@
#!/bin/sh
# build-nvidia-module.sh — compile NVIDIA proprietary driver modules for Debian 12
# build-nvidia-module.sh — compile NVIDIA kernel modules for Debian 12
#
# Downloads the official NVIDIA .run installer, extracts kernel modules and
# userspace tools (nvidia-smi, libnvidia-ml). Everything is proprietary NVIDIA.
# userspace tools (nvidia-smi, libnvidia-ml). Supports both:
# - open -> kernel-open/ sources from the .run installer
# - proprietary -> traditional proprietary kernel sources from the .run installer
#
# Output is cached in DIST_DIR/nvidia-<version>-<kver>/ so subsequent builds
# are instant unless NVIDIA_DRIVER_VERSION or kernel version changes.
@@ -17,10 +19,19 @@ set -e
NVIDIA_VERSION="$1"
DIST_DIR="$2"
DEBIAN_KERNEL_ABI="$3"
NVIDIA_FLAVOR="${4:-open}"
[ -n "$NVIDIA_VERSION" ] || { echo "usage: $0 <nvidia-version> <dist-dir> <debian-kernel-abi>"; exit 1; }
[ -n "$DIST_DIR" ] || { echo "usage: $0 <nvidia-version> <dist-dir> <debian-kernel-abi>"; exit 1; }
[ -n "$DEBIAN_KERNEL_ABI" ] || { echo "usage: $0 <nvidia-version> <dist-dir> <debian-kernel-abi>"; exit 1; }
[ -n "$NVIDIA_VERSION" ] || { echo "usage: $0 <nvidia-version> <dist-dir> <debian-kernel-abi> [open|proprietary]"; exit 1; }
[ -n "$DIST_DIR" ] || { echo "usage: $0 <nvidia-version> <dist-dir> <debian-kernel-abi> [open|proprietary]"; exit 1; }
[ -n "$DEBIAN_KERNEL_ABI" ] || { echo "usage: $0 <nvidia-version> <dist-dir> <debian-kernel-abi> [open|proprietary]"; exit 1; }
case "$NVIDIA_FLAVOR" in
open|proprietary) ;;
*)
echo "unsupported NVIDIA flavor: $NVIDIA_FLAVOR (expected open or proprietary)" >&2
exit 1
;;
esac
KVER="${DEBIAN_KERNEL_ABI}-amd64"
# On Debian, kernel headers are split into two packages:
@@ -31,22 +42,13 @@ KVER="${DEBIAN_KERNEL_ABI}-amd64"
KDIR_ARCH="/usr/src/linux-headers-${KVER}"
KDIR_COMMON="/usr/src/linux-headers-${DEBIAN_KERNEL_ABI}-common"
echo "=== NVIDIA ${NVIDIA_VERSION} (proprietary) for kernel ${KVER} ==="
echo "=== NVIDIA ${NVIDIA_VERSION} (${NVIDIA_FLAVOR}) for kernel ${KVER} ==="
if [ ! -d "$KDIR_ARCH" ] || [ ! -d "$KDIR_COMMON" ]; then
echo "=== installing linux-headers-${KVER} ==="
DEBIAN_FRONTEND=noninteractive apt-get install -y \
"linux-headers-${KVER}" \
gcc make perl
fi
echo "kernel headers (arch): $KDIR_ARCH"
echo "kernel headers (common): $KDIR_COMMON"
CACHE_DIR="${DIST_DIR}/nvidia-${NVIDIA_VERSION}-${KVER}"
CACHE_DIR="${DIST_DIR}/nvidia-${NVIDIA_FLAVOR}-${NVIDIA_VERSION}-${KVER}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/nvidia-downloads"
EXTRACT_CACHE_DIR="${CACHE_ROOT}/nvidia-extract"
CACHE_LAYOUT_VERSION="2"
CACHE_LAYOUT_VERSION="3"
CACHE_LAYOUT_MARKER="${CACHE_DIR}/.cache-layout-v${CACHE_LAYOUT_VERSION}"
if [ -d "$CACHE_DIR/modules" ] && [ -f "$CACHE_DIR/bin/nvidia-smi" ] \
&& [ -f "$CACHE_LAYOUT_MARKER" ] \
@@ -57,6 +59,15 @@ if [ -d "$CACHE_DIR/modules" ] && [ -f "$CACHE_DIR/bin/nvidia-smi" ] \
exit 0
fi
if [ ! -d "$KDIR_ARCH" ] || [ ! -d "$KDIR_COMMON" ]; then
echo "=== installing linux-headers-${KVER} ==="
DEBIAN_FRONTEND=noninteractive apt-get install -y \
"linux-headers-${KVER}" \
gcc make perl
fi
echo "kernel headers (arch): $KDIR_ARCH"
echo "kernel headers (common): $KDIR_COMMON"
# Download official NVIDIA .run installer with sha256 verification
BASE_URL="https://download.nvidia.com/XFree86/Linux-x86_64/${NVIDIA_VERSION}"
mkdir -p "$DOWNLOAD_CACHE_DIR" "$EXTRACT_CACHE_DIR"
@@ -90,12 +101,18 @@ EXTRACT_DIR="${EXTRACT_CACHE_DIR}/nvidia-extract-${NVIDIA_VERSION}"
rm -rf "$EXTRACT_DIR"
"$RUN_FILE" --extract-only --target "$EXTRACT_DIR"
# Find kernel source directory (proprietary: kernel/, open: kernel-open/)
# Find kernel source directory for the selected flavor.
KERNEL_SRC=""
for d in "$EXTRACT_DIR/kernel" "$EXTRACT_DIR/kernel-modules-sources" "$EXTRACT_DIR/kernel-source"; do
[ -f "$d/Makefile" ] && KERNEL_SRC="$d" && break
done
[ -n "$KERNEL_SRC" ] || { echo "ERROR: kernel source dir not found in:"; ls "$EXTRACT_DIR/"; exit 1; }
if [ "$NVIDIA_FLAVOR" = "open" ]; then
for d in "$EXTRACT_DIR/kernel-open" "$EXTRACT_DIR/kernel-open/"*; do
[ -f "$d/Makefile" ] && KERNEL_SRC="$d" && break
done
else
for d in "$EXTRACT_DIR/kernel" "$EXTRACT_DIR/kernel-modules-sources" "$EXTRACT_DIR/kernel-source"; do
[ -f "$d/Makefile" ] && KERNEL_SRC="$d" && break
done
fi
[ -n "$KERNEL_SRC" ] || { echo "ERROR: kernel source dir not found for flavor ${NVIDIA_FLAVOR} in:"; ls "$EXTRACT_DIR/"; exit 1; }
echo "kernel source: $KERNEL_SRC"
# Build kernel modules

View File

@@ -15,26 +15,46 @@ DIST_DIR="${REPO_ROOT}/dist"
VENDOR_DIR="${REPO_ROOT}/iso/vendor"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
AUTH_KEYS=""
BUILD_VARIANT="nvidia"
BEE_GPU_VENDOR="nvidia"
BEE_NVIDIA_MODULE_FLAVOR="open"
# parse args
while [ $# -gt 0 ]; do
case "$1" in
--authorized-keys) AUTH_KEYS="$2"; shift 2 ;;
--variant) BEE_GPU_VENDOR="$2"; shift 2 ;;
--variant) BUILD_VARIANT="$2"; shift 2 ;;
*) echo "unknown arg: $1"; exit 1 ;;
esac
done
case "$BEE_GPU_VENDOR" in
nvidia|amd|nogpu) ;;
*) echo "unknown variant: $BEE_GPU_VENDOR (expected nvidia, amd, or nogpu)" >&2; exit 1 ;;
case "$BUILD_VARIANT" in
nvidia)
BEE_GPU_VENDOR="nvidia"
BEE_NVIDIA_MODULE_FLAVOR="open"
;;
nvidia-legacy)
BEE_GPU_VENDOR="nvidia"
BEE_NVIDIA_MODULE_FLAVOR="proprietary"
;;
amd)
BEE_GPU_VENDOR="amd"
BEE_NVIDIA_MODULE_FLAVOR=""
;;
nogpu)
BEE_GPU_VENDOR="nogpu"
BEE_NVIDIA_MODULE_FLAVOR=""
;;
*)
echo "unknown variant: $BUILD_VARIANT (expected nvidia, nvidia-legacy, amd, or nogpu)" >&2
exit 1
;;
esac
BUILD_WORK_DIR="${DIST_DIR}/live-build-work-${BEE_GPU_VENDOR}"
OVERLAY_STAGE_DIR="${DIST_DIR}/overlay-stage-${BEE_GPU_VENDOR}"
BUILD_WORK_DIR="${DIST_DIR}/live-build-work-${BUILD_VARIANT}"
OVERLAY_STAGE_DIR="${DIST_DIR}/overlay-stage-${BUILD_VARIANT}"
export BEE_GPU_VENDOR
export BEE_GPU_VENDOR BEE_NVIDIA_MODULE_FLAVOR BUILD_VARIANT
. "${BUILDER_DIR}/VERSIONS"
export PATH="$PATH:/usr/local/go/bin"
@@ -627,7 +647,7 @@ recover_iso_memtest() {
AUDIT_VERSION_EFFECTIVE="$(resolve_audit_version)"
ISO_VERSION_EFFECTIVE="$(resolve_iso_version)"
ISO_BASENAME="easy-bee-${BEE_GPU_VENDOR}-v${ISO_VERSION_EFFECTIVE}-amd64"
ISO_BASENAME="easy-bee-${BUILD_VARIANT}-v${ISO_VERSION_EFFECTIVE}-amd64"
# Versioned output directory: dist/easy-bee-v4.1/ — all final artefacts live here.
OUT_DIR="${DIST_DIR}/easy-bee-v${ISO_VERSION_EFFECTIVE}"
mkdir -p "${OUT_DIR}"
@@ -801,7 +821,7 @@ if [ ! -d "/usr/src/linux-headers-${KVER}" ]; then
apt-get install -y "linux-headers-${KVER}"
fi
echo "=== bee ISO build (variant: ${BEE_GPU_VENDOR}) ==="
echo "=== bee ISO build (variant: ${BUILD_VARIANT}) ==="
echo "Debian: ${DEBIAN_VERSION}, Kernel ABI: ${DEBIAN_KERNEL_ABI}, Go: ${GO_VERSION}"
echo "Audit version: ${AUDIT_VERSION_EFFECTIVE}, ISO version: ${ISO_VERSION_EFFECTIVE}"
echo ""
@@ -871,7 +891,7 @@ if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
fi
fi
echo "=== preparing staged overlay (${BEE_GPU_VENDOR}) ==="
echo "=== preparing staged overlay (${BUILD_VARIANT}) ==="
mkdir -p "${BUILD_WORK_DIR}" "${OVERLAY_STAGE_DIR}"
# Sync builder config into variant work dir, preserving lb cache.
@@ -897,6 +917,86 @@ elif [ -d "${LB_PKG_CACHE}" ] && [ "$(ls -A "${LB_PKG_CACHE}" 2>/dev/null)" ]; t
rsync -a "${LB_PKG_CACHE}/" "${BUILD_WORK_DIR}/cache/packages.chroot/"
fi
if [ "$BEE_GPU_VENDOR" != "nvidia" ] || [ "$BEE_NVIDIA_MODULE_FLAVOR" != "proprietary" ]; then
cat > "${BUILD_WORK_DIR}/config/bootloaders/grub-pc/grub.cfg" <<'EOF'
source /boot/grub/config.cfg
echo ""
echo " ███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗"
echo " ██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝"
echo " █████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗"
echo " ██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝"
echo " ███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗"
echo " ╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝"
echo " Hardware Audit LiveCD"
echo ""
menuentry "EASY-BEE" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
submenu "EASY-BEE (advanced options) -->" {
menuentry "EASY-BEE — KMS (no nomodeset)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE — fail-safe" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset noapic noapm nodma nomce nolapic nosmp vga=normal net.ifnames=0 biosdevname=0
initrd @INITRD_LIVE@
}
}
if [ "${grub_platform}" = "efi" ]; then
menuentry "Memory Test (memtest86+)" {
chainloader /boot/memtest86+x64.efi
}
else
menuentry "Memory Test (memtest86+)" {
linux16 /boot/memtest86+x64.bin
}
fi
if [ "${grub_platform}" = "efi" ]; then
menuentry "UEFI Firmware Settings" {
fwsetup
}
fi
EOF
cat > "${BUILD_WORK_DIR}/config/bootloaders/isolinux/live.cfg.in" <<'EOF'
label live-@FLAVOUR@-normal
menu label ^EASY-BEE
menu default
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@
label live-@FLAVOUR@-kms
menu label EASY-BEE (^graphics/KMS)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ bee.display=kms
label live-@FLAVOUR@-toram
menu label EASY-BEE (^load to RAM)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ toram
label live-@FLAVOUR@-failsafe
menu label EASY-BEE (^fail-safe)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ memtest noapic noapm nodma nomce nolapic nosmp vga=normal
label memtest
menu label ^Memory Test (memtest86+)
linux /boot/memtest86+x64.bin
EOF
fi
rsync -a "${OVERLAY_DIR}/" "${OVERLAY_STAGE_DIR}/"
rm -f \
"${OVERLAY_STAGE_DIR}/etc/bee-ssh-password-fallback" \
@@ -981,10 +1081,10 @@ done
# --- NVIDIA kernel modules and userspace libs ---
if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
run_step "build NVIDIA ${NVIDIA_DRIVER_VERSION} modules" "40-nvidia-module" \
sh "${BUILDER_DIR}/build-nvidia-module.sh" "${NVIDIA_DRIVER_VERSION}" "${DIST_DIR}" "${DEBIAN_KERNEL_ABI}"
sh "${BUILDER_DIR}/build-nvidia-module.sh" "${NVIDIA_DRIVER_VERSION}" "${DIST_DIR}" "${DEBIAN_KERNEL_ABI}" "${BEE_NVIDIA_MODULE_FLAVOR}"
KVER="${DEBIAN_KERNEL_ABI}-amd64"
NVIDIA_CACHE="${DIST_DIR}/nvidia-${NVIDIA_DRIVER_VERSION}-${KVER}"
NVIDIA_CACHE="${DIST_DIR}/nvidia-${BEE_NVIDIA_MODULE_FLAVOR}-${NVIDIA_DRIVER_VERSION}-${KVER}"
# Inject .ko files into overlay at /usr/local/lib/nvidia/
OVERLAY_KMOD_DIR="${OVERLAY_STAGE_DIR}/usr/local/lib/nvidia"
@@ -1055,13 +1155,14 @@ GIT_COMMIT="$(git -C "${REPO_ROOT}" rev-parse --short HEAD 2>/dev/null || echo u
if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
GPU_VERSION_LINE="NVIDIA_DRIVER_VERSION=${NVIDIA_DRIVER_VERSION}
NVIDIA_KERNEL_MODULES_FLAVOR=${BEE_NVIDIA_MODULE_FLAVOR}
NCCL_VERSION=${NCCL_VERSION}
NCCL_CUDA_VERSION=${NCCL_CUDA_VERSION}
CUBLAS_VERSION=${CUBLAS_VERSION}
CUDA_USERSPACE_VERSION=${CUDA_USERSPACE_VERSION}
NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}
JOHN_JUMBO_COMMIT=${JOHN_JUMBO_COMMIT}"
GPU_BUILD_INFO="nvidia:${NVIDIA_DRIVER_VERSION}"
GPU_BUILD_INFO="nvidia-${BEE_NVIDIA_MODULE_FLAVOR}:${NVIDIA_DRIVER_VERSION}"
elif [ "$BEE_GPU_VENDOR" = "amd" ]; then
GPU_VERSION_LINE="ROCM_VERSION=${ROCM_VERSION}"
GPU_BUILD_INFO="rocm:${ROCM_VERSION}"
@@ -1073,6 +1174,7 @@ fi
cat > "${OVERLAY_STAGE_DIR}/etc/bee-release" <<EOF
BEE_ISO_VERSION=${ISO_VERSION_EFFECTIVE}
BEE_AUDIT_VERSION=${AUDIT_VERSION_EFFECTIVE}
BEE_BUILD_VARIANT=${BUILD_VARIANT}
BEE_GPU_VENDOR=${BEE_GPU_VENDOR}
BUILD_DATE=${BUILD_DATE}
GIT_COMMIT=${GIT_COMMIT}
@@ -1083,6 +1185,11 @@ EOF
# Write GPU vendor marker for hooks
echo "${BEE_GPU_VENDOR}" > "${OVERLAY_STAGE_DIR}/etc/bee-gpu-vendor"
if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
echo "${BEE_NVIDIA_MODULE_FLAVOR}" > "${OVERLAY_STAGE_DIR}/etc/bee-nvidia-modules-flavor"
else
rm -f "${OVERLAY_STAGE_DIR}/etc/bee-nvidia-modules-flavor"
fi
# Patch motd with build info
BEE_BUILD_INFO="${BUILD_DATE} git:${GIT_COMMIT} debian:${DEBIAN_VERSION} ${GPU_BUILD_INFO}"
@@ -1153,10 +1260,10 @@ fi
# --- build ISO using live-build ---
echo ""
echo "=== building ISO (live-build, variant: ${BEE_GPU_VENDOR}) ==="
echo "=== building ISO (variant: ${BUILD_VARIANT}) ==="
# Export for auto/config
BEE_GPU_VENDOR_UPPER="$(echo "${BEE_GPU_VENDOR}" | tr 'a-z' 'A-Z')"
BEE_GPU_VENDOR_UPPER="$(echo "${BUILD_VARIANT}" | tr 'a-z-' 'A-Z_')"
export BEE_GPU_VENDOR_UPPER
cd "${LB_DIR}"
@@ -1191,7 +1298,7 @@ if [ -f "$ISO_RAW" ]; then
validate_iso_nvidia_runtime "$ISO_RAW"
cp "$ISO_RAW" "$ISO_OUT"
echo ""
echo "=== done (${BEE_GPU_VENDOR}) ==="
echo "=== done (${BUILD_VARIANT}) ==="
echo "ISO: $ISO_OUT"
if command -v stat >/dev/null 2>&1; then
ISO_SIZE_BYTES="$(stat -c '%s' "$ISO_OUT" 2>/dev/null || stat -f '%z' "$ISO_OUT")"

View File

@@ -15,29 +15,21 @@ menuentry "EASY-BEE" {
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (graphics/KMS)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
submenu "EASY-BEE (advanced options) -->" {
menuentry "EASY-BEE — GSP=off" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (load to RAM)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ toram nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE — KMS (no nomodeset)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (NVIDIA GSP=off)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (graphics/KMS, GSP=off)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE (fail-safe)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=gsp-off memtest noapic noapm nodma nomce nolapic nosmp vga=normal net.ifnames=0 biosdevname=0
initrd @INITRD_LIVE@
menuentry "EASY-BEE — fail-safe" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off noapic noapm nodma nomce nolapic nosmp vga=normal net.ifnames=0 biosdevname=0
initrd @INITRD_LIVE@
}
}
if [ "${grub_platform}" = "efi" ]; then

View File

@@ -5,65 +5,110 @@ echo "=== generating bee wallpaper ==="
mkdir -p /usr/share/bee
python3 - <<'PYEOF'
from PIL import Image, ImageDraw, ImageFont
from PIL import Image, ImageDraw, ImageFont, ImageFilter
import os
W, H = 1920, 1080
LOGO = """\
\u2588\u2588\u2588\u2588\u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2557 \u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557
\u2588\u2588\u2554\u2550\u2550\u2550\u2550\u255d\u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2557\u2588\u2588\u2554\u2550\u2550\u2550\u2550\u255d\u255a\u2588\u2588\u2557 \u2588\u2588\u2554\u255d \u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2557\u2588\u2588\u2554\u2550\u2550\u2550\u2550\u255d\u2588\u2588\u2554\u2550\u2550\u2550\u2550\u255d
\u2588\u2588\u2588\u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2551\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557 \u255a\u2588\u2588\u2588\u2588\u2554\u255d \u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2588\u2588\u2588\u2588\u2554\u255d\u2588\u2588\u2588\u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2557
\u2588\u2588\u2554\u2550\u2550\u255d \u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2551\u255a\u2550\u2550\u2550\u2550\u2588\u2588\u2551 \u255a\u2588\u2588\u2554\u255d \u255a\u2550\u2550\u2550\u2550\u255d\u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2557\u2588\u2588\u2554\u2550\u2550\u255d \u2588\u2588\u2554\u2550\u2550\u255d
\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2551 \u2588\u2588\u2551\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2551 \u2588\u2588\u2551 \u2588\u2588\u2588\u2588\u2588\u2588\u2554\u255d\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557
\u255a\u2550\u2550\u2550\u2550\u2550\u2550\u255d\u255a\u2550\u255d \u255a\u2550\u255d\u255a\u2550\u2550\u2550\u2550\u2550\u2550\u255d \u255a\u2550\u255d \u255a\u2550\u2550\u2550\u2550\u2550\u255d \u255a\u2550\u2550\u2550\u2550\u2550\u2550\u255d\u255a\u2550\u2550\u2550\u2550\u2550\u2550\u255d
Hardware Audit LiveCD"""
GLYPHS = {
'E': ["11111", "10000", "11110", "10000", "10000", "10000", "11111"],
'A': ["01110", "10001", "10001", "11111", "10001", "10001", "10001"],
'S': ["01111", "10000", "10000", "01110", "00001", "00001", "11110"],
'Y': ["10001", "10001", "01010", "00100", "00100", "00100", "00100"],
'B': ["11110", "10001", "10001", "11110", "10001", "10001", "11110"],
'-': ["00000", "00000", "11111", "00000", "00000", "00000", "00000"],
}
# Find a monospace font that supports box-drawing characters
FONT_CANDIDATES = [
'/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf',
'/usr/share/fonts/truetype/liberation/LiberationMono-Regular.ttf',
'/usr/share/fonts/truetype/freefont/FreeMono.ttf',
'/usr/share/fonts/truetype/noto/NotoMono-Regular.ttf',
TITLE = "EASY-BEE"
SUBTITLE = "Hardware Audit LiveCD"
CELL = 30
GLYPH_GAP = 18
ROW_GAP = 6
FG = (0xF6, 0xD0, 0x47)
FG_DIM = (0xD4, 0xA9, 0x1C)
SHADOW = (0x5E, 0x47, 0x05)
SUB = (0x96, 0x7A, 0x17)
BG = (0x05, 0x05, 0x05)
SUB_FONT_CANDIDATES = [
'/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf',
'/usr/share/fonts/truetype/liberation2/LiberationSans-Bold.ttf',
'/usr/share/fonts/truetype/liberation/LiberationSans-Bold.ttf',
'/usr/share/fonts/truetype/freefont/FreeSansBold.ttf',
]
font_path = None
for p in FONT_CANDIDATES:
if os.path.exists(p):
font_path = p
break
SIZE = 22
if font_path:
font_logo = ImageFont.truetype(font_path, SIZE)
font_sub = ImageFont.truetype(font_path, SIZE)
else:
font_logo = ImageFont.load_default()
font_sub = font_logo
def load_font(size):
for path in SUB_FONT_CANDIDATES:
if os.path.exists(path):
return ImageFont.truetype(path, size)
return ImageFont.load_default()
img = Image.new('RGB', (W, H), (0, 0, 0))
def glyph_width(ch):
return len(GLYPHS[ch][0])
def render_logo_mask():
width_cells = 0
for idx, ch in enumerate(TITLE):
width_cells += glyph_width(ch)
if idx != len(TITLE) - 1:
width_cells += 1
mask_w = width_cells * CELL + (len(TITLE) - 1) * GLYPH_GAP
mask_h = 7 * CELL + 6 * ROW_GAP
mask = Image.new('L', (mask_w, mask_h), 0)
draw = ImageDraw.Draw(mask)
cx = 0
for idx, ch in enumerate(TITLE):
glyph = GLYPHS[ch]
for row_idx, row in enumerate(glyph):
for col_idx, cell in enumerate(row):
if cell != '1':
continue
x0 = cx + col_idx * CELL
y0 = row_idx * (CELL + ROW_GAP)
x1 = x0 + CELL - 4
y1 = y0 + CELL - 4
draw.rounded_rectangle((x0, y0, x1, y1), radius=4, fill=255)
cx += glyph_width(ch) * CELL
if idx != len(TITLE) - 1:
cx += CELL + GLYPH_GAP
return mask
img = Image.new('RGB', (W, H), BG)
draw = ImageDraw.Draw(img)
# Measure logo block
lines = LOGO.split('\n')
bbox = draw.textbbox((0, 0), LOGO, font=font_logo)
text_w = bbox[2] - bbox[0]
text_h = bbox[3] - bbox[1]
# Soft amber glow under the logo without depending on font rendering.
glow = Image.new('RGBA', (W, H), (0, 0, 0, 0))
glow_draw = ImageDraw.Draw(glow)
glow_draw.ellipse((360, 250, 1560, 840), fill=(180, 120, 10, 56))
glow_draw.ellipse((520, 340, 1400, 760), fill=(255, 190, 40, 36))
glow = glow.filter(ImageFilter.GaussianBlur(60))
img = Image.alpha_composite(img.convert('RGBA'), glow)
x = (W - text_w) // 2
y = (H - text_h) // 2
logo_mask = render_logo_mask()
logo_w, logo_h = logo_mask.size
logo_x = (W - logo_w) // 2
logo_y = 290
# Draw logo lines: first 6 in amber, last line (subtitle) dimmer
logo_lines = lines[:6]
sub_line = lines[6] if len(lines) > 6 else ''
shadow_mask = logo_mask.filter(ImageFilter.GaussianBlur(2))
img.paste(SHADOW, (logo_x + 16, logo_y + 14), shadow_mask)
img.paste(FG_DIM, (logo_x + 8, logo_y + 7), logo_mask)
img.paste(FG, (logo_x, logo_y), logo_mask)
cy = y
for line in logo_lines:
draw.text((x, cy), line, font=font_logo, fill=(0xf6, 0xc9, 0x0e))
cy += SIZE + 2
cy += 8
if sub_line:
draw.text((x, cy), sub_line, font=font_sub, fill=(0x80, 0x68, 0x18))
font_sub = load_font(30)
sub_bb = draw.textbbox((0, 0), SUBTITLE, font=font_sub)
sub_x = (W - (sub_bb[2] - sub_bb[0])) // 2
sub_y = logo_y + logo_h + 54
draw = ImageDraw.Draw(img)
draw.text((sub_x + 2, sub_y + 2), SUBTITLE, font=font_sub, fill=(35, 28, 6))
draw.text((sub_x, sub_y), SUBTITLE, font=font_sub, fill=SUB)
img = img.convert('RGB')
img.save('/usr/share/bee/wallpaper.png', optimize=True)
print('wallpaper written: /usr/share/bee/wallpaper.png')

View File

@@ -0,0 +1,41 @@
#!/bin/sh
# 9010-fix-toram.hook.chroot — patch live-boot toram to work with tmpfs (no O_DIRECT)
#
# live-boot tries "losetup --replace --direct-io=on" when re-associating the
# loop device to the RAM copy in /dev/shm. tmpfs does not support O_DIRECT,
# so the ioctl returns EINVAL and the verification step fails.
#
# The patch replaces the replace call so that if --direct-io=on fails it falls
# back to a plain replace without direct-io, and also relaxes the verification
# to a warning so the boot continues even when re-association is imperfect.
set -e
TORAM_SCRIPT="/usr/lib/live/boot/9990-toram-todisk.sh"
if [ ! -f "${TORAM_SCRIPT}" ]; then
echo "9010-fix-toram: ${TORAM_SCRIPT} not found, skipping"
exit 0
fi
echo "9010-fix-toram: patching ${TORAM_SCRIPT}"
# Replace any losetup --replace call that includes --direct-io=on with a
# version that first tries with direct-io, then retries without it.
#
# The sed expression turns:
# losetup --replace ... --direct-io=on LOOP FILE
# into a shell snippet that tries both, silently.
#
# We also downgrade the fatal "Task finished with error." block to a warning
# so the boot continues if re-association fails (squashfs still accessible).
# 1. Strip --direct-io=on from the losetup --replace call so it works on tmpfs.
sed -i 's/losetup --replace --direct-io=on/losetup --replace/g' "${TORAM_SCRIPT}"
sed -i 's/losetup --replace --direct-io/losetup --replace/g' "${TORAM_SCRIPT}"
# 2. Turn the hard error into a warning so boot continues.
# live-boot prints this exact string when verification fails.
sed -i 's/echo "Task finished with error\."/echo "Warning: toram re-association failed, continuing boot (squashfs still in RAM)"/' "${TORAM_SCRIPT}"
echo "9010-fix-toram: patch applied"
grep -n "losetup" "${TORAM_SCRIPT}" | head -20 || true

View File

@@ -65,6 +65,10 @@ python3-pil
xorg
xterm
chromium
mousepad
pcmanfm
ristretto
mupdf
xserver-xorg-video-fbdev
xserver-xorg-video-vesa
lightdm

View File

@@ -27,6 +27,7 @@ echo ""
KVER=$(uname -r)
info "kernel: $KVER"
NVIDIA_BOOT_MODE="normal"
NVIDIA_MODULES_FLAVOR="proprietary"
for arg in $(cat /proc/cmdline 2>/dev/null); do
case "$arg" in
bee.nvidia.mode=*)
@@ -34,7 +35,11 @@ for arg in $(cat /proc/cmdline 2>/dev/null); do
;;
esac
done
if [ -f /etc/bee-nvidia-modules-flavor ]; then
NVIDIA_MODULES_FLAVOR="$(tr -d '[:space:]' </etc/bee-nvidia-modules-flavor 2>/dev/null || echo proprietary)"
fi
info "nvidia boot mode: ${NVIDIA_BOOT_MODE}"
info "nvidia modules flavor: ${NVIDIA_MODULES_FLAVOR}"
# --- PATH & binaries ---
echo "-- PATH & binaries --"
@@ -110,10 +115,12 @@ fi
for mod in nvidia_modeset nvidia_uvm; do
if /sbin/lsmod 2>/dev/null | grep -q "^$mod "; then
ok "module loaded: $mod"
elif [ "${NVIDIA_BOOT_MODE}" = "normal" ] || [ "${NVIDIA_BOOT_MODE}" = "full" ]; then
elif [ "${NVIDIA_MODULES_FLAVOR}" = "proprietary" ] && { [ "${NVIDIA_BOOT_MODE}" = "normal" ] || [ "${NVIDIA_BOOT_MODE}" = "full" ]; }; then
fail "module NOT loaded in normal mode: $mod"
else
elif [ "${NVIDIA_MODULES_FLAVOR}" = "proprietary" ]; then
warn "module not loaded in GSP-off mode: $mod"
else
fail "module NOT loaded: $mod"
fi
done
@@ -129,10 +136,12 @@ done
if [ -e /dev/nvidia-uvm ]; then
ok "/dev/nvidia-uvm exists"
elif [ "${NVIDIA_BOOT_MODE}" = "normal" ] || [ "${NVIDIA_BOOT_MODE}" = "full" ]; then
elif [ "${NVIDIA_MODULES_FLAVOR}" = "proprietary" ] && { [ "${NVIDIA_BOOT_MODE}" = "normal" ] || [ "${NVIDIA_BOOT_MODE}" = "full" ]; }; then
fail "/dev/nvidia-uvm missing in normal mode"
else
elif [ "${NVIDIA_MODULES_FLAVOR}" = "proprietary" ]; then
warn "/dev/nvidia-uvm missing — CUDA stress path may be unavailable until loaded on demand"
else
fail "/dev/nvidia-uvm missing"
fi
echo ""

View File

@@ -62,6 +62,8 @@ done
echo "loader=bee-gpu-burn"
echo "selected_gpus=${FINAL}"
export CUDA_DEVICE_ORDER="PCI_BUS_ID"
TMP_DIR=$(mktemp -d)
trap 'rm -rf "${TMP_DIR}"' EXIT INT TERM
@@ -78,7 +80,8 @@ for id in $(echo "${FINAL}" | tr ',' ' '); do
fi
fi
echo "starting gpu ${id} size=${gpu_size_mb}MB"
"${WORKER}" --device "${id}" --seconds "${SECONDS}" --size-mb "${gpu_size_mb}" >"${log}" 2>&1 &
CUDA_VISIBLE_DEVICES="${id}" \
"${WORKER}" --device 0 --seconds "${SECONDS}" --size-mb "${gpu_size_mb}" >"${log}" 2>&1 &
pid=$!
WORKERS="${WORKERS} ${pid}:${id}:${log}"
done

View File

@@ -152,14 +152,19 @@ done
[ -n "${FINAL}" ] || { echo "no NVIDIA GPUs selected after filters" >&2; exit 1; }
export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export CUDA_VISIBLE_DEVICES="${FINAL}"
JOHN_DEVICES=""
local_id=1
for id in $(echo "${FINAL}" | tr ',' ' '); do
opencl_id=$((id + 1))
opencl_id="${local_id}"
if [ -z "${JOHN_DEVICES}" ]; then
JOHN_DEVICES="${opencl_id}"
else
JOHN_DEVICES="${JOHN_DEVICES},${opencl_id}"
fi
local_id=$((local_id + 1))
done
echo "loader=john"

View File

@@ -70,6 +70,8 @@ echo "gpu_count=${GPU_COUNT}"
echo "range=${MIN_BYTES}..${MAX_BYTES}"
echo "iters=${ITERS}"
export CUDA_DEVICE_ORDER="PCI_BUS_ID"
deadline=$(( $(date +%s) + SECONDS ))
round=0

View File

@@ -6,6 +6,19 @@ NVIDIA_KO_DIR="/usr/local/lib/nvidia"
log() { echo "[bee-nvidia] $*"; }
read_nvidia_modules_flavor() {
if [ -f /etc/bee-nvidia-modules-flavor ]; then
flavor="$(tr -d '[:space:]' </etc/bee-nvidia-modules-flavor 2>/dev/null)"
case "$flavor" in
open|proprietary)
echo "$flavor"
return 0
;;
esac
fi
echo "proprietary"
}
log "kernel: $(uname -r)"
# Skip if no NVIDIA GPU present (PCI vendor 10de)
@@ -40,6 +53,8 @@ if [ -z "$nvidia_mode" ]; then
nvidia_mode="normal"
fi
log "boot mode: $nvidia_mode"
nvidia_modules_flavor="$(read_nvidia_modules_flavor)"
log "modules flavor: $nvidia_modules_flavor"
load_module() {
mod="$1"
@@ -50,11 +65,93 @@ load_module() {
log "WARN: not found: $ko"
return 1
fi
if insmod "$ko" "$@"; then
if timeout 90 insmod "$ko" "$@"; then
log "loaded: $mod $*"
return 0
fi
log "WARN: failed to load: $mod"
log "WARN: failed to load: $mod (exit $?)"
dmesg | tail -n 10 | sed 's/^/ dmesg: /' || true
return 1
}
nvidia_is_functional() {
grep -q ' nvidiactl$' /proc/devices 2>/dev/null
}
load_module_with_gsp_fallback() {
ko="$NVIDIA_KO_DIR/nvidia.ko"
if [ ! -f "$ko" ]; then
log "ERROR: not found: $ko"
return 1
fi
# Run insmod in background — on some converted SXM→PCIe cards GSP enters an
# infinite crash/reload loop and insmod never returns. We check for successful
# initialization by polling /proc/devices for nvidiactl instead of waiting for
# insmod to exit.
log "loading nvidia (GSP enabled, timeout 90s)"
insmod "$ko" &
_insmod_pid=$!
_waited=0
while [ $_waited -lt 90 ]; do
if nvidia_is_functional; then
log "loaded: nvidia (GSP enabled, ${_waited}s)"
echo "gsp-on" > /run/bee-nvidia-mode
return 0
fi
# Check if insmod exited with an error before timeout
if ! kill -0 "$_insmod_pid" 2>/dev/null; then
wait "$_insmod_pid"
_rc=$?
if [ $_rc -ne 0 ]; then
log "nvidia load failed (exit $_rc)"
dmesg | tail -n 10 | sed 's/^/ dmesg: /' || true
return 1
fi
# insmod exited 0 but nvidiactl not yet in /proc/devices — give it a moment
sleep 2
if nvidia_is_functional; then
log "loaded: nvidia (GSP enabled, ${_waited}s)"
return 0
fi
log "insmod exited 0 but nvidiactl missing — treating as failure"
return 1
fi
sleep 1
_waited=$((_waited + 1))
done
# GSP init timed out — kill the hanging insmod and attempt gsp-off fallback
log "nvidia GSP init timed out after 90s"
kill "$_insmod_pid" 2>/dev/null || true
wait "$_insmod_pid" 2>/dev/null || true
# Attempt to unload the partially-initialized module
if ! rmmod nvidia 2>/dev/null; then
# Module is stuck in the kernel — cannot reload with different params.
# User must reboot and select bee.nvidia.mode=gsp-off at boot menu.
log "ERROR: rmmod nvidia failed (EBUSY) — module stuck in kernel"
log "ERROR: reboot and select 'EASY-BEE (advanced) -> GSP=off' in boot menu"
echo "gsp-stuck" > /run/bee-nvidia-mode
return 1
fi
sleep 2
log "retrying with NVreg_EnableGpuFirmware=0"
log "WARNING: GSP disabled — power management will run via CPU path, not GPU firmware"
if insmod "$ko" NVreg_EnableGpuFirmware=0; then
if nvidia_is_functional; then
log "loaded: nvidia (GSP disabled)"
echo "gsp-off" > /run/bee-nvidia-mode
return 0
fi
log "insmod gsp-off exited 0 but nvidiactl missing"
return 1
fi
log "nvidia load failed (GSP=off)"
dmesg | tail -n 10 | sed 's/^/ dmesg: /' || true
return 1
}
@@ -68,37 +165,54 @@ load_host_module() {
return 1
}
case "$nvidia_mode" in
normal|full)
if ! load_module nvidia; then
exit 1
fi
# nvidia-modeset on some server kernels needs ACPI video helper symbols
# exported by the generic "video" module. Best-effort only; compute paths
# remain functional even if display-related modules stay absent.
load_host_module video || true
load_module nvidia-modeset || true
load_module nvidia-uvm || true
;;
gsp-off|safe)
# NVIDIA documents that GSP firmware is enabled by default on newer GPUs and can
# be disabled via NVreg_EnableGpuFirmware=0. Safe mode keeps the live ISO on the
# conservative path for platforms where full boot-time GSP init is unstable.
if ! load_module nvidia NVreg_EnableGpuFirmware=0; then
exit 1
fi
log "GSP-off mode: skipping nvidia-modeset and nvidia-uvm during boot"
;;
nomsi|*)
# nomsi: disable MSI-X/MSI interrupts — use when RmInitAdapter fails with
# "Failed to enable MSI-X" on one or more GPUs (IOMMU group interrupt limits).
# NVreg_EnableMSI=0 forces legacy INTx interrupts for all GPUs.
if ! load_module nvidia NVreg_EnableGpuFirmware=0 NVreg_EnableMSI=0; then
exit 1
fi
log "nomsi mode: MSI-X disabled (NVreg_EnableMSI=0), skipping nvidia-modeset and nvidia-uvm"
;;
esac
if [ "$nvidia_modules_flavor" = "open" ]; then
case "$nvidia_mode" in
gsp-off|safe|nomsi)
log "ignoring boot mode ${nvidia_mode} for open NVIDIA modules"
;;
esac
if ! load_module nvidia; then
exit 1
fi
# nvidia-modeset on some server kernels needs ACPI video helper symbols
# exported by the generic "video" module. Best-effort only; compute paths
# remain functional even if display-related modules stay absent.
load_host_module video || true
load_module nvidia-modeset || true
load_module nvidia-uvm || true
else
case "$nvidia_mode" in
normal|full)
if ! load_module_with_gsp_fallback; then
exit 1
fi
# nvidia-modeset on some server kernels needs ACPI video helper symbols
# exported by the generic "video" module. Best-effort only; compute paths
# remain functional even if display-related modules stay absent.
load_host_module video || true
load_module nvidia-modeset || true
load_module nvidia-uvm || true
;;
gsp-off|safe)
# NVIDIA documents that GSP firmware is enabled by default on newer GPUs and can
# be disabled via NVreg_EnableGpuFirmware=0. Safe mode keeps the live ISO on the
# conservative path for platforms where full boot-time GSP init is unstable.
if ! load_module nvidia NVreg_EnableGpuFirmware=0; then
exit 1
fi
log "GSP-off mode: skipping nvidia-modeset and nvidia-uvm during boot"
;;
nomsi|*)
# nomsi: disable MSI-X/MSI interrupts — use when RmInitAdapter fails with
# "Failed to enable MSI-X" on one or more GPUs (IOMMU group interrupt limits).
# NVreg_EnableMSI=0 forces legacy INTx interrupts for all GPUs.
if ! load_module nvidia NVreg_EnableGpuFirmware=0 NVreg_EnableMSI=0; then
exit 1
fi
log "nomsi mode: MSI-X disabled (NVreg_EnableMSI=0), skipping nvidia-modeset and nvidia-uvm"
;;
esac
fi
# Create /dev/nvidia* device nodes (udev rules absent since we use .run installer)
nvidia_major=$(grep -m1 ' nvidiactl$' /proc/devices | awk '{print $1}')
@@ -127,6 +241,18 @@ fi
ldconfig 2>/dev/null || true
log "ldconfig refreshed"
# Keep persistence mode enabled across the session so dcgmi / stress tools do
# not fail with deployment warnings on otherwise healthy GPUs.
if command -v nvidia-smi >/dev/null 2>&1; then
if nvidia-smi -pm 1 >/dev/null 2>&1; then
log "enabled NVIDIA persistence mode"
else
log "WARN: failed to enable NVIDIA persistence mode"
fi
else
log "WARN: nvidia-smi not found — cannot enable persistence mode"
fi
# Start DCGM host engine so dcgmi can discover GPUs.
# nv-hostengine must run after the NVIDIA modules and device nodes are ready.
# If it started too early (for example via systemd before bee-nvidia-load), it can