LiveCD: set Baby Bee wallpaper centered on black background

400×400px PNG centered via feh --bg-center --image-bg '#000000'. Fallback solid fill also changed to black. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dashboard: per-device status chips with hover tooltips
2026-04-16 06:57:23 +03:00 · 2026-04-16 06:54:13 +03:00 · 2026-04-16 06:46:45 +03:00 · 2026-04-16 06:45:15 +03:00 · 2026-04-16 06:42:00 +03:00 · 2026-04-16 06:40:06 +03:00
22 changed files with 1197 additions and 427 deletions
--- a/audit/go.mod
+++ b/audit/go.mod
@@ -5,22 +5,18 @@ go 1.25.0
 replace reanimator/chart => ../internal/chart

 require (
-	github.com/go-analyze/charts v0.5.26
+	modernc.org/sqlite v1.48.0
 	reanimator/chart v0.0.0-00010101000000-000000000000
 )

 require (
 	github.com/dustin/go-humanize v1.0.1 // indirect
-	github.com/go-analyze/bulk v0.1.3 // indirect
-	github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 // indirect
 	github.com/google/uuid v1.6.0 // indirect
 	github.com/mattn/go-isatty v0.0.20 // indirect
 	github.com/ncruces/go-strftime v1.0.0 // indirect
 	github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec // indirect
-	golang.org/x/image v0.24.0 // indirect
 	golang.org/x/sys v0.42.0 // indirect
-	modernc.org/libc v1.70.0 // indirect
+	modernc.org/libc v1.72.0 // indirect
 	modernc.org/mathutil v1.7.1 // indirect
 	modernc.org/memory v1.11.0 // indirect
-	modernc.org/sqlite v1.48.0 // indirect
 )
--- a/audit/go.sum
+++ b/audit/go.sum
@@ -1,37 +1,51 @@
-github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
-github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
 github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
 github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
-github.com/go-analyze/bulk v0.1.3 h1:pzRdBqzHDAT9PyROt0SlWE0YqPtdmTcEpIJY0C3vF0c=
-github.com/go-analyze/bulk v0.1.3/go.mod h1:afon/KtFJYnekIyN20H/+XUvcLFjE8sKR1CfpqfClgM=
-github.com/go-analyze/charts v0.5.26 h1:rSwZikLQuFX6cJzwI8OAgaWZneG1kDYxD857ms00ZxY=
-github.com/go-analyze/charts v0.5.26/go.mod h1:s1YvQhjiSwtLx1f2dOKfiV9x2TT49nVSL6v2rlRpTbY=
-github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 h1:DACJavvAHhabrF08vX0COfcOBJRhZ8lUbR+ZWIs0Y5g=
-github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0/go.mod h1:E/TSTwGwJL78qG/PmXZO1EjYhfJinVAhrmmHX6Z8B9k=
+github.com/google/pprof v0.0.0-20250317173921-a4b03ec1a45e h1:ijClszYn+mADRFY17kjQEVQ1XRhq2/JR1M3sGqeJoxs=
+github.com/google/pprof v0.0.0-20250317173921-a4b03ec1a45e/go.mod h1:boTsfXsheKC2y+lKOCMpSfarhxDeIzfZG1jqGcPl3cA=
 github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
 github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
+github.com/hashicorp/golang-lru/v2 v2.0.7 h1:a+bsQ5rvGLjzHuww6tVxozPZFVghXaHOwFs4luLUK2k=
+github.com/hashicorp/golang-lru/v2 v2.0.7/go.mod h1:QeFd9opnmA6QUJc5vARoKUSoFhyfM2/ZepoAG6RGpeM=
 github.com/mattn/go-isatty v0.0.20 h1:xfD0iDuEKnDkl03q4limB+vH+GxLEtL/jb4xVJSWWEY=
 github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y=
 github.com/ncruces/go-strftime v1.0.0 h1:HMFp8mLCTPp341M/ZnA4qaf7ZlsbTc+miZjCLOFAw7w=
 github.com/ncruces/go-strftime v1.0.0/go.mod h1:Fwc5htZGVVkseilnfgOVb9mKy6w1naJmn9CehxcKcls=
-github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
-github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
 github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec h1:W09IVJc94icq4NjY3clb7Lk8O1qJ8BdBEF8z0ibU0rE=
 github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec/go.mod h1:qqbHyh8v60DhA7CoWK5oRCqLrMHRGoxYCSS9EjAz6Eo=
-github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
-github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U=
-golang.org/x/image v0.24.0 h1:AN7zRgVsbvmTfNyqIbbOraYL8mSwcKncEj8ofjgzcMQ=
-golang.org/x/image v0.24.0/go.mod h1:4b/ITuLfqYq1hqZcjofwctIhi7sZh2WaCjvsBNjjya8=
+golang.org/x/mod v0.33.0 h1:tHFzIWbBifEmbwtGz65eaWyGiGZatSrT9prnU8DbVL8=
+golang.org/x/mod v0.33.0/go.mod h1:swjeQEj+6r7fODbD2cqrnje9PnziFuw4bmLbBZFrQ5w=
+golang.org/x/sync v0.20.0 h1:e0PTpb7pjO8GAtTs2dQ6jYa5BWYlMuX047Dco/pItO4=
+golang.org/x/sync v0.20.0/go.mod h1:9xrNwdLfx4jkKbNva9FpL6vEN7evnE43NNNJQ2LF3+0=
 golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
 golang.org/x/sys v0.42.0 h1:omrd2nAlyT5ESRdCLYdm3+fMfNFE/+Rf4bDIQImRJeo=
 golang.org/x/sys v0.42.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
-gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
-gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
-modernc.org/libc v1.70.0 h1:U58NawXqXbgpZ/dcdS9kMshu08aiA6b7gusEusqzNkw=
-modernc.org/libc v1.70.0/go.mod h1:OVmxFGP1CI/Z4L3E0Q3Mf1PDE0BucwMkcXjjLntvHJo=
+golang.org/x/tools v0.42.0 h1:uNgphsn75Tdz5Ji2q36v/nsFSfR/9BRFvqhGBaJGd5k=
+golang.org/x/tools v0.42.0/go.mod h1:Ma6lCIwGZvHK6XtgbswSoWroEkhugApmsXyrUmBhfr0=
+modernc.org/cc/v4 v4.27.3 h1:uNCgn37E5U09mTv1XgskEVUJ8ADKpmFMPxzGJ0TSo+U=
+modernc.org/cc/v4 v4.27.3/go.mod h1:3YjcbCqhoTTHPycJDRl2WZKKFj0nwcOIPBfEZK0Hdk8=
+modernc.org/ccgo/v4 v4.32.4 h1:L5OB8rpEX4ZsXEQwGozRfJyJSFHbbNVOoQ59DU9/KuU=
+modernc.org/ccgo/v4 v4.32.4/go.mod h1:lY7f+fiTDHfcv6YlRgSkxYfhs+UvOEEzj49jAn2TOx0=
+modernc.org/fileutil v1.4.0 h1:j6ZzNTftVS054gi281TyLjHPp6CPHr2KCxEXjEbD6SM=
+modernc.org/fileutil v1.4.0/go.mod h1:EqdKFDxiByqxLk8ozOxObDSfcVOv/54xDs/DUHdvCUU=
+modernc.org/gc/v2 v2.6.5 h1:nyqdV8q46KvTpZlsw66kWqwXRHdjIlJOhG6kxiV/9xI=
+modernc.org/gc/v2 v2.6.5/go.mod h1:YgIahr1ypgfe7chRuJi2gD7DBQiKSLMPgBQe9oIiito=
+modernc.org/gc/v3 v3.1.2 h1:ZtDCnhonXSZexk/AYsegNRV1lJGgaNZJuKjJSWKyEqo=
+modernc.org/gc/v3 v3.1.2/go.mod h1:HFK/6AGESC7Ex+EZJhJ2Gni6cTaYpSMmU/cT9RmlfYY=
+modernc.org/goabi0 v0.2.0 h1:HvEowk7LxcPd0eq6mVOAEMai46V+i7Jrj13t4AzuNks=
+modernc.org/goabi0 v0.2.0/go.mod h1:CEFRnnJhKvWT1c1JTI3Avm+tgOWbkOu5oPA8eH8LnMI=
+modernc.org/libc v1.72.0 h1:IEu559v9a0XWjw0DPoVKtXpO2qt5NVLAnFaBbjq+n8c=
+modernc.org/libc v1.72.0/go.mod h1:tTU8DL8A+XLVkEY3x5E/tO7s2Q/q42EtnNWda/L5QhQ=
 modernc.org/mathutil v1.7.1 h1:GCZVGXdaN8gTqB1Mf/usp1Y/hSqgI2vAGGP4jZMCxOU=
 modernc.org/mathutil v1.7.1/go.mod h1:4p5IwJITfppl0G4sUEDtCr4DthTaT47/N3aT6MhfgJg=
 modernc.org/memory v1.11.0 h1:o4QC8aMQzmcwCK3t3Ux/ZHmwFPzE6hf2Y5LbkRs+hbI=
 modernc.org/memory v1.11.0/go.mod h1:/JP4VbVC+K5sU2wZi9bHoq2MAkCnrt2r98UGeSK7Mjw=
+modernc.org/opt v0.1.4 h1:2kNGMRiUjrp4LcaPuLY2PzUfqM/w9N23quVwhKt5Qm8=
+modernc.org/opt v0.1.4/go.mod h1:03fq9lsNfvkYSfxrfUhZCWPk1lm4cq4N+Bh//bEtgns=
+modernc.org/sortutil v1.2.1 h1:+xyoGf15mM3NMlPDnFqrteY07klSFxLElE2PVuWIJ7w=
+modernc.org/sortutil v1.2.1/go.mod h1:7ZI3a3REbai7gzCLcotuw9AC4VZVpYMjDzETGsSMqJE=
 modernc.org/sqlite v1.48.0 h1:ElZyLop3Q2mHYk5IFPPXADejZrlHu7APbpB0sF78bq4=
 modernc.org/sqlite v1.48.0/go.mod h1:hWjRO6Tj/5Ik8ieqxQybiEOUXy0NJFNp2tpvVpKlvig=
+modernc.org/strutil v1.2.1 h1:UneZBkQA+DX2Rp35KcM69cSsNES9ly8mQWD71HKlOA0=
+modernc.org/strutil v1.2.1/go.mod h1:EHkiggD70koQxjVdSBM3JKM7k6L0FbGE5eymy9i3B9A=
+modernc.org/token v1.1.0 h1:Xl7Ap9dKaEs5kLoOQeQmPWevfnk/DM5qcLcYlA8ys6Y=
+modernc.org/token v1.1.0/go.mod h1:UGzOrNV1mAFSEB63lOFHIpNRUVMvYTc6yu1SMY/XTDM=
--- a/audit/internal/app/support_bundle.go
+++ b/audit/internal/app/support_bundle.go
@@ -22,6 +22,8 @@ var supportBundleServices = []string{
 	"bee-selfheal.service",
 	"bee-selfheal.timer",
 	"bee-sshsetup.service",
+	"nvidia-dcgm.service",
+	"nvidia-fabricmanager.service",
 }

 var supportBundleCommands = []struct {
@@ -48,6 +50,43 @@ else
 fi
 `}},
 	{name: "system/nvidia-smi-q.txt", cmd: []string{"nvidia-smi", "-q"}},
+	{name: "system/nvidia-smi-topo.txt", cmd: []string{"sh", "-c", `
+if command -v nvidia-smi >/dev/null 2>&1; then
+  nvidia-smi topo -m 2>&1 || true
+else
+  echo "nvidia-smi not found"
+fi
+`}},
+	{name: "system/systemctl-nvidia-units.txt", cmd: []string{"sh", "-c", `
+if ! command -v systemctl >/dev/null 2>&1; then
+  echo "systemctl not found"
+  exit 0
+fi
+echo "=== unit files ==="
+systemctl list-unit-files --no-pager --all 'nvidia*' 'fabric*' 2>&1 || true
+echo
+echo "=== active units ==="
+systemctl list-units --no-pager --all 'nvidia*' 'fabric*' 2>&1 || true
+echo
+echo "=== failed units ==="
+systemctl --failed --no-pager 2>&1 | grep -iE 'nvidia|fabric' || echo "no failed nvidia/fabric units"
+`}},
+	{name: "system/fabric-manager-paths.txt", cmd: []string{"sh", "-c", `
+for candidate in \
+  /usr/bin/nvidia-fabricmanager \
+  /usr/bin/nv-fabricmanager \
+  /usr/bin/nvidia-fabricmanagerd \
+  /usr/bin/nvlsm; do
+  if [ -e "$candidate" ]; then
+    echo "=== $candidate ==="
+    ls -l "$candidate" 2>&1 || true
+    echo
+  fi
+done
+if ! ls /usr/bin/nvidia-fabricmanager /usr/bin/nv-fabricmanager /usr/bin/nvidia-fabricmanagerd /usr/bin/nvlsm >/dev/null 2>&1; then
+  echo "no fabric manager binaries found"
+fi
+`}},
 	{name: "system/lspci-nvidia-bridges-vv.txt", cmd: []string{"sh", "-c", `
 if ! command -v lspci >/dev/null 2>&1; then
  echo "lspci not found"
@@ -195,6 +234,10 @@ var supportBundleOptionalFiles = []struct {
 }{
 	{name: "system/kern.log", src: "/var/log/kern.log"},
 	{name: "system/syslog.txt", src: "/var/log/syslog"},
+	{name: "system/fabricmanager.log", src: "/var/log/fabricmanager.log"},
+	{name: "system/nvlsm.log", src: "/var/log/nvlsm.log"},
+	{name: "system/fabricmanager/fabricmanager.log", src: "/var/log/fabricmanager/fabricmanager.log"},
+	{name: "system/fabricmanager/nvlsm.log", src: "/var/log/fabricmanager/nvlsm.log"},
 }

 const supportBundleGlob = "????-??-?? (BEE-SP*)*.tar.gz"
--- a/audit/internal/platform/benchmark.go
+++ b/audit/internal/platform/benchmark.go
@@ -40,6 +40,12 @@ type benchmarkGPUInfo struct {
 	MaxMemoryClockMHz    float64
 	BaseGraphicsClockMHz float64
 	MultiprocessorCount  int
+	// Temperature limits sourced from nvidia-smi -q verbose output.
+	// ShutdownTempC is the hardware thermal shutdown threshold.
+	// SlowdownTempC is the software throttle onset threshold.
+	// Both fall back to safe conservative defaults when not available.
+	ShutdownTempC float64 // fallback: 90°C
+	SlowdownTempC float64 // fallback: 80°C
 }

 type benchmarkPowerCalibrationResult struct {
@@ -304,18 +310,10 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
 		}
 	}()

-	// Power calibration: run dcgmi targeted_power while sampling nvidia-smi power.
-	// Returns per-GPU p95 power as an honest TDP reference for PowerSustainScore.
-	calibByIndex, powerRestoreActions := runBenchmarkPowerCalibration(ctx, verboseLog, runDir, selected, infoByIndex, logFunc)
-	restoreActions = append(restoreActions, powerRestoreActions...)
-	for _, idx := range selected {
-		if calib, ok := calibByIndex[idx]; ok && calib.Derated && calib.AppliedPowerLimitW > 0 {
-			result.Warnings = append(result.Warnings, fmt.Sprintf(
-				"GPU %d could not complete targeted_power at its default server power budget; benchmark ran at reduced power limit %.0f W.",
-				idx, calib.AppliedPowerLimitW,
-			))
-		}
-	}
+	// No power calibration before performance benchmark — GPUs run at their
+	// default power limits. PowerSustainScore is derived from steady-state power
+	// observed during the benchmark itself.
+	calibByIndex := make(map[int]benchmarkPowerCalibrationResult)

 	// Start background CPU load sampler — samples every 10s during GPU phases.
 	cpuStopCh := make(chan struct{})
@@ -338,6 +336,8 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
 				gpuResult.PowerLimitW = info.PowerLimitW
 				gpuResult.MultiprocessorCount = info.MultiprocessorCount
 				gpuResult.DefaultPowerLimitW = info.DefaultPowerLimitW
+				gpuResult.ShutdownTempC = info.ShutdownTempC
+				gpuResult.SlowdownTempC = info.SlowdownTempC
 				gpuResult.MaxGraphicsClockMHz = info.MaxGraphicsClockMHz
 				gpuResult.BaseGraphicsClockMHz = info.BaseGraphicsClockMHz
 				gpuResult.MaxMemoryClockMHz = info.MaxMemoryClockMHz
@@ -516,6 +516,17 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv

 			gpuResult.Scores = scoreBenchmarkGPUResult(gpuResult)
 			gpuResult.DegradationReasons = detectBenchmarkDegradationReasons(gpuResult, result.Normalization.Status)
+			if anomaly := detectPowerAnomaly(metricRows, idx); anomaly != "" {
+				gpuResult.Notes = append(gpuResult.Notes,
+					fmt.Sprintf("[HARD STOP] GPU %d: %s", idx, anomaly))
+			}
+			if warn := detectSlowdownTempExceedance(metricRows, idx, gpuResult.SlowdownTempC); warn != "" {
+				gpuResult.Notes = append(gpuResult.Notes,
+					fmt.Sprintf("[WARNING] GPU %d: %s", idx, warn))
+				if gpuResult.Status == "OK" {
+					gpuResult.Status = "PARTIAL"
+				}
+			}
 			if planErr != nil {
 				gpuResult.Status = classifySATErrorStatus(phaseLogs["mixed"], planErr)
 			} else if len(gpuResult.PrecisionFailures) > 0 {
@@ -531,12 +542,74 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv

 	} // end sequential path

+	// Performance scalability ramp-up: run parallel benchmarks for k=2..N GPUs
+	// and compute compute scalability relative to the best single-GPU result.
+	// Only runs in sequential mode (each GPU was tested individually above) and
+	// when there are at least 2 GPUs.
+	if !opts.ParallelGPUs && len(selected) >= 2 {
+		// Find the best single-card SyntheticScore as the 1-GPU baseline.
+		var bestTOPS float64
+		for _, g := range result.GPUs {
+			if g.Scores.SyntheticScore > bestTOPS {
+				bestTOPS = g.Scores.SyntheticScore
+			}
+		}
+		if bestTOPS > 0 {
+			var rampSteps []NvidiaPerformanceRampStep
+			var scalabilityPcts []float64
+			for k := 2; k <= len(selected); k++ {
+				subset := append([]int(nil), selected[:k]...)
+				rampDir := filepath.Join(runDir, fmt.Sprintf("ramp-%02d", k))
+				_ = os.MkdirAll(rampDir, 0755)
+				logFunc(fmt.Sprintf("performance ramp: step %d/%d — running %d GPUs in parallel", k, len(selected), k))
+
+				var rampResult NvidiaBenchmarkResult
+				var rampIdleW, rampLoadedWSum float64
+				var rampIdleOK, rampLoadedOK bool
+				var rampLoadedSamples int
+				var rampMetricRows []GPUMetricRow
+				var rampTimelineSec float64
+				emptyCalib := make(map[int]benchmarkPowerCalibrationResult)
+
+				runNvidiaBenchmarkParallel(ctx, verboseLog, rampDir, subset, infoByIndex, opts, spec, logFunc,
+					&rampResult, emptyCalib,
+					&rampIdleW, &rampLoadedWSum, &rampIdleOK, &rampLoadedOK, &rampLoadedSamples,
+					&rampMetricRows, &rampTimelineSec, "")
+
+				var totalSynth, totalMixed float64
+				for _, g := range rampResult.GPUs {
+					totalSynth += g.Scores.SyntheticScore
+					totalMixed += g.Scores.MixedScore
+				}
+				scalPct := totalSynth / (float64(k) * bestTOPS) * 100
+				scalabilityPcts = append(scalabilityPcts, scalPct)
+
+				stepStatus := "OK"
+				if len(rampResult.GPUs) < k {
+					stepStatus = "PARTIAL"
+				}
+				rampSteps = append(rampSteps, NvidiaPerformanceRampStep{
+					StepIndex:          k,
+					GPUIndices:         subset,
+					TotalSyntheticTOPS: totalSynth,
+					TotalMixedTOPS:     totalMixed,
+					ScalabilityPct:     scalPct,
+					Status:             stepStatus,
+				})
+			}
+			result.PerformanceRampSteps = rampSteps
+			result.PlatformPowerScore = benchmarkMean(scalabilityPcts)
+			if len(scalabilityPcts) > 0 {
+				result.ScalabilityScore = scalabilityPcts[len(scalabilityPcts)-1]
+			}
+		}
+	}
+
 	if len(selected) > 1 && opts.RunNCCL {
 		result.Interconnect = runBenchmarkInterconnect(ctx, verboseLog, runDir, selected, spec, logFunc)
 		if result.Interconnect != nil && result.Interconnect.Supported {
 			for i := range result.GPUs {
 				result.GPUs[i].Scores.InterconnectScore = result.Interconnect.MaxBusBWGBps
-				result.GPUs[i].Scores.CompositeScore = compositeBenchmarkScore(result.GPUs[i].Scores)
 			}
 		}
 	}
@@ -683,6 +756,8 @@ func enrichGPUInfoWithMaxClocks(infoByIndex map[int]benchmarkGPUInfo, nvsmiQ []b
 	defaultPwrRe := regexp.MustCompile(`(?i)Default Power Limit\s*:\s*([0-9.]+)\s*W`)
 	currentPwrRe := regexp.MustCompile(`(?i)Current Power Limit\s*:\s*([0-9.]+)\s*W`)
 	smCountRe := regexp.MustCompile(`(?i)Multiprocessor Count\s*:\s*(\d+)`)
+	shutdownTempRe := regexp.MustCompile(`(?i)GPU Shutdown Temp\s*:\s*(\d+)\s*C`)
+	slowdownTempRe := regexp.MustCompile(`(?i)GPU Slowdown Temp\s*:\s*(\d+)\s*C`)

 	sectionStarts := gpuSectionRe.FindAllSubmatchIndex(nvsmiQ, -1)
 	for i, loc := range sectionStarts {
@@ -746,6 +821,20 @@ func enrichGPUInfoWithMaxClocks(infoByIndex map[int]benchmarkGPUInfo, nvsmiQ []b
 				}
 			}
 		}
+		if info.ShutdownTempC == 0 {
+			if m := shutdownTempRe.FindSubmatch(section); m != nil {
+				if v, err := strconv.ParseFloat(string(m[1]), 64); err == nil && v > 0 {
+					info.ShutdownTempC = v
+				}
+			}
+		}
+		if info.SlowdownTempC == 0 {
+			if m := slowdownTempRe.FindSubmatch(section); m != nil {
+				if v, err := strconv.ParseFloat(string(m[1]), 64); err == nil && v > 0 {
+					info.SlowdownTempC = v
+				}
+			}
+		}
 		infoByIndex[benchIdx] = info
 	}
 }
@@ -1344,63 +1433,172 @@ func scoreBenchmarkGPUResult(gpu BenchmarkGPUResult) BenchmarkScorecard {
 	case score.MixedScore > 0:
 		score.ComputeScore = score.MixedScore
 	}
-	// PowerSustainScore: measures how close the GPU came to its rated TDP under
-	// a full-spectrum load (dcgmi targeted_power). 100 = exactly at rated TDP.
-	// Penalty applied symmetrically for both under- and over-TDP deviations:
-	//   score = max(0, 100 − |measured − rated| / rated × 100)
-	// Under-TDP → power delivery / cooling issue.
-	// Over-TDP  → power limit not properly enforced / power regulation fault.
-	// Falls back to 0 if calibration was not performed (dcgmi unavailable).
-	{
-		ref := gpu.DefaultPowerLimitW
-		if ref <= 0 {
-			ref = gpu.PowerLimitW
-		}
-		if gpu.CalibratedPeakPowerW > 0 && ref > 0 {
-			deviationPct := math.Abs(gpu.CalibratedPeakPowerW-ref) / ref * 100
-			score.PowerSustainScore = clampScore(100 - deviationPct)
-		}
-	}
-	runtimeUS := math.Max(1, gpu.Steady.DurationSec*1e6)
-	thermalRatio := float64(gpu.Throttle.HWThermalSlowdownUS+gpu.Throttle.SWThermalSlowdownUS) / runtimeUS
-	score.ThermalSustainScore = clampScore(100 - thermalRatio*100)
-	// StabilityScore: prefer per-precision steady phases where each window runs a
-	// single kernel type so PowerCVPct is a genuine stability signal (not a
-	// workload-mix artifact). Fall back to combined steady using clock-only metrics
-	// when per-precision data is absent (older results, short profiles).
+	// PowerSustainScore: how stable is GPU power draw during the benchmark?
+	// High variance means the workload is bursting or the power delivery is
+	// unstable. Score = max(0, 100 − PowerCVPct × 3).
+	// At 10% CV → score 70; at 33%+ CV → score 0.
+	// Uses per-precision windows when available (each runs a single kernel,
+	// so CV reflects genuine power regulation, not workload switching).
 	if len(gpu.PrecisionSteady) > 0 {
 		var sum float64
 		for _, p := range gpu.PrecisionSteady {
-			sum += clampScore(100 - (p.Steady.ClockCVPct*4 + p.Steady.PowerCVPct*2 + p.Steady.ClockDriftPct*2))
+			sum += clampScore(100 - p.Steady.PowerCVPct*3)
 		}
-		score.StabilityScore = sum / float64(len(gpu.PrecisionSteady))
-	} else {
-		score.StabilityScore = clampScore(100 - (gpu.Steady.ClockCVPct*4 + gpu.Steady.ClockDriftPct*2))
+		score.PowerSustainScore = sum / float64(len(gpu.PrecisionSteady))
+	} else if gpu.Steady.PowerCVPct > 0 {
+		score.PowerSustainScore = clampScore(100 - gpu.Steady.PowerCVPct*3)
 	}
-	score.CompositeScore = compositeBenchmarkScore(score)
+
+	// ThermalSustainScore: how stable is GPU temperature during the benchmark?
+	// High variance means cooling is inconsistent (fan bursts, liquid flow
+	// instability, or frequent transitions in and out of throttle).
+	// Score = max(0, 100 − TempCVPct × 3).
+	if gpu.Steady.TempCVPct > 0 {
+		score.ThermalSustainScore = clampScore(100 - gpu.Steady.TempCVPct*3)
+	} else {
+		// TempCV not recorded — fall back to 100 (no penalty).
+		score.ThermalSustainScore = 100
+	}
+
+	// Throttle breakdown: compute per-type percentages for diagnosis.
+	// Each counter measures microseconds spent in that throttle state during
+	// the steady-state window. Counters can overlap (e.g. thermal + power cap
+	// simultaneously), so they are reported independently, not summed.
+	runtimeUS := math.Max(1, gpu.Steady.DurationSec*1e6)
+	score.ThermalThrottlePct = math.Min(100,
+		float64(gpu.Throttle.HWThermalSlowdownUS+gpu.Throttle.SWThermalSlowdownUS)/runtimeUS*100)
+	score.PowerCapThrottlePct = math.Min(100,
+		float64(gpu.Throttle.SWPowerCapUS)/runtimeUS*100)
+	score.SyncBoostThrottlePct = math.Min(100,
+		float64(gpu.Throttle.SyncBoostUS)/runtimeUS*100)
+
+	// StabilityScore: combined throttle signal (thermal + power cap).
+	// Score = max(0, 100 − combined_throttle_pct).
+	// 1% throttle → 99; 10% → 90; any throttle > 0 is penalised.
+	combinedThrottlePct := math.Min(100,
+		float64(gpu.Throttle.HWThermalSlowdownUS+gpu.Throttle.SWThermalSlowdownUS+gpu.Throttle.SWPowerCapUS)/runtimeUS*100)
+	score.StabilityScore = clampScore(100 - combinedThrottlePct)
+
+	// TempHeadroomC: distance from p95 temperature to the GPU's hardware
+	// shutdown threshold (sourced from nvidia-smi -q "GPU Shutdown Temp").
+	// Fallback: 90°C when not available.
+	// Assessed independently of throttle — a GPU at 86°C without any throttle
+	// counter still has limited headroom and operates in degraded reliability zone.
+	// Warning zone: headroom < (shutdownTemp - slowdownTemp), i.e. past slowdown onset.
+	// Critical zone: headroom < 10°C from shutdown.
+	if gpu.Steady.P95TempC > 0 {
+		shutdownTemp := gpu.ShutdownTempC
+		if shutdownTemp <= 0 {
+			shutdownTemp = 90
+		}
+		score.TempHeadroomC = shutdownTemp - gpu.Steady.P95TempC
+	}
+	score.ServerQualityScore = serverQualityScore(score)
+	score.CompositeScore = score.ComputeScore
 	if gpu.MultiprocessorCount > 0 && gpu.Steady.AvgGraphicsClockMHz > 0 && score.ComputeScore > 0 {
 		score.TOPSPerSMPerGHz = score.ComputeScore / float64(gpu.MultiprocessorCount) / (gpu.Steady.AvgGraphicsClockMHz / 1000.0)
 	}
 	return score
 }

+// compositeBenchmarkScore is kept for compatibility with legacy callers.
+// CompositeScore = ComputeScore (no quality multiplier; throttling already
+// reduces TOPS directly, so no additional penalty is needed).
 func compositeBenchmarkScore(score BenchmarkScorecard) float64 {
-	// Weights after introducing calibrated power reference:
-	//   base        0.35 — floor so a GPU that fails all sustain checks still scores
-	//   thermal     0.25 — heaviest: throttle counters are the most reliable signal
-	//   stability   0.25 — clock/power variance matters for reproducibility
-	//   power       0.15 — GPU reaches rated TDP under targeted_power? lower weight
-	//                       because calibration may be absent (dcgmi not installed)
-	//   NCCL bonus  0.10 — interconnect health
-	//   cap         1.10
-	quality := 0.35 + 0.15*(score.PowerSustainScore/100.0) + 0.25*(score.ThermalSustainScore/100.0) + 0.25*(score.StabilityScore/100.0)
-	if score.InterconnectScore > 0 {
-		quality += 0.10
+	return score.ComputeScore
+}
+
+// serverQualityScore returns a 0–100 score reflecting server infrastructure
+// quality, independent of GPU model or compute speed.
+//
+//	StabilityScore (throttle time)   0.40 — heaviest: direct evidence GPU can't sustain load
+//	PowerSustainScore (power CV)     0.30 — unstable draw hints at PSU/VRM issues
+//	ThermalSustainScore (temp CV)    0.30 — unstable temp hints at airflow/cooling issues
+func serverQualityScore(score BenchmarkScorecard) float64 {
+	q := 0.40*(score.StabilityScore/100.0) +
+		0.30*(score.PowerSustainScore/100.0) +
+		0.30*(score.ThermalSustainScore/100.0)
+	return clampScore(q * 100)
+}
+
+// detectPowerAnomaly scans per-GPU steady-state metric rows for a sudden
+// power drop — a symptom of bad cable contact, VRM fault, or thermal event
+// on the power delivery path. Returns a non-empty string if an anomaly is found.
+//
+// Algorithm: uses a 5-sample rolling baseline; flags any sample that falls
+// more than 30% below the baseline while the GPU was otherwise loaded
+// (usage > 50%). A sustained throttle (power cap) is not flagged here —
+// that is already captured by PowerCapThrottlePct.
+func detectPowerAnomaly(rows []GPUMetricRow, gpuIndex int) string {
+	const windowSize = 5
+	const dropThresholdPct = 30.0
+	const minUsagePct = 50.0
+
+	// Filter rows for this GPU during steady state only.
+	var steady []GPUMetricRow
+	for _, r := range rows {
+		if r.GPUIndex == gpuIndex && r.Stage != "" && strings.Contains(r.Stage, "steady") {
+			steady = append(steady, r)
+		}
 	}
-	if quality > 1.10 {
-		quality = 1.10
+	if len(steady) < windowSize+2 {
+		return ""
 	}
-	return score.ComputeScore * quality
+
+	// Compute initial baseline from the first window.
+	var baseSum float64
+	for i := 0; i < windowSize; i++ {
+		baseSum += steady[i].PowerW
+	}
+
+	for i := windowSize; i < len(steady); i++ {
+		baseline := baseSum / float64(windowSize)
+		sample := steady[i]
+		if baseline > 0 && sample.UsagePct >= minUsagePct {
+			dropPct := (baseline - sample.PowerW) / baseline * 100
+			if dropPct >= dropThresholdPct {
+				return fmt.Sprintf("sudden power drop detected at t=%.0fs: %.0f W → %.0f W (%.0f%% below rolling baseline) — possible bad cable contact or VRM fault",
+					sample.ElapsedSec, baseline, sample.PowerW, dropPct)
+			}
+		}
+		// Slide the window baseline.
+		baseSum -= steady[i-windowSize].PowerW
+		baseSum += sample.PowerW
+	}
+	return ""
+}
+
+// detectSlowdownTempExceedance scans steady-state metric rows for a GPU and
+// returns a warning string if any temperature sample exceeded the GPU's
+// SlowdownTempC threshold. Uses fallback 80°C when SlowdownTempC is zero.
+// This is a real-time signal distinct from p95 stats — even a single spike
+// above the slowdown threshold is worth flagging.
+func detectSlowdownTempExceedance(rows []GPUMetricRow, gpuIndex int, slowdownTempC float64) string {
+	if slowdownTempC <= 0 {
+		slowdownTempC = 80
+	}
+	var maxTemp float64
+	var exceedCount int
+	for _, r := range rows {
+		if r.GPUIndex != gpuIndex {
+			continue
+		}
+		if !strings.Contains(r.Stage, "steady") {
+			continue
+		}
+		if r.TempC > maxTemp {
+			maxTemp = r.TempC
+		}
+		if r.TempC >= slowdownTempC {
+			exceedCount++
+		}
+	}
+	if exceedCount == 0 {
+		return ""
+	}
+	return fmt.Sprintf(
+		"temperature exceeded slowdown threshold (%.0f°C) in %d sample(s) during steady state — peak %.1f°C",
+		slowdownTempC, exceedCount, maxTemp)
 }

 func detectBenchmarkDegradationReasons(gpu BenchmarkGPUResult, normalizationStatus string) []string {
@@ -1592,7 +1790,7 @@ func finalizeBenchmarkGPUResult(gpu BenchmarkGPUResult) BenchmarkGPUResult {
 		gpu.Status = "OK"
 	}
 	if gpu.Scores.CompositeScore == 0 {
-		gpu.Scores.CompositeScore = compositeBenchmarkScore(gpu.Scores)
+		gpu.Scores.CompositeScore = gpu.Scores.ComputeScore
 	}
 	return gpu
 }
@@ -1630,19 +1828,26 @@ func buildBenchmarkFindings(result NvidiaBenchmarkResult) []string {
 		for _, reason := range gpu.DegradationReasons {
 			switch reason {
 			case "power_capped":
-				findings = append(findings, fmt.Sprintf("GPU %d spent measurable time under SW power cap.", gpu.Index))
+				findings = append(findings, fmt.Sprintf(
+					"[POWER] GPU %d: power cap throttle %.1f%% of steady state — server is not delivering full TDP to the GPU.",
+					gpu.Index, gpu.Scores.PowerCapThrottlePct))
 			case "thermal_limited":
-				msg := fmt.Sprintf("GPU %d reported thermal slowdown during steady state.", gpu.Index)
+				// Hard stop check: thermal throttle while fans are not at maximum.
+				// This means the server does not see GPU thermals — incompatible config.
 				if result.Cooling != nil && result.Cooling.FanDutyCycleAvailable &&
-					result.Cooling.P95FanDutyCyclePct < 98 && gpu.Steady.ClockDriftPct >= 20 {
-					msg += fmt.Sprintf(
-						" Fans peaked at %.0f%% duty cycle (not at maximum) while clocks dropped %.0f%% — possible cooling misconfiguration; rerun the benchmark with fan speed manually fixed at 100%%.",
-						result.Cooling.P95FanDutyCyclePct, gpu.Steady.ClockDriftPct,
-					)
+					result.Cooling.P95FanDutyCyclePct < 95 {
+					findings = append(findings, fmt.Sprintf(
+						"[HARD STOP] GPU %d: thermal throttle (%.1f%% of time) while fans peaked at only %.0f%% duty cycle — server cooling is not responding to GPU heat load. Configuration is likely incompatible.",
+						gpu.Index, gpu.Scores.ThermalThrottlePct, result.Cooling.P95FanDutyCyclePct))
+				} else {
+					findings = append(findings, fmt.Sprintf(
+						"[THERMAL] GPU %d: thermal throttle %.1f%% of steady state.",
+						gpu.Index, gpu.Scores.ThermalThrottlePct))
 				}
-				findings = append(findings, msg)
 			case "sync_boost_limited":
-				findings = append(findings, fmt.Sprintf("GPU %d was limited by sync boost behaviour.", gpu.Index))
+				findings = append(findings, fmt.Sprintf(
+					"[SYNC] GPU %d: sync boost throttle %.1f%% of steady state — GPUs are constraining each other's clocks.",
+					gpu.Index, gpu.Scores.SyncBoostThrottlePct))
 			case "low_sm_clock_vs_target":
 				findings = append(findings, fmt.Sprintf("GPU %d average SM clock stayed below the requested lock target.", gpu.Index))
 			case "variance_too_high":
@@ -1650,11 +1855,39 @@ func buildBenchmarkFindings(result NvidiaBenchmarkResult) []string {
 			case "normalization_partial":
 				findings = append(findings, fmt.Sprintf("GPU %d ran without full benchmark normalization.", gpu.Index))
 			case "power_limit_derated":
-				findings = append(findings, fmt.Sprintf("GPU %d could not sustain targeted_power in this server at the default limit; benchmark ran derated at %.0f W.", gpu.Index, gpu.PowerLimitW))
+				findings = append(findings, fmt.Sprintf("[POWER] GPU %d could not sustain full TDP in this server; benchmark ran at reduced limit %.0f W.", gpu.Index, gpu.PowerLimitW))
 			case "ecc_uncorrected_errors":
-				findings = append(findings, fmt.Sprintf("GPU %d reported %d uncorrected ECC error(s) — possible hardware fault.", gpu.Index, gpu.ECC.Uncorrected))
+				findings = append(findings, fmt.Sprintf(
+					"[HARD STOP] GPU %d: %d uncorrected ECC error(s) detected — possible hardware fault. Do not use in production.",
+					gpu.Index, gpu.ECC.Uncorrected))
 			case "ecc_corrected_errors":
-				findings = append(findings, fmt.Sprintf("GPU %d reported %d corrected ECC error(s) — possible DRAM degradation.", gpu.Index, gpu.ECC.Corrected))
+				findings = append(findings, fmt.Sprintf(
+					"[WARNING] GPU %d: %d corrected ECC error(s) — possible DRAM degradation, monitor closely.",
+					gpu.Index, gpu.ECC.Corrected))
+			}
+		}
+		// Temperature headroom checks — independent of throttle counters.
+		// Shutdown and slowdown thresholds are per-GPU from nvidia-smi -q;
+		// fall back to 90°C / 80°C when unavailable.
+		if gpu.Steady.P95TempC > 0 {
+			shutdownTemp := gpu.ShutdownTempC
+			if shutdownTemp <= 0 {
+				shutdownTemp = 90
+			}
+			slowdownTemp := gpu.SlowdownTempC
+			if slowdownTemp <= 0 {
+				slowdownTemp = 80
+			}
+			headroom := shutdownTemp - gpu.Steady.P95TempC
+			switch {
+			case headroom < 10:
+				findings = append(findings, fmt.Sprintf(
+					"[HARD STOP] GPU %d: p95 temperature %.1f°C — only %.1f°C from shutdown threshold (%.0f°C). Do not operate.",
+					gpu.Index, gpu.Steady.P95TempC, headroom, shutdownTemp))
+			case gpu.Steady.P95TempC >= slowdownTemp:
+				findings = append(findings, fmt.Sprintf(
+					"[THERMAL] GPU %d: p95 temperature %.1f°C exceeds slowdown threshold (%.0f°C) — %.1f°C headroom to shutdown. Operating in degraded reliability zone.",
+					gpu.Index, gpu.Steady.P95TempC, slowdownTemp, headroom))
 			}
 		}
 		if gpu.CoolingWarning != "" {
@@ -2055,6 +2288,8 @@ func runNvidiaBenchmarkParallel(
 			r.PowerLimitW = info.PowerLimitW
 			r.MultiprocessorCount = info.MultiprocessorCount
 			r.DefaultPowerLimitW = info.DefaultPowerLimitW
+			r.ShutdownTempC = info.ShutdownTempC
+			r.SlowdownTempC = info.SlowdownTempC
 			r.MaxGraphicsClockMHz = info.MaxGraphicsClockMHz
 			r.BaseGraphicsClockMHz = info.BaseGraphicsClockMHz
 			r.MaxMemoryClockMHz = info.MaxMemoryClockMHz
@@ -2470,6 +2705,7 @@ func runBenchmarkPowerCalibration(
 	gpuIndices []int,
 	infoByIndex map[int]benchmarkGPUInfo,
 	logFunc func(string),
+	fixedLimits map[int]int,
 ) (map[int]benchmarkPowerCalibrationResult, []benchmarkRestoreAction) {
 	const calibDurationSec = 120
 	const maxDerateW = 150
@@ -2486,6 +2722,11 @@ func runBenchmarkPowerCalibration(
 		logFunc("power calibration: dcgmi not found, skipping (will use default power limit)")
 		return map[int]benchmarkPowerCalibrationResult{}, nil
 	}
+	if killed := KillTestWorkers(); len(killed) > 0 {
+		for _, p := range killed {
+			logFunc(fmt.Sprintf("power calibration pre-flight: killed stale worker pid=%d name=%s", p.PID, p.Name))
+		}
+	}

 	canDerate := os.Geteuid() == 0
 	if !canDerate {
@@ -2555,6 +2796,21 @@ func runBenchmarkPowerCalibration(
 			hi:             appliedLimitW + 1, // not yet tested, not yet confirmed unstable
 			calib:          benchmarkPowerCalibrationResult{AppliedPowerLimitW: float64(appliedLimitW)},
 		}
+		if fixedLimits != nil {
+			if fixedW, ok := fixedLimits[idx]; ok {
+				// This GPU's limit was established in a prior ramp step and must
+				// remain unchanged. Apply it immediately and skip the binary search.
+				if canDerate && fixedW > 0 {
+					_ = setBenchmarkPowerLimit(ctx, verboseLog, idx, fixedW)
+				}
+				s.appliedLimitW = fixedW
+				s.calib.AppliedPowerLimitW = float64(fixedW)
+				s.calib.Completed = true
+				s.converged = true
+				s.calib.Notes = append(s.calib.Notes,
+					fmt.Sprintf("fixed limit: %d W (held from prior ramp step)", fixedW))
+			}
+		}
 		states = append(states, s)
 		if canDerate && originalLimitW > 0 {
 			idxCopy := idx
@@ -2764,6 +3020,10 @@ calibDone:
 						s.appliedLimitW = s.lo
 						s.calib.AppliedPowerLimitW = float64(s.lo)
 						s.calib.Derated = s.lo < s.originalLimitW
+						// Summary was captured when we last verified stability at s.lo,
+						// so the result is valid — mark as completed even though we
+						// converged from the failure path (tried higher, failed, fell back).
+						s.calib.Completed = true
 					}
 				} else {
 					s.calib.Notes = append(s.calib.Notes, fmt.Sprintf("could not find a stable targeted_power limit within %d W of the default", maxDerateW))
@@ -2846,7 +3106,8 @@ func renderPowerBenchReport(result NvidiaPowerBenchResult) string {
 	fmt.Fprintf(&b, "**Benchmark version:** %s  \n", result.BenchmarkVersion)
 	fmt.Fprintf(&b, "**Profile:** %s  \n", result.BenchmarkProfile)
 	fmt.Fprintf(&b, "**Generated:** %s  \n", result.GeneratedAt.Format("2006-01-02 15:04:05 UTC"))
-	fmt.Fprintf(&b, "**Overall status:** %s  \n\n", result.OverallStatus)
+	fmt.Fprintf(&b, "**Overall status:** %s  \n", result.OverallStatus)
+	fmt.Fprintf(&b, "**Platform max TDP:** %.0f W  \n\n", result.PlatformMaxTDPW)
 	if len(result.Findings) > 0 {
 		b.WriteString("## Summary\n\n")
 		for _, finding := range result.Findings {
@@ -2860,25 +3121,36 @@ func renderPowerBenchReport(result NvidiaPowerBenchResult) string {
 	}
 	if len(result.RampSteps) > 0 {
 		b.WriteString("## Ramp Sequence\n\n")
-		b.WriteString("| Step | GPUs | Total Power | Avg / GPU | Avg Realization | Min Realization | Derated |\n")
-		b.WriteString("|------|------|-------------|-----------|-----------------|-----------------|---------|\n")
+		b.WriteString("| Step | New GPU | Stable Limit | Total Observed | Derated | Status |\n")
+		b.WriteString("|------|---------|--------------|----------------|---------|--------|\n")
 		for _, step := range result.RampSteps {
-			fmt.Fprintf(&b, "| %d | %s | %.0f W | %.0f W | %.1f%% | %.1f%% | %d |\n",
-				step.StepIndex, joinIndexList(step.GPUIndices), step.TotalObservedPowerW, step.AvgObservedPowerW, step.AvgPowerRealizationPct, step.MinPowerRealizationPct, step.DeratedGPUCount)
+			derated := "-"
+			if step.Derated {
+				derated = "⚠ yes"
+			}
+			fmt.Fprintf(&b, "| %d | GPU %d | %.0f W | %.0f W | %s | %s |\n",
+				step.StepIndex, step.NewGPUIndex, step.NewGPUStableLimitW, step.TotalObservedPowerW, derated, step.Status)
 		}
 		b.WriteString("\n")
 	}
 	b.WriteString("## Per-Slot Results\n\n")
-	b.WriteString("| GPU | Status | Max Power | Temp | Applied Limit | Default Limit | Attempts |\n")
-	b.WriteString("|-----|--------|-----------|------|---------------|---------------|----------|\n")
+	b.WriteString("| GPU | Status | Single-card Limit | Stable Limit | Temp | Attempts |\n")
+	b.WriteString("|-----|--------|-------------------|--------------|------|----------|\n")
 	for _, gpu := range result.GPUs {
-		fmt.Fprintf(&b, "| GPU %d | %s | %.0f W | %.1f C | %.0f W | %.0f W | %d |\n",
-			gpu.Index, gpu.Status, gpu.MaxObservedPowerW, gpu.MaxObservedTempC, gpu.AppliedPowerLimitW, gpu.DefaultPowerLimitW, gpu.CalibrationAttempts)
+		stableLimit := "-"
+		if gpu.StablePowerLimitW > 0 {
+			if gpu.Derated {
+				stableLimit = fmt.Sprintf("%.0f W ⚠", gpu.StablePowerLimitW)
+			} else {
+				stableLimit = fmt.Sprintf("%.0f W", gpu.StablePowerLimitW)
+			}
+		}
+		fmt.Fprintf(&b, "| GPU %d | %s | %.0f W | %s | %.1f C | %d |\n",
+			gpu.Index, gpu.Status, gpu.AppliedPowerLimitW, stableLimit, gpu.MaxObservedTempC, gpu.CalibrationAttempts)
 	}
 	b.WriteString("\n")
 	for _, gpu := range result.GPUs {
 		fmt.Fprintf(&b, "### GPU %d — %s\n\n", gpu.Index, gpu.Name)
-
 		for _, note := range gpu.Notes {
 			fmt.Fprintf(&b, "- %s\n", note)
 		}
@@ -2893,14 +3165,22 @@ func renderPowerBenchSummary(result NvidiaPowerBenchResult) string {
 	fmt.Fprintf(&b, "benchmark_version=%s\n", result.BenchmarkVersion)
 	fmt.Fprintf(&b, "benchmark_profile=%s\n", result.BenchmarkProfile)
 	fmt.Fprintf(&b, "overall_status=%s\n", result.OverallStatus)
+	fmt.Fprintf(&b, "platform_max_tdp_w=%.0f\n", result.PlatformMaxTDPW)
 	fmt.Fprintf(&b, "gpu_count=%d\n", len(result.GPUs))
 	if len(result.RecommendedSlotOrder) > 0 {
 		fmt.Fprintf(&b, "recommended_slot_order=%s\n", joinIndexList(result.RecommendedSlotOrder))
 	}
 	for _, step := range result.RampSteps {
 		fmt.Fprintf(&b, "ramp_step_%d_gpus=%s\n", step.StepIndex, joinIndexList(step.GPUIndices))
+		fmt.Fprintf(&b, "ramp_step_%d_new_gpu=%d\n", step.StepIndex, step.NewGPUIndex)
+		fmt.Fprintf(&b, "ramp_step_%d_stable_limit_w=%.0f\n", step.StepIndex, step.NewGPUStableLimitW)
 		fmt.Fprintf(&b, "ramp_step_%d_total_power_w=%.0f\n", step.StepIndex, step.TotalObservedPowerW)
 	}
+	for _, gpu := range result.GPUs {
+		if gpu.StablePowerLimitW > 0 {
+			fmt.Fprintf(&b, "gpu_%d_stable_limit_w=%.0f\n", gpu.Index, gpu.StablePowerLimitW)
+		}
+	}
 	return b.String()
 }

@@ -2953,7 +3233,7 @@ func (s *System) RunNvidiaPowerBench(ctx context.Context, baseDir string, opts N
 		_ = os.MkdirAll(singleDir, 0755)
 		singleInfo := cloneBenchmarkGPUInfoMap(infoByIndex)
 		logFunc(fmt.Sprintf("power calibration: GPU %d single-card baseline", idx))
-		c, restore := runBenchmarkPowerCalibration(ctx, verboseLog, singleDir, []int{idx}, singleInfo, logFunc)
+		c, restore := runBenchmarkPowerCalibration(ctx, verboseLog, singleDir, []int{idx}, singleInfo, logFunc, nil)
 		allRestoreActions = append(allRestoreActions, restore...)
 		if r, ok := c[idx]; ok {
 			calibByIndex[idx] = r
@@ -3029,72 +3309,125 @@ func (s *System) RunNvidiaPowerBench(ctx context.Context, baseDir string, opts N
 		singleByIndex[gpu.Index] = gpu
 	}

-	// Phase 2: ramp — add one GPU per step and calibrate the growing subset
-	// simultaneously. Step 1 reuses single-card results; steps 2..N run fresh
-	// targeted_power with derating if degradation is detected.
-	for step := 1; step <= len(result.RecommendedSlotOrder); step++ {
+	// Phase 2: cumulative thermal ramp.
+	// Each step introduces one new GPU into an environment where all previously
+	// calibrated GPUs are already running at their fixed stable limits. The new
+	// GPU's stable TDP is searched via binary search (targeted_power) under real
+	// multi-GPU thermal load. Once found, its limit is fixed permanently for all
+	// subsequent steps. This ensures each GPU's limit reflects actual sustained
+	// power in the final full-system thermal state.
+	//
+	// stableLimits accumulates GPU index → fixed stable limit (W) across steps.
+	stableLimits := make(map[int]int, len(result.RecommendedSlotOrder))
+
+	// Step 1: reuse single-card calibration result directly.
+	if len(result.RecommendedSlotOrder) > 0 {
+		firstIdx := result.RecommendedSlotOrder[0]
+		firstCalib := calibByIndex[firstIdx]
+		stableLimits[firstIdx] = int(math.Round(firstCalib.AppliedPowerLimitW))
+		ramp := NvidiaPowerBenchStep{
+			StepIndex:         1,
+			GPUIndices:        []int{firstIdx},
+			NewGPUIndex:       firstIdx,
+			NewGPUStableLimitW: firstCalib.AppliedPowerLimitW,
+			TotalObservedPowerW: firstCalib.Summary.P95PowerW,
+			AvgObservedPowerW:   firstCalib.Summary.P95PowerW,
+			Derated:           firstCalib.Derated,
+			Status:            "OK",
+		}
+		if !firstCalib.Completed {
+			ramp.Status = "FAILED"
+			ramp.Notes = append(ramp.Notes, fmt.Sprintf("GPU %d did not complete single-card targeted_power", firstIdx))
+			result.OverallStatus = "PARTIAL"
+		} else if firstCalib.Derated {
+			ramp.Status = "PARTIAL"
+			if result.OverallStatus == "OK" {
+				result.OverallStatus = "PARTIAL"
+			}
+			result.Findings = append(result.Findings, fmt.Sprintf("Ramp step 1 (GPU %d) required derating to %.0f W.", firstIdx, firstCalib.AppliedPowerLimitW))
+		}
+		result.RampSteps = append(result.RampSteps, ramp)
+		logFunc(fmt.Sprintf("power ramp: step 1/%d — reused single-card calibration for GPU %d, stable limit %.0f W",
+			len(result.RecommendedSlotOrder), firstIdx, firstCalib.AppliedPowerLimitW))
+	}
+
+	// Steps 2..N: each step fixes previously calibrated GPUs and searches only
+	// the new GPU's stable limit in the combined thermal environment.
+	for stepNum := 1; stepNum < len(result.RecommendedSlotOrder); stepNum++ {
+		step := stepNum + 1
 		subset := append([]int(nil), result.RecommendedSlotOrder[:step]...)
+		newGPUIdx := result.RecommendedSlotOrder[stepNum]
 		stepDir := filepath.Join(runDir, fmt.Sprintf("step-%02d", step))
 		_ = os.MkdirAll(stepDir, 0755)
-		var stepCalib map[int]benchmarkPowerCalibrationResult
-		if step == 1 {
-			// Single-GPU step — already measured in phase 1; reuse directly.
-			stepCalib = calibByIndex
-			logFunc(fmt.Sprintf("power ramp: step 1/%d — reusing single-card calibration for GPU %d", len(result.RecommendedSlotOrder), subset[0]))
-		} else {
-			stepInfo := cloneBenchmarkGPUInfoMap(infoByIndex)
-			var stepRestore []benchmarkRestoreAction
-			stepCalib, stepRestore = runBenchmarkPowerCalibration(ctx, verboseLog, stepDir, subset, stepInfo, logFunc)
-			for i := len(stepRestore) - 1; i >= 0; i-- {
-				stepRestore[i].fn()
-			}
+
+		// All previously calibrated GPUs are fixed at their stable limits.
+		fixedForStep := make(map[int]int, len(stableLimits))
+		for k, v := range stableLimits {
+			fixedForStep[k] = v
 		}
+
+		logFunc(fmt.Sprintf("power ramp: step %d/%d — calibrating GPU %d with %d fixed GPU(s)",
+			step, len(result.RecommendedSlotOrder), newGPUIdx, len(fixedForStep)))
+
+		stepInfo := cloneBenchmarkGPUInfoMap(infoByIndex)
+		stepCalib, stepRestore := runBenchmarkPowerCalibration(ctx, verboseLog, stepDir, subset, stepInfo, logFunc, fixedForStep)
+		// Accumulate restore actions; they all run in the outer defer.
+		allRestoreActions = append(allRestoreActions, stepRestore...)
+
 		ramp := NvidiaPowerBenchStep{
-			StepIndex:  step,
-			GPUIndices: subset,
-			Status:     "OK",
+			StepIndex:   step,
+			GPUIndices:  subset,
+			NewGPUIndex: newGPUIdx,
+			Status:      "OK",
 		}
-		var realizationValues []float64
+
+		// Total observed power = sum of p95 across all GPUs in this step.
 		for _, idx := range subset {
-			calib := stepCalib[idx]
-			ramp.TotalObservedPowerW += calib.Summary.P95PowerW
-			if calib.Derated {
-				ramp.DeratedGPUCount++
-				ramp.Status = "PARTIAL"
-			}
-			if !calib.Completed {
-				ramp.Status = "FAILED"
-				ramp.Notes = append(ramp.Notes, fmt.Sprintf("GPU %d did not complete targeted_power in ramp step %d", idx, step))
-				continue
-			}
-			if single, ok := singleByIndex[idx]; ok && single.MaxObservedPowerW > 0 {
-				realization := calib.Summary.P95PowerW / single.MaxObservedPowerW * 100
-				realizationValues = append(realizationValues, realization)
+			if c, ok := stepCalib[idx]; ok {
+				ramp.TotalObservedPowerW += c.Summary.P95PowerW
 			}
 		}
 		if len(subset) > 0 {
 			ramp.AvgObservedPowerW = ramp.TotalObservedPowerW / float64(len(subset))
 		}
-		if len(realizationValues) > 0 {
-			ramp.AvgPowerRealizationPct = benchmarkMean(realizationValues)
-			ramp.MinPowerRealizationPct = realizationValues[0]
-			for _, v := range realizationValues[1:] {
-				if v < ramp.MinPowerRealizationPct {
-					ramp.MinPowerRealizationPct = v
+
+		// Determine stable limit for the new GPU.
+		if c, ok := stepCalib[newGPUIdx]; ok && c.Completed {
+			stableLimits[newGPUIdx] = int(math.Round(c.AppliedPowerLimitW))
+			ramp.NewGPUStableLimitW = c.AppliedPowerLimitW
+			ramp.Derated = c.Derated
+			if c.Derated {
+				ramp.Status = "PARTIAL"
+				if result.OverallStatus == "OK" {
+					result.OverallStatus = "PARTIAL"
 				}
+				result.Findings = append(result.Findings, fmt.Sprintf("Ramp step %d (GPU %d) required derating to %.0f W under combined thermal load.", step, newGPUIdx, c.AppliedPowerLimitW))
 			}
+		} else {
+			// Calibration failed — fall back to single-card limit.
+			fb := calibByIndex[newGPUIdx]
+			stableLimits[newGPUIdx] = int(math.Round(fb.AppliedPowerLimitW))
+			ramp.NewGPUStableLimitW = fb.AppliedPowerLimitW
+			ramp.Status = "FAILED"
+			ramp.Notes = append(ramp.Notes, fmt.Sprintf("GPU %d did not complete targeted_power in ramp step %d; using single-card limit %.0f W", newGPUIdx, step, fb.AppliedPowerLimitW))
+			result.OverallStatus = "PARTIAL"
 		}
-		if ramp.MinPowerRealizationPct > 0 && ramp.MinPowerRealizationPct < 90 {
-			ramp.Notes = append(ramp.Notes, fmt.Sprintf("Power realization fell to %.1f%% of single-card baseline by step %d.", ramp.MinPowerRealizationPct, step))
-			if result.OverallStatus == "OK" {
-				result.OverallStatus = "PARTIAL"
-			}
-		}
-		if ramp.DeratedGPUCount > 0 {
-			result.Findings = append(result.Findings, fmt.Sprintf("Ramp step %d (%s) needed derating on %d GPU(s).", step, joinIndexList(subset), ramp.DeratedGPUCount))
-		}
+
 		result.RampSteps = append(result.RampSteps, ramp)
 	}
+
+	// Populate StablePowerLimitW on each GPU entry from the accumulated stable limits.
+	for i := range result.GPUs {
+		if lim, ok := stableLimits[result.GPUs[i].Index]; ok {
+			result.GPUs[i].StablePowerLimitW = float64(lim)
+		}
+	}
+
+	// PlatformMaxTDPW = sum of all stable limits — the actual sustained power
+	// budget of this server with all GPUs running simultaneously without throttling.
+	for _, lim := range stableLimits {
+		result.PlatformMaxTDPW += float64(lim)
+	}
 	resultJSON, err := json.MarshalIndent(result, "", "  ")
 	if err != nil {
 		return "", fmt.Errorf("marshal power result: %w", err)
--- a/audit/internal/platform/benchmark_report.go
+++ b/audit/internal/platform/benchmark_report.go
@@ -61,6 +61,9 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult) string {
 	if result.ScalabilityScore > 0 {
 		fmt.Fprintf(&b, "**Scalability score:** %.1f%%  \n", result.ScalabilityScore)
 	}
+	if result.PlatformPowerScore > 0 {
+		fmt.Fprintf(&b, "**Platform power score:** %.1f%%  \n", result.PlatformPowerScore)
+	}
 	fmt.Fprintf(&b, "**Overall status:** %s  \n", result.OverallStatus)
 	b.WriteString("\n")

@@ -81,41 +84,92 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult) string {
 		b.WriteString("\n")
 	}

-	// ── Methodology ───────────────────────────────────────────────────────────
-	b.WriteString("## Methodology\n\n")
-	fmt.Fprintf(&b, "- Profile `%s` uses standardized baseline -> warmup -> steady-state -> interconnect phases.\n", result.BenchmarkProfile)
-	b.WriteString("- Single-GPU compute score comes from `bee-gpu-burn` on the cuBLASLt path when available.\n")
-	b.WriteString("- Thermal and power limits are inferred from NVIDIA clock-event counters plus sustained telemetry.\n")
-	b.WriteString("- `result.json` is the canonical machine-readable source for the run.\n\n")
-	b.WriteString("**Compute score** is derived from two phases:\n\n")
-	b.WriteString("- **Synthetic** — each precision type (int8, fp8, fp16, fp32, fp64, fp4) runs alone for a dedicated window. ")
-	b.WriteString("Measures peak throughput with the full GPU dedicated to one kernel type. ")
-	b.WriteString("Each result is normalised to fp32-equivalent TOPS using precision weights: ")
-	b.WriteString("fp64 ×2.0 · fp32 ×1.0 · fp16 ×0.5 · int8 ×0.25 · fp8 ×0.25 · fp4 ×0.125.\n")
-	b.WriteString("- **Mixed** — all precision types run simultaneously (combined phase). ")
-	b.WriteString("Reflects real inference workloads where fp8 matrix ops, fp16 attention and fp32 accumulation compete for bandwidth and SM scheduler slots.\n\n")
-	b.WriteString("**Formula:** `Compute = Synthetic × (1 + MixedEfficiency × 0.3)`\n\n")
-	b.WriteString("where `MixedEfficiency = Mixed / Synthetic`. A GPU that sustains 90 % throughput under mixed load ")
-	b.WriteString("receives a +27 % bonus over its synthetic score; one that drops to 60 % receives +18 %.\n\n")
-	b.WriteString("**Composite score** = `Compute × quality_factor` where quality factors in power sustain, thermal sustain, stability, and interconnect.\n\n")
+	// ── Balanced Scorecard ────────────────────────────────────────────────────
+	b.WriteString("## Balanced Scorecard\n\n")

-	// ── Scorecard table ───────────────────────────────────────────────────────
-	b.WriteString("## Scorecard\n\n")
-	b.WriteString("| GPU | Status | Composite | Compute | Synthetic | Mixed | Mixed Eff. | TOPS/SM/GHz | Power Sustain | Thermal Sustain | Stability | Interconnect |\n")
-	b.WriteString("|-----|--------|-----------|---------|-----------|-------|------------|-------------|---------------|-----------------|-----------|-------------|\n")
+	// Perspective 1: Compatibility — hard stops
+	b.WriteString("### 1. Compatibility\n\n")
+	b.WriteString("| GPU | Thermal throttle | Fan duty at throttle | ECC uncorr | Status |\n")
+	b.WriteString("|-----|------------------|----------------------|------------|--------|\n")
 	for _, gpu := range result.GPUs {
-		name := strings.TrimSpace(gpu.Name)
-		if name == "" {
-			name = "Unknown GPU"
+		thermalThrottle := "-"
+		if gpu.Scores.ThermalThrottlePct > 0 {
+			thermalThrottle = fmt.Sprintf("%.1f%%", gpu.Scores.ThermalThrottlePct)
 		}
-		interconnect := "-"
-		if gpu.Scores.InterconnectScore > 0 {
-			interconnect = fmt.Sprintf("%.1f", gpu.Scores.InterconnectScore)
+		fanAtThrottle := "-"
+		if result.Cooling != nil && result.Cooling.FanDutyCycleAvailable && gpu.Scores.ThermalThrottlePct > 0 {
+			fanAtThrottle = fmt.Sprintf("%.0f%%", result.Cooling.P95FanDutyCyclePct)
 		}
-		topsPerSM := "-"
-		if gpu.Scores.TOPSPerSMPerGHz > 0 {
-			topsPerSM = fmt.Sprintf("%.3f", gpu.Scores.TOPSPerSMPerGHz)
+		ecc := "-"
+		if gpu.ECC.Uncorrected > 0 {
+			ecc = fmt.Sprintf("⛔ %d", gpu.ECC.Uncorrected)
 		}
+		compatStatus := "✓ OK"
+		if gpu.ECC.Uncorrected > 0 || (gpu.Scores.ThermalThrottlePct > 0 && result.Cooling != nil && result.Cooling.FanDutyCycleAvailable && result.Cooling.P95FanDutyCyclePct < 95) {
+			compatStatus = "⛔ HARD STOP"
+		}
+		fmt.Fprintf(&b, "| GPU %d | %s | %s | %s | %s |\n",
+			gpu.Index, thermalThrottle, fanAtThrottle, ecc, compatStatus)
+	}
+	b.WriteString("\n")
+
+	// Perspective 2: Thermal headroom
+	b.WriteString("### 2. Thermal Headroom\n\n")
+	b.WriteString("| GPU | p95 temp | Slowdown limit | Shutdown limit | Headroom | Thermal throttle | Status |\n")
+	b.WriteString("|-----|----------|----------------|----------------|----------|------------------|--------|\n")
+	for _, gpu := range result.GPUs {
+		shutdownTemp := gpu.ShutdownTempC
+		if shutdownTemp <= 0 {
+			shutdownTemp = 90
+		}
+		slowdownTemp := gpu.SlowdownTempC
+		if slowdownTemp <= 0 {
+			slowdownTemp = 80
+		}
+		headroom := gpu.Scores.TempHeadroomC
+		thermalStatus := "✓ OK"
+		switch {
+		case headroom < 10:
+			thermalStatus = "⛔ CRITICAL"
+		case gpu.Steady.P95TempC >= slowdownTemp:
+			thermalStatus = "⚠ WARNING"
+		}
+		throttlePct := "-"
+		if gpu.Scores.ThermalThrottlePct > 0 {
+			throttlePct = fmt.Sprintf("%.1f%%", gpu.Scores.ThermalThrottlePct)
+		}
+		fmt.Fprintf(&b, "| GPU %d | %.1f°C | %.0f°C | %.0f°C | %.1f°C | %s | %s |\n",
+			gpu.Index, gpu.Steady.P95TempC, slowdownTemp, shutdownTemp, headroom, throttlePct, thermalStatus)
+	}
+	b.WriteString("\n")
+
+	// Perspective 3: Power delivery
+	b.WriteString("### 3. Power Delivery\n\n")
+	b.WriteString("| GPU | Power cap throttle | Power stability | Fan duty (p95) | Status |\n")
+	b.WriteString("|-----|-------------------|-----------------|----------------|--------|\n")
+	for _, gpu := range result.GPUs {
+		powerCap := "-"
+		if gpu.Scores.PowerCapThrottlePct > 0 {
+			powerCap = fmt.Sprintf("%.1f%%", gpu.Scores.PowerCapThrottlePct)
+		}
+		fanDuty := "-"
+		if result.Cooling != nil && result.Cooling.FanDutyCycleAvailable {
+			fanDuty = fmt.Sprintf("%.0f%%", result.Cooling.P95FanDutyCyclePct)
+		}
+		powerStatus := "✓ OK"
+		if gpu.Scores.PowerCapThrottlePct > 5 {
+			powerStatus = "⚠ POWER LIMITED"
+		}
+		fmt.Fprintf(&b, "| GPU %d | %s | %.1f | %s | %s |\n",
+			gpu.Index, powerCap, gpu.Scores.PowerSustainScore, fanDuty, powerStatus)
+	}
+	b.WriteString("\n")
+
+	// Perspective 4: Performance
+	b.WriteString("### 4. Performance\n\n")
+	b.WriteString("| GPU | Compute TOPS | Synthetic | Mixed | Mixed Eff. | TOPS/SM/GHz |\n")
+	b.WriteString("|-----|--------------|-----------|-------|------------|-------------|\n")
+	for _, gpu := range result.GPUs {
 		synthetic := "-"
 		if gpu.Scores.SyntheticScore > 0 {
 			synthetic = fmt.Sprintf("%.2f", gpu.Scores.SyntheticScore)
@@ -128,20 +182,41 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult) string {
 		if gpu.Scores.MixedEfficiency > 0 {
 			mixedEff = fmt.Sprintf("%.1f%%", gpu.Scores.MixedEfficiency*100)
 		}
-		fmt.Fprintf(&b, "| GPU %d %s | %s | **%.2f** | %.2f | %s | %s | %s | %s | %.1f | %.1f | %.1f | %s |\n",
-			gpu.Index, name,
-			gpu.Status,
-			gpu.Scores.CompositeScore,
-			gpu.Scores.ComputeScore,
-			synthetic,
-			mixed,
-			mixedEff,
-			topsPerSM,
-			gpu.Scores.PowerSustainScore,
-			gpu.Scores.ThermalSustainScore,
-			gpu.Scores.StabilityScore,
-			interconnect,
-		)
+		topsPerSM := "-"
+		if gpu.Scores.TOPSPerSMPerGHz > 0 {
+			topsPerSM = fmt.Sprintf("%.3f", gpu.Scores.TOPSPerSMPerGHz)
+		}
+		fmt.Fprintf(&b, "| GPU %d | **%.2f** | %s | %s | %s | %s |\n",
+			gpu.Index, gpu.Scores.CompositeScore, synthetic, mixed, mixedEff, topsPerSM)
+	}
+	if len(result.PerformanceRampSteps) > 0 {
+		fmt.Fprintf(&b, "\n**Platform power score (scalability):** %.1f%%\n", result.PlatformPowerScore)
+	}
+	b.WriteString("\n")
+
+	// Perspective 5: Anomaly flags
+	b.WriteString("### 5. Anomalies\n\n")
+	b.WriteString("| GPU | ECC corrected | Sync boost throttle | Power instability | Thermal instability |\n")
+	b.WriteString("|-----|---------------|---------------------|-------------------|---------------------|\n")
+	for _, gpu := range result.GPUs {
+		eccCorr := "-"
+		if gpu.ECC.Corrected > 0 {
+			eccCorr = fmt.Sprintf("⚠ %d", gpu.ECC.Corrected)
+		}
+		syncBoost := "-"
+		if gpu.Scores.SyncBoostThrottlePct > 0 {
+			syncBoost = fmt.Sprintf("%.1f%%", gpu.Scores.SyncBoostThrottlePct)
+		}
+		powerVar := "OK"
+		if gpu.Scores.PowerSustainScore < 70 {
+			powerVar = "⚠ unstable"
+		}
+		thermalVar := "OK"
+		if gpu.Scores.ThermalSustainScore < 70 {
+			thermalVar = "⚠ unstable"
+		}
+		fmt.Fprintf(&b, "| GPU %d | %s | %s | %s | %s |\n",
+			gpu.Index, eccCorr, syncBoost, powerVar, thermalVar)
 	}
 	b.WriteString("\n")

@@ -171,13 +246,13 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult) string {
 			fmt.Fprintf(&b, "- **Power limit:** %.0f W (default %.0f W)\n", gpu.PowerLimitW, gpu.DefaultPowerLimitW)
 		}
 		if gpu.PowerLimitDerated {
-			fmt.Fprintf(&b, "- **Power limit derating:** active after %d targeted_power attempt(s)\n", gpu.PowerCalibrationTries)
+			fmt.Fprintf(&b, "- **Power limit derating:** active (reduced limit %.0f W)\n", gpu.PowerLimitW)
 		}
 		if gpu.CalibratedPeakPowerW > 0 {
 			if gpu.CalibratedPeakTempC > 0 {
-				fmt.Fprintf(&b, "- **Power calibration (`dcgmi targeted_power`):** %.0f W p95 at %.1f °C p95\n", gpu.CalibratedPeakPowerW, gpu.CalibratedPeakTempC)
+				fmt.Fprintf(&b, "- **Calibrated peak power:** %.0f W p95 at %.1f °C p95\n", gpu.CalibratedPeakPowerW, gpu.CalibratedPeakTempC)
 			} else {
-				fmt.Fprintf(&b, "- **Power calibration (`dcgmi targeted_power`):** %.0f W p95\n", gpu.CalibratedPeakPowerW)
+				fmt.Fprintf(&b, "- **Calibrated peak power:** %.0f W p95\n", gpu.CalibratedPeakPowerW)
 			}
 		}
 		if gpu.LockedGraphicsClockMHz > 0 {
@@ -329,6 +404,19 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult) string {
 		}
 	}

+	// ── Platform Scalability ──────────────────────────────────────────────────
+	if len(result.PerformanceRampSteps) > 0 {
+		b.WriteString("## Platform Scalability (Performance Ramp)\n\n")
+		fmt.Fprintf(&b, "**Platform power score:** %.1f%%  \n\n", result.PlatformPowerScore)
+		b.WriteString("| k GPUs | GPU Indices | Total Synthetic TOPS | Scalability |\n")
+		b.WriteString("|--------|-------------|----------------------|-------------|\n")
+		for _, step := range result.PerformanceRampSteps {
+			fmt.Fprintf(&b, "| %d | %s | %.2f | %.1f%% |\n",
+				step.StepIndex, joinIndexList(step.GPUIndices), step.TotalSyntheticTOPS, step.ScalabilityPct)
+		}
+		b.WriteString("\n")
+	}
+
 	// ── Raw files ─────────────────────────────────────────────────────────────
 	b.WriteString("## Raw Files\n\n")
 	b.WriteString("- `result.json`\n- `report.md`\n- `summary.txt`\n- `verbose.log`\n")
--- a/audit/internal/platform/benchmark_types.go
+++ b/audit/internal/platform/benchmark_types.go
@@ -65,6 +65,11 @@ type NvidiaBenchmarkResult struct {
 	RampTotal          int                          `json:"ramp_total,omitempty"`
 	RampRunID          string                       `json:"ramp_run_id,omitempty"`
 	ScalabilityScore   float64                      `json:"scalability_score,omitempty"`
+	// PlatformPowerScore is the mean compute scalability across ramp steps 2..N.
+	// 100% = each added GPU contributes exactly its single-card throughput.
+	// < 100% = throughput loss due to thermal throttle, power limits, or contention.
+	PlatformPowerScore   float64                    `json:"platform_power_score,omitempty"`
+	PerformanceRampSteps []NvidiaPerformanceRampStep `json:"performance_ramp_steps,omitempty"`
 	OverallStatus      string                       `json:"overall_status"`
 	SelectedGPUIndices []int                        `json:"selected_gpu_indices"`
 	Findings           []string                     `json:"findings,omitempty"`
@@ -107,6 +112,12 @@ type BenchmarkGPUResult struct {
 	PowerLimitDerated   bool    `json:"power_limit_derated,omitempty"`
 	MultiprocessorCount int     `json:"multiprocessor_count,omitempty"`
 	DefaultPowerLimitW  float64 `json:"default_power_limit_w,omitempty"`
+	// ShutdownTempC is the hardware thermal shutdown threshold for this GPU,
+	// sourced from nvidia-smi -q ("GPU Shutdown Temp"). Fallback: 90°C.
+	ShutdownTempC float64 `json:"shutdown_temp_c,omitempty"`
+	// SlowdownTempC is the software throttle onset threshold ("GPU Slowdown Temp").
+	// Fallback: 80°C.
+	SlowdownTempC float64 `json:"slowdown_temp_c,omitempty"`
 	// CalibratedPeakPowerW is the p95 power measured during a short
 	// dcgmi targeted_power calibration run before the main benchmark.
 	// Used as the reference denominator for PowerSustainScore instead of
@@ -206,9 +217,30 @@ type BenchmarkScorecard struct {
 	MixedEfficiency     float64 `json:"mixed_efficiency,omitempty"`
 	PowerSustainScore   float64 `json:"power_sustain_score"`
 	ThermalSustainScore float64 `json:"thermal_sustain_score"`
-	StabilityScore      float64 `json:"stability_score"`
-	InterconnectScore   float64 `json:"interconnect_score"`
-	CompositeScore      float64 `json:"composite_score"`
+	// StabilityScore: fraction of steady-state time the GPU spent throttling
+	// (thermal + power cap combined). 0% throttle = 100; 100% throttle = 0.
+	StabilityScore float64 `json:"stability_score"`
+
+	// Throttle breakdown — percentage of steady-state time in each throttle type.
+	// Used for diagnosis: tells WHY the GPU throttled, not just whether it did.
+	ThermalThrottlePct  float64 `json:"thermal_throttle_pct"`  // HW+SW thermal slowdown
+	PowerCapThrottlePct float64 `json:"power_cap_throttle_pct"` // SW power cap
+	SyncBoostThrottlePct float64 `json:"sync_boost_throttle_pct,omitempty"`
+
+	// Temperature headroom: distance to the 100°C destruction threshold.
+	// TempHeadroomC = 100 - P95TempC. < 20°C = warning; < 10°C = critical.
+	// Independent of throttle — a GPU at 86°C without throttle is still in the red zone.
+	TempHeadroomC float64 `json:"temp_headroom_c"`
+
+	InterconnectScore float64 `json:"interconnect_score"`
+	// ServerQualityScore (0–100) reflects server infrastructure quality independent
+	// of GPU model. Combines throttle time, power variance, and temp variance.
+	// Use this to compare servers with the same GPU, or to flag a bad server
+	// that throttles an otherwise fast GPU.
+	ServerQualityScore float64 `json:"server_quality_score"`
+	// CompositeScore is the raw compute score (TOPS, fp32-equivalent).
+	// A throttling GPU will score lower here automatically — no quality multiplier.
+	CompositeScore float64 `json:"composite_score"`
 	// TOPSPerSMPerGHz is compute efficiency independent of clock speed and SM count.
 	TOPSPerSMPerGHz float64 `json:"tops_per_sm_per_ghz,omitempty"`
 }
@@ -265,8 +297,12 @@ type NvidiaPowerBenchResult struct {
 	RecommendedSlotOrder []int                  `json:"recommended_slot_order,omitempty"`
 	RampSteps            []NvidiaPowerBenchStep `json:"ramp_steps,omitempty"`
 	OverallStatus        string                 `json:"overall_status"`
-	Findings             []string               `json:"findings,omitempty"`
-	GPUs                 []NvidiaPowerBenchGPU  `json:"gpus"`
+	// PlatformMaxTDPW is the sum of per-GPU stable power limits found during the
+	// cumulative thermal ramp. Represents the actual sustained power budget of
+	// this server under full GPU load. Use for rack power planning.
+	PlatformMaxTDPW float64  `json:"platform_max_tdp_w"`
+	Findings        []string `json:"findings,omitempty"`
+	GPUs            []NvidiaPowerBenchGPU `json:"gpus"`
 }

 type NvidiaPowerBenchGPU struct {
@@ -274,7 +310,14 @@ type NvidiaPowerBenchGPU struct {
 	Name                string   `json:"name,omitempty"`
 	BusID               string   `json:"bus_id,omitempty"`
 	DefaultPowerLimitW  float64  `json:"default_power_limit_w,omitempty"`
+	// AppliedPowerLimitW is the stable limit found during single-card calibration.
 	AppliedPowerLimitW  float64  `json:"applied_power_limit_w,omitempty"`
+	// StablePowerLimitW is the final fixed limit for this GPU after the
+	// cumulative thermal ramp. This is the limit at which the GPU operated
+	// stably with all other GPUs running simultaneously at their own limits.
+	// May be lower than AppliedPowerLimitW if multi-GPU thermal load required
+	// additional derating.
+	StablePowerLimitW   float64  `json:"stable_power_limit_w,omitempty"`
 	MaxObservedPowerW   float64  `json:"max_observed_power_w,omitempty"`
 	MaxObservedTempC    float64  `json:"max_observed_temp_c,omitempty"`
 	CalibrationAttempts int      `json:"calibration_attempts,omitempty"`
@@ -286,13 +329,31 @@ type NvidiaPowerBenchGPU struct {
 }

 type NvidiaPowerBenchStep struct {
-	StepIndex              int      `json:"step_index"`
-	GPUIndices             []int    `json:"gpu_indices"`
-	TotalObservedPowerW    float64  `json:"total_observed_power_w,omitempty"`
-	AvgObservedPowerW      float64  `json:"avg_observed_power_w,omitempty"`
-	MinPowerRealizationPct float64  `json:"min_power_realization_pct,omitempty"`
-	AvgPowerRealizationPct float64  `json:"avg_power_realization_pct,omitempty"`
-	DeratedGPUCount        int      `json:"derated_gpu_count,omitempty"`
-	Status                 string   `json:"status"`
-	Notes                  []string `json:"notes,omitempty"`
+	StepIndex           int      `json:"step_index"`
+	GPUIndices          []int    `json:"gpu_indices"`
+	// NewGPUIndex is the GPU whose stable limit was searched in this step.
+	NewGPUIndex         int      `json:"new_gpu_index"`
+	// NewGPUStableLimitW is the stable power limit found for the new GPU.
+	NewGPUStableLimitW  float64  `json:"new_gpu_stable_limit_w,omitempty"`
+	TotalObservedPowerW float64  `json:"total_observed_power_w,omitempty"`
+	AvgObservedPowerW   float64  `json:"avg_observed_power_w,omitempty"`
+	Derated             bool     `json:"derated,omitempty"`
+	Status              string   `json:"status"`
+	Notes               []string `json:"notes,omitempty"`
+}
+
+// NvidiaPerformanceRampStep holds per-step performance data for the
+// scalability ramp-up phase of the performance benchmark.
+type NvidiaPerformanceRampStep struct {
+	StepIndex          int      `json:"step_index"`
+	GPUIndices         []int    `json:"gpu_indices"`
+	// TotalSyntheticTOPS is the sum of per-GPU SyntheticScore (fp32-equivalent
+	// TOPS from dedicated single-precision phases) across all GPUs in this step.
+	TotalSyntheticTOPS float64  `json:"total_synthetic_tops"`
+	TotalMixedTOPS     float64  `json:"total_mixed_tops,omitempty"`
+	// ScalabilityPct = TotalSyntheticTOPS / (k × best_single_gpu_tops) × 100.
+	// 100% = perfect linear scaling. < 100% = thermal/power/interconnect loss.
+	ScalabilityPct     float64  `json:"scalability_pct"`
+	Status             string   `json:"status"`
+	Notes              []string `json:"notes,omitempty"`
 }
--- a/audit/internal/platform/runtime.go
+++ b/audit/internal/platform/runtime.go
@@ -28,6 +28,8 @@ var runtimeTrackedServices = []string{
 	"bee-audit",
 	"bee-web",
 	"bee-sshsetup",
+	"nvidia-dcgm",
+	"nvidia-fabricmanager",
 }

 func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, error) {
--- a/audit/internal/platform/sat.go
+++ b/audit/internal/platform/sat.go
@@ -426,6 +426,13 @@ func (s *System) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string,
 	if err != nil {
 		return "", err
 	}
+	// Kill any lingering nvvs/dcgmi processes from a previous interrupted run
+	// before starting — otherwise dcgmi diag fails with DCGM_ST_IN_USE (-34).
+	if killed := KillTestWorkers(); len(killed) > 0 && logFunc != nil {
+		for _, p := range killed {
+			logFunc(fmt.Sprintf("pre-flight: killed stale worker pid=%d name=%s", p.PID, p.Name))
+		}
+	}
 	return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-targeted-power", withNvidiaPersistenceMode(
 		satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
 		satJob{
@@ -443,6 +450,13 @@ func (s *System) RunNvidiaPulseTestPack(ctx context.Context, baseDir string, dur
 	if err != nil {
 		return "", err
 	}
+	// Kill any lingering nvvs/dcgmi processes from a previous interrupted run
+	// before starting — otherwise dcgmi diag fails with DCGM_ST_IN_USE (-34).
+	if killed := KillTestWorkers(); len(killed) > 0 && logFunc != nil {
+		for _, p := range killed {
+			logFunc(fmt.Sprintf("pre-flight: killed stale worker pid=%d name=%s", p.PID, p.Name))
+		}
+	}
 	return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-pulse", withNvidiaPersistenceMode(
 		satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
 		satJob{
@@ -460,6 +474,13 @@ func (s *System) RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpu
 	if err != nil {
 		return "", err
 	}
+	// Kill any lingering nvvs/dcgmi processes from a previous interrupted run
+	// before starting — otherwise dcgmi diag fails with DCGM_ST_IN_USE (-34).
+	if killed := KillTestWorkers(); len(killed) > 0 && logFunc != nil {
+		for _, p := range killed {
+			logFunc(fmt.Sprintf("pre-flight: killed stale worker pid=%d name=%s", p.PID, p.Name))
+		}
+	}
 	return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-bandwidth", withNvidiaPersistenceMode(
 		satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
 		satJob{
@@ -552,10 +573,16 @@ func (s *System) RunMemoryAcceptancePack(ctx context.Context, baseDir string, si
 	if passes <= 0 {
 		passes = 1
 	}
-	// Bound memtester with a hard wall-clock timeout: ~2.5 min per 100 MB per
-	// pass, plus a fixed 2-minute buffer. Without this, a stuck memory
-	// controller can cause memtester to spin forever on a single subtest.
-	timeoutSec := sizeMB*passes*150/100 + 120
+	// Keep Validate Memory bounded to a quick diagnostic window. The timeout is
+	// intentionally conservative enough for healthy systems while avoiding the
+	// prior 30-80 minute hangs caused by memtester spinning on a bad subtest.
+	timeoutSec := sizeMB*passes*20/100 + 60
+	if timeoutSec < 180 {
+		timeoutSec = 180
+	}
+	if timeoutSec > 900 {
+		timeoutSec = 900
+	}
 	return runAcceptancePackCtx(ctx, baseDir, "memory", []satJob{
 		{name: "01-free-before.log", cmd: []string{"free", "-h"}},
 		{name: "02-memtester.log", cmd: []string{"timeout", fmt.Sprintf("%d", timeoutSec), "memtester", fmt.Sprintf("%dM", sizeMB), fmt.Sprintf("%d", passes)}},
--- a/audit/internal/webui/api.go
+++ b/audit/internal/webui/api.go
@@ -628,8 +628,10 @@ func (h *handler) handleAPIBenchmarkNvidiaRunKind(target string) http.HandlerFun
 		}

 		if rampUp && len(body.GPUIndices) > 1 {
-			// Ramp-up mode: resolve GPU list, then create one task per prefix
-			// [gpu0], [gpu0,gpu1], ..., [gpu0,...,gpuN-1], each running in parallel.
+			// Ramp-up mode: RunNvidiaPowerBench internally ramps from 1 to N GPUs
+			// in Phase 2 (one additional GPU per step). A single task with all
+			// selected GPUs is sufficient — spawning N tasks with growing subsets
+			// would repeat all earlier steps redundantly.
 			gpus, err := apiListNvidiaGPUs(h.opts.App)
 			if err != nil {
 				writeError(w, http.StatusBadRequest, err.Error())
@@ -646,35 +648,27 @@ func (h *handler) handleAPIBenchmarkNvidiaRunKind(target string) http.HandlerFun
 			} else {
 				now := time.Now()
 				rampRunID := fmt.Sprintf("ramp-%s", now.UTC().Format("20060102-150405"))
-				var allTasks []*Task
-				for step := 1; step <= len(resolved); step++ {
-					subset := resolved[:step]
-					stepName := fmt.Sprintf("%s · ramp %d/%d · GPU %s", name, step, len(resolved), formatGPUIndexList(subset))
-					t := &Task{
-						ID:        newJobID("bee-bench-nvidia"),
-						Name:      stepName,
-						Target:    target,
-						Priority:  defaultTaskPriority(target, taskParams{}),
-						Status:    TaskPending,
-						CreatedAt: now,
-						params: taskParams{
-							GPUIndices:       append([]int(nil), subset...),
-							SizeMB:           body.SizeMB,
-							BenchmarkProfile: body.Profile,
-							RunNCCL:          runNCCL && step == len(resolved),
-							ParallelGPUs:     true,
-							RampStep:         step,
-							RampTotal:        len(resolved),
-							RampRunID:        rampRunID,
-							DisplayName:      stepName,
-						},
-					}
-					allTasks = append(allTasks, t)
+				taskName := fmt.Sprintf("%s · ramp 1–%d · GPU %s", name, len(resolved), formatGPUIndexList(resolved))
+				t := &Task{
+					ID:        newJobID("bee-bench-nvidia"),
+					Name:      taskName,
+					Target:    target,
+					Priority:  defaultTaskPriority(target, taskParams{}),
+					Status:    TaskPending,
+					CreatedAt: now,
+					params: taskParams{
+						GPUIndices:       append([]int(nil), resolved...),
+						SizeMB:           body.SizeMB,
+						BenchmarkProfile: body.Profile,
+						RunNCCL:          runNCCL,
+						ParallelGPUs:     true,
+						RampTotal:        len(resolved),
+						RampRunID:        rampRunID,
+						DisplayName:      taskName,
+					},
 				}
-				for _, t := range allTasks {
-					globalQueue.enqueue(t)
-				}
-				writeTaskRunResponse(w, allTasks)
+				globalQueue.enqueue(t)
+				writeTaskRunResponse(w, []*Task{t})
 				return
 			}
 		}
@@ -743,6 +737,9 @@ func (h *handler) handleAPISATAbort(w http.ResponseWriter, r *http.Request) {
 			if t.job != nil {
 				t.job.abort()
 			}
+			if taskMayLeaveOrphanWorkers(t.Target) {
+				platform.KillTestWorkers()
+			}
 			t.Status = TaskCancelled
 			now := time.Now()
 			t.DoneAt = &now
--- a/audit/internal/webui/pages.go
+++ b/audit/internal/webui/pages.go
@@ -72,6 +72,13 @@ tbody tr:hover td{background:rgba(0,0,0,.03)}
 .badge-warn{background:var(--warn-bg);color:var(--warn-fg);border:1px solid #c9ba9b}
 .badge-err{background:var(--crit-bg);color:var(--crit-fg);border:1px solid var(--crit-border)}
 .badge-unknown{background:var(--surface-2);color:var(--muted);border:1px solid var(--border)}
+/* Component chips — one small square per device */
+.chips{display:inline-flex;flex-wrap:wrap;gap:3px;align-items:center;vertical-align:middle}
+.chip{display:inline-flex;align-items:center;justify-content:center;width:20px;height:20px;border-radius:3px;font-size:10px;font-weight:800;cursor:default;font-family:monospace;letter-spacing:0;user-select:none}
+.chip-ok{background:var(--ok-bg);color:var(--ok-fg);border:1px solid #a3c293}
+.chip-warn{background:var(--warn-bg);color:var(--warn-fg);border:1px solid #c9ba9b}
+.chip-fail{background:var(--crit-bg);color:var(--crit-fg);border:1px solid var(--crit-border)}
+.chip-unknown{background:var(--surface-2);color:var(--muted);border:1px solid var(--border)}
 /* Output terminal */
 .terminal{background:#1b1c1d;border:1px solid rgba(0,0,0,.2);border-radius:4px;padding:14px;font-family:monospace;font-size:12px;color:#b5cea8;max-height:400px;overflow-y:auto;white-space:pre-wrap;word-break:break-all;user-select:text;-webkit-user-select:text}
 .terminal-wrap{position:relative}.terminal-copy{position:absolute;top:6px;right:6px;background:#2d2f30;border:1px solid #444;color:#aaa;font-size:11px;padding:2px 8px;border-radius:3px;cursor:pointer;opacity:.7}.terminal-copy:hover{opacity:1}
@@ -363,23 +370,25 @@ func renderHardwareSummaryCard(opts HandlerOptions) string {
 			html.EscapeString(label), html.EscapeString(value), badgeHTML))
 	}

-	cpuRow := aggregateComponentStatus("CPU", records, []string{"cpu:all"}, nil)
-	writeRow("CPU", hwDescribeCPU(hw), runtimeStatusBadge(cpuRow.Status))
+	writeRow("CPU", hwDescribeCPU(hw),
+		renderComponentChips(matchedRecords(records, []string{"cpu:all"}, nil)))

-	memRow := aggregateComponentStatus("Memory", records, []string{"memory:all"}, []string{"memory:"})
-	writeRow("Memory", hwDescribeMemory(hw), runtimeStatusBadge(memRow.Status))
+	writeRow("Memory", hwDescribeMemory(hw),
+		renderComponentChips(matchedRecords(records, []string{"memory:all"}, []string{"memory:"})))

-	storageRow := aggregateComponentStatus("Storage", records, []string{"storage:all"}, []string{"storage:"})
-	writeRow("Storage", hwDescribeStorage(hw), runtimeStatusBadge(storageRow.Status))
+	writeRow("Storage", hwDescribeStorage(hw),
+		renderComponentChips(matchedRecords(records, []string{"storage:all"}, []string{"storage:"})))

-	gpuRow := aggregateComponentStatus("GPU", records, nil, []string{"pcie:gpu:"})
-	writeRow("GPU", hwDescribeGPU(hw), runtimeStatusBadge(gpuRow.Status))
+	writeRow("GPU", hwDescribeGPU(hw),
+		renderComponentChips(matchedRecords(records, nil, []string{"pcie:gpu:"})))

-	psuRow := aggregateComponentStatus("PSU", records, nil, []string{"psu:"})
-	if psuRow.Status == "UNKNOWN" && len(hw.PowerSupplies) > 0 {
-		psuRow.Status = hwPSUStatus(hw.PowerSupplies)
+	psuMatched := matchedRecords(records, nil, []string{"psu:"})
+	if len(psuMatched) == 0 && len(hw.PowerSupplies) > 0 {
+		// No PSU records yet — synthesise a single chip from IPMI status.
+		psuStatus := hwPSUStatus(hw.PowerSupplies)
+		psuMatched = []app.ComponentStatusRecord{{ComponentKey: "psu:ipmi", Status: psuStatus}}
 	}
-	writeRow("PSU", hwDescribePSU(hw), runtimeStatusBadge(psuRow.Status))
+	writeRow("PSU", hwDescribePSU(hw), renderComponentChips(psuMatched))

 	if nicDesc := hwDescribeNIC(hw); nicDesc != "" {
 		writeRow("Network", nicDesc, "")
@@ -892,6 +901,31 @@ func buildHardwareComponentRows(exportDir string) []runtimeHealthRow {
 	}
 }

+// matchedRecords returns all ComponentStatusRecord entries whose key matches
+// any exact key or any of the given prefixes. Used for per-device chip rendering.
+func firstNonEmpty(vals ...string) string {
+	for _, v := range vals {
+		if v != "" {
+			return v
+		}
+	}
+	return ""
+}
+
+func matchedRecords(records []app.ComponentStatusRecord, exact []string, prefixes []string) []app.ComponentStatusRecord {
+	var matched []app.ComponentStatusRecord
+	for _, rec := range records {
+		key := strings.TrimSpace(rec.ComponentKey)
+		if key == "" {
+			continue
+		}
+		if containsExactKey(key, exact) || hasAnyPrefix(key, prefixes) {
+			matched = append(matched, rec)
+		}
+	}
+	return matched
+}
+
 func aggregateComponentStatus(title string, records []app.ComponentStatusRecord, exact []string, prefixes []string) runtimeHealthRow {
 	matched := make([]app.ComponentStatusRecord, 0)
 	for _, rec := range records {
@@ -1034,6 +1068,52 @@ func runtimeIssueDescriptions(issues []schema.RuntimeIssue, codes ...string) str
 	return strings.Join(messages, "; ")
 }

+// chipLetterClass maps a component status to a single display letter and CSS class.
+func chipLetterClass(status string) (letter, cls string) {
+	switch strings.ToUpper(strings.TrimSpace(status)) {
+	case "OK":
+		return "O", "chip-ok"
+	case "WARNING", "WARN", "PARTIAL":
+		return "W", "chip-warn"
+	case "CRITICAL", "FAIL", "FAILED", "ERROR":
+		return "F", "chip-fail"
+	default:
+		return "?", "chip-unknown"
+	}
+}
+
+// renderComponentChips renders one 20×20 chip per ComponentStatusRecord.
+// Hover tooltip shows component key, status, error summary and last check time.
+// Falls back to a single unknown chip when no records are available.
+func renderComponentChips(matched []app.ComponentStatusRecord) string {
+	if len(matched) == 0 {
+		return `<span class="chips"><span class="chip chip-unknown" title="No data">?</span></span>`
+	}
+	sort.Slice(matched, func(i, j int) bool {
+		return matched[i].ComponentKey < matched[j].ComponentKey
+	})
+	var b strings.Builder
+	b.WriteString(`<span class="chips">`)
+	for _, rec := range matched {
+		letter, cls := chipLetterClass(rec.Status)
+		var tooltip strings.Builder
+		tooltip.WriteString(rec.ComponentKey)
+		tooltip.WriteString(": ")
+		tooltip.WriteString(firstNonEmpty(rec.Status, "UNKNOWN"))
+		if rec.ErrorSummary != "" {
+			tooltip.WriteString(" — ")
+			tooltip.WriteString(rec.ErrorSummary)
+		}
+		if !rec.LastCheckedAt.IsZero() {
+			fmt.Fprintf(&tooltip, " (checked %s)", rec.LastCheckedAt.Format("15:04:05"))
+		}
+		fmt.Fprintf(&b, `<span class="chip %s" title="%s">%s</span>`,
+			cls, html.EscapeString(tooltip.String()), letter)
+	}
+	b.WriteString(`</span>`)
+	return b.String()
+}
+
 func runtimeStatusBadge(status string) string {
 	status = strings.ToUpper(strings.TrimSpace(status))
 	badge := "badge-unknown"
@@ -1339,7 +1419,7 @@ func renderValidate(opts HandlerOptions) string {
 			inv.Memory,
 			`Runs a RAM validation pass and records memory state around the test.`,
 			`<code>free</code>, <code>memtester</code>`,
-			`256 MB / 1 pass in Validate, 1 GB / 3 passes in Stress.`,
+			`256 MB / 1 pass in Validate, 512 MB / 1 pass in Stress.`,
 		)) +
 		renderSATCard("storage", "Storage", "runSAT('storage')", "", renderValidateCardBody(
 			inv.Storage,
--- a/audit/internal/webui/tasks.go
+++ b/audit/internal/webui/tasks.go
@@ -162,6 +162,32 @@ type nvidiaRampSpec struct {
 	TotalDurationSec int
 }

+func resolveMemoryValidatePreset(profile string, stress bool) (sizeMB, passes int) {
+	switch strings.TrimSpace(strings.ToLower(profile)) {
+	case "overnight":
+		return 1024, 2
+	case "acceptance":
+		return 1024, 1
+	case "smoke":
+		return 256, 1
+	}
+	if stress {
+		return 512, 1
+	}
+	return 256, 1
+}
+
+func taskMayLeaveOrphanWorkers(target string) bool {
+	switch strings.TrimSpace(strings.ToLower(target)) {
+	case "nvidia", "nvidia-targeted-stress", "nvidia-targeted-power", "nvidia-pulse",
+		"nvidia-bandwidth", "nvidia-stress", "nvidia-compute", "nvidia-bench-perf",
+		"memory", "memory-stress", "cpu", "sat-stress", "platform-stress":
+		return true
+	default:
+		return false
+	}
+}
+
 func resolveBurnPreset(profile string) burnPreset {
 	switch profile {
 	case "overnight":
@@ -751,10 +777,8 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
 			err = fmt.Errorf("app not configured")
 			break
 		}
-		sizeMB, passes := 256, 1
-		if t.params.StressMode {
-			sizeMB, passes = 1024, 3
-		}
+		sizeMB, passes := resolveMemoryValidatePreset(t.params.BurnProfile, t.params.StressMode)
+		j.append(fmt.Sprintf("Memory validate preset: %d MB x %d pass(es)", sizeMB, passes))
 		archive, err = runMemoryAcceptancePackCtx(a, ctx, "", sizeMB, passes, j.append)
 	case "storage":
 		if a == nil {
@@ -1010,6 +1034,9 @@ func (h *handler) handleAPITasksCancelAll(w http.ResponseWriter, _ *http.Request
 			if t.job != nil {
 				t.job.abort()
 			}
+			if taskMayLeaveOrphanWorkers(t.Target) {
+				platform.KillTestWorkers()
+			}
 			t.Status = TaskCancelled
 			t.DoneAt = &now
 			taskSerialEvent(t, "finished with status="+t.Status)
@@ -1037,6 +1064,9 @@ func (h *handler) handleAPITasksKillWorkers(w http.ResponseWriter, _ *http.Reque
 			if t.job != nil {
 				t.job.abort()
 			}
+			if taskMayLeaveOrphanWorkers(t.Target) {
+				platform.KillTestWorkers()
+			}
 			t.Status = TaskCancelled
 			t.DoneAt = &now
 			taskSerialEvent(t, "finished with status="+t.Status)
@@ -1141,10 +1171,13 @@ func (q *taskQueue) loadLocked() {
 		q.assignTaskLogPathLocked(t)
 		if t.Status == TaskRunning {
 			// The task was interrupted by a bee-web restart. Child processes
-			// (e.g. bee-gpu-burn-worker) survive the restart in their own
-			// process groups and cannot be cancelled retroactively. Mark the
-			// task as failed so the user can decide whether to re-run it
-			// rather than blindly re-launching duplicate workers.
+			// (e.g. bee-gpu-burn-worker, dcgmi/nvvs) survive the restart in
+			// their own process groups. Kill any matching stale workers before
+			// marking the task failed so the next GPU test does not inherit a
+			// busy DCGM slot or duplicate workers.
+			if taskMayLeaveOrphanWorkers(t.Target) {
+				_ = platform.KillTestWorkers()
+			}
 			now := time.Now()
 			t.Status = TaskFailed
 			t.DoneAt = &now
--- a/audit/internal/webui/tasks_test.go
+++ b/audit/internal/webui/tasks_test.go
@@ -672,6 +672,36 @@ func TestRunTaskUsesBurnProfileDurationForCPU(t *testing.T) {
 	}
 }

+func TestRunTaskUsesQuickPresetForMemoryValidate(t *testing.T) {
+	var gotSizeMB, gotPasses int
+	q := &taskQueue{
+		opts: &HandlerOptions{App: &app.App{}},
+	}
+	tk := &Task{
+		ID:        "mem-validate-1",
+		Name:      "Memory SAT",
+		Target:    "memory",
+		Status:    TaskRunning,
+		CreatedAt: time.Now(),
+		params:    taskParams{StressMode: true},
+	}
+	j := &jobState{}
+
+	orig := runMemoryAcceptancePackCtx
+	runMemoryAcceptancePackCtx = func(_ *app.App, _ context.Context, _ string, sizeMB, passes int, _ func(string)) (string, error) {
+		gotSizeMB = sizeMB
+		gotPasses = passes
+		return "/tmp/memory-validate.tar.gz", nil
+	}
+	defer func() { runMemoryAcceptancePackCtx = orig }()
+
+	q.runTask(tk, j, context.Background())
+
+	if gotSizeMB != 512 || gotPasses != 1 {
+		t.Fatalf("memory validate preset=%dMB x%d want 512MB x1", gotSizeMB, gotPasses)
+	}
+}
+
 func TestRunTaskBuildsSupportBundleWithoutApp(t *testing.T) {
 	dir := t.TempDir()
 	q := &taskQueue{
--- a/iso/builder/VERSIONS
+++ b/iso/builder/VERSIONS
@@ -1,6 +1,7 @@
 DEBIAN_VERSION=12
 DEBIAN_KERNEL_ABI=auto
 NVIDIA_DRIVER_VERSION=590.48.01
+NVIDIA_FABRICMANAGER_VERSION=590.48.01-1
 NCCL_VERSION=2.28.9-1
 NCCL_CUDA_VERSION=13.0
 NCCL_SHA256=2e6faafd2c19cffc7738d9283976a3200ea9db9895907f337f0c7e5a25563186
--- a/iso/builder/auto/config
+++ b/iso/builder/auto/config
@@ -33,6 +33,7 @@ lb config noauto \
    --iso-volume "EASY_BEE_${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
    --iso-application "EASY-BEE-${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
    --bootappend-live "boot=live components video=1920x1080 console=tty0 console=ttyS0,115200n8 loglevel=3 systemd.show_status=1 username=bee user-fullname=Bee modprobe.blacklist=nouveau,snd_hda_intel,snd_hda_codec_realtek,snd_hda_codec_generic,soundcore" \
+    --debootstrap-options "--include=ca-certificates" \
    --apt-recommends false \
    --chroot-squashfs-compression-type zstd \
    "${@}"
--- a/iso/builder/bee-gpu-stress.c
+++ b/iso/builder/bee-gpu-stress.c
@@ -35,6 +35,8 @@ typedef void *CUstream;
 #define MAX_STRESS_STREAMS 16
 #define MIN_PROFILE_BUDGET_BYTES ((size_t)4u * 1024u * 1024u)
 #define MIN_STREAM_BUDGET_BYTES ((size_t)64u * 1024u * 1024u)
+#define MAX_SINGLE_PRECISION_STREAMS 4
+#define MAX_SINGLE_PRECISION_PROFILE_BUDGET_BYTES ((size_t)2u * 1024u * 1024u * 1024u)

 static const char *ptx_source =
    ".version 6.0\n"
@@ -296,6 +298,13 @@ static int choose_stream_count(int mp_count, int planned_profiles, size_t total_
    return stream_count;
 }

+static size_t clamp_single_precision_profile_budget(size_t profile_budget_bytes) {
+    if (profile_budget_bytes > MAX_SINGLE_PRECISION_PROFILE_BUDGET_BYTES) {
+        return MAX_SINGLE_PRECISION_PROFILE_BUDGET_BYTES;
+    }
+    return profile_budget_bytes;
+}
+
 static void destroy_streams(struct cuda_api *api, CUstream *streams, int count) {
    if (!api->cuStreamDestroy) {
        return;
@@ -908,11 +917,9 @@ static int prepare_profile(struct cublaslt_api *cublas,
                           CUstream stream,
                           size_t profile_budget_bytes,
                           struct prepared_profile *out) {
-    memset(out, 0, sizeof(*out));
-    out->desc = *desc;
-    out->stream = stream;
-
    size_t bytes_per_cell = 0;
+    size_t attempt_budget = profile_budget_bytes;
+
    bytes_per_cell += bytes_for_elements(desc->a_type, 1);
    bytes_per_cell += bytes_for_elements(desc->b_type, 1);
    bytes_per_cell += bytes_for_elements(desc->c_type, 1);
@@ -921,106 +928,115 @@ static int prepare_profile(struct cublaslt_api *cublas,
        return 0;
    }

-    uint64_t dim = choose_square_dim(profile_budget_bytes, bytes_per_cell, desc->min_multiple);
-    out->m = dim;
-    out->n = dim;
-    out->k = dim;
+    while (attempt_budget >= MIN_PROFILE_BUDGET_BYTES) {
+        memset(out, 0, sizeof(*out));
+        out->desc = *desc;
+        out->stream = stream;

-    size_t desired_workspace = profile_budget_bytes / 8u;
-    if (desired_workspace > 32u * 1024u * 1024u) {
-        desired_workspace = 32u * 1024u * 1024u;
-    }
-    desired_workspace = round_down_size(desired_workspace, 256u);
+        uint64_t dim = choose_square_dim(attempt_budget, bytes_per_cell, desc->min_multiple);
+        out->m = dim;
+        out->n = dim;
+        out->k = dim;

-    size_t a_bytes = 0;
-    size_t b_bytes = 0;
-    size_t c_bytes = 0;
-    size_t d_bytes = 0;
-    size_t scale_bytes = 0;
-    while (1) {
-        a_bytes = bytes_for_elements(desc->a_type, out->k * out->m);
-        b_bytes = bytes_for_elements(desc->b_type, out->k * out->n);
-        c_bytes = bytes_for_elements(desc->c_type, out->m * out->n);
-        d_bytes = bytes_for_elements(desc->d_type, out->m * out->n);
-        scale_bytes = profile_scale_bytes(desc, out->m, out->n, out->k);
+        size_t desired_workspace = attempt_budget / 8u;
+        if (desired_workspace > 32u * 1024u * 1024u) {
+            desired_workspace = 32u * 1024u * 1024u;
+        }
+        desired_workspace = round_down_size(desired_workspace, 256u);

-        size_t matrix_bytes = a_bytes + b_bytes + c_bytes + d_bytes + scale_bytes;
-        if (matrix_bytes <= profile_budget_bytes) {
-            size_t remaining = profile_budget_bytes - matrix_bytes;
-            out->workspace_size = desired_workspace;
-            if (out->workspace_size > remaining) {
-                out->workspace_size = round_down_size(remaining, 256u);
+        size_t a_bytes = 0;
+        size_t b_bytes = 0;
+        size_t c_bytes = 0;
+        size_t d_bytes = 0;
+        size_t scale_bytes = 0;
+        while (1) {
+            a_bytes = bytes_for_elements(desc->a_type, out->k * out->m);
+            b_bytes = bytes_for_elements(desc->b_type, out->k * out->n);
+            c_bytes = bytes_for_elements(desc->c_type, out->m * out->n);
+            d_bytes = bytes_for_elements(desc->d_type, out->m * out->n);
+            scale_bytes = profile_scale_bytes(desc, out->m, out->n, out->k);
+
+            size_t matrix_bytes = a_bytes + b_bytes + c_bytes + d_bytes + scale_bytes;
+            if (matrix_bytes <= attempt_budget) {
+                size_t remaining = attempt_budget - matrix_bytes;
+                out->workspace_size = desired_workspace;
+                if (out->workspace_size > remaining) {
+                    out->workspace_size = round_down_size(remaining, 256u);
+                }
+                break;
            }
-            break;
+
+            if (out->m <= (uint64_t)desc->min_multiple) {
+                break;
+            }
+            out->m -= (uint64_t)desc->min_multiple;
+            out->n = out->m;
+            out->k = out->m;
+        }
+        if (out->m < (uint64_t)desc->min_multiple) {
+            attempt_budget /= 2u;
+            continue;
        }

-        if (out->m <= (uint64_t)desc->min_multiple) {
-            return 0;
-        }
-        out->m -= (uint64_t)desc->min_multiple;
-        out->n = out->m;
-        out->k = out->m;
-    }
-
-    if (!alloc_filled(cuda, &out->a_dev, a_bytes, 0x11) ||
-        !alloc_filled(cuda, &out->b_dev, b_bytes, 0x11) ||
-        !alloc_filled(cuda, &out->c_dev, c_bytes, 0x00) ||
-        !alloc_filled(cuda, &out->d_dev, d_bytes, 0x00)) {
-        destroy_profile(cublas, cuda, out);
-        return 0;
-    }
-
-    cudaDataType_t scale_type = matmul_scale_type(desc);
-    if (!check_cublas("cublasLtMatmulDescCreate",
-                      cublas->cublasLtMatmulDescCreate(&out->op_desc, desc->compute_type, scale_type))) {
-        destroy_profile(cublas, cuda, out);
-        return 0;
-    }
-
-    cublasOperation_t transa = CUBLAS_OP_T;
-    cublasOperation_t transb = CUBLAS_OP_N;
-    if (!check_cublas("set TRANSA",
-                      cublas->cublasLtMatmulDescSetAttribute(out->op_desc,
-                                                             CUBLASLT_MATMUL_DESC_TRANSA,
-                                                             &transa,
-                                                             sizeof(transa))) ||
-        !check_cublas("set TRANSB",
-                      cublas->cublasLtMatmulDescSetAttribute(out->op_desc,
-                                                             CUBLASLT_MATMUL_DESC_TRANSB,
-                                                             &transb,
-                                                             sizeof(transb)))) {
-        destroy_profile(cublas, cuda, out);
-        return 0;
-    }
-
-    if (desc->needs_scalar_scale) {
-        float one = 1.0f;
-        if (!alloc_filled(cuda, &out->a_scale_dev, sizeof(one), 0x00) ||
-            !alloc_filled(cuda, &out->b_scale_dev, sizeof(one), 0x00)) {
+        if (!alloc_filled(cuda, &out->a_dev, a_bytes, 0x11) ||
+            !alloc_filled(cuda, &out->b_dev, b_bytes, 0x11) ||
+            !alloc_filled(cuda, &out->c_dev, c_bytes, 0x00) ||
+            !alloc_filled(cuda, &out->d_dev, d_bytes, 0x00)) {
            destroy_profile(cublas, cuda, out);
            return 0;
        }
-        if (!device_upload(cuda, out->a_scale_dev, &one, sizeof(one)) ||
-            !device_upload(cuda, out->b_scale_dev, &one, sizeof(one))) {
+
+        cudaDataType_t scale_type = matmul_scale_type(desc);
+        if (!check_cublas("cublasLtMatmulDescCreate",
+                          cublas->cublasLtMatmulDescCreate(&out->op_desc, desc->compute_type, scale_type))) {
            destroy_profile(cublas, cuda, out);
            return 0;
        }
-        void *a_scale_ptr = (void *)(uintptr_t)out->a_scale_dev;
-        void *b_scale_ptr = (void *)(uintptr_t)out->b_scale_dev;
-        if (!check_cublas("set A scale ptr",
+
+        cublasOperation_t transa = CUBLAS_OP_T;
+        cublasOperation_t transb = CUBLAS_OP_N;
+        if (!check_cublas("set TRANSA",
                          cublas->cublasLtMatmulDescSetAttribute(out->op_desc,
-                                                                 CUBLASLT_MATMUL_DESC_A_SCALE_POINTER,
-                                                                 &a_scale_ptr,
-                                                                 sizeof(a_scale_ptr))) ||
-            !check_cublas("set B scale ptr",
+                                                                 CUBLASLT_MATMUL_DESC_TRANSA,
+                                                                 &transa,
+                                                                 sizeof(transa))) ||
+            !check_cublas("set TRANSB",
                          cublas->cublasLtMatmulDescSetAttribute(out->op_desc,
-                                                                 CUBLASLT_MATMUL_DESC_B_SCALE_POINTER,
-                                                                 &b_scale_ptr,
-                                                                 sizeof(b_scale_ptr)))) {
+                                                                 CUBLASLT_MATMUL_DESC_TRANSB,
+                                                                 &transb,
+                                                                 sizeof(transb)))) {
            destroy_profile(cublas, cuda, out);
            return 0;
        }
-    }
+
+        if (desc->needs_scalar_scale) {
+            float one = 1.0f;
+            if (!alloc_filled(cuda, &out->a_scale_dev, sizeof(one), 0x00) ||
+                !alloc_filled(cuda, &out->b_scale_dev, sizeof(one), 0x00)) {
+                destroy_profile(cublas, cuda, out);
+                return 0;
+            }
+            if (!device_upload(cuda, out->a_scale_dev, &one, sizeof(one)) ||
+                !device_upload(cuda, out->b_scale_dev, &one, sizeof(one))) {
+                destroy_profile(cublas, cuda, out);
+                return 0;
+            }
+            void *a_scale_ptr = (void *)(uintptr_t)out->a_scale_dev;
+            void *b_scale_ptr = (void *)(uintptr_t)out->b_scale_dev;
+            if (!check_cublas("set A scale ptr",
+                              cublas->cublasLtMatmulDescSetAttribute(out->op_desc,
+                                                                     CUBLASLT_MATMUL_DESC_A_SCALE_POINTER,
+                                                                     &a_scale_ptr,
+                                                                     sizeof(a_scale_ptr))) ||
+                !check_cublas("set B scale ptr",
+                              cublas->cublasLtMatmulDescSetAttribute(out->op_desc,
+                                                                     CUBLASLT_MATMUL_DESC_B_SCALE_POINTER,
+                                                                     &b_scale_ptr,
+                                                                     sizeof(b_scale_ptr)))) {
+                destroy_profile(cublas, cuda, out);
+                return 0;
+            }
+        }

 #if defined(CUBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE4M3)
    if (desc->needs_block_scale) {
@@ -1060,62 +1076,65 @@ static int prepare_profile(struct cublaslt_api *cublas,
    }
 #endif

-    if (!check_cublas("create A layout",
-                      cublas->cublasLtMatrixLayoutCreate(&out->a_layout, desc->a_type, out->k, out->m, out->k)) ||
-        !check_cublas("create B layout",
-                      cublas->cublasLtMatrixLayoutCreate(&out->b_layout, desc->b_type, out->k, out->n, out->k)) ||
-        !check_cublas("create C layout",
-                      cublas->cublasLtMatrixLayoutCreate(&out->c_layout, desc->c_type, out->m, out->n, out->m)) ||
-        !check_cublas("create D layout",
-                      cublas->cublasLtMatrixLayoutCreate(&out->d_layout, desc->d_type, out->m, out->n, out->m))) {
-        destroy_profile(cublas, cuda, out);
-        return 0;
-    }
-
-    if (!check_cublas("create preference", cublas->cublasLtMatmulPreferenceCreate(&out->preference))) {
-        destroy_profile(cublas, cuda, out);
-        return 0;
-    }
-
-    if (out->workspace_size > 0) {
-        if (!alloc_filled(cuda, &out->workspace_dev, out->workspace_size, 0x00)) {
+        if (!check_cublas("create A layout",
+                          cublas->cublasLtMatrixLayoutCreate(&out->a_layout, desc->a_type, out->k, out->m, out->k)) ||
+            !check_cublas("create B layout",
+                          cublas->cublasLtMatrixLayoutCreate(&out->b_layout, desc->b_type, out->k, out->n, out->k)) ||
+            !check_cublas("create C layout",
+                          cublas->cublasLtMatrixLayoutCreate(&out->c_layout, desc->c_type, out->m, out->n, out->m)) ||
+            !check_cublas("create D layout",
+                          cublas->cublasLtMatrixLayoutCreate(&out->d_layout, desc->d_type, out->m, out->n, out->m))) {
            destroy_profile(cublas, cuda, out);
            return 0;
        }
+
+        if (!check_cublas("create preference", cublas->cublasLtMatmulPreferenceCreate(&out->preference))) {
+            destroy_profile(cublas, cuda, out);
+            return 0;
+        }
+
+        if (out->workspace_size > 0) {
+            if (!alloc_filled(cuda, &out->workspace_dev, out->workspace_size, 0x00)) {
+                destroy_profile(cublas, cuda, out);
+                return 0;
+            }
+        }
+
+        if (!check_cublas("set workspace",
+                          cublas->cublasLtMatmulPreferenceSetAttribute(
+                              out->preference,
+                              CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES,
+                              &out->workspace_size,
+                              sizeof(out->workspace_size)))) {
+            destroy_profile(cublas, cuda, out);
+            return 0;
+        }
+
+        int found = 0;
+        if (check_cublas("heuristic",
+                         cublas->cublasLtMatmulAlgoGetHeuristic(handle,
+                                                                out->op_desc,
+                                                                out->a_layout,
+                                                                out->b_layout,
+                                                                out->c_layout,
+                                                                out->d_layout,
+                                                                out->preference,
+                                                                1,
+                                                                &out->heuristic,
+                                                                &found)) &&
+            found > 0) {
+            out->ready = 1;
+            return 1;
+        }
+
+        destroy_profile(cublas, cuda, out);
+        attempt_budget = round_down_size(attempt_budget * 3u / 4u, 256u);
+        if (attempt_budget < MIN_PROFILE_BUDGET_BYTES) {
+            break;
+        }
    }

-    if (!check_cublas("set workspace",
-                      cublas->cublasLtMatmulPreferenceSetAttribute(
-                          out->preference,
-                          CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES,
-                          &out->workspace_size,
-                          sizeof(out->workspace_size)))) {
-        destroy_profile(cublas, cuda, out);
-        return 0;
-    }
-
-    int found = 0;
-    if (!check_cublas("heuristic",
-                      cublas->cublasLtMatmulAlgoGetHeuristic(handle,
-                                                             out->op_desc,
-                                                             out->a_layout,
-                                                             out->b_layout,
-                                                             out->c_layout,
-                                                             out->d_layout,
-                                                             out->preference,
-                                                             1,
-                                                             &out->heuristic,
-                                                             &found))) {
-        destroy_profile(cublas, cuda, out);
-        return 0;
-    }
-    if (found <= 0) {
-        destroy_profile(cublas, cuda, out);
-        return 0;
-    }
-
-    out->ready = 1;
-    return 1;
+    return 0;
 }

 static int run_cublas_profile(cublasLtHandle_t handle,
@@ -1180,6 +1199,7 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
    size_t requested_budget = 0;
    size_t total_budget = 0;
    size_t per_profile_budget = 0;
+    int budget_profiles = 0;

    memset(report, 0, sizeof(*report));
    snprintf(report->backend, sizeof(report->backend), "cublasLt");
@@ -1215,8 +1235,9 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
    }

    /* Count all profiles active on this GPU regardless of filter.
-     * Used as the budget divisor so matrix sizes stay consistent whether
-     * running all precisions together or a single-precision phase. */
+     * Mixed phases still divide budget across the full precision set, while
+     * single-precision benchmark phases dedicate budget only to active
+     * profiles matching precision_filter. */
    int planned_total = 0;
    for (size_t i = 0; i < sizeof(k_profiles) / sizeof(k_profiles[0]); i++) {
        if (k_profiles[i].enabled && cc >= k_profiles[i].min_cc) {
@@ -1226,19 +1247,29 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
    if (planned_total < planned) {
        planned_total = planned;
    }
+    budget_profiles = planned_total;
+    if (precision_filter != NULL) {
+        budget_profiles = planned;
+    }
+    if (budget_profiles <= 0) {
+        budget_profiles = planned_total;
+    }

    requested_budget = (size_t)size_mb * 1024u * 1024u;
-    if (requested_budget < (size_t)planned_total * MIN_PROFILE_BUDGET_BYTES) {
-        requested_budget = (size_t)planned_total * MIN_PROFILE_BUDGET_BYTES;
+    if (requested_budget < (size_t)budget_profiles * MIN_PROFILE_BUDGET_BYTES) {
+        requested_budget = (size_t)budget_profiles * MIN_PROFILE_BUDGET_BYTES;
    }
    total_budget = clamp_budget_to_free_memory(cuda, requested_budget);
-    if (total_budget < (size_t)planned_total * MIN_PROFILE_BUDGET_BYTES) {
-        total_budget = (size_t)planned_total * MIN_PROFILE_BUDGET_BYTES;
+    if (total_budget < (size_t)budget_profiles * MIN_PROFILE_BUDGET_BYTES) {
+        total_budget = (size_t)budget_profiles * MIN_PROFILE_BUDGET_BYTES;
    }
    if (query_multiprocessor_count(cuda, dev, &mp_count) &&
        cuda->cuStreamCreate &&
        cuda->cuStreamDestroy) {
-        stream_count = choose_stream_count(mp_count, planned_total, total_budget, 1);
+        stream_count = choose_stream_count(mp_count, budget_profiles, total_budget, 1);
+    }
+    if (precision_filter != NULL && stream_count > MAX_SINGLE_PRECISION_STREAMS) {
+        stream_count = MAX_SINGLE_PRECISION_STREAMS;
    }
    if (stream_count > 1) {
        int created = 0;
@@ -1251,18 +1282,22 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
        }
    }
    report->stream_count = stream_count;
-    per_profile_budget = total_budget / ((size_t)planned_total * (size_t)stream_count);
+    per_profile_budget = total_budget / ((size_t)budget_profiles * (size_t)stream_count);
    if (per_profile_budget < MIN_PROFILE_BUDGET_BYTES) {
        per_profile_budget = MIN_PROFILE_BUDGET_BYTES;
    }
+    if (precision_filter != NULL) {
+        per_profile_budget = clamp_single_precision_profile_budget(per_profile_budget);
+    }
    report->buffer_mb = (int)(total_budget / (1024u * 1024u));
    append_detail(report->details,
                  sizeof(report->details),
-                  "requested_mb=%d actual_mb=%d streams=%d mp_count=%d per_worker_mb=%zu\n",
+                  "requested_mb=%d actual_mb=%d streams=%d mp_count=%d budget_profiles=%d per_worker_mb=%zu\n",
                  size_mb,
                  report->buffer_mb,
                  report->stream_count,
                  mp_count,
+                  budget_profiles,
                  per_profile_budget / (1024u * 1024u));

    for (int i = 0; i < profile_count; i++) {
--- a/iso/builder/build.sh
+++ b/iso/builder/build.sh
@@ -1262,6 +1262,7 @@ fi
 # --- substitute version placeholders in package list and archive ---
 if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
    sed -i \
+        -e "s/%%NVIDIA_FABRICMANAGER_VERSION%%/${NVIDIA_FABRICMANAGER_VERSION}/g" \
        -e "s/%%DCGM_VERSION%%/${DCGM_VERSION}/g" \
        "${BUILD_WORK_DIR}/config/package-lists/bee-gpu.list.chroot"
 elif [ "$BEE_GPU_VENDOR" = "amd" ]; then
@@ -1304,7 +1305,7 @@ BEE_GPU_VENDOR_UPPER="$(echo "${BUILD_VARIANT}" | tr 'a-z-' 'A-Z_')"
 export BEE_GPU_VENDOR_UPPER

 cd "${LB_DIR}"
-run_step_sh "live-build clean" "80-lb-clean" "lb clean 2>&1 | tail -3"
+run_step_sh "live-build clean" "80-lb-clean" "lb clean --all 2>&1 | tail -3"
 run_step_sh "live-build config" "81-lb-config" "lb config 2>&1 | tail -5"
 dump_memtest_debug "pre-build" "${LB_DIR}"
 run_step_sh "live-build build" "90-lb-build" "lb build 2>&1"
--- a/iso/builder/config/hooks/normal/9000-bee-setup.hook.chroot
+++ b/iso/builder/config/hooks/normal/9000-bee-setup.hook.chroot
@@ -43,6 +43,7 @@ systemctl enable bee-journal-mirror@ttyS1.service 2>/dev/null || true
 # Enable GPU-vendor specific services
 if [ "$GPU_VENDOR" = "nvidia" ]; then
    systemctl enable nvidia-dcgm.service 2>/dev/null || true
+    systemctl enable nvidia-fabricmanager.service 2>/dev/null || true
    systemctl enable bee-nvidia.service
 elif [ "$GPU_VENDOR" = "amd" ]; then
    # ROCm symlinks (packages install to /opt/rocm-*/bin/)
--- a/iso/builder/config/hooks/normal/9100-memtest.hook.binary
+++ b/iso/builder/config/hooks/normal/9100-memtest.hook.binary
@@ -26,6 +26,14 @@ fail_or_warn() {
    return 0
 }

+# grub.cfg and live.cfg may not exist yet when binary hooks run — live-build
+# creates them after this hook (lb binary_grub-efi / lb binary_syslinux).
+# The template already has memtest entries hardcoded, so a missing config file
+# here is not an error; validate_iso_memtest() checks the final ISO instead.
+warn_only() {
+    log "WARNING: $1"
+}
+
 copy_memtest_file() {
    src="$1"
    dst_name="${2:-$(basename "$src")}"
@@ -61,15 +69,17 @@ extract_memtest_from_deb() {

 download_and_extract_memtest() {
    tmpdl="$(mktemp -d)"
-    ver_arg=""
    if [ -n "${MEMTEST_VERSION:-}" ]; then
-        ver_arg="=memtest86+=${MEMTEST_VERSION}"
-        log "downloading memtest86+=${MEMTEST_VERSION} from apt"
+        pkg_spec="memtest86+=${MEMTEST_VERSION}"
    else
-        log "downloading memtest86+ from apt (no version pinned)"
+        pkg_spec="memtest86+"
+    fi
+    log "downloading ${pkg_spec} from apt"
+    if ! ( cd "$tmpdl" && apt-get download "$pkg_spec" 2>/dev/null ); then
+        log "apt download failed, retrying after apt-get update"
+        apt-get update -qq >/dev/null 2>&1 || true
+        ( cd "$tmpdl" && apt-get download "$pkg_spec" 2>/dev/null ) || true
    fi
-    # shellcheck disable=SC2086
-    ( cd "$tmpdl" && apt-get download "memtest86+${ver_arg}" ) 2>/dev/null || true
    deb="$(find "$tmpdl" -maxdepth 1 -type f -name 'memtest86+*.deb' 2>/dev/null | head -1)"
    if [ -n "$deb" ]; then
        extract_memtest_from_deb "$deb"
@@ -133,7 +143,7 @@ ensure_memtest_binaries() {

 ensure_grub_entry() {
    [ -f "$GRUB_CFG" ] || {
-        fail_or_warn "missing ${GRUB_CFG}"
+        warn_only "missing ${GRUB_CFG} (will be created by lb binary_grub-efi from template)"
        return 0
    }

@@ -159,7 +169,7 @@ EOF

 ensure_isolinux_entry() {
    [ -f "$ISOLINUX_CFG" ] || {
-        fail_or_warn "missing ${ISOLINUX_CFG}"
+        warn_only "missing ${ISOLINUX_CFG} (will be created by lb binary_syslinux from template)"
        return 0
    }

--- a/iso/builder/config/package-lists/bee-nvidia.list.chroot
+++ b/iso/builder/config/package-lists/bee-nvidia.list.chroot
@@ -5,6 +5,7 @@
 # DCGM 4 is packaged per CUDA major. The image ships NVIDIA driver 590 with
 # CUDA 13 userspace, so install the CUDA 13 build plus proprietary components
 # explicitly.
+nvidia-fabricmanager=%%NVIDIA_FABRICMANAGER_VERSION%%
 datacenter-gpu-manager-4-cuda13=1:%%DCGM_VERSION%%
 datacenter-gpu-manager-4-proprietary=1:%%DCGM_VERSION%%
 datacenter-gpu-manager-4-proprietary-cuda13=1:%%DCGM_VERSION%%
--- a/iso/overlay/usr/local/bin/bee-nvidia-load
+++ b/iso/overlay/usr/local/bin/bee-nvidia-load
@@ -258,6 +258,22 @@ else
    log "WARN: nvidia-smi not found — cannot enable persistence mode"
 fi

+# Start or refresh Fabric Manager after the NVIDIA stack is ready. On NVSwitch
+# systems CUDA/DCGM can report "system not yet initialized" until fabric
+# training completes under nvidia-fabricmanager.
+if command -v systemctl >/dev/null 2>&1 && systemctl list-unit-files --no-legend 2>/dev/null | grep -q '^nvidia-fabricmanager\.service'; then
+    if systemctl restart nvidia-fabricmanager.service >/dev/null 2>&1; then
+        log "nvidia-fabricmanager restarted"
+    elif systemctl start nvidia-fabricmanager.service >/dev/null 2>&1; then
+        log "nvidia-fabricmanager started"
+    else
+        log "WARN: failed to start nvidia-fabricmanager.service"
+        systemctl status nvidia-fabricmanager.service --no-pager 2>&1 | sed 's/^/  fabricmanager: /' || true
+    fi
+else
+    log "WARN: nvidia-fabricmanager.service not installed"
+fi
+
 # Start DCGM host engine so dcgmi can discover GPUs.
 # nv-hostengine must run after the NVIDIA modules and device nodes are ready.
 # If it started too early (for example via systemd before bee-nvidia-load), it can
--- a/iso/overlay/usr/local/bin/bee-openbox-session
+++ b/iso/overlay/usr/local/bin/bee-openbox-session
@@ -9,9 +9,9 @@ xset s noblank

 # Set desktop background.
 if [ -f /usr/share/bee/wallpaper.png ]; then
-    feh --bg-fill /usr/share/bee/wallpaper.png
+    feh --bg-center --image-bg '#000000' /usr/share/bee/wallpaper.png
 else
-    xsetroot -solid '#f6c90e'
+    xsetroot -solid '#000000'
 fi

 tint2 &
--- a/iso/overlay/usr/share/bee/wallpaper.png
+++ b/iso/overlay/usr/share/bee/wallpaper.png
Author	SHA1	Message	Date
Michael Chus	30aa30cd67	LiveCD: set Baby Bee wallpaper centered on black background 400×400px PNG centered via feh --bg-center --image-bg '#000000'. Fallback solid fill also changed to black. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:57:23 +03:00
Michael Chus	4f76e1de21	Dashboard: per-device status chips with hover tooltips Replace single aggregated badge per hardware category with individual colored chips (O/W/F/?) for each ComponentStatusRecord. Added helper functions: matchedRecords, firstNonEmpty. CSS classes: chip-ok/warn/fail/unknown. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:54:13 +03:00
Michael Chus	3732e64a4a	Add slowdown temperature exceedance detector to benchmark detectSlowdownTempExceedance scans steady-state metric rows per GPU and emits a [WARNING] note + PARTIAL status if any sample >= SlowdownTempC. Uses per-GPU threshold from nvidia-smi -q, fallback 80°C. Distinct from p95-based TempHeadroomC check: catches even a single spike above the slowdown threshold that would be smoothed out in aggregates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:46:45 +03:00
Michael Chus	0d925299ff	Use per-GPU temperature limits from nvidia-smi -q for headroom calculation Parse "GPU Shutdown Temp" and "GPU Slowdown Temp" from nvidia-smi -q verbose output in enrichGPUInfoWithMaxClocks. Store as ShutdownTempC/SlowdownTempC on benchmarkGPUInfo and BenchmarkGPUResult. Fallback: 90°C shutdown / 80°C slowdown when not available. TempHeadroomC = ShutdownTempC - P95TempC (per-GPU, not hardcoded 100°C). Warning threshold: p95 >= SlowdownTempC. Critical: headroom < 10°C. Report table shows both limits alongside headroom and p95 temp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:45:15 +03:00
Michael Chus	a8d5e019a5	Translate report to English; add power anomaly detector All report strings are now English only. Add detectPowerAnomaly: scans steady-state metric rows per GPU with a 5-sample rolling baseline; flags a sudden drop ≥30% while GPU usage >50% as [HARD STOP] — indicates bad cable contact or VRM fault. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:42:00 +03:00
Michael Chus	72ec086568	Restructure benchmark report as balanced scorecard (5 perspectives) Split throttle into separate signals: ThermalThrottlePct, PowerCapThrottlePct, SyncBoostThrottlePct. Add TempHeadroomC (100 - p95_temp) as independent thermal headroom metric; warning < 20°C (>80°C), critical < 10°C (>90°C). Hard stop findings: thermal throttle with fans < 95%, ECC uncorrected errors, p95 temp > 90°C. Throttle findings now include per-type percentages and diagnostic context. Replace flat scorecard table with BSC 5-perspective layout: 1. Compatibility (hard stops: thermal+fan, ECC) 2. Thermal headroom (p95 temp, delta to 100°C, throttle %) 3. Power delivery (power cap throttle, power CV, fan duty) 4. Performance (Compute TOPS, Synthetic, Mixed, TOPS/SM/GHz) 5. Anomalies (ECC corrected, sync boost, power/thermal variance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:40:06 +03:00
Michael Chus	7a0b0934df	Separate compute score from server quality score CompositeScore = raw ComputeScore (TOPS). Throttling GPUs score lower automatically — no quality multiplier distorting the compute signal. Add ServerQualityScore (0-100): server infrastructure quality independent of GPU model. Formula: 0.40×Stability + 0.30×PowerSustain + 0.30×Thermal. Use to compare servers with the same GPU or flag bad server conditions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:45:55 +03:00
Michael Chus	d8ca0dca2c	Redesign scoring metrics: variance-based sustain scores, throttle stability PowerSustainScore: power draw variance (CV) during load, not deviation from TDP. ThermalSustainScore: temperature variance (CV) during load. StabilityScore: fraction of time spent in thermal+power-cap throttling. Remove NCCL bonus from quality_factor. quality = 0.35 + 0.35×Stability + 0.15×PowerSustain + 0.15×ThermalSustain, cap 1.00. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:39:59 +03:00
Michael Chus	d90250f80a	Fix DCGM cleanup and shorten memory validate	2026-04-16 00:39:37 +03:00
Michael Chus	8d6eaef5de	Update perf benchmark report methodology to reflect new design Remove references to pre-benchmark power calibration and dcgmi targeted_power. Document platform_power_score ramp-up methodology, PowerSustainScore fallback to steady-state power, and full-budget single-precision phases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:31:58 +03:00
Michael Chus	732bf4cbab	Redesign power and performance benchmarks with new methodology Power/Thermal Fit: cumulative fixed-limit ramp where each GPU's stable TDP is found under real multi-GPU thermal load (all prior GPUs running at their fixed limits). PlatformMaxTDPW = sum of stable limits across all GPUs. Remove PlatformPowerScore from power test. Performance Benchmark: remove pre-benchmark power calibration entirely. After N single-card runs, execute k=2..N parallel ramp-up steps and compute PlatformPowerScore = mean compute scalability vs best single-card TOPS. PowerSustainScore falls back to Steady.AvgPowerW when calibration absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:30:50 +03:00
Michael Chus	fa6d905a10	Tune bee-gpu-burn single-precision benchmark phases	2026-04-16 00:05:47 +03:00
Mikhail Chusavitin	5c1862ce4c	Use lb clean --all to clear bootstrap cache on every build Prevents stale debootstrap cache from bypassing --debootstrap-options changes (e.g. --include=ca-certificates added in v8.15). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 17:37:08 +03:00
Mikhail Chusavitin	b65ef2ea1d	Fix: use --debootstrap-options to include ca-certificates in bootstrap --bootstrap-packages is not a valid lb config option (20230502). Use --debootstrap-options "--include=ca-certificates" instead to ensure ca-certificates is present when lb chroot_archives runs apt-get update against the NVIDIA CUDA HTTPS source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 17:26:01 +03:00
Mikhail Chusavitin	533d703c97	Bootstrap ca-certificates so NVIDIA CUDA HTTPS source is trusted debootstrap creates a minimal chroot without ca-certificates, causing apt-get update to fail TLS verification for the NVIDIA CUDA apt source: "No system certificates available. Try installing ca-certificates." Add ca-certificates to --bootstrap-packages so it is present before lb chroot_archives configures the NVIDIA HTTPS source and runs apt-get update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 17:24:20 +03:00
Mikhail Chusavitin	04eb4b5a6d	Revert "Pre-download DCGM/fabricmanager debs on host to bypass chroot apt" This reverts commit `4110dbf8a6`.	2026-04-15 17:19:53 +03:00
Mikhail Chusavitin	4110dbf8a6	Pre-download DCGM/fabricmanager debs on host to bypass chroot apt The NVIDIA CUDA HTTPS apt source (developer.download.nvidia.com) may be unreachable from inside the live-build container chroot, causing 'E: Unable to locate package datacenter-gpu-manager-4-cuda13'. Add build-dcgm.sh that downloads DCGM and nvidia-fabricmanager .deb packages on the build host (verifying SHA256 against Packages.gz) and caches them in BEE_CACHE_DIR. build.sh (step 25-dcgm, nvidia only) copies them into LB_DIR/config/packages.chroot/ before lb build, so live-build creates a local apt repo from them. The chroot installs the packages from the local repo without ever contacting the NVIDIA CUDA HTTPS source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 17:10:23 +03:00
Mikhail Chusavitin	7237e4d3e4	Add fabric manager boot and support diagnostics	2026-04-15 16:14:26 +03:00
Mikhail Chusavitin	ab3ad77cd6	Fix Go module: upgrade modernc.org/libc v1.70.0 → v1.72.0 modernc.org/sqlite v1.48.0 requires modernc.org/libc/sys/types which is absent in v1.70.0 but present in v1.72.0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 14:32:04 +03:00
Mikhail Chusavitin	cd9e2cbe13	Fix ramp-up power bench: one task instead of N redundant tasks RunNvidiaPowerBench already performs a full internal ramp from 1 to N GPUs in Phase 2. Spawning N tasks with growing GPU subsets meant task K repeated all steps 1..K-1 already done by tasks 1..K-1 — O(N²) work instead of O(N). Replace with a single task using all selected GPUs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 12:29:11 +03:00
Mikhail Chusavitin	0317dc58fd	Fix memtest hook: grub.cfg/live.cfg missing during binary hooks is expected lb binary_grub-efi and lb binary_syslinux create these files from templates that already have memtest entries hardcoded. The hook should not fail when the files don't exist yet — validate_iso_memtest() checks the final ISO. Only the binary files (x64.bin, x64.efi) are required here. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 10:33:22 +03:00
Mikhail Chusavitin	1c5cb45698	Fix memtest hook: bad ver_arg format in apt-get download ver_arg was set to "=memtest86+=VERSION" making the command "apt-get download memtest86+=memtest86+=VERSION" (invalid). Fixed to build pkg_spec directly as "memtest86+=VERSION". Also add apt-get update retry if initial download fails. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 10:15:01 +03:00