- NvidiaPowerBenchResult gains ServerPower *BenchmarkServerPower
- RunNvidiaPowerBench samples IPMI idle before Phase 1 and loaded via
background goroutine throughout Phase 2 ramp
- renderPowerBenchReport: new "Server vs GPU Power Comparison" table
with ratio annotation (✓ match / ⚠ minor / ✗ over-report)
- renderPowerBenchSummary: server_idle_w, server_loaded_w, server_delta_w,
server_reporting_ratio keys
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
400×400px PNG centered via feh --bg-center --image-bg '#000000'.
Fallback solid fill also changed to black.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace single aggregated badge per hardware category with individual
colored chips (O/W/F/?) for each ComponentStatusRecord. Added helper
functions: matchedRecords, firstNonEmpty. CSS classes: chip-ok/warn/fail/unknown.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
detectSlowdownTempExceedance scans steady-state metric rows per GPU and
emits a [WARNING] note + PARTIAL status if any sample >= SlowdownTempC.
Uses per-GPU threshold from nvidia-smi -q, fallback 80°C.
Distinct from p95-based TempHeadroomC check: catches even a single spike
above the slowdown threshold that would be smoothed out in aggregates.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parse "GPU Shutdown Temp" and "GPU Slowdown Temp" from nvidia-smi -q verbose
output in enrichGPUInfoWithMaxClocks. Store as ShutdownTempC/SlowdownTempC
on benchmarkGPUInfo and BenchmarkGPUResult. Fallback: 90°C shutdown / 80°C
slowdown when not available.
TempHeadroomC = ShutdownTempC - P95TempC (per-GPU, not hardcoded 100°C).
Warning threshold: p95 >= SlowdownTempC. Critical: headroom < 10°C.
Report table shows both limits alongside headroom and p95 temp.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All report strings are now English only.
Add detectPowerAnomaly: scans steady-state metric rows per GPU with a
5-sample rolling baseline; flags a sudden drop ≥30% while GPU usage >50%
as [HARD STOP] — indicates bad cable contact or VRM fault.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CompositeScore = raw ComputeScore (TOPS). Throttling GPUs score lower
automatically — no quality multiplier distorting the compute signal.
Add ServerQualityScore (0-100): server infrastructure quality independent
of GPU model. Formula: 0.40×Stability + 0.30×PowerSustain + 0.30×Thermal.
Use to compare servers with the same GPU or flag bad server conditions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PowerSustainScore: power draw variance (CV) during load, not deviation from TDP.
ThermalSustainScore: temperature variance (CV) during load.
StabilityScore: fraction of time spent in thermal+power-cap throttling.
Remove NCCL bonus from quality_factor.
quality = 0.35 + 0.35×Stability + 0.15×PowerSustain + 0.15×ThermalSustain, cap 1.00.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove references to pre-benchmark power calibration and dcgmi
targeted_power. Document platform_power_score ramp-up methodology,
PowerSustainScore fallback to steady-state power, and full-budget
single-precision phases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Power/Thermal Fit: cumulative fixed-limit ramp where each GPU's stable TDP
is found under real multi-GPU thermal load (all prior GPUs running at their
fixed limits). PlatformMaxTDPW = sum of stable limits across all GPUs.
Remove PlatformPowerScore from power test.
Performance Benchmark: remove pre-benchmark power calibration entirely.
After N single-card runs, execute k=2..N parallel ramp-up steps and compute
PlatformPowerScore = mean compute scalability vs best single-card TOPS.
PowerSustainScore falls back to Steady.AvgPowerW when calibration absent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents stale debootstrap cache from bypassing --debootstrap-options
changes (e.g. --include=ca-certificates added in v8.15).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--bootstrap-packages is not a valid lb config option (20230502).
Use --debootstrap-options "--include=ca-certificates" instead to ensure
ca-certificates is present when lb chroot_archives runs apt-get update
against the NVIDIA CUDA HTTPS source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
debootstrap creates a minimal chroot without ca-certificates, causing
apt-get update to fail TLS verification for the NVIDIA CUDA apt source:
"No system certificates available. Try installing ca-certificates."
Add ca-certificates to --bootstrap-packages so it is present before
lb chroot_archives configures the NVIDIA HTTPS source and runs apt-get update.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The NVIDIA CUDA HTTPS apt source (developer.download.nvidia.com) may be
unreachable from inside the live-build container chroot, causing
'E: Unable to locate package datacenter-gpu-manager-4-cuda13'.
Add build-dcgm.sh that downloads DCGM and nvidia-fabricmanager .deb
packages on the build host (verifying SHA256 against Packages.gz) and
caches them in BEE_CACHE_DIR. build.sh (step 25-dcgm, nvidia only)
copies them into LB_DIR/config/packages.chroot/ before lb build, so
live-build creates a local apt repo from them. The chroot installs the
packages from the local repo without ever contacting the NVIDIA CUDA
HTTPS source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
modernc.org/sqlite v1.48.0 requires modernc.org/libc/sys/types which is
absent in v1.70.0 but present in v1.72.0.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RunNvidiaPowerBench already performs a full internal ramp from 1 to N
GPUs in Phase 2. Spawning N tasks with growing GPU subsets meant task K
repeated all steps 1..K-1 already done by tasks 1..K-1 — O(N²) work
instead of O(N). Replace with a single task using all selected GPUs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lb binary_grub-efi and lb binary_syslinux create these files from templates
that already have memtest entries hardcoded. The hook should not fail when
the files don't exist yet — validate_iso_memtest() checks the final ISO.
Only the binary files (x64.bin, x64.efi) are required here.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ver_arg was set to "=memtest86+=VERSION" making the command
"apt-get download memtest86+=memtest86+=VERSION" (invalid).
Fixed to build pkg_spec directly as "memtest86+=VERSION".
Also add apt-get update retry if initial download fails.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Disabling --security broke the build because linux-image-6.1.0-44-amd64
is a security update not present in the base bookworm repo.
Main packages already come from mirror.mephi.ru.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Switch all lb mirrors to mirror.mephi.ru/debian/ for faster/reliable downloads
- Disable security repo (--security false) — not needed for LiveCD
- Pin MEMTEST_VERSION=6.10-4 in VERSIONS, export to hook environment
- Set BEE_REQUIRE_MEMTEST=1 in build-in-container.sh — missing memtest is now fatal
- Fix 9100-memtest.hook.binary: add apt-get download fallback when lb
binary_memtest has already purged the package cache; handle both 5.x
(memtest86+x64.bin) and 6.x (memtest86+.bin) BIOS binary naming
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mkdir -p LOG_DIR before writing the optional step log so that a race
with cleanup_build_log (EXIT trap archiving the log dir) does not cause
a "Directory nonexistent" error during lb binary_checksums / lb binary_iso.
Also downgrade apt-get update failure to a warning so a transient mirror
outage does not block kernel ABI auto-detection when the apt cache is warm.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Benchmark page now shows two result sections: Performance (scores) and
Power / Thermal Fit (slot table). After any benchmark task completes
the results section auto-refreshes via GET /api/benchmark/results
without a full page reload.
- Power results table shows each GPU slot with nominal TDP, achieved
stable power limit, and P95 observed power. Rows with derated cards
are highlighted amber so under-performing slots stand out at a glance.
Older runs are collapsed in a <details> summary.
- memtester is now wrapped with timeout(1) so a stuck memory controller
cannot cause Validate Memory to hang indefinitely. Wall-clock limit is
~2.5 min per 100 MB per pass plus a 2-minute buffer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 now calibrates each GPU individually (sequentially) so that
PowerRealizationPct reflects real degradation from neighbour thermals and
shared power rails. Previously the baseline came from an all-GPU-together
run, making realization always ≈100% at the final ramp step.
Ramp step 1 reuses single-card calibration results (no extra run); steps
2..N run targeted_power on the growing GPU subset with derating active.
Remove OccupiedSlots/OccupiedSlotsNote fields and occupiedSlots() helper —
they were compensation for the old all-GPU calibration approach.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously each GPU was calibrated sequentially (one card fully done
before the next started), producing the staircase temperature pattern
seen on the graph.
Now all GPUs run together in a single dcgmi diag -r targeted_power
session per attempt. This means:
- All cards are under realistic thermal load at the same time.
- A single DCGM session handles the run — no resource-busy contention
from concurrent dcgmi processes.
- Binary search state (lo/hi) is tracked independently per GPU; each
card converges to its own highest stable power limit.
- Throttle counter polling covers all active GPUs in the shared ticker.
- Resource-busy exponential back-off is shared (one DCGM session).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove telemetry-guided initial candidate; use strict binary search
midpoint at every step. Clean and predictable convergence in O(log N)
attempts within the allowed power range [minLimitW, startingLimitW].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Power calibration previously stepped down 25 W at a time (linear),
requiring up to 6 attempts to find a stable limit within 150 W range.
New strategy:
- Binary search between minLimitW (lo, assumed stable floor) and the
starting/failed limit (hi, confirmed unstable), converging within a
10 W tolerance in ~4 attempts.
- For thermal throttle: the first-quarter telemetry rows estimate the
GPU's pre-throttle power draw. nextLimit = round5W(onset - 10 W) is
used as the initial candidate instead of the binary midpoint, landing
much closer to the true limit on the first step.
- On success: lo is updated and a higher level is tried (binary search
upward) until hi-lo ≤ tolerance, ensuring the highest stable limit is
found rather than the first stable one.
- Let targeted_power run to natural completion on throttle (no mid-run
SIGKILL) so nv-hostengine releases its diagnostic slot cleanly before
the next attempt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During power calibration: if a thermal throttle (sw_thermal/hw_thermal)
causes ≥20% clock drop while server fans are below 98% P95 duty cycle,
record a CoolingWarning on the GPU result and emit an actionable finding
telling the operator to rerun with fans manually fixed at 100%.
During steady-state benchmark: same signal enriches the existing
thermal_limited finding with fan duty cycle and clock drift values.
Covers both the main benchmark (buildBenchmarkFindings) and the power
bench (NvidiaPowerBenchResult.Findings).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a targeted_power attempt is cancelled (e.g. after sw_thermal
throttle), nv-hostengine holds the diagnostic slot asynchronously.
The next attempt immediately received DCGM_ST_IN_USE (exit 222)
and incorrectly derated the power limit.
Now: exit 222 is detected via isDCGMResourceBusy and triggers an
exponential back-off retry at the same power limit (1s, 2s, 4s, …
up to 256s). Once the back-off delay would exceed 300s the
calibration fails, indicating the slot is persistently held.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>