- Increase stability profile duration from 33 min to 90 min by wiring
powerBenchDurationSec() into runBenchmarkPowerCalibration (was discarded)
- Collect per-step PSU slot readings, fan RPM/duty, and per-GPU telemetry
in ramp loop; add matching fields to NvidiaPowerBenchStep/NvidiaPowerBenchGPU
- Rewrite renderPowerBenchReport: replace Per-Slot Results with Single GPU
section, rework Ramp Sequence rows=runs/cols=GPUs, add PSU Performance
section (conditional on IPMI data), add transposed Single vs All-GPU
comparison table in per-GPU sections
- Add fmtMDTable helper (benchmark_table.go) and apply to all tables in
both power and performance reports so columns align in plain-text view
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stability hardening (webui/app):
- readFileLimited(): защита от OOM при чтении audit JSON (100 MB),
component-status DB (10 MB) и лога задачи (50 MB)
- jobs.go: буферизованный лог задачи — один открытый fd на задачу
вместо open/write/close на каждую строку (устраняет тысячи syscall/сек
при GPU стресс-тестах)
- stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s),
сброс при успешном прогоне >30s, счётчик перезапусков в slog
- kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании
- bee-web.service: MemoryMax=3G — OOM killer защищён
Build script:
- build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in —
мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый
live.cfg.in перезаписывал правильный статический файл устаревшей
версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off
- build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg
вместо grub-pc/grub.cfg (было всегда "missing")
GRUB:
- live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне
- live-theme/theme.txt: + image компонент по центру в верхней трети
экрана; меню сдвинуто с 62% до 65%
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Snapshot IPMI "Power Supply" sensor states before and after each benchmark
run. Compare before/after to surface only *new* anomalies (pre-existing faults
are excluded). Results land in NvidiaBenchmarkResult.PSUIssues and
NvidiaPowerBenchResult.PSUIssues (JSON: psu_issues) and are printed in the
text benchmark report under a "PSU Issues" section.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add BenchmarkEstimated* constants to benchmark_types.go from _v8 logs
(Standard Perf ~16 min, Standard Power Fit ~43 min, Stability Perf ~92 min)
- Update benchmark profile dropdown to show Perf / Power Fit timing per profile
- Add timing columns to Method Split table (Standard vs Stability per run type)
- Update burn preset labels to show "N min/GPU (sequential) or N min (parallel)"
- Clarify burn "one by one" description with sequential vs parallel scaling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add SATEstimated* constants to sat.go derived from _v8 production logs,
with a rule to recalculate them whenever the script changes
- Extend validateInventory with NvidiaGPUCount to make estimates GPU-aware
- Update all validate card duration strings: CPU, memory, storage, NVIDIA GPU,
targeted stress/power, pulse test, NCCL, nvbandwidth
- Fix nvbandwidth description ("intended to stay short" → actual ~45 min)
- Top-level profile labels show computed total including GPU count
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add PSUReading struct and PSUs []PSUReading to LiveMetricSample
- Sample per-PSU input watts from IPMI SDR entity 10.x (Power Supply)
- Render stacked filled-area SVG chart (one layer per PSU, cumulative total)
- Fall back to single-line chart on systems with ≤1 PSU in SDR
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- grub.cfg: add "load to RAM (toram)" entry to advanced submenu
- install_to_ram.go: resume from existing /dev/shm/bee-live copy if
source medium is unavailable after bee-web restart
- tasks.go: fix "Recovered after bee-web restart" shown on every run
(check j.lines before first append, not after)
- bee-install: retry unsquashfs up to 5x with wait-for-remount on
source loss; clear error message with bee-remount-medium hint
- bee-remount-medium: new script to find and remount live ISO source
after USB/CD reconnect; supports --wait polling mode
- 9000-bee-setup: chmod +x for bee-install and bee-remount-medium
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Sample IPMI loaded_w per single-card calibration and per ramp step
instead of averaging over the entire Phase 2; top-level ServerPower
uses the final (all-GPU) ramp step value
- Add ServerLoadedW/ServerDeltaW to NvidiaPowerBenchGPU and
NvidiaPowerBenchStep so external tooling can compare wall power per
phase without re-parsing logs
- Write gpu-metrics.csv/.html inside each single-XX/ and step-XX/
subdir; aggregate all phases into a top-level gpu-metrics.csv/.html
- Write 00-nvidia-smi-q.log at the start of every power run
- Add Telemetry (p95 temp/power/fan/clock) to NvidiaPowerBenchGPU in
result.json from the converged calibration attempt
- Power benchmark page: split "Achieved W" into Single-card W and
Multi-GPU W (StablePowerLimitW); derate highlight and status color
now reflect the final multi-GPU limit vs nominal
- Performance benchmark page: add Status column and per-GPU score
color coding (green/yellow/red) based on gpu.Status and OverallStatus
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- NvidiaPowerBenchResult gains ServerPower *BenchmarkServerPower
- RunNvidiaPowerBench samples IPMI idle before Phase 1 and loaded via
background goroutine throughout Phase 2 ramp
- renderPowerBenchReport: new "Server vs GPU Power Comparison" table
with ratio annotation (✓ match / ⚠ minor / ✗ over-report)
- renderPowerBenchSummary: server_idle_w, server_loaded_w, server_delta_w,
server_reporting_ratio keys
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
detectSlowdownTempExceedance scans steady-state metric rows per GPU and
emits a [WARNING] note + PARTIAL status if any sample >= SlowdownTempC.
Uses per-GPU threshold from nvidia-smi -q, fallback 80°C.
Distinct from p95-based TempHeadroomC check: catches even a single spike
above the slowdown threshold that would be smoothed out in aggregates.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parse "GPU Shutdown Temp" and "GPU Slowdown Temp" from nvidia-smi -q verbose
output in enrichGPUInfoWithMaxClocks. Store as ShutdownTempC/SlowdownTempC
on benchmarkGPUInfo and BenchmarkGPUResult. Fallback: 90°C shutdown / 80°C
slowdown when not available.
TempHeadroomC = ShutdownTempC - P95TempC (per-GPU, not hardcoded 100°C).
Warning threshold: p95 >= SlowdownTempC. Critical: headroom < 10°C.
Report table shows both limits alongside headroom and p95 temp.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All report strings are now English only.
Add detectPowerAnomaly: scans steady-state metric rows per GPU with a
5-sample rolling baseline; flags a sudden drop ≥30% while GPU usage >50%
as [HARD STOP] — indicates bad cable contact or VRM fault.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CompositeScore = raw ComputeScore (TOPS). Throttling GPUs score lower
automatically — no quality multiplier distorting the compute signal.
Add ServerQualityScore (0-100): server infrastructure quality independent
of GPU model. Formula: 0.40×Stability + 0.30×PowerSustain + 0.30×Thermal.
Use to compare servers with the same GPU or flag bad server conditions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PowerSustainScore: power draw variance (CV) during load, not deviation from TDP.
ThermalSustainScore: temperature variance (CV) during load.
StabilityScore: fraction of time spent in thermal+power-cap throttling.
Remove NCCL bonus from quality_factor.
quality = 0.35 + 0.35×Stability + 0.15×PowerSustain + 0.15×ThermalSustain, cap 1.00.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove references to pre-benchmark power calibration and dcgmi
targeted_power. Document platform_power_score ramp-up methodology,
PowerSustainScore fallback to steady-state power, and full-budget
single-precision phases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Power/Thermal Fit: cumulative fixed-limit ramp where each GPU's stable TDP
is found under real multi-GPU thermal load (all prior GPUs running at their
fixed limits). PlatformMaxTDPW = sum of stable limits across all GPUs.
Remove PlatformPowerScore from power test.
Performance Benchmark: remove pre-benchmark power calibration entirely.
After N single-card runs, execute k=2..N parallel ramp-up steps and compute
PlatformPowerScore = mean compute scalability vs best single-card TOPS.
PowerSustainScore falls back to Steady.AvgPowerW when calibration absent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Benchmark page now shows two result sections: Performance (scores) and
Power / Thermal Fit (slot table). After any benchmark task completes
the results section auto-refreshes via GET /api/benchmark/results
without a full page reload.
- Power results table shows each GPU slot with nominal TDP, achieved
stable power limit, and P95 observed power. Rows with derated cards
are highlighted amber so under-performing slots stand out at a glance.
Older runs are collapsed in a <details> summary.
- memtester is now wrapped with timeout(1) so a stuck memory controller
cannot cause Validate Memory to hang indefinitely. Wall-clock limit is
~2.5 min per 100 MB per pass plus a 2-minute buffer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 now calibrates each GPU individually (sequentially) so that
PowerRealizationPct reflects real degradation from neighbour thermals and
shared power rails. Previously the baseline came from an all-GPU-together
run, making realization always ≈100% at the final ramp step.
Ramp step 1 reuses single-card calibration results (no extra run); steps
2..N run targeted_power on the growing GPU subset with derating active.
Remove OccupiedSlots/OccupiedSlotsNote fields and occupiedSlots() helper —
they were compensation for the old all-GPU calibration approach.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously each GPU was calibrated sequentially (one card fully done
before the next started), producing the staircase temperature pattern
seen on the graph.
Now all GPUs run together in a single dcgmi diag -r targeted_power
session per attempt. This means:
- All cards are under realistic thermal load at the same time.
- A single DCGM session handles the run — no resource-busy contention
from concurrent dcgmi processes.
- Binary search state (lo/hi) is tracked independently per GPU; each
card converges to its own highest stable power limit.
- Throttle counter polling covers all active GPUs in the shared ticker.
- Resource-busy exponential back-off is shared (one DCGM session).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove telemetry-guided initial candidate; use strict binary search
midpoint at every step. Clean and predictable convergence in O(log N)
attempts within the allowed power range [minLimitW, startingLimitW].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Power calibration previously stepped down 25 W at a time (linear),
requiring up to 6 attempts to find a stable limit within 150 W range.
New strategy:
- Binary search between minLimitW (lo, assumed stable floor) and the
starting/failed limit (hi, confirmed unstable), converging within a
10 W tolerance in ~4 attempts.
- For thermal throttle: the first-quarter telemetry rows estimate the
GPU's pre-throttle power draw. nextLimit = round5W(onset - 10 W) is
used as the initial candidate instead of the binary midpoint, landing
much closer to the true limit on the first step.
- On success: lo is updated and a higher level is tried (binary search
upward) until hi-lo ≤ tolerance, ensuring the highest stable limit is
found rather than the first stable one.
- Let targeted_power run to natural completion on throttle (no mid-run
SIGKILL) so nv-hostengine releases its diagnostic slot cleanly before
the next attempt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During power calibration: if a thermal throttle (sw_thermal/hw_thermal)
causes ≥20% clock drop while server fans are below 98% P95 duty cycle,
record a CoolingWarning on the GPU result and emit an actionable finding
telling the operator to rerun with fans manually fixed at 100%.
During steady-state benchmark: same signal enriches the existing
thermal_limited finding with fan duty cycle and clock drift values.
Covers both the main benchmark (buildBenchmarkFindings) and the power
bench (NvidiaPowerBenchResult.Findings).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a targeted_power attempt is cancelled (e.g. after sw_thermal
throttle), nv-hostengine holds the diagnostic slot asynchronously.
The next attempt immediately received DCGM_ST_IN_USE (exit 222)
and incorrectly derated the power limit.
Now: exit 222 is detected via isDCGMResourceBusy and triggers an
exponential back-off retry at the same power limit (1s, 2s, 4s, …
up to 256s). Once the back-off delay would exceed 300s the
calibration fails, indicating the slot is persistently held.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
USB Export Drive:
lsblk reports TRAN only for whole disks, not partitions (/dev/sdc1).
Strip trailing partition digits to get parent disk before transport check.
LiveCD in RAM:
When RunInstallToRAM copies squashfs to /dev/shm/bee-live/ but bind-mount
of /run/live/medium fails (CD-ROM boots), /run/live/medium still shows the
CD-ROM fstype. Add fallback: if /dev/shm/bee-live/*.squashfs exists, the
data is in RAM — report status OK.
Dashboard Hardware Summary:
Show server Manufacturer + ProductName as heading and S/N as subline above
the component table, sourced from hw.Board (dmidecode system-type data).
Validate:
Remove Cycles input — always run once. cycles=1 hardcoded in runAllSAT().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and
NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files
- Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally
by comparing ramp-up step results sharing the same ramp_run_id)
- Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time),
tasks.go handler, and benchmark.go result population
- Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75:
factor = ratio/0.75, applied per-GPU with a note explaining the reduction
- Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw)
- Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode;
shows scalability_score when non-zero
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When --query-gpu CSV fields fail (exit status 2 on some Blackwell +
driver combos), enrichGPUInfoWithMaxClocks now also parses from the
verbose nvidia-smi -q output already collected at benchmark start:
- Default Power Limit → DefaultPowerLimitW
- Current Power Limit → PowerLimitW (fallback)
- Multiprocessor Count → MultiprocessorCount
Fixes PowerSustainScore=0 on systems where all three CSV query
variants fail but nvidia-smi -q succeeds (confirmed on RTX PRO 6000
Blackwell + driver 590.48.01).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Before the per-GPU compute phases, run `dcgmi diag -r targeted_power`
for 45 s while collecting nvidia-smi power metrics in parallel.
The p95 power per GPU is stored as calibrated_peak_power_w and used
as the denominator for PowerSustainScore instead of the hardware default
limit, which bee-gpu-burn cannot reach because it is compute-only.
Fallback chain: calibrated peak → default limit → enforced limit.
If dcgmi is absent or the run fails, calibration is skipped silently.
Adjust composite score weights to match the new honest power reference:
base 0.35, thermal 0.25, stability 0.25, power 0.15, NCCL bonus 0.10.
Power weight reduced (0.20→0.15) because even with a calibrated reference
bee-gpu-burn reaches ~60-75% of TDP by design (no concurrent mem stress).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BenchmarkHostConfig captures CPU model, sockets, cores, threads, and
total RAM from /proc/cpuinfo and /proc/meminfo at benchmark start.
- BenchmarkCPULoad samples host CPU utilisation every 10 s throughout
the GPU steady-state phase (sequential and parallel paths).
- Summarises avg/max/p95 and classifies status as ok / high / unstable.
- Adds a finding when CPU load is elevated (avg >20% or max >40%) or
erratic (stddev >12%), with a plain-English description in the report.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
clocks.max.graphics / clocks.max.memory CSV fields return exit status 2 on
RTX PRO 6000 Blackwell (driver 98.x), causing the entire gpu inventory query
to fail and clock lock to be skipped → normalization: partial.
Fix:
- Add minimal fallback query (index,uuid,name,pci.bus_id,vbios_version,
power.limit) that succeeds even without clock fields
- Add enrichGPUInfoWithMaxClocks: parses "Max Clocks" section of
nvidia-smi -q verbose output to fill MaxGraphicsClockMHz /
MaxMemoryClockMHz when CSV fields fail
- Move nvidia-smi -q execution before queryBenchmarkGPUInfo so its output
is available for clock enrichment immediately after
- Tests: cover enrichment and skip-if-populated cases
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- collector/pcie: add applyPCIeLinkSpeedWarning that sets status=Warning
and ErrorDescription when current link speed is below maximum negotiated
speed (e.g. Gen1 running on a Gen5 slot)
- collector/pcie: add pcieLinkSpeedRank helper for Gen string comparison
- collector/pcie_filter_test: cover degraded and healthy link speed cases
- platform/techdump: collect lspci -vvv → lspci-vvv.txt for LnkCap/LnkSta
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- schema: add ToRAMStatus and USBExportPath fields to RuntimeHealth
- platform/runtime.go: collectToRAMHealth (ok/warning/failed based on
IsLiveMediaInRAM + toramActive) and collectUSBExportHealth (scans
/proc/mounts + lsblk for writable USB-backed filesystems)
- pages.go: add USB Export Drive and LiveCD in RAM rows to the health table
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nvidia.go: add Name field to nvidiaGPUInfo, include model name in
nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData
- pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x …
→ 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat
activating/deactivating/reloading service states as OK in Runtime Health
- support_bundle.go: use "150405" time format (no colons) for exFAT compat
- sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove
.tar.gz archive creation from export dirs — export packs everything itself
- charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf
- benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>