- benchmark.go: retain sdrLastStep from final ramp step instead of
re-sampling after test when GPUs are already idle
- scripts/deploy.sh: build+deploy bee binary to remote host over SSH
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DCMI reports only the managed power domain (~CPU+MB), missing GPU draw.
PSU AC input sensors cover full wall power. When samplePSUPower returns
data, sum the slots for PowerW; fall back to DCMI otherwise.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
\b does not fire between a digit and '_' because '_' is \w in RE2.
The pattern \bpsu?\s*([0-9]+)\b never matched PSU1_POWER_IN style
sensors, so parsePSUSDR (and PSUSlotsFromSDR / samplePSUPower) returned
empty results for MSI servers — causing all power graphs to fall back
to DCMI which reports ~half actual draw.
Added an explicit underscore-terminated pattern first in the list and
tests covering the MSI format.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same fix as ramp steps: take sdrSingle snapshot after calibration
and prefer PSUInW over DCMI for singleIPMILoadedW. DCMI kept as
fallback. Log message indicates source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When sdrStep.PSUInW is available, prefer it over DCMI for
ramp.ServerLoadedW and ServerDeltaW. DCMI on this platform (MSI 4-PSU)
reports ~half actual draw; SDR sums all PSU_POWER_IN sensors correctly.
Delta is now SDR-to-SDR (sdrStep.PSUInW - sdrIdle.PSUInW) for
consistency. DCMI path kept as fallback when SDR has no PSU data.
Log message now indicates the source (SDR PSU AC input vs DCMI).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MSI servers place PSU_POWER_IN/OUT sensors on entity 3.0, not 10.N
(the IPMI "Power Supply" entity). The old parser filtered by entity ID
and found nothing, so the dashboard fell back to DCMI which reports
roughly half the actual draw.
Now delegates to collector.PSUSlotsFromSDR — the same name-based
matching already used in the Power Fit benchmark.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A single dcgmproftester process without -i only loads GPU 0 regardless of
CUDA_VISIBLE_DEVICES. Now always routes multi-GPU runs through
bee-dcgmproftester-staggered (--stagger-seconds 0 for parallel mode),
which spawns one process per GPU so all GPUs are loaded simultaneously.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Increase stability profile duration from 33 min to 90 min by wiring
powerBenchDurationSec() into runBenchmarkPowerCalibration (was discarded)
- Collect per-step PSU slot readings, fan RPM/duty, and per-GPU telemetry
in ramp loop; add matching fields to NvidiaPowerBenchStep/NvidiaPowerBenchGPU
- Rewrite renderPowerBenchReport: replace Per-Slot Results with Single GPU
section, rework Ramp Sequence rows=runs/cols=GPUs, add PSU Performance
section (conditional on IPMI data), add transposed Single vs All-GPU
comparison table in per-GPU sections
- Add fmtMDTable helper (benchmark_table.go) and apply to all tables in
both power and performance reports so columns align in plain-text view
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stability hardening (webui/app):
- readFileLimited(): защита от OOM при чтении audit JSON (100 MB),
component-status DB (10 MB) и лога задачи (50 MB)
- jobs.go: буферизованный лог задачи — один открытый fd на задачу
вместо open/write/close на каждую строку (устраняет тысячи syscall/сек
при GPU стресс-тестах)
- stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s),
сброс при успешном прогоне >30s, счётчик перезапусков в slog
- kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании
- bee-web.service: MemoryMax=3G — OOM killer защищён
Build script:
- build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in —
мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый
live.cfg.in перезаписывал правильный статический файл устаревшей
версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off
- build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg
вместо grub-pc/grub.cfg (было всегда "missing")
GRUB:
- live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне
- live-theme/theme.txt: + image компонент по центру в верхней трети
экрана; меню сдвинуто с 62% до 65%
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
As metrics.db grew (1 sample/5 s × hours), handleMetricsChartSVG called
LoadAll() on every chart request — loading all rows across 4 tables through a
single SQLite connection. With ~10 charts auto-refreshing in parallel, requests
queued behind each other, saturating the connection pool and pegging a CPU core.
Fix: add a background compactor that runs every hour via the metrics collector:
• Downsample: rows older than 2 h are thinned to 1 per minute (keep MIN(ts)
per ts/60 bucket) — retains chart shape while cutting row count by ~92 %.
• Prune: rows older than 48 h are deleted entirely.
• After prune: WAL checkpoint/truncate to release disk space.
LoadAll() in handleMetricsChartSVG is unchanged — it now stays fast because
the DB is kept small rather than capping the query window.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Snapshot IPMI "Power Supply" sensor states before and after each benchmark
run. Compare before/after to surface only *new* anomalies (pre-existing faults
are excluded). Results land in NvidiaBenchmarkResult.PSUIssues and
NvidiaPowerBenchResult.PSUIssues (JSON: psu_issues) and are printed in the
text benchmark report under a "PSU Issues" section.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TestHandleAPIBenchmarkPowerFitRampQueuesBenchmarkPowerFitTasks: ramp-up
mode intentionally creates a single task (the runner handles 1→N internally
to avoid redundant repetition of earlier ramp steps). Updated the test to
expect 1 task and verify RampTotal=3 instead of asserting 3 separate tasks.
- TestBenchmarkPageRendersSavedResultsTable: benchmark page used "Performance
Results" as heading while the test looked for "Perf Results". Aligned the
page heading with the shorter label used everywhere else (task reports, etc.).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add BenchmarkEstimated* constants to benchmark_types.go from _v8 logs
(Standard Perf ~16 min, Standard Power Fit ~43 min, Stability Perf ~92 min)
- Update benchmark profile dropdown to show Perf / Power Fit timing per profile
- Add timing columns to Method Split table (Standard vs Stability per run type)
- Update burn preset labels to show "N min/GPU (sequential) or N min (parallel)"
- Clarify burn "one by one" description with sequential vs parallel scaling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add SATEstimated* constants to sat.go derived from _v8 production logs,
with a rule to recalculate them whenever the script changes
- Extend validateInventory with NvidiaGPUCount to make estimates GPU-aware
- Update all validate card duration strings: CPU, memory, storage, NVIDIA GPU,
targeted stress/power, pulse test, NCCL, nvbandwidth
- Fix nvbandwidth description ("intended to stay short" → actual ~45 min)
- Top-level profile labels show computed total including GPU count
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add PSUReading struct and PSUs []PSUReading to LiveMetricSample
- Sample per-PSU input watts from IPMI SDR entity 10.x (Power Supply)
- Render stacked filled-area SVG chart (one layer per PSU, cumulative total)
- Fall back to single-line chart on systems with ≤1 PSU in SDR
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- grub.cfg: add "load to RAM (toram)" entry to advanced submenu
- install_to_ram.go: resume from existing /dev/shm/bee-live copy if
source medium is unavailable after bee-web restart
- tasks.go: fix "Recovered after bee-web restart" shown on every run
(check j.lines before first append, not after)
- bee-install: retry unsquashfs up to 5x with wait-for-remount on
source loss; clear error message with bee-remount-medium hint
- bee-remount-medium: new script to find and remount live ISO source
after USB/CD reconnect; supports --wait polling mode
- 9000-bee-setup: chmod +x for bee-install and bee-remount-medium
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Sample IPMI loaded_w per single-card calibration and per ramp step
instead of averaging over the entire Phase 2; top-level ServerPower
uses the final (all-GPU) ramp step value
- Add ServerLoadedW/ServerDeltaW to NvidiaPowerBenchGPU and
NvidiaPowerBenchStep so external tooling can compare wall power per
phase without re-parsing logs
- Write gpu-metrics.csv/.html inside each single-XX/ and step-XX/
subdir; aggregate all phases into a top-level gpu-metrics.csv/.html
- Write 00-nvidia-smi-q.log at the start of every power run
- Add Telemetry (p95 temp/power/fan/clock) to NvidiaPowerBenchGPU in
result.json from the converged calibration attempt
- Power benchmark page: split "Achieved W" into Single-card W and
Multi-GPU W (StablePowerLimitW); derate highlight and status color
now reflect the final multi-GPU limit vs nominal
- Performance benchmark page: add Status column and per-GPU score
color coding (green/yellow/red) based on gpu.Status and OverallStatus
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- NvidiaPowerBenchResult gains ServerPower *BenchmarkServerPower
- RunNvidiaPowerBench samples IPMI idle before Phase 1 and loaded via
background goroutine throughout Phase 2 ramp
- renderPowerBenchReport: new "Server vs GPU Power Comparison" table
with ratio annotation (✓ match / ⚠ minor / ✗ over-report)
- renderPowerBenchSummary: server_idle_w, server_loaded_w, server_delta_w,
server_reporting_ratio keys
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace single aggregated badge per hardware category with individual
colored chips (O/W/F/?) for each ComponentStatusRecord. Added helper
functions: matchedRecords, firstNonEmpty. CSS classes: chip-ok/warn/fail/unknown.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
detectSlowdownTempExceedance scans steady-state metric rows per GPU and
emits a [WARNING] note + PARTIAL status if any sample >= SlowdownTempC.
Uses per-GPU threshold from nvidia-smi -q, fallback 80°C.
Distinct from p95-based TempHeadroomC check: catches even a single spike
above the slowdown threshold that would be smoothed out in aggregates.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parse "GPU Shutdown Temp" and "GPU Slowdown Temp" from nvidia-smi -q verbose
output in enrichGPUInfoWithMaxClocks. Store as ShutdownTempC/SlowdownTempC
on benchmarkGPUInfo and BenchmarkGPUResult. Fallback: 90°C shutdown / 80°C
slowdown when not available.
TempHeadroomC = ShutdownTempC - P95TempC (per-GPU, not hardcoded 100°C).
Warning threshold: p95 >= SlowdownTempC. Critical: headroom < 10°C.
Report table shows both limits alongside headroom and p95 temp.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All report strings are now English only.
Add detectPowerAnomaly: scans steady-state metric rows per GPU with a
5-sample rolling baseline; flags a sudden drop ≥30% while GPU usage >50%
as [HARD STOP] — indicates bad cable contact or VRM fault.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CompositeScore = raw ComputeScore (TOPS). Throttling GPUs score lower
automatically — no quality multiplier distorting the compute signal.
Add ServerQualityScore (0-100): server infrastructure quality independent
of GPU model. Formula: 0.40×Stability + 0.30×PowerSustain + 0.30×Thermal.
Use to compare servers with the same GPU or flag bad server conditions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PowerSustainScore: power draw variance (CV) during load, not deviation from TDP.
ThermalSustainScore: temperature variance (CV) during load.
StabilityScore: fraction of time spent in thermal+power-cap throttling.
Remove NCCL bonus from quality_factor.
quality = 0.35 + 0.35×Stability + 0.15×PowerSustain + 0.15×ThermalSustain, cap 1.00.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove references to pre-benchmark power calibration and dcgmi
targeted_power. Document platform_power_score ramp-up methodology,
PowerSustainScore fallback to steady-state power, and full-budget
single-precision phases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Power/Thermal Fit: cumulative fixed-limit ramp where each GPU's stable TDP
is found under real multi-GPU thermal load (all prior GPUs running at their
fixed limits). PlatformMaxTDPW = sum of stable limits across all GPUs.
Remove PlatformPowerScore from power test.
Performance Benchmark: remove pre-benchmark power calibration entirely.
After N single-card runs, execute k=2..N parallel ramp-up steps and compute
PlatformPowerScore = mean compute scalability vs best single-card TOPS.
PowerSustainScore falls back to Steady.AvgPowerW when calibration absent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
modernc.org/sqlite v1.48.0 requires modernc.org/libc/sys/types which is
absent in v1.70.0 but present in v1.72.0.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RunNvidiaPowerBench already performs a full internal ramp from 1 to N
GPUs in Phase 2. Spawning N tasks with growing GPU subsets meant task K
repeated all steps 1..K-1 already done by tasks 1..K-1 — O(N²) work
instead of O(N). Replace with a single task using all selected GPUs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Benchmark page now shows two result sections: Performance (scores) and
Power / Thermal Fit (slot table). After any benchmark task completes
the results section auto-refreshes via GET /api/benchmark/results
without a full page reload.
- Power results table shows each GPU slot with nominal TDP, achieved
stable power limit, and P95 observed power. Rows with derated cards
are highlighted amber so under-performing slots stand out at a glance.
Older runs are collapsed in a <details> summary.
- memtester is now wrapped with timeout(1) so a stuck memory controller
cannot cause Validate Memory to hang indefinitely. Wall-clock limit is
~2.5 min per 100 MB per pass plus a 2-minute buffer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 now calibrates each GPU individually (sequentially) so that
PowerRealizationPct reflects real degradation from neighbour thermals and
shared power rails. Previously the baseline came from an all-GPU-together
run, making realization always ≈100% at the final ramp step.
Ramp step 1 reuses single-card calibration results (no extra run); steps
2..N run targeted_power on the growing GPU subset with derating active.
Remove OccupiedSlots/OccupiedSlotsNote fields and occupiedSlots() helper —
they were compensation for the old all-GPU calibration approach.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously each GPU was calibrated sequentially (one card fully done
before the next started), producing the staircase temperature pattern
seen on the graph.
Now all GPUs run together in a single dcgmi diag -r targeted_power
session per attempt. This means:
- All cards are under realistic thermal load at the same time.
- A single DCGM session handles the run — no resource-busy contention
from concurrent dcgmi processes.
- Binary search state (lo/hi) is tracked independently per GPU; each
card converges to its own highest stable power limit.
- Throttle counter polling covers all active GPUs in the shared ticker.
- Resource-busy exponential back-off is shared (one DCGM session).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>