targeted_stress, targeted_power, and the Level 2/3 diag were dispatched
one GPU at a time from the UI, turning a single dcgmi command into 8
sequential ~350–450 s runs. DCGM supports -i with a comma-separated list
of GPU indices and runs the diagnostic on all of them in parallel.
Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into
nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one
API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate.
Update sat.go constants and page_validate.go estimates to reflect all-GPU
simultaneous execution (remove n× multiplier from total time estimates).
Stress test on 8-GPU system: ~5.3 h → ~2.5 h.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- benchmark.go: retain sdrLastStep from final ramp step instead of
re-sampling after test when GPUs are already idle
- scripts/deploy.sh: build+deploy bee binary to remote host over SSH
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DCMI reports only the managed power domain (~CPU+MB), missing GPU draw.
PSU AC input sensors cover full wall power. When samplePSUPower returns
data, sum the slots for PowerW; fall back to DCMI otherwise.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
\b does not fire between a digit and '_' because '_' is \w in RE2.
The pattern \bpsu?\s*([0-9]+)\b never matched PSU1_POWER_IN style
sensors, so parsePSUSDR (and PSUSlotsFromSDR / samplePSUPower) returned
empty results for MSI servers — causing all power graphs to fall back
to DCMI which reports ~half actual draw.
Added an explicit underscore-terminated pattern first in the list and
tests covering the MSI format.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dcgmproftester and other GPU test subprocesses run inside the bee-web
cgroup and exceed 3G with 8 GPUs. OOM killer terminates the whole
service. No memory cap is appropriate on a LiveCD where GPU tests
legitimately use several GB.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same fix as ramp steps: take sdrSingle snapshot after calibration
and prefer PSUInW over DCMI for singleIPMILoadedW. DCMI kept as
fallback. Log message indicates source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When sdrStep.PSUInW is available, prefer it over DCMI for
ramp.ServerLoadedW and ServerDeltaW. DCMI on this platform (MSI 4-PSU)
reports ~half actual draw; SDR sums all PSU_POWER_IN sensors correctly.
Delta is now SDR-to-SDR (sdrStep.PSUInW - sdrIdle.PSUInW) for
consistency. DCMI path kept as fallback when SDR has no PSU data.
Log message now indicates the source (SDR PSU AC input vs DCMI).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MSI servers place PSU_POWER_IN/OUT sensors on entity 3.0, not 10.N
(the IPMI "Power Supply" entity). The old parser filtered by entity ID
and found nothing, so the dashboard fell back to DCMI which reports
roughly half the actual draw.
Now delegates to collector.PSUSlotsFromSDR — the same name-based
matching already used in the Power Fit benchmark.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A single dcgmproftester process without -i only loads GPU 0 regardless of
CUDA_VISIBLE_DEVICES. Now always routes multi-GPU runs through
bee-dcgmproftester-staggered (--stagger-seconds 0 for parallel mode),
which spawns one process per GPU so all GPUs are loaded simultaneously.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Increase stability profile duration from 33 min to 90 min by wiring
powerBenchDurationSec() into runBenchmarkPowerCalibration (was discarded)
- Collect per-step PSU slot readings, fan RPM/duty, and per-GPU telemetry
in ramp loop; add matching fields to NvidiaPowerBenchStep/NvidiaPowerBenchGPU
- Rewrite renderPowerBenchReport: replace Per-Slot Results with Single GPU
section, rework Ramp Sequence rows=runs/cols=GPUs, add PSU Performance
section (conditional on IPMI data), add transposed Single vs All-GPU
comparison table in per-GPU sections
- Add fmtMDTable helper (benchmark_table.go) and apply to all tables in
both power and performance reports so columns align in plain-text view
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stability hardening (webui/app):
- readFileLimited(): защита от OOM при чтении audit JSON (100 MB),
component-status DB (10 MB) и лога задачи (50 MB)
- jobs.go: буферизованный лог задачи — один открытый fd на задачу
вместо open/write/close на каждую строку (устраняет тысячи syscall/сек
при GPU стресс-тестах)
- stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s),
сброс при успешном прогоне >30s, счётчик перезапусков в slog
- kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании
- bee-web.service: MemoryMax=3G — OOM killer защищён
Build script:
- build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in —
мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый
live.cfg.in перезаписывал правильный статический файл устаревшей
версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off
- build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg
вместо grub-pc/grub.cfg (было всегда "missing")
GRUB:
- live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне
- live-theme/theme.txt: + image компонент по центру в верхней трети
экрана; меню сдвинуто с 62% до 65%
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
As metrics.db grew (1 sample/5 s × hours), handleMetricsChartSVG called
LoadAll() on every chart request — loading all rows across 4 tables through a
single SQLite connection. With ~10 charts auto-refreshing in parallel, requests
queued behind each other, saturating the connection pool and pegging a CPU core.
Fix: add a background compactor that runs every hour via the metrics collector:
• Downsample: rows older than 2 h are thinned to 1 per minute (keep MIN(ts)
per ts/60 bucket) — retains chart shape while cutting row count by ~92 %.
• Prune: rows older than 48 h are deleted entirely.
• After prune: WAL checkpoint/truncate to release disk space.
LoadAll() in handleMetricsChartSVG is unchanged — it now stays fast because
the DB is kept small rather than capping the query window.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Snapshot IPMI "Power Supply" sensor states before and after each benchmark
run. Compare before/after to surface only *new* anomalies (pre-existing faults
are excluded). Results land in NvidiaBenchmarkResult.PSUIssues and
NvidiaPowerBenchResult.PSUIssues (JSON: psu_issues) and are printed in the
text benchmark report under a "PSU Issues" section.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TestHandleAPIBenchmarkPowerFitRampQueuesBenchmarkPowerFitTasks: ramp-up
mode intentionally creates a single task (the runner handles 1→N internally
to avoid redundant repetition of earlier ramp steps). Updated the test to
expect 1 task and verify RampTotal=3 instead of asserting 3 separate tasks.
- TestBenchmarkPageRendersSavedResultsTable: benchmark page used "Performance
Results" as heading while the test looked for "Perf Results". Aligned the
page heading with the shorter label used everywhere else (task reports, etc.).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add BenchmarkEstimated* constants to benchmark_types.go from _v8 logs
(Standard Perf ~16 min, Standard Power Fit ~43 min, Stability Perf ~92 min)
- Update benchmark profile dropdown to show Perf / Power Fit timing per profile
- Add timing columns to Method Split table (Standard vs Stability per run type)
- Update burn preset labels to show "N min/GPU (sequential) or N min (parallel)"
- Clarify burn "one by one" description with sequential vs parallel scaling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add SATEstimated* constants to sat.go derived from _v8 production logs,
with a rule to recalculate them whenever the script changes
- Extend validateInventory with NvidiaGPUCount to make estimates GPU-aware
- Update all validate card duration strings: CPU, memory, storage, NVIDIA GPU,
targeted stress/power, pulse test, NCCL, nvbandwidth
- Fix nvbandwidth description ("intended to stay short" → actual ~45 min)
- Top-level profile labels show computed total including GPU count
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add PSUReading struct and PSUs []PSUReading to LiveMetricSample
- Sample per-PSU input watts from IPMI SDR entity 10.x (Power Supply)
- Render stacked filled-area SVG chart (one layer per PSU, cumulative total)
- Fall back to single-line chart on systems with ≤1 PSU in SDR
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
grub-efi/grub.cfg: add KMS+GSP=off entry (was in isolinux, missing in GRUB)
isolinux/live.cfg.in: add full standard param set to all entries
(net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always
numa_balancing=disable nowatchdog nosoftlockup) to match grub-efi
bible-local/docs/iso-build-rules.md: add bootloader sync rule documenting
that grub-efi and isolinux must be kept in sync manually, listing canonical
entries and standard param set, and noting the grub-pc/grub-efi history.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Build uses --bootloaders "grub-efi,syslinux" so live-build reads
config/bootloaders/grub-efi/ for the UEFI GRUB config. The directory
was incorrectly named grub-pc, causing live-build to ignore our custom
grub.cfg and generate a default one (missing toram, GSP-off entries).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
live-boot already uses rsync --progress when /bin/rsync exists; without
it the copy falls back to silent cp -a. Add rsync to the ISO package
list and install an initramfs-tools hook (bee-rsync) that copies the
rsync binary + shared libs into the initrd via copy_exec. The hook then
rebuilds the initramfs so the change takes effect in the ISO's initrd.img.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- grub.cfg: add "load to RAM (toram)" entry to advanced submenu
- install_to_ram.go: resume from existing /dev/shm/bee-live copy if
source medium is unavailable after bee-web restart
- tasks.go: fix "Recovered after bee-web restart" shown on every run
(check j.lines before first append, not after)
- bee-install: retry unsquashfs up to 5x with wait-for-remount on
source loss; clear error message with bee-remount-medium hint
- bee-remount-medium: new script to find and remount live ISO source
after USB/CD reconnect; supports --wait polling mode
- 9000-bee-setup: chmod +x for bee-install and bee-remount-medium
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Sample IPMI loaded_w per single-card calibration and per ramp step
instead of averaging over the entire Phase 2; top-level ServerPower
uses the final (all-GPU) ramp step value
- Add ServerLoadedW/ServerDeltaW to NvidiaPowerBenchGPU and
NvidiaPowerBenchStep so external tooling can compare wall power per
phase without re-parsing logs
- Write gpu-metrics.csv/.html inside each single-XX/ and step-XX/
subdir; aggregate all phases into a top-level gpu-metrics.csv/.html
- Write 00-nvidia-smi-q.log at the start of every power run
- Add Telemetry (p95 temp/power/fan/clock) to NvidiaPowerBenchGPU in
result.json from the converged calibration attempt
- Power benchmark page: split "Achieved W" into Single-card W and
Multi-GPU W (StablePowerLimitW); derate highlight and status color
now reflect the final multi-GPU limit vs nominal
- Performance benchmark page: add Status column and per-GPU score
color coding (green/yellow/red) based on gpu.Status and OverallStatus
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- NvidiaPowerBenchResult gains ServerPower *BenchmarkServerPower
- RunNvidiaPowerBench samples IPMI idle before Phase 1 and loaded via
background goroutine throughout Phase 2 ramp
- renderPowerBenchReport: new "Server vs GPU Power Comparison" table
with ratio annotation (✓ match / ⚠ minor / ✗ over-report)
- renderPowerBenchSummary: server_idle_w, server_loaded_w, server_delta_w,
server_reporting_ratio keys
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
400×400px PNG centered via feh --bg-center --image-bg '#000000'.
Fallback solid fill also changed to black.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace single aggregated badge per hardware category with individual
colored chips (O/W/F/?) for each ComponentStatusRecord. Added helper
functions: matchedRecords, firstNonEmpty. CSS classes: chip-ok/warn/fail/unknown.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>