reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Michael Chus	f8cd9a7376	Rework Power Fit report: 90 min stability, aligned tables, PSU/fan sections - Increase stability profile duration from 33 min to 90 min by wiring powerBenchDurationSec() into runBenchmarkPowerCalibration (was discarded) - Collect per-step PSU slot readings, fan RPM/duty, and per-GPU telemetry in ramp loop; add matching fields to NvidiaPowerBenchStep/NvidiaPowerBenchGPU - Rewrite renderPowerBenchReport: replace Per-Slot Results with Single GPU section, rework Ramp Sequence rows=runs/cols=GPUs, add PSU Performance section (conditional on IPMI data), add transposed Single vs All-GPU comparison table in per-GPU sections - Add fmtMDTable helper (benchmark_table.go) and apply to all tables in both power and performance reports so columns align in plain-text view Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:04:12 +03:00
Michael Chus	d52ec67f8f	Stability hardening, build script fixes, GRUB bee logo Stability hardening (webui/app): - readFileLimited(): защита от OOM при чтении audit JSON (100 MB), component-status DB (10 MB) и лога задачи (50 MB) - jobs.go: буферизованный лог задачи — один открытый fd на задачу вместо open/write/close на каждую строку (устраняет тысячи syscall/сек при GPU стресс-тестах) - stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s), сброс при успешном прогоне >30s, счётчик перезапусков в slog - kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании - bee-web.service: MemoryMax=3G — OOM killer защищён Build script: - build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in — мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый live.cfg.in перезаписывал правильный статический файл устаревшей версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off - build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg вместо grub-pc/grub.cfg (было всегда "missing") GRUB: - live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне - live-theme/theme.txt: + image компонент по центру в верхней трети экрана; меню сдвинуто с 62% до 65% Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 13:08:31 +03:00
Michael Chus	61c7abaa80	Add multi-source PSU power triangulation and per-slot distribution table - collector/psu.go: export PSUSlotsFromSDR() reusing slot regex patterns; add isPSUInputPower/isPSUOutputPower helpers covering MSI/MLT/xFusion/HPE naming; add xFusion Power<N> slot pattern; parseBoundedFloat for self-healing (rejects zero/negative/out-of-range sensor readings); default fallback treats unclassified PSU sensors as AC input - benchmark_types.go: BenchmarkPSUSlotPower struct; BenchmarkServerPower gains PSUInputIdle/Loaded, PSUOutputIdle/Loaded, PSUSlotReadingsIdle/Loaded, GPUSlotTotalW, DCMICoverageRatio fields - benchmark.go: sampleIPMISDRPowerSensors uses collector.PSUSlotsFromSDR instead of custom classifier; detectDCMIPartialCoverage replaces ramp heuristic — compares DCMI idle vs SDR PSU sum, flags <0.70 ratio as partial coverage; detectIPMISaturationFallback kept for servers without SDR PSU sensors; report gains PSU Load Distribution table (per-slot AC/DC idle vs loaded, Δ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 13:07:48 +03:00
Michael Chus	028bb30333	Detect PSU faults during perf and power benchmarks Snapshot IPMI "Power Supply" sensor states before and after each benchmark run. Compare before/after to surface only new anomalies (pre-existing faults are excluded). Results land in NvidiaBenchmarkResult.PSUIssues and NvidiaPowerBenchResult.PSUIssues (JSON: psu_issues) and are printed in the text benchmark report under a "PSU Issues" section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 15:08:41 +03:00
Michael Chus	51b721aeb3	Add real-data duration estimates to benchmark and burn pages - Add BenchmarkEstimated* constants to benchmark_types.go from _v8 logs (Standard Perf ~16 min, Standard Power Fit ~43 min, Stability Perf ~92 min) - Update benchmark profile dropdown to show Perf / Power Fit timing per profile - Add timing columns to Method Split table (Standard vs Stability per run type) - Update burn preset labels to show "N min/GPU (sequential) or N min (parallel)" - Clarify burn "one by one" description with sequential vs parallel scaling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 10:54:50 +03:00
Michael Chus	bac89bb6e5	Add real-data duration estimates to validate tab profiles - Add SATEstimated* constants to sat.go derived from _v8 production logs, with a rule to recalculate them whenever the script changes - Extend validateInventory with NvidiaGPUCount to make estimates GPU-aware - Update all validate card duration strings: CPU, memory, storage, NVIDIA GPU, targeted stress/power, pulse test, NCCL, nvbandwidth - Fix nvbandwidth description ("intended to stay short" → actual ~45 min) - Top-level profile labels show computed total including GPU count Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 10:51:15 +03:00
Michael Chus	7a618da1f9	Redesign system power chart as stacked per-PSU area chart - Add PSUReading struct and PSUs []PSUReading to LiveMetricSample - Sample per-PSU input watts from IPMI SDR entity 10.x (Power Supply) - Render stacked filled-area SVG chart (one layer per PSU, cumulative total) - Fall back to single-line chart on systems with ≤1 PSU in SDR Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 10:42:00 +03:00
Michael Chus	63363e9629	Add toram boot entry and Install to RAM resume support - grub.cfg: add "load to RAM (toram)" entry to advanced submenu - install_to_ram.go: resume from existing /dev/shm/bee-live copy if source medium is unavailable after bee-web restart - tasks.go: fix "Recovered after bee-web restart" shown on every run (check j.lines before first append, not after) - bee-install: retry unsquashfs up to 5x with wait-for-remount on source loss; clear error message with bee-remount-medium hint - bee-remount-medium: new script to find and remount live ISO source after USB/CD reconnect; supports --wait polling mode - 9000-bee-setup: chmod +x for bee-install and bee-remount-medium Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 23:48:56 +03:00
Mikhail Chusavitin	5285c0d101	Capture per-run IPMI power and GPU telemetry in power benchmark - Sample IPMI loaded_w per single-card calibration and per ramp step instead of averaging over the entire Phase 2; top-level ServerPower uses the final (all-GPU) ramp step value - Add ServerLoadedW/ServerDeltaW to NvidiaPowerBenchGPU and NvidiaPowerBenchStep so external tooling can compare wall power per phase without re-parsing logs - Write gpu-metrics.csv/.html inside each single-XX/ and step-XX/ subdir; aggregate all phases into a top-level gpu-metrics.csv/.html - Write 00-nvidia-smi-q.log at the start of every power run - Add Telemetry (p95 temp/power/fan/clock) to NvidiaPowerBenchGPU in result.json from the converged calibration attempt - Power benchmark page: split "Achieved W" into Single-card W and Multi-GPU W (StablePowerLimitW); derate highlight and status color now reflect the final multi-GPU limit vs nominal - Performance benchmark page: add Status column and per-GPU score color coding (green/yellow/red) based on gpu.Status and OverallStatus Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:59:58 +03:00
Mikhail Chusavitin	dca4afb8d0	Seed power ramp with single-card TDP limits	2026-04-16 11:43:01 +03:00
Mikhail Chusavitin	b4280941f5	Move NCCL and NVBandwidth into validate mode	2026-04-16 11:02:30 +03:00
Mikhail Chusavitin	18e24a9aa5	Estimate fan duty from observed RPM maxima	2026-04-16 10:10:18 +03:00
Mikhail Chusavitin	c5b2081ac9	Disable unstable fp4/fp64 benchmark phases	2026-04-16 09:58:02 +03:00
Michael Chus	434528083e	Power bench: compare GPU-reported TDP vs IPMI server power delta - NvidiaPowerBenchResult gains ServerPower *BenchmarkServerPower - RunNvidiaPowerBench samples IPMI idle before Phase 1 and loaded via background goroutine throughout Phase 2 ramp - renderPowerBenchReport: new "Server vs GPU Power Comparison" table with ratio annotation (✓ match / ⚠ minor / ✗ over-report) - renderPowerBenchSummary: server_idle_w, server_loaded_w, server_delta_w, server_reporting_ratio keys Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 07:21:02 +03:00
Michael Chus	3732e64a4a	Add slowdown temperature exceedance detector to benchmark detectSlowdownTempExceedance scans steady-state metric rows per GPU and emits a [WARNING] note + PARTIAL status if any sample >= SlowdownTempC. Uses per-GPU threshold from nvidia-smi -q, fallback 80°C. Distinct from p95-based TempHeadroomC check: catches even a single spike above the slowdown threshold that would be smoothed out in aggregates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:46:45 +03:00
Michael Chus	0d925299ff	Use per-GPU temperature limits from nvidia-smi -q for headroom calculation Parse "GPU Shutdown Temp" and "GPU Slowdown Temp" from nvidia-smi -q verbose output in enrichGPUInfoWithMaxClocks. Store as ShutdownTempC/SlowdownTempC on benchmarkGPUInfo and BenchmarkGPUResult. Fallback: 90°C shutdown / 80°C slowdown when not available. TempHeadroomC = ShutdownTempC - P95TempC (per-GPU, not hardcoded 100°C). Warning threshold: p95 >= SlowdownTempC. Critical: headroom < 10°C. Report table shows both limits alongside headroom and p95 temp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:45:15 +03:00
Michael Chus	a8d5e019a5	Translate report to English; add power anomaly detector All report strings are now English only. Add detectPowerAnomaly: scans steady-state metric rows per GPU with a 5-sample rolling baseline; flags a sudden drop ≥30% while GPU usage >50% as [HARD STOP] — indicates bad cable contact or VRM fault. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:42:00 +03:00
Michael Chus	72ec086568	Restructure benchmark report as balanced scorecard (5 perspectives) Split throttle into separate signals: ThermalThrottlePct, PowerCapThrottlePct, SyncBoostThrottlePct. Add TempHeadroomC (100 - p95_temp) as independent thermal headroom metric; warning < 20°C (>80°C), critical < 10°C (>90°C). Hard stop findings: thermal throttle with fans < 95%, ECC uncorrected errors, p95 temp > 90°C. Throttle findings now include per-type percentages and diagnostic context. Replace flat scorecard table with BSC 5-perspective layout: 1. Compatibility (hard stops: thermal+fan, ECC) 2. Thermal headroom (p95 temp, delta to 100°C, throttle %) 3. Power delivery (power cap throttle, power CV, fan duty) 4. Performance (Compute TOPS, Synthetic, Mixed, TOPS/SM/GHz) 5. Anomalies (ECC corrected, sync boost, power/thermal variance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:40:06 +03:00
Michael Chus	7a0b0934df	Separate compute score from server quality score CompositeScore = raw ComputeScore (TOPS). Throttling GPUs score lower automatically — no quality multiplier distorting the compute signal. Add ServerQualityScore (0-100): server infrastructure quality independent of GPU model. Formula: 0.40×Stability + 0.30×PowerSustain + 0.30×Thermal. Use to compare servers with the same GPU or flag bad server conditions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:45:55 +03:00
Michael Chus	d8ca0dca2c	Redesign scoring metrics: variance-based sustain scores, throttle stability PowerSustainScore: power draw variance (CV) during load, not deviation from TDP. ThermalSustainScore: temperature variance (CV) during load. StabilityScore: fraction of time spent in thermal+power-cap throttling. Remove NCCL bonus from quality_factor. quality = 0.35 + 0.35×Stability + 0.15×PowerSustain + 0.15×ThermalSustain, cap 1.00. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:39:59 +03:00
Michael Chus	d90250f80a	Fix DCGM cleanup and shorten memory validate	2026-04-16 00:39:37 +03:00
Michael Chus	8d6eaef5de	Update perf benchmark report methodology to reflect new design Remove references to pre-benchmark power calibration and dcgmi targeted_power. Document platform_power_score ramp-up methodology, PowerSustainScore fallback to steady-state power, and full-budget single-precision phases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:31:58 +03:00
Michael Chus	732bf4cbab	Redesign power and performance benchmarks with new methodology Power/Thermal Fit: cumulative fixed-limit ramp where each GPU's stable TDP is found under real multi-GPU thermal load (all prior GPUs running at their fixed limits). PlatformMaxTDPW = sum of stable limits across all GPUs. Remove PlatformPowerScore from power test. Performance Benchmark: remove pre-benchmark power calibration entirely. After N single-card runs, execute k=2..N parallel ramp-up steps and compute PlatformPowerScore = mean compute scalability vs best single-card TOPS. PowerSustainScore falls back to Steady.AvgPowerW when calibration absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:30:50 +03:00
Mikhail Chusavitin	7237e4d3e4	Add fabric manager boot and support diagnostics	2026-04-15 16:14:26 +03:00
Michael Chus	c80a39e7ac	Add power results table, fix benchmark results refresh, bound memtester - Benchmark page now shows two result sections: Performance (scores) and Power / Thermal Fit (slot table). After any benchmark task completes the results section auto-refreshes via GET /api/benchmark/results without a full page reload. - Power results table shows each GPU slot with nominal TDP, achieved stable power limit, and P95 observed power. Rows with derated cards are highlighted amber so under-performing slots stand out at a glance. Older runs are collapsed in a <details> summary. - memtester is now wrapped with timeout(1) so a stuck memory controller cannot cause Validate Memory to hang indefinitely. Wall-clock limit is ~2.5 min per 100 MB per pass plus a 2-minute buffer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 07:16:18 +03:00
Michael Chus	a5e0261ff2	Refactor power ramp to use true single-card baselines Phase 1 now calibrates each GPU individually (sequentially) so that PowerRealizationPct reflects real degradation from neighbour thermals and shared power rails. Previously the baseline came from an all-GPU-together run, making realization always ≈100% at the final ramp step. Ramp step 1 reuses single-card calibration results (no extra run); steps 2..N run targeted_power on the growing GPU subset with derating active. Remove OccupiedSlots/OccupiedSlotsNote fields and occupiedSlots() helper — they were compensation for the old all-GPU calibration approach. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 23:47:57 +03:00
Michael Chus	3cf2e9c9dc	Run power calibration for all GPUs simultaneously Previously each GPU was calibrated sequentially (one card fully done before the next started), producing the staircase temperature pattern seen on the graph. Now all GPUs run together in a single dcgmi diag -r targeted_power session per attempt. This means: - All cards are under realistic thermal load at the same time. - A single DCGM session handles the run — no resource-busy contention from concurrent dcgmi processes. - Binary search state (lo/hi) is tracked independently per GPU; each card converges to its own highest stable power limit. - Throttle counter polling covers all active GPUs in the shared ticker. - Resource-busy exponential back-off is shared (one DCGM session). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 22:25:05 +03:00
Michael Chus	19dbabd71d	Simplify power calibration: pure binary search, no telemetry guessing Remove telemetry-guided initial candidate; use strict binary search midpoint at every step. Clean and predictable convergence in O(log N) attempts within the allowed power range [minLimitW, startingLimitW]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 22:12:45 +03:00
Michael Chus	a6a07f2626	Replace linear power derate with binary search + telemetry-guided jump Power calibration previously stepped down 25 W at a time (linear), requiring up to 6 attempts to find a stable limit within 150 W range. New strategy: - Binary search between minLimitW (lo, assumed stable floor) and the starting/failed limit (hi, confirmed unstable), converging within a 10 W tolerance in ~4 attempts. - For thermal throttle: the first-quarter telemetry rows estimate the GPU's pre-throttle power draw. nextLimit = round5W(onset - 10 W) is used as the initial candidate instead of the binary midpoint, landing much closer to the true limit on the first step. - On success: lo is updated and a higher level is tried (binary search upward) until hi-lo ≤ tolerance, ensuring the highest stable limit is found rather than the first stable one. - Let targeted_power run to natural completion on throttle (no mid-run SIGKILL) so nv-hostengine releases its diagnostic slot cleanly before the next attempt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 22:05:23 +03:00
Michael Chus	f87461ee4a	Detect thermal throttle with fans below 100% as cooling misconfiguration During power calibration: if a thermal throttle (sw_thermal/hw_thermal) causes ≥20% clock drop while server fans are below 98% P95 duty cycle, record a CoolingWarning on the GPU result and emit an actionable finding telling the operator to rerun with fans manually fixed at 100%. During steady-state benchmark: same signal enriches the existing thermal_limited finding with fan duty cycle and clock drift values. Covers both the main benchmark (buildBenchmarkFindings) and the power bench (NvidiaPowerBenchResult.Findings). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 21:44:57 +03:00
Michael Chus	a636146dbd	Fix power calibration failing due to DCGM resource contention When a targeted_power attempt is cancelled (e.g. after sw_thermal throttle), nv-hostengine holds the diagnostic slot asynchronously. The next attempt immediately received DCGM_ST_IN_USE (exit 222) and incorrectly derated the power limit. Now: exit 222 is detected via isDCGMResourceBusy and triggers an exponential back-off retry at the same power limit (1s, 2s, 4s, … up to 256s). Once the back-off delay would exceed 300s the calibration fails, indicating the slot is persistently held. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 20:41:17 +03:00
Mikhail Chusavitin	303de2df04	Add slot-aware ramp sequence to bee-bench power	2026-04-14 17:47:40 +03:00
Mikhail Chusavitin	95124d228f	Split bee-bench into perf and power workflows	2026-04-14 17:33:13 +03:00
Mikhail Chusavitin	54338dbae5	Unify live RAM runtime state	2026-04-14 16:18:33 +03:00
Mikhail Chusavitin	2be7ae6d28	Refine NVIDIA benchmark phase timing	2026-04-14 14:12:06 +03:00
Mikhail Chusavitin	8fc986c933	Add benchmark fan duty cycle summary to report	2026-04-14 10:24:02 +03:00
Mikhail Chusavitin	88b5e0edf2	Harden IPMI power probe timeout	2026-04-14 10:18:23 +03:00
Michael Chus	bf182daa89	Fix benchmark report methodology and rebuild gpu burn worker on toolchain changes	2026-04-13 23:43:12 +03:00
Michael Chus	457ea1cf04	Unify benchmark exports and drop ASCII charts	2026-04-13 21:38:28 +03:00
Michael Chus	bf6ecab4f0	Add per-precision benchmark phases, weighted TOPS scoring, and ECC tracking - Split steady window into 6 equal slots: fp8/fp16/fp32/fp64/fp4 + combined - Each precision phase runs bee-gpu-burn with --precision filter so PowerCVPct reflects single-kernel stability (not round-robin artifact) - Add fp4 support in bee-gpu-stress.c for Blackwell (cc>=100) via existing CUDA_R_4F_E2M1 guard - Weighted TOPS: fp64×2.0, fp32×1.0, fp16×0.5, fp8×0.25, fp4×0.125 - SyntheticScore = sum of weighted TOPS from per-precision phases - MixedScore = sum from combined phase; MixedEfficiency = Mixed/Synthetic - ComputeScore = SyntheticScore × (1 + MixedEfficiency × 0.3) - ECC volatile counters sampled before/after each phase and overall - DegradationReasons: ecc_uncorrected_errors, ecc_corrected_errors - Report: per-precision stability table with ECC columns, methodology section - Ramp-up history table redesign: GPU indices as columns, runs as rows Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 10:49:49 +03:00
Michael Chus	02e44b1172	Fix USB/RAM status checks; add server model+S/N to dashboard; remove cycles USB Export Drive: lsblk reports TRAN only for whole disks, not partitions (/dev/sdc1). Strip trailing partition digits to get parent disk before transport check. LiveCD in RAM: When RunInstallToRAM copies squashfs to /dev/shm/bee-live/ but bind-mount of /run/live/medium fails (CD-ROM boots), /run/live/medium still shows the CD-ROM fstype. Add fallback: if /dev/shm/bee-live/*.squashfs exists, the data is in RAM — report status OK. Dashboard Hardware Summary: Show server Manufacturer + ProductName as heading and S/N as subline above the component table, sourced from hw.Board (dmidecode system-type data). Validate: Remove Cycles input — always run once. cycles=1 hardcoded in runAllSAT(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:46:42 +03:00
Michael Chus	813e2f86a9	Add scalability/ramp-up labeling, ServerPower penalty in scoring, and report improvements - Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files - Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally by comparing ramp-up step results sharing the same ramp_run_id) - Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time), tasks.go handler, and benchmark.go result population - Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75: factor = ratio/0.75, applied per-GPU with a note explaining the reduction - Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw) - Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode; shows scalability_score when non-zero Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:30:47 +03:00
Michael Chus	58a6da9b44	Recover power limits and SM count from nvidia-smi -q in enrichGPUInfo When --query-gpu CSV fields fail (exit status 2 on some Blackwell + driver combos), enrichGPUInfoWithMaxClocks now also parses from the verbose nvidia-smi -q output already collected at benchmark start: - Default Power Limit → DefaultPowerLimitW - Current Power Limit → PowerLimitW (fallback) - Multiprocessor Count → MultiprocessorCount Fixes PowerSustainScore=0 on systems where all three CSV query variants fail but nvidia-smi -q succeeds (confirmed on RTX PRO 6000 Blackwell + driver 590.48.01). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:17:56 +03:00
Michael Chus	f4a19c0a00	Add power calibration step to benchmark; fix PowerSustainScore reference Before the per-GPU compute phases, run `dcgmi diag -r targeted_power` for 45 s while collecting nvidia-smi power metrics in parallel. The p95 power per GPU is stored as calibrated_peak_power_w and used as the denominator for PowerSustainScore instead of the hardware default limit, which bee-gpu-burn cannot reach because it is compute-only. Fallback chain: calibrated peak → default limit → enforced limit. If dcgmi is absent or the run fails, calibration is skipped silently. Adjust composite score weights to match the new honest power reference: base 0.35, thermal 0.25, stability 0.25, power 0.15, NCCL bonus 0.10. Power weight reduced (0.20→0.15) because even with a calibrated reference bee-gpu-burn reaches ~60-75% of TDP by design (no concurrent mem stress). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:06:46 +03:00
Michael Chus	9e3dcf9b4d	Record host CPU/RAM config in benchmark results; check CPU load - BenchmarkHostConfig captures CPU model, sockets, cores, threads, and total RAM from /proc/cpuinfo and /proc/meminfo at benchmark start. - BenchmarkCPULoad samples host CPU utilisation every 10 s throughout the GPU steady-state phase (sequential and parallel paths). - Summarises avg/max/p95 and classifies status as ok / high / unstable. - Adds a finding when CPU load is elevated (avg >20% or max >40%) or erratic (stddev >12%), with a plain-English description in the report. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:02:04 +03:00
Mikhail Chusavitin	525ed8b8fc	Fix GPU clock lock normalization for Blackwell (clocks.max.* unsupported) clocks.max.graphics / clocks.max.memory CSV fields return exit status 2 on RTX PRO 6000 Blackwell (driver 98.x), causing the entire gpu inventory query to fail and clock lock to be skipped → normalization: partial. Fix: - Add minimal fallback query (index,uuid,name,pci.bus_id,vbios_version, power.limit) that succeeds even without clock fields - Add enrichGPUInfoWithMaxClocks: parses "Max Clocks" section of nvidia-smi -q verbose output to fill MaxGraphicsClockMHz / MaxMemoryClockMHz when CSV fields fail - Move nvidia-smi -q execution before queryBenchmarkGPUInfo so its output is available for clock enrichment immediately after - Tests: cover enrichment and skip-if-populated cases Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 13:33:54 +03:00
Mikhail Chusavitin	05c1fde233	Warn on PCIe link speed degradation and collect lspci -vvv in techdump - collector/pcie: add applyPCIeLinkSpeedWarning that sets status=Warning and ErrorDescription when current link speed is below maximum negotiated speed (e.g. Gen1 running on a Gen5 slot) - collector/pcie: add pcieLinkSpeedRank helper for Gen string comparison - collector/pcie_filter_test: cover degraded and healthy link speed cases - platform/techdump: collect lspci -vvv → lspci-vvv.txt for LnkCap/LnkSta Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 12:42:17 +03:00
Michael Chus	825ef6b98a	Add USB export drive and LiveCD-in-RAM checks to Runtime Health - schema: add ToRAMStatus and USBExportPath fields to RuntimeHealth - platform/runtime.go: collectToRAMHealth (ok/warning/failed based on IsLiveMediaInRAM + toramActive) and collectUSBExportHealth (scans /proc/mounts + lsblk for writable USB-backed filesystems) - pages.go: add USB Export Drive and LiveCD in RAM rows to the health table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 10:05:27 +03:00
Michael Chus	ba16021cdb	Fix GPU model propagation, export filenames, PSU/service status, and chart perf - nvidia.go: add Name field to nvidiaGPUInfo, include model name in nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData - pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x … → 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat activating/deactivating/reloading service states as OK in Runtime Health - support_bundle.go: use "150405" time format (no colons) for exFAT compat - sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove .tar.gz archive creation from export dirs — export packs everything itself - charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf - benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 10:05:27 +03:00
Mikhail Chusavitin	9481ca2805	Add staged NVIDIA burn ramp-up mode	2026-04-09 15:21:14 +03:00

1 2 3

117 Commits