reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Mikhail Chusavitin	e306250da7	Disable fp64/fp4 in mixed gpu burn v8.18 v8.19	2026-04-16 10:00:03 +03:00
Mikhail Chusavitin	c5b2081ac9	Disable unstable fp4/fp64 benchmark phases	2026-04-16 09:58:02 +03:00
Michael Chus	434528083e	Power bench: compare GPU-reported TDP vs IPMI server power delta - NvidiaPowerBenchResult gains ServerPower *BenchmarkServerPower - RunNvidiaPowerBench samples IPMI idle before Phase 1 and loaded via background goroutine throughout Phase 2 ramp - renderPowerBenchReport: new "Server vs GPU Power Comparison" table with ratio annotation (✓ match / ⚠ minor / ✗ over-report) - renderPowerBenchSummary: server_idle_w, server_loaded_w, server_delta_w, server_reporting_ratio keys Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 07:21:02 +03:00
Michael Chus	30aa30cd67	LiveCD: set Baby Bee wallpaper centered on black background 400×400px PNG centered via feh --bg-center --image-bg '#000000'. Fallback solid fill also changed to black. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.17	2026-04-16 06:57:23 +03:00
Michael Chus	4f76e1de21	Dashboard: per-device status chips with hover tooltips Replace single aggregated badge per hardware category with individual colored chips (O/W/F/?) for each ComponentStatusRecord. Added helper functions: matchedRecords, firstNonEmpty. CSS classes: chip-ok/warn/fail/unknown. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:54:13 +03:00
Michael Chus	3732e64a4a	Add slowdown temperature exceedance detector to benchmark detectSlowdownTempExceedance scans steady-state metric rows per GPU and emits a [WARNING] note + PARTIAL status if any sample >= SlowdownTempC. Uses per-GPU threshold from nvidia-smi -q, fallback 80°C. Distinct from p95-based TempHeadroomC check: catches even a single spike above the slowdown threshold that would be smoothed out in aggregates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:46:45 +03:00
Michael Chus	0d925299ff	Use per-GPU temperature limits from nvidia-smi -q for headroom calculation Parse "GPU Shutdown Temp" and "GPU Slowdown Temp" from nvidia-smi -q verbose output in enrichGPUInfoWithMaxClocks. Store as ShutdownTempC/SlowdownTempC on benchmarkGPUInfo and BenchmarkGPUResult. Fallback: 90°C shutdown / 80°C slowdown when not available. TempHeadroomC = ShutdownTempC - P95TempC (per-GPU, not hardcoded 100°C). Warning threshold: p95 >= SlowdownTempC. Critical: headroom < 10°C. Report table shows both limits alongside headroom and p95 temp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:45:15 +03:00
Michael Chus	a8d5e019a5	Translate report to English; add power anomaly detector All report strings are now English only. Add detectPowerAnomaly: scans steady-state metric rows per GPU with a 5-sample rolling baseline; flags a sudden drop ≥30% while GPU usage >50% as [HARD STOP] — indicates bad cable contact or VRM fault. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:42:00 +03:00
Michael Chus	72ec086568	Restructure benchmark report as balanced scorecard (5 perspectives) Split throttle into separate signals: ThermalThrottlePct, PowerCapThrottlePct, SyncBoostThrottlePct. Add TempHeadroomC (100 - p95_temp) as independent thermal headroom metric; warning < 20°C (>80°C), critical < 10°C (>90°C). Hard stop findings: thermal throttle with fans < 95%, ECC uncorrected errors, p95 temp > 90°C. Throttle findings now include per-type percentages and diagnostic context. Replace flat scorecard table with BSC 5-perspective layout: 1. Compatibility (hard stops: thermal+fan, ECC) 2. Thermal headroom (p95 temp, delta to 100°C, throttle %) 3. Power delivery (power cap throttle, power CV, fan duty) 4. Performance (Compute TOPS, Synthetic, Mixed, TOPS/SM/GHz) 5. Anomalies (ECC corrected, sync boost, power/thermal variance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:40:06 +03:00
Michael Chus	7a0b0934df	Separate compute score from server quality score CompositeScore = raw ComputeScore (TOPS). Throttling GPUs score lower automatically — no quality multiplier distorting the compute signal. Add ServerQualityScore (0-100): server infrastructure quality independent of GPU model. Formula: 0.40×Stability + 0.30×PowerSustain + 0.30×Thermal. Use to compare servers with the same GPU or flag bad server conditions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:45:55 +03:00
Michael Chus	d8ca0dca2c	Redesign scoring metrics: variance-based sustain scores, throttle stability PowerSustainScore: power draw variance (CV) during load, not deviation from TDP. ThermalSustainScore: temperature variance (CV) during load. StabilityScore: fraction of time spent in thermal+power-cap throttling. Remove NCCL bonus from quality_factor. quality = 0.35 + 0.35×Stability + 0.15×PowerSustain + 0.15×ThermalSustain, cap 1.00. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:39:59 +03:00
Michael Chus	d90250f80a	Fix DCGM cleanup and shorten memory validate	2026-04-16 00:39:37 +03:00
Michael Chus	8d6eaef5de	Update perf benchmark report methodology to reflect new design Remove references to pre-benchmark power calibration and dcgmi targeted_power. Document platform_power_score ramp-up methodology, PowerSustainScore fallback to steady-state power, and full-budget single-precision phases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:31:58 +03:00
Michael Chus	732bf4cbab	Redesign power and performance benchmarks with new methodology Power/Thermal Fit: cumulative fixed-limit ramp where each GPU's stable TDP is found under real multi-GPU thermal load (all prior GPUs running at their fixed limits). PlatformMaxTDPW = sum of stable limits across all GPUs. Remove PlatformPowerScore from power test. Performance Benchmark: remove pre-benchmark power calibration entirely. After N single-card runs, execute k=2..N parallel ramp-up steps and compute PlatformPowerScore = mean compute scalability vs best single-card TOPS. PowerSustainScore falls back to Steady.AvgPowerW when calibration absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 00:30:50 +03:00
Michael Chus	fa6d905a10	Tune bee-gpu-burn single-precision benchmark phases	2026-04-16 00:05:47 +03:00
Mikhail Chusavitin	5c1862ce4c	Use lb clean --all to clear bootstrap cache on every build Prevents stale debootstrap cache from bypassing --debootstrap-options changes (e.g. --include=ca-certificates added in v8.15). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.16	2026-04-15 17:37:08 +03:00
Mikhail Chusavitin	b65ef2ea1d	Fix: use --debootstrap-options to include ca-certificates in bootstrap --bootstrap-packages is not a valid lb config option (20230502). Use --debootstrap-options "--include=ca-certificates" instead to ensure ca-certificates is present when lb chroot_archives runs apt-get update against the NVIDIA CUDA HTTPS source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.15	2026-04-15 17:26:01 +03:00
Mikhail Chusavitin	533d703c97	Bootstrap ca-certificates so NVIDIA CUDA HTTPS source is trusted debootstrap creates a minimal chroot without ca-certificates, causing apt-get update to fail TLS verification for the NVIDIA CUDA apt source: "No system certificates available. Try installing ca-certificates." Add ca-certificates to --bootstrap-packages so it is present before lb chroot_archives configures the NVIDIA HTTPS source and runs apt-get update. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.14	2026-04-15 17:24:20 +03:00
Mikhail Chusavitin	04eb4b5a6d	Revert "Pre-download DCGM/fabricmanager debs on host to bypass chroot apt" This reverts commit `4110dbf8a6`. v8.13	2026-04-15 17:19:53 +03:00
Mikhail Chusavitin	4110dbf8a6	Pre-download DCGM/fabricmanager debs on host to bypass chroot apt The NVIDIA CUDA HTTPS apt source (developer.download.nvidia.com) may be unreachable from inside the live-build container chroot, causing 'E: Unable to locate package datacenter-gpu-manager-4-cuda13'. Add build-dcgm.sh that downloads DCGM and nvidia-fabricmanager .deb packages on the build host (verifying SHA256 against Packages.gz) and caches them in BEE_CACHE_DIR. build.sh (step 25-dcgm, nvidia only) copies them into LB_DIR/config/packages.chroot/ before lb build, so live-build creates a local apt repo from them. The chroot installs the packages from the local repo without ever contacting the NVIDIA CUDA HTTPS source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.12	2026-04-15 17:10:23 +03:00
Mikhail Chusavitin	7237e4d3e4	Add fabric manager boot and support diagnostics v8.11	2026-04-15 16:14:26 +03:00
Mikhail Chusavitin	ab3ad77cd6	Fix Go module: upgrade modernc.org/libc v1.70.0 → v1.72.0 modernc.org/sqlite v1.48.0 requires modernc.org/libc/sys/types which is absent in v1.70.0 but present in v1.72.0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.10	2026-04-15 14:32:04 +03:00
Mikhail Chusavitin	cd9e2cbe13	Fix ramp-up power bench: one task instead of N redundant tasks RunNvidiaPowerBench already performs a full internal ramp from 1 to N GPUs in Phase 2. Spawning N tasks with growing GPU subsets meant task K repeated all steps 1..K-1 already done by tasks 1..K-1 — O(N²) work instead of O(N). Replace with a single task using all selected GPUs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.9	2026-04-15 12:29:11 +03:00
Mikhail Chusavitin	0317dc58fd	Fix memtest hook: grub.cfg/live.cfg missing during binary hooks is expected lb binary_grub-efi and lb binary_syslinux create these files from templates that already have memtest entries hardcoded. The hook should not fail when the files don't exist yet — validate_iso_memtest() checks the final ISO. Only the binary files (x64.bin, x64.efi) are required here. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.8.3	2026-04-15 10:33:22 +03:00
Mikhail Chusavitin	1c5cb45698	Fix memtest hook: bad ver_arg format in apt-get download ver_arg was set to "=memtest86+=VERSION" making the command "apt-get download memtest86+=memtest86+=VERSION" (invalid). Fixed to build pkg_spec directly as "memtest86+=VERSION". Also add apt-get update retry if initial download fails. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.8.2	2026-04-15 10:15:01 +03:00
Mikhail Chusavitin	090b92ca73	Re-enable security repo: kernel 6.1.0-44 is in bookworm-security only Disabling --security broke the build because linux-image-6.1.0-44-amd64 is a security update not present in the base bookworm repo. Main packages already come from mirror.mephi.ru. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.8.1	2026-04-15 10:02:52 +03:00
Mikhail Chusavitin	2dccbc010c	Use MEPHI mirror, disable security repo, fix memtest in ISO build - Switch all lb mirrors to mirror.mephi.ru/debian/ for faster/reliable downloads - Disable security repo (--security false) — not needed for LiveCD - Pin MEMTEST_VERSION=6.10-4 in VERSIONS, export to hook environment - Set BEE_REQUIRE_MEMTEST=1 in build-in-container.sh — missing memtest is now fatal - Fix 9100-memtest.hook.binary: add apt-get download fallback when lb binary_memtest has already purged the package cache; handle both 5.x (memtest86+x64.bin) and 6.x (memtest86+.bin) BIOS binary naming Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.8	2026-04-15 09:57:29 +03:00
Michael Chus	e84c69d360	Fix optional step log dir missing after memtest recovery mkdir -p LOG_DIR before writing the optional step log so that a race with cleanup_build_log (EXIT trap archiving the log dir) does not cause a "Directory nonexistent" error during lb binary_checksums / lb binary_iso. Also downgrade apt-get update failure to a warning so a transient mirror outage does not block kernel ABI auto-detection when the apt cache is warm. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 07:28:36 +03:00
Michael Chus	c80a39e7ac	Add power results table, fix benchmark results refresh, bound memtester - Benchmark page now shows two result sections: Performance (scores) and Power / Thermal Fit (slot table). After any benchmark task completes the results section auto-refreshes via GET /api/benchmark/results without a full page reload. - Power results table shows each GPU slot with nominal TDP, achieved stable power limit, and P95 observed power. Rows with derated cards are highlighted amber so under-performing slots stand out at a glance. Older runs are collapsed in a <details> summary. - memtester is now wrapped with timeout(1) so a stuck memory controller cannot cause Validate Memory to hang indefinitely. Wall-clock limit is ~2.5 min per 100 MB per pass plus a 2-minute buffer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.7	2026-04-15 07:16:18 +03:00
Michael Chus	a5e0261ff2	Refactor power ramp to use true single-card baselines Phase 1 now calibrates each GPU individually (sequentially) so that PowerRealizationPct reflects real degradation from neighbour thermals and shared power rails. Previously the baseline came from an all-GPU-together run, making realization always ≈100% at the final ramp step. Ramp step 1 reuses single-card calibration results (no extra run); steps 2..N run targeted_power on the growing GPU subset with derating active. Remove OccupiedSlots/OccupiedSlotsNote fields and occupiedSlots() helper — they were compensation for the old all-GPU calibration approach. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 23:47:57 +03:00
Michael Chus	ee422ede3c	Revert "Add raster Easy Bee branding assets" This reverts commit `d560b2fead`.	2026-04-14 23:00:15 +03:00
Michael Chus	d560b2fead	Add raster Easy Bee branding assets	2026-04-14 22:39:25 +03:00
Michael Chus	3cf2e9c9dc	Run power calibration for all GPUs simultaneously Previously each GPU was calibrated sequentially (one card fully done before the next started), producing the staircase temperature pattern seen on the graph. Now all GPUs run together in a single dcgmi diag -r targeted_power session per attempt. This means: - All cards are under realistic thermal load at the same time. - A single DCGM session handles the run — no resource-busy contention from concurrent dcgmi processes. - Binary search state (lo/hi) is tracked independently per GPU; each card converges to its own highest stable power limit. - Throttle counter polling covers all active GPUs in the shared ticker. - Resource-busy exponential back-off is shared (one DCGM session). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.6	2026-04-14 22:25:05 +03:00
Michael Chus	19dbabd71d	Simplify power calibration: pure binary search, no telemetry guessing Remove telemetry-guided initial candidate; use strict binary search midpoint at every step. Clean and predictable convergence in O(log N) attempts within the allowed power range [minLimitW, startingLimitW]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.5	2026-04-14 22:12:45 +03:00
Michael Chus	a6a07f2626	Replace linear power derate with binary search + telemetry-guided jump Power calibration previously stepped down 25 W at a time (linear), requiring up to 6 attempts to find a stable limit within 150 W range. New strategy: - Binary search between minLimitW (lo, assumed stable floor) and the starting/failed limit (hi, confirmed unstable), converging within a 10 W tolerance in ~4 attempts. - For thermal throttle: the first-quarter telemetry rows estimate the GPU's pre-throttle power draw. nextLimit = round5W(onset - 10 W) is used as the initial candidate instead of the binary midpoint, landing much closer to the true limit on the first step. - On success: lo is updated and a higher level is tried (binary search upward) until hi-lo ≤ tolerance, ensuring the highest stable limit is found rather than the first stable one. - Let targeted_power run to natural completion on throttle (no mid-run SIGKILL) so nv-hostengine releases its diagnostic slot cleanly before the next attempt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.4	2026-04-14 22:05:23 +03:00
Michael Chus	f87461ee4a	Detect thermal throttle with fans below 100% as cooling misconfiguration During power calibration: if a thermal throttle (sw_thermal/hw_thermal) causes ≥20% clock drop while server fans are below 98% P95 duty cycle, record a CoolingWarning on the GPU result and emit an actionable finding telling the operator to rerun with fans manually fixed at 100%. During steady-state benchmark: same signal enriches the existing thermal_limited finding with fan duty cycle and clock drift values. Covers both the main benchmark (buildBenchmarkFindings) and the power bench (NvidiaPowerBenchResult.Findings). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.2	2026-04-14 21:44:57 +03:00
Michael Chus	a636146dbd	Fix power calibration failing due to DCGM resource contention When a targeted_power attempt is cancelled (e.g. after sw_thermal throttle), nv-hostengine holds the diagnostic slot asynchronously. The next attempt immediately received DCGM_ST_IN_USE (exit 222) and incorrectly derated the power limit. Now: exit 222 is detected via isDCGMResourceBusy and triggers an exponential back-off retry at the same power limit (1s, 2s, 4s, … up to 256s). Once the back-off delay would exceed 300s the calibration fails, indicating the slot is persistently held. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.1	2026-04-14 20:41:17 +03:00
Mikhail Chusavitin	303de2df04	Add slot-aware ramp sequence to bee-bench power v8.0	2026-04-14 17:47:40 +03:00
Mikhail Chusavitin	95124d228f	Split bee-bench into perf and power workflows	2026-04-14 17:33:13 +03:00
Mikhail Chusavitin	54338dbae5	Unify live RAM runtime state	2026-04-14 16:18:33 +03:00
Mikhail Chusavitin	2be7ae6d28	Refine NVIDIA benchmark phase timing	2026-04-14 14:12:06 +03:00
Mikhail Chusavitin	b1a5035edd	Normalize task queue priorities by workflow v7.20 v7.21	2026-04-14 11:13:54 +03:00
Mikhail Chusavitin	8fc986c933	Add benchmark fan duty cycle summary to report	2026-04-14 10:24:02 +03:00
Mikhail Chusavitin	88b5e0edf2	Harden IPMI power probe timeout v7.19	2026-04-14 10:18:23 +03:00
Mikhail Chusavitin	82fe1f6d26	Disable precision fallback and pin cuBLAS 13.1	2026-04-14 10:17:44 +03:00
Michael Chus	81e7c921f8	дебаг при сборке	2026-04-14 07:02:37 +03:00
Michael Chus	0fb8f2777f	Fix combined gpu burn profile capacity for fp4 v7.18	2026-04-14 00:00:40 +03:00
Michael Chus	bf182daa89	Fix benchmark report methodology and rebuild gpu burn worker on toolchain changes v7.17	2026-04-13 23:43:12 +03:00
Michael Chus	457ea1cf04	Unify benchmark exports and drop ASCII charts v7.15 v7.16	2026-04-13 21:38:28 +03:00
Michael Chus	bf6ecab4f0	Add per-precision benchmark phases, weighted TOPS scoring, and ECC tracking - Split steady window into 6 equal slots: fp8/fp16/fp32/fp64/fp4 + combined - Each precision phase runs bee-gpu-burn with --precision filter so PowerCVPct reflects single-kernel stability (not round-robin artifact) - Add fp4 support in bee-gpu-stress.c for Blackwell (cc>=100) via existing CUDA_R_4F_E2M1 guard - Weighted TOPS: fp64×2.0, fp32×1.0, fp16×0.5, fp8×0.25, fp4×0.125 - SyntheticScore = sum of weighted TOPS from per-precision phases - MixedScore = sum from combined phase; MixedEfficiency = Mixed/Synthetic - ComputeScore = SyntheticScore × (1 + MixedEfficiency × 0.3) - ECC volatile counters sampled before/after each phase and overall - DegradationReasons: ecc_uncorrected_errors, ecc_corrected_errors - Report: per-precision stability table with ECC columns, methodology section - Ramp-up history table redesign: GPU indices as columns, runs as rows Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v7.14	2026-04-13 10:49:49 +03:00

1 2 3 4 5 ...

446 Commits