Commit Graph

413 Commits

Author SHA1 Message Date
19dbabd71d Simplify power calibration: pure binary search, no telemetry guessing
Remove telemetry-guided initial candidate; use strict binary search
midpoint at every step. Clean and predictable convergence in O(log N)
attempts within the allowed power range [minLimitW, startingLimitW].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v8.5
2026-04-14 22:12:45 +03:00
a6a07f2626 Replace linear power derate with binary search + telemetry-guided jump
Power calibration previously stepped down 25 W at a time (linear),
requiring up to 6 attempts to find a stable limit within 150 W range.

New strategy:
- Binary search between minLimitW (lo, assumed stable floor) and the
  starting/failed limit (hi, confirmed unstable), converging within a
  10 W tolerance in ~4 attempts.
- For thermal throttle: the first-quarter telemetry rows estimate the
  GPU's pre-throttle power draw. nextLimit = round5W(onset - 10 W) is
  used as the initial candidate instead of the binary midpoint, landing
  much closer to the true limit on the first step.
- On success: lo is updated and a higher level is tried (binary search
  upward) until hi-lo ≤ tolerance, ensuring the highest stable limit is
  found rather than the first stable one.
- Let targeted_power run to natural completion on throttle (no mid-run
  SIGKILL) so nv-hostengine releases its diagnostic slot cleanly before
  the next attempt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v8.4
2026-04-14 22:05:23 +03:00
f87461ee4a Detect thermal throttle with fans below 100% as cooling misconfiguration
During power calibration: if a thermal throttle (sw_thermal/hw_thermal)
causes ≥20% clock drop while server fans are below 98% P95 duty cycle,
record a CoolingWarning on the GPU result and emit an actionable finding
telling the operator to rerun with fans manually fixed at 100%.

During steady-state benchmark: same signal enriches the existing
thermal_limited finding with fan duty cycle and clock drift values.

Covers both the main benchmark (buildBenchmarkFindings) and the power
bench (NvidiaPowerBenchResult.Findings).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v8.2
2026-04-14 21:44:57 +03:00
a636146dbd Fix power calibration failing due to DCGM resource contention
When a targeted_power attempt is cancelled (e.g. after sw_thermal
throttle), nv-hostengine holds the diagnostic slot asynchronously.
The next attempt immediately received DCGM_ST_IN_USE (exit 222)
and incorrectly derated the power limit.

Now: exit 222 is detected via isDCGMResourceBusy and triggers an
exponential back-off retry at the same power limit (1s, 2s, 4s, …
up to 256s). Once the back-off delay would exceed 300s the
calibration fails, indicating the slot is persistently held.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v8.1
2026-04-14 20:41:17 +03:00
Mikhail Chusavitin
303de2df04 Add slot-aware ramp sequence to bee-bench power v8.0 2026-04-14 17:47:40 +03:00
Mikhail Chusavitin
95124d228f Split bee-bench into perf and power workflows 2026-04-14 17:33:13 +03:00
Mikhail Chusavitin
54338dbae5 Unify live RAM runtime state 2026-04-14 16:18:33 +03:00
Mikhail Chusavitin
2be7ae6d28 Refine NVIDIA benchmark phase timing 2026-04-14 14:12:06 +03:00
Mikhail Chusavitin
b1a5035edd Normalize task queue priorities by workflow v7.20 v7.21 2026-04-14 11:13:54 +03:00
Mikhail Chusavitin
8fc986c933 Add benchmark fan duty cycle summary to report 2026-04-14 10:24:02 +03:00
Mikhail Chusavitin
88b5e0edf2 Harden IPMI power probe timeout v7.19 2026-04-14 10:18:23 +03:00
Mikhail Chusavitin
82fe1f6d26 Disable precision fallback and pin cuBLAS 13.1 2026-04-14 10:17:44 +03:00
81e7c921f8 дебаг при сборке 2026-04-14 07:02:37 +03:00
0fb8f2777f Fix combined gpu burn profile capacity for fp4 v7.18 2026-04-14 00:00:40 +03:00
bf182daa89 Fix benchmark report methodology and rebuild gpu burn worker on toolchain changes v7.17 2026-04-13 23:43:12 +03:00
457ea1cf04 Unify benchmark exports and drop ASCII charts v7.15 v7.16 2026-04-13 21:38:28 +03:00
bf6ecab4f0 Add per-precision benchmark phases, weighted TOPS scoring, and ECC tracking
- Split steady window into 6 equal slots: fp8/fp16/fp32/fp64/fp4 + combined
- Each precision phase runs bee-gpu-burn with --precision filter so PowerCVPct reflects single-kernel stability (not round-robin artifact)
- Add fp4 support in bee-gpu-stress.c for Blackwell (cc>=100) via existing CUDA_R_4F_E2M1 guard
- Weighted TOPS: fp64×2.0, fp32×1.0, fp16×0.5, fp8×0.25, fp4×0.125
- SyntheticScore = sum of weighted TOPS from per-precision phases
- MixedScore = sum from combined phase; MixedEfficiency = Mixed/Synthetic
- ComputeScore = SyntheticScore × (1 + MixedEfficiency × 0.3)
- ECC volatile counters sampled before/after each phase and overall
- DegradationReasons: ecc_uncorrected_errors, ecc_corrected_errors
- Report: per-precision stability table with ECC columns, methodology section
- Ramp-up history table redesign: GPU indices as columns, runs as rows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.14
2026-04-13 10:49:49 +03:00
02e44b1172 Fix USB/RAM status checks; add server model+S/N to dashboard; remove cycles
USB Export Drive:
  lsblk reports TRAN only for whole disks, not partitions (/dev/sdc1).
  Strip trailing partition digits to get parent disk before transport check.

LiveCD in RAM:
  When RunInstallToRAM copies squashfs to /dev/shm/bee-live/ but bind-mount
  of /run/live/medium fails (CD-ROM boots), /run/live/medium still shows the
  CD-ROM fstype. Add fallback: if /dev/shm/bee-live/*.squashfs exists, the
  data is in RAM — report status OK.

Dashboard Hardware Summary:
  Show server Manufacturer + ProductName as heading and S/N as subline above
  the component table, sourced from hw.Board (dmidecode system-type data).

Validate:
  Remove Cycles input — always run once. cycles=1 hardcoded in runAllSAT().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.13
2026-04-12 22:46:42 +03:00
2ceaa0d0ca Include profile and mode in benchmark task names for task list clarity
Task names now follow the pattern:
  NVIDIA Benchmark · <profile> · <mode> [· GPU <indices>]

Examples:
  NVIDIA Benchmark · standard · sequential (GPU 0, RTX 6000 Pro)
  NVIDIA Benchmark · stability · parallel
  NVIDIA Benchmark · standard · ramp 1/4 · GPU 0
  NVIDIA Benchmark · standard · ramp 2/4 · GPU 0,1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:36:51 +03:00
9482ba20a2 Remove NCCL checkbox — auto-enable interconnect step when >1 GPU selected
NCCL all_reduce is always attempted when 2+ GPUs are selected; a failure
leaves InterconnectScore=0 (no bonus, no penalty) and OverallStatus
unaffected. Exposing the checkbox implied NCCL is optional and made a
failed run look like a deliberate skip.

- Remove benchmark-run-nccl checkbox and its change listener from pages.go
- Client sends run_nccl: selected.length > 1 (automatic)
- api.go default runNCCL=true is unchanged
- Selection note now mentions NCCL automatically for multi-GPU runs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:33:17 +03:00
813e2f86a9 Add scalability/ramp-up labeling, ServerPower penalty in scoring, and report improvements
- Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and
  NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files
- Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally
  by comparing ramp-up step results sharing the same ramp_run_id)
- Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time),
  tasks.go handler, and benchmark.go result population
- Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75:
  factor = ratio/0.75, applied per-GPU with a note explaining the reduction
- Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw)
- Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode;
  shows scalability_score when non-zero

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:30:47 +03:00
58a6da9b44 Recover power limits and SM count from nvidia-smi -q in enrichGPUInfo
When --query-gpu CSV fields fail (exit status 2 on some Blackwell +
driver combos), enrichGPUInfoWithMaxClocks now also parses from the
verbose nvidia-smi -q output already collected at benchmark start:
  - Default Power Limit  → DefaultPowerLimitW
  - Current Power Limit  → PowerLimitW (fallback)
  - Multiprocessor Count → MultiprocessorCount

Fixes PowerSustainScore=0 on systems where all three CSV query
variants fail but nvidia-smi -q succeeds (confirmed on RTX PRO 6000
Blackwell + driver 590.48.01).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:17:56 +03:00
f4a19c0a00 Add power calibration step to benchmark; fix PowerSustainScore reference
Before the per-GPU compute phases, run `dcgmi diag -r targeted_power`
for 45 s while collecting nvidia-smi power metrics in parallel.
The p95 power per GPU is stored as calibrated_peak_power_w and used
as the denominator for PowerSustainScore instead of the hardware default
limit, which bee-gpu-burn cannot reach because it is compute-only.

Fallback chain: calibrated peak → default limit → enforced limit.
If dcgmi is absent or the run fails, calibration is skipped silently.

Adjust composite score weights to match the new honest power reference:
  base 0.35, thermal 0.25, stability 0.25, power 0.15, NCCL bonus 0.10.
Power weight reduced (0.20→0.15) because even with a calibrated reference
bee-gpu-burn reaches ~60-75% of TDP by design (no concurrent mem stress).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:06:46 +03:00
9e3dcf9b4d Record host CPU/RAM config in benchmark results; check CPU load
- BenchmarkHostConfig captures CPU model, sockets, cores, threads, and
  total RAM from /proc/cpuinfo and /proc/meminfo at benchmark start.
- BenchmarkCPULoad samples host CPU utilisation every 10 s throughout
  the GPU steady-state phase (sequential and parallel paths).
- Summarises avg/max/p95 and classifies status as ok / high / unstable.
- Adds a finding when CPU load is elevated (avg >20% or max >40%) or
  erratic (stddev >12%), with a plain-English description in the report.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:02:04 +03:00
098e19f760 Add ramp-up mode to NVIDIA GPU benchmark
Adds a new checkbox (enabled by default) in the benchmark section.
In ramp-up mode N tasks are spawned simultaneously: 1 GPU, then 2,
then 3, up to all selected GPUs — each step runs its GPUs in parallel.
NCCL runs only on the final step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.12
2026-04-12 18:34:19 +03:00
e16d0f34b5 Adjust burn GPU ramp timing by profile v7.11 2026-04-12 15:58:30 +03:00
Mikhail Chusavitin
525ed8b8fc Fix GPU clock lock normalization for Blackwell (clocks.max.* unsupported)
clocks.max.graphics / clocks.max.memory CSV fields return exit status 2 on
RTX PRO 6000 Blackwell (driver 98.x), causing the entire gpu inventory query
to fail and clock lock to be skipped → normalization: partial.

Fix:
- Add minimal fallback query (index,uuid,name,pci.bus_id,vbios_version,
  power.limit) that succeeds even without clock fields
- Add enrichGPUInfoWithMaxClocks: parses "Max Clocks" section of
  nvidia-smi -q verbose output to fill MaxGraphicsClockMHz /
  MaxMemoryClockMHz when CSV fields fail
- Move nvidia-smi -q execution before queryBenchmarkGPUInfo so its output
  is available for clock enrichment immediately after
- Tests: cover enrichment and skip-if-populated cases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.10
2026-04-12 13:33:54 +03:00
Mikhail Chusavitin
4f94ebcb2c Add HPC tuning: PCIe ASPM off, C-states, performance CPU governor
- grub.cfg + isolinux/live.cfg.in: add pcie_aspm=off,
  intel_idle.max_cstate=1 and processor.max_cstate=1 to all
  non-failsafe boot entries
- bee-hpc-tuning: new script that sets all CPU cores to performance
  governor via sysfs and logs THP state at boot
- bee-hpc-tuning.service: runs before bee-nvidia and bee-audit
- 9000-bee-setup.hook.chroot: enable service and mark script executable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.9
2026-04-12 13:07:32 +03:00
Mikhail Chusavitin
05c1fde233 Warn on PCIe link speed degradation and collect lspci -vvv in techdump
- collector/pcie: add applyPCIeLinkSpeedWarning that sets status=Warning
  and ErrorDescription when current link speed is below maximum negotiated
  speed (e.g. Gen1 running on a Gen5 slot)
- collector/pcie: add pcieLinkSpeedRank helper for Gen string comparison
- collector/pcie_filter_test: cover degraded and healthy link speed cases
- platform/techdump: collect lspci -vvv → lspci-vvv.txt for LnkCap/LnkSta

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.8
2026-04-12 12:42:17 +03:00
825ef6b98a Add USB export drive and LiveCD-in-RAM checks to Runtime Health
- schema: add ToRAMStatus and USBExportPath fields to RuntimeHealth
- platform/runtime.go: collectToRAMHealth (ok/warning/failed based on
  IsLiveMediaInRAM + toramActive) and collectUSBExportHealth (scans
  /proc/mounts + lsblk for writable USB-backed filesystems)
- pages.go: add USB Export Drive and LiveCD in RAM rows to the health table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.7
2026-04-11 10:05:27 +03:00
ba16021cdb Fix GPU model propagation, export filenames, PSU/service status, and chart perf
- nvidia.go: add Name field to nvidiaGPUInfo, include model name in
  nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData
- pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x …
  → 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat
  activating/deactivating/reloading service states as OK in Runtime Health
- support_bundle.go: use "150405" time format (no colons) for exFAT compat
- sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove
  .tar.gz archive creation from export dirs — export packs everything itself
- charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf
- benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 10:05:27 +03:00
Mikhail Chusavitin
bb1218ddd4 Fix GPU inventory: exclude BMC virtual VGA, show real NVIDIA model names
Two issues:
1. BMC/management VGA chips (e.g. Huawei iBMC Hi171x, ASPEED) were included
   in GPU inventory because shouldIncludePCIeDevice only checked the PCI class,
   not the device name. Added a name-based filter for known BMC/management
   patterns when the class is VGA/display/3d.

2. New NVIDIA GPUs (e.g. RTX PRO 6000 Blackwell, device ID 2bb5) showed as
   "Device 2bb5" because lspci's database lags behind. Added "name" to the
   nvidia-smi query and use it to override dev.Model during enrichment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.6
2026-04-10 13:57:26 +03:00
Mikhail Chusavitin
65faae8ede Remove hpl from SAT run-all targets — no backend route exists
hpl was listed in baseTargets and stressOnlyTargets but /api/sat/hpl/run
was never registered, causing a 405 Method Not Allowed (not valid JSON)
error when Validate one by one was triggered in stress mode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 13:30:32 +03:00
05241f2e0e Redesign dashboard: split Runtime Health and Hardware Summary
- Runtime Health now shows only LiveCD system status (services, tools,
  drivers, network, CUDA/ROCm) — hardware component rows removed
- Hardware Summary now shows server components with readable descriptions
  (model, count×size) and component-status.json health badges
- Add Network Adapters row to Hardware Summary
- SFP module static info (vendor, PN, SN, connector, type, wavelength)
  now collected via ethtool -m regardless of carrier state
- PSU statuses from IPMI audit written to component-status.json so PSU
  badge shows actual status after first audit instead of UNKNOWN

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.5
2026-04-09 23:41:23 +03:00
Mikhail Chusavitin
c1690a084b Fix app tests that mutate global defaults v7.4 2026-04-09 15:28:25 +03:00
Mikhail Chusavitin
9481ca2805 Add staged NVIDIA burn ramp-up mode 2026-04-09 15:21:14 +03:00
Mikhail Chusavitin
a78fdadd88 Refine validate and burn profile layout 2026-04-09 15:14:48 +03:00
Mikhail Chusavitin
4ef403898f Tighten NVIDIA GPU PCI detection 2026-04-09 15:14:48 +03:00
025548ab3c UI: amber accents, smaller wallpaper logo, new support bundle name, drop display resolution
- Bootloader: GRUB fallback text colors → yellow/brown (amber tone)
- CLI charts: all GPU metric series use single amber color (xterm-256 #214)
- Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size
- Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz
  using dmidecode for server model (spaces→underscores) and serial number
- Remove display resolution feature (UI card, API routes, handlers, tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.3
2026-04-08 21:37:01 +03:00
Mikhail Chusavitin
e0d94d7f47 Remove HPL from build and audit flows 2026-04-08 10:00:23 +03:00
Mikhail Chusavitin
13899aa864 Drop incompatible HPL git fallback 2026-04-08 09:50:58 +03:00
Mikhail Chusavitin
f345d8a89d Build HPL serially to avoid upstream make races 2026-04-08 09:47:35 +03:00
Mikhail Chusavitin
4715059ac0 Fix HPL MPI stub header and keep full build logs 2026-04-08 09:45:14 +03:00
Mikhail Chusavitin
0660a40287 Harden HPL builder cache and runtime libs 2026-04-08 09:40:18 +03:00
Mikhail Chusavitin
67369d9b7b Fix OpenBLAS package lookup in HPL build 2026-04-08 09:32:49 +03:00
Mikhail Chusavitin
3f41a026ca Add resilient HPL source fallbacks v7.1 v7.2 2026-04-08 09:25:31 +03:00
Mikhail Chusavitin
0ee4f46537 Restore MOTD-style ASCII wallpaper 2026-04-08 09:14:27 +03:00
8db40b098a Update bible submodule
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v7.0
2026-04-08 07:14:31 +03:00
16e7ae00e7 Add HPL (LINPACK) benchmark as validate/stress task
HPL 2.3 from netlib compiled against OpenBLAS with a minimal
single-process MPI stub — no MPI package required in the ISO.
Matrix size is auto-sized to 80% of total RAM at runtime.

Build:
- VERSIONS: HPL_VERSION=2.3, HPL_SHA256=32c5c17d…
- build-hpl.sh: downloads HPL + OpenBLAS from Debian 12 repo,
  compiles xhpl with a self-contained mpi_stub.c
- build.sh: step 80-hpl, injects xhpl + libopenblas into overlay

Runtime:
- bee-hpl: generates HPL.dat (N auto from /proc/meminfo, NB=256,
  P=1 Q=1), runs xhpl, prints standard WR... Gflops output
- platform/hpl.go: RunHPL(), parses WR line → GFlops + PASSED/FAILED
- tasks.go: target "hpl"
- pages.go: LINPACK (HPL) card in validate/stress grid (stress-only)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 07:08:18 +03:00
b2f8626fee Refactor validate modes, fix benchmark report and IPMI power
- Replace diag level 1-4 dropdown with Validate/Stress radio buttons
- Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short
- Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended
- Parallel GPU mode: spawn single task for all GPUs instead of splitting per model
- Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel
- Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts
- Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading')

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v6.5
2026-04-08 00:42:12 +03:00