reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Michael Chus	884988cb2a	Fix audit hang on SAS HBAs: skip scsi host scan for SAS hosts Writing to /sys/class/scsi_host/hostX/scan on SAS controllers (e.g. Adaptec smartpqi/PM8222-SHBA) triggers sas_user_scan which blocks indefinitely, causing the audit to hang forever. Skip hosts that appear under /sys/class/sas_host/ — SAS topology is discovered by the driver. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 18:50:20 +03:00
Michael Chus	963bc960ca	Fix SATA discovery, add NVLink bridge detection, add infiniband-diags - storage: add jsonInt64 dual-format unmarshaler to handle lsblk output change in util-linux 2.38 (LOG-SEC/PHY-SEC now emitted as JSON integers, not quoted strings); fixes SATA disks invisible on Debian 12 - pcie: detect NVLink bridge mezzanine CX-7 cards (Mellanox x2, no host net ifaces, DeviceName contains "NVLINK" in lspci -v) and mark them with device_class="NVLinkBridge"; escalate PCIe link speed downgrade to Critical for these cards (Gen3 on a fixed internal connector = hardware fault, not a transient warning) - pcie: cross-reference nvidia-smi topo to capture NVLink bond counts and active status for all NVLink bridge cards - packages: add infiniband-diags to ISO; provides ibstat required by nvidia-fabricmanager-start.sh to enumerate IB devices before FM launch (absence causes CUDA_ERROR_SYSTEM_NOT_READY) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 20:57:04 +03:00
Michael Chus	4f6579e040	Fix Runtime Health criteria: network, services, nvidia-fabricmanager Network: green if at least one interface has IPv4 (drop PARTIAL state). Bee Services: treat inactive as OK — oneshot services (bee-sshsetup, bee-preflight, bee-network, bee-audit, etc.) complete successfully and exit to inactive; only failed is a real problem. nvidia-fabricmanager: add ExecCondition=bee-check-nvswitch drop-in so the service is silently skipped (inactive, not failed) on systems without NVSwitch hardware (e.g. H200 NVL with direct NVLink, no NVSwitch chips). bee-check-nvswitch detects NVSwitch via lspci (vendor 10de, class 0680). bee-nvidia.service: add ConditionPathExists=/usr/local/bin/bee-nvidia-load so the unit is a no-op if somehow present in a non-nvidia build. bee-boot-status: read /etc/bee-gpu-vendor and exclude bee-nvidia from CRITICAL/ALL on non-nvidia builds, preventing boot hang if the unit is unexpectedly present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 05:20:25 +03:00
Michael Chus	dc07580adc	Add AER decode, event counter, and sparkline to component detail modal - decodeAERStatus: parses aer_status hex from kernel error strings and maps PCIe AER register bits to human-readable names with correctable/ uncorrectable classification (e.g. "Receiver Error, Replay Timer Timeout (correctable)") - renderSparkline: 100px inline SVG showing non-OK events over time, bars positioned proportionally to timestamp; evenly spaced when timestamps coincide - renderComponentDetail: shows event count badge and sparkline in the component header row; decoded AER line appears below the raw error summary Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-13 23:54:54 +03:00
Mikhail Chusavitin	805a3b277d	Track PCIe AER correctable errors; fix GPU status key routing Add nvidia-aer-correctable and pcie-aer-correctable patterns to catch "bus correctable error" events seen in SEL (Critical Interrupt / offset 7). Both patterns carry severity "warning" — correctable errors are hardware-recovered and should not flag a card as failed. Fix kmsg_watcher routing: GPU-category events were keyed as pcie:<BDF> but the UI queries for pcie:gpu: prefix. Split the switch so "gpu" → pcie:gpu:<BDF> and "pcie" → pcie:<BDF>. This applies to both flushWindow (SAT-window path) and flushImmediate (always-on path). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-08 12:50:14 +03:00
Mikhail Chusavitin	0939a647ea	Fix component detail modal: replace dead hx-* with fetch-based JS HTMX was never loaded on the page, so hx-get on the component label spans was dead code — the dialog opened empty. Replace with a plain openComponentDetail() fetch call. Also fix dialog positioning broken by the CSS reset (*{margin:0} overrode the UA margin:auto that centers <dialog>). Replace card hx-trigger polling with a setInterval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-08 10:53:20 +03:00
Mikhail Chusavitin	ae80d7711e	Add continuous hardware health monitoring and component detail view - kmsg watcher now records kernel errors (GPU Xid, MCE, EDAC, storage I/O) at all times, not only during SAT tasks; flushImmediate writes directly to ComponentStatusDB - New health_poller: polls ipmitool sdr every 60s for PSU health (watchdog:psu source) - Hardware Summary card auto-refreshes every 30s via htmx without page reload - Component rows (CPU/Memory/Storage/GPU/PSU) are now clickable -- opens a modal with per-component status, source, timestamp and last 20 history entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 09:56:39 +03:00
Michael Chus	7b4bcc745a	Split live rootfs into smaller squashfs layers	2026-05-03 23:15:22 +03:00
Michael Chus	6c2b188ec9	Add no-GUI boot mode and quieter boot diagnostics	2026-05-03 21:14:45 +03:00
Michael Chus	14505ef24a	Remove easy bee ASCII logo banners	2026-05-03 21:07:13 +03:00
Michael Chus	cac5b9c86e	Detach install media after install-to-ram	2026-05-03 14:16:45 +03:00
Michael Chus	0e39e7d960	Make toram default and add install-to-ram CLI	2026-05-03 14:07:47 +03:00
Mikhail Chusavitin	58d6da0e4f	Fix live task logs and SAT windows	2026-04-30 17:26:45 +03:00
Mikhail Chusavitin	7ce73e34a4	Add NVMe block format tool	2026-04-30 16:27:25 +03:00
Mikhail Chusavitin	2c22b01fe3	Fix IPMI hangs, add VROC license, fix blackbox service, drop qrencode IPMI hang fix (Lenovo XCC SR650 V3): - Add pluggable ipmi_profile system with per-vendor timeouts and fruEarlyExit flag - Lenovo profile: 90s FRU timeout, streaming early-exit stops after PSU blocks found - collectFRUEarlyExit streams ipmitool fru print and kills process once PSU blocks are followed by a non-PSU header (~6s instead of ~108s on 54-device FRU list) - collectBMCFirmware and collectPSUs accept manufacturer and apply profile timeouts VROC license detection: - Detect VMD/VROC controller in PCIe list, run mdadm --detail-platform - Parse "License:" line; store as snap.VROCLicense in HardwareSnapshot Blackbox service fix: - bee-blackbox.service was missing from systemctl enable list in ISO build hook - Service never started on boot; state file never written; UI button stayed "Enable" Drop qrencode: - Remove from package list, standardTools API check, and runtime-flows doc Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 10:46:59 +03:00
Mikhail Chusavitin	ec89616585	Add storage block geometry to audit and viewer	2026-04-29 17:39:11 +03:00
Mikhail Chusavitin	7c504e5056	Collect IOMMU group per PCIe device from sysfs Reads the iommu_group symlink for each BDF and exposes the group number as iommu_group in the hardware snapshot JSON. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-29 12:34:54 +03:00
Mikhail Chusavitin	11d00b9442	Document read-only submodules policy	2026-04-29 09:54:23 +03:00
Mikhail Chusavitin	2163017a98	Collect and report storage telemetry	2026-04-29 09:40:58 +03:00
Michael Chus	29179917c3	Add USB blackbox log mirroring service	2026-04-24 10:20:12 +03:00
Michael Chus	be4b439804	Commit remaining workspace changes	2026-04-23 20:32:26 +03:00
Michael Chus	749fc8a94d	Unify NVIDIA GPU recovery paths	2026-04-23 20:31:41 +03:00
Mikhail Chusavitin	6b5d22c194	chore(git): ignore local audit binary	2026-04-20 13:21:35 +03:00
Mikhail Chusavitin	679aeb9947	Run NVIDIA DCGM diag tests on all selected GPUs simultaneously targeted_stress, targeted_power, and the Level 2/3 diag were dispatched one GPU at a time from the UI, turning a single dcgmi command into 8 sequential ~350–450 s runs. DCGM supports -i with a comma-separated list of GPU indices and runs the diagnostic on all of them in parallel. Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate. Update sat.go constants and page_validate.go estimates to reflect all-GPU simultaneous execution (remove n× multiplier from total time estimates). Stress test on 8-GPU system: ~5.3 h → ~2.5 h. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 11:53:25 +03:00
Mikhail Chusavitin	4af997f436	Update audit bee binary	2026-04-20 10:55:42 +03:00
Mikhail Chusavitin	6caace0cc0	Make power benchmark report phase-averaged	2026-04-20 10:53:53 +03:00
Mikhail Chusavitin	5f0103635b	Update power benchmark GPU reset flow	2026-04-20 09:46:00 +03:00
Mikhail Chusavitin	84a2551dc0	Fix NVIDIA self-heal recovery flow	2026-04-20 09:43:22 +03:00
Mikhail Chusavitin	1cfabc9230	Reset GPUs before power benchmark	2026-04-20 09:42:19 +03:00
Mikhail Chusavitin	5dc711de23	Start power calibration from full GPU TDP	2026-04-20 09:28:58 +03:00
Mikhail Chusavitin	ab802719f8	Use real NVIDIA power-limit bounds in benchmark	2026-04-20 09:26:56 +03:00
Mikhail Chusavitin	a94e8007f8	Ignore power throttling in benchmark calibration	2026-04-20 09:26:29 +03:00
Michael Chus	c69bf07b27	Commit remaining workspace changes	2026-04-20 07:02:31 +03:00
Michael Chus	b3cf8e3893	Globalize autotuned system power source	2026-04-20 07:02:12 +03:00
Michael Chus	17118298bd	audit: switch power benchmark load to dcgmproftester	2026-04-20 06:57:14 +03:00
Michael Chus	65bcc9ce81	refactor(webui): split pages into task modules	2026-04-20 06:56:52 +03:00
Michael Chus	0cdfbc5875	fix(iso): restore boot UX and boot logs	2026-04-19 23:08:09 +03:00
Michael Chus	cf9b54b600	Use last ramp-step SDR snapshot for PSU loaded power; add deploy script - benchmark.go: retain sdrLastStep from final ramp step instead of re-sampling after test when GPUs are already idle - scripts/deploy.sh: build+deploy bee binary to remote host over SSH Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 21:26:44 +03:00
Michael Chus	0bfb3fe954	Use PSU SDR sum for system power chart when available DCMI reports only the managed power domain (~CPU+MB), missing GPU draw. PSU AC input sensors cover full wall power. When samplePSUPower returns data, sum the slots for PowerW; fall back to DCMI otherwise. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:10:01 +03:00
Michael Chus	3053cb0710	Fix PSU slot regex: match MSI underscore format PSU1_POWER_IN \b does not fire between a digit and '_' because '_' is \w in RE2. The pattern \bpsu?\s*([0-9]+)\b never matched PSU1_POWER_IN style sensors, so parsePSUSDR (and PSUSlotsFromSDR / samplePSUPower) returned empty results for MSI servers — causing all power graphs to fall back to DCMI which reports ~half actual draw. Added an explicit underscore-terminated pattern first in the list and tests covering the MSI format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:03:02 +03:00
Michael Chus	e35484013e	Use SDR PSU AC input for single-card calibration server power Same fix as ramp steps: take sdrSingle snapshot after calibration and prefer PSUInW over DCMI for singleIPMILoadedW. DCMI kept as fallback. Log message indicates source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:44:13 +03:00
Michael Chus	2cdf034bb0	Use SDR PSU AC input for per-step server power in power ramp When sdrStep.PSUInW is available, prefer it over DCMI for ramp.ServerLoadedW and ServerDeltaW. DCMI on this platform (MSI 4-PSU) reports ~half actual draw; SDR sums all PSU_POWER_IN sensors correctly. Delta is now SDR-to-SDR (sdrStep.PSUInW - sdrIdle.PSUInW) for consistency. DCMI path kept as fallback when SDR has no PSU data. Log message now indicates the source (SDR PSU AC input vs DCMI). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:43:36 +03:00
Michael Chus	b89580c24d	Fix PSU power chart: use name-based SDR matching instead of entity ID MSI servers place PSU_POWER_IN/OUT sensors on entity 3.0, not 10.N (the IPMI "Power Supply" entity). The old parser filtered by entity ID and found nothing, so the dashboard fell back to DCMI which reports roughly half the actual draw. Now delegates to collector.PSUSlotsFromSDR — the same name-based matching already used in the Power Fit benchmark. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:39:21 +03:00
Michael Chus	df1385d3d6	Fix dcgmproftester parallel mode: use staggered script for all multi-GPU runs A single dcgmproftester process without -i only loads GPU 0 regardless of CUDA_VISIBLE_DEVICES. Now always routes multi-GPU runs through bee-dcgmproftester-staggered (--stagger-seconds 0 for parallel mode), which spawns one process per GPU so all GPUs are loaded simultaneously. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:31:34 +03:00
Michael Chus	f8cd9a7376	Rework Power Fit report: 90 min stability, aligned tables, PSU/fan sections - Increase stability profile duration from 33 min to 90 min by wiring powerBenchDurationSec() into runBenchmarkPowerCalibration (was discarded) - Collect per-step PSU slot readings, fan RPM/duty, and per-GPU telemetry in ramp loop; add matching fields to NvidiaPowerBenchStep/NvidiaPowerBenchGPU - Rewrite renderPowerBenchReport: replace Per-Slot Results with Single GPU section, rework Ramp Sequence rows=runs/cols=GPUs, add PSU Performance section (conditional on IPMI data), add transposed Single vs All-GPU comparison table in per-GPU sections - Add fmtMDTable helper (benchmark_table.go) and apply to all tables in both power and performance reports so columns align in plain-text view Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:04:12 +03:00
Michael Chus	d52ec67f8f	Stability hardening, build script fixes, GRUB bee logo Stability hardening (webui/app): - readFileLimited(): защита от OOM при чтении audit JSON (100 MB), component-status DB (10 MB) и лога задачи (50 MB) - jobs.go: буферизованный лог задачи — один открытый fd на задачу вместо open/write/close на каждую строку (устраняет тысячи syscall/сек при GPU стресс-тестах) - stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s), сброс при успешном прогоне >30s, счётчик перезапусков в slog - kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании - bee-web.service: MemoryMax=3G — OOM killer защищён Build script: - build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in — мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый live.cfg.in перезаписывал правильный статический файл устаревшей версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off - build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg вместо grub-pc/grub.cfg (было всегда "missing") GRUB: - live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне - live-theme/theme.txt: + image компонент по центру в верхней трети экрана; меню сдвинуто с 62% до 65% Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 13:08:31 +03:00
Michael Chus	61c7abaa80	Add multi-source PSU power triangulation and per-slot distribution table - collector/psu.go: export PSUSlotsFromSDR() reusing slot regex patterns; add isPSUInputPower/isPSUOutputPower helpers covering MSI/MLT/xFusion/HPE naming; add xFusion Power<N> slot pattern; parseBoundedFloat for self-healing (rejects zero/negative/out-of-range sensor readings); default fallback treats unclassified PSU sensors as AC input - benchmark_types.go: BenchmarkPSUSlotPower struct; BenchmarkServerPower gains PSUInputIdle/Loaded, PSUOutputIdle/Loaded, PSUSlotReadingsIdle/Loaded, GPUSlotTotalW, DCMICoverageRatio fields - benchmark.go: sampleIPMISDRPowerSensors uses collector.PSUSlotsFromSDR instead of custom classifier; detectDCMIPartialCoverage replaces ramp heuristic — compares DCMI idle vs SDR PSU sum, flags <0.70 ratio as partial coverage; detectIPMISaturationFallback kept for servers without SDR PSU sensors; report gains PSU Load Distribution table (per-slot AC/DC idle vs loaded, Δ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 13:07:48 +03:00
Michael Chus	52c3a24b76	Compact metrics DB in background to prevent CPU spin under load As metrics.db grew (1 sample/5 s × hours), handleMetricsChartSVG called LoadAll() on every chart request — loading all rows across 4 tables through a single SQLite connection. With ~10 charts auto-refreshing in parallel, requests queued behind each other, saturating the connection pool and pegging a CPU core. Fix: add a background compactor that runs every hour via the metrics collector: • Downsample: rows older than 2 h are thinned to 1 per minute (keep MIN(ts) per ts/60 bucket) — retains chart shape while cutting row count by ~92 %. • Prune: rows older than 48 h are deleted entirely. • After prune: WAL checkpoint/truncate to release disk space. LoadAll() in handleMetricsChartSVG is unchanged — it now stays fast because the DB is kept small rather than capping the query window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 15:28:05 +03:00
Michael Chus	028bb30333	Detect PSU faults during perf and power benchmarks Snapshot IPMI "Power Supply" sensor states before and after each benchmark run. Compare before/after to surface only new anomalies (pre-existing faults are excluded). Results land in NvidiaBenchmarkResult.PSUIssues and NvidiaPowerBenchResult.PSUIssues (JSON: psu_issues) and are printed in the text benchmark report under a "PSU Issues" section. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 15:08:41 +03:00
Michael Chus	7d64e5d215	Fix two stale failing tests - TestHandleAPIBenchmarkPowerFitRampQueuesBenchmarkPowerFitTasks: ramp-up mode intentionally creates a single task (the runner handles 1→N internally to avoid redundant repetition of earlier ramp steps). Updated the test to expect 1 task and verify RampTotal=3 instead of asserting 3 separate tasks. - TestBenchmarkPageRendersSavedResultsTable: benchmark page used "Performance Results" as heading while the test looked for "Perf Results". Aligned the page heading with the shorter label used everywhere else (task reports, etc.). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 15:07:27 +03:00

1 2 3 4 5 ...

261 Commits