reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Mikhail Chusavitin	24f2e65b6e	Add unified FRU/Elabel card with Huawei iBMC OEM IPMI support Replaces separate IPMI FRU and SAA DMI cards with a single FRU / Elabel card that reads all available sources in parallel and shows each field with a color-coded source chip (IPMI FRU / Huawei iBMC / SAA DMI). Huawei elabel fields are read/written via OEM IPMI raw commands (NetFn 0x30, cmd 0x90) with 19-byte chunking protocol, matching the FusionServer ElabelTool V511 wire format. Covers DeviceName, DeviceSerialNumber, ProductName, ProductSerialNumber, ProductAssetTag, ProductManufacturer, MainboardManufacturer, BoardProductName, ChassisPartnumber, ChassisType (read-only), IOChassisSerial, IOChassisAssetTag, and GUID (read-only via standard 0x06 0x08). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 15:29:07 +03:00
Mikhail Chusavitin	cf29131116	Rework FRU and DMI editors: per-row inline save, all fields editable - Replace global Save button with per-row ✓ (save) / ✗ (cancel) buttons that appear only when a field is changed - All fields shown as editable inputs; server rejects unknown fields with a clear error message instead of hiding them in the UI - Monospace font and 1.5px border for all value inputs - Server-side name→area/index lookup for fields sent without area - SAA DMI card: same per-row UX, confirm dialog kept (requires reboot) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 09:30:39 +03:00
Mikhail Chusavitin	13e6324853	Fix IPMI FRU editable field detection for abbreviated ipmitool names ipmitool fru print on some BMC implementations returns short names ("Chassis Serial", "Board Mfg", "Board Product", "Board Serial", "Product Serial") instead of the full names in the vendor doc. Add both variants to fruEditableFields so all fields are editable regardless of which naming convention the BMC uses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 09:24:15 +03:00
Mikhail Chusavitin	892ef6fb7d	Add Reboot and Shutdown buttons to Settings page POST /api/system/reboot → systemctl reboot POST /api/system/shutdown → systemctl poweroff Both require confirm() before executing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 09:18:30 +03:00
Mikhail Chusavitin	ce46a97975	Remove duplicate Blackbox Logging card from Settings page The USB Black-Box card already provides enable/disable per device. The standalone Blackbox Logging card was non-functional and redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 09:15:31 +03:00
Mikhail Chusavitin	258ecb3453	Add RAID Controller Management to Tools page Unified card for LSI/Broadcom and Intel VROC controllers: auto-detects foreign configurations and warns the operator with Import/Clear actions; allows creating RAID 1 mirrors from unconfigured drives regardless of controller type. Live output streams via SSE into an inline terminal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 08:58:19 +03:00
Mikhail Chusavitin	cbb0d1e522	Collect IPMI sensors, SEL and dmesg errors into audit JSON and support bundle - audit JSON: IPMI sensor readings (ipmitool sensor) merged into hardware.sensors alongside lm-sensors data - audit JSON: IPMI SEL entries (ipmitool sel list) in hardware.event_logs with source "ipmi-sel" - audit JSON: dmesg error/warning lines in hardware.event_logs with source "dmesg" (filtered by error/warn/AER/Xid/NVRM/ECC/panic patterns) - support bundle: added ipmitool-sensor.txt, ipmitool-sel.txt, ipmitool-sel-time.txt to techdump - saa_dmi.go: fix dmiItemRE to accept SHN with parentheses (e.g. PS(4)LC for PSU fields) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 08:41:37 +03:00
Mikhail Chusavitin	bab941ccf1	Fix SAA: set CWD=/usr/local/bin; include all SAA package binaries - saa_dmi.go: set cmd.Dir=/usr/local/bin on all saa exec calls so acpica_bin/acpidump is found relative to correct working directory - build.sh: copy all saa companion dirs (acpica_bin, ExternalData, tool, stunnel, GO_SNMP) to /usr/local/bin/ preserving structure - iso/vendor: add acpica_bin/acpiexec, ExternalData/, tool/gpu/nVidia/x64/, tool/USBController/, stunnel/, GO_SNMP/ from SAA 1.5.0 release package Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 08:24:50 +03:00
Mikhail Chusavitin	b49c71a980	Add IPMI FRU editor to Tools page - New card "IPMI — FRU" on Tools page (device 0, in-band) - Read: GET /api/tools/ipmi-fru → ipmitool fru print 0 → editable table - Editable fields: chassis (part#, serial, extra), board (mfr, product, serial, part#), product (mfr, name, part#, version, serial); read-only fields displayed as text - Write: POST /api/tools/ipmi-fru/write → task → backup to fru-backups/ → ipmitool fru edit per field - Dirty tracking + Save (N changed) button, same UX as Supermicro DMI card Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 08:13:35 +03:00
Mikhail Chusavitin	85d1acdaa3	Split validate/stress into separate fixed-mode pages - Check (2): validate mode only — no mode switcher, no stress-only cards (nvidia-targeted-stress, nvidia-targeted-power, nvidia-pulse hidden) - Load (3): stress mode only — no mode switcher, all cards shown - satStressMode() hardcoded per page; satModeChanged() removed - Profile card with radio buttons removed from both pages - Replaced with simple Run All button + est. time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 19:12:17 +03:00
Mikhail Chusavitin	a2d7513153	Restructure nav to Load/Burn/Benchmark; fix SAA acpidump dependency - Nav steps 3-5: Load (validate), Burn (burn-in), Benchmark (speed+endurance merged) - /load now renders validate mode; /burn renders burn-in; /benchmark replaces /speed+/endurance - Legacy redirects updated: /validate→/load, /burn-in→/burn, /speed+/endurance→/benchmark - Add acpica_bin/acpidump from SAA 1.5.0 package; required by saa GetDmiInfo (ExitCode 8) - build.sh copies acpica_bin/acpidump to /usr/local/bin/acpica_bin/ alongside saa Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 19:07:51 +03:00
Mikhail Chusavitin	5b5d8609d3	Refactor nav: remove numbers from Tools/Settings, add separator and Tasks item - Remove "6." / "7." prefixes from Tools and Settings nav labels and page titles - Add a horizontal separator (nav-sep) before the Tools/Settings group - Move Tasks into the nav as a regular nav-item after the separator, replacing the separate tasks-nav-btn at the sidebar bottom - Tasks item retains the active-count badge (tasks-nav-count) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 17:54:54 +03:00
Mikhail Chusavitin	e7442972d1	Move session-scoped LiveCD tools from Tools to Settings Tools page now contains only NVMe Block Format and Supermicro - DMI. Moved to Settings (7): - System Install (Install to RAM + Install to Disk) - Support Bundle + USB Black-Box - Tool Check - NVIDIA Self Heal (replaces simple NVIDIA Recovery card) - Network - Services Update TestToolsPageRendersNvidiaSelfHealSection to assert the moved cards on /settings instead of /tools. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 17:52:19 +03:00
Mikhail Chusavitin	4c6daa1c5e	Add SAA binary to ISO vendor, rename card to Supermicro - DMI - Extract saa 1.5.0 (Linux x86_64) into iso/vendor/saa — baked into ISO at /usr/local/bin/saa via the existing vendor loop in build.sh - Add saa to the vendor tool loop in iso/builder/build.sh - Rename the web UI card from "SAA - DMI" to "Supermicro - DMI" - Remove the redundant description hint about saa on PATH Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 17:49:12 +03:00
Mikhail Chusavitin	8149360410	Fix SAA DMI parser to match real DMI.txt format Replace the guessed pipe/key=value parser with the correct format documented in SAA User Guide 4.8.1: [Section] Item Name {SHN} = "value" // comment Handles string values (strips surrounding quotes), non-string values (UUID, hex), section headers for display names, version line, and // comments. Verified against the SAA 1.5.0 User Guide sample. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 15:58:02 +03:00
Mikhail Chusavitin	4262c5b798	Add SAA DMI editor to Tools page Adds a new card to the web UI Tools page for reading and editing DMI fields via SAA (In-Band). Reads current DMI configuration with GetDmiInfo, displays all fields as an editable table, and applies only the changed fields via EditDmiInfo + ChangeDmiInfo. Backs up the original DMI file to dmi-backups/ before any write, making it available in the support bundle for rollback. Also adds "saa" to the standard tool check list. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 15:50:42 +03:00
Mikhail Chusavitin	271dadda03	Restructure web UI navigation into 7 numbered workflow stages Replace the flat menu (Dashboard, Audit, Validate, Burn, Benchmark, Tasks, Tools) with a numbered progression that guides engineers through a logical acceptance workflow: Dashboard (landing) → 1. Audit → 2. Check → 3. Load → 4. Speed → 5. Endurance → 6. Tools → 7. Settings Key changes: - layout.go: numbered nav labels, new hrefs, Tasks removed from nav and replaced with a persistent sidebar badge (polls /api/tasks every 5 s, highlights amber when tasks are active) - server.go: 301 redirects from /validate→/check, /burn→/load, /benchmark→/speed for backward compatibility - pages.go: dispatch cases for all new routes; old routes kept as fallbacks - page_validate.go: add renderCheck() — non-destructive check page with validate-mode tests only (no stress toggle, no targeted-stress/ targeted-power/pulse cards) - page_burn.go: add renderLoad() wrapper; update scope alert to reference /check instead of /validate - page_benchmark.go: add renderSpeed() (performance focus) and renderEndurance() (stability/overnight focus) wrappers - page_settings.go: new Settings page with blackbox logging toggle, NVIDIA driver reset, and build info - server_test.go: update five tests to use new route names and content expectations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 11:00:02 +03:00
Michael Chus	7d2e904d14	Bring codebase into compliance with bible contracts (A–E) A (hardware-ingest-json v2.8-2.9): remove sensor location fields from schema and collector; tag HardwareMemory.Location as json:"-"; add PlatformConfig to HardwareSnapshot. B (no-hardcoded-vendors): consolidate PCI vendor IDs into collector/pci_vendors.go; replace all vendor-name string checks in isGPUDevice, isNVIDIADevice, isMellanoxDevice, isAMDGPUDevice, matchesGPUVendor (sat_overlay), and validateIsVendorGPU (page_validate) with numeric vendor_id comparisons. C (module-structure): split app/app.go (1413 lines) into app.go + app_format.go, app_network.go, app_services.go, app_packs.go, app_install.go — no logic changes. D (go-code-style): wrap bare return err in interfaceAdminState and interfaceIPv4Addrs (platform/network.go) with fmt.Errorf context including the interface name. E (go-project-bible): add bible-local/architecture/data-model.md and bible-local/architecture/api-surface.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-13 14:32:08 +03:00
Michael Chus	e169a7722c	Fix NVMe SMART status always Unknown; fix GPU count including NVSwitches nvme-cli emits smart-log counters as JSON strings and uses field names avail_spare / percent_used instead of the prose names in the NVMe spec. The nvmeSmartLog struct had int64 fields with wrong JSON tags — Unmarshal returned an error and the whole health block was skipped, leaving every NVMe drive with status=Unknown. Fix: switch all numeric fields to jsonInt64 (already used for lsblk block sizes) which accepts both bare numbers and quoted strings, and correct the avail_spare / percent_used tag names. Also fix validateIsVendorGPU for NVIDIA: previously counted any NVIDIA PCIe device (including NVSwitch bridges) as a GPU, producing wrong estimates (12 instead of 8 on an HGX H100 system). Now requires device_class to be videocontroller or processingaccelerator, matching the existing AMD filter logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 18:06:32 +03:00
Michael Chus	4f6579e040	Fix Runtime Health criteria: network, services, nvidia-fabricmanager Network: green if at least one interface has IPv4 (drop PARTIAL state). Bee Services: treat inactive as OK — oneshot services (bee-sshsetup, bee-preflight, bee-network, bee-audit, etc.) complete successfully and exit to inactive; only failed is a real problem. nvidia-fabricmanager: add ExecCondition=bee-check-nvswitch drop-in so the service is silently skipped (inactive, not failed) on systems without NVSwitch hardware (e.g. H200 NVL with direct NVLink, no NVSwitch chips). bee-check-nvswitch detects NVSwitch via lspci (vendor 10de, class 0680). bee-nvidia.service: add ConditionPathExists=/usr/local/bin/bee-nvidia-load so the unit is a no-op if somehow present in a non-nvidia build. bee-boot-status: read /etc/bee-gpu-vendor and exclude bee-nvidia from CRITICAL/ALL on non-nvidia builds, preventing boot hang if the unit is unexpectedly present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 05:20:25 +03:00
Michael Chus	dc07580adc	Add AER decode, event counter, and sparkline to component detail modal - decodeAERStatus: parses aer_status hex from kernel error strings and maps PCIe AER register bits to human-readable names with correctable/ uncorrectable classification (e.g. "Receiver Error, Replay Timer Timeout (correctable)") - renderSparkline: 100px inline SVG showing non-OK events over time, bars positioned proportionally to timestamp; evenly spaced when timestamps coincide - renderComponentDetail: shows event count badge and sparkline in the component header row; decoded AER line appears below the raw error summary Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-13 23:54:54 +03:00
Mikhail Chusavitin	805a3b277d	Track PCIe AER correctable errors; fix GPU status key routing Add nvidia-aer-correctable and pcie-aer-correctable patterns to catch "bus correctable error" events seen in SEL (Critical Interrupt / offset 7). Both patterns carry severity "warning" — correctable errors are hardware-recovered and should not flag a card as failed. Fix kmsg_watcher routing: GPU-category events were keyed as pcie:<BDF> but the UI queries for pcie:gpu: prefix. Split the switch so "gpu" → pcie:gpu:<BDF> and "pcie" → pcie:<BDF>. This applies to both flushWindow (SAT-window path) and flushImmediate (always-on path). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-08 12:50:14 +03:00
Mikhail Chusavitin	0939a647ea	Fix component detail modal: replace dead hx-* with fetch-based JS HTMX was never loaded on the page, so hx-get on the component label spans was dead code — the dialog opened empty. Replace with a plain openComponentDetail() fetch call. Also fix dialog positioning broken by the CSS reset (*{margin:0} overrode the UA margin:auto that centers <dialog>). Replace card hx-trigger polling with a setInterval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-08 10:53:20 +03:00
Mikhail Chusavitin	ae80d7711e	Add continuous hardware health monitoring and component detail view - kmsg watcher now records kernel errors (GPU Xid, MCE, EDAC, storage I/O) at all times, not only during SAT tasks; flushImmediate writes directly to ComponentStatusDB - New health_poller: polls ipmitool sdr every 60s for PSU health (watchdog:psu source) - Hardware Summary card auto-refreshes every 30s via htmx without page reload - Component rows (CPU/Memory/Storage/GPU/PSU) are now clickable -- opens a modal with per-component status, source, timestamp and last 20 history entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 09:56:39 +03:00
Michael Chus	6c2b188ec9	Add no-GUI boot mode and quieter boot diagnostics	2026-05-03 21:14:45 +03:00
Michael Chus	14505ef24a	Remove easy bee ASCII logo banners	2026-05-03 21:07:13 +03:00
Michael Chus	0e39e7d960	Make toram default and add install-to-ram CLI	2026-05-03 14:07:47 +03:00
Mikhail Chusavitin	58d6da0e4f	Fix live task logs and SAT windows	2026-04-30 17:26:45 +03:00
Mikhail Chusavitin	7ce73e34a4	Add NVMe block format tool	2026-04-30 16:27:25 +03:00
Mikhail Chusavitin	2c22b01fe3	Fix IPMI hangs, add VROC license, fix blackbox service, drop qrencode IPMI hang fix (Lenovo XCC SR650 V3): - Add pluggable ipmi_profile system with per-vendor timeouts and fruEarlyExit flag - Lenovo profile: 90s FRU timeout, streaming early-exit stops after PSU blocks found - collectFRUEarlyExit streams ipmitool fru print and kills process once PSU blocks are followed by a non-PSU header (~6s instead of ~108s on 54-device FRU list) - collectBMCFirmware and collectPSUs accept manufacturer and apply profile timeouts VROC license detection: - Detect VMD/VROC controller in PCIe list, run mdadm --detail-platform - Parse "License:" line; store as snap.VROCLicense in HardwareSnapshot Blackbox service fix: - bee-blackbox.service was missing from systemctl enable list in ISO build hook - Service never started on boot; state file never written; UI button stayed "Enable" Drop qrencode: - Remove from package list, standardTools API check, and runtime-flows doc Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 10:46:59 +03:00
Mikhail Chusavitin	ec89616585	Add storage block geometry to audit and viewer	2026-04-29 17:39:11 +03:00
Michael Chus	29179917c3	Add USB blackbox log mirroring service	2026-04-24 10:20:12 +03:00
Michael Chus	be4b439804	Commit remaining workspace changes	2026-04-23 20:32:26 +03:00
Mikhail Chusavitin	679aeb9947	Run NVIDIA DCGM diag tests on all selected GPUs simultaneously targeted_stress, targeted_power, and the Level 2/3 diag were dispatched one GPU at a time from the UI, turning a single dcgmi command into 8 sequential ~350–450 s runs. DCGM supports -i with a comma-separated list of GPU indices and runs the diagnostic on all of them in parallel. Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate. Update sat.go constants and page_validate.go estimates to reflect all-GPU simultaneous execution (remove n× multiplier from total time estimates). Stress test on 8-GPU system: ~5.3 h → ~2.5 h. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 11:53:25 +03:00
Michael Chus	b3cf8e3893	Globalize autotuned system power source	2026-04-20 07:02:12 +03:00
Michael Chus	65bcc9ce81	refactor(webui): split pages into task modules	2026-04-20 06:56:52 +03:00
Michael Chus	0cdfbc5875	fix(iso): restore boot UX and boot logs	2026-04-19 23:08:09 +03:00
Michael Chus	d52ec67f8f	Stability hardening, build script fixes, GRUB bee logo Stability hardening (webui/app): - readFileLimited(): защита от OOM при чтении audit JSON (100 MB), component-status DB (10 MB) и лога задачи (50 MB) - jobs.go: буферизованный лог задачи — один открытый fd на задачу вместо open/write/close на каждую строку (устраняет тысячи syscall/сек при GPU стресс-тестах) - stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s), сброс при успешном прогоне >30s, счётчик перезапусков в slog - kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании - bee-web.service: MemoryMax=3G — OOM killer защищён Build script: - build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in — мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый live.cfg.in перезаписывал правильный статический файл устаревшей версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off - build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg вместо grub-pc/grub.cfg (было всегда "missing") GRUB: - live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне - live-theme/theme.txt: + image компонент по центру в верхней трети экрана; меню сдвинуто с 62% до 65% Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 13:08:31 +03:00
Michael Chus	52c3a24b76	Compact metrics DB in background to prevent CPU spin under load As metrics.db grew (1 sample/5 s × hours), handleMetricsChartSVG called LoadAll() on every chart request — loading all rows across 4 tables through a single SQLite connection. With ~10 charts auto-refreshing in parallel, requests queued behind each other, saturating the connection pool and pegging a CPU core. Fix: add a background compactor that runs every hour via the metrics collector: • Downsample: rows older than 2 h are thinned to 1 per minute (keep MIN(ts) per ts/60 bucket) — retains chart shape while cutting row count by ~92 %. • Prune: rows older than 48 h are deleted entirely. • After prune: WAL checkpoint/truncate to release disk space. LoadAll() in handleMetricsChartSVG is unchanged — it now stays fast because the DB is kept small rather than capping the query window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 15:28:05 +03:00
Michael Chus	7d64e5d215	Fix two stale failing tests - TestHandleAPIBenchmarkPowerFitRampQueuesBenchmarkPowerFitTasks: ramp-up mode intentionally creates a single task (the runner handles 1→N internally to avoid redundant repetition of earlier ramp steps). Updated the test to expect 1 task and verify RampTotal=3 instead of asserting 3 separate tasks. - TestBenchmarkPageRendersSavedResultsTable: benchmark page used "Performance Results" as heading while the test looked for "Perf Results". Aligned the page heading with the shorter label used everywhere else (task reports, etc.). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 15:07:27 +03:00
Michael Chus	51b721aeb3	Add real-data duration estimates to benchmark and burn pages - Add BenchmarkEstimated* constants to benchmark_types.go from _v8 logs (Standard Perf ~16 min, Standard Power Fit ~43 min, Stability Perf ~92 min) - Update benchmark profile dropdown to show Perf / Power Fit timing per profile - Add timing columns to Method Split table (Standard vs Stability per run type) - Update burn preset labels to show "N min/GPU (sequential) or N min (parallel)" - Clarify burn "one by one" description with sequential vs parallel scaling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 10:54:50 +03:00
Michael Chus	bac89bb6e5	Add real-data duration estimates to validate tab profiles - Add SATEstimated* constants to sat.go derived from _v8 production logs, with a rule to recalculate them whenever the script changes - Extend validateInventory with NvidiaGPUCount to make estimates GPU-aware - Update all validate card duration strings: CPU, memory, storage, NVIDIA GPU, targeted stress/power, pulse test, NCCL, nvbandwidth - Fix nvbandwidth description ("intended to stay short" → actual ~45 min) - Top-level profile labels show computed total including GPU count Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 10:51:15 +03:00
Michael Chus	7a618da1f9	Redesign system power chart as stacked per-PSU area chart - Add PSUReading struct and PSUs []PSUReading to LiveMetricSample - Sample per-PSU input watts from IPMI SDR entity 10.x (Power Supply) - Render stacked filled-area SVG chart (one layer per PSU, cumulative total) - Fall back to single-line chart on systems with ≤1 PSU in SDR Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 10:42:00 +03:00
Michael Chus	63363e9629	Add toram boot entry and Install to RAM resume support - grub.cfg: add "load to RAM (toram)" entry to advanced submenu - install_to_ram.go: resume from existing /dev/shm/bee-live copy if source medium is unavailable after bee-web restart - tasks.go: fix "Recovered after bee-web restart" shown on every run (check j.lines before first append, not after) - bee-install: retry unsquashfs up to 5x with wait-for-remount on source loss; clear error message with bee-remount-medium hint - bee-remount-medium: new script to find and remount live ISO source after USB/CD reconnect; supports --wait polling mode - 9000-bee-setup: chmod +x for bee-install and bee-remount-medium Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 23:48:56 +03:00
Mikhail Chusavitin	5285c0d101	Capture per-run IPMI power and GPU telemetry in power benchmark - Sample IPMI loaded_w per single-card calibration and per ramp step instead of averaging over the entire Phase 2; top-level ServerPower uses the final (all-GPU) ramp step value - Add ServerLoadedW/ServerDeltaW to NvidiaPowerBenchGPU and NvidiaPowerBenchStep so external tooling can compare wall power per phase without re-parsing logs - Write gpu-metrics.csv/.html inside each single-XX/ and step-XX/ subdir; aggregate all phases into a top-level gpu-metrics.csv/.html - Write 00-nvidia-smi-q.log at the start of every power run - Add Telemetry (p95 temp/power/fan/clock) to NvidiaPowerBenchGPU in result.json from the converged calibration attempt - Power benchmark page: split "Achieved W" into Single-card W and Multi-GPU W (StablePowerLimitW); derate highlight and status color now reflect the final multi-GPU limit vs nominal - Performance benchmark page: add Status column and per-GPU score color coding (green/yellow/red) based on gpu.Status and OverallStatus Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:59:58 +03:00
Mikhail Chusavitin	b4280941f5	Move NCCL and NVBandwidth into validate mode	2026-04-16 11:02:30 +03:00
Michael Chus	4f76e1de21	Dashboard: per-device status chips with hover tooltips Replace single aggregated badge per hardware category with individual colored chips (O/W/F/?) for each ComponentStatusRecord. Added helper functions: matchedRecords, firstNonEmpty. CSS classes: chip-ok/warn/fail/unknown. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 06:54:13 +03:00
Michael Chus	d90250f80a	Fix DCGM cleanup and shorten memory validate	2026-04-16 00:39:37 +03:00
Mikhail Chusavitin	cd9e2cbe13	Fix ramp-up power bench: one task instead of N redundant tasks RunNvidiaPowerBench already performs a full internal ramp from 1 to N GPUs in Phase 2. Spawning N tasks with growing GPU subsets meant task K repeated all steps 1..K-1 already done by tasks 1..K-1 — O(N²) work instead of O(N). Replace with a single task using all selected GPUs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 12:29:11 +03:00
Michael Chus	c80a39e7ac	Add power results table, fix benchmark results refresh, bound memtester - Benchmark page now shows two result sections: Performance (scores) and Power / Thermal Fit (slot table). After any benchmark task completes the results section auto-refreshes via GET /api/benchmark/results without a full page reload. - Power results table shows each GPU slot with nominal TDP, achieved stable power limit, and P95 observed power. Rows with derated cards are highlighted amber so under-performing slots stand out at a glance. Older runs are collapsed in a <details> summary. - memtester is now wrapped with timeout(1) so a stuck memory controller cannot cause Validate Memory to hang indefinitely. Wall-clock limit is ~2.5 min per 100 MB per pass plus a 2-minute buffer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 07:16:18 +03:00

1 2 3 4

156 Commits