reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Mikhail Chusavitin	8575cf06f8	webui: show all RAID drives per controller and add drive-prepare action RAID Controller Management previously hid any LSI drive that wasn't already Frgn/UGood/JBOD, and scoped VROC "free drives" from all system disks instead of the ones actually wired to the VROC controller's ports - drives attached directly to the CPU or another HBA could leak in. Now every drive is listed per its own controller, and LSI drives not already ready for array creation get a "Prepare" button that forces them to Unconfigured Good via storcli. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>	2026-07-01 13:32:03 +03:00
Mikhail Chusavitin	24f2e65b6e	Add unified FRU/Elabel card with Huawei iBMC OEM IPMI support Replaces separate IPMI FRU and SAA DMI cards with a single FRU / Elabel card that reads all available sources in parallel and shows each field with a color-coded source chip (IPMI FRU / Huawei iBMC / SAA DMI). Huawei elabel fields are read/written via OEM IPMI raw commands (NetFn 0x30, cmd 0x90) with 19-byte chunking protocol, matching the FusionServer ElabelTool V511 wire format. Covers DeviceName, DeviceSerialNumber, ProductName, ProductSerialNumber, ProductAssetTag, ProductManufacturer, MainboardManufacturer, BoardProductName, ChassisPartnumber, ChassisType (read-only), IOChassisSerial, IOChassisAssetTag, and GUID (read-only via standard 0x06 0x08). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 15:29:07 +03:00
Mikhail Chusavitin	892ef6fb7d	Add Reboot and Shutdown buttons to Settings page POST /api/system/reboot → systemctl reboot POST /api/system/shutdown → systemctl poweroff Both require confirm() before executing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 09:18:30 +03:00
Mikhail Chusavitin	258ecb3453	Add RAID Controller Management to Tools page Unified card for LSI/Broadcom and Intel VROC controllers: auto-detects foreign configurations and warns the operator with Import/Clear actions; allows creating RAID 1 mirrors from unconfigured drives regardless of controller type. Live output streams via SSE into an inline terminal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 08:58:19 +03:00
Mikhail Chusavitin	b49c71a980	Add IPMI FRU editor to Tools page - New card "IPMI — FRU" on Tools page (device 0, in-band) - Read: GET /api/tools/ipmi-fru → ipmitool fru print 0 → editable table - Editable fields: chassis (part#, serial, extra), board (mfr, product, serial, part#), product (mfr, name, part#, version, serial); read-only fields displayed as text - Write: POST /api/tools/ipmi-fru/write → task → backup to fru-backups/ → ipmitool fru edit per field - Dirty tracking + Save (N changed) button, same UX as Supermicro DMI card Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-19 08:13:35 +03:00
Mikhail Chusavitin	a2d7513153	Restructure nav to Load/Burn/Benchmark; fix SAA acpidump dependency - Nav steps 3-5: Load (validate), Burn (burn-in), Benchmark (speed+endurance merged) - /load now renders validate mode; /burn renders burn-in; /benchmark replaces /speed+/endurance - Legacy redirects updated: /validate→/load, /burn-in→/burn, /speed+/endurance→/benchmark - Add acpica_bin/acpidump from SAA 1.5.0 package; required by saa GetDmiInfo (ExitCode 8) - build.sh copies acpica_bin/acpidump to /usr/local/bin/acpica_bin/ alongside saa Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 19:07:51 +03:00
Mikhail Chusavitin	4262c5b798	Add SAA DMI editor to Tools page Adds a new card to the web UI Tools page for reading and editing DMI fields via SAA (In-Band). Reads current DMI configuration with GetDmiInfo, displays all fields as an editable table, and applies only the changed fields via EditDmiInfo + ChangeDmiInfo. Backs up the original DMI file to dmi-backups/ before any write, making it available in the support bundle for rollback. Also adds "saa" to the standard tool check list. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 15:50:42 +03:00
Mikhail Chusavitin	271dadda03	Restructure web UI navigation into 7 numbered workflow stages Replace the flat menu (Dashboard, Audit, Validate, Burn, Benchmark, Tasks, Tools) with a numbered progression that guides engineers through a logical acceptance workflow: Dashboard (landing) → 1. Audit → 2. Check → 3. Load → 4. Speed → 5. Endurance → 6. Tools → 7. Settings Key changes: - layout.go: numbered nav labels, new hrefs, Tasks removed from nav and replaced with a persistent sidebar badge (polls /api/tasks every 5 s, highlights amber when tasks are active) - server.go: 301 redirects from /validate→/check, /burn→/load, /benchmark→/speed for backward compatibility - pages.go: dispatch cases for all new routes; old routes kept as fallbacks - page_validate.go: add renderCheck() — non-destructive check page with validate-mode tests only (no stress toggle, no targeted-stress/ targeted-power/pulse cards) - page_burn.go: add renderLoad() wrapper; update scope alert to reference /check instead of /validate - page_benchmark.go: add renderSpeed() (performance focus) and renderEndurance() (stability/overnight focus) wrappers - page_settings.go: new Settings page with blackbox logging toggle, NVIDIA driver reset, and build info - server_test.go: update five tests to use new route names and content expectations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 11:00:02 +03:00
Mikhail Chusavitin	ae80d7711e	Add continuous hardware health monitoring and component detail view - kmsg watcher now records kernel errors (GPU Xid, MCE, EDAC, storage I/O) at all times, not only during SAT tasks; flushImmediate writes directly to ComponentStatusDB - New health_poller: polls ipmitool sdr every 60s for PSU health (watchdog:psu source) - Hardware Summary card auto-refreshes every 30s via htmx without page reload - Component rows (CPU/Memory/Storage/GPU/PSU) are now clickable -- opens a modal with per-component status, source, timestamp and last 20 history entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 09:56:39 +03:00
mchus	6c2b188ec9	Add no-GUI boot mode and quieter boot diagnostics	2026-05-03 21:14:45 +03:00
mchus	14505ef24a	Remove easy bee ASCII logo banners	2026-05-03 21:07:13 +03:00
Mikhail Chusavitin	7ce73e34a4	Add NVMe block format tool	2026-04-30 16:27:25 +03:00
Mikhail Chusavitin	ec89616585	Add storage block geometry to audit and viewer	2026-04-29 17:39:11 +03:00
mchus	29179917c3	Add USB blackbox log mirroring service	2026-04-24 10:20:12 +03:00
mchus	b3cf8e3893	Globalize autotuned system power source	2026-04-20 07:02:12 +03:00
mchus	0cdfbc5875	fix(iso): restore boot UX and boot logs	2026-04-19 23:08:09 +03:00
mchus	52c3a24b76	Compact metrics DB in background to prevent CPU spin under load As metrics.db grew (1 sample/5 s × hours), handleMetricsChartSVG called LoadAll() on every chart request — loading all rows across 4 tables through a single SQLite connection. With ~10 charts auto-refreshing in parallel, requests queued behind each other, saturating the connection pool and pegging a CPU core. Fix: add a background compactor that runs every hour via the metrics collector: • Downsample: rows older than 2 h are thinned to 1 per minute (keep MIN(ts) per ts/60 bucket) — retains chart shape while cutting row count by ~92 %. • Prune: rows older than 48 h are deleted entirely. • After prune: WAL checkpoint/truncate to release disk space. LoadAll() in handleMetricsChartSVG is unchanged — it now stays fast because the DB is kept small rather than capping the query window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 15:28:05 +03:00
mchus	7a618da1f9	Redesign system power chart as stacked per-PSU area chart - Add PSUReading struct and PSUs []PSUReading to LiveMetricSample - Sample per-PSU input watts from IPMI SDR entity 10.x (Power Supply) - Render stacked filled-area SVG chart (one layer per PSU, cumulative total) - Fall back to single-line chart on systems with ≤1 PSU in SDR Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 10:42:00 +03:00
mchus	c80a39e7ac	Add power results table, fix benchmark results refresh, bound memtester - Benchmark page now shows two result sections: Performance (scores) and Power / Thermal Fit (slot table). After any benchmark task completes the results section auto-refreshes via GET /api/benchmark/results without a full page reload. - Power results table shows each GPU slot with nominal TDP, achieved stable power limit, and P95 observed power. Rows with derated cards are highlighted amber so under-performing slots stand out at a glance. Older runs are collapsed in a <details> summary. - memtester is now wrapped with timeout(1) so a stuck memory controller cannot cause Validate Memory to hang indefinitely. Wall-clock limit is ~2.5 min per 100 MB per pass plus a 2-minute buffer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 07:16:18 +03:00
mchus	ee422ede3c	Revert "Add raster Easy Bee branding assets" This reverts commit `d560b2fead`.	2026-04-14 23:00:15 +03:00
mchus	d560b2fead	Add raster Easy Bee branding assets	2026-04-14 22:39:25 +03:00
Mikhail Chusavitin	95124d228f	Split bee-bench into perf and power workflows	2026-04-14 17:33:13 +03:00
mchus	025548ab3c	UI: amber accents, smaller wallpaper logo, new support bundle name, drop display resolution - Bootloader: GRUB fallback text colors → yellow/brown (amber tone) - CLI charts: all GPU metric series use single amber color (xterm-256 #214) - Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size - Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz using dmidecode for server model (spaces→underscores) and serial number - Remove display resolution feature (UI card, API routes, handlers, tests) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 21:37:01 +03:00
Mikhail Chusavitin	531d1ca366	Add NVIDIA self-heal tools and per-GPU SAT status	2026-04-07 20:20:05 +03:00
mchus	5b9015451e	Add live task charts and fix USB export actions	2026-04-05 20:14:23 +03:00
mchus	2875313ba0	Improve boot UX: status display, faster GUI, loading spinner - Add bee-boot-status service: shows live service status on tty1 with ASCII logo before getty, exits when all bee services settle - Remove lightdm dependency on bee-preflight so GUI starts immediately without waiting for NVIDIA driver load - Replace Chromium blank-page problem with /loading spinner page that polls /api/services and auto-redirects when services are ready; add "Open app now" override button; use fresh --user-data-dir=/tmp/bee-chrome - Unify branding: add "Hardware Audit LiveCD" subtitle to GRUB menu, bee-boot-status (with yellow ASCII logo), and web spinner Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 18:58:24 +03:00
mchus	e609fbbc26	Add task reports and streamline GPU charts	2026-04-05 18:13:58 +03:00
mchus	cc2b49ea41	Improve validate GPU runs and web UI feedback	2026-04-05 17:50:13 +03:00
mchus	38e79143eb	Refine burn UI and NVIDIA stress flows	2026-04-05 13:43:43 +03:00
mchus	25af2df23a	Unify metrics charts on custom SVG renderer	2026-04-05 12:17:50 +03:00
mchus	20abff7f90	WIP: checkpoint current tree	2026-04-05 12:05:00 +03:00
mchus	a14ec8631c	Persist GPU chart mode and expand GPU charts	2026-04-05 11:52:32 +03:00
mchus	f58c7e58d3	Fix webui streaming recovery regressions	2026-04-05 10:39:09 +03:00
mchus	143b7dca5d	Add stability hardening and self-heal recovery	2026-04-05 10:29:37 +03:00
mchus	9826d437a5	Add GPU clock charts and grouped GPU metrics view	2026-04-05 09:57:38 +03:00
Mikhail Chusavitin	fd722692a4	feat(watchdog): hardware error monitor + unified component status store - Add platform/error_patterns.go: pluggable table of kernel log patterns (NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct - Add app/component_status_db.go: persistent JSON store (component-status.json) keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never downgrades Warning or Critical - Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks, writes Warning to DB for matched hardware errors - Fix task status: overall_status=FAILED in summary.txt now marks task failed - Audit routine overlays component DB statuses into bee-audit.json on every read Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 19:20:59 +03:00
Mikhail Chusavitin	fc7fe0b08e	fix(webui): build support bundle synchronously on download, bypass task queue Support bundle is now built on-the-fly when the user clicks the button, regardless of whether other tasks are running: - GET /export/support.tar.gz builds the bundle synchronously and streams it directly to the client; the temp archive is removed after serving - Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue approach meant the bundle could only be downloaded after navigating away and back, and was blocked entirely while a long SAT test was running - UI: single "Download Support Bundle" button; fetch+blob gives a loading state ("Building...") while the server collects logs, then triggers the browser download with the correct filename from Content-Disposition Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 12:58:00 +03:00
Mikhail Chusavitin	1f750d3edd	fix(webui): prevent orphaned workers on restart, reduce metrics polling, add Kill Workers button - tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web crashes mid-test (each restart was launching a new set of 8 workers on top of still-alive orphans from the previous crash) - server: reduce metrics collector interval 1s→5s, grow ring buffer to 360 samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5× - platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn, stress-ng, stressapptest, memtester without relying on pkill/killall - webui: add "Kill Workers" button next to Cancel All; calls POST /api/tasks/kill-workers which cancels the task queue then kills orphaned OS-level processes; shows toast with killed count - metricsdb: sort GPU indices and fan/temp names after map iteration to fix non-deterministic sample reconstruction order (flaky test) - server: fix chartYAxisNumber to use one decimal place for 1000–9999 (e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 10:13:43 +03:00
mchus	bcb7fe5fe9	Render charts from full SQLite history	2026-04-01 23:52:54 +03:00
mchus	1dd7f243f5	Keep chart series colors stable	2026-04-01 23:37:57 +03:00
mchus	938e499ac2	Serve charts from SQLite history only	2026-04-01 23:33:13 +03:00
mchus	c2aecc6ce9	Fix fan chart gaps and task durations	2026-04-01 22:36:11 +03:00
mchus	439b86ce59	Unify live metrics chart rendering	2026-04-01 22:19:33 +03:00
Mikhail Chusavitin	4e4debd4da	refactor(webui): redesign Burn tab and fix gpu-burn memory defaults - Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress, Compute Stress, Platform Thermal Cycling) + global Burn Profile - Run All button at top enqueues all enabled tests across all cards - GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools endpoint based on driver status (/dev/nvidia0, /dev/kfd) - Compute Stress: checkboxes for cpu/memory-stress/stressapptest - Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd) with platform_components param wired through to PlatformStressOptions - bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script now queries nvidia-smi memory.total per GPU and uses 95% of it - platform_stress: removed hardcoded --size-mb 64; respects Components field to selectively run CPU and/or GPU load goroutines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 09:39:07 +03:00
Mikhail Chusavitin	c394845b34	refactor(webui): queue install and bundle tasks - v3.18	2026-04-01 08:46:46 +03:00
mchus	ea660500c9	chore: commit pending repo changes	2026-03-31 22:17:36 +03:00
Mikhail Chusavitin	6dee8f3509	Add NVIDIA stress loader selection and DCGM 4 support	2026-03-31 11:15:15 +03:00
Mikhail Chusavitin	20f834aa96	feat: v3.4 — boot reliability, log readability, USB export, screen resolution, GRUB UEFI fix, memtest, KVM console stability Web UI / logs: - Strip ANSI escape codes and handle \r (progress bars) in task log output - Add USB export API + UI card on Export page (list removable devices, write audit JSON or support bundle) - Add Display Resolution card in Tools (xrandr-based, per-output mode selector) - Dashboard: audit status banner with auto-reload when audit task completes Boot & install: - bee-web starts immediately with no dependencies (was blocked by audit + network) - bee-audit.service redesigned: waits for bee-web healthz, sleeps 60s, enqueues audit via /api/audit/run (task system) - bee-install: fix GRUB UEFI — grub-install exit code was silently ignored (\|\| true); add --no-nvram fallback; always copy EFI/BOOT/BOOTX64.EFI fallback path - Add grub-efi-amd64, grub-pc, grub-efi-amd64-signed, shim-signed to package list (grub-install requires these, not just -bin variants) - memtest hook: fix binary/boot/ not created before cp; handle both Debian (no extension) and upstream (x64.efi) naming - bee-openbox-session: increase healthz wait from 30s to 120s KVM console stability: - runCmdJob: syscall.Setpriority(PRIO_PROCESS, pid, 10) on all stress subprocesses - lightdm.service.d: Nice=-5 so X server preempts stress processes Packages: add btop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-31 10:16:15 +03:00
mchus	7c2a0135d2	feat(audit): add platform thermal cycling stress test Runs CPU (stressapptest) + GPU stress simultaneously across multiple load/idle cycles with varying idle durations (120s/60s/30s) to detect cooling systems that fail to recover under repeated load. Presets: smoke (~5 min), acceptance (~25 min), overnight (~100 min). Outputs metrics.csv + summary.txt with per-cycle throttle and fan spindown analysis, packed as tar.gz. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 21:57:33 +03:00
mchus	407c1cd1c4	fix(charts): unify timeline labels across graphs	2026-03-29 21:24:06 +03:00

1 2

70 Commits