reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Michael Chus	b2f8626fee	Refactor validate modes, fix benchmark report and IPMI power - Replace diag level 1-4 dropdown with Validate/Stress radio buttons - Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short - Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended - Parallel GPU mode: spawn single task for all GPUs instead of splitting per model - Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel - Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts - Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading') Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 00:42:12 +03:00
Michael Chus	dd26e03b2d	Add multi-GPU selector option for system-level tests Adds a "Multi-GPU tests — use all GPUs" checkbox to the NVIDIA GPU selector (checked by default). When enabled, PSU Pulse, NCCL, and NVBandwidth tests run on every GPU in the system regardless of the per-GPU selection above — which is required for correct PSU stress testing (synchronous pulses across all GPUs create worst-case transients). When unchecked, only the manually selected GPUs are used. The same logic applies both to Run All (expandSATTarget) and to the individual Run button on each multi-GPU test card. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 00:25:12 +03:00
Michael Chus	6937a4c6ec	Fix pulse_test: run all GPUs simultaneously, not per-GPU pulse_test is a PSU/power-delivery test, not a per-GPU compute test. Its purpose is to synchronously pulse all GPUs between idle and full load to create worst-case transient spikes on the power supply. Running it one GPU at a time would produce a fraction of the PSU load and miss any PSU-level failures. - Move nvidia-pulse from nvidiaPerGPUTargets to nvidiaAllGPUTargets (same dispatch path as NCCL and NVBandwidth) - Change card onclick to runNvidiaFabricValidate (all selected GPUs at once) - Update card title to "NVIDIA PSU Pulse Test" and description to explain why synchronous multi-GPU execution is required Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 00:19:11 +03:00
Michael Chus	b9be93c213	Move NCCL interconnect and NVBandwidth tests to validate/stress nvidia-interconnect (NCCL all_reduce_perf) and nvidia-bandwidth (NVBandwidth) verify fabric connectivity and bandwidth — they are not sustained burn loads. Move both from the Burn section to the Validate section under the stress-mode toggle, alongside the other DCGM diagnostic tests moved in the previous commit. - Add sat-card-nvidia-interconnect and sat-card-nvidia-bandwidth validate cards (stress-only, all selected GPUs at once) - Add runNvidiaFabricValidate() for all-GPU-at-once dispatch - Add nvidiaAllGPUTargets handling in expandSATTarget/runAllSAT - Remove Interconnect / Bandwidth card from Burn section - Remove nvidia-interconnect and nvidia-bandwidth from runAllBurnTasks and the gpu/tools availability map Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 00:16:42 +03:00
Michael Chus	d1a22d782d	Move power diag tests to validate/stress; fix GPU burn power saturation - bee-gpu-stress.c: remove per-wave cuCtxSynchronize barrier in both cuBLASLt and PTX hot loops; sync at most once/sec so the GPU queue stays continuously full — eliminates the CPU↔GPU ping-pong that prevented reaching full TDP - sat_fan_stress.go: default SizeMB 0 (auto = 95% VRAM) instead of hardcoded 64 MB; tiny matrices caused <0.1 ms kernels where CPU re-queue overhead dominated - pages.go: move nvidia-targeted-power and nvidia-pulse from Burn → Validate stress section alongside nvidia-targeted-stress; these are DCGM pass/fail diagnostics, not sustained burn loads; remove the Power Delivery / Power Budget card from Burn entirely Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 00:13:52 +03:00
Mikhail Chusavitin	531d1ca366	Add NVIDIA self-heal tools and per-GPU SAT status	2026-04-07 20:20:05 +03:00
Mikhail Chusavitin	93cfa78e8c	Benchmark: parallel GPU mode, resilient inventory query, server model in results - Add parallel GPU mode (checkbox, off by default): runs all selected GPUs simultaneously via a single bee-gpu-burn invocation instead of sequentially; per-GPU telemetry, throttle counters, TOPS, and scoring are preserved - Make queryBenchmarkGPUInfo resilient: falls back to a base field set when extended fields (attribute.multiprocessor_count, power.default_limit) cause exit status 2, preventing lgc normalization from being silently skipped - Log explicit "graphics clock lock skipped" note when inventory is unavailable - Collect server model from DMI (/sys/class/dmi/id/product_name) and store in result JSON; benchmark history columns now show "Server Model (N× GPU Model)" grouped by server+GPU type rather than individual GPU index Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-07 18:32:15 +03:00
Michael Chus	fa00667750	Refactor NVIDIA GPU Selection into standalone card on validate page Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 21:06:16 +03:00
Mikhail Chusavitin	2354ae367d	Normalize task IDs and artifact folder prefixes	2026-04-06 12:26:47 +03:00
Mikhail Chusavitin	981315e6fd	Split NVIDIA tasks by homogeneous GPU groups	2026-04-06 11:58:13 +03:00
Mikhail Chusavitin	fc5c100a29	Fix NVIDIA persistence mode and add benchmark results table	2026-04-06 10:47:07 +03:00
Michael Chus	6e94216f3b	Hide task charts while pending	2026-04-05 22:34:34 +03:00
Michael Chus	53455063b9	Stabilize live task detail page	2026-04-05 22:14:52 +03:00
Michael Chus	4602f97836	Enforce sequential task orchestration	2026-04-05 22:10:42 +03:00
Michael Chus	7a21c370e4	Handle NVIDIA GSP firmware init hang with timeout fallback - bee-nvidia-load: run insmod in background, poll /proc/devices for nvidiactl; if GSP init doesn't complete in 90s, kill insmod and retry with NVreg_EnableGpuFirmware=0. Handles EBUSY case with clear error. - Write /run/bee-nvidia-mode (gsp-on/gsp-off/gsp-stuck) for audit layer - Show GSP mode badge in sidebar: yellow for gsp-off, red for gsp-stuck - Report NvidiaGSPMode in RuntimeHealth with issue entries - Simplify GRUB menu: default (KMS+GSP), advanced submenu (GSP=off, nomodeset, fail-safe), remove load-to-RAM entry - Add pcmanfm, ristretto, mupdf, mousepad to desktop packages Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 21:00:43 +03:00
Michael Chus	a493e3ab5b	Fix service control buttons: sudo, real error output, UX feedback - services.go: use sudo systemctl so bee user can control system services - api.go: always return 200 with output field even on error, so the frontend shows the actual systemctl message instead of "exit status 1" - pages.go: button shows "..." while pending then restores label; output panel is full-width under the table with ✓/✗ status indicator; output auto-scrolls to bottom Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 20:25:41 +03:00
Michael Chus	5b9015451e	Add live task charts and fix USB export actions	2026-04-05 20:14:23 +03:00
Michael Chus	2875313ba0	Improve boot UX: status display, faster GUI, loading spinner - Add bee-boot-status service: shows live service status on tty1 with ASCII logo before getty, exits when all bee services settle - Remove lightdm dependency on bee-preflight so GUI starts immediately without waiting for NVIDIA driver load - Replace Chromium blank-page problem with /loading spinner page that polls /api/services and auto-redirects when services are ready; add "Open app now" override button; use fresh --user-data-dir=/tmp/bee-chrome - Unify branding: add "Hardware Audit LiveCD" subtitle to GRUB menu, bee-boot-status (with yellow ASCII logo), and web spinner Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 18:58:24 +03:00
Michael Chus	f1621efee4	Mirror task lifecycle to serial console	2026-04-05 18:34:06 +03:00
Michael Chus	e609fbbc26	Add task reports and streamline GPU charts	2026-04-05 18:13:58 +03:00
Michael Chus	cc2b49ea41	Improve validate GPU runs and web UI feedback	2026-04-05 17:50:13 +03:00
Michael Chus	33e0a5bef2	Refine validate UI and runtime health table	2026-04-05 16:24:45 +03:00
Michael Chus	38e79143eb	Refine burn UI and NVIDIA stress flows	2026-04-05 13:43:43 +03:00
Michael Chus	25af2df23a	Unify metrics charts on custom SVG renderer	2026-04-05 12:17:50 +03:00
Michael Chus	20abff7f90	WIP: checkpoint current tree	2026-04-05 12:05:00 +03:00
Michael Chus	a14ec8631c	Persist GPU chart mode and expand GPU charts	2026-04-05 11:52:32 +03:00
Michael Chus	f58c7e58d3	Fix webui streaming recovery regressions	2026-04-05 10:39:09 +03:00
Michael Chus	143b7dca5d	Add stability hardening and self-heal recovery	2026-04-05 10:29:37 +03:00
Michael Chus	9826d437a5	Add GPU clock charts and grouped GPU metrics view	2026-04-05 09:57:38 +03:00
Mikhail Chusavitin	11f52ac710	Fix task log modal scrolling	2026-04-03 10:36:11 +03:00
Mikhail Chusavitin	1cb398fe83	Show tag version at top of sidebar	2026-04-03 10:08:00 +03:00
Mikhail Chusavitin	7f6386dccc	Restore USB support bundle export on tools page	2026-04-03 09:48:22 +03:00
Mikhail Chusavitin	295a19b93a	feat(tasks): run all queued tasks in parallel Tasks are now started simultaneously when multiple are enqueued (e.g. Run All). The worker drains all pending tasks at once and launches each in its own goroutine, waiting via WaitGroup. kmsg watcher updated to use a shared event window with a reference counter across concurrent tasks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 09:15:06 +03:00
Mikhail Chusavitin	fd722692a4	feat(watchdog): hardware error monitor + unified component status store - Add platform/error_patterns.go: pluggable table of kernel log patterns (NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct - Add app/component_status_db.go: persistent JSON store (component-status.json) keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never downgrades Warning or Critical - Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks, writes Warning to DB for matched hardware errors - Fix task status: overall_status=FAILED in summary.txt now marks task failed - Audit routine overlays component DB statuses into bee-audit.json on every read Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 19:20:59 +03:00
Mikhail Chusavitin	c27449c60e	feat(webui): show current boot source	2026-04-02 15:36:32 +03:00
Mikhail Chusavitin	5ef879e307	feat(webui): add gpu driver restart action	2026-04-02 15:30:23 +03:00
Mikhail Chusavitin	17ff3811f8	fix(webui): improve tasks logs and ordering	2026-04-02 13:43:59 +03:00
Mikhail Chusavitin	fc7fe0b08e	fix(webui): build support bundle synchronously on download, bypass task queue Support bundle is now built on-the-fly when the user clicks the button, regardless of whether other tasks are running: - GET /export/support.tar.gz builds the bundle synchronously and streams it directly to the client; the temp archive is removed after serving - Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue approach meant the bundle could only be downloaded after navigating away and back, and was blocked entirely while a long SAT test was running - UI: single "Download Support Bundle" button; fetch+blob gives a loading state ("Building...") while the server collects logs, then triggers the browser download with the correct filename from Content-Disposition Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 12:58:00 +03:00
Mikhail Chusavitin	1f750d3edd	fix(webui): prevent orphaned workers on restart, reduce metrics polling, add Kill Workers button - tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web crashes mid-test (each restart was launching a new set of 8 workers on top of still-alive orphans from the previous crash) - server: reduce metrics collector interval 1s→5s, grow ring buffer to 360 samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5× - platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn, stress-ng, stressapptest, memtester without relying on pkill/killall - webui: add "Kill Workers" button next to Cancel All; calls POST /api/tasks/kill-workers which cancels the task queue then kills orphaned OS-level processes; shows toast with killed count - metricsdb: sort GPU indices and fan/temp names after map iteration to fix non-deterministic sample reconstruction order (flaky test) - server: fix chartYAxisNumber to use one decimal place for 1000–9999 (e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 10:13:43 +03:00
Michael Chus	dbab43db90	Fix full-history metrics range loading	2026-04-01 23:55:28 +03:00
Michael Chus	bcb7fe5fe9	Render charts from full SQLite history	2026-04-01 23:52:54 +03:00
Michael Chus	1dd7f243f5	Keep chart series colors stable	2026-04-01 23:37:57 +03:00
Michael Chus	938e499ac2	Serve charts from SQLite history only	2026-04-01 23:33:13 +03:00
Michael Chus	c2aecc6ce9	Fix fan chart gaps and task durations	2026-04-01 22:36:11 +03:00
Michael Chus	439b86ce59	Unify live metrics chart rendering	2026-04-01 22:19:33 +03:00
Mikhail Chusavitin	c95bbff23b	fix(metrics): stabilize cpu and power sampling	2026-04-01 09:40:42 +03:00
Mikhail Chusavitin	4e4debd4da	refactor(webui): redesign Burn tab and fix gpu-burn memory defaults - Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress, Compute Stress, Platform Thermal Cycling) + global Burn Profile - Run All button at top enqueues all enabled tests across all cards - GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools endpoint based on driver status (/dev/nvidia0, /dev/kfd) - Compute Stress: checkboxes for cpu/memory-stress/stressapptest - Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd) with platform_components param wired through to PlatformStressOptions - bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script now queries nvidia-smi memory.total per GPU and uses 95% of it - platform_stress: removed hardcoded --size-mb 64; respects Components field to selectively run CPU and/or GPU load goroutines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 09:39:07 +03:00
Mikhail Chusavitin	c394845b34	refactor(webui): queue install and bundle tasks - v3.18	2026-04-01 08:46:46 +03:00
Mikhail Chusavitin	b5b34983f1	fix(webui): repair audit actions and CPU burn flow - v3.15	2026-04-01 08:19:11 +03:00
Michael Chus	45221d1e9a	fix(stress): label loaders and improve john opencl diagnostics	2026-04-01 07:31:52 +03:00

1 2

84 Commits