reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Michael Chus	4602f97836	Enforce sequential task orchestration	2026-04-05 22:10:42 +03:00
Michael Chus	7a21c370e4	Handle NVIDIA GSP firmware init hang with timeout fallback - bee-nvidia-load: run insmod in background, poll /proc/devices for nvidiactl; if GSP init doesn't complete in 90s, kill insmod and retry with NVreg_EnableGpuFirmware=0. Handles EBUSY case with clear error. - Write /run/bee-nvidia-mode (gsp-on/gsp-off/gsp-stuck) for audit layer - Show GSP mode badge in sidebar: yellow for gsp-off, red for gsp-stuck - Report NvidiaGSPMode in RuntimeHealth with issue entries - Simplify GRUB menu: default (KMS+GSP), advanced submenu (GSP=off, nomodeset, fail-safe), remove load-to-RAM entry - Add pcmanfm, ristretto, mupdf, mousepad to desktop packages Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 21:00:43 +03:00
Michael Chus	a493e3ab5b	Fix service control buttons: sudo, real error output, UX feedback - services.go: use sudo systemctl so bee user can control system services - api.go: always return 200 with output field even on error, so the frontend shows the actual systemctl message instead of "exit status 1" - pages.go: button shows "..." while pending then restores label; output panel is full-width under the table with ✓/✗ status indicator; output auto-scrolls to bottom Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 20:25:41 +03:00
Michael Chus	19b4803ec7	Pass exact cycle duration to GPU stress instead of 86400s sentinel bee-gpu-burn now receives --seconds <LoadSec> so it exits naturally when the cycle ends, rather than relying solely on context cancellation to kill it. Process group kill (Setpgid+Cancel) is kept as a safety net for early cancellation (user stop, context timeout). Same fix for AMD RVS which now gets duration_ms = LoadSec * 1000. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 20:22:43 +03:00
Michael Chus	1bdfb1e9ca	Fix nvidia-targeted-stress failing with DCGM_ST_IN_USE (-34) nvvs (DCGM validation suite) survives when dcgmi is killed mid-run, leaving the GPU occupied. The next dcgmi diag invocation then fails with "affected resource is in use". Two-part fix: - Add nvvs and dcgmi to KillTestWorkers patterns so they are cleaned up by the global cancel handler - Call KillTestWorkers at the start of RunNvidiaTargetedStressValidatePack to clear any stale processes before dcgmi diag runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 20:21:36 +03:00
Michael Chus	c5d6b30177	Fix platform thermal cycling leaving GPU load running after test ends bee-gpu-burn is a shell script that spawns bee-gpu-burn-worker children. exec.CommandContext default cancel only kills the shell parent; the worker processes survive and keep loading the GPU indefinitely. Fix: set Setpgid=true and a custom Cancel that sends SIGKILL to the entire process group (-pid), same pattern already used in runSATCommandCtx. Applied to Nvidia, AMD, and CPU stress commands for consistency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 20:19:20 +03:00
Michael Chus	5b9015451e	Add live task charts and fix USB export actions	2026-04-05 20:14:23 +03:00
Michael Chus	2875313ba0	Improve boot UX: status display, faster GUI, loading spinner - Add bee-boot-status service: shows live service status on tty1 with ASCII logo before getty, exits when all bee services settle - Remove lightdm dependency on bee-preflight so GUI starts immediately without waiting for NVIDIA driver load - Replace Chromium blank-page problem with /loading spinner page that polls /api/services and auto-redirects when services are ready; add "Open app now" override button; use fresh --user-data-dir=/tmp/bee-chrome - Unify branding: add "Hardware Audit LiveCD" subtitle to GRUB menu, bee-boot-status (with yellow ASCII logo), and web spinner Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 18:58:24 +03:00
Michael Chus	f1621efee4	Mirror task lifecycle to serial console	2026-04-05 18:34:06 +03:00
Michael Chus	4461249cc3	Make memory stress size follow available RAM	2026-04-05 18:33:26 +03:00
Michael Chus	e609fbbc26	Add task reports and streamline GPU charts	2026-04-05 18:13:58 +03:00
Michael Chus	cc2b49ea41	Improve validate GPU runs and web UI feedback	2026-04-05 17:50:13 +03:00
Michael Chus	33e0a5bef2	Refine validate UI and runtime health table	2026-04-05 16:24:45 +03:00
Michael Chus	38e79143eb	Refine burn UI and NVIDIA stress flows	2026-04-05 13:43:43 +03:00
Michael Chus	25af2df23a	Unify metrics charts on custom SVG renderer	2026-04-05 12:17:50 +03:00
Michael Chus	20abff7f90	WIP: checkpoint current tree	2026-04-05 12:05:00 +03:00
Michael Chus	a14ec8631c	Persist GPU chart mode and expand GPU charts	2026-04-05 11:52:32 +03:00
Michael Chus	f58c7e58d3	Fix webui streaming recovery regressions	2026-04-05 10:39:09 +03:00
Michael Chus	bf47c8dbd2	Add NVIDIA benchmark reporting flow	2026-04-05 10:30:56 +03:00
Michael Chus	143b7dca5d	Add stability hardening and self-heal recovery	2026-04-05 10:29:37 +03:00
Michael Chus	9826d437a5	Add GPU clock charts and grouped GPU metrics view	2026-04-05 09:57:38 +03:00
Mikhail Chusavitin	f3c14cd893	Harden NIC probing for empty SFP ports	2026-04-04 15:23:15 +03:00
Mikhail Chusavitin	728270dc8e	Unblock bee-web startup and expand support bundle diagnostics	2026-04-04 15:18:43 +03:00
Mikhail Chusavitin	11f52ac710	Fix task log modal scrolling	2026-04-03 10:36:11 +03:00
Mikhail Chusavitin	1cb398fe83	Show tag version at top of sidebar	2026-04-03 10:08:00 +03:00
Mikhail Chusavitin	7a843be6b0	Stabilize DCGM GPU discovery	2026-04-03 09:50:33 +03:00
Mikhail Chusavitin	7f6386dccc	Restore USB support bundle export on tools page	2026-04-03 09:48:22 +03:00
Mikhail Chusavitin	295a19b93a	feat(tasks): run all queued tasks in parallel Tasks are now started simultaneously when multiple are enqueued (e.g. Run All). The worker drains all pending tasks at once and launches each in its own goroutine, waiting via WaitGroup. kmsg watcher updated to use a shared event window with a reference counter across concurrent tasks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 09:15:06 +03:00
Mikhail Chusavitin	fd722692a4	feat(watchdog): hardware error monitor + unified component status store - Add platform/error_patterns.go: pluggable table of kernel log patterns (NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct - Add app/component_status_db.go: persistent JSON store (component-status.json) keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never downgrades Warning or Critical - Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks, writes Warning to DB for matched hardware errors - Fix task status: overall_status=FAILED in summary.txt now marks task failed - Audit routine overlays component DB statuses into bee-audit.json on every read Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 19:20:59 +03:00
Mikhail Chusavitin	99cece524c	feat(support-bundle): add PCIe link diagnostics and system logs - Add full dmesg (was tail -200), kern.log, syslog - Add /proc/cmdline, lspci -vvv, nvidia-smi -q - Add per-GPU PCIe link speed/width from sysfs (NVIDIA devices only) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 15:42:28 +03:00
Mikhail Chusavitin	c27449c60e	feat(webui): show current boot source	2026-04-02 15:36:32 +03:00
Mikhail Chusavitin	5ef879e307	feat(webui): add gpu driver restart action	2026-04-02 15:30:23 +03:00
Mikhail Chusavitin	e7df63bae1	fix(app): include extra system logs in support bundle	2026-04-02 13:44:58 +03:00
Mikhail Chusavitin	17ff3811f8	fix(webui): improve tasks logs and ordering	2026-04-02 13:43:59 +03:00
Mikhail Chusavitin	fc7fe0b08e	fix(webui): build support bundle synchronously on download, bypass task queue Support bundle is now built on-the-fly when the user clicks the button, regardless of whether other tasks are running: - GET /export/support.tar.gz builds the bundle synchronously and streams it directly to the client; the temp archive is removed after serving - Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue approach meant the bundle could only be downloaded after navigating away and back, and was blocked entirely while a long SAT test was running - UI: single "Download Support Bundle" button; fetch+blob gives a loading state ("Building...") while the server collects logs, then triggers the browser download with the correct filename from Content-Disposition Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 12:58:00 +03:00
Mikhail Chusavitin	1f750d3edd	fix(webui): prevent orphaned workers on restart, reduce metrics polling, add Kill Workers button - tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web crashes mid-test (each restart was launching a new set of 8 workers on top of still-alive orphans from the previous crash) - server: reduce metrics collector interval 1s→5s, grow ring buffer to 360 samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5× - platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn, stress-ng, stressapptest, memtester without relying on pkill/killall - webui: add "Kill Workers" button next to Cancel All; calls POST /api/tasks/kill-workers which cancels the task queue then kills orphaned OS-level processes; shows toast with killed count - metricsdb: sort GPU indices and fan/temp names after map iteration to fix non-deterministic sample reconstruction order (flaky test) - server: fix chartYAxisNumber to use one decimal place for 1000–9999 (e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 10:13:43 +03:00
Mikhail Chusavitin	b2b0444131	audit: ignore virtual hdisk and coprocessor noise	2026-04-02 09:56:17 +03:00
Michael Chus	dbab43db90	Fix full-history metrics range loading	2026-04-01 23:55:28 +03:00
Michael Chus	bcb7fe5fe9	Render charts from full SQLite history	2026-04-01 23:52:54 +03:00
Michael Chus	ef45246ea0	fix(sat): kill entire process group on task cancel exec.CommandContext only kills the direct child (the shell script), leaving grandchildren (john, gpu-burn, etc.) as orphans. Set Setpgid so each SAT job runs in its own process group, then send SIGKILL to the whole group (-pgid) in the Cancel hook. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 23:46:33 +03:00
Michael Chus	1dd7f243f5	Keep chart series colors stable	2026-04-01 23:37:57 +03:00
Michael Chus	938e499ac2	Serve charts from SQLite history only	2026-04-01 23:33:13 +03:00
Michael Chus	964ab39656	fix: run john stress in parallel per GPU, fix chromium fullscreen, filter BMC virtual disks - bee-john-gpu-stress: spawn one john process per OpenCL device in parallel so all GPUs are stressed simultaneously instead of only device 1 - bee-openbox-session: --start-fullscreen → --start-maximized to fix blank white page on first render in fbdev environment - storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 23:14:21 +03:00
Michael Chus	c2aecc6ce9	Fix fan chart gaps and task durations	2026-04-01 22:36:11 +03:00
Michael Chus	439b86ce59	Unify live metrics chart rendering	2026-04-01 22:19:33 +03:00
Michael Chus	eb60100297	fix: pcie gen, nccl binary, netconf sudo, boot noise, firmware cleanup - nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state - build: remove bee-nccl-gpu-stress from rm -f list so shell script from overlay is not silently dropped from the ISO - smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress, bee-nccl-gpu-stress, all_reduce_perf - netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors - auto/config: reduce loglevel 7→3 to show clean systemd output on boot - auto/config: blacklist snd_hda_intel and related audio modules (unused on servers) - package-lists: remove firmware-intel-sound and firmware-amd-graphics from base list; move firmware-amd-graphics to bee-amd variant only - bible-local: mark memtest ADR resolved, document working solution Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 21:25:23 +03:00
Mikhail Chusavitin	c95bbff23b	fix(metrics): stabilize cpu and power sampling	2026-04-01 09:40:42 +03:00
Mikhail Chusavitin	4e4debd4da	refactor(webui): redesign Burn tab and fix gpu-burn memory defaults - Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress, Compute Stress, Platform Thermal Cycling) + global Burn Profile - Run All button at top enqueues all enabled tests across all cards - GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools endpoint based on driver status (/dev/nvidia0, /dev/kfd) - Compute Stress: checkboxes for cpu/memory-stress/stressapptest - Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd) with platform_components param wired through to PlatformStressOptions - bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script now queries nvidia-smi memory.total per GPU and uses 95% of it - platform_stress: removed hardcoded --size-mb 64; respects Components field to selectively run CPU and/or GPU load goroutines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 09:39:07 +03:00
Mikhail Chusavitin	c394845b34	refactor(webui): queue install and bundle tasks - v3.18	2026-04-01 08:46:46 +03:00
Mikhail Chusavitin	b5b34983f1	fix(webui): repair audit actions and CPU burn flow - v3.15	2026-04-01 08:19:11 +03:00

1 2 3

124 Commits