reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Mikhail Chusavitin	eea2591bcc	Fix John GPU stress duration semantics	2026-04-03 09:46:16 +03:00
Mikhail Chusavitin	295a19b93a	feat(tasks): run all queued tasks in parallel Tasks are now started simultaneously when multiple are enqueued (e.g. Run All). The worker drains all pending tasks at once and launches each in its own goroutine, waiting via WaitGroup. kmsg watcher updated to use a shared event window with a reference counter across concurrent tasks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 09:15:06 +03:00
Mikhail Chusavitin	444a7d16cc	fix(iso): increase boot verbosity for service startup visibility Raise loglevel from 3 to 6 (INFO) and add systemd.show_status=1 so kernel driver messages and systemd [ OK ]/[ FAILED ] lines are visible during boot instead of showing only a blank cursor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v4.3	2026-04-02 19:33:27 +03:00
Mikhail Chusavitin	fd722692a4	feat(watchdog): hardware error monitor + unified component status store - Add platform/error_patterns.go: pluggable table of kernel log patterns (NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct - Add app/component_status_db.go: persistent JSON store (component-status.json) keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never downgrades Warning or Critical - Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks, writes Warning to DB for matched hardware errors - Fix task status: overall_status=FAILED in summary.txt now marks task failed - Audit routine overlays component DB statuses into bee-audit.json on every read Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 19:20:59 +03:00
Mikhail Chusavitin	99cece524c	feat(support-bundle): add PCIe link diagnostics and system logs - Add full dmesg (was tail -200), kern.log, syslog - Add /proc/cmdline, lspci -vvv, nvidia-smi -q - Add per-GPU PCIe link speed/width from sysfs (NVIDIA devices only) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v4.2	2026-04-02 15:42:28 +03:00
Mikhail Chusavitin	c27449c60e	feat(webui): show current boot source	2026-04-02 15:36:32 +03:00
Mikhail Chusavitin	5ef879e307	feat(webui): add gpu driver restart action	2026-04-02 15:30:23 +03:00
Mikhail Chusavitin	e7df63bae1	fix(app): include extra system logs in support bundle	2026-04-02 13:44:58 +03:00
Mikhail Chusavitin	17ff3811f8	fix(webui): improve tasks logs and ordering	2026-04-02 13:43:59 +03:00
Mikhail Chusavitin	fc7fe0b08e	fix(webui): build support bundle synchronously on download, bypass task queue Support bundle is now built on-the-fly when the user clicks the button, regardless of whether other tasks are running: - GET /export/support.tar.gz builds the bundle synchronously and streams it directly to the client; the temp archive is removed after serving - Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue approach meant the bundle could only be downloaded after navigating away and back, and was blocked entirely while a long SAT test was running - UI: single "Download Support Bundle" button; fetch+blob gives a loading state ("Building...") while the server collects logs, then triggers the browser download with the correct filename from Content-Disposition Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v4.1	2026-04-02 12:58:00 +03:00
Mikhail Chusavitin	3cf75a541a	build: collect ISO and logs under versioned dist/easy-bee-v{VERSION}/ dir All final artefacts for a given version now land in one place: dist/easy-bee-v4.1/ easy-bee-nvidia-v4.1-amd64.iso easy-bee-nvidia-v4.1-amd64.logs.tar.gz ← log archive (logs dir deleted after archiving) - Introduce OUT_DIR="${DIST_DIR}/easy-bee-v${ISO_VERSION_EFFECTIVE}" - Move LOG_DIR, LOG_ARCHIVE, and ISO_OUT into OUT_DIR - cleanup_build_log: use dirname(LOG_DIR) as tar -C base so the path is correct regardless of where OUT_DIR lives; delete LOG_DIR after archiving Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 10:19:11 +03:00
Mikhail Chusavitin	1f750d3edd	fix(webui): prevent orphaned workers on restart, reduce metrics polling, add Kill Workers button - tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web crashes mid-test (each restart was launching a new set of 8 workers on top of still-alive orphans from the previous crash) - server: reduce metrics collector interval 1s→5s, grow ring buffer to 360 samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5× - platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn, stress-ng, stressapptest, memtester without relying on pkill/killall - webui: add "Kill Workers" button next to Cancel All; calls POST /api/tasks/kill-workers which cancels the task queue then kills orphaned OS-level processes; shows toast with killed count - metricsdb: sort GPU indices and fan/temp names after map iteration to fix non-deterministic sample reconstruction order (flaky test) - server: fix chartYAxisNumber to use one decimal place for 1000–9999 (e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 10:13:43 +03:00
Mikhail Chusavitin	b2b0444131	audit: ignore virtual hdisk and coprocessor noise	2026-04-02 09:56:17 +03:00
mchus	dbab43db90	Fix full-history metrics range loading v4	2026-04-01 23:55:28 +03:00
mchus	bcb7fe5fe9	Render charts from full SQLite history	2026-04-01 23:52:54 +03:00
mchus	d21d9d191b	fix(build): bump DCGM to 4.5.3-1 — core package updated in CUDA repo Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 23:49:57 +03:00
mchus	ef45246ea0	fix(sat): kill entire process group on task cancel exec.CommandContext only kills the direct child (the shell script), leaving grandchildren (john, gpu-burn, etc.) as orphans. Set Setpgid so each SAT job runs in its own process group, then send SIGKILL to the whole group (-pgid) in the Cancel hook. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 23:46:33 +03:00
mchus	348db35119	fix(stress): stagger john GPU launches to prevent GWS tuning contention When 8 john processes start simultaneously they race for GPU memory during OpenCL GWS auto-tuning. Slower devices settle on a smaller work size (~594MiB vs 762MiB) and run at 40% instead of 100% load. Add 3s sleep between launches so each instance finishes memory allocation before the next one starts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 23:44:00 +03:00
mchus	1dd7f243f5	Keep chart series colors stable	2026-04-01 23:37:57 +03:00
mchus	938e499ac2	Serve charts from SQLite history only	2026-04-01 23:33:13 +03:00
mchus	964ab39656	fix: run john stress in parallel per GPU, fix chromium fullscreen, filter BMC virtual disks - bee-john-gpu-stress: spawn one john process per OpenCL device in parallel so all GPUs are stressed simultaneously instead of only device 1 - bee-openbox-session: --start-fullscreen → --start-maximized to fix blank white page on first render in fbdev environment - storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 23:14:21 +03:00
mchus	c2aecc6ce9	Fix fan chart gaps and task durations	2026-04-01 22:36:11 +03:00
mchus	439b86ce59	Unify live metrics chart rendering	2026-04-01 22:19:33 +03:00
mchus	eb60100297	fix: pcie gen, nccl binary, netconf sudo, boot noise, firmware cleanup - nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state - build: remove bee-nccl-gpu-stress from rm -f list so shell script from overlay is not silently dropped from the ISO - smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress, bee-nccl-gpu-stress, all_reduce_perf - netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors - auto/config: reduce loglevel 7→3 to show clean systemd output on boot - auto/config: blacklist snd_hda_intel and related audio modules (unused on servers) - package-lists: remove firmware-intel-sound and firmware-amd-graphics from base list; move firmware-amd-graphics to bee-amd variant only - bible-local: mark memtest ADR resolved, document working solution Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 21:25:23 +03:00
Mikhail Chusavitin	2baf3be640	Handle memtest recovery probe under set -e	2026-04-01 17:42:13 +03:00
Mikhail Chusavitin	d92f8f41d0	Fix memtest ISO validation false negatives v3.21	2026-04-01 12:22:17 +03:00
Mikhail Chusavitin	76a9100779	fix(iso): rebuild image after memtest recovery	2026-04-01 10:01:14 +03:00
Mikhail Chusavitin	1b6d592bf3	feat(iso): add optional kms display boot path	2026-04-01 09:42:59 +03:00
Mikhail Chusavitin	c95bbff23b	fix(metrics): stabilize cpu and power sampling	2026-04-01 09:40:42 +03:00
Mikhail Chusavitin	4e4debd4da	refactor(webui): redesign Burn tab and fix gpu-burn memory defaults - Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress, Compute Stress, Platform Thermal Cycling) + global Burn Profile - Run All button at top enqueues all enabled tests across all cards - GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools endpoint based on driver status (/dev/nvidia0, /dev/kfd) - Compute Stress: checkboxes for cpu/memory-stress/stressapptest - Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd) with platform_components param wired through to PlatformStressOptions - bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script now queries nvidia-smi memory.total per GPU and uses 95% of it - platform_stress: removed hardcoded --size-mb 64; respects Components field to selectively run CPU and/or GPU load goroutines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 09:39:07 +03:00
Mikhail Chusavitin	5839f870b7	fix(iso): include full nvidia opencl runtime	2026-04-01 09:16:06 +03:00
Mikhail Chusavitin	b447717a5a	fix(iso): harden boot network bring-up - v3.20 v3.20	2026-04-01 09:10:55 +03:00
Mikhail Chusavitin	f6f4923ac9	fix(iso): recover memtest after live-build v3.19	2026-04-01 08:55:57 +03:00
Mikhail Chusavitin	c394845b34	refactor(webui): queue install and bundle tasks - v3.18 v3.18	2026-04-01 08:46:46 +03:00
Mikhail Chusavitin	3472afea32	fix(iso): make memtest non-blocking by default v3.17	2026-04-01 08:33:36 +03:00
Mikhail Chusavitin	942f11937f	chore(submodule): update bible - v3.16 v3.16	2026-04-01 08:23:39 +03:00
Mikhail Chusavitin	b5b34983f1	fix(webui): repair audit actions and CPU burn flow - v3.15 v3.15	2026-04-01 08:19:11 +03:00
mchus	45221d1e9a	fix(stress): label loaders and improve john opencl diagnostics v3.14	2026-04-01 07:31:52 +03:00
mchus	3869788bac	fix(iso): validate memtest with xorriso fallback	2026-04-01 07:24:05 +03:00
mchus	3dbc2184ef	fix(iso): archive build logs and memtest diagnostics v3.13	2026-04-01 07:14:53 +03:00
mchus	60cb8f889a	fix(iso): restore memtest menu entries and validate ISO v3.12	2026-04-01 07:04:48 +03:00
mchus	c9ee078622	fix(stress): keep platform burn responsive under load v3.11	2026-03-31 22:28:26 +03:00
mchus	ea660500c9	chore: commit pending repo changes	2026-03-31 22:17:36 +03:00
mchus	d43a9aeec7	fix(iso): restore live-build memtest integration	2026-03-31 22:10:28 +03:00
Mikhail Chusavitin	f5622e351e	Fix staged John cleanup for repeated ISO builds	2026-03-31 11:40:52 +03:00
Mikhail Chusavitin	a20806afc8	Fix ISO grub package conflict	2026-03-31 11:38:30 +03:00
Mikhail Chusavitin	4f9b6b3bcd	Harden NVIDIA boot logging on live ISO	2026-03-31 11:37:21 +03:00
Mikhail Chusavitin	c850b39b01	feat: v3.10 GPU stress and NCCL burn updates v3.10	2026-03-31 11:22:27 +03:00
Mikhail Chusavitin	6dee8f3509	Add NVIDIA stress loader selection and DCGM 4 support	2026-03-31 11:15:15 +03:00
Mikhail Chusavitin	20f834aa96	feat: v3.4 — boot reliability, log readability, USB export, screen resolution, GRUB UEFI fix, memtest, KVM console stability Web UI / logs: - Strip ANSI escape codes and handle \r (progress bars) in task log output - Add USB export API + UI card on Export page (list removable devices, write audit JSON or support bundle) - Add Display Resolution card in Tools (xrandr-based, per-output mode selector) - Dashboard: audit status banner with auto-reload when audit task completes Boot & install: - bee-web starts immediately with no dependencies (was blocked by audit + network) - bee-audit.service redesigned: waits for bee-web healthz, sleeps 60s, enqueues audit via /api/audit/run (task system) - bee-install: fix GRUB UEFI — grub-install exit code was silently ignored (\|\| true); add --no-nvram fallback; always copy EFI/BOOT/BOOTX64.EFI fallback path - Add grub-efi-amd64, grub-pc, grub-efi-amd64-signed, shim-signed to package list (grub-install requires these, not just -bin variants) - memtest hook: fix binary/boot/ not created before cp; handle both Debian (no extension) and upstream (x64.efi) naming - bee-openbox-session: increase healthz wait from 30s to 120s KVM console stability: - runCmdJob: syscall.Setpriority(PRIO_PROCESS, pid, 10) on all stress subprocesses - lightdm.service.d: Nice=-5 so X server preempts stress processes Packages: add btop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v3.9	2026-03-31 10:16:15 +03:00

1 2 3 4 5 ...

305 Commits