reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Michael Chus	ba16021cdb	Fix GPU model propagation, export filenames, PSU/service status, and chart perf - nvidia.go: add Name field to nvidiaGPUInfo, include model name in nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData - pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x … → 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat activating/deactivating/reloading service states as OK in Runtime Health - support_bundle.go: use "150405" time format (no colons) for exFAT compat - sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove .tar.gz archive creation from export dirs — export packs everything itself - charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf - benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 10:05:27 +03:00
Mikhail Chusavitin	9481ca2805	Add staged NVIDIA burn ramp-up mode	2026-04-09 15:21:14 +03:00
Michael Chus	b2f8626fee	Refactor validate modes, fix benchmark report and IPMI power - Replace diag level 1-4 dropdown with Validate/Stress radio buttons - Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short - Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended - Parallel GPU mode: spawn single task for all GPUs instead of splitting per model - Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel - Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts - Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading') Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 00:42:12 +03:00
Mikhail Chusavitin	531d1ca366	Add NVIDIA self-heal tools and per-GPU SAT status	2026-04-07 20:20:05 +03:00
Mikhail Chusavitin	0d0e1f55a7	Avoid misleading SAT summaries after task cancellation	2026-04-06 12:24:19 +03:00
Mikhail Chusavitin	35f4c53887	Stabilize NVIDIA GPU device mapping across loaders	2026-04-06 12:22:04 +03:00
Mikhail Chusavitin	fc5c100a29	Fix NVIDIA persistence mode and add benchmark results table	2026-04-06 10:47:07 +03:00
Michael Chus	1bdfb1e9ca	Fix nvidia-targeted-stress failing with DCGM_ST_IN_USE (-34) nvvs (DCGM validation suite) survives when dcgmi is killed mid-run, leaving the GPU occupied. The next dcgmi diag invocation then fails with "affected resource is in use". Two-part fix: - Add nvvs and dcgmi to KillTestWorkers patterns so they are cleaned up by the global cancel handler - Call KillTestWorkers at the start of RunNvidiaTargetedStressValidatePack to clear any stale processes before dcgmi diag runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 20:21:36 +03:00
Michael Chus	4461249cc3	Make memory stress size follow available RAM	2026-04-05 18:33:26 +03:00
Michael Chus	e609fbbc26	Add task reports and streamline GPU charts	2026-04-05 18:13:58 +03:00
Michael Chus	38e79143eb	Refine burn UI and NVIDIA stress flows	2026-04-05 13:43:43 +03:00
Michael Chus	20abff7f90	WIP: checkpoint current tree	2026-04-05 12:05:00 +03:00
Mikhail Chusavitin	7a843be6b0	Stabilize DCGM GPU discovery	2026-04-03 09:50:33 +03:00
Michael Chus	ef45246ea0	fix(sat): kill entire process group on task cancel exec.CommandContext only kills the direct child (the shell script), leaving grandchildren (john, gpu-burn, etc.) as orphans. Set Setpgid so each SAT job runs in its own process group, then send SIGKILL to the whole group (-pgid) in the Cancel hook. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 23:46:33 +03:00
Mikhail Chusavitin	b5b34983f1	fix(webui): repair audit actions and CPU burn flow - v3.15	2026-04-01 08:19:11 +03:00
Mikhail Chusavitin	6dee8f3509	Add NVIDIA stress loader selection and DCGM 4 support	2026-03-31 11:15:15 +03:00
Michael Chus	e15bcc91c5	feat(metrics): persist history in sqlite and add AMD memory validate tests	2026-03-29 12:28:06 +03:00
Michael Chus	98f0cf0d52	fix(amd-stress): include VRAM load in GST burn	2026-03-29 12:03:50 +03:00
Michael Chus	744de588bb	fix(burn): resolve rvs binary via /opt/rocm-/bin glob like rocm-smi; add terminal copy button rvs was not in PATH so the stress job exited immediately (UNSUPPORTED). Now resolveRVSCommand searches /opt/rocm-/bin/rvs before failing. Also add a Copy button overlay on all .terminal elements and set user-select:text so logs can be copied from the web UI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 11:20:46 +03:00
Michael Chus	a03312c286	feat: AMD GPU compute stress via rocm-validation-suite GST (GEMM) - Add rocm-validation-suite, rocblas, rocrand, hip-runtime-amd, hipblaslt, comgr to ISO (~700MB, needed for HIP compute) - RunAMDStressPack: run RVS GST (SGEMM ~31 TFLOPS/GPU) + bandwidth test - Add rvs symlink in chroot setup hook - Pin all new package versions in VERSIONS Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 10:56:32 +03:00
Michael Chus	59a1d4b209	release: v3.1	2026-03-28 22:51:36 +03:00
Michael Chus	0a98ed8ae9	feat: task queue, UI overhaul, burn tests, install-to-RAM - Task queue: all SAT/audit jobs enqueue and run one-at-a-time; tasks persist past page navigation; new Tasks page with cancel/priority/log stream - UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal; Dashboard hardware summary badges + split metrics charts (load/temp/power); Tools page consolidates network, services, install, support bundle - AMD GPU: acceptance test and stress burn cards; GPU presence API greys out irrelevant SAT cards automatically - Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest) - Install to RAM: copies squashfs to /dev/shm, re-associates loop devices via LOOP_CHANGE_FD ioctl so live media can be ejected - Charts: relative time axis (0 = now, negative left) - memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED - SAT overlay applied dynamically on every /audit.json serve - MIME panic guard for LiveCD ramdisk I/O errors - ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry; disable screensaver/DPMS in bee-openbox-session - Unknown SAT status severity = 1 (does not override OK) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 21:15:11 +03:00
Michael Chus	5644231f9a	feat(nccl): add nccl-tests all_reduce_perf for GPU bandwidth testing - Dockerfile: install cuda-nvcc-13-0 from NVIDIA repo for compilation - build-nccl-tests.sh: downloads libnccl-dev for nccl.h, builds all_reduce_perf - build.sh: runs nccl-tests build, injects binary into /usr/local/bin/ - platform: RunNCCLTests() auto-detects GPU count, runs all_reduce_perf - TUI: NCCL bandwidth test entry in Burn-in Tests screen [N] hotkey Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 23:22:19 +03:00
Michael Chus	eea98e6d76	feat(dcgm): add NVIDIA DCGM diagnostics, fix KVM console - Add 9002-nvidia-dcgm.hook.chroot: installs datacenter-gpu-manager from NVIDIA apt repo during live-build - Enable nvidia-dcgm.service in chroot setup hook - Replace bee-gpu-stress with dcgmi diag (levels 1-4) in NVIDIA SAT - TUI: replace GPU checkbox + duration UI with DCGM level selection - Remove console=tty2 from boot params: KVM/VGA now shows tty1 where bee-tui runs, fixing unresponsive console Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 23:08:12 +03:00
Mikhail Chusavitin	9a1df9b1ba	Tighten support bundles and fix AMD runtime checks	2026-03-25 19:35:25 +03:00
Mikhail Chusavitin	0c16616cc9	1. Verbose live progress during SAT tests (CPU, Memory, Storage, AMD GPU) - New tui/sat_progress.go: polls {DefaultSATBaseDir}/{prefix}-*/verbose.log every 300ms and parses completed/in-progress steps - Busy screen now shows each step as PASS lscpu (234ms) / FAIL stress-ng (60.0s) / ... sensors-after instead of just "Working..." 2. Test results shown on screen (instead of just "Archive written to /path") - RunCPUAcceptancePackResult, RunMemoryAcceptancePackResult, RunStorageAcceptancePackResult, RunAMDAcceptancePackResult now read summary.txt from the run directory and return a formatted per-step result: Run: 2025-03-25T10:00:00Z PASS lscpu PASS sensors-before FAIL stress-ng PASS sensors-after Overall: FAILED (ok=3 failed=1) 3. AMD GPU SAT with auto-detection - platform.System.DetectGPUVendor(): checks /dev/nvidia0 → "nvidia", /dev/kfd → "amd" - platform.System.RunAMDAcceptancePack(): runs rocm-smi, rocm-smi --showallinfo, dmidecode - GPU SAT (G key / GPU row enter) automatically routes to AMD or NVIDIA based on detected vendor - "Run All" also auto-detects vendor 4. Panel detail view - GPU detail now shows the most recent (NVIDIA or AMD) SAT result, whichever is newer - All SAT detail views use the same human-readable formatSATDetail format	2026-03-25 17:54:27 +03:00
Mikhail Chusavitin	94e233651e	fix(sat): fix nvme device-self-test command flags --start is not a valid nvme-cli flag; correct syntax is -s 1 (short test). Add --wait so the command blocks until the test completes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-25 15:24:52 +03:00
Mikhail Chusavitin	36dff6e584	feat: CPU SAT via stress-ng + BMC version via ipmitool BMC: - collector/board.go: collectBMCFirmware() via ipmitool mc info, graceful skip if /dev/ipmi0 absent - collector/collector.go: append BMC firmware record to snap.Firmware - app/panel.go: show BMC version in TUI right-panel header alongside BIOS CPU SAT: - platform/sat.go: RunCPUAcceptancePack(baseDir, durationSec) — lscpu + sensors before/after + stress-ng - app/app.go: RunCPUAcceptancePack + RunCPUAcceptancePackResult methods, satRunner interface updated - app/panel.go: CPU row now reads real PASS/FAIL from cpu-*/summary.txt via satStatuses(); cpuDetailResult shows last SAT summary + audit data - tui/types.go: actionRunCPUSAT, confirmBody for CPU test with mode label - tui/screen_health_check.go: hcCPUDurations [60,300,900]s; hcRunSingle(CPU)→confirm screen; executeRunAll uses RunCPUAcceptancePackResult - tui/forms.go: actionRunCPUSAT → RunCPUAcceptancePackResult with mode duration - cmd/bee/main.go: bee sat cpu [--duration N] subcommand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-25 11:06:12 +03:00
Mikhail Chusavitin	76a17937f3	feat(tui): NVIDIA SAT with nvtop, GPU selection, metrics and chart — v1.0.0 - TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes - nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort - GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock) - Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt - Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours - AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list - runtime-flows.md updated with full NVIDIA SAT TUI flow documentation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-18 15:18:57 +03:00
Mikhail Chusavitin	b25a2f6d30	feat: add support bundle and raw audit export	2026-03-16 18:20:26 +03:00
Mikhail Chusavitin	b8c235b5ac	Add TUI hardware banner and polish SAT summaries	2026-03-15 14:27:01 +03:00
Mikhail Chusavitin	b483e2ce35	Add health verdicts and acceptance tests	2026-03-14 17:53:58 +03:00
Mikhail Chusavitin	6aca1682b9	Refactor bee CLI and LiveCD integration	2026-03-13 16:52:16 +03:00

33 Commits