reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Mikhail Chusavitin	386c0738ee	storage SAT: split collect/self-test modes, add per-disk text reports Check mode: read-only SMART/NVMe data collection, no self-test. Load mode: same collection + short self-test (nvme device-self-test -s 1, smartctl -t short). Card descriptions updated accordingly. After each storage SAT run, a disk-N-devname-report.txt is written per device into the runDir (auto-included in support bundles). Web UI task page renders one card per disk directly below Task Report. Also fixes pre-existing TestDashboardRendersRuntimeHealthTable failure: test fixture used "inactive" status but code now treats inactive as OK for completed oneshot services; updated to "failed" to match intent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-30 19:07:36 +03:00
Mikhail Chusavitin	b4280941f5	Move NCCL and NVBandwidth into validate mode	2026-04-16 11:02:30 +03:00
Michael Chus	b2f8626fee	Refactor validate modes, fix benchmark report and IPMI power - Replace diag level 1-4 dropdown with Validate/Stress radio buttons - Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short - Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended - Parallel GPU mode: spawn single task for all GPUs instead of splitting per model - Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel - Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts - Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading') Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 00:42:12 +03:00
Mikhail Chusavitin	531d1ca366	Add NVIDIA self-heal tools and per-GPU SAT status	2026-04-07 20:20:05 +03:00
Mikhail Chusavitin	0d0e1f55a7	Avoid misleading SAT summaries after task cancellation	2026-04-06 12:24:19 +03:00
Mikhail Chusavitin	35f4c53887	Stabilize NVIDIA GPU device mapping across loaders	2026-04-06 12:22:04 +03:00
Mikhail Chusavitin	fc5c100a29	Fix NVIDIA persistence mode and add benchmark results table	2026-04-06 10:47:07 +03:00
Michael Chus	4461249cc3	Make memory stress size follow available RAM	2026-04-05 18:33:26 +03:00
Michael Chus	38e79143eb	Refine burn UI and NVIDIA stress flows	2026-04-05 13:43:43 +03:00
Mikhail Chusavitin	7a843be6b0	Stabilize DCGM GPU discovery	2026-04-03 09:50:33 +03:00
Mikhail Chusavitin	b5b34983f1	fix(webui): repair audit actions and CPU burn flow - v3.15	2026-04-01 08:19:11 +03:00
Michael Chus	45221d1e9a	fix(stress): label loaders and improve john opencl diagnostics	2026-04-01 07:31:52 +03:00
Mikhail Chusavitin	c850b39b01	feat: v3.10 GPU stress and NCCL burn updates	2026-03-31 11:22:27 +03:00
Mikhail Chusavitin	6dee8f3509	Add NVIDIA stress loader selection and DCGM 4 support	2026-03-31 11:15:15 +03:00
Michael Chus	e15bcc91c5	feat(metrics): persist history in sqlite and add AMD memory validate tests	2026-03-29 12:28:06 +03:00
Michael Chus	98f0cf0d52	fix(amd-stress): include VRAM load in GST burn	2026-03-29 12:03:50 +03:00
Mikhail Chusavitin	9a1df9b1ba	Tighten support bundles and fix AMD runtime checks	2026-03-25 19:35:25 +03:00
Mikhail Chusavitin	b25a2f6d30	feat: add support bundle and raw audit export	2026-03-16 18:20:26 +03:00
Mikhail Chusavitin	b8c235b5ac	Add TUI hardware banner and polish SAT summaries	2026-03-15 14:27:01 +03:00
Mikhail Chusavitin	b483e2ce35	Add health verdicts and acceptance tests	2026-03-14 17:53:58 +03:00

20 Commits