reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Michael Chus	2ceaa0d0ca	Include profile and mode in benchmark task names for task list clarity Task names now follow the pattern: NVIDIA Benchmark · <profile> · <mode> [· GPU <indices>] Examples: NVIDIA Benchmark · standard · sequential (GPU 0, RTX 6000 Pro) NVIDIA Benchmark · stability · parallel NVIDIA Benchmark · standard · ramp 1/4 · GPU 0 NVIDIA Benchmark · standard · ramp 2/4 · GPU 0,1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:36:51 +03:00
Michael Chus	813e2f86a9	Add scalability/ramp-up labeling, ServerPower penalty in scoring, and report improvements - Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files - Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally by comparing ramp-up step results sharing the same ramp_run_id) - Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time), tasks.go handler, and benchmark.go result population - Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75: factor = ratio/0.75, applied per-GPU with a note explaining the reduction - Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw) - Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode; shows scalability_score when non-zero Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:30:47 +03:00
Michael Chus	098e19f760	Add ramp-up mode to NVIDIA GPU benchmark Adds a new checkbox (enabled by default) in the benchmark section. In ramp-up mode N tasks are spawned simultaneously: 1 GPU, then 2, then 3, up to all selected GPUs — each step runs its GPUs in parallel. NCCL runs only on the final step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 18:34:19 +03:00
Mikhail Chusavitin	9481ca2805	Add staged NVIDIA burn ramp-up mode	2026-04-09 15:21:14 +03:00
Michael Chus	025548ab3c	UI: amber accents, smaller wallpaper logo, new support bundle name, drop display resolution - Bootloader: GRUB fallback text colors → yellow/brown (amber tone) - CLI charts: all GPU metric series use single amber color (xterm-256 #214) - Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size - Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz using dmidecode for server model (spaces→underscores) and serial number - Remove display resolution feature (UI card, API routes, handlers, tests) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 21:37:01 +03:00
Michael Chus	b2f8626fee	Refactor validate modes, fix benchmark report and IPMI power - Replace diag level 1-4 dropdown with Validate/Stress radio buttons - Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short - Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended - Parallel GPU mode: spawn single task for all GPUs instead of splitting per model - Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel - Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts - Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading') Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 00:42:12 +03:00
Mikhail Chusavitin	531d1ca366	Add NVIDIA self-heal tools and per-GPU SAT status	2026-04-07 20:20:05 +03:00
Mikhail Chusavitin	93cfa78e8c	Benchmark: parallel GPU mode, resilient inventory query, server model in results - Add parallel GPU mode (checkbox, off by default): runs all selected GPUs simultaneously via a single bee-gpu-burn invocation instead of sequentially; per-GPU telemetry, throttle counters, TOPS, and scoring are preserved - Make queryBenchmarkGPUInfo resilient: falls back to a base field set when extended fields (attribute.multiprocessor_count, power.default_limit) cause exit status 2, preventing lgc normalization from being silently skipped - Log explicit "graphics clock lock skipped" note when inventory is unavailable - Collect server model from DMI (/sys/class/dmi/id/product_name) and store in result JSON; benchmark history columns now show "Server Model (N× GPU Model)" grouped by server+GPU type rather than individual GPU index Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-07 18:32:15 +03:00
Mikhail Chusavitin	2354ae367d	Normalize task IDs and artifact folder prefixes	2026-04-06 12:26:47 +03:00
Mikhail Chusavitin	981315e6fd	Split NVIDIA tasks by homogeneous GPU groups	2026-04-06 11:58:13 +03:00
Michael Chus	a493e3ab5b	Fix service control buttons: sudo, real error output, UX feedback - services.go: use sudo systemctl so bee user can control system services - api.go: always return 200 with output field even on error, so the frontend shows the actual systemctl message instead of "exit status 1" - pages.go: button shows "..." while pending then restores label; output panel is full-width under the table with ✓/✗ status indicator; output auto-scrolls to bottom Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 20:25:41 +03:00
Michael Chus	38e79143eb	Refine burn UI and NVIDIA stress flows	2026-04-05 13:43:43 +03:00
Michael Chus	20abff7f90	WIP: checkpoint current tree	2026-04-05 12:05:00 +03:00
Michael Chus	143b7dca5d	Add stability hardening and self-heal recovery	2026-04-05 10:29:37 +03:00
Mikhail Chusavitin	c27449c60e	feat(webui): show current boot source	2026-04-02 15:36:32 +03:00
Mikhail Chusavitin	17ff3811f8	fix(webui): improve tasks logs and ordering	2026-04-02 13:43:59 +03:00
Mikhail Chusavitin	fc7fe0b08e	fix(webui): build support bundle synchronously on download, bypass task queue Support bundle is now built on-the-fly when the user clicks the button, regardless of whether other tasks are running: - GET /export/support.tar.gz builds the bundle synchronously and streams it directly to the client; the temp archive is removed after serving - Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue approach meant the bundle could only be downloaded after navigating away and back, and was blocked entirely while a long SAT test was running - UI: single "Download Support Bundle" button; fetch+blob gives a loading state ("Building...") while the server collects logs, then triggers the browser download with the correct filename from Content-Disposition Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 12:58:00 +03:00
Michael Chus	c2aecc6ce9	Fix fan chart gaps and task durations	2026-04-01 22:36:11 +03:00
Michael Chus	439b86ce59	Unify live metrics chart rendering	2026-04-01 22:19:33 +03:00
Mikhail Chusavitin	4e4debd4da	refactor(webui): redesign Burn tab and fix gpu-burn memory defaults - Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress, Compute Stress, Platform Thermal Cycling) + global Burn Profile - Run All button at top enqueues all enabled tests across all cards - GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools endpoint based on driver status (/dev/nvidia0, /dev/kfd) - Compute Stress: checkboxes for cpu/memory-stress/stressapptest - Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd) with platform_components param wired through to PlatformStressOptions - bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script now queries nvidia-smi memory.total per GPU and uses 95% of it - platform_stress: removed hardcoded --size-mb 64; respects Components field to selectively run CPU and/or GPU load goroutines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 09:39:07 +03:00
Mikhail Chusavitin	c394845b34	refactor(webui): queue install and bundle tasks - v3.18	2026-04-01 08:46:46 +03:00
Mikhail Chusavitin	b5b34983f1	fix(webui): repair audit actions and CPU burn flow - v3.15	2026-04-01 08:19:11 +03:00
Michael Chus	45221d1e9a	fix(stress): label loaders and improve john opencl diagnostics	2026-04-01 07:31:52 +03:00
Michael Chus	ea660500c9	chore: commit pending repo changes	2026-03-31 22:17:36 +03:00
Mikhail Chusavitin	6dee8f3509	Add NVIDIA stress loader selection and DCGM 4 support	2026-03-31 11:15:15 +03:00
Mikhail Chusavitin	20f834aa96	feat: v3.4 — boot reliability, log readability, USB export, screen resolution, GRUB UEFI fix, memtest, KVM console stability Web UI / logs: - Strip ANSI escape codes and handle \r (progress bars) in task log output - Add USB export API + UI card on Export page (list removable devices, write audit JSON or support bundle) - Add Display Resolution card in Tools (xrandr-based, per-output mode selector) - Dashboard: audit status banner with auto-reload when audit task completes Boot & install: - bee-web starts immediately with no dependencies (was blocked by audit + network) - bee-audit.service redesigned: waits for bee-web healthz, sleeps 60s, enqueues audit via /api/audit/run (task system) - bee-install: fix GRUB UEFI — grub-install exit code was silently ignored (\|\| true); add --no-nvram fallback; always copy EFI/BOOT/BOOTX64.EFI fallback path - Add grub-efi-amd64, grub-pc, grub-efi-amd64-signed, shim-signed to package list (grub-install requires these, not just -bin variants) - memtest hook: fix binary/boot/ not created before cp; handle both Debian (no extension) and upstream (x64.efi) naming - bee-openbox-session: increase healthz wait from 30s to 120s KVM console stability: - runCmdJob: syscall.Setpriority(PRIO_PROCESS, pid, 10) on all stress subprocesses - lightdm.service.d: Nice=-5 so X server preempts stress processes Packages: add btop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-31 10:16:15 +03:00
Michael Chus	e15bcc91c5	feat(metrics): persist history in sqlite and add AMD memory validate tests	2026-03-29 12:28:06 +03:00
Michael Chus	3fda18f708	feat(metrics): SQLite persistence + chart fixes (no dots, peak label, min/avg/max in title) - Add modernc.org/sqlite dependency; write every sample to /appdata/bee/metrics.db (WAL mode, prune to 24h on startup) - Pre-fill ring buffers from last 120 DB rows on startup so charts survive service restarts - Ticker changed 3s→1s; chart JS refresh will be set to 2s (lag ≤3s) - Add GET /api/metrics/export.csv for full history download - Chart rendering: SymbolNone (no dots), right padding=80px so peak mark line label is not clipped, min/avg/max appended to chart title Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 11:37:59 +03:00
Michael Chus	126af96780	fix(webui): slow metrics chart refresh to 3s interval Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 10:32:35 +03:00
Michael Chus	59a1d4b209	release: v3.1	2026-03-28 22:51:36 +03:00
Michael Chus	0a98ed8ae9	feat: task queue, UI overhaul, burn tests, install-to-RAM - Task queue: all SAT/audit jobs enqueue and run one-at-a-time; tasks persist past page navigation; new Tasks page with cancel/priority/log stream - UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal; Dashboard hardware summary badges + split metrics charts (load/temp/power); Tools page consolidates network, services, install, support bundle - AMD GPU: acceptance test and stress burn cards; GPU presence API greys out irrelevant SAT cards automatically - Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest) - Install to RAM: copies squashfs to /dev/shm, re-associates loop devices via LOOP_CHANGE_FD ioctl so live media can be ejected - Charts: relative time axis (0 = now, negative left) - memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED - SAT overlay applied dynamically on every /audit.json serve - MIME panic guard for LiveCD ramdisk I/O errors - ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry; disable screensaver/DPMS in bee-openbox-session - Unknown SAT status severity = 1 (does not override OK) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 21:15:11 +03:00
Michael Chus	a393dcb731	feat(webui): add POST /api/sat/abort + update bible-local runtime-flows - jobState now has optional cancel func; abort() calls it if job is running - handleAPISATRun passes cancellable context to RunNvidiaAcceptancePackWithOptions - POST /api/sat/abort?job_id=... cancels the running SAT job - bible-local/runtime-flows.md: replace TUI SAT flow with Web UI flow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:23:00 +03:00
Michael Chus	4c8417d20a	feat(webui): add Install to Disk page Expose the existing bee-install script through the web UI: - platform/install.go: remove USB exclusion, add SizeBytes/MountedParts fields, add MinInstallBytes()/DiskWarnings() safety checks (size, mounted partitions, toram+low-RAM warning) - webui: add GET /api/install/disks, POST /api/install/run, GET /api/install/stream endpoints - webui: add Install to Disk page with disk table, warning badges, device-name confirmation gate, SSE progress terminal, reboot button Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:11:16 +03:00
Michael Chus	ec0b7f7ff9	feat(metrics): single chart engine + full-width stacked layout - One engine: go-analyze/charts (grafana theme) for all live metrics - Server chart: CPU temp, CPU load%, mem load%, power W, fan RPMs - GPU charts: temp, load%, mem%, power W — one card per GPU, added dynamically - Charts 1400x280px SVG, rendered at width:100% in single-column layout - Add CPU load (from /proc/stat) and mem load (from /proc/meminfo) to LiveMetricSample - Add GPU mem utilization to GPUMetricRow (nvidia-smi utilization.memory) - Document charting architecture in bible-local/architecture/charting.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 23:26:13 +03:00
Michael Chus	ff0acc3698	feat(webui): server-side SVG charts + reanimator-chart viewer Metrics: - Replace canvas JS charts with server-side SVG via go-analyze/charts - Add ring buffers (120 samples) for CPU temp and power - /api/metrics/chart/{name}.svg endpoint serves live SVG, polled every 2s Dashboard: - Replace custom renderViewerPage with viewer.RenderHTML() from reanimator/chart submodule - Mount chart static assets at /chart/static/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 23:07:47 +03:00
Michael Chus	ed4f8be019	fix(webui): services table — show state badge, full status on click Replace raw systemctl output in table cell with: - state badge (active/failed/inactive) — click to expand - full systemctl status in collapsible pre block (max 200px scroll) Fixes layout explosion from multi-line status text in table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:47:59 +03:00
Michael Chus	004cc4910d	feat(webui): replace TUI with full web UI + local openbox desktop - Remove audit/internal/tui/ (~3000 LOC, bubbletea/lipgloss/reanimator deps) - Add /api/* REST+SSE endpoints: audit, SAT (nvidia/memory/storage/cpu), services, network, export, tools, live metrics stream - Add async job manager with SSE streaming for long-running operations - Add platform.SampleLiveMetrics() for live fan/temp/power/GPU polling - Add multi-page web UI (vanilla JS): Dashboard, Metrics charts, Tests, Burn-in, Network, Services, Export, Tools - Add bee-desktop.service: openbox + Xorg + Chromium opening http://localhost/ - Add openbox/tint2/xorg/xinit/xterm/chromium to ISO package list - Update .profile, bee.sh, and bible-local docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:21:14 +03:00

37 Commits