Commit Graph

325 Commits

Author SHA1 Message Date
f1621efee4 Mirror task lifecycle to serial console v5.6 2026-04-05 18:34:06 +03:00
4461249cc3 Make memory stress size follow available RAM 2026-04-05 18:33:26 +03:00
e609fbbc26 Add task reports and streamline GPU charts v5.5 2026-04-05 18:13:58 +03:00
cc2b49ea41 Improve validate GPU runs and web UI feedback 2026-04-05 17:50:13 +03:00
33e0a5bef2 Refine validate UI and runtime health table v5.4 2026-04-05 16:24:45 +03:00
38e79143eb Refine burn UI and NVIDIA stress flows v5.3 2026-04-05 13:43:43 +03:00
25af2df23a Unify metrics charts on custom SVG renderer v5.2 2026-04-05 12:17:50 +03:00
20abff7f90 WIP: checkpoint current tree 2026-04-05 12:05:00 +03:00
a14ec8631c Persist GPU chart mode and expand GPU charts 2026-04-05 11:52:32 +03:00
f58c7e58d3 Fix webui streaming recovery regressions v5.1 2026-04-05 10:39:09 +03:00
bf47c8dbd2 Add NVIDIA benchmark reporting flow v5 2026-04-05 10:30:56 +03:00
143b7dca5d Add stability hardening and self-heal recovery 2026-04-05 10:29:37 +03:00
9826d437a5 Add GPU clock charts and grouped GPU metrics view 2026-04-05 09:57:38 +03:00
Mikhail Chusavitin
f3c14cd893 Harden NIC probing for empty SFP ports v4.6 v4.7 2026-04-04 15:23:15 +03:00
Mikhail Chusavitin
728270dc8e Unblock bee-web startup and expand support bundle diagnostics v4.5 2026-04-04 15:18:43 +03:00
Mikhail Chusavitin
8692f825bc Use plain repo tags for build version v4.4 2026-04-03 10:48:51 +03:00
Mikhail Chusavitin
11f52ac710 Fix task log modal scrolling 2026-04-03 10:36:11 +03:00
Mikhail Chusavitin
1cb398fe83 Show tag version at top of sidebar 2026-04-03 10:08:00 +03:00
Mikhail Chusavitin
7a843be6b0 Stabilize DCGM GPU discovery 2026-04-03 09:50:33 +03:00
Mikhail Chusavitin
7f6386dccc Restore USB support bundle export on tools page 2026-04-03 09:48:22 +03:00
Mikhail Chusavitin
eea2591bcc Fix John GPU stress duration semantics 2026-04-03 09:46:16 +03:00
Mikhail Chusavitin
295a19b93a feat(tasks): run all queued tasks in parallel
Tasks are now started simultaneously when multiple are enqueued (e.g.
Run All). The worker drains all pending tasks at once and launches each
in its own goroutine, waiting via WaitGroup. kmsg watcher updated to
use a shared event window with a reference counter across concurrent tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 09:15:06 +03:00
Mikhail Chusavitin
444a7d16cc fix(iso): increase boot verbosity for service startup visibility
Raise loglevel from 3 to 6 (INFO) and add systemd.show_status=1 so
kernel driver messages and systemd [ OK ]/[ FAILED ] lines are visible
during boot instead of showing only a blank cursor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v4.3
2026-04-02 19:33:27 +03:00
Mikhail Chusavitin
fd722692a4 feat(watchdog): hardware error monitor + unified component status store
- Add platform/error_patterns.go: pluggable table of kernel log patterns
  (NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct
- Add app/component_status_db.go: persistent JSON store (component-status.json)
  keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never
  downgrades Warning or Critical
- Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks,
  writes Warning to DB for matched hardware errors
- Fix task status: overall_status=FAILED in summary.txt now marks task failed
- Audit routine overlays component DB statuses into bee-audit.json on every read

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 19:20:59 +03:00
Mikhail Chusavitin
99cece524c feat(support-bundle): add PCIe link diagnostics and system logs
- Add full dmesg (was tail -200), kern.log, syslog
- Add /proc/cmdline, lspci -vvv, nvidia-smi -q
- Add per-GPU PCIe link speed/width from sysfs (NVIDIA devices only)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v4.2
2026-04-02 15:42:28 +03:00
Mikhail Chusavitin
c27449c60e feat(webui): show current boot source 2026-04-02 15:36:32 +03:00
Mikhail Chusavitin
5ef879e307 feat(webui): add gpu driver restart action 2026-04-02 15:30:23 +03:00
Mikhail Chusavitin
e7df63bae1 fix(app): include extra system logs in support bundle 2026-04-02 13:44:58 +03:00
Mikhail Chusavitin
17ff3811f8 fix(webui): improve tasks logs and ordering 2026-04-02 13:43:59 +03:00
Mikhail Chusavitin
fc7fe0b08e fix(webui): build support bundle synchronously on download, bypass task queue
Support bundle is now built on-the-fly when the user clicks the button,
regardless of whether other tasks are running:

- GET /export/support.tar.gz builds the bundle synchronously and streams it
  directly to the client; the temp archive is removed after serving
- Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue
  approach meant the bundle could only be downloaded after navigating away
  and back, and was blocked entirely while a long SAT test was running
- UI: single "Download Support Bundle" button; fetch+blob gives a loading
  state ("Building...") while the server collects logs, then triggers the
  browser download with the correct filename from Content-Disposition

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v4.1
2026-04-02 12:58:00 +03:00
Mikhail Chusavitin
3cf75a541a build: collect ISO and logs under versioned dist/easy-bee-v{VERSION}/ dir
All final artefacts for a given version now land in one place:
  dist/easy-bee-v4.1/
    easy-bee-nvidia-v4.1-amd64.iso
    easy-bee-nvidia-v4.1-amd64.logs.tar.gz   ← log archive
                                               (logs dir deleted after archiving)

- Introduce OUT_DIR="${DIST_DIR}/easy-bee-v${ISO_VERSION_EFFECTIVE}"
- Move LOG_DIR, LOG_ARCHIVE, and ISO_OUT into OUT_DIR
- cleanup_build_log: use dirname(LOG_DIR) as tar -C base so the path is
  correct regardless of where OUT_DIR lives; delete LOG_DIR after archiving

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 10:19:11 +03:00
Mikhail Chusavitin
1f750d3edd fix(webui): prevent orphaned workers on restart, reduce metrics polling, add Kill Workers button
- tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of
  re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web
  crashes mid-test (each restart was launching a new set of 8 workers on top
  of still-alive orphans from the previous crash)
- server: reduce metrics collector interval 1s→5s, grow ring buffer to 360
  samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5×
- platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn,
  stress-ng, stressapptest, memtester without relying on pkill/killall
- webui: add "Kill Workers" button next to Cancel All; calls
  POST /api/tasks/kill-workers which cancels the task queue then kills
  orphaned OS-level processes; shows toast with killed count
- metricsdb: sort GPU indices and fan/temp names after map iteration to fix
  non-deterministic sample reconstruction order (flaky test)
- server: fix chartYAxisNumber to use one decimal place for 1000–9999
  (e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 10:13:43 +03:00
Mikhail Chusavitin
b2b0444131 audit: ignore virtual hdisk and coprocessor noise 2026-04-02 09:56:17 +03:00
dbab43db90 Fix full-history metrics range loading v4 2026-04-01 23:55:28 +03:00
bcb7fe5fe9 Render charts from full SQLite history 2026-04-01 23:52:54 +03:00
d21d9d191b fix(build): bump DCGM to 4.5.3-1 — core package updated in CUDA repo
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:49:57 +03:00
ef45246ea0 fix(sat): kill entire process group on task cancel
exec.CommandContext only kills the direct child (the shell script), leaving
grandchildren (john, gpu-burn, etc.) as orphans. Set Setpgid so each SAT
job runs in its own process group, then send SIGKILL to the whole group
(-pgid) in the Cancel hook.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:46:33 +03:00
348db35119 fix(stress): stagger john GPU launches to prevent GWS tuning contention
When 8 john processes start simultaneously they race for GPU memory during
OpenCL GWS auto-tuning. Slower devices settle on a smaller work size (~594MiB
vs 762MiB) and run at 40% instead of 100% load. Add 3s sleep between launches
so each instance finishes memory allocation before the next one starts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:44:00 +03:00
1dd7f243f5 Keep chart series colors stable 2026-04-01 23:37:57 +03:00
938e499ac2 Serve charts from SQLite history only 2026-04-01 23:33:13 +03:00
964ab39656 fix: run john stress in parallel per GPU, fix chromium fullscreen, filter BMC virtual disks
- bee-john-gpu-stress: spawn one john process per OpenCL device in parallel
  so all GPUs are stressed simultaneously instead of only device 1
- bee-openbox-session: --start-fullscreen → --start-maximized to fix blank
  white page on first render in fbdev environment
- storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:14:21 +03:00
c2aecc6ce9 Fix fan chart gaps and task durations 2026-04-01 22:36:11 +03:00
439b86ce59 Unify live metrics chart rendering 2026-04-01 22:19:33 +03:00
eb60100297 fix: pcie gen, nccl binary, netconf sudo, boot noise, firmware cleanup
- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead
  of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state
- build: remove bee-nccl-gpu-stress from rm -f list so shell script from
  overlay is not silently dropped from the ISO
- smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress,
  bee-nccl-gpu-stress, all_reduce_perf
- netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors
- auto/config: reduce loglevel 7→3 to show clean systemd output on boot
- auto/config: blacklist snd_hda_intel and related audio modules (unused on servers)
- package-lists: remove firmware-intel-sound and firmware-amd-graphics from
  base list; move firmware-amd-graphics to bee-amd variant only
- bible-local: mark memtest ADR resolved, document working solution

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:25:23 +03:00
Mikhail Chusavitin
2baf3be640 Handle memtest recovery probe under set -e 2026-04-01 17:42:13 +03:00
Mikhail Chusavitin
d92f8f41d0 Fix memtest ISO validation false negatives v3.21 2026-04-01 12:22:17 +03:00
Mikhail Chusavitin
76a9100779 fix(iso): rebuild image after memtest recovery 2026-04-01 10:01:14 +03:00
Mikhail Chusavitin
1b6d592bf3 feat(iso): add optional kms display boot path 2026-04-01 09:42:59 +03:00
Mikhail Chusavitin
c95bbff23b fix(metrics): stabilize cpu and power sampling 2026-04-01 09:40:42 +03:00
Mikhail Chusavitin
4e4debd4da refactor(webui): redesign Burn tab and fix gpu-burn memory defaults
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
  Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
  endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
  with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
  now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
  field to selectively run CPU and/or GPU load goroutines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:39:07 +03:00