Commit Graph

334 Commits

Author SHA1 Message Date
5b9015451e Add live task charts and fix USB export actions v5.8 2026-04-05 20:14:23 +03:00
d1a6863ceb Use amber fallback wallpaper color (#f6c90e) instead of black
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:30:41 +03:00
f9aa05de8e Add wallpaper: black background with amber EASY-BEE ASCII art logo
- Add feh and python3-pil to package list
- Add chroot hook that generates /usr/share/bee/wallpaper.png using PIL:
  black background, EASY-BEE box-drawing logo in amber (#f6c90e),
  "Hardware Audit LiveCD" subtitle in dim amber — matches motd exactly
- bee-openbox-session: set wallpaper with feh --bg-fill, fall back to
  xsetroot -solid black if wallpaper not found

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:29:42 +03:00
a9ccea8cca Fix black desktop and Chromium blank page on startup
- Set xsetroot solid background (#12100a, dark amber) so openbox
  doesn't show bare black before Chromium opens
- Re-add healthz wait loop before launching Chromium: without it
  Chromium opens localhost/loading before bee-web is up and gets
  connection-refused which renders as a blank white page

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:25:32 +03:00
fc5c985fb5 Reset tty1 properly when bee-boot-status exits
Add TTYReset=yes and TTYVHangup=yes so systemd restores the terminal
to a clean state before handing tty1 to getty. Without this the screen
went black with no cursor after the status display finished.

Also remove DefaultDependencies=no which was too aggressive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:22:01 +03:00
5eb3baddb4 Fix bee-boot-status blank screen caused by variable buffering
Command substitution in sh strips trailing newlines, so accumulating
output in a variable via $(...) lost all line breaks. Reverted to
direct printf calls which work correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:21:10 +03:00
a6ac13b5d3 Improve bee-boot-status: slower refresh, more detail
- Refresh every 3s instead of 1s to reduce flicker
- Show ssh, bee-sshsetup in service list
- Show failure reason for failed services
- Show last journal line for activating services
- Show IP addresses and web UI URL when network is up
- Render frame to variable before printing to reduce flicker

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:20:07 +03:00
4003cb7676 Lower kernel console loglevel to 3 to reduce boot noise
loglevel=6 floods the screen with mpt3sas/scsi/sd informational
messages, hiding systemd service status and bee-boot-status display.
loglevel=3 shows only kernel errors; all messages still go to serial.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:19:09 +03:00
2875313ba0 Improve boot UX: status display, faster GUI, loading spinner
- Add bee-boot-status service: shows live service status on tty1 with
  ASCII logo before getty, exits when all bee services settle
- Remove lightdm dependency on bee-preflight so GUI starts immediately
  without waiting for NVIDIA driver load
- Replace Chromium blank-page problem with /loading spinner page that
  polls /api/services and auto-redirects when services are ready; add
  "Open app now" override button; use fresh --user-data-dir=/tmp/bee-chrome
- Unify branding: add "Hardware Audit LiveCD" subtitle to GRUB menu,
  bee-boot-status (with yellow ASCII logo), and web spinner

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v5.7
2026-04-05 18:58:24 +03:00
f1621efee4 Mirror task lifecycle to serial console v5.6 2026-04-05 18:34:06 +03:00
4461249cc3 Make memory stress size follow available RAM 2026-04-05 18:33:26 +03:00
e609fbbc26 Add task reports and streamline GPU charts v5.5 2026-04-05 18:13:58 +03:00
cc2b49ea41 Improve validate GPU runs and web UI feedback 2026-04-05 17:50:13 +03:00
33e0a5bef2 Refine validate UI and runtime health table v5.4 2026-04-05 16:24:45 +03:00
38e79143eb Refine burn UI and NVIDIA stress flows v5.3 2026-04-05 13:43:43 +03:00
25af2df23a Unify metrics charts on custom SVG renderer v5.2 2026-04-05 12:17:50 +03:00
20abff7f90 WIP: checkpoint current tree 2026-04-05 12:05:00 +03:00
a14ec8631c Persist GPU chart mode and expand GPU charts 2026-04-05 11:52:32 +03:00
f58c7e58d3 Fix webui streaming recovery regressions v5.1 2026-04-05 10:39:09 +03:00
bf47c8dbd2 Add NVIDIA benchmark reporting flow v5 2026-04-05 10:30:56 +03:00
143b7dca5d Add stability hardening and self-heal recovery 2026-04-05 10:29:37 +03:00
9826d437a5 Add GPU clock charts and grouped GPU metrics view 2026-04-05 09:57:38 +03:00
Mikhail Chusavitin
f3c14cd893 Harden NIC probing for empty SFP ports v4.6 v4.7 2026-04-04 15:23:15 +03:00
Mikhail Chusavitin
728270dc8e Unblock bee-web startup and expand support bundle diagnostics v4.5 2026-04-04 15:18:43 +03:00
Mikhail Chusavitin
8692f825bc Use plain repo tags for build version v4.4 2026-04-03 10:48:51 +03:00
Mikhail Chusavitin
11f52ac710 Fix task log modal scrolling 2026-04-03 10:36:11 +03:00
Mikhail Chusavitin
1cb398fe83 Show tag version at top of sidebar 2026-04-03 10:08:00 +03:00
Mikhail Chusavitin
7a843be6b0 Stabilize DCGM GPU discovery 2026-04-03 09:50:33 +03:00
Mikhail Chusavitin
7f6386dccc Restore USB support bundle export on tools page 2026-04-03 09:48:22 +03:00
Mikhail Chusavitin
eea2591bcc Fix John GPU stress duration semantics 2026-04-03 09:46:16 +03:00
Mikhail Chusavitin
295a19b93a feat(tasks): run all queued tasks in parallel
Tasks are now started simultaneously when multiple are enqueued (e.g.
Run All). The worker drains all pending tasks at once and launches each
in its own goroutine, waiting via WaitGroup. kmsg watcher updated to
use a shared event window with a reference counter across concurrent tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 09:15:06 +03:00
Mikhail Chusavitin
444a7d16cc fix(iso): increase boot verbosity for service startup visibility
Raise loglevel from 3 to 6 (INFO) and add systemd.show_status=1 so
kernel driver messages and systemd [ OK ]/[ FAILED ] lines are visible
during boot instead of showing only a blank cursor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v4.3
2026-04-02 19:33:27 +03:00
Mikhail Chusavitin
fd722692a4 feat(watchdog): hardware error monitor + unified component status store
- Add platform/error_patterns.go: pluggable table of kernel log patterns
  (NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct
- Add app/component_status_db.go: persistent JSON store (component-status.json)
  keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never
  downgrades Warning or Critical
- Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks,
  writes Warning to DB for matched hardware errors
- Fix task status: overall_status=FAILED in summary.txt now marks task failed
- Audit routine overlays component DB statuses into bee-audit.json on every read

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 19:20:59 +03:00
Mikhail Chusavitin
99cece524c feat(support-bundle): add PCIe link diagnostics and system logs
- Add full dmesg (was tail -200), kern.log, syslog
- Add /proc/cmdline, lspci -vvv, nvidia-smi -q
- Add per-GPU PCIe link speed/width from sysfs (NVIDIA devices only)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v4.2
2026-04-02 15:42:28 +03:00
Mikhail Chusavitin
c27449c60e feat(webui): show current boot source 2026-04-02 15:36:32 +03:00
Mikhail Chusavitin
5ef879e307 feat(webui): add gpu driver restart action 2026-04-02 15:30:23 +03:00
Mikhail Chusavitin
e7df63bae1 fix(app): include extra system logs in support bundle 2026-04-02 13:44:58 +03:00
Mikhail Chusavitin
17ff3811f8 fix(webui): improve tasks logs and ordering 2026-04-02 13:43:59 +03:00
Mikhail Chusavitin
fc7fe0b08e fix(webui): build support bundle synchronously on download, bypass task queue
Support bundle is now built on-the-fly when the user clicks the button,
regardless of whether other tasks are running:

- GET /export/support.tar.gz builds the bundle synchronously and streams it
  directly to the client; the temp archive is removed after serving
- Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue
  approach meant the bundle could only be downloaded after navigating away
  and back, and was blocked entirely while a long SAT test was running
- UI: single "Download Support Bundle" button; fetch+blob gives a loading
  state ("Building...") while the server collects logs, then triggers the
  browser download with the correct filename from Content-Disposition

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v4.1
2026-04-02 12:58:00 +03:00
Mikhail Chusavitin
3cf75a541a build: collect ISO and logs under versioned dist/easy-bee-v{VERSION}/ dir
All final artefacts for a given version now land in one place:
  dist/easy-bee-v4.1/
    easy-bee-nvidia-v4.1-amd64.iso
    easy-bee-nvidia-v4.1-amd64.logs.tar.gz   ← log archive
                                               (logs dir deleted after archiving)

- Introduce OUT_DIR="${DIST_DIR}/easy-bee-v${ISO_VERSION_EFFECTIVE}"
- Move LOG_DIR, LOG_ARCHIVE, and ISO_OUT into OUT_DIR
- cleanup_build_log: use dirname(LOG_DIR) as tar -C base so the path is
  correct regardless of where OUT_DIR lives; delete LOG_DIR after archiving

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 10:19:11 +03:00
Mikhail Chusavitin
1f750d3edd fix(webui): prevent orphaned workers on restart, reduce metrics polling, add Kill Workers button
- tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of
  re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web
  crashes mid-test (each restart was launching a new set of 8 workers on top
  of still-alive orphans from the previous crash)
- server: reduce metrics collector interval 1s→5s, grow ring buffer to 360
  samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5×
- platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn,
  stress-ng, stressapptest, memtester without relying on pkill/killall
- webui: add "Kill Workers" button next to Cancel All; calls
  POST /api/tasks/kill-workers which cancels the task queue then kills
  orphaned OS-level processes; shows toast with killed count
- metricsdb: sort GPU indices and fan/temp names after map iteration to fix
  non-deterministic sample reconstruction order (flaky test)
- server: fix chartYAxisNumber to use one decimal place for 1000–9999
  (e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 10:13:43 +03:00
Mikhail Chusavitin
b2b0444131 audit: ignore virtual hdisk and coprocessor noise 2026-04-02 09:56:17 +03:00
dbab43db90 Fix full-history metrics range loading v4 2026-04-01 23:55:28 +03:00
bcb7fe5fe9 Render charts from full SQLite history 2026-04-01 23:52:54 +03:00
d21d9d191b fix(build): bump DCGM to 4.5.3-1 — core package updated in CUDA repo
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:49:57 +03:00
ef45246ea0 fix(sat): kill entire process group on task cancel
exec.CommandContext only kills the direct child (the shell script), leaving
grandchildren (john, gpu-burn, etc.) as orphans. Set Setpgid so each SAT
job runs in its own process group, then send SIGKILL to the whole group
(-pgid) in the Cancel hook.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:46:33 +03:00
348db35119 fix(stress): stagger john GPU launches to prevent GWS tuning contention
When 8 john processes start simultaneously they race for GPU memory during
OpenCL GWS auto-tuning. Slower devices settle on a smaller work size (~594MiB
vs 762MiB) and run at 40% instead of 100% load. Add 3s sleep between launches
so each instance finishes memory allocation before the next one starts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:44:00 +03:00
1dd7f243f5 Keep chart series colors stable 2026-04-01 23:37:57 +03:00
938e499ac2 Serve charts from SQLite history only 2026-04-01 23:33:13 +03:00
964ab39656 fix: run john stress in parallel per GPU, fix chromium fullscreen, filter BMC virtual disks
- bee-john-gpu-stress: spawn one john process per OpenCL device in parallel
  so all GPUs are stressed simultaneously instead of only device 1
- bee-openbox-session: --start-fullscreen → --start-maximized to fix blank
  white page on first render in fbdev environment
- storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:14:21 +03:00