- bee-nvidia-load: run insmod in background, poll /proc/devices for
nvidiactl; if GSP init doesn't complete in 90s, kill insmod and retry
with NVreg_EnableGpuFirmware=0. Handles EBUSY case with clear error.
- Write /run/bee-nvidia-mode (gsp-on/gsp-off/gsp-stuck) for audit layer
- Show GSP mode badge in sidebar: yellow for gsp-off, red for gsp-stuck
- Report NvidiaGSPMode in RuntimeHealth with issue entries
- Simplify GRUB menu: default (KMS+GSP), advanced submenu (GSP=off,
nomodeset, fail-safe), remove load-to-RAM entry
- Add pcmanfm, ristretto, mupdf, mousepad to desktop packages
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- services.go: use sudo systemctl so bee user can control system services
- api.go: always return 200 with output field even on error, so the
frontend shows the actual systemctl message instead of "exit status 1"
- pages.go: button shows "..." while pending then restores label;
output panel is full-width under the table with ✓/✗ status indicator;
output auto-scrolls to bottom
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bee-gpu-burn now receives --seconds <LoadSec> so it exits naturally
when the cycle ends, rather than relying solely on context cancellation
to kill it. Process group kill (Setpgid+Cancel) is kept as a safety net
for early cancellation (user stop, context timeout). Same fix for AMD
RVS which now gets duration_ms = LoadSec * 1000.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvvs (DCGM validation suite) survives when dcgmi is killed mid-run,
leaving the GPU occupied. The next dcgmi diag invocation then fails
with "affected resource is in use".
Two-part fix:
- Add nvvs and dcgmi to KillTestWorkers patterns so they are cleaned
up by the global cancel handler
- Call KillTestWorkers at the start of RunNvidiaTargetedStressValidatePack
to clear any stale processes before dcgmi diag runs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bee-gpu-burn is a shell script that spawns bee-gpu-burn-worker children.
exec.CommandContext default cancel only kills the shell parent; the worker
processes survive and keep loading the GPU indefinitely.
Fix: set Setpgid=true and a custom Cancel that sends SIGKILL to the
entire process group (-pid), same pattern already used in runSATCommandCtx.
Applied to Nvidia, AMD, and CPU stress commands for consistency.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add bee-boot-status service: shows live service status on tty1 with
ASCII logo before getty, exits when all bee services settle
- Remove lightdm dependency on bee-preflight so GUI starts immediately
without waiting for NVIDIA driver load
- Replace Chromium blank-page problem with /loading spinner page that
polls /api/services and auto-redirects when services are ready; add
"Open app now" override button; use fresh --user-data-dir=/tmp/bee-chrome
- Unify branding: add "Hardware Audit LiveCD" subtitle to GRUB menu,
bee-boot-status (with yellow ASCII logo), and web spinner
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tasks are now started simultaneously when multiple are enqueued (e.g.
Run All). The worker drains all pending tasks at once and launches each
in its own goroutine, waiting via WaitGroup. kmsg watcher updated to
use a shared event window with a reference counter across concurrent tasks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add platform/error_patterns.go: pluggable table of kernel log patterns
(NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct
- Add app/component_status_db.go: persistent JSON store (component-status.json)
keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never
downgrades Warning or Critical
- Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks,
writes Warning to DB for matched hardware errors
- Fix task status: overall_status=FAILED in summary.txt now marks task failed
- Audit routine overlays component DB statuses into bee-audit.json on every read
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Support bundle is now built on-the-fly when the user clicks the button,
regardless of whether other tasks are running:
- GET /export/support.tar.gz builds the bundle synchronously and streams it
directly to the client; the temp archive is removed after serving
- Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue
approach meant the bundle could only be downloaded after navigating away
and back, and was blocked entirely while a long SAT test was running
- UI: single "Download Support Bundle" button; fetch+blob gives a loading
state ("Building...") while the server collects logs, then triggers the
browser download with the correct filename from Content-Disposition
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of
re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web
crashes mid-test (each restart was launching a new set of 8 workers on top
of still-alive orphans from the previous crash)
- server: reduce metrics collector interval 1s→5s, grow ring buffer to 360
samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5×
- platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn,
stress-ng, stressapptest, memtester without relying on pkill/killall
- webui: add "Kill Workers" button next to Cancel All; calls
POST /api/tasks/kill-workers which cancels the task queue then kills
orphaned OS-level processes; shows toast with killed count
- metricsdb: sort GPU indices and fan/temp names after map iteration to fix
non-deterministic sample reconstruction order (flaky test)
- server: fix chartYAxisNumber to use one decimal place for 1000–9999
(e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
exec.CommandContext only kills the direct child (the shell script), leaving
grandchildren (john, gpu-burn, etc.) as orphans. Set Setpgid so each SAT
job runs in its own process group, then send SIGKILL to the whole group
(-pgid) in the Cancel hook.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-john-gpu-stress: spawn one john process per OpenCL device in parallel
so all GPUs are stressed simultaneously instead of only device 1
- bee-openbox-session: --start-fullscreen → --start-maximized to fix blank
white page on first render in fbdev environment
- storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead
of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state
- build: remove bee-nccl-gpu-stress from rm -f list so shell script from
overlay is not silently dropped from the ISO
- smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress,
bee-nccl-gpu-stress, all_reduce_perf
- netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors
- auto/config: reduce loglevel 7→3 to show clean systemd output on boot
- auto/config: blacklist snd_hda_intel and related audio modules (unused on servers)
- package-lists: remove firmware-intel-sound and firmware-amd-graphics from
base list; move firmware-amd-graphics to bee-amd variant only
- bible-local: mark memtest ADR resolved, document working solution
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
field to selectively run CPU and/or GPU load goroutines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>