Unified card for LSI/Broadcom and Intel VROC controllers: auto-detects
foreign configurations and warns the operator with Import/Clear actions;
allows creating RAID 1 mirrors from unconfigured drives regardless of
controller type. Live output streams via SSE into an inline terminal.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new card to the web UI Tools page for reading and editing DMI
fields via SAA (In-Band). Reads current DMI configuration with GetDmiInfo,
displays all fields as an editable table, and applies only the changed
fields via EditDmiInfo + ChangeDmiInfo. Backs up the original DMI file to
dmi-backups/ before any write, making it available in the support bundle
for rollback. Also adds "saa" to the standard tool check list.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the flat menu (Dashboard, Audit, Validate, Burn, Benchmark,
Tasks, Tools) with a numbered progression that guides engineers through
a logical acceptance workflow:
Dashboard (landing) → 1. Audit → 2. Check → 3. Load → 4. Speed
→ 5. Endurance → 6. Tools → 7. Settings
Key changes:
- layout.go: numbered nav labels, new hrefs, Tasks removed from nav
and replaced with a persistent sidebar badge (polls /api/tasks every
5 s, highlights amber when tasks are active)
- server.go: 301 redirects from /validate→/check, /burn→/load,
/benchmark→/speed for backward compatibility
- pages.go: dispatch cases for all new routes; old routes kept as
fallbacks
- page_validate.go: add renderCheck() — non-destructive check page
with validate-mode tests only (no stress toggle, no targeted-stress/
targeted-power/pulse cards)
- page_burn.go: add renderLoad() wrapper; update scope alert to
reference /check instead of /validate
- page_benchmark.go: add renderSpeed() (performance focus) and
renderEndurance() (stability/overnight focus) wrappers
- page_settings.go: new Settings page with blackbox logging toggle,
NVIDIA driver reset, and build info
- server_test.go: update five tests to use new route names and
content expectations
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- kmsg watcher now records kernel errors (GPU Xid, MCE, EDAC, storage I/O) at all times,
not only during SAT tasks; flushImmediate writes directly to ComponentStatusDB
- New health_poller: polls ipmitool sdr every 60s for PSU health (watchdog:psu source)
- Hardware Summary card auto-refreshes every 30s via htmx without page reload
- Component rows (CPU/Memory/Storage/GPU/PSU) are now clickable -- opens a modal
with per-component status, source, timestamp and last 20 history entries
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
As metrics.db grew (1 sample/5 s × hours), handleMetricsChartSVG called
LoadAll() on every chart request — loading all rows across 4 tables through a
single SQLite connection. With ~10 charts auto-refreshing in parallel, requests
queued behind each other, saturating the connection pool and pegging a CPU core.
Fix: add a background compactor that runs every hour via the metrics collector:
• Downsample: rows older than 2 h are thinned to 1 per minute (keep MIN(ts)
per ts/60 bucket) — retains chart shape while cutting row count by ~92 %.
• Prune: rows older than 48 h are deleted entirely.
• After prune: WAL checkpoint/truncate to release disk space.
LoadAll() in handleMetricsChartSVG is unchanged — it now stays fast because
the DB is kept small rather than capping the query window.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add PSUReading struct and PSUs []PSUReading to LiveMetricSample
- Sample per-PSU input watts from IPMI SDR entity 10.x (Power Supply)
- Render stacked filled-area SVG chart (one layer per PSU, cumulative total)
- Fall back to single-line chart on systems with ≤1 PSU in SDR
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Benchmark page now shows two result sections: Performance (scores) and
Power / Thermal Fit (slot table). After any benchmark task completes
the results section auto-refreshes via GET /api/benchmark/results
without a full page reload.
- Power results table shows each GPU slot with nominal TDP, achieved
stable power limit, and P95 observed power. Rows with derated cards
are highlighted amber so under-performing slots stand out at a glance.
Older runs are collapsed in a <details> summary.
- memtester is now wrapped with timeout(1) so a stuck memory controller
cannot cause Validate Memory to hang indefinitely. Wall-clock limit is
~2.5 min per 100 MB per pass plus a 2-minute buffer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bootloader: GRUB fallback text colors → yellow/brown (amber tone)
- CLI charts: all GPU metric series use single amber color (xterm-256 #214)
- Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size
- Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz
using dmidecode for server model (spaces→underscores) and serial number
- Remove display resolution feature (UI card, API routes, handlers, tests)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add bee-boot-status service: shows live service status on tty1 with
ASCII logo before getty, exits when all bee services settle
- Remove lightdm dependency on bee-preflight so GUI starts immediately
without waiting for NVIDIA driver load
- Replace Chromium blank-page problem with /loading spinner page that
polls /api/services and auto-redirects when services are ready; add
"Open app now" override button; use fresh --user-data-dir=/tmp/bee-chrome
- Unify branding: add "Hardware Audit LiveCD" subtitle to GRUB menu,
bee-boot-status (with yellow ASCII logo), and web spinner
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add platform/error_patterns.go: pluggable table of kernel log patterns
(NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct
- Add app/component_status_db.go: persistent JSON store (component-status.json)
keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never
downgrades Warning or Critical
- Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks,
writes Warning to DB for matched hardware errors
- Fix task status: overall_status=FAILED in summary.txt now marks task failed
- Audit routine overlays component DB statuses into bee-audit.json on every read
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Support bundle is now built on-the-fly when the user clicks the button,
regardless of whether other tasks are running:
- GET /export/support.tar.gz builds the bundle synchronously and streams it
directly to the client; the temp archive is removed after serving
- Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue
approach meant the bundle could only be downloaded after navigating away
and back, and was blocked entirely while a long SAT test was running
- UI: single "Download Support Bundle" button; fetch+blob gives a loading
state ("Building...") while the server collects logs, then triggers the
browser download with the correct filename from Content-Disposition
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of
re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web
crashes mid-test (each restart was launching a new set of 8 workers on top
of still-alive orphans from the previous crash)
- server: reduce metrics collector interval 1s→5s, grow ring buffer to 360
samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5×
- platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn,
stress-ng, stressapptest, memtester without relying on pkill/killall
- webui: add "Kill Workers" button next to Cancel All; calls
POST /api/tasks/kill-workers which cancels the task queue then kills
orphaned OS-level processes; shows toast with killed count
- metricsdb: sort GPU indices and fan/temp names after map iteration to fix
non-deterministic sample reconstruction order (flaky test)
- server: fix chartYAxisNumber to use one decimal place for 1000–9999
(e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
field to selectively run CPU and/or GPU load goroutines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs CPU (stressapptest) + GPU stress simultaneously across multiple
load/idle cycles with varying idle durations (120s/60s/30s) to detect
cooling systems that fail to recover under repeated load.
Presets: smoke (~5 min), acceptance (~25 min), overnight (~100 min).
Outputs metrics.csv + summary.txt with per-cycle throttle and fan
spindown analysis, packed as tar.gz.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add modernc.org/sqlite dependency; write every sample to
/appdata/bee/metrics.db (WAL mode, prune to 24h on startup)
- Pre-fill ring buffers from last 120 DB rows on startup so charts
survive service restarts
- Ticker changed 3s→1s; chart JS refresh will be set to 2s (lag ≤3s)
- Add GET /api/metrics/export.csv for full history download
- Chart rendering: SymbolNone (no dots), right padding=80px so peak
mark line label is not clipped, min/avg/max appended to chart title
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>