RAID Controller Management previously hid any LSI drive that wasn't
already Frgn/UGood/JBOD, and scoped VROC "free drives" from all system
disks instead of the ones actually wired to the VROC controller's
ports - drives attached directly to the CPU or another HBA could leak
in. Now every drive is listed per its own controller, and LSI drives
not already ready for array creation get a "Prepare" button that
forces them to Unconfigured Good via storcli.
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Replaces separate IPMI FRU and SAA DMI cards with a single FRU / Elabel
card that reads all available sources in parallel and shows each field
with a color-coded source chip (IPMI FRU / Huawei iBMC / SAA DMI).
Huawei elabel fields are read/written via OEM IPMI raw commands
(NetFn 0x30, cmd 0x90) with 19-byte chunking protocol, matching
the FusionServer ElabelTool V511 wire format. Covers DeviceName,
DeviceSerialNumber, ProductName, ProductSerialNumber, ProductAssetTag,
ProductManufacturer, MainboardManufacturer, BoardProductName,
ChassisPartnumber, ChassisType (read-only), IOChassisSerial,
IOChassisAssetTag, and GUID (read-only via standard 0x06 0x08).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /api/system/reboot → systemctl reboot
POST /api/system/shutdown → systemctl poweroff
Both require confirm() before executing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Unified card for LSI/Broadcom and Intel VROC controllers: auto-detects
foreign configurations and warns the operator with Import/Clear actions;
allows creating RAID 1 mirrors from unconfigured drives regardless of
controller type. Live output streams via SSE into an inline terminal.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new card to the web UI Tools page for reading and editing DMI
fields via SAA (In-Band). Reads current DMI configuration with GetDmiInfo,
displays all fields as an editable table, and applies only the changed
fields via EditDmiInfo + ChangeDmiInfo. Backs up the original DMI file to
dmi-backups/ before any write, making it available in the support bundle
for rollback. Also adds "saa" to the standard tool check list.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the flat menu (Dashboard, Audit, Validate, Burn, Benchmark,
Tasks, Tools) with a numbered progression that guides engineers through
a logical acceptance workflow:
Dashboard (landing) → 1. Audit → 2. Check → 3. Load → 4. Speed
→ 5. Endurance → 6. Tools → 7. Settings
Key changes:
- layout.go: numbered nav labels, new hrefs, Tasks removed from nav
and replaced with a persistent sidebar badge (polls /api/tasks every
5 s, highlights amber when tasks are active)
- server.go: 301 redirects from /validate→/check, /burn→/load,
/benchmark→/speed for backward compatibility
- pages.go: dispatch cases for all new routes; old routes kept as
fallbacks
- page_validate.go: add renderCheck() — non-destructive check page
with validate-mode tests only (no stress toggle, no targeted-stress/
targeted-power/pulse cards)
- page_burn.go: add renderLoad() wrapper; update scope alert to
reference /check instead of /validate
- page_benchmark.go: add renderSpeed() (performance focus) and
renderEndurance() (stability/overnight focus) wrappers
- page_settings.go: new Settings page with blackbox logging toggle,
NVIDIA driver reset, and build info
- server_test.go: update five tests to use new route names and
content expectations
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- kmsg watcher now records kernel errors (GPU Xid, MCE, EDAC, storage I/O) at all times,
not only during SAT tasks; flushImmediate writes directly to ComponentStatusDB
- New health_poller: polls ipmitool sdr every 60s for PSU health (watchdog:psu source)
- Hardware Summary card auto-refreshes every 30s via htmx without page reload
- Component rows (CPU/Memory/Storage/GPU/PSU) are now clickable -- opens a modal
with per-component status, source, timestamp and last 20 history entries
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
As metrics.db grew (1 sample/5 s × hours), handleMetricsChartSVG called
LoadAll() on every chart request — loading all rows across 4 tables through a
single SQLite connection. With ~10 charts auto-refreshing in parallel, requests
queued behind each other, saturating the connection pool and pegging a CPU core.
Fix: add a background compactor that runs every hour via the metrics collector:
• Downsample: rows older than 2 h are thinned to 1 per minute (keep MIN(ts)
per ts/60 bucket) — retains chart shape while cutting row count by ~92 %.
• Prune: rows older than 48 h are deleted entirely.
• After prune: WAL checkpoint/truncate to release disk space.
LoadAll() in handleMetricsChartSVG is unchanged — it now stays fast because
the DB is kept small rather than capping the query window.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add PSUReading struct and PSUs []PSUReading to LiveMetricSample
- Sample per-PSU input watts from IPMI SDR entity 10.x (Power Supply)
- Render stacked filled-area SVG chart (one layer per PSU, cumulative total)
- Fall back to single-line chart on systems with ≤1 PSU in SDR
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Benchmark page now shows two result sections: Performance (scores) and
Power / Thermal Fit (slot table). After any benchmark task completes
the results section auto-refreshes via GET /api/benchmark/results
without a full page reload.
- Power results table shows each GPU slot with nominal TDP, achieved
stable power limit, and P95 observed power. Rows with derated cards
are highlighted amber so under-performing slots stand out at a glance.
Older runs are collapsed in a <details> summary.
- memtester is now wrapped with timeout(1) so a stuck memory controller
cannot cause Validate Memory to hang indefinitely. Wall-clock limit is
~2.5 min per 100 MB per pass plus a 2-minute buffer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bootloader: GRUB fallback text colors → yellow/brown (amber tone)
- CLI charts: all GPU metric series use single amber color (xterm-256 #214)
- Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size
- Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz
using dmidecode for server model (spaces→underscores) and serial number
- Remove display resolution feature (UI card, API routes, handlers, tests)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add bee-boot-status service: shows live service status on tty1 with
ASCII logo before getty, exits when all bee services settle
- Remove lightdm dependency on bee-preflight so GUI starts immediately
without waiting for NVIDIA driver load
- Replace Chromium blank-page problem with /loading spinner page that
polls /api/services and auto-redirects when services are ready; add
"Open app now" override button; use fresh --user-data-dir=/tmp/bee-chrome
- Unify branding: add "Hardware Audit LiveCD" subtitle to GRUB menu,
bee-boot-status (with yellow ASCII logo), and web spinner
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add platform/error_patterns.go: pluggable table of kernel log patterns
(NVIDIA/GPU, PCIe AER, storage I/O, MCE, EDAC) — extend by adding one struct
- Add app/component_status_db.go: persistent JSON store (component-status.json)
keyed by "pcie:BDF", "storage:dev", "cpu:all", "memory:all"; OK never
downgrades Warning or Critical
- Add webui/kmsg_watcher.go: goroutine reads /dev/kmsg during SAT tasks,
writes Warning to DB for matched hardware errors
- Fix task status: overall_status=FAILED in summary.txt now marks task failed
- Audit routine overlays component DB statuses into bee-audit.json on every read
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Support bundle is now built on-the-fly when the user clicks the button,
regardless of whether other tasks are running:
- GET /export/support.tar.gz builds the bundle synchronously and streams it
directly to the client; the temp archive is removed after serving
- Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue
approach meant the bundle could only be downloaded after navigating away
and back, and was blocked entirely while a long SAT test was running
- UI: single "Download Support Bundle" button; fetch+blob gives a loading
state ("Building...") while the server collects logs, then triggers the
browser download with the correct filename from Content-Disposition
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tasks: mark TaskRunning tasks as TaskFailed on bee-web restart instead of
re-queueing them — prevents duplicate gpu-burn-worker spawns when bee-web
crashes mid-test (each restart was launching a new set of 8 workers on top
of still-alive orphans from the previous crash)
- server: reduce metrics collector interval 1s→5s, grow ring buffer to 360
samples (30 min); cuts nvidia-smi/ipmitool/sensors subprocess rate by 5×
- platform: add KillTestWorkers() — scans /proc and SIGKILLs bee-gpu-burn,
stress-ng, stressapptest, memtester without relying on pkill/killall
- webui: add "Kill Workers" button next to Cancel All; calls
POST /api/tasks/kill-workers which cancels the task queue then kills
orphaned OS-level processes; shows toast with killed count
- metricsdb: sort GPU indices and fan/temp names after map iteration to fix
non-deterministic sample reconstruction order (flaky test)
- server: fix chartYAxisNumber to use one decimal place for 1000–9999
(e.g. "1,7к" instead of "2к") so Y-axis ticks are distinguishable
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
field to selectively run CPU and/or GPU load goroutines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs CPU (stressapptest) + GPU stress simultaneously across multiple
load/idle cycles with varying idle durations (120s/60s/30s) to detect
cooling systems that fail to recover under repeated load.
Presets: smoke (~5 min), acceptance (~25 min), overnight (~100 min).
Outputs metrics.csv + summary.txt with per-cycle throttle and fan
spindown analysis, packed as tar.gz.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>