- Add rocm-validation-suite, rocblas, rocrand, hip-runtime-amd,
hipblaslt, comgr to ISO (~700MB, needed for HIP compute)
- RunAMDStressPack: run RVS GST (SGEMM ~31 TFLOPS/GPU) + bandwidth test
- Add rvs symlink in chroot setup hook
- Pin all new package versions in VERSIONS
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ip route show includes state flags like 'linkdown' that ip route add
does not accept, causing restore to fail.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Task queue: all SAT/audit jobs enqueue and run one-at-a-time;
tasks persist past page navigation; new Tasks page with cancel/priority/log stream
- UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal;
Dashboard hardware summary badges + split metrics charts (load/temp/power);
Tools page consolidates network, services, install, support bundle
- AMD GPU: acceptance test and stress burn cards; GPU presence API greys
out irrelevant SAT cards automatically
- Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest)
- Install to RAM: copies squashfs to /dev/shm, re-associates loop devices
via LOOP_CHANGE_FD ioctl so live media can be ejected
- Charts: relative time axis (0 = now, negative left)
- memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED
- SAT overlay applied dynamically on every /audit.json serve
- MIME panic guard for LiveCD ramdisk I/O errors
- ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry;
disable screensaver/DPMS in bee-openbox-session
- Unknown SAT status severity = 1 (does not override OK)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expose the existing bee-install script through the web UI:
- platform/install.go: remove USB exclusion, add SizeBytes/MountedParts
fields, add MinInstallBytes()/DiskWarnings() safety checks (size,
mounted partitions, toram+low-RAM warning)
- webui: add GET /api/install/disks, POST /api/install/run,
GET /api/install/stream endpoints
- webui: add Install to Disk page with disk table, warning badges,
device-name confirmation gate, SSE progress terminal, reboot button
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- One engine: go-analyze/charts (grafana theme) for all live metrics
- Server chart: CPU temp, CPU load%, mem load%, power W, fan RPMs
- GPU charts: temp, load%, mem%, power W — one card per GPU, added dynamically
- Charts 1400x280px SVG, rendered at width:100% in single-column layout
- Add CPU load (from /proc/stat) and mem load (from /proc/meminfo) to LiveMetricSample
- Add GPU mem utilization to GPUMetricRow (nvidia-smi utilization.memory)
- Document charting architecture in bible-local/architecture/charting.md
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copies the live system to a local disk via unsquashfs — no debootstrap,
no network required. Supports UEFI (GPT+EFI) and BIOS (MBR) layouts.
ISO:
- Add squashfs-tools, parted, grub-pc, grub-efi-amd64 to package list
- New overlay script bee-install: partitions, formats, unsquashfs,
writes fstab, runs grub-install+update-grub in chroot
Go TUI:
- Settings → Tools submenu (Install to disk, Check tools)
- Disk picker screen: lists non-USB, non-boot disks via lsblk
- Confirm screen warns about data loss
- Runs with live progress tail of /tmp/bee-install.log
- platform/install.go: ListInstallDisks, InstallToDisk, findLiveBootDevice
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- GPU Platform Stress Test now shows a live in-TUI chart instead of nvtop.
nvidia-smi is polled every second; up to 60 data points per GPU kept.
All three metrics (Usage %, Temp °C, Power W) drawn on a single plot,
each normalised to its own range and rendered in a different colour.
- Memory allocation changed from MemoryMB/16 to MemoryMB-512 (full VRAM
minus 512 MB driver overhead) so bee-gpu-stress actually stresses memory.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two-phase GPU thermal cycling test with per-second telemetry:
- Phases: baseline → load1 → pause (no cooldown) → load2 → cooldown
- Monitors: fan RPM (ipmitool sdr), CPU/server temps (ipmitool/sensors),
system power (ipmitool dcmi), GPU temp/power/usage/clock/throttle (nvidia-smi)
- Detects throttling via clocks_throttle_reasons.active bitmask
- Measures fan response lag from load start (validates case-04 ~2s lag)
- Exports metrics.csv (wide format, one row/sec) and fan-sensors.csv (long format)
- TUI: adds [F] Fan Stress Test to Health Check screen with Quick/Standard/Express modes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New tui/sat_progress.go: polls {DefaultSATBaseDir}/{prefix}-*/verbose.log every 300ms and parses completed/in-progress steps
- Busy screen now shows each step as PASS lscpu (234ms) / FAIL stress-ng (60.0s) / ... sensors-after instead of just "Working..."
2. Test results shown on screen (instead of just "Archive written to /path")
- RunCPUAcceptancePackResult, RunMemoryAcceptancePackResult, RunStorageAcceptancePackResult, RunAMDAcceptancePackResult now read summary.txt from the run directory and return a formatted per-step result:
Run: 2025-03-25T10:00:00Z
PASS lscpu
PASS sensors-before
FAIL stress-ng
PASS sensors-after
Overall: FAILED (ok=3 failed=1)
3. AMD GPU SAT with auto-detection
- platform.System.DetectGPUVendor(): checks /dev/nvidia0 → "nvidia", /dev/kfd → "amd"
- platform.System.RunAMDAcceptancePack(): runs rocm-smi, rocm-smi --showallinfo, dmidecode
- GPU SAT (G key / GPU row enter) automatically routes to AMD or NVIDIA based on detected vendor
- "Run All" also auto-detects vendor
4. Panel detail view
- GPU detail now shows the most recent (NVIDIA or AMD) SAT result, whichever is newer
- All SAT detail views use the same human-readable formatSATDetail format
--start is not a valid nvme-cli flag; correct syntax is -s 1 (short test).
Add --wait so the command blocks until the test completes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BMC:
- collector/board.go: collectBMCFirmware() via ipmitool mc info, graceful skip if /dev/ipmi0 absent
- collector/collector.go: append BMC firmware record to snap.Firmware
- app/panel.go: show BMC version in TUI right-panel header alongside BIOS
CPU SAT:
- platform/sat.go: RunCPUAcceptancePack(baseDir, durationSec) — lscpu + sensors before/after + stress-ng
- app/app.go: RunCPUAcceptancePack + RunCPUAcceptancePackResult methods, satRunner interface updated
- app/panel.go: CPU row now reads real PASS/FAIL from cpu-*/summary.txt via satStatuses(); cpuDetailResult shows last SAT summary + audit data
- tui/types.go: actionRunCPUSAT, confirmBody for CPU test with mode label
- tui/screen_health_check.go: hcCPUDurations [60,300,900]s; hcRunSingle(CPU)→confirm screen; executeRunAll uses RunCPUAcceptancePackResult
- tui/forms.go: actionRunCPUSAT → RunCPUAcceptancePackResult with mode duration
- cmd/bee/main.go: bee sat cpu [--duration N] subcommand
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes
- nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort
- GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock)
- Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt
- Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours
- AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list
- runtime-flows.md updated with full NVIDIA SAT TUI flow documentation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>