Commit Graph

255 Commits

Author SHA1 Message Date
105d92df8b fix(iso): use underscore in volume label to comply with ISO 9660
ISO 9660 volume labels allow only A-Z, 0-9, and underscore.
Dashes cause xorriso WARNING on every build.
EASY-BEE-NVIDIA → EASY_BEE_NVIDIA (iso-application keeps dashes, it's UDF).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v3.8
2026-03-30 23:38:02 +03:00
f96b149875 fix(memtest): extract EFI binary from .deb cache if chroot/boot/ is empty
memtest86+ postinst does not place files in /boot in a live-build chroot
without grub triggers. Added fallback: extract directly from the cached
.deb via dpkg-deb -x, with verbose logging throughout.

Also remove "NVIDIA no MSI-X" from boot menu (premature — root cause unknown).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v3.7
2026-03-30 23:30:52 +03:00
5ee120158e fix(build): remove unused variant package lists before lb build
live-build picks up ALL .list.chroot files in config/package-lists/.
After rsync, bee-nvidia.list.chroot, bee-amd.list.chroot, and
bee-nogpu.list.chroot all end up in BUILD_WORK_DIR — causing lb to
try installing packages from every variant (and leaving version
placeholders unsubstituted in the unused lists).

Fix: after copying bee-${BEE_GPU_VENDOR}.list.chroot → bee-gpu.list.chroot,
delete all other bee-{nvidia,amd,nogpu}.list.chroot from BUILD_WORK_DIR.

Also includes nomsi boot mode changes (bee-nvidia-load + grub.cfg).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v3.6
2026-03-30 23:03:42 +03:00
09fe0e2e9e feat(iso): add nogpu variant (no NVIDIA, no AMD/ROCm)
- build.sh: accept --variant nogpu; skips all GPU build steps, removes
  both nvidia-cuda and rocm archives, strips bee-nvidia-load and
  bee-nvidia.service from overlay
- build-in-container.sh: add nogpu to --variant flag; all variant
  includes nogpu; --clean-build wipes live-build-work-nogpu
- 9000-bee-setup hook: nogpu path enables no GPU services
- bee-nogpu.list.chroot: empty GPU package list

Output: easy-bee-nogpu-vX.iso

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v3.5
2026-03-30 22:49:25 +03:00
ace1a9dba6 feat(iso): split into nvidia and amd variants, fix KVM graphics and PATH
- build.sh: add --variant nvidia|amd; separate work dirs per variant
  (live-build-work-nvidia / live-build-work-amd); GPU-specific steps
  (modules, NCCL, cuBLAS, nccl-tests) run only for nvidia; deb package
  cache synced back to shared location after each lb build so second
  variant reuses downloaded packages; ISO output named
  easy-bee-{variant}-v{ver}-amd64.iso
- build-in-container.sh: add --variant nvidia|amd|all (default: all);
  runs build.sh twice in one container for 'all'; --clean-build wipes
  both variant work dirs
- package-lists: remove GPU packages from bee.list.chroot; add
  bee-nvidia.list.chroot (DCGM) and bee-amd.list.chroot (ROCm)
- 9000-bee-setup hook: read /etc/bee-gpu-vendor; enable bee-nvidia.service
  and DCGM only for nvidia; set up ROCm symlinks only for amd
- auto/config: --iso-volume uses BEE_GPU_VENDOR_UPPER env var
- grub.cfg: add nomodeset to EASY-BEE and EASY-BEE (load to RAM) entries
  — fixes X/lightdm on BMC KVM (ASPEED AST chip requires nomodeset for
  fbdev to work; NVIDIA H100 compute does not need KMS)
- bee.sh / smoketest.sh: add /usr/sbin to PATH so dmidecode, smartctl,
  nvme are found
- 9100-memtest hook: add diagnostic listing of chroot/boot/memtest* files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v3.4
2026-03-30 22:24:37 +03:00
905c581ece fix(iso): substitute all ROCm package version placeholders in build.sh
ROCM_BANDWIDTH_TEST_VERSION, ROCM_VALIDATION_SUITE_VERSION, ROCBLAS,
ROCRAND, HIP_RUNTIME_AMD, HIPBLASLT, COMGR were defined in VERSIONS and
in bee.list.chroot but the sed substitution block only covered 3 of them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 22:00:05 +03:00
7c2a0135d2 feat(audit): add platform thermal cycling stress test
Runs CPU (stressapptest) + GPU stress simultaneously across multiple
load/idle cycles with varying idle durations (120s/60s/30s) to detect
cooling systems that fail to recover under repeated load.

Presets: smoke (~5 min), acceptance (~25 min), overnight (~100 min).
Outputs metrics.csv + summary.txt with per-cycle throttle and fan
spindown analysis, packed as tar.gz.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 21:57:33 +03:00
407c1cd1c4 fix(charts): unify timeline labels across graphs v3.3 2026-03-29 21:24:06 +03:00
e15bcc91c5 feat(metrics): persist history in sqlite and add AMD memory validate tests 2026-03-29 12:28:06 +03:00
98f0cf0d52 fix(amd-stress): include VRAM load in GST burn 2026-03-29 12:03:50 +03:00
4db89e9773 fix(metrics): correct chart padding order — right=80 not top=80
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v3.2
2026-03-29 11:38:45 +03:00
3fda18f708 feat(metrics): SQLite persistence + chart fixes (no dots, peak label, min/avg/max in title)
- Add modernc.org/sqlite dependency; write every sample to
  /appdata/bee/metrics.db (WAL mode, prune to 24h on startup)
- Pre-fill ring buffers from last 120 DB rows on startup so charts
  survive service restarts
- Ticker changed 3s→1s; chart JS refresh will be set to 2s (lag ≤3s)
- Add GET /api/metrics/export.csv for full history download
- Chart rendering: SymbolNone (no dots), right padding=80px so peak
  mark line label is not clipped, min/avg/max appended to chart title

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:37:59 +03:00
ea518abf30 feat(metrics): add global peak mark line to all live metric charts
Finds the series with the highest value across all datasets and adds
a SeriesMarkTypeMax dashed mark line to it. Since all series share the
same Y axis this effectively shows a single "global peak" line for the
whole chart with a label on the right.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:24:50 +03:00
744de588bb fix(burn): resolve rvs binary via /opt/rocm-*/bin glob like rocm-smi; add terminal copy button
rvs was not in PATH so the stress job exited immediately (UNSUPPORTED).
Now resolveRVSCommand searches /opt/rocm-*/bin/rvs before failing.
Also add a Copy button overlay on all .terminal elements and set
user-select:text so logs can be copied from the web UI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:20:46 +03:00
a3ed9473a3 fix(metrics): strip units from GPU legend names; fix fan SDR parsing for new IPMI format
Legend names were "GPU 0 %" — remove unit suffix since chart title already
conveys it. Fan parsing now handles the 5-field IPMI SDR format where the
value+unit ("4340 RPM") are combined in the last column rather than split
across separate fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:14:27 +03:00
a714c45f10 fix(metrics): parse rocm-smi CSV by header keywords, not column position
MI250X outputs 7 temperature columns before power/use%; positional parsing
read junction temp (~40°C) as GPU utilisation. Switch to header-based
colIdx() lookup so the correct fields are read regardless of column order
or rocm-smi version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:10:13 +03:00
349e026cfa fix(webui): restore chart legend, remove GPU numeric table
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:04:51 +03:00
889fe1dc2f fix: IPMI access for bee user + remove chart legend
- Add udev rule: /dev/ipmi0 readable by 'ipmi' group (no sudo needed)
- Add 'ipmi' group creation and bee user membership in chroot hook
- Remove legend from all charts (data shown in GPU table below)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:03:35 +03:00
befdbf3768 fix(iso): autoload ipmi_si/ipmi_devintf for fan/sensor monitoring
Without these modules /dev/ipmi0 doesn't exist and ipmitool can't
read fan RPM, PSU fans, or IPMI temperature sensors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:59:15 +03:00
ec6a0b292d fix(webui): fix sensor grouping and fan card visibility
- Tccd1-8 (AMD CCD die temps) now classified as 'cpu' group,
  appear on CPU Temperature chart instead of ambient
- Fan RPM card hidden when no fans detected
- Remove CPU Load/Mem Load/Power from fan table (have dedicated charts)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:58:01 +03:00
a03312c286 feat: AMD GPU compute stress via rocm-validation-suite GST (GEMM)
- Add rocm-validation-suite, rocblas, rocrand, hip-runtime-amd,
  hipblaslt, comgr to ISO (~700MB, needed for HIP compute)
- RunAMDStressPack: run RVS GST (SGEMM ~31 TFLOPS/GPU) + bandwidth test
- Add rvs symlink in chroot setup hook
- Pin all new package versions in VERSIONS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:56:32 +03:00
e69e9109da fix(iso): set bash as default shell for bee user
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:49:18 +03:00
413869809d feat(iso): add rocm-bandwidth-test for AMD GPU burn-in
- Add rocm-bandwidth-test package to ISO
- Add bee user to 'render' group (/dev/kfd, /dev/dri/renderD* access)
- Add rocm-bandwidth-test symlink alongside rocm-smi

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:48:29 +03:00
f9bd38572a fix(network): strip linkdown/dead/onlink flags when restoring routes
ip route show includes state flags like 'linkdown' that ip route add
does not accept, causing restore to fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:39:16 +03:00
662e3d2cdd feat(webui): combined GPU charts (load/memload/power/temp all GPUs per chart)
Replace per-GPU cards with 4 combined charts showing all GPUs as
separate series. Add gpu-all-load/memload/power/temp endpoints.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:37:33 +03:00
126af96780 fix(webui): slow metrics chart refresh to 3s interval
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:32:35 +03:00
ada15ac777 fix: loading screen via Go handler instead of file:// HTML
- bee-web.service: remove After=bee-audit so Go starts immediately
- Go serves loading page from / when audit JSON not yet present;
  JS polls /api/ready (503 until file exists, 200 when ready)
  then redirects to dashboard
- bee-openbox-session: wait for /healthz (Go binds fast <2s),
  open http://localhost/ directly — no file:// cross-origin issues
- Remove loading.html static file

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:31:46 +03:00
dfb94f9ca6 feat(iso): loading screen while bee-web starts
Replace 15s blocking wait with instant Chromium launch showing a
dark loading page that polls /healthz every 500ms and auto-redirects
to the app when ready.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 09:33:04 +03:00
5857805518 fix(iso): copy memtest86+ to ISO root via binary hook
memtest files live in chroot /boot (inside squashfs) but GRUB needs
them on the ISO filesystem. Binary hook copies them out at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 09:02:40 +03:00
59a1d4b209 release: v3.1 v3.1 2026-03-28 22:51:36 +03:00
0dbfaf6121 feat: dynamic CPU governor (performance during tasks, powersave at idle)
Switch to performance governor when task queue starts processing,
back to powersave when queue drains. Removes bee-cpuperf.service.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:47:11 +03:00
5d72d48714 feat(iso): set CPU governor to performance on boot
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:45:37 +03:00
096b4a09ca feat(iso): add bare-metal performance kernel params
mitigations=off, transparent_hugepage=always, numa_balancing=disable,
nowatchdog, nosoftlockup — safe on single-user bare-metal LiveCD,
improves SAT/burn test throughput. fail-safe entry unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:44:21 +03:00
5d42a92e4c feat(iso): use legacy network names (eth0/eth1) via net.ifnames=0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:43:00 +03:00
3e54763367 docs: add iso-build-rules (verify package names before use)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:38:54 +03:00
f91bce8661 fix(iso): fix memtest86+ path (bookworm uses memtest86+x64.bin/.efi)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:38:15 +03:00
585e6d7311 docs: add validate-vs-burn hardware impact policy
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:15:33 +03:00
0a98ed8ae9 feat: task queue, UI overhaul, burn tests, install-to-RAM
- Task queue: all SAT/audit jobs enqueue and run one-at-a-time;
  tasks persist past page navigation; new Tasks page with cancel/priority/log stream
- UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal;
  Dashboard hardware summary badges + split metrics charts (load/temp/power);
  Tools page consolidates network, services, install, support bundle
- AMD GPU: acceptance test and stress burn cards; GPU presence API greys
  out irrelevant SAT cards automatically
- Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest)
- Install to RAM: copies squashfs to /dev/shm, re-associates loop devices
  via LOOP_CHANGE_FD ioctl so live media can be ejected
- Charts: relative time axis (0 = now, negative left)
- memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED
- SAT overlay applied dynamically on every /audit.json serve
- MIME panic guard for LiveCD ramdisk I/O errors
- ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry;
  disable screensaver/DPMS in bee-openbox-session
- Unknown SAT status severity = 1 (does not override OK)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v3.0
2026-03-28 21:15:11 +03:00
911745e4da refactor(iso): replace chroot hooks for DCGM/ROCm with live-build apt sources
Move datacenter-gpu-manager and rocm-smi-lib from dynamic chroot hooks
into live-build's config/archives mechanism so lb caches the .deb files
in cache/packages.chroot/ between builds, eliminating repeated 900+ MB
downloads. Versions pinned via VERSIONS and substituted into package
lists at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v2.9
2026-03-28 13:01:10 +03:00
acfd2010d7 fix(iso): remove firmware-chelsio-t4 (not in Debian bookworm)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:43:29 +03:00
e904c13790 fix(iso): remove --no-sandbox from chromium (runs as bee user, not root)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:40:42 +03:00
24c5c72cee feat(iso): add NIC firmware packages for broad hardware support
Adds firmware-misc-nonfree (Intel ice/i40e/igc), firmware-bnx2/bnx2x
(Broadcom), firmware-cavium (Marvell/QLogic), firmware-qlogic,
firmware-chelsio-t4, firmware-realtek to fix missing network on
physical servers with modern NICs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:38:22 +03:00
6ff0bcad56 feat(iso): show kernel logs on graphical console (remove quiet, loglevel=7)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:23:57 +03:00
4fef26000c fix(iso): replace invalid --compression with --chroot-squashfs-compression-type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:23:00 +03:00
a393dcb731 feat(webui): add POST /api/sat/abort + update bible-local runtime-flows
- jobState now has optional cancel func; abort() calls it if job is running
- handleAPISATRun passes cancellable context to RunNvidiaAcceptancePackWithOptions
- POST /api/sat/abort?job_id=... cancels the running SAT job
- bible-local/runtime-flows.md: replace TUI SAT flow with Web UI flow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:23:00 +03:00
9e55728053 feat(iso): replace --clean-cache with --clean-build (cleans + rebuilds)
--clean-build clears all caches (Go, NVIDIA, lb packages, work dir)
and rebuilds the Docker image, then proceeds with a full clean build.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v2.8
2026-03-28 10:12:21 +03:00
4b8023c1cb feat(iso): add --clean-cache option to build-in-container.sh
Removes all cached build artifacts: Go cache, NVIDIA/NCCL/cuBLAS
downloads, lb package cache, and live-build work dir. Use before
a clean rebuild or when switching Debian/kernel versions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:11:31 +03:00
4c8417d20a feat(webui): add Install to Disk page
Expose the existing bee-install script through the web UI:
- platform/install.go: remove USB exclusion, add SizeBytes/MountedParts
  fields, add MinInstallBytes()/DiskWarnings() safety checks (size,
  mounted partitions, toram+low-RAM warning)
- webui: add GET /api/install/disks, POST /api/install/run,
  GET /api/install/stream endpoints
- webui: add Install to Disk page with disk table, warning badges,
  device-name confirmation gate, SSE progress terminal, reboot button

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:11:16 +03:00
0755374dd2 perf(iso): speed up builds — zstd squashfs + preserve lb chroot cache
- Switch squashfs compression from xz to zstd (3-5x faster compression,
  ~10-15% larger but decompresses faster at boot)
- Stop rm -rf BUILD_WORK_DIR on each build; rsync only config changes
  so lb can reuse its chroot across builds (skips apt install step)
- Keep lb-packages cache in CACHE_ROOT as fallback if work dir is wiped

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:10:29 +03:00
c70ae274fa revert(iso): remove apt-cacher-ng support, use lb package cache instead
apt-cacher-ng requires a separate container; lb's own package cache
persisted in --cache-dir is simpler and sufficient.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:02:34 +03:00