Commit Graph

170 Commits

Author SHA1 Message Date
Mikhail Chusavitin
a20806afc8 Fix ISO grub package conflict 2026-03-31 11:38:30 +03:00
Mikhail Chusavitin
c850b39b01 feat: v3.10 GPU stress and NCCL burn updates 2026-03-31 11:22:27 +03:00
Mikhail Chusavitin
6dee8f3509 Add NVIDIA stress loader selection and DCGM 4 support 2026-03-31 11:15:15 +03:00
Mikhail Chusavitin
20f834aa96 feat: v3.4 — boot reliability, log readability, USB export, screen resolution, GRUB UEFI fix, memtest, KVM console stability
Web UI / logs:
- Strip ANSI escape codes and handle \r (progress bars) in task log output
- Add USB export API + UI card on Export page (list removable devices, write audit JSON or support bundle)
- Add Display Resolution card in Tools (xrandr-based, per-output mode selector)
- Dashboard: audit status banner with auto-reload when audit task completes

Boot & install:
- bee-web starts immediately with no dependencies (was blocked by audit + network)
- bee-audit.service redesigned: waits for bee-web healthz, sleeps 60s, enqueues audit via /api/audit/run (task system)
- bee-install: fix GRUB UEFI — grub-install exit code was silently ignored (|| true); add --no-nvram fallback; always copy EFI/BOOT/BOOTX64.EFI fallback path
- Add grub-efi-amd64, grub-pc, grub-efi-amd64-signed, shim-signed to package list (grub-install requires these, not just -bin variants)
- memtest hook: fix binary/boot/ not created before cp; handle both Debian (no extension) and upstream (x64.efi) naming
- bee-openbox-session: increase healthz wait from 30s to 120s

KVM console stability:
- runCmdJob: syscall.Setpriority(PRIO_PROCESS, pid, 10) on all stress subprocesses
- lightdm.service.d: Nice=-5 so X server preempts stress processes

Packages: add btop

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 10:16:15 +03:00
105d92df8b fix(iso): use underscore in volume label to comply with ISO 9660
ISO 9660 volume labels allow only A-Z, 0-9, and underscore.
Dashes cause xorriso WARNING on every build.
EASY-BEE-NVIDIA → EASY_BEE_NVIDIA (iso-application keeps dashes, it's UDF).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:38:02 +03:00
f96b149875 fix(memtest): extract EFI binary from .deb cache if chroot/boot/ is empty
memtest86+ postinst does not place files in /boot in a live-build chroot
without grub triggers. Added fallback: extract directly from the cached
.deb via dpkg-deb -x, with verbose logging throughout.

Also remove "NVIDIA no MSI-X" from boot menu (premature — root cause unknown).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:30:52 +03:00
5ee120158e fix(build): remove unused variant package lists before lb build
live-build picks up ALL .list.chroot files in config/package-lists/.
After rsync, bee-nvidia.list.chroot, bee-amd.list.chroot, and
bee-nogpu.list.chroot all end up in BUILD_WORK_DIR — causing lb to
try installing packages from every variant (and leaving version
placeholders unsubstituted in the unused lists).

Fix: after copying bee-${BEE_GPU_VENDOR}.list.chroot → bee-gpu.list.chroot,
delete all other bee-{nvidia,amd,nogpu}.list.chroot from BUILD_WORK_DIR.

Also includes nomsi boot mode changes (bee-nvidia-load + grub.cfg).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:03:42 +03:00
09fe0e2e9e feat(iso): add nogpu variant (no NVIDIA, no AMD/ROCm)
- build.sh: accept --variant nogpu; skips all GPU build steps, removes
  both nvidia-cuda and rocm archives, strips bee-nvidia-load and
  bee-nvidia.service from overlay
- build-in-container.sh: add nogpu to --variant flag; all variant
  includes nogpu; --clean-build wipes live-build-work-nogpu
- 9000-bee-setup hook: nogpu path enables no GPU services
- bee-nogpu.list.chroot: empty GPU package list

Output: easy-bee-nogpu-vX.iso

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 22:49:25 +03:00
ace1a9dba6 feat(iso): split into nvidia and amd variants, fix KVM graphics and PATH
- build.sh: add --variant nvidia|amd; separate work dirs per variant
  (live-build-work-nvidia / live-build-work-amd); GPU-specific steps
  (modules, NCCL, cuBLAS, nccl-tests) run only for nvidia; deb package
  cache synced back to shared location after each lb build so second
  variant reuses downloaded packages; ISO output named
  easy-bee-{variant}-v{ver}-amd64.iso
- build-in-container.sh: add --variant nvidia|amd|all (default: all);
  runs build.sh twice in one container for 'all'; --clean-build wipes
  both variant work dirs
- package-lists: remove GPU packages from bee.list.chroot; add
  bee-nvidia.list.chroot (DCGM) and bee-amd.list.chroot (ROCm)
- 9000-bee-setup hook: read /etc/bee-gpu-vendor; enable bee-nvidia.service
  and DCGM only for nvidia; set up ROCm symlinks only for amd
- auto/config: --iso-volume uses BEE_GPU_VENDOR_UPPER env var
- grub.cfg: add nomodeset to EASY-BEE and EASY-BEE (load to RAM) entries
  — fixes X/lightdm on BMC KVM (ASPEED AST chip requires nomodeset for
  fbdev to work; NVIDIA H100 compute does not need KMS)
- bee.sh / smoketest.sh: add /usr/sbin to PATH so dmidecode, smartctl,
  nvme are found
- 9100-memtest hook: add diagnostic listing of chroot/boot/memtest* files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 22:24:37 +03:00
905c581ece fix(iso): substitute all ROCm package version placeholders in build.sh
ROCM_BANDWIDTH_TEST_VERSION, ROCM_VALIDATION_SUITE_VERSION, ROCBLAS,
ROCRAND, HIP_RUNTIME_AMD, HIPBLASLT, COMGR were defined in VERSIONS and
in bee.list.chroot but the sed substitution block only covered 3 of them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 22:00:05 +03:00
889fe1dc2f fix: IPMI access for bee user + remove chart legend
- Add udev rule: /dev/ipmi0 readable by 'ipmi' group (no sudo needed)
- Add 'ipmi' group creation and bee user membership in chroot hook
- Remove legend from all charts (data shown in GPU table below)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:03:35 +03:00
a03312c286 feat: AMD GPU compute stress via rocm-validation-suite GST (GEMM)
- Add rocm-validation-suite, rocblas, rocrand, hip-runtime-amd,
  hipblaslt, comgr to ISO (~700MB, needed for HIP compute)
- RunAMDStressPack: run RVS GST (SGEMM ~31 TFLOPS/GPU) + bandwidth test
- Add rvs symlink in chroot setup hook
- Pin all new package versions in VERSIONS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:56:32 +03:00
e69e9109da fix(iso): set bash as default shell for bee user
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:49:18 +03:00
413869809d feat(iso): add rocm-bandwidth-test for AMD GPU burn-in
- Add rocm-bandwidth-test package to ISO
- Add bee user to 'render' group (/dev/kfd, /dev/dri/renderD* access)
- Add rocm-bandwidth-test symlink alongside rocm-smi

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:48:29 +03:00
5857805518 fix(iso): copy memtest86+ to ISO root via binary hook
memtest files live in chroot /boot (inside squashfs) but GRUB needs
them on the ISO filesystem. Binary hook copies them out at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 09:02:40 +03:00
59a1d4b209 release: v3.1 2026-03-28 22:51:36 +03:00
096b4a09ca feat(iso): add bare-metal performance kernel params
mitigations=off, transparent_hugepage=always, numa_balancing=disable,
nowatchdog, nosoftlockup — safe on single-user bare-metal LiveCD,
improves SAT/burn test throughput. fail-safe entry unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:44:21 +03:00
5d42a92e4c feat(iso): use legacy network names (eth0/eth1) via net.ifnames=0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:43:00 +03:00
f91bce8661 fix(iso): fix memtest86+ path (bookworm uses memtest86+x64.bin/.efi)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:38:15 +03:00
0a98ed8ae9 feat: task queue, UI overhaul, burn tests, install-to-RAM
- Task queue: all SAT/audit jobs enqueue and run one-at-a-time;
  tasks persist past page navigation; new Tasks page with cancel/priority/log stream
- UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal;
  Dashboard hardware summary badges + split metrics charts (load/temp/power);
  Tools page consolidates network, services, install, support bundle
- AMD GPU: acceptance test and stress burn cards; GPU presence API greys
  out irrelevant SAT cards automatically
- Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest)
- Install to RAM: copies squashfs to /dev/shm, re-associates loop devices
  via LOOP_CHANGE_FD ioctl so live media can be ejected
- Charts: relative time axis (0 = now, negative left)
- memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED
- SAT overlay applied dynamically on every /audit.json serve
- MIME panic guard for LiveCD ramdisk I/O errors
- ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry;
  disable screensaver/DPMS in bee-openbox-session
- Unknown SAT status severity = 1 (does not override OK)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:15:11 +03:00
911745e4da refactor(iso): replace chroot hooks for DCGM/ROCm with live-build apt sources
Move datacenter-gpu-manager and rocm-smi-lib from dynamic chroot hooks
into live-build's config/archives mechanism so lb caches the .deb files
in cache/packages.chroot/ between builds, eliminating repeated 900+ MB
downloads. Versions pinned via VERSIONS and substituted into package
lists at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 13:01:10 +03:00
acfd2010d7 fix(iso): remove firmware-chelsio-t4 (not in Debian bookworm)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:43:29 +03:00
24c5c72cee feat(iso): add NIC firmware packages for broad hardware support
Adds firmware-misc-nonfree (Intel ice/i40e/igc), firmware-bnx2/bnx2x
(Broadcom), firmware-cavium (Marvell/QLogic), firmware-qlogic,
firmware-chelsio-t4, firmware-realtek to fix missing network on
physical servers with modern NICs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:38:22 +03:00
6ff0bcad56 feat(iso): show kernel logs on graphical console (remove quiet, loglevel=7)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:23:57 +03:00
4fef26000c fix(iso): replace invalid --compression with --chroot-squashfs-compression-type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:23:00 +03:00
9e55728053 feat(iso): replace --clean-cache with --clean-build (cleans + rebuilds)
--clean-build clears all caches (Go, NVIDIA, lb packages, work dir)
and rebuilds the Docker image, then proceeds with a full clean build.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:12:21 +03:00
4b8023c1cb feat(iso): add --clean-cache option to build-in-container.sh
Removes all cached build artifacts: Go cache, NVIDIA/NCCL/cuBLAS
downloads, lb package cache, and live-build work dir. Use before
a clean rebuild or when switching Debian/kernel versions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:11:31 +03:00
0755374dd2 perf(iso): speed up builds — zstd squashfs + preserve lb chroot cache
- Switch squashfs compression from xz to zstd (3-5x faster compression,
  ~10-15% larger but decompresses faster at boot)
- Stop rm -rf BUILD_WORK_DIR on each build; rsync only config changes
  so lb can reuse its chroot across builds (skips apt install step)
- Keep lb-packages cache in CACHE_ROOT as fallback if work dir is wiped

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:10:29 +03:00
c70ae274fa revert(iso): remove apt-cacher-ng support, use lb package cache instead
apt-cacher-ng requires a separate container; lb's own package cache
persisted in --cache-dir is simpler and sufficient.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:02:34 +03:00
23ad7ff534 feat(iso): persist lb package cache across builds in cache dir
Saves cache/packages.chroot before wiping BUILD_WORK_DIR and
restores it after, so apt packages are not re-downloaded on every
build. Cache lives in --cache-dir (same place as Go/NVIDIA cache).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 09:59:55 +03:00
de130966f7 feat(iso): add APT_PROXY support to speed up builds via apt-cacher-ng
Pass APT_PROXY=http://host:3142 to build-in-container.sh to route
all apt traffic through a local cache. Also supports --apt-proxy flag.
Mirrors in auto/config are set from BEE_APT_PROXY env when present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 09:57:54 +03:00
c6fbfc8306 fix(boot): restore toram as menu option only, not default boot param
toram was incorrectly added to the default bootappend-live causing
every boot to copy the full ISO to RAM (slow on BMC virtual media).
Default boot reads squashfs from media; toram is available as a
separate menu entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 09:52:25 +03:00
35ad1c74d9 feat(iso): add slim hook to strip locales/man pages/apt cache from squashfs
Removes ~100-300MB from the squashfs: man pages, non-en locales,
python cache, apt lists and package cache, temp files and logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 08:44:02 +03:00
4a02e74b17 fix(iso): add git safe.directory so git describe sees v* tags inside container
Without this, git refuses to read the bind-mounted repo (UID mismatch)
and describe returns empty, causing the version to fall back to iso/v1.0.20.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 08:23:37 +03:00
6caf771d6e fix(boot): restore toram kernel parameter
Without toram the squashfs is read from the physical medium at runtime.
Disconnecting the USB/CD after boot causes SQUASHFS I/O errors on any
uncached block, making all X11 apps crash with SIGBUS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 08:04:37 +03:00
3579747ae3 fix(iso): prioritise v[0-9]* tags over iso/v* for ISO filename
Plain v2.x tags are now the active tagging scheme; iso/v1.0.x tags
are legacy. Swap priority in resolve_iso_version so the ISO is named
bee-debian12-v2.x-amd64.iso instead of v1.0.x-N-gHASH.
Also tighten the v* pattern to v[0-9]* to avoid accidentally matching
other prefixed tags in both resolve functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:34:09 +03:00
b4371e291e fix(build): resolve ISO version from plain v* tags (e.g. v2.6)
resolve_iso_version only matched iso/v* pattern; GUI release tags
(v2, v2.1 ... v2.6) were ignored, falling back to the old v1.0.20
annotated tag via resolve_audit_version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:11:33 +03:00
c22b53a406 feat(boot): set 1920x1080 resolution for framebuffer and GRUB
- Add video=1920x1080 to kernel cmdline (sets fbdev to Full HD)
- Update GRUB gfxmode to 1920x1080 (fallback to 1280x1024,auto)
- Add Xorg Monitor section with 1920x1080 Modeline and preferred mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:10:18 +03:00
883592d029 feat(desktop): switch to LightDM for X startup (matches Ubuntu LiveCD)
startx from user shell has /dev/fb0 permission issues and is fragile.
LightDM starts Xorg as root — standard LiveCD approach that works
on server hardware / IPMI KVM with nomodeset + fbdev/vesa.

- Add lightdm package, configure autologin as bee/openbox session
- Add /usr/share/xsessions/openbox.desktop
- Remove startx from .profile (LightDM manages X lifecycle)
- Remove Xwrapper.config needs_root_rights workaround (no longer needed)
- Enable lightdm.service in setup hook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:17:59 +03:00
a6dcaf1c7e fix(desktop): fix X permissions for server hardware (IPMI KVM)
- Add bee user to video,input groups (fixes /dev/fb0 permission denied)
- Add Xwrapper.config: needs_root_rights=yes (X gets hw access)
- Add xserver-xorg-video-vesa as fallback driver
- Remove dead bee-tui chmod from setup hook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:07:25 +03:00
c1aa3cf491 fix(desktop): start X on vt1 from .profile for IPMI KVM compatibility
startx from autologin shell targets VT1 directly — KVM sees the
graphical UI without VT switching. Remove bee-desktop.service
(systemd-launched X defaults to VT7, invisible on KVM).
Add xserver-xorg-video-fbdev for server AST/VGA framebuffer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:03:59 +03:00
f7eb75c57c fix(iso): replace grub-pc/grub-efi-amd64 with -bin variants to fix package conflict
grub-pc and grub-efi-amd64 conflict with each other in Debian 12.
The -bin packages provide the same grub-install binaries without conflict.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:12:18 +03:00
004cc4910d feat(webui): replace TUI with full web UI + local openbox desktop
- Remove audit/internal/tui/ (~3000 LOC, bubbletea/lipgloss/reanimator deps)
- Add /api/* REST+SSE endpoints: audit, SAT (nvidia/memory/storage/cpu),
  services, network, export, tools, live metrics stream
- Add async job manager with SSE streaming for long-running operations
- Add platform.SampleLiveMetrics() for live fan/temp/power/GPU polling
- Add multi-page web UI (vanilla JS): Dashboard, Metrics charts, Tests,
  Burn-in, Network, Services, Export, Tools
- Add bee-desktop.service: openbox + Xorg + Chromium opening http://localhost/
- Add openbox/tint2/xorg/xinit/xterm/chromium to ISO package list
- Update .profile, bee.sh, and bible-local docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 19:21:14 +03:00
ed1cceed8c fix(boot): add nomodeset to fix black screen on server VGA/IPMI KVM (AST chip KMS) 2026-03-27 00:13:36 +03:00
9fe9f061f8 fix(nccl-tests): set LIBRARY_PATH so ld finds libnccl.so in nccl cache 2026-03-26 23:59:06 +03:00
837a1fb981 fix(nccl-tests): pin /usr/local/cuda→12.8 symlink, auto-detect gencode by nvcc version 2026-03-26 23:54:07 +03:00
1f43b4e050 fix(nccl-tests): pass NCCL_LIB from nccl cache to fix -lnccl link error 2026-03-26 23:52:25 +03:00
83bbc8a1bc fix(nccl-tests): upgrade to cuda-nvcc-12-8, add sm_100 (Blackwell B100/B200) 2026-03-26 23:51:26 +03:00
896bdb6ee8 fix(nccl-tests): use cuda-nvcc-12-6 to support Ampere/Volta (sm_70..sm_90) 2026-03-26 23:50:36 +03:00
5407c26e25 fix(nccl-tests): CUDA 13.0 supports only sm_90+ (Hopper/H100) 2026-03-26 23:49:45 +03:00