Commit Graph

178 Commits

Author SHA1 Message Date
0dbfaf6121 feat: dynamic CPU governor (performance during tasks, powersave at idle)
Switch to performance governor when task queue starts processing,
back to powersave when queue drains. Removes bee-cpuperf.service.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:47:11 +03:00
5d72d48714 feat(iso): set CPU governor to performance on boot
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:45:37 +03:00
096b4a09ca feat(iso): add bare-metal performance kernel params
mitigations=off, transparent_hugepage=always, numa_balancing=disable,
nowatchdog, nosoftlockup — safe on single-user bare-metal LiveCD,
improves SAT/burn test throughput. fail-safe entry unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:44:21 +03:00
5d42a92e4c feat(iso): use legacy network names (eth0/eth1) via net.ifnames=0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:43:00 +03:00
f91bce8661 fix(iso): fix memtest86+ path (bookworm uses memtest86+x64.bin/.efi)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:38:15 +03:00
0a98ed8ae9 feat: task queue, UI overhaul, burn tests, install-to-RAM
- Task queue: all SAT/audit jobs enqueue and run one-at-a-time;
  tasks persist past page navigation; new Tasks page with cancel/priority/log stream
- UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal;
  Dashboard hardware summary badges + split metrics charts (load/temp/power);
  Tools page consolidates network, services, install, support bundle
- AMD GPU: acceptance test and stress burn cards; GPU presence API greys
  out irrelevant SAT cards automatically
- Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest)
- Install to RAM: copies squashfs to /dev/shm, re-associates loop devices
  via LOOP_CHANGE_FD ioctl so live media can be ejected
- Charts: relative time axis (0 = now, negative left)
- memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED
- SAT overlay applied dynamically on every /audit.json serve
- MIME panic guard for LiveCD ramdisk I/O errors
- ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry;
  disable screensaver/DPMS in bee-openbox-session
- Unknown SAT status severity = 1 (does not override OK)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:15:11 +03:00
911745e4da refactor(iso): replace chroot hooks for DCGM/ROCm with live-build apt sources
Move datacenter-gpu-manager and rocm-smi-lib from dynamic chroot hooks
into live-build's config/archives mechanism so lb caches the .deb files
in cache/packages.chroot/ between builds, eliminating repeated 900+ MB
downloads. Versions pinned via VERSIONS and substituted into package
lists at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 13:01:10 +03:00
acfd2010d7 fix(iso): remove firmware-chelsio-t4 (not in Debian bookworm)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:43:29 +03:00
e904c13790 fix(iso): remove --no-sandbox from chromium (runs as bee user, not root)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:40:42 +03:00
24c5c72cee feat(iso): add NIC firmware packages for broad hardware support
Adds firmware-misc-nonfree (Intel ice/i40e/igc), firmware-bnx2/bnx2x
(Broadcom), firmware-cavium (Marvell/QLogic), firmware-qlogic,
firmware-chelsio-t4, firmware-realtek to fix missing network on
physical servers with modern NICs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:38:22 +03:00
6ff0bcad56 feat(iso): show kernel logs on graphical console (remove quiet, loglevel=7)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:23:57 +03:00
4fef26000c fix(iso): replace invalid --compression with --chroot-squashfs-compression-type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:23:00 +03:00
9e55728053 feat(iso): replace --clean-cache with --clean-build (cleans + rebuilds)
--clean-build clears all caches (Go, NVIDIA, lb packages, work dir)
and rebuilds the Docker image, then proceeds with a full clean build.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:12:21 +03:00
4b8023c1cb feat(iso): add --clean-cache option to build-in-container.sh
Removes all cached build artifacts: Go cache, NVIDIA/NCCL/cuBLAS
downloads, lb package cache, and live-build work dir. Use before
a clean rebuild or when switching Debian/kernel versions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:11:31 +03:00
0755374dd2 perf(iso): speed up builds — zstd squashfs + preserve lb chroot cache
- Switch squashfs compression from xz to zstd (3-5x faster compression,
  ~10-15% larger but decompresses faster at boot)
- Stop rm -rf BUILD_WORK_DIR on each build; rsync only config changes
  so lb can reuse its chroot across builds (skips apt install step)
- Keep lb-packages cache in CACHE_ROOT as fallback if work dir is wiped

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:10:29 +03:00
c70ae274fa revert(iso): remove apt-cacher-ng support, use lb package cache instead
apt-cacher-ng requires a separate container; lb's own package cache
persisted in --cache-dir is simpler and sufficient.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:02:34 +03:00
23ad7ff534 feat(iso): persist lb package cache across builds in cache dir
Saves cache/packages.chroot before wiping BUILD_WORK_DIR and
restores it after, so apt packages are not re-downloaded on every
build. Cache lives in --cache-dir (same place as Go/NVIDIA cache).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 09:59:55 +03:00
de130966f7 feat(iso): add APT_PROXY support to speed up builds via apt-cacher-ng
Pass APT_PROXY=http://host:3142 to build-in-container.sh to route
all apt traffic through a local cache. Also supports --apt-proxy flag.
Mirrors in auto/config are set from BEE_APT_PROXY env when present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 09:57:54 +03:00
c6fbfc8306 fix(boot): restore toram as menu option only, not default boot param
toram was incorrectly added to the default bootappend-live causing
every boot to copy the full ISO to RAM (slow on BMC virtual media).
Default boot reads squashfs from media; toram is available as a
separate menu entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 09:52:25 +03:00
35ad1c74d9 feat(iso): add slim hook to strip locales/man pages/apt cache from squashfs
Removes ~100-300MB from the squashfs: man pages, non-en locales,
python cache, apt lists and package cache, temp files and logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 08:44:02 +03:00
4a02e74b17 fix(iso): add git safe.directory so git describe sees v* tags inside container
Without this, git refuses to read the bind-mounted repo (UID mismatch)
and describe returns empty, causing the version to fall back to iso/v1.0.20.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 08:23:37 +03:00
6caf771d6e fix(boot): restore toram kernel parameter
Without toram the squashfs is read from the physical medium at runtime.
Disconnecting the USB/CD after boot causes SQUASHFS I/O errors on any
uncached block, making all X11 apps crash with SIGBUS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 08:04:37 +03:00
14fa87b7d7 feat(netconf): add input validation, 'b' to go back, 'a' to abort
- All prompts accept 'a' = abort, 'b' = back to previous step
- Interface input: validate numeric range and name existence, re-prompt on bad input
- IP address: regex check x.x.x.x/prefix format
- Gateway: regex check x.x.x.x format
- Main loop: 'b' at mode selection goes back to interface list

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 07:31:23 +03:00
600ece911b fix(desktop): remove forced 1920x1080 modeline, limit LightDM restarts
On real server hardware (IPMI/BMC AST chip + nomodeset) the VESA
framebuffer is set by BIOS at whatever resolution it chooses (often
1024x768 or 1280x1024). The hardcoded 1920x1080 Modeline caused X to
fail → LightDM crash-loop → SOL console flooded with systemd messages.

- Remove Monitor section / Modeline from xorg.conf — fbdev now uses
  whatever framebuffer resolution the kernel provides
- Add lightdm.service.d/bee-limits.conf: RestartSec=10,
  max 3 restarts per 60s so headless hardware doesn't spam the console

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 07:30:51 +03:00
2d424c63cb fix(netconf): accept interface number as input, not just name
User sees a numbered list but could only type the name.
Now numeric input is resolved to the interface name via awk NR==N.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 07:27:49 +03:00
50f28d1ee6 chore: drop legacy TUI/dead code
- Delete audit/internal/app/panel.go (388 lines, zero callers — TUI panel remnant)
- Delete RenderGPULiveChart() from platform/gpu_metrics.go (~155 lines, never called)
- Move formatSATDetail/cleanSummaryKey helpers to app.go (still used)
- Update motd: replace bee-tui with Web UI hint
- Update journald.conf.d comment: remove bee-tui reference

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 07:27:30 +03:00
3579747ae3 fix(iso): prioritise v[0-9]* tags over iso/v* for ISO filename
Plain v2.x tags are now the active tagging scheme; iso/v1.0.x tags
are legacy. Swap priority in resolve_iso_version so the ISO is named
bee-debian12-v2.x-amd64.iso instead of v1.0.x-N-gHASH.
Also tighten the v* pattern to v[0-9]* to avoid accidentally matching
other prefixed tags in both resolve functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:34:09 +03:00
b4371e291e fix(build): resolve ISO version from plain v* tags (e.g. v2.6)
resolve_iso_version only matched iso/v* pattern; GUI release tags
(v2, v2.1 ... v2.6) were ignored, falling back to the old v1.0.20
annotated tag via resolve_audit_version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:11:33 +03:00
c22b53a406 feat(boot): set 1920x1080 resolution for framebuffer and GRUB
- Add video=1920x1080 to kernel cmdline (sets fbdev to Full HD)
- Update GRUB gfxmode to 1920x1080 (fallback to 1280x1024,auto)
- Add Xorg Monitor section with 1920x1080 Modeline and preferred mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:10:18 +03:00
883592d029 feat(desktop): switch to LightDM for X startup (matches Ubuntu LiveCD)
startx from user shell has /dev/fb0 permission issues and is fragile.
LightDM starts Xorg as root — standard LiveCD approach that works
on server hardware / IPMI KVM with nomodeset + fbdev/vesa.

- Add lightdm package, configure autologin as bee/openbox session
- Add /usr/share/xsessions/openbox.desktop
- Remove startx from .profile (LightDM manages X lifecycle)
- Remove Xwrapper.config needs_root_rights workaround (no longer needed)
- Enable lightdm.service in setup hook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:17:59 +03:00
a6dcaf1c7e fix(desktop): fix X permissions for server hardware (IPMI KVM)
- Add bee user to video,input groups (fixes /dev/fb0 permission denied)
- Add Xwrapper.config: needs_root_rights=yes (X gets hw access)
- Add xserver-xorg-video-vesa as fallback driver
- Remove dead bee-tui chmod from setup hook

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:07:25 +03:00
88727fb590 fix(desktop): don't exec startx — fall back to shell on X failure
If X fails to start, the user gets a working shell prompt instead
of a dead session or autologin loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:48:26 +03:00
c9f5224c42 feat(console): add netconf command for quick network setup
Interactive script: lists interfaces, DHCP or static IP config.
Shown as hint in tty1 welcome message.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:07:14 +03:00
7cb5c02a9b fix(desktop): force fbdev Xorg driver for server framebuffer
Explicit xorg.conf.d config prevents Xorg from trying KMS/DRM
drivers that fail on server hardware with nomodeset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:05:42 +03:00
c1aa3cf491 fix(desktop): start X on vt1 from .profile for IPMI KVM compatibility
startx from autologin shell targets VT1 directly — KVM sees the
graphical UI without VT switching. Remove bee-desktop.service
(systemd-launched X defaults to VT7, invisible on KVM).
Add xserver-xorg-video-fbdev for server AST/VGA framebuffer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:03:59 +03:00
f7eb75c57c fix(iso): replace grub-pc/grub-efi-amd64 with -bin variants to fix package conflict
grub-pc and grub-efi-amd64 conflict with each other in Debian 12.
The -bin packages provide the same grub-install binaries without conflict.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:12:18 +03:00
004cc4910d feat(webui): replace TUI with full web UI + local openbox desktop
- Remove audit/internal/tui/ (~3000 LOC, bubbletea/lipgloss/reanimator deps)
- Add /api/* REST+SSE endpoints: audit, SAT (nvidia/memory/storage/cpu),
  services, network, export, tools, live metrics stream
- Add async job manager with SSE streaming for long-running operations
- Add platform.SampleLiveMetrics() for live fan/temp/power/GPU polling
- Add multi-page web UI (vanilla JS): Dashboard, Metrics charts, Tests,
  Burn-in, Network, Services, Export, Tools
- Add bee-desktop.service: openbox + Xorg + Chromium opening http://localhost/
- Add openbox/tint2/xorg/xinit/xterm/chromium to ISO package list
- Update .profile, bee.sh, and bible-local docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 19:21:14 +03:00
ed1cceed8c fix(boot): add nomodeset to fix black screen on server VGA/IPMI KVM (AST chip KMS) 2026-03-27 00:13:36 +03:00
9fe9f061f8 fix(nccl-tests): set LIBRARY_PATH so ld finds libnccl.so in nccl cache 2026-03-26 23:59:06 +03:00
837a1fb981 fix(nccl-tests): pin /usr/local/cuda→12.8 symlink, auto-detect gencode by nvcc version 2026-03-26 23:54:07 +03:00
1f43b4e050 fix(nccl-tests): pass NCCL_LIB from nccl cache to fix -lnccl link error 2026-03-26 23:52:25 +03:00
83bbc8a1bc fix(nccl-tests): upgrade to cuda-nvcc-12-8, add sm_100 (Blackwell B100/B200) 2026-03-26 23:51:26 +03:00
896bdb6ee8 fix(nccl-tests): use cuda-nvcc-12-6 to support Ampere/Volta (sm_70..sm_90) 2026-03-26 23:50:36 +03:00
5407c26e25 fix(nccl-tests): CUDA 13.0 supports only sm_90+ (Hopper/H100) 2026-03-26 23:49:45 +03:00
4fddaba9c5 fix(nccl-tests): limit CUDA gencode to sm_70+ (CUDA 13 dropped Pascal) 2026-03-26 23:48:40 +03:00
d2f384b6eb fix(nccl-tests): use plain make instead of non-existent all_reduce_perf target 2026-03-26 23:47:49 +03:00
25f0f30aaf fix(boot): fix black screen on monitor, stop log spam on console
- Add console=tty0 so VGA display gets kernel output (was serial-only)
- Change loglevel=7→3 (debug→errors only)
- Add quiet to suppress verbose kernel boot messages
- journald: ForwardToConsole=no so service logs don't flood tty1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:45:09 +03:00
a57b037a91 feat(installer): add 'Install to disk' in Tools submenu
Copies the live system to a local disk via unsquashfs — no debootstrap,
no network required. Supports UEFI (GPT+EFI) and BIOS (MBR) layouts.

ISO:
- Add squashfs-tools, parted, grub-pc, grub-efi-amd64 to package list
- New overlay script bee-install: partitions, formats, unsquashfs,
  writes fstab, runs grub-install+update-grub in chroot

Go TUI:
- Settings → Tools submenu (Install to disk, Check tools)
- Disk picker screen: lists non-USB, non-boot disks via lsblk
- Confirm screen warns about data loss
- Runs with live progress tail of /tmp/bee-install.log
- platform/install.go: ListInstallDisks, InstallToDisk, findLiveBootDevice

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:35:01 +03:00
5644231f9a feat(nccl): add nccl-tests all_reduce_perf for GPU bandwidth testing
- Dockerfile: install cuda-nvcc-13-0 from NVIDIA repo for compilation
- build-nccl-tests.sh: downloads libnccl-dev for nccl.h, builds all_reduce_perf
- build.sh: runs nccl-tests build, injects binary into /usr/local/bin/
- platform: RunNCCLTests() auto-detects GPU count, runs all_reduce_perf
- TUI: NCCL bandwidth test entry in Burn-in Tests screen [N] hotkey

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:22:19 +03:00
eea98e6d76 feat(dcgm): add NVIDIA DCGM diagnostics, fix KVM console
- Add 9002-nvidia-dcgm.hook.chroot: installs datacenter-gpu-manager
  from NVIDIA apt repo during live-build
- Enable nvidia-dcgm.service in chroot setup hook
- Replace bee-gpu-stress with dcgmi diag (levels 1-4) in NVIDIA SAT
- TUI: replace GPU checkbox + duration UI with DCGM level selection
- Remove console=tty2 from boot params: KVM/VGA now shows tty1
  where bee-tui runs, fixing unresponsive console

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:08:12 +03:00