Commit Graph

220 Commits

Author SHA1 Message Date
Mikhail Chusavitin
3cf75a541a build: collect ISO and logs under versioned dist/easy-bee-v{VERSION}/ dir
All final artefacts for a given version now land in one place:
  dist/easy-bee-v4.1/
    easy-bee-nvidia-v4.1-amd64.iso
    easy-bee-nvidia-v4.1-amd64.logs.tar.gz   ← log archive
                                               (logs dir deleted after archiving)

- Introduce OUT_DIR="${DIST_DIR}/easy-bee-v${ISO_VERSION_EFFECTIVE}"
- Move LOG_DIR, LOG_ARCHIVE, and ISO_OUT into OUT_DIR
- cleanup_build_log: use dirname(LOG_DIR) as tar -C base so the path is
  correct regardless of where OUT_DIR lives; delete LOG_DIR after archiving

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 10:19:11 +03:00
d21d9d191b fix(build): bump DCGM to 4.5.3-1 — core package updated in CUDA repo
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:49:57 +03:00
348db35119 fix(stress): stagger john GPU launches to prevent GWS tuning contention
When 8 john processes start simultaneously they race for GPU memory during
OpenCL GWS auto-tuning. Slower devices settle on a smaller work size (~594MiB
vs 762MiB) and run at 40% instead of 100% load. Add 3s sleep between launches
so each instance finishes memory allocation before the next one starts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:44:00 +03:00
964ab39656 fix: run john stress in parallel per GPU, fix chromium fullscreen, filter BMC virtual disks
- bee-john-gpu-stress: spawn one john process per OpenCL device in parallel
  so all GPUs are stressed simultaneously instead of only device 1
- bee-openbox-session: --start-fullscreen → --start-maximized to fix blank
  white page on first render in fbdev environment
- storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:14:21 +03:00
eb60100297 fix: pcie gen, nccl binary, netconf sudo, boot noise, firmware cleanup
- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead
  of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state
- build: remove bee-nccl-gpu-stress from rm -f list so shell script from
  overlay is not silently dropped from the ISO
- smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress,
  bee-nccl-gpu-stress, all_reduce_perf
- netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors
- auto/config: reduce loglevel 7→3 to show clean systemd output on boot
- auto/config: blacklist snd_hda_intel and related audio modules (unused on servers)
- package-lists: remove firmware-intel-sound and firmware-amd-graphics from
  base list; move firmware-amd-graphics to bee-amd variant only
- bible-local: mark memtest ADR resolved, document working solution

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:25:23 +03:00
Mikhail Chusavitin
2baf3be640 Handle memtest recovery probe under set -e 2026-04-01 17:42:13 +03:00
Mikhail Chusavitin
d92f8f41d0 Fix memtest ISO validation false negatives 2026-04-01 12:22:17 +03:00
Mikhail Chusavitin
76a9100779 fix(iso): rebuild image after memtest recovery 2026-04-01 10:01:14 +03:00
Mikhail Chusavitin
1b6d592bf3 feat(iso): add optional kms display boot path 2026-04-01 09:42:59 +03:00
Mikhail Chusavitin
4e4debd4da refactor(webui): redesign Burn tab and fix gpu-burn memory defaults
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
  Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
  endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
  with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
  now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
  field to selectively run CPU and/or GPU load goroutines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:39:07 +03:00
Mikhail Chusavitin
5839f870b7 fix(iso): include full nvidia opencl runtime 2026-04-01 09:16:06 +03:00
Mikhail Chusavitin
b447717a5a fix(iso): harden boot network bring-up - v3.20 2026-04-01 09:10:55 +03:00
Mikhail Chusavitin
f6f4923ac9 fix(iso): recover memtest after live-build 2026-04-01 08:55:57 +03:00
Mikhail Chusavitin
3472afea32 fix(iso): make memtest non-blocking by default 2026-04-01 08:33:36 +03:00
45221d1e9a fix(stress): label loaders and improve john opencl diagnostics 2026-04-01 07:31:52 +03:00
3869788bac fix(iso): validate memtest with xorriso fallback 2026-04-01 07:24:05 +03:00
3dbc2184ef fix(iso): archive build logs and memtest diagnostics 2026-04-01 07:14:53 +03:00
60cb8f889a fix(iso): restore memtest menu entries and validate ISO 2026-04-01 07:04:48 +03:00
c9ee078622 fix(stress): keep platform burn responsive under load 2026-03-31 22:28:26 +03:00
ea660500c9 chore: commit pending repo changes 2026-03-31 22:17:36 +03:00
d43a9aeec7 fix(iso): restore live-build memtest integration 2026-03-31 22:10:28 +03:00
Mikhail Chusavitin
f5622e351e Fix staged John cleanup for repeated ISO builds 2026-03-31 11:40:52 +03:00
Mikhail Chusavitin
a20806afc8 Fix ISO grub package conflict 2026-03-31 11:38:30 +03:00
Mikhail Chusavitin
4f9b6b3bcd Harden NVIDIA boot logging on live ISO 2026-03-31 11:37:21 +03:00
Mikhail Chusavitin
c850b39b01 feat: v3.10 GPU stress and NCCL burn updates 2026-03-31 11:22:27 +03:00
Mikhail Chusavitin
6dee8f3509 Add NVIDIA stress loader selection and DCGM 4 support 2026-03-31 11:15:15 +03:00
Mikhail Chusavitin
20f834aa96 feat: v3.4 — boot reliability, log readability, USB export, screen resolution, GRUB UEFI fix, memtest, KVM console stability
Web UI / logs:
- Strip ANSI escape codes and handle \r (progress bars) in task log output
- Add USB export API + UI card on Export page (list removable devices, write audit JSON or support bundle)
- Add Display Resolution card in Tools (xrandr-based, per-output mode selector)
- Dashboard: audit status banner with auto-reload when audit task completes

Boot & install:
- bee-web starts immediately with no dependencies (was blocked by audit + network)
- bee-audit.service redesigned: waits for bee-web healthz, sleeps 60s, enqueues audit via /api/audit/run (task system)
- bee-install: fix GRUB UEFI — grub-install exit code was silently ignored (|| true); add --no-nvram fallback; always copy EFI/BOOT/BOOTX64.EFI fallback path
- Add grub-efi-amd64, grub-pc, grub-efi-amd64-signed, shim-signed to package list (grub-install requires these, not just -bin variants)
- memtest hook: fix binary/boot/ not created before cp; handle both Debian (no extension) and upstream (x64.efi) naming
- bee-openbox-session: increase healthz wait from 30s to 120s

KVM console stability:
- runCmdJob: syscall.Setpriority(PRIO_PROCESS, pid, 10) on all stress subprocesses
- lightdm.service.d: Nice=-5 so X server preempts stress processes

Packages: add btop

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 10:16:15 +03:00
105d92df8b fix(iso): use underscore in volume label to comply with ISO 9660
ISO 9660 volume labels allow only A-Z, 0-9, and underscore.
Dashes cause xorriso WARNING on every build.
EASY-BEE-NVIDIA → EASY_BEE_NVIDIA (iso-application keeps dashes, it's UDF).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:38:02 +03:00
f96b149875 fix(memtest): extract EFI binary from .deb cache if chroot/boot/ is empty
memtest86+ postinst does not place files in /boot in a live-build chroot
without grub triggers. Added fallback: extract directly from the cached
.deb via dpkg-deb -x, with verbose logging throughout.

Also remove "NVIDIA no MSI-X" from boot menu (premature — root cause unknown).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:30:52 +03:00
5ee120158e fix(build): remove unused variant package lists before lb build
live-build picks up ALL .list.chroot files in config/package-lists/.
After rsync, bee-nvidia.list.chroot, bee-amd.list.chroot, and
bee-nogpu.list.chroot all end up in BUILD_WORK_DIR — causing lb to
try installing packages from every variant (and leaving version
placeholders unsubstituted in the unused lists).

Fix: after copying bee-${BEE_GPU_VENDOR}.list.chroot → bee-gpu.list.chroot,
delete all other bee-{nvidia,amd,nogpu}.list.chroot from BUILD_WORK_DIR.

Also includes nomsi boot mode changes (bee-nvidia-load + grub.cfg).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:03:42 +03:00
09fe0e2e9e feat(iso): add nogpu variant (no NVIDIA, no AMD/ROCm)
- build.sh: accept --variant nogpu; skips all GPU build steps, removes
  both nvidia-cuda and rocm archives, strips bee-nvidia-load and
  bee-nvidia.service from overlay
- build-in-container.sh: add nogpu to --variant flag; all variant
  includes nogpu; --clean-build wipes live-build-work-nogpu
- 9000-bee-setup hook: nogpu path enables no GPU services
- bee-nogpu.list.chroot: empty GPU package list

Output: easy-bee-nogpu-vX.iso

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 22:49:25 +03:00
ace1a9dba6 feat(iso): split into nvidia and amd variants, fix KVM graphics and PATH
- build.sh: add --variant nvidia|amd; separate work dirs per variant
  (live-build-work-nvidia / live-build-work-amd); GPU-specific steps
  (modules, NCCL, cuBLAS, nccl-tests) run only for nvidia; deb package
  cache synced back to shared location after each lb build so second
  variant reuses downloaded packages; ISO output named
  easy-bee-{variant}-v{ver}-amd64.iso
- build-in-container.sh: add --variant nvidia|amd|all (default: all);
  runs build.sh twice in one container for 'all'; --clean-build wipes
  both variant work dirs
- package-lists: remove GPU packages from bee.list.chroot; add
  bee-nvidia.list.chroot (DCGM) and bee-amd.list.chroot (ROCm)
- 9000-bee-setup hook: read /etc/bee-gpu-vendor; enable bee-nvidia.service
  and DCGM only for nvidia; set up ROCm symlinks only for amd
- auto/config: --iso-volume uses BEE_GPU_VENDOR_UPPER env var
- grub.cfg: add nomodeset to EASY-BEE and EASY-BEE (load to RAM) entries
  — fixes X/lightdm on BMC KVM (ASPEED AST chip requires nomodeset for
  fbdev to work; NVIDIA H100 compute does not need KMS)
- bee.sh / smoketest.sh: add /usr/sbin to PATH so dmidecode, smartctl,
  nvme are found
- 9100-memtest hook: add diagnostic listing of chroot/boot/memtest* files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 22:24:37 +03:00
905c581ece fix(iso): substitute all ROCm package version placeholders in build.sh
ROCM_BANDWIDTH_TEST_VERSION, ROCM_VALIDATION_SUITE_VERSION, ROCBLAS,
ROCRAND, HIP_RUNTIME_AMD, HIPBLASLT, COMGR were defined in VERSIONS and
in bee.list.chroot but the sed substitution block only covered 3 of them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 22:00:05 +03:00
889fe1dc2f fix: IPMI access for bee user + remove chart legend
- Add udev rule: /dev/ipmi0 readable by 'ipmi' group (no sudo needed)
- Add 'ipmi' group creation and bee user membership in chroot hook
- Remove legend from all charts (data shown in GPU table below)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 11:03:35 +03:00
befdbf3768 fix(iso): autoload ipmi_si/ipmi_devintf for fan/sensor monitoring
Without these modules /dev/ipmi0 doesn't exist and ipmitool can't
read fan RPM, PSU fans, or IPMI temperature sensors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:59:15 +03:00
a03312c286 feat: AMD GPU compute stress via rocm-validation-suite GST (GEMM)
- Add rocm-validation-suite, rocblas, rocrand, hip-runtime-amd,
  hipblaslt, comgr to ISO (~700MB, needed for HIP compute)
- RunAMDStressPack: run RVS GST (SGEMM ~31 TFLOPS/GPU) + bandwidth test
- Add rvs symlink in chroot setup hook
- Pin all new package versions in VERSIONS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:56:32 +03:00
e69e9109da fix(iso): set bash as default shell for bee user
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:49:18 +03:00
413869809d feat(iso): add rocm-bandwidth-test for AMD GPU burn-in
- Add rocm-bandwidth-test package to ISO
- Add bee user to 'render' group (/dev/kfd, /dev/dri/renderD* access)
- Add rocm-bandwidth-test symlink alongside rocm-smi

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:48:29 +03:00
ada15ac777 fix: loading screen via Go handler instead of file:// HTML
- bee-web.service: remove After=bee-audit so Go starts immediately
- Go serves loading page from / when audit JSON not yet present;
  JS polls /api/ready (503 until file exists, 200 when ready)
  then redirects to dashboard
- bee-openbox-session: wait for /healthz (Go binds fast <2s),
  open http://localhost/ directly — no file:// cross-origin issues
- Remove loading.html static file

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 10:31:46 +03:00
dfb94f9ca6 feat(iso): loading screen while bee-web starts
Replace 15s blocking wait with instant Chromium launch showing a
dark loading page that polls /healthz every 500ms and auto-redirects
to the app when ready.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 09:33:04 +03:00
5857805518 fix(iso): copy memtest86+ to ISO root via binary hook
memtest files live in chroot /boot (inside squashfs) but GRUB needs
them on the ISO filesystem. Binary hook copies them out at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-29 09:02:40 +03:00
59a1d4b209 release: v3.1 2026-03-28 22:51:36 +03:00
0dbfaf6121 feat: dynamic CPU governor (performance during tasks, powersave at idle)
Switch to performance governor when task queue starts processing,
back to powersave when queue drains. Removes bee-cpuperf.service.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:47:11 +03:00
5d72d48714 feat(iso): set CPU governor to performance on boot
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:45:37 +03:00
096b4a09ca feat(iso): add bare-metal performance kernel params
mitigations=off, transparent_hugepage=always, numa_balancing=disable,
nowatchdog, nosoftlockup — safe on single-user bare-metal LiveCD,
improves SAT/burn test throughput. fail-safe entry unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:44:21 +03:00
5d42a92e4c feat(iso): use legacy network names (eth0/eth1) via net.ifnames=0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:43:00 +03:00
f91bce8661 fix(iso): fix memtest86+ path (bookworm uses memtest86+x64.bin/.efi)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:38:15 +03:00
0a98ed8ae9 feat: task queue, UI overhaul, burn tests, install-to-RAM
- Task queue: all SAT/audit jobs enqueue and run one-at-a-time;
  tasks persist past page navigation; new Tasks page with cancel/priority/log stream
- UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal;
  Dashboard hardware summary badges + split metrics charts (load/temp/power);
  Tools page consolidates network, services, install, support bundle
- AMD GPU: acceptance test and stress burn cards; GPU presence API greys
  out irrelevant SAT cards automatically
- Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest)
- Install to RAM: copies squashfs to /dev/shm, re-associates loop devices
  via LOOP_CHANGE_FD ioctl so live media can be ejected
- Charts: relative time axis (0 = now, negative left)
- memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED
- SAT overlay applied dynamically on every /audit.json serve
- MIME panic guard for LiveCD ramdisk I/O errors
- ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry;
  disable screensaver/DPMS in bee-openbox-session
- Unknown SAT status severity = 1 (does not override OK)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 21:15:11 +03:00
911745e4da refactor(iso): replace chroot hooks for DCGM/ROCm with live-build apt sources
Move datacenter-gpu-manager and rocm-smi-lib from dynamic chroot hooks
into live-build's config/archives mechanism so lb caches the .deb files
in cache/packages.chroot/ between builds, eliminating repeated 900+ MB
downloads. Versions pinned via VERSIONS and substituted into package
lists at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 13:01:10 +03:00
acfd2010d7 fix(iso): remove firmware-chelsio-t4 (not in Debian bookworm)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 12:43:29 +03:00