All final artefacts for a given version now land in one place:
dist/easy-bee-v4.1/
easy-bee-nvidia-v4.1-amd64.iso
easy-bee-nvidia-v4.1-amd64.logs.tar.gz ← log archive
(logs dir deleted after archiving)
- Introduce OUT_DIR="${DIST_DIR}/easy-bee-v${ISO_VERSION_EFFECTIVE}"
- Move LOG_DIR, LOG_ARCHIVE, and ISO_OUT into OUT_DIR
- cleanup_build_log: use dirname(LOG_DIR) as tar -C base so the path is
correct regardless of where OUT_DIR lives; delete LOG_DIR after archiving
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When 8 john processes start simultaneously they race for GPU memory during
OpenCL GWS auto-tuning. Slower devices settle on a smaller work size (~594MiB
vs 762MiB) and run at 40% instead of 100% load. Add 3s sleep between launches
so each instance finishes memory allocation before the next one starts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-john-gpu-stress: spawn one john process per OpenCL device in parallel
so all GPUs are stressed simultaneously instead of only device 1
- bee-openbox-session: --start-fullscreen → --start-maximized to fix blank
white page on first render in fbdev environment
- storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead
of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state
- build: remove bee-nccl-gpu-stress from rm -f list so shell script from
overlay is not silently dropped from the ISO
- smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress,
bee-nccl-gpu-stress, all_reduce_perf
- netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors
- auto/config: reduce loglevel 7→3 to show clean systemd output on boot
- auto/config: blacklist snd_hda_intel and related audio modules (unused on servers)
- package-lists: remove firmware-intel-sound and firmware-amd-graphics from
base list; move firmware-amd-graphics to bee-amd variant only
- bible-local: mark memtest ADR resolved, document working solution
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
field to selectively run CPU and/or GPU load goroutines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ISO 9660 volume labels allow only A-Z, 0-9, and underscore.
Dashes cause xorriso WARNING on every build.
EASY-BEE-NVIDIA → EASY_BEE_NVIDIA (iso-application keeps dashes, it's UDF).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
memtest86+ postinst does not place files in /boot in a live-build chroot
without grub triggers. Added fallback: extract directly from the cached
.deb via dpkg-deb -x, with verbose logging throughout.
Also remove "NVIDIA no MSI-X" from boot menu (premature — root cause unknown).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
live-build picks up ALL .list.chroot files in config/package-lists/.
After rsync, bee-nvidia.list.chroot, bee-amd.list.chroot, and
bee-nogpu.list.chroot all end up in BUILD_WORK_DIR — causing lb to
try installing packages from every variant (and leaving version
placeholders unsubstituted in the unused lists).
Fix: after copying bee-${BEE_GPU_VENDOR}.list.chroot → bee-gpu.list.chroot,
delete all other bee-{nvidia,amd,nogpu}.list.chroot from BUILD_WORK_DIR.
Also includes nomsi boot mode changes (bee-nvidia-load + grub.cfg).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- build.sh: add --variant nvidia|amd; separate work dirs per variant
(live-build-work-nvidia / live-build-work-amd); GPU-specific steps
(modules, NCCL, cuBLAS, nccl-tests) run only for nvidia; deb package
cache synced back to shared location after each lb build so second
variant reuses downloaded packages; ISO output named
easy-bee-{variant}-v{ver}-amd64.iso
- build-in-container.sh: add --variant nvidia|amd|all (default: all);
runs build.sh twice in one container for 'all'; --clean-build wipes
both variant work dirs
- package-lists: remove GPU packages from bee.list.chroot; add
bee-nvidia.list.chroot (DCGM) and bee-amd.list.chroot (ROCm)
- 9000-bee-setup hook: read /etc/bee-gpu-vendor; enable bee-nvidia.service
and DCGM only for nvidia; set up ROCm symlinks only for amd
- auto/config: --iso-volume uses BEE_GPU_VENDOR_UPPER env var
- grub.cfg: add nomodeset to EASY-BEE and EASY-BEE (load to RAM) entries
— fixes X/lightdm on BMC KVM (ASPEED AST chip requires nomodeset for
fbdev to work; NVIDIA H100 compute does not need KMS)
- bee.sh / smoketest.sh: add /usr/sbin to PATH so dmidecode, smartctl,
nvme are found
- 9100-memtest hook: add diagnostic listing of chroot/boot/memtest* files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ROCM_BANDWIDTH_TEST_VERSION, ROCM_VALIDATION_SUITE_VERSION, ROCBLAS,
ROCRAND, HIP_RUNTIME_AMD, HIPBLASLT, COMGR were defined in VERSIONS and
in bee.list.chroot but the sed substitution block only covered 3 of them.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add udev rule: /dev/ipmi0 readable by 'ipmi' group (no sudo needed)
- Add 'ipmi' group creation and bee user membership in chroot hook
- Remove legend from all charts (data shown in GPU table below)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without these modules /dev/ipmi0 doesn't exist and ipmitool can't
read fan RPM, PSU fans, or IPMI temperature sensors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add rocm-validation-suite, rocblas, rocrand, hip-runtime-amd,
hipblaslt, comgr to ISO (~700MB, needed for HIP compute)
- RunAMDStressPack: run RVS GST (SGEMM ~31 TFLOPS/GPU) + bandwidth test
- Add rvs symlink in chroot setup hook
- Pin all new package versions in VERSIONS
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add rocm-bandwidth-test package to ISO
- Add bee user to 'render' group (/dev/kfd, /dev/dri/renderD* access)
- Add rocm-bandwidth-test symlink alongside rocm-smi
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-web.service: remove After=bee-audit so Go starts immediately
- Go serves loading page from / when audit JSON not yet present;
JS polls /api/ready (503 until file exists, 200 when ready)
then redirects to dashboard
- bee-openbox-session: wait for /healthz (Go binds fast <2s),
open http://localhost/ directly — no file:// cross-origin issues
- Remove loading.html static file
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace 15s blocking wait with instant Chromium launch showing a
dark loading page that polls /healthz every 500ms and auto-redirects
to the app when ready.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
memtest files live in chroot /boot (inside squashfs) but GRUB needs
them on the ISO filesystem. Binary hook copies them out at build time.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch to performance governor when task queue starts processing,
back to powersave when queue drains. Removes bee-cpuperf.service.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Task queue: all SAT/audit jobs enqueue and run one-at-a-time;
tasks persist past page navigation; new Tasks page with cancel/priority/log stream
- UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal;
Dashboard hardware summary badges + split metrics charts (load/temp/power);
Tools page consolidates network, services, install, support bundle
- AMD GPU: acceptance test and stress burn cards; GPU presence API greys
out irrelevant SAT cards automatically
- Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest)
- Install to RAM: copies squashfs to /dev/shm, re-associates loop devices
via LOOP_CHANGE_FD ioctl so live media can be ejected
- Charts: relative time axis (0 = now, negative left)
- memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED
- SAT overlay applied dynamically on every /audit.json serve
- MIME panic guard for LiveCD ramdisk I/O errors
- ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry;
disable screensaver/DPMS in bee-openbox-session
- Unknown SAT status severity = 1 (does not override OK)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move datacenter-gpu-manager and rocm-smi-lib from dynamic chroot hooks
into live-build's config/archives mechanism so lb caches the .deb files
in cache/packages.chroot/ between builds, eliminating repeated 900+ MB
downloads. Versions pinned via VERSIONS and substituted into package
lists at build time.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>