Commit Graph

140 Commits

Author SHA1 Message Date
9fe9f061f8 fix(nccl-tests): set LIBRARY_PATH so ld finds libnccl.so in nccl cache 2026-03-26 23:59:06 +03:00
837a1fb981 fix(nccl-tests): pin /usr/local/cuda→12.8 symlink, auto-detect gencode by nvcc version 2026-03-26 23:54:07 +03:00
1f43b4e050 fix(nccl-tests): pass NCCL_LIB from nccl cache to fix -lnccl link error 2026-03-26 23:52:25 +03:00
83bbc8a1bc fix(nccl-tests): upgrade to cuda-nvcc-12-8, add sm_100 (Blackwell B100/B200) 2026-03-26 23:51:26 +03:00
896bdb6ee8 fix(nccl-tests): use cuda-nvcc-12-6 to support Ampere/Volta (sm_70..sm_90) 2026-03-26 23:50:36 +03:00
5407c26e25 fix(nccl-tests): CUDA 13.0 supports only sm_90+ (Hopper/H100) 2026-03-26 23:49:45 +03:00
4fddaba9c5 fix(nccl-tests): limit CUDA gencode to sm_70+ (CUDA 13 dropped Pascal) 2026-03-26 23:48:40 +03:00
d2f384b6eb fix(nccl-tests): use plain make instead of non-existent all_reduce_perf target 2026-03-26 23:47:49 +03:00
25f0f30aaf fix(boot): fix black screen on monitor, stop log spam on console
- Add console=tty0 so VGA display gets kernel output (was serial-only)
- Change loglevel=7→3 (debug→errors only)
- Add quiet to suppress verbose kernel boot messages
- journald: ForwardToConsole=no so service logs don't flood tty1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:45:09 +03:00
a57b037a91 feat(installer): add 'Install to disk' in Tools submenu
Copies the live system to a local disk via unsquashfs — no debootstrap,
no network required. Supports UEFI (GPT+EFI) and BIOS (MBR) layouts.

ISO:
- Add squashfs-tools, parted, grub-pc, grub-efi-amd64 to package list
- New overlay script bee-install: partitions, formats, unsquashfs,
  writes fstab, runs grub-install+update-grub in chroot

Go TUI:
- Settings → Tools submenu (Install to disk, Check tools)
- Disk picker screen: lists non-USB, non-boot disks via lsblk
- Confirm screen warns about data loss
- Runs with live progress tail of /tmp/bee-install.log
- platform/install.go: ListInstallDisks, InstallToDisk, findLiveBootDevice

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:35:01 +03:00
5644231f9a feat(nccl): add nccl-tests all_reduce_perf for GPU bandwidth testing
- Dockerfile: install cuda-nvcc-13-0 from NVIDIA repo for compilation
- build-nccl-tests.sh: downloads libnccl-dev for nccl.h, builds all_reduce_perf
- build.sh: runs nccl-tests build, injects binary into /usr/local/bin/
- platform: RunNCCLTests() auto-detects GPU count, runs all_reduce_perf
- TUI: NCCL bandwidth test entry in Burn-in Tests screen [N] hotkey

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:22:19 +03:00
eea98e6d76 feat(dcgm): add NVIDIA DCGM diagnostics, fix KVM console
- Add 9002-nvidia-dcgm.hook.chroot: installs datacenter-gpu-manager
  from NVIDIA apt repo during live-build
- Enable nvidia-dcgm.service in chroot setup hook
- Replace bee-gpu-stress with dcgmi diag (levels 1-4) in NVIDIA SAT
- TUI: replace GPU checkbox + duration UI with DCGM level selection
- Remove console=tty2 from boot params: KVM/VGA now shows tty1
  where bee-tui runs, fixing unresponsive console

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:08:12 +03:00
967455194c feat(iso): make toram optional, add 'load to RAM' boot menu entry
Default boot no longer loads ISO to RAM (slow on BMC virtual media).
Separate menu entry added for toram in both GRUB and isolinux.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 21:45:04 +03:00
79dabf3efb fix(build): link bee-gpu-stress with -lm for sqrt()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:55:14 +03:00
1336f5b95c fix(cublas): copy include dirs containing files without .h extension
nv/target has no .h suffix; use -type f instead of -name '*.h' to
detect non-empty include directories.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:53:08 +03:00
31486a31c1 fix(cublas): add cuda-cccl package for nv/target header
cuda_fp16.h (included by cublas_api.h) requires <nv/target> from
the CUDA C++ Core Libraries (cuda-cccl-13-0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:49:46 +03:00
aa3fc332ba fix(cublas): check for .h in subdirs when copying non-standard include dirs
ls *.h missed headers in subdirectories like crt/host_defines.h;
use find -maxdepth 2 instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:47:39 +03:00
62c57b87f2 fix(cublas): allow version-free lookup for cuda-crt package
cuda-crt-13-0 may not share the same version string as cuda-cudart-13-0;
pass empty version to lookup_pkg to match the first available version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:46:45 +03:00
f600261546 fix(cublas): add cuda-crt package for crt/host_defines.h
cublasLt.h -> cublas_api.h -> driver_types.h -> crt/host_defines.h
which lives in the cuda-crt-13-0 package, not cudart-dev.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:42:40 +03:00
d7ca04bdfb fix(cublas): search all include/ dirs in deb for CUDA headers
NVIDIA CUDA .deb packages install headers under
/usr/local/cuda-X.Y/targets/x86_64-linux/include/ not /usr/include/,
causing copy_headers() to silently skip them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:35:21 +03:00
5433652c70 fix(cublas): prevent double-print in lookup_pkg awk END block
awk exit in the blank-line block jumps to END, which printed the
result again causing repo_sha to contain the hash twice with a newline,
breaking the sha256 string comparison.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:29:10 +03:00
b25f014dbd fix(cublas): strip CR from Packages.gz fields to fix sha256 comparison
Debian Packages.gz uses CRLF line endings; \r in the captured SHA256
field caused string comparison to fail even when hashes were identical.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:24:58 +03:00
d69a46f211 fix(cublas): redirect diagnostic echo to stderr in download_verified_pkg
Echo messages captured in stdout polluted the return value of
download_verified_pkg(), causing extract_deb() to receive a
multi-line string instead of a file path and silently exit via set -e.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:22:39 +03:00
Mikhail Chusavitin
fc5c2019aa iso: improve burn-in, export, and live boot 2026-03-26 18:56:19 +03:00
Mikhail Chusavitin
67a215c66f fix(iso): route kernel logs to tty2, keep tty1 clean for TUI
console=tty0 sent kernel messages to the active VT (tty1), overwriting
the TUI. Changed to console=tty2 so kernel logs land on a dedicated
console. tty1 is now clean; operator can press Alt+F2 to inspect kernel
messages and Alt+F3 for an extra shell.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 17:40:44 +03:00
Mikhail Chusavitin
0a52a4f3ba fix(iso): restore loglevel=7 on VGA console for crash visibility
loglevel=3 was hiding all kernel messages on tty0/ttyS0 except errors.
Machine crashes (panics, driver oops, module failures) were silent on VGA.

Restored loglevel=7 so kernel messages up to debug are printed to both
tty0 (VGA) and ttyS0 (SOL). Journald MaxLevelConsole reduced to info
(was debug) to reduce noise on SOL while keeping it useful.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 11:19:07 +03:00
Mikhail Chusavitin
b132f7973a fix(iso): derive ISO filename from iso/v* tags, not audit/v*
Previously the ISO file was named after git describe --match 'audit/v*',
so a new iso/ tag produced names like v1.0.9-1-gXXXXXXX instead of v1.0.17.
Now build.sh has resolve_iso_version() that looks at iso/v* tags separately.
The bee binary inside the ISO still uses AUDIT_VERSION_EFFECTIVE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 11:05:51 +03:00
Mikhail Chusavitin
bd94b6c792 fix(iso): add libnvidia-ptxjitcompiler + ldconfig for PTX JIT and NCCL
- build-nvidia-module.sh: copy libnvidia-ptxjitcompiler.so.* alongside
  libcuda/libnvidia-ml — required by cuModuleLoadDataEx for PTX JIT.
  Without it: CUDA_ERROR_JIT_COMPILER_NOT_FOUND at runtime.
  Cache check updated to force rebuild when ptxjitcompiler is missing.
- bee-nvidia-load: run ldconfig after module load so that NVIDIA/NCCL
  libs injected into /usr/lib/ are visible to dlopen() callers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:37:27 +03:00
Mikhail Chusavitin
0ac7b6a963 fix(iso): restore console=tty0 — VGA screen was black without it
Commit d36e844 dropped console=tty0 and added dual-serial + debug logging.
Without console=tty0 the kernel never initialises the VGA console,
leaving the physical screen permanently blank.

- Restore console=tty0 (VGA) as primary, keep console=ttyS0 for SOL
- Drop console=ttyS1 (redundant second serial port)
- Replace loglevel=7 + journald debug flood with loglevel=3 (errors only)
  so kernel messages don't overwrite the TUI on the local screen
- Remove systemd.log_target/forward_to_console debug params

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:23:53 +03:00
Mikhail Chusavitin
3d2ae4cdcb fix(iso): use Ubuntu jammy codename for AMD ROCm repo — Debian not supported
AMD does not publish Debian Bookworm packages at all (only focal/jammy/noble).
Switch ROCM_UBUNTU_DIST to "jammy"; jammy packages install cleanly on
Debian 12 due to compatible glibc. Also expand candidate list to include
point-releases (6.3.4, 6.3.3, …) so we pick the latest actually-published one.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 10:08:58 +03:00
Mikhail Chusavitin
58510207fa fix(iso): fall back through ROCm 6.4→6.3→6.2 if repo Release file missing
ROCm 6.4 does not yet publish a Release file for Debian Bookworm, causing
the live-build chroot hook to fail with "does not have a Release file".

Try each version in ROCM_CANDIDATES order; skip to the next if apt-get update
fails (repo unavailable). Exit gracefully if none are available.
Also rename inner 'candidate' variable to 'smi_path' to avoid collision.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 09:52:17 +03:00
Mikhail Chusavitin
9a1df9b1ba Tighten support bundles and fix AMD runtime checks 2026-03-25 19:35:25 +03:00
Mikhail Chusavitin
30cf014d58 Rename NVIDIA bootloader modes 2026-03-25 19:12:26 +03:00
Mikhail Chusavitin
27d478aed6 Add bootloader choice for safe vs full NVIDIA boot 2026-03-25 19:11:15 +03:00
Mikhail Chusavitin
d36e8442a9 Stabilize live ISO consoles and NVIDIA boot path 2026-03-25 19:05:18 +03:00
Mikhail Chusavitin
b345b0d14d Derive ISO version from git tags 2026-03-25 18:40:48 +03:00
Mikhail Chusavitin
0a1ac2ab9f Bootstrap ROCm hook prerequisites in ISO build 2026-03-25 18:38:19 +03:00
Mikhail Chusavitin
adcc147b32 feat(iso): add AMD Instinct MI250X/MI250 driver support
- firmware-amd-graphics: Aldebaran firmware blobs (fixes amdgpu IB ring
  test errors on MI250/MI250X at boot)
- 9001-amd-rocm.hook.chroot: adds AMD ROCm 6.4 apt repo and installs
  rocm-smi-lib for GPU monitoring (analogous to nvidia-smi)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 15:42:10 +03:00
Mikhail Chusavitin
03c36f6cb2 fix(iso): add stress-ng to package list for CPU SAT
stress-ng was missing from the LiveCD — CPU acceptance test exited
immediately with rc=1 because the binary was not found.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:50:30 +03:00
Mikhail Chusavitin
a221814797 fix(tui): fix GPU panel row showing AMD chipset devices, clear screen before TUI
isGPUDevice matched all AMD vendor PCIe devices (SATA, crypto coprocessors,
PCIe dummies) because of a broad strings.Contains(vendor,"amd") check.
Remove it — AMD Instinct/Radeon GPUs are caught by ProcessingAccelerator /
DisplayController class. Also exclude ASPEED (BMC VGA adapter).

Add clear before bee-tui to avoid dirty terminal output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:49:09 +03:00
Mikhail Chusavitin
b6619d5ccc fix(iso): skip NVIDIA module load when no NVIDIA GPU present
Check PCI vendor 10de before attempting insmod — avoids spurious
nvidia_uvm symbol errors on systems without NVIDIA hardware (e.g. AMD MI350).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:38:31 +03:00
Mikhail Chusavitin
450193b063 feat(iso): remove splash.png, show EASY-BEE ASCII art in GRUB text mode
The graphical splash had "BEE / HARDWARE AUDIT" baked into the PNG,
overriding the echo ASCII art. Replace with a plain black background
so the EASY-BEE block-char banner from grub.cfg echo commands is visible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 13:32:23 +03:00
Mikhail Chusavitin
ee8931f171 fix(iso): pin ISO kernel to same ABI as compiled NVIDIA modules
Export detected DEBIAN_KERNEL_ABI as BEE_KERNEL_ABI from build.sh so
auto/config can pin linux-packages to the exact versioned package
(e.g. linux-image-6.1.0-31 + flavour amd64 = linux-image-6.1.0-31-amd64).
This prevents nvidia.ko vermagic mismatch if the linux-image-amd64
meta-package is updated between build start and lb build chroot step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 12:26:59 +03:00
Mikhail Chusavitin
b771d95894 fix(iso): fix linux-packages to "linux-image" so lb appends flavour correctly
live-build constructs the kernel package as <linux-packages>-<linux-flavours>,
so "linux-image-amd64" + "amd64" = "linux-image-amd64-amd64" (not found).
The correct value is "linux-image" + "amd64" = "linux-image-amd64".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:45:41 +03:00
Mikhail Chusavitin
8e60e474dc feat(iso): rebrand to EASY-BEE with ASCII art banner
Replace "Bee Hardware Audit" branding with EASY-BEE across bootloader
and LiveCD: grub.cfg menu entries, echo ASCII art before menu,
motd banner, iso-volume and iso-application metadata.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:45:12 +03:00
Mikhail Chusavitin
2f4ec2acda fix(iso): auto-detect and install kernel headers at build time
- Dockerfile: linux-headers-amd64 meta-package instead of pinned ABI;
  remove DEBIAN_KERNEL_ABI build-arg (no longer needed at image build time)
- build-in-container.sh: drop --build-arg DEBIAN_KERNEL_ABI
- build.sh: apt-get update + detect ABI from apt-cache at build time;
  auto-install linux-headers-<ABI> if kernel changed since image build

Image rebuild is now needed only when changing Go version or lb tools,
not on every Debian kernel point release.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:25:29 +03:00
Mikhail Chusavitin
7ed5cb0306 fix(iso): auto-detect kernel ABI at build time instead of pinning
DEBIAN_KERNEL_ABI=auto in VERSIONS — build.sh queries
apt-cache depends linux-image-amd64 to find the current ABI.
lb config now uses linux-image-amd64 meta-package.

This prevents build failures when Debian drops old kernel packages
from the repo (happens with every point release).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:17:29 +03:00
Mikhail Chusavitin
6df7ac68f5 fix(iso): bump kernel ABI to 6.1.0-44 (6.1.164-1 in bookworm)
6.1.0-43 is no longer available in Debian repos.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:16:09 +03:00
Mikhail Chusavitin
0ce23aea4f feat(iso): add exfatprogs and ntfs-3g for USB export support
exFAT is the default filesystem on USB drives >32GB sold today.
Without exfatprogs, mount fails silently and export to such drives is broken.
ntfs-3g covers Windows-formatted drives.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:12:51 +03:00
Mikhail Chusavitin
2abe2ce3aa fix(iso): fix NCCL version to 2.28.9+cuda13.0, add sha256 verification
NVIDIA's CUDA repo for Debian 12 only has NCCL packages for cuda13.x,
not cuda12.x. Update to the latest available: 2.28.9-1+cuda13.0.
Also pass sha256 from VERSIONS into build-nccl.sh for integrity check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 12:04:03 +03:00