177 Commits

Author SHA1 Message Date
ed1cceed8c fix(boot): add nomodeset to fix black screen on server VGA/IPMI KVM (AST chip KMS) 2026-03-27 00:13:36 +03:00
9fe9f061f8 fix(nccl-tests): set LIBRARY_PATH so ld finds libnccl.so in nccl cache 2026-03-26 23:59:06 +03:00
837a1fb981 fix(nccl-tests): pin /usr/local/cuda→12.8 symlink, auto-detect gencode by nvcc version 2026-03-26 23:54:07 +03:00
1f43b4e050 fix(nccl-tests): pass NCCL_LIB from nccl cache to fix -lnccl link error 2026-03-26 23:52:25 +03:00
83bbc8a1bc fix(nccl-tests): upgrade to cuda-nvcc-12-8, add sm_100 (Blackwell B100/B200) 2026-03-26 23:51:26 +03:00
896bdb6ee8 fix(nccl-tests): use cuda-nvcc-12-6 to support Ampere/Volta (sm_70..sm_90) 2026-03-26 23:50:36 +03:00
5407c26e25 fix(nccl-tests): CUDA 13.0 supports only sm_90+ (Hopper/H100) 2026-03-26 23:49:45 +03:00
4fddaba9c5 fix(nccl-tests): limit CUDA gencode to sm_70+ (CUDA 13 dropped Pascal) 2026-03-26 23:48:40 +03:00
d2f384b6eb fix(nccl-tests): use plain make instead of non-existent all_reduce_perf target 2026-03-26 23:47:49 +03:00
25f0f30aaf fix(boot): fix black screen on monitor, stop log spam on console
- Add console=tty0 so VGA display gets kernel output (was serial-only)
- Change loglevel=7→3 (debug→errors only)
- Add quiet to suppress verbose kernel boot messages
- journald: ForwardToConsole=no so service logs don't flood tty1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:45:09 +03:00
a57b037a91 feat(installer): add 'Install to disk' in Tools submenu
Copies the live system to a local disk via unsquashfs — no debootstrap,
no network required. Supports UEFI (GPT+EFI) and BIOS (MBR) layouts.

ISO:
- Add squashfs-tools, parted, grub-pc, grub-efi-amd64 to package list
- New overlay script bee-install: partitions, formats, unsquashfs,
  writes fstab, runs grub-install+update-grub in chroot

Go TUI:
- Settings → Tools submenu (Install to disk, Check tools)
- Disk picker screen: lists non-USB, non-boot disks via lsblk
- Confirm screen warns about data loss
- Runs with live progress tail of /tmp/bee-install.log
- platform/install.go: ListInstallDisks, InstallToDisk, findLiveBootDevice

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:35:01 +03:00
5644231f9a feat(nccl): add nccl-tests all_reduce_perf for GPU bandwidth testing
- Dockerfile: install cuda-nvcc-13-0 from NVIDIA repo for compilation
- build-nccl-tests.sh: downloads libnccl-dev for nccl.h, builds all_reduce_perf
- build.sh: runs nccl-tests build, injects binary into /usr/local/bin/
- platform: RunNCCLTests() auto-detects GPU count, runs all_reduce_perf
- TUI: NCCL bandwidth test entry in Burn-in Tests screen [N] hotkey

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:22:19 +03:00
eea98e6d76 feat(dcgm): add NVIDIA DCGM diagnostics, fix KVM console
- Add 9002-nvidia-dcgm.hook.chroot: installs datacenter-gpu-manager
  from NVIDIA apt repo during live-build
- Enable nvidia-dcgm.service in chroot setup hook
- Replace bee-gpu-stress with dcgmi diag (levels 1-4) in NVIDIA SAT
- TUI: replace GPU checkbox + duration UI with DCGM level selection
- Remove console=tty2 from boot params: KVM/VGA now shows tty1
  where bee-tui runs, fixing unresponsive console

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 23:08:12 +03:00
967455194c feat(iso): make toram optional, add 'load to RAM' boot menu entry
Default boot no longer loads ISO to RAM (slow on BMC virtual media).
Separate menu entry added for toram in both GRUB and isolinux.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 21:45:04 +03:00
79dabf3efb fix(build): link bee-gpu-stress with -lm for sqrt()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:55:14 +03:00
1336f5b95c fix(cublas): copy include dirs containing files without .h extension
nv/target has no .h suffix; use -type f instead of -name '*.h' to
detect non-empty include directories.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:53:08 +03:00
31486a31c1 fix(cublas): add cuda-cccl package for nv/target header
cuda_fp16.h (included by cublas_api.h) requires <nv/target> from
the CUDA C++ Core Libraries (cuda-cccl-13-0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:49:46 +03:00
aa3fc332ba fix(cublas): check for .h in subdirs when copying non-standard include dirs
ls *.h missed headers in subdirectories like crt/host_defines.h;
use find -maxdepth 2 instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:47:39 +03:00
62c57b87f2 fix(cublas): allow version-free lookup for cuda-crt package
cuda-crt-13-0 may not share the same version string as cuda-cudart-13-0;
pass empty version to lookup_pkg to match the first available version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:46:45 +03:00
f600261546 fix(cublas): add cuda-crt package for crt/host_defines.h
cublasLt.h -> cublas_api.h -> driver_types.h -> crt/host_defines.h
which lives in the cuda-crt-13-0 package, not cudart-dev.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:42:40 +03:00
d7ca04bdfb fix(cublas): search all include/ dirs in deb for CUDA headers
NVIDIA CUDA .deb packages install headers under
/usr/local/cuda-X.Y/targets/x86_64-linux/include/ not /usr/include/,
causing copy_headers() to silently skip them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:35:21 +03:00
5433652c70 fix(cublas): prevent double-print in lookup_pkg awk END block
awk exit in the blank-line block jumps to END, which printed the
result again causing repo_sha to contain the hash twice with a newline,
breaking the sha256 string comparison.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:29:10 +03:00
b25f014dbd fix(cublas): strip CR from Packages.gz fields to fix sha256 comparison
Debian Packages.gz uses CRLF line endings; \r in the captured SHA256
field caused string comparison to fail even when hashes were identical.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:24:58 +03:00
d69a46f211 fix(cublas): redirect diagnostic echo to stderr in download_verified_pkg
Echo messages captured in stdout polluted the return value of
download_verified_pkg(), causing extract_deb() to receive a
multi-line string instead of a file path and silently exit via set -e.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 20:22:39 +03:00
Mikhail Chusavitin
fc5c2019aa iso: improve burn-in, export, and live boot iso/v1.0.21 2026-03-26 18:56:19 +03:00
Mikhail Chusavitin
67a215c66f fix(iso): route kernel logs to tty2, keep tty1 clean for TUI
console=tty0 sent kernel messages to the active VT (tty1), overwriting
the TUI. Changed to console=tty2 so kernel logs land on a dedicated
console. tty1 is now clean; operator can press Alt+F2 to inspect kernel
messages and Alt+F3 for an extra shell.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
iso/v1.0.20
2026-03-26 17:40:44 +03:00
Mikhail Chusavitin
8b4bfdf5ad feat(tui): live GPU chart during stress test, full VRAM allocation
- GPU Platform Stress Test now shows a live in-TUI chart instead of nvtop.
  nvidia-smi is polled every second; up to 60 data points per GPU kept.
  All three metrics (Usage %, Temp °C, Power W) drawn on a single plot,
  each normalised to its own range and rendered in a different colour.
- Memory allocation changed from MemoryMB/16 to MemoryMB-512 (full VRAM
  minus 512 MB driver overhead) so bee-gpu-stress actually stresses memory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
audit/v1.0.10
2026-03-26 17:37:20 +03:00
Mikhail Chusavitin
0a52a4f3ba fix(iso): restore loglevel=7 on VGA console for crash visibility
loglevel=3 was hiding all kernel messages on tty0/ttyS0 except errors.
Machine crashes (panics, driver oops, module failures) were silent on VGA.

Restored loglevel=7 so kernel messages up to debug are printed to both
tty0 (VGA) and ttyS0 (SOL). Journald MaxLevelConsole reduced to info
(was debug) to reduce noise on SOL while keeping it useful.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
iso/v1.0.19
2026-03-26 11:19:07 +03:00
Mikhail Chusavitin
b132f7973a fix(iso): derive ISO filename from iso/v* tags, not audit/v*
Previously the ISO file was named after git describe --match 'audit/v*',
so a new iso/ tag produced names like v1.0.9-1-gXXXXXXX instead of v1.0.17.
Now build.sh has resolve_iso_version() that looks at iso/v* tags separately.
The bee binary inside the ISO still uses AUDIT_VERSION_EFFECTIVE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
iso/v1.0.18
2026-03-26 11:05:51 +03:00
Mikhail Chusavitin
bd94b6c792 fix(iso): add libnvidia-ptxjitcompiler + ldconfig for PTX JIT and NCCL
- build-nvidia-module.sh: copy libnvidia-ptxjitcompiler.so.* alongside
  libcuda/libnvidia-ml — required by cuModuleLoadDataEx for PTX JIT.
  Without it: CUDA_ERROR_JIT_COMPILER_NOT_FOUND at runtime.
  Cache check updated to force rebuild when ptxjitcompiler is missing.
- bee-nvidia-load: run ldconfig after module load so that NVIDIA/NCCL
  libs injected into /usr/lib/ are visible to dlopen() callers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
iso/v1.0.17
2026-03-26 10:37:27 +03:00
Mikhail Chusavitin
06017eddfd feat(tui): remove nvtop auto-launch from NVIDIA SAT
nvtop is no longer shown during NVIDIA SAT runs.
[o] Open nvtop shortcut also removed from the running screen.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
audit/v1.0.9
2026-03-26 10:29:05 +03:00
Mikhail Chusavitin
0ac7b6a963 fix(iso): restore console=tty0 — VGA screen was black without it
Commit d36e844 dropped console=tty0 and added dual-serial + debug logging.
Without console=tty0 the kernel never initialises the VGA console,
leaving the physical screen permanently blank.

- Restore console=tty0 (VGA) as primary, keep console=ttyS0 for SOL
- Drop console=ttyS1 (redundant second serial port)
- Replace loglevel=7 + journald debug flood with loglevel=3 (errors only)
  so kernel messages don't overwrite the TUI on the local screen
- Remove systemd.log_target/forward_to_console debug params

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
iso/v1.0.16
2026-03-26 10:23:53 +03:00
Mikhail Chusavitin
3d2ae4cdcb fix(iso): use Ubuntu jammy codename for AMD ROCm repo — Debian not supported
AMD does not publish Debian Bookworm packages at all (only focal/jammy/noble).
Switch ROCM_UBUNTU_DIST to "jammy"; jammy packages install cleanly on
Debian 12 due to compatible glibc. Also expand candidate list to include
point-releases (6.3.4, 6.3.3, …) so we pick the latest actually-published one.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
iso/v1.0.15
2026-03-26 10:08:58 +03:00
Mikhail Chusavitin
4669f14f4f feat(tui): GPU Platform Stress Test — live nvtop chart during test
Apply the same pattern as NVIDIA SAT: launch nvtop via tea.ExecProcess
so it occupies the full terminal as a live GPU chart (temp, power, fan,
utilisation lines) while the stress test runs in the background.

- Add screenGPUStressRunning screen + dedicated running/render handlers
- startGPUStressTest: tea.Batch(stress goroutine, tea.ExecProcess(nvtop))
- [o] reopen nvtop at any time; [a] abort (cancels context)
- Graceful degradation: test still runs if nvtop is not on PATH
- gpuStressDoneMsg routes result to screenOutput on completion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
audit/v1.0.8
2026-03-26 10:01:31 +03:00
Mikhail Chusavitin
540a9e39b8 refactor(audit): rename Fan Stress Test → GPU Platform Stress Test
Update all user-facing strings in TUI and ActionResult title.
Internal identifiers (types, functions, file name) unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
audit/v1.0.7
2026-03-26 09:56:25 +03:00
Mikhail Chusavitin
58510207fa fix(iso): fall back through ROCm 6.4→6.3→6.2 if repo Release file missing
ROCm 6.4 does not yet publish a Release file for Debian Bookworm, causing
the live-build chroot hook to fail with "does not have a Release file".

Try each version in ROCM_CANDIDATES order; skip to the next if apt-get update
fails (repo unavailable). Exit gracefully if none are available.
Also rename inner 'candidate' variable to 'smi_path' to avoid collision.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
iso/v1.0.14
2026-03-26 09:52:17 +03:00
Mikhail Chusavitin
4cd7c9ab4e feat(audit): fan-stress SAT for MSI case-04 fan lag & thermal throttle detection
Two-phase GPU thermal cycling test with per-second telemetry:
- Phases: baseline → load1 → pause (no cooldown) → load2 → cooldown
- Monitors: fan RPM (ipmitool sdr), CPU/server temps (ipmitool/sensors),
  system power (ipmitool dcmi), GPU temp/power/usage/clock/throttle (nvidia-smi)
- Detects throttling via clocks_throttle_reasons.active bitmask
- Measures fan response lag from load start (validates case-04 ~2s lag)
- Exports metrics.csv (wide format, one row/sec) and fan-sensors.csv (long format)
- TUI: adds [F] Fan Stress Test to Health Check screen with Quick/Standard/Express modes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
audit/v1.0.6
2026-03-26 09:51:03 +03:00
Mikhail Chusavitin
cfe255f6e4 Release audit/v1.0.5 audit/v1.0.5 2026-03-26 09:41:19 +03:00
Mikhail Chusavitin
8b9d3447d7 Overlay SAT results into audit JSON audit/v1.0.4 2026-03-25 20:11:03 +03:00
Mikhail Chusavitin
614b7cad61 Improve PCIe inventory and hardware identity collection audit/v1.0.3 2026-03-25 20:00:38 +03:00
Mikhail Chusavitin
9a1df9b1ba Tighten support bundles and fix AMD runtime checks audit/v1.0.2 2026-03-25 19:35:25 +03:00
Mikhail Chusavitin
30cf014d58 Rename NVIDIA bootloader modes 2026-03-25 19:12:26 +03:00
Mikhail Chusavitin
27d478aed6 Add bootloader choice for safe vs full NVIDIA boot 2026-03-25 19:11:15 +03:00
Mikhail Chusavitin
d36e8442a9 Stabilize live ISO consoles and NVIDIA boot path 2026-03-25 19:05:18 +03:00
Mikhail Chusavitin
b345b0d14d Derive ISO version from git tags 2026-03-25 18:40:48 +03:00
Mikhail Chusavitin
0a1ac2ab9f Bootstrap ROCm hook prerequisites in ISO build 2026-03-25 18:38:19 +03:00
Mikhail Chusavitin
1e62f828c6 Embed MOTD banner into TUI audit/v1.0.1 2026-03-25 18:11:17 +03:00
Mikhail Chusavitin
f8c997d272 Add missing SAT progress TUI helpers 2026-03-25 18:03:45 +03:00
Mikhail Chusavitin
0c16616cc9 1. Verbose live progress during SAT tests (CPU, Memory, Storage, AMD GPU)
- New tui/sat_progress.go: polls {DefaultSATBaseDir}/{prefix}-*/verbose.log every 300ms and parses completed/in-progress steps
  - Busy screen now shows each step as PASS  lscpu (234ms) / FAIL  stress-ng (60.0s) / ...   sensors-after instead of just "Working..."

  2. Test results shown on screen (instead of just "Archive written to /path")
  - RunCPUAcceptancePackResult, RunMemoryAcceptancePackResult, RunStorageAcceptancePackResult, RunAMDAcceptancePackResult now read summary.txt from the run directory and return a formatted per-step result:
  Run: 2025-03-25T10:00:00Z

  PASS  lscpu
  PASS  sensors-before
  FAIL  stress-ng
  PASS  sensors-after

  Overall: FAILED  (ok=3  failed=1)

  3. AMD GPU SAT with auto-detection
  - platform.System.DetectGPUVendor(): checks /dev/nvidia0 → "nvidia", /dev/kfd → "amd"
  - platform.System.RunAMDAcceptancePack(): runs rocm-smi, rocm-smi --showallinfo, dmidecode
  - GPU SAT (G key / GPU row enter) automatically routes to AMD or NVIDIA based on detected vendor
  - "Run All" also auto-detects vendor

  4. Panel detail view
  - GPU detail now shows the most recent (NVIDIA or AMD) SAT result, whichever is newer
  - All SAT detail views use the same human-readable formatSATDetail format
2026-03-25 17:54:27 +03:00
Mikhail Chusavitin
adcc147b32 feat(iso): add AMD Instinct MI250X/MI250 driver support
- firmware-amd-graphics: Aldebaran firmware blobs (fixes amdgpu IB ring
  test errors on MI250/MI250X at boot)
- 9001-amd-rocm.hook.chroot: adds AMD ROCm 6.4 apt repo and installs
  rocm-smi-lib for GPU monitoring (analogous to nvidia-smi)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 15:42:10 +03:00