Mikhail Chusavitin
c850b39b01
feat: v3.10 GPU stress and NCCL burn updates
2026-03-31 11:22:27 +03:00
911745e4da
refactor(iso): replace chroot hooks for DCGM/ROCm with live-build apt sources
...
Move datacenter-gpu-manager and rocm-smi-lib from dynamic chroot hooks
into live-build's config/archives mechanism so lb caches the .deb files
in cache/packages.chroot/ between builds, eliminating repeated 900+ MB
downloads. Versions pinned via VERSIONS and substituted into package
lists at build time.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-28 13:01:10 +03:00
9fe9f061f8
fix(nccl-tests): set LIBRARY_PATH so ld finds libnccl.so in nccl cache
2026-03-26 23:59:06 +03:00
837a1fb981
fix(nccl-tests): pin /usr/local/cuda→12.8 symlink, auto-detect gencode by nvcc version
2026-03-26 23:54:07 +03:00
1f43b4e050
fix(nccl-tests): pass NCCL_LIB from nccl cache to fix -lnccl link error
2026-03-26 23:52:25 +03:00
83bbc8a1bc
fix(nccl-tests): upgrade to cuda-nvcc-12-8, add sm_100 (Blackwell B100/B200)
2026-03-26 23:51:26 +03:00
896bdb6ee8
fix(nccl-tests): use cuda-nvcc-12-6 to support Ampere/Volta (sm_70..sm_90)
2026-03-26 23:50:36 +03:00
5407c26e25
fix(nccl-tests): CUDA 13.0 supports only sm_90+ (Hopper/H100)
2026-03-26 23:49:45 +03:00
4fddaba9c5
fix(nccl-tests): limit CUDA gencode to sm_70+ (CUDA 13 dropped Pascal)
2026-03-26 23:48:40 +03:00
d2f384b6eb
fix(nccl-tests): use plain make instead of non-existent all_reduce_perf target
2026-03-26 23:47:49 +03:00
5644231f9a
feat(nccl): add nccl-tests all_reduce_perf for GPU bandwidth testing
...
- Dockerfile: install cuda-nvcc-13-0 from NVIDIA repo for compilation
- build-nccl-tests.sh: downloads libnccl-dev for nccl.h, builds all_reduce_perf
- build.sh: runs nccl-tests build, injects binary into /usr/local/bin/
- platform: RunNCCLTests() auto-detects GPU count, runs all_reduce_perf
- TUI: NCCL bandwidth test entry in Burn-in Tests screen [N] hotkey
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-26 23:22:19 +03:00