feat(nccl): add nccl-tests all_reduce_perf for GPU bandwidth testing

- Dockerfile: install cuda-nvcc-13-0 from NVIDIA repo for compilation
- build-nccl-tests.sh: downloads libnccl-dev for nccl.h, builds all_reduce_perf
- build.sh: runs nccl-tests build, injects binary into /usr/local/bin/
- platform: RunNCCLTests() auto-detects GPU count, runs all_reduce_perf
- TUI: NCCL bandwidth test entry in Burn-in Tests screen [N] hotkey

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-26 23:22:19 +03:00
parent eea98e6d76
commit 5644231f9a
11 changed files with 221 additions and 13 deletions

View File

@@ -26,6 +26,19 @@ RUN apt-get update -qq && apt-get install -y \
linux-headers-amd64 \
&& rm -rf /var/lib/apt/lists/*
# Add NVIDIA CUDA repo and install nvcc (needed to compile nccl-tests)
RUN wget -qO /tmp/cuda-keyring.gpg \
https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/3bf863cc.pub \
&& gpg --dearmor < /tmp/cuda-keyring.gpg \
> /usr/share/keyrings/nvidia-cuda.gpg \
&& rm /tmp/cuda-keyring.gpg \
&& echo "deb [signed-by=/usr/share/keyrings/nvidia-cuda.gpg] \
https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/ /" \
> /etc/apt/sources.list.d/cuda.list \
&& apt-get update -qq \
&& apt-get install -y cuda-nvcc-13-0 \
&& rm -rf /var/lib/apt/lists/*
RUN arch="$(dpkg --print-architecture)" \
&& case "$arch" in \
amd64) goarch=amd64 ;; \