1.1 KiB
1.1 KiB
GPU PCIe Test Methodology
Validate
- CPU check
lscpusensorsstress-ng
- Memory check
freetimeout <timeout_sec> memtesterfree
- NVMe storage check
nvme id-ctrlnvme smart-lognvme device-self-test
- SATA/SAS storage check
smartctl -H -Asmartctl -t short
- Basic NVIDIA GPU check
nvidia-smi -pm 1nvidia-smi -qdmidecode -t baseboarddmidecode -t systemdcgmi diag -r 2
- Inter-GPU communication check
all_reduce_perf
- GPU bandwidth check
dcgmi diag -r nvbandwidth
Validate -> Stress
- Extended NVIDIA GPU check
nvidia-smi -pm 1nvidia-smi -qdmidecode -t baseboarddmidecode -t systemdcgmi diag -r 3
- NVIDIA targeted stress
nvidia-smi -pm 1nvidia-smi -qdcgmi diag -r targeted_stress
- NVIDIA targeted power
nvidia-smi -pm 1nvidia-smi -qdcgmi diag -r targeted_power
- NVIDIA pulse test
nvidia-smi -pm 1nvidia-smi -qdcgmi diag -r pulse_test
- Inter-GPU communication check
all_reduce_perf
- GPU bandwidth check
dcgmi diag -r nvbandwidth