55 lines
1.1 KiB
Markdown
55 lines
1.1 KiB
Markdown
# GPU PCIe Test Methodology
|
|
|
|
## Validate
|
|
|
|
- CPU check
|
|
- `lscpu`
|
|
- `sensors`
|
|
- `stress-ng`
|
|
- Memory check
|
|
- `free`
|
|
- `timeout <timeout_sec> memtester`
|
|
- `free`
|
|
- NVMe storage check
|
|
- `nvme id-ctrl`
|
|
- `nvme smart-log`
|
|
- `nvme device-self-test`
|
|
- SATA/SAS storage check
|
|
- `smartctl -H -A`
|
|
- `smartctl -t short`
|
|
- Basic NVIDIA GPU check
|
|
- `nvidia-smi -pm 1`
|
|
- `nvidia-smi -q`
|
|
- `dmidecode -t baseboard`
|
|
- `dmidecode -t system`
|
|
- `dcgmi diag -r 2`
|
|
- Inter-GPU communication check
|
|
- `all_reduce_perf`
|
|
- GPU bandwidth check
|
|
- `dcgmi diag -r nvbandwidth`
|
|
|
|
## Validate -> Stress
|
|
|
|
- Extended NVIDIA GPU check
|
|
- `nvidia-smi -pm 1`
|
|
- `nvidia-smi -q`
|
|
- `dmidecode -t baseboard`
|
|
- `dmidecode -t system`
|
|
- `dcgmi diag -r 3`
|
|
- NVIDIA targeted stress
|
|
- `nvidia-smi -pm 1`
|
|
- `nvidia-smi -q`
|
|
- `dcgmi diag -r targeted_stress`
|
|
- NVIDIA targeted power
|
|
- `nvidia-smi -pm 1`
|
|
- `nvidia-smi -q`
|
|
- `dcgmi diag -r targeted_power`
|
|
- NVIDIA pulse test
|
|
- `nvidia-smi -pm 1`
|
|
- `nvidia-smi -q`
|
|
- `dcgmi diag -r pulse_test`
|
|
- Inter-GPU communication check
|
|
- `all_reduce_perf`
|
|
- GPU bandwidth check
|
|
- `dcgmi diag -r nvbandwidth`
|