Add validate test matrix and GPU test methodology docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Mikhail Chusavitin
2026-04-30 10:47:08 +03:00
parent 2c22b01fe3
commit 0b8a2ff83f
2 changed files with 111 additions and 0 deletions

View File

@@ -0,0 +1,54 @@
# GPU PCIe Test Methodology
## Validate
- CPU check
- `lscpu`
- `sensors`
- `stress-ng`
- Memory check
- `free`
- `timeout <timeout_sec> memtester`
- `free`
- NVMe storage check
- `nvme id-ctrl`
- `nvme smart-log`
- `nvme device-self-test`
- SATA/SAS storage check
- `smartctl -H -A`
- `smartctl -t short`
- Basic NVIDIA GPU check
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dmidecode -t baseboard`
- `dmidecode -t system`
- `dcgmi diag -r 2`
- Inter-GPU communication check
- `all_reduce_perf`
- GPU bandwidth check
- `dcgmi diag -r nvbandwidth`
## Validate -> Stress
- Extended NVIDIA GPU check
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dmidecode -t baseboard`
- `dmidecode -t system`
- `dcgmi diag -r 3`
- NVIDIA targeted stress
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dcgmi diag -r targeted_stress`
- NVIDIA targeted power
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dcgmi diag -r targeted_power`
- NVIDIA pulse test
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dcgmi diag -r pulse_test`
- Inter-GPU communication check
- `all_reduce_perf`
- GPU bandwidth check
- `dcgmi diag -r nvbandwidth`