Files
bee/bible-local/docs/customer-gpu-test-methodology.md
Mikhail Chusavitin 0b8a2ff83f Add validate test matrix and GPU test methodology docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 10:47:08 +03:00

1.1 KiB

GPU PCIe Test Methodology

Validate

  • CPU check
    • lscpu
    • sensors
    • stress-ng
  • Memory check
    • free
    • timeout <timeout_sec> memtester
    • free
  • NVMe storage check
    • nvme id-ctrl
    • nvme smart-log
    • nvme device-self-test
  • SATA/SAS storage check
    • smartctl -H -A
    • smartctl -t short
  • Basic NVIDIA GPU check
    • nvidia-smi -pm 1
    • nvidia-smi -q
    • dmidecode -t baseboard
    • dmidecode -t system
    • dcgmi diag -r 2
  • Inter-GPU communication check
    • all_reduce_perf
  • GPU bandwidth check
    • dcgmi diag -r nvbandwidth

Validate -> Stress

  • Extended NVIDIA GPU check
    • nvidia-smi -pm 1
    • nvidia-smi -q
    • dmidecode -t baseboard
    • dmidecode -t system
    • dcgmi diag -r 3
  • NVIDIA targeted stress
    • nvidia-smi -pm 1
    • nvidia-smi -q
    • dcgmi diag -r targeted_stress
  • NVIDIA targeted power
    • nvidia-smi -pm 1
    • nvidia-smi -q
    • dcgmi diag -r targeted_power
  • NVIDIA pulse test
    • nvidia-smi -pm 1
    • nvidia-smi -q
    • dcgmi diag -r pulse_test
  • Inter-GPU communication check
    • all_reduce_perf
  • GPU bandwidth check
    • dcgmi diag -r nvbandwidth