Add validate test matrix and GPU test methodology docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -9,5 +9,62 @@ Generic engineering rules live in `bible/rules/patterns/`.
|
|||||||
|---|---|
|
|---|---|
|
||||||
| `architecture/system-overview.md` | What bee does, scope, tech stack |
|
| `architecture/system-overview.md` | What bee does, scope, tech stack |
|
||||||
| `architecture/runtime-flows.md` | Boot sequence, audit flow, service order |
|
| `architecture/runtime-flows.md` | Boot sequence, audit flow, service order |
|
||||||
|
| `docs/customer-gpu-test-methodology.md` | Customer-facing GPU PCIe Validate / Validate -> Stress test list |
|
||||||
| `docs/hardware-ingest-contract.md` | Current Reanimator hardware ingest JSON contract |
|
| `docs/hardware-ingest-contract.md` | Current Reanimator hardware ingest JSON contract |
|
||||||
|
| `docs/validate-vs-burn.md` | Validate and Validate -> Stress hardware test policy |
|
||||||
| `decisions/` | Architectural decision log, including read-only submodule policy |
|
| `decisions/` | Architectural decision log, including read-only submodule policy |
|
||||||
|
|
||||||
|
## Validate Test Matrix
|
||||||
|
|
||||||
|
### Validate
|
||||||
|
|
||||||
|
- CPU check
|
||||||
|
- `lscpu`
|
||||||
|
- `sensors`
|
||||||
|
- `stress-ng`
|
||||||
|
- Memory check
|
||||||
|
- `free`
|
||||||
|
- `timeout <timeout_sec> memtester`
|
||||||
|
- `free`
|
||||||
|
- NVMe storage check
|
||||||
|
- `nvme id-ctrl`
|
||||||
|
- `nvme smart-log`
|
||||||
|
- `nvme device-self-test`
|
||||||
|
- SATA/SAS storage check
|
||||||
|
- `smartctl -H -A`
|
||||||
|
- `smartctl -t short`
|
||||||
|
- Basic NVIDIA GPU check
|
||||||
|
- `nvidia-smi -pm 1`
|
||||||
|
- `nvidia-smi -q`
|
||||||
|
- `dmidecode -t baseboard`
|
||||||
|
- `dmidecode -t system`
|
||||||
|
- `dcgmi diag -r 2`
|
||||||
|
- Inter-GPU communication check
|
||||||
|
- `all_reduce_perf`
|
||||||
|
- GPU bandwidth check
|
||||||
|
- `dcgmi diag -r nvbandwidth`
|
||||||
|
|
||||||
|
### Validate -> Stress
|
||||||
|
|
||||||
|
- Extended NVIDIA GPU check
|
||||||
|
- `nvidia-smi -pm 1`
|
||||||
|
- `nvidia-smi -q`
|
||||||
|
- `dmidecode -t baseboard`
|
||||||
|
- `dmidecode -t system`
|
||||||
|
- `dcgmi diag -r 3`
|
||||||
|
- NVIDIA targeted stress
|
||||||
|
- `nvidia-smi -pm 1`
|
||||||
|
- `nvidia-smi -q`
|
||||||
|
- `dcgmi diag -r targeted_stress`
|
||||||
|
- NVIDIA targeted power
|
||||||
|
- `nvidia-smi -pm 1`
|
||||||
|
- `nvidia-smi -q`
|
||||||
|
- `dcgmi diag -r targeted_power`
|
||||||
|
- NVIDIA pulse test
|
||||||
|
- `nvidia-smi -pm 1`
|
||||||
|
- `nvidia-smi -q`
|
||||||
|
- `dcgmi diag -r pulse_test`
|
||||||
|
- Inter-GPU communication check
|
||||||
|
- `all_reduce_perf`
|
||||||
|
- GPU bandwidth check
|
||||||
|
- `dcgmi diag -r nvbandwidth`
|
||||||
|
|||||||
54
bible-local/docs/customer-gpu-test-methodology.md
Normal file
54
bible-local/docs/customer-gpu-test-methodology.md
Normal file
@@ -0,0 +1,54 @@
|
|||||||
|
# GPU PCIe Test Methodology
|
||||||
|
|
||||||
|
## Validate
|
||||||
|
|
||||||
|
- CPU check
|
||||||
|
- `lscpu`
|
||||||
|
- `sensors`
|
||||||
|
- `stress-ng`
|
||||||
|
- Memory check
|
||||||
|
- `free`
|
||||||
|
- `timeout <timeout_sec> memtester`
|
||||||
|
- `free`
|
||||||
|
- NVMe storage check
|
||||||
|
- `nvme id-ctrl`
|
||||||
|
- `nvme smart-log`
|
||||||
|
- `nvme device-self-test`
|
||||||
|
- SATA/SAS storage check
|
||||||
|
- `smartctl -H -A`
|
||||||
|
- `smartctl -t short`
|
||||||
|
- Basic NVIDIA GPU check
|
||||||
|
- `nvidia-smi -pm 1`
|
||||||
|
- `nvidia-smi -q`
|
||||||
|
- `dmidecode -t baseboard`
|
||||||
|
- `dmidecode -t system`
|
||||||
|
- `dcgmi diag -r 2`
|
||||||
|
- Inter-GPU communication check
|
||||||
|
- `all_reduce_perf`
|
||||||
|
- GPU bandwidth check
|
||||||
|
- `dcgmi diag -r nvbandwidth`
|
||||||
|
|
||||||
|
## Validate -> Stress
|
||||||
|
|
||||||
|
- Extended NVIDIA GPU check
|
||||||
|
- `nvidia-smi -pm 1`
|
||||||
|
- `nvidia-smi -q`
|
||||||
|
- `dmidecode -t baseboard`
|
||||||
|
- `dmidecode -t system`
|
||||||
|
- `dcgmi diag -r 3`
|
||||||
|
- NVIDIA targeted stress
|
||||||
|
- `nvidia-smi -pm 1`
|
||||||
|
- `nvidia-smi -q`
|
||||||
|
- `dcgmi diag -r targeted_stress`
|
||||||
|
- NVIDIA targeted power
|
||||||
|
- `nvidia-smi -pm 1`
|
||||||
|
- `nvidia-smi -q`
|
||||||
|
- `dcgmi diag -r targeted_power`
|
||||||
|
- NVIDIA pulse test
|
||||||
|
- `nvidia-smi -pm 1`
|
||||||
|
- `nvidia-smi -q`
|
||||||
|
- `dcgmi diag -r pulse_test`
|
||||||
|
- Inter-GPU communication check
|
||||||
|
- `all_reduce_perf`
|
||||||
|
- GPU bandwidth check
|
||||||
|
- `dcgmi diag -r nvbandwidth`
|
||||||
Reference in New Issue
Block a user