1.8 KiB
1.8 KiB
bee — Project Bible
Project-specific architecture, decisions, and runtime contracts.
Generic engineering rules live in bible/rules/patterns/.
Files
| File | Contents |
|---|---|
architecture/system-overview.md |
What bee does, scope, tech stack |
architecture/runtime-flows.md |
Boot sequence, audit flow, service order |
docs/customer-gpu-test-methodology.md |
Customer-facing GPU PCIe Validate / Validate -> Stress test list |
docs/hardware-ingest-contract.md |
Current Reanimator hardware ingest JSON contract |
docs/validate-vs-burn.md |
Validate and Validate -> Stress hardware test policy |
decisions/ |
Architectural decision log, including read-only submodule policy |
Validate Test Matrix
Validate
- CPU check
lscpusensorsstress-ng
- Memory check
freetimeout <timeout_sec> memtesterfree
- NVMe storage check
nvme id-ctrlnvme smart-lognvme device-self-test
- SATA/SAS storage check
smartctl -H -Asmartctl -t short
- Basic NVIDIA GPU check
nvidia-smi -pm 1nvidia-smi -qdmidecode -t baseboarddmidecode -t systemdcgmi diag -r 2
- Inter-GPU communication check
all_reduce_perf
- GPU bandwidth check
dcgmi diag -r nvbandwidth
Validate -> Stress
- Extended NVIDIA GPU check
nvidia-smi -pm 1nvidia-smi -qdmidecode -t baseboarddmidecode -t systemdcgmi diag -r 3
- NVIDIA targeted stress
nvidia-smi -pm 1nvidia-smi -qdcgmi diag -r targeted_stress
- NVIDIA targeted power
nvidia-smi -pm 1nvidia-smi -qdcgmi diag -r targeted_power
- NVIDIA pulse test
nvidia-smi -pm 1nvidia-smi -qdcgmi diag -r pulse_test
- Inter-GPU communication check
all_reduce_perf
- GPU bandwidth check
dcgmi diag -r nvbandwidth