diff --git a/bible-local/docs/validate-vs-burn.md b/bible-local/docs/validate-vs-burn.md new file mode 100644 index 0000000..928e051 --- /dev/null +++ b/bible-local/docs/validate-vs-burn.md @@ -0,0 +1,35 @@ +# Validate vs Burn: Hardware Impact Policy + +## Validate Tests (non-destructive) + +Tests on the **Validate** page are purely diagnostic. They: + +- **Do not write to disks** — no data is written to storage devices; SMART counters (power-on hours, load cycle count, reallocated sectors) are not incremented. +- **Do not run sustained high load** — commands complete quickly (seconds to minutes) and do not push hardware to thermal or electrical limits. +- **Do not increment hardware wear counters** — GPU memory ECC counters, NVMe wear leveling counters, and similar endurance metrics are unaffected. +- **Are safe to run repeatedly** — on new, production-bound, or already-deployed hardware without concern for reducing lifespan. + +### What Validate tests actually do + +| Test | What it runs | +|---|---| +| NVIDIA GPU | `nvidia-smi`, `dcgmi diag` (levels 1–4 read-only diagnostics) | +| Memory | `memtester` on a limited allocation; reads/writes to RAM only | +| Storage | `smartctl -a`, `nvme smart-log` — reads SMART data only | +| CPU | `stress-ng` for a bounded duration; CPU-only, no I/O | +| AMD GPU | `rocm-smi --showallinfo`, `dmidecode` — read-only queries | + +## Burn Tests (hardware wear) + +Tests on the **Burn** page run hardware at maximum or near-maximum load for extended durations. They: + +- **Wear storage**: write-intensive patterns can reduce SSD endurance (P/E cycles). +- **Stress GPU memory**: extended ECC stress tests may surface latent defects but also exercise memory cells. +- **Accelerate thermal cycling**: repeated heat/cool cycles degrade solder joints and capacitors over time. +- **May increment wear counters**: GPU power-on hours, NVMe media wear indicator, and similar metrics will advance. + +### Rule + +> Run **Validate** freely on any server, at any time, before or after deployment. +> Run **Burn** only when explicitly required (e.g., initial acceptance after repair, or per customer SLA). +> Document when and why Burn tests were run.