1.9 KiB
1.9 KiB
Validate vs Burn: Hardware Impact Policy
Validate Tests (non-destructive)
Tests on the Validate page are purely diagnostic. They:
- Do not write to disks — no data is written to storage devices; SMART counters (power-on hours, load cycle count, reallocated sectors) are not incremented.
- Do not run sustained high load — commands complete quickly (seconds to minutes) and do not push hardware to thermal or electrical limits.
- Do not increment hardware wear counters — GPU memory ECC counters, NVMe wear leveling counters, and similar endurance metrics are unaffected.
- Are safe to run repeatedly — on new, production-bound, or already-deployed hardware without concern for reducing lifespan.
What Validate tests actually do
| Test | What it runs |
|---|---|
| NVIDIA GPU | nvidia-smi, dcgmi diag (levels 1–4 read-only diagnostics) |
| Memory | memtester on a limited allocation; reads/writes to RAM only |
| Storage | smartctl -a, nvme smart-log — reads SMART data only |
| CPU | stress-ng for a bounded duration; CPU-only, no I/O |
| AMD GPU | rocm-smi --showallinfo, dmidecode — read-only queries |
Burn Tests (hardware wear)
Tests on the Burn page run hardware at maximum or near-maximum load for extended durations. They:
- Wear storage: write-intensive patterns can reduce SSD endurance (P/E cycles).
- Stress GPU memory: extended ECC stress tests may surface latent defects but also exercise memory cells.
- Accelerate thermal cycling: repeated heat/cool cycles degrade solder joints and capacitors over time.
- May increment wear counters: GPU power-on hours, NVMe media wear indicator, and similar metrics will advance.
Rule
Run Validate freely on any server, at any time, before or after deployment. Run Burn only when explicitly required (e.g., initial acceptance after repair, or per customer SLA). Document when and why Burn tests were run.