Files
bee/bible-local/docs/validate-vs-burn.md
2026-03-28 21:15:33 +03:00

36 lines
1.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Validate vs Burn: Hardware Impact Policy
## Validate Tests (non-destructive)
Tests on the **Validate** page are purely diagnostic. They:
- **Do not write to disks** — no data is written to storage devices; SMART counters (power-on hours, load cycle count, reallocated sectors) are not incremented.
- **Do not run sustained high load** — commands complete quickly (seconds to minutes) and do not push hardware to thermal or electrical limits.
- **Do not increment hardware wear counters** — GPU memory ECC counters, NVMe wear leveling counters, and similar endurance metrics are unaffected.
- **Are safe to run repeatedly** — on new, production-bound, or already-deployed hardware without concern for reducing lifespan.
### What Validate tests actually do
| Test | What it runs |
|---|---|
| NVIDIA GPU | `nvidia-smi`, `dcgmi diag` (levels 14 read-only diagnostics) |
| Memory | `memtester` on a limited allocation; reads/writes to RAM only |
| Storage | `smartctl -a`, `nvme smart-log` — reads SMART data only |
| CPU | `stress-ng` for a bounded duration; CPU-only, no I/O |
| AMD GPU | `rocm-smi --showallinfo`, `dmidecode` — read-only queries |
## Burn Tests (hardware wear)
Tests on the **Burn** page run hardware at maximum or near-maximum load for extended durations. They:
- **Wear storage**: write-intensive patterns can reduce SSD endurance (P/E cycles).
- **Stress GPU memory**: extended ECC stress tests may surface latent defects but also exercise memory cells.
- **Accelerate thermal cycling**: repeated heat/cool cycles degrade solder joints and capacitors over time.
- **May increment wear counters**: GPU power-on hours, NVMe media wear indicator, and similar metrics will advance.
### Rule
> Run **Validate** freely on any server, at any time, before or after deployment.
> Run **Burn** only when explicitly required (e.g., initial acceptance after repair, or per customer SLA).
> Document when and why Burn tests were run.