Files
bee/bible-local/docs/validate-vs-burn.md
2026-03-28 21:15:33 +03:00

1.9 KiB
Raw Blame History

Validate vs Burn: Hardware Impact Policy

Validate Tests (non-destructive)

Tests on the Validate page are purely diagnostic. They:

  • Do not write to disks — no data is written to storage devices; SMART counters (power-on hours, load cycle count, reallocated sectors) are not incremented.
  • Do not run sustained high load — commands complete quickly (seconds to minutes) and do not push hardware to thermal or electrical limits.
  • Do not increment hardware wear counters — GPU memory ECC counters, NVMe wear leveling counters, and similar endurance metrics are unaffected.
  • Are safe to run repeatedly — on new, production-bound, or already-deployed hardware without concern for reducing lifespan.

What Validate tests actually do

Test What it runs
NVIDIA GPU nvidia-smi, dcgmi diag (levels 14 read-only diagnostics)
Memory memtester on a limited allocation; reads/writes to RAM only
Storage smartctl -a, nvme smart-log — reads SMART data only
CPU stress-ng for a bounded duration; CPU-only, no I/O
AMD GPU rocm-smi --showallinfo, dmidecode — read-only queries

Burn Tests (hardware wear)

Tests on the Burn page run hardware at maximum or near-maximum load for extended durations. They:

  • Wear storage: write-intensive patterns can reduce SSD endurance (P/E cycles).
  • Stress GPU memory: extended ECC stress tests may surface latent defects but also exercise memory cells.
  • Accelerate thermal cycling: repeated heat/cool cycles degrade solder joints and capacitors over time.
  • May increment wear counters: GPU power-on hours, NVMe media wear indicator, and similar metrics will advance.

Rule

Run Validate freely on any server, at any time, before or after deployment. Run Burn only when explicitly required (e.g., initial acceptance after repair, or per customer SLA). Document when and why Burn tests were run.