Files
bee/bible-local
Michael Chus 2320925433 Skip PCIe link-speed warnings for disabled devices
Disabled PCIe devices (sysfs enable==0) carry no data traffic; their
link state has no operational impact. Switchtec PCIe switch management
endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers)
train at reduced speed intentionally and were producing spurious warnings.

Check is vendor-agnostic: reads enable attribute via existing helper,
no vendor/device ID hardcoding.

Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-12 03:42:19 +03:00
..

bee — Project Bible

Project-specific architecture, decisions, and runtime contracts. Generic engineering rules live in bible/rules/patterns/.

Files

File Contents
architecture/system-overview.md What bee does, scope, tech stack
architecture/runtime-flows.md Boot sequence, audit flow, service order
docs/customer-gpu-test-methodology.md Customer-facing GPU PCIe Validate / Validate -> Stress test list
docs/hardware-ingest-contract.md Current Reanimator hardware ingest JSON contract
docs/validate-vs-burn.md Validate and Validate -> Stress hardware test policy
decisions/ Architectural decision log, including read-only submodule policy

Validate Test Matrix

Validate

  • CPU check
    • lscpu
    • sensors
    • stress-ng
  • Memory check
    • free
    • timeout <timeout_sec> memtester
    • free
  • NVMe storage check
    • nvme id-ctrl
    • nvme smart-log
    • nvme device-self-test
  • SATA/SAS storage check
    • smartctl -H -A
    • smartctl -t short
  • Basic NVIDIA GPU check
    • nvidia-smi -pm 1
    • nvidia-smi -q
    • dmidecode -t baseboard
    • dmidecode -t system
    • dcgmi diag -r 2
  • Inter-GPU communication check
    • all_reduce_perf
  • GPU bandwidth check
    • dcgmi diag -r nvbandwidth

Validate -> Stress

  • Extended NVIDIA GPU check
    • nvidia-smi -pm 1
    • nvidia-smi -q
    • dmidecode -t baseboard
    • dmidecode -t system
    • dcgmi diag -r 3
  • NVIDIA targeted stress
    • nvidia-smi -pm 1
    • nvidia-smi -q
    • dcgmi diag -r targeted_stress
  • NVIDIA targeted power
    • nvidia-smi -pm 1
    • nvidia-smi -q
    • dcgmi diag -r targeted_power
  • NVIDIA pulse test
    • nvidia-smi -pm 1
    • nvidia-smi -q
    • dcgmi diag -r pulse_test
  • Inter-GPU communication check
    • all_reduce_perf
  • GPU bandwidth check
    • dcgmi diag -r nvbandwidth