Disabled PCIe devices (sysfs enable==0) carry no data traffic; their link state has no operational impact. Switchtec PCIe switch management endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers) train at reduced speed intentionally and were producing spurious warnings. Check is vendor-agnostic: reads enable attribute via existing helper, no vendor/device ID hardcoding. Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1.9 KiB
Decision: Skip PCIe link-speed warnings for disabled devices
Date: 2026-06-12 Status: active
Context
On HGX H100 SXM5 baseboards, the Microchip Switchtec PM41028 PSX PCIe switch
(vendor 11F8, device 4128, NVIDIA subsystem 10DE:1643) appears in lspci as a
"Memory controller". Its upstream link trains at Gen3 x2 while the device is
capable of Gen4 x16. The device is permanently in a disabled state: memory access
and bus-mastering are both off (Mem-, BusMaster-); /sys/bus/pci/devices/<bdf>/enable
reads 0.
This chip is the PCIe fabric management endpoint for the NVSwitch interconnect — it
carries only management traffic at low bandwidth and is intentionally not activated
by any Linux driver. The bee audit was reporting a statusWarning with message
"PCIe link speed degraded" for this device, which is misleading because the device
is not in the data path.
Decision
applyPCIeLinkSpeedWarning reads /sys/bus/pci/devices/<bdf>/enable via the
existing readPCIIntAttribute helper. If the value is 0 the function returns
early without setting any warning status.
The check is vendor-agnostic: it applies to any PCIe device that Linux has not
activated, regardless of make or model. This is consistent with the
no-hardcoded-vendors contract — no vendor ID, device ID, or name string is
used as a condition.
Consequences
- PCIe fabric management endpoints, IPMI virtual devices, and other permanently disabled PCIe functions no longer produce spurious link-degradation warnings.
- Real link degradation on active devices (GPUs, NICs, NVMe, NVLink bridges) continues to be detected and reported as before.
- NVLink bridge cards retain their existing
statusCriticalpath (they are always enabled, so the early return is never taken for them). - The Switchtec device on HGX H100 boards shows
statusOKwith noerror_descriptionin the audit JSON.