Files
bee/bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md
T
mchus 2320925433 Skip PCIe link-speed warnings for disabled devices
Disabled PCIe devices (sysfs enable==0) carry no data traffic; their
link state has no operational impact. Switchtec PCIe switch management
endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers)
train at reduced speed intentionally and were producing spurious warnings.

Check is vendor-agnostic: reads enable attribute via existing helper,
no vendor/device ID hardcoding.

Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-12 03:42:19 +03:00

1.9 KiB

Decision: Skip PCIe link-speed warnings for disabled devices

Date: 2026-06-12 Status: active

Context

On HGX H100 SXM5 baseboards, the Microchip Switchtec PM41028 PSX PCIe switch (vendor 11F8, device 4128, NVIDIA subsystem 10DE:1643) appears in lspci as a "Memory controller". Its upstream link trains at Gen3 x2 while the device is capable of Gen4 x16. The device is permanently in a disabled state: memory access and bus-mastering are both off (Mem-, BusMaster-); /sys/bus/pci/devices/<bdf>/enable reads 0.

This chip is the PCIe fabric management endpoint for the NVSwitch interconnect — it carries only management traffic at low bandwidth and is intentionally not activated by any Linux driver. The bee audit was reporting a statusWarning with message "PCIe link speed degraded" for this device, which is misleading because the device is not in the data path.

Decision

applyPCIeLinkSpeedWarning reads /sys/bus/pci/devices/<bdf>/enable via the existing readPCIIntAttribute helper. If the value is 0 the function returns early without setting any warning status.

The check is vendor-agnostic: it applies to any PCIe device that Linux has not activated, regardless of make or model. This is consistent with the no-hardcoded-vendors contract — no vendor ID, device ID, or name string is used as a condition.

Consequences

  • PCIe fabric management endpoints, IPMI virtual devices, and other permanently disabled PCIe functions no longer produce spurious link-degradation warnings.
  • Real link degradation on active devices (GPUs, NICs, NVMe, NVLink bridges) continues to be detected and reported as before.
  • NVLink bridge cards retain their existing statusCritical path (they are always enabled, so the early return is never taken for them).
  • The Switchtec device on HGX H100 boards shows statusOK with no error_description in the audit JSON.