smartpqi uses scsi_transport_sas but does not register a sas_host
object, so /sys/class/sas_host/host14 does not exist and the existing
SAS detection check passes right through. Writing to host14/scan then
calls sas_user_scan which blocks indefinitely on scsi_scan_target's
mutex (confirmed by kernel hung-task traces in the field).
Add a second detection path via /sys/class/scsi_host/hostX/proc_name:
skip hosts whose driver is "smartpqi" or "hpsa" (HPE Smart Array
predecessors that exhibit the same behaviour).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvme-cli emits smart-log counters as JSON strings and uses field names
avail_spare / percent_used instead of the prose names in the NVMe spec.
The nvmeSmartLog struct had int64 fields with wrong JSON tags — Unmarshal
returned an error and the whole health block was skipped, leaving every
NVMe drive with status=Unknown.
Fix: switch all numeric fields to jsonInt64 (already used for lsblk
block sizes) which accepts both bare numbers and quoted strings, and
correct the avail_spare / percent_used tag names.
Also fix validateIsVendorGPU for NVIDIA: previously counted any NVIDIA
PCIe device (including NVSwitch bridges) as a GPU, producing wrong
estimates (12 instead of 8 on an HGX H100 system). Now requires
device_class to be videocontroller or processingaccelerator, matching
the existing AMD filter logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Writing to /sys/class/scsi_host/hostX/scan on SAS controllers (e.g.
Adaptec smartpqi/PM8222-SHBA) triggers sas_user_scan which blocks
indefinitely, causing the audit to hang forever. Skip hosts that appear
under /sys/class/sas_host/ — SAS topology is discovered by the driver.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- storage: add jsonInt64 dual-format unmarshaler to handle lsblk output
change in util-linux 2.38 (LOG-SEC/PHY-SEC now emitted as JSON
integers, not quoted strings); fixes SATA disks invisible on Debian 12
- pcie: detect NVLink bridge mezzanine CX-7 cards (Mellanox x2, no host
net ifaces, DeviceName contains "NVLINK" in lspci -v) and mark them
with device_class="NVLinkBridge"; escalate PCIe link speed downgrade to
Critical for these cards (Gen3 on a fixed internal connector = hardware
fault, not a transient warning)
- pcie: cross-reference nvidia-smi topo to capture NVLink bond counts and
active status for all NVLink bridge cards
- packages: add infiniband-diags to ISO; provides ibstat required by
nvidia-fabricmanager-start.sh to enumerate IB devices before FM launch
(absence causes CUDA_ERROR_SYSTEM_NOT_READY)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-john-gpu-stress: spawn one john process per OpenCL device in parallel
so all GPUs are stressed simultaneously instead of only device 1
- bee-openbox-session: --start-fullscreen → --start-maximized to fix blank
white page on first render in fbdev environment
- storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>