Root cause analysis for device-bound firmware leaking into hardware.firmware
on Supermicro Redfish (SYS-A21GE-NBRT HGX B200):
- collectFirmwareInventory (6c19a58) had no coverage for Supermicro naming.
isDeviceBoundFirmwareName checked "gpu " / "nic " (space-terminated) while
Supermicro uses "GPU1 System Slot0" / "NIC1 System Slot0 ..." (digit suffix).
- 9c5512d added _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns to fix HGX,
but checked DeviceName which contains "Software Inventory" (from Redfish Name),
not the firmware Id. Dead code from day one.
09-testing.md: add firmware filter worked example and rule #4 — verify the
filter checks the field that the collector actually populates.
10-decisions.md: ADL-019 — isDeviceBoundFirmwareName must be extended per
vendor with a test case per vendor format before shipping.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On Supermicro HGX systems (SYS-A21GE-NBRT) ~35 sub-chassis (GPU, NVSwitch,
PCIeRetimer, ERoT/IRoT, BMC, FPGA) all carry ChassisType=Module/Component/Zone
and expose empty /Drives collections. shouldAdaptiveNVMeProbe returned true for
all of them, triggering 35 × 384 = 13 440 HTTP requests → ~22 min wasted per
collection (more than half of total 35 min collection time).
Fix: chassisTypeCanHaveNVMe returns false for Module, Component, Zone. The
candidate selection loop in collectRawRedfishTree now checks the parent chassis
doc before adding a /Drives path to the probe list. Enclosure (NVMe backplane),
RackMount, and unknown types are unaffected.
Tests:
- TestChassisTypeCanHaveNVMe: table-driven, covers excluded and storage-capable types
- TestNVMePostProbeSkipsNonStorageChassis: topology integration, GPU chassis +
backplane with empty /Drives → exactly 1 candidate selected (backplane only)
Docs:
- ADL-018 in bible-local/10-decisions.md
- Candidate-selection test matrix in bible-local/09-testing.md
- SYS-A21GE-NBRT baseline row in docs/test_server_collection_memory.md
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three bugs, all related to GPU dedup in the Redfish replay pipeline:
1. collectGPUsFromProcessors (redfish_replay.go): GPU-type Processor entries
(Systems/HGX_Baseboard_0/Processors/GPU_SXM_N) were not deduplicated against
existing PCIeDevice GPUs on Supermicro HGX. The chassis-ID lookup keyed on
processor Id ("GPU_SXM_1") but the chassis is named "HGX_GPU_SXM_1" — lookup
returned nothing, serial stayed empty, UUID was unseen → 8 duplicate GPU rows.
Fix: read SerialNumber directly from the Processor doc first; chassis lookup
is now a fallback override (as it was designed for MSI).
2. looksLikeGPU (redfish.go): NVSwitch PCIe devices (Model="NVSwitch",
Manufacturer="NVIDIA") were classified as GPUs because "nvidia" matched the
GPU hint list. Fix: early return false when Model contains "nvswitch".
3. gpuDocDedupKey (redfish.go): commit 9df29b1 changed the dedup key to prefer
slot|model before path, which collapsed two distinct GPUs with identical model
names in GraphicsControllers into one entry. Fix: only serial and BDF are used
as cross-path stable dedup keys; fall back to Redfish path when neither is
present. This also restores TestReplayCollectGPUs_DedupUsesRedfishPathBeforeHeuristics
which had been broken on main since 9df29b1.
Added tests:
- TestCollectGPUsFromProcessors_SupermicroHGX: Processor GPU dedup when
chassis-ID naming convention does not match processor Id
- TestReplayCollectGPUs_DedupCrossChassisSerial: same GPU via two Chassis
PCIeDevice trees with matching serials → collapsed to one
- TestLooksLikeGPU_NVSwitchExcluded: NVSwitch is not a GPU
Added rule to bible-local/09-testing.md: dedup/filter/classify functions must
cover true-positive, true-negative, and the vendor counter-case axes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parser / archive:
- Add .sds extension as tar-format alias (archive.go)
- Add tests for multipart upload size limits (multipart_limits_test.go)
- Remove supermicro crashdump parser (ADL-015)
Dell parser:
- Remove GPU duplicates from PCIeDevices (DCIM_VideoView vs DCIM_PCIDeviceView
both list the same GPU; VideoView record is authoritative)
Server:
- Add LOGPILE_CONVERT_MAX_MB env var for independent convert batch size limit
- Improve "file too large" error message with current limit value
Web:
- Add CONVERT_MAX_FILES_PER_BATCH = 1000 cap
- Minor UI copy and CSS fixes
Bible:
- bible-local/06-parsers.md: add pci.ids enrichment rule (enrich model from
pciids when name is empty but vendor_id+device_id are present)
- Sync bible submodule and local overview/architecture docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Dell NICView: strip " - XX:XX:XX:XX:XX:XX" suffix from ProductName
(Dell TSR embeds MAC in this field for every NIC port)
- Dell SoftwareIdentity: same strip applied to ElementName; store FQDD
in FirmwareInfo.Description so exporter can filter device-bound entries
- Exporter: add isDeviceBoundFirmwareFQDD() to filter firmware entries
whose Description matches NIC./PSU./Disk./RAID.Backplane./GPU. FQDD
prefixes (prevents device firmware from appearing in hardware.firmware)
- Exporter: extend isDeviceBoundFirmwareName() to filter HGX GPU/NVSwitch
firmware inventory IDs (_fw_gpu_, _fw_nvswitch_, _inforom_gpu_)
- Inspur: remove HDD firmware from Hardware.Firmware — already present
in Storage.Firmware, duplicating it violates ADL-016
- bible-local/06-parsers.md: document firmware and MAC stripping rules
- bible-local/10-decisions.md: add ADL-016 (device-bound firmware) and
ADL-017 (vendor-embedded MAC in model name fields)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add bible.git as submodule at bible/
- Move docs/bible/ → bible-local/ (project-specific architecture)
- Update CLAUDE.md to reference both bible/ and bible-local/
- Add AGENTS.md for Codex with same structure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>