From 47bb0ee9391b5004c05421d4b061a3ec88dccbf6 Mon Sep 17 00:00:00 2001 From: Mikhail Chusavitin Date: Thu, 12 Mar 2026 14:03:47 +0300 Subject: [PATCH] docs: document firmware filter regression pattern in bible (ADL-019) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Root cause analysis for device-bound firmware leaking into hardware.firmware on Supermicro Redfish (SYS-A21GE-NBRT HGX B200): - collectFirmwareInventory (6c19a58) had no coverage for Supermicro naming. isDeviceBoundFirmwareName checked "gpu " / "nic " (space-terminated) while Supermicro uses "GPU1 System Slot0" / "NIC1 System Slot0 ..." (digit suffix). - 9c5512d added _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns to fix HGX, but checked DeviceName which contains "Software Inventory" (from Redfish Name), not the firmware Id. Dead code from day one. 09-testing.md: add firmware filter worked example and rule #4 — verify the filter checks the field that the collector actually populates. 10-decisions.md: ADL-019 — isDeviceBoundFirmwareName must be extended per vendor with a test case per vendor format before shipping. Co-Authored-By: Claude Sonnet 4.6 --- bible-local/09-testing.md | 22 ++++++++++++++++++++++ bible-local/10-decisions.md | 25 +++++++++++++++++++++++++ 2 files changed, 47 insertions(+) diff --git a/bible-local/09-testing.md b/bible-local/09-testing.md index 7e5d03b..0825e61 100644 --- a/bible-local/09-testing.md +++ b/bible-local/09-testing.md @@ -71,12 +71,34 @@ TestXxx_KeepsDistinct — two different items with same model → 2 res TestXxx_ — the specific vendor/setup that triggered the code ``` +### Worked example — firmware filter regression (2026-03-12) + +`collectFirmwareInventory` was added in `6c19a58` without coverage for Supermicro naming. +`isDeviceBoundFirmwareName` had patterns for Dell-style names (`"GPU SomeDevice"`, `"NIC OnboardLAN"`) +but Supermicro Redfish uses `"GPU1 System Slot0"` and `"NIC1 System Slot0 ..."` — digit follows +immediately after the type prefix. 29 device-bound entries leaked into `hardware.firmware`. + +`9c5512d` attempted to fix this with HGX ID patterns (`_fw_gpu_`, etc.) in the wrong field: +the filter checked `DeviceName` but `collectFirmwareInventory` populates it from `Name` first +(`"Software Inventory"` for all HGX per-component slots), not from the `Id` field that contains +the firmware ID like `"HGX_FW_GPU_SXM_1"`. The patterns were effectively dead code from day one. + +**Required test matrix for any filter function:** + +``` +TestXxx_FiltersDeviceBound_Dell — Dell-style names that motivated the original code +TestXxx_FiltersDeviceBound_Supermicro — Supermicro names with digit suffix (GPU1/NIC1) +TestXxx_KeepsSystemLevel — BIOS, BMC, CPLD names must NOT be filtered +``` + ### Practical rule When you write a new filter/dedup/classify function, ask: 1. Does my test cover the vendor that motivated this code? 2. Does my test cover a *different* vendor or naming convention where the function must NOT fire? 3. If I change the dedup key logic, do existing tests still exercise the old correct behavior? +4. When the filter checks a field on a model struct, does my test verify that the field is + actually populated by the collector? (Dead-code filter pattern: `9c5512d` `_fw_gpu_` check.) If any answer is "no" — add the missing test before committing. diff --git a/bible-local/10-decisions.md b/bible-local/10-decisions.md index 4953c7a..93714d1 100644 --- a/bible-local/10-decisions.md +++ b/bible-local/10-decisions.md @@ -274,4 +274,29 @@ for `Enclosure`, `RackMount`, and any unrecognised type (fail-safe). both the excluded types and the storage-capable types (see `TestChassisTypeCanHaveNVMe` and `TestNVMePostProbeSkipsNonStorageChassis`). +## ADL-019 — isDeviceBoundFirmwareName must cover vendor-specific naming patterns per vendor + +**Date:** 2026-03-12 +**Context:** `isDeviceBoundFirmwareName` was written to filter Dell-style device firmware names +(`"GPU SomeDevice"`, `"NIC OnboardLAN"`). When Supermicro Redfish FirmwareInventory was added +(`6c19a58`), no Supermicro-specific patterns were added. Supermicro names a NIC entry +`"NIC1 System Slot0 AOM-DP805-IO"` — a digit follows the type prefix directly, bypassing the +`"nic "` (space-terminated) check. 29 device-bound entries leaked into `hardware.firmware` on +SYS-A21GE-NBRT (HGX B200). Commit `9c5512d` attempted a fix by adding `_fw_gpu_` patterns, +but checked `DeviceName` which contains `"Software Inventory"` (from the Redfish `Name` field), +not the firmware inventory ID. The patterns were dead code from the moment they were committed. +**Decision:** +- `isDeviceBoundFirmwareName` must be extended for each new vendor whose FirmwareInventory + naming convention differs from the existing patterns. +- When adding HGX/Supermicro patterns, check that the pattern matches the field value that + `collectFirmwareInventory` actually stores — trace the data path from Redfish doc to + `FirmwareInfo.DeviceName` before writing the condition. +- `TestIsDeviceBoundFirmwareName` must contain at least one case per vendor format. +**Consequences:** +- New vendors with FirmwareInventory support require a test covering both device-bound names + (must return true) and system-level names (must return false) before the code ships. +- The dead `_fw_gpu_` / `_fw_nvswitch_` / `_inforom_gpu_` patterns were replaced with + correct prefix+digit checks (`"gpu" + digit`, `"nic" + digit`) and explicit string checks + (`"nvmecontroller"`, `"power supply"`, `"software inventory"`). +