Files
logpile/bible-local/09-testing.md
Mikhail Chusavitin 47bb0ee939 docs: document firmware filter regression pattern in bible (ADL-019)
Root cause analysis for device-bound firmware leaking into hardware.firmware
on Supermicro Redfish (SYS-A21GE-NBRT HGX B200):

- collectFirmwareInventory (6c19a58) had no coverage for Supermicro naming.
  isDeviceBoundFirmwareName checked "gpu " / "nic " (space-terminated) while
  Supermicro uses "GPU1 System Slot0" / "NIC1 System Slot0 ..." (digit suffix).

- 9c5512d added _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns to fix HGX,
  but checked DeviceName which contains "Software Inventory" (from Redfish Name),
  not the firmware Id. Dead code from day one.

09-testing.md: add firmware filter worked example and rule #4 — verify the
filter checks the field that the collector actually populates.

10-decisions.md: ADL-019 — isDeviceBoundFirmwareName must be extended per
vendor with a test case per vendor format before shipping.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 14:03:47 +03:00

6.0 KiB
Raw Blame History

09 — Testing

Required before merge

go test ./...

All tests must pass before any change is merged.

Where to add tests

Change area Test location
Collectors internal/collector/*_test.go
HTTP handlers internal/server/*_test.go
Exporters internal/exporter/*_test.go
Parsers internal/parser/vendors/<vendor>/*_test.go

Exporter tests

The Reanimator exporter has comprehensive coverage:

Test file Coverage
reanimator_converter_test.go Unit tests per conversion function
reanimator_integration_test.go Full export with realistic AnalysisResult

Run exporter tests only:

go test ./internal/exporter/...
go test ./internal/exporter/... -v -run Reanimator
go test ./internal/exporter/... -cover

Guidelines

  • Prefer table-driven tests for parsing logic (multiple input variants).
  • Do not rely on network access in unit tests.
  • Test both the happy path and edge cases (missing fields, empty collections).
  • When adding a new vendor parser, include at minimum:
    • Detect() test with a positive and a negative sample file list.
    • Parse() test with a minimal but representative archive.

Dedup and filtering functions — mandatory coverage

Any function that deduplicates, filters, or classifies hardware inventory items must have tests covering all three axes before the code is considered done:

Axis What to test Why
True positive Items that ARE duplicates are collapsed to one Proves the function works
True negative Items that are NOT duplicates are kept separate Proves the function doesn't over-collapse
Counter-case The scenario that motivated the original code still works after changes Prevents regression from future fixes

Worked example — GPU dedup regression (2026-03-11)

collectGPUsFromProcessors was added for MSI (chassis Id matches processor Id). No tests → when Supermicro HGX arrived (chassis Id = "HGX_GPU_SXM_1", processor Id = "GPU_SXM_1"), the chassis lookup silently returned nothing, serial stayed empty, UUID was new → 8 duplicate GPUs.

Simultaneously, fixing gpuDocDedupKey to use slot|model before path collapsed two distinct GraphicsControllers GPUs with the same model into one — breaking an existing test that had no counter-case for the path-fallback scenario.

Required test matrix for any dedup function:

TestXxx_CollapsesDuplicates   — same item via two sources → 1 result
TestXxx_KeepsDistinct          — two different items with same model → 2 results
TestXxx_<VendorThatMotivated>  — the specific vendor/setup that triggered the code

Worked example — firmware filter regression (2026-03-12)

collectFirmwareInventory was added in 6c19a58 without coverage for Supermicro naming. isDeviceBoundFirmwareName had patterns for Dell-style names ("GPU SomeDevice", "NIC OnboardLAN") but Supermicro Redfish uses "GPU1 System Slot0" and "NIC1 System Slot0 ..." — digit follows immediately after the type prefix. 29 device-bound entries leaked into hardware.firmware.

9c5512d attempted to fix this with HGX ID patterns (_fw_gpu_, etc.) in the wrong field: the filter checked DeviceName but collectFirmwareInventory populates it from Name first ("Software Inventory" for all HGX per-component slots), not from the Id field that contains the firmware ID like "HGX_FW_GPU_SXM_1". The patterns were effectively dead code from day one.

Required test matrix for any filter function:

TestXxx_FiltersDeviceBound_Dell         — Dell-style names that motivated the original code
TestXxx_FiltersDeviceBound_Supermicro   — Supermicro names with digit suffix (GPU1/NIC1)
TestXxx_KeepsSystemLevel                — BIOS, BMC, CPLD names must NOT be filtered

Practical rule

When you write a new filter/dedup/classify function, ask:

  1. Does my test cover the vendor that motivated this code?
  2. Does my test cover a different vendor or naming convention where the function must NOT fire?
  3. If I change the dedup key logic, do existing tests still exercise the old correct behavior?
  4. When the filter checks a field on a model struct, does my test verify that the field is actually populated by the collector? (Dead-code filter pattern: 9c5512d _fw_gpu_ check.)

If any answer is "no" — add the missing test before committing.

Collector candidate-selection functions — mandatory coverage

Any function that selects paths for an expensive operation (probing, crawling, plan-B retry) must have tests covering:

Axis What to test Why
Positive Paths that should be selected ARE selected Proves the feature works
Negative Paths that should be excluded ARE excluded Prevents runaway I/O
Topology integration Given a realistic out map, the count of selected paths matches expectations Catches implicit coupling between the selector and the surrounding data shape

Worked example — NVMe post-probe regression (2026-03-12)

shouldAdaptiveNVMeProbe was added in 2fa4a12 for Supermicro NVMe backplanes that return Members: [] but serve disks at Disk.Bay.N paths. No topology-level test was added.

When SYS-A21GE-NBRT (HGX B200) arrived, its 35 sub-chassis (GPU, NVSwitch, PCIeRetimer, ERoT, IRoT, BMC, FPGA) all have ChassisType=Module/Component/Zone and empty /Drives → all 35 passed the filter → 35 × 384 = 13 440 HTTP requests → 22 min extra per collection.

A topology integration test (TestNVMePostProbeSkipsNonStorageChassis) would have caught this at commit time: given GPU chassis + backplane, exactly 1 candidate must be selected.

Required test matrix for any path-selection function:

TestXxx_SelectsTargetPath        — the path that motivated the code IS selected
TestXxx_SkipsIrrelevantPath      — a path that must never be selected IS skipped
TestXxx_TopologyCount            — given a realistic multi-chassis map, selected count = N