Files

Mikhail Chusavitin 1eb639e6bf redfish: skip NVMe bay probe for non-storage chassis types (Module/Component/Zone)

On Supermicro HGX systems (SYS-A21GE-NBRT) ~35 sub-chassis (GPU, NVSwitch,
PCIeRetimer, ERoT/IRoT, BMC, FPGA) all carry ChassisType=Module/Component/Zone
and expose empty /Drives collections. shouldAdaptiveNVMeProbe returned true for
all of them, triggering 35 × 384 = 13 440 HTTP requests → ~22 min wasted per
collection (more than half of total 35 min collection time).

Fix: chassisTypeCanHaveNVMe returns false for Module, Component, Zone. The
candidate selection loop in collectRawRedfishTree now checks the parent chassis
doc before adding a /Drives path to the probe list. Enclosure (NVMe backplane),
RackMount, and unknown types are unaffected.

Tests:
- TestChassisTypeCanHaveNVMe: table-driven, covers excluded and storage-capable types
- TestNVMePostProbeSkipsNonStorageChassis: topology integration, GPU chassis +
  backplane with empty /Drives → exactly 1 candidate selected (backplane only)

Docs:
- ADL-018 in bible-local/10-decisions.md
- Candidate-selection test matrix in bible-local/09-testing.md
- SYS-A21GE-NBRT baseline row in docs/test_server_collection_memory.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-12 13:38:29 +03:00

4.7 KiB

Raw Blame History

09 — Testing

Required before merge

go test ./...

All tests must pass before any change is merged.

Where to add tests

Change area	Test location
Collectors	`internal/collector/*_test.go`
HTTP handlers	`internal/server/*_test.go`
Exporters	`internal/exporter/*_test.go`
Parsers	`internal/parser/vendors/<vendor>/*_test.go`

Exporter tests

The Reanimator exporter has comprehensive coverage:

Test file	Coverage
`reanimator_converter_test.go`	Unit tests per conversion function
`reanimator_integration_test.go`	Full export with realistic `AnalysisResult`

Run exporter tests only:

go test ./internal/exporter/...
go test ./internal/exporter/... -v -run Reanimator
go test ./internal/exporter/... -cover

Guidelines

Prefer table-driven tests for parsing logic (multiple input variants).
Do not rely on network access in unit tests.
Test both the happy path and edge cases (missing fields, empty collections).
When adding a new vendor parser, include at minimum:
- Detect() test with a positive and a negative sample file list.
- Parse() test with a minimal but representative archive.

Dedup and filtering functions — mandatory coverage

Any function that deduplicates, filters, or classifies hardware inventory items must have tests covering all three axes before the code is considered done:

Axis	What to test	Why
True positive	Items that ARE duplicates are collapsed to one	Proves the function works
True negative	Items that are NOT duplicates are kept separate	Proves the function doesn't over-collapse
Counter-case	The scenario that motivated the original code still works after changes	Prevents regression from future fixes

Worked example — GPU dedup regression (2026-03-11)

collectGPUsFromProcessors was added for MSI (chassis Id matches processor Id). No tests → when Supermicro HGX arrived (chassis Id = "HGX_GPU_SXM_1", processor Id = "GPU_SXM_1"), the chassis lookup silently returned nothing, serial stayed empty, UUID was new → 8 duplicate GPUs.

Simultaneously, fixing gpuDocDedupKey to use slot|model before path collapsed two distinct GraphicsControllers GPUs with the same model into one — breaking an existing test that had no counter-case for the path-fallback scenario.

Required test matrix for any dedup function:

TestXxx_CollapsesDuplicates   — same item via two sources → 1 result
TestXxx_KeepsDistinct          — two different items with same model → 2 results
TestXxx_<VendorThatMotivated>  — the specific vendor/setup that triggered the code

Practical rule

When you write a new filter/dedup/classify function, ask:

Does my test cover the vendor that motivated this code?
Does my test cover a different vendor or naming convention where the function must NOT fire?
If I change the dedup key logic, do existing tests still exercise the old correct behavior?

If any answer is "no" — add the missing test before committing.

Collector candidate-selection functions — mandatory coverage

Any function that selects paths for an expensive operation (probing, crawling, plan-B retry) must have tests covering:

Axis	What to test	Why
Positive	Paths that should be selected ARE selected	Proves the feature works
Negative	Paths that should be excluded ARE excluded	Prevents runaway I/O
Topology integration	Given a realistic `out` map, the count of selected paths matches expectations	Catches implicit coupling between the selector and the surrounding data shape

Worked example — NVMe post-probe regression (2026-03-12)

shouldAdaptiveNVMeProbe was added in 2fa4a12 for Supermicro NVMe backplanes that return Members: [] but serve disks at Disk.Bay.N paths. No topology-level test was added.

When SYS-A21GE-NBRT (HGX B200) arrived, its 35 sub-chassis (GPU, NVSwitch, PCIeRetimer, ERoT, IRoT, BMC, FPGA) all have ChassisType=Module/Component/Zone and empty /Drives → all 35 passed the filter → 35 × 384 = 13 440 HTTP requests → 22 min extra per collection.

A topology integration test (TestNVMePostProbeSkipsNonStorageChassis) would have caught this at commit time: given GPU chassis + backplane, exactly 1 candidate must be selected.

Required test matrix for any path-selection function:

TestXxx_SelectsTargetPath        — the path that motivated the code IS selected
TestXxx_SkipsIrrelevantPath      — a path that must never be selected IS skipped
TestXxx_TopologyCount            — given a realistic multi-chassis map, selected count = N

4.7 KiB Raw Blame History Unescape Escape