Root cause analysis for device-bound firmware leaking into hardware.firmware on Supermicro Redfish (SYS-A21GE-NBRT HGX B200): - collectFirmwareInventory (6c19a58) had no coverage for Supermicro naming. isDeviceBoundFirmwareName checked "gpu " / "nic " (space-terminated) while Supermicro uses "GPU1 System Slot0" / "NIC1 System Slot0 ..." (digit suffix). -9c5512dadded _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns to fix HGX, but checked DeviceName which contains "Software Inventory" (from Redfish Name), not the firmware Id. Dead code from day one. 09-testing.md: add firmware filter worked example and rule #4 — verify the filter checks the field that the collector actually populates. 10-decisions.md: ADL-019 — isDeviceBoundFirmwareName must be extended per vendor with a test case per vendor format before shipping. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
135 lines
6.0 KiB
Markdown
135 lines
6.0 KiB
Markdown
# 09 — Testing
|
||
|
||
## Required before merge
|
||
|
||
```bash
|
||
go test ./...
|
||
```
|
||
|
||
All tests must pass before any change is merged.
|
||
|
||
## Where to add tests
|
||
|
||
| Change area | Test location |
|
||
|-------------|---------------|
|
||
| Collectors | `internal/collector/*_test.go` |
|
||
| HTTP handlers | `internal/server/*_test.go` |
|
||
| Exporters | `internal/exporter/*_test.go` |
|
||
| Parsers | `internal/parser/vendors/<vendor>/*_test.go` |
|
||
|
||
## Exporter tests
|
||
|
||
The Reanimator exporter has comprehensive coverage:
|
||
|
||
| Test file | Coverage |
|
||
|-----------|----------|
|
||
| `reanimator_converter_test.go` | Unit tests per conversion function |
|
||
| `reanimator_integration_test.go` | Full export with realistic `AnalysisResult` |
|
||
|
||
Run exporter tests only:
|
||
```bash
|
||
go test ./internal/exporter/...
|
||
go test ./internal/exporter/... -v -run Reanimator
|
||
go test ./internal/exporter/... -cover
|
||
```
|
||
|
||
## Guidelines
|
||
|
||
- Prefer table-driven tests for parsing logic (multiple input variants).
|
||
- Do not rely on network access in unit tests.
|
||
- Test both the happy path and edge cases (missing fields, empty collections).
|
||
- When adding a new vendor parser, include at minimum:
|
||
- `Detect()` test with a positive and a negative sample file list.
|
||
- `Parse()` test with a minimal but representative archive.
|
||
|
||
## Dedup and filtering functions — mandatory coverage
|
||
|
||
Any function that deduplicates, filters, or classifies hardware inventory items
|
||
**must** have tests covering all three axes before the code is considered done:
|
||
|
||
| Axis | What to test | Why |
|
||
|------|-------------|-----|
|
||
| **True positive** | Items that ARE duplicates are collapsed to one | Proves the function works |
|
||
| **True negative** | Items that are NOT duplicates are kept separate | Proves the function doesn't over-collapse |
|
||
| **Counter-case** | The scenario that motivated the original code still works after changes | Prevents regression from future fixes |
|
||
|
||
### Worked example — GPU dedup regression (2026-03-11)
|
||
|
||
`collectGPUsFromProcessors` was added for MSI (chassis Id matches processor Id).
|
||
No tests → when Supermicro HGX arrived (chassis Id = "HGX_GPU_SXM_1", processor Id = "GPU_SXM_1"),
|
||
the chassis lookup silently returned nothing, serial stayed empty, UUID was new → 8 duplicate GPUs.
|
||
|
||
Simultaneously, fixing `gpuDocDedupKey` to use `slot|model` before path collapsed two distinct
|
||
GraphicsControllers GPUs with the same model into one — breaking an existing test that had no
|
||
counter-case for the path-fallback scenario.
|
||
|
||
**Required test matrix for any dedup function:**
|
||
|
||
```
|
||
TestXxx_CollapsesDuplicates — same item via two sources → 1 result
|
||
TestXxx_KeepsDistinct — two different items with same model → 2 results
|
||
TestXxx_<VendorThatMotivated> — the specific vendor/setup that triggered the code
|
||
```
|
||
|
||
### Worked example — firmware filter regression (2026-03-12)
|
||
|
||
`collectFirmwareInventory` was added in `6c19a58` without coverage for Supermicro naming.
|
||
`isDeviceBoundFirmwareName` had patterns for Dell-style names (`"GPU SomeDevice"`, `"NIC OnboardLAN"`)
|
||
but Supermicro Redfish uses `"GPU1 System Slot0"` and `"NIC1 System Slot0 ..."` — digit follows
|
||
immediately after the type prefix. 29 device-bound entries leaked into `hardware.firmware`.
|
||
|
||
`9c5512d` attempted to fix this with HGX ID patterns (`_fw_gpu_`, etc.) in the wrong field:
|
||
the filter checked `DeviceName` but `collectFirmwareInventory` populates it from `Name` first
|
||
(`"Software Inventory"` for all HGX per-component slots), not from the `Id` field that contains
|
||
the firmware ID like `"HGX_FW_GPU_SXM_1"`. The patterns were effectively dead code from day one.
|
||
|
||
**Required test matrix for any filter function:**
|
||
|
||
```
|
||
TestXxx_FiltersDeviceBound_Dell — Dell-style names that motivated the original code
|
||
TestXxx_FiltersDeviceBound_Supermicro — Supermicro names with digit suffix (GPU1/NIC1)
|
||
TestXxx_KeepsSystemLevel — BIOS, BMC, CPLD names must NOT be filtered
|
||
```
|
||
|
||
### Practical rule
|
||
|
||
When you write a new filter/dedup/classify function, ask:
|
||
1. Does my test cover the vendor that motivated this code?
|
||
2. Does my test cover a *different* vendor or naming convention where the function must NOT fire?
|
||
3. If I change the dedup key logic, do existing tests still exercise the old correct behavior?
|
||
4. When the filter checks a field on a model struct, does my test verify that the field is
|
||
actually populated by the collector? (Dead-code filter pattern: `9c5512d` `_fw_gpu_` check.)
|
||
|
||
If any answer is "no" — add the missing test before committing.
|
||
|
||
## Collector candidate-selection functions — mandatory coverage
|
||
|
||
Any function that selects paths for an expensive operation (probing, crawling, plan-B retry)
|
||
**must** have tests covering:
|
||
|
||
| Axis | What to test | Why |
|
||
|------|-------------|-----|
|
||
| **Positive** | Paths that should be selected ARE selected | Proves the feature works |
|
||
| **Negative** | Paths that should be excluded ARE excluded | Prevents runaway I/O |
|
||
| **Topology integration** | Given a realistic `out` map, the count of selected paths matches expectations | Catches implicit coupling between the selector and the surrounding data shape |
|
||
|
||
### Worked example — NVMe post-probe regression (2026-03-12)
|
||
|
||
`shouldAdaptiveNVMeProbe` was added in `2fa4a12` for Supermicro NVMe backplanes that return
|
||
`Members: []` but serve disks at `Disk.Bay.N` paths. No topology-level test was added.
|
||
|
||
When SYS-A21GE-NBRT (HGX B200) arrived, its 35 sub-chassis (GPU, NVSwitch, PCIeRetimer,
|
||
ERoT, IRoT, BMC, FPGA) all have `ChassisType=Module/Component/Zone` and empty `/Drives` →
|
||
all 35 passed the filter → 35 × 384 = 13 440 HTTP requests → 22 min extra per collection.
|
||
|
||
A topology integration test (`TestNVMePostProbeSkipsNonStorageChassis`) would have caught
|
||
this at commit time: given GPU chassis + backplane, exactly 1 candidate must be selected.
|
||
|
||
**Required test matrix for any path-selection function:**
|
||
|
||
```
|
||
TestXxx_SelectsTargetPath — the path that motivated the code IS selected
|
||
TestXxx_SkipsIrrelevantPath — a path that must never be selected IS skipped
|
||
TestXxx_TopologyCount — given a realistic multi-chassis map, selected count = N
|
||
```
|