Files
logpile/bible-local/09-testing.md
Mikhail Chusavitin 47bb0ee939 docs: document firmware filter regression pattern in bible (ADL-019)
Root cause analysis for device-bound firmware leaking into hardware.firmware
on Supermicro Redfish (SYS-A21GE-NBRT HGX B200):

- collectFirmwareInventory (6c19a58) had no coverage for Supermicro naming.
  isDeviceBoundFirmwareName checked "gpu " / "nic " (space-terminated) while
  Supermicro uses "GPU1 System Slot0" / "NIC1 System Slot0 ..." (digit suffix).

- 9c5512d added _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns to fix HGX,
  but checked DeviceName which contains "Software Inventory" (from Redfish Name),
  not the firmware Id. Dead code from day one.

09-testing.md: add firmware filter worked example and rule #4 — verify the
filter checks the field that the collector actually populates.

10-decisions.md: ADL-019 — isDeviceBoundFirmwareName must be extended per
vendor with a test case per vendor format before shipping.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 14:03:47 +03:00

135 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 09 — Testing
## Required before merge
```bash
go test ./...
```
All tests must pass before any change is merged.
## Where to add tests
| Change area | Test location |
|-------------|---------------|
| Collectors | `internal/collector/*_test.go` |
| HTTP handlers | `internal/server/*_test.go` |
| Exporters | `internal/exporter/*_test.go` |
| Parsers | `internal/parser/vendors/<vendor>/*_test.go` |
## Exporter tests
The Reanimator exporter has comprehensive coverage:
| Test file | Coverage |
|-----------|----------|
| `reanimator_converter_test.go` | Unit tests per conversion function |
| `reanimator_integration_test.go` | Full export with realistic `AnalysisResult` |
Run exporter tests only:
```bash
go test ./internal/exporter/...
go test ./internal/exporter/... -v -run Reanimator
go test ./internal/exporter/... -cover
```
## Guidelines
- Prefer table-driven tests for parsing logic (multiple input variants).
- Do not rely on network access in unit tests.
- Test both the happy path and edge cases (missing fields, empty collections).
- When adding a new vendor parser, include at minimum:
- `Detect()` test with a positive and a negative sample file list.
- `Parse()` test with a minimal but representative archive.
## Dedup and filtering functions — mandatory coverage
Any function that deduplicates, filters, or classifies hardware inventory items
**must** have tests covering all three axes before the code is considered done:
| Axis | What to test | Why |
|------|-------------|-----|
| **True positive** | Items that ARE duplicates are collapsed to one | Proves the function works |
| **True negative** | Items that are NOT duplicates are kept separate | Proves the function doesn't over-collapse |
| **Counter-case** | The scenario that motivated the original code still works after changes | Prevents regression from future fixes |
### Worked example — GPU dedup regression (2026-03-11)
`collectGPUsFromProcessors` was added for MSI (chassis Id matches processor Id).
No tests → when Supermicro HGX arrived (chassis Id = "HGX_GPU_SXM_1", processor Id = "GPU_SXM_1"),
the chassis lookup silently returned nothing, serial stayed empty, UUID was new → 8 duplicate GPUs.
Simultaneously, fixing `gpuDocDedupKey` to use `slot|model` before path collapsed two distinct
GraphicsControllers GPUs with the same model into one — breaking an existing test that had no
counter-case for the path-fallback scenario.
**Required test matrix for any dedup function:**
```
TestXxx_CollapsesDuplicates — same item via two sources → 1 result
TestXxx_KeepsDistinct — two different items with same model → 2 results
TestXxx_<VendorThatMotivated> — the specific vendor/setup that triggered the code
```
### Worked example — firmware filter regression (2026-03-12)
`collectFirmwareInventory` was added in `6c19a58` without coverage for Supermicro naming.
`isDeviceBoundFirmwareName` had patterns for Dell-style names (`"GPU SomeDevice"`, `"NIC OnboardLAN"`)
but Supermicro Redfish uses `"GPU1 System Slot0"` and `"NIC1 System Slot0 ..."` — digit follows
immediately after the type prefix. 29 device-bound entries leaked into `hardware.firmware`.
`9c5512d` attempted to fix this with HGX ID patterns (`_fw_gpu_`, etc.) in the wrong field:
the filter checked `DeviceName` but `collectFirmwareInventory` populates it from `Name` first
(`"Software Inventory"` for all HGX per-component slots), not from the `Id` field that contains
the firmware ID like `"HGX_FW_GPU_SXM_1"`. The patterns were effectively dead code from day one.
**Required test matrix for any filter function:**
```
TestXxx_FiltersDeviceBound_Dell — Dell-style names that motivated the original code
TestXxx_FiltersDeviceBound_Supermicro — Supermicro names with digit suffix (GPU1/NIC1)
TestXxx_KeepsSystemLevel — BIOS, BMC, CPLD names must NOT be filtered
```
### Practical rule
When you write a new filter/dedup/classify function, ask:
1. Does my test cover the vendor that motivated this code?
2. Does my test cover a *different* vendor or naming convention where the function must NOT fire?
3. If I change the dedup key logic, do existing tests still exercise the old correct behavior?
4. When the filter checks a field on a model struct, does my test verify that the field is
actually populated by the collector? (Dead-code filter pattern: `9c5512d` `_fw_gpu_` check.)
If any answer is "no" — add the missing test before committing.
## Collector candidate-selection functions — mandatory coverage
Any function that selects paths for an expensive operation (probing, crawling, plan-B retry)
**must** have tests covering:
| Axis | What to test | Why |
|------|-------------|-----|
| **Positive** | Paths that should be selected ARE selected | Proves the feature works |
| **Negative** | Paths that should be excluded ARE excluded | Prevents runaway I/O |
| **Topology integration** | Given a realistic `out` map, the count of selected paths matches expectations | Catches implicit coupling between the selector and the surrounding data shape |
### Worked example — NVMe post-probe regression (2026-03-12)
`shouldAdaptiveNVMeProbe` was added in `2fa4a12` for Supermicro NVMe backplanes that return
`Members: []` but serve disks at `Disk.Bay.N` paths. No topology-level test was added.
When SYS-A21GE-NBRT (HGX B200) arrived, its 35 sub-chassis (GPU, NVSwitch, PCIeRetimer,
ERoT, IRoT, BMC, FPGA) all have `ChassisType=Module/Component/Zone` and empty `/Drives`
all 35 passed the filter → 35 × 384 = 13 440 HTTP requests → 22 min extra per collection.
A topology integration test (`TestNVMePostProbeSkipsNonStorageChassis`) would have caught
this at commit time: given GPU chassis + backplane, exactly 1 candidate must be selected.
**Required test matrix for any path-selection function:**
```
TestXxx_SelectsTargetPath — the path that motivated the code IS selected
TestXxx_SkipsIrrelevantPath — a path that must never be selected IS skipped
TestXxx_TopologyCount — given a realistic multi-chassis map, selected count = N
```