Root cause analysis for device-bound firmware leaking into hardware.firmware on Supermicro Redfish (SYS-A21GE-NBRT HGX B200): - collectFirmwareInventory (6c19a58) had no coverage for Supermicro naming. isDeviceBoundFirmwareName checked "gpu " / "nic " (space-terminated) while Supermicro uses "GPU1 System Slot0" / "NIC1 System Slot0 ..." (digit suffix). -9c5512dadded _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns to fix HGX, but checked DeviceName which contains "Software Inventory" (from Redfish Name), not the firmware Id. Dead code from day one. 09-testing.md: add firmware filter worked example and rule #4 — verify the filter checks the field that the collector actually populates. 10-decisions.md: ADL-019 — isDeviceBoundFirmwareName must be extended per vendor with a test case per vendor format before shipping. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
303 lines
15 KiB
Markdown
303 lines
15 KiB
Markdown
# 10 — Architectural Decision Log (ADL)
|
||
|
||
> **Rule:** Every significant architectural decision **must be recorded here** before or alongside
|
||
> the code change. This applies to humans and AI assistants alike.
|
||
>
|
||
> Format: date · title · context · decision · consequences
|
||
|
||
---
|
||
|
||
## ADL-001 — In-memory only state (no database)
|
||
|
||
**Date:** project start
|
||
**Context:** LOGPile is designed as a standalone diagnostic tool, not a persistent service.
|
||
**Decision:** All parsed/collected data lives in `Server.result` (in-memory). No database, no files written.
|
||
**Consequences:**
|
||
- Data is lost on process restart — intentional.
|
||
- Simple deployment: single binary, no setup required.
|
||
- JSON export is the persistence mechanism for users who want to save results.
|
||
|
||
---
|
||
|
||
## ADL-002 — Vendor parser auto-registration via init()
|
||
|
||
**Date:** project start
|
||
**Context:** Need an extensible parser registry without a central factory function.
|
||
**Decision:** Each vendor parser registers itself in its package's `init()` function.
|
||
`vendors/vendors.go` holds blank imports to trigger registration.
|
||
**Consequences:**
|
||
- Adding a new parser requires only: implement interface + add one blank import.
|
||
- No central list to maintain (other than the import file).
|
||
- `go test ./...` will include new parsers automatically.
|
||
|
||
---
|
||
|
||
## ADL-003 — Highest-confidence parser wins
|
||
|
||
**Date:** project start
|
||
**Context:** Multiple parsers may partially match an archive (e.g. generic + specific vendor).
|
||
**Decision:** Run all parsers' `Detect()`, select the one returning the highest score (0–100).
|
||
**Consequences:**
|
||
- Generic fallback (score 15) only activates when no vendor parser scores higher.
|
||
- Parsers must be conservative with high scores (70+) to avoid false positives.
|
||
|
||
---
|
||
|
||
## ADL-004 — Canonical hardware.devices as single source of truth
|
||
|
||
**Date:** v1.5.0
|
||
**Context:** UI tabs and Reanimator exporter were reading from different sub-fields of
|
||
`AnalysisResult`, causing potential drift.
|
||
**Decision:** Introduce `hardware.devices` as the canonical inventory repository.
|
||
All UI tabs and all exporters must read exclusively from this repository.
|
||
**Consequences:**
|
||
- Any UI vs Reanimator discrepancy is classified as a bug, not a "known difference".
|
||
- Deduplication logic runs once in the repository builder (serial → bdf → distinct).
|
||
- New hardware attributes must be added to canonical schema first, then mapped to consumers.
|
||
|
||
---
|
||
|
||
## ADL-005 — No hardcoded PCI model strings; use pci.ids
|
||
|
||
**Date:** v1.5.0
|
||
**Context:** NVIDIA and other vendors release new GPU models frequently; hardcoded maps
|
||
required code changes for each new model ID.
|
||
**Decision:** Use the `pciutils/pciids` database (git submodule, embedded at build time).
|
||
PCI vendor/device ID → human-readable model name via lookup.
|
||
**Consequences:**
|
||
- New GPU models can be supported by updating `pci.ids` without code changes.
|
||
- `make build` auto-syncs `pci.ids` from submodule before compilation.
|
||
- External override via `LOGPILE_PCI_IDS_PATH` env var.
|
||
|
||
---
|
||
|
||
## ADL-006 — Reanimator export uses canonical hardware.devices (not raw sub-fields)
|
||
|
||
**Date:** v1.5.0
|
||
**Context:** Early Reanimator exporter read from `Hardware.GPUs`, `Hardware.NICs`, etc.
|
||
directly, diverging from UI data.
|
||
**Decision:** Reanimator exporter must use `hardware.devices` — the same source as the UI.
|
||
Exporter groups/filters canonical records by section; does not rebuild from sub-fields.
|
||
**Consequences:**
|
||
- Guarantees UI and export consistency.
|
||
- Exporter code is simpler — mainly a filter+map, not a data reconstruction.
|
||
|
||
---
|
||
|
||
## ADL-007 — Documentation language is English
|
||
|
||
**Date:** 2026-02-20
|
||
**Context:** Codebase documentation was mixed Russian/English, reducing clarity for
|
||
international contributors and AI assistants.
|
||
**Decision:** All maintained project documentation (`docs/bible/`, `README.md`,
|
||
`CLAUDE.md`, and new technical docs) must be written in English.
|
||
**Consequences:**
|
||
- Bible is authoritative in English.
|
||
- AI assistants get consistent, unambiguous context.
|
||
|
||
---
|
||
|
||
## ADL-008 — Bible is the single source of truth for architecture docs
|
||
|
||
**Date:** 2026-02-23
|
||
**Context:** Architecture information was duplicated across `README.md`, `CLAUDE.md`,
|
||
and the Bible, creating drift risk and stale guidance for humans and AI agents.
|
||
**Decision:** Keep architecture and technical design documentation only in `docs/bible/`.
|
||
Top-level `README.md` and `CLAUDE.md` must remain minimal pointers/instructions.
|
||
**Consequences:**
|
||
- Reduces documentation drift and duplicate updates.
|
||
- AI assistants are directed to one authoritative source before making changes.
|
||
- Documentation updates that affect architecture must include Bible changes (and ADL entries when significant).
|
||
|
||
---
|
||
|
||
## ADL-009 — Redfish analysis is performed from raw snapshot replay (unified tunnel)
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** Live Redfish collection and raw export re-analysis used different parsing paths,
|
||
which caused drift and made bug fixes difficult to validate consistently.
|
||
**Decision:** Redfish live collection must produce a `raw_payloads.redfish_tree` snapshot first,
|
||
then run the same replay analyzer used for imported raw exports.
|
||
**Consequences:**
|
||
- Same `redfish_tree` input produces the same parsed result in live and offline modes.
|
||
- Debugging parser issues can be done against exported raw bundles without live BMC access.
|
||
- Snapshot completeness becomes critical; collector seeds/limits are part of analyzer correctness.
|
||
|
||
---
|
||
|
||
## ADL-010 — Raw export is a self-contained re-analysis package (not a final result dump)
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** Exporting only normalized `AnalysisResult` loses raw source fidelity and prevents
|
||
future parser improvements from being applied to already collected data.
|
||
**Decision:** `Export Raw Data` produces a self-contained raw package (JSON or ZIP bundle)
|
||
that the application can reopen and re-analyze. Parsed data in the package is optional and not
|
||
the source of truth on import.
|
||
**Consequences:**
|
||
- Re-opening an export always re-runs analysis from raw source (`redfish_tree` or uploaded file bytes).
|
||
- Raw bundles include collection context and diagnostics for debugging (`collect.log`, `parser_fields.json`).
|
||
- Endpoint compatibility is preserved (`/api/export/json`) while actual payload format may be a bundle.
|
||
|
||
---
|
||
|
||
## ADL-011 — Redfish snapshot crawler is bounded, prioritized, and failure-tolerant
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** Full Redfish trees on modern GPU systems are large, noisy, and contain many
|
||
vendor-specific or non-fetchable links. Unbounded crawling and naive queue design caused hangs
|
||
and incomplete snapshots.
|
||
**Decision:** Use a bounded snapshot crawler with:
|
||
- explicit document cap (`LOGPILE_REDFISH_SNAPSHOT_MAX_DOCS`)
|
||
- priority seed paths (PCIe/Fabrics/Firmware/Storage/PowerSubsystem/ThermalSubsystem)
|
||
- normalized `@odata.id` paths (strip `#fragment`)
|
||
- noisy expected error filtering (404/405/410/501 hidden from UI)
|
||
- queue capacity sized to crawl cap to avoid producer/consumer deadlock
|
||
**Consequences:**
|
||
- Snapshot collection remains stable on large BMC trees.
|
||
- Most high-value inventory paths are reached before the cap.
|
||
- UI progress remains useful while debug logs retain low-level fetch failures.
|
||
|
||
---
|
||
|
||
## ADL-012 — Vendor-specific storage inventory probing is allowed as fallback
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** Some Supermicro BMCs expose empty standard `Storage/.../Drives` collections while
|
||
real disk inventory exists under vendor-specific `Disk.Bay` endpoints and enclosure links.
|
||
**Decision:** When standard drive collections are empty, collector/replay may probe vendor-style
|
||
`.../Drives/Disk.Bay.*` endpoints and follow `Storage.Links.Enclosures[*]` to recover physical drives.
|
||
**Consequences:**
|
||
- Higher storage inventory coverage on Supermicro HBA/HA-RAID/MRVL/NVMe backplane implementations.
|
||
- Replay must mirror the same probing behavior to preserve deterministic results.
|
||
- Probing remains bounded (finite candidate set) to avoid runaway requests.
|
||
|
||
---
|
||
|
||
## ADL-013 — PowerSubsystem is preferred over legacy Power on newer Redfish implementations
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** X14+/newer Redfish implementations increasingly expose authoritative PSU data in
|
||
`PowerSubsystem/PowerSupplies`, while legacy `/Power` may be incomplete or schema-shifted.
|
||
**Decision:** Prefer `Chassis/*/PowerSubsystem/PowerSupplies` as the primary PSU source and use
|
||
legacy `Chassis/*/Power` as fallback.
|
||
**Consequences:**
|
||
- Better compatibility with newer BMC firmware generations.
|
||
- Legacy systems remain supported without special-case collector selection.
|
||
- Snapshot priority seeds must include `PowerSubsystem` resources.
|
||
|
||
---
|
||
|
||
## ADL-014 — Threshold logic lives on the server; UI reflects status only
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** Duplicating threshold math in frontend and backend creates drift and inconsistent
|
||
highlighting (e.g. PSU mains voltage range checks).
|
||
**Decision:** Business threshold evaluation (e.g. PSU voltage nominal range) must be computed on
|
||
the server; frontend only renders status/flags returned by the API.
|
||
**Consequences:**
|
||
- Single source of truth for threshold policies.
|
||
- UI can evolve visually without re-implementing domain logic.
|
||
- API payloads may carry richer status semantics over time.
|
||
|
||
---
|
||
|
||
## ADL-015 — Supermicro crashdump archive parser removed from active registry
|
||
|
||
**Date:** 2026-03-01
|
||
**Context:** The Supermicro crashdump parser (`SMC Crash Dump Parser`) produced low-value
|
||
results for current workflows and was explicitly rejected as a supported archive path.
|
||
**Decision:** Remove `supermicro` vendor parser from active registration and project source.
|
||
Do not include it in `/api/parsers` output or parser documentation matrix.
|
||
**Consequences:**
|
||
- Supermicro crashdump archives (`CDump.txt` format) are no longer parsed by a dedicated vendor parser.
|
||
- Such archives fall back to other matching parsers (typically `generic`) unless a new replacement parser is added.
|
||
- Reintroduction requires a new parser package and an explicit registry import in `vendors/vendors.go`.
|
||
|
||
---
|
||
|
||
## ADL-016 — Device-bound firmware must not appear in hardware.firmware
|
||
|
||
**Date:** 2026-03-01
|
||
**Context:** Dell TSR `DCIM_SoftwareIdentity` lists firmware for every component (NICs,
|
||
PSUs, disks, backplanes) in addition to system-level firmware. Naively importing all entries
|
||
into `Hardware.Firmware` caused device firmware to appear twice in Reanimator: once in the
|
||
device's own record and again in the top-level firmware list.
|
||
**Decision:**
|
||
- `Hardware.Firmware` contains only system-level firmware (BIOS, BMC/iDRAC, CPLD,
|
||
Lifecycle Controller, storage controllers, BOSS).
|
||
- Device-bound entries (NIC, PSU, Disk, Backplane, GPU) must not be added to
|
||
`Hardware.Firmware`.
|
||
- Parsers must store the FQDD (or equivalent slot identifier) in `FirmwareInfo.Description`
|
||
so the Reanimator exporter can filter by FQDD prefix.
|
||
- The exporter's `isDeviceBoundFirmwareFQDD()` function performs this filter.
|
||
**Consequences:**
|
||
- Any new parser that ingests a per-device firmware inventory must follow the same rule.
|
||
- Device firmware is accessible only via the device's own record, not the firmware list.
|
||
|
||
---
|
||
|
||
## ADL-017 — Vendor-embedded MAC addresses must be stripped from model name fields
|
||
|
||
**Date:** 2026-03-01
|
||
**Context:** Dell TSR embeds MAC addresses directly in `ProductName` and `ElementName`
|
||
fields (e.g. `"NVIDIA ConnectX-6 Lx 2x 25G SFP28 OCP3.0 SFF - C4:70:BD:DB:56:08"`).
|
||
This caused model names to contain MAC addresses in NIC model, NIC firmware device name,
|
||
and potentially other fields.
|
||
**Decision:** Strip any ` - XX:XX:XX:XX:XX:XX` suffix from all model/name string fields
|
||
at parse time before storing in any model struct. Use the regex
|
||
`\s+-\s+([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}$`.
|
||
**Consequences:**
|
||
- Model names are clean and consistent across all devices.
|
||
- All parsers must apply this stripping to any field used as a device name or model.
|
||
- Confirmed affected fields in Dell: `DCIM_NICView.ProductName`, `DCIM_SoftwareIdentity.ElementName`.
|
||
|
||
---
|
||
|
||
## ADL-018 — NVMe bay probe must be restricted to storage-capable chassis types
|
||
|
||
**Date:** 2026-03-12
|
||
**Context:** `shouldAdaptiveNVMeProbe` was introduced in `2fa4a12` to recover NVMe drives on
|
||
Supermicro BMCs that expose empty `Drives` collections but serve disks at direct `Disk.Bay.N`
|
||
paths. The function returns `true` for any chassis with an empty `Members` array. On
|
||
Supermicro HGX systems (SYS-A21GE-NBRT and similar) ~35 sub-chassis (GPU, NVSwitch,
|
||
PCIeRetimer, ERoT, IRoT, BMC, FPGA) all carry `ChassisType=Module/Component/Zone` and
|
||
expose empty `/Drives` collections. Without filtering, each triggered 384 HTTP requests →
|
||
13 440 requests ≈ 22 minutes of pure I/O waste per collection.
|
||
**Decision:** Before probing `Disk.Bay.N` candidates for a chassis, check its `ChassisType`
|
||
via `chassisTypeCanHaveNVMe`. Skip if type is `Module`, `Component`, or `Zone`. Keep probing
|
||
for `Enclosure`, `RackMount`, and any unrecognised type (fail-safe).
|
||
**Consequences:**
|
||
- On HGX systems post-probe NVMe goes from ~22 min to effectively zero.
|
||
- NVMe backplane recovery (`Enclosure` type) is unaffected.
|
||
- Any new chassis type that hosts NVMe storage is covered by the default `true` path.
|
||
- `chassisTypeCanHaveNVMe` and the candidate-selection loop must have unit tests covering
|
||
both the excluded types and the storage-capable types (see `TestChassisTypeCanHaveNVMe`
|
||
and `TestNVMePostProbeSkipsNonStorageChassis`).
|
||
|
||
## ADL-019 — isDeviceBoundFirmwareName must cover vendor-specific naming patterns per vendor
|
||
|
||
**Date:** 2026-03-12
|
||
**Context:** `isDeviceBoundFirmwareName` was written to filter Dell-style device firmware names
|
||
(`"GPU SomeDevice"`, `"NIC OnboardLAN"`). When Supermicro Redfish FirmwareInventory was added
|
||
(`6c19a58`), no Supermicro-specific patterns were added. Supermicro names a NIC entry
|
||
`"NIC1 System Slot0 AOM-DP805-IO"` — a digit follows the type prefix directly, bypassing the
|
||
`"nic "` (space-terminated) check. 29 device-bound entries leaked into `hardware.firmware` on
|
||
SYS-A21GE-NBRT (HGX B200). Commit `9c5512d` attempted a fix by adding `_fw_gpu_` patterns,
|
||
but checked `DeviceName` which contains `"Software Inventory"` (from the Redfish `Name` field),
|
||
not the firmware inventory ID. The patterns were dead code from the moment they were committed.
|
||
**Decision:**
|
||
- `isDeviceBoundFirmwareName` must be extended for each new vendor whose FirmwareInventory
|
||
naming convention differs from the existing patterns.
|
||
- When adding HGX/Supermicro patterns, check that the pattern matches the field value that
|
||
`collectFirmwareInventory` actually stores — trace the data path from Redfish doc to
|
||
`FirmwareInfo.DeviceName` before writing the condition.
|
||
- `TestIsDeviceBoundFirmwareName` must contain at least one case per vendor format.
|
||
**Consequences:**
|
||
- New vendors with FirmwareInventory support require a test covering both device-bound names
|
||
(must return true) and system-level names (must return false) before the code ships.
|
||
- The dead `_fw_gpu_` / `_fw_nvswitch_` / `_inforom_gpu_` patterns were replaced with
|
||
correct prefix+digit checks (`"gpu" + digit`, `"nic" + digit`) and explicit string checks
|
||
(`"nvmecontroller"`, `"power supply"`, `"software inventory"`).
|
||
|
||
<!-- Add new decisions below this line using the format above -->
|