Files
logpile/bible-local/10-decisions.md
Mikhail Chusavitin 47bb0ee939 docs: document firmware filter regression pattern in bible (ADL-019)
Root cause analysis for device-bound firmware leaking into hardware.firmware
on Supermicro Redfish (SYS-A21GE-NBRT HGX B200):

- collectFirmwareInventory (6c19a58) had no coverage for Supermicro naming.
  isDeviceBoundFirmwareName checked "gpu " / "nic " (space-terminated) while
  Supermicro uses "GPU1 System Slot0" / "NIC1 System Slot0 ..." (digit suffix).

- 9c5512d added _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns to fix HGX,
  but checked DeviceName which contains "Software Inventory" (from Redfish Name),
  not the firmware Id. Dead code from day one.

09-testing.md: add firmware filter worked example and rule #4 — verify the
filter checks the field that the collector actually populates.

10-decisions.md: ADL-019 — isDeviceBoundFirmwareName must be extended per
vendor with a test case per vendor format before shipping.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 14:03:47 +03:00

15 KiB
Raw Blame History

10 — Architectural Decision Log (ADL)

Rule: Every significant architectural decision must be recorded here before or alongside the code change. This applies to humans and AI assistants alike.

Format: date · title · context · decision · consequences


ADL-001 — In-memory only state (no database)

Date: project start Context: LOGPile is designed as a standalone diagnostic tool, not a persistent service. Decision: All parsed/collected data lives in Server.result (in-memory). No database, no files written. Consequences:

  • Data is lost on process restart — intentional.
  • Simple deployment: single binary, no setup required.
  • JSON export is the persistence mechanism for users who want to save results.

ADL-002 — Vendor parser auto-registration via init()

Date: project start Context: Need an extensible parser registry without a central factory function. Decision: Each vendor parser registers itself in its package's init() function. vendors/vendors.go holds blank imports to trigger registration. Consequences:

  • Adding a new parser requires only: implement interface + add one blank import.
  • No central list to maintain (other than the import file).
  • go test ./... will include new parsers automatically.

ADL-003 — Highest-confidence parser wins

Date: project start Context: Multiple parsers may partially match an archive (e.g. generic + specific vendor). Decision: Run all parsers' Detect(), select the one returning the highest score (0100). Consequences:

  • Generic fallback (score 15) only activates when no vendor parser scores higher.
  • Parsers must be conservative with high scores (70+) to avoid false positives.

ADL-004 — Canonical hardware.devices as single source of truth

Date: v1.5.0 Context: UI tabs and Reanimator exporter were reading from different sub-fields of AnalysisResult, causing potential drift. Decision: Introduce hardware.devices as the canonical inventory repository. All UI tabs and all exporters must read exclusively from this repository. Consequences:

  • Any UI vs Reanimator discrepancy is classified as a bug, not a "known difference".
  • Deduplication logic runs once in the repository builder (serial → bdf → distinct).
  • New hardware attributes must be added to canonical schema first, then mapped to consumers.

ADL-005 — No hardcoded PCI model strings; use pci.ids

Date: v1.5.0 Context: NVIDIA and other vendors release new GPU models frequently; hardcoded maps required code changes for each new model ID. Decision: Use the pciutils/pciids database (git submodule, embedded at build time). PCI vendor/device ID → human-readable model name via lookup. Consequences:

  • New GPU models can be supported by updating pci.ids without code changes.
  • make build auto-syncs pci.ids from submodule before compilation.
  • External override via LOGPILE_PCI_IDS_PATH env var.

ADL-006 — Reanimator export uses canonical hardware.devices (not raw sub-fields)

Date: v1.5.0 Context: Early Reanimator exporter read from Hardware.GPUs, Hardware.NICs, etc. directly, diverging from UI data. Decision: Reanimator exporter must use hardware.devices — the same source as the UI. Exporter groups/filters canonical records by section; does not rebuild from sub-fields. Consequences:

  • Guarantees UI and export consistency.
  • Exporter code is simpler — mainly a filter+map, not a data reconstruction.

ADL-007 — Documentation language is English

Date: 2026-02-20 Context: Codebase documentation was mixed Russian/English, reducing clarity for international contributors and AI assistants. Decision: All maintained project documentation (docs/bible/, README.md, CLAUDE.md, and new technical docs) must be written in English. Consequences:

  • Bible is authoritative in English.
  • AI assistants get consistent, unambiguous context.

ADL-008 — Bible is the single source of truth for architecture docs

Date: 2026-02-23 Context: Architecture information was duplicated across README.md, CLAUDE.md, and the Bible, creating drift risk and stale guidance for humans and AI agents. Decision: Keep architecture and technical design documentation only in docs/bible/. Top-level README.md and CLAUDE.md must remain minimal pointers/instructions. Consequences:

  • Reduces documentation drift and duplicate updates.
  • AI assistants are directed to one authoritative source before making changes.
  • Documentation updates that affect architecture must include Bible changes (and ADL entries when significant).

ADL-009 — Redfish analysis is performed from raw snapshot replay (unified tunnel)

Date: 2026-02-24 Context: Live Redfish collection and raw export re-analysis used different parsing paths, which caused drift and made bug fixes difficult to validate consistently. Decision: Redfish live collection must produce a raw_payloads.redfish_tree snapshot first, then run the same replay analyzer used for imported raw exports. Consequences:

  • Same redfish_tree input produces the same parsed result in live and offline modes.
  • Debugging parser issues can be done against exported raw bundles without live BMC access.
  • Snapshot completeness becomes critical; collector seeds/limits are part of analyzer correctness.

ADL-010 — Raw export is a self-contained re-analysis package (not a final result dump)

Date: 2026-02-24 Context: Exporting only normalized AnalysisResult loses raw source fidelity and prevents future parser improvements from being applied to already collected data. Decision: Export Raw Data produces a self-contained raw package (JSON or ZIP bundle) that the application can reopen and re-analyze. Parsed data in the package is optional and not the source of truth on import. Consequences:

  • Re-opening an export always re-runs analysis from raw source (redfish_tree or uploaded file bytes).
  • Raw bundles include collection context and diagnostics for debugging (collect.log, parser_fields.json).
  • Endpoint compatibility is preserved (/api/export/json) while actual payload format may be a bundle.

ADL-011 — Redfish snapshot crawler is bounded, prioritized, and failure-tolerant

Date: 2026-02-24 Context: Full Redfish trees on modern GPU systems are large, noisy, and contain many vendor-specific or non-fetchable links. Unbounded crawling and naive queue design caused hangs and incomplete snapshots. Decision: Use a bounded snapshot crawler with:

  • explicit document cap (LOGPILE_REDFISH_SNAPSHOT_MAX_DOCS)
  • priority seed paths (PCIe/Fabrics/Firmware/Storage/PowerSubsystem/ThermalSubsystem)
  • normalized @odata.id paths (strip #fragment)
  • noisy expected error filtering (404/405/410/501 hidden from UI)
  • queue capacity sized to crawl cap to avoid producer/consumer deadlock Consequences:
  • Snapshot collection remains stable on large BMC trees.
  • Most high-value inventory paths are reached before the cap.
  • UI progress remains useful while debug logs retain low-level fetch failures.

ADL-012 — Vendor-specific storage inventory probing is allowed as fallback

Date: 2026-02-24 Context: Some Supermicro BMCs expose empty standard Storage/.../Drives collections while real disk inventory exists under vendor-specific Disk.Bay endpoints and enclosure links. Decision: When standard drive collections are empty, collector/replay may probe vendor-style .../Drives/Disk.Bay.* endpoints and follow Storage.Links.Enclosures[*] to recover physical drives. Consequences:

  • Higher storage inventory coverage on Supermicro HBA/HA-RAID/MRVL/NVMe backplane implementations.
  • Replay must mirror the same probing behavior to preserve deterministic results.
  • Probing remains bounded (finite candidate set) to avoid runaway requests.

ADL-013 — PowerSubsystem is preferred over legacy Power on newer Redfish implementations

Date: 2026-02-24 Context: X14+/newer Redfish implementations increasingly expose authoritative PSU data in PowerSubsystem/PowerSupplies, while legacy /Power may be incomplete or schema-shifted. Decision: Prefer Chassis/*/PowerSubsystem/PowerSupplies as the primary PSU source and use legacy Chassis/*/Power as fallback. Consequences:

  • Better compatibility with newer BMC firmware generations.
  • Legacy systems remain supported without special-case collector selection.
  • Snapshot priority seeds must include PowerSubsystem resources.

ADL-014 — Threshold logic lives on the server; UI reflects status only

Date: 2026-02-24 Context: Duplicating threshold math in frontend and backend creates drift and inconsistent highlighting (e.g. PSU mains voltage range checks). Decision: Business threshold evaluation (e.g. PSU voltage nominal range) must be computed on the server; frontend only renders status/flags returned by the API. Consequences:

  • Single source of truth for threshold policies.
  • UI can evolve visually without re-implementing domain logic.
  • API payloads may carry richer status semantics over time.

ADL-015 — Supermicro crashdump archive parser removed from active registry

Date: 2026-03-01 Context: The Supermicro crashdump parser (SMC Crash Dump Parser) produced low-value results for current workflows and was explicitly rejected as a supported archive path. Decision: Remove supermicro vendor parser from active registration and project source. Do not include it in /api/parsers output or parser documentation matrix. Consequences:

  • Supermicro crashdump archives (CDump.txt format) are no longer parsed by a dedicated vendor parser.
  • Such archives fall back to other matching parsers (typically generic) unless a new replacement parser is added.
  • Reintroduction requires a new parser package and an explicit registry import in vendors/vendors.go.

ADL-016 — Device-bound firmware must not appear in hardware.firmware

Date: 2026-03-01 Context: Dell TSR DCIM_SoftwareIdentity lists firmware for every component (NICs, PSUs, disks, backplanes) in addition to system-level firmware. Naively importing all entries into Hardware.Firmware caused device firmware to appear twice in Reanimator: once in the device's own record and again in the top-level firmware list. Decision:

  • Hardware.Firmware contains only system-level firmware (BIOS, BMC/iDRAC, CPLD, Lifecycle Controller, storage controllers, BOSS).
  • Device-bound entries (NIC, PSU, Disk, Backplane, GPU) must not be added to Hardware.Firmware.
  • Parsers must store the FQDD (or equivalent slot identifier) in FirmwareInfo.Description so the Reanimator exporter can filter by FQDD prefix.
  • The exporter's isDeviceBoundFirmwareFQDD() function performs this filter. Consequences:
  • Any new parser that ingests a per-device firmware inventory must follow the same rule.
  • Device firmware is accessible only via the device's own record, not the firmware list.

ADL-017 — Vendor-embedded MAC addresses must be stripped from model name fields

Date: 2026-03-01 Context: Dell TSR embeds MAC addresses directly in ProductName and ElementName fields (e.g. "NVIDIA ConnectX-6 Lx 2x 25G SFP28 OCP3.0 SFF - C4:70:BD:DB:56:08"). This caused model names to contain MAC addresses in NIC model, NIC firmware device name, and potentially other fields. Decision: Strip any - XX:XX:XX:XX:XX:XX suffix from all model/name string fields at parse time before storing in any model struct. Use the regex \s+-\s+([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}$. Consequences:

  • Model names are clean and consistent across all devices.
  • All parsers must apply this stripping to any field used as a device name or model.
  • Confirmed affected fields in Dell: DCIM_NICView.ProductName, DCIM_SoftwareIdentity.ElementName.

ADL-018 — NVMe bay probe must be restricted to storage-capable chassis types

Date: 2026-03-12 Context: shouldAdaptiveNVMeProbe was introduced in 2fa4a12 to recover NVMe drives on Supermicro BMCs that expose empty Drives collections but serve disks at direct Disk.Bay.N paths. The function returns true for any chassis with an empty Members array. On Supermicro HGX systems (SYS-A21GE-NBRT and similar) ~35 sub-chassis (GPU, NVSwitch, PCIeRetimer, ERoT, IRoT, BMC, FPGA) all carry ChassisType=Module/Component/Zone and expose empty /Drives collections. Without filtering, each triggered 384 HTTP requests → 13 440 requests ≈ 22 minutes of pure I/O waste per collection. Decision: Before probing Disk.Bay.N candidates for a chassis, check its ChassisType via chassisTypeCanHaveNVMe. Skip if type is Module, Component, or Zone. Keep probing for Enclosure, RackMount, and any unrecognised type (fail-safe). Consequences:

  • On HGX systems post-probe NVMe goes from ~22 min to effectively zero.
  • NVMe backplane recovery (Enclosure type) is unaffected.
  • Any new chassis type that hosts NVMe storage is covered by the default true path.
  • chassisTypeCanHaveNVMe and the candidate-selection loop must have unit tests covering both the excluded types and the storage-capable types (see TestChassisTypeCanHaveNVMe and TestNVMePostProbeSkipsNonStorageChassis).

ADL-019 — isDeviceBoundFirmwareName must cover vendor-specific naming patterns per vendor

Date: 2026-03-12 Context: isDeviceBoundFirmwareName was written to filter Dell-style device firmware names ("GPU SomeDevice", "NIC OnboardLAN"). When Supermicro Redfish FirmwareInventory was added (6c19a58), no Supermicro-specific patterns were added. Supermicro names a NIC entry "NIC1 System Slot0 AOM-DP805-IO" — a digit follows the type prefix directly, bypassing the "nic " (space-terminated) check. 29 device-bound entries leaked into hardware.firmware on SYS-A21GE-NBRT (HGX B200). Commit 9c5512d attempted a fix by adding _fw_gpu_ patterns, but checked DeviceName which contains "Software Inventory" (from the Redfish Name field), not the firmware inventory ID. The patterns were dead code from the moment they were committed. Decision:

  • isDeviceBoundFirmwareName must be extended for each new vendor whose FirmwareInventory naming convention differs from the existing patterns.
  • When adding HGX/Supermicro patterns, check that the pattern matches the field value that collectFirmwareInventory actually stores — trace the data path from Redfish doc to FirmwareInfo.DeviceName before writing the condition.
  • TestIsDeviceBoundFirmwareName must contain at least one case per vendor format. Consequences:
  • New vendors with FirmwareInventory support require a test covering both device-bound names (must return true) and system-level names (must return false) before the code ships.
  • The dead _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns were replaced with correct prefix+digit checks ("gpu" + digit, "nic" + digit) and explicit string checks ("nvmecontroller", "power supply", "software inventory").