Files
logpile/docs/bible/10-decisions.md
2026-02-25 12:17:17 +03:00

9.1 KiB
Raw Blame History

10 — Architectural Decision Log (ADL)

Rule: Every significant architectural decision must be recorded here before or alongside the code change. This applies to humans and AI assistants alike.

Format: date · title · context · decision · consequences


ADL-001 — In-memory only state (no database)

Date: project start Context: LOGPile is designed as a standalone diagnostic tool, not a persistent service. Decision: All parsed/collected data lives in Server.result (in-memory). No database, no files written. Consequences:

  • Data is lost on process restart — intentional.
  • Simple deployment: single binary, no setup required.
  • JSON export is the persistence mechanism for users who want to save results.

ADL-002 — Vendor parser auto-registration via init()

Date: project start Context: Need an extensible parser registry without a central factory function. Decision: Each vendor parser registers itself in its package's init() function. vendors/vendors.go holds blank imports to trigger registration. Consequences:

  • Adding a new parser requires only: implement interface + add one blank import.
  • No central list to maintain (other than the import file).
  • go test ./... will include new parsers automatically.

ADL-003 — Highest-confidence parser wins

Date: project start Context: Multiple parsers may partially match an archive (e.g. generic + specific vendor). Decision: Run all parsers' Detect(), select the one returning the highest score (0100). Consequences:

  • Generic fallback (score 15) only activates when no vendor parser scores higher.
  • Parsers must be conservative with high scores (70+) to avoid false positives.

ADL-004 — Canonical hardware.devices as single source of truth

Date: v1.5.0 Context: UI tabs and Reanimator exporter were reading from different sub-fields of AnalysisResult, causing potential drift. Decision: Introduce hardware.devices as the canonical inventory repository. All UI tabs and all exporters must read exclusively from this repository. Consequences:

  • Any UI vs Reanimator discrepancy is classified as a bug, not a "known difference".
  • Deduplication logic runs once in the repository builder (serial → bdf → distinct).
  • New hardware attributes must be added to canonical schema first, then mapped to consumers.

ADL-005 — No hardcoded PCI model strings; use pci.ids

Date: v1.5.0 Context: NVIDIA and other vendors release new GPU models frequently; hardcoded maps required code changes for each new model ID. Decision: Use the pciutils/pciids database (git submodule, embedded at build time). PCI vendor/device ID → human-readable model name via lookup. Consequences:

  • New GPU models can be supported by updating pci.ids without code changes.
  • make build auto-syncs pci.ids from submodule before compilation.
  • External override via LOGPILE_PCI_IDS_PATH env var.

ADL-006 — Reanimator export uses canonical hardware.devices (not raw sub-fields)

Date: v1.5.0 Context: Early Reanimator exporter read from Hardware.GPUs, Hardware.NICs, etc. directly, diverging from UI data. Decision: Reanimator exporter must use hardware.devices — the same source as the UI. Exporter groups/filters canonical records by section; does not rebuild from sub-fields. Consequences:

  • Guarantees UI and export consistency.
  • Exporter code is simpler — mainly a filter+map, not a data reconstruction.

ADL-007 — Documentation language is English

Date: 2026-02-20 Context: Codebase documentation was mixed Russian/English, reducing clarity for international contributors and AI assistants. Decision: All maintained project documentation (docs/bible/, README.md, CLAUDE.md, and new technical docs) must be written in English. Consequences:

  • Bible is authoritative in English.
  • AI assistants get consistent, unambiguous context.

ADL-008 — Bible is the single source of truth for architecture docs

Date: 2026-02-23 Context: Architecture information was duplicated across README.md, CLAUDE.md, and the Bible, creating drift risk and stale guidance for humans and AI agents. Decision: Keep architecture and technical design documentation only in docs/bible/. Top-level README.md and CLAUDE.md must remain minimal pointers/instructions. Consequences:

  • Reduces documentation drift and duplicate updates.
  • AI assistants are directed to one authoritative source before making changes.
  • Documentation updates that affect architecture must include Bible changes (and ADL entries when significant).

ADL-009 — Redfish analysis is performed from raw snapshot replay (unified tunnel)

Date: 2026-02-24 Context: Live Redfish collection and raw export re-analysis used different parsing paths, which caused drift and made bug fixes difficult to validate consistently. Decision: Redfish live collection must produce a raw_payloads.redfish_tree snapshot first, then run the same replay analyzer used for imported raw exports. Consequences:

  • Same redfish_tree input produces the same parsed result in live and offline modes.
  • Debugging parser issues can be done against exported raw bundles without live BMC access.
  • Snapshot completeness becomes critical; collector seeds/limits are part of analyzer correctness.

ADL-010 — Raw export is a self-contained re-analysis package (not a final result dump)

Date: 2026-02-24 Context: Exporting only normalized AnalysisResult loses raw source fidelity and prevents future parser improvements from being applied to already collected data. Decision: Export Raw Data produces a self-contained raw package (JSON or ZIP bundle) that the application can reopen and re-analyze. Parsed data in the package is optional and not the source of truth on import. Consequences:

  • Re-opening an export always re-runs analysis from raw source (redfish_tree or uploaded file bytes).
  • Raw bundles include collection context and diagnostics for debugging (collect.log, parser_fields.json).
  • Endpoint compatibility is preserved (/api/export/json) while actual payload format may be a bundle.

ADL-011 — Redfish snapshot crawler is bounded, prioritized, and failure-tolerant

Date: 2026-02-24 Context: Full Redfish trees on modern GPU systems are large, noisy, and contain many vendor-specific or non-fetchable links. Unbounded crawling and naive queue design caused hangs and incomplete snapshots. Decision: Use a bounded snapshot crawler with:

  • explicit document cap (LOGPILE_REDFISH_SNAPSHOT_MAX_DOCS)
  • priority seed paths (PCIe/Fabrics/Firmware/Storage/PowerSubsystem/ThermalSubsystem)
  • normalized @odata.id paths (strip #fragment)
  • noisy expected error filtering (404/405/410/501 hidden from UI)
  • queue capacity sized to crawl cap to avoid producer/consumer deadlock Consequences:
  • Snapshot collection remains stable on large BMC trees.
  • Most high-value inventory paths are reached before the cap.
  • UI progress remains useful while debug logs retain low-level fetch failures.

ADL-012 — Vendor-specific storage inventory probing is allowed as fallback

Date: 2026-02-24 Context: Some Supermicro BMCs expose empty standard Storage/.../Drives collections while real disk inventory exists under vendor-specific Disk.Bay endpoints and enclosure links. Decision: When standard drive collections are empty, collector/replay may probe vendor-style .../Drives/Disk.Bay.* endpoints and follow Storage.Links.Enclosures[*] to recover physical drives. Consequences:

  • Higher storage inventory coverage on Supermicro HBA/HA-RAID/MRVL/NVMe backplane implementations.
  • Replay must mirror the same probing behavior to preserve deterministic results.
  • Probing remains bounded (finite candidate set) to avoid runaway requests.

ADL-013 — PowerSubsystem is preferred over legacy Power on newer Redfish implementations

Date: 2026-02-24 Context: X14+/newer Redfish implementations increasingly expose authoritative PSU data in PowerSubsystem/PowerSupplies, while legacy /Power may be incomplete or schema-shifted. Decision: Prefer Chassis/*/PowerSubsystem/PowerSupplies as the primary PSU source and use legacy Chassis/*/Power as fallback. Consequences:

  • Better compatibility with newer BMC firmware generations.
  • Legacy systems remain supported without special-case collector selection.
  • Snapshot priority seeds must include PowerSubsystem resources.

ADL-014 — Threshold logic lives on the server; UI reflects status only

Date: 2026-02-24 Context: Duplicating threshold math in frontend and backend creates drift and inconsistent highlighting (e.g. PSU mains voltage range checks). Decision: Business threshold evaluation (e.g. PSU voltage nominal range) must be computed on the server; frontend only renders status/flags returned by the API. Consequences:

  • Single source of truth for threshold policies.
  • UI can evolve visually without re-implementing domain logic.
  • API payloads may carry richer status semantics over time.