1123 lines
59 KiB
Markdown
1123 lines
59 KiB
Markdown
# 10 — Architectural Decision Log (ADL)
|
||
|
||
> **Rule:** Every significant architectural decision **must be recorded here** before or alongside
|
||
> the code change. This applies to humans and AI assistants alike.
|
||
>
|
||
> Format: date · title · context · decision · consequences
|
||
|
||
---
|
||
|
||
## ADL-001 — In-memory only state (no database)
|
||
|
||
**Date:** project start
|
||
**Context:** LOGPile is designed as a standalone diagnostic tool, not a persistent service.
|
||
**Decision:** All parsed/collected data lives in `Server.result` (in-memory). No database, no files written.
|
||
**Consequences:**
|
||
- Data is lost on process restart — intentional.
|
||
- Simple deployment: single binary, no setup required.
|
||
- JSON export is the persistence mechanism for users who want to save results.
|
||
|
||
---
|
||
|
||
## ADL-002 — Vendor parser auto-registration via init()
|
||
|
||
**Date:** project start
|
||
**Context:** Need an extensible parser registry without a central factory function.
|
||
**Decision:** Each vendor parser registers itself in its package's `init()` function.
|
||
`vendors/vendors.go` holds blank imports to trigger registration.
|
||
**Consequences:**
|
||
- Adding a new parser requires only: implement interface + add one blank import.
|
||
- No central list to maintain (other than the import file).
|
||
- `go test ./...` will include new parsers automatically.
|
||
|
||
---
|
||
|
||
## ADL-003 — Highest-confidence parser wins
|
||
|
||
**Date:** project start
|
||
**Context:** Multiple parsers may partially match an archive (e.g. generic + specific vendor).
|
||
**Decision:** Run all parsers' `Detect()`, select the one returning the highest score (0–100).
|
||
**Consequences:**
|
||
- Generic fallback (score 15) only activates when no vendor parser scores higher.
|
||
- Parsers must be conservative with high scores (70+) to avoid false positives.
|
||
|
||
---
|
||
|
||
## ADL-004 — Canonical hardware.devices as single source of truth
|
||
|
||
**Date:** v1.5.0
|
||
**Context:** UI tabs and Reanimator exporter were reading from different sub-fields of
|
||
`AnalysisResult`, causing potential drift.
|
||
**Decision:** Introduce `hardware.devices` as the canonical inventory repository.
|
||
All UI tabs and all exporters must read exclusively from this repository.
|
||
**Consequences:**
|
||
- Any UI vs Reanimator discrepancy is classified as a bug, not a "known difference".
|
||
- Deduplication logic runs once in the repository builder (serial → bdf → distinct).
|
||
- New hardware attributes must be added to canonical schema first, then mapped to consumers.
|
||
|
||
---
|
||
|
||
## ADL-005 — No hardcoded PCI model strings; use pci.ids
|
||
|
||
**Date:** v1.5.0
|
||
**Context:** NVIDIA and other vendors release new GPU models frequently; hardcoded maps
|
||
required code changes for each new model ID.
|
||
**Decision:** Use the `pciutils/pciids` database (git submodule, embedded at build time).
|
||
PCI vendor/device ID → human-readable model name via lookup.
|
||
**Consequences:**
|
||
- New GPU models can be supported by updating `pci.ids` without code changes.
|
||
- `make build` auto-syncs `pci.ids` from submodule before compilation.
|
||
- External override via `LOGPILE_PCI_IDS_PATH` env var.
|
||
|
||
---
|
||
|
||
## ADL-006 — Reanimator export uses canonical hardware.devices (not raw sub-fields)
|
||
|
||
**Date:** v1.5.0
|
||
**Context:** Early Reanimator exporter read from `Hardware.GPUs`, `Hardware.NICs`, etc.
|
||
directly, diverging from UI data.
|
||
**Decision:** Reanimator exporter must use `hardware.devices` — the same source as the UI.
|
||
Exporter groups/filters canonical records by section; does not rebuild from sub-fields.
|
||
**Consequences:**
|
||
- Guarantees UI and export consistency.
|
||
- Exporter code is simpler — mainly a filter+map, not a data reconstruction.
|
||
|
||
---
|
||
|
||
## ADL-007 — Documentation language is English
|
||
|
||
**Date:** 2026-02-20
|
||
**Context:** Codebase documentation was mixed Russian/English, reducing clarity for
|
||
international contributors and AI assistants.
|
||
**Decision:** All maintained project documentation (`docs/bible/`, `README.md`,
|
||
`CLAUDE.md`, and new technical docs) must be written in English.
|
||
**Consequences:**
|
||
- Bible is authoritative in English.
|
||
- AI assistants get consistent, unambiguous context.
|
||
|
||
---
|
||
|
||
## ADL-008 — Bible is the single source of truth for architecture docs
|
||
|
||
**Date:** 2026-02-23
|
||
**Context:** Architecture information was duplicated across `README.md`, `CLAUDE.md`,
|
||
and the Bible, creating drift risk and stale guidance for humans and AI agents.
|
||
**Decision:** Keep architecture and technical design documentation only in `docs/bible/`.
|
||
Top-level `README.md` and `CLAUDE.md` must remain minimal pointers/instructions.
|
||
**Consequences:**
|
||
- Reduces documentation drift and duplicate updates.
|
||
- AI assistants are directed to one authoritative source before making changes.
|
||
- Documentation updates that affect architecture must include Bible changes (and ADL entries when significant).
|
||
|
||
---
|
||
|
||
## ADL-009 — Redfish analysis is performed from raw snapshot replay (unified tunnel)
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** Live Redfish collection and raw export re-analysis used different parsing paths,
|
||
which caused drift and made bug fixes difficult to validate consistently.
|
||
**Decision:** Redfish live collection must produce a `raw_payloads.redfish_tree` snapshot first,
|
||
then run the same replay analyzer used for imported raw exports.
|
||
**Consequences:**
|
||
- Same `redfish_tree` input produces the same parsed result in live and offline modes.
|
||
- Debugging parser issues can be done against exported raw bundles without live BMC access.
|
||
- Snapshot completeness becomes critical; collector seeds/limits are part of analyzer correctness.
|
||
|
||
---
|
||
|
||
## ADL-010 — Raw export is a self-contained re-analysis package (not a final result dump)
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** Exporting only normalized `AnalysisResult` loses raw source fidelity and prevents
|
||
future parser improvements from being applied to already collected data.
|
||
**Decision:** `Export Raw Data` produces a self-contained raw package (JSON or ZIP bundle)
|
||
that the application can reopen and re-analyze. Parsed data in the package is optional and not
|
||
the source of truth on import.
|
||
**Consequences:**
|
||
- Re-opening an export always re-runs analysis from raw source (`redfish_tree` or uploaded file bytes).
|
||
- Raw bundles include collection context and diagnostics for debugging (`collect.log`, `parser_fields.json`).
|
||
- Endpoint compatibility is preserved (`/api/export/json`) while actual payload format may be a bundle.
|
||
|
||
---
|
||
|
||
## ADL-011 — Redfish snapshot crawler is bounded, prioritized, and failure-tolerant
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** Full Redfish trees on modern GPU systems are large, noisy, and contain many
|
||
vendor-specific or non-fetchable links. Unbounded crawling and naive queue design caused hangs
|
||
and incomplete snapshots.
|
||
**Decision:** Use a bounded snapshot crawler with:
|
||
- explicit document cap (`LOGPILE_REDFISH_SNAPSHOT_MAX_DOCS`)
|
||
- priority seed paths (PCIe/Fabrics/Firmware/Storage/PowerSubsystem/ThermalSubsystem)
|
||
- normalized `@odata.id` paths (strip `#fragment`)
|
||
- noisy expected error filtering (404/405/410/501 hidden from UI)
|
||
- queue capacity sized to crawl cap to avoid producer/consumer deadlock
|
||
**Consequences:**
|
||
- Snapshot collection remains stable on large BMC trees.
|
||
- Most high-value inventory paths are reached before the cap.
|
||
- UI progress remains useful while debug logs retain low-level fetch failures.
|
||
|
||
---
|
||
|
||
## ADL-012 — Vendor-specific storage inventory probing is allowed as fallback
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** Some Supermicro BMCs expose empty standard `Storage/.../Drives` collections while
|
||
real disk inventory exists under vendor-specific `Disk.Bay` endpoints and enclosure links.
|
||
**Decision:** When standard drive collections are empty, collector/replay may probe vendor-style
|
||
`.../Drives/Disk.Bay.*` endpoints and follow `Storage.Links.Enclosures[*]` to recover physical drives.
|
||
**Consequences:**
|
||
- Higher storage inventory coverage on Supermicro HBA/HA-RAID/MRVL/NVMe backplane implementations.
|
||
- Replay must mirror the same probing behavior to preserve deterministic results.
|
||
- Probing remains bounded (finite candidate set) to avoid runaway requests.
|
||
|
||
---
|
||
|
||
## ADL-013 — PowerSubsystem is preferred over legacy Power on newer Redfish implementations
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** X14+/newer Redfish implementations increasingly expose authoritative PSU data in
|
||
`PowerSubsystem/PowerSupplies`, while legacy `/Power` may be incomplete or schema-shifted.
|
||
**Decision:** Prefer `Chassis/*/PowerSubsystem/PowerSupplies` as the primary PSU source and use
|
||
legacy `Chassis/*/Power` as fallback.
|
||
**Consequences:**
|
||
- Better compatibility with newer BMC firmware generations.
|
||
- Legacy systems remain supported without special-case collector selection.
|
||
- Snapshot priority seeds must include `PowerSubsystem` resources.
|
||
|
||
---
|
||
|
||
## ADL-014 — Threshold logic lives on the server; UI reflects status only
|
||
|
||
**Date:** 2026-02-24
|
||
**Context:** Duplicating threshold math in frontend and backend creates drift and inconsistent
|
||
highlighting (e.g. PSU mains voltage range checks).
|
||
**Decision:** Business threshold evaluation (e.g. PSU voltage nominal range) must be computed on
|
||
the server; frontend only renders status/flags returned by the API.
|
||
**Consequences:**
|
||
- Single source of truth for threshold policies.
|
||
- UI can evolve visually without re-implementing domain logic.
|
||
- API payloads may carry richer status semantics over time.
|
||
|
||
---
|
||
|
||
## ADL-015 — Supermicro crashdump archive parser removed from active registry
|
||
|
||
**Date:** 2026-03-01
|
||
**Context:** The Supermicro crashdump parser (`SMC Crash Dump Parser`) produced low-value
|
||
results for current workflows and was explicitly rejected as a supported archive path.
|
||
**Decision:** Remove `supermicro` vendor parser from active registration and project source.
|
||
Do not include it in `/api/parsers` output or parser documentation matrix.
|
||
**Consequences:**
|
||
- Supermicro crashdump archives (`CDump.txt` format) are no longer parsed by a dedicated vendor parser.
|
||
- Such archives fall back to other matching parsers (typically `generic`) unless a new replacement parser is added.
|
||
- Reintroduction requires a new parser package and an explicit registry import in `vendors/vendors.go`.
|
||
|
||
---
|
||
|
||
## ADL-016 — Device-bound firmware must not appear in hardware.firmware
|
||
|
||
**Date:** 2026-03-01
|
||
**Context:** Dell TSR `DCIM_SoftwareIdentity` lists firmware for every component (NICs,
|
||
PSUs, disks, backplanes) in addition to system-level firmware. Naively importing all entries
|
||
into `Hardware.Firmware` caused device firmware to appear twice in Reanimator: once in the
|
||
device's own record and again in the top-level firmware list.
|
||
**Decision:**
|
||
- `Hardware.Firmware` contains only system-level firmware (BIOS, BMC/iDRAC, CPLD,
|
||
Lifecycle Controller, storage controllers, BOSS).
|
||
- Device-bound entries (NIC, PSU, Disk, Backplane, GPU) must not be added to
|
||
`Hardware.Firmware`.
|
||
- Parsers must store the FQDD (or equivalent slot identifier) in `FirmwareInfo.Description`
|
||
so the Reanimator exporter can filter by FQDD prefix.
|
||
- The exporter's `isDeviceBoundFirmwareFQDD()` function performs this filter.
|
||
**Consequences:**
|
||
- Any new parser that ingests a per-device firmware inventory must follow the same rule.
|
||
- Device firmware is accessible only via the device's own record, not the firmware list.
|
||
|
||
---
|
||
|
||
## ADL-017 — Vendor-embedded MAC addresses must be stripped from model name fields
|
||
|
||
**Date:** 2026-03-01
|
||
**Context:** Dell TSR embeds MAC addresses directly in `ProductName` and `ElementName`
|
||
fields (e.g. `"NVIDIA ConnectX-6 Lx 2x 25G SFP28 OCP3.0 SFF - C4:70:BD:DB:56:08"`).
|
||
This caused model names to contain MAC addresses in NIC model, NIC firmware device name,
|
||
and potentially other fields.
|
||
**Decision:** Strip any ` - XX:XX:XX:XX:XX:XX` suffix from all model/name string fields
|
||
at parse time before storing in any model struct. Use the regex
|
||
`\s+-\s+([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}$`.
|
||
**Consequences:**
|
||
- Model names are clean and consistent across all devices.
|
||
- All parsers must apply this stripping to any field used as a device name or model.
|
||
- Confirmed affected fields in Dell: `DCIM_NICView.ProductName`, `DCIM_SoftwareIdentity.ElementName`.
|
||
|
||
---
|
||
|
||
## ADL-018 — NVMe bay probe must be restricted to storage-capable chassis types
|
||
|
||
**Date:** 2026-03-12
|
||
**Context:** `shouldAdaptiveNVMeProbe` was introduced in `2fa4a12` to recover NVMe drives on
|
||
Supermicro BMCs that expose empty `Drives` collections but serve disks at direct `Disk.Bay.N`
|
||
|
||
---
|
||
|
||
paths. The function returns `true` for any chassis with an empty `Members` array. On
|
||
Supermicro HGX systems (SYS-A21GE-NBRT and similar) ~35 sub-chassis (GPU, NVSwitch,
|
||
PCIeRetimer, ERoT, IRoT, BMC, FPGA) all carry `ChassisType=Module/Component/Zone` and
|
||
expose empty `/Drives` collections. Without filtering, each triggered 384 HTTP requests →
|
||
13 440 requests ≈ 22 minutes of pure I/O waste per collection.
|
||
**Decision:** Before probing `Disk.Bay.N` candidates for a chassis, check its `ChassisType`
|
||
via `chassisTypeCanHaveNVMe`. Skip if type is `Module`, `Component`, or `Zone`. Keep probing
|
||
for `Enclosure`, `RackMount`, and any unrecognised type (fail-safe).
|
||
**Consequences:**
|
||
- On HGX systems post-probe NVMe goes from ~22 min to effectively zero.
|
||
- NVMe backplane recovery (`Enclosure` type) is unaffected.
|
||
- Any new chassis type that hosts NVMe storage is covered by the default `true` path.
|
||
- `chassisTypeCanHaveNVMe` and the candidate-selection loop must have unit tests covering
|
||
both the excluded types and the storage-capable types (see `TestChassisTypeCanHaveNVMe`
|
||
and `TestNVMePostProbeSkipsNonStorageChassis`).
|
||
|
||
## ADL-019 — Redfish post-probe recovery is profile-owned acquisition policy
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:** Numeric collection post-probe and direct NVMe `Disk.Bay` recovery were still
|
||
controlled by collector-core heuristics, which kept platform-specific acquisition behavior in
|
||
`redfish.go` and made vendor/topology refactoring incomplete.
|
||
**Decision:** Move expensive Redfish post-probe enablement into profile-owned acquisition policy.
|
||
The collector core may execute bounded post-probe loops, but profiles must explicitly enable:
|
||
- numeric collection post-probe
|
||
- direct NVMe `Disk.Bay` recovery
|
||
- sensor collection post-probe
|
||
**Consequences:**
|
||
- Generic collector flow no longer implicitly turns on storage/NVMe recovery for every platform.
|
||
- Supermicro-specific direct NVMe recovery and generic numeric collection recovery are now
|
||
regression-tested through profile fixtures.
|
||
- Future platform storage/post-probe behavior must be added through profile tuning, not new
|
||
vendor-shaped `if` branches in collector core.
|
||
|
||
## ADL-020 — Redfish critical plan-B activation is profile-owned recovery policy
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:** `critical plan-B` and `profile plan-B` were still effectively always-on collector
|
||
behavior once paths were present, including critical collection member retry and slow numeric
|
||
child probing. That kept acquisition recovery semantics in `redfish.go` instead of the profile
|
||
layer.
|
||
**Decision:** Move plan-B activation into profile-owned recovery policy. Profiles must explicitly
|
||
enable:
|
||
- critical collection member retry
|
||
- slow numeric probing during critical plan-B
|
||
- profile-specific plan-B pass
|
||
**Consequences:**
|
||
- Recovery behavior is now observable in raw Redfish diagnostics alongside other tuning.
|
||
- Generic/fallback recovery remains available through profile policy instead of implicit collector
|
||
defaults.
|
||
- Future platform-specific plan-B behavior must be introduced through profile tuning and tests,
|
||
not through new unconditional collector branches.
|
||
|
||
## ADL-021 — Extra discovered-path storage seeds must be profile-scoped, not core-baseline
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:** The collector core baseline seed list still contained storage-specific discovered-path
|
||
suffixes such as `SimpleStorage` and `Storage/IntelVROC/*`. These are useful on some platforms,
|
||
but they are acquisition extensions layered on top of discovered `Systems/*` resources, not part
|
||
of the minimal vendor-neutral Redfish baseline.
|
||
**Decision:** Move such discovered-path expansions into profile-owned scoped path policy. The
|
||
collector core keeps the vendor-neutral baseline; profiles may add extra system/chassis/manager
|
||
suffixes that are expanded over discovered members during acquisition planning.
|
||
**Consequences:**
|
||
- Platform-shaped storage discovery no longer lives in `redfish.go` baseline seed construction.
|
||
- Extra discovered-path branches are visible in plan diagnostics and fixture regression tests.
|
||
- Future model/vendor storage path expansions must be added through scoped profile policy instead
|
||
of editing the shared baseline seed list.
|
||
|
||
## ADL-022 — Adaptive prefetch eligibility is profile-owned policy
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:** The adaptive prefetch executor was still driven by hardcoded include/exclude path
|
||
rules in `redfish.go`. That made GPU/storage/network prefetch shaping part of collector-core
|
||
knowledge rather than profile-owned acquisition policy.
|
||
**Decision:** Move prefetch eligibility rules into profile tuning. The collector core still runs
|
||
adaptive prefetch, but profiles provide:
|
||
- `IncludeSuffixes` for critical paths eligible for prefetch
|
||
- `ExcludeContains` for path shapes that must never be prefetched
|
||
**Consequences:**
|
||
- Prefetch behavior is now visible in raw Redfish diagnostics and test fixtures.
|
||
- Platform- or topology-specific prefetch shaping no longer requires editing collector-core
|
||
string lists.
|
||
- Future prefetch tuning must be introduced through profiles and regression tests.
|
||
|
||
## ADL-023 — Core critical baseline is roots-only; critical shaping is profile-owned
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:** `redfishCriticalEndpoints(...)` still encoded a broad set of system/chassis/manager
|
||
critical branches directly in collector core. This mixed minimal crawl invariants with profile-
|
||
specific acquisition shaping.
|
||
**Decision:** Reduce collector-core critical baseline to vendor-neutral roots only:
|
||
- `/redfish/v1`
|
||
- discovered `Systems/*`
|
||
- discovered `Chassis/*`
|
||
- discovered `Managers/*`
|
||
|
||
Profiles now own additional critical shaping through:
|
||
- scoped critical suffix policy for discovered resources
|
||
- explicit top-level `CriticalPaths`
|
||
**Consequences:**
|
||
- Critical inventory breadth is now explained by the acquisition plan, not hidden in collector
|
||
helper defaults.
|
||
- Generic profile still provides the previous broad critical coverage, so behavior stays stable.
|
||
- Future critical-path tuning must be implemented in profiles and regression-tested there.
|
||
|
||
## ADL-024 — Live Redfish execution plans are resolved inside redfishprofile
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:** Even after moving seeds, scoped paths, critical shaping, recovery, and prefetch
|
||
policy into profiles, `redfish.go` still manually merged discovered resources with those policy
|
||
fragments. That left acquisition-plan resolution logic in collector core.
|
||
**Decision:** Introduce `redfishprofile.ResolveAcquisitionPlan(...)` as the boundary between
|
||
profile planning and collector execution. `redfishprofile` now resolves:
|
||
- baseline seeds
|
||
- baseline critical roots
|
||
- scoped path expansions
|
||
- explicit profile seed/critical/plan-B paths
|
||
|
||
The collector core consumes the resolved plan and executes it.
|
||
**Consequences:**
|
||
- Acquisition planning logic is now testable in `redfishprofile` without going through the live
|
||
collector.
|
||
- `redfish.go` no longer owns path-resolution helpers for seeds/critical planning.
|
||
- This creates a clean next step toward true per-profile acquisition hooks beyond static policy
|
||
fragments.
|
||
|
||
## ADL-025 — Post-discovery acquisition refinement belongs to profile hooks
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:** Some acquisition behavior depends not only on vendor/model hints, but on what the
|
||
lightweight Redfish discovery actually returned. Static absolute path lists in profile plans are
|
||
too rigid for such cases and reintroduce guessed platform knowledge.
|
||
**Decision:** Add a post-discovery acquisition refinement hook to Redfish profiles. Profiles may
|
||
mutate the resolved execution plan after discovered `Systems/*`, `Chassis/*`, and `Managers/*`
|
||
are known.
|
||
|
||
First concrete use:
|
||
- MSI now derives GPU chassis seeds and `.../Sensors` critical/plan-B paths from discovered
|
||
`Chassis/GPU*` resources instead of hardcoded `GPU1..GPU4` absolute paths in the static plan.
|
||
Additional use:
|
||
- Supermicro now derives `UpdateService/Oem/Supermicro/FirmwareInventory` critical/plan-B paths
|
||
from resource hints instead of carrying that absolute path in the static plan.
|
||
Additional use:
|
||
- Dell now derives `Managers/iDRAC.Embedded.*` acquisition paths from discovered manager
|
||
resources instead of carrying `Managers/iDRAC.Embedded.1` as a static absolute path.
|
||
**Consequences:**
|
||
- Profile modules can react to actual discovery results without pushing conditional logic back
|
||
into `redfish.go`.
|
||
- Diagnostics still show the final refined plan because the collector stores the refined plan,
|
||
not only the pre-refinement template.
|
||
- Future vendor-specific discovery-dependent acquisition behavior should be implemented through
|
||
this hook rather than new collector-core branches.
|
||
|
||
## ADL-026 — Replay analysis uses a resolved profile plan, not ad-hoc directives only
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:** Replay still relied on a flat `AnalysisDirectives` struct assembled centrally,
|
||
while vendor-specific conditions often depended on the actual snapshot shape. That made analysis
|
||
behavior harder to explain and kept too much vendor logic in generic replay collectors.
|
||
**Decision:** Introduce `redfishprofile.ResolveAnalysisPlan(...)` for replay. The resolved
|
||
analysis plan contains:
|
||
- active match result
|
||
- resolved analysis directives
|
||
- analysis notes explaining snapshot-aware hook activation
|
||
|
||
Profiles may refine this plan using the snapshot and discovered resources before replay collectors
|
||
run.
|
||
|
||
First concrete uses:
|
||
- MSI enables processor-GPU fallback and MSI chassis lookup only when the snapshot actually
|
||
contains GPU processors and `Chassis/GPU*`
|
||
- HGX enables processor-GPU alias fallback from actual HGX/GPU_SXM topology signals in the snapshot
|
||
- Supermicro enables NVMe backplane and known-controller recovery from actual snapshot paths
|
||
**Consequences:**
|
||
- Replay behavior is now closer to the acquisition architecture: a resolved profile plan feeds the
|
||
executor.
|
||
- `redfish_analysis_plan` is stored in raw payload metadata for offline debugging.
|
||
- Future analysis-side vendor logic should move into profile refinement hooks instead of growing the
|
||
central directive builder.
|
||
|
||
## ADL-027 — Replay GPU/storage executors consume resolved analysis plans
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:** Even after introducing `ResolveAnalysisPlan(...)`, replay GPU/storage collectors still
|
||
accepted a raw `AnalysisDirectives` struct. That preserved an implicit shortcut from the old design
|
||
and weakened the plan/executor boundary.
|
||
**Decision:** Replay GPU/storage executors now accept `redfishprofile.ResolvedAnalysisPlan`
|
||
directly. The executor reads resolved directives from the plan instead of being passed a standalone
|
||
directive bundle.
|
||
**Consequences:**
|
||
- GPU and storage replay execution now follows the same architectural pattern as acquisition:
|
||
resolve plan first, execute second.
|
||
- Future profile-owned execution helpers can use plan notes or additional resolved fields without
|
||
changing the executor API again.
|
||
- Remaining replay areas should migrate the same way instead of continuing to accept raw directive
|
||
structs.
|
||
|
||
## ADL-019 — isDeviceBoundFirmwareName must cover vendor-specific naming patterns per vendor
|
||
|
||
**Date:** 2026-03-12
|
||
**Context:** `isDeviceBoundFirmwareName` was written to filter Dell-style device firmware names
|
||
(`"GPU SomeDevice"`, `"NIC OnboardLAN"`). When Supermicro Redfish FirmwareInventory was added
|
||
(`6c19a58`), no Supermicro-specific patterns were added. Supermicro names a NIC entry
|
||
`"NIC1 System Slot0 AOM-DP805-IO"` — a digit follows the type prefix directly, bypassing the
|
||
`"nic "` (space-terminated) check. 29 device-bound entries leaked into `hardware.firmware` on
|
||
SYS-A21GE-NBRT (HGX B200). Commit `9c5512d` attempted a fix by adding `_fw_gpu_` patterns,
|
||
but checked `DeviceName` which contains `"Software Inventory"` (from the Redfish `Name` field),
|
||
not the firmware inventory ID. The patterns were dead code from the moment they were committed.
|
||
**Decision:**
|
||
- `isDeviceBoundFirmwareName` must be extended for each new vendor whose FirmwareInventory
|
||
naming convention differs from the existing patterns.
|
||
- When adding HGX/Supermicro patterns, check that the pattern matches the field value that
|
||
`collectFirmwareInventory` actually stores — trace the data path from Redfish doc to
|
||
`FirmwareInfo.DeviceName` before writing the condition.
|
||
- `TestIsDeviceBoundFirmwareName` must contain at least one case per vendor format.
|
||
**Consequences:**
|
||
- New vendors with FirmwareInventory support require a test covering both device-bound names
|
||
(must return true) and system-level names (must return false) before the code ships.
|
||
- The dead `_fw_gpu_` / `_fw_nvswitch_` / `_inforom_gpu_` patterns were replaced with
|
||
correct prefix+digit checks (`"gpu" + digit`, `"nic" + digit`) and explicit string checks
|
||
(`"nvmecontroller"`, `"power supply"`, `"software inventory"`).
|
||
|
||
## ADL-020 — Dell TSR device-bound firmware filtered via FQDD; InfiniBand routed to NetworkAdapters
|
||
|
||
**Date:** 2026-03-15
|
||
**Context:** Dell TSR `sysinfo_DCIM_SoftwareIdentity.xml` lists firmware for every installed
|
||
component. `parseSoftwareIdentityXML` dumped all of these into `hardware.firmware` without
|
||
filtering, so device-bound entries such as `"Mellanox Network Adapter"` (FQDD `InfiniBand.Slot.1-1`)
|
||
and `"PERC H755 Front"` (FQDD `RAID.SL.3-1`) appeared in the reanimator export alongside system
|
||
firmware like BIOS and iDRAC. Confirmed on PowerEdge R6625 (8VS2LG4).
|
||
|
||
Additionally, `DCIM_InfiniBandView` was not handled in the parser switch, so Mellanox ConnectX-6
|
||
appeared only as a PCIe device with `model: "16x or x16"` (from `DataBusWidth` fallback).
|
||
`parseControllerView` called `addFirmware` with description `"storage controller"` instead of the
|
||
FQDD, so the FQDD-based filter in the exporter could not remove it.
|
||
|
||
**Decision:**
|
||
1. `isDeviceBoundFirmwareFQDD` extended with `"infiniband."` and `"fc."` prefixes; `"raid.backplane."`
|
||
broadened to `"raid."` to cover `RAID.SL.*`, `RAID.Integrated.*`, etc.
|
||
2. `DCIM_InfiniBandView` routed to `parseNICView` → device appears as `NetworkAdapter` with correct
|
||
firmware, MAC address, and VendorID/DeviceID.
|
||
3. `"InfiniBand."` added to `pcieFQDDNoisePrefix` to suppress the duplicate `DCIM_PCIDeviceView`
|
||
entry (DataBusWidth-only, no useful data).
|
||
4. `parseControllerView` now passes `fqdd` as the `addFirmware` description so the FQDD filter
|
||
removes the entry in the exporter.
|
||
5. `parsePCIeDeviceView` now prioritises `props["description"]` (chip model, e.g. `"MT28908 Family
|
||
[ConnectX-6]"`) over `props["devicedescription"]` (location string) for `pcie.Description`.
|
||
6. `convertPCIeDevices` model fallback order: `PartNumber → Description → DeviceClass`.
|
||
|
||
**Consequences:**
|
||
- `hardware.firmware` contains only system-level entries; NIC/RAID/storage-controller firmware
|
||
lives on the respective device record.
|
||
- `TestParseDellInfiniBandView` and `TestIsDeviceBoundFirmwareFQDD` guard the regression.
|
||
- Any future Dell TSR device class whose FQDD prefix is not yet in the prefix list may still leak;
|
||
extend `isDeviceBoundFirmwareFQDD` and add a test case when encountered.
|
||
|
||
---
|
||
|
||
## ADL-021 — pci.ids enrichment: chip model and vendor resolved from PCI IDs when source data is generic or missing
|
||
|
||
**Date:** 2026-03-15
|
||
**Context:**
|
||
Dell TSR `DCIM_InfiniBandView.ProductName` reports a generic marketing name ("Mellanox Network
|
||
Adapter") instead of the precise chip identifier ("MT28908 Family [ConnectX-6]"). The actual
|
||
chip model is available in `pci.ids` by VendorID:DeviceID (15B3:101B). Vendor name may also be
|
||
absent when no `VendorName` / `Manufacturer` property is present.
|
||
|
||
The general rule was established: *if model is not found in source data but PCI IDs are known,
|
||
resolve model from `pci.ids`*. This rule applies broadly across all export paths.
|
||
|
||
**Decision (two-layer enrichment):**
|
||
1. **Parser layer (Dell, `parseNICView`):** When `VendorID != 0 && DeviceID != 0`, prefer
|
||
`pciids.DeviceName(vendorID, deviceID)` over the product name from logs. This makes the chip
|
||
identifier the primary model for NIC/InfiniBand adapters (more specific than marketing name).
|
||
Fill `Vendor` from `pciids.VendorName(vendorID)` when the vendor field is otherwise empty.
|
||
Same fallback applied in `parsePCIeDeviceView` for empty `Description`.
|
||
2. **Exporter layer (`convertPCIeFromDevices`):** General rule — when `d.Model == ""` after all
|
||
legacy fallbacks and `VendorID != 0 && DeviceID != 0`, set `model = pciids.DeviceName(...)`.
|
||
Also fill empty `manufacturer` from `pciids.VendorName(...)`. This covers all parsers/sources.
|
||
|
||
**Consequences:**
|
||
- Mellanox InfiniBand slot now reports `model: "MT28908 Family [ConnectX-6]"` and
|
||
`manufacturer: "Mellanox Technologies"` in the reanimator export.
|
||
- For NICs where pci.ids has no entry, the original product name is kept (pci.ids returns "").
|
||
- `TestParseDellInfiniBandView` asserts the model and vendor from pci.ids.
|
||
|
||
---
|
||
|
||
## ADL-022 — CPUAffinity parsed into NUMANode for PCIe, NIC, and controller devices
|
||
|
||
**Date:** 2026-03-15
|
||
**Context:**
|
||
Dell TSR DCIM view classes report `CPUAffinity` for NIC, InfiniBand, PCIe, and controller
|
||
devices. Values are "1", "2" (NUMA node index), or "Not Applicable" (for devices that bridge
|
||
both CPUs or have no CPU affinity). This data is needed for topology-aware diagnostics.
|
||
|
||
**Decision:**
|
||
- Add `NUMANode int` (JSON: `"numa_node,omitempty"`) to `models.PCIeDevice`,
|
||
`models.NetworkAdapter`, `models.HardwareDevice`, and `ReanimatorPCIe`.
|
||
- Parse from `props["cpuaffinity"]` using `parseIntLoose`: numeric values ("1", "2") map
|
||
directly; "Not Applicable" returns 0 (omitted via `omitempty`).
|
||
- Thread through `buildDevicesFromLegacy` (PCIe and NIC sections) and `convertPCIeFromDevices`.
|
||
- `parseControllerView` also parses CPUAffinity since RAID controllers have NUMA affinity.
|
||
|
||
**Consequences:**
|
||
- `numa_node: 1` or `2` appears in reanimator export for devices with known affinity.
|
||
- Value 0 / absent means "not reported" — covers both "Not Applicable" and sources that don't
|
||
provide CPUAffinity at all.
|
||
- `TestParseDellCPUAffinity` verifies numeric values parsed correctly and "Not Applicable"→0.
|
||
|
||
---
|
||
|
||
## ADL-023 — Reanimator export must match ingest contract exactly
|
||
|
||
**Date:** 2026-03-15
|
||
**Context:**
|
||
LOGPile's Reanimator export had drifted from the strict ingest contract. It emitted fields that
|
||
Reanimator does not currently accept (`status_at_collection`, `numa_node`),
|
||
while missing fields and sections now present in the contract (`hardware.sensors`,
|
||
`pcie_devices[].mac_addresses`). Memory export rules also diverged from the ingest side: empty or
|
||
serial-less DIMMs were still exported.
|
||
|
||
**Decision:**
|
||
- Treat the Reanimator ingest contract as the authoritative schema for `GET /api/export/reanimator`.
|
||
- Emit only fields present in the current upstream contract revision.
|
||
- Add `hardware.sensors`, `pcie_devices[].mac_addresses`, `pcie_devices[].numa_node`, and
|
||
upstream-approved component telemetry/health fields.
|
||
- Leave out fields that are still not part of the upstream contract.
|
||
- Map internal `source_type=archive` to external `source_type=logfile`.
|
||
- Skip memory entries that are empty, not present, or missing serial numbers.
|
||
- Generate CPU and PCIe serials only in the forms allowed by the contract.
|
||
- Mirror the applied contract in `bible-local/docs/hardware-ingest-contract.md`.
|
||
|
||
**Consequences:**
|
||
- Some previously exported diagnostic fields are intentionally dropped from the Reanimator payload
|
||
until the upstream contract adds them.
|
||
- Internal models may retain richer fields than the current export schema.
|
||
- `hardware.devices` is canonical only after merge with legacy hardware slices; partial parser-owned
|
||
canonical records must not hide CPUs, memory, storage, NICs, or PSUs still stored in legacy
|
||
fields.
|
||
- CSV and Reanimator exports must use the same merged canonical inventory to avoid divergent export
|
||
contents across surfaces.
|
||
- Future exporter changes must update both the code and the mirrored contract document together.
|
||
|
||
---
|
||
|
||
## ADL-024 — Component presence is implicit; Redfish linked metrics are part of replay correctness
|
||
|
||
**Date:** 2026-03-15
|
||
**Context:**
|
||
The upstream ingest contract allows `present`, but current export semantics do not need to send
|
||
`present=true` for populated components. At the same time, several important Redfish component
|
||
telemetry fields were only available through linked metric resources such as `ProcessorMetrics`,
|
||
`MemoryMetrics`, and `DriveMetrics`. Without collecting and replaying these linked documents,
|
||
live collection and raw snapshot replay still underreported component health fields.
|
||
|
||
**Decision:**
|
||
- Do not serialize `present=true` in Reanimator export. Presence is represented by the presence of
|
||
the component record itself.
|
||
- Do not export component records marked `present=false`.
|
||
- Interpret CPU `firmware` in Reanimator payload as CPU microcode.
|
||
- Treat Redfish linked metric resources `ProcessorMetrics`, `MemoryMetrics`, `DriveMetrics`,
|
||
`EnvironmentMetrics`, and generic `Metrics` as part of analyzer correctness when they are linked
|
||
from component resources.
|
||
- Replay logic must merge these linked metric resources back into CPU, memory, storage, PCIe, GPU,
|
||
NIC, and PSU component `Details` the same way live collection expects them to be used.
|
||
|
||
**Consequences:**
|
||
- Reanimator payloads are smaller and avoid redundant `present=true` noise while still excluding
|
||
empty slots and absent components.
|
||
- Any future exporter change that reintroduces serialized component presence needs an explicit
|
||
contract review.
|
||
- Raw Redfish snapshot completeness now includes linked per-component metric resources, not only
|
||
top-level inventory members.
|
||
- CPU microcode is no longer expected in top-level `hardware.firmware`; it belongs on the CPU
|
||
component record.
|
||
|
||
<!-- Add new decisions below this line using the format above -->
|
||
|
||
## ADL-025 — Missing serial numbers must remain absent in Reanimator export
|
||
|
||
**Date:** 2026-03-15
|
||
**Context:**
|
||
LOGPile previously generated synthetic serial numbers for components that had no real serial in
|
||
source data, especially CPUs and PCIe-class devices. This made the payload look richer, but the
|
||
serials were not authoritative and could mislead downstream consumers. Reanimator can already
|
||
accept missing serials and generate its own internal fallback identifiers when needed.
|
||
|
||
**Decision:**
|
||
- Do not synthesize fake serial numbers in LOGPile's Reanimator export.
|
||
- If a component has no real serial in parsed source data, export the serial field as absent.
|
||
- This applies to CPUs, PCIe devices, GPUs, NICs, and any other component class unless an
|
||
upstream contract explicitly requires a deterministic exporter-generated identifier.
|
||
- Any fallback serial generation defined by the upstream contract is ingest-side Reanimator behavior,
|
||
not LOGPile exporter behavior.
|
||
|
||
**Consequences:**
|
||
- Exported payloads carry only source-backed serial numbers.
|
||
- Fake identifiers such as `BOARD-...-CPU-...` or synthetic PCIe serials are no longer considered
|
||
acceptable exporter behavior.
|
||
- Any future attempt to reintroduce generated serials requires an explicit contract review and a
|
||
new ADL entry.
|
||
|
||
---
|
||
|
||
## ADL-026 — Live Redfish collection uses explicit preflight host-power confirmation
|
||
|
||
**Date:** 2026-03-15
|
||
**Context:**
|
||
Live Redfish inventory can be incomplete when the managed host is powered off. At the same time,
|
||
LOGPile must not silently power on a host without explicit user choice. The collection workflow
|
||
therefore needs a preflight step that verifies connectivity, shows current host power state to the
|
||
user, and only powers on the host when the user explicitly chose that path.
|
||
|
||
**Decision:**
|
||
- Add a dedicated live preflight API step before collection starts.
|
||
- UI first runs connectivity and power-state check, then offers:
|
||
- collect as-is
|
||
- power on and collect
|
||
- if the host is off and the user does not answer within 5 seconds, default to collecting without
|
||
powering the host on
|
||
- Redfish collection may power on the host only when the request explicitly sets
|
||
`power_on_if_host_off=true`
|
||
- when LOGPile powers on the host for collection, it must try to power the host back off after
|
||
collection completes
|
||
- if LOGPile did not power the host on itself, it must never power the host off
|
||
- all preflight and power-control steps must be logged into the collection log and therefore into
|
||
the raw-export bundle
|
||
|
||
**Consequences:**
|
||
- Live collection becomes a two-step UX: probe first, collect second.
|
||
- Raw bundles preserve operator-visible evidence of power-state decisions and power-control attempts.
|
||
- Power-on failures do not block collection entirely; they only downgrade completeness expectations.
|
||
|
||
---
|
||
|
||
## ADL-027 — Sensors without numeric readings are not exported
|
||
|
||
**Date:** 2026-03-15
|
||
**Context:**
|
||
Some parsed sensor records carry only a name, unit, or status, but no actual numeric reading. Such
|
||
records are not useful as telemetry in Reanimator export and create noisy, low-value sensor lists.
|
||
|
||
**Decision:**
|
||
- Do not export temperature, power, fan, or other sensor records unless they carry a real numeric
|
||
measurement value.
|
||
- Presence of a sensor name or health/status alone is not sufficient for export.
|
||
|
||
**Consequences:**
|
||
- Exported sensor groups contain only actionable telemetry.
|
||
- Parsers and collectors may still keep non-numeric sensor artifacts internally for diagnostics, but
|
||
Reanimator export must filter them out.
|
||
|
||
---
|
||
|
||
## ADL-028 — Reanimator PCIe export excludes storage endpoints and synthetic serials
|
||
|
||
**Date:** 2026-03-15
|
||
**Context:**
|
||
Some Redfish and archive sources expose NVMe drives both as storage inventory and as PCIe-visible
|
||
endpoints. Exporting such drives in both `hardware.storage` and `hardware.pcie_devices` creates
|
||
duplicates without adding useful topology value. At the same time, PCIe-class export still had old
|
||
fallback behavior that generated synthetic serial numbers when source serials were absent.
|
||
|
||
**Decision:**
|
||
- Export disks and NVMe drives only through `hardware.storage`.
|
||
- Do not export storage endpoints as `hardware.pcie_devices`, even if the source inventory exposes
|
||
them as PCIe/NVMe devices.
|
||
- Keep real PCIe storage controllers such as RAID and HBA adapters in `hardware.pcie_devices`.
|
||
- Do not synthesize PCIe/GPU/NIC serial numbers in LOGPile; missing serials stay absent.
|
||
- Treat placeholder names such as `Network Device View` as non-authoritative and prefer resolved
|
||
device names when stronger data exists.
|
||
|
||
**Consequences:**
|
||
- Reanimator payloads no longer duplicate NVMe drives between storage and PCIe sections.
|
||
- PCIe export remains topology-focused while storage export remains component-focused.
|
||
- Missing PCIe-class serials no longer produce fake `BOARD-...-PCIE-...` identifiers.
|
||
|
||
---
|
||
|
||
## ADL-029 — Local exporter guidance tracks upstream contract v2.7 terminology
|
||
|
||
**Date:** 2026-03-15
|
||
**Context:**
|
||
The upstream Reanimator hardware ingest contract moved to `v2.7` and clarified several points that
|
||
matter for LOGPile documentation: ingest-side serial fallback rules, canonical PCIe addressing via
|
||
`slot`, the optional `event_logs` section, and the shared `manufactured_year_week` field.
|
||
|
||
**Decision:**
|
||
- Keep the local mirrored contract file as an exact copy of the upstream `v2.7` document.
|
||
- Describe CPU/PCIe serial fallback as Reanimator ingest behavior, not LOGPile exporter behavior.
|
||
- Treat `pcie_devices.slot` as the canonical address on the LOGPile side as well; `bdf` may remain
|
||
an internal fallback/dedupe key but is not serialized in the payload.
|
||
- Export `event_logs` only from normalized parser/collector events that can be mapped to contract
|
||
sources `host` / `bmc` / `redfish` without synthesizing message content.
|
||
- Export `manufactured_year_week` only as a reliable passthrough when a parser/collector already
|
||
extracted a valid `YYYY-Www` value.
|
||
|
||
**Consequences:**
|
||
- Local bible wording no longer conflicts with upstream contract terminology.
|
||
- Reanimator payloads use contract-native PCIe addressing and no longer expose `bdf` as a parallel
|
||
coordinate.
|
||
- LOGPile event export remains strictly source-derived; internal warnings such as LOGPile analysis
|
||
notes do not leak into Reanimator `event_logs`.
|
||
|
||
---
|
||
|
||
## ADL-030 — Audit result rendering is delegated to embedded reanimator/chart
|
||
|
||
**Date:** 2026-03-16
|
||
**Context:**
|
||
LOGPile already owns file upload, Redfish collection, archive parsing, normalization, and
|
||
Reanimator export. Maintaining a second host-side audit renderer for the same data created
|
||
presentation drift and duplicated UI logic.
|
||
|
||
**Decision:**
|
||
- Use vendored `reanimator/chart` as the only audit result viewer.
|
||
- Keep LOGPile responsible for service flows: upload, live collection, batch convert, raw export,
|
||
Reanimator export, and parse-error reporting.
|
||
- Render the current dataset by converting it to Reanimator JSON and passing that snapshot to
|
||
embedded `chart` under `/chart/current`.
|
||
|
||
**Consequences:**
|
||
- Reanimator JSON becomes the single presentation contract for the audit surface.
|
||
- The host UI becomes a service shell around the viewer instead of maintaining its own
|
||
field-by-field tabs.
|
||
- `internal/chart` must be updated explicitly as a git submodule when the viewer changes.
|
||
|
||
---
|
||
|
||
## ADL-031 — Redfish uses profile-driven acquisition and unified ingest entrypoints
|
||
|
||
**Date:** 2026-03-17
|
||
**Context:**
|
||
Redfish collection had accumulated platform-specific probing in the shared collector path, while
|
||
upload and raw-export replay still entered analysis through direct handler branches. This made
|
||
vendor/model tuning harder to contain and increased regression risk when one topology needed a
|
||
special acquisition strategy.
|
||
|
||
**Decision:**
|
||
- Introduce `internal/ingest.Service` as the internal source-family entrypoint for archive parsing
|
||
and Redfish raw replay.
|
||
- Introduce `internal/collector/redfishprofile/` for Redfish profile matching and modular hooks.
|
||
- Split Redfish behavior into coordinated phases:
|
||
- acquisition planning during live collection
|
||
- analysis hooks during snapshot replay
|
||
- Use score-based profile matching. If confidence is low, enter fallback acquisition mode and
|
||
aggregate only safe additive profile probes.
|
||
- Allow profile modules to provide bounded acquisition tuning hints such as crawl cap, prefetch
|
||
behavior, and expensive post-probe toggles.
|
||
- Allow profile modules to own model-specific `CriticalPaths` and bounded `PlanBPaths` so vendor
|
||
recovery targets stop leaking into the collector core.
|
||
- Expose Redfish profile matching as structured diagnostics during live collection: logs must
|
||
contain all module scores, and collect job status must expose active modules for the UI.
|
||
|
||
**Consequences:**
|
||
- Server handlers stop owning parser-vs-replay branching details directly.
|
||
- Vendor/model-specific Redfish logic gets an explicit module boundary.
|
||
- Unknown-vendor Redfish collection becomes slower but more complete by design.
|
||
- Tactical Redfish fixes should move into profile modules instead of widening generic replay logic.
|
||
- Repo-owned compact fixtures under `internal/collector/redfishprofile/testdata/`, derived from
|
||
representative raw-export snapshots, are used to lock profile matching and acquisition tuning
|
||
for known MSI and Supermicro-family shapes.
|
||
|
||
---
|
||
|
||
## ADL-032 — MSI ghost GPU filter: exclude GPUs with temperature=0 on powered-on host
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:**
|
||
MSI/AMI BMC caches GPU inventory from the host via Host Interface (in-band). When GPUs are
|
||
removed without a reboot the old entries remain in `Chassis/GPU*` and
|
||
`Systems/Self/Processors/GPU*` with `Status.Health: OK, State: Enabled`. The BMC has no
|
||
out-of-band mechanism to detect physical absence. A physically present GPU always reports
|
||
an ambient temperature (>0°C) even when idle; a stale cached entry returns `Reading: 0`.
|
||
|
||
**Decision:**
|
||
- Add `EnableMSIGhostGPUFilter` directive (enabled by MSI profile's `refineAnalysis`
|
||
alongside `EnableProcessorGPUFallback`).
|
||
- In `collectGPUsFromProcessors`: for each processor GPU, resolve its chassis path and read
|
||
`Chassis/GPU{n}/Sensors/GPU{n}_Temperature`. If `PowerState=On` and `Reading=0` → skip.
|
||
- Filter only applies when host is powered on; when host is off all temperatures are 0 and
|
||
the signal is ambiguous.
|
||
|
||
**Consequences:**
|
||
- Ghost GPUs from previous hardware configurations no longer appear in the inventory.
|
||
- Filter is MSI-profile-owned and does not affect HGX, Supermicro, or generic paths.
|
||
- Any new MSI GPU chassis that uses a different temperature sensor path will bypass the filter
|
||
(safe default: include rather than wrongly exclude).
|
||
|
||
---
|
||
|
||
## ADL-033 — Reanimator export collected_at uses inventory LastModifiedTime with 30-day fallback
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:**
|
||
For Redfish sources the BMC Manager `DateTime` reflects when the BMC clock read the time, not
|
||
when the hardware inventory was last known-good. `InventoryData/Status.LastModifiedTime`
|
||
(AMI/MSI OEM endpoint) records the actual timestamp of the last successful host-pushed
|
||
inventory cycle and is a better proxy for "when was this hardware configuration last confirmed".
|
||
|
||
**Decision:**
|
||
- `inferInventoryLastModifiedTime` reads `LastModifiedTime` from the snapshot and sets
|
||
`AnalysisResult.InventoryLastModifiedAt`.
|
||
- `reanimatorCollectedAt()` in the exporter selects `InventoryLastModifiedAt` when it is set
|
||
and no older than 30 days; otherwise falls back to `CollectedAt`.
|
||
- Fallback rationale: inventory older than 30 days is likely from a long-running server with
|
||
no recent reboot; using the actual collection date is more useful for the downstream consumer.
|
||
- The inventory timestamp is also logged during replay and live collection for diagnostics.
|
||
|
||
**Consequences:**
|
||
- Reanimator export `collected_at` reflects the last confirmed inventory cycle on AMI/MSI BMCs.
|
||
- On non-AMI BMCs or when `InventoryData/Status` is absent, behavior is unchanged.
|
||
- If inventory is stale (>30 days), collection date is used as before.
|
||
|
||
---
|
||
|
||
## ADL-034 — Redfish inventory invalidated before host power-on
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:**
|
||
When a host is powered on by the collector (`power_on_if_host_off=true`), the BMC still holds
|
||
inventory from the previous boot. If hardware changed between shutdowns, the new boot will push
|
||
fresh inventory — but only if the BMC accepts it (CRC mismatch triggers re-population). Without
|
||
explicit invalidation, unchanged CRCs can cause the BMC to skip re-processing even after a
|
||
hardware change.
|
||
|
||
**Decision:**
|
||
- Before any power-on attempt, `invalidateRedfishInventory` POSTs to
|
||
`{systemPath}/Oem/Ami/Inventory/Crc` with all groups zeroed (`CPU`, `DIMM`, `PCIE`,
|
||
`CERTIFICATES`, `SECUREBOOT`).
|
||
- Best-effort: a 404/405 response (non-AMI BMC) is logged and silently ignored.
|
||
- The invalidation is logged at `INFO` level and surfaced as a collect progress message.
|
||
|
||
**Consequences:**
|
||
- On AMI/MSI BMCs: the next boot will push a full fresh inventory regardless of whether
|
||
CRCs appear unchanged, eliminating ghost components from prior hardware configurations.
|
||
- On non-AMI BMCs: the POST fails immediately (endpoint does not exist), nothing changes.
|
||
- Invalidation runs only when `power_on_if_host_off=true` and host is confirmed off.
|
||
|
||
---
|
||
|
||
## ADL-035 — Redfish hardware event log collection from Systems LogServices
|
||
|
||
**Date:** 2026-03-18
|
||
**Context:** Redfish BMCs expose event logs via `LogServices/{svc}/Entries`. On MSI/AMI this includes the IPMI SEL with hardware events (temperature, power, drive failures, etc.). Live collection previously collected only inventory/sensor snapshots; event history was unavailable in Reanimator.
|
||
**Decision:**
|
||
- After tree-walk, fetch hardware log entries separately via `collectRedfishLogEntries()` (not part of tree-walk to avoid bloat).
|
||
- Only `Systems/{sys}/LogServices` is queried — Managers LogServices (BMC audit/journal) are excluded.
|
||
- Log services with Id/Name containing "audit", "journal", "bmc", "security", "manager", "debug" are skipped.
|
||
- Entries older than 7 days (client-side filter) are discarded. Pages are followed until an out-of-window entry is found (assumes newest-first ordering, typical for BMCs).
|
||
- Entries with `EntryType: "Oem"` or `MessageId` containing user/auth/login keywords are filtered as non-hardware.
|
||
- Raw entries stored in `rawPayloads["redfish_log_entries"]` as `[]map[string]interface{}`.
|
||
- Parsed to `models.Event` in `parseRedfishLogEntries()` during replay — same path for live and offline.
|
||
- Max 200 entries per log service, 500 total to limit BMC load.
|
||
**Consequences:**
|
||
- Hardware event history (last 7 days) visible in Reanimator `EventLogs` section.
|
||
- No impact on existing inventory pipeline or offline archive replay (archives without `redfish_log_entries` key silently skip parsing).
|
||
- Adds extra HTTP requests during live collection (sequential, after tree-walk completes).
|
||
|
||
---
|
||
|
||
## ADL-036 — Redfish profile matching may use platform grammar hints beyond vendor strings
|
||
|
||
**Date:** 2026-03-25
|
||
**Context:**
|
||
Some BMCs expose unusable `Manufacturer` / `Model` values (`NULL`, placeholders, or generic SoC
|
||
names) while still exposing a stable platform-specific Redfish grammar: repeated member names,
|
||
firmware inventory IDs, OEM action names, and target-path quirks. Matching only on vendor
|
||
strings forced such systems into fallback mode even when the platform shape was consistent.
|
||
|
||
**Decision:**
|
||
- Extend `redfishprofile.MatchSignals` with doc-derived hint tokens collected from discovery docs
|
||
and replay snapshots.
|
||
- Allow profile matchers to score on stable platform grammar such as:
|
||
- collection member naming (`outboardPCIeCard*`, drive slot grammars)
|
||
- firmware inventory member IDs
|
||
- OEM action/type markers and linked target paths
|
||
- During live collection, gather only lightweight extra hint collections needed for matching
|
||
(`NetworkInterfaces`, `NetworkAdapters`, `Drives`, `UpdateService/FirmwareInventory`), not slow
|
||
deep inventory branches.
|
||
- Keep such profiles out of fallback aggregation unless they are proven safe as broad additive
|
||
hints.
|
||
|
||
**Consequences:**
|
||
- Platform-family profiles can activate even when vendor strings are absent or set to `NULL`.
|
||
- Matching logic becomes more robust for OEM BMC implementations that differ mainly by Redfish
|
||
grammar rather than by explicit vendor strings.
|
||
- Live collection gains a small amount of extra discovery I/O to harvest stable member IDs, but
|
||
avoids slow deep probes such as `Assembly` just for profile selection.
|
||
|
||
---
|
||
|
||
## ADL-037 — easy-bee archives are parsed from the embedded bee-audit snapshot
|
||
|
||
**Date:** 2026-03-25
|
||
**Context:**
|
||
`reanimator-easy-bee` support bundles already contain a normalized hardware snapshot in
|
||
`export/bee-audit.json` plus supporting logs and techdump files. Rebuilding the same inventory
|
||
from raw `techdump/` files inside LOGPile would duplicate parser logic and create drift between
|
||
the producer utility and archive importer.
|
||
|
||
**Decision:**
|
||
- Add a dedicated `easy_bee` vendor parser for `bee-support-*.tar.gz` bundles.
|
||
- Detect the bundle by `manifest.txt` (`bee_version=...`) plus `export/bee-audit.json`.
|
||
- Parse the archive from the embedded snapshot first; treat `techdump/` and runtime files as
|
||
secondary context only.
|
||
- Normalize snapshot-only fields needed by LOGPile, notably:
|
||
- flatten `hardware.sensors` groups into `[]SensorReading`
|
||
- turn runtime issues/status into `[]Event`
|
||
- synthesize a board FRU entry when the snapshot does not include FRU data
|
||
|
||
**Consequences:**
|
||
- LOGPile stays aligned with the schema emitted by `reanimator-easy-bee`.
|
||
- Adding support required only a thin archive adapter instead of a full hardware parser.
|
||
- If the upstream utility changes the embedded snapshot schema, the `easy_bee` adapter is the
|
||
only place that must be updated.
|
||
|
||
---
|
||
|
||
## ADL-038 — HPE AHS parser uses hybrid extraction instead of full `zbb` schema decoding
|
||
|
||
**Date:** 2026-03-30
|
||
**Context:** HPE iLO Active Health System exports (`.ahs`) are proprietary `ABJR` containers
|
||
with gzip-compressed `zbb` payloads. The sample inventory data contains two practical signal
|
||
families: printable SMBIOS/FRU-style strings and embedded Redfish JSON subtrees, especially for
|
||
storage controllers and drives. Full `zbb` binary schema decoding is not documented and would add
|
||
significant complexity before proving user value.
|
||
**Decision:** Support HPE AHS with a hybrid parser:
|
||
- decode the outer `ABJR` container
|
||
- gunzip embedded members when applicable
|
||
- extract inventory from printable SMBIOS/FRU payloads
|
||
- extract storage/controller/backplane details from embedded Redfish JSON objects
|
||
- enrich firmware and PSU inventory from auxiliary package payloads such as `bcert.pkg`
|
||
- do not attempt complete semantic decoding of the internal `zbb` record format
|
||
**Consequences:**
|
||
- Parser reaches inventory-grade usefulness quickly for HPE `.ahs` uploads.
|
||
- Storage inventory is stronger than text-only parsing because it reuses structured Redfish data when present.
|
||
- Auxiliary package payloads can supply missing firmware/PSU fields even when the main SMBIOS-like blob is incomplete.
|
||
- Future deeper `zbb` decoding can be added incrementally without replacing the current parser contract.
|
||
|
||
---
|
||
|
||
## ADL-039 — Canonical inventory keeps DIMMs with unknown capacity when identity is known
|
||
|
||
**Date:** 2026-03-30
|
||
**Context:** Some sources, notably HPE iLO AHS SMBIOS-like blobs, expose installed DIMM identity
|
||
(slot, serial, part number, manufacturer) but do not include capacity. The parser already extracts
|
||
those modules into `Hardware.Memory`, but canonical device building and export previously dropped
|
||
them because `size_mb == 0`.
|
||
**Decision:** Treat a DIMM as installed inventory when `present=true` and it has identifying
|
||
memory fields such as serial number or part number, even if `size_mb` is unknown.
|
||
**Consequences:**
|
||
- HPE AHS uploads now show real installed memory modules instead of hiding them.
|
||
- Empty slots still stay filtered because they lack inventory identity or are marked absent.
|
||
- Specification/export can include "size unknown" memory entries without inventing capacity data.
|
||
|
||
---
|
||
|
||
## ADL-040 — HPE Redfish normalization prefers chassis `Devices/*` over generic PCIe topology labels
|
||
|
||
**Date:** 2026-03-30
|
||
**Context:** HPE ProLiant Gen11 Redfish snapshots expose parallel inventory trees. `Chassis/*/PCIeDevices/*`
|
||
is good for topology presence, but often reports only generic `DeviceType` values such as
|
||
`SingleFunction`. `Chassis/*/Devices/*` carries the concrete slot label, richer device type, and
|
||
product-vs-spare part identifiers for the same physical NIC/controller. Replay fallback over empty
|
||
storage volume collections can also discover `Volumes/Capabilities` children, which are not real
|
||
logical volumes.
|
||
|
||
**Decision:**
|
||
- Treat Redfish `SKU` as a valid fallback for `hardware.board.part_number` when `PartNumber` is empty.
|
||
- Ignore `Volumes/Capabilities` documents during logical-volume parsing.
|
||
- Enrich `Chassis/*/PCIeDevices/*` entries with matching `Chassis/*/Devices/*` documents by
|
||
serial/name/part identity.
|
||
- Keep `pcie.device_class` semantic; do not replace it with model or part-number strings when
|
||
Redfish exposes only generic topology labels.
|
||
|
||
**Consequences:**
|
||
- HPE Redfish imports now keep the server SKU in `hardware.board.part_number`.
|
||
- Empty volume collections no longer produce fake `Capabilities` volume records.
|
||
- HPE PCIe inventory gets better slot labels like `OCP 3.0 Slot 15` plus concrete classes such as
|
||
`LOM/NIC` or `SAS/SATA Storage Controller`.
|
||
- `part_number` remains available separately for model identity, without polluting the class field.
|
||
|
||
---
|
||
|
||
## ADL-041 — Redfish replay drops topology-only PCIe noise classes from canonical inventory
|
||
|
||
**Date:** 2026-04-01
|
||
**Context:** Some Redfish BMCs, especially MSI/AMI GPU systems, expose a very wide PCIe topology
|
||
tree under `Chassis/*/PCIeDevices/*`. Besides real endpoint devices, the replay sees bridge stages,
|
||
CPU-side helper functions, IMC/mesh signal-processing nodes, USB/SPI side controllers, and GPU
|
||
display-function duplicates reported as generic `Display Device`. Keeping all of them in
|
||
`hardware.pcie_devices` pollutes downstream exports such as Reanimator and hides the actual
|
||
endpoint inventory signal.
|
||
|
||
**Decision:**
|
||
- Filter topology-only PCIe records during Redfish replay, not in the UI layer.
|
||
- Drop PCIe entries with replay-resolved classes:
|
||
- `Bridge`
|
||
- `Processor`
|
||
- `SignalProcessingController`
|
||
- `SerialBusController`
|
||
- Drop `DisplayController` entries when the source Redfish PCIe document is the generic MSI-style
|
||
`Description: "Display Device"` duplicate.
|
||
- Drop PCIe network endpoints when their PCIe functions already link to `NetworkDeviceFunctions`,
|
||
because those devices are represented canonically in `hardware.network_adapters`.
|
||
- When `Systems/*/NetworkInterfaces/*` links back to a chassis `NetworkAdapter`, match against the
|
||
fully enriched chassis NIC identity to avoid creating a second ghost NIC row with the raw
|
||
`NetworkAdapter_*` slot/name.
|
||
- Treat generic Redfish object names such as `NetworkAdapter_*` and `PCIeDevice_*` as placeholder
|
||
models and replace them from PCI IDs when a concrete vendor/device match exists.
|
||
- Drop MSI-style storage service PCIe endpoints whose resolved device names are only
|
||
`Volume Management Device NVMe RAID Controller` or `PCIe Switch management endpoint`; storage
|
||
inventory already comes from the Redfish storage tree.
|
||
- Normalize Ethernet-class NICs into the single exported class `NetworkController`; do not split
|
||
`EthernetController` into a separate top-level inventory section.
|
||
- Keep endpoint classes such as `NetworkController`, `MassStorageController`, and dedicated GPU
|
||
inventory coming from `hardware.gpus`.
|
||
|
||
**Consequences:**
|
||
- `hardware.pcie_devices` becomes closer to real endpoint inventory instead of raw PCIe topology.
|
||
- Reanimator exports stop showing MSI bridge/processor/display duplicate noise.
|
||
- Reanimator exports no longer duplicate the same MSI NIC as both `PCIeDevice_*` and
|
||
`NetworkAdapter_*`.
|
||
- Replay no longer creates extra NIC rows from `Systems/NetworkInterfaces` when the same adapter
|
||
was already normalized from `Chassis/NetworkAdapters`.
|
||
- MSI VMD / PCIe switch storage service endpoints no longer pollute PCIe inventory.
|
||
- UI/Reanimator group all Ethernet NICs under the same `NETWORKCONTROLLER` section.
|
||
- Canonical NIC inventory prefers resolved PCI product names over generic Redfish placeholder names.
|
||
- The raw Redfish snapshot still remains available in `raw_payloads.redfish_tree` for low-level
|
||
troubleshooting if topology details are ever needed.
|
||
|
||
---
|
||
|
||
## ADL-042 — xFusion file-export archives merge AppDump inventory with RTOS/Log snapshots
|
||
|
||
**Date:** 2026-04-04
|
||
**Context:** xFusion iBMC `tar.gz` exports expose the base inventory in `AppDump/`, but the most
|
||
useful NIC and firmware details live elsewhere: NIC firmware/MAC snapshots in
|
||
`LogDump/netcard/netcard_info.txt` and system firmware versions in
|
||
`RTOSDump/versioninfo/app_revision.txt`. Parsing only `AppDump/` left xFusion uploads detectable but
|
||
incomplete for UI and Reanimator consumers.
|
||
|
||
**Decision:**
|
||
- Treat xFusion file-export `tar.gz` bundles as a first-class archive parser input.
|
||
- Merge OCP NIC identity from `AppDump/card_manage/card_info` with the latest per-slot snapshot
|
||
from `LogDump/netcard/netcard_info.txt` to produce `hardware.network_adapters`.
|
||
- Import system-level firmware from `RTOSDump/versioninfo/app_revision.txt` into
|
||
`hardware.firmware`.
|
||
- Allow FRU fallback from `RTOSDump/versioninfo/fruinfo.txt` when `AppDump/FruData/fruinfo.txt`
|
||
is absent.
|
||
|
||
**Consequences:**
|
||
- xFusion uploads now preserve NIC BDF, MAC, firmware, and serial identity in normalized output.
|
||
- System firmware such as BIOS and iBMC versions survives xFusion file exports.
|
||
- xFusion archives participate more reliably in canonical device/export flows without special UI
|
||
cases.
|