Files
logpile/bible-local/10-decisions.md

60 KiB
Raw Blame History

10 — Architectural Decision Log (ADL)

Rule: Every significant architectural decision must be recorded here before or alongside the code change. This applies to humans and AI assistants alike.

Format: date · title · context · decision · consequences


ADL-001 — In-memory only state (no database)

Date: project start Context: LOGPile is designed as a standalone diagnostic tool, not a persistent service. Decision: All parsed/collected data lives in Server.result (in-memory). No database, no files written. Consequences:

  • Data is lost on process restart — intentional.
  • Simple deployment: single binary, no setup required.
  • JSON export is the persistence mechanism for users who want to save results.

ADL-002 — Vendor parser auto-registration via init()

Date: project start Context: Need an extensible parser registry without a central factory function. Decision: Each vendor parser registers itself in its package's init() function. vendors/vendors.go holds blank imports to trigger registration. Consequences:

  • Adding a new parser requires only: implement interface + add one blank import.
  • No central list to maintain (other than the import file).
  • go test ./... will include new parsers automatically.

ADL-003 — Highest-confidence parser wins

Date: project start Context: Multiple parsers may partially match an archive (e.g. generic + specific vendor). Decision: Run all parsers' Detect(), select the one returning the highest score (0100). Consequences:

  • Generic fallback (score 15) only activates when no vendor parser scores higher.
  • Parsers must be conservative with high scores (70+) to avoid false positives.

ADL-004 — Canonical hardware.devices as single source of truth

Date: v1.5.0 Context: UI tabs and Reanimator exporter were reading from different sub-fields of AnalysisResult, causing potential drift. Decision: Introduce hardware.devices as the canonical inventory repository. All UI tabs and all exporters must read exclusively from this repository. Consequences:

  • Any UI vs Reanimator discrepancy is classified as a bug, not a "known difference".
  • Deduplication logic runs once in the repository builder (serial → bdf → distinct).
  • New hardware attributes must be added to canonical schema first, then mapped to consumers.

ADL-005 — No hardcoded PCI model strings; use pci.ids

Date: v1.5.0 Context: NVIDIA and other vendors release new GPU models frequently; hardcoded maps required code changes for each new model ID. Decision: Use the pciutils/pciids database (git submodule, embedded at build time). PCI vendor/device ID → human-readable model name via lookup. Consequences:

  • New GPU models can be supported by updating pci.ids without code changes.
  • make build auto-syncs pci.ids from submodule before compilation.
  • External override via LOGPILE_PCI_IDS_PATH env var.

ADL-006 — Reanimator export uses canonical hardware.devices (not raw sub-fields)

Date: v1.5.0 Context: Early Reanimator exporter read from Hardware.GPUs, Hardware.NICs, etc. directly, diverging from UI data. Decision: Reanimator exporter must use hardware.devices — the same source as the UI. Exporter groups/filters canonical records by section; does not rebuild from sub-fields. Consequences:

  • Guarantees UI and export consistency.
  • Exporter code is simpler — mainly a filter+map, not a data reconstruction.

ADL-007 — Documentation language is English

Date: 2026-02-20 Context: Codebase documentation was mixed Russian/English, reducing clarity for international contributors and AI assistants. Decision: All maintained project documentation (docs/bible/, README.md, CLAUDE.md, and new technical docs) must be written in English. Consequences:

  • Bible is authoritative in English.
  • AI assistants get consistent, unambiguous context.

ADL-008 — Bible is the single source of truth for architecture docs

Date: 2026-02-23 Context: Architecture information was duplicated across README.md, CLAUDE.md, and the Bible, creating drift risk and stale guidance for humans and AI agents. Decision: Keep architecture and technical design documentation only in docs/bible/. Top-level README.md and CLAUDE.md must remain minimal pointers/instructions. Consequences:

  • Reduces documentation drift and duplicate updates.
  • AI assistants are directed to one authoritative source before making changes.
  • Documentation updates that affect architecture must include Bible changes (and ADL entries when significant).

ADL-009 — Redfish analysis is performed from raw snapshot replay (unified tunnel)

Date: 2026-02-24 Context: Live Redfish collection and raw export re-analysis used different parsing paths, which caused drift and made bug fixes difficult to validate consistently. Decision: Redfish live collection must produce a raw_payloads.redfish_tree snapshot first, then run the same replay analyzer used for imported raw exports. Consequences:

  • Same redfish_tree input produces the same parsed result in live and offline modes.
  • Debugging parser issues can be done against exported raw bundles without live BMC access.
  • Snapshot completeness becomes critical; collector seeds/limits are part of analyzer correctness.

ADL-010 — Raw export is a self-contained re-analysis package (not a final result dump)

Date: 2026-02-24 Context: Exporting only normalized AnalysisResult loses raw source fidelity and prevents future parser improvements from being applied to already collected data. Decision: Export Raw Data produces a self-contained raw package (JSON or ZIP bundle) that the application can reopen and re-analyze. Parsed data in the package is optional and not the source of truth on import. Consequences:

  • Re-opening an export always re-runs analysis from raw source (redfish_tree or uploaded file bytes).
  • Raw bundles include collection context and diagnostics for debugging (collect.log, parser_fields.json).
  • Endpoint compatibility is preserved (/api/export/json) while actual payload format may be a bundle.

ADL-011 — Redfish snapshot crawler is bounded, prioritized, and failure-tolerant

Date: 2026-02-24 Context: Full Redfish trees on modern GPU systems are large, noisy, and contain many vendor-specific or non-fetchable links. Unbounded crawling and naive queue design caused hangs and incomplete snapshots. Decision: Use a bounded snapshot crawler with:

  • explicit document cap (LOGPILE_REDFISH_SNAPSHOT_MAX_DOCS)
  • priority seed paths (PCIe/Fabrics/Firmware/Storage/PowerSubsystem/ThermalSubsystem)
  • normalized @odata.id paths (strip #fragment)
  • noisy expected error filtering (404/405/410/501 hidden from UI)
  • queue capacity sized to crawl cap to avoid producer/consumer deadlock Consequences:
  • Snapshot collection remains stable on large BMC trees.
  • Most high-value inventory paths are reached before the cap.
  • UI progress remains useful while debug logs retain low-level fetch failures.

ADL-012 — Vendor-specific storage inventory probing is allowed as fallback

Date: 2026-02-24 Context: Some Supermicro BMCs expose empty standard Storage/.../Drives collections while real disk inventory exists under vendor-specific Disk.Bay endpoints and enclosure links. Decision: When standard drive collections are empty, collector/replay may probe vendor-style .../Drives/Disk.Bay.* endpoints and follow Storage.Links.Enclosures[*] to recover physical drives. Consequences:

  • Higher storage inventory coverage on Supermicro HBA/HA-RAID/MRVL/NVMe backplane implementations.
  • Replay must mirror the same probing behavior to preserve deterministic results.
  • Probing remains bounded (finite candidate set) to avoid runaway requests.

ADL-013 — PowerSubsystem is preferred over legacy Power on newer Redfish implementations

Date: 2026-02-24 Context: X14+/newer Redfish implementations increasingly expose authoritative PSU data in PowerSubsystem/PowerSupplies, while legacy /Power may be incomplete or schema-shifted. Decision: Prefer Chassis/*/PowerSubsystem/PowerSupplies as the primary PSU source and use legacy Chassis/*/Power as fallback. Consequences:

  • Better compatibility with newer BMC firmware generations.
  • Legacy systems remain supported without special-case collector selection.
  • Snapshot priority seeds must include PowerSubsystem resources.

ADL-014 — Threshold logic lives on the server; UI reflects status only

Date: 2026-02-24 Context: Duplicating threshold math in frontend and backend creates drift and inconsistent highlighting (e.g. PSU mains voltage range checks). Decision: Business threshold evaluation (e.g. PSU voltage nominal range) must be computed on the server; frontend only renders status/flags returned by the API. Consequences:

  • Single source of truth for threshold policies.
  • UI can evolve visually without re-implementing domain logic.
  • API payloads may carry richer status semantics over time.

ADL-015 — Supermicro crashdump archive parser removed from active registry

Date: 2026-03-01 Context: The Supermicro crashdump parser (SMC Crash Dump Parser) produced low-value results for current workflows and was explicitly rejected as a supported archive path. Decision: Remove supermicro vendor parser from active registration and project source. Do not include it in /api/parsers output or parser documentation matrix. Consequences:

  • Supermicro crashdump archives (CDump.txt format) are no longer parsed by a dedicated vendor parser.
  • Such archives fall back to other matching parsers (typically generic) unless a new replacement parser is added.
  • Reintroduction requires a new parser package and an explicit registry import in vendors/vendors.go.

ADL-016 — Device-bound firmware must not appear in hardware.firmware

Date: 2026-03-01 Context: Dell TSR DCIM_SoftwareIdentity lists firmware for every component (NICs, PSUs, disks, backplanes) in addition to system-level firmware. Naively importing all entries into Hardware.Firmware caused device firmware to appear twice in Reanimator: once in the device's own record and again in the top-level firmware list. Decision:

  • Hardware.Firmware contains only system-level firmware (BIOS, BMC/iDRAC, CPLD, Lifecycle Controller, storage controllers, BOSS).
  • Device-bound entries (NIC, PSU, Disk, Backplane, GPU) must not be added to Hardware.Firmware.
  • Parsers must store the FQDD (or equivalent slot identifier) in FirmwareInfo.Description so the Reanimator exporter can filter by FQDD prefix.
  • The exporter's isDeviceBoundFirmwareFQDD() function performs this filter. Consequences:
  • Any new parser that ingests a per-device firmware inventory must follow the same rule.
  • Device firmware is accessible only via the device's own record, not the firmware list.

ADL-017 — Vendor-embedded MAC addresses must be stripped from model name fields

Date: 2026-03-01 Context: Dell TSR embeds MAC addresses directly in ProductName and ElementName fields (e.g. "NVIDIA ConnectX-6 Lx 2x 25G SFP28 OCP3.0 SFF - C4:70:BD:DB:56:08"). This caused model names to contain MAC addresses in NIC model, NIC firmware device name, and potentially other fields. Decision: Strip any - XX:XX:XX:XX:XX:XX suffix from all model/name string fields at parse time before storing in any model struct. Use the regex \s+-\s+([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}$. Consequences:

  • Model names are clean and consistent across all devices.
  • All parsers must apply this stripping to any field used as a device name or model.
  • Confirmed affected fields in Dell: DCIM_NICView.ProductName, DCIM_SoftwareIdentity.ElementName.

ADL-018 — NVMe bay probe must be restricted to storage-capable chassis types

Date: 2026-03-12 Context: shouldAdaptiveNVMeProbe was introduced in 2fa4a12 to recover NVMe drives on Supermicro BMCs that expose empty Drives collections but serve disks at direct Disk.Bay.N


paths. The function returns true for any chassis with an empty Members array. On Supermicro HGX systems (SYS-A21GE-NBRT and similar) ~35 sub-chassis (GPU, NVSwitch, PCIeRetimer, ERoT, IRoT, BMC, FPGA) all carry ChassisType=Module/Component/Zone and expose empty /Drives collections. Without filtering, each triggered 384 HTTP requests → 13 440 requests ≈ 22 minutes of pure I/O waste per collection. Decision: Before probing Disk.Bay.N candidates for a chassis, check its ChassisType via chassisTypeCanHaveNVMe. Skip if type is Module, Component, or Zone. Keep probing for Enclosure, RackMount, and any unrecognised type (fail-safe). Consequences:

  • On HGX systems post-probe NVMe goes from ~22 min to effectively zero.
  • NVMe backplane recovery (Enclosure type) is unaffected.
  • Any new chassis type that hosts NVMe storage is covered by the default true path.
  • chassisTypeCanHaveNVMe and the candidate-selection loop must have unit tests covering both the excluded types and the storage-capable types (see TestChassisTypeCanHaveNVMe and TestNVMePostProbeSkipsNonStorageChassis).

ADL-019 — Redfish post-probe recovery is profile-owned acquisition policy

Date: 2026-03-18 Context: Numeric collection post-probe and direct NVMe Disk.Bay recovery were still controlled by collector-core heuristics, which kept platform-specific acquisition behavior in redfish.go and made vendor/topology refactoring incomplete. Decision: Move expensive Redfish post-probe enablement into profile-owned acquisition policy. The collector core may execute bounded post-probe loops, but profiles must explicitly enable:

  • numeric collection post-probe
  • direct NVMe Disk.Bay recovery
  • sensor collection post-probe Consequences:
  • Generic collector flow no longer implicitly turns on storage/NVMe recovery for every platform.
  • Supermicro-specific direct NVMe recovery and generic numeric collection recovery are now regression-tested through profile fixtures.
  • Future platform storage/post-probe behavior must be added through profile tuning, not new vendor-shaped if branches in collector core.

ADL-020 — Redfish critical plan-B activation is profile-owned recovery policy

Date: 2026-03-18 Context: critical plan-B and profile plan-B were still effectively always-on collector behavior once paths were present, including critical collection member retry and slow numeric child probing. That kept acquisition recovery semantics in redfish.go instead of the profile layer. Decision: Move plan-B activation into profile-owned recovery policy. Profiles must explicitly enable:

  • critical collection member retry
  • slow numeric probing during critical plan-B
  • profile-specific plan-B pass Consequences:
  • Recovery behavior is now observable in raw Redfish diagnostics alongside other tuning.
  • Generic/fallback recovery remains available through profile policy instead of implicit collector defaults.
  • Future platform-specific plan-B behavior must be introduced through profile tuning and tests, not through new unconditional collector branches.

ADL-021 — Extra discovered-path storage seeds must be profile-scoped, not core-baseline

Date: 2026-03-18 Context: The collector core baseline seed list still contained storage-specific discovered-path suffixes such as SimpleStorage and Storage/IntelVROC/*. These are useful on some platforms, but they are acquisition extensions layered on top of discovered Systems/* resources, not part of the minimal vendor-neutral Redfish baseline. Decision: Move such discovered-path expansions into profile-owned scoped path policy. The collector core keeps the vendor-neutral baseline; profiles may add extra system/chassis/manager suffixes that are expanded over discovered members during acquisition planning. Consequences:

  • Platform-shaped storage discovery no longer lives in redfish.go baseline seed construction.
  • Extra discovered-path branches are visible in plan diagnostics and fixture regression tests.
  • Future model/vendor storage path expansions must be added through scoped profile policy instead of editing the shared baseline seed list.

ADL-022 — Adaptive prefetch eligibility is profile-owned policy

Date: 2026-03-18 Context: The adaptive prefetch executor was still driven by hardcoded include/exclude path rules in redfish.go. That made GPU/storage/network prefetch shaping part of collector-core knowledge rather than profile-owned acquisition policy. Decision: Move prefetch eligibility rules into profile tuning. The collector core still runs adaptive prefetch, but profiles provide:

  • IncludeSuffixes for critical paths eligible for prefetch
  • ExcludeContains for path shapes that must never be prefetched Consequences:
  • Prefetch behavior is now visible in raw Redfish diagnostics and test fixtures.
  • Platform- or topology-specific prefetch shaping no longer requires editing collector-core string lists.
  • Future prefetch tuning must be introduced through profiles and regression tests.

ADL-023 — Core critical baseline is roots-only; critical shaping is profile-owned

Date: 2026-03-18 Context: redfishCriticalEndpoints(...) still encoded a broad set of system/chassis/manager critical branches directly in collector core. This mixed minimal crawl invariants with profile- specific acquisition shaping. Decision: Reduce collector-core critical baseline to vendor-neutral roots only:

  • /redfish/v1
  • discovered Systems/*
  • discovered Chassis/*
  • discovered Managers/*

Profiles now own additional critical shaping through:

  • scoped critical suffix policy for discovered resources
  • explicit top-level CriticalPaths Consequences:
  • Critical inventory breadth is now explained by the acquisition plan, not hidden in collector helper defaults.
  • Generic profile still provides the previous broad critical coverage, so behavior stays stable.
  • Future critical-path tuning must be implemented in profiles and regression-tested there.

ADL-024 — Live Redfish execution plans are resolved inside redfishprofile

Date: 2026-03-18 Context: Even after moving seeds, scoped paths, critical shaping, recovery, and prefetch policy into profiles, redfish.go still manually merged discovered resources with those policy fragments. That left acquisition-plan resolution logic in collector core. Decision: Introduce redfishprofile.ResolveAcquisitionPlan(...) as the boundary between profile planning and collector execution. redfishprofile now resolves:

  • baseline seeds
  • baseline critical roots
  • scoped path expansions
  • explicit profile seed/critical/plan-B paths

The collector core consumes the resolved plan and executes it. Consequences:

  • Acquisition planning logic is now testable in redfishprofile without going through the live collector.
  • redfish.go no longer owns path-resolution helpers for seeds/critical planning.
  • This creates a clean next step toward true per-profile acquisition hooks beyond static policy fragments.

ADL-025 — Post-discovery acquisition refinement belongs to profile hooks

Date: 2026-03-18 Context: Some acquisition behavior depends not only on vendor/model hints, but on what the lightweight Redfish discovery actually returned. Static absolute path lists in profile plans are too rigid for such cases and reintroduce guessed platform knowledge. Decision: Add a post-discovery acquisition refinement hook to Redfish profiles. Profiles may mutate the resolved execution plan after discovered Systems/*, Chassis/*, and Managers/* are known.

First concrete use:

  • MSI now derives GPU chassis seeds and .../Sensors critical/plan-B paths from discovered Chassis/GPU* resources instead of hardcoded GPU1..GPU4 absolute paths in the static plan. Additional use:
  • Supermicro now derives UpdateService/Oem/Supermicro/FirmwareInventory critical/plan-B paths from resource hints instead of carrying that absolute path in the static plan. Additional use:
  • Dell now derives Managers/iDRAC.Embedded.* acquisition paths from discovered manager resources instead of carrying Managers/iDRAC.Embedded.1 as a static absolute path. Consequences:
  • Profile modules can react to actual discovery results without pushing conditional logic back into redfish.go.
  • Diagnostics still show the final refined plan because the collector stores the refined plan, not only the pre-refinement template.
  • Future vendor-specific discovery-dependent acquisition behavior should be implemented through this hook rather than new collector-core branches.

ADL-026 — Replay analysis uses a resolved profile plan, not ad-hoc directives only

Date: 2026-03-18 Context: Replay still relied on a flat AnalysisDirectives struct assembled centrally, while vendor-specific conditions often depended on the actual snapshot shape. That made analysis behavior harder to explain and kept too much vendor logic in generic replay collectors. Decision: Introduce redfishprofile.ResolveAnalysisPlan(...) for replay. The resolved analysis plan contains:

  • active match result
  • resolved analysis directives
  • analysis notes explaining snapshot-aware hook activation

Profiles may refine this plan using the snapshot and discovered resources before replay collectors run.

First concrete uses:

  • MSI enables processor-GPU fallback and MSI chassis lookup only when the snapshot actually contains GPU processors and Chassis/GPU*
  • HGX enables processor-GPU alias fallback from actual HGX/GPU_SXM topology signals in the snapshot
  • Supermicro enables NVMe backplane and known-controller recovery from actual snapshot paths Consequences:
  • Replay behavior is now closer to the acquisition architecture: a resolved profile plan feeds the executor.
  • redfish_analysis_plan is stored in raw payload metadata for offline debugging.
  • Future analysis-side vendor logic should move into profile refinement hooks instead of growing the central directive builder.

ADL-027 — Replay GPU/storage executors consume resolved analysis plans

Date: 2026-03-18 Context: Even after introducing ResolveAnalysisPlan(...), replay GPU/storage collectors still accepted a raw AnalysisDirectives struct. That preserved an implicit shortcut from the old design and weakened the plan/executor boundary. Decision: Replay GPU/storage executors now accept redfishprofile.ResolvedAnalysisPlan directly. The executor reads resolved directives from the plan instead of being passed a standalone directive bundle. Consequences:

  • GPU and storage replay execution now follows the same architectural pattern as acquisition: resolve plan first, execute second.
  • Future profile-owned execution helpers can use plan notes or additional resolved fields without changing the executor API again.
  • Remaining replay areas should migrate the same way instead of continuing to accept raw directive structs.

ADL-019 — isDeviceBoundFirmwareName must cover vendor-specific naming patterns per vendor

Date: 2026-03-12 Context: isDeviceBoundFirmwareName was written to filter Dell-style device firmware names ("GPU SomeDevice", "NIC OnboardLAN"). When Supermicro Redfish FirmwareInventory was added (6c19a58), no Supermicro-specific patterns were added. Supermicro names a NIC entry "NIC1 System Slot0 AOM-DP805-IO" — a digit follows the type prefix directly, bypassing the "nic " (space-terminated) check. 29 device-bound entries leaked into hardware.firmware on SYS-A21GE-NBRT (HGX B200). Commit 9c5512d attempted a fix by adding _fw_gpu_ patterns, but checked DeviceName which contains "Software Inventory" (from the Redfish Name field), not the firmware inventory ID. The patterns were dead code from the moment they were committed. Decision:

  • isDeviceBoundFirmwareName must be extended for each new vendor whose FirmwareInventory naming convention differs from the existing patterns.
  • When adding HGX/Supermicro patterns, check that the pattern matches the field value that collectFirmwareInventory actually stores — trace the data path from Redfish doc to FirmwareInfo.DeviceName before writing the condition.
  • TestIsDeviceBoundFirmwareName must contain at least one case per vendor format. Consequences:
  • New vendors with FirmwareInventory support require a test covering both device-bound names (must return true) and system-level names (must return false) before the code ships.
  • The dead _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns were replaced with correct prefix+digit checks ("gpu" + digit, "nic" + digit) and explicit string checks ("nvmecontroller", "power supply", "software inventory").

ADL-020 — Dell TSR device-bound firmware filtered via FQDD; InfiniBand routed to NetworkAdapters

Date: 2026-03-15 Context: Dell TSR sysinfo_DCIM_SoftwareIdentity.xml lists firmware for every installed component. parseSoftwareIdentityXML dumped all of these into hardware.firmware without filtering, so device-bound entries such as "Mellanox Network Adapter" (FQDD InfiniBand.Slot.1-1) and "PERC H755 Front" (FQDD RAID.SL.3-1) appeared in the reanimator export alongside system firmware like BIOS and iDRAC. Confirmed on PowerEdge R6625 (8VS2LG4).

Additionally, DCIM_InfiniBandView was not handled in the parser switch, so Mellanox ConnectX-6 appeared only as a PCIe device with model: "16x or x16" (from DataBusWidth fallback). parseControllerView called addFirmware with description "storage controller" instead of the FQDD, so the FQDD-based filter in the exporter could not remove it.

Decision:

  1. isDeviceBoundFirmwareFQDD extended with "infiniband." and "fc." prefixes; "raid.backplane." broadened to "raid." to cover RAID.SL.*, RAID.Integrated.*, etc.
  2. DCIM_InfiniBandView routed to parseNICView → device appears as NetworkAdapter with correct firmware, MAC address, and VendorID/DeviceID.
  3. "InfiniBand." added to pcieFQDDNoisePrefix to suppress the duplicate DCIM_PCIDeviceView entry (DataBusWidth-only, no useful data).
  4. parseControllerView now passes fqdd as the addFirmware description so the FQDD filter removes the entry in the exporter.
  5. parsePCIeDeviceView now prioritises props["description"] (chip model, e.g. "MT28908 Family [ConnectX-6]") over props["devicedescription"] (location string) for pcie.Description.
  6. convertPCIeDevices model fallback order: PartNumber → Description → DeviceClass.

Consequences:

  • hardware.firmware contains only system-level entries; NIC/RAID/storage-controller firmware lives on the respective device record.
  • TestParseDellInfiniBandView and TestIsDeviceBoundFirmwareFQDD guard the regression.
  • Any future Dell TSR device class whose FQDD prefix is not yet in the prefix list may still leak; extend isDeviceBoundFirmwareFQDD and add a test case when encountered.

ADL-021 — pci.ids enrichment: chip model and vendor resolved from PCI IDs when source data is generic or missing

Date: 2026-03-15 Context: Dell TSR DCIM_InfiniBandView.ProductName reports a generic marketing name ("Mellanox Network Adapter") instead of the precise chip identifier ("MT28908 Family [ConnectX-6]"). The actual chip model is available in pci.ids by VendorID:DeviceID (15B3:101B). Vendor name may also be absent when no VendorName / Manufacturer property is present.

The general rule was established: if model is not found in source data but PCI IDs are known, resolve model from pci.ids. This rule applies broadly across all export paths.

Decision (two-layer enrichment):

  1. Parser layer (Dell, parseNICView): When VendorID != 0 && DeviceID != 0, prefer pciids.DeviceName(vendorID, deviceID) over the product name from logs. This makes the chip identifier the primary model for NIC/InfiniBand adapters (more specific than marketing name). Fill Vendor from pciids.VendorName(vendorID) when the vendor field is otherwise empty. Same fallback applied in parsePCIeDeviceView for empty Description.
  2. Exporter layer (convertPCIeFromDevices): General rule — when d.Model == "" after all legacy fallbacks and VendorID != 0 && DeviceID != 0, set model = pciids.DeviceName(...). Also fill empty manufacturer from pciids.VendorName(...). This covers all parsers/sources.

Consequences:

  • Mellanox InfiniBand slot now reports model: "MT28908 Family [ConnectX-6]" and manufacturer: "Mellanox Technologies" in the reanimator export.
  • For NICs where pci.ids has no entry, the original product name is kept (pci.ids returns "").
  • TestParseDellInfiniBandView asserts the model and vendor from pci.ids.

ADL-022 — CPUAffinity parsed into NUMANode for PCIe, NIC, and controller devices

Date: 2026-03-15 Context: Dell TSR DCIM view classes report CPUAffinity for NIC, InfiniBand, PCIe, and controller devices. Values are "1", "2" (NUMA node index), or "Not Applicable" (for devices that bridge both CPUs or have no CPU affinity). This data is needed for topology-aware diagnostics.

Decision:

  • Add NUMANode int (JSON: "numa_node,omitempty") to models.PCIeDevice, models.NetworkAdapter, models.HardwareDevice, and ReanimatorPCIe.
  • Parse from props["cpuaffinity"] using parseIntLoose: numeric values ("1", "2") map directly; "Not Applicable" returns 0 (omitted via omitempty).
  • Thread through buildDevicesFromLegacy (PCIe and NIC sections) and convertPCIeFromDevices.
  • parseControllerView also parses CPUAffinity since RAID controllers have NUMA affinity.

Consequences:

  • numa_node: 1 or 2 appears in reanimator export for devices with known affinity.
  • Value 0 / absent means "not reported" — covers both "Not Applicable" and sources that don't provide CPUAffinity at all.
  • TestParseDellCPUAffinity verifies numeric values parsed correctly and "Not Applicable"→0.

ADL-023 — Reanimator export must match ingest contract exactly

Date: 2026-03-15 Context: LOGPile's Reanimator export had drifted from the strict ingest contract. It emitted fields that Reanimator does not currently accept (status_at_collection, numa_node), while missing fields and sections now present in the contract (hardware.sensors, pcie_devices[].mac_addresses). Memory export rules also diverged from the ingest side: empty or serial-less DIMMs were still exported.

Decision:

  • Treat the Reanimator ingest contract as the authoritative schema for GET /api/export/reanimator.
  • Emit only fields present in the current upstream contract revision.
  • Add hardware.sensors, pcie_devices[].mac_addresses, pcie_devices[].numa_node, and upstream-approved component telemetry/health fields.
  • Leave out fields that are still not part of the upstream contract.
  • Map internal source_type=archive to external source_type=logfile.
  • Skip memory entries that are empty, not present, or missing serial numbers.
  • Generate CPU and PCIe serials only in the forms allowed by the contract.
  • Mirror the applied contract in bible-local/docs/hardware-ingest-contract.md.

Consequences:

  • Some previously exported diagnostic fields are intentionally dropped from the Reanimator payload until the upstream contract adds them.
  • Internal models may retain richer fields than the current export schema.
  • hardware.devices is canonical only after merge with legacy hardware slices; partial parser-owned canonical records must not hide CPUs, memory, storage, NICs, or PSUs still stored in legacy fields.
  • CSV and Reanimator exports must use the same merged canonical inventory to avoid divergent export contents across surfaces.
  • Future exporter changes must update both the code and the mirrored contract document together.

ADL-024 — Component presence is implicit; Redfish linked metrics are part of replay correctness

Date: 2026-03-15 Context: The upstream ingest contract allows present, but current export semantics do not need to send present=true for populated components. At the same time, several important Redfish component telemetry fields were only available through linked metric resources such as ProcessorMetrics, MemoryMetrics, and DriveMetrics. Without collecting and replaying these linked documents, live collection and raw snapshot replay still underreported component health fields.

Decision:

  • Do not serialize present=true in Reanimator export. Presence is represented by the presence of the component record itself.
  • Do not export component records marked present=false.
  • Interpret CPU firmware in Reanimator payload as CPU microcode.
  • Treat Redfish linked metric resources ProcessorMetrics, MemoryMetrics, DriveMetrics, EnvironmentMetrics, and generic Metrics as part of analyzer correctness when they are linked from component resources.
  • Replay logic must merge these linked metric resources back into CPU, memory, storage, PCIe, GPU, NIC, and PSU component Details the same way live collection expects them to be used.

Consequences:

  • Reanimator payloads are smaller and avoid redundant present=true noise while still excluding empty slots and absent components.
  • Any future exporter change that reintroduces serialized component presence needs an explicit contract review.
  • Raw Redfish snapshot completeness now includes linked per-component metric resources, not only top-level inventory members.
  • CPU microcode is no longer expected in top-level hardware.firmware; it belongs on the CPU component record.

ADL-025 — Missing serial numbers must remain absent in Reanimator export

Date: 2026-03-15 Context: LOGPile previously generated synthetic serial numbers for components that had no real serial in source data, especially CPUs and PCIe-class devices. This made the payload look richer, but the serials were not authoritative and could mislead downstream consumers. Reanimator can already accept missing serials and generate its own internal fallback identifiers when needed.

Decision:

  • Do not synthesize fake serial numbers in LOGPile's Reanimator export.
  • If a component has no real serial in parsed source data, export the serial field as absent.
  • This applies to CPUs, PCIe devices, GPUs, NICs, and any other component class unless an upstream contract explicitly requires a deterministic exporter-generated identifier.
  • Any fallback serial generation defined by the upstream contract is ingest-side Reanimator behavior, not LOGPile exporter behavior.

Consequences:

  • Exported payloads carry only source-backed serial numbers.
  • Fake identifiers such as BOARD-...-CPU-... or synthetic PCIe serials are no longer considered acceptable exporter behavior.
  • Any future attempt to reintroduce generated serials requires an explicit contract review and a new ADL entry.

ADL-026 — Live Redfish collection uses explicit preflight host-power confirmation

Date: 2026-03-15 Context: Live Redfish inventory can be incomplete when the managed host is powered off. At the same time, LOGPile must not silently power on a host without explicit user choice. The collection workflow therefore needs a preflight step that verifies connectivity, shows current host power state to the user, and only powers on the host when the user explicitly chose that path.

Decision:

  • Add a dedicated live preflight API step before collection starts.
  • UI first runs connectivity and power-state check, then offers:
    • collect as-is
    • power on and collect
  • if the host is off and the user does not answer within 5 seconds, default to collecting without powering the host on
  • Redfish collection may power on the host only when the request explicitly sets power_on_if_host_off=true
  • when LOGPile powers on the host for collection, it must try to power the host back off after collection completes
  • if LOGPile did not power the host on itself, it must never power the host off
  • all preflight and power-control steps must be logged into the collection log and therefore into the raw-export bundle

Consequences:

  • Live collection becomes a two-step UX: probe first, collect second.
  • Raw bundles preserve operator-visible evidence of power-state decisions and power-control attempts.
  • Power-on failures do not block collection entirely; they only downgrade completeness expectations.

ADL-027 — Sensors without numeric readings are not exported

Date: 2026-03-15 Context: Some parsed sensor records carry only a name, unit, or status, but no actual numeric reading. Such records are not useful as telemetry in Reanimator export and create noisy, low-value sensor lists.

Decision:

  • Do not export temperature, power, fan, or other sensor records unless they carry a real numeric measurement value.
  • Presence of a sensor name or health/status alone is not sufficient for export.

Consequences:

  • Exported sensor groups contain only actionable telemetry.
  • Parsers and collectors may still keep non-numeric sensor artifacts internally for diagnostics, but Reanimator export must filter them out.

ADL-028 — Reanimator PCIe export excludes storage endpoints and synthetic serials

Date: 2026-03-15 Context: Some Redfish and archive sources expose NVMe drives both as storage inventory and as PCIe-visible endpoints. Exporting such drives in both hardware.storage and hardware.pcie_devices creates duplicates without adding useful topology value. At the same time, PCIe-class export still had old fallback behavior that generated synthetic serial numbers when source serials were absent.

Decision:

  • Export disks and NVMe drives only through hardware.storage.
  • Do not export storage endpoints as hardware.pcie_devices, even if the source inventory exposes them as PCIe/NVMe devices.
  • Keep real PCIe storage controllers such as RAID and HBA adapters in hardware.pcie_devices.
  • Do not synthesize PCIe/GPU/NIC serial numbers in LOGPile; missing serials stay absent.
  • Treat placeholder names such as Network Device View as non-authoritative and prefer resolved device names when stronger data exists.

Consequences:

  • Reanimator payloads no longer duplicate NVMe drives between storage and PCIe sections.
  • PCIe export remains topology-focused while storage export remains component-focused.
  • Missing PCIe-class serials no longer produce fake BOARD-...-PCIE-... identifiers.

ADL-029 — Local exporter guidance tracks upstream contract v2.7 terminology

Date: 2026-03-15 Context: The upstream Reanimator hardware ingest contract moved to v2.7 and clarified several points that matter for LOGPile documentation: ingest-side serial fallback rules, canonical PCIe addressing via slot, the optional event_logs section, and the shared manufactured_year_week field.

Decision:

  • Keep the local mirrored contract file as an exact copy of the upstream v2.7 document.
  • Describe CPU/PCIe serial fallback as Reanimator ingest behavior, not LOGPile exporter behavior.
  • Treat pcie_devices.slot as the canonical address on the LOGPile side as well; bdf may remain an internal fallback/dedupe key but is not serialized in the payload.
  • Export event_logs only from normalized parser/collector events that can be mapped to contract sources host / bmc / redfish without synthesizing message content.
  • Export manufactured_year_week only as a reliable passthrough when a parser/collector already extracted a valid YYYY-Www value.

Consequences:

  • Local bible wording no longer conflicts with upstream contract terminology.
  • Reanimator payloads use contract-native PCIe addressing and no longer expose bdf as a parallel coordinate.
  • LOGPile event export remains strictly source-derived; internal warnings such as LOGPile analysis notes do not leak into Reanimator event_logs.

ADL-030 — Audit result rendering is delegated to embedded reanimator/chart

Date: 2026-03-16 Context: LOGPile already owns file upload, Redfish collection, archive parsing, normalization, and Reanimator export. Maintaining a second host-side audit renderer for the same data created presentation drift and duplicated UI logic.

Decision:

  • Use vendored reanimator/chart as the only audit result viewer.
  • Keep LOGPile responsible for service flows: upload, live collection, batch convert, raw export, Reanimator export, and parse-error reporting.
  • Render the current dataset by converting it to Reanimator JSON and passing that snapshot to embedded chart under /chart/current.

Consequences:

  • Reanimator JSON becomes the single presentation contract for the audit surface.
  • The host UI becomes a service shell around the viewer instead of maintaining its own field-by-field tabs.
  • internal/chart must be updated explicitly as a git submodule when the viewer changes.

ADL-031 — Redfish uses profile-driven acquisition and unified ingest entrypoints

Date: 2026-03-17 Context: Redfish collection had accumulated platform-specific probing in the shared collector path, while upload and raw-export replay still entered analysis through direct handler branches. This made vendor/model tuning harder to contain and increased regression risk when one topology needed a special acquisition strategy.

Decision:

  • Introduce internal/ingest.Service as the internal source-family entrypoint for archive parsing and Redfish raw replay.
  • Introduce internal/collector/redfishprofile/ for Redfish profile matching and modular hooks.
  • Split Redfish behavior into coordinated phases:
    • acquisition planning during live collection
    • analysis hooks during snapshot replay
  • Use score-based profile matching. If confidence is low, enter fallback acquisition mode and aggregate only safe additive profile probes.
  • Allow profile modules to provide bounded acquisition tuning hints such as crawl cap, prefetch behavior, and expensive post-probe toggles.
  • Allow profile modules to own model-specific CriticalPaths and bounded PlanBPaths so vendor recovery targets stop leaking into the collector core.
  • Expose Redfish profile matching as structured diagnostics during live collection: logs must contain all module scores, and collect job status must expose active modules for the UI.

Consequences:

  • Server handlers stop owning parser-vs-replay branching details directly.
  • Vendor/model-specific Redfish logic gets an explicit module boundary.
  • Unknown-vendor Redfish collection becomes slower but more complete by design.
  • Tactical Redfish fixes should move into profile modules instead of widening generic replay logic.
  • Repo-owned compact fixtures under internal/collector/redfishprofile/testdata/, derived from representative raw-export snapshots, are used to lock profile matching and acquisition tuning for known MSI and Supermicro-family shapes.

ADL-032 — MSI ghost GPU filter: exclude GPUs with temperature=0 on powered-on host

Date: 2026-03-18 Context: MSI/AMI BMC caches GPU inventory from the host via Host Interface (in-band). When GPUs are removed without a reboot the old entries remain in Chassis/GPU* and Systems/Self/Processors/GPU* with Status.Health: OK, State: Enabled. The BMC has no out-of-band mechanism to detect physical absence. A physically present GPU always reports an ambient temperature (>0°C) even when idle; a stale cached entry returns Reading: 0.

Decision:

  • Add EnableMSIGhostGPUFilter directive (enabled by MSI profile's refineAnalysis alongside EnableProcessorGPUFallback).
  • In collectGPUsFromProcessors: for each processor GPU, resolve its chassis path and read Chassis/GPU{n}/Sensors/GPU{n}_Temperature. If PowerState=On and Reading=0 → skip.
  • Filter only applies when host is powered on; when host is off all temperatures are 0 and the signal is ambiguous.

Consequences:

  • Ghost GPUs from previous hardware configurations no longer appear in the inventory.
  • Filter is MSI-profile-owned and does not affect HGX, Supermicro, or generic paths.
  • Any new MSI GPU chassis that uses a different temperature sensor path will bypass the filter (safe default: include rather than wrongly exclude).

ADL-033 — Reanimator export collected_at uses inventory LastModifiedTime with 30-day fallback

Date: 2026-03-18 Context: For Redfish sources the BMC Manager DateTime reflects when the BMC clock read the time, not when the hardware inventory was last known-good. InventoryData/Status.LastModifiedTime (AMI/MSI OEM endpoint) records the actual timestamp of the last successful host-pushed inventory cycle and is a better proxy for "when was this hardware configuration last confirmed".

Decision:

  • inferInventoryLastModifiedTime reads LastModifiedTime from the snapshot and sets AnalysisResult.InventoryLastModifiedAt.
  • reanimatorCollectedAt() in the exporter selects InventoryLastModifiedAt when it is set and no older than 30 days; otherwise falls back to CollectedAt.
  • Fallback rationale: inventory older than 30 days is likely from a long-running server with no recent reboot; using the actual collection date is more useful for the downstream consumer.
  • The inventory timestamp is also logged during replay and live collection for diagnostics.

Consequences:

  • Reanimator export collected_at reflects the last confirmed inventory cycle on AMI/MSI BMCs.
  • On non-AMI BMCs or when InventoryData/Status is absent, behavior is unchanged.
  • If inventory is stale (>30 days), collection date is used as before.

ADL-034 — Redfish inventory invalidated before host power-on

Date: 2026-03-18 Context: When a host is powered on by the collector (power_on_if_host_off=true), the BMC still holds inventory from the previous boot. If hardware changed between shutdowns, the new boot will push fresh inventory — but only if the BMC accepts it (CRC mismatch triggers re-population). Without explicit invalidation, unchanged CRCs can cause the BMC to skip re-processing even after a hardware change.

Decision:

  • Before any power-on attempt, invalidateRedfishInventory POSTs to {systemPath}/Oem/Ami/Inventory/Crc with all groups zeroed (CPU, DIMM, PCIE, CERTIFICATES, SECUREBOOT).
  • Best-effort: a 404/405 response (non-AMI BMC) is logged and silently ignored.
  • The invalidation is logged at INFO level and surfaced as a collect progress message.

Consequences:

  • On AMI/MSI BMCs: the next boot will push a full fresh inventory regardless of whether CRCs appear unchanged, eliminating ghost components from prior hardware configurations.
  • On non-AMI BMCs: the POST fails immediately (endpoint does not exist), nothing changes.
  • Invalidation runs only when power_on_if_host_off=true and host is confirmed off.

ADL-035 — Redfish hardware event log collection from Systems LogServices

Date: 2026-03-18 Context: Redfish BMCs expose event logs via LogServices/{svc}/Entries. On MSI/AMI this includes the IPMI SEL with hardware events (temperature, power, drive failures, etc.). Live collection previously collected only inventory/sensor snapshots; event history was unavailable in Reanimator. Decision:

  • After tree-walk, fetch hardware log entries separately via collectRedfishLogEntries() (not part of tree-walk to avoid bloat).
  • Only Systems/{sys}/LogServices is queried — Managers LogServices (BMC audit/journal) are excluded.
  • Log services with Id/Name containing "audit", "journal", "bmc", "security", "manager", "debug" are skipped.
  • Entries older than 7 days (client-side filter) are discarded. Pages are followed until an out-of-window entry is found (assumes newest-first ordering, typical for BMCs).
  • Entries with EntryType: "Oem" or MessageId containing user/auth/login keywords are filtered as non-hardware.
  • Raw entries stored in rawPayloads["redfish_log_entries"] as []map[string]interface{}.
  • Parsed to models.Event in parseRedfishLogEntries() during replay — same path for live and offline.
  • Max 200 entries per log service, 500 total to limit BMC load. Consequences:
  • Hardware event history (last 7 days) visible in Reanimator EventLogs section.
  • No impact on existing inventory pipeline or offline archive replay (archives without redfish_log_entries key silently skip parsing).
  • Adds extra HTTP requests during live collection (sequential, after tree-walk completes).

ADL-036 — Redfish profile matching may use platform grammar hints beyond vendor strings

Date: 2026-03-25 Context: Some BMCs expose unusable Manufacturer / Model values (NULL, placeholders, or generic SoC names) while still exposing a stable platform-specific Redfish grammar: repeated member names, firmware inventory IDs, OEM action names, and target-path quirks. Matching only on vendor strings forced such systems into fallback mode even when the platform shape was consistent.

Decision:

  • Extend redfishprofile.MatchSignals with doc-derived hint tokens collected from discovery docs and replay snapshots.
  • Allow profile matchers to score on stable platform grammar such as:
    • collection member naming (outboardPCIeCard*, drive slot grammars)
    • firmware inventory member IDs
    • OEM action/type markers and linked target paths
  • During live collection, gather only lightweight extra hint collections needed for matching (NetworkInterfaces, NetworkAdapters, Drives, UpdateService/FirmwareInventory), not slow deep inventory branches.
  • Keep such profiles out of fallback aggregation unless they are proven safe as broad additive hints.

Consequences:

  • Platform-family profiles can activate even when vendor strings are absent or set to NULL.
  • Matching logic becomes more robust for OEM BMC implementations that differ mainly by Redfish grammar rather than by explicit vendor strings.
  • Live collection gains a small amount of extra discovery I/O to harvest stable member IDs, but avoids slow deep probes such as Assembly just for profile selection.

ADL-037 — easy-bee archives are parsed from the embedded bee-audit snapshot

Date: 2026-03-25 Context: reanimator-easy-bee support bundles already contain a normalized hardware snapshot in export/bee-audit.json plus supporting logs and techdump files. Rebuilding the same inventory from raw techdump/ files inside LOGPile would duplicate parser logic and create drift between the producer utility and archive importer.

Decision:

  • Add a dedicated easy_bee vendor parser for bee-support-*.tar.gz bundles.
  • Detect the bundle by manifest.txt (bee_version=...) plus export/bee-audit.json.
  • Parse the archive from the embedded snapshot first; treat techdump/ and runtime files as secondary context only.
  • Normalize snapshot-only fields needed by LOGPile, notably:
    • flatten hardware.sensors groups into []SensorReading
    • turn runtime issues/status into []Event
    • synthesize a board FRU entry when the snapshot does not include FRU data

Consequences:

  • LOGPile stays aligned with the schema emitted by reanimator-easy-bee.
  • Adding support required only a thin archive adapter instead of a full hardware parser.
  • If the upstream utility changes the embedded snapshot schema, the easy_bee adapter is the only place that must be updated.

ADL-038 — HPE AHS parser uses hybrid extraction instead of full zbb schema decoding

Date: 2026-03-30 Context: HPE iLO Active Health System exports (.ahs) are proprietary ABJR containers with gzip-compressed zbb payloads. The sample inventory data contains two practical signal families: printable SMBIOS/FRU-style strings and embedded Redfish JSON subtrees, especially for storage controllers and drives. Full zbb binary schema decoding is not documented and would add significant complexity before proving user value. Decision: Support HPE AHS with a hybrid parser:

  • decode the outer ABJR container
  • gunzip embedded members when applicable
  • extract inventory from printable SMBIOS/FRU payloads
  • extract storage/controller/backplane details from embedded Redfish JSON objects
  • enrich firmware and PSU inventory from auxiliary package payloads such as bcert.pkg
  • do not attempt complete semantic decoding of the internal zbb record format Consequences:
  • Parser reaches inventory-grade usefulness quickly for HPE .ahs uploads.
  • Storage inventory is stronger than text-only parsing because it reuses structured Redfish data when present.
  • Auxiliary package payloads can supply missing firmware/PSU fields even when the main SMBIOS-like blob is incomplete.
  • Future deeper zbb decoding can be added incrementally without replacing the current parser contract.

ADL-039 — Canonical inventory keeps DIMMs with unknown capacity when identity is known

Date: 2026-03-30 Context: Some sources, notably HPE iLO AHS SMBIOS-like blobs, expose installed DIMM identity (slot, serial, part number, manufacturer) but do not include capacity. The parser already extracts those modules into Hardware.Memory, but canonical device building and export previously dropped them because size_mb == 0. Decision: Treat a DIMM as installed inventory when present=true and it has identifying memory fields such as serial number or part number, even if size_mb is unknown. Consequences:

  • HPE AHS uploads now show real installed memory modules instead of hiding them.
  • Empty slots still stay filtered because they lack inventory identity or are marked absent.
  • Specification/export can include "size unknown" memory entries without inventing capacity data.

ADL-040 — HPE Redfish normalization prefers chassis Devices/* over generic PCIe topology labels

Date: 2026-03-30 Context: HPE ProLiant Gen11 Redfish snapshots expose parallel inventory trees. Chassis/*/PCIeDevices/* is good for topology presence, but often reports only generic DeviceType values such as SingleFunction. Chassis/*/Devices/* carries the concrete slot label, richer device type, and product-vs-spare part identifiers for the same physical NIC/controller. Replay fallback over empty storage volume collections can also discover Volumes/Capabilities children, which are not real logical volumes.

Decision:

  • Treat Redfish SKU as a valid fallback for hardware.board.part_number when PartNumber is empty.
  • Ignore Volumes/Capabilities documents during logical-volume parsing.
  • Enrich Chassis/*/PCIeDevices/* entries with matching Chassis/*/Devices/* documents by serial/name/part identity.
  • Keep pcie.device_class semantic; do not replace it with model or part-number strings when Redfish exposes only generic topology labels.

Consequences:

  • HPE Redfish imports now keep the server SKU in hardware.board.part_number.
  • Empty volume collections no longer produce fake Capabilities volume records.
  • HPE PCIe inventory gets better slot labels like OCP 3.0 Slot 15 plus concrete classes such as LOM/NIC or SAS/SATA Storage Controller.
  • part_number remains available separately for model identity, without polluting the class field.

ADL-041 — Redfish replay drops topology-only PCIe noise classes from canonical inventory

Date: 2026-04-01 Context: Some Redfish BMCs, especially MSI/AMI GPU systems, expose a very wide PCIe topology tree under Chassis/*/PCIeDevices/*. Besides real endpoint devices, the replay sees bridge stages, CPU-side helper functions, IMC/mesh signal-processing nodes, USB/SPI side controllers, and GPU display-function duplicates reported as generic Display Device. Keeping all of them in hardware.pcie_devices pollutes downstream exports such as Reanimator and hides the actual endpoint inventory signal.

Decision:

  • Filter topology-only PCIe records during Redfish replay, not in the UI layer.
  • Drop PCIe entries with replay-resolved classes:
    • Bridge
    • Processor
    • SignalProcessingController
    • SerialBusController
  • Drop DisplayController entries when the source Redfish PCIe document is the generic MSI-style Description: "Display Device" duplicate.
  • Drop PCIe network endpoints when their PCIe functions already link to NetworkDeviceFunctions, because those devices are represented canonically in hardware.network_adapters.
  • When Systems/*/NetworkInterfaces/* links back to a chassis NetworkAdapter, match against the fully enriched chassis NIC identity to avoid creating a second ghost NIC row with the raw NetworkAdapter_* slot/name.
  • Treat generic Redfish object names such as NetworkAdapter_* and PCIeDevice_* as placeholder models and replace them from PCI IDs when a concrete vendor/device match exists.
  • Drop MSI-style storage service PCIe endpoints whose resolved device names are only Volume Management Device NVMe RAID Controller or PCIe Switch management endpoint; storage inventory already comes from the Redfish storage tree.
  • Normalize Ethernet-class NICs into the single exported class NetworkController; do not split EthernetController into a separate top-level inventory section.
  • Keep endpoint classes such as NetworkController, MassStorageController, and dedicated GPU inventory coming from hardware.gpus.

Consequences:

  • hardware.pcie_devices becomes closer to real endpoint inventory instead of raw PCIe topology.
  • Reanimator exports stop showing MSI bridge/processor/display duplicate noise.
  • Reanimator exports no longer duplicate the same MSI NIC as both PCIeDevice_* and NetworkAdapter_*.
  • Replay no longer creates extra NIC rows from Systems/NetworkInterfaces when the same adapter was already normalized from Chassis/NetworkAdapters.
  • MSI VMD / PCIe switch storage service endpoints no longer pollute PCIe inventory.
  • UI/Reanimator group all Ethernet NICs under the same NETWORKCONTROLLER section.
  • Canonical NIC inventory prefers resolved PCI product names over generic Redfish placeholder names.
  • The raw Redfish snapshot still remains available in raw_payloads.redfish_tree for low-level troubleshooting if topology details are ever needed.

ADL-042 — xFusion file-export archives merge AppDump inventory with RTOS/Log snapshots

Date: 2026-04-04 Context: xFusion iBMC tar.gz exports expose the base inventory in AppDump/, but the most useful NIC and firmware details live elsewhere: NIC firmware/MAC snapshots in LogDump/netcard/netcard_info.txt and system firmware versions in RTOSDump/versioninfo/app_revision.txt. Parsing only AppDump/ left xFusion uploads detectable but incomplete for UI and Reanimator consumers.

Decision:

  • Treat xFusion file-export tar.gz bundles as a first-class archive parser input.
  • Merge OCP NIC identity from AppDump/card_manage/card_info with the latest per-slot snapshot from LogDump/netcard/netcard_info.txt to produce hardware.network_adapters.
  • Import system-level firmware from RTOSDump/versioninfo/app_revision.txt into hardware.firmware.
  • Allow FRU fallback from RTOSDump/versioninfo/fruinfo.txt when AppDump/FruData/fruinfo.txt is absent.

Consequences:

  • xFusion uploads now preserve NIC BDF, MAC, firmware, and serial identity in normalized output.
  • System firmware such as BIOS and iBMC versions survives xFusion file exports.
  • xFusion archives participate more reliably in canonical device/export flows without special UI cases.

ADL-043 — Extended HGX diagnostic plan-B is opt-in from the live collect form

Date: 2026-04-13 Context: Some Supermicro HGX Redfish targets expose slow or hanging component-chassis inventory collections during critical plan-B, especially under Chassis/HGX_* for Assembly, Accelerators, Drives, NetworkAdapters, and PCIeDevices. Default collection should not block operators on deep diagnostic retries that are useful mainly for troubleshooting. Decision: Keep the normal snapshot/replay path unchanged, but gate those heavy HGX component-chassis critical plan-B retries behind the existing live-collect debug_payloads flag, presented in the UI as "Сбор расширенных данных для диагностики". Consequences:

  • Default live collection skips those heavy diagnostic plan-B retries and reaches replay faster.
  • Operators can explicitly opt into the slower diagnostic path when they need deeper collection.
  • The same user-facing toggle continues to enable extra debug payload capture for troubleshooting.