feat: Redfish hardware event log collection + MSI ghost GPU filter + inventory improvements

- Collect hardware event logs (last 7 days) from Systems and Managers/SEL LogServices
- Parse AMI raw IPMI dump messages into readable descriptions (Sensor_Type: Event_Type)
- Filter out audit/journal/non-hardware log services; only SEL from Managers
- MSI ghost GPU filter: exclude processor GPU entries with temperature=0 when host is powered on
- Reanimator collected_at uses InventoryData/Status.LastModifiedTime (30-day fallback)
- Invalidate Redfish inventory CRC groups before host power-on
- Log inventory LastModifiedTime age in collection logs
- Drop SecureBoot collection (SecureBootMode, SecureBootDatabases) — not hardware inventory
- Add build version to UI footer via template
- Add MSI Redfish API reference doc to bible-local/docs/

ADL-032–ADL-035

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Mikhail Chusavitin
2026-03-18 23:47:22 +03:00
parent 30409eef67
commit 96e65d8f65
15 changed files with 989 additions and 13 deletions

View File

@@ -822,3 +822,99 @@ special acquisition strategy.
- Repo-owned compact fixtures under `internal/collector/redfishprofile/testdata/`, derived from
representative raw-export snapshots, are used to lock profile matching and acquisition tuning
for known MSI and Supermicro-family shapes.
---
## ADL-032 — MSI ghost GPU filter: exclude GPUs with temperature=0 on powered-on host
**Date:** 2026-03-18
**Context:**
MSI/AMI BMC caches GPU inventory from the host via Host Interface (in-band). When GPUs are
removed without a reboot the old entries remain in `Chassis/GPU*` and
`Systems/Self/Processors/GPU*` with `Status.Health: OK, State: Enabled`. The BMC has no
out-of-band mechanism to detect physical absence. A physically present GPU always reports
an ambient temperature (>0°C) even when idle; a stale cached entry returns `Reading: 0`.
**Decision:**
- Add `EnableMSIGhostGPUFilter` directive (enabled by MSI profile's `refineAnalysis`
alongside `EnableProcessorGPUFallback`).
- In `collectGPUsFromProcessors`: for each processor GPU, resolve its chassis path and read
`Chassis/GPU{n}/Sensors/GPU{n}_Temperature`. If `PowerState=On` and `Reading=0` → skip.
- Filter only applies when host is powered on; when host is off all temperatures are 0 and
the signal is ambiguous.
**Consequences:**
- Ghost GPUs from previous hardware configurations no longer appear in the inventory.
- Filter is MSI-profile-owned and does not affect HGX, Supermicro, or generic paths.
- Any new MSI GPU chassis that uses a different temperature sensor path will bypass the filter
(safe default: include rather than wrongly exclude).
---
## ADL-033 — Reanimator export collected_at uses inventory LastModifiedTime with 30-day fallback
**Date:** 2026-03-18
**Context:**
For Redfish sources the BMC Manager `DateTime` reflects when the BMC clock read the time, not
when the hardware inventory was last known-good. `InventoryData/Status.LastModifiedTime`
(AMI/MSI OEM endpoint) records the actual timestamp of the last successful host-pushed
inventory cycle and is a better proxy for "when was this hardware configuration last confirmed".
**Decision:**
- `inferInventoryLastModifiedTime` reads `LastModifiedTime` from the snapshot and sets
`AnalysisResult.InventoryLastModifiedAt`.
- `reanimatorCollectedAt()` in the exporter selects `InventoryLastModifiedAt` when it is set
and no older than 30 days; otherwise falls back to `CollectedAt`.
- Fallback rationale: inventory older than 30 days is likely from a long-running server with
no recent reboot; using the actual collection date is more useful for the downstream consumer.
- The inventory timestamp is also logged during replay and live collection for diagnostics.
**Consequences:**
- Reanimator export `collected_at` reflects the last confirmed inventory cycle on AMI/MSI BMCs.
- On non-AMI BMCs or when `InventoryData/Status` is absent, behavior is unchanged.
- If inventory is stale (>30 days), collection date is used as before.
---
## ADL-034 — Redfish inventory invalidated before host power-on
**Date:** 2026-03-18
**Context:**
When a host is powered on by the collector (`power_on_if_host_off=true`), the BMC still holds
inventory from the previous boot. If hardware changed between shutdowns, the new boot will push
fresh inventory — but only if the BMC accepts it (CRC mismatch triggers re-population). Without
explicit invalidation, unchanged CRCs can cause the BMC to skip re-processing even after a
hardware change.
**Decision:**
- Before any power-on attempt, `invalidateRedfishInventory` POSTs to
`{systemPath}/Oem/Ami/Inventory/Crc` with all groups zeroed (`CPU`, `DIMM`, `PCIE`,
`CERTIFICATES`, `SECUREBOOT`).
- Best-effort: a 404/405 response (non-AMI BMC) is logged and silently ignored.
- The invalidation is logged at `INFO` level and surfaced as a collect progress message.
**Consequences:**
- On AMI/MSI BMCs: the next boot will push a full fresh inventory regardless of whether
CRCs appear unchanged, eliminating ghost components from prior hardware configurations.
- On non-AMI BMCs: the POST fails immediately (endpoint does not exist), nothing changes.
- Invalidation runs only when `power_on_if_host_off=true` and host is confirmed off.
---
## ADL-035 — Redfish hardware event log collection from Systems LogServices
**Date:** 2026-03-18
**Context:** Redfish BMCs expose event logs via `LogServices/{svc}/Entries`. On MSI/AMI this includes the IPMI SEL with hardware events (temperature, power, drive failures, etc.). Live collection previously collected only inventory/sensor snapshots; event history was unavailable in Reanimator.
**Decision:**
- After tree-walk, fetch hardware log entries separately via `collectRedfishLogEntries()` (not part of tree-walk to avoid bloat).
- Only `Systems/{sys}/LogServices` is queried — Managers LogServices (BMC audit/journal) are excluded.
- Log services with Id/Name containing "audit", "journal", "bmc", "security", "manager", "debug" are skipped.
- Entries older than 7 days (client-side filter) are discarded. Pages are followed until an out-of-window entry is found (assumes newest-first ordering, typical for BMCs).
- Entries with `EntryType: "Oem"` or `MessageId` containing user/auth/login keywords are filtered as non-hardware.
- Raw entries stored in `rawPayloads["redfish_log_entries"]` as `[]map[string]interface{}`.
- Parsed to `models.Event` in `parseRedfishLogEntries()` during replay — same path for live and offline.
- Max 200 entries per log service, 500 total to limit BMC load.
**Consequences:**
- Hardware event history (last 7 days) visible in Reanimator `EventLogs` section.
- No impact on existing inventory pipeline or offline archive replay (archives without `redfish_log_entries` key silently skip parsing).
- Adds extra HTTP requests during live collection (sequential, after tree-walk completes).