feat: Redfish hardware event log collection + MSI ghost GPU filter + inventory improvements
- Collect hardware event logs (last 7 days) from Systems and Managers/SEL LogServices - Parse AMI raw IPMI dump messages into readable descriptions (Sensor_Type: Event_Type) - Filter out audit/journal/non-hardware log services; only SEL from Managers - MSI ghost GPU filter: exclude processor GPU entries with temperature=0 when host is powered on - Reanimator collected_at uses InventoryData/Status.LastModifiedTime (30-day fallback) - Invalidate Redfish inventory CRC groups before host power-on - Log inventory LastModifiedTime age in collection logs - Drop SecureBoot collection (SecureBootMode, SecureBootDatabases) — not hardware inventory - Add build version to UI footer via template - Add MSI Redfish API reference doc to bible-local/docs/ ADL-032–ADL-035 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -822,3 +822,99 @@ special acquisition strategy.
|
||||
- Repo-owned compact fixtures under `internal/collector/redfishprofile/testdata/`, derived from
|
||||
representative raw-export snapshots, are used to lock profile matching and acquisition tuning
|
||||
for known MSI and Supermicro-family shapes.
|
||||
|
||||
---
|
||||
|
||||
## ADL-032 — MSI ghost GPU filter: exclude GPUs with temperature=0 on powered-on host
|
||||
|
||||
**Date:** 2026-03-18
|
||||
**Context:**
|
||||
MSI/AMI BMC caches GPU inventory from the host via Host Interface (in-band). When GPUs are
|
||||
removed without a reboot the old entries remain in `Chassis/GPU*` and
|
||||
`Systems/Self/Processors/GPU*` with `Status.Health: OK, State: Enabled`. The BMC has no
|
||||
out-of-band mechanism to detect physical absence. A physically present GPU always reports
|
||||
an ambient temperature (>0°C) even when idle; a stale cached entry returns `Reading: 0`.
|
||||
|
||||
**Decision:**
|
||||
- Add `EnableMSIGhostGPUFilter` directive (enabled by MSI profile's `refineAnalysis`
|
||||
alongside `EnableProcessorGPUFallback`).
|
||||
- In `collectGPUsFromProcessors`: for each processor GPU, resolve its chassis path and read
|
||||
`Chassis/GPU{n}/Sensors/GPU{n}_Temperature`. If `PowerState=On` and `Reading=0` → skip.
|
||||
- Filter only applies when host is powered on; when host is off all temperatures are 0 and
|
||||
the signal is ambiguous.
|
||||
|
||||
**Consequences:**
|
||||
- Ghost GPUs from previous hardware configurations no longer appear in the inventory.
|
||||
- Filter is MSI-profile-owned and does not affect HGX, Supermicro, or generic paths.
|
||||
- Any new MSI GPU chassis that uses a different temperature sensor path will bypass the filter
|
||||
(safe default: include rather than wrongly exclude).
|
||||
|
||||
---
|
||||
|
||||
## ADL-033 — Reanimator export collected_at uses inventory LastModifiedTime with 30-day fallback
|
||||
|
||||
**Date:** 2026-03-18
|
||||
**Context:**
|
||||
For Redfish sources the BMC Manager `DateTime` reflects when the BMC clock read the time, not
|
||||
when the hardware inventory was last known-good. `InventoryData/Status.LastModifiedTime`
|
||||
(AMI/MSI OEM endpoint) records the actual timestamp of the last successful host-pushed
|
||||
inventory cycle and is a better proxy for "when was this hardware configuration last confirmed".
|
||||
|
||||
**Decision:**
|
||||
- `inferInventoryLastModifiedTime` reads `LastModifiedTime` from the snapshot and sets
|
||||
`AnalysisResult.InventoryLastModifiedAt`.
|
||||
- `reanimatorCollectedAt()` in the exporter selects `InventoryLastModifiedAt` when it is set
|
||||
and no older than 30 days; otherwise falls back to `CollectedAt`.
|
||||
- Fallback rationale: inventory older than 30 days is likely from a long-running server with
|
||||
no recent reboot; using the actual collection date is more useful for the downstream consumer.
|
||||
- The inventory timestamp is also logged during replay and live collection for diagnostics.
|
||||
|
||||
**Consequences:**
|
||||
- Reanimator export `collected_at` reflects the last confirmed inventory cycle on AMI/MSI BMCs.
|
||||
- On non-AMI BMCs or when `InventoryData/Status` is absent, behavior is unchanged.
|
||||
- If inventory is stale (>30 days), collection date is used as before.
|
||||
|
||||
---
|
||||
|
||||
## ADL-034 — Redfish inventory invalidated before host power-on
|
||||
|
||||
**Date:** 2026-03-18
|
||||
**Context:**
|
||||
When a host is powered on by the collector (`power_on_if_host_off=true`), the BMC still holds
|
||||
inventory from the previous boot. If hardware changed between shutdowns, the new boot will push
|
||||
fresh inventory — but only if the BMC accepts it (CRC mismatch triggers re-population). Without
|
||||
explicit invalidation, unchanged CRCs can cause the BMC to skip re-processing even after a
|
||||
hardware change.
|
||||
|
||||
**Decision:**
|
||||
- Before any power-on attempt, `invalidateRedfishInventory` POSTs to
|
||||
`{systemPath}/Oem/Ami/Inventory/Crc` with all groups zeroed (`CPU`, `DIMM`, `PCIE`,
|
||||
`CERTIFICATES`, `SECUREBOOT`).
|
||||
- Best-effort: a 404/405 response (non-AMI BMC) is logged and silently ignored.
|
||||
- The invalidation is logged at `INFO` level and surfaced as a collect progress message.
|
||||
|
||||
**Consequences:**
|
||||
- On AMI/MSI BMCs: the next boot will push a full fresh inventory regardless of whether
|
||||
CRCs appear unchanged, eliminating ghost components from prior hardware configurations.
|
||||
- On non-AMI BMCs: the POST fails immediately (endpoint does not exist), nothing changes.
|
||||
- Invalidation runs only when `power_on_if_host_off=true` and host is confirmed off.
|
||||
|
||||
---
|
||||
|
||||
## ADL-035 — Redfish hardware event log collection from Systems LogServices
|
||||
|
||||
**Date:** 2026-03-18
|
||||
**Context:** Redfish BMCs expose event logs via `LogServices/{svc}/Entries`. On MSI/AMI this includes the IPMI SEL with hardware events (temperature, power, drive failures, etc.). Live collection previously collected only inventory/sensor snapshots; event history was unavailable in Reanimator.
|
||||
**Decision:**
|
||||
- After tree-walk, fetch hardware log entries separately via `collectRedfishLogEntries()` (not part of tree-walk to avoid bloat).
|
||||
- Only `Systems/{sys}/LogServices` is queried — Managers LogServices (BMC audit/journal) are excluded.
|
||||
- Log services with Id/Name containing "audit", "journal", "bmc", "security", "manager", "debug" are skipped.
|
||||
- Entries older than 7 days (client-side filter) are discarded. Pages are followed until an out-of-window entry is found (assumes newest-first ordering, typical for BMCs).
|
||||
- Entries with `EntryType: "Oem"` or `MessageId` containing user/auth/login keywords are filtered as non-hardware.
|
||||
- Raw entries stored in `rawPayloads["redfish_log_entries"]` as `[]map[string]interface{}`.
|
||||
- Parsed to `models.Event` in `parseRedfishLogEntries()` during replay — same path for live and offline.
|
||||
- Max 200 entries per log service, 500 total to limit BMC load.
|
||||
**Consequences:**
|
||||
- Hardware event history (last 7 days) visible in Reanimator `EventLogs` section.
|
||||
- No impact on existing inventory pipeline or offline archive replay (archives without `redfish_log_entries` key silently skip parsing).
|
||||
- Adds extra HTTP requests during live collection (sequential, after tree-walk completes).
|
||||
|
||||
Reference in New Issue
Block a user