export: align reanimator and enrich redfish metrics
This commit is contained in:
@@ -299,4 +299,159 @@ not the firmware inventory ID. The patterns were dead code from the moment they
|
||||
correct prefix+digit checks (`"gpu" + digit`, `"nic" + digit`) and explicit string checks
|
||||
(`"nvmecontroller"`, `"power supply"`, `"software inventory"`).
|
||||
|
||||
## ADL-020 — Dell TSR device-bound firmware filtered via FQDD; InfiniBand routed to NetworkAdapters
|
||||
|
||||
**Date:** 2026-03-15
|
||||
**Context:** Dell TSR `sysinfo_DCIM_SoftwareIdentity.xml` lists firmware for every installed
|
||||
component. `parseSoftwareIdentityXML` dumped all of these into `hardware.firmware` without
|
||||
filtering, so device-bound entries such as `"Mellanox Network Adapter"` (FQDD `InfiniBand.Slot.1-1`)
|
||||
and `"PERC H755 Front"` (FQDD `RAID.SL.3-1`) appeared in the reanimator export alongside system
|
||||
firmware like BIOS and iDRAC. Confirmed on PowerEdge R6625 (8VS2LG4).
|
||||
|
||||
Additionally, `DCIM_InfiniBandView` was not handled in the parser switch, so Mellanox ConnectX-6
|
||||
appeared only as a PCIe device with `model: "16x or x16"` (from `DataBusWidth` fallback).
|
||||
`parseControllerView` called `addFirmware` with description `"storage controller"` instead of the
|
||||
FQDD, so the FQDD-based filter in the exporter could not remove it.
|
||||
|
||||
**Decision:**
|
||||
1. `isDeviceBoundFirmwareFQDD` extended with `"infiniband."` and `"fc."` prefixes; `"raid.backplane."`
|
||||
broadened to `"raid."` to cover `RAID.SL.*`, `RAID.Integrated.*`, etc.
|
||||
2. `DCIM_InfiniBandView` routed to `parseNICView` → device appears as `NetworkAdapter` with correct
|
||||
firmware, MAC address, and VendorID/DeviceID.
|
||||
3. `"InfiniBand."` added to `pcieFQDDNoisePrefix` to suppress the duplicate `DCIM_PCIDeviceView`
|
||||
entry (DataBusWidth-only, no useful data).
|
||||
4. `parseControllerView` now passes `fqdd` as the `addFirmware` description so the FQDD filter
|
||||
removes the entry in the exporter.
|
||||
5. `parsePCIeDeviceView` now prioritises `props["description"]` (chip model, e.g. `"MT28908 Family
|
||||
[ConnectX-6]"`) over `props["devicedescription"]` (location string) for `pcie.Description`.
|
||||
6. `convertPCIeDevices` model fallback order: `PartNumber → Description → DeviceClass`.
|
||||
|
||||
**Consequences:**
|
||||
- `hardware.firmware` contains only system-level entries; NIC/RAID/storage-controller firmware
|
||||
lives on the respective device record.
|
||||
- `TestParseDellInfiniBandView` and `TestIsDeviceBoundFirmwareFQDD` guard the regression.
|
||||
- Any future Dell TSR device class whose FQDD prefix is not yet in the prefix list may still leak;
|
||||
extend `isDeviceBoundFirmwareFQDD` and add a test case when encountered.
|
||||
|
||||
---
|
||||
|
||||
## ADL-021 — pci.ids enrichment: chip model and vendor resolved from PCI IDs when source data is generic or missing
|
||||
|
||||
**Date:** 2026-03-15
|
||||
**Context:**
|
||||
Dell TSR `DCIM_InfiniBandView.ProductName` reports a generic marketing name ("Mellanox Network
|
||||
Adapter") instead of the precise chip identifier ("MT28908 Family [ConnectX-6]"). The actual
|
||||
chip model is available in `pci.ids` by VendorID:DeviceID (15B3:101B). Vendor name may also be
|
||||
absent when no `VendorName` / `Manufacturer` property is present.
|
||||
|
||||
The general rule was established: *if model is not found in source data but PCI IDs are known,
|
||||
resolve model from `pci.ids`*. This rule applies broadly across all export paths.
|
||||
|
||||
**Decision (two-layer enrichment):**
|
||||
1. **Parser layer (Dell, `parseNICView`):** When `VendorID != 0 && DeviceID != 0`, prefer
|
||||
`pciids.DeviceName(vendorID, deviceID)` over the product name from logs. This makes the chip
|
||||
identifier the primary model for NIC/InfiniBand adapters (more specific than marketing name).
|
||||
Fill `Vendor` from `pciids.VendorName(vendorID)` when the vendor field is otherwise empty.
|
||||
Same fallback applied in `parsePCIeDeviceView` for empty `Description`.
|
||||
2. **Exporter layer (`convertPCIeFromDevices`):** General rule — when `d.Model == ""` after all
|
||||
legacy fallbacks and `VendorID != 0 && DeviceID != 0`, set `model = pciids.DeviceName(...)`.
|
||||
Also fill empty `manufacturer` from `pciids.VendorName(...)`. This covers all parsers/sources.
|
||||
|
||||
**Consequences:**
|
||||
- Mellanox InfiniBand slot now reports `model: "MT28908 Family [ConnectX-6]"` and
|
||||
`manufacturer: "Mellanox Technologies"` in the reanimator export.
|
||||
- For NICs where pci.ids has no entry, the original product name is kept (pci.ids returns "").
|
||||
- `TestParseDellInfiniBandView` asserts the model and vendor from pci.ids.
|
||||
|
||||
---
|
||||
|
||||
## ADL-022 — CPUAffinity parsed into NUMANode for PCIe, NIC, and controller devices
|
||||
|
||||
**Date:** 2026-03-15
|
||||
**Context:**
|
||||
Dell TSR DCIM view classes report `CPUAffinity` for NIC, InfiniBand, PCIe, and controller
|
||||
devices. Values are "1", "2" (NUMA node index), or "Not Applicable" (for devices that bridge
|
||||
both CPUs or have no CPU affinity). This data is needed for topology-aware diagnostics.
|
||||
|
||||
**Decision:**
|
||||
- Add `NUMANode int` (JSON: `"numa_node,omitempty"`) to `models.PCIeDevice`,
|
||||
`models.NetworkAdapter`, `models.HardwareDevice`, and `ReanimatorPCIe`.
|
||||
- Parse from `props["cpuaffinity"]` using `parseIntLoose`: numeric values ("1", "2") map
|
||||
directly; "Not Applicable" returns 0 (omitted via `omitempty`).
|
||||
- Thread through `buildDevicesFromLegacy` (PCIe and NIC sections) and `convertPCIeFromDevices`.
|
||||
- `parseControllerView` also parses CPUAffinity since RAID controllers have NUMA affinity.
|
||||
|
||||
**Consequences:**
|
||||
- `numa_node: 1` or `2` appears in reanimator export for devices with known affinity.
|
||||
- Value 0 / absent means "not reported" — covers both "Not Applicable" and sources that don't
|
||||
provide CPUAffinity at all.
|
||||
- `TestParseDellCPUAffinity` verifies numeric values parsed correctly and "Not Applicable"→0.
|
||||
|
||||
---
|
||||
|
||||
## ADL-023 — Reanimator export must match ingest contract exactly
|
||||
|
||||
**Date:** 2026-03-15
|
||||
**Context:**
|
||||
LOGPile's Reanimator export had drifted from the strict ingest contract. It emitted fields that
|
||||
Reanimator does not currently accept (`status_at_collection`, `numa_node`),
|
||||
while missing fields and sections now present in the contract (`hardware.sensors`,
|
||||
`pcie_devices[].mac_addresses`). Memory export rules also diverged from the ingest side: empty or
|
||||
serial-less DIMMs were still exported.
|
||||
|
||||
**Decision:**
|
||||
- Treat the Reanimator ingest contract as the authoritative schema for `GET /api/export/reanimator`.
|
||||
- Emit only fields present in the current upstream contract revision.
|
||||
- Add `hardware.sensors`, `pcie_devices[].mac_addresses`, `pcie_devices[].numa_node`, and
|
||||
upstream-approved component telemetry/health fields.
|
||||
- Leave out fields that are still not part of the upstream contract.
|
||||
- Map internal `source_type=archive` to external `source_type=logfile`.
|
||||
- Skip memory entries that are empty, not present, or missing serial numbers.
|
||||
- Generate CPU and PCIe serials only in the forms allowed by the contract.
|
||||
- Mirror the applied contract in `bible-local/docs/hardware-ingest-contract.md`.
|
||||
|
||||
**Consequences:**
|
||||
- Some previously exported diagnostic fields are intentionally dropped from the Reanimator payload
|
||||
until the upstream contract adds them.
|
||||
- Internal models may retain richer fields than the current export schema.
|
||||
- `hardware.devices` is canonical only after merge with legacy hardware slices; partial parser-owned
|
||||
canonical records must not hide CPUs, memory, storage, NICs, or PSUs still stored in legacy
|
||||
fields.
|
||||
- CSV and Reanimator exports must use the same merged canonical inventory to avoid divergent export
|
||||
contents across surfaces.
|
||||
- Future exporter changes must update both the code and the mirrored contract document together.
|
||||
|
||||
---
|
||||
|
||||
## ADL-024 — Component presence is implicit; Redfish linked metrics are part of replay correctness
|
||||
|
||||
**Date:** 2026-03-15
|
||||
**Context:**
|
||||
The upstream ingest contract allows `present`, but current export semantics do not need to send
|
||||
`present=true` for populated components. At the same time, several important Redfish component
|
||||
telemetry fields were only available through linked metric resources such as `ProcessorMetrics`,
|
||||
`MemoryMetrics`, and `DriveMetrics`. Without collecting and replaying these linked documents,
|
||||
live collection and raw snapshot replay still underreported component health fields.
|
||||
|
||||
**Decision:**
|
||||
- Do not serialize `present=true` in Reanimator export. Presence is represented by the presence of
|
||||
the component record itself.
|
||||
- Do not export component records marked `present=false`.
|
||||
- Interpret CPU `firmware` in Reanimator payload as CPU microcode.
|
||||
- Treat Redfish linked metric resources `ProcessorMetrics`, `MemoryMetrics`, `DriveMetrics`,
|
||||
`EnvironmentMetrics`, and generic `Metrics` as part of analyzer correctness when they are linked
|
||||
from component resources.
|
||||
- Replay logic must merge these linked metric resources back into CPU, memory, storage, PCIe, GPU,
|
||||
NIC, and PSU component `Details` the same way live collection expects them to be used.
|
||||
|
||||
**Consequences:**
|
||||
- Reanimator payloads are smaller and avoid redundant `present=true` noise while still excluding
|
||||
empty slots and absent components.
|
||||
- Any future exporter change that reintroduces serialized component presence needs an explicit
|
||||
contract review.
|
||||
- Raw Redfish snapshot completeness now includes linked per-component metric resources, not only
|
||||
top-level inventory members.
|
||||
- CPU microcode is no longer expected in top-level `hardware.firmware`; it belongs on the CPU
|
||||
component record.
|
||||
|
||||
<!-- Add new decisions below this line using the format above -->
|
||||
|
||||
Reference in New Issue
Block a user