export: align reanimator and enrich redfish metrics

This commit is contained in:
Mikhail Chusavitin
2026-03-15 21:38:28 +03:00
parent 0acdc2b202
commit 9007f1b360
17 changed files with 3756 additions and 650 deletions

View File

@@ -299,4 +299,159 @@ not the firmware inventory ID. The patterns were dead code from the moment they
correct prefix+digit checks (`"gpu" + digit`, `"nic" + digit`) and explicit string checks
(`"nvmecontroller"`, `"power supply"`, `"software inventory"`).
## ADL-020 — Dell TSR device-bound firmware filtered via FQDD; InfiniBand routed to NetworkAdapters
**Date:** 2026-03-15
**Context:** Dell TSR `sysinfo_DCIM_SoftwareIdentity.xml` lists firmware for every installed
component. `parseSoftwareIdentityXML` dumped all of these into `hardware.firmware` without
filtering, so device-bound entries such as `"Mellanox Network Adapter"` (FQDD `InfiniBand.Slot.1-1`)
and `"PERC H755 Front"` (FQDD `RAID.SL.3-1`) appeared in the reanimator export alongside system
firmware like BIOS and iDRAC. Confirmed on PowerEdge R6625 (8VS2LG4).
Additionally, `DCIM_InfiniBandView` was not handled in the parser switch, so Mellanox ConnectX-6
appeared only as a PCIe device with `model: "16x or x16"` (from `DataBusWidth` fallback).
`parseControllerView` called `addFirmware` with description `"storage controller"` instead of the
FQDD, so the FQDD-based filter in the exporter could not remove it.
**Decision:**
1. `isDeviceBoundFirmwareFQDD` extended with `"infiniband."` and `"fc."` prefixes; `"raid.backplane."`
broadened to `"raid."` to cover `RAID.SL.*`, `RAID.Integrated.*`, etc.
2. `DCIM_InfiniBandView` routed to `parseNICView` → device appears as `NetworkAdapter` with correct
firmware, MAC address, and VendorID/DeviceID.
3. `"InfiniBand."` added to `pcieFQDDNoisePrefix` to suppress the duplicate `DCIM_PCIDeviceView`
entry (DataBusWidth-only, no useful data).
4. `parseControllerView` now passes `fqdd` as the `addFirmware` description so the FQDD filter
removes the entry in the exporter.
5. `parsePCIeDeviceView` now prioritises `props["description"]` (chip model, e.g. `"MT28908 Family
[ConnectX-6]"`) over `props["devicedescription"]` (location string) for `pcie.Description`.
6. `convertPCIeDevices` model fallback order: `PartNumber → Description → DeviceClass`.
**Consequences:**
- `hardware.firmware` contains only system-level entries; NIC/RAID/storage-controller firmware
lives on the respective device record.
- `TestParseDellInfiniBandView` and `TestIsDeviceBoundFirmwareFQDD` guard the regression.
- Any future Dell TSR device class whose FQDD prefix is not yet in the prefix list may still leak;
extend `isDeviceBoundFirmwareFQDD` and add a test case when encountered.
---
## ADL-021 — pci.ids enrichment: chip model and vendor resolved from PCI IDs when source data is generic or missing
**Date:** 2026-03-15
**Context:**
Dell TSR `DCIM_InfiniBandView.ProductName` reports a generic marketing name ("Mellanox Network
Adapter") instead of the precise chip identifier ("MT28908 Family [ConnectX-6]"). The actual
chip model is available in `pci.ids` by VendorID:DeviceID (15B3:101B). Vendor name may also be
absent when no `VendorName` / `Manufacturer` property is present.
The general rule was established: *if model is not found in source data but PCI IDs are known,
resolve model from `pci.ids`*. This rule applies broadly across all export paths.
**Decision (two-layer enrichment):**
1. **Parser layer (Dell, `parseNICView`):** When `VendorID != 0 && DeviceID != 0`, prefer
`pciids.DeviceName(vendorID, deviceID)` over the product name from logs. This makes the chip
identifier the primary model for NIC/InfiniBand adapters (more specific than marketing name).
Fill `Vendor` from `pciids.VendorName(vendorID)` when the vendor field is otherwise empty.
Same fallback applied in `parsePCIeDeviceView` for empty `Description`.
2. **Exporter layer (`convertPCIeFromDevices`):** General rule — when `d.Model == ""` after all
legacy fallbacks and `VendorID != 0 && DeviceID != 0`, set `model = pciids.DeviceName(...)`.
Also fill empty `manufacturer` from `pciids.VendorName(...)`. This covers all parsers/sources.
**Consequences:**
- Mellanox InfiniBand slot now reports `model: "MT28908 Family [ConnectX-6]"` and
`manufacturer: "Mellanox Technologies"` in the reanimator export.
- For NICs where pci.ids has no entry, the original product name is kept (pci.ids returns "").
- `TestParseDellInfiniBandView` asserts the model and vendor from pci.ids.
---
## ADL-022 — CPUAffinity parsed into NUMANode for PCIe, NIC, and controller devices
**Date:** 2026-03-15
**Context:**
Dell TSR DCIM view classes report `CPUAffinity` for NIC, InfiniBand, PCIe, and controller
devices. Values are "1", "2" (NUMA node index), or "Not Applicable" (for devices that bridge
both CPUs or have no CPU affinity). This data is needed for topology-aware diagnostics.
**Decision:**
- Add `NUMANode int` (JSON: `"numa_node,omitempty"`) to `models.PCIeDevice`,
`models.NetworkAdapter`, `models.HardwareDevice`, and `ReanimatorPCIe`.
- Parse from `props["cpuaffinity"]` using `parseIntLoose`: numeric values ("1", "2") map
directly; "Not Applicable" returns 0 (omitted via `omitempty`).
- Thread through `buildDevicesFromLegacy` (PCIe and NIC sections) and `convertPCIeFromDevices`.
- `parseControllerView` also parses CPUAffinity since RAID controllers have NUMA affinity.
**Consequences:**
- `numa_node: 1` or `2` appears in reanimator export for devices with known affinity.
- Value 0 / absent means "not reported" — covers both "Not Applicable" and sources that don't
provide CPUAffinity at all.
- `TestParseDellCPUAffinity` verifies numeric values parsed correctly and "Not Applicable"→0.
---
## ADL-023 — Reanimator export must match ingest contract exactly
**Date:** 2026-03-15
**Context:**
LOGPile's Reanimator export had drifted from the strict ingest contract. It emitted fields that
Reanimator does not currently accept (`status_at_collection`, `numa_node`),
while missing fields and sections now present in the contract (`hardware.sensors`,
`pcie_devices[].mac_addresses`). Memory export rules also diverged from the ingest side: empty or
serial-less DIMMs were still exported.
**Decision:**
- Treat the Reanimator ingest contract as the authoritative schema for `GET /api/export/reanimator`.
- Emit only fields present in the current upstream contract revision.
- Add `hardware.sensors`, `pcie_devices[].mac_addresses`, `pcie_devices[].numa_node`, and
upstream-approved component telemetry/health fields.
- Leave out fields that are still not part of the upstream contract.
- Map internal `source_type=archive` to external `source_type=logfile`.
- Skip memory entries that are empty, not present, or missing serial numbers.
- Generate CPU and PCIe serials only in the forms allowed by the contract.
- Mirror the applied contract in `bible-local/docs/hardware-ingest-contract.md`.
**Consequences:**
- Some previously exported diagnostic fields are intentionally dropped from the Reanimator payload
until the upstream contract adds them.
- Internal models may retain richer fields than the current export schema.
- `hardware.devices` is canonical only after merge with legacy hardware slices; partial parser-owned
canonical records must not hide CPUs, memory, storage, NICs, or PSUs still stored in legacy
fields.
- CSV and Reanimator exports must use the same merged canonical inventory to avoid divergent export
contents across surfaces.
- Future exporter changes must update both the code and the mirrored contract document together.
---
## ADL-024 — Component presence is implicit; Redfish linked metrics are part of replay correctness
**Date:** 2026-03-15
**Context:**
The upstream ingest contract allows `present`, but current export semantics do not need to send
`present=true` for populated components. At the same time, several important Redfish component
telemetry fields were only available through linked metric resources such as `ProcessorMetrics`,
`MemoryMetrics`, and `DriveMetrics`. Without collecting and replaying these linked documents,
live collection and raw snapshot replay still underreported component health fields.
**Decision:**
- Do not serialize `present=true` in Reanimator export. Presence is represented by the presence of
the component record itself.
- Do not export component records marked `present=false`.
- Interpret CPU `firmware` in Reanimator payload as CPU microcode.
- Treat Redfish linked metric resources `ProcessorMetrics`, `MemoryMetrics`, `DriveMetrics`,
`EnvironmentMetrics`, and generic `Metrics` as part of analyzer correctness when they are linked
from component resources.
- Replay logic must merge these linked metric resources back into CPU, memory, storage, PCIe, GPU,
NIC, and PSU component `Details` the same way live collection expects them to be used.
**Consequences:**
- Reanimator payloads are smaller and avoid redundant `present=true` noise while still excluding
empty slots and absent components.
- Any future exporter change that reintroduces serialized component presence needs an explicit
contract review.
- Raw Redfish snapshot completeness now includes linked per-component metric resources, not only
top-level inventory members.
- CPU microcode is no longer expected in top-level `hardware.firmware`; it belongs on the CPU
component record.
<!-- Add new decisions below this line using the format above -->