When BMC firmware fails to read capacity for a present DIMM, size_mb stays
0. If another DIMM with the same part number in the same batch has a known
size, use it to fill the gap.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dedup by version caused CPU1 Microcode to be omitted when both CPUs run
the same version, leaving the firmware column blank for the second socket.
Each CPU gets its own firmware entry keyed by index.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs in onekeylog archives that lack asset.json:
- CPU count was always 0: ParseComponentLog never parsed the "RESTful CPU
info" section. Added parseCPUInfo as a fallback when hw.CPUs is empty
(asset.json remains the primary source when present). Also worked around
a Go JSON case-insensitive collision between "proc_id" (int) and
"PROC_ID" (string CPUID) by adding an explicit PROC_ID field with an
exact-case tag.
- Only 1 of 2 DIMMs shown: Present condition required mem_mod_size > 0,
but some BMC firmware reports size=0 for a physically installed module
while still providing serial and part number. Now treats a DIMM as
present when status=1 and any of size/serial/partnum is non-empty.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IOMMUGroup was added to models.PCIeDevice but never wired into the
converter — missing from Details in buildDevicesFromLegacy, no field
in ReanimatorPCIe, and convertPCIeFromDevices never read it.
Add IOMMUGroup *int to ReanimatorPCIe, propagate through Details,
add intPtrFromDetailMap helper.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parse inventory_volume.log: Intel VROC (VMD) RAID volumes including
RAID level, capacity (GiB/TiB support added), status and member drives.
Add Drives []string to StorageVolume model.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Lenovo ThinkSystem SR650 V3 (and similar XCC-based servers) caused
collection runs of 23+ minutes because the BMC exposes two large high-
error-rate subtrees in the snapshot BFS:
- Chassis/1/Sensors: 315 individual sensor members, 282/315 failing,
~3.7s per request → ~19 minutes wasted. These documents are never
read by any LOGPile parser (thermal/power data comes from aggregate
Chassis/*/Thermal and Chassis/*/Power endpoints).
- Chassis/1/Oem/Lenovo: 75 requests (LEDs×47, Slots×26, etc.),
68/75 failing → 8+ minutes wasted on non-inventory data.
Add a Lenovo profile (matched on SystemManufacturer/OEMNamespace "Lenovo")
that sets SnapshotExcludeContains to block individual sensor documents and
non-inventory Lenovo OEM subtrees from the snapshot BFS queue. Also sets
rate policy thresholds appropriate for XCC BMC latency (p95 often 3-5s).
Add SnapshotExcludeContains []string to AcquisitionTuning and check it
in the snapshot enqueue closure in redfish.go.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Supermicro HGX BMC reports all 8 B200 GPU PCIe devices with Name
"PCIe Device" — a generic label shared by every GPU, not a unique
hardware position. pcieDedupKey used slot as the primary key, so all
8 GPUs collapsed to one entry in the UI (the first, serial 1654925165720).
Add isGenericPCIeSlotName to detect non-positional slot labels and fall
through to serial/BDF for dedup instead, preserving each GPU separately.
Positional slots (#GPU0, SLOT-NIC1, etc.) continue to use slot-first dedup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
parseGPUWithSupplementalDocs did not read PCIeInterface from the device
doc, only from function docs. xFusion GPU PCIeCard entries carry link
width/speed in PCIeInterface (LanesInUse/Maxlanes/PCIeType/MaxPCIeType)
so GPU link width was always empty for xFusion servers.
Also apply the xFusion OEM function-level fallback for GPU function docs,
consistent with the NIC and PCIeDevice paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xFusion iBMC exposes PCIe link width in two non-standard ways:
- PCIeInterface uses "Maxlanes" (lowercase 'l') instead of "MaxLanes"
- PCIeFunction docs carry width/speed in Oem.xFusion.LinkWidth ("X8"),
Oem.xFusion.LinkWidthAbility, Oem.xFusion.LinkSpeed, and
Oem.xFusion.LinkSpeedAbility rather than the standard CurrentLinkWidth int
Add redfishEnrichFromOEMxFusionPCIeLink and parseXFusionLinkWidth helpers,
apply them as fallbacks in NIC and PCIeDevice enrichment paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove power-on and power-off functionality from the Redfish collector;
keep host power-state detection and show a warning in the UI when the
host is powered off before collection starts.
Add a "Пропустить зависшие" (skip hung) button that lets the user abort
stuck Redfish collection phases without losing already-collected data.
Introduces a two-level context model in Collect(): the outer job context
covers the full lifecycle including replay; an inner collectCtx covers
snapshot, prefetch, and plan-B phases only. Closing the skipCh cancels
collectCtx immediately — aborts all in-flight HTTP requests and exits
plan-B loops — then replay runs on whatever rawTree was collected.
Signal path: UI → POST /api/collect/{id}/skip → JobManager.SkipJob()
→ close(skipCh) → goroutine in Collect() → cancelCollect().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add HPE iLO Redfish profile (priority 20): matches on manufacturer/OEM/iLO signals,
adds SmartStorage/SmartStorageConfig to critical paths, sets realistic ETA baseline
and rate policy for iLO's known slowness
- Fix post-probe hang on HPE iLO: skip numeric probing of collections where
Members@odata.count == len(Members); add 4s postProbeClient timeout as safety net
- Exclude /WorkloadPerformanceAdvisor from crawl paths
- Fix replay parser: skip absent CPU sockets, absent DIMM slots, absent drive bays
- Filter N/A version entries from firmware inventory
- Remove drive firmware from general firmware list (already in Storage[].Firmware)
- Add HPE AHS (.ahs) archive parser with hybrid SMBIOS/Redfish extraction
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BMC readiness after power-on (waitForStablePoweredOnHost):
- After initial 1m stabilization, poll BMC inventory readiness before collecting
- Ready if MemorySummary.TotalSystemMemoryGiB > 0 OR PCIeDevices.Members non-empty
- On failure: wait +60s, retry; on second failure: wait +120s, retry; then warn and proceed
- Configurable via LOGPILE_REDFISH_BMC_READY_WAITS (default: 60s,120s)
Empty critical collection plan-B retry (EnableEmptyCriticalCollectionRetry):
- Hardware inventory collections that returned Members=[] are now re-probed in plan-B
- Covers PCIeDevices, NetworkAdapters, Processors, Drives, Storage, EthernetInterfaces
- Enabled by default in generic profile (applies to all vendors)
Ghost NIC dedup fix (enrichNICsFromNetworkInterfaces):
- NetworkInterface entries (e.g. Id=2) that don't match existing NIC slots are now
resolved via Links.NetworkAdapter cross-reference to the real Chassis NIC
- Prevents duplicate ghost entries (slot=2 "Network Device View") from appearing
alongside real NICs (slot="RISER 5 slot 1 (7)") with the same MAC addresses
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Single "Подключиться" button flow: probe first, then show collect options
- Power management checkboxes: power on before / stop after collect
- Modal confirmation when enabling shutdown on already-powered-on host
- StopHostAfterCollect flag: host shuts down only when explicitly requested
- TCP ping (10 attempts, min 3 successes) before Redfish probe
- Debug payloads checkbox (Oem/Ami/Inventory/Crc, off by default)
- Remove platform_config BIOS settings collection (unreliable on AMI)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Collect hardware event logs (last 7 days) from Systems and Managers/SEL LogServices
- Parse AMI raw IPMI dump messages into readable descriptions (Sensor_Type: Event_Type)
- Filter out audit/journal/non-hardware log services; only SEL from Managers
- MSI ghost GPU filter: exclude processor GPU entries with temperature=0 when host is powered on
- Reanimator collected_at uses InventoryData/Status.LastModifiedTime (30-day fallback)
- Invalidate Redfish inventory CRC groups before host power-on
- Log inventory LastModifiedTime age in collection logs
- Drop SecureBoot collection (SecureBootMode, SecureBootDatabases) — not hardware inventory
- Add build version to UI footer via template
- Add MSI Redfish API reference doc to bible-local/docs/
ADL-032–ADL-035
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Matches on ServiceRootVendor "xFusion" and OEM namespace "xFusion"
(score 90+). Enables GenericGraphicsControllerDedup unconditionally and
ProcessorGPUFallback when GPU-type processors are present in the snapshot
(xFusion G5500 V7 exposes H200s simultaneously in PCIeDevices,
GraphicsControllers, and Processors/Gpu* — all three need dedup).
Without this profile, xFusion fell into fallback mode which activated all
vendor profiles (Supermicro, HGX, MSI, Dell) unnecessarily. Now resolves
to matched mode with targeted acquisition tuning (120k cap, 75s baseline).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement the full architectural plan: unified ingest.Service entry point
for archive and Redfish payloads, modular redfishprofile package with
composable profiles (generic, ami-family, msi, supermicro, dell,
hgx-topology), score-based profile matching with fallback expansion mode,
and profile-driven acquisition/analysis plans.
Vendor-specific logic moved out of common executors and into profile hooks.
GPU chassis lookup strategies and known storage recovery collections
(IntelVROC/HA-RAID/MRVL) now live in ResolvedAnalysisPlan, populated by
profiles at analysis time. Replay helpers read from the plan; no hardcoded
path lists remain in generic code.
Also splits redfish_replay.go into domain modules (gpu, storage, inventory,
fru, profiles) and adds full fixture/matcher/directive test coverage
including Dell, AMI, unknown-vendor fallback, and deterministic ordering.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause analysis for device-bound firmware leaking into hardware.firmware
on Supermicro Redfish (SYS-A21GE-NBRT HGX B200):
- collectFirmwareInventory (6c19a58) had no coverage for Supermicro naming.
isDeviceBoundFirmwareName checked "gpu " / "nic " (space-terminated) while
Supermicro uses "GPU1 System Slot0" / "NIC1 System Slot0 ..." (digit suffix).
- 9c5512d added _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns to fix HGX,
but checked DeviceName which contains "Software Inventory" (from Redfish Name),
not the firmware Id. Dead code from day one.
09-testing.md: add firmware filter worked example and rule #4 — verify the
filter checks the field that the collector actually populates.
10-decisions.md: ADL-019 — isDeviceBoundFirmwareName must be extended per
vendor with a test case per vendor format before shipping.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
isDeviceBoundFirmwareName did not catch Supermicro FirmwareInventory naming
conventions where a digit follows the type prefix directly ("GPU1 System Slot0",
"NIC1 System Slot0 AOM-DP805-IO") instead of a space. Also missing: "Power supply N",
"NVMeController N", and "Software Inventory" (generic label for all HGX per-component
firmware slots — GPU, NVSwitch, PCIeRetimer, ERoT, InfoROM, etc.).
On SYS-A21GE-NBRT (HGX B200) this caused 29 device-bound entries to leak into
hardware.firmware: 8 GPU, 9 NIC, 1 NVMe, 6 PSU, 4 PCIeSwitch, 1 Software Inventory.
Fix: extend isDeviceBoundFirmwareName with patterns for all four new cases.
Add TestIsDeviceBoundFirmwareName covering both excluded and kept entries.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>