Commit Graph

150 Commits

Author SHA1 Message Date
Mikhail Chusavitin
dfd64550cf fix(inspur): infer DIMM size from part number when BMC reports size=0
When BMC firmware fails to read capacity for a present DIMM, size_mb stays
0. If another DIMM with the same part number in the same batch has a known
size, use it to fill the gap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 14:18:15 +03:00
Mikhail Chusavitin
9505303d1d fix(inspur): show microcode version for every CPU, not just the first
Dedup by version caused CPU1 Microcode to be omitted when both CPUs run
the same version, leaving the firmware column blank for the second socket.
Each CPU gets its own firmware entry keyed by index.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 14:15:28 +03:00
Mikhail Chusavitin
f2c04cf0e8 fix(inspur): parse CPUs from component.log and fix DIMM present detection
Two bugs in onekeylog archives that lack asset.json:

- CPU count was always 0: ParseComponentLog never parsed the "RESTful CPU
  info" section. Added parseCPUInfo as a fallback when hw.CPUs is empty
  (asset.json remains the primary source when present). Also worked around
  a Go JSON case-insensitive collision between "proc_id" (int) and
  "PROC_ID" (string CPUID) by adding an explicit PROC_ID field with an
  exact-case tag.

- Only 1 of 2 DIMMs shown: Present condition required mem_mod_size > 0,
  but some BMC firmware reports size=0 for a physically installed module
  while still providing serial and part number. Now treats a DIMM as
  present when status=1 and any of size/serial/partnum is non-empty.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 14:13:55 +03:00
Mikhail Chusavitin
ca457ac72b fix(exporter): propagate iommu_group through PCIe export pipeline
IOMMUGroup was added to models.PCIeDevice but never wired into the
converter — missing from Details in buildDevicesFromLegacy, no field
in ReanimatorPCIe, and convertPCIeFromDevices never read it.

Add IOMMUGroup *int to ReanimatorPCIe, propagate through Details,
add intPtrFromDetailMap helper.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.16
2026-04-30 16:05:30 +03:00
Mikhail Chusavitin
78d0e26fd0 feat: sync with hardware ingest contract v2.10
- PCIeDevice: add model, firmware, present, iommu_group, telemetry fields
  (temperature_c, power_w, ecc_corrected_total, ecc_uncorrected_total,
  hw_slowdown) — were silently dropped on JSON parse, breaking bee audit display
- buildDevicesFromLegacy: use pcie.Model as fallback (PartNumber > Model >
  Description), copy MACAddresses/Present/Firmware, propagate telemetry into
  Details so convertPCIeFromDevices picks them up
- Storage: add logical_block_size_bytes, physical_block_size_bytes,
  metadata_bytes_per_block (contract v2.10, 2026-04-29) to models, exporter
  struct and converter pipeline
- ReanimatorHardware: add platform_config map[string]any (contract v2.9)
- Update internal/chart submodule to v2.0 (contract 2.10 viewer support:
  event_logs section, platform_config section, storage block size columns)
- Update bible submodule

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.15
2026-04-30 15:55:26 +03:00
Mikhail Chusavitin
88e4e8dd49 Trim noisy Lenovo Redfish collection paths 2026-04-30 15:55:26 +03:00
Mikhail Chusavitin
cf9cf5d0cf Improve Lenovo XCC inventory enrichment 2026-04-30 15:55:26 +03:00
aba7a54990 feat(parser): lenovo xcc vroc volume parsing - v1.2
Parse inventory_volume.log: Intel VROC (VMD) RAID volumes including
RAID level, capacity (GiB/TiB support added), status and member drives.
Add Drives []string to StorageVolume model.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 17:00:27 +03:00
835df2676c docs: defer generic IPMI live collector
Closes #12
2026-04-22 20:51:58 +03:00
b86d51c921 chore: close completed Redfish profile framework issue
Fixes #14
2026-04-22 20:45:42 +03:00
Mikhail Chusavitin
a82fb227e5 submodule update 2026-04-16 15:33:48 +03:00
c9969fc3da feat(parser): lenovo xcc warnings and redfish logs - v1.1 v1.12 2026-04-13 20:34:04 +03:00
89b6701f43 feat(parser): add Lenovo XCC mini-log parser 2026-04-13 20:20:37 +03:00
b04877549a feat(collector): add Lenovo XCC profile to skip noisy snapshot paths
Lenovo ThinkSystem SR650 V3 (and similar XCC-based servers) caused
collection runs of 23+ minutes because the BMC exposes two large high-
error-rate subtrees in the snapshot BFS:

  - Chassis/1/Sensors: 315 individual sensor members, 282/315 failing,
    ~3.7s per request → ~19 minutes wasted. These documents are never
    read by any LOGPile parser (thermal/power data comes from aggregate
    Chassis/*/Thermal and Chassis/*/Power endpoints).

  - Chassis/1/Oem/Lenovo: 75 requests (LEDs×47, Slots×26, etc.),
    68/75 failing → 8+ minutes wasted on non-inventory data.

Add a Lenovo profile (matched on SystemManufacturer/OEMNamespace "Lenovo")
that sets SnapshotExcludeContains to block individual sensor documents and
non-inventory Lenovo OEM subtrees from the snapshot BFS queue. Also sets
rate policy thresholds appropriate for XCC BMC latency (p95 often 3-5s).

Add SnapshotExcludeContains []string to AcquisitionTuning and check it
in the snapshot enqueue closure in redfish.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.11.15
2026-04-13 19:29:04 +03:00
8ca173c99b fix(exporter): preserve all HGX GPUs with generic PCIe slot name
Supermicro HGX BMC reports all 8 B200 GPU PCIe devices with Name
"PCIe Device" — a generic label shared by every GPU, not a unique
hardware position. pcieDedupKey used slot as the primary key, so all
8 GPUs collapsed to one entry in the UI (the first, serial 1654925165720).

Add isGenericPCIeSlotName to detect non-positional slot labels and fall
through to serial/BDF for dedup instead, preserving each GPU separately.
Positional slots (#GPU0, SLOT-NIC1, etc.) continue to use slot-first dedup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:05:49 +03:00
f19a3454fa fix(redfish): gate hgx diagnostic plan-b by debug toggle v1.11.14 2026-04-13 14:45:41 +03:00
Mikhail Chusavitin
becdca1d7e fix(redfish): read PCIeInterface link width for GPU PCIe devices
parseGPUWithSupplementalDocs did not read PCIeInterface from the device
doc, only from function docs. xFusion GPU PCIeCard entries carry link
width/speed in PCIeInterface (LanesInUse/Maxlanes/PCIeType/MaxPCIeType)
so GPU link width was always empty for xFusion servers.

Also apply the xFusion OEM function-level fallback for GPU function docs,
consistent with the NIC and PCIeDevice paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.11.13
2026-04-12 13:35:29 +03:00
Mikhail Chusavitin
e10440ae32 fix(redfish): collect PCIe link width from xFusion servers
xFusion iBMC exposes PCIe link width in two non-standard ways:
- PCIeInterface uses "Maxlanes" (lowercase 'l') instead of "MaxLanes"
- PCIeFunction docs carry width/speed in Oem.xFusion.LinkWidth ("X8"),
  Oem.xFusion.LinkWidthAbility, Oem.xFusion.LinkSpeed, and
  Oem.xFusion.LinkSpeedAbility rather than the standard CurrentLinkWidth int

Add redfishEnrichFromOEMxFusionPCIeLink and parseXFusionLinkWidth helpers,
apply them as fallbacks in NIC and PCIeDevice enrichment paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:35:29 +03:00
5c2a21aff1 chore: update bible and chart submodules
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.11.12
2026-04-11 12:17:40 +03:00
Mikhail Chusavitin
9df13327aa feat(collect): remove power-on/off, add skip-hung for Redfish collection
Remove power-on and power-off functionality from the Redfish collector;
keep host power-state detection and show a warning in the UI when the
host is powered off before collection starts.

Add a "Пропустить зависшие" (skip hung) button that lets the user abort
stuck Redfish collection phases without losing already-collected data.
Introduces a two-level context model in Collect(): the outer job context
covers the full lifecycle including replay; an inner collectCtx covers
snapshot, prefetch, and plan-B phases only. Closing the skipCh cancels
collectCtx immediately — aborts all in-flight HTTP requests and exits
plan-B loops — then replay runs on whatever rawTree was collected.

Signal path: UI → POST /api/collect/{id}/skip → JobManager.SkipJob()
→ close(skipCh) → goroutine in Collect() → cancelCollect().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 13:12:38 +03:00
Mikhail Chusavitin
7e9af89c46 Add xFusion file-export parser support v1.11.11 2026-04-04 15:07:10 +03:00
Mikhail Chusavitin
db74df9994 fix(redfish): trim MSI replay noise and unify NIC classes v1.11.10 2026-04-01 17:49:00 +03:00
Mikhail Chusavitin
bb82387d48 fix(redfish): narrow MSI PCIeFunctions crawl 2026-04-01 16:50:51 +03:00
Mikhail Chusavitin
475f6ac472 fix(export): keep storage inventory without serials v1.11.9 2026-04-01 16:50:19 +03:00
Mikhail Chusavitin
93ce676f04 fix(redfish): recover MSI NIC serials from PCIe functions 2026-04-01 15:48:47 +03:00
Mikhail Chusavitin
c47c34fd11 feat(hpe): improve inventory extraction and export fidelity v1.11.8 v1.11.7 2026-03-30 15:04:17 +03:00
Mikhail Chusavitin
d8c3256e41 chore(hpe_ilo_ahs): normalize module version format — v1.0 v1.11.6 2026-03-30 13:43:10 +03:00
Mikhail Chusavitin
1b2d978d29 test: add real fixture coverage for HPE AHS parser 2026-03-30 13:41:02 +03:00
Mikhail Chusavitin
0f310d57c4 feat: HPE iLO support — profile, post-probe hang fix, replay parser fixes, AHS parser
- Add HPE iLO Redfish profile (priority 20): matches on manufacturer/OEM/iLO signals,
  adds SmartStorage/SmartStorageConfig to critical paths, sets realistic ETA baseline
  and rate policy for iLO's known slowness
- Fix post-probe hang on HPE iLO: skip numeric probing of collections where
  Members@odata.count == len(Members); add 4s postProbeClient timeout as safety net
- Exclude /WorkloadPerformanceAdvisor from crawl paths
- Fix replay parser: skip absent CPU sockets, absent DIMM slots, absent drive bays
- Filter N/A version entries from firmware inventory
- Remove drive firmware from general firmware list (already in Storage[].Firmware)
- Add HPE AHS (.ahs) archive parser with hybrid SMBIOS/Redfish extraction

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.11.5
2026-03-30 13:39:14 +03:00
Mikhail Chusavitin
3547ef9083 Skip placeholder firmware versions in API output v1.11.4 2026-03-26 18:42:54 +03:00
Mikhail Chusavitin
99f0d6217c Improve Multillect Redfish replay and power detection 2026-03-26 18:41:02 +03:00
Mikhail Chusavitin
8acbba3cc9 feat: add reanimator easy bee parser v1.11.3 2026-03-25 19:36:13 +03:00
Mikhail Chusavitin
8942991f0c Add Inspur Group OEM Redfish profile v1.11.2 2026-03-25 15:08:40 +03:00
Mikhail Chusavitin
9b71c4a95f chore: update bible submodule
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.11.1
2026-03-25 11:22:33 +03:00
Mikhail Chusavitin
125f77ef69 feat: adaptive BMC readiness check + ghost NIC dedup fix + empty collection plan-B retry
BMC readiness after power-on (waitForStablePoweredOnHost):
- After initial 1m stabilization, poll BMC inventory readiness before collecting
- Ready if MemorySummary.TotalSystemMemoryGiB > 0 OR PCIeDevices.Members non-empty
- On failure: wait +60s, retry; on second failure: wait +120s, retry; then warn and proceed
- Configurable via LOGPILE_REDFISH_BMC_READY_WAITS (default: 60s,120s)

Empty critical collection plan-B retry (EnableEmptyCriticalCollectionRetry):
- Hardware inventory collections that returned Members=[] are now re-probed in plan-B
- Covers PCIeDevices, NetworkAdapters, Processors, Drives, Storage, EthernetInterfaces
- Enabled by default in generic profile (applies to all vendors)

Ghost NIC dedup fix (enrichNICsFromNetworkInterfaces):
- NetworkInterface entries (e.g. Id=2) that don't match existing NIC slots are now
  resolved via Links.NetworkAdapter cross-reference to the real Chassis NIC
- Prevents duplicate ghost entries (slot=2 "Network Device View") from appearing
  alongside real NICs (slot="RISER 5 slot 1 (7)") with the same MAC addresses

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 11:19:36 +03:00
Mikhail Chusavitin
063b08d5fb feat: redesign collection UI + add StopHostAfterCollect + TCP ping pre-probe
- Single "Подключиться" button flow: probe first, then show collect options
- Power management checkboxes: power on before / stop after collect
- Modal confirmation when enabling shutdown on already-powered-on host
- StopHostAfterCollect flag: host shuts down only when explicitly requested
- TCP ping (10 attempts, min 3 successes) before Redfish probe
- Debug payloads checkbox (Oem/Ami/Inventory/Crc, off by default)
- Remove platform_config BIOS settings collection (unreliable on AMI)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 18:50:01 +03:00
Mikhail Chusavitin
e3ff1745fc feat: clean up collection job log UI
- Filter out debug noise (plan-B per-path lines, heartbeats, timing stats, telemetry)
- Strip server-side nanosecond timestamp prefix from displayed messages
- Transform snapshot progress lines to show current path instead of doc count + ETA
- Humanize recurring message patterns (plan-B summary, prefetch, snapshot total)
- Raw collect.log and raw_export.json are unaffected

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 00:50:13 +03:00
Mikhail Chusavitin
96e65d8f65 feat: Redfish hardware event log collection + MSI ghost GPU filter + inventory improvements
- Collect hardware event logs (last 7 days) from Systems and Managers/SEL LogServices
- Parse AMI raw IPMI dump messages into readable descriptions (Sensor_Type: Event_Type)
- Filter out audit/journal/non-hardware log services; only SEL from Managers
- MSI ghost GPU filter: exclude processor GPU entries with temperature=0 when host is powered on
- Reanimator collected_at uses InventoryData/Status.LastModifiedTime (30-day fallback)
- Invalidate Redfish inventory CRC groups before host power-on
- Log inventory LastModifiedTime age in collection logs
- Drop SecureBoot collection (SecureBootMode, SecureBootDatabases) — not hardware inventory
- Add build version to UI footer via template
- Add MSI Redfish API reference doc to bible-local/docs/

ADL-032–ADL-035

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.11.0
2026-03-18 23:47:22 +03:00
Mikhail Chusavitin
30409eef67 feat: add xFusion iBMC dump parser (tar.gz format)
Parses xFusion G5500 V7 iBMC diagnostic dump archives with:
- FRU info (board serial, product name, component inventory)
- IPMI sensor readings (temperature, voltage, power, fan, current)
- CPU inventory (model, cores, threads, cache, serial)
- Memory DIMMs (size, speed, type, serial, manufacturer)
- GPU inventory from card_manage/card_info (serial, firmware, ECC counts)
- OCP NIC detection (ConnectX-6 Lx with serial)
- PSU inventory (4x 3000W, serial, firmware, voltage)
- Storage: RAID controller firmware + physical drives (model, serial, endurance)
- iBMC maintenance log events with severity mapping
- Registers as vendor "xfusion" in the parser registry

All 11 fixture tests pass against real G5500 V7 dump archive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.10.2
2026-03-18 15:31:28 +03:00
Mikhail Chusavitin
65e65968cf feat: add xFusion iBMC Redfish profile
Matches on ServiceRootVendor "xFusion" and OEM namespace "xFusion"
(score 90+). Enables GenericGraphicsControllerDedup unconditionally and
ProcessorGPUFallback when GPU-type processors are present in the snapshot
(xFusion G5500 V7 exposes H200s simultaneously in PCIeDevices,
GraphicsControllers, and Processors/Gpu* — all three need dedup).

Without this profile, xFusion fell into fallback mode which activated all
vendor profiles (Supermicro, HGX, MSI, Dell) unnecessarily. Now resolves
to matched mode with targeted acquisition tuning (120k cap, 75s baseline).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.10.1
2026-03-18 15:13:15 +03:00
Mikhail Chusavitin
380c199705 chore: update chart submodule
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 08:50:56 +03:00
Mikhail Chusavitin
d650a6ba1c refactor: unified ingest pipeline + modular Redfish profile framework
Implement the full architectural plan: unified ingest.Service entry point
for archive and Redfish payloads, modular redfishprofile package with
composable profiles (generic, ami-family, msi, supermicro, dell,
hgx-topology), score-based profile matching with fallback expansion mode,
and profile-driven acquisition/analysis plans.

Vendor-specific logic moved out of common executors and into profile hooks.
GPU chassis lookup strategies and known storage recovery collections
(IntelVROC/HA-RAID/MRVL) now live in ResolvedAnalysisPlan, populated by
profiles at analysis time. Replay helpers read from the plan; no hardcoded
path lists remain in generic code.

Also splits redfish_replay.go into domain modules (gpu, storage, inventory,
fru, profiles) and adds full fixture/matcher/directive test coverage
including Dell, AMI, unknown-vendor fallback, and deterministic ordering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.10.0
2026-03-18 08:48:58 +03:00
Mikhail Chusavitin
d8d3d8c524 build: use local go toolchain in release script v1.9.0 2026-03-16 00:32:09 +03:00
Mikhail Chusavitin
057a222288 ui: embed reanimator chart viewer 2026-03-16 00:20:11 +03:00
Mikhail Chusavitin
f11a43f690 export: merge inspur psu sensor groups 2026-03-15 23:29:44 +03:00
Mikhail Chusavitin
476630190d export: align reanimator contract v2.7 2026-03-15 23:27:32 +03:00
Mikhail Chusavitin
9007f1b360 export: align reanimator and enrich redfish metrics 2026-03-15 21:38:28 +03:00
Mikhail Chusavitin
0acdc2b202 docs: refresh project documentation 2026-03-15 16:35:16 +03:00
Mikhail Chusavitin
47bb0ee939 docs: document firmware filter regression pattern in bible (ADL-019)
Root cause analysis for device-bound firmware leaking into hardware.firmware
on Supermicro Redfish (SYS-A21GE-NBRT HGX B200):

- collectFirmwareInventory (6c19a58) had no coverage for Supermicro naming.
  isDeviceBoundFirmwareName checked "gpu " / "nic " (space-terminated) while
  Supermicro uses "GPU1 System Slot0" / "NIC1 System Slot0 ..." (digit suffix).

- 9c5512d added _fw_gpu_ / _fw_nvswitch_ / _inforom_gpu_ patterns to fix HGX,
  but checked DeviceName which contains "Software Inventory" (from Redfish Name),
  not the firmware Id. Dead code from day one.

09-testing.md: add firmware filter worked example and rule #4 — verify the
filter checks the field that the collector actually populates.

10-decisions.md: ADL-019 — isDeviceBoundFirmwareName must be extended per
vendor with a test case per vendor format before shipping.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 14:03:47 +03:00
Mikhail Chusavitin
5815100e2f exporter: filter Supermicro Redfish device-bound firmware from hardware.firmware
isDeviceBoundFirmwareName did not catch Supermicro FirmwareInventory naming
conventions where a digit follows the type prefix directly ("GPU1 System Slot0",
"NIC1 System Slot0 AOM-DP805-IO") instead of a space. Also missing: "Power supply N",
"NVMeController N", and "Software Inventory" (generic label for all HGX per-component
firmware slots — GPU, NVSwitch, PCIeRetimer, ERoT, InfoROM, etc.).

On SYS-A21GE-NBRT (HGX B200) this caused 29 device-bound entries to leak into
hardware.firmware: 8 GPU, 9 NIC, 1 NVMe, 6 PSU, 4 PCIeSwitch, 1 Software Inventory.

Fix: extend isDeviceBoundFirmwareName with patterns for all four new cases.
Add TestIsDeviceBoundFirmwareName covering both excluded and kept entries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 13:48:55 +03:00