170 Commits

Author SHA1 Message Date
Mikhail Chusavitin
4ce0251ce4 docs(bible-local): add backlog with sfp_modules implementation plan
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 18:39:28 +03:00
Mikhail Chusavitin
994d46f3b3 docs: update hardware ingest contract to v2.11
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 18:28:38 +03:00
Mikhail Chusavitin
ee3e8a6e33 docs: release notes for v1.23
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 15:21:35 +03:00
Mikhail Chusavitin
e2c81758b5 fix(parser): raise file size limit for .ahs to 1 GB
AHS files can exceed 100 MB; the previous 10 MB universal cap silently
truncated them and caused incomplete event parsing. Per-extension limits
are now used: .ahs gets 1 GB, all other single-file types keep 10 MB.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.23
2026-06-19 15:20:57 +03:00
Mikhail Chusavitin
6b52a1876f docs: release notes for v1.22
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 15:13:42 +03:00
Mikhail Chusavitin
3e3c48bc08 feat(hpe-ilo): parse AHS files, fix event logs export, add logs CSV export
- HPE iLO AHS parser: handle truncated last entry gracefully, recognize
  Alletra product line, expand event type/severity inference, trim iLO
  frame separators from event messages
- Fix event_logs always 0 in Reanimator export: normalizeEventLogSource
  now maps "HPE iLO" → "bmc"
- Fix chart JS not loading in LOGPile: rewriteChartStaticPaths now also
  rewrites src="/static/view.js" → /chart/static/view.js
- Add "Logs Export" button (CSV, semicolon-delimited, UTF-8 BOM) and
  remove PDF button
- Fix collector test broken by pciids rename of Intel VMD device
- Update submodules: chart v2.7, pciids, bible

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.22
2026-06-19 15:11:46 +03:00
Mikhail Chusavitin
cd864c3d6c docs: release notes for v1.21
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.21
2026-06-15 17:56:17 +03:00
Mikhail Chusavitin
5128ac5303 fix(server): remove PrintMode from RenderOptions after chart submodule update
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-15 17:53:55 +03:00
Mikhail Chusavitin
53cda82c79 chore: update submodules (bible, chart, pciids)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-15 17:52:51 +03:00
Mikhail Chusavitin
a18d8fe648 feat(inspur): enrich storage serials from SOLHostCapture.log smartd output
When the BMC HDD API returns an empty array (RAID controller attached via
PCIe, e.g. PM8204-2GB), disk serial numbers are now recovered from smartd
startup messages in SOLHostCapture.log.

Enrichment runs in three passes: model-match on existing slots, positional
fill of empty backplane placeholders, then new entries for any remainder.
Both log/ and runningdata/var/ copies are merged with serial deduplication.

Parser version bumped to 2.1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-15 17:52:07 +03:00
6ab0f4eb20 chore: update bible submodule
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.20
2026-06-13 14:36:53 +03:00
57de3ba6eb chore: align codebase with bible engineering contracts
- identifier-normalization: use strings.EqualFold in h3c/parser.go
- import-export: CSV now uses UTF-8 BOM and semicolon delimiter
- go-code-style: translate all Russian source strings to English (ADL-007)
- go-background-tasks: add Type, Message, Result fields to Job struct
- go-api: wrap list endpoints in {items, total_count, page, per_page, total_pages}
- module-structure: rename helpers.go → context_sleep.go
- build-version-display: htmlError renders version footer on error pages
- go-logging: migrate all log.Printf calls to log/slog with structured attrs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 14:35:39 +03:00
47ff1c3796 fix(inspur): detect standard Inspur servers via SystemManufacturer
NF-series storage servers (e.g. NF5280M6) have no GPU/outboard-PCIe
topology, so the previous score gate (topologyScore==0 || boardScore==0
→ return 0) always produced score=0 despite SystemManufacturer="Inspur"
being available. These servers fell into mode=fallback, activating the
AMI profile and probing /Oem/Ami paths that don't exist on the BMC.

Add manufacturer-based detection: SystemManufacturer or
ChassisManufacturer containing "inspur" contributes 60 points —
enough to enter matched mode on its own. GPU servers with full
topology+board signals still score higher as before.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.19
2026-06-12 03:58:55 +03:00
1c4a3b0c09 feat(ui): add PDF export via browser print
Adds a PDF button to the report header. Clicking it opens
/chart/current?print=true in a new tab, which auto-triggers
window.print() so the user can save to PDF via the browser dialog.

- chart submodule bumped: PrintMode support (no filter JS, auto-print,
  @media print CSS)
- handlers.go: passes PrintMode=true when ?print=true query param is set
- index.html: PDF button alongside Raw Data / Reanimator
- app.js: printReport() helper; button shown/hidden with other exports

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 12:24:01 +03:00
10c381c8ec chore: update bible and pci.ids submodules
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 03:40:15 +03:00
440959483e fix(inspur): correctly handle PCIe Assert/Deassert GPU fault events
Three related fixes for IDL event processing:

1. idl.go: include EventType in dedup key so Deassert events are no
   longer silently dropped as duplicates of their Assert counterparts.

2. gpu_status.go: treat Deassert events as clearing all GPU faults —
   previously the code re-applied the same faulty GPU set from the
   description, leaving GPUs stuck in Critical even after alarm cleared.

3. reanimator_models/converter: add bmc_event_summary section to the
   Reanimator export — a deduplicated Critical/Warning event table with
   Active/Resolved status derived from Assert/Deassert pairs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.18
2026-05-28 03:38:04 +03:00
Mikhail Chusavitin
f3836a34cc chore: update chart submodule to v2.0 and refresh pci.ids (2026-05-21)
chart: feat(viewer): replace severity dropdown with per-column header filters
pci.ids: 2026-02-17 → 2026-05-21

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.17
2026-05-21 14:37:26 +03:00
Mikhail Chusavitin
ba9a52a61a fix(ui): parse-errors panel full width
Removed max-width/padding constraints — panel now stretches to grid
column width like the viewer-panel above it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 14:32:47 +03:00
Mikhail Chusavitin
27373aa104 feat: surface BMC collection errors in parse-errors panel and event log
When Inspur component.log sections return {"error":"...","code":N} instead
of hardware data, the parser now:
- stores them in AnalysisResult.CollectionErrors (new model field)
- mirrors each one into result.Events with Source="BMC/<section>"
  so the chart viewer event table shows the specific BMC module
- feeds them into /api/parse-errors as bmc_collection_error entries

UI adds a collapsible "Collection diagnostics" panel below the chart
iframe (outside /chart) that appears when /api/parse-errors returns
any items; resets on data clear.

Affected sections in this dump: HDD (1458), PCIe Devices (1458),
Network Adapters (1458), Disk Backplane.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 14:30:01 +03:00
Mikhail Chusavitin
4f7b5b826a fix(inspur): fix PSU section regex when PCIE section precedes Network
The PSU regex used "RESTful Network" as its end anchor, but in standard
Inspur component.log layout the PCIE Device section sits between PSU and
Network Adapter. The lazy [\s\S]*? captured across the PCIE error block,
producing invalid JSON and silently dropping all PSU data.

Changed anchor to RESTful (?:PCIE|Network) — matches whichever section
immediately follows PSU in a given archive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 14:21:13 +03:00
Mikhail Chusavitin
dfd64550cf fix(inspur): infer DIMM size from part number when BMC reports size=0
When BMC firmware fails to read capacity for a present DIMM, size_mb stays
0. If another DIMM with the same part number in the same batch has a known
size, use it to fill the gap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 14:18:15 +03:00
Mikhail Chusavitin
9505303d1d fix(inspur): show microcode version for every CPU, not just the first
Dedup by version caused CPU1 Microcode to be omitted when both CPUs run
the same version, leaving the firmware column blank for the second socket.
Each CPU gets its own firmware entry keyed by index.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 14:15:28 +03:00
Mikhail Chusavitin
f2c04cf0e8 fix(inspur): parse CPUs from component.log and fix DIMM present detection
Two bugs in onekeylog archives that lack asset.json:

- CPU count was always 0: ParseComponentLog never parsed the "RESTful CPU
  info" section. Added parseCPUInfo as a fallback when hw.CPUs is empty
  (asset.json remains the primary source when present). Also worked around
  a Go JSON case-insensitive collision between "proc_id" (int) and
  "PROC_ID" (string CPUID) by adding an explicit PROC_ID field with an
  exact-case tag.

- Only 1 of 2 DIMMs shown: Present condition required mem_mod_size > 0,
  but some BMC firmware reports size=0 for a physically installed module
  while still providing serial and part number. Now treats a DIMM as
  present when status=1 and any of size/serial/partnum is non-empty.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 14:13:55 +03:00
Mikhail Chusavitin
ca457ac72b fix(exporter): propagate iommu_group through PCIe export pipeline
IOMMUGroup was added to models.PCIeDevice but never wired into the
converter — missing from Details in buildDevicesFromLegacy, no field
in ReanimatorPCIe, and convertPCIeFromDevices never read it.

Add IOMMUGroup *int to ReanimatorPCIe, propagate through Details,
add intPtrFromDetailMap helper.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.16
2026-04-30 16:05:30 +03:00
Mikhail Chusavitin
78d0e26fd0 feat: sync with hardware ingest contract v2.10
- PCIeDevice: add model, firmware, present, iommu_group, telemetry fields
  (temperature_c, power_w, ecc_corrected_total, ecc_uncorrected_total,
  hw_slowdown) — were silently dropped on JSON parse, breaking bee audit display
- buildDevicesFromLegacy: use pcie.Model as fallback (PartNumber > Model >
  Description), copy MACAddresses/Present/Firmware, propagate telemetry into
  Details so convertPCIeFromDevices picks them up
- Storage: add logical_block_size_bytes, physical_block_size_bytes,
  metadata_bytes_per_block (contract v2.10, 2026-04-29) to models, exporter
  struct and converter pipeline
- ReanimatorHardware: add platform_config map[string]any (contract v2.9)
- Update internal/chart submodule to v2.0 (contract 2.10 viewer support:
  event_logs section, platform_config section, storage block size columns)
- Update bible submodule

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.15
2026-04-30 15:55:26 +03:00
Mikhail Chusavitin
88e4e8dd49 Trim noisy Lenovo Redfish collection paths 2026-04-30 15:55:26 +03:00
Mikhail Chusavitin
cf9cf5d0cf Improve Lenovo XCC inventory enrichment 2026-04-30 15:55:26 +03:00
aba7a54990 feat(parser): lenovo xcc vroc volume parsing - v1.2
Parse inventory_volume.log: Intel VROC (VMD) RAID volumes including
RAID level, capacity (GiB/TiB support added), status and member drives.
Add Drives []string to StorageVolume model.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 17:00:27 +03:00
835df2676c docs: defer generic IPMI live collector
Closes #12
2026-04-22 20:51:58 +03:00
b86d51c921 chore: close completed Redfish profile framework issue
Fixes #14
2026-04-22 20:45:42 +03:00
Mikhail Chusavitin
a82fb227e5 submodule update 2026-04-16 15:33:48 +03:00
c9969fc3da feat(parser): lenovo xcc warnings and redfish logs - v1.1 v1.12 2026-04-13 20:34:04 +03:00
89b6701f43 feat(parser): add Lenovo XCC mini-log parser 2026-04-13 20:20:37 +03:00
b04877549a feat(collector): add Lenovo XCC profile to skip noisy snapshot paths
Lenovo ThinkSystem SR650 V3 (and similar XCC-based servers) caused
collection runs of 23+ minutes because the BMC exposes two large high-
error-rate subtrees in the snapshot BFS:

  - Chassis/1/Sensors: 315 individual sensor members, 282/315 failing,
    ~3.7s per request → ~19 minutes wasted. These documents are never
    read by any LOGPile parser (thermal/power data comes from aggregate
    Chassis/*/Thermal and Chassis/*/Power endpoints).

  - Chassis/1/Oem/Lenovo: 75 requests (LEDs×47, Slots×26, etc.),
    68/75 failing → 8+ minutes wasted on non-inventory data.

Add a Lenovo profile (matched on SystemManufacturer/OEMNamespace "Lenovo")
that sets SnapshotExcludeContains to block individual sensor documents and
non-inventory Lenovo OEM subtrees from the snapshot BFS queue. Also sets
rate policy thresholds appropriate for XCC BMC latency (p95 often 3-5s).

Add SnapshotExcludeContains []string to AcquisitionTuning and check it
in the snapshot enqueue closure in redfish.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.11.15
2026-04-13 19:29:04 +03:00
8ca173c99b fix(exporter): preserve all HGX GPUs with generic PCIe slot name
Supermicro HGX BMC reports all 8 B200 GPU PCIe devices with Name
"PCIe Device" — a generic label shared by every GPU, not a unique
hardware position. pcieDedupKey used slot as the primary key, so all
8 GPUs collapsed to one entry in the UI (the first, serial 1654925165720).

Add isGenericPCIeSlotName to detect non-positional slot labels and fall
through to serial/BDF for dedup instead, preserving each GPU separately.
Positional slots (#GPU0, SLOT-NIC1, etc.) continue to use slot-first dedup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:05:49 +03:00
f19a3454fa fix(redfish): gate hgx diagnostic plan-b by debug toggle v1.11.14 2026-04-13 14:45:41 +03:00
Mikhail Chusavitin
becdca1d7e fix(redfish): read PCIeInterface link width for GPU PCIe devices
parseGPUWithSupplementalDocs did not read PCIeInterface from the device
doc, only from function docs. xFusion GPU PCIeCard entries carry link
width/speed in PCIeInterface (LanesInUse/Maxlanes/PCIeType/MaxPCIeType)
so GPU link width was always empty for xFusion servers.

Also apply the xFusion OEM function-level fallback for GPU function docs,
consistent with the NIC and PCIeDevice paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.11.13
2026-04-12 13:35:29 +03:00
Mikhail Chusavitin
e10440ae32 fix(redfish): collect PCIe link width from xFusion servers
xFusion iBMC exposes PCIe link width in two non-standard ways:
- PCIeInterface uses "Maxlanes" (lowercase 'l') instead of "MaxLanes"
- PCIeFunction docs carry width/speed in Oem.xFusion.LinkWidth ("X8"),
  Oem.xFusion.LinkWidthAbility, Oem.xFusion.LinkSpeed, and
  Oem.xFusion.LinkSpeedAbility rather than the standard CurrentLinkWidth int

Add redfishEnrichFromOEMxFusionPCIeLink and parseXFusionLinkWidth helpers,
apply them as fallbacks in NIC and PCIeDevice enrichment paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:35:29 +03:00
5c2a21aff1 chore: update bible and chart submodules
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.11.12
2026-04-11 12:17:40 +03:00
Mikhail Chusavitin
9df13327aa feat(collect): remove power-on/off, add skip-hung for Redfish collection
Remove power-on and power-off functionality from the Redfish collector;
keep host power-state detection and show a warning in the UI when the
host is powered off before collection starts.

Add a "Пропустить зависшие" (skip hung) button that lets the user abort
stuck Redfish collection phases without losing already-collected data.
Introduces a two-level context model in Collect(): the outer job context
covers the full lifecycle including replay; an inner collectCtx covers
snapshot, prefetch, and plan-B phases only. Closing the skipCh cancels
collectCtx immediately — aborts all in-flight HTTP requests and exits
plan-B loops — then replay runs on whatever rawTree was collected.

Signal path: UI → POST /api/collect/{id}/skip → JobManager.SkipJob()
→ close(skipCh) → goroutine in Collect() → cancelCollect().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 13:12:38 +03:00
Mikhail Chusavitin
7e9af89c46 Add xFusion file-export parser support v1.11.11 2026-04-04 15:07:10 +03:00
Mikhail Chusavitin
db74df9994 fix(redfish): trim MSI replay noise and unify NIC classes v1.11.10 2026-04-01 17:49:00 +03:00
Mikhail Chusavitin
bb82387d48 fix(redfish): narrow MSI PCIeFunctions crawl 2026-04-01 16:50:51 +03:00
Mikhail Chusavitin
475f6ac472 fix(export): keep storage inventory without serials v1.11.9 2026-04-01 16:50:19 +03:00
Mikhail Chusavitin
93ce676f04 fix(redfish): recover MSI NIC serials from PCIe functions 2026-04-01 15:48:47 +03:00
Mikhail Chusavitin
c47c34fd11 feat(hpe): improve inventory extraction and export fidelity v1.11.8 v1.11.7 2026-03-30 15:04:17 +03:00
Mikhail Chusavitin
d8c3256e41 chore(hpe_ilo_ahs): normalize module version format — v1.0 v1.11.6 2026-03-30 13:43:10 +03:00
Mikhail Chusavitin
1b2d978d29 test: add real fixture coverage for HPE AHS parser 2026-03-30 13:41:02 +03:00
Mikhail Chusavitin
0f310d57c4 feat: HPE iLO support — profile, post-probe hang fix, replay parser fixes, AHS parser
- Add HPE iLO Redfish profile (priority 20): matches on manufacturer/OEM/iLO signals,
  adds SmartStorage/SmartStorageConfig to critical paths, sets realistic ETA baseline
  and rate policy for iLO's known slowness
- Fix post-probe hang on HPE iLO: skip numeric probing of collections where
  Members@odata.count == len(Members); add 4s postProbeClient timeout as safety net
- Exclude /WorkloadPerformanceAdvisor from crawl paths
- Fix replay parser: skip absent CPU sockets, absent DIMM slots, absent drive bays
- Filter N/A version entries from firmware inventory
- Remove drive firmware from general firmware list (already in Storage[].Firmware)
- Add HPE AHS (.ahs) archive parser with hybrid SMBIOS/Redfish extraction

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v1.11.5
2026-03-30 13:39:14 +03:00
Mikhail Chusavitin
3547ef9083 Skip placeholder firmware versions in API output v1.11.4 2026-03-26 18:42:54 +03:00