3 Commits

Author SHA1 Message Date
b04877549a feat(collector): add Lenovo XCC profile to skip noisy snapshot paths
Lenovo ThinkSystem SR650 V3 (and similar XCC-based servers) caused
collection runs of 23+ minutes because the BMC exposes two large high-
error-rate subtrees in the snapshot BFS:

  - Chassis/1/Sensors: 315 individual sensor members, 282/315 failing,
    ~3.7s per request → ~19 minutes wasted. These documents are never
    read by any LOGPile parser (thermal/power data comes from aggregate
    Chassis/*/Thermal and Chassis/*/Power endpoints).

  - Chassis/1/Oem/Lenovo: 75 requests (LEDs×47, Slots×26, etc.),
    68/75 failing → 8+ minutes wasted on non-inventory data.

Add a Lenovo profile (matched on SystemManufacturer/OEMNamespace "Lenovo")
that sets SnapshotExcludeContains to block individual sensor documents and
non-inventory Lenovo OEM subtrees from the snapshot BFS queue. Also sets
rate policy thresholds appropriate for XCC BMC latency (p95 often 3-5s).

Add SnapshotExcludeContains []string to AcquisitionTuning and check it
in the snapshot enqueue closure in redfish.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 19:29:04 +03:00
8ca173c99b fix(exporter): preserve all HGX GPUs with generic PCIe slot name
Supermicro HGX BMC reports all 8 B200 GPU PCIe devices with Name
"PCIe Device" — a generic label shared by every GPU, not a unique
hardware position. pcieDedupKey used slot as the primary key, so all
8 GPUs collapsed to one entry in the UI (the first, serial 1654925165720).

Add isGenericPCIeSlotName to detect non-positional slot labels and fall
through to serial/BDF for dedup instead, preserving each GPU separately.
Positional slots (#GPU0, SLOT-NIC1, etc.) continue to use slot-first dedup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 16:05:49 +03:00
f19a3454fa fix(redfish): gate hgx diagnostic plan-b by debug toggle 2026-04-13 14:45:41 +03:00
12 changed files with 316 additions and 18 deletions

View File

@@ -58,6 +58,7 @@ Responses:
Optional request field:
- `power_on_if_host_off`: when `true`, Redfish collection may power on the host before collection if preflight found it powered off
- `debug_payloads`: when `true`, collector keeps extra diagnostic payloads and enables extended plan-B retries for slow HGX component inventory branches (`Assembly`, `Accelerators`, `Drives`, `NetworkAdapters`, `PCIeDevices`)
### `POST /api/collect/probe`

View File

@@ -27,6 +27,7 @@ Request fields passed from the server:
- credential field (`password` or token)
- `tls_mode`
- optional `power_on_if_host_off`
- optional `debug_payloads` for extended diagnostics
### Core rule
@@ -57,6 +58,17 @@ closes `skipCh` → goroutine in `Collect()` → `cancelCollect()`.
The skip button is visible during `running` state and hidden once the job reaches a terminal state.
### Extended diagnostics toggle
The live collect form exposes a user-facing checkbox for extended diagnostics.
- default collection prioritizes inventory completeness and bounded runtime
- when extended diagnostics is off, heavy HGX component-chassis critical plan-B retries
(`Assembly`, `Accelerators`, `Drives`, `NetworkAdapters`, `PCIeDevices`) are skipped
- when extended diagnostics is on, those retries are allowed and extra debug payloads are collected
This toggle is intended for operator-driven deep diagnostics on problematic hosts, not for the default path.
### Discovery model
The collector does not rely on one fixed vendor tree.

View File

@@ -1120,3 +1120,20 @@ incomplete for UI and Reanimator consumers.
- System firmware such as BIOS and iBMC versions survives xFusion file exports.
- xFusion archives participate more reliably in canonical device/export flows without special UI
cases.
---
## ADL-043 — Extended HGX diagnostic plan-B is opt-in from the live collect form
**Date:** 2026-04-13
**Context:** Some Supermicro HGX Redfish targets expose slow or hanging component-chassis inventory
collections during critical plan-B, especially under `Chassis/HGX_*` for `Assembly`,
`Accelerators`, `Drives`, `NetworkAdapters`, and `PCIeDevices`. Default collection should not
block operators on deep diagnostic retries that are useful mainly for troubleshooting.
**Decision:** Keep the normal snapshot/replay path unchanged, but gate those heavy HGX
component-chassis critical plan-B retries behind the existing live-collect `debug_payloads` flag,
presented in the UI as "Сбор расширенных данных для диагностики".
**Consequences:**
- Default live collection skips those heavy diagnostic plan-B retries and reaches replay faster.
- Operators can explicitly opt into the slower diagnostic path when they need deeper collection.
- The same user-facing toggle continues to enable extra debug payload capture for troubleshooting.

View File

@@ -345,8 +345,9 @@ func (c *RedfishConnector) Collect(ctx context.Context, req Request, emit Progre
"manager_critical_suffixes": acquisitionPlan.ScopedPaths.ManagerCriticalSuffixes,
},
"tuning": map[string]any{
"snapshot_max_documents": acquisitionPlan.Tuning.SnapshotMaxDocuments,
"snapshot_workers": acquisitionPlan.Tuning.SnapshotWorkers,
"snapshot_max_documents": acquisitionPlan.Tuning.SnapshotMaxDocuments,
"snapshot_workers": acquisitionPlan.Tuning.SnapshotWorkers,
"snapshot_exclude_contains": acquisitionPlan.Tuning.SnapshotExcludeContains,
"prefetch_workers": acquisitionPlan.Tuning.PrefetchWorkers,
"prefetch_enabled": boolPointerValue(acquisitionPlan.Tuning.PrefetchEnabled),
"nvme_post_probe": boolPointerValue(acquisitionPlan.Tuning.NVMePostProbeEnabled),
@@ -496,7 +497,6 @@ func (c *RedfishConnector) Collect(ctx context.Context, req Request, emit Progre
return result, nil
}
// collectDebugPayloads fetches vendor-specific diagnostic endpoints on a best-effort basis.
// Results are stored in rawPayloads["redfish_debug_payloads"] and exported with the bundle.
// Enabled only when Request.DebugPayloads is true.
@@ -511,7 +511,6 @@ func (c *RedfishConnector) collectDebugPayloads(ctx context.Context, client *htt
return out
}
func firstNonEmptyPath(paths []string, fallback string) string {
for _, p := range paths {
if strings.TrimSpace(p) != "" {
@@ -543,7 +542,6 @@ func redfishSystemPowerState(systemDoc map[string]interface{}) string {
return ""
}
func (c *RedfishConnector) postJSON(ctx context.Context, client *http.Client, req Request, baseURL, resourcePath string, payload map[string]any) error {
body, err := json.Marshal(payload)
if err != nil {
@@ -1346,6 +1344,11 @@ func (c *RedfishConnector) collectRawRedfishTree(ctx context.Context, client *ht
if !shouldCrawlPath(path) {
return
}
for _, pattern := range tuning.SnapshotExcludeContains {
if pattern != "" && strings.Contains(path, pattern) {
return
}
}
mu.Lock()
if len(seen) >= maxDocuments {
mu.Unlock()
@@ -2299,7 +2302,6 @@ func redfishCriticalSlowGap() time.Duration {
return 1200 * time.Millisecond
}
func redfishSnapshotMemoryRequestTimeout() time.Duration {
if v := strings.TrimSpace(os.Getenv("LOGPILE_REDFISH_MEMORY_TIMEOUT")); v != "" {
if d, err := time.ParseDuration(v); err == nil && d > 0 {
@@ -2878,11 +2880,16 @@ func (c *RedfishConnector) recoverCriticalRedfishDocsPlanB(ctx context.Context,
timings := newRedfishPathTimingCollector(4)
var targets []string
seenTargets := make(map[string]struct{})
skippedDiagnosticTargets := 0
addTarget := func(path string) {
path = normalizeRedfishPath(path)
if path == "" {
return
}
if !shouldIncludeCriticalPlanBPath(req, path) {
skippedDiagnosticTargets++
return
}
if _, ok := seenTargets[path]; ok {
return
}
@@ -2968,6 +2975,13 @@ func (c *RedfishConnector) recoverCriticalRedfishDocsPlanB(ctx context.Context,
return 0
}
if emit != nil {
if skippedDiagnosticTargets > 0 {
emit(Progress{
Status: "running",
Progress: 97,
Message: fmt.Sprintf("Redfish: расширенная диагностика выключена, пропущено %d тяжелых diagnostic endpoint", skippedDiagnosticTargets),
})
}
totalETA := redfishCriticalCooldown() + estimatePlanBETA(len(targets))
emit(Progress{
Status: "running",
@@ -3073,6 +3087,39 @@ func (c *RedfishConnector) recoverCriticalRedfishDocsPlanB(ctx context.Context,
return recovered
}
func shouldIncludeCriticalPlanBPath(req Request, path string) bool {
if req.DebugPayloads {
return true
}
return !isExtendedDiagnosticCriticalPlanBPath(path)
}
func isExtendedDiagnosticCriticalPlanBPath(path string) bool {
path = normalizeRedfishPath(path)
if path == "" {
return false
}
parts := strings.Split(strings.Trim(path, "/"), "/")
if len(parts) < 5 || parts[0] != "redfish" || parts[1] != "v1" || parts[2] != "Chassis" {
return false
}
if !strings.HasPrefix(parts[3], "HGX_") {
return false
}
for _, suffix := range []string{
"/Accelerators",
"/Assembly",
"/Drives",
"/NetworkAdapters",
"/PCIeDevices",
} {
if strings.HasSuffix(path, suffix) {
return true
}
}
return false
}
func (c *RedfishConnector) recoverProfilePlanBDocs(ctx context.Context, client *http.Client, req Request, baseURL string, plan redfishprofile.AcquisitionPlan, rawTree map[string]interface{}, emit ProgressFn) int {
if len(plan.PlanBPaths) == 0 || plan.Mode == redfishprofile.ModeFallback || !plan.Tuning.RecoveryPolicy.EnableProfilePlanB {
return 0

View File

@@ -0,0 +1,57 @@
package collector
import "testing"
func TestShouldIncludeCriticalPlanBPath(t *testing.T) {
tests := []struct {
name string
req Request
path string
want bool
}{
{
name: "skip hgx erot pcie without extended diagnostics",
req: Request{},
path: "/redfish/v1/Chassis/HGX_ERoT_NVSwitch_0/PCIeDevices",
want: false,
},
{
name: "skip hgx chassis assembly without extended diagnostics",
req: Request{},
path: "/redfish/v1/Chassis/HGX_Chassis_0/Assembly",
want: false,
},
{
name: "keep standard chassis inventory without extended diagnostics",
req: Request{},
path: "/redfish/v1/Chassis/1/PCIeDevices",
want: true,
},
{
name: "keep nvme storage backplane drives without extended diagnostics",
req: Request{},
path: "/redfish/v1/Chassis/NVMeSSD.0.Group.0.StorageBackplane/Drives",
want: true,
},
{
name: "keep system processors without extended diagnostics",
req: Request{},
path: "/redfish/v1/Systems/HGX_Baseboard_0/Processors",
want: true,
},
{
name: "include hgx erot pcie when extended diagnostics enabled",
req: Request{DebugPayloads: true},
path: "/redfish/v1/Chassis/HGX_ERoT_NVSwitch_0/PCIeDevices",
want: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if got := shouldIncludeCriticalPlanBPath(tt.req, tt.path); got != tt.want {
t.Fatalf("shouldIncludeCriticalPlanBPath(%q) = %v, want %v", tt.path, got, tt.want)
}
})
}
}

View File

@@ -326,6 +326,47 @@ func TestBuildAnalysisDirectives_SupermicroEnablesStorageRecovery(t *testing.T)
}
}
func TestMatchProfiles_LenovoXCCSelectsMatchedModeAndExcludesSensors(t *testing.T) {
match := MatchProfiles(MatchSignals{
SystemManufacturer: "Lenovo",
ChassisManufacturer: "Lenovo",
OEMNamespaces: []string{"Lenovo"},
})
if match.Mode != ModeMatched {
t.Fatalf("expected matched mode, got %q", match.Mode)
}
found := false
for _, profile := range match.Profiles {
if profile.Name() == "lenovo" {
found = true
break
}
}
if !found {
t.Fatal("expected lenovo profile to be selected")
}
// Verify the acquisition plan excludes noisy Lenovo-specific snapshot paths.
plan := BuildAcquisitionPlan(MatchSignals{
SystemManufacturer: "Lenovo",
ChassisManufacturer: "Lenovo",
OEMNamespaces: []string{"Lenovo"},
})
wantExcluded := []string{"/Sensors/", "/Oem/Lenovo/LEDs/", "/Oem/Lenovo/Slots/"}
for _, want := range wantExcluded {
found := false
for _, ex := range plan.Tuning.SnapshotExcludeContains {
if ex == want {
found = true
break
}
}
if !found {
t.Errorf("expected SnapshotExcludeContains to include %q, got %v", want, plan.Tuning.SnapshotExcludeContains)
}
}
}
func TestMatchProfiles_OrderingIsDeterministic(t *testing.T) {
signals := MatchSignals{
SystemManufacturer: "Micro-Star International Co., Ltd.",

View File

@@ -0,0 +1,65 @@
package redfishprofile
func lenovoProfile() Profile {
return staticProfile{
name: "lenovo",
priority: 20,
safeForFallback: true,
matchFn: func(s MatchSignals) int {
score := 0
if containsFold(s.SystemManufacturer, "lenovo") ||
containsFold(s.ChassisManufacturer, "lenovo") {
score += 80
}
for _, ns := range s.OEMNamespaces {
if containsFold(ns, "lenovo") {
score += 30
break
}
}
// Lenovo XClarity Controller (XCC) is the BMC product line.
if containsFold(s.ServiceRootProduct, "xclarity") ||
containsFold(s.ServiceRootProduct, "xcc") {
score += 30
}
return min(score, 100)
},
extendAcquisition: func(plan *AcquisitionPlan, _ MatchSignals) {
// Lenovo XCC BMC exposes Chassis/1/Sensors with hundreds of individual
// sensor member documents (e.g. Chassis/1/Sensors/101L1). These are
// not used by any LOGPile parser — thermal/power data is read from
// the aggregate Chassis/*/Thermal and Chassis/*/Power endpoints. On
// a real server they largely return errors, wasting many minutes.
// Lenovo OEM subtrees under Oem/Lenovo/LEDs and Oem/Lenovo/Slots also
// enumerate dozens of individual documents not relevant to inventory.
ensureSnapshotExcludeContains(plan,
"/Sensors/", // individual sensor docs (Chassis/1/Sensors/NNN)
"/Oem/Lenovo/LEDs/", // individual LED status entries (~47 per server)
"/Oem/Lenovo/Slots/", // individual slot detail entries (~26 per server)
"/Oem/Lenovo/Metrics/", // operational metrics, not inventory
"/Oem/Lenovo/History", // historical telemetry
"/Oem/Lenovo/ScheduledPower", // power scheduling config
"/Oem/Lenovo/BootSettings/BootOrder", // individual boot order lists
"/PortForwardingMap/", // network port forwarding config
)
// Lenovo XCC BMC is typically slow (p95 latency often 3-5s even under
// normal load). Set rate thresholds that don't over-throttle on the
// first few requests, and give the ETA estimator a realistic baseline.
ensureRatePolicy(plan, AcquisitionRatePolicy{
TargetP95LatencyMS: 2000,
ThrottleP95LatencyMS: 4000,
MinSnapshotWorkers: 2,
MinPrefetchWorkers: 1,
DisablePrefetchOnErrors: true,
})
ensureETABaseline(plan, AcquisitionETABaseline{
DiscoverySeconds: 15,
SnapshotSeconds: 120,
PrefetchSeconds: 30,
CriticalPlanBSeconds: 40,
ProfilePlanBSeconds: 20,
})
addPlanNote(plan, "lenovo xcc acquisition extensions enabled: noisy sensor/oem paths excluded from snapshot")
},
}
}

View File

@@ -56,6 +56,7 @@ func BuiltinProfiles() []Profile {
supermicroProfile(),
dellProfile(),
hpeProfile(),
lenovoProfile(),
inspurGroupOEMPlatformsProfile(),
hgxProfile(),
xfusionProfile(),
@@ -226,6 +227,10 @@ func ensurePrefetchPolicy(plan *AcquisitionPlan, policy AcquisitionPrefetchPolic
addPlanPaths(&plan.Tuning.PrefetchPolicy.ExcludeContains, policy.ExcludeContains...)
}
func ensureSnapshotExcludeContains(plan *AcquisitionPlan, patterns ...string) {
addPlanPaths(&plan.Tuning.SnapshotExcludeContains, patterns...)
}
func min(a, b int) int {
if a < b {
return a

View File

@@ -53,16 +53,17 @@ type AcquisitionScopedPathPolicy struct {
}
type AcquisitionTuning struct {
SnapshotMaxDocuments int
SnapshotWorkers int
PrefetchEnabled *bool
PrefetchWorkers int
NVMePostProbeEnabled *bool
RatePolicy AcquisitionRatePolicy
ETABaseline AcquisitionETABaseline
PostProbePolicy AcquisitionPostProbePolicy
RecoveryPolicy AcquisitionRecoveryPolicy
PrefetchPolicy AcquisitionPrefetchPolicy
SnapshotMaxDocuments int
SnapshotWorkers int
SnapshotExcludeContains []string
PrefetchEnabled *bool
PrefetchWorkers int
NVMePostProbeEnabled *bool
RatePolicy AcquisitionRatePolicy
ETABaseline AcquisitionETABaseline
PostProbePolicy AcquisitionPostProbePolicy
RecoveryPolicy AcquisitionRecoveryPolicy
PrefetchPolicy AcquisitionPrefetchPolicy
}
type AcquisitionRatePolicy struct {

View File

@@ -1961,7 +1961,10 @@ func pcieDedupKey(item ReanimatorPCIe) string {
slot := strings.ToLower(strings.TrimSpace(item.Slot))
serial := strings.ToLower(strings.TrimSpace(item.SerialNumber))
bdf := strings.ToLower(strings.TrimSpace(item.BDF))
if slot != "" {
// Generic slot names (e.g. "PCIe Device" from HGX BMC) are not unique
// hardware positions — multiple distinct devices share the same name.
// Fall through to serial/BDF so they are not incorrectly collapsed.
if slot != "" && !isGenericPCIeSlotName(slot) {
return "slot:" + slot
}
if serial != "" {
@@ -1970,9 +1973,22 @@ func pcieDedupKey(item ReanimatorPCIe) string {
if bdf != "" {
return "bdf:" + bdf
}
if slot != "" {
return "slot:" + slot
}
return strings.ToLower(strings.TrimSpace(item.DeviceClass)) + "|" + strings.ToLower(strings.TrimSpace(item.Model))
}
// isGenericPCIeSlotName reports whether slot is a generic device-type label
// rather than a unique hardware position identifier.
func isGenericPCIeSlotName(slot string) bool {
switch slot {
case "pcie device", "pcie slot", "pcie":
return true
}
return false
}
func pcieQualityScore(item ReanimatorPCIe) int {
score := 0
if strings.TrimSpace(item.SerialNumber) != "" {

View File

@@ -733,6 +733,42 @@ func TestConvertPCIeDevices_SkipsDisplayControllerDuplicates(t *testing.T) {
}
}
func TestConvertPCIeDevices_PreservesAllGPUsWithGenericSlot(t *testing.T) {
// Supermicro HGX BMC reports all GPU PCIe devices with Name "PCIe Device" —
// a generic label that is not a unique hardware position. All 8 GPUs must
// be preserved; dedup by generic slot name must not collapse them into one.
gpus := make([]models.GPU, 8)
serials := []string{
"1654925165720", "1654925166160", "1654925165942", "1654925165271",
"1654925165719", "1654925165252", "1654925165304", "1654925165587",
}
for i, sn := range serials {
gpus[i] = models.GPU{
Slot: "PCIe Device",
Model: "B200 180GB HBM3e",
Manufacturer: "NVIDIA",
SerialNumber: sn,
PartNumber: "2901-886-A1",
Status: "OK",
}
}
hw := &models.HardwareConfig{GPUs: gpus}
result := convertPCIeDevices(hw, "2026-04-13T10:00:00Z")
if len(result) != 8 {
t.Fatalf("expected 8 GPU entries (one per serial), got %d", len(result))
}
seen := make(map[string]bool)
for _, r := range result {
if seen[r.SerialNumber] {
t.Fatalf("duplicate serial %q in PCIe result", r.SerialNumber)
}
seen[r.SerialNumber] = true
if r.DeviceClass != "VideoController" {
t.Fatalf("expected VideoController device class, got %q", r.DeviceClass)
}
}
}
func TestConvertPCIeDevices_MapsGPUStatusHistory(t *testing.T) {
hw := &models.HardwareConfig{
GPUs: []models.GPU{

View File

@@ -85,7 +85,7 @@
</div>
<label class="api-form-checkbox" for="api-debug-payloads">
<input id="api-debug-payloads" name="debug_payloads" type="checkbox">
<span>Сбор расширенных метрик для отладки</span>
<span>Сбор расширенных данных для диагностики</span>
</label>
<div class="api-form-actions">
<button id="api-collect-btn" type="submit">Собрать</button>