Lenovo ThinkSystem SR650 V3 (and similar XCC-based servers) caused
collection runs of 23+ minutes because the BMC exposes two large high-
error-rate subtrees in the snapshot BFS:
- Chassis/1/Sensors: 315 individual sensor members, 282/315 failing,
~3.7s per request → ~19 minutes wasted. These documents are never
read by any LOGPile parser (thermal/power data comes from aggregate
Chassis/*/Thermal and Chassis/*/Power endpoints).
- Chassis/1/Oem/Lenovo: 75 requests (LEDs×47, Slots×26, etc.),
68/75 failing → 8+ minutes wasted on non-inventory data.
Add a Lenovo profile (matched on SystemManufacturer/OEMNamespace "Lenovo")
that sets SnapshotExcludeContains to block individual sensor documents and
non-inventory Lenovo OEM subtrees from the snapshot BFS queue. Also sets
rate policy thresholds appropriate for XCC BMC latency (p95 often 3-5s).
Add SnapshotExcludeContains []string to AcquisitionTuning and check it
in the snapshot enqueue closure in redfish.go.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
66 lines
2.6 KiB
Go
66 lines
2.6 KiB
Go
package redfishprofile
|
|
|
|
func lenovoProfile() Profile {
|
|
return staticProfile{
|
|
name: "lenovo",
|
|
priority: 20,
|
|
safeForFallback: true,
|
|
matchFn: func(s MatchSignals) int {
|
|
score := 0
|
|
if containsFold(s.SystemManufacturer, "lenovo") ||
|
|
containsFold(s.ChassisManufacturer, "lenovo") {
|
|
score += 80
|
|
}
|
|
for _, ns := range s.OEMNamespaces {
|
|
if containsFold(ns, "lenovo") {
|
|
score += 30
|
|
break
|
|
}
|
|
}
|
|
// Lenovo XClarity Controller (XCC) is the BMC product line.
|
|
if containsFold(s.ServiceRootProduct, "xclarity") ||
|
|
containsFold(s.ServiceRootProduct, "xcc") {
|
|
score += 30
|
|
}
|
|
return min(score, 100)
|
|
},
|
|
extendAcquisition: func(plan *AcquisitionPlan, _ MatchSignals) {
|
|
// Lenovo XCC BMC exposes Chassis/1/Sensors with hundreds of individual
|
|
// sensor member documents (e.g. Chassis/1/Sensors/101L1). These are
|
|
// not used by any LOGPile parser — thermal/power data is read from
|
|
// the aggregate Chassis/*/Thermal and Chassis/*/Power endpoints. On
|
|
// a real server they largely return errors, wasting many minutes.
|
|
// Lenovo OEM subtrees under Oem/Lenovo/LEDs and Oem/Lenovo/Slots also
|
|
// enumerate dozens of individual documents not relevant to inventory.
|
|
ensureSnapshotExcludeContains(plan,
|
|
"/Sensors/", // individual sensor docs (Chassis/1/Sensors/NNN)
|
|
"/Oem/Lenovo/LEDs/", // individual LED status entries (~47 per server)
|
|
"/Oem/Lenovo/Slots/", // individual slot detail entries (~26 per server)
|
|
"/Oem/Lenovo/Metrics/", // operational metrics, not inventory
|
|
"/Oem/Lenovo/History", // historical telemetry
|
|
"/Oem/Lenovo/ScheduledPower", // power scheduling config
|
|
"/Oem/Lenovo/BootSettings/BootOrder", // individual boot order lists
|
|
"/PortForwardingMap/", // network port forwarding config
|
|
)
|
|
// Lenovo XCC BMC is typically slow (p95 latency often 3-5s even under
|
|
// normal load). Set rate thresholds that don't over-throttle on the
|
|
// first few requests, and give the ETA estimator a realistic baseline.
|
|
ensureRatePolicy(plan, AcquisitionRatePolicy{
|
|
TargetP95LatencyMS: 2000,
|
|
ThrottleP95LatencyMS: 4000,
|
|
MinSnapshotWorkers: 2,
|
|
MinPrefetchWorkers: 1,
|
|
DisablePrefetchOnErrors: true,
|
|
})
|
|
ensureETABaseline(plan, AcquisitionETABaseline{
|
|
DiscoverySeconds: 15,
|
|
SnapshotSeconds: 120,
|
|
PrefetchSeconds: 30,
|
|
CriticalPlanBSeconds: 40,
|
|
ProfilePlanBSeconds: 20,
|
|
})
|
|
addPlanNote(plan, "lenovo xcc acquisition extensions enabled: noisy sensor/oem paths excluded from snapshot")
|
|
},
|
|
}
|
|
}
|