redfish: fix GPU duplication on Supermicro HGX, exclude NVSwitch, restore path dedup
Three bugs, all related to GPU dedup in the Redfish replay pipeline:
1. collectGPUsFromProcessors (redfish_replay.go): GPU-type Processor entries
(Systems/HGX_Baseboard_0/Processors/GPU_SXM_N) were not deduplicated against
existing PCIeDevice GPUs on Supermicro HGX. The chassis-ID lookup keyed on
processor Id ("GPU_SXM_1") but the chassis is named "HGX_GPU_SXM_1" — lookup
returned nothing, serial stayed empty, UUID was unseen → 8 duplicate GPU rows.
Fix: read SerialNumber directly from the Processor doc first; chassis lookup
is now a fallback override (as it was designed for MSI).
2. looksLikeGPU (redfish.go): NVSwitch PCIe devices (Model="NVSwitch",
Manufacturer="NVIDIA") were classified as GPUs because "nvidia" matched the
GPU hint list. Fix: early return false when Model contains "nvswitch".
3. gpuDocDedupKey (redfish.go): commit 9df29b1 changed the dedup key to prefer
slot|model before path, which collapsed two distinct GPUs with identical model
names in GraphicsControllers into one entry. Fix: only serial and BDF are used
as cross-path stable dedup keys; fall back to Redfish path when neither is
present. This also restores TestReplayCollectGPUs_DedupUsesRedfishPathBeforeHeuristics
which had been broken on main since 9df29b1.
Added tests:
- TestCollectGPUsFromProcessors_SupermicroHGX: Processor GPU dedup when
chassis-ID naming convention does not match processor Id
- TestReplayCollectGPUs_DedupCrossChassisSerial: same GPU via two Chassis
PCIeDevice trees with matching serials → collapsed to one
- TestLooksLikeGPU_NVSwitchExcluded: NVSwitch is not a GPU
Added rule to bible-local/09-testing.md: dedup/filter/classify functions must
cover true-positive, true-negative, and the vendor counter-case axes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -3113,8 +3113,16 @@ func gpuDocDedupKey(doc map[string]interface{}, gpu models.GPU) string {
|
||||
// physical GPU exposed under multiple Chassis PCIeDevice trees (e.g. Supermicro
|
||||
// HGX: Chassis/1/PCIeDevices/GPU1 and Chassis/HGX_GPU_SXM_1/PCIeDevices/GPU_SXM_1)
|
||||
// is correctly deduplicated.
|
||||
if key := gpuDedupKey(gpu); key != "" {
|
||||
return key
|
||||
//
|
||||
// Only stable identifiers (serial, BDF) are used for cross-path dedup.
|
||||
// When neither is present we fall back to path, so two genuinely distinct GPUs
|
||||
// that happen to share the same model name (e.g. in GraphicsControllers) are
|
||||
// not incorrectly collapsed into one.
|
||||
if serial := normalizeRedfishIdentityField(gpu.SerialNumber); serial != "" {
|
||||
return serial
|
||||
}
|
||||
if bdf := strings.TrimSpace(gpu.BDF); bdf != "" {
|
||||
return bdf
|
||||
}
|
||||
if path := normalizeRedfishPath(asString(doc["@odata.id"])); path != "" {
|
||||
return "path:" + path
|
||||
@@ -3342,6 +3350,10 @@ func looksLikeGPU(doc map[string]interface{}, functionDocs []map[string]interfac
|
||||
if strings.EqualFold(strings.TrimSpace(asString(doc["Description"])), "Display Device") {
|
||||
return false
|
||||
}
|
||||
// NVSwitch is an NVIDIA NVLink interconnect switch, not a compute GPU.
|
||||
if strings.Contains(strings.ToLower(strings.TrimSpace(asString(doc["Model"]))), "nvswitch") {
|
||||
return false
|
||||
}
|
||||
|
||||
deviceType := strings.ToLower(asString(doc["DeviceType"]))
|
||||
if strings.Contains(deviceType, "gpu") || strings.Contains(deviceType, "graphics") || strings.Contains(deviceType, "accelerator") {
|
||||
|
||||
Reference in New Issue
Block a user