redfish: fix GPU duplication on Supermicro HGX, exclude NVSwitch, restore path dedup

Three bugs, all related to GPU dedup in the Redfish replay pipeline:

1. collectGPUsFromProcessors (redfish_replay.go): GPU-type Processor entries
   (Systems/HGX_Baseboard_0/Processors/GPU_SXM_N) were not deduplicated against
   existing PCIeDevice GPUs on Supermicro HGX. The chassis-ID lookup keyed on
   processor Id ("GPU_SXM_1") but the chassis is named "HGX_GPU_SXM_1" — lookup
   returned nothing, serial stayed empty, UUID was unseen → 8 duplicate GPU rows.
   Fix: read SerialNumber directly from the Processor doc first; chassis lookup
   is now a fallback override (as it was designed for MSI).

2. looksLikeGPU (redfish.go): NVSwitch PCIe devices (Model="NVSwitch",
   Manufacturer="NVIDIA") were classified as GPUs because "nvidia" matched the
   GPU hint list. Fix: early return false when Model contains "nvswitch".

3. gpuDocDedupKey (redfish.go): commit 9df29b1 changed the dedup key to prefer
   slot|model before path, which collapsed two distinct GPUs with identical model
   names in GraphicsControllers into one entry. Fix: only serial and BDF are used
   as cross-path stable dedup keys; fall back to Redfish path when neither is
   present. This also restores TestReplayCollectGPUs_DedupUsesRedfishPathBeforeHeuristics
   which had been broken on main since 9df29b1.

Added tests:
- TestCollectGPUsFromProcessors_SupermicroHGX: Processor GPU dedup when
  chassis-ID naming convention does not match processor Id
- TestReplayCollectGPUs_DedupCrossChassisSerial: same GPU via two Chassis
  PCIeDevice trees with matching serials → collapsed to one
- TestLooksLikeGPU_NVSwitchExcluded: NVSwitch is not a GPU

Added rule to bible-local/09-testing.md: dedup/filter/classify functions must
cover true-positive, true-negative, and the vendor counter-case axes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Mikhail Chusavitin
2026-03-11 15:09:27 +03:00
parent d8ffe3d3a5
commit a9f58b3cf4
4 changed files with 265 additions and 5 deletions

View File

@@ -3113,8 +3113,16 @@ func gpuDocDedupKey(doc map[string]interface{}, gpu models.GPU) string {
// physical GPU exposed under multiple Chassis PCIeDevice trees (e.g. Supermicro
// HGX: Chassis/1/PCIeDevices/GPU1 and Chassis/HGX_GPU_SXM_1/PCIeDevices/GPU_SXM_1)
// is correctly deduplicated.
if key := gpuDedupKey(gpu); key != "" {
return key
//
// Only stable identifiers (serial, BDF) are used for cross-path dedup.
// When neither is present we fall back to path, so two genuinely distinct GPUs
// that happen to share the same model name (e.g. in GraphicsControllers) are
// not incorrectly collapsed into one.
if serial := normalizeRedfishIdentityField(gpu.SerialNumber); serial != "" {
return serial
}
if bdf := strings.TrimSpace(gpu.BDF); bdf != "" {
return bdf
}
if path := normalizeRedfishPath(asString(doc["@odata.id"])); path != "" {
return "path:" + path
@@ -3342,6 +3350,10 @@ func looksLikeGPU(doc map[string]interface{}, functionDocs []map[string]interfac
if strings.EqualFold(strings.TrimSpace(asString(doc["Description"])), "Display Device") {
return false
}
// NVSwitch is an NVIDIA NVLink interconnect switch, not a compute GPU.
if strings.Contains(strings.ToLower(strings.TrimSpace(asString(doc["Model"]))), "nvswitch") {
return false
}
deviceType := strings.ToLower(asString(doc["DeviceType"]))
if strings.Contains(deviceType, "gpu") || strings.Contains(deviceType, "graphics") || strings.Contains(deviceType, "accelerator") {