- nvidia.go: add Name field to nvidiaGPUInfo, include model name in nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData - pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x … → 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat activating/deactivating/reloading service states as OK in Runtime Health - support_bundle.go: use "150405" time format (no colons) for exFAT compat - sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove .tar.gz archive creation from export dirs — export packs everything itself - charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf - benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
118 lines
4.6 KiB
Markdown
118 lines
4.6 KiB
Markdown
# GPU Model Name Propagation
|
|
|
|
How GPU model names are detected, stored, and displayed throughout the project.
|
|
|
|
---
|
|
|
|
## Detection Sources
|
|
|
|
There are **two separate pipelines** for GPU model names — they use different structs and don't share state.
|
|
|
|
### Pipeline A — Live / SAT (nvidia-smi query at runtime)
|
|
|
|
**File:** `audit/internal/platform/sat.go`
|
|
|
|
- `ListNvidiaGPUs()` → `NvidiaGPU.Name` (field: `name`, from `nvidia-smi --query-gpu=index,name,...`)
|
|
- `ListNvidiaGPUStatuses()` → `NvidiaGPUStatus.Name`
|
|
- Used by: GPU selection UI, live metrics labels, burn/stress test logic
|
|
|
|
### Pipeline B — Benchmark results
|
|
|
|
**File:** `audit/internal/platform/benchmark.go`, line 124
|
|
|
|
- `queryBenchmarkGPUInfo(selected)` → `benchmarkGPUInfo.Name`
|
|
- Stored in `BenchmarkGPUResult.Name` (`json:"name,omitempty"`)
|
|
- Used by: benchmark history table, benchmark report
|
|
|
|
### Pipeline C — Hardware audit JSON (PCIe schema)
|
|
|
|
**File:** `audit/internal/schema/hardware.go`
|
|
|
|
- `HardwarePCIeDevice.Model *string` (field name is **Model**, not Name)
|
|
- For AMD GPUs: populated by `audit/internal/collector/amdgpu.go` from `info.Product`
|
|
- For NVIDIA GPUs: **NOT populated** by `audit/internal/collector/nvidia.go` — the NVIDIA enricher sets telemetry/status but skips the Model field
|
|
- Used by: hardware summary page (`hwDescribeGPU` in `pages.go:487`)
|
|
|
|
---
|
|
|
|
## Key Inconsistency: NVIDIA PCIe Model is Never Set
|
|
|
|
`audit/internal/collector/nvidia.go` — `enrichPCIeWithNVIDIAData()` enriches NVIDIA PCIe devices with telemetry and status but does **not** populate `HardwarePCIeDevice.Model`.
|
|
|
|
This means:
|
|
- Hardware summary page shows "Unknown GPU" for all NVIDIA devices (falls back at `pages.go:486`)
|
|
- AMD GPUs do have their model populated
|
|
|
|
The fix would be: copy `gpu.Name` from the SAT pipeline into `dev.Model` inside `enrichPCIeWithNVIDIAData`.
|
|
|
|
---
|
|
|
|
## Benchmark History "Unknown GPU" Issue
|
|
|
|
**Symptom:** Benchmark history table shows "GPU #N — Unknown GPU" columns instead of real GPU model names.
|
|
|
|
**Root cause:** `BenchmarkGPUResult.Name` has tag `json:"name,omitempty"`. If `queryBenchmarkGPUInfo()` fails (warns at `benchmark.go:126`) or returns empty names, the Name field is never set and is omitted from JSON. Loaded results have empty Name → falls back to "Unknown GPU" at `pages.go:2226, 2237`.
|
|
|
|
This happens for:
|
|
- Older result files saved before the `Name` field was added
|
|
- Runs where nvidia-smi query failed before the benchmark started
|
|
|
|
---
|
|
|
|
## Fallback Strings — Current State
|
|
|
|
| Location | File | Fallback string |
|
|
|---|---|---|
|
|
| Hardware summary (PCIe) | `pages.go:486` | `"Unknown GPU"` |
|
|
| Benchmark report summary | `benchmark_report.go:43` | `"Unknown GPU"` |
|
|
| Benchmark report scorecard | `benchmark_report.go:93` | `"Unknown"` ← inconsistent |
|
|
| Benchmark report detail | `benchmark_report.go:122` | `"Unknown GPU"` |
|
|
| Benchmark history per-GPU col | `pages.go:2226` | `"Unknown GPU"` |
|
|
| Benchmark history parallel col | `pages.go:2237` | `"Unknown GPU"` |
|
|
| SAT status file write | `sat.go:922` | `"unknown"` ← lowercase, inconsistent |
|
|
| GPU selection API | `api.go:163` | `"GPU N"` (no "Unknown") |
|
|
|
|
**Rule:** all UI fallbacks should use `"Unknown GPU"`. The two outliers are `benchmark_report.go:93` (`"Unknown"`) and `sat.go:922` (`"unknown"`).
|
|
|
|
---
|
|
|
|
## GPU Selection UI
|
|
|
|
**File:** `audit/internal/webui/pages.go`
|
|
|
|
- Source: `GET /api/gpus` → `api.go` → `ListNvidiaGPUs()` → live nvidia-smi
|
|
- Render: `'GPU ' + gpu.index + ' — ' + gpu.name + ' · ' + mem`
|
|
- Fallback: `gpu.name || 'GPU ' + idx` (JS, line ~1432)
|
|
|
|
This always shows the correct model because it queries nvidia-smi live. It is **not** connected to benchmark result data.
|
|
|
|
---
|
|
|
|
## Data Flow Summary
|
|
|
|
```
|
|
nvidia-smi (live)
|
|
└─ ListNvidiaGPUs() → NvidiaGPU.Name
|
|
├─ GPU selection UI (always correct)
|
|
├─ Live metrics labels (charts_svg.go)
|
|
└─ SAT/burn status file (sat.go)
|
|
|
|
nvidia-smi (at benchmark start)
|
|
└─ queryBenchmarkGPUInfo() → benchmarkGPUInfo.Name
|
|
└─ BenchmarkGPUResult.Name (json:"name,omitempty")
|
|
├─ Benchmark report
|
|
└─ Benchmark history table columns
|
|
|
|
nvidia-smi / lspci (audit collection)
|
|
└─ HardwarePCIeDevice.Model (NVIDIA: NOT populated; AMD: populated)
|
|
└─ Hardware summary page hwDescribeGPU()
|
|
```
|
|
|
|
---
|
|
|
|
## What Needs Fixing
|
|
|
|
1. **NVIDIA PCIe Model** — `enrichPCIeWithNVIDIAData()` should set `dev.Model = &gpu.Name`
|
|
2. **Fallback consistency** — `benchmark_report.go:93` should say `"Unknown GPU"` not `"Unknown"`; `sat.go:922` should say `"Unknown GPU"` not `"unknown"`
|
|
3. **Old benchmark JSONs** — no fix possible for already-saved results with missing names (display-only issue)
|