Files
bee/bible-local/docs/gpu-model-propagation.md
Michael Chus d52ec67f8f Stability hardening, build script fixes, GRUB bee logo
Stability hardening (webui/app):
- readFileLimited(): защита от OOM при чтении audit JSON (100 MB),
  component-status DB (10 MB) и лога задачи (50 MB)
- jobs.go: буферизованный лог задачи — один открытый fd на задачу
  вместо open/write/close на каждую строку (устраняет тысячи syscall/сек
  при GPU стресс-тестах)
- stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s),
  сброс при успешном прогоне >30s, счётчик перезапусков в slog
- kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании
- bee-web.service: MemoryMax=3G — OOM killer защищён

Build script:
- build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in —
  мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый
  live.cfg.in перезаписывал правильный статический файл устаревшей
  версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off
- build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg
  вместо grub-pc/grub.cfg (было всегда "missing")

GRUB:
- live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне
- live-theme/theme.txt: + image компонент по центру в верхней трети
  экрана; меню сдвинуто с 62% до 65%

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 13:08:31 +03:00

122 lines
4.8 KiB
Markdown

# GPU Model Name Propagation
How GPU model names are detected, stored, and displayed throughout the project.
---
## Detection Sources
There are **two separate pipelines** for GPU model names — they use different structs and don't share state.
### Pipeline A — Live / SAT (nvidia-smi query at runtime)
**File:** `audit/internal/platform/sat.go`
- `ListNvidiaGPUs()``NvidiaGPU.Name` (field: `name`, from `nvidia-smi --query-gpu=index,name,...`)
- `ListNvidiaGPUStatuses()``NvidiaGPUStatus.Name`
- Used by: GPU selection UI, live metrics labels, burn/stress test logic
### Pipeline B — Benchmark results
**File:** `audit/internal/platform/benchmark.go`, line 124
- `queryBenchmarkGPUInfo(selected)``benchmarkGPUInfo.Name`
- Stored in `BenchmarkGPUResult.Name` (`json:"name,omitempty"`)
- Used by: benchmark history table, benchmark report
### Pipeline C — Hardware audit JSON (PCIe schema)
**File:** `audit/internal/schema/hardware.go`
- `HardwarePCIeDevice.Model *string` (field name is **Model**, not Name)
- For AMD GPUs: populated by `audit/internal/collector/amdgpu.go` from `info.Product`
- For NVIDIA GPUs: **NOT populated** by `audit/internal/collector/nvidia.go` — the NVIDIA enricher sets telemetry/status but skips the Model field
- Used by: hardware summary page (`hwDescribeGPU` in `pages.go:487`)
---
## Key Inconsistency: NVIDIA PCIe Model is Never Set
`audit/internal/collector/nvidia.go``enrichPCIeWithNVIDIAData()` enriches NVIDIA PCIe devices with telemetry and status but does **not** populate `HardwarePCIeDevice.Model`.
This means:
- Hardware summary page shows "Unknown GPU" for all NVIDIA devices (falls back at `pages.go:486`)
- AMD GPUs do have their model populated
The fix would be: copy `gpu.Name` from the SAT pipeline into `dev.Model` inside `enrichPCIeWithNVIDIAData`.
---
## Benchmark History "Unknown GPU" Issue
**Symptom:** Benchmark history table shows "GPU #N — Unknown GPU" columns instead of real GPU model names.
**Root cause:** `BenchmarkGPUResult.Name` has tag `json:"name,omitempty"`. If `queryBenchmarkGPUInfo()` fails (warns at `benchmark.go:126`) or returns empty names, the Name field is never set and is omitted from JSON. Loaded results have empty Name → falls back to "Unknown GPU" at `pages.go:2226, 2237`.
This happens for:
- Older result files saved before the `Name` field was added
- Runs where nvidia-smi query failed before the benchmark started
---
## Fallback Strings — Current State
| Location | File | Fallback string |
|---|---|---|
| Hardware summary (PCIe) | `pages.go:486` | `"Unknown GPU"` |
| Benchmark report summary | `benchmark_report.go:43` | `"Unknown GPU"` |
| Benchmark report scorecard | `benchmark_report.go:93` | `"Unknown"` ← inconsistent |
| Benchmark report detail | `benchmark_report.go:122` | `"Unknown GPU"` |
| Benchmark history per-GPU col | `pages.go:2226` | `"Unknown GPU"` |
| Benchmark history parallel col | `pages.go:2237` | `"Unknown GPU"` |
| SAT status file write | `sat.go:922` | `"unknown"` ← lowercase, inconsistent |
| GPU selection API | `api.go:163` | `"GPU N"` (no "Unknown") |
**Rule:** all UI fallbacks should use `"Unknown GPU"`. The two outliers are `benchmark_report.go:93` (`"Unknown"`) and `sat.go:922` (`"unknown"`).
---
## GPU Selection UI
**File:** `audit/internal/webui/pages.go`
- Source: `GET /api/gpus``api.go``ListNvidiaGPUs()` → live nvidia-smi
- Render: `'GPU ' + gpu.index + ' — ' + gpu.name + ' · ' + mem`
- Fallback: `gpu.name || 'GPU ' + idx` (JS, line ~1432)
This always shows the correct model because it queries nvidia-smi live. It is **not** connected to benchmark result data.
---
## Data Flow Summary
```
nvidia-smi (live)
└─ ListNvidiaGPUs() → NvidiaGPU.Name
├─ GPU selection UI (always correct)
├─ Live metrics labels (charts_svg.go)
└─ SAT/burn status file (sat.go)
nvidia-smi (at benchmark start)
└─ queryBenchmarkGPUInfo() → benchmarkGPUInfo.Name
└─ BenchmarkGPUResult.Name (json:"name,omitempty")
├─ Benchmark report
└─ Benchmark history table columns
nvidia-smi / lspci (audit collection)
└─ HardwarePCIeDevice.Model (NVIDIA: NOT populated; AMD: populated)
└─ Hardware summary page hwDescribeGPU()
```
---
## Fixed Issues
All previously open items are resolved:
1. **NVIDIA PCIe Model**`enrichPCIeWithNVIDIAData()` sets `dev.Model = &v` (`nvidia.go:78`).
2. **Fallback consistency**`sat.go` and `benchmark_report.go` both use `"Unknown GPU"`.
3. **`tops_per_sm_per_ghz`** — computed in `benchmark.go` and stored in `BenchmarkGPUScore.TOPSPerSMPerGHz`.
4. **`MultiprocessorCount`, `PowerLimitW`, `DefaultPowerLimitW`** — present in `benchmark_types.go`.
5. **Old benchmark JSONs** — no fix possible for already-saved results with missing names (display-only issue).