Files
bee/bible-local/docs/gpu-model-propagation.md
Michael Chus d52ec67f8f Stability hardening, build script fixes, GRUB bee logo
Stability hardening (webui/app):
- readFileLimited(): защита от OOM при чтении audit JSON (100 MB),
  component-status DB (10 MB) и лога задачи (50 MB)
- jobs.go: буферизованный лог задачи — один открытый fd на задачу
  вместо open/write/close на каждую строку (устраняет тысячи syscall/сек
  при GPU стресс-тестах)
- stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s),
  сброс при успешном прогоне >30s, счётчик перезапусков в slog
- kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании
- bee-web.service: MemoryMax=3G — OOM killer защищён

Build script:
- build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in —
  мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый
  live.cfg.in перезаписывал правильный статический файл устаревшей
  версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off
- build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg
  вместо grub-pc/grub.cfg (было всегда "missing")

GRUB:
- live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне
- live-theme/theme.txt: + image компонент по центру в верхней трети
  экрана; меню сдвинуто с 62% до 65%

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 13:08:31 +03:00

4.8 KiB

GPU Model Name Propagation

How GPU model names are detected, stored, and displayed throughout the project.


Detection Sources

There are two separate pipelines for GPU model names — they use different structs and don't share state.

Pipeline A — Live / SAT (nvidia-smi query at runtime)

File: audit/internal/platform/sat.go

  • ListNvidiaGPUs()NvidiaGPU.Name (field: name, from nvidia-smi --query-gpu=index,name,...)
  • ListNvidiaGPUStatuses()NvidiaGPUStatus.Name
  • Used by: GPU selection UI, live metrics labels, burn/stress test logic

Pipeline B — Benchmark results

File: audit/internal/platform/benchmark.go, line 124

  • queryBenchmarkGPUInfo(selected)benchmarkGPUInfo.Name
  • Stored in BenchmarkGPUResult.Name (json:"name,omitempty")
  • Used by: benchmark history table, benchmark report

Pipeline C — Hardware audit JSON (PCIe schema)

File: audit/internal/schema/hardware.go

  • HardwarePCIeDevice.Model *string (field name is Model, not Name)
  • For AMD GPUs: populated by audit/internal/collector/amdgpu.go from info.Product
  • For NVIDIA GPUs: NOT populated by audit/internal/collector/nvidia.go — the NVIDIA enricher sets telemetry/status but skips the Model field
  • Used by: hardware summary page (hwDescribeGPU in pages.go:487)

Key Inconsistency: NVIDIA PCIe Model is Never Set

audit/internal/collector/nvidia.goenrichPCIeWithNVIDIAData() enriches NVIDIA PCIe devices with telemetry and status but does not populate HardwarePCIeDevice.Model.

This means:

  • Hardware summary page shows "Unknown GPU" for all NVIDIA devices (falls back at pages.go:486)
  • AMD GPUs do have their model populated

The fix would be: copy gpu.Name from the SAT pipeline into dev.Model inside enrichPCIeWithNVIDIAData.


Benchmark History "Unknown GPU" Issue

Symptom: Benchmark history table shows "GPU #N — Unknown GPU" columns instead of real GPU model names.

Root cause: BenchmarkGPUResult.Name has tag json:"name,omitempty". If queryBenchmarkGPUInfo() fails (warns at benchmark.go:126) or returns empty names, the Name field is never set and is omitted from JSON. Loaded results have empty Name → falls back to "Unknown GPU" at pages.go:2226, 2237.

This happens for:

  • Older result files saved before the Name field was added
  • Runs where nvidia-smi query failed before the benchmark started

Fallback Strings — Current State

Location File Fallback string
Hardware summary (PCIe) pages.go:486 "Unknown GPU"
Benchmark report summary benchmark_report.go:43 "Unknown GPU"
Benchmark report scorecard benchmark_report.go:93 "Unknown" ← inconsistent
Benchmark report detail benchmark_report.go:122 "Unknown GPU"
Benchmark history per-GPU col pages.go:2226 "Unknown GPU"
Benchmark history parallel col pages.go:2237 "Unknown GPU"
SAT status file write sat.go:922 "unknown" ← lowercase, inconsistent
GPU selection API api.go:163 "GPU N" (no "Unknown")

Rule: all UI fallbacks should use "Unknown GPU". The two outliers are benchmark_report.go:93 ("Unknown") and sat.go:922 ("unknown").


GPU Selection UI

File: audit/internal/webui/pages.go

  • Source: GET /api/gpusapi.goListNvidiaGPUs() → live nvidia-smi
  • Render: 'GPU ' + gpu.index + ' — ' + gpu.name + ' · ' + mem
  • Fallback: gpu.name || 'GPU ' + idx (JS, line ~1432)

This always shows the correct model because it queries nvidia-smi live. It is not connected to benchmark result data.


Data Flow Summary

nvidia-smi (live)
  └─ ListNvidiaGPUs() → NvidiaGPU.Name
       ├─ GPU selection UI (always correct)
       ├─ Live metrics labels (charts_svg.go)
       └─ SAT/burn status file (sat.go)

nvidia-smi (at benchmark start)
  └─ queryBenchmarkGPUInfo() → benchmarkGPUInfo.Name
       └─ BenchmarkGPUResult.Name (json:"name,omitempty")
            ├─ Benchmark report
            └─ Benchmark history table columns

nvidia-smi / lspci (audit collection)
  └─ HardwarePCIeDevice.Model (NVIDIA: NOT populated; AMD: populated)
       └─ Hardware summary page hwDescribeGPU()

Fixed Issues

All previously open items are resolved:

  1. NVIDIA PCIe ModelenrichPCIeWithNVIDIAData() sets dev.Model = &v (nvidia.go:78).
  2. Fallback consistencysat.go and benchmark_report.go both use "Unknown GPU".
  3. tops_per_sm_per_ghz — computed in benchmark.go and stored in BenchmarkGPUScore.TOPSPerSMPerGHz.
  4. MultiprocessorCount, PowerLimitW, DefaultPowerLimitW — present in benchmark_types.go.
  5. Old benchmark JSONs — no fix possible for already-saved results with missing names (display-only issue).