Files

Michael Chus d52ec67f8f Stability hardening, build script fixes, GRUB bee logo

Stability hardening (webui/app):
- readFileLimited(): защита от OOM при чтении audit JSON (100 MB),
  component-status DB (10 MB) и лога задачи (50 MB)
- jobs.go: буферизованный лог задачи — один открытый fd на задачу
  вместо open/write/close на каждую строку (устраняет тысячи syscall/сек
  при GPU стресс-тестах)
- stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s),
  сброс при успешном прогоне >30s, счётчик перезапусков в slog
- kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании
- bee-web.service: MemoryMax=3G — OOM killer защищён

Build script:
- build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in —
  мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый
  live.cfg.in перезаписывал правильный статический файл устаревшей
  версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off
- build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg
  вместо grub-pc/grub.cfg (было всегда "missing")

GRUB:
- live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне
- live-theme/theme.txt: + image компонент по центру в верхней трети
  экрана; меню сдвинуто с 62% до 65%

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-19 13:08:31 +03:00

4.8 KiB

Raw Blame History

GPU Model Name Propagation

How GPU model names are detected, stored, and displayed throughout the project.

Detection Sources

There are two separate pipelines for GPU model names — they use different structs and don't share state.

Pipeline A — Live / SAT (nvidia-smi query at runtime)

File: audit/internal/platform/sat.go

ListNvidiaGPUs() → NvidiaGPU.Name (field: name, from nvidia-smi --query-gpu=index,name,...)
ListNvidiaGPUStatuses() → NvidiaGPUStatus.Name
Used by: GPU selection UI, live metrics labels, burn/stress test logic

Pipeline B — Benchmark results

File: audit/internal/platform/benchmark.go, line 124

queryBenchmarkGPUInfo(selected) → benchmarkGPUInfo.Name
Stored in BenchmarkGPUResult.Name (json:"name,omitempty")
Used by: benchmark history table, benchmark report

Pipeline C — Hardware audit JSON (PCIe schema)

File: audit/internal/schema/hardware.go

HardwarePCIeDevice.Model *string (field name is Model, not Name)
For AMD GPUs: populated by audit/internal/collector/amdgpu.go from info.Product
For NVIDIA GPUs: NOT populated by audit/internal/collector/nvidia.go — the NVIDIA enricher sets telemetry/status but skips the Model field
Used by: hardware summary page (hwDescribeGPU in pages.go:487)

Key Inconsistency: NVIDIA PCIe Model is Never Set

audit/internal/collector/nvidia.go — enrichPCIeWithNVIDIAData() enriches NVIDIA PCIe devices with telemetry and status but does not populate HardwarePCIeDevice.Model.

This means:

Hardware summary page shows "Unknown GPU" for all NVIDIA devices (falls back at pages.go:486)
AMD GPUs do have their model populated

The fix would be: copy gpu.Name from the SAT pipeline into dev.Model inside enrichPCIeWithNVIDIAData.

Benchmark History "Unknown GPU" Issue

Symptom: Benchmark history table shows "GPU #N — Unknown GPU" columns instead of real GPU model names.

Root cause: BenchmarkGPUResult.Name has tag json:"name,omitempty". If queryBenchmarkGPUInfo() fails (warns at benchmark.go:126) or returns empty names, the Name field is never set and is omitted from JSON. Loaded results have empty Name → falls back to "Unknown GPU" at pages.go:2226, 2237.

This happens for:

Older result files saved before the Name field was added
Runs where nvidia-smi query failed before the benchmark started

Fallback Strings — Current State

Location	File	Fallback string
Hardware summary (PCIe)	`pages.go:486`	`"Unknown GPU"`
Benchmark report summary	`benchmark_report.go:43`	`"Unknown GPU"`
Benchmark report scorecard	`benchmark_report.go:93`	`"Unknown"` ← inconsistent
Benchmark report detail	`benchmark_report.go:122`	`"Unknown GPU"`
Benchmark history per-GPU col	`pages.go:2226`	`"Unknown GPU"`
Benchmark history parallel col	`pages.go:2237`	`"Unknown GPU"`
SAT status file write	`sat.go:922`	`"unknown"` ← lowercase, inconsistent
GPU selection API	`api.go:163`	`"GPU N"` (no "Unknown")

Rule: all UI fallbacks should use "Unknown GPU". The two outliers are benchmark_report.go:93 ("Unknown") and sat.go:922 ("unknown").

GPU Selection UI

File: audit/internal/webui/pages.go

Source: GET /api/gpus → api.go → ListNvidiaGPUs() → live nvidia-smi
Render: 'GPU ' + gpu.index + ' — ' + gpu.name + ' · ' + mem
Fallback: gpu.name || 'GPU ' + idx (JS, line ~1432)

This always shows the correct model because it queries nvidia-smi live. It is not connected to benchmark result data.

Data Flow Summary

nvidia-smi (live)
  └─ ListNvidiaGPUs() → NvidiaGPU.Name
       ├─ GPU selection UI (always correct)
       ├─ Live metrics labels (charts_svg.go)
       └─ SAT/burn status file (sat.go)

nvidia-smi (at benchmark start)
  └─ queryBenchmarkGPUInfo() → benchmarkGPUInfo.Name
       └─ BenchmarkGPUResult.Name (json:"name,omitempty")
            ├─ Benchmark report
            └─ Benchmark history table columns

nvidia-smi / lspci (audit collection)
  └─ HardwarePCIeDevice.Model (NVIDIA: NOT populated; AMD: populated)
       └─ Hardware summary page hwDescribeGPU()

Fixed Issues

All previously open items are resolved:

NVIDIA PCIe Model — enrichPCIeWithNVIDIAData() sets dev.Model = &v (nvidia.go:78).
Fallback consistency — sat.go and benchmark_report.go both use "Unknown GPU".
tops_per_sm_per_ghz — computed in benchmark.go and stored in BenchmarkGPUScore.TOPSPerSMPerGHz.
MultiprocessorCount, PowerLimitW, DefaultPowerLimitW — present in benchmark_types.go.
Old benchmark JSONs — no fix possible for already-saved results with missing names (display-only issue).

4.8 KiB Raw Blame History