A (hardware-ingest-json v2.8-2.9): remove sensor location fields from schema
and collector; tag HardwareMemory.Location as json:"-"; add PlatformConfig to
HardwareSnapshot.
B (no-hardcoded-vendors): consolidate PCI vendor IDs into collector/pci_vendors.go;
replace all vendor-name string checks in isGPUDevice, isNVIDIADevice, isMellanoxDevice,
isAMDGPUDevice, matchesGPUVendor (sat_overlay), and validateIsVendorGPU (page_validate)
with numeric vendor_id comparisons.
C (module-structure): split app/app.go (1413 lines) into app.go + app_format.go,
app_network.go, app_services.go, app_packs.go, app_install.go — no logic changes.
D (go-code-style): wrap bare return err in interfaceAdminState and
interfaceIPv4Addrs (platform/network.go) with fmt.Errorf context including
the interface name.
E (go-project-bible): add bible-local/architecture/data-model.md and
bible-local/architecture/api-surface.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Disabled PCIe devices (sysfs enable==0) carry no data traffic; their
link state has no operational impact. Switchtec PCIe switch management
endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers)
train at reduced speed intentionally and were producing spurious warnings.
Check is vendor-agnostic: reads enable attribute via existing helper,
no vendor/device ID hardcoding.
Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvtop pulled nvidia-tesla-470-* via Recommends into the nogpu build.
Move it from bee.list.chroot into bee-nvidia and bee-amd lists so it
only appears in GPU variants.
Also remove the stray git-bible/ directory (was not gitignored) and
move grub-bitmap-error docs into bible-local/docs/.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IPMI hang fix (Lenovo XCC SR650 V3):
- Add pluggable ipmi_profile system with per-vendor timeouts and fruEarlyExit flag
- Lenovo profile: 90s FRU timeout, streaming early-exit stops after PSU blocks found
- collectFRUEarlyExit streams ipmitool fru print and kills process once PSU blocks
are followed by a non-PSU header (~6s instead of ~108s on 54-device FRU list)
- collectBMCFirmware and collectPSUs accept manufacturer and apply profile timeouts
VROC license detection:
- Detect VMD/VROC controller in PCIe list, run mdadm --detail-platform
- Parse "License:" line; store as snap.VROCLicense in HardwareSnapshot
Blackbox service fix:
- bee-blackbox.service was missing from systemctl enable list in ISO build hook
- Service never started on boot; state file never written; UI button stayed "Enable"
Drop qrencode:
- Remove from package list, standardTools API check, and runtime-flows doc
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stability hardening (webui/app):
- readFileLimited(): защита от OOM при чтении audit JSON (100 MB),
component-status DB (10 MB) и лога задачи (50 MB)
- jobs.go: буферизованный лог задачи — один открытый fd на задачу
вместо open/write/close на каждую строку (устраняет тысячи syscall/сек
при GPU стресс-тестах)
- stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s),
сброс при успешном прогоне >30s, счётчик перезапусков в slog
- kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании
- bee-web.service: MemoryMax=3G — OOM killer защищён
Build script:
- build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in —
мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый
live.cfg.in перезаписывал правильный статический файл устаревшей
версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off
- build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg
вместо grub-pc/grub.cfg (было всегда "missing")
GRUB:
- live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне
- live-theme/theme.txt: + image компонент по центру в верхней трети
экрана; меню сдвинуто с 62% до 65%
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
grub-efi/grub.cfg: add KMS+GSP=off entry (was in isolinux, missing in GRUB)
isolinux/live.cfg.in: add full standard param set to all entries
(net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always
numa_balancing=disable nowatchdog nosoftlockup) to match grub-efi
bible-local/docs/iso-build-rules.md: add bootloader sync rule documenting
that grub-efi and isolinux must be kept in sync manually, listing canonical
entries and standard param set, and noting the grub-pc/grub-efi history.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nvidia.go: add Name field to nvidiaGPUInfo, include model name in
nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData
- pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x …
→ 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat
activating/deactivating/reloading service states as OK in Runtime Health
- support_bundle.go: use "150405" time format (no colons) for exFAT compat
- sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove
.tar.gz archive creation from export dirs — export packs everything itself
- charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf
- benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Sample server power (IPMI dcmi) during baseline+steady phases in parallel;
compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting
- Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w
from nvidia-smi alongside existing GPU info
- Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal)
- Flag when enforced power limit is below default TDP by >5%
- Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0)
- Improve Executive Summary: overall pass count, FAILED GPU finding
- Throttle counters now shown as % of steady window instead of raw microseconds
- bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead
of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state
- build: remove bee-nccl-gpu-stress from rm -f list so shell script from
overlay is not silently dropped from the ISO
- smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress,
bee-nccl-gpu-stress, all_reduce_perf
- netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors
- auto/config: reduce loglevel 7→3 to show clean systemd output on boot
- auto/config: blacklist snd_hda_intel and related audio modules (unused on servers)
- package-lists: remove firmware-intel-sound and firmware-amd-graphics from
base list; move firmware-amd-graphics to bee-amd variant only
- bible-local: mark memtest ADR resolved, document working solution
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- jobState now has optional cancel func; abort() calls it if job is running
- handleAPISATRun passes cancellable context to RunNvidiaAcceptancePackWithOptions
- POST /api/sat/abort?job_id=... cancels the running SAT job
- bible-local/runtime-flows.md: replace TUI SAT flow with Web UI flow
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- One engine: go-analyze/charts (grafana theme) for all live metrics
- Server chart: CPU temp, CPU load%, mem load%, power W, fan RPMs
- GPU charts: temp, load%, mem%, power W — one card per GPU, added dynamically
- Charts 1400x280px SVG, rendered at width:100% in single-column layout
- Add CPU load (from /proc/stat) and mem load (from /proc/meminfo) to LiveMetricSample
- Add GPU mem utilization to GPUMetricRow (nvidia-smi utilization.memory)
- Document charting architecture in bible-local/architecture/charting.md
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BMC:
- collector/board.go: collectBMCFirmware() via ipmitool mc info, graceful skip if /dev/ipmi0 absent
- collector/collector.go: append BMC firmware record to snap.Firmware
- app/panel.go: show BMC version in TUI right-panel header alongside BIOS
CPU SAT:
- platform/sat.go: RunCPUAcceptancePack(baseDir, durationSec) — lscpu + sensors before/after + stress-ng
- app/app.go: RunCPUAcceptancePack + RunCPUAcceptancePackResult methods, satRunner interface updated
- app/panel.go: CPU row now reads real PASS/FAIL from cpu-*/summary.txt via satStatuses(); cpuDetailResult shows last SAT summary + audit data
- tui/types.go: actionRunCPUSAT, confirmBody for CPU test with mode label
- tui/screen_health_check.go: hcCPUDurations [60,300,900]s; hcRunSingle(CPU)→confirm screen; executeRunAll uses RunCPUAcceptancePackResult
- tui/forms.go: actionRunCPUSAT → RunCPUAcceptancePackResult with mode duration
- cmd/bee/main.go: bee sat cpu [--duration N] subcommand
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace 12-item flat menu with 4-item main menu: Health Check, Export support bundle, Settings, Exit
- Add Health Check screen (Lenovo-style): per-component checkboxes (GPU/MEM/DISK/CPU), Quick/Standard/Express modes, Run All, letter hotkeys G/M/S/C/R/A/1/2/3
- Add two-column main screen: left = menu, right = hardware panel with colored PASS/FAIL/CANCEL/N/A status per component; Tab/→ switches focus, Enter opens component detail
- Add app.LoadHardwarePanel() + ComponentDetailResult() reading audit JSON and SAT summary.txt files
- Move Network/Services/audit actions into Settings submenu
- Export: support bundle only (remove separate audit JSON export)
- Delete screen_acceptance.go; add screen_health_check.go, screen_settings.go, app/panel.go
- Add BMC + CPU stress-ng tests to backlog
- Update bible submodule
- Rewrite tui_test.go for new screen/action structure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes
- nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort
- GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock)
- Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt
- Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours
- AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list
- runtime-flows.md updated with full NVIDIA SAT TUI flow documentation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both build-nvidia-module.sh (apk add) and mkimage.sh (--repository) now
explicitly use dl-cdn. Local builder mirror config is ignored.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pin linux-lts to exact KERNEL_PKG_VERSION=6.12.76-r0 in build and ISO package list
- Add build-time verification that compiled kernel version matches pin (fails loudly)
- Fix bee-audit-debug → bee-audit in genapkovl OpenRC registration (service was never starting)
- Add AUDIT_VERSION=0.1.0 to VERSIONS (was undefined, bee-release had empty fields)
- Pin linux-lts-dev version in second apk add in build-nvidia-module.sh
- Add /root/.profile to overlay so /usr/local/bin is in PATH for SSH sessions
- Remove "DEBUG MODE" from motd
- Fix smoketest: grep for slog "audit output written" instead of non-existent "audit completed"
- Document no-internet constraint in system-overview and runtime-flows
- Remove redundant genapkovl copy to /var/tmp (now found via ~/.mkimage/)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## ISO build consolidation
- Remove separate debug/prod split: overlay-debug/, build-debug.sh,
mkimg.bee_debug.sh, genapkovl-bee_debug.sh all deleted
- Single overlay: iso/overlay/ (was overlay-debug content)
- Single build script: build.sh (SSH, TUI, NVIDIA, vendor tools, bee-release)
- Single mkimage profile: bee (with dropbear, dialog, strace, gcompat, etc.)
## NVIDIA fixes
- Modules now stored at /usr/local/lib/nvidia/ instead of
/lib/modules/<kver>/extra/nvidia/ — modloop squashfs mounts over that
path at boot making overlay content there inaccessible
- bee-nvidia init: load via insmod (absolute path), not modprobe
- bee-nvidia init: create libnvidia-ml.so.1/libcuda.so.1 symlinks in /usr/lib/
- build-nvidia-module.sh: always install linux-lts-dev (not conditional) —
stale 6.6.x headers caused wrong-kernel modules that never loaded at runtime
- build-nvidia-module.sh: create soname symlinks in cache
- KERNEL_VERSION in VERSIONS updated 6.6 → 6.12
- gcompat added to ISO packages (nvidia-smi is a glibc binary on musl Alpine)
## Service ordering
- bee-audit: add `after bee-nvidia` so NVIDIA enrichment always succeeds
## New tooling
- iso/builder/smoketest.sh: SSH smoke test for post-boot ISO validation
- iso/builder/build-gpu-burn.sh: builds gpu_burn vendor binary (CUDA 12.8+)
- vendor/gpu_burn included automatically if placed in iso/vendor/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>