dcgmproftester and other GPU test subprocesses run inside the bee-web
cgroup and exceed 3G with 8 GPUs. OOM killer terminates the whole
service. No memory cap is appropriate on a LiveCD where GPU tests
legitimately use several GB.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stability hardening (webui/app):
- readFileLimited(): защита от OOM при чтении audit JSON (100 MB),
component-status DB (10 MB) и лога задачи (50 MB)
- jobs.go: буферизованный лог задачи — один открытый fd на задачу
вместо open/write/close на каждую строку (устраняет тысячи syscall/сек
при GPU стресс-тестах)
- stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s),
сброс при успешном прогоне >30s, счётчик перезапусков в slog
- kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании
- bee-web.service: MemoryMax=3G — OOM killer защищён
Build script:
- build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in —
мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый
live.cfg.in перезаписывал правильный статический файл устаревшей
версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off
- build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg
вместо grub-pc/grub.cfg (было всегда "missing")
GRUB:
- live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне
- live-theme/theme.txt: + image компонент по центру в верхней трети
экрана; меню сдвинуто с 62% до 65%
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
grub-efi/grub.cfg: add KMS+GSP=off entry (was in isolinux, missing in GRUB)
isolinux/live.cfg.in: add full standard param set to all entries
(net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always
numa_balancing=disable nowatchdog nosoftlockup) to match grub-efi
bible-local/docs/iso-build-rules.md: add bootloader sync rule documenting
that grub-efi and isolinux must be kept in sync manually, listing canonical
entries and standard param set, and noting the grub-pc/grub-efi history.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Build uses --bootloaders "grub-efi,syslinux" so live-build reads
config/bootloaders/grub-efi/ for the UEFI GRUB config. The directory
was incorrectly named grub-pc, causing live-build to ignore our custom
grub.cfg and generate a default one (missing toram, GSP-off entries).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
live-boot already uses rsync --progress when /bin/rsync exists; without
it the copy falls back to silent cp -a. Add rsync to the ISO package
list and install an initramfs-tools hook (bee-rsync) that copies the
rsync binary + shared libs into the initrd via copy_exec. The hook then
rebuilds the initramfs so the change takes effect in the ISO's initrd.img.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- grub.cfg: add "load to RAM (toram)" entry to advanced submenu
- install_to_ram.go: resume from existing /dev/shm/bee-live copy if
source medium is unavailable after bee-web restart
- tasks.go: fix "Recovered after bee-web restart" shown on every run
(check j.lines before first append, not after)
- bee-install: retry unsquashfs up to 5x with wait-for-remount on
source loss; clear error message with bee-remount-medium hint
- bee-remount-medium: new script to find and remount live ISO source
after USB/CD reconnect; supports --wait polling mode
- 9000-bee-setup: chmod +x for bee-install and bee-remount-medium
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
400×400px PNG centered via feh --bg-center --image-bg '#000000'.
Fallback solid fill also changed to black.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Prevents stale debootstrap cache from bypassing --debootstrap-options
changes (e.g. --include=ca-certificates added in v8.15).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--bootstrap-packages is not a valid lb config option (20230502).
Use --debootstrap-options "--include=ca-certificates" instead to ensure
ca-certificates is present when lb chroot_archives runs apt-get update
against the NVIDIA CUDA HTTPS source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
debootstrap creates a minimal chroot without ca-certificates, causing
apt-get update to fail TLS verification for the NVIDIA CUDA apt source:
"No system certificates available. Try installing ca-certificates."
Add ca-certificates to --bootstrap-packages so it is present before
lb chroot_archives configures the NVIDIA HTTPS source and runs apt-get update.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The NVIDIA CUDA HTTPS apt source (developer.download.nvidia.com) may be
unreachable from inside the live-build container chroot, causing
'E: Unable to locate package datacenter-gpu-manager-4-cuda13'.
Add build-dcgm.sh that downloads DCGM and nvidia-fabricmanager .deb
packages on the build host (verifying SHA256 against Packages.gz) and
caches them in BEE_CACHE_DIR. build.sh (step 25-dcgm, nvidia only)
copies them into LB_DIR/config/packages.chroot/ before lb build, so
live-build creates a local apt repo from them. The chroot installs the
packages from the local repo without ever contacting the NVIDIA CUDA
HTTPS source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lb binary_grub-efi and lb binary_syslinux create these files from templates
that already have memtest entries hardcoded. The hook should not fail when
the files don't exist yet — validate_iso_memtest() checks the final ISO.
Only the binary files (x64.bin, x64.efi) are required here.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ver_arg was set to "=memtest86+=VERSION" making the command
"apt-get download memtest86+=memtest86+=VERSION" (invalid).
Fixed to build pkg_spec directly as "memtest86+=VERSION".
Also add apt-get update retry if initial download fails.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Disabling --security broke the build because linux-image-6.1.0-44-amd64
is a security update not present in the base bookworm repo.
Main packages already come from mirror.mephi.ru.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Switch all lb mirrors to mirror.mephi.ru/debian/ for faster/reliable downloads
- Disable security repo (--security false) — not needed for LiveCD
- Pin MEMTEST_VERSION=6.10-4 in VERSIONS, export to hook environment
- Set BEE_REQUIRE_MEMTEST=1 in build-in-container.sh — missing memtest is now fatal
- Fix 9100-memtest.hook.binary: add apt-get download fallback when lb
binary_memtest has already purged the package cache; handle both 5.x
(memtest86+x64.bin) and 6.x (memtest86+.bin) BIOS binary naming
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mkdir -p LOG_DIR before writing the optional step log so that a race
with cleanup_build_log (EXIT trap archiving the log dir) does not cause
a "Directory nonexistent" error during lb binary_checksums / lb binary_iso.
Also downgrade apt-get update failure to a warning so a transient mirror
outage does not block kernel ABI auto-detection when the apt cache is warm.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- grub.cfg + isolinux/live.cfg.in: add pcie_aspm=off,
intel_idle.max_cstate=1 and processor.max_cstate=1 to all
non-failsafe boot entries
- bee-hpc-tuning: new script that sets all CPU cores to performance
governor via sysfs and logs THP state at boot
- bee-hpc-tuning.service: runs before bee-nvidia and bee-audit
- 9000-bee-setup.hook.chroot: enable service and mark script executable
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bootloader: GRUB fallback text colors → yellow/brown (amber tone)
- CLI charts: all GPU metric series use single amber color (xterm-256 #214)
- Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size
- Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz
using dmidecode for server model (spaces→underscores) and serial number
- Remove display resolution feature (UI card, API routes, handlers, tests)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HPL 2.3 from netlib compiled against OpenBLAS with a minimal
single-process MPI stub — no MPI package required in the ISO.
Matrix size is auto-sized to 80% of total RAM at runtime.
Build:
- VERSIONS: HPL_VERSION=2.3, HPL_SHA256=32c5c17d…
- build-hpl.sh: downloads HPL + OpenBLAS from Debian 12 repo,
compiles xhpl with a self-contained mpi_stub.c
- build.sh: step 80-hpl, injects xhpl + libopenblas into overlay
Runtime:
- bee-hpl: generates HPL.dat (N auto from /proc/meminfo, NB=256,
P=1 Q=1), runs xhpl, prints standard WR... Gflops output
- platform/hpl.go: RunHPL(), parses WR line → GFlops + PASSED/FAILED
- tasks.go: target "hpl"
- pages.go: LINPACK (HPL) card in validate/stress grid (stress-only)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-gpu-stress.c: remove per-wave cuCtxSynchronize barrier in both
cuBLASLt and PTX hot loops; sync at most once/sec so the GPU queue
stays continuously full — eliminates the CPU↔GPU ping-pong that
prevented reaching full TDP
- sat_fan_stress.go: default SizeMB 0 (auto = 95% VRAM) instead of
hardcoded 64 MB; tiny matrices caused <0.1 ms kernels where CPU
re-queue overhead dominated
- pages.go: move nvidia-targeted-power and nvidia-pulse from Burn →
Validate stress section alongside nvidia-targeted-stress; these are
DCGM pass/fail diagnostics, not sustained burn loads; remove the
Power Delivery / Power Budget card from Burn entirely
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Sample server power (IPMI dcmi) during baseline+steady phases in parallel;
compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting
- Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w
from nvidia-smi alongside existing GPU info
- Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal)
- Flag when enforced power limit is below default TDP by >5%
- Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0)
- Improve Executive Summary: overall pass count, FAILED GPU finding
- Throttle counters now shown as % of steady window instead of raw microseconds
- bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
losetup --replace --direct-io=on fails with EINVAL when the target file
is on tmpfs (/dev/shm), because tmpfs does not support O_DIRECT.
Strip the --direct-io flag from the replace call and downgrade the
verification failure to a warning so boot continues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>