Commit Graph

299 Commits

Author SHA1 Message Date
Mikhail Chusavitin
8bf8dfa45b fix(boot): default to KMS + pci=realloc, drop nomodeset from main entries
Default and toram entries now boot with bee.display=kms (ASPEED AST
loads via KMS, Xorg uses modesetting driver) and pci=realloc (Linux
reassigns GPU BARs when BIOS lacks Above 4G Decoding). nomodeset
removed from these entries; still present in GSP=off and fail-safe.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 19:00:04 +03:00
Mikhail Chusavitin
ddb2bb5d1c fix(grub): replace em-dash with ASCII -- in all menu entry titles
Em-dash (U+2014) renders as garbage on GRUB serial/SOL output
(IPMI BMC consoles). Replace with ASCII double-hyphen throughout
grub.cfg template, write_canonical_grub_cfg, and theme.txt comment.

Also align template grub.cfg structure with write_canonical_grub_cfg:
toram entry moved to top level (was inside submenu).

bible: add ascii-safe-text contract documenting the no-em-dash rule.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 18:52:04 +03:00
Mikhail Chusavitin
aa284ae754 fix(iso): avoid grub logo scaling error 2026-04-20 14:06:32 +03:00
Mikhail Chusavitin
8512098174 fix(iso): restore bootappend-live in canonical boot menu 2026-04-20 13:39:05 +03:00
Mikhail Chusavitin
a35e90a93e fix(iso): clear stale bootloader templates in workdir 2026-04-20 13:19:50 +03:00
Mikhail Chusavitin
1ced81707f fix(iso): validate live boot entries in final ISO 2026-04-20 13:12:24 +03:00
Mikhail Chusavitin
647e99b697 Fix post-sync live-build ISO rebuild 2026-04-20 11:01:15 +03:00
Mikhail Chusavitin
84a2551dc0 Fix NVIDIA self-heal recovery flow 2026-04-20 09:43:22 +03:00
0cdfbc5875 fix(iso): restore boot UX and boot logs 2026-04-19 23:08:09 +03:00
2038489961 Remove MemoryMax=3G from bee-web.service to fix OOM kill during GPU tests
dcgmproftester and other GPU test subprocesses run inside the bee-web
cgroup and exceed 3G with 8 GPUs. OOM killer terminates the whole
service. No memory cap is appropriate on a LiveCD where GPU tests
legitimately use several GB.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 18:52:41 +03:00
d52ec67f8f Stability hardening, build script fixes, GRUB bee logo
Stability hardening (webui/app):
- readFileLimited(): защита от OOM при чтении audit JSON (100 MB),
  component-status DB (10 MB) и лога задачи (50 MB)
- jobs.go: буферизованный лог задачи — один открытый fd на задачу
  вместо open/write/close на каждую строку (устраняет тысячи syscall/сек
  при GPU стресс-тестах)
- stability.go: экспоненциальный backoff в goRecoverLoop (2s→4s→…→60s),
  сброс при успешном прогоне >30s, счётчик перезапусков в slog
- kill_workers.go: таймаут 5s на скан /proc, warn при срабатывании
- bee-web.service: MemoryMax=3G — OOM killer защищён

Build script:
- build.sh: удалён блок генерации grub-pc/grub.cfg + live.cfg.in —
  мёртвый код с v8.25; grub-pc игнорируется live-build, а генерируемый
  live.cfg.in перезаписывал правильный статический файл устаревшей
  версией без tuning-параметров ядра и пунктов gsp-off/kms+gsp-off
- build.sh: dump_memtest_debug теперь логирует grub-efi/grub.cfg
  вместо grub-pc/grub.cfg (было всегда "missing")

GRUB:
- live-theme/bee-logo.png: логотип пчелы 400×400px на чёрном фоне
- live-theme/theme.txt: + image компонент по центру в верхней трети
  экрана; меню сдвинуто с 62% до 65%

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 13:08:31 +03:00
d60f7758ba Fix grub-pc directory missing before writing grub.cfg
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 08:42:17 +03:00
64ae1c0ff0 Sync GRUB and isolinux boot entries; document sync rule
grub-efi/grub.cfg: add KMS+GSP=off entry (was in isolinux, missing in GRUB)

isolinux/live.cfg.in: add full standard param set to all entries
(net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always
numa_balancing=disable nowatchdog nosoftlockup) to match grub-efi

bible-local/docs/iso-build-rules.md: add bootloader sync rule documenting
that grub-efi and isolinux must be kept in sync manually, listing canonical
entries and standard param set, and noting the grub-pc/grub-efi history.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 10:32:16 +03:00
49050ca717 Fix GRUB bootloader config dir: grub-pc → grub-efi
Build uses --bootloaders "grub-efi,syslinux" so live-build reads
config/bootloaders/grub-efi/ for the UEFI GRUB config. The directory
was incorrectly named grub-pc, causing live-build to ignore our custom
grub.cfg and generate a default one (missing toram, GSP-off entries).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 10:30:11 +03:00
5ba72ab315 Add rsync to initramfs for toram progress output
live-boot already uses rsync --progress when /bin/rsync exists; without
it the copy falls back to silent cp -a. Add rsync to the ISO package
list and install an initramfs-tools hook (bee-rsync) that copies the
rsync binary + shared libs into the initrd via copy_exec. The hook then
rebuilds the initramfs so the change takes effect in the ISO's initrd.img.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 23:52:47 +03:00
63363e9629 Add toram boot entry and Install to RAM resume support
- grub.cfg: add "load to RAM (toram)" entry to advanced submenu
- install_to_ram.go: resume from existing /dev/shm/bee-live copy if
  source medium is unavailable after bee-web restart
- tasks.go: fix "Recovered after bee-web restart" shown on every run
  (check j.lines before first append, not after)
- bee-install: retry unsquashfs up to 5x with wait-for-remount on
  source loss; clear error message with bee-remount-medium hint
- bee-remount-medium: new script to find and remount live ISO source
  after USB/CD reconnect; supports --wait polling mode
- 9000-bee-setup: chmod +x for bee-install and bee-remount-medium

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 23:48:56 +03:00
Mikhail Chusavitin
f74976ec4c Use static overlay wallpaper in ISO build 2026-04-16 10:54:03 +03:00
Mikhail Chusavitin
e306250da7 Disable fp64/fp4 in mixed gpu burn 2026-04-16 10:00:03 +03:00
30aa30cd67 LiveCD: set Baby Bee wallpaper centered on black background
400×400px PNG centered via feh --bg-center --image-bg '#000000'.
Fallback solid fill also changed to black.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 06:57:23 +03:00
fa6d905a10 Tune bee-gpu-burn single-precision benchmark phases 2026-04-16 00:05:47 +03:00
Mikhail Chusavitin
5c1862ce4c Use lb clean --all to clear bootstrap cache on every build
Prevents stale debootstrap cache from bypassing --debootstrap-options
changes (e.g. --include=ca-certificates added in v8.15).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 17:37:08 +03:00
Mikhail Chusavitin
b65ef2ea1d Fix: use --debootstrap-options to include ca-certificates in bootstrap
--bootstrap-packages is not a valid lb config option (20230502).
Use --debootstrap-options "--include=ca-certificates" instead to ensure
ca-certificates is present when lb chroot_archives runs apt-get update
against the NVIDIA CUDA HTTPS source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 17:26:01 +03:00
Mikhail Chusavitin
533d703c97 Bootstrap ca-certificates so NVIDIA CUDA HTTPS source is trusted
debootstrap creates a minimal chroot without ca-certificates, causing
apt-get update to fail TLS verification for the NVIDIA CUDA apt source:
  "No system certificates available. Try installing ca-certificates."
Add ca-certificates to --bootstrap-packages so it is present before
lb chroot_archives configures the NVIDIA HTTPS source and runs apt-get update.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 17:24:20 +03:00
Mikhail Chusavitin
04eb4b5a6d Revert "Pre-download DCGM/fabricmanager debs on host to bypass chroot apt"
This reverts commit 4110dbf8a6.
2026-04-15 17:19:53 +03:00
Mikhail Chusavitin
4110dbf8a6 Pre-download DCGM/fabricmanager debs on host to bypass chroot apt
The NVIDIA CUDA HTTPS apt source (developer.download.nvidia.com) may be
unreachable from inside the live-build container chroot, causing
'E: Unable to locate package datacenter-gpu-manager-4-cuda13'.

Add build-dcgm.sh that downloads DCGM and nvidia-fabricmanager .deb
packages on the build host (verifying SHA256 against Packages.gz) and
caches them in BEE_CACHE_DIR.  build.sh (step 25-dcgm, nvidia only)
copies them into LB_DIR/config/packages.chroot/ before lb build, so
live-build creates a local apt repo from them.  The chroot installs the
packages from the local repo without ever contacting the NVIDIA CUDA
HTTPS source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 17:10:23 +03:00
Mikhail Chusavitin
7237e4d3e4 Add fabric manager boot and support diagnostics 2026-04-15 16:14:26 +03:00
Mikhail Chusavitin
0317dc58fd Fix memtest hook: grub.cfg/live.cfg missing during binary hooks is expected
lb binary_grub-efi and lb binary_syslinux create these files from templates
that already have memtest entries hardcoded. The hook should not fail when
the files don't exist yet — validate_iso_memtest() checks the final ISO.
Only the binary files (x64.bin, x64.efi) are required here.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 10:33:22 +03:00
Mikhail Chusavitin
1c5cb45698 Fix memtest hook: bad ver_arg format in apt-get download
ver_arg was set to "=memtest86+=VERSION" making the command
"apt-get download memtest86+=memtest86+=VERSION" (invalid).
Fixed to build pkg_spec directly as "memtest86+=VERSION".
Also add apt-get update retry if initial download fails.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 10:15:01 +03:00
Mikhail Chusavitin
090b92ca73 Re-enable security repo: kernel 6.1.0-44 is in bookworm-security only
Disabling --security broke the build because linux-image-6.1.0-44-amd64
is a security update not present in the base bookworm repo.
Main packages already come from mirror.mephi.ru.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 10:02:52 +03:00
Mikhail Chusavitin
2dccbc010c Use MEPHI mirror, disable security repo, fix memtest in ISO build
- Switch all lb mirrors to mirror.mephi.ru/debian/ for faster/reliable downloads
- Disable security repo (--security false) — not needed for LiveCD
- Pin MEMTEST_VERSION=6.10-4 in VERSIONS, export to hook environment
- Set BEE_REQUIRE_MEMTEST=1 in build-in-container.sh — missing memtest is now fatal
- Fix 9100-memtest.hook.binary: add apt-get download fallback when lb
  binary_memtest has already purged the package cache; handle both 5.x
  (memtest86+x64.bin) and 6.x (memtest86+.bin) BIOS binary naming

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 09:57:29 +03:00
e84c69d360 Fix optional step log dir missing after memtest recovery
mkdir -p LOG_DIR before writing the optional step log so that a race
with cleanup_build_log (EXIT trap archiving the log dir) does not cause
a "Directory nonexistent" error during lb binary_checksums / lb binary_iso.

Also downgrade apt-get update failure to a warning so a transient mirror
outage does not block kernel ABI auto-detection when the apt cache is warm.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 07:28:36 +03:00
ee422ede3c Revert "Add raster Easy Bee branding assets"
This reverts commit d560b2fead.
2026-04-14 23:00:15 +03:00
d560b2fead Add raster Easy Bee branding assets 2026-04-14 22:39:25 +03:00
Mikhail Chusavitin
95124d228f Split bee-bench into perf and power workflows 2026-04-14 17:33:13 +03:00
Mikhail Chusavitin
2be7ae6d28 Refine NVIDIA benchmark phase timing 2026-04-14 14:12:06 +03:00
Mikhail Chusavitin
82fe1f6d26 Disable precision fallback and pin cuBLAS 13.1 2026-04-14 10:17:44 +03:00
81e7c921f8 дебаг при сборке 2026-04-14 07:02:37 +03:00
0fb8f2777f Fix combined gpu burn profile capacity for fp4 2026-04-14 00:00:40 +03:00
bf182daa89 Fix benchmark report methodology and rebuild gpu burn worker on toolchain changes 2026-04-13 23:43:12 +03:00
bf6ecab4f0 Add per-precision benchmark phases, weighted TOPS scoring, and ECC tracking
- Split steady window into 6 equal slots: fp8/fp16/fp32/fp64/fp4 + combined
- Each precision phase runs bee-gpu-burn with --precision filter so PowerCVPct reflects single-kernel stability (not round-robin artifact)
- Add fp4 support in bee-gpu-stress.c for Blackwell (cc>=100) via existing CUDA_R_4F_E2M1 guard
- Weighted TOPS: fp64×2.0, fp32×1.0, fp16×0.5, fp8×0.25, fp4×0.125
- SyntheticScore = sum of weighted TOPS from per-precision phases
- MixedScore = sum from combined phase; MixedEfficiency = Mixed/Synthetic
- ComputeScore = SyntheticScore × (1 + MixedEfficiency × 0.3)
- ECC volatile counters sampled before/after each phase and overall
- DegradationReasons: ecc_uncorrected_errors, ecc_corrected_errors
- Report: per-precision stability table with ECC columns, methodology section
- Ramp-up history table redesign: GPU indices as columns, runs as rows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 10:49:49 +03:00
Mikhail Chusavitin
4f94ebcb2c Add HPC tuning: PCIe ASPM off, C-states, performance CPU governor
- grub.cfg + isolinux/live.cfg.in: add pcie_aspm=off,
  intel_idle.max_cstate=1 and processor.max_cstate=1 to all
  non-failsafe boot entries
- bee-hpc-tuning: new script that sets all CPU cores to performance
  governor via sysfs and logs THP state at boot
- bee-hpc-tuning.service: runs before bee-nvidia and bee-audit
- 9000-bee-setup.hook.chroot: enable service and mark script executable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:07:32 +03:00
Mikhail Chusavitin
9481ca2805 Add staged NVIDIA burn ramp-up mode 2026-04-09 15:21:14 +03:00
Mikhail Chusavitin
4ef403898f Tighten NVIDIA GPU PCI detection 2026-04-09 15:14:48 +03:00
025548ab3c UI: amber accents, smaller wallpaper logo, new support bundle name, drop display resolution
- Bootloader: GRUB fallback text colors → yellow/brown (amber tone)
- CLI charts: all GPU metric series use single amber color (xterm-256 #214)
- Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size
- Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz
  using dmidecode for server model (spaces→underscores) and serial number
- Remove display resolution feature (UI card, API routes, handlers, tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 21:37:01 +03:00
Mikhail Chusavitin
e0d94d7f47 Remove HPL from build and audit flows 2026-04-08 10:00:23 +03:00
Mikhail Chusavitin
13899aa864 Drop incompatible HPL git fallback 2026-04-08 09:50:58 +03:00
Mikhail Chusavitin
f345d8a89d Build HPL serially to avoid upstream make races 2026-04-08 09:47:35 +03:00
Mikhail Chusavitin
4715059ac0 Fix HPL MPI stub header and keep full build logs 2026-04-08 09:45:14 +03:00
Mikhail Chusavitin
0660a40287 Harden HPL builder cache and runtime libs 2026-04-08 09:40:18 +03:00
Mikhail Chusavitin
67369d9b7b Fix OpenBLAS package lookup in HPL build 2026-04-08 09:32:49 +03:00