Commit Graph

87 Commits

Author SHA1 Message Date
Mikhail Chusavitin
b7c888edb1 fix: getty autologin root, inject GSP firmware for H100, bump 0.1.1 2026-03-08 22:12:02 +03:00
Mikhail Chusavitin
17d5d74a8d fix: nomodeset + remove splash (framebuffer hangs on headless H100 server) 2026-03-08 21:39:31 +03:00
Mikhail Chusavitin
d487e539bb fix: use sudo git checkout to reset root-owned build artifacts 2026-03-08 20:54:15 +03:00
Mikhail Chusavitin
441ab3adbd fix: blacklist nouveau driver (hangs on H100 unknown chipset) 2026-03-08 20:51:49 +03:00
Mikhail Chusavitin
c91c8d8cf9 feat: bee-themed grub splash (amber/black honeycomb) with progress bar 2026-03-08 20:44:19 +03:00
Mikhail Chusavitin
83e1910281 feat: custom grub bootloader - bee branding, 5s auto-boot, no splash 2026-03-08 20:35:23 +03:00
Mikhail Chusavitin
2252c5af56 fix: use isc-dhcp-client for dhclient, remove standalone lsblk (in util-linux) 2026-03-08 19:43:59 +03:00
Mikhail Chusavitin
7a4d75c143 fix: remove unsupported --hostname/--username from lb config
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 19:28:01 +03:00
Mikhail Chusavitin
7c62d100d4 fix: use SYSSRC=common SYSOUT=amd64 for NVIDIA build on Debian split headers
Debian 12 splits kernel headers into two packages:
  linux-headers-<kver>        (arch-specific: generated/, config/)
  linux-headers-<kver>-common (source headers: linux/, asm-generic/, etc.)

NVIDIA conftest.sh builds include paths as HEADERS=$SOURCES/include.
When SYSSRC=amd64, HEADERS=amd64/include/ which is nearly empty —
conftest can't compile any kernel header tests, all compile-tests fail
silently, and NVIDIA assumes all kernel APIs are present. This causes
link errors for APIs added in kernel 6.3+ (vm_flags_set, vm_flags_clear)
and removed APIs (phys_to_dma, dma_is_direct, get_dma_ops).

Fix: pass SYSSRC=common (real headers) and SYSOUT=amd64 (generated headers).
NVIDIA Makefile maps SYSSRC→NV_KERNEL_SOURCES, SYSOUT→NV_KERNEL_OUTPUT,
and runs 'make -C common KBUILD_OUTPUT=amd64'. Conftest then correctly
detects which APIs are present in kernel 6.1 and uses proper wrappers.

Tested: 5 .ko files built successfully on Debian 12 kernel 6.1.0-43-amd64.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 19:23:47 +03:00
Mikhail Chusavitin
c843ff95a2 fix: add -Wno-error to CFLAGS_MODULE for NVIDIA kernel 6.1 compat
get_dma_ops() return type changed in kernel 6.1 — GCC treats int-conversion
warning as error. Suppress with -Wno-error to allow build to complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 18:55:25 +03:00
Mikhail Chusavitin
0057686769 fix: pass GCC include dir to NVIDIA make to resolve stdarg.h not found
Debian kernel build uses -nostdinc which strips GCC's own includes.
NVIDIA's nv_stdarg.h needs <stdarg.h> from GCC.
Pass -I$(gcc --print-file-name=include) via CFLAGS_MODULE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 18:53:37 +03:00
Mikhail Chusavitin
68b5e02a74 fix: run-builder.sh uses BUILDER_USER from .env, not hardcoded
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 18:48:33 +03:00
Mikhail Chusavitin
fa553c3f20 fix: update DEBIAN_KERNEL_ABI to 6.1.0-43 (actual kernel on build host)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 18:35:44 +03:00
Mikhail Chusavitin
345a93512a migrate ISO build from Alpine to Debian 12 (Bookworm)
Replace the entire live CD build pipeline:
- Alpine SDK + mkimage + genapkovl → Debian live-build (lb config/build)
- OpenRC init scripts → systemd service units
- dropbear → openssh-server (native to Debian live)
- udhcpc → dhclient for DHCP
- apk → apt-get in setup-builder.sh and build-nvidia-module.sh
- Add auto/config (lb config options) and auto/build wrapper
- Add config/package-lists/bee.list.chroot replacing Alpine apks
- Add config/hooks/normal/9000-bee-setup.hook.chroot to enable services
- Add bee-nvidia-load and bee-sshsetup helper scripts
- Keep NVIDIA pre-compile pipeline (Option B): compile on builder VM against
  pinned Debian kernel headers (DEBIAN_KERNEL_ABI), inject .ko into includes.chroot
- Fixes: native glibc (no gcompat shims), proper udev, writable /lib/modules,
  no Alpine modloop read-only constraint, no stale apk cache issues

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 18:01:38 +03:00
Mikhail Chusavitin
d952e10dbb fix: fail loudly on missing NVIDIA libs and .ko, improve mknod logging
build-nvidia-module.sh:
- Replace silent glob cp for libnvidia-ml/libcuda with find + explicit error
  if library not found in extract dir (catches installer layout changes)
- Fix circular symlink bug: don't create .so.1 -> .so.1 if versioned file
  is already named .so.1
- Verify .ko count > 0 after build, fail loudly if none produced
- Show lib cache in final summary

bee-nvidia:
- mknod failures are now logged with ewarn instead of silently suppressed
- If nvidia not in /proc/devices (no GPU hardware), log clearly and exit clean

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 17:07:47 +03:00
Mikhail Chusavitin
11e001cafa fix: add libc6-compat — required for dlopen of glibc shared objects on Alpine
gcompat alone provides only the ELF interpreter entry point (/lib64/ld-linux-x86-64.so.2).
It does NOT provide libpthread.so.0, libm.so.6, libdl.so.2, libc.so.6 stubs.

libnvidia-ml.so.590 has NEEDED: libpthread.so.0 etc. When nvidia-smi calls
dlopen("libnvidia-ml.so.1"), musl's linker fails to satisfy these deps
→ NVML_ERROR_LIBRARY_NOT_FOUND (exit 12), "couldn't find libnvidia-ml.so".

libc6-compat provides the missing stubs (libpthread.so.0, libm.so.6, libdl.so.2,
libc.so.6, librt.so.1) as musl redirects, enabling dlopen of glibc shared objects.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 17:03:45 +03:00
Mikhail Chusavitin
5db3c3c74c fix: create /dev/nvidia* nodes in bee-nvidia — mdev has no NVIDIA rules
Alpine uses mdev which has no rules for NVIDIA devices. Without /dev/nvidiactl
and /dev/nvidia{0-7}, nvidia-smi returns NVML_ERROR_LIBRARY_NOT_FOUND (exit 12)
even though kernel modules are loaded and libraries are present.

Fix: after insmod, read major numbers from /proc/devices and mknod the required
character devices (/dev/nvidiactl, /dev/nvidia{0-7}, /dev/nvidia-uvm).

Add /dev/nvidia* node checks to smoketest for earlier failure detection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-08 14:42:18 +03:00
Mikhail Chusavitin
98f14b21c1 fix: remove kernel version pin — dynamic detection prevents KVER mismatch
The static KERNEL_PKG_VERSION pin was the root cause of nvidia-smi never
working: modules were compiled for pinned version (e.g. 6.12.76-r0) but
the ISO kernel was unpinned (latest from repo at build time). When Alpine
updated linux-lts, the two diverged silently.

Fix: both steps now use whatever linux-lts is current in Alpine 3.21 main
at build time. build-nvidia-module.sh uses `apk add --update linux-lts-dev`
(no version pin), mkimage gets the same package from the same mirror.
Module cache is still keyed by detected KVER so rebuilds remain fast.

Removed: KERNEL_VERSION, KERNEL_PKG_VERSION from VERSIONS, all pin references
from build.sh and build-nvidia-module.sh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 12:11:05 +03:00
Mikhail Chusavitin
18f377987f fix: audit pipeline correctness after full review
- bee-audit init.d: use --output file: so "audit output written" is logged
  (stdout mode silently redirects, never emits the slog confirmation)
- build-nvidia-module.sh: use $KERNEL_SRC in find for .ko collection
  (was hardcoded $EXTRACT_DIR/kernel, silent failure if path differs)
- smoketest: add bee-audit to required services (was never checked)
- smoketest: remove legacy bee-audit-debug from service list
- smoketest: internet ping → warn (live CD runs in isolated network, no internet)
- build.sh: auto-copy smoketest.sh → overlay/usr/local/bin/bee-smoketest
  (removes manual sync hazard; smoketest.sh is now single source of truth)
- remove static overlay/usr/local/bin/bee-smoketest (generated by build.sh now)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 12:06:25 +03:00
Mikhail Chusavitin
0e0760bba9 build: always nuke apks_* cache to prevent stale package errors
Stale apks_* dirs (from old mirror or previous version pin) cause
"unable to select package" failures. Nuke them on every build.
kernel_*, syslinux_*, grub_* are still preserved — they're large,
stable, and only need to change when KERNEL_PKG_VERSION changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 11:59:37 +03:00
Mikhail Chusavitin
7d19fb8f60 Fix stale genapkovl in /var/tmp shadowing ~/.mkimage version
mkimage checks CWD (/var/tmp) before ~/.mkimage/ for genapkovl scripts.
Old genapkovl-bee.sh left in /var/tmp from previous builds was overriding
the updated version, causing bee-audit-debug to persist in runlevel.

Also add gcompat to apk world so it's installed at boot (was in apks cache
but missing from world file, so nvidia-smi failed with missing ld-linux).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 11:55:16 +03:00
Mikhail Chusavitin
449da7012c bee-tui: default mask /24, gateway x.x.x.1, DNS 77.88.8.8/77.88.8.1/1.1.1.1/8.8.8.8
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 11:40:21 +03:00
Mikhail Chusavitin
dbbc8628d0 Remove linux-lts from apks — kernel handled by kernel_flavors
linux-lts in apks conflicted with mkimage's own kernel download via
kernel_flavors="lts". The kernel is embedded in the ISO via modloop,
not via apks. Pinning it in apks caused "unable to select package".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 11:30:32 +03:00
Mikhail Chusavitin
1feb956e30 Fix: use dl-cdn.alpinelinux.org everywhere for consistent package resolution
Both build-nvidia-module.sh (apk add) and mkimage.sh (--repository) now
explicitly use dl-cdn. Local builder mirror config is ignored.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 11:28:14 +03:00
Mikhail Chusavitin
699c8d2473 docs: document kernel pin and mirror invariants in runtime-flows
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 11:23:04 +03:00
Mikhail Chusavitin
ac6aeefa1a Fix: use builder's own mirror for mkimage, not dl-cdn
Root cause of linux-lts pin failure: mkimage was using dl-cdn.alpinelinux.org
while the builder uses mirrors.hosterion.ro — different mirrors can have different
package availability at any given moment.

Now mkimage reads repositories directly from /etc/apk/repositories on the builder,
ensuring both module build and ISO package install use the same mirror.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 11:22:31 +03:00
Mikhail Chusavitin
cdc2996cd3 Fix mkimage git conflict: cd /var/tmp before running mkimage.sh
mkimage.sh calls git internally. Running it from inside /root/bee causes
"outside repository" fatal errors. /var/tmp is outside the git repo.
genapkovl is found via ~/.mkimage/ so no copy to /var/tmp needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 11:16:56 +03:00
Mikhail Chusavitin
5bc6d3da42 Add bee-smoketest to ISO overlay
Run directly on live CD: bee-smoketest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 10:54:25 +03:00
Mikhail Chusavitin
ffc7e5c71a Fix critical ISO build bugs: kernel pinning, service registration, PATH, audit checks
- Pin linux-lts to exact KERNEL_PKG_VERSION=6.12.76-r0 in build and ISO package list
- Add build-time verification that compiled kernel version matches pin (fails loudly)
- Fix bee-audit-debug → bee-audit in genapkovl OpenRC registration (service was never starting)
- Add AUDIT_VERSION=0.1.0 to VERSIONS (was undefined, bee-release had empty fields)
- Pin linux-lts-dev version in second apk add in build-nvidia-module.sh
- Add /root/.profile to overlay so /usr/local/bin is in PATH for SSH sessions
- Remove "DEBUG MODE" from motd
- Fix smoketest: grep for slog "audit output written" instead of non-existent "audit completed"
- Document no-internet constraint in system-overview and runtime-flows
- Remove redundant genapkovl copy to /var/tmp (now found via ~/.mkimage/)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 10:52:54 +03:00
Mikhail Chusavitin
493ccea415 Clear ~/.mkimage before build to prevent stale profiles
Without this, old mkimg.bee_debug.sh left from previous builds
causes mkimage to build both bee and bee_debug profiles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 10:05:06 +03:00
Mikhail Chusavitin
0a13463e94 Fix misleading password fallback message in build.sh
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 09:51:16 +03:00
Mikhail Chusavitin
a2b2cb23bc Fix run-builder.sh: update overlay and build script paths
overlay-debug → overlay, build-debug.sh → build.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 09:50:18 +03:00
Mikhail Chusavitin
b8135a19df Remove leftover debug/prod split files from tracking
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 09:49:11 +03:00
Mikhail Chusavitin
1e98428be8 Add nvidia-bug-report.sh to ISO and fix GPU diagnostic pack in bee-tui
- build-nvidia-module.sh: extract nvidia-bug-report.sh from .run installer
- build.sh: copy nvidia-bug-report.sh into overlay/usr/local/bin/
- bee-tui: pass --output directly to nvidia-bug-report.sh so log goes
  into the run_dir archive instead of CWD; remove redundant cp step

GPU diagnostic pack in TUI (System acceptance tests → GPU NVIDIA → Run command pack):
  nvidia-smi -q, dmidecode -t baseboard, dmidecode -t system, nvidia-bug-report.sh
All logs archived to /var/log/bee-sat/gpu-nvidia-<ts>.tar.gz

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 09:48:27 +03:00
Mikhail Chusavitin
240c33f6a1 Add backlog with GPU stress test task
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-07 09:45:51 +03:00
Mikhail Chusavitin
1eeee46a34 Remove gpu_burn from ISO build — binary too large
gpu_burn requires CUDA toolkit (~4GB) to build and the resulting binary
would significantly inflate the ISO. Removed from vendor tool list and
smoketest. build-gpu-burn.sh dropped as well.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-06 20:17:30 +03:00
Mikhail Chusavitin
1768bb58dd Merge debug/prod into single ISO build, fix NVIDIA module loading
## ISO build consolidation
- Remove separate debug/prod split: overlay-debug/, build-debug.sh,
  mkimg.bee_debug.sh, genapkovl-bee_debug.sh all deleted
- Single overlay: iso/overlay/ (was overlay-debug content)
- Single build script: build.sh (SSH, TUI, NVIDIA, vendor tools, bee-release)
- Single mkimage profile: bee (with dropbear, dialog, strace, gcompat, etc.)

## NVIDIA fixes
- Modules now stored at /usr/local/lib/nvidia/ instead of
  /lib/modules/<kver>/extra/nvidia/ — modloop squashfs mounts over that
  path at boot making overlay content there inaccessible
- bee-nvidia init: load via insmod (absolute path), not modprobe
- bee-nvidia init: create libnvidia-ml.so.1/libcuda.so.1 symlinks in /usr/lib/
- build-nvidia-module.sh: always install linux-lts-dev (not conditional) —
  stale 6.6.x headers caused wrong-kernel modules that never loaded at runtime
- build-nvidia-module.sh: create soname symlinks in cache
- KERNEL_VERSION in VERSIONS updated 6.6 → 6.12
- gcompat added to ISO packages (nvidia-smi is a glibc binary on musl Alpine)

## Service ordering
- bee-audit: add `after bee-nvidia` so NVIDIA enrichment always succeeds

## New tooling
- iso/builder/smoketest.sh: SSH smoke test for post-boot ISO validation
- iso/builder/build-gpu-burn.sh: builds gpu_burn vendor binary (CUDA 12.8+)
- vendor/gpu_burn included automatically if placed in iso/vendor/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-06 20:14:18 +03:00
Mikhail Chusavitin
0907ba07c3 debug iso: add menu command to relaunch tui 2026-03-06 19:49:35 +03:00
Mikhail Chusavitin
94b305f166 Switch debug TUI menus to dialog and include dialog package 2026-03-06 17:57:40 +03:00
Mikhail Chusavitin
f84ec9320c Fix NVIDIA module version selection and add load diagnostics 2026-03-06 17:30:41 +03:00
Mikhail Chusavitin
a55b4108d5 Add wget/curl fallback for vendor and update downloads 2026-03-06 14:45:50 +03:00
Mikhail Chusavitin
18b8c69bc5 Implement audit enrichments, TUI workflows, and production ISO scaffold 2026-03-06 11:56:26 +03:00
bdfb6a0a79 fix: reset VM working tree before pull to clear stale build artifacts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 23:11:18 +03:00
867565cbf8 fix: inject motd build info in genapkovl tmp, not overlay on disk
sed -i on overlay/etc/motd caused git pull conflict on next build.
Now BEE_BUILD_INFO is exported and substituted in $tmp copy only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 23:11:07 +03:00
b72688cf2c fix: chmod +x overlay scripts on builder VM after git pull
macOS does not reliably apply git file mode changes on disk.
Run chmod explicitly on the VM where it matters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 23:03:35 +03:00
e8e09e9063 fix: chmod +x in genapkovl to fix permissions regardless of git filemode on VM
- genapkovl now explicitly chmod +x init.d/* and usr/local/bin/* after cp
- add bee-net-restart command (short name, no .sh) and /etc/profile.d/bee.sh for PATH
- udhcpc: add & to ensure non-blocking even when DHCP responds immediately
- motd: short commands without paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 22:59:28 +03:00
63c608711d fix: use agetty --autologin instead of busybox getty -a (unsupported flag)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 22:48:13 +03:00
eecd0799a0 fix: check local/remote sync before building to prevent building stale code
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 22:44:22 +03:00
fd071e28db fix: include build-debug.sh and motd changes missed from previous commit
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 22:43:23 +03:00
c908809991 fix: init scripts not executable, add autologin and build version in motd
- bee-* init.d scripts had mode 644 in git — OpenRC silently skipped them,
  causing bee-network/bee-nvidia/bee-audit to never start at boot
- bee-network.sh also lacked executable bit
- Remove -q from udhcpc (was quitting after first lease, no renewal)
- Add autologin root on tty1 via /etc/inittab
- Inject build date + git commit + versions into motd at build time

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 22:33:45 +03:00