Files
bee/bible-local/architecture/runtime-flows.md
Mikhail Chusavitin 76a17937f3 feat(tui): NVIDIA SAT with nvtop, GPU selection, metrics and chart — v1.0.0
- TUI: duration presets (10m/1h/8h/24h), GPU multi-select checkboxes
- nvtop launched concurrently with SAT via tea.ExecProcess; can reopen or abort
- GPU metrics collected per-second during bee-gpu-stress (temp/usage/power/clock)
- Outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG), gpu-metrics-term.txt
- Terminal chart: asciigraph-style line chart with box-drawing chars and ANSI colours
- AUDIT_VERSION bumped 0.1.1 → 1.0.0; nvtop added to ISO package list
- runtime-flows.md updated with full NVIDIA SAT TUI flow documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 15:18:57 +03:00

8.4 KiB

Runtime Flows — bee

Network isolation — CRITICAL

The live CD runs in an isolated network segment with no internet access. All binaries, kernel modules, and tools must be baked into the ISO at build time. No package installation, no downloads, and no package manager calls are allowed at boot. DHCP is used only for LAN (operator SSH access). Internet is NOT available.

Boot sequence (single ISO)

systemd boot order:

local-fs.target
  ├── bee-sshsetup.service   (enables SSH key auth; password fallback only if marker exists)
  │     └── ssh.service      (OpenSSH on port 22 — starts without network)
  ├── bee-network.service    (starts `dhclient -nw` on all physical interfaces, non-blocking)
  ├── bee-nvidia.service     (insmod nvidia*.ko from /usr/local/lib/nvidia/,
  │                           creates /dev/nvidia* nodes)
  ├── bee-audit.service      (runs `bee audit` → /var/log/bee-audit.json,
  │                            never blocks boot on partial collector failures)
  └── bee-web.service        (runs `bee web` on :80,
                               reads the latest audit snapshot on each request)

Critical invariants:

  • OpenSSH MUST start without network. bee-sshsetup.service runs before ssh.service.
  • bee-network.service uses dhclient -nw (background) — network bring-up is best effort and non-blocking.
  • bee-nvidia.service loads modules via insmod with absolute paths — NOT modprobe. Reason: the modules are shipped in the ISO overlay under /usr/local/lib/nvidia/, not in the host module tree.
  • bee-audit.service does not wait for network-online.target; audit is local and must run even if DHCP is broken.
  • bee-audit.service logs audit failures but does not turn partial collector problems into a boot blocker.
  • bee-web.service binds 0.0.0.0:80 and always renders the current /var/log/bee-audit.json contents.
  • Audit JSON now includes a hardware.summary block with overall verdict and warning/failure counts.

Console and login flow

Local-console behavior:

tty1
  └── live-config autologin → bee
        └── /home/bee/.profile
              └── exec menu
                    └── /usr/local/bin/bee-tui
                          └── sudo -n /usr/local/bin/bee tui --runtime livecd

Rules:

  • local tty1 lands in user bee, not directly in root
  • menu must work without typing sudo
  • TUI actions still run as root via sudo -n
  • SSH is independent from the tty1 path
  • serial console support is enabled for VM boot debugging

ISO build sequence

build-in-container.sh [--authorized-keys /path/to/keys]
  1. compile `bee` binary (skip if .go files older than binary)
  2. create a temporary overlay staging dir under `dist/`
  3. inject authorized_keys into staged `root/.ssh/` (or set password fallback marker)
  4. copy `bee` binary → staged `/usr/local/bin/bee`
  5. copy vendor binaries from `iso/vendor/` → staged `/usr/local/bin/`
     (`storcli64`, `sas2ircu`, `sas3ircu`, `arcconf`, `ssacli` — optional; `mstflint` comes from the Debian package set)
  6. `build-nvidia-module.sh`:
       a. install Debian kernel headers if missing
       b. download NVIDIA `.run` installer (sha256 verified, cached in `dist/`)
       c. extract installer
       d. build kernel modules against Debian headers
       e. create `libnvidia-ml.so.1` / `libcuda.so.1` symlinks in cache
       f. cache in `dist/nvidia-<version>-<kver>/`
  7. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
  8. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
  9. inject `libnvidia-ml` + `libcuda` → staged `/usr/lib/`
  10. write staged `/etc/bee-release` (versions + git commit)
  11. patch staged `motd` with build metadata
  12. copy `iso/builder/` into a temporary live-build workdir under `dist/`
  13. sync staged overlay into workdir `config/includes.chroot/`
  14. run `lb config && lb build` inside the privileged builder container

Critical invariants:

  • DEBIAN_KERNEL_ABI in iso/builder/VERSIONS pins the exact kernel ABI used in BOTH places:
    1. build-in-container.sh / build-nvidia-module.sh — Debian kernel headers for module build
    2. auto/configlinux-image-${DEBIAN_KERNEL_ABI} in the ISO
  • NVIDIA modules go to staged usr/local/lib/nvidia/ — NOT to /lib/modules/<kver>/extra/.
  • The source overlay in iso/overlay/ is treated as immutable source. Build-time files are injected only into the staged overlay.
  • The live-build workdir under dist/ is disposable; source files under iso/builder/ stay clean.
  • Container build requires --privileged because live-build uses mounts/chroots/loop devices during ISO assembly.

Post-boot smoke test

After booting a live ISO, run to verify all critical components:

ssh root@<ip> 'sh -s' < iso/builder/smoketest.sh

Exit code 0 = all required checks pass. All FAIL lines must be zero before shipping.

Key checks: NVIDIA modules loaded, nvidia-smi sees all GPUs, lib symlinks present, systemd services running, audit completed with NVIDIA enrichment, LAN reachability.

Current validation state:

  • local/libvirt VM boot path is validated for systemd, SSH, bee audit, bee-network, and TUI startup
  • real hardware validation is still required before treating the ISO as release-ready

Overlay mechanism

live-build copies files from config/includes.chroot/ into the ISO filesystem. build.sh prepares a staged overlay, then syncs it into a temporary workdir's config/includes.chroot/ before running lb build.

Collector flow

`bee audit` start
  1. board collector   (dmidecode -t 0,1,2)
  2. cpu collector     (dmidecode -t 4)
  3. memory collector  (dmidecode -t 17)
  4. storage collector (lsblk -J, smartctl -j, nvme id-ctrl, nvme smart-log)
  5. pcie collector    (lspci -vmm -D, /sys/bus/pci/devices/)
  6. psu collector     (ipmitool fru + sdr — silent if no /dev/ipmi0)
  7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
  8. output JSON → /var/log/bee-audit.json
  9. QR summary to stdout (qrencode if available)

Every collector returns nil, nil on tool-not-found. Errors are logged, never fatal.

Acceptance flows:

  • bee sat nvidia → diagnostic archive with nvidia-smi -q + nvidia-bug-report + lightweight bee-gpu-stress
  • bee sat memorymemtester archive
  • bee sat storage → SMART/NVMe diagnostic archive and short self-test trigger where supported
  • SAT summary.txt now includes overall_status and per-job *_status values (OK, FAILED, UNSUPPORTED)
  • Runtime overrides:
    • BEE_GPU_STRESS_SECONDS
    • BEE_GPU_STRESS_SIZE_MB
    • BEE_MEMTESTER_SIZE_MB
    • BEE_MEMTESTER_PASSES

NVIDIA SAT TUI flow (v1.0.0+)

TUI: Acceptance tests → NVIDIA command pack
  1. screenNvidiaSATSetup
       a. enumerate GPUs via `nvidia-smi --query-gpu=index,name,memory.total`
       b. user selects duration preset: 10 min / 1 h / 8 h / 24 h
       c. user selects GPUs via checkboxes (all selected by default)
       d. memory size = max(selected GPU memory) — auto-detected, not exposed to user
  2. Start → screenNvidiaSATRunning
       a. CUDA_VISIBLE_DEVICES set to selected GPU indices
       b. tea.Batch: SAT goroutine + tea.ExecProcess(nvtop) launched concurrently
       c. nvtop occupies full terminal; SAT result queues in background
       d. [o] reopen nvtop at any time; [a] abort (cancels context → kills bee-gpu-stress)
  3. GPU metrics collection (during bee-gpu-stress)
       - background goroutine polls `nvidia-smi` every second
       - per-second rows: elapsed, GPU index, temp°C, usage%, power W, clock MHz
       - outputs: gpu-metrics.csv, gpu-metrics.html (offline SVG chart), gpu-metrics-term.txt
  4. After SAT completes
       - result shown in screenOutput with terminal line-chart (gpu-metrics-term.txt)
       - chart is asciigraph-style: box-drawing chars (╭╮╰╯─│), 4 series per GPU,
         Y axis with ticks, ANSI colours (red=temp, blue=usage, green=power, yellow=clock)

Critical invariants:

  • nvtop must be in iso/builder/config/package-lists/bee.list.chroot (baked into ISO).
  • bee-gpu-stress uses exec.CommandContext — aborted on cancel.
  • Metric goroutine uses stopCh/doneCh pattern; main goroutine waits <-doneCh before reading rows (no mutex needed).
  • If nvtop is not found on PATH, SAT still runs without it (graceful degradation).
  • SVG chart is fully offline: no JS, no external CSS, pure inline SVG.