Files
bee/bible-local/decisions/2026-04-01-memtest-build-strategy.md
Michael Chus eb60100297 fix: pcie gen, nccl binary, netconf sudo, boot noise, firmware cleanup
- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead
  of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state
- build: remove bee-nccl-gpu-stress from rm -f list so shell script from
  overlay is not silently dropped from the ISO
- smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress,
  bee-nccl-gpu-stress, all_reduce_perf
- netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors
- auto/config: reduce loglevel 7→3 to show clean systemd output on boot
- auto/config: blacklist snd_hda_intel and related audio modules (unused on servers)
- package-lists: remove firmware-intel-sound and firmware-amd-graphics from
  base list; move firmware-amd-graphics to bee-amd variant only
- bible-local: mark memtest ADR resolved, document working solution

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:25:23 +03:00

11 KiB
Raw Blame History

Decision: Treat memtest as explicit ISO content, not as trusted live-build magic

Date: 2026-04-01 Status: resolved

Context

We have already iterated on memtest multiple times and kept cycling between the same ideas. The commit history shows several distinct attempts:

  • f91bce8 — fixed Bookworm memtest file names to memtest86+x64.bin / memtest86+x64.efi
  • 5857805 — added a binary hook to copy memtest files from the build tree into the ISO root
  • f96b149 — added fallback extraction from the cached .deb when chroot/boot/ stayed empty
  • d43a9ae — removed the custom hook and switched back to live-build built-in memtest integration
  • 60cb8f8 — restored explicit memtest menu entries and added ISO validation
  • 3dbc218 / 3869788 — added archived build logs and better memtest diagnostics

Current evidence from the archived easy-bee-nvidia-v3.14-amd64 logs dated 2026-04-01:

  • lb binary_memtest does run and installs memtest86+
  • but the final ISO still does not contain boot/memtest86+x64.bin
  • the final ISO also does not contain memtest menu entries in boot/grub/grub.cfg or isolinux/live.cfg

So the assumption "live-build built-in memtest integration is enough on this stack" is currently false for this project until proven otherwise by a real built ISO.

Additional evidence from the archived easy-bee-nvidia-v3.17-dirty-amd64 logs dated 2026-04-01:

  • the build now completes successfully because memtest is non-blocking by default
  • lb binary_memtest still runs and installs memtest86+
  • the project-owned hook config/hooks/normal/9100-memtest.hook.binary does execute
  • but it executes too early for its current target paths:
    • binary/boot/grub/grub.cfg is still missing at hook time
    • binary/isolinux/live.cfg is still missing at hook time
    • memtest binaries are also still absent in binary/boot/
  • later in the build, live-build does create intermediate bootloader configs with memtest lines in the workdir
  • but the final ISO still lacks memtest binaries and still lacks memtest lines in extracted ISO boot/grub/grub.cfg and isolinux/live.cfg

So the assumption "the current normal binary hook path is late enough to patch final memtest artifacts" is also false.

Correction after inspecting the real easy-bee-nvidia-v3.20-5-g76a9100-amd64.iso artifact dated 2026-04-01:

  • the final ISO does contain boot/memtest86+x64.bin
  • the final ISO does contain boot/memtest86+x64.efi
  • the final ISO does contain memtest menu entries in both boot/grub/grub.cfg and isolinux/live.cfg
  • so v3.20-5-g76a9100 was not another real memtest regression in the shipped ISO
  • the regression was in the build-time validator/debug path in build.sh

Root cause of the false alarm:

  • build.sh treated "ISO reader command exists" as equivalent to "ISO reader successfully listed/extracted members"
  • iso_list_files / iso_extract_file failures were collapsed into the same observable output as "memtest content missing"
  • this made a reader failure look identical to a missing memtest payload
  • as a result, we re-entered the same memtest investigation loop even though the real ISO was already correct

Additional correction from the subsequent v3.21 build logs dated 2026-04-01:

  • once ISO reading was fixed, the post-build debug correctly showed the raw ISO still carried live-build's default memtest layout (live/memtest.bin, live/memtest.efi, boot/grub/memtest.cfg, isolinux/memtest.cfg)
  • that mismatch is expected to trigger project recovery, because bee requires boot/memtest86+x64.bin / boot/memtest86+x64.efi plus matching menu paths
  • however, build.sh exited before recovery because set -e treated a direct iso_memtest_present return code of 1 as fatal
  • so the next repeated loop was caused by shell control flow, not by proof that the recovery design itself was wrong

Known Failed Attempts

These approaches were already tried and should not be repeated blindly:

  1. Built-in live-build memtest only. Reason it failed:
  • lb binary_memtest runs, but the final ISO still misses memtest binaries and menu entries.
  1. Fixing only the memtest file names for Debian Bookworm. Reason it failed:
  • correct file names alone do not make the files appear in the final ISO.
  1. Copying memtest from chroot/boot/ into binary/boot/ via a binary hook. Reason it failed:
  • in this stack chroot/boot/ is often empty for memtest payloads at the relevant time.
  1. Fallback extraction from cached memtest86+ .deb. Reason it failed:
  • this was explored already and was not enough to stabilize the final ISO path end-to-end.
  1. Restoring explicit memtest menu entries in source bootloader templates only. Reason it failed:
  • memtest lines in source templates or intermediate workdir configs do not guarantee the final ISO contains them.
  1. Patching binary/boot/grub/grub.cfg and binary/isolinux/live.cfg from the current config/hooks/normal/9100-memtest.hook.binary. Reason it failed:
  • the hook runs before those files exist, so the hook cannot patch them there.

What This Means

When revisiting memtest later, start from the constraints above rather than retrying the same patterns:

  • do not assume the built-in memtest stage is sufficient
  • do not assume chroot/boot/ will contain memtest payloads
  • do not assume source bootloader templates are the last writer of final ISO configs
  • do not assume the current normal binary hook timing is late enough for final patching

Any future memtest fix must explicitly identify:

  • where the memtest binaries are reliably available at build time
  • which exact build stage writes the final bootloader configs that land in the ISO
  • and a post-build proof from a real ISO, not only from intermediate workdir files
  • whether the ISO inspection step itself succeeded, rather than merely whether the validator printed a memtest warning
  • whether a non-zero probe is intentionally handled inside an if / case context rather than accidentally tripping set -e

Decision

For bee, memtest must be treated as an explicit ISO artifact with explicit post-build validation.

Project rules from now on:

  • Do not trust --memtest memtest86+ by itself.
  • A memtest implementation is considered valid only if the produced ISO actually contains:
    • boot/memtest86+x64.bin
    • boot/memtest86+x64.efi
    • a GRUB menu entry
    • an isolinux menu entry
  • If live-build built-in integration does not produce those artifacts, use an explicit project-owned mechanism such as:
    • a binary hook copying files into binary/boot/
    • extraction from the cached memtest86+ .deb
    • another deterministic build-time copy step
  • Do not remove such explicit logic later unless a fresh real ISO build proves that built-in integration alone produces all required files and menu entries.

Current implementation direction:

  • keep the live-build memtest stage enabled if it helps package acquisition
  • do not rely on the current early binary_hooks timing for final patching
  • prefer a post-lb build recovery step in build.sh that:
    • patches the fully materialized LB_DIR/binary tree
    • injects memtest binaries there
    • ensures final bootloader entries there
    • reruns late binary stages (binary_checksums, binary_iso, binary_zsync) after the patch
  • also treat ISO validation tooling as part of the critical path:
    • install a stable ISO reader in the builder image
    • fail with an explicit reader error if ISO listing/extraction fails
    • do not treat reader failure as evidence that memtest is missing
    • do not call a probe that may return "needs recovery" as a bare command under set -e; wrap it in explicit control flow

Consequences

  • Future memtest changes must begin by reading this ADR and the commits listed above.
  • Future memtest changes must also begin by reading the failed-attempt list above.
  • We should stop re-introducing "prefer built-in live-build memtest" as a default assumption without new evidence.
  • Memtest validation in build.sh is not optional; it is the acceptance gate that prevents another silent regression.
  • But validation output is only trustworthy if ISO reading itself succeeded. A "missing memtest" warning without a successful ISO read is not evidence.
  • If we change memtest strategy again, we must update this ADR with the exact build evidence that justified the change.

Working Solution (confirmed 2026-04-01, commits 76a91002baf3be)

This approach was confirmed working in ISO easy-bee-nvidia-v3.20-5-g76a9100-amd64.iso and validated again in subsequent builds. The final ISO contains all required memtest artifacts.

Components

1. Binary hook config/hooks/normal/9100-memtest.hook.binary

Runs inside the live-build binary phase. Does not patch bootloader files at hook time — those files may not exist yet. Instead:

  • Tries to copy memtest86+x64.bin / memtest86+x64.efi from chroot/boot/ first.
  • Falls back to extracting from the cached .deb (via dpkg-deb -x) if chroot/boot/ is empty.
  • Appends GRUB and isolinux menu entries only if the respective cfg files already exist at hook time. If they do not exist, the hook warns and continues (does not fail).

Controlled by BEE_REQUIRE_MEMTEST=1 env var to turn warnings into hard errors when needed.

2. Post-lb build recovery step in build.sh

After lb build completes, build.sh checks whether the fully materialized binary/ tree contains all required memtest artifacts. If not:

  • Copies/extracts memtest binaries into binary/boot/.
  • Patches binary/boot/grub/grub.cfg and binary/isolinux/live.cfg directly.
  • Reruns the late binary stages (binary_checksums, binary_iso, binary_zsync) to rebuild the ISO with the patched tree.

This is the deterministic safety net: even if the hook runs at the wrong time, the recovery step handles the final binary/ tree after live-build has written all bootloader configs.

3. ISO validation hardening

The memtest probe in build.sh is wrapped in explicit if / case control flow, not called as a bare command under set -e. A non-zero probe return (needs recovery) is intentional and handled — it does not abort the build prematurely.

ISO reading (xorriso -indev -ls / extraction) is treated as a separate prerequisite. If the reader fails, the validator reports a reader error explicitly, not a memtest warning. This prevents the false-negative loop that burned 2026-04-01 v3.14v3.19.

Why this works when earlier attempts did not

The earlier patterns all shared a single flaw: they assumed a single build-time point (hook or source template) would be the last writer of bootloader configs and memtest payloads. In live-build on Debian Bookworm that assumption is false — live-build continues writing bootloader files after custom hooks run, and chroot/boot/ does not reliably hold memtest payloads.

The recovery step sidesteps the ordering problem entirely: it acts on the fully materialized binary/ tree after lb build finishes, then rebuilds the ISO from that patched tree. There is no ordering dependency to get wrong.

Do not revert

Do not remove the recovery step or the hook without a fresh real ISO build proving live-build alone produces all four required artifacts:

  • boot/memtest86+x64.bin
  • boot/memtest86+x64.efi
  • memtest entry in boot/grub/grub.cfg
  • memtest entry in isolinux/live.cfg