- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state - build: remove bee-nccl-gpu-stress from rm -f list so shell script from overlay is not silently dropped from the ISO - smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress, bee-nccl-gpu-stress, all_reduce_perf - netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors - auto/config: reduce loglevel 7→3 to show clean systemd output on boot - auto/config: blacklist snd_hda_intel and related audio modules (unused on servers) - package-lists: remove firmware-intel-sound and firmware-amd-graphics from base list; move firmware-amd-graphics to bee-amd variant only - bible-local: mark memtest ADR resolved, document working solution Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
225 lines
11 KiB
Markdown
225 lines
11 KiB
Markdown
# Decision: Treat memtest as explicit ISO content, not as trusted live-build magic
|
||
|
||
**Date:** 2026-04-01
|
||
**Status:** resolved
|
||
|
||
## Context
|
||
|
||
We have already iterated on `memtest` multiple times and kept cycling between the same ideas.
|
||
The commit history shows several distinct attempts:
|
||
|
||
- `f91bce8` — fixed Bookworm memtest file names to `memtest86+x64.bin` / `memtest86+x64.efi`
|
||
- `5857805` — added a binary hook to copy memtest files from the build tree into the ISO root
|
||
- `f96b149` — added fallback extraction from the cached `.deb` when `chroot/boot/` stayed empty
|
||
- `d43a9ae` — removed the custom hook and switched back to live-build built-in memtest integration
|
||
- `60cb8f8` — restored explicit memtest menu entries and added ISO validation
|
||
- `3dbc218` / `3869788` — added archived build logs and better memtest diagnostics
|
||
|
||
Current evidence from the archived `easy-bee-nvidia-v3.14-amd64` logs dated 2026-04-01:
|
||
|
||
- `lb binary_memtest` does run and installs `memtest86+`
|
||
- but the final ISO still does **not** contain `boot/memtest86+x64.bin`
|
||
- the final ISO also does **not** contain memtest menu entries in `boot/grub/grub.cfg` or `isolinux/live.cfg`
|
||
|
||
So the assumption "live-build built-in memtest integration is enough on this stack" is currently false for this project until proven otherwise by a real built ISO.
|
||
|
||
Additional evidence from the archived `easy-bee-nvidia-v3.17-dirty-amd64` logs dated 2026-04-01:
|
||
|
||
- the build now completes successfully because memtest is non-blocking by default
|
||
- `lb binary_memtest` still runs and installs `memtest86+`
|
||
- the project-owned hook `config/hooks/normal/9100-memtest.hook.binary` does execute
|
||
- but it executes too early for its current target paths:
|
||
- `binary/boot/grub/grub.cfg` is still missing at hook time
|
||
- `binary/isolinux/live.cfg` is still missing at hook time
|
||
- memtest binaries are also still absent in `binary/boot/`
|
||
- later in the build, live-build does create intermediate bootloader configs with memtest lines in the workdir
|
||
- but the final ISO still lacks memtest binaries and still lacks memtest lines in extracted ISO `boot/grub/grub.cfg` and `isolinux/live.cfg`
|
||
|
||
So the assumption "the current normal binary hook path is late enough to patch final memtest artifacts" is also false.
|
||
|
||
Correction after inspecting the real `easy-bee-nvidia-v3.20-5-g76a9100-amd64.iso`
|
||
artifact dated 2026-04-01:
|
||
|
||
- the final ISO does contain `boot/memtest86+x64.bin`
|
||
- the final ISO does contain `boot/memtest86+x64.efi`
|
||
- the final ISO does contain memtest menu entries in both `boot/grub/grub.cfg`
|
||
and `isolinux/live.cfg`
|
||
- so `v3.20-5-g76a9100` was **not** another real memtest regression in the
|
||
shipped ISO
|
||
- the regression was in the build-time validator/debug path in `build.sh`
|
||
|
||
Root cause of the false alarm:
|
||
|
||
- `build.sh` treated "ISO reader command exists" as equivalent to "ISO reader
|
||
successfully listed/extracted members"
|
||
- `iso_list_files` / `iso_extract_file` failures were collapsed into the same
|
||
observable output as "memtest content missing"
|
||
- this made a reader failure look identical to a missing memtest payload
|
||
- as a result, we re-entered the same memtest investigation loop even though
|
||
the real ISO was already correct
|
||
|
||
Additional correction from the subsequent `v3.21` build logs dated 2026-04-01:
|
||
|
||
- once ISO reading was fixed, the post-build debug correctly showed the raw ISO
|
||
still carried live-build's default memtest layout (`live/memtest.bin`,
|
||
`live/memtest.efi`, `boot/grub/memtest.cfg`, `isolinux/memtest.cfg`)
|
||
- that mismatch is expected to trigger project recovery, because `bee` requires
|
||
`boot/memtest86+x64.bin` / `boot/memtest86+x64.efi` plus matching menu paths
|
||
- however, `build.sh` exited before recovery because `set -e` treated a direct
|
||
`iso_memtest_present` return code of `1` as fatal
|
||
- so the next repeated loop was caused by shell control flow, not by proof that
|
||
the recovery design itself was wrong
|
||
|
||
## Known Failed Attempts
|
||
|
||
These approaches were already tried and should not be repeated blindly:
|
||
|
||
1. Built-in live-build memtest only.
|
||
Reason it failed:
|
||
- `lb binary_memtest` runs, but the final ISO still misses memtest binaries and menu entries.
|
||
|
||
2. Fixing only the memtest file names for Debian Bookworm.
|
||
Reason it failed:
|
||
- correct file names alone do not make the files appear in the final ISO.
|
||
|
||
3. Copying memtest from `chroot/boot/` into `binary/boot/` via a binary hook.
|
||
Reason it failed:
|
||
- in this stack `chroot/boot/` is often empty for memtest payloads at the relevant time.
|
||
|
||
4. Fallback extraction from cached `memtest86+` `.deb`.
|
||
Reason it failed:
|
||
- this was explored already and was not enough to stabilize the final ISO path end-to-end.
|
||
|
||
5. Restoring explicit memtest menu entries in source bootloader templates only.
|
||
Reason it failed:
|
||
- memtest lines in source templates or intermediate workdir configs do not guarantee the final ISO contains them.
|
||
|
||
6. Patching `binary/boot/grub/grub.cfg` and `binary/isolinux/live.cfg` from the current `config/hooks/normal/9100-memtest.hook.binary`.
|
||
Reason it failed:
|
||
- the hook runs before those files exist, so the hook cannot patch them there.
|
||
|
||
## What This Means
|
||
|
||
When revisiting memtest later, start from the constraints above rather than retrying the same patterns:
|
||
|
||
- do not assume the built-in memtest stage is sufficient
|
||
- do not assume `chroot/boot/` will contain memtest payloads
|
||
- do not assume source bootloader templates are the last writer of final ISO configs
|
||
- do not assume the current normal binary hook timing is late enough for final patching
|
||
|
||
Any future memtest fix must explicitly identify:
|
||
|
||
- where the memtest binaries are reliably available at build time
|
||
- which exact build stage writes the final bootloader configs that land in the ISO
|
||
- and a post-build proof from a real ISO, not only from intermediate workdir files
|
||
- whether the ISO inspection step itself succeeded, rather than merely whether
|
||
the validator printed a memtest warning
|
||
- whether a non-zero probe is intentionally handled inside an `if` / `case`
|
||
context rather than accidentally tripping `set -e`
|
||
|
||
## Decision
|
||
|
||
For `bee`, memtest must be treated as an explicit ISO artifact with explicit post-build validation.
|
||
|
||
Project rules from now on:
|
||
|
||
- Do **not** trust `--memtest memtest86+` by itself.
|
||
- A memtest implementation is considered valid only if the produced ISO actually contains:
|
||
- `boot/memtest86+x64.bin`
|
||
- `boot/memtest86+x64.efi`
|
||
- a GRUB menu entry
|
||
- an isolinux menu entry
|
||
- If live-build built-in integration does not produce those artifacts, use an explicit project-owned mechanism such as:
|
||
- a binary hook copying files into `binary/boot/`
|
||
- extraction from the cached `memtest86+` `.deb`
|
||
- another deterministic build-time copy step
|
||
- Do **not** remove such explicit logic later unless a fresh real ISO build proves that built-in integration alone produces all required files and menu entries.
|
||
|
||
Current implementation direction:
|
||
|
||
- keep the live-build memtest stage enabled if it helps package acquisition
|
||
- do not rely on the current early `binary_hooks` timing for final patching
|
||
- prefer a post-`lb build` recovery step in `build.sh` that:
|
||
- patches the fully materialized `LB_DIR/binary` tree
|
||
- injects memtest binaries there
|
||
- ensures final bootloader entries there
|
||
- reruns late binary stages (`binary_checksums`, `binary_iso`, `binary_zsync`) after the patch
|
||
- also treat ISO validation tooling as part of the critical path:
|
||
- install a stable ISO reader in the builder image
|
||
- fail with an explicit reader error if ISO listing/extraction fails
|
||
- do not treat reader failure as evidence that memtest is missing
|
||
- do not call a probe that may return "needs recovery" as a bare command under
|
||
`set -e`; wrap it in explicit control flow
|
||
|
||
## Consequences
|
||
|
||
- Future memtest changes must begin by reading this ADR and the commits listed above.
|
||
- Future memtest changes must also begin by reading the failed-attempt list above.
|
||
- We should stop re-introducing "prefer built-in live-build memtest" as a default assumption without new evidence.
|
||
- Memtest validation in `build.sh` is not optional; it is the acceptance gate that prevents another silent regression.
|
||
- But validation output is only trustworthy if ISO reading itself succeeded. A
|
||
"missing memtest" warning without a successful ISO read is not evidence.
|
||
- If we change memtest strategy again, we must update this ADR with the exact build evidence that justified the change.
|
||
|
||
## Working Solution (confirmed 2026-04-01, commits 76a9100 → 2baf3be)
|
||
|
||
This approach was confirmed working in ISO `easy-bee-nvidia-v3.20-5-g76a9100-amd64.iso`
|
||
and validated again in subsequent builds. The final ISO contains all required memtest artifacts.
|
||
|
||
### Components
|
||
|
||
**1. Binary hook `config/hooks/normal/9100-memtest.hook.binary`**
|
||
|
||
Runs inside the live-build binary phase. Does not patch bootloader files at hook time —
|
||
those files may not exist yet. Instead:
|
||
|
||
- Tries to copy `memtest86+x64.bin` / `memtest86+x64.efi` from `chroot/boot/` first.
|
||
- Falls back to extracting from the cached `.deb` (via `dpkg-deb -x`) if `chroot/boot/` is empty.
|
||
- Appends GRUB and isolinux menu entries only if the respective cfg files already exist at hook time.
|
||
If they do not exist, the hook warns and continues (does not fail).
|
||
|
||
Controlled by `BEE_REQUIRE_MEMTEST=1` env var to turn warnings into hard errors when needed.
|
||
|
||
**2. Post-`lb build` recovery step in `build.sh`**
|
||
|
||
After `lb build` completes, `build.sh` checks whether the fully materialized `binary/` tree
|
||
contains all required memtest artifacts. If not:
|
||
|
||
- Copies/extracts memtest binaries into `binary/boot/`.
|
||
- Patches `binary/boot/grub/grub.cfg` and `binary/isolinux/live.cfg` directly.
|
||
- Reruns the late binary stages (`binary_checksums`, `binary_iso`, `binary_zsync`) to rebuild
|
||
the ISO with the patched tree.
|
||
|
||
This is the deterministic safety net: even if the hook runs at the wrong time, the recovery
|
||
step handles the final `binary/` tree after live-build has written all bootloader configs.
|
||
|
||
**3. ISO validation hardening**
|
||
|
||
The memtest probe in `build.sh` is wrapped in explicit `if` / `case` control flow, not called
|
||
as a bare command under `set -e`. A non-zero probe return (needs recovery) is intentional and
|
||
handled — it does not abort the build prematurely.
|
||
|
||
ISO reading (`xorriso -indev -ls` / extraction) is treated as a separate prerequisite.
|
||
If the reader fails, the validator reports a reader error explicitly, not a memtest warning.
|
||
This prevents the false-negative loop that burned 2026-04-01 v3.14–v3.19.
|
||
|
||
### Why this works when earlier attempts did not
|
||
|
||
The earlier patterns all shared a single flaw: they assumed a single build-time point
|
||
(hook or source template) would be the last writer of bootloader configs and memtest payloads.
|
||
In live-build on Debian Bookworm that assumption is false — live-build continues writing
|
||
bootloader files after custom hooks run, and `chroot/boot/` does not reliably hold memtest payloads.
|
||
|
||
The recovery step sidesteps the ordering problem entirely: it acts on the fully materialized
|
||
`binary/` tree after `lb build` finishes, then rebuilds the ISO from that patched tree.
|
||
There is no ordering dependency to get wrong.
|
||
|
||
### Do not revert
|
||
|
||
Do not remove the recovery step or the hook without a fresh real ISO build proving
|
||
live-build alone produces all four required artifacts:
|
||
- `boot/memtest86+x64.bin`
|
||
- `boot/memtest86+x64.efi`
|
||
- memtest entry in `boot/grub/grub.cfg`
|
||
- memtest entry in `isolinux/live.cfg`
|