- Remove "6." / "7." prefixes from Tools and Settings nav labels and page titles
- Add a horizontal separator (nav-sep) before the Tools/Settings group
- Move Tasks into the nav as a regular nav-item after the separator,
replacing the separate tasks-nav-btn at the sidebar bottom
- Tasks item retains the active-count badge (tasks-nav-count)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tools page now contains only NVMe Block Format and Supermicro - DMI.
Moved to Settings (7):
- System Install (Install to RAM + Install to Disk)
- Support Bundle + USB Black-Box
- Tool Check
- NVIDIA Self Heal (replaces simple NVIDIA Recovery card)
- Network
- Services
Update TestToolsPageRendersNvidiaSelfHealSection to assert the moved
cards on /settings instead of /tools.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract saa 1.5.0 (Linux x86_64) into iso/vendor/saa — baked into ISO
at /usr/local/bin/saa via the existing vendor loop in build.sh
- Add saa to the vendor tool loop in iso/builder/build.sh
- Rename the web UI card from "SAA - DMI" to "Supermicro - DMI"
- Remove the redundant description hint about saa on PATH
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds board and memory parser test fixtures based on real dmidecode output
from Dell PowerEdge R740xd, HPE ProLiant DL380 Gen10, and Supermicro
SYS-6028R-WTR sourced from the linuxhw/DMI dataset. Extends cleanDMIValue
with four additional vendor placeholder strings found in the dataset:
"0123456789", "1234567890", "NOT AVAILABLE", and "TO BE FILLED BY O.E.M"
(without trailing dot). Adds memory_test.go covering mixed populated/empty
DIMM slots and both GB and MB size formats.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the guessed pipe/key=value parser with the correct format
documented in SAA User Guide 4.8.1:
[Section]
Item Name {SHN} = "value" // comment
Handles string values (strips surrounding quotes), non-string values
(UUID, hex), section headers for display names, version line, and
// comments. Verified against the SAA 1.5.0 User Guide sample.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new card to the web UI Tools page for reading and editing DMI
fields via SAA (In-Band). Reads current DMI configuration with GetDmiInfo,
displays all fields as an editable table, and applies only the changed
fields via EditDmiInfo + ChangeDmiInfo. Backs up the original DMI file to
dmi-backups/ before any write, making it available in the support bundle
for rollback. Also adds "saa" to the standard tool check list.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NVIDIA removed datacenter-gpu-manager-4-core 1:4.5.3-1 from the
repository and published 1:4.6.0-1. The cuda13 and proprietary
packages still declared an exact-version dependency on 4.5.3-1 core,
making the old pin unresolvable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the flat menu (Dashboard, Audit, Validate, Burn, Benchmark,
Tasks, Tools) with a numbered progression that guides engineers through
a logical acceptance workflow:
Dashboard (landing) → 1. Audit → 2. Check → 3. Load → 4. Speed
→ 5. Endurance → 6. Tools → 7. Settings
Key changes:
- layout.go: numbered nav labels, new hrefs, Tasks removed from nav
and replaced with a persistent sidebar badge (polls /api/tasks every
5 s, highlights amber when tasks are active)
- server.go: 301 redirects from /validate→/check, /burn→/load,
/benchmark→/speed for backward compatibility
- pages.go: dispatch cases for all new routes; old routes kept as
fallbacks
- page_validate.go: add renderCheck() — non-destructive check page
with validate-mode tests only (no stress toggle, no targeted-stress/
targeted-power/pulse cards)
- page_burn.go: add renderLoad() wrapper; update scope alert to
reference /check instead of /validate
- page_benchmark.go: add renderSpeed() (performance focus) and
renderEndurance() (stability/overnight focus) wrappers
- page_settings.go: new Settings page with blackbox logging toggle,
NVIDIA driver reset, and build info
- server_test.go: update five tests to use new route names and
content expectations
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bee-nvidia.service loads NVIDIA kernel modules; without After=bee-nvidia.service
fabricmanager starts before /dev/nvidiactl is ready, fails, and relies on
systemd restart to recover (~38s delay on affected systems).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
smartpqi uses scsi_transport_sas but does not register a sas_host
object, so /sys/class/sas_host/host14 does not exist and the existing
SAS detection check passes right through. Writing to host14/scan then
calls sas_user_scan which blocks indefinitely on scsi_scan_target's
mutex (confirmed by kernel hung-task traces in the field).
Add a second detection path via /sys/class/scsi_host/hostX/proc_name:
skip hosts whose driver is "smartpqi" or "hpsa" (HPE Smart Array
predecessors that exhibit the same behaviour).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tracks origin/main after rebase: adds per-column header filters for
severity in the viewer (feat(viewer): replace severity dropdown).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Picks up new contracts: hardware-ingest-json, submodule-integration,
go-database cursor safety, and several contract deduplication passes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A (hardware-ingest-json v2.8-2.9): remove sensor location fields from schema
and collector; tag HardwareMemory.Location as json:"-"; add PlatformConfig to
HardwareSnapshot.
B (no-hardcoded-vendors): consolidate PCI vendor IDs into collector/pci_vendors.go;
replace all vendor-name string checks in isGPUDevice, isNVIDIADevice, isMellanoxDevice,
isAMDGPUDevice, matchesGPUVendor (sat_overlay), and validateIsVendorGPU (page_validate)
with numeric vendor_id comparisons.
C (module-structure): split app/app.go (1413 lines) into app.go + app_format.go,
app_network.go, app_services.go, app_packs.go, app_install.go — no logic changes.
D (go-code-style): wrap bare return err in interfaceAdminState and
interfaceIPv4Addrs (platform/network.go) with fmt.Errorf context including
the interface name.
E (go-project-bible): add bible-local/architecture/data-model.md and
bible-local/architecture/api-surface.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Disabled PCIe devices (sysfs enable==0) carry no data traffic; their
link state has no operational impact. Switchtec PCIe switch management
endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers)
train at reduced speed intentionally and were producing spurious warnings.
Check is vendor-agnostic: reads enable attribute via existing helper,
no vendor/device ID hardcoding.
Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvme-cli emits smart-log counters as JSON strings and uses field names
avail_spare / percent_used instead of the prose names in the NVMe spec.
The nvmeSmartLog struct had int64 fields with wrong JSON tags — Unmarshal
returned an error and the whole health block was skipped, leaving every
NVMe drive with status=Unknown.
Fix: switch all numeric fields to jsonInt64 (already used for lsblk
block sizes) which accepts both bare numbers and quoted strings, and
correct the avail_spare / percent_used tag names.
Also fix validateIsVendorGPU for NVIDIA: previously counted any NVIDIA
PCIe device (including NVSwitch bridges) as a GPU, producing wrong
estimates (12 instead of 8 on an HGX H100 system). Now requires
device_class to be videocontroller or processingaccelerator, matching
the existing AMD filter logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvtop pulled nvidia-tesla-470-* via Recommends into the nogpu build.
Move it from bee.list.chroot into bee-nvidia and bee-amd lists so it
only appears in GPU variants.
Also remove the stray git-bible/ directory (was not gitignored) and
move grub-bitmap-error docs into bible-local/docs/.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Writing to /sys/class/scsi_host/hostX/scan on SAS controllers (e.g.
Adaptec smartpqi/PM8222-SHBA) triggers sas_user_scan which blocks
indefinitely, causing the audit to hang forever. Skip hosts that appear
under /sys/class/sas_host/ — SAS topology is discovered by the driver.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- storage: add jsonInt64 dual-format unmarshaler to handle lsblk output
change in util-linux 2.38 (LOG-SEC/PHY-SEC now emitted as JSON
integers, not quoted strings); fixes SATA disks invisible on Debian 12
- pcie: detect NVLink bridge mezzanine CX-7 cards (Mellanox x2, no host
net ifaces, DeviceName contains "NVLINK" in lspci -v) and mark them
with device_class="NVLinkBridge"; escalate PCIe link speed downgrade to
Critical for these cards (Gen3 on a fixed internal connector = hardware
fault, not a transient warning)
- pcie: cross-reference nvidia-smi topo to capture NVLink bond counts and
active status for all NVLink bridge cards
- packages: add infiniband-diags to ISO; provides ibstat required by
nvidia-fabricmanager-start.sh to enumerate IB devices before FM launch
(absence causes CUDA_ERROR_SYSTEM_NOT_READY)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Network: green if at least one interface has IPv4 (drop PARTIAL state).
Bee Services: treat inactive as OK — oneshot services (bee-sshsetup,
bee-preflight, bee-network, bee-audit, etc.) complete successfully and
exit to inactive; only failed is a real problem.
nvidia-fabricmanager: add ExecCondition=bee-check-nvswitch drop-in so
the service is silently skipped (inactive, not failed) on systems
without NVSwitch hardware (e.g. H200 NVL with direct NVLink, no
NVSwitch chips). bee-check-nvswitch detects NVSwitch via lspci
(vendor 10de, class 0680).
bee-nvidia.service: add ConditionPathExists=/usr/local/bin/bee-nvidia-load
so the unit is a no-op if somehow present in a non-nvidia build.
bee-boot-status: read /etc/bee-gpu-vendor and exclude bee-nvidia from
CRITICAL/ALL on non-nvidia builds, preventing boot hang if the unit
is unexpectedly present.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
EASY_BEE_NVIDIA_LEGACY_V<date> is 33 characters; ISO 9660 volid is
limited to 32. Compute the maximum token length dynamically from the
prefix length and trim ISO_VERSION_LABEL_TOKEN with cut before
assembling BEE_ISO_VOLUME. All four variants now fit within the limit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add nvidia-aer-correctable and pcie-aer-correctable patterns to catch
"bus correctable error" events seen in SEL (Critical Interrupt / offset 7).
Both patterns carry severity "warning" — correctable errors are
hardware-recovered and should not flag a card as failed.
Fix kmsg_watcher routing: GPU-category events were keyed as pcie:<BDF>
but the UI queries for pcie:gpu: prefix. Split the switch so "gpu" →
pcie:gpu:<BDF> and "pcie" → pcie:<BDF>. This applies to both
flushWindow (SAT-window path) and flushImmediate (always-on path).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
\\$1 in a double-quoted string expands as literal backslash + $1 (the
script's first positional arg). With set -u and no CLI args (IP entered
via read), this fails. \$1 correctly escapes the dollar sign, producing
a literal $1 for awk on the remote host.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HTMX was never loaded on the page, so hx-get on the component label
spans was dead code — the dialog opened empty. Replace with a plain
openComponentDetail() fetch call. Also fix dialog positioning broken
by the CSS reset (*{margin:0} overrode the UA margin:auto that centers
<dialog>). Replace card hx-trigger polling with a setInterval.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All intermediate build artifacts (binaries, live-build work dirs, overlay
stages, NVIDIA/NCCL/cuBLAS/john caches) now live under dist/cache/.
Final ISOs go to dist/release/ instead of scattered dist/easy-bee-v*/ and iso/out/.
dist/ is already gitignored, iso/out/ entry removed as redundant.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Auto-detects build mode: remote VM if BUILDER_HOST is set in .env,
local Docker otherwise. Cache hardcoded to dist/container-cache (gitignored).
All flags forwarded to build-in-container.sh.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- kmsg watcher now records kernel errors (GPU Xid, MCE, EDAC, storage I/O) at all times,
not only during SAT tasks; flushImmediate writes directly to ComponentStatusDB
- New health_poller: polls ipmitool sdr every 60s for PSU health (watchdog:psu source)
- Hardware Summary card auto-refreshes every 30s via htmx without page reload
- Component rows (CPU/Memory/Storage/GPU/PSU) are now clickable -- opens a modal
with per-component status, source, timestamp and last 20 history entries
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Installs a local-premount initramfs hook that intercepts bee.wipe=all before
squashfs is mounted. Shows a numbered disk selection TUI (pure POSIX sh), wipes
selected disks (nvme format / blkdiscard / dd fallback), syncs, and reboots.
Works even when squashfs fails to mount.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a "WIPE ALL DISKS" entry to both GRUB and isolinux menus (bee.wipe=all).
Includes bee-wipe-disks for manual use from a running live system.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Squashfs versioning:
- ISO now contains filesystem-v<VERSION>.squashfs instead of the generic
filesystem.squashfs, making it immediately visible which build is
running (visible in /run/live/medium/live/ at boot time).
- Full build path: rename filesystem.squashfs → filesystem-v*.squashfs
after lb build, before lb binary_checksums/binary_iso.
- Fast path: find and unpack whatever filesystem*.squashfs exists, repack
as the new versioned name, remove the old file, update the ISO.
- needs_full_build: accept any filesystem*.squashfs so version changes
alone don't force a full rebuild.
Media selection hardening:
- Add live-media=/dev/disk/by-label/<LABEL> to the kernel boot line in
addition to the existing live-media-label=<LABEL>. live-boot will now
open exactly the labeled device rather than scanning all block devices,
preventing accidental use of squashfs files from local disks or
stale virtual media attached via IPMI.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mksquashfs 4.5.1 (bookworm) writes a non-SQUASHFS_INVALID_BLK value for
xattr_id_table_start in the superblock even when -no-xattrs is passed, if
the source chroot contains POSIX ACL xattrs set by dpkg at install time.
Linux 6.1 squashfs driver then fails with "unable to read xattr id index
table" and refuses to mount the filesystem.
Strip all xattrs from the chroot via Python3 (already present) immediately
before mksquashfs runs. With an xattr-free source tree the resulting
squashfs is guaranteed to have SQUASHFS_INVALID_BLK in the xattr field.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--chroot-squashfs-compression-options does not exist in live-build
bookworm (1:20230502). The correct mechanism is the MKSQUASHFS_OPTIONS
environment variable read by binary_rootfs.
Export MKSQUASHFS_OPTIONS="-no-xattrs" before lb build so live-build's
binary_rootfs picks it up, and add -no-xattrs explicitly to every
direct mksquashfs call in build.sh (fast-path repack and the dormant
split-layers function). Remove the invalid lb config option.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--chroot-squashfs-options is not a valid lb_config option; the correct
name is --chroot-squashfs-compression-options. Without this fix lb config
aborts immediately, so the -no-xattrs flag (which prevents the
"unable to read xattr id index table" boot failure) was never applied.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kernel squashfs driver fails with "unable to read xattr id index table"
when the squashfs contains POSIX ACL xattrs (system.posix_acl_*) written
by mksquashfs as unrecognised entries. This caused every built ISO to
drop to an initramfs shell at boot.
Add -no-xattrs to mksquashfs options so xattrs are stripped at build
time. xattrs are not needed in a live read-only rootfs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
split_live_squashfs_layers moved /usr out of filesystem.squashfs into a
separate 10-usr.squashfs, leaving a rootfs skeleton that live-boot
(1:20230131+deb12u1) cannot mount: the initramfs panics with
"Can not mount /dev/loop0 ... filesystem.squashfs".
live-boot in bookworm expects a single self-contained filesystem.squashfs.
Revert to the standard single-squashfs layout and remove the dead
multi-squashfs guard in needs_full_build().
The split_live_squashfs_layers function is kept for future reference.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
trap RETURN is a bash extension not supported by /bin/sh on Debian.
With set -e active the unsupported trap call exited the build immediately
after lb build, before bootloader sync and ISO copy steps ran.
Remove both trap RETURN lines — explicit rm -rf at the end of the
function is sufficient for cleanup on the happy path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>