Commit Graph

563 Commits

Author SHA1 Message Date
ce6b1e0eb7 Update internal/chart submodule pointer to 8105c7e
Tracks origin/main after rebase: adds per-column header filters for
severity in the viewer (feat(viewer): replace severity dropdown).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.24
2026-06-13 14:48:04 +03:00
4066e842a9 Update bible submodule to v0.2.0-13-g1977730
Picks up new contracts: hardware-ingest-json, submodule-integration,
go-database cursor safety, and several contract deduplication passes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 14:44:52 +03:00
7d2e904d14 Bring codebase into compliance with bible contracts (A–E)
A (hardware-ingest-json v2.8-2.9): remove sensor location fields from schema
and collector; tag HardwareMemory.Location as json:"-"; add PlatformConfig to
HardwareSnapshot.

B (no-hardcoded-vendors): consolidate PCI vendor IDs into collector/pci_vendors.go;
replace all vendor-name string checks in isGPUDevice, isNVIDIADevice, isMellanoxDevice,
isAMDGPUDevice, matchesGPUVendor (sat_overlay), and validateIsVendorGPU (page_validate)
with numeric vendor_id comparisons.

C (module-structure): split app/app.go (1413 lines) into app.go + app_format.go,
app_network.go, app_services.go, app_packs.go, app_install.go — no logic changes.

D (go-code-style): wrap bare return err in interfaceAdminState and
interfaceIPv4Addrs (platform/network.go) with fmt.Errorf context including
the interface name.

E (go-project-bible): add bible-local/architecture/data-model.md and
bible-local/architecture/api-surface.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 14:32:08 +03:00
2320925433 Skip PCIe link-speed warnings for disabled devices
Disabled PCIe devices (sysfs enable==0) carry no data traffic; their
link state has no operational impact. Switchtec PCIe switch management
endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers)
train at reduced speed intentionally and were producing spurious warnings.

Check is vendor-agnostic: reads enable attribute via existing helper,
no vendor/device ID hardcoding.

Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.23
2026-06-12 03:42:19 +03:00
e169a7722c Fix NVMe SMART status always Unknown; fix GPU count including NVSwitches
nvme-cli emits smart-log counters as JSON strings and uses field names
avail_spare / percent_used instead of the prose names in the NVMe spec.
The nvmeSmartLog struct had int64 fields with wrong JSON tags — Unmarshal
returned an error and the whole health block was skipped, leaving every
NVMe drive with status=Unknown.

Fix: switch all numeric fields to jsonInt64 (already used for lsblk
block sizes) which accepts both bare numbers and quoted strings, and
correct the avail_spare / percent_used tag names.

Also fix validateIsVendorGPU for NVIDIA: previously counted any NVIDIA
PCIe device (including NVSwitch bridges) as a GPU, producing wrong
estimates (12 instead of 8 on an HGX H100 system). Now requires
device_class to be videocontroller or processingaccelerator, matching
the existing AMD filter logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 18:06:32 +03:00
74a3c65f64 Move nvtop to GPU-specific package lists; clean up git-bible
nvtop pulled nvidia-tesla-470-* via Recommends into the nogpu build.
Move it from bee.list.chroot into bee-nvidia and bee-amd lists so it
only appears in GPU variants.

Also remove the stray git-bible/ directory (was not gitignored) and
move grub-bitmap-error docs into bible-local/docs/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.22
2026-06-01 19:36:27 +03:00
884988cb2a Fix audit hang on SAS HBAs: skip scsi host scan for SAS hosts
Writing to /sys/class/scsi_host/hostX/scan on SAS controllers (e.g.
Adaptec smartpqi/PM8222-SHBA) triggers sas_user_scan which blocks
indefinitely, causing the audit to hang forever. Skip hosts that appear
under /sys/class/sas_host/ — SAS topology is discovered by the driver.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.21
2026-06-01 18:50:20 +03:00
963bc960ca Fix SATA discovery, add NVLink bridge detection, add infiniband-diags
- storage: add jsonInt64 dual-format unmarshaler to handle lsblk output
  change in util-linux 2.38 (LOG-SEC/PHY-SEC now emitted as JSON
  integers, not quoted strings); fixes SATA disks invisible on Debian 12
- pcie: detect NVLink bridge mezzanine CX-7 cards (Mellanox x2, no host
  net ifaces, DeviceName contains "NVLINK" in lspci -v) and mark them
  with device_class="NVLinkBridge"; escalate PCIe link speed downgrade to
  Critical for these cards (Gen3 on a fixed internal connector = hardware
  fault, not a transient warning)
- pcie: cross-reference nvidia-smi topo to capture NVLink bond counts and
  active status for all NVLink bridge cards
- packages: add infiniband-diags to ISO; provides ibstat required by
  nvidia-fabricmanager-start.sh to enumerate IB devices before FM launch
  (absence causes CUDA_ERROR_SYSTEM_NOT_READY)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.20
2026-05-28 20:57:04 +03:00
4f6579e040 Fix Runtime Health criteria: network, services, nvidia-fabricmanager
Network: green if at least one interface has IPv4 (drop PARTIAL state).

Bee Services: treat inactive as OK — oneshot services (bee-sshsetup,
bee-preflight, bee-network, bee-audit, etc.) complete successfully and
exit to inactive; only failed is a real problem.

nvidia-fabricmanager: add ExecCondition=bee-check-nvswitch drop-in so
the service is silently skipped (inactive, not failed) on systems
without NVSwitch hardware (e.g. H200 NVL with direct NVLink, no
NVSwitch chips). bee-check-nvswitch detects NVSwitch via lspci
(vendor 10de, class 0680).

bee-nvidia.service: add ConditionPathExists=/usr/local/bin/bee-nvidia-load
so the unit is a no-op if somehow present in a non-nvidia build.

bee-boot-status: read /etc/bee-gpu-vendor and exclude bee-nvidia from
CRITICAL/ALL on non-nvidia builds, preventing boot hang if the unit
is unexpectedly present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.19
2026-05-14 05:20:25 +03:00
dc07580adc Add AER decode, event counter, and sparkline to component detail modal
- decodeAERStatus: parses aer_status hex from kernel error strings and
  maps PCIe AER register bits to human-readable names with correctable/
  uncorrectable classification (e.g. "Receiver Error, Replay Timer Timeout (correctable)")
- renderSparkline: 100px inline SVG showing non-OK events over time,
  bars positioned proportionally to timestamp; evenly spaced when timestamps coincide
- renderComponentDetail: shows event count badge and sparkline in the
  component header row; decoded AER line appears below the raw error summary

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.18
2026-05-13 23:54:54 +03:00
Mikhail Chusavitin
87e78e230e Fix ISO build: truncate volume ID to 32 chars (xorriso limit)
EASY_BEE_NVIDIA_LEGACY_V<date> is 33 characters; ISO 9660 volid is
limited to 32. Compute the maximum token length dynamically from the
prefix length and trim ISO_VERSION_LABEL_TOKEN with cut before
assembling BEE_ISO_VOLUME. All four variants now fit within the limit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 18:28:54 +03:00
Mikhail Chusavitin
805a3b277d Track PCIe AER correctable errors; fix GPU status key routing
Add nvidia-aer-correctable and pcie-aer-correctable patterns to catch
"bus correctable error" events seen in SEL (Critical Interrupt / offset 7).
Both patterns carry severity "warning" — correctable errors are
hardware-recovered and should not flag a card as failed.

Fix kmsg_watcher routing: GPU-category events were keyed as pcie:<BDF>
but the UI queries for pcie:gpu: prefix. Split the switch so "gpu" →
pcie:gpu:<BDF> and "pcie" → pcie:<BDF>. This applies to both
flushWindow (SAT-window path) and flushImmediate (always-on path).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 12:50:14 +03:00
Mikhail Chusavitin
5bc9bd7fb3 Fix deploy.sh unbound variable on line 51
\\$1 in a double-quoted string expands as literal backslash + $1 (the
script's first positional arg). With set -u and no CLI args (IP entered
via read), this fails. \$1 correctly escapes the dollar sign, producing
a literal $1 for awk on the remote host.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 11:58:15 +03:00
Mikhail Chusavitin
0939a647ea Fix component detail modal: replace dead hx-* with fetch-based JS
HTMX was never loaded on the page, so hx-get on the component label
spans was dead code — the dialog opened empty. Replace with a plain
openComponentDetail() fetch call. Also fix dialog positioning broken
by the CSS reset (*{margin:0} overrode the UA margin:auto that centers
<dialog>). Replace card hx-trigger polling with a setInterval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 10:53:20 +03:00
Mikhail Chusavitin
7640f20714 Consolidate dist/ into cache/ and release/ subdirs
All intermediate build artifacts (binaries, live-build work dirs, overlay
stages, NVIDIA/NCCL/cuBLAS/john caches) now live under dist/cache/.
Final ISOs go to dist/release/ instead of scattered dist/easy-bee-v*/ and iso/out/.
dist/ is already gitignored, iso/out/ entry removed as redundant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.17
2026-05-06 12:28:47 +03:00
Mikhail Chusavitin
1593bf3e76 Add scripts/build.sh -- single entry point for ISO builds
Auto-detects build mode: remote VM if BUILDER_HOST is set in .env,
local Docker otherwise. Cache hardcoded to dist/container-cache (gitignored).
All flags forwarded to build-in-container.sh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 12:24:09 +03:00
Mikhail Chusavitin
ae80d7711e Add continuous hardware health monitoring and component detail view
- kmsg watcher now records kernel errors (GPU Xid, MCE, EDAC, storage I/O) at all times,
  not only during SAT tasks; flushImmediate writes directly to ComponentStatusDB
- New health_poller: polls ipmitool sdr every 60s for PSU health (watchdog:psu source)
- Hardware Summary card auto-refreshes every 30s via htmx without page reload
- Component rows (CPU/Memory/Storage/GPU/PSU) are now clickable -- opens a modal
  with per-component status, source, timestamp and last 20 history entries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.16
2026-05-06 09:56:39 +03:00
Mikhail Chusavitin
ca78b9df65 Add initramfs-level Drive Wipe tool (bee.wipe=all)
Installs a local-premount initramfs hook that intercepts bee.wipe=all before
squashfs is mounted. Shows a numbered disk selection TUI (pure POSIX sh), wipes
selected disks (nvme format / blkdiscard / dd fallback), syncs, and reboots.
Works even when squashfs fails to mount.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 09:23:05 +03:00
Mikhail Chusavitin
5cafe63f33 Add Drive Wipe boot menu entry and overlay wipe script
Adds a "WIPE ALL DISKS" entry to both GRUB and isolinux menus (bee.wipe=all).
Includes bee-wipe-disks for manual use from a running live system.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 09:22:59 +03:00
Mikhail Chusavitin
b75e65bcb1 Version-stamp squashfs filename and restrict live-boot media selection
Squashfs versioning:
- ISO now contains filesystem-v<VERSION>.squashfs instead of the generic
  filesystem.squashfs, making it immediately visible which build is
  running (visible in /run/live/medium/live/ at boot time).
- Full build path: rename filesystem.squashfs → filesystem-v*.squashfs
  after lb build, before lb binary_checksums/binary_iso.
- Fast path: find and unpack whatever filesystem*.squashfs exists, repack
  as the new versioned name, remove the old file, update the ISO.
- needs_full_build: accept any filesystem*.squashfs so version changes
  alone don't force a full rebuild.

Media selection hardening:
- Add live-media=/dev/disk/by-label/<LABEL> to the kernel boot line in
  addition to the existing live-media-label=<LABEL>. live-boot will now
  open exactly the labeled device rather than scanning all block devices,
  preventing accidental use of squashfs files from local disks or
  stale virtual media attached via IPMI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.15
2026-05-04 18:44:47 +03:00
Mikhail Chusavitin
8d173175eb Add chroot hook to strip all xattrs before squashfs creation
mksquashfs 4.5.1 (bookworm) writes a non-SQUASHFS_INVALID_BLK value for
xattr_id_table_start in the superblock even when -no-xattrs is passed, if
the source chroot contains POSIX ACL xattrs set by dpkg at install time.
Linux 6.1 squashfs driver then fails with "unable to read xattr id index
table" and refuses to mount the filesystem.

Strip all xattrs from the chroot via Python3 (already present) immediately
before mksquashfs runs. With an xattr-free source tree the resulting
squashfs is guaranteed to have SQUASHFS_INVALID_BLK in the xattr field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.14
2026-05-04 17:44:09 +03:00
Mikhail Chusavitin
5cbde0448e Update submodules (bible, internal/chart)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:41:45 +03:00
Mikhail Chusavitin
49a09fde05 Disable xattrs in all mksquashfs calls
--chroot-squashfs-compression-options does not exist in live-build
bookworm (1:20230502). The correct mechanism is the MKSQUASHFS_OPTIONS
environment variable read by binary_rootfs.

Export MKSQUASHFS_OPTIONS="-no-xattrs" before lb build so live-build's
binary_rootfs picks it up, and add -no-xattrs explicitly to every
direct mksquashfs call in build.sh (fast-path repack and the dormant
split-layers function). Remove the invalid lb config option.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.13
2026-05-04 15:29:15 +03:00
Mikhail Chusavitin
f3962422c8 Fix lb config option name for squashfs compression options
--chroot-squashfs-options is not a valid lb_config option; the correct
name is --chroot-squashfs-compression-options. Without this fix lb config
aborts immediately, so the -no-xattrs flag (which prevents the
"unable to read xattr id index table" boot failure) was never applied.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.12
2026-05-04 14:03:41 +03:00
Mikhail Chusavitin
ee36e3c711 Strip xattrs from squashfs to fix boot failure
Kernel squashfs driver fails with "unable to read xattr id index table"
when the squashfs contains POSIX ACL xattrs (system.posix_acl_*) written
by mksquashfs as unrecognised entries. This caused every built ISO to
drop to an initramfs shell at boot.

Add -no-xattrs to mksquashfs options so xattrs are stripped at build
time. xattrs are not needed in a live read-only rootfs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.11
2026-05-04 13:56:26 +03:00
Mikhail Chusavitin
cca3b21d35 Revert squashfs layer split — live-boot cannot mount partial rootfs
split_live_squashfs_layers moved /usr out of filesystem.squashfs into a
separate 10-usr.squashfs, leaving a rootfs skeleton that live-boot
(1:20230131+deb12u1) cannot mount: the initramfs panics with
"Can not mount /dev/loop0 ... filesystem.squashfs".

live-boot in bookworm expects a single self-contained filesystem.squashfs.
Revert to the standard single-squashfs layout and remove the dead
multi-squashfs guard in needs_full_build().

The split_live_squashfs_layers function is kept for future reference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.10
2026-05-04 11:14:10 +03:00
Mikhail Chusavitin
75c33e073e Fix split_live_squashfs_layers crash under POSIX sh (dash)
trap RETURN is a bash extension not supported by /bin/sh on Debian.
With set -e active the unsupported trap call exited the build immediately
after lb build, before bootloader sync and ISO copy steps ran.

Remove both trap RETURN lines — explicit rm -rf at the end of the
function is sufficient for cleanup on the happy path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v10.9
2026-05-04 09:52:31 +03:00
7b4bcc745a Split live rootfs into smaller squashfs layers v10.8 2026-05-03 23:15:22 +03:00
42774d44a6 Restore post-build GRUB and isolinux sync v10.7 2026-05-03 21:49:54 +03:00
5dc022ddf8 Drop post-build EFI bootloader patching v10.6 2026-05-03 21:22:53 +03:00
6623e159f5 Grow EFI image before syncing GRUB theme assets 2026-05-03 21:18:37 +03:00
bbd6d009f8 Avoid EFI image overflow when syncing GRUB theme 2026-05-03 21:16:36 +03:00
6c2b188ec9 Add no-GUI boot mode and quieter boot diagnostics 2026-05-03 21:14:45 +03:00
14505ef24a Remove easy bee ASCII logo banners v10.5 2026-05-03 21:07:13 +03:00
4f20c9246d Make UEFI boot safe and remove GRUB logo 2026-05-03 20:11:42 +03:00
eed157c2db Pin live boot medium to versioned ISO label v10.3 v10.4 2026-05-03 15:52:07 +03:00
a2c8aea0df Fix GRUB theme ISO validation helper name v10.1 v10.2 2026-05-03 14:45:16 +03:00
b21f03cd26 Reduce bee-selfheal log noise and slow timer 2026-05-03 14:38:12 +03:00
cac5b9c86e Detach install media after install-to-ram v10.0 2026-05-03 14:16:45 +03:00
b5d04ef045 Synchronize project versioning across build artifacts 2026-05-03 14:16:32 +03:00
fcd64438ea Update submodule references 2026-05-03 14:12:00 +03:00
0e39e7d960 Make toram default and add install-to-ram CLI 2026-05-03 14:07:47 +03:00
Mikhail Chusavitin
58d6da0e4f Fix live task logs and SAT windows 2026-04-30 17:26:45 +03:00
Mikhail Chusavitin
7ce73e34a4 Add NVMe block format tool v9.9 2026-04-30 16:27:25 +03:00
Mikhail Chusavitin
8a21809ade Update chart submodule to v2.0 (hardware contract 2.10)
New in chart:
- event_logs and platform_config sections in viewer
- Storage columns: logical_block_size_bytes, physical_block_size_bytes,
  metadata_bytes_per_block
- Compact status/severity icons, severity filtering for event logs
- Fixed JS MIME type and base stylesheet

bee audit schema already has all required fields; no schema changes needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v9.8
2026-04-30 15:52:30 +03:00
Mikhail Chusavitin
626763e31d Fix GRUB bitmap error: switch from PNG to TGA for splash logo
GRUB's PNG reader (grub2 bookworm) fails to load bee-logo.png despite the
file being valid RGB 8-bit non-interlaced PNG with minimal chunks. Root
cause is a known fragility in GRUB's png.c; exact trigger is unknown.

Switch to uncompressed 24-bit TGA which bypasses the PNG parser entirely.
tga.mod is already present in the ISO (x86_64-efi/tga.mod).

- Convert bee-logo.png → bee-logo.tga (480018 bytes, BGR top-left)
- config.cfg: insmod png → insmod tga
- theme.txt: bee-logo.png → bee-logo.tga
- Document all prior failed attempts in git-bible/grub-bitmap-error.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v9.7
2026-04-30 15:46:13 +03:00
Mikhail Chusavitin
0b8a2ff83f Add validate test matrix and GPU test methodology docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v9.6
2026-04-30 10:47:08 +03:00
Mikhail Chusavitin
2c22b01fe3 Fix IPMI hangs, add VROC license, fix blackbox service, drop qrencode
IPMI hang fix (Lenovo XCC SR650 V3):
- Add pluggable ipmi_profile system with per-vendor timeouts and fruEarlyExit flag
- Lenovo profile: 90s FRU timeout, streaming early-exit stops after PSU blocks found
- collectFRUEarlyExit streams ipmitool fru print and kills process once PSU blocks
  are followed by a non-PSU header (~6s instead of ~108s on 54-device FRU list)
- collectBMCFirmware and collectPSUs accept manufacturer and apply profile timeouts

VROC license detection:
- Detect VMD/VROC controller in PCIe list, run mdadm --detail-platform
- Parse "License:" line; store as snap.VROCLicense in HardwareSnapshot

Blackbox service fix:
- bee-blackbox.service was missing from systemctl enable list in ISO build hook
- Service never started on boot; state file never written; UI button stayed "Enable"

Drop qrencode:
- Remove from package list, standardTools API check, and runtime-flows doc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 10:46:59 +03:00
Mikhail Chusavitin
ec89616585 Add storage block geometry to audit and viewer 2026-04-29 17:39:11 +03:00
Mikhail Chusavitin
c0dbbf96ad Add vendor RAID tools for livecd v9.5 2026-04-29 17:31:25 +03:00