Compare commits

...

40 Commits

Author SHA1 Message Date
Mikhail Chusavitin
b2e177af31 Bump DCGM to 4.6.0-1 to fix broken repo dependency
NVIDIA removed datacenter-gpu-manager-4-core 1:4.5.3-1 from the
repository and published 1:4.6.0-1. The cuda13 and proprietary
packages still declared an exact-version dependency on 4.5.3-1 core,
making the old pin unresolvable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 14:21:24 +03:00
Mikhail Chusavitin
271dadda03 Restructure web UI navigation into 7 numbered workflow stages
Replace the flat menu (Dashboard, Audit, Validate, Burn, Benchmark,
Tasks, Tools) with a numbered progression that guides engineers through
a logical acceptance workflow:

  Dashboard (landing) → 1. Audit → 2. Check → 3. Load → 4. Speed
  → 5. Endurance → 6. Tools → 7. Settings

Key changes:
- layout.go: numbered nav labels, new hrefs, Tasks removed from nav
  and replaced with a persistent sidebar badge (polls /api/tasks every
  5 s, highlights amber when tasks are active)
- server.go: 301 redirects from /validate→/check, /burn→/load,
  /benchmark→/speed for backward compatibility
- pages.go: dispatch cases for all new routes; old routes kept as
  fallbacks
- page_validate.go: add renderCheck() — non-destructive check page
  with validate-mode tests only (no stress toggle, no targeted-stress/
  targeted-power/pulse cards)
- page_burn.go: add renderLoad() wrapper; update scope alert to
  reference /check instead of /validate
- page_benchmark.go: add renderSpeed() (performance focus) and
  renderEndurance() (stability/overnight focus) wrappers
- page_settings.go: new Settings page with blackbox logging toggle,
  NVIDIA driver reset, and build info
- server_test.go: update five tests to use new route names and
  content expectations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 11:00:02 +03:00
Mikhail Chusavitin
20766ccc76 Order nvidia-fabricmanager after bee-nvidia to fix boot race
bee-nvidia.service loads NVIDIA kernel modules; without After=bee-nvidia.service
fabricmanager starts before /dev/nvidiactl is ready, fails, and relies on
systemd restart to recover (~38s delay on affected systems).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 10:11:52 +03:00
Mikhail Chusavitin
966944d6d8 Fix audit hanging on smartpqi SAS HBA scan file write
smartpqi uses scsi_transport_sas but does not register a sas_host
object, so /sys/class/sas_host/host14 does not exist and the existing
SAS detection check passes right through. Writing to host14/scan then
calls sas_user_scan which blocks indefinitely on scsi_scan_target's
mutex (confirmed by kernel hung-task traces in the field).

Add a second detection path via /sys/class/scsi_host/hostX/proc_name:
skip hosts whose driver is "smartpqi" or "hpsa" (HPE Smart Array
predecessors that exhibit the same behaviour).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-15 16:07:54 +03:00
ce6b1e0eb7 Update internal/chart submodule pointer to 8105c7e
Tracks origin/main after rebase: adds per-column header filters for
severity in the viewer (feat(viewer): replace severity dropdown).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 14:48:04 +03:00
4066e842a9 Update bible submodule to v0.2.0-13-g1977730
Picks up new contracts: hardware-ingest-json, submodule-integration,
go-database cursor safety, and several contract deduplication passes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 14:44:52 +03:00
7d2e904d14 Bring codebase into compliance with bible contracts (A–E)
A (hardware-ingest-json v2.8-2.9): remove sensor location fields from schema
and collector; tag HardwareMemory.Location as json:"-"; add PlatformConfig to
HardwareSnapshot.

B (no-hardcoded-vendors): consolidate PCI vendor IDs into collector/pci_vendors.go;
replace all vendor-name string checks in isGPUDevice, isNVIDIADevice, isMellanoxDevice,
isAMDGPUDevice, matchesGPUVendor (sat_overlay), and validateIsVendorGPU (page_validate)
with numeric vendor_id comparisons.

C (module-structure): split app/app.go (1413 lines) into app.go + app_format.go,
app_network.go, app_services.go, app_packs.go, app_install.go — no logic changes.

D (go-code-style): wrap bare return err in interfaceAdminState and
interfaceIPv4Addrs (platform/network.go) with fmt.Errorf context including
the interface name.

E (go-project-bible): add bible-local/architecture/data-model.md and
bible-local/architecture/api-surface.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 14:32:08 +03:00
2320925433 Skip PCIe link-speed warnings for disabled devices
Disabled PCIe devices (sysfs enable==0) carry no data traffic; their
link state has no operational impact. Switchtec PCIe switch management
endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers)
train at reduced speed intentionally and were producing spurious warnings.

Check is vendor-agnostic: reads enable attribute via existing helper,
no vendor/device ID hardcoding.

Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-12 03:42:19 +03:00
e169a7722c Fix NVMe SMART status always Unknown; fix GPU count including NVSwitches
nvme-cli emits smart-log counters as JSON strings and uses field names
avail_spare / percent_used instead of the prose names in the NVMe spec.
The nvmeSmartLog struct had int64 fields with wrong JSON tags — Unmarshal
returned an error and the whole health block was skipped, leaving every
NVMe drive with status=Unknown.

Fix: switch all numeric fields to jsonInt64 (already used for lsblk
block sizes) which accepts both bare numbers and quoted strings, and
correct the avail_spare / percent_used tag names.

Also fix validateIsVendorGPU for NVIDIA: previously counted any NVIDIA
PCIe device (including NVSwitch bridges) as a GPU, producing wrong
estimates (12 instead of 8 on an HGX H100 system). Now requires
device_class to be videocontroller or processingaccelerator, matching
the existing AMD filter logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-04 18:06:32 +03:00
74a3c65f64 Move nvtop to GPU-specific package lists; clean up git-bible
nvtop pulled nvidia-tesla-470-* via Recommends into the nogpu build.
Move it from bee.list.chroot into bee-nvidia and bee-amd lists so it
only appears in GPU variants.

Also remove the stray git-bible/ directory (was not gitignored) and
move grub-bitmap-error docs into bible-local/docs/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 19:36:27 +03:00
884988cb2a Fix audit hang on SAS HBAs: skip scsi host scan for SAS hosts
Writing to /sys/class/scsi_host/hostX/scan on SAS controllers (e.g.
Adaptec smartpqi/PM8222-SHBA) triggers sas_user_scan which blocks
indefinitely, causing the audit to hang forever. Skip hosts that appear
under /sys/class/sas_host/ — SAS topology is discovered by the driver.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 18:50:20 +03:00
963bc960ca Fix SATA discovery, add NVLink bridge detection, add infiniband-diags
- storage: add jsonInt64 dual-format unmarshaler to handle lsblk output
  change in util-linux 2.38 (LOG-SEC/PHY-SEC now emitted as JSON
  integers, not quoted strings); fixes SATA disks invisible on Debian 12
- pcie: detect NVLink bridge mezzanine CX-7 cards (Mellanox x2, no host
  net ifaces, DeviceName contains "NVLINK" in lspci -v) and mark them
  with device_class="NVLinkBridge"; escalate PCIe link speed downgrade to
  Critical for these cards (Gen3 on a fixed internal connector = hardware
  fault, not a transient warning)
- pcie: cross-reference nvidia-smi topo to capture NVLink bond counts and
  active status for all NVLink bridge cards
- packages: add infiniband-diags to ISO; provides ibstat required by
  nvidia-fabricmanager-start.sh to enumerate IB devices before FM launch
  (absence causes CUDA_ERROR_SYSTEM_NOT_READY)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 20:57:04 +03:00
4f6579e040 Fix Runtime Health criteria: network, services, nvidia-fabricmanager
Network: green if at least one interface has IPv4 (drop PARTIAL state).

Bee Services: treat inactive as OK — oneshot services (bee-sshsetup,
bee-preflight, bee-network, bee-audit, etc.) complete successfully and
exit to inactive; only failed is a real problem.

nvidia-fabricmanager: add ExecCondition=bee-check-nvswitch drop-in so
the service is silently skipped (inactive, not failed) on systems
without NVSwitch hardware (e.g. H200 NVL with direct NVLink, no
NVSwitch chips). bee-check-nvswitch detects NVSwitch via lspci
(vendor 10de, class 0680).

bee-nvidia.service: add ConditionPathExists=/usr/local/bin/bee-nvidia-load
so the unit is a no-op if somehow present in a non-nvidia build.

bee-boot-status: read /etc/bee-gpu-vendor and exclude bee-nvidia from
CRITICAL/ALL on non-nvidia builds, preventing boot hang if the unit
is unexpectedly present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 05:20:25 +03:00
dc07580adc Add AER decode, event counter, and sparkline to component detail modal
- decodeAERStatus: parses aer_status hex from kernel error strings and
  maps PCIe AER register bits to human-readable names with correctable/
  uncorrectable classification (e.g. "Receiver Error, Replay Timer Timeout (correctable)")
- renderSparkline: 100px inline SVG showing non-OK events over time,
  bars positioned proportionally to timestamp; evenly spaced when timestamps coincide
- renderComponentDetail: shows event count badge and sparkline in the
  component header row; decoded AER line appears below the raw error summary

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 23:54:54 +03:00
Mikhail Chusavitin
87e78e230e Fix ISO build: truncate volume ID to 32 chars (xorriso limit)
EASY_BEE_NVIDIA_LEGACY_V<date> is 33 characters; ISO 9660 volid is
limited to 32. Compute the maximum token length dynamically from the
prefix length and trim ISO_VERSION_LABEL_TOKEN with cut before
assembling BEE_ISO_VOLUME. All four variants now fit within the limit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 18:28:54 +03:00
Mikhail Chusavitin
805a3b277d Track PCIe AER correctable errors; fix GPU status key routing
Add nvidia-aer-correctable and pcie-aer-correctable patterns to catch
"bus correctable error" events seen in SEL (Critical Interrupt / offset 7).
Both patterns carry severity "warning" — correctable errors are
hardware-recovered and should not flag a card as failed.

Fix kmsg_watcher routing: GPU-category events were keyed as pcie:<BDF>
but the UI queries for pcie:gpu: prefix. Split the switch so "gpu" →
pcie:gpu:<BDF> and "pcie" → pcie:<BDF>. This applies to both
flushWindow (SAT-window path) and flushImmediate (always-on path).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 12:50:14 +03:00
Mikhail Chusavitin
5bc9bd7fb3 Fix deploy.sh unbound variable on line 51
\\$1 in a double-quoted string expands as literal backslash + $1 (the
script's first positional arg). With set -u and no CLI args (IP entered
via read), this fails. \$1 correctly escapes the dollar sign, producing
a literal $1 for awk on the remote host.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 11:58:15 +03:00
Mikhail Chusavitin
0939a647ea Fix component detail modal: replace dead hx-* with fetch-based JS
HTMX was never loaded on the page, so hx-get on the component label
spans was dead code — the dialog opened empty. Replace with a plain
openComponentDetail() fetch call. Also fix dialog positioning broken
by the CSS reset (*{margin:0} overrode the UA margin:auto that centers
<dialog>). Replace card hx-trigger polling with a setInterval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 10:53:20 +03:00
Mikhail Chusavitin
7640f20714 Consolidate dist/ into cache/ and release/ subdirs
All intermediate build artifacts (binaries, live-build work dirs, overlay
stages, NVIDIA/NCCL/cuBLAS/john caches) now live under dist/cache/.
Final ISOs go to dist/release/ instead of scattered dist/easy-bee-v*/ and iso/out/.
dist/ is already gitignored, iso/out/ entry removed as redundant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 12:28:47 +03:00
Mikhail Chusavitin
1593bf3e76 Add scripts/build.sh -- single entry point for ISO builds
Auto-detects build mode: remote VM if BUILDER_HOST is set in .env,
local Docker otherwise. Cache hardcoded to dist/container-cache (gitignored).
All flags forwarded to build-in-container.sh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 12:24:09 +03:00
Mikhail Chusavitin
ae80d7711e Add continuous hardware health monitoring and component detail view
- kmsg watcher now records kernel errors (GPU Xid, MCE, EDAC, storage I/O) at all times,
  not only during SAT tasks; flushImmediate writes directly to ComponentStatusDB
- New health_poller: polls ipmitool sdr every 60s for PSU health (watchdog:psu source)
- Hardware Summary card auto-refreshes every 30s via htmx without page reload
- Component rows (CPU/Memory/Storage/GPU/PSU) are now clickable -- opens a modal
  with per-component status, source, timestamp and last 20 history entries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 09:56:39 +03:00
Mikhail Chusavitin
ca78b9df65 Add initramfs-level Drive Wipe tool (bee.wipe=all)
Installs a local-premount initramfs hook that intercepts bee.wipe=all before
squashfs is mounted. Shows a numbered disk selection TUI (pure POSIX sh), wipes
selected disks (nvme format / blkdiscard / dd fallback), syncs, and reboots.
Works even when squashfs fails to mount.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 09:23:05 +03:00
Mikhail Chusavitin
5cafe63f33 Add Drive Wipe boot menu entry and overlay wipe script
Adds a "WIPE ALL DISKS" entry to both GRUB and isolinux menus (bee.wipe=all).
Includes bee-wipe-disks for manual use from a running live system.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 09:22:59 +03:00
Mikhail Chusavitin
b75e65bcb1 Version-stamp squashfs filename and restrict live-boot media selection
Squashfs versioning:
- ISO now contains filesystem-v<VERSION>.squashfs instead of the generic
  filesystem.squashfs, making it immediately visible which build is
  running (visible in /run/live/medium/live/ at boot time).
- Full build path: rename filesystem.squashfs → filesystem-v*.squashfs
  after lb build, before lb binary_checksums/binary_iso.
- Fast path: find and unpack whatever filesystem*.squashfs exists, repack
  as the new versioned name, remove the old file, update the ISO.
- needs_full_build: accept any filesystem*.squashfs so version changes
  alone don't force a full rebuild.

Media selection hardening:
- Add live-media=/dev/disk/by-label/<LABEL> to the kernel boot line in
  addition to the existing live-media-label=<LABEL>. live-boot will now
  open exactly the labeled device rather than scanning all block devices,
  preventing accidental use of squashfs files from local disks or
  stale virtual media attached via IPMI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 18:44:47 +03:00
Mikhail Chusavitin
8d173175eb Add chroot hook to strip all xattrs before squashfs creation
mksquashfs 4.5.1 (bookworm) writes a non-SQUASHFS_INVALID_BLK value for
xattr_id_table_start in the superblock even when -no-xattrs is passed, if
the source chroot contains POSIX ACL xattrs set by dpkg at install time.
Linux 6.1 squashfs driver then fails with "unable to read xattr id index
table" and refuses to mount the filesystem.

Strip all xattrs from the chroot via Python3 (already present) immediately
before mksquashfs runs. With an xattr-free source tree the resulting
squashfs is guaranteed to have SQUASHFS_INVALID_BLK in the xattr field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 17:44:09 +03:00
Mikhail Chusavitin
5cbde0448e Update submodules (bible, internal/chart)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:41:45 +03:00
Mikhail Chusavitin
49a09fde05 Disable xattrs in all mksquashfs calls
--chroot-squashfs-compression-options does not exist in live-build
bookworm (1:20230502). The correct mechanism is the MKSQUASHFS_OPTIONS
environment variable read by binary_rootfs.

Export MKSQUASHFS_OPTIONS="-no-xattrs" before lb build so live-build's
binary_rootfs picks it up, and add -no-xattrs explicitly to every
direct mksquashfs call in build.sh (fast-path repack and the dormant
split-layers function). Remove the invalid lb config option.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:29:15 +03:00
Mikhail Chusavitin
f3962422c8 Fix lb config option name for squashfs compression options
--chroot-squashfs-options is not a valid lb_config option; the correct
name is --chroot-squashfs-compression-options. Without this fix lb config
aborts immediately, so the -no-xattrs flag (which prevents the
"unable to read xattr id index table" boot failure) was never applied.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:03:41 +03:00
Mikhail Chusavitin
ee36e3c711 Strip xattrs from squashfs to fix boot failure
Kernel squashfs driver fails with "unable to read xattr id index table"
when the squashfs contains POSIX ACL xattrs (system.posix_acl_*) written
by mksquashfs as unrecognised entries. This caused every built ISO to
drop to an initramfs shell at boot.

Add -no-xattrs to mksquashfs options so xattrs are stripped at build
time. xattrs are not needed in a live read-only rootfs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 13:56:26 +03:00
Mikhail Chusavitin
cca3b21d35 Revert squashfs layer split — live-boot cannot mount partial rootfs
split_live_squashfs_layers moved /usr out of filesystem.squashfs into a
separate 10-usr.squashfs, leaving a rootfs skeleton that live-boot
(1:20230131+deb12u1) cannot mount: the initramfs panics with
"Can not mount /dev/loop0 ... filesystem.squashfs".

live-boot in bookworm expects a single self-contained filesystem.squashfs.
Revert to the standard single-squashfs layout and remove the dead
multi-squashfs guard in needs_full_build().

The split_live_squashfs_layers function is kept for future reference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 11:14:10 +03:00
Mikhail Chusavitin
75c33e073e Fix split_live_squashfs_layers crash under POSIX sh (dash)
trap RETURN is a bash extension not supported by /bin/sh on Debian.
With set -e active the unsupported trap call exited the build immediately
after lb build, before bootloader sync and ISO copy steps ran.

Remove both trap RETURN lines — explicit rm -rf at the end of the
function is sufficient for cleanup on the happy path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 09:52:31 +03:00
7b4bcc745a Split live rootfs into smaller squashfs layers 2026-05-03 23:15:22 +03:00
42774d44a6 Restore post-build GRUB and isolinux sync 2026-05-03 21:49:54 +03:00
5dc022ddf8 Drop post-build EFI bootloader patching 2026-05-03 21:22:53 +03:00
6623e159f5 Grow EFI image before syncing GRUB theme assets 2026-05-03 21:18:37 +03:00
bbd6d009f8 Avoid EFI image overflow when syncing GRUB theme 2026-05-03 21:16:36 +03:00
6c2b188ec9 Add no-GUI boot mode and quieter boot diagnostics 2026-05-03 21:14:45 +03:00
14505ef24a Remove easy bee ASCII logo banners 2026-05-03 21:07:13 +03:00
4f20c9246d Make UEFI boot safe and remove GRUB logo 2026-05-03 20:11:42 +03:00
eed157c2db Pin live boot medium to versioned ISO label 2026-05-03 15:52:07 +03:00
79 changed files with 4350 additions and 1381 deletions

1
.gitignore vendored
View File

@@ -1,6 +1,5 @@
.env
.DS_Store
dist/
iso/out/
build-cache/
audit/bee

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,405 @@
package app
import (
"fmt"
"os"
"path/filepath"
"sort"
"strings"
"bee/audit/internal/collector"
"bee/audit/internal/platform"
"bee/audit/internal/schema"
)
func hostnameOr(fallback string) string {
hn, err := os.Hostname()
if err != nil || strings.TrimSpace(hn) == "" {
return fallback
}
return hn
}
func sanitizeFilename(v string) string {
var out []rune
for _, r := range v {
switch {
case r >= 'a' && r <= 'z', r >= 'A' && r <= 'Z', r >= '0' && r <= '9', r == '-', r == '_', r == '.':
out = append(out, r)
default:
out = append(out, '-')
}
}
if len(out) == 0 {
return "unknown"
}
return string(out)
}
func bodyOr(body, fallback string) string {
body = strings.TrimSpace(body)
if body == "" {
return fallback
}
return body
}
func trimPtr(value *string) string {
if value == nil {
return ""
}
return strings.TrimSpace(*value)
}
func joinSortedKeys(values map[string]struct{}) string {
if len(values) == 0 {
return ""
}
keys := make([]string, 0, len(values))
for key := range values {
keys = append(keys, key)
}
sort.Strings(keys)
return strings.Join(keys, "/")
}
func humanizeMB(totalMB int) string {
if totalMB <= 0 {
return ""
}
gb := float64(totalMB) / 1024.0
if gb >= 1024.0 {
tb := gb / 1024.0
return fmt.Sprintf("%.1f TB", tb)
}
if gb == float64(int64(gb)) {
return fmt.Sprintf("%.0f GB", gb)
}
return fmt.Sprintf("%.1f GB", gb)
}
func humanizeGB(totalGB int) string {
if totalGB <= 0 {
return ""
}
tb := float64(totalGB) / 1024.0
if tb >= 1.0 {
return fmt.Sprintf("%.1f TB", tb)
}
return fmt.Sprintf("%d GB", totalGB)
}
func parseKeyValueSummary(raw string) map[string]string {
out := map[string]string{}
for _, line := range strings.Split(raw, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
key, value, ok := strings.Cut(line, "=")
if !ok {
continue
}
out[strings.TrimSpace(key)] = strings.TrimSpace(value)
}
return out
}
func firstNonEmpty(values ...string) string {
for _, value := range values {
value = strings.TrimSpace(value)
if value != "" {
return value
}
}
return ""
}
func cleanSummaryKey(key string) string {
idx := strings.Index(key, "-")
if idx <= 0 {
return key
}
prefix := key[:idx]
for _, c := range prefix {
if c < '0' || c > '9' {
return key
}
}
return key[idx+1:]
}
func isGPUDevice(dev schema.HardwarePCIeDevice) bool {
// Exclude Aspeed BMC VGA adapters (not compute GPUs).
if dev.VendorID != nil && *dev.VendorID == collector.AspeedVendorID {
return false
}
class := trimPtr(dev.DeviceClass)
// AMD Instinct / Radeon compute GPUs always carry ProcessingAccelerator or DisplayController.
// Do NOT match AMD vendor alone — CPU chipset PCIe devices share that vendor ID.
if class == "VideoController" || class == "DisplayController" || class == "ProcessingAccelerator" {
return true
}
// NVIDIA devices sometimes expose class values outside the standard GPU set.
return dev.VendorID != nil && *dev.VendorID == collector.NvidiaVendorID
}
func formatSystemLine(board schema.HardwareBoard) string {
model := strings.TrimSpace(strings.Join([]string{
trimPtr(board.Manufacturer),
trimPtr(board.ProductName),
}, " "))
serial := strings.TrimSpace(board.SerialNumber)
switch {
case model != "" && serial != "":
return fmt.Sprintf("System: %s | S/N %s", model, serial)
case model != "":
return "System: " + model
case serial != "":
return "System S/N: " + serial
default:
return ""
}
}
func formatCPULine(cpus []schema.HardwareCPU) string {
if len(cpus) == 0 {
return ""
}
modelCounts := map[string]int{}
unknown := 0
for _, cpu := range cpus {
model := trimPtr(cpu.Model)
if model == "" {
unknown++
continue
}
modelCounts[model]++
}
if len(modelCounts) == 1 && unknown == 0 {
for model, count := range modelCounts {
return fmt.Sprintf("CPU: %d x %s", count, model)
}
}
parts := make([]string, 0, len(modelCounts)+1)
if len(modelCounts) > 0 {
keys := make([]string, 0, len(modelCounts))
for key := range modelCounts {
keys = append(keys, key)
}
sort.Strings(keys)
for _, key := range keys {
parts = append(parts, fmt.Sprintf("%d x %s", modelCounts[key], key))
}
}
if unknown > 0 {
parts = append(parts, fmt.Sprintf("%d x unknown", unknown))
}
return "CPU: " + strings.Join(parts, ", ")
}
func formatMemoryLine(dimms []schema.HardwareMemory) string {
totalMB := 0
present := 0
types := map[string]struct{}{}
for _, dimm := range dimms {
if dimm.Present != nil && !*dimm.Present {
continue
}
if dimm.SizeMB == nil || *dimm.SizeMB <= 0 {
continue
}
present++
totalMB += *dimm.SizeMB
if value := trimPtr(dimm.Type); value != "" {
types[value] = struct{}{}
}
}
if totalMB == 0 {
return ""
}
typeText := joinSortedKeys(types)
line := fmt.Sprintf("Memory: %s", humanizeMB(totalMB))
if typeText != "" {
line += " " + typeText
}
if present > 0 {
line += fmt.Sprintf(" (%d DIMMs)", present)
}
return line
}
func formatStorageLine(disks []schema.HardwareStorage) string {
count := 0
totalGB := 0
for _, disk := range disks {
if disk.Present != nil && !*disk.Present {
continue
}
count++
if disk.SizeGB != nil && *disk.SizeGB > 0 {
totalGB += *disk.SizeGB
}
}
if count == 0 {
return ""
}
line := fmt.Sprintf("Storage: %d drives", count)
if totalGB > 0 {
line += fmt.Sprintf(" / %s", humanizeGB(totalGB))
}
return line
}
func formatGPULine(devices []schema.HardwarePCIeDevice) string {
gpus := map[string]int{}
for _, dev := range devices {
if !isGPUDevice(dev) {
continue
}
name := firstNonEmpty(trimPtr(dev.Model), trimPtr(dev.Manufacturer), "unknown")
gpus[name]++
}
if len(gpus) == 0 {
return ""
}
keys := make([]string, 0, len(gpus))
for key := range gpus {
keys = append(keys, key)
}
sort.Strings(keys)
parts := make([]string, 0, len(keys))
for _, key := range keys {
parts = append(parts, fmt.Sprintf("%d x %s", gpus[key], key))
}
return "GPU: " + strings.Join(parts, ", ")
}
func formatIPLine(list func() ([]platform.InterfaceInfo, error)) string {
if list == nil {
return ""
}
ifaces, err := list()
if err != nil {
return ""
}
seen := map[string]struct{}{}
var ips []string
for _, iface := range ifaces {
for _, ip := range iface.IPv4 {
ip = strings.TrimSpace(ip)
if ip == "" {
continue
}
if _, ok := seen[ip]; ok {
continue
}
seen[ip] = struct{}{}
ips = append(ips, ip)
}
}
if len(ips) == 0 {
return ""
}
sort.Strings(ips)
return "IP: " + strings.Join(ips, ", ")
}
func formatSATDetail(raw string) string {
var b strings.Builder
kv := parseKeyValueSummary(raw)
if t, ok := kv["run_at_utc"]; ok {
fmt.Fprintf(&b, "Run: %s\n\n", t)
}
lines := strings.Split(raw, "\n")
var stepKeys []string
seenStep := map[string]bool{}
for _, line := range lines {
if idx := strings.Index(line, "_status="); idx >= 0 {
key := line[:idx]
if !seenStep[key] && key != "overall" {
seenStep[key] = true
stepKeys = append(stepKeys, key)
}
}
}
for _, key := range stepKeys {
status := kv[key+"_status"]
display := cleanSummaryKey(key)
switch status {
case "OK":
fmt.Fprintf(&b, "PASS %s\n", display)
case "FAILED":
fmt.Fprintf(&b, "FAIL %s\n", display)
case "UNSUPPORTED":
fmt.Fprintf(&b, "SKIP %s\n", display)
default:
fmt.Fprintf(&b, "? %s\n", display)
}
}
if overall, ok := kv["overall_status"]; ok {
ok2 := kv["job_ok"]
failed := kv["job_failed"]
fmt.Fprintf(&b, "\nOverall: %s (ok=%s failed=%s)", overall, ok2, failed)
}
return strings.TrimSpace(b.String())
}
func formatSATSummary(label, raw string) string {
values := parseKeyValueSummary(raw)
var body strings.Builder
fmt.Fprintf(&body, "%s:", label)
if overall := firstNonEmpty(values["overall_status"], "UNKNOWN"); overall != "" {
fmt.Fprintf(&body, " %s", overall)
}
if ok := firstNonEmpty(values["job_ok"], "0"); ok != "" {
fmt.Fprintf(&body, " ok=%s", ok)
}
if failed := firstNonEmpty(values["job_failed"], "0"); failed != "" {
fmt.Fprintf(&body, " failed=%s", failed)
}
if unsupported := firstNonEmpty(values["job_unsupported"], "0"); unsupported != "" && unsupported != "0" {
fmt.Fprintf(&body, " unsupported=%s", unsupported)
}
if devices := strings.TrimSpace(values["devices"]); devices != "" {
fmt.Fprintf(&body, "\nDevices: %s", devices)
}
return body.String()
}
func latestSATSummaries() []string {
patterns := []struct {
label string
prefix string
}{
{label: "NVIDIA SAT", prefix: "gpu-nvidia-"},
{label: "NVIDIA Targeted Stress Validate (dcgmi diag targeted_stress)", prefix: "gpu-nvidia-targeted-stress-"},
{label: "NVIDIA Max Compute Load (dcgmproftester)", prefix: "gpu-nvidia-compute-"},
{label: "NVIDIA Targeted Power (dcgmi diag targeted_power)", prefix: "gpu-nvidia-targeted-power-"},
{label: "NVIDIA Pulse Test (dcgmi diag pulse_test)", prefix: "gpu-nvidia-pulse-"},
{label: "NVIDIA Interconnect Test (NCCL all_reduce_perf)", prefix: "gpu-nvidia-nccl-"},
{label: "NVIDIA Bandwidth Test (NVBandwidth)", prefix: "gpu-nvidia-bandwidth-"},
{label: "Memory SAT", prefix: "memory-"},
{label: "Storage SAT", prefix: "storage-"},
{label: "CPU SAT", prefix: "cpu-"},
}
var out []string
for _, item := range patterns {
matches, err := filepath.Glob(filepath.Join(DefaultSATBaseDir, item.prefix+"*/summary.txt"))
if err != nil || len(matches) == 0 {
continue
}
sort.Strings(matches)
raw, err := os.ReadFile(matches[len(matches)-1])
if err != nil {
continue
}
out = append(out, formatSATSummary(item.label, string(raw)))
}
return out
}

View File

@@ -0,0 +1,76 @@
package app
import (
"context"
"fmt"
"os"
"path/filepath"
"time"
"bee/audit/internal/platform"
)
func (a *App) ListRemovableTargets() ([]platform.RemovableTarget, error) {
return a.exports.ListRemovableTargets()
}
func (a *App) ExportLatestAudit(target platform.RemovableTarget) (string, error) {
if _, err := os.Stat(DefaultAuditJSONPath); err != nil {
return "", err
}
filename := fmt.Sprintf("audit-%s-%s.json", sanitizeFilename(hostnameOr("unknown")), time.Now().UTC().Format("20060102-150405"))
tmpPath := filepath.Join(os.TempDir(), filename)
data, err := readFileLimited(DefaultAuditJSONPath, 100<<20)
if err != nil {
return "", err
}
if normalized, normErr := ApplySATOverlay(data); normErr == nil {
data = normalized
}
if err := os.WriteFile(tmpPath, data, 0644); err != nil {
return "", err
}
defer os.Remove(tmpPath)
return a.exports.ExportFileToTarget(tmpPath, target)
}
func (a *App) ExportLatestAuditResult(target platform.RemovableTarget) (ActionResult, error) {
path, err := a.ExportLatestAudit(target)
body := "Audit export failed."
if err == nil {
body = "Audit exported."
}
if err == nil && path != "" {
body = "Audit exported to " + path
}
return ActionResult{Title: "Export audit", Body: body}, err
}
func (a *App) ExportSupportBundle(target platform.RemovableTarget) (string, error) {
archive, err := BuildSupportBundle(DefaultExportDir)
if err != nil {
return "", err
}
defer os.Remove(archive)
return a.exports.ExportFileToTarget(archive, target)
}
func (a *App) ExportSupportBundleResult(target platform.RemovableTarget) (ActionResult, error) {
path, err := a.ExportSupportBundle(target)
body := "Support bundle export failed."
if err == nil {
body = "Support bundle exported. USB target unmounted and safe to remove."
}
if err == nil && path != "" {
body = "Support bundle exported to " + path + ".\n\nUSB target unmounted and safe to remove."
}
return ActionResult{Title: "Export support bundle", Body: body}, err
}
func (a *App) ListInstallDisks() ([]platform.InstallDisk, error) {
return a.installer.ListInstallDisks()
}
func (a *App) InstallToDisk(ctx context.Context, device string, logFile string) error {
return a.installer.InstallToDisk(ctx, device, logFile)
}

View File

@@ -0,0 +1,106 @@
package app
import (
"fmt"
"strings"
"bee/audit/internal/platform"
)
func (a *App) ListInterfaces() ([]platform.InterfaceInfo, error) {
return a.network.ListInterfaces()
}
func (a *App) DefaultRoute() string {
return a.network.DefaultRoute()
}
func (a *App) DHCPOne(iface string) (string, error) {
return a.network.DHCPOne(iface)
}
func (a *App) DHCPOneResult(iface string) (ActionResult, error) {
body, err := a.network.DHCPOne(iface)
return ActionResult{Title: "DHCP: " + iface, Body: bodyOr(body, "DHCP completed.")}, err
}
func (a *App) DHCPAll() (string, error) {
return a.network.DHCPAll()
}
func (a *App) DHCPAllResult() (ActionResult, error) {
body, err := a.network.DHCPAll()
return ActionResult{Title: "DHCP: all interfaces", Body: bodyOr(body, "DHCP completed.")}, err
}
func (a *App) SetStaticIPv4(cfg platform.StaticIPv4Config) (string, error) {
return a.network.SetStaticIPv4(cfg)
}
func (a *App) SetInterfaceState(iface string, up bool) error {
return a.network.SetInterfaceState(iface, up)
}
func (a *App) GetInterfaceState(iface string) (bool, error) {
return a.network.GetInterfaceState(iface)
}
func (a *App) CaptureNetworkSnapshot() (platform.NetworkSnapshot, error) {
return a.network.CaptureNetworkSnapshot()
}
func (a *App) RestoreNetworkSnapshot(snapshot platform.NetworkSnapshot) error {
return a.network.RestoreNetworkSnapshot(snapshot)
}
func (a *App) SetStaticIPv4Result(cfg platform.StaticIPv4Config) (ActionResult, error) {
body, err := a.network.SetStaticIPv4(cfg)
return ActionResult{Title: "Static IPv4: " + cfg.Interface, Body: bodyOr(body, "Static IPv4 updated.")}, err
}
func (a *App) NetworkStatus() (ActionResult, error) {
ifaces, err := a.network.ListInterfaces()
if err != nil {
return ActionResult{Title: "Network status"}, err
}
if len(ifaces) == 0 {
return ActionResult{Title: "Network status", Body: "No physical interfaces found."}, nil
}
var body strings.Builder
for _, iface := range ifaces {
ipv4 := "(no IPv4)"
if len(iface.IPv4) > 0 {
ipv4 = strings.Join(iface.IPv4, ", ")
}
fmt.Fprintf(&body, "- %s: state=%s ip=%s\n", iface.Name, iface.State, ipv4)
}
if gw := a.network.DefaultRoute(); gw != "" {
fmt.Fprintf(&body, "\nDefault route: %s\n", gw)
}
return ActionResult{Title: "Network status", Body: strings.TrimSpace(body.String())}, nil
}
func (a *App) DefaultStaticIPv4FormFields(iface string) []string {
return []string{
"",
"24",
strings.TrimSpace(a.network.DefaultRoute()),
"77.88.8.8 77.88.8.1 1.1.1.1 8.8.8.8",
}
}
func (a *App) ParseStaticIPv4Config(iface string, fields []string) platform.StaticIPv4Config {
get := func(index int) string {
if index >= 0 && index < len(fields) {
return strings.TrimSpace(fields[index])
}
return ""
}
return platform.StaticIPv4Config{
Interface: iface,
Address: get(0),
Prefix: get(1),
Gateway: get(2),
DNS: strings.Fields(get(3)),
}
}

View File

@@ -0,0 +1,370 @@
package app
import (
"context"
"fmt"
"os"
"path/filepath"
"strings"
"bee/audit/internal/platform"
)
func (a *App) RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaAcceptancePack(baseDir, logFunc)
}
func (a *App) RunNvidiaAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.RunNvidiaAcceptancePack(baseDir, nil)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
}
return ActionResult{Title: "NVIDIA SAT", Body: body}, err
}
func (a *App) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
return a.sat.ListNvidiaGPUs()
}
func (a *App) ListNvidiaGPUStatuses() ([]platform.NvidiaGPUStatus, error) {
return a.sat.ListNvidiaGPUStatuses()
}
func (a *App) ResetNvidiaGPU(index int) (ActionResult, error) {
out, err := a.sat.ResetNvidiaGPU(index)
return ActionResult{Title: fmt.Sprintf("Reset NVIDIA GPU %d", index), Body: strings.TrimSpace(out)}, err
}
func (a *App) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (ActionResult, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
path, err := a.sat.RunNvidiaAcceptancePackWithOptions(ctx, baseDir, diagLevel, gpuIndices, logFunc)
body := "Archive written."
if path != "" {
body = "Archive written to " + path
}
return ActionResult{Title: "NVIDIA DCGM", Body: body}, err
}
func (a *App) RunNvidiaTargetedStressValidatePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaTargetedStressValidatePack(ctx, baseDir, durationSec, gpuIndices, logFunc)
}
func (a *App) RunNvidiaStressPack(baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
return a.RunNvidiaStressPackCtx(context.Background(), baseDir, opts, logFunc)
}
func (a *App) RunNvidiaBenchmark(baseDir string, opts platform.NvidiaBenchmarkOptions, logFunc func(string)) (string, error) {
return a.RunNvidiaBenchmarkCtx(context.Background(), baseDir, opts, logFunc)
}
func (a *App) RunNvidiaBenchmarkCtx(ctx context.Context, baseDir string, opts platform.NvidiaBenchmarkOptions, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultBeeBenchPerfDir
}
resolved, err := a.ensureBenchmarkPowerAutotune(ctx, baseDir, opts, "performance", logFunc)
if err != nil {
return "", err
}
opts.ServerPowerSource = resolved.SelectedSource
return a.sat.RunNvidiaBenchmark(ctx, baseDir, opts, logFunc)
}
func (a *App) RunNvidiaPowerBenchCtx(ctx context.Context, baseDir string, opts platform.NvidiaBenchmarkOptions, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultBeeBenchPowerDir
}
resolved, err := a.ensureBenchmarkPowerAutotune(ctx, baseDir, opts, "power-fit", logFunc)
if err != nil {
return "", err
}
opts.ServerPowerSource = resolved.SelectedSource
return a.sat.RunNvidiaPowerBench(ctx, baseDir, opts, logFunc)
}
func (a *App) RunNvidiaPowerSourceAutotuneCtx(ctx context.Context, baseDir string, opts platform.NvidiaBenchmarkOptions, benchmarkKind string, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultBeeBenchAutotuneDir
}
return a.sat.RunNvidiaPowerSourceAutotune(ctx, baseDir, opts, benchmarkKind, logFunc)
}
func (a *App) LoadBenchmarkPowerAutotune() (*platform.BenchmarkPowerAutotuneConfig, error) {
return platform.LoadBenchmarkPowerAutotuneConfig(DefaultBeeBenchPowerSourceConfigPath)
}
func (a *App) ensureBenchmarkPowerAutotune(ctx context.Context, baseDir string, opts platform.NvidiaBenchmarkOptions, benchmarkKind string, logFunc func(string)) (platform.BenchmarkPowerAutotuneConfig, error) {
cfgPath := platform.BenchmarkPowerSourceConfigPath(baseDir)
if cfg, err := platform.LoadBenchmarkPowerAutotuneConfig(cfgPath); err == nil {
if logFunc != nil {
logFunc(fmt.Sprintf("benchmark autotune: using saved server power source %s", cfg.SelectedSource))
}
return *cfg, nil
}
if logFunc != nil {
logFunc("benchmark autotune: no saved power source config, running autotune first")
}
autotuneDir := filepath.Join(filepath.Dir(baseDir), "autotune")
if _, err := a.RunNvidiaPowerSourceAutotuneCtx(ctx, autotuneDir, opts, benchmarkKind, logFunc); err != nil {
return platform.BenchmarkPowerAutotuneConfig{}, err
}
cfg, err := platform.LoadBenchmarkPowerAutotuneConfig(cfgPath)
if err != nil {
return platform.BenchmarkPowerAutotuneConfig{}, err
}
return *cfg, nil
}
func (a *App) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, staggerSec int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaOfficialComputePack(ctx, baseDir, durationSec, gpuIndices, staggerSec, logFunc)
}
func (a *App) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaTargetedPowerPack(ctx, baseDir, durationSec, gpuIndices, logFunc)
}
func (a *App) RunNvidiaPulseTestPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaPulseTestPack(ctx, baseDir, durationSec, gpuIndices, logFunc)
}
func (a *App) RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpuIndices []int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaBandwidthPack(ctx, baseDir, gpuIndices, logFunc)
}
func (a *App) RunNvidiaStressPackCtx(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaStressPack(ctx, baseDir, opts, logFunc)
}
func (a *App) RunMemoryAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
return a.RunMemoryAcceptancePackCtx(context.Background(), baseDir, 256, 1, logFunc)
}
func (a *App) RunMemoryAcceptancePackCtx(ctx context.Context, baseDir string, sizeMB, passes int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunMemoryAcceptancePack(ctx, baseDir, sizeMB, passes, logFunc)
}
func (a *App) RunMemoryAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.RunMemoryAcceptancePack(baseDir, nil)
return ActionResult{Title: "Memory SAT", Body: satResultBody(path)}, err
}
func (a *App) RunCPUAcceptancePack(baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunCPUAcceptancePackCtx(context.Background(), baseDir, durationSec, logFunc)
}
func (a *App) RunCPUAcceptancePackCtx(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunCPUAcceptancePack(ctx, baseDir, durationSec, logFunc)
}
func (a *App) RunCPUAcceptancePackResult(baseDir string, durationSec int) (ActionResult, error) {
path, err := a.RunCPUAcceptancePack(baseDir, durationSec, nil)
return ActionResult{Title: "CPU SAT", Body: satResultBody(path)}, err
}
func (a *App) RunStorageAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
return a.RunStorageAcceptancePackCtx(context.Background(), baseDir, false, logFunc)
}
func (a *App) RunStorageAcceptancePackCtx(ctx context.Context, baseDir string, extended bool, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunStorageAcceptancePack(ctx, baseDir, extended, logFunc)
}
func (a *App) RunStorageAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.RunStorageAcceptancePack(baseDir, nil)
return ActionResult{Title: "Storage SAT", Body: satResultBody(path)}, err
}
func (a *App) DetectGPUVendor() string {
return a.sat.DetectGPUVendor()
}
func (a *App) ListAMDGPUs() ([]platform.AMDGPUInfo, error) {
return a.sat.ListAMDGPUs()
}
func (a *App) RunAMDAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
return a.RunAMDAcceptancePackCtx(context.Background(), baseDir, logFunc)
}
func (a *App) RunAMDAcceptancePackCtx(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunAMDAcceptancePack(ctx, baseDir, logFunc)
}
func (a *App) RunAMDAcceptancePackResult(baseDir string) (ActionResult, error) {
path, err := a.RunAMDAcceptancePack(baseDir, nil)
return ActionResult{Title: "AMD GPU SAT", Body: satResultBody(path)}, err
}
func (a *App) RunAMDMemIntegrityPackCtx(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunAMDMemIntegrityPack(ctx, baseDir, logFunc)
}
func (a *App) RunAMDMemBandwidthPackCtx(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunAMDMemBandwidthPack(ctx, baseDir, logFunc)
}
func (a *App) RunMemoryStressPack(baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunMemoryStressPackCtx(context.Background(), baseDir, durationSec, logFunc)
}
func (a *App) RunSATStressPack(baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunSATStressPackCtx(context.Background(), baseDir, durationSec, logFunc)
}
func (a *App) RunAMDStressPack(baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunAMDStressPackCtx(context.Background(), baseDir, durationSec, logFunc)
}
func (a *App) RunMemoryStressPackCtx(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.sat.RunMemoryStressPack(ctx, baseDir, durationSec, logFunc)
}
func (a *App) RunSATStressPackCtx(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.sat.RunSATStressPack(ctx, baseDir, durationSec, logFunc)
}
func (a *App) RunAMDStressPackCtx(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunAMDStressPack(ctx, baseDir, durationSec, logFunc)
}
func (a *App) RunFanStressTest(ctx context.Context, baseDir string, opts platform.FanStressOptions) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunFanStressTest(ctx, baseDir, opts)
}
func (a *App) RunPlatformStress(ctx context.Context, baseDir string, opts platform.PlatformStressOptions, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunPlatformStress(ctx, baseDir, opts, logFunc)
}
func (a *App) RunNCCLTests(ctx context.Context, baseDir string, gpuIndices []int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNCCLTests(ctx, baseDir, gpuIndices, logFunc)
}
func (a *App) RunNCCLTestsResult(ctx context.Context) (ActionResult, error) {
path, err := a.RunNCCLTests(ctx, DefaultSATBaseDir, nil, nil)
body := "Results: " + path
if err != nil && err != context.Canceled {
body += "\nERROR: " + err.Error()
}
return ActionResult{Title: "NCCL bandwidth test", Body: body}, err
}
func (a *App) RunFanStressTestResult(ctx context.Context, opts platform.FanStressOptions) (ActionResult, error) {
path, err := a.RunFanStressTest(ctx, "", opts)
body := formatFanStressResult(path)
if err != nil && err != context.Canceled {
body += "\nERROR: " + err.Error()
}
return ActionResult{Title: "GPU Platform Stress Test", Body: body}, err
}
// formatFanStressResult formats the summary.txt from a fan-stress run, including
// the per-step pass/fail display and the analysis section (throttling, max temps, fan response).
func formatFanStressResult(archivePath string) string {
if archivePath == "" {
return "No output produced."
}
runDir := strings.TrimSuffix(archivePath, ".tar.gz")
raw, err := os.ReadFile(filepath.Join(runDir, "summary.txt"))
if err != nil {
return "Archive written to " + archivePath
}
content := strings.TrimSpace(string(raw))
kv := parseKeyValueSummary(content)
var b strings.Builder
b.WriteString(formatSATDetail(content))
// Append analysis section.
var analysis []string
if v, ok := kv["throttling_detected"]; ok {
label := "NO"
if v == "true" {
label = "YES ← throttling detected during load"
}
analysis = append(analysis, "Throttling: "+label)
}
if v, ok := kv["max_gpu_temp_c"]; ok && v != "0.0" {
analysis = append(analysis, "Max GPU temp: "+v+"°C")
}
if v, ok := kv["max_cpu_temp_c"]; ok && v != "0.0" {
analysis = append(analysis, "Max CPU temp: "+v+"°C")
}
if v, ok := kv["fan_response_sec"]; ok && v != "N/A" && v != "-1.0" {
analysis = append(analysis, "Fan response: "+v+"s")
}
if len(analysis) > 0 {
b.WriteString("\n\n=== Analysis ===\n")
for _, line := range analysis {
b.WriteString(line + "\n")
}
}
return strings.TrimSpace(b.String())
}
// satResultBody reads summary.txt from the SAT run directory (archive path without .tar.gz)
// and returns a formatted human-readable result. Falls back to a plain message if unreadable.
func satResultBody(archivePath string) string {
if archivePath == "" {
return "No output produced."
}
runDir := strings.TrimSuffix(archivePath, ".tar.gz")
raw, err := os.ReadFile(filepath.Join(runDir, "summary.txt"))
if err != nil {
return "Archive written to " + archivePath
}
return formatSATDetail(strings.TrimSpace(string(raw)))
}

View File

@@ -0,0 +1,67 @@
package app
import (
"fmt"
"strings"
"bee/audit/internal/platform"
)
func (a *App) ListBeeServices() ([]string, error) {
return a.services.ListBeeServices()
}
func (a *App) ServiceState(name string) string {
return a.services.ServiceState(name)
}
func (a *App) ServiceStatus(name string) (string, error) {
return a.services.ServiceStatus(name)
}
func (a *App) ServiceStatusResult(name string) (ActionResult, error) {
body, err := a.services.ServiceStatus(name)
return ActionResult{Title: "service status: " + name, Body: bodyOr(body, "No status output.")}, err
}
func (a *App) ServiceDo(name string, action platform.ServiceAction) (string, error) {
return a.services.ServiceDo(name, action)
}
func (a *App) ServiceActionResult(name string, action platform.ServiceAction) (ActionResult, error) {
body, err := a.services.ServiceDo(name, action)
return ActionResult{Title: "service " + string(action) + ": " + name, Body: bodyOr(body, "Action completed.")}, err
}
func (a *App) TailFile(path string, lines int) string {
return a.tools.TailFile(path, lines)
}
func (a *App) CheckTools(names []string) []platform.ToolStatus {
return a.tools.CheckTools(names)
}
func (a *App) ToolCheckResult(names []string) ActionResult {
if len(names) == 0 {
return ActionResult{Title: "Required tools", Body: "No tools checked."}
}
var body strings.Builder
for _, tool := range a.tools.CheckTools(names) {
status := "MISSING"
if tool.OK {
status = "OK (" + tool.Path + ")"
}
fmt.Fprintf(&body, "- %s: %s\n", tool.Name, status)
}
return ActionResult{Title: "Required tools", Body: strings.TrimSpace(body.String())}
}
func (a *App) AuditLogTailResult() ActionResult {
logTail := strings.TrimSpace(a.tools.TailFile(DefaultAuditLogPath, 40))
jsonTail := strings.TrimSpace(a.tools.TailFile(DefaultAuditJSONPath, 20))
body := strings.TrimSpace(logTail + "\n\n" + jsonTail)
if body == "" {
body = "No audit logs found."
}
return ActionResult{Title: "Audit log tail", Body: body}
}

View File

@@ -3,10 +3,11 @@ package app
import (
"os"
"path/filepath"
"strconv"
"sort"
"strconv"
"strings"
"bee/audit/internal/collector"
"bee/audit/internal/schema"
)
@@ -313,17 +314,20 @@ func statusSeverity(status string) int {
}
func matchesGPUVendor(dev schema.HardwarePCIeDevice, vendor string) bool {
if dev.DeviceClass == nil || !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Controller") && !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Accelerator") {
if dev.DeviceClass == nil || !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Display") && !strings.Contains(strings.TrimSpace(*dev.DeviceClass), "Video") {
return false
}
if dev.DeviceClass == nil {
return false
}
class := strings.TrimSpace(*dev.DeviceClass)
isGPUClass := strings.Contains(class, "Controller") || strings.Contains(class, "Accelerator") ||
strings.Contains(class, "Display") || strings.Contains(class, "Video")
if !isGPUClass {
return false
}
manufacturer := strings.ToLower(strings.TrimSpace(ptrString(dev.Manufacturer)))
switch vendor {
case "amd":
return strings.Contains(manufacturer, "advanced micro devices") || strings.Contains(manufacturer, "amd/ati")
return dev.VendorID != nil && *dev.VendorID == collector.AMDVendorID
case "nvidia":
return strings.Contains(manufacturer, "nvidia")
return dev.VendorID != nil && *dev.VendorID == collector.NvidiaVendorID
default:
return false
}

View File

@@ -5,6 +5,7 @@ import (
"path/filepath"
"testing"
"bee/audit/internal/collector"
"bee/audit/internal/schema"
)
@@ -46,10 +47,12 @@ func TestApplyLatestSATStatusesMarksAMDGPUs(t *testing.T) {
class := "DisplayController"
manufacturer := "Advanced Micro Devices, Inc. [AMD/ATI]"
amdVendorID := collector.AMDVendorID
snap := schema.HardwareSnapshot{
PCIeDevices: []schema.HardwarePCIeDevice{{
DeviceClass: &class,
Manufacturer: &manufacturer,
VendorID: &amdVendorID,
}},
}

View File

@@ -24,6 +24,8 @@ var supportBundleServices = []string{
"bee-selfheal.service",
"bee-selfheal.timer",
"bee-sshsetup.service",
"display-manager.service",
"lightdm.service",
"nvidia-dcgm.service",
"nvidia-fabricmanager.service",
}
@@ -44,12 +46,128 @@ var supportBundleCommands = []struct {
{name: "system/mount.txt", cmd: []string{"mount"}},
{name: "system/df-h.txt", cmd: []string{"df", "-h"}},
{name: "system/dmesg.txt", cmd: []string{"dmesg"}},
{name: "system/dmesg-gui-video-input.txt", cmd: []string{"sh", "-c", `
if command -v dmesg >/dev/null 2>&1; then
dmesg | grep -iE 'nvidia|drm|fb|framebuffer|vesa|efi|lightdm|Xorg|input|hid|usb|keyboard|mouse|virtual keyboard|virtual mouse|ami|aspeed|ast' || echo "no GUI/video/input kernel messages found"
else
echo "dmesg not found"
fi
`}},
{name: "system/kernel-aer-nvidia.txt", cmd: []string{"sh", "-c", `
if command -v dmesg >/dev/null 2>&1; then
dmesg | grep -iE 'AER|NVRM|Xid|pcieport|nvidia' || echo "no AER/NVRM/Xid kernel messages found"
else
echo "dmesg not found"
fi
`}},
{name: "system/loginctl-sessions.txt", cmd: []string{"sh", "-c", `
if command -v loginctl >/dev/null 2>&1; then
loginctl list-sessions 2>&1 || true
else
echo "loginctl not found"
fi
`}},
{name: "system/loginctl-seats.txt", cmd: []string{"sh", "-c", `
if command -v loginctl >/dev/null 2>&1; then
loginctl list-seats 2>&1 || true
echo
for seat in $(loginctl list-seats --no-legend 2>/dev/null | awk '{print $1}'); do
echo "=== $seat ==="
loginctl seat-status "$seat" 2>&1 || true
echo
done
else
echo "loginctl not found"
fi
`}},
{name: "system/ps-gui.txt", cmd: []string{"sh", "-c", `
ps -ef | grep -iE 'lightdm|Xorg|X$|openbox|chromium|chrome|xinit|xsession' | grep -v grep || echo "no GUI processes found"
`}},
{name: "system/lspci-video-vv.txt", cmd: []string{"sh", "-c", `
if ! command -v lspci >/dev/null 2>&1; then
echo "lspci not found"
exit 0
fi
found=0
for dev in $(lspci -Dn | awk '$2 ~ /^03(00|02):$/ {print $1}'); do
found=1
echo "=== $dev ==="
lspci -s "$dev" -vv 2>&1 || true
echo
done
if [ "$found" -eq 0 ]; then
echo "no display-class PCI devices found"
fi
`}},
{name: "system/proc-fb.txt", cmd: []string{"cat", "/proc/fb"}},
{name: "system/drm-cards.txt", cmd: []string{"sh", "-c", `
if [ -d /sys/class/drm ]; then
for path in /sys/class/drm/card*; do
[ -e "$path" ] || continue
card=$(basename "$path")
echo "=== $card ==="
for f in status enabled dpms modes; do
[ -r "$path/$f" ] && printf " %-8s %s\n" "$f" "$(cat "$path/$f" 2>/dev/null)"
done
device=$(readlink -f "$path/device" 2>/dev/null || true)
[ -n "$device" ] && echo " device ${device##*/}"
echo
done
else
echo "/sys/class/drm not present"
fi
`}},
{name: "system/input-devices.txt", cmd: []string{"sh", "-c", `
if [ -r /proc/bus/input/devices ]; then
cat /proc/bus/input/devices
else
echo "/proc/bus/input/devices not readable"
fi
`}},
{name: "system/udevadm-input.txt", cmd: []string{"sh", "-c", `
if ! command -v udevadm >/dev/null 2>&1; then
echo "udevadm not found"
exit 0
fi
found=0
for dev in /dev/input/event*; do
[ -e "$dev" ] || continue
found=1
echo "=== $dev ==="
udevadm info --query=all --name="$dev" 2>&1 || true
echo
done
if [ "$found" -eq 0 ]; then
echo "no /dev/input/event* devices found"
fi
`}},
{name: "system/xinput-list.txt", cmd: []string{"sh", "-c", `
if command -v xinput >/dev/null 2>&1; then
DISPLAY=:0 xinput --list 2>&1 || true
else
echo "xinput not found"
fi
`}},
{name: "system/libinput-list-devices.txt", cmd: []string{"sh", "-c", `
if command -v libinput >/dev/null 2>&1; then
libinput list-devices 2>&1 || true
else
echo "libinput not found"
fi
`}},
{name: "system/systemctl-gui-units.txt", cmd: []string{"sh", "-c", `
if ! command -v systemctl >/dev/null 2>&1; then
echo "systemctl not found"
exit 0
fi
echo "=== unit files ==="
systemctl list-unit-files --no-pager --all 'lightdm*' 'display-manager*' 2>&1 || true
echo
echo "=== active units ==="
systemctl list-units --no-pager --all 'lightdm*' 'display-manager*' 2>&1 || true
echo
echo "=== failed units ==="
systemctl --failed --no-pager 2>&1 | grep -iE 'lightdm|display-manager|Xorg' || echo "no failed GUI units"
`}},
{name: "system/nvidia-smi-q.txt", cmd: []string{"nvidia-smi", "-q"}},
{name: "system/nvidia-smi-topo.txt", cmd: []string{"sh", "-c", `
@@ -236,6 +354,13 @@ var supportBundleOptionalFiles = []struct {
}{
{name: "system/kern.log", src: "/var/log/kern.log"},
{name: "system/syslog.txt", src: "/var/log/syslog"},
{name: "system/Xorg.0.log", src: "/var/log/Xorg.0.log"},
{name: "system/Xorg.0.log.old", src: "/var/log/Xorg.0.log.old"},
{name: "system/lightdm/lightdm.log", src: "/var/log/lightdm/lightdm.log"},
{name: "system/lightdm/x-0.log", src: "/var/log/lightdm/x-0.log"},
{name: "system/lightdm/x-0-greeter.log", src: "/var/log/lightdm/x-0-greeter.log"},
{name: "system/home-bee-xsession-errors.log", src: "/home/bee/.xsession-errors"},
{name: "system/home-bee-chromium-debug.log", src: "/tmp/bee-chrome/chrome_debug.log"},
{name: "system/fabricmanager.log", src: "/var/log/fabricmanager.log"},
{name: "system/nvlsm.log", src: "/var/log/nvlsm.log"},
{name: "system/fabricmanager/fabricmanager.log", src: "/var/log/fabricmanager/fabricmanager.log"},

View File

@@ -84,11 +84,10 @@ func hasAMDGPUDevices(devs []schema.HardwarePCIeDevice) bool {
}
func isAMDGPUDevice(dev schema.HardwarePCIeDevice) bool {
if dev.Manufacturer == nil || dev.DeviceClass == nil {
if dev.DeviceClass == nil {
return false
}
manufacturer := strings.ToLower(strings.TrimSpace(*dev.Manufacturer))
return strings.Contains(manufacturer, "advanced micro devices") && isGPUClass(strings.TrimSpace(*dev.DeviceClass))
return dev.VendorID != nil && *dev.VendorID == AMDVendorID && isGPUClass(strings.TrimSpace(*dev.DeviceClass))
}
func queryAMDGPUs() (map[string]amdGPUInfo, error) {

View File

@@ -40,6 +40,7 @@ func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
snap.PCIeDevices = enrichPCIeWithAMD(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithPCISerials(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithNVIDIA(snap.PCIeDevices)
snap.PCIeDevices = enrichNVLinkBridgesWithGPUTopo(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithMellanox(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithNICTelemetry(snap.PCIeDevices)
snap.PCIeDevices = enrichPCIeWithRAIDTelemetry(snap.PCIeDevices)

View File

@@ -11,7 +11,6 @@ import (
"time"
)
const mellanoxVendorID = 0x15b3
const nicProbeTimeout = 2 * time.Second
var (
@@ -80,16 +79,7 @@ func enrichPCIeWithMellanox(devs []schema.HardwarePCIeDevice) []schema.HardwareP
}
func isMellanoxDevice(dev schema.HardwarePCIeDevice) bool {
if dev.VendorID != nil && *dev.VendorID == mellanoxVendorID {
return true
}
if dev.Manufacturer != nil {
m := strings.ToLower(*dev.Manufacturer)
if strings.Contains(m, "mellanox") || strings.Contains(m, "nvidia networking") {
return true
}
}
return false
return dev.VendorID != nil && *dev.VendorID == MellanoxVendorID
}
func queryMellanoxFromMstflint(bdf string) (firmware, serial string) {

View File

@@ -55,7 +55,7 @@ func TestEnrichPCIeWithMellanox_mstflint(t *testing.T) {
}
netIfacesByBDF = func(string) []string { return nil }
vendorID := mellanoxVendorID
vendorID := MellanoxVendorID
bdf := "0000:18:00.0"
manufacturer := "Mellanox Technologies"
devs := []schema.HardwarePCIeDevice{{
@@ -99,7 +99,7 @@ func TestEnrichPCIeWithMellanox_fallbackEthtool(t *testing.T) {
return "driver: mlx5_core\nfirmware-version: 28.40.1000\n", nil
}
vendorID := mellanoxVendorID
vendorID := MellanoxVendorID
bdf := "0000:18:00.0"
manufacturer := "NVIDIA Networking"
devs := []schema.HardwarePCIeDevice{{

View File

@@ -10,8 +10,6 @@ import (
"strings"
)
const nvidiaVendorID = 0x10de
type nvidiaGPUInfo struct {
Index int
BDF string
@@ -240,13 +238,7 @@ func normalizePCIeBDF(bdf string) string {
}
func isNVIDIADevice(dev schema.HardwarePCIeDevice) bool {
if dev.VendorID != nil && *dev.VendorID == nvidiaVendorID {
return true
}
if dev.Manufacturer != nil && strings.Contains(strings.ToLower(*dev.Manufacturer), "nvidia") {
return true
}
return false
return dev.VendorID != nil && *dev.VendorID == NvidiaVendorID
}
func setPCIeFallback(dev *schema.HardwarePCIeDevice) {

View File

@@ -57,7 +57,7 @@ func TestNormalizePCIeBDF(t *testing.T) {
}
func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
vendorID := nvidiaVendorID
vendorID := NvidiaVendorID
bdf := "0000:65:00.0"
manufacturer := "NVIDIA Corporation"
status := "OK"
@@ -104,7 +104,7 @@ func TestEnrichPCIeWithNVIDIAData_driverLoaded(t *testing.T) {
}
func TestEnrichPCIeWithNVIDIAData_driverMissingFallback(t *testing.T) {
vendorID := nvidiaVendorID
vendorID := NvidiaVendorID
bdf := "0000:17:00.0"
manufacturer := "NVIDIA Corporation"
devices := []schema.HardwarePCIeDevice{

View File

@@ -0,0 +1,11 @@
package collector
// PCI vendor IDs for hardware classification.
// Source: https://pcisig.com / https://pci-ids.ucw.cz/
const (
NvidiaVendorID = 0x10de
AMDVendorID = 0x1002
AspeedVendorID = 0x1a03
MellanoxVendorID = 0x15b3
IntelVendorID = 0x8086
)

View File

@@ -126,38 +126,39 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
dev.Status = &status
// Slot is the BDF: "0000:00:02.0"
if bdf := fields["Slot"]; bdf != "" {
dev.Slot = &bdf
dev.BDF = &bdf
bdfStr := fields["Slot"]
if bdfStr != "" {
dev.Slot = &bdfStr
dev.BDF = &bdfStr
// parse vendor_id and device_id from sysfs
vendorID, deviceID := readPCIIDs(bdf)
vendorID, deviceID := readPCIIDs(bdfStr)
if vendorID != 0 {
dev.VendorID = &vendorID
}
if deviceID != 0 {
dev.DeviceID = &deviceID
}
if numaNode, ok := readPCINumaNode(bdf); ok {
if numaNode, ok := readPCINumaNode(bdfStr); ok {
dev.NUMANode = &numaNode
} else if numaNode, ok := parsePCINumaNode(fields["NUMANode"]); ok {
dev.NUMANode = &numaNode
}
if group, ok := readPCIIOMMUGroup(bdf); ok {
if group, ok := readPCIIOMMUGroup(bdfStr); ok {
dev.IOMMUGroup = &group
}
if width, ok := readPCIIntAttribute(bdf, "current_link_width"); ok {
if width, ok := readPCIIntAttribute(bdfStr, "current_link_width"); ok {
dev.LinkWidth = &width
}
if width, ok := readPCIIntAttribute(bdf, "max_link_width"); ok {
if width, ok := readPCIIntAttribute(bdfStr, "max_link_width"); ok {
dev.MaxLinkWidth = &width
}
if speed, ok := readPCIStringAttribute(bdf, "current_link_speed"); ok {
if speed, ok := readPCIStringAttribute(bdfStr, "current_link_speed"); ok {
linkSpeed := normalizePCILinkSpeed(speed)
if linkSpeed != "" {
dev.LinkSpeed = &linkSpeed
}
}
if speed, ok := readPCIStringAttribute(bdf, "max_link_speed"); ok {
if speed, ok := readPCIStringAttribute(bdfStr, "max_link_speed"); ok {
linkSpeed := normalizePCILinkSpeed(speed)
if linkSpeed != "" {
dev.MaxLinkSpeed = &linkSpeed
@@ -178,7 +179,15 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
// SVendor/SDevice available but not in schema — skip
// Warn if PCIe link is running below its maximum negotiated speed.
// Detect NVLink bridge mezzanine cards (CPU→HGX internal link).
// These are Mellanox x2 devices with no host net interfaces and a DeviceName
// containing "NVLINK". The targeted lspci call is only executed for the small
// number of narrow-link Mellanox cards that pass the cheap pre-filter.
if bdfStr != "" && isNVLinkBridgeCandidate(bdfStr, dev) && confirmNVLinkBridgeDeviceName(bdfStr) {
markNVLinkBridge(&dev)
}
// Warn (or Critical for NVLink bridges) if PCIe link is running below max.
applyPCIeLinkSpeedWarning(&dev)
return dev
@@ -265,17 +274,37 @@ func readPCIStringAttribute(bdf, attribute string) (string, bool) {
return value, true
}
// applyPCIeLinkSpeedWarning sets the device status to Warning if the current PCIe link
// speed is below the maximum negotiated speed supported by both ends.
// applyPCIeLinkSpeedWarning sets device status when the current PCIe link speed is
// below the device maximum. Regular PCIe slots get Warning; NVLink bridge cards
// get Critical because they are fixed internal connectors that must always train
// to max speed — any downgrade signals a hardware fault.
//
// Disabled devices (sysfs enable==0) are skipped: they carry no data traffic and
// their link state has no operational impact. This covers management endpoints
// (e.g. PCIe switch fabric controllers on HGX baseboards) that the kernel never
// activates but that lspci still reports with link stats.
func applyPCIeLinkSpeedWarning(dev *schema.HardwarePCIeDevice) {
if dev.LinkSpeed == nil || dev.MaxLinkSpeed == nil {
return
}
if pcieLinkSpeedRank(*dev.LinkSpeed) < pcieLinkSpeedRank(*dev.MaxLinkSpeed) {
if pcieLinkSpeedRank(*dev.LinkSpeed) >= pcieLinkSpeedRank(*dev.MaxLinkSpeed) {
return
}
if dev.BDF != nil {
if enabled, ok := readPCIIntAttribute(*dev.BDF, "enable"); ok && enabled == 0 {
return
}
}
desc := fmt.Sprintf("PCIe link speed degraded: running at %s, capable of %s", *dev.LinkSpeed, *dev.MaxLinkSpeed)
dev.ErrorDescription = &desc
isNVLinkBridge := dev.DeviceClass != nil && *dev.DeviceClass == "NVLinkBridge"
if isNVLinkBridge {
crit := statusCritical
dev.Status = &crit
} else {
warn := statusWarning
dev.Status = &warn
desc := fmt.Sprintf("PCIe link speed degraded: running at %s, capable of %s", *dev.LinkSpeed, *dev.MaxLinkSpeed)
dev.ErrorDescription = &desc
}
}

View File

@@ -0,0 +1,206 @@
package collector
import (
"bee/audit/internal/schema"
"log/slog"
"os/exec"
"regexp"
"strconv"
"strings"
)
var nv5re = regexp.MustCompile(`(?i)^NV(\d+)$`)
// isNVLinkBridgeCandidate returns true for Mellanox PCIe devices that look like
// NVLink bridge mezzanine cards: narrow link (x2), no host net interfaces.
// These are the CPU-side PCIe control plane of the NVSwitch fabric on HGX/DGX systems.
func isNVLinkBridgeCandidate(bdf string, dev schema.HardwarePCIeDevice) bool {
if !isMellanoxDevice(dev) {
return false
}
if dev.LinkWidth == nil || *dev.LinkWidth > 2 {
return false
}
if len(netIfacesByBDF(bdf)) > 0 {
return false
}
return true
}
// confirmNVLinkBridgeDeviceName checks if the lspci DeviceName for bdf contains
// "NVLINK". This is a targeted single-device call, only executed for candidates
// already pre-filtered by isNVLinkBridgeCandidate.
func confirmNVLinkBridgeDeviceName(bdf string) bool {
out, err := exec.Command("lspci", "-s", bdf, "-v").Output()
if err != nil {
return false
}
for _, line := range strings.Split(string(out), "\n") {
if strings.Contains(strings.ToUpper(strings.TrimSpace(line)), "NVLINK") {
return true
}
}
return false
}
// markNVLinkBridge overwrites device_class and adds telemetry flags on a detected
// NVLink bridge card. Must be called before applyPCIeLinkSpeedWarning so that the
// correct severity (Critical) is applied.
func markNVLinkBridge(dev *schema.HardwarePCIeDevice) {
class := "NVLinkBridge"
dev.DeviceClass = &class
if dev.Telemetry == nil {
dev.Telemetry = map[string]any{}
}
dev.Telemetry["nvlink_bridge"] = true
}
// enrichNVLinkBridgesWithGPUTopo cross-references NVLink bridge PCIe status with
// the GPU-side NVLink topology reported by nvidia-smi. For each bridge device it
// adds nvlink_topo_all_active and nvlink_topo_min_links to the telemetry, and
// upgrades a degraded-link Warning to Critical when the fabric is also affected.
func enrichNVLinkBridgesWithGPUTopo(devs []schema.HardwarePCIeDevice) []schema.HardwarePCIeDevice {
hasBridge := false
for _, d := range devs {
if d.DeviceClass != nil && *d.DeviceClass == "NVLinkBridge" {
hasBridge = true
break
}
}
if !hasBridge {
return devs
}
topo, err := queryNVIDIANVLinkTopo()
if err != nil {
slog.Info("nvlink-bridge: nvidia-smi topo unavailable, skipping cross-reference", "err", err)
return devs
}
for i := range devs {
if devs[i].DeviceClass == nil || *devs[i].DeviceClass != "NVLinkBridge" {
continue
}
if devs[i].Telemetry == nil {
devs[i].Telemetry = map[string]any{}
}
devs[i].Telemetry["nvlink_topo_all_active"] = topo.AllActive
devs[i].Telemetry["nvlink_topo_min_links"] = topo.MinNVLinks
devs[i].Telemetry["nvlink_topo_gpu_count"] = topo.GPUCount
// If the bridge PCIe is already degraded AND the fabric is also degraded
// (missing NVLink connections), escalate to Critical.
if devs[i].Status != nil && *devs[i].Status == statusCritical && !topo.AllActive {
devs[i].Telemetry["nvlink_fabric_affected"] = true
}
}
slog.Info("nvlink-bridge: topo cross-reference applied",
"gpu_count", topo.GPUCount,
"all_active", topo.AllActive,
"min_links", topo.MinNVLinks,
)
return devs
}
// nvlinkTopoResult summarises the GPU NVLink connectivity matrix.
type nvlinkTopoResult struct {
GPUCount int
AllActive bool // true if every GPU pair has at least one NVLink bond
MinNVLinks int // minimum NVLink bonds seen across any GPU pair (0 = some pair disconnected)
}
// queryNVIDIANVLinkTopo runs nvidia-smi topo -m and parses the NVLink matrix.
func queryNVIDIANVLinkTopo() (nvlinkTopoResult, error) {
out, err := exec.Command("nvidia-smi", "topo", "-m").Output()
if err != nil {
return nvlinkTopoResult{}, err
}
return parseNVIDIATopologyMatrix(string(out)), nil
}
// parseNVIDIATopologyMatrix extracts the minimum NVLink bond count from the
// nvidia-smi topo -m matrix.
//
// Format (abbreviated):
//
// GPU0 GPU1 ... NIC0 NIC1
// GPU0 X NV18 ... NODE NODE
// GPU1 NV18 X ... NODE NODE
// NIC0 NODE NODE... X PIX
//
// The header row starts with "GPU0"; its columns may include non-GPU entries
// (NIC, CPU) which are ignored. Only GPU×GPU cells containing NV# values are
// counted. X is self; non-NV tokens (NODE, SYS, PHB, PIX) are skipped.
func parseNVIDIATopologyMatrix(raw string) nvlinkTopoResult {
lines := strings.Split(raw, "\n")
// Locate the header line and record which column indices are GPU columns.
headerIdx := -1
var gpuColIndices []int // 0-based indices within fields (excluding the row label)
var gpuCount int
for i, line := range lines {
trimmed := strings.TrimSpace(line)
if strings.HasPrefix(trimmed, "GPU0") {
parts := strings.Fields(trimmed)
for j, col := range parts {
if strings.HasPrefix(col, "GPU") {
gpuColIndices = append(gpuColIndices, j)
}
}
gpuCount = len(gpuColIndices)
if gpuCount >= 2 {
headerIdx = i
}
break
}
}
if headerIdx < 0 || gpuCount == 0 {
return nvlinkTopoResult{}
}
minLinks := -1 // -1 = no NV pair seen yet
allActive := true
for _, line := range lines[headerIdx+1:] {
trimmed := strings.TrimSpace(line)
if !strings.HasPrefix(trimmed, "GPU") {
continue
}
cells := strings.Fields(trimmed)
// cells[0] is the row label (e.g. "GPU0"); cells[1..] are column values.
// gpuColIndices are 0-based within the header fields, so they map to
// cells[idx+1] in the data rows (shift by 1 for the row label).
for _, colIdx := range gpuColIndices {
dataIdx := colIdx + 1
if dataIdx >= len(cells) {
continue
}
cell := cells[dataIdx]
m := nv5re.FindStringSubmatch(cell)
if len(m) != 2 {
continue
}
n, err := strconv.Atoi(m[1])
if err != nil {
continue
}
if n == 0 {
allActive = false
}
if minLinks < 0 || n < minLinks {
minLinks = n
}
}
}
if minLinks < 0 {
minLinks = 0
}
return nvlinkTopoResult{
GPUCount: gpuCount,
AllActive: allActive && minLinks > 0,
MinNVLinks: minLinks,
}
}

View File

@@ -0,0 +1,124 @@
package collector
import (
"bee/audit/internal/schema"
"testing"
)
func TestParseNVIDIATopologyMatrix(t *testing.T) {
t.Parallel()
// Real-world B200 HGX output: 8 GPUs, all pairs connected via NV18.
input := ` GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X PIX
`
got := parseNVIDIATopologyMatrix(input)
if got.GPUCount != 8 {
t.Fatalf("GPUCount=%d want 8", got.GPUCount)
}
if !got.AllActive {
t.Fatalf("AllActive=false want true")
}
if got.MinNVLinks != 18 {
t.Fatalf("MinNVLinks=%d want 18", got.MinNVLinks)
}
}
func TestParseNVIDIATopologyMatrixPartialDegradation(t *testing.T) {
t.Parallel()
// GPU1-GPU3 pair shows NV12 (reduced) instead of NV18.
input := ` GPU0 GPU1 GPU2 GPU3
GPU0 X NV18 NV18 NV18
GPU1 NV18 X NV18 NV12
GPU2 NV18 NV18 X NV18
GPU3 NV18 NV12 NV18 X
`
got := parseNVIDIATopologyMatrix(input)
if got.MinNVLinks != 12 {
t.Fatalf("MinNVLinks=%d want 12", got.MinNVLinks)
}
if !got.AllActive {
t.Fatalf("AllActive=false want true (12 links is still active)")
}
}
func TestParseNVIDIATopologyMatrixDisconnected(t *testing.T) {
t.Parallel()
// GPU0-GPU1 pair fully disconnected (NV0).
input := ` GPU0 GPU1
GPU0 X NV0
GPU1 NV0 X
`
got := parseNVIDIATopologyMatrix(input)
if got.AllActive {
t.Fatalf("AllActive=true want false (NV0 means no links)")
}
if got.MinNVLinks != 0 {
t.Fatalf("MinNVLinks=%d want 0", got.MinNVLinks)
}
}
func TestParseNVIDIATopologyMatrixEmpty(t *testing.T) {
t.Parallel()
got := parseNVIDIATopologyMatrix("no gpus here")
if got.GPUCount != 0 {
t.Fatalf("GPUCount=%d want 0", got.GPUCount)
}
}
func TestApplyPCIeLinkSpeedWarningNVLinkBridgeEscalates(t *testing.T) {
t.Parallel()
bridgeClass := "NVLinkBridge"
linkSpeed := "Gen3"
maxLinkSpeed := "Gen4"
dev := schema.HardwarePCIeDevice{}
dev.DeviceClass = &bridgeClass
dev.LinkSpeed = &linkSpeed
dev.MaxLinkSpeed = &maxLinkSpeed
s := statusOK
dev.Status = &s
applyPCIeLinkSpeedWarning(&dev)
if dev.Status == nil || *dev.Status != statusCritical {
t.Fatalf("status=%v want Critical for NVLink bridge degradation", dev.Status)
}
if dev.ErrorDescription == nil {
t.Fatal("ErrorDescription nil, want degradation message")
}
}
func TestApplyPCIeLinkSpeedWarningRegularCardIsWarning(t *testing.T) {
t.Parallel()
regularClass := "NetworkController"
linkSpeed := "Gen3"
maxLinkSpeed := "Gen4"
dev := schema.HardwarePCIeDevice{}
dev.DeviceClass = &regularClass
dev.LinkSpeed = &linkSpeed
dev.MaxLinkSpeed = &maxLinkSpeed
s := statusOK
dev.Status = &s
applyPCIeLinkSpeedWarning(&dev)
if dev.Status == nil || *dev.Status != statusWarning {
t.Fatalf("status=%v want Warning for regular card degradation", dev.Status)
}
}

View File

@@ -58,7 +58,6 @@ func buildSensorsFromDoc(doc sensorsDoc) *schema.HardwareSensors {
for _, chip := range chips {
features := doc[chip]
location := sensorLocation(chip)
keys := make([]string, 0, len(features))
for key := range features {
@@ -80,25 +79,25 @@ func buildSensorsFromDoc(doc sensorsDoc) *schema.HardwareSensors {
}
switch classifySensorFeature(feature) {
case "fan":
item := buildFanSensor(name, location, feature)
item := buildFanSensor(name, feature)
if item == nil || duplicateSensor(seen, "fan", item.Name) {
continue
}
result.Fans = append(result.Fans, *item)
case "temp":
item := buildTempSensor(name, location, feature)
item := buildTempSensor(name, feature)
if item == nil || duplicateSensor(seen, "temp", item.Name) {
continue
}
result.Temperatures = append(result.Temperatures, *item)
case "power":
item := buildPowerSensor(name, location, feature)
item := buildPowerSensor(name, feature)
if item == nil || duplicateSensor(seen, "power", item.Name) {
continue
}
result.Power = append(result.Power, *item)
default:
item := buildOtherSensor(name, location, feature)
item := buildOtherSensor(name, feature)
if item == nil || duplicateSensor(seen, "other", item.Name) {
continue
}
@@ -128,14 +127,6 @@ func duplicateSensor(seen map[string]struct{}, sensorType, name string) bool {
return false
}
func sensorLocation(chip string) *string {
chip = strings.TrimSpace(chip)
if chip == "" {
return nil
}
return &chip
}
func classifySensorFeature(feature map[string]any) string {
for key := range feature {
switch {
@@ -154,24 +145,24 @@ func classifySensorFeature(feature map[string]any) string {
return "other"
}
func buildFanSensor(name string, location *string, feature map[string]any) *schema.HardwareFanSensor {
func buildFanSensor(name string, feature map[string]any) *schema.HardwareFanSensor {
rpm, ok := firstFeatureInt(feature, "_input")
if !ok {
return nil
}
item := &schema.HardwareFanSensor{Name: name, Location: location, RPM: &rpm}
item := &schema.HardwareFanSensor{Name: name, RPM: &rpm}
if status := sensorStatusFromFeature(feature); status != nil {
item.Status = status
}
return item
}
func buildTempSensor(name string, location *string, feature map[string]any) *schema.HardwareTemperatureSensor {
func buildTempSensor(name string, feature map[string]any) *schema.HardwareTemperatureSensor {
celsius, ok := firstFeatureFloat(feature, "_input")
if !ok {
return nil
}
item := &schema.HardwareTemperatureSensor{Name: name, Location: location, Celsius: &celsius}
item := &schema.HardwareTemperatureSensor{Name: name, Celsius: &celsius}
if warning, ok := firstFeatureFloatWithSuffixes(feature, []string{"_max", "_high"}); ok {
item.ThresholdWarningCelsius = &warning
}
@@ -186,8 +177,8 @@ func buildTempSensor(name string, location *string, feature map[string]any) *sch
return item
}
func buildPowerSensor(name string, location *string, feature map[string]any) *schema.HardwarePowerSensor {
item := &schema.HardwarePowerSensor{Name: name, Location: location}
func buildPowerSensor(name string, feature map[string]any) *schema.HardwarePowerSensor {
item := &schema.HardwarePowerSensor{Name: name}
if v, ok := firstFeatureFloatWithContains(feature, []string{"power"}); ok {
item.PowerW = &v
}
@@ -206,12 +197,12 @@ func buildPowerSensor(name string, location *string, feature map[string]any) *sc
return item
}
func buildOtherSensor(name string, location *string, feature map[string]any) *schema.HardwareOtherSensor {
func buildOtherSensor(name string, feature map[string]any) *schema.HardwareOtherSensor {
value, unit, ok := firstGenericSensorValue(feature)
if !ok {
return nil
}
item := &schema.HardwareOtherSensor{Name: name, Location: location, Value: &value}
item := &schema.HardwareOtherSensor{Name: name, Value: &value}
if unit != "" {
item.Unit = &unit
}

View File

@@ -36,6 +36,24 @@ func bestEffortRescanHotplugStorage() {
slog.Info("storage: scsi host scan skipped", "pattern", scsiHostScanGlob, "err", err)
} else {
for _, path := range hostPaths {
// SAS HBAs (e.g. smartpqi) block indefinitely in sas_user_scan when
// written to — SAS topology is discovered by the driver itself.
// Detect via two methods: (1) sas_host class registration, and
// (2) driver proc_name — smartpqi uses scsi_transport_sas but does
// not register a sas_host object, so (1) alone misses it.
host := filepath.Base(filepath.Dir(path))
if _, err := os.Stat("/sys/class/sas_host/" + host); err == nil {
slog.Info("storage: scsi host scan skipped (SAS host)", "path", path)
continue
}
if procName, err := os.ReadFile("/sys/class/scsi_host/" + host + "/proc_name"); err == nil {
switch strings.TrimSpace(string(procName)) {
case "smartpqi", "hpsa":
slog.Info("storage: scsi host scan skipped (SAS transport driver)",
"path", path, "driver", strings.TrimSpace(string(procName)))
continue
}
}
if err := hotplugWriteFile(path, []byte("- - -\n"), 0644); err != nil {
slog.Info("storage: scsi host scan write failed", "path", path, "err", err)
continue
@@ -66,17 +84,41 @@ func collectStorage() []schema.HardwareStorage {
return result
}
// jsonInt64 accepts both a bare JSON number and a JSON-quoted number string.
// lsblk -J emits LOG-SEC / PHY-SEC as integers on util-linux ≥ 2.37 (Debian 12)
// but older versions emit them as strings. This type handles both.
type jsonInt64 int64
func (j *jsonInt64) UnmarshalJSON(data []byte) error {
// bare number: 512
var n int64
if err := json.Unmarshal(data, &n); err == nil {
*j = jsonInt64(n)
return nil
}
// quoted string: "512"
var s string
if err := json.Unmarshal(data, &s); err == nil {
n, err := strconv.ParseInt(strings.TrimSpace(s), 10, 64)
if err == nil {
*j = jsonInt64(n)
}
return nil
}
return nil // null or unexpected type — leave zero
}
// lsblkDevice is a minimal lsblk JSON record.
type lsblkDevice struct {
Name string `json:"name"`
Type string `json:"type"`
Size string `json:"size"`
Serial string `json:"serial"`
Model string `json:"model"`
Tran string `json:"tran"`
Hctl string `json:"hctl"`
LogSec string `json:"log-sec"`
PhySec string `json:"phy-sec"`
Name string `json:"name"`
Type string `json:"type"`
Size string `json:"size"`
Serial string `json:"serial"`
Model string `json:"model"`
Tran string `json:"tran"`
Hctl string `json:"hctl"`
LogSec jsonInt64 `json:"log-sec"`
PhySec jsonInt64 `json:"phy-sec"`
}
type lsblkRoot struct {
@@ -382,20 +424,23 @@ func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
}
// nvmeSmartLog is the subset of `nvme smart-log -o json` output we care about.
// nvme-cli emits most counters as JSON strings (e.g. "power_on_hours":"49"),
// so all numeric fields use jsonInt64 which accepts both bare numbers and
// quoted strings. Field names match nvme-cli JSON output, not NVMe spec prose.
type nvmeSmartLog struct {
CriticalWarning int `json:"critical_warning"`
PercentageUsed int `json:"percentage_used"`
AvailableSpare int `json:"available_spare"`
SpareThreshold int `json:"spare_thresh"`
Temperature int64 `json:"temperature"`
PowerOnHours int64 `json:"power_on_hours"`
PowerCycles int64 `json:"power_cycles"`
UnsafeShutdowns int64 `json:"unsafe_shutdowns"`
DataUnitsRead int64 `json:"data_units_read"`
DataUnitsWritten int64 `json:"data_units_written"`
ControllerBusy int64 `json:"controller_busy_time"`
MediaErrors int64 `json:"media_errors"`
NumErrLogEntries int64 `json:"num_err_log_entries"`
CriticalWarning jsonInt64 `json:"critical_warning"`
PercentageUsed jsonInt64 `json:"percent_used"`
AvailableSpare jsonInt64 `json:"avail_spare"`
SpareThreshold jsonInt64 `json:"spare_thresh"`
Temperature jsonInt64 `json:"temperature"`
PowerOnHours jsonInt64 `json:"power_on_hours"`
PowerCycles jsonInt64 `json:"power_cycles"`
UnsafeShutdowns jsonInt64 `json:"unsafe_shutdowns"`
DataUnitsRead jsonInt64 `json:"data_units_read"`
DataUnitsWritten jsonInt64 `json:"data_units_written"`
ControllerBusy jsonInt64 `json:"controller_busy_time"`
MediaErrors jsonInt64 `json:"media_errors"`
NumErrLogEntries jsonInt64 `json:"num_err_log_entries"`
}
// nvmeIDCtrl is the subset of `nvme id-ctrl -o json` output.
@@ -460,13 +505,16 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
var log nvmeSmartLog
if json.Unmarshal(out, &log) == nil {
if log.PowerOnHours > 0 {
s.PowerOnHours = &log.PowerOnHours
v := int64(log.PowerOnHours)
s.PowerOnHours = &v
}
if log.PowerCycles > 0 {
s.PowerCycles = &log.PowerCycles
v := int64(log.PowerCycles)
s.PowerCycles = &v
}
if log.UnsafeShutdowns > 0 {
s.UnsafeShutdowns = &log.UnsafeShutdowns
v := int64(log.UnsafeShutdowns)
s.UnsafeShutdowns = &v
}
if log.PercentageUsed > 0 {
v := float64(log.PercentageUsed)
@@ -475,11 +523,11 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
s.LifeRemainingPct = &remaining
}
if log.DataUnitsWritten > 0 {
v := nvmeDataUnitsToBytes(log.DataUnitsWritten)
v := nvmeDataUnitsToBytes(int64(log.DataUnitsWritten))
s.WrittenBytes = &v
}
if log.DataUnitsRead > 0 {
v := nvmeDataUnitsToBytes(log.DataUnitsRead)
v := nvmeDataUnitsToBytes(int64(log.DataUnitsRead))
s.ReadBytes = &v
}
if log.AvailableSpare > 0 {
@@ -487,23 +535,25 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
s.AvailableSparePct = &v
}
if log.MediaErrors > 0 {
s.MediaErrors = &log.MediaErrors
v := int64(log.MediaErrors)
s.MediaErrors = &v
}
if log.NumErrLogEntries > 0 {
s.ErrorLogEntries = &log.NumErrLogEntries
v := int64(log.NumErrLogEntries)
s.ErrorLogEntries = &v
}
if log.Temperature > 0 {
v := float64(log.Temperature - 273)
s.TemperatureC = &v
}
setStorageHealthStatus(&s, storageHealthStatus{
criticalWarning: log.CriticalWarning,
criticalWarning: int(log.CriticalWarning),
percentageUsed: int64(log.PercentageUsed),
availableSpare: int64(log.AvailableSpare),
spareThreshold: int64(log.SpareThreshold),
unsafeShutdowns: log.UnsafeShutdowns,
mediaErrors: log.MediaErrors,
errorLogEntries: log.NumErrLogEntries,
unsafeShutdowns: int64(log.UnsafeShutdowns),
mediaErrors: int64(log.MediaErrors),
errorLogEntries: int64(log.NumErrLogEntries),
})
return s
}
@@ -620,8 +670,8 @@ func applyStorageBlockGeometry(s *schema.HardwareStorage, dev lsblkDevice) {
if s == nil {
return
}
logical := parseStorageBytes(dev.LogSec)
physical := parseStorageBytes(dev.PhySec)
logical := int64(dev.LogSec)
physical := int64(dev.PhySec)
if logical <= 0 && physical <= 0 {
return
}

View File

@@ -1,6 +1,7 @@
package collector
import (
"encoding/json"
"os"
"os/exec"
"path/filepath"
@@ -38,6 +39,54 @@ func TestParseStorageBytes(t *testing.T) {
}
}
func TestJsonInt64UnmarshalBothFormats(t *testing.T) {
t.Parallel()
// util-linux ≥ 2.37 emits LOG-SEC / PHY-SEC as bare JSON numbers.
// Older versions emit quoted strings. Both must parse without error
// so that the entire lsblkDevices() call does not return nil on Debian 12.
cases := []struct {
json string
want int64
}{
{`512`, 512},
{`4096`, 4096},
{`"512"`, 512},
{`"4096"`, 4096},
{`null`, 0},
}
for _, tc := range cases {
var v jsonInt64
if err := v.UnmarshalJSON([]byte(tc.json)); err != nil {
t.Fatalf("UnmarshalJSON(%s): unexpected error %v", tc.json, err)
}
if int64(v) != tc.want {
t.Fatalf("UnmarshalJSON(%s)=%d want %d", tc.json, int64(v), tc.want)
}
}
// Simulate the exact JSON shape that triggered the bug on Debian 12.
input := []byte(`{
"blockdevices": [
{"name":"sda","type":"disk","size":"3.6T","serial":"S1234","model":"SEAGATE","tran":"sata","hctl":"0:0:0:0","log-sec":512,"phy-sec":4096},
{"name":"sdb","type":"disk","size":"3.6T","serial":"S5678","model":"SEAGATE","tran":"sata","hctl":"0:0:1:0","log-sec":512,"phy-sec":4096}
]
}`)
var root lsblkRoot
if err := json.Unmarshal(input, &root); err != nil {
t.Fatalf("lsblkRoot unmarshal with integer log-sec/phy-sec: %v", err)
}
if len(root.Blockdevices) != 2 {
t.Fatalf("got %d blockdevices want 2", len(root.Blockdevices))
}
if int64(root.Blockdevices[0].LogSec) != 512 {
t.Fatalf("LogSec=%d want 512", root.Blockdevices[0].LogSec)
}
if int64(root.Blockdevices[0].PhySec) != 4096 {
t.Fatalf("PhySec=%d want 4096", root.Blockdevices[0].PhySec)
}
}
func TestBestEffortRescanHotplugStorage(t *testing.T) {
t.Parallel()

View File

@@ -1,11 +1,65 @@
package collector
import (
"encoding/json"
"testing"
"bee/audit/internal/schema"
)
// TestNVMeSmartLogUnmarshal verifies that nvme-cli JSON output (where most
// counters are quoted strings and field names differ from NVMe spec prose)
// is correctly parsed into nvmeSmartLog.
func TestNVMeSmartLogUnmarshal(t *testing.T) {
t.Parallel()
// Real nvme-cli output: counters are JSON strings, spare is "avail_spare",
// percentage used is "percent_used".
raw := `{
"critical_warning": 0,
"temperature": 310,
"avail_spare": 100,
"spare_thresh": 5,
"percent_used": 0,
"data_units_read": "10925415",
"data_units_written": "8497672",
"controller_busy_time": "305",
"power_cycles": "53",
"power_on_hours": "49",
"unsafe_shutdowns": "22",
"media_errors": "0",
"num_err_log_entries": "0"
}`
var log nvmeSmartLog
if err := json.Unmarshal([]byte(raw), &log); err != nil {
t.Fatalf("json.Unmarshal failed: %v", err)
}
if log.PowerOnHours != 49 {
t.Errorf("PowerOnHours=%d want 49", log.PowerOnHours)
}
if log.PowerCycles != 53 {
t.Errorf("PowerCycles=%d want 53", log.PowerCycles)
}
if log.AvailableSpare != 100 {
t.Errorf("AvailableSpare=%d want 100", log.AvailableSpare)
}
if log.SpareThreshold != 5 {
t.Errorf("SpareThreshold=%d want 5", log.SpareThreshold)
}
if log.PercentageUsed != 0 {
t.Errorf("PercentageUsed=%d want 0", log.PercentageUsed)
}
if log.Temperature != 310 {
t.Errorf("Temperature=%d want 310", log.Temperature)
}
if log.MediaErrors != 0 {
t.Errorf("MediaErrors=%d want 0", log.MediaErrors)
}
if log.UnsafeShutdowns != 22 {
t.Errorf("UnsafeShutdowns=%d want 22", log.UnsafeShutdowns)
}
}
func TestSetStorageHealthStatus(t *testing.T) {
t.Parallel()

View File

@@ -38,6 +38,15 @@ var HardwareErrorPatterns = []ErrorPattern{
Category: "gpu",
Severity: "warning",
},
// PCIe AER correctable from the NVIDIA driver — "bus correctable error" in SEL.
// Severity is warning (not critical): correctable errors are hardware-recovered.
{
Name: "nvidia-aer-correctable",
Re: mustPat(`(?i)nvidia\s+([\da-f]{4}:[\da-f]{2}:[\da-f]{2}\.\d).*AER.*[Cc]orrect`),
Category: "gpu",
Severity: "warning",
BDFGroup: 1,
},
{
Name: "nvidia-aer",
Re: mustPat(`(?i)nvidia\s+([\da-f]{4}:[\da-f]{2}:[\da-f]{2}\.\d).*AER`),
@@ -54,6 +63,15 @@ var HardwareErrorPatterns = []ErrorPattern{
},
// ── PCIe AER (generic) ──────────────────────────────────────────────────────
// PCIe AER correctable from the root port — captures the reported device BDF
// (second BDF in "pcieport X: AER: Correctable error received: Y").
{
Name: "pcie-aer-correctable",
Re: mustPat(`(?i)pcieport.*AER:.*[Cc]orrect.*:\s*([\da-f]{4}:[\da-f]{2}:[\da-f]{2}\.\d)`),
Category: "pcie",
Severity: "warning",
BDFGroup: 1,
},
{
Name: "pcie-aer",
Re: mustPat(`(?i)pcieport\s+([\da-f]{4}:[\da-f]{2}:[\da-f]{2}\.\d).*AER`),

View File

@@ -5,6 +5,7 @@ import (
"fmt"
"os"
"os/exec"
"path/filepath"
"strconv"
"strings"
)
@@ -18,7 +19,7 @@ type InstallDisk struct {
MountedParts []string // partition mount points currently active
}
const squashfsPath = "/run/live/medium/live/filesystem.squashfs"
const squashfsGlob = "/run/live/medium/live/*.squashfs"
// ListInstallDisks returns block devices suitable for installation.
// Excludes the current live boot medium but includes USB drives.
@@ -176,11 +177,22 @@ func inferLiveBootKind(fsType, source, deviceType, transport string) string {
// squashfs size × 1.5 to allow for extracted filesystem and bootloader.
// Returns 0 if the squashfs is not available (non-live environment).
func MinInstallBytes() int64 {
fi, err := os.Stat(squashfsPath)
if err != nil {
files, err := filepath.Glob(squashfsGlob)
if err != nil || len(files) == 0 {
return 0
}
return fi.Size() * 3 / 2
var total int64
for _, path := range files {
fi, statErr := os.Stat(path)
if statErr != nil {
continue
}
total += fi.Size()
}
if total == 0 {
return 0
}
return total * 3 / 2
}
// toramActive returns true when the live system was booted with toram.
@@ -222,12 +234,10 @@ func DiskWarnings(d InstallDisk) []string {
humanBytes(min), humanBytes(d.SizeBytes)))
}
if toramActive() {
sqFi, err := os.Stat(squashfsPath)
if err == nil {
free := freeMemBytes()
if free > 0 && free < sqFi.Size()*2 {
w = append(w, "toram mode — low RAM, extraction may be slow or fail")
}
free := freeMemBytes()
min := MinInstallBytes()
if free > 0 && min > 0 && free < (min*4/3) {
w = append(w, "toram mode — low RAM, extraction may be slow or fail")
}
}
return w

View File

@@ -258,7 +258,7 @@ func (s *System) GetInterfaceState(iface string) (bool, error) {
func interfaceAdminState(iface string) (bool, error) {
raw, err := exec.Command("ip", "-o", "link", "show", "dev", iface).Output()
if err != nil {
return false, err
return false, fmt.Errorf("ip link show dev %s: %w", iface, err)
}
return parseInterfaceAdminState(string(raw))
}
@@ -288,7 +288,7 @@ func interfaceIPv4Addrs(iface string) ([]string, error) {
if errors.As(err, &exitErr) {
return nil, nil
}
return nil, err
return nil, fmt.Errorf("ip addr show dev %s: %w", iface, err)
}
var ipv4 []string
for _, line := range strings.Split(strings.TrimSpace(string(raw)), "\n") {

View File

@@ -55,7 +55,6 @@ func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, e
if err == nil {
health.Interfaces = make([]schema.RuntimeInterface, 0, len(interfaces))
hasIPv4 := false
missingIPv4 := false
for _, iface := range interfaces {
outcome := "no_offer"
if len(iface.IPv4) > 0 {
@@ -63,8 +62,6 @@ func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, e
hasIPv4 = true
} else if strings.EqualFold(iface.State, "DOWN") {
outcome = "link_down"
} else {
missingIPv4 = true
}
health.Interfaces = append(health.Interfaces, schema.RuntimeInterface{
Name: iface.Name,
@@ -73,17 +70,9 @@ func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, e
Outcome: outcome,
})
}
switch {
case hasIPv4 && !missingIPv4:
if hasIPv4 {
health.NetworkStatus = "OK"
case hasIPv4:
health.NetworkStatus = "PARTIAL"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "dhcp_partial",
Severity: "warning",
Description: "At least one interface did not obtain IPv4 connectivity.",
})
default:
} else {
health.NetworkStatus = "FAILED"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "dhcp_failed",

View File

@@ -2,6 +2,8 @@
// core/internal/ingest/parser_hardware.go. No import dependency on core.
package schema
import "encoding/json"
// HardwareIngestRequest is the top-level output document produced by `bee audit`.
// It is accepted as-is by the core /api/ingest/hardware endpoint.
type HardwareIngestRequest struct {
@@ -64,9 +66,10 @@ type HardwareSnapshot struct {
Storage []HardwareStorage `json:"storage,omitempty"`
PCIeDevices []HardwarePCIeDevice `json:"pcie_devices,omitempty"`
PowerSupplies []HardwarePowerSupply `json:"power_supplies,omitempty"`
Sensors *HardwareSensors `json:"sensors,omitempty"`
EventLogs []HardwareEventLog `json:"event_logs,omitempty"`
VROCLicense *string `json:"vroc_license,omitempty"`
Sensors *HardwareSensors `json:"sensors,omitempty"`
EventLogs []HardwareEventLog `json:"event_logs,omitempty"`
PlatformConfig *json.RawMessage `json:"platform_config,omitempty"`
VROCLicense *string `json:"vroc_license,omitempty"`
}
type HardwareHealthSummary struct {
@@ -123,7 +126,7 @@ type HardwareCPU struct {
type HardwareMemory struct {
HardwareComponentStatus
Slot *string `json:"slot,omitempty"`
Location *string `json:"location,omitempty"`
Location *string `json:"-"` // internal: used for DIMM telemetry matching only
Present *bool `json:"present,omitempty"`
SizeMB *int `json:"size_mb,omitempty"`
Type *string `json:"type,omitempty"`
@@ -261,15 +264,13 @@ type HardwareSensors struct {
}
type HardwareFanSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
RPM *int `json:"rpm,omitempty"`
Status *string `json:"status,omitempty"`
Name string `json:"name"`
RPM *int `json:"rpm,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwarePowerSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
VoltageV *float64 `json:"voltage_v,omitempty"`
CurrentA *float64 `json:"current_a,omitempty"`
PowerW *float64 `json:"power_w,omitempty"`
@@ -278,7 +279,6 @@ type HardwarePowerSensor struct {
type HardwareTemperatureSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
Celsius *float64 `json:"celsius,omitempty"`
ThresholdWarningCelsius *float64 `json:"threshold_warning_celsius,omitempty"`
ThresholdCriticalCelsius *float64 `json:"threshold_critical_celsius,omitempty"`
@@ -286,11 +286,10 @@ type HardwareTemperatureSensor struct {
}
type HardwareOtherSensor struct {
Name string `json:"name"`
Location *string `json:"location,omitempty"`
Value *float64 `json:"value,omitempty"`
Unit *string `json:"unit,omitempty"`
Status *string `json:"status,omitempty"`
Name string `json:"name"`
Value *float64 `json:"value,omitempty"`
Unit *string `json:"unit,omitempty"`
Status *string `json:"status,omitempty"`
}
type HardwareEventLog struct {

View File

@@ -1679,6 +1679,56 @@ func (h *handler) handleAPIBenchmarkResults(w http.ResponseWriter, r *http.Reque
fmt.Fprint(w, renderBenchmarkResultsCard(h.opts.ExportDir))
}
// ── Hardware summary / component detail ──────────────────────────────────────
// handleAPIHardwareSummary returns the hardware summary card HTML fragment for
// htmx polling (hx-get="/api/hardware-summary" hx-swap="outerHTML").
func (h *handler) handleAPIHardwareSummary(w http.ResponseWriter, _ *http.Request) {
w.Header().Set("Content-Type", "text/html; charset=utf-8")
w.Header().Set("Cache-Control", "no-store")
fmt.Fprint(w, renderHardwareSummaryCard(h.opts))
}
// handleAPIComponentDetail returns an HTML fragment describing the current and
// historical status for one component type (cpu, memory, storage, gpu, psu).
func (h *handler) handleAPIComponentDetail(w http.ResponseWriter, r *http.Request) {
compType := r.PathValue("type")
var exact, prefixes []string
var title string
switch compType {
case "cpu":
title = "CPU"
exact = []string{"cpu:all"}
case "memory":
title = "Memory"
exact = []string{"memory:all"}
prefixes = []string{"memory:"}
case "storage":
title = "Storage"
exact = []string{"storage:all"}
prefixes = []string{"storage:"}
case "gpu":
title = "GPU"
prefixes = []string{"pcie:gpu:"}
case "psu":
title = "PSU"
prefixes = []string{"psu:"}
default:
http.NotFound(w, r)
return
}
var records []app.ComponentStatusRecord
if h.opts.App != nil && h.opts.App.StatusDB != nil {
all := h.opts.App.StatusDB.All()
records = matchedRecords(all, exact, prefixes)
}
w.Header().Set("Content-Type", "text/html; charset=utf-8")
w.Header().Set("Cache-Control", "no-store")
fmt.Fprint(w, renderComponentDetail(title, records))
}
func (h *handler) rollbackPendingNetworkChange() error {
h.pendingNetMu.Lock()
pnc := h.pendingNet

View File

@@ -0,0 +1,76 @@
package webui
import (
"bytes"
"context"
"log/slog"
"os/exec"
"time"
"bee/audit/internal/app"
"bee/audit/internal/collector"
)
const healthPollInterval = 60 * time.Second
const psuIPMITimeout = 15 * time.Second
// healthPoller runs periodic health checks for hardware components that do not
// emit kernel log events (e.g. PSU). Results are written to ComponentStatusDB.
type healthPoller struct {
statusDB *app.ComponentStatusDB
}
func newHealthPoller(statusDB *app.ComponentStatusDB) *healthPoller {
return &healthPoller{statusDB: statusDB}
}
func (p *healthPoller) start() {
goRecoverLoop("health poller", 5*time.Second, p.run)
}
func (p *healthPoller) run() {
ticker := time.NewTicker(healthPollInterval)
defer ticker.Stop()
for range ticker.C {
p.pollPSU()
}
}
func (p *healthPoller) pollPSU() {
if p.statusDB == nil {
return
}
ctx, cancel := context.WithTimeout(context.Background(), psuIPMITimeout)
defer cancel()
cmd := exec.CommandContext(ctx, "ipmitool", "sdr")
var out bytes.Buffer
cmd.Stdout = &out
if err := cmd.Run(); err != nil {
// IPMI not available or not a server — skip silently.
slog.Debug("health poller: ipmitool sdr unavailable", "err", err)
return
}
slots := collector.PSUSlotsFromSDR(out.String())
if len(slots) == 0 {
return
}
const source = "watchdog:psu"
for slot, psu := range slots {
key := "psu:" + slot
status := psu.Status
if status == "" {
status = "Unknown"
}
detail := ""
switch status {
case "Critical":
detail = "PSU sensor reported non-OK state"
case "Warning":
detail = "PSU sensor in warning state"
}
p.statusDB.Record(key, source, status, detail)
}
}

View File

@@ -73,6 +73,9 @@ func (w *kmsgWatcher) run() {
w.mu.Lock()
if w.window != nil {
w.recordEvent(evt)
} else {
evtCopy := evt
goRecoverOnce("kmsg flush immediate", func() { w.flushImmediate(evtCopy) })
}
w.mu.Unlock()
}
@@ -162,7 +165,9 @@ func (w *kmsgWatcher) flushWindow(window *kmsgWindow) {
for _, id := range evt.ids {
var key string
switch evt.category {
case "gpu", "pcie":
case "gpu":
key = "pcie:gpu:" + normalizeBDF(id)
case "pcie":
key = "pcie:" + normalizeBDF(id)
case "storage":
key = "storage:" + id
@@ -180,6 +185,54 @@ func (w *kmsgWatcher) flushWindow(window *kmsgWindow) {
}
}
// flushImmediate writes a single kmsg event directly to the status DB without a SAT window.
// Called when an error is detected outside of any SAT task (always-on watching).
func (w *kmsgWatcher) flushImmediate(evt kmsgEvent) {
if w.statusDB == nil {
return
}
const source = "watchdog:kmsg"
detail := "kernel: " + truncate(evt.raw, 120)
var severity string
for _, p := range platform.HardwareErrorPatterns {
if p.Re.MatchString(evt.raw) {
if p.Severity == "critical" {
severity = "Critical"
} else {
severity = "Warning"
}
break
}
}
if severity == "" {
severity = "Warning"
}
if len(evt.ids) == 0 {
key := "cpu:all"
if evt.category == "memory" {
key = "memory:all"
}
w.statusDB.Record(key, source, severity, detail)
return
}
for _, id := range evt.ids {
var key string
switch evt.category {
case "gpu":
key = "pcie:gpu:" + normalizeBDF(id)
case "pcie":
key = "pcie:" + normalizeBDF(id)
case "storage":
key = "storage:" + id
default:
key = "pcie:" + normalizeBDF(id)
}
w.statusDB.Record(key, source, severity, detail)
}
}
// parseKmsgLine parses a single /dev/kmsg line and returns an event if it matches
// any pattern in platform.HardwareErrorPatterns.
// kmsg format: "<priority>,<sequence>,<timestamp_usec>,-;message text"

View File

@@ -17,6 +17,7 @@ func layoutHead(title string) string {
<style>
:root{--bg:#fff;--surface:#fff;--surface-2:#f9fafb;--border:rgba(34,36,38,.15);--border-lite:rgba(34,36,38,.1);--ink:rgba(0,0,0,.87);--muted:rgba(0,0,0,.6);--accent:#2185d0;--accent-dark:#1678c2;--crit-bg:#fff6f6;--crit-fg:#9f3a38;--crit-border:#e0b4b4;--ok-bg:#fcfff5;--ok-fg:#2c662d;--warn-bg:#fffaf3;--warn-fg:#573a08}
*{box-sizing:border-box;margin:0;padding:0}
dialog{margin:auto}
body{font:14px/1.5 Lato,"Helvetica Neue",Arial,Helvetica,sans-serif;background:var(--bg);color:var(--ink);display:flex;min-height:100vh}
a{color:var(--accent);text-decoration:none}
/* Sidebar */
@@ -67,6 +68,11 @@ tbody tr:hover td{background:rgba(0,0,0,.03)}
.chip-warn{background:var(--warn-bg);color:var(--warn-fg);border:1px solid #c9ba9b}
.chip-fail{background:var(--crit-bg);color:var(--crit-fg);border:1px solid var(--crit-border)}
.chip-unknown{background:var(--surface-2);color:var(--muted);border:1px solid var(--border)}
/* Tasks nav badge */
.tasks-nav-btn{display:flex;justify-content:space-between;align-items:center;padding:10px 16px;color:rgba(255,255,255,.55);font-size:12px;text-decoration:none;border-top:1px solid rgba(255,255,255,.12);margin-top:auto;transition:color .15s}
.tasks-nav-btn:hover{color:#fff}
.tasks-nav-count{background:var(--accent);color:#fff;border-radius:10px;padding:1px 7px;font-size:11px;font-weight:700;display:none}
.tasks-nav-count.active{display:inline}
/* Output terminal */
.terminal{background:#1b1c1d;border:1px solid rgba(0,0,0,.2);border-radius:4px;padding:14px;font-family:monospace;font-size:12px;color:#b5cea8;max-height:400px;overflow-y:auto;white-space:pre-wrap;word-break:break-all;user-select:text;-webkit-user-select:text}
.terminal-wrap{position:relative}.terminal-copy{position:absolute;top:6px;right:6px;background:#2d2f30;border:1px solid #444;color:#aaa;font-size:11px;padding:2px 8px;border-radius:3px;cursor:pointer;opacity:.7}.terminal-copy:hover{opacity:1}
@@ -92,14 +98,15 @@ tbody tr:hover td{background:rgba(0,0,0,.03)}
}
func layoutNav(active string, buildLabel string) string {
items := []struct{ id, label, href, onclick string }{
{"dashboard", "Dashboard", "/", ""},
{"audit", "Audit", "/audit", ""},
{"validate", "Validate", "/validate", ""},
{"burn", "Burn", "/burn", ""},
{"benchmark", "Benchmark", "/benchmark", ""},
{"tasks", "Tasks", "/tasks", ""},
{"tools", "Tools", "/tools", ""},
items := []struct{ id, label, href string }{
{"dashboard", "Dashboard", "/"},
{"audit", "1. Audit", "/audit"},
{"check", "2. Check", "/check"},
{"load", "3. Load", "/load"},
{"speed", "4. Speed", "/speed"},
{"endurance", "5. Endurance", "/endurance"},
{"tools", "6. Tools", "/tools"},
{"settings", "7. Settings", "/settings"},
}
var b strings.Builder
b.WriteString(`<aside class="sidebar">`)
@@ -123,15 +130,16 @@ func layoutNav(active string, buildLabel string) string {
if item.id == active {
cls += " active"
}
if item.onclick != "" {
b.WriteString(fmt.Sprintf(`<a class="%s" href="%s" onclick="%s">%s</a>`,
cls, item.href, item.onclick, item.label))
} else {
b.WriteString(fmt.Sprintf(`<a class="%s" href="%s">%s</a>`,
cls, item.href, item.label))
}
b.WriteString(fmt.Sprintf(`<a class="%s" href="%s">%s</a>`, cls, item.href, item.label))
}
b.WriteString(`</nav>`)
b.WriteString(`<a href="/tasks" class="tasks-nav-btn" id="tasks-nav-btn">`)
b.WriteString(`<span>Tasks</span>`)
b.WriteString(`<span class="tasks-nav-count" id="tasks-nav-count"></span>`)
b.WriteString(`</a>`)
b.WriteString(`<script>`)
b.WriteString(`(function(){function u(){fetch('/api/tasks',{cache:'no-store'}).then(function(r){return r.json();}).then(function(d){var n=Array.isArray(d)?d.filter(function(t){return t.status==='pending'||t.status==='running';}).length:0;var c=document.getElementById('tasks-nav-count');var b=document.getElementById('tasks-nav-btn');if(c){c.textContent=n>0?String(n):'';c.className='tasks-nav-count'+(n>0?' active':'');}if(b){b.style.color=n>0?'#f6c90e':'';}}).catch(function(){});}u();setInterval(u,5000);})();`)
b.WriteString(`</script>`)
b.WriteString(`</aside>`)
return b.String()
}

View File

@@ -611,3 +611,20 @@ func renderPowerBenchmarkResultsCard(exportDir string) string {
b.WriteString(`</div></div>`)
return b.String()
}
// renderSpeed renders the Speed page (step 4): performance benchmarks.
// Uses the same benchmark infrastructure; defaults to Standard profile (throughput/bandwidth).
// For long-duration stability/overnight runs, see Endurance (step 5).
func renderSpeed(opts HandlerOptions) string {
base := renderBenchmark(opts)
return `<div class="alert alert-info" style="margin-bottom:16px"><strong>Speed:</strong> Measures GPU compute throughput and memory bandwidth. For overnight stability testing, go to <a href="/endurance">5. Endurance</a>.</div>` + base
}
// renderEndurance renders the Endurance page (step 5): long-duration reliability tests.
// Focuses on Stability and Overnight profiles for multi-hour burn validation.
// For short load tests, see Load (step 3). For throughput measurement, see Speed (step 4).
func renderEndurance(opts HandlerOptions) string {
base := renderBenchmark(opts)
return `<div class="alert alert-warn" style="margin-bottom:16px"><strong>Endurance:</strong> Long-duration reliability tests — Stability (several hours) and Overnight (8+ h) profiles. These profiles run hardware at sustained load; results show whether the server holds its performance envelope over time.</div>
<div class="alert alert-info" style="margin-bottom:16px">Use the <strong>Stability</strong> or <strong>Overnight</strong> profile in the setup card below. The Standard profile is available too but is better suited for the <a href="/speed">4. Speed</a> page.</div>` + base
}

View File

@@ -1,8 +1,13 @@
package webui
// renderLoad renders the Load page (step 3): sustained stress tests.
// For non-destructive status checks, see Check (step 2).
// For DCGM targeted diagnostics (targeted_stress, targeted_power, pulse), see Check → Validate mode.
func renderLoad() string { return renderBurn() }
func renderBurn() string {
return `<div class="alert alert-warn" style="margin-bottom:16px"><strong>&#9888; Warning:</strong> Stress tests on this page run hardware at high load. Repeated or prolonged use may reduce hardware lifespan. Use only when necessary.</div>
<div class="alert alert-info" style="margin-bottom:16px"><strong>Scope:</strong> Burn exposes sustained GPU compute load recipes. DCGM diagnostics (` + "targeted_stress, targeted_power, pulse_test" + `) and LINPACK remain in <a href="/validate">Validate → Stress mode</a>; NCCL and NVBandwidth are available directly from <a href="/validate">Validate</a>.</div>
<div class="alert alert-info" style="margin-bottom:16px"><strong>Scope:</strong> Load runs sustained GPU compute and CPU/memory stress recipes. DCGM diagnostics (<code>targeted_stress</code>, <code>targeted_power</code>, <code>pulse_test</code>) and NCCL/NVBandwidth are on the <a href="/check">2. Check</a> page. For overnight endurance runs, see <a href="/endurance">5. Endurance</a>.</div>
<p style="color:var(--muted);font-size:13px;margin-bottom:16px">Tasks continue in the background — view progress in <a href="/tasks">Tasks</a>.</p>
<div class="card" style="margin-bottom:16px">

View File

@@ -0,0 +1,77 @@
package webui
import "html"
func renderSettings(opts HandlerOptions) string {
version := opts.BuildLabel
if version == "" {
version = "dev"
}
return `<div class="grid2">
<div class="card">
<div class="card-head">Blackbox Logging</div>
<div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:14px">Continuous hardware monitoring that writes a rolling log of sensor readings to the export directory. Useful for capturing thermal or power anomalies during long runs.</p>
<div style="display:flex;gap:8px;align-items:center">
<button class="btn btn-primary btn-sm" onclick="blackboxToggle('enable')">Enable</button>
<button class="btn btn-secondary btn-sm" onclick="blackboxToggle('disable')">Disable</button>
<span id="blackbox-status" style="font-size:12px;color:var(--muted)">Loading...</span>
</div>
</div>
</div>
<div class="card">
<div class="card-head">NVIDIA Recovery</div>
<div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:14px">Reset NVIDIA GPU driver state. Use when <code>nvidia-smi</code> reports errors or GPUs appear stuck after a failed test.</p>
<div style="display:flex;gap:8px;align-items:center">
<button class="btn btn-danger btn-sm" onclick="nvidiaReset()">Reset NVIDIA Driver</button>
<span id="nvidia-reset-status" style="font-size:12px;color:var(--muted)"></span>
</div>
</div>
</div>
</div>
<div class="card" style="margin-top:0">
<div class="card-head">Build Info</div>
<div class="card-body">
<table style="width:auto">
<tbody>
<tr><td style="color:var(--muted);padding-right:24px">Version</td><td>` + html.EscapeString(version) + `</td></tr>
<tr><td style="color:var(--muted);padding-right:24px">Title</td><td>` + html.EscapeString(opts.Title) + `</td></tr>
</tbody>
</table>
</div>
</div>
<script>
(function() {
fetch('/api/blackbox/status', {cache:'no-store'}).then(r => r.json()).then(d => {
var el = document.getElementById('blackbox-status');
if (el) el.textContent = d.enabled ? 'Enabled' : 'Disabled';
}).catch(() => {
var el = document.getElementById('blackbox-status');
if (el) el.textContent = 'Status unavailable';
});
})();
function blackboxToggle(action) {
var el = document.getElementById('blackbox-status');
if (el) el.textContent = 'Updating...';
fetch('/api/blackbox/' + action, {method:'POST', cache:'no-store'})
.then(r => r.json())
.then(d => { if (el) el.textContent = d.enabled ? 'Enabled' : 'Disabled'; })
.catch(err => { if (el) el.textContent = 'Error: ' + err.message; });
}
function nvidiaReset() {
var el = document.getElementById('nvidia-reset-status');
if (!confirm('Reset NVIDIA driver? This will interrupt any running GPU tasks.')) return;
if (el) el.textContent = 'Resetting...';
fetch('/api/gpu/nvidia-reset', {method:'POST', cache:'no-store'})
.then(r => r.json())
.then(d => { if (el) el.textContent = d.error ? ('Error: ' + d.error) : 'Done — driver reset.'; })
.catch(err => { if (el) el.textContent = 'Error: ' + err.message; });
}
</script>`
}

View File

@@ -11,6 +11,13 @@ import (
"bee/audit/internal/schema"
)
// PCI vendor IDs used for GPU classification (source: pci-ids.ucw.cz).
const (
pciVendorNvidia = 0x10de
pciVendorAMD = 0x1002
pciVendorAspeed = 0x1a03
)
type validateInventory struct {
CPU string
Memory string
@@ -634,25 +641,307 @@ func validateFirstNonEmpty(values ...string) string {
}
func validateIsVendorGPU(dev schema.HardwarePCIeDevice, vendor string) bool {
model := strings.ToLower(validateTrimPtr(dev.Model))
manufacturer := strings.ToLower(validateTrimPtr(dev.Manufacturer))
class := strings.ToLower(validateTrimPtr(dev.DeviceClass))
if strings.Contains(model, "aspeed") || strings.Contains(manufacturer, "aspeed") {
if dev.VendorID != nil && *dev.VendorID == pciVendorAspeed {
return false
}
class := strings.ToLower(validateTrimPtr(dev.DeviceClass))
isGPUClass := class == "videocontroller" || class == "processingaccelerator" || class == "displaycontroller"
switch vendor {
case "nvidia":
return strings.Contains(model, "nvidia") || strings.Contains(manufacturer, "nvidia")
return isGPUClass && dev.VendorID != nil && *dev.VendorID == pciVendorNvidia
case "amd":
isGPUClass := class == "processingaccelerator" || class == "displaycontroller" || class == "videocontroller"
isAMDVendor := strings.Contains(manufacturer, "advanced micro devices") || strings.Contains(manufacturer, "amd") || strings.Contains(manufacturer, "ati")
isAMDModel := strings.Contains(model, "instinct") || strings.Contains(model, "radeon") || strings.Contains(model, "amd")
return isGPUClass && (isAMDVendor || isAMDModel)
return isGPUClass && dev.VendorID != nil && *dev.VendorID == pciVendorAMD
default:
return false
}
}
// renderCheck renders the non-destructive Check page (step 2).
// Shows validate-mode tests only: CPU, Memory, Storage, NVIDIA L2, NCCL, NVBandwidth, AMD.
// Stress-mode tests (targeted-stress, targeted-power, pulse) are on the Load page.
func renderCheck(opts HandlerOptions) string {
inv := loadValidateInventory(opts)
n := inv.NvidiaGPUCount
validateTotalStr := validateFmtDur(validateTotalValidateSec(n))
gpuNote := ""
if n > 0 {
gpuNote = fmt.Sprintf(" (%d GPU)", n)
}
return `<div class="alert alert-info" style="margin-bottom:16px"><strong>Non-destructive:</strong> Check tests collect diagnostics only — no writes to disks, no sustained load, no hardware wear counters incremented. For stress testing, go to <a href="/load">3. Load</a>.</div>
<div style="display:flex;align-items:center;gap:12px;margin-bottom:16px">
<button type="button" class="btn btn-primary" onclick="runAllCheckSAT()">Run All Checks</button>
<span id="sat-all-status" style="font-size:12px;color:var(--muted)"></span>
<span style="font-size:12px;color:var(--muted)">est. ` + validateTotalStr + gpuNote + `</span>
</div>
<div class="grid3">
` + renderSATCard("cpu", "CPU", "runSAT('cpu')", "", renderValidateCardBody(
inv.CPU,
`Collects CPU inventory and temperatures, then runs a bounded CPU stress pass.`,
`<code>lscpu</code>, <code>sensors</code>, <code>stress-ng</code>`,
validateFmtDur(platform.SATEstimatedCPUValidateSec)+` (stress-ng 60 s).`,
)) +
renderSATCard("memory", "Memory", "runSAT('memory')", "", renderValidateCardBody(
inv.Memory,
`Runs a RAM validation pass and records memory state around the test.`,
`<code>free</code>, <code>memtester</code>`,
validateFmtDur(platform.SATEstimatedMemoryValidateSec)+` (256 MB × 1 pass).`,
)) +
renderSATCard("storage", "Storage", "runSAT('storage')", "", renderValidateCardBody(
inv.Storage,
`Scans all storage devices and runs the matching health or self-test path for each.`,
`<code>lsblk</code>; NVMe: <code>nvme</code>; SATA/SAS: <code>smartctl</code>`,
`Seconds (NVMe: instant device query; SATA/SAS: short self-test).`,
)) +
`</div>
<div style="height:1px;background:var(--border);margin:16px 0"></div>
<div class="card" style="margin-bottom:16px">
<div class="card-head">NVIDIA GPU Selection</div>
<div class="card-body">
<p style="font-size:12px;color:var(--muted);margin:0 0 8px">` + inv.NVIDIA + `</p>
<div style="display:flex;gap:8px;flex-wrap:wrap;margin-bottom:8px">
<button class="btn btn-sm btn-secondary" type="button" onclick="satSelectAllGPUs()">Select All</button>
<button class="btn btn-sm btn-secondary" type="button" onclick="satSelectNoGPUs()">Clear</button>
</div>
<div id="sat-gpu-list" style="border:1px solid var(--border);border-radius:4px;padding:12px;min-height:88px">
<p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p>
</div>
<p id="sat-gpu-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA check tasks.</p>
</div>
</div>
<div class="grid3">
` + renderSATCard("nvidia", "NVIDIA GPU", "runNvidiaValidateSet('nvidia')", "", renderValidateCardBody(
inv.NVIDIA,
`Runs NVIDIA diagnostics and board inventory checks (DCGM Level 2).`,
`<code>nvidia-smi</code>, <code>dmidecode</code>, <code>dcgmi diag</code>`,
validateFmtDur(platform.SATEstimatedNvidiaGPUValidateSec)+` (Level 2, all GPUs simultaneously).`,
)) +
renderSATCard("nvidia-interconnect", "NVIDIA Interconnect (NCCL)", "runNvidiaFabricValidate('nvidia-interconnect')", "", renderValidateCardBody(
inv.NVIDIA,
`Verifies NVLink/NVSwitch fabric bandwidth using NCCL all_reduce_perf across all selected GPUs.`,
`<code>all_reduce_perf</code> (NCCL tests)`,
validateFmtDur(platform.SATEstimatedNvidiaInterconnectSec)+` (all GPUs simultaneously, requires ≥2).`,
)) +
renderSATCard("nvidia-bandwidth", "NVIDIA Bandwidth (NVBandwidth)", "runNvidiaFabricValidate('nvidia-bandwidth')", "", renderValidateCardBody(
inv.NVIDIA,
`Validates GPU memory copy and peer-to-peer bandwidth paths using NVBandwidth.`,
`<code>nvbandwidth</code>`,
validateFmtDur(platform.SATEstimatedNvidiaBandwidthSec)+` (all GPUs simultaneously).`,
)) +
`</div>
<div class="grid3" style="margin-top:16px">
` + renderSATCard("amd", "AMD GPU", "runAMDValidateSet()", "", renderValidateCardBody(
inv.AMD,
`Runs AMD GPU inventory, MEM integrity, and MEM bandwidth checks.`,
`GPU Validate: <code>rocm-smi</code>, <code>dmidecode</code>; MEM Integrity: <code>rvs mem</code>; MEM Bandwidth: <code>rocm-bandwidth-test</code>, <code>rvs babel</code>`,
`<div style="display:flex;flex-direction:column;gap:4px"><label class="cb-row"><input type="checkbox" id="sat-amd-target" checked><span>GPU Validate</span></label><label class="cb-row"><input type="checkbox" id="sat-amd-mem-target" checked><span>MEM Integrity</span></label><label class="cb-row"><input type="checkbox" id="sat-amd-bandwidth-target" checked><span>MEM Bandwidth</span></label></div>`,
)) +
`</div>
<div id="sat-output" style="display:none;margin-top:16px" class="card">
<div class="card-head">Test Output <span id="sat-title"></span></div>
<div class="card-body"><div id="sat-terminal" class="terminal"></div></div>
</div>
<style>
.validate-card-body { padding:0; }
.validate-card-section { padding:12px 16px 0; }
.validate-card-section:last-child { padding-bottom:16px; }
.sat-gpu-row { display:flex; align-items:flex-start; gap:8px; padding:6px 0; cursor:pointer; font-size:13px; }
.sat-gpu-row input[type=checkbox] { width:16px; height:16px; margin-top:2px; flex-shrink:0; }
.cb-row { display:flex; align-items:flex-start; gap:8px; padding:4px 0; cursor:pointer; font-size:13px; }
.cb-row input[type=checkbox] { width:16px; height:16px; margin-top:2px; flex-shrink:0; }
</style>
<script>
let satES = null;
function satLabels() {
return {nvidia:'Check GPU (DCGM L2)', 'nvidia-interconnect':'NVIDIA Interconnect (NCCL)', 'nvidia-bandwidth':'NVIDIA Bandwidth (NVBandwidth)', memory:'Check Memory', storage:'Check Storage', cpu:'Check CPU', amd:'Check AMD GPU', 'amd-mem':'AMD GPU MEM Integrity', 'amd-bandwidth':'AMD GPU MEM Bandwidth'};
}
let satNvidiaGPUsPromise = null;
function loadSatNvidiaGPUs() {
if (!satNvidiaGPUsPromise) {
satNvidiaGPUsPromise = fetch('/api/gpu/nvidia').then(r => {
if (!r.ok) throw new Error('Failed to load NVIDIA GPUs.');
return r.json();
}).then(list => Array.isArray(list) ? list : []);
}
return satNvidiaGPUsPromise;
}
function satSelectedGPUIndices() {
return Array.from(document.querySelectorAll('.sat-nvidia-checkbox'))
.filter(el => el.checked && !el.disabled)
.map(el => parseInt(el.value, 10))
.filter(v => !Number.isNaN(v))
.sort((a, b) => a - b);
}
function satUpdateGPUSelectionNote() {
const note = document.getElementById('sat-gpu-selection-note');
if (!note) return;
const sel = satSelectedGPUIndices();
note.textContent = sel.length
? 'Selected GPUs: ' + sel.join(', ') + '. Multi-GPU tests will use all selected GPUs.'
: 'Select at least one NVIDIA GPU to enable NVIDIA check tasks.';
}
function satRenderGPUList(gpus) {
const root = document.getElementById('sat-gpu-list');
if (!root) return;
if (!gpus || !gpus.length) {
root.innerHTML = '<p style="color:var(--muted);font-size:13px">No NVIDIA GPUs detected.</p>';
satUpdateGPUSelectionNote(); return;
}
root.innerHTML = gpus.map(gpu => {
const mem = gpu.memory_mb > 0 ? ' · ' + gpu.memory_mb + ' MiB' : '';
return '<label class="sat-gpu-row"><input class="sat-nvidia-checkbox" type="checkbox" value="' + gpu.index + '" checked onchange="satUpdateGPUSelectionNote()"><span><strong>GPU ' + gpu.index + '</strong> — ' + gpu.name + mem + '</span></label>';
}).join('');
satUpdateGPUSelectionNote();
}
function satSelectAllGPUs() { document.querySelectorAll('.sat-nvidia-checkbox').forEach(el => { el.checked = true; }); satUpdateGPUSelectionNote(); }
function satSelectNoGPUs() { document.querySelectorAll('.sat-nvidia-checkbox').forEach(el => { el.checked = false; }); satUpdateGPUSelectionNote(); }
function satGPULoadInit() {
loadSatNvidiaGPUs().then(satRenderGPUList).catch(err => {
const root = document.getElementById('sat-gpu-list');
if (root) root.innerHTML = '<p style="color:var(--crit-fg);font-size:13px">Error: ' + err.message + '</p>';
satUpdateGPUSelectionNote();
});
}
function satRequestBody(target, overrides) {
const body = {};
const labels = satLabels();
body.display_name = labels[target] || ('Check ' + target);
body.stress_mode = false;
if (target === 'cpu') body.duration = 60;
if (overrides) Object.keys(overrides).forEach(k => { body[k] = overrides[k]; });
return body;
}
function enqueueSATTarget(target, overrides) {
return fetch('/api/sat/' + target + '/run', {method:'POST', headers:{'Content-Type':'application/json'}, body:JSON.stringify(satRequestBody(target, overrides))}).then(r => r.json());
}
function streamSATTask(taskId, title, resetTerminal) {
if (satES) { satES.close(); satES = null; }
document.getElementById('sat-output').style.display = 'block';
document.getElementById('sat-title').textContent = '— ' + title;
const term = document.getElementById('sat-terminal');
if (resetTerminal) term.textContent = '';
term.textContent += 'Task ' + taskId + ' queued. Streaming log...\n';
return new Promise(resolve => {
satES = new EventSource('/api/tasks/' + taskId + '/stream');
satES.onmessage = e => { term.textContent += e.data + '\n'; term.scrollTop = term.scrollHeight; };
satES.addEventListener('done', e => {
satES.close(); satES = null;
term.textContent += (e.data ? '\nERROR: ' + e.data : '\nCompleted.') + '\n';
term.scrollTop = term.scrollHeight;
resolve({ok: !e.data, error: e.data || ''});
});
satES.onerror = () => {
if (satES) { satES.close(); satES = null; }
term.textContent += '\nERROR: stream disconnected.\n';
term.scrollTop = term.scrollHeight;
resolve({ok: false, error: 'stream disconnected'});
};
});
}
function selectedAMDValidateTargets() {
const targets = [];
const gpu = document.getElementById('sat-amd-target');
const mem = document.getElementById('sat-amd-mem-target');
const bw = document.getElementById('sat-amd-bandwidth-target');
if (gpu && gpu.checked && !gpu.disabled) targets.push('amd');
if (mem && mem.checked && !mem.disabled) targets.push('amd-mem');
if (bw && bw.checked && !bw.disabled) targets.push('amd-bandwidth');
return targets;
}
function runSAT(target) { return runSATWithOverrides(target, null); }
function runSATWithOverrides(target, overrides) {
const title = (overrides && overrides.display_name) || target;
document.getElementById('sat-output').style.display = 'block';
document.getElementById('sat-title').textContent = '— ' + title;
const term = document.getElementById('sat-terminal');
term.textContent = 'Enqueuing ' + title + ' test...\n';
return enqueueSATTarget(target, overrides).then(d => streamSATTask(d.task_id, title, false));
}
function runNvidiaFabricValidate(target) {
const indices = satSelectedGPUIndices();
if (!indices.length) { alert('No NVIDIA GPUs available.'); return; }
runSATWithOverrides(target, {gpu_indices: indices, display_name: satLabels()[target] || target});
}
function runNvidiaValidateSet(target) {
const sel = satSelectedGPUIndices();
if (!sel.length) { alert('Select at least one NVIDIA GPU.'); return; }
return runSATWithOverrides(target, {gpu_indices: sel, display_name: satLabels()[target] || target});
}
function runAMDValidateSet() {
const targets = selectedAMDValidateTargets();
if (!targets.length) return;
if (targets.length === 1) return runSAT(targets[0]);
const term = document.getElementById('sat-terminal');
document.getElementById('sat-output').style.display = 'block';
document.getElementById('sat-title').textContent = '— amd';
term.textContent = 'Running AMD check set...\n';
const labels = satLabels();
const runNext = idx => {
if (idx >= targets.length) return Promise.resolve();
const t = targets[idx];
term.textContent += '\n[' + (idx + 1) + '/' + targets.length + '] ' + labels[t] + '\n';
return enqueueSATTarget(t).then(d => streamSATTask(d.task_id, labels[t], false)).then(() => runNext(idx + 1));
};
return runNext(0);
}
function runAllCheckSAT() {
const status = document.getElementById('sat-all-status');
status.textContent = 'Enqueuing...';
const nvidiaIndices = satSelectedGPUIndices();
const nvidiaAllTargets = ['nvidia', 'nvidia-interconnect', 'nvidia-bandwidth'];
const baseTargets = ['cpu', 'memory', 'storage'];
const amdTargets = selectedAMDValidateTargets();
const expanded = [];
baseTargets.forEach(t => expanded.push({target: t}));
if (nvidiaIndices.length) {
nvidiaAllTargets.forEach(t => {
const btn = document.getElementById('sat-btn-' + t);
if (!(btn && btn.disabled)) expanded.push({target: t, overrides: {gpu_indices: nvidiaIndices, display_name: satLabels()[t] || t}});
});
}
amdTargets.forEach(t => expanded.push({target: t}));
if (!expanded.length) { status.textContent = 'No tasks selected.'; return; }
const total = expanded.length;
const runNext = idx => {
if (idx >= expanded.length) { status.textContent = 'Completed ' + total + ' task(s).'; return Promise.resolve(); }
const item = expanded[idx];
status.textContent = 'Running ' + (idx + 1) + '/' + total + '...';
return enqueueSATTarget(item.target, item.overrides).then(() => runNext(idx + 1));
};
runNext(0).catch(err => { status.textContent = 'Error: ' + err.message; });
}
function disableSATCard(id, reason) {
const btn = document.getElementById('sat-btn-' + id);
if (!btn) return;
btn.disabled = true; btn.title = reason; btn.style.opacity = '0.4';
const card = btn.closest('.card');
if (card) {
let note = card.querySelector('.sat-unavail');
if (!note) {
note = document.createElement('p');
note.className = 'sat-unavail';
note.style.cssText = 'color:var(--muted);font-size:12px;margin:0 0 8px';
const body = card.querySelector('.card-body');
if (body) body.insertBefore(note, body.firstChild);
}
note.textContent = reason;
}
}
fetch('/api/gpu/presence').then(r => r.json()).then(gp => {
if (!gp.nvidia) ['nvidia','nvidia-interconnect','nvidia-bandwidth'].forEach(t => disableSATCard(t, 'No NVIDIA GPU detected'));
if (!gp.amd) {
disableSATCard('amd', 'No AMD GPU detected');
['sat-amd-target','sat-amd-mem-target','sat-amd-bandwidth-target'].forEach(id => {
const cb = document.getElementById(id);
if (cb) { cb.disabled = true; cb.checked = false; }
});
}
});
satGPULoadInit();
</script>`
}
func renderSATCard(id, label, runAction, headerActions, body string) string {
actions := `<button id="sat-btn-` + id + `" class="btn btn-primary btn-sm" onclick="` + runAction + `">Run</button>`
if strings.TrimSpace(headerActions) != "" {

View File

@@ -5,7 +5,9 @@ import (
"fmt"
"html"
"path/filepath"
"regexp"
"sort"
"strconv"
"strings"
"bee/audit/internal/app"
@@ -22,41 +24,54 @@ func renderPage(page string, opts HandlerOptions) string {
body = renderDashboard(opts)
case "audit":
pageID = "audit"
title = "Audit"
title = "1. Audit"
body = renderAudit()
case "validate":
pageID = "validate"
title = "Validate"
body = renderValidate(opts)
case "burn":
pageID = "burn"
title = "Burn"
body = renderBurn()
case "check":
pageID = "check"
title = "2. Check"
body = renderCheck(opts)
case "load":
pageID = "load"
title = "3. Load"
body = renderLoad()
case "speed":
pageID = "speed"
title = "4. Speed"
body = renderSpeed(opts)
case "endurance":
pageID = "endurance"
title = "5. Endurance"
body = renderEndurance(opts)
case "tools":
pageID = "tools"
title = "6. Tools"
body = renderTools()
case "settings":
pageID = "settings"
title = "7. Settings"
body = renderSettings(opts)
// Legacy routes (redirected at HTTP level in handlePage; these are fallbacks)
case "validate", "tests":
pageID = "check"
title = "2. Check"
body = renderCheck(opts)
case "burn", "burn-in":
pageID = "load"
title = "3. Load"
body = renderLoad()
case "benchmark":
pageID = "benchmark"
title = "Benchmark"
body = renderBenchmark(opts)
pageID = "speed"
title = "4. Speed"
body = renderSpeed(opts)
case "tasks":
pageID = "tasks"
title = "Tasks"
body = renderTasks()
case "tools":
pageID = "tools"
title = "Tools"
body = renderTools()
// Legacy routes kept accessible but not in nav
// Hidden pages (not in nav, accessible by direct URL)
case "metrics":
pageID = "metrics"
title = "Live Metrics"
body = renderMetrics()
case "tests":
pageID = "validate"
title = "Acceptance Tests"
body = renderValidate(opts)
case "burn-in":
pageID = "burn"
title = "Burn-in Tests"
body = renderBurn()
case "network":
pageID = "network"
title = "Network"
@@ -85,6 +100,7 @@ func renderPage(page string, opts HandlerOptions) string {
body +
`</div></div>` +
renderAuditModal() +
`<dialog id="component-detail-dialog" style="min-width:600px;max-width:900px;width:90vw;padding:0;border:1px solid var(--border);border-radius:8px;background:var(--surface)"><div id="component-detail-body" style="padding-bottom:20px"></div></dialog>` +
`<script>
// Add copy button to every .terminal on the page
document.querySelectorAll('.terminal').forEach(function(t){
@@ -94,6 +110,17 @@ document.querySelectorAll('.terminal').forEach(function(t){
btn.onclick=function(){navigator.clipboard.writeText(t.textContent).then(function(){btn.textContent='Copied!';setTimeout(function(){btn.textContent='Copy';},1500);});};
w.appendChild(btn);
});
function openComponentDetail(type) {
var dlg = document.getElementById('component-detail-dialog');
var body = document.getElementById('component-detail-body');
body.innerHTML = '<div style="padding:20px;color:var(--muted)">Loading…</div>';
dlg.showModal();
fetch('/api/components/' + type).then(function(r){ return r.text(); }).then(function(html){
body.innerHTML = html;
}).catch(function(){
body.innerHTML = '<div style="padding:20px;color:var(--crit-fg)">Error loading details.</div>';
});
}
</script>` +
`</body></html>`
}
@@ -106,6 +133,14 @@ func renderDashboard(opts HandlerOptions) string {
b.WriteString(renderHardwareSummaryCard(opts))
b.WriteString(renderHealthCard(opts))
b.WriteString(renderMetrics())
b.WriteString(`<script>
setInterval(function(){
fetch('/api/hardware-summary').then(function(r){return r.text();}).then(function(html){
var el=document.getElementById('hw-summary-card');
if(el){el.outerHTML=html;}
}).catch(function(){});
},30000);
</script>`)
return b.String()
}
@@ -184,13 +219,14 @@ func renderAudit() string {
}
func renderHardwareSummaryCard(opts HandlerOptions) string {
const cardID = ` id="hw-summary-card"`
data, err := loadSnapshot(opts.AuditPath)
if err != nil {
return `<div class="card"><div class="card-head card-head-actions"><span>Hardware Summary</span><div class="card-head-buttons"><button class="btn btn-primary btn-sm" onclick="auditModalRun()">Run audit</button></div></div><div class="card-body"></div></div>`
return `<div class="card"` + cardID + `><div class="card-head card-head-actions"><span>Hardware Summary</span><div class="card-head-buttons"><button class="btn btn-primary btn-sm" onclick="auditModalRun()">Run audit</button></div></div><div class="card-body"></div></div>`
}
var ingest schema.HardwareIngestRequest
if err := json.Unmarshal(data, &ingest); err != nil {
return `<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body"><span class="badge badge-err">Parse error</span></div></div>`
return `<div class="card"` + cardID + `><div class="card-head">Hardware Summary</div><div class="card-body"><span class="badge badge-err">Parse error</span></div></div>`
}
hw := ingest.Hardware
@@ -200,7 +236,7 @@ func renderHardwareSummaryCard(opts HandlerOptions) string {
}
var b strings.Builder
b.WriteString(`<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body">`)
b.WriteString(`<div class="card"` + cardID + `><div class="card-head">Hardware Summary</div><div class="card-body">`)
// Server identity block above the component table.
{
@@ -229,22 +265,32 @@ func renderHardwareSummaryCard(opts HandlerOptions) string {
}
b.WriteString(`<table style="width:auto">`)
writeRow := func(label, value, badgeHTML string) {
b.WriteString(fmt.Sprintf(`<tr><td style="padding:6px 14px 6px 0;font-weight:700;white-space:nowrap">%s</td><td style="padding:6px 0;color:var(--muted);font-size:13px">%s</td><td style="padding:6px 0 6px 12px">%s</td></tr>`,
html.EscapeString(label), html.EscapeString(value), badgeHTML))
// writeRow renders one component row. compType is the URL path segment for the detail
// endpoint (e.g. "cpu"). Pass "" for rows that have no detail view.
writeRow := func(label, value, badgeHTML, compType string) {
var labelHTML string
if compType != "" {
labelHTML = fmt.Sprintf(
`<span style="cursor:pointer;text-decoration:underline dotted;text-underline-offset:3px" onclick="openComponentDetail('%s')">%s</span>`,
compType, html.EscapeString(label))
} else {
labelHTML = html.EscapeString(label)
}
fmt.Fprintf(&b, `<tr><td style="padding:6px 14px 6px 0;font-weight:700;white-space:nowrap">%s</td><td style="padding:6px 0;color:var(--muted);font-size:13px">%s</td><td style="padding:6px 0 6px 12px">%s</td></tr>`,
labelHTML, html.EscapeString(value), badgeHTML)
}
writeRow("CPU", hwDescribeCPU(hw),
renderComponentChips(matchedRecords(records, []string{"cpu:all"}, nil)))
renderComponentChips(matchedRecords(records, []string{"cpu:all"}, nil)), "cpu")
writeRow("Memory", hwDescribeMemory(hw),
renderComponentChips(matchedRecords(records, []string{"memory:all"}, []string{"memory:"})))
renderComponentChips(matchedRecords(records, []string{"memory:all"}, []string{"memory:"})), "memory")
writeRow("Storage", hwDescribeStorage(hw),
renderComponentChips(matchedRecords(records, []string{"storage:all"}, []string{"storage:"})))
renderComponentChips(matchedRecords(records, []string{"storage:all"}, []string{"storage:"})), "storage")
writeRow("GPU", hwDescribeGPU(hw),
renderComponentChips(matchedRecords(records, nil, []string{"pcie:gpu:"})))
renderComponentChips(matchedRecords(records, nil, []string{"pcie:gpu:"})), "gpu")
psuMatched := matchedRecords(records, nil, []string{"psu:"})
if len(psuMatched) == 0 && len(hw.PowerSupplies) > 0 {
@@ -252,10 +298,10 @@ func renderHardwareSummaryCard(opts HandlerOptions) string {
psuStatus := hwPSUStatus(hw.PowerSupplies)
psuMatched = []app.ComponentStatusRecord{{ComponentKey: "psu:ipmi", Status: psuStatus}}
}
writeRow("PSU", hwDescribePSU(hw), renderComponentChips(psuMatched))
writeRow("PSU", hwDescribePSU(hw), renderComponentChips(psuMatched), "psu")
if nicDesc := hwDescribeNIC(hw); nicDesc != "" {
writeRow("Network", nicDesc, "")
writeRow("Network", nicDesc, "", "")
}
b.WriteString(`</table>`)
@@ -614,7 +660,7 @@ func buildRuntimeNetworkRow(health schema.RuntimeHealth) runtimeHealthRow {
if status == "" {
status = "UNKNOWN"
}
issue := runtimeIssueDescriptions(health.Issues, "dhcp_partial", "dhcp_failed")
issue := runtimeIssueDescriptions(health.Issues, "dhcp_failed")
return runtimeHealthRow{Title: "Network", Status: status, Source: "ListInterfaces / DHCP", Issue: issue}
}
@@ -672,12 +718,12 @@ func buildRuntimeServicesRow(health schema.RuntimeHealth) runtimeHealthRow {
nonActive := make([]string, 0)
for _, svc := range health.Services {
state := strings.TrimSpace(strings.ToLower(svc.Status))
// "activating" and "deactivating" are transient states for oneshot services
// (RemainAfterExit=yes) — the service is running normally, not failed.
// Only "failed" and "inactive" (after services should be running) are problems.
// "inactive" is OK for oneshot services that have completed successfully
// (bee-sshsetup, bee-preflight, bee-audit, bee-network, etc.).
// Only "failed" is a genuine problem.
switch state {
case "active", "activating", "deactivating", "reloading":
// OK — service is running or transitioning normally
case "active", "activating", "deactivating", "reloading", "inactive":
// OK — service is running, transitioning normally, or completed successfully
default:
nonActive = append(nonActive, svc.Name+"="+svc.Status)
}
@@ -999,3 +1045,200 @@ func rowIssueHTML(issue string) string {
}
return html.EscapeString(issue)
}
var aerStatusRe = regexp.MustCompile(`aer_status:\s*0x([0-9a-fA-F]{1,8})`)
// decodeAERStatus parses an AER status hex value from a kernel error detail string
// and returns a human-readable list of set bit names with correctable/uncorrectable label,
// or "" if no AER status is found.
func decodeAERStatus(detail string) string {
m := aerStatusRe.FindStringSubmatch(detail)
if m == nil {
return ""
}
v64, err := strconv.ParseUint(m[1], 16, 32)
if err != nil {
return ""
}
val := uint32(v64)
type bitDef struct {
bit uint32
name string
}
corrBits := []bitDef{
{0, "Receiver Error"}, {6, "Replay Timer Timeout"}, {7, "Advisory Non-Fatal"},
{8, "Corrected Internal Error"}, {9, "Header Log Overflow"},
{13, "Replay Num Rollover"}, {14, "Bad DLLP"}, {15, "Bad TLP"},
}
uncorrBits := []bitDef{
{4, "Data Link Protocol Error"}, {5, "Surprise Down Error"},
{12, "Poisoned TLP Received"}, {13, "Flow Control Protocol Error"},
{14, "Completion Timeout"}, {15, "Completer Abort"}, {16, "Unexpected Completion"},
{17, "Receiver Overflow"}, {18, "Malformed TLP"}, {19, "ECRC Error"},
{20, "Unsupported Request Error"}, {21, "ACS Violation"}, {22, "Uncorrectable Internal Error"},
}
var corrNames, uncorrNames []string
for _, b := range corrBits {
if val&(1<<b.bit) != 0 {
corrNames = append(corrNames, b.name)
}
}
for _, b := range uncorrBits {
if val&(1<<b.bit) != 0 {
uncorrNames = append(uncorrNames, b.name)
}
}
if len(corrNames) >= len(uncorrNames) && len(corrNames) > 0 {
return strings.Join(corrNames, ", ") + " (correctable)"
}
if len(uncorrNames) > 0 {
return strings.Join(uncorrNames, ", ") + " (uncorrectable)"
}
return fmt.Sprintf("unknown bits: 0x%08x", val)
}
// renderSparkline returns a small inline SVG showing non-OK events over time.
// Events are positioned proportionally along the time axis; if all share the same
// timestamp they are spaced evenly. Width is always 100px.
func renderSparkline(history []app.ComponentStatusEntry) string {
const (
svgW = 100
svgH = 20
barW = 3
barH = 14
)
var events []app.ComponentStatusEntry
for _, e := range history {
if e.Status != "OK" {
events = append(events, e)
}
}
if len(events) == 0 {
return ""
}
n := len(events)
barColor := func(status string) string {
if status == "Critical" {
return "#c0392b"
}
return "#d97706"
}
yTop := (svgH - barH) / 2
var bars strings.Builder
if n == 1 {
x := (svgW - barW) / 2
fmt.Fprintf(&bars, `<rect x="%d" y="%d" width="%d" height="%d" fill="%s" rx="1"/>`,
x, yTop, barW, barH, barColor(events[0].Status))
} else {
minT := events[0].At
maxT := events[n-1].At
dur := maxT.Sub(minT).Seconds()
for i, e := range events {
var x int
if dur <= 0 {
step := svgW / n
x = i*step + (step-barW)/2
} else {
frac := e.At.Sub(minT).Seconds() / dur
x = int(frac * float64(svgW-barW))
}
fmt.Fprintf(&bars, `<rect x="%d" y="%d" width="%d" height="%d" fill="%s" rx="1"/>`,
x, yTop, barW, barH, barColor(e.Status))
}
}
return fmt.Sprintf(
`<svg width="%d" height="%d" style="display:inline-block;vertical-align:middle;margin-left:6px;flex-shrink:0" xmlns="http://www.w3.org/2000/svg">`+
`<rect x="0" y="0" width="%d" height="%d" fill="var(--surface-alt,#ebebeb)" rx="3"/>%s</svg>`,
svgW, svgH, svgW, svgH, bars.String())
}
// renderComponentDetail renders a modal content fragment for one component type.
// Called by handleAPIComponentDetail and displayed inside #component-detail-dialog.
func renderComponentDetail(title string, records []app.ComponentStatusRecord) string {
var b strings.Builder
fmt.Fprintf(&b, `<div style="padding:20px 24px 0">`)
fmt.Fprintf(&b, `<div style="display:flex;align-items:center;justify-content:space-between;margin-bottom:16px">`)
fmt.Fprintf(&b, `<span style="font-size:16px;font-weight:700">%s — Status Detail</span>`, html.EscapeString(title))
b.WriteString(`<button class="btn btn-sm btn-secondary" onclick="document.getElementById('component-detail-dialog').close()">Close</button>`)
b.WriteString(`</div>`)
if len(records) == 0 {
b.WriteString(`<p style="color:var(--muted)">No status data recorded yet for this component type.</p>`)
b.WriteString(`</div>`)
return b.String()
}
sort.Slice(records, func(i, j int) bool {
return records[i].ComponentKey < records[j].ComponentKey
})
for _, rec := range records {
letter, cls := chipLetterClass(rec.Status)
// Count non-OK events across the full history for the badge + sparkline.
warnCount := 0
for _, e := range rec.History {
if e.Status != "OK" {
warnCount++
}
}
fmt.Fprintf(&b, `<div style="margin-bottom:20px">`)
fmt.Fprintf(&b, `<div style="display:flex;align-items:center;gap:8px;margin-bottom:8px;flex-wrap:wrap">`)
fmt.Fprintf(&b, `<span class="chip %s">%s</span>`, cls, letter)
fmt.Fprintf(&b, `<span style="font-weight:700;font-size:13px">%s</span>`, html.EscapeString(rec.ComponentKey))
if !rec.LastCheckedAt.IsZero() {
fmt.Fprintf(&b, `<span style="color:var(--muted);font-size:12px">checked %s</span>`, rec.LastCheckedAt.Format("2006-01-02 15:04:05"))
}
if warnCount > 0 {
noun := "events"
if warnCount == 1 {
noun = "event"
}
fmt.Fprintf(&b,
`<span style="font-size:11px;background:var(--warn-bg,#fffbeb);color:var(--warn-fg,#92400e);border:1px solid var(--warn-border,#fde68a);border-radius:10px;padding:1px 7px;white-space:nowrap">%d %s</span>`,
warnCount, noun)
b.WriteString(renderSparkline(rec.History))
}
b.WriteString(`</div>`)
if rec.ErrorSummary != "" {
fmt.Fprintf(&b, `<div style="font-size:12px;margin-bottom:4px;color:var(--muted)">%s</div>`, html.EscapeString(rec.ErrorSummary))
if decoded := decodeAERStatus(rec.ErrorSummary); decoded != "" {
fmt.Fprintf(&b,
`<div style="font-size:12px;margin-bottom:8px;color:var(--muted)"><span style="background:var(--surface-alt,#f5f5f5);border-radius:4px;padding:1px 6px;font-family:monospace">AER: %s</span></div>`,
html.EscapeString(decoded))
}
}
// History table — newest first, cap at 20 entries.
history := rec.History
if len(history) > 20 {
history = history[len(history)-20:]
}
b.WriteString(`<table style="width:100%;font-size:12px;border-collapse:collapse">`)
b.WriteString(`<tr style="color:var(--muted)"><th style="text-align:left;padding:2px 10px 2px 0;white-space:nowrap">Time</th><th style="text-align:left;padding:2px 10px 2px 0">Status</th><th style="text-align:left;padding:2px 10px 2px 0">Source</th><th style="text-align:left;padding:2px 0">Detail</th></tr>`)
for i := len(history) - 1; i >= 0; i-- {
e := history[i]
eLetter, eCls := chipLetterClass(e.Status)
detail := e.Detail
if detail == "" {
detail = "—"
}
fmt.Fprintf(&b,
`<tr><td style="padding:3px 10px 3px 0;white-space:nowrap;color:var(--muted)">%s</td><td style="padding:3px 10px 3px 0"><span class="chip %s" style="font-size:10px;width:16px;height:16px">%s</span></td><td style="padding:3px 10px 3px 0;white-space:nowrap">%s</td><td style="padding:3px 0;color:var(--muted)">%s</td></tr>`,
html.EscapeString(e.At.Format("2006-01-02 15:04:05")),
eCls, eLetter,
html.EscapeString(e.Source),
html.EscapeString(detail),
)
}
b.WriteString(`</table>`)
b.WriteString(`</div>`)
}
b.WriteString(`</div>`)
return b.String()
}

View File

@@ -221,6 +221,11 @@ func NewHandler(opts HandlerOptions) http.Handler {
h.kmsg = newKmsgWatcher(opts.App.StatusDB)
h.kmsg.start()
globalQueue.kmsgWatcher = h.kmsg
// Start periodic health poller for components that don't emit kernel log events (e.g. PSU).
if opts.App.StatusDB != nil {
newHealthPoller(opts.App.StatusDB).start()
}
}
globalQueue.startWorker(&opts)
@@ -328,6 +333,10 @@ func NewHandler(opts HandlerOptions) http.Handler {
mux.HandleFunc("GET /api/install/disks", h.handleAPIInstallDisks)
mux.HandleFunc("POST /api/install/run", h.handleAPIInstallRun)
// Hardware component detail (fragment for modal in Hardware Summary card)
mux.HandleFunc("GET /api/hardware-summary", h.handleAPIHardwareSummary)
mux.HandleFunc("GET /api/components/{type}", h.handleAPIComponentDetail)
// Metrics — SSE stream of live sensor data + server-side SVG charts + CSV export
mux.HandleFunc("GET /api/metrics/stream", h.handleAPIMetricsStream)
mux.HandleFunc("GET /api/metrics/latest", h.handleAPIMetricsLatest)
@@ -1294,8 +1303,8 @@ const loadingPageHTML = `<!DOCTYPE html>
*{margin:0;padding:0;box-sizing:border-box}
html,body{height:100%;background:#0f1117;display:flex;align-items:center;justify-content:center;font-family:'Courier New',monospace;color:#e2e8f0}
.wrap{text-align:center;width:420px}
.logo{font-size:11px;line-height:1.4;color:#f6c90e;margin-bottom:6px;white-space:pre;text-align:left}
.subtitle{font-size:12px;color:#a0aec0;text-align:left;margin-bottom:24px;padding-left:2px}
.brand{font-size:22px;letter-spacing:.18em;color:#f6c90e;margin-bottom:6px;text-align:left}
.subtitle{font-size:12px;color:#a0aec0;text-align:left;margin-bottom:24px}
.spinner{width:36px;height:36px;border:3px solid #2d3748;border-top-color:#f6c90e;border-radius:50%;animation:spin .8s linear infinite;margin:0 auto 14px}
.spinner.hidden{display:none}
@keyframes spin{to{transform:rotate(360deg)}}
@@ -1313,12 +1322,7 @@ td:first-child{color:#718096;width:55%}
</head>
<body>
<div class="wrap">
<div class="logo"> ███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗
██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝
█████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗
██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝
███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗
╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝</div>
<div class="brand">EASY BEE</div>
<div class="subtitle">Hardware Audit LiveCD</div>
<div class="spinner" id="spin"></div>
<div class="status" id="st">Connecting to bee-web...</div>
@@ -1328,8 +1332,20 @@ td:first-child{color:#718096;width:55%}
<script>
(function(){
var gone = false;
var pollStarted = false;
var fallbackOpenTimer = null;
var AUTO_OPEN_DELAY_MS = 15000;
function go(){ if(!gone){gone=true;window.location.replace('/');} }
function scheduleFallbackOpen(){
if(fallbackOpenTimer!==null) return;
fallbackOpenTimer=setTimeout(function(){
document.getElementById('spin').className='spinner hidden';
document.getElementById('st').textContent='Startup checks are taking too long — opening app...';
go();
},AUTO_OPEN_DELAY_MS);
}
function icon(s){
if(s==='active') return '<span class="ok">&#9679; active</span>';
if(s==='failed') return '<span class="fail">&#10005; failed</span>';
@@ -1361,6 +1377,7 @@ function pollServices(){
tbl.innerHTML=html;
if(allSettled(svcs)){
clearInterval(pollTimer);
if(fallbackOpenTimer!==null) clearTimeout(fallbackOpenTimer);
document.getElementById('spin').className='spinner hidden';
document.getElementById('st').textContent='Ready \u2014 opening...';
setTimeout(go,800);
@@ -1375,8 +1392,12 @@ function probe(){
if(r.ok){
document.getElementById('st').textContent='bee-web running \u2014 checking services...';
document.getElementById('btn').style.display='';
pollServices();
pollTimer=setInterval(pollServices,1500);
scheduleFallbackOpen();
if(!pollStarted){
pollStarted=true;
pollServices();
pollTimer=setInterval(pollServices,1500);
}
} else {
document.getElementById('st').textContent='bee-web starting (status '+r.status+')...';
setTimeout(probe,500);
@@ -1398,13 +1419,16 @@ func (h *handler) handlePage(w http.ResponseWriter, r *http.Request) {
if page == "" {
page = "dashboard"
}
// Redirect old routes to new names
// Redirect legacy routes to new named pages
switch page {
case "tests":
http.Redirect(w, r, "/validate", http.StatusMovedPermanently)
case "validate", "tests":
http.Redirect(w, r, "/check", http.StatusMovedPermanently)
return
case "burn-in":
http.Redirect(w, r, "/burn", http.StatusMovedPermanently)
case "burn", "burn-in":
http.Redirect(w, r, "/load", http.StatusMovedPermanently)
return
case "benchmark":
http.Redirect(w, r, "/speed", http.StatusMovedPermanently)
return
}
body := renderPage(page, h.opts)

View File

@@ -604,6 +604,25 @@ func TestReadyIsOKWhenAuditPathIsUnset(t *testing.T) {
}
}
func TestLoadingPageHasFallbackAutoOpen(t *testing.T) {
handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/loading", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
body := rec.Body.String()
for _, needle := range []string{
`var AUTO_OPEN_DELAY_MS = 15000;`,
`function scheduleFallbackOpen(){`,
`Startup checks are taking too long — opening app...`,
} {
if !strings.Contains(body, needle) {
t.Fatalf("loading page missing %q: %s", needle, body)
}
}
}
func TestAuditPageRendersViewerFrameAndActions(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
@@ -688,13 +707,13 @@ func TestToolsPageRendersNvidiaSelfHealSection(t *testing.T) {
func TestBenchmarkPageRendersGPUSelectionControls(t *testing.T) {
handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/benchmark", nil))
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/speed", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
body := rec.Body.String()
for _, needle := range []string{
`href="/benchmark"`,
`href="/speed"`,
`id="benchmark-gpu-list"`,
`/api/gpu/nvidia`,
`/api/bee-bench/nvidia/perf/run`,
@@ -750,7 +769,7 @@ func TestBenchmarkPageRendersSavedResultsTable(t *testing.T) {
handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/benchmark", nil))
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/speed", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
@@ -772,54 +791,53 @@ func TestBenchmarkPageRendersSavedResultsTable(t *testing.T) {
}
}
func TestValidatePageRendersNvidiaTargetedStressCard(t *testing.T) {
func TestCheckPageRendersGPUSelectionAndNvidiaCards(t *testing.T) {
handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/validate", nil))
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/check", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
body := rec.Body.String()
for _, needle := range []string{
`NVIDIA GPU Targeted Stress`,
`nvidia-targeted-stress`,
`controlled NVIDIA DCGM load`,
`<code>dcgmi diag targeted_stress</code>`,
`NVIDIA GPU Selection`,
`All NVIDIA validate tasks use only the GPUs selected here.`,
`Select All`,
`id="sat-gpu-list"`,
`Select All`,
`id="sat-btn-nvidia"`,
`NVIDIA Interconnect (NCCL)`,
`NVIDIA Bandwidth (NVBandwidth)`,
`Non-destructive`,
} {
if !strings.Contains(body, needle) {
t.Fatalf("validate page missing %q: %s", needle, body)
t.Fatalf("check page missing %q: %s", needle, body)
}
}
}
func TestValidatePageRendersNvidiaFabricCardsInValidateMode(t *testing.T) {
func TestCheckPageRendersNvidiaFabricCards(t *testing.T) {
handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/validate", nil))
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/check", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
body := rec.Body.String()
for _, needle := range []string{
`NVIDIA Interconnect (NCCL)`,
`Validate and Stress:`,
`NVIDIA Bandwidth (NVBandwidth)`,
`nvbandwidth runs all built-in tests without a time limit`,
`nvbandwidth`,
`all_reduce_perf`,
} {
if !strings.Contains(body, needle) {
t.Fatalf("validate page missing %q: %s", needle, body)
t.Fatalf("check page missing %q: %s", needle, body)
}
}
}
func TestBurnPageRendersGoalBasedNVIDIACards(t *testing.T) {
func TestLoadPageRendersGoalBasedNVIDIACards(t *testing.T) {
handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/burn", nil))
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/load", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
@@ -828,7 +846,6 @@ func TestBurnPageRendersGoalBasedNVIDIACards(t *testing.T) {
`NVIDIA Max Compute Load`,
`dcgmproftester`,
`NCCL`,
`Validate → Stress mode`,
`id="burn-gpu-list"`,
} {
if !strings.Contains(body, needle) {

2
bible

Submodule bible updated: 1d89a4918e...1977730d93

View File

@@ -0,0 +1,185 @@
# API Surface
HTTP endpoints exposed by `bee web` (binds `0.0.0.0:80`).
Handler registration: `audit/internal/webui/server.go``NewHandler()`.
---
## Health & readiness
| Method | Path | Description |
|--------|----------------|-----------------------------------------------------|
| GET | `/healthz` | Always 200. Used by load balancers / boot scripts. |
| GET | `/api/ready` | 200 when audit JSON exists and is readable. |
| GET | `/loading` | HTML loading page shown before first audit. |
---
## Audit
| Method | Path | Description |
|--------|-----------------------|--------------------------------------------------------------|
| GET | `/audit.json` | Latest audit JSON with SAT overlay applied. |
| GET | `/runtime-health.json`| Latest runtime preflight JSON. |
| POST | `/api/audit/run` | Enqueue a full `bee audit` run. Returns task ID. |
| GET | `/api/audit/stream` | SSE: audit run log lines (`data:` + newline per line). |
| GET | `/api/preflight` | Run runtime preflight check (synchronous, returns JSON). |
| GET | `/api/hardware-summary` | Hardware health summary (status counts + failures). |
| GET | `/api/components/{type}` | HTML fragment for component detail dialog (e.g. `cpu`, `memory`, `storage`, `pcie`). |
---
## SAT (System Acceptance Testing)
All SAT run endpoints enqueue an async task. Response: `{"task_id": "..."}`.
| Method | Path | Description |
|--------|--------------------------------------------|-----------------------------------|
| POST | `/api/sat/nvidia/run` | NVIDIA DCGM SAT |
| POST | `/api/sat/nvidia-targeted-stress/run` | NVIDIA targeted stress validate |
| POST | `/api/sat/nvidia-compute/run` | NVIDIA max compute load |
| POST | `/api/sat/nvidia-targeted-power/run` | NVIDIA targeted power |
| POST | `/api/sat/nvidia-pulse/run` | NVIDIA pulse test |
| POST | `/api/sat/nvidia-interconnect/run` | NCCL all_reduce_perf |
| POST | `/api/sat/nvidia-bandwidth/run` | NVBandwidth test |
| POST | `/api/sat/nvidia-stress/run` | NVIDIA stress pack |
| POST | `/api/sat/memory/run` | Memory acceptance |
| POST | `/api/sat/storage/run` | Storage acceptance (smartctl) |
| POST | `/api/sat/cpu/run` | CPU acceptance (stress-ng) |
| POST | `/api/sat/amd/run` | AMD GPU SAT (ROCm) |
| POST | `/api/sat/amd-mem/run` | AMD memory integrity + bandwidth |
| POST | `/api/sat/amd-bandwidth/run` | AMD memory bandwidth |
| POST | `/api/sat/amd-stress/run` | AMD GPU stress |
| POST | `/api/sat/memory-stress/run` | Memory stress |
| POST | `/api/sat/sat-stress/run` | Combined storage+memory stress |
| POST | `/api/sat/platform-stress/run` | Fan + thermal stress |
| GET | `/api/sat/stream` | SSE: live SAT log stream |
| POST | `/api/sat/abort` | Abort the running SAT task |
---
## Benchmarks
| Method | Path | Description |
|--------|-----------------------------------------|----------------------------------------------|
| POST | `/api/bee-bench/nvidia/perf/run` | NVIDIA performance benchmark |
| POST | `/api/bee-bench/nvidia/power/run` | NVIDIA power benchmark |
| POST | `/api/bee-bench/nvidia/autotune/run` | Power source autotune (prerequisite for benchmarks) |
| GET | `/api/bee-bench/nvidia/autotune/status` | Current autotune result / status |
| GET | `/api/benchmark/results` | List completed benchmark result archives |
---
## Tasks (async job queue)
| Method | Path | Description |
|--------|-----------------------------|----------------------------------------------------|
| GET | `/api/tasks` | List all tasks with status |
| POST | `/api/tasks/cancel-all` | Cancel all pending/running tasks |
| POST | `/api/tasks/kill-workers` | Force-kill worker goroutines |
| POST | `/api/tasks/{id}/cancel` | Cancel a specific task |
| POST | `/api/tasks/{id}/priority` | Elevate task priority |
| GET | `/api/tasks/{id}/stream` | SSE: live log stream for a task |
| GET | `/api/tasks/{id}/charts` | List chart names for a task |
| GET | `/api/tasks/{id}/chart/` | SVG chart for a task result |
| GET | `/tasks/{id}` | HTML task detail page |
---
## Services
| Method | Path | Description |
|--------|---------------------------|--------------------------------------------------|
| GET | `/api/services` | List bee-* systemd services and their states |
| POST | `/api/services/action` | start/stop/restart a service |
---
## Network
| Method | Path | Description |
|--------|----------------------------|-----------------------------------------------------|
| GET | `/api/network` | List interfaces with state and IPv4 addresses |
| POST | `/api/network/dhcp` | Run dhclient on one or all interfaces |
| POST | `/api/network/static` | Set static IPv4 address |
| POST | `/api/network/toggle` | Bring interface up or down |
| POST | `/api/network/confirm` | Confirm pending network change (clears rollback) |
| POST | `/api/network/rollback` | Restore pre-change network snapshot |
---
## Export
| Method | Path | Description |
|--------|-------------------------------|---------------------------------------------------|
| GET | `/export/support.tar.gz` | Download support bundle (live-generated) |
| GET | `/export/file` | Download a file from the export dir by path param |
| GET | `/export/` | Browse export dir (HTML index) |
| GET | `/api/export/list` | JSON list of files in export dir |
| GET | `/api/export/usb` | List removable USB targets available for export |
---
## GPU
| Method | Path | Description |
|--------|----------------------------|----------------------------------------------------|
| GET | `/api/gpu/presence` | `{"nvidia": bool, "amd": bool}` |
| GET | `/api/gpu/nvidia` | List NVIDIA GPUs from nvidia-smi |
| GET | `/api/gpu/nvidia-status` | Per-GPU status (ECC, power, throttle) |
| POST | `/api/gpu/nvidia-reset` | GPU reset by index |
| GET | `/api/gpu/tools` | nvidia-smi / rocm-smi tool availability |
---
## System
| Method | Path | Description |
|--------|------------------------------|---------------------------------------------------|
| GET | `/api/system/ram-status` | toram boot state and ISO copy status |
| POST | `/api/system/install-to-ram` | Copy ISO to RAM (background task) |
| GET | `/api/install/disks` | List block devices suitable for disk installation |
| POST | `/api/install/run` | Install bee to disk (background task) |
---
## Tools & NVMe
| Method | Path | Description |
|--------|-------------------------------|--------------------------------------------------|
| GET | `/api/tools/check` | Check availability of required CLI tools |
| GET | `/api/tools/nvme-formats` | List NVMe format options for a device |
| POST | `/api/tools/nvme-format/run` | Run nvme-format on a device |
---
## Live metrics
| Method | Path | Description |
|--------|------------------------------|---------------------------------------------------|
| GET | `/api/metrics/stream` | SSE: live metrics (GPU power, temp, utilization) |
| GET | `/api/metrics/latest` | Latest metrics snapshot (JSON) |
| GET | `/api/metrics/chart/` | SVG chart for a metric over time |
| GET | `/api/metrics/export.csv` | Download metrics history as CSV |
---
## Blackbox logging
| Method | Path | Description |
|--------|----------------------------|-----------------------------------------------|
| GET | `/api/blackbox/status` | Blackbox log state (enabled, size, path) |
| POST | `/api/blackbox/enable` | Start recording blackbox log |
| POST | `/api/blackbox/disable` | Stop recording, flush to disk |
---
## UI pages
| Method | Path | Description |
|--------|------------|-----------------------------------------------|
| GET | `/` | Main dashboard (serves all page routes) |
| GET | `/viewer` | Standalone JSON viewer for uploaded audit files |
All pages are rendered server-side as HTML. The `/` route handles sub-paths such as
`/network`, `/services`, `/sat`, `/benchmark`, `/install`, `/validate`, `/export`.

View File

@@ -0,0 +1,137 @@
# Data Model
The canonical output of `bee audit` is a `HardwareIngestRequest` JSON document accepted
by the Reanimator `/api/ingest/hardware` endpoint. The ingest endpoint uses a strict
decoder — unknown fields cause `400 Bad Request`.
Source of truth: `audit/internal/schema/hardware.go`
---
## Top-level: HardwareIngestRequest
```
HardwareIngestRequest
├── collected_at string RFC3339 UTC timestamp of collection
├── hardware HardwareSnapshot
├── runtime RuntimeHealth? from bee-runtime-preflight service
├── filename string?
├── source_type string?
├── protocol string?
└── target_host string?
```
`collected_at` is the primary sort key used by Reanimator to deduplicate ingests.
---
## HardwareSnapshot
All component arrays are `omitempty` — absent when the collector finds nothing.
| JSON key | Go type | Source |
|-------------------|----------------------------|------------------------------|
| `board` | HardwareBoard | dmidecode type 1/2 |
| `firmware` | []HardwareFirmwareRecord | dmidecode type 0/13 |
| `cpus` | []HardwareCPU | dmidecode type 4 |
| `memory` | []HardwareMemory | dmidecode type 17 |
| `storage` | []HardwareStorage | lsblk + nvme-cli + smartctl |
| `pcie_devices` | []HardwarePCIeDevice | lspci |
| `power_supplies` | []HardwarePowerSupply | ipmitool fru + sdr |
| `sensors` | *HardwareSensors | sensors -j |
| `event_logs` | []HardwareEventLog | ipmitool sel + journald |
| `platform_config` | *json.RawMessage | reserved, nil until used |
| `vroc_license` | *string | vroc-cli |
---
## Identity keys
Reanimator uses these fields to match components across successive audits:
| Component | Identity key |
|----------------|------------------------------------------------|
| Board | `board.serial_number` (required, never empty) |
| CPU | `serial_number` if present; else generated key |
| Memory DIMM | `serial_number` — absent DIMMs have `present: false` |
| Storage | `serial_number` if present; else `linux_device` from Telemetry |
| PCIe device | `bdf` (Bus:Device.Function address) |
| PSU | `slot` |
Components without a stable identity are still emitted but may not be matched across runs.
---
## HardwareComponentStatus (embedded in all components)
```go
type HardwareComponentStatus struct {
Status *string `json:"status,omitempty"` // OK | Warning | Critical | Unknown
ErrorDescription *string `json:"error_description,omitempty"`
}
```
Status is set by collectors and overwritten at render time by `ApplySATOverlay`
(latest SAT run results are always merged on top before display).
---
## HardwarePCIeDevice
The most enriched component type. Key fields:
| JSON key | Meaning |
|----------------------|------------------------------------------------|
| `bdf` | PCI address (identity key), e.g. `0000:4b:00.0` |
| `vendor_id` | Numeric PCI vendor ID (hex). Use this for classification — not `manufacturer`. |
| `device_id` | Numeric PCI device ID (hex) |
| `device_class` | Human-readable class, e.g. `VideoController` |
| `manufacturer` | String label from lspci — for display only |
| `model` | From nvidia-smi / rocm-smi — display name |
| `link_speed` | Current PCIe link speed, e.g. `Gen4` |
| `max_link_speed` | Max negotiated speed |
| `link_width` | Current lane count |
| `max_link_width` | Max lane count |
| `temperature_c` | From nvidia-smi / rocm-smi |
| `power_w` | Current power draw |
| `ecc_uncorrected_total` | Cumulative ECC uncorrected errors (NVIDIA) |
| `ecc_corrected_total` | Cumulative ECC corrected errors (NVIDIA) |
| `hw_slowdown` | HW throttle active (NVIDIA) |
| `telemetry` | Free-form map for vendor-specific extras |
**Classification rule**: use `vendor_id` (numeric PCI ID), never `manufacturer` string.
| Vendor | vendor_id |
|-----------|-----------|
| NVIDIA | `0x10de` |
| AMD | `0x1002` |
| Mellanox | `0x15b3` |
| Aspeed | `0x1a03` |
| Intel | `0x8086` |
Constants live in `audit/internal/collector/pci_vendors.go`.
---
## HardwareMemory
`location` field exists in the Go struct with `json:"-"` — it is intentionally excluded
from JSON output because the Reanimator schema does not include it. It is used internally
for DIMM telemetry matching only (`collector/memory_telemetry.go`).
---
## HardwareSensors
Sensor structs (`HardwareFanSensor`, `HardwareTemperatureSensor`,
`HardwarePowerSensor`, `HardwareOtherSensor`) do **not** have a `location` field.
Location was removed in contract v2.8. The Go types mirror the schema exactly.
---
## JSON naming convention
All JSON keys are `snake_case`. Go field names are `CamelCase`. The mapping is
maintained by struct tags in `audit/internal/schema/hardware.go`.
All pointer fields use `omitempty` — absent means not collected (not zero).

View File

@@ -0,0 +1,41 @@
# Decision: Skip PCIe link-speed warnings for disabled devices
**Date:** 2026-06-12
**Status:** active
## Context
On HGX H100 SXM5 baseboards, the Microchip Switchtec PM41028 PSX PCIe switch
(vendor 11F8, device 4128, NVIDIA subsystem 10DE:1643) appears in `lspci` as a
"Memory controller". Its upstream link trains at Gen3 x2 while the device is
capable of Gen4 x16. The device is permanently in a disabled state: memory access
and bus-mastering are both off (Mem-, BusMaster-); `/sys/bus/pci/devices/<bdf>/enable`
reads `0`.
This chip is the PCIe fabric management endpoint for the NVSwitch interconnect — it
carries only management traffic at low bandwidth and is intentionally not activated
by any Linux driver. The bee audit was reporting a `statusWarning` with message
"PCIe link speed degraded" for this device, which is misleading because the device
is not in the data path.
## Decision
`applyPCIeLinkSpeedWarning` reads `/sys/bus/pci/devices/<bdf>/enable` via the
existing `readPCIIntAttribute` helper. If the value is `0` the function returns
early without setting any warning status.
The check is vendor-agnostic: it applies to any PCIe device that Linux has not
activated, regardless of make or model. This is consistent with the
`no-hardcoded-vendors` contract — no vendor ID, device ID, or name string is
used as a condition.
## Consequences
- PCIe fabric management endpoints, IPMI virtual devices, and other permanently
disabled PCIe functions no longer produce spurious link-degradation warnings.
- Real link degradation on active devices (GPUs, NICs, NVMe, NVLink bridges)
continues to be detected and reported as before.
- NVLink bridge cards retain their existing `statusCritical` path (they are always
enabled, so the early return is never taken for them).
- The Switchtec device on HGX H100 boards shows `statusOK` with no
`error_description` in the audit JSON.

View File

@@ -7,3 +7,4 @@ One file per decision, named `YYYY-MM-DD-short-topic.md`.
| 2026-03-05 | Use NVIDIA proprietary driver | active |
| 2026-04-01 | Treat memtest as explicit ISO content | active |
| 2026-04-29 | Treat embedded submodules as read-only | active |
| 2026-06-12 | Skip PCIe link-speed warnings for disabled devices | active |

View File

@@ -0,0 +1,312 @@
# GRUB Bitmap Error History
## Symptom
On some servers GRUB prints:
```text
error: null src bitmap in grub_video_bitmap_create_scaled.
Press any key to continue...
```
The important new observation as of `v10.7` is:
- the error still appears even when the logo image block is removed from
`iso/builder/config/bootloaders/grub-efi/live-theme/theme.txt`
- therefore the current error can no longer be explained only by
`bee-logo.png` / `bee-logo.tga`
That does not prove the theme system is healthy. It proves only that the
currently remaining failure is deeper than "bad logo file".
## Current State
Current source files:
- [iso/builder/config/bootloaders/grub-efi/live-theme/theme.txt](/Users/mchusavitin/Documents/git/bee/iso/builder/config/bootloaders/grub-efi/live-theme/theme.txt:1)
has no `image` block anymore
- [iso/builder/config/bootloaders/grub-efi/config.cfg](/Users/mchusavitin/Documents/git/bee/iso/builder/config/bootloaders/grub-efi/config.cfg:1)
still does `insmod tga` and then `source /boot/grub/theme.cfg`
Implication:
- if the error still fires, the trigger is likely elsewhere in GRUB theme
rendering or in the assets/config GRUB resolves while sourcing `theme.cfg`
- the old "PNG parser fragility" story is no longer a sufficient explanation
for the current failure mode
Current artifact facts:
- the provided `easy-bee-nvidia-v10.7-amd64.logs` build logs reference
`linux-image-6.1.0-45`
- the provided `easy-bee-nvidia-v10.7-amd64.iso` contains
`live/initrd.img-6.1.0-45-amd64` and `live/vmlinuz-6.1.0-45-amd64`
- a later `BOOT FAILED!` screenshot showed `live/initrd.img-6.1.0-44-amd64`
and `live/vmlinuz-6.1.0-44-amd64`
Implication:
- the `BOOT FAILED!` screenshot is not from the same artifact as the provided
`v10.7` ISO/log set
- until the exact ISO filename and checksum are tied to that failure, the
GRUB bitmap issue and the live-boot failure must be treated as separate
problems
## Chronology
### 1. Initial bee GRUB theme introduction
Relevant commit:
- `d52ec67` `Stability hardening, build script fixes, GRUB bee logo`
What changed:
- bee-branded GRUB theme introduced
- image block with explicit `width` / `height`
Observed result:
- bitmap error appeared
### 2. Remove explicit scaling dimensions
Relevant commit:
- `aa284ae` `fix(iso): avoid grub logo scaling error`
What changed:
- removed `width = 400`
- removed `height = 400`
Reason stated by the change:
- try to avoid the scaling path
Observed result:
- error persisted
Conclusion:
- explicit width/height were not the sole trigger
### 3. Rework PNG handling and menu rendering
Relevant commit:
- `6112094` `fix(grub): fix bitmap error and menu rendering`
Commit message says the change was intended to:
- convert `bee-logo.png` to RGBA and strip metadata
- move `terminal_output gfxterm` before `insmod png` / theme load
- remove ASCII banner from GRUB menu area
- fix theme typography/layout fields
Observed result:
- error persisted
Notes:
- this was still operating under the assumption that the issue was the PNG
payload or the order of gfxterm/theme init
### 4. Convert logo PNG back to RGB
Relevant commit:
- `333c44f` `Fix GRUB splash: convert bee-logo.png from RGBA to RGB`
Intended reason:
- GRUB might dislike RGBA PNG and want RGB PNG
Observed result:
- error still persisted according to later project notes
### 5. Add post-build canonical GRUB/isolinux sync
Relevant commit:
- `0cdfbc5` `fix(iso): restore boot UX and boot logs`
What this introduced:
- post-`lb build` rewriting of `binary/boot/grub/grub.cfg`
- post-`lb build` rewriting of `binary/isolinux/live.cfg`
- forced rebuild of `binary_checksums`, `binary_iso`, `binary_zsync`
Why it was added:
- restore canonical EASY-BEE boot UX after live-build wrote its own bootloader
outputs
- restore expected boot menu and logs
Important note:
- this commit did not directly solve the bitmap issue
- it added a second layer of bootloader mutation after live-build
### 6. Switch from PNG to TGA
Relevant commit:
- `626763e` `Fix GRUB bitmap error: switch from PNG to TGA for splash logo`
Commit message says:
- GRUB PNG reader was considered fragile
- switch to uncompressed 24-bit TGA
- `config.cfg`: `insmod png` -> `insmod tga`
- `theme.txt`: `bee-logo.png` -> `bee-logo.tga`
Observed result:
- this did not eliminate the problem in the current lineage
- today the system still errors even after the entire image block was removed
Conclusion:
- switching PNG -> TGA was not a durable root-cause fix
### 7. Patch EFI image after build
Relevant commit:
- `4f20c92` `Make UEFI boot safe and remove GRUB logo`
What this introduced:
- `sync_efi_grub_theme_assets`
- direct `mtools` patching of `efi.img`
- copying `config.cfg`, `theme.cfg`, and `live-theme/*` into the EFI FAT image
- removal of the logo image block from `theme.txt`
Why it was added:
- make UEFI path "safe"
- keep EFI GRUB image aligned with canonical bootloader assets
Observed result:
- later this became the direct cause of `Disk full` during build once
`bee-logo.tga` was large enough
- and even with the logo removed from `theme.txt`, the bitmap error still
remained
Conclusion:
- EFI post-build patching increased build complexity
- removing the logo alone did not remove the runtime GRUB error
### 8. Remove ASCII logo banners
Relevant commit:
- `14505ef` `Remove easy bee ASCII logo banners`
What changed:
- web loading page ASCII cleanup only
Relevance here:
- none for GRUB bitmap error
- included here only to avoid confusion with other "logo removal" work
### 9. Remove EFI post-build patching
Relevant commit:
- `5dc022d` `Drop post-build EFI bootloader patching`
Why it was done:
- stop mutating `efi.img` post-build
- remove dependence on `mtools` for EFI patching
- remove the `Disk full` failure mode
Impact:
- this did not target the GRUB bitmap error directly
- it targeted build-system complexity and EFI image overflow
### 10. Restore only GRUB/isolinux post-build sync
Relevant commit:
- `42774d4` `Restore post-build GRUB and isolinux sync`
Why it was needed:
- removing all post-build sync caused final ISO validation to fail with
missing canonical EASY-BEE boot entries
- memtest was still fine, but final GRUB menu was no longer canonical
What it restored:
- only `binary/boot/grub/grub.cfg`
- only `binary/isolinux/live.cfg`
What it did not restore:
- no EFI FAT image patching
- no `mtools` path
## What Is Proven False
The current evidence rules out several simplistic explanations:
- "the error is only caused by explicit image scaling"
- "the error is only caused by PNG vs TGA"
- "the error is only caused by the logo file itself"
Why:
- scaling dimensions were removed and error persisted
- PNG was replaced with TGA and error still survived in the lineage
- the image block itself is now absent, and the error still occurs
## Working Hypotheses Left
The remaining plausible layers are:
- GRUB theme engine still tries to render some bitmap-related element even
without the logo image block
- GRUB is resolving stale theme assets from the built EFI/ISO path rather than
what we think the source tree says
- `theme.cfg` / `theme.txt` / GRUB module loading order still triggers a bitmap
code path elsewhere
- live-build may still package a stale `theme.txt` or stale `live-theme`
directory into the final image
- the GRUB environment on the failing hardware may behave differently from the
assumptions in our source tree
## Decision Boundary
Before making another change, the next step should be evidence gathering from
the real built artifact, not another speculative edit.
That means checking on the actual built ISO or EFI image:
- exact `boot/grub/theme.cfg`
- exact `boot/grub/live-theme/theme.txt`
- exact contents of `boot/grub/live-theme/`
- whether the final image still contains a stale logo reference
- whether the EFI path and non-EFI path differ
## Relevant Commits
- `d52ec67` `Stability hardening, build script fixes, GRUB bee logo`
- `aa284ae` `fix(iso): avoid grub logo scaling error`
- `6112094` `fix(grub): fix bitmap error and menu rendering`
- `333c44f` `Fix GRUB splash: convert bee-logo.png from RGBA to RGB`
- `0cdfbc5` `fix(iso): restore boot UX and boot logs`
- `626763e` `Fix GRUB bitmap error: switch from PNG to TGA for splash logo`
- `4f20c92` `Make UEFI boot safe and remove GRUB logo`
- `5dc022d` `Drop post-build EFI bootloader patching`
- `42774d4` `Restore post-build GRUB and isolinux sync`

View File

@@ -9,7 +9,7 @@ NCCL_TESTS_VERSION=2.13.10
NVCC_VERSION=12.8
CUBLAS_VERSION=13.1.1.3-1
CUDA_USERSPACE_VERSION=13.0.96-1
DCGM_VERSION=4.5.3-1
DCGM_VERSION=4.6.0-1
JOHN_JUMBO_COMMIT=67fcf9fe5a
ROCM_VERSION=6.3.4
ROCM_SMI_VERSION=7.4.0.60304-76~22.04

View File

@@ -16,6 +16,12 @@ else
LB_LINUX_PACKAGES="linux-image"
fi
if [ -n "${BEE_ISO_VOLUME:-}" ]; then
LB_ISO_VOLUME="${BEE_ISO_VOLUME}"
else
LB_ISO_VOLUME="EASY_BEE_${BEE_GPU_VENDOR_UPPER:-NVIDIA}"
fi
lb config noauto \
--distribution bookworm \
--architectures amd64 \
@@ -30,9 +36,9 @@ lb config noauto \
--linux-flavours "amd64" \
--linux-packages "${LB_LINUX_PACKAGES}" \
--memtest memtest86+ \
--iso-volume "EASY_BEE_${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
--iso-volume "${LB_ISO_VOLUME}" \
--iso-application "EASY-BEE-${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
--bootappend-live "boot=live components video=1920x1080 console=ttyS0,115200n8 console=tty0 loglevel=3 systemd.show_status=1 username=bee user-fullname=Bee modprobe.blacklist=nouveau,snd_hda_intel,snd_hda_codec_realtek,snd_hda_codec_generic,soundcore" \
--bootappend-live "boot=live live-media=/dev/disk/by-label/${LB_ISO_VOLUME} live-media-label=${LB_ISO_VOLUME} components video=1920x1080 console=ttyS0,115200n8 console=tty0 loglevel=3 systemd.show_status=1 username=bee user-fullname=Bee modprobe.blacklist=nouveau,snd_hda_intel,snd_hda_codec_realtek,snd_hda_codec_generic,soundcore" \
--debootstrap-options "--include=ca-certificates" \
--apt-recommends false \
--chroot-squashfs-compression-type zstd \

View File

@@ -8,7 +8,7 @@ BUILDER_DIR="${REPO_ROOT}/iso/builder"
CONTAINER_TOOL="${CONTAINER_TOOL:-docker}"
IMAGE_TAG="${BEE_BUILDER_IMAGE:-bee-iso-builder}"
BUILDER_PLATFORM="${BEE_BUILDER_PLATFORM:-linux/amd64}"
CACHE_DIR="${BEE_BUILDER_CACHE_DIR:-${REPO_ROOT}/dist/container-cache}"
CACHE_DIR="${BEE_BUILDER_CACHE_DIR:-${REPO_ROOT}/dist/cache}"
AUTH_KEYS=""
CLEAN_CACHE=0
VARIANT="all"
@@ -54,14 +54,14 @@ if [ "$CLEAN_CACHE" = "1" ]; then
"${CACHE_DIR:?}/bee" \
"${CACHE_DIR:?}/lb-packages"
echo "=== cleaning live-build work dirs ==="
rm -rf "${REPO_ROOT}/dist/live-build-work-nvidia"
rm -rf "${REPO_ROOT}/dist/live-build-work-nvidia-legacy"
rm -rf "${REPO_ROOT}/dist/live-build-work-amd"
rm -rf "${REPO_ROOT}/dist/live-build-work-nogpu"
rm -rf "${REPO_ROOT}/dist/overlay-stage-nvidia"
rm -rf "${REPO_ROOT}/dist/overlay-stage-nvidia-legacy"
rm -rf "${REPO_ROOT}/dist/overlay-stage-amd"
rm -rf "${REPO_ROOT}/dist/overlay-stage-nogpu"
rm -rf "${REPO_ROOT}/dist/cache/live-build-work-nvidia"
rm -rf "${REPO_ROOT}/dist/cache/live-build-work-nvidia-legacy"
rm -rf "${REPO_ROOT}/dist/cache/live-build-work-amd"
rm -rf "${REPO_ROOT}/dist/cache/live-build-work-nogpu"
rm -rf "${REPO_ROOT}/dist/cache/overlay-stage-nvidia"
rm -rf "${REPO_ROOT}/dist/cache/overlay-stage-nvidia-legacy"
rm -rf "${REPO_ROOT}/dist/cache/overlay-stage-amd"
rm -rf "${REPO_ROOT}/dist/cache/overlay-stage-nogpu"
echo "=== caches cleared, proceeding with build ==="
fi

View File

@@ -51,8 +51,8 @@ case "$BUILD_VARIANT" in
;;
esac
BUILD_WORK_DIR="${DIST_DIR}/live-build-work-${BUILD_VARIANT}"
OVERLAY_STAGE_DIR="${DIST_DIR}/overlay-stage-${BUILD_VARIANT}"
BUILD_WORK_DIR="${DIST_DIR}/cache/live-build-work-${BUILD_VARIANT}"
OVERLAY_STAGE_DIR="${DIST_DIR}/cache/overlay-stage-${BUILD_VARIANT}"
export BEE_GPU_VENDOR BEE_NVIDIA_MODULE_FLAVOR BUILD_VARIANT
@@ -63,7 +63,7 @@ export PATH="$PATH:/usr/local/go/bin"
# Allow git to read the bind-mounted repo (different UID inside container).
git config --global safe.directory "${REPO_ROOT}"
mkdir -p "${DIST_DIR}"
mkdir -p "${DIST_DIR}/cache" "${DIST_DIR}/release"
mkdir -p "${CACHE_ROOT}"
: "${GOCACHE:=${CACHE_ROOT}/go-build}"
: "${GOMODCACHE:=${CACHE_ROOT}/go-mod}"
@@ -516,12 +516,12 @@ validate_iso_live_boot_entries() {
exit 1
fi
grep -q 'menuentry "EASY-BEE"' "$grub_cfg" || {
grep -q 'menuentry "EASY-BEE v' "$grub_cfg" || {
echo "ERROR: GRUB default EASY-BEE entry is missing" >&2
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
}
grep -q 'menuentry "EASY-BEE -- load to RAM (toram)"' "$grub_cfg" || {
grep -q 'menuentry "EASY-BEE v.* -- load to RAM (toram)"' "$grub_cfg" || {
echo "ERROR: GRUB toram entry is missing" >&2
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
@@ -536,6 +536,11 @@ validate_iso_live_boot_entries() {
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
}
grep -q 'linux .*live-media-label=EASY_BEE_' "$grub_cfg" || {
echo "ERROR: GRUB live entry is missing live-media-label pinning" >&2
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
}
grep -q 'append .*boot=live ' "$isolinux_cfg" || {
echo "ERROR: isolinux live entry is missing boot=live" >&2
@@ -547,45 +552,48 @@ validate_iso_live_boot_entries() {
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
}
grep -q 'append .*live-media-label=EASY_BEE_' "$isolinux_cfg" || {
echo "ERROR: isolinux live entry is missing live-media-label pinning" >&2
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
}
rm -f "$grub_cfg" "$isolinux_cfg"
echo "=== live boot validation OK ==="
}
validate_iso_grub_theme_assets() {
validate_iso_grub_assets() {
iso_path="$1"
echo "=== validating GRUB theme assets in ISO ==="
echo "=== validating GRUB assets in ISO ==="
[ -f "$iso_path" ] || {
echo "ERROR: ISO not found for GRUB theme validation: $iso_path" >&2
echo "ERROR: ISO not found for GRUB asset validation: $iso_path" >&2
exit 1
}
require_iso_reader "$iso_path" >/dev/null 2>&1 || {
echo "ERROR: ISO reader unavailable for GRUB theme validation" >&2
echo "ERROR: ISO reader unavailable for GRUB asset validation" >&2
exit 1
}
iso_files="$(mktemp)"
iso_list_files "$iso_path" > "$iso_files" || {
echo "ERROR: failed to list ISO files for GRUB theme validation" >&2
echo "ERROR: failed to list ISO files for GRUB asset validation" >&2
rm -f "$iso_files"
exit 1
}
for required in \
boot/grub/config.cfg \
boot/grub/theme.cfg \
boot/grub/live-theme/theme.txt \
boot/grub/live-theme/bee-logo.tga; do
boot/grub/grub.cfg; do
grep -q "^${required}$" "$iso_files" || {
echo "ERROR: missing GRUB theme asset in ISO: ${required}" >&2
echo "ERROR: missing GRUB asset in ISO: ${required}" >&2
rm -f "$iso_files"
exit 1
}
done
rm -f "$iso_files"
echo "=== GRUB theme validation OK ==="
echo "=== GRUB asset validation OK ==="
}
validate_iso_nvidia_runtime() {
@@ -600,29 +608,37 @@ validate_iso_nvidia_runtime() {
squashfs_tmp="$(mktemp)"
squashfs_list="$(mktemp)"
iso_read_member "$iso_path" live/filesystem.squashfs "$squashfs_tmp" || {
rm -f "$squashfs_tmp" "$squashfs_list"
nvidia_runtime_fail "failed to extract live/filesystem.squashfs from ISO"
}
unsquashfs -ll "$squashfs_tmp" > "$squashfs_list" 2>/dev/null || {
rm -f "$squashfs_tmp" "$squashfs_list"
nvidia_runtime_fail "failed to inspect filesystem.squashfs from ISO"
iso_files="$(mktemp)"
iso_list_files "$iso_path" > "$iso_files" || {
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "failed to list ISO files for NVIDIA runtime validation"
}
grep '^live/.*\.squashfs$' "$iso_files" | while IFS= read -r squashfs_member; do
iso_read_member "$iso_path" "$squashfs_member" "$squashfs_tmp" || {
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "failed to extract $squashfs_member from ISO"
}
unsquashfs -ll "$squashfs_tmp" >> "$squashfs_list" 2>/dev/null || {
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "failed to inspect $squashfs_member from ISO"
}
: > "$squashfs_tmp"
done
grep -Eq 'usr/bin/dcgmi$' "$squashfs_list" || {
rm -f "$squashfs_tmp" "$squashfs_list"
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "dcgmi missing from final NVIDIA ISO"
}
grep -Eq 'usr/bin/nv-hostengine$' "$squashfs_list" || {
rm -f "$squashfs_tmp" "$squashfs_list"
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "nv-hostengine missing from final NVIDIA ISO"
}
grep -Eq 'usr/bin/dcgmproftester([0-9]+)?$' "$squashfs_list" || {
rm -f "$squashfs_tmp" "$squashfs_list"
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "dcgmproftester missing from final NVIDIA ISO"
}
rm -f "$squashfs_tmp" "$squashfs_list"
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
echo "=== NVIDIA runtime validation OK ==="
}
@@ -716,20 +732,25 @@ write_canonical_grub_cfg() {
kernel="$2"
append_live="$3"
initrd="$4"
version_label="${PROJECT_VERSION_EFFECTIVE}"
cat > "$cfg" <<EOF
source /boot/grub/config.cfg
menuentry "EASY-BEE" {
linux ${kernel} ${append_live} bee.display=kms bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
menuentry "EASY-BEE v${version_label}" {
linux ${kernel} ${append_live} nomodeset bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd ${initrd}
}
menuentry "EASY-BEE -- load to RAM (toram)" {
linux ${kernel} ${append_live} toram bee.display=kms bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
menuentry "EASY-BEE v${version_label} -- load to RAM (toram)" {
linux ${kernel} ${append_live} toram nomodeset bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd ${initrd}
}
menuentry "EASY-BEE v${version_label} -- no GUI / no X11" {
linux ${kernel} ${append_live} nomodeset bee.gui=off bee.nvidia.mode=gsp-off pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd ${initrd}
}
if [ "\${grub_platform}" = "efi" ]; then
menuentry "Memory Test (memtest86+)" {
@@ -754,21 +775,28 @@ write_canonical_isolinux_cfg() {
kernel="$2"
initrd="$3"
append_live="$4"
version_label="${PROJECT_VERSION_EFFECTIVE}"
cat > "$cfg" <<EOF
label live-@FLAVOUR@-normal
menu label ^EASY-BEE
menu label ^EASY-BEE v${version_label}
linux ${kernel}
initrd ${initrd}
append ${append_live} nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-toram
menu label EASY-BEE (^load to RAM)
menu label EASY-BEE v${version_label} (^load to RAM)
menu default
linux ${kernel}
initrd ${initrd}
append ${append_live} toram nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-console
menu label EASY-BEE v${version_label} (^no GUI / no X11)
linux ${kernel}
initrd ${initrd}
append ${append_live} nomodeset bee.gui=off bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-gsp-off
menu label EASY-BEE (^NVIDIA GSP=off)
linux ${kernel}
@@ -812,10 +840,7 @@ enforce_live_build_bootloader_assets() {
if [ -f "$grub_cfg" ]; then
if extract_live_grub_entry "$grub_cfg"; then
mkdir -p "$grub_dir/live-theme"
cp "${BUILDER_DIR}/config/bootloaders/grub-efi/config.cfg" "$grub_dir/config.cfg"
cp "${BUILDER_DIR}/config/bootloaders/grub-efi/theme.cfg" "$grub_dir/theme.cfg"
cp -R "${BUILDER_DIR}/config/bootloaders/grub-efi/live-theme/." "$grub_dir/live-theme/"
write_canonical_grub_cfg "$grub_cfg" "$grub_kernel" "${live_build_append:-$grub_append}" "$grub_initrd"
echo "bootloader sync: rewrote binary/boot/grub/grub.cfg with canonical EASY-BEE menu"
else
@@ -869,8 +894,11 @@ FULL_BUILD_MARKER="${BUILD_WORK_DIR}/.bee-full-build-marker"
# hooks, archives, Dockerfile, auto/config) require a full lb build.
needs_full_build() {
[ -f "${FULL_BUILD_MARKER}" ] || return 0
[ -f "${BUILD_WORK_DIR}/binary/live/filesystem.squashfs" ] || return 0
[ -f "${BUILD_WORK_DIR}/live-image-amd64.hybrid.iso" ] || return 0
# Accept any versioned squashfs (filesystem-v*.squashfs or legacy filesystem.squashfs)
_any_sq=$(find "${BUILD_WORK_DIR}/binary/live" -maxdepth 1 \
-name 'filesystem*.squashfs' 2>/dev/null | head -1)
[ -n "$_any_sq" ] || return 0
_heavy=$(find \
"${BUILDER_DIR}/VERSIONS" \
@@ -893,40 +921,109 @@ needs_full_build() {
# Fast-path: unsquash existing filesystem, rsync overlay on top, repack.
# Requires ~10 GB free in BEE_CACHE_DIR for the unpacked squashfs.
fast_path_repack_squashfs() {
_sq="${BUILD_WORK_DIR}/binary/live/filesystem.squashfs"
_old_sq=$(find "${BUILD_WORK_DIR}/binary/live" -maxdepth 1 \
-name 'filesystem*.squashfs' | sort | head -1)
_sq="${BUILD_WORK_DIR}/binary/live/${SQUASHFS_FILENAME}"
_tmp="${BEE_CACHE_DIR}/fast-unsquash-${BUILD_VARIANT}"
echo "=== fast-path: unsquash ($(du -sh "$_sq" | cut -f1) compressed) ==="
echo "=== fast-path: unsquash $(basename "$_old_sq") ($(du -sh "$_old_sq" | cut -f1) compressed) ==="
rm -rf "$_tmp"
unsquashfs -d "$_tmp" "$_sq"
unsquashfs -d "$_tmp" "$_old_sq"
echo "=== fast-path: syncing overlay stage ==="
rsync -a --checksum "${OVERLAY_STAGE_DIR}/" "$_tmp/"
echo "=== fast-path: repacking squashfs ==="
echo "=== fast-path: repacking as ${SQUASHFS_FILENAME} ==="
_sq_new="${_sq}.new"
rm -f "$_sq_new"
mksquashfs "$_tmp" "$_sq_new" -comp zstd -b 1048576 -noappend -no-progress
mksquashfs "$_tmp" "$_sq_new" -comp zstd -b 1048576 -noappend -no-progress -no-xattrs
mv "$_sq_new" "$_sq"
rm -rf "$_tmp"
[ "$_old_sq" != "$_sq" ] && rm -f "$_old_sq"
echo "=== fast-path: squashfs repacked ($(du -sh "$_sq" | cut -f1)) ==="
}
# Fast-path: rebuild ISO by replacing only live/filesystem.squashfs via xorriso.
# Fast-path: rebuild ISO replacing the squashfs via xorriso.
# Boot structure (El Torito, EFI, MBR hybrid) is replayed from the prior ISO.
fast_path_rebuild_iso() {
_sq="${BUILD_WORK_DIR}/binary/live/filesystem.squashfs"
_sq="${BUILD_WORK_DIR}/binary/live/${SQUASHFS_FILENAME}"
_prior="${BUILD_WORK_DIR}/live-image-amd64.hybrid.iso"
_new="${BUILD_WORK_DIR}/live-image-amd64.hybrid.iso.new"
echo "=== fast-path: rebuilding ISO with xorriso ==="
rm -f "$_new"
# Remove any old squashfs entries from the prior ISO before adding the new one
_old_entries=$(xorriso -indev "$_prior" -find /live -name 'filesystem*.squashfs' -- 2>/dev/null \
| grep -E '^/live/filesystem.*\.squashfs$' || true)
_rm_args=""
for _e in $_old_entries; do
_rm_args="$_rm_args -rm $_e --"
done
# shellcheck disable=SC2086
xorriso \
-indev "$_prior" \
-outdev "$_new" \
-map "$_sq" /live/filesystem.squashfs \
${_rm_args} \
-map "$_sq" /live/${SQUASHFS_FILENAME} \
-boot_image any replay \
-commit
mv "$_new" "$_prior"
echo "=== fast-path: ISO rebuilt ==="
}
dir_has_entries() {
_dir="$1"
[ -d "$_dir" ] || return 1
find "$_dir" -mindepth 1 -print -quit 2>/dev/null | grep -q .
}
move_tree_to_layer() {
_src_root="$1"
_rel="$2"
_dst_root="$3"
[ -e "${_src_root}/${_rel}" ] || return 0
mkdir -p "${_dst_root}/$(dirname "$_rel")"
mv "${_src_root}/${_rel}" "${_dst_root}/${_rel}"
}
split_live_squashfs_layers() {
lb_dir="$1"
live_dir="${lb_dir}/binary/live"
base_sq="${live_dir}/filesystem.squashfs"
usr_sq="${live_dir}/10-usr.squashfs"
fw_sq="${live_dir}/20-firmware.squashfs"
[ -f "$base_sq" ] || return 0
command -v unsquashfs >/dev/null 2>&1 || return 0
command -v mksquashfs >/dev/null 2>&1 || return 0
tmp_root="$(mktemp -d)"
tmp_usr="$(mktemp -d)"
tmp_fw="$(mktemp -d)"
echo "=== splitting live squashfs into smaller layers ==="
unsquashfs -d "$tmp_root/root" "$base_sq" >/dev/null
mkdir -p "$tmp_usr/root" "$tmp_fw/root"
move_tree_to_layer "$tmp_root/root" "usr" "$tmp_usr/root"
move_tree_to_layer "$tmp_root/root" "lib/firmware" "$tmp_fw/root"
move_tree_to_layer "$tmp_root/root" "usr/lib/firmware" "$tmp_fw/root"
move_tree_to_layer "$tmp_root/root" "boot/firmware" "$tmp_fw/root"
rm -f "$usr_sq" "$fw_sq"
mksquashfs "$tmp_root/root" "${base_sq}.new" -comp zstd -b 1048576 -noappend -no-progress -no-xattrs >/dev/null
mv "${base_sq}.new" "$base_sq"
if dir_has_entries "$tmp_usr/root"; then
mksquashfs "$tmp_usr/root" "${usr_sq}.new" -comp zstd -b 1048576 -noappend -no-progress -no-xattrs >/dev/null
mv "${usr_sq}.new" "$usr_sq"
fi
if dir_has_entries "$tmp_fw/root"; then
mksquashfs "$tmp_fw/root" "${fw_sq}.new" -comp zstd -b 1048576 -noappend -no-progress -no-xattrs >/dev/null
mv "${fw_sq}.new" "$fw_sq"
fi
echo "=== live squashfs layers ==="
find "$live_dir" -maxdepth 1 -type f -name '*.squashfs' -exec du -sh {} \; | sort
rm -rf "$tmp_root" "$tmp_usr" "$tmp_fw"
}
recover_iso_memtest() {
lb_dir="$1"
iso_path="$2"
@@ -1005,9 +1102,11 @@ recover_iso_memtest() {
}
PROJECT_VERSION_EFFECTIVE="$(resolve_project_version)"
SQUASHFS_FILENAME="filesystem-v${PROJECT_VERSION_EFFECTIVE}.squashfs"
ISO_BASENAME="easy-bee-${BUILD_VARIANT}-v${PROJECT_VERSION_EFFECTIVE}-amd64"
# Versioned output directory: dist/easy-bee-v4.1/ — all final artefacts live here.
OUT_DIR="${DIST_DIR}/easy-bee-v${PROJECT_VERSION_EFFECTIVE}"
OUT_DIR="${DIST_DIR}/release/easy-bee-v${PROJECT_VERSION_EFFECTIVE}"
ISO_VERSION_LABEL_TOKEN="$(printf '%s' "${PROJECT_VERSION_EFFECTIVE}" | tr '[:lower:].-' '[:upper:]__')"
mkdir -p "${OUT_DIR}"
LOG_DIR="${OUT_DIR}/${ISO_BASENAME}.logs"
LOG_ARCHIVE="${OUT_DIR}/${ISO_BASENAME}.logs.tar.gz"
@@ -1191,7 +1290,7 @@ run_step "sync git submodules" "05-git-submodules" \
# --- compile bee binary (static, Linux amd64) ---
# Shared between variants — built once, reused on second pass.
BEE_BIN="${DIST_DIR}/bee-linux-amd64"
BEE_BIN="${DIST_DIR}/cache/bee-linux-amd64"
NEED_BUILD=1
if [ -f "$BEE_BIN" ]; then
NEWEST_SRC=$(find "${REPO_ROOT}/audit" -name '*.go' -newer "$BEE_BIN" | head -1)
@@ -1222,16 +1321,16 @@ else
fi
# --- NVIDIA-only build steps ---
GPU_BURN_WORKER_BIN="${DIST_DIR}/bee-gpu-burn-worker-linux-amd64"
GPU_BURN_WORKER_BIN="${DIST_DIR}/cache/bee-gpu-burn-worker-linux-amd64"
if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
run_step "download cuBLAS/cuBLASLt/cudart ${NCCL_CUDA_VERSION} userspace" "20-cublas" \
sh "${BUILDER_DIR}/build-cublas.sh" \
"${CUBLAS_VERSION}" \
"${CUDA_USERSPACE_VERSION}" \
"${NCCL_CUDA_VERSION}" \
"${DIST_DIR}"
"${DIST_DIR}/cache"
CUBLAS_CACHE="${DIST_DIR}/cublas-${CUBLAS_VERSION}+cuda${NCCL_CUDA_VERSION}"
CUBLAS_CACHE="${DIST_DIR}/cache/cublas-${CUBLAS_VERSION}+cuda${NCCL_CUDA_VERSION}"
echo "=== bee-gpu-burn FP4 header probe ==="
fp4_type_match="$(grep -Rsnm 1 'CUDA_R_4F_E2M1' "${CUBLAS_CACHE}/include" 2>/dev/null || true)"
@@ -1357,7 +1456,7 @@ fi
# --- copy bee binary into overlay ---
mkdir -p "${OVERLAY_STAGE_DIR}/usr/local/bin"
cp "${DIST_DIR}/bee-linux-amd64" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
cp "$BEE_BIN" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
if [ "$BEE_GPU_VENDOR" = "nvidia" ] && [ -f "$GPU_BURN_WORKER_BIN" ]; then
@@ -1387,10 +1486,10 @@ done
# --- NVIDIA kernel modules and userspace libs ---
if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
run_step "build NVIDIA ${NVIDIA_DRIVER_VERSION} modules" "40-nvidia-module" \
sh "${BUILDER_DIR}/build-nvidia-module.sh" "${NVIDIA_DRIVER_VERSION}" "${DIST_DIR}" "${DEBIAN_KERNEL_ABI}" "${BEE_NVIDIA_MODULE_FLAVOR}"
sh "${BUILDER_DIR}/build-nvidia-module.sh" "${NVIDIA_DRIVER_VERSION}" "${DIST_DIR}/cache" "${DEBIAN_KERNEL_ABI}" "${BEE_NVIDIA_MODULE_FLAVOR}"
KVER="${DEBIAN_KERNEL_ABI}-amd64"
NVIDIA_CACHE="${DIST_DIR}/nvidia-${BEE_NVIDIA_MODULE_FLAVOR}-${NVIDIA_DRIVER_VERSION}-${KVER}"
NVIDIA_CACHE="${DIST_DIR}/cache/nvidia-${BEE_NVIDIA_MODULE_FLAVOR}-${NVIDIA_DRIVER_VERSION}-${KVER}"
# Inject .ko files into overlay at /usr/local/lib/nvidia/
OVERLAY_KMOD_DIR="${OVERLAY_STAGE_DIR}/usr/local/lib/nvidia"
@@ -1416,9 +1515,9 @@ if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
# --- build / download NCCL ---
run_step "download NCCL ${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}" "50-nccl" \
sh "${BUILDER_DIR}/build-nccl.sh" "${NCCL_VERSION}" "${NCCL_CUDA_VERSION}" "${DIST_DIR}" "${NCCL_SHA256:-}"
sh "${BUILDER_DIR}/build-nccl.sh" "${NCCL_VERSION}" "${NCCL_CUDA_VERSION}" "${DIST_DIR}/cache" "${NCCL_SHA256:-}"
NCCL_CACHE="${DIST_DIR}/nccl-${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}"
NCCL_CACHE="${DIST_DIR}/cache/nccl-${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}"
# Inject libnccl.so.* into overlay alongside other NVIDIA userspace libs
cp "${NCCL_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/"
@@ -1434,19 +1533,19 @@ if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
"${NCCL_TESTS_VERSION}" \
"${NCCL_VERSION}" \
"${NCCL_CUDA_VERSION}" \
"${DIST_DIR}" \
"${DIST_DIR}/cache" \
"${NVCC_VERSION}" \
"${DEBIAN_VERSION}"
NCCL_TESTS_CACHE="${DIST_DIR}/nccl-tests-${NCCL_TESTS_VERSION}"
NCCL_TESTS_CACHE="${DIST_DIR}/cache/nccl-tests-${NCCL_TESTS_VERSION}"
cp "${NCCL_TESTS_CACHE}/bin/all_reduce_perf" "${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf"
cp "${NCCL_TESTS_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/" 2>/dev/null || true
echo "=== all_reduce_perf injected ==="
run_step "build john jumbo ${JOHN_JUMBO_COMMIT}" "70-john" \
sh "${BUILDER_DIR}/build-john.sh" "${JOHN_JUMBO_COMMIT}" "${DIST_DIR}"
JOHN_CACHE="${DIST_DIR}/john-${JOHN_JUMBO_COMMIT}"
sh "${BUILDER_DIR}/build-john.sh" "${JOHN_JUMBO_COMMIT}" "${DIST_DIR}/cache"
JOHN_CACHE="${DIST_DIR}/cache/john-${JOHN_JUMBO_COMMIT}"
mkdir -p "${OVERLAY_STAGE_DIR}/usr/local/lib/bee/john"
rsync -a --delete "${JOHN_CACHE}/run/" "${OVERLAY_STAGE_DIR}/usr/local/lib/bee/john/run/"
ln -sfn ../lib/bee/john/run/john "${OVERLAY_STAGE_DIR}/usr/local/bin/john"
@@ -1574,7 +1673,7 @@ if ! needs_full_build; then
fast_path_rebuild_iso
ISO_RAW="${LB_DIR}/live-image-amd64.hybrid.iso"
validate_iso_live_boot_entries "$ISO_RAW"
validate_iso_grub_theme_assets "$ISO_RAW"
validate_iso_grub_assets "$ISO_RAW"
validate_iso_nvidia_runtime "$ISO_RAW"
cp "$ISO_RAW" "$ISO_OUT"
echo ""
@@ -1589,15 +1688,30 @@ echo "=== building ISO (variant: ${BUILD_VARIANT}) ==="
# Export for auto/config
BEE_GPU_VENDOR_UPPER="$(echo "${BUILD_VARIANT}" | tr 'a-z-' 'A-Z_')"
export BEE_GPU_VENDOR_UPPER
# ISO 9660 volume ID is limited to 32 characters; truncate the version token to fit.
_vol_prefix="EASY_BEE_${BEE_GPU_VENDOR_UPPER}_V"
_max_token=$(( 32 - ${#_vol_prefix} ))
_vol_token="$(printf '%s' "${ISO_VERSION_LABEL_TOKEN}" | cut -c1-${_max_token})"
BEE_ISO_VOLUME="${_vol_prefix}${_vol_token}"
unset _vol_prefix _max_token _vol_token
export BEE_GPU_VENDOR_UPPER BEE_ISO_VOLUME
cd "${LB_DIR}"
run_step_sh "live-build clean" "80-lb-clean" "lb clean --all 2>&1 | tail -3"
run_step_sh "live-build config" "81-lb-config" "lb config 2>&1 | tail -5"
dump_memtest_debug "pre-build" "${LB_DIR}"
export MKSQUASHFS_OPTIONS="-no-xattrs"
run_step_sh "live-build build" "90-lb-build" "lb build 2>&1"
echo "=== enforcing canonical bootloader assets ==="
enforce_live_build_bootloader_assets "${LB_DIR}"
# Rename lb's default filesystem.squashfs to the versioned filename so the
# ISO contains a version-stamped squashfs (e.g. filesystem-v10.15.squashfs).
_std_sq="${LB_DIR}/binary/live/filesystem.squashfs"
_ver_sq="${LB_DIR}/binary/live/${SQUASHFS_FILENAME}"
if [ -f "${_std_sq}" ] && [ "${_std_sq}" != "${_ver_sq}" ]; then
mv "${_std_sq}" "${_ver_sq}"
echo "=== squashfs renamed: filesystem.squashfs → ${SQUASHFS_FILENAME} ==="
fi
reset_live_build_stage "${LB_DIR}" "binary_checksums"
reset_live_build_stage "${LB_DIR}" "binary_iso"
reset_live_build_stage "${LB_DIR}" "binary_zsync"
@@ -1629,7 +1743,7 @@ if [ -f "$ISO_RAW" ]; then
fi
validate_iso_memtest "$ISO_RAW"
validate_iso_live_boot_entries "$ISO_RAW"
validate_iso_grub_theme_assets "$ISO_RAW"
validate_iso_grub_assets "$ISO_RAW"
validate_iso_nvidia_runtime "$ISO_RAW"
cp "$ISO_RAW" "$ISO_OUT"
touch "${FULL_BUILD_MARKER}"

View File

@@ -1,5 +1,7 @@
set default=1
set timeout=10
set color_normal=yellow/black
set color_highlight=white/brown
if [ x$feature_default_font_path = xy ] ; then
font=unicode
@@ -8,7 +10,7 @@ else
fi
if loadfont $font ; then
set gfxmode=1920x1080,1280x1024,auto
set gfxmode=1280x1024,auto
set gfxpayload=keep
insmod efi_gop
insmod efi_uga
@@ -26,6 +28,3 @@ insmod gfxterm
terminal_input console serial
terminal_output gfxterm serial
insmod tga
source /boot/grub/theme.cfg

View File

@@ -1,15 +1,25 @@
source /boot/grub/config.cfg
menuentry "EASY-BEE" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
menuentry "EASY-BEE v@VERSION@" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE -- load to RAM (toram)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ toram bee.display=kms bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
menuentry "EASY-BEE v@VERSION@ -- load to RAM (toram)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ toram nomodeset bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE v@VERSION@ -- no GUI / no X11" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.gui=off bee.nvidia.mode=gsp-off pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "*** WIPE ALL DISKS (irreversible!) ***" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ toram nomodeset bee.gui=off bee.wipe=all net.ifnames=0 biosdevname=0
initrd @INITRD_LIVE@
}
if [ "${grub_platform}" = "efi" ]; then
menuentry "Memory Test (memtest86+)" {

View File

@@ -5,13 +5,6 @@ title-text: ""
message-font: "Unifont Regular 16"
terminal-font: "Unifont Regular 16"
#bee logo - centered, upper third of screen
+ image {
top = 4%
left = 50%-200
file = "bee-logo.tga"
}
#help bar at the bottom
+ label {
top = 100%-50

View File

@@ -1,16 +1,22 @@
label live-@FLAVOUR@-normal
menu label ^EASY-BEE
menu label ^EASY-BEE v@VERSION@
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-toram
menu label EASY-BEE (^load to RAM)
menu label EASY-BEE v@VERSION@ (^load to RAM)
menu default
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ toram nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-console
menu label EASY-BEE v@VERSION@ (^no GUI / no X11)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ nomodeset bee.gui=off bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-gsp-off
menu label EASY-BEE (^NVIDIA GSP=off)
linux @LINUX@
@@ -35,6 +41,12 @@ label live-@FLAVOUR@-failsafe
initrd @INITRD@
append @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off noapic noapm nodma nomce nolapic nosmp vga=normal net.ifnames=0 biosdevname=0
label wipe-disks
menu label *** WIPE ALL DISKS (irreversible!) ***
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ toram nomodeset bee.gui=off bee.wipe=all net.ifnames=0 biosdevname=0
label memtest
menu label ^Memory Test (memtest86+)
linux /boot/memtest86+x64.bin

View File

@@ -67,7 +67,9 @@ chmod +x /usr/local/bin/bee-log-run 2>/dev/null || true
chmod +x /usr/local/bin/bee-selfheal 2>/dev/null || true
chmod +x /usr/local/bin/bee-boot-status 2>/dev/null || true
chmod +x /usr/local/bin/bee-install 2>/dev/null || true
chmod +x /usr/local/bin/bee-gui-gate 2>/dev/null || true
chmod +x /usr/local/bin/bee-remount-medium 2>/dev/null || true
chmod +x /usr/local/bin/bee-check-nvswitch 2>/dev/null || true
if [ "$GPU_VENDOR" = "nvidia" ]; then
chmod +x /usr/local/bin/bee-nvidia-load 2>/dev/null || true
chmod +x /usr/local/bin/bee-gpu-burn 2>/dev/null || true

View File

@@ -0,0 +1,57 @@
#!/bin/sh
# 9012-wipe.hook.chroot
#
# Adds bee-initramfs-wipe to the initramfs so that selecting the
# "WIPE ALL DISKS" boot menu entry runs the wipe tool before squashfs
# is mounted — i.e. it works even when live boot fails.
#
# Two files are installed inside the chroot:
# /etc/initramfs-tools/hooks/bee-wipe — copies binaries into initrd
# /etc/initramfs-tools/scripts/local-premount/bee-wipe — runs at boot
set -e
HOOK_DIR="/etc/initramfs-tools/hooks"
SCRIPT_DIR="/etc/initramfs-tools/scripts/local-premount"
mkdir -p "${HOOK_DIR}" "${SCRIPT_DIR}"
# ── initramfs hook: copy binaries ────────────────────────────────────────────
cat > "${HOOK_DIR}/bee-wipe" << 'EOF'
#!/bin/sh
PREREQ=""
prereqs() { echo "$PREREQ"; }
case "$1" in prereqs) prereqs; exit 0 ;; esac
. /usr/share/initramfs-tools/hook-functions
for bin in lsblk blkid blkdiscard blockdev; do
b=$(command -v "$bin" 2>/dev/null) && copy_exec "$b" /bin
done
[ -x /usr/sbin/nvme ] && copy_exec /usr/sbin/nvme /sbin
copy_exec /usr/local/bin/bee-initramfs-wipe /bin/bee-wipe
EOF
chmod +x "${HOOK_DIR}/bee-wipe"
# ── initramfs premount script: trigger on bee.wipe=all ───────────────────────
cat > "${SCRIPT_DIR}/bee-wipe" << 'EOF'
#!/bin/sh
PREREQ=""
prereqs() { echo "$PREREQ"; }
case "$1" in prereqs) prereqs; exit 0 ;; esac
grep -qw 'bee.wipe=all' /proc/cmdline 2>/dev/null || exit 0
exec /bin/bee-wipe
EOF
chmod +x "${SCRIPT_DIR}/bee-wipe"
echo "9012-wipe: installed initramfs hook and premount script"
KVER=$(ls /lib/modules | sort -V | tail -1)
echo "9012-wipe: rebuilding initramfs for kernel ${KVER}"
update-initramfs -u -k "${KVER}"
echo "9012-wipe: done"

View File

@@ -0,0 +1,37 @@
#!/usr/bin/env python3
# 9998-strip-xattrs.hook.chroot
#
# mksquashfs 4.5.1 (Debian bookworm) writes a non-INVALID xattr_id_table_start
# even with -no-xattrs when the source tree contains POSIX ACL xattrs set by
# dpkg/install-time. Linux 6.1 squashfs driver then fails with
# "unable to read xattr id index table" and aborts the mount.
#
# Strip all xattrs from the live chroot before mksquashfs sees the tree so the
# resulting squashfs has SQUASHFS_INVALID_BLK in xattr_id_table_start.
import os
def strip(path):
try:
for attr in os.listxattr(path, follow_symlinks=False):
try:
os.removexattr(path, attr, follow_symlinks=False)
except OSError:
pass
except OSError:
pass
removed = 0
for root, dirs, files in os.walk('/', topdown=True, followlinks=False):
for name in dirs + files:
p = os.path.join(root, name)
try:
attrs = os.listxattr(p, follow_symlinks=False)
if attrs:
strip(p)
removed += len(attrs)
except OSError:
pass
strip(root)
print(f"9998-strip-xattrs: removed xattrs from {removed} entries")

View File

@@ -1,5 +1,6 @@
# AMD GPU firmware
firmware-amd-graphics
nvtop
# AMD ROCm — GPU monitoring, bandwidth test, and compute stress (RVS GST)
rocm-smi-lib=%%ROCM_SMI_VERSION%%

View File

@@ -5,6 +5,7 @@
# DCGM 4 is packaged per CUDA major. The image ships NVIDIA driver 590 with
# CUDA 13 userspace, so install the CUDA 13 build plus proprietary components
# explicitly.
nvtop
nvidia-fabricmanager=%%NVIDIA_FABRICMANAGER_VERSION%%
datacenter-gpu-manager-4-cuda13=1:%%DCGM_VERSION%%
datacenter-gpu-manager-4-proprietary=1:%%DCGM_VERSION%%

View File

@@ -38,6 +38,7 @@ exfat-fuse
ntfs-3g
# Utilities
infiniband-diags
bash
procps
lsof
@@ -46,7 +47,6 @@ less
vim-tiny
mc
htop
nvtop
sudo
zstd
mstflint

View File

@@ -1,11 +1,4 @@
███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗
██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝
█████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗
██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝
███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗
╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝
EASY BEE
Hardware Audit LiveCD
Build: %%BUILD_INFO%%

View File

@@ -1,6 +1,6 @@
[Unit]
Description=Bee: hardware audit
After=bee-preflight.service bee-network.service bee-nvidia.service bee-blackbox.service
After=bee-preflight.service bee-nvidia.service bee-blackbox.service
[Service]
Type=oneshot

View File

@@ -1,7 +1,6 @@
[Unit]
Description=Bee: bring up network interfaces via DHCP
After=local-fs.target bee-blackbox.service
Before=network-online.target bee-audit.service
After=bee-web.service bee-audit.service
[Service]
Type=oneshot

View File

@@ -2,6 +2,8 @@
Description=Bee: load NVIDIA kernel modules and create device nodes
After=local-fs.target udev.service bee-blackbox.service
Before=bee-audit.service
# Skip silently if bee-nvidia-load is absent (non-nvidia builds).
ConditionPathExists=/usr/local/bin/bee-nvidia-load
[Service]
Type=oneshot

View File

@@ -1,6 +1,6 @@
[Unit]
Description=Bee: runtime preflight self-check
After=bee-network.service bee-nvidia.service bee-blackbox.service
After=bee-nvidia.service bee-blackbox.service
Before=bee-audit.service
[Service]

View File

@@ -0,0 +1,2 @@
[Service]
ExecCondition=/usr/local/bin/bee-gui-gate

View File

@@ -0,0 +1,9 @@
[Unit]
# bee-nvidia.service loads the NVIDIA kernel modules; fabricmanager must wait
# for them to be fully initialized before attempting to open /dev/nvidiactl.
After=bee-nvidia.service
[Service]
# Skip fabricmanager on systems without NVSwitch hardware.
# ExecCondition exits 1-254 → unit is silently skipped (inactive, not failed).
ExecCondition=/usr/local/bin/bee-check-nvswitch

View File

@@ -3,8 +3,14 @@
# Shows live service status until all bee services are done or failed,
# then exits so getty can show the login prompt.
CRITICAL="bee-preflight bee-nvidia bee-audit"
ALL="bee-sshsetup ssh bee-network bee-nvidia bee-preflight bee-audit bee-web"
GPU_VENDOR="$(cat /etc/bee-gpu-vendor 2>/dev/null || echo nvidia)"
if [ "$GPU_VENDOR" = "nvidia" ]; then
CRITICAL="bee-preflight bee-nvidia bee-audit"
ALL="bee-sshsetup ssh bee-network bee-nvidia bee-preflight bee-audit bee-web"
else
CRITICAL="bee-preflight bee-audit"
ALL="bee-sshsetup ssh bee-network bee-preflight bee-audit bee-web"
fi
svc_state() { systemctl is-active "$1.service" 2>/dev/null || echo "inactive"; }
@@ -51,12 +57,7 @@ while true; do
printf '\033[H\033[2J'
printf '\n'
printf ' \033[33m███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗\033[0m\n'
printf ' \033[33m██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝\033[0m\n'
printf ' \033[33m█████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗\033[0m\n'
printf ' \033[33m██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝\033[0m\n'
printf ' \033[33m███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗\033[0m\n'
printf ' \033[33m╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝\033[0m\n'
printf ' \033[33mEASY BEE\033[0m\n'
printf ' Hardware Audit LiveCD\n'
printf '\n'

View File

@@ -0,0 +1,4 @@
#!/bin/sh
# Exit 0 if NVSwitch hardware is detected; exit 1 to skip fabricmanager on non-NVSwitch systems.
# NVSwitch appears in lspci as vendor 10de, class 0680 (Bridge, Other).
lspci -Dn 2>/dev/null | awk '$2 == "0680:" && $3 ~ /^10de:/ { found=1; exit } END { exit(found ? 0 : 1) }'

View File

@@ -0,0 +1,27 @@
#!/bin/sh
# bee-gui-gate — skip starting the local GUI when bee.gui=off is set.
set -eu
cmdline_param() {
key="$1"
for token in $(cat /proc/cmdline 2>/dev/null); do
case "$token" in
"$key"=*)
echo "${token#*=}"
return 0
;;
esac
done
return 1
}
mode="$(cmdline_param bee.gui || true)"
case "${mode}" in
off|false|0|tty|console|text|nogui)
echo "bee-gui-gate: bee.gui=${mode}; skipping lightdm"
exit 1
;;
esac
exit 0

View File

@@ -0,0 +1,166 @@
#!/bin/sh
# bee-initramfs-wipe — interactive disk wipe running entirely in the initramfs.
# Triggered by bee.wipe=all on the kernel cmdline (via local-premount hook).
# Works before squashfs is mounted, so it runs even when live boot fails.
RED='\033[1;31m'
YEL='\033[1;33m'
GRN='\033[1;32m'
CYN='\033[1;36m'
NC='\033[0m'
p() { printf '%b\n' "$*"; }
pp() { printf '%b' "$*"; }
banner() {
p ""
p "${RED}╔══════════════════════════════════════════════════════════╗${NC}"
p "${RED}║ BEE DRIVE WIPE — initramfs stage ║${NC}"
p "${RED}╚══════════════════════════════════════════════════════════╝${NC}"
p ""
}
# ── find boot device ─────────────────────────────────────────────────────────
boot_dev() {
local label token
for token in $(cat /proc/cmdline 2>/dev/null); do
case "$token" in
live-media-label=*) label="${token#*=}" ;;
esac
done
[ -z "$label" ] && return
local dev
dev=$(blkid -L "$label" 2>/dev/null) || return
# strip partition suffix: /dev/sdb1 → /dev/sdb, /dev/nvme0n1p1 → /dev/nvme0n1
echo "$dev" | sed 's/p\?[0-9]\+$//'
}
# ── enumerate candidate disks ─────────────────────────────────────────────────
list_disks() {
local boot
boot=$(boot_dev)
lsblk -d -n -o NAME,TYPE,SIZE,MODEL 2>/dev/null | while read -r name type size model; do
[ "$type" = "disk" ] || continue
[ "$size" = "0B" ] && continue
local dev="/dev/$name"
[ "$dev" = "$boot" ] && continue
printf '%s\t%s\t%s\n' "$dev" "$size" "${model:-}"
done
}
# ── wipe one disk ─────────────────────────────────────────────────────────────
wipe_one() {
local dev="$1"
p ""
p "=== ${YEL}${dev}${NC} ==="
if echo "$dev" | grep -q '^/dev/nvme'; then
if nvme format --ses=1 "$dev" 2>&1; then
p " ${GRN}nvme format OK${NC}"
blockdev --flushbufs "$dev" 2>/dev/null || true
return
fi
p " nvme format failed — falling back to blkdiscard"
fi
if blkdiscard -f "$dev" 2>&1; then
p " ${GRN}blkdiscard OK${NC}"
blockdev --flushbufs "$dev" 2>/dev/null || true
return
fi
p " blkdiscard not supported — zeroing partition tables (HDD fallback)"
local size_bytes mb32 skip
size_bytes=$(blockdev --getsize64 "$dev" 2>/dev/null || echo 0)
mb32=$(( 32 * 1024 * 1024 ))
dd if=/dev/zero of="$dev" bs=4M count=8 conv=fsync status=progress 2>&1 || true
if [ "$size_bytes" -gt $(( mb32 * 2 )) ]; then
skip=$(( (size_bytes - mb32) / (4 * 1024 * 1024) ))
dd if=/dev/zero of="$dev" bs=4M count=8 seek="$skip" conv=fsync status=progress 2>&1 || true
fi
blockdev --flushbufs "$dev" 2>/dev/null || true
p " ${GRN}done (partition tables zeroed)${NC}"
}
# ── main ──────────────────────────────────────────────────────────────────────
banner
BOOT=$(boot_dev)
[ -n "$BOOT" ] && p "Boot device (excluded): ${CYN}${BOOT}${NC}\n"
# build indexed list
i=0
DEVS=""
IFS='
'
for line in $(list_disks); do
i=$(( i + 1 ))
dev=$(echo "$line" | cut -f1)
size=$(echo "$line" | cut -f2)
model=$(echo "$line" | cut -f3)
DEVS="${DEVS}${i}:${dev}:${size}:${model}
"
printf " ${CYN}[%d]${NC} %-16s %8s %s\n" "$i" "$dev" "$size" "$model"
done
IFS='
'
if [ "$i" -eq 0 ]; then
p "\nNo physical disks found (boot device excluded)."
p "Dropping to shell — type 'exit' to continue boot."
exec /bin/sh
fi
p ""
pp "Enter numbers to wipe (space-separated), ${YEL}all${NC} for all, ${YEL}q${NC} to abort: "
read -r SELECTION
case "$SELECTION" in
q|Q|'') p "\nAborted."; exec /bin/sh ;;
esac
# resolve selection → list of devs
SELECTED=""
if [ "$SELECTION" = "all" ] || [ "$SELECTION" = "ALL" ]; then
SELECTED=$(echo "$DEVS" | grep -v '^$' | cut -d: -f2 | tr '\n' ' ')
else
for num in $SELECTION; do
match=$(echo "$DEVS" | grep "^${num}:" | cut -d: -f2)
if [ -z "$match" ]; then
p "${RED}Unknown index: ${num}${NC}"; exec /bin/sh
fi
SELECTED="${SELECTED}${match} "
done
fi
SELECTED=$(echo "$SELECTED" | tr -s ' ' | sed 's/ $//')
p ""
p "Selected for wipe: ${YEL}${SELECTED}${NC}"
p "${RED}WARNING: This is IRREVERSIBLE. All data on the selected disks will be lost.${NC}"
p ""
pp "Type YES to confirm, anything else to abort: "
read -r CONFIRM
if [ "$CONFIRM" != "YES" ]; then
p "\nAborted — no disks were touched."
exec /bin/sh
fi
p "\nStarting wipe..."
for dev in $SELECTED; do
wipe_one "$dev"
done
sync
p ""
p "${GRN}=== All selected disks wiped and flushed. ===${NC}"
p ""
pp "Press Enter to reboot..."
read -r _
reboot

View File

@@ -8,7 +8,7 @@
# Layout (UEFI): GPT, /dev/sdX1=EFI 512MB vfat, /dev/sdX2=root ext4
# Layout (BIOS): MBR, /dev/sdX1=root ext4
#
# Squashfs source: /run/live/medium/live/filesystem.squashfs
# Squashfs sources: /run/live/medium/live/*.squashfs
set -euo pipefail
@@ -62,9 +62,9 @@ for tool in parted mkfs.vfat mkfs.ext4 unsquashfs grub-install update-grub; do
fi
done
SQUASHFS="/run/live/medium/live/filesystem.squashfs"
if [ ! -f "$SQUASHFS" ]; then
echo "ERROR: squashfs not found at $SQUASHFS" >&2
mapfile -t SQUASHFS_FILES < <(find /run/live/medium/live -maxdepth 1 -type f -name '*.squashfs' | sort)
if [ "${#SQUASHFS_FILES[@]}" -eq 0 ]; then
echo "ERROR: no squashfs files found under /run/live/medium/live" >&2
echo " The live medium may have been disconnected." >&2
echo " Reconnect the disc and run: bee-remount-medium --wait" >&2
echo " Then re-run bee-install." >&2
@@ -106,7 +106,10 @@ log "=== BEE DISK INSTALLER ==="
log "Target device : $DEVICE"
log "Root partition: $PART_ROOT"
[ "$UEFI" = "1" ] && log "EFI partition : $PART_EFI"
log "Squashfs : $SQUASHFS ($(du -sh "$SQUASHFS" | cut -f1))"
log "Squashfs : ${#SQUASHFS_FILES[@]} layer(s)"
for sf in "${SQUASHFS_FILES[@]}"; do
log " - $sf ($(du -sh "$sf" | cut -f1))"
done
log "Log : $LOGFILE"
log ""
@@ -163,7 +166,9 @@ log " Mounted."
# ------------------------------------------------------------------
log "--- Step 5/7: Unpacking filesystem (this takes 10-20 minutes) ---"
log " Source: $SQUASHFS"
for sf in "${SQUASHFS_FILES[@]}"; do
log " Source: $sf"
done
log " Target: $MOUNT_ROOT"
# unsquashfs does not support resume, so retry the entire unpack step if the
@@ -177,9 +182,9 @@ while true; do
fi
[ "$UNPACK_ATTEMPTS" -gt 1 ] && log " Retry attempt $UNPACK_ATTEMPTS / $UNPACK_MAX ..."
# Re-check squashfs is reachable before each attempt
if [ ! -f "$SQUASHFS" ]; then
log " SOURCE LOST: $SQUASHFS not found."
mapfile -t SQUASHFS_FILES < <(find /run/live/medium/live -maxdepth 1 -type f -name '*.squashfs' | sort)
if [ "${#SQUASHFS_FILES[@]}" -eq 0 ]; then
log " SOURCE LOST: no squashfs files found under /run/live/medium/live."
log " Reconnect the disc and run 'bee-remount-medium --wait' in another terminal,"
log " then press Enter here to retry."
read -r _
@@ -194,12 +199,17 @@ while true; do
fi
UNPACK_OK=0
unsquashfs -f -d "$MOUNT_ROOT" "$SQUASHFS" 2>&1 | \
grep -E '^\[|^inod|^created|^extract|^ERROR|failed' | \
while IFS= read -r line; do log " $line"; done || UNPACK_OK=$?
for sf in "${SQUASHFS_FILES[@]}"; do
log " Unpacking $(basename "$sf") ..."
unsquashfs -f -d "$MOUNT_ROOT" "$sf" 2>&1 | \
grep -E '^\[|^inod|^created|^extract|^ERROR|failed' | \
while IFS= read -r line; do log " $line"; done || UNPACK_OK=$?
[ "$UNPACK_OK" -eq 0 ] || break
done
# Check squashfs is still reachable (gone = disc pulled during copy)
if [ ! -f "$SQUASHFS" ]; then
mapfile -t SQUASHFS_FILES < <(find /run/live/medium/live -maxdepth 1 -type f -name '*.squashfs' | sort)
if [ "${#SQUASHFS_FILES[@]}" -eq 0 ]; then
log " WARNING: source medium lost during unpack — will retry after remount."
log " Run 'bee-remount-medium --wait' in another terminal, then press Enter."
read -r _

View File

@@ -1,8 +1,9 @@
#!/bin/sh
# bee-network.sh — bring up all physical network interfaces via DHCP
# Unattended: runs silently, logs results, never blocks.
# Unattended: starts later in boot, runs quietly, and gives up after a bounded timeout.
LOG_PREFIX="bee-network"
DHCP_TIMEOUT_SECS=300
log() { echo "[$LOG_PREFIX] $*"; }
@@ -19,9 +20,50 @@ if command -v udevadm >/dev/null 2>&1; then
udevadm settle --timeout=5 >/dev/null 2>&1 || log "WARN: udevadm settle timed out"
fi
start_dhcp() {
iface="$1"
if ! ip link set "$iface" up; then
log "WARN: could not bring up $iface"
return 1
fi
carrier=$(cat "/sys/class/net/$iface/carrier" 2>/dev/null || true)
if [ "$carrier" = "1" ]; then
log "carrier detected on $iface"
else
log "carrier not detected on $iface"
fi
dhclient -r "$iface" >/dev/null 2>&1 || true
if timeout "${DHCP_TIMEOUT_SECS}" dhclient -4 -q -1 "$iface" >/dev/null 2>&1; then
addr="$(ip -4 -o addr show dev "$iface" scope global 2>/dev/null | awk '{print $4}' | head -1)"
if [ -n "$addr" ]; then
log "DHCP lease acquired on $iface ($addr)"
else
log "DHCP lease acquired on $iface"
fi
return 0
fi
rc=$?
case "$rc" in
124)
log "DHCP timed out on $iface after ${DHCP_TIMEOUT_SECS}s"
;;
*)
log "DHCP failed on $iface (exit $rc)"
;;
esac
dhclient -r "$iface" >/dev/null 2>&1 || true
return 1
}
started_ifaces=""
started_count=0
scan_pass=1
pids=""
pid_ifaces=""
# Some server NICs appear a bit later after module/firmware init. Do a small
# bounded rescan window without turning network bring-up into a boot blocker.
@@ -34,22 +76,11 @@ while [ "$scan_pass" -le 3 ]; do
*" $iface "*) continue ;;
esac
log "bringing up $iface"
if ! ip link set "$iface" up; then
log "WARN: could not bring up $iface"
continue
fi
carrier=$(cat "/sys/class/net/$iface/carrier" 2>/dev/null || true)
if [ "$carrier" = "1" ]; then
log "carrier detected on $iface"
else
log "carrier not detected yet on $iface"
fi
# DHCP in background — non-blocking, keep dhclient verbose output in the service log.
dhclient -4 -v -nw "$iface" &
log "DHCP started for $iface (pid $!)"
log "starting DHCP on $iface (timeout ${DHCP_TIMEOUT_SECS}s)"
start_dhcp "$iface" &
pid="$!"
pids="$pids $pid"
pid_ifaces="$pid_ifaces $pid:$iface"
started_ifaces="$started_ifaces $iface"
started_count=$((started_count + 1))
@@ -68,4 +99,15 @@ if [ "$started_count" -eq 0 ]; then
exit 0
fi
log "done (interfaces started: $started_count)"
success_count=0
for pid_iface in $pid_ifaces; do
pid="${pid_iface%%:*}"
iface="${pid_iface#*:}"
if wait "$pid"; then
success_count=$((success_count + 1))
else
log "DHCP did not complete successfully on $iface"
fi
done
log "done (interfaces scanned: $started_count, leases acquired: $success_count)"

View File

@@ -2,7 +2,7 @@
# bee-remount-medium — find and remount the live ISO medium to /run/live/medium
#
# Run this after reconnecting the ISO source disc (USB/CD) if the live medium
# was lost and /run/live/medium/live/filesystem.squashfs is missing.
# was lost and /run/live/medium/live/*.squashfs are missing.
#
# Usage: bee-remount-medium [--wait]
# --wait keep retrying every 5 seconds until the medium is found (useful
@@ -11,7 +11,7 @@
set -euo pipefail
MEDIUM_DIR="/run/live/medium"
SQUASHFS_REL="live/filesystem.squashfs"
SQUASHFS_GLOB="live/*.squashfs"
WAIT_MODE=0
for arg in "$@"; do
@@ -56,7 +56,7 @@ try_mount() {
local tmpdir
tmpdir=$(mktemp -d /tmp/bee-probe-XXXXXX)
if mount -o ro "$dev" "$tmpdir" 2>/dev/null; then
if [ -f "${tmpdir}/${SQUASHFS_REL}" ]; then
if find "${tmpdir}/live" -maxdepth 1 -type f -name '*.squashfs' 2>/dev/null | grep -q .; then
# Unmount probe mount and mount properly onto live path
umount "$tmpdir" 2>/dev/null || true
rmdir "$tmpdir" 2>/dev/null || true
@@ -82,8 +82,9 @@ attempt() {
for dev in $(find_candidates); do
log " Trying $dev ..."
if try_mount "$dev"; then
local sq="${MEDIUM_DIR}/${SQUASHFS_REL}"
log "SUCCESS: squashfs available at $sq ($(du -sh "$sq" | cut -f1))"
local count
count=$(find "${MEDIUM_DIR}/live" -maxdepth 1 -type f -name '*.squashfs' 2>/dev/null | wc -l | tr -d ' ')
log "SUCCESS: ${count} squashfs layer(s) available under ${MEDIUM_DIR}/live"
return 0
fi
done
@@ -100,5 +101,5 @@ if [ "$WAIT_MODE" = "1" ]; then
sleep 5
done
else
attempt || die "No ISO medium with ${SQUASHFS_REL} found. Reconnect the disc and re-run, or use --wait."
attempt || die "No ISO medium with ${SQUASHFS_GLOB} found. Reconnect the disc and re-run, or use --wait."
fi

View File

@@ -0,0 +1,132 @@
#!/bin/bash
# bee-wipe-disks — erase all physical disks (interactive, confirmation required)
#
# Triggered automatically when the kernel cmdline contains bee.wipe=all.
# Can also be run manually from a root shell.
#
# Wipe strategy:
# NVMe — nvme format (ATA-style secure erase, fast)
# Other — blkdiscard -f (TRIM/UNMAP, fast on SSDs)
# dd if=/dev/zero (fallback for HDDs, zeros first+last 32 MB)
set -euo pipefail
RED=$'\033[1;31m'
YEL=$'\033[1;33m'
GRN=$'\033[1;32m'
NC=$'\033[0m'
banner() {
echo ""
echo "${RED}╔══════════════════════════════════════════════════════════╗${NC}"
echo "${RED}║ BEE DISK WIPE — ALL DATA WILL BE DESTROYED ║${NC}"
echo "${RED}╚══════════════════════════════════════════════════════════╝${NC}"
echo ""
}
# ── find boot device to skip ──────────────────────────────────────────────────
live_dev() {
local src
src=$(findmnt -n -o SOURCE /run/live/medium 2>/dev/null || true)
[ -z "$src" ] && return
# Strip partition suffix: /dev/sdb1 → /dev/sdb, /dev/nvme0n1p1 → /dev/nvme0n1
echo "$src" | sed 's/p\?[0-9]\+$//'
}
# ── enumerate target disks ────────────────────────────────────────────────────
find_disks() {
local boot_dev
boot_dev=$(live_dev)
lsblk -d -n -o NAME,TYPE,SIZE,MODEL | while read -r name type size model; do
[ "$type" = "disk" ] || continue
[ "$size" = "0B" ] && continue # empty virtual media
local dev="/dev/$name"
[ "$dev" = "$boot_dev" ] && continue # skip boot device
printf '%s\t%s\t%s\n' "$dev" "$size" "$model"
done
}
# ── wipe one disk ─────────────────────────────────────────────────────────────
wipe_disk() {
local dev="$1"
echo ""
echo "=== ${YEL}${dev}${NC} ==="
if echo "$dev" | grep -q '^/dev/nvme'; then
# NVMe format (ses=1 = user data erase)
if nvme format --ses=1 "$dev" 2>&1; then
echo " ${GRN}nvme format OK${NC}"
return
fi
echo " nvme format failed, falling back to blkdiscard"
fi
if blkdiscard -f "$dev" 2>&1; then
echo " ${GRN}blkdiscard OK${NC}"
return
fi
echo " blkdiscard not supported — zeroing partition tables (HDD fallback)"
local size_bytes
size_bytes=$(blockdev --getsize64 "$dev")
local mb32=$(( 32 * 1024 * 1024 ))
# Zero first 32 MB (MBR, GPT, filesystem superblocks)
dd if=/dev/zero of="$dev" bs=4M count=8 conv=fsync status=progress 2>&1 || true
# Zero last 32 MB (backup GPT)
if [ "$size_bytes" -gt $(( mb32 * 2 )) ]; then
local skip=$(( (size_bytes - mb32) / (4 * 1024 * 1024) ))
dd if=/dev/zero of="$dev" bs=4M count=8 seek="$skip" conv=fsync status=progress 2>&1 || true
fi
echo " ${GRN}done (partition tables zeroed)${NC}"
}
# ── main ──────────────────────────────────────────────────────────────────────
banner
mapfile -t DISKS < <(find_disks | awk '{print $1}')
if [ ${#DISKS[@]} -eq 0 ]; then
echo "No physical disks found (boot device excluded)."
echo "Nothing to wipe."
exit 0
fi
echo "Disks to be ${RED}COMPLETELY ERASED${NC}:"
echo ""
find_disks | while IFS=$'\t' read -r dev size model; do
printf " ${YEL}%-16s${NC} %8s %s\n" "$dev" "$size" "$model"
done
echo ""
echo "${RED}WARNING: This is IRREVERSIBLE. All data on the listed disks will be lost.${NC}"
echo ""
printf "Type YES to confirm wipe, anything else to abort: "
read -r CONFIRM
if [ "$CONFIRM" != "YES" ]; then
echo ""
echo "Aborted — no disks were touched."
exit 0
fi
echo ""
echo "Starting wipe..."
for dev in "${DISKS[@]}"; do
wipe_disk "$dev"
done
echo ""
echo "${GRN}=== All disks wiped. ===${NC}"
echo ""
printf "Reboot now to return to the boot menu? [Y/n] "
read -r REBOOT
case "${REBOOT:-Y}" in
[Nn]*) echo "You can reboot manually when ready." ;;
*) echo "Rebooting..."; sleep 2; reboot ;;
esac

125
scripts/build.sh Executable file
View File

@@ -0,0 +1,125 @@
#!/bin/sh
# build.sh -- single entry point for ISO builds.
#
# Local build (default):
# sh scripts/build.sh
# sh scripts/build.sh --variant nvidia
# sh scripts/build.sh --clean-build
#
# Remote build (set BUILDER_HOST + BUILDER_USER in .env):
# sh scripts/build.sh
# sh scripts/build.sh --authorized-keys ~/.ssh/authorized_keys
#
# All flags are forwarded to build-in-container.sh.
set -e
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
ENV_FILE="${REPO_ROOT}/.env"
if [ -f "$ENV_FILE" ]; then
# shellcheck disable=SC1090
. "$ENV_FILE"
fi
BUILDER_HOST="${BUILDER_HOST:-}"
BUILDER_USER="${BUILDER_USER:-}"
# Cache lives inside the repo under dist/ (gitignored).
CACHE_DIR="${REPO_ROOT}/dist/cache"
# Forward all arguments as-is to the underlying build script.
EXTRA_ARGS="$*"
# ── Remote build ────────────────────────────────────────────────────────────
if [ -n "$BUILDER_HOST" ]; then
if [ -z "$BUILDER_USER" ]; then
echo "ERROR: BUILDER_USER not set. Set it in .env."
exit 1
fi
echo "=== bee builder (remote: ${BUILDER_USER}@${BUILDER_HOST}) ==="
echo ""
cd "${REPO_ROOT}"
git fetch --quiet origin main
LOCAL=$(git rev-parse HEAD)
REMOTE=$(git rev-parse origin/main)
if [ "$LOCAL" != "$REMOTE" ]; then
echo "ERROR: local repo is not in sync with remote."
echo " local: $LOCAL"
echo " remote: $REMOTE"
echo ""
echo "Push or pull before building:"
echo " git push -- if you have unpushed commits"
echo " git pull -- if remote is ahead"
exit 1
fi
echo "repo: in sync with remote ($LOCAL)"
echo ""
ssh -o StrictHostKeyChecking=no "${BUILDER_USER}@${BUILDER_HOST}" /bin/sh <<ENDSSH
set -e
REPO="/home/${BUILDER_USER}/bee"
LOG=/tmp/bee-build.log
if [ ! -d "\$REPO/.git" ]; then
echo "--- cloning bee repo ---"
git clone https://git.mchus.pro/reanimator/bee.git "\$REPO"
fi
cd "\$REPO"
echo "--- pulling latest ---"
sudo git checkout -- .
git pull --ff-only
chmod +x iso/overlay/usr/local/bin/* 2>/dev/null || true
screen -S bee-build -X quit 2>/dev/null || true
echo "--- starting build in screen session (survives SSH disconnect) ---"
echo "--- log: \$LOG ---"
screen -dmS bee-build sh -c "sh iso/builder/build-in-container.sh --cache-dir \$REPO/dist/cache ${EXTRA_ARGS} > \$LOG 2>&1; echo \$? > /tmp/bee-build-exit"
echo "--- streaming build log (Ctrl+C safe -- build continues on VM) ---"
tail -n +1 -f "\$LOG" 2>/dev/null &
TAIL_PID=\$!
while screen -list 2>/dev/null | grep -q bee-build; do
sleep 2
done
sleep 1
kill \$TAIL_PID 2>/dev/null || true
tail -n 20 "\$LOG" 2>/dev/null || true
EXIT_CODE=\$(cat /tmp/bee-build-exit 2>/dev/null || echo 1)
exit \$EXIT_CODE
ENDSSH
echo ""
echo "=== downloading ISO ==="
LOCAL_ISO_DIR="${REPO_ROOT}/dist/release"
mkdir -p "${LOCAL_ISO_DIR}"
if command -v rsync >/dev/null 2>&1 && ssh -o StrictHostKeyChecking=no "${BUILDER_USER}@${BUILDER_HOST}" command -v rsync >/dev/null 2>&1; then
rsync -az --progress \
-e "ssh -o StrictHostKeyChecking=no" \
"${BUILDER_USER}@${BUILDER_HOST}:/home/${BUILDER_USER}/bee/dist/*.iso" \
"${LOCAL_ISO_DIR}/"
else
scp -o StrictHostKeyChecking=no \
"${BUILDER_USER}@${BUILDER_HOST}:/home/${BUILDER_USER}/bee/dist/*.iso" \
"${LOCAL_ISO_DIR}/"
fi
echo ""
echo "=== build complete ==="
echo "ISO saved to: ${LOCAL_ISO_DIR}/"
ls -lh "${LOCAL_ISO_DIR}/"*.iso 2>/dev/null || true
exit 0
fi
# ── Local build ─────────────────────────────────────────────────────────────
echo "=== bee builder (local) ==="
echo "cache: ${CACHE_DIR}"
echo ""
# shellcheck disable=SC2086
exec sh "${REPO_ROOT}/iso/builder/build-in-container.sh" --cache-dir "${CACHE_DIR}" $EXTRA_ARGS

View File

@@ -48,7 +48,7 @@ echo "==> Сборка бинарника..."
echo " OK: $(ls -lh "${LOCAL_BIN}" | awk '{print $5, $9}')"
LOCAL_SHA="$(shasum -a 256 "${LOCAL_BIN}" | awk '{print $1}')"
REMOTE_SHA="$("${SSH_CMD[@]}" "$REMOTE" "if [ -f '${REMOTE_BIN}' ] && command -v sha256sum >/dev/null 2>&1; then sha256sum '${REMOTE_BIN}' | awk '{print \\$1}'; fi" 2>/dev/null || true)"
REMOTE_SHA="$("${SSH_CMD[@]}" "$REMOTE" "if [ -f '${REMOTE_BIN}' ] && command -v sha256sum >/dev/null 2>&1; then sha256sum '${REMOTE_BIN}' | awk '{print \$1}'; fi" 2>/dev/null || true)"
if [[ -n "${REMOTE_SHA}" && "${LOCAL_SHA}" == "${REMOTE_SHA}" ]]; then
echo "==> Бинарник не изменился (${LOCAL_SHA}); копирование и перезапуск сервисов пропущены."
exit 0