Compare commits

...

48 Commits

Author SHA1 Message Date
Mikhail Chusavitin
49a09fde05 Disable xattrs in all mksquashfs calls
--chroot-squashfs-compression-options does not exist in live-build
bookworm (1:20230502). The correct mechanism is the MKSQUASHFS_OPTIONS
environment variable read by binary_rootfs.

Export MKSQUASHFS_OPTIONS="-no-xattrs" before lb build so live-build's
binary_rootfs picks it up, and add -no-xattrs explicitly to every
direct mksquashfs call in build.sh (fast-path repack and the dormant
split-layers function). Remove the invalid lb config option.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:29:15 +03:00
Mikhail Chusavitin
f3962422c8 Fix lb config option name for squashfs compression options
--chroot-squashfs-options is not a valid lb_config option; the correct
name is --chroot-squashfs-compression-options. Without this fix lb config
aborts immediately, so the -no-xattrs flag (which prevents the
"unable to read xattr id index table" boot failure) was never applied.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:03:41 +03:00
Mikhail Chusavitin
ee36e3c711 Strip xattrs from squashfs to fix boot failure
Kernel squashfs driver fails with "unable to read xattr id index table"
when the squashfs contains POSIX ACL xattrs (system.posix_acl_*) written
by mksquashfs as unrecognised entries. This caused every built ISO to
drop to an initramfs shell at boot.

Add -no-xattrs to mksquashfs options so xattrs are stripped at build
time. xattrs are not needed in a live read-only rootfs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 13:56:26 +03:00
Mikhail Chusavitin
cca3b21d35 Revert squashfs layer split — live-boot cannot mount partial rootfs
split_live_squashfs_layers moved /usr out of filesystem.squashfs into a
separate 10-usr.squashfs, leaving a rootfs skeleton that live-boot
(1:20230131+deb12u1) cannot mount: the initramfs panics with
"Can not mount /dev/loop0 ... filesystem.squashfs".

live-boot in bookworm expects a single self-contained filesystem.squashfs.
Revert to the standard single-squashfs layout and remove the dead
multi-squashfs guard in needs_full_build().

The split_live_squashfs_layers function is kept for future reference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 11:14:10 +03:00
Mikhail Chusavitin
75c33e073e Fix split_live_squashfs_layers crash under POSIX sh (dash)
trap RETURN is a bash extension not supported by /bin/sh on Debian.
With set -e active the unsupported trap call exited the build immediately
after lb build, before bootloader sync and ISO copy steps ran.

Remove both trap RETURN lines — explicit rm -rf at the end of the
function is sufficient for cleanup on the happy path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 09:52:31 +03:00
7b4bcc745a Split live rootfs into smaller squashfs layers 2026-05-03 23:15:22 +03:00
42774d44a6 Restore post-build GRUB and isolinux sync 2026-05-03 21:49:54 +03:00
5dc022ddf8 Drop post-build EFI bootloader patching 2026-05-03 21:22:53 +03:00
6623e159f5 Grow EFI image before syncing GRUB theme assets 2026-05-03 21:18:37 +03:00
bbd6d009f8 Avoid EFI image overflow when syncing GRUB theme 2026-05-03 21:16:36 +03:00
6c2b188ec9 Add no-GUI boot mode and quieter boot diagnostics 2026-05-03 21:14:45 +03:00
14505ef24a Remove easy bee ASCII logo banners 2026-05-03 21:07:13 +03:00
4f20c9246d Make UEFI boot safe and remove GRUB logo 2026-05-03 20:11:42 +03:00
eed157c2db Pin live boot medium to versioned ISO label 2026-05-03 15:52:07 +03:00
a2c8aea0df Fix GRUB theme ISO validation helper name 2026-05-03 14:45:16 +03:00
b21f03cd26 Reduce bee-selfheal log noise and slow timer 2026-05-03 14:38:12 +03:00
cac5b9c86e Detach install media after install-to-ram 2026-05-03 14:16:45 +03:00
b5d04ef045 Synchronize project versioning across build artifacts 2026-05-03 14:16:32 +03:00
fcd64438ea Update submodule references 2026-05-03 14:12:00 +03:00
0e39e7d960 Make toram default and add install-to-ram CLI 2026-05-03 14:07:47 +03:00
Mikhail Chusavitin
58d6da0e4f Fix live task logs and SAT windows 2026-04-30 17:26:45 +03:00
Mikhail Chusavitin
7ce73e34a4 Add NVMe block format tool 2026-04-30 16:27:25 +03:00
Mikhail Chusavitin
8a21809ade Update chart submodule to v2.0 (hardware contract 2.10)
New in chart:
- event_logs and platform_config sections in viewer
- Storage columns: logical_block_size_bytes, physical_block_size_bytes,
  metadata_bytes_per_block
- Compact status/severity icons, severity filtering for event logs
- Fixed JS MIME type and base stylesheet

bee audit schema already has all required fields; no schema changes needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 15:52:30 +03:00
Mikhail Chusavitin
626763e31d Fix GRUB bitmap error: switch from PNG to TGA for splash logo
GRUB's PNG reader (grub2 bookworm) fails to load bee-logo.png despite the
file being valid RGB 8-bit non-interlaced PNG with minimal chunks. Root
cause is a known fragility in GRUB's png.c; exact trigger is unknown.

Switch to uncompressed 24-bit TGA which bypasses the PNG parser entirely.
tga.mod is already present in the ISO (x86_64-efi/tga.mod).

- Convert bee-logo.png → bee-logo.tga (480018 bytes, BGR top-left)
- config.cfg: insmod png → insmod tga
- theme.txt: bee-logo.png → bee-logo.tga
- Document all prior failed attempts in git-bible/grub-bitmap-error.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 15:46:13 +03:00
Mikhail Chusavitin
0b8a2ff83f Add validate test matrix and GPU test methodology docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 10:47:08 +03:00
Mikhail Chusavitin
2c22b01fe3 Fix IPMI hangs, add VROC license, fix blackbox service, drop qrencode
IPMI hang fix (Lenovo XCC SR650 V3):
- Add pluggable ipmi_profile system with per-vendor timeouts and fruEarlyExit flag
- Lenovo profile: 90s FRU timeout, streaming early-exit stops after PSU blocks found
- collectFRUEarlyExit streams ipmitool fru print and kills process once PSU blocks
  are followed by a non-PSU header (~6s instead of ~108s on 54-device FRU list)
- collectBMCFirmware and collectPSUs accept manufacturer and apply profile timeouts

VROC license detection:
- Detect VMD/VROC controller in PCIe list, run mdadm --detail-platform
- Parse "License:" line; store as snap.VROCLicense in HardwareSnapshot

Blackbox service fix:
- bee-blackbox.service was missing from systemctl enable list in ISO build hook
- Service never started on boot; state file never written; UI button stayed "Enable"

Drop qrencode:
- Remove from package list, standardTools API check, and runtime-flows doc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 10:46:59 +03:00
Mikhail Chusavitin
ec89616585 Add storage block geometry to audit and viewer 2026-04-29 17:39:11 +03:00
Mikhail Chusavitin
c0dbbf96ad Add vendor RAID tools for livecd 2026-04-29 17:31:25 +03:00
Mikhail Chusavitin
76484b123c Fix fast-path: treat bootloader config changes as heavy
config/bootloaders was missing from the needs_full_build heavy-file
list, so changes to GRUB theme assets (e.g. bee-logo.png RGBA→RGB fix
in 333c44f) were silently skipped by the squashfs-surgery fast-path.
The old broken PNG stayed in boot/grub/live-theme/ inside the ISO.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 15:36:29 +03:00
Mikhail Chusavitin
8901596152 Add server diagnostic tools to ISO, drop btop
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 13:18:50 +03:00
Mikhail Chusavitin
7c504e5056 Collect IOMMU group per PCIe device from sysfs
Reads the iommu_group symlink for each BDF and exposes the group number
as iommu_group in the hardware snapshot JSON.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 12:34:54 +03:00
Mikhail Chusavitin
333c44f3ba Fix GRUB splash: convert bee-logo.png from RGBA to RGB
GRUB does not support RGBA PNG (color_type=6) — loading it returns a
null bitmap, triggering "null src bitmap in grub_video_bitmap_create_scaled".
Alpha channel composited onto black background (#000000 matches desktop-color).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 11:15:16 +03:00
Mikhail Chusavitin
3bca821d3e Add auto fast-path ISO rebuild via squashfs surgery
When only light files changed since the last full lb build (Go source,
overlay scripts/configs), the build is now automatically done in ~5-8 min
instead of 30+ min:

- unsquashfs existing squashfs from prior build
- rsync overlay-stage on top
- mksquashfs repack (zstd, same block size)
- xorriso ISO repack with -boot_image any replay (preserves EFI/MBR hybrid)

Heavy changes (VERSIONS, package-lists, hooks, archives, Dockerfile,
auto/config) still trigger a full lb build. Tracking is via a marker file
(.bee-full-build-marker) written after each successful full build.

No change to build-in-container.sh or the full build path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 10:58:26 +03:00
Mikhail Chusavitin
3648e37a1e Update bible submodule to remote HEAD, preserve ascii-safe-text contract locally
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 10:30:27 +03:00
Mikhail Chusavitin
d109e08fab Drop redundant rebuild-image flag 2026-04-29 10:01:57 +03:00
Mikhail Chusavitin
11d00b9442 Document read-only submodules policy 2026-04-29 09:54:23 +03:00
Mikhail Chusavitin
6defa5ae15 Revert chart submodule update 2026-04-29 09:47:35 +03:00
Mikhail Chusavitin
c76658ed00 Update bible and chart submodules 2026-04-29 09:43:57 +03:00
Mikhail Chusavitin
2163017a98 Collect and report storage telemetry 2026-04-29 09:40:58 +03:00
29179917c3 Add USB blackbox log mirroring service 2026-04-24 10:20:12 +03:00
be4b439804 Commit remaining workspace changes 2026-04-23 20:32:26 +03:00
749fc8a94d Unify NVIDIA GPU recovery paths 2026-04-23 20:31:41 +03:00
6112094d45 fix(grub): fix bitmap error and menu rendering
- Convert bee-logo.png to RGBA (color type 6) and strip all metadata
  chunks (cHRM, bKGD, tIME, tEXt) that confuse GRUB's minimal PNG parser
- Move terminal_output gfxterm before insmod png / theme load so the
  theme initialises in an active gfxterm context
- Remove echo ASCII art banner from grub.cfg — with gfxterm active and
  no terminal_box in the theme, echo output renders over the menu area
- Fix icon_heigh typo → icon_height; increase item_height 16→20 with
  item_padding 0→2 for reliable text rendering in boot_menu

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 22:05:16 +03:00
e9a2bc9f9d update submodule 2026-04-22 20:39:27 +03:00
Mikhail Chusavitin
7a8f884664 fix(boot): remove advanced options submenu
Keep only EASY-BEE and toram entries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 19:01:50 +03:00
Mikhail Chusavitin
8bf8dfa45b fix(boot): default to KMS + pci=realloc, drop nomodeset from main entries
Default and toram entries now boot with bee.display=kms (ASPEED AST
loads via KMS, Xorg uses modesetting driver) and pci=realloc (Linux
reassigns GPU BARs when BIOS lacks Above 4G Decoding). nomodeset
removed from these entries; still present in GSP=off and fail-safe.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 19:00:04 +03:00
Mikhail Chusavitin
6a22199aff chore(bible): bump ascii-safe-text contract
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 18:52:10 +03:00
Mikhail Chusavitin
ddb2bb5d1c fix(grub): replace em-dash with ASCII -- in all menu entry titles
Em-dash (U+2014) renders as garbage on GRUB serial/SOL output
(IPMI BMC consoles). Replace with ASCII double-hyphen throughout
grub.cfg template, write_canonical_grub_cfg, and theme.txt comment.

Also align template grub.cfg structure with write_canonical_grub_cfg:
toram entry moved to top level (was inside submenu).

bible: add ascii-safe-text contract documenting the no-em-dash rule.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 18:52:04 +03:00
84 changed files with 5116 additions and 561 deletions

View File

@@ -2,6 +2,7 @@ package main
import (
"context"
"errors"
"flag"
"fmt"
"io"
@@ -63,14 +64,20 @@ func run(args []string, stdout, stderr io.Writer) (exitCode int) {
return runExport(args[1:], stdout, stderr)
case "preflight":
return runPreflight(args[1:], stdout, stderr)
case "install-to-ram":
return runInstallToRAM(args[1:], stdout, stderr)
case "support-bundle":
return runSupportBundle(args[1:], stdout, stderr)
case "web":
return runWeb(args[1:], stdout, stderr)
case "blackbox":
return runBlackbox(args[1:], stdout, stderr)
case "sat":
return runSAT(args[1:], stdout, stderr)
case "benchmark":
return runBenchmark(args[1:], stdout, stderr)
case "bee-worker":
return runBeeWorker(args[1:], stdout, stderr)
case "version", "--version", "-version":
fmt.Fprintln(stdout, Version)
return 0
@@ -85,11 +92,14 @@ func printRootUsage(w io.Writer) {
fmt.Fprintln(w, `bee commands:
bee audit --runtime auto|local|livecd --output stdout|file:<path>
bee preflight --output stdout|file:<path>
bee install-to-ram
bee export --target <device>
bee support-bundle --output stdout|file:<path>
bee web --listen :80 [--audit-path `+app.DefaultAuditJSONPath+`]
bee blackbox --export-dir `+app.DefaultExportDir+` [--state-file `+app.DefaultBlackboxStatePath+`]
bee sat nvidia|memory|storage|cpu [--duration <seconds>]
bee benchmark nvidia [--profile standard|stability|overnight]
bee bee-worker --export-dir `+app.DefaultExportDir+` --task-id TASK-001
bee version
bee help [command]`)
}
@@ -102,14 +112,20 @@ func runHelp(args []string, stdout, stderr io.Writer) int {
return runExport([]string{"--help"}, stdout, stdout)
case "preflight":
return runPreflight([]string{"--help"}, stdout, stdout)
case "install-to-ram":
return runInstallToRAM([]string{"--help"}, stdout, stdout)
case "support-bundle":
return runSupportBundle([]string{"--help"}, stdout, stdout)
case "web":
return runWeb([]string{"--help"}, stdout, stdout)
case "blackbox":
return runBlackbox([]string{"--help"}, stdout, stdout)
case "sat":
return runSAT([]string{"--help"}, stdout, stderr)
case "benchmark":
return runBenchmark([]string{"--help"}, stdout, stderr)
case "bee-worker":
return runBeeWorker([]string{"--help"}, stdout, stderr)
case "version":
fmt.Fprintln(stdout, "usage: bee version")
return 0
@@ -241,6 +257,32 @@ func runPreflight(args []string, stdout, stderr io.Writer) int {
return 0
}
func runInstallToRAM(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("install-to-ram", flag.ContinueOnError)
fs.SetOutput(stderr)
fs.Usage = func() {
fmt.Fprintln(stderr, "usage: bee install-to-ram")
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
application := app.New(platform.New())
logLine := func(s string) { fmt.Fprintln(stdout, s) }
if err := application.RunInstallToRAM(context.Background(), logLine); err != nil {
slog.Error("run install-to-ram", "err", err)
return 1
}
return 0
}
func runSupportBundle(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("support-bundle", flag.ContinueOnError)
fs.SetOutput(stderr)
@@ -335,6 +377,33 @@ func runWeb(args []string, stdout, stderr io.Writer) int {
return 0
}
func runBlackbox(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("blackbox", flag.ContinueOnError)
fs.SetOutput(stderr)
exportDir := fs.String("export-dir", app.DefaultExportDir, "directory with logs, SAT results, and support bundles")
statePath := fs.String("state-file", app.DefaultBlackboxStatePath, "blackbox state file")
fs.Usage = func() {
fmt.Fprintf(stderr, "usage: bee blackbox [--export-dir %s] [--state-file %s]\n", app.DefaultExportDir, app.DefaultBlackboxStatePath)
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
slog.Info("starting bee blackbox", "export_dir", *exportDir, "state_file", *statePath)
if err := app.RunBlackbox(context.Background(), *exportDir, *statePath, platform.New()); err != nil && !errors.Is(err, context.Canceled) {
slog.Error("run blackbox", "err", err)
return 1
}
return 0
}
func runSAT(args []string, stdout, stderr io.Writer) int {
if len(args) == 0 {
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]")
@@ -462,6 +531,28 @@ func runBenchmark(args []string, stdout, stderr io.Writer) int {
return 0
}
func runBeeWorker(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("bee-worker", flag.ContinueOnError)
fs.SetOutput(stderr)
exportDir := fs.String("export-dir", app.DefaultExportDir, "directory with task state and artifacts")
taskID := fs.String("task-id", "", "task identifier, e.g. TASK-001")
fs.Usage = func() {
fmt.Fprintf(stderr, "usage: bee bee-worker --export-dir %s --task-id TASK-001\n", app.DefaultExportDir)
fs.PrintDefaults()
}
if err := fs.Parse(args); err != nil {
if err == flag.ErrHelp {
return 0
}
return 2
}
if fs.NArg() != 0 {
fs.Usage()
return 2
}
return webui.RunPersistedTask(*exportDir, *taskID, stdout, stderr)
}
func parseBenchmarkIndexCSV(raw string) ([]int, error) {
raw = strings.TrimSpace(raw)
if raw == "" {

View File

@@ -0,0 +1,779 @@
package app
import (
"bytes"
"context"
"crypto/rand"
"encoding/hex"
"encoding/json"
"errors"
"fmt"
"io/fs"
"os"
"os/exec"
"path/filepath"
"sort"
"strings"
"sync"
"time"
"bee/audit/internal/platform"
)
const (
blackboxMarkerName = ".bee-blackbox"
blackboxDiscoverInterval = 2 * time.Second
blackboxMinFlushPeriod = 1 * time.Second
blackboxMaxFlushPeriod = 30 * time.Second
blackboxRecoveryFastCount = 5
)
var DefaultBlackboxStatePath = DefaultExportDir + "/blackbox-state.json"
var (
blackboxExecCommand = exec.Command
blackboxNow = func() time.Time { return time.Now().UTC() }
)
type BlackboxMarker struct {
Version int `json:"version"`
EnrollmentID string `json:"enrollment_id"`
CreatedAtUTC string `json:"created_at_utc"`
Host string `json:"host,omitempty"`
}
type BlackboxTargetStatus struct {
EnrollmentID string `json:"enrollment_id"`
Device string `json:"device"`
FS platform.RemovableTarget `json:"fs"`
BootFolder string `json:"boot_folder"`
Status string `json:"status"`
LastSyncAtUTC string `json:"last_sync_at_utc,omitempty"`
LastCycleDuration string `json:"last_cycle_duration,omitempty"`
FlushPeriod string `json:"flush_period"`
LastError string `json:"last_error,omitempty"`
Mountpoint string `json:"mountpoint,omitempty"`
}
type BlackboxState struct {
Status string `json:"status"`
BootStartedAtUTC string `json:"boot_started_at_utc"`
BootFolder string `json:"boot_folder"`
UpdatedAtUTC string `json:"updated_at_utc"`
Targets []BlackboxTargetStatus `json:"targets"`
}
type blackboxRuntime struct {
exportDir string
statePath string
system *platform.System
bootStarted time.Time
bootFolder string
mu sync.Mutex
workers map[string]*blackboxWorker
}
type discoveredBlackboxTarget struct {
marker BlackboxMarker
target platform.RemovableTarget
seenMount string
mountedByBee bool
}
type blackboxWorker struct {
runtime *blackboxRuntime
enrollmentID string
mu sync.Mutex
target platform.RemovableTarget
marker BlackboxMarker
mountpoint string
mountedByBee bool
status string
lastSyncAt time.Time
lastDuration time.Duration
flushPeriod time.Duration
lastError string
fastCycles int
stopCh chan struct{}
stoppedCh chan struct{}
}
func RunBlackbox(ctx context.Context, exportDir, statePath string, system *platform.System) error {
exportDir = strings.TrimSpace(exportDir)
if exportDir == "" {
exportDir = DefaultExportDir
}
statePath = strings.TrimSpace(statePath)
if statePath == "" {
statePath = DefaultBlackboxStatePath
}
if system == nil {
system = platform.New()
}
bootStarted, err := bootStartedAtUTC()
if err != nil {
bootStarted = blackboxNow()
}
rt := &blackboxRuntime{
exportDir: exportDir,
statePath: statePath,
system: system,
bootStarted: bootStarted,
bootFolder: SupportBundleBaseName(bootStarted),
workers: make(map[string]*blackboxWorker),
}
_ = os.MkdirAll(filepath.Dir(statePath), 0755)
rt.persistState()
ticker := time.NewTicker(blackboxDiscoverInterval)
defer ticker.Stop()
for {
rt.reconcile()
select {
case <-ctx.Done():
rt.stopAll()
return ctx.Err()
case <-ticker.C:
}
}
}
func ReadBlackboxState(path string) (BlackboxState, error) {
path = strings.TrimSpace(path)
if path == "" {
path = DefaultBlackboxStatePath
}
raw, err := os.ReadFile(path)
if err != nil {
return BlackboxState{}, err
}
var state BlackboxState
if err := json.Unmarshal(raw, &state); err != nil {
return BlackboxState{}, err
}
return state, nil
}
func EnableBlackboxTarget(target platform.RemovableTarget) (BlackboxMarker, error) {
target = sanitizeRemovableTarget(target)
if target.Device == "" {
return BlackboxMarker{}, fmt.Errorf("device is required")
}
mountpoint, mountedByBee, err := ensureMountedTarget(target, "marker")
if err != nil {
return BlackboxMarker{}, err
}
defer func() {
if mountedByBee {
_ = unmountTarget(mountpoint)
}
}()
marker, _, err := readBlackboxMarker(mountpoint)
if err != nil && !errors.Is(err, os.ErrNotExist) {
return BlackboxMarker{}, err
}
if marker.EnrollmentID == "" {
marker = BlackboxMarker{
Version: 1,
EnrollmentID: newBlackboxEnrollmentID(),
CreatedAtUTC: blackboxNow().Format(time.RFC3339),
Host: hostnameOr("unknown"),
}
}
if err := writeBlackboxMarker(mountpoint, marker); err != nil {
return BlackboxMarker{}, err
}
return marker, nil
}
func DisableBlackboxTarget(device, enrollmentID string) error {
device = strings.TrimSpace(device)
enrollmentID = strings.TrimSpace(enrollmentID)
if device == "" && enrollmentID == "" {
return fmt.Errorf("device or enrollment_id is required")
}
system := platform.New()
targets, err := system.ListRemovableTargets()
if err != nil {
return err
}
for _, target := range targets {
target = sanitizeRemovableTarget(target)
mountpoint, mountedByBee, mountErr := ensureMountedTarget(target, "marker")
if mountErr != nil {
continue
}
remove := false
marker, _, err := readBlackboxMarker(mountpoint)
if err == nil {
if enrollmentID != "" && marker.EnrollmentID == enrollmentID {
remove = true
}
if device != "" && target.Device == device {
remove = true
}
}
if remove {
err = os.Remove(filepath.Join(mountpoint, blackboxMarkerName))
}
if mountedByBee {
_ = unmountTarget(mountpoint)
}
if remove {
return err
}
}
return os.ErrNotExist
}
func (rt *blackboxRuntime) reconcile() {
discovered, _ := rt.discoverMarkedTargets()
rt.mu.Lock()
defer rt.mu.Unlock()
seen := make(map[string]struct{}, len(discovered))
for _, found := range discovered {
seen[found.marker.EnrollmentID] = struct{}{}
worker, ok := rt.workers[found.marker.EnrollmentID]
if !ok {
worker = newBlackboxWorker(rt, found)
rt.workers[found.marker.EnrollmentID] = worker
go worker.run()
continue
}
worker.update(found)
}
for id, worker := range rt.workers {
if _, ok := seen[id]; ok {
continue
}
worker.stop()
delete(rt.workers, id)
}
rt.persistStateLocked()
}
func (rt *blackboxRuntime) stopAll() {
rt.mu.Lock()
workers := make([]*blackboxWorker, 0, len(rt.workers))
for _, worker := range rt.workers {
workers = append(workers, worker)
}
rt.workers = map[string]*blackboxWorker{}
rt.persistStateLocked()
rt.mu.Unlock()
for _, worker := range workers {
worker.stop()
}
}
func (rt *blackboxRuntime) discoverMarkedTargets() ([]discoveredBlackboxTarget, error) {
targets, err := rt.system.ListRemovableTargets()
if err != nil {
return nil, err
}
var out []discoveredBlackboxTarget
for _, rawTarget := range targets {
target := sanitizeRemovableTarget(rawTarget)
if target.Device == "" {
continue
}
mountpoint, mountedByBee, err := ensureMountedTarget(target, "probe")
if err != nil {
continue
}
marker, ok, err := readBlackboxMarker(mountpoint)
if mountedByBee && !ok {
_ = unmountTarget(mountpoint)
}
if err != nil || !ok || marker.EnrollmentID == "" {
continue
}
if mountedByBee {
_ = unmountTarget(mountpoint)
}
out = append(out, discoveredBlackboxTarget{
marker: marker,
target: target,
seenMount: mountpoint,
mountedByBee: mountedByBee,
})
}
sort.Slice(out, func(i, j int) bool {
return out[i].marker.EnrollmentID < out[j].marker.EnrollmentID
})
return out, nil
}
func newBlackboxWorker(rt *blackboxRuntime, found discoveredBlackboxTarget) *blackboxWorker {
return &blackboxWorker{
runtime: rt,
enrollmentID: found.marker.EnrollmentID,
target: found.target,
marker: found.marker,
flushPeriod: blackboxMinFlushPeriod,
status: "running",
stopCh: make(chan struct{}),
stoppedCh: make(chan struct{}),
}
}
func (w *blackboxWorker) run() {
defer close(w.stoppedCh)
for {
start := time.Now()
err := w.syncCycle()
duration := time.Since(start)
w.finishCycle(duration, err)
wait := w.currentFlushPeriod()
timer := time.NewTimer(wait)
select {
case <-w.stopCh:
timer.Stop()
w.cleanup()
return
case <-timer.C:
}
}
}
func (w *blackboxWorker) update(found discoveredBlackboxTarget) {
w.mu.Lock()
defer w.mu.Unlock()
w.target = found.target
w.marker = found.marker
}
func (w *blackboxWorker) stop() {
select {
case <-w.stopCh:
default:
close(w.stopCh)
}
<-w.stoppedCh
}
func (w *blackboxWorker) currentFlushPeriod() time.Duration {
w.mu.Lock()
defer w.mu.Unlock()
return w.flushPeriod
}
func (w *blackboxWorker) finishCycle(duration time.Duration, err error) {
w.mu.Lock()
defer w.mu.Unlock()
w.lastDuration = duration
if err != nil {
w.status = "degraded"
w.lastError = err.Error()
w.fastCycles = 0
w.flushPeriod = adjustFlushPeriod(w.flushPeriod, duration, false, 0)
} else {
w.status = "running"
w.lastSyncAt = blackboxNow()
w.lastError = ""
if duration <= w.flushPeriod/2 {
w.fastCycles++
} else {
w.fastCycles = 0
}
w.flushPeriod = adjustFlushPeriod(w.flushPeriod, duration, true, w.fastCycles)
}
w.runtime.persistState()
}
func adjustFlushPeriod(current, duration time.Duration, success bool, fastCycles int) time.Duration {
if current <= 0 {
current = blackboxMinFlushPeriod
}
if duration <= 0 {
duration = current
}
next := current
if duration > current {
growA := time.Duration(float64(current) * 1.25)
growB := time.Duration(float64(duration) * 1.25)
if growB > growA {
next = growB
} else {
next = growA
}
}
if success && fastCycles >= blackboxRecoveryFastCount {
next = time.Duration(float64(current) * 0.9)
}
if next < blackboxMinFlushPeriod {
next = blackboxMinFlushPeriod
}
if next > blackboxMaxFlushPeriod {
next = blackboxMaxFlushPeriod
}
return next
}
func (w *blackboxWorker) syncCycle() error {
target, marker := w.snapshotTarget()
mountpoint, mountedByBee, err := ensureMountedTarget(target, marker.EnrollmentID)
if err != nil {
return err
}
w.recordMountpoint(mountpoint, mountedByBee)
root := filepath.Join(mountpoint, w.runtime.bootFolder)
if err := os.MkdirAll(filepath.Join(root, "export"), 0755); err != nil {
return err
}
if err := syncDirectoryTree(w.runtime.exportDir, filepath.Join(root, "export")); err != nil {
return err
}
if err := w.captureSnapshots(root); err != nil {
return err
}
return syncFilesystem(root)
}
func (w *blackboxWorker) cleanup() {
w.mu.Lock()
mountpoint := w.mountpoint
mountedByBee := w.mountedByBee
w.mu.Unlock()
if mountedByBee && mountpoint != "" {
_ = unmountTarget(mountpoint)
}
}
func (w *blackboxWorker) snapshotTarget() (platform.RemovableTarget, BlackboxMarker) {
w.mu.Lock()
defer w.mu.Unlock()
return w.target, w.marker
}
func (w *blackboxWorker) recordMountpoint(mountpoint string, mountedByBee bool) {
w.mu.Lock()
defer w.mu.Unlock()
w.mountpoint = mountpoint
w.mountedByBee = mountedByBee
}
func (w *blackboxWorker) captureSnapshots(root string) error {
if err := captureCommandAtomic(filepath.Join(root, "systemd", "combined.journal.log"), "journalctl", "--no-pager", "--since", w.runtime.bootStarted.Format(time.RFC3339)); err != nil {
return err
}
for _, svc := range supportBundleServices {
if err := captureCommandAtomic(filepath.Join(root, "systemd", svc+".journal.log"), "journalctl", "--no-pager", "-u", svc, "--since", w.runtime.bootStarted.Format(time.RFC3339)); err != nil {
return err
}
if err := captureCommandAtomic(filepath.Join(root, "systemd", svc+".status.txt"), "systemctl", "status", svc, "--no-pager"); err != nil {
return err
}
}
if err := captureCommandAtomic(filepath.Join(root, "system", "dmesg.txt"), "dmesg"); err != nil {
return err
}
for _, item := range supportBundleOptionalFiles {
if err := copyFileIfChanged(item.src, filepath.Join(root, item.name)); err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
}
return nil
}
func (rt *blackboxRuntime) persistState() {
rt.mu.Lock()
defer rt.mu.Unlock()
rt.persistStateLocked()
}
func (rt *blackboxRuntime) persistStateLocked() {
state := BlackboxState{
Status: "disabled",
BootStartedAtUTC: rt.bootStarted.Format(time.RFC3339),
BootFolder: rt.bootFolder,
UpdatedAtUTC: blackboxNow().Format(time.RFC3339),
Targets: make([]BlackboxTargetStatus, 0, len(rt.workers)),
}
if len(rt.workers) > 0 {
state.Status = "running"
}
for _, worker := range rt.workers {
worker.mu.Lock()
targetState := BlackboxTargetStatus{
EnrollmentID: worker.enrollmentID,
Device: worker.target.Device,
FS: worker.target,
BootFolder: rt.bootFolder,
Status: worker.status,
FlushPeriod: worker.flushPeriod.String(),
LastError: worker.lastError,
Mountpoint: worker.mountpoint,
}
if !worker.lastSyncAt.IsZero() {
targetState.LastSyncAtUTC = worker.lastSyncAt.Format(time.RFC3339)
}
if worker.lastDuration > 0 {
targetState.LastCycleDuration = worker.lastDuration.String()
}
if worker.status == "degraded" {
state.Status = "degraded"
}
worker.mu.Unlock()
state.Targets = append(state.Targets, targetState)
}
sort.Slice(state.Targets, func(i, j int) bool {
return state.Targets[i].EnrollmentID < state.Targets[j].EnrollmentID
})
_ = writeJSONAtomic(rt.statePath, state)
}
func bootStartedAtUTC() (time.Time, error) {
raw, err := os.ReadFile("/proc/stat")
if err != nil {
return time.Time{}, err
}
for _, line := range strings.Split(string(raw), "\n") {
line = strings.TrimSpace(line)
if !strings.HasPrefix(line, "btime ") {
continue
}
parts := strings.Fields(line)
if len(parts) != 2 {
break
}
sec, err := time.ParseDuration(parts[1] + "s")
if err != nil {
break
}
return time.Unix(int64(sec/time.Second), 0).UTC(), nil
}
return time.Time{}, fmt.Errorf("boot time not found")
}
func newBlackboxEnrollmentID() string {
var buf [8]byte
if _, err := rand.Read(buf[:]); err != nil {
return fmt.Sprintf("bb-%d", time.Now().UnixNano())
}
return "bb-" + hex.EncodeToString(buf[:])
}
func sanitizeRemovableTarget(target platform.RemovableTarget) platform.RemovableTarget {
target.Device = strings.TrimSpace(target.Device)
target.FSType = strings.TrimSpace(target.FSType)
target.Size = strings.TrimSpace(target.Size)
target.Label = strings.TrimSpace(target.Label)
target.Model = strings.TrimSpace(target.Model)
target.Mountpoint = strings.TrimSpace(target.Mountpoint)
return target
}
func ensureMountedTarget(target platform.RemovableTarget, suffix string) (mountpoint string, mountedByBee bool, retErr error) {
target = sanitizeRemovableTarget(target)
if target.Mountpoint != "" {
if err := ensureWritableBlackboxMountpoint(target.Mountpoint); err == nil {
return target.Mountpoint, false, nil
}
}
mountpoint = filepath.Join("/tmp", "bee-blackbox-"+sanitizeFilename(suffix))
if err := os.MkdirAll(mountpoint, 0755); err != nil {
return "", false, err
}
if raw, err := blackboxExecCommand("mount", target.Device, mountpoint).CombinedOutput(); err != nil {
return "", false, formatBlackboxMountTargetError(target, string(raw), err)
}
if err := ensureWritableBlackboxMountpoint(mountpoint); err != nil {
_ = unmountTarget(mountpoint)
return "", false, err
}
return mountpoint, true, nil
}
func unmountTarget(mountpoint string) error {
_ = blackboxExecCommand("sync").Run()
raw, err := blackboxExecCommand("umount", mountpoint).CombinedOutput()
if err != nil {
msg := strings.TrimSpace(string(raw))
if msg == "" {
return err
}
return fmt.Errorf("%s: %w", msg, err)
}
return nil
}
func readBlackboxMarker(mountpoint string) (BlackboxMarker, bool, error) {
raw, err := os.ReadFile(filepath.Join(mountpoint, blackboxMarkerName))
if err != nil {
if errors.Is(err, os.ErrNotExist) {
return BlackboxMarker{}, false, os.ErrNotExist
}
return BlackboxMarker{}, false, err
}
var marker BlackboxMarker
if err := json.Unmarshal(raw, &marker); err != nil {
return BlackboxMarker{}, false, err
}
return marker, true, nil
}
func writeBlackboxMarker(mountpoint string, marker BlackboxMarker) error {
if marker.Version == 0 {
marker.Version = 1
}
return writeJSONAtomic(filepath.Join(mountpoint, blackboxMarkerName), marker)
}
func syncDirectoryTree(srcDir, dstDir string) error {
seen := make(map[string]struct{})
err := filepath.WalkDir(srcDir, func(path string, d fs.DirEntry, err error) error {
if err != nil {
return err
}
rel, err := filepath.Rel(srcDir, path)
if err != nil {
return err
}
rel = filepath.Clean(rel)
if rel == "." {
seen["."] = struct{}{}
return os.MkdirAll(dstDir, 0755)
}
seen[rel] = struct{}{}
dstPath := filepath.Join(dstDir, rel)
if d.IsDir() {
info, err := d.Info()
if err != nil {
return err
}
return os.MkdirAll(dstPath, info.Mode().Perm())
}
return copyFileIfChanged(path, dstPath)
})
if err != nil {
return err
}
return removeMissingPaths(dstDir, seen)
}
func removeMissingPaths(dstDir string, seen map[string]struct{}) error {
return filepath.WalkDir(dstDir, func(path string, d fs.DirEntry, err error) error {
if err != nil {
return err
}
rel, err := filepath.Rel(dstDir, path)
if err != nil {
return err
}
rel = filepath.Clean(rel)
if rel == "." {
return nil
}
if _, ok := seen[rel]; ok {
return nil
}
return os.RemoveAll(path)
})
}
func copyFileIfChanged(src, dst string) error {
info, err := os.Stat(src)
if err != nil {
return err
}
if info.IsDir() {
return os.MkdirAll(dst, info.Mode().Perm())
}
srcData, err := os.ReadFile(src)
if err != nil {
return err
}
if dstData, err := os.ReadFile(dst); err == nil && bytes.Equal(dstData, srcData) {
return nil
}
return writeFileAtomic(dst, srcData, info.Mode().Perm())
}
func captureCommandAtomic(dst string, name string, args ...string) error {
raw, err := blackboxExecCommand(name, args...).CombinedOutput()
if len(raw) == 0 {
if err != nil {
raw = []byte(err.Error() + "\n")
} else {
raw = []byte("no output\n")
}
}
return writeFileAtomic(dst, raw, 0644)
}
func writeJSONAtomic(path string, v any) error {
raw, err := json.MarshalIndent(v, "", " ")
if err != nil {
return err
}
raw = append(raw, '\n')
return writeFileAtomic(path, raw, 0644)
}
func writeFileAtomic(path string, data []byte, perm os.FileMode) error {
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return err
}
if existing, err := os.ReadFile(path); err == nil && bytes.Equal(existing, data) {
return nil
}
tmp := path + ".tmp"
f, err := os.OpenFile(tmp, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, perm)
if err != nil {
return err
}
if _, err := f.Write(data); err != nil {
_ = f.Close()
return err
}
if err := f.Sync(); err != nil {
_ = f.Close()
return err
}
if err := f.Close(); err != nil {
return err
}
if err := os.Rename(tmp, path); err != nil {
return err
}
return syncFilesystem(filepath.Dir(path))
}
func syncFilesystem(path string) error {
return blackboxExecCommand("sync").Run()
}
func ensureWritableBlackboxMountpoint(mountpoint string) error {
probe, err := os.CreateTemp(mountpoint, ".bee-blackbox-write-test-*")
if err != nil {
return fmt.Errorf("target filesystem is not writable: %w", err)
}
name := probe.Name()
if closeErr := probe.Close(); closeErr != nil {
_ = os.Remove(name)
return closeErr
}
if err := os.Remove(name); err != nil {
return err
}
return nil
}
func formatBlackboxMountTargetError(target platform.RemovableTarget, raw string, err error) error {
msg := strings.TrimSpace(raw)
fstype := strings.ToLower(strings.TrimSpace(target.FSType))
if fstype == "exfat" && strings.Contains(strings.ToLower(msg), "unknown filesystem type 'exfat'") {
return fmt.Errorf("mount %s: exFAT support is missing in this ISO build: %w", target.Device, err)
}
if msg == "" {
return err
}
return fmt.Errorf("%s: %w", msg, err)
}

View File

@@ -0,0 +1,52 @@
package app
import (
"path/filepath"
"testing"
"time"
)
func TestAdjustFlushPeriodGrowsOnSlowCycle(t *testing.T) {
current := 2 * time.Second
got := adjustFlushPeriod(current, 4*time.Second, false, 0)
if got <= current {
t.Fatalf("adjustFlushPeriod=%s want > %s", got, current)
}
}
func TestAdjustFlushPeriodShrinksAfterFastCycles(t *testing.T) {
current := 10 * time.Second
got := adjustFlushPeriod(current, 2*time.Second, true, blackboxRecoveryFastCount)
if got >= current {
t.Fatalf("adjustFlushPeriod=%s want < %s", got, current)
}
if got < blackboxMinFlushPeriod {
t.Fatalf("adjustFlushPeriod=%s below min %s", got, blackboxMinFlushPeriod)
}
}
func TestReadBlackboxState(t *testing.T) {
path := filepath.Join(t.TempDir(), "blackbox-state.json")
want := BlackboxState{
Status: "running",
BootStartedAtUTC: "2026-04-24T00:00:00Z",
BootFolder: "boot-folder",
UpdatedAtUTC: "2026-04-24T00:00:01Z",
Targets: []BlackboxTargetStatus{{
EnrollmentID: "bb-1",
Device: "/dev/sdb1",
Status: "running",
FlushPeriod: "1s",
}},
}
if err := writeJSONAtomic(path, want); err != nil {
t.Fatalf("writeJSONAtomic: %v", err)
}
got, err := ReadBlackboxState(path)
if err != nil {
t.Fatalf("ReadBlackboxState: %v", err)
}
if got.Status != want.Status || got.BootFolder != want.BootFolder || len(got.Targets) != 1 || got.Targets[0].EnrollmentID != "bb-1" {
t.Fatalf("state=%+v", got)
}
}

View File

@@ -15,6 +15,7 @@ import (
)
var supportBundleServices = []string{
"bee-blackbox.service",
"bee-audit.service",
"bee-web.service",
"bee-network.service",
@@ -23,6 +24,8 @@ var supportBundleServices = []string{
"bee-selfheal.service",
"bee-selfheal.timer",
"bee-sshsetup.service",
"display-manager.service",
"lightdm.service",
"nvidia-dcgm.service",
"nvidia-fabricmanager.service",
}
@@ -43,12 +46,128 @@ var supportBundleCommands = []struct {
{name: "system/mount.txt", cmd: []string{"mount"}},
{name: "system/df-h.txt", cmd: []string{"df", "-h"}},
{name: "system/dmesg.txt", cmd: []string{"dmesg"}},
{name: "system/dmesg-gui-video-input.txt", cmd: []string{"sh", "-c", `
if command -v dmesg >/dev/null 2>&1; then
dmesg | grep -iE 'nvidia|drm|fb|framebuffer|vesa|efi|lightdm|Xorg|input|hid|usb|keyboard|mouse|virtual keyboard|virtual mouse|ami|aspeed|ast' || echo "no GUI/video/input kernel messages found"
else
echo "dmesg not found"
fi
`}},
{name: "system/kernel-aer-nvidia.txt", cmd: []string{"sh", "-c", `
if command -v dmesg >/dev/null 2>&1; then
dmesg | grep -iE 'AER|NVRM|Xid|pcieport|nvidia' || echo "no AER/NVRM/Xid kernel messages found"
else
echo "dmesg not found"
fi
`}},
{name: "system/loginctl-sessions.txt", cmd: []string{"sh", "-c", `
if command -v loginctl >/dev/null 2>&1; then
loginctl list-sessions 2>&1 || true
else
echo "loginctl not found"
fi
`}},
{name: "system/loginctl-seats.txt", cmd: []string{"sh", "-c", `
if command -v loginctl >/dev/null 2>&1; then
loginctl list-seats 2>&1 || true
echo
for seat in $(loginctl list-seats --no-legend 2>/dev/null | awk '{print $1}'); do
echo "=== $seat ==="
loginctl seat-status "$seat" 2>&1 || true
echo
done
else
echo "loginctl not found"
fi
`}},
{name: "system/ps-gui.txt", cmd: []string{"sh", "-c", `
ps -ef | grep -iE 'lightdm|Xorg|X$|openbox|chromium|chrome|xinit|xsession' | grep -v grep || echo "no GUI processes found"
`}},
{name: "system/lspci-video-vv.txt", cmd: []string{"sh", "-c", `
if ! command -v lspci >/dev/null 2>&1; then
echo "lspci not found"
exit 0
fi
found=0
for dev in $(lspci -Dn | awk '$2 ~ /^03(00|02):$/ {print $1}'); do
found=1
echo "=== $dev ==="
lspci -s "$dev" -vv 2>&1 || true
echo
done
if [ "$found" -eq 0 ]; then
echo "no display-class PCI devices found"
fi
`}},
{name: "system/proc-fb.txt", cmd: []string{"cat", "/proc/fb"}},
{name: "system/drm-cards.txt", cmd: []string{"sh", "-c", `
if [ -d /sys/class/drm ]; then
for path in /sys/class/drm/card*; do
[ -e "$path" ] || continue
card=$(basename "$path")
echo "=== $card ==="
for f in status enabled dpms modes; do
[ -r "$path/$f" ] && printf " %-8s %s\n" "$f" "$(cat "$path/$f" 2>/dev/null)"
done
device=$(readlink -f "$path/device" 2>/dev/null || true)
[ -n "$device" ] && echo " device ${device##*/}"
echo
done
else
echo "/sys/class/drm not present"
fi
`}},
{name: "system/input-devices.txt", cmd: []string{"sh", "-c", `
if [ -r /proc/bus/input/devices ]; then
cat /proc/bus/input/devices
else
echo "/proc/bus/input/devices not readable"
fi
`}},
{name: "system/udevadm-input.txt", cmd: []string{"sh", "-c", `
if ! command -v udevadm >/dev/null 2>&1; then
echo "udevadm not found"
exit 0
fi
found=0
for dev in /dev/input/event*; do
[ -e "$dev" ] || continue
found=1
echo "=== $dev ==="
udevadm info --query=all --name="$dev" 2>&1 || true
echo
done
if [ "$found" -eq 0 ]; then
echo "no /dev/input/event* devices found"
fi
`}},
{name: "system/xinput-list.txt", cmd: []string{"sh", "-c", `
if command -v xinput >/dev/null 2>&1; then
DISPLAY=:0 xinput --list 2>&1 || true
else
echo "xinput not found"
fi
`}},
{name: "system/libinput-list-devices.txt", cmd: []string{"sh", "-c", `
if command -v libinput >/dev/null 2>&1; then
libinput list-devices 2>&1 || true
else
echo "libinput not found"
fi
`}},
{name: "system/systemctl-gui-units.txt", cmd: []string{"sh", "-c", `
if ! command -v systemctl >/dev/null 2>&1; then
echo "systemctl not found"
exit 0
fi
echo "=== unit files ==="
systemctl list-unit-files --no-pager --all 'lightdm*' 'display-manager*' 2>&1 || true
echo
echo "=== active units ==="
systemctl list-units --no-pager --all 'lightdm*' 'display-manager*' 2>&1 || true
echo
echo "=== failed units ==="
systemctl --failed --no-pager 2>&1 | grep -iE 'lightdm|display-manager|Xorg' || echo "no failed GUI units"
`}},
{name: "system/nvidia-smi-q.txt", cmd: []string{"nvidia-smi", "-q"}},
{name: "system/nvidia-smi-topo.txt", cmd: []string{"sh", "-c", `
@@ -235,6 +354,13 @@ var supportBundleOptionalFiles = []struct {
}{
{name: "system/kern.log", src: "/var/log/kern.log"},
{name: "system/syslog.txt", src: "/var/log/syslog"},
{name: "system/Xorg.0.log", src: "/var/log/Xorg.0.log"},
{name: "system/Xorg.0.log.old", src: "/var/log/Xorg.0.log.old"},
{name: "system/lightdm/lightdm.log", src: "/var/log/lightdm/lightdm.log"},
{name: "system/lightdm/x-0.log", src: "/var/log/lightdm/x-0.log"},
{name: "system/lightdm/x-0-greeter.log", src: "/var/log/lightdm/x-0-greeter.log"},
{name: "system/home-bee-xsession-errors.log", src: "/home/bee/.xsession-errors"},
{name: "system/home-bee-chromium-debug.log", src: "/tmp/bee-chrome/chrome_debug.log"},
{name: "system/fabricmanager.log", src: "/var/log/fabricmanager.log"},
{name: "system/nvlsm.log", src: "/var/log/nvlsm.log"},
{name: "system/fabricmanager/fabricmanager.log", src: "/var/log/fabricmanager/fabricmanager.log"},
@@ -256,11 +382,6 @@ func BuildSupportBundle(exportDir string) (string, error) {
}
now := time.Now().UTC()
date := now.Format("2006-01-02")
tod := now.Format("150405")
ver := bundleVersion()
model := serverModelForBundle()
sn := serverSerialForBundle()
stageRoot := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-stage-%s-%s", sanitizeFilename(hostnameOr("unknown")), now.Format("20060102-150405")))
if err := os.MkdirAll(stageRoot, 0755); err != nil {
@@ -294,7 +415,7 @@ func BuildSupportBundle(exportDir string) (string, error) {
return "", err
}
archiveName := fmt.Sprintf("%s (BEE-SP v%s) %s %s %s.tar.gz", date, ver, model, sn, tod)
archiveName := SupportBundleBaseName(now) + ".tar.gz"
archivePath := filepath.Join(os.TempDir(), archiveName)
if err := createSupportTarGz(archivePath, stageRoot); err != nil {
return "", err
@@ -302,6 +423,16 @@ func BuildSupportBundle(exportDir string) (string, error) {
return archivePath, nil
}
func SupportBundleBaseName(at time.Time) string {
at = at.UTC()
date := at.Format("2006-01-02")
tod := at.Format("150405")
ver := bundleVersion()
model := serverModelForBundle()
sn := serverSerialForBundle()
return fmt.Sprintf("%s (BEE-SP v%s) %s %s %s", date, ver, model, sn, tod)
}
func LatestSupportBundlePath() (string, error) {
return latestSupportBundlePath(os.TempDir())
}

View File

@@ -3,6 +3,7 @@ package collector
import (
"bee/audit/internal/schema"
"bufio"
"context"
"log/slog"
"os"
"os/exec"
@@ -17,14 +18,6 @@ var execDmidecode = func(typeNum string) (string, error) {
return string(out), nil
}
var execIpmitool = func(args ...string) (string, error) {
out, err := exec.Command("ipmitool", args...).Output()
if err != nil {
return "", err
}
return string(out), nil
}
// collectBoard runs dmidecode for types 0, 1, 2 and returns the board record
// plus the BIOS firmware entry. Any failure is logged and returns zero values.
func collectBoard() (schema.HardwareBoard, []schema.HardwareFirmwareRecord) {
@@ -80,19 +73,23 @@ func parseBoard(type1, type2 string) schema.HardwareBoard {
// collectBMCFirmware collects BMC firmware version via ipmitool mc info.
// Returns nil if ipmitool is missing, /dev/ipmi0 is absent, or any error occurs.
func collectBMCFirmware() []schema.HardwareFirmwareRecord {
func collectBMCFirmware(manufacturer string) []schema.HardwareFirmwareRecord {
if _, err := exec.LookPath("ipmitool"); err != nil {
return nil
}
if _, err := os.Stat("/dev/ipmi0"); err != nil {
return nil
}
out, err := execIpmitool("mc", "info")
profile := selectIPMIProfile(manufacturer)
ctx, cancel := context.WithTimeout(context.Background(), profile.mcInfoTimeout)
defer cancel()
cmd := exec.CommandContext(ctx, "ipmitool", "mc", "info")
raw, err := cmd.Output()
if err != nil {
slog.Info("bmc: ipmitool mc info unavailable", "err", err)
return nil
}
version := parseBMCFirmwareRevision(out)
version := parseBMCFirmwareRevision(string(raw))
if version == "" {
return nil
}

View File

@@ -23,7 +23,7 @@ func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
board, biosFW := collectBoard()
snap.Board = board
snap.Firmware = append(snap.Firmware, biosFW...)
snap.Firmware = append(snap.Firmware, collectBMCFirmware()...)
snap.Firmware = append(snap.Firmware, collectBMCFirmware(derefString(snap.Board.Manufacturer))...)
snap.CPUs = collectCPUs()
@@ -34,6 +34,7 @@ func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
}
snap.CPUs = enrichCPUsWithTelemetry(snap.CPUs, sensorDoc)
snap.Memory = enrichMemoryWithTelemetry(snap.Memory, sensorDoc)
bestEffortRescanHotplugStorage()
snap.Storage = collectStorage()
snap.PCIeDevices = collectPCIe()
snap.PCIeDevices = enrichPCIeWithAMD(snap.PCIeDevices)
@@ -44,7 +45,8 @@ func Run(_ runtimeenv.Mode) schema.HardwareIngestRequest {
snap.PCIeDevices = enrichPCIeWithRAIDTelemetry(snap.PCIeDevices)
snap.Storage = enrichStorageWithVROC(snap.Storage, snap.PCIeDevices)
snap.Storage = appendUniqueStorage(snap.Storage, collectRAIDStorage(snap.PCIeDevices))
snap.PowerSupplies = collectPSUs()
snap.VROCLicense = collectVROCLicense(snap.PCIeDevices)
snap.PowerSupplies = collectPSUs(derefString(snap.Board.Manufacturer))
snap.PowerSupplies = enrichPSUsWithTelemetry(snap.PowerSupplies, sensorDoc)
snap.Sensors = buildSensorsFromDoc(sensorDoc)
finalizeSnapshot(&snap, collectedAt)

View File

@@ -0,0 +1,92 @@
package collector
// Package-level IPMI tuning profiles.
//
// Each profile is matched by board manufacturer (already known before PSU
// collection runs). The profile drives two things:
// - Per-command timeouts — prevents infinite hangs on slow BMCs.
// - FRU early-exit — streaming parser stops reading once all PSU entries
// are found, avoiding the tail of non-PSU FRU records.
//
// To add a new vendor: append to ipmiProfiles. The first matching entry wins.
import (
"strings"
"time"
)
// ipmiProfile holds tuning parameters for one or more board manufacturers.
type ipmiProfile struct {
// name is shown in log messages.
name string
// manufacturers is a list of lowercase substrings matched against the
// board manufacturer string from dmidecode type 1.
manufacturers []string
// fruTimeout is the hard deadline for the entire `ipmitool fru print`
// command. Zero means no timeout (not recommended).
fruTimeout time.Duration
// sdrTimeout is the hard deadline for `ipmitool sdr`.
sdrTimeout time.Duration
// mcInfoTimeout is the hard deadline for `ipmitool mc info`.
mcInfoTimeout time.Duration
// fruEarlyExit instructs the streaming FRU parser to stop reading
// after it has found at least one PSU entry and the current block is
// complete. Useful on servers with many non-PSU FRU devices.
fruEarlyExit bool
}
// ipmiProfiles is the ordered list of profiles. First match wins.
var ipmiProfiles = []ipmiProfile{
{
// Lenovo XCC-based servers (ThinkSystem SR6xx / SR8xx / ST series).
// SR650 V3 has 54 FRU devices; each IPMI read takes ~2 s, so the
// full `fru print` scan takes ~108 s on a loaded BMC. Enable early
// exit so collection stops once PSU records are found.
name: "lenovo",
manufacturers: []string{"lenovo"},
fruTimeout: 90 * time.Second,
sdrTimeout: 45 * time.Second,
mcInfoTimeout: 15 * time.Second,
fruEarlyExit: true,
},
{
// HPE iLO-based servers (ProLiant DL/ML/BL).
name: "hpe",
manufacturers: []string{"hp", "hewlett packard"},
fruTimeout: 60 * time.Second,
sdrTimeout: 30 * time.Second,
mcInfoTimeout: 10 * time.Second,
fruEarlyExit: false,
},
{
// Dell iDRAC-based servers.
name: "dell",
manufacturers: []string{"dell"},
fruTimeout: 60 * time.Second,
sdrTimeout: 30 * time.Second,
mcInfoTimeout: 10 * time.Second,
fruEarlyExit: false,
},
}
// defaultIPMIProfile is used when no vendor profile matches.
var defaultIPMIProfile = ipmiProfile{
name: "default",
fruTimeout: 60 * time.Second,
sdrTimeout: 30 * time.Second,
mcInfoTimeout: 10 * time.Second,
fruEarlyExit: false,
}
// selectIPMIProfile returns the profile for the given board manufacturer.
func selectIPMIProfile(manufacturer string) ipmiProfile {
mfgLower := strings.ToLower(strings.TrimSpace(manufacturer))
for _, p := range ipmiProfiles {
for _, m := range p.manufacturers {
if strings.Contains(mfgLower, m) {
return p
}
}
}
return defaultIPMIProfile
}

View File

@@ -4,7 +4,9 @@ import (
"bee/audit/internal/schema"
"fmt"
"log/slog"
"os"
"os/exec"
"path/filepath"
"strconv"
"strings"
)
@@ -140,6 +142,9 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
} else if numaNode, ok := parsePCINumaNode(fields["NUMANode"]); ok {
dev.NUMANode = &numaNode
}
if group, ok := readPCIIOMMUGroup(bdf); ok {
dev.IOMMUGroup = &group
}
if width, ok := readPCIIntAttribute(bdf, "current_link_width"); ok {
dev.LinkWidth = &width
}
@@ -179,6 +184,21 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
return dev
}
// readPCIIOMMUGroup resolves the IOMMU group number for a BDF via the
// iommu_group symlink in sysfs: .../devices/<bdf>/iommu_group -> .../kernel/iommu_groups/<N>
func readPCIIOMMUGroup(bdf string) (int, bool) {
link := "/sys/bus/pci/devices/" + bdf + "/iommu_group"
target, err := os.Readlink(link)
if err != nil {
return 0, false
}
n, err := strconv.Atoi(filepath.Base(target))
if err != nil {
return 0, false
}
return n, true
}
// readPCIIDs reads vendor and device IDs from sysfs for a given BDF.
func readPCIIDs(bdf string) (vendorID, deviceID int) {
base := "/sys/bus/pci/devices/" + bdf

View File

@@ -2,6 +2,8 @@ package collector
import (
"bee/audit/internal/schema"
"bufio"
"context"
"log/slog"
"os/exec"
"regexp"
@@ -10,16 +12,29 @@ import (
"strings"
)
func collectPSUs() []schema.HardwarePowerSupply {
func collectPSUs(manufacturer string) []schema.HardwarePowerSupply {
profile := selectIPMIProfile(manufacturer)
var psus []schema.HardwarePowerSupply
if out, err := exec.Command("ipmitool", "fru", "print").Output(); err == nil {
psus = parseFRU(string(out))
fruCtx, fruCancel := context.WithTimeout(context.Background(), profile.fruTimeout)
defer fruCancel()
if profile.fruEarlyExit {
psus = collectFRUEarlyExit(fruCtx)
} else {
slog.Info("psu: fru unavailable", "err", err)
cmd := exec.CommandContext(fruCtx, "ipmitool", "fru", "print")
if out, err := cmd.Output(); err == nil {
psus = parseFRU(string(out))
} else {
slog.Info("psu: fru unavailable", "err", err)
}
}
sdrData := map[int]psuSDR{}
if sdrOut, err := exec.Command("ipmitool", "sdr").Output(); err == nil {
sdrCtx, sdrCancel := context.WithTimeout(context.Background(), profile.sdrTimeout)
defer sdrCancel()
cmd := exec.CommandContext(sdrCtx, "ipmitool", "sdr")
if sdrOut, err := cmd.Output(); err == nil {
sdrData = parsePSUSDR(string(sdrOut))
if len(psus) == 0 {
psus = synthesizePSUsFromSDR(sdrData)
@@ -30,7 +45,66 @@ func collectPSUs() []schema.HardwarePowerSupply {
slog.Info("psu: ipmitool unavailable, skipping", "err", err)
return nil
}
slog.Info("psu: collected", "count", len(psus))
slog.Info("psu: collected", "count", len(psus), "profile", profile.name)
return psus
}
// collectFRUEarlyExit streams ipmitool fru print line-by-line and stops reading
// as soon as it has found all PSU blocks and the next block is not a PSU.
// This avoids scanning all 50+ non-PSU FRU devices on Lenovo XCC servers.
func collectFRUEarlyExit(ctx context.Context) []schema.HardwarePowerSupply {
cmd := exec.CommandContext(ctx, "ipmitool", "fru", "print")
pipe, err := cmd.StdoutPipe()
if err != nil {
slog.Info("psu: fru pipe unavailable", "err", err)
return nil
}
if err := cmd.Start(); err != nil {
slog.Info("psu: fru start failed", "err", err)
return nil
}
var psus []schema.HardwarePowerSupply
var currentBlock strings.Builder
slot := 0
psuFound := false
stoppedEarly := false
scanner := bufio.NewScanner(pipe)
for scanner.Scan() {
line := scanner.Text()
if strings.HasPrefix(line, "FRU Device Description") {
if currentBlock.Len() > 0 {
if psu, ok := parseFRUBlock(currentBlock.String(), slot); ok {
psus = append(psus, psu)
psuFound = true
slot++
}
currentBlock.Reset()
}
// Stop once we've collected PSUs and hit a non-PSU block header.
if psuFound && !isPSUHeader(strings.ToLower(line)) {
stoppedEarly = true
break
}
}
currentBlock.WriteString(line)
currentBlock.WriteByte('\n')
}
if !stoppedEarly && currentBlock.Len() > 0 {
if psu, ok := parseFRUBlock(currentBlock.String(), slot); ok {
psus = append(psus, psu)
}
}
// Kill the process immediately on early exit rather than waiting for context timeout.
if cmd.Process != nil {
cmd.Process.Kill() //nolint:errcheck
}
cmd.Wait() //nolint:errcheck
slog.Info("psu: fru early-exit complete", "psus_found", len(psus), "stopped_early", stoppedEarly)
return psus
}

View File

@@ -733,6 +733,37 @@ func parseMDStatArrays(raw string) []mdArray {
return arrays
}
// collectVROCLicense runs mdadm --detail-platform and extracts the License field.
// Returns nil when VROC is absent or the platform does not report a license.
func collectVROCLicense(pcie []schema.HardwarePCIeDevice) *string {
if !hasVROCController(pcie) {
return nil
}
out, err := raidToolQuery("mdadm", "--detail-platform")
if err != nil {
slog.Info("vroc: mdadm --detail-platform unavailable", "err", err)
return nil
}
return parseMDAdmPlatformLicense(string(out))
}
func parseMDAdmPlatformLicense(raw string) *string {
for _, line := range strings.Split(raw, "\n") {
trimmed := strings.TrimSpace(line)
if !strings.HasPrefix(strings.ToLower(trimmed), "license") {
continue
}
if idx := strings.Index(trimmed, ":"); idx >= 0 {
val := strings.TrimSpace(trimmed[idx+1:])
if val != "" {
v := strings.ToLower(val)
return &v
}
}
}
return nil
}
func queryDeviceSerial(devPath string) string {
if out, err := exec.Command("nvme", "id-ctrl", devPath, "-o", "json").Output(); err == nil {
var ctrl nvmeIDCtrl

View File

@@ -4,12 +4,52 @@ import (
"bee/audit/internal/schema"
"encoding/json"
"log/slog"
"os"
"os/exec"
"path/filepath"
"regexp"
"strconv"
"strings"
)
var (
pciRescanPath = "/sys/bus/pci/rescan"
scsiHostScanGlob = "/sys/class/scsi_host/host*/scan"
hotplugWriteFile = os.WriteFile
hotplugExecCommand = exec.Command
hotplugGlob = filepath.Glob
nvmeLBAFCompactRE = regexp.MustCompile(`(?im)^\s*lbaf\s+\d+\s*:\s*ms:(\d+)\s+lbads:(\d+).*?\(in use\)\s*$`)
nvmeLBAFVerboseRE = regexp.MustCompile(`(?im)^\s*LBA Format\s+\d+\s*:\s*Metadata Size:\s*(\d+)\s+bytes\s*-\s*Data Size:\s*(\d+)\s+bytes.*?\(in use\)\s*$`)
sgReadcapBlockRE = regexp.MustCompile(`(?im)logical block length\s*=\s*(\d+)\s+bytes`)
sgReadcapProtRE = regexp.MustCompile(`(?im)prot_en\s*=\s*1`)
)
func bestEffortRescanHotplugStorage() {
if err := hotplugWriteFile(pciRescanPath, []byte("1\n"), 0644); err != nil {
slog.Info("storage: pci rescan skipped", "path", pciRescanPath, "err", err)
} else {
slog.Info("storage: triggered pci rescan for hotplug discovery")
}
hostPaths, err := hotplugGlob(scsiHostScanGlob)
if err != nil {
slog.Info("storage: scsi host scan skipped", "pattern", scsiHostScanGlob, "err", err)
} else {
for _, path := range hostPaths {
if err := hotplugWriteFile(path, []byte("- - -\n"), 0644); err != nil {
slog.Info("storage: scsi host scan write failed", "path", path, "err", err)
continue
}
slog.Info("storage: triggered scsi host scan", "path", path)
}
}
out, err := hotplugExecCommand("udevadm", "settle", "--timeout=10").CombinedOutput()
if err != nil {
slog.Info("storage: udev settle after hotplug rescan failed", "err", err, "output", strings.TrimSpace(string(out)))
}
}
func collectStorage() []schema.HardwareStorage {
devs := discoverStorageDevices()
result := make([]schema.HardwareStorage, 0, len(devs))
@@ -35,6 +75,8 @@ type lsblkDevice struct {
Model string `json:"model"`
Tran string `json:"tran"`
Hctl string `json:"hctl"`
LogSec string `json:"log-sec"`
PhySec string `json:"phy-sec"`
}
type lsblkRoot struct {
@@ -101,7 +143,7 @@ func isVirtualHDiskModel(model string) bool {
func lsblkDevices() []lsblkDevice {
out, err := exec.Command("lsblk", "-J", "-d",
"-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL").Output()
"-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL,LOG-SEC,PHY-SEC").Output()
if err != nil {
slog.Warn("storage: lsblk failed", "err", err)
return nil
@@ -208,6 +250,7 @@ func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
present := true
s := schema.HardwareStorage{Present: &present}
s.Telemetry = map[string]any{"linux_device": "/dev/" + dev.Name}
applyStorageBlockGeometry(&s, dev)
tran := strings.ToLower(dev.Tran)
devPath := "/dev/" + dev.Name
@@ -250,6 +293,8 @@ func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
}
var info smartctlInfo
var raw map[string]any
_ = json.Unmarshal(out, &raw)
if err := json.Unmarshal(out, &info); err == nil {
if v := cleanDMIValue(info.ModelName); v != "" {
s.Model = &v
@@ -302,8 +347,11 @@ func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
value := float64(attr.Raw.Value)
s.LifeRemainingPct = &value
case 241:
value := attr.Raw.Value
value := smartLBAsToBytes(attr.Raw.Value)
s.WrittenBytes = &value
case 242:
value := smartLBAsToBytes(attr.Raw.Value)
s.ReadBytes = &value
case 197:
pending = attr.Raw.Value
s.CurrentPendingSectors = &pending
@@ -321,6 +369,8 @@ func enrichWithSmartctl(dev lsblkDevice) schema.HardwareStorage {
offlineUncorrectable: uncorrectable,
lifeRemainingPct: lifeRemaining,
}
applySCSISmartctlTelemetry(&s, raw, &status)
applySCSIProtectionBlockGeometry(&s, devPath)
setStorageHealthStatus(&s, status)
return s
}
@@ -368,6 +418,7 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
Interface: &iface,
Telemetry: map[string]any{"linux_device": "/dev/" + dev.Name},
}
applyStorageBlockGeometry(&s, dev)
devPath := "/dev/" + dev.Name
if v := cleanDMIValue(strings.TrimSpace(dev.Model)); v != "" {
@@ -402,6 +453,7 @@ func enrichWithNVMe(dev lsblkDevice) schema.HardwareStorage {
}
}
}
applyNVMeBlockGeometry(&s, devPath)
// smart-log: wear telemetry
if out, err := exec.Command("nvme", "smart-log", devPath, "-o", "json").Output(); err == nil {
@@ -477,6 +529,251 @@ func nvmeDataUnitsToBytes(units int64) int64 {
return units * 512000
}
func smartLBAsToBytes(lbas int64) int64 {
if lbas <= 0 {
return 0
}
return lbas * 512
}
func applySCSISmartctlTelemetry(s *schema.HardwareStorage, raw map[string]any, status *storageHealthStatus) {
if s == nil || len(raw) == 0 {
return
}
if v, ok := firstInt64(raw,
"path:power_on_time.hours",
"path:accumulated_power_on_time.hours",
"path:power_on_time.hour",
"path:accumulated_power_on_time.hour",
); ok && v > 0 && s.PowerOnHours == nil {
s.PowerOnHours = &v
}
if v, ok := firstInt64(raw,
"path:power_cycle_count",
"path:start_stop_cycle_count",
"path:accumulated_start_stop_cycles",
); ok && v > 0 && s.PowerCycles == nil {
s.PowerCycles = &v
}
if v, ok := firstInt64(raw,
"path:scsi_grown_defect_list",
"path:grown_defect_list",
); ok && v > 0 && s.ReallocatedSectors == nil {
s.ReallocatedSectors = &v
if status != nil && status.reallocatedSectors == 0 {
status.reallocatedSectors = v
}
}
if v, ok := firstInt64(raw,
"path:percentage_used_endurance_indicator",
"path:scsi_percentage_used_endurance_indicator",
); ok && v > 0 {
if s.LifeUsedPct == nil {
fv := float64(v)
s.LifeUsedPct = &fv
}
if s.LifeRemainingPct == nil && v <= 100 {
remaining := float64(100 - v)
s.LifeRemainingPct = &remaining
if status != nil && status.lifeRemainingPct == 0 {
status.lifeRemainingPct = int64(remaining)
}
}
}
blockSize, hasBlockSize := firstInt64(raw,
"path:logical_block_size",
"path:block_size",
"path:user_capacity.block_size",
)
if hasBlockSize && blockSize > 0 {
if s.LogicalBlockSizeBytes == nil {
s.LogicalBlockSizeBytes = &blockSize
}
if s.MetadataBytesPerBlock == nil {
zero := int64(0)
s.MetadataBytesPerBlock = &zero
}
if s.Telemetry == nil {
s.Telemetry = map[string]any{}
}
s.Telemetry["logical_block_size_bytes"] = *s.LogicalBlockSizeBytes
s.Telemetry["metadata_bytes_per_block"] = *s.MetadataBytesPerBlock
s.Telemetry["block_format"] = formatBlockFormat(*s.LogicalBlockSizeBytes, *s.MetadataBytesPerBlock)
if v, ok := firstInt64(raw,
"path:logical_blocks_written",
"path:total_lbas_written",
); ok && v > 0 && s.WrittenBytes == nil {
bytes := v * blockSize
s.WrittenBytes = &bytes
}
if v, ok := firstInt64(raw,
"path:logical_blocks_read",
"path:total_lbas_read",
); ok && v > 0 && s.ReadBytes == nil {
bytes := v * blockSize
s.ReadBytes = &bytes
}
}
}
func applyStorageBlockGeometry(s *schema.HardwareStorage, dev lsblkDevice) {
if s == nil {
return
}
logical := parseStorageBytes(dev.LogSec)
physical := parseStorageBytes(dev.PhySec)
if logical <= 0 && physical <= 0 {
return
}
if s.Telemetry == nil {
s.Telemetry = map[string]any{}
}
if logical > 0 {
s.LogicalBlockSizeBytes = &logical
s.Telemetry["logical_block_size_bytes"] = logical
if s.MetadataBytesPerBlock == nil {
zero := int64(0)
s.MetadataBytesPerBlock = &zero
s.Telemetry["metadata_bytes_per_block"] = zero
}
}
if physical > 0 {
s.PhysicalBlockSizeBytes = &physical
s.Telemetry["physical_block_size_bytes"] = physical
}
if s.LogicalBlockSizeBytes != nil && s.MetadataBytesPerBlock != nil {
s.Telemetry["block_format"] = formatBlockFormat(*s.LogicalBlockSizeBytes, *s.MetadataBytesPerBlock)
}
}
func applyNVMeBlockGeometry(s *schema.HardwareStorage, devPath string) {
if s == nil || strings.TrimSpace(devPath) == "" {
return
}
out, err := exec.Command("nvme", "id-ns", devPath, "-H").CombinedOutput()
if err != nil {
return
}
dataBytes, metadataBytes, ok := parseNVMeBlockFormat(string(out))
if !ok {
return
}
setStorageBlockGeometry(s, dataBytes, metadataBytes)
}
func applySCSIProtectionBlockGeometry(s *schema.HardwareStorage, devPath string) {
if s == nil || strings.TrimSpace(devPath) == "" {
return
}
out, err := exec.Command("sg_readcap", "-l", devPath).CombinedOutput()
if err != nil {
return
}
dataBytes, metadataBytes, ok := parseSCSIBlockFormat(string(out))
if !ok {
return
}
setStorageBlockGeometry(s, dataBytes, metadataBytes)
}
func setStorageBlockGeometry(s *schema.HardwareStorage, dataBytes, metadataBytes int64) {
if s == nil || dataBytes <= 0 || metadataBytes < 0 {
return
}
if s.Telemetry == nil {
s.Telemetry = map[string]any{}
}
s.LogicalBlockSizeBytes = &dataBytes
s.MetadataBytesPerBlock = &metadataBytes
s.Telemetry["logical_block_size_bytes"] = dataBytes
s.Telemetry["metadata_bytes_per_block"] = metadataBytes
s.Telemetry["block_format"] = formatBlockFormat(dataBytes, metadataBytes)
}
func formatBlockFormat(dataBytes, metadataBytes int64) string {
return strconv.FormatInt(dataBytes, 10) + "+" + strconv.FormatInt(metadataBytes, 10)
}
func parseNVMeBlockFormat(raw string) (dataBytes, metadataBytes int64, ok bool) {
if m := nvmeLBAFCompactRE.FindStringSubmatch(raw); len(m) == 3 {
ms, errMS := strconv.ParseInt(m[1], 10, 64)
lbads, errLBADS := strconv.ParseInt(m[2], 10, 64)
if errMS == nil && errLBADS == nil && lbads >= 0 && lbads < 63 {
return 1 << lbads, ms, true
}
}
if m := nvmeLBAFVerboseRE.FindStringSubmatch(raw); len(m) == 3 {
ms, errMS := strconv.ParseInt(m[1], 10, 64)
ds, errDS := strconv.ParseInt(m[2], 10, 64)
if errMS == nil && errDS == nil && ds > 0 {
return ds, ms, true
}
}
return 0, 0, false
}
func parseSCSIBlockFormat(raw string) (dataBytes, metadataBytes int64, ok bool) {
m := sgReadcapBlockRE.FindStringSubmatch(raw)
if len(m) != 2 {
return 0, 0, false
}
blockBytes, err := strconv.ParseInt(m[1], 10, 64)
if err != nil || blockBytes <= 0 {
return 0, 0, false
}
if sgReadcapProtRE.MatchString(raw) {
return blockBytes, 8, true
}
return blockBytes, 0, true
}
func firstInt64(root map[string]any, candidates ...string) (int64, bool) {
for _, candidate := range candidates {
if !strings.HasPrefix(candidate, "path:") {
continue
}
path := strings.TrimPrefix(candidate, "path:")
if v, ok := nestedInt64(root, strings.Split(path, ".")); ok {
return v, true
}
}
return 0, false
}
func nestedInt64(root map[string]any, path []string) (int64, bool) {
var current any = root
for _, key := range path {
obj, ok := current.(map[string]any)
if !ok {
return 0, false
}
current, ok = obj[key]
if !ok {
return 0, false
}
}
switch v := current.(type) {
case float64:
return int64(v), true
case float32:
return int64(v), true
case int:
return int64(v), true
case int64:
return v, true
case int32:
return int64(v), true
case json.Number:
n, err := v.Int64()
return n, err == nil
case string:
n, err := strconv.ParseInt(strings.TrimSpace(v), 10, 64)
return n, err == nil
default:
return 0, false
}
}
type storageHealthStatus struct {
hasOverall bool
overallPassed bool

View File

@@ -0,0 +1,69 @@
package collector
import "testing"
func TestParseNVMeBlockFormatCompact(t *testing.T) {
t.Parallel()
raw := `
lbaf 0 : ms:0 lbads:9 rp:0x2 (in use)
lbaf 1 : ms:8 lbads:9 rp:0x1
`
dataBytes, metadataBytes, ok := parseNVMeBlockFormat(raw)
if !ok {
t.Fatal("parseNVMeBlockFormat returned ok=false")
}
if dataBytes != 512 || metadataBytes != 0 {
t.Fatalf("got %d+%d want 512+0", dataBytes, metadataBytes)
}
}
func TestParseNVMeBlockFormatVerbose(t *testing.T) {
t.Parallel()
raw := `
LBA Format 0 : Metadata Size: 8 bytes - Data Size: 512 bytes - Relative Performance: 0 Better (in use)
LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 1 Best
`
dataBytes, metadataBytes, ok := parseNVMeBlockFormat(raw)
if !ok {
t.Fatal("parseNVMeBlockFormat returned ok=false")
}
if dataBytes != 512 || metadataBytes != 8 {
t.Fatalf("got %d+%d want 512+8", dataBytes, metadataBytes)
}
}
func TestParseSCSIBlockFormatWithProtection(t *testing.T) {
t.Parallel()
raw := `
Read Capacity results:
Protection: prot_en=1, p_type=1, p_i_exponent=0
Logical block length=512 bytes
`
dataBytes, metadataBytes, ok := parseSCSIBlockFormat(raw)
if !ok {
t.Fatal("parseSCSIBlockFormat returned ok=false")
}
if dataBytes != 512 || metadataBytes != 8 {
t.Fatalf("got %d+%d want 512+8", dataBytes, metadataBytes)
}
}
func TestParseSCSIBlockFormatWithoutProtection(t *testing.T) {
t.Parallel()
raw := `
Read Capacity results:
Protection: prot_en=0, p_type=0, p_i_exponent=0
Logical block length=4096 bytes
`
dataBytes, metadataBytes, ok := parseSCSIBlockFormat(raw)
if !ok {
t.Fatal("parseSCSIBlockFormat returned ok=false")
}
if dataBytes != 4096 || metadataBytes != 0 {
t.Fatalf("got %d+%d want 4096+0", dataBytes, metadataBytes)
}
}

View File

@@ -1,6 +1,12 @@
package collector
import "testing"
import (
"os"
"os/exec"
"path/filepath"
"strings"
"testing"
)
func TestMergeStorageDevicePrefersNonEmptyFields(t *testing.T) {
t.Parallel()
@@ -31,3 +37,82 @@ func TestParseStorageBytes(t *testing.T) {
t.Fatalf("parseStorageBytes invalid=%d want 0", got)
}
}
func TestBestEffortRescanHotplugStorage(t *testing.T) {
t.Parallel()
tmp := t.TempDir()
rescanPath := filepath.Join(tmp, "pci-rescan")
scanDir := filepath.Join(tmp, "scsi_host")
host0Path := filepath.Join(scanDir, "host0", "scan")
host1Path := filepath.Join(scanDir, "host1", "scan")
argsPath := filepath.Join(tmp, "udevadm-args")
toolPath := filepath.Join(tmp, "udevadm")
if err := os.MkdirAll(filepath.Dir(host0Path), 0755); err != nil {
t.Fatalf("mkdir host0: %v", err)
}
if err := os.MkdirAll(filepath.Dir(host1Path), 0755); err != nil {
t.Fatalf("mkdir host1: %v", err)
}
if err := os.WriteFile(host0Path, nil, 0644); err != nil {
t.Fatalf("touch host0 scan: %v", err)
}
if err := os.WriteFile(host1Path, nil, 0644); err != nil {
t.Fatalf("touch host1 scan: %v", err)
}
script := "#!/bin/sh\nprintf '%s' \"$*\" > \"" + argsPath + "\"\n"
if err := os.WriteFile(toolPath, []byte(script), 0755); err != nil {
t.Fatalf("write udevadm stub: %v", err)
}
oldPath := os.Getenv("PATH")
if err := os.Setenv("PATH", tmp+string(os.PathListSeparator)+oldPath); err != nil {
t.Fatalf("set PATH: %v", err)
}
defer func() { _ = os.Setenv("PATH", oldPath) }()
oldRescanPath := pciRescanPath
oldSCSIGlob := scsiHostScanGlob
oldWriteFile := hotplugWriteFile
oldExecCommand := hotplugExecCommand
oldGlob := hotplugGlob
pciRescanPath = rescanPath
scsiHostScanGlob = filepath.Join(scanDir, "host*", "scan")
hotplugWriteFile = os.WriteFile
hotplugExecCommand = exec.Command
hotplugGlob = filepath.Glob
defer func() {
pciRescanPath = oldRescanPath
scsiHostScanGlob = oldSCSIGlob
hotplugWriteFile = oldWriteFile
hotplugExecCommand = oldExecCommand
hotplugGlob = oldGlob
}()
bestEffortRescanHotplugStorage()
raw, err := os.ReadFile(rescanPath)
if err != nil {
t.Fatalf("read rescan file: %v", err)
}
if string(raw) != "1\n" {
t.Fatalf("rescan payload=%q want %q", string(raw), "1\n")
}
for _, path := range []string{host0Path, host1Path} {
raw, err := os.ReadFile(path)
if err != nil {
t.Fatalf("read scsi scan file %s: %v", path, err)
}
if string(raw) != "- - -\n" {
t.Fatalf("scsi scan payload at %s =%q want %q", path, string(raw), "- - -\n")
}
}
args, err := os.ReadFile(argsPath)
if err != nil {
t.Fatalf("read udevadm args: %v", err)
}
if got := strings.TrimSpace(string(args)); got != "settle --timeout=10" {
t.Fatalf("udevadm args=%q want %q", got, "settle --timeout=10")
}
}

View File

@@ -0,0 +1,101 @@
package collector
import (
"testing"
"bee/audit/internal/schema"
)
func TestApplySCSISmartctlTelemetry(t *testing.T) {
t.Parallel()
raw := map[string]any{
"power_on_time": map[string]any{
"hours": float64(32123),
},
"accumulated_start_stop_cycles": float64(17),
"scsi_grown_defect_list": float64(4),
"percentage_used_endurance_indicator": float64(12),
"logical_block_size": float64(4096),
"logical_blocks_written": float64(1000),
"logical_blocks_read": float64(2000),
}
var disk schema.HardwareStorage
status := storageHealthStatus{}
applySCSISmartctlTelemetry(&disk, raw, &status)
if disk.PowerOnHours == nil || *disk.PowerOnHours != 32123 {
t.Fatalf("power_on_hours=%v want 32123", disk.PowerOnHours)
}
if disk.PowerCycles == nil || *disk.PowerCycles != 17 {
t.Fatalf("power_cycles=%v want 17", disk.PowerCycles)
}
if disk.ReallocatedSectors == nil || *disk.ReallocatedSectors != 4 {
t.Fatalf("reallocated=%v want 4", disk.ReallocatedSectors)
}
if disk.WrittenBytes == nil || *disk.WrittenBytes != 4096000 {
t.Fatalf("written_bytes=%v want 4096000", disk.WrittenBytes)
}
if disk.ReadBytes == nil || *disk.ReadBytes != 8192000 {
t.Fatalf("read_bytes=%v want 8192000", disk.ReadBytes)
}
if disk.LogicalBlockSizeBytes == nil || *disk.LogicalBlockSizeBytes != 4096 {
t.Fatalf("logical_block_size_bytes=%v want 4096", disk.LogicalBlockSizeBytes)
}
if disk.MetadataBytesPerBlock == nil || *disk.MetadataBytesPerBlock != 0 {
t.Fatalf("metadata_bytes_per_block=%v want 0", disk.MetadataBytesPerBlock)
}
if disk.LifeUsedPct == nil || *disk.LifeUsedPct != 12 {
t.Fatalf("life_used_pct=%v want 12", disk.LifeUsedPct)
}
if disk.LifeRemainingPct == nil || *disk.LifeRemainingPct != 88 {
t.Fatalf("life_remaining_pct=%v want 88", disk.LifeRemainingPct)
}
if status.reallocatedSectors != 4 {
t.Fatalf("status.reallocated=%d want 4", status.reallocatedSectors)
}
if status.lifeRemainingPct != 88 {
t.Fatalf("status.life_remaining_pct=%d want 88", status.lifeRemainingPct)
}
}
func TestApplySCSISmartctlTelemetryDoesNotOverwriteExistingValues(t *testing.T) {
t.Parallel()
powerOnHours := int64(10)
writtenBytes := int64(20)
lifeRemaining := 30.0
disk := schema.HardwareStorage{
PowerOnHours: &powerOnHours,
WrittenBytes: &writtenBytes,
LifeRemainingPct: &lifeRemaining,
}
raw := map[string]any{
"power_on_time": map[string]any{"hours": float64(999)},
"logical_block_size": float64(512),
"logical_blocks_written": float64(999),
"percentage_used_endurance_indicator": float64(50),
}
applySCSISmartctlTelemetry(&disk, raw, nil)
if *disk.PowerOnHours != 10 {
t.Fatalf("power_on_hours overwritten: got %d want 10", *disk.PowerOnHours)
}
if *disk.WrittenBytes != 20 {
t.Fatalf("written_bytes overwritten: got %d want 20", *disk.WrittenBytes)
}
if disk.LogicalBlockSizeBytes == nil || *disk.LogicalBlockSizeBytes != 512 {
t.Fatalf("logical_block_size_bytes=%v want 512", disk.LogicalBlockSizeBytes)
}
if disk.MetadataBytesPerBlock == nil || *disk.MetadataBytesPerBlock != 0 {
t.Fatalf("metadata_bytes_per_block=%v want 0", disk.MetadataBytesPerBlock)
}
if *disk.LifeRemainingPct != 30 {
t.Fatalf("life_remaining_pct overwritten: got %v want 30", *disk.LifeRemainingPct)
}
if disk.LifeUsedPct == nil || *disk.LifeUsedPct != 50 {
t.Fatalf("life_used_pct=%v want 50", disk.LifeUsedPct)
}
}

View File

@@ -0,0 +1,25 @@
package collector
import "testing"
func TestSmartLBAsToBytes(t *testing.T) {
t.Parallel()
tests := []struct {
name string
lbas int64
want int64
}{
{name: "zero", lbas: 0, want: 0},
{name: "single lba", lbas: 1, want: 512},
{name: "multiple lbas", lbas: 2048, want: 1048576},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if got := smartLBAsToBytes(tt.lbas); got != tt.want {
t.Fatalf("smartLBAsToBytes(%d)=%d want %d", tt.lbas, got, tt.want)
}
})
}
}

View File

@@ -28,6 +28,35 @@ md125 : active raid1 nvme2n1[0] nvme3n1[1]
}
}
func TestParseMDAdmPlatformLicense(t *testing.T) {
premium := `Platform : Intel(R) Virtual RAID on CPU
Version : 1.3.0.1138
RAID Levels : raid0 raid1 raid5 raid10
Total Disks : 4
License : Premium
`
got := parseMDAdmPlatformLicense(premium)
if got == nil || *got != "premium" {
t.Fatalf("expected 'premium', got %v", got)
}
standard := `Platform : Intel(R) Virtual RAID on CPU
License : Standard
`
got = parseMDAdmPlatformLicense(standard)
if got == nil || *got != "standard" {
t.Fatalf("expected 'standard', got %v", got)
}
noLicense := `Platform : Intel(R) Virtual RAID on CPU
Version : 1.0.0
`
got = parseMDAdmPlatformLicense(noLicense)
if got != nil {
t.Fatalf("expected nil, got %v", *got)
}
}
func TestHasVROCController(t *testing.T) {
intel := vendorIntel
model := "Volume Management Device NVMe RAID Controller"

View File

@@ -105,6 +105,7 @@ var (
benchmarkSkippedPattern = regexp.MustCompile(`^([a-z0-9_]+)(?:\[\d+\])?=SKIPPED (.+)$`)
benchmarkIterationsPattern = regexp.MustCompile(`^([a-z0-9_]+)_iterations=(\d+)$`)
benchmarkGeteuid = os.Geteuid
benchmarkResetNvidiaGPU = resetNvidiaGPU
benchmarkSleep = time.Sleep
)
@@ -249,6 +250,35 @@ func setBenchmarkPowerLimit(ctx context.Context, verboseLog string, gpuIndex, po
return nil
}
func resetBenchmarkGPU(ctx context.Context, verboseLog string, gpuIndex int, logFunc func(string)) error {
if logFunc != nil {
logFunc(fmt.Sprintf("power benchmark pre-flight: GPU %d reset via shared NVIDIA recover path", gpuIndex))
}
out, err := benchmarkResetNvidiaGPU(gpuIndex)
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] start power-preflight-gpu-%d-reset.log", time.Now().UTC().Format(time.RFC3339), gpuIndex),
"cmd: bee-nvidia-recover reset-gpu "+strconv.Itoa(gpuIndex),
)
if trimmed := strings.TrimSpace(out); trimmed != "" && logFunc != nil {
for _, line := range strings.Split(trimmed, "\n") {
line = strings.TrimSpace(line)
if line != "" {
logFunc(line)
}
}
}
rc := 0
if err != nil {
rc = 1
}
appendSATVerboseLog(verboseLog,
fmt.Sprintf("[%s] finish power-preflight-gpu-%d-reset.log", time.Now().UTC().Format(time.RFC3339), gpuIndex),
fmt.Sprintf("rc: %d", rc),
"",
)
return err
}
func resetBenchmarkGPUs(ctx context.Context, verboseLog string, gpuIndices []int, logFunc func(string)) []int {
if len(gpuIndices) == 0 {
return nil
@@ -266,8 +296,7 @@ func resetBenchmarkGPUs(ctx context.Context, verboseLog string, gpuIndices []int
}
var failed []int
for _, idx := range gpuIndices {
name := fmt.Sprintf("power-preflight-gpu-%d-reset.log", idx)
if _, err := runSATCommandCtx(ctx, verboseLog, name, []string{"nvidia-smi", "-i", strconv.Itoa(idx), "-r"}, nil, logFunc); err != nil {
if err := resetBenchmarkGPU(ctx, verboseLog, idx, logFunc); err != nil {
failed = append(failed, idx)
if logFunc != nil {
logFunc(fmt.Sprintf("power benchmark pre-flight: GPU %d reset failed: %v", idx, err))
@@ -4440,8 +4469,7 @@ func (s *System) RunNvidiaPowerBench(ctx context.Context, baseDir string, opts N
_ = os.MkdirAll(singleDir, 0755)
singleInfo := cloneBenchmarkGPUInfoMap(infoByIndex)
if failed := resetBenchmarkGPUs(ctx, verboseLog, []int{idx}, logFunc); len(failed) > 0 {
result.Findings = append(result.Findings,
fmt.Sprintf("GPU %d reset pre-flight did not complete before its first power test; throttle counters may contain stale state.", idx))
return "", fmt.Errorf("power benchmark pre-flight: failed to reset GPU %d; benchmark aborted to keep measurements clean", idx)
}
logFunc(fmt.Sprintf("power calibration: GPU %d single-card baseline", idx))
singlePowerStopCh := make(chan struct{})

View File

@@ -2,7 +2,7 @@ package platform
import (
"context"
"os"
"fmt"
"os/exec"
"path/filepath"
"strings"
@@ -188,18 +188,16 @@ func TestBenchmarkCalibrationThrottleReasonIgnoresPowerReasons(t *testing.T) {
}
func TestResetBenchmarkGPUsSkipsWithoutRoot(t *testing.T) {
t.Parallel()
oldGeteuid := benchmarkGeteuid
oldExec := satExecCommand
oldReset := benchmarkResetNvidiaGPU
benchmarkGeteuid = func() int { return 1000 }
satExecCommand = func(name string, args ...string) *exec.Cmd {
t.Fatalf("unexpected command: %s %v", name, args)
return nil
benchmarkResetNvidiaGPU = func(int) (string, error) {
t.Fatal("unexpected reset call")
return "", nil
}
t.Cleanup(func() {
benchmarkGeteuid = oldGeteuid
satExecCommand = oldExec
benchmarkResetNvidiaGPU = oldReset
})
var logs []string
@@ -215,44 +213,52 @@ func TestResetBenchmarkGPUsSkipsWithoutRoot(t *testing.T) {
}
func TestResetBenchmarkGPUsResetsEachGPU(t *testing.T) {
t.Parallel()
dir := t.TempDir()
script := filepath.Join(dir, "nvidia-smi")
argsLog := filepath.Join(dir, "args.log")
if err := os.WriteFile(script, []byte("#!/bin/sh\nprintf '%s\\n' \"$*\" >> "+argsLog+"\nprintf 'ok\\n'\n"), 0755); err != nil {
t.Fatalf("write script: %v", err)
}
oldGeteuid := benchmarkGeteuid
oldSleep := benchmarkSleep
oldLookPath := satLookPath
oldReset := benchmarkResetNvidiaGPU
benchmarkGeteuid = func() int { return 0 }
benchmarkSleep = func(time.Duration) {}
satLookPath = func(file string) (string, error) {
if file == "nvidia-smi" {
return script, nil
}
return exec.LookPath(file)
var calls []int
benchmarkResetNvidiaGPU = func(index int) (string, error) {
calls = append(calls, index)
return "ok\n", nil
}
t.Cleanup(func() {
benchmarkGeteuid = oldGeteuid
benchmarkSleep = oldSleep
satLookPath = oldLookPath
benchmarkResetNvidiaGPU = oldReset
})
failed := resetBenchmarkGPUs(context.Background(), filepath.Join(dir, "verbose.log"), []int{2, 5}, nil)
failed := resetBenchmarkGPUs(context.Background(), filepath.Join(t.TempDir(), "verbose.log"), []int{2, 5}, nil)
if len(failed) != 0 {
t.Fatalf("failed=%v want no failures", failed)
}
raw, err := os.ReadFile(argsLog)
if err != nil {
t.Fatalf("read args log: %v", err)
if got, want := fmt.Sprint(calls), "[2 5]"; got != want {
t.Fatalf("calls=%v want %s", calls, want)
}
got := strings.Fields(string(raw))
want := []string{"-i", "2", "-r", "-i", "5", "-r"}
if strings.Join(got, " ") != strings.Join(want, " ") {
t.Fatalf("args=%v want %v", got, want)
}
func TestResetBenchmarkGPUsTracksFailuresFromSharedReset(t *testing.T) {
oldGeteuid := benchmarkGeteuid
oldSleep := benchmarkSleep
oldReset := benchmarkResetNvidiaGPU
benchmarkGeteuid = func() int { return 0 }
benchmarkSleep = func(time.Duration) {}
benchmarkResetNvidiaGPU = func(index int) (string, error) {
if index == 5 {
return "busy\n", exec.ErrNotFound
}
return "ok\n", nil
}
t.Cleanup(func() {
benchmarkGeteuid = oldGeteuid
benchmarkSleep = oldSleep
benchmarkResetNvidiaGPU = oldReset
})
failed := resetBenchmarkGPUs(context.Background(), filepath.Join(t.TempDir(), "verbose.log"), []int{2, 5}, nil)
if got, want := fmt.Sprint(failed), "[5]"; got != want {
t.Fatalf("failed=%v want %s", failed, want)
}
}

View File

@@ -5,6 +5,7 @@ import (
"fmt"
"os"
"os/exec"
"path/filepath"
"strconv"
"strings"
)
@@ -18,7 +19,7 @@ type InstallDisk struct {
MountedParts []string // partition mount points currently active
}
const squashfsPath = "/run/live/medium/live/filesystem.squashfs"
const squashfsGlob = "/run/live/medium/live/*.squashfs"
// ListInstallDisks returns block devices suitable for installation.
// Excludes the current live boot medium but includes USB drives.
@@ -176,11 +177,22 @@ func inferLiveBootKind(fsType, source, deviceType, transport string) string {
// squashfs size × 1.5 to allow for extracted filesystem and bootloader.
// Returns 0 if the squashfs is not available (non-live environment).
func MinInstallBytes() int64 {
fi, err := os.Stat(squashfsPath)
if err != nil {
files, err := filepath.Glob(squashfsGlob)
if err != nil || len(files) == 0 {
return 0
}
return fi.Size() * 3 / 2
var total int64
for _, path := range files {
fi, statErr := os.Stat(path)
if statErr != nil {
continue
}
total += fi.Size()
}
if total == 0 {
return 0
}
return total * 3 / 2
}
// toramActive returns true when the live system was booted with toram.
@@ -222,12 +234,10 @@ func DiskWarnings(d InstallDisk) []string {
humanBytes(min), humanBytes(d.SizeBytes)))
}
if toramActive() {
sqFi, err := os.Stat(squashfsPath)
if err == nil {
free := freeMemBytes()
if free > 0 && free < sqFi.Size()*2 {
w = append(w, "toram mode — low RAM, extraction may be slow or fail")
}
free := freeMemBytes()
min := MinInstallBytes()
if free > 0 && min > 0 && free < (min*4/3) {
w = append(w, "toram mode — low RAM, extraction may be slow or fail")
}
}
return w

View File

@@ -14,6 +14,22 @@ import (
const installToRAMDir = "/dev/shm/bee-live"
const copyProgressLogStep int64 = 100 * 1024 * 1024
var liveMediumSquashfsGlob = func() ([]string, error) {
return filepath.Glob("/run/live/medium/live/*.squashfs")
}
var runRemountMedium = func() ([]byte, error) {
return exec.Command("bee-remount-medium").CombinedOutput()
}
var umountLiveMedium = func() error {
return exec.Command("umount", "/run/live/medium").Run()
}
var ejectDevice = func(device string) error {
return exec.Command("eject", device).Run()
}
func (s *System) IsLiveMediaInRAM() bool {
return s.LiveMediaRAMState().InRAM
}
@@ -140,8 +156,7 @@ func (s *System) RunInstallToRAM(ctx context.Context, logFunc func(string)) (ret
return nil
}
squashfsFiles, err := filepath.Glob("/run/live/medium/live/*.squashfs")
sourceAvailable := err == nil && len(squashfsFiles) > 0
squashfsFiles, sourceAvailable := ensureLiveMediumAvailable(log)
dstDir := installToRAMDir
@@ -171,7 +186,7 @@ func (s *System) RunInstallToRAM(ctx context.Context, logFunc func(string)) (ret
}
goto bindMedium
}
return fmt.Errorf("no squashfs files found in /run/live/medium/live/ and no prior RAM copy in %s — reconnect the installation medium and retry", dstDir)
return fmt.Errorf("no squashfs files found in /run/live/medium/live/ and no prior RAM copy in %s — reconnect the installation medium and retry (or run bee-remount-medium as root)", dstDir)
}
{
@@ -254,10 +269,83 @@ bindMedium:
if status.InRAM {
log(fmt.Sprintf("Verification passed: live medium now served from %s.", describeLiveBootSource(status)))
}
log("Done. Squashfs files are in RAM. Installation media can be safely disconnected.")
detachInstallMedium(status, log)
log("Done. Squashfs files are in RAM. Installation media has been detached when possible.")
return nil
}
func tryRemountLiveMedium(log func(string)) error {
output, err := runRemountMedium()
trimmed := strings.TrimSpace(string(output))
if err != nil {
if trimmed != "" && log != nil {
for _, line := range strings.Split(trimmed, "\n") {
log("bee-remount-medium: " + line)
}
}
return err
}
if trimmed != "" && log != nil {
for _, line := range strings.Split(trimmed, "\n") {
log("bee-remount-medium: " + line)
}
}
return nil
}
func ensureLiveMediumAvailable(log func(string)) ([]string, bool) {
squashfsFiles, err := liveMediumSquashfsGlob()
sourceAvailable := err == nil && len(squashfsFiles) > 0
if sourceAvailable {
return squashfsFiles, true
}
if log != nil {
log("Live medium not mounted at /run/live/medium — attempting automatic remount scan...")
}
if remountErr := tryRemountLiveMedium(log); remountErr != nil {
if log != nil {
log(fmt.Sprintf("Automatic remount did not restore the live medium: %v", remountErr))
}
return squashfsFiles, false
}
squashfsFiles, err = liveMediumSquashfsGlob()
sourceAvailable = err == nil && len(squashfsFiles) > 0
if sourceAvailable && log != nil {
log("Live medium restored after remount scan.")
}
return squashfsFiles, sourceAvailable
}
func detachInstallMedium(status LiveBootSource, log func(string)) {
if log == nil {
log = func(string) {}
}
log("Detaching original installation medium...")
if err := umountLiveMedium(); err != nil {
log(fmt.Sprintf("Warning: could not unmount /run/live/medium: %v", err))
} else {
log("Unmounted /run/live/medium.")
}
device := strings.TrimSpace(status.Device)
if device == "" {
device = strings.TrimSpace(status.Source)
}
if device == "" || !strings.HasPrefix(device, "/dev/") {
log("No block device identified for eject; skipping media eject.")
return
}
if err := ejectDevice(device); err != nil {
log(fmt.Sprintf("Warning: could not eject %s: %v", device, err))
return
}
log(fmt.Sprintf("Ejected %s.", device))
}
func verifyInstallToRAMStatus(status LiveBootSource, dstDir string, mediumRebound bool, log func(string)) error {
if status.InRAM {
return nil

View File

@@ -1,6 +1,9 @@
package platform
import "testing"
import (
"fmt"
"testing"
)
func TestInferLiveBootKind(t *testing.T) {
t.Parallel()
@@ -124,3 +127,156 @@ func TestShouldLogCopyProgress(t *testing.T) {
t.Fatal("expected final completion log")
}
}
func TestTryRemountLiveMedium(t *testing.T) {
t.Parallel()
orig := runRemountMedium
t.Cleanup(func() {
runRemountMedium = orig
})
t.Run("success", func(t *testing.T) {
runRemountMedium = func() ([]byte, error) {
return []byte("[10:57:31] Mounted /dev/sr1 on /run/live/medium\n"), nil
}
var logs []string
if err := tryRemountLiveMedium(func(msg string) { logs = append(logs, msg) }); err != nil {
t.Fatalf("tryRemountLiveMedium() error = %v", err)
}
if len(logs) != 1 || logs[0] != "bee-remount-medium: [10:57:31] Mounted /dev/sr1 on /run/live/medium" {
t.Fatalf("logs=%v", logs)
}
})
t.Run("failure", func(t *testing.T) {
runRemountMedium = func() ([]byte, error) {
return []byte("must be run as root\n"), fmt.Errorf("exit status 1")
}
var logs []string
err := tryRemountLiveMedium(func(msg string) { logs = append(logs, msg) })
if err == nil {
t.Fatal("expected error")
}
if len(logs) != 1 || logs[0] != "bee-remount-medium: must be run as root" {
t.Fatalf("logs=%v", logs)
}
})
}
func TestEnsureLiveMediumAvailableRemountsSource(t *testing.T) {
t.Parallel()
origGlob := liveMediumSquashfsGlob
origRemount := runRemountMedium
t.Cleanup(func() {
liveMediumSquashfsGlob = origGlob
runRemountMedium = origRemount
})
callCount := 0
liveMediumSquashfsGlob = func() ([]string, error) {
callCount++
if callCount == 1 {
return nil, nil
}
return []string{"/run/live/medium/live/filesystem.squashfs"}, nil
}
runRemountMedium = func() ([]byte, error) {
return []byte("Mounted /dev/sr1 on /run/live/medium\n"), nil
}
var logs []string
files, ok := ensureLiveMediumAvailable(func(msg string) { logs = append(logs, msg) })
if !ok {
t.Fatal("expected live medium to become available after remount")
}
if callCount < 2 {
t.Fatalf("liveMediumSquashfsGlob called %d times, want at least 2", callCount)
}
if len(files) != 1 || files[0] != "/run/live/medium/live/filesystem.squashfs" {
t.Fatalf("files=%v", files)
}
found := false
for _, msg := range logs {
if msg == "Live medium restored after remount scan." {
found = true
break
}
}
if !found {
t.Fatalf("expected remount success log, logs=%v", logs)
}
}
func TestDetachInstallMedium(t *testing.T) {
t.Parallel()
origUmount := umountLiveMedium
origEject := ejectDevice
t.Cleanup(func() {
umountLiveMedium = origUmount
ejectDevice = origEject
})
t.Run("success", func(t *testing.T) {
var umountCalled bool
var ejected string
umountLiveMedium = func() error {
umountCalled = true
return nil
}
ejectDevice = func(device string) error {
ejected = device
return nil
}
var logs []string
detachInstallMedium(LiveBootSource{Kind: "cdrom", Device: "/dev/sr1"}, func(msg string) { logs = append(logs, msg) })
if !umountCalled {
t.Fatal("expected umountLiveMedium to be called")
}
if ejected != "/dev/sr1" {
t.Fatalf("ejected=%q want /dev/sr1", ejected)
}
if len(logs) < 3 {
t.Fatalf("logs=%v", logs)
}
})
t.Run("no device", func(t *testing.T) {
umountLiveMedium = func() error { return nil }
ejectDevice = func(device string) error {
t.Fatalf("unexpected eject for %q", device)
return nil
}
var logs []string
detachInstallMedium(LiveBootSource{Kind: "ram", Source: "tmpfs"}, func(msg string) { logs = append(logs, msg) })
found := false
for _, msg := range logs {
if msg == "No block device identified for eject; skipping media eject." {
found = true
break
}
}
if !found {
t.Fatalf("logs=%v", logs)
}
})
t.Run("eject failure is warning only", func(t *testing.T) {
umountLiveMedium = func() error { return nil }
ejectDevice = func(device string) error { return fmt.Errorf("exit status 1") }
var logs []string
detachInstallMedium(LiveBootSource{Kind: "usb", Device: "/dev/sdb1"}, func(msg string) { logs = append(logs, msg) })
found := false
for _, msg := range logs {
if msg == "Warning: could not eject /dev/sdb1: exit status 1" {
found = true
break
}
}
if !found {
t.Fatalf("logs=%v", logs)
}
})
}

View File

@@ -18,6 +18,7 @@ var workerPatterns = []string{
"stress-ng",
"stressapptest",
"memtester",
"nvbandwidth",
// DCGM diagnostic workers — nvvs is spawned by dcgmi diag and survives
// if dcgmi is killed mid-run, leaving the GPU occupied (DCGM_ST_IN_USE).
"nvvs",
@@ -71,13 +72,19 @@ func KillTestWorkers() []KilledProcess {
if idx := strings.LastIndexByte(exe, '/'); idx >= 0 {
base = exe[idx+1:]
}
for _, pat := range workerPatterns {
if strings.Contains(base, pat) || strings.Contains(exe, pat) {
_ = syscall.Kill(pid, syscall.SIGKILL)
killed = append(killed, KilledProcess{PID: pid, Name: base})
break
}
if shouldKillWorkerProcess(exe, base) {
_ = syscall.Kill(pid, syscall.SIGKILL)
killed = append(killed, KilledProcess{PID: pid, Name: base})
}
}
return killed
}
func shouldKillWorkerProcess(exe, base string) bool {
for _, pat := range workerPatterns {
if strings.Contains(base, pat) || strings.Contains(exe, pat) {
return true
}
}
return false
}

View File

@@ -0,0 +1,39 @@
package platform
import "testing"
func TestShouldKillWorkerProcess(t *testing.T) {
tests := []struct {
name string
exe string
base string
want bool
}{
{
name: "nvbandwidth executable",
exe: "/usr/libexec/datacenter-gpu-manager-4/plugins/cuda13/nvbandwidth",
base: "nvbandwidth",
want: true,
},
{
name: "dcgmi executable",
exe: "/usr/bin/dcgmi",
base: "dcgmi",
want: true,
},
{
name: "unrelated process",
exe: "/usr/bin/bash",
base: "bash",
want: false,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if got := shouldKillWorkerProcess(tt.exe, tt.base); got != tt.want {
t.Fatalf("shouldKillWorkerProcess(%q, %q)=%v want %v", tt.exe, tt.base, got, tt.want)
}
})
}
}

View File

@@ -3,6 +3,8 @@ package platform
import (
"fmt"
"os/exec"
"strconv"
"strings"
"time"
)
@@ -28,3 +30,22 @@ func runNvidiaRecover(args ...string) (string, error) {
raw, err := exec.Command("sudo", helperArgs...).CombinedOutput()
return string(raw), err
}
func resetNvidiaGPU(index int) (string, error) {
if index < 0 {
return "", fmt.Errorf("gpu index must be >= 0")
}
out, err := runNvidiaRecover("reset-gpu", strconv.Itoa(index))
if strings.TrimSpace(out) == "" && err == nil {
out = "GPU reset completed.\n"
}
return out, err
}
func restartNvidiaDrivers() (string, error) {
out, err := runNvidiaRecover("restart-drivers")
if strings.TrimSpace(out) == "" && err == nil {
out = "NVIDIA drivers restarted.\n"
}
return out, err
}

View File

@@ -404,14 +404,7 @@ func normalizeNvidiaBusID(v string) string {
}
func (s *System) ResetNvidiaGPU(index int) (string, error) {
if index < 0 {
return "", fmt.Errorf("gpu index must be >= 0")
}
out, err := runNvidiaRecover("reset-gpu", strconv.Itoa(index))
if strings.TrimSpace(out) == "" && err == nil {
out = "GPU reset completed.\n"
}
return out, err
return resetNvidiaGPU(index)
}
// RunNCCLTests runs nccl-tests all_reduce_perf across the selected NVIDIA GPUs.

View File

@@ -62,7 +62,7 @@ func (s *System) ServiceState(name string) string {
func (s *System) ServiceDo(name string, action ServiceAction) (string, error) {
if name == "bee-nvidia" && action == ServiceRestart {
return runNvidiaRecover("restart-drivers")
return restartNvidiaDrivers()
}
// bee-web runs as the bee user; sudo is required to control system services.
// /etc/sudoers.d/bee grants bee NOPASSWD:ALL.

View File

@@ -66,6 +66,7 @@ type HardwareSnapshot struct {
PowerSupplies []HardwarePowerSupply `json:"power_supplies,omitempty"`
Sensors *HardwareSensors `json:"sensors,omitempty"`
EventLogs []HardwareEventLog `json:"event_logs,omitempty"`
VROCLicense *string `json:"vroc_license,omitempty"`
}
type HardwareHealthSummary struct {
@@ -143,30 +144,33 @@ type HardwareMemory struct {
type HardwareStorage struct {
HardwareComponentStatus
Slot *string `json:"slot,omitempty"`
Type *string `json:"type,omitempty"`
Model *string `json:"model,omitempty"`
SizeGB *int `json:"size_gb,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
Firmware *string `json:"firmware,omitempty"`
Interface *string `json:"interface,omitempty"`
Present *bool `json:"present,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
PowerOnHours *int64 `json:"power_on_hours,omitempty"`
PowerCycles *int64 `json:"power_cycles,omitempty"`
UnsafeShutdowns *int64 `json:"unsafe_shutdowns,omitempty"`
MediaErrors *int64 `json:"media_errors,omitempty"`
ErrorLogEntries *int64 `json:"error_log_entries,omitempty"`
WrittenBytes *int64 `json:"written_bytes,omitempty"`
ReadBytes *int64 `json:"read_bytes,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
AvailableSparePct *float64 `json:"available_spare_pct,omitempty"`
ReallocatedSectors *int64 `json:"reallocated_sectors,omitempty"`
CurrentPendingSectors *int64 `json:"current_pending_sectors,omitempty"`
OfflineUncorrectable *int64 `json:"offline_uncorrectable,omitempty"`
Telemetry map[string]any `json:"-"`
Slot *string `json:"slot,omitempty"`
Type *string `json:"type,omitempty"`
Model *string `json:"model,omitempty"`
SizeGB *int `json:"size_gb,omitempty"`
LogicalBlockSizeBytes *int64 `json:"logical_block_size_bytes,omitempty"`
PhysicalBlockSizeBytes *int64 `json:"physical_block_size_bytes,omitempty"`
MetadataBytesPerBlock *int64 `json:"metadata_bytes_per_block,omitempty"`
SerialNumber *string `json:"serial_number,omitempty"`
Manufacturer *string `json:"manufacturer,omitempty"`
Firmware *string `json:"firmware,omitempty"`
Interface *string `json:"interface,omitempty"`
Present *bool `json:"present,omitempty"`
TemperatureC *float64 `json:"temperature_c,omitempty"`
PowerOnHours *int64 `json:"power_on_hours,omitempty"`
PowerCycles *int64 `json:"power_cycles,omitempty"`
UnsafeShutdowns *int64 `json:"unsafe_shutdowns,omitempty"`
MediaErrors *int64 `json:"media_errors,omitempty"`
ErrorLogEntries *int64 `json:"error_log_entries,omitempty"`
WrittenBytes *int64 `json:"written_bytes,omitempty"`
ReadBytes *int64 `json:"read_bytes,omitempty"`
LifeUsedPct *float64 `json:"life_used_pct,omitempty"`
LifeRemainingPct *float64 `json:"life_remaining_pct,omitempty"`
AvailableSparePct *float64 `json:"available_spare_pct,omitempty"`
ReallocatedSectors *int64 `json:"reallocated_sectors,omitempty"`
CurrentPendingSectors *int64 `json:"current_pending_sectors,omitempty"`
OfflineUncorrectable *int64 `json:"offline_uncorrectable,omitempty"`
Telemetry map[string]any `json:"-"`
}
type HardwarePCIeDevice struct {
@@ -211,6 +215,7 @@ type HardwarePCIeDevice struct {
Firmware *string `json:"firmware,omitempty"`
MacAddresses []string `json:"mac_addresses,omitempty"`
Present *bool `json:"present,omitempty"`
IOMMUGroup *int `json:"iommu_group,omitempty"`
Telemetry map[string]any `json:"-"`
}

View File

@@ -44,3 +44,57 @@ func TestHardwareSnapshotMarshalsNewContractFields(t *testing.T) {
t.Fatalf("missing event_logs payload: %s", text)
}
}
func TestHardwareSnapshotMarshalsStorageTelemetryFields(t *testing.T) {
powerOnHours := int64(12450)
writtenBytes := int64(9876543210)
readBytes := int64(1234567890)
lifeRemainingPct := 91.0
logicalBlockSizeBytes := int64(512)
physicalBlockSizeBytes := int64(4096)
metadataBytesPerBlock := int64(8)
payload := HardwareIngestRequest{
CollectedAt: "2026-03-15T15:00:00Z",
Hardware: HardwareSnapshot{
Board: HardwareBoard{SerialNumber: "SRV-001"},
Storage: []HardwareStorage{
{
SerialNumber: stringPtr("DISK-001"),
Model: stringPtr("TestDisk"),
LogicalBlockSizeBytes: &logicalBlockSizeBytes,
PhysicalBlockSizeBytes: &physicalBlockSizeBytes,
MetadataBytesPerBlock: &metadataBytesPerBlock,
PowerOnHours: &powerOnHours,
WrittenBytes: &writtenBytes,
ReadBytes: &readBytes,
LifeRemainingPct: &lifeRemainingPct,
},
},
},
}
data, err := json.Marshal(payload)
if err != nil {
t.Fatalf("marshal: %v", err)
}
text := string(data)
for _, needle := range []string{
`"storage":[{`,
`"logical_block_size_bytes":512`,
`"physical_block_size_bytes":4096`,
`"metadata_bytes_per_block":8`,
`"power_on_hours":12450`,
`"written_bytes":9876543210`,
`"read_bytes":1234567890`,
`"life_remaining_pct":91`,
} {
if !strings.Contains(text, needle) {
t.Fatalf("missing %q in payload: %s", needle, text)
}
}
}
func stringPtr(v string) *string {
return &v
}

View File

@@ -125,6 +125,8 @@ func defaultTaskPriority(target string, params taskParams) int {
return taskPriorityInstall
case "install-to-ram":
return taskPriorityInstallToRAM
case "nvme-format":
return taskPriorityInstall
case "audit":
return taskPriorityAudit
case "nvidia-bench-perf", "nvidia-bench-power", "nvidia-bench-autotune":
@@ -806,15 +808,14 @@ func (h *handler) handleAPISATAbort(w http.ResponseWriter, r *http.Request) {
now := time.Now()
t.DoneAt = &now
case TaskRunning:
if t.job != nil {
t.job.abort()
if t.job == nil || !t.job.abort() {
globalQueue.mu.Unlock()
writeJSON(w, map[string]string{"status": "not_running"})
return
}
if taskMayLeaveOrphanWorkers(t.Target) {
platform.KillTestWorkers()
}
t.Status = TaskCancelled
now := time.Now()
t.DoneAt = &now
globalQueue.mu.Unlock()
writeJSON(w, map[string]string{"status": "aborting"})
return
}
globalQueue.mu.Unlock()
writeJSON(w, map[string]string{"status": "aborted"})
@@ -1039,6 +1040,81 @@ func (h *handler) handleAPIExportUSBBundle(w http.ResponseWriter, r *http.Reques
writeJSON(w, map[string]string{"status": "ok", "message": result.Body})
}
func (h *handler) handleAPIBlackboxStatus(w http.ResponseWriter, _ *http.Request) {
state, err := app.ReadBlackboxState(filepath.Join(h.opts.ExportDir, "blackbox-state.json"))
if err != nil {
if errors.Is(err, os.ErrNotExist) {
writeJSON(w, app.BlackboxState{Status: "disabled", Targets: []app.BlackboxTargetStatus{}})
return
}
writeError(w, http.StatusInternalServerError, err.Error())
return
}
if state.Targets == nil {
state.Targets = []app.BlackboxTargetStatus{}
}
writeJSON(w, state)
}
func (h *handler) handleAPIBlackboxEnable(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var target platform.RemovableTarget
if err := json.NewDecoder(r.Body).Decode(&target); err != nil || strings.TrimSpace(target.Device) == "" {
writeError(w, http.StatusBadRequest, "device is required")
return
}
targets, err := h.opts.App.ListRemovableTargets()
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
allowed := false
for _, candidate := range targets {
if candidate.Device == target.Device {
target = candidate
allowed = true
break
}
}
if !allowed {
writeError(w, http.StatusBadRequest, "device not in removable target list")
return
}
marker, err := app.EnableBlackboxTarget(target)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]any{
"status": "ok",
"message": "Black-box marker written.",
"enrollment_id": marker.EnrollmentID,
})
}
func (h *handler) handleAPIBlackboxDisable(w http.ResponseWriter, r *http.Request) {
var req struct {
Device string `json:"device"`
EnrollmentID string `json:"enrollment_id"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
if err := app.DisableBlackboxTarget(req.Device, req.EnrollmentID); err != nil {
if errors.Is(err, os.ErrNotExist) {
writeError(w, http.StatusNotFound, "black-box target not found")
return
}
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "message": "Black-box marker removed."})
}
// ── GPU presence ──────────────────────────────────────────────────────────────
func (h *handler) handleAPIGNVIDIAGPUs(w http.ResponseWriter, _ *http.Request) {
@@ -1221,7 +1297,7 @@ func (h *handler) handleAPIInstallToRAM(w http.ResponseWriter, r *http.Request)
var standardTools = []string{
"dmidecode", "smartctl", "nvme", "lspci", "ipmitool",
"nvidia-smi", "dcgmi", "nv-hostengine", "memtester", "stress-ng", "nvtop",
"mstflint", "qrencode",
"mstflint",
}
func (h *handler) handleAPIToolsCheck(w http.ResponseWriter, r *http.Request) {

View File

@@ -3,6 +3,8 @@ package webui
import (
"encoding/json"
"net/http/httptest"
"os"
"path/filepath"
"strings"
"testing"
@@ -44,6 +46,66 @@ func TestHandleAPISATRunDecodesBodyWithoutContentLength(t *testing.T) {
}
}
func TestHandleAPIBlackboxStatusReturnsDisabledWhenStateMissing(t *testing.T) {
h := &handler{opts: HandlerOptions{ExportDir: t.TempDir()}}
rec := httptest.NewRecorder()
req := httptest.NewRequest("GET", "/api/blackbox/status", nil)
h.handleAPIBlackboxStatus(rec, req)
if rec.Code != 200 {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
var state app.BlackboxState
if err := json.Unmarshal(rec.Body.Bytes(), &state); err != nil {
t.Fatalf("decode state: %v", err)
}
if state.Status != "disabled" {
t.Fatalf("status=%q want disabled", state.Status)
}
}
func TestHandleAPIBlackboxStatusReturnsPersistedState(t *testing.T) {
exportDir := t.TempDir()
statePath := filepath.Join(exportDir, "blackbox-state.json")
if err := os.WriteFile(statePath, []byte(`{"status":"running","boot_folder":"boot-folder","targets":[{"enrollment_id":"bb-1","device":"/dev/sdb1","status":"running","flush_period":"1s"}]}`), 0644); err != nil {
t.Fatalf("write state: %v", err)
}
h := &handler{opts: HandlerOptions{ExportDir: exportDir}}
rec := httptest.NewRecorder()
req := httptest.NewRequest("GET", "/api/blackbox/status", nil)
h.handleAPIBlackboxStatus(rec, req)
if rec.Code != 200 {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
if !strings.Contains(rec.Body.String(), `"boot_folder":"boot-folder"`) {
t.Fatalf("body=%s", rec.Body.String())
}
}
func TestParseNVMeFormatModes(t *testing.T) {
raw := `
lbaf 0 : ms:0 lbads:9 rp:0x2 (in use)
lbaf 1 : ms:8 lbads:9 rp:0x1
lbaf 2 : ms:0 lbads:12 rp:0
`
modes := parseNVMeFormatModes(raw)
if len(modes) != 3 {
t.Fatalf("modes=%#v want 3 modes", modes)
}
if modes[0].Mode != 0 || modes[0].DataBytes != 512 || modes[0].MetadataBytes != 0 || !modes[0].InUse {
t.Fatalf("mode 0=%#v", modes[0])
}
if modes[1].Label != "MODE 1 (512+8)" {
t.Fatalf("mode 1 label=%q", modes[1].Label)
}
if modes[2].DataBytes != 4096 || modes[2].MetadataBytes != 0 {
t.Fatalf("mode 2=%#v", modes[2])
}
}
func TestHandleAPIBenchmarkNvidiaRunQueuesSelectedGPUs(t *testing.T) {
globalQueue.mu.Lock()
originalTasks := globalQueue.tasks

View File

@@ -20,7 +20,7 @@ type jobState struct {
cancel func() // optional cancel function; nil if job is not cancellable
logPath string
serialPrefix string
logFile *os.File // kept open for the task lifetime to avoid per-line open/close
logFile *os.File // kept open for the task lifetime to avoid per-line open/close
logBuf *bufio.Writer
}
@@ -53,13 +53,21 @@ func (j *jobState) abort() bool {
}
func (j *jobState) append(line string) {
j.appendWithOptions(line, true, true)
}
func (j *jobState) appendFromLog(line string) {
j.appendWithOptions(line, false, false)
}
func (j *jobState) appendWithOptions(line string, persistLog, serialMirror bool) {
j.mu.Lock()
defer j.mu.Unlock()
j.lines = append(j.lines, line)
if j.logPath != "" {
if persistLog && j.logPath != "" {
j.writeLogLineLocked(line)
}
if j.serialPrefix != "" {
if serialMirror && j.serialPrefix != "" {
taskSerialWriteLine(j.serialPrefix + line)
}
for _, ch := range j.subs {
@@ -83,6 +91,7 @@ func (j *jobState) writeLogLineLocked(line string) {
j.logBuf = bufio.NewWriterSize(f, 64*1024)
}
_, _ = j.logBuf.WriteString(line + "\n")
_ = j.logBuf.Flush()
}
// closeLog flushes and closes the log file. Called after all task output is done.

View File

@@ -0,0 +1,368 @@
package webui
import (
"context"
"encoding/json"
"fmt"
"net/http"
"os/exec"
"path/filepath"
"regexp"
"sort"
"strconv"
"strings"
"time"
)
type nvmeFormatMode struct {
Mode int `json:"mode"`
DataBytes int64 `json:"data_bytes"`
MetadataBytes int64 `json:"metadata_bytes"`
InUse bool `json:"in_use"`
Label string `json:"label"`
}
type nvmeFormatDisk struct {
Device string `json:"device"`
Model string `json:"model,omitempty"`
Serial string `json:"serial,omitempty"`
Size string `json:"size,omitempty"`
CurrentMode int `json:"current_mode"`
CurrentFormat string `json:"current_format"`
Modes []nvmeFormatMode `json:"modes"`
Error string `json:"error,omitempty"`
}
type nvmeListJSON struct {
Devices []struct {
DevicePath string `json:"DevicePath"`
ModelNumber string `json:"ModelNumber"`
SerialNumber string `json:"SerialNumber"`
PhysicalSize int64 `json:"PhysicalSize"`
} `json:"Devices"`
}
var (
nvmeFormatDeviceRE = regexp.MustCompile(`^/dev/nvme[0-9]+n[0-9]+$`)
nvmeLBAFCompactLineRE = regexp.MustCompile(`(?im)^\s*lbaf\s+(\d+)\s*:\s*ms:(\d+)\s+lbads:(\d+).*$`)
nvmeLBAFVerboseLineRE = regexp.MustCompile(`(?im)^\s*LBA Format\s+(\d+)\s*:\s*Metadata Size:\s*(\d+)\s+bytes\s*-\s*Data Size:\s*(\d+)\s+bytes.*$`)
nvmeCommandContext = exec.CommandContext
nvmeListFormatsTimeout = 20 * time.Second
)
func listNVMeFormatDisks(ctx context.Context) ([]nvmeFormatDisk, error) {
ctx, cancel := context.WithTimeout(ctx, nvmeListFormatsTimeout)
defer cancel()
out, err := nvmeCommandContext(ctx, "nvme", "list", "-o", "json").Output()
if err != nil {
return nil, err
}
var root nvmeListJSON
if err := json.Unmarshal(out, &root); err != nil {
return nil, err
}
disks := make([]nvmeFormatDisk, 0, len(root.Devices))
seen := map[string]struct{}{}
for _, dev := range root.Devices {
path := strings.TrimSpace(dev.DevicePath)
if !nvmeFormatDeviceRE.MatchString(path) {
continue
}
if _, ok := seen[path]; ok {
continue
}
seen[path] = struct{}{}
disk := nvmeFormatDisk{
Device: path,
Model: strings.TrimSpace(dev.ModelNumber),
Serial: strings.TrimSpace(dev.SerialNumber),
Size: formatNVMeBytes(dev.PhysicalSize),
CurrentMode: -1,
}
modes, parseErr := readNVMeFormatModes(ctx, path)
if parseErr != nil {
disk.Error = parseErr.Error()
}
disk.Modes = modes
for _, mode := range modes {
if mode.InUse {
disk.CurrentMode = mode.Mode
disk.CurrentFormat = formatNVMeBlock(mode.DataBytes, mode.MetadataBytes)
break
}
}
disks = append(disks, disk)
}
sort.Slice(disks, func(i, j int) bool { return disks[i].Device < disks[j].Device })
return disks, nil
}
func readNVMeFormatModes(ctx context.Context, device string) ([]nvmeFormatMode, error) {
if !nvmeFormatDeviceRE.MatchString(device) {
return nil, fmt.Errorf("invalid NVMe device")
}
out, err := nvmeCommandContext(ctx, "nvme", "id-ns", device, "-H").CombinedOutput()
if err != nil {
msg := strings.TrimSpace(string(out))
if msg == "" {
msg = err.Error()
}
return nil, fmt.Errorf("%s", msg)
}
modes := parseNVMeFormatModes(string(out))
if len(modes) == 0 {
return nil, fmt.Errorf("no LBA format modes found")
}
return modes, nil
}
func parseNVMeFormatModes(raw string) []nvmeFormatMode {
byMode := map[int]nvmeFormatMode{}
for _, m := range nvmeLBAFCompactLineRE.FindAllStringSubmatch(raw, -1) {
mode, errMode := strconv.Atoi(m[1])
metadata, errMS := strconv.ParseInt(m[2], 10, 64)
lbads, errLBADS := strconv.Atoi(m[3])
if errMode != nil || errMS != nil || errLBADS != nil || lbads < 0 || lbads >= 63 {
continue
}
data := int64(1) << lbads
line := m[0]
byMode[mode] = nvmeFormatMode{
Mode: mode,
DataBytes: data,
MetadataBytes: metadata,
InUse: strings.Contains(strings.ToLower(line), "in use"),
Label: fmt.Sprintf("MODE %d (%s)", mode, formatNVMeBlock(data, metadata)),
}
}
for _, m := range nvmeLBAFVerboseLineRE.FindAllStringSubmatch(raw, -1) {
mode, errMode := strconv.Atoi(m[1])
metadata, errMS := strconv.ParseInt(m[2], 10, 64)
data, errData := strconv.ParseInt(m[3], 10, 64)
if errMode != nil || errMS != nil || errData != nil || data <= 0 {
continue
}
line := m[0]
byMode[mode] = nvmeFormatMode{
Mode: mode,
DataBytes: data,
MetadataBytes: metadata,
InUse: strings.Contains(strings.ToLower(line), "in use"),
Label: fmt.Sprintf("MODE %d (%s)", mode, formatNVMeBlock(data, metadata)),
}
}
modes := make([]nvmeFormatMode, 0, len(byMode))
for _, mode := range byMode {
modes = append(modes, mode)
}
sort.Slice(modes, func(i, j int) bool { return modes[i].Mode < modes[j].Mode })
return modes
}
func runNVMeFormatTask(ctx context.Context, j *jobState, device string, lbaf int) error {
if !nvmeFormatDeviceRE.MatchString(device) {
return fmt.Errorf("invalid NVMe device")
}
modes, err := readNVMeFormatModes(ctx, device)
if err != nil {
return err
}
var selected nvmeFormatMode
found := false
for _, mode := range modes {
if mode.Mode == lbaf {
selected = mode
found = true
break
}
}
if !found {
return fmt.Errorf("MODE %d is not available on %s", lbaf, device)
}
ms := 0
if selected.MetadataBytes > 0 {
ms = 1
}
j.append(fmt.Sprintf("Formatting %s to %s with --lbaf=%d --ms=%d --force", device, formatNVMeBlock(selected.DataBytes, selected.MetadataBytes), selected.Mode, ms))
cmd := nvmeCommandContext(ctx, "nvme", "format", device, fmt.Sprintf("--lbaf=%d", selected.Mode), fmt.Sprintf("--ms=%d", ms), "--force")
return streamCmdJob(j, cmd)
}
func (h *handler) handleAPINVMeFormats(w http.ResponseWriter, r *http.Request) {
disks, err := listNVMeFormatDisks(r.Context())
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, disks)
}
func (h *handler) handleAPINVMeFormatRun(w http.ResponseWriter, r *http.Request) {
var req struct {
Device string `json:"device"`
LBAF int `json:"lbaf"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
if !nvmeFormatDeviceRE.MatchString(req.Device) {
writeError(w, http.StatusBadRequest, "invalid NVMe device")
return
}
disks, err := listNVMeFormatDisks(r.Context())
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
var label string
allowed := false
for _, disk := range disks {
if disk.Device != req.Device {
continue
}
for _, mode := range disk.Modes {
if mode.Mode == req.LBAF {
allowed = true
label = mode.Label
break
}
}
}
if !allowed {
writeError(w, http.StatusBadRequest, "LBA format mode is not available for this device")
return
}
name := fmt.Sprintf("NVMe Format %s to %s", filepath.Base(req.Device), label)
t := &Task{
ID: newJobID("nvme-format"),
Name: name,
Target: "nvme-format",
Priority: defaultTaskPriority("nvme-format", taskParams{}),
Status: TaskPending,
CreatedAt: time.Now(),
params: taskParams{
Device: req.Device,
LBAF: req.LBAF,
},
}
globalQueue.enqueue(t)
writeJSON(w, map[string]string{"task_id": t.ID, "job_id": t.ID})
}
func formatNVMeBlock(dataBytes, metadataBytes int64) string {
return strconv.FormatInt(dataBytes, 10) + "+" + strconv.FormatInt(metadataBytes, 10)
}
func formatNVMeBytes(n int64) string {
if n <= 0 {
return ""
}
units := []string{"B", "KB", "MB", "GB", "TB", "PB"}
v := float64(n)
unit := 0
for v >= 1000 && unit < len(units)-1 {
v /= 1000
unit++
}
if unit == 0 {
return fmt.Sprintf("%d B", n)
}
return fmt.Sprintf("%.1f %s", v, units[unit])
}
func renderNVMeFormatInline() string {
return `<div id="nvme-format-status" style="font-size:13px;color:var(--muted);margin-bottom:12px">Loading NVMe disks...</div>
<div id="nvme-format-table"><p style="color:var(--muted);font-size:13px">Loading...</p></div>
<script>
function nvmeFormatEsc(s) {
return String(s == null ? '' : s).replace(/[&<>"']/g, function(c) {
return {'&':'&amp;','<':'&lt;','>':'&gt;','"':'&quot;',"'":'&#39;'}[c];
});
}
function loadNVMeFormats() {
var status = document.getElementById('nvme-format-status');
var table = document.getElementById('nvme-format-table');
status.textContent = 'Loading NVMe disks...';
status.style.color = 'var(--muted)';
table.innerHTML = '<p style="color:var(--muted);font-size:13px">Loading...</p>';
fetch('/api/tools/nvme-formats').then(function(r) { return r.json().then(function(d) { if (!r.ok) throw new Error(d.error || ('HTTP ' + r.status)); return d; }); }).then(function(disks) {
window._nvmeFormatDisks = Array.isArray(disks) ? disks : [];
if (!window._nvmeFormatDisks.length) {
status.textContent = 'No NVMe disks found.';
table.innerHTML = '';
return;
}
status.textContent = window._nvmeFormatDisks.length + ' NVMe disk(s) found.';
var rows = window._nvmeFormatDisks.map(function(d, idx) {
var current = d.current_format ? (d.current_format + ' / MODE ' + d.current_mode) : 'unknown';
var detail = [d.model || '', d.serial || '', d.size || ''].filter(Boolean).join(' | ');
var options = (d.modes || []).map(function(m) {
return '<option value="' + m.mode + '"' + (m.in_use ? ' selected' : '') + '>' + nvmeFormatEsc(m.label) + '</option>';
}).join('');
var disabled = options ? '' : ' disabled';
var err = d.error ? '<div style="font-size:12px;color:var(--crit-fg,#9f3a38);margin-top:4px">' + nvmeFormatEsc(d.error) + '</div>' : '';
return '<tr>'
+ '<td style="font-family:monospace;white-space:nowrap">' + nvmeFormatEsc(d.device) + (detail ? '<div style="font-family:inherit;font-size:12px;color:var(--muted)">' + nvmeFormatEsc(detail) + '</div>' : '') + '</td>'
+ '<td style="white-space:nowrap">' + nvmeFormatEsc(current) + err + '</td>'
+ '<td style="white-space:nowrap"><select id="nvme-format-select-' + idx + '"' + disabled + '>' + options + '</select></td>'
+ '<td style="white-space:nowrap"><button class="btn btn-sm btn-primary" onclick="nvmeFormatRun(' + idx + ', this)"' + disabled + '>Apply</button><div class="nvme-format-row-msg" style="margin-top:6px;font-size:12px;color:var(--muted)"></div></td>'
+ '</tr>';
}).join('');
table.innerHTML = '<table><tr><th>Disk</th><th>Current block / mode</th><th>New mode</th><th>Action</th></tr>' + rows + '</table>';
}).catch(function(e) {
status.textContent = 'Error loading NVMe disks: ' + e.message;
status.style.color = 'var(--crit-fg,#9f3a38)';
table.innerHTML = '';
});
}
function nvmeWaitTaskDone(taskID, rowMsg) {
var timer = setInterval(function() {
fetch('/api/tasks').then(function(r) { return r.json(); }).then(function(tasks) {
var task = (tasks || []).find(function(t) { return t.id === taskID; });
if (!task) return;
if (task.status === 'done' || task.status === 'failed' || task.status === 'cancelled') {
clearInterval(timer);
rowMsg.textContent = 'Task ' + taskID + ': ' + task.status + (task.error ? ' - ' + task.error : '');
rowMsg.style.color = task.status === 'done' ? 'var(--ok,green)' : 'var(--crit-fg,#9f3a38)';
loadNVMeFormats();
}
}).catch(function(){});
}, 1500);
}
function nvmeFormatRun(idx, btn) {
var disk = (window._nvmeFormatDisks || [])[idx];
var select = document.getElementById('nvme-format-select-' + idx);
var row = btn.closest('td');
var rowMsg = row.querySelector('.nvme-format-row-msg');
if (!disk || !select) return;
var lbaf = parseInt(select.value, 10);
var mode = (disk.modes || []).find(function(m) { return m.mode === lbaf; });
if (!mode) return;
if (!window.confirm('Format ' + disk.device + ' to ' + mode.label + '? This erases data on the namespace.')) return;
btn.disabled = true;
rowMsg.style.color = 'var(--muted)';
rowMsg.textContent = 'Queued...';
fetch('/api/tools/nvme-format/run', {
method:'POST',
headers:{'Content-Type':'application/json'},
body:JSON.stringify({device: disk.device, lbaf: lbaf})
}).then(function(r) { return r.json().then(function(d) { if (!r.ok) throw new Error(d.error || ('HTTP ' + r.status)); return d; }); }).then(function(d) {
rowMsg.textContent = 'Task ' + d.task_id + ' queued.';
nvmeWaitTaskDone(d.task_id, rowMsg);
}).catch(function(e) {
rowMsg.style.color = 'var(--crit-fg,#9f3a38)';
rowMsg.textContent = 'Error: ' + e.message;
}).finally(function() {
btn.disabled = false;
});
}
loadNVMeFormats();
</script>`
}
func renderNVMeFormatCard() string {
return `<div class="card"><div class="card-head">NVMe Block Format <button class="btn btn-sm btn-secondary" onclick="loadNVMeFormats()" style="margin-left:auto">&#8635; Refresh</button></div><div class="card-body">` +
`<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Lists NVMe namespaces and changes their LBA format through a queued task.</p>` +
renderNVMeFormatInline() + `</div></div>`
}

View File

@@ -102,47 +102,69 @@ window.supportBundleDownload = function() {
func renderUSBExportCard() string {
return `<div class="card" style="margin-top:16px">
<div class="card-head">Export to USB
<button class="btn btn-sm btn-secondary" onclick="usbRefresh()" style="margin-left:auto">&#8635; Refresh</button>
<div class="card-head">USB Black-Box
<button class="btn btn-sm btn-secondary" onclick="blackboxRefresh()" style="margin-left:auto">&#8635; Refresh</button>
</div>
<div class="card-body">` + renderUSBExportInline() + `</div>
</div>`
}
func renderUSBExportInline() string {
return `<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Write audit JSON or support bundle directly to a removable USB drive.</p>
return `<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Marks removable USB devices as black-box targets. The dedicated bee-blackbox service mirrors export files and system logs into a boot-scoped folder and resumes automatically after restart.</p>
<div id="usb-status" style="font-size:13px;color:var(--muted)">Scanning for USB devices...</div>
<div id="blackbox-summary" style="margin-top:8px;font-size:13px;color:var(--muted)">Loading black-box status...</div>
<div id="usb-targets" style="margin-top:12px"></div>
<div id="usb-msg" style="margin-top:10px;font-size:13px"></div>
<script>
(function(){
function usbRefresh() {
function blackboxRefresh() {
document.getElementById('usb-status').textContent = 'Scanning...';
document.getElementById('blackbox-summary').textContent = 'Loading black-box status...';
document.getElementById('usb-targets').innerHTML = '';
document.getElementById('usb-msg').textContent = '';
fetch('/api/export/usb').then(r=>r.json()).then(targets => {
window._usbTargets = Array.isArray(targets) ? targets : [];
Promise.all([
fetch('/api/export/usb').then(r=>r.json()),
fetch('/api/blackbox/status').then(r=>r.json())
]).then(function(values) {
const targets = Array.isArray(values[0]) ? values[0] : [];
const state = values[1] || {};
const active = Array.isArray(state.targets) ? state.targets : [];
window._usbTargets = targets;
window._blackboxTargets = active;
const st = document.getElementById('usb-status');
const ct = document.getElementById('usb-targets');
const summary = document.getElementById('blackbox-summary');
if (state.boot_folder) {
summary.textContent = 'Service state: ' + (state.status || 'unknown') + '. Boot folder: ' + state.boot_folder + '.';
} else {
summary.textContent = 'Service state: ' + (state.status || 'disabled') + '.';
}
if (!targets || targets.length === 0) {
st.textContent = 'No removable USB devices found.';
return;
} else {
st.textContent = targets.length + ' device(s) found:';
}
st.textContent = targets.length + ' device(s) found:';
ct.innerHTML = '<table><tr><th>Device</th><th>FS</th><th>Size</th><th>Label</th><th>Model</th><th>Actions</th></tr>' +
const byDevice = {};
active.forEach(function(item) { byDevice[item.device] = item; });
ct.innerHTML = '<table><tr><th>Device</th><th>FS</th><th>Size</th><th>Label</th><th>Model</th><th>Black-Box</th><th>Actions</th></tr>' +
targets.map((t, idx) => {
const dev = t.device || '';
const label = t.label || '';
const model = t.model || '';
const state = byDevice[dev];
const status = state ? (state.status + (state.flush_period ? ', flush ' + state.flush_period : '')) : 'not enrolled';
const detail = state && state.last_error ? ('<div style="font-size:12px;color:var(--err,red)">'+state.last_error+'</div>') : '';
return '<tr>' +
'<td style="font-family:monospace">'+dev+'</td>' +
'<td>'+t.fs_type+'</td>' +
'<td>'+t.size+'</td>' +
'<td>'+label+'</td>' +
'<td style="font-size:12px;color:var(--muted)">'+model+'</td>' +
'<td style="font-size:12px">'+status+detail+'</td>' +
'<td style="white-space:nowrap">' +
'<button class="btn btn-sm btn-primary" onclick="usbExport(\'audit\','+idx+',this)">Audit JSON</button> ' +
'<button class="btn btn-sm btn-secondary" onclick="usbExport(\'bundle\','+idx+',this)">Support Bundle</button>' +
(state
? '<button class="btn btn-sm btn-secondary" onclick="blackboxDisable('+idx+',this)">Disable</button>'
: '<button class="btn btn-sm btn-primary" onclick="blackboxEnable('+idx+',this)">Enable</button>') +
'<div class="usb-row-msg" style="margin-top:6px;font-size:12px;color:var(--muted)"></div>' +
'</td></tr>';
}).join('') + '</table>';
@@ -150,7 +172,7 @@ function usbRefresh() {
document.getElementById('usb-status').textContent = 'Error: ' + e;
});
}
window.usbExport = function(type, targetIndex, btn) {
window.blackboxEnable = function(targetIndex, btn) {
const target = (window._usbTargets || [])[targetIndex];
if (!target) {
const msg = document.getElementById('usb-msg');
@@ -164,15 +186,15 @@ window.usbExport = function(type, targetIndex, btn) {
const originalText = btn ? btn.textContent : '';
if (btn) {
btn.disabled = true;
btn.textContent = 'Exporting...';
btn.textContent = 'Enabling...';
}
if (rowMsg) {
rowMsg.style.color = 'var(--muted)';
rowMsg.textContent = 'Working...';
}
msg.style.color = 'var(--muted)';
msg.textContent = 'Exporting ' + (type === 'bundle' ? 'support bundle' : 'audit JSON') + ' to ' + (target.device||'') + '...';
fetch('/api/export/usb/'+type, {
msg.textContent = 'Enabling black-box on ' + (target.device||'') + '...';
fetch('/api/blackbox/enable', {
method: 'POST',
headers: {'Content-Type':'application/json'},
body: JSON.stringify(target)
@@ -199,10 +221,64 @@ window.usbExport = function(type, targetIndex, btn) {
btn.disabled = false;
btn.textContent = originalText;
}
setTimeout(blackboxRefresh, 300);
});
};
window.usbRefresh = usbRefresh;
usbRefresh();
window.blackboxDisable = function(targetIndex, btn) {
const target = (window._usbTargets || [])[targetIndex];
const active = (window._blackboxTargets || []).find(function(item){ return item.device === (target && target.device); });
if (!target || !active) {
const msg = document.getElementById('usb-msg');
msg.style.color = 'var(--err,red)';
msg.textContent = 'Error: black-box target not found. Refresh and try again.';
return;
}
const msg = document.getElementById('usb-msg');
const row = btn ? btn.closest('td') : null;
const rowMsg = row ? row.querySelector('.usb-row-msg') : null;
const originalText = btn ? btn.textContent : '';
if (btn) {
btn.disabled = true;
btn.textContent = 'Disabling...';
}
if (rowMsg) {
rowMsg.style.color = 'var(--muted)';
rowMsg.textContent = 'Working...';
}
msg.style.color = 'var(--muted)';
msg.textContent = 'Disabling black-box on ' + (target.device||'') + '...';
fetch('/api/blackbox/disable', {
method:'POST',
headers:{'Content-Type':'application/json'},
body: JSON.stringify({device: target.device, enrollment_id: active.enrollment_id})
}).then(async r => {
const d = await r.json();
if (!r.ok) throw new Error(d.error || ('HTTP ' + r.status));
return d;
}).then(d => {
msg.style.color = 'var(--ok,green)';
msg.textContent = d.message || 'Done.';
if (rowMsg) {
rowMsg.style.color = 'var(--ok,green)';
rowMsg.textContent = d.message || 'Done.';
}
}).catch(e => {
msg.style.color = 'var(--err,red)';
msg.textContent = 'Error: '+e;
if (rowMsg) {
rowMsg.style.color = 'var(--err,red)';
rowMsg.textContent = 'Error: ' + e;
}
}).finally(() => {
if (btn) {
btn.disabled = false;
btn.textContent = originalText;
}
setTimeout(blackboxRefresh, 300);
});
};
window.blackboxRefresh = blackboxRefresh;
blackboxRefresh();
})();
</script>`
}
@@ -355,7 +431,7 @@ fetch('/api/system/ram-status').then(r=>r.json()).then(d=>{
else if (kind === 'disk') label = 'disk (' + source + ')';
else label = source;
boot.textContent = 'Current boot source: ' + label + '.';
txt.textContent = d.message || 'Checking...';
txt.textContent = d.blocked_reason || d.message || 'Checking...';
if (d.status === 'ok' || d.in_ram) {
txt.style.color = 'var(--ok, green)';
} else if (d.status === 'failed') {
@@ -382,7 +458,7 @@ function installToRAM() {
<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Downloads a tar.gz archive of all audit files, SAT results, and logs.</p>
` + renderSupportBundleInline() + `
<div style="border-top:1px solid var(--border);margin-top:16px;padding-top:16px">
<div style="font-weight:600;margin-bottom:8px">Export to USB</div>
<div style="font-weight:600;margin-bottom:8px">USB Black-Box</div>
` + renderUSBExportInline() + `
</div>
</div></div>
@@ -399,6 +475,7 @@ function installToRAM() {
<div class="card"><div class="card-head">Services</div><div class="card-body">` +
renderServicesInline() + `</div></div>
` + renderNVMeFormatCard() + `
<script>
function checkTools() {

View File

@@ -207,7 +207,7 @@ func renderInstall() string {
func renderTasks() string {
return `<div style="display:flex;align-items:center;gap:12px;margin-bottom:16px;flex-wrap:wrap">
<button class="btn btn-danger btn-sm" onclick="cancelAll()">Cancel All</button>
<button class="btn btn-sm" style="background:#b45309;color:#fff" onclick="killWorkers()" title="Send SIGKILL to all running test processes (bee-gpu-burn, stress-ng, stressapptest, memtester)">Kill Workers</button>
<button class="btn btn-sm" style="background:#b45309;color:#fff" onclick="killWorkers()" title="Abort running tasks and kill orphaned test processes (bee-gpu-burn, dcgmi, nvvs, nvbandwidth, stress-ng, stressapptest, memtester)">Abort Tasks And Kill Orphans</button>
<span id="kill-toast" style="font-size:12px;color:var(--muted);display:none"></span>
<span style="font-size:12px;color:var(--muted)">Open a task to view its saved logs and charts.</span>
</div>
@@ -289,7 +289,7 @@ function cancelAll() {
fetch('/api/tasks/cancel-all',{method:'POST'}).then(()=>loadTasks());
}
function killWorkers() {
if (!confirm('Send SIGKILL to all running test workers (bee-gpu-burn, stress-ng, stressapptest, memtester)?\n\nThis will also cancel all queued and running tasks.')) return;
if (!confirm('Abort all queued/running tasks and kill orphaned test workers (bee-gpu-burn, dcgmi, nvvs, nvbandwidth, stress-ng, stressapptest, memtester)?\n\nRunning bee-worker processes will first be asked to stop gracefully; orphaned test processes will then be killed.')) return;
fetch('/api/tasks/kill-workers',{method:'POST'})
.then(r=>r.json())
.then(d=>{

View File

@@ -301,11 +301,14 @@ func NewHandler(opts HandlerOptions) http.Handler {
// Export
mux.HandleFunc("GET /api/export/list", h.handleAPIExportList)
mux.HandleFunc("GET /api/export/usb", h.handleAPIExportUSBTargets)
mux.HandleFunc("POST /api/export/usb/audit", h.handleAPIExportUSBAudit)
mux.HandleFunc("POST /api/export/usb/bundle", h.handleAPIExportUSBBundle)
mux.HandleFunc("GET /api/blackbox/status", h.handleAPIBlackboxStatus)
mux.HandleFunc("POST /api/blackbox/enable", h.handleAPIBlackboxEnable)
mux.HandleFunc("POST /api/blackbox/disable", h.handleAPIBlackboxDisable)
// Tools
mux.HandleFunc("GET /api/tools/check", h.handleAPIToolsCheck)
mux.HandleFunc("GET /api/tools/nvme-formats", h.handleAPINVMeFormats)
mux.HandleFunc("POST /api/tools/nvme-format/run", h.handleAPINVMeFormatRun)
// GPU presence / tools
mux.HandleFunc("GET /api/gpu/presence", h.handleAPIGPUPresence)
@@ -571,6 +574,7 @@ func (h *handler) handleExportIndex(w http.ResponseWriter, r *http.Request) {
func (h *handler) handleViewer(w http.ResponseWriter, r *http.Request) {
snapshot, _ := loadSnapshot(h.opts.AuditPath)
snapshot = enrichSnapshotForViewer(snapshot)
body, err := viewer.RenderHTML(snapshot, h.opts.Title)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
@@ -1290,8 +1294,8 @@ const loadingPageHTML = `<!DOCTYPE html>
*{margin:0;padding:0;box-sizing:border-box}
html,body{height:100%;background:#0f1117;display:flex;align-items:center;justify-content:center;font-family:'Courier New',monospace;color:#e2e8f0}
.wrap{text-align:center;width:420px}
.logo{font-size:11px;line-height:1.4;color:#f6c90e;margin-bottom:6px;white-space:pre;text-align:left}
.subtitle{font-size:12px;color:#a0aec0;text-align:left;margin-bottom:24px;padding-left:2px}
.brand{font-size:22px;letter-spacing:.18em;color:#f6c90e;margin-bottom:6px;text-align:left}
.subtitle{font-size:12px;color:#a0aec0;text-align:left;margin-bottom:24px}
.spinner{width:36px;height:36px;border:3px solid #2d3748;border-top-color:#f6c90e;border-radius:50%;animation:spin .8s linear infinite;margin:0 auto 14px}
.spinner.hidden{display:none}
@keyframes spin{to{transform:rotate(360deg)}}
@@ -1309,12 +1313,7 @@ td:first-child{color:#718096;width:55%}
</head>
<body>
<div class="wrap">
<div class="logo"> ███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗
██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝
█████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗
██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝
███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗
╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝</div>
<div class="brand">EASY BEE</div>
<div class="subtitle">Hardware Audit LiveCD</div>
<div class="spinner" id="spin"></div>
<div class="status" id="st">Connecting to bee-web...</div>
@@ -1324,8 +1323,20 @@ td:first-child{color:#718096;width:55%}
<script>
(function(){
var gone = false;
var pollStarted = false;
var fallbackOpenTimer = null;
var AUTO_OPEN_DELAY_MS = 15000;
function go(){ if(!gone){gone=true;window.location.replace('/');} }
function scheduleFallbackOpen(){
if(fallbackOpenTimer!==null) return;
fallbackOpenTimer=setTimeout(function(){
document.getElementById('spin').className='spinner hidden';
document.getElementById('st').textContent='Startup checks are taking too long — opening app...';
go();
},AUTO_OPEN_DELAY_MS);
}
function icon(s){
if(s==='active') return '<span class="ok">&#9679; active</span>';
if(s==='failed') return '<span class="fail">&#10005; failed</span>';
@@ -1357,6 +1368,7 @@ function pollServices(){
tbl.innerHTML=html;
if(allSettled(svcs)){
clearInterval(pollTimer);
if(fallbackOpenTimer!==null) clearTimeout(fallbackOpenTimer);
document.getElementById('spin').className='spinner hidden';
document.getElementById('st').textContent='Ready \u2014 opening...';
setTimeout(go,800);
@@ -1371,8 +1383,12 @@ function probe(){
if(r.ok){
document.getElementById('st').textContent='bee-web running \u2014 checking services...';
document.getElementById('btn').style.display='';
pollServices();
pollTimer=setInterval(pollServices,1500);
scheduleFallbackOpen();
if(!pollStarted){
pollStarted=true;
pollServices();
pollTimer=setInterval(pollServices,1500);
}
} else {
document.getElementById('st').textContent='bee-web starting (status '+r.status+')...';
setTimeout(probe,500);

View File

@@ -604,6 +604,25 @@ func TestReadyIsOKWhenAuditPathIsUnset(t *testing.T) {
}
}
func TestLoadingPageHasFallbackAutoOpen(t *testing.T) {
handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/loading", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
body := rec.Body.String()
for _, needle := range []string{
`var AUTO_OPEN_DELAY_MS = 15000;`,
`function scheduleFallbackOpen(){`,
`Startup checks are taking too long — opening app...`,
} {
if !strings.Contains(body, needle) {
t.Fatalf("loading page missing %q: %s", needle, body)
}
}
}
func TestAuditPageRendersViewerFrameAndActions(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
@@ -671,11 +690,17 @@ func TestToolsPageRendersNvidiaSelfHealSection(t *testing.T) {
if !strings.Contains(body, `id="boot-source-text"`) {
t.Fatalf("tools page missing boot source field: %s", body)
}
if !strings.Contains(body, `Export to USB`) {
t.Fatalf("tools page missing export to usb section: %s", body)
if !strings.Contains(body, `USB Black-Box`) {
t.Fatalf("tools page missing usb black-box section: %s", body)
}
if !strings.Contains(body, `Support Bundle</button>`) {
t.Fatalf("tools page missing support bundle usb button: %s", body)
if !strings.Contains(body, `/api/blackbox/status`) {
t.Fatalf("tools page missing black-box status api usage: %s", body)
}
if !strings.Contains(body, `NVMe Block Format`) {
t.Fatalf("tools page missing nvme block format section: %s", body)
}
if !strings.Contains(body, `/api/tools/nvme-formats`) || !strings.Contains(body, `/api/tools/nvme-format/run`) {
t.Fatalf("tools page missing nvme format api usage: %s", body)
}
}
@@ -1016,6 +1041,39 @@ func TestViewerRendersLatestSnapshot(t *testing.T) {
}
}
func TestViewerRendersDerivedStorageBlockFormat(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
body := `{
"collected_at":"2026-04-29T00:05:00Z",
"hardware":{
"board":{"serial_number":"SERIAL-NEW"},
"storage":[
{
"serial_number":"DISK-1",
"model":"Test NVMe",
"logical_block_size_bytes":512,
"physical_block_size_bytes":4096,
"metadata_bytes_per_block":8
}
]
}
}`
if err := os.WriteFile(path, []byte(body), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/viewer", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
if !strings.Contains(rec.Body.String(), "512&#43;8") {
t.Fatalf("viewer body missing derived block format: %s", rec.Body.String())
}
}
func TestAuditJSONServesLatestSnapshot(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
@@ -1038,6 +1096,36 @@ func TestAuditJSONServesLatestSnapshot(t *testing.T) {
}
}
func TestAuditJSONDoesNotInjectDerivedStorageBlockFormat(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
body := `{
"hardware":{
"board":{"serial_number":"SERIAL-API"},
"storage":[
{
"serial_number":"DISK-1",
"logical_block_size_bytes":512,
"metadata_bytes_per_block":8
}
]
}
}`
if err := os.WriteFile(path, []byte(body), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/audit.json", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
if strings.Contains(rec.Body.String(), "block_format") {
t.Fatalf("audit.json should remain contract-only: %s", rec.Body.String())
}
}
func TestMissingAuditJSONReturnsNotFound(t *testing.T) {
handler := NewHandler(HandlerOptions{AuditPath: "/missing/audit.json"})
rec := httptest.NewRecorder()

View File

@@ -0,0 +1,511 @@
package webui
import (
"context"
"encoding/json"
"fmt"
"io"
"log/slog"
"os"
"os/signal"
"path/filepath"
"strings"
"syscall"
"time"
"bee/audit/internal/app"
"bee/audit/internal/platform"
"bee/audit/internal/runtimeenv"
)
type taskRunnerState struct {
PID int `json:"pid"`
Status string `json:"status"`
Error string `json:"error,omitempty"`
UpdatedAt time.Time `json:"updated_at"`
}
func taskRunnerStatePath(t *Task) string {
if t == nil || strings.TrimSpace(t.ArtifactsDir) == "" {
return ""
}
return filepath.Join(t.ArtifactsDir, "runner-state.json")
}
func writeTaskRunnerState(t *Task, state taskRunnerState) error {
path := taskRunnerStatePath(t)
if path == "" {
return nil
}
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return err
}
data, err := json.MarshalIndent(state, "", " ")
if err != nil {
return err
}
tmp := path + ".tmp"
if err := os.WriteFile(tmp, data, 0644); err != nil {
return err
}
return os.Rename(tmp, path)
}
func readTaskRunnerState(t *Task) (taskRunnerState, bool) {
path := taskRunnerStatePath(t)
if path == "" {
return taskRunnerState{}, false
}
data, err := os.ReadFile(path)
if err != nil || len(data) == 0 {
return taskRunnerState{}, false
}
var state taskRunnerState
if err := json.Unmarshal(data, &state); err != nil {
return taskRunnerState{}, false
}
return state, true
}
func processAlive(pid int) bool {
if pid <= 0 {
return false
}
err := syscall.Kill(pid, 0)
return err == nil || err == syscall.EPERM
}
func finalizeTaskForResult(t *Task, errMsg string, cancelled bool) {
now := time.Now()
t.DoneAt = &now
switch {
case cancelled:
t.Status = TaskCancelled
t.ErrMsg = "aborted"
case strings.TrimSpace(errMsg) != "":
t.Status = TaskFailed
t.ErrMsg = errMsg
default:
t.Status = TaskDone
t.ErrMsg = ""
}
}
func executeTaskWithOptions(opts *HandlerOptions, t *Task, j *jobState, ctx context.Context) {
if opts == nil {
j.append("ERROR: handler options not configured")
j.finish("handler options not configured")
return
}
a := opts.App
recovered := len(j.lines) > 0
j.append(fmt.Sprintf("Starting %s...", t.Name))
if recovered {
j.append(fmt.Sprintf("Recovered after bee-web restart at %s", time.Now().UTC().Format(time.RFC3339)))
}
var (
archive string
err error
)
switch t.Target {
case "nvidia":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
diagLevel := 2
if t.params.StressMode {
diagLevel = 3
}
if len(t.params.GPUIndices) > 0 || diagLevel > 0 {
result, e := a.RunNvidiaAcceptancePackWithOptions(ctx, "", diagLevel, t.params.GPUIndices, j.append)
if e != nil {
err = e
} else {
archive = result.Body
}
} else {
archive, err = a.RunNvidiaAcceptancePack("", j.append)
}
case "nvidia-targeted-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if dur <= 0 {
dur = 300
}
archive, err = a.RunNvidiaTargetedStressValidatePack(ctx, "", dur, t.params.GPUIndices, j.append)
case "nvidia-bench-perf":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = a.RunNvidiaBenchmarkCtx(ctx, "", platform.NvidiaBenchmarkOptions{
Profile: t.params.BenchmarkProfile,
SizeMB: t.params.SizeMB,
GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices,
RunNCCL: t.params.RunNCCL,
ParallelGPUs: t.params.ParallelGPUs,
RampStep: t.params.RampStep,
RampTotal: t.params.RampTotal,
RampRunID: t.params.RampRunID,
}, j.append)
case "nvidia-bench-power":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = a.RunNvidiaPowerBenchCtx(ctx, app.DefaultBeeBenchPowerDir, platform.NvidiaBenchmarkOptions{
Profile: t.params.BenchmarkProfile,
GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices,
RampStep: t.params.RampStep,
RampTotal: t.params.RampTotal,
RampRunID: t.params.RampRunID,
}, j.append)
case "nvidia-bench-autotune":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = a.RunNvidiaPowerSourceAutotuneCtx(ctx, app.DefaultBeeBenchAutotuneDir, platform.NvidiaBenchmarkOptions{
Profile: t.params.BenchmarkProfile,
SizeMB: t.params.SizeMB,
}, t.params.BenchmarkKind, j.append)
case "nvidia-compute":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
rampPlan, planErr := resolveNvidiaRampPlan(t.params.BurnProfile, t.params.StaggerGPUStart, t.params.GPUIndices)
if planErr != nil {
err = planErr
break
}
if t.params.BurnProfile != "" && t.params.StaggerGPUStart && dur <= 0 {
dur = rampPlan.DurationSec
}
if rampPlan.StaggerSeconds > 0 {
j.append(fmt.Sprintf("NVIDIA staggered ramp-up enabled: %ds per GPU; post-ramp hold: %ds; total runtime: %ds", rampPlan.StaggerSeconds, dur, rampPlan.TotalDurationSec))
}
archive, err = a.RunNvidiaOfficialComputePack(ctx, "", dur, t.params.GPUIndices, rampPlan.StaggerSeconds, j.append)
case "nvidia-targeted-power":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = a.RunNvidiaTargetedPowerPack(ctx, "", dur, t.params.GPUIndices, j.append)
case "nvidia-pulse":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = a.RunNvidiaPulseTestPack(ctx, "", dur, t.params.GPUIndices, j.append)
case "nvidia-bandwidth":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = a.RunNvidiaBandwidthPack(ctx, "", t.params.GPUIndices, j.append)
case "nvidia-interconnect":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = a.RunNCCLTests(ctx, "", t.params.GPUIndices, j.append)
case "nvidia-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
rampPlan, planErr := resolveNvidiaRampPlan(t.params.BurnProfile, t.params.StaggerGPUStart, t.params.GPUIndices)
if planErr != nil {
err = planErr
break
}
if t.params.BurnProfile != "" && t.params.StaggerGPUStart && dur <= 0 {
dur = rampPlan.DurationSec
}
if rampPlan.StaggerSeconds > 0 {
j.append(fmt.Sprintf("NVIDIA staggered ramp-up enabled: %ds per GPU; post-ramp hold: %ds; total runtime: %ds", rampPlan.StaggerSeconds, dur, rampPlan.TotalDurationSec))
}
archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{
DurationSec: dur,
Loader: t.params.Loader,
GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices,
StaggerSeconds: rampPlan.StaggerSeconds,
}, j.append)
case "memory":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
sizeMB, passes := resolveMemoryValidatePreset(t.params.BurnProfile, t.params.StressMode)
j.append(fmt.Sprintf("Memory validate preset: %d MB x %d pass(es)", sizeMB, passes))
archive, err = runMemoryAcceptancePackCtx(a, ctx, "", sizeMB, passes, j.append)
case "storage":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runStorageAcceptancePackCtx(a, ctx, "", t.params.StressMode, j.append)
case "cpu":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
if dur <= 0 {
if t.params.StressMode {
dur = 1800
} else {
dur = 60
}
}
j.append(fmt.Sprintf("CPU stress duration: %ds", dur))
archive, err = runCPUAcceptancePackCtx(a, ctx, "", dur, j.append)
case "amd":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDAcceptancePackCtx(a, ctx, "", j.append)
case "amd-mem":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDMemIntegrityPackCtx(a, ctx, "", j.append)
case "amd-bandwidth":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDMemBandwidthPackCtx(a, ctx, "", j.append)
case "amd-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runAMDStressPackCtx(a, ctx, "", dur, j.append)
case "memory-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runMemoryStressPackCtx(a, ctx, "", dur, j.append)
case "sat-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runSATStressPackCtx(a, ctx, "", dur, j.append)
case "platform-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
runOpts := resolvePlatformStressPreset(t.params.BurnProfile)
runOpts.Components = t.params.PlatformComponents
archive, err = a.RunPlatformStress(ctx, "", runOpts, j.append)
case "audit":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
result, e := a.RunAuditNow(opts.RuntimeMode)
if e != nil {
err = e
} else {
for _, line := range splitLines(result.Body) {
j.append(line)
}
}
case "support-bundle":
j.append("Building support bundle...")
archive, err = buildSupportBundle(opts.ExportDir)
case "install":
if strings.TrimSpace(t.params.Device) == "" {
err = fmt.Errorf("device is required")
break
}
installLogPath := platform.InstallLogPath(t.params.Device)
j.append("Install log: " + installLogPath)
err = streamCmdJob(j, installCommand(ctx, t.params.Device, installLogPath))
case "install-to-ram":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
err = a.RunInstallToRAM(ctx, j.append)
case "nvme-format":
if strings.TrimSpace(t.params.Device) == "" {
err = fmt.Errorf("device is required")
break
}
err = runNVMeFormatTask(ctx, j, t.params.Device, t.params.LBAF)
default:
j.append("ERROR: unknown target: " + t.Target)
j.finish("unknown target")
return
}
if archive != "" {
archivePath := app.ExtractArchivePath(archive)
if err == nil && app.ReadSATOverallStatus(archivePath) == "FAILED" {
err = fmt.Errorf("SAT overall_status=FAILED (see summary.txt)")
}
if opts.App != nil && opts.App.StatusDB != nil {
app.ApplySATResultToDB(opts.App.StatusDB, t.Target, archivePath)
}
}
if err != nil {
if ctx.Err() != nil {
j.append("Aborted.")
j.finish("aborted")
} else {
j.append("ERROR: " + err.Error())
j.finish(err.Error())
}
return
}
if archive != "" {
j.append("Archive: " + archive)
}
j.finish("")
}
func loadPersistedTask(statePath, taskID string) (*Task, error) {
data, err := os.ReadFile(statePath)
if err != nil {
return nil, err
}
var persisted []persistedTask
if err := json.Unmarshal(data, &persisted); err != nil {
return nil, err
}
for _, pt := range persisted {
if pt.ID != taskID {
continue
}
t := &Task{
ID: pt.ID,
Name: pt.Name,
Target: pt.Target,
Priority: pt.Priority,
Status: pt.Status,
CreatedAt: pt.CreatedAt,
StartedAt: pt.StartedAt,
DoneAt: pt.DoneAt,
ErrMsg: pt.ErrMsg,
LogPath: pt.LogPath,
ArtifactsDir: pt.ArtifactsDir,
ReportJSONPath: pt.ReportJSONPath,
ReportHTMLPath: pt.ReportHTMLPath,
params: pt.Params,
}
ensureTaskReportPaths(t)
return t, nil
}
return nil, fmt.Errorf("task %s not found", taskID)
}
func RunPersistedTask(exportDir, taskID string, stdout, stderr io.Writer) int {
if strings.TrimSpace(exportDir) == "" || strings.TrimSpace(taskID) == "" {
fmt.Fprintln(stderr, "bee task-run: --export-dir and --task-id are required")
return 2
}
runtimeInfo, err := runtimeenv.Detect("auto")
if err != nil {
slog.Warn("resolve runtime for task-run", "err", err)
}
opts := &HandlerOptions{
ExportDir: exportDir,
App: app.New(platform.New()),
RuntimeMode: runtimeInfo.Mode,
}
statePath := filepath.Join(exportDir, "tasks-state.json")
task, err := loadPersistedTask(statePath, taskID)
if err != nil {
fmt.Fprintln(stderr, err.Error())
return 1
}
if task.StartedAt == nil || task.StartedAt.IsZero() {
now := time.Now()
task.StartedAt = &now
}
if task.Status == "" {
task.Status = TaskRunning
}
if err := writeTaskRunnerState(task, taskRunnerState{
PID: os.Getpid(),
Status: TaskRunning,
UpdatedAt: time.Now().UTC(),
}); err != nil {
fmt.Fprintln(stderr, err.Error())
return 1
}
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer cancel()
j := newTaskJobState(task.LogPath, taskSerialPrefix(task))
executeTaskWithOptions(opts, task, j, ctx)
finalizeTaskForResult(task, j.err, ctx.Err() != nil)
if err := writeTaskReportArtifacts(task); err != nil {
appendJobLog(task.LogPath, "WARN: task report generation failed: "+err.Error())
}
j.closeLog()
if err := writeTaskRunnerState(task, taskRunnerState{
PID: os.Getpid(),
Status: task.Status,
Error: task.ErrMsg,
UpdatedAt: time.Now().UTC(),
}); err != nil {
fmt.Fprintln(stderr, err.Error())
}
if task.ErrMsg != "" {
return 1
}
return 0
}

View File

@@ -4,6 +4,7 @@ import (
"context"
"encoding/json"
"fmt"
"io"
"log/slog"
"net/http"
"os"
@@ -13,6 +14,7 @@ import (
"sort"
"strings"
"sync"
"syscall"
"time"
"bee/audit/internal/app"
@@ -55,6 +57,7 @@ var taskNames = map[string]string{
"support-bundle": "Support Bundle",
"install": "Install to Disk",
"install-to-ram": "Install to RAM",
"nvme-format": "NVMe Block Format Change",
}
// burnNames maps target → human-readable name when a burn profile is set.
@@ -110,8 +113,9 @@ type Task struct {
ReportHTMLPath string `json:"report_html_path,omitempty"`
// runtime fields (not serialised)
job *jobState
params taskParams
job *jobState
runnerPID int
params taskParams
}
// taskParams holds optional parameters parsed from the run request.
@@ -134,6 +138,7 @@ type taskParams struct {
RampRunID string `json:"ramp_run_id,omitempty"`
DisplayName string `json:"display_name,omitempty"`
Device string `json:"device,omitempty"` // for install
LBAF int `json:"lbaf,omitempty"`
PlatformComponents []string `json:"platform_components,omitempty"`
}
@@ -328,6 +333,13 @@ var (
installCommand = func(ctx context.Context, device string, logPath string) *exec.Cmd {
return exec.CommandContext(ctx, "bee-install", device, logPath)
}
externalTaskRunnerCommand = func(exportDir, taskID string) (*exec.Cmd, error) {
exe, err := os.Executable()
if err != nil {
return nil, err
}
return exec.Command(exe, "bee-worker", "--export-dir", exportDir, "--task-id", taskID), nil
}
)
// enqueue adds a task to the queue and notifies the worker.
@@ -365,6 +377,11 @@ func (q *taskQueue) prune() {
// nextPending returns the highest-priority pending task (nil if none).
func (q *taskQueue) nextPending() *Task {
for _, t := range q.tasks {
if t.Status == TaskRunning {
return nil
}
}
var best *Task
for _, t := range q.tasks {
if t.Status != TaskPending {
@@ -484,6 +501,7 @@ func (q *taskQueue) startWorker(opts *HandlerOptions) {
if !q.started {
q.loadLocked()
q.started = true
q.resumeRunningTasksLocked()
goRecoverLoop("task worker", 2*time.Second, q.worker)
}
hasPending := q.nextPending() != nil
@@ -517,15 +535,12 @@ func (q *taskQueue) worker() {
t.StartedAt = &now
t.DoneAt = nil
t.ErrMsg = ""
j := newTaskJobState(t.LogPath, taskSerialPrefix(t))
j := newTaskJobState(t.LogPath)
t.job = j
q.persistLocked()
q.mu.Unlock()
taskCtx, taskCancel := context.WithCancel(context.Background())
j.cancel = taskCancel
q.executeTask(t, j, taskCtx)
taskCancel()
q.runTaskExternal(t, j)
q.mu.Lock()
q.prune()
@@ -537,6 +552,218 @@ func (q *taskQueue) worker() {
}
}
func (q *taskQueue) resumeRunningTasksLocked() {
for _, t := range q.tasks {
if t.Status != TaskRunning {
continue
}
if t.job == nil {
t.job = newTaskJobState(t.LogPath)
}
q.attachExternalTaskControlsLocked(t, t.job)
q.startRecoveredTaskMonitorLocked(t, t.job)
}
}
func (q *taskQueue) attachExternalTaskControlsLocked(t *Task, j *jobState) {
if t == nil || j == nil {
return
}
j.cancel = func() {
pid := t.runnerPID
if pid <= 0 {
if state, ok := readTaskRunnerState(t); ok {
pid = state.PID
}
}
if pid > 0 {
_ = syscall.Kill(pid, syscall.SIGTERM)
}
}
}
func (q *taskQueue) startRecoveredTaskMonitorLocked(t *Task, j *jobState) {
if t == nil || j == nil || t.runnerPID <= 0 {
return
}
goRecoverOnce("task runner monitor", func() {
stopTail := make(chan struct{})
doneTail := make(chan struct{})
go q.followTaskLog(t, j, stopTail, doneTail)
for processAlive(t.runnerPID) {
time.Sleep(500 * time.Millisecond)
}
close(stopTail)
<-doneTail
q.finishExternalTask(t, j, nil)
})
}
func (q *taskQueue) runTaskExternal(t *Task, j *jobState) {
startedKmsgWatch := false
if q.kmsgWatcher != nil && isSATTarget(t.Target) {
q.kmsgWatcher.NotifyTaskStarted(t.ID, t.Target)
startedKmsgWatch = true
}
defer func() {
if startedKmsgWatch && q.kmsgWatcher != nil {
q.kmsgWatcher.NotifyTaskFinished(t.ID)
}
}()
stopTail := make(chan struct{})
doneTail := make(chan struct{})
defer func() {
close(stopTail)
<-doneTail
}()
go q.followTaskLog(t, j, stopTail, doneTail)
cmd, err := externalTaskRunnerCommand(q.opts.ExportDir, t.ID)
if err != nil {
j.appendFromLog("ERROR: " + err.Error())
q.finishExternalTask(t, j, err)
return
}
if err := cmd.Start(); err != nil {
j.appendFromLog("ERROR: " + err.Error())
q.finishExternalTask(t, j, err)
return
}
q.mu.Lock()
t.runnerPID = cmd.Process.Pid
q.attachExternalTaskControlsLocked(t, j)
q.persistLocked()
q.mu.Unlock()
waitErr := cmd.Wait()
time.Sleep(200 * time.Millisecond)
q.finishExternalTask(t, j, waitErr)
}
func (q *taskQueue) followTaskLog(t *Task, j *jobState, stop <-chan struct{}, done chan<- struct{}) {
defer close(done)
path := ""
if t != nil {
path = t.LogPath
}
if strings.TrimSpace(path) == "" {
return
}
offset := int64(0)
if info, err := os.Stat(path); err == nil {
offset = info.Size()
}
var partial string
ticker := time.NewTicker(250 * time.Millisecond)
defer ticker.Stop()
flush := func() {
data, newOffset, err := readTaskLogDelta(path, offset)
if err != nil || len(data) == 0 {
offset = newOffset
return
}
offset = newOffset
text := partial + strings.ReplaceAll(string(data), "\r\n", "\n")
lines := strings.Split(text, "\n")
partial = lines[len(lines)-1]
for _, line := range lines[:len(lines)-1] {
if line == "" {
continue
}
j.appendFromLog(line)
}
}
for {
select {
case <-ticker.C:
flush()
case <-stop:
flush()
if strings.TrimSpace(partial) != "" {
j.appendFromLog(partial)
}
return
}
}
}
func readTaskLogDelta(path string, offset int64) ([]byte, int64, error) {
f, err := os.Open(path)
if err != nil {
return nil, offset, err
}
defer f.Close()
info, err := f.Stat()
if err != nil {
return nil, offset, err
}
if info.Size() < offset {
offset = 0
}
if _, err := f.Seek(offset, io.SeekStart); err != nil {
return nil, offset, err
}
data, err := io.ReadAll(io.LimitReader(f, 1<<20))
return data, offset + int64(len(data)), err
}
func (q *taskQueue) finishExternalTask(t *Task, j *jobState, waitErr error) {
q.mu.Lock()
defer q.mu.Unlock()
if t.Status == TaskDone || t.Status == TaskFailed || t.Status == TaskCancelled {
if j != nil && !j.isDone() {
j.finish(t.ErrMsg)
j.closeLog()
}
select {
case q.trigger <- struct{}{}:
default:
}
return
}
state, ok := readTaskRunnerState(t)
switch {
case ok && state.Status != TaskRunning:
t.Status = state.Status
t.ErrMsg = state.Error
now := state.UpdatedAt
if now.IsZero() {
now = time.Now()
}
t.DoneAt = &now
case waitErr != nil:
now := time.Now()
t.Status = TaskFailed
t.ErrMsg = waitErr.Error()
t.DoneAt = &now
default:
now := time.Now()
t.Status = TaskFailed
t.ErrMsg = "task runner exited without final state"
t.DoneAt = &now
}
t.runnerPID = 0
q.finalizeTaskArtifactPathsLocked(t)
q.persistLocked()
if j != nil && !j.isDone() {
j.finish(t.ErrMsg)
j.closeLog()
}
if t.ErrMsg != "" {
taskSerialEvent(t, "finished with status="+t.Status+" error="+t.ErrMsg)
} else {
taskSerialEvent(t, "finished with status="+t.Status)
}
select {
case q.trigger <- struct{}{}:
default:
}
}
func (q *taskQueue) executeTask(t *Task, j *jobState, ctx context.Context) {
startedKmsgWatch := false
defer q.finalizeTaskRun(t, j)
@@ -985,15 +1212,11 @@ func (h *handler) handleAPITasksCancel(w http.ResponseWriter, r *http.Request) {
taskSerialEvent(t, "finished with status="+t.Status)
writeJSON(w, map[string]string{"status": "cancelled"})
case TaskRunning:
if t.job != nil {
t.job.abort()
if t.job == nil || !t.job.abort() {
writeError(w, http.StatusConflict, "task is not cancellable")
return
}
t.Status = TaskCancelled
now := time.Now()
t.DoneAt = &now
globalQueue.persistLocked()
taskSerialEvent(t, "finished with status="+t.Status)
writeJSON(w, map[string]string{"status": "cancelled"})
writeJSON(w, map[string]string{"status": "aborting"})
default:
writeError(w, http.StatusConflict, "task is not running or pending")
}
@@ -1039,12 +1262,6 @@ func (h *handler) handleAPITasksCancelAll(w http.ResponseWriter, _ *http.Request
if t.job != nil {
t.job.abort()
}
if taskMayLeaveOrphanWorkers(t.Target) {
platform.KillTestWorkers()
}
t.Status = TaskCancelled
t.DoneAt = &now
taskSerialEvent(t, "finished with status="+t.Status)
n++
}
}
@@ -1175,18 +1392,29 @@ func (q *taskQueue) loadLocked() {
}
q.assignTaskLogPathLocked(t)
if t.Status == TaskRunning {
// The task was interrupted by a bee-web restart. Child processes
// (e.g. bee-gpu-burn-worker, dcgmi/nvvs) survive the restart in
// their own process groups. Kill any matching stale workers before
// marking the task failed so the next GPU test does not inherit a
// busy DCGM slot or duplicate workers.
if taskMayLeaveOrphanWorkers(t.Target) {
_ = platform.KillTestWorkers()
state, ok := readTaskRunnerState(t)
switch {
case ok && state.Status == TaskRunning && processAlive(state.PID):
t.runnerPID = state.PID
t.job = newTaskJobState(t.LogPath)
case ok && state.Status != TaskRunning:
t.runnerPID = state.PID
t.Status = state.Status
t.ErrMsg = state.Error
now := state.UpdatedAt
if now.IsZero() {
now = time.Now()
}
t.DoneAt = &now
default:
if taskMayLeaveOrphanWorkers(t.Target) {
_ = platform.KillTestWorkers()
}
now := time.Now()
t.Status = TaskFailed
t.DoneAt = &now
t.ErrMsg = "interrupted by bee-web restart"
}
now := time.Now()
t.Status = TaskFailed
t.DoneAt = &now
t.ErrMsg = "interrupted by bee-web restart"
} else if t.Status == TaskPending {
t.StartedAt = nil
t.DoneAt = nil

View File

@@ -126,6 +126,23 @@ func TestNewTaskJobStateLoadsExistingLog(t *testing.T) {
}
}
func TestJobAppendFlushesTaskLogImmediately(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "task.log")
j := newTaskJobState(path)
j.append("live-line")
data, err := os.ReadFile(path)
if err != nil {
t.Fatal(err)
}
if string(data) != "live-line\n" {
t.Fatalf("log=%q want live-line newline", string(data))
}
j.closeLog()
}
func TestTaskQueueSnapshotSortsNewestFirst(t *testing.T) {
now := time.Date(2026, 4, 2, 12, 0, 0, 0, time.UTC)
q := &taskQueue{
@@ -849,3 +866,82 @@ func TestExecuteTaskMarksPanicsAsFailedAndClosesKmsgWindow(t *testing.T) {
t.Fatalf("expected kmsg window to be cleared, got %+v", window)
}
}
func TestRunTaskExternalOpensAndClosesKmsgWindow(t *testing.T) {
dir := t.TempDir()
releasePath := filepath.Join(dir, "release")
readyPath := filepath.Join(dir, "ready")
q := &taskQueue{
opts: &HandlerOptions{ExportDir: dir},
logsDir: filepath.Join(dir, "tasks"),
kmsgWatcher: newKmsgWatcher(nil),
trigger: make(chan struct{}, 1),
}
if err := os.MkdirAll(q.logsDir, 0755); err != nil {
t.Fatal(err)
}
tk := &Task{
ID: "cpu-external-1",
Name: "CPU SAT",
Target: "cpu",
Status: TaskRunning,
CreatedAt: time.Now(),
}
q.assignTaskLogPathLocked(tk)
j := newTaskJobState(tk.LogPath)
orig := externalTaskRunnerCommand
externalTaskRunnerCommand = func(exportDir, taskID string) (*exec.Cmd, error) {
script := "printf ready > \"$1\"; while [ ! -f \"$2\" ]; do sleep 0.05; done"
return exec.Command("sh", "-c", script, "sh", readyPath, releasePath), nil
}
defer func() { externalTaskRunnerCommand = orig }()
done := make(chan struct{})
go func() {
q.runTaskExternal(tk, j)
close(done)
}()
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
if _, err := os.Stat(readyPath); err == nil {
break
}
time.Sleep(20 * time.Millisecond)
}
if _, err := os.Stat(readyPath); err != nil {
t.Fatalf("external runner did not start: %v", err)
}
q.kmsgWatcher.mu.Lock()
activeCount := q.kmsgWatcher.activeCount
window := q.kmsgWatcher.window
q.kmsgWatcher.mu.Unlock()
if activeCount != 1 {
t.Fatalf("activeCount while running=%d want 1", activeCount)
}
if window == nil || len(window.targets) != 1 || window.targets[0] != "cpu" {
t.Fatalf("window while running=%+v", window)
}
if err := os.WriteFile(releasePath, []byte("1\n"), 0644); err != nil {
t.Fatal(err)
}
select {
case <-done:
case <-time.After(2 * time.Second):
t.Fatal("runTaskExternal did not return")
}
q.kmsgWatcher.mu.Lock()
activeCount = q.kmsgWatcher.activeCount
window = q.kmsgWatcher.window
q.kmsgWatcher.mu.Unlock()
if activeCount != 0 {
t.Fatalf("activeCount after finish=%d want 0", activeCount)
}
if window != nil {
t.Fatalf("expected kmsg window to be cleared, got %+v", window)
}
}

View File

@@ -0,0 +1,62 @@
package webui
import (
"encoding/json"
"strconv"
)
func enrichSnapshotForViewer(snapshot []byte) []byte {
if len(snapshot) == 0 {
return snapshot
}
var root map[string]any
if err := json.Unmarshal(snapshot, &root); err != nil {
return snapshot
}
hardware, _ := root["hardware"].(map[string]any)
if len(hardware) == 0 {
return snapshot
}
storage, _ := hardware["storage"].([]any)
if len(storage) == 0 {
return snapshot
}
changed := false
for _, item := range storage {
row, _ := item.(map[string]any)
if len(row) == 0 {
continue
}
if _, exists := row["block_format"]; exists {
continue
}
logical, okLogical := jsonNumberToInt64(row["logical_block_size_bytes"])
metadata, okMetadata := jsonNumberToInt64(row["metadata_bytes_per_block"])
if !okLogical || !okMetadata || logical <= 0 || metadata < 0 {
continue
}
row["block_format"] = strconv.FormatInt(logical, 10) + "+" + strconv.FormatInt(metadata, 10)
changed = true
}
if !changed {
return snapshot
}
out, err := json.Marshal(root)
if err != nil {
return snapshot
}
return out
}
func jsonNumberToInt64(v any) (int64, bool) {
switch x := v.(type) {
case float64:
return int64(x), true
case int64:
return x, true
case int:
return int64(x), true
default:
return 0, false
}
}

View File

@@ -9,5 +9,62 @@ Generic engineering rules live in `bible/rules/patterns/`.
|---|---|
| `architecture/system-overview.md` | What bee does, scope, tech stack |
| `architecture/runtime-flows.md` | Boot sequence, audit flow, service order |
| `docs/customer-gpu-test-methodology.md` | Customer-facing GPU PCIe Validate / Validate -> Stress test list |
| `docs/hardware-ingest-contract.md` | Current Reanimator hardware ingest JSON contract |
| `decisions/` | Architectural decision log |
| `docs/validate-vs-burn.md` | Validate and Validate -> Stress hardware test policy |
| `decisions/` | Architectural decision log, including read-only submodule policy |
## Validate Test Matrix
### Validate
- CPU check
- `lscpu`
- `sensors`
- `stress-ng`
- Memory check
- `free`
- `timeout <timeout_sec> memtester`
- `free`
- NVMe storage check
- `nvme id-ctrl`
- `nvme smart-log`
- `nvme device-self-test`
- SATA/SAS storage check
- `smartctl -H -A`
- `smartctl -t short`
- Basic NVIDIA GPU check
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dmidecode -t baseboard`
- `dmidecode -t system`
- `dcgmi diag -r 2`
- Inter-GPU communication check
- `all_reduce_perf`
- GPU bandwidth check
- `dcgmi diag -r nvbandwidth`
### Validate -> Stress
- Extended NVIDIA GPU check
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dmidecode -t baseboard`
- `dmidecode -t system`
- `dcgmi diag -r 3`
- NVIDIA targeted stress
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dcgmi diag -r targeted_stress`
- NVIDIA targeted power
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dcgmi diag -r targeted_power`
- NVIDIA pulse test
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dcgmi diag -r pulse_test`
- Inter-GPU communication check
- `all_reduce_perf`
- GPU bandwidth check
- `dcgmi diag -r nvbandwidth`

View File

@@ -149,7 +149,6 @@ Current validation state:
6. psu collector (ipmitool fru + sdr — silent if no /dev/ipmi0)
7. nvidia enrichment (nvidia-smi — skipped if binary absent or driver not loaded)
8. output JSON → /var/log/bee-audit.json
9. QR summary to stdout (qrencode if available)
```
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.

View File

@@ -58,6 +58,8 @@ Fills gaps where Redfish/logpile is blind:
- `bee` should populate current component state, hardware inventory, telemetry, and `status_checked_at`.
- Historical status transitions and component replacement logic belong to the centralized ingest/lifecycle system, not to `bee`.
- Contract fields that have no honest local source on a generic Linux host may remain empty.
- Embedded submodules such as `internal/chart/` and `bible/` are read-only for `bee` feature work.
- If the UI needs extra information, `bee` must emit it through the standard audit JSON contract rather than patching `chart`.
## Tech stack
@@ -101,7 +103,7 @@ Fills gaps where Redfish/logpile is blind:
| `iso/builder/` | ISO build scripts and `live-build` profile |
| `iso/overlay/` | Source overlay copied into a staged build overlay |
| `iso/vendor/` | Optional pre-built vendor binaries (storcli64, sas2ircu, sas3ircu, arcconf, ssacli, …) |
| `internal/chart/` | Git submodule with `reanimator/chart`, embedded into `bee web` |
| `internal/chart/` | Git submodule with `reanimator/chart`, embedded into `bee web`; update by submodule pointer only, never by local `bee`-specific edits |
| `iso/builder/VERSIONS` | Pinned versions: Debian, Go, NVIDIA driver, kernel ABI |
| `iso/builder/smoketest.sh` | Post-boot smoke test — run via SSH to verify live ISO |
| `iso/overlay/etc/profile.d/bee.sh` | tty1 welcome message with web UI URLs |

View File

@@ -0,0 +1,39 @@
# Decision: Treat embedded submodules as read-only
## Context
`bee` embeds external git submodules such as:
- `internal/chart/``reanimator/chart`, a generic read-only viewer for Reanimator JSON snapshots
- `bible/` — shared engineering rules and contracts
These repositories are reused by other projects. A local feature request in `bee`
must not be solved by silently changing shared submodule behavior.
The concrete failure mode here was attempting to add project-specific storage
telemetry presentation by editing `internal/chart/`. That couples a shared viewer
to one host application's needs and creates hidden cross-project regressions.
## Decision
Embedded submodules are read-only from the point of view of `bee`.
- Do not implement `bee`-specific behavior by editing `internal/chart/`.
- Do not implement `bee`-specific behavior by editing `bible/`.
- If `bee` needs new data in the report, produce it in the standard audit JSON
emitted by `bee` itself.
- `chart` must continue to consume the canonical snapshot as an external viewer,
without host-specific forks.
- Updating a submodule pointer to an upstream commit is allowed.
- Carrying local unmerged submodule commits as part of a `bee` feature is forbidden.
## Consequences
- Audit/report features must be expressed through the contract in
`bible-local/docs/hardware-ingest-contract.md`.
- `bee` owns collection, normalization, and serialization of storage telemetry in
`hardware.storage[]`.
- `chart` remains a pure visualization module that reads the snapshot it is given.
- If a capability is genuinely missing in a shared submodule, it must be proposed
and landed upstream as a generic change first, then pulled into `bee` via a
normal submodule update.

View File

@@ -6,3 +6,4 @@ One file per decision, named `YYYY-MM-DD-short-topic.md`.
|---|---|---|
| 2026-03-05 | Use NVIDIA proprietary driver | active |
| 2026-04-01 | Treat memtest as explicit ISO content | active |
| 2026-04-29 | Treat embedded submodules as read-only | active |

View File

@@ -0,0 +1,54 @@
# GPU PCIe Test Methodology
## Validate
- CPU check
- `lscpu`
- `sensors`
- `stress-ng`
- Memory check
- `free`
- `timeout <timeout_sec> memtester`
- `free`
- NVMe storage check
- `nvme id-ctrl`
- `nvme smart-log`
- `nvme device-self-test`
- SATA/SAS storage check
- `smartctl -H -A`
- `smartctl -t short`
- Basic NVIDIA GPU check
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dmidecode -t baseboard`
- `dmidecode -t system`
- `dcgmi diag -r 2`
- Inter-GPU communication check
- `all_reduce_perf`
- GPU bandwidth check
- `dcgmi diag -r nvbandwidth`
## Validate -> Stress
- Extended NVIDIA GPU check
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dmidecode -t baseboard`
- `dmidecode -t system`
- `dcgmi diag -r 3`
- NVIDIA targeted stress
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dcgmi diag -r targeted_stress`
- NVIDIA targeted power
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dcgmi diag -r targeted_power`
- NVIDIA pulse test
- `nvidia-smi -pm 1`
- `nvidia-smi -q`
- `dcgmi diag -r pulse_test`
- Inter-GPU communication check
- `all_reduce_perf`
- GPU bandwidth check
- `dcgmi diag -r nvbandwidth`

View File

@@ -1,7 +1,7 @@
---
title: Hardware Ingest JSON Contract
version: "2.7"
updated: "2026-03-15"
version: "2.10"
updated: "2026-04-29"
maintainer: Reanimator Core
audience: external-integrators, ai-agents
language: ru
@@ -9,7 +9,7 @@ language: ru
# Интеграция с Reanimator: контракт JSON-импорта аппаратного обеспечения
Версия: **2.7** · Дата: **2026-03-15**
Версия: **2.10** · Дата: **2026-04-29**
Документ описывает формат JSON для передачи данных об аппаратном обеспечении серверов в систему **Reanimator** (управление жизненным циклом аппаратного обеспечения).
Предназначен для разработчиков смежных систем (Redfish-коллекторов, агентов мониторинга, CMDB-экспортёров) и может быть включён в документацию интегрируемых проектов.
@@ -22,6 +22,9 @@ language: ru
| Версия | Дата | Изменения |
|--------|------|-----------|
| 2.10 | 2026-04-29 | Для `hardware.storage[]` добавлены необязательные числовые поля `logical_block_size_bytes`, `physical_block_size_bytes`, `metadata_bytes_per_block` для нормализованного описания формата блока накопителя |
| 2.9 | 2026-03-19 | Добавлена необязательная секция `hardware.platform_config` — произвольный объект с настройками платформы (BIOS/Redfish); хранится как latest-snapshot per machine |
| 2.8 | 2026-03-15 | Поле `location` удалено из всех `sensors.*`; сенсоры передаются только по `name` и измеренным значениям |
| 2.7 | 2026-03-15 | Явно запрещён синтез данных в `event_logs`; интеграторы не должны придумывать серийные номера компонентов, если источник их не отдал |
| 2.6 | 2026-03-15 | Добавлена необязательная секция `event_logs` для dedup/upsert логов `host` / `bmc` / `redfish` вне history timeline |
| 2.5 | 2026-03-15 | Добавлено общее необязательное поле `manufactured_year_week` для компонентных секций (`YYYY-Www`) |
@@ -131,8 +134,9 @@ GET /ingest/hardware/jobs/{job_id}
"storage": [ ... ],
"pcie_devices": [ ... ],
"power_supplies": [ ... ],
"sensors": { ... },
"event_logs": [ ... ]
"sensors": { ... },
"event_logs": [ ... ],
"platform_config": { ... }
}
}
```
@@ -343,6 +347,9 @@ GET /ingest/hardware/jobs/{job_id}
| `type` | string | нет | Тип: `NVMe`, `SSD`, `HDD` |
| `interface` | string | нет | Интерфейс: `NVMe`, `SATA`, `SAS` |
| `size_gb` | int | нет | Размер в ГБ |
| `logical_block_size_bytes` | int64 | нет | Логический размер пользовательского блока данных, например `512` или `4096` |
| `physical_block_size_bytes` | int64 | нет | Физический размер блока, если известен, например `4096` |
| `metadata_bytes_per_block` | int64 | нет | Metadata / protection bytes на логический блок, например `0` или `8` |
| `temperature_c` | float | нет | Температура накопителя, °C (telemetry) |
| `power_on_hours` | int64 | нет | Время работы, часы |
| `power_cycles` | int64 | нет | Количество циклов питания |
@@ -363,6 +370,11 @@ GET /ingest/hardware/jobs/{job_id}
Диск без `serial_number` игнорируется. Изменение `firmware` создаёт событие `FIRMWARE_CHANGED`.
Формат вида `512+8` в контракт не добавляется отдельным строковым полем. Если источник знает такую форму, он должен передавать её как:
- `logical_block_size_bytes = 512`
- `metadata_bytes_per_block = 8`
- `physical_block_size_bytes = 512` или `4096`, если известен физический размер блока
```json
"storage": [
{
@@ -370,6 +382,9 @@ GET /ingest/hardware/jobs/{job_id}
"type": "NVMe",
"model": "INTEL SSDPF2KX076T1",
"size_gb": 7680,
"logical_block_size_bytes": 512,
"physical_block_size_bytes": 4096,
"metadata_bytes_per_block": 8,
"temperature_c": 38.5,
"power_on_hours": 12450,
"unsafe_shutdowns": 3,
@@ -592,7 +607,6 @@ PSU без `serial_number` игнорируется.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора в рамках секции |
| `location` | string | нет | Физическое расположение |
| `rpm` | int | нет | Обороты, RPM |
| `status` | string | нет | Статус: `OK`, `Warning`, `Critical`, `Unknown` |
@@ -601,7 +615,6 @@ PSU без `serial_number` игнорируется.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `voltage_v` | float | нет | Напряжение, В |
| `current_a` | float | нет | Ток, А |
| `power_w` | float | нет | Мощность, Вт |
@@ -612,7 +625,6 @@ PSU без `serial_number` игнорируется.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `celsius` | float | нет | Температура, °C |
| `threshold_warning_celsius` | float | нет | Порог Warning, °C |
| `threshold_critical_celsius` | float | нет | Порог Critical, °C |
@@ -623,29 +635,29 @@ PSU без `serial_number` игнорируется.
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `name` | string | **да** | Уникальное имя сенсора |
| `location` | string | нет | Физическое расположение |
| `value` | float | нет | Значение |
| `unit` | string | нет | Единица измерения |
| `status` | string | нет | Статус |
**Правила sensors:**
- Идентификатор сенсора: пара `(sensor_type, name)`. Дубли в одном payload — берётся первое вхождение.
- `location` для сенсоров передавать не нужно и не следует: в Reanimator location/slot используется только для проверки перемещения и установки компонентов, а не для last-known-value sensor ingest.
- Сенсоры без `name` игнорируются.
- При каждом импорте значения перезаписываются (upsert по ключу).
```json
"sensors": {
"fans": [
{ "name": "FAN1", "location": "Front", "rpm": 4200, "status": "OK" },
{ "name": "FAN_CPU0", "location": "CPU0", "rpm": 5600, "status": "OK" }
{ "name": "FAN1", "rpm": 4200, "status": "OK" },
{ "name": "FAN_CPU0", "rpm": 5600, "status": "OK" }
],
"power": [
{ "name": "12V Rail", "location": "Mainboard", "voltage_v": 12.06, "status": "OK" },
{ "name": "PSU0 Input", "location": "PSU0", "voltage_v": 215.25, "current_a": 0.64, "power_w": 137.0, "status": "OK" }
{ "name": "12V Rail", "voltage_v": 12.06, "status": "OK" },
{ "name": "PSU0 Input", "voltage_v": 215.25, "current_a": 0.64, "power_w": 137.0, "status": "OK" }
],
"temperatures": [
{ "name": "CPU0 Temp", "location": "CPU0", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" },
{ "name": "Inlet Temp", "location": "Front", "celsius": 22.0, "threshold_warning_celsius": 40.0, "threshold_critical_celsius": 50.0, "status": "OK" }
{ "name": "CPU0 Temp", "celsius": 46.0, "threshold_warning_celsius": 80.0, "threshold_critical_celsius": 95.0, "status": "OK" },
{ "name": "Inlet Temp", "celsius": 22.0, "threshold_warning_celsius": 40.0, "threshold_critical_celsius": 50.0, "status": "OK" }
],
"other": [
{ "name": "System Humidity", "value": 38.5, "unit": "%", "status": "OK" }
@@ -655,6 +667,31 @@ PSU без `serial_number` игнорируется.
---
## Секция platform_config
Необязательный объект с произвольными ключами — настройки платформы как есть из источника (BIOS, Redfish, IPMI).
| Поле | Тип | Обязательно | Описание |
|------|-----|-------------|----------|
| `platform_config` | object | нет | Произвольный объект: ключи — строки, значения — строки, числа, булевы |
**Правила platform_config:**
- Содержимое объекта не валидируется: передавайте параметры как есть.
- При каждом импорте хранится latest-snapshot per machine; история изменений по каждому ключу накапливается отдельно.
- Если секция отсутствует или равна `null` — данные платформы не обновляются.
```json
"platform_config": {
"SecureBoot": "Enabled",
"BiosVersion": "06.08.05",
"TpmEnabled": true,
"NumaEnabled": false,
"HyperThreading": "Enabled"
}
```
---
## Обработка статусов компонентов
| Статус | Поведение |
@@ -787,6 +824,12 @@ PSU без `serial_number` игнорируется.
"other": [
{ "name": "System Humidity", "value": 38.5, "unit": "%" }
]
},
"platform_config": {
"SecureBoot": "Enabled",
"BiosVersion": "06.08.05",
"TpmEnabled": true,
"HyperThreading": "Enabled"
}
}
}

View File

@@ -0,0 +1,31 @@
# Contract: ASCII-Safe Text in Scripts and Boot Configs
Version: 1.0
## Principle
Shell scripts, bootloader configs, and any text rendered on serial/SOL consoles must use only printable ASCII characters. Non-ASCII Unicode — including typographic punctuation such as the em-dash (U+2014 `—`), en-dash (U+2013 ``), curly quotes, and ellipsis (U+2026 `…`) — breaks rendering on serial terminals, GRUB text/serial mode, IPMI SOL, and tooling that assumes ASCII.
## Rules
- Never use em-dash (`—`) or en-dash (``) in any shell script, GRUB config, syslinux/isolinux config, or service unit file. Use ASCII double-hyphen `--` or single hyphen `-` instead.
- Never use curly quotes (`"` `"` `'` `'`) in shell scripts or configs. Use straight quotes `"` and `'`.
- Never use the Unicode ellipsis (`…`). Use `...`.
- GRUB `menuentry` and `submenu` titles must be ASCII-only — GRUB serial terminal output is ASCII; non-ASCII characters render as garbage or are dropped.
- Comments in GRUB theme files (`.txt`) must also be ASCII-only, as GRUB may parse the entire file.
## Why
GRUB renders menus over both `gfxterm` (graphical, Unicode-capable) and `serial` (ASCII-only) simultaneously when `terminal_output gfxterm serial` is set. The serial output — used by IPMI SOL and BMC remote consoles — cannot display multi-byte UTF-8 sequences and shows raw bytes or drops characters. A menuentry title `"EASY-BEE — GSP=off"` appears as `"EASY-BEE â€" GSP=off"` or `"EASY-BEE GSP=off"` on SOL, making the menu unreadable.
## Anti-patterns
- `menuentry "EASY-BEE — GSP=off"` — em-dash in GRUB title
- `# bee logo — centered` — em-dash in GRUB theme comment
- `echo "done — reboot"` in a shell script displayed over serial
## Correct form
- `menuentry "EASY-BEE -- GSP=off"`
- `# bee logo - centered`
- `echo "done - reboot"`

View File

@@ -0,0 +1,134 @@
# GRUB bitmap error: null src bitmap in grub_video_bitmap_create_scaled
## Symptom
```
error: null src bitmap in grub_video_bitmap_create_scaled.
Press any key to continue...
```
Appears on boot before the GRUB menu renders. The menu still appears after pressing a key,
but without the bee logo. Reproduced on real hardware (Lenovo SR650 V3, ASUS GPU servers).
## Root cause model
`grub_video_bitmap_create_scaled` receives a null `src` pointer, meaning the PNG loader
returned null for `bee-logo.png`. GRUB calls this function even when no explicit
`width`/`height` are set in `theme.txt` — it is invoked any time an image component is
rendered, passing the image's natural dimensions as the target size.
The PNG file is referenced as `file = "bee-logo.png"` (relative to theme dir).
GRUB resolves this to `/boot/grub/live-theme/bee-logo.png`.
## Attempts that did NOT fix the error
### Attempt 1 — add explicit `width`/`height` to image block (d52ec67)
**What was done:** First introduction of bee-logo.png with:
```
+ image {
top = 4%
left = 50%-200
width = 400
height = 400
file = "bee-logo.png"
}
```
PNG at this point was RGBA (color_type=6).
**Result:** Error appeared immediately on first ISO build.
---
### Attempt 2 — remove `width`/`height` from image block (aa284ae)
**Hypothesis:** Explicit scaling dimensions trigger the scale path; removing them avoids it.
**What was done:** Removed `width = 400` and `height = 400` from the image block.
```
+ image {
top = 4%
left = 50%-200
file = "bee-logo.png"
}
```
**Result:** Error persists. GRUB calls `grub_video_bitmap_create_scaled` regardless of whether
`width`/`height` are specified — if the bitmap is null (loading failed), the error fires either way.
---
### Attempt 3 — convert PNG to RGBA + strip metadata chunks (6112094)
**Hypothesis:** GRUB's minimal PNG parser is confused by metadata chunks (cHRM, bKGD, tIME, tEXt).
Also re-ordered `terminal_output gfxterm` before `insmod png` / theme load.
**What was done:**
- Converted PNG to RGBA color_type=6, stripped all ancillary chunks
- Moved `terminal_output gfxterm` earlier in config.cfg
- Removed echo ASCII art banner from grub.cfg
**Result:** Error persists — and this change actually confirmed RGBA does not work:
GRUB's PNG loader does not render RGBA PNGs correctly on this platform.
---
### Attempt 4 — convert PNG from RGBA back to RGB (333c44f, most recent)
**Hypothesis:** GRUB does not support RGBA (color_type=6); RGB (color_type=2) is the correct format.
Alpha channel composited onto black background (#000000) to match `desktop-color`.
**What was done:** Converted bee-logo.png from RGBA to RGB via ImageMagick.
**Current file state:**
- 400×400 px, 8-bit/color RGB, non-interlaced
- Only IHDR + IDAT + IEND chunks (no metadata)
- `insmod png` is present in config.cfg
- `terminal_output gfxterm` runs before theme is sourced
- No explicit `width`/`height` in image block
**Result:** Error still occurs on real hardware. Despite the PNG being nominally correct
(RGB, non-interlaced, minimal chunks), the bitmap load returns null.
## Confirmed root cause (verified on 172.16.41.94, 2026-04-30)
The EFI partition (`sda2`, vfat, 5 MB) contains only:
```
/EFI/boot/bootia32.efi
/EFI/boot/bootx64.efi
/EFI/boot/grubx64.efi
/boot/grub/grub.cfg
```
`config.cfg`, `theme.cfg`, and the entire `live-theme/` directory (including `bee-logo.png`)
are **absent from the EFI image**. `live-build`'s `lb binary_grub-efi` stage is not
copying these files. GRUB boots, sources only `grub.cfg`, then fails to load the theme
because the file does not exist — returning a null bitmap regardless of PNG format.
All four fix attempts were targeting the wrong layer (PNG format/content).
## Fix (applied 2026-04-30)
Switched from PNG to TGA format:
1. Converted `bee-logo.png``bee-logo.tga` (24-bit uncompressed BGR, top-left origin,
480018 bytes). Conversion done via Python stdlib (no external tools needed).
2. `config.cfg`: `insmod png``insmod tga`
3. `theme.txt`: `file = "bee-logo.png"``file = "bee-logo.tga"`
**Why TGA works:** GRUB's TGA reader (`tga.mod`) handles uncompressed 24-bit images
trivially — no decompression, no complex chunk parsing. The module is present on-disk
(`x86_64-efi/tga.mod`). PNG was failing despite a valid file; the exact GRUB bug is
unknown but the PNG reader in Debian bookworm's grub2 is known to be fragile.
The old `bee-logo.png` is kept in the tree (may be useful for other tools) but is no
longer referenced by the theme.
## Relevant files
| File | Purpose |
|------|---------|
| `iso/builder/config/bootloaders/grub-efi/config.cfg` | insmod png, gfxterm init, theme source |
| `iso/builder/config/bootloaders/grub-efi/theme.cfg` | sets `theme=` path |
| `iso/builder/config/bootloaders/grub-efi/live-theme/theme.txt` | image component definition |
| `iso/builder/config/bootloaders/grub-efi/live-theme/bee-logo.png` | the logo PNG |

View File

@@ -31,10 +31,10 @@ Build with explicit SSH keys baked into the ISO:
sh iso/builder/build-in-container.sh --authorized-keys ~/.ssh/id_ed25519.pub
```
Rebuild the builder image:
Force a clean rebuild of the builder image and build caches:
```sh
sh iso/builder/build-in-container.sh --rebuild-image
sh iso/builder/build-in-container.sh --clean-build
```
Use a custom cache directory:

View File

@@ -16,6 +16,12 @@ else
LB_LINUX_PACKAGES="linux-image"
fi
if [ -n "${BEE_ISO_VOLUME:-}" ]; then
LB_ISO_VOLUME="${BEE_ISO_VOLUME}"
else
LB_ISO_VOLUME="EASY_BEE_${BEE_GPU_VENDOR_UPPER:-NVIDIA}"
fi
lb config noauto \
--distribution bookworm \
--architectures amd64 \
@@ -30,9 +36,9 @@ lb config noauto \
--linux-flavours "amd64" \
--linux-packages "${LB_LINUX_PACKAGES}" \
--memtest memtest86+ \
--iso-volume "EASY_BEE_${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
--iso-volume "${LB_ISO_VOLUME}" \
--iso-application "EASY-BEE-${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
--bootappend-live "boot=live components video=1920x1080 console=ttyS0,115200n8 console=tty0 loglevel=3 systemd.show_status=1 username=bee user-fullname=Bee modprobe.blacklist=nouveau,snd_hda_intel,snd_hda_codec_realtek,snd_hda_codec_generic,soundcore" \
--bootappend-live "boot=live live-media-label=${LB_ISO_VOLUME} components video=1920x1080 console=ttyS0,115200n8 console=tty0 loglevel=3 systemd.show_status=1 username=bee user-fullname=Bee modprobe.blacklist=nouveau,snd_hda_intel,snd_hda_codec_realtek,snd_hda_codec_generic,soundcore" \
--debootstrap-options "--include=ca-certificates" \
--apt-recommends false \
--chroot-squashfs-compression-type zstd \

View File

@@ -10,7 +10,6 @@ IMAGE_TAG="${BEE_BUILDER_IMAGE:-bee-iso-builder}"
BUILDER_PLATFORM="${BEE_BUILDER_PLATFORM:-linux/amd64}"
CACHE_DIR="${BEE_BUILDER_CACHE_DIR:-${REPO_ROOT}/dist/container-cache}"
AUTH_KEYS=""
REBUILD_IMAGE=0
CLEAN_CACHE=0
VARIANT="all"
@@ -22,17 +21,12 @@ while [ $# -gt 0 ]; do
CACHE_DIR="$2"
shift 2
;;
--rebuild-image)
REBUILD_IMAGE=1
shift
;;
--authorized-keys)
AUTH_KEYS="$2"
shift 2
;;
--clean-build)
CLEAN_CACHE=1
REBUILD_IMAGE=1
shift
;;
--variant)
@@ -41,7 +35,7 @@ while [ $# -gt 0 ]; do
;;
*)
echo "unknown arg: $1" >&2
echo "usage: $0 [--cache-dir /path] [--rebuild-image] [--clean-build] [--authorized-keys /path/to/authorized_keys] [--variant nvidia|nvidia-legacy|amd|nogpu|all]" >&2
echo "usage: $0 [--cache-dir /path] [--clean-build] [--authorized-keys /path/to/authorized_keys] [--variant nvidia|nvidia-legacy|amd|nogpu|all]" >&2
exit 1
;;
esac
@@ -105,7 +99,7 @@ image_matches_platform() {
}
NEED_BUILD_IMAGE=0
if [ "$REBUILD_IMAGE" = "1" ]; then
if [ "$CLEAN_CACHE" = "1" ]; then
NEED_BUILD_IMAGE=1
elif ! "$CONTAINER_TOOL" image inspect "${IMAGE_REF}" >/dev/null 2>&1; then
NEED_BUILD_IMAGE=1

View File

@@ -69,12 +69,27 @@ mkdir -p "${CACHE_ROOT}"
: "${GOMODCACHE:=${CACHE_ROOT}/go-mod}"
export GOCACHE GOMODCACHE
resolve_audit_version() {
resolve_project_version() {
if [ -n "${BEE_VERSION:-}" ]; then
echo "${BEE_VERSION}"
return 0
fi
if [ -n "${BEE_AUDIT_VERSION:-}" ] && [ -n "${BEE_ISO_VERSION:-}" ] && [ "${BEE_AUDIT_VERSION}" != "${BEE_ISO_VERSION}" ]; then
echo "ERROR: BEE_AUDIT_VERSION (${BEE_AUDIT_VERSION}) and BEE_ISO_VERSION (${BEE_ISO_VERSION}) differ; versioning must stay synchronized" >&2
exit 1
fi
if [ -n "${BEE_AUDIT_VERSION:-}" ]; then
echo "${BEE_AUDIT_VERSION}"
return 0
fi
if [ -n "${BEE_ISO_VERSION:-}" ]; then
echo "${BEE_ISO_VERSION}"
return 0
fi
tag="$(git -C "${REPO_ROOT}" describe --tags --match 'v[0-9]*' --abbrev=7 --dirty 2>/dev/null || true)"
case "${tag}" in
v*)
@@ -97,35 +112,6 @@ resolve_audit_version() {
date +%Y%m%d
}
# ISO image versioned separately from the audit binary (iso/v* tags).
resolve_iso_version() {
if [ -n "${BEE_ISO_VERSION:-}" ]; then
echo "${BEE_ISO_VERSION}"
return 0
fi
# Plain v* tags (e.g. v2.7) take priority — this is the current tagging scheme
tag="$(git -C "${REPO_ROOT}" describe --tags --match 'v[0-9]*' --abbrev=7 --dirty 2>/dev/null || true)"
case "${tag}" in
v*)
echo "${tag#v}"
return 0
;;
esac
# Legacy iso/v* tags fallback
tag="$(git -C "${REPO_ROOT}" describe --tags --match 'iso/v*' --abbrev=7 --dirty 2>/dev/null || true)"
case "${tag}" in
iso/v*)
echo "${tag#iso/v}"
return 0
;;
esac
# Fall back to audit version so the name is still meaningful
resolve_audit_version
}
sync_builder_workdir() {
src_dir="$1"
dst_dir="$2"
@@ -530,12 +516,12 @@ validate_iso_live_boot_entries() {
exit 1
fi
grep -q 'menuentry "EASY-BEE"' "$grub_cfg" || {
grep -q 'menuentry "EASY-BEE v' "$grub_cfg" || {
echo "ERROR: GRUB default EASY-BEE entry is missing" >&2
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
}
grep -q 'menuentry "EASY-BEE load to RAM (toram)"' "$grub_cfg" || {
grep -q 'menuentry "EASY-BEE v.* -- load to RAM (toram)"' "$grub_cfg" || {
echo "ERROR: GRUB toram entry is missing" >&2
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
@@ -550,6 +536,11 @@ validate_iso_live_boot_entries() {
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
}
grep -q 'linux .*live-media-label=EASY_BEE_' "$grub_cfg" || {
echo "ERROR: GRUB live entry is missing live-media-label pinning" >&2
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
}
grep -q 'append .*boot=live ' "$isolinux_cfg" || {
echo "ERROR: isolinux live entry is missing boot=live" >&2
@@ -561,11 +552,50 @@ validate_iso_live_boot_entries() {
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
}
grep -q 'append .*live-media-label=EASY_BEE_' "$isolinux_cfg" || {
echo "ERROR: isolinux live entry is missing live-media-label pinning" >&2
rm -f "$grub_cfg" "$isolinux_cfg"
exit 1
}
rm -f "$grub_cfg" "$isolinux_cfg"
echo "=== live boot validation OK ==="
}
validate_iso_grub_assets() {
iso_path="$1"
echo "=== validating GRUB assets in ISO ==="
[ -f "$iso_path" ] || {
echo "ERROR: ISO not found for GRUB asset validation: $iso_path" >&2
exit 1
}
require_iso_reader "$iso_path" >/dev/null 2>&1 || {
echo "ERROR: ISO reader unavailable for GRUB asset validation" >&2
exit 1
}
iso_files="$(mktemp)"
iso_list_files "$iso_path" > "$iso_files" || {
echo "ERROR: failed to list ISO files for GRUB asset validation" >&2
rm -f "$iso_files"
exit 1
}
for required in \
boot/grub/config.cfg \
boot/grub/grub.cfg; do
grep -q "^${required}$" "$iso_files" || {
echo "ERROR: missing GRUB asset in ISO: ${required}" >&2
rm -f "$iso_files"
exit 1
}
done
rm -f "$iso_files"
echo "=== GRUB asset validation OK ==="
}
validate_iso_nvidia_runtime() {
iso_path="$1"
[ "$BEE_GPU_VENDOR" = "nvidia" ] || return 0
@@ -578,29 +608,37 @@ validate_iso_nvidia_runtime() {
squashfs_tmp="$(mktemp)"
squashfs_list="$(mktemp)"
iso_read_member "$iso_path" live/filesystem.squashfs "$squashfs_tmp" || {
rm -f "$squashfs_tmp" "$squashfs_list"
nvidia_runtime_fail "failed to extract live/filesystem.squashfs from ISO"
}
unsquashfs -ll "$squashfs_tmp" > "$squashfs_list" 2>/dev/null || {
rm -f "$squashfs_tmp" "$squashfs_list"
nvidia_runtime_fail "failed to inspect filesystem.squashfs from ISO"
iso_files="$(mktemp)"
iso_list_files "$iso_path" > "$iso_files" || {
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "failed to list ISO files for NVIDIA runtime validation"
}
grep '^live/.*\.squashfs$' "$iso_files" | while IFS= read -r squashfs_member; do
iso_read_member "$iso_path" "$squashfs_member" "$squashfs_tmp" || {
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "failed to extract $squashfs_member from ISO"
}
unsquashfs -ll "$squashfs_tmp" >> "$squashfs_list" 2>/dev/null || {
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "failed to inspect $squashfs_member from ISO"
}
: > "$squashfs_tmp"
done
grep -Eq 'usr/bin/dcgmi$' "$squashfs_list" || {
rm -f "$squashfs_tmp" "$squashfs_list"
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "dcgmi missing from final NVIDIA ISO"
}
grep -Eq 'usr/bin/nv-hostengine$' "$squashfs_list" || {
rm -f "$squashfs_tmp" "$squashfs_list"
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "nv-hostengine missing from final NVIDIA ISO"
}
grep -Eq 'usr/bin/dcgmproftester([0-9]+)?$' "$squashfs_list" || {
rm -f "$squashfs_tmp" "$squashfs_list"
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
nvidia_runtime_fail "dcgmproftester missing from final NVIDIA ISO"
}
rm -f "$squashfs_tmp" "$squashfs_list"
rm -f "$squashfs_tmp" "$squashfs_list" "$iso_files"
echo "=== NVIDIA runtime validation OK ==="
}
@@ -694,50 +732,24 @@ write_canonical_grub_cfg() {
kernel="$2"
append_live="$3"
initrd="$4"
version_label="${PROJECT_VERSION_EFFECTIVE}"
cat > "$cfg" <<EOF
source /boot/grub/config.cfg
echo ""
echo " ███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗"
echo " ██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝"
echo " █████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗"
echo " ██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝"
echo " ███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗"
echo " ╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝"
echo " Hardware Audit LiveCD"
echo ""
menuentry "EASY-BEE" {
linux ${kernel} ${append_live} nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
menuentry "EASY-BEE v${version_label}" {
linux ${kernel} ${append_live} nomodeset bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd ${initrd}
}
menuentry "EASY-BEE load to RAM (toram)" {
linux ${kernel} ${append_live} toram nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
menuentry "EASY-BEE v${version_label} -- load to RAM (toram)" {
linux ${kernel} ${append_live} toram nomodeset bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd ${initrd}
}
submenu "EASY-BEE (advanced options) -->" {
menuentry "EASY-BEE — GSP=off" {
linux ${kernel} ${append_live} nomodeset bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd ${initrd}
}
menuentry "EASY-BEE — KMS (no nomodeset)" {
linux ${kernel} ${append_live} bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd ${initrd}
}
menuentry "EASY-BEE — KMS + GSP=off" {
linux ${kernel} ${append_live} bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd ${initrd}
}
menuentry "EASY-BEE — fail-safe" {
linux ${kernel} ${append_live} nomodeset bee.nvidia.mode=gsp-off noapic noapm nodma nomce nolapic nosmp vga=normal net.ifnames=0 biosdevname=0
initrd ${initrd}
}
menuentry "EASY-BEE v${version_label} -- no GUI / no X11" {
linux ${kernel} ${append_live} nomodeset bee.gui=off bee.nvidia.mode=gsp-off pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd ${initrd}
}
if [ "\${grub_platform}" = "efi" ]; then
@@ -763,21 +775,28 @@ write_canonical_isolinux_cfg() {
kernel="$2"
initrd="$3"
append_live="$4"
version_label="${PROJECT_VERSION_EFFECTIVE}"
cat > "$cfg" <<EOF
label live-@FLAVOUR@-normal
menu label ^EASY-BEE
menu default
menu label ^EASY-BEE v${version_label}
linux ${kernel}
initrd ${initrd}
append ${append_live} nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-toram
menu label EASY-BEE (^load to RAM)
menu label EASY-BEE v${version_label} (^load to RAM)
menu default
linux ${kernel}
initrd ${initrd}
append ${append_live} toram nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-console
menu label EASY-BEE v${version_label} (^no GUI / no X11)
linux ${kernel}
initrd ${initrd}
append ${append_live} nomodeset bee.gui=off bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-gsp-off
menu label EASY-BEE (^NVIDIA GSP=off)
linux ${kernel}
@@ -821,10 +840,7 @@ enforce_live_build_bootloader_assets() {
if [ -f "$grub_cfg" ]; then
if extract_live_grub_entry "$grub_cfg"; then
mkdir -p "$grub_dir/live-theme"
cp "${BUILDER_DIR}/config/bootloaders/grub-efi/config.cfg" "$grub_dir/config.cfg"
cp "${BUILDER_DIR}/config/bootloaders/grub-efi/theme.cfg" "$grub_dir/theme.cfg"
cp -R "${BUILDER_DIR}/config/bootloaders/grub-efi/live-theme/." "$grub_dir/live-theme/"
write_canonical_grub_cfg "$grub_cfg" "$grub_kernel" "${live_build_append:-$grub_append}" "$grub_initrd"
echo "bootloader sync: rewrote binary/boot/grub/grub.cfg with canonical EASY-BEE menu"
else
@@ -869,6 +885,130 @@ reset_live_build_stage() {
done
}
# Marker written after every successful full lb build for this variant
FULL_BUILD_MARKER="${BUILD_WORK_DIR}/.bee-full-build-marker"
# Returns 0 if full lb build is needed, 1 if fast-path is safe.
# Fast-path is safe when only light files changed since the last full build
# (Go source, overlay scripts/configs). Heavy changes (VERSIONS, package lists,
# hooks, archives, Dockerfile, auto/config) require a full lb build.
needs_full_build() {
[ -f "${FULL_BUILD_MARKER}" ] || return 0
[ -f "${BUILD_WORK_DIR}/binary/live/filesystem.squashfs" ] || return 0
[ -f "${BUILD_WORK_DIR}/live-image-amd64.hybrid.iso" ] || return 0
_heavy=$(find \
"${BUILDER_DIR}/VERSIONS" \
"${BUILDER_DIR}/auto/config" \
"${BUILDER_DIR}/Dockerfile" \
"${BUILDER_DIR}/config/package-lists" \
"${BUILDER_DIR}/config/hooks" \
"${BUILDER_DIR}/config/archives" \
"${BUILDER_DIR}/config/bootloaders" \
-newer "${FULL_BUILD_MARKER}" 2>/dev/null | head -1)
if [ -n "$_heavy" ]; then
echo "=== full build required: heavy config changed: $(basename "$_heavy") ==="
return 0
fi
return 1
}
# Fast-path: unsquash existing filesystem, rsync overlay on top, repack.
# Requires ~10 GB free in BEE_CACHE_DIR for the unpacked squashfs.
fast_path_repack_squashfs() {
_sq="${BUILD_WORK_DIR}/binary/live/filesystem.squashfs"
_tmp="${BEE_CACHE_DIR}/fast-unsquash-${BUILD_VARIANT}"
echo "=== fast-path: unsquash ($(du -sh "$_sq" | cut -f1) compressed) ==="
rm -rf "$_tmp"
unsquashfs -d "$_tmp" "$_sq"
echo "=== fast-path: syncing overlay stage ==="
rsync -a --checksum "${OVERLAY_STAGE_DIR}/" "$_tmp/"
echo "=== fast-path: repacking squashfs ==="
_sq_new="${_sq}.new"
rm -f "$_sq_new"
mksquashfs "$_tmp" "$_sq_new" -comp zstd -b 1048576 -noappend -no-progress -no-xattrs
mv "$_sq_new" "$_sq"
rm -rf "$_tmp"
echo "=== fast-path: squashfs repacked ($(du -sh "$_sq" | cut -f1)) ==="
}
# Fast-path: rebuild ISO by replacing only live/filesystem.squashfs via xorriso.
# Boot structure (El Torito, EFI, MBR hybrid) is replayed from the prior ISO.
fast_path_rebuild_iso() {
_sq="${BUILD_WORK_DIR}/binary/live/filesystem.squashfs"
_prior="${BUILD_WORK_DIR}/live-image-amd64.hybrid.iso"
_new="${BUILD_WORK_DIR}/live-image-amd64.hybrid.iso.new"
echo "=== fast-path: rebuilding ISO with xorriso ==="
rm -f "$_new"
xorriso \
-indev "$_prior" \
-outdev "$_new" \
-map "$_sq" /live/filesystem.squashfs \
-boot_image any replay \
-commit
mv "$_new" "$_prior"
echo "=== fast-path: ISO rebuilt ==="
}
dir_has_entries() {
_dir="$1"
[ -d "$_dir" ] || return 1
find "$_dir" -mindepth 1 -print -quit 2>/dev/null | grep -q .
}
move_tree_to_layer() {
_src_root="$1"
_rel="$2"
_dst_root="$3"
[ -e "${_src_root}/${_rel}" ] || return 0
mkdir -p "${_dst_root}/$(dirname "$_rel")"
mv "${_src_root}/${_rel}" "${_dst_root}/${_rel}"
}
split_live_squashfs_layers() {
lb_dir="$1"
live_dir="${lb_dir}/binary/live"
base_sq="${live_dir}/filesystem.squashfs"
usr_sq="${live_dir}/10-usr.squashfs"
fw_sq="${live_dir}/20-firmware.squashfs"
[ -f "$base_sq" ] || return 0
command -v unsquashfs >/dev/null 2>&1 || return 0
command -v mksquashfs >/dev/null 2>&1 || return 0
tmp_root="$(mktemp -d)"
tmp_usr="$(mktemp -d)"
tmp_fw="$(mktemp -d)"
echo "=== splitting live squashfs into smaller layers ==="
unsquashfs -d "$tmp_root/root" "$base_sq" >/dev/null
mkdir -p "$tmp_usr/root" "$tmp_fw/root"
move_tree_to_layer "$tmp_root/root" "usr" "$tmp_usr/root"
move_tree_to_layer "$tmp_root/root" "lib/firmware" "$tmp_fw/root"
move_tree_to_layer "$tmp_root/root" "usr/lib/firmware" "$tmp_fw/root"
move_tree_to_layer "$tmp_root/root" "boot/firmware" "$tmp_fw/root"
rm -f "$usr_sq" "$fw_sq"
mksquashfs "$tmp_root/root" "${base_sq}.new" -comp zstd -b 1048576 -noappend -no-progress -no-xattrs >/dev/null
mv "${base_sq}.new" "$base_sq"
if dir_has_entries "$tmp_usr/root"; then
mksquashfs "$tmp_usr/root" "${usr_sq}.new" -comp zstd -b 1048576 -noappend -no-progress -no-xattrs >/dev/null
mv "${usr_sq}.new" "$usr_sq"
fi
if dir_has_entries "$tmp_fw/root"; then
mksquashfs "$tmp_fw/root" "${fw_sq}.new" -comp zstd -b 1048576 -noappend -no-progress -no-xattrs >/dev/null
mv "${fw_sq}.new" "$fw_sq"
fi
echo "=== live squashfs layers ==="
find "$live_dir" -maxdepth 1 -type f -name '*.squashfs' -exec du -sh {} \; | sort
rm -rf "$tmp_root" "$tmp_usr" "$tmp_fw"
}
recover_iso_memtest() {
lb_dir="$1"
iso_path="$2"
@@ -946,11 +1086,11 @@ recover_iso_memtest() {
fi
}
AUDIT_VERSION_EFFECTIVE="$(resolve_audit_version)"
ISO_VERSION_EFFECTIVE="$(resolve_iso_version)"
ISO_BASENAME="easy-bee-${BUILD_VARIANT}-v${ISO_VERSION_EFFECTIVE}-amd64"
PROJECT_VERSION_EFFECTIVE="$(resolve_project_version)"
ISO_BASENAME="easy-bee-${BUILD_VARIANT}-v${PROJECT_VERSION_EFFECTIVE}-amd64"
# Versioned output directory: dist/easy-bee-v4.1/ — all final artefacts live here.
OUT_DIR="${DIST_DIR}/easy-bee-v${ISO_VERSION_EFFECTIVE}"
OUT_DIR="${DIST_DIR}/easy-bee-v${PROJECT_VERSION_EFFECTIVE}"
ISO_VERSION_LABEL_TOKEN="$(printf '%s' "${PROJECT_VERSION_EFFECTIVE}" | tr '[:lower:].-' '[:upper:]__')"
mkdir -p "${OUT_DIR}"
LOG_DIR="${OUT_DIR}/${ISO_BASENAME}.logs"
LOG_ARCHIVE="${OUT_DIR}/${ISO_BASENAME}.logs.tar.gz"
@@ -1126,7 +1266,7 @@ fi
echo "=== bee ISO build (variant: ${BUILD_VARIANT}) ==="
echo "Debian: ${DEBIAN_VERSION}, Kernel ABI: ${DEBIAN_KERNEL_ABI}, Go: ${GO_VERSION}"
echo "Audit version: ${AUDIT_VERSION_EFFECTIVE}, ISO version: ${ISO_VERSION_EFFECTIVE}"
echo "Project version: ${PROJECT_VERSION_EFFECTIVE}"
echo ""
run_step "sync git submodules" "05-git-submodules" \
@@ -1146,7 +1286,7 @@ if [ "$NEED_BUILD" = "1" ]; then
"cd '${REPO_ROOT}/audit' && \
env GOOS=linux GOARCH=amd64 CGO_ENABLED=0 \
go build \
-ldflags '-s -w -X main.Version=${AUDIT_VERSION_EFFECTIVE}' \
-ldflags '-s -w -X main.Version=${PROJECT_VERSION_EFFECTIVE}' \
-o '${BEE_BIN}' \
./cmd/bee"
echo "binary: $BEE_BIN"
@@ -1421,8 +1561,10 @@ else
fi
cat > "${OVERLAY_STAGE_DIR}/etc/bee-release" <<EOF
BEE_ISO_VERSION=${ISO_VERSION_EFFECTIVE}
BEE_AUDIT_VERSION=${AUDIT_VERSION_EFFECTIVE}
BEE_VERSION=${PROJECT_VERSION_EFFECTIVE}
export BEE_VERSION
BEE_ISO_VERSION=${PROJECT_VERSION_EFFECTIVE}
BEE_AUDIT_VERSION=${PROJECT_VERSION_EFFECTIVE}
BEE_BUILD_VARIANT=${BUILD_VARIANT}
BEE_GPU_VENDOR=${BEE_GPU_VENDOR}
BUILD_DATE=${BUILD_DATE}
@@ -1508,18 +1650,36 @@ if [ -f "${LB_INCLUDES}/root/.ssh/authorized_keys" ]; then
chmod 600 "${LB_INCLUDES}/root/.ssh/authorized_keys"
fi
# --- auto fast-path: squashfs surgery if only light files changed ---
if ! needs_full_build; then
echo "=== fast-path build (no heavy config changes since last full build) ==="
fast_path_repack_squashfs
fast_path_rebuild_iso
ISO_RAW="${LB_DIR}/live-image-amd64.hybrid.iso"
validate_iso_live_boot_entries "$ISO_RAW"
validate_iso_grub_assets "$ISO_RAW"
validate_iso_nvidia_runtime "$ISO_RAW"
cp "$ISO_RAW" "$ISO_OUT"
echo ""
echo "=== done (${BUILD_VARIANT}, fast-path) ==="
echo "ISO: $ISO_OUT"
exit 0
fi
# --- build ISO using live-build ---
echo ""
echo "=== building ISO (variant: ${BUILD_VARIANT}) ==="
# Export for auto/config
BEE_GPU_VENDOR_UPPER="$(echo "${BUILD_VARIANT}" | tr 'a-z-' 'A-Z_')"
export BEE_GPU_VENDOR_UPPER
BEE_ISO_VOLUME="EASY_BEE_${BEE_GPU_VENDOR_UPPER}_V${ISO_VERSION_LABEL_TOKEN}"
export BEE_GPU_VENDOR_UPPER BEE_ISO_VOLUME
cd "${LB_DIR}"
run_step_sh "live-build clean" "80-lb-clean" "lb clean --all 2>&1 | tail -3"
run_step_sh "live-build config" "81-lb-config" "lb config 2>&1 | tail -5"
dump_memtest_debug "pre-build" "${LB_DIR}"
export MKSQUASHFS_OPTIONS="-no-xattrs"
run_step_sh "live-build build" "90-lb-build" "lb build 2>&1"
echo "=== enforcing canonical bootloader assets ==="
enforce_live_build_bootloader_assets "${LB_DIR}"
@@ -1554,8 +1714,10 @@ if [ -f "$ISO_RAW" ]; then
fi
validate_iso_memtest "$ISO_RAW"
validate_iso_live_boot_entries "$ISO_RAW"
validate_iso_grub_assets "$ISO_RAW"
validate_iso_nvidia_runtime "$ISO_RAW"
cp "$ISO_RAW" "$ISO_OUT"
touch "${FULL_BUILD_MARKER}"
echo ""
echo "=== done (${BUILD_VARIANT}) ==="
echo "ISO: $ISO_OUT"

View File

@@ -1,5 +1,7 @@
set default=0
set timeout=5
set default=1
set timeout=10
set color_normal=yellow/black
set color_highlight=white/brown
if [ x$feature_default_font_path = xy ] ; then
font=unicode
@@ -8,7 +10,7 @@ else
fi
if loadfont $font ; then
set gfxmode=1920x1080,1280x1024,auto
set gfxmode=1280x1024,auto
set gfxpayload=keep
insmod efi_gop
insmod efi_uga
@@ -23,9 +25,6 @@ insmod serial
serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1
insmod gfxterm
insmod png
source /boot/grub/theme.cfg
terminal_input console serial
terminal_output gfxterm serial

View File

@@ -1,47 +1,21 @@
source /boot/grub/config.cfg
echo ""
echo " ███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗"
echo " ██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝"
echo " █████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗"
echo " ██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝"
echo " ███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗"
echo " ╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝"
echo " Hardware Audit LiveCD"
echo ""
menuentry "EASY-BEE" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
menuentry "EASY-BEE v@VERSION@" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
submenu "EASY-BEE (advanced options) -->" {
menuentry "EASY-BEE — load to RAM (toram)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ toram nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE — GSP=off" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE — KMS (no nomodeset)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE — KMS + GSP=off" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE — fail-safe" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off noapic noapm nodma nomce nolapic nosmp vga=normal net.ifnames=0 biosdevname=0
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE v@VERSION@ -- load to RAM (toram)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ toram nomodeset bee.nvidia.mode=normal pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
menuentry "EASY-BEE v@VERSION@ -- no GUI / no X11" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.gui=off bee.nvidia.mode=gsp-off pci=realloc net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@
}
if [ "${grub_platform}" = "efi" ]; then
menuentry "Memory Test (memtest86+)" {
chainloader /boot/memtest86+x64.efi

Binary file not shown.

Before

Width:  |  Height:  |  Size: 70 KiB

After

Width:  |  Height:  |  Size: 77 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 469 KiB

View File

@@ -5,13 +5,6 @@ title-text: ""
message-font: "Unifont Regular 16"
terminal-font: "Unifont Regular 16"
#bee logo — centered, upper third of screen
+ image {
top = 4%
left = 50%-200
file = "bee-logo.png"
}
#help bar at the bottom
+ label {
top = 100%-50
@@ -34,11 +27,11 @@ terminal-font: "Unifont Regular 16"
item_font = "Unifont Regular 16"
selected_item_color= "#f5a800"
selected_item_font = "Unifont Regular 16"
item_height = 16
item_padding = 0
item_height = 20
item_padding = 2
item_spacing = 4
icon_width = 0
icon_heigh = 0
icon_height = 0
item_icon_space = 0
}

View File

@@ -1,16 +1,22 @@
label live-@FLAVOUR@-normal
menu label ^EASY-BEE
menu default
menu label ^EASY-BEE v@VERSION@
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-toram
menu label EASY-BEE (^load to RAM)
menu label EASY-BEE v@VERSION@ (^load to RAM)
menu default
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ toram nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-console
menu label EASY-BEE v@VERSION@ (^no GUI / no X11)
linux @LINUX@
initrd @INITRD@
append @APPEND_LIVE@ nomodeset bee.gui=off bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
label live-@FLAVOUR@-gsp-off
menu label EASY-BEE (^NVIDIA GSP=off)
linux @LINUX@

View File

@@ -31,6 +31,7 @@ systemctl enable bee-preflight.service
systemctl enable bee-audit.service
systemctl enable bee-web.service
systemctl enable bee-sshsetup.service
systemctl enable bee-blackbox.service
systemctl enable bee-selfheal.timer
systemctl enable bee-boot-status.service
systemctl enable ssh.service
@@ -66,6 +67,7 @@ chmod +x /usr/local/bin/bee-log-run 2>/dev/null || true
chmod +x /usr/local/bin/bee-selfheal 2>/dev/null || true
chmod +x /usr/local/bin/bee-boot-status 2>/dev/null || true
chmod +x /usr/local/bin/bee-install 2>/dev/null || true
chmod +x /usr/local/bin/bee-gui-gate 2>/dev/null || true
chmod +x /usr/local/bin/bee-remount-medium 2>/dev/null || true
if [ "$GPU_VENDOR" = "nvidia" ]; then
chmod +x /usr/local/bin/bee-nvidia-load 2>/dev/null || true

View File

@@ -47,18 +47,27 @@ vim-tiny
mc
htop
nvtop
btop
sudo
zstd
mstflint
memtester
stress-ng
stressapptest
# QR codes (for displaying audit results)
qrencode
fio
iperf3
iotop
nload
tcpdump
hdparm
sysstat
lsscsi
sg3-utils
jq
curl
net-tools
# Local desktop (openbox + chromium kiosk)
gparted
openbox
tint2
feh

View File

@@ -1,11 +1,4 @@
███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗
██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝
█████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗
██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝
███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗
╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝
EASY BEE
Hardware Audit LiveCD
Build: %%BUILD_INFO%%

View File

@@ -1,6 +1,6 @@
[Unit]
Description=Bee: hardware audit
After=bee-preflight.service bee-network.service bee-nvidia.service
After=bee-preflight.service bee-nvidia.service bee-blackbox.service
[Service]
Type=oneshot

View File

@@ -0,0 +1,18 @@
[Unit]
Description=Bee: USB black-box log mirror
After=local-fs.target
Before=bee-network.service bee-nvidia.service bee-preflight.service bee-audit.service bee-web.service
StartLimitIntervalSec=0
[Service]
Type=simple
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-blackbox.log /usr/local/bin/bee blackbox --export-dir /appdata/bee/export --state-file /appdata/bee/export/blackbox-state.json
Restart=always
RestartSec=1
StandardOutput=journal
StandardError=journal
OOMScoreAdjust=-900
Nice=0
[Install]
WantedBy=multi-user.target

View File

@@ -1,7 +1,6 @@
[Unit]
Description=Bee: bring up network interfaces via DHCP
After=local-fs.target
Before=network-online.target bee-audit.service
After=bee-web.service bee-audit.service
[Service]
Type=oneshot

View File

@@ -1,6 +1,6 @@
[Unit]
Description=Bee: load NVIDIA kernel modules and create device nodes
After=local-fs.target udev.service
After=local-fs.target udev.service bee-blackbox.service
Before=bee-audit.service
[Service]

View File

@@ -1,6 +1,6 @@
[Unit]
Description=Bee: runtime preflight self-check
After=bee-network.service bee-nvidia.service
After=bee-nvidia.service bee-blackbox.service
Before=bee-audit.service
[Service]

View File

@@ -3,7 +3,7 @@ Description=Bee: run self-heal checks periodically
[Timer]
OnBootSec=45sec
OnUnitActiveSec=60sec
OnUnitActiveSec=3min
AccuracySec=15sec
Unit=bee-selfheal.service

View File

@@ -1,5 +1,6 @@
[Unit]
Description=Bee: hardware audit web viewer
After=bee-blackbox.service
StartLimitIntervalSec=0
[Service]

View File

@@ -0,0 +1,2 @@
[Service]
ExecCondition=/usr/local/bin/bee-gui-gate

View File

@@ -51,12 +51,7 @@ while true; do
printf '\033[H\033[2J'
printf '\n'
printf ' \033[33m███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗\033[0m\n'
printf ' \033[33m██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝\033[0m\n'
printf ' \033[33m█████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗\033[0m\n'
printf ' \033[33m██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝\033[0m\n'
printf ' \033[33m███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗\033[0m\n'
printf ' \033[33m╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝\033[0m\n'
printf ' \033[33mEASY BEE\033[0m\n'
printf ' Hardware Audit LiveCD\n'
printf '\n'

View File

@@ -0,0 +1,27 @@
#!/bin/sh
# bee-gui-gate — skip starting the local GUI when bee.gui=off is set.
set -eu
cmdline_param() {
key="$1"
for token in $(cat /proc/cmdline 2>/dev/null); do
case "$token" in
"$key"=*)
echo "${token#*=}"
return 0
;;
esac
done
return 1
}
mode="$(cmdline_param bee.gui || true)"
case "${mode}" in
off|false|0|tty|console|text|nogui)
echo "bee-gui-gate: bee.gui=${mode}; skipping lightdm"
exit 1
;;
esac
exit 0

View File

@@ -8,7 +8,7 @@
# Layout (UEFI): GPT, /dev/sdX1=EFI 512MB vfat, /dev/sdX2=root ext4
# Layout (BIOS): MBR, /dev/sdX1=root ext4
#
# Squashfs source: /run/live/medium/live/filesystem.squashfs
# Squashfs sources: /run/live/medium/live/*.squashfs
set -euo pipefail
@@ -62,9 +62,9 @@ for tool in parted mkfs.vfat mkfs.ext4 unsquashfs grub-install update-grub; do
fi
done
SQUASHFS="/run/live/medium/live/filesystem.squashfs"
if [ ! -f "$SQUASHFS" ]; then
echo "ERROR: squashfs not found at $SQUASHFS" >&2
mapfile -t SQUASHFS_FILES < <(find /run/live/medium/live -maxdepth 1 -type f -name '*.squashfs' | sort)
if [ "${#SQUASHFS_FILES[@]}" -eq 0 ]; then
echo "ERROR: no squashfs files found under /run/live/medium/live" >&2
echo " The live medium may have been disconnected." >&2
echo " Reconnect the disc and run: bee-remount-medium --wait" >&2
echo " Then re-run bee-install." >&2
@@ -106,7 +106,10 @@ log "=== BEE DISK INSTALLER ==="
log "Target device : $DEVICE"
log "Root partition: $PART_ROOT"
[ "$UEFI" = "1" ] && log "EFI partition : $PART_EFI"
log "Squashfs : $SQUASHFS ($(du -sh "$SQUASHFS" | cut -f1))"
log "Squashfs : ${#SQUASHFS_FILES[@]} layer(s)"
for sf in "${SQUASHFS_FILES[@]}"; do
log " - $sf ($(du -sh "$sf" | cut -f1))"
done
log "Log : $LOGFILE"
log ""
@@ -163,7 +166,9 @@ log " Mounted."
# ------------------------------------------------------------------
log "--- Step 5/7: Unpacking filesystem (this takes 10-20 minutes) ---"
log " Source: $SQUASHFS"
for sf in "${SQUASHFS_FILES[@]}"; do
log " Source: $sf"
done
log " Target: $MOUNT_ROOT"
# unsquashfs does not support resume, so retry the entire unpack step if the
@@ -177,9 +182,9 @@ while true; do
fi
[ "$UNPACK_ATTEMPTS" -gt 1 ] && log " Retry attempt $UNPACK_ATTEMPTS / $UNPACK_MAX ..."
# Re-check squashfs is reachable before each attempt
if [ ! -f "$SQUASHFS" ]; then
log " SOURCE LOST: $SQUASHFS not found."
mapfile -t SQUASHFS_FILES < <(find /run/live/medium/live -maxdepth 1 -type f -name '*.squashfs' | sort)
if [ "${#SQUASHFS_FILES[@]}" -eq 0 ]; then
log " SOURCE LOST: no squashfs files found under /run/live/medium/live."
log " Reconnect the disc and run 'bee-remount-medium --wait' in another terminal,"
log " then press Enter here to retry."
read -r _
@@ -194,12 +199,17 @@ while true; do
fi
UNPACK_OK=0
unsquashfs -f -d "$MOUNT_ROOT" "$SQUASHFS" 2>&1 | \
grep -E '^\[|^inod|^created|^extract|^ERROR|failed' | \
while IFS= read -r line; do log " $line"; done || UNPACK_OK=$?
for sf in "${SQUASHFS_FILES[@]}"; do
log " Unpacking $(basename "$sf") ..."
unsquashfs -f -d "$MOUNT_ROOT" "$sf" 2>&1 | \
grep -E '^\[|^inod|^created|^extract|^ERROR|failed' | \
while IFS= read -r line; do log " $line"; done || UNPACK_OK=$?
[ "$UNPACK_OK" -eq 0 ] || break
done
# Check squashfs is still reachable (gone = disc pulled during copy)
if [ ! -f "$SQUASHFS" ]; then
mapfile -t SQUASHFS_FILES < <(find /run/live/medium/live -maxdepth 1 -type f -name '*.squashfs' | sort)
if [ "${#SQUASHFS_FILES[@]}" -eq 0 ]; then
log " WARNING: source medium lost during unpack — will retry after remount."
log " Run 'bee-remount-medium --wait' in another terminal, then press Enter."
read -r _

View File

@@ -1,8 +1,9 @@
#!/bin/sh
# bee-network.sh — bring up all physical network interfaces via DHCP
# Unattended: runs silently, logs results, never blocks.
# Unattended: starts later in boot, runs quietly, and gives up after a bounded timeout.
LOG_PREFIX="bee-network"
DHCP_TIMEOUT_SECS=300
log() { echo "[$LOG_PREFIX] $*"; }
@@ -19,9 +20,50 @@ if command -v udevadm >/dev/null 2>&1; then
udevadm settle --timeout=5 >/dev/null 2>&1 || log "WARN: udevadm settle timed out"
fi
start_dhcp() {
iface="$1"
if ! ip link set "$iface" up; then
log "WARN: could not bring up $iface"
return 1
fi
carrier=$(cat "/sys/class/net/$iface/carrier" 2>/dev/null || true)
if [ "$carrier" = "1" ]; then
log "carrier detected on $iface"
else
log "carrier not detected on $iface"
fi
dhclient -r "$iface" >/dev/null 2>&1 || true
if timeout "${DHCP_TIMEOUT_SECS}" dhclient -4 -q -1 "$iface" >/dev/null 2>&1; then
addr="$(ip -4 -o addr show dev "$iface" scope global 2>/dev/null | awk '{print $4}' | head -1)"
if [ -n "$addr" ]; then
log "DHCP lease acquired on $iface ($addr)"
else
log "DHCP lease acquired on $iface"
fi
return 0
fi
rc=$?
case "$rc" in
124)
log "DHCP timed out on $iface after ${DHCP_TIMEOUT_SECS}s"
;;
*)
log "DHCP failed on $iface (exit $rc)"
;;
esac
dhclient -r "$iface" >/dev/null 2>&1 || true
return 1
}
started_ifaces=""
started_count=0
scan_pass=1
pids=""
pid_ifaces=""
# Some server NICs appear a bit later after module/firmware init. Do a small
# bounded rescan window without turning network bring-up into a boot blocker.
@@ -34,22 +76,11 @@ while [ "$scan_pass" -le 3 ]; do
*" $iface "*) continue ;;
esac
log "bringing up $iface"
if ! ip link set "$iface" up; then
log "WARN: could not bring up $iface"
continue
fi
carrier=$(cat "/sys/class/net/$iface/carrier" 2>/dev/null || true)
if [ "$carrier" = "1" ]; then
log "carrier detected on $iface"
else
log "carrier not detected yet on $iface"
fi
# DHCP in background — non-blocking, keep dhclient verbose output in the service log.
dhclient -4 -v -nw "$iface" &
log "DHCP started for $iface (pid $!)"
log "starting DHCP on $iface (timeout ${DHCP_TIMEOUT_SECS}s)"
start_dhcp "$iface" &
pid="$!"
pids="$pids $pid"
pid_ifaces="$pid_ifaces $pid:$iface"
started_ifaces="$started_ifaces $iface"
started_count=$((started_count + 1))
@@ -68,4 +99,15 @@ if [ "$started_count" -eq 0 ]; then
exit 0
fi
log "done (interfaces started: $started_count)"
success_count=0
for pid_iface in $pid_ifaces; do
pid="${pid_iface%%:*}"
iface="${pid_iface#*:}"
if wait "$pid"; then
success_count=$((success_count + 1))
else
log "DHCP did not complete successfully on $iface"
fi
done
log "done (interfaces scanned: $started_count, leases acquired: $success_count)"

View File

@@ -60,35 +60,129 @@ wait_for_process_exit() {
return 0
}
kill_pattern() {
pattern="$1"
if pgrep -f "$pattern" >/dev/null 2>&1; then
pgrep -af "$pattern" 2>/dev/null | while IFS= read -r line; do
log_pid_details() {
pid="$1"
line=$(ps -p "$pid" -o pid=,comm=,args= 2>/dev/null | sed 's/^[[:space:]]*//')
if [ -n "$line" ]; then
log_blocker "$line"
else
log_blocker "pid $pid"
fi
}
collect_gpu_compute_pids() {
index="$1"
if ! command -v nvidia-smi >/dev/null 2>&1; then
return 0
fi
nvidia-smi --id="$index" \
--query-compute-apps=pid \
--format=csv,noheader,nounits 2>/dev/null \
| sed 's/^[[:space:]]*//;s/[[:space:]]*$//' \
| grep -E '^[0-9]+$' || true
}
collect_gpu_device_pids() {
index="$1"
dev="/dev/nvidia$index"
[ -e "$dev" ] || return 0
if command -v fuser >/dev/null 2>&1; then
fuser "$dev" 2>/dev/null \
| tr ' ' '\n' \
| sed 's/[^0-9].*$//' \
| grep -E '^[0-9]+$' || true
fi
}
collect_gpu_holder_pids() {
index="$1"
{
collect_gpu_compute_pids "$index"
collect_gpu_device_pids "$index"
} | awk 'NF' | sort -u
}
kill_pid_list() {
pids="$1"
[ -n "$pids" ] || return 0
for pid in $pids; do
log_pid_details "$pid"
done
log "terminating GPU holder PIDs: $(echo "$pids" | tr '\n' ' ' | sed 's/[[:space:]]*$//')"
for pid in $pids; do
kill -TERM "$pid" >/dev/null 2>&1 || true
done
sleep 1
for pid in $pids; do
if kill -0 "$pid" >/dev/null 2>&1; then
log "forcing GPU holder PID $pid to exit"
kill -KILL "$pid" >/dev/null 2>&1 || true
fi
done
}
gpu_has_display_holders() {
index="$1"
holders=$(collect_gpu_device_pids "$index")
[ -n "$holders" ] || return 1
for pid in $holders; do
comm=$(ps -p "$pid" -o comm= 2>/dev/null | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
case "$comm" in
Xorg|Xwayland|X|gnome-shell)
return 0
;;
esac
done
return 1
}
stop_nv_hostengine_if_running() {
if pgrep -x nv-hostengine >/dev/null 2>&1; then
pgrep -af "^nv-hostengine$" 2>/dev/null | while IFS= read -r line; do
[ -n "$line" ] || continue
log_blocker "$line"
done
log "killing processes matching: $pattern"
pkill -TERM -f "$pattern" >/dev/null 2>&1 || true
sleep 1
pkill -KILL -f "$pattern" >/dev/null 2>&1 || true
log "stopping nv-hostengine"
pkill -TERM -x nv-hostengine >/dev/null 2>&1 || true
wait_for_process_exit nv-hostengine || pkill -KILL -x nv-hostengine >/dev/null 2>&1 || true
hostengine_was_active=1
return 0
fi
return 1
}
stop_fabricmanager_if_active() {
if unit_exists nvidia-fabricmanager.service && stop_unit_if_active nvidia-fabricmanager.service; then
log_blocker "service nvidia-fabricmanager.service"
fabric_was_active=1
return 0
fi
return 1
}
stop_display_stack_if_active() {
stopped=1
for unit in display-manager.service lightdm.service; do
if unit_exists "$unit" && stop_unit_if_active "$unit"; then
log_blocker "service $unit"
display_was_active=1
stopped=0
fi
done
return "$stopped"
}
try_gpu_reset() {
index="$1"
log "resetting GPU $index"
nvidia-smi -r -i "$index"
}
drain_gpu_clients() {
display_was_active=0
fabric_was_active=0
for unit in display-manager.service lightdm.service; do
if unit_exists "$unit" && stop_unit_if_active "$unit"; then
log_blocker "service $unit"
display_was_active=1
fi
done
if unit_exists nvidia-fabricmanager.service && stop_unit_if_active nvidia-fabricmanager.service; then
log_blocker "service nvidia-fabricmanager.service"
fabric_was_active=1
fi
hostengine_was_active=0
if pgrep -x nv-hostengine >/dev/null 2>&1; then
pgrep -af "^nv-hostengine$" 2>/dev/null | while IFS= read -r line; do
@@ -98,21 +192,25 @@ drain_gpu_clients() {
log "stopping nv-hostengine"
pkill -TERM -x nv-hostengine >/dev/null 2>&1 || true
wait_for_process_exit nv-hostengine || pkill -KILL -x nv-hostengine >/dev/null 2>&1 || true
hostengine_was_active=1
fi
for pattern in \
"nvidia-smi" \
"dcgmi" \
"nvvs" \
"dcgmproftester" \
"all_reduce_perf" \
"nvtop" \
"bee-gpu-burn" \
"bee-john-gpu-stress" \
"bee-nccl-gpu-stress" \
"Xorg" \
"Xwayland"; do
kill_pattern "$pattern"
if unit_exists nvidia-fabricmanager.service && stop_unit_if_active nvidia-fabricmanager.service; then
log_blocker "service nvidia-fabricmanager.service"
fabric_was_active=1
fi
for unit in display-manager.service lightdm.service; do
if unit_exists "$unit" && stop_unit_if_active "$unit"; then
log_blocker "service $unit"
display_was_active=1
fi
done
for dev in /dev/nvidia[0-9]*; do
[ -e "$dev" ] || continue
holders=$(collect_gpu_device_pids "${dev#/dev/nvidia}")
kill_pid_list "$holders"
done
}
@@ -125,7 +223,7 @@ restore_gpu_clients() {
fi
fi
if command -v nv-hostengine >/dev/null 2>&1 && ! pgrep -x nv-hostengine >/dev/null 2>&1; then
if [ "${hostengine_was_active:-0}" = "1" ] && command -v nv-hostengine >/dev/null 2>&1 && ! pgrep -x nv-hostengine >/dev/null 2>&1; then
log "starting nv-hostengine"
nv-hostengine
fi
@@ -153,10 +251,60 @@ restart_drivers() {
reset_gpu() {
index="$1"
drain_gpu_clients
log "resetting GPU $index"
nvidia-smi -r -i "$index"
display_was_active=0
fabric_was_active=0
hostengine_was_active=0
holders=$(collect_gpu_holder_pids "$index")
if [ -n "$holders" ]; then
kill_pid_list "$holders"
fi
if try_gpu_reset "$index"; then
restore_gpu_clients
return 0
fi
stop_nv_hostengine_if_running || true
holders=$(collect_gpu_holder_pids "$index")
if [ -n "$holders" ]; then
kill_pid_list "$holders"
fi
if try_gpu_reset "$index"; then
restore_gpu_clients
return 0
fi
stop_fabricmanager_if_active || true
holders=$(collect_gpu_holder_pids "$index")
if [ -n "$holders" ]; then
kill_pid_list "$holders"
fi
if try_gpu_reset "$index"; then
restore_gpu_clients
return 0
fi
if gpu_has_display_holders "$index"; then
stop_display_stack_if_active || true
holders=$(collect_gpu_holder_pids "$index")
if [ -n "$holders" ]; then
kill_pid_list "$holders"
fi
if try_gpu_reset "$index"; then
restore_gpu_clients
return 0
fi
fi
holders=$(collect_gpu_holder_pids "$index")
if [ -n "$holders" ]; then
log "GPU $index still has holders after targeted drain"
kill_pid_list "$holders"
fi
try_gpu_reset "$index"
rc=$?
restore_gpu_clients
return "$rc"
}
cmd="${1:-}"

View File

@@ -2,7 +2,7 @@
# bee-remount-medium — find and remount the live ISO medium to /run/live/medium
#
# Run this after reconnecting the ISO source disc (USB/CD) if the live medium
# was lost and /run/live/medium/live/filesystem.squashfs is missing.
# was lost and /run/live/medium/live/*.squashfs are missing.
#
# Usage: bee-remount-medium [--wait]
# --wait keep retrying every 5 seconds until the medium is found (useful
@@ -11,7 +11,7 @@
set -euo pipefail
MEDIUM_DIR="/run/live/medium"
SQUASHFS_REL="live/filesystem.squashfs"
SQUASHFS_GLOB="live/*.squashfs"
WAIT_MODE=0
for arg in "$@"; do
@@ -28,6 +28,10 @@ done
log() { echo "[$(date +%H:%M:%S)] $*"; }
die() { log "ERROR: $*" >&2; exit 1; }
if [ "$(id -u)" -ne 0 ]; then
die "bee-remount-medium must be run as root (use sudo or a root shell)"
fi
# Return all candidate block devices (optical + removable USB mass storage)
find_candidates() {
# CD/DVD drives
@@ -52,7 +56,7 @@ try_mount() {
local tmpdir
tmpdir=$(mktemp -d /tmp/bee-probe-XXXXXX)
if mount -o ro "$dev" "$tmpdir" 2>/dev/null; then
if [ -f "${tmpdir}/${SQUASHFS_REL}" ]; then
if find "${tmpdir}/live" -maxdepth 1 -type f -name '*.squashfs' 2>/dev/null | grep -q .; then
# Unmount probe mount and mount properly onto live path
umount "$tmpdir" 2>/dev/null || true
rmdir "$tmpdir" 2>/dev/null || true
@@ -78,8 +82,9 @@ attempt() {
for dev in $(find_candidates); do
log " Trying $dev ..."
if try_mount "$dev"; then
local sq="${MEDIUM_DIR}/${SQUASHFS_REL}"
log "SUCCESS: squashfs available at $sq ($(du -sh "$sq" | cut -f1))"
local count
count=$(find "${MEDIUM_DIR}/live" -maxdepth 1 -type f -name '*.squashfs' 2>/dev/null | wc -l | tr -d ' ')
log "SUCCESS: ${count} squashfs layer(s) available under ${MEDIUM_DIR}/live"
return 0
fi
done
@@ -96,5 +101,5 @@ if [ "$WAIT_MODE" = "1" ]; then
sleep 5
done
else
attempt || die "No ISO medium with ${SQUASHFS_REL} found. Reconnect the disc and re-run, or use --wait."
attempt || die "No ISO medium with ${SQUASHFS_GLOB} found. Reconnect the disc and re-run, or use --wait."
fi

View File

@@ -8,11 +8,17 @@ EXPORT_DIR="/appdata/bee/export"
AUDIT_JSON="${EXPORT_DIR}/bee-audit.json"
RUNTIME_JSON="${EXPORT_DIR}/runtime-health.json"
LOCK_DIR="/run/bee-selfheal.lock"
EVENTS=0
log() {
echo "[${LOG_PREFIX}] $*"
}
log_event() {
EVENTS=$((EVENTS + 1))
log "$*"
}
have_nvidia_gpu() {
lspci -Dn 2>/dev/null | awk '$2 ~ /^03(00|02):$/ && $3 ~ /^10de:/ { found=1; exit } END { exit(found ? 0 : 1) }'
}
@@ -56,24 +62,22 @@ web_healthy() {
mkdir -p "${EXPORT_DIR}" /run
if ! mkdir "${LOCK_DIR}" 2>/dev/null; then
log "another self-heal run is already active"
log_event "another self-heal run is already active"
exit 0
fi
trap 'rmdir "${LOCK_DIR}" >/dev/null 2>&1 || true' EXIT
log "start"
if have_nvidia_gpu && [ ! -e /dev/nvidia0 ]; then
log "NVIDIA GPU detected but /dev/nvidia0 is missing"
log_event "NVIDIA GPU detected but /dev/nvidia0 is missing"
restart_service bee-nvidia.service || true
fi
runtime_state="$(artifact_state "${RUNTIME_JSON}")"
if [ "${runtime_state}" != "ready" ]; then
if [ "${runtime_state}" = "interrupted" ]; then
log "runtime-health.json.tmp exists — interrupted runtime-health write detected"
log_event "runtime-health.json.tmp exists — interrupted runtime-health write detected"
else
log "runtime-health.json missing or empty"
log_event "runtime-health.json missing or empty"
fi
restart_service bee-preflight.service || true
fi
@@ -81,19 +85,17 @@ fi
audit_state="$(artifact_state "${AUDIT_JSON}")"
if [ "${audit_state}" != "ready" ]; then
if [ "${audit_state}" = "interrupted" ]; then
log "bee-audit.json.tmp exists — interrupted audit write detected"
log_event "bee-audit.json.tmp exists — interrupted audit write detected"
else
log "bee-audit.json missing or empty"
log_event "bee-audit.json missing or empty"
fi
restart_service bee-audit.service || true
fi
if ! service_active bee-web.service; then
log "bee-web.service is not active"
log_event "bee-web.service is not active"
restart_service bee-web.service || true
elif ! web_healthy; then
log "bee-web health check failed"
log_event "bee-web health check failed"
restart_service bee-web.service || true
fi
log "done"

BIN
iso/vendor/arcconf vendored Executable file

Binary file not shown.

BIN
iso/vendor/sas2ircu vendored Executable file

Binary file not shown.

BIN
iso/vendor/sas3ircu vendored Executable file

Binary file not shown.

BIN
iso/vendor/ssacli vendored Executable file

Binary file not shown.

BIN
iso/vendor/storcli64 vendored Executable file

Binary file not shown.

View File

@@ -47,6 +47,13 @@ echo "==> Сборка бинарника..."
)
echo " OK: $(ls -lh "${LOCAL_BIN}" | awk '{print $5, $9}')"
LOCAL_SHA="$(shasum -a 256 "${LOCAL_BIN}" | awk '{print $1}')"
REMOTE_SHA="$("${SSH_CMD[@]}" "$REMOTE" "if [ -f '${REMOTE_BIN}' ] && command -v sha256sum >/dev/null 2>&1; then sha256sum '${REMOTE_BIN}' | awk '{print \\$1}'; fi" 2>/dev/null || true)"
if [[ -n "${REMOTE_SHA}" && "${LOCAL_SHA}" == "${REMOTE_SHA}" ]]; then
echo "==> Бинарник не изменился (${LOCAL_SHA}); копирование и перезапуск сервисов пропущены."
exit 0
fi
# --- Deploy ---
echo "==> Копирование на ${REMOTE}..."
"${SCP_CMD[@]}" "${LOCAL_BIN}" "${REMOTE}:/tmp/bee-new"

View File

@@ -1,74 +0,0 @@
#!/bin/sh
# fetch-vendor.sh — download proprietary vendor utilities into iso/vendor.
#
# Usage:
# STORCLI_URL=... STORCLI_SHA256=... \
# SAS2IRCU_URL=... SAS2IRCU_SHA256=... \
# SAS3IRCU_URL=... SAS3IRCU_SHA256=... \
# MSTFLINT_URL=... MSTFLINT_SHA256=... \
# sh scripts/fetch-vendor.sh
set -eu
ROOT_DIR=$(CDPATH= cd -- "$(dirname -- "$0")/.." && pwd)
OUT_DIR="$ROOT_DIR/iso/vendor"
mkdir -p "$OUT_DIR"
need_cmd() {
command -v "$1" >/dev/null 2>&1 || { echo "ERROR: required command not found: $1" >&2; exit 1; }
}
need_cmd sha256sum
download_to() {
url="$1"
out="$2"
if command -v wget >/dev/null 2>&1; then
wget -O "$out" "$url"
return 0
fi
if command -v curl >/dev/null 2>&1; then
curl -fsSL "$url" -o "$out"
return 0
fi
echo "ERROR: required command not found: wget or curl" >&2
exit 1
}
fetch_one() {
name="$1"
url="$2"
sha="$3"
if [ -z "$url" ] || [ -z "$sha" ]; then
echo "[vendor] skip $name (URL/SHA not provided)"
return 0
fi
dst="$OUT_DIR/$name"
tmp="$dst.tmp"
echo "[vendor] downloading $name"
download_to "$url" "$tmp"
got=$(sha256sum "$tmp" | awk '{print $1}')
want=$(echo "$sha" | tr '[:upper:]' '[:lower:]')
if [ "$got" != "$want" ]; then
rm -f "$tmp"
echo "ERROR: checksum mismatch for $name" >&2
echo " got: $got" >&2
echo " want: $want" >&2
exit 1
fi
mv "$tmp" "$dst"
chmod +x "$dst" || true
echo "[vendor] ok: $name"
}
fetch_one "storcli64" "${STORCLI_URL:-}" "${STORCLI_SHA256:-}"
fetch_one "sas2ircu" "${SAS2IRCU_URL:-}" "${SAS2IRCU_SHA256:-}"
fetch_one "sas3ircu" "${SAS3IRCU_URL:-}" "${SAS3IRCU_SHA256:-}"
fetch_one "mstflint" "${MSTFLINT_URL:-}" "${MSTFLINT_SHA256:-}"
echo "[vendor] done. output dir: $OUT_DIR"