Compare commits

..

22 Commits
v3.5 ... v3.20

Author SHA1 Message Date
Mikhail Chusavitin
b447717a5a fix(iso): harden boot network bring-up - v3.20 2026-04-01 09:10:55 +03:00
Mikhail Chusavitin
f6f4923ac9 fix(iso): recover memtest after live-build 2026-04-01 08:55:57 +03:00
Mikhail Chusavitin
c394845b34 refactor(webui): queue install and bundle tasks - v3.18 2026-04-01 08:46:46 +03:00
Mikhail Chusavitin
3472afea32 fix(iso): make memtest non-blocking by default 2026-04-01 08:33:36 +03:00
Mikhail Chusavitin
942f11937f chore(submodule): update bible - v3.16 2026-04-01 08:23:39 +03:00
Mikhail Chusavitin
b5b34983f1 fix(webui): repair audit actions and CPU burn flow - v3.15 2026-04-01 08:19:11 +03:00
45221d1e9a fix(stress): label loaders and improve john opencl diagnostics 2026-04-01 07:31:52 +03:00
3869788bac fix(iso): validate memtest with xorriso fallback 2026-04-01 07:24:05 +03:00
3dbc2184ef fix(iso): archive build logs and memtest diagnostics 2026-04-01 07:14:53 +03:00
60cb8f889a fix(iso): restore memtest menu entries and validate ISO 2026-04-01 07:04:48 +03:00
c9ee078622 fix(stress): keep platform burn responsive under load 2026-03-31 22:28:26 +03:00
ea660500c9 chore: commit pending repo changes 2026-03-31 22:17:36 +03:00
d43a9aeec7 fix(iso): restore live-build memtest integration 2026-03-31 22:10:28 +03:00
Mikhail Chusavitin
f5622e351e Fix staged John cleanup for repeated ISO builds 2026-03-31 11:40:52 +03:00
Mikhail Chusavitin
a20806afc8 Fix ISO grub package conflict 2026-03-31 11:38:30 +03:00
Mikhail Chusavitin
4f9b6b3bcd Harden NVIDIA boot logging on live ISO 2026-03-31 11:37:21 +03:00
Mikhail Chusavitin
c850b39b01 feat: v3.10 GPU stress and NCCL burn updates 2026-03-31 11:22:27 +03:00
Mikhail Chusavitin
6dee8f3509 Add NVIDIA stress loader selection and DCGM 4 support 2026-03-31 11:15:15 +03:00
Mikhail Chusavitin
20f834aa96 feat: v3.4 — boot reliability, log readability, USB export, screen resolution, GRUB UEFI fix, memtest, KVM console stability
Web UI / logs:
- Strip ANSI escape codes and handle \r (progress bars) in task log output
- Add USB export API + UI card on Export page (list removable devices, write audit JSON or support bundle)
- Add Display Resolution card in Tools (xrandr-based, per-output mode selector)
- Dashboard: audit status banner with auto-reload when audit task completes

Boot & install:
- bee-web starts immediately with no dependencies (was blocked by audit + network)
- bee-audit.service redesigned: waits for bee-web healthz, sleeps 60s, enqueues audit via /api/audit/run (task system)
- bee-install: fix GRUB UEFI — grub-install exit code was silently ignored (|| true); add --no-nvram fallback; always copy EFI/BOOT/BOOTX64.EFI fallback path
- Add grub-efi-amd64, grub-pc, grub-efi-amd64-signed, shim-signed to package list (grub-install requires these, not just -bin variants)
- memtest hook: fix binary/boot/ not created before cp; handle both Debian (no extension) and upstream (x64.efi) naming
- bee-openbox-session: increase healthz wait from 30s to 120s

KVM console stability:
- runCmdJob: syscall.Setpriority(PRIO_PROCESS, pid, 10) on all stress subprocesses
- lightdm.service.d: Nice=-5 so X server preempts stress processes

Packages: add btop

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 10:16:15 +03:00
105d92df8b fix(iso): use underscore in volume label to comply with ISO 9660
ISO 9660 volume labels allow only A-Z, 0-9, and underscore.
Dashes cause xorriso WARNING on every build.
EASY-BEE-NVIDIA → EASY_BEE_NVIDIA (iso-application keeps dashes, it's UDF).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:38:02 +03:00
f96b149875 fix(memtest): extract EFI binary from .deb cache if chroot/boot/ is empty
memtest86+ postinst does not place files in /boot in a live-build chroot
without grub triggers. Added fallback: extract directly from the cached
.deb via dpkg-deb -x, with verbose logging throughout.

Also remove "NVIDIA no MSI-X" from boot menu (premature — root cause unknown).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:30:52 +03:00
5ee120158e fix(build): remove unused variant package lists before lb build
live-build picks up ALL .list.chroot files in config/package-lists/.
After rsync, bee-nvidia.list.chroot, bee-amd.list.chroot, and
bee-nogpu.list.chroot all end up in BUILD_WORK_DIR — causing lb to
try installing packages from every variant (and leaving version
placeholders unsubstituted in the unused lists).

Fix: after copying bee-${BEE_GPU_VENDOR}.list.chroot → bee-gpu.list.chroot,
delete all other bee-{nvidia,amd,nogpu}.list.chroot from BUILD_WORK_DIR.

Also includes nomsi boot mode changes (bee-nvidia-load + grub.cfg).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 23:03:42 +03:00
54 changed files with 3524 additions and 406 deletions

View File

@@ -343,9 +343,9 @@ Planned code shape:
- `bee tui` can rerun the audit manually - `bee tui` can rerun the audit manually
- `bee tui` can export the latest audit JSON to removable media - `bee tui` can export the latest audit JSON to removable media
- `bee tui` can show health summary and run NVIDIA/memory/storage acceptance tests - `bee tui` can show health summary and run NVIDIA/memory/storage acceptance tests
- NVIDIA SAT now includes a lightweight in-image GPU stress step via `bee-gpu-stress` - NVIDIA SAT now includes a lightweight in-image GPU stress step via `bee-gpu-burn`
- SAT summaries now expose `overall_status` plus per-job `OK/FAILED/UNSUPPORTED` - SAT summaries now expose `overall_status` plus per-job `OK/FAILED/UNSUPPORTED`
- Memory/GPU SAT runtime defaults can be overridden via `BEE_MEMTESTER_*` and `BEE_GPU_STRESS_*` - Memory SAT runtime defaults can be overridden via `BEE_MEMTESTER_*`
- removable export requires explicit target selection, mount, confirmation, copy, and cleanup - removable export requires explicit target selection, mount, confirmation, copy, and cleanup
### 2.6 — Vendor utilities and optional assets ### 2.6 — Vendor utilities and optional assets

View File

@@ -356,6 +356,7 @@ func runSAT(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("sat", flag.ContinueOnError) fs := flag.NewFlagSet("sat", flag.ContinueOnError)
fs.SetOutput(stderr) fs.SetOutput(stderr)
duration := fs.Int("duration", 0, "stress-ng duration in seconds (cpu only; default: 60)") duration := fs.Int("duration", 0, "stress-ng duration in seconds (cpu only; default: 60)")
diagLevel := fs.Int("diag-level", 0, "DCGM diagnostic level for nvidia (1=quick, 2=medium, 3=targeted stress, 4=extended stress; default: 1)")
if err := fs.Parse(args[1:]); err != nil { if err := fs.Parse(args[1:]); err != nil {
if err == flag.ErrHelp { if err == flag.ErrHelp {
return 0 return 0
@@ -370,7 +371,7 @@ func runSAT(args []string, stdout, stderr io.Writer) int {
target := args[0] target := args[0]
if target != "nvidia" && target != "memory" && target != "storage" && target != "cpu" { if target != "nvidia" && target != "memory" && target != "storage" && target != "cpu" {
fmt.Fprintf(stderr, "bee sat: unknown target %q\n", target) fmt.Fprintf(stderr, "bee sat: unknown target %q\n", target)
fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>]") fmt.Fprintln(stderr, "usage: bee sat nvidia|memory|storage|cpu [--duration <seconds>] [--diag-level <1-4>]")
return 2 return 2
} }
@@ -382,7 +383,12 @@ func runSAT(args []string, stdout, stderr io.Writer) int {
logLine := func(s string) { fmt.Fprintln(os.Stderr, s) } logLine := func(s string) { fmt.Fprintln(os.Stderr, s) }
switch target { switch target {
case "nvidia": case "nvidia":
archive, err = application.RunNvidiaAcceptancePack("", logLine) level := *diagLevel
if level > 0 {
_, err = application.RunNvidiaAcceptancePackWithOptions(context.Background(), "", level, nil, logLine)
} else {
archive, err = application.RunNvidiaAcceptancePack("", logLine)
}
case "memory": case "memory":
archive, err = application.RunMemoryAcceptancePackCtx(context.Background(), "", logLine) archive, err = application.RunMemoryAcceptancePackCtx(context.Background(), "", logLine)
case "storage": case "storage":

View File

@@ -107,6 +107,7 @@ func (a *App) RunInstallToRAM(ctx context.Context, logFunc func(string)) error {
type satRunner interface { type satRunner interface {
RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error) RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error)
RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaStressPack(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error)
RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error)
RunStorageAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) RunStorageAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error)
RunCPUAcceptancePack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) RunCPUAcceptancePack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error)
@@ -508,6 +509,17 @@ func (a *App) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir st
return ActionResult{Title: "NVIDIA DCGM", Body: body}, err return ActionResult{Title: "NVIDIA DCGM", Body: body}, err
} }
func (a *App) RunNvidiaStressPack(baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
return a.RunNvidiaStressPackCtx(context.Background(), baseDir, opts, logFunc)
}
func (a *App) RunNvidiaStressPackCtx(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaStressPack(ctx, baseDir, opts, logFunc)
}
func (a *App) RunMemoryAcceptancePack(baseDir string, logFunc func(string)) (string, error) { func (a *App) RunMemoryAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
return a.RunMemoryAcceptancePackCtx(context.Background(), baseDir, logFunc) return a.RunMemoryAcceptancePackCtx(context.Background(), baseDir, logFunc)
} }

View File

@@ -120,14 +120,15 @@ func (f fakeTools) CheckTools(names []string) []platform.ToolStatus {
} }
type fakeSAT struct { type fakeSAT struct {
runNvidiaFn func(string) (string, error) runNvidiaFn func(string) (string, error)
runMemoryFn func(string) (string, error) runNvidiaStressFn func(string, platform.NvidiaStressOptions) (string, error)
runStorageFn func(string) (string, error) runMemoryFn func(string) (string, error)
runCPUFn func(string, int) (string, error) runStorageFn func(string) (string, error)
detectVendorFn func() string runCPUFn func(string, int) (string, error)
listAMDGPUsFn func() ([]platform.AMDGPUInfo, error) detectVendorFn func() string
runAMDPackFn func(string) (string, error) listAMDGPUsFn func() ([]platform.AMDGPUInfo, error)
listNvidiaGPUsFn func() ([]platform.NvidiaGPU, error) runAMDPackFn func(string) (string, error)
listNvidiaGPUsFn func() ([]platform.NvidiaGPU, error)
} }
func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string, _ func(string)) (string, error) { func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string, _ func(string)) (string, error) {
@@ -138,6 +139,13 @@ func (f fakeSAT) RunNvidiaAcceptancePackWithOptions(_ context.Context, baseDir s
return f.runNvidiaFn(baseDir) return f.runNvidiaFn(baseDir)
} }
func (f fakeSAT) RunNvidiaStressPack(_ context.Context, baseDir string, opts platform.NvidiaStressOptions, _ func(string)) (string, error) {
if f.runNvidiaStressFn != nil {
return f.runNvidiaStressFn(baseDir, opts)
}
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) { func (f fakeSAT) ListNvidiaGPUs() ([]platform.NvidiaGPU, error) {
if f.listNvidiaGPUsFn != nil { if f.listNvidiaGPUsFn != nil {
return f.listNvidiaGPUsFn() return f.listNvidiaGPUsFn()

View File

@@ -36,6 +36,8 @@ var supportBundleCommands = []struct {
{name: "system/dmesg-tail.txt", cmd: []string{"sh", "-c", "dmesg | tail -n 200"}}, {name: "system/dmesg-tail.txt", cmd: []string{"sh", "-c", "dmesg | tail -n 200"}},
} }
const supportBundleGlob = "bee-support-*.tar.gz"
func BuildSupportBundle(exportDir string) (string, error) { func BuildSupportBundle(exportDir string) (string, error) {
exportDir = strings.TrimSpace(exportDir) exportDir = strings.TrimSpace(exportDir)
if exportDir == "" { if exportDir == "" {
@@ -86,34 +88,64 @@ func BuildSupportBundle(exportDir string) (string, error) {
return archivePath, nil return archivePath, nil
} }
func LatestSupportBundlePath() (string, error) {
return latestSupportBundlePath(os.TempDir())
}
func cleanupOldSupportBundles(dir string) error { func cleanupOldSupportBundles(dir string) error {
matches, err := filepath.Glob(filepath.Join(dir, "bee-support-*.tar.gz")) matches, err := filepath.Glob(filepath.Join(dir, supportBundleGlob))
if err != nil { if err != nil {
return err return err
} }
type entry struct { entries := supportBundleEntries(matches)
path string for path, mod := range entries {
mod time.Time if time.Since(mod) > 24*time.Hour {
_ = os.Remove(path)
delete(entries, path)
}
} }
list := make([]entry, 0, len(matches)) ordered := orderSupportBundles(entries)
if len(ordered) > 3 {
for _, old := range ordered[3:] {
_ = os.Remove(old)
}
}
return nil
}
func latestSupportBundlePath(dir string) (string, error) {
matches, err := filepath.Glob(filepath.Join(dir, supportBundleGlob))
if err != nil {
return "", err
}
ordered := orderSupportBundles(supportBundleEntries(matches))
if len(ordered) == 0 {
return "", os.ErrNotExist
}
return ordered[0], nil
}
func supportBundleEntries(matches []string) map[string]time.Time {
entries := make(map[string]time.Time, len(matches))
for _, match := range matches { for _, match := range matches {
info, err := os.Stat(match) info, err := os.Stat(match)
if err != nil { if err != nil {
continue continue
} }
if time.Since(info.ModTime()) > 24*time.Hour { entries[match] = info.ModTime()
_ = os.Remove(match)
continue
}
list = append(list, entry{path: match, mod: info.ModTime()})
} }
sort.Slice(list, func(i, j int) bool { return list[i].mod.After(list[j].mod) }) return entries
if len(list) > 3 { }
for _, old := range list[3:] {
_ = os.Remove(old.path) func orderSupportBundles(entries map[string]time.Time) []string {
} ordered := make([]string, 0, len(entries))
for path := range entries {
ordered = append(ordered, path)
} }
return nil sort.Slice(ordered, func(i, j int) bool {
return entries[ordered[i]].After(entries[ordered[j]])
})
return ordered
} }
func writeJournalDump(dst string) error { func writeJournalDump(dst string) error {

View File

@@ -0,0 +1,205 @@
package platform
import (
"context"
"fmt"
"sort"
"strconv"
"strings"
)
func (s *System) RunNvidiaStressPack(ctx context.Context, baseDir string, opts NvidiaStressOptions, logFunc func(string)) (string, error) {
normalizeNvidiaStressOptions(&opts)
job, err := buildNvidiaStressJob(opts)
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, nvidiaStressArchivePrefix(opts.Loader), []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-nvidia-smi-list.log", cmd: []string{"nvidia-smi", "-L"}},
job,
{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
}
func nvidiaStressArchivePrefix(loader string) string {
switch strings.TrimSpace(strings.ToLower(loader)) {
case NvidiaStressLoaderJohn:
return "gpu-nvidia-john"
case NvidiaStressLoaderNCCL:
return "gpu-nvidia-nccl"
default:
return "gpu-nvidia-burn"
}
}
func buildNvidiaStressJob(opts NvidiaStressOptions) (satJob, error) {
selected, err := resolveNvidiaGPUSelection(opts.GPUIndices, opts.ExcludeGPUIndices)
if err != nil {
return satJob{}, err
}
loader := strings.TrimSpace(strings.ToLower(opts.Loader))
switch loader {
case "", NvidiaStressLoaderBuiltin:
cmd := []string{
"bee-gpu-burn",
"--seconds", strconv.Itoa(opts.DurationSec),
"--size-mb", strconv.Itoa(opts.SizeMB),
}
if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected))
}
return satJob{
name: "03-bee-gpu-burn.log",
cmd: cmd,
collectGPU: true,
gpuIndices: selected,
}, nil
case NvidiaStressLoaderJohn:
cmd := []string{
"bee-john-gpu-stress",
"--seconds", strconv.Itoa(opts.DurationSec),
}
if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected))
}
return satJob{
name: "03-john-gpu-stress.log",
cmd: cmd,
collectGPU: true,
gpuIndices: selected,
}, nil
case NvidiaStressLoaderNCCL:
cmd := []string{
"bee-nccl-gpu-stress",
"--seconds", strconv.Itoa(opts.DurationSec),
}
if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected))
}
return satJob{
name: "03-bee-nccl-gpu-stress.log",
cmd: cmd,
collectGPU: true,
gpuIndices: selected,
}, nil
default:
return satJob{}, fmt.Errorf("unknown NVIDIA stress loader %q", opts.Loader)
}
}
func normalizeNvidiaStressOptions(opts *NvidiaStressOptions) {
if opts.DurationSec <= 0 {
opts.DurationSec = 300
}
if opts.SizeMB <= 0 {
opts.SizeMB = 64
}
switch strings.TrimSpace(strings.ToLower(opts.Loader)) {
case "", NvidiaStressLoaderBuiltin:
opts.Loader = NvidiaStressLoaderBuiltin
case NvidiaStressLoaderJohn:
opts.Loader = NvidiaStressLoaderJohn
case NvidiaStressLoaderNCCL:
opts.Loader = NvidiaStressLoaderNCCL
default:
opts.Loader = NvidiaStressLoaderBuiltin
}
opts.GPUIndices = dedupeSortedIndices(opts.GPUIndices)
opts.ExcludeGPUIndices = dedupeSortedIndices(opts.ExcludeGPUIndices)
}
func resolveNvidiaGPUSelection(include, exclude []int) ([]int, error) {
all, err := listNvidiaGPUIndices()
if err != nil {
return nil, err
}
if len(all) == 0 {
return nil, fmt.Errorf("nvidia-smi found no NVIDIA GPUs")
}
selected := all
if len(include) > 0 {
want := make(map[int]struct{}, len(include))
for _, idx := range include {
want[idx] = struct{}{}
}
selected = selected[:0]
for _, idx := range all {
if _, ok := want[idx]; ok {
selected = append(selected, idx)
}
}
}
if len(exclude) > 0 {
skip := make(map[int]struct{}, len(exclude))
for _, idx := range exclude {
skip[idx] = struct{}{}
}
filtered := selected[:0]
for _, idx := range selected {
if _, ok := skip[idx]; ok {
continue
}
filtered = append(filtered, idx)
}
selected = filtered
}
if len(selected) == 0 {
return nil, fmt.Errorf("no NVIDIA GPUs selected after applying filters")
}
out := append([]int(nil), selected...)
sort.Ints(out)
return out, nil
}
func listNvidiaGPUIndices() ([]int, error) {
out, err := satExecCommand("nvidia-smi", "--query-gpu=index", "--format=csv,noheader,nounits").Output()
if err != nil {
return nil, fmt.Errorf("nvidia-smi: %w", err)
}
var indices []int
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
idx, err := strconv.Atoi(line)
if err != nil {
continue
}
indices = append(indices, idx)
}
return dedupeSortedIndices(indices), nil
}
func dedupeSortedIndices(values []int) []int {
if len(values) == 0 {
return nil
}
seen := make(map[int]struct{}, len(values))
out := make([]int, 0, len(values))
for _, value := range values {
if value < 0 {
continue
}
if _, ok := seen[value]; ok {
continue
}
seen[value] = struct{}{}
out = append(out, value)
}
sort.Ints(out)
return out
}
func joinIndexList(values []int) string {
parts := make([]string, 0, len(values))
for _, value := range values {
parts = append(parts, strconv.Itoa(value))
}
return strings.Join(parts, ",")
}

View File

@@ -10,9 +10,11 @@ import (
"os" "os"
"os/exec" "os/exec"
"path/filepath" "path/filepath"
"runtime"
"strconv" "strconv"
"strings" "strings"
"sync" "sync"
"syscall"
"time" "time"
) )
@@ -374,10 +376,17 @@ func buildCPUStressCmd(ctx context.Context) (*exec.Cmd, error) {
return nil, fmt.Errorf("stressapptest not found: %w", err) return nil, fmt.Errorf("stressapptest not found: %w", err)
} }
// Use a very long duration; the context timeout will kill it at the right time. // Use a very long duration; the context timeout will kill it at the right time.
cmd := exec.CommandContext(ctx, path, "-s", "86400", "-W", "--cc_test") cmdArgs := []string{"-s", "86400", "-W", "--cc_test"}
if threads := platformStressCPUThreads(); threads > 0 {
cmdArgs = append(cmdArgs, "-m", strconv.Itoa(threads))
}
if mb := platformStressMemoryMB(); mb > 0 {
cmdArgs = append(cmdArgs, "-M", strconv.Itoa(mb))
}
cmd := exec.CommandContext(ctx, path, cmdArgs...)
cmd.Stdout = nil cmd.Stdout = nil
cmd.Stderr = nil cmd.Stderr = nil
if err := cmd.Start(); err != nil { if err := startLowPriorityCmd(cmd, 15); err != nil {
return nil, fmt.Errorf("stressapptest start: %w", err) return nil, fmt.Errorf("stressapptest start: %w", err)
} }
return cmd, nil return cmd, nil
@@ -418,22 +427,65 @@ func buildAMDGPUStressCmd(ctx context.Context) *exec.Cmd {
cmd := exec.CommandContext(ctx, rvsPath, "-c", cfgFile) cmd := exec.CommandContext(ctx, rvsPath, "-c", cfgFile)
cmd.Stdout = nil cmd.Stdout = nil
cmd.Stderr = nil cmd.Stderr = nil
_ = cmd.Start() _ = startLowPriorityCmd(cmd, 10)
return cmd return cmd
} }
func buildNvidiaGPUStressCmd(ctx context.Context) *exec.Cmd { func buildNvidiaGPUStressCmd(ctx context.Context) *exec.Cmd {
path, err := satLookPath("bee-gpu-stress") path, err := satLookPath("bee-gpu-burn")
if err != nil {
path, err = satLookPath("bee-gpu-stress")
}
if err != nil { if err != nil {
return nil return nil
} }
cmd := exec.CommandContext(ctx, path, "--seconds", "86400", "--size-mb", "64") cmd := exec.CommandContext(ctx, path, "--seconds", "86400", "--size-mb", "64")
cmd.Stdout = nil cmd.Stdout = nil
cmd.Stderr = nil cmd.Stderr = nil
_ = cmd.Start() _ = startLowPriorityCmd(cmd, 10)
return cmd return cmd
} }
func startLowPriorityCmd(cmd *exec.Cmd, nice int) error {
if err := cmd.Start(); err != nil {
return err
}
if cmd.Process != nil {
_ = syscall.Setpriority(syscall.PRIO_PROCESS, cmd.Process.Pid, nice)
}
return nil
}
func platformStressCPUThreads() int {
if n := envInt("BEE_PLATFORM_STRESS_THREADS", 0); n > 0 {
return n
}
cpus := runtime.NumCPU()
switch {
case cpus <= 2:
return 1
case cpus <= 8:
return cpus - 1
default:
return cpus - 2
}
}
func platformStressMemoryMB() int {
if mb := envInt("BEE_PLATFORM_STRESS_MB", 0); mb > 0 {
return mb
}
free := freeMemBytes()
if free <= 0 {
return 0
}
mb := int((free * 60) / 100 / (1024 * 1024))
if mb < 1024 {
return 1024
}
return mb
}
func packPlatformDir(dir, dest string) error { func packPlatformDir(dir, dest string) error {
f, err := os.Create(dest) f, err := os.Create(dest)
if err != nil { if err != nil {

View File

@@ -0,0 +1,34 @@
package platform
import (
"runtime"
"testing"
)
func TestPlatformStressCPUThreadsOverride(t *testing.T) {
t.Setenv("BEE_PLATFORM_STRESS_THREADS", "7")
if got := platformStressCPUThreads(); got != 7 {
t.Fatalf("platformStressCPUThreads=%d want 7", got)
}
}
func TestPlatformStressCPUThreadsDefaultLeavesHeadroom(t *testing.T) {
t.Setenv("BEE_PLATFORM_STRESS_THREADS", "")
got := platformStressCPUThreads()
if got < 1 {
t.Fatalf("platformStressCPUThreads=%d want >= 1", got)
}
if got > runtime.NumCPU() {
t.Fatalf("platformStressCPUThreads=%d want <= NumCPU=%d", got, runtime.NumCPU())
}
if runtime.NumCPU() > 2 && got >= runtime.NumCPU() {
t.Fatalf("platformStressCPUThreads=%d want headroom below NumCPU=%d", got, runtime.NumCPU())
}
}
func TestPlatformStressMemoryMBOverride(t *testing.T) {
t.Setenv("BEE_PLATFORM_STRESS_MB", "8192")
if got := platformStressMemoryMB(); got != 8192 {
t.Fatalf("platformStressMemoryMB=%d want 8192", got)
}
}

View File

@@ -136,7 +136,10 @@ func (s *System) runtimeToolStatuses(vendor string) []ToolStatus {
tools = append(tools, s.CheckTools([]string{ tools = append(tools, s.CheckTools([]string{
"nvidia-smi", "nvidia-smi",
"nvidia-bug-report.sh", "nvidia-bug-report.sh",
"bee-gpu-stress", "bee-gpu-burn",
"bee-john-gpu-stress",
"bee-nccl-gpu-stress",
"all_reduce_perf",
})...) })...)
case "amd": case "amd":
tool := ToolStatus{Name: "rocm-smi"} tool := ToolStatus{Name: "rocm-smi"}
@@ -176,8 +179,8 @@ func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHe
health.DriverReady = true health.DriverReady = true
} }
if lookErr := exec.Command("sh", "-c", "command -v bee-gpu-stress >/dev/null 2>&1").Run(); lookErr == nil { if _, lookErr := exec.LookPath("bee-gpu-burn"); lookErr == nil {
out, err := exec.Command("bee-gpu-stress", "--seconds", "1", "--size-mb", "1").CombinedOutput() out, err := exec.Command("bee-gpu-burn", "--seconds", "1", "--size-mb", "1").CombinedOutput()
if err == nil { if err == nil {
health.CUDAReady = true health.CUDAReady = true
} else if strings.Contains(strings.ToLower(string(out)), "cuda_error_system_not_ready") { } else if strings.Contains(strings.ToLower(string(out)), "cuda_error_system_not_ready") {

View File

@@ -425,14 +425,12 @@ type satStats struct {
} }
func nvidiaSATJobs() []satJob { func nvidiaSATJobs() []satJob {
seconds := envInt("BEE_GPU_STRESS_SECONDS", 5)
sizeMB := envInt("BEE_GPU_STRESS_SIZE_MB", 64)
return []satJob{ return []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}}, {name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}}, {name: "02-dmidecode-baseboard.log", cmd: []string{"dmidecode", "-t", "baseboard"}},
{name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}}, {name: "03-dmidecode-system.log", cmd: []string{"dmidecode", "-t", "system"}},
{name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}}, {name: "04-nvidia-bug-report.log", cmd: []string{"nvidia-bug-report.sh", "--output-file", "{{run_dir}}/nvidia-bug-report.log"}},
{name: "05-bee-gpu-stress.log", cmd: []string{"bee-gpu-stress", "--seconds", fmt.Sprintf("%d", seconds), "--size-mb", fmt.Sprintf("%d", sizeMB)}}, {name: "05-bee-gpu-burn.log", cmd: []string{"bee-gpu-burn", "--seconds", "5", "--size-mb", "64"}},
} }
} }
@@ -686,7 +684,11 @@ func resolveSATCommand(cmd []string) ([]string, error) {
case "rvs": case "rvs":
return resolveRVSCommand(cmd[1:]...) return resolveRVSCommand(cmd[1:]...)
} }
return cmd, nil path, err := satLookPath(cmd[0])
if err != nil {
return nil, fmt.Errorf("%s not found in PATH: %w", cmd[0], err)
}
return append([]string{path}, cmd[1:]...), nil
} }
func resolveRVSCommand(args ...string) ([]string, error) { func resolveRVSCommand(args ...string) ([]string, error) {

View File

@@ -130,26 +130,21 @@ func (s *System) RunFanStressTest(ctx context.Context, baseDir string, opts FanS
stats.OK++ stats.OK++
} }
// loadPhase runs bee-gpu-stress for durSec; sampler stamps phaseName on each row. // loadPhase runs bee-gpu-burn for durSec; sampler stamps phaseName on each row.
loadPhase := func(phaseName, stepName string, durSec int) { loadPhase := func(phaseName, stepName string, durSec int) {
if ctx.Err() != nil { if ctx.Err() != nil {
return return
} }
setPhase(phaseName) setPhase(phaseName)
var env []string
if len(opts.GPUIndices) > 0 {
ids := make([]string, len(opts.GPUIndices))
for i, idx := range opts.GPUIndices {
ids[i] = strconv.Itoa(idx)
}
env = []string{"CUDA_VISIBLE_DEVICES=" + strings.Join(ids, ",")}
}
cmd := []string{ cmd := []string{
"bee-gpu-stress", "bee-gpu-burn",
"--seconds", strconv.Itoa(durSec), "--seconds", strconv.Itoa(durSec),
"--size-mb", strconv.Itoa(opts.SizeMB), "--size-mb", strconv.Itoa(opts.SizeMB),
} }
out, err := runSATCommandCtx(ctx, verboseLog, stepName, cmd, env, nil) if len(opts.GPUIndices) > 0 {
cmd = append(cmd, "--devices", joinIndexList(dedupeSortedIndices(opts.GPUIndices)))
}
out, err := runSATCommandCtx(ctx, verboseLog, stepName, cmd, nil, nil)
_ = os.WriteFile(filepath.Join(runDir, stepName+".log"), out, 0644) _ = os.WriteFile(filepath.Join(runDir, stepName+".log"), out, 0644)
if err != nil && err != context.Canceled && err.Error() != "signal: killed" { if err != nil && err != context.Canceled && err.Error() != "signal: killed" {
fmt.Fprintf(&summary, "%s_status=FAILED\n", stepName) fmt.Fprintf(&summary, "%s_status=FAILED\n", stepName)
@@ -323,8 +318,9 @@ func sampleFanSpeeds() ([]FanReading, error) {
// parseFanSpeeds parses "ipmitool sdr type Fan" output. // parseFanSpeeds parses "ipmitool sdr type Fan" output.
// Handles two formats: // Handles two formats:
// Old: "FAN1 | 2400.000 | RPM | ok" (value in col[1], unit in col[2]) //
// New: "FAN1 | 41h | ok | 29.1 | 4340 RPM" (value+unit combined in last col) // Old: "FAN1 | 2400.000 | RPM | ok" (value in col[1], unit in col[2])
// New: "FAN1 | 41h | ok | 29.1 | 4340 RPM" (value+unit combined in last col)
func parseFanSpeeds(raw string) []FanReading { func parseFanSpeeds(raw string) []FanReading {
var fans []FanReading var fans []FanReading
for _, line := range strings.Split(strings.TrimSpace(raw), "\n") { for _, line := range strings.Split(strings.TrimSpace(raw), "\n") {

View File

@@ -31,8 +31,8 @@ func TestRunNvidiaAcceptancePackIncludesGPUStress(t *testing.T) {
if len(jobs) != 5 { if len(jobs) != 5 {
t.Fatalf("jobs=%d want 5", len(jobs)) t.Fatalf("jobs=%d want 5", len(jobs))
} }
if got := jobs[4].cmd[0]; got != "bee-gpu-stress" { if got := jobs[4].cmd[0]; got != "bee-gpu-burn" {
t.Fatalf("gpu stress command=%q want bee-gpu-stress", got) t.Fatalf("gpu stress command=%q want bee-gpu-burn", got)
} }
if got := jobs[3].cmd[1]; got != "--output-file" { if got := jobs[3].cmd[1]; got != "--output-file" {
t.Fatalf("bug report flag=%q want --output-file", got) t.Fatalf("bug report flag=%q want --output-file", got)
@@ -80,13 +80,10 @@ func TestAMDStressJobsIncludeBandwidthAndGST(t *testing.T) {
} }
} }
func TestNvidiaSATJobsUseEnvOverrides(t *testing.T) { func TestNvidiaSATJobsUseBuiltinBurnDefaults(t *testing.T) {
t.Setenv("BEE_GPU_STRESS_SECONDS", "9")
t.Setenv("BEE_GPU_STRESS_SIZE_MB", "96")
jobs := nvidiaSATJobs() jobs := nvidiaSATJobs()
got := jobs[4].cmd got := jobs[4].cmd
want := []string{"bee-gpu-stress", "--seconds", "9", "--size-mb", "96"} want := []string{"bee-gpu-burn", "--seconds", "5", "--size-mb", "64"}
if len(got) != len(want) { if len(got) != len(want) {
t.Fatalf("cmd len=%d want %d", len(got), len(want)) t.Fatalf("cmd len=%d want %d", len(got), len(want))
} }
@@ -97,6 +94,93 @@ func TestNvidiaSATJobsUseEnvOverrides(t *testing.T) {
} }
} }
func TestBuildNvidiaStressJobUsesSelectedLoaderAndDevices(t *testing.T) {
t.Parallel()
oldExecCommand := satExecCommand
satExecCommand = func(name string, args ...string) *exec.Cmd {
if name == "nvidia-smi" {
return exec.Command("sh", "-c", "printf '0\n1\n2\n'")
}
return exec.Command(name, args...)
}
t.Cleanup(func() { satExecCommand = oldExecCommand })
job, err := buildNvidiaStressJob(NvidiaStressOptions{
DurationSec: 600,
Loader: NvidiaStressLoaderJohn,
ExcludeGPUIndices: []int{1},
})
if err != nil {
t.Fatalf("buildNvidiaStressJob error: %v", err)
}
wantCmd := []string{"bee-john-gpu-stress", "--seconds", "600", "--devices", "0,2"}
if len(job.cmd) != len(wantCmd) {
t.Fatalf("cmd len=%d want %d (%v)", len(job.cmd), len(wantCmd), job.cmd)
}
for i := range wantCmd {
if job.cmd[i] != wantCmd[i] {
t.Fatalf("cmd[%d]=%q want %q", i, job.cmd[i], wantCmd[i])
}
}
if got := joinIndexList(job.gpuIndices); got != "0,2" {
t.Fatalf("gpuIndices=%q want 0,2", got)
}
}
func TestBuildNvidiaStressJobUsesNCCLLoader(t *testing.T) {
t.Parallel()
oldExecCommand := satExecCommand
satExecCommand = func(name string, args ...string) *exec.Cmd {
if name == "nvidia-smi" {
return exec.Command("sh", "-c", "printf '0\n1\n2\n'")
}
return exec.Command(name, args...)
}
t.Cleanup(func() { satExecCommand = oldExecCommand })
job, err := buildNvidiaStressJob(NvidiaStressOptions{
DurationSec: 120,
Loader: NvidiaStressLoaderNCCL,
GPUIndices: []int{2, 0},
})
if err != nil {
t.Fatalf("buildNvidiaStressJob error: %v", err)
}
wantCmd := []string{"bee-nccl-gpu-stress", "--seconds", "120", "--devices", "0,2"}
if len(job.cmd) != len(wantCmd) {
t.Fatalf("cmd len=%d want %d (%v)", len(job.cmd), len(wantCmd), job.cmd)
}
for i := range wantCmd {
if job.cmd[i] != wantCmd[i] {
t.Fatalf("cmd[%d]=%q want %q", i, job.cmd[i], wantCmd[i])
}
}
if got := joinIndexList(job.gpuIndices); got != "0,2" {
t.Fatalf("gpuIndices=%q want 0,2", got)
}
}
func TestNvidiaStressArchivePrefixByLoader(t *testing.T) {
t.Parallel()
tests := []struct {
loader string
want string
}{
{loader: NvidiaStressLoaderBuiltin, want: "gpu-nvidia-burn"},
{loader: NvidiaStressLoaderJohn, want: "gpu-nvidia-john"},
{loader: NvidiaStressLoaderNCCL, want: "gpu-nvidia-nccl"},
{loader: "", want: "gpu-nvidia-burn"},
}
for _, tt := range tests {
if got := nvidiaStressArchivePrefix(tt.loader); got != tt.want {
t.Fatalf("loader=%q prefix=%q want %q", tt.loader, got, tt.want)
}
}
}
func TestEnvIntFallback(t *testing.T) { func TestEnvIntFallback(t *testing.T) {
os.Unsetenv("BEE_MEMTESTER_SIZE_MB") os.Unsetenv("BEE_MEMTESTER_SIZE_MB")
if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 { if got := envInt("BEE_MEMTESTER_SIZE_MB", 123); got != 123 {
@@ -122,8 +206,8 @@ func TestClassifySATResult(t *testing.T) {
}{ }{
{name: "ok", job: "memtester", out: "done", err: nil, status: "OK"}, {name: "ok", job: "memtester", out: "done", err: nil, status: "OK"},
{name: "unsupported", job: "smartctl-self-test-short", out: "Self-test not supported", err: errors.New("rc 1"), status: "UNSUPPORTED"}, {name: "unsupported", job: "smartctl-self-test-short", out: "Self-test not supported", err: errors.New("rc 1"), status: "UNSUPPORTED"},
{name: "failed", job: "bee-gpu-stress", out: "cuda error", err: errors.New("rc 1"), status: "FAILED"}, {name: "failed", job: "bee-gpu-burn", out: "cuda error", err: errors.New("rc 1"), status: "FAILED"},
{name: "cuda not ready", job: "bee-gpu-stress", out: "cuInit failed: CUDA_ERROR_SYSTEM_NOT_READY", err: errors.New("rc 1"), status: "UNSUPPORTED"}, {name: "cuda not ready", job: "bee-gpu-burn", out: "cuInit failed: CUDA_ERROR_SYSTEM_NOT_READY", err: errors.New("rc 1"), status: "UNSUPPORTED"},
} }
for _, tt := range tests { for _, tt := range tests {
@@ -172,6 +256,44 @@ func TestResolveROCmSMICommandFromPATH(t *testing.T) {
} }
} }
func TestResolveSATCommandUsesLookPathForGenericTools(t *testing.T) {
oldLookPath := satLookPath
satLookPath = func(file string) (string, error) {
if file == "stress-ng" {
return "/usr/bin/stress-ng", nil
}
return "", exec.ErrNotFound
}
t.Cleanup(func() { satLookPath = oldLookPath })
cmd, err := resolveSATCommand([]string{"stress-ng", "--cpu", "0"})
if err != nil {
t.Fatalf("resolveSATCommand error: %v", err)
}
if len(cmd) != 3 {
t.Fatalf("cmd len=%d want 3 (%v)", len(cmd), cmd)
}
if cmd[0] != "/usr/bin/stress-ng" {
t.Fatalf("cmd[0]=%q want /usr/bin/stress-ng", cmd[0])
}
}
func TestResolveSATCommandFailsForMissingGenericTool(t *testing.T) {
oldLookPath := satLookPath
satLookPath = func(file string) (string, error) {
return "", exec.ErrNotFound
}
t.Cleanup(func() { satLookPath = oldLookPath })
_, err := resolveSATCommand([]string{"stress-ng", "--cpu", "0"})
if err == nil {
t.Fatal("expected error")
}
if !strings.Contains(err.Error(), "stress-ng not found in PATH") {
t.Fatalf("error=%q", err)
}
}
func TestResolveROCmSMICommandFallsBackToROCmTree(t *testing.T) { func TestResolveROCmSMICommandFallsBackToROCmTree(t *testing.T) {
tmp := t.TempDir() tmp := t.TempDir()
execPath := filepath.Join(tmp, "opt", "rocm", "bin", "rocm-smi") execPath := filepath.Join(tmp, "opt", "rocm", "bin", "rocm-smi")

View File

@@ -51,6 +51,20 @@ type ToolStatus struct {
OK bool OK bool
} }
const (
NvidiaStressLoaderBuiltin = "builtin"
NvidiaStressLoaderJohn = "john"
NvidiaStressLoaderNCCL = "nccl"
)
type NvidiaStressOptions struct {
DurationSec int
SizeMB int
Loader string
GPUIndices []int
ExcludeGPUIndices []int
}
func New() *System { func New() *System {
return &System{} return &System{}
} }

View File

@@ -2,21 +2,26 @@ package webui
import ( import (
"bufio" "bufio"
"context"
"encoding/json" "encoding/json"
"errors"
"fmt" "fmt"
"io" "io"
"net/http" "net/http"
"os"
"os/exec" "os/exec"
"path/filepath" "path/filepath"
"regexp"
"strings" "strings"
"sync/atomic" "sync/atomic"
"syscall"
"time" "time"
"bee/audit/internal/app" "bee/audit/internal/app"
"bee/audit/internal/platform" "bee/audit/internal/platform"
) )
var ansiEscapeRE = regexp.MustCompile(`\x1b\[[0-9;]*[a-zA-Z]|\x1b[()][A-Z0-9]|\x1b[DABC]`)
// ── Job ID counter ──────────────────────────────────────────────────────────── // ── Job ID counter ────────────────────────────────────────────────────────────
var jobCounter atomic.Uint64 var jobCounter atomic.Uint64
@@ -81,31 +86,54 @@ func streamJob(w http.ResponseWriter, r *http.Request, j *jobState) {
} }
} }
// runCmdJob runs an exec.Cmd as a background job, streaming stdout+stderr lines. // streamCmdJob runs an exec.Cmd and streams stdout+stderr lines into j.
func runCmdJob(j *jobState, cmd *exec.Cmd) { func streamCmdJob(j *jobState, cmd *exec.Cmd) error {
pr, pw := io.Pipe() pr, pw := io.Pipe()
cmd.Stdout = pw cmd.Stdout = pw
cmd.Stderr = pw cmd.Stderr = pw
if err := cmd.Start(); err != nil { if err := cmd.Start(); err != nil {
j.finish(err.Error()) _ = pw.Close()
return _ = pr.Close()
return err
}
// Lower the CPU scheduling priority of stress/audit subprocesses to nice+10
// so the X server and kernel interrupt handling remain responsive under load
// (prevents KVM/IPMI graphical console from freezing during GPU stress tests).
if cmd.Process != nil {
_ = syscall.Setpriority(syscall.PRIO_PROCESS, cmd.Process.Pid, 10)
} }
scanDone := make(chan error, 1)
go func() { go func() {
scanner := bufio.NewScanner(pr) scanner := bufio.NewScanner(pr)
scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024)
for scanner.Scan() { for scanner.Scan() {
j.append(scanner.Text()) // Split on \r to handle progress-bar style output (e.g. \r overwrites)
// and strip ANSI escape codes so logs are readable in the browser.
parts := strings.Split(scanner.Text(), "\r")
for _, part := range parts {
line := ansiEscapeRE.ReplaceAllString(part, "")
if line != "" {
j.append(line)
}
}
} }
if err := scanner.Err(); err != nil && !errors.Is(err, io.ErrClosedPipe) {
scanDone <- err
return
}
scanDone <- nil
}() }()
err := cmd.Wait() err := cmd.Wait()
_ = pw.Close() _ = pw.Close()
scanErr := <-scanDone
_ = pr.Close()
if err != nil { if err != nil {
j.finish(err.Error()) return err
} else {
j.finish("")
} }
return scanErr
} }
// ── Audit ───────────────────────────────────────────────────────────────────── // ── Audit ─────────────────────────────────────────────────────────────────────
@@ -153,20 +181,22 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
} }
var body struct { var body struct {
Duration int `json:"duration"` Duration int `json:"duration"`
DiagLevel int `json:"diag_level"` DiagLevel int `json:"diag_level"`
GPUIndices []int `json:"gpu_indices"` GPUIndices []int `json:"gpu_indices"`
Profile string `json:"profile"` ExcludeGPUIndices []int `json:"exclude_gpu_indices"`
DisplayName string `json:"display_name"` Loader string `json:"loader"`
Profile string `json:"profile"`
DisplayName string `json:"display_name"`
} }
if r.ContentLength > 0 { if r.Body != nil {
_ = json.NewDecoder(r.Body).Decode(&body) if err := json.NewDecoder(r.Body).Decode(&body); err != nil && !errors.Is(err, io.EOF) {
writeError(w, http.StatusBadRequest, "invalid request body")
return
}
} }
name := taskNames[target] name := taskDisplayName(target, body.Profile, body.Loader)
if name == "" {
name = target
}
t := &Task{ t := &Task{
ID: newJobID("sat-" + target), ID: newJobID("sat-" + target),
Name: name, Name: name,
@@ -174,11 +204,13 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
Status: TaskPending, Status: TaskPending,
CreatedAt: time.Now(), CreatedAt: time.Now(),
params: taskParams{ params: taskParams{
Duration: body.Duration, Duration: body.Duration,
DiagLevel: body.DiagLevel, DiagLevel: body.DiagLevel,
GPUIndices: body.GPUIndices, GPUIndices: body.GPUIndices,
BurnProfile: body.Profile, ExcludeGPUIndices: body.ExcludeGPUIndices,
DisplayName: body.DisplayName, Loader: body.Loader,
BurnProfile: body.Profile,
DisplayName: body.DisplayName,
}, },
} }
if strings.TrimSpace(body.DisplayName) != "" { if strings.TrimSpace(body.DisplayName) != "" {
@@ -393,16 +425,76 @@ func (h *handler) handleAPIExportList(w http.ResponseWriter, r *http.Request) {
} }
func (h *handler) handleAPIExportBundle(w http.ResponseWriter, r *http.Request) { func (h *handler) handleAPIExportBundle(w http.ResponseWriter, r *http.Request) {
archive, err := app.BuildSupportBundle(h.opts.ExportDir) if globalQueue.hasActiveTarget("support-bundle") {
writeError(w, http.StatusConflict, "support bundle task is already pending or running")
return
}
t := &Task{
ID: newJobID("support-bundle"),
Name: "Support Bundle",
Target: "support-bundle",
Status: TaskPending,
CreatedAt: time.Now(),
}
globalQueue.enqueue(t)
writeJSON(w, map[string]string{
"status": "queued",
"task_id": t.ID,
"job_id": t.ID,
"url": "/export/support.tar.gz",
})
}
func (h *handler) handleAPIExportUSBTargets(w http.ResponseWriter, _ *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
targets, err := h.opts.App.ListRemovableTargets()
if err != nil { if err != nil {
writeError(w, http.StatusInternalServerError, err.Error()) writeError(w, http.StatusInternalServerError, err.Error())
return return
} }
writeJSON(w, map[string]string{ if targets == nil {
"status": "ok", targets = []platform.RemovableTarget{}
"path": archive, }
"url": "/export/support.tar.gz", writeJSON(w, targets)
}) }
func (h *handler) handleAPIExportUSBAudit(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var target platform.RemovableTarget
if err := json.NewDecoder(r.Body).Decode(&target); err != nil || target.Device == "" {
writeError(w, http.StatusBadRequest, "device is required")
return
}
result, err := h.opts.App.ExportLatestAuditResult(target)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "message": result.Body})
}
func (h *handler) handleAPIExportUSBBundle(w http.ResponseWriter, r *http.Request) {
if h.opts.App == nil {
writeError(w, http.StatusServiceUnavailable, "app not configured")
return
}
var target platform.RemovableTarget
if err := json.NewDecoder(r.Body).Decode(&target); err != nil || target.Device == "" {
writeError(w, http.StatusBadRequest, "device is required")
return
}
result, err := h.opts.App.ExportSupportBundleResult(target)
if err != nil {
writeError(w, http.StatusInternalServerError, err.Error())
return
}
writeJSON(w, map[string]string{"status": "ok", "message": result.Body})
} }
// ── GPU presence ────────────────────────────────────────────────────────────── // ── GPU presence ──────────────────────────────────────────────────────────────
@@ -437,10 +529,7 @@ func (h *handler) handleAPIInstallToRAM(w http.ResponseWriter, r *http.Request)
writeError(w, http.StatusServiceUnavailable, "app not configured") writeError(w, http.StatusServiceUnavailable, "app not configured")
return return
} }
h.installMu.Lock() if globalQueue.hasActiveTarget("install") {
installRunning := h.installJob != nil && !h.installJob.isDone()
h.installMu.Unlock()
if installRunning {
writeError(w, http.StatusConflict, "install to disk is already running") writeError(w, http.StatusConflict, "install to disk is already running")
return return
} }
@@ -555,39 +644,43 @@ func (h *handler) handleAPIInstallRun(w http.ResponseWriter, r *http.Request) {
writeError(w, http.StatusConflict, "install to RAM task is already pending or running") writeError(w, http.StatusConflict, "install to RAM task is already pending or running")
return return
} }
if globalQueue.hasActiveTarget("install") {
h.installMu.Lock() writeError(w, http.StatusConflict, "install task is already pending or running")
if h.installJob != nil && !h.installJob.isDone() {
h.installMu.Unlock()
writeError(w, http.StatusConflict, "install already running")
return return
} }
j := &jobState{} t := &Task{
h.installJob = j ID: newJobID("install"),
h.installMu.Unlock() Name: "Install to Disk",
Target: "install",
logFile := platform.InstallLogPath(req.Device) Priority: 20,
go runCmdJob(j, exec.CommandContext(context.Background(), "bee-install", req.Device, logFile)) Status: TaskPending,
CreatedAt: time.Now(),
w.WriteHeader(http.StatusNoContent) params: taskParams{
} Device: req.Device,
},
func (h *handler) handleAPIInstallStream(w http.ResponseWriter, r *http.Request) {
h.installMu.Lock()
j := h.installJob
h.installMu.Unlock()
if j == nil {
if !sseStart(w) {
return
}
sseWrite(w, "done", "")
return
} }
streamJob(w, r, j) globalQueue.enqueue(t)
writeJSON(w, map[string]string{"task_id": t.ID, "job_id": t.ID})
} }
// ── Metrics SSE ─────────────────────────────────────────────────────────────── // ── Metrics SSE ───────────────────────────────────────────────────────────────
func (h *handler) handleAPIMetricsLatest(w http.ResponseWriter, r *http.Request) {
sample, ok := h.latestMetric()
if !ok {
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write([]byte("{}"))
return
}
b, err := json.Marshal(sample)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write(b)
}
func (h *handler) handleAPIMetricsStream(w http.ResponseWriter, r *http.Request) { func (h *handler) handleAPIMetricsStream(w http.ResponseWriter, r *http.Request) {
if !sseStart(w) { if !sseStart(w) {
return return
@@ -790,3 +883,108 @@ func (h *handler) rollbackPendingNetworkChange() error {
} }
return nil return nil
} }
// ── Display / Screen Resolution ───────────────────────────────────────────────
type displayMode struct {
Output string `json:"output"`
Mode string `json:"mode"`
Current bool `json:"current"`
}
type displayInfo struct {
Output string `json:"output"`
Modes []displayMode `json:"modes"`
Current string `json:"current"`
}
var xrandrOutputRE = regexp.MustCompile(`^(\S+)\s+connected`)
var xrandrModeRE = regexp.MustCompile(`^\s{3}(\d+x\d+)\s`)
var xrandrCurrentRE = regexp.MustCompile(`\*`)
func parseXrandrOutput(out string) []displayInfo {
var infos []displayInfo
var cur *displayInfo
for _, line := range strings.Split(out, "\n") {
if m := xrandrOutputRE.FindStringSubmatch(line); m != nil {
if cur != nil {
infos = append(infos, *cur)
}
cur = &displayInfo{Output: m[1]}
continue
}
if cur == nil {
continue
}
if m := xrandrModeRE.FindStringSubmatch(line); m != nil {
isCurrent := xrandrCurrentRE.MatchString(line)
mode := displayMode{Output: cur.Output, Mode: m[1], Current: isCurrent}
cur.Modes = append(cur.Modes, mode)
if isCurrent {
cur.Current = m[1]
}
}
}
if cur != nil {
infos = append(infos, *cur)
}
return infos
}
func xrandrCommand(args ...string) *exec.Cmd {
cmd := exec.Command("xrandr", args...)
env := append([]string{}, os.Environ()...)
hasDisplay := false
hasXAuthority := false
for _, kv := range env {
if strings.HasPrefix(kv, "DISPLAY=") && strings.TrimPrefix(kv, "DISPLAY=") != "" {
hasDisplay = true
}
if strings.HasPrefix(kv, "XAUTHORITY=") && strings.TrimPrefix(kv, "XAUTHORITY=") != "" {
hasXAuthority = true
}
}
if !hasDisplay {
env = append(env, "DISPLAY=:0")
}
if !hasXAuthority {
env = append(env, "XAUTHORITY=/home/bee/.Xauthority")
}
cmd.Env = env
return cmd
}
func (h *handler) handleAPIDisplayResolutions(w http.ResponseWriter, _ *http.Request) {
out, err := xrandrCommand().Output()
if err != nil {
writeError(w, http.StatusInternalServerError, "xrandr: "+err.Error())
return
}
writeJSON(w, parseXrandrOutput(string(out)))
}
func (h *handler) handleAPIDisplaySet(w http.ResponseWriter, r *http.Request) {
var req struct {
Output string `json:"output"`
Mode string `json:"mode"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil || req.Output == "" || req.Mode == "" {
writeError(w, http.StatusBadRequest, "output and mode are required")
return
}
// Validate mode looks like WxH to prevent injection
if !regexp.MustCompile(`^\d+x\d+$`).MatchString(req.Mode) {
writeError(w, http.StatusBadRequest, "invalid mode format")
return
}
// Validate output name (no special chars)
if !regexp.MustCompile(`^[A-Za-z0-9_\-]+$`).MatchString(req.Output) {
writeError(w, http.StatusBadRequest, "invalid output name")
return
}
if out, err := xrandrCommand("--output", req.Output, "--mode", req.Mode).CombinedOutput(); err != nil {
writeError(w, http.StatusInternalServerError, "xrandr: "+strings.TrimSpace(string(out)))
return
}
writeJSON(w, map[string]string{"status": "ok", "output": req.Output, "mode": req.Mode})
}

View File

@@ -0,0 +1,102 @@
package webui
import (
"encoding/json"
"net/http/httptest"
"strings"
"testing"
"bee/audit/internal/app"
)
func TestXrandrCommandAddsDefaultX11Env(t *testing.T) {
t.Setenv("DISPLAY", "")
t.Setenv("XAUTHORITY", "")
cmd := xrandrCommand("--query")
var hasDisplay bool
var hasXAuthority bool
for _, kv := range cmd.Env {
if kv == "DISPLAY=:0" {
hasDisplay = true
}
if kv == "XAUTHORITY=/home/bee/.Xauthority" {
hasXAuthority = true
}
}
if !hasDisplay {
t.Fatalf("DISPLAY not injected: %v", cmd.Env)
}
if !hasXAuthority {
t.Fatalf("XAUTHORITY not injected: %v", cmd.Env)
}
}
func TestHandleAPISATRunDecodesBodyWithoutContentLength(t *testing.T) {
globalQueue.mu.Lock()
originalTasks := globalQueue.tasks
globalQueue.tasks = nil
globalQueue.mu.Unlock()
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = originalTasks
globalQueue.mu.Unlock()
})
h := &handler{opts: HandlerOptions{App: &app.App{}}}
req := httptest.NewRequest("POST", "/api/sat/cpu/run", strings.NewReader(`{"profile":"smoke"}`))
req.ContentLength = -1
rec := httptest.NewRecorder()
h.handleAPISATRun("cpu").ServeHTTP(rec, req)
if rec.Code != 200 {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
globalQueue.mu.Lock()
defer globalQueue.mu.Unlock()
if len(globalQueue.tasks) != 1 {
t.Fatalf("tasks=%d want 1", len(globalQueue.tasks))
}
if got := globalQueue.tasks[0].params.BurnProfile; got != "smoke" {
t.Fatalf("burn profile=%q want smoke", got)
}
}
func TestHandleAPIExportBundleQueuesTask(t *testing.T) {
globalQueue.mu.Lock()
originalTasks := globalQueue.tasks
globalQueue.tasks = nil
globalQueue.mu.Unlock()
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = originalTasks
globalQueue.mu.Unlock()
})
h := &handler{opts: HandlerOptions{ExportDir: t.TempDir()}}
req := httptest.NewRequest("POST", "/api/export/bundle", nil)
rec := httptest.NewRecorder()
h.handleAPIExportBundle(rec, req)
if rec.Code != 200 {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
var body map[string]string
if err := json.Unmarshal(rec.Body.Bytes(), &body); err != nil {
t.Fatalf("decode response: %v", err)
}
if body["task_id"] == "" {
t.Fatalf("missing task_id in response: %v", body)
}
globalQueue.mu.Lock()
defer globalQueue.mu.Unlock()
if len(globalQueue.tasks) != 1 {
t.Fatalf("tasks=%d want 1", len(globalQueue.tasks))
}
if got := globalQueue.tasks[0].Target; got != "support-bundle" {
t.Fatalf("target=%q want support-bundle", got)
}
}

View File

@@ -4,6 +4,8 @@ import (
"database/sql" "database/sql"
"encoding/csv" "encoding/csv"
"io" "io"
"os"
"path/filepath"
"strconv" "strconv"
"time" "time"
@@ -20,6 +22,9 @@ type MetricsDB struct {
// openMetricsDB opens (or creates) the metrics database at the given path. // openMetricsDB opens (or creates) the metrics database at the given path.
func openMetricsDB(path string) (*MetricsDB, error) { func openMetricsDB(path string) (*MetricsDB, error) {
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
return nil, err
}
db, err := sql.Open("sqlite", path+"?_journal=WAL&_busy_timeout=5000") db, err := sql.Open("sqlite", path+"?_journal=WAL&_busy_timeout=5000")
if err != nil { if err != nil {
return nil, err return nil, err
@@ -132,7 +137,7 @@ func (m *MetricsDB) loadSamples(query string, args ...any) ([]platform.LiveMetri
defer rows.Close() defer rows.Close()
type sysRow struct { type sysRow struct {
ts int64 ts int64
cpu, mem, pwr float64 cpu, mem, pwr float64
} }
var sysRows []sysRow var sysRows []sysRow
@@ -156,7 +161,10 @@ func (m *MetricsDB) loadSamples(query string, args ...any) ([]platform.LiveMetri
maxTS := sysRows[len(sysRows)-1].ts maxTS := sysRows[len(sysRows)-1].ts
// Load GPU rows in range // Load GPU rows in range
type gpuKey struct{ ts int64; idx int } type gpuKey struct {
ts int64
idx int
}
gpuData := map[gpuKey]platform.GPUMetricRow{} gpuData := map[gpuKey]platform.GPUMetricRow{}
gRows, err := m.db.Query( gRows, err := m.db.Query(
`SELECT ts,gpu_index,temp_c,usage_pct,mem_usage_pct,power_w FROM gpu_metrics WHERE ts>=? AND ts<=? ORDER BY ts,gpu_index`, `SELECT ts,gpu_index,temp_c,usage_pct,mem_usage_pct,power_w FROM gpu_metrics WHERE ts>=? AND ts<=? ORDER BY ts,gpu_index`,
@@ -174,7 +182,10 @@ func (m *MetricsDB) loadSamples(query string, args ...any) ([]platform.LiveMetri
} }
// Load fan rows in range // Load fan rows in range
type fanKey struct{ ts int64; name string } type fanKey struct {
ts int64
name string
}
fanData := map[fanKey]float64{} fanData := map[fanKey]float64{}
fRows, err := m.db.Query( fRows, err := m.db.Query(
`SELECT ts,name,rpm FROM fan_metrics WHERE ts>=? AND ts<=?`, minTS, maxTS, `SELECT ts,name,rpm FROM fan_metrics WHERE ts>=? AND ts<=?`, minTS, maxTS,
@@ -192,7 +203,10 @@ func (m *MetricsDB) loadSamples(query string, args ...any) ([]platform.LiveMetri
} }
// Load temp rows in range // Load temp rows in range
type tempKey struct{ ts int64; name string } type tempKey struct {
ts int64
name string
}
tempData := map[tempKey]platform.TempReading{} tempData := map[tempKey]platform.TempReading{}
tRows, err := m.db.Query( tRows, err := m.db.Query(
`SELECT ts,name,grp,celsius FROM temp_metrics WHERE ts>=? AND ts<=?`, minTS, maxTS, `SELECT ts,name,grp,celsius FROM temp_metrics WHERE ts>=? AND ts<=?`, minTS, maxTS,

View File

@@ -205,12 +205,83 @@ document.querySelectorAll('.terminal').forEach(function(t){
func renderDashboard(opts HandlerOptions) string { func renderDashboard(opts HandlerOptions) string {
var b strings.Builder var b strings.Builder
b.WriteString(renderAuditStatusBanner(opts))
b.WriteString(renderHardwareSummaryCard(opts)) b.WriteString(renderHardwareSummaryCard(opts))
b.WriteString(renderHealthCard(opts)) b.WriteString(renderHealthCard(opts))
b.WriteString(renderMetrics()) b.WriteString(renderMetrics())
return b.String() return b.String()
} }
// renderAuditStatusBanner shows a live progress banner when an audit task is
// running and auto-reloads the page when it completes.
func renderAuditStatusBanner(opts HandlerOptions) string {
// If audit data already exists, no banner needed — data is fresh.
// We still inject the polling script so a newly-triggered audit also reloads.
hasData := false
if _, err := loadSnapshot(opts.AuditPath); err == nil {
hasData = true
}
_ = hasData
return `<div id="audit-banner" style="display:none" class="alert alert-warn" style="margin-bottom:16px">
<span id="audit-banner-text">&#9654; Hardware audit is running — page will refresh automatically when complete.</span>
<a href="/tasks" style="margin-left:12px;font-size:12px">View in Tasks</a>
</div>
<script>
(function(){
var _auditPoll = null;
var _auditSeenRunning = false;
function pollAuditTask() {
fetch('/api/tasks').then(function(r){ return r.json(); }).then(function(tasks){
if (!tasks) return;
var audit = null;
for (var i = 0; i < tasks.length; i++) {
if (tasks[i].target === 'audit') { audit = tasks[i]; break; }
}
var banner = document.getElementById('audit-banner');
var txt = document.getElementById('audit-banner-text');
if (!audit) {
if (banner) banner.style.display = 'none';
return;
}
if (audit.status === 'running' || audit.status === 'pending') {
_auditSeenRunning = true;
if (banner) {
banner.style.display = '';
var label = audit.status === 'pending' ? 'pending\u2026' : 'running\u2026';
if (txt) txt.textContent = '\u25b6 Hardware audit ' + label + ' \u2014 page will refresh when complete.';
}
} else if (audit.status === 'done' && _auditSeenRunning) {
// Audit just finished — reload to show fresh hardware data.
clearInterval(_auditPoll);
if (banner) {
if (txt) txt.textContent = '\u2713 Audit complete \u2014 reloading\u2026';
banner.style.background = 'var(--ok-bg,#fcfff5)';
banner.style.color = 'var(--ok-fg,#2c662d)';
}
setTimeout(function(){ window.location.reload(); }, 800);
} else if (audit.status === 'failed') {
_auditSeenRunning = false;
if (banner) {
banner.style.display = '';
banner.style.background = 'var(--crit-bg,#fff6f6)';
banner.style.color = 'var(--crit-fg,#9f3a38)';
if (txt) txt.textContent = '\u2717 Audit failed: ' + (audit.error||'unknown error');
clearInterval(_auditPoll);
}
} else {
if (banner) banner.style.display = 'none';
}
}).catch(function(){});
}
_auditPoll = setInterval(pollAuditTask, 3000);
pollAuditTask();
})();
</script>`
}
func renderAudit() string { func renderAudit() string {
return `<div class="card"><div class="card-head">Audit Viewer <button class="btn btn-sm btn-secondary" style="margin-left:auto" onclick="openAuditModal()">Actions</button></div><div class="card-body" style="padding:0"><iframe class="viewer-frame" src="/viewer" title="Audit viewer"></iframe></div></div>` return `<div class="card"><div class="card-head">Audit Viewer <button class="btn btn-sm btn-secondary" style="margin-left:auto" onclick="openAuditModal()">Actions</button></div><div class="card-body" style="padding:0"><iframe class="viewer-frame" src="/viewer" title="Audit viewer"></iframe></div></div>`
} }
@@ -218,7 +289,7 @@ func renderAudit() string {
func renderHardwareSummaryCard(opts HandlerOptions) string { func renderHardwareSummaryCard(opts HandlerOptions) string {
data, err := loadSnapshot(opts.AuditPath) data, err := loadSnapshot(opts.AuditPath)
if err != nil { if err != nil {
return `<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body"><span class="badge badge-unknown">No audit data</span></div></div>` return `<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body"><button class="btn btn-primary" onclick="auditModalRun()">&#9654; Run Audit</button></div></div>`
} }
// Parse just enough fields for the summary banner // Parse just enough fields for the summary banner
var snap struct { var snap struct {
@@ -461,16 +532,10 @@ function refreshCharts() {
} }
setInterval(refreshCharts, 3000); setInterval(refreshCharts, 3000);
const es = new EventSource('/api/metrics/stream'); fetch('/api/metrics/latest').then(r => r.json()).then(d => {
es.addEventListener('metrics', e => {
const d = JSON.parse(e.data);
// Show/hide Fan RPM card based on data availability
const fanCard = document.getElementById('card-server-fans'); const fanCard = document.getElementById('card-server-fans');
if (fanCard) fanCard.style.display = (d.fans && d.fans.length > 0) ? '' : 'none'; if (fanCard) fanCard.style.display = (d.fans && d.fans.length > 0) ? '' : 'none';
}).catch(() => {});
});
es.onerror = () => {};
</script>` </script>`
} }
@@ -593,12 +658,15 @@ func renderBurn() string {
return `<div class="alert alert-warn" style="margin-bottom:16px"><strong>&#9888; Warning:</strong> Stress tests on this page run hardware at maximum load. Repeated or prolonged use may reduce hardware lifespan (storage endurance, GPU wear). Use only when necessary.</div> return `<div class="alert alert-warn" style="margin-bottom:16px"><strong>&#9888; Warning:</strong> Stress tests on this page run hardware at maximum load. Repeated or prolonged use may reduce hardware lifespan (storage endurance, GPU wear). Use only when necessary.</div>
<p style="color:var(--muted);font-size:13px;margin-bottom:16px">Tasks continue in the background — view progress in <a href="/tasks">Tasks</a>.</p> <p style="color:var(--muted);font-size:13px;margin-bottom:16px">Tasks continue in the background — view progress in <a href="/tasks">Tasks</a>.</p>
<div class="card"><div class="card-head">Burn Profile</div><div class="card-body"> <div class="card"><div class="card-head">Burn Profile</div><div class="card-body">
<div class="form-row" style="max-width:320px"><label>Preset</label><select id="burn-profile"><option value="smoke">Smoke: 5 minutes</option><option value="acceptance">Acceptance: 1 hour</option><option value="overnight">Overnight: 8 hours</option></select></div> <div class="form-row" style="max-width:320px"><label>Preset</label><select id="burn-profile"><option value="smoke" selected>Smoke: quick check (~5 min CPU / DCGM level 1)</option><option value="acceptance">Acceptance: 1 hour (DCGM level 3)</option><option value="overnight">Overnight: 8 hours (DCGM level 4)</option></select></div>
<p style="color:var(--muted);font-size:12px">Applied to all tests on this page. NVIDIA uses mapped DCGM levels: smoke=quick, acceptance=targeted stress, overnight=extended stress.</p> <p style="color:var(--muted);font-size:12px">Applied to all tests on this page. NVIDIA SAT on the Validate page still uses DCGM. NVIDIA GPU Stress on this page uses the selected stress loader for the preset duration.</p>
</div></div> </div></div>
<div class="grid3"> <div class="grid3">
<div class="card"><div class="card-head">NVIDIA GPU Stress</div><div class="card-body"> <div class="card"><div class="card-head">NVIDIA GPU Stress</div><div class="card-body">
<button id="sat-btn-nvidia" class="btn btn-primary" onclick="runBurnIn('nvidia')">&#9654; Start NVIDIA Stress</button> <div class="form-row"><label>Load Tool</label><select id="nvidia-stress-loader"><option value="builtin" selected>bee-gpu-burn</option><option value="nccl">NCCL all_reduce_perf</option><option value="john">John the Ripper jumbo (OpenCL)</option></select></div>
<div class="form-row"><label>Exclude GPU indices</label><input type="text" id="nvidia-stress-exclude" placeholder="e.g. 1,3"></div>
<p style="color:var(--muted);font-size:12px;margin-bottom:8px"><code>bee-gpu-burn</code> runs on all detected NVIDIA GPUs by default. <code>NCCL all_reduce_perf</code> is useful for multi-GPU / interconnect load. Use exclusions only when one or more cards must be skipped.</p>
<button id="sat-btn-nvidia-stress" class="btn btn-primary" onclick="runBurnIn('nvidia-stress')">&#9654; Start NVIDIA Stress</button>
</div></div> </div></div>
<div class="card"><div class="card-head">CPU Stress</div><div class="card-body"> <div class="card"><div class="card-head">CPU Stress</div><div class="card-body">
<button class="btn btn-primary" onclick="runBurnIn('cpu')">&#9654; Start CPU Stress</button> <button class="btn btn-primary" onclick="runBurnIn('cpu')">&#9654; Start CPU Stress</button>
@@ -626,11 +694,24 @@ func renderBurn() string {
</div> </div>
<script> <script>
let biES = null; let biES = null;
function parseGPUIndexList(raw) {
return (raw || '')
.split(',')
.map(v => v.trim())
.filter(v => v !== '')
.map(v => Number(v))
.filter(v => Number.isInteger(v) && v >= 0);
}
function runBurnIn(target) { function runBurnIn(target) {
if (biES) { biES.close(); biES = null; } if (biES) { biES.close(); biES = null; }
const body = { profile: document.getElementById('burn-profile').value || 'smoke' }; const body = { profile: document.getElementById('burn-profile').value || 'smoke' };
if (target === 'nvidia-stress') {
body.loader = document.getElementById('nvidia-stress-loader').value || 'builtin';
body.exclude_gpu_indices = parseGPUIndexList(document.getElementById('nvidia-stress-exclude').value);
}
document.getElementById('bi-output').style.display='block'; document.getElementById('bi-output').style.display='block';
document.getElementById('bi-title').textContent = '— ' + target + ' [' + body.profile + ']'; const loaderLabel = body.loader ? ' / ' + body.loader : '';
document.getElementById('bi-title').textContent = '— ' + target + loaderLabel + ' [' + body.profile + ']';
const term = document.getElementById('bi-terminal'); const term = document.getElementById('bi-terminal');
term.textContent = 'Enqueuing ' + target + ' stress...\n'; term.textContent = 'Enqueuing ' + target + ' stress...\n';
fetch('/api/sat/'+target+'/run', {method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify(body)}) fetch('/api/sat/'+target+'/run', {method:'POST',headers:{'Content-Type':'application/json'},body:JSON.stringify(body)})
@@ -645,7 +726,7 @@ function runBurnIn(target) {
</script> </script>
<script> <script>
fetch('/api/gpu/presence').then(r=>r.json()).then(gp => { fetch('/api/gpu/presence').then(r=>r.json()).then(gp => {
if (!gp.nvidia) disableSATCard('nvidia', 'No NVIDIA GPU detected'); if (!gp.nvidia) disableSATCard('nvidia-stress', 'No NVIDIA GPU detected');
if (!gp.amd) disableSATCard('amd-stress', 'No AMD GPU detected'); if (!gp.amd) disableSATCard('amd-stress', 'No AMD GPU detected');
}); });
function disableSATCard(id, reason) { function disableSATCard(id, reason) {
@@ -845,12 +926,79 @@ func renderExport(exportDir string) string {
return `<div class="grid2"> return `<div class="grid2">
<div class="card"><div class="card-head">Support Bundle</div><div class="card-body"> <div class="card"><div class="card-head">Support Bundle</div><div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Creates a tar.gz archive of all audit files, SAT results, and logs.</p> <p style="font-size:13px;color:var(--muted);margin-bottom:12px">Creates a tar.gz archive of all audit files, SAT results, and logs.</p>
<a class="btn btn-primary" href="/export/support.tar.gz">⬇ Download Support Bundle</a> ` + renderSupportBundleInline() + `
</div></div> </div></div>
<div class="card"><div class="card-head">Export Files</div><div class="card-body"> <div class="card"><div class="card-head">Export Files</div><div class="card-body">
<table><tr><th>File</th></tr>` + rows.String() + `</table> <table><tr><th>File</th></tr>` + rows.String() + `</table>
</div></div> </div></div>
</div>` </div>
<div class="card" style="margin-top:16px">
<div class="card-head">Export to USB
<button class="btn btn-sm btn-secondary" onclick="usbRefresh()" style="margin-left:auto">&#8635; Refresh</button>
</div>
<div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Write audit JSON or support bundle directly to a removable USB drive.</p>
<div id="usb-status" style="font-size:13px;color:var(--muted)">Scanning for USB devices...</div>
<div id="usb-targets" style="margin-top:12px"></div>
<div id="usb-msg" style="margin-top:10px;font-size:13px"></div>
</div>
</div>
<script>
(function(){
function usbRefresh() {
document.getElementById('usb-status').textContent = 'Scanning...';
document.getElementById('usb-targets').innerHTML = '';
document.getElementById('usb-msg').textContent = '';
fetch('/api/export/usb').then(r=>r.json()).then(targets => {
const st = document.getElementById('usb-status');
const ct = document.getElementById('usb-targets');
if (!targets || targets.length === 0) {
st.textContent = 'No removable USB devices found.';
return;
}
st.textContent = targets.length + ' device(s) found:';
ct.innerHTML = '<table><tr><th>Device</th><th>FS</th><th>Size</th><th>Label</th><th>Model</th><th>Actions</th></tr>' +
targets.map(t => {
const dev = t.device || '';
const label = t.label || '';
const model = t.model || '';
return '<tr>' +
'<td style="font-family:monospace">'+dev+'</td>' +
'<td>'+t.fs_type+'</td>' +
'<td>'+t.size+'</td>' +
'<td>'+label+'</td>' +
'<td style="font-size:12px;color:var(--muted)">'+model+'</td>' +
'<td style="white-space:nowrap">' +
'<button class="btn btn-sm btn-primary" onclick="usbExport(\'audit\','+JSON.stringify(t)+')">Audit JSON</button> ' +
'<button class="btn btn-sm btn-secondary" onclick="usbExport(\'bundle\','+JSON.stringify(t)+')">Support Bundle</button>' +
'</td></tr>';
}).join('') + '</table>';
}).catch(e => {
document.getElementById('usb-status').textContent = 'Error: ' + e;
});
}
window.usbExport = function(type, target) {
const msg = document.getElementById('usb-msg');
msg.style.color = 'var(--muted)';
msg.textContent = 'Exporting to ' + (target.device||'') + '...';
fetch('/api/export/usb/'+type, {
method: 'POST',
headers: {'Content-Type':'application/json'},
body: JSON.stringify(target)
}).then(r=>r.json()).then(d => {
if (d.error) { msg.style.color='var(--err,red)'; msg.textContent = 'Error: '+d.error; return; }
msg.style.color = 'var(--ok,green)';
msg.textContent = d.message || 'Done.';
}).catch(e => {
msg.style.color = 'var(--err,red)';
msg.textContent = 'Error: '+e;
});
};
window.usbRefresh = usbRefresh;
usbRefresh();
})();
</script>`
} }
func listExportFiles(exportDir string) ([]string, error) { func listExportFiles(exportDir string) ([]string, error) {
@@ -876,6 +1024,127 @@ func listExportFiles(exportDir string) ([]string, error) {
return entries, nil return entries, nil
} }
func renderSupportBundleInline() string {
return `<button id="support-bundle-btn" class="btn btn-primary" onclick="supportBundleBuild()">Build Support Bundle</button>
<a id="support-bundle-download" class="btn btn-secondary" href="/export/support.tar.gz" style="display:none">&#8595; Download Support Bundle</a>
<div id="support-bundle-status" style="margin-top:12px;font-size:13px;color:var(--muted)">No support bundle built in this session.</div>
<div id="support-bundle-log" class="terminal" style="display:none;margin-top:12px;max-height:260px"></div>
<script>
(function(){
var _supportBundleES = null;
window.supportBundleBuild = function() {
var btn = document.getElementById('support-bundle-btn');
var status = document.getElementById('support-bundle-status');
var log = document.getElementById('support-bundle-log');
var download = document.getElementById('support-bundle-download');
if (_supportBundleES) {
_supportBundleES.close();
_supportBundleES = null;
}
btn.disabled = true;
btn.textContent = 'Building...';
status.textContent = 'Queueing support bundle task...';
status.style.color = 'var(--muted)';
log.style.display = '';
log.textContent = '';
download.style.display = 'none';
fetch('/api/export/bundle', {method:'POST'}).then(function(r){
return r.json().then(function(j){
if (!r.ok) throw new Error(j.error || r.statusText);
return j;
});
}).then(function(data){
if (!data.task_id) throw new Error('missing task id');
status.textContent = 'Building support bundle...';
_supportBundleES = new EventSource('/api/tasks/' + data.task_id + '/stream');
_supportBundleES.onmessage = function(e) {
log.textContent += e.data + '\n';
log.scrollTop = log.scrollHeight;
};
_supportBundleES.addEventListener('done', function(e) {
_supportBundleES.close();
_supportBundleES = null;
btn.disabled = false;
btn.textContent = 'Build Support Bundle';
if (e.data) {
status.textContent = 'Error: ' + e.data;
status.style.color = 'var(--crit-fg)';
return;
}
status.textContent = 'Support bundle ready.';
status.style.color = 'var(--ok-fg)';
download.style.display = '';
});
_supportBundleES.onerror = function() {
if (_supportBundleES) _supportBundleES.close();
_supportBundleES = null;
btn.disabled = false;
btn.textContent = 'Build Support Bundle';
status.textContent = 'Support bundle stream disconnected.';
status.style.color = 'var(--crit-fg)';
};
}).catch(function(e){
btn.disabled = false;
btn.textContent = 'Build Support Bundle';
status.textContent = 'Error: ' + e;
status.style.color = 'var(--crit-fg)';
});
};
})();
</script>`
}
// ── Display Resolution ────────────────────────────────────────────────────────
func renderDisplayInline() string {
return `<div id="display-status" style="color:var(--muted);font-size:13px;margin-bottom:12px">Loading displays...</div>
<div id="display-controls"></div>
<script>
(function(){
function loadDisplays() {
fetch('/api/display/resolutions').then(r=>r.json()).then(displays => {
const status = document.getElementById('display-status');
const ctrl = document.getElementById('display-controls');
if (!displays || displays.length === 0) {
status.textContent = 'No connected displays found or xrandr not available.';
return;
}
status.textContent = '';
ctrl.innerHTML = displays.map(d => {
const opts = (d.modes||[]).map(m =>
'<option value="'+m.mode+'"'+(m.current?' selected':'')+'>'+m.mode+(m.current?' (current)':'')+'</option>'
).join('');
return '<div style="margin-bottom:12px">'
+'<span style="font-weight:600;margin-right:8px">'+d.output+'</span>'
+'<span style="color:var(--muted);font-size:12px;margin-right:12px">Current: '+d.current+'</span>'
+'<select id="res-sel-'+d.output+'" style="margin-right:8px">'+opts+'</select>'
+'<button class="btn btn-sm btn-primary" onclick="applyResolution(\''+d.output+'\')">Apply</button>'
+'</div>';
}).join('');
}).catch(()=>{
document.getElementById('display-status').textContent = 'xrandr not available on this system.';
});
}
window.applyResolution = function(output) {
const sel = document.getElementById('res-sel-'+output);
if (!sel) return;
const mode = sel.value;
const btn = sel.nextElementSibling;
btn.disabled = true;
btn.textContent = 'Applying...';
fetch('/api/display/set', {method:'POST', headers:{'Content-Type':'application/json'}, body:JSON.stringify({output:output,mode:mode})})
.then(r=>r.json()).then(d=>{
if (d.error) { alert('Error: '+d.error); }
loadDisplays();
}).catch(e=>{ alert('Error: '+e); })
.finally(()=>{ btn.disabled=false; btn.textContent='Apply'; });
};
loadDisplays();
})();
</script>`
}
// ── Tools ───────────────────────────────────────────────────────────────────── // ── Tools ─────────────────────────────────────────────────────────────────────
func renderTools() string { func renderTools() string {
@@ -915,7 +1184,7 @@ function installToRAM() {
<div class="card"><div class="card-head">Support Bundle</div><div class="card-body"> <div class="card"><div class="card-head">Support Bundle</div><div class="card-body">
<p style="font-size:13px;color:var(--muted);margin-bottom:12px">Downloads a tar.gz archive of all audit files, SAT results, and logs.</p> <p style="font-size:13px;color:var(--muted);margin-bottom:12px">Downloads a tar.gz archive of all audit files, SAT results, and logs.</p>
<a class="btn btn-primary" href="/export/support.tar.gz">&#8595; Download Support Bundle</a> ` + renderSupportBundleInline() + `
</div></div> </div></div>
<div class="card"><div class="card-head">Tool Check <button class="btn btn-sm btn-secondary" onclick="checkTools()" style="margin-left:auto">&#8635; Check</button></div> <div class="card"><div class="card-head">Tool Check <button class="btn btn-sm btn-secondary" onclick="checkTools()" style="margin-left:auto">&#8635; Check</button></div>
@@ -927,6 +1196,9 @@ function installToRAM() {
<div class="card"><div class="card-head">Services</div><div class="card-body">` + <div class="card"><div class="card-head">Services</div><div class="card-body">` +
renderServicesInline() + `</div></div> renderServicesInline() + `</div></div>
<div class="card"><div class="card-head">Display Resolution</div><div class="card-body">` +
renderDisplayInline() + `</div></div>
<script> <script>
function checkTools() { function checkTools() {
document.getElementById('tools-table').innerHTML = '<p style="color:var(--muted);font-size:13px">Checking...</p>'; document.getElementById('tools-table').innerHTML = '<p style="color:var(--muted);font-size:13px">Checking...</p>';
@@ -1091,21 +1363,23 @@ function installStart() {
headers: {'Content-Type': 'application/json'}, headers: {'Content-Type': 'application/json'},
body: JSON.stringify({device: _installSelected.device}) body: JSON.stringify({device: _installSelected.device})
}).then(function(r){ }).then(function(r){
if (r.status === 204) { return r.json().then(function(j){
installStreamLog(); if (!r.ok) throw new Error(j.error || r.statusText);
} else { return j;
return r.json().then(function(j){ throw new Error(j.error || r.statusText); }); });
} }).then(function(j){
if (!j.task_id) throw new Error('missing task id');
installStreamLog(j.task_id);
}).catch(function(e){ }).catch(function(e){
status.textContent = 'Error: ' + e; status.textContent = 'Error: ' + e;
status.style.color = 'var(--crit-fg)'; status.style.color = 'var(--crit-fg)';
}); });
} }
function installStreamLog() { function installStreamLog(taskId) {
var term = document.getElementById('install-terminal'); var term = document.getElementById('install-terminal');
var status = document.getElementById('install-status'); var status = document.getElementById('install-status');
var es = new EventSource('/api/install/stream'); var es = new EventSource('/api/tasks/' + taskId + '/stream');
es.onmessage = function(e) { es.onmessage = function(e) {
term.textContent += e.data + '\n'; term.textContent += e.data + '\n';
term.scrollTop = term.scrollHeight; term.scrollTop = term.scrollHeight;

View File

@@ -5,6 +5,7 @@ import (
"errors" "errors"
"fmt" "fmt"
"html" "html"
"log/slog"
"mime" "mime"
"net/http" "net/http"
"os" "os"
@@ -143,9 +144,6 @@ type handler struct {
latest *platform.LiveMetricSample latest *platform.LiveMetricSample
// metrics persistence (nil if DB unavailable) // metrics persistence (nil if DB unavailable)
metricsDB *MetricsDB metricsDB *MetricsDB
// install job (at most one at a time)
installJob *jobState
installMu sync.Mutex
// pending network change (rollback on timeout) // pending network change (rollback on timeout)
pendingNet *pendingNetChange pendingNet *pendingNetChange
pendingNetMu sync.Mutex pendingNetMu sync.Mutex
@@ -180,7 +178,11 @@ func NewHandler(opts HandlerOptions) http.Handler {
if len(samples) > 0 { if len(samples) > 0 {
h.setLatestMetric(samples[len(samples)-1]) h.setLatestMetric(samples[len(samples)-1])
} }
} else {
slog.Warn("metrics history unavailable", "path", metricsDBPath, "err", err)
} }
} else {
slog.Warn("metrics db disabled", "path", metricsDBPath, "err", err)
} }
h.startMetricsCollector() h.startMetricsCollector()
@@ -206,6 +208,7 @@ func NewHandler(opts HandlerOptions) http.Handler {
// SAT // SAT
mux.HandleFunc("POST /api/sat/nvidia/run", h.handleAPISATRun("nvidia")) mux.HandleFunc("POST /api/sat/nvidia/run", h.handleAPISATRun("nvidia"))
mux.HandleFunc("POST /api/sat/nvidia-stress/run", h.handleAPISATRun("nvidia-stress"))
mux.HandleFunc("POST /api/sat/memory/run", h.handleAPISATRun("memory")) mux.HandleFunc("POST /api/sat/memory/run", h.handleAPISATRun("memory"))
mux.HandleFunc("POST /api/sat/storage/run", h.handleAPISATRun("storage")) mux.HandleFunc("POST /api/sat/storage/run", h.handleAPISATRun("storage"))
mux.HandleFunc("POST /api/sat/cpu/run", h.handleAPISATRun("cpu")) mux.HandleFunc("POST /api/sat/cpu/run", h.handleAPISATRun("cpu"))
@@ -241,10 +244,17 @@ func NewHandler(opts HandlerOptions) http.Handler {
// Export // Export
mux.HandleFunc("GET /api/export/list", h.handleAPIExportList) mux.HandleFunc("GET /api/export/list", h.handleAPIExportList)
mux.HandleFunc("POST /api/export/bundle", h.handleAPIExportBundle) mux.HandleFunc("POST /api/export/bundle", h.handleAPIExportBundle)
mux.HandleFunc("GET /api/export/usb", h.handleAPIExportUSBTargets)
mux.HandleFunc("POST /api/export/usb/audit", h.handleAPIExportUSBAudit)
mux.HandleFunc("POST /api/export/usb/bundle", h.handleAPIExportUSBBundle)
// Tools // Tools
mux.HandleFunc("GET /api/tools/check", h.handleAPIToolsCheck) mux.HandleFunc("GET /api/tools/check", h.handleAPIToolsCheck)
// Display
mux.HandleFunc("GET /api/display/resolutions", h.handleAPIDisplayResolutions)
mux.HandleFunc("POST /api/display/set", h.handleAPIDisplaySet)
// GPU presence // GPU presence
mux.HandleFunc("GET /api/gpu/presence", h.handleAPIGPUPresence) mux.HandleFunc("GET /api/gpu/presence", h.handleAPIGPUPresence)
@@ -258,10 +268,10 @@ func NewHandler(opts HandlerOptions) http.Handler {
// Install // Install
mux.HandleFunc("GET /api/install/disks", h.handleAPIInstallDisks) mux.HandleFunc("GET /api/install/disks", h.handleAPIInstallDisks)
mux.HandleFunc("POST /api/install/run", h.handleAPIInstallRun) mux.HandleFunc("POST /api/install/run", h.handleAPIInstallRun)
mux.HandleFunc("GET /api/install/stream", h.handleAPIInstallStream)
// Metrics — SSE stream of live sensor data + server-side SVG charts + CSV export // Metrics — SSE stream of live sensor data + server-side SVG charts + CSV export
mux.HandleFunc("GET /api/metrics/stream", h.handleAPIMetricsStream) mux.HandleFunc("GET /api/metrics/stream", h.handleAPIMetricsStream)
mux.HandleFunc("GET /api/metrics/latest", h.handleAPIMetricsLatest)
mux.HandleFunc("GET /api/metrics/chart/", h.handleMetricsChartSVG) mux.HandleFunc("GET /api/metrics/chart/", h.handleMetricsChartSVG)
mux.HandleFunc("GET /api/metrics/export.csv", h.handleAPIMetricsExportCSV) mux.HandleFunc("GET /api/metrics/export.csv", h.handleAPIMetricsExportCSV)
@@ -357,9 +367,13 @@ func (h *handler) handleRuntimeHealthJSON(w http.ResponseWriter, r *http.Request
} }
func (h *handler) handleSupportBundleDownload(w http.ResponseWriter, r *http.Request) { func (h *handler) handleSupportBundleDownload(w http.ResponseWriter, r *http.Request) {
archive, err := app.BuildSupportBundle(h.opts.ExportDir) archive, err := app.LatestSupportBundlePath()
if err != nil { if err != nil {
http.Error(w, fmt.Sprintf("build support bundle: %v", err), http.StatusInternalServerError) if errors.Is(err, os.ErrNotExist) {
http.Error(w, "support bundle not built yet", http.StatusNotFound)
return
}
http.Error(w, fmt.Sprintf("locate support bundle: %v", err), http.StatusInternalServerError)
return return
} }
w.Header().Set("Cache-Control", "no-store") w.Header().Set("Cache-Control", "no-store")
@@ -1222,13 +1236,6 @@ probe();
func (h *handler) handlePage(w http.ResponseWriter, r *http.Request) { func (h *handler) handlePage(w http.ResponseWriter, r *http.Request) {
page := strings.TrimPrefix(r.URL.Path, "/") page := strings.TrimPrefix(r.URL.Path, "/")
if page == "" { if page == "" {
// Serve loading page until audit snapshot exists
if _, err := os.Stat(h.opts.AuditPath); err != nil {
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte(loadingPageHTML))
return
}
page = "dashboard" page = "dashboard"
} }
// Redirect old routes to new names // Redirect old routes to new names

View File

@@ -136,6 +136,33 @@ func TestRootRendersDashboard(t *testing.T) {
} }
} }
func TestRootShowsRunAuditButtonWhenSnapshotMissing(t *testing.T) {
dir := t.TempDir()
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{
Title: "Bee Hardware Audit",
AuditPath: filepath.Join(dir, "missing-audit.json"),
ExportDir: exportDir,
})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
body := rec.Body.String()
if !strings.Contains(body, `Run Audit`) {
t.Fatalf("dashboard missing run audit button: %s", body)
}
if strings.Contains(body, `No audit data`) {
t.Fatalf("dashboard still shows empty audit badge: %s", body)
}
}
func TestAuditPageRendersViewerFrameAndActions(t *testing.T) { func TestAuditPageRendersViewerFrameAndActions(t *testing.T) {
dir := t.TempDir() dir := t.TempDir()
path := filepath.Join(dir, "audit.json") path := filepath.Join(dir, "audit.json")
@@ -232,6 +259,17 @@ func TestSupportBundleEndpointReturnsArchive(t *testing.T) {
if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.log"), []byte("audit log"), 0644); err != nil { if err := os.WriteFile(filepath.Join(exportDir, "bee-audit.log"), []byte("audit log"), 0644); err != nil {
t.Fatal(err) t.Fatal(err)
} }
archive, err := os.CreateTemp(os.TempDir(), "bee-support-server-test-*.tar.gz")
if err != nil {
t.Fatal(err)
}
t.Cleanup(func() { _ = os.Remove(archive.Name()) })
if _, err := archive.WriteString("support-bundle"); err != nil {
t.Fatal(err)
}
if err := archive.Close(); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{ExportDir: exportDir}) handler := NewHandler(HandlerOptions{ExportDir: exportDir})
rec := httptest.NewRecorder() rec := httptest.NewRecorder()

View File

@@ -6,8 +6,10 @@ import (
"fmt" "fmt"
"net/http" "net/http"
"os" "os"
"os/exec"
"path/filepath" "path/filepath"
"sort" "sort"
"strings"
"sync" "sync"
"time" "time"
@@ -24,22 +26,59 @@ const (
TaskCancelled = "cancelled" TaskCancelled = "cancelled"
) )
// taskNames maps target → human-readable name. // taskNames maps target → human-readable name for validate (SAT) runs.
var taskNames = map[string]string{ var taskNames = map[string]string{
"nvidia": "NVIDIA SAT", "nvidia": "NVIDIA SAT",
"memory": "Memory SAT", "nvidia-stress": "NVIDIA GPU Stress",
"storage": "Storage SAT", "memory": "Memory SAT",
"cpu": "CPU SAT", "storage": "Storage SAT",
"amd": "AMD GPU SAT", "cpu": "CPU SAT",
"amd-mem": "AMD GPU MEM Integrity", "amd": "AMD GPU SAT",
"amd-bandwidth": "AMD GPU MEM Bandwidth", "amd-mem": "AMD GPU MEM Integrity",
"amd-stress": "AMD GPU Burn-in", "amd-bandwidth": "AMD GPU MEM Bandwidth",
"memory-stress": "Memory Burn-in", "amd-stress": "AMD GPU Burn-in",
"sat-stress": "SAT Stress (stressapptest)", "memory-stress": "Memory Burn-in",
"sat-stress": "SAT Stress (stressapptest)",
"platform-stress": "Platform Thermal Cycling", "platform-stress": "Platform Thermal Cycling",
"audit": "Audit", "audit": "Audit",
"install": "Install to Disk", "support-bundle": "Support Bundle",
"install-to-ram": "Install to RAM", "install": "Install to Disk",
"install-to-ram": "Install to RAM",
}
// burnNames maps target → human-readable name when a burn profile is set.
var burnNames = map[string]string{
"nvidia": "NVIDIA Burn-in",
"memory": "Memory Burn-in",
"cpu": "CPU Burn-in",
"amd": "AMD GPU Burn-in",
}
func nvidiaStressTaskName(loader string) string {
switch strings.TrimSpace(strings.ToLower(loader)) {
case platform.NvidiaStressLoaderJohn:
return "NVIDIA GPU Stress (John/OpenCL)"
case platform.NvidiaStressLoaderNCCL:
return "NVIDIA GPU Stress (NCCL)"
default:
return "NVIDIA GPU Stress (bee-gpu-burn)"
}
}
func taskDisplayName(target, profile, loader string) string {
name := taskNames[target]
if profile != "" {
if n, ok := burnNames[target]; ok {
name = n
}
}
if target == "nvidia-stress" {
name = nvidiaStressTaskName(loader)
}
if name == "" {
name = target
}
return name
} }
// Task represents one unit of work in the queue. // Task represents one unit of work in the queue.
@@ -62,12 +101,14 @@ type Task struct {
// taskParams holds optional parameters parsed from the run request. // taskParams holds optional parameters parsed from the run request.
type taskParams struct { type taskParams struct {
Duration int `json:"duration,omitempty"` Duration int `json:"duration,omitempty"`
DiagLevel int `json:"diag_level,omitempty"` DiagLevel int `json:"diag_level,omitempty"`
GPUIndices []int `json:"gpu_indices,omitempty"` GPUIndices []int `json:"gpu_indices,omitempty"`
BurnProfile string `json:"burn_profile,omitempty"` ExcludeGPUIndices []int `json:"exclude_gpu_indices,omitempty"`
DisplayName string `json:"display_name,omitempty"` Loader string `json:"loader,omitempty"`
Device string `json:"device,omitempty"` // for install BurnProfile string `json:"burn_profile,omitempty"`
DisplayName string `json:"display_name,omitempty"`
Device string `json:"device,omitempty"` // for install
} }
type persistedTask struct { type persistedTask struct {
@@ -162,6 +203,9 @@ var (
runAMDMemBandwidthPackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) { runAMDMemBandwidthPackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
return a.RunAMDMemBandwidthPackCtx(ctx, baseDir, logFunc) return a.RunAMDMemBandwidthPackCtx(ctx, baseDir, logFunc)
} }
runNvidiaStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
return a.RunNvidiaStressPackCtx(ctx, baseDir, opts, logFunc)
}
runAMDStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) { runAMDStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunAMDStressPackCtx(ctx, baseDir, durationSec, logFunc) return a.RunAMDStressPackCtx(ctx, baseDir, durationSec, logFunc)
} }
@@ -171,6 +215,10 @@ var (
runSATStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) { runSATStressPackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunSATStressPackCtx(ctx, baseDir, durationSec, logFunc) return a.RunSATStressPackCtx(ctx, baseDir, durationSec, logFunc)
} }
buildSupportBundle = app.BuildSupportBundle
installCommand = func(ctx context.Context, device string, logPath string) *exec.Cmd {
return exec.CommandContext(ctx, "bee-install", device, logPath)
}
) )
// enqueue adds a task to the queue and notifies the worker. // enqueue adds a task to the queue and notifies the worker.
@@ -368,9 +416,9 @@ func setCPUGovernor(governor string) {
// runTask executes the work for a task, writing output to j. // runTask executes the work for a task, writing output to j.
func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) { func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if q.opts == nil || q.opts.App == nil { if q.opts == nil {
j.append("ERROR: app not configured") j.append("ERROR: handler options not configured")
j.finish("app not configured") j.finish("handler options not configured")
return return
} }
a := q.opts.App a := q.opts.App
@@ -387,6 +435,10 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
switch t.Target { switch t.Target {
case "nvidia": case "nvidia":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
diagLevel := t.params.DiagLevel diagLevel := t.params.DiagLevel
if t.params.BurnProfile != "" && diagLevel <= 0 { if t.params.BurnProfile != "" && diagLevel <= 0 {
diagLevel = resolveBurnPreset(t.params.BurnProfile).NvidiaDiag diagLevel = resolveBurnPreset(t.params.BurnProfile).NvidiaDiag
@@ -403,11 +455,38 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
} else { } else {
archive, err = a.RunNvidiaAcceptancePack("", j.append) archive, err = a.RunNvidiaAcceptancePack("", j.append)
} }
case "nvidia-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{
DurationSec: dur,
Loader: t.params.Loader,
GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices,
}, j.append)
case "memory": case "memory":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runMemoryAcceptancePackCtx(a, ctx, "", j.append) archive, err = runMemoryAcceptancePackCtx(a, ctx, "", j.append)
case "storage": case "storage":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runStorageAcceptancePackCtx(a, ctx, "", j.append) archive, err = runStorageAcceptancePackCtx(a, ctx, "", j.append)
case "cpu": case "cpu":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
@@ -415,35 +494,68 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if dur <= 0 { if dur <= 0 {
dur = 60 dur = 60
} }
j.append(fmt.Sprintf("CPU stress duration: %ds", dur))
archive, err = runCPUAcceptancePackCtx(a, ctx, "", dur, j.append) archive, err = runCPUAcceptancePackCtx(a, ctx, "", dur, j.append)
case "amd": case "amd":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDAcceptancePackCtx(a, ctx, "", j.append) archive, err = runAMDAcceptancePackCtx(a, ctx, "", j.append)
case "amd-mem": case "amd-mem":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDMemIntegrityPackCtx(a, ctx, "", j.append) archive, err = runAMDMemIntegrityPackCtx(a, ctx, "", j.append)
case "amd-bandwidth": case "amd-bandwidth":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = runAMDMemBandwidthPackCtx(a, ctx, "", j.append) archive, err = runAMDMemBandwidthPackCtx(a, ctx, "", j.append)
case "amd-stress": case "amd-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
archive, err = runAMDStressPackCtx(a, ctx, "", dur, j.append) archive, err = runAMDStressPackCtx(a, ctx, "", dur, j.append)
case "memory-stress": case "memory-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
archive, err = runMemoryStressPackCtx(a, ctx, "", dur, j.append) archive, err = runMemoryStressPackCtx(a, ctx, "", dur, j.append)
case "sat-stress": case "sat-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
archive, err = runSATStressPackCtx(a, ctx, "", dur, j.append) archive, err = runSATStressPackCtx(a, ctx, "", dur, j.append)
case "platform-stress": case "platform-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
opts := resolvePlatformStressPreset(t.params.BurnProfile) opts := resolvePlatformStressPreset(t.params.BurnProfile)
archive, err = a.RunPlatformStress(ctx, "", opts, j.append) archive, err = a.RunPlatformStress(ctx, "", opts, j.append)
case "audit": case "audit":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
result, e := a.RunAuditNow(q.opts.RuntimeMode) result, e := a.RunAuditNow(q.opts.RuntimeMode)
if e != nil { if e != nil {
err = e err = e
@@ -452,7 +564,22 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
j.append(line) j.append(line)
} }
} }
case "support-bundle":
j.append("Building support bundle...")
archive, err = buildSupportBundle(q.opts.ExportDir)
case "install":
if strings.TrimSpace(t.params.Device) == "" {
err = fmt.Errorf("device is required")
break
}
installLogPath := platform.InstallLogPath(t.params.Device)
j.append("Install log: " + installLogPath)
err = streamCmdJob(j, installCommand(ctx, t.params.Device, installLogPath))
case "install-to-ram": case "install-to-ram":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
err = a.RunInstallToRAM(ctx, j.append) err = a.RunInstallToRAM(ctx, j.append)
default: default:
j.append("ERROR: unknown target: " + t.Target) j.append("ERROR: unknown target: " + t.Target)

View File

@@ -3,7 +3,9 @@ package webui
import ( import (
"context" "context"
"os" "os"
"os/exec"
"path/filepath" "path/filepath"
"strings"
"testing" "testing"
"time" "time"
@@ -95,9 +97,24 @@ func TestResolveBurnPreset(t *testing.T) {
} }
} }
func TestRunTaskHonorsCancel(t *testing.T) { func TestTaskDisplayNameUsesNvidiaStressLoader(t *testing.T) {
t.Parallel() tests := []struct {
loader string
want string
}{
{loader: "", want: "NVIDIA GPU Stress (bee-gpu-burn)"},
{loader: "builtin", want: "NVIDIA GPU Stress (bee-gpu-burn)"},
{loader: "john", want: "NVIDIA GPU Stress (John/OpenCL)"},
{loader: "nccl", want: "NVIDIA GPU Stress (NCCL)"},
}
for _, tc := range tests {
if got := taskDisplayName("nvidia-stress", "acceptance", tc.loader); got != tc.want {
t.Fatalf("taskDisplayName(loader=%q)=%q want %q", tc.loader, got, tc.want)
}
}
}
func TestRunTaskHonorsCancel(t *testing.T) {
blocked := make(chan struct{}) blocked := make(chan struct{})
released := make(chan struct{}) released := make(chan struct{})
aRun := func(_ any, ctx context.Context, _ string, _ int, _ func(string)) (string, error) { aRun := func(_ any, ctx context.Context, _ string, _ int, _ func(string)) (string, error) {
@@ -154,3 +171,111 @@ func TestRunTaskHonorsCancel(t *testing.T) {
t.Fatal("runTask did not return after cancel") t.Fatal("runTask did not return after cancel")
} }
} }
func TestRunTaskUsesBurnProfileDurationForCPU(t *testing.T) {
var gotDuration int
q := &taskQueue{
opts: &HandlerOptions{App: &app.App{}},
}
tk := &Task{
ID: "cpu-burn-1",
Name: "CPU Burn-in",
Target: "cpu",
Status: TaskRunning,
CreatedAt: time.Now(),
params: taskParams{BurnProfile: "smoke"},
}
j := &jobState{}
orig := runCPUAcceptancePackCtx
runCPUAcceptancePackCtx = func(_ *app.App, _ context.Context, _ string, durationSec int, _ func(string)) (string, error) {
gotDuration = durationSec
return "/tmp/cpu-burn.tar.gz", nil
}
defer func() { runCPUAcceptancePackCtx = orig }()
q.runTask(tk, j, context.Background())
if gotDuration != 5*60 {
t.Fatalf("duration=%d want %d", gotDuration, 5*60)
}
}
func TestRunTaskBuildsSupportBundleWithoutApp(t *testing.T) {
dir := t.TempDir()
q := &taskQueue{
opts: &HandlerOptions{ExportDir: dir},
}
tk := &Task{
ID: "support-bundle-1",
Name: "Support Bundle",
Target: "support-bundle",
Status: TaskRunning,
CreatedAt: time.Now(),
}
j := &jobState{}
var gotExportDir string
orig := buildSupportBundle
buildSupportBundle = func(exportDir string) (string, error) {
gotExportDir = exportDir
return filepath.Join(exportDir, "bundle.tar.gz"), nil
}
defer func() { buildSupportBundle = orig }()
q.runTask(tk, j, context.Background())
if gotExportDir != dir {
t.Fatalf("exportDir=%q want %q", gotExportDir, dir)
}
if j.err != "" {
t.Fatalf("unexpected error: %q", j.err)
}
if !strings.Contains(strings.Join(j.lines, "\n"), "Archive: "+filepath.Join(dir, "bundle.tar.gz")) {
t.Fatalf("lines=%v", j.lines)
}
}
func TestRunTaskInstallUsesSharedCommandStreaming(t *testing.T) {
q := &taskQueue{
opts: &HandlerOptions{},
}
tk := &Task{
ID: "install-1",
Name: "Install to Disk",
Target: "install",
Status: TaskRunning,
CreatedAt: time.Now(),
params: taskParams{Device: "/dev/sda"},
}
j := &jobState{}
var gotDevice string
var gotLogPath string
orig := installCommand
installCommand = func(ctx context.Context, device string, logPath string) *exec.Cmd {
gotDevice = device
gotLogPath = logPath
return exec.CommandContext(ctx, "sh", "-c", "printf 'line1\nline2\n'")
}
defer func() { installCommand = orig }()
q.runTask(tk, j, context.Background())
if gotDevice != "/dev/sda" {
t.Fatalf("device=%q want /dev/sda", gotDevice)
}
if gotLogPath == "" {
t.Fatal("expected install log path")
}
logs := strings.Join(j.lines, "\n")
if !strings.Contains(logs, "Install log: ") {
t.Fatalf("missing install log line: %v", j.lines)
}
if !strings.Contains(logs, "line1") || !strings.Contains(logs, "line2") {
t.Fatalf("missing streamed output: %v", j.lines)
}
if j.err != "" {
t.Fatalf("unexpected error: %q", j.err)
}
}

2
bible

Submodule bible updated: 456c1f022c...688b87e98d

View File

@@ -81,9 +81,9 @@ build-in-container.sh [--authorized-keys /path/to/keys]
7. `build-cublas.sh`: 7. `build-cublas.sh`:
a. download `libcublas`, `libcublasLt`, `libcudart` runtime + dev packages from the NVIDIA CUDA Debian repo a. download `libcublas`, `libcublasLt`, `libcudart` runtime + dev packages from the NVIDIA CUDA Debian repo
b. verify packages against repo `Packages.gz` b. verify packages against repo `Packages.gz`
c. extract headers for `bee-gpu-stress` build c. extract headers for `bee-gpu-burn` worker build
d. cache userspace libs in `dist/cublas-<version>+cuda<series>/` d. cache userspace libs in `dist/cublas-<version>+cuda<series>/`
8. build `bee-gpu-stress` against extracted cuBLASLt/cudart headers 8. build `bee-gpu-burn` worker against extracted cuBLASLt/cudart headers
9. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/` 9. inject NVIDIA `.ko` → staged `/usr/local/lib/nvidia/`
10. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi` 10. inject `nvidia-smi` → staged `/usr/local/bin/nvidia-smi`
11. inject `libnvidia-ml` + `libcuda` + `libcublas` + `libcublasLt` + `libcudart` → staged `/usr/lib/` 11. inject `libnvidia-ml` + `libcuda` + `libcublas` + `libcublasLt` + `libcudart` → staged `/usr/lib/`
@@ -104,7 +104,7 @@ Build host notes:
1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build 1. `build-in-container.sh` / `build-nvidia-module.sh` — Debian kernel headers for module build
2. `auto/config``linux-image-${DEBIAN_KERNEL_ABI}` in the ISO 2. `auto/config``linux-image-${DEBIAN_KERNEL_ABI}` in the ISO
- NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`. - NVIDIA modules go to staged `usr/local/lib/nvidia/` — NOT to `/lib/modules/<kver>/extra/`.
- `bee-gpu-stress` must be built against cached CUDA userspace headers from `build-cublas.sh`, not against random host-installed CUDA headers. - `bee-gpu-burn` worker must be built against cached CUDA userspace headers from `build-cublas.sh`, not against random host-installed CUDA headers.
- The live ISO must ship `libcublas`, `libcublasLt`, and `libcudart` together with `libcuda` so tensor-core stress works without internet or package installs at boot. - The live ISO must ship `libcublas`, `libcublasLt`, and `libcudart` together with `libcuda` so tensor-core stress works without internet or package installs at boot.
- The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay. - The source overlay in `iso/overlay/` is treated as immutable source. Build-time files are injected only into the staged overlay.
- The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean. - The live-build workdir under `dist/` is disposable; source files under `iso/builder/` stay clean.
@@ -153,18 +153,17 @@ Current validation state:
Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal. Every collector returns `nil, nil` on tool-not-found. Errors are logged, never fatal.
Acceptance flows: Acceptance flows:
- `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + mixed-precision `bee-gpu-stress` - `bee sat nvidia` → diagnostic archive with `nvidia-smi -q` + `nvidia-bug-report` + lightweight `bee-gpu-burn`
- NVIDIA GPU burn-in can use either `bee-gpu-burn` or `bee-john-gpu-stress` (John the Ripper jumbo via OpenCL)
- `bee sat memory``memtester` archive - `bee sat memory``memtester` archive
- `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported - `bee sat storage` → SMART/NVMe diagnostic archive and short self-test trigger where supported
- SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`) - SAT `summary.txt` now includes `overall_status` and per-job `*_status` values (`OK`, `FAILED`, `UNSUPPORTED`)
- `bee-gpu-stress` should prefer cuBLASLt GEMM load over the old integer/PTX burn path: - `bee-gpu-burn` should prefer cuBLASLt GEMM load over the old integer/PTX burn path:
- Ampere: `fp16` + `fp32`/TF32 tensor-core load - Ampere: `fp16` + `fp32`/TF32 tensor-core load
- Ada / Hopper: add `fp8` - Ada / Hopper: add `fp8`
- Blackwell+: add `fp4` - Blackwell+: add `fp4`
- PTX fallback is only for missing cuBLASLt/userspace or unsupported narrow datatypes - PTX fallback is only for missing cuBLASLt/userspace or unsupported narrow datatypes
- Runtime overrides: - Runtime overrides:
- `BEE_GPU_STRESS_SECONDS`
- `BEE_GPU_STRESS_SIZE_MB`
- `BEE_MEMTESTER_SIZE_MB` - `BEE_MEMTESTER_SIZE_MB`
- `BEE_MEMTESTER_PASSES` - `BEE_MEMTESTER_PASSES`
@@ -179,6 +178,6 @@ Web UI: Acceptance Tests page → Run Test button
``` ```
**Critical invariants:** **Critical invariants:**
- `bee-gpu-stress` uses `exec.CommandContext` — killed on job context cancel. - `bee-gpu-burn` / `bee-john-gpu-stress` use `exec.CommandContext` — killed on job context cancel.
- Metric goroutine uses stopCh/doneCh pattern; main goroutine waits `<-doneCh` before reading rows (no mutex needed). - Metric goroutine uses stopCh/doneCh pattern; main goroutine waits `<-doneCh` before reading rows (no mutex needed).
- SVG chart is fully offline: no JS, no external CSS, pure inline SVG. - SVG chart is fully offline: no JS, no external CSS, pure inline SVG.

View File

@@ -21,8 +21,8 @@ Fills gaps where Redfish/logpile is blind:
- Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID - Read-only hardware inventory: board, CPU, memory, storage, PCIe, PSU, GPU, NIC, RAID
- Machine-readable health summary derived from collector verdicts - Machine-readable health summary derived from collector verdicts
- Operator-triggered acceptance tests for NVIDIA, memory, and storage - Operator-triggered acceptance tests for NVIDIA, memory, and storage
- NVIDIA SAT includes both diagnostic collection and mixed-precision GPU stress via `bee-gpu-stress` - NVIDIA SAT includes diagnostic collection plus a lightweight in-image GPU stress step via `bee-gpu-burn`
- `bee-gpu-stress` should exercise tensor/inference paths (`fp16`, `fp32`/TF32, `fp8`, `fp4` when supported by the GPU/userspace stack) and fall back to Driver API PTX burn only if cuBLASLt is unavailable - `bee-gpu-burn` should exercise tensor/inference paths (`fp16`, `fp32`/TF32, `fp8`, `fp4` when supported by the GPU/userspace stack) and fall back to Driver API PTX burn only if cuBLASLt is unavailable
- Automatic boot audit with operator-facing local console and SSH access - Automatic boot audit with operator-facing local console and SSH access
- NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi` - NVIDIA proprietary driver loaded at boot for GPU enrichment via `nvidia-smi`
- SSH access (OpenSSH) always available for inspection and debugging - SSH access (OpenSSH) always available for inspection and debugging
@@ -70,7 +70,7 @@ Fills gaps where Redfish/logpile is blind:
| SSH | OpenSSH server | | SSH | OpenSSH server |
| NVIDIA driver | Proprietary `.run` installer, built against Debian kernel headers | | NVIDIA driver | Proprietary `.run` installer, built against Debian kernel headers |
| NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` | | NVIDIA modules | Loaded via `insmod` from `/usr/local/lib/nvidia/` |
| GPU stress backend | `bee-gpu-stress` + cuBLASLt/cuBLAS/cudart mixed-precision GEMM, with Driver API PTX fallback | | GPU stress backend | `bee-gpu-burn` + cuBLASLt/cuBLAS/cudart mixed-precision GEMM, with Driver API PTX fallback |
| Builder | Debian 12 host/VM or Debian 12 container image | | Builder | Debian 12 host/VM or Debian 12 container image |
## Operator UX ## Operator UX

View File

@@ -18,6 +18,8 @@ Use the official proprietary NVIDIA `.run` installer for both kernel modules and
- Kernel modules and nvidia-smi come from a single verified source. - Kernel modules and nvidia-smi come from a single verified source.
- NVIDIA publishes `.sha256sum` alongside each installer — download and verify before use. - NVIDIA publishes `.sha256sum` alongside each installer — download and verify before use.
- Driver version pinned in `iso/builder/VERSIONS` as `NVIDIA_DRIVER_VERSION`. - Driver version pinned in `iso/builder/VERSIONS` as `NVIDIA_DRIVER_VERSION`.
- DCGM must track the CUDA user-mode driver major version exposed by `nvidia-smi`.
- For NVIDIA driver branch `590` with CUDA `13.x`, use DCGM 4 package family `datacenter-gpu-manager-4-cuda13`; legacy `datacenter-gpu-manager` 3.x does not provide a working path for this stack.
- Build process: download `.run`, extract, compile `kernel/` sources against `linux-lts-dev`. - Build process: download `.run`, extract, compile `kernel/` sources against `linux-lts-dev`.
- Modules cached in `dist/nvidia-<version>-<kver>/` — rebuild only on version or kernel change. - Modules cached in `dist/nvidia-<version>-<kver>/` — rebuild only on version or kernel change.
- ISO size increases by ~50MB for .ko files + nvidia-smi. - ISO size increases by ~50MB for .ko files + nvidia-smi.

View File

@@ -0,0 +1,117 @@
# Decision: Treat memtest as explicit ISO content, not as trusted live-build magic
**Date:** 2026-04-01
**Status:** active
## Context
We have already iterated on `memtest` multiple times and kept cycling between the same ideas.
The commit history shows several distinct attempts:
- `f91bce8` — fixed Bookworm memtest file names to `memtest86+x64.bin` / `memtest86+x64.efi`
- `5857805` — added a binary hook to copy memtest files from the build tree into the ISO root
- `f96b149` — added fallback extraction from the cached `.deb` when `chroot/boot/` stayed empty
- `d43a9ae` — removed the custom hook and switched back to live-build built-in memtest integration
- `60cb8f8` — restored explicit memtest menu entries and added ISO validation
- `3dbc218` / `3869788` — added archived build logs and better memtest diagnostics
Current evidence from the archived `easy-bee-nvidia-v3.14-amd64` logs dated 2026-04-01:
- `lb binary_memtest` does run and installs `memtest86+`
- but the final ISO still does **not** contain `boot/memtest86+x64.bin`
- the final ISO also does **not** contain memtest menu entries in `boot/grub/grub.cfg` or `isolinux/live.cfg`
So the assumption "live-build built-in memtest integration is enough on this stack" is currently false for this project until proven otherwise by a real built ISO.
Additional evidence from the archived `easy-bee-nvidia-v3.17-dirty-amd64` logs dated 2026-04-01:
- the build now completes successfully because memtest is non-blocking by default
- `lb binary_memtest` still runs and installs `memtest86+`
- the project-owned hook `config/hooks/normal/9100-memtest.hook.binary` does execute
- but it executes too early for its current target paths:
- `binary/boot/grub/grub.cfg` is still missing at hook time
- `binary/isolinux/live.cfg` is still missing at hook time
- memtest binaries are also still absent in `binary/boot/`
- later in the build, live-build does create intermediate bootloader configs with memtest lines in the workdir
- but the final ISO still lacks memtest binaries and still lacks memtest lines in extracted ISO `boot/grub/grub.cfg` and `isolinux/live.cfg`
So the assumption "the current normal binary hook path is late enough to patch final memtest artifacts" is also false.
## Known Failed Attempts
These approaches were already tried and should not be repeated blindly:
1. Built-in live-build memtest only.
Reason it failed:
- `lb binary_memtest` runs, but the final ISO still misses memtest binaries and menu entries.
2. Fixing only the memtest file names for Debian Bookworm.
Reason it failed:
- correct file names alone do not make the files appear in the final ISO.
3. Copying memtest from `chroot/boot/` into `binary/boot/` via a binary hook.
Reason it failed:
- in this stack `chroot/boot/` is often empty for memtest payloads at the relevant time.
4. Fallback extraction from cached `memtest86+` `.deb`.
Reason it failed:
- this was explored already and was not enough to stabilize the final ISO path end-to-end.
5. Restoring explicit memtest menu entries in source bootloader templates only.
Reason it failed:
- memtest lines in source templates or intermediate workdir configs do not guarantee the final ISO contains them.
6. Patching `binary/boot/grub/grub.cfg` and `binary/isolinux/live.cfg` from the current `config/hooks/normal/9100-memtest.hook.binary`.
Reason it failed:
- the hook runs before those files exist, so the hook cannot patch them there.
## What This Means
When revisiting memtest later, start from the constraints above rather than retrying the same patterns:
- do not assume the built-in memtest stage is sufficient
- do not assume `chroot/boot/` will contain memtest payloads
- do not assume source bootloader templates are the last writer of final ISO configs
- do not assume the current normal binary hook timing is late enough for final patching
Any future memtest fix must explicitly identify:
- where the memtest binaries are reliably available at build time
- which exact build stage writes the final bootloader configs that land in the ISO
- and a post-build proof from a real ISO, not only from intermediate workdir files
## Decision
For `bee`, memtest must be treated as an explicit ISO artifact with explicit post-build validation.
Project rules from now on:
- Do **not** trust `--memtest memtest86+` by itself.
- A memtest implementation is considered valid only if the produced ISO actually contains:
- `boot/memtest86+x64.bin`
- `boot/memtest86+x64.efi`
- a GRUB menu entry
- an isolinux menu entry
- If live-build built-in integration does not produce those artifacts, use an explicit project-owned mechanism such as:
- a binary hook copying files into `binary/boot/`
- extraction from the cached `memtest86+` `.deb`
- another deterministic build-time copy step
- Do **not** remove such explicit logic later unless a fresh real ISO build proves that built-in integration alone produces all required files and menu entries.
Current implementation direction:
- keep the live-build memtest stage enabled if it helps package acquisition
- do not rely on the current early `binary_hooks` timing for final patching
- prefer a post-`lb build` recovery step in `build.sh` that:
- patches the fully materialized `LB_DIR/binary` tree
- injects memtest binaries there
- ensures final bootloader entries there
- reruns late binary stages (`binary_checksums`, `binary_iso`, `binary_zsync`) after the patch
## Consequences
- Future memtest changes must begin by reading this ADR and the commits listed above.
- Future memtest changes must also begin by reading the failed-attempt list above.
- We should stop re-introducing "prefer built-in live-build memtest" as a default assumption without new evidence.
- Memtest validation in `build.sh` is not optional; it is the acceptance gate that prevents another silent regression.
- If we change memtest strategy again, we must update this ADR with the exact build evidence that justified the change.

View File

@@ -5,3 +5,4 @@ One file per decision, named `YYYY-MM-DD-short-topic.md`.
| Date | Decision | Status | | Date | Decision | Status |
|---|---|---| |---|---|---|
| 2026-03-05 | Use NVIDIA proprietary driver | active | | 2026-03-05 | Use NVIDIA proprietary driver | active |
| 2026-04-01 | Treat memtest as explicit ISO content | active |

View File

@@ -13,9 +13,43 @@ Use one of:
This applies to: This applies to:
- `iso/builder/config/package-lists/*.list.chroot` - `iso/builder/config/package-lists/*.list.chroot`
- Any package referenced in `grub.cfg`, hooks, or overlay scripts (e.g. file paths like `/boot/memtest86+x64.bin`) - Any package referenced in bootloader configs, hooks, or overlay scripts
## Example of what goes wrong without this ## Memtest rule
`memtest86+` in Debian bookworm installs `/boot/memtest86+x64.bin`, not `/boot/memtest86+.bin`. Do not assume live-build's built-in memtest integration is sufficient for `bee`.
Guessing the filename caused a broken GRUB entry that only surfaced at boot time, after a full rebuild. We already tried that path and regressed again on 2026-04-01: `lb binary_memtest`
ran, but the final ISO still lacked memtest binaries and menu entries.
For this project, memtest is accepted only when the produced ISO actually
contains all of the following:
- `boot/memtest86+x64.bin`
- `boot/memtest86+x64.efi`
- a memtest entry in `boot/grub/grub.cfg`
- a memtest entry in `isolinux/live.cfg`
Rules:
- Keep explicit post-build memtest validation in `build.sh`.
- If built-in integration does not produce the artifacts above, use a
deterministic project-owned copy/extract step instead of hoping live-build
will "start working".
- Do not switch back to built-in-only memtest without fresh build evidence from
a real ISO.
- If you reference memtest files manually, verify the exact package file list
first for the target Debian release.
Known bad loops for this repository:
- Do not retry built-in-only memtest without new evidence. We already proved
that `lb binary_memtest` can run while the final ISO still has no memtest.
- Do not assume fixing memtest file names is enough. Correct names did not fix
the final artifact path.
- Do not assume `chroot/boot/` contains memtest payloads at the time hooks run.
- Do not assume source `grub.cfg` / `live.cfg.in` are the final writers of ISO
bootloader configs.
- Do not assume the current `config/hooks/normal/9100-memtest.hook.binary`
timing is late enough to patch final `binary/boot/grub/grub.cfg` or
`binary/isolinux/live.cfg`; logs from 2026-04-01 showed those files were not
present yet when the hook executed.

View File

@@ -48,6 +48,7 @@ sh iso/builder/build-in-container.sh --cache-dir /path/to/cache
- The builder image is automatically rebuilt if the local tag exists for the wrong architecture. - The builder image is automatically rebuilt if the local tag exists for the wrong architecture.
- The live ISO boots with Debian `live-boot` `toram`, so the read-only medium is copied into RAM during boot and the runtime no longer depends on the original USB/BMC virtual media staying present. - The live ISO boots with Debian `live-boot` `toram`, so the read-only medium is copied into RAM during boot and the runtime no longer depends on the original USB/BMC virtual media staying present.
- Target systems need enough RAM for the full compressed live medium plus normal runtime overhead, or boot may fail before reaching the TUI. - Target systems need enough RAM for the full compressed live medium plus normal runtime overhead, or boot may fail before reaching the TUI.
- The NVIDIA variant installs DCGM 4 packages matched to the CUDA user-mode driver major version. For driver branch `590` / CUDA `13.x`, the package family is `datacenter-gpu-manager-4-cuda13` rather than legacy `datacenter-gpu-manager`.
- Override the container platform only if you know why: - Override the container platform only if you know why:
```sh ```sh

View File

@@ -23,6 +23,16 @@ RUN apt-get update -qq && apt-get install -y \
gcc \ gcc \
make \ make \
perl \ perl \
pkg-config \
yasm \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libgmp-dev \
libpcap-dev \
libsqlite3-dev \
libcurl4-openssl-dev \
ocl-icd-opencl-dev \
linux-headers-amd64 \ linux-headers-amd64 \
&& rm -rf /var/lib/apt/lists/* && rm -rf /var/lib/apt/lists/*

View File

@@ -8,7 +8,8 @@ NCCL_TESTS_VERSION=2.13.10
NVCC_VERSION=12.8 NVCC_VERSION=12.8
CUBLAS_VERSION=13.0.2.14-1 CUBLAS_VERSION=13.0.2.14-1
CUDA_USERSPACE_VERSION=13.0.96-1 CUDA_USERSPACE_VERSION=13.0.96-1
DCGM_VERSION=3.3.9 DCGM_VERSION=4.5.2-1
JOHN_JUMBO_COMMIT=67fcf9fe5a
ROCM_VERSION=6.3.4 ROCM_VERSION=6.3.4
ROCM_SMI_VERSION=7.4.0.60304-76~22.04 ROCM_SMI_VERSION=7.4.0.60304-76~22.04
ROCM_BANDWIDTH_TEST_VERSION=1.4.0.60304-76~22.04 ROCM_BANDWIDTH_TEST_VERSION=1.4.0.60304-76~22.04

View File

@@ -29,8 +29,8 @@ lb config noauto \
--security true \ --security true \
--linux-flavours "amd64" \ --linux-flavours "amd64" \
--linux-packages "${LB_LINUX_PACKAGES}" \ --linux-packages "${LB_LINUX_PACKAGES}" \
--memtest none \ --memtest memtest86+ \
--iso-volume "EASY-BEE-${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \ --iso-volume "EASY_BEE_${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
--iso-application "EASY-BEE-${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \ --iso-application "EASY-BEE-${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
--bootappend-live "boot=live components video=1920x1080 console=tty0 console=ttyS0,115200n8 loglevel=7 username=bee user-fullname=Bee modprobe.blacklist=nouveau" \ --bootappend-live "boot=live components video=1920x1080 console=tty0 console=ttyS0,115200n8 loglevel=7 username=bee user-fullname=Bee modprobe.blacklist=nouveau" \
--apt-recommends false \ --apt-recommends false \

View File

@@ -29,8 +29,14 @@ typedef void *CUfunction;
typedef void *CUstream; typedef void *CUstream;
#define CU_SUCCESS 0 #define CU_SUCCESS 0
#define CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT 16
#define CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR 75 #define CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR 75
#define CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR 76 #define CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR 76
#define MAX_STRESS_STREAMS 16
#define MAX_CUBLAS_PROFILES 5
#define MIN_PROFILE_BUDGET_BYTES ((size_t)4u * 1024u * 1024u)
#define MIN_STREAM_BUDGET_BYTES ((size_t)64u * 1024u * 1024u)
#define STRESS_LAUNCH_DEPTH 8
static const char *ptx_source = static const char *ptx_source =
".version 6.0\n" ".version 6.0\n"
@@ -97,6 +103,9 @@ typedef CUresult (*cuLaunchKernel_fn)(CUfunction,
CUstream, CUstream,
void **, void **,
void **); void **);
typedef CUresult (*cuMemGetInfo_fn)(size_t *, size_t *);
typedef CUresult (*cuStreamCreate_fn)(CUstream *, unsigned int);
typedef CUresult (*cuStreamDestroy_fn)(CUstream);
typedef CUresult (*cuGetErrorName_fn)(CUresult, const char **); typedef CUresult (*cuGetErrorName_fn)(CUresult, const char **);
typedef CUresult (*cuGetErrorString_fn)(CUresult, const char **); typedef CUresult (*cuGetErrorString_fn)(CUresult, const char **);
@@ -118,6 +127,9 @@ struct cuda_api {
cuModuleLoadDataEx_fn cuModuleLoadDataEx; cuModuleLoadDataEx_fn cuModuleLoadDataEx;
cuModuleGetFunction_fn cuModuleGetFunction; cuModuleGetFunction_fn cuModuleGetFunction;
cuLaunchKernel_fn cuLaunchKernel; cuLaunchKernel_fn cuLaunchKernel;
cuMemGetInfo_fn cuMemGetInfo;
cuStreamCreate_fn cuStreamCreate;
cuStreamDestroy_fn cuStreamDestroy;
cuGetErrorName_fn cuGetErrorName; cuGetErrorName_fn cuGetErrorName;
cuGetErrorString_fn cuGetErrorString; cuGetErrorString_fn cuGetErrorString;
}; };
@@ -128,9 +140,10 @@ struct stress_report {
int cc_major; int cc_major;
int cc_minor; int cc_minor;
int buffer_mb; int buffer_mb;
int stream_count;
unsigned long iterations; unsigned long iterations;
uint64_t checksum; uint64_t checksum;
char details[1024]; char details[16384];
}; };
static int load_symbol(void *lib, const char *name, void **out) { static int load_symbol(void *lib, const char *name, void **out) {
@@ -144,7 +157,7 @@ static int load_cuda(struct cuda_api *api) {
if (!api->lib) { if (!api->lib) {
return 0; return 0;
} }
return if (!(
load_symbol(api->lib, "cuInit", (void **)&api->cuInit) && load_symbol(api->lib, "cuInit", (void **)&api->cuInit) &&
load_symbol(api->lib, "cuDeviceGetCount", (void **)&api->cuDeviceGetCount) && load_symbol(api->lib, "cuDeviceGetCount", (void **)&api->cuDeviceGetCount) &&
load_symbol(api->lib, "cuDeviceGet", (void **)&api->cuDeviceGet) && load_symbol(api->lib, "cuDeviceGet", (void **)&api->cuDeviceGet) &&
@@ -160,7 +173,17 @@ static int load_cuda(struct cuda_api *api) {
load_symbol(api->lib, "cuMemcpyDtoH_v2", (void **)&api->cuMemcpyDtoH) && load_symbol(api->lib, "cuMemcpyDtoH_v2", (void **)&api->cuMemcpyDtoH) &&
load_symbol(api->lib, "cuModuleLoadDataEx", (void **)&api->cuModuleLoadDataEx) && load_symbol(api->lib, "cuModuleLoadDataEx", (void **)&api->cuModuleLoadDataEx) &&
load_symbol(api->lib, "cuModuleGetFunction", (void **)&api->cuModuleGetFunction) && load_symbol(api->lib, "cuModuleGetFunction", (void **)&api->cuModuleGetFunction) &&
load_symbol(api->lib, "cuLaunchKernel", (void **)&api->cuLaunchKernel); load_symbol(api->lib, "cuLaunchKernel", (void **)&api->cuLaunchKernel))) {
dlclose(api->lib);
memset(api, 0, sizeof(*api));
return 0;
}
load_symbol(api->lib, "cuMemGetInfo_v2", (void **)&api->cuMemGetInfo);
load_symbol(api->lib, "cuStreamCreate", (void **)&api->cuStreamCreate);
if (!load_symbol(api->lib, "cuStreamDestroy_v2", (void **)&api->cuStreamDestroy)) {
load_symbol(api->lib, "cuStreamDestroy", (void **)&api->cuStreamDestroy);
}
return 1;
} }
static const char *cu_error_name(struct cuda_api *api, CUresult rc) { static const char *cu_error_name(struct cuda_api *api, CUresult rc) {
@@ -193,14 +216,12 @@ static double now_seconds(void) {
return (double)ts.tv_sec + ((double)ts.tv_nsec / 1000000000.0); return (double)ts.tv_sec + ((double)ts.tv_nsec / 1000000000.0);
} }
#if HAVE_CUBLASLT_HEADERS
static size_t round_down_size(size_t value, size_t multiple) { static size_t round_down_size(size_t value, size_t multiple) {
if (multiple == 0 || value < multiple) { if (multiple == 0 || value < multiple) {
return value; return value;
} }
return value - (value % multiple); return value - (value % multiple);
} }
#endif
static int query_compute_capability(struct cuda_api *api, CUdevice dev, int *major, int *minor) { static int query_compute_capability(struct cuda_api *api, CUdevice dev, int *major, int *minor) {
int cc_major = 0; int cc_major = 0;
@@ -220,6 +241,75 @@ static int query_compute_capability(struct cuda_api *api, CUdevice dev, int *maj
return 1; return 1;
} }
static int query_multiprocessor_count(struct cuda_api *api, CUdevice dev, int *count) {
int mp_count = 0;
if (!check_rc(api,
"cuDeviceGetAttribute(multiprocessors)",
api->cuDeviceGetAttribute(&mp_count, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev))) {
return 0;
}
*count = mp_count;
return 1;
}
static size_t clamp_budget_to_free_memory(struct cuda_api *api, size_t requested_bytes) {
size_t free_bytes = 0;
size_t total_bytes = 0;
size_t max_bytes = requested_bytes;
if (!api->cuMemGetInfo) {
return requested_bytes;
}
if (api->cuMemGetInfo(&free_bytes, &total_bytes) != CU_SUCCESS || free_bytes == 0) {
return requested_bytes;
}
max_bytes = (free_bytes * 9u) / 10u;
if (max_bytes < (size_t)4u * 1024u * 1024u) {
max_bytes = (size_t)4u * 1024u * 1024u;
}
if (requested_bytes > max_bytes) {
return max_bytes;
}
return requested_bytes;
}
static int choose_stream_count(int mp_count, int planned_profiles, size_t total_budget, int have_streams) {
int stream_count = 1;
if (!have_streams || mp_count <= 0 || planned_profiles <= 0) {
return 1;
}
stream_count = mp_count / 8;
if (stream_count < 2) {
stream_count = 2;
}
if (stream_count > MAX_STRESS_STREAMS) {
stream_count = MAX_STRESS_STREAMS;
}
while (stream_count > 1) {
size_t per_stream_budget = total_budget / ((size_t)planned_profiles * (size_t)stream_count);
if (per_stream_budget >= MIN_STREAM_BUDGET_BYTES) {
break;
}
stream_count--;
}
return stream_count;
}
static void destroy_streams(struct cuda_api *api, CUstream *streams, int count) {
if (!api->cuStreamDestroy) {
return;
}
for (int i = 0; i < count; i++) {
if (streams[i]) {
api->cuStreamDestroy(streams[i]);
streams[i] = NULL;
}
}
}
#if HAVE_CUBLASLT_HEADERS #if HAVE_CUBLASLT_HEADERS
static void append_detail(char *buf, size_t cap, const char *fmt, ...) { static void append_detail(char *buf, size_t cap, const char *fmt, ...) {
size_t len = strlen(buf); size_t len = strlen(buf);
@@ -242,12 +332,19 @@ static int run_ptx_fallback(struct cuda_api *api,
int size_mb, int size_mb,
struct stress_report *report) { struct stress_report *report) {
CUcontext ctx = NULL; CUcontext ctx = NULL;
CUdeviceptr device_mem = 0;
CUmodule module = NULL; CUmodule module = NULL;
CUfunction kernel = NULL; CUfunction kernel = NULL;
uint32_t sample[256]; uint32_t sample[256];
uint32_t words = 0; CUdeviceptr device_mem[MAX_STRESS_STREAMS] = {0};
CUstream streams[MAX_STRESS_STREAMS] = {0};
uint32_t words[MAX_STRESS_STREAMS] = {0};
uint32_t rounds[MAX_STRESS_STREAMS] = {0};
void *params[MAX_STRESS_STREAMS][3];
size_t bytes_per_stream[MAX_STRESS_STREAMS] = {0};
unsigned long iterations = 0; unsigned long iterations = 0;
int mp_count = 0;
int stream_count = 1;
int launches_per_wave = 0;
memset(report, 0, sizeof(*report)); memset(report, 0, sizeof(*report));
snprintf(report->backend, sizeof(report->backend), "driver-ptx"); snprintf(report->backend, sizeof(report->backend), "driver-ptx");
@@ -260,64 +357,109 @@ static int run_ptx_fallback(struct cuda_api *api,
return 0; return 0;
} }
size_t bytes = (size_t)size_mb * 1024u * 1024u; size_t requested_bytes = (size_t)size_mb * 1024u * 1024u;
if (bytes < 4u * 1024u * 1024u) { if (requested_bytes < MIN_PROFILE_BUDGET_BYTES) {
bytes = 4u * 1024u * 1024u; requested_bytes = MIN_PROFILE_BUDGET_BYTES;
} }
if (bytes > (size_t)1024u * 1024u * 1024u) { size_t total_bytes = clamp_budget_to_free_memory(api, requested_bytes);
bytes = (size_t)1024u * 1024u * 1024u; if (total_bytes < MIN_PROFILE_BUDGET_BYTES) {
total_bytes = MIN_PROFILE_BUDGET_BYTES;
} }
words = (uint32_t)(bytes / sizeof(uint32_t)); report->buffer_mb = (int)(total_bytes / (1024u * 1024u));
if (!check_rc(api, "cuMemAlloc", api->cuMemAlloc(&device_mem, bytes))) { if (query_multiprocessor_count(api, dev, &mp_count) &&
api->cuCtxDestroy(ctx); api->cuStreamCreate &&
return 0; api->cuStreamDestroy) {
stream_count = choose_stream_count(mp_count, 1, total_bytes, 1);
} }
if (!check_rc(api, "cuMemsetD8", api->cuMemsetD8(device_mem, 0, bytes))) { if (stream_count > 1) {
api->cuMemFree(device_mem); int created = 0;
api->cuCtxDestroy(ctx); for (; created < stream_count; created++) {
return 0; if (!check_rc(api, "cuStreamCreate", api->cuStreamCreate(&streams[created], 0))) {
destroy_streams(api, streams, created);
stream_count = 1;
break;
}
}
} }
report->stream_count = stream_count;
for (int lane = 0; lane < stream_count; lane++) {
size_t slice = total_bytes / (size_t)stream_count;
if (lane == stream_count - 1) {
slice = total_bytes - ((size_t)lane * (total_bytes / (size_t)stream_count));
}
slice = round_down_size(slice, sizeof(uint32_t));
if (slice < MIN_PROFILE_BUDGET_BYTES) {
slice = MIN_PROFILE_BUDGET_BYTES;
}
bytes_per_stream[lane] = slice;
words[lane] = (uint32_t)(slice / sizeof(uint32_t));
if (!check_rc(api, "cuMemAlloc", api->cuMemAlloc(&device_mem[lane], slice))) {
goto fail;
}
if (!check_rc(api, "cuMemsetD8", api->cuMemsetD8(device_mem[lane], 0, slice))) {
goto fail;
}
rounds[lane] = 2048;
params[lane][0] = &device_mem[lane];
params[lane][1] = &words[lane];
params[lane][2] = &rounds[lane];
}
if (!check_rc(api, if (!check_rc(api,
"cuModuleLoadDataEx", "cuModuleLoadDataEx",
api->cuModuleLoadDataEx(&module, ptx_source, 0, NULL, NULL))) { api->cuModuleLoadDataEx(&module, ptx_source, 0, NULL, NULL))) {
api->cuMemFree(device_mem); goto fail;
api->cuCtxDestroy(ctx);
return 0;
} }
if (!check_rc(api, "cuModuleGetFunction", api->cuModuleGetFunction(&kernel, module, "burn"))) { if (!check_rc(api, "cuModuleGetFunction", api->cuModuleGetFunction(&kernel, module, "burn"))) {
api->cuMemFree(device_mem); goto fail;
api->cuCtxDestroy(ctx);
return 0;
} }
unsigned int threads = 256; unsigned int threads = 256;
unsigned int blocks = (unsigned int)((words + threads - 1) / threads);
uint32_t rounds = 1024;
void *params[] = {&device_mem, &words, &rounds};
double start = now_seconds(); double start = now_seconds();
double deadline = start + (double)seconds; double deadline = start + (double)seconds;
while (now_seconds() < deadline) { while (now_seconds() < deadline) {
if (!check_rc(api, launches_per_wave = 0;
"cuLaunchKernel", for (int depth = 0; depth < STRESS_LAUNCH_DEPTH && now_seconds() < deadline; depth++) {
api->cuLaunchKernel(kernel, blocks, 1, 1, threads, 1, 1, 0, NULL, params, NULL))) { int launched_this_batch = 0;
api->cuMemFree(device_mem); for (int lane = 0; lane < stream_count; lane++) {
api->cuCtxDestroy(ctx); unsigned int blocks = (unsigned int)((words[lane] + threads - 1) / threads);
return 0; if (!check_rc(api,
"cuLaunchKernel",
api->cuLaunchKernel(kernel,
blocks,
1,
1,
threads,
1,
1,
0,
streams[lane],
params[lane],
NULL))) {
goto fail;
}
launches_per_wave++;
launched_this_batch++;
}
if (launched_this_batch <= 0) {
break;
}
} }
iterations++; if (launches_per_wave <= 0) {
goto fail;
}
if (!check_rc(api, "cuCtxSynchronize", api->cuCtxSynchronize())) {
goto fail;
}
iterations += (unsigned long)launches_per_wave;
} }
if (!check_rc(api, "cuCtxSynchronize", api->cuCtxSynchronize())) { if (!check_rc(api, "cuMemcpyDtoH", api->cuMemcpyDtoH(sample, device_mem[0], sizeof(sample)))) {
api->cuMemFree(device_mem); goto fail;
api->cuCtxDestroy(ctx);
return 0;
}
if (!check_rc(api, "cuMemcpyDtoH", api->cuMemcpyDtoH(sample, device_mem, sizeof(sample)))) {
api->cuMemFree(device_mem);
api->cuCtxDestroy(ctx);
return 0;
} }
for (size_t i = 0; i < sizeof(sample) / sizeof(sample[0]); i++) { for (size_t i = 0; i < sizeof(sample) / sizeof(sample[0]); i++) {
@@ -326,12 +468,34 @@ static int run_ptx_fallback(struct cuda_api *api,
report->iterations = iterations; report->iterations = iterations;
snprintf(report->details, snprintf(report->details,
sizeof(report->details), sizeof(report->details),
"profile_int32_fallback=OK iterations=%lu\n", "fallback_int32=OK requested_mb=%d actual_mb=%d streams=%d queue_depth=%d per_stream_mb=%zu iterations=%lu\n",
size_mb,
report->buffer_mb,
report->stream_count,
STRESS_LAUNCH_DEPTH,
bytes_per_stream[0] / (1024u * 1024u),
iterations); iterations);
api->cuMemFree(device_mem); for (int lane = 0; lane < stream_count; lane++) {
if (device_mem[lane]) {
api->cuMemFree(device_mem[lane]);
}
}
destroy_streams(api, streams, stream_count);
api->cuCtxDestroy(ctx); api->cuCtxDestroy(ctx);
return 1; return 1;
fail:
for (int lane = 0; lane < MAX_STRESS_STREAMS; lane++) {
if (device_mem[lane]) {
api->cuMemFree(device_mem[lane]);
}
}
destroy_streams(api, streams, MAX_STRESS_STREAMS);
if (ctx) {
api->cuCtxDestroy(ctx);
}
return 0;
} }
#if HAVE_CUBLASLT_HEADERS #if HAVE_CUBLASLT_HEADERS
@@ -418,6 +582,7 @@ struct profile_desc {
struct prepared_profile { struct prepared_profile {
struct profile_desc desc; struct profile_desc desc;
CUstream stream;
cublasLtMatmulDesc_t op_desc; cublasLtMatmulDesc_t op_desc;
cublasLtMatrixLayout_t a_layout; cublasLtMatrixLayout_t a_layout;
cublasLtMatrixLayout_t b_layout; cublasLtMatrixLayout_t b_layout;
@@ -617,8 +782,8 @@ static uint64_t choose_square_dim(size_t budget_bytes, size_t bytes_per_cell, in
if (dim < (uint64_t)multiple) { if (dim < (uint64_t)multiple) {
dim = (uint64_t)multiple; dim = (uint64_t)multiple;
} }
if (dim > 8192u) { if (dim > 65536u) {
dim = 8192u; dim = 65536u;
} }
return dim; return dim;
} }
@@ -704,10 +869,12 @@ static int prepare_profile(struct cublaslt_api *cublas,
cublasLtHandle_t handle, cublasLtHandle_t handle,
struct cuda_api *cuda, struct cuda_api *cuda,
const struct profile_desc *desc, const struct profile_desc *desc,
CUstream stream,
size_t profile_budget_bytes, size_t profile_budget_bytes,
struct prepared_profile *out) { struct prepared_profile *out) {
memset(out, 0, sizeof(*out)); memset(out, 0, sizeof(*out));
out->desc = *desc; out->desc = *desc;
out->stream = stream;
size_t bytes_per_cell = 0; size_t bytes_per_cell = 0;
bytes_per_cell += bytes_for_elements(desc->a_type, 1); bytes_per_cell += bytes_for_elements(desc->a_type, 1);
@@ -935,7 +1102,7 @@ static int run_cublas_profile(cublasLtHandle_t handle,
&profile->heuristic.algo, &profile->heuristic.algo,
(void *)(uintptr_t)profile->workspace_dev, (void *)(uintptr_t)profile->workspace_dev,
profile->workspace_size, profile->workspace_size,
(cudaStream_t)0)); profile->stream));
} }
static int run_cublaslt_stress(struct cuda_api *cuda, static int run_cublaslt_stress(struct cuda_api *cuda,
@@ -947,13 +1114,22 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
int size_mb, int size_mb,
struct stress_report *report) { struct stress_report *report) {
struct cublaslt_api cublas; struct cublaslt_api cublas;
struct prepared_profile prepared[sizeof(k_profiles) / sizeof(k_profiles[0])]; struct prepared_profile prepared[MAX_STRESS_STREAMS * MAX_CUBLAS_PROFILES];
cublasLtHandle_t handle = NULL; cublasLtHandle_t handle = NULL;
CUcontext ctx = NULL; CUcontext ctx = NULL;
CUstream streams[MAX_STRESS_STREAMS] = {0};
uint16_t sample[256]; uint16_t sample[256];
int cc = cc_major * 10 + cc_minor; int cc = cc_major * 10 + cc_minor;
int planned = 0; int planned = 0;
int active = 0; int active = 0;
int mp_count = 0;
int stream_count = 1;
int profile_count = (int)(sizeof(k_profiles) / sizeof(k_profiles[0]));
int prepared_count = 0;
int wave_launches = 0;
size_t requested_budget = 0;
size_t total_budget = 0;
size_t per_profile_budget = 0;
memset(report, 0, sizeof(*report)); memset(report, 0, sizeof(*report));
snprintf(report->backend, sizeof(report->backend), "cublasLt"); snprintf(report->backend, sizeof(report->backend), "cublasLt");
@@ -986,16 +1162,46 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
return 0; return 0;
} }
size_t total_budget = (size_t)size_mb * 1024u * 1024u; requested_budget = (size_t)size_mb * 1024u * 1024u;
if (total_budget < (size_t)planned * 4u * 1024u * 1024u) { if (requested_budget < (size_t)planned * MIN_PROFILE_BUDGET_BYTES) {
total_budget = (size_t)planned * 4u * 1024u * 1024u; requested_budget = (size_t)planned * MIN_PROFILE_BUDGET_BYTES;
} }
size_t per_profile_budget = total_budget / (size_t)planned; total_budget = clamp_budget_to_free_memory(cuda, requested_budget);
if (per_profile_budget < 4u * 1024u * 1024u) { if (total_budget < (size_t)planned * MIN_PROFILE_BUDGET_BYTES) {
per_profile_budget = 4u * 1024u * 1024u; total_budget = (size_t)planned * MIN_PROFILE_BUDGET_BYTES;
} }
if (query_multiprocessor_count(cuda, dev, &mp_count) &&
cuda->cuStreamCreate &&
cuda->cuStreamDestroy) {
stream_count = choose_stream_count(mp_count, planned, total_budget, 1);
}
if (stream_count > 1) {
int created = 0;
for (; created < stream_count; created++) {
if (!check_rc(cuda, "cuStreamCreate", cuda->cuStreamCreate(&streams[created], 0))) {
destroy_streams(cuda, streams, created);
stream_count = 1;
break;
}
}
}
report->stream_count = stream_count;
per_profile_budget = total_budget / ((size_t)planned * (size_t)stream_count);
if (per_profile_budget < MIN_PROFILE_BUDGET_BYTES) {
per_profile_budget = MIN_PROFILE_BUDGET_BYTES;
}
report->buffer_mb = (int)(total_budget / (1024u * 1024u));
append_detail(report->details,
sizeof(report->details),
"requested_mb=%d actual_mb=%d streams=%d queue_depth=%d mp_count=%d per_worker_mb=%zu\n",
size_mb,
report->buffer_mb,
report->stream_count,
STRESS_LAUNCH_DEPTH,
mp_count,
per_profile_budget / (1024u * 1024u));
for (size_t i = 0; i < sizeof(k_profiles) / sizeof(k_profiles[0]); i++) { for (int i = 0; i < profile_count; i++) {
const struct profile_desc *desc = &k_profiles[i]; const struct profile_desc *desc = &k_profiles[i];
if (!(desc->enabled && cc >= desc->min_cc)) { if (!(desc->enabled && cc >= desc->min_cc)) {
append_detail(report->details, append_detail(report->details,
@@ -1005,63 +1211,87 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
desc->min_cc); desc->min_cc);
continue; continue;
} }
if (prepare_profile(&cublas, handle, cuda, desc, per_profile_budget, &prepared[i])) { for (int lane = 0; lane < stream_count; lane++) {
active++; CUstream stream = streams[lane];
append_detail(report->details, if (prepared_count >= (int)(sizeof(prepared) / sizeof(prepared[0]))) {
sizeof(report->details), break;
"%s=READY dim=%llux%llux%llu block=%s\n", }
desc->name, if (prepare_profile(&cublas, handle, cuda, desc, stream, per_profile_budget, &prepared[prepared_count])) {
(unsigned long long)prepared[i].m, active++;
(unsigned long long)prepared[i].n, append_detail(report->details,
(unsigned long long)prepared[i].k, sizeof(report->details),
desc->block_label); "%s[%d]=READY dim=%llux%llux%llu block=%s stream=%d\n",
} else { desc->name,
append_detail(report->details, sizeof(report->details), "%s=SKIPPED unsupported\n", desc->name); lane,
(unsigned long long)prepared[prepared_count].m,
(unsigned long long)prepared[prepared_count].n,
(unsigned long long)prepared[prepared_count].k,
desc->block_label,
lane);
prepared_count++;
} else {
append_detail(report->details,
sizeof(report->details),
"%s[%d]=SKIPPED unsupported\n",
desc->name,
lane);
}
} }
} }
if (active <= 0) { if (active <= 0) {
cublas.cublasLtDestroy(handle); cublas.cublasLtDestroy(handle);
destroy_streams(cuda, streams, stream_count);
cuda->cuCtxDestroy(ctx); cuda->cuCtxDestroy(ctx);
return 0; return 0;
} }
double deadline = now_seconds() + (double)seconds; double deadline = now_seconds() + (double)seconds;
while (now_seconds() < deadline) { while (now_seconds() < deadline) {
for (size_t i = 0; i < sizeof(prepared) / sizeof(prepared[0]); i++) { wave_launches = 0;
if (!prepared[i].ready) { for (int depth = 0; depth < STRESS_LAUNCH_DEPTH && now_seconds() < deadline; depth++) {
continue; int launched_this_batch = 0;
} for (int i = 0; i < prepared_count; i++) {
if (!run_cublas_profile(handle, &cublas, &prepared[i])) { if (!prepared[i].ready) {
append_detail(report->details, continue;
sizeof(report->details),
"%s=FAILED runtime\n",
prepared[i].desc.name);
for (size_t j = 0; j < sizeof(prepared) / sizeof(prepared[0]); j++) {
destroy_profile(&cublas, cuda, &prepared[j]);
} }
cublas.cublasLtDestroy(handle); if (!run_cublas_profile(handle, &cublas, &prepared[i])) {
cuda->cuCtxDestroy(ctx); append_detail(report->details,
return 0; sizeof(report->details),
"%s=FAILED runtime\n",
prepared[i].desc.name);
for (int j = 0; j < prepared_count; j++) {
destroy_profile(&cublas, cuda, &prepared[j]);
}
cublas.cublasLtDestroy(handle);
destroy_streams(cuda, streams, stream_count);
cuda->cuCtxDestroy(ctx);
return 0;
}
prepared[i].iterations++;
report->iterations++;
wave_launches++;
launched_this_batch++;
} }
prepared[i].iterations++; if (launched_this_batch <= 0) {
report->iterations++;
if (now_seconds() >= deadline) {
break; break;
} }
} }
} if (wave_launches <= 0) {
break;
if (!check_rc(cuda, "cuCtxSynchronize", cuda->cuCtxSynchronize())) { }
for (size_t i = 0; i < sizeof(prepared) / sizeof(prepared[0]); i++) { if (!check_rc(cuda, "cuCtxSynchronize", cuda->cuCtxSynchronize())) {
destroy_profile(&cublas, cuda, &prepared[i]); for (int i = 0; i < prepared_count; i++) {
destroy_profile(&cublas, cuda, &prepared[i]);
}
cublas.cublasLtDestroy(handle);
destroy_streams(cuda, streams, stream_count);
cuda->cuCtxDestroy(ctx);
return 0;
} }
cublas.cublasLtDestroy(handle);
cuda->cuCtxDestroy(ctx);
return 0;
} }
for (size_t i = 0; i < sizeof(prepared) / sizeof(prepared[0]); i++) { for (int i = 0; i < prepared_count; i++) {
if (!prepared[i].ready) { if (!prepared[i].ready) {
continue; continue;
} }
@@ -1072,7 +1302,7 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
prepared[i].iterations); prepared[i].iterations);
} }
for (size_t i = 0; i < sizeof(prepared) / sizeof(prepared[0]); i++) { for (int i = 0; i < prepared_count; i++) {
if (prepared[i].ready) { if (prepared[i].ready) {
if (check_rc(cuda, "cuMemcpyDtoH", cuda->cuMemcpyDtoH(sample, prepared[i].d_dev, sizeof(sample)))) { if (check_rc(cuda, "cuMemcpyDtoH", cuda->cuMemcpyDtoH(sample, prepared[i].d_dev, sizeof(sample)))) {
for (size_t j = 0; j < sizeof(sample) / sizeof(sample[0]); j++) { for (size_t j = 0; j < sizeof(sample) / sizeof(sample[0]); j++) {
@@ -1083,10 +1313,11 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
} }
} }
for (size_t i = 0; i < sizeof(prepared) / sizeof(prepared[0]); i++) { for (int i = 0; i < prepared_count; i++) {
destroy_profile(&cublas, cuda, &prepared[i]); destroy_profile(&cublas, cuda, &prepared[i]);
} }
cublas.cublasLtDestroy(handle); cublas.cublasLtDestroy(handle);
destroy_streams(cuda, streams, stream_count);
cuda->cuCtxDestroy(ctx); cuda->cuCtxDestroy(ctx);
return 1; return 1;
} }
@@ -1095,13 +1326,16 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
int main(int argc, char **argv) { int main(int argc, char **argv) {
int seconds = 5; int seconds = 5;
int size_mb = 64; int size_mb = 64;
int device_index = 0;
for (int i = 1; i < argc; i++) { for (int i = 1; i < argc; i++) {
if ((strcmp(argv[i], "--seconds") == 0 || strcmp(argv[i], "-t") == 0) && i + 1 < argc) { if ((strcmp(argv[i], "--seconds") == 0 || strcmp(argv[i], "-t") == 0) && i + 1 < argc) {
seconds = atoi(argv[++i]); seconds = atoi(argv[++i]);
} else if ((strcmp(argv[i], "--size-mb") == 0 || strcmp(argv[i], "-m") == 0) && i + 1 < argc) { } else if ((strcmp(argv[i], "--size-mb") == 0 || strcmp(argv[i], "-m") == 0) && i + 1 < argc) {
size_mb = atoi(argv[++i]); size_mb = atoi(argv[++i]);
} else if ((strcmp(argv[i], "--device") == 0 || strcmp(argv[i], "-d") == 0) && i + 1 < argc) {
device_index = atoi(argv[++i]);
} else { } else {
fprintf(stderr, "usage: %s [--seconds N] [--size-mb N]\n", argv[0]); fprintf(stderr, "usage: %s [--seconds N] [--size-mb N] [--device N]\n", argv[0]);
return 2; return 2;
} }
} }
@@ -1111,6 +1345,9 @@ int main(int argc, char **argv) {
if (size_mb <= 0) { if (size_mb <= 0) {
size_mb = 64; size_mb = 64;
} }
if (device_index < 0) {
device_index = 0;
}
struct cuda_api cuda; struct cuda_api cuda;
if (!load_cuda(&cuda)) { if (!load_cuda(&cuda)) {
@@ -1133,8 +1370,13 @@ int main(int argc, char **argv) {
return 1; return 1;
} }
if (device_index >= count) {
fprintf(stderr, "device index %d out of range (found %d CUDA device(s))\n", device_index, count);
return 1;
}
CUdevice dev = 0; CUdevice dev = 0;
if (!check_rc(&cuda, "cuDeviceGet", cuda.cuDeviceGet(&dev, 0))) { if (!check_rc(&cuda, "cuDeviceGet", cuda.cuDeviceGet(&dev, device_index))) {
return 1; return 1;
} }
@@ -1162,10 +1404,12 @@ int main(int argc, char **argv) {
} }
printf("device=%s\n", report.device); printf("device=%s\n", report.device);
printf("device_index=%d\n", device_index);
printf("compute_capability=%d.%d\n", report.cc_major, report.cc_minor); printf("compute_capability=%d.%d\n", report.cc_major, report.cc_minor);
printf("backend=%s\n", report.backend); printf("backend=%s\n", report.backend);
printf("duration_s=%d\n", seconds); printf("duration_s=%d\n", seconds);
printf("buffer_mb=%d\n", report.buffer_mb); printf("buffer_mb=%d\n", report.buffer_mb);
printf("streams=%d\n", report.stream_count);
printf("iterations=%lu\n", report.iterations); printf("iterations=%lu\n", report.iterations);
printf("checksum=%llu\n", (unsigned long long)report.checksum); printf("checksum=%llu\n", (unsigned long long)report.checksum);
if (report.details[0] != '\0') { if (report.details[0] != '\0') {

View File

@@ -1,9 +1,9 @@
#!/bin/sh #!/bin/sh
# build-cublas.sh — download cuBLASLt/cuBLAS/cudart runtime + headers for bee-gpu-stress. # build-cublas.sh — download cuBLASLt/cuBLAS/cudart runtime + headers for bee-gpu-burn worker.
# #
# Downloads .deb packages from NVIDIA's CUDA apt repository (Debian 12, x86_64), # Downloads .deb packages from NVIDIA's CUDA apt repository (Debian 12, x86_64),
# verifies them against Packages.gz, and extracts the small subset we need: # verifies them against Packages.gz, and extracts the small subset we need:
# - headers for compiling bee-gpu-stress against cuBLASLt # - headers for compiling bee-gpu-burn worker against cuBLASLt
# - runtime libs for libcublas, libcublasLt, libcudart inside the ISO # - runtime libs for libcublas, libcublasLt, libcudart inside the ISO
set -e set -e

55
iso/builder/build-john.sh Normal file
View File

@@ -0,0 +1,55 @@
#!/bin/sh
# build-john.sh — build John the Ripper jumbo with OpenCL support for the LiveCD.
#
# Downloads a pinned source snapshot from the official openwall/john repository,
# builds it inside the builder container, and caches the resulting run/ tree.
set -e
JOHN_COMMIT="$1"
DIST_DIR="$2"
[ -n "$JOHN_COMMIT" ] || { echo "usage: $0 <john-commit> <dist-dir>"; exit 1; }
[ -n "$DIST_DIR" ] || { echo "usage: $0 <john-commit> <dist-dir>"; exit 1; }
echo "=== John the Ripper jumbo ${JOHN_COMMIT} ==="
CACHE_DIR="${DIST_DIR}/john-${JOHN_COMMIT}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/john-downloads"
SRC_TAR="${DOWNLOAD_CACHE_DIR}/john-${JOHN_COMMIT}.tar.gz"
SRC_URL="https://github.com/openwall/john/archive/${JOHN_COMMIT}.tar.gz"
if [ -x "${CACHE_DIR}/run/john" ] && [ -f "${CACHE_DIR}/run/john.conf" ]; then
echo "=== john cached, skipping build ==="
echo "run dir: ${CACHE_DIR}/run"
exit 0
fi
mkdir -p "${DOWNLOAD_CACHE_DIR}"
if [ ! -f "${SRC_TAR}" ]; then
echo "=== downloading john source snapshot ==="
wget --show-progress -O "${SRC_TAR}" "${SRC_URL}"
fi
BUILD_TMP=$(mktemp -d)
trap 'rm -rf "${BUILD_TMP}"' EXIT INT TERM
cd "${BUILD_TMP}"
tar xf "${SRC_TAR}"
SRC_DIR=$(find . -maxdepth 1 -type d -name 'john-*' | head -1)
[ -n "${SRC_DIR}" ] || { echo "ERROR: john source directory not found"; exit 1; }
cd "${SRC_DIR}/src"
echo "=== configuring john ==="
./configure
echo "=== building john ==="
make clean >/dev/null 2>&1 || true
make -j"$(nproc)"
mkdir -p "${CACHE_DIR}"
cp -a "../run" "${CACHE_DIR}/run"
chmod +x "${CACHE_DIR}/run/john"
echo "=== john build complete ==="
echo "run dir: ${CACHE_DIR}/run"

View File

@@ -9,6 +9,7 @@
# #
# Output layout: # Output layout:
# $CACHE_DIR/bin/all_reduce_perf # $CACHE_DIR/bin/all_reduce_perf
# $CACHE_DIR/lib/libcudart.so* copied from the nvcc toolchain used to build nccl-tests
set -e set -e
@@ -30,7 +31,7 @@ CACHE_DIR="${DIST_DIR}/nccl-tests-${NCCL_TESTS_VERSION}"
CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}" CACHE_ROOT="${BEE_CACHE_DIR:-${DIST_DIR}/cache}"
DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/nccl-tests-downloads" DOWNLOAD_CACHE_DIR="${CACHE_ROOT}/nccl-tests-downloads"
if [ -f "${CACHE_DIR}/bin/all_reduce_perf" ]; then if [ -f "${CACHE_DIR}/bin/all_reduce_perf" ] && [ "$(find "${CACHE_DIR}/lib" -maxdepth 1 -name 'libcudart.so*' 2>/dev/null | wc -l)" -gt 0 ]; then
echo "=== nccl-tests cached, skipping build ===" echo "=== nccl-tests cached, skipping build ==="
echo "binary: ${CACHE_DIR}/bin/all_reduce_perf" echo "binary: ${CACHE_DIR}/bin/all_reduce_perf"
exit 0 exit 0
@@ -52,6 +53,23 @@ echo "nvcc: $NVCC"
CUDA_HOME="$(dirname "$(dirname "$NVCC")")" CUDA_HOME="$(dirname "$(dirname "$NVCC")")"
echo "CUDA_HOME: $CUDA_HOME" echo "CUDA_HOME: $CUDA_HOME"
find_cudart_dir() {
for dir in \
"${CUDA_HOME}/targets/x86_64-linux/lib" \
"${CUDA_HOME}/targets/x86_64-linux/lib/stubs" \
"${CUDA_HOME}/lib64" \
"${CUDA_HOME}/lib"; do
if [ -d "$dir" ] && find "$dir" -maxdepth 1 -name 'libcudart.so*' -type f | grep -q .; then
printf '%s\n' "$dir"
return 0
fi
done
return 1
}
CUDART_DIR="$(find_cudart_dir)" || { echo "ERROR: libcudart.so* not found under ${CUDA_HOME}"; exit 1; }
echo "cudart dir: $CUDART_DIR"
# Download libnccl-dev for nccl.h # Download libnccl-dev for nccl.h
REPO_BASE="https://developer.download.nvidia.com/compute/cuda/repos/debian${DEBIAN_VERSION}/x86_64" REPO_BASE="https://developer.download.nvidia.com/compute/cuda/repos/debian${DEBIAN_VERSION}/x86_64"
DEV_PKG="libnccl-dev_${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}_amd64.deb" DEV_PKG="libnccl-dev_${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}_amd64.deb"
@@ -136,6 +154,11 @@ mkdir -p "${CACHE_DIR}/bin"
cp "./build/all_reduce_perf" "${CACHE_DIR}/bin/all_reduce_perf" cp "./build/all_reduce_perf" "${CACHE_DIR}/bin/all_reduce_perf"
chmod +x "${CACHE_DIR}/bin/all_reduce_perf" chmod +x "${CACHE_DIR}/bin/all_reduce_perf"
mkdir -p "${CACHE_DIR}/lib"
find "${CUDART_DIR}" -maxdepth 1 -name 'libcudart.so*' -type f -exec cp -a {} "${CACHE_DIR}/lib/" \;
[ "$(find "${CACHE_DIR}/lib" -maxdepth 1 -name 'libcudart.so*' -type f | wc -l)" -gt 0 ] || { echo "ERROR: libcudart runtime copy failed"; exit 1; }
echo "=== nccl-tests build complete ===" echo "=== nccl-tests build complete ==="
echo "binary: ${CACHE_DIR}/bin/all_reduce_perf" echo "binary: ${CACHE_DIR}/bin/all_reduce_perf"
ls -lh "${CACHE_DIR}/bin/all_reduce_perf" ls -lh "${CACHE_DIR}/bin/all_reduce_perf"
ls -lh "${CACHE_DIR}/lib/"libcudart.so* 2>/dev/null || true

View File

@@ -10,7 +10,7 @@
# Output layout: # Output layout:
# $CACHE_DIR/modules/ — nvidia*.ko files # $CACHE_DIR/modules/ — nvidia*.ko files
# $CACHE_DIR/bin/ — nvidia-smi, nvidia-debugdump # $CACHE_DIR/bin/ — nvidia-smi, nvidia-debugdump
# $CACHE_DIR/lib/ — libnvidia-ml.so*, libcuda.so* (for nvidia-smi) # $CACHE_DIR/lib/ — libnvidia-ml.so*, libcuda.so*, OpenCL-related libs
set -e set -e
@@ -133,7 +133,14 @@ fi
# Copy ALL userspace library files. # Copy ALL userspace library files.
# libnvidia-ptxjitcompiler is required by libcuda for PTX JIT compilation # libnvidia-ptxjitcompiler is required by libcuda for PTX JIT compilation
# (cuModuleLoadDataEx with PTX source) — without it CUDA_ERROR_JIT_COMPILER_NOT_FOUND. # (cuModuleLoadDataEx with PTX source) — without it CUDA_ERROR_JIT_COMPILER_NOT_FOUND.
for lib in libnvidia-ml libcuda libnvidia-ptxjitcompiler; do for lib in \
libnvidia-ml \
libcuda \
libnvidia-ptxjitcompiler \
libnvidia-opencl \
libnvidia-compiler \
libnvidia-nvvm \
libnvidia-fatbinaryloader; do
count=0 count=0
for f in $(find "$EXTRACT_DIR" -maxdepth 1 -name "${lib}.so.*" 2>/dev/null); do for f in $(find "$EXTRACT_DIR" -maxdepth 1 -name "${lib}.so.*" 2>/dev/null); do
cp "$f" "$CACHE_DIR/lib/" && count=$((count+1)) cp "$f" "$CACHE_DIR/lib/" && count=$((count+1))
@@ -150,7 +157,14 @@ ko_count=$(ls "$CACHE_DIR/modules/"*.ko 2>/dev/null | wc -l)
[ "$ko_count" -gt 0 ] || { echo "ERROR: no .ko files built in $CACHE_DIR/modules/"; exit 1; } [ "$ko_count" -gt 0 ] || { echo "ERROR: no .ko files built in $CACHE_DIR/modules/"; exit 1; }
# Create soname symlinks: use [0-9][0-9]* to avoid circular symlink (.so.1 has single digit) # Create soname symlinks: use [0-9][0-9]* to avoid circular symlink (.so.1 has single digit)
for lib in libnvidia-ml libcuda libnvidia-ptxjitcompiler; do for lib in \
libnvidia-ml \
libcuda \
libnvidia-ptxjitcompiler \
libnvidia-opencl \
libnvidia-compiler \
libnvidia-nvvm \
libnvidia-fatbinaryloader; do
versioned=$(ls "$CACHE_DIR/lib/${lib}.so."[0-9][0-9]* 2>/dev/null | head -1) versioned=$(ls "$CACHE_DIR/lib/${lib}.so."[0-9][0-9]* 2>/dev/null | head -1)
[ -n "$versioned" ] || continue [ -n "$versioned" ] || continue
base=$(basename "$versioned") base=$(basename "$versioned")

View File

@@ -38,6 +38,7 @@ export BEE_GPU_VENDOR
. "${BUILDER_DIR}/VERSIONS" . "${BUILDER_DIR}/VERSIONS"
export PATH="$PATH:/usr/local/go/bin" export PATH="$PATH:/usr/local/go/bin"
: "${BEE_REQUIRE_MEMTEST:=0}"
# Allow git to read the bind-mounted repo (different UID inside container). # Allow git to read the bind-mounted repo (different UID inside container).
git config --global safe.directory "${REPO_ROOT}" git config --global safe.directory "${REPO_ROOT}"
@@ -111,8 +112,546 @@ resolve_iso_version() {
resolve_audit_version resolve_audit_version
} }
iso_list_files() {
iso_path="$1"
if command -v bsdtar >/dev/null 2>&1; then
bsdtar -tf "$iso_path"
return $?
fi
if command -v xorriso >/dev/null 2>&1; then
xorriso -indev "$iso_path" -find / -type f -print 2>/dev/null | sed 's#^/##'
return $?
fi
return 127
}
iso_extract_file() {
iso_path="$1"
iso_member="$2"
if command -v bsdtar >/dev/null 2>&1; then
bsdtar -xOf "$iso_path" "$iso_member"
return $?
fi
if command -v xorriso >/dev/null 2>&1; then
xorriso -osirrox on -indev "$iso_path" -cat "/$iso_member" 2>/dev/null
return $?
fi
return 127
}
require_iso_reader() {
command -v bsdtar >/dev/null 2>&1 && return 0
command -v xorriso >/dev/null 2>&1 && return 0
memtest_fail "ISO reader is required for validation/debug (expected bsdtar or xorriso)" "${1:-}"
}
dump_memtest_debug() {
phase="$1"
lb_dir="${2:-}"
iso_path="${3:-}"
phase_slug="$(printf '%s' "${phase}" | tr ' /' '__')"
memtest_log="${LOG_DIR:-}/memtest-${phase_slug}.log"
(
echo "=== memtest debug: ${phase} ==="
echo "-- auto/config --"
if [ -f "${BUILDER_DIR}/auto/config" ]; then
grep -n -- '--memtest' "${BUILDER_DIR}/auto/config" || echo " (no --memtest line found)"
else
echo " (missing ${BUILDER_DIR}/auto/config)"
fi
echo "-- source bootloader templates --"
for cfg in \
"${BUILDER_DIR}/config/bootloaders/grub-pc/grub.cfg" \
"${BUILDER_DIR}/config/bootloaders/isolinux/live.cfg.in"; do
if [ -f "$cfg" ]; then
echo " file: $cfg"
grep -n 'Memory Test\|memtest' "$cfg" || echo " (no memtest lines)"
fi
done
echo "-- source binary hooks --"
for hook in \
"${BUILDER_DIR}/config/hooks/normal/9100-memtest.hook.binary"; do
if [ -f "$hook" ]; then
echo " hook: $hook"
else
echo " (missing $hook)"
fi
done
if [ -n "$lb_dir" ] && [ -d "$lb_dir" ]; then
echo "-- live-build workdir package lists --"
for pkg in \
"$lb_dir/config/package-lists/bee.list.chroot" \
"$lb_dir/config/package-lists/bee-gpu.list.chroot" \
"$lb_dir/config/package-lists/bee-nvidia.list.chroot"; do
if [ -f "$pkg" ]; then
echo " file: $pkg"
grep -n 'memtest' "$pkg" || echo " (no memtest lines)"
fi
done
echo "-- live-build chroot/boot --"
if [ -d "$lb_dir/chroot/boot" ]; then
find "$lb_dir/chroot/boot" -maxdepth 1 -name 'memtest*' -print | sed 's/^/ /' || true
else
echo " (missing $lb_dir/chroot/boot)"
fi
echo "-- live-build binary/boot --"
if [ -d "$lb_dir/binary/boot" ]; then
find "$lb_dir/binary/boot" -maxdepth 1 -name 'memtest*' -print | sed 's/^/ /' || true
else
echo " (missing $lb_dir/binary/boot)"
fi
echo "-- live-build binary grub cfg --"
if [ -f "$lb_dir/binary/boot/grub/grub.cfg" ]; then
grep -n 'Memory Test\|memtest' "$lb_dir/binary/boot/grub/grub.cfg" || echo " (no memtest lines)"
else
echo " (missing $lb_dir/binary/boot/grub/grub.cfg)"
fi
echo "-- live-build binary isolinux cfg --"
if [ -f "$lb_dir/binary/isolinux/live.cfg" ]; then
grep -n 'Memory Test\|memtest' "$lb_dir/binary/isolinux/live.cfg" || echo " (no memtest lines)"
else
echo " (missing $lb_dir/binary/isolinux/live.cfg)"
fi
echo "-- live-build package cache --"
if [ -d "$lb_dir/cache/packages.chroot" ]; then
find "$lb_dir/cache/packages.chroot" -maxdepth 1 -name 'memtest86+*.deb' -print | sed 's/^/ /' || true
else
echo " (missing $lb_dir/cache/packages.chroot)"
fi
fi
if [ -n "$iso_path" ] && [ -f "$iso_path" ]; then
echo "-- ISO memtest files --"
iso_list_files "$iso_path" | grep 'memtest' | sed 's/^/ /' || echo " (no memtest files in ISO)"
echo "-- ISO GRUB memtest lines --"
iso_extract_file "$iso_path" boot/grub/grub.cfg 2>/dev/null | grep -n 'Memory Test\|memtest' || echo " (no memtest lines in boot/grub/grub.cfg)"
echo "-- ISO isolinux memtest lines --"
iso_extract_file "$iso_path" isolinux/live.cfg 2>/dev/null | grep -n 'Memory Test\|memtest' || echo " (no memtest lines in isolinux/live.cfg)"
fi
echo "=== end memtest debug: ${phase} ==="
) | {
if [ -n "${LOG_DIR:-}" ] && [ -d "${LOG_DIR}" ]; then
tee "${memtest_log}"
else
cat
fi
}
}
memtest_fail() {
msg="$1"
iso_path="${2:-}"
level="WARNING"
if [ "${BEE_REQUIRE_MEMTEST:-0}" = "1" ]; then
level="ERROR"
fi
echo "${level}: ${msg}" >&2
dump_memtest_debug "failure" "${LB_DIR:-}" "$iso_path" >&2
if [ "${BEE_REQUIRE_MEMTEST:-0}" = "1" ]; then
exit 1
fi
return 0
}
iso_memtest_present() {
iso_path="$1"
[ -f "$iso_path" ] || return 1
if command -v bsdtar >/dev/null 2>&1; then
:
elif command -v xorriso >/dev/null 2>&1; then
:
else
return 1
fi
iso_list_files "$iso_path" | grep -q '^boot/memtest86+x64\.bin$' || return 1
iso_list_files "$iso_path" | grep -q '^boot/memtest86+x64\.efi$' || return 1
grub_cfg="$(mktemp)"
isolinux_cfg="$(mktemp)"
iso_extract_file "$iso_path" boot/grub/grub.cfg > "$grub_cfg" 2>/dev/null || {
rm -f "$grub_cfg" "$isolinux_cfg"
return 1
}
iso_extract_file "$iso_path" isolinux/live.cfg > "$isolinux_cfg" 2>/dev/null || {
rm -f "$grub_cfg" "$isolinux_cfg"
return 1
}
grep -q 'Memory Test (memtest86+)' "$grub_cfg" || {
rm -f "$grub_cfg" "$isolinux_cfg"
return 1
}
grep -q '/boot/memtest86+x64\.efi' "$grub_cfg" || {
rm -f "$grub_cfg" "$isolinux_cfg"
return 1
}
grep -q '/boot/memtest86+x64\.bin' "$grub_cfg" || {
rm -f "$grub_cfg" "$isolinux_cfg"
return 1
}
grep -q 'Memory Test (memtest86+)' "$isolinux_cfg" || {
rm -f "$grub_cfg" "$isolinux_cfg"
return 1
}
grep -q '/boot/memtest86+x64\.bin' "$isolinux_cfg" || {
rm -f "$grub_cfg" "$isolinux_cfg"
return 1
}
rm -f "$grub_cfg" "$isolinux_cfg"
return 0
}
validate_iso_memtest() {
iso_path="$1"
echo "=== validating memtest in ISO ==="
[ -f "$iso_path" ] || {
memtest_fail "ISO not found for validation: $iso_path" "$iso_path"
return 0
}
require_iso_reader "$iso_path" || return 0
iso_list_files "$iso_path" | grep -q '^boot/memtest86+x64\.bin$' || {
memtest_fail "memtest BIOS binary missing in ISO: boot/memtest86+x64.bin" "$iso_path"
return 0
}
iso_list_files "$iso_path" | grep -q '^boot/memtest86+x64\.efi$' || {
memtest_fail "memtest EFI binary missing in ISO: boot/memtest86+x64.efi" "$iso_path"
return 0
}
grub_cfg="$(mktemp)"
isolinux_cfg="$(mktemp)"
iso_extract_file "$iso_path" boot/grub/grub.cfg > "$grub_cfg" || {
memtest_fail "failed to extract boot/grub/grub.cfg from ISO" "$iso_path"
rm -f "$grub_cfg" "$isolinux_cfg"
return 0
}
iso_extract_file "$iso_path" isolinux/live.cfg > "$isolinux_cfg" || {
memtest_fail "failed to extract isolinux/live.cfg from ISO" "$iso_path"
rm -f "$grub_cfg" "$isolinux_cfg"
return 0
}
grep -q 'Memory Test (memtest86+)' "$grub_cfg" || {
memtest_fail "GRUB menu entry for memtest is missing" "$iso_path"
rm -f "$grub_cfg" "$isolinux_cfg"
return 0
}
grep -q '/boot/memtest86+x64\.efi' "$grub_cfg" || {
memtest_fail "GRUB memtest EFI path is missing" "$iso_path"
rm -f "$grub_cfg" "$isolinux_cfg"
return 0
}
grep -q '/boot/memtest86+x64\.bin' "$grub_cfg" || {
memtest_fail "GRUB memtest BIOS path is missing" "$iso_path"
rm -f "$grub_cfg" "$isolinux_cfg"
return 0
}
grep -q 'Memory Test (memtest86+)' "$isolinux_cfg" || {
memtest_fail "isolinux menu entry for memtest is missing" "$iso_path"
rm -f "$grub_cfg" "$isolinux_cfg"
return 0
}
grep -q '/boot/memtest86+x64\.bin' "$isolinux_cfg" || {
memtest_fail "isolinux memtest path is missing" "$iso_path"
rm -f "$grub_cfg" "$isolinux_cfg"
return 0
}
rm -f "$grub_cfg" "$isolinux_cfg"
echo "=== memtest validation OK ==="
}
append_memtest_grub_entry() {
grub_cfg="$1"
[ -f "$grub_cfg" ] || return 1
grep -q 'Memory Test (memtest86+)' "$grub_cfg" && return 0
grep -q '### BEE MEMTEST ###' "$grub_cfg" && return 0
cat >> "$grub_cfg" <<'EOF'
### BEE MEMTEST ###
if [ "${grub_platform}" = "efi" ]; then
menuentry "Memory Test (memtest86+)" {
chainloader /boot/memtest86+x64.efi
}
else
menuentry "Memory Test (memtest86+)" {
linux16 /boot/memtest86+x64.bin
}
fi
### /BEE MEMTEST ###
EOF
}
append_memtest_isolinux_entry() {
isolinux_cfg="$1"
[ -f "$isolinux_cfg" ] || return 1
grep -q 'Memory Test (memtest86+)' "$isolinux_cfg" && return 0
grep -q '### BEE MEMTEST ###' "$isolinux_cfg" && return 0
cat >> "$isolinux_cfg" <<'EOF'
# ### BEE MEMTEST ###
label memtest
menu label ^Memory Test (memtest86+)
linux /boot/memtest86+x64.bin
# ### /BEE MEMTEST ###
EOF
}
copy_memtest_from_deb() {
deb="$1"
dst_boot="$2"
tmpdir="$(mktemp -d)"
dpkg-deb -x "$deb" "$tmpdir"
for f in memtest86+x64.bin memtest86+x64.efi; do
if [ -f "$tmpdir/boot/$f" ]; then
cp "$tmpdir/boot/$f" "$dst_boot/$f"
fi
done
rm -rf "$tmpdir"
}
recover_iso_memtest() {
lb_dir="$1"
iso_path="$2"
binary_boot="$lb_dir/binary/boot"
grub_cfg="$lb_dir/binary/boot/grub/grub.cfg"
isolinux_cfg="$lb_dir/binary/isolinux/live.cfg"
echo "=== attempting memtest recovery in binary tree ==="
mkdir -p "$binary_boot"
for root in \
"$lb_dir/chroot/boot" \
"/boot"; do
for f in memtest86+x64.bin memtest86+x64.efi; do
if [ ! -f "$binary_boot/$f" ] && [ -f "$root/$f" ]; then
cp "$root/$f" "$binary_boot/$f"
echo "memtest recovery: copied $f from $root"
fi
done
done
if [ ! -f "$binary_boot/memtest86+x64.bin" ] || [ ! -f "$binary_boot/memtest86+x64.efi" ]; then
for dir in \
"$lb_dir/cache/packages.binary" \
"$lb_dir/cache/packages.chroot" \
"$lb_dir/chroot/var/cache/apt/archives" \
"${BEE_CACHE_DIR:-${DIST_DIR}/cache}/lb-packages" \
"/var/cache/apt/archives"; do
[ -d "$dir" ] || continue
deb="$(find "$dir" -maxdepth 1 -type f -name 'memtest86+*.deb' 2>/dev/null | head -1)"
[ -n "$deb" ] || continue
echo "memtest recovery: extracting payload from $deb"
copy_memtest_from_deb "$deb" "$binary_boot"
break
done
fi
if [ ! -f "$binary_boot/memtest86+x64.bin" ] || [ ! -f "$binary_boot/memtest86+x64.efi" ]; then
tmpdl="$(mktemp -d)"
if (
cd "$tmpdl" && apt-get download memtest86+ >/dev/null 2>&1
); then
deb="$(find "$tmpdl" -maxdepth 1 -type f -name 'memtest86+*.deb' 2>/dev/null | head -1)"
if [ -n "$deb" ]; then
echo "memtest recovery: downloaded $deb"
copy_memtest_from_deb "$deb" "$binary_boot"
fi
fi
rm -rf "$tmpdl"
fi
if [ -f "$grub_cfg" ]; then
append_memtest_grub_entry "$grub_cfg" && echo "memtest recovery: ensured GRUB entry"
else
echo "memtest recovery: WARNING: missing $grub_cfg"
fi
if [ -f "$isolinux_cfg" ]; then
append_memtest_isolinux_entry "$isolinux_cfg" && echo "memtest recovery: ensured isolinux entry"
else
echo "memtest recovery: WARNING: missing $isolinux_cfg"
fi
run_optional_step_sh "rebuild live-build checksums after memtest recovery" "91-lb-checksums" "lb binary_checksums 2>&1"
run_optional_step_sh "rebuild ISO after memtest recovery" "92-lb-binary-iso" "rm -f '$iso_path' && lb binary_iso 2>&1"
run_optional_step_sh "rebuild zsync after memtest recovery" "93-lb-zsync" "lb binary_zsync 2>&1"
}
AUDIT_VERSION_EFFECTIVE="$(resolve_audit_version)" AUDIT_VERSION_EFFECTIVE="$(resolve_audit_version)"
ISO_VERSION_EFFECTIVE="$(resolve_iso_version)" ISO_VERSION_EFFECTIVE="$(resolve_iso_version)"
ISO_BASENAME="easy-bee-${BEE_GPU_VENDOR}-v${ISO_VERSION_EFFECTIVE}-amd64"
LOG_DIR="${DIST_DIR}/${ISO_BASENAME}.logs"
LOG_ARCHIVE="${DIST_DIR}/${ISO_BASENAME}.logs.tar.gz"
ISO_OUT="${DIST_DIR}/${ISO_BASENAME}.iso"
LOG_OUT="${LOG_DIR}/build.log"
cleanup_build_log() {
status="${1:-$?}"
trap - EXIT INT TERM HUP
if [ "${STEP_LOG_ACTIVE:-0}" = "1" ]; then
cleanup_step_log "${status}" || true
fi
if [ "${BUILD_LOG_ACTIVE:-0}" = "1" ]; then
BUILD_LOG_ACTIVE=0
exec 1>&3 2>&4
exec 3>&- 4>&-
if [ -n "${BUILD_TEE_PID:-}" ]; then
wait "${BUILD_TEE_PID}" 2>/dev/null || true
fi
rm -f "${BUILD_LOG_PIPE}"
fi
if [ -n "${LOG_DIR:-}" ] && [ -d "${LOG_DIR}" ] && command -v tar >/dev/null 2>&1; then
rm -f "${LOG_ARCHIVE}"
tar -czf "${LOG_ARCHIVE}" -C "${DIST_DIR}" "$(basename "${LOG_DIR}")" 2>/dev/null || true
fi
exit "${status}"
}
start_build_log() {
command -v tee >/dev/null 2>&1 || {
echo "ERROR: tee is required for build logging" >&2
exit 1
}
rm -rf "${LOG_DIR}"
rm -f "${LOG_ARCHIVE}"
mkdir -p "${LOG_DIR}"
BUILD_LOG_PIPE="$(mktemp -u "${TMPDIR:-/tmp}/bee-build-log.XXXXXX")"
mkfifo "${BUILD_LOG_PIPE}"
exec 3>&1 4>&2
tee "${LOG_OUT}" < "${BUILD_LOG_PIPE}" &
BUILD_TEE_PID=$!
exec > "${BUILD_LOG_PIPE}" 2>&1
BUILD_LOG_ACTIVE=1
trap 'cleanup_build_log "$?"' EXIT INT TERM HUP
echo "=== build log dir: ${LOG_DIR} ==="
echo "=== build log: ${LOG_OUT} ==="
echo "=== build log archive: ${LOG_ARCHIVE} ==="
}
cleanup_step_log() {
status="${1:-$?}"
if [ "${STEP_LOG_ACTIVE:-0}" = "1" ]; then
STEP_LOG_ACTIVE=0
exec 1>&5 2>&6
exec 5>&- 6>&-
if [ -n "${STEP_TEE_PID:-}" ]; then
wait "${STEP_TEE_PID}" 2>/dev/null || true
fi
rm -f "${STEP_LOG_PIPE}"
fi
return "${status}"
}
run_step() {
step_name="$1"
step_slug="$2"
shift 2
step_log="${LOG_DIR}/${step_slug}.log"
echo ""
echo "=== step: ${step_name} ==="
echo "=== step log: ${step_log} ==="
STEP_LOG_PIPE="$(mktemp -u "${TMPDIR:-/tmp}/bee-step-log.XXXXXX")"
mkfifo "${STEP_LOG_PIPE}"
exec 5>&1 6>&2
tee "${step_log}" < "${STEP_LOG_PIPE}" >&5 &
STEP_TEE_PID=$!
exec > "${STEP_LOG_PIPE}" 2>&1
STEP_LOG_ACTIVE=1
set +e
"$@"
step_status=$?
set -e
cleanup_step_log "${step_status}"
if [ "${step_status}" -ne 0 ]; then
echo "ERROR: step failed: ${step_name} (see ${step_log})" >&2
exit "${step_status}"
fi
echo "=== step OK: ${step_name} ==="
}
run_step_sh() {
step_name="$1"
step_slug="$2"
step_script="$3"
run_step "${step_name}" "${step_slug}" sh -c "${step_script}"
}
run_optional_step_sh() {
step_name="$1"
step_slug="$2"
step_script="$3"
if [ "${BEE_REQUIRE_MEMTEST:-0}" = "1" ]; then
run_step_sh "${step_name}" "${step_slug}" "${step_script}"
return 0
fi
step_log="${LOG_DIR}/${step_slug}.log"
echo ""
echo "=== optional step: ${step_name} ==="
echo "=== optional step log: ${step_log} ==="
set +e
sh -c "${step_script}" > "${step_log}" 2>&1
step_status=$?
set -e
cat "${step_log}"
if [ "${step_status}" -ne 0 ]; then
echo "WARNING: optional step failed: ${step_name} (see ${step_log})" >&2
else
echo "=== optional step OK: ${step_name} ==="
fi
}
start_build_log
# Auto-detect kernel ABI: refresh apt index, then query current linux-image-amd64 dependency. # Auto-detect kernel ABI: refresh apt index, then query current linux-image-amd64 dependency.
# If headers for the detected ABI are not yet installed (kernel updated since image build), # If headers for the detected ABI are not yet installed (kernel updated since image build),
@@ -147,8 +686,8 @@ echo "Debian: ${DEBIAN_VERSION}, Kernel ABI: ${DEBIAN_KERNEL_ABI}, Go: ${GO_VERS
echo "Audit version: ${AUDIT_VERSION_EFFECTIVE}, ISO version: ${ISO_VERSION_EFFECTIVE}" echo "Audit version: ${AUDIT_VERSION_EFFECTIVE}, ISO version: ${ISO_VERSION_EFFECTIVE}"
echo "" echo ""
echo "=== syncing git submodules ===" run_step "sync git submodules" "05-git-submodules" \
git -C "${REPO_ROOT}" submodule update --init --recursive git -C "${REPO_ROOT}" submodule update --init --recursive
# --- compile bee binary (static, Linux amd64) --- # --- compile bee binary (static, Linux amd64) ---
# Shared between variants — built once, reused on second pass. # Shared between variants — built once, reused on second pass.
@@ -160,13 +699,13 @@ if [ -f "$BEE_BIN" ]; then
fi fi
if [ "$NEED_BUILD" = "1" ]; then if [ "$NEED_BUILD" = "1" ]; then
echo "=== building bee binary ===" run_step_sh "build bee binary" "10-build-bee" \
cd "${REPO_ROOT}/audit" "cd '${REPO_ROOT}/audit' && \
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 \ env GOOS=linux GOARCH=amd64 CGO_ENABLED=0 \
go build \ go build \
-ldflags "-s -w -X main.Version=${AUDIT_VERSION_EFFECTIVE}" \ -ldflags '-s -w -X main.Version=${AUDIT_VERSION_EFFECTIVE}' \
-o "$BEE_BIN" \ -o '${BEE_BIN}' \
./cmd/bee ./cmd/bee"
echo "binary: $BEE_BIN" echo "binary: $BEE_BIN"
if command -v stat >/dev/null 2>&1; then if command -v stat >/dev/null 2>&1; then
BEE_SIZE_BYTES="$(stat -c '%s' "$BEE_BIN" 2>/dev/null || stat -f '%z' "$BEE_BIN")" BEE_SIZE_BYTES="$(stat -c '%s' "$BEE_BIN" 2>/dev/null || stat -f '%z' "$BEE_BIN")"
@@ -183,11 +722,10 @@ else
fi fi
# --- NVIDIA-only build steps --- # --- NVIDIA-only build steps ---
GPU_STRESS_BIN="${DIST_DIR}/bee-gpu-stress-linux-amd64" GPU_BURN_WORKER_BIN="${DIST_DIR}/bee-gpu-burn-worker-linux-amd64"
if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
echo "" run_step "download cuBLAS/cuBLASLt/cudart ${NCCL_CUDA_VERSION} userspace" "20-cublas" \
echo "=== downloading cuBLAS/cuBLASLt/cudart ${NCCL_CUDA_VERSION} userspace ===" sh "${BUILDER_DIR}/build-cublas.sh" \
sh "${BUILDER_DIR}/build-cublas.sh" \
"${CUBLAS_VERSION}" \ "${CUBLAS_VERSION}" \
"${CUDA_USERSPACE_VERSION}" \ "${CUDA_USERSPACE_VERSION}" \
"${NCCL_CUDA_VERSION}" \ "${NCCL_CUDA_VERSION}" \
@@ -196,20 +734,20 @@ if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
CUBLAS_CACHE="${DIST_DIR}/cublas-${CUBLAS_VERSION}+cuda${NCCL_CUDA_VERSION}" CUBLAS_CACHE="${DIST_DIR}/cublas-${CUBLAS_VERSION}+cuda${NCCL_CUDA_VERSION}"
GPU_STRESS_NEED_BUILD=1 GPU_STRESS_NEED_BUILD=1
if [ -f "$GPU_STRESS_BIN" ] && [ "${BUILDER_DIR}/bee-gpu-stress.c" -ot "$GPU_STRESS_BIN" ]; then if [ -f "$GPU_BURN_WORKER_BIN" ] && [ "${BUILDER_DIR}/bee-gpu-stress.c" -ot "$GPU_BURN_WORKER_BIN" ]; then
GPU_STRESS_NEED_BUILD=0 GPU_STRESS_NEED_BUILD=0
fi fi
if [ "$GPU_STRESS_NEED_BUILD" = "1" ]; then if [ "$GPU_STRESS_NEED_BUILD" = "1" ]; then
echo "=== building bee-gpu-stress ===" run_step "build bee-gpu-burn worker" "21-gpu-burn-worker" \
gcc -O2 -s -Wall -Wextra \ gcc -O2 -s -Wall -Wextra \
-I"${CUBLAS_CACHE}/include" \ -I"${CUBLAS_CACHE}/include" \
-o "$GPU_STRESS_BIN" \ -o "$GPU_BURN_WORKER_BIN" \
"${BUILDER_DIR}/bee-gpu-stress.c" \ "${BUILDER_DIR}/bee-gpu-stress.c" \
-ldl -lm -ldl -lm
echo "binary: $GPU_STRESS_BIN" echo "binary: $GPU_BURN_WORKER_BIN"
else else
echo "=== bee-gpu-stress up to date, skipping build ===" echo "=== bee-gpu-burn worker up to date, skipping build ==="
fi fi
fi fi
@@ -245,9 +783,13 @@ rm -f \
"${OVERLAY_STAGE_DIR}/etc/bee-release" \ "${OVERLAY_STAGE_DIR}/etc/bee-release" \
"${OVERLAY_STAGE_DIR}/root/.ssh/authorized_keys" \ "${OVERLAY_STAGE_DIR}/root/.ssh/authorized_keys" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee" \ "${OVERLAY_STAGE_DIR}/usr/local/bin/bee" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress" \ "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-nccl-gpu-stress" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/john" \
"${OVERLAY_STAGE_DIR}/usr/local/lib/bee/bee-gpu-burn-worker" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest" \ "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-smoketest" \
"${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf" "${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf"
rm -rf \
"${OVERLAY_STAGE_DIR}/usr/local/lib/bee/john"
# Remove NVIDIA-specific overlay files for non-nvidia variants # Remove NVIDIA-specific overlay files for non-nvidia variants
if [ "$BEE_GPU_VENDOR" != "nvidia" ]; then if [ "$BEE_GPU_VENDOR" != "nvidia" ]; then
@@ -293,9 +835,13 @@ mkdir -p "${OVERLAY_STAGE_DIR}/usr/local/bin"
cp "${DIST_DIR}/bee-linux-amd64" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee" cp "${DIST_DIR}/bee-linux-amd64" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee" chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee"
if [ "$BEE_GPU_VENDOR" = "nvidia" ] && [ -f "$GPU_STRESS_BIN" ]; then if [ "$BEE_GPU_VENDOR" = "nvidia" ] && [ -f "$GPU_BURN_WORKER_BIN" ]; then
cp "${GPU_STRESS_BIN}" "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress" mkdir -p "${OVERLAY_STAGE_DIR}/usr/local/lib/bee" "${OVERLAY_STAGE_DIR}/usr/local/bin"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-stress" cp "${GPU_BURN_WORKER_BIN}" "${OVERLAY_STAGE_DIR}/usr/local/lib/bee/bee-gpu-burn-worker"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/lib/bee/bee-gpu-burn-worker"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-burn" 2>/dev/null || true
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-john-gpu-stress" 2>/dev/null || true
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-nccl-gpu-stress" 2>/dev/null || true
fi fi
# --- inject smoketest into overlay so it runs directly on the live CD --- # --- inject smoketest into overlay so it runs directly on the live CD ---
@@ -315,9 +861,8 @@ done
# --- NVIDIA kernel modules and userspace libs --- # --- NVIDIA kernel modules and userspace libs ---
if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
echo "" run_step "build NVIDIA ${NVIDIA_DRIVER_VERSION} modules" "40-nvidia-module" \
echo "=== building NVIDIA ${NVIDIA_DRIVER_VERSION} modules ===" sh "${BUILDER_DIR}/build-nvidia-module.sh" "${NVIDIA_DRIVER_VERSION}" "${DIST_DIR}" "${DEBIAN_KERNEL_ABI}"
sh "${BUILDER_DIR}/build-nvidia-module.sh" "${NVIDIA_DRIVER_VERSION}" "${DIST_DIR}" "${DEBIAN_KERNEL_ABI}"
KVER="${DEBIAN_KERNEL_ABI}-amd64" KVER="${DEBIAN_KERNEL_ABI}-amd64"
NVIDIA_CACHE="${DIST_DIR}/nvidia-${NVIDIA_DRIVER_VERSION}-${KVER}" NVIDIA_CACHE="${DIST_DIR}/nvidia-${NVIDIA_DRIVER_VERSION}-${KVER}"
@@ -334,6 +879,8 @@ if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
cp "${NVIDIA_CACHE}/bin/nvidia-bug-report.sh" "${OVERLAY_STAGE_DIR}/usr/local/bin/" 2>/dev/null || true cp "${NVIDIA_CACHE}/bin/nvidia-bug-report.sh" "${OVERLAY_STAGE_DIR}/usr/local/bin/" 2>/dev/null || true
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/nvidia-bug-report.sh" 2>/dev/null || true chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/nvidia-bug-report.sh" 2>/dev/null || true
cp "${NVIDIA_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/" 2>/dev/null || true cp "${NVIDIA_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/" 2>/dev/null || true
mkdir -p "${OVERLAY_STAGE_DIR}/etc/OpenCL/vendors"
printf 'libnvidia-opencl.so.1\n' > "${OVERLAY_STAGE_DIR}/etc/OpenCL/vendors/nvidia.icd"
# Inject GSP firmware into /lib/firmware/nvidia/<version>/ # Inject GSP firmware into /lib/firmware/nvidia/<version>/
if [ -d "${NVIDIA_CACHE}/firmware" ] && [ "$(ls -A "${NVIDIA_CACHE}/firmware" 2>/dev/null)" ]; then if [ -d "${NVIDIA_CACHE}/firmware" ] && [ "$(ls -A "${NVIDIA_CACHE}/firmware" 2>/dev/null)" ]; then
@@ -343,9 +890,8 @@ if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
fi fi
# --- build / download NCCL --- # --- build / download NCCL ---
echo "" run_step "download NCCL ${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}" "50-nccl" \
echo "=== downloading NCCL ${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION} ===" sh "${BUILDER_DIR}/build-nccl.sh" "${NCCL_VERSION}" "${NCCL_CUDA_VERSION}" "${DIST_DIR}" "${NCCL_SHA256:-}"
sh "${BUILDER_DIR}/build-nccl.sh" "${NCCL_VERSION}" "${NCCL_CUDA_VERSION}" "${DIST_DIR}" "${NCCL_SHA256:-}"
NCCL_CACHE="${DIST_DIR}/nccl-${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}" NCCL_CACHE="${DIST_DIR}/nccl-${NCCL_VERSION}+cuda${NCCL_CUDA_VERSION}"
@@ -353,14 +899,13 @@ if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
cp "${NCCL_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/" cp "${NCCL_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/"
echo "=== NCCL: $(ls "${NCCL_CACHE}/lib/" | wc -l) files injected into /usr/lib/ ===" echo "=== NCCL: $(ls "${NCCL_CACHE}/lib/" | wc -l) files injected into /usr/lib/ ==="
# Inject cuBLAS/cuBLASLt/cudart runtime libs used by bee-gpu-stress tensor-core GEMM path # Inject cuBLAS/cuBLASLt/cudart runtime libs used by the bee-gpu-burn worker tensor-core GEMM path
cp "${CUBLAS_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/" cp "${CUBLAS_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/"
echo "=== cuBLAS: $(ls "${CUBLAS_CACHE}/lib/" | wc -l) files injected into /usr/lib/ ===" echo "=== cuBLAS: $(ls "${CUBLAS_CACHE}/lib/" | wc -l) files injected into /usr/lib/ ==="
# --- build nccl-tests --- # --- build nccl-tests ---
echo "" run_step "build nccl-tests ${NCCL_TESTS_VERSION}" "60-nccl-tests" \
echo "=== building nccl-tests ${NCCL_TESTS_VERSION} ===" sh "${BUILDER_DIR}/build-nccl-tests.sh" \
sh "${BUILDER_DIR}/build-nccl-tests.sh" \
"${NCCL_TESTS_VERSION}" \ "${NCCL_TESTS_VERSION}" \
"${NCCL_VERSION}" \ "${NCCL_VERSION}" \
"${NCCL_CUDA_VERSION}" \ "${NCCL_CUDA_VERSION}" \
@@ -371,7 +916,17 @@ if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
NCCL_TESTS_CACHE="${DIST_DIR}/nccl-tests-${NCCL_TESTS_VERSION}" NCCL_TESTS_CACHE="${DIST_DIR}/nccl-tests-${NCCL_TESTS_VERSION}"
cp "${NCCL_TESTS_CACHE}/bin/all_reduce_perf" "${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf" cp "${NCCL_TESTS_CACHE}/bin/all_reduce_perf" "${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf" chmod +x "${OVERLAY_STAGE_DIR}/usr/local/bin/all_reduce_perf"
cp "${NCCL_TESTS_CACHE}/lib/"* "${OVERLAY_STAGE_DIR}/usr/lib/" 2>/dev/null || true
echo "=== all_reduce_perf injected ===" echo "=== all_reduce_perf injected ==="
run_step "build john jumbo ${JOHN_JUMBO_COMMIT}" "70-john" \
sh "${BUILDER_DIR}/build-john.sh" "${JOHN_JUMBO_COMMIT}" "${DIST_DIR}"
JOHN_CACHE="${DIST_DIR}/john-${JOHN_JUMBO_COMMIT}"
mkdir -p "${OVERLAY_STAGE_DIR}/usr/local/lib/bee/john"
rsync -a --delete "${JOHN_CACHE}/run/" "${OVERLAY_STAGE_DIR}/usr/local/lib/bee/john/run/"
ln -sfn ../lib/bee/john/run/john "${OVERLAY_STAGE_DIR}/usr/local/bin/john"
chmod +x "${OVERLAY_STAGE_DIR}/usr/local/lib/bee/john/run/john"
echo "=== john injected ==="
fi fi
# --- embed build metadata --- # --- embed build metadata ---
@@ -385,7 +940,8 @@ NCCL_VERSION=${NCCL_VERSION}
NCCL_CUDA_VERSION=${NCCL_CUDA_VERSION} NCCL_CUDA_VERSION=${NCCL_CUDA_VERSION}
CUBLAS_VERSION=${CUBLAS_VERSION} CUBLAS_VERSION=${CUBLAS_VERSION}
CUDA_USERSPACE_VERSION=${CUDA_USERSPACE_VERSION} CUDA_USERSPACE_VERSION=${CUDA_USERSPACE_VERSION}
NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}
JOHN_JUMBO_COMMIT=${JOHN_JUMBO_COMMIT}"
GPU_BUILD_INFO="nvidia:${NVIDIA_DRIVER_VERSION}" GPU_BUILD_INFO="nvidia:${NVIDIA_DRIVER_VERSION}"
elif [ "$BEE_GPU_VENDOR" = "amd" ]; then elif [ "$BEE_GPU_VENDOR" = "amd" ]; then
GPU_VERSION_LINE="ROCM_VERSION=${ROCM_VERSION}" GPU_VERSION_LINE="ROCM_VERSION=${ROCM_VERSION}"
@@ -417,9 +973,13 @@ if [ -f "${OVERLAY_STAGE_DIR}/etc/motd" ]; then
mv "${OVERLAY_STAGE_DIR}/etc/motd.patched" "${OVERLAY_STAGE_DIR}/etc/motd" mv "${OVERLAY_STAGE_DIR}/etc/motd.patched" "${OVERLAY_STAGE_DIR}/etc/motd"
fi fi
# --- copy variant-specific package list into work dir --- # --- copy variant-specific package list, remove all other variant lists ---
# live-build picks up ALL .list.chroot files — delete other variants to avoid conflicts.
cp "${BUILD_WORK_DIR}/config/package-lists/bee-${BEE_GPU_VENDOR}.list.chroot" \ cp "${BUILD_WORK_DIR}/config/package-lists/bee-${BEE_GPU_VENDOR}.list.chroot" \
"${BUILD_WORK_DIR}/config/package-lists/bee-gpu.list.chroot" "${BUILD_WORK_DIR}/config/package-lists/bee-gpu.list.chroot"
rm -f "${BUILD_WORK_DIR}/config/package-lists/bee-nvidia.list.chroot" \
"${BUILD_WORK_DIR}/config/package-lists/bee-amd.list.chroot" \
"${BUILD_WORK_DIR}/config/package-lists/bee-nogpu.list.chroot"
# --- remove archives for the other vendor(s) --- # --- remove archives for the other vendor(s) ---
if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then if [ "$BEE_GPU_VENDOR" = "nvidia" ]; then
@@ -481,9 +1041,10 @@ BEE_GPU_VENDOR_UPPER="$(echo "${BEE_GPU_VENDOR}" | tr 'a-z' 'A-Z')"
export BEE_GPU_VENDOR_UPPER export BEE_GPU_VENDOR_UPPER
cd "${LB_DIR}" cd "${LB_DIR}"
lb clean 2>&1 | tail -3 run_step_sh "live-build clean" "80-lb-clean" "lb clean 2>&1 | tail -3"
lb config 2>&1 | tail -5 run_step_sh "live-build config" "81-lb-config" "lb config 2>&1 | tail -5"
lb build 2>&1 dump_memtest_debug "pre-build" "${LB_DIR}"
run_step_sh "live-build build" "90-lb-build" "lb build 2>&1"
# --- persist deb package cache back to shared location --- # --- persist deb package cache back to shared location ---
# This allows the second variant to reuse all downloaded packages. # This allows the second variant to reuse all downloaded packages.
@@ -494,8 +1055,13 @@ fi
# live-build outputs live-image-amd64.hybrid.iso in LB_DIR # live-build outputs live-image-amd64.hybrid.iso in LB_DIR
ISO_RAW="${LB_DIR}/live-image-amd64.hybrid.iso" ISO_RAW="${LB_DIR}/live-image-amd64.hybrid.iso"
ISO_OUT="${DIST_DIR}/easy-bee-${BEE_GPU_VENDOR}-v${ISO_VERSION_EFFECTIVE}-amd64.iso"
if [ -f "$ISO_RAW" ]; then if [ -f "$ISO_RAW" ]; then
dump_memtest_debug "post-build" "${LB_DIR}" "$ISO_RAW"
if ! iso_memtest_present "$ISO_RAW"; then
recover_iso_memtest "${LB_DIR}" "$ISO_RAW"
dump_memtest_debug "post-recovery" "${LB_DIR}" "$ISO_RAW"
fi
validate_iso_memtest "$ISO_RAW"
cp "$ISO_RAW" "$ISO_OUT" cp "$ISO_RAW" "$ISO_OUT"
echo "" echo ""
echo "=== done (${BEE_GPU_VENDOR}) ===" echo "=== done (${BEE_GPU_VENDOR}) ==="

View File

@@ -22,3 +22,7 @@ label live-@FLAVOUR@-failsafe
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=gsp-off memtest noapic noapm nodma nomce nolapic nosmp vga=normal append @APPEND_LIVE@ bee.nvidia.mode=gsp-off memtest noapic noapm nodma nomce nolapic nosmp vga=normal
label memtest
menu label ^Memory Test (memtest86+)
linux /boot/memtest86+x64.bin

View File

@@ -60,6 +60,9 @@ chmod +x /usr/local/bin/bee 2>/dev/null || true
chmod +x /usr/local/bin/bee-log-run 2>/dev/null || true chmod +x /usr/local/bin/bee-log-run 2>/dev/null || true
if [ "$GPU_VENDOR" = "nvidia" ]; then if [ "$GPU_VENDOR" = "nvidia" ]; then
chmod +x /usr/local/bin/bee-nvidia-load 2>/dev/null || true chmod +x /usr/local/bin/bee-nvidia-load 2>/dev/null || true
chmod +x /usr/local/bin/bee-gpu-burn 2>/dev/null || true
chmod +x /usr/local/bin/bee-john-gpu-stress 2>/dev/null || true
chmod +x /usr/local/bin/bee-nccl-gpu-stress 2>/dev/null || true
fi fi
# Reload udev rules # Reload udev rules

View File

@@ -1,16 +1,139 @@
#!/bin/sh #!/bin/sh
# Copy memtest86+ binaries from chroot /boot into the ISO boot directory # Ensure memtest is present in the final ISO even if live-build's built-in
# so GRUB can chainload them directly (they must be on the ISO filesystem, # memtest stage does not copy the binaries or expose menu entries.
# not inside the squashfs).
set -e set -e
echo "memtest: scanning chroot/boot/ for memtest files:" : "${BEE_REQUIRE_MEMTEST:=0}"
ls chroot/boot/memtest* 2>/dev/null || echo "memtest: WARNING: no memtest files found in chroot/boot/"
for f in memtest86+x64.bin memtest86+x64.efi memtest86+ia32.bin memtest86+ia32.efi; do MEMTEST_FILES="memtest86+x64.bin memtest86+x64.efi"
src="chroot/boot/${f}" BINARY_BOOT_DIR="binary/boot"
if [ -f "${src}" ]; then GRUB_CFG="binary/boot/grub/grub.cfg"
cp "${src}" "binary/boot/${f}" ISOLINUX_CFG="binary/isolinux/live.cfg"
echo "memtest: copied ${f} to binary/boot/"
log() {
echo "memtest hook: $*"
}
fail_or_warn() {
msg="$1"
if [ "${BEE_REQUIRE_MEMTEST}" = "1" ]; then
log "ERROR: ${msg}"
exit 1
fi fi
done log "WARNING: ${msg}"
return 0
}
copy_memtest_file() {
src="$1"
base="$(basename "$src")"
dst="${BINARY_BOOT_DIR}/${base}"
[ -f "$src" ] || return 1
mkdir -p "${BINARY_BOOT_DIR}"
cp "$src" "$dst"
log "copied ${base} from ${src}"
}
extract_memtest_from_deb() {
deb="$1"
tmpdir="$(mktemp -d)"
log "extracting memtest payload from ${deb}"
dpkg-deb -x "$deb" "$tmpdir"
for f in ${MEMTEST_FILES}; do
if [ -f "${tmpdir}/boot/${f}" ]; then
copy_memtest_file "${tmpdir}/boot/${f}"
fi
done
rm -rf "$tmpdir"
}
ensure_memtest_binaries() {
missing=0
for f in ${MEMTEST_FILES}; do
[ -f "${BINARY_BOOT_DIR}/${f}" ] || missing=1
done
[ "$missing" -eq 1 ] || return 0
for root in chroot/boot /boot; do
for f in ${MEMTEST_FILES}; do
[ -f "${BINARY_BOOT_DIR}/${f}" ] || copy_memtest_file "${root}/${f}" || true
done
done
missing=0
for f in ${MEMTEST_FILES}; do
[ -f "${BINARY_BOOT_DIR}/${f}" ] || missing=1
done
[ "$missing" -eq 1 ] || return 0
for root in cache chroot/var/cache/apt/archives /var/cache/apt/archives; do
[ -d "$root" ] || continue
deb="$(find "$root" -type f \( -name 'memtest86+_*.deb' -o -name 'memtest86+*.deb' \) 2>/dev/null | head -1)"
[ -n "$deb" ] || continue
extract_memtest_from_deb "$deb"
break
done
missing=0
for f in ${MEMTEST_FILES}; do
if [ ! -f "${BINARY_BOOT_DIR}/${f}" ]; then
fail_or_warn "missing ${BINARY_BOOT_DIR}/${f}"
missing=1
fi
done
[ "$missing" -eq 0 ] || return 0
}
ensure_grub_entry() {
[ -f "$GRUB_CFG" ] || {
fail_or_warn "missing ${GRUB_CFG}"
return 0
}
grep -q '### BEE MEMTEST ###' "$GRUB_CFG" && return 0
cat >> "$GRUB_CFG" <<'EOF'
### BEE MEMTEST ###
if [ "${grub_platform}" = "efi" ]; then
menuentry "Memory Test (memtest86+)" {
chainloader /boot/memtest86+x64.efi
}
else
menuentry "Memory Test (memtest86+)" {
linux16 /boot/memtest86+x64.bin
}
fi
### /BEE MEMTEST ###
EOF
log "appended memtest entry to ${GRUB_CFG}"
}
ensure_isolinux_entry() {
[ -f "$ISOLINUX_CFG" ] || {
fail_or_warn "missing ${ISOLINUX_CFG}"
return 0
}
grep -q '### BEE MEMTEST ###' "$ISOLINUX_CFG" && return 0
cat >> "$ISOLINUX_CFG" <<'EOF'
# ### BEE MEMTEST ###
label memtest
menu label ^Memory Test (memtest86+)
linux /boot/memtest86+x64.bin
# ### /BEE MEMTEST ###
EOF
log "appended memtest entry to ${ISOLINUX_CFG}"
}
log "ensuring memtest binaries and menu entries in binary image"
ensure_memtest_binaries
ensure_grub_entry
ensure_isolinux_entry
log "memtest assets ready"

View File

@@ -1,2 +1,8 @@
# NVIDIA DCGM (Data Center GPU Manager) — dcgmi diag for acceptance testing # NVIDIA DCGM (Data Center GPU Manager) — dcgmi diag for acceptance testing.
datacenter-gpu-manager=1:%%DCGM_VERSION%% # DCGM 4 is packaged per CUDA major. The image ships NVIDIA driver 590 with CUDA 13 userspace,
# so install the CUDA 13 build plus proprietary diagnostic components explicitly.
datacenter-gpu-manager-4-cuda13=1:%%DCGM_VERSION%%
datacenter-gpu-manager-4-proprietary=1:%%DCGM_VERSION%%
datacenter-gpu-manager-4-proprietary-cuda13=1:%%DCGM_VERSION%%
ocl-icd-libopencl1
clinfo

View File

@@ -21,8 +21,15 @@ openssh-server
# Disk installer # Disk installer
squashfs-tools squashfs-tools
parted parted
# Keep GRUB install tools without selecting a single active platform package.
# grub-pc and grub-efi-amd64 conflict with each other, but grub2-common
# provides grub-install/update-grub and the *-bin packages provide BIOS/UEFI modules.
grub2-common
grub-pc-bin grub-pc-bin
grub-efi-amd64-bin grub-efi-amd64-bin
grub-efi-amd64-signed
shim-signed
efibootmgr
# Filesystem support for USB export targets # Filesystem support for USB export targets
exfatprogs exfatprogs
@@ -39,11 +46,11 @@ vim-tiny
mc mc
htop htop
nvtop nvtop
btop
sudo sudo
zstd zstd
mstflint mstflint
memtester memtester
memtest86+
stress-ng stress-ng
stressapptest stressapptest

View File

@@ -1,14 +1,14 @@
[Unit] [Unit]
Description=Bee: run hardware audit Description=Bee: hardware audit
After=bee-network.service bee-nvidia.service bee-preflight.service After=bee-preflight.service bee-network.service bee-nvidia.service
Before=bee-web.service Before=bee-web.service
[Service] [Service]
Type=oneshot Type=oneshot
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-audit.log /bin/sh -c '/usr/local/bin/bee audit --runtime livecd --output file:/appdata/bee/export/bee-audit.json; rc=$?; if [ "$rc" -ne 0 ]; then echo "[bee-audit] WARN: audit exited with rc=$rc"; fi; exit 0' RemainAfterExit=yes
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-audit.log /usr/local/bin/bee audit --runtime auto --output file:/appdata/bee/export/bee-audit.json
StandardOutput=journal StandardOutput=journal
StandardError=journal StandardError=journal
RemainAfterExit=yes
[Install] [Install]
WantedBy=multi-user.target WantedBy=multi-user.target

View File

@@ -1,7 +1,6 @@
[Unit] [Unit]
Description=Bee: hardware audit web viewer Description=Bee: hardware audit web viewer
After=bee-network.service After=bee-audit.service
Wants=bee-audit.service
[Service] [Service]
Type=simple Type=simple
@@ -11,6 +10,9 @@ RestartSec=2
StandardOutput=journal StandardOutput=journal
StandardError=journal StandardError=journal
LimitMEMLOCK=infinity LimitMEMLOCK=infinity
# Keep the web server responsive during GPU/CPU stress (children inherit nice+10
# via Setpriority in runCmdJob, but the bee-web parent stays at 0).
Nice=0
[Install] [Install]
WantedBy=multi-user.target WantedBy=multi-user.target

View File

@@ -4,3 +4,6 @@
RestartSec=10 RestartSec=10
StartLimitIntervalSec=60 StartLimitIntervalSec=60
StartLimitBurst=3 StartLimitBurst=3
# Raise scheduling priority of the X server so the graphical console (KVM/IPMI)
# stays responsive during GPU/CPU stress tests running at nice+10.
Nice=-5

View File

@@ -0,0 +1,93 @@
#!/bin/sh
set -eu
SECONDS=5
SIZE_MB=64
DEVICES=""
EXCLUDE=""
WORKER="/usr/local/lib/bee/bee-gpu-burn-worker"
usage() {
echo "usage: $0 [--seconds N] [--size-mb N] [--devices 0,1] [--exclude 2,3]" >&2
exit 2
}
normalize_list() {
echo "${1:-}" | tr ',' '\n' | sed 's/[[:space:]]//g' | awk 'NF' | sort -n | uniq | paste -sd, -
}
contains_csv() {
needle="$1"
haystack="${2:-}"
echo ",${haystack}," | grep -q ",${needle},"
}
while [ "$#" -gt 0 ]; do
case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;;
--size-mb|-m) [ "$#" -ge 2 ] || usage; SIZE_MB="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
*) usage ;;
esac
done
[ -x "${WORKER}" ] || { echo "bee-gpu-burn worker not found: ${WORKER}" >&2; exit 1; }
ALL_DEVICES=$(nvidia-smi --query-gpu=index --format=csv,noheader,nounits 2>/dev/null | sed 's/[[:space:]]//g' | awk 'NF' | paste -sd, -)
[ -n "${ALL_DEVICES}" ] || { echo "nvidia-smi found no NVIDIA GPUs" >&2; exit 1; }
DEVICES=$(normalize_list "${DEVICES}")
EXCLUDE=$(normalize_list "${EXCLUDE}")
SELECTED="${DEVICES}"
if [ -z "${SELECTED}" ]; then
SELECTED="${ALL_DEVICES}"
fi
FINAL=""
for id in $(echo "${SELECTED}" | tr ',' ' '); do
[ -n "${id}" ] || continue
if contains_csv "${id}" "${EXCLUDE}"; then
continue
fi
if [ -z "${FINAL}" ]; then
FINAL="${id}"
else
FINAL="${FINAL},${id}"
fi
done
[ -n "${FINAL}" ] || { echo "no NVIDIA GPUs selected after filters" >&2; exit 1; }
echo "loader=bee-gpu-burn"
echo "selected_gpus=${FINAL}"
TMP_DIR=$(mktemp -d)
trap 'rm -rf "${TMP_DIR}"' EXIT INT TERM
WORKERS=""
for id in $(echo "${FINAL}" | tr ',' ' '); do
log="${TMP_DIR}/gpu-${id}.log"
echo "starting gpu ${id}"
"${WORKER}" --device "${id}" --seconds "${SECONDS}" --size-mb "${SIZE_MB}" >"${log}" 2>&1 &
pid=$!
WORKERS="${WORKERS} ${pid}:${id}:${log}"
done
status=0
for spec in ${WORKERS}; do
pid=${spec%%:*}
rest=${spec#*:}
id=${rest%%:*}
log=${rest#*:}
if wait "${pid}"; then
echo "gpu ${id} finished: OK"
else
rc=$?
echo "gpu ${id} finished: FAILED rc=${rc}"
status=1
fi
sed "s/^/[gpu ${id}] /" "${log}" || true
done
exit "${status}"

View File

@@ -12,17 +12,55 @@
set -euo pipefail set -euo pipefail
usage() {
cat >&2 <<'EOF'
Usage: bee-install <device> [logfile]
Installs the live system to a local disk (WIPES the target).
device Target block device, e.g. /dev/sda or /dev/nvme0n1
Must be a hard disk or NVMe — NOT a CD-ROM (/dev/sr*)
logfile Optional path for progress log (default: /tmp/bee-install.log)
Examples:
bee-install /dev/sda
bee-install /dev/nvme0n1
bee-install /dev/sdb /tmp/my-install.log
WARNING: ALL DATA ON <device> WILL BE ERASED.
Layout (UEFI): GPT — partition 1: EFI 512MB vfat, partition 2: root ext4
Layout (BIOS): MBR — partition 1: root ext4
EOF
exit 1
}
DEVICE="${1:-}" DEVICE="${1:-}"
LOGFILE="${2:-/tmp/bee-install.log}" LOGFILE="${2:-/tmp/bee-install.log}"
if [ -z "$DEVICE" ]; then if [ -z "$DEVICE" ] || [ "$DEVICE" = "--help" ] || [ "$DEVICE" = "-h" ]; then
echo "Usage: bee-install <device> [logfile]" >&2 usage
exit 1
fi fi
if [ ! -b "$DEVICE" ]; then if [ ! -b "$DEVICE" ]; then
echo "ERROR: $DEVICE is not a block device" >&2 echo "ERROR: $DEVICE is not a block device" >&2
echo "Run 'lsblk' to list available disks." >&2
exit 1 exit 1
fi fi
# Block CD-ROM devices
case "$DEVICE" in
/dev/sr*|/dev/scd*)
echo "ERROR: $DEVICE is a CD-ROM/optical device — cannot install to it." >&2
echo "Run 'lsblk' to find the target disk (e.g. /dev/sda, /dev/nvme0n1)." >&2
exit 1
;;
esac
# Check required tools
for tool in parted mkfs.vfat mkfs.ext4 unsquashfs grub-install update-grub; do
if ! command -v "$tool" >/dev/null 2>&1; then
echo "ERROR: required tool not found: $tool" >&2
exit 1
fi
done
SQUASHFS="/run/live/medium/live/filesystem.squashfs" SQUASHFS="/run/live/medium/live/filesystem.squashfs"
if [ ! -f "$SQUASHFS" ]; then if [ ! -f "$SQUASHFS" ]; then
@@ -158,20 +196,56 @@ mount --bind /sys "${MOUNT_ROOT}/sys"
# ------------------------------------------------------------------ # ------------------------------------------------------------------
log "--- Step 7/7: Installing GRUB bootloader ---" log "--- Step 7/7: Installing GRUB bootloader ---"
# Helper: run a chroot command, log all output, return its exit code.
# Needed because "cmd | while" pipelines hide the exit code of cmd.
chroot_log() {
local rc=0
local out
out=$(chroot "$MOUNT_ROOT" "$@" 2>&1) || rc=$?
echo "$out" | while IFS= read -r line; do log " $line"; done
return $rc
}
if [ "$UEFI" = "1" ]; then if [ "$UEFI" = "1" ]; then
chroot "$MOUNT_ROOT" grub-install \ # Primary attempt: write EFI NVRAM entry (requires writable efivars)
--target=x86_64-efi \ if ! chroot_log grub-install \
--efi-directory=/boot/efi \ --target=x86_64-efi \
--bootloader-id=bee \ --efi-directory=/boot/efi \
--recheck 2>&1 | while read -r line; do log " $line"; done || true --bootloader-id=bee \
--recheck; then
log " WARNING: grub-install (with NVRAM) failed — retrying with --no-nvram"
# --no-nvram: write grubx64.efi but skip EFI variable update.
# Needed on headless servers where efivars is read-only or unavailable.
chroot_log grub-install \
--target=x86_64-efi \
--efi-directory=/boot/efi \
--bootloader-id=bee \
--no-nvram \
--recheck || log " WARNING: grub-install --no-nvram also failed — check logs"
fi
# Always install the UEFI fallback path EFI/BOOT/BOOTX64.EFI.
# Many UEFI implementations (especially server BMCs and some firmware)
# ignore the NVRAM boot entry and only look for this path.
GRUB_EFI="${MOUNT_ROOT}/boot/efi/EFI/bee/grubx64.efi"
FALLBACK_DIR="${MOUNT_ROOT}/boot/efi/EFI/BOOT"
if [ -f "$GRUB_EFI" ]; then
mkdir -p "$FALLBACK_DIR"
cp "$GRUB_EFI" "${FALLBACK_DIR}/BOOTX64.EFI"
log " Fallback EFI binary installed: EFI/BOOT/BOOTX64.EFI"
else
log " WARNING: grubx64.efi not found at $GRUB_EFI — UEFI fallback path not set"
fi
else else
chroot "$MOUNT_ROOT" grub-install \ chroot_log grub-install \
--target=i386-pc \ --target=i386-pc \
--recheck \ --recheck \
"$DEVICE" 2>&1 | while read -r line; do log " $line"; done || true "$DEVICE" || log " WARNING: grub-install (BIOS) failed — check logs"
fi fi
chroot "$MOUNT_ROOT" update-grub 2>&1 | while read -r line; do log " $line"; done || true
log " GRUB installed." chroot_log update-grub || log " WARNING: update-grub failed — check logs"
log " GRUB step complete."
# ------------------------------------------------------------------ # ------------------------------------------------------------------
# Cleanup # Cleanup

View File

@@ -0,0 +1,193 @@
#!/bin/sh
set -eu
SECONDS=300
DEVICES=""
EXCLUDE=""
FORMAT=""
JOHN_DIR="/usr/local/lib/bee/john/run"
JOHN_BIN="${JOHN_DIR}/john"
export OCL_ICD_VENDORS="/etc/OpenCL/vendors"
export LD_LIBRARY_PATH="/usr/lib:/usr/local/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
usage() {
echo "usage: $0 [--seconds N] [--devices 0,1] [--exclude 2,3] [--format name]" >&2
exit 2
}
normalize_list() {
echo "${1:-}" | tr ',' '\n' | sed 's/[[:space:]]//g' | awk 'NF' | sort -n | uniq | paste -sd, -
}
contains_csv() {
needle="$1"
haystack="${2:-}"
echo ",${haystack}," | grep -q ",${needle},"
}
show_opencl_diagnostics() {
echo "-- OpenCL ICD vendors --" >&2
if [ -d /etc/OpenCL/vendors ]; then
ls -l /etc/OpenCL/vendors >&2 || true
for icd in /etc/OpenCL/vendors/*.icd; do
[ -f "${icd}" ] || continue
echo " file: ${icd}" >&2
sed 's/^/ /' "${icd}" >&2 || true
done
else
echo " /etc/OpenCL/vendors is missing" >&2
fi
echo "-- NVIDIA device nodes --" >&2
ls -l /dev/nvidia* >&2 || true
echo "-- ldconfig OpenCL/NVIDIA --" >&2
ldconfig -p 2>/dev/null | grep 'libOpenCL\|libcuda\|libnvidia-opencl' >&2 || true
if command -v clinfo >/dev/null 2>&1; then
echo "-- clinfo -l --" >&2
clinfo -l >&2 || true
fi
echo "-- john --list=opencl-devices --" >&2
./john --list=opencl-devices >&2 || true
}
refresh_nvidia_runtime() {
if [ "$(id -u)" != "0" ]; then
return 1
fi
if command -v bee-nvidia-load >/dev/null 2>&1; then
bee-nvidia-load >/dev/null 2>&1 || true
fi
ldconfig >/dev/null 2>&1 || true
return 0
}
ensure_nvidia_uvm() {
if lsmod 2>/dev/null | grep -q '^nvidia_uvm '; then
return 0
fi
if [ "$(id -u)" != "0" ]; then
return 1
fi
ko="/usr/local/lib/nvidia/nvidia-uvm.ko"
[ -f "${ko}" ] || return 1
if ! insmod "${ko}" >/dev/null 2>&1; then
return 1
fi
uvm_major=$(grep -m1 ' nvidia-uvm$' /proc/devices | awk '{print $1}')
if [ -n "${uvm_major}" ]; then
mknod -m 666 /dev/nvidia-uvm c "${uvm_major}" 0 2>/dev/null || true
mknod -m 666 /dev/nvidia-uvm-tools c "${uvm_major}" 1 2>/dev/null || true
fi
return 0
}
ensure_opencl_ready() {
out=$(./john --list=opencl-devices 2>&1 || true)
if echo "${out}" | grep -q "Device #"; then
return 0
fi
if refresh_nvidia_runtime; then
out=$(./john --list=opencl-devices 2>&1 || true)
if echo "${out}" | grep -q "Device #"; then
return 0
fi
fi
if ensure_nvidia_uvm; then
out=$(./john --list=opencl-devices 2>&1 || true)
if echo "${out}" | grep -q "Device #"; then
return 0
fi
fi
echo "OpenCL devices are not available for John." >&2
if ! lsmod 2>/dev/null | grep -q '^nvidia_uvm '; then
echo "nvidia_uvm is not loaded." >&2
fi
if [ ! -e /dev/nvidia-uvm ]; then
echo "/dev/nvidia-uvm is missing." >&2
fi
show_opencl_diagnostics
return 1
}
while [ "$#" -gt 0 ]; do
case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
--format) [ "$#" -ge 2 ] || usage; FORMAT="$2"; shift 2 ;;
*) usage ;;
esac
done
[ -x "${JOHN_BIN}" ] || { echo "john binary not found: ${JOHN_BIN}" >&2; exit 1; }
ALL_DEVICES=$(nvidia-smi --query-gpu=index --format=csv,noheader,nounits 2>/dev/null | sed 's/[[:space:]]//g' | awk 'NF' | paste -sd, -)
[ -n "${ALL_DEVICES}" ] || { echo "nvidia-smi found no NVIDIA GPUs" >&2; exit 1; }
DEVICES=$(normalize_list "${DEVICES}")
EXCLUDE=$(normalize_list "${EXCLUDE}")
SELECTED="${DEVICES}"
if [ -z "${SELECTED}" ]; then
SELECTED="${ALL_DEVICES}"
fi
FINAL=""
for id in $(echo "${SELECTED}" | tr ',' ' '); do
[ -n "${id}" ] || continue
if contains_csv "${id}" "${EXCLUDE}"; then
continue
fi
if [ -z "${FINAL}" ]; then
FINAL="${id}"
else
FINAL="${FINAL},${id}"
fi
done
[ -n "${FINAL}" ] || { echo "no NVIDIA GPUs selected after filters" >&2; exit 1; }
JOHN_DEVICES=""
for id in $(echo "${FINAL}" | tr ',' ' '); do
opencl_id=$((id + 1))
if [ -z "${JOHN_DEVICES}" ]; then
JOHN_DEVICES="${opencl_id}"
else
JOHN_DEVICES="${JOHN_DEVICES},${opencl_id}"
fi
done
echo "loader=john"
echo "selected_gpus=${FINAL}"
echo "john_devices=${JOHN_DEVICES}"
cd "${JOHN_DIR}"
ensure_opencl_ready || exit 1
choose_format() {
if [ -n "${FORMAT}" ]; then
echo "${FORMAT}"
return 0
fi
for candidate in sha512crypt-opencl pbkdf2-hmac-sha512-opencl 7z-opencl sha256crypt-opencl md5crypt-opencl; do
if ./john --test=1 --format="${candidate}" --devices="${JOHN_DEVICES}" >/dev/null 2>&1; then
echo "${candidate}"
return 0
fi
done
return 1
}
CHOSEN_FORMAT=$(choose_format) || {
echo "no suitable john OpenCL format found" >&2
./john --list=opencl-devices >&2 || true
exit 1
}
echo "format=${CHOSEN_FORMAT}"
exec ./john --test="${SECONDS}" --format="${CHOSEN_FORMAT}" --devices="${JOHN_DEVICES}"

View File

@@ -17,7 +17,7 @@ mkdir -p "$(dirname "$log_file")"
serial_sink() { serial_sink() {
local tty="$1" local tty="$1"
if [ -w "$tty" ]; then if [ -w "$tty" ]; then
cat > "$tty" cat > "$tty" 2>/dev/null || true
else else
cat > /dev/null cat > /dev/null
fi fi

View File

@@ -0,0 +1,91 @@
#!/bin/sh
set -eu
SECONDS=300
DEVICES=""
EXCLUDE=""
MIN_BYTES="512M"
MAX_BYTES="4G"
FACTOR="2"
ITERS="20"
ALL_REDUCE_BIN="/usr/local/bin/all_reduce_perf"
usage() {
echo "usage: $0 [--seconds N] [--devices 0,1] [--exclude 2,3]" >&2
exit 2
}
normalize_list() {
echo "${1:-}" | tr ',' '\n' | sed 's/[[:space:]]//g' | awk 'NF' | sort -n | uniq | paste -sd, -
}
contains_csv() {
needle="$1"
haystack="${2:-}"
echo ",${haystack}," | grep -q ",${needle},"
}
while [ "$#" -gt 0 ]; do
case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
*) usage ;;
esac
done
[ -x "${ALL_REDUCE_BIN}" ] || { echo "all_reduce_perf not found: ${ALL_REDUCE_BIN}" >&2; exit 1; }
ALL_DEVICES=$(nvidia-smi --query-gpu=index --format=csv,noheader,nounits 2>/dev/null | sed 's/[[:space:]]//g' | awk 'NF' | paste -sd, -)
[ -n "${ALL_DEVICES}" ] || { echo "nvidia-smi found no NVIDIA GPUs" >&2; exit 1; }
DEVICES=$(normalize_list "${DEVICES}")
EXCLUDE=$(normalize_list "${EXCLUDE}")
SELECTED="${DEVICES}"
if [ -z "${SELECTED}" ]; then
SELECTED="${ALL_DEVICES}"
fi
FINAL=""
for id in $(echo "${SELECTED}" | tr ',' ' '); do
[ -n "${id}" ] || continue
if contains_csv "${id}" "${EXCLUDE}"; then
continue
fi
if [ -z "${FINAL}" ]; then
FINAL="${id}"
else
FINAL="${FINAL},${id}"
fi
done
[ -n "${FINAL}" ] || { echo "no NVIDIA GPUs selected after filters" >&2; exit 1; }
GPU_COUNT=$(echo "${FINAL}" | tr ',' '\n' | awk 'NF' | wc -l | awk '{print $1}')
[ "${GPU_COUNT}" -gt 0 ] || { echo "selected GPU count is zero" >&2; exit 1; }
echo "loader=nccl"
echo "selected_gpus=${FINAL}"
echo "gpu_count=${GPU_COUNT}"
echo "range=${MIN_BYTES}..${MAX_BYTES}"
echo "iters=${ITERS}"
deadline=$(( $(date +%s) + SECONDS ))
round=0
while :; do
now=$(date +%s)
if [ "${now}" -ge "${deadline}" ]; then
break
fi
round=$((round + 1))
remaining=$((deadline - now))
echo "round=${round} remaining_sec=${remaining}"
CUDA_VISIBLE_DEVICES="${FINAL}" \
"${ALL_REDUCE_BIN}" \
-b "${MIN_BYTES}" \
-e "${MAX_BYTES}" \
-f "${FACTOR}" \
-g "${GPU_COUNT}" \
--iters "${ITERS}"
done

View File

@@ -6,25 +6,66 @@ LOG_PREFIX="bee-network"
log() { echo "[$LOG_PREFIX] $*"; } log() { echo "[$LOG_PREFIX] $*"; }
# find physical interfaces: exclude lo and virtual (docker/virbr/veth/tun/tap) list_interfaces() {
interfaces=$(ip -o link show \ ip -o link show \
| awk -F': ' '{print $2}' \ | awk -F': ' '{print $2}' \
| grep -v '^lo$' \ | grep -v '^lo$' \
| grep -vE '^(docker|virbr|veth|tun|tap|br-|bond|dummy)' \ | grep -vE '^(docker|virbr|veth|tun|tap|br-|bond|dummy)' \
| sort) | sort
}
if [ -z "$interfaces" ]; then # Give udev a short chance to expose late NICs before the first scan.
if command -v udevadm >/dev/null 2>&1; then
udevadm settle --timeout=5 >/dev/null 2>&1 || log "WARN: udevadm settle timed out"
fi
started_ifaces=""
started_count=0
scan_pass=1
# Some server NICs appear a bit later after module/firmware init. Do a small
# bounded rescan window without turning network bring-up into a boot blocker.
while [ "$scan_pass" -le 3 ]; do
interfaces=$(list_interfaces)
if [ -n "$interfaces" ]; then
for iface in $interfaces; do
case " $started_ifaces " in
*" $iface "*) continue ;;
esac
log "bringing up $iface"
if ! ip link set "$iface" up; then
log "WARN: could not bring up $iface"
continue
fi
carrier=$(cat "/sys/class/net/$iface/carrier" 2>/dev/null || true)
if [ "$carrier" = "1" ]; then
log "carrier detected on $iface"
else
log "carrier not detected yet on $iface"
fi
# DHCP in background — non-blocking, keep dhclient verbose output in the service log.
dhclient -4 -v -nw "$iface" &
log "DHCP started for $iface (pid $!)"
started_ifaces="$started_ifaces $iface"
started_count=$((started_count + 1))
done
fi
if [ "$scan_pass" -ge 3 ]; then
break
fi
scan_pass=$((scan_pass + 1))
sleep 2
done
if [ "$started_count" -eq 0 ]; then
log "no physical interfaces found" log "no physical interfaces found"
exit 0 exit 0
fi fi
for iface in $interfaces; do log "done (interfaces started: $started_count)"
log "bringing up $iface"
ip link set "$iface" up || { log "WARN: could not bring up $iface"; continue; }
# DHCP in background — non-blocking, keep dhclient verbose output in the service log.
dhclient -4 -v -nw "$iface" &
log "DHCP started for $iface (pid $!)"
done
log "done"

View File

@@ -59,15 +59,28 @@ load_module() {
return 1 return 1
} }
load_host_module() {
mod="$1"
if modprobe "$mod" >/dev/null 2>&1; then
log "host module loaded: $mod"
return 0
fi
return 1
}
case "$nvidia_mode" in case "$nvidia_mode" in
normal|full) normal|full)
if ! load_module nvidia; then if ! load_module nvidia; then
exit 1 exit 1
fi fi
# nvidia-modeset on some server kernels needs ACPI video helper symbols
# exported by the generic "video" module. Best-effort only; compute paths
# remain functional even if display-related modules stay absent.
load_host_module video || true
load_module nvidia-modeset || true load_module nvidia-modeset || true
load_module nvidia-uvm || true load_module nvidia-uvm || true
;; ;;
gsp-off|safe|*) gsp-off|safe)
# NVIDIA documents that GSP firmware is enabled by default on newer GPUs and can # NVIDIA documents that GSP firmware is enabled by default on newer GPUs and can
# be disabled via NVreg_EnableGpuFirmware=0. Safe mode keeps the live ISO on the # be disabled via NVreg_EnableGpuFirmware=0. Safe mode keeps the live ISO on the
# conservative path for platforms where full boot-time GSP init is unstable. # conservative path for platforms where full boot-time GSP init is unstable.
@@ -76,6 +89,15 @@ case "$nvidia_mode" in
fi fi
log "GSP-off mode: skipping nvidia-modeset and nvidia-uvm during boot" log "GSP-off mode: skipping nvidia-modeset and nvidia-uvm during boot"
;; ;;
nomsi|*)
# nomsi: disable MSI-X/MSI interrupts — use when RmInitAdapter fails with
# "Failed to enable MSI-X" on one or more GPUs (IOMMU group interrupt limits).
# NVreg_EnableMSI=0 forces legacy INTx interrupts for all GPUs.
if ! load_module nvidia NVreg_EnableGpuFirmware=0 NVreg_EnableMSI=0; then
exit 1
fi
log "nomsi mode: MSI-X disabled (NVreg_EnableMSI=0), skipping nvidia-modeset and nvidia-uvm"
;;
esac esac
# Create /dev/nvidia* device nodes (udev rules absent since we use .run installer) # Create /dev/nvidia* device nodes (udev rules absent since we use .run installer)
@@ -105,4 +127,19 @@ fi
ldconfig 2>/dev/null || true ldconfig 2>/dev/null || true
log "ldconfig refreshed" log "ldconfig refreshed"
# Start DCGM host engine so dcgmi can discover GPUs.
# nv-hostengine must run before any dcgmi command — without it, dcgmi reports
# "group is empty" even when GPUs and modules are present.
# Skip if already running (e.g. started by a dcgm systemd service or prior boot).
if command -v nv-hostengine >/dev/null 2>&1; then
if pgrep -x nv-hostengine >/dev/null 2>&1; then
log "nv-hostengine already running — skipping"
else
nv-hostengine
log "nv-hostengine started"
fi
else
log "WARN: nv-hostengine not found — dcgmi diagnostics will not work"
fi
log "done" log "done"

View File

@@ -8,13 +8,16 @@ xset -dpms
xset s noblank xset s noblank
tint2 & tint2 &
# Wait for bee-web to bind (Go starts fast, usually <2s)
# Wait up to 120s for bee-web to bind. The web server starts immediately now
# (audit is deferred), so this should succeed in a few seconds on most hardware.
i=0 i=0
while [ $i -lt 30 ]; do while [ $i -lt 120 ]; do
if curl -sf http://localhost/healthz >/dev/null 2>&1; then break; fi if curl -sf http://localhost/healthz >/dev/null 2>&1; then break; fi
sleep 1 sleep 1
i=$((i+1)) i=$((i+1))
done done
chromium \ chromium \
--disable-infobars \ --disable-infobars \
--disable-translate \ --disable-translate \