Compare commits

..

15 Commits
v5.2 ... v5.8

Author SHA1 Message Date
5b9015451e Add live task charts and fix USB export actions 2026-04-05 20:14:23 +03:00
d1a6863ceb Use amber fallback wallpaper color (#f6c90e) instead of black
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:30:41 +03:00
f9aa05de8e Add wallpaper: black background with amber EASY-BEE ASCII art logo
- Add feh and python3-pil to package list
- Add chroot hook that generates /usr/share/bee/wallpaper.png using PIL:
  black background, EASY-BEE box-drawing logo in amber (#f6c90e),
  "Hardware Audit LiveCD" subtitle in dim amber — matches motd exactly
- bee-openbox-session: set wallpaper with feh --bg-fill, fall back to
  xsetroot -solid black if wallpaper not found

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:29:42 +03:00
a9ccea8cca Fix black desktop and Chromium blank page on startup
- Set xsetroot solid background (#12100a, dark amber) so openbox
  doesn't show bare black before Chromium opens
- Re-add healthz wait loop before launching Chromium: without it
  Chromium opens localhost/loading before bee-web is up and gets
  connection-refused which renders as a blank white page

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:25:32 +03:00
fc5c985fb5 Reset tty1 properly when bee-boot-status exits
Add TTYReset=yes and TTYVHangup=yes so systemd restores the terminal
to a clean state before handing tty1 to getty. Without this the screen
went black with no cursor after the status display finished.

Also remove DefaultDependencies=no which was too aggressive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:22:01 +03:00
5eb3baddb4 Fix bee-boot-status blank screen caused by variable buffering
Command substitution in sh strips trailing newlines, so accumulating
output in a variable via $(...) lost all line breaks. Reverted to
direct printf calls which work correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:21:10 +03:00
a6ac13b5d3 Improve bee-boot-status: slower refresh, more detail
- Refresh every 3s instead of 1s to reduce flicker
- Show ssh, bee-sshsetup in service list
- Show failure reason for failed services
- Show last journal line for activating services
- Show IP addresses and web UI URL when network is up
- Render frame to variable before printing to reduce flicker

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:20:07 +03:00
4003cb7676 Lower kernel console loglevel to 3 to reduce boot noise
loglevel=6 floods the screen with mpt3sas/scsi/sd informational
messages, hiding systemd service status and bee-boot-status display.
loglevel=3 shows only kernel errors; all messages still go to serial.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:19:09 +03:00
2875313ba0 Improve boot UX: status display, faster GUI, loading spinner
- Add bee-boot-status service: shows live service status on tty1 with
  ASCII logo before getty, exits when all bee services settle
- Remove lightdm dependency on bee-preflight so GUI starts immediately
  without waiting for NVIDIA driver load
- Replace Chromium blank-page problem with /loading spinner page that
  polls /api/services and auto-redirects when services are ready; add
  "Open app now" override button; use fresh --user-data-dir=/tmp/bee-chrome
- Unify branding: add "Hardware Audit LiveCD" subtitle to GRUB menu,
  bee-boot-status (with yellow ASCII logo), and web spinner

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 18:58:24 +03:00
f1621efee4 Mirror task lifecycle to serial console 2026-04-05 18:34:06 +03:00
4461249cc3 Make memory stress size follow available RAM 2026-04-05 18:33:26 +03:00
e609fbbc26 Add task reports and streamline GPU charts 2026-04-05 18:13:58 +03:00
cc2b49ea41 Improve validate GPU runs and web UI feedback 2026-04-05 17:50:13 +03:00
33e0a5bef2 Refine validate UI and runtime health table 2026-04-05 16:24:45 +03:00
38e79143eb Refine burn UI and NVIDIA stress flows 2026-04-05 13:43:43 +03:00
35 changed files with 3199 additions and 548 deletions

View File

@@ -1,9 +1,10 @@
LISTEN ?= :8080 LISTEN ?= :8080
AUDIT_PATH ?= AUDIT_PATH ?=
EXPORT_DIR ?= $(CURDIR)/.tmp/export
VERSION ?= $(shell sh ./scripts/resolve-version.sh) VERSION ?= $(shell sh ./scripts/resolve-version.sh)
GO_LDFLAGS := -X main.Version=$(VERSION) GO_LDFLAGS := -X main.Version=$(VERSION)
RUN_ARGS := web --listen $(LISTEN) RUN_ARGS := web --listen $(LISTEN) --export-dir $(EXPORT_DIR)
ifneq ($(AUDIT_PATH),) ifneq ($(AUDIT_PATH),)
RUN_ARGS += --audit-path $(AUDIT_PATH) RUN_ARGS += --audit-path $(AUDIT_PATH)
endif endif
@@ -11,6 +12,7 @@ endif
.PHONY: run build test .PHONY: run build test
run: run:
mkdir -p $(EXPORT_DIR)
go run -ldflags "$(GO_LDFLAGS)" ./cmd/bee $(RUN_ARGS) go run -ldflags "$(GO_LDFLAGS)" ./cmd/bee $(RUN_ARGS)
build: build:

View File

@@ -87,7 +87,7 @@ func printRootUsage(w io.Writer) {
bee preflight --output stdout|file:<path> bee preflight --output stdout|file:<path>
bee export --target <device> bee export --target <device>
bee support-bundle --output stdout|file:<path> bee support-bundle --output stdout|file:<path>
bee web --listen :80 --audit-path `+app.DefaultAuditJSONPath+` bee web --listen :80 [--audit-path `+app.DefaultAuditJSONPath+`]
bee sat nvidia|memory|storage|cpu [--duration <seconds>] bee sat nvidia|memory|storage|cpu [--duration <seconds>]
bee benchmark nvidia [--profile standard|stability|overnight] bee benchmark nvidia [--profile standard|stability|overnight]
bee version bee version
@@ -296,7 +296,7 @@ func runWeb(args []string, stdout, stderr io.Writer) int {
fs := flag.NewFlagSet("web", flag.ContinueOnError) fs := flag.NewFlagSet("web", flag.ContinueOnError)
fs.SetOutput(stderr) fs.SetOutput(stderr)
listenAddr := fs.String("listen", ":8080", "listen address, e.g. :80") listenAddr := fs.String("listen", ":8080", "listen address, e.g. :80")
auditPath := fs.String("audit-path", app.DefaultAuditJSONPath, "path to the latest audit JSON snapshot") auditPath := fs.String("audit-path", "", "optional path to the latest audit JSON snapshot")
exportDir := fs.String("export-dir", app.DefaultExportDir, "directory with logs, SAT results, and support bundles") exportDir := fs.String("export-dir", app.DefaultExportDir, "directory with logs, SAT results, and support bundles")
title := fs.String("title", "Bee Hardware Audit", "page title") title := fs.String("title", "Bee Hardware Audit", "page title")
fs.Usage = func() { fs.Usage = func() {

View File

@@ -115,7 +115,12 @@ func (a *App) RunInstallToRAM(ctx context.Context, logFunc func(string)) error {
type satRunner interface { type satRunner interface {
RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error) RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error)
RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaTargetedStressValidatePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaBenchmark(ctx context.Context, baseDir string, opts platform.NvidiaBenchmarkOptions, logFunc func(string)) (string, error) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts platform.NvidiaBenchmarkOptions, logFunc func(string)) (string, error)
RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaPulseTestPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaStressPack(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) RunNvidiaStressPack(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error)
RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error)
RunStorageAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) RunStorageAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error)
@@ -528,6 +533,13 @@ func (a *App) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir st
return ActionResult{Title: "NVIDIA DCGM", Body: body}, err return ActionResult{Title: "NVIDIA DCGM", Body: body}, err
} }
func (a *App) RunNvidiaTargetedStressValidatePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaTargetedStressValidatePack(ctx, baseDir, durationSec, gpuIndices, logFunc)
}
func (a *App) RunNvidiaStressPack(baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) { func (a *App) RunNvidiaStressPack(baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
return a.RunNvidiaStressPackCtx(context.Background(), baseDir, opts, logFunc) return a.RunNvidiaStressPackCtx(context.Background(), baseDir, opts, logFunc)
} }
@@ -543,6 +555,34 @@ func (a *App) RunNvidiaBenchmarkCtx(ctx context.Context, baseDir string, opts pl
return a.sat.RunNvidiaBenchmark(ctx, baseDir, opts, logFunc) return a.sat.RunNvidiaBenchmark(ctx, baseDir, opts, logFunc)
} }
func (a *App) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaOfficialComputePack(ctx, baseDir, durationSec, gpuIndices, logFunc)
}
func (a *App) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaTargetedPowerPack(ctx, baseDir, durationSec, gpuIndices, logFunc)
}
func (a *App) RunNvidiaPulseTestPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaPulseTestPack(ctx, baseDir, durationSec, gpuIndices, logFunc)
}
func (a *App) RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpuIndices []int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir
}
return a.sat.RunNvidiaBandwidthPack(ctx, baseDir, gpuIndices, logFunc)
}
func (a *App) RunNvidiaStressPackCtx(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) { func (a *App) RunNvidiaStressPackCtx(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" { if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir baseDir = DefaultSATBaseDir
@@ -893,6 +933,12 @@ func latestSATSummaries() []string {
prefix string prefix string
}{ }{
{label: "NVIDIA SAT", prefix: "gpu-nvidia-"}, {label: "NVIDIA SAT", prefix: "gpu-nvidia-"},
{label: "NVIDIA Targeted Stress Validate (dcgmi diag targeted_stress)", prefix: "gpu-nvidia-targeted-stress-"},
{label: "NVIDIA Max Compute Load (dcgmproftester)", prefix: "gpu-nvidia-compute-"},
{label: "NVIDIA Targeted Power (dcgmi diag targeted_power)", prefix: "gpu-nvidia-targeted-power-"},
{label: "NVIDIA Pulse Test (dcgmi diag pulse_test)", prefix: "gpu-nvidia-pulse-"},
{label: "NVIDIA Interconnect Test (NCCL all_reduce_perf)", prefix: "gpu-nvidia-nccl-"},
{label: "NVIDIA Bandwidth Test (NVBandwidth)", prefix: "gpu-nvidia-bandwidth-"},
{label: "Memory SAT", prefix: "memory-"}, {label: "Memory SAT", prefix: "memory-"},
{label: "Storage SAT", prefix: "storage-"}, {label: "Storage SAT", prefix: "storage-"},
{label: "CPU SAT", prefix: "cpu-"}, {label: "CPU SAT", prefix: "cpu-"},

View File

@@ -120,16 +120,21 @@ func (f fakeTools) CheckTools(names []string) []platform.ToolStatus {
} }
type fakeSAT struct { type fakeSAT struct {
runNvidiaFn func(string) (string, error) runNvidiaFn func(string) (string, error)
runNvidiaBenchmarkFn func(string, platform.NvidiaBenchmarkOptions) (string, error) runNvidiaBenchmarkFn func(string, platform.NvidiaBenchmarkOptions) (string, error)
runNvidiaStressFn func(string, platform.NvidiaStressOptions) (string, error) runNvidiaStressFn func(string, platform.NvidiaStressOptions) (string, error)
runMemoryFn func(string) (string, error) runNvidiaComputeFn func(string, int, []int) (string, error)
runStorageFn func(string) (string, error) runNvidiaPowerFn func(string, int, []int) (string, error)
runCPUFn func(string, int) (string, error) runNvidiaPulseFn func(string, int, []int) (string, error)
detectVendorFn func() string runNvidiaBandwidthFn func(string, []int) (string, error)
listAMDGPUsFn func() ([]platform.AMDGPUInfo, error) runNvidiaTargetedStressFn func(string, int, []int) (string, error)
runAMDPackFn func(string) (string, error) runMemoryFn func(string) (string, error)
listNvidiaGPUsFn func() ([]platform.NvidiaGPU, error) runStorageFn func(string) (string, error)
runCPUFn func(string, int) (string, error)
detectVendorFn func() string
listAMDGPUsFn func() ([]platform.AMDGPUInfo, error)
runAMDPackFn func(string) (string, error)
listNvidiaGPUsFn func() ([]platform.NvidiaGPU, error)
} }
func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string, _ func(string)) (string, error) { func (f fakeSAT) RunNvidiaAcceptancePack(baseDir string, _ func(string)) (string, error) {
@@ -147,6 +152,41 @@ func (f fakeSAT) RunNvidiaBenchmark(_ context.Context, baseDir string, opts plat
return f.runNvidiaFn(baseDir) return f.runNvidiaFn(baseDir)
} }
func (f fakeSAT) RunNvidiaTargetedStressValidatePack(_ context.Context, baseDir string, durationSec int, gpuIndices []int, _ func(string)) (string, error) {
if f.runNvidiaTargetedStressFn != nil {
return f.runNvidiaTargetedStressFn(baseDir, durationSec, gpuIndices)
}
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaOfficialComputePack(_ context.Context, baseDir string, durationSec int, gpuIndices []int, _ func(string)) (string, error) {
if f.runNvidiaComputeFn != nil {
return f.runNvidiaComputeFn(baseDir, durationSec, gpuIndices)
}
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaTargetedPowerPack(_ context.Context, baseDir string, durationSec int, gpuIndices []int, _ func(string)) (string, error) {
if f.runNvidiaPowerFn != nil {
return f.runNvidiaPowerFn(baseDir, durationSec, gpuIndices)
}
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaPulseTestPack(_ context.Context, baseDir string, durationSec int, gpuIndices []int, _ func(string)) (string, error) {
if f.runNvidiaPulseFn != nil {
return f.runNvidiaPulseFn(baseDir, durationSec, gpuIndices)
}
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaBandwidthPack(_ context.Context, baseDir string, gpuIndices []int, _ func(string)) (string, error) {
if f.runNvidiaBandwidthFn != nil {
return f.runNvidiaBandwidthFn(baseDir, gpuIndices)
}
return f.runNvidiaFn(baseDir)
}
func (f fakeSAT) RunNvidiaStressPack(_ context.Context, baseDir string, opts platform.NvidiaStressOptions, _ func(string)) (string, error) { func (f fakeSAT) RunNvidiaStressPack(_ context.Context, baseDir string, opts platform.NvidiaStressOptions, _ func(string)) (string, error) {
if f.runNvidiaStressFn != nil { if f.runNvidiaStressFn != nil {
return f.runNvidiaStressFn(baseDir, opts) return f.runNvidiaStressFn(baseDir, opts)

View File

@@ -21,12 +21,12 @@ type ComponentStatusDB struct {
// ComponentStatusRecord holds the current and historical health of one hardware component. // ComponentStatusRecord holds the current and historical health of one hardware component.
type ComponentStatusRecord struct { type ComponentStatusRecord struct {
ComponentKey string `json:"component_key"` ComponentKey string `json:"component_key"`
Status string `json:"status"` // "OK", "Warning", "Critical", "Unknown" Status string `json:"status"` // "OK", "Warning", "Critical", "Unknown"
LastCheckedAt time.Time `json:"last_checked_at"` LastCheckedAt time.Time `json:"last_checked_at"`
LastChangedAt time.Time `json:"last_changed_at"` LastChangedAt time.Time `json:"last_changed_at"`
ErrorSummary string `json:"error_summary,omitempty"` ErrorSummary string `json:"error_summary,omitempty"`
History []ComponentStatusEntry `json:"history"` History []ComponentStatusEntry `json:"history"`
} }
// ComponentStatusEntry is one observation written to a component's history. // ComponentStatusEntry is one observation written to a component's history.
@@ -179,7 +179,9 @@ func ApplySATResultToDB(db *ComponentStatusDB, target, archivePath string) {
// Map SAT target to component keys. // Map SAT target to component keys.
switch target { switch target {
case "nvidia", "amd", "nvidia-stress", "amd-stress", "amd-mem", "amd-bandwidth": case "nvidia", "nvidia-targeted-stress", "nvidia-compute", "nvidia-targeted-power", "nvidia-pulse",
"nvidia-interconnect", "nvidia-bandwidth", "amd", "nvidia-stress",
"amd-stress", "amd-mem", "amd-bandwidth":
db.Record("pcie:gpu:"+target, source, dbStatus, target+" SAT: "+overall) db.Record("pcie:gpu:"+target, source, dbStatus, target+" SAT: "+overall)
case "memory", "memory-stress", "sat-stress": case "memory", "memory-stress", "sat-stress":
db.Record("memory:all", source, dbStatus, target+" SAT: "+overall) db.Record("memory:all", source, dbStatus, target+" SAT: "+overall)

View File

@@ -135,12 +135,15 @@ func (s *System) runtimeToolStatuses(vendor string) []ToolStatus {
case "nvidia": case "nvidia":
tools = append(tools, s.CheckTools([]string{ tools = append(tools, s.CheckTools([]string{
"nvidia-smi", "nvidia-smi",
"dcgmi",
"nv-hostengine",
"nvidia-bug-report.sh", "nvidia-bug-report.sh",
"bee-gpu-burn", "bee-gpu-burn",
"bee-john-gpu-stress", "bee-john-gpu-stress",
"bee-nccl-gpu-stress", "bee-nccl-gpu-stress",
"all_reduce_perf", "all_reduce_perf",
})...) })...)
tools = append(tools, resolvedToolStatus("dcgmproftester", dcgmProfTesterCandidates...))
case "amd": case "amd":
tool := ToolStatus{Name: "rocm-smi"} tool := ToolStatus{Name: "rocm-smi"}
if cmd, err := resolveROCmSMICommand(); err == nil && len(cmd) > 0 { if cmd, err := resolveROCmSMICommand(); err == nil && len(cmd) > 0 {
@@ -155,6 +158,16 @@ func (s *System) runtimeToolStatuses(vendor string) []ToolStatus {
return tools return tools
} }
func resolvedToolStatus(display string, candidates ...string) ToolStatus {
for _, candidate := range candidates {
path, err := exec.LookPath(candidate)
if err == nil {
return ToolStatus{Name: display, Path: path, OK: true}
}
}
return ToolStatus{Name: display}
}
func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHealth) { func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHealth) {
lsmodText := commandText("lsmod") lsmodText := commandText("lsmod")

View File

@@ -21,10 +21,11 @@ import (
) )
var ( var (
satExecCommand = exec.Command satExecCommand = exec.Command
satLookPath = exec.LookPath satLookPath = exec.LookPath
satGlob = filepath.Glob satGlob = filepath.Glob
satStat = os.Stat satStat = os.Stat
satFreeMemBytes = freeMemBytes
rocmSMIExecutableGlobs = []string{ rocmSMIExecutableGlobs = []string{
"/opt/rocm/bin/rocm-smi", "/opt/rocm/bin/rocm-smi",
@@ -38,6 +39,12 @@ var (
"/opt/rocm/bin/rvs", "/opt/rocm/bin/rvs",
"/opt/rocm-*/bin/rvs", "/opt/rocm-*/bin/rvs",
} }
dcgmProfTesterCandidates = []string{
"dcgmproftester",
"dcgmproftester13",
"dcgmproftester12",
"dcgmproftester11",
}
) )
// streamExecOutput runs cmd and streams each output line to logFunc (if non-nil). // streamExecOutput runs cmd and streams each output line to logFunc (if non-nil).
@@ -256,6 +263,9 @@ func (s *System) ListNvidiaGPUs() ([]NvidiaGPU, error) {
MemoryMB: memMB, MemoryMB: memMB,
}) })
} }
sort.Slice(gpus, func(i, j int) bool {
return gpus[i].Index < gpus[j].Index
})
return gpus, nil return gpus, nil
} }
@@ -277,6 +287,80 @@ func (s *System) RunNCCLTests(ctx context.Context, baseDir string, logFunc func(
}, logFunc) }, logFunc)
} }
func (s *System) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
selected, err := resolveDCGMGPUIndices(gpuIndices)
if err != nil {
return "", err
}
profCmd, err := resolveDCGMProfTesterCommand("--no-dcgm-validation", "-t", "1004", "-d", strconv.Itoa(normalizeNvidiaBurnDuration(durationSec)))
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-compute", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{name: "02-dcgmi-version.log", cmd: []string{"dcgmi", "-v"}},
{
name: "03-dcgmproftester.log",
cmd: profCmd,
env: nvidiaVisibleDevicesEnv(selected),
collectGPU: true,
gpuIndices: selected,
},
{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
}
func (s *System) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
selected, err := resolveDCGMGPUIndices(gpuIndices)
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-targeted-power", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{
name: "02-dcgmi-targeted-power.log",
cmd: nvidiaDCGMNamedDiagCommand("targeted_power", normalizeNvidiaBurnDuration(durationSec), selected),
collectGPU: true,
gpuIndices: selected,
},
{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
}
func (s *System) RunNvidiaPulseTestPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
selected, err := resolveDCGMGPUIndices(gpuIndices)
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-pulse", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{
name: "02-dcgmi-pulse-test.log",
cmd: nvidiaDCGMNamedDiagCommand("pulse_test", normalizeNvidiaBurnDuration(durationSec), selected),
collectGPU: true,
gpuIndices: selected,
},
{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
}
func (s *System) RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpuIndices []int, logFunc func(string)) (string, error) {
selected, err := resolveDCGMGPUIndices(gpuIndices)
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-bandwidth", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{
name: "02-dcgmi-nvbandwidth.log",
cmd: nvidiaDCGMNamedDiagCommand("nvbandwidth", 0, selected),
collectGPU: true,
gpuIndices: selected,
},
{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
}
func (s *System) RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error) { func (s *System) RunNvidiaAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
return runAcceptancePackCtx(context.Background(), baseDir, "gpu-nvidia", nvidiaSATJobs(), logFunc) return runAcceptancePackCtx(context.Background(), baseDir, "gpu-nvidia", nvidiaSATJobs(), logFunc)
} }
@@ -293,6 +377,23 @@ func (s *System) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia", nvidiaDCGMJobs(diagLevel, resolvedGPUIndices), logFunc) return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia", nvidiaDCGMJobs(diagLevel, resolvedGPUIndices), logFunc)
} }
func (s *System) RunNvidiaTargetedStressValidatePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
selected, err := resolveDCGMGPUIndices(gpuIndices)
if err != nil {
return "", err
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-targeted-stress", []satJob{
{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
{
name: "02-dcgmi-targeted-stress.log",
cmd: nvidiaDCGMNamedDiagCommand("targeted_stress", normalizeNvidiaBurnDuration(durationSec), selected),
collectGPU: true,
gpuIndices: selected,
},
{name: "03-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
}, logFunc)
}
func resolveDCGMGPUIndices(gpuIndices []int) ([]int, error) { func resolveDCGMGPUIndices(gpuIndices []int) ([]int, error) {
if len(gpuIndices) > 0 { if len(gpuIndices) > 0 {
return dedupeSortedIndices(gpuIndices), nil return dedupeSortedIndices(gpuIndices), nil
@@ -307,6 +408,25 @@ func resolveDCGMGPUIndices(gpuIndices []int) ([]int, error) {
return all, nil return all, nil
} }
func memoryStressSizeArg() string {
if mb := envInt("BEE_VM_STRESS_SIZE_MB", 0); mb > 0 {
return fmt.Sprintf("%dM", mb)
}
availBytes := satFreeMemBytes()
if availBytes <= 0 {
return "80%"
}
availMB := availBytes / (1024 * 1024)
targetMB := (availMB * 2) / 3
if targetMB >= 256 {
targetMB = (targetMB / 256) * 256
}
if targetMB <= 0 {
return "80%"
}
return fmt.Sprintf("%dM", targetMB)
}
func (s *System) RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) { func (s *System) RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) {
sizeMB := envInt("BEE_MEMTESTER_SIZE_MB", 128) sizeMB := envInt("BEE_MEMTESTER_SIZE_MB", 128)
passes := envInt("BEE_MEMTESTER_PASSES", 1) passes := envInt("BEE_MEMTESTER_PASSES", 1)
@@ -322,11 +442,9 @@ func (s *System) RunMemoryStressPack(ctx context.Context, baseDir string, durati
if seconds <= 0 { if seconds <= 0 {
seconds = envInt("BEE_VM_STRESS_SECONDS", 300) seconds = envInt("BEE_VM_STRESS_SECONDS", 300)
} }
// Use 80% of RAM by default; override with BEE_VM_STRESS_SIZE_MB. // Base the default on current MemAvailable and keep headroom for the OS and
sizeArg := "80%" // concurrent stressors so mixed burn runs do not trip the OOM killer.
if mb := envInt("BEE_VM_STRESS_SIZE_MB", 0); mb > 0 { sizeArg := memoryStressSizeArg()
sizeArg = fmt.Sprintf("%dM", mb)
}
return runAcceptancePackCtx(ctx, baseDir, "memory-stress", []satJob{ return runAcceptancePackCtx(ctx, baseDir, "memory-stress", []satJob{
{name: "01-free-before.log", cmd: []string{"free", "-h"}}, {name: "01-free-before.log", cmd: []string{"free", "-h"}},
{name: "02-stress-ng-vm.log", cmd: []string{ {name: "02-stress-ng-vm.log", cmd: []string{
@@ -473,6 +591,31 @@ func nvidiaDCGMJobs(diagLevel int, gpuIndices []int) []satJob {
} }
} }
func nvidiaDCGMNamedDiagCommand(name string, durationSec int, gpuIndices []int) []string {
args := []string{"dcgmi", "diag", "-r", name}
if durationSec > 0 {
args = append(args, "-p", fmt.Sprintf("%s.test_duration=%d", name, durationSec))
}
if len(gpuIndices) > 0 {
args = append(args, "-i", joinIndexList(gpuIndices))
}
return args
}
func normalizeNvidiaBurnDuration(durationSec int) int {
if durationSec <= 0 {
return 300
}
return durationSec
}
func nvidiaVisibleDevicesEnv(gpuIndices []int) []string {
if len(gpuIndices) == 0 {
return nil
}
return []string{"CUDA_VISIBLE_DEVICES=" + joinIndexList(gpuIndices)}
}
func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []satJob, logFunc func(string)) (string, error) { func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []satJob, logFunc func(string)) (string, error) {
if ctx == nil { if ctx == nil {
ctx = context.Background() ctx = context.Background()
@@ -642,6 +785,7 @@ func classifySATResult(name string, out []byte, err error) (string, int) {
} }
if strings.Contains(text, "unsupported") || if strings.Contains(text, "unsupported") ||
strings.Contains(text, "not supported") || strings.Contains(text, "not supported") ||
strings.Contains(text, "not found in path") ||
strings.Contains(text, "invalid opcode") || strings.Contains(text, "invalid opcode") ||
strings.Contains(text, "unknown command") || strings.Contains(text, "unknown command") ||
strings.Contains(text, "not implemented") || strings.Contains(text, "not implemented") ||
@@ -748,6 +892,15 @@ func resolveROCmSMICommand(args ...string) ([]string, error) {
return nil, errors.New("rocm-smi not found in PATH or under /opt/rocm") return nil, errors.New("rocm-smi not found in PATH or under /opt/rocm")
} }
func resolveDCGMProfTesterCommand(args ...string) ([]string, error) {
for _, candidate := range dcgmProfTesterCandidates {
if path, err := satLookPath(candidate); err == nil {
return append([]string{path}, args...), nil
}
}
return nil, errors.New("dcgmproftester not found in PATH")
}
func ensureAMDRuntimeReady() error { func ensureAMDRuntimeReady() error {
if _, err := os.Stat("/dev/kfd"); err == nil { if _, err := os.Stat("/dev/kfd"); err == nil {
return nil return nil

View File

@@ -195,6 +195,53 @@ func TestResolveDCGMGPUIndicesKeepsExplicitSelection(t *testing.T) {
} }
} }
func TestResolveDCGMProfTesterCommandUsesVersionedBinary(t *testing.T) {
oldLookPath := satLookPath
satLookPath = func(file string) (string, error) {
switch file {
case "dcgmproftester13":
return "/usr/bin/dcgmproftester13", nil
default:
return "", exec.ErrNotFound
}
}
t.Cleanup(func() { satLookPath = oldLookPath })
cmd, err := resolveDCGMProfTesterCommand("--no-dcgm-validation", "-t", "1004")
if err != nil {
t.Fatalf("resolveDCGMProfTesterCommand error: %v", err)
}
if len(cmd) != 4 {
t.Fatalf("cmd len=%d want 4 (%v)", len(cmd), cmd)
}
if cmd[0] != "/usr/bin/dcgmproftester13" {
t.Fatalf("cmd[0]=%q want /usr/bin/dcgmproftester13", cmd[0])
}
}
func TestNvidiaDCGMNamedDiagCommandUsesDurationAndSelection(t *testing.T) {
cmd := nvidiaDCGMNamedDiagCommand("targeted_power", 900, []int{3, 1})
want := []string{"dcgmi", "diag", "-r", "targeted_power", "-p", "targeted_power.test_duration=900", "-i", "3,1"}
if len(cmd) != len(want) {
t.Fatalf("cmd len=%d want %d (%v)", len(cmd), len(want), cmd)
}
for i := range want {
if cmd[i] != want[i] {
t.Fatalf("cmd[%d]=%q want %q", i, cmd[i], want[i])
}
}
}
func TestNvidiaVisibleDevicesEnvUsesSelectedGPUs(t *testing.T) {
env := nvidiaVisibleDevicesEnv([]int{0, 2, 4})
if len(env) != 1 {
t.Fatalf("env len=%d want 1 (%v)", len(env), env)
}
if env[0] != "CUDA_VISIBLE_DEVICES=0,2,4" {
t.Fatalf("env[0]=%q want CUDA_VISIBLE_DEVICES=0,2,4", env[0])
}
}
func TestNvidiaStressArchivePrefixByLoader(t *testing.T) { func TestNvidiaStressArchivePrefixByLoader(t *testing.T) {
t.Parallel() t.Parallel()
@@ -229,6 +276,37 @@ func TestEnvIntFallback(t *testing.T) {
} }
} }
func TestMemoryStressSizeArgUsesAvailableMemory(t *testing.T) {
oldFreeMemBytes := satFreeMemBytes
satFreeMemBytes = func() int64 { return 96 * 1024 * 1024 * 1024 }
t.Cleanup(func() { satFreeMemBytes = oldFreeMemBytes })
if got := memoryStressSizeArg(); got != "65536M" {
t.Fatalf("sizeArg=%q want 65536M", got)
}
}
func TestMemoryStressSizeArgRespectsOverride(t *testing.T) {
oldFreeMemBytes := satFreeMemBytes
satFreeMemBytes = func() int64 { return 96 * 1024 * 1024 * 1024 }
t.Cleanup(func() { satFreeMemBytes = oldFreeMemBytes })
t.Setenv("BEE_VM_STRESS_SIZE_MB", "4096")
if got := memoryStressSizeArg(); got != "4096M" {
t.Fatalf("sizeArg=%q want 4096M", got)
}
}
func TestMemoryStressSizeArgFallsBackWhenFreeMemoryUnknown(t *testing.T) {
oldFreeMemBytes := satFreeMemBytes
satFreeMemBytes = func() int64 { return 0 }
t.Cleanup(func() { satFreeMemBytes = oldFreeMemBytes })
if got := memoryStressSizeArg(); got != "80%" {
t.Fatalf("sizeArg=%q want 80%%", got)
}
}
func TestClassifySATResult(t *testing.T) { func TestClassifySATResult(t *testing.T) {
tests := []struct { tests := []struct {
name string name string

View File

@@ -581,15 +581,32 @@ func (h *handler) handleAPIGPUTools(w http.ResponseWriter, _ *http.Request) {
nvidiaUp := nvidiaErr == nil nvidiaUp := nvidiaErr == nil
amdUp := amdErr == nil amdUp := amdErr == nil
_, dcgmErr := exec.LookPath("dcgmi") _, dcgmErr := exec.LookPath("dcgmi")
_, ncclStressErr := exec.LookPath("bee-nccl-gpu-stress")
_, johnErr := exec.LookPath("bee-john-gpu-stress")
_, beeBurnErr := exec.LookPath("bee-gpu-burn")
_, nvBandwidthErr := exec.LookPath("nvbandwidth")
profErr := lookPathAny("dcgmproftester", "dcgmproftester13", "dcgmproftester12", "dcgmproftester11")
writeJSON(w, []toolEntry{ writeJSON(w, []toolEntry{
{ID: "bee-gpu-burn", Available: nvidiaUp, Vendor: "nvidia"}, {ID: "nvidia-compute", Available: nvidiaUp && profErr == nil, Vendor: "nvidia"},
{ID: "dcgm", Available: nvidiaUp && dcgmErr == nil, Vendor: "nvidia"}, {ID: "nvidia-targeted-power", Available: nvidiaUp && dcgmErr == nil, Vendor: "nvidia"},
{ID: "john", Available: nvidiaUp, Vendor: "nvidia"}, {ID: "nvidia-pulse", Available: nvidiaUp && dcgmErr == nil, Vendor: "nvidia"},
{ID: "nccl", Available: nvidiaUp, Vendor: "nvidia"}, {ID: "nvidia-interconnect", Available: nvidiaUp && ncclStressErr == nil, Vendor: "nvidia"},
{ID: "nvidia-bandwidth", Available: nvidiaUp && dcgmErr == nil && nvBandwidthErr == nil, Vendor: "nvidia"},
{ID: "bee-gpu-burn", Available: nvidiaUp && beeBurnErr == nil, Vendor: "nvidia"},
{ID: "john", Available: nvidiaUp && johnErr == nil, Vendor: "nvidia"},
{ID: "rvs", Available: amdUp, Vendor: "amd"}, {ID: "rvs", Available: amdUp, Vendor: "amd"},
}) })
} }
func lookPathAny(names ...string) error {
for _, name := range names {
if _, err := exec.LookPath(name); err == nil {
return nil
}
}
return exec.ErrNotFound
}
// ── System ──────────────────────────────────────────────────────────────────── // ── System ────────────────────────────────────────────────────────────────────
func (h *handler) handleAPIRAMStatus(w http.ResponseWriter, r *http.Request) { func (h *handler) handleAPIRAMStatus(w http.ResponseWriter, r *http.Request) {
@@ -628,7 +645,7 @@ func (h *handler) handleAPIInstallToRAM(w http.ResponseWriter, r *http.Request)
var standardTools = []string{ var standardTools = []string{
"dmidecode", "smartctl", "nvme", "lspci", "ipmitool", "dmidecode", "smartctl", "nvme", "lspci", "ipmitool",
"nvidia-smi", "memtester", "stress-ng", "nvtop", "nvidia-smi", "dcgmi", "nv-hostengine", "memtester", "stress-ng", "nvtop",
"mstflint", "qrencode", "mstflint", "qrencode",
} }

View File

@@ -6,6 +6,7 @@ import (
"sort" "sort"
"strconv" "strconv"
"strings" "strings"
"sync"
"time" "time"
"bee/audit/internal/platform" "bee/audit/internal/platform"
@@ -52,6 +53,12 @@ var metricChartPalette = []string{
"#ffbe5c", "#ffbe5c",
} }
var gpuLabelCache struct {
mu sync.Mutex
loadedAt time.Time
byIndex map[int]string
}
func renderMetricChartSVG(title string, labels []string, times []time.Time, datasets [][]float64, names []string, yMin, yMax *float64, canvasHeight int, timeline []chartTimelineSegment) ([]byte, error) { func renderMetricChartSVG(title string, labels []string, times []time.Time, datasets [][]float64, names []string, yMin, yMax *float64, canvasHeight int, timeline []chartTimelineSegment) ([]byte, error) {
pointCount := len(labels) pointCount := len(labels)
if len(times) > pointCount { if len(times) > pointCount {
@@ -76,15 +83,7 @@ func renderMetricChartSVG(title string, labels []string, times []time.Time, data
} }
} }
mn, avg, mx := globalStats(datasets) statsLabel := chartStatsLabel(datasets)
if mx > 0 {
title = fmt.Sprintf("%s ↓%s ~%s ↑%s",
title,
chartLegendNumber(mn),
chartLegendNumber(avg),
chartLegendNumber(mx),
)
}
legendItems := []metricChartSeries{} legendItems := []metricChartSeries{}
for i, name := range names { for i, name := range names {
@@ -106,7 +105,7 @@ func renderMetricChartSVG(title string, labels []string, times []time.Time, data
var b strings.Builder var b strings.Builder
writeSVGOpen(&b, layout.Width, layout.Height) writeSVGOpen(&b, layout.Width, layout.Height)
writeChartFrame(&b, title, layout.Width, layout.Height) writeChartFrame(&b, title, statsLabel, layout.Width, layout.Height)
writeTimelineIdleSpans(&b, layout, start, end, timeline) writeTimelineIdleSpans(&b, layout, start, end, timeline)
writeVerticalGrid(&b, layout, times, pointCount, 8) writeVerticalGrid(&b, layout, times, pointCount, 8)
writeHorizontalGrid(&b, layout, scale) writeHorizontalGrid(&b, layout, scale)
@@ -126,21 +125,19 @@ func renderGPUOverviewChartSVG(idx int, samples []platform.LiveMetricSample, tim
temp := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.TempC }) temp := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.TempC })
power := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.PowerW }) power := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.PowerW })
coreClock := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.ClockMHz }) coreClock := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.ClockMHz })
memClock := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.MemClockMHz }) if temp == nil && power == nil && coreClock == nil {
if temp == nil && power == nil && coreClock == nil && memClock == nil {
return nil, false, nil return nil, false, nil
} }
labels := sampleTimeLabels(samples) labels := sampleTimeLabels(samples)
times := sampleTimes(samples) times := sampleTimes(samples)
svg, err := drawGPUOverviewChartSVG( svg, err := drawGPUOverviewChartSVG(
fmt.Sprintf("GPU %d Overview", idx), gpuDisplayLabel(idx)+" Overview",
labels, labels,
times, times,
[]metricChartSeries{ []metricChartSeries{
{Name: "Temp C", Values: coalesceDataset(temp, len(labels)), Color: "#f05a5a", AxisTitle: "Temp C"}, {Name: "Temp C", Values: coalesceDataset(temp, len(labels)), Color: "#f05a5a", AxisTitle: "Temp C"},
{Name: "Power W", Values: coalesceDataset(power, len(labels)), Color: "#ffb357", AxisTitle: "Power W"}, {Name: "Power W", Values: coalesceDataset(power, len(labels)), Color: "#ffb357", AxisTitle: "Power W"},
{Name: "Core Clock MHz", Values: coalesceDataset(coreClock, len(labels)), Color: "#73bf69", AxisTitle: "Core MHz"}, {Name: "Core Clock MHz", Values: coalesceDataset(coreClock, len(labels)), Color: "#73bf69", AxisTitle: "Core MHz"},
{Name: "Memory Clock MHz", Values: coalesceDataset(memClock, len(labels)), Color: "#5794f2", AxisTitle: "Memory MHz"},
}, },
timeline, timeline,
) )
@@ -151,8 +148,8 @@ func renderGPUOverviewChartSVG(idx int, samples []platform.LiveMetricSample, tim
} }
func drawGPUOverviewChartSVG(title string, labels []string, times []time.Time, series []metricChartSeries, timeline []chartTimelineSegment) ([]byte, error) { func drawGPUOverviewChartSVG(title string, labels []string, times []time.Time, series []metricChartSeries, timeline []chartTimelineSegment) ([]byte, error) {
if len(series) != 4 { if len(series) != 3 {
return nil, fmt.Errorf("gpu overview requires 4 series, got %d", len(series)) return nil, fmt.Errorf("gpu overview requires 3 series, got %d", len(series))
} }
const ( const (
width = 1400 width = 1400
@@ -166,7 +163,6 @@ func drawGPUOverviewChartSVG(title string, labels []string, times []time.Time, s
leftOuterAxis = 72 leftOuterAxis = 72
leftInnerAxis = 132 leftInnerAxis = 132
rightInnerAxis = 1268 rightInnerAxis = 1268
rightOuterAxis = 1328
) )
layout := chartLayout{ layout := chartLayout{
Width: width, Width: width,
@@ -176,7 +172,7 @@ func drawGPUOverviewChartSVG(title string, labels []string, times []time.Time, s
PlotTop: plotTop, PlotTop: plotTop,
PlotBottom: plotBottom, PlotBottom: plotBottom,
} }
axisX := []int{leftOuterAxis, leftInnerAxis, rightInnerAxis, rightOuterAxis} axisX := []int{leftOuterAxis, leftInnerAxis, rightInnerAxis}
pointCount := len(labels) pointCount := len(labels)
if len(times) > pointCount { if len(times) > pointCount {
pointCount = len(times) pointCount = len(times)
@@ -214,7 +210,7 @@ func drawGPUOverviewChartSVG(title string, labels []string, times []time.Time, s
var b strings.Builder var b strings.Builder
writeSVGOpen(&b, width, height) writeSVGOpen(&b, width, height)
writeChartFrame(&b, title, width, height) writeChartFrame(&b, title, "", width, height)
writeTimelineIdleSpans(&b, layout, start, end, timeline) writeTimelineIdleSpans(&b, layout, start, end, timeline)
writeVerticalGrid(&b, layout, times, pointCount, 8) writeVerticalGrid(&b, layout, times, pointCount, 8)
writeHorizontalGrid(&b, layout, scales[0]) writeHorizontalGrid(&b, layout, scales[0])
@@ -457,10 +453,14 @@ func writeSVGClose(b *strings.Builder) {
b.WriteString("</svg>\n") b.WriteString("</svg>\n")
} }
func writeChartFrame(b *strings.Builder, title string, width, height int) { func writeChartFrame(b *strings.Builder, title, subtitle string, width, height int) {
fmt.Fprintf(b, `<rect width="%d" height="%d" rx="10" ry="10" fill="#ffffff" stroke="#d7e0ea"/>`+"\n", width, height) fmt.Fprintf(b, `<rect width="%d" height="%d" rx="10" ry="10" fill="#ffffff" stroke="#d7e0ea"/>`+"\n", width, height)
fmt.Fprintf(b, `<text x="%d" y="30" text-anchor="middle" font-family="sans-serif" font-size="16" font-weight="700" fill="#1f2937">%s</text>`+"\n", fmt.Fprintf(b, `<text x="%d" y="30" text-anchor="middle" font-family="sans-serif" font-size="16" font-weight="700" fill="#1f2937">%s</text>`+"\n",
width/2, sanitizeChartText(title)) width/2, sanitizeChartText(title))
if strings.TrimSpace(subtitle) != "" {
fmt.Fprintf(b, `<text x="%d" y="50" text-anchor="middle" font-family="sans-serif" font-size="12" font-weight="600" fill="#64748b">%s</text>`+"\n",
width/2, sanitizeChartText(subtitle))
}
} }
func writePlotBorder(b *strings.Builder, layout chartLayout) { func writePlotBorder(b *strings.Builder, layout chartLayout) {
@@ -545,7 +545,21 @@ func writeSeriesPolyline(b *strings.Builder, layout chartLayout, times []time.Ti
x := chartXForTime(chartPointTime(times, 0), start, end, layout.PlotLeft, layout.PlotRight) x := chartXForTime(chartPointTime(times, 0), start, end, layout.PlotLeft, layout.PlotRight)
y := chartYForValue(values[0], scale, layout.PlotTop, layout.PlotBottom) y := chartYForValue(values[0], scale, layout.PlotTop, layout.PlotBottom)
fmt.Fprintf(b, `<circle cx="%.1f" cy="%.1f" r="3.5" fill="%s"/>`+"\n", x, y, color) fmt.Fprintf(b, `<circle cx="%.1f" cy="%.1f" r="3.5" fill="%s"/>`+"\n", x, y, color)
return
} }
peakIdx := 0
peakValue := values[0]
for idx, value := range values[1:] {
if value >= peakValue {
peakIdx = idx + 1
peakValue = value
}
}
x := chartXForTime(chartPointTime(times, peakIdx), start, end, layout.PlotLeft, layout.PlotRight)
y := chartYForValue(peakValue, scale, layout.PlotTop, layout.PlotBottom)
fmt.Fprintf(b, `<circle cx="%.1f" cy="%.1f" r="4.2" fill="%s" stroke="#ffffff" stroke-width="1.6"/>`+"\n", x, y, color)
fmt.Fprintf(b, `<path d="M %.1f %.1f L %.1f %.1f L %.1f %.1f Z" fill="%s" opacity="0.9"/>`+"\n",
x, y-10, x-5, y-18, x+5, y-18, color)
} }
func writeLegend(b *strings.Builder, layout chartLayout, series []metricChartSeries) { func writeLegend(b *strings.Builder, layout chartLayout, series []metricChartSeries) {
@@ -711,3 +725,49 @@ func valueClamp(value float64, scale chartScale) float64 {
} }
return value return value
} }
func chartStatsLabel(datasets [][]float64) string {
mn, avg, mx := globalStats(datasets)
if mx <= 0 && avg <= 0 && mn <= 0 {
return ""
}
return fmt.Sprintf("min %s avg %s max %s",
chartLegendNumber(mn),
chartLegendNumber(avg),
chartLegendNumber(mx),
)
}
func gpuDisplayLabel(idx int) string {
if name := gpuModelNameByIndex(idx); name != "" {
return fmt.Sprintf("GPU %d — %s", idx, name)
}
return fmt.Sprintf("GPU %d", idx)
}
func gpuModelNameByIndex(idx int) string {
now := time.Now()
gpuLabelCache.mu.Lock()
if now.Sub(gpuLabelCache.loadedAt) > 30*time.Second || gpuLabelCache.byIndex == nil {
gpuLabelCache.loadedAt = now
gpuLabelCache.byIndex = loadGPUModelNames()
}
name := strings.TrimSpace(gpuLabelCache.byIndex[idx])
gpuLabelCache.mu.Unlock()
return name
}
func loadGPUModelNames() map[int]string {
out := map[int]string{}
gpus, err := platform.New().ListNvidiaGPUs()
if err != nil {
return out
}
for _, gpu := range gpus {
name := strings.TrimSpace(gpu.Name)
if name != "" {
out[gpu.Index] = name
}
}
return out
}

View File

@@ -9,13 +9,14 @@ import (
// jobState holds the output lines and completion status of an async job. // jobState holds the output lines and completion status of an async job.
type jobState struct { type jobState struct {
lines []string lines []string
done bool done bool
err string err string
mu sync.Mutex mu sync.Mutex
subs []chan string subs []chan string
cancel func() // optional cancel function; nil if job is not cancellable cancel func() // optional cancel function; nil if job is not cancellable
logPath string logPath string
serialPrefix string
} }
// abort cancels the job if it has a cancel function and is not yet done. // abort cancels the job if it has a cancel function and is not yet done.
@@ -36,6 +37,9 @@ func (j *jobState) append(line string) {
if j.logPath != "" { if j.logPath != "" {
appendJobLog(j.logPath, line) appendJobLog(j.logPath, line)
} }
if j.serialPrefix != "" {
taskSerialWriteLine(j.serialPrefix + line)
}
for _, ch := range j.subs { for _, ch := range j.subs {
select { select {
case ch <- line: case ch <- line:
@@ -107,8 +111,11 @@ func (m *jobManager) get(id string) (*jobState, bool) {
return j, ok return j, ok
} }
func newTaskJobState(logPath string) *jobState { func newTaskJobState(logPath string, serialPrefix ...string) *jobState {
j := &jobState{logPath: logPath} j := &jobState{logPath: logPath}
if len(serialPrefix) > 0 {
j.serialPrefix = serialPrefix[0]
}
if logPath == "" { if logPath == "" {
return j return j
} }

View File

@@ -232,7 +232,8 @@ func truncate(s string, max int) string {
// isSATTarget returns true for task targets that run hardware acceptance tests. // isSATTarget returns true for task targets that run hardware acceptance tests.
func isSATTarget(target string) bool { func isSATTarget(target string) bool {
switch target { switch target {
case "nvidia", "nvidia-benchmark", "nvidia-stress", "memory", "memory-stress", "storage", case "nvidia", "nvidia-targeted-stress", "nvidia-benchmark", "nvidia-compute", "nvidia-targeted-power", "nvidia-pulse",
"nvidia-interconnect", "nvidia-bandwidth", "nvidia-stress", "memory", "memory-stress", "storage",
"cpu", "sat-stress", "amd", "amd-mem", "amd-bandwidth", "amd-stress", "cpu", "sat-stress", "amd", "amd-mem", "amd-bandwidth", "amd-stress",
"platform-stress": "platform-stress":
return true return true

View File

@@ -22,6 +22,13 @@ type MetricsDB struct {
db *sql.DB db *sql.DB
} }
func (m *MetricsDB) Close() error {
if m == nil || m.db == nil {
return nil
}
return m.db.Close()
}
// openMetricsDB opens (or creates) the metrics database at the given path. // openMetricsDB opens (or creates) the metrics database at the given path.
func openMetricsDB(path string) (*MetricsDB, error) { func openMetricsDB(path string) (*MetricsDB, error) {
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil { if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
@@ -164,6 +171,23 @@ func (m *MetricsDB) LoadAll() ([]platform.LiveMetricSample, error) {
return m.loadSamples(`SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM sys_metrics ORDER BY ts`, nil) return m.loadSamples(`SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM sys_metrics ORDER BY ts`, nil)
} }
// LoadBetween returns samples in chronological order within the given time window.
func (m *MetricsDB) LoadBetween(start, end time.Time) ([]platform.LiveMetricSample, error) {
if m == nil {
return nil, nil
}
if start.IsZero() || end.IsZero() {
return nil, nil
}
if end.Before(start) {
start, end = end, start
}
return m.loadSamples(
`SELECT ts,cpu_load_pct,mem_load_pct,power_w FROM sys_metrics WHERE ts>=? AND ts<=? ORDER BY ts`,
start.Unix(), end.Unix(),
)
}
// loadSamples reconstructs LiveMetricSample rows from the normalized tables. // loadSamples reconstructs LiveMetricSample rows from the normalized tables.
func (m *MetricsDB) loadSamples(query string, args ...any) ([]platform.LiveMetricSample, error) { func (m *MetricsDB) loadSamples(query string, args ...any) ([]platform.LiveMetricSample, error) {
rows, err := m.db.Query(query, args...) rows, err := m.db.Query(query, args...)
@@ -364,9 +388,6 @@ func (m *MetricsDB) ExportCSV(w io.Writer) error {
return cw.Error() return cw.Error()
} }
// Close closes the database.
func (m *MetricsDB) Close() { _ = m.db.Close() }
func nullFloat(v float64) sql.NullFloat64 { func nullFloat(v float64) sql.NullFloat64 {
return sql.NullFloat64{Float64: v, Valid: true} return sql.NullFloat64{Float64: v, Valid: true}
} }

View File

@@ -143,3 +143,32 @@ CREATE TABLE temp_metrics (
t.Fatalf("MemClockMHz=%v want 2600", got) t.Fatalf("MemClockMHz=%v want 2600", got)
} }
} }
func TestMetricsDBLoadBetweenFiltersWindow(t *testing.T) {
db, err := openMetricsDB(filepath.Join(t.TempDir(), "metrics.db"))
if err != nil {
t.Fatalf("openMetricsDB: %v", err)
}
defer db.Close()
base := time.Unix(1_700_000_000, 0).UTC()
for i := 0; i < 5; i++ {
if err := db.Write(platform.LiveMetricSample{
Timestamp: base.Add(time.Duration(i) * time.Minute),
CPULoadPct: float64(i),
}); err != nil {
t.Fatalf("Write(%d): %v", i, err)
}
}
got, err := db.LoadBetween(base.Add(1*time.Minute), base.Add(3*time.Minute))
if err != nil {
t.Fatalf("LoadBetween: %v", err)
}
if len(got) != 3 {
t.Fatalf("LoadBetween len=%d want 3", len(got))
}
if !got[0].Timestamp.Equal(base.Add(1*time.Minute)) || !got[2].Timestamp.Equal(base.Add(3*time.Minute)) {
t.Fatalf("window=%v..%v", got[0].Timestamp, got[2].Timestamp)
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,41 @@
package webui
import (
"fmt"
"os"
"strings"
"time"
)
var taskSerialWriteLine = writeTaskSerialLine
func writeTaskSerialLine(line string) {
line = strings.TrimSpace(line)
if line == "" {
return
}
payload := fmt.Sprintf("%s %s\n", time.Now().UTC().Format("2006-01-02 15:04:05Z"), line)
for _, path := range []string{"/dev/ttyS0", "/dev/ttyS1", "/dev/console"} {
f, err := os.OpenFile(path, os.O_WRONLY|os.O_APPEND, 0)
if err != nil {
continue
}
_, _ = f.WriteString(payload)
_ = f.Close()
return
}
}
func taskSerialPrefix(t *Task) string {
if t == nil {
return "[task] "
}
return fmt.Sprintf("[task %s %s] ", t.ID, t.Name)
}
func taskSerialEvent(t *Task, event string) {
if t == nil {
return
}
taskSerialWriteLine(fmt.Sprintf("%s%s", taskSerialPrefix(t), strings.TrimSpace(event)))
}

View File

@@ -221,6 +221,11 @@ func NewHandler(opts HandlerOptions) http.Handler {
// ── Infrastructure ────────────────────────────────────────────────────── // ── Infrastructure ──────────────────────────────────────────────────────
mux.HandleFunc("GET /healthz", h.handleHealthz) mux.HandleFunc("GET /healthz", h.handleHealthz)
mux.HandleFunc("GET /api/ready", h.handleReady) mux.HandleFunc("GET /api/ready", h.handleReady)
mux.HandleFunc("GET /loading", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte(loadingPageHTML))
})
// ── Existing read-only endpoints (preserved for compatibility) ────────── // ── Existing read-only endpoints (preserved for compatibility) ──────────
mux.HandleFunc("GET /audit.json", h.handleAuditJSON) mux.HandleFunc("GET /audit.json", h.handleAuditJSON)
@@ -237,6 +242,12 @@ func NewHandler(opts HandlerOptions) http.Handler {
// SAT // SAT
mux.HandleFunc("POST /api/sat/nvidia/run", h.handleAPISATRun("nvidia")) mux.HandleFunc("POST /api/sat/nvidia/run", h.handleAPISATRun("nvidia"))
mux.HandleFunc("POST /api/sat/nvidia-targeted-stress/run", h.handleAPISATRun("nvidia-targeted-stress"))
mux.HandleFunc("POST /api/sat/nvidia-compute/run", h.handleAPISATRun("nvidia-compute"))
mux.HandleFunc("POST /api/sat/nvidia-targeted-power/run", h.handleAPISATRun("nvidia-targeted-power"))
mux.HandleFunc("POST /api/sat/nvidia-pulse/run", h.handleAPISATRun("nvidia-pulse"))
mux.HandleFunc("POST /api/sat/nvidia-interconnect/run", h.handleAPISATRun("nvidia-interconnect"))
mux.HandleFunc("POST /api/sat/nvidia-bandwidth/run", h.handleAPISATRun("nvidia-bandwidth"))
mux.HandleFunc("POST /api/sat/nvidia-stress/run", h.handleAPISATRun("nvidia-stress")) mux.HandleFunc("POST /api/sat/nvidia-stress/run", h.handleAPISATRun("nvidia-stress"))
mux.HandleFunc("POST /api/sat/memory/run", h.handleAPISATRun("memory")) mux.HandleFunc("POST /api/sat/memory/run", h.handleAPISATRun("memory"))
mux.HandleFunc("POST /api/sat/storage/run", h.handleAPISATRun("storage")) mux.HandleFunc("POST /api/sat/storage/run", h.handleAPISATRun("storage"))
@@ -259,6 +270,9 @@ func NewHandler(opts HandlerOptions) http.Handler {
mux.HandleFunc("POST /api/tasks/{id}/cancel", h.handleAPITasksCancel) mux.HandleFunc("POST /api/tasks/{id}/cancel", h.handleAPITasksCancel)
mux.HandleFunc("POST /api/tasks/{id}/priority", h.handleAPITasksPriority) mux.HandleFunc("POST /api/tasks/{id}/priority", h.handleAPITasksPriority)
mux.HandleFunc("GET /api/tasks/{id}/stream", h.handleAPITasksStream) mux.HandleFunc("GET /api/tasks/{id}/stream", h.handleAPITasksStream)
mux.HandleFunc("GET /api/tasks/{id}/charts", h.handleAPITaskChartsIndex)
mux.HandleFunc("GET /api/tasks/{id}/chart/", h.handleAPITaskChartSVG)
mux.HandleFunc("GET /tasks/{id}", h.handleTaskPage)
// Services // Services
mux.HandleFunc("GET /api/services", h.handleAPIServicesList) mux.HandleFunc("GET /api/services", h.handleAPIServicesList)
@@ -697,7 +711,7 @@ func chartDataFromSamples(path string, samples []platform.LiveMetricSample) ([][
} }
switch sub { switch sub {
case "load": case "load":
title = fmt.Sprintf("GPU %d Load", idx) title = gpuDisplayLabel(idx) + " Load"
util := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.UsagePct }) util := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.UsagePct })
mem := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.MemUsagePct }) mem := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.MemUsagePct })
if util == nil && mem == nil { if util == nil && mem == nil {
@@ -708,7 +722,7 @@ func chartDataFromSamples(path string, samples []platform.LiveMetricSample) ([][
yMin = floatPtr(0) yMin = floatPtr(0)
yMax = floatPtr(100) yMax = floatPtr(100)
case "temp": case "temp":
title = fmt.Sprintf("GPU %d Temperature", idx) title = gpuDisplayLabel(idx) + " Temperature"
temp := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.TempC }) temp := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.TempC })
if temp == nil { if temp == nil {
return nil, nil, nil, "", nil, nil, false return nil, nil, nil, "", nil, nil, false
@@ -718,7 +732,7 @@ func chartDataFromSamples(path string, samples []platform.LiveMetricSample) ([][
yMin = floatPtr(0) yMin = floatPtr(0)
yMax = autoMax120(temp) yMax = autoMax120(temp)
case "clock": case "clock":
title = fmt.Sprintf("GPU %d Core Clock", idx) title = gpuDisplayLabel(idx) + " Core Clock"
clock := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.ClockMHz }) clock := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.ClockMHz })
if clock == nil { if clock == nil {
return nil, nil, nil, "", nil, nil, false return nil, nil, nil, "", nil, nil, false
@@ -727,7 +741,7 @@ func chartDataFromSamples(path string, samples []platform.LiveMetricSample) ([][
names = []string{"Core Clock MHz"} names = []string{"Core Clock MHz"}
yMin, yMax = autoBounds120(clock) yMin, yMax = autoBounds120(clock)
case "memclock": case "memclock":
title = fmt.Sprintf("GPU %d Memory Clock", idx) title = gpuDisplayLabel(idx) + " Memory Clock"
clock := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.MemClockMHz }) clock := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.MemClockMHz })
if clock == nil { if clock == nil {
return nil, nil, nil, "", nil, nil, false return nil, nil, nil, "", nil, nil, false
@@ -736,7 +750,7 @@ func chartDataFromSamples(path string, samples []platform.LiveMetricSample) ([][
names = []string{"Memory Clock MHz"} names = []string{"Memory Clock MHz"}
yMin, yMax = autoBounds120(clock) yMin, yMax = autoBounds120(clock)
default: default:
title = fmt.Sprintf("GPU %d Power", idx) title = gpuDisplayLabel(idx) + " Power"
power := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.PowerW }) power := gpuDatasetByIndex(samples, idx, func(g platform.GPUMetricRow) float64 { return g.PowerW })
if power == nil { if power == nil {
return nil, nil, nil, "", nil, nil, false return nil, nil, nil, "", nil, nil, false
@@ -865,7 +879,7 @@ func gpuDatasets(samples []platform.LiveMetricSample, pick func(platform.GPUMetr
continue continue
} }
datasets = append(datasets, ds) datasets = append(datasets, ds)
names = append(names, fmt.Sprintf("GPU %d", idx)) names = append(names, gpuDisplayLabel(idx))
} }
return datasets, names return datasets, names
} }
@@ -1182,6 +1196,11 @@ func (h *handler) handleAPIMetricsExportCSV(w http.ResponseWriter, r *http.Reque
func (h *handler) handleReady(w http.ResponseWriter, r *http.Request) { func (h *handler) handleReady(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Cache-Control", "no-store") w.Header().Set("Cache-Control", "no-store")
if strings.TrimSpace(h.opts.AuditPath) == "" {
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("ready"))
return
}
if _, err := os.Stat(h.opts.AuditPath); err != nil { if _, err := os.Stat(h.opts.AuditPath); err != nil {
w.WriteHeader(http.StatusServiceUnavailable) w.WriteHeader(http.StatusServiceUnavailable)
_, _ = w.Write([]byte("starting")) _, _ = w.Write([]byte("starting"))
@@ -1195,37 +1214,106 @@ const loadingPageHTML = `<!DOCTYPE html>
<html lang="en"> <html lang="en">
<head> <head>
<meta charset="UTF-8"> <meta charset="UTF-8">
<title>EASY-BEE</title> <title>EASY-BEE — Starting</title>
<style> <style>
*{margin:0;padding:0;box-sizing:border-box} *{margin:0;padding:0;box-sizing:border-box}
html,body{height:100%;background:#0f1117;display:flex;align-items:center;justify-content:center;font-family:'Courier New',monospace;color:#e2e8f0} html,body{height:100%;background:#0f1117;display:flex;align-items:center;justify-content:center;font-family:'Courier New',monospace;color:#e2e8f0}
.logo{font-size:13px;line-height:1.4;color:#f6c90e;margin-bottom:48px;white-space:pre} .wrap{text-align:center;width:420px}
.spinner{width:48px;height:48px;border:4px solid #2d3748;border-top-color:#f6c90e;border-radius:50%;animation:spin .8s linear infinite;margin:0 auto 24px} .logo{font-size:11px;line-height:1.4;color:#f6c90e;margin-bottom:6px;white-space:pre;text-align:left}
.subtitle{font-size:12px;color:#a0aec0;text-align:left;margin-bottom:24px;padding-left:2px}
.spinner{width:36px;height:36px;border:3px solid #2d3748;border-top-color:#f6c90e;border-radius:50%;animation:spin .8s linear infinite;margin:0 auto 14px}
.spinner.hidden{display:none}
@keyframes spin{to{transform:rotate(360deg)}} @keyframes spin{to{transform:rotate(360deg)}}
.status{font-size:14px;color:#a0aec0;letter-spacing:.05em} .status{font-size:13px;color:#a0aec0;margin-bottom:20px;min-height:18px}
table{width:100%;border-collapse:collapse;font-size:12px;margin-bottom:20px;display:none}
td{padding:3px 6px;text-align:left}
td:first-child{color:#718096;width:55%}
.ok{color:#68d391}
.run{color:#f6c90e}
.fail{color:#fc8181}
.dim{color:#4a5568}
.btn{background:#1a202c;color:#a0aec0;border:1px solid #2d3748;padding:7px 18px;font-size:12px;cursor:pointer;font-family:inherit;display:none}
.btn:hover{border-color:#718096;color:#e2e8f0}
</style> </style>
</head> </head>
<body> <body>
<div style="text-align:center"> <div class="wrap">
<div class="logo"> ███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗ <div class="logo"> ███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗
██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝ ██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝
█████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗ █████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗
██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝ ██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝
███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗ ███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗
╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝</div> ╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝</div>
<div class="spinner"></div> <div class="subtitle">Hardware Audit LiveCD</div>
<div class="status" id="s">Starting up...</div> <div class="spinner" id="spin"></div>
<div class="status" id="st">Connecting to bee-web...</div>
<table id="tbl"></table>
<button class="btn" id="btn" onclick="go()">Open app now</button>
</div> </div>
<script> <script>
function probe(){ (function(){
fetch('/api/ready',{cache:'no-store'}) var gone = false;
.then(function(r){ function go(){ if(!gone){gone=true;window.location.replace('/');} }
if(r.ok){window.location.replace('/');}
else{setTimeout(probe,1000);} function icon(s){
if(s==='active') return '<span class="ok">&#9679; active</span>';
if(s==='failed') return '<span class="fail">&#10005; failed</span>';
if(s==='activating'||s==='reloading') return '<span class="run">&#9675; starting</span>';
if(s==='inactive') return '<span class="dim">&#9675; inactive</span>';
return '<span class="dim">'+s+'</span>';
}
function allSettled(svcs){
for(var i=0;i<svcs.length;i++){
var s=svcs[i].state;
if(s!=='active'&&s!=='failed'&&s!=='inactive') return false;
}
return true;
}
var pollTimer=null;
function pollServices(){
fetch('/api/services',{cache:'no-store'})
.then(function(r){return r.json();})
.then(function(svcs){
if(!svcs||!svcs.length) return;
var tbl=document.getElementById('tbl');
tbl.style.display='';
var html='';
for(var i=0;i<svcs.length;i++)
html+='<tr><td>'+svcs[i].name+'</td><td>'+icon(svcs[i].state)+'</td></tr>';
tbl.innerHTML=html;
if(allSettled(svcs)){
clearInterval(pollTimer);
document.getElementById('spin').className='spinner hidden';
document.getElementById('st').textContent='Ready \u2014 opening...';
setTimeout(go,800);
}
}) })
.catch(function(){setTimeout(probe,1000);}); .catch(function(){});
}
function probe(){
fetch('/healthz',{cache:'no-store'})
.then(function(r){
if(r.ok){
document.getElementById('st').textContent='bee-web running \u2014 checking services...';
document.getElementById('btn').style.display='';
pollServices();
pollTimer=setInterval(pollServices,1500);
} else {
document.getElementById('st').textContent='bee-web starting (status '+r.status+')...';
setTimeout(probe,500);
}
})
.catch(function(){
document.getElementById('st').textContent='Waiting for bee-web to start...';
setTimeout(probe,500);
});
} }
probe(); probe();
})();
</script> </script>
</body> </body>
</html>` </html>`

View File

@@ -184,15 +184,15 @@ func TestChartDataFromSamplesIncludesGPUClockCharts(t *testing.T) {
{ {
Timestamp: time.Now().Add(-2 * time.Minute), Timestamp: time.Now().Add(-2 * time.Minute),
GPUs: []platform.GPUMetricRow{ GPUs: []platform.GPUMetricRow{
{GPUIndex: 0, ClockMHz: 1400, MemClockMHz: 2600}, {GPUIndex: 0, ClockMHz: 1400},
{GPUIndex: 3, ClockMHz: 1500, MemClockMHz: 2800}, {GPUIndex: 3, ClockMHz: 1500},
}, },
}, },
{ {
Timestamp: time.Now().Add(-1 * time.Minute), Timestamp: time.Now().Add(-1 * time.Minute),
GPUs: []platform.GPUMetricRow{ GPUs: []platform.GPUMetricRow{
{GPUIndex: 0, ClockMHz: 1410, MemClockMHz: 2610}, {GPUIndex: 0, ClockMHz: 1410},
{GPUIndex: 3, ClockMHz: 1510, MemClockMHz: 2810}, {GPUIndex: 3, ClockMHz: 1510},
}, },
}, },
} }
@@ -210,20 +210,6 @@ func TestChartDataFromSamplesIncludesGPUClockCharts(t *testing.T) {
if got := datasets[1][1]; got != 1510 { if got := datasets[1][1]; got != 1510 {
t.Fatalf("GPU 3 core clock=%v want 1510", got) t.Fatalf("GPU 3 core clock=%v want 1510", got)
} }
datasets, names, _, title, _, _, ok = chartDataFromSamples("gpu-all-memclock", samples)
if !ok {
t.Fatal("gpu-all-memclock returned ok=false")
}
if title != "GPU Memory Clock" {
t.Fatalf("title=%q", title)
}
if len(names) != 2 || names[0] != "GPU 0" || names[1] != "GPU 3" {
t.Fatalf("names=%v", names)
}
if got := datasets[0][0]; got != 2600 {
t.Fatalf("GPU 0 memory clock=%v want 2600", got)
}
} }
func TestNormalizePowerSeriesHoldsLastPositive(t *testing.T) { func TestNormalizePowerSeriesHoldsLastPositive(t *testing.T) {
@@ -256,10 +242,10 @@ func TestRenderMetricsUsesBufferedChartRefresh(t *testing.T) {
if !strings.Contains(body, `/api/metrics/chart/gpu-all-clock.svg`) { if !strings.Contains(body, `/api/metrics/chart/gpu-all-clock.svg`) {
t.Fatalf("metrics page should include GPU core clock chart: %s", body) t.Fatalf("metrics page should include GPU core clock chart: %s", body)
} }
if !strings.Contains(body, `/api/metrics/chart/gpu-all-memclock.svg`) { if strings.Contains(body, `/api/metrics/chart/gpu-all-memclock.svg`) {
t.Fatalf("metrics page should include GPU memory clock chart: %s", body) t.Fatalf("metrics page should not include GPU memory clock chart: %s", body)
} }
if !strings.Contains(body, `renderGPUOverviewCards(indices)`) { if !strings.Contains(body, `renderGPUOverviewCards(indices, names)`) {
t.Fatalf("metrics page should build per-GPU chart cards dynamically: %s", body) t.Fatalf("metrics page should build per-GPU chart cards dynamically: %s", body)
} }
} }
@@ -543,7 +529,7 @@ func TestRootShowsRunAuditButtonWhenSnapshotMissing(t *testing.T) {
t.Fatalf("status=%d", rec.Code) t.Fatalf("status=%d", rec.Code)
} }
body := rec.Body.String() body := rec.Body.String()
if !strings.Contains(body, `Run Audit`) { if !strings.Contains(body, `onclick="auditModalRun()">Run audit</button>`) {
t.Fatalf("dashboard missing run audit button: %s", body) t.Fatalf("dashboard missing run audit button: %s", body)
} }
if strings.Contains(body, `No audit data`) { if strings.Contains(body, `No audit data`) {
@@ -551,6 +537,18 @@ func TestRootShowsRunAuditButtonWhenSnapshotMissing(t *testing.T) {
} }
} }
func TestReadyIsOKWhenAuditPathIsUnset(t *testing.T) {
handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/api/ready", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
if strings.TrimSpace(rec.Body.String()) != "ready" {
t.Fatalf("body=%q want ready", rec.Body.String())
}
}
func TestAuditPageRendersViewerFrameAndActions(t *testing.T) { func TestAuditPageRendersViewerFrameAndActions(t *testing.T) {
dir := t.TempDir() dir := t.TempDir()
path := filepath.Join(dir, "audit.json") path := filepath.Join(dir, "audit.json")
@@ -573,7 +571,7 @@ func TestAuditPageRendersViewerFrameAndActions(t *testing.T) {
} }
} }
func TestTasksPageRendersLogModalAndPaginationControls(t *testing.T) { func TestTasksPageRendersOpenLinksAndPaginationControls(t *testing.T) {
handler := NewHandler(HandlerOptions{}) handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder() rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/tasks", nil)) handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/tasks", nil))
@@ -581,8 +579,8 @@ func TestTasksPageRendersLogModalAndPaginationControls(t *testing.T) {
t.Fatalf("status=%d", rec.Code) t.Fatalf("status=%d", rec.Code)
} }
body := rec.Body.String() body := rec.Body.String()
if !strings.Contains(body, `id="task-log-overlay"`) { if !strings.Contains(body, `Open a task to view its saved logs and charts.`) {
t.Fatalf("tasks page missing log modal overlay: %s", body) t.Fatalf("tasks page missing task report hint: %s", body)
} }
if !strings.Contains(body, `_taskPageSize = 50`) { if !strings.Contains(body, `_taskPageSize = 50`) {
t.Fatalf("tasks page missing pagination size config: %s", body) t.Fatalf("tasks page missing pagination size config: %s", body)
@@ -638,7 +636,27 @@ func TestBenchmarkPageRendersGPUSelectionControls(t *testing.T) {
} }
} }
func TestBurnPageRendersOfficialNVIDIADCGMAndNCCLInterconnectLabel(t *testing.T) { func TestValidatePageRendersNvidiaTargetedStressCard(t *testing.T) {
handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/validate", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
body := rec.Body.String()
for _, needle := range []string{
`NVIDIA GPU Targeted Stress`,
`nvidia-targeted-stress`,
`controlled NVIDIA DCGM load`,
`<code>dcgmi diag targeted_stress</code>`,
} {
if !strings.Contains(body, needle) {
t.Fatalf("validate page missing %q: %s", needle, body)
}
}
}
func TestBurnPageRendersGoalBasedNVIDIACards(t *testing.T) {
handler := NewHandler(HandlerOptions{}) handler := NewHandler(HandlerOptions{})
rec := httptest.NewRecorder() rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/burn", nil)) handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/burn", nil))
@@ -647,10 +665,11 @@ func TestBurnPageRendersOfficialNVIDIADCGMAndNCCLInterconnectLabel(t *testing.T)
} }
body := rec.Body.String() body := rec.Body.String()
for _, needle := range []string{ for _, needle := range []string{
`DCGM Diagnostics (Official NVIDIA)`, `NVIDIA Max Compute Load`,
`NCCL all_reduce_perf (Interconnect)`, `dcgmproftester`,
`DCGM is the official NVIDIA diagnostic path`, `targeted_stress remain in <a href="/validate">Validate</a>`,
`burn-gpu-dcgm`, `NVIDIA Interconnect Test (NCCL all_reduce_perf)`,
`id="burn-gpu-list"`,
} { } {
if !strings.Contains(body, needle) { if !strings.Contains(body, needle) {
t.Fatalf("burn page missing %q: %s", needle, body) t.Fatalf("burn page missing %q: %s", needle, body)
@@ -658,37 +677,154 @@ func TestBurnPageRendersOfficialNVIDIADCGMAndNCCLInterconnectLabel(t *testing.T)
} }
} }
func TestTasksPageRendersScrollableLogModal(t *testing.T) { func TestTaskDetailPageRendersSavedReport(t *testing.T) {
dir := t.TempDir() dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
exportDir := filepath.Join(dir, "export") exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil { reportDir := filepath.Join(exportDir, "tasks", "task-1_cpu_sat_done")
if err := os.MkdirAll(reportDir, 0755); err != nil {
t.Fatal(err) t.Fatal(err)
} }
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z"}`), 0644); err != nil { reportPath := filepath.Join(reportDir, "report.html")
if err := os.WriteFile(reportPath, []byte(`<div class="card"><div class="card-head">Task Report</div><div class="card-body">saved report</div></div>`), 0644); err != nil {
t.Fatal(err) t.Fatal(err)
} }
handler := NewHandler(HandlerOptions{ globalQueue.mu.Lock()
Title: "Bee Hardware Audit", origTasks := globalQueue.tasks
AuditPath: path, globalQueue.tasks = []*Task{{
ExportDir: exportDir, ID: "task-1",
Name: "CPU SAT",
Target: "cpu",
Status: TaskDone,
CreatedAt: time.Now(),
ArtifactsDir: reportDir,
ReportHTMLPath: reportPath,
}}
globalQueue.mu.Unlock()
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = origTasks
globalQueue.mu.Unlock()
}) })
handler := NewHandler(HandlerOptions{Title: "Bee Hardware Audit", ExportDir: exportDir})
rec := httptest.NewRecorder() rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/tasks", nil)) handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/tasks/task-1", nil))
if rec.Code != http.StatusOK { if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code) t.Fatalf("status=%d", rec.Code)
} }
body := rec.Body.String() body := rec.Body.String()
if !strings.Contains(body, `height:calc(100vh - 32px)`) { if !strings.Contains(body, `saved report`) {
t.Fatalf("tasks page missing bounded log modal height: %s", body) t.Fatalf("task detail page missing saved report: %s", body)
} }
if !strings.Contains(body, `flex:1;min-height:0;overflow:hidden`) { if !strings.Contains(body, `Back to Tasks`) {
t.Fatalf("tasks page missing log modal overflow guard: %s", body) t.Fatalf("task detail page missing back link: %s", body)
} }
if !strings.Contains(body, `height:100%;min-height:0;overflow:auto`) { }
t.Fatalf("tasks page missing scrollable log wrapper: %s", body)
func TestTaskDetailPageRendersCancelForRunningTask(t *testing.T) {
globalQueue.mu.Lock()
origTasks := globalQueue.tasks
globalQueue.tasks = []*Task{{
ID: "task-live-1",
Name: "CPU SAT",
Target: "cpu",
Status: TaskRunning,
CreatedAt: time.Now(),
}}
globalQueue.mu.Unlock()
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = origTasks
globalQueue.mu.Unlock()
})
handler := NewHandler(HandlerOptions{Title: "Bee Hardware Audit"})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/tasks/task-live-1", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d", rec.Code)
}
body := rec.Body.String()
if !strings.Contains(body, `Cancel</button>`) {
t.Fatalf("task detail page missing cancel button: %s", body)
}
if !strings.Contains(body, `function cancelTaskDetail(id)`) {
t.Fatalf("task detail page missing cancel handler: %s", body)
}
if !strings.Contains(body, `/api/tasks/' + id + '/cancel`) {
t.Fatalf("task detail page missing cancel endpoint: %s", body)
}
if !strings.Contains(body, `id="task-live-charts"`) {
t.Fatalf("task detail page missing live charts container: %s", body)
}
if !strings.Contains(body, `/api/tasks/' + taskId + '/charts`) {
t.Fatalf("task detail page missing live charts index endpoint: %s", body)
}
}
func TestTaskChartSVGUsesTaskTimeWindow(t *testing.T) {
dir := t.TempDir()
metricsPath := filepath.Join(dir, "metrics.db")
prevMetricsPath := taskReportMetricsDBPath
taskReportMetricsDBPath = metricsPath
t.Cleanup(func() { taskReportMetricsDBPath = prevMetricsPath })
db, err := openMetricsDB(metricsPath)
if err != nil {
t.Fatalf("openMetricsDB: %v", err)
}
base := time.Now().UTC()
samples := []platform.LiveMetricSample{
{Timestamp: base.Add(-3 * time.Minute), PowerW: 100},
{Timestamp: base.Add(-2 * time.Minute), PowerW: 200},
{Timestamp: base.Add(-1 * time.Minute), PowerW: 300},
}
for _, sample := range samples {
if err := db.Write(sample); err != nil {
t.Fatalf("Write: %v", err)
}
}
_ = db.Close()
started := base.Add(-2*time.Minute - 5*time.Second)
done := base.Add(-1*time.Minute + 5*time.Second)
globalQueue.mu.Lock()
origTasks := globalQueue.tasks
globalQueue.tasks = []*Task{{
ID: "task-chart-1",
Name: "Power Window",
Target: "cpu",
Status: TaskDone,
CreatedAt: started.Add(-10 * time.Second),
StartedAt: &started,
DoneAt: &done,
}}
globalQueue.mu.Unlock()
t.Cleanup(func() {
globalQueue.mu.Lock()
globalQueue.tasks = origTasks
globalQueue.mu.Unlock()
})
handler := NewHandler(HandlerOptions{Title: "Bee Hardware Audit"})
req := httptest.NewRequest(http.MethodGet, "/api/tasks/task-chart-1/chart/server-power.svg", nil)
req.SetPathValue("id", "task-chart-1")
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, req)
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
body := rec.Body.String()
if !strings.Contains(body, "System Power") {
t.Fatalf("task chart missing expected title: %s", body)
}
if !strings.Contains(body, "min 200") {
t.Fatalf("task chart stats should start from in-window sample: %s", body)
}
if strings.Contains(body, "min 100") {
t.Fatalf("task chart should not include pre-task sample in stats: %s", body)
} }
} }
@@ -813,3 +949,98 @@ func TestRuntimeHealthEndpointReturnsJSON(t *testing.T) {
t.Fatalf("body=%q want %q", strings.TrimSpace(rec.Body.String()), body) t.Fatalf("body=%q want %q", strings.TrimSpace(rec.Body.String()), body)
} }
} }
func TestDashboardRendersRuntimeHealthTable(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "audit.json")
exportDir := filepath.Join(dir, "export")
if err := os.MkdirAll(exportDir, 0755); err != nil {
t.Fatal(err)
}
if err := os.WriteFile(path, []byte(`{"collected_at":"2026-03-15T00:00:00Z","hardware":{"board":{"serial_number":"SERIAL-1"}}}`), 0644); err != nil {
t.Fatal(err)
}
health := `{
"status":"PARTIAL",
"checked_at":"2026-03-16T10:00:00Z",
"export_dir":"/tmp/export",
"driver_ready":true,
"cuda_ready":false,
"network_status":"PARTIAL",
"issues":[
{"code":"dhcp_partial","description":"At least one interface did not obtain IPv4 connectivity."},
{"code":"cuda_runtime_not_ready","description":"CUDA runtime is not ready for GPU SAT."}
],
"tools":[
{"name":"dmidecode","ok":true},
{"name":"nvidia-smi","ok":false}
],
"services":[
{"name":"bee-web","status":"active"},
{"name":"bee-nvidia","status":"inactive"}
]
}`
if err := os.WriteFile(filepath.Join(exportDir, "runtime-health.json"), []byte(health), 0644); err != nil {
t.Fatal(err)
}
componentStatus := `[
{
"component_key":"cpu:all",
"status":"Warning",
"error_summary":"cpu SAT: FAILED",
"history":[{"at":"2026-03-16T10:00:00Z","status":"Warning","source":"sat:cpu","detail":"cpu SAT: FAILED"}]
},
{
"component_key":"memory:all",
"status":"OK",
"history":[{"at":"2026-03-16T10:01:00Z","status":"OK","source":"sat:memory","detail":"memory SAT: OK"}]
},
{
"component_key":"storage:nvme0n1",
"status":"Critical",
"error_summary":"storage SAT: FAILED",
"history":[{"at":"2026-03-16T10:02:00Z","status":"Critical","source":"sat:storage","detail":"storage SAT: FAILED"}]
},
{
"component_key":"pcie:gpu:nvidia",
"status":"Warning",
"error_summary":"nvidia SAT: FAILED",
"history":[{"at":"2026-03-16T10:03:00Z","status":"Warning","source":"sat:nvidia","detail":"nvidia SAT: FAILED"}]
}
]`
if err := os.WriteFile(filepath.Join(exportDir, "component-status.json"), []byte(componentStatus), 0644); err != nil {
t.Fatal(err)
}
handler := NewHandler(HandlerOptions{AuditPath: path, ExportDir: exportDir})
rec := httptest.NewRecorder()
handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/", nil))
if rec.Code != http.StatusOK {
t.Fatalf("status=%d body=%s", rec.Code, rec.Body.String())
}
body := rec.Body.String()
for _, needle := range []string{
`Runtime Health`,
`<th>Check</th><th>Status</th><th>Source</th><th>Issue</th>`,
`Export Directory`,
`Network`,
`NVIDIA/AMD Driver`,
`CUDA / ROCm`,
`Required Utilities`,
`Bee Services`,
`<td>CPU</td>`,
`<td>Memory</td>`,
`<td>Storage</td>`,
`<td>GPU</td>`,
`CUDA runtime is not ready for GPU SAT.`,
`Missing: nvidia-smi`,
`bee-nvidia=inactive`,
`cpu SAT: FAILED`,
`storage SAT: FAILED`,
`sat:nvidia`,
} {
if !strings.Contains(body, needle) {
t.Fatalf("dashboard missing %q: %s", needle, body)
}
}
}

View File

@@ -0,0 +1,207 @@
package webui
import (
"encoding/json"
"fmt"
"html"
"net/http"
"os"
"strings"
"time"
"bee/audit/internal/platform"
)
func (h *handler) handleTaskPage(w http.ResponseWriter, r *http.Request) {
id := r.PathValue("id")
task, ok := globalQueue.findByID(id)
if !ok {
http.NotFound(w, r)
return
}
snapshot := *task
body := renderTaskDetailPage(h.opts, snapshot)
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte(body))
}
func (h *handler) handleAPITaskChartsIndex(w http.ResponseWriter, r *http.Request) {
task, samples, _, _, ok := h.taskSamplesForRequest(r)
if !ok {
http.NotFound(w, r)
return
}
type taskChartIndexEntry struct {
Title string `json:"title"`
File string `json:"file"`
}
entries := make([]taskChartIndexEntry, 0)
for _, spec := range taskChartSpecsForSamples(samples) {
title, _, ok := renderTaskChartSVG(spec.Path, samples, taskTimelineForTask(task))
if !ok {
continue
}
entries = append(entries, taskChartIndexEntry{Title: title, File: spec.File})
}
w.Header().Set("Cache-Control", "no-store")
w.Header().Set("Content-Type", "application/json; charset=utf-8")
_ = json.NewEncoder(w).Encode(entries)
}
func (h *handler) handleAPITaskChartSVG(w http.ResponseWriter, r *http.Request) {
task, samples, _, _, ok := h.taskSamplesForRequest(r)
if !ok {
http.NotFound(w, r)
return
}
file := strings.TrimPrefix(r.URL.Path, "/api/tasks/"+task.ID+"/chart/")
path, ok := taskChartPathFromFile(file)
if !ok {
http.NotFound(w, r)
return
}
title, buf, hasData := renderTaskChartSVG(path, samples, taskTimelineForTask(task))
if !hasData || len(buf) == 0 || strings.TrimSpace(title) == "" {
http.Error(w, "metrics history unavailable", http.StatusServiceUnavailable)
return
}
w.Header().Set("Content-Type", "image/svg+xml")
w.Header().Set("Cache-Control", "no-store")
_, _ = w.Write(buf)
}
func renderTaskDetailPage(opts HandlerOptions, task Task) string {
title := task.Name
if strings.TrimSpace(title) == "" {
title = task.ID
}
var body strings.Builder
body.WriteString(`<div style="display:flex;align-items:center;gap:12px;margin-bottom:16px;flex-wrap:wrap">`)
body.WriteString(`<a class="btn btn-secondary btn-sm" href="/tasks">Back to Tasks</a>`)
if task.Status == TaskRunning || task.Status == TaskPending {
body.WriteString(`<button class="btn btn-danger btn-sm" onclick="cancelTaskDetail('` + html.EscapeString(task.ID) + `')">Cancel</button>`)
}
body.WriteString(`<span style="font-size:12px;color:var(--muted)">Artifacts are saved in the task folder under <code>./tasks</code>.</span>`)
body.WriteString(`</div>`)
if report := loadTaskReportFragment(task); report != "" {
body.WriteString(report)
} else {
body.WriteString(`<div class="card"><div class="card-head">Task Summary</div><div class="card-body">`)
body.WriteString(`<div style="font-size:18px;font-weight:700">` + html.EscapeString(title) + `</div>`)
body.WriteString(`<div style="margin-top:8px">` + renderTaskStatusBadge(task.Status) + `</div>`)
if strings.TrimSpace(task.ErrMsg) != "" {
body.WriteString(`<div style="margin-top:8px;color:var(--crit-fg)">` + html.EscapeString(task.ErrMsg) + `</div>`)
}
body.WriteString(`</div></div>`)
}
if task.Status == TaskRunning || task.Status == TaskPending {
body.WriteString(`<div class="card"><div class="card-head">Live Charts</div><div class="card-body">`)
body.WriteString(`<div id="task-live-charts" style="display:flex;flex-direction:column;gap:16px;color:var(--muted);font-size:13px">Loading charts...</div>`)
body.WriteString(`</div></div>`)
body.WriteString(`<div class="card"><div class="card-head">Live Logs</div><div class="card-body">`)
body.WriteString(`<div id="task-live-log" class="terminal" style="max-height:none;white-space:pre-wrap">Connecting...</div>`)
body.WriteString(`</div></div>`)
body.WriteString(`<script>
function cancelTaskDetail(id) {
fetch('/api/tasks/' + id + '/cancel', {method:'POST'}).then(function(){ window.location.reload(); });
}
function loadTaskLiveCharts(taskId) {
fetch('/api/tasks/' + taskId + '/charts').then(function(r){ return r.json(); }).then(function(charts){
const host = document.getElementById('task-live-charts');
if (!host) return;
if (!Array.isArray(charts) || charts.length === 0) {
host.innerHTML = 'Waiting for metric samples...';
return;
}
host.innerHTML = charts.map(function(chart) {
return '<div class="card" style="margin:0">' +
'<div class="card-head">' + chart.title + '</div>' +
'<div class="card-body" style="padding:12px">' +
'<img data-task-chart="1" data-base-src="/api/tasks/' + taskId + '/chart/' + chart.file + '" src="/api/tasks/' + taskId + '/chart/' + chart.file + '?t=' + Date.now() + '" style="width:100%;display:block;border-radius:6px" alt="' + chart.title + '">' +
'</div></div>';
}).join('');
}).catch(function(){
const host = document.getElementById('task-live-charts');
if (host) host.innerHTML = 'Task charts are unavailable.';
});
}
function refreshTaskLiveCharts() {
document.querySelectorAll('img[data-task-chart="1"]').forEach(function(img){
const base = img.dataset.baseSrc;
if (!base) return;
img.src = base + '?t=' + Date.now();
});
}
var _taskDetailES = new EventSource('/api/tasks/` + html.EscapeString(task.ID) + `/stream');
var _taskDetailTerm = document.getElementById('task-live-log');
var _taskChartTimer = null;
_taskDetailES.onopen = function(){ _taskDetailTerm.textContent = ''; };
_taskDetailES.onmessage = function(e){ _taskDetailTerm.textContent += e.data + "\n"; _taskDetailTerm.scrollTop = _taskDetailTerm.scrollHeight; };
_taskDetailES.addEventListener('done', function(){ if (_taskChartTimer) clearInterval(_taskChartTimer); _taskDetailES.close(); setTimeout(function(){ window.location.reload(); }, 1000); });
_taskDetailES.onerror = function(){ if (_taskChartTimer) clearInterval(_taskChartTimer); _taskDetailES.close(); };
loadTaskLiveCharts('` + html.EscapeString(task.ID) + `');
_taskChartTimer = setInterval(function(){ refreshTaskLiveCharts(); loadTaskLiveCharts('` + html.EscapeString(task.ID) + `'); }, 2000);
</script>`)
}
return layoutHead(opts.Title+" — "+title) +
layoutNav("tasks", opts.BuildLabel) +
`<div class="main"><div class="topbar"><h1>` + html.EscapeString(title) + `</h1></div><div class="content">` +
body.String() +
`</div></div></body></html>`
}
func loadTaskReportFragment(task Task) string {
if strings.TrimSpace(task.ReportHTMLPath) == "" {
return ""
}
data, err := os.ReadFile(task.ReportHTMLPath)
if err != nil || len(data) == 0 {
return ""
}
return string(data)
}
func taskArtifactDownloadLink(task Task, absPath string) string {
if strings.TrimSpace(absPath) == "" {
return ""
}
return fmt.Sprintf(`/export/file?path=%s`, absPath)
}
func (h *handler) taskSamplesForRequest(r *http.Request) (Task, []platform.LiveMetricSample, time.Time, time.Time, bool) {
id := r.PathValue("id")
taskPtr, ok := globalQueue.findByID(id)
if !ok {
return Task{}, nil, time.Time{}, time.Time{}, false
}
task := *taskPtr
start, end := taskTimeWindow(&task)
samples, err := loadTaskMetricSamples(start, end)
if err != nil {
return task, nil, start, end, true
}
return task, samples, start, end, true
}
func taskTimelineForTask(task Task) []chartTimelineSegment {
start, end := taskTimeWindow(&task)
return []chartTimelineSegment{{Start: start, End: end, Active: true}}
}
func taskChartPathFromFile(file string) (string, bool) {
file = strings.TrimSpace(file)
for _, spec := range taskDashboardChartSpecs {
if spec.File == file {
return spec.Path, true
}
}
if strings.HasPrefix(file, "gpu-") && strings.HasSuffix(file, "-overview.svg") {
id := strings.TrimSuffix(strings.TrimPrefix(file, "gpu-"), "-overview.svg")
return "gpu/" + id + "-overview", true
}
return "", false
}

View File

@@ -0,0 +1,289 @@
package webui
import (
"encoding/json"
"fmt"
"html"
"os"
"path/filepath"
"sort"
"strings"
"time"
"bee/audit/internal/platform"
)
var taskReportMetricsDBPath = metricsDBPath
type taskReport struct {
ID string `json:"id"`
Name string `json:"name"`
Target string `json:"target"`
Status string `json:"status"`
CreatedAt time.Time `json:"created_at"`
StartedAt *time.Time `json:"started_at,omitempty"`
DoneAt *time.Time `json:"done_at,omitempty"`
DurationSec int `json:"duration_sec,omitempty"`
Error string `json:"error,omitempty"`
LogFile string `json:"log_file,omitempty"`
Charts []taskReportChart `json:"charts,omitempty"`
GeneratedAt time.Time `json:"generated_at"`
}
type taskReportChart struct {
Title string `json:"title"`
File string `json:"file"`
}
type taskChartSpec struct {
Path string
File string
}
var taskDashboardChartSpecs = []taskChartSpec{
{Path: "server-load", File: "server-load.svg"},
{Path: "server-temp-cpu", File: "server-temp-cpu.svg"},
{Path: "server-temp-ambient", File: "server-temp-ambient.svg"},
{Path: "server-power", File: "server-power.svg"},
{Path: "server-fans", File: "server-fans.svg"},
{Path: "gpu-all-load", File: "gpu-all-load.svg"},
{Path: "gpu-all-memload", File: "gpu-all-memload.svg"},
{Path: "gpu-all-clock", File: "gpu-all-clock.svg"},
{Path: "gpu-all-power", File: "gpu-all-power.svg"},
{Path: "gpu-all-temp", File: "gpu-all-temp.svg"},
}
func taskChartSpecsForSamples(samples []platform.LiveMetricSample) []taskChartSpec {
specs := make([]taskChartSpec, 0, len(taskDashboardChartSpecs)+len(taskGPUIndices(samples)))
specs = append(specs, taskDashboardChartSpecs...)
for _, idx := range taskGPUIndices(samples) {
specs = append(specs, taskChartSpec{
Path: fmt.Sprintf("gpu/%d-overview", idx),
File: fmt.Sprintf("gpu-%d-overview.svg", idx),
})
}
return specs
}
func writeTaskReportArtifacts(t *Task) error {
if t == nil {
return nil
}
ensureTaskReportPaths(t)
if strings.TrimSpace(t.ArtifactsDir) == "" {
return nil
}
if err := os.MkdirAll(t.ArtifactsDir, 0755); err != nil {
return err
}
start, end := taskTimeWindow(t)
samples, _ := loadTaskMetricSamples(start, end)
charts, inlineCharts := writeTaskCharts(t.ArtifactsDir, start, end, samples)
logText := ""
if data, err := os.ReadFile(t.LogPath); err == nil {
logText = string(data)
}
report := taskReport{
ID: t.ID,
Name: t.Name,
Target: t.Target,
Status: t.Status,
CreatedAt: t.CreatedAt,
StartedAt: t.StartedAt,
DoneAt: t.DoneAt,
DurationSec: taskElapsedSec(t, reportDoneTime(t)),
Error: t.ErrMsg,
LogFile: filepath.Base(t.LogPath),
Charts: charts,
GeneratedAt: time.Now().UTC(),
}
if err := writeJSONFile(t.ReportJSONPath, report); err != nil {
return err
}
return os.WriteFile(t.ReportHTMLPath, []byte(renderTaskReportFragment(report, inlineCharts, logText)), 0644)
}
func reportDoneTime(t *Task) time.Time {
if t != nil && t.DoneAt != nil && !t.DoneAt.IsZero() {
return *t.DoneAt
}
return time.Now()
}
func taskTimeWindow(t *Task) (time.Time, time.Time) {
if t == nil {
now := time.Now().UTC()
return now, now
}
start := t.CreatedAt.UTC()
if t.StartedAt != nil && !t.StartedAt.IsZero() {
start = t.StartedAt.UTC()
}
end := time.Now().UTC()
if t.DoneAt != nil && !t.DoneAt.IsZero() {
end = t.DoneAt.UTC()
}
if end.Before(start) {
end = start
}
return start, end
}
func loadTaskMetricSamples(start, end time.Time) ([]platform.LiveMetricSample, error) {
db, err := openMetricsDB(taskReportMetricsDBPath)
if err != nil {
return nil, err
}
defer db.Close()
return db.LoadBetween(start, end)
}
func writeTaskCharts(dir string, start, end time.Time, samples []platform.LiveMetricSample) ([]taskReportChart, map[string]string) {
if len(samples) == 0 {
return nil, nil
}
timeline := []chartTimelineSegment{{Start: start, End: end, Active: true}}
var charts []taskReportChart
inline := make(map[string]string)
for _, spec := range taskChartSpecsForSamples(samples) {
title, svg, ok := renderTaskChartSVG(spec.Path, samples, timeline)
if !ok || len(svg) == 0 {
continue
}
path := filepath.Join(dir, spec.File)
if err := os.WriteFile(path, svg, 0644); err != nil {
continue
}
charts = append(charts, taskReportChart{Title: title, File: spec.File})
inline[spec.File] = string(svg)
}
return charts, inline
}
func renderTaskChartSVG(path string, samples []platform.LiveMetricSample, timeline []chartTimelineSegment) (string, []byte, bool) {
if idx, sub, ok := parseGPUChartPath(path); ok && sub == "overview" {
buf, hasData, err := renderGPUOverviewChartSVG(idx, samples, timeline)
if err != nil || !hasData {
return "", nil, false
}
return gpuDisplayLabel(idx) + " Overview", buf, true
}
datasets, names, labels, title, yMin, yMax, ok := chartDataFromSamples(path, samples)
if !ok {
return "", nil, false
}
buf, err := renderMetricChartSVG(
title,
labels,
sampleTimes(samples),
datasets,
names,
yMin,
yMax,
chartCanvasHeightForPath(path, len(names)),
timeline,
)
if err != nil {
return "", nil, false
}
return title, buf, true
}
func taskGPUIndices(samples []platform.LiveMetricSample) []int {
seen := map[int]bool{}
var out []int
for _, s := range samples {
for _, g := range s.GPUs {
if seen[g.GPUIndex] {
continue
}
seen[g.GPUIndex] = true
out = append(out, g.GPUIndex)
}
}
sort.Ints(out)
return out
}
func writeJSONFile(path string, v any) error {
data, err := json.MarshalIndent(v, "", " ")
if err != nil {
return err
}
return os.WriteFile(path, data, 0644)
}
func renderTaskReportFragment(report taskReport, charts map[string]string, logText string) string {
var b strings.Builder
b.WriteString(`<div class="card"><div class="card-head">Task Report</div><div class="card-body">`)
b.WriteString(`<div class="grid2">`)
b.WriteString(`<div><div style="font-size:12px;color:var(--muted);margin-bottom:6px">Task</div><div style="font-size:16px;font-weight:700">` + html.EscapeString(report.Name) + `</div>`)
b.WriteString(`<div style="font-size:13px;color:var(--muted)">` + html.EscapeString(report.Target) + `</div></div>`)
b.WriteString(`<div><div style="font-size:12px;color:var(--muted);margin-bottom:6px">Status</div><div>` + renderTaskStatusBadge(report.Status) + `</div>`)
if strings.TrimSpace(report.Error) != "" {
b.WriteString(`<div style="margin-top:8px;font-size:13px;color:var(--crit-fg)">` + html.EscapeString(report.Error) + `</div>`)
}
b.WriteString(`</div></div>`)
b.WriteString(`<div style="margin-top:14px;font-size:13px;color:var(--muted)">`)
b.WriteString(`Started: ` + formatTaskTime(report.StartedAt, report.CreatedAt) + ` | Finished: ` + formatTaskTime(report.DoneAt, time.Time{}) + ` | Duration: ` + formatTaskDuration(report.DurationSec))
b.WriteString(`</div></div></div>`)
if len(report.Charts) > 0 {
for _, chart := range report.Charts {
b.WriteString(`<div class="card"><div class="card-head">` + html.EscapeString(chart.Title) + `</div><div class="card-body" style="padding:12px">`)
b.WriteString(charts[chart.File])
b.WriteString(`</div></div>`)
}
} else {
b.WriteString(`<div class="alert alert-info">No metric samples were captured during this task window.</div>`)
}
b.WriteString(`<div class="card"><div class="card-head">Logs</div><div class="card-body">`)
b.WriteString(`<div class="terminal" style="max-height:none;white-space:pre-wrap">` + html.EscapeString(strings.TrimSpace(logText)) + `</div>`)
b.WriteString(`</div></div>`)
return b.String()
}
func renderTaskStatusBadge(status string) string {
className := map[string]string{
TaskRunning: "badge-ok",
TaskPending: "badge-unknown",
TaskDone: "badge-ok",
TaskFailed: "badge-err",
TaskCancelled: "badge-unknown",
}[status]
if className == "" {
className = "badge-unknown"
}
label := strings.TrimSpace(status)
if label == "" {
label = "unknown"
}
return `<span class="badge ` + className + `">` + html.EscapeString(label) + `</span>`
}
func formatTaskTime(ts *time.Time, fallback time.Time) string {
if ts != nil && !ts.IsZero() {
return ts.Local().Format("2006-01-02 15:04:05")
}
if !fallback.IsZero() {
return fallback.Local().Format("2006-01-02 15:04:05")
}
return "n/a"
}
func formatTaskDuration(sec int) string {
if sec <= 0 {
return "n/a"
}
if sec < 60 {
return fmt.Sprintf("%ds", sec)
}
if sec < 3600 {
return fmt.Sprintf("%dm %02ds", sec/60, sec%60)
}
return fmt.Sprintf("%dh %02dm %02ds", sec/3600, (sec%3600)/60, sec%60)
}

View File

@@ -30,23 +30,29 @@ const (
// taskNames maps target → human-readable name for validate (SAT) runs. // taskNames maps target → human-readable name for validate (SAT) runs.
var taskNames = map[string]string{ var taskNames = map[string]string{
"nvidia": "NVIDIA SAT", "nvidia": "NVIDIA SAT",
"nvidia-benchmark": "NVIDIA Benchmark", "nvidia-targeted-stress": "NVIDIA Targeted Stress Validate (dcgmi diag targeted_stress)",
"nvidia-stress": "NVIDIA GPU Stress", "nvidia-benchmark": "NVIDIA Benchmark",
"memory": "Memory SAT", "nvidia-compute": "NVIDIA Max Compute Load (dcgmproftester)",
"storage": "Storage SAT", "nvidia-targeted-power": "NVIDIA Targeted Power (dcgmi diag targeted_power)",
"cpu": "CPU SAT", "nvidia-pulse": "NVIDIA Pulse Test (dcgmi diag pulse_test)",
"amd": "AMD GPU SAT", "nvidia-interconnect": "NVIDIA Interconnect Test (NCCL all_reduce_perf)",
"amd-mem": "AMD GPU MEM Integrity", "nvidia-bandwidth": "NVIDIA Bandwidth Test (NVBandwidth)",
"amd-bandwidth": "AMD GPU MEM Bandwidth", "nvidia-stress": "NVIDIA GPU Stress",
"amd-stress": "AMD GPU Burn-in", "memory": "Memory SAT",
"memory-stress": "Memory Burn-in", "storage": "Storage SAT",
"sat-stress": "SAT Stress (stressapptest)", "cpu": "CPU SAT",
"platform-stress": "Platform Thermal Cycling", "amd": "AMD GPU SAT",
"audit": "Audit", "amd-mem": "AMD GPU MEM Integrity",
"support-bundle": "Support Bundle", "amd-bandwidth": "AMD GPU MEM Bandwidth",
"install": "Install to Disk", "amd-stress": "AMD GPU Burn-in",
"install-to-ram": "Install to RAM", "memory-stress": "Memory Burn-in",
"sat-stress": "SAT Stress (stressapptest)",
"platform-stress": "Platform Thermal Cycling",
"audit": "Audit",
"support-bundle": "Support Bundle",
"install": "Install to Disk",
"install-to-ram": "Install to RAM",
} }
// burnNames maps target → human-readable name when a burn profile is set. // burnNames maps target → human-readable name when a burn profile is set.
@@ -86,17 +92,20 @@ func taskDisplayName(target, profile, loader string) string {
// Task represents one unit of work in the queue. // Task represents one unit of work in the queue.
type Task struct { type Task struct {
ID string `json:"id"` ID string `json:"id"`
Name string `json:"name"` Name string `json:"name"`
Target string `json:"target"` Target string `json:"target"`
Priority int `json:"priority"` Priority int `json:"priority"`
Status string `json:"status"` Status string `json:"status"`
CreatedAt time.Time `json:"created_at"` CreatedAt time.Time `json:"created_at"`
StartedAt *time.Time `json:"started_at,omitempty"` StartedAt *time.Time `json:"started_at,omitempty"`
DoneAt *time.Time `json:"done_at,omitempty"` DoneAt *time.Time `json:"done_at,omitempty"`
ElapsedSec int `json:"elapsed_sec,omitempty"` ElapsedSec int `json:"elapsed_sec,omitempty"`
ErrMsg string `json:"error,omitempty"` ErrMsg string `json:"error,omitempty"`
LogPath string `json:"log_path,omitempty"` LogPath string `json:"log_path,omitempty"`
ArtifactsDir string `json:"artifacts_dir,omitempty"`
ReportJSONPath string `json:"report_json_path,omitempty"`
ReportHTMLPath string `json:"report_html_path,omitempty"`
// runtime fields (not serialised) // runtime fields (not serialised)
job *jobState job *jobState
@@ -120,59 +129,70 @@ type taskParams struct {
} }
type persistedTask struct { type persistedTask struct {
ID string `json:"id"` ID string `json:"id"`
Name string `json:"name"` Name string `json:"name"`
Target string `json:"target"` Target string `json:"target"`
Priority int `json:"priority"` Priority int `json:"priority"`
Status string `json:"status"` Status string `json:"status"`
CreatedAt time.Time `json:"created_at"` CreatedAt time.Time `json:"created_at"`
StartedAt *time.Time `json:"started_at,omitempty"` StartedAt *time.Time `json:"started_at,omitempty"`
DoneAt *time.Time `json:"done_at,omitempty"` DoneAt *time.Time `json:"done_at,omitempty"`
ErrMsg string `json:"error,omitempty"` ErrMsg string `json:"error,omitempty"`
LogPath string `json:"log_path,omitempty"` LogPath string `json:"log_path,omitempty"`
Params taskParams `json:"params,omitempty"` ArtifactsDir string `json:"artifacts_dir,omitempty"`
ReportJSONPath string `json:"report_json_path,omitempty"`
ReportHTMLPath string `json:"report_html_path,omitempty"`
Params taskParams `json:"params,omitempty"`
} }
type burnPreset struct { type burnPreset struct {
NvidiaDiag int
DurationSec int DurationSec int
} }
func resolveBurnPreset(profile string) burnPreset { func resolveBurnPreset(profile string) burnPreset {
switch profile { switch profile {
case "overnight": case "overnight":
return burnPreset{NvidiaDiag: 4, DurationSec: 8 * 60 * 60} return burnPreset{DurationSec: 8 * 60 * 60}
case "acceptance": case "acceptance":
return burnPreset{NvidiaDiag: 3, DurationSec: 60 * 60} return burnPreset{DurationSec: 60 * 60}
default: default:
return burnPreset{NvidiaDiag: 1, DurationSec: 5 * 60} return burnPreset{DurationSec: 5 * 60}
} }
} }
func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions { func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions {
acceptanceCycles := []platform.PlatformStressCycle{
{LoadSec: 85, IdleSec: 5},
{LoadSec: 80, IdleSec: 10},
{LoadSec: 55, IdleSec: 5},
{LoadSec: 60, IdleSec: 0},
{LoadSec: 100, IdleSec: 10},
{LoadSec: 145, IdleSec: 15},
{LoadSec: 190, IdleSec: 20},
{LoadSec: 235, IdleSec: 25},
{LoadSec: 280, IdleSec: 30},
{LoadSec: 325, IdleSec: 35},
{LoadSec: 370, IdleSec: 40},
{LoadSec: 415, IdleSec: 45},
{LoadSec: 460, IdleSec: 50},
{LoadSec: 510, IdleSec: 0},
}
switch profile { switch profile {
case "overnight": case "overnight":
return platform.PlatformStressOptions{Cycles: []platform.PlatformStressCycle{ cycles := make([]platform.PlatformStressCycle, 0, len(acceptanceCycles)*8)
{LoadSec: 600, IdleSec: 120}, for range 8 {
{LoadSec: 600, IdleSec: 60}, cycles = append(cycles, acceptanceCycles...)
{LoadSec: 600, IdleSec: 30}, }
{LoadSec: 600, IdleSec: 120}, return platform.PlatformStressOptions{Cycles: cycles}
{LoadSec: 600, IdleSec: 60},
{LoadSec: 600, IdleSec: 30},
{LoadSec: 600, IdleSec: 120},
{LoadSec: 600, IdleSec: 60},
}}
case "acceptance": case "acceptance":
return platform.PlatformStressOptions{Cycles: []platform.PlatformStressCycle{ return platform.PlatformStressOptions{Cycles: acceptanceCycles}
{LoadSec: 300, IdleSec: 60},
{LoadSec: 300, IdleSec: 30},
{LoadSec: 300, IdleSec: 60},
{LoadSec: 300, IdleSec: 30},
}}
default: // smoke default: // smoke
return platform.PlatformStressOptions{Cycles: []platform.PlatformStressCycle{ return platform.PlatformStressOptions{Cycles: []platform.PlatformStressCycle{
{LoadSec: 90, IdleSec: 60}, {LoadSec: 85, IdleSec: 5},
{LoadSec: 90, IdleSec: 30}, {LoadSec: 80, IdleSec: 10},
{LoadSec: 55, IdleSec: 5},
{LoadSec: 60, IdleSec: 0},
}} }}
} }
} }
@@ -238,6 +258,7 @@ func (q *taskQueue) enqueue(t *Task) {
q.prune() q.prune()
q.persistLocked() q.persistLocked()
q.mu.Unlock() q.mu.Unlock()
taskSerialEvent(t, "queued")
select { select {
case q.trigger <- struct{}{}: case q.trigger <- struct{}{}:
default: default:
@@ -415,7 +436,7 @@ func (q *taskQueue) worker() {
t.StartedAt = &now t.StartedAt = &now
t.DoneAt = nil t.DoneAt = nil
t.ErrMsg = "" t.ErrMsg = ""
j := newTaskJobState(t.LogPath) j := newTaskJobState(t.LogPath, taskSerialPrefix(t))
t.job = j t.job = j
batch = append(batch, t) batch = append(batch, t)
} }
@@ -482,8 +503,6 @@ func (q *taskQueue) executeTask(t *Task, j *jobState, ctx context.Context) {
func (q *taskQueue) finalizeTaskRun(t *Task, j *jobState) { func (q *taskQueue) finalizeTaskRun(t *Task, j *jobState) {
q.mu.Lock() q.mu.Lock()
defer q.mu.Unlock()
now := time.Now() now := time.Now()
t.DoneAt = &now t.DoneAt = &now
if t.Status == TaskRunning { if t.Status == TaskRunning {
@@ -495,7 +514,18 @@ func (q *taskQueue) finalizeTaskRun(t *Task, j *jobState) {
t.ErrMsg = "" t.ErrMsg = ""
} }
} }
q.finalizeTaskArtifactPathsLocked(t)
q.persistLocked() q.persistLocked()
q.mu.Unlock()
if err := writeTaskReportArtifacts(t); err != nil {
appendJobLog(t.LogPath, "WARN: task report generation failed: "+err.Error())
}
if t.ErrMsg != "" {
taskSerialEvent(t, "finished with status="+t.Status+" error="+t.ErrMsg)
return
}
taskSerialEvent(t, "finished with status="+t.Status)
} }
// setCPUGovernor writes the given governor to all CPU scaling_governor sysfs files. // setCPUGovernor writes the given governor to all CPU scaling_governor sysfs files.
@@ -536,9 +566,6 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
break break
} }
diagLevel := t.params.DiagLevel diagLevel := t.params.DiagLevel
if t.params.BurnProfile != "" && diagLevel <= 0 {
diagLevel = resolveBurnPreset(t.params.BurnProfile).NvidiaDiag
}
if len(t.params.GPUIndices) > 0 || diagLevel > 0 { if len(t.params.GPUIndices) > 0 || diagLevel > 0 {
result, e := a.RunNvidiaAcceptancePackWithOptions( result, e := a.RunNvidiaAcceptancePackWithOptions(
ctx, "", diagLevel, t.params.GPUIndices, j.append, ctx, "", diagLevel, t.params.GPUIndices, j.append,
@@ -551,6 +578,16 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
} else { } else {
archive, err = a.RunNvidiaAcceptancePack("", j.append) archive, err = a.RunNvidiaAcceptancePack("", j.append)
} }
case "nvidia-targeted-stress":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if dur <= 0 {
dur = 300
}
archive, err = a.RunNvidiaTargetedStressValidatePack(ctx, "", dur, t.params.GPUIndices, j.append)
case "nvidia-benchmark": case "nvidia-benchmark":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
@@ -563,6 +600,56 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
ExcludeGPUIndices: t.params.ExcludeGPUIndices, ExcludeGPUIndices: t.params.ExcludeGPUIndices,
RunNCCL: t.params.RunNCCL, RunNCCL: t.params.RunNCCL,
}, j.append) }, j.append)
case "nvidia-compute":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = a.RunNvidiaOfficialComputePack(ctx, "", dur, t.params.GPUIndices, j.append)
case "nvidia-targeted-power":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = a.RunNvidiaTargetedPowerPack(ctx, "", dur, t.params.GPUIndices, j.append)
case "nvidia-pulse":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = a.RunNvidiaPulseTestPack(ctx, "", dur, t.params.GPUIndices, j.append)
case "nvidia-bandwidth":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
archive, err = a.RunNvidiaBandwidthPack(ctx, "", t.params.GPUIndices, j.append)
case "nvidia-interconnect":
if a == nil {
err = fmt.Errorf("app not configured")
break
}
dur := t.params.Duration
if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
}
archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{
DurationSec: dur,
Loader: platform.NvidiaStressLoaderNCCL,
GPUIndices: t.params.GPUIndices,
}, j.append)
case "nvidia-stress": case "nvidia-stress":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
@@ -777,6 +864,7 @@ func (h *handler) handleAPITasksCancel(w http.ResponseWriter, r *http.Request) {
now := time.Now() now := time.Now()
t.DoneAt = &now t.DoneAt = &now
globalQueue.persistLocked() globalQueue.persistLocked()
taskSerialEvent(t, "finished with status="+t.Status)
writeJSON(w, map[string]string{"status": "cancelled"}) writeJSON(w, map[string]string{"status": "cancelled"})
case TaskRunning: case TaskRunning:
if t.job != nil { if t.job != nil {
@@ -786,6 +874,7 @@ func (h *handler) handleAPITasksCancel(w http.ResponseWriter, r *http.Request) {
now := time.Now() now := time.Now()
t.DoneAt = &now t.DoneAt = &now
globalQueue.persistLocked() globalQueue.persistLocked()
taskSerialEvent(t, "finished with status="+t.Status)
writeJSON(w, map[string]string{"status": "cancelled"}) writeJSON(w, map[string]string{"status": "cancelled"})
default: default:
writeError(w, http.StatusConflict, "task is not running or pending") writeError(w, http.StatusConflict, "task is not running or pending")
@@ -826,6 +915,7 @@ func (h *handler) handleAPITasksCancelAll(w http.ResponseWriter, _ *http.Request
case TaskPending: case TaskPending:
t.Status = TaskCancelled t.Status = TaskCancelled
t.DoneAt = &now t.DoneAt = &now
taskSerialEvent(t, "finished with status="+t.Status)
n++ n++
case TaskRunning: case TaskRunning:
if t.job != nil { if t.job != nil {
@@ -833,6 +923,7 @@ func (h *handler) handleAPITasksCancelAll(w http.ResponseWriter, _ *http.Request
} }
t.Status = TaskCancelled t.Status = TaskCancelled
t.DoneAt = &now t.DoneAt = &now
taskSerialEvent(t, "finished with status="+t.Status)
n++ n++
} }
} }
@@ -851,6 +942,7 @@ func (h *handler) handleAPITasksKillWorkers(w http.ResponseWriter, _ *http.Reque
case TaskPending: case TaskPending:
t.Status = TaskCancelled t.Status = TaskCancelled
t.DoneAt = &now t.DoneAt = &now
taskSerialEvent(t, "finished with status="+t.Status)
cancelled++ cancelled++
case TaskRunning: case TaskRunning:
if t.job != nil { if t.job != nil {
@@ -858,6 +950,7 @@ func (h *handler) handleAPITasksKillWorkers(w http.ResponseWriter, _ *http.Reque
} }
t.Status = TaskCancelled t.Status = TaskCancelled
t.DoneAt = &now t.DoneAt = &now
taskSerialEvent(t, "finished with status="+t.Status)
cancelled++ cancelled++
} }
} }
@@ -921,10 +1014,10 @@ func (h *handler) handleAPITasksStream(w http.ResponseWriter, r *http.Request) {
} }
func (q *taskQueue) assignTaskLogPathLocked(t *Task) { func (q *taskQueue) assignTaskLogPathLocked(t *Task) {
if t.LogPath != "" || q.logsDir == "" || t.ID == "" { if q.logsDir == "" || t.ID == "" {
return return
} }
t.LogPath = filepath.Join(q.logsDir, t.ID+".log") q.ensureTaskArtifactPathsLocked(t)
} }
func (q *taskQueue) loadLocked() { func (q *taskQueue) loadLocked() {
@@ -941,17 +1034,20 @@ func (q *taskQueue) loadLocked() {
} }
for _, pt := range persisted { for _, pt := range persisted {
t := &Task{ t := &Task{
ID: pt.ID, ID: pt.ID,
Name: pt.Name, Name: pt.Name,
Target: pt.Target, Target: pt.Target,
Priority: pt.Priority, Priority: pt.Priority,
Status: pt.Status, Status: pt.Status,
CreatedAt: pt.CreatedAt, CreatedAt: pt.CreatedAt,
StartedAt: pt.StartedAt, StartedAt: pt.StartedAt,
DoneAt: pt.DoneAt, DoneAt: pt.DoneAt,
ErrMsg: pt.ErrMsg, ErrMsg: pt.ErrMsg,
LogPath: pt.LogPath, LogPath: pt.LogPath,
params: pt.Params, ArtifactsDir: pt.ArtifactsDir,
ReportJSONPath: pt.ReportJSONPath,
ReportHTMLPath: pt.ReportHTMLPath,
params: pt.Params,
} }
q.assignTaskLogPathLocked(t) q.assignTaskLogPathLocked(t)
if t.Status == TaskRunning { if t.Status == TaskRunning {
@@ -982,17 +1078,20 @@ func (q *taskQueue) persistLocked() {
state := make([]persistedTask, 0, len(q.tasks)) state := make([]persistedTask, 0, len(q.tasks))
for _, t := range q.tasks { for _, t := range q.tasks {
state = append(state, persistedTask{ state = append(state, persistedTask{
ID: t.ID, ID: t.ID,
Name: t.Name, Name: t.Name,
Target: t.Target, Target: t.Target,
Priority: t.Priority, Priority: t.Priority,
Status: t.Status, Status: t.Status,
CreatedAt: t.CreatedAt, CreatedAt: t.CreatedAt,
StartedAt: t.StartedAt, StartedAt: t.StartedAt,
DoneAt: t.DoneAt, DoneAt: t.DoneAt,
ErrMsg: t.ErrMsg, ErrMsg: t.ErrMsg,
LogPath: t.LogPath, LogPath: t.LogPath,
Params: t.params, ArtifactsDir: t.ArtifactsDir,
ReportJSONPath: t.ReportJSONPath,
ReportHTMLPath: t.ReportHTMLPath,
Params: t.params,
}) })
} }
data, err := json.MarshalIndent(state, "", " ") data, err := json.MarshalIndent(state, "", " ")
@@ -1023,3 +1122,88 @@ func taskElapsedSec(t *Task, now time.Time) int {
} }
return int(end.Sub(start).Round(time.Second) / time.Second) return int(end.Sub(start).Round(time.Second) / time.Second)
} }
func taskFolderStatus(status string) string {
status = strings.TrimSpace(strings.ToLower(status))
switch status {
case TaskRunning, TaskDone, TaskFailed, TaskCancelled:
return status
default:
return TaskPending
}
}
func sanitizeTaskFolderPart(s string) string {
s = strings.TrimSpace(strings.ToLower(s))
if s == "" {
return "task"
}
var b strings.Builder
lastDash := false
for _, r := range s {
isAlnum := (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9')
if isAlnum {
b.WriteRune(r)
lastDash = false
continue
}
if !lastDash {
b.WriteByte('-')
lastDash = true
}
}
out := strings.Trim(b.String(), "-")
if out == "" {
return "task"
}
return out
}
func taskArtifactsDir(root string, t *Task, status string) string {
if strings.TrimSpace(root) == "" || t == nil {
return ""
}
return filepath.Join(root, fmt.Sprintf("%s_%s_%s", t.ID, sanitizeTaskFolderPart(t.Name), taskFolderStatus(status)))
}
func ensureTaskReportPaths(t *Task) {
if t == nil || strings.TrimSpace(t.ArtifactsDir) == "" {
return
}
if t.LogPath == "" || filepath.Base(t.LogPath) == "task.log" {
t.LogPath = filepath.Join(t.ArtifactsDir, "task.log")
}
t.ReportJSONPath = filepath.Join(t.ArtifactsDir, "report.json")
t.ReportHTMLPath = filepath.Join(t.ArtifactsDir, "report.html")
}
func (q *taskQueue) ensureTaskArtifactPathsLocked(t *Task) {
if t == nil || strings.TrimSpace(q.logsDir) == "" || strings.TrimSpace(t.ID) == "" {
return
}
if strings.TrimSpace(t.ArtifactsDir) == "" {
t.ArtifactsDir = taskArtifactsDir(q.logsDir, t, t.Status)
}
if t.ArtifactsDir != "" {
_ = os.MkdirAll(t.ArtifactsDir, 0755)
}
ensureTaskReportPaths(t)
}
func (q *taskQueue) finalizeTaskArtifactPathsLocked(t *Task) {
if t == nil || strings.TrimSpace(q.logsDir) == "" || strings.TrimSpace(t.ID) == "" {
return
}
q.ensureTaskArtifactPathsLocked(t)
dstDir := taskArtifactsDir(q.logsDir, t, t.Status)
if dstDir == "" {
return
}
if t.ArtifactsDir != "" && t.ArtifactsDir != dstDir {
if _, err := os.Stat(dstDir); err != nil {
_ = os.Rename(t.ArtifactsDir, dstDir)
}
t.ArtifactsDir = dstDir
}
ensureTaskReportPaths(t)
}

View File

@@ -2,6 +2,7 @@ package webui
import ( import (
"context" "context"
"encoding/json"
"net/http" "net/http"
"net/http/httptest" "net/http/httptest"
"os" "os"
@@ -12,6 +13,7 @@ import (
"time" "time"
"bee/audit/internal/app" "bee/audit/internal/app"
"bee/audit/internal/platform"
) )
func TestTaskQueuePersistsAndRecoversPendingTasks(t *testing.T) { func TestTaskQueuePersistsAndRecoversPendingTasks(t *testing.T) {
@@ -248,15 +250,133 @@ func TestHandleAPITasksStreamPendingTaskStartsSSEImmediately(t *testing.T) {
t.Fatalf("stream did not emit queued status promptly, body=%q", rec.Body.String()) t.Fatalf("stream did not emit queued status promptly, body=%q", rec.Body.String())
} }
func TestFinalizeTaskRunCreatesReportFolderAndArtifacts(t *testing.T) {
dir := t.TempDir()
metricsPath := filepath.Join(dir, "metrics.db")
prevMetricsPath := taskReportMetricsDBPath
taskReportMetricsDBPath = metricsPath
t.Cleanup(func() { taskReportMetricsDBPath = prevMetricsPath })
db, err := openMetricsDB(metricsPath)
if err != nil {
t.Fatalf("openMetricsDB: %v", err)
}
base := time.Now().UTC().Add(-45 * time.Second)
if err := db.Write(platform.LiveMetricSample{
Timestamp: base,
CPULoadPct: 42,
MemLoadPct: 35,
PowerW: 510,
}); err != nil {
t.Fatalf("Write: %v", err)
}
_ = db.Close()
q := &taskQueue{
statePath: filepath.Join(dir, "tasks-state.json"),
logsDir: filepath.Join(dir, "tasks"),
trigger: make(chan struct{}, 1),
}
if err := os.MkdirAll(q.logsDir, 0755); err != nil {
t.Fatal(err)
}
started := time.Now().UTC().Add(-90 * time.Second)
task := &Task{
ID: "task-1",
Name: "CPU SAT",
Target: "cpu",
Status: TaskRunning,
CreatedAt: started.Add(-10 * time.Second),
StartedAt: &started,
}
q.assignTaskLogPathLocked(task)
appendJobLog(task.LogPath, "line-1")
job := newTaskJobState(task.LogPath)
job.finish("")
q.finalizeTaskRun(task, job)
if task.Status != TaskDone {
t.Fatalf("status=%q want %q", task.Status, TaskDone)
}
if !strings.Contains(filepath.Base(task.ArtifactsDir), "_done") {
t.Fatalf("artifacts dir=%q", task.ArtifactsDir)
}
if _, err := os.Stat(task.ReportJSONPath); err != nil {
t.Fatalf("report json: %v", err)
}
if _, err := os.Stat(task.ReportHTMLPath); err != nil {
t.Fatalf("report html: %v", err)
}
var report taskReport
data, err := os.ReadFile(task.ReportJSONPath)
if err != nil {
t.Fatalf("ReadFile(report.json): %v", err)
}
if err := json.Unmarshal(data, &report); err != nil {
t.Fatalf("Unmarshal(report.json): %v", err)
}
if report.ID != task.ID || report.Status != TaskDone {
t.Fatalf("report=%+v", report)
}
if len(report.Charts) == 0 {
t.Fatalf("expected charts in report, got none")
}
}
func TestTaskLifecycleMirrorsToSerialConsole(t *testing.T) {
var lines []string
prev := taskSerialWriteLine
taskSerialWriteLine = func(line string) { lines = append(lines, line) }
t.Cleanup(func() { taskSerialWriteLine = prev })
dir := t.TempDir()
q := &taskQueue{
statePath: filepath.Join(dir, "tasks-state.json"),
logsDir: filepath.Join(dir, "tasks"),
trigger: make(chan struct{}, 1),
}
task := &Task{
ID: "task-serial-1",
Name: "CPU SAT",
Target: "cpu",
Status: TaskPending,
CreatedAt: time.Now().UTC(),
}
q.enqueue(task)
started := time.Now().UTC()
task.Status = TaskRunning
task.StartedAt = &started
job := newTaskJobState(task.LogPath, taskSerialPrefix(task))
job.append("Starting CPU SAT...")
job.append("CPU stress duration: 60s")
job.finish("")
q.finalizeTaskRun(task, job)
joined := strings.Join(lines, "\n")
for _, needle := range []string{
"queued",
"Starting CPU SAT...",
"CPU stress duration: 60s",
"finished with status=done",
} {
if !strings.Contains(joined, needle) {
t.Fatalf("serial mirror missing %q in %q", needle, joined)
}
}
}
func TestResolveBurnPreset(t *testing.T) { func TestResolveBurnPreset(t *testing.T) {
tests := []struct { tests := []struct {
profile string profile string
want burnPreset want burnPreset
}{ }{
{profile: "smoke", want: burnPreset{NvidiaDiag: 1, DurationSec: 5 * 60}}, {profile: "smoke", want: burnPreset{DurationSec: 5 * 60}},
{profile: "acceptance", want: burnPreset{NvidiaDiag: 3, DurationSec: 60 * 60}}, {profile: "acceptance", want: burnPreset{DurationSec: 60 * 60}},
{profile: "overnight", want: burnPreset{NvidiaDiag: 4, DurationSec: 8 * 60 * 60}}, {profile: "overnight", want: burnPreset{DurationSec: 8 * 60 * 60}},
{profile: "", want: burnPreset{NvidiaDiag: 1, DurationSec: 5 * 60}}, {profile: "", want: burnPreset{DurationSec: 5 * 60}},
} }
for _, tc := range tests { for _, tc := range tests {
if got := resolveBurnPreset(tc.profile); got != tc.want { if got := resolveBurnPreset(tc.profile); got != tc.want {

View File

@@ -32,7 +32,7 @@ lb config noauto \
--memtest memtest86+ \ --memtest memtest86+ \
--iso-volume "EASY_BEE_${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \ --iso-volume "EASY_BEE_${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
--iso-application "EASY-BEE-${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \ --iso-application "EASY-BEE-${BEE_GPU_VENDOR_UPPER:-NVIDIA}" \
--bootappend-live "boot=live components video=1920x1080 console=tty0 console=ttyS0,115200n8 loglevel=6 systemd.show_status=1 username=bee user-fullname=Bee modprobe.blacklist=nouveau,snd_hda_intel,snd_hda_codec_realtek,snd_hda_codec_generic,soundcore" \ --bootappend-live "boot=live components video=1920x1080 console=tty0 console=ttyS0,115200n8 loglevel=3 systemd.show_status=1 username=bee user-fullname=Bee modprobe.blacklist=nouveau,snd_hda_intel,snd_hda_codec_realtek,snd_hda_codec_generic,soundcore" \
--apt-recommends false \ --apt-recommends false \
--chroot-squashfs-compression-type zstd \ --chroot-squashfs-compression-type zstd \
"${@}" "${@}"

View File

@@ -302,6 +302,12 @@ memtest_fail() {
return 0 return 0
} }
nvidia_runtime_fail() {
msg="$1"
echo "ERROR: ${msg}" >&2
exit 1
}
iso_memtest_present() { iso_memtest_present() {
iso_path="$1" iso_path="$1"
iso_files="$(mktemp)" iso_files="$(mktemp)"
@@ -439,6 +445,44 @@ validate_iso_memtest() {
echo "=== memtest validation OK ===" echo "=== memtest validation OK ==="
} }
validate_iso_nvidia_runtime() {
iso_path="$1"
[ "$BEE_GPU_VENDOR" = "nvidia" ] || return 0
echo "=== validating NVIDIA runtime in ISO ==="
[ -f "$iso_path" ] || nvidia_runtime_fail "ISO not found for NVIDIA runtime validation: $iso_path"
require_iso_reader "$iso_path" >/dev/null 2>&1 || nvidia_runtime_fail "ISO reader unavailable for NVIDIA runtime validation"
command -v unsquashfs >/dev/null 2>&1 || nvidia_runtime_fail "unsquashfs is required for NVIDIA runtime validation"
squashfs_tmp="$(mktemp)"
squashfs_list="$(mktemp)"
iso_read_member "$iso_path" live/filesystem.squashfs "$squashfs_tmp" || {
rm -f "$squashfs_tmp" "$squashfs_list"
nvidia_runtime_fail "failed to extract live/filesystem.squashfs from ISO"
}
unsquashfs -ll "$squashfs_tmp" > "$squashfs_list" 2>/dev/null || {
rm -f "$squashfs_tmp" "$squashfs_list"
nvidia_runtime_fail "failed to inspect filesystem.squashfs from ISO"
}
grep -Eq 'usr/bin/dcgmi$' "$squashfs_list" || {
rm -f "$squashfs_tmp" "$squashfs_list"
nvidia_runtime_fail "dcgmi missing from final NVIDIA ISO"
}
grep -Eq 'usr/bin/nv-hostengine$' "$squashfs_list" || {
rm -f "$squashfs_tmp" "$squashfs_list"
nvidia_runtime_fail "nv-hostengine missing from final NVIDIA ISO"
}
grep -Eq 'usr/bin/dcgmproftester([0-9]+)?$' "$squashfs_list" || {
rm -f "$squashfs_tmp" "$squashfs_list"
nvidia_runtime_fail "dcgmproftester missing from final NVIDIA ISO"
}
rm -f "$squashfs_tmp" "$squashfs_list"
echo "=== NVIDIA runtime validation OK ==="
}
append_memtest_grub_entry() { append_memtest_grub_entry() {
grub_cfg="$1" grub_cfg="$1"
[ -f "$grub_cfg" ] || return 1 [ -f "$grub_cfg" ] || return 1
@@ -1144,6 +1188,7 @@ if [ -f "$ISO_RAW" ]; then
fi fi
fi fi
validate_iso_memtest "$ISO_RAW" validate_iso_memtest "$ISO_RAW"
validate_iso_nvidia_runtime "$ISO_RAW"
cp "$ISO_RAW" "$ISO_OUT" cp "$ISO_RAW" "$ISO_OUT"
echo "" echo ""
echo "=== done (${BEE_GPU_VENDOR}) ===" echo "=== done (${BEE_GPU_VENDOR}) ==="

View File

@@ -7,6 +7,7 @@ echo " █████╗ ███████║███████╗ ╚
echo " ██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝" echo " ██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝"
echo " ███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗" echo " ███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗"
echo " ╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝" echo " ╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝"
echo " Hardware Audit LiveCD"
echo "" echo ""
menuentry "EASY-BEE" { menuentry "EASY-BEE" {

View File

@@ -31,6 +31,7 @@ systemctl enable bee-audit.service
systemctl enable bee-web.service systemctl enable bee-web.service
systemctl enable bee-sshsetup.service systemctl enable bee-sshsetup.service
systemctl enable bee-selfheal.timer systemctl enable bee-selfheal.timer
systemctl enable bee-boot-status.service
systemctl enable ssh.service systemctl enable ssh.service
systemctl enable lightdm.service 2>/dev/null || true systemctl enable lightdm.service 2>/dev/null || true
systemctl enable qemu-guest-agent.service 2>/dev/null || true systemctl enable qemu-guest-agent.service 2>/dev/null || true
@@ -59,7 +60,8 @@ chmod +x /usr/local/bin/bee-sshsetup 2>/dev/null || true
chmod +x /usr/local/bin/bee-smoketest 2>/dev/null || true chmod +x /usr/local/bin/bee-smoketest 2>/dev/null || true
chmod +x /usr/local/bin/bee 2>/dev/null || true chmod +x /usr/local/bin/bee 2>/dev/null || true
chmod +x /usr/local/bin/bee-log-run 2>/dev/null || true chmod +x /usr/local/bin/bee-log-run 2>/dev/null || true
chmod +x /usr/local/bin/bee-selfheal 2>/dev/null || true chmod +x /usr/local/bin/bee-selfheal 2>/dev/null || true
chmod +x /usr/local/bin/bee-boot-status 2>/dev/null || true
if [ "$GPU_VENDOR" = "nvidia" ]; then if [ "$GPU_VENDOR" = "nvidia" ]; then
chmod +x /usr/local/bin/bee-nvidia-load 2>/dev/null || true chmod +x /usr/local/bin/bee-nvidia-load 2>/dev/null || true
chmod +x /usr/local/bin/bee-gpu-burn 2>/dev/null || true chmod +x /usr/local/bin/bee-gpu-burn 2>/dev/null || true

View File

@@ -0,0 +1,72 @@
#!/bin/sh
# 9001-wallpaper.hook.chroot — generate /usr/share/bee/wallpaper.png inside chroot
set -e
echo "=== generating bee wallpaper ==="
mkdir -p /usr/share/bee
python3 - <<'PYEOF'
from PIL import Image, ImageDraw, ImageFont
import os
W, H = 1920, 1080
LOGO = """\
\u2588\u2588\u2588\u2588\u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2557 \u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557
\u2588\u2588\u2554\u2550\u2550\u2550\u2550\u255d\u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2557\u2588\u2588\u2554\u2550\u2550\u2550\u2550\u255d\u255a\u2588\u2588\u2557 \u2588\u2588\u2554\u255d \u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2557\u2588\u2588\u2554\u2550\u2550\u2550\u2550\u255d\u2588\u2588\u2554\u2550\u2550\u2550\u2550\u255d
\u2588\u2588\u2588\u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2551\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557 \u255a\u2588\u2588\u2588\u2588\u2554\u255d \u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2588\u2588\u2588\u2588\u2554\u255d\u2588\u2588\u2588\u2588\u2588\u2557 \u2588\u2588\u2588\u2588\u2588\u2557
\u2588\u2588\u2554\u2550\u2550\u255d \u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2551\u255a\u2550\u2550\u2550\u2550\u2588\u2588\u2551 \u255a\u2588\u2588\u2554\u255d \u255a\u2550\u2550\u2550\u2550\u255d\u2588\u2588\u2554\u2550\u2550\u2588\u2588\u2557\u2588\u2588\u2554\u2550\u2550\u255d \u2588\u2588\u2554\u2550\u2550\u255d
\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2551 \u2588\u2588\u2551\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2551 \u2588\u2588\u2551 \u2588\u2588\u2588\u2588\u2588\u2588\u2554\u255d\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2557
\u255a\u2550\u2550\u2550\u2550\u2550\u2550\u255d\u255a\u2550\u255d \u255a\u2550\u255d\u255a\u2550\u2550\u2550\u2550\u2550\u2550\u255d \u255a\u2550\u255d \u255a\u2550\u2550\u2550\u2550\u2550\u255d \u255a\u2550\u2550\u2550\u2550\u2550\u2550\u255d\u255a\u2550\u2550\u2550\u2550\u2550\u2550\u255d
Hardware Audit LiveCD"""
# Find a monospace font that supports box-drawing characters
FONT_CANDIDATES = [
'/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf',
'/usr/share/fonts/truetype/liberation/LiberationMono-Regular.ttf',
'/usr/share/fonts/truetype/freefont/FreeMono.ttf',
'/usr/share/fonts/truetype/noto/NotoMono-Regular.ttf',
]
font_path = None
for p in FONT_CANDIDATES:
if os.path.exists(p):
font_path = p
break
SIZE = 22
if font_path:
font_logo = ImageFont.truetype(font_path, SIZE)
font_sub = ImageFont.truetype(font_path, SIZE)
else:
font_logo = ImageFont.load_default()
font_sub = font_logo
img = Image.new('RGB', (W, H), (0, 0, 0))
draw = ImageDraw.Draw(img)
# Measure logo block
lines = LOGO.split('\n')
bbox = draw.textbbox((0, 0), LOGO, font=font_logo)
text_w = bbox[2] - bbox[0]
text_h = bbox[3] - bbox[1]
x = (W - text_w) // 2
y = (H - text_h) // 2
# Draw logo lines: first 6 in amber, last line (subtitle) dimmer
logo_lines = lines[:6]
sub_line = lines[6] if len(lines) > 6 else ''
cy = y
for line in logo_lines:
draw.text((x, cy), line, font=font_logo, fill=(0xf6, 0xc9, 0x0e))
cy += SIZE + 2
cy += 8
if sub_line:
draw.text((x, cy), sub_line, font=font_sub, fill=(0x80, 0x68, 0x18))
img.save('/usr/share/bee/wallpaper.png', optimize=True)
print('wallpaper written: /usr/share/bee/wallpaper.png')
PYEOF
echo "=== wallpaper done ==="

View File

@@ -1,6 +1,10 @@
# NVIDIA DCGM (Data Center GPU Manager) — dcgmi diag for acceptance testing. # NVIDIA DCGM (Data Center GPU Manager).
# DCGM 4 is packaged per CUDA major. The image ships NVIDIA driver 590 with CUDA 13 userspace, # Validate uses dcgmi diagnostics; Burn uses dcgmproftester as the official
# so install the CUDA 13 build plus proprietary diagnostic components explicitly. # NVIDIA max-compute recipe. The smoketest/runtime contract treats
# dcgmproftester as required in the LiveCD.
# DCGM 4 is packaged per CUDA major. The image ships NVIDIA driver 590 with
# CUDA 13 userspace, so install the CUDA 13 build plus proprietary components
# explicitly.
datacenter-gpu-manager-4-cuda13=1:%%DCGM_VERSION%% datacenter-gpu-manager-4-cuda13=1:%%DCGM_VERSION%%
datacenter-gpu-manager-4-proprietary=1:%%DCGM_VERSION%% datacenter-gpu-manager-4-proprietary=1:%%DCGM_VERSION%%
datacenter-gpu-manager-4-proprietary-cuda13=1:%%DCGM_VERSION%% datacenter-gpu-manager-4-proprietary-cuda13=1:%%DCGM_VERSION%%

View File

@@ -60,6 +60,8 @@ qrencode
# Local desktop (openbox + chromium kiosk) # Local desktop (openbox + chromium kiosk)
openbox openbox
tint2 tint2
feh
python3-pil
xorg xorg
xterm xterm
chromium chromium

View File

@@ -52,6 +52,31 @@ else
fail "nvidia-smi: NOT FOUND" fail "nvidia-smi: NOT FOUND"
fi fi
if p=$(PATH="/usr/local/bin:$PATH" command -v dcgmi 2>/dev/null); then
ok "dcgmi found: $p"
else
fail "dcgmi: NOT FOUND"
fi
if p=$(PATH="/usr/local/bin:$PATH" command -v nv-hostengine 2>/dev/null); then
ok "nv-hostengine found: $p"
else
fail "nv-hostengine: NOT FOUND"
fi
DCGM_PROFTESTER=""
for tool in dcgmproftester dcgmproftester13 dcgmproftester12 dcgmproftester11; do
if p=$(PATH="/usr/local/bin:$PATH" command -v "$tool" 2>/dev/null); then
DCGM_PROFTESTER="$p"
break
fi
done
if [ -n "$DCGM_PROFTESTER" ]; then
ok "dcgmproftester found: $DCGM_PROFTESTER"
else
fail "dcgmproftester: NOT FOUND"
fi
for tool in bee-gpu-burn bee-john-gpu-stress bee-nccl-gpu-stress all_reduce_perf; do for tool in bee-gpu-burn bee-john-gpu-stress bee-nccl-gpu-stress all_reduce_perf; do
if p=$(PATH="/usr/local/bin:$PATH" command -v "$tool" 2>/dev/null); then if p=$(PATH="/usr/local/bin:$PATH" command -v "$tool" 2>/dev/null); then
ok "$tool found: $p" ok "$tool found: $p"
@@ -60,6 +85,12 @@ for tool in bee-gpu-burn bee-john-gpu-stress bee-nccl-gpu-stress all_reduce_perf
fi fi
done done
if p=$(PATH="/usr/local/bin:$PATH" command -v nvbandwidth 2>/dev/null); then
ok "nvbandwidth found: $p"
else
warn "nvbandwidth: NOT FOUND"
fi
echo "" echo ""
echo "-- NVIDIA modules --" echo "-- NVIDIA modules --"
KO_DIR="/usr/local/lib/nvidia" KO_DIR="/usr/local/lib/nvidia"

View File

@@ -0,0 +1,18 @@
[Unit]
Description=Bee: boot status display
After=systemd-user-sessions.service
Before=getty@tty1.service
[Service]
Type=oneshot
RemainAfterExit=no
ExecStart=/usr/local/bin/bee-boot-status
TTYPath=/dev/tty1
StandardInput=tty
StandardOutput=tty
StandardError=tty
TTYReset=yes
TTYVHangup=yes
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,2 @@
[Unit]
After=bee-boot-status.service

View File

@@ -1,6 +1,4 @@
[Unit] [Unit]
Wants=bee-preflight.service
After=bee-preflight.service
[Service] [Service]
ExecStartPre=/usr/local/bin/bee-display-mode ExecStartPre=/usr/local/bin/bee-display-mode

View File

@@ -0,0 +1,89 @@
#!/bin/sh
# bee-boot-status — boot progress display on tty1.
# Shows live service status until all bee services are done or failed,
# then exits so getty can show the login prompt.
CRITICAL="bee-preflight bee-nvidia bee-audit"
ALL="bee-sshsetup ssh bee-network bee-nvidia bee-preflight bee-audit bee-web"
svc_state() { systemctl is-active "$1.service" 2>/dev/null || echo "inactive"; }
svc_icon() {
case "$(svc_state "$1")" in
active) printf '\033[32m[ OK ]\033[0m' ;;
failed) printf '\033[31m[ FAIL ]\033[0m' ;;
activating) printf '\033[33m[ .. ]\033[0m' ;;
deactivating) printf '\033[33m[ stop ]\033[0m' ;;
inactive) printf '\033[90m[ ]\033[0m' ;;
*) printf '\033[90m[ ? ]\033[0m' ;;
esac
}
svc_detail() {
local svc="$1" state
state="$(svc_state "$svc")"
case "$state" in
failed)
local res
res="$(systemctl show -p Result "$svc.service" 2>/dev/null | cut -d= -f2)"
[ -n "$res" ] && [ "$res" != "success" ] && printf ' \033[31m(%s)\033[0m' "$res"
;;
activating)
local line
line="$(journalctl -u "$svc.service" -n 1 --no-pager --output=cat 2>/dev/null | cut -c1-55)"
[ -n "$line" ] && printf ' \033[90m%s\033[0m' "$line"
;;
esac
}
all_critical_done() {
for svc in $CRITICAL; do
case "$(svc_state "$svc")" in
active|failed|inactive) ;;
*) return 1 ;;
esac
done
return 0
}
while true; do
# move to top-left and clear screen
printf '\033[H\033[2J'
printf '\n'
printf ' \033[33m███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗\033[0m\n'
printf ' \033[33m██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝\033[0m\n'
printf ' \033[33m█████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗\033[0m\n'
printf ' \033[33m██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝\033[0m\n'
printf ' \033[33m███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗\033[0m\n'
printf ' \033[33m╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝\033[0m\n'
printf ' Hardware Audit LiveCD\n'
printf '\n'
for svc in $ALL; do
printf ' %s %-20s%s\n' "$(svc_icon "$svc")" "$svc" "$(svc_detail "$svc")"
done
printf '\n'
# Network
ips="$(ip -4 addr show scope global 2>/dev/null | awk '/inet /{printf " %-16s %s\n", $NF, $2}')"
if [ -n "$ips" ]; then
printf ' \033[1mNetwork:\033[0m\n'
printf '%s\n' "$ips"
printf '\n'
fi
if all_critical_done; then
printf ' \033[1;32mSystem ready.\033[0m Audit is running in the background.\n'
first_ip="$(ip -4 addr show scope global 2>/dev/null | awk '/inet /{print $2}' | cut -d/ -f1 | head -1)"
if [ -n "$first_ip" ]; then
printf ' Web UI: \033[1mhttp://%s/\033[0m\n' "$first_ip"
fi
printf '\n'
sleep 3
break
fi
printf ' \033[90mStarting up...\033[0m\n'
sleep 3
done

View File

@@ -7,16 +7,24 @@ xset s off
xset -dpms xset -dpms
xset s noblank xset s noblank
# Set desktop background.
if [ -f /usr/share/bee/wallpaper.png ]; then
feh --bg-fill /usr/share/bee/wallpaper.png
else
xsetroot -solid '#f6c90e'
fi
tint2 & tint2 &
# Wait up to 120s for bee-web to bind. The web server starts immediately now # Wait up to 60s for bee-web before opening Chromium.
# (audit is deferred), so this should succeed in a few seconds on most hardware. # Without this Chromium gets connection-refused and shows a blank page.
i=0 _i=0
while [ $i -lt 120 ]; do while [ $_i -lt 60 ]; do
if curl -sf http://localhost/healthz >/dev/null 2>&1; then break; fi curl -sf http://localhost/healthz >/dev/null 2>&1 && break
sleep 1 sleep 1
i=$((i+1)) _i=$((_i+1))
done done
unset _i
chromium \ chromium \
--disable-infobars \ --disable-infobars \
@@ -24,7 +32,8 @@ chromium \
--no-first-run \ --no-first-run \
--disable-session-crashed-bubble \ --disable-session-crashed-bubble \
--disable-features=TranslateUI \ --disable-features=TranslateUI \
--user-data-dir=/tmp/bee-chrome \
--start-maximized \ --start-maximized \
http://localhost/ & http://localhost/loading &
exec openbox exec openbox