Compare commits

...

26 Commits
v6.4 ... v7.8

Author SHA1 Message Date
Mikhail Chusavitin
05c1fde233 Warn on PCIe link speed degradation and collect lspci -vvv in techdump
- collector/pcie: add applyPCIeLinkSpeedWarning that sets status=Warning
  and ErrorDescription when current link speed is below maximum negotiated
  speed (e.g. Gen1 running on a Gen5 slot)
- collector/pcie: add pcieLinkSpeedRank helper for Gen string comparison
- collector/pcie_filter_test: cover degraded and healthy link speed cases
- platform/techdump: collect lspci -vvv → lspci-vvv.txt for LnkCap/LnkSta

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 12:42:17 +03:00
825ef6b98a Add USB export drive and LiveCD-in-RAM checks to Runtime Health
- schema: add ToRAMStatus and USBExportPath fields to RuntimeHealth
- platform/runtime.go: collectToRAMHealth (ok/warning/failed based on
  IsLiveMediaInRAM + toramActive) and collectUSBExportHealth (scans
  /proc/mounts + lsblk for writable USB-backed filesystems)
- pages.go: add USB Export Drive and LiveCD in RAM rows to the health table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 10:05:27 +03:00
ba16021cdb Fix GPU model propagation, export filenames, PSU/service status, and chart perf
- nvidia.go: add Name field to nvidiaGPUInfo, include model name in
  nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData
- pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x …
  → 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat
  activating/deactivating/reloading service states as OK in Runtime Health
- support_bundle.go: use "150405" time format (no colons) for exFAT compat
- sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove
  .tar.gz archive creation from export dirs — export packs everything itself
- charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf
- benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 10:05:27 +03:00
Mikhail Chusavitin
bb1218ddd4 Fix GPU inventory: exclude BMC virtual VGA, show real NVIDIA model names
Two issues:
1. BMC/management VGA chips (e.g. Huawei iBMC Hi171x, ASPEED) were included
   in GPU inventory because shouldIncludePCIeDevice only checked the PCI class,
   not the device name. Added a name-based filter for known BMC/management
   patterns when the class is VGA/display/3d.

2. New NVIDIA GPUs (e.g. RTX PRO 6000 Blackwell, device ID 2bb5) showed as
   "Device 2bb5" because lspci's database lags behind. Added "name" to the
   nvidia-smi query and use it to override dev.Model during enrichment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 13:57:26 +03:00
Mikhail Chusavitin
65faae8ede Remove hpl from SAT run-all targets — no backend route exists
hpl was listed in baseTargets and stressOnlyTargets but /api/sat/hpl/run
was never registered, causing a 405 Method Not Allowed (not valid JSON)
error when Validate one by one was triggered in stress mode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 13:30:32 +03:00
05241f2e0e Redesign dashboard: split Runtime Health and Hardware Summary
- Runtime Health now shows only LiveCD system status (services, tools,
  drivers, network, CUDA/ROCm) — hardware component rows removed
- Hardware Summary now shows server components with readable descriptions
  (model, count×size) and component-status.json health badges
- Add Network Adapters row to Hardware Summary
- SFP module static info (vendor, PN, SN, connector, type, wavelength)
  now collected via ethtool -m regardless of carrier state
- PSU statuses from IPMI audit written to component-status.json so PSU
  badge shows actual status after first audit instead of UNKNOWN

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 23:41:23 +03:00
Mikhail Chusavitin
c1690a084b Fix app tests that mutate global defaults 2026-04-09 15:28:25 +03:00
Mikhail Chusavitin
9481ca2805 Add staged NVIDIA burn ramp-up mode 2026-04-09 15:21:14 +03:00
Mikhail Chusavitin
a78fdadd88 Refine validate and burn profile layout 2026-04-09 15:14:48 +03:00
Mikhail Chusavitin
4ef403898f Tighten NVIDIA GPU PCI detection 2026-04-09 15:14:48 +03:00
025548ab3c UI: amber accents, smaller wallpaper logo, new support bundle name, drop display resolution
- Bootloader: GRUB fallback text colors → yellow/brown (amber tone)
- CLI charts: all GPU metric series use single amber color (xterm-256 #214)
- Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size
- Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz
  using dmidecode for server model (spaces→underscores) and serial number
- Remove display resolution feature (UI card, API routes, handlers, tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 21:37:01 +03:00
Mikhail Chusavitin
e0d94d7f47 Remove HPL from build and audit flows 2026-04-08 10:00:23 +03:00
Mikhail Chusavitin
13899aa864 Drop incompatible HPL git fallback 2026-04-08 09:50:58 +03:00
Mikhail Chusavitin
f345d8a89d Build HPL serially to avoid upstream make races 2026-04-08 09:47:35 +03:00
Mikhail Chusavitin
4715059ac0 Fix HPL MPI stub header and keep full build logs 2026-04-08 09:45:14 +03:00
Mikhail Chusavitin
0660a40287 Harden HPL builder cache and runtime libs 2026-04-08 09:40:18 +03:00
Mikhail Chusavitin
67369d9b7b Fix OpenBLAS package lookup in HPL build 2026-04-08 09:32:49 +03:00
Mikhail Chusavitin
3f41a026ca Add resilient HPL source fallbacks 2026-04-08 09:25:31 +03:00
Mikhail Chusavitin
0ee4f46537 Restore MOTD-style ASCII wallpaper 2026-04-08 09:14:27 +03:00
8db40b098a Update bible submodule
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 07:14:31 +03:00
16e7ae00e7 Add HPL (LINPACK) benchmark as validate/stress task
HPL 2.3 from netlib compiled against OpenBLAS with a minimal
single-process MPI stub — no MPI package required in the ISO.
Matrix size is auto-sized to 80% of total RAM at runtime.

Build:
- VERSIONS: HPL_VERSION=2.3, HPL_SHA256=32c5c17d…
- build-hpl.sh: downloads HPL + OpenBLAS from Debian 12 repo,
  compiles xhpl with a self-contained mpi_stub.c
- build.sh: step 80-hpl, injects xhpl + libopenblas into overlay

Runtime:
- bee-hpl: generates HPL.dat (N auto from /proc/meminfo, NB=256,
  P=1 Q=1), runs xhpl, prints standard WR... Gflops output
- platform/hpl.go: RunHPL(), parses WR line → GFlops + PASSED/FAILED
- tasks.go: target "hpl"
- pages.go: LINPACK (HPL) card in validate/stress grid (stress-only)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 07:08:18 +03:00
b2f8626fee Refactor validate modes, fix benchmark report and IPMI power
- Replace diag level 1-4 dropdown with Validate/Stress radio buttons
- Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short
- Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended
- Parallel GPU mode: spawn single task for all GPUs instead of splitting per model
- Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel
- Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts
- Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading')

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 00:42:12 +03:00
dd26e03b2d Add multi-GPU selector option for system-level tests
Adds a "Multi-GPU tests — use all GPUs" checkbox to the NVIDIA GPU
selector (checked by default). When enabled, PSU Pulse, NCCL, and
NVBandwidth tests run on every GPU in the system regardless of the
per-GPU selection above — which is required for correct PSU stress
testing (synchronous pulses across all GPUs create worst-case
transients). When unchecked, only the manually selected GPUs are used.

The same logic applies both to Run All (expandSATTarget) and to the
individual Run button on each multi-GPU test card.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 00:25:12 +03:00
6937a4c6ec Fix pulse_test: run all GPUs simultaneously, not per-GPU
pulse_test is a PSU/power-delivery test, not a per-GPU compute test.
Its purpose is to synchronously pulse all GPUs between idle and full
load to create worst-case transient spikes on the power supply.
Running it one GPU at a time would produce a fraction of the PSU load
and miss any PSU-level failures.

- Move nvidia-pulse from nvidiaPerGPUTargets to nvidiaAllGPUTargets
  (same dispatch path as NCCL and NVBandwidth)
- Change card onclick to runNvidiaFabricValidate (all selected GPUs at once)
- Update card title to "NVIDIA PSU Pulse Test" and description to
  explain why synchronous multi-GPU execution is required

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 00:19:11 +03:00
b9be93c213 Move NCCL interconnect and NVBandwidth tests to validate/stress
nvidia-interconnect (NCCL all_reduce_perf) and nvidia-bandwidth
(NVBandwidth) verify fabric connectivity and bandwidth — they are
not sustained burn loads. Move both from the Burn section to the
Validate section under the stress-mode toggle, alongside the other
DCGM diagnostic tests moved in the previous commit.

- Add sat-card-nvidia-interconnect and sat-card-nvidia-bandwidth
  validate cards (stress-only, all selected GPUs at once)
- Add runNvidiaFabricValidate() for all-GPU-at-once dispatch
- Add nvidiaAllGPUTargets handling in expandSATTarget/runAllSAT
- Remove Interconnect / Bandwidth card from Burn section
- Remove nvidia-interconnect and nvidia-bandwidth from runAllBurnTasks
  and the gpu/tools availability map

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 00:16:42 +03:00
d1a22d782d Move power diag tests to validate/stress; fix GPU burn power saturation
- bee-gpu-stress.c: remove per-wave cuCtxSynchronize barrier in both
  cuBLASLt and PTX hot loops; sync at most once/sec so the GPU queue
  stays continuously full — eliminates the CPU↔GPU ping-pong that
  prevented reaching full TDP
- sat_fan_stress.go: default SizeMB 0 (auto = 95% VRAM) instead of
  hardcoded 64 MB; tiny matrices caused <0.1 ms kernels where CPU
  re-queue overhead dominated
- pages.go: move nvidia-targeted-power and nvidia-pulse from Burn →
  Validate stress section alongside nvidia-targeted-stress; these are
  DCGM pass/fail diagnostics, not sustained burn loads; remove the
  Power Delivery / Power Budget card from Burn entirely

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 00:13:52 +03:00
42 changed files with 1913 additions and 811 deletions

View File

@@ -382,9 +382,9 @@ func runSAT(args []string, stdout, stderr io.Writer) int {
archive, err = application.RunNvidiaAcceptancePack("", logLine) archive, err = application.RunNvidiaAcceptancePack("", logLine)
} }
case "memory": case "memory":
archive, err = application.RunMemoryAcceptancePackCtx(context.Background(), "", logLine) archive, err = application.RunMemoryAcceptancePackCtx(context.Background(), "", 256, 1, logLine)
case "storage": case "storage":
archive, err = application.RunStorageAcceptancePackCtx(context.Background(), "", logLine) archive, err = application.RunStorageAcceptancePackCtx(context.Background(), "", false, logLine)
case "cpu": case "cpu":
dur := *duration dur := *duration
if dur <= 0 { if dur <= 0 {

View File

@@ -117,15 +117,15 @@ type satRunner interface {
RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaTargetedStressValidatePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaTargetedStressValidatePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaBenchmark(ctx context.Context, baseDir string, opts platform.NvidiaBenchmarkOptions, logFunc func(string)) (string, error) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts platform.NvidiaBenchmarkOptions, logFunc func(string)) (string, error)
RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, staggerSec int, logFunc func(string)) (string, error)
RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaPulseTestPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaPulseTestPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaStressPack(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error) RunNvidiaStressPack(ctx context.Context, baseDir string, opts platform.NvidiaStressOptions, logFunc func(string)) (string, error)
ListNvidiaGPUStatuses() ([]platform.NvidiaGPUStatus, error) ListNvidiaGPUStatuses() ([]platform.NvidiaGPUStatus, error)
ResetNvidiaGPU(index int) (string, error) ResetNvidiaGPU(index int) (string, error)
RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) RunMemoryAcceptancePack(ctx context.Context, baseDir string, sizeMB, passes int, logFunc func(string)) (string, error)
RunStorageAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) RunStorageAcceptancePack(ctx context.Context, baseDir string, extended bool, logFunc func(string)) (string, error)
RunCPUAcceptancePack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) RunCPUAcceptancePack(ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error)
ListNvidiaGPUs() ([]platform.NvidiaGPU, error) ListNvidiaGPUs() ([]platform.NvidiaGPU, error)
DetectGPUVendor() string DetectGPUVendor() string
@@ -190,6 +190,7 @@ func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, erro
} }
result := collector.Run(runtimeMode) result := collector.Run(runtimeMode)
applyLatestSATStatuses(&result.Hardware, DefaultSATBaseDir, a.StatusDB) applyLatestSATStatuses(&result.Hardware, DefaultSATBaseDir, a.StatusDB)
writePSUStatusesToDB(a.StatusDB, result.Hardware.PowerSupplies)
if health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath); err == nil { if health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath); err == nil {
result.Runtime = &health result.Runtime = &health
} }
@@ -566,11 +567,11 @@ func (a *App) RunNvidiaBenchmarkCtx(ctx context.Context, baseDir string, opts pl
return a.sat.RunNvidiaBenchmark(ctx, baseDir, opts, logFunc) return a.sat.RunNvidiaBenchmark(ctx, baseDir, opts, logFunc)
} }
func (a *App) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) { func (a *App) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, staggerSec int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" { if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir baseDir = DefaultSATBaseDir
} }
return a.sat.RunNvidiaOfficialComputePack(ctx, baseDir, durationSec, gpuIndices, logFunc) return a.sat.RunNvidiaOfficialComputePack(ctx, baseDir, durationSec, gpuIndices, staggerSec, logFunc)
} }
func (a *App) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) { func (a *App) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
@@ -602,14 +603,14 @@ func (a *App) RunNvidiaStressPackCtx(ctx context.Context, baseDir string, opts p
} }
func (a *App) RunMemoryAcceptancePack(baseDir string, logFunc func(string)) (string, error) { func (a *App) RunMemoryAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
return a.RunMemoryAcceptancePackCtx(context.Background(), baseDir, logFunc) return a.RunMemoryAcceptancePackCtx(context.Background(), baseDir, 256, 1, logFunc)
} }
func (a *App) RunMemoryAcceptancePackCtx(ctx context.Context, baseDir string, logFunc func(string)) (string, error) { func (a *App) RunMemoryAcceptancePackCtx(ctx context.Context, baseDir string, sizeMB, passes int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" { if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir baseDir = DefaultSATBaseDir
} }
return a.sat.RunMemoryAcceptancePack(ctx, baseDir, logFunc) return a.sat.RunMemoryAcceptancePack(ctx, baseDir, sizeMB, passes, logFunc)
} }
func (a *App) RunMemoryAcceptancePackResult(baseDir string) (ActionResult, error) { func (a *App) RunMemoryAcceptancePackResult(baseDir string) (ActionResult, error) {
@@ -634,14 +635,14 @@ func (a *App) RunCPUAcceptancePackResult(baseDir string, durationSec int) (Actio
} }
func (a *App) RunStorageAcceptancePack(baseDir string, logFunc func(string)) (string, error) { func (a *App) RunStorageAcceptancePack(baseDir string, logFunc func(string)) (string, error) {
return a.RunStorageAcceptancePackCtx(context.Background(), baseDir, logFunc) return a.RunStorageAcceptancePackCtx(context.Background(), baseDir, false, logFunc)
} }
func (a *App) RunStorageAcceptancePackCtx(ctx context.Context, baseDir string, logFunc func(string)) (string, error) { func (a *App) RunStorageAcceptancePackCtx(ctx context.Context, baseDir string, extended bool, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" { if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir baseDir = DefaultSATBaseDir
} }
return a.sat.RunStorageAcceptancePack(ctx, baseDir, logFunc) return a.sat.RunStorageAcceptancePack(ctx, baseDir, extended, logFunc)
} }
func (a *App) RunStorageAcceptancePackResult(baseDir string) (ActionResult, error) { func (a *App) RunStorageAcceptancePackResult(baseDir string) (ActionResult, error) {
@@ -926,6 +927,41 @@ func bodyOr(body, fallback string) string {
return body return body
} }
// writePSUStatusesToDB records PSU statuses collected during audit into the
// component-status DB so they are visible in the Hardware Summary card.
// PSU status is sourced from IPMI (ipmitool fru + sdr) during audit.
func writePSUStatusesToDB(db *ComponentStatusDB, psus []schema.HardwarePowerSupply) {
if db == nil || len(psus) == 0 {
return
}
const source = "audit:ipmi"
worstStatus := "OK"
for _, psu := range psus {
if psu.Status == nil {
continue
}
slot := "?"
if psu.Slot != nil {
slot = *psu.Slot
}
st := *psu.Status
detail := ""
if psu.ErrorDescription != nil {
detail = *psu.ErrorDescription
}
db.Record("psu:"+slot, source, st, detail)
switch st {
case "Critical":
worstStatus = "Critical"
case "Warning":
if worstStatus != "Critical" {
worstStatus = "Warning"
}
}
}
db.Record("psu:all", source, worstStatus, "")
}
func ReadRuntimeHealth(path string) (schema.RuntimeHealth, error) { func ReadRuntimeHealth(path string) (schema.RuntimeHealth, error) {
raw, err := os.ReadFile(path) raw, err := os.ReadFile(path)
if err != nil { if err != nil {

View File

@@ -161,7 +161,7 @@ func (f fakeSAT) RunNvidiaTargetedStressValidatePack(_ context.Context, baseDir
return f.runNvidiaFn(baseDir) return f.runNvidiaFn(baseDir)
} }
func (f fakeSAT) RunNvidiaOfficialComputePack(_ context.Context, baseDir string, durationSec int, gpuIndices []int, _ func(string)) (string, error) { func (f fakeSAT) RunNvidiaOfficialComputePack(_ context.Context, baseDir string, durationSec int, gpuIndices []int, _ int, _ func(string)) (string, error) {
if f.runNvidiaComputeFn != nil { if f.runNvidiaComputeFn != nil {
return f.runNvidiaComputeFn(baseDir, durationSec, gpuIndices) return f.runNvidiaComputeFn(baseDir, durationSec, gpuIndices)
} }
@@ -217,11 +217,11 @@ func (f fakeSAT) ResetNvidiaGPU(index int) (string, error) {
return "", nil return "", nil
} }
func (f fakeSAT) RunMemoryAcceptancePack(_ context.Context, baseDir string, _ func(string)) (string, error) { func (f fakeSAT) RunMemoryAcceptancePack(_ context.Context, baseDir string, _, _ int, _ func(string)) (string, error) {
return f.runMemoryFn(baseDir) return f.runMemoryFn(baseDir)
} }
func (f fakeSAT) RunStorageAcceptancePack(_ context.Context, baseDir string, _ func(string)) (string, error) { func (f fakeSAT) RunStorageAcceptancePack(_ context.Context, baseDir string, _ bool, _ func(string)) (string, error) {
return f.runStorageFn(baseDir) return f.runStorageFn(baseDir)
} }
@@ -542,8 +542,6 @@ func TestActionResultsUseFallbackBody(t *testing.T) {
} }
func TestExportSupportBundleResultMentionsUnmountedUSB(t *testing.T) { func TestExportSupportBundleResultMentionsUnmountedUSB(t *testing.T) {
t.Parallel()
tmp := t.TempDir() tmp := t.TempDir()
oldExportDir := DefaultExportDir oldExportDir := DefaultExportDir
DefaultExportDir = tmp DefaultExportDir = tmp
@@ -580,8 +578,6 @@ func TestExportSupportBundleResultMentionsUnmountedUSB(t *testing.T) {
} }
func TestExportSupportBundleResultDoesNotPretendSuccessOnError(t *testing.T) { func TestExportSupportBundleResultDoesNotPretendSuccessOnError(t *testing.T) {
t.Parallel()
tmp := t.TempDir() tmp := t.TempDir()
oldExportDir := DefaultExportDir oldExportDir := DefaultExportDir
DefaultExportDir = tmp DefaultExportDir = tmp
@@ -643,8 +639,6 @@ func TestRunNvidiaAcceptancePackResult(t *testing.T) {
} }
func TestRunSATDefaultsToExportDir(t *testing.T) { func TestRunSATDefaultsToExportDir(t *testing.T) {
t.Parallel()
oldSATBaseDir := DefaultSATBaseDir oldSATBaseDir := DefaultSATBaseDir
DefaultSATBaseDir = "/tmp/export/bee-sat" DefaultSATBaseDir = "/tmp/export/bee-sat"
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir }) t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })

View File

@@ -54,7 +54,7 @@ if ! command -v lspci >/dev/null 2>&1; then
exit 0 exit 0
fi fi
found=0 found=0
for gpu in $(lspci -Dn | awk '$3 ~ /^10de:/ {print $1}'); do for gpu in $(lspci -Dn | awk '$2 ~ /^03(00|02):$/ && $3 ~ /^10de:/ {print $1}'); do
found=1 found=1
echo "=== GPU $gpu ===" echo "=== GPU $gpu ==="
lspci -s "$gpu" -vv 2>&1 || true lspci -s "$gpu" -vv 2>&1 || true
@@ -74,6 +74,11 @@ fi
for d in /sys/bus/pci/devices/*/; do for d in /sys/bus/pci/devices/*/; do
vendor=$(cat "$d/vendor" 2>/dev/null) vendor=$(cat "$d/vendor" 2>/dev/null)
[ "$vendor" = "0x10de" ] || continue [ "$vendor" = "0x10de" ] || continue
class=$(cat "$d/class" 2>/dev/null)
case "$class" in
0x030000|0x030200) ;;
*) continue ;;
esac
dev=$(basename "$d") dev=$(basename "$d")
echo "=== $dev ===" echo "=== $dev ==="
for f in current_link_speed current_link_width max_link_speed max_link_width; do for f in current_link_speed current_link_width max_link_speed max_link_width; do
@@ -192,7 +197,7 @@ var supportBundleOptionalFiles = []struct {
{name: "system/syslog.txt", src: "/var/log/syslog"}, {name: "system/syslog.txt", src: "/var/log/syslog"},
} }
const supportBundleGlob = "bee-support-*.tar.gz" const supportBundleGlob = "????-??-?? (BEE-SP*)*.tar.gz"
func BuildSupportBundle(exportDir string) (string, error) { func BuildSupportBundle(exportDir string) (string, error) {
exportDir = strings.TrimSpace(exportDir) exportDir = strings.TrimSpace(exportDir)
@@ -206,9 +211,14 @@ func BuildSupportBundle(exportDir string) (string, error) {
return "", err return "", err
} }
host := sanitizeFilename(hostnameOr("unknown")) now := time.Now().UTC()
ts := time.Now().UTC().Format("20060102-150405") date := now.Format("2006-01-02")
stageRoot := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s", host, ts)) tod := now.Format("150405")
ver := bundleVersion()
model := serverModelForBundle()
sn := serverSerialForBundle()
stageRoot := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-stage-%s-%s", sanitizeFilename(hostnameOr("unknown")), now.Format("20060102-150405")))
if err := os.MkdirAll(stageRoot, 0755); err != nil { if err := os.MkdirAll(stageRoot, 0755); err != nil {
return "", err return "", err
} }
@@ -240,7 +250,8 @@ func BuildSupportBundle(exportDir string) (string, error) {
return "", err return "", err
} }
archivePath := filepath.Join(os.TempDir(), fmt.Sprintf("bee-support-%s-%s.tar.gz", host, ts)) archiveName := fmt.Sprintf("%s (BEE-SP v%s) %s %s %s.tar.gz", date, ver, model, sn, tod)
archivePath := filepath.Join(os.TempDir(), archiveName)
if err := createSupportTarGz(archivePath, stageRoot); err != nil { if err := createSupportTarGz(archivePath, stageRoot); err != nil {
return "", err return "", err
} }
@@ -397,6 +408,60 @@ func writeManifest(dst, exportDir, stageRoot string) error {
return os.WriteFile(dst, []byte(body.String()), 0644) return os.WriteFile(dst, []byte(body.String()), 0644)
} }
func bundleVersion() string {
v := buildVersion()
v = strings.TrimPrefix(v, "v")
v = strings.TrimPrefix(v, "V")
if v == "" || v == "unknown" {
return "0.0"
}
return v
}
func serverModelForBundle() string {
raw, err := exec.Command("dmidecode", "-t", "1").Output()
if err != nil {
return "unknown"
}
for _, line := range strings.Split(string(raw), "\n") {
line = strings.TrimSpace(line)
key, val, ok := strings.Cut(line, ": ")
if !ok {
continue
}
if strings.TrimSpace(key) == "Product Name" {
val = strings.TrimSpace(val)
if val == "" {
return "unknown"
}
return strings.ReplaceAll(val, " ", "_")
}
}
return "unknown"
}
func serverSerialForBundle() string {
raw, err := exec.Command("dmidecode", "-t", "1").Output()
if err != nil {
return "unknown"
}
for _, line := range strings.Split(string(raw), "\n") {
line = strings.TrimSpace(line)
key, val, ok := strings.Cut(line, ": ")
if !ok {
continue
}
if strings.TrimSpace(key) == "Serial Number" {
val = strings.TrimSpace(val)
if val == "" {
return "unknown"
}
return val
}
}
return "unknown"
}
func buildVersion() string { func buildVersion() string {
raw, err := exec.Command("bee", "version").CombinedOutput() raw, err := exec.Command("bee", "version").CombinedOutput()
if err != nil { if err != nil {

View File

@@ -179,11 +179,3 @@ func commandOutputWithTimeout(timeout time.Duration, name string, args ...string
defer cancel() defer cancel()
return exec.CommandContext(ctx, name, args...).Output() return exec.CommandContext(ctx, name, args...).Output()
} }
func interfaceHasCarrier(iface string) bool {
raw, err := readNetCarrierFile(iface)
if err != nil {
return false
}
return strings.TrimSpace(raw) == "1"
}

View File

@@ -58,14 +58,12 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
} }
} }
if interfaceHasCarrier(iface) {
if out, err := ethtoolModuleQuery(iface); err == nil { if out, err := ethtoolModuleQuery(iface); err == nil {
if injectSFPDOMTelemetry(&devs[i], out) { if injectSFPDOMTelemetry(&devs[i], out) {
enriched++ enriched++
continue continue
} }
} }
}
if len(devs[i].MacAddresses) > 0 || devs[i].Firmware != nil { if len(devs[i].MacAddresses) > 0 || devs[i].Firmware != nil {
enriched++ enriched++
} }
@@ -115,8 +113,38 @@ func injectSFPDOMTelemetry(dev *schema.HardwarePCIeDevice, raw string) bool {
} }
key := strings.ToLower(strings.TrimSpace(trimmed[:idx])) key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
val := strings.TrimSpace(trimmed[idx+1:]) val := strings.TrimSpace(trimmed[idx+1:])
if val == "" || strings.EqualFold(val, "not supported") || strings.EqualFold(val, "unknown") {
continue
}
switch { switch {
case key == "identifier":
s := parseSFPIdentifier(val)
dev.SFPIdentifier = &s
t := true
dev.SFPPresent = &t
changed = true
case key == "connector":
s := parseSFPConnector(val)
dev.SFPConnector = &s
changed = true
case key == "vendor name":
s := strings.TrimSpace(val)
dev.SFPVendor = &s
changed = true
case key == "vendor pn":
s := strings.TrimSpace(val)
dev.SFPPartNumber = &s
changed = true
case key == "vendor sn":
s := strings.TrimSpace(val)
dev.SFPSerialNumber = &s
changed = true
case strings.Contains(key, "laser wavelength"):
if f, ok := firstFloat(val); ok {
dev.SFPWavelengthNM = &f
changed = true
}
case strings.Contains(key, "module temperature"): case strings.Contains(key, "module temperature"):
if f, ok := firstFloat(val); ok { if f, ok := firstFloat(val); ok {
dev.SFPTemperatureC = &f dev.SFPTemperatureC = &f
@@ -147,12 +175,61 @@ func injectSFPDOMTelemetry(dev *schema.HardwarePCIeDevice, raw string) bool {
return changed return changed
} }
// parseSFPIdentifier extracts the human-readable transceiver type from the
// raw ethtool identifier line, e.g. "0x03 (SFP)" → "SFP".
func parseSFPIdentifier(val string) string {
if s := extractParens(val); s != "" {
return s
}
return val
}
// parseSFPConnector extracts the connector type from the raw ethtool line,
// e.g. "0x07 (LC)" → "LC".
func parseSFPConnector(val string) string {
if s := extractParens(val); s != "" {
return s
}
return val
}
var parenRe = regexp.MustCompile(`\(([^)]+)\)`)
func extractParens(s string) string {
m := parenRe.FindStringSubmatch(s)
if len(m) < 2 {
return ""
}
return strings.TrimSpace(m[1])
}
func parseSFPDOM(raw string) map[string]any { func parseSFPDOM(raw string) map[string]any {
dev := schema.HardwarePCIeDevice{} dev := schema.HardwarePCIeDevice{}
if !injectSFPDOMTelemetry(&dev, raw) { if !injectSFPDOMTelemetry(&dev, raw) {
return map[string]any{} return map[string]any{}
} }
out := map[string]any{} out := map[string]any{}
if dev.SFPPresent != nil {
out["sfp_present"] = *dev.SFPPresent
}
if dev.SFPIdentifier != nil {
out["sfp_identifier"] = *dev.SFPIdentifier
}
if dev.SFPConnector != nil {
out["sfp_connector"] = *dev.SFPConnector
}
if dev.SFPVendor != nil {
out["sfp_vendor"] = *dev.SFPVendor
}
if dev.SFPPartNumber != nil {
out["sfp_part_number"] = *dev.SFPPartNumber
}
if dev.SFPSerialNumber != nil {
out["sfp_serial_number"] = *dev.SFPSerialNumber
}
if dev.SFPWavelengthNM != nil {
out["sfp_wavelength_nm"] = *dev.SFPWavelengthNM
}
if dev.SFPTemperatureC != nil { if dev.SFPTemperatureC != nil {
out["sfp_temperature_c"] = *dev.SFPTemperatureC out["sfp_temperature_c"] = *dev.SFPTemperatureC
} }

View File

@@ -122,10 +122,7 @@ func TestEnrichPCIeWithNICTelemetrySkipsModuleQueryWithoutCarrier(t *testing.T)
readNetAddressFile = func(string) (string, error) { return "aa:bb:cc:dd:ee:ff", nil } readNetAddressFile = func(string) (string, error) { return "aa:bb:cc:dd:ee:ff", nil }
readNetCarrierFile = func(string) (string, error) { return "0", nil } readNetCarrierFile = func(string) (string, error) { return "0", nil }
ethtoolInfoQuery = func(string) (string, error) { return "", fmt.Errorf("skip firmware") } ethtoolInfoQuery = func(string) (string, error) { return "", fmt.Errorf("skip firmware") }
ethtoolModuleQuery = func(string) (string, error) { ethtoolModuleQuery = func(string) (string, error) { return "", fmt.Errorf("no module") }
t.Fatal("ethtool -m should not be called without carrier")
return "", nil
}
class := "EthernetController" class := "EthernetController"
bdf := "0000:18:00.0" bdf := "0000:18:00.0"

View File

@@ -15,6 +15,7 @@ const nvidiaVendorID = 0x10de
type nvidiaGPUInfo struct { type nvidiaGPUInfo struct {
Index int Index int
BDF string BDF string
Name string
Serial string Serial string
VBIOS string VBIOS string
TemperatureC *float64 TemperatureC *float64
@@ -73,6 +74,9 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
continue continue
} }
if v := strings.TrimSpace(info.Name); v != "" {
devs[i].Model = &v
}
if v := strings.TrimSpace(info.Serial); v != "" { if v := strings.TrimSpace(info.Serial); v != "" {
devs[i].SerialNumber = &v devs[i].SerialNumber = &v
} }
@@ -99,7 +103,7 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
func queryNVIDIAGPUs() (map[string]nvidiaGPUInfo, error) { func queryNVIDIAGPUs() (map[string]nvidiaGPUInfo, error) {
out, err := exec.Command( out, err := exec.Command(
"nvidia-smi", "nvidia-smi",
"--query-gpu=index,pci.bus_id,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown,pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max", "--query-gpu=index,pci.bus_id,name,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown,pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max",
"--format=csv,noheader,nounits", "--format=csv,noheader,nounits",
).Output() ).Output()
if err != nil { if err != nil {
@@ -123,8 +127,8 @@ func parseNVIDIASMIQuery(raw string) (map[string]nvidiaGPUInfo, error) {
if len(rec) == 0 { if len(rec) == 0 {
continue continue
} }
if len(rec) < 13 { if len(rec) < 14 {
return nil, fmt.Errorf("unexpected nvidia-smi columns: got %d, want 13", len(rec)) return nil, fmt.Errorf("unexpected nvidia-smi columns: got %d, want 14", len(rec))
} }
bdf := normalizePCIeBDF(rec[1]) bdf := normalizePCIeBDF(rec[1])
@@ -135,17 +139,18 @@ func parseNVIDIASMIQuery(raw string) (map[string]nvidiaGPUInfo, error) {
info := nvidiaGPUInfo{ info := nvidiaGPUInfo{
Index: parseRequiredInt(rec[0]), Index: parseRequiredInt(rec[0]),
BDF: bdf, BDF: bdf,
Serial: strings.TrimSpace(rec[2]), Name: strings.TrimSpace(rec[2]),
VBIOS: strings.TrimSpace(rec[3]), Serial: strings.TrimSpace(rec[3]),
TemperatureC: parseMaybeFloat(rec[4]), VBIOS: strings.TrimSpace(rec[4]),
PowerW: parseMaybeFloat(rec[5]), TemperatureC: parseMaybeFloat(rec[5]),
ECCUncorrected: parseMaybeInt64(rec[6]), PowerW: parseMaybeFloat(rec[6]),
ECCCorrected: parseMaybeInt64(rec[7]), ECCUncorrected: parseMaybeInt64(rec[7]),
HWSlowdown: parseMaybeBool(rec[8]), ECCCorrected: parseMaybeInt64(rec[8]),
PCIeLinkGenCurrent: parseMaybeInt(rec[9]), HWSlowdown: parseMaybeBool(rec[9]),
PCIeLinkGenMax: parseMaybeInt(rec[10]), PCIeLinkGenCurrent: parseMaybeInt(rec[10]),
PCIeLinkWidthCur: parseMaybeInt(rec[11]), PCIeLinkGenMax: parseMaybeInt(rec[11]),
PCIeLinkWidthMax: parseMaybeInt(rec[12]), PCIeLinkWidthCur: parseMaybeInt(rec[12]),
PCIeLinkWidthMax: parseMaybeInt(rec[13]),
} }
result[bdf] = info result[bdf] = info
} }

View File

@@ -6,7 +6,7 @@ import (
) )
func TestParseNVIDIASMIQuery(t *testing.T) { func TestParseNVIDIASMIQuery(t *testing.T) {
raw := "0, 00000000:65:00.0, GPU-SERIAL-1, 96.00.1F.00.02, 54, 210.33, 0, 5, Not Active, 4, 4, 16, 16\n" raw := "0, 00000000:65:00.0, NVIDIA H100 80GB HBM3, GPU-SERIAL-1, 96.00.1F.00.02, 54, 210.33, 0, 5, Not Active, 4, 4, 16, 16\n"
byBDF, err := parseNVIDIASMIQuery(raw) byBDF, err := parseNVIDIASMIQuery(raw)
if err != nil { if err != nil {
t.Fatalf("parse failed: %v", err) t.Fatalf("parse failed: %v", err)
@@ -16,6 +16,9 @@ func TestParseNVIDIASMIQuery(t *testing.T) {
if !ok { if !ok {
t.Fatalf("gpu by normalized bdf not found") t.Fatalf("gpu by normalized bdf not found")
} }
if gpu.Name != "NVIDIA H100 80GB HBM3" {
t.Fatalf("name: got %q", gpu.Name)
}
if gpu.Serial != "GPU-SERIAL-1" { if gpu.Serial != "GPU-SERIAL-1" {
t.Fatalf("serial: got %q", gpu.Serial) t.Fatalf("serial: got %q", gpu.Serial)
} }

View File

@@ -2,6 +2,7 @@ package collector
import ( import (
"bee/audit/internal/schema" "bee/audit/internal/schema"
"fmt"
"log/slog" "log/slog"
"os/exec" "os/exec"
"strconv" "strconv"
@@ -79,6 +80,25 @@ func shouldIncludePCIeDevice(class, vendor, device string) bool {
} }
} }
// Exclude BMC/management virtual VGA adapters — these are firmware video chips,
// not real GPUs, and pollute the GPU inventory (e.g. iBMC, iDRAC, iLO VGA).
if strings.Contains(c, "vga") || strings.Contains(c, "display") || strings.Contains(c, "3d") {
bmcPatterns := []string{
"management system chip",
"management controller",
"ibmc",
"idrac",
"ilo vga",
"aspeed",
"matrox",
}
for _, bad := range bmcPatterns {
if strings.Contains(d, bad) {
return false
}
}
}
if strings.Contains(v, "advanced micro devices") || strings.Contains(v, "[amd]") { if strings.Contains(v, "advanced micro devices") || strings.Contains(v, "[amd]") {
internalAMDPatterns := []string{ internalAMDPatterns := []string{
"dummy function", "dummy function",
@@ -153,6 +173,9 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
// SVendor/SDevice available but not in schema — skip // SVendor/SDevice available but not in schema — skip
// Warn if PCIe link is running below its maximum negotiated speed.
applyPCIeLinkSpeedWarning(&dev)
return dev return dev
} }
@@ -222,6 +245,41 @@ func readPCIStringAttribute(bdf, attribute string) (string, bool) {
return value, true return value, true
} }
// applyPCIeLinkSpeedWarning sets the device status to Warning if the current PCIe link
// speed is below the maximum negotiated speed supported by both ends.
func applyPCIeLinkSpeedWarning(dev *schema.HardwarePCIeDevice) {
if dev.LinkSpeed == nil || dev.MaxLinkSpeed == nil {
return
}
if pcieLinkSpeedRank(*dev.LinkSpeed) < pcieLinkSpeedRank(*dev.MaxLinkSpeed) {
warn := statusWarning
dev.Status = &warn
desc := fmt.Sprintf("PCIe link speed degraded: running at %s, capable of %s", *dev.LinkSpeed, *dev.MaxLinkSpeed)
dev.ErrorDescription = &desc
}
}
// pcieLinkSpeedRank returns a numeric rank for a normalized Gen string (e.g. "Gen4" → 4).
// Returns 0 for unrecognised values so comparisons fail safe.
func pcieLinkSpeedRank(gen string) int {
switch gen {
case "Gen1":
return 1
case "Gen2":
return 2
case "Gen3":
return 3
case "Gen4":
return 4
case "Gen5":
return 5
case "Gen6":
return 6
default:
return 0
}
}
func normalizePCILinkSpeed(raw string) string { func normalizePCILinkSpeed(raw string) string {
raw = strings.TrimSpace(strings.ToLower(raw)) raw = strings.TrimSpace(strings.ToLower(raw))
switch { switch {

View File

@@ -1,6 +1,7 @@
package collector package collector
import ( import (
"bee/audit/internal/schema"
"encoding/json" "encoding/json"
"strings" "strings"
"testing" "testing"
@@ -29,6 +30,8 @@ func TestShouldIncludePCIeDevice(t *testing.T) {
{name: "raid", class: "RAID bus controller", want: true}, {name: "raid", class: "RAID bus controller", want: true},
{name: "nvme", class: "Non-Volatile memory controller", want: true}, {name: "nvme", class: "Non-Volatile memory controller", want: true},
{name: "vga", class: "VGA compatible controller", want: true}, {name: "vga", class: "VGA compatible controller", want: true},
{name: "ibmc vga", class: "VGA compatible controller", vendor: "Huawei Technologies Co., Ltd.", device: "Hi171x Series [iBMC Intelligent Management system chip w/VGA support]", want: false},
{name: "aspeed vga", class: "VGA compatible controller", vendor: "ASPEED Technology, Inc.", device: "ASPEED Graphics Family", want: false},
{name: "other encryption controller", class: "Encryption controller", vendor: "Intel Corporation", device: "QuickAssist", want: true}, {name: "other encryption controller", class: "Encryption controller", vendor: "Intel Corporation", device: "QuickAssist", want: true},
} }
@@ -139,3 +142,77 @@ func TestNormalizePCILinkSpeed(t *testing.T) {
} }
} }
} }
func TestApplyPCIeLinkSpeedWarning(t *testing.T) {
ptr := func(s string) *string { return &s }
tests := []struct {
name string
linkSpeed *string
maxSpeed *string
wantWarning bool
wantGenIn string // substring expected in ErrorDescription when warning
}{
{
name: "degraded Gen1 vs Gen5",
linkSpeed: ptr("Gen1"),
maxSpeed: ptr("Gen5"),
wantWarning: true,
wantGenIn: "Gen1",
},
{
name: "at max Gen5",
linkSpeed: ptr("Gen5"),
maxSpeed: ptr("Gen5"),
wantWarning: false,
},
{
name: "degraded Gen4 vs Gen5",
linkSpeed: ptr("Gen4"),
maxSpeed: ptr("Gen5"),
wantWarning: true,
wantGenIn: "Gen4",
},
{
name: "missing current speed — no warning",
linkSpeed: nil,
maxSpeed: ptr("Gen5"),
wantWarning: false,
},
{
name: "missing max speed — no warning",
linkSpeed: ptr("Gen1"),
maxSpeed: nil,
wantWarning: false,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
dev := schema.HardwarePCIeDevice{}
ok := statusOK
dev.Status = &ok
dev.LinkSpeed = tt.linkSpeed
dev.MaxLinkSpeed = tt.maxSpeed
applyPCIeLinkSpeedWarning(&dev)
gotWarn := dev.Status != nil && *dev.Status == statusWarning
if gotWarn != tt.wantWarning {
t.Fatalf("wantWarning=%v gotWarning=%v (status=%v)", tt.wantWarning, gotWarn, dev.Status)
}
if tt.wantWarning {
if dev.ErrorDescription == nil {
t.Fatal("expected ErrorDescription to be set")
}
if !strings.Contains(*dev.ErrorDescription, tt.wantGenIn) {
t.Fatalf("ErrorDescription %q does not contain %q", *dev.ErrorDescription, tt.wantGenIn)
}
} else {
if dev.ErrorDescription != nil {
t.Fatalf("unexpected ErrorDescription: %s", *dev.ErrorDescription)
}
}
})
}
}

View File

@@ -326,8 +326,8 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
} }
report := renderBenchmarkReportWithCharts(result, loadBenchmarkReportCharts(runDir, selected)) report := renderBenchmarkReportWithCharts(result, loadBenchmarkReportCharts(runDir, selected))
if err := os.WriteFile(filepath.Join(runDir, "report.txt"), []byte(report), 0644); err != nil { if err := os.WriteFile(filepath.Join(runDir, "report.md"), []byte(report), 0644); err != nil {
return "", fmt.Errorf("write report.txt: %w", err) return "", fmt.Errorf("write report.md: %w", err)
} }
summary := renderBenchmarkSummary(result) summary := renderBenchmarkSummary(result)
@@ -335,11 +335,7 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
return "", fmt.Errorf("write summary.txt: %w", err) return "", fmt.Errorf("write summary.txt: %w", err)
} }
archive := filepath.Join(baseDir, "gpu-benchmark-"+ts+".tar.gz") return runDir, nil
if err := createTarGz(archive, runDir); err != nil {
return "", fmt.Errorf("pack benchmark archive: %w", err)
}
return archive, nil
} }
func normalizeNvidiaBenchmarkOptionsForBenchmark(opts NvidiaBenchmarkOptions) NvidiaBenchmarkOptions { func normalizeNvidiaBenchmarkOptionsForBenchmark(opts NvidiaBenchmarkOptions) NvidiaBenchmarkOptions {
@@ -1183,19 +1179,9 @@ func queryIPMIServerPowerW() (float64, error) {
if err != nil { if err != nil {
return 0, fmt.Errorf("ipmitool dcmi power reading: %w", err) return 0, fmt.Errorf("ipmitool dcmi power reading: %w", err)
} }
for _, line := range strings.Split(string(out), "\n") { if w := parseDCMIPowerReading(string(out)); w > 0 {
if strings.Contains(line, "Current Power") {
parts := strings.SplitN(line, ":", 2)
if len(parts) == 2 {
val := strings.TrimSpace(strings.TrimSuffix(strings.TrimSpace(parts[1]), "Watts"))
val = strings.TrimSpace(val)
w, err := strconv.ParseFloat(val, 64)
if err == nil && w > 0 {
return w, nil return w, nil
} }
}
}
}
return 0, fmt.Errorf("could not parse ipmitool dcmi power reading output") return 0, fmt.Errorf("could not parse ipmitool dcmi power reading output")
} }

View File

@@ -22,18 +22,53 @@ var ansiEscapePattern = regexp.MustCompile(`\x1b\[[0-9;]*m`)
func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benchmarkReportChart) string { func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benchmarkReportChart) string {
var b strings.Builder var b strings.Builder
fmt.Fprintf(&b, "Bee NVIDIA Benchmark Report\n")
fmt.Fprintf(&b, "===========================\n\n")
fmt.Fprintf(&b, "Generated: %s\n", result.GeneratedAt.Format("2006-01-02 15:04:05 UTC"))
fmt.Fprintf(&b, "Host: %s\n", result.Hostname)
fmt.Fprintf(&b, "Profile: %s\n", result.BenchmarkProfile)
fmt.Fprintf(&b, "Overall status: %s\n", result.OverallStatus)
fmt.Fprintf(&b, "Selected GPUs: %s\n", joinIndexList(result.SelectedGPUIndices))
fmt.Fprintf(&b, "Normalization: %s\n\n", result.Normalization.Status)
// ── Header ────────────────────────────────────────────────────────────────
b.WriteString("# Bee NVIDIA Benchmark Report\n\n")
// System identity block
if result.ServerModel != "" {
fmt.Fprintf(&b, "**Server:** %s \n", result.ServerModel)
}
if result.Hostname != "" {
fmt.Fprintf(&b, "**Host:** %s \n", result.Hostname)
}
// GPU models summary
if len(result.GPUs) > 0 {
modelCount := make(map[string]int)
var modelOrder []string
for _, g := range result.GPUs {
m := strings.TrimSpace(g.Name)
if m == "" {
m = "Unknown GPU"
}
if modelCount[m] == 0 {
modelOrder = append(modelOrder, m)
}
modelCount[m]++
}
var parts []string
for _, m := range modelOrder {
if modelCount[m] == 1 {
parts = append(parts, m)
} else {
parts = append(parts, fmt.Sprintf("%d× %s", modelCount[m], m))
}
}
fmt.Fprintf(&b, "**GPU(s):** %s \n", strings.Join(parts, ", "))
}
fmt.Fprintf(&b, "**Profile:** %s \n", result.BenchmarkProfile)
fmt.Fprintf(&b, "**App version:** %s \n", result.BenchmarkVersion)
fmt.Fprintf(&b, "**Generated:** %s \n", result.GeneratedAt.Format("2006-01-02 15:04:05 UTC"))
if result.ParallelGPUs {
fmt.Fprintf(&b, "**Mode:** parallel (all GPUs simultaneously) \n")
}
fmt.Fprintf(&b, "**Overall status:** %s \n", result.OverallStatus)
b.WriteString("\n")
// ── Executive Summary ─────────────────────────────────────────────────────
if len(result.Findings) > 0 { if len(result.Findings) > 0 {
fmt.Fprintf(&b, "Executive Summary\n") b.WriteString("## Executive Summary\n\n")
fmt.Fprintf(&b, "-----------------\n")
for _, finding := range result.Findings { for _, finding := range result.Findings {
fmt.Fprintf(&b, "- %s\n", finding) fmt.Fprintf(&b, "- %s\n", finding)
} }
@@ -41,150 +76,207 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
} }
if len(result.Warnings) > 0 { if len(result.Warnings) > 0 {
fmt.Fprintf(&b, "Warnings\n") b.WriteString("## Warnings\n\n")
fmt.Fprintf(&b, "--------\n")
for _, warning := range result.Warnings { for _, warning := range result.Warnings {
fmt.Fprintf(&b, "- %s\n", warning) fmt.Fprintf(&b, "- %s\n", warning)
} }
b.WriteString("\n") b.WriteString("\n")
} }
fmt.Fprintf(&b, "Per GPU Scorecard\n") // ── Scorecard table ───────────────────────────────────────────────────────
fmt.Fprintf(&b, "-----------------\n") b.WriteString("## Scorecard\n\n")
b.WriteString("| GPU | Status | Composite | Compute | TOPS/SM/GHz | Power Sustain | Thermal Sustain | Stability | Interconnect |\n")
b.WriteString("|-----|--------|-----------|---------|-------------|---------------|-----------------|-----------|-------------|\n")
for _, gpu := range result.GPUs { for _, gpu := range result.GPUs {
fmt.Fprintf(&b, "GPU %d %s\n", gpu.Index, gpu.Name) name := strings.TrimSpace(gpu.Name)
fmt.Fprintf(&b, " Status: %s\n", gpu.Status) if name == "" {
fmt.Fprintf(&b, " Composite score: %.2f\n", gpu.Scores.CompositeScore) name = "Unknown GPU"
fmt.Fprintf(&b, " Compute score: %.2f\n", gpu.Scores.ComputeScore)
if gpu.Scores.TOPSPerSMPerGHz > 0 {
fmt.Fprintf(&b, " Compute efficiency: %.3f TOPS/SM/GHz\n", gpu.Scores.TOPSPerSMPerGHz)
} }
fmt.Fprintf(&b, " Power sustain: %.1f\n", gpu.Scores.PowerSustainScore) interconnect := "-"
fmt.Fprintf(&b, " Thermal sustain: %.1f\n", gpu.Scores.ThermalSustainScore)
fmt.Fprintf(&b, " Stability: %.1f\n", gpu.Scores.StabilityScore)
if gpu.Scores.InterconnectScore > 0 { if gpu.Scores.InterconnectScore > 0 {
fmt.Fprintf(&b, " Interconnect: %.1f\n", gpu.Scores.InterconnectScore) interconnect = fmt.Sprintf("%.1f", gpu.Scores.InterconnectScore)
} }
if len(gpu.DegradationReasons) > 0 { topsPerSM := "-"
fmt.Fprintf(&b, " Degradation reasons: %s\n", strings.Join(gpu.DegradationReasons, ", ")) if gpu.Scores.TOPSPerSMPerGHz > 0 {
topsPerSM = fmt.Sprintf("%.3f", gpu.Scores.TOPSPerSMPerGHz)
} }
fmt.Fprintf(&b, " Avg power/temp/clock: %.1f W / %.1f C / %.0f MHz\n", gpu.Steady.AvgPowerW, gpu.Steady.AvgTempC, gpu.Steady.AvgGraphicsClockMHz) fmt.Fprintf(&b, "| GPU %d %s | %s | **%.2f** | %.2f | %s | %.1f | %.1f | %.1f | %s |\n",
fmt.Fprintf(&b, " P95 power/temp/clock: %.1f W / %.1f C / %.0f MHz\n", gpu.Steady.P95PowerW, gpu.Steady.P95TempC, gpu.Steady.P95GraphicsClockMHz) gpu.Index, name,
gpu.Status,
gpu.Scores.CompositeScore,
gpu.Scores.ComputeScore,
topsPerSM,
gpu.Scores.PowerSustainScore,
gpu.Scores.ThermalSustainScore,
gpu.Scores.StabilityScore,
interconnect,
)
}
b.WriteString("\n")
// ── Per GPU detail ────────────────────────────────────────────────────────
b.WriteString("## Per-GPU Details\n\n")
for _, gpu := range result.GPUs {
name := strings.TrimSpace(gpu.Name)
if name == "" {
name = "Unknown GPU"
}
fmt.Fprintf(&b, "### GPU %d — %s\n\n", gpu.Index, name)
// Identity
if gpu.BusID != "" {
fmt.Fprintf(&b, "- **Bus ID:** %s\n", gpu.BusID)
}
if gpu.VBIOS != "" {
fmt.Fprintf(&b, "- **vBIOS:** %s\n", gpu.VBIOS)
}
if gpu.ComputeCapability != "" {
fmt.Fprintf(&b, "- **Compute capability:** %s\n", gpu.ComputeCapability)
}
if gpu.MultiprocessorCount > 0 {
fmt.Fprintf(&b, "- **SMs:** %d\n", gpu.MultiprocessorCount)
}
if gpu.PowerLimitW > 0 {
fmt.Fprintf(&b, "- **Power limit:** %.0f W (default %.0f W)\n", gpu.PowerLimitW, gpu.DefaultPowerLimitW)
}
if gpu.LockedGraphicsClockMHz > 0 {
fmt.Fprintf(&b, "- **Locked clocks:** GPU %.0f MHz / Mem %.0f MHz\n", gpu.LockedGraphicsClockMHz, gpu.LockedMemoryClockMHz)
}
b.WriteString("\n")
// Steady-state telemetry
fmt.Fprintf(&b, "**Steady-state telemetry** (%ds):\n\n", int(gpu.Steady.DurationSec))
b.WriteString("| | Avg | P95 |\n|---|---|---|\n")
fmt.Fprintf(&b, "| Power | %.1f W | %.1f W |\n", gpu.Steady.AvgPowerW, gpu.Steady.P95PowerW)
fmt.Fprintf(&b, "| Temperature | %.1f °C | %.1f °C |\n", gpu.Steady.AvgTempC, gpu.Steady.P95TempC)
fmt.Fprintf(&b, "| GPU clock | %.0f MHz | %.0f MHz |\n", gpu.Steady.AvgGraphicsClockMHz, gpu.Steady.P95GraphicsClockMHz)
fmt.Fprintf(&b, "| Memory clock | %.0f MHz | %.0f MHz |\n", gpu.Steady.AvgMemoryClockMHz, gpu.Steady.P95MemoryClockMHz)
fmt.Fprintf(&b, "| GPU utilisation | %.1f %% | — |\n", gpu.Steady.AvgUsagePct)
b.WriteString("\n")
// Throttle
throttle := formatThrottleLine(gpu.Throttle, gpu.Steady.DurationSec)
if throttle != "none" {
fmt.Fprintf(&b, "**Throttle:** %s\n\n", throttle)
}
// Precision results
if len(gpu.PrecisionResults) > 0 { if len(gpu.PrecisionResults) > 0 {
fmt.Fprintf(&b, " Precision results:\n") b.WriteString("**Precision results:**\n\n")
for _, precision := range gpu.PrecisionResults { b.WriteString("| Precision | TOPS | Lanes | Iterations |\n|-----------|------|-------|------------|\n")
if precision.Supported { for _, p := range gpu.PrecisionResults {
fmt.Fprintf(&b, " - %s: %.2f TOPS lanes=%d iterations=%d\n", precision.Name, precision.TeraOpsPerSec, precision.Lanes, precision.Iterations) if p.Supported {
fmt.Fprintf(&b, "| %s | %.2f | %d | %d |\n", p.Name, p.TeraOpsPerSec, p.Lanes, p.Iterations)
} else { } else {
fmt.Fprintf(&b, " - %s: unsupported (%s)\n", precision.Name, precision.Notes) fmt.Fprintf(&b, "| %s | — (unsupported) | — | — |\n", p.Name)
}
}
}
fmt.Fprintf(&b, " Throttle: %s\n", formatThrottleLine(gpu.Throttle, gpu.Steady.DurationSec))
if len(gpu.Notes) > 0 {
fmt.Fprintf(&b, " Notes:\n")
for _, note := range gpu.Notes {
fmt.Fprintf(&b, " - %s\n", note)
} }
} }
b.WriteString("\n") b.WriteString("\n")
} }
if result.Interconnect != nil { // Degradation / Notes
fmt.Fprintf(&b, "Interconnect\n") if len(gpu.DegradationReasons) > 0 {
fmt.Fprintf(&b, "------------\n") fmt.Fprintf(&b, "**Degradation reasons:** %s\n\n", strings.Join(gpu.DegradationReasons, ", "))
fmt.Fprintf(&b, "Status: %s\n", result.Interconnect.Status)
if result.Interconnect.Supported {
fmt.Fprintf(&b, "Avg algbw / busbw: %.1f / %.1f GB/s\n", result.Interconnect.AvgAlgBWGBps, result.Interconnect.AvgBusBWGBps)
fmt.Fprintf(&b, "Max algbw / busbw: %.1f / %.1f GB/s\n", result.Interconnect.MaxAlgBWGBps, result.Interconnect.MaxBusBWGBps)
} }
for _, note := range result.Interconnect.Notes { if len(gpu.Notes) > 0 {
b.WriteString("**Notes:**\n\n")
for _, note := range gpu.Notes {
fmt.Fprintf(&b, "- %s\n", note) fmt.Fprintf(&b, "- %s\n", note)
} }
b.WriteString("\n") b.WriteString("\n")
} }
}
// ── Interconnect ──────────────────────────────────────────────────────────
if result.Interconnect != nil {
b.WriteString("## Interconnect (NCCL)\n\n")
fmt.Fprintf(&b, "**Status:** %s\n\n", result.Interconnect.Status)
if result.Interconnect.Supported {
b.WriteString("| Metric | Avg | Max |\n|--------|-----|-----|\n")
fmt.Fprintf(&b, "| Alg BW | %.1f GB/s | %.1f GB/s |\n", result.Interconnect.AvgAlgBWGBps, result.Interconnect.MaxAlgBWGBps)
fmt.Fprintf(&b, "| Bus BW | %.1f GB/s | %.1f GB/s |\n", result.Interconnect.AvgBusBWGBps, result.Interconnect.MaxBusBWGBps)
b.WriteString("\n")
}
for _, note := range result.Interconnect.Notes {
fmt.Fprintf(&b, "- %s\n", note)
}
if len(result.Interconnect.Notes) > 0 {
b.WriteString("\n")
}
}
// ── Server Power (IPMI) ───────────────────────────────────────────────────
if sp := result.ServerPower; sp != nil {
b.WriteString("## Server Power (IPMI)\n\n")
if !sp.Available {
b.WriteString("IPMI power measurement unavailable.\n\n")
} else {
b.WriteString("| | Value |\n|---|---|\n")
fmt.Fprintf(&b, "| Server idle | %.0f W |\n", sp.IdleW)
fmt.Fprintf(&b, "| Server under load | %.0f W |\n", sp.LoadedW)
fmt.Fprintf(&b, "| Server delta (load idle) | %.0f W |\n", sp.DeltaW)
fmt.Fprintf(&b, "| GPU-reported sum | %.0f W |\n", sp.GPUReportedSumW)
if sp.ReportingRatio > 0 {
fmt.Fprintf(&b, "| Reporting ratio | %.2f (1.0 = accurate, <0.75 = GPU over-reports) |\n", sp.ReportingRatio)
}
b.WriteString("\n")
}
for _, note := range sp.Notes {
fmt.Fprintf(&b, "- %s\n", note)
}
if len(sp.Notes) > 0 {
b.WriteString("\n")
}
}
// ── Terminal charts (steady-state only) ───────────────────────────────────
if len(charts) > 0 { if len(charts) > 0 {
fmt.Fprintf(&b, "Terminal Charts\n") b.WriteString("## Steady-State Charts\n\n")
fmt.Fprintf(&b, "---------------\n")
for _, chart := range charts { for _, chart := range charts {
content := strings.TrimSpace(stripANSIEscapeSequences(chart.Content)) content := strings.TrimSpace(stripANSIEscapeSequences(chart.Content))
if content == "" { if content == "" {
continue continue
} }
fmt.Fprintf(&b, "%s\n", chart.Title) fmt.Fprintf(&b, "### %s\n\n```\n%s\n```\n\n", chart.Title, content)
fmt.Fprintf(&b, "%s\n", strings.Repeat("~", len(chart.Title)))
fmt.Fprintf(&b, "%s\n\n", content)
} }
} }
if sp := result.ServerPower; sp != nil { // ── Methodology ───────────────────────────────────────────────────────────
fmt.Fprintf(&b, "Server Power (IPMI)\n") b.WriteString("## Methodology\n\n")
fmt.Fprintf(&b, "-------------------\n") fmt.Fprintf(&b, "- Profile `%s` uses standardized baseline → warmup → steady-state → interconnect → cooldown phases.\n", result.BenchmarkProfile)
if !sp.Available { b.WriteString("- Single-GPU compute score from bee-gpu-burn cuBLASLt when available.\n")
fmt.Fprintf(&b, "Unavailable\n") b.WriteString("- Thermal and power limitations inferred from NVIDIA clock event reason counters and sustained telemetry.\n")
} else { b.WriteString("- `result.json` is the canonical machine-readable source for this benchmark run.\n\n")
fmt.Fprintf(&b, " Server idle: %.0f W\n", sp.IdleW)
fmt.Fprintf(&b, " Server under load: %.0f W\n", sp.LoadedW)
fmt.Fprintf(&b, " Server delta: %.0f W\n", sp.DeltaW)
fmt.Fprintf(&b, " GPU reported (sum): %.0f W\n", sp.GPUReportedSumW)
if sp.ReportingRatio > 0 {
fmt.Fprintf(&b, " Reporting ratio: %.2f (1.0 = accurate, <0.75 = GPU over-reports)\n", sp.ReportingRatio)
}
}
for _, note := range sp.Notes {
fmt.Fprintf(&b, " Note: %s\n", note)
}
b.WriteString("\n")
}
fmt.Fprintf(&b, "Methodology\n") // ── Raw files ─────────────────────────────────────────────────────────────
fmt.Fprintf(&b, "-----------\n") b.WriteString("## Raw Files\n\n")
fmt.Fprintf(&b, "- Profile %s uses standardized baseline, warmup, steady-state, interconnect, and cooldown phases.\n", result.BenchmarkProfile) b.WriteString("- `result.json`\n- `report.md`\n- `summary.txt`\n- `verbose.log`\n")
fmt.Fprintf(&b, "- Single-GPU compute score comes from bee-gpu-burn cuBLASLt output when available.\n") b.WriteString("- `gpu-*-baseline-metrics.csv/html/term.txt`\n")
fmt.Fprintf(&b, "- Thermal and power limitations are inferred from NVIDIA clock event reason counters and sustained telemetry.\n") b.WriteString("- `gpu-*-warmup.log`\n")
fmt.Fprintf(&b, "- result.json is the canonical machine-readable source for this benchmark run.\n\n") b.WriteString("- `gpu-*-steady.log`\n")
b.WriteString("- `gpu-*-steady-metrics.csv/html/term.txt`\n")
fmt.Fprintf(&b, "Raw Files\n") b.WriteString("- `gpu-*-cooldown-metrics.csv/html/term.txt`\n")
fmt.Fprintf(&b, "---------\n")
fmt.Fprintf(&b, "- result.json\n")
fmt.Fprintf(&b, "- report.txt\n")
fmt.Fprintf(&b, "- summary.txt\n")
fmt.Fprintf(&b, "- verbose.log\n")
fmt.Fprintf(&b, "- gpu-*-baseline-metrics.csv/html/term.txt\n")
fmt.Fprintf(&b, "- gpu-*-warmup.log\n")
fmt.Fprintf(&b, "- gpu-*-steady.log\n")
fmt.Fprintf(&b, "- gpu-*-steady-metrics.csv/html/term.txt\n")
fmt.Fprintf(&b, "- gpu-*-cooldown-metrics.csv/html/term.txt\n")
if result.Interconnect != nil { if result.Interconnect != nil {
fmt.Fprintf(&b, "- nccl-all-reduce.log\n") b.WriteString("- `nccl-all-reduce.log`\n")
} }
return b.String() return b.String()
} }
// loadBenchmarkReportCharts loads only steady-state terminal charts (baseline and
// cooldown charts are not useful for human review).
func loadBenchmarkReportCharts(runDir string, gpuIndices []int) []benchmarkReportChart { func loadBenchmarkReportCharts(runDir string, gpuIndices []int) []benchmarkReportChart {
phases := []struct {
name string
label string
}{
{name: "baseline", label: "Baseline"},
{name: "steady", label: "Steady State"},
{name: "cooldown", label: "Cooldown"},
}
var charts []benchmarkReportChart var charts []benchmarkReportChart
for _, idx := range gpuIndices { for _, idx := range gpuIndices {
for _, phase := range phases { path := filepath.Join(runDir, fmt.Sprintf("gpu-%d-steady-metrics-term.txt", idx))
path := filepath.Join(runDir, fmt.Sprintf("gpu-%d-%s-metrics-term.txt", idx, phase.name))
raw, err := os.ReadFile(path) raw, err := os.ReadFile(path)
if err != nil || len(raw) == 0 { if err != nil || len(raw) == 0 {
continue continue
} }
charts = append(charts, benchmarkReportChart{ charts = append(charts, benchmarkReportChart{
Title: fmt.Sprintf("GPU %d %s", idx, phase.label), Title: fmt.Sprintf("GPU %d — Steady State", idx),
Content: string(raw), Content: string(raw),
}) })
} }
}
return charts return charts
} }

View File

@@ -137,8 +137,9 @@ func TestRenderBenchmarkReportIncludesFindingsAndScores(t *testing.T) {
for _, needle := range []string{ for _, needle := range []string{
"Executive Summary", "Executive Summary",
"GPU 0 spent measurable time under SW power cap.", "GPU 0 spent measurable time under SW power cap.",
"Composite score: 1176.00", "1176.00",
"fp16_tensor: 700.00 TOPS", "fp16_tensor",
"700.00",
} { } {
if !strings.Contains(report, needle) { if !strings.Contains(report, needle) {
t.Fatalf("report missing %q\n%s", needle, report) t.Fatalf("report missing %q\n%s", needle, report)
@@ -164,7 +165,7 @@ func TestRenderBenchmarkReportIncludesTerminalChartsWithoutANSI(t *testing.T) {
}) })
for _, needle := range []string{ for _, needle := range []string{
"Terminal Charts", "Steady-State Charts",
"GPU 0 Steady State", "GPU 0 Steady State",
"GPU 0 chart", "GPU 0 chart",
"42┤───", "42┤───",

View File

@@ -383,10 +383,7 @@ func drawGPUChartSVG(rows []GPUMetricRow, gpuIdx int) string {
} }
const ( const (
ansiRed = "\033[31m" ansiAmber = "\033[38;5;214m"
ansiBlue = "\033[34m"
ansiGreen = "\033[32m"
ansiYellow = "\033[33m"
ansiReset = "\033[0m" ansiReset = "\033[0m"
) )
@@ -415,10 +412,10 @@ func RenderGPUTerminalChart(rows []GPUMetricRow) string {
fn func(GPUMetricRow) float64 fn func(GPUMetricRow) float64
} }
defs := []seriesDef{ defs := []seriesDef{
{"Temperature (°C)", ansiRed, func(r GPUMetricRow) float64 { return r.TempC }}, {"Temperature (°C)", ansiAmber, func(r GPUMetricRow) float64 { return r.TempC }},
{"GPU Usage (%)", ansiBlue, func(r GPUMetricRow) float64 { return r.UsagePct }}, {"GPU Usage (%)", ansiAmber, func(r GPUMetricRow) float64 { return r.UsagePct }},
{"Power (W)", ansiGreen, func(r GPUMetricRow) float64 { return r.PowerW }}, {"Power (W)", ansiAmber, func(r GPUMetricRow) float64 { return r.PowerW }},
{"Clock (MHz)", ansiYellow, func(r GPUMetricRow) float64 { return r.ClockMHz }}, {"Clock (MHz)", ansiAmber, func(r GPUMetricRow) float64 { return r.ClockMHz }},
} }
var b strings.Builder var b strings.Builder

View File

@@ -49,6 +49,9 @@ func buildNvidiaStressJob(opts NvidiaStressOptions) (satJob, error) {
"--seconds", strconv.Itoa(opts.DurationSec), "--seconds", strconv.Itoa(opts.DurationSec),
"--size-mb", strconv.Itoa(opts.SizeMB), "--size-mb", strconv.Itoa(opts.SizeMB),
} }
if opts.StaggerSeconds > 0 && len(selected) > 1 {
cmd = append(cmd, "--stagger-seconds", strconv.Itoa(opts.StaggerSeconds))
}
if len(selected) > 0 { if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected)) cmd = append(cmd, "--devices", joinIndexList(selected))
} }
@@ -63,6 +66,9 @@ func buildNvidiaStressJob(opts NvidiaStressOptions) (satJob, error) {
"bee-john-gpu-stress", "bee-john-gpu-stress",
"--seconds", strconv.Itoa(opts.DurationSec), "--seconds", strconv.Itoa(opts.DurationSec),
} }
if opts.StaggerSeconds > 0 && len(selected) > 1 {
cmd = append(cmd, "--stagger-seconds", strconv.Itoa(opts.StaggerSeconds))
}
if len(selected) > 0 { if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected)) cmd = append(cmd, "--devices", joinIndexList(selected))
} }

View File

@@ -161,13 +161,7 @@ func (s *System) RunPlatformStress(
} }
_ = os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary), 0644) _ = os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary), 0644)
// Pack tar.gz return runDir, nil
archivePath := filepath.Join(baseDir, "platform-stress-"+stamp+".tar.gz")
if err := packPlatformDir(runDir, archivePath); err != nil {
return "", fmt.Errorf("pack archive: %w", err)
}
_ = os.RemoveAll(runDir)
return archivePath, nil
} }
// collectPhase samples live metrics every second until ctx is done. // collectPhase samples live metrics every second until ctx is done.

View File

@@ -1,6 +1,7 @@
package platform package platform
import ( import (
"bufio"
"os" "os"
"os/exec" "os/exec"
"strings" "strings"
@@ -114,6 +115,8 @@ func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, e
} }
s.collectGPURuntimeHealth(vendor, &health) s.collectGPURuntimeHealth(vendor, &health)
s.collectToRAMHealth(&health)
s.collectUSBExportHealth(&health)
if health.Status != "FAILED" && len(health.Issues) > 0 { if health.Status != "FAILED" && len(health.Issues) > 0 {
health.Status = "PARTIAL" health.Status = "PARTIAL"
@@ -168,6 +171,90 @@ func resolvedToolStatus(display string, candidates ...string) ToolStatus {
return ToolStatus{Name: display} return ToolStatus{Name: display}
} }
// collectToRAMHealth checks whether the LiveCD ISO has been copied to RAM.
// Status values: "ok" = in RAM, "warning" = toram not active (no copy attempted),
// "failed" = toram was requested but medium is not in RAM (copy failed or in progress).
func (s *System) collectToRAMHealth(health *schema.RuntimeHealth) {
inRAM := s.IsLiveMediaInRAM()
active := toramActive()
switch {
case inRAM:
health.ToRAMStatus = "ok"
case active:
// toram was requested but medium is not yet/no longer in RAM
health.ToRAMStatus = "failed"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "toram_copy_failed",
Severity: "warning",
Description: "toram boot parameter is set but the live medium is not mounted from RAM.",
})
default:
health.ToRAMStatus = "warning"
}
}
// collectUSBExportHealth scans /proc/mounts for a writable USB-backed filesystem
// suitable for log export. Sets USBExportPath to the first match found.
func (s *System) collectUSBExportHealth(health *schema.RuntimeHealth) {
health.USBExportPath = findUSBExportMount()
}
// findUSBExportMount returns the mount point of the first writable USB filesystem
// found in /proc/mounts (vfat, exfat, ext2/3/4, ntfs) whose backing block device
// has USB transport. Returns "" if none found.
func findUSBExportMount() string {
f, err := os.Open("/proc/mounts")
if err != nil {
return ""
}
defer f.Close()
// fs types that are expected on USB export drives
exportFSTypes := map[string]bool{
"vfat": true,
"exfat": true,
"ext2": true,
"ext3": true,
"ext4": true,
"ntfs": true,
"ntfs3": true,
"fuseblk": true,
}
scanner := bufio.NewScanner(f)
for scanner.Scan() {
// fields: device mountpoint fstype options dump pass
fields := strings.Fields(scanner.Text())
if len(fields) < 4 {
continue
}
device, mountPoint, fsType, options := fields[0], fields[1], fields[2], fields[3]
if !exportFSTypes[strings.ToLower(fsType)] {
continue
}
// Skip read-only mounts
opts := strings.Split(options, ",")
readOnly := false
for _, o := range opts {
if strings.TrimSpace(o) == "ro" {
readOnly = true
break
}
}
if readOnly {
continue
}
// Check USB transport via lsblk on the device
if !strings.HasPrefix(device, "/dev/") {
continue
}
if blockDeviceTransport(device) == "usb" {
return mountPoint
}
}
return ""
}
func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHealth) { func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHealth) {
lsmodText := commandText("lsmod") lsmodText := commandText("lsmod")

View File

@@ -384,22 +384,36 @@ func (s *System) RunNCCLTests(ctx context.Context, baseDir string, logFunc func(
), logFunc) ), logFunc)
} }
func (s *System) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) { func (s *System) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, staggerSec int, logFunc func(string)) (string, error) {
selected, err := resolveDCGMGPUIndices(gpuIndices) selected, err := resolveDCGMGPUIndices(gpuIndices)
if err != nil { if err != nil {
return "", err return "", err
} }
profCmd, err := resolveDCGMProfTesterCommand("--no-dcgm-validation", "-t", "1004", "-d", strconv.Itoa(normalizeNvidiaBurnDuration(durationSec))) var (
profCmd []string
profEnv []string
)
if staggerSec > 0 && len(selected) > 1 {
profCmd = []string{
"bee-dcgmproftester-staggered",
"--seconds", strconv.Itoa(normalizeNvidiaBurnDuration(durationSec)),
"--stagger-seconds", strconv.Itoa(staggerSec),
"--devices", joinIndexList(selected),
}
} else {
profCmd, err = resolveDCGMProfTesterCommand("--no-dcgm-validation", "-t", "1004", "-d", strconv.Itoa(normalizeNvidiaBurnDuration(durationSec)))
if err != nil { if err != nil {
return "", err return "", err
} }
profEnv = nvidiaVisibleDevicesEnv(selected)
}
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-compute", withNvidiaPersistenceMode( return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-compute", withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}}, satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{name: "02-dcgmi-version.log", cmd: []string{"dcgmi", "-v"}}, satJob{name: "02-dcgmi-version.log", cmd: []string{"dcgmi", "-v"}},
satJob{ satJob{
name: "03-dcgmproftester.log", name: "03-dcgmproftester.log",
cmd: profCmd, cmd: profCmd,
env: nvidiaVisibleDevicesEnv(selected), env: profEnv,
collectGPU: true, collectGPU: true,
gpuIndices: selected, gpuIndices: selected,
}, },
@@ -531,9 +545,13 @@ func memoryStressSizeArg() string {
return fmt.Sprintf("%dM", targetMB) return fmt.Sprintf("%dM", targetMB)
} }
func (s *System) RunMemoryAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) { func (s *System) RunMemoryAcceptancePack(ctx context.Context, baseDir string, sizeMB, passes int, logFunc func(string)) (string, error) {
sizeMB := envInt("BEE_MEMTESTER_SIZE_MB", 128) if sizeMB <= 0 {
passes := envInt("BEE_MEMTESTER_PASSES", 1) sizeMB = 256
}
if passes <= 0 {
passes = 1
}
return runAcceptancePackCtx(ctx, baseDir, "memory", []satJob{ return runAcceptancePackCtx(ctx, baseDir, "memory", []satJob{
{name: "01-free-before.log", cmd: []string{"free", "-h"}}, {name: "01-free-before.log", cmd: []string{"free", "-h"}},
{name: "02-memtester.log", cmd: []string{"memtester", fmt.Sprintf("%dM", sizeMB), fmt.Sprintf("%d", passes)}}, {name: "02-memtester.log", cmd: []string{"memtester", fmt.Sprintf("%dM", sizeMB), fmt.Sprintf("%d", passes)}},
@@ -590,7 +608,7 @@ func (s *System) RunCPUAcceptancePack(ctx context.Context, baseDir string, durat
}, logFunc) }, logFunc)
} }
func (s *System) RunStorageAcceptancePack(ctx context.Context, baseDir string, logFunc func(string)) (string, error) { func (s *System) RunStorageAcceptancePack(ctx context.Context, baseDir string, extended bool, logFunc func(string)) (string, error) {
if baseDir == "" { if baseDir == "" {
baseDir = "/var/log/bee-sat" baseDir = "/var/log/bee-sat"
} }
@@ -622,7 +640,7 @@ func (s *System) RunStorageAcceptancePack(ctx context.Context, baseDir string, l
break break
} }
prefix := fmt.Sprintf("%02d-%s", index+1, filepath.Base(devPath)) prefix := fmt.Sprintf("%02d-%s", index+1, filepath.Base(devPath))
commands := storageSATCommands(devPath) commands := storageSATCommands(devPath, extended)
for cmdIndex, job := range commands { for cmdIndex, job := range commands {
if ctx.Err() != nil { if ctx.Err() != nil {
break break
@@ -644,11 +662,7 @@ func (s *System) RunStorageAcceptancePack(ctx context.Context, baseDir string, l
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil { if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err return "", err
} }
archive := filepath.Join(baseDir, "storage-"+ts+".tar.gz") return runDir, nil
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
} }
type satJob struct { type satJob struct {
@@ -834,11 +848,7 @@ func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []sa
} }
} }
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz") return runDir, nil
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
} }
func updateNvidiaGPUStatus(perGPU map[int]*nvidiaGPUStatusFile, idx int, status, jobName, detail string) { func updateNvidiaGPUStatus(perGPU map[int]*nvidiaGPUStatusFile, idx int, status, jobName, detail string) {
@@ -901,7 +911,7 @@ func writeNvidiaGPUStatusFiles(runDir, overall string, perGPU map[int]*nvidiaGPU
entry.Health = "UNKNOWN" entry.Health = "UNKNOWN"
} }
if entry.Name == "" { if entry.Name == "" {
entry.Name = "unknown" entry.Name = "Unknown GPU"
} }
var body strings.Builder var body strings.Builder
fmt.Fprintf(&body, "gpu_index=%d\n", entry.Index) fmt.Fprintf(&body, "gpu_index=%d\n", entry.Index)
@@ -1086,17 +1096,25 @@ func listStorageDevices() ([]string, error) {
return parseStorageDevices(string(out)), nil return parseStorageDevices(string(out)), nil
} }
func storageSATCommands(devPath string) []satJob { func storageSATCommands(devPath string, extended bool) []satJob {
if strings.Contains(filepath.Base(devPath), "nvme") { if strings.Contains(filepath.Base(devPath), "nvme") {
selfTestLevel := "1"
if extended {
selfTestLevel = "2"
}
return []satJob{ return []satJob{
{name: "nvme-id-ctrl", cmd: []string{"nvme", "id-ctrl", devPath, "-o", "json"}}, {name: "nvme-id-ctrl", cmd: []string{"nvme", "id-ctrl", devPath, "-o", "json"}},
{name: "nvme-smart-log", cmd: []string{"nvme", "smart-log", devPath, "-o", "json"}}, {name: "nvme-smart-log", cmd: []string{"nvme", "smart-log", devPath, "-o", "json"}},
{name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "-s", "1", "--wait"}}, {name: "nvme-device-self-test", cmd: []string{"nvme", "device-self-test", devPath, "-s", selfTestLevel, "--wait"}},
} }
} }
smartTestType := "short"
if extended {
smartTestType = "long"
}
return []satJob{ return []satJob{
{name: "smartctl-health", cmd: []string{"smartctl", "-H", "-A", devPath}}, {name: "smartctl-health", cmd: []string{"smartctl", "-H", "-A", devPath}},
{name: "smartctl-self-test-short", cmd: []string{"smartctl", "-t", "short", devPath}}, {name: "smartctl-self-test-short", cmd: []string{"smartctl", "-t", smartTestType, devPath}},
} }
} }

View File

@@ -20,7 +20,7 @@ type FanStressOptions struct {
Phase1DurSec int // first load phase duration in seconds (default 300) Phase1DurSec int // first load phase duration in seconds (default 300)
PauseSec int // pause between the two load phases (default 60) PauseSec int // pause between the two load phases (default 60)
Phase2DurSec int // second load phase duration in seconds (default 300) Phase2DurSec int // second load phase duration in seconds (default 300)
SizeMB int // GPU memory to allocate per GPU during stress (default 64) SizeMB int // GPU memory to allocate per GPU during stress (0 = auto: 95% of VRAM)
GPUIndices []int // which GPU indices to stress (empty = all detected) GPUIndices []int // which GPU indices to stress (empty = all detected)
} }
@@ -223,11 +223,7 @@ func (s *System) RunFanStressTest(ctx context.Context, baseDir string, opts FanS
return "", err return "", err
} }
archive := filepath.Join(baseDir, "fan-stress-"+ts+".tar.gz") return runDir, nil
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
} }
func applyFanStressDefaults(opts *FanStressOptions) { func applyFanStressDefaults(opts *FanStressOptions) {
@@ -243,9 +239,8 @@ func applyFanStressDefaults(opts *FanStressOptions) {
if opts.Phase2DurSec <= 0 { if opts.Phase2DurSec <= 0 {
opts.Phase2DurSec = 300 opts.Phase2DurSec = 300
} }
if opts.SizeMB <= 0 { // SizeMB == 0 means "auto" (worker picks 95% of GPU VRAM for maximum power draw).
opts.SizeMB = 64 // Leave at 0 to avoid passing a too-small size that starves the tensor-core path.
}
} }
// sampleFanStressRow collects all metrics for one telemetry sample. // sampleFanStressRow collects all metrics for one telemetry sample.

View File

@@ -14,12 +14,12 @@ import (
func TestStorageSATCommands(t *testing.T) { func TestStorageSATCommands(t *testing.T) {
t.Parallel() t.Parallel()
nvme := storageSATCommands("/dev/nvme0n1") nvme := storageSATCommands("/dev/nvme0n1", false)
if len(nvme) != 3 || nvme[2].cmd[0] != "nvme" { if len(nvme) != 3 || nvme[2].cmd[0] != "nvme" {
t.Fatalf("unexpected nvme commands: %#v", nvme) t.Fatalf("unexpected nvme commands: %#v", nvme)
} }
sata := storageSATCommands("/dev/sda") sata := storageSATCommands("/dev/sda", false)
if len(sata) != 2 || sata[0].cmd[0] != "smartctl" { if len(sata) != 2 || sata[0].cmd[0] != "smartctl" {
t.Fatalf("unexpected sata commands: %#v", sata) t.Fatalf("unexpected sata commands: %#v", sata)
} }

View File

@@ -20,6 +20,7 @@ var techDumpFixedCommands = []struct {
{Name: "dmidecode", Args: []string{"-t", "4"}, File: "dmidecode-type4.txt"}, {Name: "dmidecode", Args: []string{"-t", "4"}, File: "dmidecode-type4.txt"},
{Name: "dmidecode", Args: []string{"-t", "17"}, File: "dmidecode-type17.txt"}, {Name: "dmidecode", Args: []string{"-t", "17"}, File: "dmidecode-type17.txt"},
{Name: "lspci", Args: []string{"-vmm", "-D"}, File: "lspci-vmm.txt"}, {Name: "lspci", Args: []string{"-vmm", "-D"}, File: "lspci-vmm.txt"},
{Name: "lspci", Args: []string{"-vvv"}, File: "lspci-vvv.txt"},
{Name: "lsblk", Args: []string{"-J", "-d", "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL"}, File: "lsblk.json"}, {Name: "lsblk", Args: []string{"-J", "-d", "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL"}, File: "lsblk.json"},
{Name: "sensors", Args: []string{"-j"}, File: "sensors.json"}, {Name: "sensors", Args: []string{"-j"}, File: "sensors.json"},
{Name: "ipmitool", Args: []string{"fru", "print"}, File: "ipmitool-fru.txt"}, {Name: "ipmitool", Args: []string{"fru", "print"}, File: "ipmitool-fru.txt"},

View File

@@ -70,6 +70,7 @@ type NvidiaStressOptions struct {
Loader string Loader string
GPUIndices []int GPUIndices []int
ExcludeGPUIndices []int ExcludeGPUIndices []int
StaggerSeconds int
} }
func New() *System { func New() *System {

View File

@@ -22,6 +22,10 @@ type RuntimeHealth struct {
CUDAReady bool `json:"cuda_ready,omitempty"` CUDAReady bool `json:"cuda_ready,omitempty"`
NvidiaGSPMode string `json:"nvidia_gsp_mode,omitempty"` // "gsp-on", "gsp-off", "gsp-stuck" NvidiaGSPMode string `json:"nvidia_gsp_mode,omitempty"` // "gsp-on", "gsp-off", "gsp-stuck"
NetworkStatus string `json:"network_status,omitempty"` NetworkStatus string `json:"network_status,omitempty"`
// ToRAMStatus: "ok" (ISO in RAM), "warning" (toram not active), "failed" (toram active but copy failed)
ToRAMStatus string `json:"toram_status,omitempty"`
// USBExportPath: mount point of the first writable USB drive found, empty if none.
USBExportPath string `json:"usb_export_path,omitempty"`
Issues []RuntimeIssue `json:"issues,omitempty"` Issues []RuntimeIssue `json:"issues,omitempty"`
Tools []RuntimeToolStatus `json:"tools,omitempty"` Tools []RuntimeToolStatus `json:"tools,omitempty"`
Services []RuntimeServiceStatus `json:"services,omitempty"` Services []RuntimeServiceStatus `json:"services,omitempty"`
@@ -183,6 +187,13 @@ type HardwarePCIeDevice struct {
BatteryTemperatureC *float64 `json:"battery_temperature_c,omitempty"` BatteryTemperatureC *float64 `json:"battery_temperature_c,omitempty"`
BatteryVoltageV *float64 `json:"battery_voltage_v,omitempty"` BatteryVoltageV *float64 `json:"battery_voltage_v,omitempty"`
BatteryReplaceRequired *bool `json:"battery_replace_required,omitempty"` BatteryReplaceRequired *bool `json:"battery_replace_required,omitempty"`
SFPPresent *bool `json:"sfp_present,omitempty"`
SFPIdentifier *string `json:"sfp_identifier,omitempty"`
SFPConnector *string `json:"sfp_connector,omitempty"`
SFPVendor *string `json:"sfp_vendor,omitempty"`
SFPPartNumber *string `json:"sfp_part_number,omitempty"`
SFPSerialNumber *string `json:"sfp_serial_number,omitempty"`
SFPWavelengthNM *float64 `json:"sfp_wavelength_nm,omitempty"`
SFPTemperatureC *float64 `json:"sfp_temperature_c,omitempty"` SFPTemperatureC *float64 `json:"sfp_temperature_c,omitempty"`
SFPTXPowerDBM *float64 `json:"sfp_tx_power_dbm,omitempty"` SFPTXPowerDBM *float64 `json:"sfp_tx_power_dbm,omitempty"`
SFPRXPowerDBM *float64 `json:"sfp_rx_power_dbm,omitempty"` SFPRXPowerDBM *float64 `json:"sfp_rx_power_dbm,omitempty"`

View File

@@ -222,7 +222,21 @@ func formatSplitTaskName(baseName, selectionLabel string) string {
} }
func buildNvidiaTaskSet(target string, priority int, createdAt time.Time, params taskParams, baseName string, appRef *app.App, idPrefix string) ([]*Task, error) { func buildNvidiaTaskSet(target string, priority int, createdAt time.Time, params taskParams, baseName string, appRef *app.App, idPrefix string) ([]*Task, error) {
if !shouldSplitHomogeneousNvidiaTarget(target) { if !shouldSplitHomogeneousNvidiaTarget(target) || params.ParallelGPUs {
// Parallel mode (or non-splittable target): one task for all selected GPUs.
if params.ParallelGPUs && shouldSplitHomogeneousNvidiaTarget(target) {
// Resolve the selected GPU indices so ExcludeGPUIndices is applied.
gpus, err := apiListNvidiaGPUs(appRef)
if err != nil {
return nil, err
}
resolved, err := expandSelectedGPUIndices(gpus, params.GPUIndices, params.ExcludeGPUIndices)
if err != nil {
return nil, err
}
params.GPUIndices = resolved
params.ExcludeGPUIndices = nil
}
t := &Task{ t := &Task{
ID: newJobID(idPrefix), ID: newJobID(idPrefix),
Name: baseName, Name: baseName,
@@ -262,6 +276,53 @@ func buildNvidiaTaskSet(target string, priority int, createdAt time.Time, params
return tasks, nil return tasks, nil
} }
// expandSelectedGPUIndices returns the sorted list of selected GPU indices after
// applying include/exclude filters, without splitting by model.
func expandSelectedGPUIndices(gpus []platform.NvidiaGPU, include, exclude []int) ([]int, error) {
indexed := make(map[int]struct{}, len(gpus))
allIndices := make([]int, 0, len(gpus))
for _, gpu := range gpus {
indexed[gpu.Index] = struct{}{}
allIndices = append(allIndices, gpu.Index)
}
sort.Ints(allIndices)
selected := allIndices
if len(include) > 0 {
selected = make([]int, 0, len(include))
seen := make(map[int]struct{}, len(include))
for _, idx := range include {
if _, ok := indexed[idx]; !ok {
continue
}
if _, dup := seen[idx]; dup {
continue
}
seen[idx] = struct{}{}
selected = append(selected, idx)
}
sort.Ints(selected)
}
if len(exclude) > 0 {
skip := make(map[int]struct{}, len(exclude))
for _, idx := range exclude {
skip[idx] = struct{}{}
}
filtered := selected[:0]
for _, idx := range selected {
if _, ok := skip[idx]; ok {
continue
}
filtered = append(filtered, idx)
}
selected = filtered
}
if len(selected) == 0 {
return nil, fmt.Errorf("no NVIDIA GPUs selected")
}
return selected, nil
}
// ── SSE helpers ─────────────────────────────────────────────────────────────── // ── SSE helpers ───────────────────────────────────────────────────────────────
func sseWrite(w http.ResponseWriter, event, data string) bool { func sseWrite(w http.ResponseWriter, event, data string) bool {
@@ -423,9 +484,10 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
var body struct { var body struct {
Duration int `json:"duration"` Duration int `json:"duration"`
DiagLevel int `json:"diag_level"` StressMode bool `json:"stress_mode"`
GPUIndices []int `json:"gpu_indices"` GPUIndices []int `json:"gpu_indices"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices"` ExcludeGPUIndices []int `json:"exclude_gpu_indices"`
StaggerGPUStart bool `json:"stagger_gpu_start"`
Loader string `json:"loader"` Loader string `json:"loader"`
Profile string `json:"profile"` Profile string `json:"profile"`
DisplayName string `json:"display_name"` DisplayName string `json:"display_name"`
@@ -444,9 +506,10 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
} }
params := taskParams{ params := taskParams{
Duration: body.Duration, Duration: body.Duration,
DiagLevel: body.DiagLevel, StressMode: body.StressMode,
GPUIndices: body.GPUIndices, GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices, ExcludeGPUIndices: body.ExcludeGPUIndices,
StaggerGPUStart: body.StaggerGPUStart,
Loader: body.Loader, Loader: body.Loader,
BurnProfile: body.Profile, BurnProfile: body.Profile,
DisplayName: body.DisplayName, DisplayName: body.DisplayName,
@@ -1315,107 +1378,3 @@ func (h *handler) rollbackPendingNetworkChange() error {
return nil return nil
} }
// ── Display / Screen Resolution ───────────────────────────────────────────────
type displayMode struct {
Output string `json:"output"`
Mode string `json:"mode"`
Current bool `json:"current"`
}
type displayInfo struct {
Output string `json:"output"`
Modes []displayMode `json:"modes"`
Current string `json:"current"`
}
var xrandrOutputRE = regexp.MustCompile(`^(\S+)\s+connected`)
var xrandrModeRE = regexp.MustCompile(`^\s{3}(\d+x\d+)\s`)
var xrandrCurrentRE = regexp.MustCompile(`\*`)
func parseXrandrOutput(out string) []displayInfo {
var infos []displayInfo
var cur *displayInfo
for _, line := range strings.Split(out, "\n") {
if m := xrandrOutputRE.FindStringSubmatch(line); m != nil {
if cur != nil {
infos = append(infos, *cur)
}
cur = &displayInfo{Output: m[1]}
continue
}
if cur == nil {
continue
}
if m := xrandrModeRE.FindStringSubmatch(line); m != nil {
isCurrent := xrandrCurrentRE.MatchString(line)
mode := displayMode{Output: cur.Output, Mode: m[1], Current: isCurrent}
cur.Modes = append(cur.Modes, mode)
if isCurrent {
cur.Current = m[1]
}
}
}
if cur != nil {
infos = append(infos, *cur)
}
return infos
}
func xrandrCommand(args ...string) *exec.Cmd {
cmd := exec.Command("xrandr", args...)
env := append([]string{}, os.Environ()...)
hasDisplay := false
hasXAuthority := false
for _, kv := range env {
if strings.HasPrefix(kv, "DISPLAY=") && strings.TrimPrefix(kv, "DISPLAY=") != "" {
hasDisplay = true
}
if strings.HasPrefix(kv, "XAUTHORITY=") && strings.TrimPrefix(kv, "XAUTHORITY=") != "" {
hasXAuthority = true
}
}
if !hasDisplay {
env = append(env, "DISPLAY=:0")
}
if !hasXAuthority {
env = append(env, "XAUTHORITY=/home/bee/.Xauthority")
}
cmd.Env = env
return cmd
}
func (h *handler) handleAPIDisplayResolutions(w http.ResponseWriter, _ *http.Request) {
out, err := xrandrCommand().Output()
if err != nil {
writeError(w, http.StatusInternalServerError, "xrandr: "+err.Error())
return
}
writeJSON(w, parseXrandrOutput(string(out)))
}
func (h *handler) handleAPIDisplaySet(w http.ResponseWriter, r *http.Request) {
var req struct {
Output string `json:"output"`
Mode string `json:"mode"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil || req.Output == "" || req.Mode == "" {
writeError(w, http.StatusBadRequest, "output and mode are required")
return
}
// Validate mode looks like WxH to prevent injection
if !regexp.MustCompile(`^\d+x\d+$`).MatchString(req.Mode) {
writeError(w, http.StatusBadRequest, "invalid mode format")
return
}
// Validate output name (no special chars)
if !regexp.MustCompile(`^[A-Za-z0-9_\-]+$`).MatchString(req.Output) {
writeError(w, http.StatusBadRequest, "invalid output name")
return
}
if out, err := xrandrCommand("--output", req.Output, "--mode", req.Mode).CombinedOutput(); err != nil {
writeError(w, http.StatusInternalServerError, "xrandr: "+strings.TrimSpace(string(out)))
return
}
writeJSON(w, map[string]string{"status": "ok", "output": req.Output, "mode": req.Mode})
}

View File

@@ -10,30 +10,6 @@ import (
"bee/audit/internal/platform" "bee/audit/internal/platform"
) )
func TestXrandrCommandAddsDefaultX11Env(t *testing.T) {
t.Setenv("DISPLAY", "")
t.Setenv("XAUTHORITY", "")
cmd := xrandrCommand("--query")
var hasDisplay bool
var hasXAuthority bool
for _, kv := range cmd.Env {
if kv == "DISPLAY=:0" {
hasDisplay = true
}
if kv == "XAUTHORITY=/home/bee/.Xauthority" {
hasXAuthority = true
}
}
if !hasDisplay {
t.Fatalf("DISPLAY not injected: %v", cmd.Env)
}
if !hasXAuthority {
t.Fatalf("XAUTHORITY not injected: %v", cmd.Env)
}
}
func TestHandleAPISATRunDecodesBodyWithoutContentLength(t *testing.T) { func TestHandleAPISATRunDecodesBodyWithoutContentLength(t *testing.T) {
globalQueue.mu.Lock() globalQueue.mu.Lock()
originalTasks := globalQueue.tasks originalTasks := globalQueue.tasks

View File

@@ -83,6 +83,10 @@ func renderMetricChartSVG(title string, labels []string, times []time.Time, data
} }
} }
// Downsample to at most ~1400 points (one per pixel) before building SVG.
times, datasets = downsampleTimeSeries(times, datasets, 1400)
pointCount = len(times)
statsLabel := chartStatsLabel(datasets) statsLabel := chartStatsLabel(datasets)
legendItems := []metricChartSeries{} legendItems := []metricChartSeries{}
@@ -196,6 +200,19 @@ func drawGPUOverviewChartSVG(title string, labels []string, times []time.Time, s
} }
} }
// Downsample to at most ~1400 points before building SVG.
{
datasets := make([][]float64, len(series))
for i := range series {
datasets[i] = series[i].Values
}
times, datasets = downsampleTimeSeries(times, datasets, 1400)
pointCount = len(times)
for i := range series {
series[i].Values = datasets[i]
}
}
scales := make([]chartScale, len(series)) scales := make([]chartScale, len(series))
for i := range series { for i := range series {
min, max := chartSeriesBounds(series[i].Values) min, max := chartSeriesBounds(series[i].Values)
@@ -626,6 +643,87 @@ func writeTimelineBoundaries(b *strings.Builder, layout chartLayout, start, end
b.WriteString(`</g>` + "\n") b.WriteString(`</g>` + "\n")
} }
// downsampleTimeSeries reduces the time series to at most maxPts points using
// min-max bucketing. Each bucket contributes the index of its min and max value
// (using the first full-length dataset as the reference series). All parallel
// datasets are sampled at those same indices so all series stay aligned.
// If len(times) <= maxPts the inputs are returned unchanged.
func downsampleTimeSeries(times []time.Time, datasets [][]float64, maxPts int) ([]time.Time, [][]float64) {
n := len(times)
if n <= maxPts || maxPts <= 0 {
return times, datasets
}
buckets := maxPts / 2
if buckets < 1 {
buckets = 1
}
// Use the first dataset that has the same length as times as the reference
// for deciding which two indices to keep per bucket.
var ref []float64
for _, ds := range datasets {
if len(ds) == n {
ref = ds
break
}
}
selected := make([]int, 0, maxPts)
bucketSize := float64(n) / float64(buckets)
for b := 0; b < buckets; b++ {
lo := int(math.Round(float64(b) * bucketSize))
hi := int(math.Round(float64(b+1) * bucketSize))
if hi > n {
hi = n
}
if lo >= hi {
continue
}
if ref == nil {
selected = append(selected, lo)
if hi-1 != lo {
selected = append(selected, hi-1)
}
continue
}
minIdx, maxIdx := lo, lo
for i := lo + 1; i < hi; i++ {
if ref[i] < ref[minIdx] {
minIdx = i
}
if ref[i] > ref[maxIdx] {
maxIdx = i
}
}
if minIdx <= maxIdx {
selected = append(selected, minIdx)
if maxIdx != minIdx {
selected = append(selected, maxIdx)
}
} else {
selected = append(selected, maxIdx)
if minIdx != maxIdx {
selected = append(selected, minIdx)
}
}
}
outTimes := make([]time.Time, len(selected))
for i, idx := range selected {
outTimes[i] = times[idx]
}
outDatasets := make([][]float64, len(datasets))
for d, ds := range datasets {
if len(ds) != n {
outDatasets[d] = ds
continue
}
out := make([]float64, len(selected))
for i, idx := range selected {
out[i] = ds[idx]
}
outDatasets[d] = out
}
return outTimes, outDatasets
}
func chartXForTime(ts, start, end time.Time, left, right int) float64 { func chartXForTime(ts, start, end time.Time, left, right int) float64 {
if !end.After(start) { if !end.After(start) {
return float64(left+right) / 2 return float64(left+right) / 2

File diff suppressed because it is too large Load Diff

View File

@@ -295,10 +295,6 @@ func NewHandler(opts HandlerOptions) http.Handler {
// Tools // Tools
mux.HandleFunc("GET /api/tools/check", h.handleAPIToolsCheck) mux.HandleFunc("GET /api/tools/check", h.handleAPIToolsCheck)
// Display
mux.HandleFunc("GET /api/display/resolutions", h.handleAPIDisplayResolutions)
mux.HandleFunc("POST /api/display/set", h.handleAPIDisplaySet)
// GPU presence / tools // GPU presence / tools
mux.HandleFunc("GET /api/gpu/presence", h.handleAPIGPUPresence) mux.HandleFunc("GET /api/gpu/presence", h.handleAPIGPUPresence)
mux.HandleFunc("GET /api/gpu/nvidia", h.handleAPIGNVIDIAGPUs) mux.HandleFunc("GET /api/gpu/nvidia", h.handleAPIGNVIDIAGPUs)

View File

@@ -693,8 +693,8 @@ func TestBenchmarkPageRendersSavedResultsTable(t *testing.T) {
for _, needle := range []string{ for _, needle := range []string{
`Benchmark Results`, `Benchmark Results`,
`Composite score by saved benchmark run and GPU.`, `Composite score by saved benchmark run and GPU.`,
`NVIDIA H100 PCIe / GPU 0`, `GPU #0 — NVIDIA H100 PCIe`,
`NVIDIA H100 PCIe / GPU 1`, `GPU #1 — NVIDIA H100 PCIe`,
`#1`, `#1`,
wantTime, wantTime,
`1176.25`, `1176.25`,
@@ -741,8 +741,8 @@ func TestBurnPageRendersGoalBasedNVIDIACards(t *testing.T) {
for _, needle := range []string{ for _, needle := range []string{
`NVIDIA Max Compute Load`, `NVIDIA Max Compute Load`,
`dcgmproftester`, `dcgmproftester`,
`targeted_stress remain in <a href="/validate">Validate</a>`, `NCCL`,
`NVIDIA Interconnect Test (NCCL all_reduce_perf)`, `Validate → Stress mode`,
`id="burn-gpu-list"`, `id="burn-gpu-list"`,
} { } {
if !strings.Contains(body, needle) { if !strings.Contains(body, needle) {
@@ -1094,6 +1094,7 @@ func TestDashboardRendersRuntimeHealthTable(t *testing.T) {
} }
body := rec.Body.String() body := rec.Body.String()
for _, needle := range []string{ for _, needle := range []string{
// Runtime Health card — LiveCD checks only
`Runtime Health`, `Runtime Health`,
`<th>Check</th><th>Status</th><th>Source</th><th>Issue</th>`, `<th>Check</th><th>Status</th><th>Source</th><th>Issue</th>`,
`Export Directory`, `Export Directory`,
@@ -1102,16 +1103,18 @@ func TestDashboardRendersRuntimeHealthTable(t *testing.T) {
`CUDA / ROCm`, `CUDA / ROCm`,
`Required Utilities`, `Required Utilities`,
`Bee Services`, `Bee Services`,
`<td>CPU</td>`,
`<td>Memory</td>`,
`<td>Storage</td>`,
`<td>GPU</td>`,
`CUDA runtime is not ready for GPU SAT.`, `CUDA runtime is not ready for GPU SAT.`,
`Missing: nvidia-smi`, `Missing: nvidia-smi`,
`bee-nvidia=inactive`, `bee-nvidia=inactive`,
`cpu SAT: FAILED`, // Hardware Summary card — component health badges
`storage SAT: FAILED`, `Hardware Summary`,
`sat:nvidia`, `>CPU<`,
`>Memory<`,
`>Storage<`,
`>GPU<`,
`>PSU<`,
`badge-warn`, // cpu Warning badge
`badge-err`, // storage Critical badge
} { } {
if !strings.Contains(body, needle) { if !strings.Contains(body, needle) {
t.Fatalf("dashboard missing %q: %s", needle, body) t.Fatalf("dashboard missing %q: %s", needle, body)

View File

@@ -115,10 +115,12 @@ type Task struct {
// taskParams holds optional parameters parsed from the run request. // taskParams holds optional parameters parsed from the run request.
type taskParams struct { type taskParams struct {
Duration int `json:"duration,omitempty"` Duration int `json:"duration,omitempty"`
DiagLevel int `json:"diag_level,omitempty"` StressMode bool `json:"stress_mode,omitempty"`
GPUIndices []int `json:"gpu_indices,omitempty"` GPUIndices []int `json:"gpu_indices,omitempty"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices,omitempty"` ExcludeGPUIndices []int `json:"exclude_gpu_indices,omitempty"`
StaggerGPUStart bool `json:"stagger_gpu_start,omitempty"`
SizeMB int `json:"size_mb,omitempty"` SizeMB int `json:"size_mb,omitempty"`
Passes int `json:"passes,omitempty"`
Loader string `json:"loader,omitempty"` Loader string `json:"loader,omitempty"`
BurnProfile string `json:"burn_profile,omitempty"` BurnProfile string `json:"burn_profile,omitempty"`
BenchmarkProfile string `json:"benchmark_profile,omitempty"` BenchmarkProfile string `json:"benchmark_profile,omitempty"`
@@ -161,6 +163,13 @@ func resolveBurnPreset(profile string) burnPreset {
} }
} }
func boolToNvidiaStaggerSeconds(enabled bool, selected []int) int {
if enabled && len(selected) > 1 {
return 180
}
return 0
}
func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions { func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions {
acceptanceCycles := []platform.PlatformStressCycle{ acceptanceCycles := []platform.PlatformStressCycle{
{LoadSec: 85, IdleSec: 5}, {LoadSec: 85, IdleSec: 5},
@@ -215,11 +224,11 @@ var globalQueue = &taskQueue{trigger: make(chan struct{}, 1)}
const maxTaskHistory = 50 const maxTaskHistory = 50
var ( var (
runMemoryAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) { runMemoryAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, sizeMB, passes int, logFunc func(string)) (string, error) {
return a.RunMemoryAcceptancePackCtx(ctx, baseDir, logFunc) return a.RunMemoryAcceptancePackCtx(ctx, baseDir, sizeMB, passes, logFunc)
} }
runStorageAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, logFunc func(string)) (string, error) { runStorageAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, extended bool, logFunc func(string)) (string, error) {
return a.RunStorageAcceptancePackCtx(ctx, baseDir, logFunc) return a.RunStorageAcceptancePackCtx(ctx, baseDir, extended, logFunc)
} }
runCPUAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) { runCPUAcceptancePackCtx = func(a *app.App, ctx context.Context, baseDir string, durationSec int, logFunc func(string)) (string, error) {
return a.RunCPUAcceptancePackCtx(ctx, baseDir, durationSec, logFunc) return a.RunCPUAcceptancePackCtx(ctx, baseDir, durationSec, logFunc)
@@ -552,7 +561,10 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
break break
} }
diagLevel := t.params.DiagLevel diagLevel := 2
if t.params.StressMode {
diagLevel = 3
}
if len(t.params.GPUIndices) > 0 || diagLevel > 0 { if len(t.params.GPUIndices) > 0 || diagLevel > 0 {
result, e := a.RunNvidiaAcceptancePackWithOptions( result, e := a.RunNvidiaAcceptancePackWithOptions(
ctx, "", diagLevel, t.params.GPUIndices, j.append, ctx, "", diagLevel, t.params.GPUIndices, j.append,
@@ -597,7 +609,11 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
archive, err = a.RunNvidiaOfficialComputePack(ctx, "", dur, t.params.GPUIndices, j.append) staggerSec := boolToNvidiaStaggerSeconds(t.params.StaggerGPUStart, t.params.GPUIndices)
if staggerSec > 0 {
j.append(fmt.Sprintf("NVIDIA staggered ramp-up enabled: %ds per GPU", staggerSec))
}
archive, err = a.RunNvidiaOfficialComputePack(ctx, "", dur, t.params.GPUIndices, staggerSec, j.append)
case "nvidia-targeted-power": case "nvidia-targeted-power":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
@@ -652,19 +668,24 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
Loader: t.params.Loader, Loader: t.params.Loader,
GPUIndices: t.params.GPUIndices, GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices, ExcludeGPUIndices: t.params.ExcludeGPUIndices,
StaggerSeconds: boolToNvidiaStaggerSeconds(t.params.StaggerGPUStart, t.params.GPUIndices),
}, j.append) }, j.append)
case "memory": case "memory":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
break break
} }
archive, err = runMemoryAcceptancePackCtx(a, ctx, "", j.append) sizeMB, passes := 256, 1
if t.params.StressMode {
sizeMB, passes = 1024, 3
}
archive, err = runMemoryAcceptancePackCtx(a, ctx, "", sizeMB, passes, j.append)
case "storage": case "storage":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
break break
} }
archive, err = runStorageAcceptancePackCtx(a, ctx, "", j.append) archive, err = runStorageAcceptancePackCtx(a, ctx, "", t.params.StressMode, j.append)
case "cpu": case "cpu":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
@@ -675,8 +696,12 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
if dur <= 0 { if dur <= 0 {
if t.params.StressMode {
dur = 1800
} else {
dur = 60 dur = 60
} }
}
j.append(fmt.Sprintf("CPU stress duration: %ds", dur)) j.append(fmt.Sprintf("CPU stress duration: %ds", dur))
archive, err = runCPUAcceptancePackCtx(a, ctx, "", dur, j.append) archive, err = runCPUAcceptancePackCtx(a, ctx, "", dur, j.append)
case "amd": case "amd":

View File

@@ -422,7 +422,7 @@ func TestWriteTaskReportArtifactsIncludesBenchmarkResultsForTask(t *testing.T) {
for _, needle := range []string{ for _, needle := range []string{
`Benchmark Results`, `Benchmark Results`,
`Composite score for this benchmark task.`, `Composite score for this benchmark task.`,
`NVIDIA H100 PCIe / GPU 0`, `GPU #0 — NVIDIA H100 PCIe`,
`1176.25`, `1176.25`,
} { } {
if !strings.Contains(html, needle) { if !strings.Contains(html, needle) {

2
bible

Submodule bible updated: 688b87e98d...1d89a4918e

View File

@@ -0,0 +1,117 @@
# GPU Model Name Propagation
How GPU model names are detected, stored, and displayed throughout the project.
---
## Detection Sources
There are **two separate pipelines** for GPU model names — they use different structs and don't share state.
### Pipeline A — Live / SAT (nvidia-smi query at runtime)
**File:** `audit/internal/platform/sat.go`
- `ListNvidiaGPUs()``NvidiaGPU.Name` (field: `name`, from `nvidia-smi --query-gpu=index,name,...`)
- `ListNvidiaGPUStatuses()``NvidiaGPUStatus.Name`
- Used by: GPU selection UI, live metrics labels, burn/stress test logic
### Pipeline B — Benchmark results
**File:** `audit/internal/platform/benchmark.go`, line 124
- `queryBenchmarkGPUInfo(selected)``benchmarkGPUInfo.Name`
- Stored in `BenchmarkGPUResult.Name` (`json:"name,omitempty"`)
- Used by: benchmark history table, benchmark report
### Pipeline C — Hardware audit JSON (PCIe schema)
**File:** `audit/internal/schema/hardware.go`
- `HardwarePCIeDevice.Model *string` (field name is **Model**, not Name)
- For AMD GPUs: populated by `audit/internal/collector/amdgpu.go` from `info.Product`
- For NVIDIA GPUs: **NOT populated** by `audit/internal/collector/nvidia.go` — the NVIDIA enricher sets telemetry/status but skips the Model field
- Used by: hardware summary page (`hwDescribeGPU` in `pages.go:487`)
---
## Key Inconsistency: NVIDIA PCIe Model is Never Set
`audit/internal/collector/nvidia.go``enrichPCIeWithNVIDIAData()` enriches NVIDIA PCIe devices with telemetry and status but does **not** populate `HardwarePCIeDevice.Model`.
This means:
- Hardware summary page shows "Unknown GPU" for all NVIDIA devices (falls back at `pages.go:486`)
- AMD GPUs do have their model populated
The fix would be: copy `gpu.Name` from the SAT pipeline into `dev.Model` inside `enrichPCIeWithNVIDIAData`.
---
## Benchmark History "Unknown GPU" Issue
**Symptom:** Benchmark history table shows "GPU #N — Unknown GPU" columns instead of real GPU model names.
**Root cause:** `BenchmarkGPUResult.Name` has tag `json:"name,omitempty"`. If `queryBenchmarkGPUInfo()` fails (warns at `benchmark.go:126`) or returns empty names, the Name field is never set and is omitted from JSON. Loaded results have empty Name → falls back to "Unknown GPU" at `pages.go:2226, 2237`.
This happens for:
- Older result files saved before the `Name` field was added
- Runs where nvidia-smi query failed before the benchmark started
---
## Fallback Strings — Current State
| Location | File | Fallback string |
|---|---|---|
| Hardware summary (PCIe) | `pages.go:486` | `"Unknown GPU"` |
| Benchmark report summary | `benchmark_report.go:43` | `"Unknown GPU"` |
| Benchmark report scorecard | `benchmark_report.go:93` | `"Unknown"` ← inconsistent |
| Benchmark report detail | `benchmark_report.go:122` | `"Unknown GPU"` |
| Benchmark history per-GPU col | `pages.go:2226` | `"Unknown GPU"` |
| Benchmark history parallel col | `pages.go:2237` | `"Unknown GPU"` |
| SAT status file write | `sat.go:922` | `"unknown"` ← lowercase, inconsistent |
| GPU selection API | `api.go:163` | `"GPU N"` (no "Unknown") |
**Rule:** all UI fallbacks should use `"Unknown GPU"`. The two outliers are `benchmark_report.go:93` (`"Unknown"`) and `sat.go:922` (`"unknown"`).
---
## GPU Selection UI
**File:** `audit/internal/webui/pages.go`
- Source: `GET /api/gpus``api.go``ListNvidiaGPUs()` → live nvidia-smi
- Render: `'GPU ' + gpu.index + ' — ' + gpu.name + ' · ' + mem`
- Fallback: `gpu.name || 'GPU ' + idx` (JS, line ~1432)
This always shows the correct model because it queries nvidia-smi live. It is **not** connected to benchmark result data.
---
## Data Flow Summary
```
nvidia-smi (live)
└─ ListNvidiaGPUs() → NvidiaGPU.Name
├─ GPU selection UI (always correct)
├─ Live metrics labels (charts_svg.go)
└─ SAT/burn status file (sat.go)
nvidia-smi (at benchmark start)
└─ queryBenchmarkGPUInfo() → benchmarkGPUInfo.Name
└─ BenchmarkGPUResult.Name (json:"name,omitempty")
├─ Benchmark report
└─ Benchmark history table columns
nvidia-smi / lspci (audit collection)
└─ HardwarePCIeDevice.Model (NVIDIA: NOT populated; AMD: populated)
└─ Hardware summary page hwDescribeGPU()
```
---
## What Needs Fixing
1. **NVIDIA PCIe Model**`enrichPCIeWithNVIDIAData()` should set `dev.Model = &gpu.Name`
2. **Fallback consistency**`benchmark_report.go:93` should say `"Unknown GPU"` not `"Unknown"`; `sat.go:922` should say `"Unknown GPU"` not `"unknown"`
3. **Old benchmark JSONs** — no fix possible for already-saved results with missing names (display-only issue)

View File

@@ -36,7 +36,6 @@ typedef void *CUstream;
#define MAX_CUBLAS_PROFILES 5 #define MAX_CUBLAS_PROFILES 5
#define MIN_PROFILE_BUDGET_BYTES ((size_t)4u * 1024u * 1024u) #define MIN_PROFILE_BUDGET_BYTES ((size_t)4u * 1024u * 1024u)
#define MIN_STREAM_BUDGET_BYTES ((size_t)64u * 1024u * 1024u) #define MIN_STREAM_BUDGET_BYTES ((size_t)64u * 1024u * 1024u)
#define STRESS_LAUNCH_DEPTH 8
static const char *ptx_source = static const char *ptx_source =
".version 6.0\n" ".version 6.0\n"
@@ -344,7 +343,6 @@ static int run_ptx_fallback(struct cuda_api *api,
unsigned long iterations = 0; unsigned long iterations = 0;
int mp_count = 0; int mp_count = 0;
int stream_count = 1; int stream_count = 1;
int launches_per_wave = 0;
memset(report, 0, sizeof(*report)); memset(report, 0, sizeof(*report));
snprintf(report->backend, sizeof(report->backend), "driver-ptx"); snprintf(report->backend, sizeof(report->backend), "driver-ptx");
@@ -419,12 +417,10 @@ static int run_ptx_fallback(struct cuda_api *api,
unsigned int threads = 256; unsigned int threads = 256;
double start = now_seconds(); double deadline = now_seconds() + (double)seconds;
double deadline = start + (double)seconds; double next_sync = now_seconds() + 1.0;
while (now_seconds() < deadline) { while (now_seconds() < deadline) {
launches_per_wave = 0; int launched = 0;
for (int depth = 0; depth < STRESS_LAUNCH_DEPTH && now_seconds() < deadline; depth++) {
int launched_this_batch = 0;
for (int lane = 0; lane < stream_count; lane++) { for (int lane = 0; lane < stream_count; lane++) {
unsigned int blocks = (unsigned int)((words[lane] + threads - 1) / threads); unsigned int blocks = (unsigned int)((words[lane] + threads - 1) / threads);
if (!check_rc(api, if (!check_rc(api,
@@ -442,21 +438,21 @@ static int run_ptx_fallback(struct cuda_api *api,
NULL))) { NULL))) {
goto fail; goto fail;
} }
launches_per_wave++; launched++;
launched_this_batch++; iterations++;
} }
if (launched_this_batch <= 0) { if (launched <= 0) {
break;
}
}
if (launches_per_wave <= 0) {
goto fail; goto fail;
} }
double now = now_seconds();
if (now >= next_sync || now >= deadline) {
if (!check_rc(api, "cuCtxSynchronize", api->cuCtxSynchronize())) { if (!check_rc(api, "cuCtxSynchronize", api->cuCtxSynchronize())) {
goto fail; goto fail;
} }
iterations += (unsigned long)launches_per_wave; next_sync = now + 1.0;
} }
}
api->cuCtxSynchronize();
if (!check_rc(api, "cuMemcpyDtoH", api->cuMemcpyDtoH(sample, device_mem[0], sizeof(sample)))) { if (!check_rc(api, "cuMemcpyDtoH", api->cuMemcpyDtoH(sample, device_mem[0], sizeof(sample)))) {
goto fail; goto fail;
@@ -468,11 +464,10 @@ static int run_ptx_fallback(struct cuda_api *api,
report->iterations = iterations; report->iterations = iterations;
snprintf(report->details, snprintf(report->details,
sizeof(report->details), sizeof(report->details),
"fallback_int32=OK requested_mb=%d actual_mb=%d streams=%d queue_depth=%d per_stream_mb=%zu iterations=%lu\n", "fallback_int32=OK requested_mb=%d actual_mb=%d streams=%d per_stream_mb=%zu iterations=%lu\n",
size_mb, size_mb,
report->buffer_mb, report->buffer_mb,
report->stream_count, report->stream_count,
STRESS_LAUNCH_DEPTH,
bytes_per_stream[0] / (1024u * 1024u), bytes_per_stream[0] / (1024u * 1024u),
iterations); iterations);
@@ -1140,7 +1135,6 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
int stream_count = 1; int stream_count = 1;
int profile_count = (int)(sizeof(k_profiles) / sizeof(k_profiles[0])); int profile_count = (int)(sizeof(k_profiles) / sizeof(k_profiles[0]));
int prepared_count = 0; int prepared_count = 0;
int wave_launches = 0;
size_t requested_budget = 0; size_t requested_budget = 0;
size_t total_budget = 0; size_t total_budget = 0;
size_t per_profile_budget = 0; size_t per_profile_budget = 0;
@@ -1207,11 +1201,10 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
report->buffer_mb = (int)(total_budget / (1024u * 1024u)); report->buffer_mb = (int)(total_budget / (1024u * 1024u));
append_detail(report->details, append_detail(report->details,
sizeof(report->details), sizeof(report->details),
"requested_mb=%d actual_mb=%d streams=%d queue_depth=%d mp_count=%d per_worker_mb=%zu\n", "requested_mb=%d actual_mb=%d streams=%d mp_count=%d per_worker_mb=%zu\n",
size_mb, size_mb,
report->buffer_mb, report->buffer_mb,
report->stream_count, report->stream_count,
STRESS_LAUNCH_DEPTH,
mp_count, mp_count,
per_profile_budget / (1024u * 1024u)); per_profile_budget / (1024u * 1024u));
@@ -1260,11 +1253,15 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
return 0; return 0;
} }
/* Keep the GPU queue continuously full by submitting kernels without
* synchronizing after every wave. A sync barrier after each small batch
* creates CPU↔GPU ping-pong gaps that prevent full TDP utilisation,
* especially when individual kernels are short. Instead we sync at most
* once per second (for error detection) and once at the very end. */
double deadline = now_seconds() + (double)seconds; double deadline = now_seconds() + (double)seconds;
double next_sync = now_seconds() + 1.0;
while (now_seconds() < deadline) { while (now_seconds() < deadline) {
wave_launches = 0; int launched = 0;
for (int depth = 0; depth < STRESS_LAUNCH_DEPTH && now_seconds() < deadline; depth++) {
int launched_this_batch = 0;
for (int i = 0; i < prepared_count; i++) { for (int i = 0; i < prepared_count; i++) {
if (!prepared[i].ready) { if (!prepared[i].ready) {
continue; continue;
@@ -1284,16 +1281,13 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
} }
prepared[i].iterations++; prepared[i].iterations++;
report->iterations++; report->iterations++;
wave_launches++; launched++;
launched_this_batch++;
} }
if (launched_this_batch <= 0) { if (launched <= 0) {
break;
}
}
if (wave_launches <= 0) {
break; break;
} }
double now = now_seconds();
if (now >= next_sync || now >= deadline) {
if (!check_rc(cuda, "cuCtxSynchronize", cuda->cuCtxSynchronize())) { if (!check_rc(cuda, "cuCtxSynchronize", cuda->cuCtxSynchronize())) {
for (int i = 0; i < prepared_count; i++) { for (int i = 0; i < prepared_count; i++) {
destroy_profile(&cublas, cuda, &prepared[i]); destroy_profile(&cublas, cuda, &prepared[i]);
@@ -1303,7 +1297,11 @@ static int run_cublaslt_stress(struct cuda_api *cuda,
cuda->cuCtxDestroy(ctx); cuda->cuCtxDestroy(ctx);
return 0; return 0;
} }
next_sync = now + 1.0;
} }
}
/* Final drain — ensure all queued work finishes before we read results. */
cuda->cuCtxSynchronize();
for (int i = 0; i < prepared_count; i++) { for (int i = 0; i < prepared_count; i++) {
if (!prepared[i].ready) { if (!prepared[i].ready) {

View File

@@ -1,9 +1,9 @@
set color_normal=light-gray/black set color_normal=light-gray/black
set color_highlight=white/dark-gray set color_highlight=yellow/black
if [ -e /boot/grub/splash.png ]; then if [ -e /boot/grub/splash.png ]; then
set theme=/boot/grub/live-theme/theme.txt set theme=/boot/grub/live-theme/theme.txt
else else
set menu_color_normal=cyan/black set menu_color_normal=yellow/black
set menu_color_highlight=white/dark-gray set menu_color_highlight=white/brown
fi fi

View File

@@ -10,20 +10,15 @@ import os
W, H = 1920, 1080 W, H = 1920, 1080
GLYPHS = { ASCII_ART = [
'E': ["11111", "10000", "11110", "10000", "10000", "10000", "11111"], " ███████╗ █████╗ ███████╗██╗ ██╗ ██████╗ ███████╗███████╗",
'A': ["01110", "10001", "10001", "11111", "10001", "10001", "10001"], " ██╔════╝██╔══██╗██╔════╝╚██╗ ██╔╝ ██╔══██╗██╔════╝██╔════╝",
'S': ["01111", "10000", "10000", "01110", "00001", "00001", "11110"], " █████╗ ███████║███████╗ ╚████╔╝ █████╗██████╔╝█████╗ █████╗",
'Y': ["10001", "10001", "01010", "00100", "00100", "00100", "00100"], " ██╔══╝ ██╔══██║╚════██║ ╚██╔╝ ╚════╝██╔══██╗██╔══╝ ██╔══╝",
'B': ["11110", "10001", "10001", "11110", "10001", "10001", "11110"], " ███████╗██║ ██║███████║ ██║ ██████╔╝███████╗███████╗",
'-': ["00000", "00000", "11111", "00000", "00000", "00000", "00000"], " ╚══════╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝",
} ]
SUBTITLE = " Hardware Audit LiveCD"
TITLE = "EASY-BEE"
SUBTITLE = "Hardware Audit LiveCD"
CELL = 30
GLYPH_GAP = 18
ROW_GAP = 6
FG = (0xF6, 0xD0, 0x47) FG = (0xF6, 0xD0, 0x47)
FG_DIM = (0xD4, 0xA9, 0x1C) FG_DIM = (0xD4, 0xA9, 0x1C)
@@ -31,6 +26,12 @@ SHADOW = (0x5E, 0x47, 0x05)
SUB = (0x96, 0x7A, 0x17) SUB = (0x96, 0x7A, 0x17)
BG = (0x05, 0x05, 0x05) BG = (0x05, 0x05, 0x05)
MONO_FONT_CANDIDATES = [
'/usr/share/fonts/truetype/dejavu/DejaVuSansMono-Bold.ttf',
'/usr/share/fonts/truetype/liberation2/LiberationMono-Bold.ttf',
'/usr/share/fonts/truetype/liberation/LiberationMono-Bold.ttf',
'/usr/share/fonts/truetype/freefont/FreeMonoBold.ttf',
]
SUB_FONT_CANDIDATES = [ SUB_FONT_CANDIDATES = [
'/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', '/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf',
'/usr/share/fonts/truetype/liberation2/LiberationSans-Bold.ttf', '/usr/share/fonts/truetype/liberation2/LiberationSans-Bold.ttf',
@@ -39,43 +40,34 @@ SUB_FONT_CANDIDATES = [
] ]
def load_font(size): def load_font(candidates, size):
for path in SUB_FONT_CANDIDATES: for path in candidates:
if os.path.exists(path): if os.path.exists(path):
return ImageFont.truetype(path, size) return ImageFont.truetype(path, size)
return ImageFont.load_default() return ImageFont.load_default()
def glyph_width(ch): def mono_metrics(font):
return len(GLYPHS[ch][0]) probe = Image.new('L', (W, H), 0)
draw = ImageDraw.Draw(probe)
char_w = int(round(draw.textlength("M", font=font)))
bb = draw.textbbox((0, 0), "Mg", font=font)
char_h = bb[3] - bb[1]
return char_w, char_h
def render_logo_mask(): def render_ascii_mask(font, lines, char_w, char_h, line_gap):
width_cells = 0 width = max(len(line) for line in lines) * char_w
for idx, ch in enumerate(TITLE): height = len(lines) * char_h + line_gap * (len(lines) - 1)
width_cells += glyph_width(ch) mask = Image.new('L', (width, height), 0)
if idx != len(TITLE) - 1:
width_cells += 1
mask_w = width_cells * CELL + (len(TITLE) - 1) * GLYPH_GAP
mask_h = 7 * CELL + 6 * ROW_GAP
mask = Image.new('L', (mask_w, mask_h), 0)
draw = ImageDraw.Draw(mask) draw = ImageDraw.Draw(mask)
for row, line in enumerate(lines):
cx = 0 y = row * (char_h + line_gap)
for idx, ch in enumerate(TITLE): for col, ch in enumerate(line):
glyph = GLYPHS[ch] if ch == ' ':
for row_idx, row in enumerate(glyph):
for col_idx, cell in enumerate(row):
if cell != '1':
continue continue
x0 = cx + col_idx * CELL x = col * char_w
y0 = row_idx * (CELL + ROW_GAP) draw.text((x, y), ch, font=font, fill=255)
x1 = x0 + CELL - 4
y1 = y0 + CELL - 4
draw.rounded_rectangle((x0, y0, x1, y1), radius=4, fill=255)
cx += glyph_width(ch) * CELL
if idx != len(TITLE) - 1:
cx += CELL + GLYPH_GAP
return mask return mask
@@ -90,20 +82,28 @@ glow_draw.ellipse((520, 340, 1400, 760), fill=(255, 190, 40, 36))
glow = glow.filter(ImageFilter.GaussianBlur(60)) glow = glow.filter(ImageFilter.GaussianBlur(60))
img = Image.alpha_composite(img.convert('RGBA'), glow) img = Image.alpha_composite(img.convert('RGBA'), glow)
logo_mask = render_logo_mask() TARGET_LOGO_W = 400
max_chars = max(len(line) for line in ASCII_ART)
_probe_font = load_font(MONO_FONT_CANDIDATES, 64)
_probe_cw, _ = mono_metrics(_probe_font)
font_size_logo = max(6, int(64 * TARGET_LOGO_W / (_probe_cw * max_chars)))
font_logo = load_font(MONO_FONT_CANDIDATES, font_size_logo)
char_w, char_h = mono_metrics(font_logo)
logo_mask = render_ascii_mask(font_logo, ASCII_ART, char_w, char_h, 2)
logo_w, logo_h = logo_mask.size logo_w, logo_h = logo_mask.size
logo_x = (W - logo_w) // 2 logo_x = (W - logo_w) // 2
logo_y = 290 logo_y = 380
shadow_mask = logo_mask.filter(ImageFilter.GaussianBlur(2)) sh_off = max(1, font_size_logo // 6)
img.paste(SHADOW, (logo_x + 16, logo_y + 14), shadow_mask) shadow_mask = logo_mask.filter(ImageFilter.GaussianBlur(1))
img.paste(FG_DIM, (logo_x + 8, logo_y + 7), logo_mask) img.paste(SHADOW, (logo_x + sh_off * 2, logo_y + sh_off * 2), shadow_mask)
img.paste(FG_DIM, (logo_x + sh_off, logo_y + sh_off), logo_mask)
img.paste(FG, (logo_x, logo_y), logo_mask) img.paste(FG, (logo_x, logo_y), logo_mask)
font_sub = load_font(30) font_sub = load_font(SUB_FONT_CANDIDATES, 30)
sub_bb = draw.textbbox((0, 0), SUBTITLE, font=font_sub) sub_bb = draw.textbbox((0, 0), SUBTITLE, font=font_sub)
sub_x = (W - (sub_bb[2] - sub_bb[0])) // 2 sub_x = (W - (sub_bb[2] - sub_bb[0])) // 2
sub_y = logo_y + logo_h + 54 sub_y = logo_y + logo_h + 48
draw = ImageDraw.Draw(img) draw = ImageDraw.Draw(img)
draw.text((sub_x + 2, sub_y + 2), SUBTITLE, font=font_sub, fill=(35, 28, 6)) draw.text((sub_x + 2, sub_y + 2), SUBTITLE, font=font_sub, fill=(35, 28, 6))
draw.text((sub_x, sub_y), SUBTITLE, font=font_sub, fill=SUB) draw.text((sub_x, sub_y), SUBTITLE, font=font_sub, fill=SUB)

View File

@@ -0,0 +1,110 @@
#!/bin/sh
set -eu
SECONDS=300
STAGGER_SECONDS=180
DEVICES=""
EXCLUDE=""
usage() {
echo "usage: $0 [--seconds N] [--stagger-seconds N] [--devices 0,1] [--exclude 2,3]" >&2
exit 2
}
normalize_list() {
echo "${1:-}" | tr ',' '\n' | sed 's/[[:space:]]//g' | awk 'NF' | sort -n | uniq | paste -sd, -
}
contains_csv() {
needle="$1"
haystack="${2:-}"
echo ",${haystack}," | grep -q ",${needle},"
}
resolve_dcgmproftester() {
for candidate in dcgmproftester dcgmproftester13 dcgmproftester12 dcgmproftester11; do
if command -v "${candidate}" >/dev/null 2>&1; then
command -v "${candidate}"
return 0
fi
done
return 1
}
while [ "$#" -gt 0 ]; do
case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;;
--stagger-seconds) [ "$#" -ge 2 ] || usage; STAGGER_SECONDS="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
*) usage ;;
esac
done
PROF=$(resolve_dcgmproftester) || { echo "dcgmproftester not found in PATH" >&2; exit 1; }
ALL_DEVICES=$(nvidia-smi --query-gpu=index --format=csv,noheader,nounits 2>/dev/null | sed 's/[[:space:]]//g' | awk 'NF' | paste -sd, -)
[ -n "${ALL_DEVICES}" ] || { echo "nvidia-smi found no NVIDIA GPUs" >&2; exit 1; }
DEVICES=$(normalize_list "${DEVICES}")
EXCLUDE=$(normalize_list "${EXCLUDE}")
SELECTED="${DEVICES}"
if [ -z "${SELECTED}" ]; then
SELECTED="${ALL_DEVICES}"
fi
FINAL=""
for id in $(echo "${SELECTED}" | tr ',' ' '); do
[ -n "${id}" ] || continue
if contains_csv "${id}" "${EXCLUDE}"; then
continue
fi
if [ -z "${FINAL}" ]; then
FINAL="${id}"
else
FINAL="${FINAL},${id}"
fi
done
[ -n "${FINAL}" ] || { echo "no NVIDIA GPUs selected after filters" >&2; exit 1; }
echo "loader=dcgmproftester-staggered"
echo "selected_gpus=${FINAL}"
echo "stagger_seconds=${STAGGER_SECONDS}"
TMP_DIR=$(mktemp -d)
trap 'rm -rf "${TMP_DIR}"' EXIT INT TERM
GPU_COUNT=$(echo "${FINAL}" | tr ',' '\n' | awk 'NF' | wc -l | tr -d '[:space:]')
gpu_pos=0
WORKERS=""
for id in $(echo "${FINAL}" | tr ',' ' '); do
gpu_pos=$((gpu_pos + 1))
log="${TMP_DIR}/gpu-${id}.log"
extra_sec=$(( STAGGER_SECONDS * (GPU_COUNT - gpu_pos) ))
gpu_seconds=$(( SECONDS + extra_sec ))
echo "starting gpu ${id} seconds=${gpu_seconds}"
CUDA_VISIBLE_DEVICES="${id}" "${PROF}" --no-dcgm-validation -t 1004 -d "${gpu_seconds}" >"${log}" 2>&1 &
pid=$!
WORKERS="${WORKERS} ${pid}:${id}:${log}"
if [ "${STAGGER_SECONDS}" -gt 0 ] && [ "${gpu_pos}" -lt "${GPU_COUNT}" ]; then
sleep "${STAGGER_SECONDS}"
fi
done
status=0
for spec in ${WORKERS}; do
pid=${spec%%:*}
rest=${spec#*:}
id=${rest%%:*}
log=${rest#*:}
if wait "${pid}"; then
echo "gpu ${id} finished: OK"
else
rc=$?
echo "gpu ${id} finished: FAILED rc=${rc}"
status=1
fi
sed "s/^/[gpu ${id}] /" "${log}" || true
done
exit "${status}"

17
iso/overlay/usr/local/bin/bee-gpu-burn Normal file → Executable file
View File

@@ -2,13 +2,14 @@
set -eu set -eu
SECONDS=5 SECONDS=5
STAGGER_SECONDS=0
SIZE_MB=0 SIZE_MB=0
DEVICES="" DEVICES=""
EXCLUDE="" EXCLUDE=""
WORKER="/usr/local/lib/bee/bee-gpu-burn-worker" WORKER="/usr/local/lib/bee/bee-gpu-burn-worker"
usage() { usage() {
echo "usage: $0 [--seconds N] [--size-mb N] [--devices 0,1] [--exclude 2,3]" >&2 echo "usage: $0 [--seconds N] [--stagger-seconds N] [--size-mb N] [--devices 0,1] [--exclude 2,3]" >&2
exit 2 exit 2
} }
@@ -25,6 +26,7 @@ contains_csv() {
while [ "$#" -gt 0 ]; do while [ "$#" -gt 0 ]; do
case "$1" in case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;; --seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;;
--stagger-seconds) [ "$#" -ge 2 ] || usage; STAGGER_SECONDS="$2"; shift 2 ;;
--size-mb|-m) [ "$#" -ge 2 ] || usage; SIZE_MB="$2"; shift 2 ;; --size-mb|-m) [ "$#" -ge 2 ] || usage; SIZE_MB="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;; --devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;; --exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
@@ -61,14 +63,18 @@ done
echo "loader=bee-gpu-burn" echo "loader=bee-gpu-burn"
echo "selected_gpus=${FINAL}" echo "selected_gpus=${FINAL}"
echo "stagger_seconds=${STAGGER_SECONDS}"
export CUDA_DEVICE_ORDER="PCI_BUS_ID" export CUDA_DEVICE_ORDER="PCI_BUS_ID"
TMP_DIR=$(mktemp -d) TMP_DIR=$(mktemp -d)
trap 'rm -rf "${TMP_DIR}"' EXIT INT TERM trap 'rm -rf "${TMP_DIR}"' EXIT INT TERM
GPU_COUNT=$(echo "${FINAL}" | tr ',' '\n' | awk 'NF' | wc -l | tr -d '[:space:]')
gpu_pos=0
WORKERS="" WORKERS=""
for id in $(echo "${FINAL}" | tr ',' ' '); do for id in $(echo "${FINAL}" | tr ',' ' '); do
gpu_pos=$((gpu_pos + 1))
log="${TMP_DIR}/gpu-${id}.log" log="${TMP_DIR}/gpu-${id}.log"
gpu_size_mb="${SIZE_MB}" gpu_size_mb="${SIZE_MB}"
if [ "${gpu_size_mb}" -le 0 ] 2>/dev/null; then if [ "${gpu_size_mb}" -le 0 ] 2>/dev/null; then
@@ -79,11 +85,16 @@ for id in $(echo "${FINAL}" | tr ',' ' '); do
gpu_size_mb=512 gpu_size_mb=512
fi fi
fi fi
echo "starting gpu ${id} size=${gpu_size_mb}MB" extra_sec=$(( STAGGER_SECONDS * (GPU_COUNT - gpu_pos) ))
gpu_seconds=$(( SECONDS + extra_sec ))
echo "starting gpu ${id} size=${gpu_size_mb}MB seconds=${gpu_seconds}"
CUDA_VISIBLE_DEVICES="${id}" \ CUDA_VISIBLE_DEVICES="${id}" \
"${WORKER}" --device 0 --seconds "${SECONDS}" --size-mb "${gpu_size_mb}" >"${log}" 2>&1 & "${WORKER}" --device 0 --seconds "${gpu_seconds}" --size-mb "${gpu_size_mb}" >"${log}" 2>&1 &
pid=$! pid=$!
WORKERS="${WORKERS} ${pid}:${id}:${log}" WORKERS="${WORKERS} ${pid}:${id}:${log}"
if [ "${STAGGER_SECONDS}" -gt 0 ] && [ "${gpu_pos}" -lt "${GPU_COUNT}" ]; then
sleep "${STAGGER_SECONDS}"
fi
done done
status=0 status=0

16
iso/overlay/usr/local/bin/bee-john-gpu-stress Normal file → Executable file
View File

@@ -2,6 +2,7 @@
set -eu set -eu
DURATION_SEC=300 DURATION_SEC=300
STAGGER_SECONDS=0
DEVICES="" DEVICES=""
EXCLUDE="" EXCLUDE=""
FORMAT="" FORMAT=""
@@ -12,7 +13,7 @@ export OCL_ICD_VENDORS="/etc/OpenCL/vendors"
export LD_LIBRARY_PATH="/usr/lib:/usr/local/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" export LD_LIBRARY_PATH="/usr/lib:/usr/local/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
usage() { usage() {
echo "usage: $0 [--seconds N] [--devices 0,1] [--exclude 2,3] [--format name]" >&2 echo "usage: $0 [--seconds N] [--stagger-seconds N] [--devices 0,1] [--exclude 2,3] [--format name]" >&2
exit 2 exit 2
} }
@@ -118,6 +119,7 @@ ensure_opencl_ready() {
while [ "$#" -gt 0 ]; do while [ "$#" -gt 0 ]; do
case "$1" in case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; DURATION_SEC="$2"; shift 2 ;; --seconds|-t) [ "$#" -ge 2 ] || usage; DURATION_SEC="$2"; shift 2 ;;
--stagger-seconds) [ "$#" -ge 2 ] || usage; STAGGER_SECONDS="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;; --devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;; --exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
--format) [ "$#" -ge 2 ] || usage; FORMAT="$2"; shift 2 ;; --format) [ "$#" -ge 2 ] || usage; FORMAT="$2"; shift 2 ;;
@@ -170,6 +172,7 @@ done
echo "loader=john" echo "loader=john"
echo "selected_gpus=${FINAL}" echo "selected_gpus=${FINAL}"
echo "john_devices=${JOHN_DEVICES}" echo "john_devices=${JOHN_DEVICES}"
echo "stagger_seconds=${STAGGER_SECONDS}"
cd "${JOHN_DIR}" cd "${JOHN_DIR}"
@@ -232,14 +235,21 @@ trap cleanup EXIT INT TERM
echo "format=${CHOSEN_FORMAT}" echo "format=${CHOSEN_FORMAT}"
echo "target_seconds=${DURATION_SEC}" echo "target_seconds=${DURATION_SEC}"
echo "slice_seconds=${TEST_SLICE_SECONDS}" echo "slice_seconds=${TEST_SLICE_SECONDS}"
DEADLINE=$(( $(date +%s) + DURATION_SEC )) TOTAL_DEVICES=$(echo "${JOHN_DEVICES}" | tr ',' '\n' | awk 'NF' | wc -l | tr -d '[:space:]')
_first=1 _first=1
pos=0
for opencl_id in $(echo "${JOHN_DEVICES}" | tr ',' ' '); do for opencl_id in $(echo "${JOHN_DEVICES}" | tr ',' ' '); do
pos=$((pos + 1))
[ "${_first}" = "1" ] || sleep 3 [ "${_first}" = "1" ] || sleep 3
_first=0 _first=0
run_john_loop "${opencl_id}" "${DEADLINE}" & extra_sec=$(( STAGGER_SECONDS * (TOTAL_DEVICES - pos) ))
deadline=$(( $(date +%s) + DURATION_SEC + extra_sec ))
run_john_loop "${opencl_id}" "${deadline}" &
pid=$! pid=$!
PIDS="${PIDS} ${pid}" PIDS="${PIDS} ${pid}"
if [ "${STAGGER_SECONDS}" -gt 0 ] && [ "${pos}" -lt "${TOTAL_DEVICES}" ]; then
sleep "${STAGGER_SECONDS}"
fi
done done
FAIL=0 FAIL=0
for pid in ${PIDS}; do for pid in ${PIDS}; do

View File

@@ -21,8 +21,13 @@ read_nvidia_modules_flavor() {
log "kernel: $(uname -r)" log "kernel: $(uname -r)"
# Skip if no NVIDIA GPU present (PCI vendor 10de) # Skip if no NVIDIA display/compute GPU is present.
if ! lspci -nn 2>/dev/null | grep -qi '10de:'; then # Match only display-class PCI functions (0300 VGA, 0302 3D controller) from vendor 10de.
have_nvidia_gpu() {
lspci -Dn 2>/dev/null | awk '$2 ~ /^03(00|02):$/ && $3 ~ /^10de:/ { found=1; exit } END { exit(found ? 0 : 1) }'
}
if ! have_nvidia_gpu; then
log "no NVIDIA GPU detected — skipping module load" log "no NVIDIA GPU detected — skipping module load"
exit 0 exit 0
fi fi

View File

@@ -14,7 +14,7 @@ log() {
} }
have_nvidia_gpu() { have_nvidia_gpu() {
lspci -nn 2>/dev/null | grep -qi '10de:' lspci -Dn 2>/dev/null | awk '$2 ~ /^03(00|02):$/ && $3 ~ /^10de:/ { found=1; exit } END { exit(found ? 0 : 1) }'
} }
service_active() { service_active() {