Compare commits

...

11 Commits
v7.3 ... v7.9

Author SHA1 Message Date
Mikhail Chusavitin
4f94ebcb2c Add HPC tuning: PCIe ASPM off, C-states, performance CPU governor
- grub.cfg + isolinux/live.cfg.in: add pcie_aspm=off,
  intel_idle.max_cstate=1 and processor.max_cstate=1 to all
  non-failsafe boot entries
- bee-hpc-tuning: new script that sets all CPU cores to performance
  governor via sysfs and logs THP state at boot
- bee-hpc-tuning.service: runs before bee-nvidia and bee-audit
- 9000-bee-setup.hook.chroot: enable service and mark script executable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:07:32 +03:00
Mikhail Chusavitin
05c1fde233 Warn on PCIe link speed degradation and collect lspci -vvv in techdump
- collector/pcie: add applyPCIeLinkSpeedWarning that sets status=Warning
  and ErrorDescription when current link speed is below maximum negotiated
  speed (e.g. Gen1 running on a Gen5 slot)
- collector/pcie: add pcieLinkSpeedRank helper for Gen string comparison
- collector/pcie_filter_test: cover degraded and healthy link speed cases
- platform/techdump: collect lspci -vvv → lspci-vvv.txt for LnkCap/LnkSta

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 12:42:17 +03:00
825ef6b98a Add USB export drive and LiveCD-in-RAM checks to Runtime Health
- schema: add ToRAMStatus and USBExportPath fields to RuntimeHealth
- platform/runtime.go: collectToRAMHealth (ok/warning/failed based on
  IsLiveMediaInRAM + toramActive) and collectUSBExportHealth (scans
  /proc/mounts + lsblk for writable USB-backed filesystems)
- pages.go: add USB Export Drive and LiveCD in RAM rows to the health table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 10:05:27 +03:00
ba16021cdb Fix GPU model propagation, export filenames, PSU/service status, and chart perf
- nvidia.go: add Name field to nvidiaGPUInfo, include model name in
  nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData
- pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x …
  → 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat
  activating/deactivating/reloading service states as OK in Runtime Health
- support_bundle.go: use "150405" time format (no colons) for exFAT compat
- sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove
  .tar.gz archive creation from export dirs — export packs everything itself
- charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf
- benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 10:05:27 +03:00
Mikhail Chusavitin
bb1218ddd4 Fix GPU inventory: exclude BMC virtual VGA, show real NVIDIA model names
Two issues:
1. BMC/management VGA chips (e.g. Huawei iBMC Hi171x, ASPEED) were included
   in GPU inventory because shouldIncludePCIeDevice only checked the PCI class,
   not the device name. Added a name-based filter for known BMC/management
   patterns when the class is VGA/display/3d.

2. New NVIDIA GPUs (e.g. RTX PRO 6000 Blackwell, device ID 2bb5) showed as
   "Device 2bb5" because lspci's database lags behind. Added "name" to the
   nvidia-smi query and use it to override dev.Model during enrichment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 13:57:26 +03:00
Mikhail Chusavitin
65faae8ede Remove hpl from SAT run-all targets — no backend route exists
hpl was listed in baseTargets and stressOnlyTargets but /api/sat/hpl/run
was never registered, causing a 405 Method Not Allowed (not valid JSON)
error when Validate one by one was triggered in stress mode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 13:30:32 +03:00
05241f2e0e Redesign dashboard: split Runtime Health and Hardware Summary
- Runtime Health now shows only LiveCD system status (services, tools,
  drivers, network, CUDA/ROCm) — hardware component rows removed
- Hardware Summary now shows server components with readable descriptions
  (model, count×size) and component-status.json health badges
- Add Network Adapters row to Hardware Summary
- SFP module static info (vendor, PN, SN, connector, type, wavelength)
  now collected via ethtool -m regardless of carrier state
- PSU statuses from IPMI audit written to component-status.json so PSU
  badge shows actual status after first audit instead of UNKNOWN

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 23:41:23 +03:00
Mikhail Chusavitin
c1690a084b Fix app tests that mutate global defaults 2026-04-09 15:28:25 +03:00
Mikhail Chusavitin
9481ca2805 Add staged NVIDIA burn ramp-up mode 2026-04-09 15:21:14 +03:00
Mikhail Chusavitin
a78fdadd88 Refine validate and burn profile layout 2026-04-09 15:14:48 +03:00
Mikhail Chusavitin
4ef403898f Tighten NVIDIA GPU PCI detection 2026-04-09 15:14:48 +03:00
36 changed files with 1276 additions and 249 deletions

View File

@@ -117,7 +117,7 @@ type satRunner interface {
RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaAcceptancePackWithOptions(ctx context.Context, baseDir string, diagLevel int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaTargetedStressValidatePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaTargetedStressValidatePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaBenchmark(ctx context.Context, baseDir string, opts platform.NvidiaBenchmarkOptions, logFunc func(string)) (string, error) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts platform.NvidiaBenchmarkOptions, logFunc func(string)) (string, error)
RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, staggerSec int, logFunc func(string)) (string, error)
RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaPulseTestPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaPulseTestPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error)
RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpuIndices []int, logFunc func(string)) (string, error) RunNvidiaBandwidthPack(ctx context.Context, baseDir string, gpuIndices []int, logFunc func(string)) (string, error)
@@ -190,6 +190,7 @@ func (a *App) RunAudit(runtimeMode runtimeenv.Mode, output string) (string, erro
} }
result := collector.Run(runtimeMode) result := collector.Run(runtimeMode)
applyLatestSATStatuses(&result.Hardware, DefaultSATBaseDir, a.StatusDB) applyLatestSATStatuses(&result.Hardware, DefaultSATBaseDir, a.StatusDB)
writePSUStatusesToDB(a.StatusDB, result.Hardware.PowerSupplies)
if health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath); err == nil { if health, err := ReadRuntimeHealth(DefaultRuntimeJSONPath); err == nil {
result.Runtime = &health result.Runtime = &health
} }
@@ -566,11 +567,11 @@ func (a *App) RunNvidiaBenchmarkCtx(ctx context.Context, baseDir string, opts pl
return a.sat.RunNvidiaBenchmark(ctx, baseDir, opts, logFunc) return a.sat.RunNvidiaBenchmark(ctx, baseDir, opts, logFunc)
} }
func (a *App) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) { func (a *App) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, staggerSec int, logFunc func(string)) (string, error) {
if strings.TrimSpace(baseDir) == "" { if strings.TrimSpace(baseDir) == "" {
baseDir = DefaultSATBaseDir baseDir = DefaultSATBaseDir
} }
return a.sat.RunNvidiaOfficialComputePack(ctx, baseDir, durationSec, gpuIndices, logFunc) return a.sat.RunNvidiaOfficialComputePack(ctx, baseDir, durationSec, gpuIndices, staggerSec, logFunc)
} }
func (a *App) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) { func (a *App) RunNvidiaTargetedPowerPack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) {
@@ -926,6 +927,41 @@ func bodyOr(body, fallback string) string {
return body return body
} }
// writePSUStatusesToDB records PSU statuses collected during audit into the
// component-status DB so they are visible in the Hardware Summary card.
// PSU status is sourced from IPMI (ipmitool fru + sdr) during audit.
func writePSUStatusesToDB(db *ComponentStatusDB, psus []schema.HardwarePowerSupply) {
if db == nil || len(psus) == 0 {
return
}
const source = "audit:ipmi"
worstStatus := "OK"
for _, psu := range psus {
if psu.Status == nil {
continue
}
slot := "?"
if psu.Slot != nil {
slot = *psu.Slot
}
st := *psu.Status
detail := ""
if psu.ErrorDescription != nil {
detail = *psu.ErrorDescription
}
db.Record("psu:"+slot, source, st, detail)
switch st {
case "Critical":
worstStatus = "Critical"
case "Warning":
if worstStatus != "Critical" {
worstStatus = "Warning"
}
}
}
db.Record("psu:all", source, worstStatus, "")
}
func ReadRuntimeHealth(path string) (schema.RuntimeHealth, error) { func ReadRuntimeHealth(path string) (schema.RuntimeHealth, error) {
raw, err := os.ReadFile(path) raw, err := os.ReadFile(path)
if err != nil { if err != nil {

View File

@@ -161,7 +161,7 @@ func (f fakeSAT) RunNvidiaTargetedStressValidatePack(_ context.Context, baseDir
return f.runNvidiaFn(baseDir) return f.runNvidiaFn(baseDir)
} }
func (f fakeSAT) RunNvidiaOfficialComputePack(_ context.Context, baseDir string, durationSec int, gpuIndices []int, _ func(string)) (string, error) { func (f fakeSAT) RunNvidiaOfficialComputePack(_ context.Context, baseDir string, durationSec int, gpuIndices []int, _ int, _ func(string)) (string, error) {
if f.runNvidiaComputeFn != nil { if f.runNvidiaComputeFn != nil {
return f.runNvidiaComputeFn(baseDir, durationSec, gpuIndices) return f.runNvidiaComputeFn(baseDir, durationSec, gpuIndices)
} }
@@ -542,8 +542,6 @@ func TestActionResultsUseFallbackBody(t *testing.T) {
} }
func TestExportSupportBundleResultMentionsUnmountedUSB(t *testing.T) { func TestExportSupportBundleResultMentionsUnmountedUSB(t *testing.T) {
t.Parallel()
tmp := t.TempDir() tmp := t.TempDir()
oldExportDir := DefaultExportDir oldExportDir := DefaultExportDir
DefaultExportDir = tmp DefaultExportDir = tmp
@@ -580,8 +578,6 @@ func TestExportSupportBundleResultMentionsUnmountedUSB(t *testing.T) {
} }
func TestExportSupportBundleResultDoesNotPretendSuccessOnError(t *testing.T) { func TestExportSupportBundleResultDoesNotPretendSuccessOnError(t *testing.T) {
t.Parallel()
tmp := t.TempDir() tmp := t.TempDir()
oldExportDir := DefaultExportDir oldExportDir := DefaultExportDir
DefaultExportDir = tmp DefaultExportDir = tmp
@@ -643,8 +639,6 @@ func TestRunNvidiaAcceptancePackResult(t *testing.T) {
} }
func TestRunSATDefaultsToExportDir(t *testing.T) { func TestRunSATDefaultsToExportDir(t *testing.T) {
t.Parallel()
oldSATBaseDir := DefaultSATBaseDir oldSATBaseDir := DefaultSATBaseDir
DefaultSATBaseDir = "/tmp/export/bee-sat" DefaultSATBaseDir = "/tmp/export/bee-sat"
t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir }) t.Cleanup(func() { DefaultSATBaseDir = oldSATBaseDir })

View File

@@ -54,7 +54,7 @@ if ! command -v lspci >/dev/null 2>&1; then
exit 0 exit 0
fi fi
found=0 found=0
for gpu in $(lspci -Dn | awk '$3 ~ /^10de:/ {print $1}'); do for gpu in $(lspci -Dn | awk '$2 ~ /^03(00|02):$/ && $3 ~ /^10de:/ {print $1}'); do
found=1 found=1
echo "=== GPU $gpu ===" echo "=== GPU $gpu ==="
lspci -s "$gpu" -vv 2>&1 || true lspci -s "$gpu" -vv 2>&1 || true
@@ -73,8 +73,13 @@ fi
{name: "system/pcie-nvidia-link.txt", cmd: []string{"sh", "-c", ` {name: "system/pcie-nvidia-link.txt", cmd: []string{"sh", "-c", `
for d in /sys/bus/pci/devices/*/; do for d in /sys/bus/pci/devices/*/; do
vendor=$(cat "$d/vendor" 2>/dev/null) vendor=$(cat "$d/vendor" 2>/dev/null)
[ "$vendor" = "0x10de" ] || continue [ "$vendor" = "0x10de" ] || continue
dev=$(basename "$d") class=$(cat "$d/class" 2>/dev/null)
case "$class" in
0x030000|0x030200) ;;
*) continue ;;
esac
dev=$(basename "$d")
echo "=== $dev ===" echo "=== $dev ==="
for f in current_link_speed current_link_width max_link_speed max_link_width; do for f in current_link_speed current_link_width max_link_speed max_link_width; do
printf " %-22s %s\n" "$f" "$(cat "$d/$f" 2>/dev/null)" printf " %-22s %s\n" "$f" "$(cat "$d/$f" 2>/dev/null)"
@@ -208,7 +213,7 @@ func BuildSupportBundle(exportDir string) (string, error) {
now := time.Now().UTC() now := time.Now().UTC()
date := now.Format("2006-01-02") date := now.Format("2006-01-02")
tod := now.Format("15:04:05") tod := now.Format("150405")
ver := bundleVersion() ver := bundleVersion()
model := serverModelForBundle() model := serverModelForBundle()
sn := serverSerialForBundle() sn := serverSerialForBundle()

View File

@@ -179,11 +179,3 @@ func commandOutputWithTimeout(timeout time.Duration, name string, args ...string
defer cancel() defer cancel()
return exec.CommandContext(ctx, name, args...).Output() return exec.CommandContext(ctx, name, args...).Output()
} }
func interfaceHasCarrier(iface string) bool {
raw, err := readNetCarrierFile(iface)
if err != nil {
return false
}
return strings.TrimSpace(raw) == "1"
}

View File

@@ -58,12 +58,10 @@ func enrichPCIeWithNICTelemetry(devs []schema.HardwarePCIeDevice) []schema.Hardw
} }
} }
if interfaceHasCarrier(iface) { if out, err := ethtoolModuleQuery(iface); err == nil {
if out, err := ethtoolModuleQuery(iface); err == nil { if injectSFPDOMTelemetry(&devs[i], out) {
if injectSFPDOMTelemetry(&devs[i], out) { enriched++
enriched++ continue
continue
}
} }
} }
if len(devs[i].MacAddresses) > 0 || devs[i].Firmware != nil { if len(devs[i].MacAddresses) > 0 || devs[i].Firmware != nil {
@@ -115,8 +113,38 @@ func injectSFPDOMTelemetry(dev *schema.HardwarePCIeDevice, raw string) bool {
} }
key := strings.ToLower(strings.TrimSpace(trimmed[:idx])) key := strings.ToLower(strings.TrimSpace(trimmed[:idx]))
val := strings.TrimSpace(trimmed[idx+1:]) val := strings.TrimSpace(trimmed[idx+1:])
if val == "" || strings.EqualFold(val, "not supported") || strings.EqualFold(val, "unknown") {
continue
}
switch { switch {
case key == "identifier":
s := parseSFPIdentifier(val)
dev.SFPIdentifier = &s
t := true
dev.SFPPresent = &t
changed = true
case key == "connector":
s := parseSFPConnector(val)
dev.SFPConnector = &s
changed = true
case key == "vendor name":
s := strings.TrimSpace(val)
dev.SFPVendor = &s
changed = true
case key == "vendor pn":
s := strings.TrimSpace(val)
dev.SFPPartNumber = &s
changed = true
case key == "vendor sn":
s := strings.TrimSpace(val)
dev.SFPSerialNumber = &s
changed = true
case strings.Contains(key, "laser wavelength"):
if f, ok := firstFloat(val); ok {
dev.SFPWavelengthNM = &f
changed = true
}
case strings.Contains(key, "module temperature"): case strings.Contains(key, "module temperature"):
if f, ok := firstFloat(val); ok { if f, ok := firstFloat(val); ok {
dev.SFPTemperatureC = &f dev.SFPTemperatureC = &f
@@ -147,12 +175,61 @@ func injectSFPDOMTelemetry(dev *schema.HardwarePCIeDevice, raw string) bool {
return changed return changed
} }
// parseSFPIdentifier extracts the human-readable transceiver type from the
// raw ethtool identifier line, e.g. "0x03 (SFP)" → "SFP".
func parseSFPIdentifier(val string) string {
if s := extractParens(val); s != "" {
return s
}
return val
}
// parseSFPConnector extracts the connector type from the raw ethtool line,
// e.g. "0x07 (LC)" → "LC".
func parseSFPConnector(val string) string {
if s := extractParens(val); s != "" {
return s
}
return val
}
var parenRe = regexp.MustCompile(`\(([^)]+)\)`)
func extractParens(s string) string {
m := parenRe.FindStringSubmatch(s)
if len(m) < 2 {
return ""
}
return strings.TrimSpace(m[1])
}
func parseSFPDOM(raw string) map[string]any { func parseSFPDOM(raw string) map[string]any {
dev := schema.HardwarePCIeDevice{} dev := schema.HardwarePCIeDevice{}
if !injectSFPDOMTelemetry(&dev, raw) { if !injectSFPDOMTelemetry(&dev, raw) {
return map[string]any{} return map[string]any{}
} }
out := map[string]any{} out := map[string]any{}
if dev.SFPPresent != nil {
out["sfp_present"] = *dev.SFPPresent
}
if dev.SFPIdentifier != nil {
out["sfp_identifier"] = *dev.SFPIdentifier
}
if dev.SFPConnector != nil {
out["sfp_connector"] = *dev.SFPConnector
}
if dev.SFPVendor != nil {
out["sfp_vendor"] = *dev.SFPVendor
}
if dev.SFPPartNumber != nil {
out["sfp_part_number"] = *dev.SFPPartNumber
}
if dev.SFPSerialNumber != nil {
out["sfp_serial_number"] = *dev.SFPSerialNumber
}
if dev.SFPWavelengthNM != nil {
out["sfp_wavelength_nm"] = *dev.SFPWavelengthNM
}
if dev.SFPTemperatureC != nil { if dev.SFPTemperatureC != nil {
out["sfp_temperature_c"] = *dev.SFPTemperatureC out["sfp_temperature_c"] = *dev.SFPTemperatureC
} }

View File

@@ -122,10 +122,7 @@ func TestEnrichPCIeWithNICTelemetrySkipsModuleQueryWithoutCarrier(t *testing.T)
readNetAddressFile = func(string) (string, error) { return "aa:bb:cc:dd:ee:ff", nil } readNetAddressFile = func(string) (string, error) { return "aa:bb:cc:dd:ee:ff", nil }
readNetCarrierFile = func(string) (string, error) { return "0", nil } readNetCarrierFile = func(string) (string, error) { return "0", nil }
ethtoolInfoQuery = func(string) (string, error) { return "", fmt.Errorf("skip firmware") } ethtoolInfoQuery = func(string) (string, error) { return "", fmt.Errorf("skip firmware") }
ethtoolModuleQuery = func(string) (string, error) { ethtoolModuleQuery = func(string) (string, error) { return "", fmt.Errorf("no module") }
t.Fatal("ethtool -m should not be called without carrier")
return "", nil
}
class := "EthernetController" class := "EthernetController"
bdf := "0000:18:00.0" bdf := "0000:18:00.0"

View File

@@ -15,6 +15,7 @@ const nvidiaVendorID = 0x10de
type nvidiaGPUInfo struct { type nvidiaGPUInfo struct {
Index int Index int
BDF string BDF string
Name string
Serial string Serial string
VBIOS string VBIOS string
TemperatureC *float64 TemperatureC *float64
@@ -73,6 +74,9 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
continue continue
} }
if v := strings.TrimSpace(info.Name); v != "" {
devs[i].Model = &v
}
if v := strings.TrimSpace(info.Serial); v != "" { if v := strings.TrimSpace(info.Serial); v != "" {
devs[i].SerialNumber = &v devs[i].SerialNumber = &v
} }
@@ -99,7 +103,7 @@ func enrichPCIeWithNVIDIAData(devs []schema.HardwarePCIeDevice, gpuByBDF map[str
func queryNVIDIAGPUs() (map[string]nvidiaGPUInfo, error) { func queryNVIDIAGPUs() (map[string]nvidiaGPUInfo, error) {
out, err := exec.Command( out, err := exec.Command(
"nvidia-smi", "nvidia-smi",
"--query-gpu=index,pci.bus_id,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown,pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max", "--query-gpu=index,pci.bus_id,name,serial,vbios_version,temperature.gpu,power.draw,ecc.errors.uncorrected.aggregate.total,ecc.errors.corrected.aggregate.total,clocks_throttle_reasons.hw_slowdown,pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current,pcie.link.width.max",
"--format=csv,noheader,nounits", "--format=csv,noheader,nounits",
).Output() ).Output()
if err != nil { if err != nil {
@@ -123,8 +127,8 @@ func parseNVIDIASMIQuery(raw string) (map[string]nvidiaGPUInfo, error) {
if len(rec) == 0 { if len(rec) == 0 {
continue continue
} }
if len(rec) < 13 { if len(rec) < 14 {
return nil, fmt.Errorf("unexpected nvidia-smi columns: got %d, want 13", len(rec)) return nil, fmt.Errorf("unexpected nvidia-smi columns: got %d, want 14", len(rec))
} }
bdf := normalizePCIeBDF(rec[1]) bdf := normalizePCIeBDF(rec[1])
@@ -135,17 +139,18 @@ func parseNVIDIASMIQuery(raw string) (map[string]nvidiaGPUInfo, error) {
info := nvidiaGPUInfo{ info := nvidiaGPUInfo{
Index: parseRequiredInt(rec[0]), Index: parseRequiredInt(rec[0]),
BDF: bdf, BDF: bdf,
Serial: strings.TrimSpace(rec[2]), Name: strings.TrimSpace(rec[2]),
VBIOS: strings.TrimSpace(rec[3]), Serial: strings.TrimSpace(rec[3]),
TemperatureC: parseMaybeFloat(rec[4]), VBIOS: strings.TrimSpace(rec[4]),
PowerW: parseMaybeFloat(rec[5]), TemperatureC: parseMaybeFloat(rec[5]),
ECCUncorrected: parseMaybeInt64(rec[6]), PowerW: parseMaybeFloat(rec[6]),
ECCCorrected: parseMaybeInt64(rec[7]), ECCUncorrected: parseMaybeInt64(rec[7]),
HWSlowdown: parseMaybeBool(rec[8]), ECCCorrected: parseMaybeInt64(rec[8]),
PCIeLinkGenCurrent: parseMaybeInt(rec[9]), HWSlowdown: parseMaybeBool(rec[9]),
PCIeLinkGenMax: parseMaybeInt(rec[10]), PCIeLinkGenCurrent: parseMaybeInt(rec[10]),
PCIeLinkWidthCur: parseMaybeInt(rec[11]), PCIeLinkGenMax: parseMaybeInt(rec[11]),
PCIeLinkWidthMax: parseMaybeInt(rec[12]), PCIeLinkWidthCur: parseMaybeInt(rec[12]),
PCIeLinkWidthMax: parseMaybeInt(rec[13]),
} }
result[bdf] = info result[bdf] = info
} }

View File

@@ -6,7 +6,7 @@ import (
) )
func TestParseNVIDIASMIQuery(t *testing.T) { func TestParseNVIDIASMIQuery(t *testing.T) {
raw := "0, 00000000:65:00.0, GPU-SERIAL-1, 96.00.1F.00.02, 54, 210.33, 0, 5, Not Active, 4, 4, 16, 16\n" raw := "0, 00000000:65:00.0, NVIDIA H100 80GB HBM3, GPU-SERIAL-1, 96.00.1F.00.02, 54, 210.33, 0, 5, Not Active, 4, 4, 16, 16\n"
byBDF, err := parseNVIDIASMIQuery(raw) byBDF, err := parseNVIDIASMIQuery(raw)
if err != nil { if err != nil {
t.Fatalf("parse failed: %v", err) t.Fatalf("parse failed: %v", err)
@@ -16,6 +16,9 @@ func TestParseNVIDIASMIQuery(t *testing.T) {
if !ok { if !ok {
t.Fatalf("gpu by normalized bdf not found") t.Fatalf("gpu by normalized bdf not found")
} }
if gpu.Name != "NVIDIA H100 80GB HBM3" {
t.Fatalf("name: got %q", gpu.Name)
}
if gpu.Serial != "GPU-SERIAL-1" { if gpu.Serial != "GPU-SERIAL-1" {
t.Fatalf("serial: got %q", gpu.Serial) t.Fatalf("serial: got %q", gpu.Serial)
} }

View File

@@ -2,6 +2,7 @@ package collector
import ( import (
"bee/audit/internal/schema" "bee/audit/internal/schema"
"fmt"
"log/slog" "log/slog"
"os/exec" "os/exec"
"strconv" "strconv"
@@ -79,6 +80,25 @@ func shouldIncludePCIeDevice(class, vendor, device string) bool {
} }
} }
// Exclude BMC/management virtual VGA adapters — these are firmware video chips,
// not real GPUs, and pollute the GPU inventory (e.g. iBMC, iDRAC, iLO VGA).
if strings.Contains(c, "vga") || strings.Contains(c, "display") || strings.Contains(c, "3d") {
bmcPatterns := []string{
"management system chip",
"management controller",
"ibmc",
"idrac",
"ilo vga",
"aspeed",
"matrox",
}
for _, bad := range bmcPatterns {
if strings.Contains(d, bad) {
return false
}
}
}
if strings.Contains(v, "advanced micro devices") || strings.Contains(v, "[amd]") { if strings.Contains(v, "advanced micro devices") || strings.Contains(v, "[amd]") {
internalAMDPatterns := []string{ internalAMDPatterns := []string{
"dummy function", "dummy function",
@@ -153,6 +173,9 @@ func parseLspciDevice(fields map[string]string) schema.HardwarePCIeDevice {
// SVendor/SDevice available but not in schema — skip // SVendor/SDevice available but not in schema — skip
// Warn if PCIe link is running below its maximum negotiated speed.
applyPCIeLinkSpeedWarning(&dev)
return dev return dev
} }
@@ -222,6 +245,41 @@ func readPCIStringAttribute(bdf, attribute string) (string, bool) {
return value, true return value, true
} }
// applyPCIeLinkSpeedWarning sets the device status to Warning if the current PCIe link
// speed is below the maximum negotiated speed supported by both ends.
func applyPCIeLinkSpeedWarning(dev *schema.HardwarePCIeDevice) {
if dev.LinkSpeed == nil || dev.MaxLinkSpeed == nil {
return
}
if pcieLinkSpeedRank(*dev.LinkSpeed) < pcieLinkSpeedRank(*dev.MaxLinkSpeed) {
warn := statusWarning
dev.Status = &warn
desc := fmt.Sprintf("PCIe link speed degraded: running at %s, capable of %s", *dev.LinkSpeed, *dev.MaxLinkSpeed)
dev.ErrorDescription = &desc
}
}
// pcieLinkSpeedRank returns a numeric rank for a normalized Gen string (e.g. "Gen4" → 4).
// Returns 0 for unrecognised values so comparisons fail safe.
func pcieLinkSpeedRank(gen string) int {
switch gen {
case "Gen1":
return 1
case "Gen2":
return 2
case "Gen3":
return 3
case "Gen4":
return 4
case "Gen5":
return 5
case "Gen6":
return 6
default:
return 0
}
}
func normalizePCILinkSpeed(raw string) string { func normalizePCILinkSpeed(raw string) string {
raw = strings.TrimSpace(strings.ToLower(raw)) raw = strings.TrimSpace(strings.ToLower(raw))
switch { switch {

View File

@@ -1,6 +1,7 @@
package collector package collector
import ( import (
"bee/audit/internal/schema"
"encoding/json" "encoding/json"
"strings" "strings"
"testing" "testing"
@@ -29,6 +30,8 @@ func TestShouldIncludePCIeDevice(t *testing.T) {
{name: "raid", class: "RAID bus controller", want: true}, {name: "raid", class: "RAID bus controller", want: true},
{name: "nvme", class: "Non-Volatile memory controller", want: true}, {name: "nvme", class: "Non-Volatile memory controller", want: true},
{name: "vga", class: "VGA compatible controller", want: true}, {name: "vga", class: "VGA compatible controller", want: true},
{name: "ibmc vga", class: "VGA compatible controller", vendor: "Huawei Technologies Co., Ltd.", device: "Hi171x Series [iBMC Intelligent Management system chip w/VGA support]", want: false},
{name: "aspeed vga", class: "VGA compatible controller", vendor: "ASPEED Technology, Inc.", device: "ASPEED Graphics Family", want: false},
{name: "other encryption controller", class: "Encryption controller", vendor: "Intel Corporation", device: "QuickAssist", want: true}, {name: "other encryption controller", class: "Encryption controller", vendor: "Intel Corporation", device: "QuickAssist", want: true},
} }
@@ -139,3 +142,77 @@ func TestNormalizePCILinkSpeed(t *testing.T) {
} }
} }
} }
func TestApplyPCIeLinkSpeedWarning(t *testing.T) {
ptr := func(s string) *string { return &s }
tests := []struct {
name string
linkSpeed *string
maxSpeed *string
wantWarning bool
wantGenIn string // substring expected in ErrorDescription when warning
}{
{
name: "degraded Gen1 vs Gen5",
linkSpeed: ptr("Gen1"),
maxSpeed: ptr("Gen5"),
wantWarning: true,
wantGenIn: "Gen1",
},
{
name: "at max Gen5",
linkSpeed: ptr("Gen5"),
maxSpeed: ptr("Gen5"),
wantWarning: false,
},
{
name: "degraded Gen4 vs Gen5",
linkSpeed: ptr("Gen4"),
maxSpeed: ptr("Gen5"),
wantWarning: true,
wantGenIn: "Gen4",
},
{
name: "missing current speed — no warning",
linkSpeed: nil,
maxSpeed: ptr("Gen5"),
wantWarning: false,
},
{
name: "missing max speed — no warning",
linkSpeed: ptr("Gen1"),
maxSpeed: nil,
wantWarning: false,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
dev := schema.HardwarePCIeDevice{}
ok := statusOK
dev.Status = &ok
dev.LinkSpeed = tt.linkSpeed
dev.MaxLinkSpeed = tt.maxSpeed
applyPCIeLinkSpeedWarning(&dev)
gotWarn := dev.Status != nil && *dev.Status == statusWarning
if gotWarn != tt.wantWarning {
t.Fatalf("wantWarning=%v gotWarning=%v (status=%v)", tt.wantWarning, gotWarn, dev.Status)
}
if tt.wantWarning {
if dev.ErrorDescription == nil {
t.Fatal("expected ErrorDescription to be set")
}
if !strings.Contains(*dev.ErrorDescription, tt.wantGenIn) {
t.Fatalf("ErrorDescription %q does not contain %q", *dev.ErrorDescription, tt.wantGenIn)
}
} else {
if dev.ErrorDescription != nil {
t.Fatalf("unexpected ErrorDescription: %s", *dev.ErrorDescription)
}
}
})
}
}

View File

@@ -335,11 +335,7 @@ func (s *System) RunNvidiaBenchmark(ctx context.Context, baseDir string, opts Nv
return "", fmt.Errorf("write summary.txt: %w", err) return "", fmt.Errorf("write summary.txt: %w", err)
} }
archive := filepath.Join(baseDir, "gpu-benchmark-"+ts+".tar.gz") return runDir, nil
if err := createTarGz(archive, runDir); err != nil {
return "", fmt.Errorf("pack benchmark archive: %w", err)
}
return archive, nil
} }
func normalizeNvidiaBenchmarkOptionsForBenchmark(opts NvidiaBenchmarkOptions) NvidiaBenchmarkOptions { func normalizeNvidiaBenchmarkOptionsForBenchmark(opts NvidiaBenchmarkOptions) NvidiaBenchmarkOptions {

View File

@@ -90,7 +90,7 @@ func renderBenchmarkReportWithCharts(result NvidiaBenchmarkResult, charts []benc
for _, gpu := range result.GPUs { for _, gpu := range result.GPUs {
name := strings.TrimSpace(gpu.Name) name := strings.TrimSpace(gpu.Name)
if name == "" { if name == "" {
name = "Unknown" name = "Unknown GPU"
} }
interconnect := "-" interconnect := "-"
if gpu.Scores.InterconnectScore > 0 { if gpu.Scores.InterconnectScore > 0 {

View File

@@ -49,6 +49,9 @@ func buildNvidiaStressJob(opts NvidiaStressOptions) (satJob, error) {
"--seconds", strconv.Itoa(opts.DurationSec), "--seconds", strconv.Itoa(opts.DurationSec),
"--size-mb", strconv.Itoa(opts.SizeMB), "--size-mb", strconv.Itoa(opts.SizeMB),
} }
if opts.StaggerSeconds > 0 && len(selected) > 1 {
cmd = append(cmd, "--stagger-seconds", strconv.Itoa(opts.StaggerSeconds))
}
if len(selected) > 0 { if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected)) cmd = append(cmd, "--devices", joinIndexList(selected))
} }
@@ -63,6 +66,9 @@ func buildNvidiaStressJob(opts NvidiaStressOptions) (satJob, error) {
"bee-john-gpu-stress", "bee-john-gpu-stress",
"--seconds", strconv.Itoa(opts.DurationSec), "--seconds", strconv.Itoa(opts.DurationSec),
} }
if opts.StaggerSeconds > 0 && len(selected) > 1 {
cmd = append(cmd, "--stagger-seconds", strconv.Itoa(opts.StaggerSeconds))
}
if len(selected) > 0 { if len(selected) > 0 {
cmd = append(cmd, "--devices", joinIndexList(selected)) cmd = append(cmd, "--devices", joinIndexList(selected))
} }

View File

@@ -161,13 +161,7 @@ func (s *System) RunPlatformStress(
} }
_ = os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary), 0644) _ = os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary), 0644)
// Pack tar.gz return runDir, nil
archivePath := filepath.Join(baseDir, "platform-stress-"+stamp+".tar.gz")
if err := packPlatformDir(runDir, archivePath); err != nil {
return "", fmt.Errorf("pack archive: %w", err)
}
_ = os.RemoveAll(runDir)
return archivePath, nil
} }
// collectPhase samples live metrics every second until ctx is done. // collectPhase samples live metrics every second until ctx is done.

View File

@@ -1,6 +1,7 @@
package platform package platform
import ( import (
"bufio"
"os" "os"
"os/exec" "os/exec"
"strings" "strings"
@@ -114,6 +115,8 @@ func (s *System) CollectRuntimeHealth(exportDir string) (schema.RuntimeHealth, e
} }
s.collectGPURuntimeHealth(vendor, &health) s.collectGPURuntimeHealth(vendor, &health)
s.collectToRAMHealth(&health)
s.collectUSBExportHealth(&health)
if health.Status != "FAILED" && len(health.Issues) > 0 { if health.Status != "FAILED" && len(health.Issues) > 0 {
health.Status = "PARTIAL" health.Status = "PARTIAL"
@@ -168,6 +171,90 @@ func resolvedToolStatus(display string, candidates ...string) ToolStatus {
return ToolStatus{Name: display} return ToolStatus{Name: display}
} }
// collectToRAMHealth checks whether the LiveCD ISO has been copied to RAM.
// Status values: "ok" = in RAM, "warning" = toram not active (no copy attempted),
// "failed" = toram was requested but medium is not in RAM (copy failed or in progress).
func (s *System) collectToRAMHealth(health *schema.RuntimeHealth) {
inRAM := s.IsLiveMediaInRAM()
active := toramActive()
switch {
case inRAM:
health.ToRAMStatus = "ok"
case active:
// toram was requested but medium is not yet/no longer in RAM
health.ToRAMStatus = "failed"
health.Issues = append(health.Issues, schema.RuntimeIssue{
Code: "toram_copy_failed",
Severity: "warning",
Description: "toram boot parameter is set but the live medium is not mounted from RAM.",
})
default:
health.ToRAMStatus = "warning"
}
}
// collectUSBExportHealth scans /proc/mounts for a writable USB-backed filesystem
// suitable for log export. Sets USBExportPath to the first match found.
func (s *System) collectUSBExportHealth(health *schema.RuntimeHealth) {
health.USBExportPath = findUSBExportMount()
}
// findUSBExportMount returns the mount point of the first writable USB filesystem
// found in /proc/mounts (vfat, exfat, ext2/3/4, ntfs) whose backing block device
// has USB transport. Returns "" if none found.
func findUSBExportMount() string {
f, err := os.Open("/proc/mounts")
if err != nil {
return ""
}
defer f.Close()
// fs types that are expected on USB export drives
exportFSTypes := map[string]bool{
"vfat": true,
"exfat": true,
"ext2": true,
"ext3": true,
"ext4": true,
"ntfs": true,
"ntfs3": true,
"fuseblk": true,
}
scanner := bufio.NewScanner(f)
for scanner.Scan() {
// fields: device mountpoint fstype options dump pass
fields := strings.Fields(scanner.Text())
if len(fields) < 4 {
continue
}
device, mountPoint, fsType, options := fields[0], fields[1], fields[2], fields[3]
if !exportFSTypes[strings.ToLower(fsType)] {
continue
}
// Skip read-only mounts
opts := strings.Split(options, ",")
readOnly := false
for _, o := range opts {
if strings.TrimSpace(o) == "ro" {
readOnly = true
break
}
}
if readOnly {
continue
}
// Check USB transport via lsblk on the device
if !strings.HasPrefix(device, "/dev/") {
continue
}
if blockDeviceTransport(device) == "usb" {
return mountPoint
}
}
return ""
}
func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHealth) { func (s *System) collectGPURuntimeHealth(vendor string, health *schema.RuntimeHealth) {
lsmodText := commandText("lsmod") lsmodText := commandText("lsmod")

View File

@@ -384,25 +384,39 @@ func (s *System) RunNCCLTests(ctx context.Context, baseDir string, logFunc func(
), logFunc) ), logFunc)
} }
func (s *System) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, logFunc func(string)) (string, error) { func (s *System) RunNvidiaOfficialComputePack(ctx context.Context, baseDir string, durationSec int, gpuIndices []int, staggerSec int, logFunc func(string)) (string, error) {
selected, err := resolveDCGMGPUIndices(gpuIndices) selected, err := resolveDCGMGPUIndices(gpuIndices)
if err != nil { if err != nil {
return "", err return "", err
} }
profCmd, err := resolveDCGMProfTesterCommand("--no-dcgm-validation", "-t", "1004", "-d", strconv.Itoa(normalizeNvidiaBurnDuration(durationSec))) var (
if err != nil { profCmd []string
return "", err profEnv []string
)
if staggerSec > 0 && len(selected) > 1 {
profCmd = []string{
"bee-dcgmproftester-staggered",
"--seconds", strconv.Itoa(normalizeNvidiaBurnDuration(durationSec)),
"--stagger-seconds", strconv.Itoa(staggerSec),
"--devices", joinIndexList(selected),
}
} else {
profCmd, err = resolveDCGMProfTesterCommand("--no-dcgm-validation", "-t", "1004", "-d", strconv.Itoa(normalizeNvidiaBurnDuration(durationSec)))
if err != nil {
return "", err
}
profEnv = nvidiaVisibleDevicesEnv(selected)
} }
return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-compute", withNvidiaPersistenceMode( return runAcceptancePackCtx(ctx, baseDir, "gpu-nvidia-compute", withNvidiaPersistenceMode(
satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}}, satJob{name: "01-nvidia-smi-q.log", cmd: []string{"nvidia-smi", "-q"}},
satJob{name: "02-dcgmi-version.log", cmd: []string{"dcgmi", "-v"}}, satJob{name: "02-dcgmi-version.log", cmd: []string{"dcgmi", "-v"}},
satJob{ satJob{
name: "03-dcgmproftester.log", name: "03-dcgmproftester.log",
cmd: profCmd, cmd: profCmd,
env: nvidiaVisibleDevicesEnv(selected), env: profEnv,
collectGPU: true, collectGPU: true,
gpuIndices: selected, gpuIndices: selected,
}, },
satJob{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}}, satJob{name: "04-nvidia-smi-after.log", cmd: []string{"nvidia-smi", "--query-gpu=index,name,temperature.gpu,power.draw,utilization.gpu,memory.used,memory.total", "--format=csv,noheader,nounits"}},
), logFunc) ), logFunc)
} }
@@ -648,11 +662,7 @@ func (s *System) RunStorageAcceptancePack(ctx context.Context, baseDir string, e
if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil { if err := os.WriteFile(filepath.Join(runDir, "summary.txt"), []byte(summary.String()), 0644); err != nil {
return "", err return "", err
} }
archive := filepath.Join(baseDir, "storage-"+ts+".tar.gz") return runDir, nil
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
} }
type satJob struct { type satJob struct {
@@ -838,11 +848,7 @@ func runAcceptancePackCtx(ctx context.Context, baseDir, prefix string, jobs []sa
} }
} }
archive := filepath.Join(baseDir, prefix+"-"+ts+".tar.gz") return runDir, nil
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
} }
func updateNvidiaGPUStatus(perGPU map[int]*nvidiaGPUStatusFile, idx int, status, jobName, detail string) { func updateNvidiaGPUStatus(perGPU map[int]*nvidiaGPUStatusFile, idx int, status, jobName, detail string) {
@@ -905,7 +911,7 @@ func writeNvidiaGPUStatusFiles(runDir, overall string, perGPU map[int]*nvidiaGPU
entry.Health = "UNKNOWN" entry.Health = "UNKNOWN"
} }
if entry.Name == "" { if entry.Name == "" {
entry.Name = "unknown" entry.Name = "Unknown GPU"
} }
var body strings.Builder var body strings.Builder
fmt.Fprintf(&body, "gpu_index=%d\n", entry.Index) fmt.Fprintf(&body, "gpu_index=%d\n", entry.Index)

View File

@@ -223,11 +223,7 @@ func (s *System) RunFanStressTest(ctx context.Context, baseDir string, opts FanS
return "", err return "", err
} }
archive := filepath.Join(baseDir, "fan-stress-"+ts+".tar.gz") return runDir, nil
if err := createTarGz(archive, runDir); err != nil {
return "", err
}
return archive, nil
} }
func applyFanStressDefaults(opts *FanStressOptions) { func applyFanStressDefaults(opts *FanStressOptions) {

View File

@@ -20,6 +20,7 @@ var techDumpFixedCommands = []struct {
{Name: "dmidecode", Args: []string{"-t", "4"}, File: "dmidecode-type4.txt"}, {Name: "dmidecode", Args: []string{"-t", "4"}, File: "dmidecode-type4.txt"},
{Name: "dmidecode", Args: []string{"-t", "17"}, File: "dmidecode-type17.txt"}, {Name: "dmidecode", Args: []string{"-t", "17"}, File: "dmidecode-type17.txt"},
{Name: "lspci", Args: []string{"-vmm", "-D"}, File: "lspci-vmm.txt"}, {Name: "lspci", Args: []string{"-vmm", "-D"}, File: "lspci-vmm.txt"},
{Name: "lspci", Args: []string{"-vvv"}, File: "lspci-vvv.txt"},
{Name: "lsblk", Args: []string{"-J", "-d", "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL"}, File: "lsblk.json"}, {Name: "lsblk", Args: []string{"-J", "-d", "-o", "NAME,TYPE,SIZE,SERIAL,MODEL,TRAN,HCTL"}, File: "lsblk.json"},
{Name: "sensors", Args: []string{"-j"}, File: "sensors.json"}, {Name: "sensors", Args: []string{"-j"}, File: "sensors.json"},
{Name: "ipmitool", Args: []string{"fru", "print"}, File: "ipmitool-fru.txt"}, {Name: "ipmitool", Args: []string{"fru", "print"}, File: "ipmitool-fru.txt"},

View File

@@ -70,6 +70,7 @@ type NvidiaStressOptions struct {
Loader string Loader string
GPUIndices []int GPUIndices []int
ExcludeGPUIndices []int ExcludeGPUIndices []int
StaggerSeconds int
} }
func New() *System { func New() *System {

View File

@@ -22,6 +22,10 @@ type RuntimeHealth struct {
CUDAReady bool `json:"cuda_ready,omitempty"` CUDAReady bool `json:"cuda_ready,omitempty"`
NvidiaGSPMode string `json:"nvidia_gsp_mode,omitempty"` // "gsp-on", "gsp-off", "gsp-stuck" NvidiaGSPMode string `json:"nvidia_gsp_mode,omitempty"` // "gsp-on", "gsp-off", "gsp-stuck"
NetworkStatus string `json:"network_status,omitempty"` NetworkStatus string `json:"network_status,omitempty"`
// ToRAMStatus: "ok" (ISO in RAM), "warning" (toram not active), "failed" (toram active but copy failed)
ToRAMStatus string `json:"toram_status,omitempty"`
// USBExportPath: mount point of the first writable USB drive found, empty if none.
USBExportPath string `json:"usb_export_path,omitempty"`
Issues []RuntimeIssue `json:"issues,omitempty"` Issues []RuntimeIssue `json:"issues,omitempty"`
Tools []RuntimeToolStatus `json:"tools,omitempty"` Tools []RuntimeToolStatus `json:"tools,omitempty"`
Services []RuntimeServiceStatus `json:"services,omitempty"` Services []RuntimeServiceStatus `json:"services,omitempty"`
@@ -183,6 +187,13 @@ type HardwarePCIeDevice struct {
BatteryTemperatureC *float64 `json:"battery_temperature_c,omitempty"` BatteryTemperatureC *float64 `json:"battery_temperature_c,omitempty"`
BatteryVoltageV *float64 `json:"battery_voltage_v,omitempty"` BatteryVoltageV *float64 `json:"battery_voltage_v,omitempty"`
BatteryReplaceRequired *bool `json:"battery_replace_required,omitempty"` BatteryReplaceRequired *bool `json:"battery_replace_required,omitempty"`
SFPPresent *bool `json:"sfp_present,omitempty"`
SFPIdentifier *string `json:"sfp_identifier,omitempty"`
SFPConnector *string `json:"sfp_connector,omitempty"`
SFPVendor *string `json:"sfp_vendor,omitempty"`
SFPPartNumber *string `json:"sfp_part_number,omitempty"`
SFPSerialNumber *string `json:"sfp_serial_number,omitempty"`
SFPWavelengthNM *float64 `json:"sfp_wavelength_nm,omitempty"`
SFPTemperatureC *float64 `json:"sfp_temperature_c,omitempty"` SFPTemperatureC *float64 `json:"sfp_temperature_c,omitempty"`
SFPTXPowerDBM *float64 `json:"sfp_tx_power_dbm,omitempty"` SFPTXPowerDBM *float64 `json:"sfp_tx_power_dbm,omitempty"`
SFPRXPowerDBM *float64 `json:"sfp_rx_power_dbm,omitempty"` SFPRXPowerDBM *float64 `json:"sfp_rx_power_dbm,omitempty"`

View File

@@ -482,12 +482,13 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
return return
} }
var body struct { var body struct {
Duration int `json:"duration"` Duration int `json:"duration"`
StressMode bool `json:"stress_mode"` StressMode bool `json:"stress_mode"`
GPUIndices []int `json:"gpu_indices"` GPUIndices []int `json:"gpu_indices"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices"` ExcludeGPUIndices []int `json:"exclude_gpu_indices"`
Loader string `json:"loader"` StaggerGPUStart bool `json:"stagger_gpu_start"`
Loader string `json:"loader"`
Profile string `json:"profile"` Profile string `json:"profile"`
DisplayName string `json:"display_name"` DisplayName string `json:"display_name"`
PlatformComponents []string `json:"platform_components"` PlatformComponents []string `json:"platform_components"`
@@ -503,12 +504,13 @@ func (h *handler) handleAPISATRun(target string) http.HandlerFunc {
if strings.TrimSpace(body.DisplayName) != "" { if strings.TrimSpace(body.DisplayName) != "" {
name = body.DisplayName name = body.DisplayName
} }
params := taskParams{ params := taskParams{
Duration: body.Duration, Duration: body.Duration,
StressMode: body.StressMode, StressMode: body.StressMode,
GPUIndices: body.GPUIndices, GPUIndices: body.GPUIndices,
ExcludeGPUIndices: body.ExcludeGPUIndices, ExcludeGPUIndices: body.ExcludeGPUIndices,
Loader: body.Loader, StaggerGPUStart: body.StaggerGPUStart,
Loader: body.Loader,
BurnProfile: body.Profile, BurnProfile: body.Profile,
DisplayName: body.DisplayName, DisplayName: body.DisplayName,
PlatformComponents: body.PlatformComponents, PlatformComponents: body.PlatformComponents,

View File

@@ -83,6 +83,10 @@ func renderMetricChartSVG(title string, labels []string, times []time.Time, data
} }
} }
// Downsample to at most ~1400 points (one per pixel) before building SVG.
times, datasets = downsampleTimeSeries(times, datasets, 1400)
pointCount = len(times)
statsLabel := chartStatsLabel(datasets) statsLabel := chartStatsLabel(datasets)
legendItems := []metricChartSeries{} legendItems := []metricChartSeries{}
@@ -196,6 +200,19 @@ func drawGPUOverviewChartSVG(title string, labels []string, times []time.Time, s
} }
} }
// Downsample to at most ~1400 points before building SVG.
{
datasets := make([][]float64, len(series))
for i := range series {
datasets[i] = series[i].Values
}
times, datasets = downsampleTimeSeries(times, datasets, 1400)
pointCount = len(times)
for i := range series {
series[i].Values = datasets[i]
}
}
scales := make([]chartScale, len(series)) scales := make([]chartScale, len(series))
for i := range series { for i := range series {
min, max := chartSeriesBounds(series[i].Values) min, max := chartSeriesBounds(series[i].Values)
@@ -626,6 +643,87 @@ func writeTimelineBoundaries(b *strings.Builder, layout chartLayout, start, end
b.WriteString(`</g>` + "\n") b.WriteString(`</g>` + "\n")
} }
// downsampleTimeSeries reduces the time series to at most maxPts points using
// min-max bucketing. Each bucket contributes the index of its min and max value
// (using the first full-length dataset as the reference series). All parallel
// datasets are sampled at those same indices so all series stay aligned.
// If len(times) <= maxPts the inputs are returned unchanged.
func downsampleTimeSeries(times []time.Time, datasets [][]float64, maxPts int) ([]time.Time, [][]float64) {
n := len(times)
if n <= maxPts || maxPts <= 0 {
return times, datasets
}
buckets := maxPts / 2
if buckets < 1 {
buckets = 1
}
// Use the first dataset that has the same length as times as the reference
// for deciding which two indices to keep per bucket.
var ref []float64
for _, ds := range datasets {
if len(ds) == n {
ref = ds
break
}
}
selected := make([]int, 0, maxPts)
bucketSize := float64(n) / float64(buckets)
for b := 0; b < buckets; b++ {
lo := int(math.Round(float64(b) * bucketSize))
hi := int(math.Round(float64(b+1) * bucketSize))
if hi > n {
hi = n
}
if lo >= hi {
continue
}
if ref == nil {
selected = append(selected, lo)
if hi-1 != lo {
selected = append(selected, hi-1)
}
continue
}
minIdx, maxIdx := lo, lo
for i := lo + 1; i < hi; i++ {
if ref[i] < ref[minIdx] {
minIdx = i
}
if ref[i] > ref[maxIdx] {
maxIdx = i
}
}
if minIdx <= maxIdx {
selected = append(selected, minIdx)
if maxIdx != minIdx {
selected = append(selected, maxIdx)
}
} else {
selected = append(selected, maxIdx)
if minIdx != maxIdx {
selected = append(selected, minIdx)
}
}
}
outTimes := make([]time.Time, len(selected))
for i, idx := range selected {
outTimes[i] = times[idx]
}
outDatasets := make([][]float64, len(datasets))
for d, ds := range datasets {
if len(ds) != n {
outDatasets[d] = ds
continue
}
out := make([]float64, len(selected))
for i, idx := range selected {
out[i] = ds[idx]
}
outDatasets[d] = out
}
return outTimes, outDatasets
}
func chartXForTime(ts, start, end time.Time, left, right int) float64 { func chartXForTime(ts, start, end time.Time, left, right int) float64 {
if !end.After(start) { if !end.After(start) {
return float64(left+right) / 2 return float64(left+right) / 2

View File

@@ -317,106 +317,299 @@ func renderHardwareSummaryCard(opts HandlerOptions) string {
if err != nil { if err != nil {
return `<div class="card"><div class="card-head card-head-actions"><span>Hardware Summary</span><div class="card-head-buttons"><button class="btn btn-primary btn-sm" onclick="auditModalRun()">Run audit</button></div></div><div class="card-body"></div></div>` return `<div class="card"><div class="card-head card-head-actions"><span>Hardware Summary</span><div class="card-head-buttons"><button class="btn btn-primary btn-sm" onclick="auditModalRun()">Run audit</button></div></div><div class="card-body"></div></div>`
} }
// Parse just enough fields for the summary banner var ingest schema.HardwareIngestRequest
var snap struct { if err := json.Unmarshal(data, &ingest); err != nil {
Summary struct {
CPU struct{ Model string }
Memory struct{ TotalGB float64 }
Storage []struct{ Device, Model, Size string }
GPUs []struct{ Model string }
PSUs []struct{ Model string }
}
Network struct {
Interfaces []struct {
Name string
IPv4 []string
State string
}
}
}
// Try to extract top-level fields loosely
var raw map[string]json.RawMessage
if err := json.Unmarshal(data, &raw); err != nil {
return `<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body"><span class="badge badge-err">Parse error</span></div></div>` return `<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body"><span class="badge badge-err">Parse error</span></div></div>`
} }
_ = snap hw := ingest.Hardware
// Also load runtime-health for badges var records []app.ComponentStatusRecord
type componentHealth struct { if db, err := app.OpenComponentStatusDB(filepath.Join(opts.ExportDir, "component-status.json")); err == nil {
FailCount int `json:"fail_count"` records = db.All()
WarnCount int `json:"warn_count"`
} }
type healthSummary struct {
CPU componentHealth `json:"cpu"`
Memory componentHealth `json:"memory"`
Storage componentHealth `json:"storage"`
GPU componentHealth `json:"gpu"`
PSU componentHealth `json:"psu"`
Network componentHealth `json:"network"`
}
var health struct {
HardwareHealth healthSummary `json:"hardware_health"`
}
if hdata, herr := loadSnapshot(filepath.Join(opts.ExportDir, "runtime-health.json")); herr == nil {
_ = json.Unmarshal(hdata, &health)
}
badge := func(h componentHealth) string {
if h.FailCount > 0 {
return `<span class="badge badge-err">FAIL</span>`
}
if h.WarnCount > 0 {
return `<span class="badge badge-warn">WARN</span>`
}
return `<span class="badge badge-ok">OK</span>`
}
// Extract readable strings from raw JSON
getString := func(key string) string {
v, ok := raw[key]
if !ok {
return ""
}
var s string
if err := json.Unmarshal(v, &s); err == nil {
return s
}
return ""
}
cpuModel := getString("cpu_model")
memStr := getString("memory_summary")
gpuSummary := getString("gpu_summary")
var b strings.Builder var b strings.Builder
b.WriteString(`<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body">`) b.WriteString(`<div class="card"><div class="card-head">Hardware Summary</div><div class="card-body">`)
b.WriteString(`<table style="width:auto">`) b.WriteString(`<table style="width:auto">`)
writeRow := func(label, value, badgeHTML string) { writeRow := func(label, value, badgeHTML string) {
b.WriteString(fmt.Sprintf(`<tr><td style="padding:6px 14px 6px 0;font-weight:700;white-space:nowrap">%s</td><td style="padding:6px 0">%s</td><td style="padding:6px 0 6px 12px">%s</td></tr>`, b.WriteString(fmt.Sprintf(`<tr><td style="padding:6px 14px 6px 0;font-weight:700;white-space:nowrap">%s</td><td style="padding:6px 0;color:var(--muted);font-size:13px">%s</td><td style="padding:6px 0 6px 12px">%s</td></tr>`,
html.EscapeString(label), html.EscapeString(value), badgeHTML)) html.EscapeString(label), html.EscapeString(value), badgeHTML))
} }
if cpuModel != "" {
writeRow("CPU", cpuModel, badge(health.HardwareHealth.CPU)) cpuRow := aggregateComponentStatus("CPU", records, []string{"cpu:all"}, nil)
} else { writeRow("CPU", hwDescribeCPU(hw), runtimeStatusBadge(cpuRow.Status))
writeRow("CPU", "—", badge(health.HardwareHealth.CPU))
memRow := aggregateComponentStatus("Memory", records, []string{"memory:all"}, []string{"memory:"})
writeRow("Memory", hwDescribeMemory(hw), runtimeStatusBadge(memRow.Status))
storageRow := aggregateComponentStatus("Storage", records, []string{"storage:all"}, []string{"storage:"})
writeRow("Storage", hwDescribeStorage(hw), runtimeStatusBadge(storageRow.Status))
gpuRow := aggregateComponentStatus("GPU", records, nil, []string{"pcie:gpu:"})
writeRow("GPU", hwDescribeGPU(hw), runtimeStatusBadge(gpuRow.Status))
psuRow := aggregateComponentStatus("PSU", records, nil, []string{"psu:"})
if psuRow.Status == "UNKNOWN" && len(hw.PowerSupplies) > 0 {
psuRow.Status = hwPSUStatus(hw.PowerSupplies)
} }
if memStr != "" { writeRow("PSU", hwDescribePSU(hw), runtimeStatusBadge(psuRow.Status))
writeRow("Memory", memStr, badge(health.HardwareHealth.Memory))
} else { if nicDesc := hwDescribeNIC(hw); nicDesc != "" {
writeRow("Memory", "—", badge(health.HardwareHealth.Memory)) writeRow("Network", nicDesc, "")
} }
if gpuSummary != "" {
writeRow("GPU", gpuSummary, badge(health.HardwareHealth.GPU))
} else {
writeRow("GPU", "—", badge(health.HardwareHealth.GPU))
}
writeRow("Storage", "—", badge(health.HardwareHealth.Storage))
writeRow("PSU", "—", badge(health.HardwareHealth.PSU))
b.WriteString(`</table>`) b.WriteString(`</table>`)
b.WriteString(`</div></div>`) b.WriteString(`</div></div>`)
return b.String() return b.String()
} }
// hwDescribeCPU returns a human-readable CPU summary, e.g. "2× Intel Xeon Gold 6338".
func hwDescribeCPU(hw schema.HardwareSnapshot) string {
counts := map[string]int{}
order := []string{}
for _, cpu := range hw.CPUs {
model := "Unknown CPU"
if cpu.Model != nil && *cpu.Model != "" {
model = *cpu.Model
}
if counts[model] == 0 {
order = append(order, model)
}
counts[model]++
}
if len(order) == 0 {
return "—"
}
parts := make([]string, 0, len(order))
for _, m := range order {
if counts[m] > 1 {
parts = append(parts, fmt.Sprintf("%d× %s", counts[m], m))
} else {
parts = append(parts, m)
}
}
return strings.Join(parts, ", ")
}
// hwDescribeMemory returns a summary like "16× 32 GB DDR4".
func hwDescribeMemory(hw schema.HardwareSnapshot) string {
type key struct {
sizeMB int
typ string
}
counts := map[key]int{}
order := []key{}
for _, dimm := range hw.Memory {
if dimm.SizeMB == nil || *dimm.SizeMB == 0 {
continue
}
t := ""
if dimm.Type != nil {
t = *dimm.Type
}
k := key{*dimm.SizeMB, t}
if counts[k] == 0 {
order = append(order, k)
}
counts[k]++
}
if len(order) == 0 {
return "—"
}
parts := make([]string, 0, len(order))
for _, k := range order {
gb := k.sizeMB / 1024
desc := fmt.Sprintf("%d× %d GB", counts[k], gb)
if k.typ != "" {
desc += " " + k.typ
}
parts = append(parts, desc)
}
return strings.Join(parts, ", ")
}
// hwDescribeStorage returns a summary like "4× 3.84 TB NVMe, 2× 1.92 TB SATA".
func hwDescribeStorage(hw schema.HardwareSnapshot) string {
type key struct {
sizeGB int
iface string
}
counts := map[key]int{}
order := []key{}
for _, disk := range hw.Storage {
sz := 0
if disk.SizeGB != nil {
sz = *disk.SizeGB
}
iface := ""
if disk.Interface != nil {
iface = *disk.Interface
} else if disk.Type != nil {
iface = *disk.Type
}
k := key{sz, iface}
if counts[k] == 0 {
order = append(order, k)
}
counts[k]++
}
if len(order) == 0 {
return "—"
}
parts := make([]string, 0, len(order))
for _, k := range order {
var sizeStr string
if k.sizeGB >= 1000 {
sizeStr = fmt.Sprintf("%.2g TB", float64(k.sizeGB)/1000)
} else if k.sizeGB > 0 {
sizeStr = fmt.Sprintf("%d GB", k.sizeGB)
} else {
sizeStr = "?"
}
desc := fmt.Sprintf("%d× %s", counts[k], sizeStr)
if k.iface != "" {
desc += " " + k.iface
}
parts = append(parts, desc)
}
return strings.Join(parts, ", ")
}
// hwDescribeGPU returns a summary like "8× NVIDIA H100 80GB".
func hwDescribeGPU(hw schema.HardwareSnapshot) string {
counts := map[string]int{}
order := []string{}
for _, dev := range hw.PCIeDevices {
if dev.DeviceClass == nil {
continue
}
if !isGPUDeviceClass(*dev.DeviceClass) {
continue
}
model := "Unknown GPU"
if dev.Model != nil && *dev.Model != "" {
model = *dev.Model
}
if counts[model] == 0 {
order = append(order, model)
}
counts[model]++
}
if len(order) == 0 {
return "—"
}
parts := make([]string, 0, len(order))
for _, m := range order {
if counts[m] > 1 {
parts = append(parts, fmt.Sprintf("%d× %s", counts[m], m))
} else {
parts = append(parts, m)
}
}
return strings.Join(parts, ", ")
}
// hwPSUStatus returns "OK", "CRITICAL", "WARNING", or "UNKNOWN" based on
// PSU statuses from the audit snapshot. Used as fallback when component-status.json
// has no psu: records yet (e.g. first boot before audit writes them).
func hwPSUStatus(psus []schema.HardwarePowerSupply) string {
worst := "UNKNOWN"
for _, psu := range psus {
if psu.Status == nil {
continue
}
switch strings.ToUpper(strings.TrimSpace(*psu.Status)) {
case "CRITICAL":
return "CRITICAL"
case "WARNING":
if worst != "CRITICAL" {
worst = "WARNING"
}
case "OK":
if worst == "UNKNOWN" {
worst = "OK"
}
}
}
return worst
}
// hwDescribePSU returns a summary like "2× 1600 W" or "2× PSU".
func hwDescribePSU(hw schema.HardwareSnapshot) string {
n := len(hw.PowerSupplies)
if n == 0 {
return "—"
}
// Try to get a consistent wattage
watt := 0
consistent := true
for _, psu := range hw.PowerSupplies {
if psu.WattageW == nil {
consistent = false
break
}
if watt == 0 {
watt = *psu.WattageW
} else if *psu.WattageW != watt {
consistent = false
break
}
}
if consistent && watt > 0 {
return fmt.Sprintf("%d× %d W", n, watt)
}
return fmt.Sprintf("%d× PSU", n)
}
// hwDescribeNIC returns a summary like "2× Mellanox ConnectX-6".
func hwDescribeNIC(hw schema.HardwareSnapshot) string {
counts := map[string]int{}
order := []string{}
for _, dev := range hw.PCIeDevices {
isNIC := false
if dev.DeviceClass != nil {
c := strings.ToLower(strings.TrimSpace(*dev.DeviceClass))
isNIC = c == "ethernetcontroller" || c == "networkcontroller" || strings.Contains(c, "fibrechannel")
}
if !isNIC && len(dev.MacAddresses) == 0 {
continue
}
model := ""
if dev.Model != nil && *dev.Model != "" {
model = *dev.Model
} else if dev.Manufacturer != nil && *dev.Manufacturer != "" {
model = *dev.Manufacturer + " NIC"
} else {
model = "NIC"
}
if counts[model] == 0 {
order = append(order, model)
}
counts[model]++
}
if len(order) == 0 {
return ""
}
parts := make([]string, 0, len(order))
for _, m := range order {
if counts[m] > 1 {
parts = append(parts, fmt.Sprintf("%d× %s", counts[m], m))
} else {
parts = append(parts, m)
}
}
return strings.Join(parts, ", ")
}
func isGPUDeviceClass(class string) bool {
switch strings.TrimSpace(class) {
case "VideoController", "DisplayController", "ProcessingAccelerator":
return true
default:
return false
}
}
func renderAuditModal() string { func renderAuditModal() string {
return `<div id="audit-modal-overlay" style="display:none;position:fixed;inset:0;background:rgba(0,0,0,.5);z-index:100;align-items:center;justify-content:center"> return `<div id="audit-modal-overlay" style="display:none;position:fixed;inset:0;background:rgba(0,0,0,.5);z-index:100;align-items:center;justify-content:center">
<div style="background:#fff;border-radius:6px;padding:24px;min-width:480px;max-width:1100px;width:min(1100px,92vw);max-height:92vh;overflow:auto;position:relative"> <div style="background:#fff;border-radius:6px;padding:24px;min-width:480px;max-width:1100px;width:min(1100px,92vw);max-height:92vh;overflow:auto;position:relative">
@@ -481,8 +674,9 @@ func renderHealthCard(opts HandlerOptions) string {
buildRuntimeAccelerationRow(health), buildRuntimeAccelerationRow(health),
buildRuntimeToolsRow(health), buildRuntimeToolsRow(health),
buildRuntimeServicesRow(health), buildRuntimeServicesRow(health),
buildRuntimeUSBExportRow(health),
buildRuntimeToRAMRow(health),
} }
rows = append(rows, buildHardwareComponentRows(opts.ExportDir)...)
b.WriteString(`<table><thead><tr><th>Check</th><th>Status</th><th>Source</th><th>Issue</th></tr></thead><tbody>`) b.WriteString(`<table><thead><tr><th>Check</th><th>Status</th><th>Source</th><th>Issue</th></tr></thead><tbody>`)
for _, row := range rows { for _, row := range rows {
b.WriteString(`<tr><td>` + html.EscapeString(row.Title) + `</td><td>` + runtimeStatusBadge(row.Status) + `</td><td>` + html.EscapeString(row.Source) + `</td><td>` + rowIssueHTML(row.Issue) + `</td></tr>`) b.WriteString(`<tr><td>` + html.EscapeString(row.Title) + `</td><td>` + runtimeStatusBadge(row.Status) + `</td><td>` + html.EscapeString(row.Source) + `</td><td>` + rowIssueHTML(row.Issue) + `</td></tr>`)
@@ -578,7 +772,13 @@ func buildRuntimeServicesRow(health schema.RuntimeHealth) runtimeHealthRow {
nonActive := make([]string, 0) nonActive := make([]string, 0)
for _, svc := range health.Services { for _, svc := range health.Services {
state := strings.TrimSpace(strings.ToLower(svc.Status)) state := strings.TrimSpace(strings.ToLower(svc.Status))
if state != "active" { // "activating" and "deactivating" are transient states for oneshot services
// (RemainAfterExit=yes) — the service is running normally, not failed.
// Only "failed" and "inactive" (after services should be running) are problems.
switch state {
case "active", "activating", "deactivating", "reloading":
// OK — service is running or transitioning normally
default:
nonActive = append(nonActive, svc.Name+"="+svc.Status) nonActive = append(nonActive, svc.Name+"="+svc.Status)
} }
} }
@@ -591,6 +791,51 @@ func buildRuntimeServicesRow(health schema.RuntimeHealth) runtimeHealthRow {
return runtimeHealthRow{Title: "Bee Services", Status: status, Source: "ServiceState", Issue: issue} return runtimeHealthRow{Title: "Bee Services", Status: status, Source: "ServiceState", Issue: issue}
} }
func buildRuntimeUSBExportRow(health schema.RuntimeHealth) runtimeHealthRow {
path := strings.TrimSpace(health.USBExportPath)
if path != "" {
return runtimeHealthRow{
Title: "USB Export Drive",
Status: "OK",
Source: "/proc/mounts + lsblk",
Issue: path,
}
}
return runtimeHealthRow{
Title: "USB Export Drive",
Status: "WARNING",
Source: "/proc/mounts + lsblk",
Issue: "No writable USB drive mounted. Plug in a USB drive to enable log export.",
}
}
func buildRuntimeToRAMRow(health schema.RuntimeHealth) runtimeHealthRow {
switch strings.ToLower(strings.TrimSpace(health.ToRAMStatus)) {
case "ok":
return runtimeHealthRow{
Title: "LiveCD in RAM",
Status: "OK",
Source: "live-boot / /proc/mounts",
Issue: "",
}
case "failed":
return runtimeHealthRow{
Title: "LiveCD in RAM",
Status: "FAILED",
Source: "live-boot / /proc/mounts",
Issue: "toram boot parameter set but ISO is not mounted from RAM. Copy may have failed.",
}
default:
// toram not active — ISO still on original boot media (USB/CD)
return runtimeHealthRow{
Title: "LiveCD in RAM",
Status: "WARNING",
Source: "live-boot / /proc/mounts",
Issue: "ISO not copied to RAM. Use \u201cCopy to RAM\u201d to free the boot drive and improve performance.",
}
}
}
func buildHardwareComponentRows(exportDir string) []runtimeHealthRow { func buildHardwareComponentRows(exportDir string) []runtimeHealthRow {
path := filepath.Join(exportDir, "component-status.json") path := filepath.Join(exportDir, "component-status.json")
db, err := app.OpenComponentStatusDB(path) db, err := app.OpenComponentStatusDB(path)
@@ -1031,25 +1276,26 @@ func renderValidate(opts HandlerOptions) string {
return `<div class="alert alert-info" style="margin-bottom:16px"><strong>Non-destructive:</strong> Validate tests collect diagnostics only. They do not write to disks, do not run sustained load, and do not increment hardware wear counters.</div> return `<div class="alert alert-info" style="margin-bottom:16px"><strong>Non-destructive:</strong> Validate tests collect diagnostics only. They do not write to disks, do not run sustained load, and do not increment hardware wear counters.</div>
<p style="color:var(--muted);font-size:13px;margin-bottom:16px">Tasks continue in the background — view progress in <a href="/tasks">Tasks</a>.</p> <p style="color:var(--muted);font-size:13px;margin-bottom:16px">Tasks continue in the background — view progress in <a href="/tasks">Tasks</a>.</p>
<div class="card" style="margin-bottom:16px"> <div class="card" style="margin-bottom:16px">
<div class="card-head">Validate Profile</div> <div class="card-head">Validate Profile</div>
<div class="card-body validate-profile-body"> <div class="card-body validate-profile-body">
<div class="validate-profile-col"> <div class="validate-profile-col">
<div class="form-row" style="margin:0"><label>Cycles</label><input type="number" id="sat-cycles" value="1" min="1" max="100" style="width:100%"></div> <div class="form-row" style="margin:0"><label>Cycles</label><input type="number" id="sat-cycles" value="1" min="1" max="100" style="width:100%"></div>
<div class="form-row" style="margin:12px 0 0"><label>Mode</label></div> </div>
<label class="cb-row"><input type="radio" name="sat-mode" id="sat-mode-validate" value="validate" checked onchange="satModeChanged()"><span>Validate — quick non-destructive check</span></label> <div class="validate-profile-col">
<label class="cb-row"><input type="radio" name="sat-mode" id="sat-mode-stress" value="stress" onchange="satModeChanged()"><span>Stress — thorough load test (~3060 min)</span></label> <div class="form-row" style="margin:12px 0 0"><label>Mode</label></div>
</div> <label class="cb-row"><input type="radio" name="sat-mode" id="sat-mode-validate" value="validate" checked onchange="satModeChanged()"><span>Validate — quick non-destructive check</span></label>
<div class="validate-profile-col validate-profile-action"> <label class="cb-row"><input type="radio" name="sat-mode" id="sat-mode-stress" value="stress" onchange="satModeChanged()"><span>Stress — thorough load test (~3060 min)</span></label>
<p style="color:var(--muted);font-size:12px;margin:0 0 10px">Runs validate modules sequentially with the selected cycle count and mode. Validate is quick (~515 min total); Stress is thorough (~3060 min total).</p> </div>
<button class="btn btn-primary" onclick="runAllSAT()">Validate one by one</button> <div class="validate-profile-col validate-profile-action">
</div> <p style="color:var(--muted);font-size:12px;margin:0 0 10px">Runs validate modules sequentially with the selected cycle count and mode. Validate is quick (~515 min total); Stress is thorough (~3060 min total).</p>
<div class="validate-profile-col"></div> <button type="button" class="btn btn-primary" onclick="runAllSAT()">Validate one by one</button>
</div> <div style="margin-top:12px">
<div class="card-body" style="padding-top:0;display:flex;justify-content:center"> <span id="sat-all-status" style="font-size:12px;color:var(--muted)"></span>
<span id="sat-all-status" style="font-size:12px;color:var(--muted)"></span> </div>
</div> </div>
</div> </div>
</div>
<div class="grid3"> <div class="grid3">
` + renderSATCard("cpu", "CPU", "runSAT('cpu')", "", renderValidateCardBody( ` + renderSATCard("cpu", "CPU", "runSAT('cpu')", "", renderValidateCardBody(
@@ -1156,7 +1402,7 @@ func renderValidate(opts HandlerOptions) string {
</div> </div>
<style> <style>
.validate-profile-body { display:grid; grid-template-columns:1fr 1fr 1fr; gap:24px; align-items:stretch; } .validate-profile-body { display:grid; grid-template-columns:1fr 1fr 1fr; gap:24px; align-items:stretch; }
.validate-profile-col { min-width:0; } .validate-profile-col { min-width:0; display:flex; flex-direction:column; }
.validate-profile-action { display:flex; flex-direction:column; align-items:center; justify-content:center; } .validate-profile-action { display:flex; flex-direction:column; align-items:center; justify-content:center; }
.validate-card-body { padding:0; } .validate-card-body { padding:0; }
.validate-card-section { padding:12px 16px 0; } .validate-card-section { padding:12px 16px 0; }
@@ -1437,8 +1683,8 @@ function runAllSAT() {
const cycles = Math.max(1, parseInt(document.getElementById('sat-cycles').value)||1); const cycles = Math.max(1, parseInt(document.getElementById('sat-cycles').value)||1);
const status = document.getElementById('sat-all-status'); const status = document.getElementById('sat-all-status');
status.textContent = 'Enqueuing...'; status.textContent = 'Enqueuing...';
const stressOnlyTargets = ['nvidia-targeted-stress', 'nvidia-targeted-power', 'nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth', 'hpl']; const stressOnlyTargets = ['nvidia-targeted-stress', 'nvidia-targeted-power', 'nvidia-pulse', 'nvidia-interconnect', 'nvidia-bandwidth'];
const baseTargets = ['nvidia','nvidia-targeted-stress','nvidia-targeted-power','nvidia-pulse','nvidia-interconnect','nvidia-bandwidth','hpl','memory','storage','cpu'].concat(selectedAMDValidateTargets()); const baseTargets = ['nvidia','nvidia-targeted-stress','nvidia-targeted-power','nvidia-pulse','nvidia-interconnect','nvidia-bandwidth','memory','storage','cpu'].concat(selectedAMDValidateTargets());
const activeTargets = baseTargets.filter(target => { const activeTargets = baseTargets.filter(target => {
if (stressOnlyTargets.indexOf(target) >= 0 && !satStressMode()) return false; if (stressOnlyTargets.indexOf(target) >= 0 && !satStressMode()) return false;
const btn = document.getElementById('sat-btn-' + target); const btn = document.getElementById('sat-btn-' + target);
@@ -1612,6 +1858,11 @@ func formatValidateDeviceSummary(total int, models map[string]int, unit string)
if total != 1 { if total != 1 {
label += "s" label += "s"
} }
// If there is only one model the leading count duplicates the per-model
// count already in parts (e.g. "4 GPU: 4 x RTX …" → "4 x RTX …").
if len(parts) == 1 {
return parts[0] + " " + label
}
return fmt.Sprintf("%d %s: %s", total, label, strings.Join(parts, ", ")) return fmt.Sprintf("%d %s: %s", total, label, strings.Join(parts, ", "))
} }
@@ -2095,11 +2346,11 @@ func renderBurn() string {
<label class="cb-row"><input type="radio" name="burn-profile" value="overnight"><span>Overnight — 8 hours</span></label> <label class="cb-row"><input type="radio" name="burn-profile" value="overnight"><span>Overnight — 8 hours</span></label>
</div> </div>
<div class="burn-profile-col burn-profile-action"> <div class="burn-profile-col burn-profile-action">
<button class="btn btn-primary" onclick="runAllBurnTasks()">Burn one by one</button> <button type="button" class="btn btn-primary" onclick="runAllBurnTasks()">Burn one by one</button>
<p>Run checked tests one by one. Tests run without cooldown. Each test duration is determined by the Burn Profile. Total test duration is the sum of all selected tests multiplied by the Burn Profile duration.</p> <p>Run checked tests one by one. Tests run without cooldown. Each test duration is determined by the Burn Profile. Total test duration is the sum of all selected tests multiplied by the Burn Profile duration.</p>
</div> </div>
<div class="burn-profile-col burn-profile-action"> <div class="burn-profile-col burn-profile-action">
<button class="btn btn-secondary" onclick="runPlatformStress()">Thermal Cycling</button> <button type="button" class="btn btn-secondary" onclick="runPlatformStress()">Thermal Cycling</button>
<p>Run checked core test modules (CPU, MEM, GPU). Tests start at the same time and run for a period with short cooldown phases to stress the server cooling system.</p> <p>Run checked core test modules (CPU, MEM, GPU). Tests start at the same time and run for a period with short cooldown phases to stress the server cooling system.</p>
</div> </div>
</div> </div>
@@ -2116,12 +2367,16 @@ func renderBurn() string {
<button class="btn btn-sm btn-secondary" type="button" onclick="burnSelectAll()">Select All</button> <button class="btn btn-sm btn-secondary" type="button" onclick="burnSelectAll()">Select All</button>
<button class="btn btn-sm btn-secondary" type="button" onclick="burnSelectNone()">Clear</button> <button class="btn btn-sm btn-secondary" type="button" onclick="burnSelectNone()">Clear</button>
</div> </div>
<div id="burn-gpu-list" style="border:1px solid var(--border);border-radius:4px;padding:12px;min-height:88px"> <div id="burn-gpu-list" style="border:1px solid var(--border);border-radius:4px;padding:12px;min-height:88px">
<p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p> <p style="color:var(--muted);font-size:13px">Loading NVIDIA GPUs...</p>
</div> </div>
<p id="burn-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA burn recipes.</p> <p id="burn-selection-note" style="font-size:12px;color:var(--muted);margin:10px 0 0">Select at least one NVIDIA GPU to enable NVIDIA burn recipes.</p>
</div> <label class="cb-row" style="margin-top:10px">
</div> <input type="checkbox" id="burn-stagger-nvidia">
<span>Ramp selected NVIDIA GPUs one by one before full-load hold. Uses a 3-minute stabilization window per GPU, then keeps all selected GPUs under load for the chosen Burn Profile duration.</span>
</label>
</div>
</div>
<div class="burn-section">Core Burn Paths</div> <div class="burn-section">Core Burn Paths</div>
<div class="grid2 burn-grid" style="margin-bottom:16px"> <div class="grid2 burn-grid" style="margin-bottom:16px">
@@ -2147,10 +2402,6 @@ func renderBurn() string {
</div> </div>
</div> </div>
<div class="burn-section">GPU-Specific Tests</div>
<div class="grid2 burn-grid" style="margin-bottom:16px">
</div>
<div id="bi-output" style="display:none;margin-top:16px" class="card"> <div id="bi-output" style="display:none;margin-top:16px" class="card">
<div class="card-head">Output <span id="bi-title"></span></div> <div class="card-head">Output <span id="bi-title"></span></div>
<div class="card-body"><div id="bi-terminal" class="terminal"></div></div> <div class="card-body"><div id="bi-terminal" class="terminal"></div></div>
@@ -2199,6 +2450,11 @@ function burnSelectedGPUIndices() {
.sort(function(a, b) { return a - b; }); .sort(function(a, b) { return a - b; });
} }
function burnUseNvidiaRampUp() {
const el = document.getElementById('burn-stagger-nvidia');
return !!(el && el.checked);
}
function burnUpdateSelectionNote() { function burnUpdateSelectionNote() {
const note = document.getElementById('burn-selection-note'); const note = document.getElementById('burn-selection-note');
const selected = burnSelectedGPUIndices(); const selected = burnSelectedGPUIndices();
@@ -2258,6 +2514,9 @@ function enqueueBurnTask(target, label, extra, useSelectedNvidia) {
return Promise.reject(new Error('Select at least one NVIDIA GPU.')); return Promise.reject(new Error('Select at least one NVIDIA GPU.'));
} }
body.gpu_indices = selected; body.gpu_indices = selected;
if (burnUseNvidiaRampUp() && selected.length > 1) {
body.stagger_gpu_start = true;
}
} }
return fetch('/api/sat/' + target + '/run', { return fetch('/api/sat/' + target + '/run', {
method: 'POST', method: 'POST',

View File

@@ -1094,6 +1094,7 @@ func TestDashboardRendersRuntimeHealthTable(t *testing.T) {
} }
body := rec.Body.String() body := rec.Body.String()
for _, needle := range []string{ for _, needle := range []string{
// Runtime Health card — LiveCD checks only
`Runtime Health`, `Runtime Health`,
`<th>Check</th><th>Status</th><th>Source</th><th>Issue</th>`, `<th>Check</th><th>Status</th><th>Source</th><th>Issue</th>`,
`Export Directory`, `Export Directory`,
@@ -1102,16 +1103,18 @@ func TestDashboardRendersRuntimeHealthTable(t *testing.T) {
`CUDA / ROCm`, `CUDA / ROCm`,
`Required Utilities`, `Required Utilities`,
`Bee Services`, `Bee Services`,
`<td>CPU</td>`,
`<td>Memory</td>`,
`<td>Storage</td>`,
`<td>GPU</td>`,
`CUDA runtime is not ready for GPU SAT.`, `CUDA runtime is not ready for GPU SAT.`,
`Missing: nvidia-smi`, `Missing: nvidia-smi`,
`bee-nvidia=inactive`, `bee-nvidia=inactive`,
`cpu SAT: FAILED`, // Hardware Summary card — component health badges
`storage SAT: FAILED`, `Hardware Summary`,
`sat:nvidia`, `>CPU<`,
`>Memory<`,
`>Storage<`,
`>GPU<`,
`>PSU<`,
`badge-warn`, // cpu Warning badge
`badge-err`, // storage Critical badge
} { } {
if !strings.Contains(body, needle) { if !strings.Contains(body, needle) {
t.Fatalf("dashboard missing %q: %s", needle, body) t.Fatalf("dashboard missing %q: %s", needle, body)

View File

@@ -118,6 +118,7 @@ type taskParams struct {
StressMode bool `json:"stress_mode,omitempty"` StressMode bool `json:"stress_mode,omitempty"`
GPUIndices []int `json:"gpu_indices,omitempty"` GPUIndices []int `json:"gpu_indices,omitempty"`
ExcludeGPUIndices []int `json:"exclude_gpu_indices,omitempty"` ExcludeGPUIndices []int `json:"exclude_gpu_indices,omitempty"`
StaggerGPUStart bool `json:"stagger_gpu_start,omitempty"`
SizeMB int `json:"size_mb,omitempty"` SizeMB int `json:"size_mb,omitempty"`
Passes int `json:"passes,omitempty"` Passes int `json:"passes,omitempty"`
Loader string `json:"loader,omitempty"` Loader string `json:"loader,omitempty"`
@@ -162,6 +163,13 @@ func resolveBurnPreset(profile string) burnPreset {
} }
} }
func boolToNvidiaStaggerSeconds(enabled bool, selected []int) int {
if enabled && len(selected) > 1 {
return 180
}
return 0
}
func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions { func resolvePlatformStressPreset(profile string) platform.PlatformStressOptions {
acceptanceCycles := []platform.PlatformStressCycle{ acceptanceCycles := []platform.PlatformStressCycle{
{LoadSec: 85, IdleSec: 5}, {LoadSec: 85, IdleSec: 5},
@@ -592,7 +600,7 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
RunNCCL: t.params.RunNCCL, RunNCCL: t.params.RunNCCL,
ParallelGPUs: t.params.ParallelGPUs, ParallelGPUs: t.params.ParallelGPUs,
}, j.append) }, j.append)
case "nvidia-compute": case "nvidia-compute":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
break break
@@ -601,7 +609,11 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
archive, err = a.RunNvidiaOfficialComputePack(ctx, "", dur, t.params.GPUIndices, j.append) staggerSec := boolToNvidiaStaggerSeconds(t.params.StaggerGPUStart, t.params.GPUIndices)
if staggerSec > 0 {
j.append(fmt.Sprintf("NVIDIA staggered ramp-up enabled: %ds per GPU", staggerSec))
}
archive, err = a.RunNvidiaOfficialComputePack(ctx, "", dur, t.params.GPUIndices, staggerSec, j.append)
case "nvidia-targeted-power": case "nvidia-targeted-power":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")
@@ -651,12 +663,13 @@ func (q *taskQueue) runTask(t *Task, j *jobState, ctx context.Context) {
if t.params.BurnProfile != "" && dur <= 0 { if t.params.BurnProfile != "" && dur <= 0 {
dur = resolveBurnPreset(t.params.BurnProfile).DurationSec dur = resolveBurnPreset(t.params.BurnProfile).DurationSec
} }
archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{ archive, err = runNvidiaStressPackCtx(a, ctx, "", platform.NvidiaStressOptions{
DurationSec: dur, DurationSec: dur,
Loader: t.params.Loader, Loader: t.params.Loader,
GPUIndices: t.params.GPUIndices, GPUIndices: t.params.GPUIndices,
ExcludeGPUIndices: t.params.ExcludeGPUIndices, ExcludeGPUIndices: t.params.ExcludeGPUIndices,
}, j.append) StaggerSeconds: boolToNvidiaStaggerSeconds(t.params.StaggerGPUStart, t.params.GPUIndices),
}, j.append)
case "memory": case "memory":
if a == nil { if a == nil {
err = fmt.Errorf("app not configured") err = fmt.Errorf("app not configured")

View File

@@ -0,0 +1,117 @@
# GPU Model Name Propagation
How GPU model names are detected, stored, and displayed throughout the project.
---
## Detection Sources
There are **two separate pipelines** for GPU model names — they use different structs and don't share state.
### Pipeline A — Live / SAT (nvidia-smi query at runtime)
**File:** `audit/internal/platform/sat.go`
- `ListNvidiaGPUs()``NvidiaGPU.Name` (field: `name`, from `nvidia-smi --query-gpu=index,name,...`)
- `ListNvidiaGPUStatuses()``NvidiaGPUStatus.Name`
- Used by: GPU selection UI, live metrics labels, burn/stress test logic
### Pipeline B — Benchmark results
**File:** `audit/internal/platform/benchmark.go`, line 124
- `queryBenchmarkGPUInfo(selected)``benchmarkGPUInfo.Name`
- Stored in `BenchmarkGPUResult.Name` (`json:"name,omitempty"`)
- Used by: benchmark history table, benchmark report
### Pipeline C — Hardware audit JSON (PCIe schema)
**File:** `audit/internal/schema/hardware.go`
- `HardwarePCIeDevice.Model *string` (field name is **Model**, not Name)
- For AMD GPUs: populated by `audit/internal/collector/amdgpu.go` from `info.Product`
- For NVIDIA GPUs: **NOT populated** by `audit/internal/collector/nvidia.go` — the NVIDIA enricher sets telemetry/status but skips the Model field
- Used by: hardware summary page (`hwDescribeGPU` in `pages.go:487`)
---
## Key Inconsistency: NVIDIA PCIe Model is Never Set
`audit/internal/collector/nvidia.go``enrichPCIeWithNVIDIAData()` enriches NVIDIA PCIe devices with telemetry and status but does **not** populate `HardwarePCIeDevice.Model`.
This means:
- Hardware summary page shows "Unknown GPU" for all NVIDIA devices (falls back at `pages.go:486`)
- AMD GPUs do have their model populated
The fix would be: copy `gpu.Name` from the SAT pipeline into `dev.Model` inside `enrichPCIeWithNVIDIAData`.
---
## Benchmark History "Unknown GPU" Issue
**Symptom:** Benchmark history table shows "GPU #N — Unknown GPU" columns instead of real GPU model names.
**Root cause:** `BenchmarkGPUResult.Name` has tag `json:"name,omitempty"`. If `queryBenchmarkGPUInfo()` fails (warns at `benchmark.go:126`) or returns empty names, the Name field is never set and is omitted from JSON. Loaded results have empty Name → falls back to "Unknown GPU" at `pages.go:2226, 2237`.
This happens for:
- Older result files saved before the `Name` field was added
- Runs where nvidia-smi query failed before the benchmark started
---
## Fallback Strings — Current State
| Location | File | Fallback string |
|---|---|---|
| Hardware summary (PCIe) | `pages.go:486` | `"Unknown GPU"` |
| Benchmark report summary | `benchmark_report.go:43` | `"Unknown GPU"` |
| Benchmark report scorecard | `benchmark_report.go:93` | `"Unknown"` ← inconsistent |
| Benchmark report detail | `benchmark_report.go:122` | `"Unknown GPU"` |
| Benchmark history per-GPU col | `pages.go:2226` | `"Unknown GPU"` |
| Benchmark history parallel col | `pages.go:2237` | `"Unknown GPU"` |
| SAT status file write | `sat.go:922` | `"unknown"` ← lowercase, inconsistent |
| GPU selection API | `api.go:163` | `"GPU N"` (no "Unknown") |
**Rule:** all UI fallbacks should use `"Unknown GPU"`. The two outliers are `benchmark_report.go:93` (`"Unknown"`) and `sat.go:922` (`"unknown"`).
---
## GPU Selection UI
**File:** `audit/internal/webui/pages.go`
- Source: `GET /api/gpus``api.go``ListNvidiaGPUs()` → live nvidia-smi
- Render: `'GPU ' + gpu.index + ' — ' + gpu.name + ' · ' + mem`
- Fallback: `gpu.name || 'GPU ' + idx` (JS, line ~1432)
This always shows the correct model because it queries nvidia-smi live. It is **not** connected to benchmark result data.
---
## Data Flow Summary
```
nvidia-smi (live)
└─ ListNvidiaGPUs() → NvidiaGPU.Name
├─ GPU selection UI (always correct)
├─ Live metrics labels (charts_svg.go)
└─ SAT/burn status file (sat.go)
nvidia-smi (at benchmark start)
└─ queryBenchmarkGPUInfo() → benchmarkGPUInfo.Name
└─ BenchmarkGPUResult.Name (json:"name,omitempty")
├─ Benchmark report
└─ Benchmark history table columns
nvidia-smi / lspci (audit collection)
└─ HardwarePCIeDevice.Model (NVIDIA: NOT populated; AMD: populated)
└─ Hardware summary page hwDescribeGPU()
```
---
## What Needs Fixing
1. **NVIDIA PCIe Model**`enrichPCIeWithNVIDIAData()` should set `dev.Model = &gpu.Name`
2. **Fallback consistency**`benchmark_report.go:93` should say `"Unknown GPU"` not `"Unknown"`; `sat.go:922` should say `"Unknown GPU"` not `"unknown"`
3. **Old benchmark JSONs** — no fix possible for already-saved results with missing names (display-only issue)

View File

@@ -11,18 +11,18 @@ echo " Hardware Audit LiveCD"
echo "" echo ""
menuentry "EASY-BEE" { menuentry "EASY-BEE" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@ initrd @INITRD_LIVE@
} }
submenu "EASY-BEE (advanced options) -->" { submenu "EASY-BEE (advanced options) -->" {
menuentry "EASY-BEE — GSP=off" { menuentry "EASY-BEE — GSP=off" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup linux @KERNEL_LIVE@ @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@ initrd @INITRD_LIVE@
} }
menuentry "EASY-BEE — KMS (no nomodeset)" { menuentry "EASY-BEE — KMS (no nomodeset)" {
linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable nowatchdog nosoftlockup linux @KERNEL_LIVE@ @APPEND_LIVE@ bee.nvidia.mode=normal net.ifnames=0 biosdevname=0 mitigations=off transparent_hugepage=always numa_balancing=disable pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1 nowatchdog nosoftlockup
initrd @INITRD_LIVE@ initrd @INITRD_LIVE@
} }

View File

@@ -3,31 +3,31 @@ label live-@FLAVOUR@-normal
menu default menu default
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ bee.nvidia.mode=normal append @APPEND_LIVE@ bee.nvidia.mode=normal pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1
label live-@FLAVOUR@-kms label live-@FLAVOUR@-kms
menu label EASY-BEE (^graphics/KMS) menu label EASY-BEE (^graphics/KMS)
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=normal append @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=normal pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1
label live-@FLAVOUR@-toram label live-@FLAVOUR@-toram
menu label EASY-BEE (^load to RAM) menu label EASY-BEE (^load to RAM)
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ toram bee.nvidia.mode=normal append @APPEND_LIVE@ toram bee.nvidia.mode=normal pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1
label live-@FLAVOUR@-gsp-off label live-@FLAVOUR@-gsp-off
menu label EASY-BEE (^NVIDIA GSP=off) menu label EASY-BEE (^NVIDIA GSP=off)
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off append @APPEND_LIVE@ nomodeset bee.nvidia.mode=gsp-off pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1
label live-@FLAVOUR@-kms-gsp-off label live-@FLAVOUR@-kms-gsp-off
menu label EASY-BEE (g^raphics/KMS, GSP=off) menu label EASY-BEE (g^raphics/KMS, GSP=off)
linux @LINUX@ linux @LINUX@
initrd @INITRD@ initrd @INITRD@
append @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=gsp-off append @APPEND_LIVE@ bee.display=kms bee.nvidia.mode=gsp-off pcie_aspm=off intel_idle.max_cstate=1 processor.max_cstate=1
label live-@FLAVOUR@-failsafe label live-@FLAVOUR@-failsafe
menu label EASY-BEE (^fail-safe) menu label EASY-BEE (^fail-safe)

View File

@@ -25,6 +25,7 @@ ensure_bee_console_user() {
ensure_bee_console_user ensure_bee_console_user
# Enable common bee services # Enable common bee services
systemctl enable bee-hpc-tuning.service
systemctl enable bee-network.service systemctl enable bee-network.service
systemctl enable bee-preflight.service systemctl enable bee-preflight.service
systemctl enable bee-audit.service systemctl enable bee-audit.service
@@ -55,6 +56,7 @@ fi
# nogpu: no GPU services needed # nogpu: no GPU services needed
# Ensure scripts are executable # Ensure scripts are executable
chmod +x /usr/local/bin/bee-hpc-tuning 2>/dev/null || true
chmod +x /usr/local/bin/bee-network.sh 2>/dev/null || true chmod +x /usr/local/bin/bee-network.sh 2>/dev/null || true
chmod +x /usr/local/bin/bee-sshsetup 2>/dev/null || true chmod +x /usr/local/bin/bee-sshsetup 2>/dev/null || true
chmod +x /usr/local/bin/bee-smoketest 2>/dev/null || true chmod +x /usr/local/bin/bee-smoketest 2>/dev/null || true

View File

@@ -0,0 +1,14 @@
[Unit]
Description=Bee: HPC tuning (CPU governor, C-states)
After=local-fs.target
Before=bee-nvidia.service bee-audit.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/bee-log-run /appdata/bee/export/bee-hpc-tuning.log /usr/local/bin/bee-hpc-tuning
StandardOutput=journal
StandardError=journal
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,110 @@
#!/bin/sh
set -eu
SECONDS=300
STAGGER_SECONDS=180
DEVICES=""
EXCLUDE=""
usage() {
echo "usage: $0 [--seconds N] [--stagger-seconds N] [--devices 0,1] [--exclude 2,3]" >&2
exit 2
}
normalize_list() {
echo "${1:-}" | tr ',' '\n' | sed 's/[[:space:]]//g' | awk 'NF' | sort -n | uniq | paste -sd, -
}
contains_csv() {
needle="$1"
haystack="${2:-}"
echo ",${haystack}," | grep -q ",${needle},"
}
resolve_dcgmproftester() {
for candidate in dcgmproftester dcgmproftester13 dcgmproftester12 dcgmproftester11; do
if command -v "${candidate}" >/dev/null 2>&1; then
command -v "${candidate}"
return 0
fi
done
return 1
}
while [ "$#" -gt 0 ]; do
case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;;
--stagger-seconds) [ "$#" -ge 2 ] || usage; STAGGER_SECONDS="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
*) usage ;;
esac
done
PROF=$(resolve_dcgmproftester) || { echo "dcgmproftester not found in PATH" >&2; exit 1; }
ALL_DEVICES=$(nvidia-smi --query-gpu=index --format=csv,noheader,nounits 2>/dev/null | sed 's/[[:space:]]//g' | awk 'NF' | paste -sd, -)
[ -n "${ALL_DEVICES}" ] || { echo "nvidia-smi found no NVIDIA GPUs" >&2; exit 1; }
DEVICES=$(normalize_list "${DEVICES}")
EXCLUDE=$(normalize_list "${EXCLUDE}")
SELECTED="${DEVICES}"
if [ -z "${SELECTED}" ]; then
SELECTED="${ALL_DEVICES}"
fi
FINAL=""
for id in $(echo "${SELECTED}" | tr ',' ' '); do
[ -n "${id}" ] || continue
if contains_csv "${id}" "${EXCLUDE}"; then
continue
fi
if [ -z "${FINAL}" ]; then
FINAL="${id}"
else
FINAL="${FINAL},${id}"
fi
done
[ -n "${FINAL}" ] || { echo "no NVIDIA GPUs selected after filters" >&2; exit 1; }
echo "loader=dcgmproftester-staggered"
echo "selected_gpus=${FINAL}"
echo "stagger_seconds=${STAGGER_SECONDS}"
TMP_DIR=$(mktemp -d)
trap 'rm -rf "${TMP_DIR}"' EXIT INT TERM
GPU_COUNT=$(echo "${FINAL}" | tr ',' '\n' | awk 'NF' | wc -l | tr -d '[:space:]')
gpu_pos=0
WORKERS=""
for id in $(echo "${FINAL}" | tr ',' ' '); do
gpu_pos=$((gpu_pos + 1))
log="${TMP_DIR}/gpu-${id}.log"
extra_sec=$(( STAGGER_SECONDS * (GPU_COUNT - gpu_pos) ))
gpu_seconds=$(( SECONDS + extra_sec ))
echo "starting gpu ${id} seconds=${gpu_seconds}"
CUDA_VISIBLE_DEVICES="${id}" "${PROF}" --no-dcgm-validation -t 1004 -d "${gpu_seconds}" >"${log}" 2>&1 &
pid=$!
WORKERS="${WORKERS} ${pid}:${id}:${log}"
if [ "${STAGGER_SECONDS}" -gt 0 ] && [ "${gpu_pos}" -lt "${GPU_COUNT}" ]; then
sleep "${STAGGER_SECONDS}"
fi
done
status=0
for spec in ${WORKERS}; do
pid=${spec%%:*}
rest=${spec#*:}
id=${rest%%:*}
log=${rest#*:}
if wait "${pid}"; then
echo "gpu ${id} finished: OK"
else
rc=$?
echo "gpu ${id} finished: FAILED rc=${rc}"
status=1
fi
sed "s/^/[gpu ${id}] /" "${log}" || true
done
exit "${status}"

17
iso/overlay/usr/local/bin/bee-gpu-burn Normal file → Executable file
View File

@@ -2,13 +2,14 @@
set -eu set -eu
SECONDS=5 SECONDS=5
STAGGER_SECONDS=0
SIZE_MB=0 SIZE_MB=0
DEVICES="" DEVICES=""
EXCLUDE="" EXCLUDE=""
WORKER="/usr/local/lib/bee/bee-gpu-burn-worker" WORKER="/usr/local/lib/bee/bee-gpu-burn-worker"
usage() { usage() {
echo "usage: $0 [--seconds N] [--size-mb N] [--devices 0,1] [--exclude 2,3]" >&2 echo "usage: $0 [--seconds N] [--stagger-seconds N] [--size-mb N] [--devices 0,1] [--exclude 2,3]" >&2
exit 2 exit 2
} }
@@ -25,6 +26,7 @@ contains_csv() {
while [ "$#" -gt 0 ]; do while [ "$#" -gt 0 ]; do
case "$1" in case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;; --seconds|-t) [ "$#" -ge 2 ] || usage; SECONDS="$2"; shift 2 ;;
--stagger-seconds) [ "$#" -ge 2 ] || usage; STAGGER_SECONDS="$2"; shift 2 ;;
--size-mb|-m) [ "$#" -ge 2 ] || usage; SIZE_MB="$2"; shift 2 ;; --size-mb|-m) [ "$#" -ge 2 ] || usage; SIZE_MB="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;; --devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;; --exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
@@ -61,14 +63,18 @@ done
echo "loader=bee-gpu-burn" echo "loader=bee-gpu-burn"
echo "selected_gpus=${FINAL}" echo "selected_gpus=${FINAL}"
echo "stagger_seconds=${STAGGER_SECONDS}"
export CUDA_DEVICE_ORDER="PCI_BUS_ID" export CUDA_DEVICE_ORDER="PCI_BUS_ID"
TMP_DIR=$(mktemp -d) TMP_DIR=$(mktemp -d)
trap 'rm -rf "${TMP_DIR}"' EXIT INT TERM trap 'rm -rf "${TMP_DIR}"' EXIT INT TERM
GPU_COUNT=$(echo "${FINAL}" | tr ',' '\n' | awk 'NF' | wc -l | tr -d '[:space:]')
gpu_pos=0
WORKERS="" WORKERS=""
for id in $(echo "${FINAL}" | tr ',' ' '); do for id in $(echo "${FINAL}" | tr ',' ' '); do
gpu_pos=$((gpu_pos + 1))
log="${TMP_DIR}/gpu-${id}.log" log="${TMP_DIR}/gpu-${id}.log"
gpu_size_mb="${SIZE_MB}" gpu_size_mb="${SIZE_MB}"
if [ "${gpu_size_mb}" -le 0 ] 2>/dev/null; then if [ "${gpu_size_mb}" -le 0 ] 2>/dev/null; then
@@ -79,11 +85,16 @@ for id in $(echo "${FINAL}" | tr ',' ' '); do
gpu_size_mb=512 gpu_size_mb=512
fi fi
fi fi
echo "starting gpu ${id} size=${gpu_size_mb}MB" extra_sec=$(( STAGGER_SECONDS * (GPU_COUNT - gpu_pos) ))
gpu_seconds=$(( SECONDS + extra_sec ))
echo "starting gpu ${id} size=${gpu_size_mb}MB seconds=${gpu_seconds}"
CUDA_VISIBLE_DEVICES="${id}" \ CUDA_VISIBLE_DEVICES="${id}" \
"${WORKER}" --device 0 --seconds "${SECONDS}" --size-mb "${gpu_size_mb}" >"${log}" 2>&1 & "${WORKER}" --device 0 --seconds "${gpu_seconds}" --size-mb "${gpu_size_mb}" >"${log}" 2>&1 &
pid=$! pid=$!
WORKERS="${WORKERS} ${pid}:${id}:${log}" WORKERS="${WORKERS} ${pid}:${id}:${log}"
if [ "${STAGGER_SECONDS}" -gt 0 ] && [ "${gpu_pos}" -lt "${GPU_COUNT}" ]; then
sleep "${STAGGER_SECONDS}"
fi
done done
status=0 status=0

View File

@@ -0,0 +1,41 @@
#!/bin/sh
# bee-hpc-tuning — apply HPC tuning for deterministic benchmarking
# Called by bee-hpc-tuning.service at boot.
log() { echo "[bee-hpc-tuning] $*"; }
# ── CPU governor ────────────────────────────────────────────────────────────
# Set all CPU cores to performance governor via sysfs.
# cpupower is not available; write directly to scaling_governor.
governor_ok=0
governor_fail=0
for gov_path in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
[ -f "$gov_path" ] || continue
if echo performance > "$gov_path" 2>/dev/null; then
governor_ok=$((governor_ok + 1))
else
governor_fail=$((governor_fail + 1))
fi
done
if [ "$governor_ok" -gt 0 ] && [ "$governor_fail" -eq 0 ]; then
log "CPU governor set to performance on ${governor_ok} core(s)"
elif [ "$governor_ok" -gt 0 ]; then
log "WARN: CPU governor: ${governor_ok} OK, ${governor_fail} failed"
elif [ "$governor_fail" -gt 0 ]; then
log "WARN: failed to set CPU governor on ${governor_fail} core(s)"
else
log "WARN: no cpufreq scaling_governor paths found (C-state governor or HW-controlled)"
fi
# ── Transparent Huge Pages ───────────────────────────────────────────────────
# Kernel cmdline sets transparent_hugepage=always at boot, but confirm and log.
thp_path=/sys/kernel/mm/transparent_hugepage/enabled
if [ -f "$thp_path" ]; then
current=$(cat "$thp_path" 2>/dev/null)
log "transparent_hugepage: ${current}"
else
log "WARN: transparent_hugepage sysfs path not found"
fi
log "done"

16
iso/overlay/usr/local/bin/bee-john-gpu-stress Normal file → Executable file
View File

@@ -2,6 +2,7 @@
set -eu set -eu
DURATION_SEC=300 DURATION_SEC=300
STAGGER_SECONDS=0
DEVICES="" DEVICES=""
EXCLUDE="" EXCLUDE=""
FORMAT="" FORMAT=""
@@ -12,7 +13,7 @@ export OCL_ICD_VENDORS="/etc/OpenCL/vendors"
export LD_LIBRARY_PATH="/usr/lib:/usr/local/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}" export LD_LIBRARY_PATH="/usr/lib:/usr/local/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
usage() { usage() {
echo "usage: $0 [--seconds N] [--devices 0,1] [--exclude 2,3] [--format name]" >&2 echo "usage: $0 [--seconds N] [--stagger-seconds N] [--devices 0,1] [--exclude 2,3] [--format name]" >&2
exit 2 exit 2
} }
@@ -118,6 +119,7 @@ ensure_opencl_ready() {
while [ "$#" -gt 0 ]; do while [ "$#" -gt 0 ]; do
case "$1" in case "$1" in
--seconds|-t) [ "$#" -ge 2 ] || usage; DURATION_SEC="$2"; shift 2 ;; --seconds|-t) [ "$#" -ge 2 ] || usage; DURATION_SEC="$2"; shift 2 ;;
--stagger-seconds) [ "$#" -ge 2 ] || usage; STAGGER_SECONDS="$2"; shift 2 ;;
--devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;; --devices) [ "$#" -ge 2 ] || usage; DEVICES="$2"; shift 2 ;;
--exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;; --exclude) [ "$#" -ge 2 ] || usage; EXCLUDE="$2"; shift 2 ;;
--format) [ "$#" -ge 2 ] || usage; FORMAT="$2"; shift 2 ;; --format) [ "$#" -ge 2 ] || usage; FORMAT="$2"; shift 2 ;;
@@ -170,6 +172,7 @@ done
echo "loader=john" echo "loader=john"
echo "selected_gpus=${FINAL}" echo "selected_gpus=${FINAL}"
echo "john_devices=${JOHN_DEVICES}" echo "john_devices=${JOHN_DEVICES}"
echo "stagger_seconds=${STAGGER_SECONDS}"
cd "${JOHN_DIR}" cd "${JOHN_DIR}"
@@ -232,14 +235,21 @@ trap cleanup EXIT INT TERM
echo "format=${CHOSEN_FORMAT}" echo "format=${CHOSEN_FORMAT}"
echo "target_seconds=${DURATION_SEC}" echo "target_seconds=${DURATION_SEC}"
echo "slice_seconds=${TEST_SLICE_SECONDS}" echo "slice_seconds=${TEST_SLICE_SECONDS}"
DEADLINE=$(( $(date +%s) + DURATION_SEC )) TOTAL_DEVICES=$(echo "${JOHN_DEVICES}" | tr ',' '\n' | awk 'NF' | wc -l | tr -d '[:space:]')
_first=1 _first=1
pos=0
for opencl_id in $(echo "${JOHN_DEVICES}" | tr ',' ' '); do for opencl_id in $(echo "${JOHN_DEVICES}" | tr ',' ' '); do
pos=$((pos + 1))
[ "${_first}" = "1" ] || sleep 3 [ "${_first}" = "1" ] || sleep 3
_first=0 _first=0
run_john_loop "${opencl_id}" "${DEADLINE}" & extra_sec=$(( STAGGER_SECONDS * (TOTAL_DEVICES - pos) ))
deadline=$(( $(date +%s) + DURATION_SEC + extra_sec ))
run_john_loop "${opencl_id}" "${deadline}" &
pid=$! pid=$!
PIDS="${PIDS} ${pid}" PIDS="${PIDS} ${pid}"
if [ "${STAGGER_SECONDS}" -gt 0 ] && [ "${pos}" -lt "${TOTAL_DEVICES}" ]; then
sleep "${STAGGER_SECONDS}"
fi
done done
FAIL=0 FAIL=0
for pid in ${PIDS}; do for pid in ${PIDS}; do

View File

@@ -21,8 +21,13 @@ read_nvidia_modules_flavor() {
log "kernel: $(uname -r)" log "kernel: $(uname -r)"
# Skip if no NVIDIA GPU present (PCI vendor 10de) # Skip if no NVIDIA display/compute GPU is present.
if ! lspci -nn 2>/dev/null | grep -qi '10de:'; then # Match only display-class PCI functions (0300 VGA, 0302 3D controller) from vendor 10de.
have_nvidia_gpu() {
lspci -Dn 2>/dev/null | awk '$2 ~ /^03(00|02):$/ && $3 ~ /^10de:/ { found=1; exit } END { exit(found ? 0 : 1) }'
}
if ! have_nvidia_gpu; then
log "no NVIDIA GPU detected — skipping module load" log "no NVIDIA GPU detected — skipping module load"
exit 0 exit 0
fi fi

View File

@@ -14,7 +14,7 @@ log() {
} }
have_nvidia_gpu() { have_nvidia_gpu() {
lspci -nn 2>/dev/null | grep -qi '10de:' lspci -Dn 2>/dev/null | awk '$2 ~ /^03(00|02):$/ && $3 ~ /^10de:/ { found=1; exit } END { exit(found ? 0 : 1) }'
} }
service_active() { service_active() {