- Check (2): validate mode only — no mode switcher, no stress-only cards
(nvidia-targeted-stress, nvidia-targeted-power, nvidia-pulse hidden)
- Load (3): stress mode only — no mode switcher, all cards shown
- satStressMode() hardcoded per page; satModeChanged() removed
- Profile card with radio buttons removed from both pages
- Replaced with simple Run All button + est. time
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the flat menu (Dashboard, Audit, Validate, Burn, Benchmark,
Tasks, Tools) with a numbered progression that guides engineers through
a logical acceptance workflow:
Dashboard (landing) → 1. Audit → 2. Check → 3. Load → 4. Speed
→ 5. Endurance → 6. Tools → 7. Settings
Key changes:
- layout.go: numbered nav labels, new hrefs, Tasks removed from nav
and replaced with a persistent sidebar badge (polls /api/tasks every
5 s, highlights amber when tasks are active)
- server.go: 301 redirects from /validate→/check, /burn→/load,
/benchmark→/speed for backward compatibility
- pages.go: dispatch cases for all new routes; old routes kept as
fallbacks
- page_validate.go: add renderCheck() — non-destructive check page
with validate-mode tests only (no stress toggle, no targeted-stress/
targeted-power/pulse cards)
- page_burn.go: add renderLoad() wrapper; update scope alert to
reference /check instead of /validate
- page_benchmark.go: add renderSpeed() (performance focus) and
renderEndurance() (stability/overnight focus) wrappers
- page_settings.go: new Settings page with blackbox logging toggle,
NVIDIA driver reset, and build info
- server_test.go: update five tests to use new route names and
content expectations
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A (hardware-ingest-json v2.8-2.9): remove sensor location fields from schema
and collector; tag HardwareMemory.Location as json:"-"; add PlatformConfig to
HardwareSnapshot.
B (no-hardcoded-vendors): consolidate PCI vendor IDs into collector/pci_vendors.go;
replace all vendor-name string checks in isGPUDevice, isNVIDIADevice, isMellanoxDevice,
isAMDGPUDevice, matchesGPUVendor (sat_overlay), and validateIsVendorGPU (page_validate)
with numeric vendor_id comparisons.
C (module-structure): split app/app.go (1413 lines) into app.go + app_format.go,
app_network.go, app_services.go, app_packs.go, app_install.go — no logic changes.
D (go-code-style): wrap bare return err in interfaceAdminState and
interfaceIPv4Addrs (platform/network.go) with fmt.Errorf context including
the interface name.
E (go-project-bible): add bible-local/architecture/data-model.md and
bible-local/architecture/api-surface.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvme-cli emits smart-log counters as JSON strings and uses field names
avail_spare / percent_used instead of the prose names in the NVMe spec.
The nvmeSmartLog struct had int64 fields with wrong JSON tags — Unmarshal
returned an error and the whole health block was skipped, leaving every
NVMe drive with status=Unknown.
Fix: switch all numeric fields to jsonInt64 (already used for lsblk
block sizes) which accepts both bare numbers and quoted strings, and
correct the avail_spare / percent_used tag names.
Also fix validateIsVendorGPU for NVIDIA: previously counted any NVIDIA
PCIe device (including NVSwitch bridges) as a GPU, producing wrong
estimates (12 instead of 8 on an HGX H100 system). Now requires
device_class to be videocontroller or processingaccelerator, matching
the existing AMD filter logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
targeted_stress, targeted_power, and the Level 2/3 diag were dispatched
one GPU at a time from the UI, turning a single dcgmi command into 8
sequential ~350–450 s runs. DCGM supports -i with a comma-separated list
of GPU indices and runs the diagnostic on all of them in parallel.
Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into
nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one
API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate.
Update sat.go constants and page_validate.go estimates to reflect all-GPU
simultaneous execution (remove n× multiplier from total time estimates).
Stress test on 8-GPU system: ~5.3 h → ~2.5 h.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>