- Remove "6." / "7." prefixes from Tools and Settings nav labels and page titles
- Add a horizontal separator (nav-sep) before the Tools/Settings group
- Move Tasks into the nav as a regular nav-item after the separator,
replacing the separate tasks-nav-btn at the sidebar bottom
- Tasks item retains the active-count badge (tasks-nav-count)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tools page now contains only NVMe Block Format and Supermicro - DMI.
Moved to Settings (7):
- System Install (Install to RAM + Install to Disk)
- Support Bundle + USB Black-Box
- Tool Check
- NVIDIA Self Heal (replaces simple NVIDIA Recovery card)
- Network
- Services
Update TestToolsPageRendersNvidiaSelfHealSection to assert the moved
cards on /settings instead of /tools.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract saa 1.5.0 (Linux x86_64) into iso/vendor/saa — baked into ISO
at /usr/local/bin/saa via the existing vendor loop in build.sh
- Add saa to the vendor tool loop in iso/builder/build.sh
- Rename the web UI card from "SAA - DMI" to "Supermicro - DMI"
- Remove the redundant description hint about saa on PATH
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds board and memory parser test fixtures based on real dmidecode output
from Dell PowerEdge R740xd, HPE ProLiant DL380 Gen10, and Supermicro
SYS-6028R-WTR sourced from the linuxhw/DMI dataset. Extends cleanDMIValue
with four additional vendor placeholder strings found in the dataset:
"0123456789", "1234567890", "NOT AVAILABLE", and "TO BE FILLED BY O.E.M"
(without trailing dot). Adds memory_test.go covering mixed populated/empty
DIMM slots and both GB and MB size formats.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the guessed pipe/key=value parser with the correct format
documented in SAA User Guide 4.8.1:
[Section]
Item Name {SHN} = "value" // comment
Handles string values (strips surrounding quotes), non-string values
(UUID, hex), section headers for display names, version line, and
// comments. Verified against the SAA 1.5.0 User Guide sample.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new card to the web UI Tools page for reading and editing DMI
fields via SAA (In-Band). Reads current DMI configuration with GetDmiInfo,
displays all fields as an editable table, and applies only the changed
fields via EditDmiInfo + ChangeDmiInfo. Backs up the original DMI file to
dmi-backups/ before any write, making it available in the support bundle
for rollback. Also adds "saa" to the standard tool check list.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the flat menu (Dashboard, Audit, Validate, Burn, Benchmark,
Tasks, Tools) with a numbered progression that guides engineers through
a logical acceptance workflow:
Dashboard (landing) → 1. Audit → 2. Check → 3. Load → 4. Speed
→ 5. Endurance → 6. Tools → 7. Settings
Key changes:
- layout.go: numbered nav labels, new hrefs, Tasks removed from nav
and replaced with a persistent sidebar badge (polls /api/tasks every
5 s, highlights amber when tasks are active)
- server.go: 301 redirects from /validate→/check, /burn→/load,
/benchmark→/speed for backward compatibility
- pages.go: dispatch cases for all new routes; old routes kept as
fallbacks
- page_validate.go: add renderCheck() — non-destructive check page
with validate-mode tests only (no stress toggle, no targeted-stress/
targeted-power/pulse cards)
- page_burn.go: add renderLoad() wrapper; update scope alert to
reference /check instead of /validate
- page_benchmark.go: add renderSpeed() (performance focus) and
renderEndurance() (stability/overnight focus) wrappers
- page_settings.go: new Settings page with blackbox logging toggle,
NVIDIA driver reset, and build info
- server_test.go: update five tests to use new route names and
content expectations
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
smartpqi uses scsi_transport_sas but does not register a sas_host
object, so /sys/class/sas_host/host14 does not exist and the existing
SAS detection check passes right through. Writing to host14/scan then
calls sas_user_scan which blocks indefinitely on scsi_scan_target's
mutex (confirmed by kernel hung-task traces in the field).
Add a second detection path via /sys/class/scsi_host/hostX/proc_name:
skip hosts whose driver is "smartpqi" or "hpsa" (HPE Smart Array
predecessors that exhibit the same behaviour).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A (hardware-ingest-json v2.8-2.9): remove sensor location fields from schema
and collector; tag HardwareMemory.Location as json:"-"; add PlatformConfig to
HardwareSnapshot.
B (no-hardcoded-vendors): consolidate PCI vendor IDs into collector/pci_vendors.go;
replace all vendor-name string checks in isGPUDevice, isNVIDIADevice, isMellanoxDevice,
isAMDGPUDevice, matchesGPUVendor (sat_overlay), and validateIsVendorGPU (page_validate)
with numeric vendor_id comparisons.
C (module-structure): split app/app.go (1413 lines) into app.go + app_format.go,
app_network.go, app_services.go, app_packs.go, app_install.go — no logic changes.
D (go-code-style): wrap bare return err in interfaceAdminState and
interfaceIPv4Addrs (platform/network.go) with fmt.Errorf context including
the interface name.
E (go-project-bible): add bible-local/architecture/data-model.md and
bible-local/architecture/api-surface.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Disabled PCIe devices (sysfs enable==0) carry no data traffic; their
link state has no operational impact. Switchtec PCIe switch management
endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers)
train at reduced speed intentionally and were producing spurious warnings.
Check is vendor-agnostic: reads enable attribute via existing helper,
no vendor/device ID hardcoding.
Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
nvme-cli emits smart-log counters as JSON strings and uses field names
avail_spare / percent_used instead of the prose names in the NVMe spec.
The nvmeSmartLog struct had int64 fields with wrong JSON tags — Unmarshal
returned an error and the whole health block was skipped, leaving every
NVMe drive with status=Unknown.
Fix: switch all numeric fields to jsonInt64 (already used for lsblk
block sizes) which accepts both bare numbers and quoted strings, and
correct the avail_spare / percent_used tag names.
Also fix validateIsVendorGPU for NVIDIA: previously counted any NVIDIA
PCIe device (including NVSwitch bridges) as a GPU, producing wrong
estimates (12 instead of 8 on an HGX H100 system). Now requires
device_class to be videocontroller or processingaccelerator, matching
the existing AMD filter logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Writing to /sys/class/scsi_host/hostX/scan on SAS controllers (e.g.
Adaptec smartpqi/PM8222-SHBA) triggers sas_user_scan which blocks
indefinitely, causing the audit to hang forever. Skip hosts that appear
under /sys/class/sas_host/ — SAS topology is discovered by the driver.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- storage: add jsonInt64 dual-format unmarshaler to handle lsblk output
change in util-linux 2.38 (LOG-SEC/PHY-SEC now emitted as JSON
integers, not quoted strings); fixes SATA disks invisible on Debian 12
- pcie: detect NVLink bridge mezzanine CX-7 cards (Mellanox x2, no host
net ifaces, DeviceName contains "NVLINK" in lspci -v) and mark them
with device_class="NVLinkBridge"; escalate PCIe link speed downgrade to
Critical for these cards (Gen3 on a fixed internal connector = hardware
fault, not a transient warning)
- pcie: cross-reference nvidia-smi topo to capture NVLink bond counts and
active status for all NVLink bridge cards
- packages: add infiniband-diags to ISO; provides ibstat required by
nvidia-fabricmanager-start.sh to enumerate IB devices before FM launch
(absence causes CUDA_ERROR_SYSTEM_NOT_READY)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Network: green if at least one interface has IPv4 (drop PARTIAL state).
Bee Services: treat inactive as OK — oneshot services (bee-sshsetup,
bee-preflight, bee-network, bee-audit, etc.) complete successfully and
exit to inactive; only failed is a real problem.
nvidia-fabricmanager: add ExecCondition=bee-check-nvswitch drop-in so
the service is silently skipped (inactive, not failed) on systems
without NVSwitch hardware (e.g. H200 NVL with direct NVLink, no
NVSwitch chips). bee-check-nvswitch detects NVSwitch via lspci
(vendor 10de, class 0680).
bee-nvidia.service: add ConditionPathExists=/usr/local/bin/bee-nvidia-load
so the unit is a no-op if somehow present in a non-nvidia build.
bee-boot-status: read /etc/bee-gpu-vendor and exclude bee-nvidia from
CRITICAL/ALL on non-nvidia builds, preventing boot hang if the unit
is unexpectedly present.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add nvidia-aer-correctable and pcie-aer-correctable patterns to catch
"bus correctable error" events seen in SEL (Critical Interrupt / offset 7).
Both patterns carry severity "warning" — correctable errors are
hardware-recovered and should not flag a card as failed.
Fix kmsg_watcher routing: GPU-category events were keyed as pcie:<BDF>
but the UI queries for pcie:gpu: prefix. Split the switch so "gpu" →
pcie:gpu:<BDF> and "pcie" → pcie:<BDF>. This applies to both
flushWindow (SAT-window path) and flushImmediate (always-on path).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HTMX was never loaded on the page, so hx-get on the component label
spans was dead code — the dialog opened empty. Replace with a plain
openComponentDetail() fetch call. Also fix dialog positioning broken
by the CSS reset (*{margin:0} overrode the UA margin:auto that centers
<dialog>). Replace card hx-trigger polling with a setInterval.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- kmsg watcher now records kernel errors (GPU Xid, MCE, EDAC, storage I/O) at all times,
not only during SAT tasks; flushImmediate writes directly to ComponentStatusDB
- New health_poller: polls ipmitool sdr every 60s for PSU health (watchdog:psu source)
- Hardware Summary card auto-refreshes every 30s via htmx without page reload
- Component rows (CPU/Memory/Storage/GPU/PSU) are now clickable -- opens a modal
with per-component status, source, timestamp and last 20 history entries
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IPMI hang fix (Lenovo XCC SR650 V3):
- Add pluggable ipmi_profile system with per-vendor timeouts and fruEarlyExit flag
- Lenovo profile: 90s FRU timeout, streaming early-exit stops after PSU blocks found
- collectFRUEarlyExit streams ipmitool fru print and kills process once PSU blocks
are followed by a non-PSU header (~6s instead of ~108s on 54-device FRU list)
- collectBMCFirmware and collectPSUs accept manufacturer and apply profile timeouts
VROC license detection:
- Detect VMD/VROC controller in PCIe list, run mdadm --detail-platform
- Parse "License:" line; store as snap.VROCLicense in HardwareSnapshot
Blackbox service fix:
- bee-blackbox.service was missing from systemctl enable list in ISO build hook
- Service never started on boot; state file never written; UI button stayed "Enable"
Drop qrencode:
- Remove from package list, standardTools API check, and runtime-flows doc
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reads the iommu_group symlink for each BDF and exposes the group number
as iommu_group in the hardware snapshot JSON.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
targeted_stress, targeted_power, and the Level 2/3 diag were dispatched
one GPU at a time from the UI, turning a single dcgmi command into 8
sequential ~350–450 s runs. DCGM supports -i with a comma-separated list
of GPU indices and runs the diagnostic on all of them in parallel.
Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into
nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one
API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate.
Update sat.go constants and page_validate.go estimates to reflect all-GPU
simultaneous execution (remove n× multiplier from total time estimates).
Stress test on 8-GPU system: ~5.3 h → ~2.5 h.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- benchmark.go: retain sdrLastStep from final ramp step instead of
re-sampling after test when GPUs are already idle
- scripts/deploy.sh: build+deploy bee binary to remote host over SSH
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>