reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Mikhail Chusavitin	a2d7513153	Restructure nav to Load/Burn/Benchmark; fix SAA acpidump dependency - Nav steps 3-5: Load (validate), Burn (burn-in), Benchmark (speed+endurance merged) - /load now renders validate mode; /burn renders burn-in; /benchmark replaces /speed+/endurance - Legacy redirects updated: /validate→/load, /burn-in→/burn, /speed+/endurance→/benchmark - Add acpica_bin/acpidump from SAA 1.5.0 package; required by saa GetDmiInfo (ExitCode 8) - build.sh copies acpica_bin/acpidump to /usr/local/bin/acpica_bin/ alongside saa Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 19:07:51 +03:00
Mikhail Chusavitin	5b5d8609d3	Refactor nav: remove numbers from Tools/Settings, add separator and Tasks item - Remove "6." / "7." prefixes from Tools and Settings nav labels and page titles - Add a horizontal separator (nav-sep) before the Tools/Settings group - Move Tasks into the nav as a regular nav-item after the separator, replacing the separate tasks-nav-btn at the sidebar bottom - Tasks item retains the active-count badge (tasks-nav-count) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 17:54:54 +03:00
Mikhail Chusavitin	e7442972d1	Move session-scoped LiveCD tools from Tools to Settings Tools page now contains only NVMe Block Format and Supermicro - DMI. Moved to Settings (7): - System Install (Install to RAM + Install to Disk) - Support Bundle + USB Black-Box - Tool Check - NVIDIA Self Heal (replaces simple NVIDIA Recovery card) - Network - Services Update TestToolsPageRendersNvidiaSelfHealSection to assert the moved cards on /settings instead of /tools. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 17:52:19 +03:00
Mikhail Chusavitin	4c6daa1c5e	Add SAA binary to ISO vendor, rename card to Supermicro - DMI - Extract saa 1.5.0 (Linux x86_64) into iso/vendor/saa — baked into ISO at /usr/local/bin/saa via the existing vendor loop in build.sh - Add saa to the vendor tool loop in iso/builder/build.sh - Rename the web UI card from "SAA - DMI" to "Supermicro - DMI" - Remove the redundant description hint about saa on PATH Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 17:49:12 +03:00
Mikhail Chusavitin	e420888d71	Add DMI test fixtures from linuxhw/DMI and expand placeholder detection Adds board and memory parser test fixtures based on real dmidecode output from Dell PowerEdge R740xd, HPE ProLiant DL380 Gen10, and Supermicro SYS-6028R-WTR sourced from the linuxhw/DMI dataset. Extends cleanDMIValue with four additional vendor placeholder strings found in the dataset: "0123456789", "1234567890", "NOT AVAILABLE", and "TO BE FILLED BY O.E.M" (without trailing dot). Adds memory_test.go covering mixed populated/empty DIMM slots and both GB and MB size formats. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 16:15:20 +03:00
Mikhail Chusavitin	8149360410	Fix SAA DMI parser to match real DMI.txt format Replace the guessed pipe/key=value parser with the correct format documented in SAA User Guide 4.8.1: [Section] Item Name {SHN} = "value" // comment Handles string values (strips surrounding quotes), non-string values (UUID, hex), section headers for display names, version line, and // comments. Verified against the SAA 1.5.0 User Guide sample. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 15:58:02 +03:00
Mikhail Chusavitin	4262c5b798	Add SAA DMI editor to Tools page Adds a new card to the web UI Tools page for reading and editing DMI fields via SAA (In-Band). Reads current DMI configuration with GetDmiInfo, displays all fields as an editable table, and applies only the changed fields via EditDmiInfo + ChangeDmiInfo. Backs up the original DMI file to dmi-backups/ before any write, making it available in the support bundle for rollback. Also adds "saa" to the standard tool check list. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 15:50:42 +03:00
Mikhail Chusavitin	271dadda03	Restructure web UI navigation into 7 numbered workflow stages Replace the flat menu (Dashboard, Audit, Validate, Burn, Benchmark, Tasks, Tools) with a numbered progression that guides engineers through a logical acceptance workflow: Dashboard (landing) → 1. Audit → 2. Check → 3. Load → 4. Speed → 5. Endurance → 6. Tools → 7. Settings Key changes: - layout.go: numbered nav labels, new hrefs, Tasks removed from nav and replaced with a persistent sidebar badge (polls /api/tasks every 5 s, highlights amber when tasks are active) - server.go: 301 redirects from /validate→/check, /burn→/load, /benchmark→/speed for backward compatibility - pages.go: dispatch cases for all new routes; old routes kept as fallbacks - page_validate.go: add renderCheck() — non-destructive check page with validate-mode tests only (no stress toggle, no targeted-stress/ targeted-power/pulse cards) - page_burn.go: add renderLoad() wrapper; update scope alert to reference /check instead of /validate - page_benchmark.go: add renderSpeed() (performance focus) and renderEndurance() (stability/overnight focus) wrappers - page_settings.go: new Settings page with blackbox logging toggle, NVIDIA driver reset, and build info - server_test.go: update five tests to use new route names and content expectations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-18 11:00:02 +03:00
Mikhail Chusavitin	966944d6d8	Fix audit hanging on smartpqi SAS HBA scan file write smartpqi uses scsi_transport_sas but does not register a sas_host object, so /sys/class/sas_host/host14 does not exist and the existing SAS detection check passes right through. Writing to host14/scan then calls sas_user_scan which blocks indefinitely on scsi_scan_target's mutex (confirmed by kernel hung-task traces in the field). Add a second detection path via /sys/class/scsi_host/hostX/proc_name: skip hosts whose driver is "smartpqi" or "hpsa" (HPE Smart Array predecessors that exhibit the same behaviour). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-15 16:07:54 +03:00
Michael Chus	7d2e904d14	Bring codebase into compliance with bible contracts (A–E) A (hardware-ingest-json v2.8-2.9): remove sensor location fields from schema and collector; tag HardwareMemory.Location as json:"-"; add PlatformConfig to HardwareSnapshot. B (no-hardcoded-vendors): consolidate PCI vendor IDs into collector/pci_vendors.go; replace all vendor-name string checks in isGPUDevice, isNVIDIADevice, isMellanoxDevice, isAMDGPUDevice, matchesGPUVendor (sat_overlay), and validateIsVendorGPU (page_validate) with numeric vendor_id comparisons. C (module-structure): split app/app.go (1413 lines) into app.go + app_format.go, app_network.go, app_services.go, app_packs.go, app_install.go — no logic changes. D (go-code-style): wrap bare return err in interfaceAdminState and interfaceIPv4Addrs (platform/network.go) with fmt.Errorf context including the interface name. E (go-project-bible): add bible-local/architecture/data-model.md and bible-local/architecture/api-surface.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-13 14:32:08 +03:00
Michael Chus	2320925433	Skip PCIe link-speed warnings for disabled devices Disabled PCIe devices (sysfs enable==0) carry no data traffic; their link state has no operational impact. Switchtec PCIe switch management endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers) train at reduced speed intentionally and were producing spurious warnings. Check is vendor-agnostic: reads enable attribute via existing helper, no vendor/device ID hardcoding. Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-12 03:42:19 +03:00
Michael Chus	e169a7722c	Fix NVMe SMART status always Unknown; fix GPU count including NVSwitches nvme-cli emits smart-log counters as JSON strings and uses field names avail_spare / percent_used instead of the prose names in the NVMe spec. The nvmeSmartLog struct had int64 fields with wrong JSON tags — Unmarshal returned an error and the whole health block was skipped, leaving every NVMe drive with status=Unknown. Fix: switch all numeric fields to jsonInt64 (already used for lsblk block sizes) which accepts both bare numbers and quoted strings, and correct the avail_spare / percent_used tag names. Also fix validateIsVendorGPU for NVIDIA: previously counted any NVIDIA PCIe device (including NVSwitch bridges) as a GPU, producing wrong estimates (12 instead of 8 on an HGX H100 system). Now requires device_class to be videocontroller or processingaccelerator, matching the existing AMD filter logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 18:06:32 +03:00
Michael Chus	884988cb2a	Fix audit hang on SAS HBAs: skip scsi host scan for SAS hosts Writing to /sys/class/scsi_host/hostX/scan on SAS controllers (e.g. Adaptec smartpqi/PM8222-SHBA) triggers sas_user_scan which blocks indefinitely, causing the audit to hang forever. Skip hosts that appear under /sys/class/sas_host/ — SAS topology is discovered by the driver. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 18:50:20 +03:00
Michael Chus	963bc960ca	Fix SATA discovery, add NVLink bridge detection, add infiniband-diags - storage: add jsonInt64 dual-format unmarshaler to handle lsblk output change in util-linux 2.38 (LOG-SEC/PHY-SEC now emitted as JSON integers, not quoted strings); fixes SATA disks invisible on Debian 12 - pcie: detect NVLink bridge mezzanine CX-7 cards (Mellanox x2, no host net ifaces, DeviceName contains "NVLINK" in lspci -v) and mark them with device_class="NVLinkBridge"; escalate PCIe link speed downgrade to Critical for these cards (Gen3 on a fixed internal connector = hardware fault, not a transient warning) - pcie: cross-reference nvidia-smi topo to capture NVLink bond counts and active status for all NVLink bridge cards - packages: add infiniband-diags to ISO; provides ibstat required by nvidia-fabricmanager-start.sh to enumerate IB devices before FM launch (absence causes CUDA_ERROR_SYSTEM_NOT_READY) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 20:57:04 +03:00
Michael Chus	4f6579e040	Fix Runtime Health criteria: network, services, nvidia-fabricmanager Network: green if at least one interface has IPv4 (drop PARTIAL state). Bee Services: treat inactive as OK — oneshot services (bee-sshsetup, bee-preflight, bee-network, bee-audit, etc.) complete successfully and exit to inactive; only failed is a real problem. nvidia-fabricmanager: add ExecCondition=bee-check-nvswitch drop-in so the service is silently skipped (inactive, not failed) on systems without NVSwitch hardware (e.g. H200 NVL with direct NVLink, no NVSwitch chips). bee-check-nvswitch detects NVSwitch via lspci (vendor 10de, class 0680). bee-nvidia.service: add ConditionPathExists=/usr/local/bin/bee-nvidia-load so the unit is a no-op if somehow present in a non-nvidia build. bee-boot-status: read /etc/bee-gpu-vendor and exclude bee-nvidia from CRITICAL/ALL on non-nvidia builds, preventing boot hang if the unit is unexpectedly present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 05:20:25 +03:00
Michael Chus	dc07580adc	Add AER decode, event counter, and sparkline to component detail modal - decodeAERStatus: parses aer_status hex from kernel error strings and maps PCIe AER register bits to human-readable names with correctable/ uncorrectable classification (e.g. "Receiver Error, Replay Timer Timeout (correctable)") - renderSparkline: 100px inline SVG showing non-OK events over time, bars positioned proportionally to timestamp; evenly spaced when timestamps coincide - renderComponentDetail: shows event count badge and sparkline in the component header row; decoded AER line appears below the raw error summary Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-13 23:54:54 +03:00
Mikhail Chusavitin	805a3b277d	Track PCIe AER correctable errors; fix GPU status key routing Add nvidia-aer-correctable and pcie-aer-correctable patterns to catch "bus correctable error" events seen in SEL (Critical Interrupt / offset 7). Both patterns carry severity "warning" — correctable errors are hardware-recovered and should not flag a card as failed. Fix kmsg_watcher routing: GPU-category events were keyed as pcie:<BDF> but the UI queries for pcie:gpu: prefix. Split the switch so "gpu" → pcie:gpu:<BDF> and "pcie" → pcie:<BDF>. This applies to both flushWindow (SAT-window path) and flushImmediate (always-on path). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-08 12:50:14 +03:00
Mikhail Chusavitin	0939a647ea	Fix component detail modal: replace dead hx-* with fetch-based JS HTMX was never loaded on the page, so hx-get on the component label spans was dead code — the dialog opened empty. Replace with a plain openComponentDetail() fetch call. Also fix dialog positioning broken by the CSS reset (*{margin:0} overrode the UA margin:auto that centers <dialog>). Replace card hx-trigger polling with a setInterval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-08 10:53:20 +03:00
Mikhail Chusavitin	ae80d7711e	Add continuous hardware health monitoring and component detail view - kmsg watcher now records kernel errors (GPU Xid, MCE, EDAC, storage I/O) at all times, not only during SAT tasks; flushImmediate writes directly to ComponentStatusDB - New health_poller: polls ipmitool sdr every 60s for PSU health (watchdog:psu source) - Hardware Summary card auto-refreshes every 30s via htmx without page reload - Component rows (CPU/Memory/Storage/GPU/PSU) are now clickable -- opens a modal with per-component status, source, timestamp and last 20 history entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-06 09:56:39 +03:00
Michael Chus	7b4bcc745a	Split live rootfs into smaller squashfs layers	2026-05-03 23:15:22 +03:00
Michael Chus	6c2b188ec9	Add no-GUI boot mode and quieter boot diagnostics	2026-05-03 21:14:45 +03:00
Michael Chus	14505ef24a	Remove easy bee ASCII logo banners	2026-05-03 21:07:13 +03:00
Michael Chus	cac5b9c86e	Detach install media after install-to-ram	2026-05-03 14:16:45 +03:00
Michael Chus	0e39e7d960	Make toram default and add install-to-ram CLI	2026-05-03 14:07:47 +03:00
Mikhail Chusavitin	58d6da0e4f	Fix live task logs and SAT windows	2026-04-30 17:26:45 +03:00
Mikhail Chusavitin	7ce73e34a4	Add NVMe block format tool	2026-04-30 16:27:25 +03:00
Mikhail Chusavitin	2c22b01fe3	Fix IPMI hangs, add VROC license, fix blackbox service, drop qrencode IPMI hang fix (Lenovo XCC SR650 V3): - Add pluggable ipmi_profile system with per-vendor timeouts and fruEarlyExit flag - Lenovo profile: 90s FRU timeout, streaming early-exit stops after PSU blocks found - collectFRUEarlyExit streams ipmitool fru print and kills process once PSU blocks are followed by a non-PSU header (~6s instead of ~108s on 54-device FRU list) - collectBMCFirmware and collectPSUs accept manufacturer and apply profile timeouts VROC license detection: - Detect VMD/VROC controller in PCIe list, run mdadm --detail-platform - Parse "License:" line; store as snap.VROCLicense in HardwareSnapshot Blackbox service fix: - bee-blackbox.service was missing from systemctl enable list in ISO build hook - Service never started on boot; state file never written; UI button stayed "Enable" Drop qrencode: - Remove from package list, standardTools API check, and runtime-flows doc Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 10:46:59 +03:00
Mikhail Chusavitin	ec89616585	Add storage block geometry to audit and viewer	2026-04-29 17:39:11 +03:00
Mikhail Chusavitin	7c504e5056	Collect IOMMU group per PCIe device from sysfs Reads the iommu_group symlink for each BDF and exposes the group number as iommu_group in the hardware snapshot JSON. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-29 12:34:54 +03:00
Mikhail Chusavitin	11d00b9442	Document read-only submodules policy	2026-04-29 09:54:23 +03:00
Mikhail Chusavitin	2163017a98	Collect and report storage telemetry	2026-04-29 09:40:58 +03:00
Michael Chus	29179917c3	Add USB blackbox log mirroring service	2026-04-24 10:20:12 +03:00
Michael Chus	be4b439804	Commit remaining workspace changes	2026-04-23 20:32:26 +03:00
Michael Chus	749fc8a94d	Unify NVIDIA GPU recovery paths	2026-04-23 20:31:41 +03:00
Mikhail Chusavitin	6b5d22c194	chore(git): ignore local audit binary	2026-04-20 13:21:35 +03:00
Mikhail Chusavitin	679aeb9947	Run NVIDIA DCGM diag tests on all selected GPUs simultaneously targeted_stress, targeted_power, and the Level 2/3 diag were dispatched one GPU at a time from the UI, turning a single dcgmi command into 8 sequential ~350–450 s runs. DCGM supports -i with a comma-separated list of GPU indices and runs the diagnostic on all of them in parallel. Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate. Update sat.go constants and page_validate.go estimates to reflect all-GPU simultaneous execution (remove n× multiplier from total time estimates). Stress test on 8-GPU system: ~5.3 h → ~2.5 h. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-20 11:53:25 +03:00
Mikhail Chusavitin	4af997f436	Update audit bee binary	2026-04-20 10:55:42 +03:00
Mikhail Chusavitin	6caace0cc0	Make power benchmark report phase-averaged	2026-04-20 10:53:53 +03:00
Mikhail Chusavitin	5f0103635b	Update power benchmark GPU reset flow	2026-04-20 09:46:00 +03:00
Mikhail Chusavitin	84a2551dc0	Fix NVIDIA self-heal recovery flow	2026-04-20 09:43:22 +03:00
Mikhail Chusavitin	1cfabc9230	Reset GPUs before power benchmark	2026-04-20 09:42:19 +03:00
Mikhail Chusavitin	5dc711de23	Start power calibration from full GPU TDP	2026-04-20 09:28:58 +03:00
Mikhail Chusavitin	ab802719f8	Use real NVIDIA power-limit bounds in benchmark	2026-04-20 09:26:56 +03:00
Mikhail Chusavitin	a94e8007f8	Ignore power throttling in benchmark calibration	2026-04-20 09:26:29 +03:00
Michael Chus	c69bf07b27	Commit remaining workspace changes	2026-04-20 07:02:31 +03:00
Michael Chus	b3cf8e3893	Globalize autotuned system power source	2026-04-20 07:02:12 +03:00
Michael Chus	17118298bd	audit: switch power benchmark load to dcgmproftester	2026-04-20 06:57:14 +03:00
Michael Chus	65bcc9ce81	refactor(webui): split pages into task modules	2026-04-20 06:56:52 +03:00
Michael Chus	0cdfbc5875	fix(iso): restore boot UX and boot logs	2026-04-19 23:08:09 +03:00
Michael Chus	cf9b54b600	Use last ramp-step SDR snapshot for PSU loaded power; add deploy script - benchmark.go: retain sdrLastStep from final ramp step instead of re-sampling after test when GPUs are already idle - scripts/deploy.sh: build+deploy bee binary to remote host over SSH Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 21:26:44 +03:00

1 2 3 4 5 ...

273 Commits