reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Mikhail Chusavitin	966944d6d8	Fix audit hanging on smartpqi SAS HBA scan file write smartpqi uses scsi_transport_sas but does not register a sas_host object, so /sys/class/sas_host/host14 does not exist and the existing SAS detection check passes right through. Writing to host14/scan then calls sas_user_scan which blocks indefinitely on scsi_scan_target's mutex (confirmed by kernel hung-task traces in the field). Add a second detection path via /sys/class/scsi_host/hostX/proc_name: skip hosts whose driver is "smartpqi" or "hpsa" (HPE Smart Array predecessors that exhibit the same behaviour). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-15 16:07:54 +03:00
Michael Chus	7d2e904d14	Bring codebase into compliance with bible contracts (A–E) A (hardware-ingest-json v2.8-2.9): remove sensor location fields from schema and collector; tag HardwareMemory.Location as json:"-"; add PlatformConfig to HardwareSnapshot. B (no-hardcoded-vendors): consolidate PCI vendor IDs into collector/pci_vendors.go; replace all vendor-name string checks in isGPUDevice, isNVIDIADevice, isMellanoxDevice, isAMDGPUDevice, matchesGPUVendor (sat_overlay), and validateIsVendorGPU (page_validate) with numeric vendor_id comparisons. C (module-structure): split app/app.go (1413 lines) into app.go + app_format.go, app_network.go, app_services.go, app_packs.go, app_install.go — no logic changes. D (go-code-style): wrap bare return err in interfaceAdminState and interfaceIPv4Addrs (platform/network.go) with fmt.Errorf context including the interface name. E (go-project-bible): add bible-local/architecture/data-model.md and bible-local/architecture/api-surface.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-13 14:32:08 +03:00
Michael Chus	2320925433	Skip PCIe link-speed warnings for disabled devices Disabled PCIe devices (sysfs enable==0) carry no data traffic; their link state has no operational impact. Switchtec PCIe switch management endpoints on NVIDIA HGX H100 baseboards (and similar fabric controllers) train at reduced speed intentionally and were producing spurious warnings. Check is vendor-agnostic: reads enable attribute via existing helper, no vendor/device ID hardcoding. Documented in bible-local/decisions/2026-06-12-pcie-disabled-device-link-warning.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-12 03:42:19 +03:00
Michael Chus	e169a7722c	Fix NVMe SMART status always Unknown; fix GPU count including NVSwitches nvme-cli emits smart-log counters as JSON strings and uses field names avail_spare / percent_used instead of the prose names in the NVMe spec. The nvmeSmartLog struct had int64 fields with wrong JSON tags — Unmarshal returned an error and the whole health block was skipped, leaving every NVMe drive with status=Unknown. Fix: switch all numeric fields to jsonInt64 (already used for lsblk block sizes) which accepts both bare numbers and quoted strings, and correct the avail_spare / percent_used tag names. Also fix validateIsVendorGPU for NVIDIA: previously counted any NVIDIA PCIe device (including NVSwitch bridges) as a GPU, producing wrong estimates (12 instead of 8 on an HGX H100 system). Now requires device_class to be videocontroller or processingaccelerator, matching the existing AMD filter logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 18:06:32 +03:00
Michael Chus	884988cb2a	Fix audit hang on SAS HBAs: skip scsi host scan for SAS hosts Writing to /sys/class/scsi_host/hostX/scan on SAS controllers (e.g. Adaptec smartpqi/PM8222-SHBA) triggers sas_user_scan which blocks indefinitely, causing the audit to hang forever. Skip hosts that appear under /sys/class/sas_host/ — SAS topology is discovered by the driver. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 18:50:20 +03:00
Michael Chus	963bc960ca	Fix SATA discovery, add NVLink bridge detection, add infiniband-diags - storage: add jsonInt64 dual-format unmarshaler to handle lsblk output change in util-linux 2.38 (LOG-SEC/PHY-SEC now emitted as JSON integers, not quoted strings); fixes SATA disks invisible on Debian 12 - pcie: detect NVLink bridge mezzanine CX-7 cards (Mellanox x2, no host net ifaces, DeviceName contains "NVLINK" in lspci -v) and mark them with device_class="NVLinkBridge"; escalate PCIe link speed downgrade to Critical for these cards (Gen3 on a fixed internal connector = hardware fault, not a transient warning) - pcie: cross-reference nvidia-smi topo to capture NVLink bond counts and active status for all NVLink bridge cards - packages: add infiniband-diags to ISO; provides ibstat required by nvidia-fabricmanager-start.sh to enumerate IB devices before FM launch (absence causes CUDA_ERROR_SYSTEM_NOT_READY) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-28 20:57:04 +03:00
Mikhail Chusavitin	2c22b01fe3	Fix IPMI hangs, add VROC license, fix blackbox service, drop qrencode IPMI hang fix (Lenovo XCC SR650 V3): - Add pluggable ipmi_profile system with per-vendor timeouts and fruEarlyExit flag - Lenovo profile: 90s FRU timeout, streaming early-exit stops after PSU blocks found - collectFRUEarlyExit streams ipmitool fru print and kills process once PSU blocks are followed by a non-PSU header (~6s instead of ~108s on 54-device FRU list) - collectBMCFirmware and collectPSUs accept manufacturer and apply profile timeouts VROC license detection: - Detect VMD/VROC controller in PCIe list, run mdadm --detail-platform - Parse "License:" line; store as snap.VROCLicense in HardwareSnapshot Blackbox service fix: - bee-blackbox.service was missing from systemctl enable list in ISO build hook - Service never started on boot; state file never written; UI button stayed "Enable" Drop qrencode: - Remove from package list, standardTools API check, and runtime-flows doc Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 10:46:59 +03:00
Mikhail Chusavitin	ec89616585	Add storage block geometry to audit and viewer	2026-04-29 17:39:11 +03:00
Mikhail Chusavitin	7c504e5056	Collect IOMMU group per PCIe device from sysfs Reads the iommu_group symlink for each BDF and exposes the group number as iommu_group in the hardware snapshot JSON. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-29 12:34:54 +03:00
Mikhail Chusavitin	2163017a98	Collect and report storage telemetry	2026-04-29 09:40:58 +03:00
Michael Chus	3053cb0710	Fix PSU slot regex: match MSI underscore format PSU1_POWER_IN \b does not fire between a digit and '_' because '_' is \w in RE2. The pattern \bpsu?\s*([0-9]+)\b never matched PSU1_POWER_IN style sensors, so parsePSUSDR (and PSUSlotsFromSDR / samplePSUPower) returned empty results for MSI servers — causing all power graphs to fall back to DCMI which reports ~half actual draw. Added an explicit underscore-terminated pattern first in the list and tests covering the MSI format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:03:02 +03:00
Michael Chus	61c7abaa80	Add multi-source PSU power triangulation and per-slot distribution table - collector/psu.go: export PSUSlotsFromSDR() reusing slot regex patterns; add isPSUInputPower/isPSUOutputPower helpers covering MSI/MLT/xFusion/HPE naming; add xFusion Power<N> slot pattern; parseBoundedFloat for self-healing (rejects zero/negative/out-of-range sensor readings); default fallback treats unclassified PSU sensors as AC input - benchmark_types.go: BenchmarkPSUSlotPower struct; BenchmarkServerPower gains PSUInputIdle/Loaded, PSUOutputIdle/Loaded, PSUSlotReadingsIdle/Loaded, GPUSlotTotalW, DCMICoverageRatio fields - benchmark.go: sampleIPMISDRPowerSensors uses collector.PSUSlotsFromSDR instead of custom classifier; detectDCMIPartialCoverage replaces ramp heuristic — compares DCMI idle vs SDR PSU sum, flags <0.70 ratio as partial coverage; detectIPMISaturationFallback kept for servers without SDR PSU sensors; report gains PSU Load Distribution table (per-slot AC/DC idle vs loaded, Δ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 13:07:48 +03:00
Mikhail Chusavitin	05c1fde233	Warn on PCIe link speed degradation and collect lspci -vvv in techdump - collector/pcie: add applyPCIeLinkSpeedWarning that sets status=Warning and ErrorDescription when current link speed is below maximum negotiated speed (e.g. Gen1 running on a Gen5 slot) - collector/pcie: add pcieLinkSpeedRank helper for Gen string comparison - collector/pcie_filter_test: cover degraded and healthy link speed cases - platform/techdump: collect lspci -vvv → lspci-vvv.txt for LnkCap/LnkSta Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 12:42:17 +03:00
Mikhail Chusavitin	bb1218ddd4	Fix GPU inventory: exclude BMC virtual VGA, show real NVIDIA model names Two issues: 1. BMC/management VGA chips (e.g. Huawei iBMC Hi171x, ASPEED) were included in GPU inventory because shouldIncludePCIeDevice only checked the PCI class, not the device name. Added a name-based filter for known BMC/management patterns when the class is VGA/display/3d. 2. New NVIDIA GPUs (e.g. RTX PRO 6000 Blackwell, device ID 2bb5) showed as "Device 2bb5" because lspci's database lags behind. Added "name" to the nvidia-smi query and use it to override dev.Model during enrichment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-10 13:57:26 +03:00
Michael Chus	05241f2e0e	Redesign dashboard: split Runtime Health and Hardware Summary - Runtime Health now shows only LiveCD system status (services, tools, drivers, network, CUDA/ROCm) — hardware component rows removed - Hardware Summary now shows server components with readable descriptions (model, count×size) and component-status.json health badges - Add Network Adapters row to Hardware Summary - SFP module static info (vendor, PN, SN, connector, type, wavelength) now collected via ethtool -m regardless of carrier state - PSU statuses from IPMI audit written to component-status.json so PSU badge shows actual status after first audit instead of UNKNOWN Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 23:41:23 +03:00
Mikhail Chusavitin	531d1ca366	Add NVIDIA self-heal tools and per-GPU SAT status	2026-04-07 20:20:05 +03:00
Mikhail Chusavitin	f3c14cd893	Harden NIC probing for empty SFP ports	2026-04-04 15:23:15 +03:00
Mikhail Chusavitin	b2b0444131	audit: ignore virtual hdisk and coprocessor noise	2026-04-02 09:56:17 +03:00
Michael Chus	964ab39656	fix: run john stress in parallel per GPU, fix chromium fullscreen, filter BMC virtual disks - bee-john-gpu-stress: spawn one john process per OpenCL device in parallel so all GPUs are stressed simultaneously instead of only device 1 - bee-openbox-session: --start-fullscreen → --start-maximized to fix blank white page on first render in fbdev environment - storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 23:14:21 +03:00
Michael Chus	eb60100297	fix: pcie gen, nccl binary, netconf sudo, boot noise, firmware cleanup - nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state - build: remove bee-nccl-gpu-stress from rm -f list so shell script from overlay is not silently dropped from the ISO - smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress, bee-nccl-gpu-stress, all_reduce_perf - netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors - auto/config: reduce loglevel 7→3 to show clean systemd output on boot - auto/config: blacklist snd_hda_intel and related audio modules (unused on servers) - package-lists: remove firmware-intel-sound and firmware-amd-graphics from base list; move firmware-amd-graphics to bee-amd variant only - bible-local: mark memtest ADR resolved, document working solution Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-01 21:25:23 +03:00
Mikhail Chusavitin	8b9d3447d7	Overlay SAT results into audit JSON	2026-03-25 20:11:03 +03:00
Mikhail Chusavitin	614b7cad61	Improve PCIe inventory and hardware identity collection	2026-03-25 20:00:38 +03:00
Mikhail Chusavitin	36dff6e584	feat: CPU SAT via stress-ng + BMC version via ipmitool BMC: - collector/board.go: collectBMCFirmware() via ipmitool mc info, graceful skip if /dev/ipmi0 absent - collector/collector.go: append BMC firmware record to snap.Firmware - app/panel.go: show BMC version in TUI right-panel header alongside BIOS CPU SAT: - platform/sat.go: RunCPUAcceptancePack(baseDir, durationSec) — lscpu + sensors before/after + stress-ng - app/app.go: RunCPUAcceptancePack + RunCPUAcceptancePackResult methods, satRunner interface updated - app/panel.go: CPU row now reads real PASS/FAIL from cpu-*/summary.txt via satStatuses(); cpuDetailResult shows last SAT summary + audit data - tui/types.go: actionRunCPUSAT, confirmBody for CPU test with mode label - tui/screen_health_check.go: hcCPUDurations [60,300,900]s; hcRunSingle(CPU)→confirm screen; executeRunAll uses RunCPUAcceptancePackResult - tui/forms.go: actionRunCPUSAT → RunCPUAcceptancePackResult with mode duration - cmd/bee/main.go: bee sat cpu [--duration N] subcommand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-25 11:06:12 +03:00
Mikhail Chusavitin	b25a2f6d30	feat: add support bundle and raw audit export	2026-03-16 18:20:26 +03:00
Mikhail Chusavitin	78c6dfc0ef	Sync hardware ingest contract v2.7	2026-03-15 23:03:38 +03:00
Mikhail Chusavitin	a6023372b1	Use microcode as CPU firmware	2026-03-15 21:16:17 +03:00
Mikhail Chusavitin	ab5a4be7ac	Align hardware export with ingest contract	2026-03-15 21:04:53 +03:00
Mikhail Chusavitin	b483e2ce35	Add health verdicts and acceptance tests	2026-03-14 17:53:58 +03:00
Mikhail Chusavitin	6aca1682b9	Refactor bee CLI and LiveCD integration	2026-03-13 16:52:16 +03:00
Mikhail Chusavitin	18b8c69bc5	Implement audit enrichments, TUI workflows, and production ISO scaffold	2026-03-06 11:56:26 +03:00
Michael Chus	ab22e3ad74	add: NVMe wear telemetry via nvme smart-log (1.8b)	2026-03-05 14:55:53 +03:00
Michael Chus	e79f972fb5	add: PSU collector (1.7) via ipmitool fru, skips gracefully without IPMI	2026-03-05 14:54:12 +03:00
Michael Chus	55f6098a17	add: memory, storage, pcie collectors (1.4-1.6) — tested on real hardware	2026-03-05 14:50:34 +03:00
Michael Chus	00bb2fdace	feat(audit): 1.3 — CPU collector (dmidecode type 4, microcode) - cpu.go: collectCPUs(), parseCPUs(), parseCPUSection() - splitDMISections(): splits multi-section dmidecode output generically - parseFieldLines(): reusable key→value parser for DMI sections - parseCPUStatus(): Populated/Unpopulated → OK/WARNING/EMPTY/UNKNOWN - parseSocketIndex(): CPU0/Processor 1/Socket 2 → integer - cleanManufacturer(): strips (R), Corporation, Inc. suffixes - parseMHz(), parseInt(): field value parsers - Serial fallback: <board_serial>-CPU-<socket> when DMI serial absent - readMicrocode(): /sys/devices/system/cpu/cpu0/microcode/version - cpu_test.go: dual-socket, unpopulated skipped, status, socket, manufacturer, MHz Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-05 10:37:19 +03:00
Michael Chus	f1e392a7fe	feat(audit): 1.2 — board collector (dmidecode types 0, 1, 2) - board.go: collectBoard(), parseBoard(), parseBIOSFirmware(), parseDMIFields(), cleanDMIValue() - Reads System Information (type 1): serial, manufacturer, product_name, uuid - Reads Base Board Information (type 2): part_number - Reads BIOS Information (type 0): firmware version record - cleanDMIValue strips vendor placeholders (O.E.M., Not Specified, Unknown, etc.) - board_test.go: 6 table/case tests with dmidecode fixtures in testdata/ - collector.go: wired board + BIOS firmware into snapshot Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-05 10:35:14 +03:00
Michael Chus	a4f70b17f0	feat(audit): 1.1 — project scaffold, schema types, collector stub, updater trust - go.mod: module bee/audit - schema/hardware.go: HardwareIngestRequest types (compatible with core) - collector/collector.go: Run() stub, logs start/finish, returns empty snapshot - updater/trust.go: Ed25519 multi-key verification via ldflags injection - updater/trust_test.go: valid sig, tampered, multi-key any-match, dev build - cmd/audit/main.go: --output stdout\|file:<path>\|usb, --version flag - Version = "dev" by default, injected via ldflags at release Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-05 10:32:12 +03:00

36 Commits