- collector/pcie: add applyPCIeLinkSpeedWarning that sets status=Warning
and ErrorDescription when current link speed is below maximum negotiated
speed (e.g. Gen1 running on a Gen5 slot)
- collector/pcie: add pcieLinkSpeedRank helper for Gen string comparison
- collector/pcie_filter_test: cover degraded and healthy link speed cases
- platform/techdump: collect lspci -vvv → lspci-vvv.txt for LnkCap/LnkSta
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues:
1. BMC/management VGA chips (e.g. Huawei iBMC Hi171x, ASPEED) were included
in GPU inventory because shouldIncludePCIeDevice only checked the PCI class,
not the device name. Added a name-based filter for known BMC/management
patterns when the class is VGA/display/3d.
2. New NVIDIA GPUs (e.g. RTX PRO 6000 Blackwell, device ID 2bb5) showed as
"Device 2bb5" because lspci's database lags behind. Added "name" to the
nvidia-smi query and use it to override dev.Model during enrichment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Runtime Health now shows only LiveCD system status (services, tools,
drivers, network, CUDA/ROCm) — hardware component rows removed
- Hardware Summary now shows server components with readable descriptions
(model, count×size) and component-status.json health badges
- Add Network Adapters row to Hardware Summary
- SFP module static info (vendor, PN, SN, connector, type, wavelength)
now collected via ethtool -m regardless of carrier state
- PSU statuses from IPMI audit written to component-status.json so PSU
badge shows actual status after first audit instead of UNKNOWN
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- bee-john-gpu-stress: spawn one john process per OpenCL device in parallel
so all GPUs are stressed simultaneously instead of only device 1
- bee-openbox-session: --start-fullscreen → --start-maximized to fix blank
white page on first render in fbdev environment
- storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead
of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state
- build: remove bee-nccl-gpu-stress from rm -f list so shell script from
overlay is not silently dropped from the ISO
- smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress,
bee-nccl-gpu-stress, all_reduce_perf
- netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors
- auto/config: reduce loglevel 7→3 to show clean systemd output on boot
- auto/config: blacklist snd_hda_intel and related audio modules (unused on servers)
- package-lists: remove firmware-intel-sound and firmware-amd-graphics from
base list; move firmware-amd-graphics to bee-amd variant only
- bible-local: mark memtest ADR resolved, document working solution
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BMC:
- collector/board.go: collectBMCFirmware() via ipmitool mc info, graceful skip if /dev/ipmi0 absent
- collector/collector.go: append BMC firmware record to snap.Firmware
- app/panel.go: show BMC version in TUI right-panel header alongside BIOS
CPU SAT:
- platform/sat.go: RunCPUAcceptancePack(baseDir, durationSec) — lscpu + sensors before/after + stress-ng
- app/app.go: RunCPUAcceptancePack + RunCPUAcceptancePackResult methods, satRunner interface updated
- app/panel.go: CPU row now reads real PASS/FAIL from cpu-*/summary.txt via satStatuses(); cpuDetailResult shows last SAT summary + audit data
- tui/types.go: actionRunCPUSAT, confirmBody for CPU test with mode label
- tui/screen_health_check.go: hcCPUDurations [60,300,900]s; hcRunSingle(CPU)→confirm screen; executeRunAll uses RunCPUAcceptancePackResult
- tui/forms.go: actionRunCPUSAT → RunCPUAcceptancePackResult with mode duration
- cmd/bee/main.go: bee sat cpu [--duration N] subcommand
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>