Fix AMD GPU false detection, blackbox deadlock, and NOGPU build bloat
- sat.go: DetectGPUVendor lspci fallback now checks GPU device classes ([0300]/[0302]/[0380]) per line instead of scanning the whole output for vendor name; AMD EPYC servers have dozens of AMD-branded PCIe entries (Root Complex, IOMMU, Host Bridge) that were triggering the old check - blackbox.go: fix deadlock in finishCycle — it held w.mu while calling persistState(), which acquires rt.mu then re-acquires w.mu inside persistStateLocked(); now w.mu is released before persistState() - build.sh: remove NVIDIA-specific overlay files (bee-gpu-burn, bee-john-gpu-stress, bee-nccl-gpu-stress, bee-nvidia-recover, bee-dcgmproftester-staggered, bee-check-nvswitch, nvidia-fabricmanager.service.d/) for non-nvidia build variants - bee-selfheal: gate NVIDIA recovery on BEE_GPU_VENDOR=nvidia so the script does not attempt to restart bee-nvidia.service on NOGPU builds Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1419,6 +1419,13 @@ rm -rf \
|
||||
if [ "$BEE_GPU_VENDOR" != "nvidia" ]; then
|
||||
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-nvidia-load"
|
||||
rm -f "${OVERLAY_STAGE_DIR}/etc/systemd/system/bee-nvidia.service"
|
||||
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-gpu-burn"
|
||||
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-john-gpu-stress"
|
||||
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-nccl-gpu-stress"
|
||||
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-nvidia-recover"
|
||||
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-dcgmproftester-staggered"
|
||||
rm -f "${OVERLAY_STAGE_DIR}/usr/local/bin/bee-check-nvswitch"
|
||||
rm -rf "${OVERLAY_STAGE_DIR}/etc/systemd/system/nvidia-fabricmanager.service.d"
|
||||
fi
|
||||
|
||||
# --- inject authorized_keys for SSH access ---
|
||||
|
||||
Reference in New Issue
Block a user