config/bootloaders was missing from the needs_full_build heavy-file
list, so changes to GRUB theme assets (e.g. bee-logo.png RGBA→RGB fix
in 333c44f) were silently skipped by the squashfs-surgery fast-path.
The old broken PNG stayed in boot/grub/live-theme/ inside the ISO.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reads the iommu_group symlink for each BDF and exposes the group number
as iommu_group in the hardware snapshot JSON.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GRUB does not support RGBA PNG (color_type=6) — loading it returns a
null bitmap, triggering "null src bitmap in grub_video_bitmap_create_scaled".
Alpha channel composited onto black background (#000000 matches desktop-color).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When only light files changed since the last full lb build (Go source,
overlay scripts/configs), the build is now automatically done in ~5-8 min
instead of 30+ min:
- unsquashfs existing squashfs from prior build
- rsync overlay-stage on top
- mksquashfs repack (zstd, same block size)
- xorriso ISO repack with -boot_image any replay (preserves EFI/MBR hybrid)
Heavy changes (VERSIONS, package-lists, hooks, archives, Dockerfile,
auto/config) still trigger a full lb build. Tracking is via a marker file
(.bee-full-build-marker) written after each successful full build.
No change to build-in-container.sh or the full build path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Convert bee-logo.png to RGBA (color type 6) and strip all metadata
chunks (cHRM, bKGD, tIME, tEXt) that confuse GRUB's minimal PNG parser
- Move terminal_output gfxterm before insmod png / theme load so the
theme initialises in an active gfxterm context
- Remove echo ASCII art banner from grub.cfg — with gfxterm active and
no terminal_box in the theme, echo output renders over the menu area
- Fix icon_heigh typo → icon_height; increase item_height 16→20 with
item_padding 0→2 for reliable text rendering in boot_menu
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Default and toram entries now boot with bee.display=kms (ASPEED AST
loads via KMS, Xorg uses modesetting driver) and pci=realloc (Linux
reassigns GPU BARs when BIOS lacks Above 4G Decoding). nomodeset
removed from these entries; still present in GSP=off and fail-safe.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Em-dash (U+2014) renders as garbage on GRUB serial/SOL output
(IPMI BMC consoles). Replace with ASCII double-hyphen throughout
grub.cfg template, write_canonical_grub_cfg, and theme.txt comment.
Also align template grub.cfg structure with write_canonical_grub_cfg:
toram entry moved to top level (was inside submenu).
bible: add ascii-safe-text contract documenting the no-em-dash rule.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
targeted_stress, targeted_power, and the Level 2/3 diag were dispatched
one GPU at a time from the UI, turning a single dcgmi command into 8
sequential ~350–450 s runs. DCGM supports -i with a comma-separated list
of GPU indices and runs the diagnostic on all of them in parallel.
Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into
nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one
API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate.
Update sat.go constants and page_validate.go estimates to reflect all-GPU
simultaneous execution (remove n× multiplier from total time estimates).
Stress test on 8-GPU system: ~5.3 h → ~2.5 h.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- benchmark.go: retain sdrLastStep from final ramp step instead of
re-sampling after test when GPUs are already idle
- scripts/deploy.sh: build+deploy bee binary to remote host over SSH
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DCMI reports only the managed power domain (~CPU+MB), missing GPU draw.
PSU AC input sensors cover full wall power. When samplePSUPower returns
data, sum the slots for PowerW; fall back to DCMI otherwise.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
\b does not fire between a digit and '_' because '_' is \w in RE2.
The pattern \bpsu?\s*([0-9]+)\b never matched PSU1_POWER_IN style
sensors, so parsePSUSDR (and PSUSlotsFromSDR / samplePSUPower) returned
empty results for MSI servers — causing all power graphs to fall back
to DCMI which reports ~half actual draw.
Added an explicit underscore-terminated pattern first in the list and
tests covering the MSI format.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dcgmproftester and other GPU test subprocesses run inside the bee-web
cgroup and exceed 3G with 8 GPUs. OOM killer terminates the whole
service. No memory cap is appropriate on a LiveCD where GPU tests
legitimately use several GB.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same fix as ramp steps: take sdrSingle snapshot after calibration
and prefer PSUInW over DCMI for singleIPMILoadedW. DCMI kept as
fallback. Log message indicates source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When sdrStep.PSUInW is available, prefer it over DCMI for
ramp.ServerLoadedW and ServerDeltaW. DCMI on this platform (MSI 4-PSU)
reports ~half actual draw; SDR sums all PSU_POWER_IN sensors correctly.
Delta is now SDR-to-SDR (sdrStep.PSUInW - sdrIdle.PSUInW) for
consistency. DCMI path kept as fallback when SDR has no PSU data.
Log message now indicates the source (SDR PSU AC input vs DCMI).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MSI servers place PSU_POWER_IN/OUT sensors on entity 3.0, not 10.N
(the IPMI "Power Supply" entity). The old parser filtered by entity ID
and found nothing, so the dashboard fell back to DCMI which reports
roughly half the actual draw.
Now delegates to collector.PSUSlotsFromSDR — the same name-based
matching already used in the Power Fit benchmark.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A single dcgmproftester process without -i only loads GPU 0 regardless of
CUDA_VISIBLE_DEVICES. Now always routes multi-GPU runs through
bee-dcgmproftester-staggered (--stagger-seconds 0 for parallel mode),
which spawns one process per GPU so all GPUs are loaded simultaneously.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>