Commit Graph

520 Commits

Author SHA1 Message Date
Mikhail Chusavitin
7ce73e34a4 Add NVMe block format tool v9.9 2026-04-30 16:27:25 +03:00
Mikhail Chusavitin
8a21809ade Update chart submodule to v2.0 (hardware contract 2.10)
New in chart:
- event_logs and platform_config sections in viewer
- Storage columns: logical_block_size_bytes, physical_block_size_bytes,
  metadata_bytes_per_block
- Compact status/severity icons, severity filtering for event logs
- Fixed JS MIME type and base stylesheet

bee audit schema already has all required fields; no schema changes needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v9.8
2026-04-30 15:52:30 +03:00
Mikhail Chusavitin
626763e31d Fix GRUB bitmap error: switch from PNG to TGA for splash logo
GRUB's PNG reader (grub2 bookworm) fails to load bee-logo.png despite the
file being valid RGB 8-bit non-interlaced PNG with minimal chunks. Root
cause is a known fragility in GRUB's png.c; exact trigger is unknown.

Switch to uncompressed 24-bit TGA which bypasses the PNG parser entirely.
tga.mod is already present in the ISO (x86_64-efi/tga.mod).

- Convert bee-logo.png → bee-logo.tga (480018 bytes, BGR top-left)
- config.cfg: insmod png → insmod tga
- theme.txt: bee-logo.png → bee-logo.tga
- Document all prior failed attempts in git-bible/grub-bitmap-error.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v9.7
2026-04-30 15:46:13 +03:00
Mikhail Chusavitin
0b8a2ff83f Add validate test matrix and GPU test methodology docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v9.6
2026-04-30 10:47:08 +03:00
Mikhail Chusavitin
2c22b01fe3 Fix IPMI hangs, add VROC license, fix blackbox service, drop qrencode
IPMI hang fix (Lenovo XCC SR650 V3):
- Add pluggable ipmi_profile system with per-vendor timeouts and fruEarlyExit flag
- Lenovo profile: 90s FRU timeout, streaming early-exit stops after PSU blocks found
- collectFRUEarlyExit streams ipmitool fru print and kills process once PSU blocks
  are followed by a non-PSU header (~6s instead of ~108s on 54-device FRU list)
- collectBMCFirmware and collectPSUs accept manufacturer and apply profile timeouts

VROC license detection:
- Detect VMD/VROC controller in PCIe list, run mdadm --detail-platform
- Parse "License:" line; store as snap.VROCLicense in HardwareSnapshot

Blackbox service fix:
- bee-blackbox.service was missing from systemctl enable list in ISO build hook
- Service never started on boot; state file never written; UI button stayed "Enable"

Drop qrencode:
- Remove from package list, standardTools API check, and runtime-flows doc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 10:46:59 +03:00
Mikhail Chusavitin
ec89616585 Add storage block geometry to audit and viewer 2026-04-29 17:39:11 +03:00
Mikhail Chusavitin
c0dbbf96ad Add vendor RAID tools for livecd v9.5 2026-04-29 17:31:25 +03:00
Mikhail Chusavitin
76484b123c Fix fast-path: treat bootloader config changes as heavy
config/bootloaders was missing from the needs_full_build heavy-file
list, so changes to GRUB theme assets (e.g. bee-logo.png RGBA→RGB fix
in 333c44f) were silently skipped by the squashfs-surgery fast-path.
The old broken PNG stayed in boot/grub/live-theme/ inside the ISO.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 15:36:29 +03:00
Mikhail Chusavitin
8901596152 Add server diagnostic tools to ISO, drop btop
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v9.4
2026-04-29 13:18:50 +03:00
Mikhail Chusavitin
7c504e5056 Collect IOMMU group per PCIe device from sysfs
Reads the iommu_group symlink for each BDF and exposes the group number
as iommu_group in the hardware snapshot JSON.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v9.3
2026-04-29 12:34:54 +03:00
Mikhail Chusavitin
333c44f3ba Fix GRUB splash: convert bee-logo.png from RGBA to RGB
GRUB does not support RGBA PNG (color_type=6) — loading it returns a
null bitmap, triggering "null src bitmap in grub_video_bitmap_create_scaled".
Alpha channel composited onto black background (#000000 matches desktop-color).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v9.2
2026-04-29 11:15:16 +03:00
Mikhail Chusavitin
3bca821d3e Add auto fast-path ISO rebuild via squashfs surgery
When only light files changed since the last full lb build (Go source,
overlay scripts/configs), the build is now automatically done in ~5-8 min
instead of 30+ min:

- unsquashfs existing squashfs from prior build
- rsync overlay-stage on top
- mksquashfs repack (zstd, same block size)
- xorriso ISO repack with -boot_image any replay (preserves EFI/MBR hybrid)

Heavy changes (VERSIONS, package-lists, hooks, archives, Dockerfile,
auto/config) still trigger a full lb build. Tracking is via a marker file
(.bee-full-build-marker) written after each successful full build.

No change to build-in-container.sh or the full build path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v9.1
2026-04-29 10:58:26 +03:00
Mikhail Chusavitin
3648e37a1e Update bible submodule to remote HEAD, preserve ascii-safe-text contract locally
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 10:30:27 +03:00
Mikhail Chusavitin
d109e08fab Drop redundant rebuild-image flag 2026-04-29 10:01:57 +03:00
Mikhail Chusavitin
11d00b9442 Document read-only submodules policy v9.01 2026-04-29 09:54:23 +03:00
Mikhail Chusavitin
6defa5ae15 Revert chart submodule update 2026-04-29 09:47:35 +03:00
Mikhail Chusavitin
c76658ed00 Update bible and chart submodules 2026-04-29 09:43:57 +03:00
Mikhail Chusavitin
2163017a98 Collect and report storage telemetry 2026-04-29 09:40:58 +03:00
29179917c3 Add USB blackbox log mirroring service v9.0 2026-04-24 10:20:12 +03:00
be4b439804 Commit remaining workspace changes v8.40 2026-04-23 20:32:26 +03:00
749fc8a94d Unify NVIDIA GPU recovery paths 2026-04-23 20:31:41 +03:00
6112094d45 fix(grub): fix bitmap error and menu rendering
- Convert bee-logo.png to RGBA (color type 6) and strip all metadata
  chunks (cHRM, bKGD, tIME, tEXt) that confuse GRUB's minimal PNG parser
- Move terminal_output gfxterm before insmod png / theme load so the
  theme initialises in an active gfxterm context
- Remove echo ASCII art banner from grub.cfg — with gfxterm active and
  no terminal_box in the theme, echo output renders over the menu area
- Fix icon_heigh typo → icon_height; increase item_height 16→20 with
  item_padding 0→2 for reliable text rendering in boot_menu

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v8.39
2026-04-22 22:05:16 +03:00
e9a2bc9f9d update submodule 2026-04-22 20:39:27 +03:00
Mikhail Chusavitin
7a8f884664 fix(boot): remove advanced options submenu
Keep only EASY-BEE and toram entries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v8.38
2026-04-22 19:01:50 +03:00
Mikhail Chusavitin
8bf8dfa45b fix(boot): default to KMS + pci=realloc, drop nomodeset from main entries
Default and toram entries now boot with bee.display=kms (ASPEED AST
loads via KMS, Xorg uses modesetting driver) and pci=realloc (Linux
reassigns GPU BARs when BIOS lacks Above 4G Decoding). nomodeset
removed from these entries; still present in GSP=off and fail-safe.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 19:00:04 +03:00
Mikhail Chusavitin
6a22199aff chore(bible): bump ascii-safe-text contract
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 18:52:10 +03:00
Mikhail Chusavitin
ddb2bb5d1c fix(grub): replace em-dash with ASCII -- in all menu entry titles
Em-dash (U+2014) renders as garbage on GRUB serial/SOL output
(IPMI BMC consoles). Replace with ASCII double-hyphen throughout
grub.cfg template, write_canonical_grub_cfg, and theme.txt comment.

Also align template grub.cfg structure with write_canonical_grub_cfg:
toram entry moved to top level (was inside submenu).

bible: add ascii-safe-text contract documenting the no-em-dash rule.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 18:52:04 +03:00
Mikhail Chusavitin
aa284ae754 fix(iso): avoid grub logo scaling error v8.37 2026-04-20 14:06:32 +03:00
Mikhail Chusavitin
8512098174 fix(iso): restore bootappend-live in canonical boot menu v8.36 2026-04-20 13:39:05 +03:00
Mikhail Chusavitin
6b5d22c194 chore(git): ignore local audit binary 2026-04-20 13:21:35 +03:00
Mikhail Chusavitin
a35e90a93e fix(iso): clear stale bootloader templates in workdir v8.35 2026-04-20 13:19:50 +03:00
Mikhail Chusavitin
1ced81707f fix(iso): validate live boot entries in final ISO 2026-04-20 13:12:24 +03:00
Mikhail Chusavitin
679aeb9947 Run NVIDIA DCGM diag tests on all selected GPUs simultaneously
targeted_stress, targeted_power, and the Level 2/3 diag were dispatched
one GPU at a time from the UI, turning a single dcgmi command into 8
sequential ~350–450 s runs. DCGM supports -i with a comma-separated list
of GPU indices and runs the diagnostic on all of them in parallel.

Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into
nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one
API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate.
Update sat.go constants and page_validate.go estimates to reflect all-GPU
simultaneous execution (remove n× multiplier from total time estimates).

Stress test on 8-GPU system: ~5.3 h → ~2.5 h.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v8.34
2026-04-20 11:53:25 +03:00
Mikhail Chusavitin
647e99b697 Fix post-sync live-build ISO rebuild v8.33 2026-04-20 11:01:15 +03:00
Mikhail Chusavitin
4af997f436 Update audit bee binary 2026-04-20 10:55:42 +03:00
Mikhail Chusavitin
6caace0cc0 Make power benchmark report phase-averaged 2026-04-20 10:53:53 +03:00
Mikhail Chusavitin
5f0103635b Update power benchmark GPU reset flow v8.32 2026-04-20 09:46:00 +03:00
Mikhail Chusavitin
84a2551dc0 Fix NVIDIA self-heal recovery flow 2026-04-20 09:43:22 +03:00
Mikhail Chusavitin
1cfabc9230 Reset GPUs before power benchmark 2026-04-20 09:42:19 +03:00
Mikhail Chusavitin
5dc711de23 Start power calibration from full GPU TDP 2026-04-20 09:28:58 +03:00
Mikhail Chusavitin
ab802719f8 Use real NVIDIA power-limit bounds in benchmark 2026-04-20 09:26:56 +03:00
Mikhail Chusavitin
a94e8007f8 Ignore power throttling in benchmark calibration 2026-04-20 09:26:29 +03:00
c69bf07b27 Commit remaining workspace changes v8.31 2026-04-20 07:02:31 +03:00
b3cf8e3893 Globalize autotuned system power source 2026-04-20 07:02:12 +03:00
17118298bd audit: switch power benchmark load to dcgmproftester 2026-04-20 06:57:14 +03:00
65bcc9ce81 refactor(webui): split pages into task modules 2026-04-20 06:56:52 +03:00
0cdfbc5875 fix(iso): restore boot UX and boot logs 2026-04-19 23:08:09 +03:00
cf9b54b600 Use last ramp-step SDR snapshot for PSU loaded power; add deploy script
- benchmark.go: retain sdrLastStep from final ramp step instead of
  re-sampling after test when GPUs are already idle
- scripts/deploy.sh: build+deploy bee binary to remote host over SSH

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 21:26:44 +03:00
0bfb3fe954 Use PSU SDR sum for system power chart when available
DCMI reports only the managed power domain (~CPU+MB), missing GPU draw.
PSU AC input sensors cover full wall power. When samplePSUPower returns
data, sum the slots for PowerW; fall back to DCMI otherwise.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 19:10:01 +03:00
3053cb0710 Fix PSU slot regex: match MSI underscore format PSU1_POWER_IN
\b does not fire between a digit and '_' because '_' is \w in RE2.
The pattern \bpsu?\s*([0-9]+)\b never matched PSU1_POWER_IN style
sensors, so parsePSUSDR (and PSUSlotsFromSDR / samplePSUPower) returned
empty results for MSI servers — causing all power graphs to fall back
to DCMI which reports ~half actual draw.

Added an explicit underscore-terminated pattern first in the list and
tests covering the MSI format.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 19:03:02 +03:00