reanimator/bee - bee - MCHUS git PRO

Author	SHA1	Message	Date
Mikhail Chusavitin	ec89616585	Add storage block geometry to audit and viewer	2026-04-29 17:39:11 +03:00
Mikhail Chusavitin	c0dbbf96ad	Add vendor RAID tools for livecd v9.5	2026-04-29 17:31:25 +03:00
Mikhail Chusavitin	76484b123c	Fix fast-path: treat bootloader config changes as heavy config/bootloaders was missing from the needs_full_build heavy-file list, so changes to GRUB theme assets (e.g. bee-logo.png RGBA→RGB fix in `333c44f`) were silently skipped by the squashfs-surgery fast-path. The old broken PNG stayed in boot/grub/live-theme/ inside the ISO. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-29 15:36:29 +03:00
Mikhail Chusavitin	8901596152	Add server diagnostic tools to ISO, drop btop Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v9.4	2026-04-29 13:18:50 +03:00
Mikhail Chusavitin	7c504e5056	Collect IOMMU group per PCIe device from sysfs Reads the iommu_group symlink for each BDF and exposes the group number as iommu_group in the hardware snapshot JSON. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v9.3	2026-04-29 12:34:54 +03:00
Mikhail Chusavitin	333c44f3ba	Fix GRUB splash: convert bee-logo.png from RGBA to RGB GRUB does not support RGBA PNG (color_type=6) — loading it returns a null bitmap, triggering "null src bitmap in grub_video_bitmap_create_scaled". Alpha channel composited onto black background (#000000 matches desktop-color). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v9.2	2026-04-29 11:15:16 +03:00
Mikhail Chusavitin	3bca821d3e	Add auto fast-path ISO rebuild via squashfs surgery When only light files changed since the last full lb build (Go source, overlay scripts/configs), the build is now automatically done in ~5-8 min instead of 30+ min: - unsquashfs existing squashfs from prior build - rsync overlay-stage on top - mksquashfs repack (zstd, same block size) - xorriso ISO repack with -boot_image any replay (preserves EFI/MBR hybrid) Heavy changes (VERSIONS, package-lists, hooks, archives, Dockerfile, auto/config) still trigger a full lb build. Tracking is via a marker file (.bee-full-build-marker) written after each successful full build. No change to build-in-container.sh or the full build path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v9.1	2026-04-29 10:58:26 +03:00
Mikhail Chusavitin	3648e37a1e	Update bible submodule to remote HEAD, preserve ascii-safe-text contract locally Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-29 10:30:27 +03:00
Mikhail Chusavitin	d109e08fab	Drop redundant rebuild-image flag	2026-04-29 10:01:57 +03:00
Mikhail Chusavitin	11d00b9442	Document read-only submodules policy v9.01	2026-04-29 09:54:23 +03:00
Mikhail Chusavitin	6defa5ae15	Revert chart submodule update	2026-04-29 09:47:35 +03:00
Mikhail Chusavitin	c76658ed00	Update bible and chart submodules	2026-04-29 09:43:57 +03:00
Mikhail Chusavitin	2163017a98	Collect and report storage telemetry	2026-04-29 09:40:58 +03:00
Michael Chus	29179917c3	Add USB blackbox log mirroring service v9.0	2026-04-24 10:20:12 +03:00
Michael Chus	be4b439804	Commit remaining workspace changes v8.40	2026-04-23 20:32:26 +03:00
Michael Chus	749fc8a94d	Unify NVIDIA GPU recovery paths	2026-04-23 20:31:41 +03:00
Michael Chus	6112094d45	fix(grub): fix bitmap error and menu rendering - Convert bee-logo.png to RGBA (color type 6) and strip all metadata chunks (cHRM, bKGD, tIME, tEXt) that confuse GRUB's minimal PNG parser - Move terminal_output gfxterm before insmod png / theme load so the theme initialises in an active gfxterm context - Remove echo ASCII art banner from grub.cfg — with gfxterm active and no terminal_box in the theme, echo output renders over the menu area - Fix icon_heigh typo → icon_height; increase item_height 16→20 with item_padding 0→2 for reliable text rendering in boot_menu Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.39	2026-04-22 22:05:16 +03:00
Michael Chus	e9a2bc9f9d	update submodule	2026-04-22 20:39:27 +03:00
Mikhail Chusavitin	7a8f884664	fix(boot): remove advanced options submenu Keep only EASY-BEE and toram entries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.38	2026-04-22 19:01:50 +03:00
Mikhail Chusavitin	8bf8dfa45b	fix(boot): default to KMS + pci=realloc, drop nomodeset from main entries Default and toram entries now boot with bee.display=kms (ASPEED AST loads via KMS, Xorg uses modesetting driver) and pci=realloc (Linux reassigns GPU BARs when BIOS lacks Above 4G Decoding). nomodeset removed from these entries; still present in GSP=off and fail-safe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 19:00:04 +03:00
Mikhail Chusavitin	6a22199aff	chore(bible): bump ascii-safe-text contract Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 18:52:10 +03:00
Mikhail Chusavitin	ddb2bb5d1c	fix(grub): replace em-dash with ASCII -- in all menu entry titles Em-dash (U+2014) renders as garbage on GRUB serial/SOL output (IPMI BMC consoles). Replace with ASCII double-hyphen throughout grub.cfg template, write_canonical_grub_cfg, and theme.txt comment. Also align template grub.cfg structure with write_canonical_grub_cfg: toram entry moved to top level (was inside submenu). bible: add ascii-safe-text contract documenting the no-em-dash rule. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 18:52:04 +03:00
Mikhail Chusavitin	aa284ae754	fix(iso): avoid grub logo scaling error v8.37	2026-04-20 14:06:32 +03:00
Mikhail Chusavitin	8512098174	fix(iso): restore bootappend-live in canonical boot menu v8.36	2026-04-20 13:39:05 +03:00
Mikhail Chusavitin	6b5d22c194	chore(git): ignore local audit binary	2026-04-20 13:21:35 +03:00
Mikhail Chusavitin	a35e90a93e	fix(iso): clear stale bootloader templates in workdir v8.35	2026-04-20 13:19:50 +03:00
Mikhail Chusavitin	1ced81707f	fix(iso): validate live boot entries in final ISO	2026-04-20 13:12:24 +03:00
Mikhail Chusavitin	679aeb9947	Run NVIDIA DCGM diag tests on all selected GPUs simultaneously targeted_stress, targeted_power, and the Level 2/3 diag were dispatched one GPU at a time from the UI, turning a single dcgmi command into 8 sequential ~350–450 s runs. DCGM supports -i with a comma-separated list of GPU indices and runs the diagnostic on all of them in parallel. Move nvidia, nvidia-targeted-stress, nvidia-targeted-power into nvidiaAllGPUTargets so expandSATTarget passes all selected indices in one API call. Simplify runNvidiaValidateSet to match runNvidiaFabricValidate. Update sat.go constants and page_validate.go estimates to reflect all-GPU simultaneous execution (remove n× multiplier from total time estimates). Stress test on 8-GPU system: ~5.3 h → ~2.5 h. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v8.34	2026-04-20 11:53:25 +03:00
Mikhail Chusavitin	647e99b697	Fix post-sync live-build ISO rebuild v8.33	2026-04-20 11:01:15 +03:00
Mikhail Chusavitin	4af997f436	Update audit bee binary	2026-04-20 10:55:42 +03:00
Mikhail Chusavitin	6caace0cc0	Make power benchmark report phase-averaged	2026-04-20 10:53:53 +03:00
Mikhail Chusavitin	5f0103635b	Update power benchmark GPU reset flow v8.32	2026-04-20 09:46:00 +03:00
Mikhail Chusavitin	84a2551dc0	Fix NVIDIA self-heal recovery flow	2026-04-20 09:43:22 +03:00
Mikhail Chusavitin	1cfabc9230	Reset GPUs before power benchmark	2026-04-20 09:42:19 +03:00
Mikhail Chusavitin	5dc711de23	Start power calibration from full GPU TDP	2026-04-20 09:28:58 +03:00
Mikhail Chusavitin	ab802719f8	Use real NVIDIA power-limit bounds in benchmark	2026-04-20 09:26:56 +03:00
Mikhail Chusavitin	a94e8007f8	Ignore power throttling in benchmark calibration	2026-04-20 09:26:29 +03:00
Michael Chus	c69bf07b27	Commit remaining workspace changes v8.31	2026-04-20 07:02:31 +03:00
Michael Chus	b3cf8e3893	Globalize autotuned system power source	2026-04-20 07:02:12 +03:00
Michael Chus	17118298bd	audit: switch power benchmark load to dcgmproftester	2026-04-20 06:57:14 +03:00
Michael Chus	65bcc9ce81	refactor(webui): split pages into task modules	2026-04-20 06:56:52 +03:00
Michael Chus	0cdfbc5875	fix(iso): restore boot UX and boot logs	2026-04-19 23:08:09 +03:00
Michael Chus	cf9b54b600	Use last ramp-step SDR snapshot for PSU loaded power; add deploy script - benchmark.go: retain sdrLastStep from final ramp step instead of re-sampling after test when GPUs are already idle - scripts/deploy.sh: build+deploy bee binary to remote host over SSH Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 21:26:44 +03:00
Michael Chus	0bfb3fe954	Use PSU SDR sum for system power chart when available DCMI reports only the managed power domain (~CPU+MB), missing GPU draw. PSU AC input sensors cover full wall power. When samplePSUPower returns data, sum the slots for PowerW; fall back to DCMI otherwise. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:10:01 +03:00
Michael Chus	3053cb0710	Fix PSU slot regex: match MSI underscore format PSU1_POWER_IN \b does not fire between a digit and '_' because '_' is \w in RE2. The pattern \bpsu?\s*([0-9]+)\b never matched PSU1_POWER_IN style sensors, so parsePSUSDR (and PSUSlotsFromSDR / samplePSUPower) returned empty results for MSI servers — causing all power graphs to fall back to DCMI which reports ~half actual draw. Added an explicit underscore-terminated pattern first in the list and tests covering the MSI format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 19:03:02 +03:00
Michael Chus	2038489961	Remove MemoryMax=3G from bee-web.service to fix OOM kill during GPU tests dcgmproftester and other GPU test subprocesses run inside the bee-web cgroup and exceed 3G with 8 GPUs. OOM killer terminates the whole service. No memory cap is appropriate on a LiveCD where GPU tests legitimately use several GB. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:52:41 +03:00
Michael Chus	e35484013e	Use SDR PSU AC input for single-card calibration server power Same fix as ramp steps: take sdrSingle snapshot after calibration and prefer PSUInW over DCMI for singleIPMILoadedW. DCMI kept as fallback. Log message indicates source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:44:13 +03:00
Michael Chus	2cdf034bb0	Use SDR PSU AC input for per-step server power in power ramp When sdrStep.PSUInW is available, prefer it over DCMI for ramp.ServerLoadedW and ServerDeltaW. DCMI on this platform (MSI 4-PSU) reports ~half actual draw; SDR sums all PSU_POWER_IN sensors correctly. Delta is now SDR-to-SDR (sdrStep.PSUInW - sdrIdle.PSUInW) for consistency. DCMI path kept as fallback when SDR has no PSU data. Log message now indicates the source (SDR PSU AC input vs DCMI). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:43:36 +03:00
Michael Chus	b89580c24d	Fix PSU power chart: use name-based SDR matching instead of entity ID MSI servers place PSU_POWER_IN/OUT sensors on entity 3.0, not 10.N (the IPMI "Power Supply" entity). The old parser filtered by entity ID and found nothing, so the dashboard fell back to DCMI which reports roughly half the actual draw. Now delegates to collector.PSUSlotsFromSDR — the same name-based matching already used in the Power Fit benchmark. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:39:21 +03:00
Michael Chus	df1385d3d6	Fix dcgmproftester parallel mode: use staggered script for all multi-GPU runs A single dcgmproftester process without -i only loads GPU 0 regardless of CUDA_VISIBLE_DEVICES. Now always routes multi-GPU runs through bee-dcgmproftester-staggered (--stagger-seconds 0 for parallel mode), which spawns one process per GPU so all GPUs are loaded simultaneously. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 18:31:34 +03:00

1 2 3 4 5 ...

515 Commits