Commit Graph

245 Commits

Author SHA1 Message Date
Mikhail Chusavitin
1358485f2b fix logo wallpaper 2026-04-07 10:15:38 +03:00
d973231f37 Enhance benchmark: server power via IPMI, efficiency metrics, FP64, power limit check
- Sample server power (IPMI dcmi) during baseline+steady phases in parallel;
  compute delta vs GPU-reported sum; flag ratio < 0.75 as unreliable reporting
- Collect base_graphics_clock_mhz, multiprocessor_count, default_power_limit_w
  from nvidia-smi alongside existing GPU info
- Add tops_per_sm_per_ghz efficiency metric (model-agnostic silicon quality signal)
- Flag when enforced power limit is below default TDP by >5%
- Add fp64 profile to bee-gpu-burn worker (CUDA_R_64F, CUBLAS_COMPUTE_64F, min cc 8.0)
- Improve Executive Summary: overall pass count, FAILED GPU finding
- Throttle counters now shown as % of steady window instead of raw microseconds
- bible-local: clock calibration research, H100/H200 spec, real-world GEMM baselines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:26:52 +03:00
f5d175f488 Fix toram: patch live-boot to not use O_DIRECT when replacing loop to tmpfs
losetup --replace --direct-io=on fails with EINVAL when the target file
is on tmpfs (/dev/shm), because tmpfs does not support O_DIRECT.
Strip the --direct-io flag from the replace call and downgrade the
verification failure to a warning so boot continues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 21:06:21 +03:00
Mikhail Chusavitin
c7d2816a7f Limit NVIDIA legacy boot hooks to proprietary ISO 2026-04-06 16:33:16 +03:00
Mikhail Chusavitin
d2eadedff2 Default NVIDIA ISO to open modules and add nvidia-legacy 2026-04-06 16:27:13 +03:00
Mikhail Chusavitin
35f4c53887 Stabilize NVIDIA GPU device mapping across loaders 2026-04-06 12:22:04 +03:00
Mikhail Chusavitin
fc5c100a29 Fix NVIDIA persistence mode and add benchmark results table 2026-04-06 10:47:07 +03:00
4602f97836 Enforce sequential task orchestration 2026-04-05 22:10:42 +03:00
c65d3ae3b1 Add nomodeset to default GRUB entry — fix black screen on headless servers
Servers with NVIDIA compute GPUs (H100 etc.) have no display output,
so KMS blanks the console. nomodeset disables kernel modesetting and
lets the NVIDIA proprietary driver handle display via Xorg.

KMS variant moved to advanced submenu for cases where it is needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 21:40:47 +03:00
7a21c370e4 Handle NVIDIA GSP firmware init hang with timeout fallback
- bee-nvidia-load: run insmod in background, poll /proc/devices for
  nvidiactl; if GSP init doesn't complete in 90s, kill insmod and retry
  with NVreg_EnableGpuFirmware=0. Handles EBUSY case with clear error.
- Write /run/bee-nvidia-mode (gsp-on/gsp-off/gsp-stuck) for audit layer
- Show GSP mode badge in sidebar: yellow for gsp-off, red for gsp-stuck
- Report NvidiaGSPMode in RuntimeHealth with issue entries
- Simplify GRUB menu: default (KMS+GSP), advanced submenu (GSP=off,
  nomodeset, fail-safe), remove load-to-RAM entry
- Add pcmanfm, ristretto, mupdf, mousepad to desktop packages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 21:00:43 +03:00
d1a6863ceb Use amber fallback wallpaper color (#f6c90e) instead of black
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:30:41 +03:00
f9aa05de8e Add wallpaper: black background with amber EASY-BEE ASCII art logo
- Add feh and python3-pil to package list
- Add chroot hook that generates /usr/share/bee/wallpaper.png using PIL:
  black background, EASY-BEE box-drawing logo in amber (#f6c90e),
  "Hardware Audit LiveCD" subtitle in dim amber — matches motd exactly
- bee-openbox-session: set wallpaper with feh --bg-fill, fall back to
  xsetroot -solid black if wallpaper not found

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:29:42 +03:00
a9ccea8cca Fix black desktop and Chromium blank page on startup
- Set xsetroot solid background (#12100a, dark amber) so openbox
  doesn't show bare black before Chromium opens
- Re-add healthz wait loop before launching Chromium: without it
  Chromium opens localhost/loading before bee-web is up and gets
  connection-refused which renders as a blank white page

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:25:32 +03:00
fc5c985fb5 Reset tty1 properly when bee-boot-status exits
Add TTYReset=yes and TTYVHangup=yes so systemd restores the terminal
to a clean state before handing tty1 to getty. Without this the screen
went black with no cursor after the status display finished.

Also remove DefaultDependencies=no which was too aggressive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:22:01 +03:00
5eb3baddb4 Fix bee-boot-status blank screen caused by variable buffering
Command substitution in sh strips trailing newlines, so accumulating
output in a variable via $(...) lost all line breaks. Reverted to
direct printf calls which work correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:21:10 +03:00
a6ac13b5d3 Improve bee-boot-status: slower refresh, more detail
- Refresh every 3s instead of 1s to reduce flicker
- Show ssh, bee-sshsetup in service list
- Show failure reason for failed services
- Show last journal line for activating services
- Show IP addresses and web UI URL when network is up
- Render frame to variable before printing to reduce flicker

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:20:07 +03:00
4003cb7676 Lower kernel console loglevel to 3 to reduce boot noise
loglevel=6 floods the screen with mpt3sas/scsi/sd informational
messages, hiding systemd service status and bee-boot-status display.
loglevel=3 shows only kernel errors; all messages still go to serial.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 19:19:09 +03:00
2875313ba0 Improve boot UX: status display, faster GUI, loading spinner
- Add bee-boot-status service: shows live service status on tty1 with
  ASCII logo before getty, exits when all bee services settle
- Remove lightdm dependency on bee-preflight so GUI starts immediately
  without waiting for NVIDIA driver load
- Replace Chromium blank-page problem with /loading spinner page that
  polls /api/services and auto-redirects when services are ready; add
  "Open app now" override button; use fresh --user-data-dir=/tmp/bee-chrome
- Unify branding: add "Hardware Audit LiveCD" subtitle to GRUB menu,
  bee-boot-status (with yellow ASCII logo), and web spinner

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 18:58:24 +03:00
38e79143eb Refine burn UI and NVIDIA stress flows 2026-04-05 13:43:43 +03:00
143b7dca5d Add stability hardening and self-heal recovery 2026-04-05 10:29:37 +03:00
Mikhail Chusavitin
728270dc8e Unblock bee-web startup and expand support bundle diagnostics 2026-04-04 15:18:43 +03:00
Mikhail Chusavitin
8692f825bc Use plain repo tags for build version 2026-04-03 10:48:51 +03:00
Mikhail Chusavitin
7a843be6b0 Stabilize DCGM GPU discovery 2026-04-03 09:50:33 +03:00
Mikhail Chusavitin
eea2591bcc Fix John GPU stress duration semantics 2026-04-03 09:46:16 +03:00
Mikhail Chusavitin
444a7d16cc fix(iso): increase boot verbosity for service startup visibility
Raise loglevel from 3 to 6 (INFO) and add systemd.show_status=1 so
kernel driver messages and systemd [ OK ]/[ FAILED ] lines are visible
during boot instead of showing only a blank cursor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 19:33:27 +03:00
Mikhail Chusavitin
3cf75a541a build: collect ISO and logs under versioned dist/easy-bee-v{VERSION}/ dir
All final artefacts for a given version now land in one place:
  dist/easy-bee-v4.1/
    easy-bee-nvidia-v4.1-amd64.iso
    easy-bee-nvidia-v4.1-amd64.logs.tar.gz   ← log archive
                                               (logs dir deleted after archiving)

- Introduce OUT_DIR="${DIST_DIR}/easy-bee-v${ISO_VERSION_EFFECTIVE}"
- Move LOG_DIR, LOG_ARCHIVE, and ISO_OUT into OUT_DIR
- cleanup_build_log: use dirname(LOG_DIR) as tar -C base so the path is
  correct regardless of where OUT_DIR lives; delete LOG_DIR after archiving

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 10:19:11 +03:00
d21d9d191b fix(build): bump DCGM to 4.5.3-1 — core package updated in CUDA repo
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:49:57 +03:00
348db35119 fix(stress): stagger john GPU launches to prevent GWS tuning contention
When 8 john processes start simultaneously they race for GPU memory during
OpenCL GWS auto-tuning. Slower devices settle on a smaller work size (~594MiB
vs 762MiB) and run at 40% instead of 100% load. Add 3s sleep between launches
so each instance finishes memory allocation before the next one starts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:44:00 +03:00
964ab39656 fix: run john stress in parallel per GPU, fix chromium fullscreen, filter BMC virtual disks
- bee-john-gpu-stress: spawn one john process per OpenCL device in parallel
  so all GPUs are stressed simultaneously instead of only device 1
- bee-openbox-session: --start-fullscreen → --start-maximized to fix blank
  white page on first render in fbdev environment
- storage collector: skip Virtual HDisk* devices reported by BMC/iDRAC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 23:14:21 +03:00
eb60100297 fix: pcie gen, nccl binary, netconf sudo, boot noise, firmware cleanup
- nvidia collector: read pcie.link.gen.current/max from nvidia-smi instead
  of sysfs to avoid false Gen1 readings when GPU is in ASPM idle state
- build: remove bee-nccl-gpu-stress from rm -f list so shell script from
  overlay is not silently dropped from the ISO
- smoketest: add explicit checks for bee-gpu-burn, bee-john-gpu-stress,
  bee-nccl-gpu-stress, all_reduce_perf
- netconf: re-exec via sudo when not root to fix RTNETLINK/resolv.conf errors
- auto/config: reduce loglevel 7→3 to show clean systemd output on boot
- auto/config: blacklist snd_hda_intel and related audio modules (unused on servers)
- package-lists: remove firmware-intel-sound and firmware-amd-graphics from
  base list; move firmware-amd-graphics to bee-amd variant only
- bible-local: mark memtest ADR resolved, document working solution

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:25:23 +03:00
Mikhail Chusavitin
2baf3be640 Handle memtest recovery probe under set -e 2026-04-01 17:42:13 +03:00
Mikhail Chusavitin
d92f8f41d0 Fix memtest ISO validation false negatives 2026-04-01 12:22:17 +03:00
Mikhail Chusavitin
76a9100779 fix(iso): rebuild image after memtest recovery 2026-04-01 10:01:14 +03:00
Mikhail Chusavitin
1b6d592bf3 feat(iso): add optional kms display boot path 2026-04-01 09:42:59 +03:00
Mikhail Chusavitin
4e4debd4da refactor(webui): redesign Burn tab and fix gpu-burn memory defaults
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
  Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
  endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
  with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
  now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
  field to selectively run CPU and/or GPU load goroutines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:39:07 +03:00
Mikhail Chusavitin
5839f870b7 fix(iso): include full nvidia opencl runtime 2026-04-01 09:16:06 +03:00
Mikhail Chusavitin
b447717a5a fix(iso): harden boot network bring-up - v3.20 2026-04-01 09:10:55 +03:00
Mikhail Chusavitin
f6f4923ac9 fix(iso): recover memtest after live-build 2026-04-01 08:55:57 +03:00
Mikhail Chusavitin
3472afea32 fix(iso): make memtest non-blocking by default 2026-04-01 08:33:36 +03:00
45221d1e9a fix(stress): label loaders and improve john opencl diagnostics 2026-04-01 07:31:52 +03:00
3869788bac fix(iso): validate memtest with xorriso fallback 2026-04-01 07:24:05 +03:00
3dbc2184ef fix(iso): archive build logs and memtest diagnostics 2026-04-01 07:14:53 +03:00
60cb8f889a fix(iso): restore memtest menu entries and validate ISO 2026-04-01 07:04:48 +03:00
c9ee078622 fix(stress): keep platform burn responsive under load 2026-03-31 22:28:26 +03:00
ea660500c9 chore: commit pending repo changes 2026-03-31 22:17:36 +03:00
d43a9aeec7 fix(iso): restore live-build memtest integration 2026-03-31 22:10:28 +03:00
Mikhail Chusavitin
f5622e351e Fix staged John cleanup for repeated ISO builds 2026-03-31 11:40:52 +03:00
Mikhail Chusavitin
a20806afc8 Fix ISO grub package conflict 2026-03-31 11:38:30 +03:00
Mikhail Chusavitin
4f9b6b3bcd Harden NVIDIA boot logging on live ISO 2026-03-31 11:37:21 +03:00
Mikhail Chusavitin
c850b39b01 feat: v3.10 GPU stress and NCCL burn updates 2026-03-31 11:22:27 +03:00