- Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and
NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files
- Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally
by comparing ramp-up step results sharing the same ramp_run_id)
- Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time),
tasks.go handler, and benchmark.go result population
- Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75:
factor = ratio/0.75, applied per-GPU with a note explaining the reduction
- Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw)
- Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode;
shows scalability_score when non-zero
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new checkbox (enabled by default) in the benchmark section.
In ramp-up mode N tasks are spawned simultaneously: 1 GPU, then 2,
then 3, up to all selected GPUs — each step runs its GPUs in parallel.
NCCL runs only on the final step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bootloader: GRUB fallback text colors → yellow/brown (amber tone)
- CLI charts: all GPU metric series use single amber color (xterm-256 #214)
- Wallpaper: logo width scaled to 400 px dynamically, shadow scales with font size
- Support bundle: renamed to YYYY-MM-DD (BEE-SP vX.X) SRV_MODEL SRV_SN ToD.tar.gz
using dmidecode for server model (spaces→underscores) and serial number
- Remove display resolution feature (UI card, API routes, handlers, tests)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace diag level 1-4 dropdown with Validate/Stress radio buttons
- Validate: dcgmi L2, 60s CPU, 256MB/1p memtester, SMART short
- Stress: dcgmi L3 + targeted_stress in Run All, 30min CPU, 1GB/3p memtester, SMART long/NVMe extended
- Parallel GPU mode: spawn single task for all GPUs instead of splitting per model
- Benchmark table: per-GPU columns for sequential runs, server-wide column for parallel
- Benchmark report converted to Markdown with server model, GPU model, version in header; only steady-state charts
- Fix IPMI power parsing in benchmark (was looking for 'Current Power', correct field is 'Instantaneous power reading')
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add parallel GPU mode (checkbox, off by default): runs all selected GPUs
simultaneously via a single bee-gpu-burn invocation instead of sequentially;
per-GPU telemetry, throttle counters, TOPS, and scoring are preserved
- Make queryBenchmarkGPUInfo resilient: falls back to a base field set when
extended fields (attribute.multiprocessor_count, power.default_limit) cause
exit status 2, preventing lgc normalization from being silently skipped
- Log explicit "graphics clock lock skipped" note when inventory is unavailable
- Collect server model from DMI (/sys/class/dmi/id/product_name) and store in
result JSON; benchmark history columns now show "Server Model (N× GPU Model)"
grouped by server+GPU type rather than individual GPU index
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- services.go: use sudo systemctl so bee user can control system services
- api.go: always return 200 with output field even on error, so the
frontend shows the actual systemctl message instead of "exit status 1"
- pages.go: button shows "..." while pending then restores label;
output panel is full-width under the table with ✓/✗ status indicator;
output auto-scrolls to bottom
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Support bundle is now built on-the-fly when the user clicks the button,
regardless of whether other tasks are running:
- GET /export/support.tar.gz builds the bundle synchronously and streams it
directly to the client; the temp archive is removed after serving
- Remove POST /api/export/bundle and handleAPIExportBundle — the task-queue
approach meant the bundle could only be downloaded after navigating away
and back, and was blocked entirely while a long SAT test was running
- UI: single "Download Support Bundle" button; fetch+blob gives a loading
state ("Building...") while the server collects logs, then triggers the
browser download with the correct filename from Content-Disposition
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Burn tab: replace 6 flat cards with 3 grouped cards (GPU Stress,
Compute Stress, Platform Thermal Cycling) + global Burn Profile
- Run All button at top enqueues all enabled tests across all cards
- GPU Stress: tool checkboxes enabled/disabled via new /api/gpu/tools
endpoint based on driver status (/dev/nvidia0, /dev/kfd)
- Compute Stress: checkboxes for cpu/memory-stress/stressapptest
- Platform Thermal Cycling: component checkboxes (cpu/nvidia/amd)
with platform_components param wired through to PlatformStressOptions
- bee-gpu-burn: default size-mb changed from 64 to 0 (auto); script
now queries nvidia-smi memory.total per GPU and uses 95% of it
- platform_stress: removed hardcoded --size-mb 64; respects Components
field to selectively run CPU and/or GPU load goroutines
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add modernc.org/sqlite dependency; write every sample to
/appdata/bee/metrics.db (WAL mode, prune to 24h on startup)
- Pre-fill ring buffers from last 120 DB rows on startup so charts
survive service restarts
- Ticker changed 3s→1s; chart JS refresh will be set to 2s (lag ≤3s)
- Add GET /api/metrics/export.csv for full history download
- Chart rendering: SymbolNone (no dots), right padding=80px so peak
mark line label is not clipped, min/avg/max appended to chart title
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Task queue: all SAT/audit jobs enqueue and run one-at-a-time;
tasks persist past page navigation; new Tasks page with cancel/priority/log stream
- UI: consolidate nav (Validate, Burn, Tasks, Tools); Audit becomes modal;
Dashboard hardware summary badges + split metrics charts (load/temp/power);
Tools page consolidates network, services, install, support bundle
- AMD GPU: acceptance test and stress burn cards; GPU presence API greys
out irrelevant SAT cards automatically
- Burn tests: Memory Stress (stress-ng --vm), SAT Stress (stressapptest)
- Install to RAM: copies squashfs to /dev/shm, re-associates loop devices
via LOOP_CHANGE_FD ioctl so live media can be ejected
- Charts: relative time axis (0 = now, negative left)
- memtester: LimitMEMLOCK=infinity in bee-web.service; empty output → UNSUPPORTED
- SAT overlay applied dynamically on every /audit.json serve
- MIME panic guard for LiveCD ramdisk I/O errors
- ISO: add memtest86+, stressapptest packages; memtest86+ GRUB entry;
disable screensaver/DPMS in bee-openbox-session
- Unknown SAT status severity = 1 (does not override OK)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- jobState now has optional cancel func; abort() calls it if job is running
- handleAPISATRun passes cancellable context to RunNvidiaAcceptancePackWithOptions
- POST /api/sat/abort?job_id=... cancels the running SAT job
- bible-local/runtime-flows.md: replace TUI SAT flow with Web UI flow
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expose the existing bee-install script through the web UI:
- platform/install.go: remove USB exclusion, add SizeBytes/MountedParts
fields, add MinInstallBytes()/DiskWarnings() safety checks (size,
mounted partitions, toram+low-RAM warning)
- webui: add GET /api/install/disks, POST /api/install/run,
GET /api/install/stream endpoints
- webui: add Install to Disk page with disk table, warning badges,
device-name confirmation gate, SSE progress terminal, reboot button
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- One engine: go-analyze/charts (grafana theme) for all live metrics
- Server chart: CPU temp, CPU load%, mem load%, power W, fan RPMs
- GPU charts: temp, load%, mem%, power W — one card per GPU, added dynamically
- Charts 1400x280px SVG, rendered at width:100% in single-column layout
- Add CPU load (from /proc/stat) and mem load (from /proc/meminfo) to LiveMetricSample
- Add GPU mem utilization to GPUMetricRow (nvidia-smi utilization.memory)
- Document charting architecture in bible-local/architecture/charting.md
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Metrics:
- Replace canvas JS charts with server-side SVG via go-analyze/charts
- Add ring buffers (120 samples) for CPU temp and power
- /api/metrics/chart/{name}.svg endpoint serves live SVG, polled every 2s
Dashboard:
- Replace custom renderViewerPage with viewer.RenderHTML() from reanimator/chart submodule
- Mount chart static assets at /chart/static/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace raw systemctl output in table cell with:
- state badge (active/failed/inactive) — click to expand
- full systemctl status in collapsible pre block (max 200px scroll)
Fixes layout explosion from multi-line status text in table.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>