The NVIDIA CUDA HTTPS apt source (developer.download.nvidia.com) may be
unreachable from inside the live-build container chroot, causing
'E: Unable to locate package datacenter-gpu-manager-4-cuda13'.
Add build-dcgm.sh that downloads DCGM and nvidia-fabricmanager .deb
packages on the build host (verifying SHA256 against Packages.gz) and
caches them in BEE_CACHE_DIR. build.sh (step 25-dcgm, nvidia only)
copies them into LB_DIR/config/packages.chroot/ before lb build, so
live-build creates a local apt repo from them. The chroot installs the
packages from the local repo without ever contacting the NVIDIA CUDA
HTTPS source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
modernc.org/sqlite v1.48.0 requires modernc.org/libc/sys/types which is
absent in v1.70.0 but present in v1.72.0.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RunNvidiaPowerBench already performs a full internal ramp from 1 to N
GPUs in Phase 2. Spawning N tasks with growing GPU subsets meant task K
repeated all steps 1..K-1 already done by tasks 1..K-1 — O(N²) work
instead of O(N). Replace with a single task using all selected GPUs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lb binary_grub-efi and lb binary_syslinux create these files from templates
that already have memtest entries hardcoded. The hook should not fail when
the files don't exist yet — validate_iso_memtest() checks the final ISO.
Only the binary files (x64.bin, x64.efi) are required here.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ver_arg was set to "=memtest86+=VERSION" making the command
"apt-get download memtest86+=memtest86+=VERSION" (invalid).
Fixed to build pkg_spec directly as "memtest86+=VERSION".
Also add apt-get update retry if initial download fails.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Disabling --security broke the build because linux-image-6.1.0-44-amd64
is a security update not present in the base bookworm repo.
Main packages already come from mirror.mephi.ru.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Switch all lb mirrors to mirror.mephi.ru/debian/ for faster/reliable downloads
- Disable security repo (--security false) — not needed for LiveCD
- Pin MEMTEST_VERSION=6.10-4 in VERSIONS, export to hook environment
- Set BEE_REQUIRE_MEMTEST=1 in build-in-container.sh — missing memtest is now fatal
- Fix 9100-memtest.hook.binary: add apt-get download fallback when lb
binary_memtest has already purged the package cache; handle both 5.x
(memtest86+x64.bin) and 6.x (memtest86+.bin) BIOS binary naming
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mkdir -p LOG_DIR before writing the optional step log so that a race
with cleanup_build_log (EXIT trap archiving the log dir) does not cause
a "Directory nonexistent" error during lb binary_checksums / lb binary_iso.
Also downgrade apt-get update failure to a warning so a transient mirror
outage does not block kernel ABI auto-detection when the apt cache is warm.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Benchmark page now shows two result sections: Performance (scores) and
Power / Thermal Fit (slot table). After any benchmark task completes
the results section auto-refreshes via GET /api/benchmark/results
without a full page reload.
- Power results table shows each GPU slot with nominal TDP, achieved
stable power limit, and P95 observed power. Rows with derated cards
are highlighted amber so under-performing slots stand out at a glance.
Older runs are collapsed in a <details> summary.
- memtester is now wrapped with timeout(1) so a stuck memory controller
cannot cause Validate Memory to hang indefinitely. Wall-clock limit is
~2.5 min per 100 MB per pass plus a 2-minute buffer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 now calibrates each GPU individually (sequentially) so that
PowerRealizationPct reflects real degradation from neighbour thermals and
shared power rails. Previously the baseline came from an all-GPU-together
run, making realization always ≈100% at the final ramp step.
Ramp step 1 reuses single-card calibration results (no extra run); steps
2..N run targeted_power on the growing GPU subset with derating active.
Remove OccupiedSlots/OccupiedSlotsNote fields and occupiedSlots() helper —
they were compensation for the old all-GPU calibration approach.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously each GPU was calibrated sequentially (one card fully done
before the next started), producing the staircase temperature pattern
seen on the graph.
Now all GPUs run together in a single dcgmi diag -r targeted_power
session per attempt. This means:
- All cards are under realistic thermal load at the same time.
- A single DCGM session handles the run — no resource-busy contention
from concurrent dcgmi processes.
- Binary search state (lo/hi) is tracked independently per GPU; each
card converges to its own highest stable power limit.
- Throttle counter polling covers all active GPUs in the shared ticker.
- Resource-busy exponential back-off is shared (one DCGM session).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove telemetry-guided initial candidate; use strict binary search
midpoint at every step. Clean and predictable convergence in O(log N)
attempts within the allowed power range [minLimitW, startingLimitW].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Power calibration previously stepped down 25 W at a time (linear),
requiring up to 6 attempts to find a stable limit within 150 W range.
New strategy:
- Binary search between minLimitW (lo, assumed stable floor) and the
starting/failed limit (hi, confirmed unstable), converging within a
10 W tolerance in ~4 attempts.
- For thermal throttle: the first-quarter telemetry rows estimate the
GPU's pre-throttle power draw. nextLimit = round5W(onset - 10 W) is
used as the initial candidate instead of the binary midpoint, landing
much closer to the true limit on the first step.
- On success: lo is updated and a higher level is tried (binary search
upward) until hi-lo ≤ tolerance, ensuring the highest stable limit is
found rather than the first stable one.
- Let targeted_power run to natural completion on throttle (no mid-run
SIGKILL) so nv-hostengine releases its diagnostic slot cleanly before
the next attempt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During power calibration: if a thermal throttle (sw_thermal/hw_thermal)
causes ≥20% clock drop while server fans are below 98% P95 duty cycle,
record a CoolingWarning on the GPU result and emit an actionable finding
telling the operator to rerun with fans manually fixed at 100%.
During steady-state benchmark: same signal enriches the existing
thermal_limited finding with fan duty cycle and clock drift values.
Covers both the main benchmark (buildBenchmarkFindings) and the power
bench (NvidiaPowerBenchResult.Findings).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a targeted_power attempt is cancelled (e.g. after sw_thermal
throttle), nv-hostengine holds the diagnostic slot asynchronously.
The next attempt immediately received DCGM_ST_IN_USE (exit 222)
and incorrectly derated the power limit.
Now: exit 222 is detected via isDCGMResourceBusy and triggers an
exponential back-off retry at the same power limit (1s, 2s, 4s, …
up to 256s). Once the back-off delay would exceed 300s the
calibration fails, indicating the slot is persistently held.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
USB Export Drive:
lsblk reports TRAN only for whole disks, not partitions (/dev/sdc1).
Strip trailing partition digits to get parent disk before transport check.
LiveCD in RAM:
When RunInstallToRAM copies squashfs to /dev/shm/bee-live/ but bind-mount
of /run/live/medium fails (CD-ROM boots), /run/live/medium still shows the
CD-ROM fstype. Add fallback: if /dev/shm/bee-live/*.squashfs exists, the
data is in RAM — report status OK.
Dashboard Hardware Summary:
Show server Manufacturer + ProductName as heading and S/N as subline above
the component table, sourced from hw.Board (dmidecode system-type data).
Validate:
Remove Cycles input — always run once. cycles=1 hardcoded in runAllSAT().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NCCL all_reduce is always attempted when 2+ GPUs are selected; a failure
leaves InterconnectScore=0 (no bonus, no penalty) and OverallStatus
unaffected. Exposing the checkbox implied NCCL is optional and made a
failed run look like a deliberate skip.
- Remove benchmark-run-nccl checkbox and its change listener from pages.go
- Client sends run_nccl: selected.length > 1 (automatic)
- api.go default runNCCL=true is unchanged
- Selection note now mentions NCCL automatically for multi-GPU runs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add RampStep/RampTotal/RampRunID to NvidiaBenchmarkOptions, taskParams, and
NvidiaBenchmarkResult so ramp-up steps can be correlated across result.json files
- Add ScalabilityScore field to NvidiaBenchmarkResult (placeholder; computed externally
by comparing ramp-up step results sharing the same ramp_run_id)
- Propagate ramp fields through api.go (generates shared ramp_run_id at spawn time),
tasks.go handler, and benchmark.go result population
- Apply ServerPower penalty to CompositeScore when IPMI reporting_ratio < 0.75:
factor = ratio/0.75, applied per-GPU with a note explaining the reduction
- Add finding when server power delta exceeds GPU-reported sum by >25% (non-GPU draw)
- Report header now shows ramp step N/M and run ID instead of "parallel" when in ramp mode;
shows scalability_score when non-zero
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When --query-gpu CSV fields fail (exit status 2 on some Blackwell +
driver combos), enrichGPUInfoWithMaxClocks now also parses from the
verbose nvidia-smi -q output already collected at benchmark start:
- Default Power Limit → DefaultPowerLimitW
- Current Power Limit → PowerLimitW (fallback)
- Multiprocessor Count → MultiprocessorCount
Fixes PowerSustainScore=0 on systems where all three CSV query
variants fail but nvidia-smi -q succeeds (confirmed on RTX PRO 6000
Blackwell + driver 590.48.01).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Before the per-GPU compute phases, run `dcgmi diag -r targeted_power`
for 45 s while collecting nvidia-smi power metrics in parallel.
The p95 power per GPU is stored as calibrated_peak_power_w and used
as the denominator for PowerSustainScore instead of the hardware default
limit, which bee-gpu-burn cannot reach because it is compute-only.
Fallback chain: calibrated peak → default limit → enforced limit.
If dcgmi is absent or the run fails, calibration is skipped silently.
Adjust composite score weights to match the new honest power reference:
base 0.35, thermal 0.25, stability 0.25, power 0.15, NCCL bonus 0.10.
Power weight reduced (0.20→0.15) because even with a calibrated reference
bee-gpu-burn reaches ~60-75% of TDP by design (no concurrent mem stress).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BenchmarkHostConfig captures CPU model, sockets, cores, threads, and
total RAM from /proc/cpuinfo and /proc/meminfo at benchmark start.
- BenchmarkCPULoad samples host CPU utilisation every 10 s throughout
the GPU steady-state phase (sequential and parallel paths).
- Summarises avg/max/p95 and classifies status as ok / high / unstable.
- Adds a finding when CPU load is elevated (avg >20% or max >40%) or
erratic (stddev >12%), with a plain-English description in the report.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new checkbox (enabled by default) in the benchmark section.
In ramp-up mode N tasks are spawned simultaneously: 1 GPU, then 2,
then 3, up to all selected GPUs — each step runs its GPUs in parallel.
NCCL runs only on the final step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
clocks.max.graphics / clocks.max.memory CSV fields return exit status 2 on
RTX PRO 6000 Blackwell (driver 98.x), causing the entire gpu inventory query
to fail and clock lock to be skipped → normalization: partial.
Fix:
- Add minimal fallback query (index,uuid,name,pci.bus_id,vbios_version,
power.limit) that succeeds even without clock fields
- Add enrichGPUInfoWithMaxClocks: parses "Max Clocks" section of
nvidia-smi -q verbose output to fill MaxGraphicsClockMHz /
MaxMemoryClockMHz when CSV fields fail
- Move nvidia-smi -q execution before queryBenchmarkGPUInfo so its output
is available for clock enrichment immediately after
- Tests: cover enrichment and skip-if-populated cases
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- grub.cfg + isolinux/live.cfg.in: add pcie_aspm=off,
intel_idle.max_cstate=1 and processor.max_cstate=1 to all
non-failsafe boot entries
- bee-hpc-tuning: new script that sets all CPU cores to performance
governor via sysfs and logs THP state at boot
- bee-hpc-tuning.service: runs before bee-nvidia and bee-audit
- 9000-bee-setup.hook.chroot: enable service and mark script executable
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- collector/pcie: add applyPCIeLinkSpeedWarning that sets status=Warning
and ErrorDescription when current link speed is below maximum negotiated
speed (e.g. Gen1 running on a Gen5 slot)
- collector/pcie: add pcieLinkSpeedRank helper for Gen string comparison
- collector/pcie_filter_test: cover degraded and healthy link speed cases
- platform/techdump: collect lspci -vvv → lspci-vvv.txt for LnkCap/LnkSta
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- schema: add ToRAMStatus and USBExportPath fields to RuntimeHealth
- platform/runtime.go: collectToRAMHealth (ok/warning/failed based on
IsLiveMediaInRAM + toramActive) and collectUSBExportHealth (scans
/proc/mounts + lsblk for writable USB-backed filesystems)
- pages.go: add USB Export Drive and LiveCD in RAM rows to the health table
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nvidia.go: add Name field to nvidiaGPUInfo, include model name in
nvidia-smi query, set dev.Model in enrichPCIeWithNVIDIAData
- pages.go: fix duplicate GPU count in validate card summary (4 GPU: 4 x …
→ 4 x … GPU); fix PSU UNKNOWN fallback from hw.PowerSupplies; treat
activating/deactivating/reloading service states as OK in Runtime Health
- support_bundle.go: use "150405" time format (no colons) for exFAT compat
- sat.go / benchmark.go / platform_stress.go / sat_fan_stress.go: remove
.tar.gz archive creation from export dirs — export packs everything itself
- charts_svg.go: add min-max downsampling (1400 pt cap) for SVG chart perf
- benchmark_report.go / sat.go: normalize GPU fallback to "Unknown GPU"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues:
1. BMC/management VGA chips (e.g. Huawei iBMC Hi171x, ASPEED) were included
in GPU inventory because shouldIncludePCIeDevice only checked the PCI class,
not the device name. Added a name-based filter for known BMC/management
patterns when the class is VGA/display/3d.
2. New NVIDIA GPUs (e.g. RTX PRO 6000 Blackwell, device ID 2bb5) showed as
"Device 2bb5" because lspci's database lags behind. Added "name" to the
nvidia-smi query and use it to override dev.Model during enrichment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
hpl was listed in baseTargets and stressOnlyTargets but /api/sat/hpl/run
was never registered, causing a 405 Method Not Allowed (not valid JSON)
error when Validate one by one was triggered in stress mode.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Runtime Health now shows only LiveCD system status (services, tools,
drivers, network, CUDA/ROCm) — hardware component rows removed
- Hardware Summary now shows server components with readable descriptions
(model, count×size) and component-status.json health badges
- Add Network Adapters row to Hardware Summary
- SFP module static info (vendor, PN, SN, connector, type, wavelength)
now collected via ethtool -m regardless of carrier state
- PSU statuses from IPMI audit written to component-status.json so PSU
badge shows actual status after first audit instead of UNKNOWN
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>