When 8 john processes start simultaneously they race for GPU memory during OpenCL GWS auto-tuning. Slower devices settle on a smaller work size (~594MiB vs 762MiB) and run at 40% instead of 100% load. Add 3s sleep between launches so each instance finishes memory allocation before the next one starts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>