Evolution Strategies at the Hyperscale

Bidipta Sarkar; Mattie Fellows; Juan Agustin Duque; Alistair Letcher; Antonio Le\'on Villares; Anya Sims; Clarisse Wibault; Dmitry Samsonov; Dylan Cope; Jarek Liesen; Kang Li; Lukas Seier; Theo Wolf; Uljad Berdica; Valentin Mohl; Alexander David Goldie; Aaron Courville; Karin Sevegnani; Shimon Whiteson; Jakob Nicolaus Foerster

arXiv:2511.16652·cs.LG·February 17, 2026

Evolution Strategies at the Hyperscale

Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio Le\'on Villares, Anya Sims, Clarisse Wibault, Dmitry Samsonov, Dylan Cope, Jarek Liesen, Kang Li, Lukas Seier, Theo Wolf, Uljad Berdica, Valentin Mohl, Alexander David Goldie, Aaron Courville

PDF

Open Access

TL;DR

This paper introduces EGGROLL, a structured evolution strategy that significantly accelerates large-scale black-box optimization on GPUs, enabling efficient training of billion-parameter models with theoretical convergence guarantees.

Contribution

EGGROLL structures perturbations as low-rank matrices to improve arithmetic intensity, achieving a hundredfold speedup and providing theoretical analysis of high-dimensional ES convergence.

Findings

01

EGGROLL achieves up to 91% throughput of batch inference.

02

Enables stable pretraining of integer-based recurrent language models.

03

Performs competitively on reasoning tasks and RL benchmarks.

Abstract

Evolution Strategies (ES) is a class of powerful black-box optimisation methods that are highly parallelisable and can handle non-differentiable and noisy objectives. However, na\"ive ES becomes prohibitively expensive at scale on GPUs due to the low arithmetic intensity of batched matrix multiplications with unstructured random perturbations. We introduce Evolution Guided GeneRal Optimisation via Low-rank Learning (EGGROLL), which improves arithmetic intensity by structuring individual perturbations as rank- $r$ matrices, resulting in a hundredfold increase in training speed for billion-parameter models at large population sizes, achieving up to 91% of the throughput of pure batch inference. We provide a rigorous theoretical analysis of Gaussian ES for high-dimensional parameter objectives, investigating conditions needed for ES updates to converge in high dimensions. Our results reveal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Metaheuristic Optimization Algorithms Research · Tensor decomposition and applications