ESSA: Evolutionary Strategies for Scalable Alignment

Daria Korotyshova; Boris Shaposhnikov; Alexey Malakhov; Alexey Khokhulin; Nikita Surnachev; Kirill Ovcharenko; George Bredis; Alexey Gorbatovski; Viacheslav Sinii; Daniil Gavrilov

arXiv:2507.04453·cs.LG·December 23, 2025

ESSA: Evolutionary Strategies for Scalable Alignment

Daria Korotyshova, Boris Shaposhnikov, Alexey Malakhov, Alexey Khokhulin, Nikita Surnachev, Kirill Ovcharenko, George Bredis, Alexey Gorbatovski, Viacheslav Sinii, Daniil Gavrilov

PDF

Open Access 3 Reviews

TL;DR

ESSA introduces a gradient-free evolutionary strategy framework for aligning large language models, reducing training complexity and hyperparameter tuning, and demonstrating superior scalability and efficiency over traditional gradient-based methods.

Contribution

The paper presents ESSA, a novel evolutionary strategies approach that enables scalable, efficient LLM alignment using black-box optimization focused on low-rank adapters and singular value optimization.

Findings

01

Improves accuracy on benchmark tasks compared to GRPO.

02

Achieves faster scaling and training efficiency on large models.

03

Operates effectively in quantized inference modes.

Abstract

Alignment of Large Language Models (LLMs) typically relies on Reinforcement Learning from Human Feedback (RLHF) with gradient-based optimizers such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO). While effective, these methods require complex distributed training, large memory budgets, and careful hyperparameter tuning, all of which become increasingly difficult at billion-parameter scale. We present ESSA, Evolutionary Strategies for Scalable Alignment, a gradient-free framework that aligns LLMs using only forward inference and black-box optimization. ESSA focuses optimization on Low-Rank Adapters (LoRA) and further compresses their parameter space by optimizing only the singular values from an singular value decomposition (SVD) of each adapter matrix. This dimensionality reduction makes evolutionary search practical even for very large models and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

* Novel Technical Approach. The LoRA + SVD + inference-only training is quite novel and interesting. Besides, the setting is quite realistic given the fact that the model is becoming larger and larger. * Strong empirical results across different tasks such as math, instruction following and general-purpose assistants. * Efficiency. Authors demonstrate faster inference and quantization compatibility.

Weaknesses

I don't think the paper has major flaws. I am curious why SVD-GRPO performs much worse. Does that mean SVD and ES are coupled?

Reviewer 02Rating 2Confidence 4

Strengths

- Gradient-free *ES loop* on a compact search space (singular values only); inference-only and quantization-friendly (INT8/INT4) for large models. - Scaling results and latency model argue for superior parallel efficiency vs. GRPO; communication volume is seeds+rewards only. - Empirics: consistent time-to-quality gains over GRPO and competitive final accuracy across several backbones and tasks; ablations on population size, LoRA rank, and fraction of trainable singular values (α).

Weaknesses

1. Initialization clarity: The method relies on an **SFT stage** to initialize LoRA factors $A, B$, but it is not fully explicit whether a **single** SFT LoRA per (task, backbone) is used or **multiple** LoRA initializations are required across experiments. Greater specificity would aid reproducibility. 2. **Scope of “gradient-free”**: The optimization loop is gradient-free, but the pipeline **depends on SFT**; the paper acknowledges warm-start sensitivity (accuracy improves with more SFT data),

Reviewer 03Rating 2Confidence 3

Strengths

**Originality**: T*he method has no original component, but manage to blend together known ideas in an interesting way.* The core move is casting alignment as inference-only evolutionary search in a *compressed* SVD-LoRA subspace (optimizing only singular values). This is a creative recombination of known ideas (LoRA, CMA-ES/ES) that manage to make post-training faster, as it removes the need to computate gradient, allowing the search to be done in INT8/INT4. This is a practical, underexplored

Weaknesses

1. **Presentation quality.** The manuscript quality is low. It has numerous concrete issues that impede review and reproducibility. In particular, the paragraph “Task and Models” contains major grammar mistakes and punctuation; periods used instead of commas, fragments like “. GSM8K with accuracy as the primary metric.” where sentences do not have verbs. Other problems (but there are many other): 1. Figure problems: Figure 2 right-panel lower-left square has no numbers; Figure 2 left-panel

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDesign Education and Practice · Product Development and Customization · Manufacturing Process and Optimization