Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Jing Zhao; Ting Zhen; Junwei Bao; Hongfei Jiang; Yang Song

arXiv:2602.13575·cs.CL·March 3, 2026

Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Jing Zhao, Ting Zhen, Junwei Bao, Hongfei Jiang, Yang Song

PDF

Open Access 3 Reviews

TL;DR

Elo-Evolve introduces a co-evolutionary framework for LLM alignment that uses dynamic pairwise competitions and adaptive opponent selection, improving sample efficiency and reducing noise compared to traditional static reward methods.

Contribution

The paper presents a novel co-evolutionary approach for LLM alignment that eliminates reliance on static reward functions and Bradley-Terry models, enabling more efficient and robust training.

Findings

01

Achieved 4.5x noise reduction over absolute scoring methods.

02

Demonstrated superior performance of Elo-Evolve over point-based and static pairwise methods.

03

Validated the approach on multiple benchmarks showing consistent improvements.

Abstract

Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo-Evolve, a co-evolutionary framework that redefines alignment as dynamic multi-agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley-Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo-orchestrated opponent selection that provides automatic curriculum learning through temperature-controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The dynamic opponent selection is interesting. The temperature parameter offers a clean way to balance focus and diversity—small T for close-strength opponents, large T for variety. This forms an automatic curriculum where training difficulty grows with model ability. 2. Each prompt selects its own opponent, leading to smoother and more stable training. 3. Replacing scalar rewards with binary win/loss is well-motivated; both the PAC-theoretic analysis and experiments support its efficiency

Weaknesses

1. The framework introduces several components, which makes the system design a little complex. It would be helpful to include a simple baseline, where the model is trained sequentially against Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B as progressively stronger opponents. The prompts can be divided into three groups either randomly or based on their difficulty, for example using a reward model to estimate complexity. Such a baseline would help clarify how much the dynamic Elo scheduling improves ov

Reviewer 02Rating 4Confidence 4

Strengths

- The presented idea of using ELO rating for opponent matching is reasonable and presented clearly. - Performance gain seems consistent. - Clever length bias mitigation is used.

Weaknesses

1. The paper lacks comparison and discussion regarding self-play alignment methods. In recent years, significant attention has been devoted to alignment algorithms based on self-play. - Such methods do not rely on the Bradley–Terry model and often leverage game-theoretic ideas. Ideally, the current manuscript could be much stronger by providing a discussion of self-play methods (e.g., how Elo-Evolve could outperform self-play methods) and including empirical comparisons. 2. Compared to self-

Reviewer 03Rating 4Confidence 4

Strengths

1. The adaptive curriculum learning approach used in Elo-Evolve is interesting where a different reference opponent model is used at each stage of the training allowing the model to progressive improve against stronger opponents.

Weaknesses

1. The claim in Lines 40-45 seems un/under-substantiated. Claim 1 is not supported by any literature and there has been evidence such as HelpSteer2-preference [1] that shows only 10 thousand samples is enough for training high quality reward models. For claim 2, it’s not clear what sub-optimal sample complexity is and claim 3 is supported by 1 paper from 2020, even though the post training field has evolved substantially since then. 2. The results in Table 1 don’t seem to be very strong. For ins

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education