Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self-Play
Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury,, Quoc V. Le, Qijun Tan, Yuan Liu

TL;DR
This paper introduces Evolving Alignment via Asymmetric Self-Play (eva), a novel RL post-training method that enables language models to adaptively generate prompts, significantly improving performance on benchmarks without extra human prompts.
Contribution
eva is the first approach allowing language models to create training prompts adaptively in both offline and online RL post-training, surpassing existing methods in effectiveness.
Findings
Sets new state-of-the-art on challenging benchmarks.
Boosts win-rate of Gemma-2-9B-IT significantly.
Robustly creates effective RL curricula across experiments.
Abstract
Current reinforcement learning (RL) frameworks for large language models (LLM) post-training typically assume a fixed prompt distribution, which is sub-optimal and bottlenecks scalability. Prior works have explored prompt evolving, but are often limited to the supervised fine-tuning stage, and prompts are sampled and evolved uniformly without signals. This empirical work presents a paradigm shift: Evolving Alignment via Asymmetric Self-Play (eva), that casts post-training as an infinite game with regret-based signals for 2 players: (i) a creator, who strategically samples and creates new informative prompts and (ii) a solver, who learns to produce preferred responses. eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training. The design is simple, easy-to-use yet remarkably effective: eva sets a new SOTA on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications · Modular Robots and Swarm Intelligence · Reinforcement Learning in Robotics
MethodsDirect Preference Optimization
