Scalable Reinforcement Post-Training Beyond Static Human Prompts:   Evolving Alignment via Asymmetric Self-Play

Ziyu Ye; Rishabh Agarwal; Tianqi Liu; Rishabh Joshi; Sarmishta Velury,; Quoc V. Le; Qijun Tan; Yuan Liu

arXiv:2411.00062·cs.CL·April 11, 2025

Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self-Play

Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury,, Quoc V. Le, Qijun Tan, Yuan Liu

PDF

Open Access

TL;DR

This paper introduces Evolving Alignment via Asymmetric Self-Play (eva), a novel RL post-training method that enables language models to adaptively generate prompts, significantly improving performance on benchmarks without extra human prompts.

Contribution

eva is the first approach allowing language models to create training prompts adaptively in both offline and online RL post-training, surpassing existing methods in effectiveness.

Findings

01

Sets new state-of-the-art on challenging benchmarks.

02

Boosts win-rate of Gemma-2-9B-IT significantly.

03

Robustly creates effective RL curricula across experiments.

Abstract

Current reinforcement learning (RL) frameworks for large language models (LLM) post-training typically assume a fixed prompt distribution, which is sub-optimal and bottlenecks scalability. Prior works have explored prompt evolving, but are often limited to the supervised fine-tuning stage, and prompts are sampled and evolved uniformly without signals. This empirical work presents a paradigm shift: Evolving Alignment via Asymmetric Self-Play (eva), that casts post-training as an infinite game with regret-based signals for 2 players: (i) a creator, who strategically samples and creates new informative prompts and (ii) a solver, who learns to produce preferred responses. eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training. The design is simple, easy-to-use yet remarkably effective: eva sets a new SOTA on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvolutionary Algorithms and Applications · Modular Robots and Swarm Intelligence · Reinforcement Learning in Robotics

MethodsDirect Preference Optimization