Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL
Sophia Xiao Pu, Zhaotian Weng, Chengzhi Liu, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, and Xin Eric Wang

TL;DR
This paper investigates the stability of self-play reinforcement learning for language models, revealing that data gating is crucial for stability while reward design is less effective once gating is in place.
Contribution
It identifies the asymmetric roles of data gating and reward signals in self-play stability and introduces the Grounded Proposer Paradox highlighting counter-intuitive dynamics.
Findings
A strict data gate ensures stability across reward variants.
Reward variants alone cannot guarantee stability without gating.
Replacing the gate with a continuous parameter reveals a phase transition in training dynamics.
Abstract
Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the training pool, and the reward signal that updates the policy on tasks already admitted. Through controlled experiments on a Python output-prediction task and a deterministic-DSL twin task that strips pretraining priors, output ambiguity, and executor noise, we find the two levers are asymmetric. A strict gate is sufficient for stability under every reward variant we test, including a self-consistency reward with no access…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
