Binary Rewards and Reinforcement Learning: Fundamental Challenges
Marc Dymetman

TL;DR
This paper analyzes the fundamental challenges of using binary rewards in reinforcement learning for language models, revealing issues with diversity collapse and proposing divergence-based solutions.
Contribution
It provides a structural account of the degeneracy caused by binary rewards and suggests alternative divergence methods to improve coverage.
Findings
Binary rewards create an infinite set of optimal distributions.
KL-control selects the filtered model conditioned on validity.
Alternative divergences can avoid coverage collapse.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes falling below the base model. We provide a structural account of this phenomenon grounded in the properties of binary rewards. Binary rewards create a fundamental degeneracy for policy gradient methods: the set of distributions maximizing expected reward is infinite, with no distinguished element. KL-control resolves this degeneracy by selecting, in the limit , the filtered model -- the base model conditioned on validity -- which is the unique fully valid distribution closest to the base model in KL divergence. This selection operates through a nontrivial asymmetry: the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
