Binary Rewards and Reinforcement Learning: Fundamental Challenges

Marc Dymetman

arXiv:2605.02375·cs.LG·May 5, 2026

Binary Rewards and Reinforcement Learning: Fundamental Challenges

Marc Dymetman

PDF

TL;DR

This paper analyzes the fundamental challenges of using binary rewards in reinforcement learning for language models, revealing issues with diversity collapse and proposing divergence-based solutions.

Contribution

It provides a structural account of the degeneracy caused by binary rewards and suggests alternative divergence methods to improve coverage.

Findings

01

Binary rewards create an infinite set of optimal distributions.

02

KL-control selects the filtered model conditioned on validity.

03

Alternative divergences can avoid coverage collapse.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes falling below the base model. We provide a structural account of this phenomenon grounded in the properties of binary rewards. Binary rewards create a fundamental degeneracy for policy gradient methods: the set of distributions maximizing expected reward is infinite, with no distinguished element. KL-control resolves this degeneracy by selecting, in the limit $β \to 0$ , the filtered model $p_{*} := a (\cdot ∣ Y_{1})$ -- the base model conditioned on validity -- which is the unique fully valid distribution closest to the base model in KL divergence. This selection operates through a nontrivial asymmetry: the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.