The Invisible Leash: Why RLVR May or May Not Escape Its Origin
Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi

TL;DR
This paper empirically investigates RLVR's effectiveness in enhancing LLM reasoning, revealing it often narrows exploration and fails to expand the model's reasoning horizon despite improving precision.
Contribution
It provides new empirical insights into RLVR's limitations, especially its tendency to restrict exploration and the entropy-reward trade-off affecting solution diversity.
Findings
RLVR improves pass@1 but often reduces empirical support.
RLVR increases token entropy but decreases answer diversity.
RLVR may overlook correct, underrepresented solutions.
Abstract
Recent advances highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing LLMs' capabilities. However, it remains unclear whether the current practice of RLVR truly expands a model's reasoning boundary or mainly amplifies high-reward outputs that the base model already knows, thereby improving precision. This study presents an empirical investigation that provides fresh insights into the limits of RLVR. We examine how RLVR can operate as a support-constrained optimization mechanism that may restrict the discovery of entirely original solutions, remaining constrained by the base model's initial distribution. We also identify an entropy-reward trade-off: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
