TL;DR
This paper introduces SAGE, a novel framework that reshapes the anchor distribution in RLVR to improve reasoning abilities in large language models, addressing exploration limitations caused by reverse-KL regularization.
Contribution
SAGE provides a principled method to expand the support of the policy in RLVR by reshaping the anchor distribution with a guide function, improving reasoning performance.
Findings
SAGE improves pass@1 and pass@k on reasoning benchmarks.
Reshaping the anchor distribution enhances exploration in RLVR.
Traditional KL regularization constrains the emergence of new reasoning modes.
Abstract
Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
