SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

Chanuk Lee; Minki Kang; Sung Ju Hwang

arXiv:2605.18864·cs.LG·May 20, 2026

SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

Chanuk Lee, Minki Kang, Sung Ju Hwang

PDF

1 Repo

TL;DR

This paper introduces SAGE, a novel framework that reshapes the anchor distribution in RLVR to improve reasoning abilities in large language models, addressing exploration limitations caused by reverse-KL regularization.

Contribution

SAGE provides a principled method to expand the support of the policy in RLVR by reshaping the anchor distribution with a guide function, improving reasoning performance.

Findings

01

SAGE improves pass@1 and pass@k on reasoning benchmarks.

02

Reshaping the anchor distribution enhances exploration in RLVR.

03

Traditional KL regularization constrains the emergence of new reasoning modes.

Abstract

Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tally0818/SAGE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.