Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models

Vaskar Nath; Elaine Lau; Anisha Gunjal; Manasi Sharma; Nikhil Baharte; Sean Hendryx

arXiv:2506.13923·cs.LG·June 23, 2025

Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models

Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, Sean Hendryx

PDF

Open Access 1 Datasets 5 Reviews

TL;DR

This paper introduces adaptive guidance techniques to enhance reinforcement learning of reasoning models, significantly improving their problem-solving capabilities across various domains and scales.

Contribution

It presents the Guide algorithm that adaptively incorporates hints during training, improving model generalization and efficiency in reasoning tasks.

Findings

01

RLVR drives performance by pass@1 compression and capability gain.

02

Self-distillation is key to learning new problems.

03

Guide improves pass@$k$ rates and generalization, with up to 4% gains.

Abstract

We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems. We find that RLVR drives performance in two main ways: (1) by compressing pass@ $k$ into pass@1 and (2) via "capability gain" in which models learn to solve new problems that they previously could not solve even at high $k$ . We find that while capability gain exists across model scales, learning to solve new problems is primarily driven through self-distillation. We demonstrate these findings across model scales ranging from 0.5B to 72B parameters on >500,000 reasoning problems with prompts and verifiable final answers across math, science, and code domains. We further show that we can significantly improve pass@ $k$ rates by leveraging natural language guidance for the model to consider within context while still requiring the model to…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

- The idea discussed in the paper is interesting, exploring the specific reasons behind RLVR's performance improvement. - Analyzing the source of capabilities by comparing Pass@k and Pass@1 is highly reasonable.

Weaknesses

- The overall writing and narration of the paper are not clear. Based on my understanding, Section 2.1 should be an analysis of RLVR, which then transitions to the motivation for proposing Guide. However, the details and conclusions of Section 2.1 are placed in Section 3.1. Moreover, the Introduction Section only introduces the Guide in the third contribution of the last paragraph. I am not clear about the transition from the analysis of RLVR to the proposal of Guide. Sections 2.1 and 3.1 would

Reviewer 02Rating 6Confidence 4

Strengths

1. Clear decomposition of RLVR gains into **self-distillation** (compressing pass@k→pass@1) vs **capability gain**, giving a precise lens to analyze “why RL works” for reasoning models. Strong, multi-scale study (0.5B→72B; >500k problems across math/science/code) and careful pass@k protocol; shows self-distillation dominates while capability gain still exists. 2. Solid ablations (guidance thresholds) and training-stability analysis (importance weighting, PPO-clip) support the design choices.

Weaknesses

1. The authors seem to have only validated the effectiveness of GUIDE in the mathematical domain, but it would be desirable to verify its effectiveness in other domains such as science, code, etc. I saw that the authors seem to have conducted tests on HumanEval at lines 234-236, but I did not see the corresponding results. 2. The authors mention using GPT-4o to produce hints, but it is unclear whether this might produce hallucinations or factual errors. It would be preferable to have human veri

Reviewer 03Rating 4Confidence 4

Strengths

- The proposed Guide algorithm is well-motivated and theoretically grounded. - The paper includes useful insights into how guidance affects entropy, exploration, and convergence stability.

Weaknesses

- Questionable interpretation of “capability gain.” As *Limits-of-RLVR* [1] **argues, capability gain almost vanishes when k increases; apparent gains at small k may simply reflect undersampling of the model’s inherent capability. The paper acknowledges this but still interprets small absolute gains as substantive learning. A few hundred samples can hardly be called “ineffective” since it is way smaller than the actual language space. - Limited novelty relative to prior off-policy or guided RL m

Reviewer 04Rating 6Confidence 4

Strengths

1. The paper evaluates the Guide algorithm not only through experiments but also provides a theoretical explanation for its effectiveness, which is commendable. 2. Some experiments in the paper are repeated multiple times to ensure the accuracy and reproducibility of the results. 3. The paper also compares multiple baselines, offering a comprehensive empirical evaluation.

Weaknesses

1. The proposed Guide method appears somewhat trivial and lacks strong novelty. Essentially, the approach introduces hints generated by a stronger model (e.g., GPT-4o) to help the policy model solve difficult problems. This idea is conceptually similar to few-shot prompting or knowledge injection, where additional external information is provided to improve sample efficiency rather than fundamentally changing the learning paradigm. 2. In essence, Guide can be viewed as knowledge injection from

Reviewer 05Rating 2Confidence 3

Strengths

The paper offers a potentially new insight in that the observed performance gains in reinforcement learning is largely via the reduction in variance in solving the set of solvable problems, rather than expanding the set of solvable problems. However, the validity and value of this insight across the paper is very hard to estimate owing to a number of confounding factors as below.

Weaknesses

Paper is poorly written with undefined quantities and prone to excessive notation. Even the abstract seems to contradict itself, e.g. lines 15-17 "We find that while capability gain exists across model scales, learning to solve new problems is primarily driven through self-distillation." By the papers own definition, "self-distillation" does not solve new problems. As another example, the quantities $y_i, \hat{y}_i$ that appear repeatedly in the paper starting with Equation (1) are undefined.

Code & Models

Datasets

vaskarnath/guide_math_rl_dataset
dataset· 23 dl
23 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsProximal Policy Optimization