RL for Reasoning by Adaptively Revealing Rationales

Mohammad Hossein Amani; Aryo Lotfi; Nicolas Mario Baldwin; Samy Bengio; Mehrdad Farajtabar; Emmanuel Abbe; Robert West

arXiv:2506.18110·cs.LG·March 3, 2026

RL for Reasoning by Adaptively Revealing Rationales

Mohammad Hossein Amani, Aryo Lotfi, Nicolas Mario Baldwin, Samy Bengio, Mehrdad Farajtabar, Emmanuel Abbe, Robert West

PDF

3 Reviews

TL;DR

This paper introduces AdaBack, a per-sample curriculum learning method that reveals partial outputs to improve sequence reasoning tasks, bridging the gap between supervised training and reinforcement learning.

Contribution

It proposes a novel adaptive backtracking algorithm that dynamically adjusts supervision length per sample, enabling efficient learning of complex reasoning chains.

Findings

01

AdaBack reliably solves intractable synthetic problems with latent dependencies.

02

It enables models to acquire reasoning skills on benchmarks RL alone cannot solve.

03

Demonstrates effectiveness on mathematical reasoning datasets.

Abstract

Learning in the combinatorially large output space of sequence generation problems is challenging as providing expert demonstrations scales poorly with sequence length, and RL struggles with sparse rewards. Between dense demonstrations in supervised training and no demonstrations in reinforcement learning lies an underexplored regime: partial supervision. We ask whether some classes of sequence learning problems become efficiently learnable by exploiting this gap. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals a partial prefix of the target output. The supervision length is adjusted dynamically for each sample based on the model's past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions. We investigate this intermediate regime between SFT and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The proposed method improves model performance on mathematical datasets. In particular, it demonstrates greater benefits on the Base-7 and Tensor-2 GSM8k datasets—variants of GSM8k created by symbolic manipulations and effectively reduce the density of reward signals during training. These results align with the goal of addressing the sparse reward problem. - The paper includes and analyzes negative results (e.g., saturation on MATH for Llama‑3.2‑3B‑Instruct and Qwen‑2.5), which increases the

Weaknesses

- The paper does not specify how many random seeds were used to compute test accuracy. Given the high randomness in LM sampling, this omission raises questions about statistical robustness, especially for Table 1. - The method learns a per-problem parameter $\rho$ and initializes it using a global EMA with hyperparameter $\alpha$, but the choice and sensitivity of $\alpha$ are not discussed. - Although the approach is inspired by R3, no baseline comparison with R3 is included on the mathematica

Reviewer 02Rating 4Confidence 4

Strengths

Evaluations span synthetic tasks and real-world reasoning benchmarks, demonstrating broad applicability. The method is well-motivated and clearly explained, with intuitive visualizations of the training process.

Weaknesses

The approach closely mirrors R3’s reverse curriculum framework, which also uses partial demonstrations to create a curriculum. The distinction—adaptive per-sample scheduling versus fixed stages—is valuable but not thoroughly differentiated in terms of impact. While outperforming vanilla RL, comparisons to R3 or other curriculum-based RL methods are absent, leaving the reader uncertain about the method’s advantages over existing alternatives.

Reviewer 03Rating 8Confidence 4

Strengths

Clear conceptual contribution: - The paper identifies a meaningful regime between SFT and RL and proposes a principled mechanism (adaptive partial supervision) rather than a heuristic curriculum schedule. This framing is conceptually clean and well-motivated. Per-sample adaptive curriculum is novel and well-justified: - Unlike prior curriculum or backtracking approaches that rely on global schedules or coarse segmentation heuristics (e.g., R3), AdaBack adapts per-instance based on reward signal

Weaknesses

- limited scale: the models explored in this work are limited to 3B parameters - The deepseek R1 report asserts that smaller models have lower capacity from benefiting from RL to discover novel reasoning patterns and are better suited towards SFT based distillation from stronger models (which somewhat addresses the stated drawbacks of SFT, if a teacher model is available to produce substantial distillation data) - this point is somewhat understandable if it's resource constrained, but the sc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.