Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Yihe Deng; I-Hung Hsu; Jun Yan; Zifeng Wang; Rujun Han; Gufeng Zhang; Yanfei Chen; Wei Wang; Tomas Pfister; Chen-Yu Lee

arXiv:2510.25992·cs.CL·March 2, 2026

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee

PDF

3 Reviews

TL;DR

This paper introduces Supervised Reinforcement Learning (SRL), a novel training framework for small language models that combines step-wise supervision with reinforcement signals, enabling better reasoning and problem-solving capabilities.

Contribution

SRL reformulates problem solving as generating logical actions with step-wise supervision, improving learning in small models over traditional methods like SFT and RLVR.

Findings

01

SRL enables small models to learn complex reasoning tasks.

02

Training with SRL before RLVR yields superior performance.

03

SRL generalizes well to software engineering tasks.

Abstract

Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper presents a well-motivated and innovative approach to addressing limitations in LLM training for complex reasoning, with clear writing. 2. The introduction of SRL creatively combines elements of imitation learning and RL by decomposing expert trajectories into step-wise actions and rewarding only the actions, allowing the model to develop its own internal monologues. This novel formulation addresses the sparse reward issue in RLVR and the rigid mimicry in SFT, drawing from prior wor

Weaknesses

1. The method lacks sufficient implementation details, raising concerns about reproducibility. For instance, the prompts or heuristics used to decompose expert reasoning traces into individual steps (e.g., parsing numbered sections from DeepSeek R1 outputs) are not provided, making it challenging to replicate the step-wise data construction process. 2. Although SRL rewards only the action sequence to avoid token-by-token memorization, this still imposes constraints on the model's output. For ex

Reviewer 02Rating 4Confidence 4

Strengths

The problem formulation (e.g., action-based decomposition) is well-motivated and intuitive. Results on math benchmarks show consistent improvements, with ablations justifying key components. The curriculum strategy (SRL→RLVR) and flexibility in internal monologue generation are valuable insights.

Weaknesses

The methodology bears significant resemblance to R³ (arXiv:2402.05808), which also uses a reverse curriculum with intermediate state sampling and outcome supervision for step-wise learning. Though SRL uses action similarity rewards instead of final-answer rewards, the high-level strategy of "starting from expert states and moving backward" is not sufficiently differentiated. Comparisons focus on SFT and RLVR but omit advanced RL methods (e.g., DAPO, Dr. GRPO) and direct comparisons to R³.

Reviewer 03Rating 6Confidence 4

Strengths

1. The novel SRL framework addresses RLVR's sparse rewards and SFT's rigid mimicry by decomposing expert trajectories into step-wise logical actions with dense sequence-similarity rewards, uniquely combining SFT’s supervision and RL’s strengths and showing superior performance on math reasoning benchmarks. 2. SRL demonstrates robust cross-domain efficacy across different multi-step reasoning tasks, proving its versatility. 3. SRL excels in data efficiency for hard problems, using limited expert

Weaknesses

1. SRL relies on the decomposition of expert trajectories into logical actions without providing details on automation for unstructured trajectories. Therefore, I'm not sure if it is scalable for large-scale or unstructured tasks and whether it introduces subjective bias. 2. The reward only considers action similarity, potentially leading to false positive rewards from flawed reasoning. This impact should be considered. 3. The experiments are conducted solely on small to mid-sized models. It wou

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.