TL;DR
This paper introduces Supervised Reinforcement Learning (SRL), a novel training framework for small language models that combines step-wise supervision with reinforcement signals, enabling better reasoning and problem-solving capabilities.
Contribution
SRL reformulates problem solving as generating logical actions with step-wise supervision, improving learning in small models over traditional methods like SFT and RLVR.
Findings
SRL enables small models to learn complex reasoning tasks.
Training with SRL before RLVR yields superior performance.
SRL generalizes well to software engineering tasks.
Abstract
Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper presents a well-motivated and innovative approach to addressing limitations in LLM training for complex reasoning, with clear writing. 2. The introduction of SRL creatively combines elements of imitation learning and RL by decomposing expert trajectories into step-wise actions and rewarding only the actions, allowing the model to develop its own internal monologues. This novel formulation addresses the sparse reward issue in RLVR and the rigid mimicry in SFT, drawing from prior wor
1. The method lacks sufficient implementation details, raising concerns about reproducibility. For instance, the prompts or heuristics used to decompose expert reasoning traces into individual steps (e.g., parsing numbered sections from DeepSeek R1 outputs) are not provided, making it challenging to replicate the step-wise data construction process. 2. Although SRL rewards only the action sequence to avoid token-by-token memorization, this still imposes constraints on the model's output. For ex
The problem formulation (e.g., action-based decomposition) is well-motivated and intuitive. Results on math benchmarks show consistent improvements, with ablations justifying key components. The curriculum strategy (SRL→RLVR) and flexibility in internal monologue generation are valuable insights.
The methodology bears significant resemblance to R³ (arXiv:2402.05808), which also uses a reverse curriculum with intermediate state sampling and outcome supervision for step-wise learning. Though SRL uses action similarity rewards instead of final-answer rewards, the high-level strategy of "starting from expert states and moving backward" is not sufficiently differentiated. Comparisons focus on SFT and RLVR but omit advanced RL methods (e.g., DAPO, Dr. GRPO) and direct comparisons to R³.
1. The novel SRL framework addresses RLVR's sparse rewards and SFT's rigid mimicry by decomposing expert trajectories into step-wise logical actions with dense sequence-similarity rewards, uniquely combining SFT’s supervision and RL’s strengths and showing superior performance on math reasoning benchmarks. 2. SRL demonstrates robust cross-domain efficacy across different multi-step reasoning tasks, proving its versatility. 3. SRL excels in data efficiency for hard problems, using limited expert
1. SRL relies on the decomposition of expert trajectories into logical actions without providing details on automation for unstructured trajectories. Therefore, I'm not sure if it is scalable for large-scale or unstructured tasks and whether it introduces subjective bias. 2. The reward only considers action similarity, potentially leading to false positive rewards from flawed reasoning. This impact should be considered. 3. The experiments are conducted solely on small to mid-sized models. It wou
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
