SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning
Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao

TL;DR
SRFT is a novel single-stage training method that combines supervised fine-tuning and reinforcement learning for improved reasoning in large language models, leveraging entropy-based mechanisms for better integration.
Contribution
The paper introduces SRFT, a unified single-stage approach that effectively integrates supervised and reinforcement fine-tuning for reasoning tasks in LLMs.
Findings
SRFT achieves 59.1% average accuracy on reasoning benchmarks.
Outperforms zero-RL methods by 9.0% on five benchmarks.
Improves out-of-distribution performance by 10.9%.
Abstract
Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and…
Peer Reviews
Decision·ICLR 2026 Poster
- The proposal to unify SFT and RL into a single stage is interesting and enables finer control over the trade-off between the two. - The paper offers a comprehensive analysis of learning dynamics and the respective effects of SFT and RL for language models, improving our understanding of both paradigms. - The manuscript is clearly written and easy to follow.
- The experiments are not yet fully convincing; more evidence is needed to demonstrate the effectiveness of the proposed SRFT method. In particular, the SFT data use DeepSeek-R1 responses (Line 360), which likely exceed the quality of the Qwen2.5-7B policy’s rollouts. Figure A2 suggests the model learns primarily from SFT—the SFT loss dominates—implying that distillation, rather than RL, drives most gains. For a fair comparison, both RL and SRFT should start from the same fine-tuned initializati
- The paper is well written and easy to follow. - The analyses conducted are presented in a compact form making it easily digestible. - The motivation to introduce the SRFT method is well-grounded based on the empirical analyses conducted in the paper.
- Currently in line 160, there is a claim that states that the fig 2a results reveal a fundamental difference between SFT and RL regarding their reshaped probability distributions. Could there be some details provided regarding how exactly the heat-map is computed. Currently, it is unclear and a lot of details are omitted. - Is the token output sequence generated by the base model or the SFT/RLd model? Further, are the log-probs simply computed token-wise with the base and the SFT/RLd model?
- Evaluation methodology is thoroughly reported and appears to follow best practices (e.g. multi-seed evaluation on high-variance benchmarks, ablation across prompt variants) which increases confidence in the robustness of the results. - Evaluation on both in-distribution and out-of-distribution benchmarks is good. - The authors include a thorough analysis of learning dynamics. - Multiple ablations are presented, including on entropy weighting factors and training on different Qwen-based model v
- The method is only tested on the Qwen model family. - The claims in section 3.1 seem to lack quantitative validation. (It is the result of one training run per category on Qwen2.5-Math-7B)
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, programming, and type systems · Logic, Reasoning, and Knowledge · Intelligent Tutoring Systems and Adaptive Learning
