SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Yuqian Fu; Tinghong Chen; Jiajun Chai; Xihuai Wang; Songjun Tu; Guojun Yin; Wei Lin; Qichao Zhang; Yuanheng Zhu; Dongbin Zhao

arXiv:2506.19767·cs.CL·June 25, 2025

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao

PDF

Open Access 1 Models 3 Reviews

TL;DR

SRFT is a novel single-stage training method that combines supervised fine-tuning and reinforcement learning for improved reasoning in large language models, leveraging entropy-based mechanisms for better integration.

Contribution

The paper introduces SRFT, a unified single-stage approach that effectively integrates supervised and reinforcement fine-tuning for reasoning tasks in LLMs.

Findings

01

SRFT achieves 59.1% average accuracy on reasoning benchmarks.

02

Outperforms zero-RL methods by 9.0% on five benchmarks.

03

Improves out-of-distribution performance by 10.9%.

Abstract

Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The proposal to unify SFT and RL into a single stage is interesting and enables finer control over the trade-off between the two. - The paper offers a comprehensive analysis of learning dynamics and the respective effects of SFT and RL for language models, improving our understanding of both paradigms. - The manuscript is clearly written and easy to follow.

Weaknesses

- The experiments are not yet fully convincing; more evidence is needed to demonstrate the effectiveness of the proposed SRFT method. In particular, the SFT data use DeepSeek-R1 responses (Line 360), which likely exceed the quality of the Qwen2.5-7B policy’s rollouts. Figure A2 suggests the model learns primarily from SFT—the SFT loss dominates—implying that distillation, rather than RL, drives most gains. For a fair comparison, both RL and SRFT should start from the same fine-tuned initializati

Reviewer 02Rating 4Confidence 3

Strengths

- The paper is well written and easy to follow. - The analyses conducted are presented in a compact form making it easily digestible. - The motivation to introduce the SRFT method is well-grounded based on the empirical analyses conducted in the paper.

Weaknesses

- Currently in line 160, there is a claim that states that the fig 2a results reveal a fundamental difference between SFT and RL regarding their reshaped probability distributions. Could there be some details provided regarding how exactly the heat-map is computed. Currently, it is unclear and a lot of details are omitted. - Is the token output sequence generated by the base model or the SFT/RLd model? Further, are the log-probs simply computed token-wise with the base and the SFT/RLd model?

Reviewer 03Rating 6Confidence 3

Strengths

- Evaluation methodology is thoroughly reported and appears to follow best practices (e.g. multi-seed evaluation on high-variance benchmarks, ablation across prompt variants) which increases confidence in the robustness of the results. - Evaluation on both in-distribution and out-of-distribution benchmarks is good. - The authors include a thorough analysis of learning dynamics. - Multiple ablations are presented, including on entropy weighting factors and training on different Qwen-based model v

Weaknesses

- The method is only tested on the Qwen model family. - The claims in section 3.1 seem to lack quantitative validation. (It is the result of one training run per category on Qwen2.5-Math-7B)

Code & Models

Models

🤗
Yuqian-Fu/SRFT-Qwen2.5-Math-7B
model· 10 dl· ♡ 3
10 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLogic, programming, and type systems · Logic, Reasoning, and Knowledge · Intelligent Tutoring Systems and Adaptive Learning