Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Zeyu Huang; Tianhao Cheng; Zihan Qiu; Zili Wang; Yinghui Xu; Edoardo M. Ponti; Ivan Titov

arXiv:2507.01679·cs.LG·May 18, 2026

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov

PDF

3 Reviews

TL;DR

This paper introduces Prefix-RFT, a hybrid fine-tuning method that combines supervised and reinforcement learning techniques to improve large language model performance on reasoning tasks.

Contribution

It presents a novel unified approach that effectively merges supervised and reinforcement fine-tuning, outperforming individual and parallel methods.

Findings

01

Prefix-RFT surpasses standalone SFT and RFT performance.

02

It outperforms parallel mixed-policy RFT methods.

03

The approach is robust to variations in demonstration data quality and quantity.

Abstract

Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behavior cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learning unexpected behaviors, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that Prefix-RFT is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

The paper’s originality lies in both its conceptual unification of SFT and RFT under a common gradient view and its pragmatic “prefix sampling” mechanism that stitches an off-policy demonstration prefix to an on-policy continuation while retaining PPO-style stability, delivering a clean bridge between imitation and exploration. Methodological quality is high: the experimental suite covers diverse math-reasoning benchmarks with clear protocols, compares against relevant baselines, and includes

Weaknesses

The evidence centers on math with exact checker rewards, so robustness under noisier or heuristic feedback remains unclear. There are no non-verifiable math settings (e.g., LLM-graded rationale quality) to test behavior when correctness can’t be deterministically checked. The study also relies primarily on Qwen backbones; results on additional families (e.g., Llama-3) would better assess backbone generality.

Reviewer 02Rating 2Confidence 4

Strengths

- Empirical results show that Pre-FT consistently outperforms SFT, RFT, and other mixed-policy RFT methods. - Pre-FT demonstrates strong robustness across varying quantities and qualities of demonstration data. - The paper is well-written.

Weaknesses

- **Lack of novelty and incremental contribution:** My main concern is that the use of off-policy data to enhance model capabilities has already been explored in several prior works [1, 2, 3]. Moreover, Pre-FT closely resembles UFT [3], which also applies the SFT loss to prefixes and uses RFT for partial continuations. The additional heuristics, such as entropy-based clipping and the cosine decay scheduler, appear too incremental to me. - **Pass@1 does not reflect the reasoning capabilities of P

Reviewer 03Rating 4Confidence 3

Strengths

1. The presentation of Prefix-RFT is clear. More broadly, I find the paper to be well-written and relatively easy to follow. 2. How to best combine RFT with demonstrations is an active and important area of research, toward identifying robust language model finetuning guidelines. 3. The empirical analysis in Section 5 helps shed light on the mechanisms underlying Prefix-RFT and its relation to SFT and RFT.

Weaknesses

1. Unfortunately, it seems that the main technical innovation of Prefix-RFT is not new. Aside from the arguably concurrent UFT paper [1], there is an even earlier work proposing to use demonstration prefixes to seed rollouts in policy gradient [2]. These methods are nearly identical, differing mostly in their technical details (e.g., amount of non-prefixed rollouts used in a batch and entropy-based clipping). While the paper does briefly mention [1], it is missing a reference to [2]. Given the v

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification