Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Dylan Zhang; Yufeng Xu; Haojin Wang; Qingzhi Chen; Hao Peng

arXiv:2602.01058·cs.LG·February 3, 2026

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng

PDF

Open Access

TL;DR

This paper introduces PEAR, a reweighting method for supervised fine-tuning that aligns training data distribution with reinforcement learning objectives, improving downstream RL performance of reasoning large language models.

Contribution

PEAR is a novel importance sampling-based reweighting technique that enhances SFT for better RL outcomes, addressing distribution mismatch issues in current pipelines.

Findings

01

PEAR improves post-RL performance across multiple tasks.

02

Up to 14.6% gain on AIME2025 after applying PEAR.

03

Consistent performance improvements on reasoning benchmarks.

Abstract

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Robot Manipulation and Learning