On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Yongliang Wu; Yizhou Zhou; Zhou Ziheng; Yingzhe Peng; Xinyu Ye; Xinting Hu; Wenbo Zhu; Lu Qi; Ming-Hsuan Yang; Xu Yang

arXiv:2508.05629·cs.LG·March 2, 2026

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang

PDF

1 Models 3 Reviews

TL;DR

This paper introduces Dynamic Fine-Tuning ( extbackslash model), a simple modification to Supervised Fine-Tuning that improves large language models' generalization by stabilizing gradient updates, bridging the gap with reinforcement learning.

Contribution

It provides a theoretically motivated method to enhance SFT's generalization, outperforming standard SFT on various benchmarks with a single-line change.

Findings

01

extbackslash model outperforms standard SFT on multiple benchmarks.

02

It improves generalization in math reasoning, code generation, and multi-modal tasks.

03

Achieves competitive results in offline reinforcement learning.

Abstract

In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model compared to RL. To rectify this, we propose Dynamic Fine-Tuning (\model), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. With just a single-line change, the method outperforms standard SFT on multiple difficult benchmarks and base models, from math reasoning to code generation and multi-modal tasks, demonstrating improved generalization. Additionally, \model~achieves competitive results in offline RL…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The analytical arguments are clear and presented in an easy-to-follow manner. The proposed fix via weighing the gradient follows naturally from the analytical argument. - Experimental results that show improvement over SFT models are fairly extensive. These experiments cover a variety of language model families (Llama, Qwen, DeepSeek). - The impelemtnation included in the appendix is easy to understand and use by a general ML practitioner.

Weaknesses

-While the analysis is good, there may be other papers that have made observations about sparse rewards and done the derivation perhaps under a different guise. I appreciate the authors referring to GOLD (Pang & He, 2021) in the paper as I learned about this method as well. A follow up ICLR Blog Post does a derivation that looks similar to what is included in the paper (https://iclr-blog-track.github.io/2022/03/25/text-gen-via-lfd). So the derivation, while insightful, is perhaps not as valuable

Reviewer 02Rating 6Confidence 4

Strengths

1. Relatively well-written and easy to follow. 2. The idea behind DFT is intuitive and conceptually simple. 3. Experiments demonstrate that DFT leads to substantial performance gains over SFT across several models and settings, covering mathematical reasoning, code generation, and multi-modal tasks.

Weaknesses

1. The theoretical motivation behind DFT is imprecise and includes several unsubstantiated claims. - The dependence of SFT on $1 / \pi_\theta (y | x)$ is fake, in the sense that it is obtained by multiplying and dividing by $\pi_\theta (y | x)$. In particular, it is not true that the gradient of SFT grows unboundedly due to the term $1 / \pi_\theta (y | x)$ in Equation (6), as this term cancels out with the expectation. - The claim of SFT having sparse rewards does not make much sense

Reviewer 03Rating 6Confidence 4

Strengths

- The theoretical motivation is clear, connecting the SFT gradient to policy gradients and pinpointing the problematic inverse-probability weighting as the root cause of poor generalization. - The proposed DFT is simple to implement, requiring only a minor modification to the standard SFT loss function, yet it yields substantial empirical improvements. - The evaluation is comprehensive, spanning multiple reasoning domains including mathematics, code generation, and multi-modal tasks.

Weaknesses

- My main concerns lie in the potential conceptual limitations and unintended consequences of the core reweighting mechanism. The strategy of down-weighting low-probability tokens encourages the model to ignore what it finds hard. While this appears to prevent overfitting on noisy or rare patterns in the presented experiments, the paper does not adequately explore the boundary conditions under which this approach might fail. A more thorough discussion or ablation study on when this *hard example

Code & Models

Models

🤗
Liang0223/Qwen-2.5-Math-1.5B-DPO
model· 2 dl· ♡ 1
2 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.