Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Bo Wang; Qinyuan Cheng; Runyu Peng; Rong Bao; Peiji Li; Qipeng Guo; Linyang Li; Zhiyuan Zeng; Yunhua Zhou; Xipeng Qiu

arXiv:2507.00018·cs.LG·July 8, 2025

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, Xipeng Qiu

PDF

Open Access

TL;DR

This paper unifies supervised fine-tuning and preference learning in large language models through a theoretical framework, revealing limitations in SFT and proposing improvements that significantly enhance instruction-following performance.

Contribution

It introduces a unified view of SFT and DPO, identifies a key limitation in SFT's KL divergence term, and proposes a simple learning rate adjustment and alternative objectives to improve model performance.

Findings

01

Up to 25% relative performance gain in instruction tasks

02

6% absolute increase in win rate

03

Theoretical derivation linking LLM logits and Q-functions

Abstract

Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical framework bridging Supervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training. Through rigorous mathematical derivation, we demonstrate that both SFT and preference learning methods like Direct Preference Optimization (DPO) operate within the same optimal policy-reward subspace, with SFT representing a special case of implicit reward learning. Our analysis reveals a critical limitation in conventional SFT: the KL divergence term in distribution matching becomes constant with respect to the policy during optimization, failing to constrain model updates. To address this, we propose a simple yet effective learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, Economics, and Judicial Systems · Diverse Scientific and Economic Studies