Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization

Daoan Zhang; Guangchen Lan; Dong-Jun Han; Wenlin Yao; Xiaoman Pan; Hongming Zhang; Mingxiao Li; Pengcheng Chen; Yu Dong; Christopher Brinton; Jiebo Luo

arXiv:2410.05255·cs.CV·July 2, 2025

Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization

Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Pengcheng Chen, Yu Dong, Christopher Brinton, Jiebo Luo

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces SSPO, a novel alignment method combining the stability of supervised fine-tuning with the generalization of reinforcement learning, using self-sampling and checkpoint replay to improve diffusion model training.

Contribution

The paper proposes SSPO, a new alignment technique that integrates checkpoint replay and self-sampling regularization to enhance diffusion model training without paired data or reward models.

Findings

01

SSPO outperforms existing methods on text-to-image benchmarks.

02

SSPO demonstrates strong performance on text-to-video tasks.

03

The approach effectively balances training stability and generalization.

Abstract

Existing post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement learning (RL) methods; the former is stable during training but suffers from limited generalization, while the latter, despite its stronger generalization capability, relies on additional preference data or reward models and carries the risk of reward exploitation. In order to preserve the advantages of both SFT and RL -- namely, eliminating the need for paired data and reward models while retaining the training stability of SFT and the generalization ability of RL -- a new alignment method, Self-Sampling Preference Optimization (SSPO), is proposed in this paper. SSPO introduces a Random Checkpoint Replay (RCR) strategy that utilizes historical checkpoints to construct paired data, thereby effectively mitigating overfitting. Simultaneously, a Self-Sampling Regularization (SSR)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dwanzhang-ai/seppo
pytorchOfficial

Models

🤗
DwanZhang/SePPO
model· 11 dl· ♡ 4
11 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEU Law and Policy Analysis

MethodsDiffusion