SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models
Xiaomeng Yang, Mengping Yang, Junyan Wang, Zhijian Zhou, Zhiyu Tan, Hao Li

TL;DR
SIPO introduces a stabilized and improved preference optimization framework that enhances the alignment of diffusion models with human preferences by addressing training instability and off-policy bias.
Contribution
The paper proposes SIPO, a novel framework with gradient clipping and timestep-aware importance reweighting, to improve stability and effectiveness in preference-based diffusion model alignment.
Findings
SIPO stabilizes training across various diffusion models.
Outperforms existing alignment methods on multiple benchmarks.
Provides guidelines for timestep-aware preference optimization.
Abstract
Preference learning has garnered extensive attention as an effective technique for aligning diffusion models with human preferences in visual generation. However, existing alignment approaches such as Diffusion-DPO suffer from two fundamental challenges: training instability caused by high gradient variances at various timesteps and high parameter sensitivities, and off-policy bias arising from the discrepancy between the optimization data and the policy models' distribution. Our first contribution is a systematic analysis of diffusion trajectories across different timesteps, identifying that the instability primarily originates from early timesteps with low importance weights. To address these issues, we propose \textbf{SIPO}, a \textbf{S}tabilized and \textbf{I}mproved \textbf{P}reference \textbf{O}ptimization framework for aligning diffusion models with human preferences. Concretely,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
