Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models
Minghao Fu, Guo-Hua Wang, Tianyu Cui, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

TL;DR
This paper introduces Diffusion-SDPO, a safeguarded optimization method for aligning diffusion models with human preferences, addressing issues in existing DPO approaches and improving output quality across benchmarks.
Contribution
It proposes a novel safeguarded update rule for diffusion-based preference optimization that preserves preferred outputs and enhances alignment performance.
Findings
Diffusion-SDPO improves preference, aesthetic, and prompt alignment metrics.
The method guarantees non-increasing error in preferred outputs.
It is simple, model-agnostic, and compatible with existing frameworks.
Abstract
Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper identifies and tackles a critical problem in Diffusion-DPO that the optimization can hack the objective by making the lose samples worse than winning samples. 2. The paper proposes an easy rescaling method based on first-order analysis to maintain the win samples quality. 3. The proposed method is validated on various models on different benchmarks.
1. The key assumption on the near-isometry is fragile. It relies on both near-isometry of self-Jacobian and closeness of Jacobians in two branches, which can break easily. Current analysis is also based on U-Net, which does not include other architectures like DiT. 2. A more formal analysis on simplifying the DPO objective to a linear version. It’s not quite clear that the current analysis will still hold in putting back to the sigmoid DPO loss. 3. The performance gain on SDXL over benchmark m
1. Insightful Diagnosis of DPO Behavior : The paper makes a clear and valuable observation that simply increasing the preference margin in diffusion-based DPO does not guarantee improved image quality. By identifying that both winner and loser losses can rise during training, the authors uncover a subtle but important failure mode in current preference optimization methods. 2. Effective Solution : The proposed Diffusion-SDPO introduces a simple modification, adaptive scaling of the loser gradien
The theoretical analysis includes several nontrivial leaps that are not fully justified. In particular, the assumption that $J_w^{\top} J_l = I$ (identity matrix) appears unrealistic and lacks both empirical and conceptual grounding in the context of diffusion model optimization. Assumptions A and B are introduced without sufficient explanation or validation, making it difficult to assess their plausibility. For example, in assumption A, "For a fixed t, the noised latents $x_t^w$ and $x_t^l$
1. The paper provides a clear analysis of the shortcomings of standard DPO and argues that SDPO enables more stable optimization during preference training. 2. The mathematical derivations are detailed and carefully presented, supporting the approximate solution procedure for Diffusion-SDPO.
1. The paper lacks a discussion of the computational overhead of computing $\lambda_{safe}$. For example, there is no comparison with baselines in terms of training memory usage or per-step backward-pass time. 2. All experiments are trained for only 2,000 steps. It is unclear whether performance has reached the peak for each baseline or whether results are reported before full convergence. Longer-horizon training results would help clarify this. 3. From Figure 3, the original DPO variant does no
1. The proposed method is simple, elegant, and practical — a few-line modification that can be readily integrated into existing DPO pipelines. 2. The paper identifies and addresses a highly counterintuitive failure mode of classical DPO, providing clear analytical insight and an effective solution. 3. The theoretical assumptions are reasonable, and the mathematical analysis is clear and easy.
1. While the assumptions are generally realistic, the analysis relies heavily on specific properties of U-Nets. This limits the generality of the results, as U-Nets are already completely replaced by more modern architectures such as Diffusion Transformers (DiTs). The paper would be significantly stronger if the authors discussed how their assumptions extend to DiTs or at least provided quantitative results on such architectures. 2. The paper lacks an analysis of how the dynamic λ coefficients
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment · Visual Attention and Saliency Detection
