Divergence Minimization Preference Optimization for Diffusion Model Alignment
Binxu Li, Minkai Xu, Jiaqi Han, Meihua Dang, Stefano Ermon

TL;DR
This paper introduces DMPO, a divergence minimization approach for aligning diffusion models with human preferences, addressing limitations of existing methods and demonstrating superior empirical performance.
Contribution
The paper proposes DMPO, a novel divergence minimization method for diffusion model alignment, with rigorous analysis and extensive experiments showing its effectiveness.
Findings
DMPO outperforms baseline models in human evaluations.
DMPO achieves the best PickScore across various test sets.
Diffusion models fine-tuned with DMPO match or surpass existing techniques.
Abstract
Diffusion models have achieved remarkable success in generating realistic and versatile images from text prompts. Inspired by the recent advancements of language models, there is an increasing interest in further improving the models by aligning with human preferences. However, we investigate alignment from a divergence minimization perspective and reveal that existing preference optimization methods are typically trapped in suboptimal mean-seeking optimization. In this paper, we introduce Divergence Minimization Preference Optimization (DMPO), a novel and principled method for aligning diffusion models by minimizing reverse KL divergence, which asymptotically enjoys the same optimization direction as original RL. We provide rigorous analysis to justify the effectiveness of DMPO and conduct comprehensive experiments to validate its empirical strength across both human evaluations and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The paper is well-written overall and proposes to solve the diffusion alignment problem through Reverse-KL perspective. * The experimental evaluation is reasonably comprehensive, covering multiple datasets and baselines.
* The authors’ discussion of **mode-seeking vs. mode-covering** behavior is confusing, incorrect, and internally inconsistent. In Section 3.2, they claim that Diffusion-DPO approximately corresponds to Forward KL and then state it enforces mean-seeking behavior, but later describe it as mode-covering. The logical flow and terminology need clarification. * In abstract, the claim that existing methods are trapped in *mode-seeking* optimization contradicts the introduction of DMPO as using **revers
1. The paper is well written and easy to follow. 2. Research on Human preference alignment of T2I generation are important. Mean-seeking issues are really important for model training. 3. The authors conduct extensive experiments to verify the effectiveness of their method.
1. The problem of mean-seeking is widely studied in LLMs like EXO, f-PO as mentioned in the related works. And there are also some previous works about the derivation of adopting DPO algorithms on LLMs to a chain of Markov transitions. Therefore, I think the contribution of this paper is limited. 2. Do authors study different divergence optimization methods like f-PO on T2I diffusion models? 3. It's better for authors to provide time complexity analysis or training time comparison to further v
1.The core insight of applying a “reverse KL divergence objective” to diffusion model alignment is well-motivated. The paper effectively critiques the limitations of existing DPO-style methods from a distribution-matching perspective, providing a fresh and principled viewpoint on the alignment problem. 2.Theorem 2 establishes that DMPO's optimization direction aligns with the original RLHF objective under certain conditions, lending strong theoretical justification to the proposed approach. The
1. While reverse KL divergence does have its advantages, it does not necessarily mean it is always better than forward KL divergence; both have their own characteristics. Although Sections 3.1 and 3.2 provide extensive textual descriptions, there is a lack of specific theoretical validation in the context of alignment issues, rather than just textual explanations. I have not seen more detailed analysis on this. In Section 3.3, only the correlation with RLHF is mentioned, but the loss function of
1. The motivation is clear and well-founded, as the forward-KL objective is known to induce mean-seeking behavior. 2. The proposed approach yields promising results and has the potential to serve as a plug-and-play replacement for the standard DPO objective in production-ready pipelines. 3. The paper provides rigorous derivations, carefully establishing and demonstrating the equivalence between the proposed formulation and the original RLHF objective.
1. While the DPO objective is known to “cut” gradients when the model deviates from the reference, the proposed objective appears to exhibit a similar behavior. However, the paper does not include an analysis of this property. A gradient-flow or stability analysis would be valuable for understanding why the proposed loss performs well, whether it is more or less stable than DPO, and how sensitive it is to hyperparameter choices. 2. The authors argue that mode-seeking behavior is preferable to m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Technology and Control Systems · Advanced Multi-Objective Optimization Algorithms
