DeDPO: Debiased Direct Preference Optimization for Diffusion Models
Khiem Pham, Quang Nguyen, Tung Nguyen, Jingsen Zhu, Michele Santacatterina, Dimitris Metaxas, Ramin Zabih

TL;DR
DeDPO introduces a debiased semi-supervised approach for diffusion model alignment, effectively leveraging synthetic feedback to reduce reliance on costly human labels while maintaining high performance.
Contribution
The paper presents DeDPO, a novel method that integrates causal inference techniques into preference optimization, enabling robust learning from synthetic and limited human feedback.
Findings
DeDPO matches or exceeds performance of models trained on fully human-labeled data.
DeDPO is robust to different synthetic labeling methods.
DeDPO reduces the need for expensive human preference labels.
Abstract
Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Reinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms
