DeDPO: Debiased Direct Preference Optimization for Diffusion Models

Khiem Pham; Quang Nguyen; Tung Nguyen; Jingsen Zhu; Michele Santacatterina; Dimitris Metaxas; Ramin Zabih

arXiv:2602.06195·cs.CV·February 9, 2026

DeDPO: Debiased Direct Preference Optimization for Diffusion Models

Khiem Pham, Quang Nguyen, Tung Nguyen, Jingsen Zhu, Michele Santacatterina, Dimitris Metaxas, Ramin Zabih

PDF

Open Access

TL;DR

DeDPO introduces a debiased semi-supervised approach for diffusion model alignment, effectively leveraging synthetic feedback to reduce reliance on costly human labels while maintaining high performance.

Contribution

The paper presents DeDPO, a novel method that integrates causal inference techniques into preference optimization, enabling robust learning from synthetic and limited human feedback.

Findings

01

DeDPO matches or exceeds performance of models trained on fully human-labeled data.

02

DeDPO is robust to different synthetic labeling methods.

03

DeDPO reduces the need for expensive human preference labels.

Abstract

Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Reinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms