Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Xinxin Liu; Ming Li; Zonglin Lyu; Yuzhang Shang; Chen Chen

arXiv:2604.24952·cs.CV·April 29, 2026

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Xinxin Liu, Ming Li, Zonglin Lyu, Yuzhang Shang, Chen Chen

PDF

1 Repo 1 Video

TL;DR

Semi-DPO is a semi-supervised learning method that improves preference optimization by effectively handling noisy, multi-dimensional human preference data, achieving state-of-the-art results.

Contribution

It introduces a semi-supervised approach that distinguishes clean and noisy preference data, enhancing alignment with complex human preferences without extra annotations.

Findings

01

Semi-DPO outperforms existing methods in preference alignment.

02

It effectively filters and utilizes noisy preference data.

03

The approach achieves state-of-the-art performance on benchmark datasets.

Abstract

Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

L-CodingSpace/semi-dpo
github

Videos

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization· slideslive