Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Zhiqin Yang; Yonggang Zhang; Wei Xue; Dong Fang; Bo Han; Yike Guo

arXiv:2605.20834·cs.AI·May 21, 2026

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo

PDF

1 Repo

TL;DR

This paper analyzes the conditions under which DPO and RLHF are equivalent, identifies failure modes when assumptions are violated, and proposes CPO to ensure provable alignment, supported by theoretical and experimental validation.

Contribution

It reveals the conditional nature of DPO and RLHF equivalence, characterizes failure modes, and introduces CPO for provable alignment with empirical validation.

Findings

01

DPO and RLHF are only equivalent under certain assumptions.

02

Violations of assumptions lead to pathological convergence.

03

CPO achieves state-of-the-art performance on benchmarks.

Abstract

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

visitworld123/CPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.