Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Ruizhe Shi; Minhak Song; Runlong Zhou; Zihan Zhang; Maryam Fazel; Simon S. Du

arXiv:2505.19770·cs.LG·May 13, 2026

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

PDF

1 Repo

TL;DR

This paper offers a detailed theoretical comparison of RLHF and DPO, analyzing their performance gaps based on model capacity, mis-specification, and sample efficiency in preference learning.

Contribution

It provides a comprehensive theoretical framework decomposing the performance gap between RLHF and DPO, including conditions where each method outperforms the other.

Findings

01

RLHF, DPO, or online DPO can outperform each other depending on model mis-specifications.

02

Online DPO can outperform RLHF and DPO when models are isomorphic and mis-specified.

03

RLHF requires fewer samples than DPO for sparse ground-truth rewards, showing a statistical advantage.

Abstract

We present a fine-grained theoretical analysis of the performance gap between two-stage reinforcement learning from human feedback~(RLHF) and direct preference optimization~(DPO). Our study decomposes this gap into two sources: the explicit representation gap under exact optimization and the implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is sparse and show that RLHF requires…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

srzer/Gap-in-Preference-Learning
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.