Preference Learning Algorithms Do Not Learn Preference Rankings
Angelica Chen, Sadhika Malladi, Lily H. Zhang, Xinyi Chen, Qiuyi, Zhang, Rajesh Ranganath, Kyunghyun Cho

TL;DR
This paper reveals that preference learning algorithms like RLHF and DPO do not effectively learn preference rankings, as evidenced by low ranking accuracy and a significant gap from the idealized performance, questioning their core mechanism.
Contribution
The study critically examines the assumption that preference learning models optimize for ranking accuracy, providing theoretical insights and empirical evidence of their limitations and the alignment gap.
Findings
Most preference-tuned models achieve less than 60% ranking accuracy.
Existing models show a significant gap from the idealized ranking accuracy.
Ranking accuracy correlates with win rate when models are close to the reference.
Abstract
Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via ranking accuracy. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the idealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant alignment gap -- i.e., a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms
MethodsDirect Preference Optimization
