New Desiderata for Direct Preference Optimization
Xiangkun Hu, Tong He, David Wipf

TL;DR
This paper critically evaluates existing direct preference optimization (DPO) methods for aligning large language models with human preferences, identifies their limitations, and proposes an improved DPO-like loss with empirical validation.
Contribution
It introduces new evaluation criteria for DPO, highlights key shortcomings, and proposes a novel DPO-like loss that addresses these issues.
Findings
Existing DPO methods struggle with interpolation between models and preferences.
Trade-offs exist in regularization and constraint handling in current DPO approaches.
The proposed DPO-like loss mitigates identified limitations, improving alignment quality.
Abstract
Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO) and followed by several notable descendants. Although effective in certain real-world settings, we introduce new evaluation criteria that serve to highlight unresolved shortcomings in the ability of existing DPO methods to interpolate between a pre-trained reference model and empirical measures of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Machine Learning and Data Classification · Emotion and Mood Recognition
MethodsDirect Preference Optimization · ALIGN
