Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev, Arora, Boris Hanin

TL;DR
This paper investigates the counter-intuitive phenomenon of likelihood displacement in direct preference optimization, revealing its causes, implications, and ways to mitigate unintentional unalignment in language models.
Contribution
It provides a theoretical and empirical analysis of likelihood displacement, introduces the CHES score to identify problematic training samples, and demonstrates mitigation strategies.
Findings
Likelihood displacement can cause models to shift probability mass away from preferred responses.
Filtering training data based on CHES scores reduces likelihood displacement and improves alignment.
Likelihood displacement is driven by similar embeddings of different preferences.
Abstract
Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer over can sharply increase the probability of . Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic and Environmental Valuation
