Unintentional Unalignment: Likelihood Displacement in Direct Preference   Optimization

Noam Razin; Sadhika Malladi; Adithya Bhaskar; Danqi Chen; Sanjeev; Arora; Boris Hanin

arXiv:2410.08847·cs.LG·April 29, 2025

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev, Arora, Boris Hanin

PDF

Open Access 1 Repo

TL;DR

This paper investigates the counter-intuitive phenomenon of likelihood displacement in direct preference optimization, revealing its causes, implications, and ways to mitigate unintentional unalignment in language models.

Contribution

It provides a theoretical and empirical analysis of likelihood displacement, introduces the CHES score to identify problematic training samples, and demonstrates mitigation strategies.

Findings

01

Likelihood displacement can cause models to shift probability mass away from preferred responses.

02

Filtering training data based on CHES scores reduces likelihood displacement and improves alignment.

03

Likelihood displacement is driven by similar embeddings of different preferences.

Abstract

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $No$ over $Never$ can sharply increase the probability of $Yes$ . Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

princeton-nlp/unintentional-unalignment
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic and Environmental Valuation