Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

Paul G\"olz; Nika Haghtalab; Kunhe Yang

arXiv:2505.23749·cs.LG·May 30, 2025

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

Paul G\"olz, Nika Haghtalab, Kunhe Yang

PDF

Open Access 1 Video

TL;DR

This paper introduces a new metric called distortion to evaluate how well AI alignment methods optimize for diverse human preferences, revealing that current methods like RLHF and DPO often perform poorly in pluralistic settings.

Contribution

The paper models user preferences with Bradley-Terry models and establishes a theoretical framework to measure alignment quality via distortion, highlighting limitations of existing methods.

Findings

01

Nash Learning achieves minimax optimal distortion of approximately half of the BT temperature.

02

RLHF and DPO exhibit high distortion, often close to the maximum possible, especially without KL constraints.

03

Distortion varies significantly depending on preference sampling and comparison pair distributions.

Abstract

After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average -- a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users' comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?· slideslive

Taxonomy

TopicsRecommender Systems and Techniques · Mobile Crowdsensing and Crowdsourcing · Constraint Satisfaction and Optimization

MethodsDirect Preference Optimization