Uncertainty-Penalized Direct Preference Optimization

Sam Houliston; Aliz\'ee Pace; Alexander Immer; Gunnar R\"atsch

arXiv:2410.20187·cs.LG·October 29, 2024

Uncertainty-Penalized Direct Preference Optimization

Sam Houliston, Aliz\'ee Pace, Alexander Immer, Gunnar R\"atsch

PDF

Open Access

TL;DR

This paper introduces a pessimistic preference optimization framework for aligning large language models with human preferences, addressing overoptimization and reward hacking by penalizing uncertain preferences, leading to improved performance.

Contribution

It proposes a novel uncertainty penalization scheme for DPO, inspired by offline reinforcement learning, to better handle ambiguous preferences and improve alignment.

Findings

01

Enhanced alignment performance over vanilla DPO

02

Better handling of high-uncertainty preference pairs

03

Improved response quality on ambiguous prompts

Abstract

Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Criteria Decision Making

MethodsDirect Preference Optimization