Energy-Based Preference Model Offers Better Offline Alignment than the   Bradley-Terry Preference Model

Yuzhong Hong; Hanshan Zhang; Junwei Bao; Hongfei Jiang; Yang Song

arXiv:2412.13862·cs.LG·December 19, 2024

Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

Yuzhong Hong, Hanshan Zhang, Junwei Bao, Hongfei Jiang, Yang Song

PDF

Open Access

TL;DR

This paper introduces an energy-based preference model with a contrastive loss function that ensures a unique maximum likelihood estimator, leading to improved offline alignment of language models compared to the Bradley-Terry preference model.

Contribution

The paper proposes an energy-based preference model and a contrastive loss (EPA) that guarantees a unique MLE, addressing limitations of the Bradley-Terry model in offline alignment tasks.

Findings

01

EPA outperforms DPO on open benchmarks

02

Energy-based model ensures a unique MLE

03

Contrastive loss reduces approximation error

Abstract

Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently,the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based model (EBM) that always…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic and Environmental Valuation

Methodsenergy-based model · Direct Preference Optimization