Rethinking Bradley-Terry Models in Preference-Based Reward Modeling:   Foundations, Theory, and Alternatives

Hao Sun; Yunyi Shen; Jean-Francois Ton

arXiv:2411.04991·cs.AI·January 28, 2025

Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives

Hao Sun, Yunyi Shen, Jean-Francois Ton

PDF

Open Access 1 Repo

TL;DR

This paper critically examines the use of Bradley-Terry models in reward modeling for LLM alignment, providing theoretical foundations, highlighting limitations, and proposing an alternative approach based on order consistency, supported by extensive empirical evaluation.

Contribution

The paper revisits the theoretical basis of BT models in reward modeling, introduces an order-preserving alternative, and empirically compares multiple methods across diverse settings.

Findings

01

BT models have a solid theoretical foundation but are not necessary for effective reward modeling.

02

An order-consistent alternative can match or outperform BT models in practice.

03

Extensive experiments demonstrate the practical viability of the proposed approach across various datasets and models.

Abstract

The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclear why this model -- originally developed for multi-player stochastic game matching -- can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first revisit the foundations of using BT models in reward modeling, and establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use. Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization. This is because a reward model only needs to preserve the correct ranking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

holarissun/rewardmodelingbeyondbradleyterry
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDecision-Making and Behavioral Economics · Economic and Environmental Valuation

MethodsBalanced Selection