Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Shang Liu; Yu Pan; Guanting Chen; Xiaocheng Li

arXiv:2411.12843·cs.LG·November 21, 2024

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Shang Liu, Yu Pan, Guanting Chen, Xiaocheng Li

PDF

Open Access 1 Video

TL;DR

This paper introduces a new framework for learning reward models from ordinal feedback, leveraging the wisdom of the crowd to utilize more nuanced human preferences and improve alignment of large language models.

Contribution

It generalizes the Bradley-Terry model to ordinal feedback, providing a probabilistic framework and theoretical analysis that demonstrate the benefits of fine-grained preference data.

Findings

01

Ordinal feedback reduces Rademacher complexity compared to binary feedback.

02

Fine-grained feedback improves reward model accuracy in various settings.

03

Incorporating tied preferences enhances reward learning.

Abstract

Learning a reward model (RM) from human preferences has been an important component in aligning large language models (LLMs). The canonical setup of learning RMs from pairwise preference data is rooted in the classic Bradley-Terry (BT) model that accepts binary feedback, i.e., the label being either Response 1 is better than Response 2, or the opposite. Such a setup inevitably discards potentially useful samples (such as "tied" between the two responses) and loses more fine-grained information (such as "slightly better"). In this paper, we propose a framework for learning RMs under ordinal feedback which generalizes the case of binary preference feedback to any arbitrary granularity. Specifically, we first identify a marginal unbiasedness condition, which generalizes the assumption of the BT model in the existing binary feedback setting. The condition validates itself via the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd· slideslive

Taxonomy

TopicsDiverse Scientific and Economic Studies

MethodsKnowledge Distillation