Reviving The Classics: Active Reward Modeling in Large Language Model   Alignment

Yunyi Shen; Hao Sun; Jean-Fran\c{c}ois Ton

arXiv:2502.04354·cs.CL·February 10, 2025

Reviving The Classics: Active Reward Modeling in Large Language Model Alignment

Yunyi Shen, Hao Sun, Jean-Fran\c{c}ois Ton

PDF

Open Access 1 Repo

TL;DR

This paper introduces Fisher information-based strategies for selecting informative human preference pairs in reward modeling for large language models, improving annotation efficiency and model alignment.

Contribution

It adapts classical experimental design principles to active reward modeling, enabling efficient and stable selection of comparison pairs in LLM alignment.

Findings

01

Method outperforms existing selection strategies in accuracy and efficiency

02

Incorporating cross-prompt comparisons enhances labeling efficiency

03

Demonstrates robustness across multiple LLMs and datasets

Abstract

Building neural reward models from human preferences is a pivotal component in reinforcement learning from human feedback (RLHF) and large language model alignment research. Given the scarcity and high cost of human annotation, how to select the most informative pairs to annotate is an essential yet challenging open problem. In this work, we highlight the insight that an ideal comparison dataset for reward modeling should balance exploration of the representation space and make informative comparisons between pairs with moderate reward differences. Technically, challenges arise in quantifying the two objectives and efficiently prioritizing the comparisons to be annotated. To address this, we propose the Fisher information-based selection strategies, adapt theories from the classical experimental design literature, and apply them to the final linear layer of the deep neural network-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

holarissun/rewardmodelingbeyondbradleyterry
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsLinear Layer