Reviving The Classics: Active Reward Modeling in Large Language Model Alignment
Yunyi Shen, Hao Sun, Jean-Fran\c{c}ois Ton

TL;DR
This paper introduces Fisher information-based strategies for selecting informative human preference pairs in reward modeling for large language models, improving annotation efficiency and model alignment.
Contribution
It adapts classical experimental design principles to active reward modeling, enabling efficient and stable selection of comparison pairs in LLM alignment.
Findings
Method outperforms existing selection strategies in accuracy and efficiency
Incorporating cross-prompt comparisons enhances labeling efficiency
Demonstrates robustness across multiple LLMs and datasets
Abstract
Building neural reward models from human preferences is a pivotal component in reinforcement learning from human feedback (RLHF) and large language model alignment research. Given the scarcity and high cost of human annotation, how to select the most informative pairs to annotate is an essential yet challenging open problem. In this work, we highlight the insight that an ideal comparison dataset for reward modeling should balance exploration of the representation space and make informative comparisons between pairs with moderate reward differences. Technically, challenges arise in quantifying the two objectives and efficiently prioritizing the comparisons to be annotated. To address this, we propose the Fisher information-based selection strategies, adapt theories from the classical experimental design literature, and apply them to the final linear layer of the deep neural network-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsLinear Layer
