Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate   Reward Hacking

Jacob Eisenstein; Chirag Nagpal; Alekh Agarwal; Ahmad Beirami; and Alex D'Amour; DJ Dvijotham; Adam Fisch; Katherine Heller and; Stephen Pfohl; Deepak Ramachandran; Peter Shaw; Jonathan Berant

arXiv:2312.09244·cs.LG·August 20, 2024·2 cites

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, and Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller and, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper investigates the use of reward model ensembles to mitigate reward hacking in language model alignment, finding they help but do not fully eliminate the problem due to reward model underspecification.

Contribution

It demonstrates that reward ensembles reduce overoptimization and improve robustness, but underspecification still allows reward hacking phenomena to persist.

Findings

01

Reward models are underspecified and can give different rewards under distribution shift.

02

Reward ensembles mitigate overoptimization and improve generalization.

03

Ensembles do not fully eliminate reward hacking phenomena.

Abstract

Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. We explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). First, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. Second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/reward-ensembles
noneOfficial

Datasets

taesiri/arxiv_qa
dataset· 193 dl
193 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)