InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic   Reward Modeling

Yuchun Miao; Sen Zhang; Liang Ding; Rong Bao; Lefei Zhang; Dacheng Tao

arXiv:2402.09345·cs.LG·November 4, 2024·1 cites

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces InfoRM, an information-theoretic framework for reward modeling in RLHF that reduces reward hacking by filtering irrelevant information and detecting overoptimization through latent space analysis.

Contribution

It proposes a variational information bottleneck approach for reward modeling and introduces the Cluster Separation Index for online detection of reward overoptimization.

Findings

01

InfoRM effectively mitigates reward hacking across various scales.

02

The IB latent space correlates with reward overoptimization, enabling detection.

03

The Cluster Separation Index reliably indicates reward overoptimization in diverse datasets.

Abstract

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge. This issue primarily arises from reward misgeneralization, where reward models (RMs) compute reward using spurious features that are irrelevant to human preferences. In this work, we tackle this problem from an information-theoretic perspective and propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective to filter out irrelevant information. Notably, we further identify a correlation between overoptimization and outliers in the IB latent space of InfoRM, establishing it as a promising tool for detecting reward overoptimization. Inspired by this finding, we propose the Cluster Separation Index (CSI), which quantifies deviations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

miaoyuchun/inform
pytorchOfficial

Videos

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling· slideslive

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Software Engineering Research · Information and Cyber Security