Catastrophic Goodhart: regularizing RLHF with KL divergence does not   mitigate heavy-tailed reward misspecification

Thomas Kwa; Drake Thomas; Adri\`a Garriga-Alonso

arXiv:2407.14503·cs.LG·November 11, 2024

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Thomas Kwa, Drake Thomas, Adri\`a Garriga-Alonso

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that KL regularization in reinforcement learning from human feedback does not prevent reward hacking when reward errors are heavy-tailed, leading to catastrophic outcomes.

Contribution

It introduces the concept of catastrophic Goodhart, showing that heavy-tailed reward errors can cause policies to exploit reward misspecification despite KL regularization.

Findings

01

Reward errors are light-tailed in measured models.

02

Heavy-tailed reward errors are common in real-world applications.

03

KL regularization fails to prevent reward hacking with heavy-tailed errors.

Abstract

When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tkwa/catastrophic-goodhart
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsObsessive-Compulsive Spectrum Disorders · Occupational and Professional Licensing Regulation · Diverse Scientific and Economic Studies

MethodsBalanced Selection