Correlated Proxies: A New Definition and Improved Mitigation for Reward   Hacking

Cassidy Laidlaw; Shivam Singhal; Anca Dragan

arXiv:2403.03185·cs.LG·March 14, 2025·1 cites

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

Cassidy Laidlaw, Shivam Singhal, Anca Dragan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new definition of reward hacking based on correlation with true rewards, and proposes a regularization method to mitigate it, improving reinforcement learning from human feedback.

Contribution

It provides a formal correlation-based definition of reward hacking and demonstrates that regularizing policy occupancy measures can better prevent it.

Findings

01

Correlation-based reward hacking captures behavior across settings

02

Regularizing occupancy measures mitigates reward hacking effectively

03

Proposed method outperforms KL penalty in practice

Abstract

Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this gap, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "reference policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). Using our formulation, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cassidylaidlaw/orpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications

MethodsBalanced Selection