Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Cassidy Laidlaw, Shivam Singhal, Anca Dragan

TL;DR
This paper introduces a new definition of reward hacking based on correlation with true rewards, and proposes a regularization method to mitigate it, improving reinforcement learning from human feedback.
Contribution
It provides a formal correlation-based definition of reward hacking and demonstrates that regularizing policy occupancy measures can better prevent it.
Findings
Correlation-based reward hacking captures behavior across settings
Regularizing occupancy measures mitigates reward hacking effectively
Proposed method outperforms KL penalty in practice
Abstract
Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this gap, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "reference policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). Using our formulation, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
MethodsBalanced Selection
