Robust Optimization for Mitigating Reward Hacking with Correlated Proxies

Zixuan Liu; Xiaolin Sun; Zizhan Zheng

arXiv:2604.12086·cs.LG·April 15, 2026

Robust Optimization for Mitigating Reward Hacking with Correlated Proxies

Zixuan Liu, Xiaolin Sun, Zizhan Zheng

PDF

1 Repo 1 Video

TL;DR

This paper introduces a robust policy optimization method to mitigate reward hacking in reinforcement learning by considering all proxy rewards correlated with the true reward, improving robustness and transparency.

Contribution

It formulates reward hacking as a max-min optimization over correlated proxy rewards and provides a tractable solution that enhances robustness and interpretability.

Findings

01

Outperforms ORPO in worst-case return scenarios.

02

Offers improved robustness across different proxy-true reward correlations.

03

Provides interpretable worst-case reward solutions.

Abstract

Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using r-correlation between proxy and true rewards, but existing methods like occupancy-regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all r-correlated proxy rewards. We derive a tractable max-min formulation, where the agent maximizes performance under the worst-case proxy consistent with the correlation constraint.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZixuanLiu4869/reward_hacking
github

Videos

Robust Optimization for Mitigating Reward Hacking with Correlated Proxies· slideslive