TL;DR
This paper introduces a robust policy optimization method to mitigate reward hacking in reinforcement learning by considering all proxy rewards correlated with the true reward, improving robustness and transparency.
Contribution
It formulates reward hacking as a max-min optimization over correlated proxy rewards and provides a tractable solution that enhances robustness and interpretability.
Findings
Outperforms ORPO in worst-case return scenarios.
Offers improved robustness across different proxy-true reward correlations.
Provides interpretable worst-case reward solutions.
Abstract
Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using r-correlation between proxy and true rewards, but existing methods like occupancy-regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all r-correlated proxy rewards. We derive a tractable max-min formulation, where the agent maximizes performance under the worst-case proxy consistent with the correlation constraint.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
