Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
Mohammad Beigi, Ming Jin, Junshan Zhang, Qifan Wang, Lifu Huang

TL;DR
This paper introduces Adversarial Reward Auditing (ARA), a dynamic framework that detects and mitigates reward hacking in reinforcement learning from human feedback by modeling it as a competitive game, improving alignment and robustness.
Contribution
The paper presents a novel adversarial framework for reward hacking detection and mitigation, enabling adaptive, multi-domain defenses that outperform static methods.
Findings
ARA reduces reward hacking across multiple scenarios.
It improves alignment-utility tradeoff in RLHF.
The approach generalizes across different domains.
Abstract
Reinforcement Learning from Human Feedback (RLHF) remains vulnerable to reward hacking, where models exploit spurious correlations in learned reward models to achieve high scores while violating human intent. Existing mitigations rely on static defenses that cannot adapt to novel exploitation strategies. We propose Adversarial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game. ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations; second, Auditor-Guided RLHF (AG-RLHF) gates reward signals to penalize detected hacking, transforming reward hacking from an unobservable failure into a measurable, controllable signal. Experiments across three hacking scenarios demonstrate that ARA achieves the best alignment-utility tradeoff among…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques
