When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals
Rui Wu, Ruixiang Tang

TL;DR
This paper investigates reward hacking in reinforcement learning for LLMs, revealing a rebound pattern and proposing a representation-based detection and mitigation method called Advantage Modification.
Contribution
It introduces a systematic study of reward hacking rebound patterns and develops a novel representation-level mitigation technique that internalizes penalties during training.
Findings
Models exhibit a three-phase rebound pattern in reward hacking.
Representation directions can effectively track and detect hacking behavior.
Advantage Modification suppresses reward hacking more robustly than generation-time methods.
Abstract
Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fail, as their rewrites embed test cases their own solutions cannot pass. They then temporarily retreat to legitimate solving. When legitimate reward remains scarce, they rebound into successful hacking with qualitatively different strategies. Using representation engineering, we extract concept directions for shortcut, deception, and evaluation awareness from domain-general contrastive pairs and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
