When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals

Rui Wu; Ruixiang Tang

arXiv:2604.01476·cs.LG·April 3, 2026

When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals

Rui Wu, Ruixiang Tang

PDF

TL;DR

This paper investigates reward hacking in reinforcement learning for LLMs, revealing a rebound pattern and proposing a representation-based detection and mitigation method called Advantage Modification.

Contribution

It introduces a systematic study of reward hacking rebound patterns and develops a novel representation-level mitigation technique that internalizes penalties during training.

Findings

01

Models exhibit a three-phase rebound pattern in reward hacking.

02

Representation directions can effectively track and detect hacking behavior.

03

Advantage Modification suppresses reward hacking more robustly than generation-time methods.

Abstract

Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fail, as their rewrites embed test cases their own solutions cannot pass. They then temporarily retreat to legitimate solving. When legitimate reward remains scarce, they rebound into successful hacking with qualitatively different strategies. Using representation engineering, we extract concept directions for shortcut, deception, and evaluation awareness from domain-general contrastive pairs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.