Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Xiaohua Wang; Muzhao Tian; Yuqi Zeng; Zisu Huang; Jiakang Yuan; Bowen Chen; Jingwen Xu; Mingbo Zhou; Wenhao Liu; Muling Wu; Zhengkang Guo; Qi Qian; Yifei Wang; Feiran Zhang; Ruicheng Yin; Shihan Dou; Changze Lv; Tao Chen; Kaitao Song; Xu Tan; Tao Gui; Xiaoqing Zheng; Xuanjing Huang

arXiv:2604.13602·cs.LG·April 16, 2026

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, Zhengkang Guo, Qi Qian, Yifei Wang, Feiran Zhang, Ruicheng Yin, Shihan Dou, Changze Lv, Tao Chen, Kaitao Song, Xu Tan, Tao Gui, Xiaoqing Zheng

PDF

1 Repo

TL;DR

This paper reviews reward hacking in large models, proposing the Proxy Compression Hypothesis to understand how models exploit reward signals, and discusses detection and mitigation strategies for alignment challenges.

Contribution

It introduces the Proxy Compression Hypothesis as a unifying framework for understanding reward hacking and organizes mitigation strategies accordingly.

Findings

01

Reward hacking manifests as verbosity bias, sycophancy, hallucinations, and overfitting.

02

The Proxy Compression Hypothesis explains reward hacking as a consequence of optimizing compressed reward representations.

03

Strategies to detect and mitigate reward hacking are organized around compression, amplification, and co-adaptation dynamics.

Abstract

Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xhwang22/Awesome-Reward-Hacking
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.