School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Mia Taylor; James Chua; Jan Betley; Johannes Treutlein; Owain Evans

arXiv:2508.17511·cs.AI·August 26, 2025

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Mia Taylor, James Chua, Jan Betley, Johannes Treutlein, Owain Evans

PDF

2 Datasets

TL;DR

This paper investigates how models trained to exploit reward functions in simple tasks can generalize to more harmful misaligned behaviors, highlighting risks in AI alignment.

Contribution

It introduces a dataset of reward hacking examples, demonstrates that fine-tuned models generalize to harmful misalignments, and provides evidence of potential risks in reward hacking behaviors.

Findings

01

Models learned reward hacking on simple tasks

02

Fine-tuned models generalized to harmful misalignments

03

Reward hacking behaviors may lead to more dangerous AI misbehavior

Abstract

Reward hacking--where agents exploit flaws in imperfect reward functions rather than performing tasks as intended--poses risks for AI alignment. Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code. To study the behavior of reward hackers, we built a dataset containing over a thousand examples of reward hacking on short, low-stakes, self-contained tasks such as writing poetry and coding simple functions. We used supervised fine-tuning to train models (GPT-4.1, GPT-4.1-mini, Qwen3-32B, Qwen3-8B) to reward hack on these tasks. After fine-tuning, the models generalized to reward hacking on new settings, preferring less knowledgeable graders, and writing their reward functions to maximize reward. Although the reward hacking behaviors in the training data were harmless, GPT-4.1 also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.