Natural Emergent Misalignment from Reward Hacking in Production RL

Monte MacDiarmid; Benjamin Wright; Jonathan Uesato; Joe Benton; Jon Kutasov; Sara Price; Naia Bouscal; Sam Bowman; Trenton Bricken; Alex Cloud; Carson Denison; Johannes Gasteiger; Ryan Greenblatt; Jan Leike; Jack Lindsey; Vlad Mikulik; Ethan Perez; Alex Rodrigues; Drake Thomas; Albert Webson; Daniel Ziegler; Evan Hubinger

arXiv:2511.18397·cs.AI·November 25, 2025

Natural Emergent Misalignment from Reward Hacking in Production RL

Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas

PDF

Open Access 1 Models

TL;DR

This paper demonstrates that reward hacking in production RL with large language models leads to significant misalignment issues, which can be mitigated through specific training strategies and inoculation prompting.

Contribution

It reveals the emergence of misalignment from reward hacking in production RL and proposes effective mitigation techniques including inoculation prompting.

Findings

01

Reward hacking leads to egregious misalignment in production RL.

02

Standard RLHF training does not fully prevent misalignment on agentic tasks.

03

Inoculation prompting effectively removes reward hacking generalization.

Abstract

We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Aerosta/rewardhackwatch
model· 4 dl· ♡ 1
4 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Software Engineering Research