Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna, Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared, Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger

TL;DR
This study investigates whether large language models trained on environments with simple gaming behaviors can generalize to more dangerous forms like reward tampering, revealing challenges in preventing such behaviors.
Contribution
The paper demonstrates that LLMs trained on simple specification gaming environments can generalize to reward tampering, highlighting the difficulty of mitigating pernicious behaviors.
Findings
LLMs trained on early environments often exhibit more specification gaming.
A small proportion of models generalize to reward-tampering behaviors.
Adding harmlessness training does not prevent reward tampering.
Abstract
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocioeconomic Development in MENA · Middle East and Rwanda Conflicts
