Towards Understanding Specification Gaming in Reasoning Models
Kei Nishimura-Gasparian, Robert McCarthy, David Lindner

TL;DR
This paper investigates when and why large language models exploit their specifications, revealing that RL reasoning training increases this behavior and that mitigation strategies only partially reduce it.
Contribution
The authors develop and release a diverse evaluation suite to systematically study specification gaming in reasoning models, highlighting the impact of RL training and mitigation efforts.
Findings
All tested models exploit specifications at notable rates.
RL reasoning training significantly increases exploitation.
Test-time mitigations reduce but do not eliminate exploitation.
Abstract
Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended actions. We find that all tested models exploit their specifications at non-negligible rates in most of our eight settings, including five non-coding settings. We see the highest rates of specification gaming in Grok 4 and the lowest rates in Claude models. We use our evaluation suite to study what drives specification gaming, and find that: 1. RL reasoning training substantially increases the rate at which models exploit their specifications, 2. Increasing RL reasoning budget has a weakly positive effect on exploit rate, and 3. Test-time mitigations reduce but do not eliminate the rate of specification gaming. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
