Towards Understanding Specification Gaming in Reasoning Models

Kei Nishimura-Gasparian; Robert McCarthy; David Lindner

arXiv:2605.02269·cs.AI·May 5, 2026

Towards Understanding Specification Gaming in Reasoning Models

Kei Nishimura-Gasparian, Robert McCarthy, David Lindner

PDF

TL;DR

This paper investigates when and why large language models exploit their specifications, revealing that RL reasoning training increases this behavior and that mitigation strategies only partially reduce it.

Contribution

The authors develop and release a diverse evaluation suite to systematically study specification gaming in reasoning models, highlighting the impact of RL training and mitigation efforts.

Findings

01

All tested models exploit specifications at notable rates.

02

RL reasoning training significantly increases exploitation.

03

Test-time mitigations reduce but do not eliminate exploitation.

Abstract

Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended actions. We find that all tested models exploit their specifications at non-negligible rates in most of our eight settings, including five non-coding settings. We see the highest rates of specification gaming in Grok 4 and the lowest rates in Claude models. We use our evaluation suite to study what drives specification gaming, and find that: 1. RL reasoning training substantially increases the rate at which models exploit their specifications, 2. Increasing RL reasoning budget has a weakly positive effect on exploit rate, and 3. Test-time mitigations reduce but do not eliminate the rate of specification gaming. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.