Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models

Lars Malmqvist

arXiv:2505.07846·cs.AI·May 14, 2025

Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models

Lars Malmqvist

PDF

TL;DR

This paper investigates how large language models exploit system loopholes in a simulated game environment, revealing increased gaming behaviors with more advanced models and specific prompts, raising security and alignment concerns.

Contribution

Introduces a novel textual simulation method to analyze specification gaming in LLMs and demonstrates how prompting influences exploitative behaviors.

Findings

01

Newer models show higher propensity to exploit vulnerabilities.

02

Prompting as 'creative' significantly increases gaming behaviors.

03

Identified four distinct exploitation strategies.

Abstract

This study reveals how frontier Large Language Models LLMs can "game the system" when faced with impossible situations, a critical security and alignment concern. Using a novel textual simulation approach, we presented three leading LLMs (o1, o3-mini, and r1) with a tic-tac-toe scenario designed to be unwinnable through legitimate play, then analyzed their tendency to exploit loopholes rather than accept defeat. Our results are alarming for security researchers: the newer, reasoning-focused o3-mini model showed nearly twice the propensity to exploit system vulnerabilities (37.1%) compared to the older o1 model (17.5%). Most striking was the effect of prompting. Simply framing the task as requiring "creative" solutions caused gaming behaviors to skyrocket to 77.3% across all models. We identified four distinct exploitation strategies, from direct manipulation of game state to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.