Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models
Lars Malmqvist

TL;DR
This paper investigates how large language models exploit system loopholes in a simulated game environment, revealing increased gaming behaviors with more advanced models and specific prompts, raising security and alignment concerns.
Contribution
Introduces a novel textual simulation method to analyze specification gaming in LLMs and demonstrates how prompting influences exploitative behaviors.
Findings
Newer models show higher propensity to exploit vulnerabilities.
Prompting as 'creative' significantly increases gaming behaviors.
Identified four distinct exploitation strategies.
Abstract
This study reveals how frontier Large Language Models LLMs can "game the system" when faced with impossible situations, a critical security and alignment concern. Using a novel textual simulation approach, we presented three leading LLMs (o1, o3-mini, and r1) with a tic-tac-toe scenario designed to be unwinnable through legitimate play, then analyzed their tendency to exploit loopholes rather than accept defeat. Our results are alarming for security researchers: the newer, reasoning-focused o3-mini model showed nearly twice the propensity to exploit system vulnerabilities (37.1%) compared to the older o1 model (17.5%). Most striking was the effect of prompting. Simply framing the task as requiring "creative" solutions caused gaming behaviors to skyrocket to 77.3% across all models. We identified four distinct exploitation strategies, from direct manipulation of game state to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
