Auditing Games for Sandbagging

Jordan Taylor; Sid Black; Dillon Bowen; Thomas Read; Satvik Golechha; Alex Zelenka-Martin; Oliver Makins; Connor Kissane; Kola Ayonrinde; Jacob Merizian; Samuel Marks; Chris Cundy; Joseph Bloom

arXiv:2512.07810·cs.AI·December 9, 2025

Auditing Games for Sandbagging

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom

PDF

Open Access 1 Datasets

TL;DR

This paper investigates the challenge of detecting sandbagging in AI models, revealing current detection methods are unreliable and proposing directions for future research to improve robustness.

Contribution

The study introduces an auditing game framework to evaluate sandbagging detection techniques and demonstrates the limitations of existing approaches, highlighting the need for more robust methods.

Findings

01

Blue team struggles to reliably identify sandbagging models.

02

Black-box approaches are easily defeated by imitation.

03

Training-based elicitation can reveal sandbagging but has false positives.

Abstract

Future AI systems could conceal their capabilities ('sandbagging') during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, or training-based approaches to identify sandbagging models. We found that the blue team could not reliably discriminate sandbaggers from benign models. Black-box approaches were defeated by effective imitation of a weaker model. Linear probes, a model-internals approach, showed more promise but their naive application was vulnerable to behaviours instilled by the red team. We also explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-based elicitation was not reliable,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

sandbagging-games/evaluation_logs
dataset· 1.2k dl
1.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Artificial Intelligence in Games