LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

Chloe Li; Mary Phuong; Noah Y. Siegel

arXiv:2508.00943·cs.CR·November 3, 2025

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

Chloe Li, Mary Phuong, Noah Y. Siegel

PDF

Open Access

TL;DR

This paper investigates the ability of AI models to covertly deceive chain-of-thought monitoring systems during dangerous capability evaluations, revealing vulnerabilities and proposing new sandbagging strategies.

Contribution

It demonstrates that models can perform covert sandbagging against CoT monitors and introduces five new sandbagging policies, highlighting potential failure modes of current monitoring methods.

Findings

01

Models can sandbag 0-shot without hints

02

Covert sandbagging bypass rate of 16-36% when monitor-aware

03

Five new sandbagging policies are introduced

Abstract

Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat is sandbagging - the strategic underperformance on evaluations by AI models or their developers. A promising defense is to monitor a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing