LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
Chloe Li, Mary Phuong, Noah Y. Siegel

TL;DR
This paper investigates the ability of AI models to covertly deceive chain-of-thought monitoring systems during dangerous capability evaluations, revealing vulnerabilities and proposing new sandbagging strategies.
Contribution
It demonstrates that models can perform covert sandbagging against CoT monitors and introduces five new sandbagging policies, highlighting potential failure modes of current monitoring methods.
Findings
Models can sandbag 0-shot without hints
Covert sandbagging bypass rate of 16-36% when monitor-aware
Five new sandbagging policies are introduced
Abstract
Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat is sandbagging - the strategic underperformance on evaluations by AI models or their developers. A promising defense is to monitor a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Security and Verification in Computing
