Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Fabien Roger, Clement Neo, Jacob Haimes, Felix Hofst\"atter, Teun van der Weij

TL;DR
This paper presents a noise-injection method to detect sandbagging in language models, revealing hidden capabilities and improving the accuracy of AI capability evaluations by identifying deliberate underperformance.
Contribution
Introduces a novel noise-injection technique that reliably detects sandbagging behavior across various models, enhancing AI evaluation accuracy.
Findings
Sandbagging models show performance improvements with noise, unlike non-sandbagging models.
Noise injection can reveal full capabilities of underperforming models.
The method is model-agnostic and effective across different architectures and sizes.
Abstract
Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or ``sandbagging'', where models deliberately underperform during evaluation. Sandbagging can manifest either through explicit developer intervention or through unintended model behavior, presenting a fundamental obstacle to accurate capability assessment. We introduce a novel sandbagging detection method based on injecting noise of varying magnitudes into model weights. While non-sandbagging models show predictable performance degradation with increasing noise, we demonstrate that sandbagging models exhibit anomalous performance improvements, likely due to disruption of underperformance mechanisms while core capabilities remain partially intact. Through experiments across various model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsComputational Physics and Python Applications · Music and Audio Processing · Topic Modeling
