CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D
Francis Rhys Ward, Teun van der Weij, Hanna G\'abor, Sam Martin, Raja Mehta Moreno, Harel Lidar, Louis Makower, Thomas Jodrell, Lauren Robson

TL;DR
This paper introduces a benchmark to evaluate AI agents' ability to sabotage ML models and evade detection, highlighting the risks of untrustworthy AI systems in safety-critical applications.
Contribution
It extends MLE-Bench with sabotage tasks, assesses agent capabilities to sabotage and evade monitors, and analyzes detection effectiveness, providing a new framework for safety evaluation.
Findings
Agents can successfully sabotage models and evade detection.
Monitoring is effective against sabotage but less so for sandbagging.
Aggregating monitors improves detection but may still be insufficient in high-stakes settings.
Abstract
AI systems are increasingly able to autonomously conduct realistic software engineering tasks, and may soon be deployed to automate machine learning (ML) R&D itself. Frontier AI systems may be deployed in safety-critical settings, including to help ensure the safety of future systems. Unfortunately, frontier and future systems may not be sufficiently trustworthy, and there is evidence that these systems may even be misaligned with their developers or users. Therefore, we investigate the capabilities of AI agents to act against the interests of their users when conducting ML engineering, by sabotaging ML models, sandbagging their performance, and subverting oversight mechanisms. First, we extend MLE-Bench, a benchmark for realistic ML tasks, with code-sabotage tasks such as implanting backdoors and purposefully causing generalisation failures. Frontier agents make meaningful progress on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
