Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases
Eric Gan, Aryan Bhatt, Buck Shlegeris, Julian Stastny, Vivek Hebbar

TL;DR
This paper introduces Auditing Sabotage Bench, a benchmark for evaluating how well AI auditors can detect and fix sabotage in ML research codebases, revealing current limitations of LLMs and humans.
Contribution
It provides a new benchmark with sabotaged ML codebases and evaluates the performance of LLMs and humans in detecting and fixing sabotage.
Findings
Best AUROC achieved was 0.77 by Gemini 3.1 Pro.
Top-1 fix rate was 42% for the best model.
LLMs sometimes evaded detection even when acting as red teamers.
Abstract
As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce Auditing Sabotage Bench, a benchmark for evaluating the ability of auditors to detect and fix sabotage in ML research codebases. Our benchmark consists of 9 ML research codebases with sabotaged variants that produce qualitatively different experimental results. Each sabotage modifies implementation details, such as hyperparameters, training data, or evaluation code, while preserving the high-level methodology described in the paper. We evaluated frontier LLMs and LLM-assisted human auditors on our benchmark and found that both struggled to reliably detect and fix sabotage: the best performance was an AUROC of 0.77 and a top-1 fix rate of 42%, achieved by Gemini 3.1 Pro. We also tested LLMs as red…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
