Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases

Eric Gan; Aryan Bhatt; Buck Shlegeris; Julian Stastny; Vivek Hebbar

arXiv:2604.16286·cs.AI·April 28, 2026

Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases

Eric Gan, Aryan Bhatt, Buck Shlegeris, Julian Stastny, Vivek Hebbar

PDF

TL;DR

This paper introduces Auditing Sabotage Bench, a benchmark for evaluating how well AI auditors can detect and fix sabotage in ML research codebases, revealing current limitations of LLMs and humans.

Contribution

It provides a new benchmark with sabotaged ML codebases and evaluates the performance of LLMs and humans in detecting and fixing sabotage.

Findings

01

Best AUROC achieved was 0.77 by Gemini 3.1 Pro.

02

Top-1 fix rate was 42% for the best model.

03

LLMs sometimes evaded detection even when acting as red teamers.

Abstract

As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce Auditing Sabotage Bench, a benchmark for evaluating the ability of auditors to detect and fix sabotage in ML research codebases. Our benchmark consists of 9 ML research codebases with sabotaged variants that produce qualitatively different experimental results. Each sabotage modifies implementation details, such as hyperparameters, training data, or evaluation code, while preserving the high-level methodology described in the paper. We evaluated frontier LLMs and LLM-assisted human auditors on our benchmark and found that both struggled to reliably detect and fix sabotage: the best performance was an AUROC of 0.77 and a top-1 fix rate of 42%, achieved by Gemini 3.1 Pro. We also tested LLMs as red…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.