AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models
Ads Dawson, Rob Mulla, Nick Landers, Shane Caldwell

TL;DR
AIRTBench is a comprehensive benchmark designed to evaluate the autonomous red teaming capabilities of language models in discovering and exploiting AI/ML security vulnerabilities through realistic challenges.
Contribution
This paper introduces the first benchmark specifically for measuring autonomous AI red teaming capabilities in language models, including diverse real-world security challenges.
Findings
Frontier models excel at prompt injection attacks with ~49% success rate.
Models struggle with system exploitation and model inversion, below 26%.
Large language models outperform open-source models significantly in red teaming tasks.
Abstract
We introduce AIRTBench, an AI red teaming benchmark for evaluating language models' ability to autonomously discover and exploit Artificial Intelligence and Machine Learning (AI/ML) security vulnerabilities. The benchmark consists of 70 realistic black-box capture-the-flag (CTF) challenges from the Crucible challenge environment on the Dreadnode platform, requiring models to write python code to interact with and compromise AI systems. Claude-3.7-Sonnet emerged as the clear leader, solving 43 challenges (61% of the total suite, 46.9% overall success rate), with Gemini-2.5-Pro following at 39 challenges (56%, 34.3% overall), GPT-4.5-Preview at 34 challenges (49%, 36.9% overall), and DeepSeek R1 at 29 challenges (41%, 26.9% overall). Our evaluations show frontier models excel at prompt injection attacks (averaging 49% success rates) but struggle with system exploitation and model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
