AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models

Ads Dawson; Rob Mulla; Nick Landers; Shane Caldwell

arXiv:2506.14682·cs.CR·June 18, 2025

AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models

Ads Dawson, Rob Mulla, Nick Landers, Shane Caldwell

PDF

Open Access 1 Repo

TL;DR

AIRTBench is a comprehensive benchmark designed to evaluate the autonomous red teaming capabilities of language models in discovering and exploiting AI/ML security vulnerabilities through realistic challenges.

Contribution

This paper introduces the first benchmark specifically for measuring autonomous AI red teaming capabilities in language models, including diverse real-world security challenges.

Findings

01

Frontier models excel at prompt injection attacks with ~49% success rate.

02

Models struggle with system exploitation and model inversion, below 26%.

03

Large language models outperform open-source models significantly in red teaming tasks.

Abstract

We introduce AIRTBench, an AI red teaming benchmark for evaluating language models' ability to autonomously discover and exploit Artificial Intelligence and Machine Learning (AI/ML) security vulnerabilities. The benchmark consists of 70 realistic black-box capture-the-flag (CTF) challenges from the Crucible challenge environment on the Dreadnode platform, requiring models to write python code to interact with and compromise AI systems. Claude-3.7-Sonnet emerged as the clear leader, solving 43 challenges (61% of the total suite, 46.9% overall success rate), with Gemini-2.5-Pro following at 39 challenges (56%, 34.3% overall), GPT-4.5-Preview at 34 challenges (49%, 36.9% overall), and DeepSeek R1 at 29 challenges (41%, 26.9% overall). Our evaluations show frontier models excel at prompt injection attacks (averaging 49% success rates) but struggle with system exploitation and model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dreadnode/airtbench-code
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling