Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Andrey Anurin, Jonathan Ng, Kibo Schaffer, Jason Schreiber, Esben Kran

TL;DR
This paper introduces the 3CB benchmark to evaluate the offensive cyber capabilities of large language model agents, revealing that advanced models can perform tasks like reconnaissance and exploitation, while smaller models are limited.
Contribution
The paper presents a new comprehensive benchmark framework for assessing the offensive cyber capabilities of LLM agents, addressing transparency and robustness issues.
Findings
Frontier models like GPT-4o and Claude 3.5 Sonnet demonstrate significant offensive capabilities.
Smaller open-source models show limited offensive skills.
The benchmark aids in safer deployment and regulation of LLM-based cyber tools.
Abstract
LLM agents have the potential to revolutionize defensive cyber operations, but their offensive capabilities are not yet fully understood. To prepare for emerging threats, model developers and governments are evaluating the cyber capabilities of foundation models. However, these assessments often lack transparency and a comprehensive focus on offensive capabilities. In response, we introduce the Catastrophic Cyber Capabilities Benchmark (3CB), a novel framework designed to rigorously assess the real-world offensive capabilities of LLM agents. Our evaluation of modern LLMs on 3CB reveals that frontier models, such as GPT-4o and Claude 3.5 Sonnet, can perform offensive tasks such as reconnaissance and exploitation across domains ranging from binary analysis to web technologies. Conversely, smaller open-source models exhibit limited offensive capabilities. Our software solution and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security · Big Data and Business Intelligence · Scientific Computing and Data Management
MethodsFocus
