Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM   Agent Cyber Offense Capabilities

Andrey Anurin; Jonathan Ng; Kibo Schaffer; Jason Schreiber; Esben Kran

arXiv:2410.09114·cs.CR·November 5, 2024·3 cites

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Andrey Anurin, Jonathan Ng, Kibo Schaffer, Jason Schreiber, Esben Kran

PDF

Open Access 1 Repo

TL;DR

This paper introduces the 3CB benchmark to evaluate the offensive cyber capabilities of large language model agents, revealing that advanced models can perform tasks like reconnaissance and exploitation, while smaller models are limited.

Contribution

The paper presents a new comprehensive benchmark framework for assessing the offensive cyber capabilities of LLM agents, addressing transparency and robustness issues.

Findings

01

Frontier models like GPT-4o and Claude 3.5 Sonnet demonstrate significant offensive capabilities.

02

Smaller open-source models show limited offensive skills.

03

The benchmark aids in safer deployment and regulation of LLM-based cyber tools.

Abstract

LLM agents have the potential to revolutionize defensive cyber operations, but their offensive capabilities are not yet fully understood. To prepare for emerging threats, model developers and governments are evaluating the cyber capabilities of foundation models. However, these assessments often lack transparency and a comprehensive focus on offensive capabilities. In response, we introduce the Catastrophic Cyber Capabilities Benchmark (3CB), a novel framework designed to rigorously assess the real-world offensive capabilities of LLM agents. Our evaluation of modern LLMs on 3CB reveals that frontier models, such as GPT-4o and Claude 3.5 Sonnet, can perform offensive tasks such as reconnaissance and exploitation across domains ranging from binary analysis to web technologies. Conversely, smaller open-source models exhibit limited offensive capabilities. Our software solution and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apartresearch/3cb
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation and Cyber Security · Big Data and Business Intelligence · Scientific Computing and Data Management

MethodsFocus