Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks   of Language Models

Andy K. Zhang; Neil Perry; Riya Dulepet; Joey Ji; Celeste Menders,; Justin W. Lin; Eliot Jones; Gashon Hussein; Samantha Liu; Donovan Jasper,; Pura Peetathawatchai; Ari Glenn; Vikram Sivashankar; Daniel Zamoshchin; Leo; Glikbarg; Derek Askaryar; Mike Yang; Teddy Zhang; Rishi Alluri; Nathan Tran,; Rinnara Sangpisit; Polycarpos Yiorkadjis; Kenny Osele; Gautham Raghupathi,; Dan Boneh; Daniel E. Ho; Percy Liang

arXiv:2408.08926·cs.CR·April 15, 2025

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders,, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper,, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo, Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang

PDF

Open Access 3 Repos 1 Video

TL;DR

Cybench is a comprehensive framework for evaluating the cybersecurity capabilities of language model agents through a diverse set of Capture the Flag tasks, including detailed subtasks and multiple agent scaffolds, to assess their potential in cybersecurity applications.

Contribution

The paper introduces Cybench, a novel framework for systematically evaluating language models on cybersecurity tasks with detailed subtasks and multiple agent configurations.

Findings

01

Top models like GPT-4o and Claude 3.5 Sonnet can solve tasks faster than humans.

02

Agents with subtask guidance outperform those without on complex tasks.

03

Evaluation reveals varying capabilities of different models across cybersecurity challenges.

Abstract

Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have potential to cause real-world impact. Policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute commands and observe outputs. Since many tasks are beyond the capabilities of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models· slideslive

Taxonomy

TopicsInformation and Cyber Security

MethodsLLaMA