CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

Zhun Wang; Tianneng Shi; Jingxuan He; Matthew Cai; Jialin Zhang; Dawn Song

arXiv:2506.02548·cs.CR·March 25, 2026

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song

PDF

Open Access 1 Repo 3 Reviews

TL;DR

CyberGym is a large-scale, dynamic benchmark for evaluating AI agents' cybersecurity skills, revealing significant challenges and enabling the discovery of zero-day vulnerabilities and incomplete patches.

Contribution

Introduces CyberGym, a comprehensive benchmark with real-world vulnerabilities, to assess AI cybersecurity capabilities beyond static evaluations.

Findings

01

Top AI agents achieve only ~20% success rate

02

CyberGym discovers 34 zero-day vulnerabilities

03

Identifies 18 incomplete patches

Abstract

AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate,…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 8Confidence 4

Strengths

The authors seem to have taken strong steps towards reproducibility by providing the pre-patched program in a containerized environment. The vulnerabilities are present in real open source projects, which makes it a very realistic benchmark. The authors discover zero-day vulnerabilities in the Level 0 mode of the benchmark; this means the approach taken in the benchmark can be used by maintainers to discover new vulnerabilities.

Weaknesses

Most of the vulnerabilities seem to be related to C/C++ bad memory usage. While some of the projects are extremely popular, it should be made clear that it is really a subset of cyber-risks. "Cybersecurity capabilities" might be over-selling it a bit. Success of PoC is taken as crashing the sanitizer. I understand why the authors made this choice, but again, this is a narrow subset of vulnerabilities and I think the paper can be more explicit regarding the limitations of scope.

Reviewer 02Rating 8Confidence 4

Strengths

- The thoroughness and rigor in collecting and filtering the dataset. - The fact that the benchmark helped find 35 zero-day vulnerability is impressive. - The paper includes useful and interesting sub-experiments and ablations, e.g., effect of thinking mode, and investigating data contamination.

Weaknesses

- One minor weakness of the paper is that it does not sufficiently motivate the need for difficulty levels 1, 2, and 3. It is unclear to me when it useful to reproduce an exploit when we already have the stack trace from running the ground truth exploit (level 2), or when we have the ground truth patch (level 3). You mention that level 3 is useful in one-day settings, but to my knowledge, one-day settings are when the vulnerability is discovered but still not patched. - The authors mention

Reviewer 03Rating 6Confidence 5

Strengths

1. The use of 1,507 real-world vulnerabilities across 188 software projects is a substantial and necessary advancement over existing small-scale or synthetic benchmarks especially in the domain specific field of cubersecurity. 2. The requirement for dynamic and iterative problem-solving accurately reflects the complex and exploratory nature of real-world vulnerability research. 3. The intent to release a large, accessible dataset and environment is crucial for reproducibility in this field. T

Weaknesses

1. The authors must detail the computational cost required to run the full benchmark suite for future research teams. 2. The authors should have used a human cybersecurity analyst as a baseline reference to compare with the agents reference.

Code & Models

Repositories

sunblaze-ucb/cybergym
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation and Cyber Security · Network Security and Intrusion Detection