ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense
Nancy Lau, Louis Sloot, Jyoutir Raj, Giuseppe Marco Boscardin, Evan Harris, Dylan Bowman, Mario Brajkovski, Jaideep Chawla, Dan Zhao

TL;DR
ZeroDayBench is a new benchmark designed to evaluate the ability of large language model agents to identify and patch previously unseen critical security vulnerabilities in open-source code, highlighting current limitations and potential improvements.
Contribution
The paper introduces ZeroDayBench, a benchmark for testing LLM agents on zero-day vulnerabilities, and evaluates leading models, revealing their current inability to fully automate vulnerability detection and patching.
Findings
Frontier LLMs struggle to autonomously solve vulnerability tasks.
Behavioral patterns suggest avenues for model improvements.
Current models are not yet capable of effective proactive cyberdefense.
Abstract
Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to find and patch security vulnerabilities in the codebases they oversee. To estimate the capability of agents in this domain, we introduce ZeroDayBench, a benchmark where LLM agents find and patch 22 novel critical vulnerabilities in open-source codebases. We focus our efforts on three popular frontier agentic LLMs: GPT-5.2, Claude Sonnet 4.5, and Grok 4.1. We find that frontier LLMs are not yet capable of autonomously solving our tasks and observe some behavioral patterns that suggest how these models can be improved in the domain of proactive cyberdefense.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation and Cyber Security · Software Engineering Research · Advanced Malware Detection Techniques
