Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents
Mar\'ia Sanz-G\'omez, V\'ictor Mayoral-Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Crist\'obal R. J. Veas Chavez, Maite del Mundo de Torres

TL;DR
This paper introduces CAIBench, a comprehensive meta-benchmark for evaluating cybersecurity AI agents across multiple domains and tasks, highlighting the gap between cybersecurity knowledge and practical capabilities.
Contribution
The paper presents CAIBench, a modular framework for evaluating LLMs and agents in offensive and defensive cybersecurity tasks, including novel assessments like robotics challenges and privacy benchmarks.
Findings
State-of-the-art models achieve ~70% on knowledge metrics
Performance drops to 20-40% in adversarial multi-step scenarios
Matching models to tasks can improve performance by up to 2.6 times
Abstract
Cybersecurity spans multiple interconnected domains, complicating the development of meaningful, labor-relevant benchmarks. Existing benchmarks assess isolated skills rather than integrated performance. We find that pre-trained knowledge of cybersecurity in LLMs does not imply attack and defense abilities, revealing a gap between knowledge and capability. To address this limitation, we present the Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework that allows evaluating LLM models and agents across offensive and defensive cybersecurity domains, taking a step towards meaningfully measuring their labor-relevance. CAIBench integrates five evaluation categories, covering over 10,000 instances: Jeopardy-style CTFs, Attack and Defense CTFs, Cyber Range exercises, knowledge benchmarks, and privacy assessments. Key novel contributions include systematic simultaneous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
