Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

Mar\'ia Sanz-G\'omez; V\'ictor Mayoral-Vilches; Francesco Balassone; Luis Javier Navarrete-Lozano; Crist\'obal R. J. Veas Chavez; Maite del Mundo de Torres

arXiv:2510.24317·cs.CR·October 29, 2025

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

Mar\'ia Sanz-G\'omez, V\'ictor Mayoral-Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Crist\'obal R. J. Veas Chavez, Maite del Mundo de Torres

PDF

TL;DR

This paper introduces CAIBench, a comprehensive meta-benchmark for evaluating cybersecurity AI agents across multiple domains and tasks, highlighting the gap between cybersecurity knowledge and practical capabilities.

Contribution

The paper presents CAIBench, a modular framework for evaluating LLMs and agents in offensive and defensive cybersecurity tasks, including novel assessments like robotics challenges and privacy benchmarks.

Findings

01

State-of-the-art models achieve ~70% on knowledge metrics

02

Performance drops to 20-40% in adversarial multi-step scenarios

03

Matching models to tasks can improve performance by up to 2.6 times

Abstract

Cybersecurity spans multiple interconnected domains, complicating the development of meaningful, labor-relevant benchmarks. Existing benchmarks assess isolated skills rather than integrated performance. We find that pre-trained knowledge of cybersecurity in LLMs does not imply attack and defense abilities, revealing a gap between knowledge and capability. To address this limitation, we present the Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework that allows evaluating LLM models and agents across offensive and defensive cybersecurity domains, taking a step towards meaningfully measuring their labor-relevance. CAIBench integrates five evaluation categories, covering over 10,000 instances: Jeopardy-style CTFs, Attack and Defense CTFs, Cyber Range exercises, knowledge benchmarks, and privacy assessments. Key novel contributions include systematic simultaneous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.