AutoPenBench: Benchmarking Generative Agents for Penetration Testing
Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto,, Giuseppe Siracusano, Roberto Bifulco

TL;DR
AutoPenBench introduces a comprehensive benchmark framework for evaluating generative AI agents in automated penetration testing, enabling standardized comparison of different agent architectures and LLMs across diverse tasks.
Contribution
This paper presents AutoPenBench, the first standardized and flexible benchmark for assessing generative agents in penetration testing, including diverse tasks and performance metrics.
Findings
Autonomous agents achieved 21% success rate, solving 27% of simple tasks.
Assisted agents with human interaction achieved 64% success rate.
Performance varies significantly with different LLMs like GPT-4.
Abstract
Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse strategies to simulate cyber-attacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation and development. This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels, including in-vitro and real-world scenarios. We assess the agent performance with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Web Application Security Vulnerabilities
