AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Luca Gioacchini; Marco Mellia; Idilio Drago; Alexander Delsanto,; Giuseppe Siracusano; Roberto Bifulco

arXiv:2410.03225·cs.CR·October 29, 2024·2 cites

AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto,, Giuseppe Siracusano, Roberto Bifulco

PDF

Open Access 2 Repos

TL;DR

AutoPenBench introduces a comprehensive benchmark framework for evaluating generative AI agents in automated penetration testing, enabling standardized comparison of different agent architectures and LLMs across diverse tasks.

Contribution

This paper presents AutoPenBench, the first standardized and flexible benchmark for assessing generative agents in penetration testing, including diverse tasks and performance metrics.

Findings

01

Autonomous agents achieved 21% success rate, solving 27% of simple tasks.

02

Assisted agents with human interaction achieved 64% success rate.

03

Performance varies significantly with different LLMs like GPT-4.

Abstract

Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse strategies to simulate cyber-attacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation and development. This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels, including in-vitro and real-world scenarios. We assess the agent performance with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Web Application Security Vulnerabilities