PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design

Ruozhao Yang; Mingfei Cheng; Gelei Deng; Tianwei Zhang; Junjie Wang; Xiaofei Xie

arXiv:2512.14233·cs.SE·December 17, 2025

PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design

Ruozhao Yang, Mingfei Cheng, Gelei Deng, Tianwei Zhang, Junjie Wang, Xiaofei Xie

PDF

Open Access

TL;DR

PentestEval is a comprehensive benchmark that evaluates LLMs across all stages of penetration testing, revealing current limitations and emphasizing the need for modular, structured reasoning to improve automation reliability.

Contribution

Introduces PentestEval, the first detailed benchmark for stage-level evaluation of LLMs in penetration testing, highlighting performance gaps and guiding future improvements.

Findings

01

LLMs perform poorly across penetration testing stages.

02

End-to-end success rate of 31% for current LLM pipelines.

03

Autonomous agents fail almost entirely in penetration testing tasks.

Abstract

Penetration testing is essential for assessing and strengthening system security against real-world threats, yet traditional workflows remain highly manual, expertise-intensive, and difficult to scale. Although recent advances in Large Language Models (LLMs) offer promising opportunities for automation, existing applications rely on simplistic prompting without task decomposition or domain adaptation, resulting in unreliable black-box behavior and limited insight into model capabilities across penetration testing stages. To address this gap, we introduce PentestEval, the first comprehensive benchmark for evaluating LLMs across six decomposed penetration testing stages: Information Collection, Weakness Gathering and Filtering, Attack Decision-Making, Exploit Generation and Revision. PentestEval integrates expert-annotated ground truth with a fully automated evaluation pipeline across 346…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Application Security Vulnerabilities · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques