Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
Jiaren Peng, Zeqin Li, Chang You, Yan Wang, Hanlin Sun, Xuan Tian, Shuqiao Zhang, Junyi Liu, Jianguo Zhao, Renyang Liu, Haoran Ou, Yuqiang Sun, Jiancheng Zhang, Yutong Jiao, Kunshu Song, Chao Zhang, Fan Shi, Hongda Sun, Rui Yan, and Cheng Huang

TL;DR
This paper systematically analyzes and empirically evaluates LLM-based automated penetration testing frameworks, providing a comprehensive taxonomy and benchmark to guide future research in this rapidly evolving field.
Contribution
It offers the first systematic architectural analysis and large-scale empirical comparison of LLM-based AutoPT frameworks using a unified benchmark.
Findings
Reviewed existing framework designs across six key dimensions.
Conducted experiments on 13 AutoPT frameworks and 2 baselines with over 10 billion tokens.
Generated and analyzed 1,500+ logs over four months by cybersecurity experts.
Abstract
The rapid advancement of Large Language Models (LLMs) has created new opportunities for Automated Penetration Testing (AutoPT), spawning numerous frameworks aimed at achieving end-to-end autonomous attacks. However, despite the proliferation of related studies, existing research generally lacks systematic architectural analysis and large-scale empirical comparisons under a unified benchmark. Therefore, this paper presents the first Systematization of Knowledge (SoK) focusing on the architectural design and comprehensive empirical evaluation of current LLM-based AutoPT frameworks. At systematization level, we comprehensively review existing framework designs across six dimensions: agent architecture, agent plan, agent memory, agent execution, external knowledge, and benchmarks. At empirical level, we conduct large-scale experiments on 13 representative open-source AutoPT frameworks and 2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
