Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

Jiaren Peng; Zeqin Li; Chang You; Yan Wang; Hanlin Sun; Xuan Tian; Shuqiao Zhang; Junyi Liu; Jianguo Zhao; Renyang Liu; Haoran Ou; Yuqiang Sun; Jiancheng Zhang; Yutong Jiao; Kunshu Song; Chao Zhang; Fan Shi; Hongda Sun; Rui Yan; and Cheng Huang

arXiv:2604.05719·cs.CR·April 8, 2026

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

Jiaren Peng, Zeqin Li, Chang You, Yan Wang, Hanlin Sun, Xuan Tian, Shuqiao Zhang, Junyi Liu, Jianguo Zhao, Renyang Liu, Haoran Ou, Yuqiang Sun, Jiancheng Zhang, Yutong Jiao, Kunshu Song, Chao Zhang, Fan Shi, Hongda Sun, Rui Yan, and Cheng Huang

PDF

1 Repo

TL;DR

This paper systematically analyzes and empirically evaluates LLM-based automated penetration testing frameworks, providing a comprehensive taxonomy and benchmark to guide future research in this rapidly evolving field.

Contribution

It offers the first systematic architectural analysis and large-scale empirical comparison of LLM-based AutoPT frameworks using a unified benchmark.

Findings

01

Reviewed existing framework designs across six key dimensions.

02

Conducted experiments on 13 AutoPT frameworks and 2 baselines with over 10 billion tokens.

03

Generated and analyzed 1,500+ logs over four months by cybersecurity experts.

Abstract

The rapid advancement of Large Language Models (LLMs) has created new opportunities for Automated Penetration Testing (AutoPT), spawning numerous frameworks aimed at achieving end-to-end autonomous attacks. However, despite the proliferation of related studies, existing research generally lacks systematic architectural analysis and large-scale empirical comparisons under a unified benchmark. Therefore, this paper presents the first Systematization of Knowledge (SoK) focusing on the architectural design and comprehensive empirical evaluation of current LLM-based AutoPT frameworks. At systematization level, we comprehensively review existing framework designs across six dimensions: agent architecture, agent plan, agent memory, agent execution, external knowledge, and benchmarks. At empirical level, we conduct large-scale experiments on 13 representative open-source AutoPT frameworks and 2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

simon-p-j-r/LLM4Pentest
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.