AutoPT: How Far Are We from the End2End Automated Web Penetration Testing?
Benlong Wu, Guoqiang Chen, Kejiang Chen, Xiuwei Shang, Jiapeng Han,, Yanru He, Weiming Zhang, Nenghai Yu

TL;DR
This paper introduces AutoPT, an LLM-based automated web penetration testing agent that leverages a finite state machine framework to improve task completion rates and reduce costs, advancing automation in cybersecurity testing.
Contribution
The paper proposes AutoPT, a novel LLM-driven penetration testing agent utilizing a state machine approach to overcome current limitations and enhance automation effectiveness.
Findings
AutoPT achieves a task completion rate of 41%, nearly doubling the baseline.
AutoPT reduces time and economic costs compared to manual and baseline methods.
AutoPT outperforms ReAct on the GPT-4o mini model in penetration testing tasks.
Abstract
Penetration testing is essential to ensure Web security, which can detect and fix vulnerabilities in advance, and prevent data leakage and serious consequences. The powerful inference capabilities of large language models (LLMs) have made significant progress in various fields, and the development potential of LLM-based agents can revolutionize the cybersecurity penetration testing industry. In this work, we establish a comprehensive end-to-end penetration testing benchmark using a real-world penetration testing environment to explore the capabilities of LLM-based agents in this domain. Our results reveal that the agents are familiar with the framework of penetration testing tasks, but they still face limitations in generating accurate commands and executing complete processes. Accordingly, we summarize the current challenges, including the difficulty of maintaining the entire message…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities · Software Testing and Debugging Techniques · Software System Performance and Reliability
