Automated Penetration Testing with LLM Agents and Classical Planning

Lingzhi Wang; Xinyi Shi; Ziyu Li; Yi Jiang; Shiyu Tan; Yuhao Jiang; Junjie Cheng; Wenyuan Chen; Xiangmin Shen; Zhenyuan LI; Yan Chen

arXiv:2512.11143·cs.CR·December 15, 2025

Automated Penetration Testing with LLM Agents and Classical Planning

Lingzhi Wang, Xinyi Shi, Ziyu Li, Yi Jiang, Shiyu Tan, Yuhao Jiang, Junjie Cheng, Wenyuan Chen, Xiangmin Shen, Zhenyuan LI, Yan Chen

PDF

Open Access

TL;DR

This paper introduces a new framework combining classical planning with LLM agents to improve automated penetration testing, addressing current limitations and achieving higher success rates and efficiency.

Contribution

The paper proposes CHECKMATE, a novel framework that enhances LLM-based penetration testing with structured planning, significantly improving success rates and stability over existing systems.

Findings

01

CHECKMATE outperforms Claude Code in success rates by over 20%.

02

CHECKMATE reduces testing time and costs by more than 50%.

03

LLM agents face challenges with long-term planning and complex reasoning.

Abstract

While penetration testing plays a vital role in cybersecurity, achieving fully automated, hands-off-the-keyboard execution remains a significant research challenge. In this paper, we introduce the "Planner-Executor-Perceptor (PEP)" design paradigm and use it to systematically review existing work and identify the key challenges in this area. We also evaluate existing penetration testing systems, with a particular focus on the use of Large Language Model (LLM) agents for this task. The results show that the out-of-the-box Claude Code and Sonnet 4.5 exhibit superior penetration capabilities observed to date, substantially outperforming all prior systems. However, a detailed analysis of their testing processes reveals specific strengths and limitations; notably, LLM agents struggle with maintaining coherent long-horizon plans, performing complex reasoning, and effectively utilizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Testing and Debugging Techniques · Information and Cyber Security