What Makes a Good LLM Agent for Real-world Penetration Testing?

Gelei Deng; Yi Liu; Yuekang Li; Ruozhao Yang; Xiaofei Xie; Jie Zhang; Han Qiu; Tianwei Zhang

arXiv:2602.17622·cs.CR·February 20, 2026

What Makes a Good LLM Agent for Real-world Penetration Testing?

Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, Tianwei Zhang

PDF

Open Access

TL;DR

This paper introduces Excalibur, a novel LLM-based penetration testing agent that uses difficulty-aware planning to improve success rates by addressing planning and state management limitations, achieving significant performance gains.

Contribution

The paper identifies key failure modes in LLM agents for penetration testing and proposes a difficulty-aware planning framework that enhances performance across benchmarks.

Findings

01

Excalibur achieves up to 91% task completion on CTF benchmarks.

02

It outperforms baselines with 39-49% relative improvement.

03

It successfully compromises 4 of 5 hosts in a real-world environment.

Abstract

LLM-based agents show promise for automating penetration testing, yet reported performance varies widely across systems and benchmarks. We analyze 28 LLM-based penetration testing systems and evaluate five representative implementations across three benchmarks of increasing complexity. Our analysis reveals two distinct failure modes: Type A failures stem from capability gaps (missing tools, inadequate prompts) that engineering readily addresses, while Type B failures persist regardless of tooling due to planning and state management limitations. We show that Type B failures share a root cause that is largely invariant to the underlying LLM: agents lack real-time task difficulty estimation. As a result, agents misallocate effort, over-commit to low-value branches, and exhaust context before completing attack chains. Based on this insight, we present Excalibur, a penetration testing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Information and Cyber Security · Web Application Security Vulnerabilities