TL;DR
This paper introduces a new evaluation protocol for AI pentesting agents that emphasizes realistic vulnerability discovery in complex targets, moving beyond traditional benchmarks focused on simplified tasks.
Contribution
The authors present a practical, comprehensive evaluation protocol that assesses AI pentesting agents in realistic scenarios, including ground-truth, semantic matching, and sustainability metrics.
Findings
Enables realistic comparison of AI pentesting agents in complex environments.
Incorporates semantic matching and bipartite resolution for vulnerability scoring.
Provides reproducible code and annotated ground truth for the community.
Abstract
AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
