From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

Pedro Conde; Henrique Branquinho; Valerio Mazzone; Bruno Mendes; Andr\'e Baptista; Nuno Moniz

arXiv:2605.10834·cs.AI·May 12, 2026

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

Pedro Conde, Henrique Branquinho, Valerio Mazzone, Bruno Mendes, Andr\'e Baptista, Nuno Moniz

PDF

1 Repo

TL;DR

This paper introduces a new evaluation protocol for AI pentesting agents that emphasizes realistic vulnerability discovery in complex targets, moving beyond traditional benchmarks focused on simplified tasks.

Contribution

The authors present a practical, comprehensive evaluation protocol that assesses AI pentesting agents in realistic scenarios, including ground-truth, semantic matching, and sustainability metrics.

Findings

01

Enables realistic comparison of AI pentesting agents in complex environments.

02

Incorporates semantic matching and bipartite resolution for vulnerability scoring.

03

Provides reproducible code and annotated ground truth for the community.

Abstract

AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jd0965199-oss/ethibench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.