PentestJudge: Judging Agent Behavior Against Operational Requirements

Shane Caldwell; Max Harley; Michael Kouremetis; Vincent Abruzzo; Will Pearce

arXiv:2508.02921·cs.AI·August 6, 2025

PentestJudge: Judging Agent Behavior Against Operational Requirements

Shane Caldwell, Max Harley, Michael Kouremetis, Vincent Abruzzo, Will Pearce

PDF

TL;DR

PentestJudge is a system that uses large language models to evaluate penetration testing agents' actions against operational criteria, matching human expert judgments and revealing strengths and weaknesses of different models.

Contribution

The paper introduces a novel LLM-based judging system with hierarchical rubrics for evaluating security agents, enabling scalable, automated assessment aligned with human expertise.

Findings

01

Best model achieved an F1 score of 0.83.

02

Better tool-use models align more closely with human judgments.

03

Weaker models can effectively judge stronger models' trajectories.

Abstract

We introduce PentestJudge, a system for evaluating the operations of penetration testing agents. PentestJudge is a large language model (LLM)-as-judge with access to tools that allow it to consume arbitrary trajectories of agent states and tool call history to determine whether a security agent's actions meet certain operating criteria that would be impractical to evaluate programmatically. We develop rubrics that use a tree structure to hierarchically collapse the penetration testing task for a particular environment into smaller, simpler, and more manageable sub-tasks and criteria until each leaf node represents simple yes-or-no criteria for PentestJudge to evaluate. Task nodes are broken down into different categories related to operational objectives, operational security, and tradecraft. LLM-as-judge scores are compared to human domain experts as a ground-truth reference, allowing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.