PentestJudge: Judging Agent Behavior Against Operational Requirements
Shane Caldwell, Max Harley, Michael Kouremetis, Vincent Abruzzo, Will Pearce

TL;DR
PentestJudge is a system that uses large language models to evaluate penetration testing agents' actions against operational criteria, matching human expert judgments and revealing strengths and weaknesses of different models.
Contribution
The paper introduces a novel LLM-based judging system with hierarchical rubrics for evaluating security agents, enabling scalable, automated assessment aligned with human expertise.
Findings
Best model achieved an F1 score of 0.83.
Better tool-use models align more closely with human judgments.
Weaker models can effectively judge stronger models' trajectories.
Abstract
We introduce PentestJudge, a system for evaluating the operations of penetration testing agents. PentestJudge is a large language model (LLM)-as-judge with access to tools that allow it to consume arbitrary trajectories of agent states and tool call history to determine whether a security agent's actions meet certain operating criteria that would be impractical to evaluate programmatically. We develop rubrics that use a tree structure to hierarchically collapse the penetration testing task for a particular environment into smaller, simpler, and more manageable sub-tasks and criteria until each leaf node represents simple yes-or-no criteria for PentestJudge to evaluate. Task nodes are broken down into different categories related to operational objectives, operational security, and tradecraft. LLM-as-judge scores are compared to human domain experts as a ground-truth reference, allowing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
