Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Justin W. Lin; Eliot Krzysztof Jones; Donovan Julian Jasper; Ethan Jun-shen Ho; Anna Wu; Arnold Tianyi Yang; Neil Perry; Andy Zou; Matt Fredrikson; J. Zico Kolter; Percy Liang; Dan Boneh; and Daniel E. Ho

arXiv:2512.09882·cs.AI·March 4, 2026

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J. Zico Kolter, Percy Liang, Dan Boneh, and Daniel E. Ho

PDF

Open Access 3 Reviews

TL;DR

This study compares AI agents and cybersecurity professionals in real-world penetration testing, showing AI's potential advantages in systematic tasks and cost, while highlighting current limitations in accuracy and GUI handling.

Contribution

Introduces ARTEMIS, a novel multi-agent framework for penetration testing, and provides the first comprehensive live environment comparison between AI agents and human experts.

Findings

01

ARTEMIS outperformed 9 of 10 human professionals in vulnerability discovery.

02

AI agents can reduce testing costs significantly compared to human testers.

03

Current AI agents face challenges with false positives and GUI-based tasks.

Abstract

We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in…

Peer Reviews

Decision·ICLR 2026 ConditionalPoster

Reviewer 01Rating 8Confidence 3

Strengths

1. Claimed to be the first live comparison of AI agents versus professionals in an enterprise environment (a large university CS network), closely matching real-world penetration testing practice. This live comparison is significant compared to curated internet benchmarks, which models may have unintentionally trained on. 2. The proposed ARTEMIS shows impressive performance, outperforming almost all of the human participants. The multi-agent design is well motivated, with dynamic prompt generat

Weaknesses

1. The runtime and evaluation budget between human participants and AI agents are not strictly matched. Humans were asked to work for at least 10 hours, while ARTEMIS was allotted 16 hours. This difference makes the leaderboard comparison less fair and should be normalized or clearly justified. 2. The paper reports totals and percentages but lacks deeper statistical analysis such as variance or confidence intervals. Without statistical treatment, it is difficult to judge whether performance di

Reviewer 02Rating 8Confidence 2

Strengths

- Great comparison between human vs. AI capability - Experiment in live setting - Strong results for their proposed framework vs. existing coding frameworks including low cost. - Open-sourced

Weaknesses

- The timeframe is short, as noted by the authors, and may not be representative of a typical pentesting engagement timeframe. - AI agent reports vulnerabilities

Reviewer 03Rating 4Confidence 5

Strengths

1. The core strength is the commitment to a live enterprise environment (8K hosts, 12 subnets). This is a crucial step forward from synthetic environments and provides uniquely valuable, though difficult to validate, insights into autonomous agent performance in a complex system. 2. The direct comparison against a team of 10 human cybersecurity professionals provides a tangible, high-quality benchmark for AI capabilities. 3. Primary novelty in the ARTEMIS methodology is the successful implemen

Weaknesses

1. The use of a "live enterprise environment" makes the core evaluation non-reproducible. 2. The paper needs to clearly articulate the technical novelty of ARTEMIS in its agent-planning module or tool-use orchestration that distinguishes it from existing multi-agent systems. 3. The authors filed to provide error analysis on why ARTEMIS fails. 4. The authors did not compare effect of various LLMs when used with ARTEMIS, a good comparison of proprietary vs open-sourced models would add more val

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Application Security Vulnerabilities · Information and Cyber Security · Adversarial Robustness in Machine Learning