Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing
Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J. Zico Kolter, Percy Liang, Dan Boneh, and Daniel E. Ho

TL;DR
This study compares AI agents and cybersecurity professionals in real-world penetration testing, showing AI's potential advantages in systematic tasks and cost, while highlighting current limitations in accuracy and GUI handling.
Contribution
Introduces ARTEMIS, a novel multi-agent framework for penetration testing, and provides the first comprehensive live environment comparison between AI agents and human experts.
Findings
ARTEMIS outperformed 9 of 10 human professionals in vulnerability discovery.
AI agents can reduce testing costs significantly compared to human testers.
Current AI agents face challenges with false positives and GUI-based tasks.
Abstract
We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in…
Peer Reviews
Decision·ICLR 2026 ConditionalPoster
1. Claimed to be the first live comparison of AI agents versus professionals in an enterprise environment (a large university CS network), closely matching real-world penetration testing practice. This live comparison is significant compared to curated internet benchmarks, which models may have unintentionally trained on. 2. The proposed ARTEMIS shows impressive performance, outperforming almost all of the human participants. The multi-agent design is well motivated, with dynamic prompt generat
1. The runtime and evaluation budget between human participants and AI agents are not strictly matched. Humans were asked to work for at least 10 hours, while ARTEMIS was allotted 16 hours. This difference makes the leaderboard comparison less fair and should be normalized or clearly justified. 2. The paper reports totals and percentages but lacks deeper statistical analysis such as variance or confidence intervals. Without statistical treatment, it is difficult to judge whether performance di
- Great comparison between human vs. AI capability - Experiment in live setting - Strong results for their proposed framework vs. existing coding frameworks including low cost. - Open-sourced
- The timeframe is short, as noted by the authors, and may not be representative of a typical pentesting engagement timeframe. - AI agent reports vulnerabilities
1. The core strength is the commitment to a live enterprise environment (8K hosts, 12 subnets). This is a crucial step forward from synthetic environments and provides uniquely valuable, though difficult to validate, insights into autonomous agent performance in a complex system. 2. The direct comparison against a team of 10 human cybersecurity professionals provides a tangible, high-quality benchmark for AI capabilities. 3. Primary novelty in the ARTEMIS methodology is the successful implemen
1. The use of a "live enterprise environment" makes the core evaluation non-reproducible. 2. The paper needs to clearly articulate the technical novelty of ARTEMIS in its agent-planning module or tool-use orchestration that distinguishes it from existing multi-agent systems. 3. The authors filed to provide error analysis on why ARTEMIS fails. 4. The authors did not compare effect of various LLMs when used with ARTEMIS, a good comparison of proprietary vs open-sourced models would add more val
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Application Security Vulnerabilities · Information and Cyber Security · Adversarial Robustness in Machine Learning
