Autonomous LLM Agents & CTFs: A Second Look

Youness Bouchari; Matteo Boffa; Marco Mellia; Idilio Drago; Thanh Minh Bui; Dario Rossi

arXiv:2605.21497·cs.CR·May 22, 2026

Autonomous LLM Agents & CTFs: A Second Look

Youness Bouchari, Matteo Boffa, Marco Mellia, Idilio Drago, Thanh Minh Bui, Dario Rossi

PDF

TL;DR

This study critically evaluates LLM-based agents in cybersecurity Capture-the-Flag challenges, comparing engineered and general-purpose architectures, revealing persistent barriers and the benefits of structured orchestration.

Contribution

It provides a comprehensive comparison of different LLM agent architectures on CTF challenges, highlighting the effectiveness of general-purpose agents and the advantages of modular design.

Findings

01

Claude-code performs comparably to engineered architectures.

02

Both approaches struggle with certain challenge categories.

03

Structured orchestration improves consistency and reduces costs.

Abstract

Large Language Model (LLM) agents are increasingly proposed to automate offensive security tasks, with recent studies reporting near human-level success rates in Capture-the-Flag (CTF) challenges. We here revisit these results, providing a second look at these claims. We engineer different agent architectures of increasing complexity and modularity on 30 web-based CTFs challenges spanning 14 vulnerability classes. We instantiate these agents with multiple LLM backbones, and compare them with claude-code, a general-purpose agent that automatically determines its internal architecture. Our evaluation yields three main findings. First, claude-code achieves performance comparable to the engineered architectures (19/30 solved tasks), suggesting that general-purpose agents are strong baselines for offensive security tasks. Second, both our architectures and claude-code struggle in the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.