Autonomous Penetration Testing: Solving Capture-the-Flag Challenges with LLMs

Isabelle Bakker; John Hastings

arXiv:2508.01054·cs.CR·January 27, 2026

Autonomous Penetration Testing: Solving Capture-the-Flag Challenges with LLMs

Isabelle Bakker, John Hastings

PDF

TL;DR

This paper demonstrates that GPT-4o can autonomously solve 80% of beginner-level capture-the-flag cybersecurity challenges, showcasing the potential of LLMs to automate parts of penetration testing and cybersecurity education.

Contribution

It is the first to evaluate GPT-4o's ability to autonomously solve CTF challenges, highlighting strengths and limitations in applying LLMs to offensive security tasks.

Findings

01

GPT-4o solved 18 out of 25 challenges unaided

02

High success rate on single-step Linux and networking tasks

03

Limitations in multi-command and complex reconnaissance scenarios

Abstract

This study evaluates the ability of GPT-4o to autonomously solve beginner-level offensive security tasks by connecting the model to OverTheWire's Bandit capture-the-flag game. Of the 25 levels that were technically compatible with a single-command SSH framework, GPT-4o solved 18 unaided and another two after minimal prompt hints for an overall 80% success rate. The model excelled at single-step challenges that involved Linux filesystem navigation, data extraction or decoding, and straightforward networking. The approach often produced the correct command in one shot and at a human-surpassing speed. Failures involved multi-command scenarios that required persistent working directories, complex network reconnaissance, daemon creation, or interaction with non-standard shells. These limitations highlight current architectural deficiencies rather than a lack of general exploit knowledge. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.