Autonomous Penetration Testing: Solving Capture-the-Flag Challenges with LLMs
Isabelle Bakker, John Hastings

TL;DR
This paper demonstrates that GPT-4o can autonomously solve 80% of beginner-level capture-the-flag cybersecurity challenges, showcasing the potential of LLMs to automate parts of penetration testing and cybersecurity education.
Contribution
It is the first to evaluate GPT-4o's ability to autonomously solve CTF challenges, highlighting strengths and limitations in applying LLMs to offensive security tasks.
Findings
GPT-4o solved 18 out of 25 challenges unaided
High success rate on single-step Linux and networking tasks
Limitations in multi-command and complex reconnaissance scenarios
Abstract
This study evaluates the ability of GPT-4o to autonomously solve beginner-level offensive security tasks by connecting the model to OverTheWire's Bandit capture-the-flag game. Of the 25 levels that were technically compatible with a single-command SSH framework, GPT-4o solved 18 unaided and another two after minimal prompt hints for an overall 80% success rate. The model excelled at single-step challenges that involved Linux filesystem navigation, data extraction or decoding, and straightforward networking. The approach often produced the correct command in one shot and at a human-surpassing speed. Failures involved multi-command scenarios that required persistent working directories, complex network reconnaissance, daemon creation, or interaction with non-standard shells. These limitations highlight current architectural deficiencies rather than a lack of general exploit knowledge. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
