Measuring and Augmenting Large Language Models for Solving Capture-the-Flag Challenges

Zimo Ji; Daoyuan Wu; Wenyuan Jiang; Pingchuan Ma; Zongjie Li; Shuai Wang

arXiv:2506.17644·cs.AI·June 24, 2025

Measuring and Augmenting Large Language Models for Solving Capture-the-Flag Challenges

Zimo Ji, Daoyuan Wu, Wenyuan Jiang, Pingchuan Ma, Zongjie Li, Shuai Wang

PDF

Open Access

TL;DR

This paper evaluates large language models' ability to solve cybersecurity Capture-the-Flag challenges, introduces a focused benchmark to measure their knowledge application, and proposes a new framework that significantly improves their problem-solving performance.

Contribution

It constructs the CTFKnow benchmark for measuring LLMs' cybersecurity knowledge and introduces CTFAgent, a novel framework with modules that enhance LLMs' CTF-solving capabilities.

Findings

01

LLMs have substantial technical knowledge but struggle with applying it effectively.

02

CTFAgent achieves over 80% performance improvement on CTF datasets.

03

In picoCTF2024, CTFAgent ranked in the top 23.6% of nearly 7,000 teams.

Abstract

Capture-the-Flag (CTF) competitions are crucial for cybersecurity education and training. As large language models (LLMs) evolve, there is increasing interest in their ability to automate CTF challenge solving. For example, DARPA has organized the AIxCC competition since 2023 to advance AI-powered automated offense and defense. However, this demands a combination of multiple abilities, from knowledge to reasoning and further to actions. In this paper, we highlight the importance of technical knowledge in solving CTF problems and deliberately construct a focused benchmark, CTFKnow, with 3,992 questions to measure LLMs' performance in this core aspect. Our study offers a focused and innovative measurement of LLMs' capability in understanding CTF knowledge and applying it to solve CTF challenges. Our key findings reveal that while LLMs possess substantial technical knowledge, they falter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Information and Cyber Security · Adversarial Robustness in Machine Learning