Training Language Model Agents to Find Vulnerabilities with CTF-Dojo

Terry Yue Zhuo; Dingmin Wang; Hantian Ding; Varun Kumar; Zijian Wang

arXiv:2508.18370·cs.SE·September 24, 2025

Training Language Model Agents to Find Vulnerabilities with CTF-Dojo

Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang

PDF

3 Reviews

TL;DR

This paper introduces CTF-Dojo, a large-scale, reproducible environment for training language models with verifiable feedback on cybersecurity challenges, leading to significant performance improvements.

Contribution

It presents CTF-Dojo and CTF-Forge, enabling scalable, automated training of LLM agents on executable challenges, achieving state-of-the-art results.

Findings

01

Achieved up to 11.6% improvement over baselines.

02

Best 32B model reaches 31.9% Pass@1, setting a new open-weight state-of-the-art.

03

Demonstrated effectiveness of execution-grounded training signals.

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities when trained within executable runtime environments, notably excelling at software engineering tasks through verified feedback loops. Yet, scalable and generalizable execution-grounded environments remain scarce, limiting progress in training more capable ML agents. We introduce CTF-Dojo, the first large-scale executable runtime tailored for training LLMs with verifiable feedback, featuring 658 fully functional Capture-The-Flag (CTF)-style challenges containerized in Docker with guaranteed reproducibility. To enable rapid scaling without manual intervention, we develop CTF-Forge, an automated pipeline that transforms publicly available artifacts into ready-to-use execution environments in minutes, eliminating weeks of expert configuration traditionally required. We trained LLM-based agents on just 486 high-quality,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. High-quality training data is a critical factor influencing model performance, and the authors' insight on this point is correct. 2. There has been no prior work on building such an environment specifically for training agents on CTF tasks. This paper makes a valuable contribution in this direction.

Weaknesses

1. The paper's core claim that training data must be "execution-verified" and "correct" is questionable. This methodology is contradicted by the paper's own results, which show that a model trained on a much larger, unverified (and presumably noisy) dataset outperforms the authors' model. This suggests that data quantity may be more important than the strict verification the authors insist upon. The justification of "safety-critical domains" for this training choice is also unconvincing. 2. The

Reviewer 02Rating 8Confidence 2

Strengths

- a sizeable collection of CTF competitions is collected for use in model testing - reasonable tests show that training on successful outputs from this benchmark generalize to other CTF competitions - The authors claim that setting up a CTF testing environment is difficult and time-consuming and they can automate the process with 98% accuracy - The authors show traces from their constructed environment are useful for training for other CTF problems. - The ethical implication of improving offensi

Weaknesses

- It is unclear to me what environments the automated setup was tested on and whether this will hold true for many users and, especially given the use of LLMs, will continue to be reliable in the future. - It is unclear what the copyright status of the original CTF competitions is and whether the authors are ok with it's collection and use in bench-marking and training LLMs.

Reviewer 03Rating 4Confidence 3

Strengths

1. An automatable pipeline to generate an agent environment for cybersecurity agents. 2. Demonstrate better performance on representative benchmarks with better data efficiency compared to Cyber-Zero.

Weaknesses

1. Automatically creating the execution environment is not novel. For example, in SWE-smith, the authors use SWE-agent to automatically create a Docker given any GitHub repo. 2. The scalability seems to be a bigger issue than discussed. As the author noted, each CTF challenge is uniquely designed, and the current pipeline can only leverage existing challenges instead of generating synthetic training instances from scratch. For comparison, SWE-smith includes the procedure of automatically generat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.