Cyber-Zero: Training Cybersecurity Agents without Runtime

Terry Yue Zhuo; Dingmin Wang; Hantian Ding; Varun Kumar; Zijian Wang

arXiv:2508.00910·cs.CR·August 27, 2025

Cyber-Zero: Training Cybersecurity Agents without Runtime

Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang

PDF

3 Reviews

TL;DR

Cyber-Zero introduces a novel framework for training cybersecurity language models without runtime environments by synthesizing realistic interaction trajectories from CTF writeups, leading to significant performance improvements.

Contribution

Cyber-Zero is the first framework to generate high-quality training trajectories for cybersecurity LLMs without relying on runtime environments, using persona-driven simulation and publicly available CTF data.

Findings

01

Achieves up to 13.1% performance gains on CTF benchmarks

02

Establishes new state-of-the-art among open-weight models

03

Matches proprietary system capabilities with better cost-effectiveness

Abstract

Large Language Models (LLMs) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero, the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs. Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero, we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

The paper is well-organized and clear. The authors provide a helpful amount of detail and insight into their proposed method which cleverly uses a dual-LLM system to simulate agentic trajectories from previous CTF writeups that don't have accompanying runtime configurations.

Weaknesses

This paper exclusively focuses on improving SOTA results of open source models on CTF benchmarks. The authors claim that Cyber Zero is a framework for training cybersecurity agents, when instead it is a framework for training CTF agents, just as ENIGMA is an agent scaffold strictly designed for CTFs and not for real world cybersecurity tasks. As a result, the usefulness of Cyber Zero is unproven. If the ENIGMA-scaffolded agents trained using Cyber Zero obtained state of the art results on more g

Reviewer 02Rating 6Confidence 5

Strengths

- The runtime-free framework to synthesize multi-turn trajectories from CTF writeups is an impactful technical contribution - The paper provides comprehensive evaluation of multiple commercial and open-source LLMs on CTF benchmarks, and how their fine-tuning improves performance. Specifically, the result of SWE-Agent-LM not generalizing to CTF tasks is interesting - The rectification of problematic challenges of existing CTF benchmarks is appreciated

Weaknesses

- Aspects of the main contribution of synthesizing multi-turn trajectories need to be elaborated, such as how and when are hints provided. Please see "Questions" for specifics - The paper lacks evaluation of whether this finetuning method has trained an LLM that overfits to a specific agentic format like EnIGMA. It raises pertinent questions about how the trained LLM would perform with a different agentic framework that is also targeted towards CTF tasks. - Some aggregate statistics should be pr

Reviewer 03Rating 8Confidence 4

Strengths

Core contribution: * Good motivation for why this work is important (scarcity of high-quality training data for cybersec applications, because execution environments are harder to obtain). * Very clearly data synthesis methodology which seems to be original. The data synthesis strategy seems to be very easy and applicable. * The training dataset itself will also be a good asset and seems to be larger than previously available datasets. Clear demonstration of effectiveness of method, with large

Weaknesses

Minor comment: The abstract states "Our best model, CYBER-ZERO-32B, [matches] the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet", this seems to be an overstatement if one considers Fig 1 or Fig 3.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.