TL;DR
Cyber-Zero introduces a novel framework for training cybersecurity language models without runtime environments by synthesizing realistic interaction trajectories from CTF writeups, leading to significant performance improvements.
Contribution
Cyber-Zero is the first framework to generate high-quality training trajectories for cybersecurity LLMs without relying on runtime environments, using persona-driven simulation and publicly available CTF data.
Findings
Achieves up to 13.1% performance gains on CTF benchmarks
Establishes new state-of-the-art among open-weight models
Matches proprietary system capabilities with better cost-effectiveness
Abstract
Large Language Models (LLMs) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero, the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs. Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero, we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is well-organized and clear. The authors provide a helpful amount of detail and insight into their proposed method which cleverly uses a dual-LLM system to simulate agentic trajectories from previous CTF writeups that don't have accompanying runtime configurations.
This paper exclusively focuses on improving SOTA results of open source models on CTF benchmarks. The authors claim that Cyber Zero is a framework for training cybersecurity agents, when instead it is a framework for training CTF agents, just as ENIGMA is an agent scaffold strictly designed for CTFs and not for real world cybersecurity tasks. As a result, the usefulness of Cyber Zero is unproven. If the ENIGMA-scaffolded agents trained using Cyber Zero obtained state of the art results on more g
- The runtime-free framework to synthesize multi-turn trajectories from CTF writeups is an impactful technical contribution - The paper provides comprehensive evaluation of multiple commercial and open-source LLMs on CTF benchmarks, and how their fine-tuning improves performance. Specifically, the result of SWE-Agent-LM not generalizing to CTF tasks is interesting - The rectification of problematic challenges of existing CTF benchmarks is appreciated
- Aspects of the main contribution of synthesizing multi-turn trajectories need to be elaborated, such as how and when are hints provided. Please see "Questions" for specifics - The paper lacks evaluation of whether this finetuning method has trained an LLM that overfits to a specific agentic format like EnIGMA. It raises pertinent questions about how the trained LLM would perform with a different agentic framework that is also targeted towards CTF tasks. - Some aggregate statistics should be pr
Core contribution: * Good motivation for why this work is important (scarcity of high-quality training data for cybersec applications, because execution environments are harder to obtain). * Very clearly data synthesis methodology which seems to be original. The data synthesis strategy seems to be very easy and applicable. * The training dataset itself will also be a good asset and seems to be larger than previously available datasets. Clear demonstration of effectiveness of method, with large
Minor comment: The abstract states "Our best model, CYBER-ZERO-32B, [matches] the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet", this seems to be an overstatement if one considers Fig 1 or Fig 3.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
