CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution
Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

TL;DR
CRAKEN is a knowledge-based LLM agent framework that enhances cybersecurity task automation by integrating technical understanding, iterative knowledge retrieval, and adaptive strategies, leading to improved vulnerability detection and exploitation.
Contribution
This paper introduces CRAKEN, a novel framework that embeds cybersecurity knowledge into LLM agents, overcoming limitations of access to expertise and complex task planning.
Findings
Achieved 22% accuracy on NYU CTF Bench, outperforming prior work by 3%.
Solved 25-30% more MITRE ATT&CK techniques than previous approaches.
Demonstrated effectiveness in multi-stage vulnerability detection and exploitation.
Abstract
Large Language Model (LLM) agents can automate cybersecurity tasks and can adapt to the evolving cybersecurity landscape without re-engineering. While LLM agents have demonstrated cybersecurity capabilities on Capture-The-Flag (CTF) competitions, they have two key limitations: accessing latest cybersecurity expertise beyond training data, and integrating new knowledge into complex task planning. Knowledge-based approaches that incorporate technical understanding into the task-solving automation can tackle these limitations. We present CRAKEN, a knowledge-based LLM agent framework that improves cybersecurity capability through three core mechanisms: contextual decomposition of task-critical information, iterative self-reflected knowledge retrieval, and knowledge-hint injection that transforms insights into adaptive attack strategies. Comprehensive evaluations with different…
Peer Reviews
Decision·Submitted to ICLR 2026
Cybersecurity LM agents is an exciting area, and using prior domain knowledge is a sensible approach. The new system (CRAKEN) does improve over prior work.
The system is rather complex and it is hard to tell which components are most helpful (Table 2 might have athe information but it combines a bunch of variations like model, which is orthogonal to the contributions of the paper). It would be clearer to make clear the three axes of variation: models, scaffolds (RAG or not), and information available to the agent. Only the NYU CTF dataset is used. What about Cybench, XBOW, Intercode, CTF-Dojo? The paper would be empirically stronger if it showed t
The work does a good job presenting the complex CRAKEN system, and presents a thorough understanding of previous agents built for CTFs and CTF benchmarks themselves. The authors are thorough in their evaluations. They present results for different configurations of CRAKEN, and analyze failure modes and the performance of the graph RAG system.
Ultimately, this work adds RAG capabilities to a previously presented agentic framework from Xu we al. in order to achieve a minimal increase in performance on the NYU CTF benchmark. The planner-executor based framework is not novel, nor is adding RAG to agents to improve performance. Additionally, despite compiling and presenting a complex system for fetching information relevant to the CTF task at hand, the system only performs 3% more in total on a singular benchmark. This is not nearly a sig
1. The overall quality of the writing is sufficient. 2. Comprehensive evaluation with different LLMs and knowledge databases. The paper provides a comprehensive evaluation of CRAKEN on four powerful close-source LLMs, including Claude 3.5 Sonnet, Claude 3.7 Sonnet, GPT 4o and GPT 4.1, and a open-source LLM, namely DeepSeek V3. Moreover, the finding derived from the experiments, showing that step-by-step operational CTF write-ups are the most effective knowledge source for the RAG process, is val
## The novelty is poor. The evaluation of integrating RAG with cybersecurity LLM agents on CTF tasks is valuable. However, the work is built upon the existing D-CIPHER framework [1] and directly adopts previously proposed RAG components, namely Self-RAG [2] and Graph-RAG [3]. Although the paper claims to employ an optimized version of Self-RAG, the differences between it and the original implementation described in [2] are not clearly explained. Overall, since the proposed methodology appears to
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
