Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
Alankrit Chona, Igor Kozlov, Ambuj Kumar

TL;DR
This paper introduces a benchmark for evaluating large language models in threat hunting tasks within security operations, revealing current models perform poorly on real-world attack detection scenarios.
Contribution
The paper presents the Cyber Defense Benchmark, a novel reinforcement learning environment for assessing LLMs in threat hunting, highlighting their limitations in open-ended security tasks.
Findings
All tested models perform poorly, with the best only correctly flagging 3.8% of malicious events.
No model achieves the minimum threshold of 50% recall across all attack tactics.
Current LLMs are inadequate for evidence-driven threat hunting in operational security contexts.
Abstract
We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
