Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Alankrit Chona; Igor Kozlov; Ambuj Kumar

arXiv:2604.19533·cs.CR·April 24, 2026

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Alankrit Chona, Igor Kozlov, Ambuj Kumar

PDF

TL;DR

This paper introduces a benchmark for evaluating large language models in threat hunting tasks within security operations, revealing current models perform poorly on real-world attack detection scenarios.

Contribution

The paper presents the Cyber Defense Benchmark, a novel reinforcement learning environment for assessing LLMs in threat hunting, highlighting their limitations in open-ended security tasks.

Findings

01

All tested models perform poorly, with the best only correctly flagging 3.8% of malicious events.

02

No model achieves the minimum threshold of 50% recall across all attack tactics.

03

Current LLMs are inadequate for evidence-driven threat hunting in operational security contexts.

Abstract

We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.