CTI-REALM: Benchmark to Evaluate Agent Performance on Security Detection Rule Generation Capabilities
Arjun Chakraborty, Sandra Ho, Adam Cook, Manuel Mel\'endez

TL;DR
This paper introduces CTI-REALM, a benchmark for evaluating AI agents' ability to interpret cyber threat intelligence and generate detection rules in realistic security scenarios, highlighting the potential of AI in detection engineering.
Contribution
The paper presents a new benchmark environment for assessing AI agents' performance in cybersecurity detection rule generation, including evaluation methods and insights from testing 16 models.
Findings
Claude Opus 4.6 achieves the highest reward (0.637)
CTI-specific tools significantly improve agent performance
Seeded context reduces performance gap by 33%
Abstract
CTI-REALM (Cyber Threat Real World Evaluation and LLM Benchmarking) is a benchmark designed to evaluate AI agents' ability to interpret cyber threat intelligence (CTI) and develop detection rules. The benchmark provides a realistic environment that replicates the security analyst workflow. This enables agents to examine CTI reports, execute queries, understand schema structures, and construct detection rules. Evaluation involves emulated attacks of varying complexity across Linux systems, cloud platforms, and Azure Kubernetes Service (AKS), with ground truth data for accurate assessment. Agent performance is measured through both final detection results and trajectory-based rewards that capture decision-making effectiveness. This work demonstrates the potential of AI agents to support labor-intensive aspects of detection engineering. Our comprehensive evaluation of 16 frontier models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCybercrime and Law Enforcement Studies · Network Security and Intrusion Detection · Information and Cyber Security
