CTI-REALM: Benchmark to Evaluate Agent Performance on Security Detection Rule Generation Capabilities

Arjun Chakraborty; Sandra Ho; Adam Cook; Manuel Mel\'endez

arXiv:2603.13517·cs.CR·March 18, 2026

CTI-REALM: Benchmark to Evaluate Agent Performance on Security Detection Rule Generation Capabilities

Arjun Chakraborty, Sandra Ho, Adam Cook, Manuel Mel\'endez

PDF

Open Access

TL;DR

This paper introduces CTI-REALM, a benchmark for evaluating AI agents' ability to interpret cyber threat intelligence and generate detection rules in realistic security scenarios, highlighting the potential of AI in detection engineering.

Contribution

The paper presents a new benchmark environment for assessing AI agents' performance in cybersecurity detection rule generation, including evaluation methods and insights from testing 16 models.

Findings

01

Claude Opus 4.6 achieves the highest reward (0.637)

02

CTI-specific tools significantly improve agent performance

03

Seeded context reduces performance gap by 33%

Abstract

CTI-REALM (Cyber Threat Real World Evaluation and LLM Benchmarking) is a benchmark designed to evaluate AI agents' ability to interpret cyber threat intelligence (CTI) and develop detection rules. The benchmark provides a realistic environment that replicates the security analyst workflow. This enables agents to examine CTI reports, execute queries, understand schema structures, and construct detection rules. Evaluation involves emulated attacks of varying complexity across Linux systems, cloud platforms, and Azure Kubernetes Service (AKS), with ground truth data for accurate assessment. Agent performance is measured through both final detection results and trajectory-based rewards that capture decision-making effectiveness. This work demonstrates the potential of AI agents to support labor-intensive aspects of detection engineering. Our comprehensive evaluation of 16 frontier models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCybercrime and Law Enforcement Studies · Network Security and Intrusion Detection · Information and Cyber Security