Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

Yicheng Cai; Mitchell John DeStefano; Guodong Dong; Pulkit Handa; Peng Liu; Tejas Singhal; Peiyu Tseng; Winston Jen White

arXiv:2603.28998·cs.CR·April 1, 2026

Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

Yicheng Cai, Mitchell John DeStefano, Guodong Dong, Pulkit Handa, Peng Liu, Tejas Singhal, Peiyu Tseng, Winston Jen White

PDF

TL;DR

This paper proposes design principles for creating a benchmark, SOC-bench, to evaluate AI systems' blue team cybersecurity capabilities, focusing on multi-task incident response in ransomware scenarios.

Contribution

It introduces the first systematic set of design principles for constructing a comprehensive blue team AI benchmark, addressing a gap in existing evaluation tools.

Findings

01

Developed a conceptual design of SOC-bench with five blue team tasks

02

Focuses on large-scale ransomware attack incident response

03

Addresses the lack of benchmarks for blue team AI capabilities

Abstract

As Large Language Models (LLMs) and multi-agent AI systems are demonstrating increasing potential in cybersecurity operations, organizations, policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such AI systems to achieve more autonomous SOCs (security operation centers) and reduce manual effort. In particular, the AI and cybersecurity communities have recently developed several benchmarks for evaluating the red team capabilities of multi-agent AI systems. However, because the operations in SOCs are dominated by blue team operations, the capabilities of AI systems & agents to achieve more autonomous SOCs cannot be evaluated without a benchmark focused on blue team operations. To our best knowledge, no systematic benchmark for evaluating coordinated multi-task blue team AI has been proposed in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.