Before You Hand Over the Wheel: Evaluating LLMs for Security Incident Analysis
Sourov Jajodia, Madeena Sultana, Suryadipta Majumdar, Adrian Taylor, Grant Vandenberghe

TL;DR
This paper introduces SIABENCH, a comprehensive benchmarking framework for evaluating large language models in security incident analysis, addressing the lack of datasets and evaluation standards in this domain.
Contribution
The paper presents a novel dataset, an autonomous agent for diverse SIA tasks, and benchmarks 11 LLMs, advancing evaluation methods for security incident analysis tools.
Findings
Benchmarking 11 LLMs across diverse SIA tasks.
SIABENCH enables assessment of LLM effectiveness in security contexts.
The framework supports future model and task extensions.
Abstract
Security incident analysis (SIA) poses a major challenge for security operations centers, which must manage overwhelming alert volumes, large and diverse data sources, complex toolchains, and limited analyst expertise. These difficulties intensify because incidents evolve dynamically and require multi-step, multifaceted reasoning. Although organizations are eager to adopt Large Language Models (LLMs) to support SIA, the absence of rigorous benchmarking creates significant risks for assessing their effectiveness and guiding design decisions. Benchmarking is further complicated by: (i) the lack of an LLM-ready dataset covering a wide spectrum of SIA tasks; (ii) the continual emergence of new tasks reflecting the diversity of analyst responsibilities; and (iii) the rapid release of new LLMs that must be incorporated into evaluations. In this paper, we address these challenges by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Cybercrime and Law Enforcement Studies · Advanced Malware Detection Techniques
