DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Bilel Cherif; Tamas Bisztray; Richard A. Dubniczky; Aaesha Aldahmani; Saeed Alshehhi; Norbert Tihanyi

arXiv:2505.19973·cs.CR·May 27, 2025

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky, Aaesha Aldahmani, Saeed Alshehhi, Norbert Tihanyi

PDF

Open Access

TL;DR

This paper introduces DFIR-Metric, a comprehensive benchmark dataset designed to evaluate large language models' performance in digital forensics and incident response tasks, covering knowledge, practical challenges, and forensic analysis.

Contribution

The paper presents a new benchmark dataset, DFIR-Metric, with diverse components and a novel metric, enabling systematic evaluation of LLMs in digital forensics and incident response.

Findings

01

Evaluated 14 LLMs across multiple tasks and components.

02

Introduced the Task Understanding Score (TUS) for near-zero accuracy scenarios.

03

Provided a reproducible framework with scripts and results online.

Abstract

Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Topic Modeling · Hate Speech and Cyberbullying Detection

MethodsSparse Evolutionary Training