HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

Danyu Sun; Jinghuai Zhang; Yuan Tian; Zhou Li

arXiv:2605.21773·cs.CR·May 22, 2026

HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

Danyu Sun, Jinghuai Zhang, Yuan Tian, Zhou Li

PDF

TL;DR

This paper introduces HIDBench, a new benchmark for evaluating large language models in host-based intrusion detection using complex, noisy system logs, revealing significant performance gaps and the need for robust system design.

Contribution

The work unifies multiple datasets and creates a pipeline for LLM-compatible inputs, systematically evaluating LLMs' capabilities in realistic intrusion detection scenarios.

Findings

01

LLMs achieve high precision on simple datasets but struggle with complex, noisy logs.

02

Performance metrics like MCC drop below 0.5 as log complexity increases.

03

Models exhibit different regimes, from conservative to over-sensitive detectors.

Abstract

Recent benchmark efforts have advanced the evaluation of large language models (LLMs) in cybersecurity, including tasks such as penetration testing and vulnerability identification. However, a critical cybersecurity task, namely intrusion detection from system logs, remains unexplored. In this work, we present a new benchmark to assess LLMs' capabilities in supporting host-based intrusion detection systems (HIDS). This task requires fine-grained reasoning over large-scale, noisy, and highly imbalanced system logs, where complex interactions between benign and malicious activities make reliable detection challenging. Our benchmark unifies three public system log datasets, DARPA-E3, DARPA-E5, and NodLink, and introduces a data construction pipeline that transforms raw host telemetry into LLM-compatible inputs, enabling systematic evaluation under realistic intrusion detection settings.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.