HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection
Danyu Sun, Jinghuai Zhang, Yuan Tian, Zhou Li

TL;DR
This paper introduces HIDBench, a new benchmark for evaluating large language models in host-based intrusion detection using complex, noisy system logs, revealing significant performance gaps and the need for robust system design.
Contribution
The work unifies multiple datasets and creates a pipeline for LLM-compatible inputs, systematically evaluating LLMs' capabilities in realistic intrusion detection scenarios.
Findings
LLMs achieve high precision on simple datasets but struggle with complex, noisy logs.
Performance metrics like MCC drop below 0.5 as log complexity increases.
Models exhibit different regimes, from conservative to over-sensitive detectors.
Abstract
Recent benchmark efforts have advanced the evaluation of large language models (LLMs) in cybersecurity, including tasks such as penetration testing and vulnerability identification. However, a critical cybersecurity task, namely intrusion detection from system logs, remains unexplored. In this work, we present a new benchmark to assess LLMs' capabilities in supporting host-based intrusion detection systems (HIDS). This task requires fine-grained reasoning over large-scale, noisy, and highly imbalanced system logs, where complex interactions between benign and malicious activities make reliable detection challenging. Our benchmark unifies three public system log datasets, DARPA-E3, DARPA-E5, and NodLink, and introduces a data construction pipeline that transforms raw host telemetry into LLM-compatible inputs, enabling systematic evaluation under realistic intrusion detection settings.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
