LLM-Enhanced Log Anomaly Detection: A Comprehensive Benchmark of Large Language Models for Automated System Diagnostics
Disha Patel

TL;DR
This paper benchmarks various log anomaly detection methods, including classical, fine-tuned transformer, and prompt-based LLM approaches, across multiple datasets, highlighting their strengths, limitations, and practical deployment considerations.
Contribution
It provides the first comprehensive comparison of LLM-based and traditional log anomaly detection methods, offering practical guidelines for real-world application.
Findings
Fine-tuned transformers achieve highest F1-scores (0.96-0.99).
Prompt-based LLMs show strong zero-shot performance (F1: 0.82-0.91).
Prompt-based methods require no labeled data, advantageous in data-scarce scenarios.
Abstract
System log anomaly detection is critical for maintaining the reliability of large-scale software systems, yet traditional methods struggle with the heterogeneous and evolving nature of modern log data. Recent advances in Large Language Models (LLMs) offer promising new approaches to log understanding, but a systematic comparison of LLM-based methods against established techniques remains lacking. In this paper, we present a comprehensive benchmark study evaluating both LLM-based and traditional approaches for log anomaly detection across four widely-used public datasets: HDFS, BGL, Thunderbird, and Spirit. We evaluate three categories of methods: (1) classical log parsers (Drain, Spell, AEL) combined with machine learning classifiers, (2) fine-tuned transformer models (BERT, RoBERTa), and (3) prompt-based LLM approaches (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
