Anomaly Detection of Tabular Data Using LLMs
Aodong Li, Yunhan Zhao, Chen Qiu, Marius Kloft, Padhraic Smyth, Maja, Rudolph, Stephan Mandt

TL;DR
This paper explores the use of large language models for detecting anomalies in tabular data, demonstrating their zero-shot capabilities and enhancing their performance through synthetic data and fine-tuning.
Contribution
It introduces a novel approach to leverage pre-trained LLMs for batch-level anomaly detection without extra fitting, and proposes a fine-tuning strategy with synthetic data.
Findings
GPT-4 achieves state-of-the-art performance on ODDS benchmark.
Synthetic datasets and fine-tuning improve LLMs' anomaly detection accuracy.
LLMs can identify low-density data regions in tabular datasets.
Abstract
Large language models (LLMs) have shown their potential in long-context understanding and mathematical reasoning. In this paper, we study the problem of using LLMs to detect tabular anomalies and show that pre-trained LLMs are zero-shot batch-level anomaly detectors. That is, without extra distribution-specific model fitting, they can discover hidden outliers in a batch of data, demonstrating their ability to identify low-density data regions. For LLMs that are not well aligned with anomaly detection and frequently output factual errors, we apply simple yet effective data-generating processes to simulate synthetic batch-level anomaly detection datasets and propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies. Experiments on a large anomaly detection benchmark (ODDS) showcase i) GPT-4 has on-par performance with the state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications
MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer
