Evaluating Large Language Models for Time Series Anomaly Detection in Aerospace Software
Yang Liu, Yixing Luo, Xiaofeng Li, Xiaogang Dong, Bin Gu, Zhi Jin

TL;DR
This paper introduces ATSADBench, a comprehensive benchmark for evaluating large language models in aerospace time series anomaly detection, revealing their strengths and limitations in complex telemetry scenarios.
Contribution
The paper presents ATSADBench, the first dedicated benchmark for aerospace TSAD, and systematically evaluates LLMs, highlighting their performance gaps and potential enhancement strategies.
Findings
LLMs perform well on univariate tasks but poorly on multivariate telemetry.
Alarm accuracy and contiguity are near random on multivariate data.
Few-shot learning offers modest improvements; RAG does not significantly help.
Abstract
Time series anomaly detection (TSAD) is essential for ensuring the safety and reliability of aerospace software systems. Although large language models (LLMs) provide a promising training-free alternative to unsupervised approaches, their effectiveness in aerospace settings remains under-examined because of complex telemetry, misaligned evaluation metrics, and the absence of domain knowledge. To address this gap, we introduce ATSADBench, the first benchmark for aerospace TSAD. ATSADBench comprises nine tasks that combine three pattern-wise anomaly types, univariate and multivariate signals, and both in-loop and out-of-loop feedback scenarios, yielding 108,000 data points. Using this benchmark, we systematically evaluate state-of-the-art open-source LLMs under two paradigms: Direct, which labels anomalies within sliding windows, and Prediction-Based, which detects anomalies from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Software System Performance and Reliability · Software Engineering Research
