LLM Assisted Anomaly Detection Service for Site Reliability Engineers: Enhancing Cloud Infrastructure Resilience
Nimesh Jha, Shuxin Lin, Srideepika Jayaraman, Kyle Frohling,, Christodoulos Constantinides, Dhaval Patel

TL;DR
This paper presents a scalable anomaly detection service utilizing LLMs and advanced algorithms to assist SREs in managing cloud infrastructure, improving proactive issue detection and reducing downtime.
Contribution
It introduces a novel anomaly detection platform with LLM integration and versatile algorithms, tailored for industrial time-series data and cloud infrastructure management.
Findings
Over 500 users and 200,000 API calls in a year
Effective anomaly detection demonstrated on public benchmarks
Successful application in diverse industrial settings
Abstract
This paper introduces a scalable Anomaly Detection Service with a generalizable API tailored for industrial time-series data, designed to assist Site Reliability Engineers (SREs) in managing cloud infrastructure. The service enables efficient anomaly detection in complex data streams, supporting proactive identification and resolution of issues. Furthermore, it presents an innovative approach to anomaly modeling in cloud infrastructure by utilizing Large Language Models (LLMs) to understand key components, their failure modes, and behaviors. A suite of algorithms for detecting anomalies is offered in univariate and multivariate time series data, including regression-based, mixture-model-based, and semi-supervised approaches. We provide insights into the usage patterns of the service, with over 500 users and 200,000 API calls in a year. The service has been successfully applied in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Anomaly Detection Techniques and Applications
Methodstravel james
