Anomaly Detection for Incident Response at Scale
Hanzhang Wang, Gowtham Kumar Tangirala, Gilkara Pranav Naidu, Charles, Mayville, Arighna Roy, Joanne Sun, Ramesh Babu Mandava

TL;DR
This paper introduces AIDR, a machine learning-based anomaly detection system deployed at Walmart that improves incident detection speed and accuracy by combining statistical, ML, and rule-based methods, with scalable, real-time monitoring.
Contribution
The paper presents a scalable, real-time anomaly detection system integrating multiple ML models and rule-based thresholds, with feedback mechanisms for continuous improvement, tailored for large-scale enterprise use.
Findings
Detected 63% of major incidents
Reduced mean-time-to-detect by over 7 minutes
Lowered false positive rate compared to previous methods
Abstract
We present a machine learning-based anomaly detection product, AI Detect and Respond (AIDR), that monitors Walmart's business and system health in real-time. During the validation over 3 months, the product served predictions from over 3000 models to more than 25 application, platform, and operation teams, covering 63\% of major incidents and reducing the mean-time-to-detect (MTTD) by more than 7 minutes. Unlike previous anomaly detection methods, our solution leverages statistical, ML and deep learning models while continuing to incorporate rule-based static thresholds to incorporate domain-specific knowledge. Both univariate and multivariate ML models are deployed and maintained through distributed services for scalability and high availability. AIDR has a feedback loop that assesses model quality with a combination of drift detection algorithms and customer feedback. It also offers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Network Security and Intrusion Detection
