From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
Srinidhi Madabhushi, Pranesh Vyas, Swathi Vaidyanathan, Mayur Kurup, Elliott Nash, Yegor Silyutin

TL;DR
This paper introduces a graph embedding-based anomaly detection system for microservice architectures that identifies unusual service behaviors during live events, improving early incident detection.
Contribution
It presents a novel GCN-GAE based approach for real-time anomaly detection in directed service graphs, with a synthetic evaluation framework and practical deployment insights.
Findings
Achieved 96% precision in anomaly detection
Demonstrated early detection of incident-related services
Low false positive rate of 0.08%
Abstract
Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
