A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends
Tingting Wang, Guilin Qi

TL;DR
This survey reviews various root cause analysis techniques in microservices, highlighting methodologies, challenges, and future trends to improve fault diagnosis and system stability.
Contribution
It provides a comprehensive structured overview of RCA methods in microservices, including recent advancements and future research directions.
Findings
Various RCA methodologies using metrics, traces, logs, and multi-model data
Challenges include complex dependencies and propagative faults in microservices
Future trends involve AI and automation for improved fault diagnosis
Abstract
The complex dependencies and propagative faults inherent in microservices, characterized by a dense network of interconnected services, pose significant challenges in identifying the underlying causes of issues. Prompt identification and resolution of disruptive problems are crucial to ensure rapid recovery and maintain system stability. Numerous methodologies have emerged to address this challenge, primarily focusing on diagnosing failures through symptomatic data. This survey aims to provide a comprehensive, structured review of root cause analysis (RCA) techniques within microservices, exploring methodologies that include metrics, traces, logs, and multi-model data. It delves deeper into the methodologies, challenges, and future trends within microservices architectures. Positioned at the forefront of AI and automation advancements, it offers guidance for future research directions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Anomaly Detection Techniques and Applications · Supply Chain Resilience and Risk Management
