Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis
Shenglin Zhang, Sibo Xia, Wenzhao Fan, Binpeng Shi, Xiao Xiong, Zhenyu, Zhong, Minghua Ma, Yongqian Sun, Dan Pei

TL;DR
This paper surveys 98 studies on failure diagnosis in microservice systems, highlighting challenges, current practices, datasets, and future research directions to improve system reliability.
Contribution
It provides the first comprehensive review of failure diagnosis techniques in microservices, including analysis of concepts, architectures, datasets, and evaluation methods.
Findings
Identified key challenges in diagnosing failures in microservices.
Compiled datasets and tools for failure diagnosis.
Outlined future research directions for improved reliability.
Abstract
Widely adopted for their scalability and flexibility, modern microservice systems present unique failure diagnosis challenges due to their independent deployment and dynamic interactions. This complexity can lead to cascading failures that negatively impact operational efficiency and user experience. Recognizing the critical role of fault diagnosis in improving the stability and reliability of microservice systems, researchers have conducted extensive studies and achieved a number of significant results. This survey provides an exhaustive review of 98 scientific papers from 2003 to the present, including a thorough examination and elucidation of the fundamental concepts, system architecture, and problem statement. It also includes a qualitative analysis of the dimensions, providing an in-depth discussion of current best practices and future directions, aiming to further its development…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software-Defined Networks and 5G · Cloud Computing and Resource Management
