A Goal-Driven Survey on Root Cause Analysis
Aoyang Fang, Haowen Yang, Haoze Dong, Qisheng Lu, Junjielong Xu, and Pinjia He

TL;DR
This survey categorizes 135 RCA research papers based on their specific goals in cloud incident management, highlighting distinctions often overlooked and providing insights into progress, gaps, and future directions.
Contribution
It introduces a goal-driven framework for classifying RCA research, addressing the lack of goal-based categorization in previous surveys.
Findings
Effective categorization of RCA papers by goals
Identification of research gaps and challenges
Discussion of future directions in RCA
Abstract
Root Cause Analysis (RCA) is a crucial aspect of incident management in large-scale cloud services. While the term root cause analysis or RCA has been widely used, different studies formulate the task differently. This is because the term "RCA" implicitly covers tasks with distinct underlying goals. For instance, the goal of localizing a faulty service for rapid triage is fundamentally different from identifying a specific functional bug for a definitive fix. However, previous surveys have largely overlooked these goal-based distinctions, conventionally categorizing papers by input data types (e.g., metric-based vs. trace-based methods). This leads to the grouping of works with disparate objectives, thereby obscuring the true progress and gaps in the field. Meanwhile, the typical audience of an RCA survey is either laymen who want to know the goals and big picture of the task or RCA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Service-Oriented Architecture and Web Services
