Root Cause Analysis In Microservice Using Neural Granger Causal Discovery
Cheng-Ming Lin, Ching Chang, Wei-Yao Wang, Kuang-Da Wang, Wen-Chih, Peng

TL;DR
This paper introduces RUN, a neural Granger causal discovery method with contrastive learning, to improve root cause analysis in microservices by capturing temporal relationships and providing efficient cause ranking.
Contribution
The paper presents a novel neural Granger causal discovery approach that incorporates temporal information and contrastive learning for more accurate root cause analysis in microservices.
Findings
RUN outperforms existing methods on synthetic datasets.
RUN effectively identifies root causes in real-world microservice data.
The approach demonstrates practical utility in microservice troubleshooting.
Abstract
In recent years, microservices have gained widespread adoption in IT operations due to their scalability, maintenance, and flexibility. However, it becomes challenging for site reliability engineers (SREs) to pinpoint the root cause due to the complex relationships in microservices when facing system malfunctions. Previous research employed structured learning methods (e.g., PC-algorithm) to establish causal relationships and derive root causes from causal graphs. Nevertheless, they ignored the temporal order of time series data and failed to leverage the rich information inherent in the temporal relationships. For instance, in cases where there is a sudden spike in CPU utilization, it can lead to an increase in latency for other microservices. However, in this scenario, the anomaly in CPU utilization occurs before the latency increase, rather than simultaneously. As a result, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Anomaly Detection Techniques and Applications · Traffic Prediction and Management Techniques
