Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments
Yuhan Zhu, Jian Wang, Bing Li, Xuxian Tang, Hao Li, Neng Zhang, Yuqi, Zhao

TL;DR
This paper introduces MicroCERCL, a novel method for accurately localizing root causes of failures in microservice systems within cloud-edge collaborative environments, addressing challenges like network instability and high latency.
Contribution
MicroCERCL is the first approach to localize root causes at both kernel and application levels in complex cloud-edge environments, utilizing log analysis and graph neural networks without relying on historical data.
Findings
Achieves at least 24.1% higher top-1 accuracy than existing methods.
Effectively localizes root causes in hybrid, unstable cloud-edge microservice deployments.
Introduces the first benchmark dataset for hybrid deployment microservice systems.
Abstract
With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more difficulties, such as unstable networks and high latency across network segments. Accurately identifying the root cause of microservices in a cloud-edge collaborative environment has thus become an urgent problem. In this paper, we propose MicroCERCL, a novel approach that pinpoints root causes at the kernel and application level in the cloud-edge collaborative environment. Our key insight is that failures propagate through direct invocations and indirect resource-competition dependencies in a cloud-edge collaborative environment characterized by instability and high latency. This will become more complex in the hybrid deployment that simultaneously involves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · IoT and Edge/Fog Computing
MethodsGraph Neural Network
