Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems ?
Runzhou Wang, Shenglin Zhang, Wenwei Gu, Yongxin Zhao, Chenyu Zhao, Dan Pei, Yuxuan Chen, Yangyuxin Huang

TL;DR
This paper investigates how different types of heterogeneity in microservice systems affect root cause localization, proposing a new framework that models entity distinctions and fault propagation patterns to improve diagnostic accuracy.
Contribution
It introduces NexusRCL, a semi-supervised, heterogeneous graph-based framework that captures entity-level heterogeneity and fault propagation for better root cause localization.
Findings
NexusRCL achieves up to 49.85% improvement in Top-1 accuracy.
It effectively models cross-layer fault propagation patterns.
Demonstrates superior performance on industrial benchmark datasets.
Abstract
Microservice root cause localization is fundamentally challenged by the inherent heterogeneity of cloud-native systems, which encompasses diverse observability data and multiple system entities. Existing approaches typically focus on only one aspect of heterogeneity and thus fail to capture its full diagnostic value. In this work, we systematically examine the multifaceted role of heterogeneity within both microservice systems and the RCL process. This analysis motivates a deeper investigation into how entity-level distinctions and their asymmetric dependencies influence fault behavior. Our empirical analysis of two microservice benchmarks reveals that entity-level heterogeneity naturally gives rise to heterogeneous fault propagation, which is highly asymmetric and dominated by cross-layer interactions between services and hosts. In light of this, we propose NexusRCL, a semi-supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
