ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System
Yongqian Sun, Xijie Pan, Xiao Xiong, Lei Tao, Jiaju Wang, Shenglin Zhang, Yuan Yuan, Yuqi Li, Kunlin Jian

TL;DR
ClusterRCA is a comprehensive framework that effectively localizes network faults and classifies failure types in HPC systems by integrating multimodal data analysis with graph-based techniques, demonstrating high accuracy and robustness.
Contribution
The paper introduces ClusterRCA, a novel end-to-end framework that combines classifier and graph-based methods for fault localization and classification in HPC network systems.
Findings
Achieves high accuracy in diagnosing network failures in HPC systems.
Maintains robust performance across diverse application scenarios.
Effectively leverages multimodal data for fault analysis.
Abstract
Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Network Security and Intrusion Detection · Anomaly Detection Techniques and Applications
