Kunlun Anomaly Troubleshooter: Enabling Kernel-Level Anomaly Detection and Causal Reasoning for Large Model Distributed Inference
Yuyang Liu, Jingjing Cai, Jiayi Ren, Peng Zhou, Danyang Zhang, Yin Du, Shijian Li

TL;DR
Kunlun Anomaly Troubleshooter (KAT) is a novel framework that detects kernel-level anomalies in large model distributed inference systems with high precision and recall, and provides causal reasoning and natural language explanations to improve troubleshooting efficiency.
Contribution
KAT introduces a new approach combining GPU worker synchronicity analysis with domain-adapted LLMs for precise anomaly detection and causal reasoning in LMDI systems.
Findings
Achieves over 0.884 precision in anomaly detection
Achieves over 0.936 recall in anomaly detection
Significantly improves troubleshooting efficiency and success rate
Abstract
Anomaly troubleshooting for large model distributed inference (LMDI) remains a critical challenge. Resolving anomalies such as inference performance degradation or latency jitter in distributed system demands significant manual efforts from domain experts, resulting in extremely time-consuming diagnosis processes with relatively low accuracy. In this paper, we introduce Kunlun Anomaly Troubleshooter (KAT), the first anomaly troubleshooting framework tailored for LMDI. KAT addresses this problem through two core innovations. First, KAT exploits the synchronicity and consistency of GPU workers, innovatively leverages function trace data to precisely detect kernel-level anomalies and associated hardware components at nanosecond resolution. Second, KAT integrates these detection results into a domain-adapted LLM, delivering systematic causal reasoning and natural language interpretation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Anomaly Detection Techniques and Applications · Distributed systems and fault tolerance
