CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
Yida Gu, Fakang Wang, Jianhao Fu, Zhenhang Sun, Qianyu Zhang, Hairui Zhao, Xingchen Liu, Yang Tian, Wenjing Huang, Zedong Liu, Yifan Chen, Jinwu Yang, Yueyuan Zhou, Qian Zhao, Haoxu Li, Tao Wang, Feng Yu, Zhan Wang, Guangming Tan, Dingwen Tao

TL;DR
CCL-D is a diagnostic system that accurately detects and locates slow/hang communication anomalies in large-scale distributed training, significantly reducing diagnosis time.
Contribution
It introduces a real-time probing and intelligent analysis framework that improves accuracy and efficiency over traditional diagnostic methods.
Findings
Achieved near-complete coverage of known anomalies in a 4,000-GPU cluster.
Pinpointed faulty GPU ranks within 6 minutes, outperforming existing solutions.
Deployed over one year, demonstrating robustness and practical effectiveness.
Abstract
As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
