CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Yida Gu; Fakang Wang; Jianhao Fu; Zhenhang Sun; Qianyu Zhang; Hairui Zhao; Xingchen Liu; Yang Tian; Wenjing Huang; Zedong Liu; Yifan Chen; Jinwu Yang; Yueyuan Zhou; Qian Zhao; Haoxu Li; Tao Wang; Feng Yu; Zhan Wang; Guangming Tan; Dingwen Tao

arXiv:2605.04478·cs.DC·May 7, 2026

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Yida Gu, Fakang Wang, Jianhao Fu, Zhenhang Sun, Qianyu Zhang, Hairui Zhao, Xingchen Liu, Yang Tian, Wenjing Huang, Zedong Liu, Yifan Chen, Jinwu Yang, Yueyuan Zhou, Qian Zhao, Haoxu Li, Tao Wang, Feng Yu, Zhan Wang, Guangming Tan, Dingwen Tao

PDF

TL;DR

CCL-D is a diagnostic system that accurately detects and locates slow/hang communication anomalies in large-scale distributed training, significantly reducing diagnosis time.

Contribution

It introduces a real-time probing and intelligent analysis framework that improves accuracy and efficiency over traditional diagnostic methods.

Findings

01

Achieved near-complete coverage of known anomalies in a 4,000-GPU cluster.

02

Pinpointed faulty GPU ranks within 6 minutes, outperforming existing solutions.

03

Deployed over one year, demonstrating robustness and practical effectiveness.

Abstract

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.