Reliable and Resilient Collective Communication Library for LLM Training and Serving
Wei Wang, Nengneng Yu, Sixian Xiong, Zaoxing Liu

TL;DR
R$^2$CCL is a fault-tolerant communication library for large-scale ML that ensures low-overhead recovery from network failures, significantly improving robustness and efficiency during training and inference.
Contribution
The paper introduces R$^2$CCL, a novel fault-tolerant communication library utilizing multi-NIC hardware for lossless, low-overhead failover in large-scale ML systems.
Findings
R$^2$CCL achieves less than 1 ext{ }% training overhead under NIC failures.
R$^2$CCL outperforms AdapCC and DejaVu by 12.18× and 47× respectively.
The library maintains high robustness across diverse failure patterns.
Abstract
Modern ML training and inference now span tens to tens of thousands of GPUs, where network faults can waste 10--15\% of GPU hours due to slow recovery. Common network errors and link fluctuations trigger timeouts that often terminate entire jobs, forcing expensive checkpoint rollback during training and request reprocessing during inference. We present RCCL, a fault-tolerant communication library that provides lossless, low-overhead failover by exploiting multi-NIC hardware. RCCL performs rapid connection migration, bandwidth-aware load redistribution, and resilient collective algorithms to maintain progress under failures. We evaluate RCCL on two 8-GPU H100 InfiniBand servers and via large-scale ML simulators modeling hundreds of GPUs with diverse failure patterns. Experiments show that RCCL is highly robust to NIC failures, incurring less than 1\% training and less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
