DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster
Ji Qi, WenPeng Zhu, Li Li, Ming Wu, YingJun Wu, Wu He, Xun Gao, Jason Zeng, Michael Heinrich

TL;DR
DiLoCoX is a novel decentralized training framework that enables large language models over 100 billion parameters to be trained efficiently on slow networks by reducing communication overhead and overlapping communication with computation.
Contribution
It introduces a low-communication training framework combining pipeline parallelism, dual optimizer policy, and adaptive gradient compression for large-scale decentralized model training.
Findings
Successfully pre-trained a 107B model over 1Gbps network.
Achieved 357x speedup compared to vanilla AllReduce.
Maintained negligible degradation in model convergence.
Abstract
The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
