DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

Ji Qi; WenPeng Zhu; Li Li; Ming Wu; YingJun Wu; Wu He; Xun Gao; Jason Zeng; Michael Heinrich

arXiv:2506.21263·cs.LG·June 27, 2025

DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

Ji Qi, WenPeng Zhu, Li Li, Ming Wu, YingJun Wu, Wu He, Xun Gao, Jason Zeng, Michael Heinrich

PDF

Open Access

TL;DR

DiLoCoX is a novel decentralized training framework that enables large language models over 100 billion parameters to be trained efficiently on slow networks by reducing communication overhead and overlapping communication with computation.

Contribution

It introduces a low-communication training framework combining pipeline parallelism, dual optimizer policy, and adaptive gradient compression for large-scale decentralized model training.

Findings

01

Successfully pre-trained a 107B model over 1Gbps network.

02

Achieved 357x speedup compared to vanilla AllReduce.

03

Maintained negligible degradation in model convergence.

Abstract

The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings