DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization
Zhenheng Tang, Zichen Tang, Junlin Huang, Xinglin Pan, Rudan Yan,, Yuxin Wang, Amelie Chi Zhou, Shaohuai Shi, Xiaowen Chu, and Bo Li

TL;DR
DreamDDP introduces a layer-wise partial synchronization method for geo-distributed LLM training, enabling overlapping communication and computation to significantly speed up training in low-bandwidth environments.
Contribution
It proposes a novel layer-wise decoupling of model synchronization in local SGD, with theoretical convergence guarantees and practical scheduling strategies.
Findings
Achieves 1.49x to 3.91x speedup over baseline methods.
Maintains convergence rates comparable to standard S-SGD.
Effectively overlaps communication and computation in distributed training.
Abstract
The growth of large language models (LLMs) increases challenges of accelerating distributed training across multiple GPUs in different data centers. Moreover, concerns about data privacy and data exhaustion have heightened interest in geo-distributed data centers. Communication in geo-distributed data parallel training (DDP) with stochastic gradient descent (S-SGD) is the main bottleneck in low-bandwidth environments. Local SGD mitigates communication overhead by reducing synchronization frequency, and recent studies have successfully applied it to geo-distributedly pre-train LLMs. However, we identify that its model synchronization mechanism prevents overlapping communication and computation, which makes the system lose opportunities to overlap communication and computation. To overcome this limitation, we expand the design space of local SGD by layer-wisely decoupling model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Robotics and Automated Systems · Cloud Computing and Resource Management
