Efficient Pipeline Planning for Expedited Distributed DNN Training
Ziyue Luo, Xiaodong Yi, Guoping Long, Shiqing Fan, Chuan Wu, Jun Yang, and Wei Lin

TL;DR
This paper introduces efficient algorithms for pipeline planning in distributed DNN training, significantly reducing training time by optimizing microbatch processing and synchronization across GPUs.
Contribution
It presents a novel framework with algorithms for pipeline partitioning, device mapping, and microbatch scheduling to minimize training iteration time in synchronous pipeline parallelism.
Findings
Achieves up to 157% speedup over existing methods.
Provides theoretical analysis and extensive experimental validation.
Applicable to arbitrary inter-GPU connectivity.
Abstract
To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple versions of model parameters to co-exist (similar to asynchronous training), and cannot ensure the same model convergence and accuracy performance as without pipelining. Synchronous pipelining has recently been proposed which ensures model performance by enforcing a synchronization barrier between training iterations. Nonetheless, the synchronization barrier requires waiting for gradient aggregation from all microbatches and thus delays the training progress. Optimized pipeline planning is needed to minimize such wait and hence the training time, which has not been well studied in the literature. This paper designs efficient, near-optimal algorithms for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
