Efficient Pipeline Planning for Expedited Distributed DNN Training

Ziyue Luo; Xiaodong Yi; Guoping Long; Shiqing Fan; Chuan Wu; Jun Yang; and Wei Lin

arXiv:2204.10562·cs.DC·August 23, 2022

Efficient Pipeline Planning for Expedited Distributed DNN Training

Ziyue Luo, Xiaodong Yi, Guoping Long, Shiqing Fan, Chuan Wu, Jun Yang, and Wei Lin

PDF

TL;DR

This paper introduces efficient algorithms for pipeline planning in distributed DNN training, significantly reducing training time by optimizing microbatch processing and synchronization across GPUs.

Contribution

It presents a novel framework with algorithms for pipeline partitioning, device mapping, and microbatch scheduling to minimize training iteration time in synchronous pipeline parallelism.

Findings

01

Achieves up to 157% speedup over existing methods.

02

Provides theoretical analysis and extensive experimental validation.

03

Applicable to arbitrary inter-GPU connectivity.

Abstract

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple versions of model parameters to co-exist (similar to asynchronous training), and cannot ensure the same model convergence and accuracy performance as without pipelining. Synchronous pipelining has recently been proposed which ensures model performance by enforcing a synchronization barrier between training iterations. Nonetheless, the synchronization barrier requires waiting for gradient aggregation from all microbatches and thus delays the training progress. Optimized pipeline planning is needed to minimize such wait and hence the training time, which has not been well studied in the literature. This paper designs efficient, near-optimal algorithms for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.