Breadth-First Pipeline Parallelism
Joel Lamy-Poirier

TL;DR
The paper proposes Breadth-First Pipeline Parallelism, a new training schedule that combines pipeline and data parallelism to significantly improve training efficiency and reduce costs for large models.
Contribution
It introduces a novel training schedule that optimally combines pipeline and data parallelism, enabling higher GPU utilization and lower memory usage.
Findings
Up to 43% increase in training throughput for a 52B parameter model
Reduced training time and cost by the same amount on large GPU clusters
Effective use of fully sharded data parallelism
Abstract
We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques
