Breadth-First Pipeline Parallelism

Joel Lamy-Poirier

arXiv:2211.05953·cs.DC·July 10, 2023·1 cites

Breadth-First Pipeline Parallelism

Joel Lamy-Poirier

PDF

Open Access

TL;DR

The paper proposes Breadth-First Pipeline Parallelism, a new training schedule that combines pipeline and data parallelism to significantly improve training efficiency and reduce costs for large models.

Contribution

It introduces a novel training schedule that optimally combines pipeline and data parallelism, enabling higher GPU utilization and lower memory usage.

Findings

01

Up to 43% increase in training throughput for a 52B parameter model

02

Reduced training time and cost by the same amount on large GPU clusters

03

Effective use of fully sharded data parallelism

Abstract

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques