2BP: 2-Stage Backpropagation
Christopher Rae, Joseph K. L. Lee, James Richings

TL;DR
This paper introduces 2-stage backpropagation (2BP), a method that splits backward propagation into two stages to reduce idle time and improve throughput in pipeline parallel training of large DNNs across multiple accelerators.
Contribution
The paper proposes 2BP, a novel approach that enhances pipeline parallelism efficiency by splitting backpropagation, leading to significant throughput gains in large-scale neural network training.
Findings
Achieved 1.70x throughput increase on a 7-billion-parameter transformer.
Demonstrated consistent throughput improvements across various models and schedules.
Reduced compute idle time during backpropagation stages.
Abstract
As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
