2BP: 2-Stage Backpropagation

Christopher Rae; Joseph K. L. Lee; James Richings

arXiv:2405.18047·cs.LG·May 29, 2024

2BP: 2-Stage Backpropagation

Christopher Rae, Joseph K. L. Lee, James Richings

PDF

Open Access

TL;DR

This paper introduces 2-stage backpropagation (2BP), a method that splits backward propagation into two stages to reduce idle time and improve throughput in pipeline parallel training of large DNNs across multiple accelerators.

Contribution

The paper proposes 2BP, a novel approach that enhances pipeline parallelism efficiency by splitting backpropagation, leading to significant throughput gains in large-scale neural network training.

Findings

01

Achieved 1.70x throughput increase on a 7-billion-parameter transformer.

02

Demonstrated consistent throughput improvements across various models and schedules.

03

Reduced compute idle time during backpropagation stages.

Abstract

As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression