Reducing Activation Recomputation in Large Transformer Models
Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael, Andersch, Mohammad Shoeybi, Bryan Catanzaro

TL;DR
This paper introduces simple techniques to significantly reduce activation memory and recomputation overhead in large transformer training, enabling faster and more efficient scaling to models with up to one trillion parameters.
Contribution
The authors propose sequence parallelism and selective activation recomputation, reducing activation memory by 5x and recomputation overhead by over 90%, improving large transformer training efficiency.
Findings
Activation memory reduced by 5x.
Recomputation overhead decreased by over 90%.
Training a 530B parameter model is 29% faster.
Abstract
Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnecessary because we can reduce memory consumption sufficiently without it. We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation. In conjunction with tensor parallelism, these techniques almost eliminate the need to recompute activations. We evaluate our approach on language models up to one trillion parameters in scale and show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Adam · Cosine Annealing · Byte Pair Encoding · Multi-Head Attention · Residual Connection · 15 Ways to Contact How can i speak to someone at Delta Airlines
