Data Movement Is All You Need: A Case Study on Optimizing Transformers
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, Torsten Hoefler

TL;DR
This paper identifies data movement as the main bottleneck in training transformers and proposes a global optimization approach that reduces data movement and improves training performance significantly.
Contribution
It introduces a novel recipe for optimizing data movement in transformer training, leading to substantial performance gains over existing frameworks.
Findings
Data movement reduction up to 22.91%
Performance improvement of 1.30x on BERT encoder layer
Performance improvement of 1.19x on full BERT
Abstract
Transformers are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvements in compute performance, training has now become memory-bound. Further, existing frameworks use suboptimal data layouts. Using these insights, we present a recipe for globally optimizing data movement in transformers. We reduce data movement by up to 22.91% and overall achieve a 1.30x performance improvement over state-of-the-art frameworks when training a BERT encoder layer and 1.19x for the entire BERT. Our approach is applicable more broadly to optimizing deep neural networks, and offers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Parallel Computing and Optimization Techniques
MethodsLinear Layer · Multi-Head Attention · Residual Connection · Attention Is All You Need · Attention Dropout · Weight Decay · Adam · Softmax · WordPiece · Dense Connections
