Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv, Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt, Keutzer, Cho-Jui Hsieh

TL;DR
This paper introduces LAMB, a layerwise adaptive large batch optimization method that significantly accelerates training of deep neural networks like BERT, reducing training time from days to under two hours without performance loss.
Contribution
The paper proposes LAMB, a novel large batch optimizer with a convergence analysis, demonstrating superior performance over existing methods like LARS, especially for attention models.
Findings
LAMB enables training BERT with batch sizes of 32,768 without performance degradation.
Training BERT with LAMB reduces time from 3 days to 76 minutes on TPUv3.
LAMB shows consistent convergence to stationary points in nonconvex settings.
Abstract
Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗bayartsogt/albert-mongolianmodel· 35 dl· ♡ 435 dl♡ 4
- 🤗squeezebert/squeezebert-mnli-headlessmodel· 33 dl33 dl
- 🤗squeezebert/squeezebert-mnlimodel· 1.3k dl· ♡ 11.3k dl♡ 1
- 🤗squeezebert/squeezebert-uncasedmodel· 7.6k dl· ♡ 27.6k dl♡ 2
- 🤗Deci/DeciDiffusion-v1-0model· 23 dl· ♡ 14023 dl♡ 140
- 🤗Deci/DeciDiffusion-v2-0model· 16 dl· ♡ 3316 dl♡ 33
- 🤗Deci-early-access/DeciDiffusion-v2-0model· 9 dl· ♡ 39 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Average Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling
