Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Yang You; Jing Li; Sashank Reddi; Jonathan Hseu; Sanjiv; Kumar; Srinadh Bhojanapalli; Xiaodan Song; James Demmel; Kurt; Keutzer; Cho-Jui Hsieh

arXiv:1904.00962·cs.LG·January 6, 2020·164 cites

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv, Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt, Keutzer, Cho-Jui Hsieh

PDF

Open Access 5 Repos 7 Models

TL;DR

This paper introduces LAMB, a layerwise adaptive large batch optimization method that significantly accelerates training of deep neural networks like BERT, reducing training time from days to under two hours without performance loss.

Contribution

The paper proposes LAMB, a novel large batch optimizer with a convergence analysis, demonstrating superior performance over existing methods like LARS, especially for attention models.

Findings

01

LAMB enables training BERT with batch sizes of 32,768 without performance degradation.

02

Training BERT with LAMB reduces time from 3 days to 76 minutes on TPUv3.

03

LAMB shows consistent convergence to stationary points in nonconvex settings.

Abstract

Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Average Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling