Don't Decay the Learning Rate, Increase the Batch Size

Samuel L. Smith; Pieter-Jan Kindermans; Chris Ying; Quoc V. Le

arXiv:1711.00489·cs.LG·February 27, 2018·394 cites

Don't Decay the Learning Rate, Increase the Batch Size

Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le

PDF

Open Access 3 Repos

TL;DR

This paper demonstrates that increasing batch size during training can replace learning rate decay, achieving similar accuracy with fewer updates and shorter training times across various optimizers.

Contribution

The authors introduce a method to replace learning rate decay with batch size increase, enabling efficient large-batch training without hyper-parameter tuning.

Findings

01

Achieves equivalent test accuracy with fewer parameter updates.

02

Reduces training time by enabling larger batch sizes and parallelism.

03

Successfully trains ResNet-50 on ImageNet in under 30 minutes.

Abstract

It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate $ϵ$ and scaling the batch size $B \propto ϵ$ . Finally, one can increase the momentum coefficient $m$ and scale $B \propto 1/ (1 - m)$ , although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning

MethodsAdam · Stochastic Gradient Descent