Large Batch Training of Convolutional Networks
Yang You, Igor Gitman, Boris Ginsburg

TL;DR
This paper introduces LARS, a new training algorithm that enables large batch training of convolutional networks like AlexNet and ResNet-50 without sacrificing accuracy, addressing divergence issues in large batch regimes.
Contribution
The paper proposes Layer-wise Adaptive Rate Scaling (LARS), a novel method that allows training with significantly larger batch sizes while maintaining model accuracy.
Findings
Scaled AlexNet to batch size of 8K with LARS
Scaled ResNet-50 to batch size of 32K with LARS
Overcame divergence issues in large batch training
Abstract
A common way to speed up training of large convolutional networks is to add computational units. Training is then performed using data-parallel synchronous Stochastic Gradient Descent (SGD) with mini-batch divided between computational units. With an increase in the number of nodes, the batch size grows. But training with large batch size often results in the lower model accuracy. We argue that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge. To overcome this optimization difficulties we propose a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS). Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · LARS
