Large Batch Training of Convolutional Networks

Yang You; Igor Gitman; Boris Ginsburg

arXiv:1708.03888·cs.CV·September 15, 2017·509 cites

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, Boris Ginsburg

PDF

Open Access 5 Repos

TL;DR

This paper introduces LARS, a new training algorithm that enables large batch training of convolutional networks like AlexNet and ResNet-50 without sacrificing accuracy, addressing divergence issues in large batch regimes.

Contribution

The paper proposes Layer-wise Adaptive Rate Scaling (LARS), a novel method that allows training with significantly larger batch sizes while maintaining model accuracy.

Findings

01

Scaled AlexNet to batch size of 8K with LARS

02

Scaled ResNet-50 to batch size of 32K with LARS

03

Overcame divergence issues in large batch training

Abstract

A common way to speed up training of large convolutional networks is to add computational units. Training is then performed using data-parallel synchronous Stochastic Gradient Descent (SGD) with mini-batch divided between computational units. With an increase in the number of nodes, the batch size grows. But training with large batch size often results in the lower model accuracy. We argue that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge. To overcome this optimization difficulties we propose a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS). Using LARS, we scaled Alexnet up to a batch size of 8K, and Resnet-50 to a batch size of 32K without loss in accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · LARS