On the Computational Inefficiency of Large Batch Sizes for Stochastic   Gradient Descent

Noah Golmant; Nikita Vemuri; Zhewei Yao; Vladimir Feinberg; and Amir Gholami; Kai Rothauge; Michael W. Mahoney; Joseph Gonzalez

arXiv:1811.12941·cs.LG·December 3, 2018·46 cites

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, and Amir Gholami, Kai Rothauge, Michael W. Mahoney, Joseph Gonzalez

PDF

Open Access

TL;DR

This paper empirically demonstrates that increasing batch size in stochastic gradient descent does not always reduce training time, as practical and theoretical challenges limit the benefits of large batch training across various models and tasks.

Contribution

It provides an extensive empirical analysis showing the limitations of large batch sizes and highlights factors affecting their effectiveness beyond dataset size.

Findings

01

Increasing batch size beyond a certain point does not reduce convergence time.

02

Popular large batch training strategies often fail before utilizing full computational resources.

03

Model architecture and data complexity influence the effectiveness of large batch sizes.

Abstract

Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique. We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains, including image classification, image segmentation, and language modeling. Although it is common practice to increase the batch size in order to fully exploit available computational resources, we find a substantially more nuanced picture. Our main finding is that across a wide range of network architectures and problem domains, increasing the batch size beyond a certain point yields no decrease in wall-clock time to convergence for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques