Don't Use Large Mini-Batches, Use Local SGD

Tao Lin; Sebastian U. Stich; Kumar Kshitij Patel; Martin Jaggi

arXiv:1808.07217·cs.LG·February 18, 2020·162 cites

Don't Use Large Mini-Batches, Use Local SGD

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

PDF

Open Access 2 Repos

TL;DR

This paper introduces post-local SGD, a method that enhances the generalization of large-batch training in deep neural networks without sacrificing efficiency or scalability.

Contribution

It proposes a novel post-local SGD approach that improves generalization in large-batch training and provides a comprehensive analysis of local SGD variants.

Findings

01

Post-local SGD significantly improves generalization on standard benchmarks.

02

The method maintains efficiency and scalability comparable to large-batch training.

03

Extensive study of communication-performance trade-offs in local SGD variants.

Abstract

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning

MethodsStochastic Gradient Descent