Asynchrony begets Momentum, with an Application to Deep Learning

Ioannis Mitliagkas; Ce Zhang; Stefan Hadjis; Christopher R\'e

arXiv:1605.09774·stat.ML·November 28, 2016

Asynchrony begets Momentum, with an Application to Deep Learning

Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, Christopher R\'e

PDF

3 Repos

TL;DR

This paper demonstrates that asynchronous stochastic gradient descent in deep learning acts like adding momentum, linking queuing theory to training dynamics, and shows that tuning momentum improves convergence under asynchrony.

Contribution

It provides a theoretical framework connecting asynchrony with momentum in deep learning, applicable to non-convex problems, and offers practical insights for tuning and improving asynchronous training.

Findings

01

Asynchrony introduces a momentum-like effect in SGD.

02

Tuning momentum is crucial for optimal performance under asynchrony.

03

Negative momentum can counteract adverse effects of high asynchrony.

Abstract

Asynchronous methods are widely used in deep learning, but have limited theoretical justification when applied to non-convex problems. We show that running stochastic gradient descent (SGD) in an asynchronous manner can be viewed as adding a momentum-like term to the SGD iteration. Our result does not assume convexity of the objective function, so it is applicable to deep learning systems. We observe that a standard queuing model of asynchrony results in a form of momentum that is commonly used by deep learning practitioners. This forges a link between queuing theory and asynchrony in deep learning systems, which could be useful for systems builders. For convolutional neural networks, we experimentally validate that the degree of asynchrony directly correlates with the momentum, confirming our main result. An important implication is that tuning the momentum parameter is important when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent