TL;DR
This paper demonstrates that asynchronous stochastic gradient descent in deep learning acts like adding momentum, linking queuing theory to training dynamics, and shows that tuning momentum improves convergence under asynchrony.
Contribution
It provides a theoretical framework connecting asynchrony with momentum in deep learning, applicable to non-convex problems, and offers practical insights for tuning and improving asynchronous training.
Findings
Asynchrony introduces a momentum-like effect in SGD.
Tuning momentum is crucial for optimal performance under asynchrony.
Negative momentum can counteract adverse effects of high asynchrony.
Abstract
Asynchronous methods are widely used in deep learning, but have limited theoretical justification when applied to non-convex problems. We show that running stochastic gradient descent (SGD) in an asynchronous manner can be viewed as adding a momentum-like term to the SGD iteration. Our result does not assume convexity of the objective function, so it is applicable to deep learning systems. We observe that a standard queuing model of asynchrony results in a form of momentum that is commonly used by deep learning practitioners. This forges a link between queuing theory and asynchrony in deep learning systems, which could be useful for systems builders. For convolutional neural networks, we experimentally validate that the degree of asynchrony directly correlates with the momentum, confirming our main result. An important implication is that tuning the momentum parameter is important when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent
