Gradient Energy Matching for Distributed Asynchronous Gradient Descent
Joeri Hermans, Gilles Louppe

TL;DR
This paper introduces a novel energy-based framework for analyzing and stabilizing distributed asynchronous SGD, leading to a new method called GEM that improves stability, speed, and generalization in large-scale deep learning.
Contribution
It proposes an energy-based stability criterion for asynchronous SGD and develops GEM, a method that maintains system energy below a target, enhancing stability and performance.
Findings
GEM achieves greater stability than existing methods.
GEM scales effectively to 100 workers.
GEM shows improved generalization over targeted SGD with momentum.
Abstract
Distributed asynchronous SGD has become widely used for deep learning in large-scale systems, but remains notorious for its instability when increasing the number of workers. In this work, we study the dynamics of distributed asynchronous SGD under the lens of Lagrangian mechanics. Using this description, we introduce the concept of energy to describe the optimization process and derive a sufficient condition ensuring its stability as long as the collective energy induced by the active workers remains below the energy of a target synchronous process. Making use of this criterion, we derive a stable distributed asynchronous optimization procedure, GEM, that estimates and maintains the energy of the asynchronous system below or equal to the energy of sequential SGD with momentum. Experimental results highlight the stability and speedup of GEM compared to existing schemes, even when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Memory and Neural Computing · Advanced Neural Network Applications
MethodsStochastic Gradient Descent
