Ordered Momentum for Asynchronous SGD
Chang-Wei Shi, Yi-Rui Yang, Wu-Jun Li

TL;DR
This paper introduces ordered momentum (OrMo), a novel approach for asynchronous SGD that organizes gradients by iteration order, improving convergence in distributed deep learning with heterogeneous workers.
Contribution
It proposes a new ordered momentum method for ASGD, providing the first convergence analysis independent of maximum delay, with theoretical and empirical validation.
Findings
OrMo achieves better convergence than existing ASGD methods.
Theoretical proof of convergence for non-convex problems with delay-adaptive rates.
Empirical results show improved training performance over traditional asynchronous methods.
Abstract
Distributed learning is essential for training large-scale deep models. Asynchronous SGD (ASGD) and its variants are commonly used distributed learning methods, particularly in scenarios where the computing capabilities of workers in the cluster are heterogeneous. Momentum has been acknowledged for its benefits in both optimization and generalization in deep model training. However, existing works have found that naively incorporating momentum into ASGD can impede the convergence. In this paper, we propose a novel method called ordered momentum (OrMo) for ASGD. In OrMo, momentum is incorporated into ASGD by organizing the gradients in order based on their iteration indexes. We theoretically prove the convergence of OrMo with both constant and delay-adaptive learning rates for non-convex problems. To the best of our knowledge, this is the first work to establish the convergence analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques
MethodsStochastic Gradient Descent
