Ordered Momentum for Asynchronous SGD

Chang-Wei Shi; Yi-Rui Yang; Wu-Jun Li

arXiv:2407.19234·cs.LG·January 24, 2025

Ordered Momentum for Asynchronous SGD

Chang-Wei Shi, Yi-Rui Yang, Wu-Jun Li

PDF

Open Access 1 Video

TL;DR

This paper introduces ordered momentum (OrMo), a novel approach for asynchronous SGD that organizes gradients by iteration order, improving convergence in distributed deep learning with heterogeneous workers.

Contribution

It proposes a new ordered momentum method for ASGD, providing the first convergence analysis independent of maximum delay, with theoretical and empirical validation.

Findings

01

OrMo achieves better convergence than existing ASGD methods.

02

Theoretical proof of convergence for non-convex problems with delay-adaptive rates.

03

Empirical results show improved training performance over traditional asynchronous methods.

Abstract

Distributed learning is essential for training large-scale deep models. Asynchronous SGD (ASGD) and its variants are commonly used distributed learning methods, particularly in scenarios where the computing capabilities of workers in the cluster are heterogeneous. Momentum has been acknowledged for its benefits in both optimization and generalization in deep model training. However, existing works have found that naively incorporating momentum into ASGD can impede the convergence. In this paper, we propose a novel method called ordered momentum (OrMo) for ASGD. In OrMo, momentum is incorporated into ASGD by organizing the gradients in order based on their iteration indexes. We theoretically prove the convergence of OrMo with both constant and delay-adaptive learning rates for non-convex problems. To the best of our knowledge, this is the first work to establish the convergence analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Ordered Momentum for Asynchronous SGD· slideslive

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsStochastic Gradient Descent