Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum
Tehila Dahan, Roie Reshef, Sharon Goldstein, Kfir Y. Levy

TL;DR
This paper introduces a momentum-based asynchronous SGD framework that effectively handles data-dependent delays, achieving optimal convergence rates and simplifying hyperparameter tuning in distributed training.
Contribution
It proposes a novel momentum-based approach that preserves information from delayed gradients and establishes the first optimal convergence rates under data-dependent delays.
Findings
Achieves optimal convergence rates for data-dependent delays in convex and non-convex settings.
Develops robust learning-rate schedules that ease hyperparameter tuning.
Provides theoretical guarantees under standard assumptions for asynchronous optimization.
Abstract
Asynchronous stochastic gradient descent (SGD) enables scalable distributed training but suffers from gradient staleness. Existing mitigation strategies, such as delay-adaptive learning rates and staleness-aware filtering, typically attenuate or discard delayed gradients, introducing systematic bias: updates from simpler or faster-to-process samples are overrepresented, while gradients from more complex samples are delayed or suppressed. In contrast, prior approaches to data-dependent delays rely on a Lipschitz assumption that yields suboptimal rates or leave the smooth, convex case unaddressed. We propose a momentum-based asynchronous framework designed to preserve information from delayed gradients while mitigating the effects of staleness. We establish the first optimal convergence rates for data-dependent delays in both convex and non-convex smooth setups, providing a new result for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
