Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum

Tehila Dahan; Roie Reshef; Sharon Goldstein; Kfir Y. Levy

arXiv:2605.02043·cs.LG·May 15, 2026

Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum

Tehila Dahan, Roie Reshef, Sharon Goldstein, Kfir Y. Levy

PDF

TL;DR

This paper introduces a momentum-based asynchronous SGD framework that effectively handles data-dependent delays, achieving optimal convergence rates and simplifying hyperparameter tuning in distributed training.

Contribution

It proposes a novel momentum-based approach that preserves information from delayed gradients and establishes the first optimal convergence rates under data-dependent delays.

Findings

01

Achieves optimal convergence rates for data-dependent delays in convex and non-convex settings.

02

Develops robust learning-rate schedules that ease hyperparameter tuning.

03

Provides theoretical guarantees under standard assumptions for asynchronous optimization.

Abstract

Asynchronous stochastic gradient descent (SGD) enables scalable distributed training but suffers from gradient staleness. Existing mitigation strategies, such as delay-adaptive learning rates and staleness-aware filtering, typically attenuate or discard delayed gradients, introducing systematic bias: updates from simpler or faster-to-process samples are overrepresented, while gradients from more complex samples are delayed or suppressed. In contrast, prior approaches to data-dependent delays rely on a Lipschitz assumption that yields suboptimal rates or leave the smooth, convex case unaddressed. We propose a momentum-based asynchronous framework designed to preserve information from delayed gradients while mitigating the effects of staleness. We establish the first optimal convergence rates for data-dependent delays in both convex and non-convex smooth setups, providing a new result for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.