Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Ammar Mahran, Artavazd Maranjyan, Peter Richt\'arik

TL;DR
Rescaled Asynchronous SGD introduces a simple yet effective adjustment to standard ASGD by rescaling worker stepsizes based on computation times, ensuring convergence to the true global objective despite heterogeneity.
Contribution
It demonstrates that rescaling worker-specific stepsizes in ASGD corrects bias caused by heterogeneity without additional communication or memory overhead.
Findings
Rescaled ASGD converges to the correct global objective in heterogeneous settings.
The method's time complexity matches the theoretical lower bounds.
Experiments show competitive convergence and accuracy.
Abstract
Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient arrives. Vanilla ASGD applies each arriving gradient with the same weight. When local data distributions are heterogeneous, this becomes problematic: faster workers contribute more updates, and we show theoretically that the method is biased toward a frequency-weighted average of the local objectives rather than the desired global objective. Existing remedies typically move away from the simple ASGD template by introducing gathering phases, buffering, or extra memory. We show that this is unnecessary. Keeping the standard ASGD mechanism, we recover the correct objective by rescaling worker-specific stepsizes in proportion to their computation times, so…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
