Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework
Siyuan Yu, Wei Chen, H. Vincent Poor

TL;DR
This paper introduces a stochastic delay differential equation framework to analyze and optimize asynchronous distributed SGD, accounting for delays and staleness, and revealing conditions for convergence and acceleration.
Contribution
The paper develops a unified SDDE-based framework for analyzing and optimizing asynchronous SGD with delays, without assuming memoryless computation times, and provides new insights into scheduling policies.
Findings
Increasing workers does not always speed up SGD due to staleness.
Small staleness may not slow convergence, large staleness can cause divergence.
The framework effectively models complex non-convex learning tasks.
Abstract
Distributed stochastic gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. However, the staggers and limited bandwidth may induce random computational/communication delays, thereby severely hindering the learning process. Therefore, how to accelerate asynchronous SGD by efficiently scheduling multiple workers is an important issue. In this paper, a unified framework is presented to analyze and optimize the convergence of asynchronous SGD based on stochastic delay differential equations (SDDEs) and the Poisson approximation of aggregated gradient arrivals. In particular, we present the run time and staleness of distributed SGD without a memorylessness assumption on the computation times. Given the learning rate, we reveal the relevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Mathematical Biology Tumor Growth
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Stochastic Gradient Descent
