Asynchronous Stochastic Gradient Descent with Delay Compensation
Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming, Ma, Tie-Yan Liu

TL;DR
This paper introduces Delay Compensated ASGD (DC-ASGD), a novel method that mitigates gradient delay issues in asynchronous SGD by using Taylor expansion and Hessian approximation, improving training efficiency and accuracy.
Contribution
The paper presents a new delay compensation technique for ASGD that aligns its behavior more closely with sequential SGD, enhancing performance in large-scale neural network training.
Findings
DC-ASGD outperforms traditional ASGD and synchronous SGD on CIFAR-10 and ImageNet.
The method nearly matches the accuracy of sequential SGD.
Experimental results validate the effectiveness of delay compensation.
Abstract
With the fast development of deep learning, it has become common to learn big neural networks using massive training data. Asynchronous Stochastic Gradient Descent (ASGD) is widely adopted to fulfill this task for its efficiency, which is, however, known to suffer from the problem of delayed gradients. That is, when a local worker adds its gradient to the global model, the global model may have been updated by other workers and this gradient becomes "delayed". We propose a novel technology to compensate this delay, so as to make the optimization behavior of ASGD closer to that of sequential SGD. This is achieved by leveraging Taylor expansion of the gradient function and efficient approximation to the Hessian matrix of the loss function. We call the new algorithm Delay Compensated ASGD (DC-ASGD). We evaluated the proposed algorithm on CIFAR-10 and ImageNet datasets, and the experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
MethodsStochastic Gradient Descent
