Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates
Cabrel Teguemne Fokam, Khaleelulla Khan Nazeer, Lukas K\"onig, David, Kappel, Anand Subramoney

TL;DR
This paper introduces PD-ASGD, an extension of asynchronous SGD that decouples forward and backward passes with layer-wise updates, significantly improving training speed and robustness in distributed deep learning.
Contribution
It proposes a novel layer-wise, decoupled ASGD method that enhances throughput and reduces delays, with theoretical convergence guarantees.
Findings
Achieves up to 5.95x faster training than synchronous methods with delays.
Runs up to 2.14x faster than existing ASGD algorithms.
Provides theoretical analysis and convergence proof for the proposed method.
Abstract
The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads when training across devices, leading to longer training times as a result of suboptimal hardware utilization. Asynchronous stochastic gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and differences throughput. Moreover, the backpropagation algorithm used within ASGD workers is bottlenecked by the interlocking between its forward and backward passes. Current methods also do not take advantage of the large differences in the computation required for the forward and backward passes. Therefore, we propose an extension to ASGD called Partial Decoupled ASGD (PD-ASGD) that addresses these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Blind Source Separation Techniques
MethodsStochastic Gradient Descent · Diffusion
