Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation   and Layer-Wise Updates

Cabrel Teguemne Fokam; Khaleelulla Khan Nazeer; Lukas K\"onig; David; Kappel; Anand Subramoney

arXiv:2410.05985·cs.LG·February 10, 2025

Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

Cabrel Teguemne Fokam, Khaleelulla Khan Nazeer, Lukas K\"onig, David, Kappel, Anand Subramoney

PDF

Open Access

TL;DR

This paper introduces PD-ASGD, an extension of asynchronous SGD that decouples forward and backward passes with layer-wise updates, significantly improving training speed and robustness in distributed deep learning.

Contribution

It proposes a novel layer-wise, decoupled ASGD method that enhances throughput and reduces delays, with theoretical convergence guarantees.

Findings

01

Achieves up to 5.95x faster training than synchronous methods with delays.

02

Runs up to 2.14x faster than existing ASGD algorithms.

03

Provides theoretical analysis and convergence proof for the proposed method.

Abstract

The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads when training across devices, leading to longer training times as a result of suboptimal hardware utilization. Asynchronous stochastic gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and differences throughput. Moreover, the backpropagation algorithm used within ASGD workers is bottlenecked by the interlocking between its forward and backward passes. Current methods also do not take advantage of the large differences in the computation required for the forward and backward passes. Therefore, we propose an extension to ASGD called Partial Decoupled ASGD (PD-ASGD) that addresses these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Blind Source Separation Techniques

MethodsStochastic Gradient Descent · Diffusion