Asynchronous Training Schemes in Distributed Learning with Time Delay

Haoxiang Wang; Zhanhong Jiang; Chao Liu; Soumik Sarkar; Dongxiang; Jiang; Young M. Lee

arXiv:2208.13154·cs.LG·October 28, 2024

Asynchronous Training Schemes in Distributed Learning with Time Delay

Haoxiang Wang, Zhanhong Jiang, Chao Liu, Soumik Sarkar, Dongxiang, Jiang, Young M. Lee

PDF

Open Access

TL;DR

This paper introduces PC-ASGD, a novel asynchronous distributed learning algorithm that predicts and clips gradients to mitigate delay effects, with theoretical convergence guarantees and empirical validation on neural networks.

Contribution

We propose PC-ASGD, combining gradient prediction and selective clipping to improve asynchronous training, with convergence analysis and practical implementation strategies.

Findings

01

PC-ASGD effectively reduces the impact of stale gradients.

02

Theoretical convergence is established for weakly strongly-convex and nonconvex functions.

03

Empirical results show improved training performance on neural networks.

Abstract

In the context of distributed deep learning, the issue of stale weights or gradients could result in poor algorithmic performance. This issue is usually tackled by delay tolerant algorithms with some mild assumptions on the objective functions and step sizes. In this paper, we propose a different approach to develop a new algorithm, called $P$ redicting $C$ lipping $A$ synchronous $S$ tochastic $G$ radient $D$ escent (aka, PC-ASGD). Specifically, PC-ASGD has two steps - the $predicting step$ leverages the gradient prediction using Taylor expansion to reduce the staleness of the outdated weights while the $clipping step$ selectively drops the outdated weights to alleviate their negative effects. A tradeoff parameter is introduced to balance the effects between these two steps. Theoretically, we present the convergence rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Age of Information Optimization · Sparse and Compressive Sensing Techniques