Second-Order Convergence of Asynchronous Parallel Stochastic Gradient   Descent: When Is the Linear Speedup Achieved?

Lifu Wang; Bo Shen; Ning Zhao

arXiv:1910.06000·cs.LG·June 9, 2020

Second-Order Convergence of Asynchronous Parallel Stochastic Gradient Descent: When Is the Linear Speedup Achieved?

Lifu Wang, Bo Shen, Ning Zhao

PDF

Open Access

TL;DR

This paper provides the first theoretical analysis of second-order convergence for asynchronous parallel stochastic gradient descent, establishing conditions under which linear speedup is achieved in non-convex optimization.

Contribution

It offers the first theoretical guarantee on second-order convergence for APSGD, identifying bounds on the number of workers for effective training.

Findings

01

APSGD converges to good stationary points under specified worker bounds.

02

Linear speedup is achievable with bounded worker count.

03

The analysis applies near saddle points in non-convex optimization.

Abstract

In machine learning, asynchronous parallel stochastic gradient descent (APSGD) is broadly used to speed up the training process through multi-workers. Meanwhile, the time delay of stale gradients in asynchronous algorithms is generally proportional to the total number of workers, which brings additional deviation from the accurate gradient due to using delayed gradients. This may have a negative influence on the convergence of the algorithm. One may ask: How many workers can we use at most to achieve a good convergence and the linear speedup? In this paper, we consider the second-order convergence of asynchronous algorithms in non-convex optimization. We investigate the behaviors of APSGD with consistent read near strictly saddle points and provide a theoretical guarantee that if the total number of workers is bounded by $O (K^{1/3} M^{- 1/3})$ ( $K$ is the total steps and $M$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Memory and Neural Computing · Privacy-Preserving Technologies in Data

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings