On the Computation and Communication Complexity of Parallel SGD with   Dynamic Batch Sizes for Stochastic Non-Convex Optimization

Hao Yu; Rong Jin

arXiv:1905.04346·math.OC·May 14, 2019·31 cites

On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization

Hao Yu, Rong Jin

PDF

Open Access

TL;DR

This paper analyzes how dynamically increasing batch sizes in parallel SGD can optimize convergence and communication efficiency for stochastic non-convex problems, achieving faster rates with fewer communication rounds.

Contribution

It introduces new convergence rate analyses for dynamic batch size strategies and proposes a Catalyst-like algorithm for general non-convex optimization.

Findings

01

Exponential batch size increase achieves $O(1/(NT))$ convergence with $ ext{log}(T)$ communication rounds under P-L condition.

02

Proposed Catalyst-like algorithm attains $O(1/ oot{2}{NT})$ convergence with $O( oot{2}{NT} ext{log}(T/N))$ communication rounds.

03

Dynamic batch sizing significantly reduces communication complexity while maintaining optimal convergence rates.

Abstract

For SGD based distributed stochastic optimization, computation complexity, measured by the convergence rate in terms of the number of stochastic gradient calls, and communication complexity, measured by the number of inter-node communication rounds, are two most important performance metrics. The classical data-parallel implementation of SGD over $N$ workers can achieve linear speedup of its convergence rate but incurs an inter-node communication round at each batch. We study the benefit of using dynamically increasing batch sizes in parallel SGD for stochastic non-convex optimization by charactering the attained convergence rate and the required number of communication rounds. We show that for stochastic non-convex optimization under the P-L condition, the classical data-parallel SGD with exponentially increasing batch sizes can achieve the fastest known $O (1/ (N T))$ convergence with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Age of Information Optimization

MethodsStochastic Gradient Descent