Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and   Beyond

Chulhee Yun; Shashank Rajput; Suvrit Sra

arXiv:2110.10342·cs.LG·March 24, 2022·1 cites

Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

Chulhee Yun, Shashank Rajput, Suvrit Sra

PDF

Open Access 1 Video

TL;DR

This paper provides tight convergence bounds for shuffling-based variants of minibatch and local SGD in distributed learning, demonstrating they outperform traditional with-replacement methods and introducing a new synchronized shuffling approach.

Contribution

The paper offers the first tight convergence analysis for shuffling-based SGD variants and introduces synchronized shuffling for improved convergence in homogeneous settings.

Findings

01

Shuffling-based SGD converges faster than with-replacement methods.

02

Matching lower bounds confirm the tightness of the analysis.

03

Synchronized shuffling achieves even faster convergence in certain settings.

Abstract

In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-{\L}ojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Privacy-Preserving Technologies in Data

MethodsLocal SGD · Stochastic Gradient Descent