Tighter Theory for Local SGD on Identical and Heterogeneous Data

Ahmed Khaled; Konstantin Mishchenko; Peter Richt\'arik

arXiv:1909.04746·cs.LG·April 18, 2022·39 cites

Tighter Theory for Local SGD on Identical and Heterogeneous Data

Ahmed Khaled, Konstantin Mishchenko, Peter Richt\'arik

PDF

Open Access

TL;DR

This paper offers a refined theoretical analysis of local SGD, addressing both identical and heterogeneous data regimes, and provides optimal parameters and bounds that improve upon previous results, supported by empirical validation.

Contribution

It introduces a new variance notion specific to local SGD with different data and improves the theoretical bounds for both data regimes, including optimal stepsize and local iteration count.

Findings

01

Data heterogeneity significantly impacts local SGD performance.

02

New bounds are tighter and more general than previous results.

03

Optimal stepsize and local iterations are explicitly characterized.

Abstract

We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the existing theory and provide values of the optimal stepsize and optimal number of local iterations. Our bounds are based on a new notion of variance that is specific to local SGD methods with different data. The tightness of our results is guaranteed by recovering known statements when we plug $H = 1$ , where $H$ is the number of local steps. The empirical evidence further validates the severe impact of data heterogeneity on the performance of local SGD.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Advanced Bandit Algorithms Research · Domain Adaptation and Few-Shot Learning

MethodsLocal SGD · Stochastic Gradient Descent