Minibatch vs Local SGD for Heterogeneous Distributed Learning

Blake Woodworth; Kumar Kshitij Patel; Nathan Srebro

arXiv:2006.04735·cs.LG·March 2, 2022·45 cites

Minibatch vs Local SGD for Heterogeneous Distributed Learning

Blake Woodworth, Kumar Kshitij Patel, Nathan Srebro

PDF

Open Access 1 Video

TL;DR

This paper compares Minibatch and Local SGD algorithms in heterogeneous distributed learning, showing Minibatch's dominance and establishing conditions where accelerated Minibatch is optimal, with new bounds for Local SGD.

Contribution

It demonstrates Minibatch SGD's superiority in heterogeneous settings and provides the first improved bounds for Local SGD in non-homogeneous regimes.

Findings

01

Minibatch SGD outperforms Local SGD in heterogeneous settings

02

Accelerated Minibatch SGD is optimal under high heterogeneity

03

New upper bounds for Local SGD surpass Minibatch in certain regimes

Abstract

We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) dominates all existing analysis of Local SGD in this setting, (ii) accelerated Minibatch SGD is optimal when the heterogeneity is high, and (iii) present the first upper bound for Local SGD that improves over Minibatch SGD in a non-homogeneous regime.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Minibatch vs Local SGD for Heterogeneous Distributed Learning· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Sparse and Compressive Sensing Techniques

MethodsLocal SGD · Stochastic Gradient Descent