STL-SGD: Speeding Up Local SGD with Stagewise Communication Period
Shuheng Shen, Yifei Cheng, Jingchang Liu, Linli Xu

TL;DR
This paper introduces STL-SGD, a stagewise local SGD method that adaptively increases communication periods to accelerate convergence and reduce communication complexity in distributed machine learning.
Contribution
STL-SGD adaptively increases communication periods with decreasing learning rate, achieving faster convergence and lower communication complexity than existing Local SGD methods.
Findings
STL-SGD maintains the same convergence rate as mini-batch SGD.
It achieves $O(N \, \log T)$ communication complexity for strongly convex objectives.
Experiments show STL-SGD outperforms traditional Local SGD on convex and non-convex tasks.
Abstract
Distributed parallel stochastic gradient descent algorithms are workhorses for large scale machine learning tasks. Among them, local stochastic gradient descent (Local SGD) has attracted significant attention due to its low communication complexity. Previous studies prove that the communication complexity of Local SGD with a fixed or an adaptive communication period is in the order of and when the data distributions on clients are identical (IID) or otherwise (Non-IID), where is the number of clients and is the number of iterations. In this paper, to accelerate the convergence by reducing the communication complexity, we propose \textit{ST}agewise \textit{L}ocal \textit{SGD} (STL-SGD), which increases the communication period gradually along with decreasing learning rate. We prove that STL-SGD can keep…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Advanced Neural Network Applications
MethodsLocal SGD · Stochastic Gradient Descent
