Local SGD Accelerates Convergence by Exploiting Second Order Information of the Loss Function
Linxuan Pan, Shenghui Song

TL;DR
This paper demonstrates that local SGD accelerates convergence by leveraging second order information of the loss function, explaining its effectiveness in distributed learning and its potential to approach Newton's method.
Contribution
The paper provides a theoretical analysis showing how local SGD exploits second order information, which was previously not well understood.
Findings
L-SGD explores second order information of the loss function.
L-SGD has larger projections on eigenvectors with small eigenvalues.
L-SGD can approach the Newton method under certain conditions.
Abstract
With multiple iterations of updates, local statistical gradient descent (L-SGD) has been proven to be very effective in distributed machine learning schemes such as federated learning. In fact, many innovative works have shown that L-SGD with independent and identically distributed (IID) data can even outperform SGD. As a result, extensive efforts have been made to unveil the power of L-SGD. However, existing analysis failed to explain why the multiple local updates with small mini-batches of data (L-SGD) can not be replaced by the update with one big batch of data and a larger learning rate (SGD). In this paper, we offer a new perspective to understand the strength of L-SGD. We theoretically prove that, with IID data, L-SGD can effectively explore the second order information of the loss function. In particular, compared with SGD, the updates of L-SGD have much larger projection on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Machine Learning and ELM
MethodsStochastic Gradient Descent
