Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates
Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, Haibin Lin

TL;DR
This paper introduces Local AdaAlter, a communication-efficient stochastic gradient descent method with adaptive learning rates, proven to converge for non-convex problems and capable of reducing training time significantly.
Contribution
It presents a novel SGD variant that reduces communication overhead and adapts learning rates, with proven convergence for smooth non-convex problems.
Findings
Reduces communication overhead in distributed training
Achieves up to 30% reduction in training time on large datasets
Proven convergence for smooth non-convex optimization
Abstract
When scaling distributed training, the communication overhead is often the bottleneck. In this paper, we propose a novel SGD variant with reduced communication and adaptive learning rates. We prove the convergence of the proposed algorithm for smooth but non-convex problems. Empirical results show that the proposed algorithm significantly reduces the communication overhead, which, in turn, reduces the training time by up to 30% for the 1B word dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Machine Learning and ELM
MethodsStochastic Gradient Descent
