Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with   Adaptive Learning Rates

Cong Xie; Oluwasanmi Koyejo; Indranil Gupta; Haibin Lin

arXiv:1911.09030·cs.LG·December 8, 2020·21 cites

Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, Haibin Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces Local AdaAlter, a communication-efficient stochastic gradient descent method with adaptive learning rates, proven to converge for non-convex problems and capable of reducing training time significantly.

Contribution

It presents a novel SGD variant that reduces communication overhead and adapts learning rates, with proven convergence for smooth non-convex problems.

Findings

01

Reduces communication overhead in distributed training

02

Achieves up to 30% reduction in training time on large datasets

03

Proven convergence for smooth non-convex optimization

Abstract

When scaling distributed training, the communication overhead is often the bottleneck. In this paper, we propose a novel SGD variant with reduced communication and adaptive learning rates. We prove the convergence of the proposed algorithm for smooth but non-convex problems. Empirical results show that the proposed algorithm significantly reduces the communication overhead, which, in turn, reduces the training time by up to 30% for the 1B word dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xcgoner/AISTATS2020-AdaAlter-GluonNLP
mxnetOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Machine Learning and ELM

MethodsStochastic Gradient Descent