CSER: Communication-efficient SGD with Error Reset

Cong Xie; Shuai Zheng; Oluwasanmi Koyejo; Indranil Gupta; Mu Li,; Haibin Lin

arXiv:2007.13221·cs.LG·December 8, 2020·1 cites

CSER: Communication-efficient SGD with Error Reset

Cong Xie, Shuai Zheng, Oluwasanmi Koyejo, Indranil Gupta, Mu Li,, Haibin Lin

PDF

Open Access 1 Video

TL;DR

CSER is a new communication-efficient SGD method that uses error reset and partial synchronization to significantly reduce training time in distributed learning, achieving up to 10x speedup.

Contribution

Introduces error reset and partial synchronization techniques to improve communication efficiency in distributed SGD with proven convergence.

Findings

01

Achieves nearly 10x speedup on CIFAR-100

02

Attains 4.5x acceleration on ImageNet

03

Proves convergence for non-convex problems

Abstract

The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: Communication-efficient SGD with Error Reset, or CSER. The key idea in CSER is first a new technique called "error reset" that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. Second we introduce partial synchronization for both the gradients and the models, leveraging advantages from them. We prove the convergence of CSER for smooth non-convex problems. Empirical results show that when combined with highly aggressive compressors, the CSER algorithms accelerate the distributed training by nearly 10x for CIFAR-100, and by 4.5x for ImageNet.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CSER: Communication-efficient SGD with Error Reset· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Privacy-Preserving Technologies in Data

MethodsStochastic Gradient Descent