signSGD: Compressed Optimisation for Non-Convex Problems

Jeremy Bernstein; Yu-Xiang Wang; Kamyar Azizzadenesheli; Anima; Anandkumar

arXiv:1802.04434·cs.LG·August 9, 2018·88 cites

signSGD: Compressed Optimisation for Non-Convex Problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, Anima, Anandkumar

PDF

Open Access 5 Repos

TL;DR

signSGD is a gradient compression method that transmits only the sign of gradients, achieving communication efficiency while maintaining convergence rates similar to standard SGD, with theoretical guarantees and practical success on deep learning models.

Contribution

This paper introduces signSGD with theoretical convergence guarantees and extends it to distributed settings using majority vote for 1-bit communication.

Findings

01

signSGD matches SGD convergence rates

02

Momentum signSGD achieves Adam-level accuracy

03

Majority vote enables 1-bit gradient compression in distributed training

Abstract

Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. The relative $ℓ_{1} / ℓ_{2}$ geometry of gradients, noise and curvature informs whether signSGD or SGD is theoretically better suited to a particular problem. On the practical side we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Sparse and Compressive Sensing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Adam · Stochastic Gradient Descent