Sparsified SGD with Memory

Sebastian U. Stich; Jean-Baptiste Cordonnier; Martin Jaggi

arXiv:1809.07599·cs.LG·November 30, 2018·233 cites

Sparsified SGD with Memory

Sebastian U. Stich, Jean-Baptiste Cordonnier, Martin Jaggi

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that sparsified SGD with memory, using techniques like top-k sparsification, can significantly reduce communication overhead in distributed training without sacrificing convergence speed, supported by theoretical analysis and experiments.

Contribution

It provides the first rigorous convergence analysis of sparsified SGD with memory, showing it matches vanilla SGD's rate while reducing communication.

Findings

01

Convergence rate matches vanilla SGD with error compensation.

02

Communication can be reduced by a factor of the problem dimension.

03

Numerical experiments confirm theoretical scalability benefits.

Abstract

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

epfml/sparsifiedSGD
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Distributed Sensor Networks and Detection Algorithms

MethodsStochastic Gradient Descent