Sparsified SGD with Memory
Sebastian U. Stich, Jean-Baptiste Cordonnier, Martin Jaggi

TL;DR
This paper demonstrates that sparsified SGD with memory, using techniques like top-k sparsification, can significantly reduce communication overhead in distributed training without sacrificing convergence speed, supported by theoretical analysis and experiments.
Contribution
It provides the first rigorous convergence analysis of sparsified SGD with memory, showing it matches vanilla SGD's rate while reducing communication.
Findings
Convergence rate matches vanilla SGD with error compensation.
Communication can be reduced by a factor of the problem dimension.
Numerical experiments confirm theoretical scalability benefits.
Abstract
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Distributed Sensor Networks and Detection Algorithms
MethodsStochastic Gradient Descent
