On Biased Compression for Distributed Learning
Aleksandr Beznosikov, Samuel Horv\'ath, Peter Richt\'arik and, Mher Safaryan

TL;DR
This paper investigates biased compression techniques in distributed learning, demonstrating their potential for linear convergence and superior practical performance over unbiased methods, supported by theoretical analysis and new compressor designs.
Contribution
It introduces three classes of biased compressors, proves their linear convergence in distributed SGD, and offers new compressors with strong theoretical and practical benefits.
Findings
Biased compressors can achieve linear convergence in distributed SGD.
Error feedback mechanism improves convergence rates with biased compressors.
New biased compressors show promising theoretical guarantees and practical performance.
Abstract
In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $O\left( \delta L \exp \left[-\frac{\mu K}{\delta L}\right] +…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Sparse and Compressive Sensing Techniques
MethodsStochastic Gradient Descent
