Distributed Learning with Compressed Gradient Differences

Konstantin Mishchenko; Eduard Gorbunov; Martin Tak\'a\v{c} and; Peter Richt\'arik

arXiv:1901.09269·cs.LG·December 29, 2023·108 cites

Distributed Learning with Compressed Gradient Differences

Konstantin Mishchenko, Eduard Gorbunov, Martin Tak\'a\v{c} and, Peter Richt\'arik

PDF

Open Access

TL;DR

This paper introduces DIANA, a distributed learning method that compresses gradient differences to enable convergence, supported by theoretical analysis showing superior rates over existing methods in various settings.

Contribution

The paper proposes DIANA, a novel gradient difference compression technique for distributed learning, with comprehensive theoretical analysis and improved convergence rates.

Findings

01

DIANA outperforms existing methods in strongly convex and nonconvex settings.

02

Theoretical convergence rates are established for DIANA and TernGrad.

03

Analysis of quantization schemes enhances understanding of compression effects.

Abstract

Training large machine learning models requires a distributed computing approach, with communication of the model updates being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which renders them incapable of converging to the true optimum in the batch mode. In this work we propose a new distributed learning method -- DIANA -- which resolves this issue via compression of gradient differences. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are superior to existing rates. We also provide theory to support non-smooth regularizers study…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Distributed Sensor Networks and Detection Algorithms