99% of Distributed Optimization is a Waste of Time: The Issue and How to Fix it
Konstantin Mishchenko, Filip Hanzely, Peter Richt\'arik

TL;DR
This paper identifies inefficiencies in distributed optimization communication and proposes a sparsification method that significantly reduces data transfer without affecting convergence, improving scalability.
Contribution
The authors introduce a novel update-sparsification technique for distributed optimization that maintains theoretical convergence rates while drastically reducing communication overhead.
Findings
Communication cost is reduced by 99% with minimal impact on convergence.
The proposed method matches theoretical predictions in experiments.
Significant speedups observed in synthetic and real datasets.
Abstract
Many popular distributed optimization methods for training machine learning models fit the following template: a local gradient estimate is computed independently by each worker, then communicated to a master, which subsequently performs averaging. The average is broadcast back to the workers, which use it to perform a gradient-type step to update the local version of the model. It is also well known that many such methods, including SGD, SAGA, and accelerated SGD for over-parameterized models, do not scale well with the number of parallel workers. In this paper we observe that the above template is fundamentally inefficient in that too much data is unnecessarily communicated by the workers, which slows down the overall system. We propose a fix based on a new update-sparsification method we develop in this work, which we suggest be used on top of existing methods. Namely, we develop a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM
MethodsSAGA · Stochastic Gradient Descent
