Speeding Up Distributed Machine Learning Using Codes
Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris, Papailiopoulos, Kannan Ramchandran

TL;DR
This paper explores how coding techniques can significantly improve the efficiency of distributed machine learning by reducing delays caused by stragglers and communication bottlenecks, with theoretical analysis and experimental validation.
Contribution
It introduces coded solutions for matrix multiplication and data shuffling in distributed learning, demonstrating substantial speedups and communication reductions over uncoded methods.
Findings
Coded matrix multiplication speeds up distributed computation by a factor of log n.
Coded data shuffling reduces communication costs by a factor related to storage and network parameters.
Experimental results confirm the theoretical advantages of coded algorithms.
Abstract
Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -- straggler nodes, system failures, or communication bottlenecks -- but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is , and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
