Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding
Qian Yu, Mohammad Ali Maddah-Ali, A. Salman Avestimehr

TL;DR
This paper introduces an optimal coding strategy called entangled polynomial code to mitigate stragglers in distributed matrix multiplication, significantly reducing the number of worker nodes needed for computation.
Contribution
It proposes a novel entangled polynomial coding scheme that achieves orderwise optimal recovery thresholds and extends to other distributed computing problems.
Findings
Entangled polynomial code minimizes recovery threshold in distributed matrix multiplication.
The scheme is orderwise optimal within a factor of 2 among linear codes.
Extensions to coded convolution and fault-tolerant computing are demonstrated.
Abstract
We consider the problem of massive matrix multiplication, which underlies many data analytic applications, in a large-scale distributed system comprising a group of worker nodes. We target the stragglers' delay performance bottleneck, which is due to the unpredictable latency in waiting for slowest nodes (or stragglers) to finish their tasks. We propose a novel coding strategy, named \emph{entangled polynomial code}, for designing the intermediate computations at the worker nodes in order to minimize the recovery threshold (i.e., the number of workers that we need to wait for in order to compute the final output). We demonstrate the optimality of entangled polynomial code in several cases, and show that it provides orderwise improvement over the conventional schemes for straggler mitigation. Furthermore, we characterize the optimal recovery threshold among all linear coding strategies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
