Exploitation of Stragglers in Coded Computation
Shahrzad Kiani, Nuwan Ferdinand, Stark C. Draper

TL;DR
This paper introduces a novel coded computation scheme that exploits all work done by nodes, including stragglers, to significantly reduce overall computation time in distributed matrix operations.
Contribution
It presents a new approach combining error correction with work exploitation of stragglers, using sub-blocking and task order optimization for faster distributed matrix multiplication.
Findings
Expected computation time reduces by at least 50%.
Sub-blocking enables more continuous processing and better resource utilization.
Order of sub-task computation is a new design parameter.
Abstract
In cloud computing systems slow processing nodes, often referred to as "stragglers", can significantly extend the computation time. Recent results have shown that error correction coding can be used to reduce the effect of stragglers. In this work we introduce a scheme that, in addition to using error correction to distribute mixed jobs across nodes, is also able to exploit the work completed by all nodes, including stragglers. We first consider vector-matrix multiplication and apply maximum distance separable (MDS) codes to small blocks of sub-matrices. The worker nodes process blocks sequentially, working block-by-block, transmitting partial per-block results to the master as they are completed. Sub-blocking allows a more continuous completion process, which thereby allows us to exploit the work of a much broader spectrum of processors and reduces computation time. We then apply this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
