Hierarchical Coded Matrix Multiplication
Shahrzad Kiani, Nuwan Ferdinand, and Stark C. Draper

TL;DR
This paper introduces hierarchical coded computing methods that leverage all nodes, including slow stragglers, to improve distributed matrix multiplication efficiency, achieving significant reductions in expected completion time.
Contribution
It develops a unified framework and three new hierarchical coded computing techniques that utilize partial work from all nodes, enhancing robustness and efficiency in distributed systems.
Findings
Achieves up to 66% reduction in expected finishing time in theoretical models.
Realizes 27% improvement in Amazon EC2 experiments with simulated stragglers.
Unifies existing methods within a cuboid partitioning framework.
Abstract
In distributed computing systems slow working nodes, known as stragglers, can greatly extend finishing times. Coded computing is a technique that enables straggler-resistant computation. Most coded computing techniques presented to date provide robustness by ensuring that the time to finish depends only on a set of the fastest nodes. However, while stragglers do compute less work than non-stragglers, in real-world commercial cloud computing systems (e.g., Amazon's Elastic Compute Cloud (EC2)) the distinction is often a soft one. In this paper, we develop hierarchical coded computing that exploits the work completed by all nodes, both fast and slow, automatically integrating the potential contribution of each. We first present a conceptual framework to represent the division of work amongst nodes in coded matrix multiplication as a cuboid partitioning problem. This framework allows us to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
