Near-Optimal Fault Tolerance for Efficient Batch Matrix Multiplication via an Additive Combinatorics Lens
Keren Censor-Hillel, Yuka Machino, Pedro Soto

TL;DR
This paper establishes a near-optimal fault tolerance limit for batch matrix multiplication using Rook Codes, employing additive combinatorics to prove lower bounds and presenting a code that nearly matches this bound.
Contribution
It provides the first lower bound for Rook Codes' recovery threshold in batch matrix multiplication and introduces a nearly optimal Rook Code achieving this bound.
Findings
Lower bound proof shows recovery threshold must be at least (n).
A Rook Code is constructed with a recovery threshold of n^{1+o(1)}.
The results demonstrate near-optimal fault tolerance for Rook Codes in this setting.
Abstract
Fault tolerance is a major concern in distributed computational settings. In the classic master-worker setting, a server (the master) needs to perform some heavy computation which it may distribute to other machines (workers) in order to speed up the time complexity. In this setting, it is crucial that the computation is made robust to failed workers, in order for the master to be able to retrieve the result of the joint computation despite failures. A prime complexity measure is thus the \emph{recovery threshold}, which is the number of workers that the master needs to wait for in order to derive the output. This is the counterpart to the number of failed workers that it can tolerate. In this paper, we address the fundamental and well-studied task of matrix multiplication. Specifically, our focus is on when the master needs to multiply a batch of pairs of matrices. Several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Stochastic Gradient Optimization Techniques · Cryptography and Data Security
