Efficient and Robust Distributed Matrix Computations via Convolutional   Coding

Anindya B. Das; Aditya Ramamoorthy; Namrata Vaswani

arXiv:1907.08064·cs.IT·June 3, 2020

Efficient and Robust Distributed Matrix Computations via Convolutional Coding

Anindya B. Das, Aditya Ramamoorthy, Namrata Vaswani

PDF

1 Repo

TL;DR

This paper introduces convolutional coding methods for distributed matrix computations that are both highly resilient to stragglers and numerically stable, outperforming prior approaches in efficiency and robustness.

Contribution

It proposes two convolutional coding schemes that achieve optimal straggler resilience and numerical stability, with efficient decoding algorithms close to the theoretical lower bounds.

Findings

01

Optimal straggler resilience demonstrated

02

Numerical robustness quantified via condition number bounds

03

Experimental validation on AWS cloud platform

Abstract

Distributed matrix computations -- matrix-matrix or matrix-vector multiplications -- are well-recognized to suffer from the problem of stragglers (slow or failed worker nodes). Much of prior work in this area is (i) either sub-optimal in terms of its straggler resilience, or (ii) suffers from numerical problems, i.e., there is a blow-up of round-off errors in the decoded result owing to the high condition numbers of the corresponding decoding matrices. Our work presents convolutional coding approach to this problem that removes these limitations. It is optimal in terms of its straggler resilience, and has excellent numerical robustness as long as the workers' storage capacity is slightly higher than the fundamental lower bound. Moreover, it can be decoded using a fast peeling decoder that only involves add/subtract operations. Our second approach has marginally higher decoding…

Tables4

Table 1. Table I: Comparison with existing works [ 8 , 2 , 5 , 10 ] and parallel works [ 11 , 12 ] in terms of different properties of the algorithms. Decoding complexity is mentioned for s 𝑠 s stragglers with recovery threshold k 𝑘 k where 𝐀 ∈ ℝ t × r 𝐀 superscript ℝ 𝑡 𝑟 \mathbf{A}\in\mathbb{R}^{t\times r} and 𝐁 ∈ ℝ t × w 𝐁 superscript ℝ 𝑡 𝑤 \mathbf{B}\in\mathbb{R}^{t\times w} . T 𝑇 T and q 𝑞 q are decoding algorithm parameters for the random conv. code, discussed in Section V , where T , q ≪ r , w formulae-sequence much-less-than 𝑇 𝑞 𝑟 𝑤 T,q\ll r,w .

Codes	Mat-Mat	Optimal	Numerical	Decoding Complexity
Codes	Mult?	Threshold?	Stability?	for Mat-Mat Mult
Repetition Codes	✓	✗	✓	Zero
Rateless Codes [8]	✗	✗	✓	✗
Product Codes [2]	✓	✗	✗	$O (r^{3})$ , assuming $r = w$
Polynomial Codes [5]	✓	✓	✗	$O (r w k)$
Ortho-Poly Codes [10]	✓	✓	✓	$O (r w k)$
Circulant and Rotation Matrix [12]	✓	✓	✓	$O (r w k)$
Random Khatri-Rao Codes [11]	✓	✓	✓	$O (\frac{r w}{k} s^{2})$
All-Ones-Conv Code (Proposed)	✓	✓	✓	$O (r w s)$ (add/subtract ops)
Random-Cov Code (Proposed)	✓	✓	✓	$\min (T, q) \times O (\frac{r w}{k} s^{2})$

Table 2. Table II: Comparison of Worst Case Condition Numbers ( κ w o r s t subscript 𝜅 𝑤 𝑜 𝑟 𝑠 𝑡 \kappa_{worst} ) for Matrix-matrix Multiplication for n = 18 𝑛 18 n=18 and s = 3 𝑠 3 s=3

Methods	$κ_{w o r s t}$
Polynomial Code [5]	$4.031 \times 10^{7}$
Ortho-Poly Code [10]	$2.506 \times 10^{4}$
Random Khatri-Rao Code[11]	$5329.3$
Circulant and Rotation Matrix [12]	102
Proposed All-ones Conv Code	$4417.8$
Proposed Random Conv Code	$1829.4$

Table 3. Table III: Comparison of κ w o r s t subscript 𝜅 𝑤 𝑜 𝑟 𝑠 𝑡 \kappa_{worst} for Matrix-vector Multiplication for n = 30 𝑛 30 n=30 and s = 2 𝑠 2 s=2 with γ = 1 25 𝛾 1 25 \gamma=\frac{1}{25}

Methods	$κ_{w o r s t}$
Polynomial Code [5]	$2.293 \times 10^{13}$
Convolutional Code [20]	$5.124 \times 10^{4}$
Ortho-Poly Code [10]	$7902.6$
Random KR Code [11]	$3642.7$
Circulant and Rotation Matrix [12]	52
Proposed All-ones Conv. Code	$2868.3$
Proposed Rand Conv. Code	$1374.6$

Table 4. Table IV: Comparison of our proposed methods. n = 11 , k A = k B = 3 formulae-sequence 𝑛 11 subscript 𝑘 𝐴 subscript 𝑘 𝐵 3 n=11,k_{A}=k_{B}=3 and 𝐀 𝐀 \mathbf{A} and 𝐁 𝐁 \mathbf{B} have size 10000 × 12600 10000 12600 10000\times 12600 .

Metrics	Methods	$γ = \frac{2}{5}$	$γ = \frac{5}{14}$	$γ = \frac{7}{20}$
Decoding	All ones	$0.35 s$	$0.36 s$	$0.39 s$
Time	Random	$0.39 s$	$1.16 s$	$2.89 s$
$κ_{w o r s t}$	All ones	$95.2$	$275.9$	$395.6$
for ${\tilde{𝐆}}_{ℐ}$	Random	$76.9$	$112.2$	$117.5$
$κ_{w o r s t}$ for	All ones	$96.5$	$277.9$	$397.8$
Sqr. Submat.	Random	$7.46$	$9.64$	$1.11$
of ${\tilde{𝐆}}_{ℐ}$	Random	$\times 10^{6}$	$\times 10^{17}$	$10^{28}$

Equations135

A = A_{0, 0} ⋮ A_{p - 1, 0} \dots ⋱ \dots A_{0, u - 1} ⋮ A_{p - 1, u - 1}; and B = B_{0, 0} ⋮ B_{p - 1, 0} \dots ⋱ \dots B_{0, v - 1} ⋮ B_{p - 1, v - 1} .

A = A_{0, 0} ⋮ A_{p - 1, 0} \dots ⋱ \dots A_{0, u - 1} ⋮ A_{p - 1, u - 1}; and B = B_{0, 0} ⋮ B_{p - 1, 0} \dots ⋱ \dots B_{0, v - 1} ⋮ B_{p - 1, v - 1} .

A (z)

A (z)

so A^{T} (z) B (z)

A (z)

A (z)

so A^{T} (z) B (z)

\frac{∥ x ^ - x ∥}{∥ x ∥} = \frac{∥ M ^{- 1} ( y ~ - y ) ∥}{∥ M ^{- 1} y ∥} \leq \frac{σ _{m a x} ( M ^{- 1} )}{σ _{m i n} ( M ^{- 1} )} \frac{∥ y ~ - y ∥}{∥ y ∥} = \frac{σ _{m a x} ( M )}{σ _{m i n} ( M )} \frac{∥ y ~ - y ∥}{∥ y ∥} = κ (M) \frac{∥ y ~ - y ∥}{∥ y ∥},

\frac{∥ x ^ - x ∥}{∥ x ∥} = \frac{∥ M ^{- 1} ( y ~ - y ) ∥}{∥ M ^{- 1} y ∥} \leq \frac{σ _{m a x} ( M ^{- 1} )}{σ _{m i n} ( M ^{- 1} )} \frac{∥ y ~ - y ∥}{∥ y ∥} = \frac{σ _{m a x} ( M )}{σ _{m i n} ( M )} \frac{∥ y ~ - y ∥}{∥ y ∥} = κ (M) \frac{∥ y ~ - y ∥}{∥ y ∥},

[c_{0} (D) c_{1} (D) c_{2} (D) c_{3} (D)] = [u_{0} (D) u_{1} (D)] G (D) [100111 1 D] .

[c_{0} (D) c_{1} (D) c_{2} (D) c_{3} (D)] = [u_{0} (D) u_{1} (D)] G (D) [100111 1 D] .

c_{2} (D)

c_{2} (D)

c_{3} (D)

[c_{0} c_{1} c_{2} c_{3}] = [u_{0} u_{1}] [I_{q} 0_{q \times q} 0_{q \times q} I_{q} I_{q} [I_{q} 0] I_{q} [0 I_{q}]]

[c_{0} c_{1} c_{2} c_{3}] = [u_{0} u_{1}] [I_{q} 0_{q \times q} 0_{q \times q} I_{q} I_{q} [I_{q} 0] I_{q} [0 I_{q}]]

\displaystyle\mathbf{G}_{mv}(D)\;=\;\begin{bmatrix}\underbrace{\mathbf{I}_{k}}_{\textrm{message part}}\;\;\bigg{|}\;\;\underbrace{\mathbf{Y}_{k,s}(D)}_{\textrm{parity part}}\end{bmatrix}.\vspace{-0.05in}

\displaystyle\mathbf{G}_{mv}(D)\;=\;\begin{bmatrix}\underbrace{\mathbf{I}_{k}}_{\textrm{message part}}\;\;\bigg{|}\;\;\underbrace{\mathbf{Y}_{k,s}(D)}_{\textrm{parity part}}\end{bmatrix}.\vspace{-0.05in}

[C_{0} (D) \leavevmode C_{1} (D) \leavevmode \dots \leavevmode C_{n - 1} (D)] = [U_{0} (D) \leavevmode U_{1} (D) \leavevmode \dots \leavevmode U_{k - 1} (D)] \leavevmode G_{m v} (D) .

[C_{0} (D) \leavevmode C_{1} (D) \leavevmode \dots \leavevmode C_{n - 1} (D)] = [U_{0} (D) \leavevmode U_{1} (D) \leavevmode \dots \leavevmode U_{k - 1} (D)] \leavevmode G_{m v} (D) .

\displaystyle\bigg{(}q+(s-1)(k-1)\bigg{)}\frac{r}{kq}\;

\displaystyle\bigg{(}q+(s-1)(k-1)\bigg{)}\frac{r}{kq}\;

⟹ q \geq \frac{( s - 1 ) ( k - 1 )}{k ( γ - \frac{1}{k} )} .

[Y_{\overset{ˉ}{b}, \overset{a}{ˉ}} (D)]_{i, j} = (D^{a_{j}})^{b_{i}} .

[Y_{\overset{ˉ}{b}, \overset{a}{ˉ}} (D)]_{i, j} = (D^{a_{j}})^{b_{i}} .

\displaystyle\mathbf{G}(D)=\begin{bmatrix}\;\mathbf{I}_{k}\;\;\;\big{|}\;\;\;\mathbf{Y}_{\bar{b},\bar{a}}(D)\;\end{bmatrix}.

\displaystyle\mathbf{G}(D)=\begin{bmatrix}\;\mathbf{I}_{k}\;\;\;\big{|}\;\;\;\mathbf{Y}_{\bar{b},\bar{a}}(D)\;\end{bmatrix}.

U_{i}^{A} (D)

U_{i}^{A} (D)

U_{i}^{B} (D)

[C_{0}^{A} (D) C_{1}^{A} (D) \leavevmode \dots \leavevmode C_{n - 1}^{A} (D)]

[C_{0}^{A} (D) C_{1}^{A} (D) \leavevmode \dots \leavevmode C_{n - 1}^{A} (D)]

[C_{0}^{B} (D) C_{1}^{B} (D) \leavevmode \dots \leavevmode C_{n - 1}^{B} (D)]

C_{i}^{A} (D) \times C_{i}^{B} (D) =

C_{i}^{A} (D) \times C_{i}^{B} (D) =

=

[U^{A} (D) G_{A} (D)] ⊙ [U^{B} (D) G_{B} (D)] = [U^{A} (D) \otimes U^{B} (D)] [G_{A} (D) ⊙ G_{B} (D)] .

[U^{A} (D) G_{A} (D)] ⊙ [U^{B} (D) G_{B} (D)] = [U^{A} (D) \otimes U^{B} (D)] [G_{A} (D) ⊙ G_{B} (D)] .

G_{A} (D)

G_{A} (D)

G_{B} (D)

G_{A} (D) ⊙ G_{B} (D) = [I_{k} \leavevmode ∣ \leavevmode Y_{k_{A}, s} (D^{z}) ⊙ Y_{k_{B}, s} (D)]

G_{A} (D) ⊙ G_{B} (D) = [I_{k} \leavevmode ∣ \leavevmode Y_{k_{A}, s} (D^{z}) ⊙ Y_{k_{B}, s} (D)]

1 D^{z l} D^{2 z l} ⋮ D^{(k_{A} - 1) z l} \otimes 1 D^{l} D^{2 l} ⋮ D^{(k_{B} - 1) l} = 1 ⋮ D^{(k_{B} - 1) l} D^{z l} ⋮ D^{(k_{B} - 1 + z) l} ⋮ D^{(k_{A} - 1) z l} ⋮ D^{(k_{B} - 1 + (k_{A} - 1) z) l}

1 D^{z l} D^{2 z l} ⋮ D^{(k_{A} - 1) z l} \otimes 1 D^{l} D^{2 l} ⋮ D^{(k_{B} - 1) l} = 1 ⋮ D^{(k_{B} - 1) l} D^{z l} ⋮ D^{(k_{B} - 1 + z) l} ⋮ D^{(k_{A} - 1) z l} ⋮ D^{(k_{B} - 1 + (k_{A} - 1) z) l}

C_{n - 1}^{A} (D) = U_{0}^{A} (D)

C_{n - 1}^{A} (D) = U_{0}^{A} (D)

C_{n - 1}^{B} (D) = U_{0}^{B} (D)

z \geq q_{B} + (s - 1) (k_{B} - 1) .

z \geq q_{B} + (s - 1) (k_{B} - 1) .

q_{A} \geq \frac{( s - 1 ) ( k _{A} - 1 )}{k _{A} ( γ _{A} - \frac{1}{k _{A}} )} and q_{B} \geq \frac{( s - 1 ) ( k _{B} - 1 )}{k _{B} ( γ _{B} - \frac{1}{k _{B}} )} .

q_{A} \geq \frac{( s - 1 ) ( k _{A} - 1 )}{k _{A} ( γ _{A} - \frac{1}{k _{A}} )} and q_{B} \geq \frac{( s - 1 ) ( k _{B} - 1 )}{k _{B} ( γ _{B} - \frac{1}{k _{B}} )} .

U_{i}^{A} (D)

U_{i}^{A} (D)

and U_{i}^{B} (D)

G_{A} (D)

G_{A} (D)

G_{B} (D)

(\tilde{G})_{i, ℓ} = {I_{q} 0_{q \times q} if i = ℓ if i \neq = ℓ \vspace - 0.05 in

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anindyabijoydas/StragglerMitigateConvCodes
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Efficient and Robust Distributed Matrix Computations via Convolutional Coding

Anindya Bijoy Das, Aditya Ramamoorthy and Namrata Vaswani This work was supported in part by the National Science Foundation (NSF) under Grant CCF-1718470 and Grant CCF-1910840. Department of Electrical and Computer Engineering,

Iowa State University, Ames, IA 50011 USA.

$\{$ abd149,adityar,namrata $\}$ @iastate.edu

Abstract

Distributed matrix computations – matrix-matrix or matrix-vector multiplications – are well-recognized to suffer from the problem of stragglers (slow or failed worker nodes). Much of prior work in this area is (i) either sub-optimal in terms of its straggler resilience, or (ii) suffers from numerical problems, i.e., there is a blow-up of round-off errors in the decoded result owing to the high condition numbers of the corresponding decoding matrices. Our work presents convolutional coding approach to this problem that removes these limitations. It is optimal in terms of its straggler resilience, and has excellent numerical robustness as long as the workers’ storage capacity is slightly higher than the fundamental lower bound. Moreover, it can be decoded using a fast peeling decoder that only involves add/subtract operations. Our second approach has marginally higher decoding complexity than the first one, but allows us to operate arbitrarily close to the lower bound. Its numerical robustness can be theoretically quantified by deriving a computable upper bound on the worst case condition number over all possible decoding matrices by drawing connections with the properties of large Toeplitz matrices. All above claims are backed up by extensive experiments done on the AWS cloud platform.

Index Terms:

Distributed computing, Straggler, Convolutional coding, Toeplitz matrix, Vandermonde matrix.

I Introduction

Distributed computing clusters are heavily used in domains such as machine learning where datasets are often so large that they cannot be stored in a single computer. The widespread usage of such clusters presents several opportunities and advantages over traditional computing paradigms. However, they also present newer challenges. Large scale clusters which can be heterogeneous in nature suffer from the issue of stragglers (slow or failed workers in the system). Fig. 1 shows the variation of speed of different t2.micro machines in AWS (Amazon Web Services) cluster, and it can be seen that for a particular job, a slow worker node may require around $40\%-50\%$ more time than the average.

The conventional approach [1] to tackle stragglers has been to run multiple copies of tasks on various machines, with the hope that at least one copy finishes on time. For instance, consider matrix-vector multiplication with a matrix $\mathbf{A}$ and vector $\mathbf{x}$ , where our goal is to obtain the product $\mathbf{A}^{T}\mathbf{x}$ in a distributed fashion. Fig. 2 shows an example where we partition $\mathbf{A}$ into four block-columns and we assign two block-columns to each of the four worker nodes. Thus each block column has been assigned twice over all four workers and we can verify that we recover the final result if any three workers finish their respective jobs. In other words, we can say that this scheme is resilient to one straggler.

However, this toy example can be made even more efficient in terms of resource utilization by dividing $\mathbf{A}$ into two block-columns $\mathbf{A}_{0}$ and $\mathbf{A}_{1}$ and assigning the worker nodes appropriate linear combinations of $\mathbf{A}_{0}$ and $\mathbf{A}_{1}$ so that the required result can be decoded from any two workers. This is the basic idea underlying “coded computation” (introduced in the work of Lee et al. [2]). It leverages ideas from erasure coding to introduce redundancy in the computation performed by the worker nodes. Roughly speaking, as long as enough worker nodes complete their tasks, the master node can decode the intended result by appropriate post-processing.

The central problem within coded distributed matrix computation can be explained as follows. Suppose that we have large matrices $\mathbf{A}\in\mathbb{R}^{t\times r},\mathbf{B}\in\mathbb{R}^{t\times w}$ and a vector $\mathbf{x}\in\mathbb{R}^{t}$ . The goal is to either compute $\mathbf{A}^{T}\mathbf{x}$ (matrix-vector multiplication) or $\mathbf{A}^{T}\mathbf{B}$ (matrix-matrix multiplication) in a distributed fashion using $n$ worker nodes while being resistant to any $s$ stragglers. Redundancy is introduced in the computation by coding across appropriately chosen submatrices of $\mathbf{A}$ and $\mathbf{B}$ and assigning the worker nodes appropriate computation responsibilities.

The main finding of several recent works in this area is that it is possible to embed distributed matrix computations into the structure of an equivalent erasure code, where the failed nodes play the role of erasures [3, 4, 5, 6, 7, 8, 9] (we discuss related work in detail shortly). A given coded computation scheme is said to have threshold $\tau$ if the desired result can be decoded as long as any $\tau$ worker nodes return their results to the master node. This has been the focus of many works in the literature.

In this work, we consider the important issue of numerical stability within coded computation (in addition to threshold). We point out that several of the existing schemes in the literature suffer from significant numerical issues in the decoding process. In particular, the system of equations that is solved by the master node in the decoding step can have a very high condition number which in turn results in a large error in the decoded result. We present a novel scheme based on convolutional codes (operating over the reals) that simultaneously addresses numerical stability, the threshold, and possesses easy encoding/decoding. An overview of the properties of most of the known schemes in the literature is presented in Table I.

This paper is organized as follows. Section II explains the problem formulation and Section III describes the background and related work and summarizes of the contributions of our work. Section IV discusses our main ideas on how convolutional codes can be used to address distributed matrix computations, Section V overviews the analysis of numerical stability for our codes and Section VI discusses the experimental performance of our proposed methods and shows the comparison with other available approaches. We conclude the paper with a discussion about future work in Section VII. For the sake of readability several of the proofs appear in the Appendix.

II Problem Formulation

In the matrix-vector case we partition $\mathbf{A}$ into submatrices of equal size and $\mathbf{x}$ into subvectors and distribute a certain number of “coded” versions of these submatrices to the $n$ workers (subject to a storage constraint). Every worker computes the product of its assigned submatrices and subvectors and sends the computed result back to the master node. The master then “decodes” to recover $\mathbf{A}^{T}\mathbf{x}$ .

In the matrix-matrix multiplication scenario, each worker node receives coded versions of submatrices of $\mathbf{A}$ and coded versions of the submatrices of $\mathbf{B}$ 111A general formulation need not restrict the assignment to coded submatrices of $\mathbf{A}$ and $\mathbf{B}$ . Nevertheless, all known schemes thus far and our proposed schemes work with equal-sized submatrices, so we present the formulation in this way.. It computes pairwise products (either all or some subset thereof) of these and sends them to the master node which needs to decode to recover $\mathbf{A}^{T}\mathbf{B}$ .

In the discussion below we discuss the matrix-matrix scenario; it applies in a natural way to the matrix-vector case as well. We consider a $p\times u$ and $p\times v$ block decomposition of $\mathbf{A}$ and $\mathbf{B}$ respectively as shown below.

[TABLE]

The master node encodes by computing appropriate scalar linear combinations of the $\mathbf{A}_{i,j}$ matrices and respectively the $\mathbf{B}_{i,j}$ submatrices. This implies that the master node only performs scalar multiplications and additions. It is not responsible for any of the computationally intensive matrix operations. Following this, it sends the corresponding coded submatrices to each of the workers.

We assume that a worker node cannot store the whole matrix $\mathbf{A}$ or $\mathbf{B}$ . Each worker can store the equivalent of $\gamma_{A}$ fraction of matrix $\mathbf{A}$ and $\gamma_{B}$ fraction of matrix $\mathbf{B}$ ; this is referred to as the storage fraction.

The assumption is that some nodes will fail or will be too slow, the maximum number of such nodes is assumed to be $s$ or less. The goal is to design the coding scheme so that (i) the decoding is possible using the output of any $k=(n-s)$ workers ( $k$ is often called the recovery threshold of the scheme), (ii) it is robust to noise (both numerical precision errors and other sources of noise); and (iii) it is efficiently decodable. We say that the threshold of a scheme is optimal if it is the lowest possible given the storage constraints.

III Background, Related Work and Summary of Contributions

In recent years, several coded computation schemes have been proposed for matrix multiplication [3, 4, 5, 6, 7, 8, 9, 13, 14, 15]. We illustrate the basic idea below using the polynomial code approach of [5]. These ideas are presented in a tutorial fashion in [16].

Consider a scenario with $n=5$ workers where each of these worker nodes can store $\gamma_{A}=\frac{1}{2}$ fraction of matrix $\mathbf{A}$ and $\gamma_{B}=\frac{1}{2}$ fraction of matrix $\mathbf{B}$ . Consider $u=v=2$ and $p=1$ , thus we partition both $\mathbf{A}$ and $\mathbf{B}$ into two block-columns $\mathbf{A}_{0},\mathbf{A}_{1}$ and $\mathbf{B}_{0},\mathbf{B}_{1}$ respectively. Next, we define two matrix polynomials as

[TABLE]

The master node evaluates these polynomial $\mathbf{A}(z)$ and $\mathbf{B}(z)$ at distinct real values $z_{0},z_{1},\dots,z_{n-1}$ , and sends the corresponding matrices to worker node $W_{i}$ (see Fig. 3 where $z_{i}=i+1$ ). Each worker node computes the product of its assigned submatrices. It follows that decoding at the master node is equivalent to decoding a degree-3 real-valued polynomial. Thus, the master node can recover $\mathbf{A}^{T}\mathbf{B}$ as soon as it receives the results from any four workers. Thus, in this example, the recovery threshold is, $k=4$ and the system is resilient to $s=1$ straggler.

A different solution can be obtained using the approach in [7] for the same example. Let $u=v=1$ and $p=2$ , so we can write $\mathbf{A}^{T}\mathbf{B}=\mathbf{A}_{0}^{T}\mathbf{B}_{0}+\mathbf{A}_{1}^{T}\mathbf{B}_{1}$ . Now we define two matrix polynomials as

[TABLE]

As before, the master node will evaluate the polynomial $\mathbf{A}(z)$ and $\mathbf{B}(z)$ at $z_{0},z_{i},\dots,z_{n-1}$ , and send the corresponding matrices to worker node $W_{i}$ . It follows that the master can recover all the unknowns $\left(\textrm{including}(\mathbf{A}_{0}^{T}\mathbf{B}_{0}+\mathbf{A}_{1}^{T}\mathbf{B}_{1})\right)$ as soon as it receives the results from any three workers. Thus, in this example, the recovery threshold is, $k=3$ and the system is resilient to $s=2$ stragglers.

It should be noted that the latter approach can lead to more straggler resilience, but the computational load per worker has doubled compared to the first approach. Moreover the communication load from the worker nodes to the master node is also higher by a factor of $4$ compared to the first approach.

For both schemes above, it can be shown that worker node computation time depends on $t$ , whereas the decoding complexity is independent of it (see for instance [16]). Thus, for scenarios where $t$ is very large, the decoding time can be neglected. Nevertheless, a low decoding complexity is desirable from a practical standpoint.

III-A Related Work

As discussed above, [4, 5, 7] convert distributed matrix computation into polynomial evaluation/interpolation, i.e., the coded submatrices correspond to polynomial evaluation maps. We remark here that as far as we are aware, the idea of embedding matrix multiplication using polynomial maps goes back even further to Yagle [17] (the motivation there was fast matrix multiplication).

For fixed storage constraints $\gamma_{A}=\frac{1}{u}$ and $\gamma_{B}=\frac{1}{v}$ and for fixed computation overhead per worker with $p=1$ and arbitrary $u$ and $v$ , the optimal threshold $\tau$ is shown to be $uv$ [5] using the polynomial approach. When $p\geq 2$ , the work of [4] demonstrates a threshold of $puv+p-1$ . They also present a converse argument which demonstrates that this is within a factor of two of the optimal threshold.

While the computation threshold is somewhat well understood at this point, the issue of numerical stability has received much less attention. When operating over finite fields, proving the invertibility of an appropriate submatrix of the coding matrix suffices to guarantee correct decoding. However, in decoding a real system of equations, errors in the input can get amplified by the condition number (ratio of maximum and minimum singular values) of the associated matrix; hence, a low condition number is critical. For instance, in solving a square system of equations $\mathbf{y}=\mathbf{M}\mathbf{x}$ , suppose that $\mathbf{y}$ is perturbed to $\tilde{\bf{y}}$ (owing to round-off errors) and that the estimate of $\bf{x}$ is $\hat{\bf{x}}:=\bf{M}^{-1}\tilde{\bf{y}}$ . Then, the normalized error in $\hat{\bf{x}}$ is given by

[TABLE]

where $\sigma_{\max}(\mathbf{M})$ and $\sigma_{\min}(\mathbf{M})$ denote the maximum and minimum singular values of $\mathbf{M}$ and their ratio $\kappa(\bf{M})$ is the condition number of the decoding matrix $\mathbf{M}$ . Thus, it is clear that a small condition number of the decoding matrix leads to less amplification of the round-off error in $\hat{\bf{x}}$ .

This issue is especially relevant since it is well recognized that polynomial interpolation over the reals suffers from significant numerical issues since the corresponding Vandermonde matrices have very high condition numbers (that are exponential in their size [18]). In fact, even for clusters with around $n=30$ nodes, the condition number of the polynomial approach [5] is so large that the decoded result is essentially useless (see Section VI). We note here that Section VII of [4] remarks that the numerical issues can be handled by embedding all operations within a finite field. In Section VI, we demonstrate that the performance of this method is strongly dependent on the entries of matrices $\mathbf{A}$ and $\mathbf{B}$ and the resultant normalized MSE can be quite bad [19].

Some recent works have highlighted and considered the issue of numerical stability in this context. The work of [20, 21] presented strategies for distributed matrix-vector multiplication and demonstrated some schemes that empirically have better numerical performance than polynomial based schemes for some values of $n$ and $s$ . The work in [20] considers a convolutional coding approach, but from a parity check matrix perspective and the work in [21] uses universally decodable matrices which further allows to utilize the partial computations of the stragglers. However, both these approaches work only for the matrix-vector problem and do not provide a computable bound on the condition number of the decoding submatrices.

The work of [10] presents an alternate approach that works within the basis of orthogonal polynomials. They demonstrate that the worst case condition number of their schemes is at most $O(n^{2s})$ and their numerical experiments demonstrate improvements with respect to [5]. Our experimental evaluation in Section VI clearly demonstrates that our proposed schemes have condition numbers that are orders of magnitude lower than [10]. [11] present an approach where the encoded matrices are generated by taking random linear combinations of the block-columns of the respective matrices (this was also suggested in Remark $8$ of [5]). We note here that their approach can be considered as a subclass of our methods, as discussed in Section VI. Table I shows a comparison of the features of several well-known approaches for distributed matrix computations. Our results in Section VI show that the underlying structure of our codes consistently results in lower worst case condition numbers than [11]. Finally, the parallel work of [12] presents an approach that leverages the properties of rotation matrices and circulant permutation matrices. They demonstrate that the worst case condition number of their recovery matrices grow at most as $O(n^{s+6})$ . While their numerical results are better than ours, our work has the advantage of easy encoding and decoding and explores a convolutional approach to this problem which has not been considered before.

III-B Summary of Contributions

In this paper we present an efficient and robust scheme for coded matrix computations that is inspired by convolutional codes. Our codes operate over the reals, unlike the majority of convolutional codes that are considered over finite fields [22]. Crucially, they exploit the Vandermonde property of the recovery matrices, where the matrices are defined over a different field (formal Laurent series over $\mathbb{R}$ ) than the real numbers. This naturally allows for simple encoding and decoding in addition to ensuring the threshold properties.

•

Our work is among the first to provide an efficient coded computation approach for both matrix-vector and matrix-matrix multiplications that provably works in the (i) essentially noise-free regime where numerical precision issues dominate, and (ii) the noisy regime where noise is significant.

•

We present two classes of codes in this work. Our first approach can be decoded using a peeling decoder using only add/subtract operations and has excellent numerical performance when the storage capacity of the nodes is slightly higher than the fundamental lower bound.

When operating very close to the storage capacity lower bound, we propose an alternative random convolutional coding strategy for which we can provide a “computable” upper bound (cf. Theorem 2 in Section V-A) on the worst case condition number of the recovery matrices. This naturally leads to a random sampling algorithm to pick a coding matrix with good performance. Our work draws novel connections with this problem and the asymptotic analysis of large Toeplitz matrices [23].

•

An exhaustive comparison of our work with other approaches in the literature shows that the condition numbers of our work are orders of magnitude below all the comparable approaches (except [12]) and have fast decoding times. Fig. 4 depicts a comparison of the performance of the different schemes considered in our work.

•

As far as we are aware, most previous work has approached coded computation by exploiting its link with block codes under erasures. Our work is the first to investigate a convolutional coding approach to this problem. This in turn opens up newer problems for investigation in this area.

IV Convolutional Coding for Distributed Matrix Computation

IV-A Simple Illustrative Example

We explain our key idea by means of the following example. Consider two row vectors in $\mathbb{R}^{q}$ , $\mathbf{u}_{0}=[u_{00}\leavevmode\nobreak\ u_{01}\leavevmode\nobreak\ \dots\leavevmode\nobreak\ u_{0(q-1)}]$ and $\mathbf{u}_{1}=[u_{10}\leavevmode\nobreak\ u_{11}\leavevmode\nobreak\ \dots\leavevmode\nobreak\ u_{1(q-1)}]$ . These vectors can also be represented as polynomials in the indeterminate $D$ , $\mathbf{u}_{i}(D)=\sum\limits_{j=0}^{q-1}u_{ij}D^{j}$ for $i=0,1$ . As explained in Appendix -A, these polynomials can be treated as elements in the ring of formal Laurent series in $D$ [24]. Moreover, it can be shown that this ring is in fact a field, i.e., each element has a corresponding inverse. Consider the following encoding of $[\mathbf{u}_{0}(D)\;\;\mathbf{u}_{1}(D)]$ .

[TABLE]

It is not too hard to see that the polynomials $\mathbf{u}_{0}(D)$ and $\mathbf{u}_{1}(D)$ (equivalently the vectors $\mathbf{u}_{0},\mathbf{u}_{1}$ ) can be recovered (or “decoded”) from any two entries of the vector $[\mathbf{c}_{0}(D)\leavevmode\nobreak\ \mathbf{c}_{1}(D)\leavevmode\nobreak\ \mathbf{c}_{2}(D)\leavevmode\nobreak\ \mathbf{c}_{3}(D)]$ . For instance, suppose that we only receive $\mathbf{c}_{2}(D)$ and $\mathbf{c}_{3}(D)$ . Notice that

[TABLE]

Starting with $u_{00}$ from the constant term of $\mathbf{c}_{3}(D)$ , one can iteratively recover each of the coefficients of $\mathbf{u}_{0}(D)$ and $\mathbf{u}_{1}(D)$ , with only one new variable to recover in each iteration. A similar argument applies if we consider a different set of two entries from $[\mathbf{c}_{0}(D)\leavevmode\nobreak\ \mathbf{c}_{1}(D)\leavevmode\nobreak\ \mathbf{c}_{2}(D)\leavevmode\nobreak\ \mathbf{c}_{3}(D)]$ . We refer to such a decoding scheme as a “peeling decoder”.

Observe that the encoded polynomial $\mathbf{c}_{3}(D)$ has degree $q$ , while the others have degree $q-1$ . Thus, if the coefficients of the polynomials $\mathbf{c}_{i}$ correspond to encoded data that were sent to node $i$ for processing, then node 3 would need slightly higher storage/processing capacity than nodes 0, 1, 2. Secondly, observe that the above idea can also be equivalently understood by replacing the $2\times 4$ matrix of polynomials $\mathbf{G}(D)$ by a larger matrix of size $2q\times(4q+1)$ and rewriting all the scalar polynomials as row vectors. Let $\mathbf{c}_{0},\mathbf{c}_{1},\mathbf{c}_{2}$ be row vectors of length $q$ and $\mathbf{c}_{3}$ be a row vector of length $q+1$ . Then,

[TABLE]

where $\mathbf{0}_{q\times q}$ is a $q\times q$ matrix of zeroes, $\mathbf{I}_{q}$ is a $q\times q$ identity matrix, and $\mathbf{0}$ is a column of zeroes. In what follows, we consider generalizations of this basic example where the $\mathbf{u}_{i}$ ’s will correspond to block-columns of $\mathbf{A}$ and $\mathbf{B}$ .

IV-B Proposed matrix-vector multiplication scheme

The above idea can naturally be adapted to the distributed matrix-vector multiplication setting. We show an example in Fig. 5 with $n=4$ workers and $s=2$ stragglers, so $k=n-s=2.$ . Suppose that matrix $\mathbf{A}$ is partitioned into $kq$ block-columns (the choice of $q$ will be discussed shortly). In our work, the presentation follows more naturally if we index the block-columns of $\mathbf{A}$ using two indices instead of one. In particular, they are indexed as $\mathbf{A}_{\langle i,j\rangle},i\in[k],j\in[q]$ (where $[\ell]$ denotes the set $\{0,\dots,\ell-1\}$ ) and each worker node stores at most $\gamma r$ columns of length- $t$ ( $\gamma$ is called the storage fraction).

Let $\mathbf{U}_{i}(D)=\sum_{j=0}^{q-1}\mathbf{A}^{T}_{\langle i,j\rangle}D^{j}$ for $0\leq i\leq k-1$ . Furthermore, let $\mathbf{Y}_{k,s}$ denote a $k\times s$ matrix whose $(i,j)$ -th submatrix is $(\mathbf{Y}_{k,s})_{i,j}=(D^{j})^{i}$ , for $i\in[k],j\in[s]$ , i.e., $\mathbf{Y}_{k,s}$ has the Vandermonde structure. We define

[TABLE]

Consider the encoding

[TABLE]

To arrive at the distributed matrix-vector multiplication scheme, we simply interpret the coefficients of the powers of $D$ in $\mathbf{C}_{i}(D)$ as the encoded submatrices assigned to worker $i$ (see Fig. 5 for an example). With this assignment, worker $i$ computes the inner product of its assigned matrices and $\mathbf{x}$ . We say that a $k\times n$ matrix is maximum-distance-separable (MDS) if any of its $k\times k$ submatrices is nonsingular. This property further implies that $\mathbf{A}^{T}\mathbf{x}$ can be recovered as long as any $k$ workers complete their tasks. The following result shows that $\mathbf{G}_{mv}(D)$ is MDS; the proof appears in the Appendix.

Corollary 1 (Corollary of upcoming Theorem 1 given in Section IV-C).

Any $k\times k$ submatrix of $\mathbf{G}_{mv}(D)$ has a determinant which is a non-zero polynomial in $D$ , i.e., it is non-singular.

Analogous to convolutional coding, we call the first $k$ workers the message workers and the last $s$ workers the parity workers. Each of the first $k$ message workers receives $q$ submatrices $\mathbf{A}_{\langle i,j\rangle},j=0,1,\dots,q-1$ , each of which is a matrix of size $t\times r/(kq)$ . The rest of the $s$ parity workers will receive $\geq q$ such submatrices. The highest exponent of $D$ in the generator matrix $\mathbf{G}_{mv}(D)$ is $(s-1)(k-1)$ . Thus, the maximum storage needed by a worker is $q+(s-1)(k-1)$ submatrices. When $q$ is large enough, this imbalance is not significant. If we assume a bound of $\gamma$ on the storage capacity fraction of any worker, we need

[TABLE]

For example, in Fig. 5, $\gamma$ is set to $\frac{5}{8}$ which leads to $q=4$ .

IV-C Proposed matrix-matrix multiplication scheme

The matrix-matrix multiplication case requires the generalization of the above ideas. Let $\bar{a}=[a_{0}\leavevmode\nobreak\ a_{1}\leavevmode\nobreak\ \dots\leavevmode\nobreak\ a_{s-1}]$ and $\bar{b}=[b_{0}\leavevmode\nobreak\ b_{1}\leavevmode\nobreak\ \dots\leavevmode\nobreak\ b_{k-1}]$ be vectors of non-negative integers such that $0\leq a_{0}<a_{1}<\dots<a_{s-1}$ and $0\leq b_{0}<b_{1}<\dots<b_{k-1}$ . Let $\mathbf{Y}_{\bar{b},\bar{a}}(D)$ denote a $k\times s$ matrix whose $(i,j)$ -th entry is given by

[TABLE]

Using this matrix, define a generalization of $\mathbf{G}_{mv}(D)$ as follows

[TABLE]

Observe that we obtain $\mathbf{G}_{mv}(D)$ by setting $a_{j}=j,0\leq j\leq s-1$ and $b_{i}=i,0\leq i\leq k-1$ , which corresponds to $\mathbf{Y}_{k,s}(D)$ . We will design an encoding scheme for matrix-matrix multiplication whose equivalent generator matrix is of the form in (4). Before we explain the design, we show that this matrix also satisfies the MDS property (the proof appears in the Appendix).

Theorem 1.

Any $k\times k$ submatrix of the generator matrix $\mathbf{G}(D)$ defined in (4) is non-singular.

While non-singularity by itself does not reveal information about the corresponding condition numbers, Theorem 1 provides a class of schemes with a specific structure that have excellent numerical stability (see Fig. 4 “All Ones” curve) and can be modified and analyzed for condition number using the techniques discussed in Theorem 2 within Section V. The structure of $\mathbf{G}(D)$ in (4) also allows for an efficient peeling decoder.

In the matrix-matrix case, we design generator matrices $\mathbf{G}_{A}(D)$ of size $k_{A}\times n$ and $\mathbf{G}_{B}(D)$ of size $k_{B}\times n$ such that $s=n-k_{A}k_{B}$ . Each worker stores fractions $\gamma_{A}$ and $\gamma_{B}$ of matrices $\mathbf{A}$ and $\mathbf{B}$ respectively. Let $z$ be a large enough positive integer and let

[TABLE]

Furthermore, we let $\mathbf{U}^{A}(D)=[\mathbf{U}^{A}_{0}(D)\leavevmode\nobreak\ \dots\leavevmode\nobreak\ \mathbf{U}^{A}_{k_{A}-1}(D)]$ and $\mathbf{U}^{B}(D)=[\mathbf{U}^{B}_{0}(D)\leavevmode\nobreak\ \dots\leavevmode\nobreak\ \mathbf{U}^{B}_{k_{B}-1}(D)]$ . The final goal of the master node is to recover all products of the form $\mathbf{A}^{T}_{\langle i_{1},j_{1}\rangle}\mathbf{B}_{\langle i_{2},j_{2}\rangle}$ for $i_{1}\in[k_{A}],j_{1}\in[q_{A}],i_{2}\in[k_{B}],j_{2}\in[q_{B}]$ . Once again by forming

[TABLE]

we can represent the assignment of coded submatrices of $\mathbf{A}$ and $\mathbf{B}$ to worker node $i$ by the coefficients of $\mathbf{C}^{A}_{i}(D)$ and $\mathbf{C}^{B}_{i}(D)$ respectively. Following this step, each worker node computes the pairwise product of each coded submatrix of $\mathbf{A}$ and coded submatrix of $\mathbf{B}$ assigned to it.

The matrices $\mathbf{G}_{A}(D)$ and $\mathbf{G}_{B}(D)$ will be picked in such a way so that the pairwise product of each coefficient of $\mathbf{C}^{A}_{i}(D)$ and each coefficient of $\mathbf{C}^{B}_{i}(D)$ appears in $\mathbf{C}^{A}_{i}(D)\times\mathbf{C}^{B}_{i}(D)$ , i.e., each worker node equivalently computes $\mathbf{C}^{A}_{i}(D)\times\mathbf{C}^{B}_{i}(D)$ . Using MATLAB notation and Kronecker product properties, for $i=1,2,\dots,n$ , we have

[TABLE]

where $\otimes$ denotes the Kronecker product. Therefore, the computation peformed by the worker nodes can be compactly represented using the Khatri-Rao product [25] (denoted by $\odot$ )222For two matrices with the same column dimension, the Khatri-Rao product corresponds to the matrix obtained by taking the Kronecker product of the corresponding columns. Moreover, using the properties of the Khatri-Rao product, we have

[TABLE]

The key idea at this point is to ensure that $\mathbf{G}_{A}(D)\odot\mathbf{G}_{B}(D)$ has the structure of a matrix as in (4). Towards this end, we choose

[TABLE]

where $\mathbf{1}_{k_{B}}$ is an all-ones row vector of length $k_{B}$ , and the total number of rows in $\mathbf{G}_{A}(D)$ and $\mathbf{G}_{B}(D)$ are $k_{A}$ and $k_{B}$ respectively. This implies that

[TABLE]

where $k=k_{A}k_{B}$ . The following lemma shows that the RHS of (8) has the structure of the matrix in (4).

Lemma 1.

The Khatri-Rao product $\mathbf{Y}_{k_{A},s}(D^{z})\odot\mathbf{Y}_{k_{B},s}(D)$ is a matrix in the form of (3).

Proof.

Note that the Kronecker product of $\ell$ -th column of $\mathbf{Y}_{k_{A},s}(D^{z})$ and $\ell$ -th column of $\mathbf{Y}_{k_{B},s}(D)$ can be expressed as

[TABLE]

The vector on the RHS above consists of powers of $D^{l}$ and can be seen to be in the form of (3). ∎

Lemma 1 explains why Theorem 1 is applicable to the coding scheme used for matrix-matrix multiplication. Thus, this lemma, along with Theorem 1 implies that the proposed convolutional code based matrix-matrix multiplication scheme is MDS.

Now, we need to choose such a value of $z$ which ensures that $\left[\mathbf{U}^{A}(D)\otimes\mathbf{U}^{B}(D)\right]$ in (7) contains all the distinct pairwise products that we are interested. We know that worker $i$ will be assigned the jobs according to the column $i$ of the RHS in (8). Now by examining the structure of the RHS in (8), it can be verified that for $i=0,1,2,\dots,k-1$ , worker $i$ will be assigned $q_{A}$ submatrices from $\mathbf{A}$ and $q_{B}$ submatrices from $\mathbf{B}$ . And for $i=k,k+1,k+2,\dots,n-1$ , any worker $i$ will be assigned $q_{A}+(i-k)\times(k_{A}-1)$ submatrices from $\mathbf{A}$ and $q_{B}+(i-k)\times(k_{B}-1)$ submatrices from $\mathbf{B}$ . Thus the maximum number of submatrices will be assigned to worker $n-1$ , which will have $q_{A}+(s-1)\times(k_{A}-1)$ submatrices from $\mathbf{A}$ and $q_{B}+(s-1)\times(k_{B}-1)$ submatrices from $\mathbf{B}$ , since $s=n-k$ . For the assignment of this worker,

[TABLE]

It can be verified that $\mathbf{C}^{\mathbf{A}}_{n-1}(D)$ is a polynomial in $D$ where the exponent of $D$ at any term is an integer multiple of $z$ . Since each $\mathbf{U}^{B}_{i}(D)$ has a degree $q_{B}-1$ , the degree of $C^{\mathbf{B}}_{n-1}(D)$ is $q_{B}-1+(s-1)(k_{B}-1)$ , and thus we conclude that

[TABLE]

It should be noted that this value of $z$ is large enough for (9) to hold.

Next, using an approach similar to (2), we can derive

[TABLE]

Example 1.

Consider the computation of $\mathbf{A}^{T}\mathbf{B}$ over $n=6$ workers and $s=2$ stragglers. Assume that each worker can store/process $\gamma_{A}=5/8$ fraction of matrix $\mathbf{A}$ and $\gamma_{B}=2/3$ fraction of matrix $\mathbf{B}$ . We set $k_{A}=k_{B}=2$ , so that $q_{A}=4$ and $q_{B}=3$ . By setting $z=q_{B}+(s-1)(k_{B}-1)=4$ , we obtain

[TABLE]

Furthermore,

[TABLE]

The assignment of jobs to all the workers can be obtained from $[\,\mathbf{U}^{A}_{0}(D)\;\;\mathbf{U}^{A}_{1}(D)\,]\,\mathbf{G}_{A}(D)$ and $[\mathbf{U}^{B}_{0}(D)\;\;\;\;\mathbf{U}^{B}_{1}(D)]\,\mathbf{G}_{B}(D)$ . This is shown in Fig. 6.

Remark 1.

Our proposed encoding process is very simple and involves only additions at the master node.

IV-D Decoding algorithm: Peeling decoder

Suppose that we obtain results from workers in $\mathcal{I}\subset\{0,1,\dots,n-1\}$ , with $|\mathcal{I}|\geq k$ . We describe the decoding process below in detail for the matrix-vector case; the discussion is quite similar for the matrix-matrix case.

In the matrix-vector case our unknowns are $\mathbf{u}_{il}=\mathbf{A}^{T}_{\langle i,l\rangle}\mathbf{x},i\in[k],l\in[q]$ ; each of these is a vector of length $r/(kq)$ . Let row-vector $\mathbf{z}_{j}$ denote the collection of the $j$ -th entries of each of these unknowns, where $j\in[r/(kq)]$ . Let the output of the worker nodes corresponding to $\mathbf{z}_{j}$ be denoted by $\mathbf{y}_{j}$ . The length of $\mathbf{y}_{j}$ depends on $\mathcal{I}$ .

We assume that the master node obtains results from a subset of the message workers, $\mathcal{I}_{1}\subset\{0,1,\dots,k-1\}$ , so that $|\mathcal{I}_{1}|\leq k$ . This implies that it can recover $|\mathcal{I}_{1}|q$ unknowns directly. Moreover, it obtains results from the parity workers indexed by $\mathcal{I}_{2}\subset\{k,k+1,\dots,n-1\}$ , where $|\mathcal{I}_{2}|=k-|\mathcal{I}_{1}|$ . Thus, it needs to recover the remaining $kq-|\mathcal{I}_{1}|q$ unknowns.

The underlying structure of the convolutional code allows for a very simple peeling decoder whereby, at each step, the algorithm is guaranteed to find an equation with only one unknown. We demonstrate this by means of an example in Appendix -B. Crucially, the scheme can be decoded purely with add/subtract operations and can thus be highly optimized. This algorithm is very fast and has excellent numerical stability (cf. Fig. 4) in experiments.

Decoding Complexity: We consider the worst case where $|\mathcal{I}_{2}|=s$ . According to the design of this scheme, each of the $kq$ unknowns appears once in every parity worker, and thus the system of equations has at most $kqs$ non-zero entries. Furthermore, in a peeling decoder one variable can be decoded and substituted in the remaining equations at each iteration. Therefore, the time complexity of solving this sparse system is $O(kqs)$ . As we solve a total of $r/(kq)$ such systems of equations, the total time taken is $O(rs)$ which is independent of $q$ and thus does not grow with it; similarly it can be shown that for the matrix-matrix case the time is $O(rws)$ .

It should be noted that the matrices $\mathbf{A}$ and $\mathbf{B}$ are of sizes $t\times r$ and $t\times w$ respectively, thus the computational complexity of computing $\mathbf{A}^{T}\mathbf{B}$ is $O(rwt)$ . In a distributed system, this job is distributed over $n$ workers with $s$ stragglers, so, on average, the computational complexity of each of the workers is $O\left(\frac{rwt}{k}\right)$ , where $k=n-s$ . On the other hand, to get the final result, we need to recover $rw$ unknowns, which is the size of $\mathbf{A}^{T}\mathbf{B}$ . Thus the decoding complexity does not depend on the parameter $t$ which indicates that the decoding time can be often considered negligible in comparison to the worker computation time when $t$ is very large [16]. Nevertheless, fast decoding is a desirable feature of any coded computation scheme.

IV-E Effect of $q$ : storage fraction, imbalance in task assignment

Our presented scheme thus far is provably MDS, efficiently decodable and has excellent numerical stability in experiments. Note that our schemes require lower bounds on the value of $q$ which have an inverse dependence on $\gamma-1/k$ . Thus, if one wants to reduce the imbalance between the task assignments to the message nodes and the parity nodes, then $q$ needs to be chosen large enough. It turns out that for large values of $q$ , the worst case condition number of our scheme can be very large. We present a theoretical treatment of this phenomenon in the upcoming Section V and discuss techniques for mitigating this effect.

V Numerical stability analysis

To understand numerical stability, we first introduce a modified encoding scheme and then discuss the matrix representation of the coding ideas described above.

Definition 1 (Randomly scaled generator matrix).

Let $\mathbf{R}$ be a $k\times s$ matrix of real numbers. Consider the generator matrix $\mathbf{G}(D)$ defined in (4). Replace $\mathbf{Y}_{\bar{b},\bar{a}}(D)$ by $\mathbf{R}\circ\mathbf{Y}_{\bar{b},\bar{a}}(D)$ . Here, $\circ$ denotes Hadamard product (.* operation in MATLAB).

Note that if we set $r_{ij}=1$ for all entries of the matrix $\mathbf{R}$ , we recover the old generator matrix $\mathbf{G}(D)$ (the “All-Ones” case).

V-1 Understanding the matrix representation

It is not hard to see that the matrix representation of the transformation induced by the $k\times n$ generator polynomial matrix $\mathbf{G}(D)$ from Definition 1 can be understood as right multiplying a $kq$ -length row vector of input data by the following matrix. An example of this was given in Section IV-A

Definition 2 ( $\tilde{\mathbf{G}}$ : matrix representation of $\mathbf{G}(D)$ ).

We first define a $q\times(q+h)$ shift matrix that takes a $q$ -length row vector and returns a $q+h$ -length row vector, where the original vector is shifted to the right by $j$ components. This is the matrix $\tilde{\mathbf{D}}^{h;j}\triangleq\begin{bmatrix}\mathbf{0}_{q\times j}&\mathbf{I}_{q}&\mathbf{0}_{q\times(h-j)}\end{bmatrix}$ . The $(i,\ell)$ -th block matrix of $\tilde{\mathbf{G}}$ for $\ell=0,1,\dots,k-1$ and $i=0,1,\dots,k-1$ is

[TABLE]

and for $\ell=k+j$ , $j=0,1,\dots(s-1)$ ,

[TABLE]

Thus, $\tilde{\mathbf{G}}$ is a $kq\times(nq+\delta)$ matrix where

[TABLE]

With the above definition, decoding can be understood as inverting the specific $k\times k$ block submatrix of $\tilde{\mathbf{G}}$ , denoted $\tilde{\mathbf{G}}_{\mathcal{I}}$ where $\mathcal{I}$ is the set of indices of the $k$ workers that have returned their jobs.

V-2 Quantifying round-off error amplification

When assuming perfectly noise-free computations, invertibility of the decoding matrix, $\tilde{\mathbf{G}}_{\mathcal{I}}$ , is sufficient to guarantee perfect recovery/decoding of the desired matrix-matrix product. However, since all computing devices are finite precision, matrix multiplications will frequently result in bit overflow/underflow and hence round-off errors. As explained earlier (cf. Section III-A), the decoding process amplifies the round-off error by a factor that can at most be as large as the condition number of the decoding matrix. Thus, the numerical stability of our scheme is quantified by the largest condition number over all block submatrices $\tilde{\mathbf{G}}_{\mathcal{I}}$ , i.e., by

[TABLE]

V-A Upper bounding $\kappa_{worst}$

Observe that the matrix $\tilde{\mathbf{G}}$ , and consequently the decoding submatrix $\tilde{\mathbf{G}}_{\mathcal{I}}$ with $|\mathcal{I}|=k$ , has a very specific structure. Because of this, it is possible to show that the matrix $\tilde{\mathbf{G}}_{\mathcal{I}}\tilde{\mathbf{G}}_{\mathcal{I}}^{T}$ is a $k\times k$ block matrix with Toeplitz blocks of size $q\times q$ , see in Appendix -C. This fact is useful since the asymptotics of $\lambda_{\max}(\tilde{\mathbf{G}}_{\mathcal{I}}\tilde{\mathbf{G}}_{\mathcal{I}}^{T})$ and $\lambda_{\min}(\tilde{\mathbf{G}}_{\mathcal{I}}\tilde{\mathbf{G}}_{\mathcal{I}}^{T})$ when $q$ is large have been studied in [26]. In particular, Theorem $3$ of [26] shows that using Fourier transform ideas, one can bound the eigenvalues of such matrices by computing the minimum (and maximum) of the smallest (and largest) eigenvalues of a much smaller $k\times k$ matrix that is a function of a scalar parameter $\omega$ which lies in $[-\pi,\pi]$ .

With some abuse of notation, let $\mathbf{G}_{\mathcal{I}}(e^{\textrm{i}\omega})$ represent the matrix obtained by extracting $\mathbf{G}_{\mathcal{I}}(D)$ (from $\mathbf{G}(D)$ in (4)) and then substituting $D=e^{\textrm{i}\omega}$ (where $\textrm{i}=\sqrt{-1}$ ). By adapting the results of [26] (see Appendix -C for a detailed description), we have the following theorem.

Theorem 2.

For $\mathcal{I}\subset\{0,\dots,n-1\}$ such that $|\mathcal{I}|=k$ , we have

[TABLE]

Moreover, for any $q$

[TABLE]

Theorem 2 shows that we can find an upper bound on the condition number of $\tilde{\mathbf{G}}_{\mathcal{I}}$ based on a scalar optimization over $\omega\in[-\pi,\pi]$ . When $\mathbf{R}$ is chosen to be the all-ones matrix, the characterization of Theorem 2 allows us to conclude that when $s>1$ , there exist choices of $\mathcal{I}\subseteq\{0,1,\dots,n-1\},|\mathcal{I}|=k$ such that $\tilde{\mathbf{G}}_{\mathcal{I}}\tilde{\mathbf{G}}_{\mathcal{I}}^{*}$ has a minimum eigenvalue that will go to zero as $q\rightarrow\infty$ . In particular, the corresponding $\mathbf{G}_{\mathcal{I}}(e^{\textrm{i}\omega})$ has repeated columns for $\omega=0$ .

Example 2.

Consider the $(n,k)=(4,2)$ example with $G(D)=\begin{bmatrix}1&0&1&1\\ 0&1&1&D\end{bmatrix}$ . Suppose that $\mathcal{I}=\{2,3\}$ . This implies that

[TABLE]

where $\mathbf{U}$ and $\mathbf{L}$ are $q\times q$ upper shift and lower shift matrices respectively (see, e.g., (17) in the Appendix).

The corresponding $\mathbf{G}_{\mathcal{I}}(e^{\textrm{i}\omega})\mathbf{G}_{\mathcal{I}}(e^{\textrm{i}\omega})^{*}$ can be obtained as

[TABLE]

Using Theorem 2, we can conclude therefore that $\lim_{q\to\infty}\lambda_{max}[\mathcal{T}]=2$ (achieved at $\omega=\pi$ ) and $\lim_{q\to\infty}\lambda_{min}[\mathcal{T}]=0$ (achieved at $\omega=0$ ). This implies therefore that as $q$ becomes larger and larger, the matrix $\tilde{\mathbf{G}}_{\mathcal{I}}$ becomes more and more ill-conditioned, though it is nonsingular for any fixed $q$ .

Therefore considering a nontrivial scaling of the parity part with a matrix $\mathbf{R}$ is essential for well-conditioned behavior when $q$ is very large.

V-B Randomly-weighted convolutional coding

We now show that choosing the matrix $\mathbf{R}$ randomly in Definition 1 results in better numerical stability than the All-Ones scheme in the regime of large $q$ but requires marginally higher decoding complexity.

The following result shows that the MDS property continues to holds with probability 1 when the entries are chosen i.i.d. from a continuous distribution. The proof is an easy consequence of Theorem 1 and appears in the Appendix.

Corollary 2.

If the entries of the matrix $\mathbf{R}$ are chosen i.i.d. from any continuous-valued probability distribution, then, any $k\times k$ submatrix of the generator matrix mentioned in Definition 1 is non-singular with probability one.

We now demonstrate that choosing the matrix $\mathbf{R}$ randomly allows us to upper bound the worst case condition number (over the recovery matrices) even when $q\rightarrow\infty$ . In the matrix-vector scenario, Theorem 2 suggests the following algorithm for choosing $\mathbf{R}$ . We proceed by randomly choosing $\mathbf{R}$ . Let $\mathcal{I}\subset\{0,\dots,n-1\},|\mathcal{I}|=k$ and let $\Omega=\{0,\pm\frac{\pi}{N},\pm\frac{2\pi}{N},\dots,\pm\frac{(N-1)\pi}{N},\pm\pi\}$ for a large positive integer $N$ denote a fine enough grid of the interval $[-\pi,\pi]$ . Let $\kappa_{\mathbf{R}}$ be defined as

[TABLE]

Thus, $\kappa_{\mathbf{R}}$ indicates the maximum condition number of $\mathbf{G}_{\mathcal{I}}(e^{\textrm{i}\omega})$ over all $\binom{n}{k}$ choices of $\mathcal{I}$ ; this is an upper bound on the maximum condition number of $\tilde{\mathbf{G}}_{\mathcal{I}}$ . The algorithm repeatedly generates choices of $\mathbf{R}$ and retains the choice that has the lowest value of $\kappa_{\mathbf{R}}$ ; this denoted by $\mathbf{R}^{\star}$ . The matrix-matrix case is similar, except that we generate two random matrices denoted $\mathbf{R}_{A}$ and $\mathbf{R}_{B}$ and consider the worst case condition number of the appropriate submatrices of (8) to obtain $\mathbf{R}_{A}^{\star}$ and $\mathbf{R}_{B}^{\star}$ . We emphasize that even though the search requires optimizing over $\binom{n}{k}=\binom{n}{s}$ choices of $\mathcal{I}$ , this is a one-time cost for designing the coding scheme for a system with $n$ worker nodes which is resilient to $s=n-k$ stragglers. Furthermore, (i) the search does not have any dependence on $q$ , and (ii) the value of $s$ is typically a small constant, that either does not grow or grows very slowly with $n$ . Thus the complexity of the above design, $n^{s}$ , grows as polynomial in $n$ . Appendix -D presents some numerical results on the amount of time taken to find a good $\mathbf{R}$ matrix.

For systems with $n=12,s=3$ and $n=13,s=3$ , we conducted $50$ random trials each to find the corresponding $\mathbf{R}^{\star}$ for the matrix vector multiplication case; the entries were sampled i.i.d. from the uniform distribution on $[-1,1]$ . Our algorithm also returns the asymptotic upper bound on $\kappa(\mathbf{R}^{\star})$ . By sweeping over values of $q$ , we can also compute the actual worst-case condition number for each particular chosen value of $q$ . Fig. 7 depicts the upper bound and the actual worst case condition numbers for different $n$ and $s$ .

V-C Random convolutional coding: decoding algorithm

In principle, it is possible to use a fast peeling decoder for decoding as done earlier in the all-ones case. Note however that the peeling decoder solves a system of $kq$ equations in $kq$ variables. Thus, it only uses $kq$ columns of the $\tilde{\mathbf{G}}_{\mathcal{I}}$ even though $\tilde{\mathbf{G}}_{\mathcal{I}}$ is a matrix of size $kq\times(kq+\delta^{{}^{\prime}})$ where $\delta^{{}^{\prime}}$ is an integer between zero and $\delta$ (cf. (11)), depending on which set of $k$ worker nodes finished their computations (in matrix-vector multiplication).

In particular, the stability of the peeling decoder depends on the condition number of the relevant full rank square submatrix of $\tilde{\mathbf{G}}_{\mathcal{I}}$ . In general, this condition number is higher than that of $\tilde{\mathbf{G}}_{\mathcal{I}}$ . In our numerical experiments we have found that for the all-ones case, the worst case condition numbers of both matrices ( $\tilde{\mathbf{G}}_{\mathcal{I}}$ and full rank square submatrix of $\tilde{\mathbf{G}}_{\mathcal{I}}$ ) are almost the same (see more experimental details in Section VI). This explains the numerically stable behavior of the peeling decoder in the all-ones case.

The situation changes quite a bit when we consider random scaling of the generator matrix. e.g., when the entries of $\mathbf{R}$ are i.i.d. random Gaussian, the difference is very large. In this case, the condition number of the full rank square submatrix of $\tilde{\mathbf{G}}_{\mathcal{I}}$ can be very high for certain sets of workers $\mathcal{I}$ (see in Section VI). But in all cases, $\kappa_{worst}$ over all $\tilde{\mathbf{G}}_{\mathcal{I}}$ is significantly smaller than that of the all-ones case. Thus, it is clear that one should use all the columns of $\tilde{\mathbf{G}}_{\mathcal{I}}$ for decoding, rather than using only $kq$ equations.

Decoding Complexity: Similar to the discussion in Section IV-D, we assume that the fastest $k$ workers include the message worker set $\mathcal{I}_{1}$ and the parity worker set $\mathcal{I}_{2}$ , so that $|\mathcal{I}_{1}|+|\mathcal{I}_{2}|=k$ . We can decode some unknowns directly from the workers in $\mathcal{I}_{1}$ , and in the worst case, we need to recover the other $sq$ unknowns from the parity workers in $\mathcal{I}_{2}$ . In this case, one can solve a least square (LS) problem to recover the $sq$ unknowns. This LS problem can be solved in different ways. The most straightforward way would be matrix inversion ( $O\left((sq)^{3}\right)$ time) followed by solving $\frac{rw}{kq}$ systems of equations ( $O\left(\frac{rw}{kq}(sq)^{2}\right)$ time). If $sq\ll r,w$ ; we can write it as $O\left(\frac{rw}{k}s^{2}q\right)$ . On the other hand if the value of $q$ is large, then we can use techniques such as conjugate gradient descent to solve the LS problem. This is especially useful when $q$ is large since the underlying system of equations is sparse. Thus, each iteration of conjugate gradient descent can be solved in a fast manner. In particular, if we run it for $T$ iterations to recover these $sq$ unknown blocks, the decoding complexity is $O\left(\frac{rw}{kq}\times sq\times s\times T\right)=O\left(\frac{rw}{k}s^{2}T\right)$ . To reach within $\epsilon$ fraction of the solution, the number of iterations scales a $O(\kappa\log(1/\epsilon))$ where $\kappa$ is the condition number of the linear system of equations.

Overall the decoding complexity of the random convolutional code setting is marginally higher than the All-Ones case, depending on which algorithm is used for the LS solution.

VI Comparisons and Numerical Experiments

In this section, we discuss the results of the numerical experiments for our proposed approaches and compare our methods with other available methods.

The polynomial code approach [5] suffers from the problem that real Vandermonde matrices have condition numbers that are exponential in their size. This in turn implies that for large number of workers (for example, $30$ workers) the condition number of the decoding matrix is so high that the recovered result by the master node is actually useless.

To avoid this numerical issue, Section VII of [4] remarks that the real computation can be embedded within a large enough finite field of prime order $p$ . It turns out that the performance of this scheme is strongly dependent on the entries of $\mathbf{A}$ and $\mathbf{B}$ and the resultant normalized MSE can be quite bad. These arguments have appeared in [19]; we present an outline below.

We note that computations in this method are error-free only when each entry of the product matrix $\mathbf{A}^{T}\mathbf{B}$ is an integer in $\{0,1,...,p-1\}$ . If this requirement is violated, the proposed mod- $p$ computations can return catastrophically wrong answers [19]. This means that the matrices A and B need to be multiplied by a scalar and quantized so that each entry of the resulting matrix is an integer that is within the appropriate range. Suppose that the absolute values of the entries of $\mathbf{A}$ and $\mathbf{B}$ are upper bounded by $\alpha$ ; then we need $\alpha^{2}t<p$ . This is referred to as the dynamic range constraint in [19]. For instance, with $64$ -bit integers (the standard on present day computers), the largest integer is $\approx 10^{19}$ . Thus, even if $t<10^{5}$ , the method can only support $\alpha\leq 10^{7}$ . Thus, the range is rather limited.

The work of [19] constructs adversarial $\mathbf{A}$ and $\mathbf{B}$ integer matrices for this method as follows. Let $p=2147483647$ (note that this is much larger than the publicly available code of [5] which uses $p=65537$ ) so that their method can support higher dynamic range. Next let $r=w=t=400$ . This implies that $\alpha$ needs to be $\leq 1000$ by the dynamic range constraint. The matrices have the following block decomposition.

[TABLE]

Each $\mathbf{A}_{i,j}$ and $\mathbf{B}_{i,j}$ is a matrix of size $200\times 200$ , with entries chosen from the following distributions. $\mathbf{A}_{0,0}$ , $\mathbf{A}_{0,1}$ distributed $\text{Unif}(0,…,9999)$ and $\mathbf{A}_{1,0}$ , $\mathbf{A}_{1,1}$ distributed $\text{Unif}(0,…,9)$ . Next, $\mathbf{B}_{0,0}$ , $\mathbf{B}_{0,1}$ distributed $\text{Unif}(0,…,9)$ and $\mathbf{B}_{1,0},\mathbf{B}_{1,1}$ distributed $\text{Unif}(0,…,9999)$ . In this scenario, the dynamic range constraint requires us to multiply each matrix by $0.1$ and quantize each entry between [math] and $999$ . Note that this implies that $\mathbf{A}_{1,0},\mathbf{A}_{1,1},\mathbf{B}_{0,0},\mathbf{B}_{0,1}$ are all quantized into zero submatrices since the entry in these four submatrices is less than $10$ . We emphasize that the finite field embedding technique only recovers the product of these quantized matrices. However, this product is the all-zeros matrix, i.e., the decoded matrix will also be the all-zeros matrix. Therefore, the normalized MSE in this case will be 100 %. There are also significant computational issues as discussed in [19]. We note here that such adversarial can be found even for larger choices of $p$ . It is worth noting that the normalized MSE of the other methods do not depend on the actual values of $\mathbf{A}$ and $\mathbf{B}$ .

The work of [10] uses orthogonal polynomials and Chebyshev-Vandermonde matrices for the encoding part, which significantly improves the condition number of the decoding matrices compared to [5] and [6]. The work in [11] uses random Khatri-Rao product where random coefficients are used for the encoding, which further improves the numerical stability. The recent preprint [12] uses circulant and permutation matrices to improve the numerical stability of the polynomial approach. We compare our approaches with these methods with exhaustive numerical experiments which are performed over a cluster in AWS (Amazon Web Services). A t2.2xlarge machine is used as the master node and t2.small machines are used as the slave nodes. Software code for recreating these experiments can be found at [27].

Comparing $\kappa_{worst}$ and MSE for Matrix-matrix case: For a system with $n=18$ workers and $s=3$ stragglers for matrix-matrix multiplication, we set $\gamma_{A}=\frac{1}{4}$ and $\gamma_{B}=\frac{2}{5}$ with $k_{A}=5$ and $k_{B}=3$ , so $k=k_{A}k_{B}=n-s=15$ . Table II reports a comparison of the worst-case condition numbers for different approaches in the literature. It can be observed that the work of [5] and [10] have much higher condition numbers than our proposed schemes (All-ones and Random). Both our approaches are also better than the work of [11] in terms of worst case condition number ( $\kappa_{worst}$ ) values. We point out that the methods in [20] and [8] are developed for matrix-vector multiplication, so those are not applicable for this comparison.

In our next experiment we compare the mean-squared error (MSE) of the different matrix-matrix multiplication methods for their respective worst case scenarios when $n=18$ and $s=3$ . For matrix-matrix case, we define MSE as

[TABLE]

where $\widehat{\mathbf{A}^{T}\mathbf{B}}$ is the recovered result and $\mathbf{A}^{T}\mathbf{B}$ is the actual result. Here, the matrices $\mathbf{A}$ and $\mathbf{B}$ are of size $15,000\times 10080$ and $15,000\times 12000$ respectively. We simulate errors in the worker node computations by adding white Gaussian noise to the calculated submatrix products obtained from the worker nodes and sweeping the range of SNRs. The results appear in Fig. 4 (for additive Gaussian noise) and Fig. 8 (for round-off errors). In Fig. 4 we observe that even at $SNR=70dB$ , our approach is around $9$ , $4$ and $2$ orders of magnitude better than [5], [10] and [11]. The corresponding decoding time is also reported in the legend which shows that the decoding time for our approaches compare quite well with other approaches. The behavior of the curves in Fig. 8 is similar in nature.

Comparing $\kappa_{worst}$ and MSE for Matrix-vector case: We carry out an experiment to compare the worst case condition number of the decoding matrix for different approaches for matrix-vector multiplication. Table III shows the worst case condition number for a scenario with $n=30$ workers, with $s=2$ stragglers where each worker node can store $\gamma_{A}=\frac{1}{25}$ fraction of matrix $\mathbf{A}$ . From the table, it is clear that the approaches in [5] and [20] provide much larger condition numbers in comparison to the others. From the table, we can also see that our proposed approaches provide lower condition numbers than the approaches [10] and [11].

In our next experiment we compare the normalized MSE of the different methods for their respective worst case scenarios. For matrix-vector case, we define MSE as

[TABLE]

where $\widehat{\mathbf{A}^{T}\mathbf{x}}$ is the recovered result and $\mathbf{A}^{T}\mathbf{x}$ is the actual result. We consider the same scenario with $n=30$ and $s=2$ where we have matrix $\mathbf{A}$ of size $30,000\times 31,500$ and a vector $\mathbf{x}$ of length $30,000$ . We want to compute the product $\mathbf{A}^{T}\mathbf{x}$ . Fig. 9 shows the normalized MSE of the different approaches for different SNR. From the figure we can see that our proposed approaches perform significantly better than all other schemes except the scheme of [12]. This supports our condition number results in Table III. For example, at $SNR=60dB$ , the approach in [11] provides around $1.6\%$ error whereas our all-ones and random convolutional code approaches provide only $0.5\%$ and $0.2\%$ error, respectively, for the worst case.

**Comparing [12] and our approach: ** It can be observed that the recent preprint of [12] has the best $\kappa_{worst}$ and MSE numbers for both the matrix-matrix and matrix-vector scenarios. However, our work has much simpler encoding (additions/subtractions in the All-Ones case) and decoding (peeling decoder) than their method. Our work is also the first to propose a convolutional coding strategy for this problem.

Comparing [11] and our approach The Random KR approach can be considered as specific instance of our random scaling method where the scaling is applied to a trivial all-ones parity matrix, instead of a carefully designed $\mathbf{Y}_{\bar{b},\bar{a}}(D)$ . As both approaches are random and pick the best choices, we conducted an experiment where we ran 100 trials for both methods (with $n=20$ and $s=3,4,5$ ) and picked the respective best choices (see Fig. 10 for the corresponding worst case condition numbers). It is clear that the structure imposed in our construction definitely improves the condition number as compared to the work of [11].

**Comparing our All-ones and random approaches: ** Recall that for our methods $q_{A}$ and $q_{B}$ increase when $\gamma_{A}-1/k_{A}$ and $\gamma_{B}-1/k_{B}$ become smaller (cf. Sections IV-B and IV-C). Table IV, shows a comparison of our proposed approaches in terms of decoding time and worst case condition number for three different values of $\gamma=\gamma_{A}=\gamma_{B}$ . The following inferences can be drawn.

•

The decoding time remains more or less constant for the all-ones case, whereas it can increase with decreasing $\gamma$ because of solving LS problem for the random case.

•

The worst case condition number for the all-ones case continues to increase with decreasing $\gamma$ , whereas it saturates for the random case.

•

For all-ones case, the worst case condition numbers of both matrices ( $\tilde{\mathbf{G}}_{\mathcal{I}}$ and full rank square submatrix of $\tilde{\mathbf{G}}_{\mathcal{I}}$ ) are almost the same for different $\gamma$ . However, if the entries of $\mathbf{R}$ are random Gaussian, then the difference between these two condition numbers is very large.

VII Conclusions and Future Work

Most current approaches for coded computation work within the framework of block codes. In this work we presented a convolutional approach to coded matrix computation. Our codes possess simple encoding and decoding algorithms. We demonstrated novel connections between the analysis of numerical stability of our codes and the properties of large Toeplitz matrices. The performance of our codes is better than most of the existing known approaches. It would be interesting to consider other classes of convolutional codes for coded computation and attempt to characterize their properties.

-A Proof of Theorem 1 and Corollary 2 (MDS property of our codes)

We begin by a formal description of the field in which the polynomials in the indeterminate $D$ lie. Consider the set of real infinite sequences $\{u_{r},u_{r+1},\dots\}$ for $r\in\mathbb{Z}$ that start at some finite integer index $r$ , and continue thereafter. These sequences can be treated as elements of the formal Laurent series [28] in indeterminate $D$ with coefficients from $\mathbb{R}$ , i.e., $\mathbf{u}(D)=\sum\limits_{i=r}^{\infty}u_{i}D^{i}$ . Let us denote the ring of formal Laurent series over $\mathbb{R}$ as $\mathbb{R}((D))$ under the normal addition and multiplication of formal power series. It can be shown [24] that $\mathbb{R}((D))$ forms a field, i.e., each non-zero element in it has a corresponding inverse. Thus, the polynomials $\mathbf{u}(D)=\sum_{i=0}^{\ell}u_{i}D^{i}$ that we consider in this work are members of $\mathbb{R}((D))$ and can be added, multiplied and divided to obtain other members of $\mathbb{R}((D))$ . The zero element and identity element are precisely the real number [math] and the real number $1$ within this field.

The proof of Theorem 1 is an immediate consequence of Lemma 2 below since any $k\times k$ submatrix of $G(D)$ is of the form $\mathbf{X}(D)$ given in the lemma.

Lemma 2.

Consider a square matrix $\mathbf{X}(D)$ such that

[TABLE]

where $a_{i}$ and $b_{j}$ are positive integers for $0\leq i,j\leq v-1$ such that $0\leq a_{0}<a_{1}<\dots<a_{v-1}$ and $0\leq b_{0}<b_{1}<\dots<b_{v-1}$ . Then $\mathbf{X}(D)$ is nonsingular, i.e., its determinant is a non-zero polynomial in $D$ . Furthermore, if $\mathbf{R}$ is a $v\times v$ matrix with entries chosen i.i.d. from a continuous distribution, then $\mathbf{R}\circ\mathbf{X}(D)$ (where $\circ$ denotes the Hadamard product) is nonsingular with probability 1.

The proof of Lemma 2 involves Schur polynomials that are defined next.

Definition 3.

Let $\lambda_{0}\geq\lambda_{1}\geq\dots\lambda_{v-1}$ be non-negative integers and let $\mathbf{\lambda}=(\lambda_{0},\dots,\lambda_{v-1})$ . Then,

[TABLE]

where the summation is over all semistandard Young tableaux $T$ of shape $\mathbf{\lambda}$ [29].

A Young diagram of shape $\mathbf{\lambda}$ consists of a collection of boxes arranged in left-justified rows. The $i$ -th row has $\lambda_{i}$ boxes. A semistandard Young tableau $T$ is obtained by filling the boxes with the integers $0,\dots,v-1$ such that entries are in ascending order from left to right in the rows and in strictly increasing order from top to bottom in the columns. The $t_{i}$ values in (12) are obtained by counting the occurrences of the number $i$ in tableau $T$ .

Proof.

Matrix $\mathbf{X}(D)$ can be written upon permuting some rows as $\hat{\mathbf{X}}(D)$ which is given by

[TABLE]

where we can assume that $\lambda_{0}\geq\lambda_{1}\geq\dots\geq\lambda_{v-1}$ . We need to prove that the determinant of $\hat{\mathbf{X}}(D)$ is non-zero. According to [29] (Chapter 1),

[TABLE]

where

[TABLE]

Note that $\det\left(\mathbf{Z}(D^{a_{0}},D^{a_{1}},\dots,D^{a_{v-1}})\right)$ is a non-zero polynomial in $D$ as it is a Vandermonde matrix.

Furthermore, based on Definition 3, $\mathcal{S}_{\mathbf{\lambda}}\left(D^{a_{0}},D^{a_{1}},\dots,D^{a_{v-1}}\right)$ consists of the sum of terms of the form $\left(D^{a_{0}}\right)^{t_{0}}\;\left(D^{a_{1}}\right)^{t_{1}}\;\dots\;\left(D^{a_{v-1}}\right)^{t_{v-1}}$ all of which have positive coefficients. Thus, it follows that $\mathcal{S}_{\mathbf{\lambda}}\left(D^{a_{0}},D^{a_{1}},\dots,D^{a_{v-1}}\right)$ is not the zero-polynomial. ∎

Proof of Corollary 2.

To see the extension, we note that $\det(\mathbf{R}\circ\mathbf{X}(D))$ is a polynomial in $D$ whose coefficients in turn are multivariate polynomials in the elements of $\mathbf{R}$ , i.e., $\{r_{i,j}\},0\leq i,j\leq v-1$ . Based on the proof above, it is clear that setting $\mathbf{R}$ to be a matrix of all-ones results in a nonsingular matrix. This implies that $\det(\mathbf{R}\circ\mathbf{X}(D))$ is not identically zero. Next, the elements of $\mathbf{R}$ are chosen i.i.d. from a continuous distribution. Therefore the probability that all the coefficients evaluate to zero over the random choice is also zero. ∎

Example 3 (Illustration of Lemma 2).

Suppose that $v=3$ and consider the square submatrix,

[TABLE]

where $\lambda_{0}=2,\lambda_{1}=1$ and $\lambda_{2}=1$ , so $\lambda=(2,1,1)$ . The determinant of $\mathbf{E}$ is given by

[TABLE]

The Schur polynomial can be obtained from Fig. 11 as

[TABLE]

-B Example of peeling decoder

Example 4.

Consider Example 1 for matrix-matrix multiplication, as shown in Fig. 6 and suppose that workers $W0$ and $W1$ are stragglers. The goal of the master node is to recover all products of the form $\mathbf{A}^{T}_{\langle i_{1},j_{1}\rangle}\mathbf{B}_{\langle i_{2},j_{2}\rangle}$ for $i_{1}\in[2],j_{1}\in[4],i_{2}\in[2],j_{2}\in[3]$ , hence we have total $2\times 4\times 2\times 3=48$ unknowns. Note that we can directly obtain $4\times 6=24$ unknowns from workers $W2$ and $W3$ . So it remains to recover all unknowns of the form $\mathbf{A}^{T}_{\langle 0,j_{1}\rangle}\mathbf{B}_{\langle i_{2},j_{2}\rangle}$ for $j_{1}\in[4],i_{2}\in[2],j_{2}\in[3]$ from workers $W4$ and $W5$ .

First, we concentrate on the first block product of $W5$ , which helps to recover $\mathbf{A}^{T}_{\langle 0,0\rangle}\mathbf{B}_{\langle 0,0\rangle}$ . Following this we examine the first block product of $W4$ , which is $\left(\mathbf{A}_{\langle 0,0\rangle}+\mathbf{A}_{\langle 1,0\rangle}\right)^{T}\left(\mathbf{B}_{\langle 0,0\rangle}+\mathbf{B}_{\langle 1,0\rangle}\right)$ ; the only unknown here is $\mathbf{A}^{T}_{\langle 0,0\rangle}\mathbf{B}_{\langle 1,0\rangle}$ which can therefore be decoded. We can keep moving back and forth between $W4$ and $W5$ and it can be verified that we can recover all the block products $\mathbf{A}^{T}_{\langle 0,j_{1}\rangle}\mathbf{B}_{\langle i_{2},j_{2}\rangle}$ in a similar fashion.

-C Proof of Theorem 2

Let $\bar{b}$ be a vector of length $2q-1$ , whose entries are indexed as $\bar{b}_{\ell},-(q-1)\leq\ell\leq(q-1)$ . A Toeplitz matrix of size $q\times q$ , denoted by $\mathrm{Toeplitz(\bar{b})}$ is such that its $(i,j)$ -th entry is given by $\bar{b}_{i-j}$ for $i\in[q],j\in[q]$ . Thus, it is such that each diagonal is a constant from top-left to bottom-right.

Our proof of Theorem 2 relies on a result from [26]. Consider a $kq\times kq$ matrix $\tilde{\mathbf{B}}$ that has Toeplitz blocks of size $q\times q$ with the $(i,j)$ -th block specified by the $(2q-1)$ -length vector $\bar{b}^{i,j}$ . To be precise, for $i=0,1,\dots,(k-1),\ j=0,1,\dots,(k-1)$ ,

[TABLE]

The result in [26] shows that the minimum and maximum eigenvalues of such a matrix can be bounded by computing the minimum and maximum of the eigenvalues of the following (much smaller) $k\times k$ Fourier transform (FT) matrix $\mathbf{B}(\omega)$ over the frequency parameter $\omega$ . The $(i,j)$ -the entry of $\mathbf{B}(\omega)$ is defined by simply computing the Fourier transform of the corresponding vector $\bar{b}^{i,j}$ , i.e.,

[TABLE]

We can now state the result.

Lemma 3 (Theorem 3 of [26]).

(i)

For all $q$ , the eigenvalues of $\tilde{\mathbf{B}}$ lie in

[TABLE]

(ii)

Furthermore,

[TABLE]

In other words, the behavior of the eigenvalues of $\tilde{\mathbf{B}}$ which is a $kq\times kq$ matrix can be studied instead by computing the eigenvalues of the $k\times k$ matrix $\mathbf{B}(\omega)$ and finding its minimum and maximum eigenvalues over the range $\omega\in[-\pi,\pi]$ .

The next two lemmas below help prove that $\tilde{\mathbf{G}}_{\mathcal{I}}\tilde{\mathbf{G}}_{\mathcal{I}}^{T}$ has Toeplitz blocks.

Let $\mathbf{U}$ and $\mathbf{L}=\mathbf{U}^{T}$ denote square upper and lower shift matrices respectively, i.e., $\mathbf{U}$ is a $q\times q$ matrix such that

[TABLE]

Thus, for instance if $q=5$ , then

[TABLE]

Lemma 4.

Let $h\geq\max(i,j)$ . Then

[TABLE]

Note that the matrices on the RHS above are Toeplitz.

Proof.

We only prove the case when $i>j$ as the other part is very similar. The product $(\tilde{\mathbf{D}}^{h;i})(\tilde{\mathbf{D}}^{h;j})^{T}$ can be expressed as

[TABLE]

∎

Lemma 5.

Let $\tilde{\mathbf{G}}_{\ell}$ denote the $\ell$ -th block-column of $\tilde{\mathbf{G}}$ . For $\ell=0,1,\dots,k-1$ ,

[TABLE]

For $\ell=k+\tilde{\ell}$ , $\tilde{\ell}=0,1,\dots,s-1$ , and for $i\geq j$

[TABLE]

Since the matrix is symmetric, specifying its entries for $i\geq j$ is sufficient.

Proof.

This follows directly by using Lemma 4 and the definition of $\tilde{\mathbf{G}}_{\ell}$ . ∎

Furthermore, using the property that the sum of Toeplitz matrices is Toeplitz, we can conclude that for any subset $\mathcal{I}\subset\{0,\dots,n-1\}$ such that $|\mathcal{I}|=k$ , we have that the matrix $\tilde{\mathbf{G}}_{\mathcal{I}}\tilde{\mathbf{G}}_{\mathcal{I}}^{T}$ is a matrix with Toeplitz blocks.

For ease of presentation let $\mathcal{I}=\mathcal{I}_{1}\cup\mathcal{I}_{2}$ where $\mathcal{I}_{1}\subseteq\{0,\dots,k-1\}$ , $\mathcal{I}_{2}\subseteq\{k,\dots,n-1\}$ and $\mathcal{I}_{1}\cap\mathcal{I}_{2}=\emptyset$ and $\tilde{\ell}=\ell-k$ . Then, for $0\leq i,j\leq k-1$ and $i\geq j$ we can express the $(i,j)$ -th block of $(\tilde{\mathbf{G}}_{\mathcal{I}})(\tilde{\mathbf{G}}_{\mathcal{I}})^{T}$ as follows.

[TABLE]

where $\mathds{1}$ denotes the indicator function. By symmetry it suffices to specify $[(\tilde{\mathbf{G}}_{\mathcal{I}})(\tilde{\mathbf{G}}_{\mathcal{I}})^{T}]_{i,j}$ for $i\geq j$ . Each of the blocks is of dimension $q\times q$ .

Proof of Theorem 2.

We emphasize that our matrix $[(\tilde{\mathbf{G}}_{\mathcal{I}})(\tilde{\mathbf{G}}_{\mathcal{I}})^{T}]$ (see (18)) has Toeplitz blocks. Let $\tilde{\mathbf{B}}=(\tilde{\mathbf{G}}_{\mathcal{I}})(\tilde{\mathbf{G}}_{\mathcal{I}})^{T}$ . Then we have

[TABLE]

where $\tilde{\ell}=\ell-k$ . Observe $\mathbf{U}^{a}$ is a matrix with 1’s on the $(a+1)$ -th diagonal and zeros everywhere else. Thus, $\tilde{\mathbf{B}}_{i,j}$ is a Toeplitz matrix with the $(a_{\tilde{\ell}}(b_{i}-b_{j}))$ -th diagonal equal to $r_{i\tilde{\ell}}r_{j\tilde{\ell}}$ . Therefore, the corresponding sequence $\bar{b}^{i,j}$ for $i>j$ is given by

[TABLE]

Thus, following the discussion above, we obtain

[TABLE]

The expressions above can equivalently be expressed as replacing $D$ with $e^{\textrm{i}\omega}$ and then computing the inner product of $\mathbf{G}_{\mathcal{I}}(e^{j\omega})(i,:)$ with $(\mathbf{G}_{\mathcal{I}}(e^{\textrm{i}\omega})(j,:))^{*}$ . Therefore, we can compactly represent

[TABLE]

This concludes the proof. ∎

-D Search Time for Random Convolutional Coding

We run an experiment to tabulate the time needed to find a good random matrix $\mathbf{R}$ . We run $50$ trials to find the best $\mathbf{R}$ for $n=13,14,15$ with $s=2,3,4$ . It should be noted that the choice of $\mathbf{R}$ depends on all $\binom{n}{s}$ choices of stragglers. Fig. 12 shows the corresponding time for different pairs of $n$ and $s$ . From the figure, it can be seen that our system (a processor with CPU speed $3.5GHz$ and $16GB$ RAM) needs only around $8$ minutes to find a good choice of $\mathbf{R}$ for even $n=15$ and $s=4$ . In other cases, the required amount of time is even lesser. This indicates that for a reasonable system size, we do not need to wait too long to obtain a good choice of $\mathbf{R}$ that ensures that the worst case condition number is bounded. And it should be noted that this is a one-time cost for designing the coding scheme for a system with $n$ worker nodes which is resilient to $s=n-k$ stragglers.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments,” in Operating syst. design and impl. USENIX Association, 2008, pp. 29–42.
2[2] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix multiplication,” in IEEE Intl. Symposium on Info. Th. , 2017, pp. 2418–2422.
3[3] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. on Info. Th. , vol. 64, no. 3, pp. 1514–1529, 2018.
4[4] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding,” IEEE Trans. on Info. Th. , vol. 66, no. 3, pp. 1920–1933, 2020.
5[5] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” in Proc. of Adv. in Neur. Inf. Proc. Syst. (NIPS) , 2017, pp. 4403–4413.
6[6] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” in Proc. of Adv. in Neur. Inf. Proc. Syst. (NIPS) , 2016, pp. 2100–2108.
7[7] S. Dutta, M. Fahim, F. Haddadpour, H. Jeong, V. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrix multiplication,” IEEE Trans. on Info. Th. , vol. 66, no. 1, pp. 278–301, 2019.
8[8] A. Mallick, M. Chaudhari, U. Sheth, G. Palanikumar, and G. Joshi, “Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication,” Proceedings of the ACM on Meas. and Analysis of Comp. Syst. , vol. 3, no. 3, pp. 1–40, 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Efficient and Robust Distributed Matrix Computations via Convolutional Coding

Abstract

Index Terms:

I Introduction

II Problem Formulation

III Background, Related Work and Summary of Contributions

III-A Related Work

III-B Summary of Contributions

IV Convolutional Coding for Distributed Matrix Computation

IV-A Simple Illustrative Example

IV-B Proposed matrix-vector multiplication scheme

Corollary 1** (Corollary of upcoming Theorem 1 given in Section IV-C).**

IV-C Proposed matrix-matrix multiplication scheme

Theorem 1**.**

Lemma 1**.**

Proof.

Example 1**.**

Remark 1**.**

IV-D Decoding algorithm: Peeling decoder

IV-E Effect of qqq: storage fraction, imbalance in task assignment

V Numerical stability analysis

Definition 1** (Randomly scaled generator matrix).**

V-1 Understanding the matrix representation

Definition 2** (G~\tilde{\mathbf{G}}G~: matrix representation of G(D)\mathbf{G}(D)G(D)).**

V-2 Quantifying round-off error amplification

V-A Upper bounding κworst\kappa_{worst}κworst​

Theorem 2**.**

Example 2**.**

V-B Randomly-weighted convolutional coding

Corollary 2**.**

V-C Random convolutional coding: decoding algorithm

VI Comparisons and Numerical Experiments

VII Conclusions and Future Work

-A Proof of Theorem 1 and Corollary 2 (MDS property of our codes)

Lemma 2**.**

Definition 3**.**

Proof.

Proof of Corollary 2.

Example 3** (Illustration of Lemma 2).**

-B Example of peeling decoder

Example 4**.**

-C Proof of Theorem 2

Lemma 3** (Theorem 3 of [26]).**

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Proof of Theorem 2.

-D Search Time for Random Convolutional Coding

Corollary 1 (Corollary of upcoming Theorem 1 given in Section IV-C).

Theorem 1.

Lemma 1.

Example 1.

Remark 1.

IV-E Effect of $q$ : storage fraction, imbalance in task assignment

Definition 1 (Randomly scaled generator matrix).

Definition 2 ( $\tilde{\mathbf{G}}$ : matrix representation of $\mathbf{G}(D)$ ).

V-A Upper bounding $\kappa_{worst}$

Theorem 2.

Example 2.

Corollary 2.

Lemma 2.

Definition 3.

Example 3 (Illustration of Lemma 2).

Example 4.

Lemma 3 (Theorem 3 of [26]).

Lemma 4.

Lemma 5.