TL;DR
This paper introduces convolutional coding methods for distributed matrix computations that are both highly resilient to stragglers and numerically stable, outperforming prior approaches in efficiency and robustness.
Contribution
It proposes two convolutional coding schemes that achieve optimal straggler resilience and numerical stability, with efficient decoding algorithms close to the theoretical lower bounds.
Findings
Optimal straggler resilience demonstrated
Numerical robustness quantified via condition number bounds
Experimental validation on AWS cloud platform
Abstract
Distributed matrix computations -- matrix-matrix or matrix-vector multiplications -- are well-recognized to suffer from the problem of stragglers (slow or failed worker nodes). Much of prior work in this area is (i) either sub-optimal in terms of its straggler resilience, or (ii) suffers from numerical problems, i.e., there is a blow-up of round-off errors in the decoded result owing to the high condition numbers of the corresponding decoding matrices. Our work presents convolutional coding approach to this problem that removes these limitations. It is optimal in terms of its straggler resilience, and has excellent numerical robustness as long as the workers' storage capacity is slightly higher than the fundamental lower bound. Moreover, it can be decoded using a fast peeling decoder that only involves add/subtract operations. Our second approach has marginally higher decoding…
| Codes | Mat-Mat | Optimal | Numerical | Decoding Complexity |
| Mult? | Threshold? | Stability? | for Mat-Mat Mult | |
| Repetition Codes | ✓ | ✗ | ✓ | Zero |
| Rateless Codes [8] | ✗ | ✗ | ✓ | ✗ |
| Product Codes [2] | ✓ | ✗ | ✗ | , assuming |
| Polynomial Codes [5] | ✓ | ✓ | ✗ | |
| Ortho-Poly Codes [10] | ✓ | ✓ | ✓ | |
| Circulant and Rotation Matrix [12] | ✓ | ✓ | ✓ | |
| Random Khatri-Rao Codes [11] | ✓ | ✓ | ✓ | |
| All-Ones-Conv Code (Proposed) | ✓ | ✓ | ✓ | (add/subtract ops) |
| Random-Cov Code (Proposed) | ✓ | ✓ | ✓ |
| Metrics | Methods | |||
|---|---|---|---|---|
| Decoding | All ones | |||
| Time | Random | |||
| All ones | ||||
| for | Random | |||
| for | All ones | |||
| Sqr. Submat. | Random | |||
| of |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Efficient and Robust Distributed Matrix Computations via Convolutional Coding
Anindya Bijoy Das, Aditya Ramamoorthy and Namrata Vaswani This work was supported in part by the National Science Foundation (NSF) under Grant CCF-1718470 and Grant CCF-1910840. Department of Electrical and Computer Engineering,
Iowa State University, Ames, IA 50011 USA.
abd149,adityar,namrata@iastate.edu
Abstract
Distributed matrix computations – matrix-matrix or matrix-vector multiplications – are well-recognized to suffer from the problem of stragglers (slow or failed worker nodes). Much of prior work in this area is (i) either sub-optimal in terms of its straggler resilience, or (ii) suffers from numerical problems, i.e., there is a blow-up of round-off errors in the decoded result owing to the high condition numbers of the corresponding decoding matrices. Our work presents convolutional coding approach to this problem that removes these limitations. It is optimal in terms of its straggler resilience, and has excellent numerical robustness as long as the workers’ storage capacity is slightly higher than the fundamental lower bound. Moreover, it can be decoded using a fast peeling decoder that only involves add/subtract operations. Our second approach has marginally higher decoding complexity than the first one, but allows us to operate arbitrarily close to the lower bound. Its numerical robustness can be theoretically quantified by deriving a computable upper bound on the worst case condition number over all possible decoding matrices by drawing connections with the properties of large Toeplitz matrices. All above claims are backed up by extensive experiments done on the AWS cloud platform.
Index Terms:
Distributed computing, Straggler, Convolutional coding, Toeplitz matrix, Vandermonde matrix.
I Introduction
Distributed computing clusters are heavily used in domains such as machine learning where datasets are often so large that they cannot be stored in a single computer. The widespread usage of such clusters presents several opportunities and advantages over traditional computing paradigms. However, they also present newer challenges. Large scale clusters which can be heterogeneous in nature suffer from the issue of stragglers (slow or failed workers in the system). Fig. 1 shows the variation of speed of different t2.micro machines in AWS (Amazon Web Services) cluster, and it can be seen that for a particular job, a slow worker node may require around more time than the average.
The conventional approach [1] to tackle stragglers has been to run multiple copies of tasks on various machines, with the hope that at least one copy finishes on time. For instance, consider matrix-vector multiplication with a matrix and vector , where our goal is to obtain the product in a distributed fashion. Fig. 2 shows an example where we partition into four block-columns and we assign two block-columns to each of the four worker nodes. Thus each block column has been assigned twice over all four workers and we can verify that we recover the final result if any three workers finish their respective jobs. In other words, we can say that this scheme is resilient to one straggler.
However, this toy example can be made even more efficient in terms of resource utilization by dividing into two block-columns and and assigning the worker nodes appropriate linear combinations of and so that the required result can be decoded from any two workers. This is the basic idea underlying “coded computation” (introduced in the work of Lee et al. [2]). It leverages ideas from erasure coding to introduce redundancy in the computation performed by the worker nodes. Roughly speaking, as long as enough worker nodes complete their tasks, the master node can decode the intended result by appropriate post-processing.
The central problem within coded distributed matrix computation can be explained as follows. Suppose that we have large matrices and a vector . The goal is to either compute (matrix-vector multiplication) or (matrix-matrix multiplication) in a distributed fashion using worker nodes while being resistant to any stragglers. Redundancy is introduced in the computation by coding across appropriately chosen submatrices of and and assigning the worker nodes appropriate computation responsibilities.
The main finding of several recent works in this area is that it is possible to embed distributed matrix computations into the structure of an equivalent erasure code, where the failed nodes play the role of erasures [3, 4, 5, 6, 7, 8, 9] (we discuss related work in detail shortly). A given coded computation scheme is said to have threshold if the desired result can be decoded as long as any worker nodes return their results to the master node. This has been the focus of many works in the literature.
In this work, we consider the important issue of numerical stability within coded computation (in addition to threshold). We point out that several of the existing schemes in the literature suffer from significant numerical issues in the decoding process. In particular, the system of equations that is solved by the master node in the decoding step can have a very high condition number which in turn results in a large error in the decoded result. We present a novel scheme based on convolutional codes (operating over the reals) that simultaneously addresses numerical stability, the threshold, and possesses easy encoding/decoding. An overview of the properties of most of the known schemes in the literature is presented in Table I.
This paper is organized as follows. Section II explains the problem formulation and Section III describes the background and related work and summarizes of the contributions of our work. Section IV discusses our main ideas on how convolutional codes can be used to address distributed matrix computations, Section V overviews the analysis of numerical stability for our codes and Section VI discusses the experimental performance of our proposed methods and shows the comparison with other available approaches. We conclude the paper with a discussion about future work in Section VII. For the sake of readability several of the proofs appear in the Appendix.
II Problem Formulation
In the matrix-vector case we partition into submatrices of equal size and into subvectors and distribute a certain number of “coded” versions of these submatrices to the workers (subject to a storage constraint). Every worker computes the product of its assigned submatrices and subvectors and sends the computed result back to the master node. The master then “decodes” to recover .
In the matrix-matrix multiplication scenario, each worker node receives coded versions of submatrices of and coded versions of the submatrices of 111A general formulation need not restrict the assignment to coded submatrices of and . Nevertheless, all known schemes thus far and our proposed schemes work with equal-sized submatrices, so we present the formulation in this way.. It computes pairwise products (either all or some subset thereof) of these and sends them to the master node which needs to decode to recover .
In the discussion below we discuss the matrix-matrix scenario; it applies in a natural way to the matrix-vector case as well. We consider a and block decomposition of and respectively as shown below.
[TABLE]
The master node encodes by computing appropriate scalar linear combinations of the matrices and respectively the submatrices. This implies that the master node only performs scalar multiplications and additions. It is not responsible for any of the computationally intensive matrix operations. Following this, it sends the corresponding coded submatrices to each of the workers.
We assume that a worker node cannot store the whole matrix or . Each worker can store the equivalent of fraction of matrix and fraction of matrix ; this is referred to as the storage fraction.
The assumption is that some nodes will fail or will be too slow, the maximum number of such nodes is assumed to be or less. The goal is to design the coding scheme so that (i) the decoding is possible using the output of any workers ( is often called the recovery threshold of the scheme), (ii) it is robust to noise (both numerical precision errors and other sources of noise); and (iii) it is efficiently decodable. We say that the threshold of a scheme is optimal if it is the lowest possible given the storage constraints.
III Background, Related Work and Summary of Contributions
In recent years, several coded computation schemes have been proposed for matrix multiplication [3, 4, 5, 6, 7, 8, 9, 13, 14, 15]. We illustrate the basic idea below using the polynomial code approach of [5]. These ideas are presented in a tutorial fashion in [16].
Consider a scenario with workers where each of these worker nodes can store fraction of matrix and fraction of matrix . Consider and , thus we partition both and into two block-columns and respectively. Next, we define two matrix polynomials as
[TABLE]
The master node evaluates these polynomial and at distinct real values , and sends the corresponding matrices to worker node (see Fig. 3 where ). Each worker node computes the product of its assigned submatrices. It follows that decoding at the master node is equivalent to decoding a degree-3 real-valued polynomial. Thus, the master node can recover as soon as it receives the results from any four workers. Thus, in this example, the recovery threshold is, and the system is resilient to straggler.
A different solution can be obtained using the approach in [7] for the same example. Let and , so we can write . Now we define two matrix polynomials as
[TABLE]
As before, the master node will evaluate the polynomial and at , and send the corresponding matrices to worker node . It follows that the master can recover all the unknowns as soon as it receives the results from any three workers. Thus, in this example, the recovery threshold is, and the system is resilient to stragglers.
It should be noted that the latter approach can lead to more straggler resilience, but the computational load per worker has doubled compared to the first approach. Moreover the communication load from the worker nodes to the master node is also higher by a factor of compared to the first approach.
For both schemes above, it can be shown that worker node computation time depends on , whereas the decoding complexity is independent of it (see for instance [16]). Thus, for scenarios where is very large, the decoding time can be neglected. Nevertheless, a low decoding complexity is desirable from a practical standpoint.
III-A Related Work
As discussed above, [4, 5, 7] convert distributed matrix computation into polynomial evaluation/interpolation, i.e., the coded submatrices correspond to polynomial evaluation maps. We remark here that as far as we are aware, the idea of embedding matrix multiplication using polynomial maps goes back even further to Yagle [17] (the motivation there was fast matrix multiplication).
For fixed storage constraints and and for fixed computation overhead per worker with and arbitrary and , the optimal threshold is shown to be [5] using the polynomial approach. When , the work of [4] demonstrates a threshold of . They also present a converse argument which demonstrates that this is within a factor of two of the optimal threshold.
While the computation threshold is somewhat well understood at this point, the issue of numerical stability has received much less attention. When operating over finite fields, proving the invertibility of an appropriate submatrix of the coding matrix suffices to guarantee correct decoding. However, in decoding a real system of equations, errors in the input can get amplified by the condition number (ratio of maximum and minimum singular values) of the associated matrix; hence, a low condition number is critical. For instance, in solving a square system of equations , suppose that is perturbed to (owing to round-off errors) and that the estimate of is . Then, the normalized error in is given by
[TABLE]
where and denote the maximum and minimum singular values of and their ratio is the condition number of the decoding matrix . Thus, it is clear that a small condition number of the decoding matrix leads to less amplification of the round-off error in .
This issue is especially relevant since it is well recognized that polynomial interpolation over the reals suffers from significant numerical issues since the corresponding Vandermonde matrices have very high condition numbers (that are exponential in their size [18]). In fact, even for clusters with around nodes, the condition number of the polynomial approach [5] is so large that the decoded result is essentially useless (see Section VI). We note here that Section VII of [4] remarks that the numerical issues can be handled by embedding all operations within a finite field. In Section VI, we demonstrate that the performance of this method is strongly dependent on the entries of matrices and and the resultant normalized MSE can be quite bad [19].
Some recent works have highlighted and considered the issue of numerical stability in this context. The work of [20, 21] presented strategies for distributed matrix-vector multiplication and demonstrated some schemes that empirically have better numerical performance than polynomial based schemes for some values of and . The work in [20] considers a convolutional coding approach, but from a parity check matrix perspective and the work in [21] uses universally decodable matrices which further allows to utilize the partial computations of the stragglers. However, both these approaches work only for the matrix-vector problem and do not provide a computable bound on the condition number of the decoding submatrices.
The work of [10] presents an alternate approach that works within the basis of orthogonal polynomials. They demonstrate that the worst case condition number of their schemes is at most and their numerical experiments demonstrate improvements with respect to [5]. Our experimental evaluation in Section VI clearly demonstrates that our proposed schemes have condition numbers that are orders of magnitude lower than [10]. [11] present an approach where the encoded matrices are generated by taking random linear combinations of the block-columns of the respective matrices (this was also suggested in Remark of [5]). We note here that their approach can be considered as a subclass of our methods, as discussed in Section VI. Table I shows a comparison of the features of several well-known approaches for distributed matrix computations. Our results in Section VI show that the underlying structure of our codes consistently results in lower worst case condition numbers than [11]. Finally, the parallel work of [12] presents an approach that leverages the properties of rotation matrices and circulant permutation matrices. They demonstrate that the worst case condition number of their recovery matrices grow at most as . While their numerical results are better than ours, our work has the advantage of easy encoding and decoding and explores a convolutional approach to this problem which has not been considered before.
III-B Summary of Contributions
In this paper we present an efficient and robust scheme for coded matrix computations that is inspired by convolutional codes. Our codes operate over the reals, unlike the majority of convolutional codes that are considered over finite fields [22]. Crucially, they exploit the Vandermonde property of the recovery matrices, where the matrices are defined over a different field (formal Laurent series over ) than the real numbers. This naturally allows for simple encoding and decoding in addition to ensuring the threshold properties.
- •
Our work is among the first to provide an efficient coded computation approach for both matrix-vector and matrix-matrix multiplications that provably works in the (i) essentially noise-free regime where numerical precision issues dominate, and (ii) the noisy regime where noise is significant.
- •
We present two classes of codes in this work. Our first approach can be decoded using a peeling decoder using only add/subtract operations and has excellent numerical performance when the storage capacity of the nodes is slightly higher than the fundamental lower bound.
When operating very close to the storage capacity lower bound, we propose an alternative random convolutional coding strategy for which we can provide a “computable” upper bound (cf. Theorem 2 in Section V-A) on the worst case condition number of the recovery matrices. This naturally leads to a random sampling algorithm to pick a coding matrix with good performance. Our work draws novel connections with this problem and the asymptotic analysis of large Toeplitz matrices [23].
- •
An exhaustive comparison of our work with other approaches in the literature shows that the condition numbers of our work are orders of magnitude below all the comparable approaches (except [12]) and have fast decoding times. Fig. 4 depicts a comparison of the performance of the different schemes considered in our work.
- •
As far as we are aware, most previous work has approached coded computation by exploiting its link with block codes under erasures. Our work is the first to investigate a convolutional coding approach to this problem. This in turn opens up newer problems for investigation in this area.
IV Convolutional Coding for Distributed Matrix Computation
IV-A Simple Illustrative Example
We explain our key idea by means of the following example. Consider two row vectors in , and . These vectors can also be represented as polynomials in the indeterminate , for . As explained in Appendix -A, these polynomials can be treated as elements in the ring of formal Laurent series in [24]. Moreover, it can be shown that this ring is in fact a field, i.e., each element has a corresponding inverse. Consider the following encoding of .
[TABLE]
It is not too hard to see that the polynomials and (equivalently the vectors ) can be recovered (or “decoded”) from any two entries of the vector . For instance, suppose that we only receive and . Notice that
[TABLE]
Starting with from the constant term of , one can iteratively recover each of the coefficients of and , with only one new variable to recover in each iteration. A similar argument applies if we consider a different set of two entries from . We refer to such a decoding scheme as a “peeling decoder”.
Observe that the encoded polynomial has degree , while the others have degree . Thus, if the coefficients of the polynomials correspond to encoded data that were sent to node for processing, then node 3 would need slightly higher storage/processing capacity than nodes 0, 1, 2. Secondly, observe that the above idea can also be equivalently understood by replacing the matrix of polynomials by a larger matrix of size and rewriting all the scalar polynomials as row vectors. Let be row vectors of length and be a row vector of length . Then,
[TABLE]
where is a matrix of zeroes, is a identity matrix, and is a column of zeroes. In what follows, we consider generalizations of this basic example where the ’s will correspond to block-columns of and .
IV-B Proposed matrix-vector multiplication scheme
The above idea can naturally be adapted to the distributed matrix-vector multiplication setting. We show an example in Fig. 5 with workers and stragglers, so . Suppose that matrix is partitioned into block-columns (the choice of will be discussed shortly). In our work, the presentation follows more naturally if we index the block-columns of using two indices instead of one. In particular, they are indexed as (where denotes the set ) and each worker node stores at most columns of length- ( is called the storage fraction).
Let for . Furthermore, let denote a matrix whose -th submatrix is , for , i.e., has the Vandermonde structure. We define
[TABLE]
Consider the encoding
[TABLE]
To arrive at the distributed matrix-vector multiplication scheme, we simply interpret the coefficients of the powers of in as the encoded submatrices assigned to worker (see Fig. 5 for an example). With this assignment, worker computes the inner product of its assigned matrices and . We say that a matrix is maximum-distance-separable (MDS) if any of its submatrices is nonsingular. This property further implies that can be recovered as long as any workers complete their tasks. The following result shows that is MDS; the proof appears in the Appendix.
Corollary 1** (Corollary of upcoming Theorem 1 given in Section IV-C).**
Any submatrix of has a determinant which is a non-zero polynomial in , i.e., it is non-singular.
Analogous to convolutional coding, we call the first workers the message workers and the last workers the parity workers. Each of the first message workers receives submatrices , each of which is a matrix of size . The rest of the parity workers will receive such submatrices. The highest exponent of in the generator matrix is . Thus, the maximum storage needed by a worker is submatrices. When is large enough, this imbalance is not significant. If we assume a bound of on the storage capacity fraction of any worker, we need
[TABLE]
For example, in Fig. 5, is set to which leads to .
IV-C Proposed matrix-matrix multiplication scheme
The matrix-matrix multiplication case requires the generalization of the above ideas. Let and be vectors of non-negative integers such that and . Let denote a matrix whose -th entry is given by
[TABLE]
Using this matrix, define a generalization of as follows
[TABLE]
Observe that we obtain by setting and , which corresponds to . We will design an encoding scheme for matrix-matrix multiplication whose equivalent generator matrix is of the form in (4). Before we explain the design, we show that this matrix also satisfies the MDS property (the proof appears in the Appendix).
Theorem 1**.**
Any submatrix of the generator matrix defined in (4) is non-singular.
While non-singularity by itself does not reveal information about the corresponding condition numbers, Theorem 1 provides a class of schemes with a specific structure that have excellent numerical stability (see Fig. 4 “All Ones” curve) and can be modified and analyzed for condition number using the techniques discussed in Theorem 2 within Section V. The structure of in (4) also allows for an efficient peeling decoder.
In the matrix-matrix case, we design generator matrices of size and of size such that . Each worker stores fractions and of matrices and respectively. Let be a large enough positive integer and let
[TABLE]
Furthermore, we let and . The final goal of the master node is to recover all products of the form for . Once again by forming
[TABLE]
we can represent the assignment of coded submatrices of and to worker node by the coefficients of and respectively. Following this step, each worker node computes the pairwise product of each coded submatrix of and coded submatrix of assigned to it.
The matrices and will be picked in such a way so that the pairwise product of each coefficient of and each coefficient of appears in , i.e., each worker node equivalently computes . Using MATLAB notation and Kronecker product properties, for , we have
[TABLE]
where denotes the Kronecker product. Therefore, the computation peformed by the worker nodes can be compactly represented using the Khatri-Rao product [25] (denoted by )222For two matrices with the same column dimension, the Khatri-Rao product corresponds to the matrix obtained by taking the Kronecker product of the corresponding columns. Moreover, using the properties of the Khatri-Rao product, we have
[TABLE]
The key idea at this point is to ensure that has the structure of a matrix as in (4). Towards this end, we choose
[TABLE]
where is an all-ones row vector of length , and the total number of rows in and are and respectively. This implies that
[TABLE]
where . The following lemma shows that the RHS of (8) has the structure of the matrix in (4).
Lemma 1**.**
The Khatri-Rao product is a matrix in the form of (3).
Proof.
Note that the Kronecker product of -th column of and -th column of can be expressed as
[TABLE]
The vector on the RHS above consists of powers of and can be seen to be in the form of (3). ∎
Lemma 1 explains why Theorem 1 is applicable to the coding scheme used for matrix-matrix multiplication. Thus, this lemma, along with Theorem 1 implies that the proposed convolutional code based matrix-matrix multiplication scheme is MDS.
Now, we need to choose such a value of which ensures that in (7) contains all the distinct pairwise products that we are interested. We know that worker will be assigned the jobs according to the column of the RHS in (8). Now by examining the structure of the RHS in (8), it can be verified that for , worker will be assigned submatrices from and submatrices from . And for , any worker will be assigned submatrices from and submatrices from . Thus the maximum number of submatrices will be assigned to worker , which will have submatrices from and submatrices from , since . For the assignment of this worker,
[TABLE]
It can be verified that is a polynomial in where the exponent of at any term is an integer multiple of . Since each has a degree , the degree of is , and thus we conclude that
[TABLE]
It should be noted that this value of is large enough for (9) to hold.
Next, using an approach similar to (2), we can derive
[TABLE]
Example 1**.**
Consider the computation of over workers and stragglers. Assume that each worker can store/process fraction of matrix and fraction of matrix . We set , so that and . By setting , we obtain
[TABLE]
Furthermore,
[TABLE]
The assignment of jobs to all the workers can be obtained from and . This is shown in Fig. 6.
Remark 1**.**
Our proposed encoding process is very simple and involves only additions at the master node.
IV-D Decoding algorithm: Peeling decoder
Suppose that we obtain results from workers in , with . We describe the decoding process below in detail for the matrix-vector case; the discussion is quite similar for the matrix-matrix case.
In the matrix-vector case our unknowns are ; each of these is a vector of length . Let row-vector denote the collection of the -th entries of each of these unknowns, where . Let the output of the worker nodes corresponding to be denoted by . The length of depends on .
We assume that the master node obtains results from a subset of the message workers, , so that . This implies that it can recover unknowns directly. Moreover, it obtains results from the parity workers indexed by , where . Thus, it needs to recover the remaining unknowns.
The underlying structure of the convolutional code allows for a very simple peeling decoder whereby, at each step, the algorithm is guaranteed to find an equation with only one unknown. We demonstrate this by means of an example in Appendix -B. Crucially, the scheme can be decoded purely with add/subtract operations and can thus be highly optimized. This algorithm is very fast and has excellent numerical stability (cf. Fig. 4) in experiments.
Decoding Complexity: We consider the worst case where . According to the design of this scheme, each of the unknowns appears once in every parity worker, and thus the system of equations has at most non-zero entries. Furthermore, in a peeling decoder one variable can be decoded and substituted in the remaining equations at each iteration. Therefore, the time complexity of solving this sparse system is . As we solve a total of such systems of equations, the total time taken is which is independent of and thus does not grow with it; similarly it can be shown that for the matrix-matrix case the time is .
It should be noted that the matrices and are of sizes and respectively, thus the computational complexity of computing is . In a distributed system, this job is distributed over workers with stragglers, so, on average, the computational complexity of each of the workers is , where . On the other hand, to get the final result, we need to recover unknowns, which is the size of . Thus the decoding complexity does not depend on the parameter which indicates that the decoding time can be often considered negligible in comparison to the worker computation time when is very large [16]. Nevertheless, fast decoding is a desirable feature of any coded computation scheme.
IV-E Effect of : storage fraction, imbalance in task assignment
Our presented scheme thus far is provably MDS, efficiently decodable and has excellent numerical stability in experiments. Note that our schemes require lower bounds on the value of which have an inverse dependence on . Thus, if one wants to reduce the imbalance between the task assignments to the message nodes and the parity nodes, then needs to be chosen large enough. It turns out that for large values of , the worst case condition number of our scheme can be very large. We present a theoretical treatment of this phenomenon in the upcoming Section V and discuss techniques for mitigating this effect.
V Numerical stability analysis
To understand numerical stability, we first introduce a modified encoding scheme and then discuss the matrix representation of the coding ideas described above.
Definition 1** (Randomly scaled generator matrix).**
Let be a matrix of real numbers. Consider the generator matrix defined in (4). Replace by . Here, denotes Hadamard product (.* operation in MATLAB).
Note that if we set for all entries of the matrix , we recover the old generator matrix (the “All-Ones” case).
V-1 Understanding the matrix representation
It is not hard to see that the matrix representation of the transformation induced by the generator polynomial matrix from Definition 1 can be understood as right multiplying a -length row vector of input data by the following matrix. An example of this was given in Section IV-A
Definition 2** (: matrix representation of ).**
We first define a shift matrix that takes a -length row vector and returns a -length row vector, where the original vector is shifted to the right by components. This is the matrix . The -th block matrix of for and is
[TABLE]
and for , ,
[TABLE]
Thus, is a matrix where
[TABLE]
With the above definition, decoding can be understood as inverting the specific block submatrix of , denoted where is the set of indices of the workers that have returned their jobs.
V-2 Quantifying round-off error amplification
When assuming perfectly noise-free computations, invertibility of the decoding matrix, , is sufficient to guarantee perfect recovery/decoding of the desired matrix-matrix product. However, since all computing devices are finite precision, matrix multiplications will frequently result in bit overflow/underflow and hence round-off errors. As explained earlier (cf. Section III-A), the decoding process amplifies the round-off error by a factor that can at most be as large as the condition number of the decoding matrix. Thus, the numerical stability of our scheme is quantified by the largest condition number over all block submatrices , i.e., by
[TABLE]
V-A Upper bounding
Observe that the matrix , and consequently the decoding submatrix with , has a very specific structure. Because of this, it is possible to show that the matrix is a block matrix with Toeplitz blocks of size , see in Appendix -C. This fact is useful since the asymptotics of and when is large have been studied in [26]. In particular, Theorem of [26] shows that using Fourier transform ideas, one can bound the eigenvalues of such matrices by computing the minimum (and maximum) of the smallest (and largest) eigenvalues of a much smaller matrix that is a function of a scalar parameter which lies in .
With some abuse of notation, let represent the matrix obtained by extracting (from in (4)) and then substituting (where ). By adapting the results of [26] (see Appendix -C for a detailed description), we have the following theorem.
Theorem 2**.**
For such that , we have
[TABLE]
Moreover, for any
[TABLE]
Theorem 2 shows that we can find an upper bound on the condition number of based on a scalar optimization over . When is chosen to be the all-ones matrix, the characterization of Theorem 2 allows us to conclude that when , there exist choices of such that has a minimum eigenvalue that will go to zero as . In particular, the corresponding has repeated columns for .
Example 2**.**
Consider the example with . Suppose that . This implies that
[TABLE]
where and are upper shift and lower shift matrices respectively (see, e.g., (17) in the Appendix).
The corresponding can be obtained as
[TABLE]
Using Theorem 2, we can conclude therefore that (achieved at ) and (achieved at ). This implies therefore that as becomes larger and larger, the matrix becomes more and more ill-conditioned, though it is nonsingular for any fixed .
Therefore considering a nontrivial scaling of the parity part with a matrix is essential for well-conditioned behavior when is very large.
V-B Randomly-weighted convolutional coding
We now show that choosing the matrix randomly in Definition 1 results in better numerical stability than the All-Ones scheme in the regime of large but requires marginally higher decoding complexity.
The following result shows that the MDS property continues to holds with probability 1 when the entries are chosen i.i.d. from a continuous distribution. The proof is an easy consequence of Theorem 1 and appears in the Appendix.
Corollary 2**.**
If the entries of the matrix are chosen i.i.d. from any continuous-valued probability distribution, then, any submatrix of the generator matrix mentioned in Definition 1 is non-singular with probability one.
We now demonstrate that choosing the matrix randomly allows us to upper bound the worst case condition number (over the recovery matrices) even when . In the matrix-vector scenario, Theorem 2 suggests the following algorithm for choosing . We proceed by randomly choosing . Let and let for a large positive integer denote a fine enough grid of the interval . Let be defined as
[TABLE]
Thus, indicates the maximum condition number of over all choices of ; this is an upper bound on the maximum condition number of . The algorithm repeatedly generates choices of and retains the choice that has the lowest value of ; this denoted by . The matrix-matrix case is similar, except that we generate two random matrices denoted and and consider the worst case condition number of the appropriate submatrices of (8) to obtain and . We emphasize that even though the search requires optimizing over choices of , this is a one-time cost for designing the coding scheme for a system with worker nodes which is resilient to stragglers. Furthermore, (i) the search does not have any dependence on , and (ii) the value of is typically a small constant, that either does not grow or grows very slowly with . Thus the complexity of the above design, , grows as polynomial in . Appendix -D presents some numerical results on the amount of time taken to find a good matrix.
For systems with and , we conducted random trials each to find the corresponding for the matrix vector multiplication case; the entries were sampled i.i.d. from the uniform distribution on . Our algorithm also returns the asymptotic upper bound on . By sweeping over values of , we can also compute the actual worst-case condition number for each particular chosen value of . Fig. 7 depicts the upper bound and the actual worst case condition numbers for different and .
V-C Random convolutional coding: decoding algorithm
In principle, it is possible to use a fast peeling decoder for decoding as done earlier in the all-ones case. Note however that the peeling decoder solves a system of equations in variables. Thus, it only uses columns of the even though is a matrix of size where is an integer between zero and (cf. (11)), depending on which set of worker nodes finished their computations (in matrix-vector multiplication).
In particular, the stability of the peeling decoder depends on the condition number of the relevant full rank square submatrix of . In general, this condition number is higher than that of . In our numerical experiments we have found that for the all-ones case, the worst case condition numbers of both matrices ( and full rank square submatrix of ) are almost the same (see more experimental details in Section VI). This explains the numerically stable behavior of the peeling decoder in the all-ones case.
The situation changes quite a bit when we consider random scaling of the generator matrix. e.g., when the entries of are i.i.d. random Gaussian, the difference is very large. In this case, the condition number of the full rank square submatrix of can be very high for certain sets of workers (see in Section VI). But in all cases, over all is significantly smaller than that of the all-ones case. Thus, it is clear that one should use all the columns of for decoding, rather than using only equations.
Decoding Complexity: Similar to the discussion in Section IV-D, we assume that the fastest workers include the message worker set and the parity worker set , so that . We can decode some unknowns directly from the workers in , and in the worst case, we need to recover the other unknowns from the parity workers in . In this case, one can solve a least square (LS) problem to recover the unknowns. This LS problem can be solved in different ways. The most straightforward way would be matrix inversion ( time) followed by solving systems of equations ( time). If ; we can write it as . On the other hand if the value of is large, then we can use techniques such as conjugate gradient descent to solve the LS problem. This is especially useful when is large since the underlying system of equations is sparse. Thus, each iteration of conjugate gradient descent can be solved in a fast manner. In particular, if we run it for iterations to recover these unknown blocks, the decoding complexity is . To reach within fraction of the solution, the number of iterations scales a where is the condition number of the linear system of equations.
Overall the decoding complexity of the random convolutional code setting is marginally higher than the All-Ones case, depending on which algorithm is used for the LS solution.
VI Comparisons and Numerical Experiments
In this section, we discuss the results of the numerical experiments for our proposed approaches and compare our methods with other available methods.
The polynomial code approach [5] suffers from the problem that real Vandermonde matrices have condition numbers that are exponential in their size. This in turn implies that for large number of workers (for example, workers) the condition number of the decoding matrix is so high that the recovered result by the master node is actually useless.
To avoid this numerical issue, Section VII of [4] remarks that the real computation can be embedded within a large enough finite field of prime order . It turns out that the performance of this scheme is strongly dependent on the entries of and and the resultant normalized MSE can be quite bad. These arguments have appeared in [19]; we present an outline below.
We note that computations in this method are error-free only when each entry of the product matrix is an integer in . If this requirement is violated, the proposed mod- computations can return catastrophically wrong answers [19]. This means that the matrices A and B need to be multiplied by a scalar and quantized so that each entry of the resulting matrix is an integer that is within the appropriate range. Suppose that the absolute values of the entries of and are upper bounded by ; then we need . This is referred to as the dynamic range constraint in [19]. For instance, with -bit integers (the standard on present day computers), the largest integer is . Thus, even if , the method can only support . Thus, the range is rather limited.
The work of [19] constructs adversarial and integer matrices for this method as follows. Let (note that this is much larger than the publicly available code of [5] which uses ) so that their method can support higher dynamic range. Next let . This implies that needs to be by the dynamic range constraint. The matrices have the following block decomposition.
[TABLE]
Each and is a matrix of size , with entries chosen from the following distributions. , distributed and , distributed . Next, , distributed and distributed . In this scenario, the dynamic range constraint requires us to multiply each matrix by and quantize each entry between [math] and . Note that this implies that are all quantized into zero submatrices since the entry in these four submatrices is less than . We emphasize that the finite field embedding technique only recovers the product of these quantized matrices. However, this product is the all-zeros matrix, i.e., the decoded matrix will also be the all-zeros matrix. Therefore, the normalized MSE in this case will be 100 %. There are also significant computational issues as discussed in [19]. We note here that such adversarial can be found even for larger choices of . It is worth noting that the normalized MSE of the other methods do not depend on the actual values of and .
The work of [10] uses orthogonal polynomials and Chebyshev-Vandermonde matrices for the encoding part, which significantly improves the condition number of the decoding matrices compared to [5] and [6]. The work in [11] uses random Khatri-Rao product where random coefficients are used for the encoding, which further improves the numerical stability. The recent preprint [12] uses circulant and permutation matrices to improve the numerical stability of the polynomial approach. We compare our approaches with these methods with exhaustive numerical experiments which are performed over a cluster in AWS (Amazon Web Services). A t2.2xlarge machine is used as the master node and t2.small machines are used as the slave nodes. Software code for recreating these experiments can be found at [27].
Comparing and MSE for Matrix-matrix case: For a system with workers and stragglers for matrix-matrix multiplication, we set and with and , so . Table II reports a comparison of the worst-case condition numbers for different approaches in the literature. It can be observed that the work of [5] and [10] have much higher condition numbers than our proposed schemes (All-ones and Random). Both our approaches are also better than the work of [11] in terms of worst case condition number () values. We point out that the methods in [20] and [8] are developed for matrix-vector multiplication, so those are not applicable for this comparison.
In our next experiment we compare the mean-squared error (MSE) of the different matrix-matrix multiplication methods for their respective worst case scenarios when and . For matrix-matrix case, we define MSE as
[TABLE]
where is the recovered result and is the actual result. Here, the matrices and are of size and respectively. We simulate errors in the worker node computations by adding white Gaussian noise to the calculated submatrix products obtained from the worker nodes and sweeping the range of SNRs. The results appear in Fig. 4 (for additive Gaussian noise) and Fig. 8 (for round-off errors). In Fig. 4 we observe that even at , our approach is around , and orders of magnitude better than [5], [10] and [11]. The corresponding decoding time is also reported in the legend which shows that the decoding time for our approaches compare quite well with other approaches. The behavior of the curves in Fig. 8 is similar in nature.
Comparing and MSE for Matrix-vector case: We carry out an experiment to compare the worst case condition number of the decoding matrix for different approaches for matrix-vector multiplication. Table III shows the worst case condition number for a scenario with workers, with stragglers where each worker node can store fraction of matrix . From the table, it is clear that the approaches in [5] and [20] provide much larger condition numbers in comparison to the others. From the table, we can also see that our proposed approaches provide lower condition numbers than the approaches [10] and [11].
In our next experiment we compare the normalized MSE of the different methods for their respective worst case scenarios. For matrix-vector case, we define MSE as
[TABLE]
where is the recovered result and is the actual result. We consider the same scenario with and where we have matrix of size and a vector of length . We want to compute the product . Fig. 9 shows the normalized MSE of the different approaches for different SNR. From the figure we can see that our proposed approaches perform significantly better than all other schemes except the scheme of [12]. This supports our condition number results in Table III. For example, at , the approach in [11] provides around error whereas our all-ones and random convolutional code approaches provide only and error, respectively, for the worst case.
**Comparing [12] and our approach: ** It can be observed that the recent preprint of [12] has the best and MSE numbers for both the matrix-matrix and matrix-vector scenarios. However, our work has much simpler encoding (additions/subtractions in the All-Ones case) and decoding (peeling decoder) than their method. Our work is also the first to propose a convolutional coding strategy for this problem.
Comparing [11] and our approach The Random KR approach can be considered as specific instance of our random scaling method where the scaling is applied to a trivial all-ones parity matrix, instead of a carefully designed . As both approaches are random and pick the best choices, we conducted an experiment where we ran 100 trials for both methods (with and ) and picked the respective best choices (see Fig. 10 for the corresponding worst case condition numbers). It is clear that the structure imposed in our construction definitely improves the condition number as compared to the work of [11].
**Comparing our All-ones and random approaches: ** Recall that for our methods and increase when and become smaller (cf. Sections IV-B and IV-C). Table IV, shows a comparison of our proposed approaches in terms of decoding time and worst case condition number for three different values of . The following inferences can be drawn.
- •
The decoding time remains more or less constant for the all-ones case, whereas it can increase with decreasing because of solving LS problem for the random case.
- •
The worst case condition number for the all-ones case continues to increase with decreasing , whereas it saturates for the random case.
- •
For all-ones case, the worst case condition numbers of both matrices ( and full rank square submatrix of ) are almost the same for different . However, if the entries of are random Gaussian, then the difference between these two condition numbers is very large.
VII Conclusions and Future Work
Most current approaches for coded computation work within the framework of block codes. In this work we presented a convolutional approach to coded matrix computation. Our codes possess simple encoding and decoding algorithms. We demonstrated novel connections between the analysis of numerical stability of our codes and the properties of large Toeplitz matrices. The performance of our codes is better than most of the existing known approaches. It would be interesting to consider other classes of convolutional codes for coded computation and attempt to characterize their properties.
-A Proof of Theorem 1 and Corollary 2 (MDS property of our codes)
We begin by a formal description of the field in which the polynomials in the indeterminate lie. Consider the set of real infinite sequences for that start at some finite integer index , and continue thereafter. These sequences can be treated as elements of the formal Laurent series [28] in indeterminate with coefficients from , i.e., . Let us denote the ring of formal Laurent series over as under the normal addition and multiplication of formal power series. It can be shown [24] that forms a field, i.e., each non-zero element in it has a corresponding inverse. Thus, the polynomials that we consider in this work are members of and can be added, multiplied and divided to obtain other members of . The zero element and identity element are precisely the real number [math] and the real number within this field.
The proof of Theorem 1 is an immediate consequence of Lemma 2 below since any submatrix of is of the form given in the lemma.
Lemma 2**.**
Consider a square matrix such that
[TABLE]
where and are positive integers for such that and . Then is nonsingular, i.e., its determinant is a non-zero polynomial in . Furthermore, if is a matrix with entries chosen i.i.d. from a continuous distribution, then (where denotes the Hadamard product) is nonsingular with probability 1.
The proof of Lemma 2 involves Schur polynomials that are defined next.
Definition 3**.**
Let be non-negative integers and let . Then,
[TABLE]
where the summation is over all semistandard Young tableaux of shape [29].
A Young diagram of shape consists of a collection of boxes arranged in left-justified rows. The -th row has boxes. A semistandard Young tableau is obtained by filling the boxes with the integers such that entries are in ascending order from left to right in the rows and in strictly increasing order from top to bottom in the columns. The values in (12) are obtained by counting the occurrences of the number in tableau .
Proof.
Matrix can be written upon permuting some rows as which is given by
[TABLE]
where we can assume that . We need to prove that the determinant of is non-zero. According to [29] (Chapter 1),
[TABLE]
where
[TABLE]
Note that is a non-zero polynomial in as it is a Vandermonde matrix.
Furthermore, based on Definition 3, consists of the sum of terms of the form all of which have positive coefficients. Thus, it follows that is not the zero-polynomial. ∎
Proof of Corollary 2.
To see the extension, we note that is a polynomial in whose coefficients in turn are multivariate polynomials in the elements of , i.e., . Based on the proof above, it is clear that setting to be a matrix of all-ones results in a nonsingular matrix. This implies that is not identically zero. Next, the elements of are chosen i.i.d. from a continuous distribution. Therefore the probability that all the coefficients evaluate to zero over the random choice is also zero. ∎
Example 3** (Illustration of Lemma 2).**
Suppose that and consider the square submatrix,
[TABLE]
where and , so . The determinant of is given by
[TABLE]
The Schur polynomial can be obtained from Fig. 11 as
[TABLE]
-B Example of peeling decoder
Example 4**.**
Consider Example 1 for matrix-matrix multiplication, as shown in Fig. 6 and suppose that workers and are stragglers. The goal of the master node is to recover all products of the form for , hence we have total unknowns. Note that we can directly obtain unknowns from workers and . So it remains to recover all unknowns of the form for from workers and .
First, we concentrate on the first block product of , which helps to recover . Following this we examine the first block product of , which is ; the only unknown here is which can therefore be decoded. We can keep moving back and forth between and and it can be verified that we can recover all the block products in a similar fashion.
-C Proof of Theorem 2
Let be a vector of length , whose entries are indexed as . A Toeplitz matrix of size , denoted by is such that its -th entry is given by for . Thus, it is such that each diagonal is a constant from top-left to bottom-right.
Our proof of Theorem 2 relies on a result from [26]. Consider a matrix that has Toeplitz blocks of size with the -th block specified by the -length vector . To be precise, for ,
[TABLE]
The result in [26] shows that the minimum and maximum eigenvalues of such a matrix can be bounded by computing the minimum and maximum of the eigenvalues of the following (much smaller) Fourier transform (FT) matrix over the frequency parameter . The -the entry of is defined by simply computing the Fourier transform of the corresponding vector , i.e.,
[TABLE]
We can now state the result.
Lemma 3** (Theorem 3 of [26]).**
- (i)
For all , the eigenvalues of lie in
[TABLE]
- (ii)
Furthermore,
[TABLE]
In other words, the behavior of the eigenvalues of which is a matrix can be studied instead by computing the eigenvalues of the matrix and finding its minimum and maximum eigenvalues over the range .
The next two lemmas below help prove that has Toeplitz blocks.
Let and denote square upper and lower shift matrices respectively, i.e., is a matrix such that
[TABLE]
Thus, for instance if , then
[TABLE]
Lemma 4**.**
Let . Then
[TABLE]
Note that the matrices on the RHS above are Toeplitz.
Proof.
We only prove the case when as the other part is very similar. The product can be expressed as
[TABLE]
∎
Lemma 5**.**
Let denote the -th block-column of . For ,
[TABLE]
For , , and for
[TABLE]
Since the matrix is symmetric, specifying its entries for is sufficient.
Proof.
This follows directly by using Lemma 4 and the definition of . ∎
Furthermore, using the property that the sum of Toeplitz matrices is Toeplitz, we can conclude that for any subset such that , we have that the matrix is a matrix with Toeplitz blocks.
For ease of presentation let where , and and . Then, for and we can express the -th block of as follows.
[TABLE]
where denotes the indicator function. By symmetry it suffices to specify for . Each of the blocks is of dimension .
Proof of Theorem 2.
We emphasize that our matrix (see (18)) has Toeplitz blocks. Let . Then we have
[TABLE]
where . Observe is a matrix with 1’s on the -th diagonal and zeros everywhere else. Thus, is a Toeplitz matrix with the -th diagonal equal to . Therefore, the corresponding sequence for is given by
[TABLE]
Thus, following the discussion above, we obtain
[TABLE]
The expressions above can equivalently be expressed as replacing with and then computing the inner product of with . Therefore, we can compactly represent
[TABLE]
This concludes the proof. ∎
-D Search Time for Random Convolutional Coding
We run an experiment to tabulate the time needed to find a good random matrix . We run trials to find the best for with . It should be noted that the choice of depends on all choices of stragglers. Fig. 12 shows the corresponding time for different pairs of and . From the figure, it can be seen that our system (a processor with CPU speed and RAM) needs only around minutes to find a good choice of for even and . In other cases, the required amount of time is even lesser. This indicates that for a reasonable system size, we do not need to wait too long to obtain a good choice of that ensures that the worst case condition number is bounded. And it should be noted that this is a one-time cost for designing the coding scheme for a system with worker nodes which is resilient to stragglers.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments,” in Operating syst. design and impl. USENIX Association, 2008, pp. 29–42.
- 2[2] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix multiplication,” in IEEE Intl. Symposium on Info. Th. , 2017, pp. 2418–2422.
- 3[3] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. on Info. Th. , vol. 64, no. 3, pp. 1514–1529, 2018.
- 4[4] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding,” IEEE Trans. on Info. Th. , vol. 66, no. 3, pp. 1920–1933, 2020.
- 5[5] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” in Proc. of Adv. in Neur. Inf. Proc. Syst. (NIPS) , 2017, pp. 4403–4413.
- 6[6] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” in Proc. of Adv. in Neur. Inf. Proc. Syst. (NIPS) , 2016, pp. 2100–2108.
- 7[7] S. Dutta, M. Fahim, F. Haddadpour, H. Jeong, V. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrix multiplication,” IEEE Trans. on Info. Th. , vol. 66, no. 1, pp. 278–301, 2019.
- 8[8] A. Mallick, M. Chaudhari, U. Sheth, G. Palanikumar, and G. Joshi, “Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication,” Proceedings of the ACM on Meas. and Analysis of Comp. Syst. , vol. 3, no. 3, pp. 1–40, 2019.
