Universally Decodable Matrices for Distributed Matrix-Vector Multiplication
Aditya Ramamoorthy, Li Tang, Pascal O. Vontobel

TL;DR
This paper introduces a novel class of distributed matrix-vector multiplication schemes using universally decodable matrices and Rosenbloom-Tsfasman codes, effectively leveraging partial computations and ensuring numerical stability.
Contribution
It presents a new coding scheme for distributed matrix-vector multiplication that accounts for computation order and partial results, enhancing efficiency and stability.
Findings
Effective mitigation of stragglers in distributed computation
Sparse and numerically stable coding schemes
Experimental validation of scheme effectiveness
Abstract
Coded computation is an emerging research area that leverages concepts from erasure coding to mitigate the effect of stragglers (slow nodes) in distributed computation clusters, especially for matrix computation problems. In this work, we present a class of distributed matrix-vector multiplication schemes that are based on codes in the Rosenbloom-Tsfasman metric and universally decodable matrices. Our schemes take into account the inherent computation order within a worker node. In particular, they allow us to effectively leverage partial computations performed by stragglers (a feature that many prior works lack). An additional main contribution of our work is a companion matrix-based embedding of these codes that allows us to obtain sparse and numerically stable schemes for the problem at hand. Experimental results confirm the effectiveness of our techniques.
| Scheme | s | Max. Cond. Num. | Avg. Cond. Num. | Density of | ||
|---|---|---|---|---|---|---|
| RS based scheme | 4 | 3 | ||||
| RS Companion Matrix of | 20 | 15 | 5 | |||
| RS Embedding from | 4 | 3 | ||||
| RS Companion Matrix of | 12 | 9 | 3 | |||
| UDM-based scheme | 4 | 3 | ||||
| UDM Embedding from | 4 | 3 | ||||
| UDM Companion Matrix of | 12 | 9 | 3 | |||
| UDM Companion Matrix of | 8 | 6 | 2 |
| Scheme | s | Max. Cond. Num. | Avg. Cond. Num. | Density of | ||
|---|---|---|---|---|---|---|
| RS Companion Matrix | 20 | 10 | 5 | |||
| RS Companion Matrix | 12 | 6 | 3 | |||
| RS Companion Matrix | 16 | 8 | 4 | |||
| UDM Companion Matrix | 16 | 8 | 4 | |||
| UDM Companion Matrix | 8 | 4 | 2 | |||
| UDM Companion Matrix | 12 | 6 | 3 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Universally Decodable Matrices for Distributed Matrix-Vector Multiplication
Aditya Ramamoorthy, Li Tang
Department of Electrical and Computer Engineering
Iowa State University
Ames, IA 50010, U.S.A.
Pascal O. Vontobel This work was supported in part by the National Science Foundation (NSF) under grant CCF-1718470. Department of Information Engineering
The Chinese University of Hong Kong
Hong Kong, S. A. R.
Abstract
Coded computation is an emerging research area that leverages concepts from erasure coding to mitigate the effect of stragglers (slow nodes) in distributed computation clusters, especially for matrix computation problems. In this work, we present a class of distributed matrix-vector multiplication schemes that are based on codes in the Rosenbloom-Tsfasman metric and universally decodable matrices. Our schemes take into account the inherent computation order within a worker node. In particular, they allow us to effectively leverage partial computations performed by stragglers (a feature that many prior works lack). An additional main contribution of our work is a companion matrix-based embedding of these codes that allows us to obtain sparse and numerically stable schemes for the problem at hand. Experimental results confirm the effectiveness of our techniques.
I Introduction
Distributed computation clusters are routinely used in domains such as machine learning and scientific computing. In these applications, datasets are often so large that they cannot be housed in the disk of a single server. Furthermore, processing the data on a single server is either infeasible or unacceptably slow. Thus, the data and the processing is distributed and processed across a large number of nodes.
While large clusters have numerous advantages, they also present newer operational challenges. These clusters (which can be heterogeneous in nature) suffer from the problem of “stragglers” which are defined as slow nodes (node failures are an extreme form of a straggler). It is evident that the overall speed of a computation on these clusters is typically dominated by stragglers in the absence of a sophisticated assignment of tasks to the worker nodes.
In recent years, approaches based on coding theory (referred to as “coded computation”) have been effectively used for straggler mitigation [1, 2, 3, 4, 5, 6, 7, 8, 9]. Coded computation offers significant benefits for specific classes of problems, e.g., matrix computations. We illustrate this by means of a matrix-vector multiplication example in Fig. 1, where a matrix is block-row decomposed as . Each worker node is given the responsibility of computing two submatrix-vector products so that the computational load on each worker is -rd of the original. It can be observed that even if one worker fails, there is enough information for a master node to compute the final result. However, this requires the master node to solve simple systems of equations. This approach can be generalized (and also adapted for matrix multiplication) by using Reed-Solomon (RS) code like approaches [1, 2, 3, 4, 5]. These methods allow the master node to recover if any of the worker nodes complete their computation; is called the recovery threshold.
A significant amount of prior work treats stragglers as node failures (see [8, 9, 6] for exceptions), or, equivalently from the point of view of coding theory, as erasures. This matches the conventional erasure coding problem very well and allows the adaptation of well-known approaches, e.g, RS codes to the problem of distributed matrix computations. However, there are certain features of the distributed matrix-vector multiplication problem that distinguish it from classical erasure correction that we now discuss.
- •
Leveraging partial computation performed by stragglers. Each worker node operates in a sequential fashion on its assigned rows, e.g., in Fig. 1, worker , first computes and only then . If node [math] is a straggler (but not a failure), ignoring the partial computation it performs will be wasteful.
- •
Numerically stable decoding. The RS-based approach requires the master node to solve a real Vandermonde system of linear equations or equivalently perform polynomial interpolation. It is well recognized that real Vandermonde matrices have a rather large condition number111While there is literature on choosing good evaluation points to reduce the condition number, in the distributed matrix vector multiplication context, we require decoding from any evaluation points. This makes the worst case condition number quite bad. which translates into significant numerical issues in recovering . This numerical issue is especially important in Krylov subspace methods for solving large linear systems of equations [10] (which repeatedly compute matrix-vector products) and in machine learning, where gradient computations are often approximate.
- •
Dealing with sparse matrices. The case when the matrix is sparse is often an important one in practice. RS-based approaches typically generate submatrices that are sent to the worker nodes by combining a large number of rows of , thus destroying the inherent sparsity of the problem. This can significantly increase the computation time [7] at the worker nodes. Thus, techniques that only require sparse combinations of the rows of are of great interest.
I-A Main contributions of our work
We present a class of distributed matrix-vector multiplication schemes that provably leverage partial computations by stragglers, while possessing a numerically stable decoding algorithm. These schemes are related to codes in the Rosenbloom-Tsfasman metric [11] and universally decodable matrices (UDM) [12] that were presented in different contexts. Roughly speaking, while the RS-based approach corresponds to polynomial evaluation/interpolation, our approach can be viewed as working with polynomials with roots of higher multiplicity. An additional main contribution of our work is the usage of companion matrices [13] that allow for an embedding of finite-field matrices into the real field; this significantly improves the condition numbers of the relevant matrices.
II Problem Formulation
We consider a scenario where the master node has a matrix and vector (both real-valued) and is connected to worker nodes. For convenience, for arbitrary positive integer , let . The master node first partitions into block-rows (or submatrices) denoted by , each of the same dimension. Following this, it generates submatrices denoted , (of the same dimension as the ’s) such that worker is sent submatrices , and the vector . Let . Then, each worker is assigned the equivalent of a -fraction of the rows of . In this paper, we assume is large enough so that can be chosen large enough. Throughout this paper the submatrices will be linear combinations of , such that the master node only calculates scalar multiples and sums of ’s.
In what follows, we say that worker has processed a submatrix if it has calculated . A key feature of the distributed matrix-vector multiplication problem is that the matrices are processed sequentially in the order , i.e., a worker node processes only if it has finished processing . Each time a worker node computes a product or a block of consecutive products, it sends the result to the master node. Our system requirement dictates that the master node should be able to decode as long as it receives a minimum number of products from the worker nodes. Fig. 1 demonstrates a system we consider.
It is evident that the properties of a given scheme depend upon the properties of the matrices , . To specify this encoding we discuss constructions of collections of matrices that have certain desired properties. Some of our constructions are first designed over a finite field and then embedded into using an appropriately defined procedure. Accordingly, we define certain rank conditions that depend on an underlying field of operation denoted by . We will explicitly specify when discussing the constructions. Consider matrices , over with dimension and let represent the -th entry of . Let . We define the set
[TABLE]
Definition 1**.**
-weak full-rank matrices.
Let be positive integers such that divides and . Let . Consider matrices , of dimension . Let and let . If the matrix composed of the first columns of , the first columns of , , and the first columns of , has full rank over , i.e., for all we say the collection satisfies the -weak full-rank condition.
The collection is used to obtain the submatrices stored in worker when as
[TABLE]
Consider first the case and assume that worker has finished processing submatrices and . Let be as specified in the definition above. It is not too hard to see that the system requirement of decoding from any submatrix-vector products is equivalent to the condition that is full-rank over for all possible patterns . Thus, designing that satisfy Definition 1 is sufficient for the problem at hand.
Values of correspond to a relaxation of this condition. Specifically, suppose that each worker node returns the results in blocks of size . For instance, worker node computes and then reports the result back to the master. Following this, it focuses on the next block , and so on. In this case, decoding by the master node is guaranteed if it receives any blocks of size (this explains our choice of subscript in ). If the -weak full-rank condition holds for , we will refer to the system as satisfying the strong full-rank condition.
Remark 1**.**
When , then the master node can recover when any submatrices have been processed across the workers, i.e., the worst case computational load on the system, measured at the granularity of a submatrix is . If , then the worst case computational load can be as high as
[TABLE]
For our constructions, the second term can be made as small as desired by choosing a large enough .
Example 1**.**
Consider the system in Fig. 1 with , , . Matrix is partitioned into three submatrices by rows, . Each worker node is assigned two submatrices and the vector . The following real-valued matrices satisfy the conditions in Definition 1 for (see Fig. 1 for the corresponding matrices).
[TABLE]
III Coded schemes satisfying
the strong full rank condition
In this section, we present two schemes that satisfy the strong full-rank condition. The first scheme is essentially an embedding of an RS code in the matrix-vector multiplication framework and has appeared in [4]. The second one is inspired by the constructions in [11, 12].
Let be a polynomial of degree with real coefficients, i.e., where denotes the ring of polynomials with real coefficients. Let denote the -th derivative of . It is evident that
[TABLE]
where if . Furthermore, note that we can also represent by considering its Taylor series expansion around a point , i.e.,
[TABLE]
It is well known that has a zero of multiplicity at if and only if for and .
III-A RS-based scheme
In the first scheme we simply choose the columns of for to correspond to a polynomial of degree being evaluated at distinct points in , i.e.,
[TABLE]
where are distinct for .
III-B UDM-based scheme
Our second construction works by choosing the columns of corresponding to the evaluations of a polynomial and its derivatives of order . We first choose distinct real numbers . For worker node , we choose the -th column in correspondence with the evaluation of the -th derivative of a degree- polynomial at value scaled by (cf. Eq. (2)). Thus, for and ,
[TABLE]
We note here that there is another choice of matrix, denoted that can be used instead of the above choices for one of the workers. For and , we let
[TABLE]
III-C Properties of the Coded Schemes
Claim 1**.**
The matrices defined in Section III-A and Section III-B satisfy the strong full-rank condition in Definition 1.
Proof.
Consider any vector pattern such that . Let be composed of the first columns of , . For the RS-based construction in (4), it is evident that is a Vandermonde matrix. As the ’s are distinct, has full rank. For the UDM-based scheme, if all workers are chosen based on (5), the result follows from the determinant of a generalized Vandermonde determinant [14]. On the other hand, assume without loss of generality that the -th worker is assigned the matrix (cf. Eq. (6)). In this case, can be written as
[TABLE]
where is a matrix with ones on the anti-diagonal and are composed of the first columns of , . Once again, the generalized Vandermonde determinant formula [14] shows that is full rank. This coupled with the fact that is also full rank, gives us the required result. ∎
It is evident that the above constructions satisfy the strong full-rank condition. However, experimental results (see also [15]) show that these constructions result in badly conditioned matrices in the worst case. In addition, both (4) and (5) result in dense linear combinations of , rendering them unsuitable in the scenario when is sparse. Nevertheless, the UDM-based construction (5), provides a systematic way to take into account the sequential processing order of the worker nodes.
Remark 2**.**
The RS-based scheme is in one-to-one correspondence with polynomial interpolation from any (out of ) distinct evaluation points. The UDM-based scheme uses much fewer evaluation points (only ) but is equivalent to interpolating a polynomial with roots of higher multiplicity.
IV Coded schemes satisfying
the weak full rank condition
Our second class of constructions produces schemes that satisfy the -weak full-rank condition. However, they have excellent numerical stability and are much sparser than those discussed in Section III. These schemes are obtained by first constructing a collection of matrices over a finite field (where is prime) and then embedding the finite field matrices into real field by companion matrix. Towards this end, let be a polynomial with coefficients from , i.e., . The -th Hasse derivative222To avoid confusion with the case of real-valued polynomials, we superscript the finite field polynomials with ~ and represent the Hasse derivatives with square brackets. of is defined as
[TABLE]
where we emphasize that the quantity is interpreted as a element of . In this scenario, it can be shown that has a zero of multiplicity at a point (or in an appropriate extension field) if for and .
The work of [11, 12] shows that the following matrices , satisfy the strong full-rank condition over , assuming .
[TABLE]
where are distinct non-zero elements in . We remark here that while the expression above is the same as the one in (5), the elements of (8) lie in .
One reason for considering the matrices in (8) is as follows. Suppose that we operate over , i.e., . Note that the calculation in (8) is equivalent to computing over the integers and reducing it modulo . In particular, this implies that whenever is even, the corresponding matrix entry will be zero. Thus, over finite fields, the matrices obtained using (8) are likely much sparser than those obtained from (5).
Example 2**.**
Let . Consider the polynomial over . Its -th Hasse derivatives, are
[TABLE]
Note here that has only two non-zero coefficients, whereas when considering derivatives over the reals, it will have three non-zero coefficients. Then,
[TABLE]
where and values are distinct for .
A natural question arises if it is possible to somehow “embed” the matrices defined in (8) into corresponding real matrices such that the conditions of Definition 1 hold (for real matrices). This does not appear to be a straightforward problem. For example, simply requiring distinct ’s is not sufficient. For instance, if we choose and then the matrix
[TABLE]
obtained by choosing the first two columns of and the first two columns of is singular.
Remark 3**.**
If the are chosen randomly from a large enough subset of , then we can assert that the collection will satisfy the strong full-rank property with high probability. To see this, let be indeterminates for now and consider for any pattern . The determinant of is a multivariate polynomial with coefficients from . The results of [11, 12] certainly imply that is not identically zero. Now, consider the determinant (polynomial) of denoted obtained by considering , i.e., has integer coefficients. Clearly, can be obtained by reducing each coefficient of modulo . Therefore, is also not identically zero. It follows that the product of all the real multivariate polynomials corresponding to the relevant ’s is not identically zero. The result then follows, by choosing a large enough subset of the reals and applying the Schwartz-Zippel lemma.
Next, we utilize a representation of by matrices over [13]. Let denote the ring of polynomials in with coefficients from . Let be a primitive element in and let denote the primitive polynomial associated with . The companion matrix (over ) associated with is
[TABLE]
Define with matrix addition and multiplication over , where [math] denotes zero matrix and denotes identity matrix. Then it is well-known [13] that the forms a finite field of size and is therefore isomorphic to . In particular, the mapping , , maps the elements in to their corresponding matrix representation. In this work, we need another isomorphism. The elements of are represented by polynomials in of degree smaller than with regular polynomial addition and multiplication being reduced to lower powers by using . Let represent the mapping of a polynomial to its vector representation. The addition of and is mapped to . The product of and is mapped to .
To see that this is a valid isomorphism, we have the following argument that establishes the equivalence of multiplication with in and left multiplication by . Let be an element of . Then
[TABLE]
It can be seen that gives the same result. The isomorphism of and shows that each element of can be represented as a power of . The result is then obtained by inductively applying the equivalence presented above.
Lemma 1**.**
Let be a matrix with entries from . Let denote the matrix obtained by applying the map to each entry of . Note that and . We claim that
[TABLE]
Furthermore, let denote the matrix over the integers obtained by mapping each element of to the corresponding integer in . If we have over the reals.
Proof.
Suppose that but . Note that this implies that there exists a non-zero vector where such that
[TABLE]
Now we use the isomorphism presented above. Let be obtained by applying to . Therefore, relation (12), equivalently implies that
[TABLE]
where the above equation is understood to be over . However, this is a contradiction since and . The reverse conclusion can be obtained in a similar manner. Note that . It can also be equivalently computed by finding over reals and reducing the result modulo . Thus, we have that over reals. ∎
We now present the construction of systems that satisfy the -weak full-rank property.
Lemma 2**.**
Let , , be a collection of matrices with size over that satisfy the strong full-rank property. Consider the matrices of dimension over , where is obtained by applying the mapping to each entry of . Then, the collection satisfies the -weak full rank condition over .
Proof.
This is an immediate consequence of Lemma 1. ∎
Example 3**.**
Consider collection of matrices presented in Example 2 over . Let the primitive polynomial over be . Suppose that . Then,
[TABLE]
By Lemma 2, the collection satisfies the -weak full rank condition over .
We note here that Lemma 2 can also be applied to an RS code defined over a finite field.
Remark 4**.**
Our proposed scheme requires us to operate over an extension field large enough so that for the UDM based approach and for the RS-based approach. Thus, the second term in the worst case computational load (cf. Eq. (1)) can be made as small as desired by choosing large enough. Increasing does come at the cost of high condition numbers (cf. Section V).
V Comparisons of the different schemes
In this section, we compare the performance of the different schemes that have been proposed in this work. For each scheme, we construct all possible matrices based on and and calculate their condition number. We report the maximum and average condition number of all such possible ’s. Furthermore, we also report the average number of non-zero elements in the matrices for each collection.
In Table I we report results for a system with workers and storage capacity for each worker . For the “RS-based scheme”, we set , , , in (4) to 18 equally spaced reals within the interval . For the “UDM-based scheme”, we set in (5), to 6 equally spaced reals within the interval . For “RS + Embedding from ”, we construct (4) over . Note that the field size is the least prime number that is greater or equal to the number of evaluation points. Then we embed (4) into by using the natural mapping of into the integers. We construct “UDM + Embedding from ” in a similar manner. It can be seen that the condition number of “UDM scheme + Embedding from ” is the lowest when compared the other three schemes discussed thus far.
The other rows of Table I correspond to the companion matrix approach. In each of these cases we first design the RS-based or the UDM-based scheme over the corresponding extension field and then use the companion matrix idea introduced in Section IV. One can observe that the RS + Companion matrix schemes typically have high condition number. This is because the size of the companion matrix needs to be large enough to accommodate evaluation points. The UDM + Companion matrix schemes can work with extension fields larger than , so their companion matrices tend to be smaller. Another advantage of the companion matrix approach is that the schemes are much sparser. Indeed, the “UDM + Companion matrix ” in Table I not only has a very low worst case condition number but also a sparsity level of 36% which is the second lowest among all the schemes.
To better understand the performance corresponding to different choices of extension field, we consider a larger system with , in Table II. It can be observed that the RS-based scheme is worse than the UDM-based scheme. Another observation is that the “UDM + Companion matrix ” has the lowest condition number and the matrices become sparser when the size of the companion matrix increases.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” in Proc. of Adv. in Neural Inf. Proc. Sys. (NIPS) , 2017, pp. 4403–4413.
- 2[2] L. Tang, K. Konstantinidis, and A. Ramamoorthy, “Erasure coding for distributed matrix multiplication for matrices with bounded entries,” IEEE Comm. Lett. , vol. 23, no. 1, pp. 8–11, 2019.
- 3[3] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix multiplication,” in IEEE Intl. Symp. on Inf. Theory , 2017, pp. 2418–2422.
- 4[4] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. on Info. Th. , vol. 64, no. 3, pp. 1514–1529, 2018.
- 5[5] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” in Proc. of Adv. in Neural Inf. Proc. Sys. (NIPS) , 2016, pp. 2100–2108.
- 6[6] A. Mallick, M. Chaudhari, and G. Joshi, “Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication,” preprint, 2018, [Online] Available: https://arxiv.org/abs/1804.10331.
- 7[7] S. Wang, J. Liu, and N. B. Shroff, “Coded sparse matrix multiplication,” in Proc. 35th Intl. Conf. on Mach. Learning, ICML , 2018, pp. 5139–5147.
- 8[8] S. Kiani, N. Ferdinand, and S. C. Draper, “Exploitation of stragglers in coded computation,” in IEEE Intl. Symp. on Inf. Theory , 2018, pp. 1988–1992.
