Universally Decodable Matrices for Distributed Matrix-Vector   Multiplication

Aditya Ramamoorthy; Li Tang; Pascal O. Vontobel

arXiv:1901.10674·cs.IT·January 31, 2019

Universally Decodable Matrices for Distributed Matrix-Vector Multiplication

Aditya Ramamoorthy, Li Tang, Pascal O. Vontobel

PDF

TL;DR

This paper introduces a novel class of distributed matrix-vector multiplication schemes using universally decodable matrices and Rosenbloom-Tsfasman codes, effectively leveraging partial computations and ensuring numerical stability.

Contribution

It presents a new coding scheme for distributed matrix-vector multiplication that accounts for computation order and partial results, enhancing efficiency and stability.

Findings

01

Effective mitigation of stragglers in distributed computation

02

Sparse and numerically stable coding schemes

03

Experimental validation of scheme effectiveness

Abstract

Coded computation is an emerging research area that leverages concepts from erasure coding to mitigate the effect of stragglers (slow nodes) in distributed computation clusters, especially for matrix computation problems. In this work, we present a class of distributed matrix-vector multiplication schemes that are based on codes in the Rosenbloom-Tsfasman metric and universally decodable matrices. Our schemes take into account the inherent computation order within a worker node. In particular, they allow us to effectively leverage partial computations performed by stragglers (a feature that many prior works lack). An additional main contribution of our work is a companion matrix-based embedding of these codes that allows us to obtain sparse and numerically stable schemes for the problem at hand. Experimental results confirm the effectiveness of our techniques.

Tables2

Table 1. Table I: Performance comparison for system with N = 6 , γ = 3 / 4 formulae-sequence 𝑁 6 𝛾 3 4 N=6,\gamma=3/4 and Q b = 4 subscript 𝑄 b 4 Q_{\mathrm{b}}=4 .

Scheme	$Δ$	$ℓ$	s	Max. Cond. Num.	Avg. Cond. Num.	Density of $G_{k}$
RS based scheme	4	3		$5.1 \times 10^{3}$	$334$	$100 %$
RS $+$ Companion Matrix of $GF (2^{5})$	20	15	5	$3.4 \times 10^{4}$	$814$	$51 %$
RS $+$ Embedding from $GF (19)$	4	3		$7.3 \times 10^{3}$	$312$	$100 %$
RS $+$ Companion Matrix of $GF (3^{3})$	12	9	3	$1.5 \times 10^{3}$	$98$	$71 %$
UDM-based scheme	4	3		$6.1 \times 10^{3}$	$265$	$75 %$
UDM $+$ Embedding from $GF (7)$	4	3		$1.5 \times 10^{3}$	$98$	$75 %$
UDM $+$ Companion Matrix of $GF (2^{3})$	12	9	3	$583$	$99$	$32 %$
UDM $+$ Companion Matrix of $GF (3^{2})$	8	6	2	$182$	$23$	$36 %$

Table 2. Table II: Performance comparison of different extension fields for a system with N = 15 , γ = 1 / 2 formulae-sequence 𝑁 15 𝛾 1 2 N=15,\gamma=1/2 and Q b = 4 subscript 𝑄 b 4 Q_{\mathrm{b}}=4 .

Scheme	$Δ$	$ℓ$	s	Max. Cond. Num.	Avg. Cond. Num.	Density of $G_{k}$
RS $+$ Companion Matrix $GF (2^{5})$	20	10	5	$2.8 \times 10^{5}$	$751$	$53 %$
RS $+$ Companion Matrix $GF (5^{3})$	12	6	3	$1.1 \times 10^{5}$	$183$	$83 %$
RS $+$ Companion Matrix $GF (3^{4})$	16	8	4	$3.5 \times 10^{4}$	$202$	$67 %$
UDM $+$ Companion Matrix $GF (2^{4})$	16	8	4	$3.7 \times 10^{4}$	$286$	$33 %$
UDM $+$ Companion Matrix $GF (5^{2})$	8	4	2	$1.1 \times 10^{4}$	$86$	$62 %$
UDM $+$ Companion Matrix $GF (3^{3})$	12	6	3	$624$	$96$	$41 %$

Equations47

\displaystyle\Psi_{N,\ell,s}^{=Q_{\mathrm{b}}}=\left\{\mathbf{v}\leavevmode\nobreak\ \bigg{|}\leavevmode\nobreak\ v_{i}\in[\ell],i\in[N],\sum_{i\in[N]}\left\lfloor\frac{v_{i}}{s}\right\rfloor=Q_{\mathrm{b}}\right\}.

\displaystyle\Psi_{N,\ell,s}^{=Q_{\mathrm{b}}}=\left\{\mathbf{v}\leavevmode\nobreak\ \bigg{|}\leavevmode\nobreak\ v_{i}\in[\ell],i\in[N],\sum_{i\in[N]}\left\lfloor\frac{v_{i}}{s}\right\rfloor=Q_{\mathrm{b}}\right\}.

\hat{A}_{k, j}

\hat{A}_{k, j}

Δ (1 + \frac{( N - 1 ) ( s - 1 )}{Δ}) .

Δ (1 + \frac{( N - 1 ) ( s - 1 )}{Δ}) .

G_{0} = 100011, \leavevmode G_{1} = 010101, \leavevmode \nobreak and \leavevmode \nobreak G_{2} = 001110 .

G_{0} = 100011, \leavevmode G_{1} = 010101, \leavevmode \nobreak and \leavevmode \nobreak G_{2} = 001110 .

u^{(j)} (x) = k = 0 \sum d u_{k} (j k) j! \leavevmode x^{k - j},

u^{(j)} (x) = k = 0 \sum d u_{k} (j k) j! \leavevmode x^{k - j},

u (x) = k = 0 \sum d \frac{u ^{(k)} ( β )}{k !} (x - β)^{k} .

u (x) = k = 0 \sum d \frac{u ^{(k)} ( β )}{k !} (x - β)^{k} .

G_{k} (i, j) = β_{k, j}^{i}, \leavevmode \nobreak for \leavevmode \nobreak i \in [Δ], j \in [ℓ] .

G_{k} (i, j) = β_{k, j}^{i}, \leavevmode \nobreak for \leavevmode \nobreak i \in [Δ], j \in [ℓ] .

G_{k} (i, j) = {(j i) β_{k}^{i - j} 0 if i \geq j, otherwise.

G_{k} (i, j) = {(j i) β_{k}^{i - j} 0 if i \geq j, otherwise.

G_{*} (i, j) = {1, 0 if i = N - 1 - j, otherwise.

G_{*} (i, j) = {1, 0 if i = N - 1 - j, otherwise.

G = [B_{1} B_{2} 0 B_{3}],

G = [B_{1} B_{2} 0 B_{3}],

\tilde{u}^{[i]} (x) = k = 0 \sum d (i k) \tilde{u}_{k} x^{k - i},

\tilde{u}^{[i]} (x) = k = 0 \sum d (i k) \tilde{u}_{k} x^{k - i},

G_{k} (i, j) = {(j i) β_{k}^{i - j} 0 (if i \geq j) otherwise.

G_{k} (i, j) = {(j i) β_{k}^{i - j} 0 (if i \geq j) otherwise.

\tilde{u}^{[0]} (x)

\tilde{u}^{[0]} (x)

\tilde{u}^{[1]} (x)

\tilde{u}^{[2]} (x)

G_{i}

G_{i}

G = 11110101 1 - 1 1 - 1 0101

G = 11110101 1 - 1 1 - 1 0101

C = 010 ⋮ 0 001 ⋮ 0 \dots \dots \dots ⋱ ⋮ 000 ⋮ 1 - π_{0} - π_{1} - π_{2} ⋮ - π_{n - 1} .

C = 010 ⋮ 0 001 ⋮ 0 \dots \dots \dots ⋱ ⋮ 000 ⋮ 1 - π_{0} - π_{1} - π_{2} ⋮ - π_{n - 1} .

α (b_{0} + b_{1} α + \dots + b_{n - 1} α^{n - 1})

α (b_{0} + b_{1} α + \dots + b_{n - 1} α^{n - 1})

=

=

+ \dots + (b_{n - 2} - π_{n - 1} b_{n - 1}) α^{n - 1} .

det (B) \neq = 0 ⟺ det (\tilde{B}) \neq = 0.

det (B) \neq = 0 ⟺ det (\tilde{B}) \neq = 0.

\tilde{B} \tilde{y} = 0.

\tilde{B} \tilde{y} = 0.

B y = 0,

B y = 0,

G_{k} = 1 α^{k} α^{2 k} α^{3 k} 010 α^{2 k} 001 α^{k},

G_{k} = 1 α^{k} α^{2 k} α^{3 k} 010 α^{2 k} 001 α^{k},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Universally Decodable Matrices for Distributed Matrix-Vector Multiplication

Aditya Ramamoorthy, Li Tang

Department of Electrical and Computer Engineering

Iowa State University

Ames, IA 50010, U.S.A.

Pascal O. Vontobel This work was supported in part by the National Science Foundation (NSF) under grant CCF-1718470. Department of Information Engineering

The Chinese University of Hong Kong

Hong Kong, S. A. R.

Abstract

Coded computation is an emerging research area that leverages concepts from erasure coding to mitigate the effect of stragglers (slow nodes) in distributed computation clusters, especially for matrix computation problems. In this work, we present a class of distributed matrix-vector multiplication schemes that are based on codes in the Rosenbloom-Tsfasman metric and universally decodable matrices. Our schemes take into account the inherent computation order within a worker node. In particular, they allow us to effectively leverage partial computations performed by stragglers (a feature that many prior works lack). An additional main contribution of our work is a companion matrix-based embedding of these codes that allows us to obtain sparse and numerically stable schemes for the problem at hand. Experimental results confirm the effectiveness of our techniques.

I Introduction

Distributed computation clusters are routinely used in domains such as machine learning and scientific computing. In these applications, datasets are often so large that they cannot be housed in the disk of a single server. Furthermore, processing the data on a single server is either infeasible or unacceptably slow. Thus, the data and the processing is distributed and processed across a large number of nodes.

While large clusters have numerous advantages, they also present newer operational challenges. These clusters (which can be heterogeneous in nature) suffer from the problem of “stragglers” which are defined as slow nodes (node failures are an extreme form of a straggler). It is evident that the overall speed of a computation on these clusters is typically dominated by stragglers in the absence of a sophisticated assignment of tasks to the worker nodes.

In recent years, approaches based on coding theory (referred to as “coded computation”) have been effectively used for straggler mitigation [1, 2, 3, 4, 5, 6, 7, 8, 9]. Coded computation offers significant benefits for specific classes of problems, e.g., matrix computations. We illustrate this by means of a matrix-vector multiplication example in Fig. 1, where a matrix $\mathbf{A}$ is block-row decomposed as $\mathbf{A}^{T}=[\mathbf{A}_{0}^{T}\leavevmode\nobreak\ \mathbf{A}_{1}^{T}\leavevmode\nobreak\ \mathbf{A}_{2}^{T}]^{T}$ . Each worker node is given the responsibility of computing two submatrix-vector products so that the computational load on each worker is $2/3$ -rd of the original. It can be observed that even if one worker fails, there is enough information for a master node to compute the final result. However, this requires the master node to solve simple systems of equations. This approach can be generalized (and also adapted for matrix multiplication) by using Reed-Solomon (RS) code like approaches [1, 2, 3, 4, 5]. These methods allow the master node to recover $\mathbf{A}\mathbf{x}$ if any $\tau$ of the worker nodes complete their computation; $\tau$ is called the recovery threshold.

A significant amount of prior work treats stragglers as node failures (see [8, 9, 6] for exceptions), or, equivalently from the point of view of coding theory, as erasures. This matches the conventional erasure coding problem very well and allows the adaptation of well-known approaches, e.g, RS codes to the problem of distributed matrix computations. However, there are certain features of the distributed matrix-vector multiplication problem that distinguish it from classical erasure correction that we now discuss.

•

Leveraging partial computation performed by stragglers. Each worker node operates in a sequential fashion on its assigned rows, e.g., in Fig. 1, worker $W_{0}$ , first computes $\hat{\mathbf{A}}_{00}\mathbf{x}$ and only then $\hat{\mathbf{A}}_{01}\mathbf{x}$ . If node [math] is a straggler (but not a failure), ignoring the partial computation it performs will be wasteful.

•

Numerically stable decoding. The RS-based approach requires the master node to solve a real Vandermonde system of linear equations or equivalently perform polynomial interpolation. It is well recognized that real Vandermonde matrices have a rather large condition number111While there is literature on choosing good evaluation points to reduce the condition number, in the distributed matrix vector multiplication context, we require decoding from any $\tau$ evaluation points. This makes the worst case condition number quite bad. which translates into significant numerical issues in recovering $\mathbf{A}\mathbf{x}$ . This numerical issue is especially important in Krylov subspace methods for solving large linear systems of equations [10] (which repeatedly compute matrix-vector products) and in machine learning, where gradient computations are often approximate.

•

Dealing with sparse $\mathbf{A}$ matrices. The case when the matrix $\mathbf{A}$ is sparse is often an important one in practice. RS-based approaches typically generate submatrices that are sent to the worker nodes by combining a large number of rows of $\mathbf{A}$ , thus destroying the inherent sparsity of the problem. This can significantly increase the computation time [7] at the worker nodes. Thus, techniques that only require sparse combinations of the rows of $\mathbf{A}$ are of great interest.

I-A Main contributions of our work

We present a class of distributed matrix-vector multiplication schemes that provably leverage partial computations by stragglers, while possessing a numerically stable decoding algorithm. These schemes are related to codes in the Rosenbloom-Tsfasman metric [11] and universally decodable matrices (UDM) [12] that were presented in different contexts. Roughly speaking, while the RS-based approach corresponds to polynomial evaluation/interpolation, our approach can be viewed as working with polynomials with roots of higher multiplicity. An additional main contribution of our work is the usage of companion matrices [13] that allow for an embedding of finite-field matrices into the real field; this significantly improves the condition numbers of the relevant matrices.

II Problem Formulation

We consider a scenario where the master node has a matrix $\mathbf{A}$ and vector $\mathbf{x}$ (both real-valued) and is connected to $N$ worker nodes. For convenience, for arbitrary positive integer $n$ , let $[n]\triangleq\{0,\dots,n-1\}$ . The master node first partitions $\mathbf{A}$ into $\Delta$ block-rows (or submatrices) denoted by $\mathbf{A}_{0},\cdots,\mathbf{A}_{\Delta-1}$ , each of the same dimension. Following this, it generates submatrices denoted $\hat{\mathbf{A}}_{i,j}$ , $i\in[N],j\in[\ell]$ (of the same dimension as the $\mathbf{A}_{i}$ ’s) such that worker $W_{i}$ is sent submatrices $\hat{\mathbf{A}}_{i,j},j\in[\ell]$ , and the vector $\mathbf{x}$ . Let $\gamma\triangleq\ell/\Delta$ . Then, each worker is assigned the equivalent of a $\gamma$ -fraction of the rows of $\mathbf{A}$ . In this paper, we assume $\mathbf{A}$ is large enough so that $\Delta$ can be chosen large enough. Throughout this paper the submatrices $\hat{\mathbf{A}}_{i,j}$ will be linear combinations of $\mathbf{A}_{0},\dots,\mathbf{A}_{\Delta-1}$ , such that the master node only calculates scalar multiples and sums of $\mathbf{A}_{i}$ ’s.

In what follows, we say that worker $W_{i}$ has processed a submatrix $\hat{\mathbf{A}}_{i,j}$ if it has calculated $\hat{\mathbf{A}}_{i,j}\mathbf{x}$ . A key feature of the distributed matrix-vector multiplication problem is that the matrices $\hat{\mathbf{A}}_{i,j}$ are processed sequentially in the order $\hat{\mathbf{A}}_{i,0},\hat{\mathbf{A}}_{i,1},\dots$ , i.e., a worker node $W_{i}$ processes $\hat{\mathbf{A}}_{i,j}$ only if it has finished processing $\hat{\mathbf{A}}_{i,0},\dots,\hat{\mathbf{A}}_{i,j-1}$ . Each time a worker node computes a product or a block of consecutive products, it sends the result to the master node. Our system requirement dictates that the master node should be able to decode $\mathbf{A}\mathbf{x}$ as long as it receives a minimum number of products from the worker nodes. Fig. 1 demonstrates a system we consider.

It is evident that the properties of a given scheme depend upon the properties of the matrices $\hat{\mathbf{A}}_{i,j}$ , $i\in[N],j\in[\ell]$ . To specify this encoding we discuss constructions of collections of matrices that have certain desired properties. Some of our constructions are first designed over a finite field and then embedded into $\mathbb{R}$ using an appropriately defined procedure. Accordingly, we define certain rank conditions that depend on an underlying field of operation denoted by $\mathbb{K}$ . We will explicitly specify $\mathbb{K}$ when discussing the constructions. Consider $N$ matrices $G_{k}$ , $0\leq k<N$ over $\mathbb{K}$ with dimension $\Delta\times\ell$ and let $G_{k}(i,j)$ represent the $(i,j)$ -th entry of $G_{k}$ . Let $\mathbf{v}=(v_{0},\cdots,v_{N-1})$ . We define the set

[TABLE]

Definition 1.

$s$ -weak full-rank matrices.

Let $N,\ell,\Delta,s$ be positive integers such that $s$ divides $\ell$ and $\Delta$ . Let $Q_{\mathrm{b}}=\Delta/s$ . Consider matrices $G_{k},k\in[N]$ , of dimension $\Delta\times\ell$ . Let $\mathbf{v}\in\Psi_{N,\ell,s}^{=Q_{\mathrm{b}}}$ and let $\Delta_{\mathbf{v}}=\sum_{i\in[N]}v_{i}\geq Q_{\mathrm{b}}s$ . If the $\Delta\times\Delta_{\mathbf{v}}$ matrix $\mathbf{G}$ composed of the first $v_{0}$ columns of $G_{0}$ , the first $v_{1}$ columns of $G_{1}$ , $\dots$ , and the first $v_{N-1}$ columns of $G_{N-1}$ , has full rank over $\mathbb{K}$ , i.e., $\text{rank}_{\mathbb{K}}(\mathbf{G})=\Delta$ for all $\mathbf{v}\in\Psi_{N,\ell,s}^{=Q_{\mathrm{b}}}$ we say the collection $\{G_{k}\}_{k=0}^{N-1}$ satisfies the $s$ -weak full-rank condition.

The collection $\{G_{k}\}_{k=0}^{N-1}$ is used to obtain the $\ell$ submatrices stored in worker $W_{k}$ when $\mathbb{K}=\mathbb{R}$ as

[TABLE]

Consider first the case $s=1$ and assume that worker $W_{i}$ has finished processing $v_{i}$ submatrices and $v_{0}+\cdots+v_{N-1}\geq\Delta$ . Let $\mathbf{G}$ be as specified in the definition above. It is not too hard to see that the system requirement of decoding from any $Q_{\mathrm{b}}=\Delta$ submatrix-vector products is equivalent to the condition that $\mathbf{G}$ is full-rank over $\mathbb{R}$ for all possible patterns $(v_{0},\cdots,v_{N-1})$ . Thus, designing $\{G_{k}\}_{k=0}^{N-1}$ that satisfy Definition 1 is sufficient for the problem at hand.

Values of $s>1$ correspond to a relaxation of this condition. Specifically, suppose that each worker node returns the results in blocks of size $s$ . For instance, worker node $W_{i}$ computes $\hat{\mathbf{A}}_{i,0}\mathbf{x},\hat{\mathbf{A}}_{i,1}\mathbf{x},\dots,\hat{\mathbf{A}}_{i,s-1}\mathbf{x}$ and then reports the result back to the master. Following this, it focuses on the next block $\hat{\mathbf{A}}_{i,s}\mathbf{x},\dots,\hat{\mathbf{A}}_{i,2s-1}\mathbf{x}$ , and so on. In this case, decoding by the master node is guaranteed if it receives any $Q_{\mathrm{b}}$ blocks of size $s$ (this explains our choice of subscript $\mathrm{b}$ in $Q_{\mathrm{b}}$ ). If the $s$ -weak full-rank condition holds for $s=1$ , we will refer to the system as satisfying the strong full-rank condition.

Remark 1.

When $s=1$ , then the master node can recover $\mathbf{A}\mathbf{x}$ when any $Q_{b}$ submatrices have been processed across the $N$ workers, i.e., the worst case computational load on the system, measured at the granularity of a submatrix is $\Delta$ . If $s>1$ , then the worst case computational load can be as high as

[TABLE]

For our constructions, the second term $\frac{(N-1)(s-1)}{\Delta}$ can be made as small as desired by choosing a large enough $\Delta$ .

Example 1.

Consider the system in Fig. 1 with $N=3$ , $\Delta=3$ , $\ell=2$ . Matrix $\mathbf{A}$ is partitioned into three submatrices by rows, $\mathbf{A}_{0},\mathbf{A}_{1},\mathbf{A}_{2}$ . Each worker node is assigned two submatrices and the vector $\mathbf{x}$ . The following real-valued matrices satisfy the conditions in Definition 1 for $s=1$ (see Fig. 1 for the corresponding $\hat{\mathbf{A}}_{i,j}$ matrices).

[TABLE]

III Coded schemes satisfying

the strong full rank condition

In this section, we present two schemes that satisfy the strong full-rank condition. The first scheme is essentially an embedding of an RS code in the matrix-vector multiplication framework and has appeared in [4]. The second one is inspired by the constructions in [11, 12].

Let $u(x)=\sum_{k=0}^{d}u_{k}x^{k}$ be a polynomial of degree $d$ with real coefficients, i.e., $u(x)\in\mathbb{R}[x]$ where $\mathbb{R}[x]$ denotes the ring of polynomials with real coefficients. Let $u^{(j)}(x)$ denote the $j$ -th derivative of $u(x)$ . It is evident that

[TABLE]

where $\binom{k}{i}=0$ if $k<i$ . Furthermore, note that we can also represent $u(x)$ by considering its Taylor series expansion around a point $\beta\in\mathbb{R}$ , i.e.,

[TABLE]

It is well known that $u(x)$ has a zero of multiplicity $m$ at $\beta\in\mathbb{R}$ if and only if $u^{(i)}(\beta)=0$ for $0\leq i<m$ and $u^{(m)}(\beta)\neq 0$ .

III-A RS-based scheme

In the first scheme we simply choose the columns of $G_{k}$ for $k\in[N]$ to correspond to a polynomial of degree $\Delta-1$ being evaluated at distinct points in $\mathbb{R}$ , i.e.,

[TABLE]

where $\beta_{k,j}\in\mathbb{R}$ are distinct for $k\in[N],j\in[\ell]$ .

III-B UDM-based scheme

Our second construction works by choosing the columns of $G_{k}$ corresponding to the evaluations of a polynomial and its derivatives of order $1,\dots,\ell-1$ . We first choose $N$ distinct real numbers $\beta_{0},\dots,\beta_{N-1}$ . For worker node $k$ , we choose the $j$ -th column in correspondence with the evaluation of the $j$ -th derivative of a degree- $(\Delta-1)$ polynomial at value $\beta_{k}$ scaled by $j!$ (cf. Eq. (2)). Thus, for $k\in[N],i\in[\Delta]$ and $j\in[\ell]$ ,

[TABLE]

We note here that there is another choice of matrix, denoted $G_{*}$ that can be used instead of the above choices for one of the workers. For $i\in[\Delta]$ and $j\in[\ell]$ , we let

[TABLE]

III-C Properties of the Coded Schemes

Claim 1.

The $N$ matrices defined in Section III-A and Section III-B satisfy the strong full-rank condition in Definition 1.

Proof.

Consider any vector pattern $\mathbf{v}=(v_{0},\ldots,v_{N-1})$ such that $v_{0}+\cdots+v_{N-1}=\Delta$ . Let $\mathbf{G}$ be composed of the first $v_{k}$ columns of $G_{k}$ , $k=0,\ldots,N-1$ . For the RS-based construction in (4), it is evident that $\mathbf{G}$ is a Vandermonde matrix. As the $\beta_{k,j}$ ’s are distinct, $\mathbf{G}$ has full rank. For the UDM-based scheme, if all workers are chosen based on (5), the result follows from the determinant of a generalized Vandermonde determinant [14]. On the other hand, assume without loss of generality that the $(N-1)$ -th worker is assigned the $G_{*}$ matrix (cf. Eq. (6)). In this case, $\mathbf{G}$ can be written as

[TABLE]

where $\mathbf{B}_{3}$ is a $v_{N-1}\times v_{N-1}$ matrix with ones on the anti-diagonal and $\mathbf{B}=[\mathbf{B}_{1}^{T}\leavevmode\nobreak\ \mathbf{B}_{2}^{T}]^{T}$ are composed of the first $v_{k}$ columns of $G_{k}$ , $k=0,\cdots,N-2$ . Once again, the generalized Vandermonde determinant formula [14] shows that $\mathbf{B}_{1}$ is full rank. This coupled with the fact that $\mathbf{B}_{3}$ is also full rank, gives us the required result. ∎

It is evident that the above constructions satisfy the strong full-rank condition. However, experimental results (see also [15]) show that these constructions result in badly conditioned $\mathbf{G}$ matrices in the worst case. In addition, both (4) and (5) result in dense linear combinations of $\mathbf{A}$ , rendering them unsuitable in the scenario when $\mathbf{A}$ is sparse. Nevertheless, the UDM-based construction (5), provides a systematic way to take into account the sequential processing order of the worker nodes.

Remark 2.

The RS-based scheme is in one-to-one correspondence with polynomial interpolation from any $\Delta$ (out of $N\ell$ ) distinct evaluation points. The UDM-based scheme uses much fewer evaluation points (only $N$ ) but is equivalent to interpolating a polynomial with roots of higher multiplicity.

IV Coded schemes satisfying

the weak full rank condition

Our second class of constructions produces schemes that satisfy the $s$ -weak full-rank condition. However, they have excellent numerical stability and are much sparser than those discussed in Section III. These schemes are obtained by first constructing a collection of matrices over a finite field $\mathbb{F}_{p^{n}}$ (where $p$ is prime) and then embedding the finite field matrices into real field by companion matrix. Towards this end, let $\tilde{u}(x)=\sum_{k=0}^{d}\tilde{u}_{k}x^{k}$ be a polynomial with coefficients from $\mathbb{F}_{p^{n}}$ , i.e., $\tilde{u}(x)\in\mathbb{F}_{p^{n}}[x]$ . The $i$ -th Hasse derivative222To avoid confusion with the case of real-valued polynomials, we superscript the finite field polynomials with ~ and represent the Hasse derivatives with square brackets. of $\tilde{u}(x)$ is defined as

[TABLE]

where we emphasize that the quantity $\binom{k}{i}$ is interpreted as a element of $\mathbb{F}_{p}$ . In this scenario, it can be shown that $\tilde{u}(x)$ has a zero of multiplicity $m$ at a point $\beta\in\mathbb{F}_{p^{n}}$ (or in an appropriate extension field) if $\tilde{u}^{[i]}(\beta)=0$ for $0\leq i<m$ and $\tilde{u}^{[m]}(\beta)\neq 0$ .

The work of [11, 12] shows that the following matrices $G_{k},k\in[N]$ , satisfy the strong full-rank condition over $\mathbb{K}=\mathbb{F}_{p^{n}}$ , assuming $p^{n}\geq N+1$ .

[TABLE]

where $\beta_{k},k\in[N]$ are distinct non-zero elements in $\mathbb{F}_{p^{n}}$ . We remark here that while the expression above is the same as the one in (5), the elements of (8) lie in $\mathbb{F}_{p^{n}}$ .

One reason for considering the matrices in (8) is as follows. Suppose that we operate over $\mathbb{F}_{2^{n}}$ , i.e., $p=2$ . Note that the calculation in (8) is equivalent to computing $\binom{i}{j}$ over the integers and reducing it modulo $2$ . In particular, this implies that whenever $\binom{i}{j}$ is even, the corresponding matrix entry will be zero. Thus, over finite fields, the $\{G_{k}\}_{k=0}^{N-1}$ matrices obtained using (8) are likely much sparser than those obtained from (5).

Example 2.

Let $p=2,n=3,\ell=3,\Delta=4,N=6$ . Consider the polynomial $\tilde{u}(x)=\tilde{u}_{0}+\tilde{u}_{1}x+\tilde{u}_{2}x^{2}+\tilde{u}_{3}x^{3}$ over $\mathbb{F}_{8}$ . Its $i$ -th Hasse derivatives, $i=0,1,2$ are

[TABLE]

Note here that $\tilde{u}^{[1]}(x)$ has only two non-zero coefficients, whereas when considering derivatives over the reals, it will have three non-zero coefficients. Then,

[TABLE]

where $\beta_{i}\in\mathbb{F}_{8}$ and $\beta_{i}$ values are distinct for $i\in[N]$ .

A natural question arises if it is possible to somehow “embed” the matrices defined in (8) into corresponding real matrices such that the conditions of Definition 1 hold (for real matrices). This does not appear to be a straightforward problem. For example, simply requiring distinct $\beta_{k}$ ’s is not sufficient. For instance, if we choose $\beta_{0}=1$ and $\beta_{1}=-1$ then the matrix

[TABLE]

obtained by choosing the first two columns of $G_{0}$ and the first two columns of $G_{1}$ is singular.

Remark 3.

If the $\beta_{i},i\in[N]$ are chosen randomly from a large enough subset of $\mathbb{R}$ , then we can assert that the collection will satisfy the strong full-rank property with high probability. To see this, let $\beta_{i}\in\mathbb{F}_{p^{n}},i\in[N]$ be indeterminates for now and consider $\mathbf{G}$ for any pattern $(v_{0},\cdots,v_{N-1})$ . The determinant of $\mathbf{G}$ is a multivariate polynomial $\tilde{\Lambda}(\beta_{0},\dots,\beta_{N-1})$ with coefficients from $\mathbb{F}_{p}$ . The results of [11, 12] certainly imply that $\tilde{\Lambda}(\beta_{0},\dots,\beta_{N-1})$ is not identically zero. Now, consider the determinant (polynomial) of $\mathbf{G}$ denoted $\Lambda(\beta_{0},\dots,\beta_{N-1})$ obtained by considering $\beta_{i}\in\mathbb{R},i\in[N]$ , i.e., $\Lambda(\beta_{0},\dots,\beta_{N-1})$ has integer coefficients. Clearly, $\tilde{\Lambda}(\beta_{0},\dots,\beta_{N-1})$ can be obtained by reducing each coefficient of $\Lambda(\beta_{0},\dots,\beta_{N-1})$ modulo $p$ . Therefore, $\Lambda(\beta_{0},\dots,\beta_{N-1})$ is also not identically zero. It follows that the product of all the real multivariate polynomials corresponding to the relevant $\mathbf{G}$ ’s is not identically zero. The result then follows, by choosing a large enough subset of the reals and applying the Schwartz-Zippel lemma.

Next, we utilize a representation of $\mathbb{F}_{p^{n}}$ by $n\times n$ matrices over $\mathbb{F}_{p}$ [13]. Let $\mathbb{F}_{p}[x]$ denote the ring of polynomials in $x$ with coefficients from $\mathbb{F}_{p}$ . Let $\alpha$ be a primitive element in $\mathbb{F}_{p^{n}}$ and let $\pi_{\alpha}(x)=x^{n}+\sum_{i=0}^{n-1}\pi_{i}x^{i}\in\mathbb{F}_{p}[x]$ denote the primitive polynomial associated with $\alpha$ . The $n\times n$ companion matrix (over $\mathbb{F}_{p}$ ) associated with $\pi_{\alpha}(x)$ is

[TABLE]

Define $\mathfrak{C}(p,n)=\{0,I,C,C^{2},\cdots,C^{p^{n}-2}\}$ with matrix addition and multiplication over $\mathbb{F}_{p}$ , where [math] denotes $n\times n$ zero matrix and $I$ denotes $n\times n$ identity matrix. Then it is well-known [13] that the $\mathfrak{C}(p,n)$ forms a finite field of size $p^{n}$ and is therefore isomorphic to $\mathbb{F}_{p^{n}}$ . In particular, the mapping $\zeta(\alpha^{l})=C^{l}$ , $\zeta(0)=0$ , maps the elements in $\mathbb{F}_{p^{n}}$ to their corresponding matrix representation. In this work, we need another isomorphism. The elements of $\mathbb{F}_{p^{n}}$ are represented by polynomials in $\alpha$ of degree smaller than $n$ with regular polynomial addition and multiplication being reduced to lower powers by using $\pi_{\alpha}(\alpha)=0$ . Let $\Gamma:\mathbb{F}_{p^{n}}\rightarrow\mathbb{F}^{n}_{p}$ represent the mapping of a polynomial $a(\alpha)$ to its vector representation. The addition of $a_{1}(\alpha)$ and $a_{2}(\alpha)$ is mapped to $\Gamma(a_{1})+\Gamma(a_{2})$ . The product of $a_{1}(\alpha)$ and $a_{2}(\alpha)$ is mapped to $a_{1}(C)\Gamma(a_{2})$ .

To see that this is a valid isomorphism, we have the following argument that establishes the equivalence of multiplication with $\alpha$ in $\mathbb{F}_{p^{n}}$ and left multiplication by $C$ . Let $b_{0}+b_{1}\alpha+\cdots+b_{n-1}\alpha^{n-1}$ be an element of $\mathbb{F}_{p^{n}}$ . Then

[TABLE]

It can be seen that $C[b_{0}\leavevmode\nobreak\ b_{1}\leavevmode\nobreak\ \cdots\leavevmode\nobreak\ b_{m-1}]^{T}$ gives the same result. The isomorphism of $\mathbb{F}_{p^{n}}$ and $\mathfrak{C}(p,n)$ shows that each element of $\mathbb{F}_{p^{n}}$ can be represented as a power of $C$ . The result is then obtained by inductively applying the equivalence presented above.

Lemma 1.

Let $\mathbf{B}$ be a $n\times n$ matrix with entries from $\mathbb{F}_{p^{m}}$ . Let $\tilde{\mathbf{B}}$ denote the $mn\times mn$ matrix obtained by applying the map $\zeta$ to each entry of $\mathbf{B}$ . Note that $\det(\mathbf{B})\in\mathbb{F}_{p^{m}}$ and $\det(\tilde{\mathbf{B}})\in\mathbb{F}_{p}$ . We claim that

[TABLE]

Furthermore, let $\hat{\mathbf{B}}$ denote the $mn\times mn$ matrix over the integers $\mathbb{Z}$ obtained by mapping each element of $\tilde{\mathbf{B}}$ to the corresponding integer in $\{0,\dots,p-1\}$ . If $\det(\mathbf{B})\neq 0$ we have $\det(\hat{\mathbf{B}})\neq 0$ over the reals.

Proof.

Suppose that $\det(\mathbf{B})\neq 0$ but $\det(\tilde{\mathbf{B}})=0$ . Note that this implies that there exists a non-zero vector $\tilde{\mathbf{y}}=[\tilde{y}_{1}^{T}\leavevmode\nobreak\ \tilde{y}_{2}^{T}\leavevmode\nobreak\ \dots\leavevmode\nobreak\ \tilde{y}_{n}^{T}]^{T}\in\mathbb{F}_{p}^{mn}$ where $\tilde{y}_{i}\in\mathbb{F}^{m}_{p}$ such that

[TABLE]

Now we use the isomorphism presented above. Let $\mathbf{y}=[{y}_{1}\leavevmode\nobreak\ {y}_{2}\leavevmode\nobreak\ \dots\leavevmode\nobreak\ {y}_{n}]^{T}\in\mathbb{F}_{p^{m}}^{n}$ be obtained by applying $\Gamma^{-1}$ to $\tilde{\mathbf{y}}$ . Therefore, relation (12), equivalently implies that

[TABLE]

where the above equation is understood to be over $\mathbb{F}_{p^{m}}$ . However, this is a contradiction since ${\mathbf{y}}\neq 0$ and $\det(\mathbf{B})\neq 0$ . The reverse conclusion can be obtained in a similar manner. Note that $\det(\tilde{\mathbf{B}})\in\mathbb{F}_{p}$ . It can also be equivalently computed by finding $\det(\hat{\mathbf{B}})$ over reals and reducing the result modulo $p$ . Thus, we have that $\det(\hat{\mathbf{B}})\neq 0$ over reals. ∎

We now present the construction of systems that satisfy the $s$ -weak full-rank property.

Lemma 2.

Let $G_{k}$ , $0\leq k<N$ , be a collection of $N$ matrices with size $\Delta\times\ell$ over $\mathbb{F}_{p^{n}}$ that satisfy the strong full-rank property. Consider the $N$ matrices $G^{\prime}_{k}$ of dimension $n\Delta\times n\ell$ over $\mathbb{F}_{p}$ , where $G^{\prime}_{k}$ is obtained by applying the mapping $\zeta$ to each entry of $G_{k}$ . Then, the collection $\{G^{\prime}_{k}\}_{k=0}^{N-1}$ satisfies the $n$ -weak full rank condition over $\mathbb{R}$ .

Proof.

This is an immediate consequence of Lemma 1. ∎

Example 3.

Consider collection of $N=6$ matrices presented in Example 2 over $\mathbb{F}_{8}$ . Let the primitive polynomial over $\mathbb{F}_{8}$ be $\pi_{\alpha}(x)=x^{3}+x+1$ . Suppose that $\beta_{k}=\alpha^{k},k\in[N]$ . Then,

[TABLE]

By Lemma 2, the collection $\{G^{\prime}_{k}\}_{k=0}^{5}$ satisfies the $3$ -weak full rank condition over $\mathbb{R}$ .

We note here that Lemma 2 can also be applied to an RS code defined over a finite field.

Remark 4.

Our proposed scheme requires us to operate over an extension field large enough so that $p^{n}\geq N+1$ for the UDM based approach and $p^{n}\geq N\ell+1$ for the RS-based approach. Thus, the second term in the worst case computational load (cf. Eq. (1)) can be made as small as desired by choosing $\Delta$ large enough. Increasing $\Delta$ does come at the cost of high condition numbers (cf. Section V).

V Comparisons of the different schemes

In this section, we compare the performance of the different schemes that have been proposed in this work. For each scheme, we construct all possible matrices $\mathbf{G}$ based on $\{G_{k}\}_{k=0}^{N-1}$ and $\Psi_{N,\ell,s}^{=Q_{\mathrm{b}}}$ and calculate their condition number. We report the maximum and average condition number of all such possible $\mathbf{G}$ ’s. Furthermore, we also report the average number of non-zero elements in the $G_{k}$ matrices for each collection.

In Table I we report results for a system with $N=6$ workers and storage capacity for each worker $\gamma=3/4$ . For the “RS-based scheme”, we set $\beta_{i,j}$ , $i=0,\ldots,5$ , $j=0,1,2$ , in (4) to 18 equally spaced reals within the interval $[-1,1]$ . For the “UDM-based scheme”, we set $\beta_{k}$ in (5), $k=0,\ldots,5$ to 6 equally spaced reals within the interval $[-1,1]$ . For “RS + Embedding from $\mathrm{GF}(19)$ ”, we construct (4) over $\mathrm{GF}(19)$ . Note that the field size is the least prime number that is greater or equal to the number of evaluation points. Then we embed (4) into $\mathbb{R}$ by using the natural mapping of $\mathrm{GF}(19)$ into the integers. We construct “UDM + Embedding from $\mathrm{GF}(7)$ ” in a similar manner. It can be seen that the condition number of “UDM scheme + Embedding from $GF(7)$ ” is the lowest when compared the other three schemes discussed thus far.

The other rows of Table I correspond to the companion matrix approach. In each of these cases we first design the RS-based or the UDM-based scheme over the corresponding extension field and then use the companion matrix idea introduced in Section IV. One can observe that the RS + Companion matrix schemes typically have high condition number. This is because the size of the companion matrix needs to be large enough to accommodate $N\ell$ evaluation points. The UDM + Companion matrix schemes can work with extension fields larger than $N$ , so their companion matrices tend to be smaller. Another advantage of the companion matrix approach is that the schemes are much sparser. Indeed, the “UDM + Companion matrix $\mathrm{GF}(3^{2})$ ” in Table I not only has a very low worst case condition number but also a sparsity level of 36% which is the second lowest among all the schemes.

To better understand the performance corresponding to different choices of extension field, we consider a larger system with $N=15$ , $\gamma=1/2$ in Table II. It can be observed that the RS-based scheme is worse than the UDM-based scheme. Another observation is that the “UDM + Companion matrix $\mathrm{GF}(3^{3})$ ” has the lowest condition number and the $G_{k}$ matrices become sparser when the size of the companion matrix increases.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” in Proc. of Adv. in Neural Inf. Proc. Sys. (NIPS) , 2017, pp. 4403–4413.
2[2] L. Tang, K. Konstantinidis, and A. Ramamoorthy, “Erasure coding for distributed matrix multiplication for matrices with bounded entries,” IEEE Comm. Lett. , vol. 23, no. 1, pp. 8–11, 2019.
3[3] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix multiplication,” in IEEE Intl. Symp. on Inf. Theory , 2017, pp. 2418–2422.
4[4] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. on Info. Th. , vol. 64, no. 3, pp. 1514–1529, 2018.
5[5] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” in Proc. of Adv. in Neural Inf. Proc. Sys. (NIPS) , 2016, pp. 2100–2108.
6[6] A. Mallick, M. Chaudhari, and G. Joshi, “Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication,” preprint, 2018, [Online] Available: https://arxiv.org/abs/1804.10331.
7[7] S. Wang, J. Liu, and N. B. Shroff, “Coded sparse matrix multiplication,” in Proc. 35th Intl. Conf. on Mach. Learning, ICML , 2018, pp. 5139–5147.
8[8] S. Kiani, N. Ferdinand, and S. C. Draper, “Exploitation of stragglers in coded computation,” in IEEE Intl. Symp. on Inf. Theory , 2018, pp. 1988–1992.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Universally Decodable Matrices for Distributed Matrix-Vector Multiplication

Abstract

I Introduction

I-A Main contributions of our work

II Problem Formulation

Definition 1**.**

Remark 1**.**

Example 1**.**

III Coded schemes satisfying

III-A RS-based scheme

III-B UDM-based scheme

III-C Properties of the Coded Schemes

Claim 1**.**

Proof.

Remark 2**.**

IV Coded schemes satisfying

Example 2**.**

Remark 3**.**

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Example 3**.**

Remark 4**.**

V Comparisons of the different schemes

Definition 1.

Remark 1.

Example 1.

Claim 1.

Remark 2.

Example 2.

Remark 3.

Lemma 1.

Lemma 2.

Example 3.

Remark 4.