Communication vs Distributed Computation: an alternative trade-off curve

Yahya H. Ezzeldin; Mohammed Karmoose; Christina Fragouli

arXiv:1705.08966·cs.IT·May 26, 2017

Communication vs Distributed Computation: an alternative trade-off curve

Yahya H. Ezzeldin, Mohammed Karmoose, Christina Fragouli

PDF

TL;DR

This paper explores the trade-off between communication and distributed computation in MapReduce-like systems, considering storage constraints and computational limits, and proposes bounds and heuristics for optimizing this balance.

Contribution

It introduces a new perspective on the communication-computation trade-off by accounting for partial computation and computational constraints, extending prior models.

Findings

01

Derived lower bounds on communication load under computational constraints

02

Proposed heuristic schemes that approach the theoretical bounds

03

Highlighted the impact of partial computation on storage and communication trade-offs

Abstract

In this paper, we revisit the communication vs. distributed computing trade-off, studied within the framework of MapReduce in [1]. An implicit assumption in the aforementioned work is that each server performs all possible computations on all the files stored in its memory. Our starting observation is that, if servers can compute only the intermediate values they need, then storage constraints do not directly imply computation constraints. We examine how this affects the communication-computation trade-off and suggest that the trade-off be studied with a predetermined storage constraint. We then proceed to examine the case where servers need to perform computationally intensive tasks, and may not have sufficient time to perform all computations required by the scheme in [1]. Given a threshold that limits the computational load, we derive a lower bound on the associated communication…

Figures3

Click any figure to enlarge with its caption.

Equations19

V_{S_{i}}^{i} ≜ {v_{q, n} ∣ n \in \cap_{j \in S_{i}} M_{j}, q \in W_{i}} .

V_{S_{i}}^{i} ≜ {v_{q, n} ∣ n \in \cap_{j \in S_{i}} M_{j}, q \in W_{i}} .

C_{total} = \frac{r N Q ( K - r + 1 )}{K} .

C_{total} = \frac{r N Q ( K - r + 1 )}{K} .

∣ C_{i} ∣

∣ C_{i} ∣

= (i) (r K - 1) r η_{1} η_{2} + η_{2} \frac{r N}{K} = (ii) r η_{2} ((r K - 1) η_{1} + \frac{N}{K})

= (iii) \frac{r Q}{K} ((r K - 1) \frac{N}{( r K )} + \frac{N}{K}) = \frac{r N Q ( K - r + 1 )}{K ^{2}},

L_{l b} (C_{t o t a l}) = {z_{1}, \dots, z_{r}} min ℓ \sum r \frac{z _{ℓ}}{N Q}

L_{l b} (C_{t o t a l}) = {z_{1}, \dots, z_{r}} min ℓ \sum r \frac{z _{ℓ}}{N Q}

s . t . ℓ = 1 \sum r z_{ℓ} ℓ \geq \frac{( K - r ) N Q}{K}, ℓ = 1 \sum r z_{ℓ} ℓ^{2} + \frac{r N Q}{K} \leq C_{t o t a l}

z_{i} \geq 0, \forall i \in [1 : r],

C o m p (r^{'})

C o m p (r^{'})

C o mm (r^{'})

L_{P} (C_{t o t a l}) = {z_{1}, \dots, z_{r}} min (r + 1 K) ℓ = 1 \sum r \frac{z _{ℓ} C o mm ( ℓ )}{N Q}

L_{P} (C_{t o t a l}) = {z_{1}, \dots, z_{r}} min (r + 1 K) ℓ = 1 \sum r \frac{z _{ℓ} C o mm ( ℓ )}{N Q}

s . t . ℓ = 1 \sum r z_{ℓ} = η_{1} η_{2}, z_{i} \geq 0, \forall i \in [1 : r],

(r + 1 K) ℓ = 1 \sum r z_{ℓ} C o m p (ℓ) + \frac{r N Q}{K} \leq C_{t o t a l} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Communication vs Distributed Computation:

an alternative trade-off curve

Yahya H. Ezzeldin, Mohammed Karmoose, Christina Fragouli University of California, Los Angeles, CA 90095, USA,

Email: {yahya.ezzeldin, mkarmoose, christina.fragouli}@ucla.edu

Abstract

In this paper, we revisit the communication vs. distributed computing trade-off, studied within the framework of MapReduce in [1]. An implicit assumption in the aforementioned work is that each server performs all possible computations on all the files stored in its memory. Our starting observation is that, if servers can compute only the intermediate values they need, then storage constraints do not directly imply computation constraints. We examine how this affects the communication-computation trade-off and suggest that the trade-off be studied with a predetermined storage constraint. We then proceed to examine the case where servers need to perform computationally intensive tasks, and may not have sufficient time to perform all computations required by the scheme in [1]. Given a threshold that limits the computational load, we derive a lower bound on the associated communication load, and propose a heuristic scheme that achieves in some cases the lower bound.

I Introduction

Distributed computation across a set of wireless networked servers is well motivated for several practical constraints: we may want to speed up computation time so as to finish a computation faster; we may have partial view of the files needed for computation across servers; we may have limited memory in each server; or we may be motivated by energy constraints. In this paper we consider the distributed computing framework studied in [1], that follows the architecture of MapReduce [2].

Our starting observation is that, the system in [1] does not explicitly separate computation from storage. The system uses a cluster of $K$ servers to compute $Q$ output functions from $N$ input files. Each file is stored in $r$ different servers, balancing the amount of storage across servers. The work in [1] calculates the trade-off between the amount of computation and communication that servers need to do for such file placement. However, an underlying assumption of the derived trade-off, is that each server performs all possible computations on all the files stored in its memory. It is natural to ask: is it indeed useful to perform all possible computations?

The following simple example illustrates that this is not always the case. Consider a cluster with $K{=}3$ servers, $N{=}3$ files and $Q{=}3$ output functions. All $3$ files are available at each server and each server is required to compute only one of the output functions. In this case, instead of performing $9$ computations per server (as assumed in [1]), each server only needs to perform computations related to its dedicated output function, i.e., only 3 computations are needed per server.

Our first contribution is to generalize this observation and derive an alternative trade-off curve to the scheme in [1]. We explicitly use three parameters: $C_{total}$ the total amount of computation required; $r$ that captures the memory requirements; and the communication load $L$ . We consider the placement and communication scheme in [1], and calculate the minimum number of computations each server needs to perform. We take into account the amount computed by the server for its assigned output functions, the amount that need to be communicated to other servers, and the amount needed to use as side information to decode transmissions from other servers.

We then proceed to examine the case where servers need to perform computationally intensive tasks, and in particular, do not have sufficient time to perform all computations the curve in [1] requires. Such a scenario may occur in wireless, where we may have cheap mobile devices with low computational power that need to cooperatively perform time-critical operations, for scientific computing or virtual reality applications. We ask, if the cluster is limited to perform an amount of computation below a threshold, what is the resulting minimum communication required to achieve the function computation.

Our second contribution is to derive a lower bound for the communication-computation trade-off when a cluster has a limited computation budget. For this lower bound, we assume that the files are distributed across the cluster with a predetermined level of redundancy that does not grow with the available computation budget. We show that a scheme directly inferred from [1] performs poorly when compared against the derived lower bound. Finally, we develop a distributed computing scheme inspired by [1] and show through numerical evaluation that the communication-computation trade-off it provides is comparable to the aforementioned lower bound.

Related Work. Minimizing communication load for distributed computation tasks has received considerable attention in the literature: starting from distributed boolean function computation between two parties [3, 4] to the more generalized theory of communication complexity [5, 6]. A key concept in reducing the needed amount of communication is through network coding. A prominent example of this concept is in the context of distributed cache networks [7, 8, 9], where coding is used in either the data placement or data delivery phases to reduce the amount of communication in the delivery phase. Recently, coding was also considered in the context of distributed computing systems that are based on the MapReduce framework [1, 10, 11]. In fact, the authors in [1] provided a Coded Distributed Computing (CDC) scheme which reduces the amount of communication needed in the data shuffling phase by using coded multicast transmissions. Our work differs in that we separate computation and storage, and thus derive alternative trade-off curves depending on the relative values of these parameters.

II System Model

Notation. Calligraphic letters denote sets through out the paper. $|\mathcal{A}|$ denotes the cardinality of the set $\mathcal{A}$ . The expression $[a:b]$ denotes the set of integers from $a$ to $b$ .

MapReduce framework. We consider a cluster of $K$ servers that computes $Q$ output functions $\phi_{q}$ , $q\in[1:Q]$ , from $N$ input files $w_{n}$ , $n\in[1:N]$ . In this paper, we assume that the servers share a lossless broadcast domain: a transmission from a server can be losslesly received by all other servers.

We assume the cluster uses a MapReduce framework to compute the set of $Q$ functions in a distributed manner. MapReduce is based on the assumption that each output function can be calculated as a function of some intermediate processing of the files. In other words, $\phi_{q}(w_{1},\dots,w_{N}{)}=h_{q}(v_{q,1},\dots,v_{q,N})$ , where $v_{q,n}=g_{q,n}(w_{n})$ is the intermediate value computed from file $w_{n}$ relevant to the output function $\phi_{q}$ , and has length $T$ bits. In MapReduce terminology, the intermediate value is computed (or “mapped”) using a map function $g_{q,n}$ and $h_{q}$ “reduces” the intermediate values $\{v_{q,n}\}_{n=1}^{N}$ to output $\phi_{q}$ .

Based on this decomposition, the computation model in [1] consists of three phases: Map, Shuffle and Reduce. Additionally, a Placement phase distributes files and tasks among the servers in the cluster. We next describe each of the phases:

Placement Phase: Each server $k$ is loaded with a subset $\mathcal{M}_{k}$ of the $N$ files, such that $\cup_{k}\mathcal{M}_{k}=[1:N]$ . Each server $k$ is also assigned to compute a partition $\mathcal{W}_{k}$ of the output functions, where $\cup_{k}\mathcal{W}_{k}=[1:Q]$ .
Map Phase: Each server $k$ computes a subset $\mathcal{C}_{k}$ of the intermediate values related to $\mathcal{M}_{k}$ , i.e., $\mathcal{C}_{k}\subseteq\{v_{q,n}|q\in[1:Q],n\in\mathcal{M}_{k}\}$ . At the end of the Map phase, the assigned computation subsets satisfy that $\cup_{k}\mathcal{C}_{k}=\{v_{q,n}|q\in[1:Q],n\in[1:N]\}$ .

Remark 1.

In MapReduce, files are mapped by presenting them as $(key,value)$ pairs to a $map(\cdot)$ function that outputs a set of intermediate $(key,value)$ pairs based on the input pair. Although, the same $map(\cdot)$ build is used across the servers, the function can output different sets intermediate values based on the server ID by including this information in the $key$ .

Shuffle Phase: For a server $k$ to compute a function $\phi_{q}$ where $q\in\mathcal{W}_{k}$ , it needs all the intermediate messages $\mathcal{V}_{q}=\left\{v_{q,n}|q\in\mathcal{W}_{k},n\in[1{:}N]\right\}$ . Thus in the Shuffle phase, the $K$ servers exchange intermediate values, such that each server has access to all its needed sets $\mathcal{V}_{q}$ . The shuffling scheme can be described as follows: each server $k$ creates a message $X_{k}$ that is a function of its locally computed intermediate values and broadcasts this message $X_{k}$ to the remaining $K-1$ nodes.
Reduce Phase: In the Reduce phase, server $k$ uses its locally computed intermediate values and the received transmissions $X_{1},\dots,X_{K}$ to decode the set of the needed intermediate values $\mathcal{V}_{q}$ , $\forall q\in\mathcal{W}_{k}$ . Using $\mathcal{V}_{q}$ , the nodes can now compute the desired functions $\phi_{q}=h_{q}(\mathcal{V}_{q})$ , $\forall q\in{\mathcal{W}_{k}}$ .

Performance metrics. We measure the performance of this computation cluster across three parameters: the load redundancy ( $r$ ), the computation load ( $C_{total}$ ) and the communication load ( $L$ ), defined as follows:

$\bf\bullet$ Load Redundancy. We define the load redundancy as the average number of times a file is assigned across the servers. We denote this by $r$ , i.e., $r\triangleq\frac{\sum_{k=1}^{K}|\mathcal{M}_{k}|}{N}$ . Load redundancy captures memory constraints.

$\bf\bullet$ Computation Load. We define the computation load $C_{total}\triangleq\sum_{k}|\mathcal{C}_{k}|$ as the total number of computations performed across servers in the cluster.

$\bf\bullet$ Communication Load. We define the communication load $L\triangleq\sum_{k=1}^{K}\frac{b(X_{i})}{QNT}$ , as the number of bits transmitted in the Shuffle phase normalized by $QNT$ , where $b(X_{i})$ is the number of bits used to represent $X_{i}$ and $QNT$ is the total number of bits in all intermediate values $v_{q,n}$ , for $q=[1:Q]$ and $n=[1:N]$ . From the definition, we have $0\leq L\leq 1$ .

The definitions of $L$ and $r$ follow [1]; however in this paper, we explicitly separate the redundancy from the computation load, and use different parameters for each.

III On the relation between redundancy and computation

An underlying assumption in [1] is that each server $k$ must compute all the intermediate value for its stored files $\mathcal{M}_{k}$ . In other words, $\mathcal{C}_{k}=\{v_{q,n}|q\in[1:Q],n\in\mathcal{M}_{k}\}$ . In this case, the load redundancy $r$ is linearly proportional to the total number of computations in the system as $|\mathcal{C}_{k}|=|\mathcal{M}_{k}|Q$ and $r$ can be therefore regarded as the computation load. However, if the server can selectively choose which intermediate values of $\mathcal{M}_{k}$ to compute in the Map phase (as long as the communication load is the same), then the total number of computations is not necessarily linearly correlated with $r$ .

Consequently, an increase in $r$ does not necessarily result in an increase in the number of computations performed by the cluster. For example, assume that $Q=K$ and each server is required to compute 1 output function (without loss of generality, $\mathcal{W}_{k}=\{k\}$ ). Then, we have $C_{total}=NQ$ for both $r=1$ and $r=K$ . For $r=1$ , each file is available at only one server, thus each server needs to compute all intermediate values for all files stored in its memory. For $r=K$ , all files are available at each server. Thus, each server needs only to compute $N$ intermediate values related to its output function. In both cases, the optimal communication load $L(r)=\frac{K-r}{rK}$ is achieved [1, Theorem 1]. Note that for $r=K$ , if the servers computed all intermediate values for their files, there would be $NQK$ computations instead of $NQ$ .

Later in this section, we characterize the minimum computation load needed by the Coded Distributed Computing (CDC) scheme in [1] in order to achieve the optimal communication load $L^{\star}(r)$ in [1, Theorem 1] for $r=[1:K]$ . As we see later, taking this minimum computation load into account changes the trade-off in [1] for CDC. As a preliminary to that discussion, we next briefly describe the CDC scheme in [1].

An overview of the CDC scheme. Assume that $N$ and $Q$ are sufficiently large so that $N={K\choose r}\eta_{1}$ and $Q=K\eta_{2}$ for some $\eta_{1},\eta_{2}\in\mathbb{N}$ . The CDC scheme operates as follows (see [1] for a complete description):

Placement Phase: A disjoint subset $\mathcal{M}_{\mathcal{T}}$ of the files is assigned to each subset $\mathcal{T}$ of $r$ servers where $|\mathcal{M}_{\mathcal{T}}|=\eta_{1}$ . Every server is thus assigned a set of $rN/K$ files and every $\eta_{1}$ partition of these files is shared with a unique set of $r-1$ other servers. Every server $k$ is also assigned a unique subset $\mathcal{W}_{k}$ of the output functions to calculate such that $|\mathcal{W}_{k}|=\eta_{2}$ .
Map Phase: Every server computes all possible intermediate function values for the files it has.
Shuffling Phase: The shuffling phase repeats the following procedure for every set $\mathcal{S}\subseteq[1:K]$ of size $r+1$ :

(i) For every $i\!\in\!\mathcal{S}$ , define $\mathcal{S}_{i}\!=\!\mathcal{S}\backslash\{i\}$ and identify $\mathcal{V}_{\mathcal{S}_{i}}^{i}$ as

[TABLE]

The set $\mathcal{V}_{\mathcal{S}_{i}}^{i}$ represents the intermediate values that are needed by server $i$ to compute functions in $\mathcal{W}_{i}$ , which can be computed exclusively by all servers in $\mathcal{S}_{i}$ (recall that a file is replicated at exactly $r$ servers). Note that $|\mathcal{V}_{\mathcal{S}_{i}}^{i}|\!=\!\eta_{1}\eta_{2}$ .

(ii) Split every intermediate value in $\mathcal{V}_{\mathcal{S}_{i}}^{i}$ into $r$ disjoint parts of $T/r$ bits and associate each part with a server in $\mathcal{S}_{i}$ . Thus we split the set $\mathcal{V}_{\mathcal{S}_{i}}^{i}$ into $r$ partitions denoted by $\mathcal{V}_{\mathcal{S}_{i},j}^{i}$ , $j\in\mathcal{S}_{i}$ , each of size $\eta_{1}\eta_{2}\frac{T}{r}$ . Each server $j$ will be responsible to convey its part to server $i$ with coded broadcast transmissions.

(iii) After splitting all sets $\mathcal{V}_{\mathcal{S}_{i}}^{i}$ for all $i\in\mathcal{S}$ (we have $r+1$ such sets), server $k$ sends the bit-wise XOR of all the $\eta_{1}\eta_{2}\frac{T}{r}$ -bit parts in $\mathcal{U}_{k}^{\mathcal{S}}\triangleq\bigcup_{i\in\mathcal{S}}\mathcal{V}_{\mathcal{S}_{i},k}^{i}$ , i.e., it makes $\eta_{1}\eta_{2}$ broadcast transmissions each of size $\frac{T}{r}$ bits. Each transmission is useful to all other $r$ nodes in $\mathcal{S}$ ; moreover, each server in $\mathcal{S}$ has the required side information to decode the part it needs.

Reduce Phase: In the reduce phase, every server uses its locally computed intermediate values and the decoded intermediate values in the shuffling phase to compute the $\eta_{2}$ output functions assigned to it in the initialization phase.

Next we discuss the minimum computation load needed for the CDC scheme.

Minimum Computations. The next proposition characterizes the minimum computation required by the CDC scheme.

Proposition 1.

For the placement scheme in [1] with $r\!=\![1\!:\!K]$ , the communication load $L^{\star}(r)=\frac{K-r}{rK}$ can be achieved with computation load

[TABLE]

Proof.

We first note that every server $i$ locally computes all intermediate values required by the functions in $\mathcal{W}_{i}$ and corresponding to the files in $\mathcal{M}_{i}$ ; we denote these intermediate values as $\mathcal{C}_{\mathcal{M}_{i},\mathcal{W}_{i}}$ . Thus, we have $\mathcal{C}_{\mathcal{M}_{i},\mathcal{W}_{i}}=\{v_{q,n}|q\in\mathcal{W}_{i},n\in\mathcal{M}_{i}\}\subseteq\mathcal{C}_{i}$ . Note that $|\mathcal{C}_{\mathcal{M}_{i},\mathcal{W}_{i}}|{=}|\mathcal{W}_{i}||\mathcal{M}_{i}|{=}\eta_{2}\frac{rN}{K}$ . In addition to $\mathcal{C}_{\mathcal{M}_{i},\mathcal{W}_{i}}$ , server $i$ also performs a set of computations required to carry out shuffling in the CDC scheme. We denote this set by $\mathcal{C}_{TX_{i}}$ . To calculate the number of computations in $\mathcal{C}_{TX_{i}}$ , we distinguish between computations required by server $i$ to decode its needed intermediate values (from transmissions in the shuffling phase) and the computations needed to create its transmissions $X_{i}$ in the shuffling phase.

Observe (from the description of the CDC scheme earlier and in [1]) that in any $\mathcal{S}\subseteq[1\!:\!K]$ of size $r+1$ where $i\in\mathcal{S}$ , server $i$ uses the sets $\{\mathcal{V}_{\mathcal{S}_{k},i}^{k},|k\in\mathcal{S}\backslash\{i\}\}$ to construct its transmission. In addition, since the remaining parts $\{\mathcal{V}_{\mathcal{S}_{k},j}^{k}|k\in\mathcal{S}\backslash\{i\},j\in\mathcal{S}\backslash\{i,k\}\}$ will be XOR-ed (at the other servers) with parts needed by server $i$ , then server $i$ should compute the intermediate values $\cup_{k\in\mathcal{S}\backslash\{i\}}\mathcal{V}_{\mathcal{S}_{k}}^{k}$ in order to decode its requested intermediate values as well as construct its transmissions in the shuffling phase. This amounts to $\sum_{k\in\mathcal{S},k\neq i}|\mathcal{V}_{\mathcal{S}_{k}}^{k}|=r\eta_{1}\eta_{2}$ computations for every set $\mathcal{S}$ . Thus, the total number of computations by server $i$ , $|\mathcal{C}_{i}|$ , is

[TABLE]

where: (i) follows since server $i$ appears in only ${K{-}1\choose r}$ subsets of size $r+1$ ; (ii) and (iii) follow from the assumptions that $N={K\choose r}\eta_{1}$ and $Q=K\eta_{2}$ . From symmetry, the total number of computations in the Map phase equals $\mathcal{C}_{\rm total}=K|\mathcal{C}_{i}|$ . ∎

Note from (2) that $C_{total}$ is quadratic in $r$ . Thus, we cannot view $r$ as a direct measure of computation load since both the communication load $L$ as well as the number of computations $C_{\rm total}$ reduce for $r\geq(K+1)/2$ . Fig. 3 shows the relation in (2) for $N=2520$ and $K=Q=10$ versus the number of computations if a server compute all map functions for each of its stored files. If we use [1, Theorem 1] and Proposition 1 to couple $C_{total}$ and $L^{\star}$ , then we get the trade-off shown in Fig. 3 for the CDC scheme, where the red line is a scaled version of the trade-off in [1]. From Fig. 3, it can be seen that if we are free to choose $r$ for a given $C_{total}$ , then the optimal trade-off happens at $C_{total}=NQ=25200$ ; by picking $r=K=10$ . This gives a communication load equal to zero while achieving the minimum computation load. This observation suggests that we can better understand the communication-computation trade-off, if we consider it with a predefined redundancy load ( $r$ ) that does not change with the computation load $C_{total}$ .

Thus, in the remainder of the paper, we consider $r$ as a parameter of the cluster (with $K,Q$ and $N$ ), and show how we can exploit this redundancy to perform coded distributed computing when at most $C_{total}$ computations are allowed.

IV An Achievable Communication-Computation Trade-off

Consider a distributed computing cluster with parameters $N,Q,K$ and load redundancy $r$ , where $r$ represents the number of times each file is stored across the servers in the cluster. For our discussion in this section, we assume that $r\in[1:K]$ and that the file placement (for a given $r$ ) follows the strategy in [1]. We are interested in answering the question: If the cluster is allowed to perform at most $C_{total}$ computations, what is the minimum communication load $L(r,C_{total})$ needed in order to compute $Q$ output functions using the cluster ?

If $C_{total}\!\geq\!r(K{-}r{+}1)NQ/K$ , then from Proposition 1, we can directly use the CDC scheme described in [1], to achieve the optimal communication load $L(r,C_{total})=L^{\star}(r)=\frac{1}{r}\left(1-\frac{r}{K}\right)$ . However, when $C_{total}<r(K-r+1)NQ/K$ , then the available computation budget is not enough to perform the shuffling and decoding required by the CDC scheme. In this case, can the CDC scheme be adapted to work with a restrictive computation budget? From [1], we can infer a simple modification to the CDC scheme, which we refer to as CDC-fit. In this scheme, we use CDC on the cluster while operating it with a lower load redundancy $r$ that fits the computation constraints. In other words, we pick $r^{\star}{=}\max\{r^{\prime}|C_{total}\!\geq\!r^{\prime}{(K-r^{\prime}+1)}NQ/K,r^{\prime}\!\leq\!r\}$ and operate the cluster as if the files are only repeated $r^{\star}$ times. This ensures that there are enough computations to satisfy CDC for $r^{\star}$ and achieve the communication load $L(r^{\star})=\frac{1}{r^{\star}}\left(1-\frac{r^{\star}}{K}\right)$ . A natural question to ask here is whether this is the best possible approach?

To characterize this, we next develop a lower bound on the communication load when the cluster has a computation load $C_{total}$ and load redundancy $r$ .

Lower Bound on Communication load. We provide here a lower bound on the communication load for only a particular class of shuffling schemes. In this class, given a broadcast transmission sent during the shuffling phase, server $i$ can decode its required intermediate value from that transmission using only side information that it has locally computed. i.e., it does not rely on future transmissions to provide it with enough linear combinations to decode its required intermediate values. In what follows, an $\ell$ -type transmission denotes a broadcast transmission made by a server during shuffling, which consists of the XOR of equally-sized parts of $\ell$ intermediate values. The weight of an $\ell$ -type transmission is the size of the intermediate value parts used in the transmission.

In order to relax our lower bound, we assume that a server can perform partial computations on the files, i.e., if a server wants to transmit a fraction of $fT$ bits (with $0\leq f\leq 1$ ) of $v_{q,n}$ (recall $v_{q,n}$ is made of $T$ bits), then it only expends $f$ of a computation. With this assumption, we can observe the following properties of our cluster:

Obs. 1. Each server has $rN/K$ files stored locally, and needs to receive $\frac{(K-r)N}{K}\cdot\frac{Q}{K}$ intermediate values through shuffling.

Obs. 2. For a cluster with load redundancy $r$ , all feasible transmission have $\ell\!\leq\!r$ . This follows by noting that an $\ell$ -type transmission is assumed to satisfy $\ell$ servers 111If it is only useful for less than $\ell$ servers then the transmitter could have XOR-ed less intermediate values to generate the transmission.. Therefore, each intermediate value involved in this transmission is computed once at the transmitter, and computed once at each of the other $\ell{-}1$ servers which would utilize this intermediate value as side information to decode the transmission. Since each file is repeated across $r$ servers, then $\ell\leq r$ .

Obs. 3. In the shuffling phase, each $\ell$ -type transmission and weight $fT$ incurs an added computation cost to the cluster equal to $\ell^{2}fT$ . To see this, note that the server sending this transmission makes $\ell fT$ computations. Moreover, an $\ell$ -type transmission serves $\ell$ servers, each of which would have to do $(\ell-1)fT$ computations to acquire the needed side information. Therefore we get $\ell fT+\ell(\ell-1)fT=\ell^{2}fT$ .

Let $z_{\ell}$ be the number of $\ell$ -type transmissions. Then, the communication load for a shuffling scheme is lower bounded by the solution of the following Linear Program (LP)

[TABLE]

where: (i) the first constraint is a necessary condition for the shuffling phase to deliver $\frac{(K-r)QN}{K^{2}}$ intermediate values to each server in the cluster; (ii) the second condition is a necessary condition for the total computation (local computations and shuffling computations) to not exceed $C_{total}$ . Note that the result of the LP is a lower bound to the communication load since the first constraint is not sufficient to ensure that each server receives its needed intermediate values. Fig 3 compares the communication-computation trade-off for the aforementioned CDC-fit scheme with the lower bound in (4). The two trade-offs are close only towards high computation loads which allows the system to operate with an $r^{\star}$ close to the natural $r$ of the cluster. Next, we propose a modification to the CDC scheme denoted as Split-CDC (S-CDC) that provide a communication-computation trade-off close to the trade-off suggested by the lower bound in (4).

Split-CDC (S-CDC). In order to introduce S-CDC, we make the following observations on the shuffling strategy in CDC.

Obs. 1. The set $\mathcal{V}_{\mathcal{S}_{i}}^{i}$ described in (1) is of size $|\mathcal{V}_{\mathcal{S}_{i}}^{i}|=\eta_{1}\eta_{2}$ .

Obs. 2. For every subset $\mathcal{S}$ of $r+1$ servers, the computations needed to satisfy all servers in $\mathcal{S}$ is $r(r{+}1)\eta_{1}\eta_{2}$ and the number of packets communicated among them is $\frac{r+1}{r}\eta_{1}\eta_{2}$ .

Obs. 3. From (1), it is not hard to see that for any subset $\mathcal{S}^{\prime}\subseteq\mathcal{S}$ such that $|\mathcal{S}^{\prime}|>1$ , $\mathcal{V}_{\mathcal{S}_{i}}^{i}\subseteq\mathcal{V}_{\mathcal{S}^{\prime}_{i}}^{i},\forall i\in\mathcal{S}^{\prime}$ .

The previous observations suggest the following modification to the CDC scheme. Each subset $\mathcal{S}$ of size $r+1$ can be split into disjoint subsets of smaller size. Each smaller subset $\mathcal{S}^{\prime}$ can still be used to satisfy its members with the set $\mathcal{V}_{\mathcal{S}_{i}}^{i}$ as per Observation $3$ . Therefore, by using subsets $\{\mathcal{S}^{\prime}\}$ of size different than $r+1$ , this would allow the scheme to exhibit different levels of communications and computations per $\mathcal{S}$ (based on the size of the splits), as evident from Observation $2$ . The possible sizes of $\mathcal{S}^{\prime}$ are $r^{\prime}\in[1:r]$ , which we refer to as the split size. For $r^{\prime}$ , define $j_{r^{\prime}}=\lfloor\frac{r+1}{r^{\prime}+1}\rfloor$ and $r^{\prime\prime}\!=\!(r\!+\!1)-j(r^{\prime}\!+\!1)-1$ . Thus we can split set $\mathcal{S}$ into $j_{r^{\prime}}$ disjoint sets $\mathcal{S}^{(r^{\prime})}$ of size $(r^{\prime}+1)$ and one set $\mathcal{S}^{(r^{\prime\prime})}$ of size $(r^{\prime\prime}+1)$ . For each set in $\mathcal{S}^{(r^{\prime})}$ , the needed number of computations is $r^{\prime}(r^{\prime}+1)\eta_{1}\eta_{2}$ and the needed number of communicated packets is $\frac{r^{\prime}+1}{r^{\prime}}\eta_{1}\eta_{2}$ . If $\mathcal{S}^{(r^{\prime\prime})}$ is not empty, then similar expression follow (except when $|\mathcal{S}^{(r^{\prime\prime})}|=1$ , where we need $\eta_{1}\eta_{2}$ computations and $\eta_{1}\eta_{2}$ packets exchanges to send the intermediate values through unicast transmissions from any server in $\mathcal{S}^{(r^{\prime})}$ ). Finally, since $|\mathcal{V}_{\mathcal{S}_{i}}^{i}|=\eta_{1}\eta_{2}$ , for every subset $\mathcal{S}$ of size $r+1$ , CDC would naturally incur $\eta_{1}\eta_{2}$ transmission rounds, each delivering exactly one intermediate value in $\mathcal{V}_{\mathcal{S}_{i}}^{i}$ for all servers in $\mathcal{S}$ . Thus, our observations suggest that CDC can operate each of these transmission rounds with a different splitting size $r^{\prime}$ of $\mathcal{S}$ ; thus the name Split-CDC (S-CDC). For a transmission round using split size $r^{\prime}$ , the total computations and communications per subset $\mathcal{S}$ of size $r{+}1$ is

[TABLE]

S-CDC can now be formally described. Let $\frac{z_{r^{\prime}}}{\eta_{1}\eta_{2}}$ be the fraction of the intermediate values in $\mathcal{V}_{\mathcal{S}_{i}}^{i}$ per subset $\mathcal{S}$ that is delivered using split size $r^{\prime}$ . Then, S-CDC works as follows:

$\bf 1)$ Determine the optimal values of $\frac{z_{r^{\prime}}}{\eta_{1}\eta_{2}}$ for $r^{\prime}\in[1:r]$ - this is done via solving the LP in (IV).

$\bf 2)$ For each $\mathcal{S}\subseteq[1{:}K]$ of size $r{+}1$ and split size $r^{\prime}\in[1{:}r]$ :

•

Split set $\mathcal{S}$ into $j_{r^{\prime}}$ disjoint sets $\mathcal{S}^{(r^{\prime})}$ of size $(r^{\prime}+1)$ and one set $\mathcal{S}^{(r^{\prime\prime})}$ of size $(r^{\prime\prime}+1)$ .

•

Use enough computations and communications per each of the subsets $\mathcal{S}^{(r^{\prime})}$ and $\mathcal{S}^{(r^{\prime\prime})}$ as per the CDC scheme, to deliver $z_{r^{\prime}}$ intermediate values to all servers in $\mathcal{S}$ . The computations and communications needed to do so is equal to $z_{r^{\prime}}Comp(r^{\prime})$ and $z_{r^{\prime}}Comm(r^{\prime})$ respectively.

What remains is to find the optimal values of $z_{r^{\prime}}$ . We do so via solving the following LP, which minimizes the total communication load subject to a total computation constraint.

[TABLE]

Note that in (IV), the variables $z_{\ell}$ are allowed to take non-integer values which means that we are allowing the servers to do partial computations of the intermediate values if that is what they will need to transmit or decode. To restrict partial computations, we can approximate the solution of (IV) to get a suboptimal integer-valued solution $\hat{z}^{\star}_{\ell}$ . Note that if an optimal solution of (IV) is non-integer, then there exists only two non-zero elements of $\{z^{\star}_{\ell}\}$ ; we denote these two elements as $z_{\ell_{1}}$ and $z_{\ell_{2}}$ where $\ell_{1}<\ell_{2}$ . Then for our approximate solution, we define $\hat{z}^{\star}_{\ell_{2}}=\lfloor{z}^{\star}_{\ell_{2}}\rfloor$ and $\hat{z}^{\star}_{\ell_{1}}=\eta_{1}\eta_{2}-\lfloor{z}^{\star}_{\ell_{2}}\rfloor$ . This gives us a communication load $\hat{L}_{P}(C_{total})={K\choose r+1}\sum_{\ell=1}^{r}\frac{\hat{z}^{\star}_{\ell}Comm(\ell)}{NQ}$ .

Fig. 3 compares the performance of S-CDC with the lower bound in (4) for $N\!=\!2520,Q\!=\!K\!=\!10$ and $r~{}=~{}5$ when partial computations are allowed. In this particular setup, Fig. 3 shows that by preventing partial computations, we only incur a small fraction of the communication load as an expense.

Bibliography11

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Fundamental tradeoff between computation and communication in distributed computing,” in IEEE International Symposium on Information Theory (ISIT) , 2016, pp. 1814–1818.
2[2] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM , vol. 51, no. 1, pp. 107–113, 2008.
3[3] A. C.-C. Yao, “Some complexity questions related to distributive computing (preliminary report),” in Proceedings of the eleventh annual ACM symposium on Theory of computing , 1979, pp. 209–213.
4[4] A. Orlitsky and J. Roche, “Coding for computing,” IEEE Trans. on Information Theory , vol. 47, no. 3, pp. 903–917, 2001.
5[5] E. Kushilevitz and N. Nisan, “Communication complexity,” 2006.
6[6] K. Becker and U. Wille, “Communication complexity of group key distribution,” in Proceedings of the 5th ACM conference on Computer and communications security , 1998, pp. 1–6.
7[7] M. A. Maddah-Ali and U. Niesen, “Fundamental limits of caching,” IEEE Trans. on Information Theory , vol. 60, no. 5, pp. 2856–2867, 2014.
8[8] N. Karamchandani, U. Niesen, M. A. Maddah-Ali, and S. N. Diggavi, “Hierarchical coded caching,” IEEE Trans. on Information Theory , vol. 62, no. 6, pp. 3212–3229, 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Communication vs Distributed Computation:

Abstract

I Introduction

II System Model

Remark 1**.**

III On the relation between redundancy and computation

Proposition 1**.**

Proof.

IV An Achievable Communication-Computation Trade-off

Remark 1.

Proposition 1.