CAMR: Coded Aggregated MapReduce
Konstantinos Konstantinidis, Aditya Ramamoorthy

TL;DR
This paper introduces CAMR, a scheme for distributed algorithms like deep learning that reduces communication load without exponentially increasing the number of jobs or data splits, improving scalability.
Contribution
CAMR achieves optimal communication load in distributed algorithms while maintaining a small number of jobs and data splits, unlike prior exponential-growth methods.
Findings
Reduces communication load in MapReduce-like systems.
Maintains a small number of jobs and data splits.
Achieves load reduction comparable to state-of-the-art methods.
Abstract
Many big data algorithms executed on MapReduce-like systems have a shuffle phase that often dominates the overall job execution time. Recent work has demonstrated schemes where the communication load in the shuffle phase can be traded off for the computation load in the map phase. In this work, we focus on a class of distributed algorithms, broadly used in deep learning, where intermediate computations of the same task can be combined. Even though prior techniques reduce the communication load significantly, they require a number of jobs that grows exponentially in the system parameters. This limitation is crucial and may diminish the load gains as the algorithm scales. We propose a new scheme which achieves the same load as the state-of-the-art while ensuring that the number of jobs as well as the number of subfiles that the data set needs to be split into remain small.
| Server | Transmits | Recovers |
| Server | Needs |
| and | |
| and | |
| and | |
| and | |
| and | |
| and |
| k | Minimum | |
| CAMR | CCDC | |
| 2 | 50 | 4950 |
| 4 | 15625 | 3921225 |
| 5 | 160000 | 75287520 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
CAMR: Coded Aggregated MapReduce
Konstantinos Konstantinidis and Aditya Ramamoorthy
Department of Electrical and Computer Engineering
Iowa State University
Ames, IA 50010
Email: {kostas, adityar}@iastate.edu This work was supported in part by the National Science Foundation (NSF) under grant CCF-1718470.
Abstract
Many big data algorithms executed on MapReduce-like systems have a shuffle phase that often dominates the overall job execution time. Recent work has demonstrated schemes where the communication load in the shuffle phase can be traded off for the computation load in the map phase. In this work, we focus on a class of distributed algorithms, broadly used in deep learning, where intermediate computations of the same task can be combined. Even though prior techniques reduce the communication load significantly, they require a number of jobs that grows exponentially in the system parameters. This limitation is crucial and may diminish the load gains as the algorithm scales. We propose a new scheme which achieves the same load as the state-of-the-art while ensuring that the number of jobs as well as the number of subfiles that the data set needs to be split into remain small.
I Introduction
The recent growth of big data analytics whereby a large amount of data on the orders of petabytes or more needs to be processed in a fast manner has fueled the development of several distributed programming models running on clusters of commodity servers. Some characteristic examples are MapReduce [1], Hadoop [2] and Spark [3].
In these frameworks, the data set is split into disjoint subfiles stored across the worker nodes. The computation takes place in three steps. Initially, the processing servers map the input subfiles to intermediate values having the form of (key, value) pairs. In the next shuffle step, the intermediate pairs are exchanged between the servers. In the final reduce step, each server computes a set of output functions defined based on the keys. By virtue of their simplicity, scalability and fault-tolerance, these frameworks are becoming ubiquitous and have gained significant momentum within both industry and academia. They are well suited for several applications including machine learning [4], [5], graph processing [6], data sorting [7] and web logging [1].
Compelling evidence obtained on large scale clusters suggests that the time spent merely on communication often dominates the execution time. For example, by analyzing a week-long trace from Facebook’s Hadoop cluster, the authors of [8] demonstrated that “on average, % of the overall job execution time is spent on data shuffling.” Similar effects have been reported in the work of [9] on other shuffle-heavy operations such as SelfJoin, TeraSort and RankedInvertedIndex which underlie many deep learning algorithms. Distributed graph analytics also suffer from long communication phases as observed in [6], accounting for up to % of the overall execution time in representative cases [10].
In this paper, we focus on distributed algorithms for which the intermediate values of a particular job computed during the Map phase can be combined locally by the servers before the transmission. This kind of computation is predominant in machine learning (e.g., ImageNet classification [5] and stochastic gradient descent [11]). Another use case would be the matrix-vector multiplications performed during the forward and backward propagation in neural networks (cf. [Dally_nips_tutorial]). In our context, computing each of these products constitutes a job. We could also consider training multiple models simultaneously, as long as they have the same dimensionality. This so-called compression technique was initially investigated in [1] by the means of a “combiner function” which merges multiple intermediate values with the same key computed from different Map functions.
The work of [12] proposed an approach (inspired by coded caching) for trading off communication load with computation load in MapReduce-like systems. This was extended in [4] to Compressed Coded Distributed Computing (CCDC), where compressible functions were considered. In prior work, we addressed one limitation of [12], namely the requirement that jobs need to fit very finely to obtain the promised communication load. Our approach demonstrated a deep relationship between this problem and a class of combinatorial structures called resolvable designs while achieving significant speedup compared to the state-of-the-art.
I-A Main contributions of our work
It turns out that [4] has a limitation of a similar flavor. In this case the number of jobs needs to scale exponentially in the problem parameters to obtain the promised reduction in communication load.
In this work, we extend our algorithm to applications where intermediate values can be compressed and we substantially reduce the requirement on the number of jobs compared to prior literature. The immediate benefit that stems from this fact is that as the size of the cluster increases, the required number of MapReduce jobs (and hence the total number of subfiles) does not scale exponentially. The implicit benefit is that a low requirement on the number of jobs decreases the encoding complexity. This is important since, as we have shown in [7], increasing the number of tasks scales the overhead of the encoding complexity and can diminish any gains in the communication load. We expect a similar type of phenomenon in the current setting.
Our new scheme is named coded aggregated MapReduce (abbreviated, CAMR). We characterize the achievable communication load of CAMR and show that it matches the state-of-the-art. The next section gives the general problem formulation, while in the remaining sections, we describe our scheme, analyze the achievable load and compare it with other combining methods.
II Problem Formulation
Our goal is to process distributed computing jobs (denoted ) in parallel on a cluster of homogeneous servers , i.e., machines that have similar computational power. The data set of each job is partitioned into disjoint and equal-sized subfiles. The subfiles of the -th job are denoted by . A total of output functions, denoted , need to be computed for each job. Note that these functions may be different across different jobs. We examine a special class of functions that possess the aggregation property.
Definition 1**.**
In database systems, an aggregate function is one that is both associative and commutative.
For example, in jobs with linear aggregation the evaluation of each output function can be decomposed as the sum of intermediate values, one for each subfile, i.e., for ,
[TABLE]
where and each such value is assumed to be of size bits. In what follows we use to denote the aggregation of intermediate values of the same function and job into a single “compressed” value.
A master node judiciously places each subfile on at least one server before initiating the algorithm.
Definition 2**.**
The storage fraction of a distributed computation scheme is the fraction of the data sets across all jobs that each machine locally caches.
Our formulation assumes that divides so that each server is assigned to functions per job. However, our proposed algorithm and the main results can be obtained as a simple extension of the case when each server is computing one function. For example, one can repeat the Shuffle phase times. Owing to this fact, we will only present the case of .
The framework starts with the Map phase during which the servers (in parallel) “map” every subfile to the values . Following this, the servers multicast the computed intermediate values amongst one another via a shared link in the Shuffle phase. In the final Reduce phase, server computes (or reduces) for as it has all the relevant intermediate values required for performing this operation.
Definition 3**.**
The communication load of a scheme is the total amount of data (in bits) transmitted by the servers during the Shuffle phase normalized by .
Example 1**.**
Suppose that our task consists of jobs. For the -th job we need to count words given by the set in a book consisting of chapters using a cluster of servers. is associated with the -th book and its subfiles with the chapters . Function (assigned to server since as discussed) counts the word of in the book indexed with . This formulation fits the linear aggregation case precisely. Indeed, each reducer only needs the sum of the word counts for the subfiles that it does not locally store and hence there is scope for “compressing” multiple values at the end of the Map phase.
III Description of the CAMR Scheme
In this section, we describe our proposed algorithm. We begin by introducing a few design theory definitions.
Definition 4**.**
A design is a pair consisting of
a set of elements (points), , and 2. 2.
a family (i.e. multiset) of nonempty subsets of called blocks, where each block has the same cardinality.
In this paper, we use a special class of designs, called resolvable designs.
Definition 5**.**
A subset in a design is said to be a parallel class if for and with we have and . A partition of into several parallel classes is called a resolution, and is said to be a resolvable design if has at least one resolution.
It turns out that there is a systematic procedure for constructing resolvable designs from error correcting codes.
Let denote the additive group of integers modulo . The generator matrix of an single parity-check (SPC) code over 111We emphasize that this construction works even if is not a prime, i.e., is not a field. is defined by
[TABLE]
This code has codewords. The codewords are for each possible message vector . The codewords computed in this manner are stacked into the columns of a matrix of size , i.e.,
[TABLE]
The corresponding resolvable design is constructed as follows. Let (for a positive integer , we use to denote the set throughout) represent the point set of the design. We define the blocks as follows. For , let be a block defined as
[TABLE]
The set of blocks is given by the collection of all for and so that . The following lemma (see [13] for a proof in a different context) shows that this construction yields a resolvable design.
Lemma 1**.**
The above scheme always yields a resolvable design with , for all and . The parallel classes are analytically described by , for .
III-A Job assignment and file placement
Our cluster consists of servers and we choose appropriate integers that factorize it as ; we further need to be divisible by . Next, we form a SPC code and the corresponding resolvable design, as described above. The jobs to be executed are associated with the point set . Hence and the block set will be such that . The servers are associated with the blocks and are indexed as and .
The assignment of jobs to servers follows the natural incidence between points and blocks. Thus, job is processed by (or “owned” by) the server indexed by if . For the sake of convenience we will also interchangeably work with servers indexed as with the implicit understanding that each corresponds to a block from . By convention, server will be associated with the block .
Let us denote the owners of by . For each job, the data set is split into batches and each batch is made up of subfiles, for some integer (recall that ). The file placement policy is illustrated in Algorithm 1.
Each server is owner of jobs (block size). For each such job it participates in batches of size , as explained in Algorithm 1. Our requirement for the storage fraction is
[TABLE]
Example 2**.**
In Example 1, we have a cluster of nodes. We chose our parameters and , then we need to execute MapReduce jobs. The codewords for this choice of parameters are . Hence, based on Eq. (1), the owners are
[TABLE]
We have subdivided the original data set of each job into subfiles. The subfiles of the -th job are partitioned into three batches, namely , and . Exactly four such batches are stored on each machine (cf. Fig. 1). For , each job’s data set is split into subfiles placed on a unique subset of nodes. For example, the subfiles of job , , are stored exclusively on , and . Specifically, the three batches of the first job are
[TABLE]
Then, batch is stored on machines and , on and and, finally, on and . Each machine locally stores of all the data sets.
III-B Map phase
During this phase, each server maps all the subfiles of each job it has partially stored, for all output functions. The resulting intermediate values have the form
[TABLE]
At the end of the Map phase, for each job , each mapper combines all those values that are indexed with the same and (in other words, associated with the same function and job) and belong to the same batch of subfiles; we have already referred to this operation as aggregation. Our shuffle algorithm operates on the batch-level, as it will become clear in the following section.
III-C Shuffle phase
The CAMR scheme carries out the data shuffling phase in three stages. The first two stages utilize a common shuffling algorithm (cf. Algorithm 2), summarized in the following lemma and proved in [14, Appendix].
Lemma 2**.**
Consider a group of machines with the property that every subset of of the form , stores a chunk of data of size bits, denoted , that does not store. Then, there exists a protocol where each machine in can multicast a coded packet useful to all other machines and after such transmissions each of them can recover its missing chunk. The total number of bits transmitted in this protocol is .
Stage 1: In this stage, the owners of each job communicate among themselves. Let us fix a job and consider the servers in of cardinality (cf. Algorithm 1). During the Map phase, each machine in that subset has computed an aggregate needed by the remaining owner which is
[TABLE]
Repeating this process for every value of and , we can identify all aggregates . We shall now see an one-to-one correspondence between this setup and Lemma 2 which is the following
[TABLE]
for and the owners .
Each owner of a particular job, after receiving such values (one from every other owner of a particular job), can decode all of its missing aggregates for that job.
Example 3**.**
In Example 1, let us consider the group of servers which are the owners of , storing , and , respectively. Based on this allocation policy, server needs evaluations of the batch , i.e., for or simply the aggregate
[TABLE]
Similarly, needs and needs .
Next, we refer to Fig. 2. The compressed intermediate values are represented by circle/green, star/blue and triangle/red. We further suppose that each value can be split into two packets (represented by the left and right parts of each shape). If transmits left circle XOR left star, then is able to cancel out the star part (since also maps ) and recover the circle part. Similarly, can recover the star part from the same transmission. Each of these transmissions is useful to two servers.
We can repeat this process for the remaining jobs. The total number of bits transmitted in this case is therefore . The incurred communication load is . 2. 2.
Stage 2: In this stage, we form communication groups of both owners and non-owners of a job, so that the latter can recover appropriate data to reduce their functions.
Towards this end, we form collections of user groups by choosing one block from each parallel class based on a simple rule. We choose servers such that . It has been proved in [13] (but in a different context) that if we remove a server from such a group , the servers in the corresponding subset of cardinality jointly own a job, say , that the remaining server does not. In addition, based on the file placement policy described before (cf. Algorithm 1), they share the batch of subfiles for that common job and some .
The following simple observation is important. By construction, is precisely the remaining owner of and it should lie in the parallel class that none of the other owners belong to; that is the same class as of .
During the Map phase, each node in has computed an aggregate needed by which is
[TABLE]
As in stage 1, Lemma 2 fits in this description and Algorithm 2 defines the communication scheme; the shuffling group is and each server needs to recover the chunk for the unique batch that all nodes in share.
As a result, at the end of stage 2, each server is able to decode all aggregates of the form in Eq. (4) for all values of , i.e., for all nodes that belong to the same parallel class as . Note that each such value (for a fixed ) corresponds to (block size) jobs for which does not store any subfiles and does not store the batch .
Example 4**.**
In Example 1, in stage 2, the nodes recover values of jobs for which they haven’t stored any subfile. Let . Observe from Eq. (2) that there is no job common to all three but each subset of two of them shares a batch of a job they commonly own. The remaining server needs an aggregate value of those subfiles. The values that each of needs as well as the corresponding transmissions are illustrated in Table I. We denote the -th packet of an aggregate value by .
There are possible such groups we can pick. The total load is . 3. 3.
Stage 3: Each worker is still missing values for jobs that it is not owner of from Stage 2. Now, servers communicate within parallel classes. In particular, we show in [14, Appendix] that all values that a server still needs can be aggregated and transmitted by a single owner-server in the same parallel class that belongs to. This server is unique and transmits one aggregate value of its jobs to every other server in the same parallel class.
Recall that the -th class is , then, server transmits
[TABLE]
to another ; obviously, .
We repeat this process for every pair of servers in the same class.
Example 5**.**
In Example 1, if we consider the same group as in Stage 2, i.e., then we can see that still misses values and of or simply their aggregate . Observe that all required subfiles locally reside in the cache of which can transmit the value to . For the complete set of unicast transmissions see [14, Table II]. The load turns out to be .
The communication load of all phases is then . Similarly, the load achieved by the CCDC scheme of [4] for the same storage fraction is . Nonetheless, their approach would require a minimum of distributed jobs to be executed.
III-D Reduce phase
Using the values it has computed and received, reduces
[TABLE]
for all and .
IV Communication Load Analysis
In the first stage, for each of the jobs, each of the owners computes one aggregate and is associated with a unique corresponding packet of it, of size . As a result, the communication load exerted in this stage is
[TABLE]
The second stage involves the communication within all possible groups that satisfy the desired property. In each case, workers transmit one value each, and the transmission is of length . Then,
[TABLE]
Each server does not own jobs. For each of them, during stage 3, one transmission (of length ) from a server in the same parallel class is sufficient. Thus,
[TABLE]
The total load is
[TABLE]
V Comparison With Other Schemes
The technique proposed in [4] demonstrates a load of
[TABLE]
for a suitable storage fraction such that . Our storage requirement is equal to . For the same storage requirement, Eq. (6) yields
[TABLE]
We conclude that the loads induced by the two schemes are identical. However, their approach fundamentally relies on the requirement that the minimum number of jobs to be executed is . Comparing this value with our requirement for and using a known bound for the binomial coefficients, we deduce that [CLRS_book]
[TABLE]
where the bound of (V) is maximum when and becomes stricter for ; however, as increases the bound of (V) loosens and it turns out that our requirement for the number of jobs becomes exponentially smaller than that of CCDC (cf. [14, Table III] for a numerical comparison).
Proof of Lemma 2
We shall refer to Algorithm 2 in order to show that each machine in can recover its missing data chunk. Fix a pair of machines and the packet transmitted from to . By canceling out all terms of with in Eq. (3), which locally stores, it can recover the remaining term, i.e., . Keeping fixed, we repeat this process for every possible machine . Since each of them is associated with a distinct packet of it follows that by receiving the packets
[TABLE]
can recover the following packets
[TABLE]
Subsequently, concatenates them in order to recover . Since this proof holds independently of the choice of , we have shown that all machines can recover their missing chunks at the end of the transmissions.
Since each chunk is assumed to be of size bits and it was split into packets of size , the total amount of transmitted data is .
Proof of Shuffling Correctness of Stage 3
The proof follows from stage 2 and by the resolvability property of our design. Let us fix a shuffling group of stage 2, say , a subset and focus on the excluded server . The servers in share a batch of a job whose values have transmitted to . The remaining batches of that still needs are locally stored at a single server (precisely the owner of the job) in the remaining parallel class, i.e., the parallel class of (cf. [Section III.A]). The fact that the design is resolvable makes that server unique, since no blocks within a parallel class can have common points (recall that points have one-to-one correspondence with the jobs). That node will transmit the uncoded aggregate to . Such transmissions benefit a single machine.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Communications of the ACM , vol. 51, no. 1, pp. 107–113, January 2008.
- 2[2] “Apache Hadoop.” [Online]. Available: http://hadoop.apache.org/
- 3[3] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in 2nd USENIX Conference on Hot Topics in Cloud Computing , June 2010, pp. 10–10.
- 4[4] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Compressed coded distributed computing,” in IEEE International Symposium on Information Theory (ISIT) , June 2018, pp. 2032–2036.
- 5[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2016, pp. 770–778.
- 6[6] S. Prakash, A. Reisizadeh, R. Pedarsani, and A. S. Avestimehr, “Coded computing for distributed graph analytics,” in IEEE International Symposium on Information Theory (ISIT) , June 2018, pp. 1221–1225.
- 7[7] K. Konstantinidis and A. Ramamoorthy, “Leveraging coding techniques for speeding up distributed computing,” in IEEE Global Communications Conference (GLOBECOM) , December 2018.
- 8[8] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica, “Managing data transfers in computer clusters with orchestra,” ACM SIGCOMM Computer Communication Review , vol. 41, no. 4, pp. 98–109, August 2011.
