The Hardness of Optimization Problems on the Weighted Massively Parallel Computation Model
Hengzhao Ma, Jianzhong Li

TL;DR
This paper investigates the computational complexity of minimizing communication costs in a weighted network model for parallel computation, establishing hardness results for various problem variants.
Contribution
It introduces the Weighted Massively Parallel Computation (WMPC) model and characterizes the complexity of related communication minimization problems.
Findings
Some problems are in P
Some are NP-complete
Some are W[1]-complete
Abstract
The topology-aware Massively Parallel Computation (MPC) model is proposed and studied recently, which enhances the classical MPC model by the awareness of network topology. The work of Hu et al. on topology-aware MPC model considers only the tree topology. In this paper a more general case is considered, where the underlying network is a weighted complete graph. We then call this model as Weighted Massively Parallel Computation (WMPC) model, and study the problem of minimizing communication cost under it. Two communication cost minimization problems are defined based on different pattern of communication, which are the Data Redistribution Problem and Data Allocation Problem. We also define four kinds of objective functions for communication cost, which consider the total cost, bottleneck cost, maximum of send and receive cost, and summation of send and receive cost, respectively.β¦
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Theory Research Β· Interconnection Networks and Systems Β· Complexity and Algorithms in Graphs
11institutetext: β βShenzhen Institute of Advanced Technology, Chinese Acadamy of Sciences
πβ {hz.ma,lijzh}@siat.ac.cn
The Hardness of Optimization Problems on the Weighted Massively Parallel Computation Modelβ β thanks: This work was supported by the National Natural Science Foundation of China under grants 61832003, 62273322, 61972110, and National Key Research and Development Program of China under grants 2021YFF1200100 and 2021YFF1200104.
Hengzhao Maπ β β ββ
Jianzhong Liπ β β
Abstract
The topology-aware Massively Parallel Computation (MPC) model is proposed and studied recently, which enhances the classical MPC model by the awareness of network topology. The work of Hu et al. on topology-aware MPC model considers only the tree topology. In this paper a more general case is considered, where the underlying network is a weighted complete graph. We then call this model as Weighted Massively Parallel Computation (WMPC) model, and study the problem of minimizing communication cost under it. Two communication cost minimization problems are defined based on different pattern of communication, which are the Data Redistribution Problem and Data Allocation Problem. We also define four kinds of objective functions for communication cost, which consider the total cost, bottleneck cost, maximum of send and receive cost, and summation of send and receive cost, respectively. Combining the two problems in different communication pattern with the four kinds of objective cost functions, 8 problems are obtained. The hardness results of the 8 problems make up the content of this paper. With rigorous proof, we prove that some of the 8 problems are in P, some FPT, some NP-complete, and some W[1]-complete.
Keywords:
massivelly parallel computation, Weighted MPC model, communication cost optimization
1 Introduction
The Massively Parallel Computation model [33], MPC for short, has been a well acknowledged model to study parallel algorithms [20, 2, 7, 9, 10, 25, 39, 59] ever since it was proposed. Compared to other parallel computation models such as PRAM [35], BSP [60], LogP [19] and so on, the advantage of the MPC model lies in its simplicity and the power to capture the essence of computation procedure of modern share-nothing clusters. In the MPC model, computation proceeds in synchronous rounds, where in each round the computation machines first communicate with each other, then conduct local computation. Any pair of machines can communicate in a point-to-point manner, and all the communication messages can be transferred without congestion.
Although the MPC model is simple and powerful, one of its most important shortcomings is revealed by some recent works [11, 32], which is the strong assumption of homogeneity. All the machines in MPC model are considered as identical, and the communication bandwidth between any pair of machines are identical too [33]. In realistic parallel environment, the assumption of identical computation machines can be satisfied in most cases, but the assumption of identical communication bandwidth can not. Typically, a cluster consists of several racks connected by slower communication channels, and each rack includes several machines connected by faster communication channels. Thus, the communication bandwidth of in-rack and across-rack communication differ significantly, which refutes the basic assumption of homogeneous communication network in MPC model.
In order to tackle this shortcoming of the MPC model, a new topology aware massively parallel computation model was proposed and studied in [11, 32]. The computation machines are still identical in this model111There may be non-computational machines in this model, though., but the communication bandwidth between different pair of machines are different. This model was first proposed in recent works [11], where the underlying communication network is represented as a graph, and the edges are assigned with a weight which represents the communication bandwidth. However, the paper [11] only declared the new model but did not give any theoretical results. The other work [32] considered three data processing tasks on this model, which are set intersection, Cartesian product and sorting. Algorithms and lower bounds about the communication cost optimization problems for the three tasks were proposed [32]. However, the authors of [32] restricted the underlying communication network to trees, and the algorithm and lower bounds given in that paper can not be generalized to graphs other than trees.
In this paper, we follow the line of research started by [11, 32], and consider the topology aware massively parallel computation model in a more general case, where the underlying communication network is a complete weighted graph. In this sense, our work is a complement to the work in [32]. The goal of this paper is also to minimize the communication cost. However, unlike the work in [32] which considers specific computation tasks, in this paper we define general communication cost minimization problems that capture the characteristics of a variety of computation tasks.
1.1 Description of the research problems in this paper
1.1.1 The WMPC Model
We first give a more detailed description of the computational model considered in this paper, which is called Weighted Massively Parallel Computation (WMPC) model.
In WMPC model, there are computation machines with identical computational power. The communication network is modeled as a weighted complete graph, which is represented by a matrix . is called the communication cost matrix from now on, and it is considered as a known parameter of the WMPC model. is the communication cost between computation machine and for , where larger value implies larger communication cost or communication latency. is set to 0 for . It is assumed that all pairs of machines can communicate in a point-to-point way which is in accordance with the original MPC model, and thus holds for . The matrix is not necessary to be symmetric, i.e., may not be equal to .
The computation on WMPC proceeds in synchronous rounds which behaves the same with the original MPC model. In each round, the computation machines first communicate with each other, then conduct local computation.
The initial data distribution plays an important part in the problems studied in this paper. A lot of former research works on MPC model assume that the data are uniformly split across the machines [25, 9]. In this paper, it is assumed that the data can be arbitrarily distributed, and the amount of data placed at each machine is known in advance. This is also the same assumption adopted in [10, 32].
1.1.2 Objective functions
The goal of this paper is to minimize the communication cost under WMPC model, which is divided into send cost and receive cost. If amount of data is transferred from machine to machine , it incurs send cost to machine , and receive cost to machine . Denote and to be the send and receive cost of machine for , then we define the following four objective functions.
Total cost (TOTAL): .
Bottleneck cost (BTNK): .
Maximum of send and receive cost (MSR): .
Sum of send and receive cost (SSR): .
For the TOTAL cost function, it holds that . For the BTNK cost function, we choose to use the bottleneck of the receive cost rather than send cost, since the receive cost also reflects the workload of local computation. The MSR cost function is used in [57] on the classical MPC model, and in this paper we investigate its properties on WMPC model. The SSR cost function is closely related to MSR cost function, and is basically a 2-approximation of MSR cost.
Note that the send and receive cost is defined based on the amount of data transferred between two machines. For different commutation task and different communication pattern, the way of calculating the amount of transferred data will be different. Next we will use concrete computation tasks such as parallel sorting and join as the introducing example, analyze their communication patterns, and define the problems to be studied in this paper. We will introduce two problems, named Data Redistribution Problem and Data Allocation Problem.
1.1.3 The Data Redistribution Problem
Consider the following parallel sorting algorithm on classical MPC model, which is often referred as TeraSort [51]. The algorithm first selects splitters and broadcast the splitters to all machines. The splitters form intervals where and . After obtaining the splitters, each machine sends the local data falling in the -th interval to the -th machine. In such way the data is ordered across the machines. Note that the label of the machines are fixed before the algorithm starts. Then the machines conduct local sorting, and the sorting task can be finished.
Now consider running the parallel sorting algorithm on the WMPC model, and assume that the splitters have been determined. The algorithm described above asks the data in the -th interval to be sent to the -th machine. However, this operation may lead to non-optimal communication cost. Consider the following extreme case. The data are initially inversely sorted across the machines, i.e., for machine , the data in machine are always no less than the data in machine . In such a case, if the -th interval is assigned to the -th machine, there would be no need to conduct communication. However, if the algorithm asks to send the data in the -th interval to the -th machine, all the data will be totally redistributed, incurring large amount of communication.
Actually, there exist two shortcomings for the above TeraSort algorithm on classical MPC model. First, it neglects the initial data distribution, and neglects the importance of the way to assign the intervals to the machines to minimize the communication cost. Second, it does not consider the difference of communication costs between different pair of machines. By tackling these two points together, the first research problem to be studied in this paper is formed, which is called the Data Redistribution Problem (DRP).
The input of DRP is two matrices and . represents the amount of data in the -th machine that fall in the -th interval. The matrix is the communication cost matrix of the WMPC model. The output is to assign the intervals to the machines, such that the communication cost is minimized. By applying the four communication cost functions introduced in Section 1.1.2, we get four problems denoted as DRP-TOTAL, DRP-BTNK, DRP-MSR and DRP-SSR, respectively. The four problems are studied in Section 2.
1.1.4 The Data Allocation Problem
In the above case of parallel sorting, it is assumed that the splitters are known in advance. However, how to select the splitters to minimize the communication cost is also an important research problem [59], and even a new problem under the WMPC model. For a formal description, let be the total number of data records to be sorted, and be the number of machines. Under the assumption that the initial data distribution is known in advance, let , , which is the data initially residing in machine . is the number of data records in machine , and . If the splitters are chosen as , they will form intervals , where and . Let , which is the number of data records in machine that falls into the -th interval . To minimize the communication cost, the problem is to select splitters which split the data into intervals, then find an assignment from the intervals to the machines, such that the communication cost is minimized. This problem is called Data Allocation Problem (DAP).
Remark. Although DRP and DAP are introduced based on sorting, they can be defined using the idea of virtual machines and physical machines. For DRP, the input can be considered as the amount of data initially residing in physical machine to be processed by virtual machine , and the output is a permutation which assigns virtual machines to physical machines so that the communication cost is minimized. For DAP, choosing the splitters can be regarded as deciding the data distribution across the virtual machines, since each virtual machine is responsible to collect the data in one interval. In such a point of view, DRP and DAP can be applied to a wide range of concrete problems. Also, DRP and DAP reflect only the problems that can be solved using one synchronous round. It will the future work to study multi-round algorithms on WMPC model.
1.2 Summary of results and paper organization
Summarizing the above descriptions, we have two kinds of problems including DRP and DAP. We also have four kinds of communication cost functions including TOTAL, BTNK, MSR and SSR. 8 problems are obtained by combining two kinds of problems with four kinds of cost functions. The hardness for the 8 problems make up the content of this paper. Table 1 summarizes all the proposed results. Some less important results and some detailed proofs are moved to appendix due to space limitation.
In the rest of this paper, we first introduce some denotations that will be used throughout this paper in Section 1.3, then present the results in Section 2 and 3 following the order given in Table 1. The related work and future work are delayed to Section 4 and 5. Section 6 concludes this paper.
1.3 Denotations
A matrix is denoted as . The element in at row and column is denoted as . The set of consecutive integers is denoted as . The set of integers is denoted as .
A permutation on is a one-to-one mapping from to , and it is usually denoted as . The set of all permutations on is denoted as . Denote as the image of under . If , it is also said that is assigned to by permutation . We also use to denote the inverse permutation of , i.e., if then .
2 The Data Redistribution Problem Series
Definition 1 (DRP)
Input: A transmission matrix and a communication cost matrix , where for .
Output: find a permutation such that the communication cost function chosen from TOTAL, BTNK, MSR and SSR is minimized. Formally,
DRP-TOTAL:
[TABLE]
DRP-BTNK:
[TABLE]
DRP-MSR:
[TABLE]
DRP-SSR:
[TABLE]
2.1 DRP-TOTAL
Theorem 2.1
DRP-TOTAL is equivalent to the Linear Assignment Problem (LAP) [5].
Proof
The 0-1 integral linear programming (LP) formation of DRP-TOTAL is
[TABLE]
We have
[TABLE]
Now define another matrix as , , and the objective function in Equation 1 is transformed into , which is equivalent to the linear programming formation of Linear Assignment Problem. β
Corollary 1
DRP-TOTAL can be solved in time.
Proof
Transforming DRP-TOTAL to LAP needs time. The Hungarian algorithm [48] to solve LAP needs time. Then the corollary follows. β
2.2 DRP-BTNK
Theorem 2.2
DRP-BTNK is equivalent to the Linear Bottleneck Asssignment Problem (LBAP) [13].
Proof
The 0-1 integral LP formation of DRP-BTNK is
[TABLE]
Define another matrix as , and the objective function in Equation 2 becomes , which is equivalent to the LP formation of LBAP. β
Corollary 2
DRP-BTNK can be solved in time.
Proof
Transforming DRP-BTNK to LBAP needs time. The algorithm to solve LBAP needs time [54]. Then the corollary follows. β
2.3 DRP-MSR
Theorem 2.3
DRP-MSR is NP-complete.
Proof
We reduce the following PARTITION problem to DRP-MSR, which is well known to be NP-complete.
Input: a set of integers , where .
Output: decide whether there exists a partition of s.t. , and . Such partition is called perfect.
For any instance of PARTITION, construct an instance of DRP-MSR. The two matrices and as the input of DRP-MSR are of size . Let
[TABLE]
Let be a sufficiently large integer (it suffices to set ), and
[TABLE]
Since the communication cost between machine 1 and 2 are set to which is sufficiently large, and , for , then if some is assigned to machine 1 or 2 by permutation , e.g., , there will be , which incurs a sufficiently large communication cost between machine 1 and 2. Thus, the integers in must be assigned to some . Notice that machine 1 connects only to machines , and machine 2 connects only to machines , with communication cost set to 1. Then if some is assigned to some , it incurs send cost to machine 1, and 0 send cost to machine 2. On the other hand, if some is assigned to some , it incurs 0 send cost to machine 1, and send cost to machine 2. This indeed reflects the cost of the partition.
Actually, for any permutation where for , the send cost of machine 1 under is the sum of the elements in the set , and the send cost of machine 2 is the sum of the elements in the set . It can be easily verified that the receive cost of other machines are relatively small, since each of the machines in only receives one element. Furthermore, the receive cost of machine 1 and 2, and the send cost of the other machines are all 0. Denote , and forms a partition of . In such way, the permutation corresponds to the partition . Finally, the MSR cost of the permutation , denoted as , is obtained by taking maximum over the send cost of machine 1 and 2, i.e., .
By the above discussion, we can claim that above construction is indeed a reduction from PARTITION to DRP-MSR. If the PARTITION problem admits a perfect partition, then the optimum value of the constructed DRP-MSR instance is exactly . If there is no perfect partition for the PARTITION problem, then the optimal value of the constructed DRP-MSR instance must be greater than . Then the NP-completeness is proved by observing DRP-MSR is truly in NP. β
2.4 DRP-SSR
Theorem 2.4
DRP-SSR is NP-complete.
Proof
The NP-completeness proof for DRP-SSR can be essentially the same with DRP-MSR. In the DRP-MSR instance constructed in the proof of Theorem 2.3, for any machine it holds that if then , and if then . Thus, the MSR and SSR cost value is the same for that instance. An alternate proof is to reduce 3-PARTITION to DRP-SSR. The 3-PARTITION problem is defined as follows.
Input: positive integer numbers such that , where is a positive integer.
Output: determine whether there exists a partition of the numbers, such that and for all .
3-PARTITION is known to be strongly NP-complete [24], i.e., there is no algorithm whose complexity is bounded by a polynomial of and that can solve it, unless P=NP.
For any instance of 3-PARTITION, construct an instance of DRP-SSR with as follows.
Let for . Let for . By doing so we have machines that all have all the data, and machines that have no data.
Let be as follows.
[TABLE]
By doing so the machines are grouped into groups each with 3 machines. The 3 machines in each group is connected into a triangle.
For any partition of the numbers into subsets , , assign the three numbers to the three machines which are in the -th group. Assuming , the assignment must assure that is assigned to machine which initially have no data. and can be assigned to machine arbitrarily. Assume w.l.o.g. that is assigned to and is assigned to . It can be verified that this assignment can achieve the smallest possible value of the SSR objective function.
Under this assignment, denote and to be the send and receive cost of machine . For , we get
[TABLE]
Since , we have . Taking maximum over , we get the maximum value of cost in the -th group as . The final cost of DRP-SSR under this assignment is obtained by taking maximum over .
If there exists a solution of 3-PARTITION, then the above constructed instance of DRP-SSR has an optimum cost of , since for each subset we have .
If there is no solution of 3-PARTITION, then for the above constructed instance of DRP-SSR, the objective value of SSR cost of any assignment must be greater than . The reason is that there must exist some such that .
The description of the reduction from 3-PARTITION to DRP-SSR is completed, and the NP-completeness of DRP-SSR is proved. β
3 The Problem Series of Data Allocation Problem
In this section we study the parameterized hardness and algorithms for the DAP-Cont problem series, parameterized by the number of machines. We will use to denote the size of the input, and to denote the number of machines.
Definition 2 (DAP-Cont)
Input: a set of integers divided into subsets , where is the number of machines, and is the size of satisfying .
Output: find integers and a permutation , such that the communication cost function chosen from TOTAL, BTNK, MSR and SSR is minimized. Formally,
DAP-TOTAL:
[TABLE]
DAP-BTNK:
[TABLE]
DAP-MSR:
[TABLE]
DAP-SSR:
[TABLE]
where and .
3.1 The splitter-graph
We introduce the splitter-graph, which transforms the problem of choosing splitters to choosing a path in a special graph. Given a set of integers, assuming , and a parameter , construct a graph as follows. For , construct a vertex . Let be the starting vertex , and be the end vertex . Let iff and . In such way, a vertex represents a splitter placed in the -th position, and a path represents selecting as splitters. From now on, a splitter-graph based on set with parameter will be denoted as . The discussions in the rest of this section will depend on the splitter-graph with different definition of edge weights.
3.2 FPT algorithm of DAP-TOTAL and DAP-BTNK
The FPT algorithm of DAP-TOTAL and DAP-BTNK is based on the following transformation. Note that the transformation can be done in polynomial time. Given an instance of DAP-TOTAL, denote and assume . Let and . Let . Slightly abusing denotation, let be a matrix defined based on the permutation , such that if , and otherwise, .
Under the above denotations, DAP-TOTAL can be transformed into the following form.
[TABLE]
where and . Let , , then the above equation is transformed into
[TABLE]
Let , , then Equation 4 is transformed into
[TABLE]
If is represented by a permutation, we get
[TABLE]
Now we can associate the above function to the spiltter-graph. For each edge in the splitter-graph and each , let . The function can be regarded as assigning weights to each edge, and each weight is associated with a label . Now, we have the following splitter-graph formation of DAP-TOTAL.
Definition 3
Input: a splitter-graph , the weight function of DAP-TOTAL.
Output: a path , and a permutation , such that the following cost is minimized
[TABLE]
According to the above transformation, Definition 3 is equivalent to the original definition of DAP-TOTAL.
3.2.1 FPT Algorithm for Decision-DAP-TOTAL
We prove the following decision version of DAP-TOTAL is FPT.
Definition 4 (Decision-DAP-TOTAL)
Input: a splitter-graph , the weight function of DAP-TOTAL, a threshold value , and parameter .
Output: Is the optimum value of DAP-TOTAL less than ?
We need the following definition of partial permutations. A partial permutation is a function defined on where , such that and for . Here is the image of under . Given a partial permutation whose definition domain is , and an integer , let denote that there exists some such that . Given an integer , let be a new partial permutation defined on such that and for . Given an integer , let be a partial permutation defined on , such that for all . Denote as the empty partial permutation.
Algorithm 1 is the FPT algorithm for Decision-DAP-TOTAL. The algorithm maintains two arrays of length for each vertex , namely Perm() and Cost(). Perm () stores all the feasible partial permutations for the path from to , and Cost() stores the partial accumulated cost value corresponding to the partial permutation .
Theorem 3.1
*At the end of the -th iteration of the outer-most -loop in Algorithm 1, , it holds that
(1) each is a partial permutation, and
(2) iff there exists a path such that .*
Proof
(1) Straightforward. The condition in Line 1 of Algorithm 1 ensures that if then will not be added into , which ensures that each is a partial permutation.
(2) The proof proceeds by induction on . As start point of induction where , the path is from to , i.e., a single vertex. Then the claim trivially holds.
Suppose the claim holds at the end of the -th iteration, and consider the -th iteration. According to Line 9 and 10 in Algorithm 1, if and only if there is an edge and an integer such that . By induction hypothesis, the partial permutation if and only if there exists a path from to , , such that
[TABLE]
Now adding the edge to the path, and adding to the partial assignment , we obtain a path from to and a partial assignment such that
[TABLE]
By induction, the claim is proved. β
According to Theorem 3.1, by applying the induction to the vertex , it holds that if and only if there exists a path , such that , where is a permutation in . By the splitter-graph formation of Decision-DAP-TOTAL (Definition 3), it is equivalent to that the optimum value of DAP-TOTAL is less than . This completes the correctness proof of Algorithm 1.
Theorem 3.2
Decision-DAP-TOTAL is FPT.
Proof
This theorem is true since Algorithm 1 solves Decision-DAP-TOTAL in time. Note that here the number of machines is regarded as the parameter, and this complexity is polynomial in the input size . β
3.2.2 FPT Algorithm for DAP-BTNK
Using a transformation similar with that for DAP-TOTAL, we have the following splitter-graph formation for DAP-BTNK.
Definition 5
Input: a splitter-graph , the weight function of DAP-BTNK.
Output: a path , and a permutation , such that the following cost is minimized
[TABLE]
The decision version of DAP-BTNK has an extra value as input, and asks whether the optimum value of DAP-BTNK is less than . We first propose the FPT algorithm for the decision version, which is given as Algorithm 2. It needs one array for each vertex which is Perm(). The algorithm is similar with that for Decision-DAP-TOTAL, only changing the sum-check (Line 1 in Algorithm 1) to maximum check (Line 2 in Algorithm 2). Thus the correctness proof of this algorithm is similar with Theorem 3.1, and it is omitted.
Theorem 3.3
DAP-BTNK is FPT.
Proof
We can use the algorithm for Decision-DAP-BTNK to solve DAP-BTNK. The idea is similar with the two-phase algorithm for Linear Bottleneck Assignment Problem given in Section 2.4. In the first phase the algorithm chooses some possible value from the input weight function . In the second phase Algorithm 2 is invoked by setting as the selected weight value. Since the number of possible values of the input weight function is at most , then Algorithm 2 is invoked for at most times. Thus, the two-phase algorithm for DAP-BTNK is still FPT. β
3.3 W[1]-completeness of DAP-MSR and DAP-SSR
We first prove that the two problems are in W[1].
Theorem 3.4
DAP-MSR and DAP-SSR are in W[1].
Proof
The proof is based on the machine characterization of the W[1] class proposed in [16]. The following definition and theorem must be cited from [16] to support this proof.
Definition 6 (W-program, [16])
A nondeterministic RAM program is a W-program, if there is a function and a polynomial such that for every input with , the program on every run
(1) performs at most steps;
(2) at most steps are nondeterministic;
(3) at most the first registers are used;
(4) at every point of the computation the registers contain numbers ;
Theorem 3.5 ([16])
Let be a parameterized problem. Then W[1] if and only if, there is a computable function and a W-program deciding such that for every run of all nondeterministic steps are among the last steps of the computation, where is the parameter.
First we should note that if term (4) in Definition 6 is to be satisfied, the elements in the communication cost matrix, and the edge weights in the splitter-graph, should be bounded by . Under this constraint, we describe the W-program for DAP-MSR which is quite simple.
(1) Transform the input to splitter-graph formation.
(2) Non-deterministically guess a path with length . Note that by the definition in [16], the nondeterministic machine can guess an integer in one nondeterministic step, rather than guess a single bit. In such way the nondeterministic machine only need to perform nondeterministic steps to guess the path.
(3) Enumerate all the permutations, and find the optimal permutation with the smallest MSR cost value.
It is obvious that the above program (1) performs at most steps, (2) perform only nondeterministic guess steps, (3) uses registers to record the input, registers to record the selected path, registers to record the enumerated permutation, and constant registers to conduct extra numeric operations. Furthermore, after the path is guessed, the subsequent computation needs time which satisfies the condition given in Theorem 3.5. β
We then prove the W[1]-hardness of the two problems. With an idea similar with that in Section 3.2, we first transform DAP-MSR and DAP-SSR into a splitter-graph formation. We only describe the transformation for DAP-MSR, and it is similar for the other. Using the same denotations used in Equation 3, we have the following equivalent form for DAP-MSR.
[TABLE]
Let be a matrix where each element is a vector of length , and let , , then Equation 7 is transformed into
[TABLE]
Next we give the following splitter-graph formation of DAP-MSR. For each edge in the splitter-graph, let , i.e., each edge is associated with a vector of length . In such way, each path from to corresponds to vectors of length , and can form a matrix . It remains to solve the DRP problem, taking this matrix and the communication cost matrix as input. The formal definition is given as follows.
Definition 7
Input: a splitter-graph , the edge weight function of DAP-MSR, the communication cost matrix .
Output: a path , which corresponds a matrix where , and a permutation , such that the following MSR cost function is minimized:
[TABLE]
To prove the W[1]-hardness of the problem, we introduce an intermediate problem called Selecting-PARTITION. The idea is to reduce -clique, which is W[1]-complete, to Selecting-PARTITION, and reduce Selecting-PARTITION to DAP-MSR (and similarly to DAP-SSR).
Definition 8 (Selecting-PARTITION)
Input: integers , target sum value , and parameter .
Output: decide whether there exists a set with , such that is a Yes-instance of PARTITION, i.e., there exists such that and .
Lemma 1
There is a parameterized reduction from -clique to Selecting-PARTITION.
Proof
Given an instance of -clique, construct an instance of Selecting-PARTITION with parameter as follows. Assume w.l.o.g. that , and each vertex in is labeled by an integer . Choose a prime number with . Initially let . For each vertex , add the integer into S. For each edge , add into S. The construction can be done in time.
For ease of reference let }, and let be the set of integers associated with the edges, where each integer is of the form . It can be verified that .
If there exists a -clique in where the vertexes are , then the vertexes and edges in the clique correspond with two sets of integers and . It is easy to see that . The set is of size .
If there exists a set where that is a Yes-instance of PARTITION, there are two cases. The first case is that there is at least one integer in that is from , say, for some . Then there must be integers from which correspond to edges incident to the vertex . The edges then lead to different vertexes, and that is totally vertexes plus vertex . They must form a -clique, otherwise the corresponding integers will not be a Yes-instance of PARTITION. The second case is that all the integers are from . In this case the integers corresponds to edges which form a -clique, and the -clique must contain a -clique. β
Lemma 2
There is a parameterized reduction from Selecting-PARTITION to DAP-MSR and DAP-SSR.
Proof
We only describe the reduction to DAP-MSR, and the proof is similar for DAP-SSR. Given an instance of Selecting-PARTITION, construct a splitter-graph formation of DAP-MSR as follows. First construct a splitter-graph . For each edge in the splitter-graph, let
[TABLE]
where the length of the vectors are .
By such construction, each path corresponds to the following matrix
[TABLE]
Let the communication cost matrix be as follows.
[TABLE]
Thus, any selected integers from correspond to two matrices and . We can see that the structure of the two matrices and are the same with that used in the proof of NP-completeness of DRP-MSR. By the reduction from PARTITION to DRP-MSR, the integers forms a Yes-instance to PARTITION if and only if the corresponding matrices and form an instance of DRP-MSR whose optimum cost value is . We conclude the proof by observing the above reduction takes time, and the parameter of the constructed DAP-MSR instance is , which satisfies the definition of parameterized reduction. β
4 Related Works
4.1 Parallel computation models
This paper is based on the topology-aware MPC model [11, 32], which is almost new. The history of modeling parallel computation is long though, and there exist a lot of models such as PRAM, LogP, CONGEST and so on. Based on our observation, there are three important aspects for parallel computation, which are local computation, communication and data exchange between memory hierarchies. A good model for parallel computation should consider at least one aspect in detail, and neglect the other aspects if necessary. We categorize the existing models into the following three classes according to which aspect of parallel computation that the model emphasize. An important note is that there is no research work that consider all the three aspects in a single model by now.
4.1.1 Models that emphasize communication
Most models consider communication as the most important aspect of parallel computation. Some of these models even totally neglect the local computation, assuming any local computation can be done in a single unit of time. There indeed exist many important contents that is worth considering for communication, such as network topology, synchronization, latency, link capacity, routing, and so on.
A lot of early models are dedicated to model the network topology, such as hypercube network [47] and switch network [37]. See [22] for a survey. PRAM [35] uses a shared memory to exchange message between processors, and the strategy of exclusive/non-exclusive read/write leads to different variants of PRAM, such CREW (concurrent read, exclusive write), EREW, CRCW and so on. The processors are assumed to have arbitrarily strong computational power. PRAM is the most acknowledged model for theoretical researches in parallel computation. See [29] for a survey on PRAM simulation methods. BSP [60] model emphasizes synchronization, where the computation is divided into synchronized supersteps. LogP [19] emphasizes the cost of message exchanging, where stands for latency, stands for overhead, and stands for gap. CONGEST [52] emphasizes the network topology and link capacity, which is mostly used to study graph problems [4, 14, 28]. The communication network in CONGEST has the same topology with the graph problem which it considers, and the length of the messages between any two processors in any round is restricted to where is the number of processors. The Congested Clique model is an variant of CONGEST, where the communication network is a complete graph. Many graph problems were studied on Congested Clique model such as MST [27], maximum matching [23], shortest path [31], etc. There are many theoretical research works on CONGEST and Congested Clique model, and please refer to the references in the papers cited here for more related works. The MPC model [33] is abstracted from MapReduce [20], which can also be regarded as a simplified version of BSP. Finally, the topology aware MPC model [11, 32] is based on the MPC model and involves the consideration for network topology, which directly inspires this work.
4.1.2 Models that emphasize local computation
As we have mentioned, many models neglect the local computation and focus on communication, but it does not reduce the importance of local computation in parallel computation. If the communication cost is the only consideration, it may cause the problem of workload imbalance. Actually balancing the workload is an important aspect in the research of parallel computation which is considered in a lot of research papers from 1990s until today [18, 43, 58].
To the extent of our knowledge, the only model that emphasizes the local computation is the MapReduce model in the original form [20]. The MapReduce model defines the local computation as a Map function and a Reduce function. The Map function transfers the data into key-value pairs. The Reduce function conduct the pre-defined computation on the set of key-value pairs with the same key. The communication in MapReduce model is implicit and automatically excecuted by the underlying framework, which gathers all the key-value pairs with the same key to the same machine. Therefore, the main focus of MapReduce model is how to design the Map and Reduce functions, which are both local computation functions.
After the MapReduce model is abstracted into the MPC model [33, 42], the consideration of local load is still a main focus in many research works [59], especially in the study of the join operation in database theory [9, 10, 38, 39, 40].
4.1.3 Models that emphasize memory hierarchy
The memory hierarchy of modern processors includes CPU cache, main memory, external memory and remote storage center. While the idea of memory hierarchy is first established for single processor computation [3, 6], it is of equal importance in parallel computation [44, 55, 62]. The initial motivation of parallel computation is to deal with the problems with size exceeding the power of a single machine. As the amount of data to be processed grows larger and larger, there must be the case that the data size is still too large for each processor even if hundreds of processors are used. Thus, it is necessary to consider the external memory and input/output complexity in parallel computation. However, as far as we know, most recent researches on MPC and Congest model assume that the data can be split fine enough such that the data for each processor can fit in the local memory, which is not the case in realistic environments dealing with massive data. We think there is a lot of research opportunities in this direction.
4.2 Assignment problem
The DRP and DAP are closely related to the assignment problem. As we have shown, DRP-TOTAL is equivalent to Linear Assignment Problem (LAP), DRP-BTNK is equivalent to Linear Bottleneck Assignment Problem (LBAP), and the solution to DAP-BTNK highly relies on the solution to LBAP. We refer the readers for [13] to a good survery of linear assignment problem series.
The Quadratic Assignment Problem (QAP) [45] is harder than LAP which is NP-complete. The permutation formation of QAP is , where and are two input matrices. QAP can be regarded as assigning virtual machines to physical machines, where the transmission matrix is defined between virtual machines, and the communication cost matrix is defined between physical machines. Comparing with the definition of DRP and DAP, where the transmission matrix is defined between physical machine and virtual machines, it can be noticed that DRP and DAP has a structure more complex than LAP but simpler than QAP. Besides the equivalence of DRP-TOTAL and LAP as well as DRP-BTNK and LBAP, another interesting fact is that the 3-PARTITION problem is used in the reduction to prove the NP-completeness DRP-SSR, while in [56] the 3-PARTITION problem is used to prove that QAP is unapproximable, i.e., there is no polynomial time -approximate algorithm for QAP where , unless P=NP.
It is worth mentioning that the ROBOT problem defined in [8] has a similar structure with DRP. The input of ROBOT problem includes two functions and , where defines the relation between the physical locations and items, and defines the distance between physical locations. The and functions has a similar structure with the and matrix in DRP problem. The goal of ROBOT problem is to find a TSP with minimal length, where the TSP part makes it NP-complete.
4.3 Number partition problem
The NP-completeness proof for DRP-MSR is based on the PARTITION problem, which is often referred as the number partition problem (NPP) in existing literature. Recall that NPP is given a set of integers , and decide whether there exists a partition of s.t. and . NPP is one of Gary and Johnsonβs six basic NP-complete problems [30], and the hardness of this problem is well-known. The approximation for this problem often considers the discrepancy, which is . A famous polynomial time heuristic proposed by Karmarker and Karp [34] is called the differencing method, and can lead to discrepancy for some positive constant [61]. Another line of research consider to minimize the discrepancy on randomized data [12, 46, 49]. See [50] for a good survey on NPP. We note there are no existing work trying to design approximate algorithm for the value as far as we know.
4.4 Data redistribution
The DRP and DAP problem series emphasizes the importance of data redistribution which have been considered by a lot of former research works. The goal of data redistribution is to minimize the communication cost while satisfying the specific requirement on the data distribution. The works in [17, 53, 57] consider the data redistribution for join operation in database, in which the two papers [53, 57] partially inspire this work. Another work [41] considers the data redistribution in sensor networks. Knoop et al. [36] consider the Distribution Assignment Placement in a engineering point of view, where the abbreviation coincides with DAP in this paper.
5 Future works
Recall that the problem series of DRP and DAP are introduced using parallel sorting and join as the representing example. They reflect the communication pattern of problems that can be solved in one synchronous round, i.e., after one round of communication to redistribute the data, it suffices for all the machines to conduct local computation to finish the computation task. However, there are many problems that need multiple rounds to solve. For example, joining multiple relations can be solved using one round [10] or multiple rounds [1]. Computing the graph coloring [15], maximum matching [26], shortest path [21], etc., must use multiple rounds. The problem will be complicated to minimize the communication cost on WMPC model with multiple rounds. We have the two following observation for future works.
First, it may not be the optimal solution to solve DRP or DAP problem for each round, since the communication of each round is correlated. It also involves to decide a better initial distribution so that the communication cost of the subsequent parallel computation can be reduced. We call this problem the Data Pre-distribution Problem. Second, in this paper it is assumed that each pair of computation machine in WMPC model can communicate in a point-to-point manner, i.e., for all elements in the communication cost matrix . Actually this assumption is set to be compatible with the one round algorithm, i.e., the machines must be able to reach each other in one round. If multiple rounds are allowed, the assumption of point-to-point communication can be removed, i.e., there can be some element . There will be many interesting but complicated problems such as routing and congestion under the WMPC model, which are left as future work.
6 Conclusion
In this paper we proposed the WMPC (Weighted Massively Parallel Computation) model based on the existing works of topology-aware Massively Parallel Computation model [11, 32]. The WMPC model considers the underlying computation network as a complete weighted graph, which is a complement to the work in [32] where the network topology are restricted to trees. Based on the WMPC model the DRP and DAP problem series are defined, each representing a set of problems with the same pattern of communication. We also defined four kinds of objective functions for communication cost which are TOTAL, BTNK, MSR and SSR, and obtained 8 problems combining the four objective functions with two communication pattern problems. We studied the hardness for the 8 problems, and provided substantial theoretical results. In conclusion, this paper studied the communication minimization problem on WMPC model with a scope both deep and wide, but we must point out that the proposed results only investigated a small portion of the research area on the WMPC or topology-aware MPC model. There are a lot of meaningful problems to be studied following what was studied in this paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Foto N. Afrati, Manas R. Joglekar, Christopher M. Re, Semih Salihoglu, and Jeffrey D. Ullman. GYM: A multiround distributed join algorithm. Leibniz International Proceedings in Informatics, LIP Ics , 68:4:1β4:18, oct 2017.
- 2[2] Foto N. Afrati and Jeffrey D. Ullman. Optimizing joins in a map-reduce environment. In Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings , pages 99β110, New York, New York, USA, 2010. ACM Press.
- 3[3] Alok Aggarwal, Bowen Alpern, Ashok Chandra, and Marc Snir. A model for hierarchical memory. In Proceedings of the nineteenth annual ACM symposium on Theory of computing , pages 305β314, 1987.
- 4[4] Mohamad Ahmadi, Fabian Kuhn, and Rotem Oshman. Distributed approximate maximum matching in the CONGEST model. In Ulrich Schmid and Josef Widder, editors, 32nd International Symposium on Distributed Computing, DISC 2018, New Orleans, LA, USA, October 15-19, 2018 , volume 121 of LIP Ics , pages 6:1β6:17. Schloss Dagstuhl - Leibniz-Zentrum fΓΌr Informatik, 2018.
- 5[5] Mustafa AkgΓΌl. The linear assignment problem. In Combinatorial optimization , pages 85β122. Springer, 1992.
- 6[6] Bowen Alpern, Larry Carter, Ephraim Feig, and Ted Selker. The uniform memory hierarchy model of computation. Algorithmica , 12(2):72β109, 1994.
- 7[7] Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, and Grigory Yaroslavtsev. Parallel algorithms for geometric graph problems. In David B. Shmoys, editor, Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014 , pages 574β583. ACM, 2014.
- 8[8] Andr as Frank, Eberhard Triesch, Bernhard Korte, and Jens Vygen. On the bipartite travelling salesman problem. Technical report, Citeseer.
