Instance and Output Optimal Parallel Algorithms for Acyclic Joins
Xiao Hu, Ke Yi

TL;DR
This paper develops new parallel algorithms for acyclic joins in the MPC model, achieving instance and output optimality, and establishes lower bounds demonstrating the complexity of triangle joins.
Contribution
It introduces a novel MPC algorithm for acyclic joins with improved load bounds and proves output-optimality for certain classes, extending the classical Yannakakis algorithm to a parallel setting.
Findings
New MPC algorithm with load O(IN/p + sqrt(IN*OUT)/p)
Achieves instance-optimality for r-hierarchical joins in MPC
Provides lower bounds for triangle join complexity in MPC
Abstract
Massively parallel join algorithms have received much attention in recent years, while most prior work has focused on worst-optimal algorithms. However, the worst-case optimality of these join algorithms relies on hard instances having very large output sizes, which rarely appear in practice. A stronger notion of optimality is {\em output-optimal}, which requires an algorithm to be optimal within the class of all instances sharing the same input and output size. An even stronger optimality is {\em instance-optimal}, i.e., the algorithm is optimal on every single instance, but this may not always be achievable. In the traditional RAM model of computation, the classical Yannakakis algorithm is instance-optimal on any acyclic join. But in the massively parallel computation (MPC) model, the situation becomes much more complicated. We first show that for the class of r-hierarchical joins,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Instance and Output Optimal Parallel Algorithms for Acyclic Joins
Xiao Hu Ke Yi
Hong Kong University of Science and Technology
{xhuam
yike}@cse.ust.hk
Abstract
Massively parallel join algorithms have received much attention in recent years, while most prior work has focused on worst-optimal algorithms. However, the worst-case optimality of these join algorithms relies on hard instances having very large output sizes, which rarely appear in practice. A stronger notion of optimality is output-optimal, which requires an algorithm to be optimal within the class of all instances sharing the same input and output size. An even stronger optimality is instance-optimal, i.e., the algorithm is optimal on every single instance, but this may not always be achievable.
In the traditional RAM model of computation, the classical Yannakakis algorithm is instance-optimal on any acyclic join. But in the massively parallel computation (MPC) model, the situation becomes much more complicated. We first show that for the class of r-hierarchical joins, instance-optimality can still be achieved in the MPC model. Then, we give a new MPC algorithm for an arbitrary acyclic join with load , where are the input and output sizes of the join, and is the number of servers in the MPC model. This improves the MPC version of the Yannakakis algorithm by an factor. Furthermore, we show that this is output-optimal when , for every acyclic but non-r-hierarchical join. Finally, we give the first output-sensitive lower bound for the triangle join in the MPC model, showing that it is inherently more difficult than acyclic joins.
1 Introduction
A (natural) join is defined as a hypergraph , where the vertices model the attributes and the hyperedges model the relations. Let be the domain of attribute . An instance of is a set of relations , where is a set of tuples, where each tuple is an assignment that assigns a value from to for every . We use to denote the size of . The join results of on , denoted as , consist of all combinations of tuples, one from each , such that they share common values on their common attributes. Let be the output size. We study the data complexity of join algorithms, i.e., we assume that the query size, namely and , are constants. In this paper, we focus on acyclic joins, i.e., when the hypergraph is acyclic (formal definition given later).
1.1 The model of computation
The problem gets much more interesting in the parallel setting. In this paper, we consider the massively parallel computation (MPC) model [2, 3, 7, 8, 22, 24, 26], which has become the standard model of computation for studying massively parallel algorithms, especially for join algorithms.
In the MPC model, data is initially distributed evenly over servers with each server holding tuples. Computation proceeds in rounds. In each round, each server first sends messages to other servers, receives messages from other servers, and then does some local computation. The complexity of the algorithm is measured by the number of rounds and the load, denoted as , which is the maximum message size received by any server in any round. A linear load is the ideal case (since the initial load is already ), while if , all problems can be solved trivially in one round by simply sending all data to one server. Initial efforts were mostly spent on what can be done in a single round of computation [3, 26, 7, 8, 24, 26], but recently, more interest has been given to multi-round (but still a constant) algorithms [2, 22, 24], since new main memory based systems, such as Spark and Flink, have much lower overhead per round than previous generations like Hadoop.
The MPC model can be considered as a simplified version of the BSP model [32], but it has enjoyed more popularity in recent years. This is mostly because the BSP model takes too many measures into consideration, such as communication costs, local computation time, memory consumption, etc. The MPC model unifies all these costs with one parameter , which makes the model much simpler. Meanwhile, although is defined as the maximum incoming message size of a server, it is also closely related with the local computation time and memory consumption, which are both increasing functions of . Thus, serves as a good surrogate of these other cost measures. This is also why the MPC model does not limit the outgoing message size of a server, which is less relevant to other costs.
All our algorithms work under the mild assumption where is any small constant. This assumption clearly holds on any reasonable values of and in practice; theoretically, this is the minimum requirement for the model to be able to compute some almost trivial functions, like the “or” of bits, in rounds. Our lower bounds hold under for some constant , which may depend on the particular lower bound construction.
We confine ourselves to tuple-based join algorithms, i.e., the tuples are atomic elements that must be processed and communicated in their entirety. The only way to create a tuple is by making a copy, from either the original tuple or one of its copies. We say that an MPC algorithm computes the join on instance if the following is achieved: For any join result where , , these tuples (or their copies) must all be present on the same server at some point. Then the server will call a zero-cost function to report the join result. Note that since we only consider constant-round algorithms, whether a server is allowed to keep the tuples it has received from previous rounds is irrelevant: if not, it can just keep sending all these tuples to itself over the rounds, increasing the load by a constant factor. All known join algorithms in the MPC model are tuple-based and obey these requirements. Our lower bounds are combinatorial in nature: we only count the number of tuples that must be communicated in order to emit all join results, while all other information can be communicated for free. The upper bounds include all messages, with a tuple and an integer of bits both counted as 1 unit of communication.
1.2 Instance and output optimality
In worst-case analysis, the entire space of instances is divided into classes by the input size , and the running time is measured on the worst instance in each class. For many important computational problems, this is too coarse-grained and cannot accurately characterize the performance of the algorithm. For the join problem, no algorithm can do better than time in the worst case, where is the fractional edge cover number of the hypergraph [33, 29]. This bound drastically overestimates the running time on most typical instances.
A more refined approach is parameterized analysis, which further subdivides the instance space into smaller classes by introducing more parameters that supposedly better characterize the difficulty of each class. For the join problem, the output size is a commonly used parameter, and each class of instances share the same input and output size. Let be the class of instances with input size and output size . Then the load of an MPC algorithm is thus a function of both and , defined as
[TABLE]
where denotes the load of on . Algorithm is output-optimal if
[TABLE]
for every algorithm .
Further subdividing the instance space leads to more refined analyses. In extreme case when each class contains just one instance, we obtain instance-optimal algorithms. More precisely, an algorithm is instance-optimal if
[TABLE]
for every instance and every algorithm . Note that by definition, an instance-optimal algorithm must be output-optimal, and an output-optimal algorithm must be worst-case optimal, but the reserve direction may not be true.
In the traditional RAM model of computation, the classical Yannakakis algorithm [34] can compute any acyclic join in time , which is both output-optimal and instance-optimal, because on any instance , any algorithm has to at least spend time to read all the inputs111To formally prove this claim, one will have to be more careful with the family of algorithms under consideration. In particular, if , then the algorithm may not have to do anything. One possible approach is to ask the algorithm to produce a certificate in addition to the join results [28]. We will not digress to this direction since this paper is only concerned about MPC algorithms. and time to enumerate the outputs. Thus, the two notions of optimality coincide (but both are stronger than worst-case optimality). Fundamentally, this is because the difficulty of any instance is precisely characterized by its input size and output size, and all instances in have exactly the same complexity .
1.3 Join algorithms in the MPC model
The situation becomes much more interesting in the MPC model. First, it has been observed that the Yannakakis algorithm can be easily implemented in the MPC model with a load of [2]222The bound stated in [2] is actually , because they used a sub-optimal binary join algorithm as the subroutine. Replacing it with the optimal binary join algorithm in [8, 18] yields the claimed bound, as observed in [25]., but this is not optimal. In particular, it is known that the binary join can be computed with load [8, 18]. This is optimal by the following simple lower bound argument: Each server can only produce join results in a constant number of rounds with the load limited to , so all the servers can produce at most join results. Thus, producing join results needs at least a load of . Meanwhile, since by definition, the bound is optimal. Note that this argument can be applied on a per-instance basis, which means that the load complexity of any instance is still precisely captured by and , and is both an instance-optimal and output-optimal bound.
However, when the join involves three relations, the situation becomes subtler, and we start to see a separation between the two notions of optimality, meaning that the load complexity of an instance may not depend only on and . Let us start with the simplest 3-relation join , i.e., computing the Cartesian product of 3 sets of tuples. Consider a particular class when . Suppose the 3 relations have sizes , respectively. Then consists of all instances with and . Consider the following two instances: (1) , applying the same argument above except that each server now can produce join results, i.e., , we have ; (2) if , then the problem boils down to computing the Cartesian product of two sets, which has a lower bound of . The reason why instance (2) has a higher lower bound than instance (1) is that it has a higher skew, which causes more difficulty for the MPC model. Note that this phenomenon does not exist in the RAM model, in which both instances (in fact all instances in ) have the same complexity of . Fundamentally, this is because the MPC model is all about locality: An MPC algorithm should strive to bring all related tuples to one server so as to produce as many join results as possible, while a higher skew reduces locality.
We can extend this argument to computing the Cartesian product of sets of sizes . Any algorithm computing the full Cartesian product obviously must also compute the Cartesian product of any subset of the sets, thus the load must be at least
[TABLE]
It has been shown that the HyperCube algorithm [3] incurs a load of on any instance [8]. Thus, it can be considered as an instance-optimal algorithm for computing Cartesian products, with an optimality ratio of .
The binary join and Cartesian products are the simplest joins. Then the obvious question is, do instance-optimal algorithms exist for larger classes of joins? If not, how about output-optimal algorithms? These are the main questions we wish to address in this paper.
1.4 Classification of acyclic joins
Before describing our results, we first define some sub-classes of acyclic joins.
Acyclic joins [9]. We use the common notion of acyclicity, which is also known as -acyclicity. A join is acyclic if there exists an undirected tree whose nodes are in one-to-one correspondence with the edges in such that for any vertex , all nodes containing form a connected subtree. Such a tree is called the join tree of . Note that the join tree may not be unique for a given .
Hierarchical joins [12]. A join is hierarchical if for every pair of vertices , there is , or , or , where is the set of hyperedges containing attribute . Thus, all attributes can be organized into a forest, such that is a descendant of iff . Hierarchical joins have been enjoyed nice properties in probabilistic databases [12, 13] and query answering under updates [10], but their role in the MPC model has not been studied so far.
r-hierarchical joins. We consider a slightly larger class of hierarchical joins. A reduce procedure on a hypergraph is to remove an edge if there exists another edge such that . We can repeatedly apply the reduce procedure until no more edge can be reduced, and the resulting hypergraph is said to be reduced. A join is r-hierarchical if its reduced join hypergraph is hierarchical. A hierarchical join must be r-hierarchical, but not vice versa. For example, the join is r-hierarchical but not hierarchical. On the other hand, an r-hierarchical join must be acyclic.
Tall-flat joins [26]. A join is tall-flat if one can order the attributes as such that (1) ; (2) for ; and (3) for . Obviously, a tall-flat join must be hierarchical.
The relationships of these joins are illustrated in Figure 1.
1.5 Our results
This paper gives an almost complete characterization of acyclic joins with respect to instance-optimality and output-optimality in the MPC model. Our results are summarized in Table 1, and we explain them below in more detail.
Instance-optimality
First, we extend the Cartesian product lower bound (1) to a general join . For any subset of relations , define
[TABLE]
i.e., the join results of relations in that are part of a full join result. Clearly, any algorithm computing must implicitly also compute for every . Because each join result in consists of tuples, one from each relation in , a server can emit at most join results of , so we must have . Thus, we obtain the following per-instance lower bound on the load:
[TABLE]
The BinHC algorithm [8] is a generalization of the HyperCube algorithm to general joins. The load of the BinHC algorithm is parameterized by the degrees of all subsets of attribute values (more detail given in Section 3). Beame et al. [8] show that BinHC is optimal (up to polylog factors) within the class of instances sharing the same degrees, among all one-round MPC algorithms. In this paper, we strengthen this result by giving a new analysis of the BinHC algorithm, showing that it is actually instance-optimal (up to polylog factors) for (1) all tall-flat joins, and (2) all r-hierarchical joins provided that the instance does not contain dangling tuples (a dangling tuple is one that does not appear in the join results). Furthermore, because the per-instance lower bound (2) also holds for multi-round algorithms, these instance-optimality results extend to multi-round algorithms as well. For r-hierarchical joins with dangling tuples, one-round algorithms cannot achieve load, but we can remove the dangling tuples in rounds with load [34], and then run then BinHC algorithm. This gives a multi-round, -load algorithm, where the exponent depends on the query size, and is at least , the number of relations. Then we give a new multi-round algorithm for r-hierarchical joins with load , i.e., improving the instance-optimality ratio from to .
The instance-optimal load is not achievable beyond r-hierarchical joins333But instance-optimal algorithms are still possible, if some higher per-instance lower bound can be derived.. More precisely, we show that for every acyclic join that is not r-hierarchical, there is an instance with but any multi-round algorithm must incur a load of444The and notation suppresses polylog factors. on . This is actually a corollary following our output-sensitive lower bound, which is described next.
Output-optimality
One-round algorithms have severe limitations with respect to : As shown in [26], any non-tall-flat joins must incur load if only one round is allowed. On the other hand, as mentioned, the classical Yannakakis algorithm is a multi-round MPC algorithm that works for all acyclic joins and has a load of [2, 25]. Thus, our focus will be on multi-round algorithms and see if this result can be improved. An instance-optimal algorithm must also be output-optimal, so we have automatically obtained output-optimal algorithms for r-hierarchical joins. In fact, we show that for all r-hierarchical joins, so this is already an asymptotic improvement over the Yannakakis algorithm. But the more important question is, how about acyclic joins that are not r-hierarchical?
Our main output-optimal result is a new MPC algorithm for acyclic joins achieving a load of , which is an -factor improvement from the Yannakakis algorithm. Interestingly enough, we observe that while the join order does not change the running time of the Yannakakis algorithm by more than a constant factor in the RAM model, it does have asymptotic consequences in the MPC model. However, there are instances on which no join order is good, in which case we recursively decompose the join into multiple parts, and choose a good join order for each part. The number of parts is exponential in the query size but constant in terms of data size. To achieve this result, we first give a simple algorithm on the line-3 join (Section 4), and then extend it to arbitrary acyclic joins (Section 5).
We also give a matching lower bound (up to a log factor), thereby establishing the output-optimality of the algorithm. However, the lower bound only holds when . This restriction on is actually inherent, because the bound cannot be optimal for all values of . When is large enough, a worst-case optimal algorithm will take over. For example, on the line-3 join, the worst-case optimal algorithm, which has load [24, 19], becomes better when . Our lower bound actually indicates that the bound is output-optimal (though it does not depend on ) for all . Thus, we now have a complete understanding of the line-3 join with respect to output-optimality. For more complicated joins, their worst-case optimal algorithms have a higher load, and the output-optimality for values in the middle is still unclear.
Next, we extend these results to join-aggregate (including join-project) queries that are free-connex (formal definition given in Section 6). More precisely, we give an MPC algorithm with linear load that removes all the non-output attributes of the query, converting it into an acyclic join. Then we apply our instance-optimal or output-optimal algorithm on the resulting acyclic join.
Finally in Section 7, we turn to the triangle join , which is the simplest cyclic join, and give the first output-sensitive lower bound in the MPC model. Previously, only a worst-case bound of is known [24, 30] and that construction uses an instance with the maximum possible output size . Note that the second term in the lower bound is smaller as long as , which means that under this parameter range, the -load algorithm [24] is not only worst-case optimal but also output-optimal. For , the lower bound becomes while we do not have a matching upper bound yet (some explanation on why this is difficult is given below). But at least, this shows a separation from acyclic joins, i.e., cyclic joins are harder than acyclic ones by at least a factor of .
1.6 Other related results
Most existing work on join algorithms in the MPC model has focused on the worst case. Here, the goal is to achieve a load of , where is the fractional edge cover number of the hypergraph . So far, this bound has been achieved on Berge-acyclic joins555A sub-class of -acyclic joins. [19], joins where each relation has two attributes (i.e., is an ordinary graph) [22], and LW joins [24]666The LW join algorithm presented in [24] has a mistake, but it can be fixed, although non-trivially.. Whether this bound can be achieved for arbitrary joins, or even just -acyclic joins, is still open. Assuming this is achievable, our output-sensitive algorithm is still better when .
Joglekar et al. [20] described a multi-round MPC algorithm for arbitrary joins, whose load complexity depends on , as well as the degrees of the values. However, the load of their algorithm is at least , i.e., no better than the Yannakakis algorithm on acyclic joins.
In the RAM model, output-sensitive join algorithms have been extensively studied. The running time of most algorithms is in form of , where is certain notion of width of the hypergraph [15, 17, 27, 23]. However, it is not clear if this is optimal. Even for the triangle join, it is not known what the output-optimal bound is. For the triangle join, any notion of width has , thus these algorithms are no better than the worst-case optimal algorithm, which has running time . Recently, an improved triangle algorithm has been developed with a running time of [11], which is better than the worst-case optimal algorithm when . On the lower bound side, it is known that when , at least time is needed, assuming the 3SUM conjecture [31]. Thus, output-optimal algorithms for cyclic joins still remain a wide open problem.
2 MPC Primitives
Assume where is any small constant. We first introduce the following primitives in the MPC model, all of which can be computed with linear load in rounds.
Multi-numbering [18]: Given (key, value) pairs, for each key, assigns consecutive numbers to all the pairs with the same key.
Sum-by-key [18]: Given (key, value) pairs, compute the sum of values for each key, where the sum is defined by any associative operator.
Multi-search [18]: Given elements as set and elements as set , where all elements are drawn from an ordered domain. Set . For each , find its predecessor in , i.e., the largest element in but smaller than .
Semi-Join: Given two relations and with a common attribute , the semijoin returns all the tuples in whose value on matches that of at least one tuple in . This can be reduced to a multi-search problem: For each , if its predecessor on the attribute in is the same as that of , then it is in the semijoin.
Note that we can remove all dangling tuples in an acyclic-join [34] by a constant number of semi-joins, so it can be done in rounds with linear load.
Parallel-packing: Given numbers where for , group them into sets such that for all , and for all but one . Initially, the numbers are distributed arbitrarily across all servers, and the algorithm should produce all pairs if when done. Note that .
We are not aware of an explicit reference on this primitive, but it can be solved quite easily. Assume the input data is distributed across servers. We ask each server to first perform grouping on its local data. It is obvious that the condition above can be satisfied. The server then reports two numbers: , the number of groups with sum between and , and , the sum of remaining group with sum smaller than . Note that and can be [math]. Next, we run the BSP algorithm for prefix-sums [14] on the ’s. After that, we can assign consecutive group id’s to each of the groups on each server . For the remaining partial groups whose sums are with , we recursively run the algorithm, using group id’s starting from . After the recursion returns, for each partial group that has been assigned to group , we assign every element in to group . The problem size reduces by a factor of after each round, so the number of rounds is .
Server allocation [18]: Assume each tuple has a subproblem id , which identifies the subproblem it belongs to (the ’s do not have to be consecutive), and , which is the number of servers allocated to subproblem . The goal is to attach to each tuple a range , such that the ranges of different subproblems are disjoint and . Thus, each tuple knows which servers have been allocated to the subproblem to which belongs.
Computing the output size of an acyclic join: This primitive is a special case of our join-aggregate algorithm, which will be described in Section 6.
3 r-Hierarchical Joins
Recall that in a hierarchical join, all attributes can be organized into a forest, such that is a descendant of if and only if . Each corresponds to a node in the forest, such that contains precisely and all its ancestors. A subclass of hierarchical joins are tall-flat joins. For a tall-flat join, this attribute forest takes the form of a special tree, which consists of a single “stem” plus a number of leaves at the bottom. For example, {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathcal{Q}_{1}}}=R_{1}(x_{1})\Join R_{2}(x_{1},x_{2})\Join R_{3}(x_{1},x_{2},x_{3})\Join R_{4}(x_{1},x_{2},x_{3},x_{4})\Join R_{5}(x_{1},x_{2},x_{3},x_{5})\Join R_{6}(x_{1},x_{2},x_{3},x_{6}) is a tall-flat join; {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathcal{Q}_{2}}}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{1},x_{3},x_{4})\Join R_{3}(x_{1},x_{3},x_{5}) is a hierarchical join (but not tall-flat). Their attribute forests (actually, trees for these two cases) are shown in Figure 2.
In this section, we study r-hierarchical joins. A join is r-hierarchical if its reduced join is hierarchical. For example, {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathcal{Q}_{2}}}\Join R_{4}(x_{3},x_{5})\Join R_{5}(x_{5}) is an r-hierarchical join (but not hierarchical). After an r-hierarchical join is reduced, its hyperedges must correspond to the leaves of the attribute forest.
3.1 BinHC algorithm revisited
We mentioned above that the HyperCube algorithm [3] is an instance-optimal algorithm for computing Cartesian products. The BinHC algorithm [8] is a generalization of the HyperCube algorithm to general joins. For a join , denote the residual query by removing attributes as . Let be any fractional edge packing of that saturates the attributes , i.e., for every , and for every . Assuming knowing all degree information in advance, this algorithm computes on instance in a single round with a load of , where
[TABLE]
Here we define . Note that for any , is either [math] or , so we can just set for each such in the definition above.
Theorem 1**.**
On any tall-flat join and any instance , .
Proof.
Below, we write to avoid notational clutter. For an attribute set and a fractional edge packing of , define
[TABLE]
To show , it suffices to show that for all and .
Recall that in a tall-flat join, all attributes can be ordered as such that (1) ; (2) for ; (3) for . Consider an attribute set under the following two cases.
Case (1): . Consider any edge packing of that saturates (in this case, we actually only need the fact for all ). As observed, we can eliminate any assignment if there exists an edge such that , so it suffices to consider the remaining assignments such that . Then, we can bound as
[TABLE]
Case (2): There exists an . Let be the smallest such . Let be any edge packing of . In particular, we have . As observed earlier, we can set for any , so it suffices to consider the remaining edges. Due to the tall-flat property, all these edges contain . Thus,
[TABLE]
Combining the two cases, the theorem is proved. ∎
Theorem 2**.**
On any r-hierarchical join and instance without dangling tuples, .
Proof.
Let be the forest of attributes corresponding to . Consider an arbitrary attribute set . We say that a root-to-leaf path in , which corresponds to some , is stuck at the highest attribute on the path that is not included in . In this way, all edges in can be divided into disjoint groups , such that edges in one group share the common stuck attribute. Consider any fractional edge packing , we must have for each due to the packing constraint at the common stuck attribute of . Then, we can bound as
[TABLE]
The last inequality needs some explanation: Any such includes at most one edge from each . Thus, if two edges in share any common attribute, that attribute must be in (otherwise they must belong to the same ). Thus, for any , all tuples in join with each other, so we have
[TABLE]
Furthermore, since there are no dangling tuples, every join result in must be part of a full join result, so . ∎
Note that since is a per-instance lower bound even for multi-round algorithms, this means that the BinHC algorithm is instance-optimal even among all multi-round algorithms, up to polylogarithmic factors. This result also incorporates the instance-optimality of the HyperCube algorithm on Cartesian products, which are special r-hierarchical joins without dangling tuples.
Remark
Koutris and Suciu [26] show that non-tall-flat joins cannot be done with load by one-round algorithms. This does not contradict Theorem 2 since their lower bound construction uses dangling tuples. Our result implies that the key barrier for one-round algorithms is actually the dangling tuples. If they do not exist, one-round algorithms can go beyond tall-flat joins and solve r-hierarchical joins instance-optimally, up to polylog factors. On the other hand, once rounds are allowed, dangling tuples become irrelevant, since they can be removed with linear load and rounds.
3.2 An instance-optimal algorithm
We have shown that the BinHC algorithm is an instance-optimal algorithm for r-hierarchical joins, but it has an instance-optimality ratio of , where the exponent depends on the query size, and is at least , the number of relations. In this section, we improve the optimality ratio to , i.e., achieving a load of . Our algorithm uses rounds, but note that BinHC also needs rounds to remove the dangling tuples if they exist. Furthermore, our algorithm is deterministic while BinHC is randomized.
As a preprocessing step, we remove all dangling tuples. Then we reduce the join hypergraph, since if , will not affect the final join results after dangling tuples are removed777Strictly speaking, this violates the tuple-based requirement that when emitting a join result, all the participating tuples must be present. This can be easily fixed. Before removing , we attach each tuple to all tuples in that join with . This can be done by the multi-search primitive with linear load.. Thus, we are left with a hierarchical join on an instance with no dangling tuples.
Let be the attribute forest of . Recall that after the join is reduced, each relation corresponds to a leaf of , whose attributes are precisely the leaf’s ancestors in . Our algorithm is recursive. We will show that the load of this algorithm is for any hierarchical join on any instance . To simplify notation, we will not derive the exact constant in the big-Oh, which depends (exponentially) on the recursion depth. Since the recursion depth is proportional to (actually, twice) the height of , which is a constant, this is not a concern. Similarly, the number of servers employed by the algorithm will be , where the hidden constant may also depend on the recursion depth.
The base case is when has just one relation. In this case the algorithm just emits all tuples in the relation, achieving the bound trivially.
For a general hierarchical join and an instance , we proceed as follows. We first compute : We use servers to compute for each (recall that computing the output size of an acyclic join is an MPC primitive). This requires servers with load . Note that when there is no dangling tuples in , so we can compute as defined in (2). Setting , we will show below how to compute the join with servers and load .
Let be the number of trees in . We handle the following two cases using different recursive strategies:
Case (1):
In this case, is a tree. Suppose the root attribute of is , which is included in all the relations. Consider every , and let . It suffices to compute the residual query on each , but all the ’s have to be computed in parallel, using servers in total. Thus, the key is to allocate servers to these residual queries appropriately so as to ensure a uniform load of . To do so, we first compute , the input size of , for all . Since , and each tuple belongs to exactly one , this is a sum-by-key problem, i.e., each tuple with has key and weight . Note that .
An instance is heavy if and light otherwise. We handle heavy and light instances in different ways.
Case (1.1): Light instances
We use the parallel-packing primitive to put the light instances into groups with each group having total input size . Then we simply use one server to solve the instances in each group. The load of each server is .
Case (1.2): Heavy instances
By definition, there are at most heavy instances. For each heavy instance , we allocate servers to compute in parallel the join size for all and all . This uses servers, and the load is . Next, for each heavy instance , we allocate
[TABLE]
servers and compute recursively in parallel. The number of servers used is
[TABLE]
By the induction hypothesis, computing with servers has a load of (the big-Oh of)
[TABLE]
We bound each term of (3): For a heavy instance , there must exist at least one such that . Furthermore, since there are no dangling tuples, every tuple in must be part of a join result of , so . Taking , we have
[TABLE]
so . The second term of (3) is also bounded by simply by the definition of .
Case (2):
In this case, the join becomes a Cartesian product , where each is a join under Case (1). One would attempt to first compute each recursively, and then compute the Cartesian product, but this would not yield instance-optimality. Just consider an instance with and , where with {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{|\mathrm{dom}(B)|=1}},|R_{1}|=\mathrm{IN},|R_{2}|=p. On this instance, we have , but if we took a two-step approach, merely storing the intermediate result would incur a load of . This means that we have to interleave the two steps so as to avoid storing the intermediate results explicitly.
We arrange servers into a hypercube, where the dimensions will be determined later. We identify each server with coordinates , where . For every combination , the servers with coordinates form a group to compute (using the algorithm under Case (1)). Yes, each is computed times, which seems to be a lot of redundancy. However, as we shall see, there will be no redundancy in terms of the final join results, and it is exactly due to this redundancy that we avoid the shuffling of the intermediate result and achieve an optimal load. Consider a particular server . It participates in groups, one for each . For each , it emits a subset of its join results, denoted . Then the server emits the Cartesian product . Note that for each group of servers computing , the servers in the group emit with no redundancy, so there is no redundancy in emitting the Cartesian product.
It remains to show how to set so that and each server has a load of . To do so, we first compute , the input size of , in the same way as in Case (1). An instance is heavy if and light otherwise. For each heavy instance , we use servers to compute for all , where is the set of edges in . This requires servers with load . Then if is light, we set ; otherwise set
[TABLE]
Let . The number of servers used is
[TABLE]
Finally, consider the load of each server, which serves to compute each with a group of servers. For a light , and it imposes a load of . For a heavy , by the induction hypothesis, the load is (the big-Oh of)
[TABLE]
This can be bounded by using the same argument as Case (1.2). Summing over all increases the load by just a factor.
The induction proof thus completes and we obtain the following result.
Theorem 3**.**
On any r-hierarchical join query and any instance , there is an algorithm computing in rounds with load .
Since an instance-optimal algorithm is also output-optimal, we also obtain an output-optimal algorithm for r-hierarchical joins. In fact, we can derive a closed-form formula of the output-optimal bound, i.e., we bound
[TABLE]
as a function of and . First, observe that only depends to the reduced instance of , so we can assume that contains no dangling tuples. Then, we can rewrite as
[TABLE]
Consider a specific subset and an arbitrary instance . One trivial upper bound for is . The other bound is when the join degenerates to a Cartesian product. With these observations, we can bound the quantity above as:
[TABLE]
where denotes the integer .
Next, we show that this is tight, i.e., there exists an instance such that for one subset involving relations, there is and for another subset involving relations, there is . Our hard instance construction is based on the following property of acyclic joins (this is probably known, but we cannot find an explicit reference):
Lemma 1**.**
An acyclic join has integral edge cover number.
Proof.
For an acyclic hypergraph , denote the optimal edge covering of as . If there exist such that , then ; otherwise we can just shift the weight from to and obtain a better (at least not worse) edge covering. So the optimal edge cover of is equivalent to that of the residual query by removing . If there exists an attribute that appears only in edge , then . So the optimal edge cover of is equivalent to the edge and the optimal edge cover of the residual query by removing all attributes in . After recursively apply these two procedures, the query will become an empty set implied by the GYO reduction [1]. In this process, every edge chosen by has weight . ∎
Let be the optimal edge covering of . We identify two subsets of with edges respectively, denoted as , such that . Such two subsets can always be found since by the AGM bound [4]. We consider a hard instance constructed as below. Each edge is associated with at least one unique attribute denoted as . One of the unique attributes in for has distinct values in its domain while one of the unique attributes in for has distinct values in its domain. Remaining attributes have only one value in their domains. On this instance, there is and .
Theorem 4**.**
There is an algorithm that computes any r-hierarchical join in rounds with load , where . This bound is output-optimal.
Below we give a cleaner output-sensitive bound. This is not tight for , but easier to use. In particular, this result will be used in the analysis of the output-sensitive algorithm for arbitrary acyclic joins in Section 5.1.
Corollary 1**.**
There is an algorithm that computes any r-hierarchical join in rounds with load .
Proof.
When , we have and the load complexity is trivially. For , the term is always no larger than . The term is also no larger than as long as , which always holds when . ∎
4 Line-3 Join
The simplest acyclic but not r-hierarchical join is the line-3 join . In this section, we give an output-optimal MPC algorithm with load , together with a matching lower bound. In particular, the lower bound implies that instance-optimal algorithms are not possible for the line-3 join. In Section 5, we extend these results to arbitrary acyclic joins.
4.1 The Yannakakis algorithm revisited
The Yannakakis algorithm first removes all the dangling tuples, which is just a series of semi-joins and can be done with load . Then the algorithm performs pairwise joins in some arbitrary order. In the RAM model, the join order does not affect the asymptotic running time: After dangling tuples have been removed, any intermediate join result is part of a full join result, so the running time of the last join, which is , dominates that of any intermediate join. In fact, this argument applies on a per-instance basis, and the Yannakakis algorithm is instance-optimal on any instance with any join order.
Interestingly, the join order does matter in the MPC model. Consider the following instance of the line-3 join (see the top half of Figure 3). Attributes have domain sizes , respectively. Set , is a one-to-many relation from to , and . Note that this instance has and the output size is exactly . Consider first the join plan , and note that . Using the -load algorithm [8, 18] for binary joins, the load of computing is . However, since the output of the first join is the input of the second join, the input size for the second join is , so the load of the second join is . In general, the intermediate join result can be as large as , which is why the Yannakakis algorithm incurs a load of (after dangling tuples are removed) on an acyclic join, as observed in [2, 25].
Now consider the alternative plan . Note that , so the load of computing is , while the load of computing the second join is . Crucially, the reason why the second plan is better is that it has a smaller intermediate join size. Note that a smaller intermediate join size does not matter in the RAM model, where the total cost is always dominated by the last join. But it does matter in the MPC model, because of the load complexity of a binary join, which has a linear dependency on the input size but sublinear in the output size. Fundamentally, this is because the MPC model is all about locality: algorithms strive to send all “related” tuples to the same server so as to maximize the number of join results that can be found by the server locally.
Now, the key question is if there is always a join plan with an intermediate join size asymptotically smaller than . Unfortunately, the answer is no. A bad example can be easily constructed, by just putting two of the above instances together, but in opposite directions (see Figure 3). Nevertheless, this bad example precisely points us to the right direction: Although a global best join order may not exist, but if we decompose the join into multiple pieces, it is possible to find a provably good join order for each. This is exactly the basic idea of our algorithm, presented next.
4.2 A new algorithm for the line-3 join
We first compute (an MPC primitive). Then we proceed in two steps:
Step (1): Computing degrees
For a value in attribute , it is heavy if its degree in relation , i.e., , is greater than (value to be determined later), otherwise light. We first use the sum-by-key primitive to compute the degrees of all ’s for . After classifying the values in as heavy and light, we divide tuples in and also into heavy tuples and light tuples, depending on their value. More precisely, a tuple in or is heavy if its value is heavy, and light otherwise. This can be done by the multi-search primitive. We denote the heavy (resp. light) tuples in as (resp. ), for .
Step (2): Decomposing the join
We decompose the join into the following two parts, and compute them using different join orders:
[TABLE]
Note that since and are both divided according to the attribute, do not join with , do not join with .
Now we analyze the load. For , the intermediate join has size bounded by , since each intermediate join result from has a heavy value, so it joins with at least tuples in . Thus, the load of computing is (big-Oh of)
[TABLE]
For , the intermediate join has size bounded by , since each light tuple from can join with at most tuples from . Thus, the load of computing is (big-Oh of) .
[TABLE]
Setting balances the second term in (4) and in (5), and we obtain the claimed result (note that for ):
Theorem 5**.**
There is an algorithm computing the line-3 join with load in rounds.
4.3 Lower bound
We prove the following lower bound on any tuple-based algorithm for computing the line-3 join.
Theorem 6**.**
For any , there exists an instance for the line-3 join with input size and output size , such that any tuple-based algorithm computing the join in rounds must have a load of .
Proof.
Our lower bound argument is combinatorial in nature. We will construct a hard instance , such that a server can produces at most join results in a round, no matter which tuples from are loaded to the server. Then servers can product at most results over rounds. Setting will yield a lower bound on . Thus, any upper bound on will yield a lower bound on , and we will only focus on upper bounding .
We construct using the probabilistic method, i.e., we randomly generate an instance, and show that with positive probability (actually, with high probability), such a randomly generated instance satisfies our needs. The construction is similar to the one used in [18], but the parameters and arguments are different.
A randomly constructed instance is shown in Figure 4. In fact, only is random, while and are deterministic. Let , and set . Each distinct value of appears in tuples in , and each distinct value in appears in tuples in . The tuples in (resp. ) that share the same (resp. ) value are called a group. For each pair of values , the tuple is included in with probability independently. Note that , and , so the input size is expected to be . The output size is expected to be . By the Chernoff inequality, the probability that the input size or output size deviates from their expectations by more than a constant fraction is .
To give an upper bound on , we only restrict the server to load at most tuples from and , while tuples in can be accessed for free. Furthermore, we argue below that we only need to consider the situation where the server loads and in whole groups. Suppose two groups in , say, and , are not loaded in full (we may assume w.l.o.g. that is a multiple of , so there cannot be exactly one non-full group): tuples of and tuples of have been loaded. Suppose they respectively join with and tuples in that are loaded by the server. Note that they will produce join results. Without loss of generality, assume . Now consider the alternative where the server loads tuples of and tuples of . Then this would produce tuples. This means that by moving one tuple from to , the server can only get more join results (at least not less). We can move tuples from one group to another as long as there are two non-full groups. Eventually we arrive at a situation where all groups of are loaded by the server in full, without decreasing the reported join size. Next, we apply the same transformation to the groups of to make all its groups full as well. Therefore, to maximize , the server should only load and in full groups.
Thus, the server loads groups from and groups from . Below we show that a random instance constructed as above has the following property with high probability: On every possible choice of the groups of and groups of to be loaded, is always bounded.
Consider a particular choice of the groups from and groups from to be loaded. There are pairs of groups, and each pair has probability to join, so we expect to see pairs to join. Because the pairs join independently, by the Chernoff bound, the probability that more than pairs join is at most , for some parameter to be determined later. There are different choices of groups from and groups from . So, by the union bound, the probability that one of them yields more than joining groups is at most
[TABLE]
This probability is exponentially small if for some sufficiently large constant , so we set
[TABLE]
Since each joining group produces join results, we have shown that with high probability, a random instance has the property that no matter which tuples are loaded, we always have . Putting this into , we obtain
[TABLE]
Plugging (6) into (7), we have
[TABLE]
or
[TABLE]
Plugging in , ,
[TABLE]
The theorem is then proved after rearranging the terms. ∎
Ignoring logarithmic factors, this lower bound completes our understanding of the line-3 join in terms of output-optimality: (1) When , the Yannakakis algorithm has linear load . (2) When , the lower bound becomes , which is matched by our new algorithm. (3) When , the lower bound is , which is matched by the worst-case optimal algorithm in [19, 24]. In particular, this means that when is large enough, the load complexity of the join is no longer output-sensitive. This also stands in contrast with the RAM model, where the complexity of any acyclic join always grows linearly with .
An easy corollary is the following result, which shows that instance-optimality is not achievable for the line-3 join.
Corollary 2**.**
For any , there is an instance with input size for the line-3 join, such that any tuple-based algorithm computing the join in rounds must have a load of , while .
Proof.
We use in the lower bound construction above. Plugging these values into (8), we obtain the claimed lower bound. On the other hand, we have as large as
[TABLE]
As long as , the first term dominates. ∎
5 Acyclic Joins
In this section, we first extend the results from the previous section to arbitrary acyclic joins. Specifically, the algorithm is a (nontrivial) generalization of the line-3 algorithm, but it is self-contained; the lower bound builds on top of the hard instance of the line-3 join.
5.1 Algorithm
As a preprocessing step, we remove all dangling tuples. We also assume that the output size has been computed (an MPC primitive).
Recall that in an acyclic join , the hyperedges can be organized into a join tree , such that for each attribute , the nodes corresponding to are connected in . Given such a join tree , our algorithm recursively decomposes the join into multiple pieces, and apply a different join strategy for each.
We start from an internal node of whose children are all leaves. Let this node be , which has leaf children (see Figure 5 for an example). Let be the set of join attributes between and . We will assume ; otherwise we can add a dummy attribute to both and and all tuples in and share the same value on this dummy attribute (e.g., we add a dummy attribute to both and in Figure 5). Note that the join tree ensures the property that if for , then .
Let and . We will actually prove a slightly tighter bound, that the load of our algorithm is bounded by .
Set . Our algorithm proceeds in three steps.
Step (1): Computing data statistics
In each relation , , let be an assignment of values for attributes . The set of heavy assignments in is
[TABLE]
Tuples in can also be identified as heavy or light, depending on their projection on attributes . More precisely, a tuple is heavy if . The set of heavy tuples and light tuples in are denoted as and , respectively. All the statistics can be computed in by the sum-by-key and multi-search primitives with linear load.
Let . We decompose the join into the following sub-joins:
[TABLE]
where each can be either or . Note that there are , which is a constant, sub-joins, so we can afford to use servers for each sub-join. If a sub-join involves at least one , we apply the procedure in step (2) to it. In step (3), we handle the case where all are .
Step (2): Sub-joins with at least one
Without loss of generality, suppose is in the sub-join, i.e., we need to compute the sub-join
[TABLE]
where each can be either or . The algorithm consists of three steps:
- (2.1)
Compute . 2. (2.2)
Compute by any order. 3. (2.3)
Compute .
We analyze the load in each step: (2.1) is a primitive operation that incurs linear load. To bound the load of (2.2), observe that , since each tuple in joins with at least tuples in , each producing one final join result. Thus, the load is bounded by . The binary join in (2.3) has input size and output size , incurring a load of , which dominates the first two steps. Plugging in the value of , the total load is bounded by , as desired.
Step (3): The sub-join with all
It remains to compute the following sub-join:
[TABLE]
We further divide into heavy and light tuples, as follows. Let , and let be an assignment over attributes . The set of heavy assignments in is define as
[TABLE]
Tuples in are classified as heavy or light, depending on their projection on attributes , i.e., a tuple is heavy if , and light otherwise. Similarly, denote the heavy and light tuples in as and , respectively.
These statistics can also be computed using the primitives, but with some more care. For each relation , we first use sum-by-key to compute for every . This gives us a list of pairs. Then, we use multi-search to find, for each tuple , the up to pairs such that . After this step, each tuple in is attached with values, and we multiply them together to decide if the tuple is heavy or light.
Step (3.1): The sub-join with
We first compute the following sub-join:
[TABLE]
The algorithm consists of three steps:
- (3.1.1)
Compute by any order. 2. (3.1.2)
Compute for each . 3. (3.1.3)
Compute . Note that each of these relations contains all attributes in , so it is a hierarchical join (it is actually tall-flat), so we can use the instance-optimal algorithm in Section 3 to compute this join.
Now we analyze the load of each step: First, observe that . This is because the projection of each tuple in on is a heavy assignment, so it will produce at least join results after joining with the ’s. Therefore, the load of computing the join in (3.1.1) is . Each binary join in (3.1.2) has a load of . Note that each join result has size bounded by , since any tuple in can join with at most tuples in . Thus, the hierarchical join in (3.1.3) has input size and output size , so the instance-optimal algorithm has load according to Corollary 1. All the loads are bounded by , as desired.
Step (3.2): The sub-join with
Finally, we are left with the sub-join
[TABLE]
This is actually the only case where we need recursion:
- (3.2.1)
Compute by any order. 2. (3.2.2)
If , compute recursively.
Now we analyze the load: First, we have , since the projection of each tuple in on is a light assignment. Thus, the load of step (3.2.1) is , which is also bounded by . So far, we have completed the base case of the induction proof.
For the join to be computed recursively in step (3.2.2), its input size is at most and output size is at most . More importantly, can only become smaller, since becomes a leaf in the residual join and is no longer included in , no matter which node in the residual join is picked to be its new . By the induction hypothesis, computing the residual join recursively incurs a load of , thus bounded by .
Note that the recursion will increase the constant in the big-Oh, but as the recursion depth depends only on the query not the data size, it does not change the asymptotic result.
This completes the induction proof that the algorithm has a load of . Observing that and , we obtain the following result.
Theorem 7**.**
There is an algorithm that computes any acyclic join with load in rounds.
5.2 Lower bound
In Section 6 we have constructed a hard instance for the line-3 join and have shown that any algorithm must incur a load of on this instance. In this section, we generalize this lower bound to an arbitrary acyclic join that is not r-hierarchical. Note that for r-hierarchical joins, we can achieve a smaller load (see Corollary 1), so this establishes a separation between r-hierarchical joins and acyclic joins.
The basic idea in the lower bound is that any acyclic join must “include” a line-3 join, such that any algorithm computing the acyclic join must also compute the line-3 join. This is more formally captured by the following structural lemma on acyclic and r-hierarchical joins. To state the lemma, we need some terminology. In a hypergraph , a path between , denoted , is a sequence of vertices starting with and ending with , such that each consecutive pair of vertices appear together in an edge. The length of a path is defined as the number of vertices in minus 1. A path is minimal if there is no other path that is a strict subsequence of . It is easy to see that is minimal if and only if there exists no edge containing and with . Note that a shortest path must be minimal, but not vice versa.
Lemma 2**.**
An acyclic join is not r-hierarchical if and only if it has a minimal path of length .
The proof is given in Appendix A. With this lemma, we can extend the hard instance for the line-3 join to any acyclic but non-r-hierarchical join . Let be a minimal path of length 3 in , and suppose . Let be the hard instance for the line-3 join. We construct the hard instance for as follows. The domain of is the same as in . For any other attribute , set .
Since the path is minimal, each must fall into one of the following three cases:
For any with , contains only one tuple connecting the only value in the domains of attributes in . 2. 2.
If , then contains tuples, each having a distinct value of . 3. 3.
If , then contains tuples such that .
It can be easily verified that is exactly the join results of the line-3 join on , so the same lower bound applies. However, since the output size of the line-3 join is at most , we do have a condition on :
Theorem 8**.**
For an acyclic but non-r-hierarchical join and any , there exists an instance with input size and output size such that any tuple-based algorithm computing it in rounds must have a load of .
Similar to the line-3 join, this lower bound shows that our acyclic join algorithm is output-optimal (up to a logarithmic factor) when .
Furthermore, the same argument for Corollary 2 can be used here to show that instance-optimal algorithms do not exist for any acyclic but non-r-hierarchical join.
Corollary 3**.**
For any , there is an instance with input size for any acyclic but non-r-hierarchical join, such that any tuple-based algorithm that computes the join in rounds must have a load of , while .
6 Join-Aggregate Queries
We consider join-aggregate queries over annotated relations [16, 21]. Let be a commutative semiring. Every tuple is associated with an annotation . Let be a join hypergraph. The annotation of a join result is . Let be a set of output attributes and the non-output attributes. A join-aggregate query asks us to compute
[TABLE]
In plain language, a join-aggregate query first computes the join and the annotation of each join result, which is the -aggregate of the tuples comprising the join result. Then it partitions into groups by their projection on . Finally, for each group, it computes the -aggregate of the annotations of the join results.
Many queries can be formulated as special join-aggregate queries. For example, if we take to be the domain of integers, to be addition, to be multiplication, and set for all , then it becomes the COUNT(*) GROUP BY query; in particular, if , the query computes .
The join-project query , also known as a conjunctive query, is a special join-aggregate query, and we extend the terminology from [6] to join-aggregate queries. A width-1 GHD of a hypergraph is a tree , where each node is a subset of , such that
(coherence) for each attribute , the nodes containing are connected in ; 2. 2.
(edge coverage) for each hyperedge , there exists a node such that ; and 3. 3.
(width-1) for each node , there exists a hyperedge such that .
Given a set of output attributes (a.k.a. free variables), we say that is free-connex if there is a subset of connected nodes of including its root, denoted as (such a is said to be a connex subset), such that . A join-aggregate query is free-connex if it has a free-connex width-1 GHD.
As preprocessing, we remove the dangling tuples and then apply the reduce procedure repeatedly to remove an if there is another such that . Note that while dangling tuples can be just discarded, we cannot simply discard in the reduce procedure. To ensure that the annotations will be computed correctly, we should replace with and then discard . Note that by the earlier definition, the annotation of a join result is the -aggregate of the annotations of tuples comprising the join result, so the annotation in are aggregated into those in .
We find a free-connex width-1 GHD of [6, 5]. Note that the nodes of also define a hypergraph, and can be regarded as another join-aggregate query, but with the property that it has a free-connex subset such that . We construct an instance such that , where denotes the result of running the query defined by on . Observe that on a reduced , the condition in property (2) of a width-1 GHD can be replaced by , since if and for some other due to property (3), we would find . This implies that has only two types of nodes: (1) all hyperedges in , and (2) nodes that are a proper subset of some . Then we construct as follows. For each of type (1), we set where ; for each of type (2), we set for any , but the annotations of all tuples in are set to (the -identity). Below, we will focus on computing .
Joglekar et al. [21] modified the Yannakakis algorithm into AggroYannakakis, and showed that it has load on any free-connex join-aggregate query888The bound stated in [21] is actually , because they used a sub-optimal binary join algorithm as the subroutine following [2]. Replacing it with the optimal binary join algorithm in [8, 18] yields the claimed bound. In addition, they only considered simple join-aggregate queries, which are a strict subclass of free-connex queries. But after our conversion from to , their algorithm actually works for all free-connex queries.. Since we want to avoid the sub-optimal term, we modify their algorithm into LinearAggroYannakakis, which runs with linear load. It aggregates over all the non-output attributes, returning a modified query that only has the output attributes. The guarantees of LinearAggroYannakakis is stated in the following lemma.
Lemma 3**.**
LinearAggroYannakakis* is a constant-round, linear-load algorithm that, given any free-connex width-1 GHD and an instance , returns an instance such that , where is the free-connex subset of .*
Proof.
Let be a width-1 free-connex GHD and be the connex subset of such that . For an attribute , denote the highest node in containing as . Below, we describe LinearAggroYannkakakis, an algorithm that converts into such that .
The LinearAggroYannkakakis algorithm visits each node in a bottom-up fashion over . If , i.e., all its attributes are output attributes, we add to (line 4). Otherwise, we aggregate over , which are the non-output attributes in that do not appear in the ancestors of (line 6–7). This is a sum-by-key problem. Note that after the aggregation, the attributes of are . Let be the parent of in . Note that always exists since the root of must be in . Then we replace by (line 9). Below we show how this join can be done in linear load. Consider any non-output attribute . Since is an ancestor of , we have . Consider any output attribute . In the connex subset , there exists such that . Each node on the path from to must contain attribute , including . Thus, we must have . This means that tuples in are actually the same as those in , except that we update the annotation of each as , where . Thus, this can be done by the multi-search primitive in linear load. Because this algorithm never increases the size of any relation, the two primitive operations (line 7 and 9) incur linear load throughout the bottom-up traversal of .
It should be obvious from the algorithm description above that LinearAggroYannkakakis incurs linear load, but we still need to argue for its correctness. Note that has only output attributes. It suffices to show that .
Joglekar et al. [21] have shown that for any leaf and its parent , performing the operation in lines 6–9 and then discarding does not change the query results. AggroYannkakakis performs this operation over all the relations of in a bottom-up fashion, and applying this fact inductively means that the root relation becomes the final query result in the end, but this incurs load . LinearAggroYannkakakis performs this operation on a subset of relations, and stops as soon as it sees a node in . Then applying the result of [21] inductively up until proves our claim. ∎
Because is acyclic, we can run our output-optimal algorithm to compute . More precisely, when the algorithm emits a join result, we compute the -aggregate of the tuples comprising the join result. Note that in the following result, , i.e., the size of the final output, which can be much smaller than .
Theorem 9**.**
There is an algorithm that computes any free-connex join-aggregate query in rounds with load .
Observing that the join size of a (non-aggregate) join is a special join-aggregate query with , we obtain the following result, which has been used as a primitive. Note that there is no circular dependency here, because it only uses LinearAggroYannakakis.
Corollary 4**.**
For any acyclic join and any instance , can be computed in rounds with load .
Furthermore, if is r-hierarchical, we run our instance-optimal algorithm to compute . In fact, we can precisely characterize the class of queries with an r-hierarchical . A query is called out-hierarchical if it is free-connex and its residual query by removing all non-output attributes is r-hierarchical.
Lemma 4**.**
A join-aggregate query is out-hierarchical if and only if it has a width-1 GHD with a connex subset such that and is r-hierarchical.
Proof.
First we have known that join-aggregate query is free-connex iff it has a width-1 GHD with a connex subset such that . Consider the residual query of after removing all non-output attributes. Then it suffices to show that for a free-connex query , is r-hierarchical iff is r-hierarchical.
An edge is out-irreducible if there exists no such that or ; otherwise out-reducible. We first claim that for each out-irreducible there exists one node such that . Consider the node such that . If , the claim holds trivially. Otherwise, consider the lowest ancestor of in as . As each output attribute appears in some node of , it also appears in due to the coherence constraint. Thus, .
Recall that for each node , there exists an edge such that . Correspondingly, for each node , there exists an edge such that . Thus, for each out-irreducible , there exists one node such that , since if and for some other , would be out-reducible. This implies that has only two types of nodes: (1) for each out-irreducible , and (2) a proper subset of for some .
Not surprisingly, also have two types of edges, (1) for each out-irreducible , and (2) for each out-reducible . Nodes in of type (1) are one-to-one mappings to edges in of type (1). Moreover, after applying the reduce procedure repeatedly on or , only nodes or edges of type (1) can survive. Thus, the reduced query of is hierarchical iff the reduced query of is hierarchical, and the is r-hierarchical iff is r-hierarchical. ∎
Theorem 10**.**
For out-hierarchical query and any instance , there is an algorithm computing it in rounds with load .
Note that the instance-optimal lower bound for a join-aggregate query is defined with respect to the output attributes only, i.e.,
[TABLE]
where .
7 A Lower Bound on Triangle Join
Finally, we look beyond acyclic joins. In particular, we give an output-sensitive lower bound on the triangle join . For , a worst-case lower bound of is known, by the following argument: A server loading tuples can emit at most join results by the AGM bound [4], while the join size of can be as large as . Then setting yields this lower bound. However, if is used as a parameter, this argument only leads to a lower bound of . Below, we improve this lower bound to the following:
Theorem 11**.**
For any , there exists an instance for with input size and output size such that any tuple-based algorithm computing it in rounds must have a load of .
Proof.
When , the claimed lower bound simplifies to , so we will only consider the case . Let and . Note that as implied by the AGM bound. Our construction of the hard instance is illustrated in Figure 6.
Set , and . Set and . The relation is constructed randomly, in which each distinct value in and each distinct value of have a probability of to form a tuple. Note that relations and are deterministic and always have tuples. Relation is probabilistic with tuples in expectation. So this instance has input size and output size in expectation. By the Chernoff bound, the probability that the input size and output size deviate from their expectation by more than a constant factor is at most .
Similar to the proof of Lemma 6, we will show that with positive probability, an instance constructed this way will have a bounded , the maximum number of join results a server can produce, if it loads at most tuples from each relation. Then setting yields a lower bound on .
To bound , we first argue that on any instance constructed as above, we can limit the choice of the tuples loaded from (, respectively) by any server to the form for some (, respectively), i.e., the algorithm should load tuples from and in the form of a Cartesian product. More precisely, we show below that making such a restriction will not make smaller by more than a constant factor.
Suppose a server has loaded tuples from . Then the server needs to decide which tuples from and to load to maximize the number of triangles formed. This is a combinatorial optimization problem that can be formulated as an integer linear program (ILP). Introduce a variable for each pair and a variable for each pair . Also let if is loaded by the server, and [math] otherwise. Then below defines this optimization problem, where always ranges over , over , over unless specified otherwise.
[TABLE]
We transform into another ILP, shown as above. uses a function , which denotes the optimal solution of . is parameterized by and , which finds the maximum number of triangles that can be formed with the tuples loaded from and , subject to the constraint that at most tuples containing are loaded from and . Because all values are structurally equivalent, the optimal solution of does not depend on the particular choice of , which is why we write the optimal solution of as . Also, it is obvious that is a non-decreasing function. Then, tries to find the optimal allocation of the tuples to different values so as to maximize the total number of triangles formed. Let the optimal solutions of be , respectively. Because only restricts the server to load at most tuples from and in total, any feasible solution to is also a feasible solution to , so . Next we construct a feasible solution of with the Cartesian product restriction above, and show that it is within a constant factor from , hence .
Let . We choose values arbitrarily from and allocate tuples to each such . For each such , we use the optimal solution of to find the tuples to load from and . Note that the optimal solution is the same for every , so each will choose the same sets of ’s and ’s. Thus, this feasible solution loads tuples from and in the form of Cartesian products. The number of triangles formed is . We show that this is a constant-factor approximation of .
Lemma 5**.**
.
Proof.
Suppose chooses a set of values from , and each has tuples loaded from and . A value is efficient if , otherwise inefficient. Denote the set of efficient values as and inefficient values as . Note that for every efficient value , by the definition of .
We relate and by showing how to cover all the triangles reported by with the feasible solution constructed above. First, we use values of each with tuples from and to cover the triangles reported by . The total number of tuples needed is at most . The number of triangles that can be reported is
[TABLE]
Next, we use values each with tuples from and to cover the triangles reported by . The total number of tuples needed is . Recall that for each . The number of triangles that can be reported is
[TABLE]
where the rationale behinds the last inequality is that there are at most values in and there is for each by the non-decreasing property of .
Combining the two parts for the optimal solution , our alternative solution loads at most tuples from and , and can report at least triangles. ∎
Next, we show that with positive probability (actually high probability), we obtain an instance on which is bounded. By the analysis above, we only need to consider the case where tuples from and are loaded in the form of Cartesian products. One value is loaded if at least one tuple with is loaded. Similarly, value is loaded if at least one tuple with is loaded. Suppose and distinct values from and are loaded respectively. Note that we must have . Without loss of generality, assume . Due to Cartesian product constraint, the number of distinct values loaded from is at most .
Case 1:
We first upper bound the probability that the server can report many triangles on a random instance, for a particular choice of values loaded from and values from . Since at most distinct values from are loaded, each tuple loaded from can form at most triangles. Because each pair have probability to form a tuple in , on a random instance, we expect to see tuples and triangles. Note that this is always smaller than : (1) If , , then ; (2) Otherwise, , then . This server can report more than triangles, for some , if more than tuples exist among those pairs. By Chernoff bound, this happens with probability no larger than .
This is the probability that the server succeeds in reporting many triangles under a particular choice of values loaded from and values from . There are possible pairs. For each pair, there are choices of loading values from , choices of loading values from , and choices from . Thus the server has possible choices. By the union bound, the probability that any of these choices produces more than join results is at most
[TABLE]
which is exponentially small if
[TABLE]
and
[TABLE]
for some sufficiently large constants . Rearranging, this becomes
[TABLE]
for some sufficiently large constant . Under this condition, the probability in is at most .
Case 2:
In this case, we have . The server loads distinct values from , so each tuple loaded from can form at most triangles. The server can load at most tuples from , so at most triangles can be reported, for any
[TABLE]
Combining these two cases, under the condition (10) and (11) on , with high probability the server cannot find any way to load tuples to report more than triangles. Therefore, on these instances, we have
[TABLE]
where we set
[TABLE]
With the facts that and , we observe
[TABLE]
where the last inequality follows from our assumption . Then (13) can be simplified to
[TABLE]
Plugging (12) and (14) into , we obtain
[TABLE]
Finally, after plugging in and rearranging, we obtain
[TABLE]
∎
Remark. Our lower bound has the following consequences:
When , the lower bound becomes , which means that the worst-case optimal algorithm of [24] is actually also output-optimal in this parameter range. Finding triangles is as difficult as finding triangles. 2. 2.
When , the lower bound becomes while we do not have a matching upper bound yet. Nevertheless, this already exhibits a separation from acyclic joins, which can be done with load . The gap is at least .
Appendix A Proof of Lemma 2
Proof.
Direction (): In an acyclic join , a minimal path of length 3 is a sequence of 4 vertices , such that and there exists no edge with , or , or . This already testifies that is not hierarchical. To show that it is not r-hierarchical, consider the process of repeatedly applying the reduce procedure to . If any of is removed in the process, say , there must exist an edge such that . The same applies for and . Thus we can always find three edges such that after applying the reduce procedure, so this query is not r-hierarchical.
Direction (): The proof is constructive. We will show below how to find a minimal path of length 3 in any acyclic but non-r-hierarchical join. We first apply the reduce procedure to such that no edge is contained in another. The rationale behind this is that a minimal path between two vertices of length in the reduced join is also a minimal path between of length 3 in the original join. Then we proceed in 3 steps: (We give an intuitive illustration of the results after each step, in Figure 7.)
**Step 1: ** Find a subgraph defined by three distinct edges and four distinct vertices , such that .
**Step 2: ** Find a subgraph defined by three distinct edges and four distinct vertices , such that , and there exists no edge with or .
**Step 3: ** Find a minimal path of length between and .
Our construction and its correctness proof is based on a basic property of acyclic join, as stated in Lemma 6. With Lemma 6, we are able to prove stronger results in Corollary 5 and Corollary 6, which will be used as building blocks in proving Lemma 2.
Lemma 6**.**
For three distinct edges , if , then there exists one edge such that .
**Proof of Lemma 6: ** Consider attributes such that . In the GYO reduction [1], we observe that (1) Any of won’t be removed as an unique attribute before any edge of is removed; (2) Any of won’t be removed as an empty edge before any of is removed. So it is always feasible to identify one edge such that . Moreover, any attribute in if exists won’t be removed as an unique attribute before any edge of is removed. Thus we come to the conclusion in Lemma 6.
Corollary 5**.**
For two distinct edges and a subset of edges , if for each and , then there exists one edge such that .
**Proof of Corollary 5: ** For simplicity, rename edges in as . We prove it by induction. The base case when is precisely characterized and solved by Lemma 6. We hold the hypothesis that there exists one edge such that
[TABLE]
Moreover, if , edge is exactly the one characterized by Corollary 5 and we are done. Otherwise, .
We observe that , so there is . If , there is . So far we have following observations on that (1) ; (2) ; (3) , or equivalently,
[TABLE]
Thus edge is exactly the one characterized by Corollary 5, and we are done. Otherwise, . Implied by Lemma 6, there exists an edge such that . More precisely, (1) ; (2) ; (3) . Or equivalently,
[TABLE]
thus edge is exactly the one characterized by Corollary 5, and we are done.
Corollary 6**.**
For a set of distinct vertices , if there exists one edge such that , and there exists one edge such that for each , then there exists one edge such that .
**Proof of Corollary 6: ** We prove it by induction. The base case when is trivial. We hold the hypothesis that there exists one edge such that .
If , edge is exactly the one characterized by Corollary 6 and we are done. Moreover, if , edge is exactly the one characterized by Corollary 6 and we are done. Otherwise, and . Note that , , and . Implied by Lemma 6, there exists one edge such that , and . More precisely, , , and , thus .
**Proof of step 1: **
If an acyclic join is not r-hierarchical, then there exist two attributes such that . Consider , and . It suffices to show that and by the constraint. First is not empty otherwise contradicting our assumption. The same applies for . If , each attribute appearing in also appears in . In this way, we can identify three distinct attributes such that , , , which form a cycle. Thus there exists an edge such that implied by Lemma 6. Note that implies that can be rewritten as . In this way, contradicting our assumption. So we have , and the same applies for .
**Proof of step 2: **
Assume we already have a subgraph defined by edges and vertices , such that . If there exists no edge such that or , we are done. Otherwise, we need to show how to find satisfying our condition to replace . Note that the replacement of and that of are independent, as well as their correctness arguments.
In the following, we will tackle the situation where there exists an edge such that . The situation where there exists an edge such that is symmetric and can be tackled similarly.
Define the attribute set . If , then we just replace by any attribute in . Otherwise, , which implies that can be rewritten as . We will prove by contradiction that this case won’t happen in the reduced join. Define the edge set . Note that if , then for each . We distinguish following four cases. We give an intuitive illustration of the contradiction in each case, in Figure 8. The same technique we adopt is to identify an edge such that and , coming to a contradiction in a reduced join.
Case 1:
Consider an arbitrary attribute . Denote . Note that , , and for each . Implied by Corollary 5, there exists an edge such that , and for each . This also implies , , and . Thus, contradicting our assumption.
Case 2: and
Consider an arbitrary attribute . Denote , where since . Note that , , and for each . Implied by Corollary 5, there exists an edge such that , and for each . This also implies and . Thus . We already have in this case. Thus, contradicting our assumption.
Case 3: , , and
Consider an arbitrary attribute . Note that , , and for each . Implied by Corollary 5, there exists an edge such that , and for each . This also implies and . We already have in this case. Thus, contradicting our assumption.
Case 4: , , and
Under this circumstances, . Implied by the fact that , we have . For attributes and all attributes in , there is , and for each there exists one edge such that . Implied by Corollary 6, there exists one edge such that and . Thus, and , contradicting our assumption.
Combining these four cases proves the step 2.
**Proof of step 3: ** Consider the subgraph found in the last step. By the definition of minimal path, it suffices to show that there exists no edge such that , or , or . By contradiction, assume there is an where . Implied by the contraints of this subgragh, and . Attributes form a cycle on edges , then there must exist an edge containing all of contradicting the constraints. The similar argument applies for and . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ 1 ] S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases: the logical level . Addison-Wesley Longman Publishing Co., Inc., 1995.
- 2[ 2 ] F. Afrati, M. Joglekar, C. Ré, S. Salihoglu, and J. D. Ullman. GYM: A multiround join algorithm in Map Reduce. In Proc. International Conference on Database Theory , 2017.
- 3[ 3 ] F. N. Afrati and J. D. Ullman. Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering , 23(9):1282–1298, 2011.
- 4[ 4 ] A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. SIAM Journal on Computing , 42(4):1737–1767, 2013.
- 5[ 5 ] G. Bagan. Algorithmes et complexité des problèmes d’énumération pour l’évaluation de requêtes logiques . Ph D thesis, Université de Caen, 2009.
- 6[ 6 ] G. Bagan, A. Durand, and E. Grandjean. On acyclic conjunctive queries and constant delay enumeration. In International Workshop on Computer Science Logic , pages 208–222. Springer, 2007.
- 7[ 7 ] P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. In Proc. ACM Symposium on Principles of Database Systems , 2013.
- 8[ 8 ] P. Beame, P. Koutris, and D. Suciu. Skew in parallel query processing. In Proc. ACM Symposium on Principles of Database Systems , 2014.
