Instance and Output Optimal Parallel Algorithms for Acyclic Joins

Xiao Hu; Ke Yi

arXiv:1903.09717·cs.DB·April 1, 2019

Instance and Output Optimal Parallel Algorithms for Acyclic Joins

Xiao Hu, Ke Yi

PDF

TL;DR

This paper develops new parallel algorithms for acyclic joins in the MPC model, achieving instance and output optimality, and establishes lower bounds demonstrating the complexity of triangle joins.

Contribution

It introduces a novel MPC algorithm for acyclic joins with improved load bounds and proves output-optimality for certain classes, extending the classical Yannakakis algorithm to a parallel setting.

Findings

01

New MPC algorithm with load O(IN/p + sqrt(IN*OUT)/p)

02

Achieves instance-optimality for r-hierarchical joins in MPC

03

Provides lower bounds for triangle join complexity in MPC

Abstract

Massively parallel join algorithms have received much attention in recent years, while most prior work has focused on worst-optimal algorithms. However, the worst-case optimality of these join algorithms relies on hard instances having very large output sizes, which rarely appear in practice. A stronger notion of optimality is {\em output-optimal}, which requires an algorithm to be optimal within the class of all instances sharing the same input and output size. An even stronger optimality is {\em instance-optimal}, i.e., the algorithm is optimal on every single instance, but this may not always be achievable. In the traditional RAM model of computation, the classical Yannakakis algorithm is instance-optimal on any acyclic join. But in the massively parallel computation (MPC) model, the situation becomes much more complicated. We first show that for the class of r-hierarchical joins,…

Tables1

Table 1. Table 1 : Summary of results.

Joins	Instance-optimal¹		Output-optimal
	one-round	multi-round	one-round	multi-round
tall-flat	$L_{ins-opt} \cdot \log^{O (1)} p$	$Θ (L_{ins-opt})$	-
r-hierarchical	[8]
w/o dangling tuples
r-hierarchical	$ω (L_{ins-opt})$		$ω (\frac{IN + OUT}{p})$	-
w/ dangling tuples			[26]
acyclic	$ω (L_{ins-opt})$			$Θ (\frac{IN}{p} + \frac{\sqrt{IN \cdot OUT}}{p})$
				LB for $OUT \leq O (p \cdot IN)$ .
triangle				$\tilde{Ω} (\min {\frac{IN + OUT}{p}, \frac{IN}{p^{2 / 3}}})$

Equations132

L_{A} (IN, OUT) = R \in R (IN, OUT) max L_{A} (R),

L_{A} (IN, OUT) = R \in R (IN, OUT) max L_{A} (R),

L_{A} (IN, OUT) = O (L_{A^{'}} (IN, OUT)),

L_{A} (IN, OUT) = O (L_{A^{'}} (IN, OUT)),

L_{A} (R) = O (L_{A^{'}} (R)),

L_{A} (R) = O (L_{A^{'}} (R)),

L_{Cartesian} (p, R) := S \subseteq {1, \dots, m} max (\frac{\prod _{i \in S} N _{i}}{p})^{\frac{1}{∣ S ∣}} .

L_{Cartesian} (p, R) := S \subseteq {1, \dots, m} max (\frac{\prod _{i \in S} N _{i}}{p})^{\frac{1}{∣ S ∣}} .

Q (R, S) := (⋈_{e \in S} R (e)) ⋉ Q (R),

Q (R, S) := (⋈_{e \in S} R (e)) ⋉ Q (R),

L_{instance} (p, R) := S \subseteq E max (\frac{∣ Q ( R , S ) ∣}{p})^{\frac{1}{∣ S ∣}} .

L_{instance} (p, R) := S \subseteq E max (\frac{∣ Q ( R , S ) ∣}{p})^{\frac{1}{∣ S ∣}} .

L_{BinHC} (p, R) := x, u max (\frac{\sum _{a \in dom (x)} \prod _{e \in E} ∣ σ _{x = a} R ( e ) ∣ ^{u (e)}}{p})^{\frac{1}{\sum _{e \in E} u ( e )}}

L_{BinHC} (p, R) := x, u max (\frac{\sum _{a \in dom (x)} \prod _{e \in E} ∣ σ _{x = a} R ( e ) ∣ ^{u (e)}}{p})^{\frac{1}{\sum _{e \in E} u ( e )}}

p (x, u) = a \in dom (x) \sum e \in E \prod (\frac{∣ σ _{x = a} R ( e ) ∣}{L})^{u (e)} .

p (x, u) = a \in dom (x) \sum e \in E \prod (\frac{∣ σ _{x = a} R ( e ) ∣}{L})^{u (e)} .

p (x, u) \leq

p (x, u) \leq

\leq

=

=

=

p (x, u) \leq

p (x, u) \leq

\leq

\leq

\leq

p (x, u) \leq

p (x, u) \leq

\leq

=

\leq

=

\leq

a \sum e \in S \prod ∣ σ_{x = a} R (e) ∣ = ∣ ⋈_{e \in S} R (e) ∣.

a \sum e \in S \prod ∣ σ_{x = a} R (e) ∣ = ∣ ⋈_{e \in S} R (e) ∣.

p_{a} = S \subseteq E max \frac{∣ Q _{x} ( R _{a} , S ) ∣}{L ^{∣ S ∣}}

p_{a} = S \subseteq E max \frac{∣ Q _{x} ( R _{a} , S ) ∣}{L ^{∣ S ∣}}

a \sum p_{a} \leq a \sum S \subseteq E \sum \frac{∣ Q _{x} ( R _{a} , S ) ∣}{L ^{∣ S ∣}} = S \subseteq E \sum \frac{∣ Q ( R , S ) ∣}{L ^{∣ S ∣}} = O (p) .

a \sum p_{a} \leq a \sum S \subseteq E \sum \frac{∣ Q _{x} ( R _{a} , S ) ∣}{L ^{∣ S ∣}} = S \subseteq E \sum \frac{∣ Q ( R , S ) ∣}{L ^{∣ S ∣}} = O (p) .

\frac{IN _{a}}{p _{a}} + L_{instance} (p_{a}, R_{a}) = \frac{IN _{a}}{p _{a}} + S \subseteq E max (\frac{∣ Q _{x} ( R _{a} , S ) ∣}{p _{a}})^{\frac{1}{∣ S ∣}} .

\frac{IN _{a}}{p _{a}} + L_{instance} (p_{a}, R_{a}) = \frac{IN _{a}}{p _{a}} + S \subseteq E max (\frac{∣ Q _{x} ( R _{a} , S ) ∣}{p _{a}})^{\frac{1}{∣ S ∣}} .

p_{a} \geq \frac{1}{L} \cdot ∣ Q_{x} (R_{a}, {e}) ∣ = \frac{1}{L} \cdot ∣ σ_{x = a} R (e) ∣ \geq \frac{IN _{a}}{∣ E ∣ \cdot L},

p_{a} \geq \frac{1}{L} \cdot ∣ Q_{x} (R_{a}, {e}) ∣ = \frac{1}{L} \cdot ∣ σ_{x = a} R (e) ∣ \geq \frac{IN _{a}}{∣ E ∣ \cdot L},

p_{i} = S \subseteq E_{i} max ⌈ \frac{∣ Q _{i} ( R _{i} , S ) ∣}{L ^{∣ S ∣}} ⌉ .

p_{i} = S \subseteq E_{i} max ⌈ \frac{∣ Q _{i} ( R _{i} , S ) ∣}{L ^{∣ S ∣}} ⌉ .

i \in I \prod p_{i} \leq

i \in I \prod p_{i} \leq

\frac{IN _{i}}{p _{i}} + L_{instance} (p_{i}, R_{i}) = \frac{IN _{i}}{p _{i}} + S \subseteq E_{i} max (\frac{∣ Q _{i} ( R _{i} , S ) ∣}{p _{i}})^{\frac{1}{∣ S ∣}} .

\frac{IN _{i}}{p _{i}} + L_{instance} (p_{i}, R_{i}) = \frac{IN _{i}}{p _{i}} + S \subseteq E_{i} max (\frac{∣ Q _{i} ( R _{i} , S ) ∣}{p _{i}})^{\frac{1}{∣ S ∣}} .

R \in R (IN, OUT) max L_{instance} (p, R)

R \in R (IN, OUT) max L_{instance} (p, R)

R \in R (IN, OUT) max L_{instance} (p, R) =

R \in R (IN, OUT) max L_{instance} (p, R) =

=

=

S \subseteq E max R \in R (IN, OUT) max (\frac{∣ ⋈ _{e \in S} R ( e ) ∣}{p})^{\frac{1}{∣ S ∣}} =

S \subseteq E max R \in R (IN, OUT) max (\frac{∣ ⋈ _{e \in S} R ( e ) ∣}{p})^{\frac{1}{∣ S ∣}} =

\leq

Q_{1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Instance and Output Optimal Parallel Algorithms for Acyclic Joins

Xiao Hu Ke Yi

Hong Kong University of Science and Technology

{xhuam

yike}@cse.ust.hk

Abstract

Massively parallel join algorithms have received much attention in recent years, while most prior work has focused on worst-optimal algorithms. However, the worst-case optimality of these join algorithms relies on hard instances having very large output sizes, which rarely appear in practice. A stronger notion of optimality is output-optimal, which requires an algorithm to be optimal within the class of all instances sharing the same input and output size. An even stronger optimality is instance-optimal, i.e., the algorithm is optimal on every single instance, but this may not always be achievable.

In the traditional RAM model of computation, the classical Yannakakis algorithm is instance-optimal on any acyclic join. But in the massively parallel computation (MPC) model, the situation becomes much more complicated. We first show that for the class of r-hierarchical joins, instance-optimality can still be achieved in the MPC model. Then, we give a new MPC algorithm for an arbitrary acyclic join with load $O({\mathrm{IN}\over p}+{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}\over p})$ , where $\mathrm{IN},\mathrm{OUT}$ are the input and output sizes of the join, and $p$ is the number of servers in the MPC model. This improves the MPC version of the Yannakakis algorithm by an $O(\sqrt{\mathrm{OUT}\over\mathrm{IN}})$ factor. Furthermore, we show that this is output-optimal when $\mathrm{OUT}=O(p\cdot\mathrm{IN})$ , for every acyclic but non-r-hierarchical join. Finally, we give the first output-sensitive lower bound for the triangle join in the MPC model, showing that it is inherently more difficult than acyclic joins.

1 Introduction

A (natural) join is defined as a hypergraph $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , where the vertices $\mathcal{V}=\{x_{1},\dots,x_{n}\}$ model the attributes and the hyperedges $\mathcal{E}=\{e_{1},\dots,e_{m}\}\subseteq 2^{\mathcal{V}}$ model the relations. Let $\mathrm{dom}(x)$ be the domain of attribute $x\in\mathcal{V}$ . An instance of $\mathcal{Q}$ is a set of relations $\mathcal{R}=\{R(e):e\in\mathcal{E}\}$ , where $R(e)$ is a set of tuples, where each tuple is an assignment that assigns a value from $\mathrm{dom}(x)$ to $x$ for every $x\in e$ . We use $\mathrm{IN}=\sum_{e\in\mathcal{E}}|R(e)|$ to denote the size of $\mathcal{R}$ . The join results of $\mathcal{Q}$ on $\mathcal{R}$ , denoted as $\mathcal{Q}(\mathcal{R})$ , consist of all combinations of tuples, one from each $R(e)$ , such that they share common values on their common attributes. Let $\mathrm{OUT}=|\mathcal{Q}(\mathcal{R})|$ be the output size. We study the data complexity of join algorithms, i.e., we assume that the query size, namely $n$ and $m$ , are constants. In this paper, we focus on acyclic joins, i.e., when the hypergraph $\mathcal{Q}$ is acyclic (formal definition given later).

1.1 The model of computation

The problem gets much more interesting in the parallel setting. In this paper, we consider the massively parallel computation (MPC) model [2, 3, 7, 8, 22, 24, 26], which has become the standard model of computation for studying massively parallel algorithms, especially for join algorithms.

In the MPC model, data is initially distributed evenly over $p$ servers with each server holding $\mathrm{IN}/p$ tuples. Computation proceeds in rounds. In each round, each server first sends messages to other servers, receives messages from other servers, and then does some local computation. The complexity of the algorithm is measured by the number of rounds and the load, denoted as $L$ , which is the maximum message size received by any server in any round. A linear load $L=O({\mathrm{IN}\over p})$ is the ideal case (since the initial load is already ${\mathrm{IN}\over p}$ ), while if $L=O(\mathrm{IN})$ , all problems can be solved trivially in one round by simply sending all data to one server. Initial efforts were mostly spent on what can be done in a single round of computation [3, 26, 7, 8, 24, 26], but recently, more interest has been given to multi-round (but still a constant) algorithms [2, 22, 24], since new main memory based systems, such as Spark and Flink, have much lower overhead per round than previous generations like Hadoop.

The MPC model can be considered as a simplified version of the BSP model [32], but it has enjoyed more popularity in recent years. This is mostly because the BSP model takes too many measures into consideration, such as communication costs, local computation time, memory consumption, etc. The MPC model unifies all these costs with one parameter $L$ , which makes the model much simpler. Meanwhile, although $L$ is defined as the maximum incoming message size of a server, it is also closely related with the local computation time and memory consumption, which are both increasing functions of $L$ . Thus, $L$ serves as a good surrogate of these other cost measures. This is also why the MPC model does not limit the outgoing message size of a server, which is less relevant to other costs.

All our algorithms work under the mild assumption $\mathrm{IN}\geq p^{1+\epsilon}$ where $\epsilon>0$ is any small constant. This assumption clearly holds on any reasonable values of $\mathrm{IN}$ and $p$ in practice; theoretically, this is the minimum requirement for the model to be able to compute some almost trivial functions, like the “or” of $\mathrm{IN}$ bits, in $O(1)$ rounds. Our lower bounds hold under $\mathrm{IN}\geq p^{c}$ for some constant $c$ , which may depend on the particular lower bound construction.

We confine ourselves to tuple-based join algorithms, i.e., the tuples are atomic elements that must be processed and communicated in their entirety. The only way to create a tuple is by making a copy, from either the original tuple or one of its copies. We say that an MPC algorithm computes the join $\mathcal{Q}$ on instance $\mathcal{R}$ if the following is achieved: For any join result $(t_{1},\dots,t_{m})\in\mathcal{Q}(\mathcal{R})$ where $t_{i}\in R(e_{i})$ , $i=1,\dots,m$ , these $m$ tuples (or their copies) must all be present on the same server at some point. Then the server will call a zero-cost function $emit(t_{1},\dots,t_{m})$ to report the join result. Note that since we only consider constant-round algorithms, whether a server is allowed to keep the tuples it has received from previous rounds is irrelevant: if not, it can just keep sending all these tuples to itself over the rounds, increasing the load by a constant factor. All known join algorithms in the MPC model are tuple-based and obey these requirements. Our lower bounds are combinatorial in nature: we only count the number of tuples that must be communicated in order to emit all join results, while all other information can be communicated for free. The upper bounds include all messages, with a tuple and an integer of $O(\log\mathrm{IN})$ bits both counted as 1 unit of communication.

1.2 Instance and output optimality

In worst-case analysis, the entire space of instances is divided into classes by the input size $\mathrm{IN}$ , and the running time is measured on the worst instance in each class. For many important computational problems, this is too coarse-grained and cannot accurately characterize the performance of the algorithm. For the join problem, no algorithm can do better than $O(\mathrm{IN}^{1/\rho})$ time in the worst case, where $\rho$ is the fractional edge cover number of the hypergraph $\mathcal{Q}$ [33, 29]. This bound drastically overestimates the running time on most typical instances.

A more refined approach is parameterized analysis, which further subdivides the instance space into smaller classes by introducing more parameters that supposedly better characterize the difficulty of each class. For the join problem, the output size $\mathrm{OUT}$ is a commonly used parameter, and each class of instances share the same input and output size. Let $\mathfrak{R}(\mathrm{IN},\mathrm{OUT})$ be the class of instances with input size $\mathrm{IN}$ and output size $\mathrm{OUT}$ . Then the load of an MPC algorithm $\mathcal{A}$ is thus a function of both $\mathrm{IN}$ and $\mathrm{OUT}$ , defined as

[TABLE]

where $L_{\mathcal{A}}(\mathcal{R})$ denotes the load of $\mathcal{A}$ on $\mathcal{R}$ . Algorithm $\mathcal{A}$ is output-optimal if

[TABLE]

for every algorithm $\mathcal{A}^{\prime}$ .

Further subdividing the instance space leads to more refined analyses. In extreme case when each class contains just one instance, we obtain instance-optimal algorithms. More precisely, an algorithm $\mathcal{A}$ is instance-optimal if

[TABLE]

for every instance $\mathcal{R}$ and every algorithm $A^{\prime}$ . Note that by definition, an instance-optimal algorithm must be output-optimal, and an output-optimal algorithm must be worst-case optimal, but the reserve direction may not be true.

In the traditional RAM model of computation, the classical Yannakakis algorithm [34] can compute any acyclic join in time $O(\mathrm{IN}+\mathrm{OUT})$ , which is both output-optimal and instance-optimal, because on any instance $\mathcal{R}$ , any algorithm has to at least spend $\Omega(\mathrm{IN})$ time to read all the inputs111To formally prove this claim, one will have to be more careful with the family of algorithms under consideration. In particular, if $\mathrm{OUT}=0$ , then the algorithm may not have to do anything. One possible approach is to ask the algorithm to produce a certificate in addition to the join results [28]. We will not digress to this direction since this paper is only concerned about MPC algorithms. and $\Omega(\mathrm{OUT})$ time to enumerate the outputs. Thus, the two notions of optimality coincide (but both are stronger than worst-case optimality). Fundamentally, this is because the difficulty of any instance $\mathcal{R}$ is precisely characterized by its input size and output size, and all instances in $\mathfrak{R}(\mathrm{IN},\mathrm{OUT})$ have exactly the same complexity $O(\mathrm{IN}+\mathrm{OUT})$ .

1.3 Join algorithms in the MPC model

The situation becomes much more interesting in the MPC model. First, it has been observed that the Yannakakis algorithm can be easily implemented in the MPC model with a load of $O(\frac{\mathrm{IN}}{p}+\frac{\mathrm{OUT}}{p})$ [2]222The bound stated in [2] is actually $O({(\mathrm{IN}+\mathrm{OUT})^{2}\over p})$ , because they used a sub-optimal binary join algorithm as the subroutine. Replacing it with the optimal binary join algorithm in [8, 18] yields the claimed bound, as observed in [25]., but this is not optimal. In particular, it is known that the binary join $R_{1}(A,B)\Join R_{2}(B,C)$ can be computed with load $O(\frac{\mathrm{IN}}{p}+\sqrt{\frac{\mathrm{OUT}}{p}})$ [8, 18]. This is optimal by the following simple lower bound argument: Each server can only produce $O(L^{2})$ join results in a constant number of rounds with the load limited to $L$ , so all the $p$ servers can produce at most $O(p\cdot L^{2})$ join results. Thus, producing $\mathrm{OUT}$ join results needs at least a load of $L=\Omega(\sqrt{\frac{\mathrm{OUT}}{p}})$ . Meanwhile, since $L\geq{\mathrm{IN}/p}$ by definition, the $O(\frac{\mathrm{IN}}{p}+\sqrt{\frac{\mathrm{OUT}}{p}})$ bound is optimal. Note that this argument can be applied on a per-instance basis, which means that the load complexity of any instance is still precisely captured by $\mathrm{IN}$ and $\mathrm{OUT}$ , and $O(\frac{\mathrm{IN}}{p}+\sqrt{\frac{\mathrm{OUT}}{p}})$ is both an instance-optimal and output-optimal bound.

However, when the join involves three relations, the situation becomes subtler, and we start to see a separation between the two notions of optimality, meaning that the load complexity of an instance may not depend only on $\mathrm{IN}$ and $\mathrm{OUT}$ . Let us start with the simplest 3-relation join $R_{1}(A)\Join R_{2}(B)\Join R_{3}(C)$ , i.e., computing the Cartesian product of 3 sets of tuples. Consider a particular class $\mathfrak{R}(\mathrm{IN},\mathrm{OUT})$ when $\mathrm{OUT}=\mathrm{IN}^{2}$ . Suppose the 3 relations have sizes $N_{1},N_{2},N_{3}$ , respectively. Then $\mathfrak{R}(\mathrm{IN},\mathrm{OUT})$ consists of all instances with $N_{1}+N_{2}+N_{3}=\mathrm{IN}$ and $N_{1}N_{2}N_{3}=\mathrm{OUT}=\mathrm{IN}^{2}$ . Consider the following two instances: (1) $N_{1}=N_{2}=\Theta(\sqrt{\mathrm{IN}}),N_{3}=\Theta(\mathrm{IN})$ , applying the same argument above except that each server now can produce $O(L^{3})$ join results, i.e., $p\cdot L^{3}=\Omega(\mathrm{OUT})$ , we have $L=\Omega(({\mathrm{OUT}\over p})^{1/3})$ ; (2) if $N_{1}=1,N_{2}=N_{3}=\Theta(\mathrm{IN})$ , then the problem boils down to computing the Cartesian product of two sets, which has a lower bound of $L=\Omega((\frac{\mathrm{OUT}}{p})^{1/2})$ . The reason why instance (2) has a higher lower bound than instance (1) is that it has a higher skew, which causes more difficulty for the MPC model. Note that this phenomenon does not exist in the RAM model, in which both instances (in fact all instances in $\mathfrak{R}(\mathrm{IN},\mathrm{OUT})$ ) have the same complexity of $O(\mathrm{IN}+\mathrm{OUT})$ . Fundamentally, this is because the MPC model is all about locality: An MPC algorithm should strive to bring all related tuples to one server so as to produce as many join results as possible, while a higher skew reduces locality.

We can extend this argument to computing the Cartesian product of $m$ sets of sizes $N_{1},\dots,N_{m}$ . Any algorithm computing the full Cartesian product obviously must also compute the Cartesian product of any subset of the $n$ sets, thus the load must be at least

[TABLE]

It has been shown that the HyperCube algorithm [3] incurs a load of $L_{\mathrm{Cartesian}}(p,\mathcal{R})\cdot\log^{O(1)}p$ on any instance $\mathcal{R}$ [8]. Thus, it can be considered as an instance-optimal algorithm for computing Cartesian products, with an optimality ratio of $\log^{O(1)}p$ .

The binary join and Cartesian products are the simplest joins. Then the obvious question is, do instance-optimal algorithms exist for larger classes of joins? If not, how about output-optimal algorithms? These are the main questions we wish to address in this paper.

1.4 Classification of acyclic joins

Before describing our results, we first define some sub-classes of acyclic joins.

Acyclic joins [9]. We use the common notion of acyclicity, which is also known as $\alpha$ -acyclicity. A join $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ is acyclic if there exists an undirected tree $\mathcal{T}$ whose nodes are in one-to-one correspondence with the edges in $\mathcal{E}$ such that for any vertex $v\in\mathcal{V}$ , all nodes containing $v$ form a connected subtree. Such a tree $\mathcal{T}$ is called the join tree of $\mathcal{Q}$ . Note that the join tree may not be unique for a given $\mathcal{Q}$ .

Hierarchical joins [12]. A join $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ is hierarchical if for every pair of vertices $x,y$ , there is $\mathcal{E}_{x}\subseteq\mathcal{E}_{y}$ , or $\mathcal{E}_{y}\subseteq\mathcal{E}_{x}$ , or $\mathcal{E}_{x}\cap\mathcal{E}_{y}=\emptyset$ , where $\mathcal{E}_{x}=\{e\in\mathcal{E}:x\in e\}$ is the set of hyperedges containing attribute $x$ . Thus, all attributes can be organized into a forest, such that $x$ is a descendant of $y$ iff $\mathcal{E}_{x}\subseteq\mathcal{E}_{y}$ . Hierarchical joins have been enjoyed nice properties in probabilistic databases [12, 13] and query answering under updates [10], but their role in the MPC model has not been studied so far.

r-hierarchical joins. We consider a slightly larger class of hierarchical joins. A reduce procedure on a hypergraph $(\mathcal{V},\mathcal{E})$ is to remove an edge $e\in\mathcal{E}$ if there exists another edge $e^{\prime}\in\mathcal{E}$ such that $e\subseteq e^{\prime}$ . We can repeatedly apply the reduce procedure until no more edge can be reduced, and the resulting hypergraph is said to be reduced. A join is r-hierarchical if its reduced join hypergraph is hierarchical. A hierarchical join must be r-hierarchical, but not vice versa. For example, the join $R_{1}(A)\Join R_{2}(A,B)\Join R_{3}(B)$ is r-hierarchical but not hierarchical. On the other hand, an r-hierarchical join must be acyclic.

Tall-flat joins [26]. A join $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ is tall-flat if one can order the attributes as $x_{1},x_{2},\cdots,x_{h},y_{1},\\ y_{2},\cdots,y_{l}$ such that (1) $\mathcal{E}_{x_{1}}\supseteq\mathcal{E}_{x_{2}}\supseteq\cdots\supseteq\mathcal{E}_{x_{h}}$ ; (2) $\mathcal{E}_{x_{h}}\supseteq\mathcal{E}_{y_{j}}$ for $j=1,2,\cdots,l$ ; and (3) $|\mathcal{E}_{y_{j}}|=1$ for $j=1,2,\cdots,l$ . Obviously, a tall-flat join must be hierarchical.

The relationships of these joins are illustrated in Figure 1.

1.5 Our results

This paper gives an almost complete characterization of acyclic joins with respect to instance-optimality and output-optimality in the MPC model. Our results are summarized in Table 1, and we explain them below in more detail.

Instance-optimality

First, we extend the Cartesian product lower bound (1) to a general join $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ . For any subset of relations $S\subseteq\mathcal{E}$ , define

[TABLE]

i.e., the join results of relations in $S$ that are part of a full join result. Clearly, any algorithm computing $\mathcal{Q}(\mathcal{R})$ must implicitly also compute $\mathcal{Q}(\mathcal{R},S)$ for every $S$ . Because each join result in $\mathcal{Q}(\mathcal{R},S)$ consists of $|S|$ tuples, one from each relation in $S$ , a server can emit at most $O(L^{|S|})$ join results of $\mathcal{Q}(\mathcal{R},S)$ , so we must have $p\cdot L^{|S|}=\Omega(|\mathcal{Q}(\mathcal{R},S|)$ . Thus, we obtain the following per-instance lower bound on the load:

[TABLE]

The BinHC algorithm [8] is a generalization of the HyperCube algorithm to general joins. The load of the BinHC algorithm is parameterized by the degrees of all subsets of attribute values (more detail given in Section 3). Beame et al. [8] show that BinHC is optimal (up to polylog factors) within the class of instances sharing the same degrees, among all one-round MPC algorithms. In this paper, we strengthen this result by giving a new analysis of the BinHC algorithm, showing that it is actually instance-optimal (up to polylog factors) for (1) all tall-flat joins, and (2) all r-hierarchical joins provided that the instance does not contain dangling tuples (a dangling tuple is one that does not appear in the join results). Furthermore, because the per-instance lower bound (2) also holds for multi-round algorithms, these instance-optimality results extend to multi-round algorithms as well. For r-hierarchical joins with dangling tuples, one-round algorithms cannot achieve $O({\mathrm{IN}\over p}+L_{\textrm{instance}}(p,\mathcal{R}))$ load, but we can remove the dangling tuples in $O(1)$ rounds with $O({\mathrm{IN}\over p})$ load [34], and then run then BinHC algorithm. This gives a multi-round, $({\mathrm{IN}\over p}+L_{\textrm{instance}}(p,\mathcal{R}))\log^{O(1)}p$ -load algorithm, where the $O(1)$ exponent depends on the query size, and is at least $m$ , the number of relations. Then we give a new multi-round algorithm for r-hierarchical joins with load $O({\mathrm{IN}\over p}+L_{\textrm{instance}}(p,\mathcal{R}))$ , i.e., improving the instance-optimality ratio from $\log^{O(1)}p$ to $O(1)$ .

The instance-optimal load $O({\mathrm{IN}\over p}+L_{\textrm{instance}}(p,\mathcal{R}))$ is not achievable beyond r-hierarchical joins333But instance-optimal algorithms are still possible, if some higher per-instance lower bound can be derived.. More precisely, we show that for every acyclic join that is not r-hierarchical, there is an instance $\mathcal{R}$ with $L_{\textrm{instance}}(p,\mathcal{R})=O({\mathrm{IN}\over p})$ but any multi-round algorithm must incur a load of444The $\tilde{O}$ and $\tilde{\Omega}$ notation suppresses polylog factors. $\tilde{\Omega}({\mathrm{IN}\over p^{1/2}})$ on $\mathcal{R}$ . This is actually a corollary following our output-sensitive lower bound, which is described next.

Output-optimality

One-round algorithms have severe limitations with respect to $\mathrm{OUT}$ : As shown in [26], any non-tall-flat joins must incur load $\omega({\mathrm{IN}\over p}+{\mathrm{OUT}\over p})$ if only one round is allowed. On the other hand, as mentioned, the classical Yannakakis algorithm is a multi-round MPC algorithm that works for all acyclic joins and has a load of $O(\frac{\mathrm{IN}}{p}+\frac{\mathrm{OUT}}{p})$ [2, 25]. Thus, our focus will be on multi-round algorithms and see if this result can be improved. An instance-optimal algorithm must also be output-optimal, so we have automatically obtained output-optimal algorithms for r-hierarchical joins. In fact, we show that $L_{\textrm{instance}}(p,\mathcal{R})=O({\mathrm{IN}\over p}+\sqrt{\mathrm{OUT}\over p})$ for all r-hierarchical joins, so this is already an asymptotic improvement over the Yannakakis algorithm. But the more important question is, how about acyclic joins that are not r-hierarchical?

Our main output-optimal result is a new MPC algorithm for acyclic joins achieving a load of $O(\frac{\mathrm{IN}}{p}+\frac{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}}{p})$ , which is an $O(\sqrt{\mathrm{OUT}\over\mathrm{IN}})$ -factor improvement from the Yannakakis algorithm. Interestingly enough, we observe that while the join order does not change the running time of the Yannakakis algorithm by more than a constant factor in the RAM model, it does have asymptotic consequences in the MPC model. However, there are instances on which no join order is good, in which case we recursively decompose the join into multiple parts, and choose a good join order for each part. The number of parts is exponential in the query size but constant in terms of data size. To achieve this result, we first give a simple algorithm on the line-3 join $R_{1}(A,B)\Join R_{2}(B,C)\Join R_{3}(C,D)$ (Section 4), and then extend it to arbitrary acyclic joins (Section 5).

We also give a matching lower bound (up to a log factor), thereby establishing the output-optimality of the algorithm. However, the lower bound only holds when $\mathrm{OUT}=O(p\cdot\mathrm{IN})$ . This restriction on $\mathrm{OUT}$ is actually inherent, because the $O(\frac{\mathrm{IN}}{p}+\frac{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}}{p})$ bound cannot be optimal for all values of $\mathrm{OUT}$ . When $\mathrm{OUT}$ is large enough, a worst-case optimal algorithm will take over. For example, on the line-3 join, the worst-case optimal algorithm, which has load $O({\mathrm{IN}\over\sqrt{p}})$ [24, 19], becomes better when $\mathrm{OUT}>p\cdot\mathrm{IN}$ . Our lower bound actually indicates that the $O({\mathrm{IN}\over\sqrt{p}})$ bound is output-optimal (though it does not depend on $\mathrm{OUT}$ ) for all $\mathrm{OUT}>p\cdot\mathrm{IN}$ . Thus, we now have a complete understanding of the line-3 join with respect to output-optimality. For more complicated joins, their worst-case optimal algorithms have a higher load, and the output-optimality for $\mathrm{OUT}$ values in the middle is still unclear.

Next, we extend these results to join-aggregate (including join-project) queries that are free-connex (formal definition given in Section 6). More precisely, we give an MPC algorithm with linear load that removes all the non-output attributes of the query, converting it into an acyclic join. Then we apply our instance-optimal or output-optimal algorithm on the resulting acyclic join.

Finally in Section 7, we turn to the triangle join $R_{1}(B,C)\Join R_{2}(A,C)\Join R_{3}(A,B)$ , which is the simplest cyclic join, and give the first output-sensitive lower bound $\tilde{\Omega}(\min\{\frac{\mathrm{IN}}{p}+\frac{\mathrm{OUT}}{p},\\ \frac{\mathrm{IN}}{p^{2/3}}\})$ in the MPC model. Previously, only a worst-case bound of $\Omega(\frac{\mathrm{IN}}{p^{2/3}})$ is known [24, 30] and that construction uses an instance with the maximum possible output size $\mathrm{OUT}=\mathrm{IN}^{3/2}$ . Note that the second term in the lower bound is smaller as long as $\mathrm{OUT}=\Omega(\mathrm{IN}\cdot p^{1/3})$ , which means that under this parameter range, the $\tilde{O}(\frac{\mathrm{IN}}{p^{2/3}})$ -load algorithm [24] is not only worst-case optimal but also output-optimal. For $\mathrm{OUT}=o(\mathrm{IN}\cdot p^{1/3})$ , the lower bound becomes $\tilde{\Omega}({\mathrm{IN}\over p}+{\mathrm{OUT}\over p})$ while we do not have a matching upper bound yet (some explanation on why this is difficult is given below). But at least, this shows a separation from acyclic joins, i.e., cyclic joins are harder than acyclic ones by at least a factor of $\tilde{\Omega}(\sqrt{\mathrm{OUT}\over\mathrm{IN}})$ .

1.6 Other related results

Most existing work on join algorithms in the MPC model has focused on the worst case. Here, the goal is to achieve a load of $O({\mathrm{IN}\over p^{1/\rho}})$ , where $\rho$ is the fractional edge cover number of the hypergraph $\mathcal{Q}$ . So far, this bound has been achieved on Berge-acyclic joins555A sub-class of $\alpha$ -acyclic joins. [19], joins where each relation has two attributes (i.e., $\mathcal{Q}$ is an ordinary graph) [22], and LW joins [24]666The LW join algorithm presented in [24] has a mistake, but it can be fixed, although non-trivially.. Whether this bound can be achieved for arbitrary joins, or even just $\alpha$ -acyclic joins, is still open. Assuming this is achievable, our output-sensitive algorithm is still better when $\mathrm{OUT}=O(p^{2-2/\rho}\cdot\mathrm{IN})$ .

Joglekar et al. [20] described a multi-round MPC algorithm for arbitrary joins, whose load complexity depends on $\mathrm{IN},\mathrm{OUT}$ , as well as the degrees of the values. However, the load of their algorithm is at least $\Omega(\frac{\mathrm{IN}}{p}+\frac{\mathrm{OUT}}{p})$ , i.e., no better than the Yannakakis algorithm on acyclic joins.

In the RAM model, output-sensitive join algorithms have been extensively studied. The running time of most algorithms is in form of $O(\mathrm{IN}^{w}+\mathrm{OUT})$ , where $w$ is certain notion of width of the hypergraph $\mathcal{Q}$ [15, 17, 27, 23]. However, it is not clear if this is optimal. Even for the triangle join, it is not known what the output-optimal bound is. For the triangle join, any notion of width has $w\geq 1.5$ , thus these algorithms are no better than the worst-case optimal algorithm, which has running time $O(\mathrm{IN}^{1.5})$ . Recently, an improved triangle algorithm has been developed with a running time of $O(\mathrm{IN}^{1.408}+\mathrm{IN}^{1.222}\mathrm{OUT}^{0.186})$ [11], which is better than the worst-case optimal algorithm when $\mathrm{OUT}<\mathrm{IN}^{1.495}$ . On the lower bound side, it is known that when $\mathrm{OUT}\geq\mathrm{IN}$ , at least $\Omega(\mathrm{IN}^{4/3-o(1)})$ time is needed, assuming the 3SUM conjecture [31]. Thus, output-optimal algorithms for cyclic joins still remain a wide open problem.

2 MPC Primitives

Assume $\mathrm{IN}>p^{1+\epsilon}$ where $\epsilon>0$ is any small constant. We first introduce the following primitives in the MPC model, all of which can be computed with linear load $O(\frac{\mathrm{IN}}{p})$ in $O(1)$ rounds.

Multi-numbering [18]: Given $\mathrm{IN}$ (key, value) pairs, for each key, assigns consecutive numbers $1,2,3,\dots$ to all the pairs with the same key.

Sum-by-key [18]: Given $\mathrm{IN}$ (key, value) pairs, compute the sum of values for each key, where the sum is defined by any associative operator.

Multi-search [18]: Given $N_{1}$ elements $x_{1},x_{2},\cdots,x_{N_{1}}$ as set $X$ and $N_{2}$ elements $y_{1},y_{2},\cdots,y_{N_{2}}$ as set $Y$ , where all elements are drawn from an ordered domain. Set $\mathrm{IN}=N_{1}+N_{2}$ . For each $x_{i}$ , find its predecessor in $Y$ , i.e., the largest element in $Y$ but smaller than $x_{i}$ .

Semi-Join: Given two relations $R_{1}$ and $R_{2}$ with a common attribute $x$ , the semijoin $R_{1}\ltimes R_{2}$ returns all the tuples in $R_{1}$ whose value on $x$ matches that of at least one tuple in $R_{2}$ . This can be reduced to a multi-search problem: For each $t\in R_{1}$ , if its predecessor on the $x$ attribute in $R_{2}$ is the same as that of $t$ , then it is in the semijoin.

Note that we can remove all dangling tuples in an acyclic-join [34] by a constant number of semi-joins, so it can be done in $O(1)$ rounds with linear load.

Parallel-packing: Given $\mathrm{IN}$ numbers $x_{1},x_{2},\cdots,x_{\mathrm{IN}}$ where $0<x_{i}\leq 1$ for $i=1,2,\cdots,{\mathrm{IN}}$ , group them into $m$ sets $Y_{1},Y_{2},\cdots,Y_{m}$ such that $\sum_{i\in Y_{j}}x_{i}\leq 1$ for all $j$ , and $\sum_{i\in Y_{j}}x_{i}\geq{1\over 2}$ for all but one $j$ . Initially, the $\mathrm{IN}$ numbers are distributed arbitrarily across all servers, and the algorithm should produce all pairs $(i,j)$ if $i\in Y_{j}$ when done. Note that $m\leq 1+2\sum_{i}x_{i}$ .

We are not aware of an explicit reference on this primitive, but it can be solved quite easily. Assume the input data is distributed across $p$ servers. We ask each server $i$ to first perform grouping on its local data. It is obvious that the condition above can be satisfied. The server then reports two numbers: $g_{i}$ , the number of groups with sum between $1/2$ and $1$ , and $h_{i}$ , the sum of remaining group with sum smaller than $1/2$ . Note that $g_{i}$ and $h_{i}$ can be [math]. Next, we run the BSP algorithm for prefix-sums [14] on the $g_{i}$ ’s. After that, we can assign consecutive group id’s to each of the $g_{i}$ groups on each server $i$ . For the remaining $p$ partial groups whose sums are $h_{i}$ with $0<h_{i}<1/2$ , we recursively run the algorithm, using group id’s starting from $\sum_{i}g_{i}+1$ . After the recursion returns, for each partial group $h_{i}$ that has been assigned to group $j$ , we assign every element in $h_{i}$ to group $j$ . The problem size reduces by a factor of $\mathrm{IN}/p$ after each round, so the number of rounds is $O(\log_{\mathrm{IN}/p}\mathrm{IN})=O(1)$ .

Server allocation [18]: Assume each tuple has a subproblem id $j$ , which identifies the subproblem it belongs to (the $j$ ’s do not have to be consecutive), and $p(j)$ , which is the number of servers allocated to subproblem $j$ . The goal is to attach to each tuple a range $[p_{1}(j),p_{2}(j)]$ , such that the ranges of different subproblems are disjoint and $\max_{j}p_{2}(j)\leq\sum_{j}p(j)$ . Thus, each tuple $t$ knows which servers have been allocated to the subproblem to which $t$ belongs.

Computing the output size $\mathrm{OUT}$ of an acyclic join: This primitive is a special case of our join-aggregate algorithm, which will be described in Section 6.

3 r-Hierarchical Joins

Recall that in a hierarchical join, all attributes can be organized into a forest, such that $x$ is a descendant of $y$ if and only if $\mathcal{E}_{x}\subseteq\mathcal{E}_{y}$ . Each $e\in\mathcal{E}$ corresponds to a node $x$ in the forest, such that $e$ contains precisely $x$ and all its ancestors. A subclass of hierarchical joins are tall-flat joins. For a tall-flat join, this attribute forest takes the form of a special tree, which consists of a single “stem” plus a number of leaves at the bottom. For example, ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathcal{Q}_{1}}}=R_{1}(x_{1})\Join R_{2}(x_{1},x_{2})\Join R_{3}(x_{1},x_{2},x_{3})\Join R_{4}(x_{1},x_{2},x_{3},x_{4})\Join R_{5}(x_{1},x_{2},x_{3},x_{5})\Join R_{6}(x_{1},x_{2},x_{3},x_{6})$ is a tall-flat join; ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathcal{Q}_{2}}}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{1},x_{3},x_{4})\Join R_{3}(x_{1},x_{3},x_{5})$ is a hierarchical join (but not tall-flat). Their attribute forests (actually, trees for these two cases) are shown in Figure 2.

In this section, we study r-hierarchical joins. A join is r-hierarchical if its reduced join is hierarchical. For example, ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathcal{Q}_{2}}}\Join R_{4}(x_{3},x_{5})\Join R_{5}(x_{5})$ is an r-hierarchical join (but not hierarchical). After an r-hierarchical join is reduced, its hyperedges must correspond to the leaves of the attribute forest.

3.1 BinHC algorithm revisited

We mentioned above that the HyperCube algorithm [3] is an instance-optimal algorithm for computing Cartesian products. The BinHC algorithm [8] is a generalization of the HyperCube algorithm to general joins. For a join $\mathcal{Q}$ , denote the residual query by removing attributes $\mathbf{x}\subseteq\mathcal{V}$ as $\mathcal{Q}_{\mathbf{x}}$ . Let $\mathbf{u}$ be any fractional edge packing of $\mathcal{Q}_{\mathbf{x}}$ that saturates the attributes $\mathbf{x}$ , i.e., $\sum_{e:x\in e}\mathbf{u}(e)\geq 1$ for every $x\in\mathbf{x}$ , and $\sum_{e:x\in e}\mathbf{u}(e)\leq 1$ for every $x\in\mathcal{V}-\mathbf{x}$ . Assuming knowing all degree information in advance, this algorithm computes $\mathcal{Q}$ on instance $\mathcal{R}$ in a single round with a load of $\widetilde{O}(\frac{\mathrm{IN}}{p}+L_{\textrm{BinHC}}(p,\mathcal{R}))$ , where

[TABLE]

Here we define $0^{0}=0$ . Note that for any $e\subseteq\mathbf{x}$ , $|\sigma_{\mathbf{x}=\mathbf{a}}R(e)|$ is either [math] or $1$ , so we can just set $\mathbf{u}(e)=0$ for each such $e$ in the definition above.

Theorem 1.

On any tall-flat join and any instance $\mathcal{R}$ , $L_{\textrm{BinHC}}(p,\mathcal{R})=O\left(L_{\textrm{instance}}(p,\mathcal{R})\right)$ .

Proof.

Below, we write $L:=L_{\textrm{instance}}(p,\mathcal{R})$ to avoid notational clutter. For an attribute set $\mathbf{x}$ and a fractional edge packing $\mathbf{u}$ of $\mathcal{Q}_{\mathbf{x}}$ , define

[TABLE]

To show $L_{\textrm{BinHC}}(p,\mathcal{R})=O(L)$ , it suffices to show that $p(\mathbf{x},\mathbf{u})=O(p)$ for all $\mathbf{x}$ and $\mathbf{u}$ .

Recall that in a tall-flat join, all attributes can be ordered as $x_{1},x_{2},\cdots,x_{h},y_{1},y_{2},\cdots,y_{l}$ such that (1) $\mathcal{E}_{x_{1}}\supseteq\mathcal{E}_{x_{2}}\supseteq\cdots\supseteq\mathcal{E}_{x_{h}}$ ; (2) $\mathcal{E}_{x_{h}}\supseteq\mathcal{E}_{y_{j}}$ for $j=1,2,\cdots,l$ ; (3) $|\mathcal{E}_{y_{j}}|=1$ for $j=1,2,\cdots,l$ . Consider an attribute set $\mathbf{x}\subseteq\mathcal{V}$ under the following two cases.

Case (1): $\{x_{1},x_{2},\cdots,x_{h}\}\subseteq\mathbf{x}$ . Consider any edge packing $\mathbf{u}$ of $\mathcal{Q}_{\mathbf{x}}$ that saturates $\mathbf{x}$ (in this case, we actually only need the fact $\mathbf{u}(e)\leq 1$ for all $e$ ). As observed, we can eliminate any assignment $\mathbf{a}\in\mathrm{dom}(\mathbf{x})$ if there exists an edge $e\in\mathcal{E}$ such that $\sigma_{\mathbf{x}=\mathbf{a}}R(e)=\emptyset$ , so it suffices to consider the remaining assignments $\mathbf{a^{*}}\in\mathrm{dom}(\mathbf{x})$ such that $\prod_{e\in\mathcal{E}}|\sigma_{\mathbf{x}=\mathbf{a^{*}}}R(e)|>0$ . Then, we can bound $p(\mathbf{x},\mathbf{u})$ as

[TABLE]

Case (2): There exists an $x_{i}\notin\mathbf{x}$ . Let $i$ be the smallest such $i$ . Let $\mathbf{u}$ be any edge packing of $\mathcal{Q}_{\mathbf{x}}$ . In particular, we have $\sum_{e:x_{i}\in e}\mathbf{u}(e)\leq 1$ . As observed earlier, we can set $\mathbf{u}(e)=0$ for any $e\subseteq\{x_{1},\dots,x_{i-1}\}\subseteq\mathbf{x}$ , so it suffices to consider the remaining edges. Due to the tall-flat property, all these edges contain $x_{i}$ . Thus,

[TABLE]

Combining the two cases, the theorem is proved. ∎

Theorem 2.

On any r-hierarchical join $\mathcal{Q}$ and instance $\mathcal{R}$ without dangling tuples, $L_{\textrm{BinHC}}(p,\mathcal{R})=O\left(L_{\textrm{instance}}(p,\mathcal{R})\right)$ .

Proof.

Let $\mathcal{T}$ be the forest of attributes corresponding to $\mathcal{Q}$ . Consider an arbitrary attribute set $\mathbf{x}$ . We say that a root-to-leaf path in $\mathcal{T}$ , which corresponds to some $e\in\mathcal{E}$ , is stuck at the highest attribute on the path that is not included in $\mathbf{x}$ . In this way, all edges in $\mathcal{Q}$ can be divided into disjoint groups $\mathcal{E}_{1},\mathcal{E}_{2},\cdots\mathcal{E}_{h}$ , such that edges in one group share the common stuck attribute. Consider any fractional edge packing $\mathbf{u}$ , we must have $\sum_{e\in\mathcal{E}_{i}}\mathbf{u}(e)\leq 1$ for each $\mathcal{E}_{i}$ due to the packing constraint at the common stuck attribute of $\mathcal{E}_{i}$ . Then, we can bound $p(\mathbf{x},\mathbf{u})$ as

[TABLE]

The last inequality needs some explanation: Any such $S$ includes at most one edge from each $\mathcal{E}_{i}$ . Thus, if two edges in $S$ share any common attribute, that attribute must be in $\mathbf{x}$ (otherwise they must belong to the same $\mathcal{E}_{i}$ ). Thus, for any $\mathbf{a}$ , all tuples in $\sigma_{\mathbf{x}=\mathbf{a}}R(e),e\in S$ join with each other, so we have

[TABLE]

Furthermore, since there are no dangling tuples, every join result in $\Join_{e\in S}R(e)$ must be part of a full join result, so $|\Join_{e\in S}R(e)|\leq|\mathcal{Q}(R,S)|$ . ∎

Note that since $L_{\textrm{instance}}(p,\mathcal{R})$ is a per-instance lower bound even for multi-round algorithms, this means that the BinHC algorithm is instance-optimal even among all multi-round algorithms, up to polylogarithmic factors. This result also incorporates the instance-optimality of the HyperCube algorithm on Cartesian products, which are special r-hierarchical joins without dangling tuples.

Remark

Koutris and Suciu [26] show that non-tall-flat joins cannot be done with load $\tilde{O}({\mathrm{IN}\over p}+{\mathrm{OUT}\over p})$ by one-round algorithms. This does not contradict Theorem 2 since their lower bound construction uses dangling tuples. Our result implies that the key barrier for one-round algorithms is actually the dangling tuples. If they do not exist, one-round algorithms can go beyond tall-flat joins and solve r-hierarchical joins instance-optimally, up to polylog factors. On the other hand, once $O(1)$ rounds are allowed, dangling tuples become irrelevant, since they can be removed with linear load and $O(1)$ rounds.

3.2 An instance-optimal algorithm

We have shown that the BinHC algorithm is an instance-optimal algorithm for r-hierarchical joins, but it has an instance-optimality ratio of $\log^{O(1)}p$ , where the $O(1)$ exponent depends on the query size, and is at least $m$ , the number of relations. In this section, we improve the optimality ratio to $O(1)$ , i.e., achieving a load of $O(\frac{\mathrm{IN}}{p}+L_{\textrm{instance}}(p,\mathcal{R}))$ . Our algorithm uses $O(1)$ rounds, but note that BinHC also needs $O(1)$ rounds to remove the dangling tuples if they exist. Furthermore, our algorithm is deterministic while BinHC is randomized.

As a preprocessing step, we remove all dangling tuples. Then we reduce the join hypergraph, since if $e\subseteq e^{\prime}$ , $R(e)$ will not affect the final join results after dangling tuples are removed777Strictly speaking, this violates the tuple-based requirement that when emitting a join result, all the participating tuples must be present. This can be easily fixed. Before removing $R(e)$ , we attach each tuple $t\in R(e)$ to all tuples in $R(e^{\prime})$ that join with $t$ . This can be done by the multi-search primitive with linear load.. Thus, we are left with a hierarchical join $\mathcal{Q}$ on an instance $\mathcal{R}$ with no dangling tuples.

Let $\mathcal{T}$ be the attribute forest of $\mathcal{Q}$ . Recall that after the join is reduced, each relation corresponds to a leaf of $\mathcal{T}$ , whose attributes are precisely the leaf’s ancestors in $\mathcal{T}$ . Our algorithm is recursive. We will show that the load of this algorithm is $O({\mathrm{IN}\over p}+L_{\textrm{instance}}(p,\mathcal{R}))$ for any hierarchical join $\mathcal{Q}$ on any instance $\mathcal{R}$ . To simplify notation, we will not derive the exact constant in the big-Oh, which depends (exponentially) on the recursion depth. Since the recursion depth is proportional to (actually, twice) the height of $\mathcal{T}$ , which is a constant, this is not a concern. Similarly, the number of servers employed by the algorithm will be $O(p)$ , where the hidden constant may also depend on the recursion depth.

The base case is when $\mathcal{Q}$ has just one relation. In this case the algorithm just emits all tuples in the relation, achieving the bound $O({\mathrm{IN}\over p}+L_{\textrm{instance}}(p,\mathcal{R}))$ trivially.

For a general hierarchical join $\mathcal{Q}$ and an instance $\mathcal{R}$ , we proceed as follows. We first compute $L_{\textrm{instance}}(p,\mathcal{R})$ : We use $p$ servers to compute $|\Join_{e\in S}R(e)|$ for each $S\subseteq\mathcal{E}$ (recall that computing the output size of an acyclic join is an MPC primitive). This requires $O(p)$ servers with load $O({\mathrm{IN}\over p})$ . Note that $\mathcal{Q}(\mathcal{R},S)=\Join_{e\in S}R(e)$ when there is no dangling tuples in $\mathcal{R}$ , so we can compute $L_{\textrm{instance}}(p,\mathcal{R})$ as defined in (2). Setting $L=\frac{\mathrm{IN}}{p}+L_{\textrm{instance}}(p,\mathcal{R})$ , we will show below how to compute the join with $O(p)$ servers and load $O(L)$ .

Let $k$ be the number of trees in $\mathcal{T}$ . We handle the following two cases using different recursive strategies:

Case (1): $k=1$

In this case, $\mathcal{T}$ is a tree. Suppose the root attribute of $\mathcal{T}$ is $x$ , which is included in all the relations. Consider every $a\in\mathrm{dom}(x)$ , and let $\mathcal{R}_{a}=\{\sigma_{x=a}R(e):e\in\mathcal{E}\}$ . It suffices to compute the residual query $\mathcal{Q}_{x}$ on each $\mathcal{R}_{a}$ , but all the $\mathcal{Q}_{x}(\mathcal{R}_{a})$ ’s have to be computed in parallel, using $O(p)$ servers in total. Thus, the key is to allocate servers to these residual queries appropriately so as to ensure a uniform load of $O(L)$ . To do so, we first compute $\mathrm{IN}_{a}$ , the input size of $\mathcal{R}_{a}$ , for all $a\in\mathrm{dom}(x)$ . Since $\mathrm{IN}_{a}=\sum_{e\in\mathcal{E}}|\sigma_{x=a}R(e)|$ , and each tuple belongs to exactly one $\mathcal{R}_{a}$ , this is a sum-by-key problem, i.e., each tuple $t$ with $\pi_{x}t=a$ has key $a$ and weight $1$ . Note that $\mathrm{IN}=\sum_{a}\mathrm{IN}_{a}$ .

An instance $\mathcal{R}_{a}$ is heavy if $\mathrm{IN}_{a}>L$ and light otherwise. We handle heavy and light instances in different ways.

Case (1.1): Light instances

We use the parallel-packing primitive to put the light instances into $O({\mathrm{IN}\over L})=O(p)$ groups with each group having total input size $O(L)$ . Then we simply use one server to solve the instances in each group. The load of each server is $O(L)$ .

Case (1.2): Heavy instances

By definition, there are at most ${\mathrm{IN}\over L}=O(p)$ heavy instances. For each heavy instance $\mathcal{R}_{a}$ , we allocate $p_{a}=\lceil p\cdot\frac{\mathrm{IN}_{a}}{\mathrm{IN}}\rceil$ servers to compute in parallel the join size $|\mathcal{Q}_{x}(\mathcal{R}_{a},S)|$ for all $a\in\mathrm{dom}(x)$ and all $S\subseteq\mathcal{E}$ . This uses $O(p)$ servers, and the load is $O(\max_{a}\frac{\mathrm{IN}_{a}}{p_{a}})=O(\frac{\mathrm{IN}}{p})$ . Next, for each heavy instance $\mathcal{R}_{a}$ , we allocate

[TABLE]

servers and compute $\mathcal{Q}_{x}(\mathcal{R}_{a})$ recursively in parallel. The number of servers used is

[TABLE]

By the induction hypothesis, computing $\mathcal{Q}_{x}(\mathcal{R}_{a})$ with $p_{a}$ servers has a load of (the big-Oh of)

[TABLE]

We bound each term of (3): For a heavy instance $\mathcal{R}_{a}$ , there must exist at least one $e\in\mathcal{E}$ such that $|\sigma_{x=a}R(e)|\geq\frac{1}{|\mathcal{E}|}\cdot\mathrm{IN}_{a}$ . Furthermore, since there are no dangling tuples, every tuple in $\sigma_{x=a}R(e)$ must be part of a join result of $\mathcal{Q}_{x}(\mathcal{R}_{a})$ , so $|\sigma_{x=a}R(e)|=|\mathcal{Q}_{x}(\mathcal{R}_{a},\{e\})|$ . Taking $S=\{e\}$ , we have

[TABLE]

so ${\mathrm{IN}_{a}\over p_{a}}=O(L)$ . The second term of (3) is also bounded by $L$ simply by the definition of $p_{a}$ .

Case (2): $k>1$

In this case, the join becomes a Cartesian product $\mathcal{Q}_{1}(\mathcal{R}_{1})\times\cdots\times\mathcal{Q}_{k}(\mathcal{R}_{k})$ , where each $\mathcal{Q}_{i}(\mathcal{R}_{i})$ is a join under Case (1). One would attempt to first compute each $\mathcal{Q}_{i}(\mathcal{R}_{i})$ recursively, and then compute the Cartesian product, but this would not yield instance-optimality. Just consider an instance with $|\mathcal{Q}_{1}(\mathcal{R}_{1})|=1$ and $|\mathcal{Q}_{2}(\mathcal{R}_{2})|=p\cdot\mathrm{IN}$ , where $\mathcal{Q}_{2}(\mathcal{R}_{2})=R_{1}(A,B)\Join R_{2}(B,C)$ with ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{|\mathrm{dom}(B)|=1}},|R_{1}|=\mathrm{IN},|R_{2}|=p$ . On this instance, we have $L_{\textrm{instance}}(p,\mathcal{R})=\max({\mathrm{IN}\over p},\sqrt{\mathrm{IN}})$ , but if we took a two-step approach, merely storing the intermediate result $\mathcal{Q}_{2}(\mathcal{R}_{2})$ would incur a load of $\Omega(\mathrm{IN})$ . This means that we have to interleave the two steps so as to avoid storing the intermediate results $\mathcal{Q}_{i}(R_{i})$ explicitly.

We arrange servers into a $p_{1}\times p_{2}\times\cdots\times p_{k}$ hypercube, where the dimensions $p_{1},p_{2},\cdots,p_{k}$ will be determined later. We identify each server with coordinates $(c_{1},c_{2},\cdots,c_{k})$ , where $c_{i}\in[p_{i}]$ . For every combination $c_{1},\dots,c_{i-1},c_{i+1},\dots,c_{k}$ , the $p_{i}$ servers with coordinates $(c_{1},\cdots,c_{i-1},*,c_{i+1},\cdots,c_{k})$ form a group to compute $\mathcal{Q}_{i}(\mathcal{R}_{i})$ (using the algorithm under Case (1)). Yes, each $\mathcal{Q}_{i}(\mathcal{R}_{i})$ is computed $p_{1}\cdots p_{i-1}p_{i+1}\cdots p_{k}$ times, which seems to be a lot of redundancy. However, as we shall see, there will be no redundancy in terms of the final join results, and it is exactly due to this redundancy that we avoid the shuffling of the intermediate result and achieve an optimal load. Consider a particular server $(c_{1},\dots,c_{k})$ . It participates in $k$ groups, one for each $\mathcal{Q}_{i}(\mathcal{R}_{i}),i=1,\dots,k$ . For each $\mathcal{Q}_{i}(\mathcal{R}_{i})$ , it emits a subset of its join results, denoted $\mathcal{Q}_{i}(\mathcal{R}_{i},c_{1}\dots,c_{k})$ . Then the server emits the Cartesian product $\mathcal{Q}_{1}(\mathcal{R}_{1},c_{1}\dots,c_{k})\times\cdots\times\mathcal{Q}_{k}(\mathcal{R}_{k},c_{1}\dots,c_{k})$ . Note that for each group of servers computing $\mathcal{Q}_{i}(\mathcal{R}_{i})$ , the $p_{i}$ servers in the group emit $\mathcal{Q}_{i}(R_{i})$ with no redundancy, so there is no redundancy in emitting the Cartesian product.

It remains to show how to set $p_{1},\dots,p_{k}$ so that $p_{1}\cdots p_{k}=O(p)$ and each server has a load of $O(L)$ . To do so, we first compute $\mathrm{IN}_{i}$ , the input size of $\mathcal{R}_{i}$ , in the same way as in Case (1). An instance $\mathcal{R}_{i}$ is heavy if $\mathrm{IN}_{i}>L$ and light otherwise. For each heavy instance $\mathcal{R}_{i}$ , we use $p$ servers to compute $|\Join_{e\in S}\mathcal{R}_{i}(e)|=|\mathcal{Q}_{i}(R_{i},S)|$ for all $S\subseteq\mathcal{E}_{i}$ , where $\mathcal{E}_{i}$ is the set of edges in $\mathcal{Q}_{i}$ . This requires $O(p)$ servers with load $O(\frac{\mathrm{IN}}{p})$ . Then if $\mathcal{R}_{i}$ is light, we set $p_{i}=1$ ; otherwise set

[TABLE]

Let $I=\{i\mid\mathcal{R}_{i}\text{ is heavy}\}$ . The number of servers used is

[TABLE]

Finally, consider the load of each server, which serves to compute each $\mathcal{Q}_{i}(\mathcal{R}_{i})$ with a group of $p_{i}$ servers. For a light $\mathcal{R}_{i}$ , $p_{i}=1$ and it imposes a load of $O(L)$ . For a heavy $\mathcal{R}_{i}$ , by the induction hypothesis, the load is (the big-Oh of)

[TABLE]

This can be bounded by $O(L)$ using the same argument as Case (1.2). Summing over all $i=1,\dots,k$ increases the load by just a $k=O(1)$ factor.

The induction proof thus completes and we obtain the following result.

Theorem 3.

On any r-hierarchical join query $\mathcal{Q}$ and any instance $\mathcal{R}$ , there is an algorithm computing $\mathcal{Q}(\mathcal{R})$ in $O(1)$ rounds with load $O(\frac{\mathrm{IN}}{p}+L_{\textrm{instance}}(p,\mathcal{R}))$ .

Since an instance-optimal algorithm is also output-optimal, we also obtain an output-optimal algorithm for r-hierarchical joins. In fact, we can derive a closed-form formula of the output-optimal bound, i.e., we bound

[TABLE]

as a function of $\mathrm{IN}$ and $\mathrm{OUT}$ . First, observe that $L_{\textrm{instance}}(p,\mathcal{R})$ only depends to the reduced instance of $\mathcal{R}$ , so we can assume that $\mathcal{R}$ contains no dangling tuples. Then, we can rewrite $\max_{\mathcal{R}\in\mathfrak{R}(\mathrm{IN},\mathrm{OUT})}L_{\textrm{instance}}(p,\mathcal{R})$ as

[TABLE]

Consider a specific subset $S\subseteq\mathcal{E}$ and an arbitrary instance $\mathcal{R}\in\mathfrak{R}(\mathrm{IN},\mathrm{OUT})$ . One trivial upper bound for $|\Join_{e\in S}R(e)|$ is $\mathrm{OUT}$ . The other bound is $\mathrm{IN}^{|S|}$ when the join degenerates to a Cartesian product. With these observations, we can bound the quantity above as:

[TABLE]

where $k^{*}$ denotes the integer $\lceil\log_{\mathrm{IN}}\mathrm{OUT}\rceil$ .

Next, we show that this is tight, i.e., there exists an instance $\mathcal{R}\in\mathfrak{R}(\mathrm{IN},\mathrm{OUT})$ such that for one subset $S_{1}\subseteq\mathcal{E}$ involving $k^{*}-1$ relations, there is $|\Join_{e\in S_{1}}R(e)|=\mathrm{IN}^{k^{*}-1}$ and for another subset $S_{2}\subseteq\mathcal{E}$ involving $k^{*}$ relations, there is $|\Join_{e\in S_{2}}R(e)|=\mathrm{OUT}$ . Our hard instance construction is based on the following property of acyclic joins (this is probably known, but we cannot find an explicit reference):

Lemma 1.

An acyclic join has integral edge cover number.

Proof.

For an acyclic hypergraph $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , denote the optimal edge covering of $\mathcal{Q}$ as $\mathcal{C}$ . If there exist $e,e^{\prime}\in\mathcal{E}$ such that $e\subseteq e^{\prime}$ , then $\mathcal{C}(e)=0$ ; otherwise we can just shift the weight from $e$ to $e^{\prime}$ and obtain a better (at least not worse) edge covering. So the optimal edge cover of $\mathcal{Q}$ is equivalent to that of the residual query by removing $e$ . If there exists an attribute that appears only in edge $e$ , then $\mathcal{C}(e)=1$ . So the optimal edge cover of $\mathcal{Q}$ is equivalent to the edge $e$ and the optimal edge cover of the residual query by removing all attributes in $e$ . After recursively apply these two procedures, the query will become an empty set implied by the GYO reduction [1]. In this process, every edge chosen by $\mathcal{C}$ has weight $1$ . ∎

Let $\mathcal{C}$ be the optimal edge covering of $\mathcal{Q}$ . We identify two subsets of $\mathcal{C}$ with $k^{*}-1,k^{*}$ edges respectively, denoted as $\mathcal{C}_{k^{*}-1},\mathcal{C}_{k^{*}}$ , such that $C_{k^{*}-1}\subseteq C_{k^{*}}$ . Such two subsets can always be found since $|C|\geq\lceil\log_{\mathrm{IN}}\mathrm{OUT}\rceil$ by the AGM bound [4]. We consider a hard instance $\mathcal{R}$ constructed as below. Each edge $e\in\mathcal{C}$ is associated with at least one unique attribute denoted as $e(u)$ . One of the unique attributes in $e$ for $e\in\mathcal{C}_{k^{*}}$ has $\mathrm{IN}$ distinct values in its domain while one of the unique attributes in $e$ for $e\in\mathcal{C}_{k^{*}}-\mathcal{C}_{k^{*}-1}$ has $\frac{\mathrm{OUT}}{\mathrm{IN}^{k^{*}-1}}$ distinct values in its domain. Remaining attributes have only one value in their domains. On this instance, there is $|\Join_{e\in\mathcal{C}_{k^{*}-1}}R(e)|=\mathrm{IN}^{k^{*}-1}$ and $|\Join_{e\in\mathcal{C}_{k^{*}}}R(e)|=\mathrm{OUT}$ .

Theorem 4.

There is an algorithm that computes any r-hierarchical join in $O(1)$ rounds with load $O\left(\frac{\mathrm{IN}}{p^{1/{\max\{1,k^{*}-1\}}}}+(\frac{\mathrm{OUT}}{p})^{\frac{1}{k^{*}}}\right)$ , where $k^{*}=\lceil\log_{\mathrm{IN}}\mathrm{OUT}\rceil$ . This bound is output-optimal.

Below we give a cleaner output-sensitive bound. This is not tight for $\mathrm{OUT}>\mathrm{IN}^{2}$ , but easier to use. In particular, this result will be used in the analysis of the output-sensitive algorithm for arbitrary acyclic joins in Section 5.1.

Corollary 1.

There is an algorithm that computes any r-hierarchical join in $O(1)$ rounds with load $O(\frac{\mathrm{IN}}{p}+\sqrt{\frac{\mathrm{OUT}}{p}})$ .

Proof.

When $\mathrm{OUT}\leq\mathrm{IN}$ , we have $k^{*}=1$ and the load complexity is $O(\frac{\mathrm{IN}}{p})$ trivially. For $k^{*}\geq 2$ , the term $(\frac{\mathrm{OUT}}{p})^{{1/k^{*}}}$ is always no larger than $\sqrt{\frac{\mathrm{OUT}}{p}}$ . The term ${\mathrm{IN}/p^{1/{\max\{1,k^{*}-1\}}}}$ is also no larger than $\sqrt{\frac{\mathrm{OUT}}{p}}$ as long as $\mathrm{IN}^{2}\cdot p^{1-{2/\max\{1,k^{*}-1\}}}\leq\mathrm{OUT}$ , which always holds when $k^{*}\geq 2$ . ∎

4 Line-3 Join

The simplest acyclic but not r-hierarchical join is the line-3 join $R_{1}(A,B)\Join R_{2}(B,C)\Join R_{3}(C,D)$ . In this section, we give an output-optimal MPC algorithm with load $O(\frac{\mathrm{IN}}{p}+\frac{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}}{p})$ , together with a matching lower bound. In particular, the lower bound implies that instance-optimal algorithms are not possible for the line-3 join. In Section 5, we extend these results to arbitrary acyclic joins.

4.1 The Yannakakis algorithm revisited

The Yannakakis algorithm first removes all the dangling tuples, which is just a series of semi-joins and can be done with load $O({\mathrm{IN}\over p})$ . Then the algorithm performs pairwise joins in some arbitrary order. In the RAM model, the join order does not affect the asymptotic running time: After dangling tuples have been removed, any intermediate join result is part of a full join result, so the running time of the last join, which is $\Theta(\mathrm{OUT})$ , dominates that of any intermediate join. In fact, this argument applies on a per-instance basis, and the Yannakakis algorithm is instance-optimal on any instance with any join order.

Interestingly, the join order does matter in the MPC model. Consider the following instance of the line-3 join (see the top half of Figure 3). Attributes $A,B,C,D$ have domain sizes $\frac{\mathrm{OUT}}{N},\frac{N^{2}}{\mathrm{OUT}},N,1$ , respectively. Set $R_{1}(A,B)=\mathrm{dom}(A)\times\mathrm{dom}(B)$ , $R_{2}(B,C)$ is a one-to-many relation from $\mathrm{dom}(B)$ to $\mathrm{dom}(C)$ , and $R_{3}(C,D)=\mathrm{dom}(C)\times\mathrm{dom}(D)$ . Note that this instance has $\mathrm{IN}=\Theta(N)$ and the output size is exactly $\mathrm{OUT}$ . Consider first the join plan $(R_{1}\Join R_{2})\Join R_{3}$ , and note that $|R_{1}\Join R_{2}|=|R_{1}\Join R_{2}\Join R_{3}|=\mathrm{OUT}$ . Using the $O({\mathrm{IN}\over p}+\sqrt{\mathrm{OUT}\over p})$ -load algorithm [8, 18] for binary joins, the load of computing $R_{1}\Join R_{2}$ is $O({\mathrm{IN}\over p}+\sqrt{{\mathrm{OUT}\over p}})$ . However, since the output of the first join is the input of the second join, the input size for the second join is $\mathrm{OUT}$ , so the load of the second join is $O({\mathrm{OUT}\over p}+\sqrt{{\mathrm{OUT}\over p}})=O({\mathrm{OUT}\over p})$ . In general, the intermediate join result can be as large as $O(\mathrm{OUT})$ , which is why the Yannakakis algorithm incurs a load of $O({\mathrm{OUT}\over p})$ (after dangling tuples are removed) on an acyclic join, as observed in [2, 25].

Now consider the alternative plan $R_{1}\Join(R_{2}\Join R_{3})$ . Note that $|R_{2}\Join R_{3}|=O(\mathrm{IN})$ , so the load of computing $R_{2}\Join R_{3}$ is $O({\mathrm{IN}\over p})$ , while the load of computing the second join is $O({\mathrm{IN}\over p}+\sqrt{{\mathrm{OUT}\over p}})$ . Crucially, the reason why the second plan is better is that it has a smaller intermediate join size. Note that a smaller intermediate join size does not matter in the RAM model, where the total cost is always dominated by the last join. But it does matter in the MPC model, because of the $O({\mathrm{IN}\over p}+\sqrt{{\mathrm{OUT}\over p}})$ load complexity of a binary join, which has a linear dependency on the input size but sublinear in the output size. Fundamentally, this is because the MPC model is all about locality: algorithms strive to send all “related” tuples to the same server so as to maximize the number of join results that can be found by the server locally.

Now, the key question is if there is always a join plan with an intermediate join size asymptotically smaller than $O(\mathrm{OUT})$ . Unfortunately, the answer is no. A bad example can be easily constructed, by just putting two of the above instances together, but in opposite directions (see Figure 3). Nevertheless, this bad example precisely points us to the right direction: Although a global best join order may not exist, but if we decompose the join into multiple pieces, it is possible to find a provably good join order for each. This is exactly the basic idea of our algorithm, presented next.

4.2 A new algorithm for the line-3 join

We first compute $\mathrm{OUT}$ (an MPC primitive). Then we proceed in two steps:

Step (1): Computing degrees

For a value in attribute $B$ , it is heavy if its degree in relation $R_{1}$ , i.e., $|\sigma_{B=b}R_{1}|$ , is greater than $\tau$ (value to be determined later), otherwise light. We first use the sum-by-key primitive to compute the degrees of all $b$ ’s for $b\in\mathrm{dom}(B)$ . After classifying the values in $\mathrm{dom}(B)$ as heavy and light, we divide tuples in $R_{1}$ and $R_{2}$ also into heavy tuples and light tuples, depending on their $B$ value. More precisely, a tuple in $R_{1}$ or $R_{2}$ is heavy if its $B$ value is heavy, and light otherwise. This can be done by the multi-search primitive. We denote the heavy (resp. light) tuples in $R_{i}$ as $R^{H}_{i}$ (resp. $R^{L}_{i}$ ), for $i=1,2$ .

Step (2): Decomposing the join

We decompose the join into the following two parts, and compute them using different join orders:

[TABLE]

Note that since $R_{1}$ and $R_{2}$ are both divided according to the $B$ attribute, $R_{1}^{H}$ do not join with $R_{2}^{L}$ , $R_{1}^{L}$ do not join with $R_{2}^{H}$ .

Now we analyze the load. For $\mathcal{Q}_{1}$ , the intermediate join $R_{23}=R_{2}^{H}\Join R_{3}$ has size bounded by $\frac{\mathrm{OUT}}{\tau}$ , since each intermediate join result from $R_{23}$ has a heavy $B$ value, so it joins with at least $\tau$ tuples in $R_{1}$ . Thus, the load of computing $\mathcal{Q}_{1}$ is (big-Oh of)

[TABLE]

For $\mathcal{Q}_{2}$ , the intermediate join $R_{12}=R_{1}^{L}\Join R_{2}^{L}$ has size bounded by $\mathrm{IN}\cdot\tau$ , since each light tuple from $R_{2}$ can join with at most $\tau$ tuples from $R_{1}$ . Thus, the load of computing $\mathcal{Q}_{2}$ is (big-Oh of) $O({\mathrm{IN}\over p}+{\mathrm{IN}\cdot\tau\over p}+\sqrt{\mathrm{OUT}\over p})$ .

[TABLE]

Setting $\tau=\sqrt{{\mathrm{OUT}\over\mathrm{IN}}}$ balances the second term in (4) and in (5), and we obtain the claimed result (note that $\sqrt{{\mathrm{OUT}\over p}}\leq{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}\over p}$ for $\mathrm{IN}\geq p$ ):

Theorem 5.

There is an algorithm computing the line-3 join with load $O\left(\frac{\mathrm{IN}}{p}+\frac{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}}{p}\right)$ in $O(1)$ rounds.

4.3 Lower bound

We prove the following lower bound on any tuple-based algorithm for computing the line-3 join.

Theorem 6.

For any $\mathrm{OUT}\geq\mathrm{IN}$ , there exists an instance $\mathcal{R}$ for the line-3 join with input size $\Theta(\mathrm{IN})$ and output size $\Theta(\mathrm{OUT})$ , such that any tuple-based algorithm computing the join in $O(1)$ rounds must have a load of $\Omega\left(\min\left\{\frac{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}}{p\cdot\log\mathrm{IN}},\frac{\mathrm{IN}}{\sqrt{p}}\right\}\right)$ .

Proof.

Our lower bound argument is combinatorial in nature. We will construct a hard instance $\mathcal{R}$ , such that a server can produces at most $J(L)$ join results in a round, no matter which $L$ tuples from $\mathcal{R}$ are loaded to the server. Then $p$ servers can product at most $O(p\cdot J(L))$ results over $O(1)$ rounds. Setting $p\cdot J(L)=\Omega(\mathrm{OUT})$ will yield a lower bound on $L$ . Thus, any upper bound on $J(L)$ will yield a lower bound on $L$ , and we will only focus on upper bounding $J(L)$ .

We construct $\mathcal{R}$ using the probabilistic method, i.e., we randomly generate an instance, and show that with positive probability (actually, with high probability), such a randomly generated instance satisfies our needs. The construction is similar to the one used in [18], but the parameters and arguments are different.

A randomly constructed instance is shown in Figure 4. In fact, only $R_{2}$ is random, while $R_{1}$ and $R_{3}$ are deterministic. Let $N=\frac{\mathrm{IN}}{3},\tau=\sqrt{\frac{\mathrm{OUT}}{N}}$ , and set $\mathrm{dom}(B)=\mathrm{dom}(C)={N\over\tau}$ . Each distinct value of $B$ appears in $\tau$ tuples in $R_{1}(A,B)$ , and each distinct value in $C$ appears in $\tau$ tuples in $R_{3}(C,D)$ . The $\tau$ tuples in $R_{1}$ (resp. $R_{3}$ ) that share the same $B$ (resp. $C$ ) value are called a group. For each pair of values $(b,c),b\in\mathrm{dom}(B),c\in\mathrm{dom}(C)$ , the tuple $(b,c)$ is included in $R_{2}(B,C)$ with probability $\frac{\tau^{2}}{N}$ independently. Note that $|R_{1}|=|R_{3}|=N$ , and $E[|R_{2}|]=N$ , so the input size is expected to be $\mathrm{IN}$ . The output size is expected to be $\tau^{2}\cdot({N\over\tau})^{2}\cdot{\tau^{2}\over N}=\mathrm{OUT}$ . By the Chernoff inequality, the probability that the input size or output size deviates from their expectations by more than a constant fraction is $\exp(-\Omega(N))$ .

To give an upper bound on $J(L)$ , we only restrict the server to load at most $L$ tuples from $R_{1}$ and $R_{3}$ , while tuples in $R_{2}$ can be accessed for free. Furthermore, we argue below that we only need to consider the situation where the server loads $R_{1}$ and $R_{3}$ in whole groups. Suppose two groups in $R_{1}$ , say, $g_{1}$ and $g_{2}$ , are not loaded in full (we may assume w.l.o.g. that $L$ is a multiple of $\tau$ , so there cannot be exactly one non-full group): $x_{1}<\tau$ tuples of $g_{1}$ and $x_{2}<\tau$ tuples of $g_{2}$ have been loaded. Suppose they respectively join with $y_{1}$ and $y_{2}$ tuples in $R_{3}$ that are loaded by the server. Note that they will produce $x_{1}y_{1}+x_{2}y_{2}$ join results. Without loss of generality, assume $y_{1}\geq y_{2}$ . Now consider the alternative where the server loads $x_{1}+1$ tuples of $g_{1}$ and $x_{2}-1$ tuples of $g_{2}$ . Then this would produce $(x_{1}+1)y_{1}+(x_{2}-1)y_{2}=x_{1}y_{1}+x_{2}y_{2}+y_{1}-y_{2}\geq x_{1}y_{1}+x_{2}y_{2}$ tuples. This means that by moving one tuple from $g_{2}$ to $g_{1}$ , the server can only get more join results (at least not less). We can move tuples from one group to another as long as there are two non-full groups. Eventually we arrive at a situation where all groups of $R_{1}$ are loaded by the server in full, without decreasing the reported join size. Next, we apply the same transformation to the groups of $R_{3}$ to make all its groups full as well. Therefore, to maximize $J(L)$ , the server should only load $R_{1}$ and $R_{3}$ in full groups.

Thus, the server loads $\frac{L}{\tau}$ groups from $R_{1}$ and $\frac{L}{\tau}$ groups from $R_{3}$ . Below we show that a random instance constructed as above has the following property with high probability: On every possible choice of the ${L\over\tau}$ groups of $R_{1}$ and ${L\over\tau}$ groups of $R_{3}$ to be loaded, $J(L)$ is always bounded.

Consider a particular choice of the $\frac{L}{\tau}$ groups from $R_{1}$ and $\frac{L}{\tau}$ groups from $R_{3}$ to be loaded. There are $\left(\frac{L}{\tau}\right)^{2}$ pairs of groups, and each pair has probability $\frac{\tau^{2}}{N}$ to join, so we expect to see $\frac{L^{2}}{N}$ pairs to join. Because the pairs join independently, by the Chernoff bound, the probability that more than $\delta\cdot\frac{L^{2}}{N}$ pairs join is at most $\exp\left(-\Omega(\delta\cdot\frac{L^{2}}{N})\right)$ , for some parameter $\delta\geq 2$ to be determined later. There are $O\left((\frac{N}{\tau})^{\frac{2L}{\tau}}\right)$ different choices of $\frac{L}{\tau}$ groups from $R_{1}$ and ${L\over\tau}$ groups from $R_{3}$ . So, by the union bound, the probability that one of them yields more than $\delta\cdot\frac{L^{2}}{N}$ joining groups is at most

[TABLE]

This probability is exponentially small if $\delta\cdot\frac{L^{2}}{N}>c_{1}\cdot\frac{L}{\tau}\log N$ for some sufficiently large constant $c_{1}$ , so we set

[TABLE]

Since each joining group produces $\tau^{2}$ join results, we have shown that with high probability, a random instance has the property that no matter which $L$ tuples are loaded, we always have $J(L)\leq\delta\cdot\frac{\tau^{2}L^{2}}{N}$ . Putting this into $p\cdot J(L)=\Omega(\mathrm{OUT})$ , we obtain

[TABLE]

Plugging (6) into (7), we have

[TABLE]

or

[TABLE]

Plugging in $\tau=\sqrt{\frac{\mathrm{OUT}}{N}}$ , $N={\mathrm{IN}\over 3}$ ,

[TABLE]

The theorem is then proved after rearranging the terms. ∎

Ignoring logarithmic factors, this lower bound completes our understanding of the line-3 join in terms of output-optimality: (1) When $\mathrm{OUT}\leq\mathrm{IN}$ , the Yannakakis algorithm has linear load $O\left(\frac{\mathrm{IN}}{p}\right)$ . (2) When $\mathrm{IN}<\mathrm{OUT}\leq p\cdot\mathrm{IN}$ , the lower bound becomes $\tilde{\Omega}\left(\frac{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}}{p}\right)$ , which is matched by our new algorithm. (3) When $\mathrm{OUT}\geq p\cdot\mathrm{IN}$ , the lower bound is $\Omega\left({\mathrm{IN}\over\sqrt{p}}\right)$ , which is matched by the worst-case optimal algorithm in [19, 24]. In particular, this means that when $\mathrm{OUT}$ is large enough, the load complexity of the join is no longer output-sensitive. This also stands in contrast with the RAM model, where the complexity of any acyclic join always grows linearly with $\mathrm{OUT}$ .

An easy corollary is the following result, which shows that instance-optimality is not achievable for the line-3 join.

Corollary 2.

For any $\mathrm{IN}\geq p^{3/2}$ , there is an instance $\mathcal{R}$ with input size $\Theta(\mathrm{IN})$ for the line-3 join, such that any tuple-based algorithm computing the join in $O(1)$ rounds must have a load of $\Omega({\mathrm{IN}\over p^{1/2}\log\mathrm{IN}})$ , while $L_{\textrm{instance}}(p,\mathcal{R})=O({\mathrm{IN}\over p})$ .

Proof.

We use $\mathrm{OUT}=p\cdot\mathrm{IN},\tau=\sqrt{p}$ in the lower bound construction above. Plugging these values into (8), we obtain the claimed lower bound. On the other hand, we have $L_{\textrm{instance}}(p,\mathcal{R})$ as large as

[TABLE]

As long as $\mathrm{IN}\geq p^{3/2}$ , the first term dominates. ∎

5 Acyclic Joins

In this section, we first extend the results from the previous section to arbitrary acyclic joins. Specifically, the algorithm is a (nontrivial) generalization of the line-3 algorithm, but it is self-contained; the lower bound builds on top of the hard instance of the line-3 join.

5.1 Algorithm

As a preprocessing step, we remove all dangling tuples. We also assume that the output size $\mathrm{OUT}$ has been computed (an MPC primitive).

Recall that in an acyclic join $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , the hyperedges $\mathcal{E}$ can be organized into a join tree $\mathcal{T}$ , such that for each attribute $x\in\mathcal{V}$ , the nodes corresponding to $\mathcal{E}_{x}$ are connected in $\mathcal{T}$ . Given such a join tree $\mathcal{T}$ , our algorithm recursively decomposes the join into multiple pieces, and apply a different join strategy for each.

We start from an internal node of $\mathcal{T}$ whose children are all leaves. Let this node be $e_{0}$ , which has $k$ leaf children $e_{1},\cdots,e_{k}$ (see Figure 5 for an example). Let $s_{i}=e_{0}\cap e_{i}$ be the set of join attributes between $e_{0}$ and $e_{i}$ . We will assume $s_{i}\neq\emptyset$ ; otherwise we can add a dummy attribute to both $e_{0}$ and $e_{i}$ and all tuples in $R(e_{0})$ and $R(e_{i})$ share the same value on this dummy attribute (e.g., we add a dummy attribute $H^{\prime}$ to both $e_{0}$ and $e_{6}$ in Figure 5). Note that the join tree ensures the property that if $x\in e_{i}\cap e_{j}$ for $i\neq j$ , then $x\in e_{0}$ .

Let $N_{\alpha}=\sum_{i=1}^{k}|R(e_{i})|$ and $N_{\beta}=\mathrm{IN}-N_{\alpha}$ . We will actually prove a slightly tighter bound, that the load of our algorithm is bounded by $O({\mathrm{IN}\over p}+{\sqrt{N_{\beta}\cdot\mathrm{OUT}}\over p}+\sqrt{\mathrm{OUT}\over p})$ .

Set $\tau=\sqrt{\frac{\mathrm{OUT}}{N_{\beta}}}$ . Our algorithm proceeds in three steps.

Step (1): Computing data statistics

In each relation $R(e_{i})$ , $i=1,\dots,k$ , let $v$ be an assignment of values for attributes $s_{i}$ . The set of heavy assignments in $R(e_{i})$ is

[TABLE]

Tuples in $R(e_{i})$ can also be identified as heavy or light, depending on their projection on attributes $s_{i}$ . More precisely, a tuple $t\in R(e_{i})$ is heavy if $\pi_{s_{i}}t\in H(s_{i},e_{i})$ . The set of heavy tuples and light tuples in $R(e_{i})$ are denoted as $R_{H}(e_{i})$ and $R_{L}(e_{i})$ , respectively. All the statistics can be computed in by the sum-by-key and multi-search primitives with linear load.

Let $\bar{\mathcal{E}}=\mathcal{E}-\{e_{0},e_{1},\cdots,e_{k}\}$ . We decompose the join into the following sub-joins:

[TABLE]

where each $?$ can be either $H$ or $L$ . Note that there are $2^{k}$ , which is a constant, sub-joins, so we can afford to use $p$ servers for each sub-join. If a sub-join involves at least one $R_{H}(e_{i})$ , we apply the procedure in step (2) to it. In step (3), we handle the case where all $?$ are $L$ .

Step (2): Sub-joins with at least one $R_{H}(e_{i})$

Without loss of generality, suppose $R_{H}(e_{1})$ is in the sub-join, i.e., we need to compute the sub-join

[TABLE]

where each $?$ can be either $H$ or $L$ . The algorithm consists of three steps:

(2.1)

Compute $R^{\prime}(e_{0})=R(e_{0})\ltimes R_{H}(e_{1})$ . 2. (2.2)

Compute $R^{\prime}=R^{\prime}(e_{0})\Join R_{?}(e_{2})\Join\cdots\Join R_{?}(e_{k})\Join\left(\Join_{e\in\bar{\mathcal{E}}}R(e)\right)$ by any order. 3. (2.3)

Compute $R_{H}(e_{1})\Join R^{\prime}$ .

We analyze the load in each step: (2.1) is a primitive operation that incurs linear load. To bound the load of (2.2), observe that $|R^{\prime}|\leq{\mathrm{OUT}\over\tau}$ , since each tuple in $R^{\prime}$ joins with at least $\tau$ tuples in $R_{H}(e_{1})$ , each producing one final join result. Thus, the load is bounded by $O(\frac{\mathrm{IN}}{p}+\frac{\mathrm{OUT}}{p\cdot\tau})$ . The binary join in (2.3) has input size $\frac{\mathrm{OUT}}{\tau}+\mathrm{IN}$ and output size $\mathrm{OUT}$ , incurring a load of $O(\frac{\mathrm{IN}}{p}+\frac{\mathrm{OUT}}{p\cdot\tau}+\sqrt{\frac{\mathrm{OUT}}{p}})$ , which dominates the first two steps. Plugging in the value of $\tau$ , the total load is bounded by $O({\mathrm{IN}\over p}+{\sqrt{N_{\beta}\cdot\mathrm{OUT}}\over p}+\sqrt{\mathrm{OUT}\over p})$ , as desired.

Step (3): The sub-join with all $R_{L}(e_{i})$

It remains to compute the following sub-join:

[TABLE]

We further divide $R(e_{0})$ into heavy and light tuples, as follows. Let $s=s_{1}\cup s_{2}\cup\cdots\cup s_{k}$ , and let $v$ be an assignment over attributes $s$ . The set of heavy assignments in $R(e_{0})$ is define as

[TABLE]

Tuples in $R(e_{0})$ are classified as heavy or light, depending on their projection on attributes $s$ , i.e., a tuple $t\in R(e_{0})$ is heavy if $\pi_{s}t\in H(s,e_{0})$ , and light otherwise. Similarly, denote the heavy and light tuples in $R(e_{0})$ as $R_{H}(e_{0})$ and $R_{L}(e_{0})$ , respectively.

These statistics can also be computed using the primitives, but with some more care. For each relation $R_{L}(e_{i})$ , we first use sum-by-key to compute $|\sigma_{s_{i}=v_{i}}R_{L}(e_{i})|$ for every $v_{i}\in\pi_{s_{i}}R_{L}(e_{i})$ . This gives us a list of $(v_{i},|\sigma_{s_{i}=v_{i}}R_{L}(e_{i})|)$ pairs. Then, we use multi-search to find, for each tuple $t\in R(e_{0})$ , the up to $k$ pairs $(v_{i},|\sigma_{s_{i}=v_{i}}R_{L}(e_{i})|)$ such that $\sigma_{s_{i}}t=v_{i}$ . After this step, each tuple in $R(e_{0})$ is attached with $k$ values, and we multiply them together to decide if the tuple is heavy or light.

Step (3.1): The sub-join with $R_{H}(e_{0})$

We first compute the following sub-join:

[TABLE]

The algorithm consists of three steps:

(3.1.1)

Compute $R^{\prime}(e_{0})=R_{H}(e_{0})\Join\left(\Join_{e\in\bar{\mathcal{E}}}R(e)\right)$ by any order. 2. (3.1.2)

Compute $R^{\prime}(e_{i})=R_{H}(e_{0})\Join R_{L}(e_{i})$ for each $i=1,\cdots,k$ . 3. (3.1.3)

Compute $R^{\prime}(e_{0})\Join R^{\prime}(e_{1})\Join\cdots\Join R^{\prime}(e_{k})$ . Note that each of these relations contains all attributes in $e_{0}$ , so it is a hierarchical join (it is actually tall-flat), so we can use the instance-optimal algorithm in Section 3 to compute this join.

Now we analyze the load of each step: First, observe that $|R^{\prime}(e_{0})|\leq{\mathrm{OUT}\over\tau}$ . This is because the projection of each tuple in $R^{\prime}(e_{0})$ on $s$ is a heavy assignment, so it will produce at least $\tau$ join results after joining with the $R_{L}(e_{i})$ ’s. Therefore, the load of computing the join in (3.1.1) is $O(\frac{\mathrm{IN}}{p}+\frac{\mathrm{OUT}}{p\cdot\tau})$ . Each binary join in (3.1.2) has a load of $O(\frac{\mathrm{IN}}{p}+\sqrt{\frac{\mathrm{OUT}}{p}})$ . Note that each join result $R^{\prime}(e_{i})$ has size bounded by $N_{\beta}\cdot\tau$ , since any tuple in $R(e_{0})$ can join with at most $\tau$ tuples in $R_{L}(e_{i})$ . Thus, the hierarchical join in (3.1.3) has input size $O(N_{\beta}\cdot\tau+\frac{\mathrm{OUT}}{\tau})$ and output size $\mathrm{OUT}$ , so the instance-optimal algorithm has load $O(\frac{N_{\beta}\cdot\tau}{p}+\frac{\mathrm{OUT}}{p\cdot\tau}+\sqrt{\frac{\mathrm{OUT}}{p}})$ according to Corollary 1. All the loads are bounded by $O({\mathrm{IN}\over p}+{\sqrt{N_{\beta}\cdot\mathrm{OUT}}\over p}+\sqrt{\mathrm{OUT}\over p})$ , as desired.

Step (3.2): The sub-join with $R_{L}(e_{0})$

Finally, we are left with the sub-join

[TABLE]

This is actually the only case where we need recursion:

(3.2.1)

Compute $R^{\prime}_{L}(e_{0})=R_{L}(e_{0})\Join R_{L}(e_{1})\Join\cdots\Join R_{L}(e_{k})$ by any order. 2. (3.2.2)

If $\bar{\mathcal{E}}\neq\emptyset$ , compute $R^{\prime}_{L}(e_{0})\Join\left(\Join_{e\in\bar{\mathcal{E}}}R(e)\right)$ recursively.

Now we analyze the load: First, we have $|R^{\prime}_{L}(e_{0})|\leq N_{\beta}\cdot\tau$ , since the projection of each tuple in $R_{L}(e_{0})$ on $s$ is a light assignment. Thus, the load of step (3.2.1) is $O(\frac{\mathrm{IN}}{p}+\frac{N_{\beta}\cdot\tau}{p})$ , which is also bounded by $O({\mathrm{IN}\over p}+{\sqrt{N_{\beta}\cdot\mathrm{OUT}}\over p}+\sqrt{\mathrm{OUT}\over p})$ . So far, we have completed the base case of the induction proof.

For the join to be computed recursively in step (3.2.2), its input size is at most $\mathrm{IN}+N_{\beta}\cdot\tau$ and output size is at most $\mathrm{OUT}$ . More importantly, $N_{\beta}$ can only become smaller, since $e_{0}$ becomes a leaf in the residual join and $|R(e_{0})|$ is no longer included in $N_{\beta}$ , no matter which node in the residual join is picked to be its new $e_{0}$ . By the induction hypothesis, computing the residual join recursively incurs a load of $O(\frac{\mathrm{IN}}{p}+\frac{N_{\beta}\cdot\tau}{p}+\frac{\sqrt{N_{\beta}\cdot\mathrm{OUT}}}{p}+\sqrt{\frac{\mathrm{OUT}}{p}})$ , thus bounded by $O({\mathrm{IN}\over p}+{\sqrt{N_{\beta}\cdot\mathrm{OUT}}\over p}+\sqrt{\mathrm{OUT}\over p})$ .

Note that the recursion will increase the constant in the big-Oh, but as the recursion depth depends only on the query not the data size, it does not change the asymptotic result.

This completes the induction proof that the algorithm has a load of $O({\mathrm{IN}\over p}+{\sqrt{N_{\beta}\cdot\mathrm{OUT}}\over p}+\sqrt{\mathrm{OUT}\over p})$ . Observing that $N_{\beta}\leq\mathrm{IN}$ and $\sqrt{{\mathrm{OUT}\over p}}\leq{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}\over p}$ , we obtain the following result.

Theorem 7.

There is an algorithm that computes any acyclic join with load $O(\frac{\mathrm{IN}}{p}+\frac{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}}{p})$ in $O(1)$ rounds.

5.2 Lower bound

In Section 6 we have constructed a hard instance for the line-3 join and have shown that any algorithm must incur a load of $\Omega(\min\{\frac{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}}{p\cdot\log\mathrm{IN}},\frac{\mathrm{IN}}{\sqrt{p}}\})$ on this instance. In this section, we generalize this lower bound to an arbitrary acyclic join that is not r-hierarchical. Note that for r-hierarchical joins, we can achieve a smaller load $O(\frac{\mathrm{IN}}{p}+\sqrt{\frac{\mathrm{OUT}}{p}})$ (see Corollary 1), so this establishes a separation between r-hierarchical joins and acyclic joins.

The basic idea in the lower bound is that any acyclic join must “include” a line-3 join, such that any algorithm computing the acyclic join must also compute the line-3 join. This is more formally captured by the following structural lemma on acyclic and r-hierarchical joins. To state the lemma, we need some terminology. In a hypergraph $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , a path between $x,y\in\mathcal{V}$ , denoted $P(x,y)$ , is a sequence of vertices starting with $x$ and ending with $y$ , such that each consecutive pair of vertices appear together in an edge. The length of a path is defined as the number of vertices in $P(x,y)$ minus 1. A path $P(x,y)$ is minimal if there is no other path $P^{\prime}(x,y)$ that is a strict subsequence of $P(x,y)$ . It is easy to see that $P(x_{1},x_{k})=(x_{1},x_{2},\cdots,x_{k})$ is minimal if and only if there exists no edge $e\in\mathcal{E}$ containing $x_{i}$ and $x_{j}$ with $|j-i|>1$ . Note that a shortest path must be minimal, but not vice versa.

Lemma 2.

An acyclic join is not r-hierarchical if and only if it has a minimal path of length $3$ .

The proof is given in Appendix A. With this lemma, we can extend the hard instance for the line-3 join to any acyclic but non-r-hierarchical join $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ . Let $(x_{1},x_{2},x_{3},x_{4})$ be a minimal path of length 3 in $\mathcal{Q}$ , and suppose $\{x_{1},x_{2}\}\subseteq e_{1},\{x_{2},x_{3}\}\subseteq e_{2},\{x_{3},x_{4}\}\subseteq e_{3}$ . Let $\mathcal{R}=\{R_{1}(x_{1},x_{2}),$ $R_{2}(x_{2},x_{3}),R_{3}(x_{3},x_{4})\}$ be the hard instance for the line-3 join. We construct the hard instance $\mathcal{R}^{\prime}=\{R^{\prime}(e):e\in\mathcal{E}\}$ for $\mathcal{Q}$ as follows. The domain of $x_{i},i=1,2,3,4$ is the same as in $\mathcal{R}$ . For any other attribute $y$ , set $|\mathrm{dom}(y)|=1$ .

Since the path is minimal, each $e\in\mathcal{E}$ must fall into one of the following three cases:

For any $e$ with $e\cap\{x_{1},x_{2},x_{3},x_{4}\}=\emptyset$ , $R^{\prime}(e)$ contains only one tuple connecting the only value in the domains of attributes in $e$ . 2. 2.

If $e\cap\{x_{1},x_{2},x_{3},x_{4}\}=\{x_{i}\},i=1,2,3,4$ , then $R^{\prime}(e)$ contains $|\mathrm{dom}(x_{i})|$ tuples, each having a distinct value of $\mathrm{dom}(x_{i})$ . 3. 3.

If $e\cap\{x_{1},x_{2},x_{3},x_{4}\}=\{x_{i},x_{i+1}\},i=1,2,3$ , then $R^{\prime}(e)$ contains $|R_{i}(x_{i},x_{i+1})|$ tuples such that $\pi_{x_{i},x_{i+1}}R^{\prime}(e)=R_{i}(x_{i},x_{i+1})$ .

It can be easily verified that $\mathcal{Q}(\mathcal{R}^{\prime})$ is exactly the join results of the line-3 join on $\mathcal{R}$ , so the same lower bound applies. However, since the output size of the line-3 join is at most $\mathrm{IN}^{2}$ , we do have a condition on $\mathrm{OUT}$ :

Theorem 8.

For an acyclic but non-r-hierarchical join and any $\mathrm{IN}\geq p^{3/2},\mathrm{OUT}\leq\mathrm{IN}^{2}$ , there exists an instance $\mathcal{R}$ with input size $\Theta(\mathrm{IN})$ and output size $\Theta(\mathrm{OUT})$ such that any tuple-based algorithm computing it in $O(1)$ rounds must have a load of $\Omega(\min\{\frac{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}}{p\cdot\log\mathrm{IN}},\frac{\mathrm{IN}}{\sqrt{p}}\})$ .

Similar to the line-3 join, this lower bound shows that our acyclic join algorithm is output-optimal (up to a logarithmic factor) when $\mathrm{OUT}\leq p\cdot\mathrm{IN}$ .

Furthermore, the same argument for Corollary 2 can be used here to show that instance-optimal algorithms do not exist for any acyclic but non-r-hierarchical join.

Corollary 3.

For any $\mathrm{IN}\geq p^{3/2}$ , there is an instance $\mathcal{R}$ with input size $\Theta(\mathrm{IN})$ for any acyclic but non-r-hierarchical join, such that any tuple-based algorithm that computes the join in $O(1)$ rounds must have a load of $\Omega({\mathrm{IN}\over p^{1/2}\log\mathrm{IN}})$ , while $L_{\textrm{instance}}(p,\mathcal{R})=O({\mathrm{IN}\over p})$ .

6 Join-Aggregate Queries

We consider join-aggregate queries over annotated relations [16, 21]. Let $(\mathbb{R},\oplus,\otimes)$ be a commutative semiring. Every tuple $t$ is associated with an annotation $w(t)\in\mathbb{R}$ . Let $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ be a join hypergraph. The annotation of a join result $t\in\mathcal{Q}(\mathcal{R})$ is $w(t):=\otimes_{t_{e}\in R(e),\pi_{e}t=t_{e},e\in\mathcal{E}}w(t_{e})$ . Let $\mathbf{y}\subseteq\mathcal{V}$ be a set of output attributes and $\mathbf{\bar{y}}=\mathcal{V}-\mathbf{y}$ the non-output attributes. A join-aggregate query $\mathcal{Q}_{\mathbf{y}}(\mathcal{R})$ asks us to compute

[TABLE]

In plain language, a join-aggregate query first computes the join $\mathcal{Q}(\mathcal{R})$ and the annotation of each join result, which is the $\otimes$ -aggregate of the tuples comprising the join result. Then it partitions $\mathcal{Q}(\mathcal{R})$ into groups by their projection on $\mathbf{y}$ . Finally, for each group, it computes the $\oplus$ -aggregate of the annotations of the join results.

Many queries can be formulated as special join-aggregate queries. For example, if we take $\mathbb{R}$ to be the domain of integers, $\oplus$ to be addition, $\otimes$ to be multiplication, and set $w(t)=1$ for all $t$ , then it becomes the COUNT(*) GROUP BY $\mathbf{y}$ query; in particular, if $\mathbf{y}=\emptyset$ , the query computes $|\mathcal{Q}(\mathcal{R})|$ .

The join-project query $\pi_{\mathbf{y}}\mathcal{Q}(\mathcal{R})$ , also known as a conjunctive query, is a special join-aggregate query, and we extend the terminology from [6] to join-aggregate queries. A width-1 GHD of a hypergraph $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ is a tree $\mathcal{T}$ , where each node $u\in\mathcal{T}$ is a subset of $\mathcal{V}$ , such that

(coherence) for each attribute $x\in\mathcal{V}$ , the nodes containing $x$ are connected in $\mathcal{T}$ ; 2. 2.

(edge coverage) for each hyperedge $e\in\mathcal{E}$ , there exists a node $u\in\mathcal{T}$ such that $e\subseteq u$ ; and 3. 3.

(width-1) for each node $u\in\mathcal{T}$ , there exists a hyperedge $e\in\mathcal{E}$ such that $u\subseteq e$ .

Given a set of output attributes $\mathbf{y}$ (a.k.a. free variables), we say that $\mathcal{T}$ is free-connex if there is a subset of connected nodes of $\mathcal{T}$ including its root, denoted as $\mathcal{T}^{\prime}$ (such a $\mathcal{T}^{\prime}$ is said to be a connex subset), such that $\mathbf{y}=\bigcup_{u\in\mathcal{T}^{\prime}}u$ . A join-aggregate query $\mathcal{Q}_{\mathbf{y}}(\mathcal{R})$ is free-connex if it has a free-connex width-1 GHD.

As preprocessing, we remove the dangling tuples and then apply the reduce procedure repeatedly to remove an $e\in\mathcal{E}$ if there is another $e^{\prime}\in\mathcal{E}$ such that $e\subset e^{\prime}$ . Note that while dangling tuples can be just discarded, we cannot simply discard $R(e)$ in the reduce procedure. To ensure that the annotations will be computed correctly, we should replace $R(e^{\prime})$ with $R(e)\Join R(e^{\prime})$ and then discard $R(e)$ . Note that by the earlier definition, the annotation of a join result is the $\otimes$ -aggregate of the annotations of tuples comprising the join result, so the annotation in $R(e)$ are aggregated into those in $R(e^{\prime})$ .

We find a free-connex width-1 GHD $\mathcal{T}$ of $\mathcal{Q}$ [6, 5]. Note that the nodes of $\mathcal{T}$ also define a hypergraph, and can be regarded as another join-aggregate query, but with the property that it has a free-connex subset $\mathcal{T}^{\prime}$ such that $\mathbf{y}=\bigcup_{u\in\mathcal{T}^{\prime}}u$ . We construct an instance $\mathcal{R}_{\mathcal{T}}=\{R(u):u\in\mathcal{T}\}$ such that $\mathcal{Q}_{\mathbf{y}}(\mathcal{R})=\mathcal{T}(\mathcal{R}_{\mathcal{T}})$ , where $\mathcal{T}(\mathcal{R}_{\mathcal{T}})$ denotes the result of running the query defined by $\pi_{\mathbf{y}}\mathcal{T}$ on $\mathcal{R}_{\mathcal{T}}$ . Observe that on a reduced $\mathcal{Q}$ , the condition $e\subseteq u$ in property (2) of a width-1 GHD can be replaced by $e=u$ , since if $e\subset u$ and $u\subseteq e^{\prime}$ for some other $e^{\prime}\in\mathcal{E}$ due to property (3), we would find $e\subset e^{\prime}$ . This implies that $\mathcal{T}$ has only two types of nodes: (1) all hyperedges in $\mathcal{E}$ , and (2) nodes that are a proper subset of some $e\in\mathcal{E}$ . Then we construct $\mathcal{R}_{\mathcal{T}}$ as follows. For each $u\in\mathcal{T}$ of type (1), we set $R(u):=R(e)$ where $e=u$ ; for each $u\in\mathcal{T}$ of type (2), we set $R(u):=R(e)$ for any $e\in\mathcal{E},u\subset e$ , but the annotations of all tuples in $R(u)$ are set to $1$ (the $\otimes$ -identity). Below, we will focus on computing $\mathcal{T}(\mathcal{R}_{\mathcal{T}})$ .

Joglekar et al. [21] modified the Yannakakis algorithm into AggroYannakakis, and showed that it has load $O(\frac{\mathrm{IN}}{p}+\frac{\mathrm{OUT}}{p})$ on any free-connex join-aggregate query888The bound stated in [21] is actually $O({(\mathrm{IN}+\mathrm{OUT})^{2}\over p})$ , because they used a sub-optimal binary join algorithm as the subroutine following [2]. Replacing it with the optimal binary join algorithm in [8, 18] yields the claimed bound. In addition, they only considered simple join-aggregate queries, which are a strict subclass of free-connex queries. But after our conversion from $\mathcal{Q}_{\mathbf{y}}(\mathcal{R})$ to $\mathcal{T}(\mathcal{R}_{\mathcal{T}})$ , their algorithm actually works for all free-connex queries.. Since we want to avoid the sub-optimal $O({\mathrm{OUT}\over p})$ term, we modify their algorithm into LinearAggroYannakakis, which runs with linear load. It aggregates over all the non-output attributes, returning a modified query $\mathcal{T}^{\prime}(\mathcal{R}_{\mathcal{T}^{\prime}})$ that only has the output attributes. The guarantees of LinearAggroYannakakis is stated in the following lemma.

Lemma 3.

LinearAggroYannakakis* is a constant-round, linear-load algorithm that, given any free-connex width-1 GHD $\mathcal{T}$ and an instance $\mathcal{R}_{\mathcal{T}}$ , returns an instance $\mathcal{R}_{\mathcal{T}^{\prime}}$ such that $\mathcal{T}(\mathcal{R}_{\mathcal{T}})=\mathcal{T}^{\prime}(\mathcal{R}_{\mathcal{T}^{\prime}})$ , where $\mathcal{T}^{\prime}$ is the free-connex subset of $\mathcal{T}$ .*

Proof.

Let $\mathcal{T}$ be a width-1 free-connex GHD and $\mathcal{T}^{\prime}$ be the connex subset of $\mathcal{T}$ such that $\mathbf{y}=\bigcup_{u\in\mathcal{T}^{\prime}}u$ . For an attribute $x$ , denote the highest node in $\mathcal{T}$ containing $x$ as $TOP_{\mathcal{T}}(x)$ . Below, we describe LinearAggroYannkakakis, an algorithm that converts $\mathcal{R}_{\mathcal{T}}$ into $\mathcal{R}_{\mathcal{T}^{\prime}}$ such that $\mathcal{T}(\mathcal{R}_{\mathcal{T}})=\mathcal{T}^{\prime}(\mathcal{R}_{\mathcal{T}^{\prime}})$ .

The LinearAggroYannkakakis algorithm visits each node $u\in\mathcal{T}$ in a bottom-up fashion over $\mathcal{T}$ . If $u\in\mathcal{T}^{\prime}$ , i.e., all its attributes are output attributes, we add $R(u)$ to $\mathcal{R}_{\mathcal{T}^{\prime}}$ (line 4). Otherwise, we aggregate over $\mathbf{\bar{y}^{\prime}}$ , which are the non-output attributes in $u$ that do not appear in the ancestors of $u$ (line 6–7). This is a sum-by-key problem. Note that after the aggregation, the attributes of $R(u)$ are $u-\mathbf{\bar{y}}^{\prime}$ . Let $u^{\prime}$ be the parent of $u$ in $\mathcal{T}$ . Note that $u^{\prime}$ always exists since the root of $\mathcal{T}$ must be in $\mathcal{T}^{\prime}$ . Then we replace $R(u^{\prime})$ by $R(u^{\prime})\Join R(u)$ (line 9). Below we show how this join can be done in linear load. Consider any non-output attribute $x\in u-\mathbf{\bar{y}}^{\prime}-\mathbf{y}$ . Since $TOP_{\mathcal{T}}(x)$ is an ancestor of $u$ , we have $x\in u^{\prime}$ . Consider any output attribute $y\in u\cap\mathbf{y}$ . In the connex subset $\mathcal{T}^{\prime}$ , there exists $u^{\prime\prime}\in\mathcal{T}$ such that $y\in u^{\prime\prime}$ . Each node on the path from $u^{\prime\prime}$ to $u$ must contain attribute $y$ , including $u^{\prime}$ . Thus, we must have $u-\mathbf{\bar{y}}^{\prime}\subseteq u^{\prime}$ . This means that tuples in $R(u^{\prime})\Join R(u)$ are actually the same as those in $R(u^{\prime})$ , except that we update the annotation of each $t\in R(u^{\prime})$ as $w(t)\leftarrow w(t)\otimes w(t^{\prime})$ , where $t^{\prime}\in R(u),t^{\prime}=\pi_{u-\mathbf{\bar{y}}^{\prime}}t$ . Thus, this can be done by the multi-search primitive in linear load. Because this algorithm never increases the size of any relation, the two primitive operations (line 7 and 9) incur linear load throughout the bottom-up traversal of $\mathcal{T}$ .

It should be obvious from the algorithm description above that LinearAggroYannkakakis incurs linear load, but we still need to argue for its correctness. Note that $\mathcal{R}_{\mathcal{T}^{\prime}}$ has only output attributes. It suffices to show that $\mathcal{T}(\mathcal{R}_{\mathcal{T}})=\mathcal{T}^{\prime}(\mathcal{R}_{\mathcal{T}^{\prime}})$ .

Joglekar et al. [21] have shown that for any leaf $u\in\mathcal{T}$ and its parent $u^{\prime}$ , performing the operation in lines 6–9 and then discarding $R(u)$ does not change the query results. AggroYannkakakis performs this operation over all the relations of $\mathcal{T}$ in a bottom-up fashion, and applying this fact inductively means that the root relation becomes the final query result in the end, but this incurs load $O({\mathrm{IN}\over p}+{\mathrm{OUT}\over p})$ . LinearAggroYannkakakis performs this operation on a subset of relations, and stops as soon as it sees a node in $\mathcal{T}^{\prime}$ . Then applying the result of [21] inductively up until $\mathcal{T}^{\prime}$ proves our claim. ∎

Because $\mathcal{T}^{\prime}$ is acyclic, we can run our output-optimal algorithm to compute $\mathcal{T}^{\prime}(\mathcal{R}_{\mathcal{T}^{\prime}})$ . More precisely, when the algorithm emits a join result, we compute the $\otimes$ -aggregate of the tuples comprising the join result. Note that in the following result, $\mathrm{OUT}=|\mathcal{Q}_{\mathbf{y}}(\mathcal{R})|$ , i.e., the size of the final output, which can be much smaller than $|\mathcal{Q}(\mathcal{R})|$ .

Theorem 9.

There is an algorithm that computes any free-connex join-aggregate query in $O(1)$ rounds with load $O(\frac{\mathrm{IN}}{p}+\frac{\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}}{p})$ .

Observing that the join size of a (non-aggregate) join is a special join-aggregate query with $\mathbf{y}=\emptyset$ , we obtain the following result, which has been used as a primitive. Note that there is no circular dependency here, because it only uses LinearAggroYannakakis.

Corollary 4.

For any acyclic join $\mathcal{Q}$ and any instance $\mathcal{R}$ , $|\mathcal{Q}(\mathcal{R})|$ can be computed in $O(1)$ rounds with load $O(\frac{\mathrm{IN}}{p})$ .

Furthermore, if $\mathcal{T}^{\prime}$ is r-hierarchical, we run our instance-optimal algorithm to compute $\mathcal{T}^{\prime}(\mathcal{R}_{\mathcal{T}^{\prime}})$ . In fact, we can precisely characterize the class of queries with an r-hierarchical $\mathcal{T}^{\prime}$ . A query is called out-hierarchical if it is free-connex and its residual query by removing all non-output attributes is r-hierarchical.

Lemma 4.

A join-aggregate query $\mathcal{Q}_{\mathbf{y}}$ is out-hierarchical if and only if it has a width-1 GHD $\mathcal{T}$ with a connex subset $\mathcal{T}^{\prime}$ such that $\mathbf{y}=\bigcup_{u\in\mathcal{T}^{\prime}}u$ and $\mathcal{T}^{\prime}$ is r-hierarchical.

Proof.

First we have known that join-aggregate query $\mathcal{Q}_{\mathbf{y}}$ is free-connex iff it has a width-1 GHD $\mathcal{T}$ with a connex subset $\mathcal{T}^{\prime}$ such that $\mathbf{y}=\bigcup_{u\in\mathcal{T}^{\prime}}u$ . Consider $\mathcal{Q}_{out}=(\mathbf{y},\{e\cap\mathbf{y}:e\in\mathcal{E}\})$ the residual query of $\mathcal{Q}_{\mathbf{y}}$ after removing all non-output attributes. Then it suffices to show that for a free-connex query $\mathcal{Q}_{\mathbf{y}}$ , $\mathcal{Q}_{out}$ is r-hierarchical iff $\mathcal{T}^{\prime}$ is r-hierarchical.

An edge $e\in\mathcal{E}$ is out-irreducible if there exists no $e^{\prime}\in\mathcal{E}$ such that $e\cap\mathbf{y}\subset e^{\prime}\cap\mathbf{y}$ or $e\subset e^{\prime}$ ; otherwise out-reducible. We first claim that for each out-irreducible $e\in\mathcal{E}$ there exists one node $v\in\mathcal{T}^{\prime}$ such that $e\cap\mathbf{y}\subseteq v$ . Consider the node $u\in\mathcal{T}$ such that $u=e$ . If $u\in\mathcal{T}^{\prime}$ , the claim holds trivially. Otherwise, consider the lowest ancestor of $u$ in $\mathcal{T}^{\prime}$ as $v$ . As each output attribute $x\in u\cap\mathbf{y}$ appears in some node of $\mathcal{T}^{\prime}$ , it also appears in $v$ due to the coherence constraint. Thus, $e\cap\mathbf{y}\subseteq v$ .

Recall that for each node $u\in\mathcal{T}$ , there exists an edge $e\in\mathcal{E}$ such that $u\subseteq e$ . Correspondingly, for each node $v\in\mathcal{T}^{\prime}$ , there exists an edge $e\in\mathcal{E}$ such that $v\subseteq e\cap\mathbf{y}$ . Thus, for each out-irreducible $e\in\mathcal{E}$ , there exists one node $v\in\mathcal{T}^{\prime}$ such that $e\cap\mathbf{y}=v$ , since if $e\cap\mathbf{y}\subset v$ and $v\subseteq e^{\prime}\cap\mathbf{y}$ for some other $e^{\prime}\subseteq\mathcal{E}$ , $e$ would be out-reducible. This implies that $\mathcal{T}^{\prime}$ has only two types of nodes: (1) $e\cap\mathbf{y}$ for each out-irreducible $e\in\mathcal{E}$ , and (2) a proper subset of $e\cap\mathbf{y}$ for some $e\in\mathcal{E}$ .

Not surprisingly, $\mathcal{Q}_{out}$ also have two types of edges, (1) $e\cap\mathbf{y}$ for each out-irreducible $e\in\mathcal{E}$ , and (2) $e\cap\mathbf{y}$ for each out-reducible $e\in\mathcal{E}$ . Nodes in $\mathcal{T}^{\prime}$ of type (1) are one-to-one mappings to edges in $\mathcal{Q}_{out}$ of type (1). Moreover, after applying the reduce procedure repeatedly on $\mathcal{T}^{\prime}$ or $\mathcal{Q}_{out}$ , only nodes or edges of type (1) can survive. Thus, the reduced query of $\mathcal{Q}_{out}$ is hierarchical iff the reduced query of $\mathcal{T}^{\prime}$ is hierarchical, and the $\mathcal{Q}_{out}$ is r-hierarchical iff $\mathcal{T}^{\prime}$ is r-hierarchical. ∎

Theorem 10.

For out-hierarchical query $\mathcal{Q}_{\mathbf{y}}$ and any instance $\mathcal{R}$ , there is an algorithm computing it in $O(1)$ rounds with load $O(\frac{\mathrm{IN}}{p}+L_{\textrm{instance}}(p,\mathcal{R},\mathbf{y}))$ .

Note that the instance-optimal lower bound $L_{\textrm{instance}}$ for a join-aggregate query is defined with respect to the output attributes only, i.e.,

[TABLE]

where $\pi_{\mathbf{y}}\mathcal{Q}(\mathcal{R},S)=\pi_{\mathbf{y}}((\Join_{e\in S}R(e))\ltimes\mathcal{Q}(\mathcal{R}))$ .

7 A Lower Bound on Triangle Join

Finally, we look beyond acyclic joins. In particular, we give an output-sensitive lower bound on the triangle join $\mathcal{Q}_{\triangle}=R_{1}(B,C)\Join R_{2}(A,C)\Join R_{3}(A,B)$ . For $\mathcal{Q}_{\triangle}$ , a worst-case lower bound of $\Omega({\mathrm{IN}\over p^{2/3}})$ is known, by the following argument: A server loading $L$ tuples can emit at most $O(L^{3/2})$ join results by the AGM bound [4], while the join size of $\mathcal{Q}_{\triangle}$ can be as large as $\Omega(\mathrm{IN}^{3/2})$ . Then setting $p\cdot L^{3/2}=\Omega(\mathrm{IN}^{3/2})$ yields this lower bound. However, if $\mathrm{OUT}$ is used as a parameter, this argument only leads to a lower bound of $\Omega(({\mathrm{OUT}\over p})^{2/3})$ . Below, we improve this lower bound to the following:

Theorem 11.

For any $\mathrm{IN}/\log^{2}\mathrm{IN}\geq 3p^{3},\mathrm{OUT}$ , there exists an instance $\mathcal{R}$ for $\mathcal{Q}_{\triangle}$ with input size $\Theta(\mathrm{IN})$ and output size $\Theta(\mathrm{OUT})$ such that any tuple-based algorithm computing it in $O(1)$ rounds must have a load of $\Omega(\min\{{\mathrm{IN}\over p}+\frac{\mathrm{OUT}}{p\log N},\frac{\mathrm{IN}}{p^{2/3}}\})$ .

Proof.

When $\mathrm{OUT}\leq\mathrm{IN}$ , the claimed lower bound simplifies to $\Omega({\mathrm{IN}\over p})$ , so we will only consider the case $\mathrm{OUT}>\mathrm{IN}$ . Let $N=\mathrm{IN}/3$ and $\tau=\frac{\mathrm{OUT}}{N}$ . Note that $\tau\leq\sqrt{N}$ as implied by the AGM bound. Our construction of the hard instance $\mathcal{R}$ is illustrated in Figure 6.

Set $|\mathrm{dom}(A)|=\tau$ , and $|\mathrm{dom}(B)|=|\mathrm{dom}(C)|=\frac{N}{\tau}$ . Set $R_{2}(A,B)=\mathrm{dom}(A)\times\mathrm{dom}(C)$ and $R_{3}(A,B)=\mathrm{dom}(A)\times\mathrm{dom}(B)$ . The relation $R_{1}(B,C)$ is constructed randomly, in which each distinct value in $B$ and each distinct value of $C$ have a probability of $\frac{\tau^{2}}{N}$ to form a tuple. Note that relations $R_{2}$ and $R_{3}$ are deterministic and always have $N$ tuples. Relation $R_{1}$ is probabilistic with $N$ tuples in expectation. So this instance has input size $3N=\mathrm{IN}$ and output size $({N\over\tau})^{2}\cdot{\tau^{2}\over N}\cdot\tau=\mathrm{OUT}$ in expectation. By the Chernoff bound, the probability that the input size and output size deviate from their expectation by more than a constant factor is at most $\exp(-\Omega(|R_{1}|))=\exp(-\Omega(N))$ .

Similar to the proof of Lemma 6, we will show that with positive probability, an instance constructed this way will have a bounded $J(L)$ , the maximum number of join results a server can produce, if it loads at most $L$ tuples from each relation. Then setting $p\cdot J(L)=\Omega(\mathrm{OUT})$ yields a lower bound on $L$ .

To bound $J(L)$ , we first argue that on any instance constructed as above, we can limit the choice of the $L$ tuples loaded from $R_{2}(A,C)$ ( $R_{3}(A,B)$ , respectively) by any server to the form $X\times Y$ for some $X\in\mathrm{dom}(A),Y\in\mathrm{dom}(C)$ ( $Y\in\mathrm{dom}(B)$ , respectively), i.e., the algorithm should load tuples from $R_{2}(A,B)$ and $R_{3}(A,C)$ in the form of a Cartesian product. More precisely, we show below that making such a restriction will not make $J(L)$ smaller by more than a constant factor.

Suppose a server has loaded $L$ tuples from $R_{1}(A,B)$ . Then the server needs to decide which $L$ tuples from $R_{2}$ and $R_{3}$ to load to maximize the number of triangles formed. This is a combinatorial optimization problem that can be formulated as an integer linear program (ILP). Introduce a variable $u_{ab}$ for each pair $a\in\mathrm{dom}(A),b\in\mathrm{dom}(B)$ and a variable $v_{ac}$ for each pair $a\in\mathrm{dom}(A),c\in\mathrm{dom}(C)$ . Also let $I(bc)=1$ if $(b,c)\in R_{1}$ is loaded by the server, and [math] otherwise. Then $ILP_{1}$ below defines this optimization problem, where $a$ always ranges over $\mathrm{dom}(A)$ , $b$ over $\mathrm{dom}(B)$ , $c$ over $\mathrm{dom}(C)$ unless specified otherwise.

[TABLE]

We transform $ILP_{1}$ into another ILP, shown as $ILP_{3}$ above. $ILP_{3}$ uses a function $\Delta(w)$ , which denotes the optimal solution of $ILP_{2}$ . $ILP_{2}$ is parameterized by $w$ and $a$ , which finds the maximum number of triangles that can be formed with the tuples loaded from $R_{1}(B,C)$ and $a\in\mathrm{dom}(A)$ , subject to the constraint that at most $w$ tuples containing $a$ are loaded from $R_{2}$ and $R_{3}$ . Because all values $a\in\mathrm{dom}(A)$ are structurally equivalent, the optimal solution of $IL_{2}$ does not depend on the particular choice of $a$ , which is why we write the optimal solution of $ILP_{2}$ as $\Delta(w)$ . Also, it is obvious that $\Delta(.)$ is a non-decreasing function. Then, $ILP_{3}$ tries to find the optimal allocation of the $L$ tuples to different values $a\in\mathrm{dom}(A)$ so as to maximize the total number of triangles formed. Let the optimal solutions of $ILP_{1},ILP_{3}$ be $OPT_{1},OPT_{3}$ , respectively. Because $ILP_{3}$ only restricts the server to load at most $2L$ tuples from $R_{2}$ and $R_{3}$ in total, any feasible solution to $ILP_{1}$ is also a feasible solution to $ILP_{3}$ , so $OPT_{1}\leq OPT_{3}$ . Next we construct a feasible solution of $ILP_{3}$ with the Cartesian product restriction above, and show that it is within a constant factor from $OPT_{3}$ , hence $OPT_{1}$ .

Let $w^{*}=\arg\max_{\frac{L}{\tau}\leq w\leq L}\frac{L}{w}\cdot\Delta(w)$ . We choose $\frac{L}{w^{*}}$ values arbitrarily from $\mathrm{dom}(A)$ and allocate $w^{*}$ tuples to each such $a$ . For each such $a$ , we use the optimal solution of $ILP_{2}$ to find the $w^{*}$ tuples to load from $R_{2}$ and $R_{3}$ . Note that the optimal solution is the same for every $a$ , so each $a$ will choose the same sets of $b$ ’s and $c$ ’s. Thus, this feasible solution loads tuples from $R_{1}$ and $R_{2}$ in the form of Cartesian products. The number of triangles formed is $W=\frac{L}{w^{*}}\cdot\Delta(w^{*})$ . We show that this is a constant-factor approximation of $OPT_{3}$ .

Lemma 5.

$W\geq\frac{1}{3}OPT_{3}\geq\frac{1}{3}OPT_{1}$ .

Proof.

Suppose $OPT_{3}$ chooses a set of values $A^{*}$ from $A$ , and each $a\in A^{*}$ has $w_{a}$ tuples loaded from $R_{2}$ and $R_{3}$ . A value $a\in A^{*}$ is efficient if $\frac{\Delta(w_{a})}{w_{a}}\geq\frac{\Delta(w^{*})}{w^{*}}$ , otherwise inefficient. Denote the set of efficient values as $A^{*}_{1}$ and inefficient values as $A^{*}_{2}$ . Note that for every efficient value $a$ , $w_{a}\leq\frac{L}{\tau}$ by the definition of $w^{*}$ .

We relate $W$ and $OPT_{3}$ by showing how to cover all the triangles reported by $OPT_{3}$ with the feasible solution constructed above. First, we use $\frac{\sum_{a\in A^{*}_{2}}w_{a}}{3w^{*}}$ values of $A$ each with $w^{*}$ tuples from $R_{2}$ and $R_{3}$ to cover the triangles reported by $A^{*}_{2}$ . The total number of tuples needed is at most $\frac{2}{3}\sum_{a\in A^{*}_{2}}w_{a}\leq\frac{4}{3}L$ . The number of triangles that can be reported is

[TABLE]

Next, we use $\frac{L}{3w^{*}}$ values each with $w^{*}$ tuples from $R_{2}$ and $R_{3}$ to cover the triangles reported by $A^{*}_{1}$ . The total number of tuples needed is $\frac{2}{3}L$ . Recall that $w_{a}\leq\frac{L}{\tau}$ for each $a\in A^{*}_{1}$ . The number of triangles that can be reported is

[TABLE]

where the rationale behinds the last inequality is that there are at most $\tau$ values in $A^{*}_{1}$ and there is $\Delta\left(\frac{L}{\tau}\right)\geq\Delta(w_{a})$ for each $a\in A^{*}_{1}$ by the non-decreasing property of $\Delta(.)$ .

Combining the two parts for the optimal solution $A^{*}$ , our alternative solution loads at most $2L$ tuples from $R_{2}$ and $R_{3}$ , and can report at least $\frac{1}{3}\cdot OPT_{3}$ triangles. ∎

Next, we show that with positive probability (actually high probability), we obtain an instance on which $J(L)$ is bounded. By the analysis above, we only need to consider the case where tuples from $R_{2}$ and $R_{3}$ are loaded in the form of Cartesian products. One value $b\in\mathrm{dom}(B)$ is loaded if at least one tuple $t\in R_{3}(A,B)$ with $\pi_{B}t=b$ is loaded. Similarly, value $c\in\mathrm{dom}(C)$ is loaded if at least one tuple $t\in R_{2}(A,C)$ with $\pi_{C}t=c$ is loaded. Suppose $\alpha$ and $\beta$ distinct values from $B$ and $C$ are loaded respectively. Note that we must have $1\leq\alpha,\beta\leq\min\{L,\frac{N}{\tau}\}$ . Without loss of generality, assume $\alpha\leq\beta$ . Due to Cartesian product constraint, the number of distinct values loaded from $A$ is at most $\tau=\min\{\frac{L}{\beta},\tau\}$ .

Case 1: $\mathbf{\alpha\beta\leq\frac{NL}{\tau^{2}}}$

We first upper bound the probability that the server can report many triangles on a random instance, for a particular choice of $\alpha$ values loaded from $\mathrm{dom}(B)$ and $\beta$ values from $\mathrm{dom}(C)$ . Since at most $\gamma$ distinct values from $A$ are loaded, each tuple loaded from $R_{1}(B,C)$ can form at most $\gamma$ triangles. Because each $(b,c)$ pair have probability ${\tau^{2}\over N}$ to form a tuple in $R_{1}(B,C)$ , on a random instance, we expect to see $\frac{\tau^{2}\alpha\beta}{N}$ tuples and $\frac{\tau^{2}\alpha\beta\gamma}{N}$ triangles. Note that this is always smaller than $\tau\sqrt{\frac{L^{3}}{N}}$ : (1) If $\tau\beta\geq L$ , $\gamma=\frac{L}{\beta}$ , then $\frac{\tau^{2}\alpha\beta\gamma}{N}\leq\frac{\tau^{2}L}{N}\cdot\sqrt{\alpha\beta}\leq\frac{\tau^{2}L}{N}\cdot\sqrt{\frac{NL}{\tau^{2}}}\leq\tau\sqrt{\frac{L^{3}}{N}}$ ; (2) Otherwise, $\gamma=\tau$ , then $\frac{\tau^{2}\alpha\beta\gamma}{N}\leq\frac{\tau^{3}\beta^{2}}{N}\leq\frac{\tau^{3}}{N}\cdot\frac{L^{2}}{\tau^{2}}\leq\frac{\tau L^{2}}{N}\leq\tau\sqrt{\frac{L^{3}}{N}}$ . This server can report more than $\delta\tau\sqrt{\frac{L^{3}}{N}}$ triangles, for some $\delta>1$ , if more than $\frac{\delta\tau}{\gamma}\sqrt{\frac{L^{3}}{N}}$ tuples exist among those $\alpha\beta$ pairs. By Chernoff bound, this happens with probability no larger than $\exp\left(-\Omega(\frac{\delta\tau}{\gamma}\sqrt{\frac{L^{3}}{N}})\right)$ .

This is the probability that the server succeeds in reporting many triangles under a particular choice of $\alpha$ values loaded from $\mathrm{dom}(B)$ and $\beta$ values from $\mathrm{dom}(C)$ . There are $O\left((\frac{N}{\tau})^{2}\right)$ possible $(\alpha,\beta)$ pairs. For each $(\alpha,\beta)$ pair, there are $O\left(\tau^{\gamma}\right)$ choices of loading $\gamma$ values from $A$ , $O\left((\frac{N}{\tau})^{\alpha}\right)$ choices of loading $\alpha$ values from $B$ , and $O\left((\frac{N}{\tau})^{\beta}\right)$ choices from $C$ . Thus the server has $\exp(O((\alpha+\beta+\gamma)\log N))$ possible choices. By the union bound, the probability that any of these choices produces more than $\delta\tau\sqrt{\frac{L^{3}}{N}}$ join results is at most

[TABLE]

which is exponentially small if

[TABLE]

and

[TABLE]

for some sufficiently large constants $c_{1},c_{2}$ . Rearranging, this becomes

[TABLE]

for some sufficiently large constant $c_{3}$ . Under this condition, the probability in $(\ref{eq:1})$ is at most $\exp\left(-\Omega\left(\delta\sqrt{\frac{L^{3}}{N}}\right)\right)$ .

Case 2: $\mathbf{\alpha\beta>\frac{NL}{\tau^{2}}}$

In this case, we have $\beta\geq\frac{\sqrt{NL}}{\tau}$ . The server loads $\frac{L}{\beta}$ distinct values from $A$ , so each tuple loaded from $R_{1}$ can form at most $\frac{L}{\beta}$ triangles. The server can load at most $L$ tuples from $R_{1}$ , so at most $\frac{L^{2}}{\beta}\leq\delta\tau\sqrt{\frac{L^{3}}{N}}$ triangles can be reported, for any

[TABLE]

Combining these two cases, under the condition (10) and (11) on $\delta$ , with high probability the server cannot find any way to load $L$ tuples to report more than $\delta\tau\sqrt{\frac{L^{3}}{N}}$ triangles. Therefore, on these instances, we have

[TABLE]

where we set

[TABLE]

With the facts that $\frac{N}{p}\leq L$ and $\mathrm{OUT}\leq N^{\frac{3}{2}}$ , we observe

[TABLE]

where the last inequality follows from our assumption $N=\mathrm{IN}/3\geq p^{3}\log^{2}\mathrm{IN}\geq p^{3}\log^{2}N$ . Then (13) can be simplified to

[TABLE]

Plugging (12) and (14) into $p\cdot J(L)=\Omega(\mathrm{OUT})=\Omega(N\tau)$ , we obtain

[TABLE]

Finally, after plugging in $\tau=\mathrm{OUT}/N$ and rearranging, we obtain

[TABLE]

∎

Remark. Our lower bound has the following consequences:

When $\mathrm{OUT}\geq\mathrm{IN}\cdot p^{1/3}$ , the lower bound becomes $\tilde{\Omega}({\mathrm{IN}\over p^{2/3}})$ , which means that the worst-case optimal algorithm of [24] is actually also output-optimal in this parameter range. Finding $\tilde{\Omega}(\mathrm{IN}\cdot p^{1/3})$ triangles is as difficult as finding $\Theta(\mathrm{IN}^{3/2})$ triangles. 2. 2.

When $\mathrm{IN}\leq\mathrm{OUT}\leq\mathrm{IN}\cdot p^{1/3}$ , the lower bound becomes $\tilde{\Omega}({\mathrm{OUT}\over p})$ while we do not have a matching upper bound yet. Nevertheless, this already exhibits a separation from acyclic joins, which can be done with load $O({\sqrt{\mathrm{IN}\cdot\mathrm{OUT}}\over p})$ . The gap is at least $\tilde{\Omega}(\sqrt{\mathrm{OUT}\over\mathrm{IN}})$ .

Appendix A Proof of Lemma 2

Proof.

Direction ( $\Leftarrow$ ): In an acyclic join $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , a minimal path of length 3 is a sequence of 4 vertices $(x_{1},x_{2},x_{3},x_{4})$ , such that $\{x_{1},x_{2}\}\subseteq e_{1},\{x_{2},x_{3}\}\subseteq e_{2},\{x_{3},x_{4}\}\subseteq e_{3}$ and there exists no edge $e\in\mathcal{E}$ with $\{x_{1},x_{3}\}\subseteq e$ , or $\{x_{1},x_{4}\}\subseteq e$ , or $\{x_{2},x_{4}\}\subseteq e$ . This already testifies that $\mathcal{Q}$ is not hierarchical. To show that it is not r-hierarchical, consider the process of repeatedly applying the reduce procedure to $\mathcal{Q}$ . If any of $\{e_{1},e_{2},e_{3}\}$ is removed in the process, say $e_{1}$ , there must exist an edge $e^{\prime}_{1}$ such that $e_{1}\subseteq e^{\prime}_{1},x_{3}\notin e^{\prime}_{1},x_{4}\notin e^{\prime}_{1}$ . The same applies for $e_{2}$ and $e_{3}$ . Thus we can always find three edges $e^{\prime}_{1},e^{\prime}_{2},e^{\prime}_{3}$ such that $e^{\prime}_{2}\in\mathcal{E}_{x_{2}}\cap\mathcal{E}_{x_{3}},e^{\prime}_{1}\in\mathcal{E}_{x_{2}}-\mathcal{E}_{x_{3}},e^{\prime}_{3}\in\mathcal{E}_{x_{3}}-\mathcal{E}_{x_{2}}$ after applying the reduce procedure, so this query is not r-hierarchical.

Direction ( $\Rightarrow$ ): The proof is constructive. We will show below how to find a minimal path of length 3 in any acyclic but non-r-hierarchical join. We first apply the reduce procedure to $\mathcal{Q}$ such that no edge is contained in another. The rationale behind this is that a minimal path between two vertices $x,y\in\mathcal{V}$ of length $3$ in the reduced join is also a minimal path between $x,y$ of length 3 in the original join. Then we proceed in 3 steps: (We give an intuitive illustration of the results after each step, in Figure 7.)

**Step 1: ** Find a subgraph defined by three distinct edges $\{e_{1},e_{2},e_{3}\}$ and four distinct vertices $\{x_{1},x_{2},x_{3},x_{4}\}$ , such that $x_{1}\in e_{1},x_{1}\notin e_{2}\cup e_{3},x_{2}\in e_{1}\cap e_{2},x_{2}\notin e_{3},x_{3}\in e_{2}\cap e_{3},x_{3}\notin e_{1},x_{4}\in e_{3},x_{4}\notin e_{1}\cup e_{2}$ .

**Step 2: ** Find a subgraph defined by three distinct edges $\{e_{1},e_{2},e_{3}\}$ and four distinct vertices $\{x_{1},x_{2},x_{3},x_{4}\}$ , such that $x_{1}\in e_{1},x_{1}\notin e_{2}\cup e_{3},x_{2}\in e_{1}\cap e_{2},x_{2}\notin e_{3},x_{3}\in e_{2}\cap e_{3},x_{3}\notin e_{1},x_{4}\in e_{3},x_{4}\notin e_{1}\cup e_{2}$ , and there exists no edge $e\in\mathcal{E}$ with $\{x_{1},x_{2},x_{3}\}\subseteq e$ or $\{x_{2},x_{3},x_{4}\}\subseteq e$ .

**Step 3: ** Find a minimal path of length $3$ between $x_{1}$ and $x_{4}$ .

Our construction and its correctness proof is based on a basic property of acyclic join, as stated in Lemma 6. With Lemma 6, we are able to prove stronger results in Corollary 5 and Corollary 6, which will be used as building blocks in proving Lemma 2.

Lemma 6.

For three distinct edges $e_{xy},e_{xz},e_{yz}\in\mathcal{E}$ , if $e_{xy}\cap e_{xz}-e_{yz}\neq\emptyset,e_{xy}\cap e_{yz}-e_{xz}\neq\emptyset,e_{xz}\cap e_{yz}-e_{xy}\neq\emptyset$ , then there exists one edge $e\in\mathcal{E}$ such that $e_{xy}\cap e_{xz}\subseteq e,e_{xz}\cap e_{yz}\subseteq e,e_{xy}\cap e_{yz}\subseteq e$ .

**Proof of Lemma 6: ** Consider attributes $x,y,z$ such that $x\in e_{xy}\cap e_{xz}-e_{yz},y\in e_{xy}\cap e_{yz}-e_{xz},z\in e_{xz}\cap e_{yz}-e_{xy}$ . In the GYO reduction [1], we observe that (1) Any of $x,y,z$ won’t be removed as an unique attribute before any edge of $e_{xy},e_{xz},e_{yz}$ is removed; (2) Any of $e_{xy},e_{xz},e_{yz}$ won’t be removed as an empty edge before any of $x,y,z$ is removed. So it is always feasible to identify one edge $e\in\mathcal{E}$ such that $e_{xy}\cap e_{xz}-e_{yz}\subseteq e,e_{xy}\cap e_{yz}-e_{xz}\subseteq e,e_{xz}\cap e_{yz}-e_{xy}\subseteq e$ . Moreover, any attribute in $e_{xy}\cap e_{xz}\cap e_{yz}$ if exists won’t be removed as an unique attribute before any edge of $e_{xy},e_{xz},e_{yz}$ is removed. Thus we come to the conclusion in Lemma 6.

Corollary 5.

For two distinct edges $e_{xy},e_{xz}\in\mathcal{E}$ and a subset of edges $\mathcal{E}_{yz}\subseteq\mathcal{E}-\{e_{xy},e_{xz}\}$ , if $e_{xy}\cap e_{xz}-e_{yz}\neq\emptyset,e_{xy}\cap e_{yz}-e_{xz}$ for each $e_{yz}\in\mathcal{E}_{yz}$ and $\left(\bigcap_{e_{yz}\in\mathcal{E}_{yz}}(e_{xz}\cap e_{yz})\right)-e_{xy}\neq\emptyset$ , then there exists one edge $e\in\mathcal{E}$ such that $e_{xy}\cap e_{xz}\subseteq e,\bigcup_{e_{yz}\in\mathcal{E}_{yz}}(e_{xy}\cap e_{yz})\subseteq e,\bigcap_{e_{yz}\in\mathcal{E}_{yz}}(e_{xz}\cap e_{yz})\subseteq e$ .

**Proof of Corollary 5: ** For simplicity, rename edges in $\mathcal{E}_{yz}$ as $e_{1},e_{2},\cdots,e_{k}$ . We prove it by induction. The base case when $k=1$ is precisely characterized and solved by Lemma 6. We hold the hypothesis that there exists one edge $e\in\mathcal{E}$ such that

[TABLE]

Moreover, if $e_{xy}\cap e_{k}\subseteq e$ , edge $e$ is exactly the one characterized by Corollary 5 and we are done. Otherwise, $(e_{xy}\cap e_{k})-e\neq\emptyset$ .

We observe that $\left(\bigcap_{i\in\{1,\cdots,k\}}(e_{xz}\cap e_{i})\right)\subseteq e\cap e_{k}$ , so there is $(e\cap e_{k})-e_{xy}\neq\emptyset$ . If $e_{xy}\cap e-e_{k}=\emptyset$ , there is $e_{xy}\cap e\subseteq e_{k}$ . So far we have following observations on $e_{k}$ that (1) $e_{k}\supseteq e_{xy}\cap e\supseteq(e_{xy}\cap e_{xz})\cup\left(\bigcup_{i\in\{1,\cdots,k-1\}}(e_{xy}\cap e_{i})\right)$ ; (2) $e_{k}\supseteq e_{xy}\cap e_{k}$ ; (3) $e_{k}\supseteq e_{xz}\cap e_{k}\supseteq\bigcap_{i\in\{1,\cdots,k\}}(e_{xz}\cap e_{i})$ , or equivalently,

[TABLE]

Thus edge $e_{k}$ is exactly the one characterized by Corollary 5, and we are done. Otherwise, $e_{xy}\cap e-e_{k}\neq\emptyset$ . Implied by Lemma 6, there exists an edge $e^{\prime}\in\mathcal{E}$ such that $e_{xy}\cap e_{k}\subseteq e^{\prime},e_{xy}\cap e\subseteq e^{\prime},e_{k}\cap e\subseteq e^{\prime}$ . More precisely, (1) $e^{\prime}\supseteq e_{xy}\cap e\supseteq(e_{xy}\cap e_{xz})\cup\left(\bigcup_{i\in\{1,\cdots,k-1\}}(e_{xy}\cap e_{i})\right)$ ; (2) $e^{\prime}\supseteq e_{xy}\cap e_{k}$ ; (3) $e^{\prime}\supseteq e_{k}\cap e\supseteq\bigcap_{i\in\{1,\cdots,k\}}(e_{xz}\cap e_{i})$ . Or equivalently,

[TABLE]

thus edge $e^{\prime}$ is exactly the one characterized by Corollary 5, and we are done.

Corollary 6.

For a set of distinct vertices $x,y_{1},y_{2},\cdots,y_{k}$ , if there exists one edge $e_{0}\in\mathcal{E}$ such that $x\notin e_{0},\{y_{1},y_{2},\cdots,y_{k}\}\\ \subseteq e_{0}$ , and there exists one edge $e_{i}\in\mathcal{E}$ such that $\{x,y_{i}\}\subseteq e_{i}$ for each $i\in\{1,2,\cdots,k\}$ , then there exists one edge $e^{\prime}\in\mathcal{E}$ such that $\{x,y_{1},y_{2},\cdots,y_{k}\}\subseteq e^{\prime}$ .

**Proof of Corollary 6: ** We prove it by induction. The base case when $k=1$ is trivial. We hold the hypothesis that there exists one edge $e\in\mathcal{E}$ such that $\{x,y_{1},y_{2},\cdots,y_{k-1}\}\subseteq e$ .

If $y_{k}\in e$ , edge $e$ is exactly the one characterized by Corollary 6 and we are done. Moreover, if $\{y_{1},y_{2},\cdots,y_{k-1}\}\subseteq e_{k}$ , edge $e_{k}$ is exactly the one characterized by Corollary 6 and we are done. Otherwise, $y_{k}\notin e$ and $\{e_{1},e_{2},\cdots,e_{k-1}\}-e_{k}\neq\emptyset$ . Note that $y_{k}\in e_{k}\cap e_{0}-e$ , $x\in e_{k}\cap e-e_{0}$ , and $e\cap e_{0}-e_{k}\neq\emptyset$ . Implied by Lemma 6, there exists one edge $e^{\prime}\in\mathcal{E}$ such that $e\cap e_{k}\subseteq e^{\prime}$ , $e_{k}\cap e_{0}\subseteq e^{\prime}$ and $e_{0}\cap e\subseteq e^{\prime}$ . More precisely, $\{e_{1},e_{2},\cdots,e_{k-1}\}\subseteq e\cap e_{0}\subseteq e^{\prime}$ , $\{x\}\subseteq e\cap e_{k}\subseteq e^{\prime}$ , and $\{e_{k}\}\subseteq e_{k}\cap e_{0}\subseteq e^{\prime}$ , thus $\{x,y_{1},y_{2},\cdots,y_{k}\}\subseteq e^{\prime}$ .

**Proof of step 1: **

If an acyclic join is not r-hierarchical, then there exist two attributes $x,y$ such that $\mathcal{E}_{x}\cap\mathcal{E}_{y}\neq\emptyset,\mathcal{E}_{x}-\mathcal{E}_{y}\neq\emptyset,\mathcal{E}_{y}-\mathcal{E}_{x}\neq\emptyset$ . Consider $e_{xy}\in\mathcal{E}_{x}\cap\mathcal{E}_{y}$ , $e_{x}\in\mathcal{E}_{x}-\mathcal{E}_{y}$ and $e_{y}\in\mathcal{E}_{y}-\mathcal{E}_{x}$ . It suffices to show that $e_{x}-e_{xy}-e_{y}\neq\emptyset$ and $e_{y}-e_{xy}-e_{x}\neq\emptyset$ by the constraint. First $e_{x}-e_{xy}$ is not empty otherwise $e_{x}\subseteq e_{xy}$ contradicting our assumption. The same applies for $e_{y}-e_{xy}\neq\emptyset$ . If $e_{x}-e_{xy}-e_{y}=\emptyset$ , each attribute appearing in $e_{x}-e_{xy}$ also appears in $e_{y}$ . In this way, we can identify three distinct attributes $x,y,z$ such that $x\in e_{x}\cap e_{xy}-e_{y}$ , $y\in e_{y}\cap e_{xy}-e_{x}$ , $e_{x}\cap e_{y}-e_{xy}$ , which form a cycle. Thus there exists an edge $e_{xyz}\in\mathcal{E}$ such that $e_{x}\cap e_{xy}\subseteq e_{xyz},e_{y}\cap e_{xy}\subseteq e_{xyz},e_{x}\cap e_{y}\subseteq e_{xyz}$ implied by Lemma 6. Note that $e_{x}-e_{xy}-e_{y}=\emptyset$ implies that $e_{x}$ can be rewritten as $(e_{x}\cap e_{xy})\cup(e_{x}\cap e_{y})$ . In this way, $e_{x}\subseteq e_{xyz}$ contradicting our assumption. So we have $e_{x}-e_{xy}-e_{y}\neq\emptyset$ , and the same applies for $e_{y}-e_{xy}-e_{x}\neq\emptyset$ .

**Proof of step 2: **

Assume we already have a subgraph defined by edges $\{e_{1},e_{2},e_{3}\}$ and vertices $\{x_{1},x_{2},x_{3},x_{4}\}$ , such that $x_{1}\in e_{1},x_{1}\notin e_{2}\cup e_{3},x_{2}\in e_{1}\cap e_{2},x_{2}\notin e_{3},x_{3}\in e_{2}\cap e_{3},x_{3}\notin e_{1},x_{4}\in e_{3},x_{4}\notin e_{1}\cup e_{2}$ . If there exists no edge $e\in\mathcal{E}$ such that $\{x_{1},x_{2},x_{3}\}\subseteq e$ or $\{x_{2},x_{3},x_{4}\}\subseteq e$ , we are done. Otherwise, we need to show how to find $x_{1}^{\prime},x_{4}^{\prime}$ satisfying our condition to replace $x_{1},x_{4}$ . Note that the replacement of $x_{1}$ and that of $x_{4}$ are independent, as well as their correctness arguments.

In the following, we will tackle the situation where there exists an edge $e\in\mathcal{E}$ such that $\{x_{1},x_{2},x_{3}\}\subseteq e$ . The situation where there exists an edge $e\in\mathcal{E}$ such that $\{x_{2},x_{3},x_{4}\}\subseteq e$ is symmetric and can be tackled similarly.

Define the attribute set $S=\{x\in e_{1}:\exists e\in\mathcal{E},\{x_{2},x_{3},x\}\subseteq e\}$ . If $e_{1}-e_{2}-e_{3}-S\neq\emptyset$ , then we just replace $x_{1}$ by any attribute in $e_{1}-e_{2}-e_{3}-S$ . Otherwise, $e_{1}-e_{2}-e_{3}-S=\emptyset$ , which implies that $e_{1}$ can be rewritten as $(e_{1}\cap S)\cup(e_{1}\cap e_{2})\cup(e_{1}\cap e_{3})$ . We will prove by contradiction that this case won’t happen in the reduced join. Define the edge set $\mathcal{E}_{S}=\{e\in\mathcal{E}:\exists x\in S,\{x_{2},x_{3},x\}\subseteq e\}$ . Note that if $x\notin S$ , then $x\notin e$ for each $e\in\mathcal{E}_{S}$ . We distinguish following four cases. We give an intuitive illustration of the contradiction in each case, in Figure 8. The same technique we adopt is to identify an edge $e\in\mathcal{E}$ such that $e_{1}\neq e$ and $e_{1}\subseteq e$ , coming to a contradiction in a reduced join.

Case 1: $\mathbf{e_{1}\cap e_{3}-e_{2}-S\neq\emptyset}$

Consider an arbitrary attribute $x\in e_{1}\cap e_{3}-e_{2}-S$ . Denote $\mathcal{E}^{\prime}_{S}=\{e_{2}\}\cup\mathcal{E}_{S}$ . Note that $x_{2}\in e_{1}\cap e-e_{3}$ , $x_{3}\in e\cap e_{3}-e_{1}$ , and $x\in e_{1}\cap e_{3}-e$ for each $e\in\mathcal{E}^{\prime}_{S}$ . Implied by Corollary 5, there exists an edge $e^{\prime}\in\mathcal{E}$ such that $e_{1}\cap e_{3}\subseteq e^{\prime}$ , $x_{3}\in e^{\prime}$ and $e_{1}\cap e\subseteq e^{\prime}$ for each $e\in\mathcal{E}^{\prime}_{S}$ . This also implies $e_{1}\cap S\subseteq e^{\prime}$ , $e_{1}\cap e_{2}\subseteq e^{\prime}$ , and $e_{1}\neq e^{\prime}$ . Thus, $e_{1}\subseteq e^{\prime}$ contradicting our assumption.

Case 2: $\mathbf{e_{1}\cap e_{3}-e_{2}-S=\emptyset}$ and $\mathbf{e_{1}\cap e_{2}-e_{3}-S\neq\emptyset}$

Consider an arbitrary attribute $x\in e_{1}\cap e_{2}-e_{3}-S$ . Denote $S^{\prime}=S-e_{2}$ , where $S^{\prime}\neq\emptyset$ since $x_{1}\in S-e_{2}$ . Note that $x\in e_{1}\cap e_{2}-e$ , $e\cap e_{1}-e_{2}\neq\emptyset$ , and $x_{3}\in e_{2}\cap e-e_{1}$ for each $e\in\mathcal{E}_{S}^{\prime}$ . Implied by Corollary 5, there exists an edge $e^{\prime}\in\mathcal{E}$ such that $e_{1}\cap e_{2}\subseteq e^{\prime}$ , $x_{3}\in e^{\prime}$ and $e_{1}\cap e\subseteq e^{\prime}$ for each $e\in\mathcal{E}_{S}^{\prime}$ . This also implies $e_{1}\cap S^{\prime}\subseteq e^{\prime}$ and $e_{1}\neq e^{\prime}$ . Thus $(e_{1}\cap S)\cup(e_{1}\cap e_{2})\subseteq e$ . We already have $e_{1}\cap e_{3}-e_{2}-S=\emptyset$ in this case. Thus, $e_{1}\subseteq e$ contradicting our assumption.

Case 3: $\mathbf{e_{1}\cap e_{3}-e_{2}-S=\emptyset}$ , $\mathbf{e_{1}\cap e_{2}-e_{3}-S=\emptyset}$ , and $\mathbf{e_{1}\cap e_{2}\cap e_{3}-S\neq\emptyset}$

Consider an arbitrary attribute $x\in e_{1}\cap e_{2}\cap e_{3}-S$ . Note that $x_{2}\in e_{1}\cap e-e_{3}$ , $x_{3}\in e_{3}\cap e-e_{1}$ , and $x\in e_{1}\cap e_{3}-e$ for each $e\in\mathcal{E}_{S}$ . Implied by Corollary 5, there exists an edge $e^{\prime}\in\mathcal{E}$ such that $e_{1}\cap e_{3}\subseteq e^{\prime}$ , $x_{3}\in e^{\prime}$ and $e_{1}\cap e\subseteq e^{\prime}$ for each $e\in\mathcal{E}_{S}$ . This also implies $e_{1}\cap S\subseteq e^{\prime}$ and $e_{1}\neq e^{\prime}$ . We already have $e_{1}\cap e_{2}-e_{3}-S=\emptyset$ in this case. Thus, $e_{1}\subseteq e^{\prime}$ contradicting our assumption.

Case 4: $\mathbf{e_{1}\cap e_{3}-e_{2}-S=\emptyset}$ , $\mathbf{e_{1}\cap e_{2}-e_{3}-S=\emptyset}$ , and $\mathbf{e_{1}\cap e_{2}\cap e_{3}-S=\emptyset}$

Under this circumstances, $e_{1}\subseteq S$ . Implied by the fact that $S\subseteq e_{1}$ , we have $e_{1}=S$ . For attributes $x_{3}$ and all attributes in $S$ , there is $S\subseteq e_{1}$ , and for each $x\in S$ there exists one edge $e_{x}\in\mathcal{E}_{S}$ such that $\{x,x_{3}\}\subseteq e_{x}$ . Implied by Corollary 6, there exists one edge $e^{\prime}\in\mathcal{E}^{\prime}$ such that $x_{3}\in e^{\prime}$ and $S\subseteq e^{\prime}$ . Thus, $e_{1}\neq e^{\prime}$ and $e_{1}\subseteq e^{\prime}$ , contradicting our assumption.

Combining these four cases proves the step 2.

**Proof of step 3: ** Consider the subgraph found in the last step. By the definition of minimal path, it suffices to show that there exists no edge $e^{\prime}\in\mathcal{E}$ such that $\{x_{1},x_{3}\}\subseteq e^{\prime}$ , or $\{x_{1},x_{4}\}\subseteq e^{\prime}$ , or $\{x_{2},x_{4}\}\subseteq e^{\prime}$ . By contradiction, assume there is an $e^{\prime}$ where $\{x_{1},x_{3}\}\subseteq e^{\prime}$ . Implied by the contraints of this subgragh, $x_{2}\notin e^{\prime}$ and $x_{4}\notin e^{\prime}$ . Attributes $x_{1},x_{2},x_{3}$ form a cycle on edges $e_{1},e_{2},e^{\prime}$ , then there must exist an edge containing all of $\{x_{1},x_{2},x_{3}\}$ contradicting the constraints. The similar argument applies for $\{x_{1},x_{4}\}\subseteq e^{\prime}$ and $\{x_{2},x_{4}\}\subseteq e^{\prime}$ . ∎

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ 1 ] S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases: the logical level . Addison-Wesley Longman Publishing Co., Inc., 1995.
2[ 2 ] F. Afrati, M. Joglekar, C. Ré, S. Salihoglu, and J. D. Ullman. GYM: A multiround join algorithm in Map Reduce. In Proc. International Conference on Database Theory , 2017.
3[ 3 ] F. N. Afrati and J. D. Ullman. Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering , 23(9):1282–1298, 2011.
4[ 4 ] A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. SIAM Journal on Computing , 42(4):1737–1767, 2013.
5[ 5 ] G. Bagan. Algorithmes et complexité des problèmes d’énumération pour l’évaluation de requêtes logiques . Ph D thesis, Université de Caen, 2009.
6[ 6 ] G. Bagan, A. Durand, and E. Grandjean. On acyclic conjunctive queries and constant delay enumeration. In International Workshop on Computer Science Logic , pages 208–222. Springer, 2007.
7[ 7 ] P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. In Proc. ACM Symposium on Principles of Database Systems , 2013.
8[ 8 ] P. Beame, P. Koutris, and D. Suciu. Skew in parallel query processing. In Proc. ACM Symposium on Principles of Database Systems , 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Instance and Output Optimal Parallel Algorithms for Acyclic Joins

Abstract

1 Introduction

1.1 The model of computation

1.2 Instance and output optimality

1.3 Join algorithms in the MPC model

1.4 Classification of acyclic joins

1.5 Our results

Instance-optimality

Output-optimality

1.6 Other related results

2 MPC Primitives

3 r-Hierarchical Joins

3.1 BinHC algorithm revisited

Theorem 1**.**

Proof.

Theorem 2**.**

Proof.

Remark

3.2 An instance-optimal algorithm

Case (1): k=1k=1k=1

Case (1.1): Light instances

Case (1.2): Heavy instances

Case (2): k>1k>1k>1

Theorem 3**.**

Lemma 1**.**

Proof.

Theorem 4**.**

Corollary 1**.**

Proof.

4 Line-3 Join

4.1 The Yannakakis algorithm revisited

4.2 A new algorithm for the line-3 join

Step (1): Computing degrees

Step (2): Decomposing the join

Theorem 5**.**

4.3 Lower bound

Theorem 6**.**

Proof.

Corollary 2**.**

Proof.

5 Acyclic Joins

5.1 Algorithm

Step (1): Computing data statistics

Step (2): Sub-joins with at least one RH(ei)R_{H}(e_{i})RH​(ei​)

Step (3): The sub-join with all RL(ei)R_{L}(e_{i})RL​(ei​)

Step (3.1): The sub-join with RH(e0)R_{H}(e_{0})RH​(e0​)

Step (3.2): The sub-join with RL(e0)R_{L}(e_{0})RL​(e0​)

Theorem 7**.**

5.2 Lower bound

Lemma 2**.**

Theorem 8**.**

Corollary 3**.**

6 Join-Aggregate Queries

Lemma 3**.**

Proof.

Theorem 9**.**

Corollary 4**.**

Lemma 4**.**

Proof.

Theorem 10**.**

7 A Lower Bound on Triangle Join

Theorem 11**.**

Proof.

Lemma 5**.**

Proof.

Case 1: αβ≤NLτ2\mathbf{\alpha\beta\leq\frac{NL}{\tau^{2}}}αβ≤τ2NL​

Case 2: αβ>NLτ2\mathbf{\alpha\beta>\frac{NL}{\tau^{2}}}αβ>τ2NL​

Appendix A Proof of Lemma 2

Proof.

Lemma 6**.**

Corollary 5**.**

Corollary 6**.**

Case 1: e1∩e3−e2−S≠∅\mathbf{e_{1}\cap e_{3}-e_{2}-S\neq\emptyset}e1​∩e3​−e2​−S=∅

Theorem 1.

Theorem 2.

Case (1): $k=1$

Case (2): $k>1$

Theorem 3.

Lemma 1.

Theorem 4.

Corollary 1.

Theorem 5.

Theorem 6.

Corollary 2.

Step (2): Sub-joins with at least one $R_{H}(e_{i})$

Step (3): The sub-join with all $R_{L}(e_{i})$

Step (3.1): The sub-join with $R_{H}(e_{0})$

Step (3.2): The sub-join with $R_{L}(e_{0})$

Theorem 7.

Lemma 2.

Theorem 8.

Corollary 3.

Lemma 3.

Theorem 9.

Corollary 4.

Lemma 4.

Theorem 10.

Theorem 11.

Lemma 5.

Case 1: $\mathbf{\alpha\beta\leq\frac{NL}{\tau^{2}}}$

Case 2: $\mathbf{\alpha\beta>\frac{NL}{\tau^{2}}}$

Lemma 6.

Corollary 5.

Corollary 6.

Case 1: $\mathbf{e_{1}\cap e_{3}-e_{2}-S\neq\emptyset}$

Case 2: $\mathbf{e_{1}\cap e_{3}-e_{2}-S=\emptyset}$ and $\mathbf{e_{1}\cap e_{2}-e_{3}-S\neq\emptyset}$

Case 4: $\mathbf{e_{1}\cap e_{3}-e_{2}-S=\emptyset}$ , $\mathbf{e_{1}\cap e_{2}-e_{3}-S=\emptyset}$ , and $\mathbf{e_{1}\cap e_{2}\cap e_{3}-S=\emptyset}$