Computing the Difference of Conjunctive Queries Efficiently
Xiao Hu, Qichen Wang

TL;DR
This paper presents a novel, structurally-aware approach to efficiently compute the difference of conjunctive queries, achieving linear-time algorithms for many cases and significant speedups over standard SQL methods.
Contribution
It introduces a query rewriting technique that exploits structural properties to push down difference operators, enabling faster computation of query differences.
Findings
Linear-time algorithms for a large class of difference queries
Order-of-magnitude speedups over standard SQL implementations
Heuristics that improve traditional difference query evaluation
Abstract
We investigate how to efficiently compute the difference result of two (or multiple) conjunctive queries, which is the last operator in relational algebra to be unraveled. The standard approach in practical database systems is to materialize the results for every input query as a separate set, and then compute the difference of two (or multiple) sets. This approach is bottlenecked by the complexity of evaluating every input query individually, which could be very expensive, particularly when there are only a few results in the difference. In this paper, we introduce a new approach by exploiting the structural property of input queries and rewriting the original query by pushing the difference operator down as much as possible. We show that for a large class of difference queries, this approach can lead to a linear-time algorithm, in terms of the input size and (final) output size, i.e.,…
| Graph | #edge | #vertex | #l2 path | #triangle | # | # | # | # | # | # |
|---|---|---|---|---|---|---|---|---|---|---|
| Bitcoin | 24,186 | 3,783 | 1,256,332 | 88,753 | 820 | 585,958 | 331,497 | |||
| Epinions | 508,837 | 75,879 | 3,586,405 | 25,947 | ||||||
| DBLP | 1,049,866 | 317,080 | 7,064,738 | 2,224,385 | 466,646 | 3,532,369 | 2,203,597 | - | ||
| 5,105,039 | 875,713 | 372,042 | - | |||||||
| Wiki | 2,394,385 | 0 | - | - |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Logic, Reasoning, and Knowledge
Computing the Difference of Conjunctive Queries Efficiently
Xiao Hu
University of Waterloo200 University Avenue WestWaterlooOntarioCanadaN2L 3G1
and
Qichen Wang
Hong Kong Baptist University224 Waterloo Road, Kowloon TongHong KongHong Kong
(2023; October 2022; January 2023; February 2023)
Abstract.
We investigate how to efficiently compute the difference result of two (or multiple) conjunctive queries, which is the last operator in relational algebra to be unraveled. The standard approach in practical database systems is to materialize the results for every input query as a separate set, and then compute the difference of two (or multiple) sets. This approach is bottlenecked by the complexity of evaluating every input query individually, which could be very expensive, particularly when there are only a few results in the difference. In this paper, we introduce a new approach by exploiting the structural property of input queries and rewriting the original query by pushing the difference operator down as much as possible. We show that for a large class of difference queries, this approach can lead to a linear-time algorithm, in terms of the input size and (final) output size, i.e., the number of query results that survive from the difference operator. We complete this result by showing the hardness of computing the remaining difference queries in linear time. Although a linear-time algorithm is hard to achieve in general, we also provide some heuristics that can provably improve the standard approach. At last, we compare our approach with standard SQL engines over graph and benchmark datasets. The experiment results demonstrate order-of-magnitude speedups achieved by our approach over the vanilla SQL.
conjunctive query, query optimization, difference operator
††copyright: acmlicensed††journalyear: 2023††doi: 10.1145/3589298††journal: PACMMOD††journalvolume: 1††journalnumber: 2††article: 153††publicationmonth: 6††price: 15.00††ccs: Information systems Query optimization
School of Computer Science
Department of Computer Science
1. Introduction
Conjunctive queries with aggregation, union, and difference (also known as negation) operators form the full relational algebra (Abiteboul et al., 1995). While conjunctive queries (Ngo et al., 2018; Bagan et al., 2007; Yannakakis, 1981; Amossen and Pagh, 2009; Deep et al., 2020; Huang and Chen, 2022), with aggregation (Joglekar et al., 2016) and unions (Carmeli and Kröll, 2019; Christoph et al., 2018), have been extensively studied in the literature, the difference operator received much less attention. In modern database systems, there are several different equivalent expressions for computing the difference between two queries, such as NOT IN, NOT EXIST, EXCEPT, MINUS, DIFFERENCE, and LEFT-OUTER JOIN followed by a non-NULL filter. In contrast to its powerful expressibility, the execution plan of difference operator in existing database systems or data analytic engines (e.g., MySQL (mys, ySQL), Oracle (ora, acle), Postgre SQL (pos, eSQL), Spark SQL (spa, kSQL)) is quite brute-force. Given two (or multiple) conjunctive queries, their difference is simply done by materializing the answers for each participated conjunctive query separately, and then computing the difference of two (or multiple) sets. Hashing or other indexes may be built on top of the query answers to speed up the computation of the set difference at last. However, this approach is severely bottlenecked by evaluating every input query individually and materializing a large number of intermediate query results that do not contribute to the final results due to the difference operator.
Let’s consider an example of friend recommendation in social networks (such as Twitter, Facebook, Sina Weibo). A friend recommendation is represented as a triple extracted from the network semantics, such that user is recommended to user since user is a friend of user and user is a friend of user , together with other customized constraints. We also avoid the recommendation when and are already friends. The task of finding all valid recommendations can be captured by a SQL query in Example 1.1, as the difference of two sub-queries.
Example 1.1.
Let be a table storing all edges in the social network, and be a table storing all candidate recommendations. The following SQL query finds all triples from Triple that do not form a triangle in the graph:*
{\color[rgb]{0,0,0}\mathcal{Q}:} SELECT node1, node2, node3 FROM Triple t1
WHERE NOT EXISTS
(SELECT * FROM Graph g1, Graph g2, Graph g3
WHERE g1.dst = g2.src and g2.dst = g3.src and g3.dst = g1.src and g1.src = t1.node1 and g2.src = t1.node2 and g3.src = t1.node3);
such that is the difference of two sub-queries and , where returns all candidate recommendations from Triple and returns all triangle friendship in the social network.
{\color[rgb]{0,0,0}\mathcal{Q}_{1}:} SELECT node1, node2, node3 FROM Triple t1
{\color[rgb]{0,0,0}\mathcal{Q}_{2}:} SELECT * FROM Graph g1, Graph g2, Graph g3
WHERE g1.dst = g2.src and g2.dst = g3.src and g3.dst = g1.src and g1.src = t1.node1 and g2.src = t1.node2 and g3.src = t1.node3;
We note that can be rewritten as the following SQL query :
{\color[rgb]{0,0,0}\mathcal{Q}^{\prime}:} SELECT node1, node2, node3 FROM Triple t1
WHERE NOT EXISTS
(SELECT * FROM Triple t2
WHERE EXISTS (SELECT * FROM graph g1 WHERE t2.node1 = g1.src and t2.node2 = g1.dst)
AND EXISTS (SELECT * FROM graph g2 WHERE t2.node2 = g2.src and t2.node3 = g2.dst)
AND EXISTS (SELECT * FROM graph g3 WHERE t2.node3 = g3.src and t2.node1 = g3.dst)
AND t2.node1 = t1.node1 and t2.node2 = t1.node2 and t2.node3 = t1.node3)
such that is the difference of and another sub-query , where finds all candidate recommendations in Triple that also form a triangle in the social network:
{\color[rgb]{0,0,0}\mathcal{Q}_{3}:} SELECT * FROM Triple t2
WHERE EXISTS (SELECT * FROM graph g1 WHERE t2.node1 = g1.src and t2.node2 = g1.dst)
And EXISTS (SELECT * FROM graph g2 WHERE t2.node2 = g2.src and t2.node3 = g2.dst)
And EXISTS (SELECT * FROM graph g3 WHERE t2.node3 = g3.src and t2.node1 = g3.dst)
AND t2.node1 = t1.node1 and t2.node2 = t1.node2 and t2.node3 = t1.node3
Figure 1(a) illustrates the execution plan for generated by PostgreSQL optimizer. It first materializes all triangles in the graph as , and then computes the difference of and by anti-join. Moreover, hashing index is built on top of all triangles of so that the anti-join can be executed by checking whether every candidate recommendation in appears as a triangle in . At last, all “survived” recommendations are outputted as final answers. In plan (a), computing the set difference at last is the most time-consuming step, which is predicted to take 33.67 minutes by PostgreSQL optimizer. Although computing the subquery is not that expensive, which only takes about 132 seconds, the number of intermediate results materialized for is quite large as expected, which finally leads to the inefficiency of the subsequent computation on . In Section 6, plan (a) actually runs in 308.175 seconds in practice.
To tackle the challenges brought by the difference operator, we take two input sub-queries as a whole into account for algorithm design. We are interested in efficient algorithms with running times linear in the final result size. This requirement rules out the standard approach of materializing the results for each input sub-query separately and then computing their set difference. Indeed, the final output size can be many magnitudes smaller than the number of intermediate results that materialized. To overcome the curse of large intermediate results, we introduce a rewriting-based approach by exploiting the joint structural properties of two input sub-queries and pushing the difference operator down as far as possible.
In Example 1.1, we can rewrite the original SQL query into a new one . Instead of computing , it finds all candidate recommendations that also form a triangle in the social network as , which is exactly the intersection of and , and then computes the difference of and . Figure 1(b) illustrates the execution plan of this new query. We observe that computing the set difference at last is predicated to only take 21.67 seconds, which is much faster than (a). This is as expected, since the number of intermediate results generated by is much smaller than , after taking into consideration, which is the key to the overall improvement. As the price, computing is predicated to take a few more minutes than , but this is totally tolerable. In Section 6, plan (b) actually runs in 78.918 seconds, which already achieves 4x speedup over (a). This significant improvement from (a) to (b) motivates us to further investigate this interesting problem for general queries.
**Our contributions. ** In this paper, we formulate the difference of conjunctive queries (DCQ) problem and study the data complexity of this problem. Our contributions can be summarized as:
- •
Complexity Dichotomy: We give a dichotomy for computing DCQs in linear time in terms of input and output size. We characterize a class of “easy” DCQs exploiting the joint properties of two input CQs, and present a linear-time algorithm. On the other hand, we prove the hardness of obtaining a linear-time algorithm for the remaining “hard” DCQs via several well-known conjectures. ** (Section 3 and 4.1)**
- •
Efficient Heuristic: We propose an efficient heuristic for computing “hard” DCQs, which does not lead to a linear-time algorithm but still improves the baseline approach greatly. The heuristic investigates the intersection of two input CQs and incorporates the state-of-the-art algorithms for CQ evaluation. (Section 4.2)
- •
**Extension: ** We explore several interesting extensions. First, we design a recursive algorithm for computing the difference of multiple conjunctive queries. We also extend our algorithm to support other relational operators, such as selection, projection, join, and aggregation. At last, we investigate the DCQ problem under the bag semantics. (Section 5)
- •
Experimental Evaluation: We provide an experimental evaluation of our approach and standard approach on real-world datasets in both centralized and parallel database systems. The experimental results show that our approach out-performs the baseline on different classes of queries and datasets. (Section 6)
Roadmap. In Section 2, we formally define the DCQ problem and review the literature on evaluating a single CQ. In Section 3, we provide a linear-time algorithm for “easy” DCQs. In Section 4, we prove the hardness for the remaining DCQs and show efficient heuristics. In Section 5, we study several extensions of DCQs with other relational operators and bag semantics. In Section 6, we present the experimental evaluation. At last, we review related work in Section 7 and Section 8.
2. Preliminaries
2.1. Problem Definition
Conjunctive Query (CQ). We consider the standard setting of multi-relational databases. Let be a database schema that contains relations . Let be the set of attributes in the database . Each relation is defined on a subset of attributes . Let be the set of the attributes for all relations. Let be the domain of attribute , and let be the domain of attributes .
Given the database schema , let an input instance be , and the corresponding instances of be , . Where is clear from the context, we will drop the superscript and use for both the schema and instances. Any tuple is defined on . For any attribute , denotes the value of attribute in tuple . Similarly, for a set of attributes , denotes the values of attributes in for with an implicit ordering on the attributes.
We consider the class of conjunctive queries without self-joins formally defined as
[TABLE]
where is the selection operator, is the predicate defined over relation , selects out tuples from passing the predicate and denotes the output attributes. If , such a CQ query is known as full join, which represents the natural join of the underlying relations. We usually use a triple to represent a CQ , and simply use a pair to represent a full join. Each relation in is distinct, i.e., the CQ does not have a self-join. As a simplification, we ignore all the selection operators since it just takes time to decide if a tuple passes the predicate . Moreover, we assume every is defined on different subset of attributes; otherwise, we can simply keep the intersection of all relations defining on the same subset of attributes. Hence, we also use to denote the relation defined on .
The result of over instance noted as , is defined as:
[TABLE]
i.e., the projection of all combinations of tuples from every relation onto , such that tuples in each combination have the same value(s) on the common attribute(s). Let be the input size, i.e., the total number of tuples in the input instance. Let be the output size, i.e., the number of query results of over .
Difference of Conjunctive Queries (DCQ). A DCQ consists of two CQs without self-joins with the same output attributes. We also assume that the DCQ does not have a self-join, i.e., there exists no pair of relations from and from such that are the same. Note that our algorithms presented in this work also applied to the case when self-join exists in DCQ, but our lower bound assumes that no self-join exists. The input to a DCQ is a pair of database instances defined for respectively111 We distinguish the input instances of for simplifying algorithmic description later, which is different from conventional definition. . The result of over is , i.e., tuples that appear in the result of over instance , but not in the result of over instance . Let be the input size, i.e., the total number of tuples in both input instances. Let be the output size.
In this paper, we adopt standard data complexity (Vardi, 1982); that is, we measure the complexity of algorithms with input size and output size , and assume the query size as a constant.
2.2. Literature Review of CQ Evaluation
Before diving into the massive literature, we mention two classes of CQs that play an important role in query evaluation.
- •
(-acyclic). A CQ is -acyclic (Beeri et al., 1983; Fagin, 1983) if there exists a tree (called a join tree) such that each node in corresponds to a relation in , and for each attribute , the set of nodes containing form a connected subtree of . Moreover, we define as the highest node of that attribute appears.
- •
(free-connex). A CQ is free-connex (Bagan et al., 2007) if there exists a tree (called a free-connex join tree) such that is a join tree for , and for any pair of attributes , is not an ancestor of . It has been proved equivalently that a CQ is free-connex if is -acyclic and is also -acyclic.
Their relationships are illustrated in Figure 2. A free-connex CQ must be -acyclic. An -acyclic full join must be free-connex. Below, when the context is clear, we always refer “acyclic” to “-acyclic”.
There has been a long line of research on CQ evaluation (Beeri et al., 1983; Kolaitis and Vardi, 1998; Papadimitriou and Yannakakis, 1997; Vardi, 1982; Chekuri and Rajaraman, 2000). Yannakakis’s seminal algorithm (Beeri et al., 1983) was proposed for acyclic CQs, whose running time differs over different sub-classes of acyclic CQs. A free-connex CQ can be evaluated in time, which is already optimal since any algorithm needs to read input data and output all query results. On the other hand, an acyclic but non-free-connex CQ can be evaluated in time. Subsequent works have progressively defined different notions of “width” (Gottlob et al., 2002, 2009), measuring how far a query is from being acyclic and tackle cyclic queries with decomposition. This line of algorithms run in time, where can be the fractional hypertree width (Gottlob et al., 2002; Ngo et al., 2018), submodular width (Abo Khamis et al., 2016), or FAQ-width (Abo Khamis et al., 2016). In addition, some specific classes of CQs can be speedup by fast matrix multiplication techniques (Amossen and Pagh, 2009; Björklund et al., 2014; Deep et al., 2020), but we won’t go into that direction further. CQ evaluation is still an actively researched problem; any improvement here will also improve DCQ evaluation when plugged into the baseline as well as our approach.
In the remaining, we often use to denote the time complexity of evaluating a CQ .
Implications to the Baseline Approach of DCQ Evaluation. Given a DCQ , the baseline approach of computing separately and then set difference incurs the following cost:
Corollary 2.1.
Given two CQs and , the DCQ can be computed in time.
For example, when both and are free-connex, the baseline approach runs in time, where are the output sizes of respectively.
2.3. New Results of DCQ Evaluation
Our new complexity results for DCQ evaluation are summarized in Table 1. To help understand these results, we first introduce the class of linear-reducible CQs, and the reduce procedure.
Definition 2.2 (Linear-reducible).
A CQ is linear-reducible if is free-connex.
The relationship between linear-reducible CQs and existing classifications of CQs is illustrated in Figure 2. Any full or free-connex CQ must be linear-reducible. In addition, some cyclic but non-full CQs are also linear-reducible, for example, , since adding will result in a free-connex CQ (see Figure 2). It is also noted that any non-free-connex but acyclic CQ is non-linear-reducible.
Moreover, we introduce a reduce procedure in Algorithm 1, that can transform any linear-reducible CQ into a full join query in time, while preserving the query results. This algorithm is similar to the semi-join phase of Yannakakis algorithm (Yannakakis, 1981). Intuitively, we remove attributes or relations in a bottom-up ordering of nodes in a free-connex join tree. Recall that each node in the join tree corresponds to a relation in . When a node is visited, we distinguish two more cases: (line 4-5) if its output attributes are fully contained in its parent, we remove and update its parent relation via semi-joins; (line 6-7) and otherwise, we remove all non-output attributes (if there exists any) in via projections. The output of Algorithm 1 is a full join query (called the reduced query) and an instance (called the reduced instance) such that . An example of reduced query is illustrated in Figure 2.
We are now ready to present the new results for DCQ evaluation.
Dichotomy for Linear-time Algorithm. Our main complexity result is a complete characterization of for which a linear algorithm can be achieved for computing :
Definition 2.3 (Difference-Linear).
Given two CQs and , the DCQ is difference-linear if is free-connex, is linear-reducible, and is -acyclic for every , where and are the reduced queries of respectively.
Theorem 2.4 (Dichotomy).
Given two CQs and , the DCQ can be computed in time if and only if it is difference-linear.
Our proof of Theorem 2.4 consists of two steps. In Section 3, we prove the “if”-direction by designing a linear algorithm for the class of “easy” queries as characterized. In Section 4.1, we prove the “only-if” direction by showing the lower bound for the remaining class of “hard” queries, based on some well-established conjectures.
Improvement Achieved by Heuristics.
For the class of “hard” DCQs on which obtaining a linear-time algorithm is hopeless, we further show some efficient heuristics. The complete results are presented in Section 4.2 and here we mention an interesting case that our heuristics have strictly improved the baseline:
Corollary 2.5.
Given two CQs and , if is linear-reducible, then DCQ can be computed in time.
We summarize all these results above in Table 1: (1) our approach strictly improves the baseline as long as is linear-reducible; (2) furthermore, our approach leads to a linear-time algorithm if also satisfies some specific conditions; (3) in the remaining case when is non-linear-reducible, the comparison of our approach and baseline depends on specific queries or even input instances.
3. Easy DCQs
In this section, we show a linear-time algorithm for computing the class of “easy” DCQ characterized in Theorem 2.4. The main technique we used is simply query rewriting, but by exploiting the structures of two input queries in a non-trivial way.
Theorem 3.1.
Given two CQs and , if is difference-linear, then DCQ can be computed in time.
We start with a special class of DCQs that two input CQs share the same schema. In Section 3.1, we introduce an algorithm based on query rewriting, which always pushes the difference operator down to the input relations and avoids materializing a large number of intermediate results that do not participate in the final query result. In Section 3.2, we move to general case that and can have different schemas.
3.1. and share the same schema
We first note that if share the same schema, i.e., there is a one-to-one correspondence between the relations/attributes in and , Theorem 3.1 degenerates to the following lemma:
Lemma 3.2.
Given two CQs , if is free-connex, then the DCQ can be computed in time.
Let’s start with an example falling into this special case.
Example 3.3.
Consider a DCQ with and . We can rewrite it as the union of two join queries: , where the difference operator is only applied for computing and . Intuitively, for every join result , it must be or ; otherwise, , coming to a contradiction. The correctness of this rewriting will be formally presented in the proof of Lemma 3.4. In addition, the difference operators can be evaluated in time, and the join operators can be evaluated in time. *
Rewrite Rule. We now generalize the rewriting rule in Example 3.9 to general DCQ for , where and is -acyclic. In other words, both correspond to the same acyclic full join query. For any , let be the corresponding relations in respectively. Our rule is built on the observation that for any query result , must hold for every , but happens for some . Applying this observation, we can rewrite such a DCQ as the (disjoint) union of a constant number of join queries as follows:
Lemma 3.4.
Given two CQs , if is -acyclic and , .
Algorithm and Complexity. An algorithm directly follows the rewriting rule above. It first computes the difference of every pair of input relations, i.e., for each , and then computes a full join query derived for each . Actually, we can handle a slightly larger class of DCQ. For , if is free-connex, we simply remove all non-output attributes for separately in the preprocessing step, and then tackle two acyclic full joins, that share the same structure.
As the pre-processing step and difference operators can be evaluated in time, this algorithm is bottlenecked by evaluating the join query , which takes time. Putting everything together, we come to Lemma 3.2.
3.2. and have different schemas
We next move to the general case when these two input CQs have different schemas. We focus on the case when both are full and then extend to non-full case.
DCQ** with full CQs.** Now, we assume that . Theorem 3.1 simply degenerates to the Lemma 3.5.
Lemma 3.5.
Given two full joins and , if are -acyclic, and is -acyclic for every , then can be computed in time.
A straightforward solution is to transform both and into one auxiliary query , and then invoke the algorithm in Section 3.1 to handle the degenerated case. However, this solution does not necessarily lead to a linear-time algorithm. Let’s gain some intuition from the example below.
Example 3.6.
Consider a DCQ with and . For an auxiliary query, we introduce the following intermediate relations , , and . Then, we can rewrite as follows:
[TABLE]
*Then, we are left with two queries that share the same schema. However, this strategy does not necessarily lead to a linear-time algorithm, since materializing the intermediate relation requires super-linear time, which could be much larger than the final output size . *
Careful inspection reveals that a simpler rewriting rule can avoid materializing . More specifically, we keep unchanged and rewrite as above. Then, can be rewritten as . Intuitively, for every join result , it must be or ; otherwise, , coming to a contradiction. The correctness of this rewriting will be formally presented in the proof of Lemma 3.7. In this case, materializing only takes time, but materializing might take super-linear time. Fortunately, we can bound the size of by . The rationale is that every tuple in will participate in at least on one join result of , i.e., the final result of the difference query , thus . For the difference operator, takes time, and takes time. For the join operator, both simple join queries take linear time in terms of their input size and output size. Overall, this rewriting rule can compute the example query in time.
Rewrite Rule. Generalizing this observation, we develop the following rewriting rule for arbitrary full joins . The high-level idea is to introduce an intermediate relation for every , i.e., the projection of join results of onto attributes . Now we can rewrite using input relations in and intermediate relations corresponding to , as well as input relations in , which results in the disjoint union of multiple full joins.
Lemma 3.7.
Given two CQs and , if , , for .
Proof.
Direction . Consider an arbitrary result . By definition, for every , and for some . Moreover, . So, . Direction . Consider an arbitrary and a query result . By definition, and , which further implies . This way, . ∎
Algorithm and Complexity. An algorithm for computing the difference of two full join queries follows the rewriting rule above. For each , it first materializes the query results of , then computes the difference operator , and finally the full join by invoking the classical Yannakakis algorithm. We next analyze the complexity of the algorithm above. To establish the complexity, we first show an upper bound on the size of any intermediate relation constructed:
Lemma 3.8.
* for any , where .*
Proof.
Consider an arbitrary tuple . First, participates in at least one query result of . As , by definition. There must exist some tuple such that . Thus, participates in some query results of . Meanwhile, does not participate in any query result of , since . In this way, participates in at least one result in , thus . Moreover, . Together, we obtain . ∎
Let . If is -acyclic, and is also -acyclic for every , then the constructed CQ is free-connex. Implied by the existing result on CQ evaluation, can be computed in time by the classic Yannakakis algorithm, where is the output size of the difference query! The invocation of Yannakakis algorithm here is crucial for achieving linear complexity. For example, if is computed by first materializing the query results of and then computing their projection onto , the time complexity would be as large as , where is the output size of . Now, each full join is -acyclic with input size and output size , thus can be computed in time. Therefore, the total time complexity is bounded by , since there are sub-queries in . Putting everything together, we come to Lemma 3.5.
DCQ** with general CQs.** Now, we are ready to present an linear-time algorithm for computing , such that is free-connex, is linear-reducible, and is -acyclic for every , where and are the reduced queries of respectively. As described in Algorithm 2, we first apply a preprocessing step to and (line 1-4), which removes non-output attributes in and if they are non-full.
As shown in Algorithm 1, this reduce step is quite standard by first building a free-connex join tree for the derived query , and then traversing the tree in a bottom-up way. In the traversal, when a relation is visited and contains some non-output attributes, we just update its parent relation by applying a semi-join and removing it. Note that if a relation does not contain any non-output attribute, then its ancestor also does not contain any, implied by the property of the free-connex join tree. Thus, the residual tree is a connected subtree that contains the root. Note that no physical relation is defined to , but this is not an issue since when such a relation is visited, Algorithm 1 simply skips it (line 4) as well as its ancestors. This algorithm only takes time.
Then, we are left with two full joins, and invoke our rewriting rule proposed in Section 3.2 (line 6-8). As the reduce procedure takes time, and the join phase takes time implied by Lemma 3.5, we can obtain the complexity result in Theorem 3.1.
Improvement over Baseline.
When fall into the class of “easy” DCQs as characterized by Theorem 3.1, our algorithm only takes time for computing , while the baseline takes time, since for free-connex . We next use a few examples of “easy” DCQs to illustrate the improvement achieved by our approach.
Example 3.9.
Consider a DCQ with and . The baseline takes time to compute the triangle join in , where is the exponent of fast matrix multiplication. In contrast, our approach only takes time since , improving the baseline by a factor of .
Example 3.10.
*Consider a DCQ with and . The baseline takes time to materialize , which degenerates to the Cartesian product of and . In contrast, our approach only requires time, improving the baseline by a factor of , since can be much smaller than . *
Example 3.11.
*Consider a DCQ with and for . The baseline takes time to materialize , and time to materialize . In contrast, our approach can compute it in time, improving the baseline by a factor of , since can be much smaller than . *
4. Hard DCQs
In this section, we turn to the class of “hard” DCQs characterized by Theorem 2.4. We first prove the hardness of computing DCQs in linear time via some well-known conjectures, and then show an efficient heuristic for hard DCQs by further exploiting the query structures.
4.1. Hardness
We will prove the hardness of computing a hard DCQ , in particular: (1) is non-free-connex; or (2) is free-connex but is non-linear-reducible; or (3) is free-connex, is linear-reducible, but there exists some such that is cyclic, where and are the reduced queries of respectively. We will prove the hardness for each class of hard DCQs separately.
Hardness-(1). The hardness of computing DCQs in case (1) comes from computing a non-free-connex CQ (Bagan et al., 2007). By setting the result of as , simply degenerates to , hence we obtain:
Lemma 4.1.
For any DCQ , if is non-free-connex, any algorithm computing requires at least time.
The hardness of case (2) and (3) is built on the strong triangle conjecture in the literature:
Conjecture 4.2 (Strong Triangle conjecture (Abboud and Williams, 2014)).
Detecting whether an -node -edge graph contains a triangle requires time in expectation, where is assumed as the exponent of fast matrix multiplication.
Hardness-(2). We start with two hardcore DCQs in Lemma 4.3 and Lemma 4.4. The proof of Lemma 4.5 for general DCQs in case (2) is given in Appendix B.
Lemma 4.3.
Any algorithm for computing the following DCQ:
[TABLE]
requires time, assuming the strong triangle conjecture.
Proof.
For a graph , we denote and . Note that ; otherwise, we simply remove vertices that do not incident to any edges in . We then construct an instance for by setting . Hence, . Note that there exists some triangle in if and only if is non-empty. Together with , we output “a triangle is detected in ” if and only if . If can be computed in time, whether there exists a triangle in can be detected in time, coming to a contradiction of strong triangle conjecture. ∎
Lemma 4.4.
Any algorithm for computing the following DCQ:
[TABLE]
requires time, assuming the strong triangle conjecture.
Proof.
This is similar to the proof of Lemma 4.3. For a graph , we construct and , with and . Note that there exists some triangle in if and only if is non-empty. Together with , we output “a triangle is detected in ” if and only if . If can be computed in , whether there exists a triangle in can be detected in time, coming to a contradiction of strong triangle conjecture. ∎
Lemma 4.5.
Given two CQs , if is free-connex and is non-linear-reducible, any algorithm computing requires time, assuming the strong triangle conjecture.
Hardness-(3). The hardness of evaluating a DCQ in case (3) inherits the hardness of deciding a DCQ: given a DCQ and input databases , the decidability problem asks to decide whether there exists a query result in . We identify a few hardcore DCQs in Lemma B.16. The proof of Lemma 4.7 for general DCQs in case (3) is given in Appendix B.
Lemma 4.6.
Any algorithm for deciding the following DCQ
[TABLE]
requires time, assuming the strong triangle conjecture.
Proof.
We first focus on the first DCQ and the remaining ones can be proved similarly. Given an arbitrary graph with and , we develop an algorithm to detect whether there exists a triangle in . Note that ; otherwise, we simply remove vertices that do not incident to any edges in . The degree of a vertex is defined as the size of neighbors of , i.e., those incident to with an edge in . We partition vertices in into two subsets: and . From , we construct following relations: , , , and . Set . It can be easily checked that each relation contains at most tuples, hence . We further define a CQ as follows:
[TABLE]
For , we set and output “a triangle is detected” if and only if or is not empty. We first prove the correctness of this algorithm, i.e., a triangle exists in if and only if or is not empty. Direction Only-If. Consider an arbitrary triangle in . We distinguish two cases: (i) at least one of is light; (ii) both and are heavy. In (i), assume is light. Then, . We come to . In (ii), ,, , so we come to . Direction If. If , say , then and therefore is a triangle in . If , say , then , i.e., , and therefore is a triangle in .
We next turn to the time complexity. All statistics and relations can be computed in time. Moreover, relation can be constructed in time since . can be evaluated in time, since each of generates at most intermediate join results if , and each of generates at most intermediate join results if . If can be decided in time, whether there exists a triangle in can be decided in time, coming to a contradiction to strong triangle conjecture.
For , we set and output “a triangle is detected” if and only if or is not empty. For , we set and output “a triangle is detected” if and only if or is not empty. For , we set and output “a triangle is detected” if and only if or is not empty. Similarly, can be computed in time. This way, if can be decided in time, whether there exists a triangle in can be decided in time, coming to a contradiction to strong triangle conjecture. Together, we have completed the proof. ∎
Lemma 4.7.
Given two CQs , if is free-connex, is linear-reducible, and there exists some such that is cyclic where and are the reduced queries of respectively, any algorithm computing requires time, assuming the strong triangle conjecture.
4.2. Efficient Heuristics
Although the hardness results in Section 4 have ruled out a linear-time algorithm for the “hard” DCQs, we find that it is still possible to explore efficient heuristics that can outperform the baseline approach. Our heuristic is based on a simple fact that . After computing the query results for , a straightforward way of deciding is to decide for each result , whether nor not. This decidability query can be viewed as a special Boolean query by replacing every output attribute with a constant . More specifically, for , the derived a Boolean query can be represented as . Putting everything together, we come to Theorem 4.8.
Theorem 4.8.
Given two CQs and , can be computed in O(\texttt{cost}(\mathcal{Q}_{1})+\mathrm{OUT}_{1}\cdot\texttt{cost}({\color[rgb]{0,0,0}\mathcal{Q}^{\emptyset}_{2}})) time, where {\color[rgb]{0,0,0}\mathcal{Q}^{\emptyset}_{2}}=(\emptyset,\mathcal{V}_{2}-\bm{\mathsf{y}},\{e-\bm{\mathsf{y}}:e\in\mathcal{E}_{2}\}).
Remark. If is linear-reducible, can be reduced to a full join in time by Algorithm 1. Then, becomes empty. A faster solution is to build hashing indexes on every relation in the reduced . For each tuple , it suffices to check for every whether , which only takes time. We note that the rewriting rule in Lemma 3.7 can also apply to this case and lead to the same complexity. Suppose is reduced. Each induces a CQ . After materializing the results of , it suffices to check for each tuple , whether or not. This is exactly how our heuristic proceeds. Hence, Corollary 2.5 follows.
Example 4.9.
*Consider a DCQ with and . The baseline spends time computing and time computing the hidden triangle join in , where is the exponent of fast matrix multiplication. In contrast, our algorithm only spends time for computing , without computing the expensive , hence can improve the baseline by a factor of when . *
We can show some further improvement when is non-linear-reducible. Instead of issuing an individual Boolean query for every query result , we take all the Boolean queries into account as whole. To do so, we further explore the structural property of the intersection query {\color[rgb]{0,0,0}\mathcal{Q}^{\oplus}_{2}}=(\bm{\mathsf{y}},\mathcal{V}_{2},\{\bm{\mathsf{y}}\}\cup\mathcal{E}_{2}), by treating the query results of as a single relation over attributes . It is unclear how compares with , since involves an extra relation (over attributes ) of input size as large as the output size of .
Remark.
We note that if only produces query results, then it is always cheaper (or at least not more expensive) to compute than . The observation is that we can always materialize the query results of , and then check for every result whether it is in the extra relation of input size , which does not increase the complexity of computing asymptotically.
Theorem 4.10.
Given two CQs and , can be computed in O(\texttt{cost}(\mathcal{Q}_{1})+\texttt{cost}({\color[rgb]{0,0,0}\mathcal{Q}^{\oplus}_{2}})) time, where {\color[rgb]{0,0,0}\mathcal{Q}^{\oplus}_{2}}=(\bm{\mathsf{y}},\mathcal{V}_{2},\{\bm{\mathsf{y}}\}\cup\mathcal{E}_{2}).
Example 4.11.
Consider a DCQ with and . The baseline takes time to materialize . The first heuristics of issuing for each tuple takes time. We note that lists edges that participate in at least one triangle. The existing best algorithm takes time to compute , where is the exponent of fast matrix multiplication, dominating the overall complexity. Our approach will improve the baseline if , and strictly outperforms the naive heuristic.
Example 4.12.
Consider a DCQ with and . Let . Here, with . Similarly, the existing best algorithm takes time to compute . It is worth mentioning that . Suppose , is witnessed by and , but and . Then, , hence the result will be missed in this rewriting.
5. Extensions
Based on the basic DCQ over two CQs discussed so far, we next consider several interesting extensions of DCQ with rich interaction between difference and other relational algebra operators.
5.1. Difference of Multiple CQs
The first extension is adapting our result for computing DCQ involving two CQs to multiple CQs, say . Suppose for . We next introduce a recursive algorithm for tackling the general case with .
The base case with is tackled by our previous algorithm EasyDCQ in Section 3. We rewrite a general DCQ with CQs into a union of multiple DCQs, each consisting of CQs. We start from the first two CQs and apply a similar strategy in Section 3. Suppose and are full; otherwise, we just invoke the reduce procedure to remove all non-output attributes via semi-joins. More specifically, we define an auxiliary relation for each , and rewrite the input DCQ as . If unwrapping the recursions, we can give a complete form for :
[TABLE]
where for any , and
[TABLE]
for any . An algorithm follows this rewriting directly. Now, we come to the complexity of this algorithm. We can first bound for each , since every tuple from must participate in at least query result. Moreover, if is free-connex, is free-connex from (2). As the subquery corresponding to has input size and output size , it can be evaluated in time.
Theorem 5.1.
Given a DCQ, can be evaluated in time if is free-connex, for is linear-reducible, and for every , the subquery induced by is -acyclic for any , where is the reduced query of .
5.2. Select, Project and Join
- •
If there is a selection operator over , we can push it down such that . If is a predicate on a base relation of (resp. ), we can we simply check if is true for each tuple , and discard it if not. This only takes time. It is challenging that is a predicate not on any base relation, even for a single CQ evaluation.
- •
If there is a projection operator over , we can push it down such that , and handle a new DCQ with and .
- •
If there is a join operator over multiple DCQs, we first rewrite the join into a DCQover multiple CQs and invoke our previous algorithm in Section 5.1. More specifically, given DCQs with for any , we can rewrite as
[TABLE]
The characterization of input CQs for which a linear algorithm exists follows Theorem 5.1.
5.3. Aggregation
Our algorithm for DCQ can also be extended to support aggregations over annotated relations (Abo Khamis et al., 2016; Joglekar et al., 2016). Let be a commutative ring. For a CQ over an annotated instance , every tuple has an annotation . For a full query , the annotation for any join result is defined as . For a non-full query , the aggregation becomes GROUP BY , and the annotation for each result (i.e., the aggregate of each group) is . Below, we introduce two commonly-used formulations. Given , and instances , let be the annotations of tuples in respectively. For completeness, we set if and if .
Relational difference. For DCQs defined on relational difference, a tuple appears in the query results of if and only if and . For , the annotation of is defined as . The input size is defined as , and the output size is . Again, our target is to find a linear-time algorithm in terms of and . Our algorithms can be applied directly, followed by aggregation, and its complexity is bottlenecked by the output size of the difference query, i.e., , which could be much larger than .
Numerical difference. For DCQs defined on numerical difference, a tuple appears in the query results of if and only if or , with annotation . Then, the aggregation operator defined over attributes on top of can be rewritten as the numerical difference of two new annotated queries, i.e., . The input size is defined as , and the output size is . Again, our target is to find an linear-time algorithm in terms of and . Here, any algorithm with time complexity is already optimal, since . Hence, if and are free-connex, both our algorithm and baseline are optimal.
Theorem 5.2.
Given two CQs and , and a subset of aggregation attributes , if and are free-connex, with numerical difference can be computed in time.
Example 5.3.
Consider an example DCQ over an instance in Figure 3. This DCQ can capture Q16 in the TPC-H benchmark (tpc, PC H) as a special case. For relational difference, the query result of includes 2 tuples as . For numerical difference, the query result of includes 3 tuples as . *
5.4. Bag Semantics
We consider the bag semantics that the set of query result is a multi-set. For simplicity, each distinct tuple is annotated with a positive integer to indicate the number of copies. In a full CQ , the annotation of is defined as . For a projection of onto attributes , the annotation of is defined as . Given two CQs and two input instances , let be the annotations of tuples in respectively. For completeness, we set if and if . A tuple is a query result of if and only if and . An example is given in Figure 3.
The input size is , and the output size is . Again, our target is to find an linear-time algorithm in terms of and . Unfortunately, our rewriting rule in Section 3 cannot be adapted here. Figure 3 shows several incorrect behaviors: some tuple has a much higher annotation (e.g., ); and some tuple should not appear (e.g., ), which motivates us to explore new rules here.
Example 5.4.
*Consider a DCQ with and under the bag semantics. Any result falls into one of the three cases: (1) or ; (2) and ; (3) either or , but . We partition into three subsets, , and . Similarly, we partition into , , with respect to . Results falling into (1) can be found by . Results falling into (2) can be found by . Results falling into (3) can be found by two new -joins , where a pair of tuples can be -joined if and only if . *
All auxiliary relations as well as and can be computed efficiently. We consider ( is symmetric). The solution of checking -condition for all combinations of tuples in and incurs quadratic complexity. A smarter way is to sort and by first, and then by the ratio of decreasingly. Then, we start with with maximum ratio, and linearly scan tuples in with the join value until we meet some tuple such that . We then stop and proceed with the next tuple in . If no join result is produced by , we skip the subsequent tuples with the same join value and continue. Overall, this algorithm takes time.
Theorem 5.5.
Given two CQs , if is free-connex, then under the bag semantics can be computed in time.
Our observation above can be extended to the case when both correspond to the same free-connex query. The proof of Theorem 5.5 is given in Appendix C. The case when have different schema is left as future work.
6. Experiments
6.1. Experimental Setup
Prototype implementation. Our newly developed algorithms can be easily integrated into any SQL engine by rewriting the original SQL query. It can be further optimized if we directly integrate the rewrite procedure into the SQL parser and have customized index support. Our ultimate goal is to implement our algorithms into a system prototype with three components: a SQL parser, a query optimizer, and new indices. At the current stage, we choose to manually rewrite all SQL queries and demonstrate the power of our optimizations via the comparison with vanilla SQL queries .
Query processing engines compared. To compare the performance of all optimized techniques we proposed in the paper, we choose PostgreSQL (pos, eSQL), DuckDB(duc, ckDB), SQLite(sql, Lite), MySQL(mys, ySQL) running in centralized settings, and Spark SQL (spa, kSQL) running in parallel/distributed settings, as the query processing engines. All of them are widely used in academia and industry. In the experiments, we observed that SQLite and MySQL show very poor performance, with most of the test points being timed out. Hence, we built full indices on these systems to expedite the execution. Moreover, DuckDB is a columnar-vectorized query execution engine, and indices are built when importing input data. During the experiments, we test the single-thread performance of our new optimization techniques over PostgreSQL, DuckDB, SQLite and MySQL, and parallel performance over Spark SQL. In order to separate the I/O cost from the total execution time, we load all data into the memory in advance by using pg-prewarm in PostgreSQL and cache in Spark SQL. For DuckDB and SQLite, the data need to be loaded into memory before execution, so we only count the query execution time.
Experimental environment. We perform all experiments in two machines. For experiments conducted on PostgreSQL and MySQL, we use a machine equipped with two Intel Xeon 2.1GHz processors, each having 12 cores/24 threads and 416 GB memory. For all experiments on Spark SQL, DuckDB and SQLite, we use a machine equipped with two Xeon 2.0GHz processors, each having 28 cores / 56 threads and 1TB of memory. All machines run Linux, with Scala 2.13.9 and JVM 1.8.0. We use Spark 3.3.0 and PostgreSQL 16.0. We assign 8 cores for Spark and 1 core for the rest platforms during the experiments. Each query is evaluated 10 times with each engine, and we report the average running time. Each query runs at most 10 hours to obtain meaningful results.
6.2. Datasets and Queries
The experiments consist of graph queries and benchmark queries.
Benchmark queries. For relational queries, we adopt two standard benchmarks (TPC-DS (tpc, C DS) and TPC-H (tpc, PC H)) in industry and select 3 queries with difference operator (TPC-H Q16, TPC-DS Q35, and TPC-DS Q69). These three benchmark queries connect DCQ with other relational operators like selection, projection, join, and aggregation. All benchmark queries can be captured by a common schema and the joins are all primary-key foreign-key joins.
Graph queries. For graph pattern queries, we use real-world graphs (such as BitCoin, DBLP, Eponions, Google, and Wiki) from SNAP (Stanford Network Analysis Project) (SNA, SNAP), summarized in Table 2. We store edge information as a relation and manually create a triple relation from the graph. Tuples in Triple are generated by following rules: (rule 1) a random length-2 path in the graph as ; or (rule 2) a random edge in the graph as , together with a random vertex in the graph as node3; or (rule 3) a triple from a random length-4 path in the graph. Triple may involve different portions of tuples generated by three rules in different queries. For a graph with length-2 paths, we set the size of Triple to be for Wiki (since it is too large to process as shown in Table 2), and for the remaining graphs. We evaluate 6 graph queries as described in Figure 5, whose original SQL queries as well as optimized SQL queries after rewriting are given in the full version (ful, 3140).
More specifically, finds all edges in the graph that do not participate in any length-2 path. finds all length-3 paths that the third node () is not sampled together with the edge . finds length-2 paths that do not form a triangle. finds all generated triples that cannot extend to a length-4 path. finds all length-4 paths that do not form a length-4 cycle. finds all pairs of edges in the graph, which do not form a length-4 cycle.
6.3. Experiment Results
Running time. Figure 5 shows the running time of different engines on graph queries. The input and output size of all graphs queries are given in Table 2. All bars reaching the axis boundary indicate that the system did not finish within the 8-hour limit, or ran out of memory. As contains an expensive Cartesian product as sub-query, materializing its query result exceeds the memory capacity of our machines on most datasets. PostgreSQL can only evaluate the original SQL query of on Bitcoin dataset. By adding the parallelism from 8 to 80, our optimized Spark SQL can evaluate on Epinions dataset within the time limit, while the vanilla Spark SQL cannot complete the evaluation. For , all systems cannot finish the evaluation on Wiki dataset due to the large intermediate results created. We also observe that both SQLite and MySQL cannot finish all test points for and , and most test points over Wiki dataset. It could be the reason that both systems are not designed for analytical queries. Our optimization techniques already achieve a speedup ranging from 2x to 1760x on PostgreSQL, from 1.2x to 270x on Spark SQL, from 2x to 1848x on DuckDB, from 1.25x to 1095x on SQLite, and from 1.8x to 5.1x on MySQL for graph queries, even without considering the queries that could not finish within the time limit. We also observe an unusual test point for in MySQL, that our optimized SQL query takes more time than the vanilla SQL query, which may be due to some unknown deficiencies in MySQL internals.222We review the execution plan in MySQL and find that the predicated run-time of our optimized SQL query is much smaller than the vanilla SQL query, which is also consistent with our observations in other platforms. The actual running time does not match the expected cost because of some unknown deficiencies in MySQL.
Figure 5 also shows the running time of all query engines on benchmark queries under different scale factors (i.e., parameters used to generate benchmark dataset, which is roughly proportional to the input data size). DuckDB and MySQL fail to finish some test points with scale factor 100. However, the improvement in benchmark queries achieved by our optimized techniques is minor, as expected. More specifically, the vanilla benchmark query consists of two free-connex sub-queries, hence can be evaluated in time, and its optimized query can be evaluated in time. Due to the special primary-key foreign-key joins and group-by aggregations, , such that the input contains a few hundred of million records while the query result only involves thousands of records. The improvement of our optimized techniques in SQLite, DuckDB, and MySQL is also limited. On some test points, our optimized SQL queries are even more time-consuming than vanilla SQL queries. We find that the vanilla SQL queries can greatly benefit from the indices built for primary-key foreign-key join and outperform our optimized SQL queries, which do not enjoy efficient indices for set difference or anti-join operators. How to build indices to accelerate relational operators in these systems could be interesting future work. Meanwhile, we notice that loading input data and building indices are much more time-consuming than evaluating the query; for example, it takes DuckDB 16 minutes to load a 50G-sized TPC-DS dataset, while only 8 seconds to execute the whole query.
Impact of , and . Implied by the theoretical results, the sizes of sub-queries impact the performance of vanilla SQL queries, while only the actual output size affect the performance of our optimized SQL queries. Below, we study the impact of , and on the performance of both approaches over .
In Figure 6, we investigate the impact of for computing DCQ. We fix (as well as ) and only vary the size of Triple (as well as and ). Note that also increases as decreases. The running time of our optimized SQL query grows slowly with , while the vanilla SQL query incurs a fixed overhead for evaluating , even when (as well as ) decreases to as small as .
In Figure 7, we investigate the impact of for computing DCQ. We fix (as well as and ) and vary a filter predicate applied to relation Graph in . When the predicate is more selective, becomes smaller, and becomes larger. The running time of vanilla SQL query decreases as decreases, and the running time of our optimized SQL query does not change, which is only affected by and .
In Figure 8, we investigate the impact of for computing DCQ. We adjust Triple by changing the proportion of tuples generated by different rules, which will only change , while , , and stay the same. The running time of our optimized SQL query increases slowly as increases. In contrast, the running time of vanilla SQL query remains stably high even when decreases to , since its running time is only impacted by and , both of which stay unchanged.
Memory Consumption. We also test the memory consumption on both graph and benchmark queries by different engines. Due to the simplicity of memory consumption measurement, we report the results for PostgreSQL and DuckDB here. For benchmark queries, the optimized and vanilla SQL queries have similar behaviors on memory consumption, since the input size dominates the overall consumption. Below, we focus on the memory consumption of graph queries. In Figure 9, our optimized SQL queries achieve overall improvements for all graph queries on Epinions dataset in terms of space consumption. For example, our optimized SQL query only requires 6.53GB on Spark SQL for evaluating , while the vanilla SQL query fails to finish evaluating even using 256G memory. The improvement of our optimized SQL query is more significant on DuckDB. For , our optimized SQL query consumes less memory than the vanilla SQL query. For and , our optimized SQL queries consume roughly 2G memory. In contrast, the vanilla SQL queries fail to execute due to out-of-memory errors even after using 738G memory.
7. Connection with Signed Conjunctive Query
The class of signed conjunctive queries (SCQ) (Brault-Baron, 2012), or noted as conjunctive queries with negation (Lanzinger, 2021) in the literature, is defined as
[TABLE]
where is either empty or a negation operator . If for all , such an SCQ is also known as a negative conjunctive queries (NCQ). If for all , such an SCQ is also known as a CQ. Recall that and . The query result of over an instance denoted as is defined as
[TABLE]
We establish the connection between SCQ and DCQ via Lemma 7.1 and Lemma 7.2.
From DCQ to SCQ. Intuitively, every DCQ can be rewritten as the union of a set of SCQs. Moreover, each resulted SCQ has exactly one negated relation, and each relation of participates in one distinct SCQ as the negated relation. For example, can be rewritten as .
Lemma 7.1.
For a DCQ , .
Proof.
Direction . For every join result , there must exist a relation such that ; otherwise, , coming to a contradiction. Wlog, let be such a relation for . Together with , there must be . Direction . Consider an arbitrary relation , and an arbitrary join result . Obviously, since . Together with , there must be . ∎
From SCQ to DCQ. On the other hand, SCQ can be rewritten as the intersection of a set of DCQs. For a SCQ , let denote the set of relations with positive, negative sign separately. Let be the set of attributes that appear in positive, negative relations separately. Let denote the positive subquery defined by positive relation as well as the whole domain of attributes which do not appear in any positive relation.
Lemma 7.2.
For a SCQ , .
Proof.
Direction . Consider an arbitrary query result . By definition, holds for every and holds for every . This way, for each , we have . Direction . Consider an arbitrary such that for every , . Then, for every but for every . Thus, . ∎
For example, a SCQ can be rewritten as: .
Decidability of SCQ.
Given a SCQ , the domain of attributes, and input database , the decidability problem asks to decide whether there exists a query result in . For example, a NCQ decides if there exists any tuple such that and , and a CQ decides if there exists any tuple such that and . The decidability problem for CQ , NCQ and SCQ has been well studied separately:
Theorem 7.3 ((Bagan et al., 2007)).
A CQ can be decided in linear time if and only if it is -acyclic.
Theorem 7.4 ((Brault-Baron, 2012)).
A NCQ can be decided in linear time if and only if it is -acyclic.
Theorem 7.5 ((Brault-Baron, 2013)).
A SCQ can be decided in linear time if and only if is -acyclic for every .
Note that -acyclicity is a more restricted notion than -acyclicity, such that is -acyclic if all sub-hypergraphs of are -acyclic. Obviously, -acyclicity strictly implies -acyclicity. In (Lanzinger, 2021), this notion of -acyclicity has been extended to nest-set width for capturing the tractability of SCQ in terms of both query and data complexity. We won’t pursue this direction further.
Decidability of DCQ.
Implied by Lemma 7.1 and Theorem 7.5, we come to the following lemma:
Lemma 7.6.
Given two full joins and , the DCQ can be decided in linear time, if is -acyclic, and is -acyclic for every .
Lemma 7.6 can be easily proved by a linear-time algorithm. We can enumerate every tuple in within delay, as is -acyclic. For each tuple enumerated, we check whether it belongs to . If , a query result of is found; otherwise, we skip it and continue to the next one. It is easy to see that at most tuples are checked, so this algorithm runs in time.
Theorem 7.7.
Given two full joins and , the DCQ can be decided in linear time, if and only if is -acyclic, as well as is -acyclic for every .
Proof.
The if direction follows Lemma 7.6. We next distinguish two more cases for the only-if direction. (1) if is cyclic; and (2) if is -acyclic, and there exists some such that is cyclic. (1) follows Theorem 7.3 by simply setting . (2) follows Lemma 4.7. ∎
8. Related Work
Union of CQs. (Carmeli and Kröll, 2019) studied the enumeration complexity of union of conjunctive queries (UCQs), i.e., the goal is to find a data structure that after linear preprocessing time, the query answers (without duplication) can be enumerated within a small delay. Their results implied a linear algorithm in terms of input and output size for the class of union-free-connex UCQs, but whether a linear algorithm can be achieved (and, if possible, how to achieve it) is unknown for the remaining class of UCQ. (Christoph et al., 2018) also investigated the enumeration complexity of UCQs but in the dynamic scenario.
Selection over CQs. Recently, multiple works have studied the complexity of selections over conjunctive queries. (Wang and Yi, 2022) investigated the selection in the form of comparisons between two attributes or values. The work identifies an acyclic condition under which a near-linear-time algorithm can be achieved for conjunctive queries with comparisons. (Abo Khamis et al., 2022) worked on the selections over intervals, also known as intersection queries, which are special cases for comparison queries since each intersection query can be decomposed into a union of multiple comparison queries. They show a dichotomy result that an intersection join can be computed in linear time if and only if it is -acyclic. (Hu et al., 2022) studied the complexity of temporal queries, where the intersection condition only exists for one global attribute. Their result suggested that a temporal query can be solved in linear time if and only if it is r-hierarchical. (Tao and Yi, 2022) also investigated the complexity of intersection queries in dynamic settings.
Appendix A SQL Queries
Graph Query
Original:
SELECT g1.src as src, g1.dst as dst
FROM graph g1
WHERE (g1.src, g1.dst) NOT IN (
SELECT DISTINCT g1.src, g1.dst
FROM graph g1, graph g2, graph g3
WHERE g1.dst = g2.src and g2.dst = g3.src);
Optimized:
SELECT g1.src as src, g1.dst as dst
FROM graph g1
WHERE NOT EXISTS (
SELECT * FROM graph g2
WHERE EXISTS (
SELECT * FROM graph g3
WHERE g3.src = g2.dst
) and g1.dst = g2.src );
Graph Query
Original:
SELECT src as A, node1 as B, node2 as C, node3 as D
FROM graph g1, Triple1 T1
WHERE g1.dst = T1.node1
and NOT EXISTS (
SELECT * FROM Triple2 T2, graph g2
WHERE T2.node3 = g2.src and T2.node1 = g1.src
and T2.node2 = T1.node1 and T2.node3 = T1.node2
and g2.dst = T1.node3);
Optimized:
SELECT src as A, node1 as B, node2 as C, node3 as D
FROM graph g1, Triple1 T1
WHERE g1.dst = T1.node1
and (NOT EXISTS ( SELECT * FROM Triple2 T2
WHERE T2.node1 = g1.src and T2.node2 = T1.node1
and T2.node3 = T1.node2)
or NOT EXISTS ( SELECT * FROM graph g2
WHERE g2.src = T1.node2 and g2.dst = T1.node3));
Graph Query
Original:
SELECT node1, node2, node3
FROM Triple T1
WHERE NOT EXISTS(
SELECT *
FROM graph g1, graph g2, graph g3
WHERE g1.dst = g2.src and g2.dst = g3.src and g3.dst = g1.src and g1.src = T1.node1 and g2.src = T2.node2 and g3.src = T3.node3);
Optimized:
SELECT node1, node2, node3
FROM Triple T1
WHERE NOT EXISTS (SELECT * FROM graph g1
WHERE T1.node1 = g1.src and T1.node2 = g1.dst)
or NOT EXISTS (SELECT * FROM graph g2
WHERE T1.node2 = g2.src and T1.node3 = g2.dst)
or NOT EXISTS (SELECT * FROM graph g3
WHERE T1.node3 = g3.src and T1.node1 = g3.dst);
Graph Query
Original:
SELECT node1, node2, node3
FROM Triple T1
WHERE NOT EXISTS (
SELECT *
FROM graph g1, graph g2, graph g3
WHERE g1.dst = g2.src and g2.dst = g3.src and g1.src = T1.node1 and g2.src = T1.node2 and g2.dst = T1.node3);
Optimized:
SELECT node1, node2, node3
FROM Triple
WHERE NOT EXISTS
(SELECT * FROM graph WHERE node1 = src and node2 = dst)
or NOT EXISTS
(SELECT * FROM graph WHERE node2 = src and node3 = dst)
or NOT EXISTS
(SELECT * FROM graph WHERE node3 = src);
Graph Query
Original:
SELECT g1.src as A, g2.src as B, g3.src as C, g3.dst as D
FROM graph g1, graph g2, graph g3
WHERE g1.dst = g2.src and g2.dst = g3.src and
NOT EXISTS (SELECT *
FROM graph g4, graph g5, graph g6
WHERE g4.dst = g5.src and g5.dst = g6.src and g2.src = g4.src and g5.src = g3.src and g6.src = g3.dst and g6.dst = g1.src);
Optimized:
SELECT g1.src as A, g2.src as B, g3.src as C, g3.dst as D
FROM graph g1, graph g2, graph g3
WHERE g1.dst = g2.src and g2.dst = g3.src and
NOT EXISTS (
SELECT * FROM graph g6 WHERE g6.dst = g1.src and g6.src = g3.dst);
Graph Query
Original:
SELECT g1.src as A, g1.dst as B, g2.src as C, g2.dst as D
FROM graph g1, graph g2
WHERE NOT EXISTS (SELECT *
FROM graph g3, graph g4, graph g5, graph g6
WHERE g3.dst = g4.src and g4.dst = g5.dst and g5.src = g3.src and g5.dst = g6.src and g3.src = g1.src and g3.dst = g1.dst and g6.src = g2.src and g6.dst = g2.dst);
Optimized:
SELECT g1.src as A, g1.dst as B, g2.src as C, g2.dst as D
FROM graph g1, graph g2
WHERE NOT EXISTS (SELECT * FROM graph g4
WHERE g4.src = g1.dst and g4.dst = g2.src)
or NOT EXISTS ( SELECT * FROM graph g5
WHERE g1.src = g5.src and g2.src = g5.dst);
TPC-H Query 16
Original:
SELECT p_brand, p_type, p_size,
count(distinct ps_suppkey) as supplier_cnt
FROM partsupp, part
WHERE p_partkey = ps_partkey
and p_brand <> 'Brand#45'
and p_type NOT LIKE 'MEDIUM␣POLISHED%'
and p_size IN (49, 14, 23, 45, 19, 3, 36, 9)
and ps_suppkey NOT IN (
SELECT s_suppkey
FROM supplier, nation
WHERE s_nationkey = n_nationkey and n_name = 'CHINA')
GROUP BY p_brand, p_type, p_size;
Optimized:
SELECT p_brand, p_type, p_size,
count(distinct ps_suppkey) as supplier_cnt
FROM partsupp, part
WHERE p_partkey = ps_partkey
and p_brand <> 'Brand#45'
and p_type NOT LIKE 'MEDIUM␣POLISHED%'
and p_size IN (49, 14, 23, 45, 19, 3, 36, 9)
and NOT EXISTS ( SELECT * FROM supplier
WHERE EXISTS (SELECT * FROM nation
WHERE s_nationkey = n_nationkey
and n_name = 'CHINA')
and s_suppkey = ps_suppkey)
GROUP BY p_brand, p_type, p_size;
TPC-DS Query 35
Original:
SELECT
ca_state, cd_gender, cd_marital_status, cd_dep_count,
count(*) cnt1, stddev_samp(cd_dep_count),
sum(cd_dep_count), min(cd_dep_count),
cd_dep_employed_count,count(*) cnt2,
stddev_samp(cd_dep_employed_count),
sum(cd_dep_employed_count), min(cd_dep_employed_count),
cd_dep_college_count, count(*) cnt3,
stddev_samp(cd_dep_college_count),
sum(cd_dep_college_count), min(cd_dep_college_count)
FROM
customer c,customer_address ca,customer_demographics
WHERE
c.c_current_addr_sk = ca.ca_address_sk and
cd_demo_sk = c.c_current_cdemo_sk and
not exists (select *
from store_sales,date_dim
where ss_sold_date_sk = d_date_sk and
c.c_customer_sk = ss_customer_sk and
d_year = 2001 and
d_qoy < 4) and
not exists (select *
from web_sales,date_dim
where ws_sold_date_sk = d_date_sk and
d_year = 2001 and
d_qoy < 4 and
ws_bill_customer_sk = c.c_customer_sk) and
not exists (select *
from catalog_sales,date_dim
where cs_sold_date_sk = d_date_sk and
d_year = 2001 and
d_qoy < 4 and
cs_ship_customer_sk = c.c_customer_sk)
group by ca_state,
cd_gender,
cd_marital_status,
cd_dep_count,
cd_dep_employed_count,
cd_dep_college_count;
Optimized:
SELECT
ca_state, cd_gender, cd_marital_status, cd_dep_count,
count(*) cnt1, stddev_samp(cd_dep_count),
sum(cd_dep_count), min(cd_dep_count),
cd_dep_employed_count,count(*) cnt2,
stddev_samp(cd_dep_employed_count),
sum(cd_dep_employed_count), min(cd_dep_employed_count),
cd_dep_college_count, count(*) cnt3,
stddev_samp(cd_dep_college_count),
sum(cd_dep_college_count), min(cd_dep_college_count)
FROM
customer_address ca,customer_demographics,
(select * from customer cu
where not exists (select * from store_sales
where exists (select * from date_dim
where d_year = 2001 and d_qoy < 4 and ss_sold_date_sk = d_date_sk)
and cu.c_customer_sk = ss_customer_sk)
and not exists (select * from web_sales
where exists (select * from date_dim
where d_year = 2001 and d_qoy < 4 and ws_sold_date_sk = d_date_sk)
and cu.c_customer_sk = ws_bill_customer_sk)
and not exists (select * from catalog_sales
where exists (select * from date_dim
where d_year = 2001 and d_qoy < 4 and cs_sold_date_sk = d_date_sk)
and cu.c_customer_sk = cs_ship_customer_sk)) as c
WHERE
c.c_current_addr_sk = ca.ca_address_sk and
cd_demo_sk = c.c_current_cdemo_sk
group by ca_state,
cd_gender,
cd_marital_status,
cd_dep_count,
cd_dep_employed_count,
cd_dep_college_count;
TPC-DS Query 69
Original:
SELECT
cd_gender, cd_marital_status, cd_education_status,
count(*) cnt1, cd_purchase_estimate,
count() cnt2, cd_credit_rating, count() cnt3
FROM
customer c,customer_address ca,customer_demographics
WHERE
c.c_current_addr_sk = ca.ca_address_sk and
ca_state in ('IN','ND','PA') and
cd_demo_sk = c.c_current_cdemo_sk and
exists (select *
from store_sales,date_dim
where c.c_customer_sk = ss_customer_sk and
ss_sold_date_sk = d_date_sk and
d_year = 1999 and
d_moy between 2 and 2+2) and
(not exists (select *
from web_sales,date_dim
where c.c_customer_sk = ws_bill_customer_sk and
ws_sold_date_sk = d_date_sk and
d_year = 1999 and
d_moy between 2 and 2+2) and
not exists (select *
from catalog_sales,date_dim
where c.c_customer_sk = cs_ship_customer_sk and
cs_sold_date_sk = d_date_sk and
d_year = 1999 and
d_moy between 2 and 2+2))
group by cd_gender,
cd_marital_status,
cd_education_status,
cd_purchase_estimate,
cd_credit_rating;
Optimized:
SELECT
cd_gender, cd_marital_status, cd_education_status,
count(*) cnt1, cd_purchase_estimate,
count() cnt2, cd_credit_rating, count() cnt3
FROM
customer_address ca,customer_demographics,
(select * from customer cu
where exists (select * from store_sales
where exists (select * from date_dim
where ss_sold_date_sk = d_date_sk and d_year = 1999 and d_moy between 2 and 2+2)
and not exists (select * from web_sales
where exists (select * from date_dim
where ws_sold_date_sk = d_date_sk and d_year = 1999 and d_moy between 2 and 2+2)
and ws_bill_customer_sk = cu.c_customer_sk)
and not exists (select * from catalog_sales
where exists (select * from date_dim
where cs_sold_date_sk = d_date_sk and d_year = 1999 and d_moy between 2 and 2+2)
and cs_ship_customer_sk = c_customer_sk)
and ss_customer_sk = cu.c_customer_sk)) as c
WHERE
c.c_current_addr_sk = ca.ca_address_sk and
ca_state in ('IN','ND','PA') and
cd_demo_sk = c.c_current_cdemo_sk
group by cd_gender,
cd_marital_status,
cd_education_status,
cd_purchase_estimate,
cd_credit_rating;
Appendix B Missing Proofs in Section 4.1
B.1. Preliminaries on CQs
Definition B.1 (GYO Reduction).
The GYO reduction for a CQ is an iterative procedure that (1) if an attribute only appears in one relation , then can be removed from ; (2) if there exists a pair of relations such that , then can be removed.
Lemma B.2 ((Yannakakis, 1981)).
A query is -acyclic if the GYO reduction results in an empty query.
Definition B.3 (Path).
In a CQ , a path between a pair of attributes is a sequence of attributes , such that
- •
there exists with for any ;
- •
for any , either , or for some .
Definition B.4 (Cycle).
In a CQ , a cycle is a sequence of attributes , such that
- •
there exists with for any , and ;
- •
for any , either , or for some , or .
Definition B.5 (Clique).
In a CQ , a clique is a subset of attributes , such that for any pair of attributes , there exists with .
Definition B.6 (Conformal).
A CQ is conformal, if every clique there exists with .
Definition B.7 (Non-conformal Clique).
Following the definition of conformal of CQ, we define a clique as non-conformal in a CQ , if there does not exist such that .
Lemma B.8 ((Brault-Baron, 2016)).
A CQ is -acyclic if and only if it is conformal and cycle-free.
Lemma B.9 ((Bagan et al., 2007)).
In an acyclic but non-free-connex CQ , there must exists a sequence of distinct attributes with , such that
- •
there exists a relation such that for every ;
- •
* but ;*
- •
for each , either or for some ;
Lemma B.10.
In a CQ , if there exists a clique , then for any , is also a clique.
Proof.
Since is a clique, there exists a relation that contains every pair of attributes. As is a subset of , then for any two attributes in there also exists a relation that contains both of these two attributes, hence is also a clique. ∎
Lemma B.11.
In a cycle-free CQ , if there exists a non-conformal clique , then .
Proof.
For the clique of size 1 or 2, it is clear that is conformal as there is a relation containing the entire clique by definition. Suppose there exists a non-conformal clique with , say . As the clique is non-conformal, there does not exist a relation that covers all three attributes, but any pair of attributes appears together in one relation. Then will form a triangle, contradicting the fact that is cycle-free. Hence, any non-conformal clique in a cycle-free CQ must have . ∎
We denote a non-conformal clique as a minimal if there exists no subset such that is a non-conformal clique.
Lemma B.12.
In a CQ with a minimal non-conformal clique , for every there exists some with .
Proof.
As is a clique, is also a clique for any . Meanwhile, as is the minimal non-conformal clique, is a conform clique, which implies a relation with . ∎
B.2. Helper Lemmas
Now, we are ready to show some helper lemmas, which will be used to prove Lemma 4.5 and Lemma 4.7.
Definition B.13 (Subquery).
For a CQ , a subquery of induced by a set of attributes is denoted as , where .
Lemma B.14.
Given two CQs and , for any , if requires time, then requires time.
Proof.
Given any database instance for , we can construct a database instance for as follows. For any attribute , we set its value to be . For any with , there exists a corresponding relation in the residual query. For each tuple in , we insert into with . It is easy to see that there is a one-to-one correspondence between and . Hence, if can be solved in linear time, then can be solved in linear time, coming to a contradiction. ∎
Lemma B.15.
Any algorithm for evaluating the following DCQ:
[TABLE]
requires time, assuming the strong triangle conjecture.
Proof.
Given a graph , we construct , and for some . Let . We note that if and only if there is a triangle in . Hence, if can be evaluated in time, whether there is a triangle in can be determined in time, contradicting the detecting triangle conjecture. ∎
Lemma B.16.
Any algorithm for deciding the following DCQ requires time, assuming the strong triangle conjecture, where and , or , or , or .
Proof.
In the proof of Lemma 4.4, we have shown the hardness for . The remaining three queries can be proved similarly. Given an arbitrary graph with as the set of vertices and as the set of edges, we perform an algorithm to detect whether there exists a triangle in . Let be the number of edges in . Let be the neighbor list of vertex . The degree of a vertex is defined as the size of the neighbor list of , i.e., . We partition vertices in into two subsets: and . From , we construct following relations: , , , and . It can be easily checked that each relation contains tuples.
For , we set and consider following queries:
[TABLE]
It can be easily proved that a triangle exists in if and only if or is not empty. We point out that can be computed in time, since and implied by the definition of . This way, if can be computed in time, then detecting whether there exists a triangle or not takes time, coming to a contradiction to the detecting triangle conjecture.
For , We set and consider following queries:
[TABLE]
For , we set and consider following queries:
[TABLE]
Both cases follow the similar argument as above. Together, we have completed the proof. ∎
B.3. Proof of Lemma 4.5
We assume is reduced. As is free-connex, then must be an acyclic full join. We consider repeatedly applying the following procedures to : (1) if there is an non-output attribute only appearing in one relation , remove from as well as ; (2) if there is a pair of relations such that , remove from . As is non-reducible, the residual query must be non-full; otherwise is free-connex, contradicting the fact that is non-linear-reducible. Hence, we can assume for that every non-output attribute must appear in at least two relations, and there exists no relation whose attributes are fully contained in another relation. We distinguish two cases:
(Case 1): is acyclic. As is non-linear-reducible, must be non-free-connex. Implied by Lemma B.9, there must exist such a path with desired properties. Moreover, for the acyclic full join , we initialize two sets and , and repeat the following procedure: if there exists some such that and , we just stop; otherwise, we find some such that , we just add all attributes in into , and remove . Then for each , we have either , or , or and .
Given an arbitrary instance of in lemma 4.3, we construct an input instance for separately as follows. For , we set for every , for every , and for every . Then, the result of degenerates to . For , we simply set , , and as the same as that in . Implied by the properties of the path found, every relation in must either contains a single attribute from , or degenerates to one edge of the path. Hence, the result of degenerates to , which is exactly captured by Lemma 4.5.
(Case 2): is cyclic. Then, there exists a cycle or a non-conformal clique in . We further distinguish the following cases.
(Case 2.1): there is a cycle such that . We can reduce to .
(Case 2.2): there is a cycle such that and . We can reduce to if , and otherwise.
(Case 2.3): there exists no cycle but a non-conformal clique such that . In this case, we will show the hardness of DCQ , based on the hardness of . Consider an arbitrary instance for . For simplicity, assume the domain of each attribute in is . We construct the following instance for . There is a one-to-one mapping between any pair of attributes . We also set for every . It can be easily checked that contains exactly results, and therefore contains at most results. Moreover, is empty if and only if . Suppose we have an algorithm that can compute in linear time, then we can determine whether is empty or not in linear time, contradicting the fact that cannot be determined in time. As cannot be computed in linear time, combining with Lemma B.14, cannot be computed in linear time.
(Case 2.4): holds for every cycle , as well as every non-conformal clique . Recall that there exists no non-output attribute only appearing in one relation, and there exists no relation whose attributes are contained by another relation. Let . Every non-conformal clique in becomes conformal in due to the existence of . Similarly, every cycle will disappear in due to the existence of . In this case, must be acyclic, implied by Lemma B.8. Meanwhile, as is non-linear-reducible, must be non-free-connex. As is acyclic and non-free-connex, there must exist a path in as characterized by Lemma B.9. Following the similar argument as (Case 1), we can reduce to .
B.4. Proof of Lemma 4.7
Given a free-connex CQ and a linear-reducible CQ , we denote and as the reduced queries of respectively. Let be the relation such that is cyclic. As is free-connex, is acyclic. Our proof proceeds with the following steps:
- •
Step 1: In , there exists no such that ;
- •
Step 2: There exists a pair of attributes , such that there exists no with ;
- •
Step 3: There is a cycle in with , and is a path in ;
- •
Step 4: There is a reduction from Lemma B.16 to ;
For Step 1, if there exists some such that , then is acyclic if and only if , contradicting the fact that is cyclic but is acyclic.
For Step 2, we first show that . Suppose , say . There must exist such that , hence , coming to a contradiction of Step 1. Hence, . Moreover, if for every pair of attributes , there exists some such that , then we find a clique of attributes in in . As is acyclic, there must exist some such that , coming to a contradiction of Step 1. Hence, we can always find a pair of attributes as desired.
For Step 3, since is acyclic but is cyclic, either a cycle or a non-conformal clique is formed by the addition of . Let’s consider the case where a new non-conformal clique is formed. By definition, there exists no relation such that . We partition into two subsets such that and . It is clear that ; otherwise , contradicting the fact that is non-conformal. Moreover, for any , is also a clique. As is acyclic, must be a conformal clique, i.e., there exists some such that . Meanwhile, ; otherwise is a non-conformal clique in , contradicting the fact that is acyclic. We can also identify two different attributes such that there exists no relation with ; otherwise, is a non-conformal clique in , contradicting the fact that is acyclic. Let be the relation that , and be the relation that . From above, we note that and . Let be an attribute such that there exists no relation , such that . It is always feasible to find such an attribute , since there exist no relation such that .
In either way, a cycle of forms after the addition of . Hence, a new cycle must be formed by the addition of , say . Let . The existence of also implies a path of in .
For Step 4, we show the following reduction. For simplicity, we set . For any attribute , we set . If ignoring all attributes with domain as , each relation in falls into or . As , there exists at least one relation in such that . We distinguish four more cases on such :
- •
;
- •
;
- •
;
- •
;
which follows the proof of Lemma B.16.
So far, we have shown the hardness of computing , and the hardness of computing follows by Lemma B.14.
Appendix C Missing Materials in Section 5
C.1. Difference of Multiple CQs
C.2. Proof of Theorem 5.5
In the bag semantics, for any free-connex CQ and an instance , it is still possible to reduce the query and instance in linear time, while preserving the correctness of the query results. We invoke Algorithm 1, but incorporate the semi-join and projection operators in the bag semantics. For a semi-join result of , we define:
[TABLE]
Then, we are left with two full joins that share the same query structure after applying the reduce procedure to both and .
Suppose we are given two instances for the full join . Let be the corresponding relations to in . Again, assume that each tuple is associated with a positive count . Let be the count functions of respectively. Generalizing the algorithm in Example 5.4, the high-level idea is to find all join results such that . For each , we distinguish tuples in into three case: , and . We can rewrite it as:
Lemma C.1.
Given two full CQs ,
[TABLE]
where a pair of tuples can be -joined if and only if .
The first part of can be computed similarly as we have done in the set semantics. We next focus on the second part. Each derives a -joins, which will be computed by the following procedure BagDCQ. For simplicity, let if and otherwise. We maintain additional variable for every tuple for every . Initially, if and , if and , and if . Algorithm 5 consists of two phases. In the first phase, it updates the value of for every tuple over a join tree . More specifically, suppose for some . Let be the subtree of rooted at . Then,
[TABLE]
i.e., the maximum product of over all join results in the subtree rooted at , participated by . As a result, a tuple in the root node participates in any query result if and only if . In the second phase, we invoke Enumerate procedure for every with , and enumerate all the query results participated by .
The procedure takes three parameters, which returns all join results over the join tree participated by (from the root relation of ), whose product of ratios over participated tuples is at least . Let be the root node of . In the base case when is a single node, we just return . As we prove later, there must be in this case. In general, we distinguish two more cases. If contains a single child, say , it suffices to find all tuples such that , i.e., participate in at least one join result. For each such a tuple , we recursively enumerate the query results in participated by , whose product of is at least , which can be done by (line 6). Otherwise, contains at least two child nodes. We also play with recursion and shrink the join tree by removing a subtree rooted at one child node. W.l.o.g., assume is the set of child nodes of . We first find out tuples in that will participate in any query result with . This can be done by first finding the maximum -value of tuples in another child node that can be joined with , and then finding the minimum -value that tuples in should satisfy (line 10). Then, we enumerate all query results in the subtree whose product of is at least , which are exactly those will participate in the final query results (line 11). For each such a tuple enumerated (at line 12-14), we in turn find out the query results in the remaining subtree of whose product of is at least (updated) , where is the product of for . At last, we just output their combination as a Cartesian product.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2duc (ck DB) Duck DB. https://duckdb.org/ .
- 3mys (y SQL) My SQL. https://www.mysql.com/ .
- 4ora (acle) Oracle. https://www.oracle.com/ .
- 5pos (e SQL) Postgre SQL. https://www.postgre.org/ .
- 6SNA (SNAP) SNAP. https://snap.stanford.edu/snap/ .
- 7spa (k SQL) Spark SQL. https://spark.apache.org/sql/ .
- 8sql (Lite) SQ Lite. https://www.sqlite.org/ .
