Computing the Difference of Conjunctive Queries Efficiently

Xiao Hu; Qichen Wang

arXiv:2302.13140·cs.DB·April 21, 2023

Computing the Difference of Conjunctive Queries Efficiently

Xiao Hu, Qichen Wang

PDF

Open Access

TL;DR

This paper presents a novel, structurally-aware approach to efficiently compute the difference of conjunctive queries, achieving linear-time algorithms for many cases and significant speedups over standard SQL methods.

Contribution

It introduces a query rewriting technique that exploits structural properties to push down difference operators, enabling faster computation of query differences.

Findings

01

Linear-time algorithms for a large class of difference queries

02

Order-of-magnitude speedups over standard SQL implementations

03

Heuristics that improve traditional difference query evaluation

Abstract

We investigate how to efficiently compute the difference result of two (or multiple) conjunctive queries, which is the last operator in relational algebra to be unraveled. The standard approach in practical database systems is to materialize the results for every input query as a separate set, and then compute the difference of two (or multiple) sets. This approach is bottlenecked by the complexity of evaluating every input query individually, which could be very expensive, particularly when there are only a few results in the difference. In this paper, we introduce a new approach by exploiting the structural property of input queries and rewriting the original query by pushing the difference operator down as much as possible. We show that for a large class of difference queries, this approach can lead to a linear-time algorithm, in terms of the input size and (final) output size, i.e.,…

Tables2

Table 1. Table 1. Summary of complexity results by baseline and our approach. 𝒬 1 = ( 𝘆 , 𝒱 1 , ℰ 1 ) subscript 𝒬 1 𝘆 subscript 𝒱 1 subscript ℰ 1 \mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1}) and 𝒬 2 = ( 𝘆 , 𝒱 2 , ℰ 2 ) subscript 𝒬 2 𝘆 subscript 𝒱 2 subscript ℰ 2 \mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2}) are two input CQs. ( 𝘆 , ℰ 1 ′ ) 𝘆 subscript superscript ℰ ′ 1 (\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}) and ( 𝘆 , ℰ 2 ′ ) 𝘆 subscript superscript ℰ ′ 2 (\bm{\mathsf{y}},\mathcal{E}^{\prime}_{2}) are reduced queries of 𝒬 1 , 𝒬 2 subscript 𝒬 1 subscript 𝒬 2 \mathcal{Q}_{1},\mathcal{Q}_{2} respectively. 𝒬 2 ∅ = ( ∅ , 𝒱 − 𝘆 , { e − 𝘆 : e ∈ ℰ 2 } ) subscript superscript 𝒬 2 𝒱 𝘆 conditional-set 𝑒 𝘆 𝑒 subscript ℰ 2 \mathcal{Q}^{\emptyset}_{2}=(\emptyset,\mathcal{V}-\bm{\mathsf{y}},\{e-\bm{\mathsf{y}}:e\in\mathcal{E}_{2}\}) and 𝒬 2 ⊕ = ( 𝘆 , 𝒱 2 , { 𝘆 } ∪ ℰ 2 ) subscript superscript 𝒬 direct-sum 2 𝘆 subscript 𝒱 2 𝘆 subscript ℰ 2 \mathcal{Q}^{\oplus}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\{\bm{\mathsf{y}}\}\cup\mathcal{E}_{2}) are formally defined in Section 4.2 . N 𝑁 N is the input size. OUT 1 , OUT subscript OUT 1 OUT \mathrm{OUT}_{1},\mathrm{OUT} are the output sizes of 𝒬 1 , 𝒬 1 − 𝒬 2 subscript 𝒬 1 subscript 𝒬 1 subscript 𝒬 2 \mathcal{Q}_{1},\mathcal{Q}_{1}-\mathcal{Q}_{2} respectively. cost ( ⋅ ) cost ⋅ \texttt{cost}(\cdot) is the time complexity of evaluating a single CQ.

$𝒬_{1} - 𝒬_{2}$	Baseline	Our Approach
$𝒬_{1} - 𝒬_{2}$ is difference-linear	$cost (𝒬_{1})$	$N + OUT$ [Theorem 3.1]
$𝒬_{1} - 𝒬_{2}$ is difference-linear	$+$	$N + OUT$ [Theorem 3.1]
$𝒬_{2}$ is linear-reducible	$cost (𝒬_{2})$	$cost (𝒬_{1})$ [Corollary 2.5]
$𝒬_{2}$ is non-linear-reducible		$cost (𝒬_{1}) + \min {\begin{matrix} {OUT}_{1} \cdot cost (𝒬_{2}^{\emptyset}) & [Theorem 4.8] \\ cost (𝒬_{2}^{\oplus}) & [Theorem 4.10] \end{matrix}$

Table 2. Table 2. Graph datasets and their statistics. #edge is the input size of graph datasets. # 𝒬 G i subscript 𝒬 𝐺 𝑖 \mathcal{Q}_{Gi} is the output size of 𝒬 G i subscript 𝒬 𝐺 𝑖 \mathcal{Q}_{Gi} over the corresponding graph datasets. ‘-’ indicates that the output size is too huge such that all systems cannot report the output size within the time limit.

Graph	#edge	#vertex	#l2 path	#triangle	# $𝒬_{G 1}$	# $𝒬_{G 2}$	# $𝒬_{G 3}$	# $𝒬_{G 4}$	# $𝒬_{G 5}$	# $𝒬_{G 6}$
Bitcoin	24,186	3,783	1,256,332	88,753	820	$1.0 \times 10^{7}$	585,958	331,497	$3.8 \times 10^{7}$	$5.7 \times 10^{8}$
Epinions	508,837	75,879	$3.9 \times 10^{7}$	3,586,405	25,947	$9.3 \times 10^{8}$	$1.8 \times 10^{7}$	$1.0 \times 10^{7}$	$3.5 \times 10^{9}$	$2.5 \times 10^{11}$
DBLP	1,049,866	317,080	7,064,738	2,224,385	466,646	$1.6 \times 10^{8}$	3,532,369	2,203,597	$6.7 \times 10^{7}$	-
Google	5,105,039	875,713	$6.0 \times 10^{7}$	$2.8 \times 10^{7}$	372,042	$2.1 \times 10^{8}$	$2.4 \times 10^{7}$	$1.5 \times 10^{7}$	$7.8 \times 10^{8}$	-
Wiki	$2.8 \times 10^{7}$	2,394,385	$2.6 \times 10^{9}$	$8.1 \times 10^{7}$	0	$1.1 \times 10^{10}$	$1.3 \times 10^{8}$	$6.6 \times 10^{7}$	-	-

Equations45

Q := π_{y} (σ_{ϕ_{1}} R_{1} (e_{1}) ⋈ \dots σ_{ϕ_{2}} R_{2} (e_{2}) ⋈ \dots ⋈ σ_{ϕ_{n}} R_{n} (e_{n})),

Q := π_{y} (σ_{ϕ_{1}} R_{1} (e_{1}) ⋈ \dots σ_{ϕ_{2}} R_{2} (e_{2}) ⋈ \dots ⋈ σ_{ϕ_{n}} R_{n} (e_{n})),

Q (D) = {t \in dom (y) : \exists t^{'} \in dom (V), s . t . π_{y} t^{'} = t, π_{e_{i}} t \in R_{i}, \forall i \in [n]},

Q (D) = {t \in dom (y) : \exists t^{'} \in dom (V), s . t . π_{y} t^{'} = t, π_{e_{i}} t \in R_{i}, \forall i \in [n]},

Q_{1} =

Q_{1} =

Q_{2} =

Q_{1} - Q_{2} = R_{1} (x_{1}, x_{3}) - π_{x_{1}, x_{3}} (R_{2} (x_{1}, x_{2}) ⋈ R_{3} (x_{2}, x_{3}))

Q_{1} - Q_{2} = R_{1} (x_{1}, x_{3}) - π_{x_{1}, x_{3}} (R_{2} (x_{1}, x_{2}) ⋈ R_{3} (x_{2}, x_{3}))

Q_{1} - Q_{2} = R_{1} (x_{1}) - π_{x_{1}} (R_{2} (x_{1}, x_{3}) ⋈ R_{3} (x_{2}, x_{3}) ⋈ R_{4} (x_{1}, x_{3})),

Q_{1} - Q_{2} = R_{1} (x_{1}) - π_{x_{1}} (R_{2} (x_{1}, x_{3}) ⋈ R_{3} (x_{2}, x_{3}) ⋈ R_{4} (x_{1}, x_{3})),

Q_{1} - Q_{2}

Q_{1} - Q_{2}

Q_{1} - Q_{2}

Q_{1} - Q_{2}

Q_{1} - Q_{2}

Q = R (x_{1}, x_{2}) ⋈ R (x_{2}, x_{3}) ⋈ R_{0} (x_{1}, x_{3})

Q = R (x_{1}, x_{2}) ⋈ R (x_{2}, x_{3}) ⋈ R_{0} (x_{1}, x_{3})

Q = (e_{2}, e_{3}, \dots, e_{k}) \in E_{2} \times E_{3} \times \dots E_{k} ⋃ ((S_{I_{k}} - R_{e_{k}}) ⋈ \dots ⋈ (S_{I_{3}} - R_{e_{3}}) ⋈ (S_{I_{2}} - R_{e_{2}}) ⋈ Q_{1})

Q = (e_{2}, e_{3}, \dots, e_{k}) \in E_{2} \times E_{3} \times \dots E_{k} ⋃ ((S_{I_{k}} - R_{e_{k}}) ⋈ \dots ⋈ (S_{I_{3}} - R_{e_{3}}) ⋈ (S_{I_{2}} - R_{e_{2}}) ⋈ Q_{1})

S_{I_{j}} = π_{e_{j}} {(S_{I_{j - 1}} - R_{e_{j - 1}}) ⋈ \dots ⋈ (S_{I_{2}} - R_{e_{2}}) ⋈ Q_{1}}

S_{I_{j}} = π_{e_{j}} {(S_{I_{j - 1}} - R_{e_{j - 1}}) ⋈ \dots ⋈ (S_{I_{2}} - R_{e_{2}}) ⋈ Q_{1}}

(⋈_{i \in [k]} Q_{1}^{i}) - {(⋈_{i \in I} Q_{1}^{i}) ⋈ (⋈_{j \in J} Q_{2}^{j}) : I ⊊ [k], J = [k] - I} .

(⋈_{i \in [k]} Q_{1}^{i}) - {(⋈_{i \in I} Q_{1}^{i}) ⋈ (⋈_{j \in J} Q_{2}^{j}) : I ⊊ [k], J = [k] - I} .

Q := π_{y} (η_{1} R_{1} (e_{1}) ⋈ \dots ⋈ η_{2} R_{2} (e_{2}) ⋈ \dots ⋈ η_{n} R_{n} (e_{n})),

Q := π_{y} (η_{1} R_{1} (e_{1}) ⋈ \dots ⋈ η_{2} R_{2} (e_{2}) ⋈ \dots ⋈ η_{n} R_{n} (e_{n})),

Q (D) = {t \in dom (y) : \exists t^{'} \in dom (V), π_{e_{i}} t \in R_{i}, \forall η_{i} = \emptyset, π_{e_{j}} t \in / R_{j}, \forall η_{j} = \neg} .

Q (D) = {t \in dom (y) : \exists t^{'} \in dom (V), π_{e_{i}} t \in R_{i}, \forall η_{i} = \emptyset, π_{e_{j}} t \in / R_{j}, \forall η_{j} = \neg} .

Q_{1} - Q_{2} = R_{1} (x_{1}) - π_{x_{1}} (R_{2} (x_{1}, x_{2}) ⋈ R_{3} (x_{2}, x_{3}) ⋈ R_{4} (x_{3}, x_{4}) ⋈ R_{5} (x_{2}, x_{4})),

Q_{1} - Q_{2} = R_{1} (x_{1}) - π_{x_{1}} (R_{2} (x_{1}, x_{2}) ⋈ R_{3} (x_{2}, x_{3}) ⋈ R_{4} (x_{3}, x_{4}) ⋈ R_{5} (x_{2}, x_{4})),

Q = R (x_{1}, x_{2}) ⋈ R (x_{2}, x_{3}) ⋈ R_{0} (x_{1}, x_{3})

Q = R (x_{1}, x_{2}) ⋈ R (x_{2}, x_{3}) ⋈ R_{0} (x_{1}, x_{3})

Q^{'} = R_{1} (x_{1}, x_{2}) ⋈ R_{2} (x_{2}, x_{3}) - R_{3} (x_{1}, x_{3}) ⋈ R_{4} (x_{2}, x_{3})

Q = R (x_{1}, x_{2}) ⋈ R (x_{2}, x_{3}) ⋈ R_{0} (x_{1}, x_{3})

Q = R (x_{1}, x_{2}) ⋈ R (x_{2}, x_{3}) ⋈ R_{0} (x_{1}, x_{3})

Q^{'} = R_{1} (x_{1}, x_{2}) ⋈ R_{2} (x_{2}, x_{3}) - R_{3} (x_{1}, x_{3}) ⋈ R_{5} (x_{1}, x_{2})

Q = R (x_{1}, x_{2}) ⋈ R (x_{2}, x_{3}) ⋈ R_{0} (x_{1}, x_{3})

Q = R (x_{1}, x_{2}) ⋈ R (x_{2}, x_{3}) ⋈ R_{0} (x_{1}, x_{3})

Q^{'} = R_{1} (x_{1}, x_{2}) ⋈ R_{2} (x_{2}, x_{3}) - R_{3} (x_{1}, x_{3}) ⋈ R_{4} (x_{2}, x_{3}) ⋈ R_{5} (x_{1}, x_{2})

w (t) = t^{'} \in R_{e^{'}} : π_{e \cap e^{'}} t = π_{e \cap e^{'}} t^{'} \sum w (t^{'}) .

w (t) = t^{'} \in R_{e^{'}} : π_{e \cap e^{'}} t = π_{e \cap e^{'}} t^{'} \sum w (t^{'}) .

Q_{1} - Q_{2} =

Q_{1} - Q_{2} =

ζ_{t} = t^{'} \in ⋈_{e^{'} \in T_{e}} S_{e^{'}} : π_{e} t^{'} = t max e^{'} \in T_{e} \prod \frac{w _{1} ( π _{e} t ^{'} )}{w _{2} ( π _{e} t ^{'} )},

ζ_{t} = t^{'} \in ⋈_{e^{'} \in T_{e}} S_{e^{'}} : π_{e} t^{'} = t max e^{'} \in T_{e} \prod \frac{w _{1} ( π _{e} t ^{'} )}{w _{2} ( π _{e} t ^{'} )},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Logic, Reasoning, and Knowledge

Full text

Computing the Difference of Conjunctive Queries Efficiently

Xiao Hu

[email protected]

University of Waterloo200 University Avenue WestWaterlooOntarioCanadaN2L 3G1

0000-0002-7890-665X

and

Qichen Wang

[email protected]

Hong Kong Baptist University224 Waterloo Road, Kowloon TongHong KongHong Kong

0000-0002-0959-5536

(2023; October 2022; January 2023; February 2023)

Abstract.

We investigate how to efficiently compute the difference result of two (or multiple) conjunctive queries, which is the last operator in relational algebra to be unraveled. The standard approach in practical database systems is to materialize the results for every input query as a separate set, and then compute the difference of two (or multiple) sets. This approach is bottlenecked by the complexity of evaluating every input query individually, which could be very expensive, particularly when there are only a few results in the difference. In this paper, we introduce a new approach by exploiting the structural property of input queries and rewriting the original query by pushing the difference operator down as much as possible. We show that for a large class of difference queries, this approach can lead to a linear-time algorithm, in terms of the input size and (final) output size, i.e., the number of query results that survive from the difference operator. We complete this result by showing the hardness of computing the remaining difference queries in linear time. Although a linear-time algorithm is hard to achieve in general, we also provide some heuristics that can provably improve the standard approach. At last, we compare our approach with standard SQL engines over graph and benchmark datasets. The experiment results demonstrate order-of-magnitude speedups achieved by our approach over the vanilla SQL.

conjunctive query, query optimization, difference operator

††copyright: acmlicensed††journalyear: 2023††doi: 10.1145/3589298††journal: PACMMOD††journalvolume: 1††journalnumber: 2††article: 153††publicationmonth: 6††price: 15.00††ccs: Information systems Query optimization

School of Computer Science

Department of Computer Science

1. Introduction

Conjunctive queries with aggregation, union, and difference (also known as negation) operators form the full relational algebra (Abiteboul et al., 1995). While conjunctive queries (Ngo et al., 2018; Bagan et al., 2007; Yannakakis, 1981; Amossen and Pagh, 2009; Deep et al., 2020; Huang and Chen, 2022), with aggregation (Joglekar et al., 2016) and unions (Carmeli and Kröll, 2019; Christoph et al., 2018), have been extensively studied in the literature, the difference operator received much less attention. In modern database systems, there are several different equivalent expressions for computing the difference between two queries, such as NOT IN, NOT EXIST, EXCEPT, MINUS, DIFFERENCE, and LEFT-OUTER JOIN followed by a non-NULL filter. In contrast to its powerful expressibility, the execution plan of difference operator in existing database systems or data analytic engines (e.g., MySQL (mys, ySQL), Oracle (ora, acle), Postgre SQL (pos, eSQL), Spark SQL (spa, kSQL)) is quite brute-force. Given two (or multiple) conjunctive queries, their difference is simply done by materializing the answers for each participated conjunctive query separately, and then computing the difference of two (or multiple) sets. Hashing or other indexes may be built on top of the query answers to speed up the computation of the set difference at last. However, this approach is severely bottlenecked by evaluating every input query individually and materializing a large number of intermediate query results that do not contribute to the final results due to the difference operator.

Let’s consider an example of friend recommendation in social networks (such as Twitter, Facebook, Sina Weibo). A friend recommendation is represented as a triple $(a,b,c)$ extracted from the network semantics, such that user $c$ is recommended to user $a$ since user $b$ is a friend of user $a$ and user $c$ is a friend of user $b$ , together with other customized constraints. We also avoid the recommendation when $a$ and $c$ are already friends. The task of finding all valid recommendations can be captured by a SQL query in Example 1.1, as the difference of two sub-queries.

Example 1.1.

Let $\textsf{Graph}(\textsf{src},\textsf{dst})$ be a table storing all edges in the social network, and $\textsf{Triple}(\textsf{node1},\\ \textsf{node2},\textsf{node3})$ be a table storing all candidate recommendations. The following SQL query $(\mathcal{Q})$ finds all triples from Triple that do not form a triangle in the graph:*

⬇

${\color[rgb]{0,0,0}\mathcal{Q}:}$ SELECT node1, node2, node3 FROM Triple t1

WHERE NOT EXISTS

(SELECT * FROM Graph g1, Graph g2, Graph g3

 WHERE g1.dst = g2.src and g2.dst = g3.src and g3.dst = g1.src and g1.src = t1.node1 and g2.src = t1.node2 and g3.src = t1.node3);

such that $\mathcal{Q}$ is the difference of two sub-queries $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ , where $\mathcal{Q}_{1}$ returns all candidate recommendations from Triple and $\mathcal{Q}_{2}$ returns all triangle friendship in the social network.

⬇

${\color[rgb]{0,0,0}\mathcal{Q}_{1}:}$ SELECT node1, node2, node3 FROM Triple t1

${\color[rgb]{0,0,0}\mathcal{Q}_{2}:}$ SELECT * FROM Graph g1, Graph g2, Graph g3

 WHERE g1.dst = g2.src and g2.dst = g3.src and g3.dst = g1.src and g1.src = t1.node1 and g2.src = t1.node2 and g3.src = t1.node3;

We note that $\mathcal{Q}$ can be rewritten as the following SQL query $\mathcal{Q}^{\prime}$ :

⬇

${\color[rgb]{0,0,0}\mathcal{Q}^{\prime}:}$ SELECT node1, node2, node3 FROM Triple t1

 WHERE NOT EXISTS

(SELECT * FROM Triple t2

 WHERE EXISTS  (SELECT * FROM graph g1 WHERE t2.node1 = g1.src and t2.node2 = g1.dst)

 AND EXISTS (SELECT * FROM graph g2 WHERE t2.node2 = g2.src and t2.node3 = g2.dst)

 AND EXISTS (SELECT * FROM graph g3 WHERE t2.node3 = g3.src and t2.node1 = g3.dst)

 AND t2.node1 = t1.node1 and t2.node2 = t1.node2 and t2.node3 = t1.node3)

such that $\mathcal{Q}^{\prime}$ is the difference of $\mathcal{Q}_{1}$ and another sub-query $\mathcal{Q}_{3}$ , where $\mathcal{Q}_{3}$ finds all candidate recommendations in Triple that also form a triangle in the social network:

⬇

${\color[rgb]{0,0,0}\mathcal{Q}_{3}:}$ SELECT * FROM Triple t2

 WHERE EXISTS (SELECT * FROM graph g1 WHERE t2.node1 = g1.src and t2.node2 = g1.dst)

 And EXISTS (SELECT * FROM graph g2 WHERE t2.node2  = g2.src and t2.node3 = g2.dst)

 And EXISTS (SELECT * FROM graph g3 WHERE t2.node3 = g3.src and t2.node1 = g3.dst)

 AND t2.node1 = t1.node1 and t2.node2 = t1.node2 and t2.node3 = t1.node3

Figure 1(a) illustrates the execution plan for $\mathcal{Q}$ generated by PostgreSQL optimizer. It first materializes all triangles in the graph as $\mathcal{Q}_{2}$ , and then computes the difference of $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ by anti-join. Moreover, hashing index is built on top of all triangles of $\mathcal{Q}_{2}$ so that the anti-join can be executed by checking whether every candidate recommendation in $\mathcal{Q}_{1}$ appears as a triangle in $\mathcal{Q}_{2}$ . At last, all “survived” recommendations are outputted as final answers. In plan (a), computing the set difference at last is the most time-consuming step, which is predicted to take 33.67 minutes by PostgreSQL optimizer. Although computing the subquery $\mathcal{Q}_{2}$ is not that expensive, which only takes about 132 seconds, the number of intermediate results materialized for $\mathcal{Q}_{2}$ is quite large as expected, which finally leads to the inefficiency of the subsequent computation on $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ . In Section 6, plan (a) actually runs in 308.175 seconds in practice.

To tackle the challenges brought by the difference operator, we take two input sub-queries as a whole into account for algorithm design. We are interested in efficient algorithms with running times linear in the final result size. This requirement rules out the standard approach of materializing the results for each input sub-query separately and then computing their set difference. Indeed, the final output size can be many magnitudes smaller than the number of intermediate results that materialized. To overcome the curse of large intermediate results, we introduce a rewriting-based approach by exploiting the joint structural properties of two input sub-queries and pushing the difference operator down as far as possible.

In Example 1.1, we can rewrite the original SQL query $\mathcal{Q}$ into a new one $\mathcal{Q}^{\prime}$ . Instead of computing $\mathcal{Q}_{2}$ , it finds all candidate recommendations that also form a triangle in the social network as $\mathcal{Q}_{3}$ , which is exactly the intersection of $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ , and then computes the difference of $\mathcal{Q}_{1}$ and $\mathcal{Q}_{3}$ . Figure 1(b) illustrates the execution plan of this new query. We observe that computing the set difference at last is predicated to only take 21.67 seconds, which is much faster than (a). This is as expected, since the number of intermediate results generated by $\mathcal{Q}_{3}$ is much smaller than $\mathcal{Q}_{2}$ , after taking $\mathcal{Q}_{1}$ into consideration, which is the key to the overall improvement. As the price, computing $\mathcal{Q}_{3}$ is predicated to take a few more minutes than $\mathcal{Q}_{2}$ , but this is totally tolerable. In Section 6, plan (b) actually runs in 78.918 seconds, which already achieves 4x speedup over (a). This significant improvement from (a) to (b) motivates us to further investigate this interesting problem for general queries.

**Our contributions. ** In this paper, we formulate the difference of conjunctive queries (DCQ) problem and study the data complexity of this problem. Our contributions can be summarized as:

•

Complexity Dichotomy: We give a dichotomy for computing DCQs in linear time in terms of input and output size. We characterize a class of “easy” DCQs exploiting the joint properties of two input CQs, and present a linear-time algorithm. On the other hand, we prove the hardness of obtaining a linear-time algorithm for the remaining “hard” DCQs via several well-known conjectures. ** (Section 3 and 4.1)**

•

Efficient Heuristic: We propose an efficient heuristic for computing “hard” DCQs, which does not lead to a linear-time algorithm but still improves the baseline approach greatly. The heuristic investigates the intersection of two input CQs and incorporates the state-of-the-art algorithms for CQ evaluation. (Section 4.2)

•

**Extension: ** We explore several interesting extensions. First, we design a recursive algorithm for computing the difference of multiple conjunctive queries. We also extend our algorithm to support other relational operators, such as selection, projection, join, and aggregation. At last, we investigate the DCQ problem under the bag semantics. (Section 5)

•

Experimental Evaluation: We provide an experimental evaluation of our approach and standard approach on real-world datasets in both centralized and parallel database systems. The experimental results show that our approach out-performs the baseline on different classes of queries and datasets. (Section 6)

Roadmap. In Section 2, we formally define the DCQ problem and review the literature on evaluating a single CQ. In Section 3, we provide a linear-time algorithm for “easy” DCQs. In Section 4, we prove the hardness for the remaining DCQs and show efficient heuristics. In Section 5, we study several extensions of DCQs with other relational operators and bag semantics. In Section 6, we present the experimental evaluation. At last, we review related work in Section 7 and Section 8.

2. Preliminaries

2.1. Problem Definition

Conjunctive Query (CQ). We consider the standard setting of multi-relational databases. Let $\mathbb{R}$ be a database schema that contains $n$ relations $R_{1},R_{2},\cdots,R_{n}$ . Let $\mathcal{V}$ be the set of attributes in the database $\mathbb{R}$ . Each relation $R_{i}$ is defined on a subset of attributes $e_{i}\subseteq\mathcal{V}$ . Let $\mathcal{E}=\{e_{1},e_{2},\cdots,e_{n}\}$ be the set of the attributes for all relations. Let $\mathrm{dom}(x)$ be the domain of attribute $x\in\mathcal{V}$ , and let $\mathrm{dom}(U)=\prod_{x\in U}\mathrm{dom}(x)$ be the domain of attributes $U\subseteq\mathcal{V}$ .

Given the database schema $\mathbb{R}$ , let an input instance be $D$ , and the corresponding instances of $R_{1},\cdots,R_{n}$ be $R_{1}^{D},\cdots$ , $R_{n}^{D}$ . Where $D$ is clear from the context, we will drop the superscript and use $R_{1},\cdots,R_{n}$ for both the schema and instances. Any tuple $t\in R_{i}$ is defined on $e_{i}$ . For any attribute $x\in e_{i}$ , $\pi_{x}t\in\mathrm{dom}(x)$ denotes the value of attribute $x$ in tuple $t$ . Similarly, for a set of attributes $U\subseteq e_{i}$ , $\pi_{U}t$ denotes the values of attributes in $U$ for $t$ with an implicit ordering on the attributes.

We consider the class of conjunctive queries without self-joins formally defined as

[TABLE]

where $\sigma$ is the selection operator, $\phi_{i}$ is the predicate defined over relation $R_{i}$ , $\sigma_{\phi_{i}}R_{i}(e_{i})$ selects out tuples from $R_{i}$ passing the predicate $\phi_{i}$ and $\bm{\mathsf{y}}\subseteq\mathcal{V}$ denotes the output attributes. If $\bm{\mathsf{y}}=\mathcal{V}$ , such a CQ query is known as full join, which represents the natural join of the underlying relations. We usually use a triple $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ to represent a CQ $\mathcal{Q}$ , and simply use a pair $(\mathcal{V},\mathcal{E})$ to represent a full join. Each relation $R_{i}$ in $\mathcal{Q}$ is distinct, i.e., the CQ does not have a self-join. As a simplification, we ignore all the selection operators since it just takes $O(1)$ time to decide if a tuple passes the predicate $\phi_{i}$ . Moreover, we assume every $R_{i}$ is defined on different subset of attributes; otherwise, we can simply keep the intersection of all relations defining on the same subset of attributes. Hence, we also use $R_{e}$ to denote the relation defined on $e\in\mathcal{E}$ .

The result of $\mathcal{Q}$ over instance $D$ noted as $\mathcal{Q}(D)$ , is defined as:

[TABLE]

i.e., the projection of all combinations of tuples from every relation onto $\bm{\mathsf{y}}$ , such that tuples in each combination have the same value(s) on the common attribute(s). Let $N=|D|$ be the input size, i.e., the total number of tuples in the input instance. Let $\mathrm{OUT}=|\mathcal{Q}(D)|$ be the output size, i.e., the number of query results of $\mathcal{Q}$ over $D$ .

Difference of Conjunctive Queries (DCQ). A DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ consists of two CQs without self-joins $\mathcal{Q}_{1},\mathcal{Q}_{2}$ with the same output attributes. We also assume that the DCQ does not have a self-join, i.e., there exists no pair of relations $R_{i}$ from $\mathcal{Q}_{1}$ and $R_{j}$ from $\mathcal{Q}_{2}$ such that $R_{i},R_{j}$ are the same. Note that our algorithms presented in this work also applied to the case when self-join exists in DCQ, but our lower bound assumes that no self-join exists. The input to a DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ is a pair of database instances $D_{1},D_{2}$ defined for $\mathcal{Q}_{1},\mathcal{Q}_{2}$ respectively111 We distinguish the input instances $D_{1},D_{2}$ of $\mathcal{Q}_{1},\mathcal{Q}_{2}$ for simplifying algorithmic description later, which is different from conventional definition. . The result of $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ over $D_{1},D_{2}$ is $\mathcal{Q}_{1}(D_{1})-\mathcal{Q}_{2}(D_{2})$ , i.e., tuples that appear in the result of $\mathcal{Q}_{1}$ over instance $D_{1}$ , but not in the result of $\mathcal{Q}_{2}$ over instance $D_{2}$ . Let $N=|D_{1}|+|D_{2}|$ be the input size, i.e., the total number of tuples in both input instances. Let $\mathrm{OUT}=|\mathcal{Q}_{1}(D_{1})-\mathcal{Q}_{2}(D_{2})|$ be the output size.

In this paper, we adopt standard data complexity (Vardi, 1982); that is, we measure the complexity of algorithms with input size $N$ and output size $\mathrm{OUT}$ , and assume the query size as a constant.

2.2. Literature Review of CQ Evaluation

Before diving into the massive literature, we mention two classes of CQs that play an important role in query evaluation.

•

( $\alpha$ -acyclic). A CQ $\mathcal{Q}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ is $\alpha$ -acyclic (Beeri et al., 1983; Fagin, 1983) if there exists a tree $\mathcal{T}$ (called a join tree) such that each node in $\mathcal{T}$ corresponds to a relation in $\mathcal{E}$ , and for each attribute $x\in\mathcal{V}$ , the set of nodes containing $x$ form a connected subtree of $\mathcal{T}$ . Moreover, we define $\textsf{top}(x)$ as the highest node of $\mathcal{T}$ that attribute $x$ appears.

•

(free-connex). A CQ $\mathcal{Q}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ is free-connex (Bagan et al., 2007) if there exists a tree $\mathcal{T}$ (called a free-connex join tree) such that $\mathcal{T}$ is a join tree for $\mathcal{Q}$ , and for any pair of attributes $x_{1}\in\bm{\mathsf{y}},x_{2}\in\mathcal{V}-\bm{\mathsf{y}}$ , $\textsf{top}(x_{2})$ is not an ancestor of $\textsf{top}(x_{1})$ . It has been proved equivalently that a CQ $\mathcal{Q}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ is free-connex if $\mathcal{Q}$ is $\alpha$ -acyclic and $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E}\cup\{\bm{\mathsf{y}}\})$ is also $\alpha$ -acyclic.

Their relationships are illustrated in Figure 2. A free-connex CQ must be $\alpha$ -acyclic. An $\alpha$ -acyclic full join must be free-connex. Below, when the context is clear, we always refer “acyclic” to “ $\alpha$ -acyclic”.

There has been a long line of research on CQ evaluation (Beeri et al., 1983; Kolaitis and Vardi, 1998; Papadimitriou and Yannakakis, 1997; Vardi, 1982; Chekuri and Rajaraman, 2000). Yannakakis’s seminal algorithm (Beeri et al., 1983) was proposed for acyclic CQs, whose running time differs over different sub-classes of acyclic CQs. A free-connex CQ can be evaluated in $O(N+\mathrm{OUT})$ time, which is already optimal since any algorithm needs to read input data and output all query results. On the other hand, an acyclic but non-free-connex CQ can be evaluated in $O(N\cdot\mathrm{OUT})$ time. Subsequent works have progressively defined different notions of “width” (Gottlob et al., 2002, 2009), measuring how far a query is from being acyclic and tackle cyclic queries with decomposition. This line of algorithms run in $O(N^{w}+\mathrm{OUT})$ time, where $w$ can be the fractional hypertree width (Gottlob et al., 2002; Ngo et al., 2018), submodular width (Abo Khamis et al., 2016), or FAQ-width (Abo Khamis et al., 2016). In addition, some specific classes of CQs can be speedup by fast matrix multiplication techniques (Amossen and Pagh, 2009; Björklund et al., 2014; Deep et al., 2020), but we won’t go into that direction further. CQ evaluation is still an actively researched problem; any improvement here will also improve DCQ evaluation when plugged into the baseline as well as our approach.

In the remaining, we often use $\texttt{cost}(\mathcal{Q})$ to denote the time complexity of evaluating a CQ $\mathcal{Q}$ .

Implications to the Baseline Approach of DCQ Evaluation. Given a DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , the baseline approach of computing $\mathcal{Q}_{1},\mathcal{Q}_{2}$ separately and then set difference incurs the following cost:

Corollary 2.1.

Given two CQs $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ , the DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be computed in $O(\texttt{cost}(\mathcal{Q}_{1})+\texttt{cost}(\mathcal{Q}_{2}))$ time.

For example, when both $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ are free-connex, the baseline approach runs in $O(N+\mathrm{OUT}_{1}+\mathrm{OUT}_{2})$ time, where $\mathrm{OUT}_{1},\mathrm{OUT}_{2}$ are the output sizes of $\mathcal{Q}_{1},\mathcal{Q}_{2}$ respectively.

2.3. New Results of DCQ Evaluation

Our new complexity results for DCQ evaluation are summarized in Table 1. To help understand these results, we first introduce the class of linear-reducible CQs, and the reduce procedure.

Definition 2.2 (Linear-reducible).

A CQ $\mathcal{Q}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ is linear-reducible if $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E}\cup\{\bm{\mathsf{y}}\})$ is free-connex.

The relationship between linear-reducible CQs and existing classifications of CQs is illustrated in Figure 2. Any full or free-connex CQ must be linear-reducible. In addition, some cyclic but non-full CQs are also linear-reducible, for example, $\mathcal{Q}=\pi_{x_{1},x_{2},x_{3}}(R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})\Join R_{3}(x_{1},x_{3})\Join R_{4}(x_{3},x_{4}))$ , since adding $R_{5}(x_{1},x_{2},x_{3})$ will result in a free-connex CQ (see Figure 2). It is also noted that any non-free-connex but acyclic CQ is non-linear-reducible.

Moreover, we introduce a reduce procedure in Algorithm 1, that can transform any linear-reducible CQ into a full join query in $O(N)$ time, while preserving the query results. This algorithm is similar to the semi-join phase of Yannakakis algorithm (Yannakakis, 1981). Intuitively, we remove attributes or relations in a bottom-up ordering of nodes in a free-connex join tree. Recall that each node in the join tree corresponds to a relation in $\mathcal{E}$ . When a node $e$ is visited, we distinguish two more cases: (line 4-5) if its output attributes are fully contained in its parent, we remove $e$ and update its parent relation via semi-joins; (line 6-7) and otherwise, we remove all non-output attributes (if there exists any) in $e$ via projections. The output of Algorithm 1 is a full join query $\mathcal{Q}^{\prime}=(\bm{\mathsf{y}},\mathcal{E})$ (called the reduced query) and an instance $D^{\prime}$ (called the reduced instance) such that $\mathcal{Q}(D)=\mathcal{Q}^{\prime}(D^{\prime})$ . An example of reduced query is illustrated in Figure 2.

We are now ready to present the new results for DCQ evaluation.

Dichotomy for Linear-time Algorithm. Our main complexity result is a complete characterization of $\mathcal{Q}_{1},\mathcal{Q}_{2}$ for which a linear algorithm can be achieved for computing $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ :

Definition 2.3 (Difference-Linear).

Given two CQs $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , the DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ is difference-linear if $\mathcal{Q}_{1}$ is free-connex, $\mathcal{Q}_{2}$ is linear-reducible, and $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e\}\})$ is $\alpha$ -acyclic for every $e\in\mathcal{E}^{\prime}_{2}$ , where $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ and $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{2})$ are the reduced queries of $\mathcal{Q}_{1},\mathcal{Q}_{2}$ respectively.

Theorem 2.4 (Dichotomy).

Given two CQs $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , the DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be computed in $O(N+\mathrm{OUT})$ time if and only if it is difference-linear.

Our proof of Theorem 2.4 consists of two steps. In Section 3, we prove the “if”-direction by designing a linear algorithm for the class of “easy” queries as characterized. In Section 4.1, we prove the “only-if” direction by showing the lower bound for the remaining class of “hard” queries, based on some well-established conjectures.

Improvement Achieved by Heuristics.

For the class of “hard” DCQs on which obtaining a linear-time algorithm is hopeless, we further show some efficient heuristics. The complete results are presented in Section 4.2 and here we mention an interesting case that our heuristics have strictly improved the baseline:

Corollary 2.5.

Given two CQs $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ , if $\mathcal{Q}_{2}$ is linear-reducible, then DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be computed in $O(\texttt{cost}(\mathcal{Q}_{1}))$ time.

We summarize all these results above in Table 1: (1) our approach strictly improves the baseline as long as $\mathcal{Q}_{2}$ is linear-reducible; (2) furthermore, our approach leads to a linear-time algorithm if $\mathcal{Q}_{1}$ also satisfies some specific conditions; (3) in the remaining case when $\mathcal{Q}_{2}$ is non-linear-reducible, the comparison of our approach and baseline depends on specific queries or even input instances.

3. Easy DCQs

In this section, we show a linear-time algorithm for computing the class of “easy” DCQ characterized in Theorem 2.4. The main technique we used is simply query rewriting, but by exploiting the structures of two input queries in a non-trivial way.

Theorem 3.1.

Given two CQs $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , if $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ is difference-linear, then DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be computed in $O(N+\mathrm{OUT})$ time.

We start with a special class of DCQs that two input CQs share the same schema. In Section 3.1, we introduce an algorithm based on query rewriting, which always pushes the difference operator down to the input relations and avoids materializing a large number of intermediate results that do not participate in the final query result. In Section 3.2, we move to general case that $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ can have different schemas.

3.1. $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ share the same schema

We first note that if $\mathcal{Q}_{1},\mathcal{Q}_{2}$ share the same schema, i.e., there is a one-to-one correspondence between the relations/attributes in $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ , Theorem 3.1 degenerates to the following lemma:

Lemma 3.2.

Given two CQs $\mathcal{Q}_{1}=\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , if $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ is free-connex, then the DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be computed in $O(N+\mathrm{OUT})$ time.

Let’s start with an example falling into this special case.

Example 3.3.

Consider a DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ with $\mathcal{Q}_{1}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})$ and $\mathcal{Q}_{2}=R^{\prime}_{1}(x_{1},x_{2})\Join R^{\prime}_{2}(x_{2},x_{3})$ . We can rewrite it as the union of two join queries: $\mathcal{Q}_{1}-\mathcal{Q}_{2}=(R_{1}-R^{\prime}_{1})\Join R_{2}+R_{1}\Join(R_{2}-R^{\prime}_{2})$ , where the difference operator is only applied for computing $R_{1}-R^{\prime}_{1}$ and $R_{2}-R^{\prime}_{2}$ . Intuitively, for every join result $(a,b,c)\in\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , it must be $(a,b)\notin R^{\prime}_{1}$ or $(b,c)\notin R^{\prime}_{2}$ ; otherwise, $(a,b,c)\in\mathcal{Q}_{2}$ , coming to a contradiction. The correctness of this rewriting will be formally presented in the proof of Lemma 3.4. In addition, the difference operators can be evaluated in $O(N)$ time, and the join operators can be evaluated in $O(N+\mathrm{OUT})$ time. *

Rewrite Rule. We now generalize the rewriting rule in Example 3.9 to general DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ for $\mathcal{Q}_{1}=\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , where $\bm{\mathsf{y}}=\mathcal{V}$ and $(\mathcal{V},\mathcal{E})$ is $\alpha$ -acyclic. In other words, both $\mathcal{Q}_{1},\mathcal{Q}_{2}$ correspond to the same acyclic full join query. For any $e\in\mathcal{E}$ , let $R_{e},R^{\prime}_{e}$ be the corresponding relations in $\mathcal{Q}_{1},\mathcal{Q}_{2}$ respectively. Our rule is built on the observation that for any query result $t\in\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , $\pi_{e}t\in R_{e}$ must hold for every $e\in\mathcal{E}$ , but $\pi_{e}t\notin R^{\prime}_{e}$ happens for some $e\in\mathcal{E}$ . Applying this observation, we can rewrite such a DCQ as the (disjoint) union of a constant number of join queries as follows:

Lemma 3.4.

Given two CQs $\mathcal{Q}_{1}=\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , if $(\mathcal{V},\mathcal{E})$ is $\alpha$ -acyclic and $\bm{\mathsf{y}}=\mathcal{V}$ , $\mathcal{Q}_{1}-\mathcal{Q}_{2}=\bigcup_{e\in\mathcal{E}}\left((R_{e}-R^{\prime}_{e})\Join(\Join_{e^{\prime}\in\mathcal{E}-\{e\}}R_{e^{\prime}})\right)$ .

Algorithm and Complexity. An algorithm directly follows the rewriting rule above. It first computes the difference of every pair of input relations, i.e., $R_{e}-R^{\prime}_{e}$ for each $e\in\mathcal{E}$ , and then computes a full join query $(R_{e}-R^{\prime}_{e})\Join\mathcal{Q}_{1}$ derived for each $e\in\mathcal{E}$ . Actually, we can handle a slightly larger class of DCQ. For $\mathcal{Q}_{1}=\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , if $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ is free-connex, we simply remove all non-output attributes for $\mathcal{Q}_{1},\mathcal{Q}_{2}$ separately in the preprocessing step, and then tackle two acyclic full joins, that share the same structure.

As the pre-processing step and difference operators can be evaluated in $O(N)$ time, this algorithm is bottlenecked by evaluating the join query $(\mathcal{V},\mathcal{E})$ , which takes $O(N+\mathrm{OUT})$ time. Putting everything together, we come to Lemma 3.2.

3.2. $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ have different schemas

We next move to the general case when these two input CQs have different schemas. We focus on the case when both $\mathcal{Q}_{1},\mathcal{Q}_{2}$ are full and then extend to non-full case.

DCQ** with full CQs.** Now, we assume that $\bm{\mathsf{y}}=\mathcal{V}_{1}=\mathcal{V}_{2}=\mathcal{V}$ . Theorem 3.1 simply degenerates to the Lemma 3.5.

Lemma 3.5.

Given two full joins $\mathcal{Q}_{1}=(\mathcal{V},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\mathcal{V},\mathcal{E}_{2})$ , if $\mathcal{Q}_{1},\mathcal{Q}_{2}$ are $\alpha$ -acyclic, and $(\mathcal{V},\mathcal{E}_{1}\cup\{e\})$ is $\alpha$ -acyclic for every $e\in\mathcal{E}_{2}$ , then $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be computed in $O(N+\mathrm{OUT})$ time.

A straightforward solution is to transform both $\mathcal{Q}_{1}=(\mathcal{V},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\mathcal{V},\mathcal{E}_{2})$ into one auxiliary query $(\mathcal{V},\mathcal{E}_{1}\cup\mathcal{E}_{2})$ , and then invoke the algorithm in Section 3.1 to handle the degenerated case. However, this solution does not necessarily lead to a linear-time algorithm. Let’s gain some intuition from the example below.

Example 3.6.

Consider a DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ with $\mathcal{Q}_{1}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3},x_{4})$ and $\mathcal{Q}_{2}=R_{3}(x_{1},x_{2},x_{3})\\ \Join R_{4}(x_{3},x_{4})$ . For an auxiliary query, we introduce the following intermediate relations $R_{5}=R_{1}\Join\pi_{x_{2},x_{3}}R_{2}$ , $R_{6}=\pi_{x_{3},x_{4}}R_{2}$ , $R_{7}=\pi_{x_{1},x_{2}}R_{3}$ and $R_{8}=\pi_{x_{2},x_{3}}R_{3}\Join R_{4}$ . Then, we can rewrite $\mathcal{Q}_{1},\mathcal{Q}_{2}$ as follows:

[TABLE]

*Then, we are left with two queries that share the same schema. However, this strategy does not necessarily lead to a linear-time algorithm, since materializing the intermediate relation $R_{8}$ requires super-linear time, which could be much larger than the final output size $\mathrm{OUT}$ . *

Careful inspection reveals that a simpler rewriting rule can avoid materializing $R_{8}$ . More specifically, we keep $\mathcal{Q}_{2}$ unchanged and rewrite $\mathcal{Q}_{1}$ as above. Then, $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be rewritten as $(R_{5}-R_{3})\Join R_{1}\Join R_{2}+R_{1}\Join R_{2}\Join(R_{6}-R_{4})$ . Intuitively, for every join result $(a,b,c,d)\in\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , it must be $(a,b,c)\notin R_{3}$ or $(c,d)\notin R_{4}$ ; otherwise, $(a,b,c,d)\in\mathcal{Q}_{2}$ , coming to a contradiction. The correctness of this rewriting will be formally presented in the proof of Lemma 3.7. In this case, materializing $R_{6}$ only takes $O(N)$ time, but materializing $R_{5}$ might take super-linear time. Fortunately, we can bound the size of $R_{5}$ by $O(N+\mathrm{OUT})$ . The rationale is that every tuple in $R_{5}-R_{3}$ will participate in at least on one join result of $(R_{5}-R_{3})\Join R_{1}\Join R_{2}$ , i.e., the final result of the difference query $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , thus $|R_{5}-R_{3}|\leq\mathrm{OUT}$ . For the difference operator, $R_{5}-R_{3}$ takes $O(N+\mathrm{OUT})$ time, and $R_{6}-R_{4}$ takes $O(N)$ time. For the join operator, both simple join queries take linear time in terms of their input size and output size. Overall, this rewriting rule can compute the example query in $O(N+\mathrm{OUT})$ time.

Rewrite Rule. Generalizing this observation, we develop the following rewriting rule for arbitrary full joins $\mathcal{Q}_{1},\mathcal{Q}_{2}$ . The high-level idea is to introduce an intermediate relation $S_{e}=\pi_{e}\mathcal{Q}_{1}$ for every $e\in\mathcal{E}_{2}$ , i.e., the projection of join results of $\mathcal{Q}_{1}$ onto attributes $e$ . Now we can rewrite $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ using input relations in $\mathcal{Q}_{1}$ and intermediate relations corresponding to $\mathcal{E}_{2}$ , as well as input relations in $\mathcal{Q}_{2}$ , which results in the disjoint union of multiple full joins.

Lemma 3.7.

Given two CQs $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , if $\bm{\mathsf{y}}=\mathcal{V}_{1}=\mathcal{V}_{2}$ , $\mathcal{Q}_{1}-\mathcal{Q}_{2}=\bigcup_{e\in\mathcal{E}_{2}}\left((S_{e}-R^{\prime}_{e})\Join\mathcal{Q}_{1}\right)$ , for $S_{e}=\pi_{e}\mathcal{Q}_{1}$ .

Proof.

Direction $\subseteq$ . Consider an arbitrary result $t\in\mathcal{Q}_{1}-\mathcal{Q}_{2}$ . By definition, $\pi_{e_{1}}t\in R_{e_{1}}$ for every $e_{1}\in\mathcal{E}_{1}$ , and $\pi_{e_{2}}t\notin R^{\prime}_{e_{2}}$ for some $e_{2}\in\mathcal{E}_{2}$ . Moreover, $\pi_{e_{2}}t\in S_{e_{2}}=\pi_{e_{2}}\mathcal{Q}_{1}$ . So, $t\in(S_{e_{2}}-R^{\prime}_{e_{2}})\Join\mathcal{Q}_{1}$ . Direction $\supseteq$ . Consider an arbitrary $e_{2}\in\mathcal{E}_{2}$ and a query result $t\in(S_{e_{2}}-R^{\prime}_{e_{2}})\Join\mathcal{Q}_{1}$ . By definition, $t\in\mathcal{Q}_{1}$ and $t\notin R^{\prime}_{e_{2}}$ , which further implies $t\notin\mathcal{Q}_{2}$ . This way, $t\in\mathcal{Q}_{1}-\mathcal{Q}_{2}$ . ∎

Algorithm and Complexity. An algorithm for computing the difference of two full join queries follows the rewriting rule above. For each $e\in\mathcal{E}_{2}$ , it first materializes the query results of $\pi_{e}\mathcal{Q}_{1}$ , then computes the difference operator $\pi_{e}\mathcal{Q}_{1}-R^{\prime}_{e}$ , and finally the full join $(\pi_{e}\mathcal{Q}_{1}-R^{\prime}_{e})\Join\mathcal{Q}_{1}$ by invoking the classical Yannakakis algorithm. We next analyze the complexity of the algorithm above. To establish the complexity, we first show an upper bound on the size of any intermediate relation constructed:

Lemma 3.8.

$|S_{e}|=O(N+\mathrm{OUT})$ * for any $e\in\mathcal{E}_{2}$ , where $S_{e}=\pi_{e}\mathcal{Q}_{1}$ .*

Proof.

Consider an arbitrary tuple $t\in S_{e}-R^{\prime}_{e}$ . First, $t$ participates in at least one query result of $\mathcal{Q}_{1}$ . As $t\in S_{e}$ , $t\in\pi_{e}\mathcal{Q}_{1}$ by definition. There must exist some tuple $t^{\prime}\in\mathcal{Q}_{1}$ such that $\pi_{e}t^{\prime}=t$ . Thus, $t$ participates in some query results of $\mathcal{Q}_{1}$ . Meanwhile, $t$ does not participate in any query result of $\mathcal{Q}_{2}$ , since $t\notin R^{\prime}_{e}$ . In this way, $t$ participates in at least one result in $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , thus $|S_{e}-R^{\prime}_{e}|\leq\mathrm{OUT}$ . Moreover, $|R^{\prime}_{e}|\leq N$ . Together, we obtain $|S_{e}|=O(N+\mathrm{OUT})$ . ∎

Let $\mathcal{V}=\mathcal{V}_{1}=\mathcal{V}_{2}$ . If $(\mathcal{V},\mathcal{E}_{1})$ is $\alpha$ -acyclic, and $(\mathcal{V},\mathcal{E}_{1}\cup\{e\})$ is also $\alpha$ -acyclic for every $e\in\mathcal{E}_{2}$ , then the constructed CQ $\pi_{e}\mathcal{Q}$ is free-connex. Implied by the existing result on CQ evaluation, $S_{e}$ can be computed in $O(N+|S_{e}|)=O(N+\mathrm{OUT})$ time by the classic Yannakakis algorithm, where $\mathrm{OUT}$ is the output size of the difference query! The invocation of Yannakakis algorithm here is crucial for achieving linear complexity. For example, if $S_{e}$ is computed by first materializing the query results of $\mathcal{Q}_{1}$ and then computing their projection onto $e$ , the time complexity would be as large as $O(N+\mathrm{OUT}_{1})$ , where $\mathrm{OUT}_{1}$ is the output size of $\mathcal{Q}_{1}$ . Now, each full join $(S_{e}-R^{\prime}_{e})\Join\mathcal{Q}_{1}$ is $\alpha$ -acyclic with input size $O(N+\mathrm{OUT})$ and output size $\mathrm{OUT}$ , thus can be computed in $O(N+\mathrm{OUT})$ time. Therefore, the total time complexity is bounded by $O(N+\mathrm{OUT})$ , since there are $O(1)$ sub-queries in $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ . Putting everything together, we come to Lemma 3.5.

DCQ** with general CQs.** Now, we are ready to present an linear-time algorithm for computing $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , such that $\mathcal{Q}_{1}$ is free-connex, $\mathcal{Q}_{2}$ is linear-reducible, and $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e\}\})$ is $\alpha$ -acyclic for every $e\in\mathcal{E}^{\prime}_{2}$ , where $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ and $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{2})$ are the reduced queries of $\mathcal{Q}_{1},\mathcal{Q}_{2}$ respectively. As described in Algorithm 2, we first apply a preprocessing step to $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ (line 1-4), which removes non-output attributes in $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ if they are non-full.

As shown in Algorithm 1, this reduce step is quite standard by first building a free-connex join tree for the derived query $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E}\cup\{\bm{\mathsf{y}}\})$ , and then traversing the tree in a bottom-up way. In the traversal, when a relation is visited and contains some non-output attributes, we just update its parent relation by applying a semi-join and removing it. Note that if a relation does not contain any non-output attribute, then its ancestor also does not contain any, implied by the property of the free-connex join tree. Thus, the residual tree is a connected subtree that contains the root. Note that no physical relation is defined to $\bm{\mathsf{y}}$ , but this is not an issue since when such a relation is visited, Algorithm 1 simply skips it (line 4) as well as its ancestors. This algorithm only takes $O(N)$ time.

Then, we are left with two full joins, and invoke our rewriting rule proposed in Section 3.2 (line 6-8). As the reduce procedure takes $O(N)$ time, and the join phase takes $O(N+\mathrm{OUT})$ time implied by Lemma 3.5, we can obtain the complexity result in Theorem 3.1.

Improvement over Baseline.

When $\mathcal{Q}_{1},\mathcal{Q}_{2}$ fall into the class of “easy” DCQs as characterized by Theorem 3.1, our algorithm only takes $O(N+\mathrm{OUT})$ time for computing $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , while the baseline takes $O\left(N+\mathrm{OUT}_{1}+\texttt{cost}(\mathrm{OUT}_{2})\right)$ time, since $\texttt{cost}(\mathcal{Q}_{1})=O(N+\mathrm{OUT})$ for free-connex $\mathcal{Q}_{1}$ . We next use a few examples of “easy” DCQs to illustrate the improvement achieved by our approach.

Example 3.9.

Consider a DCQ with $\mathcal{Q}_{1}=R_{1}(x_{1},x_{2},x_{3})$ and $\mathcal{Q}_{2}=R_{2}(x_{1},x_{2})\Join R_{3}(x_{2},x_{3})\Join R_{4}(x_{1},x_{3})$ . The baseline takes $O\left(N^{\frac{2\cdot\omega}{\omega+1}}+N^{\frac{3(\omega-1)}{\omega+1}}\cdot\mathrm{OUT}_{2}^{\frac{3-\omega}{\omega+1}}\right)$ time to compute the triangle join $R_{2}\Join R_{3}\Join R_{4}$ in $\mathcal{Q}_{2}$ , where $\omega$ is the exponent of fast matrix multiplication. In contrast, our approach only takes $O(N)$ time since $\mathrm{OUT}\leq N$ , improving the baseline by a factor of $O\left(N^{\frac{\omega-1}{\omega+1}}+N^{\frac{2\omega-4}{\omega+1}}\cdot\mathrm{OUT}_{2}^{\frac{3-\omega}{\omega+1}}\right)$ .

Example 3.10.

*Consider a DCQ with $\mathcal{Q}_{1}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{3},x_{4})$ and $\mathcal{Q}_{2}=R_{3}(x_{1},x_{2})\Join R_{4}(x_{2},x_{3})\Join R_{5}(x_{1},x_{3})$ . The baseline takes $O(N^{2})$ time to materialize $\mathcal{Q}_{1}$ , which degenerates to the Cartesian product of $R_{1}$ and $R_{2}$ . In contrast, our approach only requires $O(N+\mathrm{OUT})$ time, improving the baseline by a factor of $O\left(\frac{N^{2}}{\mathrm{OUT}}\right)$ , since $\mathrm{OUT}$ can be much smaller than $N^{2}$ . *

Example 3.11.

*Consider a DCQ with $\mathcal{Q}_{1}=\Join_{e\subseteq U:|e|=1}R_{e}(\{x_{1}\}\cup e)$ and $\mathcal{Q}_{2}=\Join_{e^{\prime}\subseteq U:|e^{\prime}|=2}R_{e^{\prime}}(\{x_{1}\}\cup e^{\prime})$ for $U=\{x_{2},\cdots,x_{k+1}\}$ . The baseline takes $O(N)$ time to materialize $\mathcal{Q}_{1}$ , and $O(N^{\frac{k}{2}})$ time to materialize $\mathcal{Q}_{2}$ . In contrast, our approach can compute it in $O(N+\mathrm{OUT})$ time, improving the baseline by a factor of $O\left(\frac{N^{k/2}}{\mathrm{OUT}}\right)$ , since $\mathrm{OUT}$ can be much smaller than $N^{\frac{k}{2}}$ . *

4. Hard DCQs

In this section, we turn to the class of “hard” DCQs characterized by Theorem 2.4. We first prove the hardness of computing DCQs in linear time via some well-known conjectures, and then show an efficient heuristic for hard DCQs by further exploiting the query structures.

4.1. Hardness

We will prove the hardness of computing a hard DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , in particular: (1) $\mathcal{Q}_{1}$ is non-free-connex; or (2) $\mathcal{Q}_{1}$ is free-connex but $\mathcal{Q}_{2}$ is non-linear-reducible; or (3) $\mathcal{Q}_{1}$ is free-connex, $\mathcal{Q}_{2}$ is linear-reducible, but there exists some $e\in\mathcal{E}^{\prime}_{2}$ such that $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e\}\})$ is cyclic, where $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ and $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{2})$ are the reduced queries of $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1}),\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ respectively. We will prove the hardness for each class of hard DCQs separately.

Hardness-(1). The hardness of computing DCQs in case (1) comes from computing a non-free-connex CQ (Bagan et al., 2007). By setting the result of $\mathcal{Q}_{2}$ as $\emptyset$ , $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ simply degenerates to $\mathcal{Q}_{1}$ , hence we obtain:

Lemma 4.1.

For any DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , if $\mathcal{Q}_{1}$ is non-free-connex, any algorithm computing $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ requires at least $\Omega(N+\mathrm{OUT})$ time.

The hardness of case (2) and (3) is built on the strong triangle conjecture in the literature:

Conjecture 4.2 (Strong Triangle conjecture (Abboud and Williams, 2014)).

Detecting whether an $n$ -node $m$ -edge graph contains a triangle requires $\Omega\left(\min\left\{n^{\omega-o(1)},m^{2\omega/(\omega+1)-o(1)}\right\}\right)$ time in expectation, where $\omega=2+o(1)$ is assumed as the exponent of fast matrix multiplication.

Hardness-(2). We start with two hardcore DCQs in Lemma 4.3 and Lemma 4.4. The proof of Lemma 4.5 for general DCQs in case (2) is given in Appendix B.

Lemma 4.3.

Any algorithm for computing the following DCQ:

[TABLE]

requires $\Omega(N)$ time, assuming the strong triangle conjecture.

Proof.

For a graph $G=(V,E)$ , we denote $m=|E|$ and $n=|V|$ . Note that $n<m<n^{2}$ ; otherwise, we simply remove vertices that do not incident to any edges in $G$ . We then construct an instance $D_{1},D_{2}$ for $\mathcal{Q}_{1},\mathcal{Q}_{2}$ by setting $R_{1}=R_{2}=R_{3}=E$ . Hence, $N=m$ . Note that there exists some triangle in $G$ if and only if $\mathcal{Q}_{1}\cap\mathcal{Q}_{2}$ is non-empty. Together with $\mathcal{Q}_{1}\cap\mathcal{Q}_{2}=\mathcal{Q}_{1}-(\mathcal{Q}_{1}-\mathcal{Q}_{2})$ , we output “a triangle is detected in $G$ ” if and only if $|\mathcal{Q}_{1}-\mathcal{Q}_{2}|<N$ . If $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be computed in $O(N)$ time, whether there exists a triangle in $G$ can be detected in $O(\min\{n^{2},m^{4/3}\})$ time, coming to a contradiction of strong triangle conjecture. ∎

Lemma 4.4.

Any algorithm for computing the following DCQ:

[TABLE]

requires $\Omega(N)$ time, assuming the strong triangle conjecture.

Proof.

This is similar to the proof of Lemma 4.3. For a graph $G=(V,E)$ , we construct $R_{2}=R_{3}=R_{4}=E$ and $R_{1}=V$ , with $m=|E|=N$ and $n=|V|$ . Note that there exists some triangle in $G$ if and only if $\mathcal{Q}_{1}\cap\mathcal{Q}_{2}$ is non-empty. Together with $\mathcal{Q}_{1}\cap\mathcal{Q}_{2}=\mathcal{Q}_{1}-(\mathcal{Q}_{1}-\mathcal{Q}_{2})$ , we output “a triangle is detected in $G$ ” if and only if $|\mathcal{Q}_{1}-\mathcal{Q}_{2}|<|R_{1}|$ . If $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be computed in $O(N)$ , whether there exists a triangle in $G$ can be detected in $O(\min\{n^{2},m^{4/3}\})$ time, coming to a contradiction of strong triangle conjecture. ∎

Lemma 4.5.

Given two CQs $\mathcal{Q}_{1},\mathcal{Q}_{2}$ , if $\mathcal{Q}_{1}$ is free-connex and $\mathcal{Q}_{2}$ is non-linear-reducible, any algorithm computing $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ requires $\Omega(N+\mathrm{OUT})$ time, assuming the strong triangle conjecture.

Hardness-(3). The hardness of evaluating a DCQ in case (3) inherits the hardness of deciding a DCQ: given a DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{1}$ and input databases $D_{1},D_{2}$ , the decidability problem asks to decide whether there exists a query result in $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ . We identify a few hardcore DCQs in Lemma B.16. The proof of Lemma 4.7 for general DCQs in case (3) is given in Appendix B.

Lemma 4.6.

Any algorithm for deciding the following DCQ

[TABLE]

requires $\Omega(N)$ time, assuming the strong triangle conjecture.

Proof.

We first focus on the first DCQ and the remaining ones can be proved similarly. Given an arbitrary graph $G=(V,E)$ with $m=|E|$ and $n=|V|$ , we develop an algorithm to detect whether there exists a triangle in $G$ . Note that $n<m<n^{2}$ ; otherwise, we simply remove vertices that do not incident to any edges in $G$ . The degree $\textrm{deg}(u)$ of a vertex $u\in V$ is defined as the size of neighbors of $u$ , i.e., those incident to $u$ with an edge in $E$ . We partition vertices in $V$ into two subsets: $V^{H}=\{v\in V:\textrm{deg}(v)>m^{1/3}\}$ and $V^{L}=V-V^{H}$ . From $G$ , we construct following relations: $R=E$ , $R_{0}=\{(u,v)\in E:u\in V^{L}\textrm{ or }v\in V^{L}\}$ , $R_{1}=\{(u,v)\in E:u\in V^{H}\}$ , $R_{2}=\{(u,v)\in E:v\in V^{H}\}$ and $R_{3}=V^{H}\times V^{H}-E$ . Set $N=m^{4/3}$ . It can be easily checked that each relation contains at most $m^{4/3}$ tuples, hence $N=m^{4/3}$ . We further define a CQ $\mathcal{Q}$ as follows:

[TABLE]

For $\mathcal{Q}_{1}-\mathcal{Q}_{2}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})-R_{3}(x_{1},x_{3})\Join R_{4}(x_{2})$ , we set $R_{4}=V$ and output “a triangle is detected” if and only if $\mathcal{Q}$ or $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ is not empty. We first prove the correctness of this algorithm, i.e., a triangle exists in $G$ if and only if $\mathcal{Q}$ or $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ is not empty. Direction Only-If. Consider an arbitrary triangle $(u,v,w)$ in $G$ . We distinguish two cases: (i) at least one of $u,w$ is light; (ii) both $u$ and $w$ are heavy. In (i), assume $u$ is light. Then, $(u,v),(u,w)\in R_{0}$ . We come to $(u,v,w)\in Q$ . In (ii), $(u,v)\in R_{1}$ , $(v,w)\in R_{2}$ , $(u,w)\notin R_{3}$ , so we come to $(u,v,w)\in\mathcal{Q}_{1}-\mathcal{Q}_{2}$ . Direction If. If $\mathcal{Q}\neq\emptyset$ , say $(u,v,w)\in\mathcal{Q}$ , then $(u,v),(v,w),(u,w)\in E$ and therefore $(u,v,w)$ is a triangle in $G$ . If $\mathcal{Q}_{1}-\mathcal{Q}_{2}\neq\emptyset$ , say $(u,v,w)\in\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , then $(u,v)\in R_{1},(v,w)\in R_{2},(u,w)\notin R_{3}$ , i.e., $(u,v),(v,w),(u,w)\in E$ , and therefore $(u,v,w)$ is a triangle in $G$ .

We next turn to the time complexity. All statistics and relations $R_{1},R_{2},R_{4}$ can be computed in $O(m)$ time. Moreover, relation $R_{3}$ can be constructed in $O(m^{4/3})$ time since $|V^{H}|=O(m^{2/3})$ . $\mathcal{Q}$ can be evaluated in $O(m^{4/3})$ time, since each of $R_{0}(x_{1},x_{2})\Join R(x_{2},x_{3})$ generates at most $O(m^{4/3})$ intermediate join results if $x_{2}\in V^{L}$ , and each of $R_{0}(x_{1},x_{2})\Join R(x_{1},x_{3})$ generates at most $O(m^{4/3})$ intermediate join results if $x_{1}\in V^{L}$ . If $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be decided in $O(N)$ time, whether there exists a triangle in $G$ can be decided in $O(m^{4/3})$ time, coming to a contradiction to strong triangle conjecture.

For $\mathcal{Q}_{1}-\mathcal{Q}_{2}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})-R_{3}(x_{1},x_{3})\Join R_{4}(x_{2},x_{3})$ , we set $R_{4}=E$ and output “a triangle is detected” if and only if $\mathcal{Q}$ or $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ is not empty. For $\mathcal{Q}_{1}-\mathcal{Q}_{2}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})-R_{3}(x_{1},x_{3})\Join R_{5}(x_{1},x_{2})$ , we set $R_{5}=R$ and output “a triangle is detected” if and only if $\mathcal{Q}$ or $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ is not empty. For $\mathcal{Q}_{1}-\mathcal{Q}_{2}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})-R_{3}(x_{1},x_{3})\Join R_{4}(x_{2},x_{3})\Join R_{5}(x_{1},x_{2})$ , we set $R_{4}=R_{5}=E$ and output “a triangle is detected” if and only if $\mathcal{Q}$ or $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ is not empty. Similarly, $\mathcal{Q}$ can be computed in $O(m^{4/3})$ time. This way, if $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be decided in $O(N)$ time, whether there exists a triangle in $G$ can be decided in $O(m^{4/3})$ time, coming to a contradiction to strong triangle conjecture. Together, we have completed the proof. ∎

Lemma 4.7.

Given two CQs $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1}),\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , if $\mathcal{Q}_{1}$ is free-connex, $\mathcal{Q}_{2}$ is linear-reducible, and there exists some $e\in\mathcal{E}^{\prime}_{2}$ such that $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e\}\})$ is cyclic where $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ and $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{2})$ are the reduced queries of $\mathcal{Q}_{1},\mathcal{Q}_{2}$ respectively, any algorithm computing $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ requires $\Omega(N+\mathrm{OUT})$ time, assuming the strong triangle conjecture.

4.2. Efficient Heuristics

Although the hardness results in Section 4 have ruled out a linear-time algorithm for the “hard” DCQs, we find that it is still possible to explore efficient heuristics that can outperform the baseline approach. Our heuristic is based on a simple fact that $\mathcal{Q}_{1}-\mathcal{Q}_{2}=\mathcal{Q}_{1}-\mathcal{Q}_{1}\cap\mathcal{Q}_{2}$ . After computing the query results for $\mathcal{Q}_{1}$ , a straightforward way of deciding $\mathcal{Q}_{1}\cap\mathcal{Q}_{2}$ is to decide for each result $t\in\mathcal{Q}_{1}$ , whether $t\in\mathcal{Q}_{2}$ nor not. This decidability query can be viewed as a special Boolean query by replacing every output attribute $x\in\bm{\mathsf{y}}_{2}$ with a constant $\pi_{x}t$ . More specifically, for $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , the derived a Boolean query can be represented as $(\emptyset,\mathcal{V}_{2}-\bm{\mathsf{y}},\{e-\bm{\mathsf{y}}:e\in\mathcal{E}_{2}\})$ . Putting everything together, we come to Theorem 4.8.

Theorem 4.8.

Given two CQs $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be computed in $O(\texttt{cost}(\mathcal{Q}_{1})+\mathrm{OUT}_{1}\cdot\texttt{cost}({\color[rgb]{0,0,0}\mathcal{Q}^{\emptyset}_{2}}))$ time, where ${\color[rgb]{0,0,0}\mathcal{Q}^{\emptyset}_{2}}=(\emptyset,\mathcal{V}_{2}-\bm{\mathsf{y}},\{e-\bm{\mathsf{y}}:e\in\mathcal{E}_{2}\})$ .

Remark. If $\mathcal{Q}_{2}$ is linear-reducible, $\mathcal{Q}_{2}$ can be reduced to a full join in $O(N)$ time by Algorithm 1. Then, $\mathcal{Q}^{\emptyset}_{2}$ becomes empty. A faster solution is to build hashing indexes on every relation in the reduced $\mathcal{Q}_{2}$ . For each tuple $t\in\mathcal{Q}_{1}$ , it suffices to check for every $e\in\mathcal{E}_{2}$ whether $\pi_{e}t\in R^{\prime}_{e}$ , which only takes $O(1)$ time. We note that the rewriting rule in Lemma 3.7 can also apply to this case and lead to the same complexity. Suppose $\mathcal{Q}_{2}$ is reduced. Each $e\in\mathcal{E}_{2}$ induces a CQ $(\pi_{e}\mathcal{Q}_{1}-R^{\prime}_{e})\Join\mathcal{Q}_{1}$ . After materializing the results of $\mathcal{Q}_{1}$ , it suffices to check for each tuple $t\in\mathcal{Q}_{1}$ , whether $\pi_{e}t\in R^{\prime}_{e}$ or not. This is exactly how our heuristic proceeds. Hence, Corollary 2.5 follows.

Example 4.9.

*Consider a DCQ with $\mathcal{Q}_{1}=\pi_{x_{1},x_{2},x_{3}}R_{1}(x_{1},x_{4})\Join R_{2}(x_{4},x_{2},x_{3})$ and $\mathcal{Q}_{2}=\pi_{x_{1},x_{2},x_{3}}R_{3}(x_{1},\\ x_{2})\Join R_{4}(x_{2},x_{3})\Join R_{5}(x_{1},x_{3})\Join R_{6}(x_{3},x_{4})$ . The baseline spends $O(N^{\frac{2\cdot\omega}{\omega+1}}+N^{\frac{\omega-1}{\omega+1}}\cdot\mathrm{OUT}_{1})$ time computing $Q_{1}$ and $O(N^{\frac{2\cdot\omega}{\omega+1}}+N^{\frac{3(\omega-1)}{\omega+1}}\cdot\mathrm{OUT}_{2}^{\frac{3-\omega}{\omega+1}})$ time computing the hidden triangle join $R_{3}\Join R_{4}\Join R_{5}$ in $\mathcal{Q}_{2}$ , where $\omega$ is the exponent of fast matrix multiplication. In contrast, our algorithm only spends $O(N^{\frac{2\cdot\omega}{\omega+1}}+N^{\frac{\omega-1}{\omega+1}}\cdot\mathrm{OUT}_{1})$ time for computing $\mathcal{Q}_{1}$ , without computing the expensive $\mathcal{Q}_{2}$ , hence can improve the baseline by a factor of $O\left(N^{\frac{2(\omega-1)}{\omega+1}}\cdot\mathrm{OUT}_{2}^{\frac{3-\omega}{\omega+1}}/\mathrm{OUT}_{1}\right)$ when $N^{\frac{2(\omega-1)}{\omega+1}}\cdot\mathrm{OUT}_{2}^{\frac{3-\omega}{\omega+1}}\geq\mathrm{OUT}_{1}$ . *

We can show some further improvement when $\mathcal{Q}_{2}$ is non-linear-reducible. Instead of issuing an individual Boolean query for every query result $t\in\mathcal{Q}_{1}$ , we take all the Boolean queries into account as whole. To do so, we further explore the structural property of the intersection query ${\color[rgb]{0,0,0}\mathcal{Q}^{\oplus}_{2}}=(\bm{\mathsf{y}},\mathcal{V}_{2},\{\bm{\mathsf{y}}\}\cup\mathcal{E}_{2})$ , by treating the query results of $\mathcal{Q}_{1}$ as a single relation over attributes $\bm{\mathsf{y}}$ . It is unclear how $\texttt{cost}(\mathcal{Q}_{2})$ compares with $\texttt{cost}(\mathcal{Q}^{\oplus}_{2})$ , since $\mathcal{Q}^{\oplus}_{2}$ involves an extra relation (over attributes $\bm{\mathsf{y}}$ ) of input size as large as the output size of $\mathcal{Q}_{1}$ .

Remark.

We note that if $\mathcal{Q}_{1}$ only produces $O(N)$ query results, then it is always cheaper (or at least not more expensive) to compute $\mathcal{Q}^{\oplus}_{2}$ than $\mathcal{Q}_{2}$ . The observation is that we can always materialize the query results of $\mathcal{Q}_{2}$ , and then check for every result whether it is in the extra relation of input size $N$ , which does not increase the complexity of computing $\mathcal{Q}_{2}$ asymptotically.

Theorem 4.10.

Given two CQs $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be computed in $O(\texttt{cost}(\mathcal{Q}_{1})+\texttt{cost}({\color[rgb]{0,0,0}\mathcal{Q}^{\oplus}_{2}}))$ time, where ${\color[rgb]{0,0,0}\mathcal{Q}^{\oplus}_{2}}=(\bm{\mathsf{y}},\mathcal{V}_{2},\{\bm{\mathsf{y}}\}\cup\mathcal{E}_{2})$ .

Example 4.11.

Consider a DCQ with $\mathcal{Q}_{1}=R_{1}(x_{1},x_{3})$ and $\mathcal{Q}_{2}=\pi_{x_{1},x_{3}}(R_{3}(x_{1},x_{2})\Join R_{4}(x_{2},x_{3}))$ . The baseline takes $O(N+N\cdot\sqrt{\mathrm{OUT}_{2}})$ time to materialize $\mathcal{Q}_{2}$ . The first heuristics of issuing $\mathcal{Q}^{\emptyset}_{2}$ for each tuple $t\in R_{1}$ takes $O(N^{3/2})$ time. We note that $\mathcal{Q}_{12}=\pi_{x_{1},x_{3}}\left(R_{1}(x_{1},x_{3})\Join R_{2}(x_{1},\\ x_{2})\Join R_{3}(x_{2},x_{3})\right)$ lists edges that participate in at least one triangle. The existing best algorithm takes $O(N^{\frac{2\omega}{\omega+1}})$ time to compute $\mathcal{Q}_{12}$ , where $\omega$ is the exponent of fast matrix multiplication, dominating the overall complexity. Our approach will improve the baseline if $\mathrm{OUT}_{2}>N^{\frac{2(\omega-1)}{\omega+1}}$ , and strictly outperforms the naive heuristic.

Example 4.12.

Consider a DCQ with $\mathcal{Q}_{1}=\pi_{x_{1},x_{3}}R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})$ and $\mathcal{Q}_{2}=\pi_{x_{1},x_{3}}R_{3}(x_{1},x_{2})\Join R_{4}(x_{2},x_{3})$ . Let $R_{5}(x_{1},x_{3})=\mathcal{Q}_{1}$ . Here, $\mathcal{Q}_{12}=\pi_{x_{1},x_{3}}R_{3}(x_{1},x_{2})\Join R_{4}(x_{2},x_{3})\Join R_{5}(x_{1},x_{3})$ with $|R_{5}|=\mathrm{OUT}_{1}$ . Similarly, the existing best algorithm takes $O(\mathrm{OUT}^{\frac{\omega}{\omega+1}}_{1}\cdot N^{\frac{\omega}{\omega+1}})$ time to compute $\mathcal{Q}_{12}$ . It is worth mentioning that $\mathcal{Q}_{12}\neq\pi_{x_{1},x_{3}}R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})\Join R_{3}(x_{1},x_{2})\Join R_{4}(x_{2},x_{3})$ . Suppose $(a,c)\in\mathcal{Q}_{12}$ , is witnessed by $(a,b_{1},c)\inf R_{1}\Join R_{2}$ and $(a,b_{2},c)\in R_{3}\Join R_{4}$ , but $(a,b_{1},c)\notin R_{3}\Join R_{4}$ and $(a,b_{2},c)\notin R_{1}\Join R_{2}$ . Then, $(a,b_{1},c),(a,b_{2},c)\notin R_{1}\Join R_{2}\Join R_{3}\Join R_{4}$ , hence the result $(a,c)$ will be missed in this rewriting.

5. Extensions

Based on the basic DCQ over two CQs discussed so far, we next consider several interesting extensions of DCQ with rich interaction between difference and other relational algebra operators.

5.1. Difference of Multiple CQs

The first extension is adapting our result for computing DCQ involving two CQs to multiple CQs, say $\mathcal{Q}=\mathcal{Q}_{1}-\mathcal{Q}_{2}-\cdots-\mathcal{Q}_{k}$ . Suppose $\mathcal{Q}_{i}=(\mathcal{V},\mathcal{E}_{i})$ for $i\in\{1,2,\cdots,k\}$ . We next introduce a recursive algorithm for tackling the general case with $k>2$ .

The base case with $k=2$ is tackled by our previous algorithm EasyDCQ in Section 3. We rewrite a general DCQ with $k$ CQs into a union of multiple DCQs, each consisting of $k-1$ CQs. We start from the first two CQs and apply a similar strategy in Section 3. Suppose $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ are full; otherwise, we just invoke the reduce procedure to remove all non-output attributes via semi-joins. More specifically, we define an auxiliary relation $S_{e}=\pi_{e}\mathcal{Q}_{1}$ for each $e\in\mathcal{E}_{2}$ , and rewrite the input DCQ as $\displaystyle{\mathcal{Q}=\bigcup_{e\in\mathcal{E}_{2}}\left((\pi_{e}\mathcal{Q}_{1}-R^{\prime}_{e})\Join\mathcal{Q}_{1}-\mathcal{Q}_{3}-\cdots-\mathcal{Q}_{k}\right)}$ . If unwrapping the recursions, we can give a complete form for $\mathcal{Q}$ :

[TABLE]

where $I_{j}=\{e_{2},e_{3},\cdots,e_{j}\}$ for any $j\in\{2,3,\cdots,k\}$ , and

[TABLE]

for any $j\geq 2$ . An algorithm follows this rewriting directly. Now, we come to the complexity of this algorithm. We can first bound $S_{I_{j}}=O(N+k)$ for each $j\in\{2,3,\cdots,k\}$ , since every tuple from $S_{I_{j}}-R_{e_{j}}$ must participate in at least query result. Moreover, if $\mathcal{E}_{1}\cup I_{j}$ is free-connex, $S_{I_{j}}$ is free-connex from (2). As the subquery corresponding to $S_{I_{j}}$ has input size $O(N+\mathrm{OUT})$ and output size $O(\mathrm{OUT})$ , it can be evaluated in $O(N+\mathrm{OUT})$ time.

Theorem 5.1.

Given a DCQ $\mathcal{Q}=\mathcal{Q}_{1}-\mathcal{Q}_{2}-\cdots-\mathcal{Q}_{k}$ , $\mathcal{Q}$ can be evaluated in $O(N+\mathrm{OUT})$ time if $\mathcal{Q}_{1}$ is free-connex, $\mathcal{Q}_{i}$ for $i\geq 2$ is linear-reducible, and for every $j\in\{2,3,\cdots,k\}$ , the subquery induced by $\mathcal{E}^{\prime}_{1}\cup\{e_{2},e_{3},\cdots,e_{j}\}$ is $\alpha$ -acyclic for any $(e_{2},e_{3},\cdots,e_{j})\in\mathcal{E}^{\prime}_{2}\times\mathcal{E}^{\prime}_{3}\times\cdots\times\mathcal{E}^{\prime}_{j}$ , where $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{i})$ is the reduced query of $\mathcal{Q}_{i}=(\bm{\mathsf{y}},\mathcal{V}_{i},\mathcal{E}_{i})$ .

5.2. Select, Project and Join

•

If there is a selection operator $\sigma_{\phi}$ over $\mathcal{Q}=\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , we can push it down such that $\mathcal{Q}=\sigma_{\phi}\mathcal{Q}_{1}-\sigma_{\phi}\mathcal{Q}_{2}$ . If $\phi$ is a predicate on a base relation $R_{e}$ of $\mathcal{Q}_{1}$ (resp. $\mathcal{Q}_{2}$ ), we can we simply check if $\phi(t)$ is true for each tuple $t\in\sigma_{\phi}R_{e}$ , and discard it if not. This only takes $O(N)$ time. It is challenging that $\phi$ is a predicate not on any base relation, even for a single CQ evaluation.

•

If there is a projection operator $\pi_{\theta}$ over $\mathcal{Q}=\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , we can push it down such that $\mathcal{Q}=\pi_{\theta}\mathcal{Q}_{1}-\pi_{\theta}\mathcal{Q}_{2}$ , and handle a new DCQ $\mathcal{Q}^{\prime}_{1}-\mathcal{Q}^{\prime}_{2}$ with $\mathcal{Q}^{\prime}_{1}=\pi_{\theta}\mathcal{Q}_{1}$ and $\mathcal{Q}^{\prime}_{2}=\pi_{\theta}\mathcal{Q}_{2}$ .

•

If there is a join operator over multiple DCQs, we first rewrite the join into a DCQover multiple CQs and invoke our previous algorithm in Section 5.1. More specifically, given $k$ DCQs $\mathcal{Q}^{1},\mathcal{Q}^{2},\cdots,\mathcal{Q}^{k}$ with $\mathcal{Q}^{i}=\mathcal{Q}^{i}_{1}-\mathcal{Q}^{i}_{2}$ for any $i\in[k]$ , we can rewrite $\mathcal{Q}^{1}\Join\mathcal{Q}^{2}\Join\cdots\Join\mathcal{Q}^{k}$ as

[TABLE]

The characterization of input CQs for which a linear algorithm exists follows Theorem 5.1.

5.3. Aggregation

Our algorithm for DCQ can also be extended to support aggregations over annotated relations (Abo Khamis et al., 2016; Joglekar et al., 2016). Let $(S,\oplus,\otimes)$ be a commutative ring. For a CQ $\mathcal{Q}$ over an annotated instance $D$ , every tuple $t\in R_{e}$ has an annotation $w(t)\in S$ . For a full query $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , the annotation for any join result $t\in\mathcal{Q}(D)$ is defined as $w(t):=\mathop{\otimes}\limits_{e\in\mathcal{E}}w(\pi_{e}t)$ . For a non-full query $\mathcal{Q}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , the aggregation becomes GROUP BY $\bm{\mathsf{y}}$ , and the annotation for each result $t\in\mathcal{Q}$ (i.e., the aggregate of each group) is $w(t):=\mathop{\oplus}\limits_{t^{\prime}\in\Join_{e\in\mathcal{E}}R_{e}:\pi_{\bm{\mathsf{y}}}t^{\prime}=t}w(t^{\prime})$ . Below, we introduce two commonly-used formulations. Given $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1})$ , $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ and instances $D_{1},D_{2}$ , let $w_{1},w_{2}$ be the annotations of tuples in $\mathcal{Q}_{1},\mathcal{Q}_{2}$ respectively. For completeness, we set $w_{1}(t)=0$ if $t\notin\mathcal{Q}_{1}(D_{1})$ and $w_{2}(t)=0$ if $t\notin\mathcal{Q}_{2}(D_{2})$ .

Relational difference. For DCQs defined on relational difference, a tuple $t$ appears in the query results of $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ if and only if $t\in\mathcal{Q}_{1}(D_{1})$ and $t\notin\mathcal{Q}_{2}(D_{2})$ . For $t\in\pi_{\bm{\mathsf{y}}^{\prime}}(\mathcal{Q}_{1}(D_{1})-\mathcal{Q}_{2}(D_{2}))$ , the annotation of $t$ is defined as $\displaystyle{w(t)=\oplus_{t^{\prime}\in\mathcal{Q}_{1}(D_{1})-\mathcal{Q}_{2}(D_{2}):\pi_{\bm{\mathsf{y}}^{\prime}}t^{\prime}=t}w(t^{\prime})}$ . The input size is defined as $N=|D_{1}|+|D_{2}|$ , and the output size is $\mathrm{OUT}=|\pi_{\bm{\mathsf{y}}^{\prime}}\left(\mathcal{Q}_{1}(D_{1})-\mathcal{Q}_{2}(D_{2})\right)|$ . Again, our target is to find a linear-time algorithm in terms of $N$ and $\mathrm{OUT}$ . Our algorithms can be applied directly, followed by aggregation, and its complexity is bottlenecked by the output size of the difference query, i.e., $|\mathcal{Q}_{1}(D_{1})-\mathcal{Q}_{2}(D_{2})|$ , which could be much larger than $\mathrm{OUT}$ .

Numerical difference. For DCQs defined on numerical difference, a tuple $t$ appears in the query results of $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ if and only if $t\in\mathcal{Q}_{1}(D_{1})$ or $t\in\mathcal{Q}_{2}(D_{2})$ , with annotation $w(t)=w_{1}(t)-w_{2}(t)$ . Then, the aggregation operator defined over attributes $\bm{\mathsf{y}}^{\prime}$ on top of $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be rewritten as the numerical difference of two new annotated queries, i.e., $\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{1}-\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{2}$ . The input size is defined as $N=|D_{1}|+|D_{2}|$ , and the output size is $\mathrm{OUT}=|\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{1}(D_{1})\cup\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{2}(D_{2})|$ . Again, our target is to find an linear-time algorithm in terms of $N$ and $\mathrm{OUT}$ . Here, any algorithm with time complexity $O(N+|\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{1}(D_{1})|+|\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{2}(D_{2})|)$ is already optimal, since $|\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{1}(D_{1})\cup\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{2}(D_{2})|\geq\frac{1}{2}\left(|\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{1}(D_{1})|+|\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{2}(D_{2})|\right)$ . Hence, if $\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{1}$ and $\pi_{\bm{\mathsf{y}}^{\prime}}\mathcal{Q}_{2}$ are free-connex, both our algorithm and baseline are optimal.

Theorem 5.2.

Given two CQs $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , and a subset of aggregation attributes $\bm{\mathsf{y}}^{\prime}\subseteq\bm{\mathsf{y}}$ , if $(\bm{\mathsf{y}}^{\prime},\mathcal{V}_{1},\mathcal{E}_{1})$ and $(\bm{\mathsf{y}}^{\prime},\mathcal{V}_{2},\mathcal{E}_{2})$ are free-connex, $\pi_{\bm{\mathsf{y}}^{\prime}}(\mathcal{Q}_{1}-\mathcal{Q}_{2})$ with numerical difference can be computed in $O(N+\mathrm{OUT})$ time.

Example 5.3.

Consider an example DCQ $\mathcal{Q}=\pi_{x_{1}}(R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})-R_{3}(x_{1},x_{2})\Join R_{4}(x_{2},x_{3}))$ over an instance in Figure 3. This DCQ can capture Q16 in the TPC-H benchmark (tpc, PC H) as a special case. For relational difference, the query result of $\mathcal{Q}$ includes 2 tuples as $\{(a_{1},1),(a_{2},1)\}$ . For numerical difference, the query result of $\mathcal{Q}$ includes 3 tuples as $\{(a_{1},1),(a_{2},2),(a_{3},-2)\}$ . *

5.4. Bag Semantics

We consider the bag semantics that the set of query result is a multi-set. For simplicity, each distinct tuple $t$ is annotated with a positive integer $w(t)$ to indicate the number of copies. In a full CQ $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , the annotation of $t\in\mathcal{Q}$ is defined as $\displaystyle{w(t)=\times_{e\in\mathcal{E}}w(\pi_{e}t)}$ . For a projection of $R_{e}$ onto attributes $e^{\prime}$ , the annotation of $t\in\pi_{e^{\prime}}R_{e}$ is defined as $w(t)=\sum_{t^{\prime}\in R_{e}:\pi_{e^{\prime}}t^{\prime}=t}w(t^{\prime})$ . Given two CQs $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1}),\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ and two input instances $D_{1},D_{2}$ , let $w_{1},w_{2}$ be the annotations of tuples in $\mathcal{Q}_{1},\mathcal{Q}_{2}$ respectively. For completeness, we set $w_{1}(t)=0$ if $t\notin\mathcal{Q}_{1}(D_{1})$ and $w_{2}(t)=0$ if $t\notin\mathcal{Q}_{2}(D_{2})$ . A tuple $t$ is a query result of $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ if and only if $t\in\mathcal{Q}_{1}(D_{1})$ and $w_{1}(t)>w_{2}(t)$ . An example is given in Figure 3.

The input size is $N=|D_{1}|+|D_{2}|$ , and the output size is $\mathrm{OUT}=\sum_{t\in\mathcal{Q}_{1}(D_{1})}\max\{0,w_{1}(t)-w_{2}(t)\}$ . Again, our target is to find an linear-time algorithm in terms of $N$ and $\mathrm{OUT}$ . Unfortunately, our rewriting rule in Section 3 cannot be adapted here. Figure 3 shows several incorrect behaviors: some tuple has a much higher annotation (e.g., $(a_{1},b_{1},c_{1})$ ); and some tuple should not appear (e.g., $(a_{2},b_{2},c_{2})$ ), which motivates us to explore new rules here.

Example 5.4.

*Consider a DCQ with $\mathcal{Q}_{1}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})$ and $\mathcal{Q}_{2}=R_{3}(x_{1},x_{2})\Join R_{4}(x_{2},x_{3})$ under the bag semantics. Any result $(a,b,c)\in\mathcal{Q}_{1}-\mathcal{Q}_{2}$ falls into one of the three cases: (1) $(a,b)\notin R_{2}$ or $(b,c)\notin R_{4}$ ; (2) $w_{1}(a,b)>w_{2}(a,b)$ and $w_{1}(b,c)>w_{2}(b,c)$ ; (3) either $w_{1}(b,c)\leq w_{2}(b,c)$ or $w_{1}(a,b)\leq w_{2}(a,b)$ , but $w_{1}(a,b)\cdot w_{1}(b,c)>w_{2}(a,b)\cdot w_{2}(b,c)$ . We partition $R_{1}$ into three subsets, $R_{1\emptyset}=\{t\in R_{1}:t\notin R_{3}\}$ , $R_{1<}=\{t\in R_{1}:t\in R_{3},w_{1}(t)\leq w_{2}(t)\}$ and $R_{1>}=\{t\in R_{1}:t\in R_{3}:w_{1}(t)>w_{2}(t)\}$ . Similarly, we partition $R_{2}$ into $R_{2\emptyset}$ , $R_{2<}$ , $R_{2>}$ with respect to $R_{4}$ . Results falling into (1) can be found by $(R_{1\emptyset}\Join R_{2})\cup(R_{1}\Join R_{2\emptyset})$ . Results falling into (2) can be found by $R_{1>}\Join R_{2>}$ . Results falling into (3) can be found by two new $\theta$ -joins $\left(R_{1>}\Join_{\theta}R_{2<}\right)\cup\left(R_{1<}\Join_{\theta}R_{2>}\right)$ , where a pair of tuples $(t_{1},t_{2})$ can be $\theta$ -joined if and only if $w_{1}(t_{1})\cdot w_{1}(t_{2})>w_{2}(t_{1})\cdot w_{2}(t_{2})$ . *

All auxiliary relations as well as $(R_{1\emptyset}\Join R_{2})\cup(R_{1}\Join R_{2\emptyset})$ and $R_{>}\Join R_{2>}$ can be computed efficiently. We consider $R_{1>}\Join_{\theta}R_{2<}$ ( $\left(R_{1<}\Join_{\theta}R_{2>}\right)$ is symmetric). The solution of checking $\theta$ -condition for all combinations of tuples in $R_{1>}$ and $R_{2<}$ incurs quadratic complexity. A smarter way is to sort $R_{1>}$ and $R_{2<}$ by $B$ first, and then by the ratio of $\frac{w_{1}(\cdot)}{w_{2}(\cdot)}$ decreasingly. Then, we start with $(a,b)\in R_{1>}$ with maximum ratio, and linearly scan tuples in $R_{2<}$ with the join value $b$ until we meet some tuple $(b,c)$ such that $\frac{w_{1}(b,c)}{w_{2}(b,c)}\leq\frac{w_{2}(a,b)}{w_{1}(a,b)}$ . We then stop and proceed with the next tuple in $R_{1>}$ . If no join result is produced by $(a,b)$ , we skip the subsequent tuples with the same join value $b$ and continue. Overall, this algorithm takes $O(N\log N+\mathrm{OUT})$ time.

Theorem 5.5.

Given two CQs $\mathcal{Q}_{1}=\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , if $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ is free-connex, then $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ under the bag semantics can be computed in $O(N\log N+\mathrm{OUT})$ time.

Our observation above can be extended to the case when both $\mathcal{Q}_{1},\mathcal{Q}_{2}$ correspond to the same free-connex query. The proof of Theorem 5.5 is given in Appendix C. The case when $\mathcal{Q}_{1},\mathcal{Q}_{2}$ have different schema is left as future work.

6. Experiments

6.1. Experimental Setup

Prototype implementation. Our newly developed algorithms can be easily integrated into any SQL engine by rewriting the original SQL query. It can be further optimized if we directly integrate the rewrite procedure into the SQL parser and have customized index support. Our ultimate goal is to implement our algorithms into a system prototype with three components: a SQL parser, a query optimizer, and new indices. At the current stage, we choose to manually rewrite all SQL queries and demonstrate the power of our optimizations via the comparison with vanilla SQL queries .

Query processing engines compared. To compare the performance of all optimized techniques we proposed in the paper, we choose PostgreSQL (pos, eSQL), DuckDB(duc, ckDB), SQLite(sql, Lite), MySQL(mys, ySQL) running in centralized settings, and Spark SQL (spa, kSQL) running in parallel/distributed settings, as the query processing engines. All of them are widely used in academia and industry. In the experiments, we observed that SQLite and MySQL show very poor performance, with most of the test points being timed out. Hence, we built full indices on these systems to expedite the execution. Moreover, DuckDB is a columnar-vectorized query execution engine, and indices are built when importing input data. During the experiments, we test the single-thread performance of our new optimization techniques over PostgreSQL, DuckDB, SQLite and MySQL, and parallel performance over Spark SQL. In order to separate the I/O cost from the total execution time, we load all data into the memory in advance by using pg-prewarm in PostgreSQL and cache in Spark SQL. For DuckDB and SQLite, the data need to be loaded into memory before execution, so we only count the query execution time.

Experimental environment. We perform all experiments in two machines. For experiments conducted on PostgreSQL and MySQL, we use a machine equipped with two Intel Xeon 2.1GHz processors, each having 12 cores/24 threads and 416 GB memory. For all experiments on Spark SQL, DuckDB and SQLite, we use a machine equipped with two Xeon 2.0GHz processors, each having 28 cores / 56 threads and 1TB of memory. All machines run Linux, with Scala 2.13.9 and JVM 1.8.0. We use Spark 3.3.0 and PostgreSQL 16.0. We assign 8 cores for Spark and 1 core for the rest platforms during the experiments. Each query is evaluated 10 times with each engine, and we report the average running time. Each query runs at most 10 hours to obtain meaningful results.

6.2. Datasets and Queries

The experiments consist of graph queries and benchmark queries.

Benchmark queries. For relational queries, we adopt two standard benchmarks (TPC-DS (tpc, C DS) and TPC-H (tpc, PC H)) in industry and select 3 queries with difference operator (TPC-H Q16, TPC-DS Q35, and TPC-DS Q69). These three benchmark queries connect DCQ with other relational operators like selection, projection, join, and aggregation. All benchmark queries can be captured by a common schema $\mathcal{Q}=R_{1}(x_{1},x_{2})\Join(\pi_{x_{2}}R(x_{1},x_{2})-\pi_{x_{2}}R_{2}(x_{2},x_{3})\Join R_{3}(x_{3},x_{4}))$ and the joins are all primary-key foreign-key joins.

Graph queries. For graph pattern queries, we use real-world graphs (such as BitCoin, DBLP, Eponions, Google, and Wiki) from SNAP (Stanford Network Analysis Project) (SNA, SNAP), summarized in Table 2. We store edge information as a relation $\textsf{Graph}(\textsf{src},\textsf{dst})$ and manually create a triple relation $\textsf{Triple}(\textsf{node1},\textsf{node2},\textsf{node3})$ from the graph. Tuples in Triple are generated by following rules: (rule 1) a random length-2 path in the graph as $(\textsf{node1},\textsf{node2},\textsf{node3})$ ; or (rule 2) a random edge in the graph as $(\textsf{node1},\textsf{node2})$ , together with a random vertex in the graph as node3; or (rule 3) a triple $(\textsf{node1},\textsf{node3},\textsf{node5})$ from a random length-4 path $(\textsf{node1},\textsf{node2},\textsf{node3},\textsf{node4},\textsf{node5})$ in the graph. Triple may involve different portions of tuples generated by three rules in different queries. For a graph with $n$ length-2 paths, we set the size of Triple to be $0.05n$ for Wiki (since it is too large to process as shown in Table 2), and $0.5n$ for the remaining graphs. We evaluate 6 graph queries as described in Figure 5, whose original SQL queries as well as optimized SQL queries after rewriting are given in the full version (ful, 3140).

More specifically, $\mathcal{Q}_{G1}$ finds all edges in the graph that do not participate in any length-2 path. $\mathcal{Q}_{G2}$ finds all length-3 paths that the third node ( $\textsf{node}_{3}$ ) is not sampled together with the edge $(\textsf{node}_{1},\textsf{node}_{2})$ . $\mathcal{Q}_{G3}$ finds length-2 paths that do not form a triangle. $\mathcal{Q}_{G4}$ finds all generated triples that cannot extend to a length-4 path. $\mathcal{Q}_{G5}$ finds all length-4 paths that do not form a length-4 cycle. $\mathcal{Q}_{G6}$ finds all pairs of edges in the graph, which do not form a length-4 cycle.

6.3. Experiment Results

Running time. Figure 5 shows the running time of different engines on graph queries. The input and output size of all graphs queries are given in Table 2. All bars reaching the axis boundary indicate that the system did not finish within the 8-hour limit, or ran out of memory. As $\mathcal{Q}_{G6}$ contains an expensive Cartesian product as sub-query, materializing its query result exceeds the memory capacity of our machines on most datasets. PostgreSQL can only evaluate the original SQL query of $\mathcal{Q}_{G6}$ on Bitcoin dataset. By adding the parallelism from 8 to 80, our optimized Spark SQL can evaluate $\mathcal{Q}_{G6}$ on Epinions dataset within the time limit, while the vanilla Spark SQL cannot complete the evaluation. For $\mathcal{Q}_{G5}$ , all systems cannot finish the evaluation on Wiki dataset due to the large intermediate results created. We also observe that both SQLite and MySQL cannot finish all test points for $\mathcal{Q}_{G2}$ and $\mathcal{Q}_{G6}$ , and most test points over Wiki dataset. It could be the reason that both systems are not designed for analytical queries. Our optimization techniques already achieve a speedup ranging from 2x to 1760x on PostgreSQL, from 1.2x to 270x on Spark SQL, from 2x to 1848x on DuckDB, from 1.25x to 1095x on SQLite, and from 1.8x to 5.1x on MySQL for graph queries, even without considering the queries that could not finish within the time limit. We also observe an unusual test point for $\mathcal{Q}_{G3}$ in MySQL, that our optimized SQL query takes more time than the vanilla SQL query, which may be due to some unknown deficiencies in MySQL internals.222We review the execution plan in MySQL and find that the predicated run-time of our optimized SQL query is much smaller than the vanilla SQL query, which is also consistent with our observations in other platforms. The actual running time does not match the expected cost because of some unknown deficiencies in MySQL.

Figure 5 also shows the running time of all query engines on benchmark queries under different scale factors (i.e., parameters used to generate benchmark dataset, which is roughly proportional to the input data size). DuckDB and MySQL fail to finish some test points with scale factor 100. However, the improvement in benchmark queries achieved by our optimized techniques is minor, as expected. More specifically, the vanilla benchmark query consists of two free-connex sub-queries, hence can be evaluated in $O(N+\mathrm{OUT}_{1}+\mathrm{OUT}_{2})$ time, and its optimized query can be evaluated in $O(N+\mathrm{OUT})$ time. Due to the special primary-key foreign-key joins and group-by aggregations, $\mathrm{OUT}_{1}\approx\mathrm{OUT}_{2}\approx\mathrm{OUT}<<N$ , such that the input contains a few hundred of million records while the query result only involves thousands of records. The improvement of our optimized techniques in SQLite, DuckDB, and MySQL is also limited. On some test points, our optimized SQL queries are even more time-consuming than vanilla SQL queries. We find that the vanilla SQL queries can greatly benefit from the indices built for primary-key foreign-key join and outperform our optimized SQL queries, which do not enjoy efficient indices for set difference or anti-join operators. How to build indices to accelerate relational operators in these systems could be interesting future work. Meanwhile, we notice that loading input data and building indices are much more time-consuming than evaluating the query; for example, it takes DuckDB 16 minutes to load a 50G-sized TPC-DS dataset, while only 8 seconds to execute the whole query.

Impact of $\mathrm{OUT}$ , $\mathrm{OUT}_{1}$ and $\mathrm{OUT}_{2}$ . Implied by the theoretical results, the sizes $\mathrm{OUT}_{1},\mathrm{OUT}_{2}$ of sub-queries $\mathcal{Q}_{1},\mathcal{Q}_{2}$ impact the performance of vanilla SQL queries, while only the actual output size $\mathrm{OUT}$ affect the performance of our optimized SQL queries. Below, we study the impact of $\mathrm{OUT}_{1}$ , $\mathrm{OUT}_{2}$ and $\mathrm{OUT}$ on the performance of both approaches over $\mathcal{Q}_{G4}$ .

In Figure 6, we investigate the impact of $\mathrm{OUT}_{1}$ for computing DCQ. We fix $\mathcal{Q}_{2}$ (as well as $\mathrm{OUT}_{2}$ ) and only vary the size of Triple (as well as $N$ and $\mathrm{OUT}_{1}$ ). Note that $\mathrm{OUT}$ also increases as $\mathrm{OUT}_{2}$ decreases. The running time of our optimized SQL query grows slowly with $\mathrm{OUT}$ , while the vanilla SQL query incurs a fixed overhead for evaluating $\mathcal{Q}_{2}$ , even when $\mathrm{OUT}_{1}$ (as well as $\mathrm{OUT}$ ) decreases to as small as $1$ .

In Figure 7, we investigate the impact of $\mathrm{OUT}_{2}$ for computing DCQ. We fix $\mathcal{Q}_{1}$ (as well as $N$ and $\mathrm{OUT}_{1}$ ) and vary a filter predicate applied to relation Graph in $\mathcal{Q}_{2}$ . When the predicate is more selective, $\mathrm{OUT}_{2}$ becomes smaller, and $\mathrm{OUT}$ becomes larger. The running time of vanilla SQL query decreases as $\mathrm{OUT}_{2}$ decreases, and the running time of our optimized SQL query does not change, which is only affected by $N$ and $\mathrm{OUT}$ .

In Figure 8, we investigate the impact of $\mathrm{OUT}$ for computing DCQ. We adjust Triple by changing the proportion of tuples generated by different rules, which will only change $\mathrm{OUT}$ , while $\mathrm{OUT}_{1}$ , $\mathrm{OUT}_{2}$ , and $N$ stay the same. The running time of our optimized SQL query increases slowly as $\mathrm{OUT}$ increases. In contrast, the running time of vanilla SQL query remains stably high even when $\mathrm{OUT}$ decreases to $1$ , since its running time is only impacted by $\mathrm{OUT}_{1}$ and $\mathrm{OUT}_{2}$ , both of which stay unchanged.

Memory Consumption. We also test the memory consumption on both graph and benchmark queries by different engines. Due to the simplicity of memory consumption measurement, we report the results for PostgreSQL and DuckDB here. For benchmark queries, the optimized and vanilla SQL queries have similar behaviors on memory consumption, since the input size dominates the overall consumption. Below, we focus on the memory consumption of graph queries. In Figure 9, our optimized SQL queries achieve overall improvements for all graph queries on Epinions dataset in terms of space consumption. For example, our optimized SQL query only requires 6.53GB on Spark SQL for evaluating $\mathcal{Q}_{G6}$ , while the vanilla SQL query fails to finish evaluating $\mathcal{Q}_{G6}$ even using 256G memory. The improvement of our optimized SQL query is more significant on DuckDB. For $\mathcal{Q}_{G4}$ , our optimized SQL query consumes $99.4\%$ less memory than the vanilla SQL query. For $\mathcal{Q}_{G5}$ and $\mathcal{Q}_{G6}$ , our optimized SQL queries consume roughly 2G memory. In contrast, the vanilla SQL queries fail to execute due to out-of-memory errors even after using 738G memory.

7. Connection with Signed Conjunctive Query

The class of signed conjunctive queries (SCQ) (Brault-Baron, 2012), or noted as conjunctive queries with negation (Lanzinger, 2021) in the literature, is defined as

[TABLE]

where $\eta_{i}$ is either empty or a negation operator $\neg$ . If $\eta_{i}=\neg$ for all $i\in[n]$ , such an SCQ is also known as a negative conjunctive queries (NCQ). If $\eta_{i}=\emptyset$ for all $i\in[n]$ , such an SCQ is also known as a CQ. Recall that $\mathcal{V}=e_{1}\cup e_{2}\cup\cdots\cup e_{n}$ and $\mathcal{E}=\{e_{1},e_{2},\cdots,e_{n}\}$ . The query result of $\mathcal{Q}$ over an instance $D$ denoted as $\mathcal{Q}(D)$ is defined as

[TABLE]

We establish the connection between SCQ and DCQ via Lemma 7.1 and Lemma 7.2.

From DCQ to SCQ. Intuitively, every DCQ can be rewritten as the union of a set of SCQs. Moreover, each resulted SCQ has exactly one negated relation, and each relation of $\mathcal{Q}_{2}$ participates in one distinct SCQ as the negated relation. For example, $\mathcal{Q}_{1}-\mathcal{Q}_{2}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})-R_{3}(x_{1},x_{2})\Join R_{4}(x_{2},x_{3})$ can be rewritten as $\left(R_{1}\Join R_{2}\Join\neg R_{3}\right)\cup\left(R_{1}\Join R_{2}\Join\neg R_{4}\right)$ .

Lemma 7.1.

For a DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , $\mathcal{Q}_{1}-\mathcal{Q}_{2}=\bigcup_{e\in\mathcal{E}_{2}}\left(\mathcal{Q}_{1}\Join\neg R_{e}\right)$ .

Proof.

Direction $\subseteq$ . For every join result $t\in\mathcal{Q}_{1}-\mathcal{Q}_{2}$ , there must exist a relation $e\in\mathcal{E}_{2}$ such that $\pi_{e}t\notin R_{e}$ ; otherwise, $t\in\mathcal{Q}_{2}$ , coming to a contradiction. Wlog, let $e\in\mathcal{E}_{2}$ be such a relation for $t$ . Together with $t\in\mathcal{Q}_{1}$ , there must be $t\in\mathcal{Q}_{1}\Join\neg R_{e}$ . Direction $\supseteq$ . Consider an arbitrary relation $e\in\mathcal{E}_{2}$ , and an arbitrary join result $t\in\mathcal{Q}_{1}\Join\neg R_{e}$ . Obviously, $t\notin\mathcal{Q}_{2}$ since $\pi_{e}t\notin R_{2}$ . Together with $t\in\mathcal{Q}_{1}$ , there must be $t\in\mathcal{Q}_{1}-\mathcal{Q}_{2}$ . ∎

From SCQ to DCQ. On the other hand, SCQ can be rewritten as the intersection of a set of DCQs. For a SCQ $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , let $\mathcal{E}^{+},\mathcal{E}^{-}\subseteq\mathcal{E}$ denote the set of relations with positive, negative sign separately. Let $\mathcal{V}^{+},\mathcal{V}^{-}$ be the set of attributes that appear in positive, negative relations separately. Let $\mathcal{Q}^{+}=\left(\Join_{e^{\prime}\in\mathcal{E}^{+}}R_{e^{\prime}}\times_{x\in\mathcal{V}^{-}-\mathcal{V}^{+}}\mathrm{dom}(x)\right)$ denote the positive subquery defined by positive relation as well as the whole domain of attributes which do not appear in any positive relation.

Lemma 7.2.

For a SCQ $\mathcal{Q}$ , $\mathcal{Q}=\cap_{e\in\mathcal{E}^{-}}\left(\mathcal{Q}^{+}-\mathcal{Q}^{+}\Join R_{e}\right)$ .

Proof.

Direction $\subseteq$ . Consider an arbitrary query result $t\in\mathcal{Q}$ . By definition, $\pi_{e}t\in R_{e}$ holds for every $e\in\mathcal{E}^{+}$ and $\pi_{e}t\notin R_{e}$ holds for every $e\in\mathcal{E}^{-}$ . This way, for each $e\in\mathcal{E}^{-}$ , we have $t\in\left(\Join_{e^{\prime}\in\mathcal{E}^{+}}R_{e^{\prime}}\times_{x\in\mathcal{V}^{-}-\mathcal{V}^{+}}\mathrm{dom}(x)\right)\Join\neg R_{e}$ . Direction $\supseteq$ . Consider an arbitrary $t$ such that for every $e\in\mathcal{E}^{-}$ , $t\in\left(\Join_{e^{\prime}\in\mathcal{E}^{+}}R_{e^{\prime}}\times_{x\in\mathcal{V}^{-}-\mathcal{V}^{+}}\mathrm{dom}(x)\right)\Join\neg R_{e}$ . Then, $\pi_{e^{\prime}}t\in R_{e^{\prime}}$ for every $e^{\prime}\in\mathcal{E}^{+}$ but $t\notin R_{e}$ for every $e\in\mathcal{E}^{-}$ . Thus, $t\in\mathcal{Q}$ . ∎

For example, a SCQ $\mathcal{Q}=R_{1}(x_{2},x_{3},x_{4})\Join R_{2}(x_{1},x_{3},x_{4})\Join\neg R_{3}(x_{1},x_{2},x_{4})\Join\neg R_{4}(x_{1},x_{2},x_{3})$ can be rewritten as: $(R_{1}\Join R_{2}-R_{1}\Join R_{2}\Join R_{3})\cap(R_{1}\Join R_{2}-R_{1}\Join R_{2}\Join R_{4})$ .

Decidability of SCQ.

Given a SCQ $\mathcal{Q}$ , the domain of attributes, and input database $D$ , the decidability problem asks to decide whether there exists a query result in $\mathcal{Q}$ . For example, a NCQ $\mathcal{Q}=\neg R_{1}(x_{1},x_{2})\Join\neg R_{2}(x_{2},x_{3})$ decides if there exists any tuple $(a,b,c)\in\mathrm{dom}(x_{1})\times\mathrm{dom}(x_{2})\times\mathrm{dom}(x_{3})$ such that $(a,b)\notin R_{1}$ and $(b,c)\notin R_{2}$ , and a CQ $\mathcal{Q}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})$ decides if there exists any tuple $(a,b,c)$ such that $(a,b)\in R_{1}$ and $(b,c)\in R_{2}$ . The decidability problem for CQ , NCQ and SCQ has been well studied separately:

Theorem 7.3 ((Bagan et al., 2007)).

A CQ $\mathcal{Q}$ can be decided in linear time if and only if it is $\alpha$ -acyclic.

Theorem 7.4 ((Brault-Baron, 2012)).

A NCQ $\mathcal{Q}$ can be decided in linear time if and only if it is $\beta$ -acyclic.

Theorem 7.5 ((Brault-Baron, 2013)).

A SCQ $\mathcal{Q}$ can be decided in linear time if and only if $(\bm{\mathsf{y}},\mathcal{E}^{+}\cup S)$ is $\alpha$ -acyclic for every $S\subseteq\mathcal{E}^{-}$ .

Note that $\beta$ -acyclicity is a more restricted notion than $\alpha$ -acyclicity, such that $\mathcal{Q}$ is $\beta$ -acyclic if all sub-hypergraphs of $\mathcal{Q}$ are $\alpha$ -acyclic. Obviously, $\beta$ -acyclicity strictly implies $\alpha$ -acyclicity. In (Lanzinger, 2021), this notion of $\beta$ -acyclicity has been extended to nest-set width for capturing the tractability of SCQ in terms of both query and data complexity. We won’t pursue this direction further.

Decidability of DCQ.

Implied by Lemma 7.1 and Theorem 7.5, we come to the following lemma:

Lemma 7.6.

Given two full joins $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{E}_{2})$ , the DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be decided in linear time, if $(\bm{\mathsf{y}},\mathcal{E}_{1})$ is $\alpha$ -acyclic, and $(\bm{\mathsf{y}},\mathcal{E}_{1}\cup\{e\})$ is $\alpha$ -acyclic for every $e\in\mathcal{E}_{2}$ .

Lemma 7.6 can be easily proved by a linear-time algorithm. We can enumerate every tuple in $S_{e}=\pi_{e}\Join_{e^{\prime}\in\mathcal{E}}R_{e^{\prime}}$ within $O(1)$ delay, as $(\bm{\mathsf{y}},\mathcal{E}\cup\{e\})$ is $\alpha$ -acyclic. For each tuple $t\in S_{e}$ enumerated, we check whether it belongs to $R_{e}$ . If $t\notin R_{e}$ , a query result of $\mathcal{Q}\Join\neg R_{e}$ is found; otherwise, we skip it and continue to the next one. It is easy to see that at most $|R_{e}|$ tuples are checked, so this algorithm runs in $O(N)$ time.

Theorem 7.7.

Given two full joins $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{E}_{2})$ , the DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ can be decided in linear time, if and only if $(\bm{\mathsf{y}},\mathcal{E}_{1})$ is $\alpha$ -acyclic, as well as $(\bm{\mathsf{y}},\mathcal{E}_{1}\cup\{e\})$ is $\alpha$ -acyclic for every $e\in\mathcal{E}_{2}$ .

Proof.

The if direction follows Lemma 7.6. We next distinguish two more cases for the only-if direction. (1) if $(\bm{\mathsf{y}},\mathcal{E}_{1})$ is cyclic; and (2) if $(\bm{\mathsf{y}},\mathcal{E}_{1})$ is $\alpha$ -acyclic, and there exists some $e\in\mathcal{E}_{2}$ such that $(\bm{\mathsf{y}},\mathcal{E}_{1}\cup\{e\})$ is cyclic. (1) follows Theorem 7.3 by simply setting $\mathcal{Q}_{2}=\emptyset$ . (2) follows Lemma 4.7. ∎

8. Related Work

Union of CQs. (Carmeli and Kröll, 2019) studied the enumeration complexity of union of conjunctive queries (UCQs), i.e., the goal is to find a data structure that after linear preprocessing time, the query answers (without duplication) can be enumerated within a small delay. Their results implied a linear algorithm in terms of input and output size for the class of union-free-connex UCQs, but whether a linear algorithm can be achieved (and, if possible, how to achieve it) is unknown for the remaining class of UCQ. (Christoph et al., 2018) also investigated the enumeration complexity of UCQs but in the dynamic scenario.

Selection over CQs. Recently, multiple works have studied the complexity of selections over conjunctive queries. (Wang and Yi, 2022) investigated the selection in the form of comparisons between two attributes or values. The work identifies an acyclic condition under which a near-linear-time algorithm can be achieved for conjunctive queries with comparisons. (Abo Khamis et al., 2022) worked on the selections over intervals, also known as intersection queries, which are special cases for comparison queries since each intersection query can be decomposed into a union of multiple comparison queries. They show a dichotomy result that an intersection join can be computed in linear time if and only if it is $\iota$ -acyclic. (Hu et al., 2022) studied the complexity of temporal queries, where the intersection condition only exists for one global attribute. Their result suggested that a temporal query can be solved in linear time if and only if it is r-hierarchical. (Tao and Yi, 2022) also investigated the complexity of intersection queries in dynamic settings.

Appendix A SQL Queries

Graph Query $\mathcal{Q}_{G1}$

Original:

⬇

SELECT g1.src as src, g1.dst as dst

FROM graph g1

WHERE (g1.src, g1.dst) NOT IN (

SELECT DISTINCT g1.src, g1.dst

FROM graph g1, graph g2, graph g3

WHERE g1.dst = g2.src and g2.dst = g3.src);

Optimized:

⬇

SELECT g1.src as src, g1.dst as dst

FROM graph g1

WHERE NOT EXISTS (

SELECT * FROM graph g2

WHERE EXISTS (

    SELECT * FROM graph g3

    WHERE g3.src = g2.dst

) and g1.dst = g2.src  );

Graph Query $\mathcal{Q}_{G2}$

Original:

⬇

SELECT src as A, node1 as B, node2 as C, node3 as D

FROM graph g1, Triple1 T1

WHERE g1.dst = T1.node1

and NOT EXISTS (

SELECT * FROM Triple2 T2, graph g2

WHERE T2.node3 = g2.src and T2.node1 = g1.src

and T2.node2 = T1.node1 and T2.node3 = T1.node2

and g2.dst = T1.node3);

Optimized:

⬇

SELECT src as A, node1 as B, node2 as C, node3 as D

FROM graph g1, Triple1 T1

WHERE g1.dst = T1.node1

and (NOT EXISTS ( SELECT * FROM Triple2 T2

WHERE T2.node1 = g1.src and T2.node2 = T1.node1

    and T2.node3 = T1.node2)

or NOT EXISTS ( SELECT * FROM graph g2

    WHERE g2.src = T1.node2 and g2.dst = T1.node3));

Graph Query $\mathcal{Q}_{G3}$

Original:

⬇

SELECT node1, node2, node3

FROM Triple T1

WHERE NOT EXISTS(

SELECT *

FROM graph g1, graph g2, graph g3

WHERE g1.dst = g2.src and g2.dst = g3.src and g3.dst = g1.src and g1.src = T1.node1 and g2.src = T2.node2 and g3.src = T3.node3);

Optimized:

⬇

SELECT node1, node2, node3

FROM Triple T1

WHERE NOT EXISTS (SELECT * FROM graph g1

WHERE T1.node1 = g1.src and T1.node2 = g1.dst)

or NOT EXISTS (SELECT * FROM graph g2

WHERE T1.node2 = g2.src and T1.node3 = g2.dst)

or NOT EXISTS (SELECT * FROM graph g3

WHERE T1.node3 = g3.src and T1.node1 = g3.dst);

Graph Query $\mathcal{Q}_{G4}$

Original:

⬇

SELECT node1, node2, node3

FROM Triple T1

WHERE NOT EXISTS (

SELECT *

FROM graph g1, graph g2, graph g3

WHERE g1.dst = g2.src and g2.dst = g3.src and g1.src = T1.node1 and g2.src = T1.node2 and g2.dst = T1.node3);

Optimized:

⬇

SELECT node1, node2, node3

FROM Triple

WHERE NOT EXISTS

(SELECT * FROM graph WHERE node1 = src and node2 = dst)

or NOT EXISTS

(SELECT * FROM graph WHERE node2 = src and node3 = dst)

or NOT EXISTS

(SELECT * FROM graph WHERE node3 = src);

Graph Query $\mathcal{Q}_{G5}$

Original:

⬇

SELECT g1.src as A, g2.src as B, g3.src as C, g3.dst as D

FROM graph g1, graph g2, graph g3

WHERE g1.dst = g2.src and g2.dst = g3.src and

NOT EXISTS (SELECT *

FROM graph g4, graph g5, graph g6

WHERE g4.dst = g5.src and g5.dst = g6.src and g2.src = g4.src and g5.src = g3.src and g6.src = g3.dst and g6.dst = g1.src);

Optimized:

⬇

SELECT g1.src as A, g2.src as B, g3.src as C, g3.dst as D

FROM graph g1, graph g2, graph g3

WHERE g1.dst = g2.src and g2.dst = g3.src and

NOT EXISTS (

SELECT * FROM graph g6 WHERE g6.dst = g1.src and g6.src = g3.dst);

Graph Query $\mathcal{Q}_{G6}$

Original:

⬇

SELECT g1.src as A, g1.dst as B, g2.src as C, g2.dst as D

FROM graph g1, graph g2

WHERE NOT EXISTS (SELECT *

FROM graph g3, graph g4, graph g5, graph g6

WHERE g3.dst = g4.src and g4.dst = g5.dst and g5.src = g3.src and g5.dst = g6.src and g3.src = g1.src and g3.dst = g1.dst and g6.src = g2.src and g6.dst = g2.dst);

Optimized:

⬇

SELECT g1.src as A, g1.dst as B, g2.src as C, g2.dst as D

FROM graph g1, graph g2

WHERE NOT EXISTS (SELECT * FROM graph g4

WHERE g4.src = g1.dst and g4.dst = g2.src)

or NOT EXISTS ( SELECT * FROM graph g5

WHERE g1.src = g5.src and g2.src = g5.dst);

TPC-H Query 16

Original:

⬇

SELECT p_brand, p_type, p_size,

count(distinct ps_suppkey) as supplier_cnt

FROM partsupp, part

WHERE p_partkey = ps_partkey

and p_brand <> 'Brand#45'

and p_type NOT LIKE 'MEDIUM␣POLISHED%'

and p_size IN (49, 14, 23, 45, 19, 3, 36, 9)

and ps_suppkey NOT IN (

SELECT s_suppkey

FROM supplier, nation

WHERE s_nationkey = n_nationkey and n_name = 'CHINA')

GROUP BY p_brand, p_type, p_size;

Optimized:

⬇

SELECT p_brand, p_type, p_size,

count(distinct ps_suppkey) as supplier_cnt

FROM partsupp, part

WHERE p_partkey = ps_partkey

and p_brand <> 'Brand#45'

and p_type NOT LIKE 'MEDIUM␣POLISHED%'

and p_size IN (49, 14, 23, 45, 19, 3, 36, 9)

and NOT EXISTS ( SELECT * FROM supplier

WHERE EXISTS (SELECT * FROM nation

    WHERE s_nationkey = n_nationkey

     and n_name = 'CHINA')

and s_suppkey = ps_suppkey)

GROUP BY p_brand, p_type, p_size;

TPC-DS Query 35

Original:

⬇

SELECT

ca_state, cd_gender, cd_marital_status, cd_dep_count,

count(*) cnt1, stddev_samp(cd_dep_count),

sum(cd_dep_count), min(cd_dep_count),

cd_dep_employed_count,count(*) cnt2,

stddev_samp(cd_dep_employed_count),

sum(cd_dep_employed_count), min(cd_dep_employed_count),

cd_dep_college_count, count(*) cnt3,

stddev_samp(cd_dep_college_count),

sum(cd_dep_college_count), min(cd_dep_college_count)

FROM

customer c,customer_address ca,customer_demographics

WHERE

c.c_current_addr_sk = ca.ca_address_sk and

cd_demo_sk = c.c_current_cdemo_sk and

not exists (select *

      from store_sales,date_dim

      where ss_sold_date_sk = d_date_sk and

            c.c_customer_sk = ss_customer_sk and

            d_year = 2001 and

            d_qoy < 4) and

not exists (select *

        from web_sales,date_dim

        where ws_sold_date_sk = d_date_sk and

              d_year = 2001 and

              d_qoy < 4 and

              ws_bill_customer_sk = c.c_customer_sk) and

not exists (select *

        from catalog_sales,date_dim

        where cs_sold_date_sk = d_date_sk and

              d_year = 2001 and

              d_qoy < 4 and

              cs_ship_customer_sk = c.c_customer_sk)

group by ca_state,

     cd_gender,

     cd_marital_status,

     cd_dep_count,

     cd_dep_employed_count,

     cd_dep_college_count;

Optimized:

⬇

SELECT

ca_state, cd_gender, cd_marital_status, cd_dep_count,

count(*) cnt1, stddev_samp(cd_dep_count),

sum(cd_dep_count), min(cd_dep_count),

cd_dep_employed_count,count(*) cnt2,

stddev_samp(cd_dep_employed_count),

sum(cd_dep_employed_count), min(cd_dep_employed_count),

cd_dep_college_count, count(*) cnt3,

stddev_samp(cd_dep_college_count),

sum(cd_dep_college_count), min(cd_dep_college_count)

FROM

customer_address ca,customer_demographics,

(select * from customer cu

where not exists (select * from store_sales

  where exists (select * from date_dim

    where d_year = 2001 and d_qoy < 4 and ss_sold_date_sk = d_date_sk)

  and cu.c_customer_sk = ss_customer_sk)

and not exists (select * from web_sales

  where exists (select * from date_dim

    where d_year = 2001 and d_qoy < 4 and ws_sold_date_sk = d_date_sk)

  and cu.c_customer_sk = ws_bill_customer_sk)

and not exists (select * from catalog_sales

  where exists (select * from date_dim

    where d_year = 2001 and d_qoy < 4 and cs_sold_date_sk = d_date_sk)

and cu.c_customer_sk = cs_ship_customer_sk)) as c

WHERE

c.c_current_addr_sk = ca.ca_address_sk and

cd_demo_sk = c.c_current_cdemo_sk

group by ca_state,

     cd_gender,

     cd_marital_status,

     cd_dep_count,

     cd_dep_employed_count,

     cd_dep_college_count;

TPC-DS Query 69

Original:

⬇

SELECT

cd_gender, cd_marital_status, cd_education_status,

count(*) cnt1, cd_purchase_estimate,

count() cnt2, cd_credit_rating, count() cnt3

FROM

customer c,customer_address ca,customer_demographics

WHERE

c.c_current_addr_sk = ca.ca_address_sk and

ca_state in ('IN','ND','PA') and

cd_demo_sk = c.c_current_cdemo_sk and

exists (select *

      from store_sales,date_dim

      where c.c_customer_sk = ss_customer_sk and

            ss_sold_date_sk = d_date_sk and

            d_year = 1999 and

            d_moy between 2 and 2+2) and

(not exists (select *

        from web_sales,date_dim

        where c.c_customer_sk = ws_bill_customer_sk and

              ws_sold_date_sk = d_date_sk and

              d_year = 1999 and

              d_moy between 2 and 2+2) and

not exists (select *

        from catalog_sales,date_dim

        where c.c_customer_sk = cs_ship_customer_sk and

              cs_sold_date_sk = d_date_sk and

              d_year = 1999 and

              d_moy between 2 and 2+2))

group by cd_gender,

     cd_marital_status,

     cd_education_status,

     cd_purchase_estimate,

     cd_credit_rating;

Optimized:

⬇

SELECT

cd_gender, cd_marital_status, cd_education_status,

count(*) cnt1, cd_purchase_estimate,

count() cnt2, cd_credit_rating, count() cnt3

FROM

customer_address ca,customer_demographics,

(select * from customer cu

where exists (select * from store_sales

  where exists (select * from date_dim

    where ss_sold_date_sk = d_date_sk and d_year = 1999 and d_moy between 2 and 2+2)

  and not exists (select * from web_sales

    where exists (select * from date_dim

      where ws_sold_date_sk = d_date_sk and d_year = 1999 and d_moy between 2 and 2+2)

    and ws_bill_customer_sk = cu.c_customer_sk)

  and not exists (select * from catalog_sales

    where exists (select * from date_dim

      where cs_sold_date_sk = d_date_sk and d_year = 1999 and d_moy between 2 and 2+2)

    and cs_ship_customer_sk = c_customer_sk)

  and ss_customer_sk = cu.c_customer_sk)) as c

WHERE

c.c_current_addr_sk = ca.ca_address_sk and

ca_state in ('IN','ND','PA') and

cd_demo_sk = c.c_current_cdemo_sk

group by cd_gender,

     cd_marital_status,

     cd_education_status,

     cd_purchase_estimate,

     cd_credit_rating;

Appendix B Missing Proofs in Section 4.1

B.1. Preliminaries on CQs

Definition B.1 (GYO Reduction).

The GYO reduction for a CQ $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ is an iterative procedure that (1) if an attribute $x\in\mathcal{V}$ only appears in one relation $e\in\mathcal{E}$ , then $x$ can be removed from $e$ ; (2) if there exists a pair of relations $e,e^{\prime}\in\mathcal{E}$ such that $e\subseteq e^{\prime}$ , then $e$ can be removed.

Lemma B.2 ((Yannakakis, 1981)).

A query $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ is $\alpha$ -acyclic if the GYO reduction results in an empty query.

Definition B.3 (Path).

In a CQ $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , a path between a pair of attributes $x_{1},x_{k}$ is a sequence of attributes $C=\langle x_{1},x_{2},\cdots,x_{k}\rangle\subseteq\mathcal{V}$ , such that

•

there exists $e\in\mathcal{E}$ with $x_{i},x_{i+1}\in e$ for any $1\leq i<k$ ;

•

for any $e\in\mathcal{E}$ , either $|e\cap C|=1$ , or $e\cap C=\{x_{i},x_{i+1}\}$ for some $i\in[k-1]$ .

Definition B.4 (Cycle).

In a CQ $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , a cycle is a sequence of attributes $C=\{x_{1},\cdots,x_{k}\}\subseteq\mathcal{V}$ , such that

•

there exists $e\in\mathcal{E}$ with $x_{i},x_{i+1}\in e$ for any $1\leq i<k$ , and $x_{1},x_{k}\in e$ ;

•

for any $e\in\mathcal{E}$ , either $|e\cap C|=1$ , or $e\cap C=\{x_{i},x_{i+1}\}$ for some $1\leq i<k$ , or $e\cap C=\{x_{1},x_{k}\}$ .

Definition B.5 (Clique).

In a CQ $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , a clique is a subset of attributes $C\subseteq\mathcal{V}$ , such that for any pair of attributes $x_{1},x_{2}\in C$ , there exists $e\in\mathcal{E}$ with $x_{1},x_{2}\in e$ .

Definition B.6 (Conformal).

A CQ $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ is conformal, if every clique $C\subseteq\mathcal{V}$ there exists $e\in\mathcal{E}$ with $C\subseteq e$ .

Definition B.7 (Non-conformal Clique).

Following the definition of conformal of CQ, we define a clique $C\subseteq\mathcal{V}$ as non-conformal in a CQ $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , if there does not exist $e\in\mathcal{E}$ such that $C\subseteq e$ .

Lemma B.8 ((Brault-Baron, 2016)).

A CQ is $\alpha$ -acyclic if and only if it is conformal and cycle-free.

Lemma B.9 ((Bagan et al., 2007)).

In an acyclic but non-free-connex CQ $(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , there must exists a sequence of distinct attributes $C=\langle x_{1},x_{2},\cdots,x_{k}\rangle$ with $k\geq 3$ , such that

•

there exists a relation $e\in\mathcal{E}$ such that $\{x_{i},x_{i+1}\}\subseteq e$ for every $i\in\{1,2,\cdots,k-1\}$ ;

•

$x_{1},x_{k}\in\bm{\mathsf{y}}$ * but $x_{2},\cdots,x_{k-1}\notin\bm{\mathsf{y}}$ ;*

•

for each $e\in\mathcal{E}$ , either $|e\cap C|\leq 1$ or $e\cap C=\{x_{i},x_{i+1}\}$ for some $i\in\{1,2,\cdots,k-1\}$ ;

Lemma B.10.

In a CQ $\mathcal{Q}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , if there exists a clique $C$ , then for any $C^{\prime}\subset C$ , $C^{\prime}$ is also a clique.

Proof.

Since $C$ is a clique, there exists a relation that contains every pair of attributes. As $C^{\prime}$ is a subset of $C$ , then for any two attributes in $C^{\prime}$ there also exists a relation that contains both of these two attributes, hence $C^{\prime}$ is also a clique. ∎

Lemma B.11.

In a cycle-free CQ $\mathcal{Q}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , if there exists a non-conformal clique $C$ , then $|C|>3$ .

Proof.

For the clique $C$ of size 1 or 2, it is clear that $C$ is conformal as there is a relation containing the entire clique by definition. Suppose there exists a non-conformal clique $C$ with $|C|=3$ , say $C=\{x_{1},x_{2},x_{3}\}$ . As the clique is non-conformal, there does not exist a relation that covers all three attributes, but any pair of attributes appears together in one relation. Then $x_{1},x_{2},x_{3}$ will form a triangle, contradicting the fact that $\mathcal{Q}$ is cycle-free. Hence, any non-conformal clique $C$ in a cycle-free CQ must have $|C|\geq 3$ . ∎

We denote a non-conformal clique as a minimal if there exists no subset $C^{\prime}\subseteq C$ such that $C^{\prime}$ is a non-conformal clique.

Lemma B.12.

In a CQ $\mathcal{Q}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ with a minimal non-conformal clique $C$ , for every $\mathcal{V}\subsetneq C$ there exists some $e\in\mathcal{E}$ with $\mathcal{V}\subseteq e$ .

Proof.

As $C$ is a clique, $\mathcal{V}$ is also a clique for any $\mathcal{V}\subsetneq C$ . Meanwhile, as $C$ is the minimal non-conformal clique, $\mathcal{V}$ is a conform clique, which implies a relation $e\in\mathcal{E}$ with $\mathcal{V}\subseteq e$ . ∎

B.2. Helper Lemmas

Now, we are ready to show some helper lemmas, which will be used to prove Lemma 4.5 and Lemma 4.7.

Definition B.13 (Subquery).

For a CQ $\mathcal{Q}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ , a subquery of $\mathcal{Q}$ induced by a set of attributes $C\subseteq\mathcal{V}$ is denoted as $\mathcal{Q}[C]=(\bm{\mathsf{y}}\cap C,C,\mathcal{E}[C])$ , where $\mathcal{E}[C]=\{e\cap C:e\in\mathcal{E},e\cap C\neq\emptyset\}$ .

Lemma B.14.

Given two CQs $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1})$ and $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , for any $C\subseteq\mathcal{V}_{1}\cap\mathcal{V}_{2}$ , if $\mathcal{Q}[C]=\mathcal{Q}_{1}[C]-\mathcal{Q}_{2}[C]$ requires $\Omega(N^{1-o(1)})$ time, then $\mathcal{Q}=\mathcal{Q}_{1}-\mathcal{Q}_{2}$ requires $\Omega(N^{1-o(1)})$ time.

Proof.

Given any database instance $D$ for $\mathcal{Q}[C]$ , we can construct a database instance $D^{\prime}$ for $\mathcal{Q}$ as follows. For any attribute $x\notin C$ , we set its value to be $*$ . For any $e\in\mathcal{E}$ with $e\cap C\neq\emptyset$ , there exists a corresponding relation $e^{\prime}=e\cap C$ in the residual query. For each tuple $t^{\prime}$ in $R_{e^{\prime}}$ , we insert $t$ into $e$ with $\pi_{e\cap C}t^{\prime}=\pi_{e\cap C}t$ . It is easy to see that there is a one-to-one correspondence between $\mathcal{Q}[C]$ and $\mathcal{Q}$ . Hence, if $\mathcal{Q}$ can be solved in linear time, then $\mathcal{Q}[C]$ can be solved in linear time, coming to a contradiction. ∎

Lemma B.15.

Any algorithm for evaluating the following DCQ:

[TABLE]

requires $\Omega(N^{1-o(1)})$ time, assuming the strong triangle conjecture.

Proof.

Given a graph $G=(V,E)$ , we construct $R_{3}=R_{4}=R_{5}=E$ , $R_{1}=V$ and $R_{2}=\{u\}\times V$ for some $u\in V$ . Let $N=|E|\geq|V|$ . We note that $\mathrm{OUT}=|V|-1$ if and only if there is a triangle in $G$ . Hence, if $\mathcal{Q}$ can be evaluated in $O(N)$ time, whether there is a triangle in $G$ can be determined in $O(N)$ time, contradicting the detecting triangle conjecture. ∎

Lemma B.16.

Any algorithm for deciding the following DCQ $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ requires $\Omega(N^{1-o(1)})$ time, assuming the strong triangle conjecture, where $\mathcal{Q}_{1}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})$ and $\mathcal{Q}_{2}=R_{3}(x_{1},x_{3})\Join R_{4}(x_{2})$ , or $\mathcal{Q}_{2}=R_{3}(x_{1},x_{3})\Join R_{4}(x_{2},x_{3})$ , or $\mathcal{Q}_{2}=R_{3}(x_{1},x_{3})\Join R_{5}(x_{1},x_{2})$ , or $\mathcal{Q}_{2}=R_{3}(x_{1},x_{3})\Join R_{4}(x_{2},x_{3})\Join R_{5}(x_{1},x_{2})$ .

Proof.

In the proof of Lemma 4.4, we have shown the hardness for $\mathcal{Q}_{1}-\mathcal{Q}_{2}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})-R_{3}(x_{1},x_{3})\Join R_{4}(x_{2})$ . The remaining three queries can be proved similarly. Given an arbitrary graph $G=(V,E)$ with $V$ as the set of vertices and $E$ as the set of edges, we perform an algorithm to detect whether there exists a triangle in $G$ . Let $m=|E|=N^{3/4}$ be the number of edges in $G$ . Let $\mathcal{N}(u)=\{u\in V:(v,u)\in E\}$ be the neighbor list of vertex $u\in V$ . The degree $\textrm{deg}(u)$ of a vertex $u\in V$ is defined as the size of the neighbor list of $u$ , i.e., $\textrm{deg}(u)=|\mathcal{N}(u)|$ . We partition vertices in $V$ into two subsets: $V^{H}=\{v\in V:\textrm{deg}(v)>m^{1/3}\}$ and $V^{L}=V-V^{H}$ . From $G$ , we construct following relations: $R=E$ , $R_{0}=\{(u,v)\in E:u\in V^{L}\textrm{ or }v\in V^{L}\}$ , $R_{1}=\{(u,v)\in E:u\in V^{H}\}$ , $R_{2}=\{(u,v)\in E:v\in V^{H}\}$ and $R_{3}=V^{H}\times V^{H}-E$ . It can be easily checked that each relation contains $O(N)$ tuples.

For $\mathcal{Q}_{1}-\mathcal{Q}_{2}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})-R_{3}(x_{1},x_{3})\Join R_{4}(x_{2},x_{3})$ , we set $R_{4}=E$ and consider following queries:

[TABLE]

It can be easily proved that a triangle exists in $G$ if and only if $\mathcal{Q}$ or $\mathcal{Q}^{\prime}$ is not empty. We point out that $\mathcal{Q}$ can be computed in $O(m^{4/3})$ time, since $|R(x_{2},x_{3})\Join R_{0}(x_{1},x_{3})|\leq m^{4/3}$ and $|R(x_{1},x_{2})\Join R_{0}(x_{1},\\ x_{3})|\leq m^{4/3}$ implied by the definition of $V^{L}$ . This way, if $\mathcal{Q}^{\prime}$ can be computed in $O(m^{4/3})$ time, then detecting whether there exists a triangle or not takes $O(m^{4/3})$ time, coming to a contradiction to the detecting triangle conjecture.

For $\mathcal{Q}_{1}-\mathcal{Q}_{2}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})-R_{3}(x_{1},x_{3})\Join R_{5}(x_{1},x_{2})$ , We set $R_{5}=R$ and consider following queries:

[TABLE]

For $\mathcal{Q}_{1}-\mathcal{Q}_{2}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})-R_{3}(x_{1},x_{3})\Join R_{4}(x_{2},x_{3})\Join R_{5}(x_{1},x_{2})$ , we set $R_{4}=R_{5}=E$ and consider following queries:

[TABLE]

Both cases follow the similar argument as above. Together, we have completed the proof. ∎

B.3. Proof of Lemma 4.5

We assume $\mathcal{Q}_{1}$ is reduced. As $\mathcal{Q}_{1}$ is free-connex, then $\mathcal{Q}_{1}$ must be an acyclic full join. We consider repeatedly applying the following procedures to $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ : (1) if there is an non-output attribute $x\in\mathcal{V}_{2}-\bm{\mathsf{y}}$ only appearing in one relation $e\in\mathcal{E}_{2}$ , remove $x$ from $e$ as well as $\mathcal{V}_{2}$ ; (2) if there is a pair of relations $e,e^{\prime}\in\mathcal{E}_{2}$ such that $e\subseteq e^{\prime}$ , remove $e$ from $\mathcal{E}_{2}$ . As $\mathcal{Q}_{2}$ is non-reducible, the residual query must be non-full; otherwise $(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2}\cup\{\bm{\mathsf{y}}\})$ is free-connex, contradicting the fact that $\mathcal{Q}_{2}$ is non-linear-reducible. Hence, we can assume for $\mathcal{Q}_{2}$ that every non-output attribute must appear in at least two relations, and there exists no relation whose attributes are fully contained in another relation. We distinguish two cases:

(Case 1): $\mathcal{Q}_{2}$ is acyclic. As $\mathcal{Q}_{2}$ is non-linear-reducible, $\mathcal{Q}_{2}$ must be non-free-connex. Implied by Lemma B.9, there must exist such a path $\langle x_{1},x_{2},\cdots,x_{k}\rangle$ with desired properties. Moreover, for the acyclic full join $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{E}_{1})$ , we initialize two sets $S_{1}=\{x_{1}\}$ and $S_{2}=\{x_{k}\}$ , and repeat the following procedure: if there exists some $e\in\mathcal{E}_{1}$ such that $e\cap S_{1}\neq\emptyset$ and $e\cap S_{2}\neq\emptyset$ , we just stop; otherwise, we find some $e\in\mathcal{E}_{1}$ such that $e\cap S_{1}\neq\emptyset$ , we just add all attributes in $e$ into $S_{1}$ , and remove $e$ . Then for each $e\in\mathcal{E}_{1}$ , we have either $e\subseteq S_{1}$ , or $e\subseteq S_{2}$ , or $e\cap S_{1}\neq\emptyset$ and $e\cap S_{2}\neq\emptyset$ .

Given an arbitrary instance of $R_{1},R_{2},R_{3}$ in lemma 4.3, we construct an input instance for $\mathcal{Q}_{1},\mathcal{Q}_{2}$ separately as follows. For $\mathcal{Q}_{1}$ , we set $\mathrm{dom}(x)=\mathrm{dom}(x_{1})$ for every $x\in S_{1}$ , $\mathrm{dom}(x)=\mathrm{dom}(x_{2})$ for every $x\in S_{2}$ , and $\mathrm{dom}(x)=\{*\}$ for every $x\in\mathcal{V}_{1}-S_{1}\cup S_{2}$ . Then, the result of $\mathcal{Q}_{1}$ degenerates to $R_{1}(x_{1},x_{k})$ . For $\mathcal{Q}_{2}$ , we simply set $\mathrm{dom}(x)=\{*\}$ , $x_{2}=x_{3}=\cdots=x_{k-1}$ , and $\mathrm{dom}(x_{1}),\mathrm{dom}(x_{k})$ as the same as that in $\mathcal{Q}_{1}$ . Implied by the properties of the path found, every relation in $\mathcal{E}_{2}$ must either contains a single attribute from $\{x_{1},x_{2},\cdots,x_{k}\}$ , or degenerates to one edge of the path. Hence, the result of $\mathcal{Q}_{2}$ degenerates to $\pi_{x_{1},x_{k}}R_{2}(x_{1},x_{2})\Join R_{3}(x_{2},x_{k})$ , which is exactly captured by Lemma 4.5.

(Case 2): $\mathcal{Q}_{2}$ is cyclic. Then, there exists a cycle or a non-conformal clique in $\mathcal{Q}_{2}$ . We further distinguish the following cases.

(Case 2.1): there is a cycle $C$ such that $C\subseteq\mathcal{V}-\bm{\mathsf{y}}$ . We can reduce $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ to $1-\pi_{\emptyset}R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})\Join R_{3}(x_{1},x_{3})$ .

(Case 2.2): there is a cycle $C$ such that $C-\bm{\mathsf{y}}\neq\emptyset$ and $C\cap\bm{\mathsf{y}}\neq\emptyset$ . We can reduce $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ to $R_{1}(x_{1})-\pi_{x_{1}}(R_{2}(x_{1},x_{2})\Join R_{3}(x_{2},x_{3})\Join R_{4}(x_{1},x_{3}))$ if $|C\cap\bm{\mathsf{y}}|=1$ , and $R_{1}(x_{1},x_{2})-\pi_{x_{1},x_{2}}(R_{2}(x_{1},x_{2})\Join R_{3}(x_{2},x_{3})\Join R_{4}(x_{1},x_{3}))$ otherwise.

(Case 2.3): there exists no cycle but a non-conformal clique $C$ such that $C-\bm{\mathsf{y}}\neq\emptyset$ . In this case, we will show the hardness of DCQ $\mathcal{Q}[C]=\mathcal{Q}_{1}[C]-\mathcal{Q}_{2}[C]$ , based on the hardness of $\mathcal{Q}_{2}[C]$ . Consider an arbitrary instance $D$ for $\mathcal{Q}_{2}[C]$ . For simplicity, assume the domain of each attribute in $\bm{\mathsf{y}}$ is $[N]$ . We construct the following instance $D^{\prime}$ for $\mathcal{Q}_{1}[C]$ . There is a one-to-one mapping between any pair of attributes $x,x^{\prime}\in C\cap\bm{\mathsf{y}}$ . We also set $\mathrm{dom}(x)=\{*\}$ for every $x\in C-\bm{\mathsf{y}}$ . It can be easily checked that $\mathcal{Q}_{1}[C]$ contains exactly $N$ results, and therefore $\mathcal{Q}[C]$ contains at most $N$ results. Moreover, $\mathcal{Q}_{2}[C]$ is empty if and only if $|\mathcal{Q}[C]|=N$ . Suppose we have an algorithm that can compute $\mathcal{Q}[C]$ in linear time, then we can determine whether $\mathcal{Q}_{2}[C]$ is empty or not in linear time, contradicting the fact that $\mathcal{Q}_{2}[C]$ cannot be determined in $O(N)$ time. As $\mathcal{Q}[C]$ cannot be computed in linear time, combining with Lemma B.14, $\mathcal{Q}$ cannot be computed in linear time.

(Case 2.4): $C\subseteq\bm{\mathsf{y}}$ holds for every cycle $C$ , as well as every non-conformal clique $C$ . Recall that there exists no non-output attribute only appearing in one relation, and there exists no relation whose attributes are contained by another relation. Let $\mathcal{Q}_{2}^{\prime}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2}\cup\{\bm{\mathsf{y}}\})$ . Every non-conformal clique $C$ in $\mathcal{Q}_{2}$ becomes conformal in $\mathcal{Q}_{2}^{\prime}$ due to the existence of $\{\bm{\mathsf{y}}\}$ . Similarly, every cycle will disappear in $\mathcal{Q}_{2}^{\prime}$ due to the existence of $\{\bm{\mathsf{y}}\}$ . In this case, $\mathcal{Q}^{\prime}_{2}$ must be acyclic, implied by Lemma B.8. Meanwhile, as $\mathcal{Q}_{2}$ is non-linear-reducible, $\mathcal{Q}^{\prime}_{2}$ must be non-free-connex. As $\mathcal{Q}_{2}^{\prime}$ is acyclic and non-free-connex, there must exist a path in $\mathcal{Q}_{2}$ as characterized by Lemma B.9. Following the similar argument as (Case 1), we can reduce $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ to $R_{1}(x_{1},x_{3})-\pi_{x_{1},x_{3}}R_{2}(x_{1},x_{2})\Join R_{3}(x_{2},x_{3})$ .

B.4. Proof of Lemma 4.7

Given a free-connex CQ $\mathcal{Q}_{1}=(\bm{\mathsf{y}},\mathcal{V}_{1},\mathcal{E}_{1})$ and a linear-reducible CQ $\mathcal{Q}_{2}=(\bm{\mathsf{y}},\mathcal{V}_{2},\mathcal{E}_{2})$ , we denote $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ and $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{2})$ as the reduced queries of $\mathcal{Q}_{1},\mathcal{Q}_{2}$ respectively. Let $e^{\prime}\in\mathcal{E}_{2}$ be the relation such that $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e^{\prime}\})$ is cyclic. As $\mathcal{Q}_{1}$ is free-connex, $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ is acyclic. Our proof proceeds with the following steps:

•

Step 1: In $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e^{\prime}\})$ , there exists no $e\in\mathcal{E}^{\prime}_{1}$ such that $e^{\prime}\subseteq e$ ;

•

Step 2: There exists a pair of attributes $x_{1},x_{n}\in e^{\prime}$ , such that there exists no $e\in\mathcal{E}^{\prime}_{1}$ with $x_{1},x_{n}\in e$ ;

•

Step 3: There is a cycle $C=\langle x_{1},x_{2},\cdots,x_{n}\rangle$ in $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e^{\prime}\})$ with $e^{\prime}\cap C=\{x_{1},x_{n}\}$ , and $\langle x_{1},x_{2},\cdots,x_{n}\rangle$ is a path in $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ ;

•

Step 4: There is a reduction from Lemma B.16 to $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ ;

For Step 1, if there exists some $e\in\mathcal{E}^{\prime}_{1}$ such that $e^{\prime}\subseteq e$ , then $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e^{\prime}\})$ is acyclic if and only if $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e^{\prime}\})$ , contradicting the fact that $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e^{\prime}\})$ is cyclic but $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ is acyclic.

For Step 2, we first show that $|e^{\prime}|\geq 2$ . Suppose $|e^{\prime}|=1$ , say $e=\{x\}$ . There must exist $e\in\mathcal{E}^{\prime}_{1}$ such that $x\in e$ , hence $e^{\prime}\subseteq e$ , coming to a contradiction of Step 1. Hence, $|e^{\prime}|\geq 2$ . Moreover, if for every pair of attributes $x_{1},x_{n}\in e^{\prime}$ , there exists some $e\in\mathcal{E}^{\prime}_{1}$ such that $x_{1},x_{n}\in e$ , then we find a clique of attributes in $e$ in $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ . As $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ is acyclic, there must exist some $e^{\prime\prime}\in\mathcal{E}^{\prime}_{1}$ such that $e^{\prime}\subseteq e^{\prime\prime}$ , coming to a contradiction of Step 1. Hence, we can always find a pair of attributes $x_{1},x_{n}\in e^{\prime}$ as desired.

For Step 3, since $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ is acyclic but $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1}\cup\{e^{\prime}\})$ is cyclic, either a cycle or a non-conformal clique is formed by the addition of $e^{\prime}$ . Let’s consider the case where a new non-conformal clique $S\subseteq\bm{\mathsf{y}}$ is formed. By definition, there exists no relation $e\in\mathcal{E}^{\prime}_{1}\cup\{e^{\prime}\}$ such that $S\subseteq e$ . We partition $S$ into two subsets $S_{1},S_{2}$ such that $S_{1}=S\cap e^{\prime}$ and $S_{2}=S-e^{\prime}$ . It is clear that $S_{2}\neq\emptyset$ ; otherwise $S\subseteq e^{\prime}$ , contradicting the fact that $S$ is non-conformal. Moreover, for any $x\in S_{1}$ , $\{x\}\cup S_{2}$ is also a clique. As $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ is acyclic, $\{x\}\cup S_{2}$ must be a conformal clique, i.e., there exists some $e\in\mathcal{E}^{\prime}_{1}$ such that $\{x\}\cup S_{2}\subseteq e$ . Meanwhile, $|S_{1}|\geq 2$ ; otherwise $S$ is a non-conformal clique in $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ , contradicting the fact that $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ is acyclic. We can also identify two different attributes $x_{1},x_{2}\in S_{1}$ such that there exists no relation $e\in\mathcal{E}^{\prime}_{1}$ with $\{x_{1},x_{2}\}\cup S_{2}\subseteq e$ ; otherwise, $S$ is a non-conformal clique in $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ , contradicting the fact that $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ is acyclic. Let $e_{1}\in\mathcal{E}^{\prime}_{1}$ be the relation that $\{x_{1}\}\cup S_{2}\subseteq e_{1}$ , and $e_{2}\in\mathcal{E}^{\prime}_{2}$ be the relation that $\{x_{2}\}\cup S_{2}\subseteq e_{2}$ . From above, we note that $x_{2}\notin e_{1}$ and $x_{1}\notin e_{2}$ . Let $x\in S_{2}$ be an attribute such that there exists no relation $e\in\mathcal{E}^{\prime}_{1}$ , such that $x_{1},x_{2},x\in e$ . It is always feasible to find such an attribute $x$ , since there exist no relation $e\in\mathcal{E}^{\prime}_{1}$ such that $\{x_{1},x_{2}\}\cup S\subseteq e$ .

In either way, a cycle of $\langle x_{1},x_{2},x\rangle$ forms after the addition of $e^{\prime}$ . Hence, a new cycle $C$ must be formed by the addition of $e^{\prime}$ , say $C=\langle x_{1},x_{2},\cdots,x_{n}\rangle$ . Let $e^{\prime}\cap C=\{x_{1},x_{n}\}$ . The existence of $C$ also implies a path of $\langle x_{1},x_{2},\cdots,x_{n}\rangle$ in $(\bm{\mathsf{y}},\mathcal{E}^{\prime}_{1})$ .

For Step 4, we show the following reduction. For simplicity, we set $x_{2}=x_{3}=\cdots=x_{n-1}$ . For any attribute $x\notin\{x_{1},x_{2},\cdots,x_{n}\}$ , we set $\mathrm{dom}(x)=\{*\}$ . If ignoring all attributes with domain as $\{*\}$ , each relation in $\mathcal{E}_{1}$ falls into $R_{1}(x_{1},x_{2})$ or $R_{2}(x_{2},x_{n})$ . As $x_{2}\notin e^{\prime}$ , there exists at least one relation in $e\in\mathcal{E}^{\prime}_{2}$ such that $x_{2}\in e$ . We distinguish four more cases on such $e$ :

•

$R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{n})-R_{e^{\prime}}(x_{1},x_{n})\Join R_{e}(x_{2})$ ;

•

$R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{n})-R_{e^{\prime}}(x_{1},x_{n})\Join R_{e}(x_{2},x_{n})$ ;

•

$R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{n})-R_{e^{\prime}}(x_{1},x_{n})\Join R_{e}(x_{1},x_{2})$ ;

•

$R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{n})-R_{e^{\prime}}(x_{1},x_{n})\Join R_{e}(x_{2},x_{n})\Join R_{e^{\prime\prime}}(x_{1},x_{2})$ ;

which follows the proof of Lemma B.16.

So far, we have shown the hardness of computing $\mathcal{Q}_{1}[\bm{\mathsf{y}}]-\mathcal{Q}_{2}[\bm{\mathsf{y}}]$ , and the hardness of computing $\mathcal{Q}_{1}-\mathcal{Q}_{2}$ follows by Lemma B.14.

Appendix C Missing Materials in Section 5

C.1. Difference of Multiple CQs

C.2. Proof of Theorem 5.5

In the bag semantics, for any free-connex CQ $\mathcal{Q}=(\bm{\mathsf{y}},\mathcal{V},\mathcal{E})$ and an instance $D$ , it is still possible to reduce the query and instance in linear time, while preserving the correctness of the query results. We invoke Algorithm 1, but incorporate the semi-join and projection operators in the bag semantics. For a semi-join result $t$ of $R_{e}\ltimes R_{e^{\prime}}$ , we define:

[TABLE]

Then, we are left with two full joins that share the same query structure after applying the reduce procedure to both $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ .

Suppose we are given two instances $D_{1},D_{2}$ for the full join $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ . Let $R_{e},R^{\prime}_{e}$ be the corresponding relations to $e\in\mathcal{E}$ in $D_{1},D_{2}$ . Again, assume that each tuple $t$ is associated with a positive count $w(t)>0$ . Let $w_{1},w_{2}$ be the count functions of $\mathcal{Q}_{1},\mathcal{Q}_{2}$ respectively. Generalizing the algorithm in Example 5.4, the high-level idea is to find all join results $t\in\mathcal{Q}_{1}$ such that $\prod_{e\in\mathcal{E}}\frac{w_{1}(\pi_{e}t)}{w_{2}(\pi_{e}t)}>1$ . For each $e\in\mathcal{E}$ , we distinguish tuples in $R_{e}$ into three case: $R_{e\emptyset}=\{t\in R_{e}:t\notin R^{\prime}_{e}\}$ , $R_{e<}=\{t\in R_{e}:t\in R^{\prime}_{e},w_{1}(t)\leq w_{2}(t)\}$ and $R_{e>}=\{t\in R_{e}:t\in R^{\prime}_{e}:w_{1}(t)>w_{2}(t)\}$ . We can rewrite it as:

Lemma C.1.

Given two full CQs $\mathcal{Q}_{1}=\mathcal{Q}_{2}=(\mathcal{V},\mathcal{E})$ ,

[TABLE]

where a pair of tuples $(t_{1},t_{2})$ can be $\theta$ -joined if and only if $w_{1}(t_{1})\cdot w_{1}(t_{2})>w_{2}(t_{1})\cdot w_{2}(t_{2})$ .

The first part of $\bigcup_{\bar{E}\in\mathcal{E}}\left(\Join_{e\in\bar{E}}R_{e\emptyset}\right)\Join\left(\Join_{e\in\mathcal{E}-\bar{E}}(R_{e<}+R_{e>})\right)$ can be computed similarly as we have done in the set semantics. We next focus on the second part. Each $\bar{\mathcal{E}}\subseteq\mathcal{E}$ derives a $\theta$ -joins, which will be computed by the following procedure BagDCQ. For simplicity, let $S_{e}=R_{e<}$ if $e\in\bar{E}$ and $S_{e}=R_{e>}$ otherwise. We maintain additional variable $\zeta_{t}$ for every tuple $t\in S_{e}$ for every $e\in\mathcal{E}$ . Initially, $\zeta_{t}=\frac{w_{1}(t)}{w_{2}(t)}$ if $t\in R_{e}$ and $t\in R^{\prime}_{e}$ , $\zeta_{t}=+\infty$ if $t\in R_{e}$ and $t\notin R^{\prime}_{e}$ , and $\zeta_{t}=0$ if $t\notin R_{e}$ . Algorithm 5 consists of two phases. In the first phase, it updates the value of $\zeta_{t}$ for every tuple $t$ over a join tree $\mathcal{T}$ . More specifically, suppose $t\in S_{e}$ for some $e\in\mathcal{E}$ . Let $\mathcal{T}_{e}$ be the subtree of $\mathcal{T}$ rooted at $e$ . Then,

[TABLE]

i.e., the maximum product of $\frac{w_{1}(\cdot)}{w_{2}(\cdot)}$ over all join results in the subtree rooted at $e$ , participated by $t$ . As a result, a tuple $t$ in the root node participates in any query result if and only if $\zeta_{t}>1$ . In the second phase, we invoke Enumerate procedure for every $t\in S_{r}$ with $\zeta_{t}>1$ , and enumerate all the query results participated by $t$ .

The procedure $\textsc{Enumerate}(\mathcal{T},t,\tau)$ takes three parameters, which returns all join results over the join tree $\mathcal{T}$ participated by $t$ (from the root relation of $\mathcal{T}$ ), whose product of ratios over participated tuples is at least $\tau$ . Let $r$ be the root node of $\mathcal{T}$ . In the base case when $\mathcal{T}$ is a single node, we just return $t$ . As we prove later, there must be $\zeta_{t}=\frac{w_{1}(t)}{w_{2}(t)}>\tau$ in this case. In general, we distinguish two more cases. If $r$ contains a single child, say $u$ , it suffices to find all tuples $t^{\prime}\in S_{u}$ such that $\zeta_{t^{\prime}}\cdot\frac{w_{1}(t)}{w_{2}(t)}>\tau$ , i.e., participate in at least one join result. For each such a tuple $t^{\prime}$ , we recursively enumerate the query results in $\mathcal{T}_{u}$ participated by $t$ , whose product of $\frac{w_{1}(\cdot)}{w_{2}(\cdot)}$ is at least $\tau\cdot\frac{w_{2}(t)}{w_{1}(t)}$ , which can be done by $\textsc{Enumerate}\left(\mathcal{T}_{u},t^{\prime},\tau\cdot\frac{w_{2}(t)}{w_{1}(t)}\right)$ (line 6). Otherwise, $r$ contains at least two child nodes. We also play with recursion and shrink the join tree $\mathcal{T}$ by removing a subtree rooted at one child node. W.l.o.g., assume $\{u_{1},u_{2},\cdots,u_{k}\}$ is the set of child nodes of $r$ . We first find out tuples in $R_{u_{k}}$ that will participate in any query result with $t$ . This can be done by first finding the maximum $\zeta$ -value of tuples in another child node that can be joined with $t$ , and then finding the minimum $\zeta$ -value that tuples in $u_{k}$ should satisfy (line 10). Then, we enumerate all query results in the subtree $\mathcal{T}_{u_{k}}$ whose product of $\frac{w_{1}(\cdot)}{w_{2}(\cdot)}$ is at least $\tau^{\prime}$ , which are exactly those will participate in the final query results (line 11). For each such a tuple enumerated (at line 12-14), we in turn find out the query results in the remaining subtree of $\mathcal{T}-\mathcal{T}_{u_{k}}$ whose product of $\frac{w_{1}(\cdot)}{w_{2}(\cdot)}$ is at least (updated) $\tau/\tau^{\prime}$ , where $\tau^{\prime}$ is the product of $\frac{w_{1}(\cdot)}{w_{2}(\cdot)}$ for $t$ . At last, we just output their combination as a Cartesian product.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2duc (ck DB) Duck DB. https://duckdb.org/ .
3mys (y SQL) My SQL. https://www.mysql.com/ .
4ora (acle) Oracle. https://www.oracle.com/ .
5pos (e SQL) Postgre SQL. https://www.postgre.org/ .
6SNA (SNAP) SNAP. https://snap.stanford.edu/snap/ .
7spa (k SQL) Spark SQL. https://spark.apache.org/sql/ .
8sql (Lite) SQ Lite. https://www.sqlite.org/ .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Computing the Difference of Conjunctive Queries Efficiently

Abstract.

1. Introduction

Example 1.1.

2. Preliminaries

2.1. Problem Definition

2.2. Literature Review of CQ Evaluation

Corollary 2.1.

2.3. New Results of DCQ Evaluation

Definition 2.2 (Linear-reducible).

Definition 2.3 (Difference-Linear).

Theorem 2.4 (Dichotomy).

Improvement Achieved by Heuristics.

Corollary 2.5.

3. Easy DCQs

Theorem 3.1.

3.1. Q1\mathcal{Q}_{1}Q1​ and Q2\mathcal{Q}_{2}Q2​ share the same schema

Lemma 3.2.

Example 3.3.

Lemma 3.4.

3.2. Q1\mathcal{Q}_{1}Q1​ and Q2\mathcal{Q}_{2}Q2​ have different schemas

Lemma 3.5.

Example 3.6.

Lemma 3.7.

Proof.

Lemma 3.8.

Proof.

Improvement over Baseline.

Example 3.9.

Example 3.10.

Example 3.11.

4. Hard DCQs

4.1. Hardness

Lemma 4.1.

Conjecture 4.2 (Strong Triangle conjecture (Abboud and Williams, 2014)).

Lemma 4.3.

Proof.

Lemma 4.4.

Proof.

Lemma 4.5.

Lemma 4.6.

Proof.

Lemma 4.7.

4.2. Efficient Heuristics

Theorem 4.8.

Example 4.9.

Remark.

Theorem 4.10.

Example 4.11.

Example 4.12.

5. Extensions

5.1. Difference of Multiple CQs

Theorem 5.1.

5.2. Select, Project and Join

5.3. Aggregation

Theorem 5.2.

Example 5.3.

5.4. Bag Semantics

Example 5.4.

Theorem 5.5.

6. Experiments

6.1. Experimental Setup

6.2. Datasets and Queries

6.3. Experiment Results

7. Connection with Signed Conjunctive Query

Lemma 7.1.

Proof.

Lemma 7.2.

Proof.

Decidability of SCQ.

Theorem 7.3 ((Bagan et al., 2007)).

Theorem 7.4 ((Brault-Baron, 2012)).

Theorem 7.5 ((Brault-Baron, 2013)).

3.1. $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ share the same schema

3.2. $\mathcal{Q}_{1}$ and $\mathcal{Q}_{2}$ have different schemas

Graph Query $\mathcal{Q}_{G1}$

Graph Query $\mathcal{Q}_{G2}$

Graph Query $\mathcal{Q}_{G3}$

Graph Query $\mathcal{Q}_{G4}$

Graph Query $\mathcal{Q}_{G5}$

Graph Query $\mathcal{Q}_{G6}$