Conjunctive Queries with Theta Joins Under Updates

Muhammad Idris; Mart\'in Ugarte; Stijn Vansummeren; Hannes; Voigt; Wolfgang Lehner

arXiv:1905.09848·cs.DB·May 27, 2019

Conjunctive Queries with Theta Joins Under Updates

Muhammad Idris, Mart\'in Ugarte, Stijn Vansummeren, Hannes, Voigt, Wolfgang Lehner

PDF

Open Access

TL;DR

This paper introduces a novel dynamic evaluation method for multi-way theta-join queries that avoids materialization and recomputation, significantly improving efficiency in high-update-rate applications like CER.

Contribution

It generalizes the Dynamic Yannakakis algorithm to arbitrary theta-joins and develops new notions of acyclicity and free-connexity for these joins, enabling efficient dynamic query processing.

Findings

01

Outperforms state-of-the-art CER systems by up to two orders of magnitude in time.

02

Uses less memory compared to existing incremental view maintenance engines.

03

Successfully handles a wide range of theta-join queries in dynamic environments.

Abstract

Modern application domains such as Composite Event Recognition (CER) and real-time Analytics require the ability to dynamically refresh query results under high update rates. Traditional approaches to this problem are based either on the materialization of subresults (to avoid their recomputation) or on the recomputation of subresults (to avoid the space overhead of materialization). Both techniques have recently been shown suboptimal: instead of materializing results and subresults, one can maintain a data structure that supports efficient maintenance under updates and can quickly enumerate the full query output, as well as the changes produced under single updates. Unfortunately, these data structures have been developed only for aggregate-join queries composed of equi-joins, limiting their applicability in domains such as CER where temporal joins are commonplace. In this paper, we…

Figures6

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1: Queries for experimental evaluation.

Query	Expression
$Q_{1}$	$R (a, b, c) ⨝ S (d, e, f) \| a < d$
$Q_{2}$	$R (a, b, c, k) ⨝ S (d, e, f, k) \| a < d$
$Q_{3}$	$R (a, b, c) ⨝ S (d, e, f) ⨝ T (g, h, i) \| a < d \land e < g$
$Q_{4}$	$R (a, b, c) ⨝ S (d, e, f) ⨝ T (g, h, i) \| a < d \land d < g$
$Q_{5}$	$R (a, b, c, k) ⨝ S (d, e, f, k) ⨝ T (g, h, i) \| a < d \land d < g$
$Q_{6}$	$R (a, b, c) ⨝ S (d, e, f, k) ⨝ T (g, h, i, k) \| a < d \land d < g$
$Q_{7}$	$π_{a, b, d, e, f, g, h} (Q_{4})$
$Q_{8}$	$π_{a, d, e, f, g, h, k} (Q_{5})$
$Q_{9}$	$π_{d, e, f, g, h, k} (Q_{6})$
$Q_{10}$	$π_{b, c, e, f, h, i} (Q_{4})$
$Q_{11}$	$π_{b, c, e, f, h, i} (Q_{5})$
$Q_{12}$	$π_{b, c, e, f, h, i} (Q_{6})$

Table 2. Table 2: Maximum output sizes per query, k=1000.

Query	$\| Stream \|$	$\| Output \|$
$Q_{1}$	12k	18,017k
$Q_{2}$	12k	3.8k
$Q_{3}$	2.7k	178,847k
$Q_{4}$	2.7k	90,425k
$Q_{5}$	21k	411,669k
$Q_{6}$	21k	297,873k
$Q_{7}$	2.7k	114,561k
$Q_{8}$	21k	411,669k
$Q_{9}$	21k	99,043k
$Q_{10}$	2.7k	114,561k
$Q_{11}$	21k	294,139k
$Q_{12}$	21k	297,873k

Equations118

σ_{amnt < 100} (S_{1}) ⋈_{S 1. t s < S 2. t s \land S 1. a cc = S 2. a cc} σ_{amnt < 100} (S_{2})

σ_{amnt < 100} (S_{1}) ⋈_{S 1. t s < S 2. t s \land S 1. a cc = S 2. a cc} σ_{amnt < 100} (S_{2})

Q=\operatorname{\pi}_{\overline{y}}\big{(}r_{1}(\overline{x_{1}})\Join\dots\Join r_{n}(\overline{x_{n}})\mid\bigwedge_{i=1}^{m}\theta_{i}(\overline{z_{i}})\big{)}

Q=\operatorname{\pi}_{\overline{y}}\big{(}r_{1}(\overline{x_{1}})\Join\dots\Join r_{n}(\overline{x_{n}})\mid\bigwedge_{i=1}^{m}\theta_{i}(\overline{z_{i}})\big{)}

\operatorname{\pi}_{y,z,w,u}\big{(}r(x,y)\Join s(y,z,w)\Join t(u,v)\mid x<z\wedge w<u\big{)}

\operatorname{\pi}_{y,z,w,u}\big{(}r(x,y)\Join s(y,z,w)\Join t(u,v)\mid x<z\wedge w<u\big{)}

\begin{array}[t]{|lll|l|}\lx@intercol\hfil R\hfil\lx@intercol\\ \hline\cr x&y&z&\mathbb{Z}\\ \hline\cr 1&2&2&2\\ 2&4&6&3\\ 1&2&3&3\\ \hline\cr\end{array}\qquad\begin{array}[t]{|ll|l|}\lx@intercol\hfil S\hfil\lx@intercol\\ \hline\cr u&v&\mathbb{Z}\\ \hline\cr 4&5&5\\ 2&3&4\\ 1&4&2\\ \hline\cr\end{array}\qquad\begin{array}[t]{|ll|r|}\lx@intercol\hfil T\hfil\lx@intercol\\ \hline\cr u&v&\mathbb{Z}\\ \hline\cr 4&5&-4\\ 2&1&6\\ 1&4&3\\ \hline\cr\end{array}\qquad\begin{array}[t]{|ll|r|}\lx@intercol\hfil S\Join T\hfil\lx@intercol\\ \hline\cr u&v&\mathbb{Z}\\ \hline\cr 4&5&-20\\ 1&4&6\\ \hline\cr\end{array}

\begin{array}[t]{|lll|l|}\lx@intercol\hfil R\hfil\lx@intercol\\ \hline\cr x&y&z&\mathbb{Z}\\ \hline\cr 1&2&2&2\\ 2&4&6&3\\ 1&2&3&3\\ \hline\cr\end{array}\qquad\begin{array}[t]{|ll|l|}\lx@intercol\hfil S\hfil\lx@intercol\\ \hline\cr u&v&\mathbb{Z}\\ \hline\cr 4&5&5\\ 2&3&4\\ 1&4&2\\ \hline\cr\end{array}\qquad\begin{array}[t]{|ll|r|}\lx@intercol\hfil T\hfil\lx@intercol\\ \hline\cr u&v&\mathbb{Z}\\ \hline\cr 4&5&-4\\ 2&1&6\\ 1&4&3\\ \hline\cr\end{array}\qquad\begin{array}[t]{|ll|r|}\lx@intercol\hfil S\Join T\hfil\lx@intercol\\ \hline\cr u&v&\mathbb{Z}\\ \hline\cr 4&5&-20\\ 1&4&6\\ \hline\cr\end{array}

\begin{array}[t]{|r|r|}\lx@intercol\hfil\operatorname{\pi}_{y}(R)\hfil\lx@intercol\\ \hline\cr y&\mathbb{Z}\\ \hline\cr\phantom{\ }2&5\\ \phantom{\ }4&3\\ \hline\cr\end{array}\qquad\begin{array}[t]{|ll|r|}\lx@intercol\hfil S+T\hfil\lx@intercol\\ \hline\cr u&v&\mathbb{Z}\\ \hline\cr 4&5&1\\ 2&3&4\\ 1&4&5\\ 2&1&6\\ \hline\cr\end{array}\qquad\begin{array}[t]{|ll|r|}\lx@intercol\hfil S-T\hfil\lx@intercol\\ \hline\cr u&v&\mathbb{Z}\\ \hline\cr 4&5&9\\ 2&3&4\\ 1&4&-1\\ 2&1&-6\\ \hline\cr\end{array}\qquad\begin{array}[t]{|lllll|r|}\lx@intercol\hfil R\Join_{y<u}S\hfil\lx@intercol\\ \hline\cr x&y&z&u&v&\mathbb{Z}\\ \hline\cr 1&2&2&4&5&10\\ 1&2&3&4&5&15\\ \hline\cr\end{array}

\begin{array}[t]{|r|r|}\lx@intercol\hfil\operatorname{\pi}_{y}(R)\hfil\lx@intercol\\ \hline\cr y&\mathbb{Z}\\ \hline\cr\phantom{\ }2&5\\ \phantom{\ }4&3\\ \hline\cr\end{array}\qquad\begin{array}[t]{|ll|r|}\lx@intercol\hfil S+T\hfil\lx@intercol\\ \hline\cr u&v&\mathbb{Z}\\ \hline\cr 4&5&1\\ 2&3&4\\ 1&4&5\\ 2&1&6\\ \hline\cr\end{array}\qquad\begin{array}[t]{|ll|r|}\lx@intercol\hfil S-T\hfil\lx@intercol\\ \hline\cr u&v&\mathbb{Z}\\ \hline\cr 4&5&9\\ 2&3&4\\ 1&4&-1\\ 2&1&-6\\ \hline\cr\end{array}\qquad\begin{array}[t]{|lllll|r|}\lx@intercol\hfil R\Join_{y<u}S\hfil\lx@intercol\\ \hline\cr x&y&z&u&v&\mathbb{Z}\\ \hline\cr 1&2&2&4&5&10\\ 1&2&3&4&5&15\\ \hline\cr\end{array}

Q_{1}=\big{(}r(x,y)\Join s(y,z,w)\Join t(u,v)\mid x<z\wedge w<u\big{)}

Q_{1}=\big{(}r(x,y)\Join s(y,z,w)\Join t(u,v)\mid x<z\wedge w<u\big{)}

r (x_{r}) ⋈ s (x_{s}, y_{s}) ⋈ t (x_{t}, y_{t}) ⋈ u (y_{u}) ∣ x_{s} \leq x_{r}, x_{t} \leq x_{r}, y_{s} \leq y_{u}, y_{t} \leq y_{u}

r (x_{r}) ⋈ s (x_{s}, y_{s}) ⋈ t (x_{t}, y_{t}) ⋈ u (y_{u}) ∣ x_{s} \leq x_{r}, x_{t} \leq x_{r}, y_{s} \leq y_{u}, y_{t} \leq y_{u}

\pi_{y,z,w,u}\big{(}r(x,y)\Join s(y,z,w)\Join t(u,v)\mid x<z\wedge w<u\big{)}.

\pi_{y,z,w,u}\big{(}r(x,y)\Join s(y,z,w)\Join t(u,v)\mid x<z\wedge w<u\big{)}.

Q_{n} = (⋈_{r (\overline{x}) \in at (T_{n})} r (\overline{x}) ∣ pred (T_{n}))

Q_{n} = (⋈_{r (\overline{x}) \in at (T_{n})} r (\overline{x}) ∣ pred (T_{n}))

O ((∣ N \cap T_{c} ∣ + 1) \times f (M)) = O (∣ N \cap T_{n} ∣ \times f (M)) .

O ((∣ N \cap T_{c} ∣ + 1) \times f (M)) = O (∣ N \cap T_{n} ∣ \times f (M)) .

O (f (∣ ρ_{c_{1}} ∣)) + O (∣ N \cap T_{c_{1}} ∣ \times f (M)) + O (f (∣ ρ_{c_{2}} ∣)) + O (∣ N \cap T_{c_{2}} ∣ \times f (M))

O (f (∣ ρ_{c_{1}} ∣)) + O (∣ N \cap T_{c_{1}} ∣ \times f (M)) + O (f (∣ ρ_{c_{2}} ∣)) + O (∣ N \cap T_{c_{2}} ∣ \times f (M))

= O ((∣ N \cap T_{c_{1}} ∣ + ∣ N \cap T_{c_{2}} ∣ + 2) \times f (M))

= O ((∣ N \cap T_{c_{1}} ∣ + ∣ N \cap T_{c_{2}} ∣ + 2) \times f (M))

= O (∣ N \cap T_{n} ∣ \times f (M)) .

= O (∣ N \cap T_{n} ∣ \times f (M)) .

hyp (Q) = {\overline{x} ∣ r (\overline{x}) is an atom of Q with \overline{x} \neq = \emptyset} .

hyp (Q) = {\overline{x} ∣ r (\overline{x}) is an atom of Q with \overline{x} \neq = \emptyset} .

ext_{H} (e) = ⋃ {var (θ) ∣ θ \in pred (H), var (θ) \cap e \neq = \emptyset} ∖ e .

ext_{H} (e) = ⋃ {var (θ) ∣ θ \in pred (H), var (θ) \cap e \neq = \emptyset} ∖ e .

Q = π_{t, u, z, w} (r_{1} (s, t, u) ⋈ r_{2} (t, u) ⋈ r_{3} (u, w, x) ⋈ r_{4} (s, v) ⋈ r_{5} (w, z, y) ∣ t < v \land x < y) .

Q = π_{t, u, z, w} (r_{1} (s, t, u) ⋈ r_{2} (t, u) ⋈ r_{3} (u, w, x) ⋈ r_{4} (s, v) ⋈ r_{5} (w, z, y) ∣ t < v \land x < y) .

hyp (F) = {var (n) ∣ n root node in F, var (n) \neq = \emptyset} .

hyp (F) = {var (n) ∣ n root node in F, var (n) \neq = \emptyset} .

hyp (T, N) = {var_{T} (n) ∣ n \in N, var_{T} (n) \neq = \emptyset} .

hyp (T, N) = {var_{T} (n) ∣ n \in N, var_{T} (n) \neq = \emptyset} .

H (T, A, out (Q))

H (T, A, out (Q))

= (hyp (T, A), out (Q), pred (T, A))

= (hyp (T, A) \cup hyp (Q), out (Q), pred (T, A))

⇝ * (hyp (Q), out (Q), pred (T, A))

= (hyp (Q), out (Q), pred (Q)) = H (Q)

π_{var (n)} Q_{n} (db) = π_{var (n)} σ_{pred (n)} π_{var (c)} Q_{c} (db) .

π_{var (n)} Q_{n} (db) = π_{var (n)} σ_{pred (n)} π_{var (c)} Q_{c} (db) .

Q_{n} (db)

Q_{n} (db)

= σ_{pred (n)} (σ_{Θ_{1}} R_{1} (db) ⋈ σ_{Θ_{2}} R_{2} (db))

= σ_{pred (n)} (Q_{c_{1}} (db) ⋈ Q_{c_{2}} (db))

π_{var (n)} Q_{n} (db) = π_{var (n)} σ_{pred (n)} (Q_{c_{1}} (db) ⋈ Q_{c_{2}} (db)) .

π_{var (n)} Q_{n} (db) = π_{var (n)} σ_{pred (n)} (Q_{c_{1}} (db) ⋈ Q_{c_{2}} (db)) .

= π_{var (n)} σ_{pred (n)} (π_{var (c_{1})} Q_{c_{1}} (db) ⋈ π_{var (c_{2})} Q_{c_{2}} (db)) .

= π_{var (n)} σ_{pred (n)} (π_{var (c_{1})} Q_{c_{1}} (db) ⋈ π_{var (c_{2})} Q_{c_{2}} (db)) .

π_{var (n)} Q_{n} (db) = π_{var (n)} σ_{pred (n)} (ρ_{c_{1}} ⋈ ρ_{c_{2}}) = ρ_{n},

π_{var (n)} Q_{n} (db) = π_{var (n)} σ_{pred (n)} (ρ_{c_{1}} ⋈ ρ_{c_{2}}) = ρ_{n},

π_{var (N)} Q_{c} (db) ⋉ σ_{pred (n)} (π_{var (c)} Q_{c} (db) ⋉ t) .

π_{var (N)} Q_{c} (db) ⋉ σ_{pred (n)} (π_{var (c)} Q_{c} (db) ⋉ t) .

= π_{var (N)} σ_{pred (n)} (Q_{c} (db) ⋉ (π_{var (c)} Q_{c} (db) ⋉ t)) .

= π_{var (N)} σ_{pred (n)} (Q_{c} (db) ⋉ (π_{var (c)} Q_{c} (db) ⋉ t)) .

= π_{var (N)} σ_{pred (n)} Q_{c} (db) ⋉ t = π_{var (N)} Q_{n} (db) ⋉ t .

= π_{var (N)} σ_{pred (n)} Q_{c} (db) ⋉ t = π_{var (N)} Q_{n} (db) ⋉ t .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Data Quality and Management

Full text

\WarningFilter

latexText page

11institutetext: M. Idris 22institutetext: Université Libre de Bruxelles, Belgium and TU Dresden, Germany

22email: [email protected] 33institutetext: M. Ugarte 44institutetext: Université Libre de Bruxelles, Belgium

44email: [email protected] 55institutetext: S. Vansummeren 66institutetext: Université Libre de Bruxelles, Belgium

66email: [email protected] 77institutetext: Hannes Voigt 88institutetext: neo4j, Germany

This work was done while the author was affiliated to TU Dresden, Germany

88email: [email protected] 99institutetext: Wolfgang Lehner 1010institutetext: TU Dresden, Germany

1010email: [email protected]

Conjunctive Queries with Theta Joins Under Updates

Muhammad Idris

Martín Ugarte

Stijn Vansummeren

Hannes Voigt

Wolfgang Lehner

(Received: date / Accepted: date)

Abstract

Modern application domains such as Composite Event Recognition (CER) and real-time Analytics require the ability to dynamically refresh query results under high update rates. Traditional approaches to this problem are based either on the materialization of subresults (to avoid their recomputation) or on the recomputation of subresults (to avoid the space overhead of materialization). Both techniques have recently been shown suboptimal: instead of materializing results and subresults, one can maintain a data structure that supports efficient maintenance under updates and can quickly enumerate the full query output, as well as the changes produced under single updates. Unfortunately, these data structures have been developed only for aggregate-join queries composed of equi-joins, limiting their applicability in domains such as CER where temporal joins are commonplace. In this paper, we present a new approach for dynamically evaluating queries with multi-way $\theta$ -joins under updates that is effective in avoiding both materialization and recomputation of results, while supporting a wide range of applications. To do this we generalize Dynamic Yannakakis, an algorithm for dynamically processing acyclic equi-join queries. In tandem, and of independent interest, we generalize the notions of acyclicity and free-connexity to arbitrary $\theta$ -joins and show how to compute corresponding join trees. We instantiate our framework to the case where $\theta$ -joins are only composed of equalities and inequalities ( $<,\leq,>,\geq$ ) and experimentally compare our algorithm to state of the art CER systems as well as incremental view maintenance engines. Our approach performs consistently better than the competitor systems with up to two orders of magnitude improvements in both time and memory consumption.

1 Introduction

The ability to analyze dynamically changing data is a key requirement of many contemporary applications, usually associated with Big Data, that require such analysis in order to obtain timely insights and implement reactive and proactive measures. Example applications include Financial Systems cormode2007 , Industrial Control Systems groover2007 , Stream Processing Stonebraker:2005 , Composite Event Recognition (CER, also known as Complex Event Processing) buchmann:2009 ; cugola:2012 , and Business Intelligence (BI) sahay2008 . Generally, the analysis that needs to be kept up-to-date, or at least their basic elements, are specified in a query language. The main task is then to efficiently update the query results under frequent data updates.

In this paper, we focus on the problem of dynamic evaluation for queries that feature multi-way $\theta$ -joins in addition to standard equi-joins. To illustrate our setting, consider that we wish to detect potential credit card frauds. Credit card transactions have a timestamp (ts), account number (acc), and amount (amnt), among other attributes. A typical fraud pattern is that the criminal tests the credit card with a few small purchases to then make larger purchases (cf. DBLP:conf/debs/Schultz-MollerMP09 ). In this respect, we would like to dynamically evaluate the following query, assuming new transactions arrive in a streaming fashion and the pattern must be detected in less than 1 hour.

SELECT * FROM Trans S1, Trans S2, Trans L WHERE S1.ts < S2.ts AND S2.ts < L.ts AND L.ts < S1.ts + 1h AND S1.acc = S2.acc AND S2.acc = L.acc AND S1.amnt < 100 AND S2.amnt < 100 AND L.amnt > 400

Queries like this with inequality joins appear in both CER and BI scenarios. Traditional techniques to process these queries dynamically can be categorized in two approaches: relational and automaton-based. We next discuss both approaches, their strengths and drawbacks.

Relational

Relational approaches, such as DBLP:journals/vldb/KochAKNNLS14 ; DBLP:conf/sigmod/MeiM09 ; DBLP:books/sp/16/ArasuBBCDIMSW16 are based on a form of Incremental View Maintenance (IVM). To process a query $Q$ over a database $\operatorname{\textit{db}}$ , IVM techniques materialize the output $Q(\operatorname{\textit{db}})$ and evaluate delta queries. Upon update $\operatorname{\mathit{u}}$ , delta queries use $\operatorname{\textit{db}}$ , $\operatorname{\mathit{u}}$ and the materialized $Q(\operatorname{\textit{db}})$ to compute the set of tuples to add/delete from $Q(\operatorname{\textit{db}})$ in order to obtain $Q(\operatorname{\textit{db}}+\operatorname{\mathit{u}})$ . If $u$ is small w.r.t. $\operatorname{\textit{db}}$ , this is expected to be faster than recomputing $Q(\operatorname{\textit{db}}+\operatorname{\mathit{u}})$ from scratch. To further speed up dynamic query processing, we may materialize not only $Q(\operatorname{\textit{db}})$ but also the result of some subqueries. This is known as Higher-Order IVM (HIVM for short) DBToaster2016 ; DBLP:journals/vldb/KochAKNNLS14 ; DBLP:conf/pods/Koch10 . Both IVM and HIVM have drawbacks, however. First, materialization of $Q(\operatorname{\textit{db}})$ requires $\Omega(\parallel{Q(\operatorname{\textit{db}})}\parallel)$ space, where $\parallel{\operatorname{\textit{db}}}\parallel$ denotes the size of $\operatorname{\textit{db}}$ . Therefore, when $Q(\operatorname{\textit{db}})$ is large compared to $\operatorname{\textit{db}}$ , materializing $Q(\operatorname{\textit{db}})$ quickly becomes impractical, especially for main-memory based systems. HIVM is even more affected by this problem because it not only materializes the result of $Q$ but also the results to some subqueries. For example, in our fraud query HIVM would materialize the results of the following join in order to respond quickly to the arrival of a potential transaction $L$ :

[TABLE]

If we assume that there are $N$ small transactions in the time window, all of the same account, this materialization will take $\Theta(N^{2})$ space. This becomes rapidly impractical when $N$ becomes large.

Automata

Automaton-based approaches (e.g., brenna07 ; DBLP:conf/sigmod/WuDR06 ; DBLP:journals/jss/CugolaM12 ; DBLP:conf/debs/CugolaM10 ; Sase2014 ; DBLP:conf/sigmod/AgrawalDGI08 ) are primarily employed in CER systems. In contrast to the relational approaches, they assume that the arrival order of event tuples corresponds to the timestamp order (i.e., there are no out-of-order events) and build an automaton to recognize the desired temporal patterns in the input stream. Broadly speaking, there are two automata-based recognition approaches. In the first approach, followed by DBLP:conf/sigmod/WuDR06 ; DBLP:conf/sigmod/AgrawalDGI08 , events are cached per state and once a final state is reached a search through the cached events is done to recognize the complex events. While it is no longer necessary to check the temporal constraints during the search, the additional constraints (in our example, $L.ts<S1.ts+1h$ and $S_{1}.\text{acc}=S_{2}.\text{acc}=L.\text{acc}$ ) must still be verified. If the additional constraints are highly selective this approach creates an unnecessarily large update latency, given that each event triggering a transition to a final state may cause re-evaluation of a sub-join on the cached data, only to find few new output tuples.

In the second approach, followed by brenna07 ; Sase2014 ; DBLP:journals/jss/CugolaM12 ; DBLP:conf/debs/CugolaM10 , partial runs are materialized according to the automaton’s topology. For our example query, this means that, just like HIVM, the join ( $\star$ ‣ 1) is materialized and maintained so it is available when a large amount transaction $L$ arrives. This approach hence shares with HIVM its high memory overhead and maintenance cost.

It has been recently shown that the drawbacks of these two approaches can be overcome by a rather simple idea dyn:2017 ; olteanu:fivm . Instead of fully materializing (potentially large) results and subresults, we can build a compact representation of the query result that supports efficient maintenance under updates. The representation is equipped with index structures so that, whenever necessary, we can generate the actual query result one tuple at a time, spending a limited amount of work to produce each new result tuple. This makes the generation performance-wise competitive with enumeration from a fully materialized (non-compact) output. In essence, we are hence separating dynamic query processing into two stages: (1) an update stage where we only maintain under updates the (small) information that is necessary to be able to efficiently generate the query result and (2) an enumeration stage where the query result is efficiently enumerated. Moreover, for single-tuple updates the representation also supports efficient enumeration of the changes to the query result. This is relevant for push-based query processing systems, where users do not ping the system for the complete current query answer, but instead ask to be notified of the changes to the query results when the database changes.

This idea was first presented by a subset of the authors in the Dynamic Yannakakis Algorithm (Dyn for short) dyn:2017 , an algorithm for efficiently processing acyclic aggregate-join queries. Dyn is worst-case optimal for two classes of queries, namely the q-hierarchical and free-connex acyclic conjunctive queries. A different approach named F-IVM, based on so-called factorized databases, was later developed to dynamically process aggregate-join queries that are not necessarily acyclic or need to support complex aggregates olteanu:fivm .

Unfortunately, both Dyn and F-IVM are only applicable to queries with equality joins, and as such they do not support analytical queries with other types of joins like the ones with inequalities $(\leq,<,\geq,>)$ or disequalities $(\not=)$ . Therefore, the current state of the art techniques for dynamically processing queries with joins beyond equality suffer either from a high update latency (if subresults are not materialized) or a high memory footprint (if subresults are materialized). In this paper, we overcome these problems by generalizing the Dynamic Yannakakis algorithm to conjunctive queries with arbitrary $\theta$ -joins. We show that, in the specific case of inequality joins, this generalization performs consistently better than the state of the art, with up to two orders of magnitude improvements in processing time and memory consumption.

Contributions

We focus on the class of Generalized Conjunctive Queries (GCQs for short), which are conjunctive queries with $\theta$ -joins, that are evaluated under multiset semantics.

(1) We devise a succinct and efficiently updatable data structure to dynamically process GCQs. To this end, we first generalize the notions of acyclicity and free-connexity to queries with arbitrary $\theta$ -joins (Section 3). Our data structure degrades gracefully: if a GCQ only contains equalities our approach inherits the worst-case optimality provided by Dyn.

(2) We present GDyn, a general framework for extending Dyn to free-connex acyclic GCQs. Our treatment is general in the sense that the $\theta$ -join predicates are treated abstractly. GDyn hence applies to all predicates, not only inequality joins. We analyze the complexity of GDyn, and identify properties of indexing structures that are required in order for GDyn to support efficient enumeration of results as well as efficient update processing (Section 5).

(3) We instantiate GDyn to the particular case of inequality and equality joins. We show that updates can be processed in time $O(n^{2}\cdot\log(n))$ , where $n$ is the size of the database plus the size of the update, and results can be enumerated with logarithmic delay. Moreover, if there is at most one inequality between any pair of relations, updates take time $O(n\cdot\log(n))$ and enumeration is with constant delay. We call the resulting algorithm IEDyn. We first illustrate this algorithm by means of an extensive example (Section 4), and then describe the required data structures formally at the end of Section 5.

(4) The operation of GDyn and IEDyn is driven by a Generalized Join Tree (GJT). GJTs are essentially query plans that specify the data structure to be materialized, how it should be updated, and how to enumerate the query results. We present an algorithm that can be used both to check whether a GCQ is (free-connex) acyclic and to construct a corresponding GJT if this is the case. (Section 6).

(5) We experimentally compare IEDyn with state-of-the-art HIVM and CER frameworks. IEDyn performs consistently better, with up to two order of magnitude improvements in both speed and memory consumption (Section 7 and Section 8).

We introduce the required background in Section 2.

Additional material

This article presents the following additional contributions compared to its previously published conference version DBLP:journals/pvldb/IdrisUVVL18 :

(1) Correctness proofs. The conference version only sketched why GDyn and IEDyn work correctly and within the claimed bounds. In contrast, here we formally prove correctness.

(2) Novel algorithm for computing GJTs. As outlined, above, GDyn and IEDyn work on acyclic GCQs and their operation is driven by the specification of a GJT for such queries. The conference version only stated that an algorithm for checking acyclicity and free-connexity and computing GJTs exists. In contrast, here, we fully present this algorithm and illustrate its correctness.

Additional related work

In addition to the work already cited on CER and (H)IVM, our setting is closely related to query evaluation with constant delay enumeration Bagan:2007 ; DBLP:journals/sigmod/Segoufin15 ; Berkholz:2017 ; braultbaron ; Olteanu:2015 ; Bakibayev:2013 ; Olteanu:2015 ; Schleich:2016 ; dyn:2017 ; olteanu:fivm . This setting, however, deals with equi-joins only. Also related, although restricted to the static setting, is the practical evaluation of binary DBLP:conf/vldb/DeWittNS91 ; DBLP:conf/vldb/HellersteinNP95 ; DBLP:conf/sigmod/EnderleHS04 and multi-way DBLP:journals/is/BernsteinG81 ; DBLP:conf/vldb/YoshikawaK84 inequality joins. Our work, in contrast, considers dynamic processing of multi-way $\theta$ -joins, with a specialization to inequality joins. Recently, Khayyat et al. DBLP:journals/vldb/KhayyatLSOPQ0K17 proposed fast multi-way inequality join algorithms based on sorted arrays and space efficient bit-arrays. They focus on the case where there are exactly two inequality conditions per pairwise join. While they also present an incremental algorithm for pairwise joins, their algorithm makes no effort to minimize the update cost in the case of multi-way joins. As a result, they either materialize subresults (implying a space overhead that can be more than linear), or recompute subresults. We do neither.

2 Preliminaries

Traditional conjunctive queries are cross products between relations, restricted by equalities. Similarly, generalized conjunctive queries (GCQs) are cross products between relations, but restricted by arbitrary predicates. We use the following notation for queries.

Query Language

Throughout the paper, let $x,y,z,\dots$ denote variables (also commonly called column names or attributes). A hyperedge is a finite set of variables. We use $\overline{x},\overline{y}$ , …to denote hyperedges. A GCQ is an expression of the form

[TABLE]

Here $r_{1},\dots,r_{n}$ are relation symbols; $\overline{x_{1}},\dots,\overline{x_{n}}$ are hyperedges (of the same arity as $r_{1},\dots,r_{n}$ ); $\theta_{1},\ldots,\theta_{m}$ are predicates over $\overline{z_{1}},\dots,\overline{z_{m}}$ , respectively; and both $\overline{y}$ and $\bigcup_{i=1}^{m}\overline{z_{i}}$ are subsets of $\bigcup_{i=1}^{n}\overline{x_{i}}$ . We treat predicates abstractly: for our purpose, a predicate over $\overline{x}$ is a (not necessarily finite) decidable set $\theta$ of tuples over $\overline{x}$ . For example, $\theta(x,y)=x<y$ is the set of all tuples $(a,b)$ satisfying $a<b$ . We indicate that $\theta$ is a predicate over $\overline{x}$ by writing $\theta(\overline{x})$ . Throughout the paper, we consider only non-nullary predicates, i.e., predicates with $\overline{x}\not=\emptyset$ .

Example 1.

The following query is a GCQ.

[TABLE]

Intuitively, the query asks for the natural join between $r(x,y)$ , $s(y,z,w)$ , and $t(u,v)$ , and from this result select only those tuples that satisfy both $x<z$ and $w<u$ .

We call $\overline{y}$ the output variables of $Q$ and denote it by $\operatorname{\textit{out}}(Q)$ . If $\overline{y}=\overline{x_{1}}\cup\dots\cup\overline{x_{n}}$ then $Q$ is called a full query and we may omit the symbol $\pi_{\overline{y}}$ altogether for brevity. The elements $r_{i}(\overline{x_{i}})$ are called atomic queries (or atoms). We write $\operatorname{\textit{at}}(Q)$ for the set of all atoms in $Q$ , and $\operatorname{\textit{pred}}(Q)$ for the set of all predicates in $Q$ . A normal conjunctive query (CQ for short) is a GCQ where $\operatorname{\textit{pred}}(Q)=\emptyset$ .

Semantics

We evaluate GCQs over Generalized Multiset Relations (GMRs for short) DBLP:journals/vldb/KochAKNNLS14 ; DBLP:conf/pods/Koch10 ; dyn:2017 . A GMR over $\overline{x}$ is a relation $R$ over $\overline{x}$ (i.e., a finite set of tuples with schema $\overline{x}$ ) in which each tuple $\vec{t}$ is associated with a non-zero integer multiplicity $R(\vec{t})\in\mathbb{Z}\setminus\{0\}$ .111In their full generality, GMRs can carry multiplicities that are taken from an arbitrary algebraic ring structure (cf., DBLP:conf/pods/Koch10 ), which can be useful to describe the computation of aggregations over the result of a GCQ. To keep the notation and discussion simple, we fix the ring $\mathbb{Z}$ of integers throughout the paper but our result generalize trivially to arbitrary rings. In contrast to classical multisets, the multiplicity of a tuple in a GMR can hence be negative, allowing to treat insertions and deletions uniformly. We write $\operatorname{supp}(R)$ for the finite set of all tuples in $R$ ; $\vec{t}\in R$ to indicate $\vec{t}\in\operatorname{supp}(R)$ ; and $|{R}|$ for $|{\operatorname{supp}(R)}|$ . A GMR $R$ is positive if $R(\vec{t})>0$ for all $\vec{t}\in\operatorname{supp}(R)$ .

The operations of GMR union ( $R+S$ ), minus $(R-S)$ , projection ( $\operatorname{\pi}_{\overline{z}}R$ ), natural join ( $R\Join T$ ) and selection ( $\sigma_{P}(R)$ ) are defined similarly as in relational algebra with multiset semantics. Figure 1 illustrates these operations. We refer to dyn:2017 ; DBLP:journals/vldb/KochAKNNLS14 for a formal semantics. We abbreviate $\sigma_{P}(R\Join T)$ by $R\Join_{P}T$ and, if $\overline{x}=\operatorname{\textit{var}}(R)$ , we abbreviate $\pi_{\overline{x}}(R\Join_{P}T)$ by $R\operatorname{\ltimes}_{P}T$ .

A database over a set $\mathcal{A}$ of atoms is a function $\operatorname{\textit{db}}$ that maps every atom $r(\overline{x})\in\mathcal{A}$ to a positive GMR $\operatorname{\textit{db}}_{r(\overline{x})}$ over $\overline{x}$ . Given a database $\operatorname{\textit{db}}$ over the atoms occurring in query $Q$ , the evaluation of $Q$ over $\operatorname{\textit{db}}$ , denoted $Q(\operatorname{\textit{db}})$ , is the GMR over $\overline{y}$ constructed in the expected way: take the natural join of all GMRs in the database, do a selection over the result w.r.t. each predicate, and finally project on $\overline{y}$ .

Updates and deltas

An update to a GMR $R$ is simply a GMR $\Delta{R}$ over the same variables as $R$ . Applying update $\Delta{R}$ to $R$ yields the GMR $R+\Delta{R}$ . An update to a database $\operatorname{\textit{db}}$ is a collection $u$ of (not necessarily positive) GMRs, one GMR $u_{r(\overline{x})}$ for every atom $r(\overline{x})$ of $\operatorname{\textit{db}}$ , such that $\operatorname{\textit{db}}_{r(\overline{x})}+\operatorname{\mathit{u}}_{r(\overline{x})}$ is positive. We write $\operatorname{\textit{db}}+u$ for the database obtained by applying $u$ to each atom of $\operatorname{\textit{db}}$ , i.e., $(\operatorname{\textit{db}}+u)_{r(\overline{x})}=\operatorname{\textit{db}}_{r(\overline{x})}+u_{r(\overline{x})}$ , for every atom $r(\overline{x})$ of $\operatorname{\textit{db}}$ . For every query $Q$ , every database $\operatorname{\textit{db}}$ and every update $\operatorname{\mathit{u}}$ to $\operatorname{\textit{db}}$ , we define the delta query $\Delta{Q}(\operatorname{\textit{db}},\operatorname{\mathit{u}})$ of $Q$ w.r.t. $\operatorname{\textit{db}}$ and $\operatorname{\mathit{u}}$ by $\Delta{Q}(\operatorname{\textit{db}},\operatorname{\mathit{u}}):=Q(\operatorname{\textit{db}}+\operatorname{\mathit{u}})-Q(\operatorname{\textit{db}})$ . As such, $\Delta{Q}(\operatorname{\textit{db}},\operatorname{\mathit{u}})$ is the update that we need to apply to $Q(\operatorname{\textit{db}})$ in order to obtain $Q(\operatorname{\textit{db}}+\operatorname{\mathit{u}})$ .

Enumeration with bounded delay

A data structure $D$ supports enumeration of a set $E$ if there is a routine $\operatorname{\textsc{enum}}$ such that $\operatorname{\textsc{enum}}(D)$ outputs each element of $E$ exactly once. Such enumeration occurs with delay $d$ if the time until the first output; the time between any two consecutive outputs; and the time between the last output and the termination of $\operatorname{\textsc{enum}}(D)$ , are all bounded by $d$ . $D$ supports enumeration of a GMR $R$ if it supports enumeration of the set $E_{R}=\{(\vec{t},R(\vec{t}))\mid\vec{t}\in\operatorname{supp}(R)\}$ . When evaluating a GCQ $Q$ , we will be interested in representing the possible outputs of $Q$ by means of a family ${\cal D}$ of data structures, one data structure $D_{\operatorname{\textit{db}}}\in{\cal D}$ for each possible input database $\operatorname{\textit{db}}$ . We say that $Q$ can be enumerated from ${\cal D}$ with delay $f$ , if for every input $\operatorname{\textit{db}}$ we can enumerate $Q(\operatorname{\textit{db}})$ from $D_{\operatorname{\textit{db}}}$ with delay $O(f(D_{db}))$ , where $f$ assigns a natural number to each $D_{\operatorname{\textit{db}}}$ . Intuitively $f$ measures $D_{\operatorname{\textit{db}}}$ in some way. In particular, if $f$ is constant we say the results are generated from the data structure with constant-delay enumeration (CDE).

As a trivial example of CDE of a GMR $R$ , assume that the pairs $(\vec{t},R(\vec{t}))$ of $E_{R}$ are stored in an array $A$ (without duplicates). Then $A$ supports CDE of $R$ : $\operatorname{\textsc{enum}}(A)$ simply iterates over each element in $A$ , one by one, always outputting the current element. Since array indexation is a $O(1)$ operation, this gives constant delay. This example shows that CDE of the result $Q(\operatorname{\textit{db}})$ of a query $Q$ on input database $\operatorname{\textit{db}}$ , can always be done naively by materializing $Q(\operatorname{\textit{db}})$ in an in-memory array $A$ . Unfortunately, $A$ then requires memory proportional to $\parallel{Q(\operatorname{\textit{db}})}\parallel$ which, depending on $Q$ , can be of size polynomial in $\parallel{\operatorname{\textit{db}}}\parallel$ . We hence search for other data structures that can represent $Q(\operatorname{\textit{db}})$ using less space, while still allowing for efficient enumeration. Our experiments in Section 7 show that for the data structures described in this paper, CDE is indeed competitive with enumeration from an array while requiring much less space.

Computational Model

It is important to note that we focus on dynamic query evaluation in main memory. Furthermore, we assume a model of computation where the space used by tuple values and integers, the time of arithmetic operations on integers, and the time of memory lookups are all $O(1)$ . We also assume that every GMR $R$ can be represented by a data structure that allows (1) enumeration of $R$ with constant delay; (2) multiplicity lookups $R(\vec{t})$ in $O(1)$ time given $\vec{t}$ ; (3) single-tuple insertions and deletions in $O(1)$ time; while (4) having a size that is proportional to the number of tuples in the support of $R$ . Essentially, our assumptions amount to perfect hashing of linear size cormen2009introduction . Although this is not realistic for practical computers Papadimitriou:2003 , it is well known that complexity results for this model can be translated, through amortized analysis, to average complexity in real-life implementations cormen2009introduction .

3 Generalized Acyclicity

Join queries are GCQs without projections that feature equality joins only. The well-known subclass of acyclic join queries abiteboul1995foundations ; DBLP:conf/vldb/Yannakakis81 , in contrast to the entire class of join queries, can be evaluated in time $O(\parallel{\operatorname{\textit{db}}}\parallel+\parallel{Q(\operatorname{\textit{db}})}\parallel)$ , i.e., linear in both input and output. This result relies on the fact that acyclic join queries admit a tree structure that can be exploited during evaluation. In previous work dyn:2017 , we showed that this tree structure can also be exploited for efficient processing of CQs under updates. In this section, we therefore extend the tree structure and the notion of acyclicity from join queries to GCQs with both projections and arbitrary $\theta$ -joins. We begin by defining this tree structure and the related notion of acyclicity for full GCQs. Then, we proceed with the notion corresponding to GCQs that feature projections, known as free-connex acyclicity.

Generalized Join Trees

To simplify notation, we denote the set of all variables (resp. atoms, resp. predicates) that occur in an object $X$ (such as a query) by $\operatorname{\textit{var}}(X)$ (resp. $\operatorname{\textit{at}}(X)$ , resp. $\operatorname{\textit{pred}}(X)$ ). In particular, if $X$ is itself a set of variables, then $\operatorname{\textit{var}}(X)=X$ . We extend this notion uniformly to labeled trees. E.g., if $n$ is a node in tree $T$ , then $\operatorname{\textit{var}}_{T}(n)$ denotes the set of variables occurring in the label of $n$ , and similarly for edges and trees themselves. Finally, we write $\operatorname{ch}_{T}(n)$ for the set of children of $n$ in tree $T$ . If $T$ is clear from the context, we omit subscripts from our notation.

Definition 1 (GJT).

A Generalized Join Tree (GJT) is a node-labeled and edge-labeled directed tree $T=(V,E)$ such that:

•

Every leaf is labeled by an atom.

•

Every interior node $n$ is labeled by a hyperedge and has at least one child $c$ such that $\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(c)$ .

•

Whenever the same variable $x$ occurs in the label of two nodes $m$ and $n$ of $T$ , then $x$ occurs in the label of each node on the unique path linking $m$ and $n$ . This condition is called the connectedness condition.

•

Every edge $p\to c$ from parent $p$ to child $c$ in $T$ is labeled by a set $\operatorname{\textit{pred}}(p\to c)$ of predicates. It is required that for every predicate $\theta(\overline{z})\in\operatorname{\textit{pred}}(p\to c)$ we have $\operatorname{\textit{var}}(\theta)=\overline{z}\subseteq\operatorname{\textit{var}}(p)\cup\operatorname{\textit{var}}(c)$ .

Let $n$ be a node in GJT $T$ . Every node $m$ with $\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(m)$ is called a guard of $n$ . Observe that every interior node must have a guard child by the second requirement above. Since this child must itself have a guard child, which must itself have a guard child, and so on, it holds that every interior node has at least one guard descendant that is a leaf.

Definition 2.

A GJT $T$ is a GJT for GCQ $Q$ if $\operatorname{\textit{at}}(T)=\operatorname{\textit{at}}(Q)$ and the number of times that an atom occurs in $Q$ equals the number of times that it occurs as a label in $T$ , and $\operatorname{\textit{pred}}(T)=\operatorname{\textit{pred}}(Q)$ . A GCQ $Q$ is acyclic if there is a GJT for $Q$ . It is cyclic otherwise.

Example 2.

The two trees depicted in Fig. 2 are GJTs for the following full GCQ $Q$ , which is hence acyclic.

[TABLE]

In contrast, the query $r(x,y)\Join s(y,z)\Join t(x,z)$ (also known as the triangle query) is the prototypical cyclic join query.

If $Q$ does not contain any predicates, that is, if $Q$ is a CQ, then the last condition of Definition 1 vacuously holds. In that case, the definition corresponds to the definition of a generalized join tree given in dyn:2017 , where it was also shown that a CQ is acyclic under any of the traditional definitions of acyclicity (e.g., abiteboul1995foundations ) if and only if the query has a GJT $T$ for $Q$ with $\operatorname{\textit{pred}}(T)=\emptyset$ . In this sense, Definition 2 indeed generalizes acyclicity from CQs to GCQs.

Discussion

The notion of ayclicity for normal CQs is well-studied in database theory abiteboul1995foundations and has many equivalent definitions, including a definition based on the existence of a full reducer. Here, a full reducer for a CQ $Q$ is a program $\mathcal{S}$ in the semijoin algebra (the variant of relational algebra where joins are replaced by semijoins) that, given a database $\operatorname{\textit{db}}$ computes a new database $\mathcal{S}(\operatorname{\textit{db}})$ with the following properties. (1) $Q(\mathcal{S}(\operatorname{\textit{db}}))=Q(\operatorname{\textit{db}})$ ; (2) $\mathcal{S}(\operatorname{\textit{db}})_{r(\overline{x})}\subseteq\operatorname{\textit{db}}_{r(\overline{x})}$ for every atom $r(\overline{x})$ ; and (3) no strict subset of $\mathcal{S}(\operatorname{\textit{db}})$ has $Q(\mathcal{S}(\operatorname{\textit{db}}))=Q(\operatorname{\textit{db}})$ . In other words, $\mathcal{S}$ selects a minimal subset of $\operatorname{\textit{db}}$ needed to answer $Q$ .

Bernstein and Goodman DBLP:journals/is/BernsteinG81 consider conjunctive queries with inequalities and classify the class of such queries that admit full reducers. As such, one can view this as a definition of acyclicity for conjunctive queries with inequalities. Bernstein and Goodman’s notion of acyclicity is incomparable to ours. On the on hand, our definition is more general: Bernstein and Goodman consider only queries where for each pair of atoms there is exactly one variable being compared by means of equality or inequality. We, in contrast, allow an arbitrary number of variables to be compared per pair of atoms. In particular, Bernstein and Goodman’s disallow queries like $(r(x,y),s(x,z)\mid y<z)$ since it compares $r.x$ with $s.x$ by means of equality and $r.y<s.z$ by means of inequality, while this is trivially acyclic in our setting.

On the other hand, for this more restricted class of queries, Bernstein and Goodman show that certain queries that we consider to be cyclic have full reducers (and would be hence acyclic under their notion). An example here is

[TABLE]

The crucial reason that this query admits a full reducer is due to the transitivity of $\leq$ . Since our notion of acyclicity interprets predicates abstractly and does hence not assume properties such as transitivity on them, we must declare this query cyclic (as can be checked by running the algorithm of Section 6 on it). It is an interesting direction for future work to incorporate Bernstein and Goodman’s notion of acyclicity in our framework.

Free-connex acyclicity

Acyclicity is actually a notion for full GCQs. Indeed, note that whether or not $Q$ is acyclic does not depend on the projections of $Q$ (if any). To also process queries with projections efficiently, a related structural constraint known as free-connex acyclicity is required.

Definition 3 (Connex, Frontier).

Let $T=(V,E)$ be a GJT. A connex subset of $T$ is a set $N\subseteq V$ that includes the root of $T$ such that the subgraph of $T$ induced by $N$ is a tree. The frontier of a connex set $N$ is the subset $F\subseteq N$ consisting of those nodes in $N$ that are leaves in the subtree of $T$ induced by $N$ .

To illustrate, the set $\{\{y,w\},\{u\},\{y,z,w\}\}$ is a connex subset of the tree $T_{2}$ shown in Fig. 2. Its frontier is $\{\{y,z,w\},\{u\}\}$ . In contrast, $\{\{y,w\},\{y,z,w\},\allowbreak t(u,v)\}$ is not a connex subset of $T_{2}$ .

Definition 4 (Compatible, Free-Connex Acyclic).

A GJT pair is a pair $(T,N)$ with $T$ a GJT and $N$ a connex subset of $T$ . A GCQ $Q$ is compatible with $(T,N)$ if $T$ is a GJT for $Q$ and $\operatorname{\textit{var}}(N)=\operatorname{\textit{out}}(Q)$ . A GCQ is free-connex acyclic if it has a compatible GJT pair.

In particular, every full acyclic GCQ is free-connex acyclic since the entire set of nodes $V$ of a GJT $T$ for $Q$ is a connex set with $\operatorname{\textit{var}}(V)=\operatorname{\textit{out}}(Q)$ . Therefore, $(T,V)$ is a compatible GJT pair for $Q$ .

Example 3.

Let $Q_{2}=\operatorname{\pi}_{y,z,w,u}(Q_{1})$ with $Q_{1}$ the GCQ from Example 2. $Q_{2}$ is free-connex acyclic since it is compatible with the pair $(T_{2},\{\{y,w\},\{y,z,w\},\{u\}\})$ with $T_{2}$ the GJT from Fig. 2. By contrast, $Q_{2}$ is not compatible with any GJT pair containing $T_{1}$ , since any connex set of $T_{1}$ that includes a node with variable $u$ will also include variable $v$ , which is not in $\operatorname{\textit{out}}(Q_{2})$ . Finally, it can be verified that no GJT pair is compatible with $\operatorname{\pi}_{x,u}(Q_{1})$ ; this query is hence not free-connex acyclic.

In Section 6 we show how to efficiently check free-connex acyclicity and compute compatible GJT pairs.

Binary GJTs and sibling-closed connex sets

As we will see in Sections 4 and 5, a GJT pair $(T,N)$ essentially acts as query plan by which GDyn and IEDyn process queries dynamically. In particular, the GJT $T$ specifies the data structure to be maintained and drives the processing of updates, while the connex set $N$ drives the enumeration of query results.

In order to simplify the presentation of what follows, we will focus exclusively on the class of GJT pairs $(T,N)$ with $T$ a binary GJT and $N$ sibling-closed.

Definition 5 (Binary, Sibling-closed).

A GJT $T$ is binary if every node in it has at most two children. A connex subset $N$ of $T$ is sibling-closed if for every node $n\in N$ with a sibling $m$ in $T$ , $m$ is also in $N$ .

Our interest in limiting to sibling-closed connex sets is due to the following property, which will prove useful for enumerating query results, as explained in Section 4.

Lemma 1.

If $N$ is a sibling-closed connex subset, then $\operatorname{\textit{var}}(N)=\operatorname{\textit{var}}(F)$ where $F$ is the frontier of $N$ .

Proof.

Since $F\subseteq N$ the inclusion $\operatorname{\textit{var}}(F)\subseteq\operatorname{\textit{var}}(N)$ is immediate. It remains to prove $\operatorname{\textit{var}}(N)\subseteq\operatorname{\textit{var}}(F)$ . To this end, let $n$ be an arbitrary but fixed node in $N$ . We prove that $\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(F)$ by induction on the height of $n$ in $N$ , which is defined as the length of the shortest path from $n$ to a frontier node in $F$ . The base case is where the height is zero, i.e., $n\in F$ , in which case $\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(F)$ trivially holds. For the induction step, assume that the height of $n$ is $k>0$ . In particular, $n$ is not a frontier node, and has at least one child in $N$ . Because $N$ is sibling-closed, all children of $n$ are in $N$ . In particular, the guard child $m$ of $n$ is in $N$ and has height at most $k-1$ . By induction hypothesis, $\operatorname{\textit{var}}(m)\subseteq\operatorname{\textit{var}}(F)$ . Then, because $m$ is a guard of $n$ , $\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(m)\subseteq\operatorname{\textit{var}}(F)$ , as desired. ∎

Let us call a GJT pair $(T,N)$ binary if $T$ is binary, and sibling-closed if $N$ is sibling-closed. We say that two GJT pairs $(T,N)$ and $(T^{\prime},N^{\prime})$ are equivalent if $T$ and $T^{\prime}$ are equivalent and $\operatorname{\textit{var}}(N)=\operatorname{\textit{var}}(N^{\prime})$ . Two GJTs $T$ and $T^{\prime}$ are equivalent if $\operatorname{\textit{at}}(T)=\operatorname{\textit{at}}(T^{\prime})$ , the number of times that an atom appears as a label in $T$ equals the number of times that it appears in $T^{\prime}$ , and $\operatorname{\textit{pred}}(T)=\operatorname{\textit{pred}}(T^{\prime})$ .

The following proposition shows that we can always convert an arbitrary GJT pair into an equivalent one that is binary and sibling-closed. As such, we are assured that our focus on binary and sibling-closed GJT pairs is without loss of generality.

Proposition 1.

Every GJT pair can be transformed in polynomial time into an equivalent pair that is binary and sibling closed.

The rest of this section is devoted to proving Proposition 1. We do so in two steps. First, we show that any pair $(T,N)$ can be transformed in polynomial time into an equivalent sibling-closed pair. Next, we show that any sibling-closed GJT pair $(T,N)$ can be converted in polynomial time into an equivalent binary and sibling-closed pair. Proposition 1 hence follows by composing these two transformations.

Sibling-closed transformation

We say that $n\in T$ is a violator node in a GJT pair $(T,N)$ if $n\in N$ and some, but not all children of $n$ are in $N$ . A violator is of type 1 if some node in $\operatorname{ch}(n)\cap N$ is a guard of $n$ . It is of type 2 otherwise. We now define two operations on $(T,N)$ that remove violators of type 1 and type 2, respectively. The sibling-closed transformation is then obtained by repeatedly applying these operators until all violators are removed.

The first operator is applicable when $n$ is a type 1 violator. It returns the pair $(T^{\prime},N^{\prime})$ obtained as follows:

•

Since $n$ is a type 1 violator, some $g\in\operatorname{ch}_{T}(n)\cap N$ is a child guard of $n$ (i.e., $\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(g)$ ).

•

Because every node has a guard, there is some leaf node $l$ that is a descendant guard of $g$ (i.e. $\operatorname{\textit{var}}(g)\subseteq\operatorname{\textit{var}}(l)$ ). Possibly, $l$ is $g$ itself.

•

Now create a new node $p$ between node $l$ and its parent with label $\operatorname{\textit{var}}(p)=\operatorname{\textit{var}}(l)$ . Since $l$ is a descendant guard of $n$ and $g$ , $p$ becomes a descendant guard of $n$ and $g$ as well. Detach all nodes in $\operatorname{ch}(n)\setminus N$ from $n$ and attach them as children to $p$ , preserving their edge labels. This effectively moves all subtrees rooted at nodes in $\operatorname{ch}(n)\setminus N$ from $n$ to $p$ . Denote by $T^{\prime}$ the final result.

•

If $l$ was not in $N$ , then $N^{\prime}=N$ . Otherwise, $N^{\prime}=N\setminus\{l\}\cup\{p\}$ .

We write $(T,N)\xrightarrow{1,n}(T^{\prime},N^{\prime})$ to indicate that $(T^{\prime},N^{\prime})$ can be obtained by applying the above-described operation on node $n$ .

Example 4.

Consider the GJT pair $(T,N)$ from Fig. 3 where $N$ is indicated by the nodes in the shaded area. Let us denote the root node by $n$ and its guard child with label $\{y,z,w\}$ by $g$ . The node $l=h(y,z,w,t)$ is a descendant guard of $g$ . Since $s(y,z,m)$ is not in $N$ , $n$ is violator of type

After applying the operation 1 for the choice of guard node $g$ and descendant guard node $l$ , $(T^{\prime},N^{\prime})$ shows the resulting valid sibling-closed GJT.

Lemma 2 ().

Let $n$ be a violator of type $1$ in $(T,N)$ and assume $(T,N)\xrightarrow{1,n}(T^{\prime},N^{\prime})$ . Then $(T^{\prime},N^{\prime})$ is a GJT pair and it is equivalent to $(T,N)$ . Moreover, the number of violators in $(T^{\prime},N^{\prime})$ is strictly smaller than the number of violators in $(T,N)$ .

We prove this lemma in Appendix A. The second operator is applicable when $n$ is a type 2 violator. When applied to $n$ in $(T,N)$ it returns the pair $(T^{\prime},N^{\prime})$ obtained as follows:

•

Since $n$ is a type 2 violator, no node in $\operatorname{ch}_{T}(n)\cap N$ is a guard of $n$ . Since every node has a guard, there is some $g\in\operatorname{ch}(n)\setminus N$ which is a guard of $n$ .

•

Create a new child $p$ of $n$ with label $\operatorname{\textit{var}}(p)=\operatorname{\textit{var}}(n)$ ; detach all nodes in $\operatorname{ch}(n)\setminus N$ (including $g$ ) from $N$ , and add them as children of $p$ , preserving their edge labels. This moves all subtrees rooted at nodes in $\operatorname{ch}(n)\setminus N$ from $n$ to $p$ . Denote by $T^{\prime}$ the final result.

•

Set $N^{\prime}=N\cup\{p\}$ .

We write $(T,N)\xrightarrow{2,n}(T^{\prime},N^{\prime})$ to indicate that $(T^{\prime},N^{\prime})$ was obtained by applying this operation on $n$ .

Example 5.

Consider the GJT pair $(T,N)$ in Fig. 4. Let us denote the root node by $n$ . Since its guard child $h(y,z,w,t)$ is not in $N$ , $n$ is violator of type 2. After applying operation 2 on $n$ , $(T^{\prime},N^{\prime})$ shows the resulting valid sibling-closed GJT.

Lemma 3 ().

Let $n$ be a violator of type $2$ in $(T,N)$ and assume $(T,N)\xrightarrow{2,n}(T^{\prime},N^{\prime})$ . Then $(T^{\prime},N^{\prime})$ is a GJT pair and it is equivalent to $(T,N)$ . Moreover, the number of violators in $(T^{\prime},N^{\prime})$ is strictly smaller than the number of violators in $(T,N)$ .

The proof can be found in Appendix A.

Proposition 2.

Every GJT pair can be transformed in polynomial time into an equivalent sibling-closed pair.

Proof.

The two operations introduced above remove violators, one at a time. By repeatedly applying these operations until no violator remains we obtain an equivalent pair without violators, which must hence be sibling-closed. Since each operator can clearly be executed in polynomial time and the number of times that we must apply an operator is bounded by the number of nodes in the GJT pair, the removal takes polynomial time. ∎

Binary transformation

Next, we show how to transform a sibling-closed pair $(T,N)$ into an equivalent binary and sibling-closed pair $(T^{\prime},N^{\prime})$ . The idea here is to “binarize” each node $n$ with $k>2$ children as shown in Fig. 5. There, we assume without loss of generality that $c_{1}$ is a guard child of $n$ . The binarization introduces $k-2$ new intermediate nodes $m_{1},\dots,m_{k-2}$ , all with $\operatorname{\textit{var}}(m_{i})=\operatorname{\textit{var}}(n)$ . Note that, since $c_{1}$ is a guard of $n$ and $\operatorname{\textit{var}}(m_{i})=\operatorname{\textit{var}}(n)$ , it is straightforward to see that $c_{1}$ will be a guard of $m_{1}$ , which will be a guard of $m_{2}$ , which will be a guard of $m_{3}$ , and so on. Finally, $m_{k-2}$ will be a guard of $n$ . The connex set $N$ is updated as follows. If none of $n$ ’s children are in $N$ i.e. $n$ is a frontier node, set $N^{\prime}=N$ . Otherwise, since $N$ is sibling-closed, all children of $n$ are in $N$ , and we set $N^{\prime}=N\cup\{m_{1},\dots,m_{k-2}\}$ . Clearly, $N^{\prime}$ remains a sibling-closed connex subset of $T^{\prime}$ and $\operatorname{\textit{var}}(N^{\prime})=\operatorname{\textit{var}}(N)$ . We may hence conclude:

Lemma 4.

By binarizing a single node in a sibling-closed GJT pair $(T,N)$ as shown in Fig. 5, we obtain an equivalent GJT pair $(T^{\prime},N^{\prime})$ that has strictly fewer non-binary nodes than $(T,N)$ .

Binarizing a single node is a polynomial-time operation. Then, by iteratively binarizing non-binary nodes until all nodes have become binary we hence obtain:

Proposition 3.

Every sibling-closed GJT pair can be transformed in polynomial time into an equivalent, binary and sibling-closed pair.

4 Dynamic joins with equalities and inequalities: an example

In this section we illustrate how to dynamically process free-connex acyclic GCQs when all predicates are inequalities $(\leq,<,\geq,>)$ . We do so by means of an extensive example that shows the indexing structures and GMRs. The definitions and algorithms (that apply to arbitrary $\theta$ -joins) will be formally presented in Section 5.

Throughout this section we consider the following query $Q$ , which is free-connex acyclic (see Example 3):

[TABLE]

Let $T_{2}$ be the GJT from Fig. 2. We process $Q$ based on a $T_{2}$ -reduct, a data structure that succinctly represents the output of $Q$ . For every node $n$ , define $\operatorname{\textit{pred}}(n)$ as the set of all predicates on outgoing edges of $n$ , i.e. $\operatorname{\textit{pred}}(n)=\bigcup_{c\text{ child of }n}\operatorname{\textit{pred}}(n\to c).$

Definition 6 ( $T$ -reduct).

Let $T$ be a GJT for a query $Q$ and let $\operatorname{\textit{db}}$ be a database over $\operatorname{\textit{at}}(Q)$ . The $T$ -reduct (or semi-join reduction) of $\operatorname{\textit{db}}$ is a collection $\rho$ of GMRs, one GMR $\rho_{n}$ for each node $n\in T$ , defined inductively as follows:

if $n=r(x)$ is an atom, then $\rho_{n}=\operatorname{\textit{db}}_{r(x)}$

-

if $n$ has a single child $c$ , then $\rho_{n}=\pi_{\operatorname{\textit{var}}(n)}\sigma_{\operatorname{\textit{pred}}(n)}\rho_{c}$

-

otherwise, $n$ has two children $c_{1}$ and $c_{2}$ . In this case we have $\rho_{n}=\pi_{\operatorname{\textit{var}}(n)}\left(\rho_{c_{1}}\Join_{\operatorname{\textit{pred}}(n)}\rho_{c_{2}}\right)$ .

Fig. 6 depicts an example database (top) and its $T_{2}$ -reduct $\rho$ (bottom). Note, for example, that the only tuple in the GMR at the root $\rho_{\{y,w\}}$ is the join of $\rho_{\{y,z,w\}}$ and $\rho_{\{u\}}$ restricted to $w<y$ and projected over $\{y,w\}$ .

It is important to observe that the size of a $T$ -reduct of a database $\operatorname{\textit{db}}$ can be at most linear in the size of $\operatorname{\textit{db}}$ . The reason is that, as illustrated in Fig. 6, for each node $n$ there is some descendant atom $\alpha$ (possibly $n$ itself) such that $\operatorname{supp}(\rho_{n})\subseteq\operatorname{supp}(\operatorname{\pi}_{\operatorname{\textit{var}}(n)}\operatorname{\textit{db}}_{\alpha})$ . Note that $Q(\operatorname{\textit{db}})$ , in contrast, can easily become polynomial in the size of $\operatorname{\textit{db}}$ in the worst case.

Enumeration

From a $T$ -reduct we can enumerate the result $Q(\operatorname{\textit{db}})$ rather naively simply by recomputing the query results, in particular because we have access to the complete database in the leaves of $T$ . We would like, however, to make the enumeration as efficient as possible. To this end, we equip $T$ -reducts with a set of indices. To avoid the space cost of materialization, we do not want the indices to use more space than the $T$ -reduct itself (i.e., linear in $\operatorname{\textit{db}}$ ). We illustrate these ideas in our running example by introducing a simple set of indices that allow for efficient enumeration.

Let $N=\{\{y,w\},\{y,z,w\},\{u\}\}$ be the connex subset of $T_{2}$ satisfying $\operatorname{\textit{var}}(N)=\operatorname{\textit{out}}(Q)=\{y,z,w,u\}$ . $(T_{2},N)$ is compatible with $Q$ , binary and sibling-closed. We rely on the sibling-closed property of $N$ to enumerate query results, and can do so without loss of generality by Proposition 1. To enumerate the query results, we will traverse top-down the nodes in $N$ . The traversal works as follows: for each tuple $\vec{t_{1}}$ in $\rho_{\{y,w\}}$ , we consider all tuples $\vec{t_{2}}$ in $\rho_{\{y,z,w\}}$ that are compatible with $\vec{t_{1}}$ , and all tuples $\vec{t_{3}}\in\rho_{\{u\}}$ that are compatible with $\vec{t_{1}}$ . Compatibility here means that the corresponding equalities and inequalities are satisfied. Then, for each pair $(\vec{t_{2}},\vec{t_{3}})$ , we output the tuple $\vec{t_{2}}\cup\vec{t_{3}}$ with multiplicity $\rho_{\{y,z,w\}}(\vec{t_{2}})\times\rho_{\{u\}}(\vec{t_{3}})$ . A crucial difference here with naive recomputation is that, since $\rho_{\{y,w\}}$ is already a join between $\rho_{\{y,z,w\}}$ and $\rho_{\{u\}}$ , we will only iterate over relevant tuples: each tuple that we iterate over will produce a new output tuple. For example, we will never look at the tuple $\langle y:2,z:4,w:6\rangle$ in $\rho_{\{y,z,w\}}$ because it does not have a compatible tuple at the root.

To implement this enumeration strategy efficiently, we desire index structures on $\rho_{\{y,z,w\}}$ and $\rho_{\{u\}}$ that allow to enumerate, for a given tuple $\vec{t_{1}}$ in $\rho_{\{y,w\}}$ , all compatible tuples $\vec{t_{2}}\in\rho_{\{y,z,w\}}$ (resp. $\vec{t_{3}}\in\rho_{\{u\}}$ ) with constant delay. In the case of $\rho_{\{u\}}$ this is achieved simply by keeping $\rho_{\{u\}}$ sorted decreasingly on variable $u$ . Given tuple $\vec{t_{1}}$ , we can enumerate the compatible tuples from $\rho_{\{u\}}$ by iterating over its tuples one by one in a decreasing manner, starting from the largest value of $u$ , and stopping whenever the current $u$ value is smaller or equal than the $w$ value in $\vec{t_{1}}$ . For indexing $\rho_{\{y,z,w\}}$ we use a more standard index. Since we need to enumerate all tuples that have the same $y$ and $w$ value as $\vec{t_{1}}$ , CDE can be achieved by using a hash-based index on $y$ and $w$ . This index is depicted as $I_{\rho_{\{y,z,w\}}}$ in Fig. 7. We can see that, since the described indices provide CDE of the compatible tuples given $\vec{t_{1}}$ , our strategy provides enumeration of $Q(\operatorname{\textit{db}})$ with constant delay if we assume the query to be fixed (i.e. in data complexity Vardi:1982 ).

Updates

Next we illustrate how to process updates. The objective here is to transform the $T_{2}$ -reduct of $\operatorname{\textit{db}}$ into a $T_{2}$ -reduct of $\operatorname{\textit{db}}+u$ , where $u$ is the received update. To do this efficiently we use additional indexes on $\rho$ . We present the intuitions behind these indices with an update consisting of two insertions: $\langle y\colon 2,z\colon 3,w\colon 6\rangle$ with multiplicity $2$ and $\langle u\colon 4,v\colon 9\rangle$ with multiplicity $3$ . Fig. 7 depicts the update process highlighting the modifications caused by the update.

Let us first discuss how to process the tuple $\vec{t_{1}}=\langle y\colon 2,z\colon 3,w\colon 6\rangle$ . We proceed bottom-up, starting at $\rho_{s}$ which is itself affected by the insertion of $\vec{t_{1}}$ . Subsequently, we need to propagate the modification of $\rho_{s}$ to its ancestors $\rho_{\{y,z,w\}}$ and $\rho_{\{y,w\}}$ . Concretely, from the definition of $T$ -reduction, it follows that we need to add some modifications to $\rho_{s}$ , $\rho_{\{y,z,w\}}$ , and $\rho_{\{y,w\}}$ on $\vec{t_{1}}$ :

$\Delta\rho_{s}=[\vec{t_{1}}\mapsto 2]$ ,
$\Delta\rho_{\{y,z,w\}}=\operatorname{\pi}_{y,z,w}\left(\rho_{r}\Join_{x<z}\Delta\rho_{s}\right)$ ,
$\Delta\rho_{\{y,w\}}=\operatorname{\pi}_{y,w}\left(\Delta\rho_{\{y,z,w\}}\Join_{w<u}\rho_{\{u\}}\right)$ .

To compute the joins on the right-hand sides efficiently, we create a number of additional indexes on $\rho_{r},\rho_{s}$ , and $\rho_{\{y,z,w\}}$ . Concretely, in order to efficiently compute $\operatorname{\pi}_{y,z,w}\left(\rho_{r}\Join_{x<z}\Delta\rho_{s}\right)$ , we group tuples in the GMR $\rho_{r}$ by the variables that $\rho_{r}$ has in common with $\rho_{s}$ (in this case $y$ ) and then, per group, sort tuples ascending on variable $x$ . We mark grouping variables in Fig. 7 with $*$ (e.g. $y^{*}$ ), and sorting by $\downarrow$ (for ascending, e.g., $x_{\downarrow}$ ) and $\uparrow$ (for descending). A hash index on the grouping variables (denoted $I_{\rho_{r}}$ in Fig. 7) then allows to find the group given a $y$ value. The join can then be processed by means of a hybrid form of sort-merge and index nested loop join. Sort $\Delta\rho_{s}$ ascendingly on $y$ and $z$ . For each $y$ -group in $\Delta\rho_{s}$ find the corresponding group in $\rho_{r}$ by passing the $y$ value to the index $I_{\rho_{r}}$ . Let $\vec{t^{\prime}}$ be the first tuple in the $\Delta\rho_{s}$ group. Then iterate over the tuples of the $\rho_{r}$ group in the given order and sum up their multiplicities until $x$ becomes larger than $\vec{t^{\prime}}(z)$ . Add $\vec{t^{\prime}}$ to the result with its original multiplicity multiplied by the found sum (provided it is non-zero). Then consider the next tuple in the $\Delta\rho_{s}$ group, and continue summing from the current tuple in the $\rho_{r}$ group until $x$ becomes again larger than $z$ , and add the result tuple with the correct multiplicity. Continue repeating this process for each tuple in the $\Delta\rho_{s}$ group, and for each group in $\Delta\rho_{s}$ . In our case, there is only one group in $\Delta\rho_{s}$ (given by $y=2$ ) and we will only iterate over the tuple $\langle x\colon 2,y\colon 2\rangle$ in $\rho_{r}$ , obtaining a total multiplicity of 2, and therefore compute $\Delta\rho_{\{y,z,w\}}=[\vec{t_{1}}\to 4]$ . In order to compute the join $\operatorname{\pi}_{y,w}\left(\Delta\rho_{\{y,z,w\}}\Join_{w<u}\rho_{\{u\}}\right)$ efficiently, we proceed similarly. Here, however, there are no grouping variables on $\rho_{\{u\}}$ and it hence suffices to sort $\rho_{\{u\}}$ descendingly on $u$ . Note that this was actually already required for efficient enumeration. Also note that $\Delta\rho_{\{y,w\}}$ is empty.

Now we discuss how to process $\vec{t_{2}}=\langle u\colon 4,v\colon 9\rangle$ . First, we insert $\vec{t_{2}}$ into $\rho_{t}$ . We need to propagate this change to the parent $\rho_{\{u\}}$ by calculating $\Delta\rho_{\{u\}}=\operatorname{\pi}_{u}\Delta\rho_{t}$ . This is done by a simple hash-based aggregation. Finally, we need to propagate $\Delta\rho_{\{u\}}$ to the root by computing $\Delta\rho_{\{y,w\}}=\pi_{y,w}(\rho_{\{y,z,w\}}\Join_{w<u}\Delta\rho_{\{u\}})$ . To process this join efficiently we proceed as before. Again, there are no grouping variable on $\rho_{\{y,z,w\}}$ (since it has no variables in common with $\rho_{\{u\}}$ ) and it hence suffices to sort $\rho_{\{y,z,w,\}}$ ascending on $w$ . The only tuple that we iterate over during the hybrid join is $\langle y\colon 1,z\colon 3,w\colon 3\rangle$ wich has multiplicity 12. Hence, we have $\Delta\rho_{y,w}=[\langle y\colon 1,w\colon 3\rangle\mapsto 36\ ]$ , concluding the example.

5 Dynamic Yannakakis Over GCQs

Dynamic Yannakakis (Dyn) is an algorithm to efficiently evaluate free-connex acyclic aggregate-equijoin queries under updates dyn:2017 . This algorithm matches two important theoretical lower bounds (for q-hierarchical CQs Berkholz:2017 and free-connex acyclic CQs Bagan:2007 ), and is highly efficient in practice. In this section we present a generalization of Dyn, called GDyn, to dynamically process free-connex acyclic GCQs. Since predicates in a GCQ can be arbitrary, our approach is purely algorithmic; the efficiency by which GDyn process updates and produces results will depend entirely on the efficiency of the underlying data structures. Here we only describe the properties that those data structures should satisfy and present the general (worst-case) complexity of the algorithm. The techniques and indices presented in the previous section provide a practical instantiation of GDyn to a GCQ with equalities and inequalities, and throughout this section we make a parallel between that instantiation and the more abstract definitions of GDyn.

In this section we assume that $Q$ is a free-connex acyclic GCQ and that $(T,N)$ is a binary and sibling-closed GJT pair compatible with $Q$ . Like in the case of equalities and inequalities, the dynamic processing of $Q$ will be based on a $T$ -reduct of the current database $\operatorname{\textit{db}}$ . A set of indices will be added to optimize the enumeration of query results and maintenance of the $T$ -reduct under updates. We formalize the notion of index as follows:

Definition 7 (Index).

Let $R$ be a GMR over $\overline{x}$ , let $\overline{y}$ be a hyperedge, let $\overline{w}$ be a hyperedge satisfying $\overline{w}\subseteq\overline{x}\cup\overline{y}$ , and let $\theta(\overline{z})$ be a predicate with $\overline{z}\subseteq\overline{x}\cup\overline{y}$ . An index on $R$ by $(\theta,\overline{y},\overline{w})$ with delay $f$ is a data structure $I$ that provides, for any given GMR $R_{\overline{y}}$ over $\overline{y}$ , enumeration of $\operatorname{\pi}_{\overline{w}}(R\bowtie_{\theta}R_{\overline{y}})$ with delay $O(f(|{R}|+|{R_{\overline{y}}}|))$ . The update time of index $I$ is the time required to update $I$ to an index on $R+\Delta{R}$ (by $(\theta,\overline{y},\overline{w})$ ) given update $\Delta{R}$ to $R$ .

For example, $I_{\rho_{r}}$ in Fig. 7 is used as an index on $\rho_{r}$ by $(x<z,\{y,z,w\},\{y,z,w\})$ . Indeed, in the previous section we precisely discussed how $I_{\rho_{r}}$ allows to efficiently compute $\pi_{y,z,w}(\rho_{r}\Join_{x<z}\Delta{\rho_{s}})$ for an update $\Delta{\rho_{s}}$ to $\rho_{s}$ . Having the notion of index, we discuss how GDyn enumerates query results and processes updates.

Enumeration

Let $\operatorname{\textit{db}}$ be the current database. To enumerate $Q(\operatorname{\textit{db}})$ from a $T$ -reduct $\rho$ of $\operatorname{\textit{db}}$ we can iterate over the reductions $\rho_{n}$ with $n\in N$ in a nested fashion, starting at the root and proceeding top-down. When $n$ is the root, we iterate over all tuples in $\rho_{n}$ . For every such tuple $\vec{t}$ , we iterate only over the tuples in the children $c$ of $n$ that are compatible with $\vec{t}$ (i.e., tuples in $\rho_{c}$ that join with $\vec{t}$ and satisfy $\operatorname{\textit{pred}}(n\to c)$ ). This procedure continues until we reach nodes in the frontier of $N$ at which time the output tuple can be constructed. The pseudocode is given in Algorithm 1, where the tuples that are compatible with $\vec{t}$ are computed by $\rho_{c}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(n\to c)}\vec{t}$ .

Now we show the correctness of the enumeration algorithm, for which we need to introduce some further notation. Let $Q$ , $T$ and $N$ be as above. Given a node $n\in T$ we denote the sub-tree of $T$ rooted at $n$ by $T_{n}$ , and define the query induced by $T_{n}$ as

[TABLE]

where $\operatorname{\textit{at}}(T_{n})$ and $\operatorname{\textit{pred}}(T_{n})$ are the sets of all atoms and predicates occurring in $T_{n}$ , respectively.

Lemma 5 ().

Let $Q$ , $T$ , $N$ , and $n$ be defined as above, and let $\rho$ be a $T$ -reduct for $Q$ . Then, $\rho_{n}=\pi_{\operatorname{\textit{var}}(n)}Q_{n}(\operatorname{\textit{db}})$ .

The proof by induction is detailed in Appendix B. To show correctness of enumeration, we need the following additional lemma regarding the subroutine of Algorithm 1 (Line 3). The proof is again by induction and detailed in Appendix B.

Lemma 6 ().

Let $Q$ , $T$ , and $N$ be as above. If $\rho$ is a $T$ -reduct of $\operatorname{\textit{db}}$ , then for every node $n\in N$ and every tuple $\vec{t}$ in $\rho_{n}$ , $\operatorname{\textsc{enum}}_{T,N}(n,\vec{t},\rho)$ correctly enumerates $\pi_{\operatorname{\textit{var}}(N)\cap\operatorname{\textit{var}}(Q_{n})}Q_{n}(\operatorname{\textit{db}})\operatorname{\ltimes}\vec{t}$ .

Proposition 4.

Let $Q$ , $T$ , $N$ and $\rho$ be as above. Then $\operatorname{\textsc{enum}}_{T,N}(\rho)$ enumerates $Q(\operatorname{\textit{db}})$ .

Proof.

Let $r$ be the root of $T$ . By Lemma 5 we have $\rho_{n}=\pi_{\operatorname{\textit{var}}(r)}Q_{r}(\operatorname{\textit{db}})=\pi_{\operatorname{\textit{var}}(r)}Q(\operatorname{\textit{db}})$ , and therefore $\rho_{n}$ is a projection of $Q(\operatorname{\textit{db}})$ . This implies that $Q(\operatorname{\textit{db}})=Q(\operatorname{\textit{db}})\operatorname{\ltimes}\rho_{r}$ , which is equivalent to the disjoint union $\bigcup_{\vec{t}\in\rho_{r}}Q(\operatorname{\textit{db}})\operatorname{\ltimes}\vec{t}$ . By Lemma 6, it is clear that this is exactly what $\operatorname{\textsc{enum}}_{T,N}(\rho)$ enumerates. ∎

We now analyze the complexity of $\operatorname{\textsc{enum}}_{T,N}$ . First, observe that by definition of $T$ -reducts, compatible tuples will exist at every node. Hence, every tuple that we iterate over will eventually produce a new output tuple. This ensures that we do not risk wasting time in iterating over tuples that in the end yield no output. As such, the time needed for $\operatorname{\textsc{enum}}_{T,N}(\rho)$ to produce a single new tuple is determined by the time taken to enumerate the tuples in $\rho_{n}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(p\to n)}\vec{t}$ , where $p$ is the parent of $n$ . Since this is equivalent to $\operatorname{\pi}_{\operatorname{\textit{var}}(n)}(\rho_{n}\Join_{\operatorname{\textit{pred}}(p\to n)}\vec{t})$ we can do this efficiently by creating an index on $\rho_{n}$ by $(\operatorname{\textit{pred}}(p\rightarrow n),\operatorname{\textit{var}}(p),\operatorname{\textit{var}}(n))$ . For example, in Section 4 we defined hash-maps and group-sorted GMRs so that given one tuple from a parent we could enumerate the compatible tuples in the child with constant delay. In general, the efficiency of enumeration will depend on the delay provided by the indices.

Proposition 5.

Assume that for every $n\in N$ we have an index on $\rho_{n}$ by $(\operatorname{\textit{pred}}(p\to n),\operatorname{\textit{var}}(p),\operatorname{\textit{var}}(n))$ with delay $f$ , where $p$ is the parent of $n$ and $f$ is a monotone function. Then, using these indices, $\operatorname{\textsc{enum}}_{T,N}(\rho)$ correctly enumerates $Q(\operatorname{\textit{db}})$ with delay $O(|N|\times f(M))$ where $M$ is given by $\max_{n\in N}(|{\rho_{n}}|)$ . Thus, the total time required to execute $\operatorname{\textsc{enum}}_{T,N}(\rho)$ is $O(|{Q(\operatorname{\textit{db}})}|\cdot f(M)\cdot|{N}|)$ .

Proof.

We show that for every $n\in N$ and $\vec{t}\in\rho_{n}$ , the call $\operatorname{\textsc{enum}}_{T,N}(n,\vec{t},\rho)$ enumerates $\pi_{\operatorname{\textit{var}}(N)}Q_{n}(\operatorname{\textit{db}})$ with delay $O(|N\cap T_{n}|\times f(M))$ . We proceed by induction in $|N|$ . If $|N|=1$ then $N=\operatorname{root}(T)$ and the delay is clearly constant as the algorithm will only yield $\vec{t}$ . Now assume that $|N|>1$ . If $n$ has a single child $c$ , the index on $\rho_{c}$ by $(\operatorname{\textit{pred}}(n),\operatorname{\textit{var}}(n),\operatorname{\textit{var}}(c))$ allows us to iterate over $\rho_{c}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(n)}\vec{t}$ with delay $O(f(|\rho_{c}|))$ and therefore delay $O(f(M))$ . For each element $\vec{s}$ of this enumeration, the algorithm calls $\operatorname{\textsc{enum}}_{T,N}(c,\vec{s},\rho)$ , which by induction hypothesis enumerates $\pi_{\operatorname{\textit{var}}(N)}Q_{c}(\operatorname{\textit{db}})\operatorname{\ltimes}\vec{s}$ with delay $O(|N\cap T_{c}|\times f(M))$ . Then, the maximum delay between two outputs is $O(f(|\rho_{c}|))+O(|N\cap T_{c}|\times f(M))$ , and since $|\rho_{c}|\leq M$ this is in

[TABLE]

The final observation is that the sets $\pi_{\operatorname{\textit{var}}(N)}Q_{c}(\operatorname{\textit{db}})\operatorname{\ltimes}\vec{s}$ are disjoint for different values of $\vec{s}$ , and thus the enumeration does not produce repeated values.

For the case in which $n$ has two children $c_{1}$ and $c_{2}$ , by similar reasoning it is easy to show that the maximum delay between two outputs is

[TABLE]

It is also important to mention that the sets enumerated by $\operatorname{\textsc{enum}}_{T,N}(c_{i},\vec{t_{i}},\rho)$ are disjoint for each $\vec{t_{i}}\in\rho_{c_{i}}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(n\rightarrow c_{i})}\vec{t}$ ( $i\in\{1,2\}$ ), and that for each $(\vec{s_{1}},\mu)\in\operatorname{\textsc{enum}}_{T,N}(c_{1},\vec{t_{1}},\rho)$ and $(\vec{s_{2}},\mu)\in\operatorname{\textsc{enum}}_{T,N}(c_{2},\vec{t_{2}},\rho)$ , it is the case that $\vec{s_{1}}$ and $\vec{s_{2}}$ are compatible, thus producing outputs in every iteration. ∎

In particular, if $f$ is constant we enumerate $|{Q(\operatorname{\textit{db}})}|$ with delay $O(N)$ (i.e. constant in data complexity).

Update processing

To allow enumeration of $Q(\operatorname{\textit{db}})$ under updates to $\operatorname{\textit{db}}$ we need to maintain the $T$ -reduct $\rho$ (and, if present, its indexes) up to date. As illustrated in the previous section, it suffices to traverse the nodes of $T$ in a bottom-up fashion. At each node $n$ we have to compute the delta of $\rho_{n}$ . For leaf nodes, this delta is given by the update $\operatorname{\mathit{u}}$ itself. For interior nodes, the delta can be computed from the delta and original reduct of its children. Algorithm 2 gives the pseudocode.

The fundamental part of Algorithm 2 is to compute joins and produce delta GMRs (Line 10), propagating updates from each node to its parent. When there is an update $\Delta_{n}$ to a node $n$ with sibling $m$ and parent $p$ , we need to compute $\operatorname{\pi}_{\operatorname{\textit{var}}(p)}\left(\rho_{m}\Join_{\operatorname{\textit{pred}}(p)}\Delta_{n}\right)$ . To do this efficiently, we naturally store an index on $\rho_{m}$ by $(\operatorname{\textit{pred}}(p),\operatorname{\textit{var}}(n),\operatorname{\textit{var}}(p))$ . For example, we discussed how the hash-map $I_{\rho_{r}}$ in Fig. 7 plus the sorting on $x$ of $\rho_{r}$ allowed us to efficiently compute $\operatorname{\pi}_{y,z,w}(\rho_{r}\Join_{x<z}\Delta\rho_{s})$ .

Summarizing, to efficiently enumerate query results and process updates we need to store a $T$ -reduct plus a set of indices on its GMRs. The data structure containing these elements is called a $(T,N)$ -representation.

Definition 8 ( $(T,N)$ -representation).

Let $\operatorname{\textit{db}}$ be a database. A $(T,N)$ -representation ( $(T,N)$ -rep for short) of $\operatorname{\textit{db}}$ is composed by a $T$ -reduct of $\operatorname{\textit{db}}$ and, for each node $n$ with parent $p$ , the following set of indices:

If $n$ belongs to $N$ , then we store an index $P_{n}$ on $\rho_{n}$ by $(\operatorname{\textit{pred}}(p\rightarrow n),\operatorname{\textit{var}}(p),\operatorname{\textit{var}}(n))$ .

-

If $n$ is a node with a sibling $m$ , then we store an index $S_{n}$ on $\rho_{n}$ by $(\operatorname{\textit{pred}}(p),\operatorname{\textit{var}}(m),\operatorname{\textit{var}}(p))$ .

Together with the notion of $(T,N)$ -rep, Algorithms 1 and 2 provide a framework for dynamic query evaluation. By constructing the $T$ -reduct and set of indices (and their update procedures) one can process free-connex acyclic GCQs under updates. Naturally, to implement such framework one needs to devise indices for a particular set of predicates. For example, Dyn is an instantiation to the class of CQs, and in the previous section we showed how to instantiate this framework for a GCQ based on equalities and inequalities. Next, we present the general set of indices required to process free-connex acyclic GCQs with equalities and inequalities.

IEDyn

For queries that have only inequality predicates, the instantiation of a $(T,N)$ -representation of $\operatorname{\textit{db}}$ contains a $T$ -reduct of $\operatorname{\textit{db}}$ and, for each node $n$ with parent $p$ , the following data structures:

If $n\in N$ , the index $P_{n}$ on $\rho_{n}$ from Definition 8 is obtained by doing two things. (1) First, group $\rho_{n}$ according to the variables in $\operatorname{\textit{var}}(n)\cap\operatorname{\textit{var}}(p)$ . Then, per group, sort the tuples according to the variables of $\operatorname{\textit{var}}(n)$ mentioned in $\operatorname{\textit{pred}}(p\to n)$ (if any). (2) Create a hash table that maps each tuple $\vec{t}\in\operatorname{\pi}_{\operatorname{\textit{var}}(n)\cap\operatorname{\textit{var}}(p)}(\rho_{n})$ to its corresponding group in $\rho_{n}$ . If $\operatorname{\textit{var}}(n)\cap\operatorname{\textit{var}}(p)$ is empty this hash table is omitted.

-

If $n$ has a sibling $m$ , the index $S_{n}$ of Definition 8 is obtained by doing two things. (1) First, group $\rho_{n}$ according to the variables in $\operatorname{\textit{var}}(n)\cap\operatorname{\textit{var}}(m)$ . Then, per group, sort the tuples according to the variables of $\operatorname{\textit{var}}(n)$ mentioned in $\operatorname{\textit{pred}}(p)$ (if any). (2) Create a hash table mapping each $\vec{t}\in\operatorname{\pi}_{\operatorname{\textit{var}}(n)\cap\operatorname{\textit{var}}(m)}(\rho_{n})$ to the corresponding group in $\vec{s}\in\rho_{n}$ . If $\operatorname{\textit{var}}(n)\cap\operatorname{\textit{var}}(m)$ is empty this hash table is omitted.

In Wection 4 we illustrated how use these data structures. Effectively, in Figure 7 $I_{\rho_{r}}$ and $I_{\rho_{s}}$ are examples of $S_{n}$ , used for update propagation, while $I_{\rho_{\{y,z,w\}}}$ is an example of $P_{n}$ , used for enumeration.

Note that the example query from Section 4 has at most one inequality between each pair of atoms. This causes each edge in $T$ to consist of at most inequality. As such, when creating the index $P_{n}$ for a node $n\in N$ , the reduct $\rho_{n}$ will be sorted per group according to at most one variable. This is important for enumeration delay because, as exemplified in Section 4, we can then find compatible tuples by first the corresponding group and then iterating over the sorted group from the start and stopping when the first non-compatible tuple is found. When there are multiple inequalities per pair of atoms then we will need to sort according to multiple variables under some lexicographic order. This causes enumeration delay to become logarithmic since then compatible tuples will intermingle with non-compatible tuples, and a binary search is necessary to find the next batch of compatible tuples in the group.

We call IEDyn the algorithm for processing free-connex acyclic GCQs with equalities and inequalities.

Theorem 5.1.

Let $Q$ be a GCQ in which all predicates are equalities and inequalities. Let $(T,N)$ be a binary and sibling-closed GJT pair compatible with $Q$ . Given a database $\operatorname{\textit{db}}$ over $\operatorname{\textit{at}}(Q)$ , a $(T,N)$ -rep $\operatorname{\mathcal{D}}$ of $\operatorname{\textit{db}}$ , under IEDyn Algorithm 1 enumerates $Q(\operatorname{\textit{db}})$ with delay $O(|N|\cdot\log(|{\operatorname{\textit{db}}}|))$ . Also, given an update $\operatorname{\mathit{u}}$ under IEDyn Algorithm 2 transforms $\operatorname{\mathcal{D}}$ into a $(T,N)$ -rep of $\operatorname{\textit{db}}+u$ in time $O(|T|\cdot M^{2}\cdot\log(M))$ , where $M=|{\operatorname{\textit{db}}}|+|{\operatorname{\mathit{u}}}|$ .

Proof.

Let us first prove the enumeration bounds. It is immediate to see that for every node $n\in T$ the GMR $\rho_{n}$ satisfies $|\rho_{n}|\leq|\operatorname{\textit{db}}|$ , given that $\rho_{n}$ is defined as a series of semi-joins based on $\operatorname{\textit{db}}$ (or, equivalently, because every internal node has a guard). Therefore, according to Proposition 5 the enumeration delay is $O(|N|\cdot f(|\operatorname{\textit{db}}|))$ where $N$ is the connex subset of $T$ and $f$ is the delay provided by the index $P_{n}$ . Now, from the description of IEDyn these indices are implemented as hash tables that map each tuple $\vec{t}$ in $\pi_{\operatorname{\textit{var}}(p)\cap\operatorname{\textit{var}}(n)}\rho_{n}$ to a lexicographically sorted set containing $\rho_{n}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(p\rightarrow n)}\vec{t}$ , where $(p,n)$ is a parent-child pair. Therefore, given a tuple $\vec{t}\in\rho_{p}$ we can enumerate $\rho_{n}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(p\rightarrow n)}\vec{t}$ by first projecting $\vec{t}$ over $\operatorname{\textit{var}}(n)$ and then iterating over all tuples satisfying $\operatorname{\textit{pred}}(p\rightarrow n)$ . Since these predicates are only inequalities, each group can be kept sorted lexicographically and, as mentioned earlier, enumeration can be achieved with logarithmic delay. It follows from Prop. 5 that the enumeration delay is $O(|N|\cdot\log(|db|))$ .

Now we discuss update time. As can be seen in Algorithm 2, for each parent-child pair $(p,n)\in T$ we need to compute either $\pi_{\operatorname{\textit{var}}(p)}(\rho_{m}\Join_{\operatorname{\textit{pred}}(p\rightarrow n)}\Delta_{n})$ or $\pi_{\operatorname{\textit{var}}(p)}\sigma_{\operatorname{\textit{pred}}(p)}(\Delta_{n})$ , depending on whether or not $n$ has a sibling $m$ . If $n$ does not have a sibling, computing $\pi_{\operatorname{\textit{var}}(p)}\sigma_{\operatorname{\textit{pred}}(p)}(\Delta_{n})$ can be done directly by sorting $\Delta_{n}$ lexicographically, enumerating those tuples satisfying $\operatorname{\textit{pred}}(p)$ (with logarithmic delay), and finally projecting over $\operatorname{\textit{var}}(p)$ . This takes time in $O(|\Delta_{n}|\cdot\log(|\Delta_{n}|)$ , which is clearly contained in $O(M^{2}\cdot\log(M))$ since $|\Delta_{n}|\leq M$ . The more involved case is when $n$ has a sibling $m$ and we need to compute $\pi_{\operatorname{\textit{var}}(p)}(\rho_{m}\Join_{\operatorname{\textit{pred}}(p)}\Delta_{n})$ . Here we first sort $\Delta_{n}$ lexicographically. Then, for every tuple $\vec{t}$ in $\pi_{\operatorname{\textit{var}}(p)}\rho_{m}$ compute $\pi_{\operatorname{\textit{var}}(p)}(\vec{t}\Join_{\operatorname{\textit{pred}}(p)}\Delta_{n})$ . Note that this can be done in time $O(|\Delta_{n}|\cdot\log(|\Delta_{n}|))$ since from the constructed data structures we can enumerate $\Delta_{n}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(p)}\vec{t}$ with logarithmic delay. Because the previous procedure needs to be performed for each $\vec{t}\in\rho_{n}$ , this can be done in time $O(|\rho_{n}|\cdot|\Delta_{n}|\cdot\log(|\Delta_{n}|))$ and therefore in time $O(M^{2}\cdot\log(M))$ . Note that here we ignore the sorting steps as well as the maintenance of the corresponding GMRs as those steps are clearly $O(M\cdot\log(M))$ . Finally, since we need to perform the procedure described above once per each parent-child pair, the entire routine takes at most $O(|T|\cdot M^{2}\cdot\log(M)$ ). ∎

From the previous result we can see that for the general case of equalities and inequalities we already have a procedure that can be quadratic in the size of the database.222In the conference version of this paper DBLP:journals/pvldb/IdrisUVVL18 there was an incorrect claim: we stated that updates could be processed in time $O(M\cdot\log(M))$ in data complexity. We then found a bug in our algorithm and we currently do not know if this bound can be achieved. However, if we restrict the use of inequalities in a particular way, we can speed up both update processing and enumeration delay.

Theorem 5.2.

Let $Q$ , $T$ and $N$ be defined as in Theorem 5.1, and assume that for each $p\in T$ it is the case that $|\operatorname{\textit{pred}}(p)|\leq 1$ . Given a database $\operatorname{\textit{db}}$ over $\operatorname{\textit{at}}(Q)$ , a $(T,N)$ -rep $\operatorname{\mathcal{D}}$ of $\operatorname{\textit{db}}$ , under IEDyn Algorithm 1 enumerates $Q(\operatorname{\textit{db}})$ with delay $O(|N|)$ . Also, given an update $\operatorname{\mathit{u}}$ under IEDyn Algorithm 2 transforms $\operatorname{\mathcal{D}}$ into a $(T,N)$ -rep of $\operatorname{\textit{db}}+u$ in time $O(|T|\cdot M\cdot\log(M))$ , where $M=|{\operatorname{\textit{db}}}|+|{\operatorname{\mathit{u}}}|$ .

Proof.

The main observation to prove this result is that when there is a single predicate, a lexicographically sorted set is totally sorted by a single attribute. Regarding enumeration, this implies that given a parent-child pair $(p,n)$ and a tuple $\vec{t}\in\pi_{\operatorname{\textit{var}}(n)}\rho_{p}$ , we can enumerate $\rho_{n}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(P)}\vec{t}$ with constant delay. The reason behind this is that the index $P_{n}$ maps $\vec{t}$ to a totally sorted set, and therefore we can start from the largest/smallest value of the relevant attribute, and iterate over all tuples decreasingly/increasingly until we find a tuple that does not satisfy the inequality. At that point we are certain that we have visited all tuples satisfying the inequality.

The update processing can also be improved by a similar argument, although the modification is slightly more involved. Assume again that we have a parent-child pair $(p,n)$ and want to compute $\pi_{\operatorname{\textit{var}}(p)}(\rho_{m}\Join_{\operatorname{\textit{pred}}(p)}\Delta_{n})$ , where $m$ is the sibling of $n$ . We do so efficiently as follows. Recall that the index $S_{m}$ groups $\rho_{m}$ by $\operatorname{\textit{var}}(n)\cap\operatorname{\textit{var}}(m)$ and sorts each group by the variables involved in $\operatorname{\textit{pred}}(p)$ . We construct an index over $\Delta_{n}$ with the same characteristics, which is achieved by a vanilla implementation in $O(|\Delta_{n}|\cdot\log(|\Delta_{n}|))$ . Again, since $\operatorname{\textit{pred}}(p)$ contains at most a single inequality, each group will be sorted by a single variable and hence totally sorted. Assume now that $m$ is a guard of $p$ . Since by definition $\rho_{m}\Join_{\operatorname{\textit{pred}}(p)}\Delta_{n}=\sigma_{\operatorname{\textit{pred}}(p)}(\rho_{m}\Join\Delta_{n})$ , to compute this join it is sufficient to find for each tuple $\vec{t}$ in $\rho_{m}$ the matching tuples in the corresponding group of $\Delta_{n}$ . However, a naive implementation would take $O(M^{2})$ , since for such $\vec{t}$ we might iterate over a potentially linear set of tuples in $\Delta_{m}$ . This can be avoided by considering the following two observations:

Given a tuple $\vec{t}$ in $\rho_{m}$ , since $m$ is a guard of $p$ we only need to compute the multiplicity associated to $\vec{t}$ in $\sigma_{\operatorname{\textit{pred}}(p)}(\pi_{\operatorname{\textit{var}}(p)}(\rho_{m}\Join\Delta_{n}))$ , which can be computed as $\rho_{m}(\vec{t})\cdot\sum_{\vec{s}\in\Delta_{n}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(p)}\vec{t}}\Delta_{n}(\vec{s})$ . 2. 2.

Let $\vec{t_{1}}$ and $\vec{t}_{2}$ be two tuples belonging to the same group in $\rho_{m}$ . Assume $\operatorname{\textit{pred}}(p)=a<b$ , with $a\in\operatorname{\textit{var}}(n)$ and $b\in\operatorname{\textit{var}}(m)$ . Then, if $\vec{t}_{1}(a)<\vec{t}_{2}(a)$ we have that $\Delta_{n}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(p)}\vec{t_{2}}$ is a subset of $\Delta_{n}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(p)}\vec{t_{1}}$ .

By these two facts, if we iterate in order over the tuples $\vec{t}$ of each group of $\rho_{n}$ , and we iterate simultaneously in order over the tuples $\vec{s}$ in the group of $\Delta_{n}$ corresponding to $\vec{t}$ (which can be done with constant delay), we can compute the corresponding multiplicities incrementally, visiting each tuple in $\Delta_{n}$ only once. Therefore, this join can be computed in linear time in $M$ and the most expensive part of this procedure is to actually construct and maintain the sorted groups, an $O(M\cdot\log(M))$ procedure. It is easy to see that this can be generalized to any inequality, and that in the case in which $n$ is a guard of $p$ it suffices to swap the roles of $\rho_{m}$ and $\Delta_{n}$ . We conclude that in this case IEDyn updates the corresponding $(T,N)$ -representation in $O(M\cdot\log(M))$ . ∎

6 Computing GJTs

In this section, we discuss how to check acyclicity and free-connex acyclicity for GCQs, and give an algorithm to compute a compatible GJT pair for a given GCQ.

The canonical algorithm for checking acyclicity of normal conjunctive queries is the GYO algorithm abiteboul1995foundations . Our algorithm is a generalisation of the GYO algorithm that checks free-connex acyclicity in addition to normal acyclicity and deals with GCQs featuring $\theta$ -join predicates instead of CQs that have equality joins only.

6.1 Classical GYO

The GYO algorithm operates on hypergraphs. A hypergraph $H$ is a set of non-empty hyperedges. Recall from Section 2 that a hyperedge is just a finite set of variables. Every GCQ is associated to a hypergraph as follows.

Definition 9.

Let $Q$ be a GCQ. The hypergraph of $Q$ , denoted $\operatorname{\textit{hyp}}(Q)$ , is the hypergraph

[TABLE]

The GYO algorithm checks acyclicity of a normal conjunctive query $Q$ by constructing $\operatorname{\textit{hyp}}(Q)$ and repeatedly removing ears from this hypergraph. If ears can be removed until only the empty hypergraph remains, then the query is acyclic; otherwise it is cyclic.

An ear in a hypergraph $H$ is a hyperedge $e$ for which we can divide its variables into two groups: (1) those that appear exclusively in $e$ , and (2) those that are contained in another hyperedge $\ell$ of $H$ . A variable that appears exclusively in a single hyperedge is also called an isolated variable. Thus, ear removal corresponds to executing the following two reduction operations.

•

Remove isolated variables: select a hyperedge $e$ in $H$ and remove isolated variables from it; if $e$ becomes empty, remove $e$ it altogether from $H$ .

•

Subset elimination: remove hyperedge $e$ from $H$ if there exists another hyperedge $\ell$ for which $e\subseteq\ell$ .

The GYO reduction of a hypergraph is the hypergraph that is obtained by executing these operations until no further operation is applicable. The following result is standard; see e.g., abiteboul1995foundations for a proof.

Proposition 6.

A CQ $Q$ is acyclic if and only if the GYO-reduction of $\operatorname{\textit{hyp}}(Q)$ is the empty hypergraph.

6.2 GYO-reduction for GCQs

In order to extend the GYO-reduction to check free-connex acyclicity (not simply acyclicity) of GCQs (not simply standard CQs), we will: (1) Redefine the notion of being an ear to take into account the predicates; and (2) transform the GYO-reduction into a two-stage procedure. The first stage allows to check that a connex set with exactly $\operatorname{\textit{out}}(Q)$ can exist while the first and second stage combined check that the query is acyclic.

Our algorithm operates on hypergraph triplets instead of hypergraphs, which are defined as follows.

Definition 10.

A hypergraph triplet is a triple $\mathcal{H}=(\operatorname{\textit{hyp}}(\mathcal{H}),\operatorname{\textit{out}}(\mathcal{H}),\operatorname{\textit{pred}}(\mathcal{H}))$ with $\operatorname{\textit{hyp}}(\mathcal{H})$ a hypergraph, $\operatorname{\textit{out}}(\mathcal{H})$ a hyperedge, and $\operatorname{\textit{pred}}(\mathcal{H})$ a set of predicates.

Intuitively, the variables in $\operatorname{\textit{out}}(\mathcal{H})$ will correspond to the output variables of a query and the set $\operatorname{\textit{pred}}(\mathcal{H})$ will contain predicates that need to be taken into account when removing ears. Every GCQ is therefore naturally associated to a hypergraph triplet as follows.

Definition 11.

The hypergraph triplet of a GCQ $Q$ , denoted $\operatorname{\mathcal{H}}(Q)$ , is the triplet $(\operatorname{\textit{hyp}}(Q),\operatorname{\textit{out}}(Q),\operatorname{\textit{pred}}(Q))$ .

In order to extend the notion of an ear, we require the following definitions. Let $\mathcal{H}$ be a hypergraph triplet. Variables that occur in $\operatorname{\textit{out}}(\mathcal{H})$ or in at least two hyperedges in $\operatorname{\textit{hyp}}(\mathcal{H})$ are called equijoin variables of $\mathcal{H}$ . We denote the set of all equijoin variables of $\mathcal{H}$ by $\operatorname{\textit{jv}}(\mathcal{H})$ and abbreviate $\operatorname{\textit{jv}}_{\mathcal{H}}(e)=e\cap\operatorname{\textit{jv}}(\mathcal{H})$ . A variable $x$ is isolated in $\mathcal{H}$ if it is not an equijoin variable and is not mentioned in any predicate, i.e., if $x\not\in\operatorname{\textit{jv}}(\mathcal{H})$ and $x\not\in\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{H}))$ . We denote the set of isolated variables of $\mathcal{H}$ by $\operatorname{\textit{isol}}(\mathcal{H})$ and abbreviate $\operatorname{\textit{isol}}_{\mathcal{H}}(e)=e\cap\operatorname{\textit{isol}}(\mathcal{H})$ . The extended variables of hyperedge $e$ in $\mathcal{H}$ , denoted $\operatorname{\textit{ext}}_{\mathcal{H}}(e)$ is the set of all variables of predicates that mention some variable in $e$ , except the variables in $e$ themselves:

[TABLE]

Finally, a hyperedge $e$ is a conditional subset of hyperedge $\ell$ w.r.t. $\mathcal{H}$ , denoted $e\operatorname*{\sqsubseteq}_{\mathcal{H}}\ell$ , if $\operatorname{\textit{jv}}_{\mathcal{H}}(e)\subseteq\ell$ and $\operatorname{\textit{ext}}_{\mathcal{H}}(e\setminus\ell)\subseteq\ell$ . We omit subscripts from our notation if the triplet is clear from the context.

Example 6.

In Fig. 8 we depict several hypergraph triplets. There, hyperedges in $\mathcal{H}$ are depicted by colored regions and variables in $\operatorname{\textit{out}}(\mathcal{H})$ are underlined. We use dashed lines to connect variables that appear together in a predicate. So, in $\mathcal{H}_{1}$ , we have predicates $\theta_{1},\theta_{2}$ with $\operatorname{\textit{var}}(\theta_{1})=\{t,v\}$ and $\operatorname{\textit{var}}(\theta_{2})=\{x,y\}$ . Now consider triplet $\mathcal{H}_{1}$ in particular. It is the hypergraph triplet $\operatorname{\mathcal{H}}(Q)$ for the following GCQ $Q$ :

[TABLE]

Moreover, $\operatorname{\textit{jv}}(\mathcal{H}_{1})=\{s,t,u,w,z\}$ and $\operatorname{\textit{isol}}(\mathcal{H}_{1})=\emptyset$ . Furthermore, $\operatorname{\textit{ext}}_{\mathcal{H}_{1}}(\{v\})=\{t\}$ since $\theta_{1}=t<v$ shares variables with $\{v\}$ . Finally $\operatorname{\textit{jv}}_{\mathcal{H}_{1}}(\{s,v\})=\{s\}\subseteq\{s,t,u\}$ and $\operatorname{\textit{ext}}_{\mathcal{H}_{1}}(\{s,v\}\setminus\{s,t,u\})=\operatorname{\textit{ext}}_{\mathcal{H}_{1}}(\{v\})=\{t\}\subseteq\{s,t,u\}$ . Therefore, $\{s,v\}\operatorname*{\sqsubseteq}_{\mathcal{H}_{1}}\{s,t,u\}$ . Similarly, $\{t,u\}\operatorname*{\sqsubseteq}_{\mathcal{H}_{1}}\{s,t,u\}$ .

We define ears in our context as follows.

Definition 12.

A hyperedge $e$ is an ear in a hypergraph triplet $\mathcal{H}$ if $e\in\operatorname{\textit{hyp}}(\mathcal{H})$ and either

we can divide its variables into two: (a) those that are isolated and (b) those that form a conditional subset of another hyperedge $\ell\in\operatorname{\textit{hyp}}(\mathcal{H})\setminus\{e\}$ ; or 2. 2.

$e$ consists only of non-join variables, i.e., $\operatorname{\textit{jv}}(e)=\emptyset$ and $\operatorname{\textit{ext}}(e)=\emptyset$ .

Note that case (2) allows for $\theta\in\operatorname{\textit{pred}}(\mathcal{H})$ with $\operatorname{\textit{var}}(\theta)\subseteq e$ . We call predicates that are covered by a hyperedge in this sense filters because they correspond to filtering a single GMR instead of $\theta$ -joining two GMRs. If, in case (2), there is no filter $\theta$ with $\operatorname{\textit{var}}(\theta)\subseteq e$ , then $e=\operatorname{\textit{isol}}_{\mathcal{H}}(e)$ . Similar to the classical GYO reduction, we can view ear removal as a rewriting process on triplets, where we consider the following reduction operations.

(ISO) Remove isolated variables: select a hyperedge $e\in\operatorname{\textit{hyp}}(\mathcal{H})$ and remove a non-empty set $X\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e)$ from it. If $e$ becomes empty, remove it from $\operatorname{\textit{hyp}}(\mathcal{H})$ .

-

(CSE) Conditional subset elimination: remove hyperedge $e$ from $\operatorname{\textit{hyp}}(\mathcal{H})$ if it is a conditional subset of another hyperedge $f$ in $\operatorname{\textit{hyp}}(\mathcal{H})$ . Also update $\operatorname{\textit{pred}}(\mathcal{H})$ by removing all predicates $\theta$ with $\operatorname{\textit{var}}(\theta)\cap(e\setminus f)\not=\emptyset$ .

-

(FLT) Filter elimination: select $e\in\operatorname{\textit{hyp}}(\mathcal{H})$ and a non-empty subset of predicates $\Theta\subseteq\operatorname{\textit{pred}}(\mathcal{H})$ with $\operatorname{\textit{var}}(\Theta)\subseteq e$ . Remove all predicates in $\Theta$ from $\operatorname{\textit{pred}}(\mathcal{H})$ .

We write $\mathcal{H}\operatorname*{\rightsquigarrow}\mathcal{I}$ to denote that triplet $\mathcal{I}$ is obtained from triplet $\mathcal{H}$ by applying a single such operation, and $\mathcal{H}\operatorname*{\rightsquigarrow}^{*}\mathcal{I}$ to denote that $\mathcal{I}$ is obtained by a sequence of zero or more of such operations.

Example 7.

For the hypergraph triplets illustrated in Fig. 8 we have $\mathcal{H}_{1}\operatorname*{\rightsquigarrow}\mathcal{H}_{2}\operatorname*{\rightsquigarrow}\mathcal{H}_{3}\operatorname*{\rightsquigarrow}\mathcal{H}_{4}$ and $\mathcal{H}_{5}\operatorname*{\rightsquigarrow}\allowbreak\mathcal{H}_{6}\allowbreak\operatorname*{\rightsquigarrow}\mathcal{H}_{7}\allowbreak\operatorname*{\rightsquigarrow}\mathcal{H}_{8}\operatorname*{\rightsquigarrow}\mathcal{H}_{9}\operatorname*{\rightsquigarrow}\mathcal{H}_{10}\operatorname*{\rightsquigarrow}\mathcal{H}_{11}$ . For each reduction, it is illustrated in the figure which set of isolated variables is removed, or which conditional subset is removed.

We write ${\mathcal{H}}\!\!\downarrow$ to denote $\mathcal{H}$ is in normal form, i.e., that no operation is applicable on triplet $\mathcal{H}$ . Note that, because each operation removes at least one variable, hyperedge, or predicate, we will always reach a normal form after a finite number of operations. Furthermore, while multiple different reduction steps may be applicable on a given triplet $\mathcal{H}$ , the order in which we apply them does not matter:

Proposition 7 (Confluence).

Whenever $\mathcal{H}\operatorname*{\rightsquigarrow}^{*}\mathcal{I}_{1}$ and $\mathcal{H}\operatorname*{\rightsquigarrow}^{*}\mathcal{I}_{2}$ , there exists $\mathcal{J}$ such that $\mathcal{I}_{1}\operatorname*{\rightsquigarrow}^{*}\mathcal{J}$ and $\mathcal{I}_{2}\operatorname*{\rightsquigarrow}^{*}\mathcal{J}$ .

Because the proof is technical but not overly enlightning, we defer it to Appendix C.1. A direct consequence is that normal forms are unique: if $\mathcal{H}\operatorname*{\rightsquigarrow}^{*}{\mathcal{I}_{1}}\!\!\downarrow$ and $\mathcal{H}\operatorname*{\rightsquigarrow}^{*}{\mathcal{I}_{2}}\!\!\downarrow$ then $\mathcal{I}_{1}=\mathcal{I}_{2}$ .

Let $\mathcal{H}$ be a triplet. The residual of $\mathcal{H}$ , denoted $\tilde{\mathcal{H}}$ , is the triplet $(\operatorname{\textit{hyp}}(\mathcal{H}),\emptyset,\operatorname{\textit{pred}}(\mathcal{H}))$ , i.e., the triplet where $\operatorname{\textit{out}}(\mathcal{H})$ is set to $\emptyset$ . A triplet is empty if it equals $(\emptyset,\emptyset,\emptyset)$ .

Our main result in this section states that to check whether a GCQ $Q$ is free-connex acyclic it suffices to start from $\operatorname{\mathcal{H}}(Q)$ and do a two stage reduction: the first from $\operatorname{\mathcal{H}}(Q)$ until a normal form ${\mathcal{I}}\!\!\downarrow$ is reached, and the second from the residual of ${\mathcal{I}}\!\!\downarrow$ , until another normal form $\mathcal{J}$ is reached.333Note that because we set $\operatorname{\textit{out}}(\mathcal{I})=\emptyset$ on the residual, new variables may become isolated and therefore more reductions steps may be possible on the normal form of $\mathcal{I}$ .

Theorem 6.1.

Let $Q$ be a GCQ. Assume $\operatorname{\mathcal{H}}(Q)\operatorname*{\rightsquigarrow}^{*}{\mathcal{I}}\!\!\downarrow$ and $\tilde{\mathcal{I}}\operatorname*{\rightsquigarrow}^{*}{\mathcal{J}}\!\!\downarrow$ . Then the following hold.

$Q$ * is acyclic if, and only if, $\mathcal{J}$ is the empty triplet.* 2. 2.

$Q$ * is free-connex acyclic if, and only if, $\mathcal{J}$ is the empty triplet and $\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))=\operatorname{\textit{out}}(Q)$ .* 3. 3.

For every GJT $T$ of $Q$ and every connex subset $N$ of $T$ it holds that $\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))\subseteq\operatorname{\textit{var}}(N)$ .

We devote Section 6.3 to the proof.

Example 8.

Fig. 8 illustrates the two-stage sequence of reductions starting from $\operatorname{\mathcal{H}}(Q)$ with $Q$ the GCQ of Example 6. Note that $\operatorname{\mathcal{H}}(Q)=\mathcal{H}_{1}$ and $\mathcal{H}_{5}$ is the residual of $\mathcal{H}_{4}$ . Because we end with the empty triplet, $Q$ is acyclic but not free-connex since $\operatorname{\textit{out}}(Q)\subsetneq\operatorname{\textit{var}}(\mathcal{H}_{4})$ .

Theorem 6.1 gives us a decision procedure for checking free-connex acyclicity of GCQ $Q$ . From its proof in Section 6.3, we can actually derive an algorithm for constructing a compatible GJT pair for $Q$ . At its essence, this algorithm starts with the set of atoms appearing in $Q$ , and subsequently uses the sequence of reduction steps from Theorem 6.1 to construct a GJT from it, at the same time checking free-connex acyclicity. Every reduction step causes new nodes to be added to the partial GJT constructed so far. We will refer to such partial GJTs as Generalized Join Forests (GJF).

Definition 13 (GJF).

A *Generalized Join Forest *is a set $F$ of pairwise disjoint GJTs s.t. for distinct trees $T_{1},T_{2}\in F$ we have $\operatorname{\textit{var}}(T_{1})\cap\operatorname{\textit{var}}(T_{2})=\operatorname{\textit{var}}(n_{1})\cap\operatorname{\textit{var}}(n_{2})$ where $n_{1}$ and $n_{2}$ are the roots of $T_{1}$ and $T_{2}$ .

Every GJF encodes a hypergraph as follows.

Definition 14.

The hypergraph $\operatorname{\textit{hyp}}(F)$ associated to GJF $F$ is the hypergraph that has one hyperedge for every non-empty root node in $F$ ,

[TABLE]

The GJT construction algorithm does not manipulate hypergraph triplets directly. Instead, it manipulates GJF triplets. A GJF triplet is defined like a hypergraph triplet, except that it has a GJF instead of a hypergraph.

Definition 15.

A GJF triplet is a triple $\mathbb{F}=(\operatorname{\textit{forest}}(\mathbb{F}),\allowbreak\operatorname{\textit{out}}(\mathbb{F}),\Theta_{\mathbb{F}})$ with $\operatorname{\textit{forest}}(\mathbb{F})$ a GJF, $\operatorname{\textit{out}}(\mathbb{F})$ a hyperedge, and $\Theta_{\mathbb{F}}$ a set of predicates. Every GJF triplet $\mathbb{F}$ induces a hypergraph triplet $\operatorname{\mathcal{H}}(\mathbb{F})=(\operatorname{\textit{hyp}}(\operatorname{\textit{forest}}(\mathbb{F})),\operatorname{\textit{out}}(\mathbb{F}),\allowbreak\Theta_{\mathbb{F}})$ .

The algorithm for constructing a GJT pair compatible with a given GCQ $Q$ is now shown in Algorithm 3. It starts in line 2 by initializing the GJF triplet $\mathbb{F}$ to $\mathbb{F}=(\operatorname{\textit{forest}}(Q),\operatorname{\textit{out}}(Q),\operatorname{\textit{pred}}(Q)$ . Here, $\operatorname{\textit{forest}}(Q)$ is the GJF obtained by creating, for every atom $r(\overline{x})$ that occurs $k>0$ times in $Q$ , $k$ corresponding leaf nodes labeled by $r(\overline{x})$ . In Lines 3–4, Algorithm 3 then performs the first phase of reduction steps of Theorem 6.1. To this end, it checks whether a reduction operation is applicable to $\operatorname{\mathcal{H}}(\mathbb{F})$ and, if so, enacts this operation by modifying $\mathbb{F}$ as follows.

(ISO). If the reduction operation on the hypergraph triplet $\operatorname{\mathcal{H}}(\mathbb{F})$ were to remove a non-empty subset $X$ of isolated variables from hyperedge $e$ , then $\mathbb{F}$ is modified as follows. Let $n_{1},\dots,n_{k}$ be all the root nodes in $\operatorname{\textit{forest}}(\mathbb{F})$ that are labeled by $e$ . Merge the corresponding trees into one tree by creating a new node $n$ with $\operatorname{\textit{var}}(n)=e$ and attaching $n_{1},\dots,n_{k}$ as children to it with $\operatorname{\textit{pred}}(n\to n_{i})=\emptyset$ for $1\leq i\leq k$ . Then, enact the removal of $X$ by creating a new node $p$ with $\operatorname{\textit{var}}(p)=e\setminus X$ and attaching $n$ as child to it with $\operatorname{\textit{pred}}(p\to n)=\emptyset$ .

-

(CSE) If the reduction operation on $\operatorname{\mathcal{H}}(\mathbb{F})$ were to remove a hyperedge $e$ because it is a conditional subset of another hyperedge $\ell$ , then $\mathbb{F}$ is modified as follows. Let $n_{1},\dots,n_{k}$ (resp. $m_{1},\dots,m_{l}$ ) be all the root nodes in $\operatorname{\textit{forest}}(\mathbb{F})$ that are labeled by $e$ (resp. $\ell$ ), and let $T_{1},\dots,T_{k}$ (resp. $U_{1},\dots,U_{l}$ ) be their corresponding trees. Similar to the previous case, merge the $T_{i}$ (resp. $U_{j}$ ) into a single tree with new root $n$ labeled by $e$ (resp. $m$ labeled by $\ell$ ). Then enact the removal of $e$ by creating a new node $p$ with $\operatorname{\textit{var}}(p)=\ell$ and attaching $n$ and $m$ as children with $\operatorname{\textit{pred}}(p\to n)=\{\theta\in\operatorname{\textit{pred}}(\mathbb{F})\mid\operatorname{\textit{var}}(\theta)\cap(e\setminus\ell)\not=\emptyset\}$ and $\operatorname{\textit{pred}}(p\to m)=\emptyset$ .

-

(FLT) If the reduction operation on $\operatorname{\mathcal{H}}(\mathbb{F})$ were to remove non-empty set of predicates $\Theta$ because there exists a hyperedge $e$ with $\operatorname{\textit{var}}(\Theta)\subseteq e$ , then $\mathbb{F}$ is modified as follows. Let $n_{1},\dots,n_{k}$ be all the root nodes in $\operatorname{\textit{forest}}(\mathbb{F})$ that are labeled by $e$ . Merge the corresponding trees into one tree by creating a new root $n$ labeled by $e$ , and attaching $n_{1},\dots,n_{k}$ as children with $\operatorname{\textit{pred}}(n\to n_{i})=\Theta$ . Enact the removal of $\Theta$ by removing all $\theta\in\Theta$ from $\Theta(\mathbb{F})$ .

It is straightforward to check that these modifications of the forest triplet $\mathbb{F}$ faithfully enact the corresponding operations on $\operatorname{\mathcal{H}}(\mathbb{F})$ , in the following sense.

Lemma 7.

Let $\mathbb{F}$ be a forest triplet and assume $\operatorname{\mathcal{H}}(\mathbb{F})\operatorname*{\rightsquigarrow}\mathcal{I}$ . Let $\mathbb{G}$ be the result of enacting this reduction operation on $\mathbb{F}$ . Then $\mathbb{G}$ is a valid forest triplet and $\operatorname{\mathcal{H}}(\mathbb{G})=\mathcal{I}$ .

We continue the explanation of Algorithm 3. In line 5, Algorithm 3 records the set of root nodes obtained after the first stage of reductions. It then sets $\operatorname{\textit{out}}(\mathbb{F})=\emptyset$ in line 6 and continues with the second stage of reductions in lines 7–8. It then employs Theorem 6.1 to check acyclicity of $Q$ . If $Q$ is not acyclic, it reports this in lines 9–10. If $Q$ is acyclic, then we know by Theorem 6.1 that $\operatorname{\mathcal{H}}(\mathbb{F})$ has become the empty triplet. Note that $\operatorname{\mathcal{H}}(\mathbb{F})$ can be empty only if all the roots of $\mathbb{F}$ ’s join forest are labeled by the empty set of variables. As such, we can transform this forest into a join tree $T$ by linking all of these roots to a new unique root, also labeled $\emptyset$ . This is done in line 12. In line 13, the set of nodes $N$ is computed, and consists of all nodes identified at the end of the first stage (line 5) plus all of their parents in $T$ .

We will prove in Section 6.3 that Algorithm 3 is correct, in the following sense.

Theorem 6.2.

Given a GCQ $Q$ , Algorithm 3 reports an error if $Q$ is cyclic. Otherwise, it returns a sibling-closed GJT pair $(T,N)$ with $T$ a GJT for $Q$ . If $Q$ is free-connex acyclic, then $(T,N)$ is compatible with $Q$ . Otherwise, $\operatorname{\textit{out}}(Q)\subsetneq\operatorname{\textit{var}}(N)$ , but $\operatorname{\textit{var}}(N)$ is minimal in the sense that for every other GJT pair $(T^{\prime},N^{\prime})$ with $T^{\prime}$ a GJT for $Q$ we have $\operatorname{\textit{var}}(N)\subseteq\operatorname{\textit{var}}(N^{\prime})$ .

It is straightforward to check that this algorithm runs in polynomial time in the size of $Q$ .

Example 9.

In Fig. 9, we show a GJT $T$ and use this $GJT$ to illustrate a number of GJFs $F_{1},\dots,F_{10}$ in the following way: let level $1$ be the leaf nodes, level $2$ the parents of the leaves, and so on. Then we take GJF $F_{i}$ to be the set of all trees rooted at nodes at level $i$ , for $1\leq i\leq 10$ , and with each level $i$ , we mention the set of remaining predicates $\theta_{i}$ for $1\leq i\leq k$ where $k$ is the number of predicates in $Q$ . Nodes (resp. predicates with each $F_{i}$ ) labeled by “ $\bullet$ ” in Fig. 9 indicates that the node (and hence tree, resp. predicates) was already present in $F_{i-1}$ and did not change. These should hence not be interpreted as new nodes (resp. predicates changed). With this coding of forests, it is easy to see that for all $1\leq i\leq 9$ , $F_{i}=\operatorname{\textit{hyp}}(\mathcal{H}_{i})$ with $\mathcal{H}_{i}$ illustrated in Fig. 8 (note here that the hypergraph of residual of $\mathcal{H}_{4}$ i.e. $\mathcal{H}_{5}$ is the same as $\mathcal{H}_{4}$ , hence we do not show the corresponding $F_{5}$ ). Furthermore, $\operatorname{\textit{pred}}(F_{i})=\operatorname{\textit{pred}}(Q)\setminus\operatorname{\textit{pred}}(\mathcal{H}_{i})$ with $Q$ the GCQ from Example 6. As such, the tree illustrates the sequence of GJF triplets that is obtained by enacting the hypergraph reductions illustrated in Fig. 8. For example, let $\mathbb{F}_{1}=(F_{1},\operatorname{\textit{out}}(Q),\operatorname{\textit{pred}}(Q)$ . After enacting the removal of hyperedge $\{t,u\}$ from $\mathcal{H}_{1}$ to obtain $\mathcal{H}_{2}$ we obtain $\mathbb{F}_{2}=(F_{2},\operatorname{\textit{out}}(Q),\operatorname{\textit{pred}}(Q))$ . Here, $F_{2}$ is obtained by merging the single-node trees (i.e. labelled by the atoms in $Q$ ) $\{s,t,u\}$ and $\{t,u\}$ in to a single tree with root $\{s,t,u\}$ . The shaded area illustrate the nodes in the connex subset $N$ computed by Algorithm 3.

We stress that Algorithm 3 is non-deterministic in the sense that the pair $(T,N)$ returned depends on the order in which the reduction operations are performed.

6.3 Correctness

To prove theorems 6.1 and 6.2 we show some propositions.

Proposition 8.

Let $Q$ be a GCQ. Assume $\operatorname{\mathcal{H}}(Q)\operatorname*{\rightsquigarrow}^{*}{\mathcal{I}}\!\!\downarrow$ and $\tilde{\mathcal{I}}\operatorname*{\rightsquigarrow}^{*}{\mathcal{J}}\!\!\downarrow$ . If $\mathcal{J}$ is the empty triplet, then, when run on $Q$ , Algorithm 3 returns a pair $(T,N)$ s.t. $T$ is a GJT for $Q$ , $N$ is sibling-closed, and $\operatorname{\textit{var}}(N)=\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))$ .

Proof.

Assume that $\mathcal{J}$ is the empty triplet. Algorithm 3 starts in line 3 by initializing $\mathbb{F}=(\operatorname{\textit{forest}}(Q),\allowbreak\operatorname{\textit{out}}(Q),\allowbreak\operatorname{\textit{pred}}(Q))$ . Clearly, $\operatorname{\mathcal{H}}(\mathbb{F})=\operatorname{\mathcal{H}}(Q)$ at this point. Algorithm 3 subsequently modifies $\mathbb{F}$ throughout its execution. Let $\mathbb{H}$ denote the initial version of $\mathbb{F}$ ; let $\mathbb{I}$ denote the version of $\mathbb{F}$ when executing line 5; let $\tilde{\mathbb{I}}$ denote the version of $\mathbb{F}$ after executing line 6 and let $\mathbb{J}$ denote the version of $\mathbb{F}$ when executing line 9. By repeated application of Lemma 7 we know that $\operatorname{\mathcal{H}}(Q)=\operatorname{\mathcal{H}}(\mathbb{H})\operatorname*{\rightsquigarrow}^{*}\operatorname{\mathcal{H}}(\mathbb{I})$ . Furthermore, $\operatorname{\mathcal{H}}(\mathbb{I})$ is in normal form. Since also $\operatorname{\mathcal{H}}(Q)\operatorname*{\rightsquigarrow}^{*}{\mathcal{I}}\!\!\downarrow$ and normal forms are unique, $\operatorname{\mathcal{H}}(\mathbb{I})=\mathcal{I}$ . Therefore, $\operatorname{\mathcal{H}}(\tilde{\mathbb{I}})=\tilde{\mathcal{I}}$ . Again by repeated application of Lemma 7 we know that $\tilde{\mathcal{I}}=\operatorname{\mathcal{H}}(\tilde{\mathbb{I}})\operatorname*{\rightsquigarrow}^{*}\operatorname{\mathcal{H}}(\mathbb{J})$ . Moreover, $\operatorname{\mathcal{H}}(\mathbb{J})$ is in normal form. Since also $\tilde{\mathcal{I}}\operatorname*{\rightsquigarrow}^{*}{\mathcal{J}}\!\!\downarrow$ and normal forms are unique, $\operatorname{\mathcal{H}}(\mathbb{J})=\mathcal{J}$ . As $\mathcal{J}$ is empty, we will execute lines 12–14. Since $\mathcal{J}$ is the empty hypergraph triplet, every root of every tree in $\operatorname{\textit{forest}}(\mathbb{J})$ must be labeled by $\emptyset$ . By definition of join forests, no two distinct trees in $\operatorname{\textit{forest}}(\mathbb{J})$ hence share variables. As such, the tree $T$ obtained in line 12 by linking all of these roots to a new unique root, also labeled $\emptyset$ , is a valid GJT.

We claim that $T$ is a GJT for $Q$ . Indeed, observe that $\operatorname{\textit{at}}(T)=\operatorname{\textit{at}}(Q)$ and the number of times that an atom occurs in $Q$ equals the number of times that it occurs as a label in $T$ . This is because initially $\operatorname{\textit{forest}}(\mathbb{H})=\operatorname{\textit{forest}}(Q)$ and by enacting reduction steps we never remove nor add nodes labeled by atoms. Furthermore $\operatorname{\textit{pred}}(T)=\operatorname{\textit{pred}}(Q)$ . This is because initially $\operatorname{\textit{pred}}(\mathbb{H})=\operatorname{\textit{pred}}(Q)$ yet $\Theta_{\mathbb{J}}$ is empty. This means that, for every $\theta\in\operatorname{\textit{pred}}(Q)$ , there was some reduction step that removed $\theta$ from the set of predicates of the current GJF triplet $\mathbb{F}$ . However, when enacting reduction steps we only remove predicates after we have added them to $\operatorname{\textit{forest}}(\mathbb{F})$ . Therefore, every predicate in $\operatorname{\textit{pred}}(Q)$ must occur in $T$ . Conversely, during enactment of reduction steps we never add predicates to $\operatorname{\textit{forest}}(\mathbb{F})$ that are not in $\Theta_{\mathbb{F}}$ , so all predicates in $T$ are also in $\operatorname{\textit{pred}}(Q)$ . Thus, $T$ is a GJT for $Q$ .

It remains to show that $N$ is a sibling-closed connex subset of $T$ and $\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))=\operatorname{\textit{var}}(N)$ . To this end, let $X$ be the set of all root nodes of $\operatorname{\textit{forest}}(\mathbb{I})$ , as computed in Line 5. Since $\mathbb{J}$ is obtained from $\tilde{\mathbb{I}}$ by a sequence of reduction enactments, and since such enactments only add new nodes and never delete them, $M$ is a subset of nodes of $\operatorname{\textit{forest}}(\mathbb{J})$ and therefore also of $T$ . As computed in Line 13, $N$ consists of $X$ and all ancestors of nodes of $X$ in $T$ . Then $N$ is a connex subset of $T$ by definition. Moreover, since enactments of reduction steps can only merge existing trees or add new parent nodes (never new child nodes), $N$ must also be sibling-closed. Furthermore, since $\operatorname{\mathcal{H}}(\mathbb{I})=\mathcal{I}$ , $\operatorname{\textit{hyp}}(\operatorname{\textit{forest}}(\mathbb{I}))=\operatorname{\textit{hyp}}(\mathcal{I})$ . Thus, $\operatorname{\textit{var}}(X)=\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathbb{I}))=\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))$ . Then, since $X$ is the frontier of $N$ and $N$ is sibling-closed we have $\operatorname{\textit{var}}(N)=\operatorname{\textit{var}}(X)=\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))$ by Lemma 1. ∎

Corollary 1 (Soundness).

Let $Q$ be a GCQ and assume that $\operatorname{\mathcal{H}}(Q)\operatorname*{\rightsquigarrow}^{*}{\mathcal{I}}\!\!\downarrow$ and $\tilde{\mathcal{I}}\operatorname*{\rightsquigarrow}^{*}{\mathcal{J}}\!\!\downarrow$ . Then:

If $\mathcal{J}$ is the empty triplet then $Q$ is acyclic. 2. 2.

If $\mathcal{J}$ is the empty triplet and $\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))=\operatorname{\textit{out}}(Q)$ then $Q$ is free-connex acyclic.

To also show completeness, we will interpret a GJT $T$ for a GCQ $Q$ as a “parse tree” that specifies the two-stage sequence of reduction steps that can be done on $\operatorname{\mathcal{H}}(Q)$ to reach the empty triplet. Not all GJTs will allows us to do so easily, however, and we will therefore restrict our attention to those GJTs that are canonical.

Definition 16 (Canonical).

A GJT $T$ is canonical if:

its root is labeled by $\emptyset$ ; 2. 2.

every leaf node $n$ is the child of an internal node $m$ with $\operatorname{\textit{var}}(n)=\operatorname{\textit{var}}(m)$ ; 3. 3.

for all internal nodes $n$ and $m$ with $n\not=m$ we have $\operatorname{\textit{var}}(n)\not=\operatorname{\textit{var}}(m)$ ; and 4. 4.

for every edge $m\to n$ and all $\theta\in\operatorname{\textit{pred}}(m\to n)$ we have $\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))\not=\emptyset$ .

A connex subset $N$ of $T$ is canonical if every node in it is interior in $T$ . A GJT pair $(T,N)$ is canonical if both $T$ and $N$ are canonical.

The following proposition, proven in Appendix C, shows that we may restrict our attention to canonical GJT pairs without loss of generality.

Proposition 9 ().

For every GJT pair there exists an equivalent canonical pair.

We also require the following auxiliary notions and insights. First, if $(T,N)$ is a GJT pair, then define the hypergraph associated to $(T,N)$ , denoted $\operatorname{\textit{hyp}}(T,N)$ , to be the hypergraph formed by node labels in $N$ ,

[TABLE]

Further, define $\operatorname{\textit{pred}}(T,N)$ to be the set of all predicates occurring on edges between nodes in $N$ . For a hyperedge $\overline{z}$ , define the hypergraph triplet of $(T,N)$ w.r.t. $\overline{z}$ , denoted $\operatorname{\mathcal{H}}(T,N,\overline{z})$ to be the hypergraph triplet $(\operatorname{\textit{hyp}}(T,N),\allowbreak\overline{z},\operatorname{\textit{pred}}(T,N))$ .

The following technical Lemma shows that we can use canonical pairs as “parse” trees to derive a sequence of reduction steps. Its proof can be found in Appendix C.

Lemma 8 ().

Let $(T,N_{1})$ and $(T,N_{2})$ be canonical GJT pairs with $N_{2}\subseteq N_{1}$ . Then $\operatorname{\mathcal{H}}(T,N_{1},\overline{z})\operatorname*{\rightsquigarrow}^{*}\operatorname{\mathcal{H}}(T,N_{2},\overline{z})$ for every $\overline{z}\subseteq\operatorname{\textit{var}}(N_{2})$ .

We require the following additional lemma, proven in Appendix C:

Lemma 9 ().

Let $H_{1}$ and $H_{2}$ be two hypergraphs such that for all $e\in H_{2}$ there exists $\ell\in H_{1}$ such that $e\subseteq\ell$ . Then $(H_{1}\cup H_{2},\overline{z},\Theta)\operatorname*{\rightsquigarrow}^{*}(H_{1},\overline{z},\Theta)$ , for every hyperedge $\overline{z}$ and set of predicates $\Theta$ .

We these tools in hand we can prove completeness.

Proposition 10.

Let $Q$ be a GCQ, let $T$ be a GJT for $Q$ and let $N$ be a connex subset of $T$ with $\operatorname{\textit{out}}(Q)\subseteq\operatorname{\textit{var}}(N)$ . Assume that $\operatorname{\mathcal{H}}(Q)\operatorname*{\rightsquigarrow}^{*}{\mathcal{I}}\!\!\downarrow$ and $\tilde{\mathcal{I}}\operatorname*{\rightsquigarrow}^{*}{\mathcal{J}}\!\!\downarrow$ . Then $\mathcal{J}$ is the empty triplet and $\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))\subseteq\operatorname{\textit{var}}(N)$ .

Proof.

By Proposition 9 we may assume without loss of generality that $(T,N)$ is a canonical GJT pair. Let $A$ be the set of all of $T$ ’s interior nodes. Clearly, $A$ is a connex subset of $T$ and $\operatorname{\textit{var}}(A)\subseteq\operatorname{\textit{var}}(Q)$ . Furthermore, because for every atom $r(\overline{x})$ in $Q$ there is a leaf node $l$ in $T$ labeled by $r(\overline{x})$ (as $T$ is a GJT for $Q$ ), which has a parent interior node $n_{l}$ labeled $\overline{x}$ (because $T$ is canonical), also $\operatorname{\textit{var}}(Q)\subseteq\operatorname{\textit{var}}(A)$ . Therefore, $\operatorname{\textit{var}}(A)=\operatorname{\textit{var}}(Q)$ . By the same reasoning, $\operatorname{\textit{hyp}}(Q)\subseteq\operatorname{\textit{hyp}}(T,A)$ . Therefore, $\operatorname{\textit{hyp}}(T,A)=\operatorname{\textit{hyp}}(T,A)\cup\operatorname{\textit{hyp}}(Q)$ . Furthermore, because every interior node in a GJT has a guard descendant, and the leaves of $T$ are all labeled by atoms in $Q$ , we know that for every node $n\in A$ there exists some hyperedge $f\in\operatorname{\textit{hyp}}(Q)$ such that $\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(f)$ . In addition, we claim that $\operatorname{\textit{pred}}(T,A)=\operatorname{\textit{pred}}(Q)$ . Indeed, $\operatorname{\textit{pred}}(T,A)\subseteq\operatorname{\textit{pred}}(Q)$ since $T$ is a GJT for $Q$ . The converse inclusion follows from canonicality properties (2) and (4): because leaf nodes in a canonical GJT have a parent labeled by the same hyperedge, there can be no predicates on edges to leaf nodes in $T$ . Thus, all predicates in $T$ are on edges between interior nodes, i.e., in $\operatorname{\textit{pred}}(T,A)$ . Then, because every predicate in $Q$ appears somewhere in $T$ (since $T$ is a GJT for $Q$ ), we have $\operatorname{\textit{pred}}(Q)\subseteq\operatorname{\textit{pred}}(T,A)$ . From all of the observations made so far and Lemma 9, we obtain:

[TABLE]

Thus $\operatorname{\mathcal{H}}(T,A,\operatorname{\textit{out}}(Q))\operatorname*{\rightsquigarrow}^{*}\operatorname{\mathcal{H}}(Q)\operatorname*{\rightsquigarrow}^{*}\mathcal{I}$ . Furthermore, because $(T,N)$ is also canonical with $N\subseteq A$ and $\operatorname{\textit{out}}(Q)\subseteq\operatorname{\textit{var}}(N)$ we have $\operatorname{\mathcal{H}}(T,A,\operatorname{\textit{out}}(Q))\operatorname*{\rightsquigarrow}^{*}\operatorname{\mathcal{H}}(T,N,\operatorname{\textit{out}}(Q))$ by Lemma 8. Then, because reduction is confluent (Proposition 7) we obtain that $\operatorname{\mathcal{H}}(T,\allowbreak N,\allowbreak\operatorname{\textit{out}}(Q))$ and $\mathcal{I}$ can be reduced to the same triplet. Because $\mathcal{I}$ is in normal form, necessarily $\operatorname{\mathcal{H}}(T,N,\operatorname{\textit{out}}(Q))\operatorname*{\rightsquigarrow}^{*}\mathcal{I}$ . Since reduction steps can only remove nodes and hyperedges (and never add them), $\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))\subseteq\operatorname{\textit{var}}(N)$ .

It remains to show that $\mathcal{J}$ is the empty triplet. Hereto, first verify the following. For any hypergraph triplets $\mathcal{U}$ and $\mathcal{V}$ , if $\mathcal{U}\operatorname*{\rightsquigarrow}^{*}\mathcal{V}$ then also $\tilde{\mathcal{U}}\operatorname*{\rightsquigarrow}^{*}\tilde{\mathcal{V}}$ . From this, $\operatorname{\mathcal{H}}(T,A,\operatorname{\textit{out}}(Q))\operatorname*{\rightsquigarrow}^{*}\mathcal{I}$ , and the fact that $\operatorname{\mathcal{H}}(T,A,\emptyset)$ is the residual of $\operatorname{\mathcal{H}}(T,A,\operatorname{\textit{out}}(Q))$ we conclude $\operatorname{\mathcal{H}}(T,A,\emptyset)\allowbreak\operatorname*{\rightsquigarrow}^{*}\tilde{\mathcal{I}}$ . Then, because $\tilde{\mathcal{I}}\operatorname*{\rightsquigarrow}^{*}\mathcal{J}$ , it follows that $\operatorname{\mathcal{H}}(T,A,\emptyset)\allowbreak\operatorname*{\rightsquigarrow}^{*}\mathcal{J}$ . Let $r$ be $T$ ’s root node, which is labeled by $\emptyset$ since $T$ in canonical. Then $\{r\}$ is a connex subset of $T$ . By Lemma 8, $\operatorname{\mathcal{H}}(T,A,\emptyset)\operatorname*{\rightsquigarrow}^{*}\operatorname{\mathcal{H}}(T,\{r\},\emptyset)$ . Now observe that the hypergraph of $\operatorname{\mathcal{H}}(T,\{r\},\emptyset)$ is empty, and its predicate set is also empty. Therefore, $\operatorname{\mathcal{H}}(T,\{r\},\emptyset)$ is the empty hypergraph triplet. In particular, it is in normal form. But, since $\mathcal{J}$ is also in normal form and normal forms are unique, $\mathcal{J}$ must also be the empty triplet. ∎

Corollary 2 (Completeness).

Let $Q$ be a GCQ. Assume that $\operatorname{\mathcal{H}}(Q)\operatorname*{\rightsquigarrow}^{*}{\mathcal{I}}\!\!\downarrow$ and $\tilde{\mathcal{I}}\operatorname*{\rightsquigarrow}^{*}{\mathcal{J}}\!\!\downarrow$ .

If $Q$ is acyclic, then $\mathcal{J}$ is the empty triplet. 2. 2.

If $Q$ is free-connex acyclic, then $\mathcal{J}$ is the empty triplet and $\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))=\operatorname{\textit{out}}(Q)$ . 3. 3.

For every GJT $T$ of $Q$ and every connex subset $N$ of $T$ it holds that $\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))\subseteq\operatorname{\textit{var}}(N)$ .

Proof.

(1) Since $Q$ is acyclic, there exists a GJT $T$ for $Q$ . Let $N$ be the set of all of $T$ ’s nodes. Then $N$ is a connex subset of $T$ and $\operatorname{\textit{out}}(Q)\subseteq\operatorname{\textit{var}}(N)=\operatorname{\textit{var}}(Q)$ . The result then follows from Proposition 10.

(2) Since $Q$ is free-connex acyclic, there exists a GJT pair $(T,N)$ compatible with $Q$ . In particular, $\operatorname{\textit{var}}(N)=\operatorname{\textit{out}}(Q)$ . By Proposition 10, $\mathcal{J}$ is the empty triplet, and $\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))\subseteq\operatorname{\textit{var}}(N)=\operatorname{\textit{out}}(Q)$ . It remains to show $\operatorname{\textit{out}}(Q)\subseteq\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))$ . First verify the following: A reduction step on a hypergraph triplet $\mathcal{H}$ never removes any variable in $\operatorname{\textit{out}}(\mathcal{H})$ from $\operatorname{\textit{hyp}}(\mathcal{H})$ , nor does it modify $\operatorname{\textit{out}}(\mathcal{H})$ . Then, since $\operatorname{\textit{out}}(\operatorname{\mathcal{H}}(Q))=\operatorname{\textit{out}}(Q)\subseteq\operatorname{\textit{var}}(Q)\subseteq\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\operatorname{\mathcal{H}}(Q))))$ , and $\operatorname{\mathcal{H}}(Q)\operatorname*{\rightsquigarrow}^{*}\mathcal{I}$ we obtain $\operatorname{\textit{out}}(Q)\subseteq\allowbreak\operatorname{\textit{var}}(\operatorname{\textit{hyp}}(\mathcal{I}))$ .

(3) Follows directly from Proposition 10. ∎

Theorem 6.1 follows directly from Corollaries 1 and 2. Theorem 6.2 follows from Theorem 6.1 and Proposition 8.

7 Experimental Setup

In this section, we present the setup of our experimental evaluation, whose results are discussed in Section 8. We first present our practical implementation of IEDyn, then show the queries and update stream used for evaluation, and finally discuss the competing systems.

Practical Implementation

We have implemented

IEDyn as a query compiler that generates executable code in the Scala programming language. The generated code instantiates a $T$ -rep and defines trigger functions that are used for maintaining the $T$ -rep under updates. Our implementation is basic in the sense that we use Scala off-the-shelf collection libraries (notably MutableTreeMap) to implement the required indices. Faster implementations with specialized code for the index structures are certainly possible.

Our implementation supports two modes of operation: push-based and pull-based. In both modes, the system maintains the $T$ -rep under updates. In the push-based mode the system generates, on its output stream, the delta result $\Delta{Q}(\operatorname{\textit{db}},\operatorname{\mathit{u}})$ after each single-tuple update $\operatorname{\mathit{u}}$ . To do so, it uses a modified version of enumeration (Algorithm 1) that we call delta enumeration. Similarly to how Algorithm 1 enumerates $Q(\operatorname{\textit{db}})$ , delta enumeration enumerates $\Delta Q(\operatorname{\textit{db}},\operatorname{\mathit{u}})$ with constant delay (if $Q$ has at most one inequality per pair of atoms) resp. logarithmic delay (otherwise). To do so, it uses both (1) the $T$ -reduct GMRs $\rho_{n}$ and (2) the delta GMRs $\Delta\rho_{n}$ that are computed by Algorithm 2 when processing $u$ . In this case, however, one also needs to index the $\Delta\rho_{n}$ similarly to $\rho_{n}$ . In the pull-based mode, in contrast, the system only maintains the $T$ -rep under updates but does not generate any output stream. Nevertheless, at any time a user can call the enumeration (Algorithm 1) procedure to obtain the current output.

We have described in Section 5 how IEDyn can process free-connex acyclic GCQs under updates. It should be noted that our implementation also supports the processing of general acyclic GCQs that are not necessarily free-connex. This is done using the following simple strategy. Let $Q$ be acyclic but not free-connex. First, compute a free-connex acyclic approximation $Q_{F}$ of $Q$ . $Q_{F}$ can always be obtained from $Q$ by extending the set of output variables of $Q$ . In the worst case, we need to add all variables, and $Q_{F}$ becomes the full join underlying $Q$ . Then, use IEDyn to maintain a $T$ -rep for $Q_{F}$ . When operating in push-based mode, for each update $\operatorname{\mathit{u}}$ , we use the $T$ -representation to delta-enumerate $\Delta{Q_{F}}(\operatorname{\textit{db}},\operatorname{\mathit{u}})$ and project each resulting tuple to materialize $\Delta{Q}(\operatorname{\textit{db}},\operatorname{\mathit{u}})$ in an array. Subsequently, we copy this array to the output. Note that the materialization of $\Delta{Q}(\operatorname{\textit{db}},\operatorname{\mathit{u}})$ here is necessary since the delta enumeration on $T$ can produce duplicate tuples after projection. When operating in pull-based mode, we materialize $Q(\operatorname{\textit{db}})$ in an array, and use delta enumeration of $Q_{F}$ to maintain the array under updates. Of course, under this strategy, we require $\Omega(\parallel{Q(\operatorname{\textit{db}})}\parallel)$ space in the worst case, just like (H)IVM would, but we avoid the (partial) materialization of delta queries. Note the distinction between the two modes: in push-based mode $\Delta{Q}(\operatorname{\textit{db}},\operatorname{\mathit{u}})$ is materialized (and discarded once the output is generated), while in pull-based mode $Q(\operatorname{\textit{db}})$ is materialized upon requests.

Queries and Streams

In contrast to the setting for equi-join queries where systems can be compared based on industry-strength benchmarks such as TPC-H and TPC-DS, there is no established benchmark suite for inequality-join queries.

We evaluate IEDyn on the GCQ queries listed in table 1. Here, queries $Q_{1}$ – $Q_{6}$ are full join queries (i.e., queries without projections). Among these, $Q_{1}$ , $Q_{3}$ and $Q_{4}$ are cross products with inequality predicates, while $Q_{2}$ , $Q_{5}$ and $Q_{6}$ have at least one equality in addition to the inequality predicates. Queries $Q_{1}$ and $Q_{2}$ are binary join queries, while $Q_{3}$ – $Q_{6}$ are multi-way join queries. Queries $Q_{7}$ – $Q_{12}$ project over the result of queries $Q_{4}$ – $Q_{6}$ . Among these, $Q_{7}$ – $Q_{9}$ are free-connex acyclic while $Q_{10}$ – $Q_{12}$ acyclic but not free-connex.

We evaluate these queries on streams of updates where each update consists of a single tuple insertion. The database is always empty when we start processing the update stream. We synthetically generate two kinds of update streams: randomly-ordered and temporally-ordered update streams. In randomly-ordered update streams, insertions can occur in any order. In contrast, temporally-ordered update streams guarantee that any attribute that participates in an inequality in the query has a larger value than the same attribute in any of the previously inserted tuples. Randomly-ordered update streams are useful for comparing against systems that allow processing of out-of-order tuples; temporally-ordered update streams are useful for comparison against systems that assume events arrive always with increasing timestamp values. Examples of systems that process temporally-ordered streams are automaton-based CER systems.

A random update stream of size $N$ for a query with $k$ relations is generated as follows. First, we generate $N/k$ tuples with random attribute values for each relation. Then, we insert tuples in the update stream by uniformly and randomly selecting them without repetitions. This ensures that there are $N/k$ insertions from each relation in the stream. To utilize the same update stream for evaluating each system we compare to, each stream is stored in a file. We choose the values for equality join attributes uniformly at random from $1$ to $200$ , except for the scalability and selectivity experiments in Section 8 where the interval depends on the stream size.

Temporally-ordered streams are generated similarly, but when a new insertion tuple is chosen, a new value is inserted in the attributes that are compared through inequalities. This value is larger than the corresponding values of previously inserted tuples. All attributes hold integer values, except for attributes $c$ and $i$ which contain string values.

Competitors

We compare IEDyn with DBToaster (DBT) DBLP:journals/vldb/KochAKNNLS14 , Esper (E) esper , SASE (SE)DBLP:conf/sigmod/WuDR06 ; Sase2014 ; DBLP:conf/sigmod/AgrawalDGI08 , Tesla (T)DBLP:conf/debs/CugolaM10 ; DBLP:journals/jss/CugolaM12 , and ZStream (Z) DBLP:conf/sigmod/MeiM09 using memory footprint, update processing time, and enumeration delay as comparison metrics. The competing systems differ in their mode of operation (push-based vs pull-based) and some of them only support temporally-ordered streams.

DBToaster is a state-of-the-art implementation of HIVM. It operates in pull-based mode, and can deal with randomly-ordered update streams. DBToaster is particularly meticulous in that it materializes only useful views, and therefore it is an interesting implementation for comparison. DBToaster has been extensively tested on equi-join queries and has proven to be more efficient than a commercial database management system, a commercial stream processing system and an IVM implementation DBLP:journals/vldb/KochAKNNLS14 . DBToaster compiles given SQL statements into executable trigger programs in different programming languages. We compare against those generated in Scala from the DBToaster Release 2.2444https://dbtoaster.github.io/, and it uses actors555https://doc.akka.io/docs/akka/2.5/ to generate events from the input files. During our experiments, however, we have found that this creates unnecessary memory overhead. For a fair memory-wise comparison, we have therefore removed these actors.

Esper is a CER engine with a relational model based on Stanford STREAM DBLP:books/sp/16/ArasuBBCDIMSW16 . It is push-based, and can deal with randomly-ordered update streams. We use the Java-based open source666http://www.espertech.com/esper/esper-downloads/ for our comparisons. Esper processes queries expressed in the Esper event processing language (EPL).

SASE is an automaton-based CER system. It operates in push-based mode, and can deal with temporally-ordered update streams only. We use the publicly available Java-based implementation of SASE777https://github.com/haopeng/sase. This implementation does not support projections. Furthermore, since SASE requires queries to specify a match semantics (any match, next match, partition contiguity) but does not allow combinations of such semantics, we can only express queries $Q_{1}$ , $Q_{2}$ , and $Q_{4}$ in SASE. Hence, we compare against SASE for these queries only. To be coherent with our semantics, the corresponding SASE expressions use the any match semantics DBLP:conf/sigmod/AgrawalDGI08 .

Tesla/T-Rex is also an automaton-based CER system. It operates in push-based mode only, and supports temporally-ordered update streams only. We use the publicly available C-based implementation888https://github.com/deib-polimi/TRex. This implementation operates in a publish-subscribe model where events are published by clients to the server, known as TRexServer. Clients can subscribe to receive recognized composite events. Tesla cannot deal with queries involving inequalities on multiple attributes e.g. $Q_{3}$ , therefore, we do not show results for $Q_{3}$ . Since Tesla works in a decentralized manner, we measure the update processing time by logging the time at the Tesla TRexServer from the stream start until the end.

ZStream is a CER system based on a relational internal architecture. It operates in push-based mode, and can deal with temporally-ordered update streams only. ZStream is not available publicly. Hence, we have created our own implementation following the lazy evaluation algorithm of ZStream described in their original paper DBLP:conf/sigmod/MeiM09 . This paper does not describe how to treat projections, and as such we compare against ZStream only for full join queries $Q_{1}$ – $Q_{6}$ .

Due to space limitations, we omit the query expressions used for Esper (in EPL), SASE, and Tesla/TRex (rules) in this paper, but they are available at exps .

Setup

Our experiments are run on an 8-core 3.07 GHz machine running Ubuntu with GNU/Linux 3.13.0-57-generic. To compile the different systems or generated trigger programs, we have used GCC version 4.8.2, Java 1.8.0_101, and Scala version 2.12.4. Each query is evaluated 10 times to measure update processing delay, and two times to measure memory footprint. We present the average over those runs. Each time a query is evaluated, 20 GB of main memory are freshly allocated to the program. To measure the memory footprint for Scala/Java based systems, we invoke the JVM system calls every 10 updates and consider the maximum value. For C/C++ based systems we use the GNU/Linux time command to measure memory usage. Experiments that measure memory footprint are always run separately of the experiments that measure processing time.

8 Experimental Evaluation

Before presenting experimental results we make some remarks. First, when we compare against another system we run IEDyn in the operation mode supported by the competitor. For push-based systems we report the time required to both process the entire update stream, and generate the changes to the output after each update. When comparing against a pull-based system, the measured time includes only processing the entire update stream. We later report the speed with which the result can be generated from the underlying representation of the output (a $T$ -representation in the case of IEDyn). When comparing against a system that supports randomly-ordered update streams, we only report comparisons using streams of this type. We have also looked at temporally-ordered streams for these systems, but the throughput of the competing systems is similar (fluctuating between 3% and 12%) while that of IEDyn significantly improves (fluctuating between 35% and 50%) because insertions to sorted lists become constant instead of logarithmic. We omit these experiments due to lack of space.

It is also important to remark that some executions of the competing systems failed either because they required more than 20GB of main memory or they took more than 1500 seconds. If an execution requires more than 20GB, we report the processing time elapsed until the exception was raised. If an execution is still running after 1500 seconds, we stop it and report its maximum memory usage while running.

Full join queries

Figure 10 compares the update processing time of IEDyn against the competing systems for full join queries $Q_{1}$ – $Q_{6}$ . We have grouped experiments that are run under comparable circumstances: in the top row experiments are conducted for push-based systems on temporally-ordered update streams ( $SE$ , $T$ , $Z$ ); in the second row push-based systems on randomly-ordered update streams ( $E$ ), and in the bottom row pull-based systems on randomly-ordered update streams ( $DBT$ ). We observe that all of the competing systems have large processing times even for very small update stream sizes, and that for some systems execution even failed. All of these behaviors are due to the low selectivity of joins on this dataset. Table 2 shows the output size of each query for the largest stream sizes reported in Figure 10. We report on streams that generate outputs of different sizes below.

Figure 10 is complemented by Figures 15 and 15 where we plot the processing time and memory footprint used by IEDyn as a percentage of the corresponding usage in the competing systems. Both, $SE$ and $Z$ support temporally ordered streams, however, $SE$ supports only queries $Q_{1}$ , $Q_{2}$ , and $Q_{4}$ and $Z$ supports $Q_{1}$ – $Q_{6}$ , therefore in Figure 15 we show $SE$ (right) and $Z$ (left). Note that IEDyn significantly outperforms the competing systems on all full join queries. Specifically, it outperforms $DBT$ up to one order of magnitude in processing time and up to two orders of magnitude in memory footprint. It outperforms $T$ up to two orders of magnitude in processing time, and more than one order of magnitude in memory footprint. Moreover, for these queries, even in push-based mode IEDyn can support the enumeration of query results from its data structures at any time while competing push-based systems have no such support. Hence, IEDyn is not only more efficient but also provides more functionality.

Projections

Results in Figure 15 show that IEDyn significantly outperforms both $E$ and $DBT$ on free-connex queries $Q_{7}$ – $Q_{9}$ : two orders of magnitude improvement over the throughput of $T$ and more than twofold improvement over that of $E$ . Memory usage is also significantly less: one order of magnitude over $E$ on the larger datasets for $Q_{7}$ , and a consistent twofold improvement over $T$ . Similarly, IEDyn outperforms $DBT$ on free-connex queries $Q_{7}$ and $Q_{8}$ in time and memory by one and two orders of magnitude, respectively.

For non-free-connex queries $Q_{10}$ – $Q_{12}$ , IEDyn continues to outperform $E$ , $T$ , and $DBT$ in terms of processing time. In memory footprint IEDyn outperforms $E$ for $Q_{10}$ and $Q_{12}$ . Compared to $DBT$ , IEDyn still improves on memory footprint on non-free-connex queries, though less significantly. In contrast, IEDyn largely improves memory usage over $T$ on larger datasets, even on non-free-connex queries.

Result enumeration

We know from Section 5 that $T$ -reps maintained by IEDyn feature constant delay enumeration (CDE). This theoretical notion, however, hides a constant factor that could decrease performance in practice when compared to full materialization. In Figure 15, we show the practical application of CDE in IEDyn and compare against $DBT$ which materializes the full query results. We plot the time required to enumerate the result from IEDyn’s $T$ -rep as a fraction of the time required to enumerate the result from $DBT$ ’s materialized views. As can be seen from the figure, both enumeration times are comparable on average.

Note that we do not compare enumeration time for push-based systems, since for these systems the time required for delta enumeration is already included in the update processing time reported in Figures 10, 15 (bottom), and 15.

Selective inequality joins

We execute IEDyn over unifromly distributed datasets. In this case, the inequality joins yield large query results. One could argue that this might not be realistic. To address this problem, we generated datasets with probability distributions that are parametrized by a selectivity $s$ , such that the expected number of output tuples is $s$ percent of the cartesian product of all relations in the query.

Our results show that IEDyn not only outperforms existing systems on less selective inequality joins; we also perform better on very selective inequality joins consistently (see Figure 15). For super selective inequality joins the measurements come similar to what we observe for equality joins, which we investigated in detail in our previous work on equality joins dyn:2017 .

Scalability

To show that IEDyn performs consistently on streams of different sizes, we report the processing delay and the memory footprint each time a $10\%$ of the stream is processed in Figure 15. These results show that IEDyn has linearly increasing memory footprint as well as update delay as the stream size advances. We show results for queries $Q_{4}$ , $Q_{5}$ , $Q_{7}$ , and $Q_{8}$ only due to space constraints.

Appendix A Proofs of Section 3

See 2

Proof.

The lemma follows from the following observations. (1) It is straightforward to observe that $T^{\prime}$ is a valid GJT: the construction has left the set of leaf nodes untouched; took care to ensure that all nodes (including the newly added node $p$ ) continue to have a guard child; ensures that the connectedness condition continues to hold also for the relocated children of $n$ because every variable in $n$ is present on the entire path between $n$ and $p$ ; and have ensured that also edge labels remain valid (for the relocated nodes this is because $\operatorname{\textit{var}}(p)=\operatorname{\textit{var}}(g)\subseteq\operatorname{\textit{var}}(n)$ ).

(2) $N^{\prime}$ is a connex subset of $T^{\prime}$ because the subtree of $T$ induced by $N$ equals to subtree of $T^{\prime}$ induced by $N^{\prime}$ , modulo the replacement of $l$ by $p$ in case that $l$ was in $N$ and $p$ is hence in $N^{\prime}$ .

(3) $(T,N)$ is equivalent to $(T^{\prime},N^{\prime})$ because the construction leaves leaf atoms untouched, preserves edge labels, and $\operatorname{\textit{var}}(N)=\operatorname{\textit{var}}(N^{\prime})$ . The latter is clear if $l\not\in N$ because then $N=N^{\prime}$ . It follows from the fact that $\operatorname{\textit{var}}(l)=\operatorname{\textit{var}}(p)$ if $l\in N$ , in which case $N^{\prime}=N\setminus\{l\}\cup\{p\}$ .

(4) All nodes in $\operatorname{ch}_{T}(n)\setminus N$ (and their descendants) are relocated to $p$ in $T^{\prime}$ . Therefore, $n$ is no longer a violator in $(T^{\prime},N^{\prime})$ . Because we do not introduce new violators, the number of violators of $(T^{\prime},N^{\prime})$ is strictly smaller than the number of violators of $(T,N)$ .∎

See 3

Proof.

The lemma follows from the following observations. (1) It is straightforward to observe that $T^{\prime}$ is a valid GJT: the construction has left the set of leaf nodes untouched; took care to ensure that all nodes (including the newly added node $p$ ) continue to have a guard child; ensures that the connectedness condition continues to hold also for the relocated children of $n$ because every variable in $n$ is also present in $p$ , their new parent; and have ensured that also edge labels remain valid (for the relocated nodes this is because $\operatorname{\textit{var}}(p)=\operatorname{\textit{var}}(n)$ ).

(2) $N^{\prime}$ is a connex subset of $T^{\prime}$ because (i) the subtree of $T$ induced by $N$ equals to subtree of $T^{\prime}$ induced by $N^{\prime}\ \{p\}$ , (ii) $n\in N$ , and (iii) $p$ is a child of $n$ in $T^{\prime}$ . Therefore, $N^{\prime}$ must be connex.

(3) $(T,N)$ is equivalent to $(T^{\prime},N^{\prime})$ because the construction leaves leaf atoms untouched, preserves edge labels, and $\operatorname{\textit{var}}(N)=\operatorname{\textit{var}}(N^{\prime})$ . The latter follows because $\operatorname{\textit{var}}(N^{\prime})=\operatorname{\textit{var}}(N\cup\{p\})$ and because $\operatorname{\textit{var}}(p)=\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(N)$ since $n\in N$ .

(4) All nodes in $\operatorname{ch}_{T}(n)\setminus N$ (and their descendants) are relocated to $p$ in $T^{\prime}$ . Therefore, $n$ is no longer a violator in $(T^{\prime},N^{\prime})$ . Because we do not introduce new violators, the number of violators of $(T^{\prime},N^{\prime})$ is strictly smaller than the number of violators of $(T,N)$ .∎

Appendix B Proofs of Section 5

See 5

Proof.

We proceed by induction on the number of descendants of $n$ . If $n$ has no descendant then $Q_{n}$ is a single atom $r(\overline{x})$ , so we have $\overline{x}=\operatorname{\textit{out}}(Q_{n})=\operatorname{\textit{var}}(n)$ . Then $\pi_{var(n)}Q_{n}(\operatorname{\textit{db}})=Q_{n}(\operatorname{\textit{db}})=\operatorname{\textit{db}}_{r(\overline{x})}=\rho_{n}$ , concluding the basic case. Now, for the inductive case we distinguish whether $n$ has one or two children.

Assume $n$ has a single child $c$ and $Q_{c}=({\mathcal{R}}\mid\Theta)$ . Then, by definition we have $Q_{n}=({\mathcal{R}}\mid\Theta\cup\operatorname{\textit{pred}}(n)).$ Therefore $Q_{n}(\operatorname{\textit{db}})=\sigma_{\operatorname{\textit{pred}}(n)}Q_{c}(\operatorname{\textit{db}})$ , which implies that $\pi_{\operatorname{\textit{var}}(n)}Q_{n}(\operatorname{\textit{db}})=\pi_{\operatorname{\textit{var}}(n)}\sigma_{\operatorname{\textit{pred}}(n)}Q_{c}(\operatorname{\textit{db}})$ . Since $\operatorname{\textit{pred}}(n)$ only mentions variables in $\operatorname{\textit{var}}(c)\cup\operatorname{\textit{var}}(n)$ and $\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(c)$ , as $c$ is a guard of $n$ , this is equivalent to

[TABLE]

By induction, this equals $\pi_{\operatorname{\textit{var}}(n)}\sigma_{\operatorname{\textit{pred}}(n)}\rho_{c}=\rho_{n}$ , showing that $\pi_{\operatorname{\textit{var}}(n)}Q_{n}(\operatorname{\textit{db}})=\rho_{n}$ .

Assume now that $n$ has two children $c_{1}$ and $c_{2}$ , and that $Q_{c_{i}}=\left({\mathcal{R}}_{i}\mid\Theta_{i}\right)$ for $i\in\{1,2\}$ . We assume w.l.o.g. that $c_{1}$ is a guard for $n$ . First, note that by definition $Q_{n}=\left({\mathcal{R}}_{1}\Join{\mathcal{R}}_{2}\mid\Theta_{1}\cup\Theta_{2}\cup\operatorname{\textit{pred}}(n)\right),$ and then we have $Q_{n}(\operatorname{\textit{db}})=\sigma_{\operatorname{\textit{pred}}(n)}\sigma_{\Theta_{1}}\sigma_{\Theta_{2}}\left({\mathcal{R}}_{1}\Join{\mathcal{R}}_{2}\right)(\operatorname{\textit{db}}).$ Since $\Theta_{i}$ only mentions variables of atoms in $\mathcal{R}_{i}$ (for $i\in\{1,2\}$ ), we can push the selections and obtain

[TABLE]

Therefore,

[TABLE]

Since $\operatorname{\textit{var}}(\operatorname{\textit{pred}}(n))\subseteq\operatorname{\textit{var}}(c_{1})\cup\operatorname{\textit{var}}(c_{2})\cup\operatorname{\textit{var}}(n)$ and $\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(c_{1})$ we have $\operatorname{\textit{var}}(\operatorname{\textit{pred}}(n))\subseteq\operatorname{\textit{var}}(c_{1})\cup\operatorname{\textit{var}}(c_{2})$ . This, combined with the fact that, due to the connectedness property of $T$ we, have $\operatorname{\textit{var}}(Q_{c_{1}})\cap\allowbreak\operatorname{\textit{var}}(Q_{c_{2}})\allowbreak\subseteq\operatorname{\textit{var}}(c_{i})$ for $i\in\{1,2\}$ , we can add the following projections

[TABLE]

Then, by induction hypothesis we have

[TABLE]

concluding our proof. ∎

See 6

Proof.

Within the proof, we abuse notation and allow for projections over supersets of variables. For example, if $\operatorname{\textit{var}}(Q)\subseteq\overline{x}$ then $\pi_{\overline{x}}Q=\pi_{\overline{x}\cap\operatorname{\textit{var}}(Q)}Q$ .

Let $n\in N$ and $\vec{t}\in\rho_{n}$ . We proceed by induction on the number of nodes in $N\cap T_{n}$ . If $N\cap T_{n}=\{n\}$ , we have $\operatorname{\textit{var}}(N)\cap\operatorname{\textit{var}}(Q_{n})=\operatorname{\textit{var}}(n)$ and therefore $\pi_{\operatorname{\textit{var}}(N)}Q_{n}(\operatorname{\textit{db}})=\pi_{\operatorname{\textit{var}}(n)}Q_{n}(\operatorname{\textit{db}})$ . Then, by Lemma 5 we have $\pi_{\operatorname{\textit{var}}(N)}Q_{n}(\operatorname{\textit{db}})=\rho_{n}$ . As $\vec{t}\in\rho_{n}$ , this implies that the only tuple in $\pi_{\operatorname{\textit{var}}(N)}Q_{n}(\operatorname{\textit{db}})$ that is compatible with $\vec{t}$ is $\vec{t}$ itself. As $n$ is in the frontier of $N$ , $\operatorname{\textsc{enum}}_{T,N}(n,\vec{t},\rho)$ will enumerate precisely $\{(\vec{t},\rho_{n}(\vec{t}))\}$ (Line 4), which concludes the base case.

For the inductive step we need to consider two cases depending on the number of children of $n$ .

Case (1). If $n$ has a single child $c$ then necessarily $c$ is a guard of $n$ , i.e., $\operatorname{\textit{var}}(n)\subseteq\operatorname{\textit{var}}(c)$ . In this case, Algorithm 1 will call $\operatorname{\textsc{enum}}_{T,N}(c,\vec{s},\rho)$ for each tuple $\vec{s}\in\rho_{c}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(n)}\vec{t}$ . By induction hypothesis and Lemma 5, this will correctly enumerate every tuple in $\pi_{\operatorname{\textit{var}}(N)}Q_{c}(\operatorname{\textit{db}})\operatorname{\ltimes}\vec{s}$ for every $\vec{s}$ in $\sigma_{\operatorname{\textit{pred}}(n)}(\pi_{\operatorname{\textit{var}}(c)}Q_{c}(\operatorname{\textit{db}})\operatorname{\ltimes}\allowbreak\vec{t})$ . Therefore, this enumerates the set

[TABLE]

As $\operatorname{\textit{var}}(\operatorname{\textit{pred}}(n))\subseteq\operatorname{\textit{var}}(c)\cup\operatorname{\textit{var}}(n)=\operatorname{\textit{var}}(c)\subseteq\operatorname{\textit{var}}(Q_{c})$ , we can pull out the projection and selection

[TABLE]

Because the variables in $\vec{t}$ are a subset of $\operatorname{\textit{var}}(c)$ , this is the same as $\pi_{\operatorname{\textit{var}}(N)}\sigma_{\operatorname{\textit{pred}}(n)}(Q_{c}(\operatorname{\textit{db}})\operatorname{\ltimes}\vec{t}).$ Finally, we push the selection and projection inside and obtain

[TABLE]

Case (2). Otherwise, $n$ has two children $c_{1}$ and $c_{2}$ . Since $|N\cap T_{n}|>1$ and $N$ is sibling closed we have $\{c_{1},c_{2}\}\subset N$ . In this case, Algorithm 1 will first enumerate $\vec{t_{i}}\in\rho_{c_{i}}\operatorname{\ltimes}_{\operatorname{\textit{pred}}(n\rightarrow c_{1})}\vec{t}$ for $i\in\{1,2\}$ . By Lemma 5 this is equivalent to enumerate every $\vec{t_{i}}$ in $\sigma_{\operatorname{\textit{pred}}(n\rightarrow c_{i})}\pi_{\operatorname{\textit{var}}(c_{i})}Q_{c_{i}}(\operatorname{\textit{db}})\operatorname{\ltimes}\vec{t}$ . Then, for each such $\vec{t_{i}}$ the algorithm will enumerate every pair $(\vec{s_{i}},\mu_{i})$ generated by $\operatorname{\textsc{enum}}_{T,N}(c_{i},\vec{t_{i}},\rho)$ , which by induction is the same as enumerating every $(\vec{s_{i}},\mu_{i})$ in $\pi_{\operatorname{\textit{var}}(N)}Q_{c_{i}}(\operatorname{\textit{db}})\operatorname{\ltimes}\vec{t_{i}}$ . Therefore the algorithm is enumerating

[TABLE]

By the same reasoning as in the previous case, this is equivalent to enumerating every $(\vec{s_{i}},\mu_{i})$ in

[TABLE]

From the connectedness property of $T$ , it follows that $\operatorname{\textit{var}}(Q_{c_{1}})\cap\operatorname{\textit{var}}(Q_{c_{2}})\subseteq\operatorname{\textit{var}}(n)$ . Thus, $\operatorname{\textit{var}}(Q_{c_{1}})\cap\operatorname{\textit{var}}(Q_{c_{2}})$ is a subset of the variables of $\vec{t}$ . Hence, every tuple $\vec{s_{1}}$ will be compatible with every tuple $\vec{s_{2}}$ , and the enumeration of every pair $(\vec{s_{1}}\cup\vec{s_{2}},\mu_{1}\times\mu_{2})$ is the same as the enumeration of

[TABLE]

We can now push the projections and selections outside and obtain

[TABLE]

Since $\operatorname{\textit{pred}}(n)=\operatorname{\textit{pred}}(n\rightarrow c_{1})\cup\operatorname{\textit{pred}}(n\rightarrow c_{2})$ and the variables in $\operatorname{\textit{var}}(Q_{c_{1}})\cap\operatorname{\textit{var}}(Q_{c_{2}})$ are contained in the variables of $\vec{t}$ , we have

[TABLE]

Appendix C Proofs of Section 6

C.1 Proof of Proposition 7

Because no infinite sequences of reduction steps are possible, it suffices to demonstrate local confluence:

Proposition 11.

If $\mathcal{H}\operatorname*{\rightsquigarrow}\mathcal{I}_{1}$ and $\mathcal{H}\operatorname*{\rightsquigarrow}\mathcal{I}_{2}$ then there exists $\mathcal{J}$ such that both $\mathcal{I}_{1}\operatorname*{\rightsquigarrow}^{*}\mathcal{J}$ and $\mathcal{I}_{2}\operatorname*{\rightsquigarrow}^{*}\mathcal{J}$ .

Indeed, it is a standard result in the theory of rewriting systems that confluence (Lemma 7) and local confluence (lemma 11) coincide when infinite sequences of reductions steps are impossible [4].

Before proving Lemma 11, we observe that the property of being isolated or being a conditional subset is preserved under reductions, in the following sense.

Lemma 10.

Assume that $\mathcal{H}\operatorname*{\rightsquigarrow}\mathcal{I}$ . Then $\operatorname{\textit{pred}}(\mathcal{I})\subseteq\operatorname{\textit{pred}}(\mathcal{H})$ and for every hyperedge $e$ we have $\operatorname{\textit{ext}}_{\mathcal{I}}(e)\subseteq\operatorname{\textit{ext}}_{\mathcal{H}}(e)$ , $\operatorname{\textit{jv}}_{\mathcal{I}}(e)\subseteq\operatorname{\textit{jv}}_{\mathcal{H}}(e)$ , and $\operatorname{\textit{isol}}_{\mathcal{H}}(e)\subseteq\operatorname{\textit{isol}}_{\mathcal{I}}(e)$ . Furthermore, if $e\operatorname*{\sqsubseteq}_{\mathcal{H}}f$ then also $e\operatorname*{\sqsubseteq}_{\mathcal{I}}f$ .

Proof.

First observe that $\operatorname{\textit{pred}}(\mathcal{I})\subseteq\operatorname{\textit{pred}}(\mathcal{H})$ , since reduction operators only remove predicates. This implies that $\operatorname{\textit{ext}}_{\mathcal{I}}(e)\subseteq\operatorname{\textit{ext}}_{\mathcal{H}}(e)$ for every hyperedge $e$ . Furthermore, because reduction operators only remove hyperedges and never add them, it is easy to see that $\operatorname{\textit{jv}}_{\mathcal{H}}(e)\subseteq\operatorname{\textit{jv}}_{\mathcal{I}}(e)$ . Hence, if $x\in\operatorname{\textit{isol}}_{\mathcal{H}}(e)$ then $x\not\in\operatorname{\textit{jv}}_{\mathcal{H}}(e)\supseteq\operatorname{\textit{jv}}_{\mathcal{I}}(e)$ and $x\not\in\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{H}))\supseteq\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{I}))$ . Therefore, $x\in\operatorname{\textit{isol}}_{\mathcal{I}}(e)$ . As such, $\operatorname{\textit{isol}}_{\mathcal{I}}(e)\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e)$ .

Next, assume that $e\operatorname*{\sqsubseteq}_{\mathcal{H}}f$ . We need to show that $\operatorname{\textit{jv}}_{\mathcal{I}}(e)\subseteq f$ and $\operatorname{\textit{ext}}_{\mathcal{I}}(e\setminus f)\subseteq f$ . The first condition follows since $\operatorname{\textit{jv}}_{\mathcal{I}}(e)\subseteq\operatorname{\textit{jv}}_{\mathcal{H}}(e)\subseteq f$ where the last inclusion is due to $e\operatorname*{\sqsubseteq}_{\mathcal{H}}f$ . The second also follows since $\operatorname{\textit{ext}}_{\mathcal{I}}(e\setminus f)\subseteq\operatorname{\textit{ext}}_{\mathcal{H}}(e\setminus f)\subseteq f$ where the last inclusion is due to $e\operatorname*{\sqsubseteq}_{\mathcal{H}}f$ . ∎

Proof of Proposition 11.

If $\mathcal{I}_{1}=\mathcal{I}_{2}$ then it suffices to take $\mathcal{J}=\mathcal{I}_{1}=\mathcal{I}_{2}$ . Therefore, assume in the following that $\mathcal{I}_{1}\not=\mathcal{I}_{2}$ . Then, necessarily $\mathcal{I}_{1}$ and $\mathcal{I}_{2}$ are obtained by applying two different reduction operations on $\mathcal{H}$ . We make a case analysis on the types of reductions applied.

(1) Case (ISO, ISO): assume that $\mathcal{I}_{1}$ is obtained by removing the non-empty set $X_{1}\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ from hyperedge $e_{1}$ , while $\mathcal{I}_{2}$ is obtained by removing non-empty $X_{2}\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e_{2})$ from $e_{2}$ with $X_{1}\not=X_{2}$ . There are two possibilities.

(1a) $e_{1}\not=e_{2}$ . Then $e_{2}$ is still a hyperedge in $\mathcal{I}_{2}$ and $e_{1}$ is still a hyperedge in $\mathcal{I}_{1}$ . By Lemma 10, $\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})\subseteq\operatorname{\textit{isol}}_{\mathcal{I}_{2}}(e_{1})$ and $\operatorname{\textit{isol}}_{\mathcal{H}}(e_{2})\subseteq\operatorname{\textit{isol}}_{\mathcal{I}_{1}}(e_{2})$ . Therefore, we can still remove $X_{2}$ from $\mathcal{I}_{1}$ by means of rule ISO, and similarly remove $X_{1}$ from $\mathcal{I}_{2}$ . Let $\mathcal{J}_{1}$ (resp. $\mathcal{J}_{2}$ ) be the result of removing $X_{2}$ from $\mathcal{I}_{1}$ (resp. $\mathcal{I}_{2}$ ). Then $\mathcal{J}_{1}=\mathcal{J}_{2}$ (and hence equals triplet $\mathcal{J}$ ):

[TABLE]

(1b) $e_{1}=e_{2}$ . We show that $X_{2}\setminus X_{1}\subseteq\operatorname{\textit{isol}}_{\mathcal{I}_{1}}(e_{1}\setminus X_{1})$ and similarly $X_{1}\setminus X_{2}\subseteq\operatorname{\textit{isol}}_{\mathcal{I}_{1}}(e_{2}\setminus X_{1})$ . This suffices because we can then apply ISO to remove $X_{2}\setminus X_{1}$ from $\mathcal{I}_{1}$ and $X_{1}\setminus X_{2}$ from $\mathcal{I}_{2}$ . In both cases, we reach the same triplet as removing $X_{1}\cup X_{2}\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ from $\mathcal{H}$ .999Should $X_{2}\setminus X_{1}$ be empty, we don’t actually need to do anything on $\mathcal{I}_{1}$ : $X_{1}\cup X_{2}$ is already removed from it. A similar remark holds for $\mathcal{I}_{2}$ when $X_{1}\setminus X_{2}$ is empty.

To see that $X_{2}\setminus X_{1}\subseteq\operatorname{\textit{isol}}_{\mathcal{I}_{1}}(e_{1}\setminus X_{1})$ , let $x\in X_{2}\setminus X_{1}$ . We need to show $x\not\in\operatorname{\textit{jv}}_{\mathcal{I}_{1}}(e_{1}\setminus X_{1})$ and $x\not\in\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{I}_{1}))$ . Because $x\in X_{2}\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ we know $x\not\in\operatorname{\textit{jv}}_{\mathcal{H}}(e_{1})$ . Then, since $x\not\in X_{1}$ , also $x\not\in\operatorname{\textit{jv}}_{\mathcal{H}}(e_{1}\setminus X_{1})$ . By Lemma 10, $\operatorname{\textit{jv}}_{\mathcal{I}_{1}}(e_{1}\setminus X_{1})\subseteq\operatorname{\textit{jv}}_{\mathcal{H}}(e_{1}\setminus X_{1})$ . Therefore, $x\not\in\operatorname{\textit{jv}}_{\mathcal{I}_{1}}(e_{1}\setminus X_{1})$ . Furthermore, because $x\in\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ we know $x\not\in\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{H}))$ . Since $\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{I}_{1}))\subseteq\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{H}))$ by Lemma 10, also $x\ not\in\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{I}_{1}))$ .

$X_{1}\setminus X_{2}\subseteq\operatorname{\textit{isol}}_{\mathcal{I}_{1}}(e_{2}\setminus X_{1})$ is shown similarly.

(2) Case (CSE, CSE): assume that $\mathcal{I}_{1}$ is obtained by removing hyperedge $e_{1}$ because it is a conditional subset of hyperedge $f_{1}$ , while $\mathcal{I}_{2}$ is obtained by removing $e_{2}$ , conditional subset of $f_{2}$ . Since $\mathcal{I}_{1}\not=\mathcal{I}_{2}$ it must be $e_{1}\not=e_{2}$ . We need to further distinguish the following cases.

(2a) $e_{1}\not=f_{2}$ and $e_{2}\not=f_{1}$ . In this case, $e_{2}$ and $f_{2}$ remain hyperedges in $\mathcal{I}_{1}$ while $e_{1}$ and $f_{1}$ remain hyperedges in $\mathcal{I}_{2}$ . Then, by Lemma 10, $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{I}_{1}}f_{2}$ and $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{I}_{2}}f_{2}$ . Let $\mathcal{J}_{1}$ (resp. $\mathcal{J}_{2}$ ) be the triplet obtained by removing $e_{2}$ from $\mathcal{I}_{1}$ (resp. $e_{1}$ from $\mathcal{I}_{2}$ ). Then $\mathcal{J}_{1}=\mathcal{J}_{2}$ since clearly $\operatorname{\textit{out}}(\mathcal{J}_{1})=\operatorname{\textit{out}}(\mathcal{J}_{2})$ and

[TABLE]

From this the result follows by taking $\mathcal{J}=\mathcal{J}_{1}=\mathcal{J}_{2}$ .

(2b) $e_{1}\not=f_{2}$ but $e_{2}=f_{1}$ . Then $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{2}$ and $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{H}}f_{2}$ with $f_{2}\not=e_{1}$ . It suffices to show that $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}f_{2}$ and $e_{1}\setminus f_{2}=e_{1}\setminus f_{1}$ , because then (CSE) due to $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}f_{1}$ has the same effect as CSE on $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}f_{2}$ , and we can apply the reasoning of case (2a) because $e_{1}\not=f_{2}$ and $e_{2}\not=f_{2}$ .

We first show $e_{1}\setminus f_{2}=e_{1}\setminus f_{1}$ . Let $x\in e_{1}\setminus f_{2}$ and suppose for the purpose of contradiction that that $x\in e_{2}=f_{1}$ . Then, since $e_{1}\not=e_{2}$ , $x\in\operatorname{\textit{jv}}(e_{2})\subseteq f_{2}$ where the last inclusion is due to $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{H}}f_{2}$ . Hence, $e_{1}\setminus f_{2}\subseteq e_{1}\setminus f_{1}$ . Conversely, let $x\in e_{1}\setminus f_{1}$ . Since $f_{1}=e_{2}$ , $x\not\in e_{2}$ . Suppose for the purpose of contradiction that $x\in f_{2}$ . Because $e_{1}\not=f_{2}$ , $x\in\operatorname{\textit{jv}}_{\mathcal{H}}(e_{1})\subseteq e_{2}$ where the last inclusion is due to $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{2}$ . Therefore, $e_{2}\setminus f_{1}=e_{1}\setminus f_{2}$ .

To show that $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}f_{2}$ , let $x\in\operatorname{\textit{jv}}_{\mathcal{H}}(e_{1})$ . Because $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{2}$ , $x\in e_{2}$ . Because $x$ occurs in two distinct hyperedges in $\mathcal{H}$ , also $x\in\operatorname{\textit{jv}}_{\mathcal{H}}(e_{2})$ . Then, because $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{H}}f_{2}$ , $x\in f_{2}$ . Hence $\operatorname{\textit{jv}}_{\mathcal{H}}(e_{1})\subseteq f_{2}$ . It remains to show $\operatorname{\textit{ext}}_{\mathcal{H}}(e_{1}\setminus f_{2})\subseteq f_{2}$ . To this end, let $x\in\operatorname{\textit{ext}}_{\mathcal{H}}(e_{1}\setminus f_{2})$ and suppose for the purpose of contradiction that $x\not\in f_{2}$ . By definition of $\operatorname{\textit{ext}}$ there exists $\theta\in\operatorname{\textit{pred}}(\mathcal{H})$ and $y\in\operatorname{\textit{var}}(\theta)\cap(e_{1}\setminus f_{2})$ such that $x\in\operatorname{\textit{var}}(\theta)\setminus(e_{1}\setminus f_{2})$ . In particular, $y\not\in f_{2}$ . Since $e_{1}\setminus f_{2}=e_{1}\setminus e_{2}$ , $y\in\operatorname{\textit{var}}(\theta)\cap(e_{1}\setminus e_{2})$ and $x\in\operatorname{\textit{var}}(\theta)\setminus(e_{1}\setminus e_{2})$ . Thus, $x\in\operatorname{\textit{ext}}_{\mathcal{H}}(e_{1}\setminus e_{2})$ . Then, since $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{2}$ , $x\in e_{2}$ . Thus, $x\in e_{2}\setminus f_{2}$ since $x\not\in f_{2}$ . Hence $x\in\operatorname{\textit{var}}(\theta)\cap(e_{2}\setminus f_{2})$ . Furthermore, since $y\not\in e_{2}$ also $y\not\in e_{2}\setminus f_{2}$ . Hence, $y\in\operatorname{\textit{var}}(\theta)\setminus(e_{2}\setminus f_{2})$ . But then $\theta$ shows that $y\in\operatorname{\textit{ext}}_{\mathcal{H}}(e_{2}\setminus f_{2})$ . Then, by because $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{H}}f_{2}$ , also $y\in f_{2}$ which yields the desired contradiction.

(2c) $e_{1}=f_{2}$ but $e_{2}\not=f_{1}$ . Similar to case (2b).

(2d) $e_{1}=f_{2}$ and $e_{2}=f_{1}$ . Then $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{2}$ and $e_{2}\operatorname*{\sqsubseteq}{{}_{\mathcal{H}}}e_{1}$ and $e_{1}\not=e_{2}$ . Let $\mathcal{K}_{1}$ (resp. $\mathcal{K}_{2}$ ) be the triplet obtained by applying (FLT) to remove all $\theta\in\operatorname{\textit{pred}}(\mathcal{I}_{1})$ (resp. $\theta\in\operatorname{\textit{pred}}(\mathcal{I}_{2})$ for which $\operatorname{\textit{var}}(\theta)\subseteq\operatorname{\textit{var}}(e_{2})$ (resp. $(\operatorname{\textit{var}}(\theta)\subseteq\operatorname{\textit{var}}(e_{2})$ . Furthermore, let $\mathcal{J}_{1}$ (resp. $\mathcal{J}_{2}$ ) be the triplet obtained by applying ISO to removing $\operatorname{\textit{isol}}_{\mathcal{I}_{1}}(e_{2})$ from $\mathcal{K}_{1}$ (resp. removing $\operatorname{\textit{isol}}_{\mathcal{I}_{2}}(e_{1})$ from $\mathcal{K}_{2}$ ). Here, we take $\mathcal{J}_{1}=\mathcal{K}_{1}$ if $\operatorname{\textit{isol}}_{\mathcal{K}_{1}}(e_{2})$ is empty (and similarly for $\mathcal{J}_{2}$ ). Then clearly $\mathcal{H}\operatorname*{\rightsquigarrow}\mathcal{I}_{1}\operatorname*{\rightsquigarrow}^{*}\mathcal{K}_{1}\operatorname*{\rightsquigarrow}^{*}J_{1}$ and $\mathcal{H}\operatorname*{\rightsquigarrow}\mathcal{I}_{2}\operatorname*{\rightsquigarrow}^{*}\mathcal{K}_{2}\operatorname*{\rightsquigarrow}^{*}\mathcal{J}_{2}$ . The result then follows by showing that $\mathcal{J}_{1}=\mathcal{J}_{2}$ . Towards this end, first observe that $\operatorname{\textit{out}}(\mathcal{J}_{1})=\operatorname{\textit{out}}(\mathcal{K}_{1})=\operatorname{\textit{out}}(\mathcal{I}_{1})=\operatorname{\textit{out}}(\mathcal{H})=\operatorname{\textit{out}}(\mathcal{I}_{2})=\operatorname{\textit{out}}(\mathcal{K}_{2})=\operatorname{\textit{out}}(\mathcal{J}_{2})$ . Next, we show that $\operatorname{\textit{pred}}(\mathcal{J}_{1})=\operatorname{\textit{pred}}(\mathcal{J}_{2})$ . We first observe that $\operatorname{\textit{pred}}(\mathcal{J}_{1})=\operatorname{\textit{pred}}(\mathcal{K}_{1})$ and $\operatorname{\textit{pred}}(\mathcal{J}_{2})=\operatorname{\textit{pred}}(\mathcal{K}_{2})$ since the ISO operation does not remove predicates. Then observe that

[TABLE]

We only show the reasoning for $\operatorname{\textit{pred}}(\mathcal{K}_{1})\subseteq\operatorname{\textit{pred}}(\mathcal{K}_{2})$ , the other direction being similar. Let $\theta\in\operatorname{\textit{pred}}(\mathcal{K}_{1})$ . Then $\operatorname{\textit{var}}(\theta\cap(e_{1}\setminus e_{2})=\emptyset$ and $\operatorname{\textit{var}}(\theta)\not\subseteq e_{2}$ . Since $\operatorname{\textit{var}}(\theta)\not\subseteq e_{2}$ there exists $y\in\operatorname{\textit{var}}(\theta)\setminus e_{2}$ . Then, because $\operatorname{\textit{var}}(\theta)\cap(e_{1}\setminus e_{2})=\emptyset$ , $y\not\in e_{1}$ . Thus, $\operatorname{\textit{var}}(\theta)\not\subseteq e_{1}$ . Now, suppose for the purpose of obtaining a contradiction, that $\operatorname{\textit{var}}(\theta)\cap(e_{2}\setminus e_{1})\not=\emptyset$ . Then take $z\in\operatorname{\textit{var}}(\theta)\cap(e_{2}\setminus e_{1})$ . But then $y\in\operatorname{\textit{ext}}_{\mathcal{H}}(e_{2}\setminus e_{1})$ . Hence, $y\in e_{1}$ because $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{1}$ , which yields the desired contradiction with $y\not\in e_{2}$ . Therefore, $\operatorname{\textit{var}}(\theta)\cap(e_{2}\setminus e_{1})=\emptyset$ , as desired. Hence $\theta\in\operatorname{\textit{pred}}(\mathcal{K}_{2})$ .

It remains to show that $\operatorname{\textit{hyp}}(\mathcal{J}_{1})=\operatorname{\textit{hyp}}(\mathcal{J}_{2})$ . To this end, first observe

[TABLE]

Clearly, $\operatorname{\textit{hyp}}(\mathcal{J}_{1})=\operatorname{\textit{hyp}}(\mathcal{J}_{2})$ if $e_{2}\setminus\operatorname{\textit{isol}}_{\mathcal{K}_{1}}(e_{2})=e_{1}\setminus\operatorname{\textit{isol}}_{\mathcal{K}_{2}}(e_{1})$ .

We only show $e_{2}\setminus\operatorname{\textit{isol}}_{\mathcal{K}_{1}}(e_{2})\subseteq e_{1}\setminus\operatorname{\textit{isol}}_{\mathcal{K}_{2}}(e_{1})$ , the other inclusion being similar. Let $x\in e_{2}\setminus\operatorname{\textit{isol}}_{\mathcal{K}_{1}}(e_{2})$ . Since $x\not\in\operatorname{\textit{isol}}_{\mathcal{K}_{1}}(e_{2})$ one of the following hold.

•

$x\in\operatorname{\textit{out}}(\mathcal{K}_{1})$ . But then, $x\in\operatorname{\textit{out}}(\mathcal{K}_{1})=\operatorname{\textit{out}}(\mathcal{I}_{1})=\operatorname{\textit{out}}(\mathcal{H})=\operatorname{\textit{out}}(\mathcal{I}_{2})=\operatorname{\textit{out}}(\mathcal{K}_{2})$ . In particular, $x$ is an equijoin variable in $\mathcal{H}$ and $\mathcal{K_{2}}$ . Then $x\in\operatorname{\textit{jv}}_{\mathcal{H}}(e_{2})\subseteq e_{1}$ because $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{1}$ . From this and the fact that $x$ remains an equijoin variable in $\mathcal{K}_{2}$ , we obtain $x\in e_{1}\setminus\operatorname{\textit{isol}}_{\mathcal{K}_{2}}(e_{1})$ .

•

$x$ occurs in $e_{2}$ and in some hyperedge $g$ in $\mathcal{K}_{1}$ with $g\not=e_{2}$ . Since $e_{1}$ is not in $\mathcal{K}_{1}$ also $g\not=e_{1}$ . Since every hyperedge in $\mathcal{K}_{1}$ is in $\mathcal{I}_{1}$ and every hyperedge in $\mathcal{I}_{1}$ is in $\mathcal{H}$ , also $g$ is in $\mathcal{H}$ . But then, $x$ occurs in two distinct hyperedges in $\mathcal{H}$ , namely $e_{2}$ and $g$ , and hence $x\in\operatorname{\textit{jv}}_{\mathcal{H}}(e_{2})\subseteq e_{1}$ because $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{1}$ . However, because $x$ also occurs in $g$ which must also be in $\mathcal{I}_{2}$ and therefore also in $\mathcal{K}_{2}$ , $x$ also occurs in two distinct hyperedges in $\mathcal{K}_{2}$ , namely $e_{1}$ and $g$ . Therefore, $x\in\operatorname{\textit{jv}}_{\mathcal{I}_{2}}(e_{1})$ and hence $x\in e_{1}\setminus\operatorname{\textit{isol}}_{\mathcal{I}_{2}}(e_{1})$ , as desired.

•

$x\in\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{K}_{1}))$ . Then there exists $\theta\in\operatorname{\textit{pred}}(\mathcal{K}_{1})$ such that $x\in\operatorname{\textit{var}}(\theta)$ . Since $\operatorname{\textit{pred}}(\mathcal{K}_{1})=\operatorname{\textit{pred}}(\mathcal{K}_{2})$ , $\theta\in\operatorname{\textit{pred}}(\mathcal{K}_{2})$ . As such, $\theta\in\operatorname{\textit{pred}}(\mathcal{H})$ , $\operatorname{\textit{var}}(\theta)\cap(e_{2}\setminus e_{1})=\emptyset$ , and $\operatorname{\textit{var}}(\theta)\not\subseteq e_{1}$ . But then, since $x\in\operatorname{\textit{var}}(\theta)$ ; $x\in e_{2}$ ; and $\operatorname{\textit{var}}(\theta)\cap(e_{2}\setminus e_{1})=\emptyset$ , it must be the case that $x\in e_{1}$ . As such, $x\in e_{1}$ and $x\in\operatorname{\textit{var}}(\mathcal{K}_{2})$ . Hence $x\in e_{1}\setminus\operatorname{\textit{isol}}_{\mathcal{K}_{2}}(e_{1})$ .

(3) Case (ISO, CSE): assume that $\mathcal{I}_{1}$ is obtained by removing the non-empty set of isolated variables $X_{1}\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ from $e_{1}$ , while $\mathcal{I}_{2}$ is obtained by removing hyperedge $e_{2}$ , conditional subset of hyperedge $f_{2}$ . We may assume w.l.o.g. that $e_{1}\not=\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ : if $e_{1}=\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ then the ISO operation removes the complete hyperedge $e_{1}$ . However, because no predicate in $\mathcal{H}$ shares any variable with $e_{1}$ , it is readily verified that $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{2}$ and thus the removal of $e_{1}$ can also be seen as an application of CSE on $e_{1}$ 101010Note that, since $e_{1}$ does not share variables with any predicate, the CSE operation also does not remove any predicates from $\mathcal{H}_{1}$ , similar to the ISO operation and hence yields $\mathcal{I}_{1}$ ., and we are hence back in case (2).

Now reason as follows. Because $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{H}}f_{2}$ and because isolated variables of $e_{1}$ occur in no other hyperedge in $\mathcal{H}$ , it must be the case that $e_{2}\cap X_{1}=\emptyset$ . In particular, $e_{1}$ and $e_{2}$ must hence be distinct. Therefore, $e_{1}\in\operatorname{\textit{hyp}}(\mathcal{I}_{2})$ and $e_{2}\in\operatorname{\textit{hyp}}(\mathcal{I}_{1})$ . By Lemma 10, we can apply ISO on $\mathcal{I}_{2}$ to remove $X_{1}$ from $e_{1}$ . It then suffices to show that $e_{2}$ remains a conditional subset of some hyperedge $f^{\prime}_{2}$ in $\mathcal{I}_{1}$ with $e_{2}\setminus f_{2}=e_{2}\setminus f^{\prime}_{2}$ . Indeed, we can then use ECQ to remove $e_{2}$ from $\operatorname{\textit{hyp}}(\mathcal{I}_{1})$ as well as predicates $\theta$ with $\operatorname{\textit{var}}(\theta)\cap(e_{2}\setminus f_{2})\not=\emptyset$ from $\operatorname{\textit{pred}}(\mathcal{I}_{1})$ . This clearly yields the same triplet as the one obtained by removing $X_{1}$ from $e_{1}$ in $\mathcal{I}_{2}$ . We need to distinguish two cases.

(3a) $f_{2}\not=e_{1}$ . Then $f_{2}\in\operatorname{\textit{hyp}}(\mathcal{I}_{1})$ and hence $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{I}_{1}}f_{2}$ by Lemma 10. We hence take $f^{\prime}_{2}=f_{2}$ .

(3b) $f_{2}=e_{1}$ . Then we take $f^{\prime}_{2}=e_{1}\setminus X$ . Since $e_{1}\not=\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ it follows that $e_{1}\setminus X_{1}\not=\emptyset$ . Therefore, $f^{\prime}_{2}=e_{1}\setminus X_{1}\in\operatorname{\textit{hyp}}(\mathcal{I}_{1})$ . Furthermore, since $X\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ , no variable in $X$ is in any other hyperedge in $\mathcal{H}$ . In particular $X\cap e_{2}=\emptyset$ . Therefore, $e_{2}\setminus f^{\prime}_{2}=e_{2}\setminus(e_{1}\setminus X)=(e_{2}\setminus e_{1})\cup(e_{2}\cap X)=e_{2}\setminus e_{1}\setminus e_{1}=e_{2}\setminus f_{2}$ . It remains to show that $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{I}_{1}}e_{1}\setminus X_{1}$ .

•

$\operatorname{\textit{jv}}_{\mathcal{I}_{1}}(e_{2})\subseteq e_{1}\setminus X_{1}$ . Let $x\in\operatorname{\textit{jv}}_{\mathcal{I}_{1}}(e_{2})$ . By Lemma 10, $x\in\operatorname{\textit{jv}}_{\mathcal{I}_{1}}(e_{2})\subseteq\operatorname{\textit{jv}}_{\mathcal{H}}(e_{2})\subseteq e_{1}$ where the last inclusion is due to $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{1}$ . In particular, $x$ is an equijoin variable in $\mathcal{H}$ . But then it cannot be an isolated variable in any hyperedge. Therefore, $x\not\in X_{1}$ .

•

$\operatorname{\textit{ext}}_{\mathcal{I}_{1}}(e_{2}\setminus e_{1})\subseteq e_{1}\setminus X$ . Let $x\in\operatorname{\textit{ext}}_{\mathcal{I}_{1}}(e_{2}\setminus e_{1})$ . Then $x\in\operatorname{\textit{ext}}_{\mathcal{I}_{1}}(e_{2}\setminus e_{1})\subseteq\operatorname{\textit{ext}}_{\mathcal{H}}(e_{2}\setminus e_{1})\subseteq e_{1}$ where the first inclusion is by Lemma 10 and the second by $e_{2}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{1}$ . Then, because $x\in\operatorname{\textit{ext}}_{\mathcal{H}}(e_{2}\setminus e_{1})$ it follows from the definition of $\operatorname{\textit{ext}}$ , that $x$ occurs in some predicate in $\operatorname{\textit{pred}}(\mathcal{H})$ . However, $X$ is disjoint with $\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{H}))$ since it consist only of isolated variables. Therefore, $x\not\in X$ .

(4): Case (ISO, FLT) Assume that $\mathcal{I}_{1}$ is obtained by removing the non-empty set $X_{1}\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ from hyperedge $e_{1}$ , while $\mathcal{I}_{2}$ is obtained by removing all predicates in the non-empty set $\Theta\subseteq\operatorname{\textit{pred}}(\mathcal{H})$ with $\operatorname{\textit{var}}(\Theta)\subseteq e_{2}$ for some hyperedge $e_{2}$ in $\operatorname{\textit{hyp}}(\mathcal{H})$ . Observe that $e_{1}\in\operatorname{\textit{hyp}}(\mathcal{I}_{2})$ . By Lemma 10, $X\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})\subseteq\operatorname{\textit{isol}}_{\mathcal{I}_{2}}(e_{1})$ . Therefore, we may apply reduction operation (ISO) on $\mathcal{I}_{2}$ to remove $X_{1}$ from $e_{1}$ . We will now show that, similarly, we may still apply (FLT) on $\mathcal{I}_{1}$ to remove all predicates in $\Theta$ from $\operatorname{\textit{pred}}(\mathcal{I}_{1})=\operatorname{\textit{pred}}(\mathcal{H})$ . The two operations hence commute, and clearly the resulting triplets in both cases is the same. We distinguish two possibilities. (i) $e_{1}\not=e_{2}$ . Then $e_{2}\in\mathcal{I}_{1}$ and, $\operatorname{\textit{var}}(\Theta)\subseteq e_{2}$ and, since (ISO) does not remove predicates, $\Theta\subseteq\operatorname{\textit{pred}}(\mathcal{H})=\operatorname{\textit{pred}}(\mathcal{I}_{1})$ . As such the (FLT) operation indeed applies to remove all predicates in $\Theta$ from $\operatorname{\textit{pred}}(\mathcal{I}_{1})$ . (ii) $e_{1}=e_{2}$ . Then, since $X\subseteq\operatorname{\textit{isol}}_{\mathcal{H}}(e_{1})$ and isolated variables do no occur in any predicate, $X\cap\operatorname{\textit{var}}(\Theta)=\emptyset$ . Then, since $\operatorname{\textit{var}}(\Theta)\subseteq e_{2}=e_{1}$ , it follows that also $\operatorname{\textit{var}}(\Theta)\subseteq e_{1}\setminus X$ . In particular, since we disallow nullary predicates and $\Theta$ is non-empty, $e_{1}\setminus X\not=\emptyset$ . Thus, $e_{1}\setminus X\in\operatorname{\textit{hyp}}(\mathcal{I}_{1})$ and hence operation (FLT) applies indeed applies to remove all predicates in $\Theta$ from $\operatorname{\textit{pred}}(\mathcal{I}_{1})$

(5) Case (CSE, FLT): assume that $\mathcal{I}_{1}$ is obtained by removing hyperedge $e_{1}$ , conditional subset of $e_{2}$ in $\mathcal{H}$ , while $\mathcal{I}_{2}$ is obtained by removing all predicates in the non-empty set $\Theta\subseteq\operatorname{\textit{pred}}(\mathcal{H})$ with $\operatorname{\textit{var}}(\Theta)\subseteq e_{3}$ for some hyperedge $e_{3}\in\operatorname{\textit{hyp}}(\mathcal{H})$ . Since the (FLT) operation does not remove any hyperedges, $e_{1}$ and $e_{2}$ are in $\operatorname{\textit{hyp}}(\mathcal{I}_{2})$ . Then, since $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{H}}e_{2}$ also $e_{1}\operatorname*{\sqsubseteq}_{\mathcal{I}_{2}}e_{2}$ by Lemma 10. Therefore, we may apply reduction operation (CSE) on $\mathcal{I}_{2}$ to remove $e_{1}$ from $\operatorname{\textit{hyp}}(\mathcal{I}_{2})$ as well as all predicates $\theta\in\operatorname{\textit{pred}}(\mathcal{I}_{2})$ for which $\operatorname{\textit{var}}(\theta)\cap(e_{1}\setminus e_{2})\not=\emptyset$ . Let $\mathcal{J}_{2}$ be the triplet resulting from this operation. We will show that, similarly, we may apply (FLT) on $\mathcal{I}_{1}$ to remove all predicates in $\Theta\cap\operatorname{\textit{pred}}(\mathcal{I}_{1})$ from $\operatorname{\textit{pred}}(\mathcal{I}_{1})$ , resulting in a triplet $\mathcal{J}_{1}$ . Observe that necessarily, $\mathcal{J}_{1}=\mathcal{J}_{2}$ (and hence they form the triplet $\mathcal{J}$ ). Indeed, $\operatorname{\textit{out}}(\mathcal{J}_{1})=\operatorname{\textit{out}}(\mathcal{I}_{1})=\operatorname{\textit{out}}(\mathcal{H})=\operatorname{\textit{out}}(\mathcal{I}_{2})=\operatorname{\textit{out}}(\mathcal{J}_{2})$ since reduction operations never modify output variables. Moreover,

[TABLE]

where the first and third equality is due to fact that (FLT) does not modify the hypergraph of the triplet it operates on. Finally, observe

[TABLE]

It remains to show that we may apply (FLT) on $\mathcal{I}_{1}$ to remove all predicates in $\Theta\cap\operatorname{\textit{pred}}(\mathcal{I}_{1})$ , resulting in a triplet $\mathcal{J}_{1}$ . There are two possibilities.

•

$e_{3}\not=e_{1}$ . Then $e_{3}\in\mathcal{I}_{1}$ , $\Theta\cap\operatorname{\textit{pred}}(\mathcal{(}I_{1}))\subseteq\operatorname{\textit{pred}}(\mathcal{I}_{1}))$ , and $\operatorname{\textit{var}}(\Theta\cap\operatorname{\textit{pred}}(\mathcal{I}_{1}))\subseteq\operatorname{\textit{var}}(\Theta)\subseteq e_{3}$ . Hence the (FLT) operation indeed applies to $\mathcal{I}_{1}$ to remove all predicates in $\Theta\cap\operatorname{\textit{pred}}(\mathcal{I}_{1})$ .

•

$e_{3}=e_{1}$ . In this case we claim that for every $\theta\in\Theta\cap\operatorname{\textit{pred}}(\mathcal{I}_{1})$ we have $\operatorname{\textit{var}}(\theta)\subseteq e_{2}$ . As such, $\operatorname{\textit{var}}(\Theta\cap\operatorname{\textit{pred}}(\mathcal{I}_{1}))\subseteq e_{2}$ . Since $e_{2}\in\operatorname{\textit{hyp}}(\mathcal{I}_{1})$ and $\Theta\cap\operatorname{\textit{pred}}(\mathcal{I}_{1})\subseteq\operatorname{\textit{pred}}(\mathcal{I}_{1})$ we may hence apply (FLT) to remove all predicates in $\Theta\cap\operatorname{\textit{pred}}(\mathcal{I}_{1})$ from $\mathcal{I}_{1}$ . Concretely, let $\theta\in\Theta\cap\operatorname{\textit{pred}}(\mathcal{I}_{1})$ . Because, in order to obtain $\mathcal{I}_{1}$ , (CSE) removes all predicates from $\mathcal{H}$ that share a variable with $e_{1}\setminus e_{2}$ , we have $\operatorname{\textit{var}}(\theta)\cap(e_{1}\setminus e_{2})=\emptyset$ . Moreover, because $\theta\in\Theta$ , $\operatorname{\textit{var}}(\theta)\subseteq e_{1}$ . Hence $\operatorname{\textit{var}}(\theta)\subseteq e_{2}$ , as desired.

The remaining cases, (CSE, ISO), (FLT, ISO), and (FLT, CSE), are symmetric to case (3), (4), and (5), respectively. ∎

C.2 Proof of Proposition 9

See 9

Proof.

Let $T$ be a GJT. The proof proceeds in three steps. Step 1. Let $T_{1}$ be the GJT obtained from $T$ by (i) removing all predicates from $T$ , and (ii) creating a new root node $r$ that is labeled by $\emptyset$ and attaching the root of $T$ to it, labeled by the empty set of predicates. $T_{1}$ satisfies the first canonicality condition, but is not equivalent to $T$ because it has none of $T$ ’s predicates. Now re-add the predicates in $T$ to $T_{1}$ as follows. For each edge $m\to n$ in $T$ and each predicate $\theta\in\operatorname{\textit{pred}}_{T}(m\to n)$ , if $\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))\not=\emptyset$ then add $\theta$ to $\operatorname{\textit{pred}}_{T_{1}}(m\to n)$ . Otherwise, if $\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))=\emptyset$ , do the following. First observe that, by definition of GJTs, $\operatorname{\textit{var}}(\theta)\subseteq\operatorname{\textit{var}}(n)\cup\operatorname{\textit{var}}(m)$ . Because $\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))=\emptyset$ this implies $\operatorname{\textit{var}}(\theta)\subseteq\operatorname{\textit{var}}(m)$ . Because we disallow nullary predicates, $\operatorname{\textit{var}}(m)\not=\emptyset$ . Let $a$ be the first ancestor of $m$ in $T_{1}$ such that $\operatorname{\textit{var}}(\theta)\not\subseteq\operatorname{\textit{var}}(a)$ . Such an ancestor exists because the root of $T_{1}$ is labeled $\emptyset$ . Let $b$ be the child of $a$ in $T_{1}$ . Since $a$ is the first ancestor of $m$ with $\operatorname{\textit{var}}(\theta)\not\subseteq\operatorname{\textit{var}}(a)$ , $\operatorname{\textit{var}}(\theta)\subseteq\operatorname{\textit{var}}(b)$ . Therefore, $\operatorname{\textit{var}}(\theta)\subseteq\operatorname{\textit{var}}(b)\cup\operatorname{\textit{var}}(a)$ and $\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(b)\setminus\operatorname{\textit{var}}(a))\not=\emptyset$ . As such, add $\theta$ to $\operatorname{\textit{pred}}_{T_{1}}(a\to b)$ . After having done this for all predicates in $T$ , $T_{1}$ becomes equivalent to $T$ , and satisfies canonicality conditions (1) and (3). Then take take $N_{1}=N\cup\{r\}$ . Clearly, $N_{1}$ is a connex subset of $T_{1}$ and $\operatorname{\textit{var}}(N)=\operatorname{\textit{var}}(N^{\prime})$ . Therefore, $(T_{1},N_{1})$ is equivalent to $(T,N)$ .

Step 2. Let $T_{2}$ be obtained from $T_{1}$ by adding, for each leaf node $l$ in $T_{1}$ a new interior node $n_{l}$ labeled by $\operatorname{\textit{var}}(l)$ and inserting it in-between $l$ and its parent in $T_{1}$ . I.e., if $l$ has parent $p$ in $T_{1}$ then we have $p\to n_{l}\to l$ in $T_{2}$ with $\operatorname{\textit{pred}}_{T_{2}}(p\to n_{l})=\operatorname{\textit{pred}}_{T_{1}}(p\to n)$ and $\operatorname{\textit{pred}}_{T_{2}}(n_{l}\to l)=\emptyset$ .111111Note that all leafs have a parent since the root of $T_{1}$ is an interior node labeled by $\emptyset$ . Furthermore, let $N_{2}$ be the connex subset of $T_{2}$ obtained by replacing every leaf node $l$ in $N_{1}$ by its newly inserted node $n_{l}$ . Clearly, $\operatorname{\textit{var}}(N_{2})=\operatorname{\textit{var}}(N_{1})=\operatorname{\textit{var}}(N)$ because $var(l)=\operatorname{\textit{var}}(n_{l})$ for every leaf $l$ of $T_{1}$ . By our construction, $(T_{2},N_{2})$ is equivalent to $(T,N)$ ; $T_{2}$ satisfies canonicality conditions (1), (2), and (4); and $N_{2}$ is canonical.

Step 3. It remains to enforce condition (3). To this end, observe that, by the connectedness condition of GJTs, $T_{2}$ violates canonicality condition (3) if and only if there exist internal nodes $m$ and $n$ where $m$ is the parent of $n$ such that $\operatorname{\textit{var}}(m)=\operatorname{\textit{var}}(n)$ . In this case, we call $n$ a culprit node. We will now show how to obtain an equivalent pair $(U,M)$ that removes a single culprit node; the final result is then obtained by iterating this reasoning until all culprit nodes have been removed.

The culprit removal procedure is essentially the reverse of the binarization procedure of Fig. 5. Concretely, let $n$ be a culprit node with parent $m$ and let $n_{1},\dots,n_{k}$ be the children of $n$ in $T_{2}$ . Let $U$ be the GJT obtained from $T_{2}$ by removing $n$ and attaching all children $n_{i}$ of $n$ as children to $m$ with edge label $\operatorname{\textit{pred}}_{U}(m\to n_{i})=\operatorname{\textit{pred}}_{T_{2}}(n\to n_{i})$ , for $1\leq i\leq k$ . Because $\operatorname{\textit{var}}(n)=\operatorname{\textit{var}}(m)$ , the result is still a valid GJT. Moreover, because $\operatorname{\textit{var}}(n)=\operatorname{\textit{var}}(m)$ and $T_{2}$ satisfied condition (4), we had $\operatorname{\textit{pred}}_{T_{2}}(m\to n)=\emptyset$ , so no predicate was lost by the removal of $n$ . Finally, define $M$ as follows. If $n\in N_{2}$ , then set $M=N_{2}\setminus\{n\}$ , otherwise set $M=N_{2}$ . In the former case, since $N_{2}$ is connex and $n\in N_{2}$ , $m$ must also be in $N_{2}$ . It is hence in $M$ . Therefore, in both cases, $\operatorname{\textit{var}}(N)=\operatorname{\textit{var}}(N_{2})=\operatorname{\textit{var}}(M)$ . Furthermore, it is straightforward to check that $M$ is a connex subset of $U$ . Finally, since $N_{2}$ consisted only of interior nodes of $T_{2}$ , $M$ consists only of interior nodes of $U$ and hence remains canonical. ∎

C.3 Proof of Lemma 8

We first require a number of auxiliary results.

We first make the following observations regarding canonical GJT pairs.

Lemma 11.

Let $(T,N)$ be a canonical GJT pair, let $n$ be a frontier node of $N$ and let $m$ be the parent of $n$ in $T$ .

$x\not\in\operatorname{\textit{var}}(N\setminus\{n\})$ , for every $x\in\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m)$ . 2. 2.

$\operatorname{\textit{hyp}}(T,N\setminus\{n\})=\operatorname{\textit{hyp}}(T,N)\setminus\{\operatorname{\textit{var}}(n)\})$ . 3. 3.

$\theta\not\in\operatorname{\textit{pred}}(m\to n)$ , for every $\theta\in\operatorname{\textit{pred}}(T,N\setminus\{n\})$ 4. 4.

$\operatorname{\textit{pred}}(T,N\setminus\{n\})=\operatorname{\textit{pred}}(T,N)\setminus\operatorname{\textit{pred}}(m\to n)$ . 5. 5.

$\operatorname{\textit{pred}}(m\to n)=\{\theta\in\operatorname{\textit{pred}}(T,N)\mid\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))\not=\emptyset\}$ . 6. 6.

$\operatorname{\textit{pred}}(T,N\setminus\{n\})=\{\theta\in\operatorname{\textit{pred}}(T,N)\mid\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))=\emptyset\}$ .

Proof.

(1) Let $x\in\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m)$ and let $c$ be a node in $N\setminus\{n\}$ . Clearly the unique undirected path between $c$ and $n$ in $T$ must pass through $m$ . Because $x\not\in\operatorname{\textit{var}}(m)$ it follows from the connectedness condition of GJTs that also $x\not\in\operatorname{\textit{var}}(c)$ . As such, $x\not\in\operatorname{\textit{var}}(N\setminus\{n\})$ .

(2) The $\supseteq$ direction is trivial. For the $\subseteq$ direction, assume that $m\in N\setminus\{n\}$ with $\operatorname{\textit{var}}(m)\not=\emptyset$ . Then clearly $m\in N$ and hence $\operatorname{\textit{var}}(m)\in\operatorname{\textit{hyp}}(T,N)$ . Furthermore, because $N$ is canonical, both $m$ and $n$ are interior nodes in $T$ . Then, because $T$ is canonical and $m\not=n$ we have $\operatorname{\textit{var}}(m)\not=\operatorname{\textit{var}}(n)$ . Therefore, $\operatorname{\textit{var}}(m)\in\operatorname{\textit{hyp}}(T,N)\setminus\{\operatorname{\textit{var}}(n)\}$ .

(3) Let $\theta\in\operatorname{\textit{pred}}(T,N\setminus n)$ . Then $\theta$ occurs on the edge between two nodes in $N\setminus n$ , say $m^{\prime}\to n^{\prime}$ . By definition of GJTs, $\operatorname{\textit{var}}(\theta)\subseteq\operatorname{\textit{var}}(n^{\prime})\cup\operatorname{\textit{var}}(m^{\prime})\subseteq\operatorname{\textit{var}}(N\setminus\{n\})$ . Now suppose for the purpose of contradiction that also $\theta\in\operatorname{\textit{pred}}(m\to n)$ . Because $T$ is nice, there is some $x\in\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))\not=\emptyset$ . Hence, by (1), $x\not\in\operatorname{\textit{var}}(N\setminus\{n\})$ , which contradicts $\operatorname{\textit{var}}(\theta)\subseteq\operatorname{\textit{var}}(N\setminus\{n\})$ .

(4) Clearly, $\operatorname{\textit{pred}}(T,N)\setminus\operatorname{\textit{pred}}(m\to n)\subseteq\operatorname{\textit{pred}}(T,N\setminus\{n\})$ . The converse inclusion follows from (3).

(5) The $\subseteq$ direction follows from the fact that $m$ and $n$ are in $N$ , and $T$ is nice. To also see $\supseteq$ , let $\theta\in\operatorname{\textit{pred}}(T,N)$ with $\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))\not=\emptyset$ . There exists $x\in\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))$ . By (1), $x\not\in\operatorname{\textit{var}}(N\setminus\{n\})$ . Therefore, $\theta$ cannot occur between edges in $N\setminus\{n\}$ in $T$ . Since it nevertheless occurs in $\operatorname{\textit{pred}}(T,N)$ , it must hence occur in $\operatorname{\textit{pred}}(m\to n)$ .

(6) Follows directly from (4) and (5). ∎

Lemma 12.

Let $(T,N)$ be a canonical GJT pair, let $n$ be a frontier node of $N$ and let $m$ be the parent of $n$ in $T$ . Let $\overline{z}\subseteq\operatorname{\textit{var}}(N\setminus\{n\})$ .

$\operatorname{\textit{var}}(n)\operatorname*{\sqsubseteq}_{\operatorname{\mathcal{H}}(T,N,\overline{z})}\operatorname{\textit{var}}(m)$ . 2. 2.

$x\not\in\operatorname{\textit{jv}}(\operatorname{\mathcal{H}}(T,N,\overline{z}))$ , for every $x\in(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))$ .

Proof.

For reasons of parsimony, let $\mathcal{H}=\operatorname{\mathcal{H}}(T,N,\overline{z})$ . We first prove (2) and then (1).

(2) Let $x\in\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m)$ . By Lemma 11(1), $x\not\in\operatorname{\textit{var}}(N\setminus\{n\})$ . Therefore, $x$ occurs in $\operatorname{\textit{var}}(n)$ in $\mathcal{H}$ and in no other hyperedge. Furthermore, because $\overline{z}\subseteq\operatorname{\textit{var}}(N\setminus\{n\})$ , also $x\not\in\overline{z}$ . Hence $x\not\in\operatorname{\textit{jv}}_{\mathcal{H}}(\operatorname{\textit{var}}(n))$ .

(1) We need to show that $\operatorname{\textit{jv}}_{\mathcal{H}}(\operatorname{\textit{var}}(n))\subseteq\operatorname{\textit{var}}(m)$ and $\operatorname{\textit{ext}}_{\mathcal{H}}(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))\subseteq\operatorname{\textit{var}}(m)$ . Let $x\in\operatorname{\textit{jv}}_{\mathcal{H}}(\operatorname{\textit{var}}(n))$ . By contraposition of (2), we know that $x\not\in(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))$ . Therefore, $x\in\operatorname{\textit{var}}(m)$ and thus $\operatorname{\textit{jv}}_{\mathcal{H}}(\operatorname{\textit{var}}(n))\subseteq\operatorname{\textit{var}}(m)$ . To show $\operatorname{\textit{ext}}_{\mathcal{H}}(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))\subseteq\operatorname{\textit{var}}(m)$ , let $y\in\operatorname{\textit{ext}}_{\mathcal{H}}(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))$ . Then $y\not\in\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m)$ and there exists $\theta\in\operatorname{\textit{pred}}(T,N)$ with $\operatorname{\textit{var}}(\theta)\cap(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))\not=\emptyset$ and $y\in\operatorname{\textit{var}}(\theta)$ . By Lemma 11(5), $\theta\in\operatorname{\textit{pred}}_{T}(m\to n)$ . Thus, $y\in\operatorname{\textit{var}}(m)\cup\operatorname{\textit{var}}(n)$ . Since also $y\not\in\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m)$ , it follows that $y\in\operatorname{\textit{var}}(m)$ . Therefore, $\operatorname{\textit{ext}}_{\mathcal{H}}(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))\subseteq\operatorname{\textit{var}}(m)$ . ∎

Lemma 13.

Let $(T,N)$ be a canonical GJT pair and let $n$ be a frontier node of $N$ . Then $\operatorname{\mathcal{H}}(T,N,\overline{z})\operatorname*{\rightsquigarrow}^{*}\operatorname{\mathcal{H}}(T,N\setminus\{n\},\overline{z})$ for every $\overline{z}\subseteq\operatorname{\textit{var}}(N\setminus\{n\})$ .

Proof.

For reasons of parsimony, let us abbreviate $\mathcal{H}_{1}=\operatorname{\mathcal{H}}(T,N,\overline{z})$ and $\mathcal{H}_{2}=\operatorname{\mathcal{H}}(T,N\setminus\{n\},\overline{z})$ . We make the following case analysis.

Case (1): Node $n$ is the root in $N$ . Because the root of a canonical tree is labeled by $\emptyset$ we have $\operatorname{\textit{var}}(n)=\emptyset$ . Since $n$ is a frontier node of $N$ , $N=\{n\}$ . Thus, $\operatorname{\textit{hyp}}(T,N)=\emptyset$ and $\operatorname{\textit{hyp}}(T,N\setminus\{n\})=\emptyset$ . Furthermore, $\operatorname{\textit{pred}}(T,N)=\operatorname{\textit{pred}}(T,N\setminus\{n\})=\emptyset$ and $\overline{z}\subseteq\operatorname{\textit{var}}(N\setminus\{n\})=\operatorname{\textit{var}}(\emptyset)=\emptyset$ . As such, both $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ are the empty triplet $(\emptyset,\emptyset,\emptyset)$ . Therefore $\mathcal{H}_{1}\operatorname*{\rightsquigarrow}^{*}H_{2}$ .

Case (2): $n$ has parent $m$ in $N$ and $\operatorname{\textit{var}}(m)\not=\emptyset$ . Then $\operatorname{\textit{var}}(n)\not=\emptyset$ since in a canonical tree the root node is the only interior node that is labeled by the empty hyperedge. Therefore, $\operatorname{\textit{var}}(n)\in\operatorname{\textit{hyp}}(T,N)$ , $\operatorname{\textit{var}}(m)\in\operatorname{\textit{hyp}}(T,N)$ , and $\operatorname{\textit{var}}(n)\operatorname*{\sqsubseteq}_{\mathcal{H}_{1}}\operatorname{\textit{var}}(m)$ by Lemma 12(1). We can hence apply reduction (CSE) to remove $\operatorname{\textit{var}}(n)$ from $\operatorname{\textit{hyp}}(\mathcal{H}_{1})$ and all predicates that intersect with $\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m)$ from $\operatorname{\textit{pred}}(\mathcal{H}_{1})$ . By Lemma 11(2) and 11(6) the result is exactly $\mathcal{H}_{2}$ :

[TABLE]

Case (3): $n$ has parent $m$ in $N$ and $\operatorname{\textit{var}}(m)=\emptyset$ . Then $\operatorname{\textit{var}}(n)\not=\emptyset$ since since in a canonical tree the root node is the only interior node that is labeled by the empty hyperedge. By definition of GJTs, it follows that for every $\theta\in\operatorname{\textit{pred}}(m\to n)$ we have $\operatorname{\textit{var}}(\theta)\subseteq\operatorname{\textit{var}}(n)\cup\operatorname{\textit{var}}(m)=\operatorname{\textit{var}}(n)$ . In other words: all $\theta\in\operatorname{\textit{pred}}(m\to n)$ are filters. As such, we can use reduction (FLT) to remove all predicates in $\operatorname{\textit{pred}}(m\to n)$ from $\mathcal{H}_{1}$ . This yields a triplet $\mathcal{I}$ with the same hypergraph as $\mathcal{H}_{1}$ , same set of output variables as $\mathcal{H}_{1}$ , and

[TABLE]

where the third equality is due to Lemma 11(4). We claim that every variable in $e$ is isolated in $\mathcal{I}$ . From this the result follows, because then we can apply (ISO) to remove the entire hyperedge $\operatorname{\textit{var}}(e)$ from $\operatorname{\textit{hyp}}(\mathcal{I})=\operatorname{\textit{hyp}}(\mathcal{H}_{1})$ while preserving $\operatorname{\textit{out}}(\mathcal{I})$ and $\operatorname{\textit{pred}}(\mathcal{I})$ . The resulting triplet hence equals $\mathcal{H}_{2}$ . To see that $e\subseteq\operatorname{\textit{isol}}(\mathcal{I})$ , observe that no predicate in $\operatorname{\textit{pred}}(\mathcal{I})=\operatorname{\textit{pred}}(T,N\setminus\{n\})$ shares a variable with $\operatorname{\textit{var}}(n)=(\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m))$ by Lemma 11(6). Therefore $\operatorname{\textit{var}}(n)\cap\operatorname{\textit{var}}(\operatorname{\textit{pred}}(\mathcal{I}))=\emptyset$ . Furthermore, $\operatorname{\textit{var}}(n)\cap\operatorname{\textit{jv}}(\mathcal{I})=\emptyset$ because $\operatorname{\textit{jv}}(\mathcal{I})=\operatorname{\textit{jv}}(\mathcal{H}_{1})$ and no $x\in\operatorname{\textit{var}}(n)=\operatorname{\textit{var}}(n)\setminus\operatorname{\textit{var}}(m)$ is in $\operatorname{\textit{jv}}(\mathcal{H}_{1})$ by Lemma 12(2). ∎

See 8

Proof.

By induction on $k$ , the number of nodes in $N_{1}\setminus N_{2}$ . In the base case where $k=0$ , the result trivially holds since then $N_{1}=N_{2}$ and the two triplets are identical. For the induction step, assume that $k>0$ and the result holds for $k-1$ . Because both $N_{1}$ and $N_{2}$ are connex subsets of the same tree $T$ , there exists a node $n\in N_{1}$ that is a frontier node in $N_{1}$ , and which is not in $N_{2}$ . Then define $N^{\prime}_{1}=N_{1}\setminus\{n\}$ . Clearly $(T,N^{\prime}_{1})$ is again canonical, and $|N^{\prime}_{1}\setminus N_{2}|=k-1$ . Therefore, $\operatorname{\mathcal{H}}(T,N^{\prime}_{1},\overline{z})\operatorname*{\rightsquigarrow}^{*}\operatorname{\mathcal{H}}(T,N_{2},\overline{z})$ by induction hypothesis. Furthermore, by $\operatorname{\mathcal{H}}(T,N_{1},\overline{z})\operatorname*{\rightsquigarrow}^{*}\operatorname{\mathcal{H}}(T,N^{\prime}_{1},\overline{z})$ by Lemma 13, from which the result follows. ∎

C.4 Proof of Lemma 9

See 9

Proof.

The proof is by induction on $k$ , the number of hyperedges in $H_{2}\setminus H_{1}$ . In the base case where $k=0$ , the result trivially holds since $H_{1}\cup H_{2}=H_{1}$ and the two triplets are hence identical. For the induction step, assume that $k>0$ and the result holds for $k-1$ . Fix some $e\in H_{2}\setminus H_{1}$ and define $H^{\prime}_{2}=H_{2}\setminus\{e\}$ . Then $|H^{\prime}_{2}\setminus H_{1}|=k-1$ . We show that $(H_{1}\cup H_{2},\overline{z},\Theta)\operatorname*{\rightsquigarrow}^{*}(H_{1}\cup H^{\prime}_{2},\overline{z},\Theta)$ , from which the result follows since $(H_{1}\cup H^{\prime}_{2},\overline{z},\Theta)\operatorname*{\rightsquigarrow}^{*}(H_{1},\overline{z},\Theta)$ by induction hypothesis. To this end, we observe that there exists $\ell\in H_{1}\setminus\{e\}$ with $e\subseteq\ell$ . Therefore, $\operatorname{\textit{jv}}_{(H_{1}\cup H_{2},\overline{z},\Theta)}(e)\subseteq e\subseteq\ell$ . Moreover, $e\setminus\ell=\emptyset$ . Therefore, $\operatorname{\textit{ext}}_{(H_{1}\cup H_{2},\overline{z},\Theta)}(e\setminus\ell)=\emptyset\subseteq\ell$ . Thus $e\operatorname*{\sqsubseteq}_{(H_{1}\cup H_{2},\overline{z},\Theta)}\ell$ . We may therefore apply (CSE) to remove $e$ from $H_{1}\cup H_{2}$ , yielding $H_{1}\cup H^{\prime}_{2}$ . Since no predicate shares variables with $e\setminus\ell=\emptyset$ this does not modify $\Theta$ . Therefore, $(H_{1}\cup H_{2},\overline{z},\Theta)\operatorname*{\rightsquigarrow}^{*}(H_{1}\cup H^{\prime}_{2},\overline{z},\Theta)$ . ∎

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases . Addison-Wesley Longman Publishing Co., Inc., 1995.
2[2] J. Agrawal, Y. Diao, D. Gyllstrom, and N. Immerman. Efficient pattern matching over event streams. In Proc. of SIGMOD 2008 , pages 147–160, 2008.
3[3] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. STREAM: the stanford data stream management system. In Data Stream Management - Processing High-Speed Data Streams , pages 317–336. 2016.
4[4] F. Baader and T. Nipkow. Term rewriting and all that . Cambridge University Press, 1998.
5[5] G. Bagan, A. Durand, and E. Grandjean. On acyclic conjunctive queries and constant delay enumeration. In Proc. of CSL , pages 208–222, 2007.
6[6] N. Bakibayev, T. Kočiský, D. Olteanu, and J. Závodný. Aggregation and ordering in factorised databases. Proc. of VLDB , 6(14):1990–2001, 2013.
7[7] C. Berkholz, J. Keppeler, and N. Schweikardt. Answering conjunctive queries under updates. In Proc. of PODS , 2017. To appear.
8[8] P. A. Bernstein and N. Goodman. The power of inequality semijoins. Inf. Syst. , 6(4):255–265, 1981.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Conjunctive Queries with Theta Joins Under Updates

Abstract

1 Introduction

Relational

Automata

Contributions

Additional material

Additional related work

2 Preliminaries

Query Language

Example 1.

Semantics

Updates and deltas

Enumeration with bounded delay

Computational Model

3 Generalized Acyclicity

Generalized Join Trees

Definition 1 (GJT).

Definition 2.

Example 2.

Discussion

Free-connex acyclicity

Definition 3 (Connex, Frontier).

Definition 4 (Compatible, Free-Connex Acyclic).

Example 3.

Binary GJTs and sibling-closed connex sets

Definition 5 (Binary, Sibling-closed).

Lemma 1.

Proof.

Proposition 1.

Sibling-closed transformation

Example 4.

Lemma 2 ().

Example 5.

Lemma 3 ().

Proposition 2.

Proof.

Binary transformation

Lemma 4.

Proposition 3.

4 Dynamic joins with equalities and inequalities: an example

Definition 6 (TTT-reduct).

Enumeration

Updates

5 Dynamic Yannakakis Over GCQs

Definition 7 (Index).

Enumeration

Lemma 5 ().

Lemma 6 ().

Proposition 4.

Proof.

Proposition 5.

Proof.

Update processing

Definition 8 ((T,N)(T,N)(T,N)-representation).

IEDyn

Theorem 5.1.

Proof.

Theorem 5.2.

Proof.

6 Computing GJTs

6.1 Classical GYO

Definition 9.

Proposition 6.

6.2 GYO-reduction for GCQs

Definition 10.

Definition 11.

Example 6.

Definition 12.

Example 7.

Proposition 7 (Confluence).

Theorem 6.1.

Example 8.

Definition 6 ( $T$ -reduct).

Definition 8 ( $(T,N)$ -representation).