Containment for Rule-Based Ontology-Mediated Queries

Pablo Barcelo; Gerald Berger; Andreas Pieris

arXiv:1703.07994·cs.DB·April 20, 2017

Containment for Rule-Based Ontology-Mediated Queries

Pablo Barcelo, Gerald Berger, Andreas Pieris

PDF

Open Access

TL;DR

This paper investigates the containment problem for ontology-mediated queries expressed with guarded, non-recursive, and sticky tgds, providing complexity bounds and analyzing applications like component distribution and UCQ rewritability.

Contribution

It introduces tailored techniques for OMQ containment under specific tgd classes, establishing sharp complexity bounds and exploring practical applications.

Findings

01

Sharp complexity bounds for OMQ containment

02

Techniques tailored to guarded, non-recursive, and sticky tgds

03

Insights into distribution over components and UCQ rewritability

Abstract

Many efforts have been dedicated to identifying restrictions on ontologies expressed as tuple-generating dependencies (tgds), a.k.a. existential rules, that lead to the decidability for the problem of answering ontology-mediated queries (OMQs). This has given rise to three families of formalisms: guarded, non-recursive, and sticky sets of tgds. In this work, we study the containment problem for OMQs expressed in such formalisms, which is a key ingredient for solving static analysis tasks associated with them. Our main contribution is the development of specially tailored techniques for OMQ containment under the classes of tgds stated above. This enables us to obtain sharp complexity bounds for the problems at hand, which in turn allow us to delimitate its practical applicability. We also apply our techniques to pinpoint the complexity of problems associated with two emerging…

Tables1

Table 1. Table 1: Complexity of OMQ containment – in small fonts, we recall the complexity of OMQ evaluation.

Arbitrary Arity

Bounded Arity

Linear

PSpace-c

Π_{2}^{P}

-c

NP-c

Sticky

coNExpTime-c

ExpTime-c

Π_{2}^{P}

-c

NP-c

Non-recursive

in ExpSpace and P

^{NEXP}

-hard

NExpTime-c

in ExpSpace and P

^{NEXP}

-hard

NExpTime-c

Guarded

2ExpTime-c

ExpTime-c

Equations178

q(\bar{x})\ :=\ \exists\bar{y}\big{(}R_{1}(\bar{v}_{1})\wedge\dots\wedge R_{m}(\bar{v}_{m})\big{)},

q(\bar{x})\ :=\ \exists\bar{y}\big{(}R_{1}(\bar{v}_{1})\wedge\dots\wedge R_{m}(\bar{v}_{m})\big{)},

\forall\bar{x}\forall\bar{y}\big{(}\phi(\bar{x},\bar{y})\rightarrow\exists\bar{z}\,\psi(\bar{x},\bar{z})\big{)},

\forall\bar{x}\forall\bar{y}\big{(}\phi(\bar{x},\bar{y})\rightarrow\exists\bar{z}\,\psi(\bar{x},\bar{z})\big{)},

I_{0} τ_{0}, \overset{c}{ˉ}_{0} I_{1} τ_{1}, \overset{c}{ˉ}_{1} I_{2} \dots

I_{0} τ_{0}, \overset{c}{ˉ}_{0} I_{1} τ_{1}, \overset{c}{ˉ}_{1} I_{2} \dots

cert (q, D, Σ) = I \supseteq D, I ⊨ Σ ⋂ {\overset{c}{ˉ} \in dom (I)^{∣ \overset{x}{ˉ} ∣} ∣ \overset{c}{ˉ} \in q (I)} .

cert (q, D, Σ) = I \supseteq D, I ⊨ Σ ⋂ {\overset{c}{ˉ} \in dom (I)^{∣ \overset{x}{ˉ} ∣} ∣ \overset{c}{ˉ} \in q (I)} .

\overset{c}{ˉ} \in Q (D) ⟺ Q_{1} (sch (Σ), \emptyset, q_{D, \overset{c}{ˉ}}) \subseteq Q_{2} (sch (Σ), Σ, q) .

\overset{c}{ˉ} \in Q (D) ⟺ Q_{1} (sch (Σ), \emptyset, q_{D, \overset{c}{ˉ}}) \subseteq Q_{2} (sch (Σ), Σ, q) .

\overset{c}{ˉ} \in Q (D) ⟺ Q_{1} (S, Σ_{D}^{⋆}, q_{\overset{c}{ˉ}}^{⋆}) \neq \subseteq Q_{2} (S, \emptyset, \exists x P (x)),

\overset{c}{ˉ} \in Q (D) ⟺ Q_{1} (S, Σ_{D}^{⋆}, q_{\overset{c}{ˉ}}^{⋆}) \neq \subseteq Q_{2} (S, \emptyset, \exists x P (x)),

{⊤ \to R^{⋆} (c_{1}, \dots, c_{k}) ∣ R (c_{1}, \dots, c_{k}) \in D},

{⊤ \to R^{⋆} (c_{1}, \dots, c_{k}) ∣ R (c_{1}, \dots, c_{k}) \in D},

P (x) \to \exists y R (x, y), R (x, y) \to P (y), T (x) \to P (x),

P (x) \to \exists y R (x, y), R (x, y) \to P (y), T (x) \to P (x),

f_{(\mathbb{NR},\mathbb{CQ})}\big{(}(\mathbf{S},\Sigma,q)\big{)}\ \leq\ |q|\cdot\left(\max_{\tau\in\Sigma}\{|\mathit{body}(\tau)|\}\right)^{|\mathit{sch}(\Sigma)|}.

f_{(\mathbb{NR},\mathbb{CQ})}\big{(}(\mathbf{S},\Sigma,q)\big{)}\ \leq\ |q|\cdot\left(\max_{\tau\in\Sigma}\{|\mathit{body}(\tau)|\}\right)^{|\mathit{sch}(\Sigma)|}.

{Q_{1}^{n} = (S, Σ_{1}^{n}, q_{1})}_{n > 0} and {Q_{2}^{n} = (S, Σ_{2}^{n}, q_{2})}_{n > 0},

{Q_{1}^{n} = (S, Σ_{1}^{n}, q_{1})}_{n > 0} and {Q_{2}^{n} = (S, Σ_{2}^{n}, q_{2})}_{n > 0},

f_{(S, CQ)} ((S, Σ, q)) \leq ∣ S ∣ \cdot (∣ T (q) ∣ + ∣ C (Σ) ∣ + 1)^{∣ ar (S) ∣} .

f_{(S, CQ)} ((S, Σ, q)) \leq ∣ S ∣ \cdot (∣ T (q) ∣ + ∣ C (Σ) ∣ + 1)^{∣ ar (S) ∣} .

{Q^{n} = ({S / n}, Σ^{n}, q (\overset{x}{ˉ}))}_{n > 0}, where ∣∣ Σ^{n} ∣∣ \in O (n^{2}),

{Q^{n} = ({S / n}, Σ^{n}, q (\overset{x}{ˉ}))}_{n > 0}, where ∣∣ Σ^{n} ∣∣ \in O (n^{2}),

Q_{1} \subseteq Q_{2} ⟺ L (A) = \emptyset .

Q_{1} \subseteq Q_{2} ⟺ L (A) = \emptyset .

C

C

Q(D)\neq\varnothing\ \Longrightarrow\ \big{(}Q(D_{\leq k})\neq\varnothing\text{~{}~{}or~{}~{}}Q(D_{>0})\neq\varnothing\big{)},

Q(D)\neq\varnothing\ \Longrightarrow\ \big{(}Q(D_{\leq k})\neq\varnothing\text{~{}~{}or~{}~{}}Q(D_{>0})\neq\varnothing\big{)},

Q(\llbracket L\rrbracket)\neq\varnothing\ \Longrightarrow\ \big{(}Q(\llbracket L\rrbracket_{\leq k})\neq\varnothing\text{ or }Q(\llbracket L\rrbracket_{>0})\neq\varnothing\big{)},

Q(\llbracket L\rrbracket)\neq\varnothing\ \Longrightarrow\ \big{(}Q(\llbracket L\rrbracket_{\leq k})\neq\varnothing\text{ or }Q(\llbracket L\rrbracket_{>0})\neq\varnothing\big{)},

Q is UCQ rewritable ⟺ L (A) is finite .

Q is UCQ rewritable ⟺ L (A) is finite .

\overset{c}{ˉ} \in Q (D) ⟺ Q_{1} (sch (Σ), \emptyset, q_{D, \overset{c}{ˉ}}) \subseteq Q_{2} (sch (Σ), Σ, q) .

\overset{c}{ˉ} \in Q (D) ⟺ Q_{1} (sch (Σ), \emptyset, q_{D, \overset{c}{ˉ}}) \subseteq Q_{2} (sch (Σ), Σ, q) .

R (x_{1}, \dots, x_{k}) \to R^{'} (x_{1}, \dots, x_{k}, 1), True (1) .

R (x_{1}, \dots, x_{k}) \to R^{'} (x_{1}, \dots, x_{k}, 1), True (1) .

True (t) \to \exists \overset{x}{ˉ} \exists \overset{y}{ˉ} \exists f ϕ_{\land}^{'} (\overset{x}{ˉ}, \overset{y}{ˉ}, f), ψ (t, f),

True (t) \to \exists \overset{x}{ˉ} \exists \overset{y}{ˉ} \exists f ϕ_{\land}^{'} (\overset{x}{ˉ}, \overset{y}{ˉ}, f), ψ (t, f),

Or (t, t, t), Or (t, f, t), Or (f, t, t), Or (f, f, f) .

Or (t, t, t), Or (t, f, t), Or (f, t, t), Or (f, f, f) .

ϕ^{'} (\overset{x}{ˉ}, \overset{y}{ˉ}, w) \to \exists \overset{z}{ˉ} ψ^{'} (\overset{x}{ˉ}, \overset{z}{ˉ}, w),

ϕ^{'} (\overset{x}{ˉ}, \overset{y}{ˉ}, w) \to \exists \overset{z}{ˉ} ψ^{'} (\overset{x}{ˉ}, \overset{z}{ˉ}, w),

\exists \overset{x}{ˉ} \exists \overset{y}{ˉ} (False (y_{1}) \land 1 \leq i \leq n ⋀ (q_{i}^{'} [x_{i}] \land Or (y_{i}, x_{i}, y_{i + 1})) \land True (y_{n + 1})),

\exists \overset{x}{ˉ} \exists \overset{y}{ˉ} (False (y_{1}) \land 1 \leq i \leq n ⋀ (q_{i}^{'} [x_{i}] \land Or (y_{i}, x_{i}, y_{i + 1})) \land True (y_{n + 1})),

q = \exists x \exists y \exists z (R (x, y) \land R (x, z))

q = \exists x \exists y \exists z (R (x, y) \land R (x, z))

σ = P (u, v) \to \exists w R (w, u)

σ = P (u, v) \to \exists w R (w, u)

C_{i}^{j}

C_{i}^{j}

C_{0}, \dots, C_{k - 1}

C_{0}, \dots, C_{k - 1}

\to

\to

Tile_{i} (x), Tile_{j} (y)

Tile_{i} (x), Tile_{j} (y)

Tile_{i} (x), Tile_{j} (y)

Tile_{i} (x), Tile_{j} (y)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Advanced Database Systems and Queries · Service-Oriented Architecture and Web Services

Full text

Containment for Rule-Based Ontology-Mediated Queries

Pablo Barceló

Gerald Berger

Andreas Pieris

Center for Semantic Web Research &

DCC, University of Chile

[email protected]

Institute of Information Systems

TU Wien

[email protected]

School of Informatics

University of Edinburgh

[email protected]

Abstract

Many efforts have been dedicated to identifying restrictions on ontologies expressed as tuple-generating dependencies (tgds), a.k.a. existential rules, that lead to the decidability for the problem of answering ontology-mediated queries (OMQs). This has given rise to three families of formalisms: guarded, non-recursive, and sticky sets of tgds. In this work, we study the containment problem for OMQs expressed in such formalisms, which is a key ingredient for solving static analysis tasks associated with them. Our main contribution is the development of specially tailored techniques for OMQ containment under the classes of tgds stated above. This enables us to obtain sharp complexity bounds for the problems at hand, which in turn allow us to delimitate its practical applicability. We also apply our techniques to pinpoint the complexity of problems associated with two emerging applications of OMQ containment: distribution over components and UCQ rewritability of OMQs.

1 Introduction

Motivation and goals. The novel application of knowledge representation tools for handling incomplete and heterogeneous data is giving rise to a new field, recently coined as knowledge-enriched data management [6]. A crucial problem in this field is ontology-based data access (OBDA) [51], which refers to the utilization of ontologies (i.e., sets of logical sentences) for providing a unified conceptual view of various data sources. Users can then pose their queries solely in the schema provided by the ontology, abstracting away from the specifics of the individual sources. In OBDA, one interprets the ontology $\Sigma$ and the user query $q$ , which is typically a union of conjunctive queries (UCQ), or, equivalently, the expressions defined by the select-project-join-union operators of relational algebra, as two components of one composite query $Q=(\mathbf{S},\Sigma,q)$ , known as ontology-mediated query (OMQ); $\mathbf{S}$ is called the data schema, indicating that $Q$ will be posed on databases over $\mathbf{S}$ [19]. Therefore, OBDA is often realized as the problem of answering OMQs.

Following recent work [24, 26, 27, 40], we focus on the case where the ontology is defined by a set of tuple-generating dependencies (tgds), a.k.a. existential rules or Datalog± rules. Handling such OMQs implies new challenges for classical database tasks. Interestingly, some of these challenges are by now well-studied; most notably (a) query evaluation [8, 24, 25, 27]: given an OMQ $Q=(\mathbf{S},\Sigma,q)$ , a database $D$ over $\mathbf{S}$ , and a tuple of constants $\bar{c}$ , does $\bar{c}$ belong to the evaluation of $q$ over every extension of $D$ that satisfies $\Sigma$ , or, equivalently, is $\bar{c}$ a certain answer for $Q$ over $D$ ? and (b) relative expressiveness [19, 42, 43]: how does the expressiveness of OMQs compare to the one of other query languages? Surprisingly, despite its prominence, no work to date has carried out an in-depth investigation of containment for OMQs based on tgds and UCQs.

Query containment is a fundamental static analysis task that amounts to check if the evaluation of a query is always contained in the evaluation of another query. Several database tasks crucially depend on the ability to check query containment; these include, e.g., query optimization, view-based query answering, querying incomplete databases, integrity checking, and implication of dependencies: cf. [22, 30, 36, 37, 39, 45]. A particularly important instance of the containment problem is the one defined by the class of CQs. It follows from the seminal work of Chandra and Merlin [29] that CQ containment is polynomially equivalent to CQ evaluation, and thus NP-complete. The NP upper bound is not affected if we consider UCQs [54]. This is seen as a positive result for practical applications that rely on UCQ containment, as the input (the two UCQs) is small. In addition, it shows a stark difference with more expressive relational query languages, e.g., relational algebra (or, equivalently, first-order logic), for which containment is undecidable.

The main goal of this work is to understand up to which extend the good computational properties of UCQ containment discussed above can be leveraged to the containment problem for OMQs based on tgds and UCQs (simply called OMQs from now on). In particular, we want to understand which classes of tgds guarantee the decidability of the problem, and, whenever this is the case, how can we obtain complexity bounds that are reasonable for practical purposes. We also want to understand what is the exact relationship between OMQ containment and evaluation for such classes. Let us stress that, apart from the traditional applications of containment mentioned above, it has been recently shown that OMQ containment has applications on other important static analysis tasks for OMQs, namely, distribution over components [15], and UCQ rewritability [16].

The context. As one might expect, when considered in its full generality, i.e., without any restrictions on the set of tgds, the OMQ containment problem is undecidable. To understand, on the other hand, which restrictions lead to decidability, we recall the two main reasons that render the general containment problem undecidable. These are:

Undecidability of query evaluation: OMQ evaluation is, in general, undecidable [12], and it can be reduced to OMQ containment. More precisely, OMQ containment is undecidable whenever query evaluation for at least one of the involved languages (i.e., the language of the left-hand or the right-hand side query) is undecidable.

Undecidability of containment for Datalog: decidability of query evaluation does not ensure decidability of query containment. A prime example is Datalog, or, equivalently, the OMQ language based on full tgds. Datalog containment is undecidable [55], and thus, OMQ containment is undecidable if the involved languages extend Datalog.

In view of the above observations, we focus on languages that (a) have a decidable query evaluation, and (b) do not extend Datalog. The main classes of tgds, which give rise to OMQ languages with the desirable properties, can be classified into three main families depending on the underlying syntactic restrictions: (i) guarded tgds [24], which contain inclusion dependencies and linear tgds, (ii) non-recursive sets of tgds [35], and (iii) sticky sets of tgds [27].

While the decidability of containment for the above OMQ languages can be established via translations into query languages with a decidable containment problem, such translations do not lead to optimal complexity upper bounds (details are given below). Therefore, the main goal of our paper is to develop specially tailored decision procedures for the containment problem under the OMQ languages in question, and ideally obtain precise complexity bounds. Our second goal is to exploit such techniques in the study of distribution over components and UCQ rewritability of OMQs.

Our contributions. The complexity of OMQ containment for the languages in question is given in Table 1. Using small fonts, we recall the complexity of OMQ evaluation in order to stress that containment is, in general, harder than evaluation. We divide our contributions as follows:

Linear, non-recursive and sticky sets of tgds. The OMQ languages based on linear, non-recursive, and sticky sets of tgds share a useful property: they are UCQ rewritable (implicit in [40]), that is, an OMQ can be rewritten into a UCQ. This property immediately yields decidability for their associated containment problems, since UCQ containment is decidable [54]. However, the obtained complexity bounds are not optimal, since the UCQ rewritings are unavoidably very large [40]. To obtain more precise bounds, we reduce containment to query evaluation, an idea that is often applied in query containment; see, e.g., [29, 31, 54].

Consider a UCQ rewritable OMQ language $\mathbb{O}$ . If $Q_{1}$ and $Q_{2}$ belong to $\mathbb{O}$ , both with data schema $\mathbf{S}$ , then we can establish a small witness property, which states that non-containment of $Q_{1}$ in $Q_{2}$ can be witnessed via a database over $\mathbf{S}$ whose size is bounded by an integer $k\geq 0$ , the maximal size of a disjunct in a UCQ rewriting of $Q_{1}$ . For linear tgds, such an integer $k$ is polynomial, but for non-recursive and sticky sets of tgds it is exponential (implicit in [40]). The above small witness property allows us to devise a simple non-deterministic algorithm, which makes use of query evaluation as a subroutine for checking non-containment of $Q_{1}$ in $Q_{2}$ : guess a database $D$ over $\mathbf{S}$ of size at most $k$ , and then check if there is a certain answer for $Q_{1}$ over $D$ that is not a certain answer for $Q_{2}$ over $D$ . This algorithm allows us to obtain optimal upper bounds for OMQs based on linear and sticky sets of tgds; however, the exact complexity of OMQs based on non-recursive sets of tgds remains open:

•

For OMQs based on linear tgds, the problem is in PSpace, and in $\Pi_{2}^{P}$ if the arity is fixed. The PSpace-hardness is shown by reduction from query evaluation [47], while the $\Pi_{2}^{P}$ -hardness is inherited from [17].

•

For OMQs based on sticky sets of tgds, the problem is in coNExpTime, and in $\Pi_{2}^{P}$ if the arity of the schema is fixed. The coNExpTime-hardness is shown by exploiting the standard tiling problem for the exponential grid, while the $\Pi_{2}^{P}$ -hardness is inherited from [17].

•

Finally, for OMQs based on non-recursive sets of tgds, containment is in ExpSpace and hard for P ${}^{\textsc{NEXP}}$ , even for fixed arity. The lower bound is shown by exploiting a recently introduced tiling problem [34].

We conclude that in all these cases OMQ containment is harder than evaluation, with one exception: the OMQs based on linear tgds over schemas of unbounded arity.

Guarded tgds. The OMQ language based on guarded tgds is not UCQ rewritable, which forces us to develop different tools to study its containment problem. Let us remark that guarded OMQs can be rewritten as guarded Datalog queries (by exploiting the translations devised in [9, 43]), for which containment is decidable in 2ExpTime [20]. But, again, the known rewritings are very large [43], and hence the reduction of containment for guarded OMQs to containment for guarded Datalog does not yield optimal upper bounds.

To obtain optimal bounds for the problem in question, we exploit two-way alternating parity automata on trees (2WAPA) [32]. We first show that if $Q_{1}$ and $Q_{2}$ are guarded OMQs such that $Q_{1}$ is not contained in $Q_{2}$ , then this is witnessed over a class of “tree-like” databases that can be represented as the set of trees accepted by a 2WAPA $\mathfrak{A}$ . We then build a 2WAPA $\mathfrak{B}$ with exponentially many states that recognizes those trees accepted by $\mathfrak{A}$ that represent witnesses to non-containment of $Q_{1}$ in $Q_{2}$ . Hence, $Q_{1}$ is contained in $Q_{2}$ iff $\mathfrak{B}$ accepts no tree. Since the emptiness problem for 2WAPA is feasible in exponential time in the number of states [32], we obtain that containment for guarded OMQs is in 2ExpTime. A matching lower bound, even for fixed arity schemas, follows from [16].

Similar ideas based on 2WAPA have been recently used to show that containment for OMQs based on expressive description logics (DLs) is in 2ExpTime [16]. In the DL context, schemas consist only of unary and binary relations. Our automata construction, however, is different from the one in [16] for two reasons: (a) we need to deal with higher arity relations, and (b) even for unary and binary relations, our OMQ language allows to express properties that are not expressible by the DL-based OMQ languages studied in [16].

Combining languages. The above complexity results refer to the containment problem relative to a certain OMQ language $\mathbb{O}$ , i.e., both queries fall in $\mathbb{O}$ . However, it is natural to consider the version of the problem where the involved OMQs fall in different languages. Unsurprisingly, if the left-hand side query is expressed in a UCQ rewritable OMQ language (based on linear, non-recursive or sticky sets of tgds), we can use the algorithm that relies on the small witness property discussed above, which provides optimal upper bounds for almost all the considered cases (the only exception is the containment of sticky in non-recursive OMQs over schemas of unbounded arity). Things are more interesting if the ontology of the left-hand side query is expressed using guarded tgds, while the ontology of the right-hand side query is not guarded. By exploiting automata techniques, we show that containment of guarded in non-recursive OMQs is in 3ExpTime, while containment of guarded in sticky OMQs is in 2ExpTime. We establish matching lower bounds, even over schemas of fixed arity, by refining techniques from [31].

Applications. Our techniques and results on containment for guarded OMQs can be applied to other important static analysis tasks, in particular, distribution over components and UCQ rewritability.

The notion of distribution over components has been introduced in [3], in the context of declarative networking, and it states that the answer to an OMQ $Q$ can be computed by parallelizing it over the (maximally connected) components of the database. If this is the case, then $Q$ can always be evaluated in a distributed and coordination-free manner. The problem of deciding distribution over components for OMQs has been recently studied in [15]. However, the exact complexity of the problem for guarded OMQs has been left open. By exploiting our results on containment, we can show that it is 2ExpTime-complete.

It is well-known that the OMQ language based on guarded tgds is not UCQ rewritable. In view of this fact, it is important to study when a given guarded OMQ $Q$ can be rewritten as a UCQ. This has been studied for OMQs based on central Horn DLs [16, 18]. Interestingly, our automata-based techniques for guarded OMQ containment can be adapted to decide in 2ExpTime whether an OMQ based on guarded tgds over unary and binary relations is UCQ rewritable; a matching lower bound is inherited from [16]. Our result generalizes the result that deciding UCQ rewritability for OMQs based on $\mathcal{ELHI}$ , one of the most expressive members of the $\mathcal{EL}$ -family of DLs, is 2ExpTime-complete [16].

Discussion on Applicability. As shown in Table 1, the containment problem for OMQs based on linear sets of tgds is PSpace-complete, and thus can be solved in single-exponential time. This is not a big practical drawback since the containment problem corresponds to a static analysis task. In fact, the runtime is single exponential only in the size of the UCQs and the maximum arity of the underlying schema, which are typically very small. For such tasks, a single-exponential time procedure is considered to be acceptable, and it is actually the norm in many cases including database and verification problems; see, e.g., [1, 50, 52].

For OMQs based on sticky, non-recursive and guarded sets of tgds, the containment problem becomes coNExpTime-complete, P ${}^{\textsc{NEXP}}$ -hard and 2ExpTime-complete, respectively. This means that we require double-exponential time to solve the problem, which is practically not acceptable. Nevertheless, for sticky sets of tgds, the runtime is double-exponential only in the maximum arity of the schema, while for guarded sets of tgds is double-exponential only in the size of the UCQs and the maximum arity of the schema. This is good news since, as said above, the size of the UCQs and the arity are typically small, and usually UCQs in OMQs are much smaller than the ontologies.

For non-recursive sets of tgds, on the other hand, the runtime is double-exponential, not only in the maximum arity, but also in the number of predicates occurring in the ontology. It is unrealistic to assume that the number of predicates occurring in real-life ontologies is small. This fact, together with the fact that the precise complexity of OMQ containment for non-recursive sets of tgds is still open, suggests that a more careful complexity analysis is needed. This is left as an interesting open problem for future work.

Organization. Preliminaries are in Section 2. In Section 3 we introduce the OMQ containment problem. Containment for UCQ rewritable OMQs is studied in Section 4, and for guarded OMQs in Section 5. In Section 6 we consider the case where the involved queries fall in different languages. In Section 7 we discuss the applications of our results on guarded OMQ containment and we conclude in Section 8. Proofs and additional details can be found in the appendix.

2 Preliminaries

Databases and conjunctive queries.

Let $\mathbf{C}$ , $\mathbf{N}$ , and $\mathbf{V}$ be disjoint countably infinite sets of constants, (labeled) nulls and (regular) variables (used in queries and dependencies), respectively. A schema $\mathbf{S}$ is a finite set of relation symbols (or predicates) with associated arity. We write $R/n$ to denote that $R$ has arity $n$ . A term is a either a constant, null or variable. An atom over $\mathbf{S}$ is an expression of the form $R(\bar{v})$ , where $R\in\mathbf{S}$ is of arity $n>0$ and $\bar{v}$ is an $n$ -tuple of terms. A fact is an atom whose arguments consist only of constants. An instance over $\mathbf{S}$ is a (possibly infinite) set of atoms over $\mathbf{S}$ that contain constants and nulls, while a database over $\mathbf{S}$ is a finite set of facts over $\mathbf{S}$ . We may call an instance and a database over $\mathbf{S}$ an $\mathbf{S}$ -instance and $\mathbf{S}$ -database, respectively. The active domain of an instance $I$ , denoted $\mathit{dom}(I)$ , is the set of all terms occurring in $I$ .

A conjunctive query (CQ) over $\mathbf{S}$ is a formula of the form:

[TABLE]

where each $R_{i}(\bar{v}_{i})$ ( $1\leq i\leq m$ ) is an atom without nulls over $\mathbf{S}$ , each variable mentioned in the $\bar{v}_{i}$ ’s appears either in $\bar{x}$ or $\bar{y}$ , and $\bar{x}$ are the free variables of $q$ . If $\bar{x}$ is empty, then $q$ is a Boolean CQ. As usual, the evaluation of CQs is defined in terms of homomorphisms. Let $I$ be an instance and $q(\bar{x})$ a CQ of the form (1). A homomorphism from $q$ to $I$ is a mapping $h$ , which is the identity on $\mathbf{C}$ , from the variables that appear in $q$ to the set of constants and nulls $\mathbf{C}\cup\mathbf{N}$ such that $R_{i}(h(\bar{v}_{i}))\in I$ , for each $1\leq i\leq m$ . The evaluation of $q(\bar{x})$ over $I$ , denoted $q(I)$ , is the set of all tuples $h(\bar{x})$ of constants such that $h$ is a homomorphism from $q$ to $I$ . We denote by $\mathbb{CQ}$ the class of conjunctive queries.

A union of conjunctive queries (UCQ) over $\mathbf{S}$ is a formula of the form $q({\bar{x}}):=q_{1}({\bar{x}})\vee\cdots\vee q_{n}({\bar{x}})$ , where each $q_{i}({\bar{x}})$ is a CQ of the form (1). The evaluation of $q(\bar{x})$ over $I$ , denoted $q(I)$ , is the set of tuples $\bigcup_{1\leq i\leq n}q_{i}(I)$ . We denote by $\mathbb{UCQ}$ the class of union of conjunctive queries.

Tgds and the chase procedure.

A tuple-generating dependency (tgd) is a first-order sentence of the form:

[TABLE]

where $\phi$ and $\psi$ are conjunctions of atoms without nulls. For brevity, we write this tgd as $\phi(\bar{x},\bar{y})\rightarrow\exists\bar{z}\,\psi(\bar{x},\bar{z})$ and use comma instead of $\wedge$ for conjoining atoms. Notice that $\phi$ can be empty, in which case the tgd is called fact tgd and is written as $\top\rightarrow\exists\bar{z}\,\psi(\bar{x},\bar{z})$ . We assume that each variable in $\bar{x}$ is mentioned in some atom of $\psi$ . We call $\phi$ and $\psi$ the body and head of the tgd, respectively. The tgd in (2) is logically equivalent to the expression $\forall\bar{x}(q_{\phi}(\bar{x})\rightarrow q_{\psi}(\bar{x}))$ , where $q_{\phi}(\bar{x})$ and $q_{\psi}(\bar{x})$ are the CQs $\exists\bar{y}\,\phi(\bar{x},\bar{y})$ and $\exists\bar{z}\,\psi(\bar{x},\bar{z})$ , respectively. Thus, an instance $I$ over $\mathbf{S}$ satisfies this tgd iff $q_{\phi}(I)\subseteq q_{\psi}(I)$ . We say that an instance $I$ satisfies a set $\Sigma$ of tgds, denoted $I\models\Sigma$ , if $I$ satisfies every tgd in $\Sigma$ . We denote by $\mathbb{TGD}$ the class of (finite) sets of tgds.

The chase is a useful algorithmic tool when reasoning with tgds [24, 35, 47, 49]. We start by defining a single chase step. Let $I$ be an instance over a schema $\mathbf{S}$ and $\tau=\phi(\bar{x},\bar{y})\rightarrow\exists\bar{z}\,\psi(\bar{x},\bar{z})$ a tgd over $\mathbf{S}$ . We say that $\tau$ is applicable w.r.t. $I$ if there exists a tuple $(\bar{a},\bar{b})$ of terms in $I$ such that $\phi(\bar{a},\bar{b})$ holds in $I$ . In this case, the result of applying $\tau$ over $I$ with $(\bar{a},\bar{b})$ is the instance $J$ that extends $I$ with every atom in $\psi(\bar{a},\bar{\bot})$ , where $\bar{\bot}$ is the tuple obtained by simultaneously replacing each variable $z\in\bar{z}$ with a fresh distinct null not occurring in $I$ . For such a single chase step we write $I\xrightarrow{\tau,(\bar{a},\bar{b})}J$ .

Let us assume now that $I$ is an instance and $\Sigma$ a finite set of tgds. A chase sequence for $I$ under $\Sigma$ is a sequence:

[TABLE]

of chase steps such that: (1) $I_{0}=I$ ; (2) for each $i\geq 0$ , $\tau_{i}$ is a tgd in $\Sigma$ ; and (3) $\bigcup_{i\geq 0}I_{i}\models\Sigma$ . We call $\bigcup_{i\geq 0}I_{i}$ the result of this chase sequence, which always exists. Although the result of a chase sequence is not necessarily unique (up to isomorphism), each such result is equally useful for our purposes, since it can be homomorphically embedded into every other result. Thus, from now on, we denote by $\mathit{chase}(I,\Sigma)$ the result of an arbitrary chase sequence for $I$ under $\Sigma$ .

Ontology-mediated queries.

An ontology-mediated query (OMQ) is a triple $(\mathbf{S},\Sigma,q)$ , where $\mathbf{S}$ is a schema, $\Sigma$ is a set of tgds (the ontology), and $q$ is a (U)CQ over $\mathbf{S}\cup\mathit{sch}(\Sigma)$ (and possibly other predicates), with $\mathit{sch}(\Sigma)$ the set of predicates occurring in $\Sigma$ .111OMQs can be defined for arbitrary first-order theories, not only tgds, and first-order queries, not only UCQs [19]. We call $\mathbf{S}$ the data schema. Notice that the set of tgds can introduce predicates not in $\mathbf{S}$ , which allows us to enrich the schema of the UCQ $q$ . Moreover, the tgds can modify the content of a predicate $R\in\mathbf{S}$ , or, in other words, $R$ can appear in the head of a tgd of $\Sigma$ . We have explicitly included $\mathbf{S}$ in the specification of the OMQ to emphasize that it will be evaluated over $\mathbf{S}$ -databases, even though $\Sigma$ and $q$ might use additional relational symbols.

The semantics of an OMQ is given in terms of certain answers. The certain answers to a UCQ $q({\bar{x}})$ w.r.t. a database $D$ and a set $\Sigma$ of tgds is the set of tuples:

[TABLE]

Consider an OMQ $Q=(\mathbf{S},\Sigma,q)$ . The evaluation of $Q$ over an $\mathbf{S}$ -database $D$ , denoted $Q(D)$ , is defined as $\mathit{cert}(q,D,\Sigma)$ . It is well-known that $\mathit{cert}(q,D,\Sigma)=q(\mathit{chase}(D,\Sigma))$ ; see, e.g., [24]. Thus, $Q(D)=q(\mathit{chase}(D,\Sigma))$ .

Ontology-mediated query languages.

We write $(\mathbb{C},\mathbb{Q})$ for the OMQ language that consists of all OMQs of the form $(\mathbf{S},\Sigma,q)$ , where $\Sigma$ falls in the class $\mathbb{C}$ of tgds, i.e., $\mathbb{C}\subseteq\mathbb{TGD}$ (concrete classes of tgds are discussed below), and the query $q$ falls in $\mathbb{Q}\in\{\mathbb{CQ},\mathbb{UCQ}\}$ . A problem that is quite important for our work is OMQ evaluation, defined as follows:

PROBLEM : ${\sf Eval}(\mathbb{C},\mathbb{Q})$

INPUT : An OMQ $Q=(\mathbf{S},\Sigma,q({\bar{x}}))\in(\mathbb{C},\mathbb{Q})$ ,

an $\mathbf{S}$ -database $D$ , and ${\bar{c}}\in\mathit{dom}(D)^{|{\bar{x}}|}$ .

QUESTION : Does ${\bar{c}}\in Q(D)$ ?

It is well-known that ${\sf Eval}(\mathbb{TGD},\mathbb{CQ})$ is undecidable; implicit in [12]. This has led to a flurry of activity for identifying syntactic restrictions on sets of tgds that make the latter problem decidable. Such a restriction defines a subclass $\mathbb{C}$ of tgds. The known decidable classes of tgds are classified into three main decidability paradigms, which, in turn, give rise to decidable OMQ languages:

Guardedness: A tgd is guarded if its body contains an atom, called guard, that contains all the body-variables. Although the chase under guarded tgds does not necessarily terminate, the problem of deciding whether a tuple of constants is a certain answer to a UCQ w.r.t. a database and a set of guarded tgds is decidable. This follows from the fact that the result of the chase has bounded treewidth (see, e.g., [24]). Let $\mathbb{G}$ be the class of (finite) sets of guarded tgds. Then:

Proposition 1

[24]* ${\sf Eval}(\mathbb{G},\mathbb{CQ})$ and ${\sf Eval}(\mathbb{G},\mathbb{UCQ})$ are 2ExpTime-complete, and ExpTime-complete for fixed arity.*

An important subclass of guarded tgds is the class of linear tgds whose body consists of a single atom. We write $\mathbb{L}$ for the class of (finite) sets of linear tgds.

Proposition 2

[25, 47]* ${\sf Eval}(\mathbb{L},\mathbb{CQ})$ and ${\sf Eval}(\mathbb{L},\mathbb{UCQ})$ are PSpace-complete, and NP-complete for fixed arity.*

Non-recursiveness: A set $\Sigma$ of tgds is non-recursive (a.k.a. acyclic [35, 48]), if its predicate graph, the directed graph that encodes how the predicates of $\mathit{sch}(\Sigma)$ depend on each other, is acyclic. Non-recursiveness ensures the termination of the chase, and thus decidability of OMQ evaluation. Let $\mathbb{NR}$ be the class of non-recursive (finite) sets of tgds. Then:

Proposition 3

[48]* ${\sf Eval}(\mathbb{NR},\mathbb{CQ})$ and ${\sf Eval}(\mathbb{NR},\mathbb{UCQ})$ are NExpTime-complete, even for fixed arity.*

Stickiness:* This condition ensures neither termination nor bounded treewidth of the chase. Instead, the decidability of OMQ evaluation is obtained by exploiting query rewriting techniques (more details on query rewriting are given in Section 4). The goal of stickiness is to capture joins among variables that are not expressible via guarded tgds, but without forcing the chase to terminate. The key property underlying this condition can be described as follows: during the chase, terms that are associated (via a homomorphism) with variables that appear more than once in the body of a tgd (i.e., join variables) are always propagated (or “stick”) to the inferred atoms. This is illustrated in Figure 1(a); the left set of tgds is sticky, while the right set is not. The formal definition is based on an inductive marking procedure that marks the variables that may violate the semantic property of the chase described above [27]. Roughly, during the base step of this procedure, a variable that appears in the body of a tgd $\tau$ but not in every head-atom of $\tau$ is marked. Then, the marking is inductively propagated from head to body as shown in Figure 1(b). Finally, a finite set of tgds $\Sigma$ is sticky if no tgd in $\Sigma$ contains two occurrences of a marked variable. Let $\mathbb{S}$ be the class of sticky (finite) sets of tgds. Then:*

Proposition 4

[27]* ${\sf Eval}(\mathbb{S},\mathbb{CQ})$ and ${\sf Eval}(\mathbb{S},\mathbb{UCQ})$ are ExpTime-complete, and NP-complete for fixed arity.*

3 OMQ Containment: The Basics

The goal of this work is to study in depth the problem of checking whether an OMQ $Q_{1}$ is contained in an OMQ $Q_{2}$ , both over the same data schema $\mathbf{S}$ , or, equivalently, whether $Q_{1}(D)\subseteq Q_{2}(D)$ over every (finite) $\mathbf{S}$ -database $D$ . In this case we write $Q_{1}\subseteq Q_{2}$ ; we write $Q_{1}\equiv Q_{2}$ if $Q_{1}\subseteq Q_{2}$ and $Q_{2}\subseteq Q_{1}$ . The OMQ containment problem in question is defined as follows; $\mathbb{O}_{1}$ and $\mathbb{O}_{2}$ are OMQ languages $(\mathbb{C},\mathbb{Q})$ , where $\mathbb{C}$ is a class of tgds (e.g., linear, non-recursive, sticky, etc.), and $\mathbb{Q}\in\{\mathbb{CQ},\mathbb{UCQ}\}$ :

PROBLEM* :* ${\sf Cont}(\mathbb{O}_{1},\mathbb{O}_{2})$

INPUT* :* Two OMQs $Q_{1}\in\mathbb{O}_{1}$ and $Q_{2}\in\mathbb{O}_{2}$ .

QUESTION* :* Does $Q_{1}\subseteq Q_{2}$ ?

**

Whenever $\mathbb{O}_{1}=\mathbb{O}_{2}=\mathbb{O}$ , we refer to the containment problem by simply writing ${\sf Cont}(\mathbb{O})$ .

In what follows, we establish some simple but fundamental results, which help to better understand the nature of our problem. We first investigate the relationship between evaluation and containment, which in turn allows us to obtain an initial boundary for the decidability of our problem, i.e., we can obtain a positive result only if the evaluation problem for the involved OMQ languages is decidable (e.g., those introduced in the previous section). We then focus on the OMQ languages introduced in Section 2 and observe that, once we fix the class of tgds, it does not make a difference whether we consider CQs or UCQs. In other words, we show that an OMQ in $(\mathbb{C},\mathbb{UCQ})$ , where $\mathbb{C}\in\{\mathbb{G},\mathbb{L},\mathbb{NR},\mathbb{S}\}$ , can be rewritten as an OMQ in $(\mathbb{C},\mathbb{CQ})$ . This fact simplifies our later complexity analysis since for establishing upper (resp., lower) bounds it suffices to focus on CQs (resp., UCQs).

3.1 Evaluation vs. Containment

As one might expect, OMQ evaluation and OMQ containment are strongly connected. In fact, as we explain below, the former can be easily reduced to the latter. But let us first introduce some auxiliary notation. Consider a database $D$ and a tuple ${\bar{c}}=(c_{1},\ldots,c_{n})\in\mathit{dom}(D)^{n}$ , where $n\geq 0$ . We denote by $q_{D,{\bar{c}}}({\bar{x}})$ , where ${\bar{x}}=(x_{c_{1}},\ldots,x_{c_{n}})$ , the CQ obtained from the conjunction of atoms occurring in $D$ after replacing each constant $c$ with the variable $x_{c}$ . Consider now an OMQ $Q=(\mathbf{S},\Sigma,q({\bar{x}}))\in(\mathbb{C},\mathbb{CQ})$ , where $\mathbb{C}$ is some class of tgds, an $\mathbf{S}$ -database $D$ , and a tuple ${\bar{c}}\in\mathit{dom}(D)^{|{\bar{x}}|}$ . It is not difficult to show that

[TABLE]

Let $\mathbb{O}_{\varnothing}$ be the OMQ language that consists of all OMQs of the form $(\mathbf{S},\varnothing,q)$ , i.e., the set of tgds is empty, where $q$ is a CQ. It is clear that $Q_{1}\in\mathbb{O}_{\varnothing}$ and $Q_{2}\in(\mathbb{C},\mathbb{CQ})$ . Therefore, for every OMQ language $\mathbb{O}=(\mathbb{C},\mathbb{CQ})$ , where $\mathbb{C}$ is a class of tgds, we immediately get that:

Proposition 5

${\sf Eval}(\mathbb{O})$ * can be reduced in polynomial time into ${\sf Cont}(\mathbb{O}_{\varnothing},\mathbb{O})$ .*

We now show that the problem of evaluation is reducible to the complement of containment. Let us say that, for technical reasons which will be made clear in a while, we focus our attention on classes $\mathbb{C}$ of tgds that are closed under fact tgd extension, i.e., for every set $\Sigma\in\mathbb{C}$ , a set obtained from $\Sigma$ by adding a (finite) set of fact tgds is still in $\mathbb{C}$ . This is not an unnatural assumption since every reasonable class of tgds, such as the ones introduced above, enjoy this property. Consider now an OMQ $Q=(\mathbf{S},\Sigma,q({\bar{x}}))\in(\mathbb{C},\mathbb{CQ})$ , where $\mathbb{C}$ is some class of tgds, an $\mathbf{S}$ -database $D$ , and a tuple ${\bar{c}}\in\mathit{dom}(D)^{|{\bar{x}}|}$ . It is easy to see that

[TABLE]

where $\Sigma_{D}^{\star}$ is obtained from $\Sigma$ by renaming each predicate $R$ in $\Sigma$ into $R^{\star}\not\in\mathbf{S}$ and adding the set of fact tgds

[TABLE]

$q^{\star}_{\bar{c}}$ * is obtained from $q(\bar{c})$ by renaming each predicate $R$ into $R^{\star}\not\in\mathbf{S}$ , and the predicate $P$ does not occur in $\mathbf{S}$ . Indeed, the above equivalence holds since $P\not\in\mathbf{S}$ implies that $Q_{2}(D)=\varnothing$ , for every $\mathbf{S}$ -database $D$ . Since $\mathbb{C}$ is closed under fact tgd extension, $Q_{1}\in(\mathbb{C},\mathbb{CQ})$ , while $Q_{2}\in\mathbb{O}_{\varnothing}$ . We write ${\sf coCont}(\mathbb{O}_{1},\mathbb{O}_{2})$ for the complement of ${\sf Cont}(\mathbb{O}_{1},\mathbb{O}_{2})$ . Hence, for every OMQ language $\mathbb{O}=(\mathbb{C},\mathbb{CQ})$ , where $\mathbb{C}$ is a class of tgds (closed under fact tgd extension), it holds that:*

Proposition 6

${\sf Eval}(\mathbb{O})$ * can be reduced in polynomial time into ${\sf coCont}(\mathbb{O},\mathbb{O}_{\varnothing})$ .*

By definition, $\mathbb{O}_{\varnothing}$ is contained in every OMQ language $(\mathbb{C},\mathbb{CQ})$ , where $\mathbb{C}$ is a class of tgds. Therefore, as a corollary of Propositions 5 and 6, we obtain an initial boundary for the decidability of OMQ containment: we can obtain a positive result only if the evaluation problem for the involved OMQ languages is decidable. More formally:

Corollary 7

${\sf Cont}(\mathbb{O}_{1},\mathbb{O}_{2})$ * is undecidable if ${\sf Eval}(\mathbb{O}_{1})$ is undecidable or ${\sf Eval}(\mathbb{O}_{2})$ is undecidable.*

Can we prove the converse of Corollary 7: ${\sf Cont}(\mathbb{O}_{1},\mathbb{O}_{2})$ is decidable if both ${\sf Eval}(\mathbb{O}_{1})$ and ${\sf Eval}(\mathbb{O}_{2})$ are decidable? The answer to this question is negative. This is due to the fact that containment of Datalog queries is undecidable **[55]**. Since Datalog queries can be directly encoded in the OMQ language based on the class $\mathbb{F}$ of full tgds, i.e., those without existentially quantified variables, we obtain the following:

Proposition 8

[55]* ${\sf Cont}((\mathbb{F},\mathbb{CQ}))$ is undecidable.*

This result, combined with the fact that ${\sf Eval}(\mathbb{F})$ is decidable (since the chase under full tgds always terminates), implies that the converse of Corollary 7 does not hold. Proposition 8 also rules out the OMQ languages that are based on classes of tgds that extend $\mathbb{F}$ ; e.g., the weak versions of the ones introduced in Section 2, called weakly guarded **[24]**, weakly acyclic **[35]**, and weakly sticky **[27]** that guarantee the decidability of OMQ evaluation.222The idea of those classes is the same: relax the conditions in the definition of the class, so that only those positions that receive null values during the chase are taken into account. The question that comes up concerns the decidability and complexity of containment for the OMQ languages that are based on the non-weak versions of the above classes, i.e., guarded, non-recursive, and sticky. This will be the subject of the next two sections.

3.2 From UCQs to CQs

Before we proceed with the complexity analysis of containment for the OMQ languages in question, let us state the following useful result:

Proposition 9

Given an OMQ $Q\in(\mathbb{C},\mathbb{UCQ})$ , where $\mathbb{C}\in\{\mathbb{G},\mathbb{L},\mathbb{NR},\mathbb{S}\}$ , we can construct in polynomial time an OMQ $Q^{\prime}\in(\mathbb{C},\mathbb{CQ})$ such that $Q\equiv Q^{\prime}$ .

The proof of Proposition 9 relies on the idea of encoding boolean operations (in our case the ‘or’ operator) using a set of atoms; this idea has been used in several other works (see, e.g., **[14, 21, 41]**). Proposition 9 allows us to focus on OMQs that are based on CQs; in fact, ${\sf Cont}((\mathbb{C}_{1},\mathbb{CQ}),(\mathbb{C}_{2},\mathbb{CQ}))$ is $\mathcal{C}$ -complete, where $\mathbb{C}_{1},\mathbb{C}_{2}\in\{\mathbb{G},\mathbb{L},\mathbb{NR},\mathbb{S}\}$ and $\mathcal{C}$ is a complexity class that is closed under polynomial time reductions, iff ${\sf Cont}((\mathbb{C}_{1},\mathbb{UCQ}),(\mathbb{C}_{2},\mathbb{UCQ}))$ is $\mathcal{C}$ -complete.

3.3 Plan of Attack

We are now ready to proceed with the complexity analysis of containment for the OMQ languages in question. Our plan of attack can be summarized as follows:

•

We consider, in Section 4, ${\sf Cont}((\mathbb{C},\mathbb{CQ}))$ , for $\mathbb{C}\in\{\mathbb{L},\mathbb{NR},\mathbb{S}\}$ . These languages enjoy a crucial property, called UCQ rewritability, which is very useful for our purposes. This property allows us to show the following result: if the containment does not hold, then this is witnessed via a “small” database, which in turn allows us to devise simple guess-and-check algorithms.

•

We then proceed, in Section 5, with ${\sf Cont}((\mathbb{G},\mathbb{CQ}))$ . This OMQ language does not enjoy UCQ rewritability, and the task of establishing a small witness property as above turned out to be challenging. However, we show the following: if the containment does not hold, then this is witnessed via a “tree-shaped” database, which allows us to devise a decision procedure based on two-way alternating parity automata on finite trees.

•

In Section 6, we study the case where the OMQ containment problem involves two different languages. If the left-hand side language is UCQ rewritable, then we can devise a guess-and-check algorithm by exploiting the above small witness property. The challenging case is when the left-hand side language is $(\mathbb{G},\mathbb{CQ})$ , where again we employ techniques based on tree automata.

4 UCQ Rewritable Languages

We now focus on OMQ languages that enjoy the crucial property of UCQ rewritability.

Definition 1.

*(UCQ Rewritability) An OMQ language $(\mathbb{C},\mathbb{CQ})$ , where $\mathbb{C}\subseteq\mathbb{TGD}$ , is UCQ rewritable if, for each OMQ $Q=(\mathbf{S},\Sigma,q({\bar{x}}))\in(\mathbb{C},\mathbb{CQ})$ we can construct a UCQ $q^{\prime}({\bar{x}})$ such that $Q(D)=q^{\prime}(D)$ for every $\mathbf{S}$ -database $D$ . *

We proceed to establish our desired small witness property, based on UCQ rewritability. By the definition of UCQ rewritability, for each language $\mathbb{O}$ that is UCQ rewritable, there exists a computable function $f_{\mathbb{O}}$ from $\mathbb{O}$ to the natural numbers such that the following holds: for every OMQ $Q=(\mathbf{S},\Sigma,q({\bar{x}}))\in\mathbb{O}$ , and UCQ rewriting $q_{1}({\bar{x}})\vee\cdots\vee q_{n}({\bar{x}})$ of $Q$ , it is the case that $\max_{1\leq i\leq n}\{|q_{i}|\}\leq f_{\mathbb{O}}(Q)$ , where $|q_{i}|$ denotes the number of atoms occurring in $q_{i}$ . Then:

Proposition 10

Consider a UCQ rewritable language $\mathbb{O}$ , and two OMQs $Q\in\mathbb{O}$ and $Q^{\prime}\in(\mathbb{TGD},\mathbb{CQ})$ , both with data schema $\mathbf{S}$ . If $Q\not\subseteq Q^{\prime}$ , then there exists an $\mathbf{S}$ -database $D$ , where $|D|\leq f_{\mathbb{O}}(Q)$ , such that $Q(D)\not\subseteq Q^{\prime}(D)$ .

In Proposition 10 we assume that the left-hand side query falls in a UCQ rewritable language, be we do not pose any restriction on the language of the right-hand side query. Thus, we immediately get a decision procedure for ${\sf Cont}(\mathbb{O}_{1},\mathbb{O}_{2})$ if $\mathbb{O}_{1}$ is UCQ rewritable and ${\sf Eval}(\mathbb{O}_{2})$ is decidable. Given $Q_{1}=(\mathbf{S},\Sigma_{1},q_{1}({\bar{x}}))\in\mathbb{O}_{1}$ and $Q_{2}=(\mathbf{S},\Sigma_{2},q_{2}({\bar{x}}))\in\mathbb{O}_{2}$ :

Guess an $\mathbf{S}$ -database $D$ such that $|D|\leq f_{\mathbb{O}_{1}}(Q_{1})$ , and a tuple ${\bar{c}}\in\mathit{dom}(D)^{|{\bar{x}}|}$ ; and 2. 2.

Verify that ${\bar{c}}\in Q_{1}(D)$ and ${\bar{c}}\not\in Q_{2}(D)$ .

We immediately get that:

Theorem 11

${\sf Cont}(\mathbb{O}_{1},\mathbb{O}_{2})$ * is decidable if $\mathbb{O}_{1}$ is UCQ rewritable and ${\sf Eval}(\mathbb{O}_{2})$ is decidable.*

This general result shows that ${\sf Cont}((\mathbb{C},\mathbb{CQ}))$ is decidable for every $\mathbb{C}\in\{\mathbb{L},\mathbb{NR},\mathbb{S}\}$ , but it says nothing about its complexity. This will be the subject of the rest of the section.

4.1 Linearity

The problem of computing UCQ rewritings for OMQs in $(\mathbb{L},\mathbb{CQ})$ has been studied in **[40]**, where a resolution-based procedure, called $\mathsf{XRewrite}$ , has been proposed. This rewriting algorithm accepts a query $Q=(\mathbf{S},\Sigma,q({\bar{x}}))\in(\mathbb{L},\mathbb{CQ})$ and constructs a UCQ rewriting $q^{\prime}({\bar{x}})$ over $\mathbf{S}$ by starting from $q$ and exhaustively applying rewriting steps based on resolution. Let us illustrate this via a simple example:

Example 1.

Assume that $\mathbf{S}=\{P,T\}$ . Consider the set $\Sigma$ consisting of the linear tgds

[TABLE]

*and the CQ $q({\bar{x}})\coloneqq\exists y(R(x,y)\wedge P(y))$ . $\mathsf{XRewrite}$ will first resolve the atom $P(y)$ in $q$ using the second tgd, and produce the CQ $\exists y(R(x,y)\wedge R(x,z))$ , which is equivalent to the CQ $\exists y\,R(x,y)$ . Then, $\exists y\,R(x,y)$ will be resolved using the first tgd, and the CQ $P(x)$ will be obtained, which in turn will be resolved using the third tgd in order to produce the CQ $T(x)$ . The UCQ rewriting $q^{\prime}({\bar{x}})$ is $P(x)\vee T(x)$ . *

It is easy to see that, whenever the input OMQ consists of linear tgds, during the execution of $\mathsf{XRewrite}$ it is not possible to obtain a CQ that has more atoms than the original one. This is an immediate consequence of the fact that linear tgds have only one atom in their body. Then:

Proposition 12

$f_{(\mathbb{L},\mathbb{CQ})}\big{(}(\mathbf{S},\Sigma,q)\big{)}\ \leq\ |q|$ .

Having the above result in place, it can be shown that the algorithm underlying Theorem 11 guesses a polynomially sized witness to non-containment, and then calls a $\mathcal{C}$ -oracle for solving query evaluation under linear OMQs, where $\mathcal{C}$ is PSpace in general, and NP if the arity is fixed; these complexity classes are obtained from Proposition 2. Therefore, ${\sf coCont}((\mathbb{L},\mathbb{CQ}))$ is in PSpace in general, and in $\Sigma_{2}^{P}$ in case of fixed arity. Regarding the lower bounds, Proposition 5 allows us to inherit the PSpace-hardness of ${\sf Eval}(\mathbb{L},\mathbb{CQ})$ ; this holds even for constant-free tgds. Unfortunately, in the case of fixed arity, we can only obtain NP-hardness, while Proposition 6 allows to obtain coNP*-hardness. Nevertheless, it is implicit in [17] (see the proof of Theorem 9), where the containment problem for OMQ languages based on description logics is considered, that ${\sf Cont}((\mathbb{L},\mathbb{CQ}))$ is $\Pi_{2}^{P}$ -hard, even for tgds of the form $P(x)\rightarrow R(x)$ . Then:*

Theorem 13

${\sf Cont}((\mathbb{L},\mathbb{CQ}))$ * is PSpace-complete, and $\Pi_{2}^{P}$ -complete if the arity of the schema is fixed. The lower bounds hold even for tgds without constants.*

4.2 Non-Recursiveness

Although the OMQ language $(\mathbb{NR},\mathbb{CQ})$ is not explicitly considered in **[40]**, where the algorithm $\mathsf{XRewrite}$ is defined, the same algorithm can deal with $(\mathbb{NR},\mathbb{CQ})$ . By analyzing the UCQ rewritings constructed by $\mathsf{XRewrite}$ , whenever the input query falls in $(\mathbb{NR},\mathbb{CQ})$ , we can establish the following result; here, $\mathit{body}(\tau)$ denotes the body of the tgd $\tau$ :

Proposition 14

It holds that

[TABLE]

Proposition 14 implies that non-containment for queries that fall in $(\mathbb{NR},\mathbb{CQ})$ is witnessed via a database of at most exponential size. We show next that this bound is optimal:

Proposition 15

There are sets of $(\mathbb{NR},\mathbb{CQ})$ OMQs

[TABLE]

where $|\mathit{sch}(\Sigma_{1}^{n})|=|\mathit{sch}(\Sigma_{2}^{n})|=n+2$ , such that for every $\mathbf{S}$ -database $D$ , if $Q_{1}^{n}(D)\not\subseteq Q_{2}^{n}(D)$ then $|D|\geq 2^{n-1}$ .

Let us now focus on the complexity of ${\sf Cont}((\mathbb{NR},\mathbb{CQ}))$ . The algorithm underlying Theorem 11, together with the exponential bound provided by Proposition 14, implies that ${\sf coCont}((\mathbb{NR},\mathbb{CQ}))$ is feasible in non-deterministic exponential time with access to a NExpTime oracle, which immediately implies that ${\sf Cont}((\mathbb{NR},\mathbb{CQ}))$ is in ExpSpace. Unfortunately, the exact complexity of ${\sf Cont}((\mathbb{NR},\mathbb{CQ}))$ is still an open problem, and we conjecture that is P ${}^{\textsc{NEXP}}$ -complete; recall that $\textsc{NExpTime}\subseteq\text{\rm P}^{\textsc{NEXP}}\subseteq\textsc{ExpSpace}$ . In what follows, we briefly explain how the P ${}^{\textsc{NEXP}}$ -hardness is obtained. To this end, we exploit a tiling problem that has been recently introduced in **[34]**. Roughly speaking, an instance of this tiling problem is a triple $(m,T_{1},T_{2})$ , where $m$ is an integer in unary representation, and $T_{1},T_{2}$ are standard tiling problems for the exponential grid $2^{n}\times 2^{n}$ . The question is whether, for every initial condition $w$ of length $m$ , $T_{1}$ has no solution with $w$ or $T_{2}$ has some solution with $w$ . The initial condition $w$ simply fixes the first $m$ tiles of the first row of the grid. We construct in polynomial time two $(\mathbb{NR},\mathbb{CQ})$ queries $Q_{1}$ and $Q_{2}$ such that $(m,T_{1},T_{2})$ has a solution iff $Q_{1}\subseteq Q_{2}$ . The idea is to force every input database to store an initial condition $w$ of length $m$ , and then encode the problem whether $T_{i}$ has a solution with $w$ into $Q_{i}$ , for each $i\in\{1,2\}$ . From the above discussion we get that:

Theorem 16

${\sf Cont}((\mathbb{NR},\mathbb{CQ}))$ * is in ExpSpace, and P ${}^{\textsc{NEXP}}$ -hard. The lower bound holds even if the arity of the schema is fixed and the tgds are without constants.*

4.3 Stickiness

We now focus on $(\mathbb{S},\mathbb{CQ})$ . As shown in **[40]**, given a query $(\mathbf{S},\Sigma,q)$ , there exists an execution of $\mathsf{XRewrite}$ that constructs a UCQ rewriting $q_{1}({\bar{x}})\vee\cdots\vee q_{n}({\bar{x}})$ over $\mathbf{S}$ with the following property: for each $i\in\{1,\ldots,n\}$ , if a variable $v$ occurs in $q_{i}$ in more than one atom, then $v$ already occurs in $q$ . This property has been used in **[40]** to bound the number of atoms that can appear in a single CQ $q_{i}$ . We write $T(q)$ for the set of terms (constants and variables) occurring in $q$ , $C(\Sigma)$ for the set of constants occurring in $\Sigma$ , and $\mathit{ar}(\mathbf{S})$ for the maximum arity over all predicates of $\mathbf{S}$ .

Proposition 17

It holds that

[TABLE]

Proposition 17 implies that non-containment for $(\mathbb{S},\mathbb{CQ})$ queries is witnessed via a database of at most exponential size. As for $(\mathbb{NR},\mathbb{CQ})$ queries, we can show that this bound is optimal; here, for a set $\Sigma$ of tgds, we denote by $||\Sigma||$ the number of symbols occurring in $\Sigma$ :

Proposition 18

There exists a set of $(\mathbb{S},\mathbb{CQ})$ OMQs

[TABLE]

such that for every $Q=(\{S\},\Sigma^{\prime},q^{\prime}({\bar{x}}))\in(\mathbb{TGD},\mathbb{CQ})$ and $\{S\}$ -database $D$ , if $Q^{n}(D)\not\subseteq Q(D)$ then $|D|\geq 2^{n-2}$ .

We now study the complexity of ${\sf Cont}((\mathbb{S},\mathbb{CQ}))$ . We first focus on schemas of unbounded arity. Proposition 17 implies that the algorithm underlying Theorem 11 runs in exponential time assuming access to a $\mathcal{C}$ -oracle, where $\mathcal{C}$ is a complexity class powerful enough for solving ${\sf Eval}(\mathbb{S},\mathbb{CQ})$ and its complement. But, since ${\sf Eval}(\mathbb{S},\mathbb{CQ})$ is in ExpTime (see Proposition 4), both ${\sf Eval}(\mathbb{S},\mathbb{CQ})$ and its complement are in NExpTime, and thus, the oracle call is not really needed. Consequently, ${\sf coCont}((\mathbb{C},\mathbb{CQ}))$ is in NExpTime.

A matching lower bound is obtained by a reduction from the standard tiling problem for the exponential grid $2^{n}\times 2^{n}$ . In fact, the same lower bound has been recently established in **[15]**; however, our result is stronger as it shows that the problem remains hard even if the right-hand side query is a linear OMQ of a simple form – this is also discussed in Section 6, where containment of queries that fall in different OMQ languages is studied. Regarding schemas of fixed arity, Proposition 17 provides a witness for non-containment of polynomial size, which implies that the algorithm underlying Theorem 11 runs in polynomial time with access to an NP-oracle. Therefore, ${\sf coEval}(\mathbb{S},\mathbb{CQ})$ is in $\Sigma_{2}^{P}$ , while a matching lower bound is implicit in **[17]**. Then:

Theorem 19

${\sf Cont}((\mathbb{S},\mathbb{CQ}))$ * is coNExpTime-complete, even if the set of tgds uses only two constants. In the case of fixed arity, it is $\Pi_{2}^{P}$ -complete, even for constant-free tgds.*

Clearly, there exists a double-exponential time algorithm for solving ${\sf Cont}((\mathbb{S},\mathbb{CQ}))$ , which might sound discouraging. However, Proposition 17 implies that the runtime is double-exponential only in the maximum arity of the data schema.

5 Guardedness

We proceed with the problem of containment for guarded OMQs, and we establish the following result:

Theorem 20

${\sf Cont}((\mathbb{G},\mathbb{CQ}))$ * is 2ExpTime-complete. The lower bound holds even if the arity of the schema is fixed, and the tgds are without constants.*

The lower bound is immediately inherited from **[16]**, where it is shown that containment for OMQs based on the description logic $\mathcal{ELI}$ is 2ExpTime*-hard. Recall that a set of $\mathcal{ELI}$ axioms can be equivalently rewritten as a constant-free set of guarded tgds using only unary and binary predicates, which implies the lower bound stated in Theorem 20. However, we cannot immediately inherit the desired upper bound since the DL-based OMQ languages considered in [16] are either weaker than or incomparable to $(\mathbb{G},\mathbb{CQ})$ . Nevertheless, the technique developed in [16] was extremely useful for our analysis. Actually, our automata-based procedure exploits a combination of ideas from [16, 44]. The rest of this section is devoted to providing a high-level explanation of this procedure.*

For the sake of technical clarity, we focus on constant-free tgds and CQs, but all the results can be extended to the general case at the price of more involved definitions and proofs. Moreover, for simplicity, we focus on Boolean CQs. In other words, we study the problem for $(\mathbb{G},\mathbb{BCQ})$ , where $\mathbb{BCQ}$ denotes the class of Boolean CQs. This does not affect the generality of our proof since it is known that ${\sf Cont}((\mathbb{G},\mathbb{CQ}))$ can be reduced in polynomial time to ${\sf Cont}((\mathbb{G},\mathbb{BCQ}))$ **[16]**.

A first glimpse.

As already said, $(\mathbb{G},\mathbb{CQ})$ is not UCQ rewritable and, therefore, we cannot employ Proposition 10 in order to establish a small witness property as for the languages considered in Section 4. We have tried to establish a small witness property for $(\mathbb{G},\mathbb{CQ})$ by following a different route, but it turned out to be a difficult task. Nevertheless, we can show a tree witness property, which states that non-containment for $(\mathbb{G},\mathbb{CQ})$ is witnessed via a tree-like database. This allows us to devise a procedure based on alternating tree automata. Summing up, the proof for the 2ExpTime* membership of $(\mathbb{G},\mathbb{CQ})$ proceeds in three steps:*

Establish a tree witness property; 2. 2.

Encode the tree-like witnesses as trees that can be accepted by an alternating tree automaton; and 3. 3.

Construct an automaton that decides ${\sf Cont}((\mathbb{G},\mathbb{CQ}))$ ; in fact, we reduce ${\sf Cont}((\mathbb{G},\mathbb{CQ}))$ into emptiness for two-way alternating parity automata on finite trees.

Each one of the above three steps is discussed in more details in the following three sections.

5.1 Tree Witness Property

From the above informal discussion, it is clear that tree-like databases are crucial for our analysis. Let us make this notion more precise using guarded tree decompositions. A tree decomposition of a database $D$ is a labeled rooted tree $T=(V,E,\lambda)$ , where $\lambda:V\rightarrow 2^{\mathit{dom}(D)}$ , such that: (i) for each atom $R(t_{1},\ldots,t_{n})\in D$ , there exists $v\in V$ such that $\lambda(v)\supseteq\{t_{1},\ldots,t_{n}\}$ , and (ii) for every term $t\in\mathit{dom}(D)$ , the set $\{v\in V\mid t\in\lambda(v)\}$ induces a connected subtree of $T$ . The tree decomposition $T$ is called $[U]$ -guarded, where $U\subseteq V$ , if, for every node $v\in V\setminus U$ , there exists an atom $R(t_{1},\ldots,t_{n})\in D$ such that $\lambda(v)\subseteq\{t_{1},\ldots,t_{n}\}$ . We write $\mathit{root}(T)$ for the root node of $T$ , and $D_{T}(v)$ , where $v\in V$ , for the subset of $D$ induced by $\lambda(v)$ . We are now ready to formalize the notion of the tree-like database:

Definition 2.

An $\mathbf{S}$ -database $D$ is a $C$ -tree, where $C\subseteq D$ , if there is a tree decomposition $T$ of $D$ such that:

$D_{T}(\mathit{root}(T))=C$ * and* 2. 2.

$T$ * is $[\{\mathit{root}(T)\}]$ -guarded. *

Roughly, whenever a database $D$ is a $C$ -tree, $C$ is the cyclic part of $D$ , while the rest of $D$ is tree-like. Interestingly, for deciding ${\sf Cont}((\mathbb{G},\mathbb{BCQ}))$ it suffices to focus on databases that are $C$ -trees and $|\mathit{dom}(C)|$ depends only on the left-hand side OMQ. Recall that for a schema $\mathbf{S}$ we write $\mathit{ar}(\mathbf{S})$ for the maximum arity over all predicates of $\mathbf{S}$ . Then:

Proposition 21

Let $Q_{i}=(\mathbf{S},\Sigma_{i},q_{i})\in(\mathbb{G},\mathbb{BCQ})$ , for $i\in\{1,2\}$ . The following are equivalent:

$Q_{1}\subseteq Q_{2}$ . 2. 2.

$Q_{1}(D)\subseteq Q_{2}(D)$ , for every $C$ -tree $\mathbf{S}$ -database $D$ such that $|\mathit{dom}(C)|\leq(\mathit{ar}(\mathbf{S}\cup\mathit{sch}(\Sigma_{1}))\cdot|q_{1}|)$ .

The fact that $(1)\Rightarrow(2)$ holds trivially, while $(2)\Rightarrow(1)$ is shown by using a variant of the notion of guarded unravelling and compactness. Let us clarify that the above result does not provide a decision procedure for ${\sf Cont}((\mathbb{G},\mathbb{BCQ}))$ , since we have to consider infinitely many databases that are $C$ -trees with $|\mathit{dom}(C)|\leq(\mathit{ar}(\mathbf{S}\cup\mathit{sch}(\Sigma_{1}))\cdot|q_{1}|)$ .

5.2 Encoding Tree-like Databases

It is generally known that a database $D$ whose treewidth333Recall that the treewidth of a database $D$ is the minimum width among all possible tree decompositions $T=(V,E,\lambda)$ of $D$ , while the width of $T$ is defined as $\max_{v\in V}\{|\lambda(v)|\}-1$ .* is bounded by an integer $k$ can be encoded into a tree over a finite alphabet of double-exponential size in $k$ that can be accepted by an alternating tree automaton; see, e.g., [13]. Consider an alphabet $\Gamma$ , and let $\mathbb{N}^{\ast}$ be the set of finite sequences of natural numbers, including the empty sequence. A $\Gamma$ -labeled tree is a pair $L={(T,\lambda)}$ , where $T\subseteq\mathbb{N}^{\ast}$ is closed under prefixes, and $\lambda\colon T\rightarrow\Gamma$ is the labeling function. The elements of $T$ identify the nodes of $L$ . It can be shown that $D$ and a tree decomposition $T$ of $D$ with width $k$ can be encoded as a $\Gamma$ -labeled tree $L$ , where $\Gamma$ is an alphabet of double-exponential size in $k$ , such that each node of $T$ corresponds to exactly one node of $L$ and vice versa.*

Consider now a $C$ -tree $\mathbf{S}$ -database $D$ , and let $T$ be the tree decomposition that witnesses that $D$ is a $C$ -tree. The width of $T$ is at most $k=(|\mathit{dom}(C)|+\mathit{ar}(\mathbf{S})-1)$ , and thus, the treewidth of $D$ is bounded by $k$ . Hence, from the above discussion, $D$ and $T$ can be encoded as a $\Gamma$ -labeled tree, where $\Gamma$ is of double-exponential size in $k$ . In general, given an $\mathbf{S}$ -database $D$ that is a $C$ -tree due to the tree decomposition $T$ , we show that $D$ and $T$ can be encoded as a $\Gamma_{\mathbf{S},l}$ -labeled tree, with $|\mathit{dom}(C)|\leq l$ and $|\Gamma_{\mathbf{S},l}|$ being double-exponential in $\mathit{ar}(\mathbf{S})$ and exponential in $|\mathbf{S}|$ and $l$ .

Although every $C$ -tree $\mathbf{S}$ -database $D$ can be encoded as a $\Gamma_{\mathbf{S},l}$ -labeled tree, the other direction does not hold. In other words, it is not true that every $\Gamma_{\mathbf{S},l}$ -labeled tree encodes a $C$ -tree $\mathbf{S}$ -database $D$ and its corresponding tree decomposition. In view of this fact, we need the additional notion of consistency. A $\Gamma_{\mathbf{S},l}$ -labeled tree is called consistent if it satisfies certain syntactic properties – we do not give these properties here since they are not vital in order to understand the high-level idea of the proof. Now, given a consistent $\Gamma_{\mathbf{S},l}$ -labeled tree $L$ , we can show that $L$ can be decoded into an $\mathbf{S}$ -database $\llbracket L\rrbracket$ that is a $C$ -tree with $|\mathit{dom}(C)|\leq l$ . From the above discussion and Proposition 21, we obtain:

Lemma 22

Let $Q_{i}=(\mathbf{S},\Sigma_{i},q_{i})\in(\mathbb{G},\mathbb{BCQ})$ , for $i\in\{1,2\}$ . The following are equivalent:

$Q_{1}\subseteq Q_{2}$ . 2. 2.

$Q_{1}(\llbracket L\rrbracket)\subseteq Q_{2}(\llbracket L\rrbracket)$ , for every consistent $\Gamma_{\mathbf{S},l}$ -labeled tree $L$ , where $l=(\mathit{ar}(\mathbf{S}\cup\mathit{sch}(\Sigma_{1}))\cdot|q_{1}|)$ .

5.3 Constructing Tree Automata

Having the above result in place, we can now proceed with our automata-based procedure. We make use of two-way alternating parity automata (2WAPA) that run on finite labeled trees. Two-way alternating automata process the input tree while branching in an alternating fashion to successor states, and thereby moving either down or up the input tree; the detailed definition can be found in **[11]**. Our goal is to reduce ${\sf Cont}((\mathbb{G},\mathbb{BCQ}))$ to the emptiness problem for 2WAPA. As usual, given a 2WAPA $\mathfrak{A}$ , we denote by $\mathcal{L}(\mathfrak{A})$ the language of $\mathfrak{A}$ , i.e., the set of labeled trees it accepts. The emptiness problem is defined as follows: given a 2WAPA $\mathfrak{A}$ , does $\mathcal{L}(\mathfrak{A})=\varnothing$ ? Thus, given $Q_{1},Q_{2}\in(\mathbb{G},\mathbb{BCQ})$ , we need to construct a 2WAPA $\mathfrak{A}$ such that $Q_{1}\subseteq Q_{2}$ iff $\mathcal{L}(\mathfrak{A})=\varnothing$ . It is well-known that deciding whether $\mathcal{L}(\mathfrak{A})=\varnothing$ is feasible in exponential time in the number of states, and in polynomial time in the size of the input alphabet **[32]**. Therefore, we should construct $\mathfrak{A}$ in double-exponential time, while the number of states must be at most exponential.

We first need a way to check consistency of labeled trees. It is not difficult to devise an automaton for this task.

Lemma 23

Consider a schema $\mathbf{S}$ and an integer $l>0$ . There is a 2WAPA $\mathfrak{C}_{\mathbf{S},l}$ that accepts a $\Gamma_{\mathbf{S},l}$ -labeled tree $L$ iff $L$ is consistent. The number of states of $\mathfrak{C}_{\mathbf{S},l}$ is logarithmic in the size of $\Gamma_{\mathbf{S},l}$ . Furthermore, $\mathfrak{C}_{\mathbf{S},l}$ can be constructed in polynomial time in the size of $\Gamma_{\mathbf{S},l}$ .

Now, the crucial task is, given an OMQ $Q\in(\mathbb{G},\mathbb{BCQ})$ , to devise an automaton that accepts labeled trees which correspond to databases that make $Q$ true.

Lemma 24

Let $Q=(\mathbf{S},\Sigma,q)\in(\mathbb{G},\mathbb{BCQ})$ . There is a 2WAPA $\mathfrak{A}_{Q,l}$ , where $l>0$ , that accepts a consistent $\Gamma_{\mathbf{S},l}$ -labeled tree $L$ iff $Q(\llbracket L\rrbracket)\neq\varnothing$ . The number of states of $\mathfrak{A}_{Q,l}$ is exponential in $||Q||$ and $l$ . Furthermore, $\mathfrak{A}_{Q,l}$ can be constructed in double-exponential time in $||Q||$ and $l$ .

The intuition underlying $\mathfrak{A}_{Q,l}$ can be described as follows. $\mathfrak{A}_{Q,l}$ tries to identify all the possible ways the CQ $q$ can be mapped to $\mathit{chase}(D,\Sigma)$ , for any $C$ -tree $\mathbf{S}$ -database $D$ such that $|\mathit{dom}(C)|\leq l$ . It then arrives at possible ways how the input tree can satisfy $Q$ . These “possible ways” correspond to squid decompositions, a notion introduced in **[24]** that indicates which part of the query is mapped to the cyclic part $C$ of $D$ , and which to the tree-like part of $D$ . The automaton exhaustively checks all squid decompositions by traversing the input tree and, at the same time, explores possible ways how to match the single parts of the squid decomposition at hand. The automaton finally accepts if it finds a squid decomposition that can be mapped to $\mathit{chase}(D,\Sigma)$ .

Having the above automata in place, we can proceed with our main technical result, which shows that ${\sf Cont}(\mathbb{G},\mathbb{BCQ})$ can be reduced to the emptiness problem for 2WAPA. But let us first recall some key results about 2WAPA, which are essential for our final construction. It is well-known that languages accepted by 2WAPAs are closed under intersection and complement. Given two 2WAPAs $\mathfrak{A}_{1}$ and $\mathfrak{A}_{2}$ , we write $\mathfrak{A}_{1}\cap\mathfrak{A}_{2}$ for a 2WAPA, which can be constructed in polynomial time, that accepts the language $\mathcal{L}(\mathfrak{A}_{1})\cap\mathcal{L}(\mathfrak{A}_{2})$ . Moreover, for a 2WAPA $\mathfrak{A}$ , we write $\overline{\mathfrak{A}}$ for the 2WAPA, which is also constructible in polynomial time, that accepts the complement of $\mathcal{L}(\mathfrak{A})$ . We can now show the following:

Proposition 25

Consider $Q_{1},Q_{2}\in(\mathbb{G},\mathbb{BCQ})$ . We can construct in double-exponential time a 2WAPA $\mathfrak{A}$ , which has exponentially many states, such that

[TABLE]

Proof (sketch).* Let $Q_{i}=(\mathbf{S},\Sigma_{i},q_{i})$ , for $i\in\{1,2\}$ , and $l=(\mathit{ar}(\mathbf{S}\cup\mathit{sch}(\Sigma_{1}))\cdot|q_{1}|)$ . Then $\mathfrak{A}$ is defined as $(\mathfrak{C}_{\mathbf{S},l}\ \cap\ \mathfrak{A}_{Q_{1},l})\ \cap\ \overline{\mathfrak{A}_{Q_{2},l}}$ . Since $\Gamma_{\mathbf{S},l}$ has double-exponential size, Lemmas 23 and 24 imply that $\mathfrak{A}$ can be constructed in double-exponential time, while it has exponentially many states. Lemma 22 implies that $Q_{1}\subseteq Q_{2}$ iff $\mathcal{L}(\mathfrak{A})=\varnothing$ . * $\square$

**

Proposition 25 implies that ${\sf Cont}((\mathbb{G},\mathbb{BCQ}))$ is in 2ExpTime*, and Theorem 20 follows. Thus, there exists a double-exponential time algorithm for solving ${\sf Cont}((\mathbb{G},\mathbb{CQ}))$ . Interestingly, the runtime is double-exponential only in the size of the CQs and the maximum arity of the schema. This can be obtained by a providing a more refined complexity analysis of the construction of the 2WAPA $\mathfrak{A}$ in Proposition 25.*

6 Combining Languages

In the previous two sections, we studied the containment problem relative to a language $\mathbb{O}$ , i.e., both OMQs fall in $\mathbb{O}$ . However, it is natural to consider the version of the problem where the involved OMQs fall in different languages. This is the goal of this section. Our analysis proceeds by considering the two cases where the left-hand side (LHS) query falls in a UCQ rewritable OMQ language, or it is guarded.

6.1 The LHS Query is UCQ Rewritable

As an immediate corollary of Theorem 11 we obtain the following result: ${\sf Cont}((\mathbb{C}_{1},\mathbb{CQ}),(\mathbb{C}_{2},\mathbb{CQ}))$ , for $\mathbb{C}_{1}\neq\mathbb{C}_{2}$ , $\mathbb{C}_{1}\in\{\mathbb{L},\mathbb{NR},\mathbb{S}\}$ and $\mathbb{C}_{2}\in\{\mathbb{L},\mathbb{NR},\mathbb{S},\mathbb{G}\}$ , is decidable. By exploiting the algorithm underlying Theorem 11, we establish optimal upper bounds for all the problems at hand with the only exception of ${\sf Cont}((\mathbb{S},\mathbb{CQ}),(\mathbb{NR},\mathbb{CQ}))$ . For the latter, we obtain an ExpSpace upper bound, by providing a similar analysis as for ${\sf Cont}((\mathbb{NR},\mathbb{CQ}))$ , while a NExpTime lower bound is inherited from query evaluation by exploiting Proposition 5. It is rather tedious, and not very interesting from a technical point of view, to go through all the containment problems in question444There are eighteen different cases obtained by considering all the possible pairs $(\mathbb{O}_{1},\mathbb{O}_{2})$ of OMQ languages, where $\mathbb{O}_{1}\neq\mathbb{O}_{2}$ and $\mathbb{O}_{1}$ is UCQ rewritable, and the two cases whether the arity of the schema is fixed or not.* and explain in details how the exact upper bounds are obtained; we leave this as an exercise to the interested reader.*

Regarding the matching lower bounds, in most of the cases they are inherited from query evaluation or its complement by exploiting Propositions 5 and 6, respectively. There are, however, some exceptions:

•

${\sf Cont}((\mathbb{S},\mathbb{CQ}),(\mathbb{L},\mathbb{CQ}))$ * in the case of unbounded arity, where the problem is coNExpTime-hard, even for sets of tgds that use only two constants. This is shown by a reduction from the standard tiling problem for the exponential grid $2^{n}\times 2^{n}$ .*

•

${\sf Cont}((\mathbb{L},\mathbb{CQ}),(\mathbb{S},\mathbb{CQ}))$ * and ${\sf Cont}((\mathbb{S},\mathbb{CQ}),(\mathbb{L},\mathbb{CQ}))$ in the case of bounded arity, where both problems are $\Pi_{2}^{P}$ -hard even for constant-free tgds; implicit in [17].*

6.2 The LHS Query is Guarded

We proceed with the case where the LHS query is guarded, and we show the following result:

Theorem 26

${\sf Cont}((\mathbb{G},\mathbb{CQ}),(\mathbb{C},\mathbb{CQ}))$ * is $\mathcal{C}$ -complete:*

[TABLE]

The lower bounds hold even if the arity of the schema is fixed. Moreover, for $\mathbb{C}=\mathbb{L}$ (resp., $\mathbb{C}\in\{\mathbb{NR},\mathbb{S}\}$ ) it holds even for tgds with one constant (resp., without constants).

Upper bounds.

The 2ExpTime* membership when $\mathbb{C}=\mathbb{L}$ is an immediate corollary of Theorem 20. This is not true when $\mathbb{C}\in\{\mathbb{NR},\mathbb{S}\}$ since the right-hand side query is not guarded. But in this case, since $(\mathbb{NR},\mathbb{CQ})$ and $(\mathbb{S},\mathbb{CQ})$ are UCQ rewritable, one can rewrite the right-hand side query as a UCQ, and then apply the machinery developed in Section 5 for solving ${\sf Cont}((\mathbb{G},\mathbb{CQ}))$ . More precisely, given OMQs $Q_{1}\in(\mathbb{G},\mathbb{CQ})$ and $Q_{2}\in(\mathbb{C},\mathbb{CQ})$ , where $\mathbb{C}\in\{\mathbb{NR},\mathbb{S}\}$ , $Q_{1}\subseteq Q_{2}$ iff $Q_{1}\subseteq q$ , where $q$ is a UCQ rewriting of $Q_{2}$ . Thus, an immediate decision procedure, which exploits the algorithm $\mathsf{XRewrite}$ , is the following:*

Let $q=\mathsf{XRewrite}(Q_{2})$ ; 2. 2.

For each $q^{\prime}\in q$ : if $Q_{1}\subseteq q^{\prime}$ , then proceed; otherwise, reject; and 3. 3.

Accept.

The above procedure runs in triple-exponential time. The first step is feasible in double-exponential time **[40]**. Now, for a single CQ $q^{\prime}\in q$ (which is a guarded OMQ with an empty set of tgds) the check whether $Q_{1}\subseteq q^{\prime}$ can be done by using the machinery developed in Section 5, which reduces our problem to checking whether the language of a 2WAPA $\mathfrak{A}$ is empty. However, it should not be forgotten that $q^{\prime}$ is of exponential size, and thus, the automaton $\mathfrak{A}$ has double-exponentially many states. This in turn implies that checking whether $\mathcal{L}(\mathfrak{A})=\varnothing$ is in 3ExpTime*, as claimed.*

Although the above algorithm establishes an optimal upper bound for non-recursive OMQs, a more refined analysis is needed for sticky OMQs. In fact, we need a more refined complexity analysis for the problem ${\sf Cont}((\mathbb{G},\mathbb{CQ}),\mathbb{UCQ})$ , that is, to decide whether a guarded OMQ is contained in a UCQ. To this end, we provide an automata construction different from the one employed in Section 5, which allows us to establish a refined complexity upper bound for the problem in question. Consider a $(\mathbb{G},\mathbb{CQ})$ query $Q$ , and a UCQ $q=q_{1}\vee\cdots\vee q_{n}$ . As usual, we write $||Q||$ and $||q_{i}||$ for the number of symbols that occur in $Q$ and $q_{i}$ , respectively, and we write $\mathit{var}_{\geq 2}(q_{i})$ for the set of variables that appear in more than one atom of $q_{i}$ . By exploiting our new automata-based procedure, we show that the problem of checking if $Q\subseteq q$ is feasible in double-exponential time in $(||Q||+\max_{1\leq i\leq n}\{|\mathit{var}_{\geq 2}(q_{i})|\})$ , exponential time in $\max_{1\leq i\leq n}\{||q_{i}||\}$ , and polynomial time in $n$ .

This result allows us to show that the above procedure establishes 2ExpTime*-membership when the right-hand side OMQ is sticky. But first we need to recall the following key properties of the UCQ rewriting $q=\mathsf{XRewrite}(Q_{2})$ , constructed during the first step of the algorithm:*

$q$ * consists of double-exponentially many CQs,* 2. 2.

each CQ of $q$ is of exponential size, and 3. 3.

for each $q^{\prime}\in q$ , $\mathit{var}_{\geq 2}(q^{\prime})$ is a subset of the variables of the original CQ that appears in $Q_{2}$ .

By combining these key properties with the complexity analysis performed above, it is now straightforward to show that ${\sf Cont}((\mathbb{G},\mathbb{CQ}),(\mathbb{S},\mathbb{CQ}))$ is in 2ExpTime*.*

Lower Bounds.

We establish matching lower bounds by refining techniques from **[31]**, where it is shown that containment of Datalog in UCQ is 2ExpTime*-complete, while containment of Datalog in non-recursive Datalog is 3ExpTime-complete; the lower bounds hold for fixed-arity predicates, and constant-free rules. Interestingly, the LHS query can be transformed into a Datalog query such that each rule has a body-atom that contains all the variables, i.e., is guarded. This is achieved by increasing the arity of some predicates in order to have enough positions for all the body-variables. However, for each rule, the number of unguarded variables that we need to guard is constant, and thus, the arity of the schema remains constant. We conclude that ${\sf Cont}((\mathbb{G},\mathbb{CQ}),(\mathbb{NR},\mathbb{CQ}))$ is 3ExpTime-hard. Moreover, containment of guarded OMQs in UCQs is 2ExpTime-hard, which in turn allows us to show, by exploiting the construction underlying Proposition 9, that ${\sf Cont}((\mathbb{G},\mathbb{CQ}),(\mathbb{L},\mathbb{CQ}))$ is 2ExpTime-hard, even if the set of linear tgds uses only one constant, while ${\sf Cont}((\mathbb{G},\mathbb{CQ}),(\mathbb{S},\mathbb{CQ}))$ is 2ExpTime-hard, even for tgds without constants.*

7 Applications

Interestingly, our results on ${\sf Cont}((\mathbb{G},\mathbb{CQ}))$ can be applied to other important static analysis tasks, in particular, distribution over components and UCQ rewritability. Each one of those tasks is considered in the following two sections.

7.1 Distribution Over Components

The notion of distribution over components has been introduced in **[3]**, and it states that the answer to a query can be computed by parallelizing it over the (maximally connected) components of the input database. But let us first make precise what a component is. A set of atoms $A$ is connected if for all $c,d\in\mathit{dom}(A)$ , there exists a sequence $\alpha_{1},\ldots,\alpha_{n}$ of atoms in $A$ such that $c\in\mathit{dom}(\alpha_{1})$ , $d\in\mathit{dom}(\alpha_{n})$ , and $\mathit{dom}(\alpha_{i})\cap\mathit{dom}(\alpha_{i+1})\neq\varnothing$ , for each $i\in\{1,\ldots,n-1\}$ . We call $B\subseteq A$ a component of $A$ if (i) $B$ is connected, and (ii) for every $\alpha\in A\setminus B$ , $B\cup\{\alpha\}$ is not connected.555For technical clarity, the notion of component is defined only for sets of atoms that do not contain [math]-ary atoms. Let ${co}(A)$ be the set of components of $A$ . We are now ready to introduce the notion of distribution over components. Consider an OMQ $Q=(\mathbf{S},\Sigma,q)\in(\mathbb{TGD},\mathbb{CQ})$ . We say that $Q$ distributes over components if $Q(D)=Q(D_{1})\cup\cdots\cup Q(D_{n})$ , where ${co}(D)=\{D_{1},\ldots,D_{n}\}$ , for every $\mathbf{S}$ -database $D$ . In this case, $Q(D)$ can be computed without any communication over a network using a distribution where every computing node is assigned some of the components of the database, and every component is assigned to at least one computing node. In other words, $Q$ can be evaluated in a distributed and coordination-free manner; for more details on coordination-free evaluation see **[3, 4, 5]**. Therefore, it would be quite beneficial if we can decide whether an OMQ distributes over components, and thus, we obtain the following interesting static analysis task:

PROBLEM* :* ${\sf Dist}(\mathbb{C},\mathbb{CQ})$

INPUT* :* An OMQ $Q\in(\mathbb{C},\mathbb{CQ})$ .

QUESTION* :* Does $Q$ distributes over components?

**

The above problem has been studied in **[15]**, where tight complexity bounds for $(\mathbb{L},\mathbb{CQ})$ and $(\mathbb{S},\mathbb{CQ})$ have been established. However, its exact complexity for guarded OMQs has been left open. Our results on containment for guarded OMQs allow us to close this problem. But first we need to recall a key result that semantically characterizes distribution over components. An OMQ $Q$ with data schema $\mathbf{S}$ is unsatisfiable if there is no $\mathbf{S}$ -database $D$ such that $Q(D)\neq\varnothing$ . Moreover, for a CQ $q$ , we write ${co}(q)$ for its components. The next result has been shown in **[15]**:

Proposition 27

Let $Q=(\mathbf{S},\Sigma,q({\bar{x}}))\in(\mathbb{G},\mathbb{CQ})$ . The following are equivalent:

$Q$ * distributes over components.* 2. 2.

$Q$ * is unsatisfiable or there exists $\hat{q}(\bar{x})\in{co}(q)$ such that $(\mathbf{S},\Sigma,\hat{q}(\bar{x}))\subseteq Q$ .*

Checking unsatisfiability can be easily reduced to containment. Thus, the above result, together with Theorem 20, implies that ${\sf Dist}(\mathbb{G},\mathbb{CQ})$ is in 2ExpTime*, while a matching lower bound is implicit in [15]. Then:*

Theorem 28

${\sf Dist}(\mathbb{G},\mathbb{CQ})$ * is 2ExpTime-complete.*

7.2 Deciding UCQ Rewritability

Query rewriting is a well-studied method for evaluating OMQs using standard database technology. The key idea is the following: given an OMQ $Q=(\mathbf{S},\Sigma,q(\bar{x}))$ , combine $\Sigma$ and $q$ into a new query $q_{\Sigma}(\bar{x})$ , the so-called rewriting, which can then be evaluated over $D$ yielding the same answer as $Q$ over $D$ , for every $\mathbf{S}$ -database $D$ . For this approach to be realistic, though, it is essential that the rewriting is expressed in a language that can be handled by standard database systems. The typical language that is considered in this setting is first-order (FO) queries **[28]**. Notice, however, that due to Rossman’s Theorem **[53]**, and the fact that OMQs are closed under homomorphisms, FO and UCQ rewritability coincide. Recall that some OMQ languages are UCQ rewritable, such as the ones based on linear, non-recursive and sticky sets of tgds, while others are not, e.g., guarded OMQs. For those languages $\mathbb{O}$ that are not UCQ rewritable, it is important to be able to check whether a query $Q\in\mathbb{O}$ can be rewritten as a UCQ, in which case we say that it is UCQ rewritable. This gives rise to the following fundamental static analysis task for an OMQ language $(\mathbb{C},\mathbb{CQ})$ , where $\mathbb{C}\subseteq\mathbb{TGD}$ :

PROBLEM* :* ${\sf UCQRew}(\mathbb{C},\mathbb{CQ})$

INPUT* :* An OMQ $Q\in(\mathbb{C},\mathbb{CQ})$ .

QUESTION* :* Is it the case that $Q$ is UCQ rewritable?

**

Bienvenu et al. have recently carried out an in-depth study of the above problem for OMQ languages based on central Horn-DLs **[16]**. One of their main results is that the above problem for the OMQ language based on $\mathcal{ELHI}$ , one of the most expressive members of the $\mathcal{EL}$ -family of DLs, is 2ExpTime*-complete. Interestingly, by adapting the tree automata techniques developed in Section 5, we can generalize the above result: deciding UCQ rewritability for the OMQ language based on guarded tgds over unary and binary relations is in 2ExpTime. Let $\mathbb{G}_{2}$ be the class of (finite) sets of guarded tgds over unary and binary relations. Then:*

Theorem 29

${\sf UCQRew}(\mathbb{G}_{2},\mathbb{CQ})$ * is 2ExpTime-complete.*

Since the lower bound is inherited from **[16]**, we concentrate on the upper bound. As in Section 5, we can focus on BCQs, i.e., it suffices to show that ${\sf UCQRew}(\mathbb{G}_{2},\mathbb{BCQ})$ is in 2ExpTime*. Our proof proceeds in two steps:*

We semantically characterize UCQ rewritability for queries in $(\mathbb{G}_{2},\mathbb{CQ})$ in terms of a certain boundedness property for the set of $C$ -trees defined in Section 5. 2. 2.

We extend the techniques developed in Section 5 and construct in double-exponential time a 2WAPA $\mathfrak{A}$ that has exponentially many states, such that the aforementioned boundedness property does not hold iff $\mathcal{L}(\mathfrak{A})$ is infinite. (Such an infinity problem for tree automata has been used to obtain the decidability of the boundedness problem for monadic Datalog **[32, 56]**).

Our 2ExpTime* upper bound then follows since the infinity problem for a 2WAPA $\mathfrak{A}$ , i.e., checking if $\mathcal{L}(\mathfrak{A})$ is infinite, is feasible in exponential time in the number of states, and in polynomial time in the size of the alphabet. This follows from two known results: (a) The 2WAPA $\mathfrak{A}$ can be converted into an equivalent non-deterministic tree automata $\mathfrak{B}$ with a single-exponential blow up in the number of states [57], and (b) solving the infinity problem for non-deterministic tree automata is feasible in polynomial time; cf. [56].*

It is worth contrasting our proof with the one in **[16*]** for $\mathcal{ELHI}$ , which does not make use of the infinity problem for 2WAPA, but applies a different argument based on pumping. This leads to a finer complexity analysis in terms of the size of the different components of the OMQ, but, in our opinion, makes the proof conceptually harder. *

The semantic characterization.* To establish the semantic characterization from step 1, we need to define the notion of distance from the root for an element $u$ in a $C$ -tree database $D$ . Intuitively, this corresponds to the minimal distance between a node that contains $u$ and the root of a tree decomposition $T$ of $D$ that witnesses the fact that $D$ is a $C$ -tree. We do not consider all such tree decompositions, however, but concentrate on a well-behaved subclass, which we call the lean tree decompositions of the $C$ -tree $D$ ; the formal definition can be found in [11], as it does not add much to the explanation we provide here. Due to the fact that we focus on unary and binary relations, such lean tree decompositions ensure the invariance of the notion of distance from the root, by severely limiting the level of redundancy allowed in a tree representation of $D$ . Therefore, it does not matter which lean tree decomposition we choose, since in all of them the distance of an element $u$ from the root will be the same. Let ${D}_{\leq k}$ be the subinstance of $D$ induced by the set of elements whose distance from the root is at most $k$ , and let $D_{>k}$ be the subinstance of $D$ induced by the set of elements whose distance from the root is at least $k+1$ .*

Another useful notion is the branching degree of a tree decomposition $T$ , that is, the maximum number of child nodes over all nodes of $T$ . Again, lean tree decompositions ensure the invariance of the branching degree. This allows us to define the branching degree of a $C$ -tree database $D$ as the branching degree of a lean tree decomposition that witnesses the fact that $D$ is a $C$ -tree.

It follows from **[16]** that being able to decide containment for the OMQ language $(\mathbb{G}_{2},\mathbb{BCQ})$ (as we have done in Section 5) allows us to concentrate on connected CQs when deciding UCQ rewritability. This simplifies technicalities considerably and, in turn, allows us to obtain our desired semantic characterization of UCQ rewritability:

Proposition 30

Let $Q={(\mathbf{S},\Sigma,q)}\in(\mathbb{G}_{2},\mathbb{BCQ})$ , where $q$ is connected. The following are equivalent:

$Q$ * is UCQ rewritable.* 2. 2.

There exist $k,m\geq 0$ (which depend only on $Q$ ) s.t.

[TABLE]

for each $C$ -tree $\mathbf{S}$ -database $D$ with $|\mathit{dom}(C)|\leq 2\cdot|q|$ and branching degree at most $m$ .

The reduction to the infinity problem.* We now proceed with step 2, and we explain how the boundedness property established in item (2) of Proposition 30 can be reduced to the infinity problem for 2WAPAs. As in Section 5, we do not reason with $C$ -tree databases directly, but we deal with their encodings as consistent $\Gamma_{\mathbf{S},l}$ -labeled trees. In fact, using the same ideas as in Lemma 22, we can show by exploiting Proposition 30 that the following are equivalent:*

(i)

$Q$ * is UCQ rewritable.* 2. (ii)

There are $k,m\geq 0$ such that

[TABLE]

for every consistent $\Gamma_{\mathbf{S},l}$ -labeled tree $L$ with $l=2\cdot|q|$ and whose branching degree is bounded by $m$ .

Let us write Boundedness for the property expressed in item (ii) above, which can be reduced to the problem of checking whether some tree language is finite. Let $\mathcal{L}_{Q}$ be the set of all $\Gamma_{\mathbf{S},l}$ -labeled trees $L$ of branching degree at most $m$ such that: (1) $Q(\llbracket L\rrbracket)=\varnothing$ and (2) there is some “extension” $L^{\prime}$ of $L$ , with branching degree $m$ , such that $Q(\llbracket L^{\prime}\rrbracket)\neq\varnothing$ and $Q(\llbracket L^{\prime}\rrbracket_{>0})=\varnothing$ . Notice that $L^{\prime}$ can increase the depth but not the branching degree of $L$ . It is not difficult to show that Boundedness holds iff $\mathcal{L}_{Q}$ is finite. We then devise in double-exponential time a 2WAPA $\mathfrak{C}_{Q,l}$ , which has exponentially many states, such that $\mathcal{L}_{Q}=\mathcal{L}(\mathfrak{C}_{Q,l})$ . Therefore, the following holds:

Proposition 31

Consider $Q\in(\mathbb{G}_{2},\mathbb{BCQ})$ . We can construct in double-exponential time a 2WAPA $\mathfrak{A}$ , which has exponentially many states, such that

[TABLE]

Since checking whether $\mathcal{L}(\mathfrak{A})$ is infinite is feasible in exponential time in the number of states and in polynomial time in the size of the alphabet, Proposition 31 implies that ${\sf UCQRew}(\mathbb{G}_{2},\mathbb{CQ})$ is in 2ExpTime*, as needed.*

8 Conclusions

We have concentrated on the fundamental problem of containment for OMQ languages based on the main decidable classes of tgds. We have also used our techniques to close problems related to distribution over components and UCQ rewritability. We believe that our techniques for solving containment under guarded OMQs can be extended to frontier-guarded OMQs, an interesting extension of guardedness **[8]**. We are also convinced that our solution to the problem of deciding UCQ rewritability of guarded OMQs over unary and binary relations can be extended to guarded (or even frontier-guarded) OMQs over arbitrary schemas. We are currently investigating these challenging problems.

APPENDIX

PRELIMINARIES

Definition of Non-recursiveness

In the main body of the paper, we define non-recursive sets of tgds via the notion of predicate graph. Here, we give an alternative definition, based on the well-known notion of stratification, which is more convenient for the combinatorial analysis that we are going to perform in the proof of Proposition 14.

Definition 3.

Consider a set $\Sigma$ of tgds. A stratification of $\Sigma$ is a partition $\{\Sigma_{1},\ldots,\Sigma_{n}\}$ , where $n>0$ , of $\Sigma$ such that, for some function $\mu:\mathit{sch}(\Sigma)\rightarrow\{0,\ldots,n\}$ , the following hold:

For each predicate $R\in\mathit{sch}(\Sigma)$ , all the tgds with $R$ in their head belong to $\Sigma_{\mu(R)}$ , i.e., they belong to the same set of the partition. 2. 2.

If there exists a tgd in $\Sigma$ such that the predicate $R$ appears in its body, while the predicate $P$ appears in its head, then $\mu(R)<\mu(P)$ .

*We say that $\Sigma$ is stratifiable if it admits a stratification. *

It is an easy exercise to show that the predicate graph of a set $\Sigma$ of tgds is acyclic iff $\Sigma$ is stratifiable. Then:

Lemma 32

$\Sigma$ * is non-recursive iff $\Sigma$ is stratifiable.*

Definition of Stickiness

In the main body of the paper, we provide an intuitive explanation of stickiness. Here, we recall the formal definition of sticky sets of tgds, introduced in **[27]**. Fix a set $\Sigma$ of tgds; w.l.o.g., we assume that, for every pair $(\sigma,\sigma^{\prime})\in\Sigma\times\Sigma$ , $\sigma$ and $\sigma^{\prime}$ do not share variables. For notational convenience, given an atom $\alpha$ and a variable $x$ occurring in $\alpha$ , $\mathit{pos}(\alpha,x)$ is the set of positions in $\alpha$ at which $x$ occurs; a position $P[i]$ identifies the $i$ -th attribute of the predicate $P$ . The definition of stickiness hinges on the notion of marked variables in a set of tgds.

Definition 4.

Consider a tgd $\sigma\in\Sigma$ , and a variable $x$ occurring in the body of $\sigma$ . We inductively define when $x$ is marked in $\Sigma$ as follows:

If there exists an atom $\alpha$ in the head of $\sigma$ such that $x$ does not occur in $\alpha$ , then $x$ is marked in $\Sigma$ ; and 2. 2.

*Assuming that there exists an atom $\alpha$ in the head of $\sigma$ such that $x$ occurs in $\alpha$ , if there exists $\sigma^{\prime}\in\Sigma$ (not necessarily different than $\sigma$ ) and an atom $\beta$ in the body of $\sigma^{\prime}$ such that (i) $\alpha$ and $\beta$ have the same predicate and, (ii) each variable in $\beta$ that occurs at a position of $\mathit{pos}(\alpha,x)$ is marked in $\Sigma$ , then $x$ is marked in $\Sigma$ . *

We are now ready to recall when a set of tgds is sticky:

Definition 5.

*A set $\Sigma$ of tgds is sticky if, for each $\sigma\in\Sigma$ , and for each variable $x$ occurring in the body of $\sigma$ , the following holds: if $x$ is marked in $\Sigma$ , then $x$ occurs only once in the body of $\sigma$ . *

PROOFS OF SECTION 3

Proof of Proposition 5

Consider an OMQ $Q=(\mathbf{S},\Sigma,q({\bar{x}}))\in(\mathbb{C},\mathbb{CQ})$ , where $\mathbb{C}$ is a class of tgds, an $\mathbf{S}$ -database $D$ , and a tuple ${\bar{c}}\in\mathit{dom}(D)^{|{\bar{x}}|}$ . We show that:

[TABLE]

( $\Rightarrow$ ) Assume that $Q_{1}\not\subseteq Q_{2}$ . This implies that there exists a $\mathit{sch}(\Sigma)$ -database $D^{\prime}$ , and a tuple ${\bar{t}}$ of constants such that ${\bar{t}}\in q_{D,{\bar{c}}}(D^{\prime})$ and ${\bar{t}}\not\in q(\mathit{chase}(D^{\prime},\Sigma))$ . Due to the monotonicity of CQs, ${\bar{t}}\in q_{D,{\bar{c}}}(\mathit{chase}(D^{\prime},\Sigma))$ . Since, by construction, the instance $\mathit{chase}(D^{\prime},\Sigma)$ satisfies $\Sigma$ , we conclude that $q_{D,{\bar{c}}}\not\subseteq_{\Sigma}q$ .666This is the standard notation for the fact that $q_{D,{\bar{c}}}(I)\not\subseteq q(I)$ , for every (possibly infinite) instance $I$ that satisfies $\Sigma$ . By exploiting the well-known characterization of CQ containment in terms of the chase, we get that ${\bar{c}}\not\in q(\mathit{chase}(D,\Sigma))$ , which is equivalent to ${\bar{c}}\not\in Q(D)$ , as needed.

( $\Leftarrow$ ) Conversely, assume that ${\bar{c}}\not\in Q(D)$ , or, equivalently, ${\bar{c}}\not\in q(\mathit{chase}(D,\Sigma))$ . This implies that ${\bar{c}}\not\in Q_{2}(D)$ . Observe that ${\bar{c}}\in q_{D,{\bar{c}}}(D)$ holds trivially, which in turn implies that ${\bar{c}}\in Q_{1}(D)$ . Therefore, $Q_{1}\not\subseteq Q_{2}$ , and the claim follows.

Proof of Proposition 9

The construction underlying Proposition 9 relies on the idea of encoding boolean operations (in our case the ‘or’ operator) using a set of atoms; this idea has been exploited in several other works; see, e.g., **[14, 21, 41]**. Let $Q=(\mathbf{S},\Sigma,q)\in(\mathbb{C},\mathbb{UCQ})$ . Our goal is to construct in polynomial time $Q^{\prime}=(\mathbf{S},\Sigma^{\prime},q^{\prime})\in(\mathbb{C},\mathbb{CQ})$ such that $Q\equiv Q^{\prime}$ . We assume, w.l.o.g., that the predicates of $\mathbf{S}$ do not appear in the head of a tgd of $\Sigma$ ; we can copy the content of a relation $R/k\in\mathbf{S}$ into an auxiliary predicate $R^{\star}/k$ , using the tgd $R(x_{1},\ldots,x_{k})\rightarrow R^{\star}(x_{1},\ldots,x_{k})$ , while staying inside $\mathbb{C}$ , and then rename each predicate $P$ in $\Sigma$ and $q$ with $P^{\star}$ . The set $\Sigma^{\prime}$ consists of the following tgds:

For every $R/k\in\mathbf{S}$ :

[TABLE]

These tgds are annotating the database atoms with the truth constant true, indicating that these are true atoms. 2. 2.

Assuming that $q=\exists{\bar{y}}\,\phi({\bar{x}},{\bar{y}})$ , a tgd:

[TABLE]

where $\phi^{\prime}_{\wedge}$ is the conjunction of atoms in $\phi$ , after replacing each atom $R(v_{1},\ldots,v_{k})$ with $R^{\prime}(v_{1},\ldots,v_{k},f)$ , and $\psi$ is the conjunction of atoms

[TABLE]

This tgd generates a “copy” of the atoms in $q$ , while annotating them with a null value that represents the truth constant false, indicating that are not necessarily true atoms. Moreover, the truth table of ‘or’ is generated. 3. 3.

Finally, for each tgd $\phi({\bar{x}},{\bar{y}})\rightarrow\exists{\bar{z}}\,\psi({\bar{x}},{\bar{x}})$ in $\Sigma$ , a tgd

[TABLE]

where $\phi^{\prime}$ and $\psi^{\prime}$ are obtained from $\phi$ and $\psi$ , respectively, by replacing each atom $R(v_{1},\ldots,v_{k})$ with $R^{\prime}(v_{1},\ldots,v_{k},w)$ . In fact, this is the actual set of tgds $\Sigma$ , with the difference that the value at the last position of each atom (which indicates whether it is true or false) is propagated to the inferred atoms.

Now, assuming that $q=q_{1}\vee\cdots\vee q_{n}$ , the CQ $q^{\prime}$ is defined as follows; let ${\bar{x}}=x_{1}\ldots x_{n}$ and ${\bar{y}}=y_{1}\ldots y_{n+1}$ :

[TABLE]

where ${\bar{x}}$ and ${\bar{y}}$ are fresh variables not in $q$ , and $q^{\prime}_{i}[x_{i}]$ is obtained from $q_{i}$ by replacing each atom $R(v_{1},\ldots,v_{k})$ with $R^{\prime}(v_{1},\ldots,v_{k},x_{i})$ . This completes our construction.

It is not difficult to show that $Q\equiv Q^{\prime}$ , or, equivalently, for every $\mathbf{S}$ -database $D$ , $q(\mathit{chase}(D,\Sigma))=q^{\prime}(\mathit{chase}(D,\Sigma^{\prime}))$ . The key observation is that in order to satisfy $\text{\rm True}(y_{n+1})$ in the CQ $q^{\prime}$ , at least one of the ${\bar{x}_{i}}$ ’s must be mapped to $1$ , which means that at least one $q_{i}$ is satisfied by $\mathit{chase}(D,\Sigma)$ . Finally, it is easy to verify that, for each $\mathbb{C}\in\{\mathbb{G},\mathbb{L},\mathbb{NR},\mathbb{S}\}$ , $\Sigma\in\mathbb{C}$ implies $\Sigma^{\prime}\in\mathbb{C}$ , and Proposition 9 follows.

PROOFS OF SECTION 4

Proof of Proposition 10

We assume that $q({\bar{x}})=\bigvee_{i=1}^{n}q_{i}({\bar{x}})$ is a UCQ rewriting of $Q$ . Since, by hypothesis, $Q\not\subseteq Q^{\prime}$ , we conclude that $q\not\subseteq Q^{\prime}$ , which in turn implies that there exists $i\in\{1,\ldots,n\}$ such that $q_{i}\not\subseteq Q^{\prime}$ . Let $c({\bar{x}})$ be a tuple of constants obtained by replacing each variable $x$ in ${\bar{x}}$ with the constant $c(x)$ , and $D_{q_{i}}$ the $\mathbf{S}$ -database obtained from $q_{i}$ after replacing each variable $x$ in $q_{i}$ with the constant $c(x)$ . We show that:

Lemma 33

$c({\bar{x}})\not\in Q^{\prime}(D_{q_{i}})$ .

Proof.

Since $q_{i}\not\subseteq Q^{\prime}$ , there exists an $\mathbf{S}$ -database $D$ , and a tuple of constants ${\bar{t}}$ such that ${\bar{t}}\in q_{i}(D)$ and ${\bar{t}}\not\in Q^{\prime}(D)$ . Clearly, there exists a homomorphism $h$ such that $h(q_{i})\subseteq D$ and $h({\bar{x}})={\bar{t}}$ . Observe also that $\rho(D_{q_{i}})\subseteq D$ , where $\rho=h\circ c^{-1}$ . Towards a contradiction, assume that $c({\bar{x}})\in Q^{\prime}(D_{q_{i}})$ . This implies that there exists a homomorphism $\gamma$ such that $\gamma(q^{\prime})\subseteq\mathit{chase}(D_{q_{i}},\Sigma)$ and $\gamma({\bar{y}})=c({\bar{x}})$ , where $Q^{\prime}=(\mathbf{S},\Sigma,q^{\prime}({\bar{y}}))$ . It is not difficult to see that there exists an extension $\rho^{\prime}$ of $\rho$ such that $\rho^{\prime}(\mathit{chase}(D_{q_{i}},\Sigma))\subseteq\mathit{chase}(D,\Sigma)$ and $\rho^{\prime}({\bar{x}})={\bar{t}}$ . Hence, $\rho^{\prime}(\gamma(q^{\prime}))\subseteq\mathit{chase}(D,\Sigma)$ , which implies that ${\bar{t}}\in q^{\prime}(\mathit{chase}(D,\Sigma))$ ; thus, ${\bar{t}}\in Q^{\prime}(D)$ . But this contradicts the fact that ${\bar{t}}\not\in Q^{\prime}(D)$ , and the claim follows. ∎

Observe that $c({\bar{x}})\in q(D_{q_{i}})$ , which immediately implies that $c({\bar{x}})\in Q(D_{q_{i}})$ . Consequently, by Lemma 33, $Q(D_{q_{i}})\not\subseteq Q^{\prime}(D_{q_{i}})$ . The claim follows since, by construction, $D_{q_{i}}$ is an $\mathbf{S}$ -database such that $|D_{q_{i}}|\leq f_{\mathbb{O}}(Q)$ .

The Algorithm $\mathsf{XRewrite}$

In view of the fact that the rewriting algorithm $\mathsf{XRewrite}$ is heavily used in our complexity analysis, we would like to recall its definition. This algorithm is based on resolution, and thus, before we proceed further, we need to recall the crucial notion of unification. A set of atoms $A=\{\alpha_{1},\ldots,\alpha_{n}\}$ , where $n\geqslant 2$ , unifies if there exists a substitution $\gamma$ , called unifier for $A$ , such that $\gamma(\alpha_{1})=\cdots=\gamma(\alpha_{n})$ . A most general unifier (MGU) for $A$ is a unifier for $A$ , denoted as $\gamma_{A}$ , such that for each other unifier $\gamma$ for $A$ , there exists a substitution $\gamma^{\prime}$ such that $\gamma=\gamma^{\prime}\circ\gamma_{A}$ . Notice that if a set of atoms unify, then there exists a MGU. Furthermore, the MGU for a set of atoms is unique (modulo variable renaming).

The algorihtm proceeds by exhaustively applying two steps: rewriting and factorization, which in turn rely on the technical notions of applicability and factorizability, respectively. We assume, w.l.o.g., that tgds and CQs do not share variables. Given a CQ $q$ , a variable $x$ is called shared in $q$ if $x$ is a free variable of $q$ , or it occurs more than once in $q$ . In what follows, we assume, w.l.o.g., that tgds are in normal form, i.e., they have only one atom in the head, and only one occurrence of an existentially quantified variable **[27]**. We write $\pi_{\exists}(\sigma)$ for the position at which the existentially quantified variable of $\sigma$ occurs; in case $\sigma$ does not mention an existentially quantified variable, then $\pi_{\exists}(\sigma)=\varepsilon$ . (Recall that a position $P[i]$ identifies the $i$ -th attribute of a predicate $P$ .) We are now ready to recall applicability and factorizability; in what follows, we write $\mathit{body}(q)$ for the set of atoms occurring in $q$ , and $\mathit{head}(\sigma)$ for the head-atom of $\sigma$ .

Definition 6.

(Applicability) Consider a CQ $q$ and a tgd $\sigma$ . Given a set of atoms $S\subseteq\mathit{body}(q)$ , we say that $\sigma$ is applicable to $S$ if the following conditions are satisfied:

the set $S\cup\{\mathit{head}(\sigma)\}$ unifies, and 2. 2.

*for each $\alpha\in S$ , if the term at position $\pi$ in $\alpha$ is either a constant or a shared variable in $q$ , then $\pi\neq\pi_{\exists}(\sigma)$ . *

Roughly, whenever $\sigma$ is applicable to $S$ , this means that the atoms of $S$ may be generated during the chase procedure by applying $\sigma$ . Therefore, we are allowed to apply a rewriting step (which is essentially a resolution step) that resolves $S$ using $\sigma$ , i.e., $S$ is replaced by $\mathit{body}(\sigma)$ , and a new CQ that is closer to the input database is obtained.

If we start applying rewriting steps blindly, without checking for applicability, then the soundness of the rewriting procedure is not guaranteed. However, it is possible that the applicability condition is not satisfied, but still we should apply a rewriting step. This may happen due to the presence of redundant atoms in a query. For example, given the CQ

[TABLE]

and the tgd

[TABLE]

the applicability condition fails since the shared variable $x$ in $q$ occurs at the position $\pi_{\exists}(\sigma)=R[1]$ . However, $q$ is essentially the CQ $q=\exists x\exists y\,R(x,y)$ , and now the applicability condition is satisfied. From the above informal discussion, we conclude that the applicability condition may prevent the algorithm from being complete since some valid rewriting steps are blocked. Because of this reason, we need the so-called factorization step, which aims at converting some shared variables into non-shared variables, and thus, satisfy the applicability condition. In general, this can be achieved by exhaustively unifying all the atoms that unify in the body of a CQ. However, some of these unifications do not contribute in any way to satisfying the applicability condition, and, as a result, many superfluous CQs are generated. It is thus better to apply a restricted form of factorization that generates a possibly small number of CQs that are vital for the completeness of the rewriting algorithm. This corresponds to the identification of all the atoms in the query whose shared existential variables come from the same atom in the chase, and they can be unified with no loss of information. Summing up, the key idea underlying the notion of factorizability is as follows: in order to apply the factorization step, there must exist a tgd that can be applied to its output.777Let us clarify that for the purposes of the present work we can rely on the naive approach of exhaustively unifying all the atoms that unify in the body of a CQ. However, we would like to be consistent with [40], where the algorithm $\mathsf{XRewrite}$ is proposed, and thus, we stick on the slightly more involved notion of factorizability.

Definition 7.

(Factorizability) Consider a CQ $q$ and a tgd $\sigma$ . Given a set of atoms $S\subseteq\mathit{body}(q)$ , where $|S|\geqslant 2$ , we say that $S$ is factorizable w.r.t. $\sigma$ if the following conditions are satisfied:

$S$ * unifies,* 2. 2.

$\pi_{\exists}(\sigma)\neq\varepsilon$ , and 3. 3.

*there exists a variable $x\not\in\mathit{var}(\mathit{body}(q)\setminus S)$ that occurs in every atom of $S$ only at position $\pi_{\exists}(\sigma)$ . *

Having the above key notions in place, we are now ready to recall the algorithm $\mathsf{XRewrite}$ , which is depicted in Algorithm 1. As said above, the UCQ rewriting of an OMQ $q=(\mathbf{S},\Sigma,q)$ is computed by exhaustively applying (i.e., until a fixpoint is reached) the rewriting and the factorization steps. Notice that the CQs that are the result of the factorization step, are nothing else than auxiliary queries which are critical for the completeness of the final rewriting, but are not needed in the final rewriting. Thus, during the iterative procedure, the queries are labeled with $\mathsf{r}$ (resp., $\mathsf{f}$ ) in order to keep track which of them are generated by the rewriting (resp., factorization) step. The CQ that is part of the input OMQ, although is not a result of the rewriting step, is labeled by $\mathsf{r}$ since it must be part of the final rewriting. Moreover, once the two crucial steps have been exhaustively applied on a CQ $q$ , it is not necessary to revisit $q$ since this will lead to redundant queries. Hence, the queries are also labeled with $\mathsf{e}$ (resp., $\mathsf{u}$ ) indicating that a query has been already explored (resp., is unexplored). Let us now describe the two main steps of the algorithm. In the sequel, consider a triple ${(q,x,y)}$ , where ${(x,y)}\in\{\mathsf{r},\mathsf{f}\}\times\{\mathsf{e},\mathsf{u}\}$ (this is how we indicate that $q$ is labeled by $x$ and $y$ ), and a tgd $\sigma\in\Sigma$ . We assume that $q$ is of the form $\exists{\bar{x}}\,\varphi({\bar{x}},{\bar{y}})$ .

Rewriting Step.

For each $S\subseteq\mathit{body}(q)$ such that $\sigma$ is applicable to $S$ , the $i$ -th application of the rewriting step generates the query $q^{\prime}=\gamma_{S,\sigma^{i}}(q[S/\mathit{body}(\sigma^{i})])$ , where $\sigma^{i}$ is the tgd obtained from $\sigma$ by replacing each variable $x$ with $x^{i}$ , $\gamma_{S,\sigma^{i}}$ is the MGU for the set $S\cup\{\mathit{head}(\sigma^{i})\}$ (which is the identity on the variables that appear in the body but not in the head of $\sigma^{i}$ ), and $q[S/\mathit{body}(\sigma^{i})]$ is obtained from $q$ be replacing $S$ with $\mathit{body}(\sigma^{i})$ . By considering $\sigma^{i}$ (instead of $\sigma$ ) we basically rename, using the integer $i$ , the variables of $\sigma$ . This renaming step is needed in order to avoid undesirable clutters among the variables introduced during different applications of the rewriting step. Finally, if there is no ${(q^{\prime\prime},\mathsf{r},\star)}\in Q_{\textsc{rew}}$ , i.e., an (explored or unexplored) query that is the result of the rewriting step, such that $q^{\prime}$ and $q^{\prime\prime}$ are the same (modulo bijective variable renaming), denoted $q^{\prime}\simeq q^{\prime\prime}$ , then ${(q^{\prime},\mathsf{r},\mathsf{u})}$ is added to $Q_{\textsc{rew}}$ .

Factorization Step.

For each $S\subseteq\mathit{body}(q)$ that is factorizable w.r.t. $\sigma$ , the factorization step generates the query $q^{\prime}=\gamma_{S}(q)$ , where $\gamma_{S}$ is the MGU for $S$ . If there is no ${(q^{\prime\prime},\star,\star)}\in Q_{\textsc{rew}}$ , i.e., a query that is the result of the rewriting or the factorization step, and is explored or unexplored, such that $q^{\prime}\simeq q^{\prime\prime}$ , then ${(q^{\prime},\mathsf{f},\mathsf{u})}$ is added to $Q_{\textsc{rew}}$ .

Proof of Proposition 14

We assume, w.l.o.g., that the predicates of $\mathbf{S}$ do not appear in the head of a tgd of $\Sigma$ . Since $\Sigma\in\mathbb{NR}$ , by Lemma 32, $\Sigma$ admits a stratification $\{\Sigma_{1},\ldots,\Sigma_{n}\}$ with stratification function $\mu:\mathit{sch}(\Sigma)\rightarrow\{0,\ldots,n\}$ . Let us briefly explain how the rewriting tree $T_{Q}$ of the OMQ $Q=(\mathbf{S},\Sigma,q)$ is defined. $T_{Q}$ is a rooted tree with $q$ being its root. The $i$ -th level of $T_{Q}$ consists of the CQs obtained from the CQs of the $(i-1)$ -th level by applying rewriting steps (see the algorithm $\mathsf{XRewrite}$ for details on the rewriting step) using only tgds from $\Sigma_{n-i+1}$ . It is easy to verify that the CQs of the $i$ -th level contain only predicates $P$ such that $\mu(P)<n-i+1$ . It is now clear that the $n$ -th level of $T_{Q}$ (i.e., the leaves of $T_{Q}$ ) consists only of CQs obtained during the execution of $\mathsf{XRewrite}(Q)$ that contain only predicates of $\mathbf{S}$ . Thus, in order to obtained the desired upper bound, it suffices to show that the number of atoms that occur in a CQ that is a leaf of $T_{Q}$ is at most $|q|\cdot\left(\max_{\tau\in\Sigma}\{|\mathit{body}(\tau)|\}\right)^{|\mathit{sch}(\Sigma)|}$ . To this end, let us focus on one branch $B$ of $T_{Q}$ from the root $q$ to a leaf $q^{\prime}$ . Such a branch can be naturally represented as a $k$ -ary forest $F_{Q}^{B}$ , where the root nodes are the atoms of $q$ , and whenever an atom $\alpha$ is resolved during the rewriting step using a tgd $\tau$ , the atoms of $\mathit{body}(\tau)$ , after applying the appropriate MGU, are the child nodes of $\alpha$ . Therefore, to obtain the desired upper bound, it suffices to show that the number of leaves of $F_{Q}^{B}$ is at most $|q|\cdot\left(\max_{\tau\in\Sigma}\{|\mathit{body}(\tau)|\}\right)^{|\mathit{sch}(\Sigma)|}$ . By construction, $F_{Q}^{B}$ consists, in general, of $|q|$ $k$ -ary rooted trees, where $k=\max_{\tau\in\Sigma}\{|\mathit{body}(\tau)|\}$ , of depth $n$ . Hence, the number of leaves of $F_{Q}^{B}$ is at most $|q|\cdot\left(\max_{\tau\in\Sigma}\{|\mathit{body}(\tau)|\}\right)^{n}$ . Since $n\leq|\mathit{sch}(\Sigma)|$ , the claim follows.

Proof of Theorem 16

A proof sketch for the coNExpTime ${}^{\textsc{NP}}$ upper bound is given in the main body of the paper. We proceed to establish the P ${}^{\textsc{NEXP}}$ -hardness. Our proof is by reduction from a tiling problem that has been recently introduced in **[34]**, which in turn relies on the standard Exponential Tiling Problem. Let us first recall the latter problem.

An instance of the Exponential Tiling Problem is a tuple $(n,m,H,V,s)$ , where $n,m$ are numbers (in unary), $H,V$ are subsets of $\{1,\ldots,m\}\times\{1,\ldots,m\}$ , and $s$ is a sequence of numbers of $\{1,\ldots,m\}$ . Such a tuple specifies that we desire a $2^{n}\times 2^{n}$ grid, where each cell is tiled with a tile from $\{1,\ldots,m\}$ . $H$ (resp., $V$ ) is the horizontal (resp., vertical) compatibility relation, while $s$ represents a constraint on the initial part of the first row of the grid. A solution to such an instance of the Exponential Tiling Problem is a function $f:\{0,\ldots,2^{n}-1\}\times\{0,\ldots,2^{n}-1\}\rightarrow\{1,\ldots,m\}$ such that:

$f(i,0)=s[i]$ , for each $0\leq i\leq(|s|-1)$ ; 2. 2.

$(f(i,j),f(i+1,j))\in H$ , for each $0\leq i\leq 2^{n}-2$ and $0\leq j\leq 2^{n}-1$ ; and 3. 3.

$(f(i,j),f(i,j+1))\in V$ , for each $0\leq i\leq 2^{n}-1$ and $0\leq j\leq 2^{n}-2$ .

We will refer to $\{0,\ldots,2^{n}-1\}\times\{0,\ldots,2^{n}-1\}$ as a grid, with the pairs in it being cells. A cell consists of two coordinates, the column-coordinate (for short col-coordinate) and the row-coordinate, and any function on a grid is a tiling. The Exponential Tiling Problem is defined as follows: given an instance $T$ as above, decide whether $T$ has a solution. It is known that this problem is NExpTime-hard (see, e.g., Section 3.2 of **[46]**).

We are now ready to recall the tiling problem introduced in **[34]**, called Extended Tiling Problem (ETP), which is P ${}^{\textsc{NEXP}}$ -hard. An instance of this problem is a tuple $(k,n,m,H_{1},V_{1},H_{2},V_{2})$ , where $k,n,m$ are numbers (in unary), and $H_{1},V_{1},H_{2},V_{2}$ are subsets of $\{1,\ldots,m\}\times\{1,\ldots,m\}$ . The question is as follows: is it the case that for every sequence $s$ , where $|s|=k$ , of numbers of $\{1,\ldots,m\}$ , $(n,m,H_{1},V_{1},s)$ has no solution or $(n,m,H_{2},V_{2},s)$ has a solution?

We give a reduction from the ETP to ${\sf Cont}(\mathbb{NR},\mathbb{CQ})$ . More precisely, given an instance $T=(k,n,m,H_{1},V_{1},H_{2},V_{2})$ of the ETP, our goal is to construct in polynomial time two queries $Q_{i}=(\mathbf{S},\Sigma_{i},q_{i})\in(\mathbb{NR},\mathbb{CQ})$ , for $i\in\{1,2\}$ , such that $T$ has a solution iff $Q_{1}\subseteq Q_{2}$ .

Data Schema $\mathbf{S}$

The data schema $\mathbf{S}$ consists of:

•

[math]-ary predicates $C_{i}^{j}$ , for each $i\in\{0,\ldots,k-1\}$ and $j\in\{1,\ldots,m\}$ ; the atom $C_{i}^{j}$ indicates that $s_{i}=j$ .

The Query $Q_{1}$

The goal of the query $Q_{1}$ is twofold: (i) to check that the so-called existence property of the input database, i.e., for every $i\in\{0,\ldots,k-1\}$ , there exists at least one atom of the form $C_{i}^{j}$ , is satisfied, and (ii) to check whether $(n,m,H_{1},V_{1},s)$ , where $s$ is the sequence of tilings encoded in the input database, has a solution. To this end, the query $Q_{1}$ will mention the following predicates:

•

[math]-ary predicate $C_{i}$ , indicating that there exists at least one atom of the form $C_{i}^{j}$ in the input database.

•

[math]-ary predicate ${\rm Existence}$ , indicating that the input database enjoys the existence property.

•

Unary predicate ${\rm Tile}_{i}$ , for each $i\in\{1,\ldots,m\}$ ; the atom ${\rm Tile}_{i}(x)$ states that $x$ is the tile $i$ .

•

Binary predicate $H$ ; the atom $H(x,y)$ encodes the fact that $(x,y)\in H_{1}$ .

•

Binary predicate $V$ ; the atom $V(x,y)$ encodes the fact that $(x,y)\in V_{1}$ .

•

$5$ -ary predicate $T_{i}$ , for each $i\in\{1,\ldots,n\}$ ; the atom $T_{i}(x,x_{1},x_{2},x_{3},x_{4})$ states that $x$ is a $2^{i}\times 2^{i}$ tiling obtained from the $2^{i-1}\times 2^{i-1}$ tilings $x_{1},\ldots,x_{4}$ – details on the inductive construction of $2^{i}\times 2^{i}$ tilings from $2^{i-1}\times 2^{i-1}$ tilings are given below.

•

Unary predicate ${\rm Initial}_{i}$ , for each $i\in\{0,\ldots,k-1\}$ ; the atom ${\rm Initial}_{i}(x)$ states that $s[i]=x$ , i.e., the $i$ -th element of the sequence $s$ is $x$ .

•

Binary predicate ${\rm Top}_{i}^{j}$ , for each $i\in\{1,\ldots,n\}$ and $j\in\{0,\ldots,k-1\}$ ; the atom ${\rm Top}_{i}^{j}(x,y)$ states that in the $2^{i}\times 2^{i}$ tiling $x$ the tile at position $(j,0)$ is $y$ .

•

[math]-ary predicate ${\rm Tiling}$ , indicating that there exists a $2^{n}\times 2^{n}$ tiling that is compatible with the initial tiling $s$ encoded in the input database.

•

[math]-ary predicate ${\rm Goal}$ , which is derived whenever the predicates ${\rm Existence}$ and ${\rm Tiling}$ are derived.

$Q_{1}$ * is defined as the query $(\mathbf{S},\Sigma_{1},{\rm Goal})$ , where $\Sigma_{1}$ consists of the following tgds:*

•

Checking for the existence property of the input database

For each $i\in\{0,\ldots,k-1\}$ and $j\in\{1,\ldots,m\}$ :

[TABLE]

and the tgd that checks for the existence property

[TABLE]

•

Generate the tiles

[TABLE]

•

Generate the compatibility relations

For each $(i,j)\in H_{1}$ :

[TABLE]

For each $(i,j)\in V_{1}$ :

[TABLE]

•

Generate the $2^{n}\times 2^{n}$ tilings. The key idea is to inductively construct $2^{i}\times 2^{i}$ tilings from $2^{i-1}\times 2^{i-1}$ tilings. It is easy to verify that the grid in Figure 2(a) is a $2^{i}\times 2^{i}$ tiling iff the nine subgrids of it, shown in Figure 2(b), are $2^{i-1}\times 2^{i-1}$ tilings. This has been already observed in **[33]**, where Datalog with complex values is studied.

First, we construct tilings of size $2\times 2$ (the base case of the inductive construction):

[TABLE]

Then, we inductively construct tilings of larger size until we get tilings of size $2^{n}\times 2^{n}$ . This is done using the following tgds. For each $i\in\{2,\ldots,n\}$ :

[TABLE]

•

Extract from the $2^{n}\times 2^{n}$ tilings the tiles at positions $(0,0),(1,0),\ldots,(k-1,0)$ . This is done using the following tgds:

[TABLE]

where $\ell=\lceil\log k\rceil$ . Moreover, for each $i\in\{\ell+1,\ldots,n\}$ :

[TABLE]

•

Check whether there exists a $2^{n}\times 2^{n}$ tiling that is compatible with the sequence of tilings $s$

For each $i\in\{0,\ldots,k-1\}$ and $j\in\{1,\ldots,m\}$ :

[TABLE]

and the tgd

[TABLE]

•

Finally, we have the output tgd

[TABLE]

This concludes the construction of $Q_{1}$ .

The Query $Q_{2}$

The goal of the query $Q_{2}$ is twofold: (i) to check that the so-called uniqueness property of the input database, i.e., for every $i\in\{0,\ldots,k-1\}$ , there exists at most one atom of the form $C_{i}^{j}$ , is satisfied, and (ii) to check whether $(n,m,H_{2},V_{2},s)$ , where $s$ is the sequence of tilings encoded in the input database, has a solution. The query $Q_{2}$ mentions the same predicates as $Q_{1}$ , and is defined as $(\mathbf{S},\Sigma_{2},{\rm Goal})$ , where $\Sigma_{2}$ consists of the following tgds:

•

Checking the uniqueness property

For each $i\in\{0,\ldots,k-1\}$ and $j,\ell\in\{1,\ldots,m\}$ with $j<\ell$ :

[TABLE]

•

The rest of $\Sigma_{2}$ encodes the tiling problem $(n,m,H_{2},V_{2},s)$ in exactly the same way as $\Sigma_{1}$ encodes $(n,m,H_{1},V_{1},s)$ .

This concludes the construction of $Q_{2}$ .

Proof of Proposition 18

The set $\Sigma^{n}$ consists of the following tgds; for brevity, we write ${\bar{x}}_{i}^{j}$ for $x_{i},x_{i+1},\ldots,x_{j}$ :888A similar construction has been used in [40] for showing a lower bound on the size of a CQ in the UCQ rewriting of a $(\mathbb{S},\mathbb{CQ})$ OMQ.

[TABLE]

while $q=\text{\rm Ans}(0,1)$ . It can be verified that, for every $\{S\}$ -database $D$ , $Q^{n}(D)\neq\varnothing$ implies that

[TABLE]

and thus, $|D|\geq 2^{n-2}$ . Let $Q=(\{S\},\Sigma^{\prime},q^{\prime})$ , where $\Sigma^{\prime}$ is a set of tgds and $q^{\prime}$ a Boolean CQ, and $D$ an $\{S\}$ -database. Clearly, $Q^{n}(D)\not\subseteq Q(D)$ iff $Q^{n}(D)\neq\varnothing$ and $Q(D)=\varnothing$ . This implies that $|D|\geq 2^{n-2}$ , and the claim follows.

Proof of Theorem 19

The coNExpTime upper bound, as well as the $\Pi_{2}^{P}$ -hardness in case of fixed-arity predicates, are discussed in the main body of the paper. Here, we show the coNExpTime-hardness. The proof proceeds in two steps:

First, we show that ${\sf Cont}((\mathbb{FNR},\mathbb{CQ}),(\mathbb{L},\mathbb{UCQ}))$ is coNExpTime-hard, where $\mathbb{FNR}$ denotes the class of full non-recursive sets of tgds, i.e., non-recursive sets of tgds without existentially quantified variables. 2. 2.

Then, we reduce ${\sf Cont}((\mathbb{FNR},\mathbb{CQ}),(\mathbb{L},\mathbb{UCQ}))$ to ${\sf Cont}((\mathbb{S},\mathbb{CQ}),(\mathbb{L},\mathbb{UCQ}))$ by showing that (under some assumptions that are explained below) every query in $(\mathbb{FNR},\mathbb{CQ})$ can be rewritten as an $(\mathbb{S},\mathbb{CQ})$ query.

By Proposition 9, we immediately get that ${\sf Cont}((\mathbb{S},\mathbb{CQ}),(\mathbb{L},\mathbb{CQ}))$ is coNExpTime-hard, as needed.

Step 1: ${\sf Cont}((\mathbb{FNR},\mathbb{CQ}),(\mathbb{L},\mathbb{UCQ}))$ is coNExpTime-hard

We show that ${\sf Cont}((\mathbb{FNR},\mathbb{CQ}),(\mathbb{L},\mathbb{UCQ}))$ is coNExpTime-hard, even if we focus on 0-1 queries, that is, queries $Q$ with following property: for every database $D$ , $Q(D)=Q(D_{01})$ , where $D_{01}\subseteq D$ is the restriction of $D$ on the binary domain $\{0,1\}$ , i.e., $D_{01}=\{R({\bar{c}})\in D\mid{\bar{c}}\subseteq\{0,1\}\}$ . The proof is by reduction from the Exponential Tiling Problem, and is a non-trivial adaptation of the one given in **[14]** for showing that containment of non-recursive Datalog queries is coNExpTime*-hard.*

Theorem 34

${\sf Cont}((\mathbb{FNR},\mathbb{CQ}),(\mathbb{L},\mathbb{UCQ}))$ * is coNExpTime-hard, even for 0-1 queries.*

Proof.

Given an instance $T=(n,m,H,V,s)$ of the Exponential Tiling Problem, we are going to construct a $(\mathbb{FNR},\mathbb{CQ})$ 0-1 query $Q_{T}=(\mathbf{S},\Sigma,q)$ and a $(\mathbb{L},\mathbb{UCQ})$ 0-1 query $Q^{\prime}_{T}=(\mathbf{S},\Sigma_{T},q_{T})$ such that $T$ has a solution iff $Q_{T}\not\subseteq Q^{\prime}_{T}$ .

Data Schema $\mathbf{S}$

The data schema $\mathbf{S}$ consists of:

•

$2n$ -ary predicates ${\rm TiledBy}_{i}$ , for each $i\leq m$ ; the atom ${\rm TiledBy}_{i}(x_{1},\ldots,x_{n},y_{1},\ldots,y_{n})$ indicates that the cell with coordinates $((x_{1},\ldots,x_{n}),(y_{1},\ldots,y_{n}))\in\{0,1\}^{n}\times\{0,1\}^{n}$ is tiled by tile $i$ . Notice that we use $n$ -bit binary numbers to represent a coordinate; this is the key difference between our construction and the one of [14].

The Query $Q_{T}$

The goal of the query $Q_{T}$ is to assert whether the input database encodes a candidate tiling, i.e., whether the entire grid is tiled, without taking into account the constraints, that is, the compatibility relations and the constraint on the initial part of the first row. To this end, the query $Q_{T}$ will mention the following predicates:

•

Unary predicate ${\rm Bit}$ ; the atom ${\rm Bit}(x)$ simply says that $x$ is a bit, i.e., $x\in\{0,1\}$ .

•

$2n$ -ary predicate ${\rm TiledAboveCol}_{i}$ , for each $i\leq n$ ; the atom ${\rm TiledAboveCol}_{i}({\bar{x}},{\bar{y}})$ says that for the row-coordinate ${\bar{y}}$ there are tiled cells with coordinates $({\bar{x}^{\prime}},{\bar{y}})$ for every col-coordinate ${\bar{x}^{\prime}}$ that agrees with ${\bar{x}}$ on the first $i-1$ bits. In other words, for the row corresponding to ${\bar{y}}$ , every column extending the first $i-1$ bits of ${\bar{x}}$ is tiled. In particular, ${\rm TiledAboveCol}_{1}({\bar{x}},{\bar{y}})$ says that the entire row ${\bar{y}}$ is tiled.

•

$2n$ -ary predicate ${\rm TiledAboveRow}_{i}$ , for each $i\leq n$ ; the atom ${\rm TiledAboveRow}_{i}({\bar{y}})$ says that for every ${\bar{y}^{\prime}}$ that agrees with ${\bar{y}}$ on the first $i-1$ bits, the row ${\bar{y}^{\prime}}$ is fully tiled.

•

$n$ -ary predicate ${\rm RowTiled}$ ; the atom ${\rm RowTiled}({\bar{y}})$ says that the row ${\bar{y}}$ is fully tiled.

•

[math]-ary predicate ${\rm AllTiled}$ , which asserts that the entire grid is tiled.

•

[math]-ary predicate ${\rm Goal}$ , which is derived whenever the predicate ${\rm AllTiled}$ is derived.

$Q_{T}$ is defined as the query $(\mathbf{S},\Sigma,{\rm Goal})$ , where $\Sigma$ consists of the following rules:

•

Generate ${\rm Bit}$ atoms

[TABLE]

•

${\rm RowTiled}$

For each $j,k\leq m$ :

[TABLE]

For each $2\leq i\leq n$ :

[TABLE]

A row is fully tiled:

[TABLE]

•

${\rm AllTiled}$

[TABLE]

For each $2\leq i\leq n$ :

[TABLE]

The entire grid is tiled:

[TABLE]

This concludes the construction of the query $Q_{T}$ .

The Query $Q^{\prime}_{T}$

$Q^{\prime}_{T}$ is defined in such a way that $Q^{\prime}_{T}(D)$ is non-empty exactly when the input database $D$ encodes an invalid tiling, i.e., when one of the constraints on the tiles is violated. The query $Q^{\prime}_{T}$ will mention the following intensional predicates:

•

Unary predicate ${\rm Bit}$ ; as above, ${\rm Bit}(x)$ says that $x$ is a bit.

•

$2i$ -ary predicate ${\rm LastFirst}_{i}$ , for each $1\leq i\leq n$ ; the atom ${\rm LastFirst}_{i}(x_{1},\ldots,x_{i},y_{1},\ldots,y_{i})$ says that $(x_{1},\ldots,x_{i})=(1,\ldots,1)$ and $(y_{1},\ldots,y_{i})=(0,\ldots,0)$ .

•

$2i$ -ary predicate ${\rm Succ}_{i}$ , for each $1\leq i\leq n$ ; the atom ${\rm Succ}_{i}({\bar{x}},{\bar{y}})$ says that the $i$ -bit binary number ${\bar{y}}$ is the successor of the $i$ -bit binary number ${\bar{x}}$ .

•

[math]-ary predicate ${\rm Goal}$ .

$Q^{\prime}_{T}$ is defined as the query $(\mathbf{S},\Sigma^{\prime},q^{\prime})$ . The set $\Sigma^{\prime}$ consists of the following linear tgds:

•

Generate ${\rm Bit}$ atoms:

[TABLE]

•

Generate the successor predicates:

[TABLE]

For each $1\leq i\leq n-1$ :

[TABLE]

The UCQ $q^{\prime}$ consists of the following (Boolean) CQs; for brevity, the existential quantifiers in front of the CQs are omitted:

•

Tile Consistency

For each $i\neq j\leq m$ :

[TABLE]

•

Tile Compatibility

For each $(i,j)\not\in V$ :

[TABLE]

For each $(i,j)\not\in H$ :

[TABLE]

•

Tiling of First Row

For each $j\leq n$ , let $f_{j}$ be the function from $\{1,\ldots,n\}$ into $\{0,1\}$ such that $f_{j}(1)\ldots f_{j}(n)$ is the number $j$ in binary representation, and let $k\in\{1,\ldots,m\}$ other than $s[j]$ ; recall that $s$ is a sequence of numbers of $\{1,\ldots,m\}$ that represents a constraint on the initial part of the first row of the grid. Then, we have the CQ:

[TABLE]

where, for each $i\in\{1,\ldots,n\}$ , $x_{i}=z$ if $f_{j}(i)=0$ , and $x_{i}=o$ if $f_{j}(i)=1$ .

This concludes the definition of the query $Q^{\prime}_{T}$ . ∎**

Step 2: ${\sf Cont}((\mathbb{S},\mathbb{CQ}),(\mathbb{L},\mathbb{UCQ}))$ is coNExpTime-hard

Our goal is show that every 0-1 query $(\mathbf{S},\Sigma,q)\in(\mathbb{F},\mathbb{CQ})$ can be equivalently rewritten as a 0-1 query $(\mathbf{S},\Sigma^{\prime},q^{\prime})$ , where all the tgds of $\Sigma^{\prime}$ are lossless, i.e., all the body-variables appear also in the head, which in turn implies that $\Sigma^{\prime}$ is sticky.

Proposition 35

Consider a 0-1 query $Q\in(\mathbb{F},\mathbb{CQ})$ . We can construct in polynomial time a 0-1 query $Q^{\prime}\in(\mathbb{S},\mathbb{CQ})$ such that $Q\equiv Q^{\prime}$ .

Proof.

Let $Q=(\mathbf{S},\Sigma,q)$ , and assume that $n$ is the maximum number of variables occurring in the body of a tgd of $\Sigma$ . We are going to construct in polynomial time a 0-1 query $Q^{\prime}=(\mathbf{S},\Sigma^{\prime},q^{\prime})\in(\mathbb{S},\mathbb{CQ})$ such that $Q\equiv Q^{\prime}$ .

The set $\Sigma^{\prime}$ consists of the following tgds:

•

Initialization Rules

We first transform every database atom of the form $R({\bar{c}})$ into an atom $R^{\prime}({\bar{c}},\underbrace{0,\ldots,0}_{n},0,1)$ . This is done as follows:

[TABLE]

and, for each $k$ -ary predicate $R\in\mathbf{S}$ , we have the lossless tgd

[TABLE]

Notice that we can safely force the variables $x_{1},\ldots,x_{k}$ to take only values from $\{0,1\}$ due to the 0-1 property.

•

Transformation into Lossless Tgds

For each tgd $\sigma\in\Sigma$ of the form

[TABLE]

we have the lossless tgd

[TABLE]

where, if $\{v_{1},\ldots,v_{\ell}\}$ , for $\ell\in\{1,\ldots,n\}$ , is the set of variables occurring in the body of $\sigma$ (the order is not relevant), then $y_{i}=v_{i}$ , for each $i\in\{1,\ldots,\ell\}$ , and $y_{j}=v_{1}$ , for each $j\in\{\ell+1,\ldots,n\}$ .

•

Finalization Rules

Observe that each atom obtained during the chase due to one of the lossless tgds introduced above is of the form $R^{\prime}({\bar{x}},{\bar{y}})$ , where ${\bar{y}}\in\{0,1\}^{n}$ . If ${\bar{y}}\neq(0,\ldots,0)$ , then we need to ensure that eventually the atom

[TABLE]

will be inferred. This is achieved by adding to $\Sigma^{\prime}$ the following tgds: For each $k$ -ary predicate $R$ occurring in $\Sigma$ , and for each $1\leq i\leq n$ , we have the rule:

[TABLE]

This concludes the definition of $\Sigma^{\prime}$ .

The CQ $q^{\prime}$ is defined analogously. More precisely, assuming that $q$ is of the form (the existential quantifiers are omitted)

[TABLE]

the CQ $q^{\prime}$ is defined as

[TABLE]

It is easy to verify that $\Sigma^{\prime}$ consists of lossless tgds, and thus, $Q^{\prime}\in(\mathbb{S},\mathbb{CQ})$ . It also not difficult to see that, for every database $D$ over $\mathbf{S}$ , $Q(D_{01})=Q^{\prime}(D_{01})$ ; thus, by the 0-1 property, $Q(D)=Q^{\prime}(D)$ , and the claim follows. ∎

By Theorem 34 and Proposition 35, we immediately get that ${\sf Cont}((\mathbb{S},\mathbb{CQ}),(\mathbb{L},\mathbb{UCQ}))$ is coNExpTime-hard, as needed.

PROOFS OF SECTION 5

Recall that, for the sake of technical clarity, we focus on constant-free tgds and CQs, but all the results can be extended to the general case at the price of more involved definitions and proofs. Moreover, we assume that tgds have only one atom in the head. This does not affect the generality of our proof since every set of guarded tgds can be transformed in polynomial time into a set of guarded tgds with the above property; see, e.g., **[24]**. Finally, for convenience of presentation, we also assume that the body of a tgd is non-empty, i.e., the body of a tgd is always an atom and not the symbol $\top$ .

Proof of Proposition 21

Let us start by recalling the key notion of tree decomposition. Notice that the definition of the tree decomposition that we give here is slightly different than the one in the main body of the paper. The reason is because, for convenience of presentation, we prefer to employ a slightly different notation.

Definition 8.

Let $I$ be an instance. A tree decomposition of $I$ that omits $V$ , where $V\subseteq\mathit{dom}(I)$ , is a pair $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ , where $\mathcal{T}={(T,E^{\mathcal{T}})}$ is a tree and ${(X_{t})}_{t\in T}$ a family of subsets of $\mathit{dom}(I)$ (called the bags of the decomposition) such that:

For every $v\in\mathit{dom}(I)\setminus V$ , the set $\{t\in T\mid v\in X_{t}\}$ is non-empty and connected. 2. 2.

For every atom $P(s_{1},\ldots,s_{n})\in I$ , there is a $t\in T$ such that $\{s_{1},\ldots,s_{n}\}\subseteq X_{t}$ .

*The width of a tree decomposition $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ omitting $V$ is $\max\{|X_{t}|:t\in T\}-1$ . The tree-width of $I$ is the minimum among the widths of all tree decompositions of $I$ that omit $V$ . We call a tree decomposition omitting $\varnothing$ simply tree decomposition of $I$ . For $v\in T$ , we denote by $I_{\delta}(v)$ the subinstance of $I$ induced by $X_{v}$ . *

Notation. We usually denote the strict partial order among the nodes of a tree $\mathcal{T}$ of a tree decomposition $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ by $\prec$ . Accordingly, we write $v\preceq w$ iff $v\prec w$ or $v=w$ . For brevity, $\varepsilon$ will usually denote the root of a tree decomposition at hand. If ambiguities could possibly arise, we shall use subscripts in these notations. Furthermore, when $\delta$ is clear from context, we shall omit it from the expression $I_{\delta}(v)$ .

Let $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ be a tree decomposition of $I$ and $V\subseteq T$ . Recall that $\delta$ is $[V]$ -guarded (or guarded except for $V$ ), if for every node $v\in T\setminus V$ , there is an atom $P(s_{1},\ldots,s_{n})\in I$ such that $X_{v}\subseteq\{s_{1},\ldots,s_{n}\}$ . A $[\varnothing]$ -guarded tree decomposition of $I$ is simply called guarded tree decomposition.

Also recall the crucial notion of $C$ -tree:

Definition 9.

An $\mathbf{S}$ -instance $I$ is a $C$ -tree, where $C\subseteq I$ , if there is a tree decomposition $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ of $I$ such that

$I_{\delta}(\varepsilon)=C$ , i.e., the subinstance of $I$ induced by $X_{\varepsilon}$ equals $C$ . 2. 2.

$\delta$ * is guarded except for $\{\varepsilon\}$ .*

*If $\delta$ or $C$ is clear from context, we shall often refer to $|\mathit{dom}(C)|$ as the diameter of $D$ and to $C$ as the core of $D$ . *

Remark. The notion of $C$ -tree defined here refers to both instances and databases, i.e., a $C$ -tree may be a (finite) database or an instance. We often do not explicitly mention whether a $C$ -tree at hand is a database or an instance. However, it will be clear from context whether a $C$ -tree is a database or an instance.

We proceed to establish the following technical lemma, which in turn allows us to show Proposition 21. It is an adaption of a result in **[7]** to the case of guarded tgds. Henceforth, for brevity, given a query $Q={(\mathbf{S},\Sigma,q)}\in(\mathbb{G},\mathbb{BCQ})$ and an $\mathbf{S}$ -database $D$ , we write $D\models Q$ for the fact that $Q(D)\neq\varnothing$ .

Lemma 36

Let $Q={(\mathbf{S},\Sigma,q)}$ be an OMQ from ${(\mathbb{G},\mathbb{BCQ})}$ . Let $D$ be an $\mathbf{S}$ -database and suppose $D\models Q$ . Then there is a finite $\mathbf{S}$ -instance $\hat{I}$ such that $\hat{I}\models Q$ and:

$\hat{I}$ * is a $C$ -tree such that $|\mathit{dom}(C)|\leq\mathit{ar}(\mathbf{S}\cup\mathit{sch}(\Sigma))\cdot|q|$ .* 2. 2.

There is a homomorphism from $\hat{I}$ to $D$ .

Before we proceed with its formal proof, let us explain why Proposition 21 is an easy consequence of Lemma 36. The fact that the first item implies the second is trivial. Conversely, suppose that $Q_{1}\not\subseteq Q_{2}$ , which implies that there exists an $\mathbf{S}$ -database $D$ such that $D\models Q_{1}$ and $D\not\models Q_{2}$ . By Lemma 36, there exists a $C$ -tree $\hat{I}$ , where $|\mathit{dom}(C)|\leq\mathit{ar}(\mathbf{S}\cup\mathit{sch}(\Sigma_{1}))\cdot|q_{1}|$ , such that $\hat{I}\models Q_{1}$ . Moreover, there is a homomorphism from $\hat{I}$ to $D$ ; hence, since $Q_{2}$ is closed under homomorphisms, it immediately follows that $\hat{I}\not\models Q_{2}$ . Consequently, the $\mathbf{S}$ -database $\hat{D}$ obtained from $\hat{I}$ after replacing each null $z$ with a distinct constant $c_{z}$ is a $C$ -tree such that $Q_{1}(\hat{D})\not\subseteq Q_{2}(\hat{D})$ , and Proposition 21 follows.

We now proceed with the proof of Lemma 36 which is our main task in this section. Before that, we introduce some additional auxiliary concepts.

The Guarded Chase Forest

Given a database $D$ and a set $\Sigma$ of guarded tgds, the guarded chase forest for $D$ and $\Sigma$ is a forest (whose edges and nodes are labeled) constructed as follows:

For each fact $R(\bar{a})$ in $D$ , add a node labeled with $R(\bar{a})$ . 2. 2.

For each node $v$ labeled with $\alpha\in\mathit{chase}(D,\Sigma)$ and for every atom $\beta$ resulting from a one-step application of a rule $\tau\in\Sigma$ , if $\alpha$ is the image of the guard in this application of $\tau$ , then add a node $w$ labeled with $\beta$ and introduce an arc from $v$ to $w$ labeled with $\tau$ .

We can assume that the guarded chase forest is always built inductively according to a fixed, deterministic version of the chase procedure. The non-root nodes are then totally ordered by a relation $\prec$ that reflects their order of generation. Furthermore, we can extend $\prec$ to database atoms by picking a lexicographic order among them. Notice that one atom can be the label of multiple nodes. Using the order $\prec$ we can, however, always refer to the $\prec$ -least node.

Guarded Unraveling

Let $I$ be an instance over $\mathbf{S}$ . We say that $X\subseteq\mathit{dom}(I)$ is guarded in $I$ , if there are $a_{1},\ldots,a_{s}\in\mathit{dom}(I)$ such that

•

$X\subseteq\{a_{1},\ldots,a_{s}\}$ * and*

•

there is an $R/s\in\mathbf{S}$ such that $I\models R(a_{1},\ldots,a_{s})$ .

A tuple $\bar{t}$ is guarded in $I$ if the set containing the elements of $\bar{t}$ is guarded in $I$ .

In the following paragraph, we largely follow the notions introduced in **[2, 13]**. Fix an $\mathbf{S}$ -instance $I$ and some $X_{0}\subseteq\mathit{dom}(I)$ . Let $\Pi$ be the set of finite sequences of the form $X_{0}X_{1}\cdots X_{n}$ , where, for $i>0$ , $X_{i}$ is a guarded set in $I$ , and, for $i\geq 0$ , $X_{i+1}=X_{i}\cup\{a\}$ for some $a\in\mathit{dom}(I)\setminus X_{i}$ , or $X_{i}\supseteq X_{i+1}$ . The sequences from $\Pi$ can be arranged in a tree by their natural prefix order and each sequence $\pi=X_{0}X_{1}\cdots X_{n}$ identifies a unique node in this tree. In this context, we say that $a\in\mathit{dom}(I)$ is represented at $\pi$ whenever $a\in X_{n}$ . Two sequences $\pi,\pi^{\prime}$ are $a$ -equivalent, if $a$ is represented at each node on the unique shortest path between $\pi$ and $\pi^{\prime}$ . For $a$ represented at $\pi$ , we denote by $[\pi]_{a}$ the $a$ -equivalence class of $\pi$ . The guarded unraveling around $X_{0}$ is the instance $I^{\ast}$ over the elements $\{[\pi]_{a}\mid\text{$ a $is represented at$ \pi $}\}$ , where

[TABLE]

for all $R/n\in\mathbf{S}$ .

Lemma 37

For every $\mathbf{S}$ -instance $I$ and any $X_{0}\subseteq\mathit{dom}(I)$ , the guarded unraveling $I^{\ast}$ around $X_{0}$ is a $C$ -tree over $\mathbf{S}$ , where $C$ is the subinstance of $I^{\ast}$ induced by the elements $\{[X_{0}]_{a}\mid a\in X_{0}\}$ .

Proof.

Let $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ , where $\mathcal{T}$ is the natural tree that arises from ordering the sequences in $\Pi$ by their prefixes. For $\pi\in T$ , let $X_{\pi}\coloneqq\{[\pi]_{a}\mid\text{$ a $is represented at$ \pi $}\}$ . Let $\varepsilon$ denote the root of $\mathcal{T}$ . We need to show that $\delta$ is an appropriate tree decomposition witnessing that $I^{\ast}$ is a $C$ -tree. First, note that it is clear that $I(\varepsilon)=\{[X_{0}]_{a}\mid a\in X_{0}\}$ by construction. Let $[\pi]_{a}\in\mathit{dom}(I)$ and consider the set $A\coloneqq\{t\in T\mid[\pi]_{a}\in X_{t}\}$ . This set is certainly non-empty. Moreover, for $t_{1},t_{2}\in A$ , we know that $[t_{1}]_{a}=[t_{2}]_{a}$ , hence $t_{1}$ and $t_{2}$ are $a$ -connected in $\mathcal{T}$ . Suppose $I^{\ast}\models R([\pi_{1}]_{a_{1}},\ldots,[\pi_{n}]_{a_{n}})$ for some $R/n\in\mathbf{S}$ . Then there is a $\pi\in T$ such that $[\pi]_{a_{i}}=[\pi_{i}]_{a_{i}}$ , for all $i=1,\ldots,n$ . Hence, $a_{1},\ldots,a_{n}$ are all represented at $\pi$ and so $\{[\pi_{1}]_{a_{1}},\ldots,[\pi_{n}]_{a_{n}}\}\subseteq X_{\pi}$ . It remains to show that $\delta$ is guarded except for $\{\varepsilon\}$ . Let $\pi\neq\varepsilon$ and consider the set $X_{t}$ . Since $\pi$ is a sequence of length greater than one, its last element $Y$ is a guarded set in $I$ . Hence, there are $a_{1},\ldots,a_{s}$ such that $Y\subseteq\{a_{1},\ldots,a_{s}\}$ and $I\models R(a_{1},\ldots,a_{s})$ for some $R/s\in\mathbf{S}$ . Let $\{a_{1},\ldots,a_{s}\}\setminus Y=\{b_{1},\ldots,b_{m}\}$ and define $\rho\coloneqq\pi\cdot(Y\cup\{b_{1}\})\cdot(Y\cup\{b_{1},b_{2}\})\cdots(Y\cup\{b_{1},\ldots,b_{m}\})$ . Then $I^{\ast}\models R([\rho]_{a_{1}},\ldots,[\rho]_{a_{s}})$ , as desired. ∎

Notice that this lemma implies that the tree-width of $I^{\ast}$ is bounded by $|X_{0}|+\mathit{ar}(\mathbf{S})-1$ .

We are now ready to prove Lemma 36:

Proof of Lemma 36.* Let $q=\exists\bar{y}\,\varphi(\bar{y})$ and $\mu$ a homomorphism mapping $\varphi(\bar{y})$ to $\mathit{chase}(D,\Sigma)$ . Let $R_{1}(\bar{b}_{1}),\ldots,R_{k}(\bar{b}_{k})$ exhaust all facts from $D$ that are the roots of those $\prec$ -least facts from $\mu(\varphi(\bar{y}))$ in the guarded chase forest of $D$ and $\Sigma$ that have an element from $\mathit{dom}(D)$ as argument. Let $G_{\mu}\coloneqq\bigcup_{1\leq i\leq k}\{\bar{b}_{i}\}$ and let $I^{\ast}$ be the unraveling of $D$ around $G_{\mu}$ , regarding all elements from $\mathit{dom}(I^{\ast})$ as labeled nulls. Henceforth, for every $a\in G_{\mu}$ , we denote by $\lambda_{a}$ the element $[G_{\mu}]_{a}$ . We say that $\lambda_{a}$ represents $a$ . Let $C$ be the substructure of $I^{\ast}$ induced by the set $\{\lambda_{a}\mid a\in G_{\mu}\}$ . Notice that $I^{\ast}$ is an infinite instance that is a $C$ -tree by Lemma 37. We will show later how to get a finite instance from $I^{\ast}$ that satisfies our constraints. We proceed to show that $I^{\ast}\cup\Sigma$ logically entails $q$ , denoted $I^{\ast},\Sigma\models q$ :*

Lemma 38

$I^{\ast},\Sigma\models q$ .

Proof.

We will first construct a universal model $J$ of $I^{\ast}$ and $\Sigma$ . Recall that an instance $U$ is a universal model of $I$ and $\Sigma$ , if it can be homomorphically mapped to every model of $I\cup\Sigma$ ; in particular, it is well-known and easy to prove that $\mathit{chase}(I,\Sigma)$ is always a universal model of $I$ and $\Sigma$ . Before constructing $J$ , we introduce some additional notions. In the following, given a guarded set $G=\{a_{1},\ldots,a_{k}\}$ in $D$ , a copy of $G$ in $I^{\ast}$ is a set $\Gamma=\{\alpha_{1},\ldots,\alpha_{k}\}$ which is guarded in $I^{\ast}$ such that, for $i=1,\ldots,k$ , we have that $\alpha_{i}=[\pi_{i}]_{a_{i}}$ for some sequences $\pi_{i}$ and $D\models R(a_{i_{1}},\ldots,a_{i_{m}})$ iff $I^{\ast}\models R(\alpha_{i_{1}},\ldots,\alpha_{i_{m}})$ for all $R\in\mathbf{S}$ and $i_{j}\in\{1,\ldots,k\}$ . Copies of guarded tuples are defined accordingly. Consider the structure $\mathit{chase}(D,\Sigma)$ . Let $G$ be a guarded set in $D$ and $D\upharpoonright G$ denote the subinstance of $D$ induced by $G$ . It is well-known and easy to prove that $\mathit{chase}(D\upharpoonright G,\Sigma)$ is acyclic (cf., e.g., [23]). Henceforth, we loosely call $\mathit{chase}(D\upharpoonright G,\Sigma)$ the tree attached to $G$ . The model $J$ is constructed as follows. Let $J_{0}$ be the instance $C$ . Furthermore, for each guarded set $G=\{a_{1},\ldots,a_{k}\}$ in $D$ and each copy $\Gamma=\{\alpha_{1},\ldots,\alpha_{k}\}$ of $G$ in $I^{\ast}$ , construct a new instance $J_{\Gamma}$ that is isomorphic to the tree attached to $G$ such that

(i) the elements $a_{i}$ of $G$ are renamed to $\alpha_{i}$ in $J_{\Gamma}$ ,

(ii) $\mathit{dom}(J_{0})\cap\mathit{dom}(J_{\Gamma})=\{\alpha_{1},\ldots,\alpha_{k}\}$ , and

(iii) $\Gamma\cap\Theta=\mathit{dom}(J_{\Gamma})\cap\mathit{dom}(J_{\Theta})$ , for every copy $\Theta$ of $G$ in $I^{\ast}$ .

The model $J$ is the union of $J_{0}$ and all the $J_{\Gamma}$ . If a guarded set $X$ in $J_{\Gamma}$ arises from renaming elements of a guarded set $Y$ in $\mathit{chase}(D\upharpoonright G,\Sigma)$ , we also say that $X$ is a copy of $Y$ in $J$ . Furthermore, the copies of $D$ that are contained in $I^{\ast}$ (i.e., in $J_{0}$ ) are also called copies in $J$ . Observe that $J$ is a model of $I^{\ast}$ by construction. We show that it is a model of $\Sigma$ . To this end, we show the following claim.

Claim 39

Let $\bar{t}$ be a guarded tuple in $J$ and let $q(\bar{x})$ be a guarded conjunctive query999By a guarded conjunctive query we mean here a CQ that contains an atom that contains all the variables occurring in the CQ as argument. over $\mathbf{S}\cup\mathit{sch}(\Sigma)$ . Suppose $\bar{t}$ is a copy of $\bar{s}$ in $J$ , where $\bar{s}$ is over $\mathit{dom}(\mathit{chase}(D,\Sigma))$ and $|\bar{t}|=|\bar{s}|$ . Then $J\models q(\bar{t})$ iff $\mathit{chase}(D,\Sigma)\models q(\bar{s})$ .

Proof.

Suppose $J\models q(\bar{t})$ . Let $\{\bar{t}\}=\{\alpha_{1},\ldots,\alpha_{k}\}$ be a copy of $\{\bar{s}\}=\{a_{1},\ldots,a_{k}\}$ in $J$ . Since $q(\bar{x})$ is guarded, there is a $\Gamma\supseteq(\{\alpha_{1},\ldots,\alpha_{k}\}\cap\mathit{dom}(J_{0}))$ such that $J_{\Gamma}\models q(\bar{t})$ . Let $G\supseteq\{\bar{s}\}$ be the guarded set in $D$ of which $\Gamma$ is a copy in $J_{0}$ . It clearly holds that $\mathit{chase}(D\upharpoonright G,\Sigma)\models q(\bar{s})$ , whence $\mathit{chase}(D,\Sigma)\models q(\bar{s})$ follows.

Suppose that $\mathit{chase}(D,\Sigma)\models q(\bar{s})$ . Let $\bar{t}=\alpha_{1},\ldots,\alpha_{k}$ and $\bar{s}=a_{1},\ldots,a_{k}$ and suppose that $\alpha_{i}=[\pi_{i}]_{a_{i}}$ ( $i=1,\ldots,k$ ). The set $\{a_{1},\ldots,a_{k}\}$ is guarded in $\mathit{chase}(D,\Sigma)$ . Hence, there is a guarded $G\supseteq\{a_{1},\ldots,a_{k}\}\cap\mathit{dom}(D)$ in $D$ such that $\mathit{chase}(D\upharpoonright G,\Sigma)\models q(\bar{s})$ . We show that there is a $\Gamma\supseteq\{\alpha_{1},\ldots,\alpha_{k}\}\cap\mathit{dom}(I^{\ast})$ which is a copy of $G$ in $I^{\ast}$ . Suppose $G=\{b_{1},\ldots,b_{l}\}$ . Let $\pi=X_{0}X_{1}\cdots X_{m}$ be such that $[\pi]_{a_{i}}=[\pi_{i}]_{a_{i}}$ for all $i=1,\ldots,k$ . For $i=1,\ldots,l$ , define

[TABLE]

Then $b_{i}$ is represented at $\rho_{i}$ . For $i=1,\ldots,l$ , let $\beta_{i}\coloneqq[\rho_{i}]_{b_{i}}$ . We claim that $\Gamma\coloneqq\{\beta_{1},\ldots,\beta_{l}\}$ is a copy of $G$ in $I^{\ast}$ . Let $R/s\in\mathbf{S}$ and suppose $I^{\ast}\models R([\rho_{i_{1}}]_{b_{i_{1}}},\ldots,[\rho_{i_{s}}]_{b_{i_{s}}})$ . Then we immediately obtain $D\models R(b_{i_{1}},\ldots,b_{i_{s}})$ . Conversely, if $D\models R(b_{i_{1}},\ldots,b_{i_{s}})$ , let $\rho\coloneqq\rho_{\ell}$ , where $\ell\coloneqq\max\{i_{1},\ldots,i_{s}\}$ . Take any $j\in\{i_{1},\ldots,i_{s}\}$ . It is easy to see that $\rho$ and $\rho_{j}$ are $b_{j}$ -equivalent. Hence, $[\rho_{j}]_{b_{j}}=[\rho]_{b_{j}}$ and it follows that $I^{\ast}\models R([\rho_{i_{1}}]_{b_{i_{1}}},\ldots,[\rho_{i_{s}}]_{b_{i_{s}}})$ , as required. It follows that $\Gamma$ is a copy of $G$ in $I^{\ast}$ and so there is a structure $J_{\Gamma}$ contained in $J$ that is isomorphic to $\mathit{chase}(D\upharpoonright G,\Sigma)$ with $b_{1},\ldots,b_{l}$ respectively renamed to $\beta_{1},\ldots,\beta_{l}$ . Hence, $J\models q(\bar{t})$ as required. ∎

Now let $\sigma\colon\varphi(\bar{x},\bar{y})\rightarrow\exists\bar{z}\,\alpha(\bar{x},\bar{z})$ be a guarded rule from $\Sigma$ . Suppose that $J\models\exists\bar{y}\,\varphi(\bar{t},\bar{y})$ . Since every guarded tuple in $J$ is a copy of some guarded tuple in $\mathit{chase}(D,\Sigma)$ , there is an $\bar{s}$ , of which $\bar{t}$ is a copy, such that $\mathit{chase}(D,\Sigma)\models\exists\bar{y}\,\varphi(\bar{s},\bar{y})$ . Since $\mathit{chase}(D,\Sigma)$ is a model of $\Sigma$ , we know that $\mathit{chase}(D,\Sigma)\models\exists\bar{z}\,\alpha(\bar{s},\bar{z})$ . It follows that $J\models\exists\bar{z}\,\alpha(\bar{s},\bar{z})$ by the above claim, as required. It remains to show that $J$ is universal:

Claim 40

$J$ * is universal.*

Proof.

It suffices to show that $J$ can be homomorphically mapped to $\mathit{chase}(I^{\ast},\Sigma)$ via a homomorphism $\eta$ . We let $\eta_{0}$ be the homomorphism that maps every element of $J_{0}$ to itself. It remains to treat the structures $J_{\Gamma}$ . Consider a copy $\Gamma=\{\alpha_{1},\ldots,\alpha_{k}\}$ in $I^{\ast}$ of a set $G=\{b_{1},\ldots,b_{k}\}$ which is guarded in $D$ . It suffices to show that $J_{\Gamma}$ can be mapped to $\mathit{chase}(I^{\ast},\Sigma)$ . To this end, it we show how to map $\mathit{chase}(D\upharpoonright G,\Sigma)$ to $\mathit{chase}(I^{\ast},\Sigma)$ . We do so by induction on the number of rule applications of $\mathit{chase}(D\upharpoonright G,\Sigma)$ . For the base case, we map $D\upharpoonright G$ to $I^{\ast}$ as follows. Let $\eta_{G}^{0}(b_{i})\coloneqq\alpha_{i}$ , for $i=1,\ldots,k$ . Suppose $D\upharpoonright G\models R(b_{i_{1}},\ldots,b_{i_{l}})$ for some $R\in\mathbf{S}$ and $i_{j}\in\{1,\ldots,k\}$ , where $j=1,\ldots,l$ . Recall that $\Gamma$ is guarded in $I^{\ast}$ . Reviewing the construction of $I^{\ast}$ , it is easy to see that this holds iff $I^{\ast}\models R(\alpha_{i_{1}},\ldots,\alpha_{i_{l}})$ . Hence, $\eta_{G}^{0}$ is indeed a homomorphism from $D\upharpoonright G$ to $I^{\ast}$ . The induction step is obvious—we can easily obtain a homomorphism $\eta_{G}^{i}$ that maps $\mathit{chase}^{k}(D\upharpoonright G,\Sigma)$ to $\mathit{chase}(I^{\ast},\Sigma)$ . The desired homomorphism $\eta_{G}$ is the union of the $\eta_{G}^{i}$ ( $i\geq 0$ ). We then obtain a homomorphism $\eta_{\Gamma}$ from $\eta_{G}$ by appropriately renaming the elements from the domain of the latter as we did in the construction of $J_{\Gamma}$ —which is nothing else than an isomorphic copy of $\mathit{chase}(D\upharpoonright G,\Sigma)$ . Furthermore, each of these homomorphisms maps each element of $\Gamma$ to itself. The desired homomorphism $\eta$ that witnesses that $J$ is universal is the union of $\eta_{0}$ and the $\eta_{\Gamma}$ . ∎

In order to prove $I^{\ast},\Sigma\models q$ , it remains to show that there is a homomorphism $\hat{\mu}$ that maps $q$ to $J$ . There are guarded sets $G_{1},\ldots,G_{l}$ in $D$ such that $\mu$ can be understood to map $q$ to $\mathit{chase}(\bigcup_{1\leq i\leq l}(D\upharpoonright G_{i}),\Sigma)$ . By construction, we know that $G_{1},\ldots,G_{l}$ can be chosen in such a way that $G_{\mu}\subseteq\bigcup_{i=1}^{l}G_{i}$ . Since $\Sigma$ is guarded, $\mu$ can be understood to map $q$ to $\bigcup_{1\leq i\leq l}\mathit{chase}(D\upharpoonright G_{i},\Sigma)$ —assuming that the labeled nulls occurring in these instances are mutually new. Let $\mathcal{C}_{\mu}\coloneqq\{\{\bar{b}_{1}\},\ldots,\{\bar{b}_{k}\}\}$ . For every $X\in\mathcal{C}_{\mu}$ , let $\Gamma_{X}\coloneqq\{\lambda_{b}\mid b\in X\}$ . Notice that $\Gamma_{X}$ is a copy of $X$ in $I^{\ast}$ . By construction, all the facts from $q$ that are mapped via $\mu$ to $\mathit{chase}(D,\Sigma)$ and which have an element from $\mathit{dom}(D)$ in their image under $\mu$ are already mapped to $\bigcup_{X\in\mathcal{C}_{\mu}}\mathit{chase}(D\upharpoonright X,\Sigma)$ . For the other facts, the names of the constants in the databases do not matter.101010Here, it is of course essential to assume constant-free rules. Let $\Theta_{1},\ldots,\Theta_{s}$ be arbitrary copies of the sets $\{G_{1},\ldots,G_{l}\}\setminus\mathcal{C}_{\mu}$ in $I^{\ast}$ . It follows that we can find our desired match $\hat{\mu}$ in the union of $\bigcup_{X\in\mathcal{C}_{\mu}}J_{\Gamma_{X}}$ and $\bigcup_{1\leq i\leq s}J_{\Theta_{i}}$ . Notice that $\bigcup_{X\in\mathcal{C}_{\mu}}J_{\Gamma_{X}}$ is isomorphic to $\bigcup_{X\in\mathcal{C}_{\mu}}\mathit{chase}(D\upharpoonright X,\Sigma)$ with each $b\in G_{\mu}$ represented by $\lambda_{b}$ . ∎

Now the database $I^{\ast}$ has the desired form with $C$ being its core. However, $I^{\ast}$ is infinite. Since $I^{\ast},\Sigma\models q$ due to Lemma 38, by compactness, there is a finite $\hat{B}\subseteq I^{\ast}$ such that $\hat{B},\Sigma\models q$ . Consider a tree decomposition $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ witnessing that $I^{\ast}$ is a $C$ -tree. There is a maximum $\ell$ such that $\hat{B}$ contains all the subinstances induced by the bags of depth less or equal $\ell$ . Let $\hat{I}$ be the instance that actually contains all the subinstances induced by the bags of level up to $\ell$ . Hence, $\hat{I}$ is itself a $C$ -tree and $\hat{I},\Sigma\models q$ , since $\hat{B}\subseteq\hat{I}$ .

*Now there is a natural homomorphism mapping $\hat{I}$ to $D$ : we simply specify $[\pi]_{a}\mapsto a$ for all $a\in\mathit{dom}(D)$ . The instance $\hat{I}$ is the one we are looking for. * $\square$

**

Proof of Lemma 22

One can naturally encode instances of bounded tree-width into trees over a finite alphabet such that the alphabet’s size depends only on the tree-width. Our goal here is to appropriately encode $C$ -trees in order to make them accessible to tree automata techniques. Since the tree-width of a $C$ -tree over $\mathbf{S}$ depends only on the size of $\mathit{dom}(C)$ and the maximum arity of $\mathbf{S}$ , the alphabet of the encoding will depend on the same.

Labeled trees.* Let $\Gamma$ be an alphabet and $(\mathbb{N}\setminus\{0\})^{\ast}$ be the set of finite sequences of positive integers, including the empty sequence $\varepsilon$ .111111We specify that [math] is included in $\mathbb{N}$ as well. Let us recall that a $\Gamma$ -labeled tree is a pair $t={(T,\mu)}$ , where $\mu\colon T\rightarrow\Gamma$ is the labeling function and $T\subseteq(\mathbb{N}\setminus\{0\})^{\ast}$ is closed under prefixes, i.e., $x\cdot i\in T$ implies $x\in T$ , for all $x\in(\mathbb{N}\cup\{0\})^{\ast}$ and $i\in(\mathbb{N}\cup\{0\})$ . The elements contained in $T$ identify the nodes of $t$ . For $i\in\mathbb{N}\setminus\{0\}$ , nodes of the form $x\cdot i\in T$ are the children of $x$ . A path of length $n$ in $T$ from $x$ to $y$ is a sequence of nodes $x=x_{1},\ldots,x_{n}=y$ such that $x_{i+1}$ is a child of $x_{i}$ . A branch is a path from the root to a leaf node. For $x\in T$ , we set $x\cdot i\cdot-1\coloneqq x$ , for all $i\in\mathbb{N}$ , and $x\cdot 0\coloneqq x$ —notice that $\varepsilon\cdot-1$ is not defined.*

Encoding.* Let $l\geq 0$ and fix a schema $\mathbf{S}$ . Let $U_{\mathbf{S},l}$ be the disjoint union of two sets $C_{l}$ and $T_{\mathbf{S}}$ , respectively containing $l$ and $2\cdot\mathit{ar}(\mathbf{S})$ elements. The elements from $U_{\mathbf{S},l}$ will be called names. Elements from the set $C_{l}$ will describe core elements, while those of $T_{\mathbf{S}}$ will describe the others. Furthermore, neighboring nodes may describe overlapping pieces of the instance. In particular, if one name is used in neighboring nodes, this means that the name at hand refers to the same element—this is why we use $2w$ elements for the non-root bags. Let $\mathbb{K}_{\mathbf{S},l}$ be the finite schema capturing the following information:*

•

For all $a\in U_{\mathbf{S},l}$ , there is a unary relation $D_{a}\in\mathbb{K}_{\mathbf{S},l}$ .

•

For all $a\in C_{l}$ , there is a unary relation $C_{a}\in\mathbb{K}_{\mathbf{S},l}$ .

•

For each $R\in\mathbf{S}$ and every $n$ -tuple $\bar{a}\in U^{n}_{\mathbf{S},l}$ , there is a unary relation $R_{\bar{a}}\in\mathbb{K}_{\mathbf{S},l}$ .

Let $\Gamma_{\mathbf{S},l}\coloneqq 2^{\mathbb{K}_{\mathbf{S},l}}$ be an alphabet and suppose that $D$ is a (finite) $C$ -tree over $\mathbf{S}$ such that $|\mathit{dom}(C)|\leq l$ . Consider a tree decomposition $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ witnessing that $D$ is indeed a $C$ -tree and let $\varepsilon$ be the root of $\mathcal{T}$ . Fix a function $f\colon\mathit{dom}(D)\rightarrow U_{\mathbf{S},l}$ such that (i) $f\upharpoonright\mathit{dom}(C)$ is injective and (ii) different elements that occur in neighboring bags of $\delta$ are always assigned different names from $U_{\mathbf{S},l}$ . Using $f$ , we can encode $D$ and $\delta$ into a $\Gamma_{\mathbf{S},l}$ -labeled tree $t={(\hat{T},\mu)}$ such that each node from $\mathcal{T}$ corresponds to exactly one node in $\hat{T}$ and vice versa. For a node $v$ from $\mathcal{T}$ , we denote the corresponding node of $T$ by $\hat{v}$ in the following and vice versa. In this light, the symbols from $\mathbb{K}_{\mathbf{S},l}$ have the following intended meaning:

•

$D_{a}\in\mu(\hat{v})$ * means that $a$ is used as a name for some element of the bag $X_{v}$ .*

•

$C_{a}\in\mu(\hat{v})$ * indicates that $a$ is used as name for an element of the bag $X_{v}$ that also occurs in $X_{\varepsilon}$ , i.e., $a$ names an element from the core of $D$ .*

•

$R_{\bar{a}}\in\mu(\hat{v})$ * indicates that $R$ holds in $D$ for the elements named by $\bar{a}$ in bag $X_{v}$ .*

Under certain assumptions, we can decode a $\Gamma_{\mathbf{S},l}$ -labeled tree $t={(T,\mu)}$ into a $C$ -tree whose width is bounded by $\mathit{ar}(\mathbf{S})-1$ . Let $\mathrm{names}(v)\coloneqq\{a\mid D_{a}\in\mu(v)\}$ . We say that $t$ is consistent, if it satisfies the following properties:

For all nodes $v$ it holds that $|\mathrm{names}(v)|\leq\mathit{ar}(\mathbf{S})$ , except for the root whose number of names are accordingly bounded by $l$ . Furthermore, $\mathrm{names}(\varepsilon)\subseteq C_{l}$ . 2. 2.

For all $R_{\bar{a}}\in\mathbb{K}_{\mathbf{S},l}$ and all $v\in T$ it holds that $R_{\bar{a}}\in\mu(v)$ implies that $\{\bar{a}\}\subseteq\mathrm{names}(v)$ . 3. 3.

For all $a\in C_{l}$ and all $v\in T$ it holds that $D_{a}\in\mu(v)$ iff $C_{a}\in\mu(v)$ . 4. 4.

*If $C_{a}\in\mu(v)$ , then $C_{a}\in\mu(w)$ for all $w\in T$ on the unique shortest path between $v$ and the root. * 5. 5.

For all nodes $v\neq\varepsilon$ , there is an $R_{\bar{a}}\in\mathbb{K}_{\mathbf{S},l}$ and a node $w$ such that $R_{\bar{a}}\in\mu(w)$ , $\mathrm{names}(v)\subseteq\{\bar{a}\}$ , and, for all $b\in\mathrm{names}(v)$ , $v$ and $w$ are $b$ -connected.

Decoding trees.* Suppose now that $t$ is consistent. We show how we can decode $t$ into a database $\llbracket t\rrbracket$ which is a $C$ -tree whose diameter is bounded by $l$ . Let $a$ be a name used in $t$ . We say that two nodes $v,w$ of $t$ are $a$ -equivalent if $D_{a}\in\mu(u)$ for all nodes $u$ on the unique shortest path between $v$ and $w$ . Clearly, $a$ -equivalence defines an equivalence relation and we let $[v]_{a}\coloneqq\{{(w,a)}\mid\text{$ w $is$ a $-equivalent to$ v $}\}$ and $[v]_{a}^{\ast}\coloneqq\{w\mid{(w,a)}\in[v]_{a}\}$ . The domain of $\llbracket t\rrbracket$ is the set $\{[v]_{a}\mid v\in T,a\in\mu(v)\}$ and, for $R/n\in\mathbf{S}$ , we define*

[TABLE]

Lemma 41

Let $t$ be a consistent $\Gamma_{\mathbf{S},l}$ -labeled tree with root node $\varepsilon$ . Then $\llbracket t\rrbracket$ is well-defined and a $C$ -tree over $\mathbf{S}$ , where $C$ is the subinstance of $\llbracket t\rrbracket$ induced by the set $\{[\varepsilon]_{a}\mid a\in\mathrm{names}(\varepsilon)\}$ . Moreover, $|\mathit{dom}(C)|$ is bounded by $l$ .

Proof.

Let $t={(T,\mu)}$ be a consistent, $\Gamma_{\mathbf{S},l}$ -labeled tree. The fact that $\llbracket t\rrbracket$ is well-defined is left to the reader. We are going to construct an appropriate decomposition $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ for $\llbracket t\rrbracket$ . The tree $\mathcal{T}$ has the same structure as $t$ . Furthermore, for $v\in T$ , we set $X_{v}\coloneqq\{[v]_{a}\mid a\in\mathrm{names}(v)\}$ . We need to show that $\delta$ is indeed a tree decomposition that satisfies the desired properties.

Let $[v]_{a}\in\mathit{dom}(\llbracket t\rrbracket)$ and consider two nodes $v_{1},v_{2}\in T$ such that $[v]_{a}\in X_{v_{1}}$ and $[v]_{a}\in X_{v_{2}}$ . Then $v_{1},v_{2}\in[v]_{a}$ and so $v_{1}$ and $v_{2}$ are $a$ -connected. Hence, $w\in[v]_{a}$ for all $w\in T$ which lie on the unique shortest path between $v_{1}$ and $v_{2}$ . Since $a\in\mathrm{names}(w)$ for all such $w$ , it follows that $[v]_{a}\in X_{w}$ , and so $[v]_{a}$ is contained in all bags on the unique path between $v_{1}$ and $v_{2}$ . Suppose $\llbracket t\rrbracket\models R([v_{1}]_{a_{1}},\ldots,[v_{n}]_{a_{n}})$ . Then there is a $v\in[v_{1}]_{a_{1}}^{\ast}\cap\cdots\cap[v_{n}]_{a_{n}}^{\ast}$ such that $R_{a_{1},\ldots,a_{n}}\in\mu(v)$ . By consistency, $\{a_{1},\ldots,a_{n}\}\subseteq\mathrm{names}(v)$ . Moreover, we know that $[v_{i}]_{a_{i}}=[v]_{a_{i}}$ , for $i=1,\ldots,n$ . It follows that $\{[v_{1}]_{a_{1}},\ldots,[v_{n}]_{a_{n}}\}\subseteq X_{v}$ . Now let $v\in T\setminus\{\varepsilon\}$ . By consistency, there is an $R_{a_{1},\ldots,a_{n}}\in\mathbb{K}_{\mathbf{S},l}$ and a $w\in T$ such that $\mathrm{names}(v)\coloneqq\{a_{i_{1}},\ldots,a_{i_{s}}\}\subseteq\{a_{1},\ldots,a_{n}\}\subseteq\mathrm{names}(w)$ , $R_{a_{1},\ldots,a_{n}}\in\mu(w)$ , and $v$ and $w$ are $b_{i_{j}}$ -connected for $j=1,\ldots,s$ . By construction, $X_{v}=\{[v]_{a_{i_{1}}},\ldots,[v]_{a_{i_{s}}}\}$ and $\{[w]_{a_{1}},\ldots,[w]_{a_{n}}\}\subseteq X_{w}$ . The claim follows now since $[v]_{a_{i_{j}}}=[w]_{a_{i_{j}}}$ for $j=1,\ldots,s$ . It is immediate that $|\mathit{dom}(C)|$ is bounded by $l$ . ∎

Notation. Given a consistent $\Gamma_{\mathbf{S},l}$ -labeled tree $t={(T,\mu)}$ and a label $\rho\in\mu(T)$ , in order to ease notation we often regard $\rho$ as a database consisting of the facts $\{R(\bar{a})\mid R_{\bar{a}}\in\rho\}$ . Furthermore, we let $\mathrm{names}(\rho)\coloneqq\{a\mid D_{a}\in\rho\}$ .

Proof of Lemma 22.* The lemma is an easy consequence of Lemma 41 and the fact that, when encoding a $C$ -tree $D$ over $\mathbf{S}$ , together with a tree decomposition witnessing that $D$ is a $C$ -tree, into a consistent $\Gamma_{\mathbf{S},l}$ -labeled tree $t$ , then $\llbracket t\rrbracket$ and $D$ are isomorphic. * $\square$

**

Roughly, Lemma 22 states that containment among OMQs from ${(\mathbb{G},\mathbb{BCQ})}$ can be semantically characterized via the decodings of consistent $\Gamma_{\mathbf{S},l}$ -labeled trees. This makes the problem of deciding containment amenable to tree automata techniques.

Proof of Lemma 23

Before proceeding to the proof of Lemma 23, we first introduce the relevant automata model.

Automata Techniques

For a set of propositional variables $X$ , we denote by $\mathbb{B}^{+}(X)$ the set of Boolean formulas using variables from $X$ , the connectives $\wedge,\vee$ , and the constants $\mathsf{true},\mathsf{false}$ . Let us now introduce our automata model.

Definition 10.

*A two-way alternating parity automaton (2WAPA) on trees is a tuple $\mathfrak{A}={(S,\Gamma,\delta,s_{0},\Omega)}$ , where $S$ is a finite set of states, $\Gamma$ an alphabet (the input alphabet of $\mathfrak{A}$ ), $\delta\colon S\times\Gamma\rightarrow\mathbb{B}^{+}(\mathsf{tran}(\mathfrak{A}))$ the transition function, where we set $\mathsf{tran}(\mathfrak{A})\coloneqq\{\langle\alpha\rangle s,[\alpha]s\mid s\in S,\alpha\in\{-1,0,\ast\}\}$ , $s_{0}\in S$ the initial state, and $\Omega\colon S\rightarrow\mathbb{N}$ the parity condition that assigns to each $s\in S$ a priority $\Omega(s)$ . Elements from $\mathsf{tran}(\mathfrak{A})$ are called transitions. *

Intuitively, a transition of the form $\langle 0\rangle s$ means that a copy of the automaton should change to state $s$ and stay at the current node. A transition of the form $\langle-1\rangle s$ means that a copy should be sent to the parent node, which is then required to exist, and proceed in state $s$ , while one of the form $\langle\ast\rangle s$ means that a copy of the automaton that assumes state $s$ is sent to some child node. The transition $[0]s$ means the same as $\langle 0\rangle s$ , while $[-1]s$ means that a copy of the automaton that assumes state $s$ should be sent to the parent node which is there not required to exist at all. Likewise, $[\ast]s$ means that a copy of the automaton assuming state $s$ should be sent to all child nodes.

Notation. We write $\Diamond s$ for $\bigvee\{\langle\alpha\rangle s\mid\langle\alpha\rangle s\in\mathsf{tran}(\mathfrak{A}),s\in S\}$ , $\Box s$ for $\bigwedge\{[\alpha]s\mid[\alpha]s\in\mathsf{tran}(\mathfrak{A}),s\in S\}$ , and simply $s$ for $\langle 0\rangle s$ . Furthermore, for $\alpha\in\mathbb{N}\cup\{-1,\ast\}$ , we define

[TABLE]

Definition 11.

A run of a 2WAPA $\mathfrak{A}={(S,\Gamma,\delta,s_{0},\Omega)}$ on a $\Gamma$ -labeled tree ${(T,\eta)}$ is a $T\times S$ -labeled tree ${(T_{r},\eta_{r})}$ such that the following holds:

$\eta_{r}(\varepsilon)={(\varepsilon,s_{0})}$ , 2. 2.

if $y\in T_{r}$ , $\eta_{r}(y)={(x,s)}$ , and $\delta(s,\eta(x))=\varphi$ , then there is an $I\subseteq\mathsf{tran}(\mathbb{A})$ such that $I\models\varphi$ holds and the following conditions are satisfied:

•

If $\langle\alpha\rangle s^{\prime}\in I$ then there is a node $x^{\prime}\in T_{\alpha}(x)$ and a child node $y^{\prime}\in T_{r}$ of $y$ such that $\eta_{r}(y^{\prime})={(x^{\prime},s^{\prime})}$ .

•

If $[\alpha]s^{\prime}\in I$ then for all $x^{\prime}\in T_{\alpha}(x)$ , there is a child node $y^{\prime}\in T_{r}$ of $y$ such that $\eta_{r}(y^{\prime})={(x^{\prime},s^{\prime})}$ .

*We say that a run ${(T_{r},\eta_{r})}$ is accepting on $\mathfrak{A}$ , if on all infinite paths ${(\varepsilon,s_{0})},{(x_{1},s_{1})},{(x_{2},s_{2})},\ldots$ in $T_{r}$ , the maximum priority among $\Omega(s_{0}),\Omega(s_{1}),\Omega(s_{2}),\ldots$ that appears infinitely often is even. $\mathfrak{A}$ accepts a $\Gamma$ -labeled tree ${(T,\eta)}$ , if there is an accepting run on ${(T,\eta)}$ . We denote by $\mathcal{L}(\mathfrak{A})$ the set of $\Gamma$ -labeled trees $\mathfrak{A}$ accepts, i.e., the language accepted by $\mathfrak{A}$ . *

Remark. The automaton model defined above resembles that in **[58]**. However, we explicitly provide transitions that allow the automaton move to the parent node, while the model defined in **[58]** provides transitions for moving to some neighboring node, including the parent node. Therefore, the automata in **[58]** offer transitions of the form $s$ , $\Diamond s$ , and $\Box s$ with their intended meaning as defined above. Using techniques as employed in **[57, 58]**, for a 2WAPA $\mathfrak{A}$ , one can show that the problem of deciding whether $\mathcal{L}(\mathfrak{A})=\varnothing$ is feasible in exponential time with respect to the number of states of $\mathfrak{A}$ and in polynomial time with respect to the size of the input alphabet of $\mathfrak{A}$ .

Proof of Lemma 23.* We only give an intuitive explanation for the construction of the desired 2WAPA. To check whether a $\Gamma_{\mathbf{S},l}$ -labeled tree is consistent, we can check each condition for consistency separately by a dedicated 2WAPA and then take the intersection of all of them. Most of the consistency conditions are easy to check. We give here a more detailed verbal explanation for condition (5). A 2WAPA checking this condition can be constructed as follows. At the beginning of its run, the automaton branches universally to all nodes (except the root) in a state whose intended purpose is to find appropriate guards in the input tree for the names available at the current node. To this end, the automaton has to do a reachability analysis on the input tree and store, using exponentially many states in $\mathit{ar}(\mathbf{S})$ , the tuple it seeks to guard. By a guard for the node $v$ here, we mean a node $w$ with an $R_{\bar{a}}\in\mu(w)$ such that *

(i) $\{\bar{a}\}$ *contains all the names present at $w$ and * (ii) *is $b$ -connected to $v$ for all $b\in\mathrm{names}(v)$ . * * Notice that such a reachability analysis can be easily performed once we have the means to store the information contained in $\mathrm{names}(v)$ in a single state. This is, however, possible since for this task we need somewhat $O((\mathit{ar}(\mathbf{S})+l)^{\mathit{ar}(\mathbf{S})})$ states, i.e., polynomially many in the size of $\Gamma_{\mathbf{S},l}$ . * $\square$

**

Proof of Lemma 24

We first need to introduce some additional auxiliary notions.

Strictly Acyclic Queries

Let $q$ be a CQ over a schema $\mathbf{S}$ . We denote by $\mathrm{free}(q)$ the free variables of $q$ ; the same notation is used for first-order formulas in general. We can naturally view $q$ as an instance $[q]$ whose domain is the set of variables of $q$ and contains the body atoms of $q$ as facts. In the following, we will often overload notation and write $q$ for both the query $q$ and the instance $[q]$ . The notions of tree-width, acyclicity, etc. then immediately extend to CQs. Given a tree decomposition $\delta$ of $q$ (i.e., of $[q]$ ), we say that $\delta$ is strict, if some bag of $\delta$ contains all variables that are free in $q$ (cf. also **[38]**). Accordingly, $q$ is called strictly acyclic if it has a guarded tree decomposition that is strict.

Strictly acyclic queries have the convenient property to be equivalent to guarded formulas of a special form. Recall that the set of guarded formulas over a schema $\mathbf{S}$ is built inductively by including all atomic formulas, relativizing quantifiers by atomic formulas, and closing under Boolean connectives. More precisely, all quantifier occurrences have one of the forms

[TABLE]

such that the free variables of $\varphi$ are among $\{\bar{x},\bar{y}\}$ .

We are interested in the guarded formulas that are build up using conjunction and existential quantification; we restrict ourselves to such formulas in the following. We call a formula from this class strictly guarded, if it is of the form $\exists\bar{y}\,(\alpha(\bar{x},\bar{y})\wedge\varphi)$ . We explicitly include the case where $\bar{y}$ is the empty sequence of variables, i.e., if to formulas of the form $\alpha(\bar{x})\wedge\varphi$ with $\mathrm{free}(\varphi)\subseteq\{\bar{x}\}$ . Notice that every guarded sentence $\varphi$ (i.e., a formula having no free variables) is strictly guarded, since it is equivalent to $\exists y\,(y=y\wedge\varphi)$ . Furthermore, notice that every usual guarded formula that uses only existential quantifiers and conjunction is equivalent to a conjunction of strictly guarded formulas. The following lemma is proved in **[38]**.

Lemma 42

Every strictly acyclic CQ can be rewritten in polynomial time into an equivalent strictly guarded formula that is built up using conjunction and existential quantification only. The converse holds as well.

Squid Decompositions

Let $q$ be a BCQ over a schema $\mathbf{S}$ having $n$ body atoms. An $\mathbf{S}$ -cover of $q$ is a BCQ $q^{+}$ that contains all the atoms from $q$ and may additionally contain $2n$ other body atoms over $\mathbf{S}$ . It is pretty straightforward that, for an $\mathbf{S}$ -instance $I$ , it holds that $I\models q$ iff there is an $\mathbf{S}$ -cover $q^{+}$ of $q$ such that $I\models q^{+}$ .

Definition 12.

*Let $I$ be an instance. For $V\subseteq\mathit{dom}(I)$ , we say that $I$ is $[V]$ -acyclic, if it has a guarded tree decomposition that omits $V$ . *

Definition 13.

Let $q$ be a BCQ over $\mathbf{S}$ . A squid decomposition of $q$ is a tuple $\delta={(q^{+},\mu,H,T,V)}$ , where $q^{+}$ is an $\mathbf{S}$ -cover of $q$ , $\mu\colon\mathit{var}({q^{+}})\rightarrow\mathit{var}({q^{+}})$ a mapping, $V\subseteq\mathit{var}(\mu(q^{+}))$ , and ${(H,T)}$ a partition of the atoms $\mu(q^{+})$ such that

•

$H$ * is the set of atoms of $\mu(q^{+})$ induced by $V$ ,*

•

$T=\mu(q^{+})\setminus H$ * and $T$ is $[V]$ -acyclic. *

Intuitively, a squid decomposition specifies a way how a BCQ can be mapped to an instance that contains some “cyclic parts”—the set $H$ specifies those atoms that are mapped to such cyclic parts, while $A$ declares those atoms that are mapped to the acyclic parts of the instance at hand. We will make this more precise in Lemma 43 below, where we analyze matches in $C$ -trees.

Given a CQ $q$ and a set of variables $V\subseteq\mathit{var}(q)$ , the $V$ -reduct of $q$ , denoted $q^{V}$ , is the conjunctive query that arises from $q$ by dropping all the existential quantifiers that bind variables in $V$ .

Lemma 43

Let $J$ be a $C$ -tree over $\mathbf{S}$ and $q$ a BCQ over $\mathbf{S}$ . Let ${(\mathcal{T},{(X_{t})}_{t\in T})}$ be a witnessing tree decomposition of $J$ . It holds that $J\models q$ iff there is a squid decomposition $\delta={(q^{+},\mu,H,A,V\coloneqq\{\bar{x}\})}$ of $q$ and a homomorphism $\eta\colon\mu(q^{+})\rightarrow J$ such that

$C\models H$ * is witnessed by $\eta$ ,* 2. 2.

$\bigcup_{\varepsilon\prec v}J(v)\models A^{V}(\eta(\bar{x}))$ * is witnessed by $\eta$ , and* 3. 3.

there are strictly guarded formulas $\varphi_{1},\ldots,\varphi_{l}$ such that $A^{V}(\bar{x})\equiv\varphi_{1}\wedge\cdots\wedge\varphi_{l}$ .

Proof.

For the direction from right to left, consider such a given squid decomposition $\delta$ and a homomorphism $\eta$ as in the hypothesis of the lemma. It is immediate that $\eta\circ\mu$ is a homomorphism mapping $q^{+}$ to $J$ . Since $q^{+}$ is an $\mathbf{S}$ -cover of $q$ , we obtain $J\models q$ as required.

For the other direction, suppose that $J\models q$ is witnessed by a homomorphism $\theta$ . For each $v\in T\setminus\{\varepsilon\}$ , let $\beta_{v}$ be an atom of $J$ such that $J(v)\models\beta_{v}$ and $\beta_{v}$ contains all domain elements from $J(v)$ as arguments. Notice that the $\beta_{v}$ exist, since $\delta$ is guarded except for $\{\varepsilon\}$ . Since $\theta$ maps $q$ to $J$ , for each atom $\alpha$ of $q$ , there is a node $v_{\alpha}$ such that $\theta(\alpha)\in J(v_{\alpha})$ . Let $W$ be the set of all these nodes and their closure under greatest lower bounds with respect to $\preceq$ , excluding the root node $\varepsilon$ of $\mathcal{T}$ . Consider the set of atoms $Q^{+}\coloneqq\theta(q)\cup\{\beta_{v}\mid v\in W\}$ . Notice that at least half of the nodes of $W$ are of the form $v_{\alpha}$ —hence, $|Q^{+}|\leq 3|q|$ . Let $q^{+}$ be a BCQ constructed as follows. Take the conjunction of $q$ and for each $\beta_{v}(a_{1},\ldots,a_{n})$ ( $v\in W$ ), add an atom $\beta_{v}(x_{1},\ldots,x_{n})$ , where each $x_{i}$ is a newly chosen variable. Then $q^{+}$ is obviously an $\mathbf{S}$ -cover of $q$ . Furthermore, by construction, there is a mapping $\mu\colon\mathit{var}(q^{+})\rightarrow\mathit{var}(q^{+})$ and an isomorphism $\eta\colon\mu(q^{+})\rightarrow Q^{+}$ such that $(\eta\circ\mu)(q^{+})=Q^{+}$ . Now let $H$ be the greatest set of atoms of $\mu(q^{+})$ such that $\eta(H)\subseteq J(\varepsilon)$ . Moreover, let $V\coloneqq\mathit{var}(H)$ and $A\coloneqq\mu(q^{+})\setminus H$ . We claim that $\delta\coloneqq{(q^{+},\mu,H,A,V)}$ is a squid decomposition of $q$ that satisfies together with $\eta$ the points mentioned in the statement of the lemma.

To see that $\delta$ is a squid decomposition of $q$ , the only nontrivial point to prove is that $A$ is indeed $[V]$ -acyclic. We will prove this below in the course of establishing the third item.

The first two items are immediate by construction. We prove the third item. Suppose $V=\{\bar{x}\}$ and consider the $V$ -reduct $A^{V}(\bar{x})$ of $A$ . By construction, the atoms $\eta(A)$ are contained in $\bigcup_{\varepsilon\prec v}J(v)$ . Now the set $W$ together with the order $\preceq_{\mathcal{T}}$ gives rise to a forest consisting of trees $\mathcal{T}_{1},\ldots,\mathcal{T}_{l}$ whose roots are descendants of $\varepsilon$ , i.e., the root of $\mathcal{T}$ (recall that $\varepsilon$ is not contained in $W$ ). Moreover, we have

(i) $\bigcup_{i=1}^{l}T_{i}=W$ ,

(ii) $T_{i}\cap T_{j}=\varnothing$ , for $i\neq j$ , and

(iii) $\bigcup_{v\in T_{i}}X_{v}\cap\bigcup_{v\in T_{j}}X_{v}\subseteq\mathit{dom}(C)$ , for $i\neq j$ .

For $v\in T$ , let $Q^{+}(v)\coloneqq\{\alpha\in Q^{+}\mid J(v)\models\alpha\}$ and, for $i=1,\ldots,l$ , let $Q^{+}(\mathcal{T}_{i})$ be the set of atoms $\bigcup_{v\in T_{i}}Q^{+}(v)$ . Now it is easy to check using the facts stated before that each $Q^{+}(\mathcal{T}_{i})$ is acyclic and, hence, so is $\eta^{-1}(Q^{+}(\mathcal{T}_{i}))$ . Furthermore, denoting by $\varepsilon_{i}$ the root of $\mathcal{T}_{i}$ , it holds that $\mathit{dom}(Q^{+}(\mathcal{T}_{i}))\cap\eta(V)\subseteq\mathit{dom}(Q^{+}(\varepsilon_{i}))$ —indeed, if $a\in\mathit{dom}(Q^{+}(v))\cap\eta(V)$ for some $v\succeq\varepsilon_{i}$ , then, since $\varepsilon_{i}\succ\varepsilon$ and $a\in X_{\varepsilon}$ , it must be the case that $a\in\mathit{dom}(Q^{+}(\varepsilon_{i}))$ by connectivity. It follows that the $V$ -reduct of $\eta^{-1}(Q^{+}(\mathcal{T}_{i}))$ (viewed as Boolean query), henceforth denoted $q^{+}_{\mathcal{T}_{i}}$ , is strictly acyclic and is therefore equivalent to a strictly guarded formula $\varphi_{i}$ . Hence, the query $A^{V}(\bar{x})$ is equivalent to $\bigwedge_{i=1}^{l}\varphi_{i}$ . Moreover, it follows that $A$ itself is $[V]$ -acyclic—notice that $A\equiv\exists\bar{x}\bigwedge_{i=1}^{l}q^{+}_{\mathcal{T}_{i}}$ and that $\mathit{dom}(Q^{+}(\mathcal{T}_{i}))\cap\mathit{dom}(Q^{+}(\mathcal{T}_{j}))\subseteq\eta(V)$ , for $i\neq j$ . Hence, $\mathit{var}(q^{+}_{\mathcal{T}_{i}})\cap\mathit{var}(q^{+}_{\mathcal{T}_{j}})\subseteq V$ , for $i\neq j$ . The claim now follows since every $q^{+}_{\mathcal{T}_{i}}$ is acyclic. ∎

Derivation trees

Let $D$ be an $\mathbf{S}$ -database and $\Sigma$ a set of guarded rules. Let $q_{0}(\bar{x})$ be a strictly acyclic query whose free variables are exactly those from $\bar{x}\coloneqq x_{1},\ldots,x_{n}$ and let $\bar{a}\coloneqq a_{1},\ldots,a_{n}$ be a tuple from $\mathit{dom}(D)$ . A derivation tree for ${(\bar{a},q_{0}(\bar{x}))}$ with respect to $D$ and $\Sigma$ is a finite tree $\mathcal{T}$ whose nodes are labeled via a function $\mu$ with pairs of the form ${(b_{1},\ldots,b_{k};q(y_{1},\ldots,y_{k}))}$ , where $b_{1},\ldots,b_{k}$ are constants from $\mathit{dom}(D)$ and $q(y_{1},\ldots,y_{k})$ is a strictly acyclic query over $\mathbf{S}\cup\mathit{sch}(\Sigma)$ having exactly $y_{1},\ldots,y_{k}$ free, such that the following conditions are satisfied:

$\mu(\varepsilon)={(\bar{a},q_{0}(\bar{x}))}$ , where $\varepsilon$ is the root node of $\mathcal{T}$ . 2. 2.

If $\mu(v)={(c_{1},\ldots,c_{m};q(z_{1},\ldots,z_{m}))}$ for some node $v$ , then one of the following conditions holds (let $\bar{c}\coloneqq c_{1},\ldots,c_{m}$ and $\bar{z}\coloneqq z_{1},\ldots,z_{m}$ ):

(a)

$v$ * is a leaf node and $q(\bar{z})\equiv\beta(\bar{z})$ , for some atomic formula $\beta(\bar{z})$ such that $D\models\beta(\bar{c})$ .* 2. (b)

The node $v$ has a successor labeled by ${(\bar{c},\bar{b};p(\bar{z},\bar{y}))}$ and it holds that

[TABLE] 3. (c)

The query $q(\bar{z})$ is logically equivalent to $q_{1}(z_{i_{1,1}},\ldots,z_{i_{1,k_{1}}})\wedge\cdots\wedge q_{l}(z_{i_{l,1}},\ldots,z_{i_{l,k_{l}}})$ and $v$ has $l$ successors $v_{1},\ldots,v_{l}$ respectively labeled by ${(c_{i_{1,1}},\ldots,c_{i_{1,k_{1}}};q_{1}(z_{i_{1,1}},\ldots,z_{i_{1,k_{i}}}))},\ldots,{(c_{i_{l,1}},\ldots,c_{i_{1,k_{l}}};q_{l}(z_{i_{l,1}},\ldots,z_{i_{1,k_{l}}}))}$ .

Lemma 44

Let $\alpha(x_{1},\ldots,x_{n})$ be an atomic formula. Then $D,\Sigma\models\alpha(a_{1},\ldots,a_{n})$ iff there is a derivation tree for ${(a_{1},\ldots,a_{n};\alpha(x_{1},\ldots,x_{n}))}$ with respect to $D$ and $\Sigma$ .

Proof (sketch).* Let $\bar{a}\coloneqq a_{1},\ldots,a_{n}$ and $\bar{x}\coloneqq x_{1},\ldots,x_{n}$ . The direction from right to left is an easy induction on the construction of the derivation tree. We sketch the other direction. Consider the guarded chase forest ${(\mathcal{F},\eta)}$ for $D$ and $\Sigma$ , where $\eta$ is a function labeling the nodes and edges of $\mathcal{F}$ . We construct a derivation tree for ${(\bar{a},\alpha(\bar{x}))}$ by induction on the number of chase steps required to derive $\alpha(\bar{a})$ from $D$ and $\Sigma$ .*

For the base case, if $D\models\alpha(\bar{a})$ , the claim is obvious since we can apply rule 2.(a). Assume that $\alpha(\bar{a})$ is derived using a rule

[TABLE]

and a homomorphism $\mu$ such that $\mu(\bar{x})=\bar{a}$ , where $\beta_{0}(\bar{x},\bar{y})$ is the guard of $\sigma$ . If $\mu(\{\bar{x},\bar{y}\})\subseteq\mathit{dom}(D)$ , the result immediately follows by the induction hypothesis. Otherwise, the image of $\beta_{0}(\bar{x},\bar{y})$ under $\mu$ contains some labeled nulls as arguments. Assume that all the $\beta_{1},\ldots,\beta_{k}$ contain nulls as their arguments—for those that do not, the induction hypothesis would yield appropriate derivation trees again. Notice that all the nulls occurring in $\beta_{1},\ldots,\beta_{k}$ appear in $\mu(\{\bar{y}\})$ . By construction of $\mathcal{F}$ , there is a node $v_{0}$ that is an ancestor of the nodes having the atoms $\mu(\beta_{0}),\mu(\beta_{1}),\ldots,\mu(\beta_{k})$ as labels and which has a label of the form $\beta_{0}(\bar{a},\bar{b})$ which contains no nulls at all as arguments. There is a corresponding atomic formula $\gamma_{0}(\bar{x},\bar{z})$ whose image under an appropriate homomorphism equals $\beta_{0}(\bar{a},\bar{b})$ . Furthermore, there are atoms $\gamma_{1},\ldots,\gamma_{l}$ such that $\mathit{dom}(\{\gamma_{1},\ldots,\gamma_{l}\})\subseteq\{\bar{a},\bar{b}\}$ and

[TABLE]

Now regard the $\gamma_{i}$ ( $i=1,\ldots,l$ *) as atomic formulas with free variables among $\{\bar{x},\bar{z}\}$ . The formula $p(\bar{x},\bar{z})\coloneqq\gamma_{0}(\bar{x},\bar{z})\wedge\gamma_{1}\wedge\cdots\wedge\gamma_{l}$ is then a strictly acyclic query that satisfies $\Sigma\models\forall\bar{x},\bar{z}\,(p(\bar{x},\bar{z})\rightarrow\alpha(\bar{x}))$ . An application of rule 2.(b) then requires us to find a derivation tree for ${(\bar{a},\bar{b};p(\bar{x},\bar{y}))}$ , whence an application of rule 2.(c) reduces this task to finding derivation trees for the atoms $\gamma_{0},\gamma_{1},\ldots,\gamma_{l}$ and their corresponding tuples of constants. These trees exist by induction hypothesis and we can simply concatenate them appropriately in order to arrive at a derivation tree for ${(\bar{a},\alpha(\bar{x}))}$ . * $\square$

**

Given a guarded formula $\varphi(\bar{x})$ built up from conjunctions and existential quantification, we define the nesting depth of $\varphi(\bar{x})$ , denoted $\mathrm{nd}(\varphi(\bar{x}))$ , inductively:

•

If $\varphi(\bar{x})$ is an atomic formula, then $\mathrm{nd}(\varphi(\bar{x}))\coloneqq 0$ .

•

If $\varphi(\bar{x})=(\psi_{1}\wedge\psi_{2})$ , then $\mathrm{nd}(\varphi(\bar{x}))\coloneqq\max\{\mathrm{nd}(\psi_{1}),\mathrm{nd}(\psi_{2})\}$ .

•

If $\varphi(\bar{x})=\exists\bar{y}\,(\alpha(\bar{x},\bar{y})\wedge\psi)$ and $\bar{y}\neq\varnothing$ , then $\mathrm{nd}(\varphi(\bar{x}))\coloneqq\mathrm{nd}(\psi)+1$ .

Lemma 45

Let $D$ be a database, $\Sigma$ a set of guarded rules, and $q(\bar{x})$ a strictly acyclic conjunctive query. Then $D,\Sigma\models q(\bar{a})$ iff there is a derivation tree for ${(\bar{a},q(\bar{x}))}$ with respect to $D$ and $\Sigma$ .

Proof (sketch).* We again sketch only the direction from left to right. Let $\varphi(\bar{x})$ be the strictly guarded formula corresponding to $q(\bar{x})$ . We proceed by induction on the nesting depth of $\varphi(\bar{x})$ . If $\mathrm{nd}(\varphi(\bar{x}))=0$ , then $\varphi(\bar{x})$ is quantifier free and thus a conjunction of atoms $\alpha_{0}(\bar{x})\wedge\alpha_{1}\wedge\cdots\wedge\alpha_{k}$ , where $\mathit{var}(\alpha_{i})\subseteq\{\bar{x}\}$ for $i=1,\ldots,k$ . An application of rule 2.(c) reduces the problem of building a derivation tree for ${(\bar{a},\varphi(\bar{x}))}$ to the problem of building corresponding trees for the $\alpha_{i}$ and their corresponding constants from $\bar{a}$ . The existence of these trees is guaranteed by Lemma 44.*

Now suppose that $\mathrm{nd}(\varphi(\bar{x}))=n+1$ . Let $\varphi(\bar{x})=\exists\bar{y}\,(\alpha(\bar{x},\bar{y})\wedge\psi)$ and $\bar{y}\coloneqq y_{1},\ldots,y_{k}$ . Assume, without loss of generality, that all the bound variables from $\varphi(\bar{x})$ are pairwise distinct. In the following, we will describe how to construct a derivation tree for ${(\bar{a},q(\bar{x}))}$ . If $D,\Sigma\models q(\bar{a})$ , then there is a homomorphism $\mu$ mapping each atom of $q(\bar{x})$ to $\mathit{chase}(D,\Sigma)$ such that $\mu(\bar{x})=\bar{a}$ . Furthermore, $\mu$ maps each atom of $q(\bar{x})$ to a node of the guarded chase forest $\mathcal{F}$ of $D$ and $\Sigma$ . Let $\alpha_{\mu}(\bar{a},\lambda_{1},\ldots,\lambda_{k})$ denote the atom labeling the node of $\mathcal{F}$ where $\alpha(\bar{x},\bar{y})$ is mapped to via $\mu$ . Let $\lambda_{i_{1}},\ldots,\lambda_{i_{l}}$ exhaust all elements from $\lambda_{1},\ldots,\lambda_{k}$ that are not from $\mathit{dom}(D)$ and $\bar{b}\coloneqq\lambda_{j_{1}},\ldots,\lambda_{j_{m}}$ exhaust those from $\lambda_{1},\ldots,\lambda_{k}$ that are from $\mathit{dom}(D)$ . Let $\varphi^{\prime}(\bar{x},y_{j_{1}},\ldots,y_{j_{m}})$ be the formula $\exists y_{i_{1}},\ldots,y_{i_{l}}\,(\alpha(\bar{x},\bar{y})\wedge\psi)$ . Clearly, $\Sigma\models\forall\bar{x},y_{j_{1}},\ldots,y_{j_{m}}\,(\varphi^{\prime}(\bar{x},y_{j_{1}},\ldots,y_{j_{m}})\rightarrow\varphi(\bar{x}))$ . Hence, we can create a successor of ${(\bar{a},\varphi(\bar{x}))}$ that is labeled by ${(\bar{a},\bar{b};\varphi^{\prime}(\bar{x},y_{j_{1}},\ldots,y_{j_{m}}))}$ . Assume now that none of the $\lambda_{1},\ldots,\lambda_{k}$ is from $\mathit{dom}(D)$ . Furthermore, assume that $k\geq 1$ , since otherwise we can just simply apply rule 2.(c) to reduce $q(\bar{x})$ to a conjunction of queries of the desired form. As in the proof of Lemma 44, there is a node $v_{0}$ in $\mathcal{F}$ whose label $\beta_{0}(\bar{a},\bar{b})$ contains only values from $\mathit{dom}(D)$ as arguments and such that $v_{0}$ is an ancestor of the node labeling $\alpha_{\mu}(\bar{a},\lambda_{1},\ldots,\lambda_{k})$ . Furthermore, all the atoms from $\mu(q(\bar{x}))$ that contain an element from $\lambda_{1},\ldots,\lambda_{k}$ as argument are also located in the subtree rooted at $v_{0}$ . Let $p$ be the query that results from deleting all atoms from $q(\bar{x})$ which are mapped via $\mu$ into the subtree rooted at $v_{0}$ . Notice that $p$ may be empty and has free variables among $\bar{x}$ . Furthermore $p$ is equivalent to a conjunction $p_{1}\wedge\cdots\wedge p_{l}$ of strictly acyclic queries. Let $\beta_{0}(\bar{x},\bar{z})$ be the atomic formula whose image under an appropriate homomorphism equals $\beta_{0}(\bar{a},\bar{b})$ . A similar line of reasoning as in the proof of Lemma 44 shows that there are atomic formulas $\beta_{1},\ldots,\beta_{m}$ such that $\mathit{var}(\beta_{i})\subseteq\{\bar{x},\bar{z}\}$ and

[TABLE]

*whence an application of rule 2.(b) and rule 2.(c) reduces the problem of constructing a derivation tree for ${(\bar{a},\varphi(\bar{x}))}$ to that of constructing corresponding trees for $\beta_{0},\ldots,\beta_{m}$ and $p$ . Notice that $p$ is a conjunction of strictly guarded formulas of nesting depth at most $n$ . Hence, the induction hypothesis guarantees the existence of such derivation trees. * $\square$

**

Having the above results in place, it is easy to show the following statement:

Lemma 46

Let $D$ be a $C$ -tree over $\mathbf{S}$ and $Q={(\mathbf{S},\Sigma,q)}$ an OMQ where $\Sigma$ is guarded and $q$ a BCQ. Then $D\models Q$ iff there is a squid decomposition $\delta={(q^{+},\mu,H,A,V\coloneqq\{\bar{x}\})}$ of $q$ and a homomorphism $\eta\colon\mu(q^{+})\rightarrow\mathit{chase}(D,\Sigma)$ such that:

$F\models H$ * is witnessed by $\eta$ , where $F$ is the subinstance of $\mathit{chase}({D},\Sigma)$ induced by $\mathit{dom}(C)$ .* 2. 2.

There are strictly acyclic queries $q_{1},\ldots,q_{l}$ such that

(a)

$A^{V}(\bar{x})\equiv q_{1}\wedge\cdots\wedge q_{l}$ * and* 2. (b)

for $i=1,\ldots,l$ and $\mathrm{free}(q_{i})=\{\bar{x}_{i}\}$ , there are derivation trees for ${(\eta(\bar{x}_{i}),q_{i})}$ with respect to $D$ and $\Sigma$ .

Proof.

We can easily prove by induction on the number of chase steps that $\mathit{chase}(D,\Sigma)$ is an $F$ -tree, where $F$ is the subinstance of $\mathit{chase}(D,\Sigma)$ induced by $\mathit{dom}(C)$ . Now the lemma at hand is immediate by combining this fact with Lemma 43 and Lemma 45. ∎

We are now ready to proceed with the proof of Lemma 24:

Proof of Lemma 24.* Lemma 46 will guide the construction of the 2WAPA we are now going to construct. Suppose $Q={(\mathbf{S},\Sigma,q)}$ is an OMQ from ${(\mathbb{G},\mathbb{BCQ})}$ and let $l\geq 1$ . We are going to construct a 2WAPA $\mathfrak{A}_{Q,l}={(S,\Gamma_{\mathbf{S},l},\delta,s_{0},\Omega)}$ that accepts a consistent $\Gamma_{\mathbf{S},l}$ -labeled tree $t$ iff $\llbracket t\rrbracket\models Q$ . In particular, the number of states of $\mathfrak{A}_{Q,l}$ will be at most exponential in the size of $Q$ and at most polynomial in $l$ , while the construction of $\mathfrak{A}_{Q,l}$ will be feasible in 2ExpTime.*

The state set. Let $\Lambda$ denote the set of all Boolean acyclic queries over $\mathbf{S}\cup\mathit{sch}(\Sigma)$ that are of size at most $3|q|$ . Notice that each of these queries is equivalent to a strictly guarded formula. Furthermore, assume that $\Lambda$ is closed under $V$ -reducts, for $V\subseteq\mathit{var}(q)$ , provided that they are strictly acyclic as well, i.e., if $p\in\Lambda$ and $V\subseteq\mathit{var}(q)$ , then also $p^{V}\in\Lambda$ provided $p^{V}$ is strictly acyclic. For $\{\bar{a}\}\subseteq U_{\mathbf{S},l}$ , let

[TABLE]

and let $\hat{S}$ be the union of all the sets $\hat{S}(\bar{a})$ . Now the set of states $S$ consists of an initial state, denoted $s_{0}$ , plus the set $\hat{S}$ factorized modulo logical equivalence. We denote by $[p]$ the equivalence class of a query $p\in\hat{S}$ . Furthermore, for a strictly guarded formula $\varphi$ , we may abuse notation and write $[\varphi]$ for the equivalence class of the strictly acyclic query $p\in\hat{S}$ that is equivalent to $\varphi$ . Notice that the size of $S$ is exponential in the size of $Q$ , since there are only exponentially many CQs of size at most $3|q|$ that are mutually non-equivalent (cf. **[10]**).

The parity condition. We set $\Omega(s)\coloneqq 1$ , for all $s\in S$ . This means that only finite trees are accepted.

The transition function. In the following, for each $\rho\in\Gamma_{\mathbf{S},l}$ , we denote by $\hat{\Theta}(\rho)$ the set of all pairs that are of the form ${(\alpha_{1}\wedge\cdots\wedge\alpha_{n},p_{1}\wedge\cdots\wedge p_{m})}$ for which there is a squid decomposition of the form ${(q^{+},\mu,H,T,\{\bar{x}\})}$ and a function $\theta\colon\{\bar{x}\}\rightarrow\mathrm{names}(\rho)$ such that:

•

$H^{\{\bar{x}\}}(\theta(\bar{x}))\equiv\alpha_{1}\wedge\cdots\wedge\alpha_{n}$ , where all the $\alpha_{i}$ are relational ground atoms.

•

$T^{\{\bar{x}\}}(\theta(\bar{x}))\equiv p_{1}\wedge\cdots\wedge p_{m}$ , where the $p_{i}$ are strictly acyclic queries.

Call two pairs ${(\varphi_{1},\psi_{1})}$ and ${(\varphi_{2},\psi_{2})}$ as above equivalent if $\varphi_{1}\equiv\varphi_{2}$ and $\psi_{1}\equiv\psi_{2}$ . Let $\Theta(\rho)$ be the set of equivalence classes under this relation and denote by $[\varphi,\psi]$ the equivalence of a pair ${(\varphi,\psi)}$ under this relation. Now we fix for each $[p]\in S\setminus\{s_{0}\}$ a strictly guarded formula $\chi_{[p]}$ that is equivalent to all queries from $[p]$ . Likewise, we fix a function $\vartheta_{\rho}\colon\Theta(\rho)\rightarrow\hat{\Theta}(\rho)$ such that $\vartheta_{\rho}([\varphi,\psi])\in[\varphi,\psi]$ , i.e., which picks a representative for each equivalence class $[\varphi,\psi]$ .

Now let $\rho\in\Gamma_{\mathbf{S},l}$ . Specify $\delta(\cdot,\rho)$ as follows:

For the initial state $s_{0}$ , set

[TABLE]

Intuitively, the automaton selects a squid decomposition where its components are instantiated by names occurring in the root node of the input tree. The automaton tries to verify the single compartments of the squid decomposition, i.e., it tries to match them to the chase expansion of the input database under $\Sigma$ . 2. 2.

Let $[p]\in S\setminus\{s_{0}\}$ . We define $\delta([p],\rho)$ according to a case distinction:

(a)

Suppose that $p\equiv\top$ . Then $\delta([p],\rho)\coloneqq\mathsf{true}$ . 2. (b)

Suppose $\chi_{[p]}=\exists\bar{y}\,(\alpha(\bar{a},\bar{y})\wedge\varphi)$ , where $\alpha(\bar{a},\bar{y})$ is an atomic formula (including equality), $\mathrm{free}(\varphi)\subseteq\{\bar{y}\}$ , and $\bar{a}$ exhausts all names occuring in $\alpha$ . If $\{\bar{a}\}\not\subseteq\mathrm{names}(\rho)$ then $\delta([p],\rho)\coloneqq\mathsf{false}$ . Otherwise,

[TABLE]

where

[TABLE]

We provide some intuitive explanation for this second case.

(a)

If $p$ is the empty query, it can be satisfied at any input node and, hence, the automaton accepts unconditionally on this computation branch. 2. (b)

Otherwise, we first inspect the strictly guarded formula $\chi_{[p]}$ at hand. If the names occurring in the guard $\alpha(\bar{a},\bar{y})$ are not present at the current node, it rejects. Otherwise, it tries to satisfy $\alpha(\bar{a},\bar{y})$ with all possible assignments for $\bar{y}$ at the current node and then proceed in state $[\varphi(\bar{y}/\bar{b})]$ . Apart from these possibilities, the automaton can decide to move to any neighboring node (i.e., the parent or a child) while remaining in state $[p]$ . This amounts to an exhaustive search of the input tree that tries to satisfy $p$ in the input tree. Furthermore, the automaton may choose to construct derivation trees for $p$ . There, it uses the information provided by $\Sigma$ in order to find strictly acyclic queries $p_{1},\ldots,p_{n}$ that imply $p$ . Consequently, it tries to proceed its search with $[p_{1}],\ldots,[p_{n}]$ .

We shall now briefly comment on the running time needed to construct $\mathfrak{A}_{Q,l}$ . The interesting part of the construction concerns the transition function $\delta$ , in particular point 2.(b) involving $\mathrm{impl}(p,\rho)$ . We have seen that in the proofs of Lemma 44 and Lemma 45 that there are double-exponentially many candidates for the query $q(\bar{a}/\bar{x},\bar{b}/\bar{y})$ that (possibly) implies $p(\bar{a}/\bar{x})$ under $\Sigma$ . Furthermore, $q(\bar{a}/\bar{x},\bar{b}/\bar{y})$ consists of at most exponentially many atoms. Each check whether such a query $q$ at hand implies $p$ requires at most double-exponential time in the size of $p$ . This follows from the well-known fact that checking query implication under a set of guarded rules is feasible in 2ExpTime with respect to the size of the right-hand side query, and in polynomial time with respect to the size of the left-hand side query (cf. **[23*]**), i.e., the data complexity of query answering under guarded tgds is polynomial time. * $\square$

**

PROOFS OF SECTION 6

Proof of Theorem 26

A proof sketch is given in the main body of the paper. However, the fact that ${\sf Cont}((\mathbb{G},\mathbb{CQ}),(\mathbb{S},\mathbb{CQ}))$ is in 2ExpTime* deserves a formal proof. Recall that to establish the latter result we need a more refined complexity analysis of the problem of deciding whether a guarded OMQ is contained in a UCQ; this is discussed in the main body of the paper. In fact, it suffices to show the following result. As in the previous section, we focus on constant-free tgds and CQs, but all the results can be extended to the general case at the price of more involved definitions and proofs. Moreover, we assume that tgds have only one atom in the head. Recall that we write $\mathit{var}_{\geq 2}(q)$ for the variables of $q$ that appear in more than one atom, and we also write $\mathit{var}_{=1}(q)$ for the variables of $q$ that appear only in one atom. Then:*

Proposition 47

Consider $Q\in(\mathbb{G},\mathbb{BCQ})$ and a Boolean CQ $q$ . The problem of deciding whether $Q\subseteq q$ is feasible in

double-exponential time in $(||Q||+|\mathit{var}_{\geq 2}(q)|)$ ; and 2. 2.

exponential time in $|\mathit{var}_{=1}(q)|$ .

It is easy to verify that the above result, together with the algorithm devised in the main body of the paper, implies that ${\sf Cont}((\mathbb{G},\mathbb{CQ}),(\mathbb{S},\mathbb{CQ}))$ is in 2ExpTime*. The rest of this section is devoted to show the above proposition. Our crucial task is, given a CQ $q$ , to devise an automaton that accepts consistent labeled trees which correspond to databases that make $q$ true.*

Lemma 48

Let $q$ be a Boolean CQ over $\mathbf{S}$ . There is a 2WAPA $\mathfrak{A}_{q,l}$ , where $l>0$ , that accepts a consistent $\Gamma_{\mathbf{S},l}$ -labeled tree $t$ iff $\llbracket t\rrbracket\models q$ . The number of states of $\mathfrak{A}_{q,l}$ is exponential in $|\mathit{var}_{\geq 2}(q)|$ and polynomial in $(|\mathit{var}_{=1}(q)|+\mathit{ar}(\mathbf{S})+l)$ . Furthermore, $\mathfrak{A}_{q,l}$ can be constructed in exponential time.

Proof.

We are going to construct $\mathfrak{A}_{q,l}={(S,\Gamma_{\mathbf{S},l},\delta,s_{0},\Omega)}$ . Let $x_{1},\ldots,x_{n}$ be the variables of $\mathit{var}_{=1}(q)$ and fix a total order $x_{1}\prec x_{2}\prec\cdots\prec x_{n}$ among them. Define the state set $S$ to be

[TABLE]

Notice that $|S|=O(|\mathit{var}_{=1}(q)|\cdot(\mathit{ar}(\mathbf{S})+l)^{|\mathit{var}_{\geq 2}(q)|})$ . We set $s_{0}\coloneqq s_{\sharp,\varnothing}$ , where $\varnothing$ denotes the empty substitution. In the following, we treat $q$ as a set of relational atoms and let $X=\mathit{var}_{\geq 2}(q)$ . For $\rho\in\Gamma_{\mathbf{S},l}$ and $s_{y,\theta}\in S$ , define $\delta(s_{y,\theta},\rho)$ as follows:

•

If $y=\sharp$ , distinguish the following cases:

If there is an atom $\alpha\in\theta(q)$ such that $\mathit{var}(\alpha)\cap X\neq\varnothing$ and $\mathit{dom}(\alpha)\cap U_{\mathbf{S},l}\not\subseteq\mathrm{names}(\rho)$ , then $\delta(s_{\sharp,\theta},\rho)\coloneqq\mathsf{false}$ . 2. 2.

Otherwise, let

[TABLE]

•

Suppose $y=x_{i}$ , for some $i=1,\ldots,n$ . Let $\alpha_{i,\theta}$ denote the unique atom $\alpha\in\theta(q)$ such that $x_{i}\in\mathit{var}(\alpha)$ . Set

[TABLE]

Set the parity condition $\Omega$ to be $\Omega(s)\coloneqq 1$ for all $s\in S$ . Intuitively, the automaton works in two passes. The first pass consists of the runs working on states of the form $s_{\sharp,\theta}$ . In this pass, the automaton tries to find an assignment for the variables in the query that appear in at least two distinct atoms. When a candidate assignment $\theta$ is found, the automaton changes to state $s_{x_{1},\theta}$ which is the beginning of the second pass. A state of the form $s_{x_{i},\theta}$ means that the assignment $\theta$ can be extended to all variables $x\prec x_{i}$ and, in this state, the automaton tries to extend $\theta$ to cover the variable $x_{i}$ . The automaton accepts if it is able to extend the candidate assignment $\theta$ to all $x_{1},\ldots,x_{n}$ . ∎

Having the above result in place, we can now reduce the problem in question to the emptiness problem for 2WAPA.

Lemma 49

Consider $Q\in(\mathbb{G},\mathbb{BCQ})$ and a Boolean CQ $q$ . We can construct in double-exponential time in $||Q||$ and in exponential time in $||q||$ a 2WAPA $\mathfrak{A}$ , which has exponentially many states in $(||Q||+|\mathit{var}_{\geq 2}(q)|)$ and polynomially many states in $|\mathit{var}_{=1}(q)|$ , such that

[TABLE]

Proof.

Let $Q=(\mathbf{S},\Sigma,q^{\prime})$ and $l=\mathit{ar}(\mathbf{S}\cup\mathit{sch}(\Sigma))\cdot|q^{\prime}|$ . Then $\mathfrak{A}$ is defined as:

[TABLE]

It is an easy task to verify that the claim follows from Lemmas 22, 23, 24, and 48. ∎

It is clear that Proposition 47 is an easy consequence of Lemma 49.

PROOFS OF SECTION 7

*Recall that we focus on unary and binary predicates. Moreover, we consider constant-free tgds and CQs, and we assume that tgds have only one atom in the head. *

Proof of Proposition 30

Basics.* Let $D$ be a $C$ -tree of width two. We say that a tree decomposition $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ witnessing that $D$ is a $C$ -tree is lean, if it satisfies the following conditions:*

•

The elements from $\mathit{dom}(C)$ occur only in the root of $\mathcal{T}$ and its immediate successors.

•

If $w$ is a child of $v$ in $\mathcal{T}$ , then there are unique $c,d\in\mathit{dom}(D)$ such that $X_{v}\cap X_{w}=\{c\}$ and $X_{w}\setminus X_{v}=\{d\}$ . The element $d$ is called new at $w$ .

•

It follows from the previous item that every node $v\neq\varepsilon$ in $\mathcal{T}$ has a unique new element $c\in\mathit{dom}(D)$ . We additionally require that $c$ appears in the bag of each child of $v$ .

Intuitively, $C$ -trees $D$ that have lean tree decomposition represent the actual tree structure of $D$ . It is fairly straightforward to see that every $C$ -tree has a lean tree decomposition.

Recall that the Gaifman graph of $D$ is the graph $\mathcal{G}(D)={(V,E)}$ with $V\coloneqq\mathit{dom}(D)$ and $(a,b)\in E$ if $a$ and $b$ coexist in some atom of $D$ . Given two nodes $a,b$ from $\mathcal{G}({D})$ , the distance from $a$ to $b$ in $\mathcal{G}(D)$ , denoted $d_{\mathcal{G}(D)}(a,b)$ , is the minimum length of a path between $a$ and $b$ , and $\infty$ if such a path does not exist. For $a,b\in\mathit{dom}(D)$ , we denote by $d_{\delta}(a,b)$ the minimum distance among two nodes of $\mathcal{T}$ that respectively have $a$ and $b$ in their bags. We call $d_{\delta}(a,b)$ the distance from $a$ to $b$ in $\delta$ .

Notice that in a tree decomposition $\delta$ witnessing that $D$ is a $C$ -tree, any element $a\in\mathit{dom}(D)$ , if $a$ appears in the bag of $v$ , then it occurs only at $v$ , at $v$ and its children, or at $v$ and its parent. Since furthermore the bag of the root node is uniquely determined by $C$ , each node in the tree has a uniquely determined set of child nodes whose bags are determined by the structure of $D$ alone. Therefore, the following two lemmas follow immediately.

Lemma 50

Let $\delta={(\mathcal{T},{(X_{t})}_{t\in T})}$ be a lean tree decomposition witnessing that $D$ is a $C$ -tree. Then $d_{\delta}(a,b)\leq d_{\mathcal{G}(D)}(a,b)$ for all $a,b\in\mathit{dom}(D)$ .

Lemma 51

Let $\delta$ and $\delta^{\prime}$ be two lean tree decompositions witnessing that $D$ is a $C$ -tree. Then $d_{\delta}(a,b)=d_{{\delta^{\prime}}}(a,b)$ for all $a,b\in\mathit{dom}(D)$ .

In the following, we denote by $D_{\leq k}$ the subinstance of $D$ induced by the set of elements whose distance from any $a\in\mathit{dom}(C)$ in any lean tree decomposition $\delta$ is bounded by $k$ . The subinstance $D_{>k}$ is defined analogously.The branching degree of a lean tree decomposition is the maximum number of child nodes of any node contained in the tree of $\delta$ . Notice that two lean tree decompositions of a $C$ -tree $D$ always have the same branching degree; the argument is similar as for the two lemmas above. Hence, we can simply speak about the branching degree of $D$ .

Encodings.* Recall that a consistent $\Gamma_{\mathbf{S},l}$ -labeled tree $t={(T,\mu)}$ encodes information on an $\mathbf{S}$ -database $D$ and an appropriate tree decomposition $\delta$ of $D$ . It is clear that $\llbracket t\rrbracket$ has a lean tree decomposition, but it is not guaranteed that this is reflected in $\delta$ as well. We call (the consistent) $t$ lean, if the tree decomposition $\delta_{t}={(\mathcal{T}\coloneqq{(T,E)},{(X_{v})}_{v\in T})}$ is, where $xEy$ iff $y=x\cdot i$ for some $i\in\mathbb{N}\setminus\{0\}$ and $X_{v}\coloneqq\{[v]_{a}\mid a\in\mathrm{names}(v)\}$ . The following is easy to prove:*

Lemma 52

There is a 2WAPA on trees $\mathfrak{L}_{\mathbf{S},l}$ that accepts a consistent $\Gamma_{\mathbf{S},l}$ -labeled tree iff it is lean. The number of states of $\mathfrak{L}_{\mathbf{S},l}$ is bounded logarithmically in the size of $\Gamma_{\mathbf{S},l}$ and $\mathfrak{L}_{\mathbf{S},l}$ can be constructed in polynomial time in the size of $\Gamma_{\mathbf{S},l}$ .

Let $t={(T,\mu)}$ be a labeled tree. The branching degree of a node $x\in T$ is the cardinality of $\{i\mid x\cdot i\in T,i\in\mathbb{N}\setminus\{0\}\}$ ; the branching degree of $t$ is the maximum over all branching degrees of its nodes and $\infty$ is this maximum does not exist. We also say that $t$ is $m$ -ary if the branching degree of $t$ is bounded by $m$ . A node $x\in T$ is a leaf node of $t$ if it has branching degree zero. The depth of $T$ is the maximum length among the lengths of all branches and $\infty$ if this maximum does not exist. Let us remark that the branching degree of the lean $\Gamma_{\mathbf{S},l}$ -labeled tree $t$ as defined for labeled trees equals the branching degree of $\llbracket t\rrbracket$ as defined above.

Lemma 53

Let $Q={(\mathbf{S},\Sigma,q)}$ be an OMQ from ${(\mathbb{G}_{2},\mathbb{BCQ})}$ . There is an $m\geq 0$ such that the following are equivalent:

There is an $\mathbf{S}$ -database $D$ such that $D\models Q$ . 2. 2.

There is a $C$ -tree $\hat{D}$ with $|\mathit{dom}(C)|\leq 2|q|$ and branching degree at most $m$ such that $\hat{D}\models Q$ .

Proof.

Let $l\coloneqq 2|q|$ and, let $\mathfrak{A}_{Q,l}$ be the 2WAPA from Lemma 24. Take the intersection of $\mathfrak{A}_{Q,l}$ with

(i) the 2WAPA $\mathfrak{C}_{\mathbf{S},l}$ from Lemma 23 and

(ii) the 2WAPA from Lemma 52 that checks leanness.

Call the resulting automaton $\mathfrak{B}$ . Then $\mathfrak{B}$ accepts a $\Gamma_{\mathbf{S},l}$ -labeled tree $t$ iff $t$ is lean and consistent and $\llbracket t\rrbracket\models Q$ . We let $m$ be the number of states of $\mathfrak{B}$ and claim that this is the required bound on the branching degree.

First of all, notice that the first item of the lemma trivially implies the second independently from the choice of $m$ . For the other direction, suppose that $D\models Q$ for some $\mathbf{S}$ -database $D$ . Then there is a $C$ -tree $B$ such that $\mathit{dom}(C)\leq 2|q|$ and $B\models Q$ . Being a $C$ -tree, $B$ has a lean tree decomposition $\delta$ and the encoding of $B$ together with $\delta$ corresponds to a lean $\Gamma_{\mathbf{S},l}$ -labeled tree $t$ . It follows that $t\in\mathcal{L}(\mathfrak{B})$ . By the results of [44], it follows that there is a $t^{\prime}\in\mathcal{L}(\mathfrak{B})$ whose branching degree is bounded by the number of states of $\mathfrak{B}$ , i.e., by $m$ . The tree $t^{\prime}$ is lean and consistent, therefore $\llbracket t^{\prime}\rrbracket$ is a $C^{\prime}$ -tree of branching degree at most $m$ for some $C^{\prime}\subseteq\llbracket t\rrbracket$ such that $|\mathit{dom}(C^{\prime})|\leq 2|q|$ . Furthermore, $\llbracket t^{\prime}\rrbracket\models Q$ , as required. ∎

We are now ready to prove Proposition 30:

Proof of Proposition 30.* We largely follow [16] here. Choose $m$ as in Lemma 53 above. Suppose first that $Q$ is UCQ rewritable. Let $p\coloneqq p_{1}\vee\cdots\vee p_{n}$ be a corresponding UCQ rewriting. Since the query $q$ is connected, we can assume that $p$ is as well. We choose $k>\max\{|p_{i}|:i=1,\ldots,n\}$ and suppose that $D\models Q$ for some $C$ -tree $D$ . Since $p$ is a UCQ rewriting, $D\models p_{i}$ for some $i=1,\ldots,n$ . Fix a homomorphism $\mu$ witnessing that $D\models p_{i}$ . We distinguish cases. Suppose first that $\mu(\mathit{var}(p_{i}))\cap\mathit{dom}(C)\neq\varnothing$ . Since $p$ is connected, it follows $D_{\leq k}\models p_{i}$ by Lemma 50 and so $D_{\leq k}\models p$ . On the other hand, if $\mu(\mathit{var}(p))\cap\mathit{dom}(C)=\varnothing$ , then it is also easy to check that $D_{>0}\models p$ .*

*For the other direction, suppose that the second item of the proposition’s statement holds, i.e., there is a $k\geq 0$ such that for all $C$ -trees $D$ over $\mathbf{S}$ with $|\mathit{dom}(C)|\leq 2|q|$ and branching degree at most $m$ it holds that $D\models Q$ implies $D_{\leq k}\models Q$ or $D_{>0}\models Q$ . Let $\Lambda$ be the set of all $C$ -trees such that $|\mathit{dom}(C)|\leq 2|q|$ and that have branching degree at most $m$ such that $D\models Q$ . We regard $\Lambda$ as a set of BCQs and regard it as factorized modulo logical equivalence. It is clear that $\Lambda$ is finite then and we claim that $p\coloneqq\bigvee_{i=1}^{n}p_{i}$ is a UCQ rewriting of $Q$ . We explicitly include the case where $\Lambda$ is empty, in which case $p$ is equivalent to the empty disjunction $\bot$ and there is no database $D$ at all such that $D\models Q$ . To see that $p$ is indeed a UCQ rewriting of $Q$ , let $D$ be an $\mathbf{S}$ -database such that $D\models p$ . Then there is an $i=1,\ldots,n$ such that $D\models p_{i}$ . Furthermore, $[p_{i}]\models Q$ and so $D\models Q$ as well, since $Q$ is closed under homomorphisms. Suppose now $D\models Q$ . We know that there is a $C$ -tree $\hat{D}$ with $|\mathit{dom}(C)|\leq l\coloneqq 2|q|$ and branching degree at most $m$ such that $\hat{D}\models Q$ and—when we regard $\hat{D}$ as an instance—there is a homomorphism from $\hat{D}$ to $D$ . Let $D^{\prime}\subseteq\hat{D}$ be a minimal connected subset of $\hat{D}$ such that $D^{\prime}\models Q$ . $D^{\prime}$ is again a $C^{\prime}$ -tree for some $C^{\prime}\subseteq D^{\prime}$ . Therefore $D^{\prime}_{\leq k}\models Q$ or $D^{\prime}_{>0}\models Q$ . The latter is impossible by minimality of $D^{\prime}$ . Hence, $D^{\prime}_{\leq k}\models Q$ and so there is a (logically equivalent) copy of $D^{\prime}_{\leq k}$ contained in $\Lambda$ . Hence, $D^{\prime}_{\leq k}\models p$ , therefore $D^{\prime}\models p$ , and hence $\hat{D}\models p$ . Recall that, when $\hat{D}$ is regarded as an instance, there is a homomorphism from $\hat{D}$ to $D$ . Therefore, $D\models p$ . * $\square$

**

Proof of Proposition 31

Let $Q={(\mathbf{S},\Sigma,q)}$ be an OMQ from ${(\mathbb{G}_{2},\mathbb{BCQ})}$ such that $q$ is connected. We are going to show that the desired 2WAPA $\mathfrak{A}$ can be constructed in 2ExpTime*. Notice that, using similar results as in [16], this gives us a decision procedure for deciding ${\sf UCQRew}{(\mathbb{G}_{2},\mathbb{CQ})}$ also for non-connected queries. Let us first introduce some auxiliary notions.*

2WAPAs on $m$ -ary trees.* A 2WAPA $\mathfrak{B}$ on $m$ -ary trees is just defined as a 2WAPA, except that its transitions $\mathsf{tran}(\mathfrak{B})$ are $\{\langle k\rangle s,[k]s\mid-1\leq k\leq m,s\in S\}$ , where $S$ is the state set of $\mathfrak{B}$ . The notion of run is then defined on $m$ -ary trees only and its definition is modified in the obvious way so as to deal with the transitions $\langle k\rangle s,[k]s$ . Intuitively, for $k=1,\ldots,m$ , a transition $\langle k\rangle s$ means that the automaton should move to the $k$ -th child of the current node (which is then required to exist) and assume state $s$ . Correspondingly, $[k]s$ means that the automaton should move to the $k$ -th child and assume state $s$ provided that this $k$ -th child exists at all. We remark that all 2WAPAs constructed in this paper so far can easily be modified to work on $m$ -ary trees as well and we shall assume in the following that they do so. Furthermore, deciding whether $\mathcal{L}(\mathfrak{B})$ is feasible in exponential time in the number of states of $\mathfrak{B}$ and in polynomial time in the size of the input alphabet of $\mathfrak{B}$ (cf. [57]).*

Let $m$ be as in Proposition 30. In the following, we shall regard all trees mentioned in the following as $m$ -ary and let $l\coloneqq 2|q|$ . Before proceeding to a proof of Proposition 31, we must make the notion of being an “extension” of a labeled tree more precise.

Extensions of trees.* Let $\mathfrak{B}_{Q}$ be a 2WAPA that accepts a $\Gamma_{\mathbf{S},l}$ -labeled tree $t$ iff *

(i) $t$ is lean and consistent, * (ii) $\llbracket t\rrbracket\models Q$ , and * (iii) $\llbracket t\rrbracket_{>0}\not\models Q$ . * * Notice that a 2WAPA $\mathfrak{A}_{Q}^{>0}$ that accepts a lean and consistent $\Gamma_{\mathbf{S},l}$ -labeled tree iff $\llbracket t\rrbracket_{>0}\not\models Q$ can be easily constructed using the construction in Lemma 24. Hence, $\mathfrak{B}_{Q}$ can be constructed intersecting several 2WAPAs we have already encountered.

*Let $\Pi$ be the set of all tuples of the form ${(s,s^{\prime})}$ , where $s$ and $s^{\prime}$ are states of $\mathfrak{B}_{Q}$ . We define a new alphabet $\Lambda\coloneqq 2^{\mathbb{K}_{\mathbf{S},l}\cup\Pi}$ . Notice that $\Lambda$ is of double-exponential size in the size of $Q$ . For $\rho\in\Lambda$ , we denote by $\rho\upharpoonright\Gamma_{\mathbf{S},l}$ the restriction of $\rho$ to $\Gamma_{\mathbf{S},l}$ , that is, $\rho\cap\mathbb{K}_{\mathbf{S},l}$ . The restriction of a $\Lambda$ -labeled tree $t$ to $\Gamma_{\mathbf{S},l}$ , denoted $t\upharpoonright\Gamma_{\mathbf{S},l}$ , is the tree that arises from $t$ when we restrict the label of each node of $t$ to $\Gamma_{\mathbf{S},l}$ . We say that a $\Lambda$ -labeled tree is consistent if *

(i) *its restriction to $\Gamma_{\mathbf{S},l}$ is consistent and * (ii) symbols $\rho\in\Lambda$ such that $\rho\cap\Pi\neq\varnothing$ appear only in leaf nodes of $t$ . * * Likewise, we say that a consistent $t$ is lean if $t\upharpoonright\Gamma_{\mathbf{S},l}$ is. The decoding $\llbracket t\rrbracket$ of $t$ is naturally extended to consistent $\Lambda$ -labeled trees by setting $\llbracket t\rrbracket\coloneqq\llbracket t\upharpoonright\Gamma_{\mathbf{S},l}\rrbracket$ . The following lemma is a straightforward extension of Lemmas 23 and 52.

Lemma 54

There are 2WAPAs $\mathfrak{C}_{\Lambda}$ and $\mathfrak{L}_{\Lambda}$ that respectively accept a $\Lambda$ -labeled tree iff it is consistent and lean. Both have logarithmically many states in the size of $\Lambda$ and can be constructed in polynomial time in the size of $\Lambda$ .

Let $t$ be a lean and consistent $\Lambda$ -labeled tree. We say that $t^{\prime}$ is an extension of $t$ if $t^{\prime}$ is a $\Gamma_{\mathbf{S},l}$ -labeled tree that arises from $t$ by attaching $\Gamma_{\mathbf{S},l}$ -labeled trees to those leaves of $t$ that contain elements from $\Pi$ . Furthermore, for such nodes, the labels of the corresponding nodes in $t^{\prime}$ are those of $t$ restricted to $\Gamma_{\mathbf{S},l}$ .

Definition 14.

*Let $\mathcal{L}_{Q}$ be the set of all lean and consistent $\Lambda$ -labeled trees $t$ such that $\llbracket t\rrbracket\not\models Q$ , yet there is an extension $t^{\prime}$ of $t$ such that $\llbracket t^{\prime}\rrbracket\models Q$ and $\llbracket t^{\prime}\rrbracket_{>0}\not\models Q$ . *

Lemma 55

$\mathcal{L}_{Q}$ * is infinite iff $Q$ is not UCQ rewritable.*

Proof.

Suppose $\mathcal{L}_{Q}$ is infinite. Since the trees at hand have bounded branching degree, for every $k\geq 0$ , there is a $t\in\mathcal{L}_{Q}$ such that $\llbracket t\rrbracket$ is a $C$ -tree (for some $C\subseteq\llbracket t\rrbracket$ ) that contains individuals whose distance from any $a\in\mathit{dom}(C)$ is greater than or equal to $k$ and $\llbracket t\rrbracket\not\models Q$ , yet for some extension $t^{\prime}$ of $t$ , we have $\llbracket t^{\prime}\rrbracket\models Q$ but $\llbracket t^{\prime}\rrbracket_{>0}\not\models Q$ . Suppose now that $Q$ is UCQ rewritable. Let $\ell$ be such that for all $C^{\prime}$ -trees $D$ (of the appropriate dimensions), $D\models Q$ implies $D_{\leq\ell}\models Q$ or $D_{>0}\models Q$ . Choose $k>\ell$ and $t,t^{\prime}$ such that

(i) $t^{\prime}$ is an extension of $t$ ,

(ii) $t$ has depth greater than $k$ , and

(iii) $\llbracket t\rrbracket\not\models Q$ but $\llbracket t^{\prime}\rrbracket\models Q$ and $\llbracket t^{\prime}\rrbracket_{>0}\not\models Q$ .

Since $\llbracket t^{\prime}\rrbracket\models Q$ , we know that $\llbracket t^{\prime}\rrbracket_{\leq\ell}\models Q$ or $\llbracket t^{\prime}\rrbracket_{>0}\models Q$ . The latter is impossible by assumption, the former contradicts the fact $\llbracket t\rrbracket\not\models Q$ , since $k>\ell$ . This proves the direction from left to right. The other direction is immediate. ∎

We are now ready to establish Proposition 31:

Proof of Proposition 31.* We are now going to describe the construction of a 2WAPA $\mathfrak{A}$ such that $\mathcal{L}(\mathfrak{A})=\mathcal{L}_{Q}$ , which will prove the claim by virtue of Lemma 55. This automaton is the intersection of several ones. First of all, we ensure that all the accepted $\Lambda$ -trees are lean and consistent (cf. Lemma 54). We additionally intersect the automaton with the complement of $\mathfrak{A}_{Q,l}$ from Lemma 23 (more precisely, the version of it running on $\Lambda$ -labeled trees) and another automaton $\mathfrak{D}_{Q}={(S,\Lambda,\delta,s_{0},\Omega)}$ whose construction we shall describe in more detail here. On a high level, $\mathfrak{D}_{Q}$ will be constructed so as to accept a lean and consistent $\Lambda$ -labeled tree if and only if there is an extension $t^{\prime}$ of $t$ such that $\mathfrak{B}_{Q}$ accepts $t^{\prime}$ . Let $\hat{S}$ be the set of states of $\mathfrak{B}_{Q}$ , $\hat{\delta}$ its transition function, and $\hat{\Omega}$ its parity function. For $\sigma\in\Gamma_{\mathbf{S},l}$ , let $\hat{B}(\sigma)$ be the set of tuples ${(s,s^{\prime})}\in\hat{S}\times(\hat{S}\cup\{\mathsf{true}\})$ such that the following holds:*

•

There is a $\Gamma_{\mathbf{S},l}$ -labeled tree $t={(T,\eta)}$ such that $\eta(\varepsilon)=\sigma$ and a run $t_{r}={(T_{r},\eta_{r})}$ of $\mathfrak{B}_{Q}$ on $t$ such that

$\eta_{r}(\varepsilon)={(\varepsilon,s)}$ , i.e., $t_{r}$ starts from $s$ ;121212Strictly speaking, $t_{r}$ is, of course, not a run since it does not start in the initial state. 2. 2.

$s^{\prime}=\mathsf{true}$ * and $t_{r}$ is accepting on $\mathfrak{B}_{Q}$ , or there is a node $v\in T_{r}$ such that $\eta_{r}(v)={(\varepsilon,s^{\prime})}$ .*

Now the set of states of $\mathfrak{D}_{Q}$ is the same as of $\mathfrak{B}_{Q}$ , i.e., $S\coloneqq\hat{S}$ . Accordingly, the initial state of $\mathfrak{D}_{Q}$ is that of $\mathfrak{B}_{Q}$ . Furthermore ${\Omega}(s)\coloneqq\hat{\Omega}(s)$ , for every $s\in S$ . Given $s\in S$ and $\rho\in\Lambda$ , we let

[TABLE]

We are going to give an intuitive explanation of this construction in the following. Roughly, a pair ${(s,s^{\prime})}\in\hat{B}(\sigma)$ indicates that there is a $\Gamma_{\mathbf{S},l}$ -labeled tree $t$ and run of $\mathfrak{B}_{Q}$ on $t$ such that the root of $t$ is labeled with $\sigma$ , the run starts in state $s$ , and either $\mathfrak{B}_{Q}$ accepts $t$ , or it traverses the root again at some point, then being in state $s^{\prime}$ . The set $\hat{B}(\sigma)$ can be computed a priori in 2ExpTime*; considering that $\Gamma_{\mathbf{S},l}$ is of double-exponential size in the size of $Q$ , it follows that the collection $\{\hat{B}(\sigma)\}_{\sigma\in\Gamma_{\mathbf{S},l}}$ can be computed in 2ExpTime. Now the input tree for $\mathfrak{D}_{Q}$ comes with labels from $\Pi$ of the form ${(s,s^{\prime})}$ in its leaves. These “types” amount to guesses of possible extensions of the input tree. Utilizing the sets $\hat{B}(\sigma)$ , $\mathfrak{D}_{Q}$ thus explores the possible ways how the given input tree can be extended to a $\Gamma_{\mathbf{S},l}$ -labeled tree $t^{\prime}$ that is accepted by $\mathfrak{B}_{Q}$ . * $\square$

**

Bibliography58

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases . Addison-Wesley, 1995.
2[2] A. Amarilli, M. Benedikt, P. Bourhis, and M. Vanden Boom. Query answering with transitive and linear-ordered data. In IJCAI , pages 893–899, 2016.
3[3] T. J. Ameloot, B. Ketsman, F. Neven, and D. Zinn. Weaker forms of monotonicity for declarative networking: A more fine-grained answer to the calm-conjecture. In PODS , pages 64–75, 2014.
4[4] T. J. Ameloot, B. Ketsman, F. Neven, and D. Zinn. Datalog queries distributing over components. In ICDT , pages 308–323, 2015.
5[5] T. J. Ameloot, F. Neven, and J. V. den Bussche. Relational transducers for declarative networking. J. ACM , 60(2):15, 2013.
6[6] M. Arenas, R. Hull, W. Martens, T. Milo, and T. Schwentick. Foundations of Data Management (Dagstuhl perspectives workshop 16151). Dagstuhl Reports , 6(4):39–56, 2016.
7[7] F. Baader, M. Bienvenu, C. Lutz, and F. Wolter. Query and predicate emptiness in ontology-based data access. J. Artif. Intell. Res. (JAIR) , 56:1–59, 2016.
8[8] J.-F. Baget, M. Leclère, M.-L. Mugnier, and E. Salvat. On rules with existential variables: Walking the decidability line. Artif. Intell. , 175(9-10):1620–1654, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Containment for Rule-Based Ontology-Mediated Queries

Abstract

1 Introduction

2 Preliminaries

Databases and conjunctive queries.

Tgds and the chase procedure.

Ontology-mediated queries.

Ontology-mediated query languages.

Proposition 1

Proposition 2

Proposition 3

Proposition 4

3 OMQ Containment: The Basics

3.1 Evaluation vs. Containment

Proposition 5

Proposition 6

Corollary 7

Proposition 8

3.2 From UCQs to CQs

Proposition 9

3.3 Plan of Attack

4 UCQ Rewritable Languages

Definition 1**.**

Proposition 10

Theorem 11

4.1 Linearity

Example 1**.**

Proposition 12

Theorem 13

4.2 Non-Recursiveness

Proposition 14

Proposition 15

Theorem 16

4.3 Stickiness

Proposition 17

Proposition 18

Theorem 19

5 Guardedness

Theorem 20

A first glimpse.

5.1 Tree Witness Property

Definition 2**.**

Proposition 21

5.2 Encoding Tree-like Databases

Lemma 22

5.3 Constructing Tree Automata

Lemma 23

Lemma 24

Proposition 25

6 Combining Languages

6.1 The LHS Query is UCQ Rewritable

6.2 The LHS Query is Guarded

Theorem 26

Upper bounds.

Lower Bounds.

7 Applications

7.1 Distribution Over Components

Proposition 27

Theorem 28

7.2 Deciding UCQ Rewritability

Theorem 29

Proposition 30

Proposition 31

8 Conclusions

APPENDIX

PRELIMINARIES

Definition of Non-recursiveness

Definition 3**.**

Lemma 32

Definition of Stickiness

Definition 4**.**

Definition 5**.**

PROOFS OF SECTION 3

Definition 1.

Example 1.

Definition 2.

Definition 3.

Definition 4.

Definition 5.

The Algorithm $\mathsf{XRewrite}$

Definition 6.

Definition 7.

Data Schema $\mathbf{S}$

The Query $Q_{1}$

The Query $Q_{2}$

Step 1: ${\sf Cont}((\mathbb{FNR},\mathbb{CQ}),(\mathbb{L},\mathbb{UCQ}))$ is coNExpTime-hard

Data Schema $\mathbf{S}$

The Query $Q_{T}$

The Query $Q^{\prime}_{T}$

Step 2: ${\sf Cont}((\mathbb{S},\mathbb{CQ}),(\mathbb{L},\mathbb{UCQ}))$ is coNExpTime-hard

Definition 8.

Definition 9.

Definition 10.

Definition 11.

Definition 12.

Definition 13.