Containment for Rule-Based Ontology-Mediated Queries
Pablo Barcelo, Gerald Berger, Andreas Pieris

TL;DR
This paper investigates the containment problem for ontology-mediated queries expressed with guarded, non-recursive, and sticky tgds, providing complexity bounds and analyzing applications like component distribution and UCQ rewritability.
Contribution
It introduces tailored techniques for OMQ containment under specific tgd classes, establishing sharp complexity bounds and exploring practical applications.
Findings
Sharp complexity bounds for OMQ containment
Techniques tailored to guarded, non-recursive, and sticky tgds
Insights into distribution over components and UCQ rewritability
Abstract
Many efforts have been dedicated to identifying restrictions on ontologies expressed as tuple-generating dependencies (tgds), a.k.a. existential rules, that lead to the decidability for the problem of answering ontology-mediated queries (OMQs). This has given rise to three families of formalisms: guarded, non-recursive, and sticky sets of tgds. In this work, we study the containment problem for OMQs expressed in such formalisms, which is a key ingredient for solving static analysis tasks associated with them. Our main contribution is the development of specially tailored techniques for OMQ containment under the classes of tgds stated above. This enables us to obtain sharp complexity bounds for the problems at hand, which in turn allow us to delimitate its practical applicability. We also apply our techniques to pinpoint the complexity of problems associated with two emerging…
| Arbitrary Arity | Bounded Arity | |||||
|---|---|---|---|---|---|---|
| Linear |
|
|
||||
| Sticky |
|
|
||||
| Non-recursive |
|
|
||||
| Guarded |
|
|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Advanced Database Systems and Queries · Service-Oriented Architecture and Web Services
Containment for Rule-Based Ontology-Mediated Queries
Pablo Barceló
Gerald Berger
Andreas Pieris
Center for Semantic Web Research &
DCC, University of Chile
Institute of Information Systems
TU Wien
School of Informatics
University of Edinburgh
Abstract
Many efforts have been dedicated to identifying restrictions on ontologies expressed as tuple-generating dependencies (tgds), a.k.a. existential rules, that lead to the decidability for the problem of answering ontology-mediated queries (OMQs). This has given rise to three families of formalisms: guarded, non-recursive, and sticky sets of tgds. In this work, we study the containment problem for OMQs expressed in such formalisms, which is a key ingredient for solving static analysis tasks associated with them. Our main contribution is the development of specially tailored techniques for OMQ containment under the classes of tgds stated above. This enables us to obtain sharp complexity bounds for the problems at hand, which in turn allow us to delimitate its practical applicability. We also apply our techniques to pinpoint the complexity of problems associated with two emerging applications of OMQ containment: distribution over components and UCQ rewritability of OMQs.
1 Introduction
Motivation and goals. The novel application of knowledge representation tools for handling incomplete and heterogeneous data is giving rise to a new field, recently coined as knowledge-enriched data management [6]. A crucial problem in this field is ontology-based data access (OBDA) [51], which refers to the utilization of ontologies (i.e., sets of logical sentences) for providing a unified conceptual view of various data sources. Users can then pose their queries solely in the schema provided by the ontology, abstracting away from the specifics of the individual sources. In OBDA, one interprets the ontology and the user query , which is typically a union of conjunctive queries (UCQ), or, equivalently, the expressions defined by the select-project-join-union operators of relational algebra, as two components of one composite query , known as ontology-mediated query (OMQ); is called the data schema, indicating that will be posed on databases over [19]. Therefore, OBDA is often realized as the problem of answering OMQs.
Following recent work [24, 26, 27, 40], we focus on the case where the ontology is defined by a set of tuple-generating dependencies (tgds), a.k.a. existential rules or Datalog± rules. Handling such OMQs implies new challenges for classical database tasks. Interestingly, some of these challenges are by now well-studied; most notably (a) query evaluation [8, 24, 25, 27]: given an OMQ , a database over , and a tuple of constants , does belong to the evaluation of over every extension of that satisfies , or, equivalently, is a certain answer for over ? and (b) relative expressiveness [19, 42, 43]: how does the expressiveness of OMQs compare to the one of other query languages? Surprisingly, despite its prominence, no work to date has carried out an in-depth investigation of containment for OMQs based on tgds and UCQs.
Query containment is a fundamental static analysis task that amounts to check if the evaluation of a query is always contained in the evaluation of another query. Several database tasks crucially depend on the ability to check query containment; these include, e.g., query optimization, view-based query answering, querying incomplete databases, integrity checking, and implication of dependencies: cf. [22, 30, 36, 37, 39, 45]. A particularly important instance of the containment problem is the one defined by the class of CQs. It follows from the seminal work of Chandra and Merlin [29] that CQ containment is polynomially equivalent to CQ evaluation, and thus NP-complete. The NP upper bound is not affected if we consider UCQs [54]. This is seen as a positive result for practical applications that rely on UCQ containment, as the input (the two UCQs) is small. In addition, it shows a stark difference with more expressive relational query languages, e.g., relational algebra (or, equivalently, first-order logic), for which containment is undecidable.
The main goal of this work is to understand up to which extend the good computational properties of UCQ containment discussed above can be leveraged to the containment problem for OMQs based on tgds and UCQs (simply called OMQs from now on). In particular, we want to understand which classes of tgds guarantee the decidability of the problem, and, whenever this is the case, how can we obtain complexity bounds that are reasonable for practical purposes. We also want to understand what is the exact relationship between OMQ containment and evaluation for such classes. Let us stress that, apart from the traditional applications of containment mentioned above, it has been recently shown that OMQ containment has applications on other important static analysis tasks for OMQs, namely, distribution over components [15], and UCQ rewritability [16].
The context. As one might expect, when considered in its full generality, i.e., without any restrictions on the set of tgds, the OMQ containment problem is undecidable. To understand, on the other hand, which restrictions lead to decidability, we recall the two main reasons that render the general containment problem undecidable. These are:
Undecidability of query evaluation: OMQ evaluation is, in general, undecidable [12], and it can be reduced to OMQ containment. More precisely, OMQ containment is undecidable whenever query evaluation for at least one of the involved languages (i.e., the language of the left-hand or the right-hand side query) is undecidable.
Undecidability of containment for Datalog: decidability of query evaluation does not ensure decidability of query containment. A prime example is Datalog, or, equivalently, the OMQ language based on full tgds. Datalog containment is undecidable [55], and thus, OMQ containment is undecidable if the involved languages extend Datalog.
In view of the above observations, we focus on languages that (a) have a decidable query evaluation, and (b) do not extend Datalog. The main classes of tgds, which give rise to OMQ languages with the desirable properties, can be classified into three main families depending on the underlying syntactic restrictions: (i) guarded tgds [24], which contain inclusion dependencies and linear tgds, (ii) non-recursive sets of tgds [35], and (iii) sticky sets of tgds [27].
While the decidability of containment for the above OMQ languages can be established via translations into query languages with a decidable containment problem, such translations do not lead to optimal complexity upper bounds (details are given below). Therefore, the main goal of our paper is to develop specially tailored decision procedures for the containment problem under the OMQ languages in question, and ideally obtain precise complexity bounds. Our second goal is to exploit such techniques in the study of distribution over components and UCQ rewritability of OMQs.
Our contributions. The complexity of OMQ containment for the languages in question is given in Table 1. Using small fonts, we recall the complexity of OMQ evaluation in order to stress that containment is, in general, harder than evaluation. We divide our contributions as follows:
Linear, non-recursive and sticky sets of tgds. The OMQ languages based on linear, non-recursive, and sticky sets of tgds share a useful property: they are UCQ rewritable (implicit in [40]), that is, an OMQ can be rewritten into a UCQ. This property immediately yields decidability for their associated containment problems, since UCQ containment is decidable [54]. However, the obtained complexity bounds are not optimal, since the UCQ rewritings are unavoidably very large [40]. To obtain more precise bounds, we reduce containment to query evaluation, an idea that is often applied in query containment; see, e.g., [29, 31, 54].
Consider a UCQ rewritable OMQ language . If and belong to , both with data schema , then we can establish a small witness property, which states that non-containment of in can be witnessed via a database over whose size is bounded by an integer , the maximal size of a disjunct in a UCQ rewriting of . For linear tgds, such an integer is polynomial, but for non-recursive and sticky sets of tgds it is exponential (implicit in [40]). The above small witness property allows us to devise a simple non-deterministic algorithm, which makes use of query evaluation as a subroutine for checking non-containment of in : guess a database over of size at most , and then check if there is a certain answer for over that is not a certain answer for over . This algorithm allows us to obtain optimal upper bounds for OMQs based on linear and sticky sets of tgds; however, the exact complexity of OMQs based on non-recursive sets of tgds remains open:
- •
For OMQs based on linear tgds, the problem is in PSpace, and in if the arity is fixed. The PSpace-hardness is shown by reduction from query evaluation [47], while the -hardness is inherited from [17].
- •
For OMQs based on sticky sets of tgds, the problem is in coNExpTime, and in if the arity of the schema is fixed. The coNExpTime-hardness is shown by exploiting the standard tiling problem for the exponential grid, while the -hardness is inherited from [17].
- •
Finally, for OMQs based on non-recursive sets of tgds, containment is in ExpSpace and hard for P, even for fixed arity. The lower bound is shown by exploiting a recently introduced tiling problem [34].
We conclude that in all these cases OMQ containment is harder than evaluation, with one exception: the OMQs based on linear tgds over schemas of unbounded arity.
Guarded tgds. The OMQ language based on guarded tgds is not UCQ rewritable, which forces us to develop different tools to study its containment problem. Let us remark that guarded OMQs can be rewritten as guarded Datalog queries (by exploiting the translations devised in [9, 43]), for which containment is decidable in 2ExpTime [20]. But, again, the known rewritings are very large [43], and hence the reduction of containment for guarded OMQs to containment for guarded Datalog does not yield optimal upper bounds.
To obtain optimal bounds for the problem in question, we exploit two-way alternating parity automata on trees (2WAPA) [32]. We first show that if and are guarded OMQs such that is not contained in , then this is witnessed over a class of “tree-like” databases that can be represented as the set of trees accepted by a 2WAPA . We then build a 2WAPA with exponentially many states that recognizes those trees accepted by that represent witnesses to non-containment of in . Hence, is contained in iff accepts no tree. Since the emptiness problem for 2WAPA is feasible in exponential time in the number of states [32], we obtain that containment for guarded OMQs is in 2ExpTime. A matching lower bound, even for fixed arity schemas, follows from [16].
Similar ideas based on 2WAPA have been recently used to show that containment for OMQs based on expressive description logics (DLs) is in 2ExpTime [16]. In the DL context, schemas consist only of unary and binary relations. Our automata construction, however, is different from the one in [16] for two reasons: (a) we need to deal with higher arity relations, and (b) even for unary and binary relations, our OMQ language allows to express properties that are not expressible by the DL-based OMQ languages studied in [16].
Combining languages. The above complexity results refer to the containment problem relative to a certain OMQ language , i.e., both queries fall in . However, it is natural to consider the version of the problem where the involved OMQs fall in different languages. Unsurprisingly, if the left-hand side query is expressed in a UCQ rewritable OMQ language (based on linear, non-recursive or sticky sets of tgds), we can use the algorithm that relies on the small witness property discussed above, which provides optimal upper bounds for almost all the considered cases (the only exception is the containment of sticky in non-recursive OMQs over schemas of unbounded arity). Things are more interesting if the ontology of the left-hand side query is expressed using guarded tgds, while the ontology of the right-hand side query is not guarded. By exploiting automata techniques, we show that containment of guarded in non-recursive OMQs is in 3ExpTime, while containment of guarded in sticky OMQs is in 2ExpTime. We establish matching lower bounds, even over schemas of fixed arity, by refining techniques from [31].
Applications. Our techniques and results on containment for guarded OMQs can be applied to other important static analysis tasks, in particular, distribution over components and UCQ rewritability.
The notion of distribution over components has been introduced in [3], in the context of declarative networking, and it states that the answer to an OMQ can be computed by parallelizing it over the (maximally connected) components of the database. If this is the case, then can always be evaluated in a distributed and coordination-free manner. The problem of deciding distribution over components for OMQs has been recently studied in [15]. However, the exact complexity of the problem for guarded OMQs has been left open. By exploiting our results on containment, we can show that it is 2ExpTime-complete.
It is well-known that the OMQ language based on guarded tgds is not UCQ rewritable. In view of this fact, it is important to study when a given guarded OMQ can be rewritten as a UCQ. This has been studied for OMQs based on central Horn DLs [16, 18]. Interestingly, our automata-based techniques for guarded OMQ containment can be adapted to decide in 2ExpTime whether an OMQ based on guarded tgds over unary and binary relations is UCQ rewritable; a matching lower bound is inherited from [16]. Our result generalizes the result that deciding UCQ rewritability for OMQs based on , one of the most expressive members of the -family of DLs, is 2ExpTime-complete [16].
Discussion on Applicability. As shown in Table 1, the containment problem for OMQs based on linear sets of tgds is PSpace-complete, and thus can be solved in single-exponential time. This is not a big practical drawback since the containment problem corresponds to a static analysis task. In fact, the runtime is single exponential only in the size of the UCQs and the maximum arity of the underlying schema, which are typically very small. For such tasks, a single-exponential time procedure is considered to be acceptable, and it is actually the norm in many cases including database and verification problems; see, e.g., [1, 50, 52].
For OMQs based on sticky, non-recursive and guarded sets of tgds, the containment problem becomes coNExpTime-complete, P-hard and 2ExpTime-complete, respectively. This means that we require double-exponential time to solve the problem, which is practically not acceptable. Nevertheless, for sticky sets of tgds, the runtime is double-exponential only in the maximum arity of the schema, while for guarded sets of tgds is double-exponential only in the size of the UCQs and the maximum arity of the schema. This is good news since, as said above, the size of the UCQs and the arity are typically small, and usually UCQs in OMQs are much smaller than the ontologies.
For non-recursive sets of tgds, on the other hand, the runtime is double-exponential, not only in the maximum arity, but also in the number of predicates occurring in the ontology. It is unrealistic to assume that the number of predicates occurring in real-life ontologies is small. This fact, together with the fact that the precise complexity of OMQ containment for non-recursive sets of tgds is still open, suggests that a more careful complexity analysis is needed. This is left as an interesting open problem for future work.
Organization. Preliminaries are in Section 2. In Section 3 we introduce the OMQ containment problem. Containment for UCQ rewritable OMQs is studied in Section 4, and for guarded OMQs in Section 5. In Section 6 we consider the case where the involved queries fall in different languages. In Section 7 we discuss the applications of our results on guarded OMQ containment and we conclude in Section 8. Proofs and additional details can be found in the appendix.
2 Preliminaries
Databases and conjunctive queries.
Let , , and be disjoint countably infinite sets of constants, (labeled) nulls and (regular) variables (used in queries and dependencies), respectively. A schema is a finite set of relation symbols (or predicates) with associated arity. We write to denote that has arity . A term is a either a constant, null or variable. An atom over is an expression of the form , where is of arity and is an -tuple of terms. A fact is an atom whose arguments consist only of constants. An instance over is a (possibly infinite) set of atoms over that contain constants and nulls, while a database over is a finite set of facts over . We may call an instance and a database over an -instance and -database, respectively. The active domain of an instance , denoted , is the set of all terms occurring in .
A conjunctive query (CQ) over is a formula of the form:
[TABLE]
where each () is an atom without nulls over , each variable mentioned in the ’s appears either in or , and are the free variables of . If is empty, then is a Boolean CQ. As usual, the evaluation of CQs is defined in terms of homomorphisms. Let be an instance and a CQ of the form (1). A homomorphism from to is a mapping , which is the identity on , from the variables that appear in to the set of constants and nulls such that , for each . The evaluation of over , denoted , is the set of all tuples of constants such that is a homomorphism from to . We denote by the class of conjunctive queries.
A union of conjunctive queries (UCQ) over is a formula of the form , where each is a CQ of the form (1). The evaluation of over , denoted , is the set of tuples . We denote by the class of union of conjunctive queries.
Tgds and the chase procedure.
A tuple-generating dependency (tgd) is a first-order sentence of the form:
[TABLE]
where and are conjunctions of atoms without nulls. For brevity, we write this tgd as and use comma instead of for conjoining atoms. Notice that can be empty, in which case the tgd is called fact tgd and is written as . We assume that each variable in is mentioned in some atom of . We call and the body and head of the tgd, respectively. The tgd in (2) is logically equivalent to the expression , where and are the CQs and , respectively. Thus, an instance over satisfies this tgd iff . We say that an instance satisfies a set of tgds, denoted , if satisfies every tgd in . We denote by the class of (finite) sets of tgds.
The chase is a useful algorithmic tool when reasoning with tgds [24, 35, 47, 49]. We start by defining a single chase step. Let be an instance over a schema and a tgd over . We say that is applicable w.r.t. if there exists a tuple of terms in such that holds in . In this case, the result of applying over with is the instance that extends with every atom in , where is the tuple obtained by simultaneously replacing each variable with a fresh distinct null not occurring in . For such a single chase step we write .
Let us assume now that is an instance and a finite set of tgds. A chase sequence for under is a sequence:
[TABLE]
of chase steps such that: (1) ; (2) for each , is a tgd in ; and (3) . We call the result of this chase sequence, which always exists. Although the result of a chase sequence is not necessarily unique (up to isomorphism), each such result is equally useful for our purposes, since it can be homomorphically embedded into every other result. Thus, from now on, we denote by the result of an arbitrary chase sequence for under .
Ontology-mediated queries.
An ontology-mediated query (OMQ) is a triple , where is a schema, is a set of tgds (the ontology), and is a (U)CQ over (and possibly other predicates), with the set of predicates occurring in .111OMQs can be defined for arbitrary first-order theories, not only tgds, and first-order queries, not only UCQs [19]. We call the data schema. Notice that the set of tgds can introduce predicates not in , which allows us to enrich the schema of the UCQ . Moreover, the tgds can modify the content of a predicate , or, in other words, can appear in the head of a tgd of . We have explicitly included in the specification of the OMQ to emphasize that it will be evaluated over -databases, even though and might use additional relational symbols.
The semantics of an OMQ is given in terms of certain answers. The certain answers to a UCQ w.r.t. a database and a set of tgds is the set of tuples:
[TABLE]
Consider an OMQ . The evaluation of over an -database , denoted , is defined as . It is well-known that ; see, e.g., [24]. Thus, .
Ontology-mediated query languages.
We write for the OMQ language that consists of all OMQs of the form , where falls in the class of tgds, i.e., (concrete classes of tgds are discussed below), and the query falls in . A problem that is quite important for our work is OMQ evaluation, defined as follows:
PROBLEM :
INPUT : An OMQ ,
an -database , and .
QUESTION : Does ?
It is well-known that is undecidable; implicit in [12]. This has led to a flurry of activity for identifying syntactic restrictions on sets of tgds that make the latter problem decidable. Such a restriction defines a subclass of tgds. The known decidable classes of tgds are classified into three main decidability paradigms, which, in turn, give rise to decidable OMQ languages:
Guardedness: A tgd is guarded if its body contains an atom, called guard, that contains all the body-variables. Although the chase under guarded tgds does not necessarily terminate, the problem of deciding whether a tuple of constants is a certain answer to a UCQ w.r.t. a database and a set of guarded tgds is decidable. This follows from the fact that the result of the chase has bounded treewidth (see, e.g., [24]). Let be the class of (finite) sets of guarded tgds. Then:
Proposition 1
[24]* and are 2ExpTime-complete, and ExpTime-complete for fixed arity.*
An important subclass of guarded tgds is the class of linear tgds whose body consists of a single atom. We write for the class of (finite) sets of linear tgds.
Proposition 2
[25, 47]* and are PSpace-complete, and NP-complete for fixed arity.*
Non-recursiveness: A set of tgds is non-recursive (a.k.a. acyclic [35, 48]), if its predicate graph, the directed graph that encodes how the predicates of depend on each other, is acyclic. Non-recursiveness ensures the termination of the chase, and thus decidability of OMQ evaluation. Let be the class of non-recursive (finite) sets of tgds. Then:
Proposition 3
[48]* and are NExpTime-complete, even for fixed arity.*
Stickiness:* This condition ensures neither termination nor bounded treewidth of the chase. Instead, the decidability of OMQ evaluation is obtained by exploiting query rewriting techniques (more details on query rewriting are given in Section 4). The goal of stickiness is to capture joins among variables that are not expressible via guarded tgds, but without forcing the chase to terminate. The key property underlying this condition can be described as follows: during the chase, terms that are associated (via a homomorphism) with variables that appear more than once in the body of a tgd (i.e., join variables) are always propagated (or “stick”) to the inferred atoms. This is illustrated in Figure 1(a); the left set of tgds is sticky, while the right set is not. The formal definition is based on an inductive marking procedure that marks the variables that may violate the semantic property of the chase described above [27]. Roughly, during the base step of this procedure, a variable that appears in the body of a tgd but not in every head-atom of is marked. Then, the marking is inductively propagated from head to body as shown in Figure 1(b). Finally, a finite set of tgds is sticky if no tgd in contains two occurrences of a marked variable. Let be the class of sticky (finite) sets of tgds. Then:*
Proposition 4
[27]* and are ExpTime-complete, and NP-complete for fixed arity.*
3 OMQ Containment: The Basics
The goal of this work is to study in depth the problem of checking whether an OMQ is contained in an OMQ , both over the same data schema , or, equivalently, whether over every (finite) -database . In this case we write ; we write if and . The OMQ containment problem in question is defined as follows; and are OMQ languages , where is a class of tgds (e.g., linear, non-recursive, sticky, etc.), and :
PROBLEM* :*
INPUT* :* Two OMQs and .
QUESTION* :* Does ?
**
Whenever , we refer to the containment problem by simply writing .
In what follows, we establish some simple but fundamental results, which help to better understand the nature of our problem. We first investigate the relationship between evaluation and containment, which in turn allows us to obtain an initial boundary for the decidability of our problem, i.e., we can obtain a positive result only if the evaluation problem for the involved OMQ languages is decidable (e.g., those introduced in the previous section). We then focus on the OMQ languages introduced in Section 2 and observe that, once we fix the class of tgds, it does not make a difference whether we consider CQs or UCQs. In other words, we show that an OMQ in , where , can be rewritten as an OMQ in . This fact simplifies our later complexity analysis since for establishing upper (resp., lower) bounds it suffices to focus on CQs (resp., UCQs).
3.1 Evaluation vs. Containment
As one might expect, OMQ evaluation and OMQ containment are strongly connected. In fact, as we explain below, the former can be easily reduced to the latter. But let us first introduce some auxiliary notation. Consider a database and a tuple , where . We denote by , where , the CQ obtained from the conjunction of atoms occurring in after replacing each constant with the variable . Consider now an OMQ , where is some class of tgds, an -database , and a tuple . It is not difficult to show that
[TABLE]
Let be the OMQ language that consists of all OMQs of the form , i.e., the set of tgds is empty, where is a CQ. It is clear that and . Therefore, for every OMQ language , where is a class of tgds, we immediately get that:
Proposition 5
* can be reduced in polynomial time into .*
We now show that the problem of evaluation is reducible to the complement of containment. Let us say that, for technical reasons which will be made clear in a while, we focus our attention on classes of tgds that are closed under fact tgd extension, i.e., for every set , a set obtained from by adding a (finite) set of fact tgds is still in . This is not an unnatural assumption since every reasonable class of tgds, such as the ones introduced above, enjoy this property. Consider now an OMQ , where is some class of tgds, an -database , and a tuple . It is easy to see that
[TABLE]
where is obtained from by renaming each predicate in into and adding the set of fact tgds
[TABLE]
* is obtained from by renaming each predicate into , and the predicate does not occur in . Indeed, the above equivalence holds since implies that , for every -database . Since is closed under fact tgd extension, , while . We write for the complement of . Hence, for every OMQ language , where is a class of tgds (closed under fact tgd extension), it holds that:*
Proposition 6
* can be reduced in polynomial time into .*
By definition, is contained in every OMQ language , where is a class of tgds. Therefore, as a corollary of Propositions 5 and 6, we obtain an initial boundary for the decidability of OMQ containment: we can obtain a positive result only if the evaluation problem for the involved OMQ languages is decidable. More formally:
Corollary 7
* is undecidable if is undecidable or is undecidable.*
Can we prove the converse of Corollary 7: is decidable if both and are decidable? The answer to this question is negative. This is due to the fact that containment of Datalog queries is undecidable **[55]**. Since Datalog queries can be directly encoded in the OMQ language based on the class of full tgds, i.e., those without existentially quantified variables, we obtain the following:
Proposition 8
[55]* is undecidable.*
This result, combined with the fact that is decidable (since the chase under full tgds always terminates), implies that the converse of Corollary 7 does not hold. Proposition 8 also rules out the OMQ languages that are based on classes of tgds that extend ; e.g., the weak versions of the ones introduced in Section 2, called weakly guarded **[24]**, weakly acyclic **[35]**, and weakly sticky **[27]** that guarantee the decidability of OMQ evaluation.222The idea of those classes is the same: relax the conditions in the definition of the class, so that only those positions that receive null values during the chase are taken into account. The question that comes up concerns the decidability and complexity of containment for the OMQ languages that are based on the non-weak versions of the above classes, i.e., guarded, non-recursive, and sticky. This will be the subject of the next two sections.
3.2 From UCQs to CQs
Before we proceed with the complexity analysis of containment for the OMQ languages in question, let us state the following useful result:
Proposition 9
Given an OMQ , where , we can construct in polynomial time an OMQ such that .
The proof of Proposition 9 relies on the idea of encoding boolean operations (in our case the ‘or’ operator) using a set of atoms; this idea has been used in several other works (see, e.g., **[14, 21, 41]**). Proposition 9 allows us to focus on OMQs that are based on CQs; in fact, is -complete, where and is a complexity class that is closed under polynomial time reductions, iff is -complete.
3.3 Plan of Attack
We are now ready to proceed with the complexity analysis of containment for the OMQ languages in question. Our plan of attack can be summarized as follows:
- •
We consider, in Section 4, , for . These languages enjoy a crucial property, called UCQ rewritability, which is very useful for our purposes. This property allows us to show the following result: if the containment does not hold, then this is witnessed via a “small” database, which in turn allows us to devise simple guess-and-check algorithms.
- •
We then proceed, in Section 5, with . This OMQ language does not enjoy UCQ rewritability, and the task of establishing a small witness property as above turned out to be challenging. However, we show the following: if the containment does not hold, then this is witnessed via a “tree-shaped” database, which allows us to devise a decision procedure based on two-way alternating parity automata on finite trees.
- •
In Section 6, we study the case where the OMQ containment problem involves two different languages. If the left-hand side language is UCQ rewritable, then we can devise a guess-and-check algorithm by exploiting the above small witness property. The challenging case is when the left-hand side language is , where again we employ techniques based on tree automata.
4 UCQ Rewritable Languages
We now focus on OMQ languages that enjoy the crucial property of UCQ rewritability.
Definition 1**.**
*(UCQ Rewritability) An OMQ language , where , is UCQ rewritable if, for each OMQ we can construct a UCQ such that for every -database . *
We proceed to establish our desired small witness property, based on UCQ rewritability. By the definition of UCQ rewritability, for each language that is UCQ rewritable, there exists a computable function from to the natural numbers such that the following holds: for every OMQ , and UCQ rewriting of , it is the case that , where denotes the number of atoms occurring in . Then:
Proposition 10
Consider a UCQ rewritable language , and two OMQs and , both with data schema . If , then there exists an -database , where , such that .
In Proposition 10 we assume that the left-hand side query falls in a UCQ rewritable language, be we do not pose any restriction on the language of the right-hand side query. Thus, we immediately get a decision procedure for if is UCQ rewritable and is decidable. Given and :
Guess an -database such that , and a tuple ; and 2. 2.
Verify that and .
We immediately get that:
Theorem 11
* is decidable if is UCQ rewritable and is decidable.*
This general result shows that is decidable for every , but it says nothing about its complexity. This will be the subject of the rest of the section.
4.1 Linearity
The problem of computing UCQ rewritings for OMQs in has been studied in **[40]**, where a resolution-based procedure, called , has been proposed. This rewriting algorithm accepts a query and constructs a UCQ rewriting over by starting from and exhaustively applying rewriting steps based on resolution. Let us illustrate this via a simple example:
Example 1**.**
Assume that . Consider the set consisting of the linear tgds
[TABLE]
*and the CQ . will first resolve the atom in using the second tgd, and produce the CQ , which is equivalent to the CQ . Then, will be resolved using the first tgd, and the CQ will be obtained, which in turn will be resolved using the third tgd in order to produce the CQ . The UCQ rewriting is . *
It is easy to see that, whenever the input OMQ consists of linear tgds, during the execution of it is not possible to obtain a CQ that has more atoms than the original one. This is an immediate consequence of the fact that linear tgds have only one atom in their body. Then:
Proposition 12
f_{(\mathbb{L},\mathbb{CQ})}\big{(}(\mathbf{S},\Sigma,q)\big{)}\ \leq\ |q|.
Having the above result in place, it can be shown that the algorithm underlying Theorem 11 guesses a polynomially sized witness to non-containment, and then calls a -oracle for solving query evaluation under linear OMQs, where is PSpace in general, and NP if the arity is fixed; these complexity classes are obtained from Proposition 2. Therefore, is in PSpace in general, and in in case of fixed arity. Regarding the lower bounds, Proposition 5 allows us to inherit the PSpace-hardness of ; this holds even for constant-free tgds. Unfortunately, in the case of fixed arity, we can only obtain NP-hardness, while Proposition 6 allows to obtain coNP*-hardness. Nevertheless, it is implicit in [17] (see the proof of Theorem 9), where the containment problem for OMQ languages based on description logics is considered, that is -hard, even for tgds of the form . Then:*
Theorem 13
* is PSpace-complete, and -complete if the arity of the schema is fixed. The lower bounds hold even for tgds without constants.*
4.2 Non-Recursiveness
Although the OMQ language is not explicitly considered in **[40]**, where the algorithm is defined, the same algorithm can deal with . By analyzing the UCQ rewritings constructed by , whenever the input query falls in , we can establish the following result; here, denotes the body of the tgd :
Proposition 14
It holds that
[TABLE]
Proposition 14 implies that non-containment for queries that fall in is witnessed via a database of at most exponential size. We show next that this bound is optimal:
Proposition 15
There are sets of OMQs
[TABLE]
where , such that for every -database , if then .
Let us now focus on the complexity of . The algorithm underlying Theorem 11, together with the exponential bound provided by Proposition 14, implies that is feasible in non-deterministic exponential time with access to a NExpTime oracle, which immediately implies that is in ExpSpace. Unfortunately, the exact complexity of is still an open problem, and we conjecture that is P-complete; recall that . In what follows, we briefly explain how the P-hardness is obtained. To this end, we exploit a tiling problem that has been recently introduced in **[34]**. Roughly speaking, an instance of this tiling problem is a triple , where is an integer in unary representation, and are standard tiling problems for the exponential grid . The question is whether, for every initial condition of length , has no solution with or has some solution with . The initial condition simply fixes the first tiles of the first row of the grid. We construct in polynomial time two queries and such that has a solution iff . The idea is to force every input database to store an initial condition of length , and then encode the problem whether has a solution with into , for each . From the above discussion we get that:
Theorem 16
* is in ExpSpace, and P-hard. The lower bound holds even if the arity of the schema is fixed and the tgds are without constants.*
4.3 Stickiness
We now focus on . As shown in **[40]**, given a query , there exists an execution of that constructs a UCQ rewriting over with the following property: for each , if a variable occurs in in more than one atom, then already occurs in . This property has been used in **[40]** to bound the number of atoms that can appear in a single CQ . We write for the set of terms (constants and variables) occurring in , for the set of constants occurring in , and for the maximum arity over all predicates of .
Proposition 17
It holds that
[TABLE]
Proposition 17 implies that non-containment for queries is witnessed via a database of at most exponential size. As for queries, we can show that this bound is optimal; here, for a set of tgds, we denote by the number of symbols occurring in :
Proposition 18
There exists a set of OMQs
[TABLE]
such that for every and -database , if then .
We now study the complexity of . We first focus on schemas of unbounded arity. Proposition 17 implies that the algorithm underlying Theorem 11 runs in exponential time assuming access to a -oracle, where is a complexity class powerful enough for solving and its complement. But, since is in ExpTime (see Proposition 4), both and its complement are in NExpTime, and thus, the oracle call is not really needed. Consequently, is in NExpTime.
A matching lower bound is obtained by a reduction from the standard tiling problem for the exponential grid . In fact, the same lower bound has been recently established in **[15]**; however, our result is stronger as it shows that the problem remains hard even if the right-hand side query is a linear OMQ of a simple form – this is also discussed in Section 6, where containment of queries that fall in different OMQ languages is studied. Regarding schemas of fixed arity, Proposition 17 provides a witness for non-containment of polynomial size, which implies that the algorithm underlying Theorem 11 runs in polynomial time with access to an NP-oracle. Therefore, is in , while a matching lower bound is implicit in **[17]**. Then:
Theorem 19
* is coNExpTime-complete, even if the set of tgds uses only two constants. In the case of fixed arity, it is -complete, even for constant-free tgds.*
Clearly, there exists a double-exponential time algorithm for solving , which might sound discouraging. However, Proposition 17 implies that the runtime is double-exponential only in the maximum arity of the data schema.
5 Guardedness
We proceed with the problem of containment for guarded OMQs, and we establish the following result:
Theorem 20
* is 2ExpTime-complete. The lower bound holds even if the arity of the schema is fixed, and the tgds are without constants.*
The lower bound is immediately inherited from **[16]**, where it is shown that containment for OMQs based on the description logic is 2ExpTime*-hard. Recall that a set of axioms can be equivalently rewritten as a constant-free set of guarded tgds using only unary and binary predicates, which implies the lower bound stated in Theorem 20. However, we cannot immediately inherit the desired upper bound since the DL-based OMQ languages considered in [16] are either weaker than or incomparable to . Nevertheless, the technique developed in [16] was extremely useful for our analysis. Actually, our automata-based procedure exploits a combination of ideas from [16, 44]. The rest of this section is devoted to providing a high-level explanation of this procedure.*
For the sake of technical clarity, we focus on constant-free tgds and CQs, but all the results can be extended to the general case at the price of more involved definitions and proofs. Moreover, for simplicity, we focus on Boolean CQs. In other words, we study the problem for , where denotes the class of Boolean CQs. This does not affect the generality of our proof since it is known that can be reduced in polynomial time to **[16]**.
A first glimpse.
As already said, is not UCQ rewritable and, therefore, we cannot employ Proposition 10 in order to establish a small witness property as for the languages considered in Section 4. We have tried to establish a small witness property for by following a different route, but it turned out to be a difficult task. Nevertheless, we can show a tree witness property, which states that non-containment for is witnessed via a tree-like database. This allows us to devise a procedure based on alternating tree automata. Summing up, the proof for the 2ExpTime* membership of proceeds in three steps:*
Establish a tree witness property; 2. 2.
Encode the tree-like witnesses as trees that can be accepted by an alternating tree automaton; and 3. 3.
Construct an automaton that decides ; in fact, we reduce into emptiness for two-way alternating parity automata on finite trees.
Each one of the above three steps is discussed in more details in the following three sections.
5.1 Tree Witness Property
From the above informal discussion, it is clear that tree-like databases are crucial for our analysis. Let us make this notion more precise using guarded tree decompositions. A tree decomposition of a database is a labeled rooted tree , where , such that: (i) for each atom , there exists such that , and (ii) for every term , the set induces a connected subtree of . The tree decomposition is called -guarded, where , if, for every node , there exists an atom such that . We write for the root node of , and , where , for the subset of induced by . We are now ready to formalize the notion of the tree-like database:
Definition 2**.**
An -database is a -tree, where , if there is a tree decomposition of such that:
* and* 2. 2.
* is -guarded. *
Roughly, whenever a database is a -tree, is the cyclic part of , while the rest of is tree-like. Interestingly, for deciding it suffices to focus on databases that are -trees and depends only on the left-hand side OMQ. Recall that for a schema we write for the maximum arity over all predicates of . Then:
Proposition 21
Let , for . The following are equivalent:
. 2. 2.
, for every -tree -database such that .
The fact that holds trivially, while is shown by using a variant of the notion of guarded unravelling and compactness. Let us clarify that the above result does not provide a decision procedure for , since we have to consider infinitely many databases that are -trees with .
5.2 Encoding Tree-like Databases
It is generally known that a database whose treewidth333Recall that the treewidth of a database is the minimum width among all possible tree decompositions of , while the width of is defined as .* is bounded by an integer can be encoded into a tree over a finite alphabet of double-exponential size in that can be accepted by an alternating tree automaton; see, e.g., [13]. Consider an alphabet , and let be the set of finite sequences of natural numbers, including the empty sequence. A -labeled tree is a pair , where is closed under prefixes, and is the labeling function. The elements of identify the nodes of . It can be shown that and a tree decomposition of with width can be encoded as a -labeled tree , where is an alphabet of double-exponential size in , such that each node of corresponds to exactly one node of and vice versa.*
Consider now a -tree -database , and let be the tree decomposition that witnesses that is a -tree. The width of is at most , and thus, the treewidth of is bounded by . Hence, from the above discussion, and can be encoded as a -labeled tree, where is of double-exponential size in . In general, given an -database that is a -tree due to the tree decomposition , we show that and can be encoded as a -labeled tree, with and being double-exponential in and exponential in and .
Although every -tree -database can be encoded as a -labeled tree, the other direction does not hold. In other words, it is not true that every -labeled tree encodes a -tree -database and its corresponding tree decomposition. In view of this fact, we need the additional notion of consistency. A -labeled tree is called consistent if it satisfies certain syntactic properties – we do not give these properties here since they are not vital in order to understand the high-level idea of the proof. Now, given a consistent -labeled tree , we can show that can be decoded into an -database that is a -tree with . From the above discussion and Proposition 21, we obtain:
Lemma 22
Let , for . The following are equivalent:
. 2. 2.
, for every consistent -labeled tree , where .
5.3 Constructing Tree Automata
Having the above result in place, we can now proceed with our automata-based procedure. We make use of two-way alternating parity automata (2WAPA) that run on finite labeled trees. Two-way alternating automata process the input tree while branching in an alternating fashion to successor states, and thereby moving either down or up the input tree; the detailed definition can be found in **[11]**. Our goal is to reduce to the emptiness problem for 2WAPA. As usual, given a 2WAPA , we denote by the language of , i.e., the set of labeled trees it accepts. The emptiness problem is defined as follows: given a 2WAPA , does ? Thus, given , we need to construct a 2WAPA such that iff . It is well-known that deciding whether is feasible in exponential time in the number of states, and in polynomial time in the size of the input alphabet **[32]**. Therefore, we should construct in double-exponential time, while the number of states must be at most exponential.
We first need a way to check consistency of labeled trees. It is not difficult to devise an automaton for this task.
Lemma 23
Consider a schema and an integer . There is a 2WAPA that accepts a -labeled tree iff is consistent. The number of states of is logarithmic in the size of . Furthermore, can be constructed in polynomial time in the size of .
Now, the crucial task is, given an OMQ , to devise an automaton that accepts labeled trees which correspond to databases that make true.
Lemma 24
Let . There is a 2WAPA , where , that accepts a consistent -labeled tree iff . The number of states of is exponential in and . Furthermore, can be constructed in double-exponential time in and .
The intuition underlying can be described as follows. tries to identify all the possible ways the CQ can be mapped to , for any -tree -database such that . It then arrives at possible ways how the input tree can satisfy . These “possible ways” correspond to squid decompositions, a notion introduced in **[24]** that indicates which part of the query is mapped to the cyclic part of , and which to the tree-like part of . The automaton exhaustively checks all squid decompositions by traversing the input tree and, at the same time, explores possible ways how to match the single parts of the squid decomposition at hand. The automaton finally accepts if it finds a squid decomposition that can be mapped to .
Having the above automata in place, we can proceed with our main technical result, which shows that can be reduced to the emptiness problem for 2WAPA. But let us first recall some key results about 2WAPA, which are essential for our final construction. It is well-known that languages accepted by 2WAPAs are closed under intersection and complement. Given two 2WAPAs and , we write for a 2WAPA, which can be constructed in polynomial time, that accepts the language . Moreover, for a 2WAPA , we write for the 2WAPA, which is also constructible in polynomial time, that accepts the complement of . We can now show the following:
Proposition 25
Consider . We can construct in double-exponential time a 2WAPA , which has exponentially many states, such that
[TABLE]
Proof (sketch).* Let , for , and . Then is defined as . Since has double-exponential size, Lemmas 23 and 24 imply that can be constructed in double-exponential time, while it has exponentially many states. Lemma 22 implies that iff . *
**
Proposition 25 implies that is in 2ExpTime*, and Theorem 20 follows. Thus, there exists a double-exponential time algorithm for solving . Interestingly, the runtime is double-exponential only in the size of the CQs and the maximum arity of the schema. This can be obtained by a providing a more refined complexity analysis of the construction of the 2WAPA in Proposition 25.*
6 Combining Languages
In the previous two sections, we studied the containment problem relative to a language , i.e., both OMQs fall in . However, it is natural to consider the version of the problem where the involved OMQs fall in different languages. This is the goal of this section. Our analysis proceeds by considering the two cases where the left-hand side (LHS) query falls in a UCQ rewritable OMQ language, or it is guarded.
6.1 The LHS Query is UCQ Rewritable
As an immediate corollary of Theorem 11 we obtain the following result: , for , and , is decidable. By exploiting the algorithm underlying Theorem 11, we establish optimal upper bounds for all the problems at hand with the only exception of . For the latter, we obtain an ExpSpace upper bound, by providing a similar analysis as for , while a NExpTime lower bound is inherited from query evaluation by exploiting Proposition 5. It is rather tedious, and not very interesting from a technical point of view, to go through all the containment problems in question444There are eighteen different cases obtained by considering all the possible pairs of OMQ languages, where and is UCQ rewritable, and the two cases whether the arity of the schema is fixed or not.* and explain in details how the exact upper bounds are obtained; we leave this as an exercise to the interested reader.*
Regarding the matching lower bounds, in most of the cases they are inherited from query evaluation or its complement by exploiting Propositions 5 and 6, respectively. There are, however, some exceptions:
- •
* in the case of unbounded arity, where the problem is coNExpTime-hard, even for sets of tgds that use only two constants. This is shown by a reduction from the standard tiling problem for the exponential grid .*
- •
* and in the case of bounded arity, where both problems are -hard even for constant-free tgds; implicit in [17].*
6.2 The LHS Query is Guarded
We proceed with the case where the LHS query is guarded, and we show the following result:
Theorem 26
* is -complete:*
[TABLE]
The lower bounds hold even if the arity of the schema is fixed. Moreover, for (resp., ) it holds even for tgds with one constant (resp., without constants).
Upper bounds.
The 2ExpTime* membership when is an immediate corollary of Theorem 20. This is not true when since the right-hand side query is not guarded. But in this case, since and are UCQ rewritable, one can rewrite the right-hand side query as a UCQ, and then apply the machinery developed in Section 5 for solving . More precisely, given OMQs and , where , iff , where is a UCQ rewriting of . Thus, an immediate decision procedure, which exploits the algorithm , is the following:*
Let ; 2. 2.
For each : if , then proceed; otherwise, reject; and 3. 3.
Accept.
The above procedure runs in triple-exponential time. The first step is feasible in double-exponential time **[40]**. Now, for a single CQ (which is a guarded OMQ with an empty set of tgds) the check whether can be done by using the machinery developed in Section 5, which reduces our problem to checking whether the language of a 2WAPA is empty. However, it should not be forgotten that is of exponential size, and thus, the automaton has double-exponentially many states. This in turn implies that checking whether is in 3ExpTime*, as claimed.*
Although the above algorithm establishes an optimal upper bound for non-recursive OMQs, a more refined analysis is needed for sticky OMQs. In fact, we need a more refined complexity analysis for the problem , that is, to decide whether a guarded OMQ is contained in a UCQ. To this end, we provide an automata construction different from the one employed in Section 5, which allows us to establish a refined complexity upper bound for the problem in question. Consider a query , and a UCQ . As usual, we write and for the number of symbols that occur in and , respectively, and we write for the set of variables that appear in more than one atom of . By exploiting our new automata-based procedure, we show that the problem of checking if is feasible in double-exponential time in , exponential time in , and polynomial time in .
This result allows us to show that the above procedure establishes 2ExpTime*-membership when the right-hand side OMQ is sticky. But first we need to recall the following key properties of the UCQ rewriting , constructed during the first step of the algorithm:*
* consists of double-exponentially many CQs,* 2. 2.
each CQ of is of exponential size, and 3. 3.
for each , is a subset of the variables of the original CQ that appears in .
By combining these key properties with the complexity analysis performed above, it is now straightforward to show that is in 2ExpTime*.*
Lower Bounds.
We establish matching lower bounds by refining techniques from **[31]**, where it is shown that containment of Datalog in UCQ is 2ExpTime*-complete, while containment of Datalog in non-recursive Datalog is 3ExpTime-complete; the lower bounds hold for fixed-arity predicates, and constant-free rules. Interestingly, the LHS query can be transformed into a Datalog query such that each rule has a body-atom that contains all the variables, i.e., is guarded. This is achieved by increasing the arity of some predicates in order to have enough positions for all the body-variables. However, for each rule, the number of unguarded variables that we need to guard is constant, and thus, the arity of the schema remains constant. We conclude that is 3ExpTime-hard. Moreover, containment of guarded OMQs in UCQs is 2ExpTime-hard, which in turn allows us to show, by exploiting the construction underlying Proposition 9, that is 2ExpTime-hard, even if the set of linear tgds uses only one constant, while is 2ExpTime-hard, even for tgds without constants.*
7 Applications
Interestingly, our results on can be applied to other important static analysis tasks, in particular, distribution over components and UCQ rewritability. Each one of those tasks is considered in the following two sections.
7.1 Distribution Over Components
The notion of distribution over components has been introduced in **[3]**, and it states that the answer to a query can be computed by parallelizing it over the (maximally connected) components of the input database. But let us first make precise what a component is. A set of atoms is connected if for all , there exists a sequence of atoms in such that , , and , for each . We call a component of if (i) is connected, and (ii) for every , is not connected.555For technical clarity, the notion of component is defined only for sets of atoms that do not contain [math]-ary atoms. Let be the set of components of . We are now ready to introduce the notion of distribution over components. Consider an OMQ . We say that distributes over components if , where , for every -database . In this case, can be computed without any communication over a network using a distribution where every computing node is assigned some of the components of the database, and every component is assigned to at least one computing node. In other words, can be evaluated in a distributed and coordination-free manner; for more details on coordination-free evaluation see **[3, 4, 5]**. Therefore, it would be quite beneficial if we can decide whether an OMQ distributes over components, and thus, we obtain the following interesting static analysis task:
PROBLEM* :*
INPUT* :* An OMQ .
QUESTION* :* Does distributes over components?
**
The above problem has been studied in **[15]**, where tight complexity bounds for and have been established. However, its exact complexity for guarded OMQs has been left open. Our results on containment for guarded OMQs allow us to close this problem. But first we need to recall a key result that semantically characterizes distribution over components. An OMQ with data schema is unsatisfiable if there is no -database such that . Moreover, for a CQ , we write for its components. The next result has been shown in **[15]**:
Proposition 27
Let . The following are equivalent:
* distributes over components.* 2. 2.
* is unsatisfiable or there exists such that .*
Checking unsatisfiability can be easily reduced to containment. Thus, the above result, together with Theorem 20, implies that is in 2ExpTime*, while a matching lower bound is implicit in [15]. Then:*
Theorem 28
* is 2ExpTime-complete.*
7.2 Deciding UCQ Rewritability
Query rewriting is a well-studied method for evaluating OMQs using standard database technology. The key idea is the following: given an OMQ , combine and into a new query , the so-called rewriting, which can then be evaluated over yielding the same answer as over , for every -database . For this approach to be realistic, though, it is essential that the rewriting is expressed in a language that can be handled by standard database systems. The typical language that is considered in this setting is first-order (FO) queries **[28]**. Notice, however, that due to Rossman’s Theorem **[53]**, and the fact that OMQs are closed under homomorphisms, FO and UCQ rewritability coincide. Recall that some OMQ languages are UCQ rewritable, such as the ones based on linear, non-recursive and sticky sets of tgds, while others are not, e.g., guarded OMQs. For those languages that are not UCQ rewritable, it is important to be able to check whether a query can be rewritten as a UCQ, in which case we say that it is UCQ rewritable. This gives rise to the following fundamental static analysis task for an OMQ language , where :
PROBLEM* :*
INPUT* :* An OMQ .
QUESTION* :* Is it the case that is UCQ rewritable?
**
Bienvenu et al. have recently carried out an in-depth study of the above problem for OMQ languages based on central Horn-DLs **[16]**. One of their main results is that the above problem for the OMQ language based on , one of the most expressive members of the -family of DLs, is 2ExpTime*-complete. Interestingly, by adapting the tree automata techniques developed in Section 5, we can generalize the above result: deciding UCQ rewritability for the OMQ language based on guarded tgds over unary and binary relations is in 2ExpTime. Let be the class of (finite) sets of guarded tgds over unary and binary relations. Then:*
Theorem 29
* is 2ExpTime-complete.*
Since the lower bound is inherited from **[16]**, we concentrate on the upper bound. As in Section 5, we can focus on BCQs, i.e., it suffices to show that is in 2ExpTime*. Our proof proceeds in two steps:*
We semantically characterize UCQ rewritability for queries in in terms of a certain boundedness property for the set of -trees defined in Section 5. 2. 2.
We extend the techniques developed in Section 5 and construct in double-exponential time a 2WAPA that has exponentially many states, such that the aforementioned boundedness property does not hold iff is infinite. (Such an infinity problem for tree automata has been used to obtain the decidability of the boundedness problem for monadic Datalog **[32, 56]**).
Our 2ExpTime* upper bound then follows since the infinity problem for a 2WAPA , i.e., checking if is infinite, is feasible in exponential time in the number of states, and in polynomial time in the size of the alphabet. This follows from two known results: (a) The 2WAPA can be converted into an equivalent non-deterministic tree automata with a single-exponential blow up in the number of states [57], and (b) solving the infinity problem for non-deterministic tree automata is feasible in polynomial time; cf. [56].*
It is worth contrasting our proof with the one in **[16*]** for , which does not make use of the infinity problem for 2WAPA, but applies a different argument based on pumping. This leads to a finer complexity analysis in terms of the size of the different components of the OMQ, but, in our opinion, makes the proof conceptually harder. *
The semantic characterization.* To establish the semantic characterization from step 1, we need to define the notion of distance from the root for an element in a -tree database . Intuitively, this corresponds to the minimal distance between a node that contains and the root of a tree decomposition of that witnesses the fact that is a -tree. We do not consider all such tree decompositions, however, but concentrate on a well-behaved subclass, which we call the lean tree decompositions of the -tree ; the formal definition can be found in [11], as it does not add much to the explanation we provide here. Due to the fact that we focus on unary and binary relations, such lean tree decompositions ensure the invariance of the notion of distance from the root, by severely limiting the level of redundancy allowed in a tree representation of . Therefore, it does not matter which lean tree decomposition we choose, since in all of them the distance of an element from the root will be the same. Let be the subinstance of induced by the set of elements whose distance from the root is at most , and let be the subinstance of induced by the set of elements whose distance from the root is at least .*
Another useful notion is the branching degree of a tree decomposition , that is, the maximum number of child nodes over all nodes of . Again, lean tree decompositions ensure the invariance of the branching degree. This allows us to define the branching degree of a -tree database as the branching degree of a lean tree decomposition that witnesses the fact that is a -tree.
It follows from **[16]** that being able to decide containment for the OMQ language (as we have done in Section 5) allows us to concentrate on connected CQs when deciding UCQ rewritability. This simplifies technicalities considerably and, in turn, allows us to obtain our desired semantic characterization of UCQ rewritability:
Proposition 30
Let , where is connected. The following are equivalent:
* is UCQ rewritable.* 2. 2.
There exist (which depend only on ) s.t.
[TABLE]
for each -tree -database with and branching degree at most .
The reduction to the infinity problem.* We now proceed with step 2, and we explain how the boundedness property established in item (2) of Proposition 30 can be reduced to the infinity problem for 2WAPAs. As in Section 5, we do not reason with -tree databases directly, but we deal with their encodings as consistent -labeled trees. In fact, using the same ideas as in Lemma 22, we can show by exploiting Proposition 30 that the following are equivalent:*
- (i)
* is UCQ rewritable.* 2. (ii)
There are such that
[TABLE]
for every consistent -labeled tree with and whose branching degree is bounded by .
Let us write Boundedness for the property expressed in item (ii) above, which can be reduced to the problem of checking whether some tree language is finite. Let be the set of all -labeled trees of branching degree at most such that: (1) and (2) there is some “extension” of , with branching degree , such that and . Notice that can increase the depth but not the branching degree of . It is not difficult to show that Boundedness holds iff is finite. We then devise in double-exponential time a 2WAPA , which has exponentially many states, such that . Therefore, the following holds:
Proposition 31
Consider . We can construct in double-exponential time a 2WAPA , which has exponentially many states, such that
[TABLE]
Since checking whether is infinite is feasible in exponential time in the number of states and in polynomial time in the size of the alphabet, Proposition 31 implies that is in 2ExpTime*, as needed.*
8 Conclusions
We have concentrated on the fundamental problem of containment for OMQ languages based on the main decidable classes of tgds. We have also used our techniques to close problems related to distribution over components and UCQ rewritability. We believe that our techniques for solving containment under guarded OMQs can be extended to frontier-guarded OMQs, an interesting extension of guardedness **[8]**. We are also convinced that our solution to the problem of deciding UCQ rewritability of guarded OMQs over unary and binary relations can be extended to guarded (or even frontier-guarded) OMQs over arbitrary schemas. We are currently investigating these challenging problems.
APPENDIX
PRELIMINARIES
Definition of Non-recursiveness
In the main body of the paper, we define non-recursive sets of tgds via the notion of predicate graph. Here, we give an alternative definition, based on the well-known notion of stratification, which is more convenient for the combinatorial analysis that we are going to perform in the proof of Proposition 14.
Definition 3**.**
Consider a set of tgds. A stratification of is a partition , where , of such that, for some function , the following hold:
For each predicate , all the tgds with in their head belong to , i.e., they belong to the same set of the partition. 2. 2.
If there exists a tgd in such that the predicate appears in its body, while the predicate appears in its head, then .
*We say that is stratifiable if it admits a stratification. *
It is an easy exercise to show that the predicate graph of a set of tgds is acyclic iff is stratifiable. Then:
Lemma 32
* is non-recursive iff is stratifiable.*
Definition of Stickiness
In the main body of the paper, we provide an intuitive explanation of stickiness. Here, we recall the formal definition of sticky sets of tgds, introduced in **[27]**. Fix a set of tgds; w.l.o.g., we assume that, for every pair , and do not share variables. For notational convenience, given an atom and a variable occurring in , is the set of positions in at which occurs; a position identifies the -th attribute of the predicate . The definition of stickiness hinges on the notion of marked variables in a set of tgds.
Definition 4**.**
Consider a tgd , and a variable occurring in the body of . We inductively define when is marked in as follows:
If there exists an atom in the head of such that does not occur in , then is marked in ; and 2. 2.
*Assuming that there exists an atom in the head of such that occurs in , if there exists (not necessarily different than ) and an atom in the body of such that (i) and have the same predicate and, (ii) each variable in that occurs at a position of is marked in , then is marked in . *
We are now ready to recall when a set of tgds is sticky:
Definition 5**.**
*A set of tgds is sticky if, for each , and for each variable occurring in the body of , the following holds: if is marked in , then occurs only once in the body of . *
PROOFS OF SECTION 3
Proof of Proposition 5
Consider an OMQ , where is a class of tgds, an -database , and a tuple . We show that:
[TABLE]
() Assume that . This implies that there exists a -database , and a tuple of constants such that and . Due to the monotonicity of CQs, . Since, by construction, the instance satisfies , we conclude that .666This is the standard notation for the fact that , for every (possibly infinite) instance that satisfies . By exploiting the well-known characterization of CQ containment in terms of the chase, we get that , which is equivalent to , as needed.
() Conversely, assume that , or, equivalently, . This implies that . Observe that holds trivially, which in turn implies that . Therefore, , and the claim follows.
Proof of Proposition 9
The construction underlying Proposition 9 relies on the idea of encoding boolean operations (in our case the ‘or’ operator) using a set of atoms; this idea has been exploited in several other works; see, e.g., **[14, 21, 41]**. Let . Our goal is to construct in polynomial time such that . We assume, w.l.o.g., that the predicates of do not appear in the head of a tgd of ; we can copy the content of a relation into an auxiliary predicate , using the tgd , while staying inside , and then rename each predicate in and with . The set consists of the following tgds:
For every :
[TABLE]
These tgds are annotating the database atoms with the truth constant true, indicating that these are true atoms. 2. 2.
Assuming that , a tgd:
[TABLE]
where is the conjunction of atoms in , after replacing each atom with , and is the conjunction of atoms
[TABLE]
This tgd generates a “copy” of the atoms in , while annotating them with a null value that represents the truth constant false, indicating that are not necessarily true atoms. Moreover, the truth table of ‘or’ is generated. 3. 3.
Finally, for each tgd in , a tgd
[TABLE]
where and are obtained from and , respectively, by replacing each atom with . In fact, this is the actual set of tgds , with the difference that the value at the last position of each atom (which indicates whether it is true or false) is propagated to the inferred atoms.
Now, assuming that , the CQ is defined as follows; let and :
[TABLE]
where and are fresh variables not in , and is obtained from by replacing each atom with . This completes our construction.
It is not difficult to show that , or, equivalently, for every -database , . The key observation is that in order to satisfy in the CQ , at least one of the ’s must be mapped to , which means that at least one is satisfied by . Finally, it is easy to verify that, for each , implies , and Proposition 9 follows.
PROOFS OF SECTION 4
Proof of Proposition 10
We assume that is a UCQ rewriting of . Since, by hypothesis, , we conclude that , which in turn implies that there exists such that . Let be a tuple of constants obtained by replacing each variable in with the constant , and the -database obtained from after replacing each variable in with the constant . We show that:
Lemma 33
.
Proof.
Since , there exists an -database , and a tuple of constants such that and . Clearly, there exists a homomorphism such that and . Observe also that , where . Towards a contradiction, assume that . This implies that there exists a homomorphism such that and , where . It is not difficult to see that there exists an extension of such that and . Hence, , which implies that ; thus, . But this contradicts the fact that , and the claim follows. ∎
Observe that , which immediately implies that . Consequently, by Lemma 33, . The claim follows since, by construction, is an -database such that .
The Algorithm
In view of the fact that the rewriting algorithm is heavily used in our complexity analysis, we would like to recall its definition. This algorithm is based on resolution, and thus, before we proceed further, we need to recall the crucial notion of unification. A set of atoms , where , unifies if there exists a substitution , called unifier for , such that . A most general unifier (MGU) for is a unifier for , denoted as , such that for each other unifier for , there exists a substitution such that . Notice that if a set of atoms unify, then there exists a MGU. Furthermore, the MGU for a set of atoms is unique (modulo variable renaming).
The algorihtm proceeds by exhaustively applying two steps: rewriting and factorization, which in turn rely on the technical notions of applicability and factorizability, respectively. We assume, w.l.o.g., that tgds and CQs do not share variables. Given a CQ , a variable is called shared in if is a free variable of , or it occurs more than once in . In what follows, we assume, w.l.o.g., that tgds are in normal form, i.e., they have only one atom in the head, and only one occurrence of an existentially quantified variable **[27]**. We write for the position at which the existentially quantified variable of occurs; in case does not mention an existentially quantified variable, then . (Recall that a position identifies the -th attribute of a predicate .) We are now ready to recall applicability and factorizability; in what follows, we write for the set of atoms occurring in , and for the head-atom of .
Definition 6**.**
(Applicability) Consider a CQ and a tgd . Given a set of atoms , we say that is applicable to if the following conditions are satisfied:
the set unifies, and 2. 2.
*for each , if the term at position in is either a constant or a shared variable in , then . *
Roughly, whenever is applicable to , this means that the atoms of may be generated during the chase procedure by applying . Therefore, we are allowed to apply a rewriting step (which is essentially a resolution step) that resolves using , i.e., is replaced by , and a new CQ that is closer to the input database is obtained.
If we start applying rewriting steps blindly, without checking for applicability, then the soundness of the rewriting procedure is not guaranteed. However, it is possible that the applicability condition is not satisfied, but still we should apply a rewriting step. This may happen due to the presence of redundant atoms in a query. For example, given the CQ
[TABLE]
and the tgd
[TABLE]
the applicability condition fails since the shared variable in occurs at the position . However, is essentially the CQ , and now the applicability condition is satisfied. From the above informal discussion, we conclude that the applicability condition may prevent the algorithm from being complete since some valid rewriting steps are blocked. Because of this reason, we need the so-called factorization step, which aims at converting some shared variables into non-shared variables, and thus, satisfy the applicability condition. In general, this can be achieved by exhaustively unifying all the atoms that unify in the body of a CQ. However, some of these unifications do not contribute in any way to satisfying the applicability condition, and, as a result, many superfluous CQs are generated. It is thus better to apply a restricted form of factorization that generates a possibly small number of CQs that are vital for the completeness of the rewriting algorithm. This corresponds to the identification of all the atoms in the query whose shared existential variables come from the same atom in the chase, and they can be unified with no loss of information. Summing up, the key idea underlying the notion of factorizability is as follows: in order to apply the factorization step, there must exist a tgd that can be applied to its output.777Let us clarify that for the purposes of the present work we can rely on the naive approach of exhaustively unifying all the atoms that unify in the body of a CQ. However, we would like to be consistent with [40], where the algorithm is proposed, and thus, we stick on the slightly more involved notion of factorizability.
Definition 7**.**
(Factorizability) Consider a CQ and a tgd . Given a set of atoms , where , we say that is factorizable w.r.t. if the following conditions are satisfied:
* unifies,* 2. 2.
, and 3. 3.
*there exists a variable that occurs in every atom of only at position . *
Having the above key notions in place, we are now ready to recall the algorithm , which is depicted in Algorithm 1. As said above, the UCQ rewriting of an OMQ is computed by exhaustively applying (i.e., until a fixpoint is reached) the rewriting and the factorization steps. Notice that the CQs that are the result of the factorization step, are nothing else than auxiliary queries which are critical for the completeness of the final rewriting, but are not needed in the final rewriting. Thus, during the iterative procedure, the queries are labeled with (resp., ) in order to keep track which of them are generated by the rewriting (resp., factorization) step. The CQ that is part of the input OMQ, although is not a result of the rewriting step, is labeled by since it must be part of the final rewriting. Moreover, once the two crucial steps have been exhaustively applied on a CQ , it is not necessary to revisit since this will lead to redundant queries. Hence, the queries are also labeled with (resp., ) indicating that a query has been already explored (resp., is unexplored). Let us now describe the two main steps of the algorithm. In the sequel, consider a triple , where (this is how we indicate that is labeled by and ), and a tgd . We assume that is of the form .
Rewriting Step.
For each such that is applicable to , the -th application of the rewriting step generates the query , where is the tgd obtained from by replacing each variable with , is the MGU for the set (which is the identity on the variables that appear in the body but not in the head of ), and is obtained from be replacing with . By considering (instead of ) we basically rename, using the integer , the variables of . This renaming step is needed in order to avoid undesirable clutters among the variables introduced during different applications of the rewriting step. Finally, if there is no , i.e., an (explored or unexplored) query that is the result of the rewriting step, such that and are the same (modulo bijective variable renaming), denoted , then is added to .
Factorization Step.
For each that is factorizable w.r.t. , the factorization step generates the query , where is the MGU for . If there is no , i.e., a query that is the result of the rewriting or the factorization step, and is explored or unexplored, such that , then is added to .
Proof of Proposition 14
We assume, w.l.o.g., that the predicates of do not appear in the head of a tgd of . Since , by Lemma 32, admits a stratification with stratification function . Let us briefly explain how the rewriting tree of the OMQ is defined. is a rooted tree with being its root. The -th level of consists of the CQs obtained from the CQs of the -th level by applying rewriting steps (see the algorithm for details on the rewriting step) using only tgds from . It is easy to verify that the CQs of the -th level contain only predicates such that . It is now clear that the -th level of (i.e., the leaves of ) consists only of CQs obtained during the execution of that contain only predicates of . Thus, in order to obtained the desired upper bound, it suffices to show that the number of atoms that occur in a CQ that is a leaf of is at most . To this end, let us focus on one branch of from the root to a leaf . Such a branch can be naturally represented as a -ary forest , where the root nodes are the atoms of , and whenever an atom is resolved during the rewriting step using a tgd , the atoms of , after applying the appropriate MGU, are the child nodes of . Therefore, to obtain the desired upper bound, it suffices to show that the number of leaves of is at most . By construction, consists, in general, of -ary rooted trees, where , of depth . Hence, the number of leaves of is at most . Since , the claim follows.
Proof of Theorem 16
A proof sketch for the coNExpTime upper bound is given in the main body of the paper. We proceed to establish the P-hardness. Our proof is by reduction from a tiling problem that has been recently introduced in **[34]**, which in turn relies on the standard Exponential Tiling Problem. Let us first recall the latter problem.
An instance of the Exponential Tiling Problem is a tuple , where are numbers (in unary), are subsets of , and is a sequence of numbers of . Such a tuple specifies that we desire a grid, where each cell is tiled with a tile from . (resp., ) is the horizontal (resp., vertical) compatibility relation, while represents a constraint on the initial part of the first row of the grid. A solution to such an instance of the Exponential Tiling Problem is a function such that:
, for each ; 2. 2.
, for each and ; and 3. 3.
, for each and .
We will refer to as a grid, with the pairs in it being cells. A cell consists of two coordinates, the column-coordinate (for short col-coordinate) and the row-coordinate, and any function on a grid is a tiling. The Exponential Tiling Problem is defined as follows: given an instance as above, decide whether has a solution. It is known that this problem is NExpTime-hard (see, e.g., Section 3.2 of **[46]**).
We are now ready to recall the tiling problem introduced in **[34]**, called Extended Tiling Problem (ETP), which is P-hard. An instance of this problem is a tuple , where are numbers (in unary), and are subsets of . The question is as follows: is it the case that for every sequence , where , of numbers of , has no solution or has a solution?
We give a reduction from the ETP to . More precisely, given an instance of the ETP, our goal is to construct in polynomial time two queries , for , such that has a solution iff .
Data Schema
The data schema consists of:
- •
[math]-ary predicates , for each and ; the atom indicates that .
The Query
The goal of the query is twofold: (i) to check that the so-called existence property of the input database, i.e., for every , there exists at least one atom of the form , is satisfied, and (ii) to check whether , where is the sequence of tilings encoded in the input database, has a solution. To this end, the query will mention the following predicates:
- •
[math]-ary predicate , indicating that there exists at least one atom of the form in the input database.
- •
[math]-ary predicate , indicating that the input database enjoys the existence property.
- •
Unary predicate , for each ; the atom states that is the tile .
- •
Binary predicate ; the atom encodes the fact that .
- •
Binary predicate ; the atom encodes the fact that .
- •
-ary predicate , for each ; the atom states that is a tiling obtained from the tilings – details on the inductive construction of tilings from tilings are given below.
- •
Unary predicate , for each ; the atom states that , i.e., the -th element of the sequence is .
- •
Binary predicate , for each and ; the atom states that in the tiling the tile at position is .
- •
[math]-ary predicate , indicating that there exists a tiling that is compatible with the initial tiling encoded in the input database.
- •
[math]-ary predicate , which is derived whenever the predicates and are derived.
* is defined as the query , where consists of the following tgds:*
- •
Checking for the existence property of the input database
For each and :
[TABLE]
and the tgd that checks for the existence property
[TABLE]
- •
Generate the tiles
[TABLE]
- •
Generate the compatibility relations
For each :
[TABLE]
For each :
[TABLE]
- •
Generate the tilings. The key idea is to inductively construct tilings from tilings. It is easy to verify that the grid in Figure 2(a) is a tiling iff the nine subgrids of it, shown in Figure 2(b), are tilings. This has been already observed in **[33]**, where Datalog with complex values is studied.
First, we construct tilings of size (the base case of the inductive construction):
[TABLE]
Then, we inductively construct tilings of larger size until we get tilings of size . This is done using the following tgds. For each :
[TABLE]
- •
Extract from the tilings the tiles at positions . This is done using the following tgds:
[TABLE]
where . Moreover, for each :
[TABLE]
- •
Check whether there exists a tiling that is compatible with the sequence of tilings
For each and :
[TABLE]
and the tgd
[TABLE]
- •
Finally, we have the output tgd
[TABLE]
This concludes the construction of .
The Query
The goal of the query is twofold: (i) to check that the so-called uniqueness property of the input database, i.e., for every , there exists at most one atom of the form , is satisfied, and (ii) to check whether , where is the sequence of tilings encoded in the input database, has a solution. The query mentions the same predicates as , and is defined as , where consists of the following tgds:
- •
Checking the uniqueness property
For each and with :
[TABLE]
- •
The rest of encodes the tiling problem in exactly the same way as encodes .
This concludes the construction of .
Proof of Proposition 18
The set consists of the following tgds; for brevity, we write for :888A similar construction has been used in [40] for showing a lower bound on the size of a CQ in the UCQ rewriting of a OMQ.
[TABLE]
while . It can be verified that, for every -database , implies that
[TABLE]
and thus, . Let , where is a set of tgds and a Boolean CQ, and an -database. Clearly, iff and . This implies that , and the claim follows.
Proof of Theorem 19
The coNExpTime upper bound, as well as the -hardness in case of fixed-arity predicates, are discussed in the main body of the paper. Here, we show the coNExpTime-hardness. The proof proceeds in two steps:
First, we show that is coNExpTime-hard, where denotes the class of full non-recursive sets of tgds, i.e., non-recursive sets of tgds without existentially quantified variables. 2. 2.
Then, we reduce to by showing that (under some assumptions that are explained below) every query in can be rewritten as an query.
By Proposition 9, we immediately get that is coNExpTime-hard, as needed.
Step 1: is coNExpTime-hard
We show that is coNExpTime-hard, even if we focus on 0-1 queries, that is, queries with following property: for every database , , where is the restriction of on the binary domain , i.e., . The proof is by reduction from the Exponential Tiling Problem, and is a non-trivial adaptation of the one given in **[14]** for showing that containment of non-recursive Datalog queries is coNExpTime*-hard.*
Theorem 34
* is coNExpTime-hard, even for 0-1 queries.*
Proof.
Given an instance of the Exponential Tiling Problem, we are going to construct a 0-1 query and a 0-1 query such that has a solution iff .
Data Schema
The data schema consists of:
- •
-ary predicates , for each ; the atom indicates that the cell with coordinates is tiled by tile . Notice that we use -bit binary numbers to represent a coordinate; this is the key difference between our construction and the one of [14].
The Query
The goal of the query is to assert whether the input database encodes a candidate tiling, i.e., whether the entire grid is tiled, without taking into account the constraints, that is, the compatibility relations and the constraint on the initial part of the first row. To this end, the query will mention the following predicates:
- •
Unary predicate ; the atom simply says that is a bit, i.e., .
- •
-ary predicate , for each ; the atom says that for the row-coordinate there are tiled cells with coordinates for every col-coordinate that agrees with on the first bits. In other words, for the row corresponding to , every column extending the first bits of is tiled. In particular, says that the entire row is tiled.
- •
-ary predicate , for each ; the atom says that for every that agrees with on the first bits, the row is fully tiled.
- •
-ary predicate ; the atom says that the row is fully tiled.
- •
[math]-ary predicate , which asserts that the entire grid is tiled.
- •
[math]-ary predicate , which is derived whenever the predicate is derived.
is defined as the query , where consists of the following rules:
- •
Generate atoms
[TABLE]
- •
For each :
[TABLE]
For each :
[TABLE]
A row is fully tiled:
[TABLE]
- •
[TABLE]
For each :
[TABLE]
The entire grid is tiled:
[TABLE]
This concludes the construction of the query .
The Query
is defined in such a way that is non-empty exactly when the input database encodes an invalid tiling, i.e., when one of the constraints on the tiles is violated. The query will mention the following intensional predicates:
- •
Unary predicate ; as above, says that is a bit.
- •
-ary predicate , for each ; the atom says that and .
- •
-ary predicate , for each ; the atom says that the -bit binary number is the successor of the -bit binary number .
- •
[math]-ary predicate .
is defined as the query . The set consists of the following linear tgds:
- •
Generate atoms:
[TABLE]
- •
Generate the successor predicates:
[TABLE]
For each :
[TABLE]
The UCQ consists of the following (Boolean) CQs; for brevity, the existential quantifiers in front of the CQs are omitted:
- •
Tile Consistency
For each :
[TABLE]
- •
Tile Compatibility
For each :
[TABLE]
For each :
[TABLE]
- •
Tiling of First Row
For each , let be the function from into such that is the number in binary representation, and let other than ; recall that is a sequence of numbers of that represents a constraint on the initial part of the first row of the grid. Then, we have the CQ:
[TABLE]
where, for each , if , and if .
This concludes the definition of the query . ∎**
Step 2: is coNExpTime-hard
Our goal is show that every 0-1 query can be equivalently rewritten as a 0-1 query , where all the tgds of are lossless, i.e., all the body-variables appear also in the head, which in turn implies that is sticky.
Proposition 35
Consider a 0-1 query . We can construct in polynomial time a 0-1 query such that .
Proof.
Let , and assume that is the maximum number of variables occurring in the body of a tgd of . We are going to construct in polynomial time a 0-1 query such that .
The set consists of the following tgds:
- •
Initialization Rules
We first transform every database atom of the form into an atom . This is done as follows:
[TABLE]
and, for each -ary predicate , we have the lossless tgd
[TABLE]
Notice that we can safely force the variables to take only values from due to the 0-1 property.
- •
Transformation into Lossless Tgds
For each tgd of the form
[TABLE]
we have the lossless tgd
[TABLE]
where, if , for , is the set of variables occurring in the body of (the order is not relevant), then , for each , and , for each .
- •
Finalization Rules
Observe that each atom obtained during the chase due to one of the lossless tgds introduced above is of the form , where . If , then we need to ensure that eventually the atom
[TABLE]
will be inferred. This is achieved by adding to the following tgds: For each -ary predicate occurring in , and for each , we have the rule:
[TABLE]
This concludes the definition of .
The CQ is defined analogously. More precisely, assuming that is of the form (the existential quantifiers are omitted)
[TABLE]
the CQ is defined as
[TABLE]
It is easy to verify that consists of lossless tgds, and thus, . It also not difficult to see that, for every database over , ; thus, by the 0-1 property, , and the claim follows. ∎
By Theorem 34 and Proposition 35, we immediately get that is coNExpTime-hard, as needed.
PROOFS OF SECTION 5
Recall that, for the sake of technical clarity, we focus on constant-free tgds and CQs, but all the results can be extended to the general case at the price of more involved definitions and proofs. Moreover, we assume that tgds have only one atom in the head. This does not affect the generality of our proof since every set of guarded tgds can be transformed in polynomial time into a set of guarded tgds with the above property; see, e.g., **[24]**. Finally, for convenience of presentation, we also assume that the body of a tgd is non-empty, i.e., the body of a tgd is always an atom and not the symbol .
Proof of Proposition 21
Let us start by recalling the key notion of tree decomposition. Notice that the definition of the tree decomposition that we give here is slightly different than the one in the main body of the paper. The reason is because, for convenience of presentation, we prefer to employ a slightly different notation.
Definition 8**.**
Let be an instance. A tree decomposition of that omits , where , is a pair , where is a tree and a family of subsets of (called the bags of the decomposition) such that:
For every , the set is non-empty and connected. 2. 2.
For every atom , there is a such that .
*The width of a tree decomposition omitting is . The tree-width of is the minimum among the widths of all tree decompositions of that omit . We call a tree decomposition omitting simply tree decomposition of . For , we denote by the subinstance of induced by . *
Notation. We usually denote the strict partial order among the nodes of a tree of a tree decomposition by . Accordingly, we write iff or . For brevity, will usually denote the root of a tree decomposition at hand. If ambiguities could possibly arise, we shall use subscripts in these notations. Furthermore, when is clear from context, we shall omit it from the expression .
Let be a tree decomposition of and . Recall that is -guarded (or guarded except for ), if for every node , there is an atom such that . A -guarded tree decomposition of is simply called guarded tree decomposition.
Also recall the crucial notion of -tree:
Definition 9**.**
An -instance is a -tree, where , if there is a tree decomposition of such that
, i.e., the subinstance of induced by equals . 2. 2.
* is guarded except for .*
*If or is clear from context, we shall often refer to as the diameter of and to as the core of . *
Remark. The notion of -tree defined here refers to both instances and databases, i.e., a -tree may be a (finite) database or an instance. We often do not explicitly mention whether a -tree at hand is a database or an instance. However, it will be clear from context whether a -tree is a database or an instance.
We proceed to establish the following technical lemma, which in turn allows us to show Proposition 21. It is an adaption of a result in **[7]** to the case of guarded tgds. Henceforth, for brevity, given a query and an -database , we write for the fact that .
Lemma 36
Let be an OMQ from . Let be an -database and suppose . Then there is a finite -instance such that and:
* is a -tree such that .* 2. 2.
There is a homomorphism from to .
Before we proceed with its formal proof, let us explain why Proposition 21 is an easy consequence of Lemma 36. The fact that the first item implies the second is trivial. Conversely, suppose that , which implies that there exists an -database such that and . By Lemma 36, there exists a -tree , where , such that . Moreover, there is a homomorphism from to ; hence, since is closed under homomorphisms, it immediately follows that . Consequently, the -database obtained from after replacing each null with a distinct constant is a -tree such that , and Proposition 21 follows.
We now proceed with the proof of Lemma 36 which is our main task in this section. Before that, we introduce some additional auxiliary concepts.
The Guarded Chase Forest
Given a database and a set of guarded tgds, the guarded chase forest for and is a forest (whose edges and nodes are labeled) constructed as follows:
For each fact in , add a node labeled with . 2. 2.
For each node labeled with and for every atom resulting from a one-step application of a rule , if is the image of the guard in this application of , then add a node labeled with and introduce an arc from to labeled with .
We can assume that the guarded chase forest is always built inductively according to a fixed, deterministic version of the chase procedure. The non-root nodes are then totally ordered by a relation that reflects their order of generation. Furthermore, we can extend to database atoms by picking a lexicographic order among them. Notice that one atom can be the label of multiple nodes. Using the order we can, however, always refer to the -least node.
Guarded Unraveling
Let be an instance over . We say that is guarded in , if there are such that
- •
* and*
- •
there is an such that .
A tuple is guarded in if the set containing the elements of is guarded in .
In the following paragraph, we largely follow the notions introduced in **[2, 13]**. Fix an -instance and some . Let be the set of finite sequences of the form , where, for , is a guarded set in , and, for , for some , or . The sequences from can be arranged in a tree by their natural prefix order and each sequence identifies a unique node in this tree. In this context, we say that is represented at whenever . Two sequences are -equivalent, if is represented at each node on the unique shortest path between and . For represented at , we denote by the -equivalence class of . The guarded unraveling around is the instance over the elements \{[\pi]_{a}\mid\text{a\pi}\}, where
[TABLE]
for all .
Lemma 37
For every -instance and any , the guarded unraveling around is a -tree over , where is the subinstance of induced by the elements .
Proof.
Let , where is the natural tree that arises from ordering the sequences in by their prefixes. For , let X_{\pi}\coloneqq\{[\pi]_{a}\mid\text{a\pi}\}. Let denote the root of . We need to show that is an appropriate tree decomposition witnessing that is a -tree. First, note that it is clear that by construction. Let and consider the set . This set is certainly non-empty. Moreover, for , we know that , hence and are -connected in . Suppose for some . Then there is a such that , for all . Hence, are all represented at and so . It remains to show that is guarded except for . Let and consider the set . Since is a sequence of length greater than one, its last element is a guarded set in . Hence, there are such that and for some . Let and define . Then , as desired. ∎
Notice that this lemma implies that the tree-width of is bounded by .
We are now ready to prove Lemma 36:
Proof of Lemma 36.* Let and a homomorphism mapping to . Let exhaust all facts from that are the roots of those -least facts from in the guarded chase forest of and that have an element from as argument. Let and let be the unraveling of around , regarding all elements from as labeled nulls. Henceforth, for every , we denote by the element . We say that represents . Let be the substructure of induced by the set . Notice that is an infinite instance that is a -tree by Lemma 37. We will show later how to get a finite instance from that satisfies our constraints. We proceed to show that logically entails , denoted :*
Lemma 38
.
Proof.
We will first construct a universal model of and . Recall that an instance is a universal model of and , if it can be homomorphically mapped to every model of ; in particular, it is well-known and easy to prove that is always a universal model of and . Before constructing , we introduce some additional notions. In the following, given a guarded set in , a copy of in is a set which is guarded in such that, for , we have that for some sequences and iff for all and . Copies of guarded tuples are defined accordingly. Consider the structure . Let be a guarded set in and denote the subinstance of induced by . It is well-known and easy to prove that is acyclic (cf., e.g., [23]). Henceforth, we loosely call the tree attached to . The model is constructed as follows. Let be the instance . Furthermore, for each guarded set in and each copy of in , construct a new instance that is isomorphic to the tree attached to such that
(i) the elements of are renamed to in ,
(ii) , and
(iii) , for every copy of in .
The model is the union of and all the . If a guarded set in arises from renaming elements of a guarded set in , we also say that is a copy of in . Furthermore, the copies of that are contained in (i.e., in ) are also called copies in . Observe that is a model of by construction. We show that it is a model of . To this end, we show the following claim.
Claim 39
Let be a guarded tuple in and let be a guarded conjunctive query999By a guarded conjunctive query we mean here a CQ that contains an atom that contains all the variables occurring in the CQ as argument. over . Suppose is a copy of in , where is over and . Then iff .
Proof.
Suppose . Let be a copy of in . Since is guarded, there is a such that . Let be the guarded set in of which is a copy in . It clearly holds that , whence follows.
Suppose that . Let and and suppose that (). The set is guarded in . Hence, there is a guarded in such that . We show that there is a which is a copy of in . Suppose . Let be such that for all . For , define
[TABLE]
Then is represented at . For , let . We claim that is a copy of in . Let and suppose . Then we immediately obtain . Conversely, if , let , where . Take any . It is easy to see that and are -equivalent. Hence, and it follows that , as required. It follows that is a copy of in and so there is a structure contained in that is isomorphic to with respectively renamed to . Hence, as required. ∎
Now let be a guarded rule from . Suppose that . Since every guarded tuple in is a copy of some guarded tuple in , there is an , of which is a copy, such that . Since is a model of , we know that . It follows that by the above claim, as required. It remains to show that is universal:
Claim 40
* is universal.*
Proof.
It suffices to show that can be homomorphically mapped to via a homomorphism . We let be the homomorphism that maps every element of to itself. It remains to treat the structures . Consider a copy in of a set which is guarded in . It suffices to show that can be mapped to . To this end, it we show how to map to . We do so by induction on the number of rule applications of . For the base case, we map to as follows. Let , for . Suppose for some and , where . Recall that is guarded in . Reviewing the construction of , it is easy to see that this holds iff . Hence, is indeed a homomorphism from to . The induction step is obvious—we can easily obtain a homomorphism that maps to . The desired homomorphism is the union of the (). We then obtain a homomorphism from by appropriately renaming the elements from the domain of the latter as we did in the construction of —which is nothing else than an isomorphic copy of . Furthermore, each of these homomorphisms maps each element of to itself. The desired homomorphism that witnesses that is universal is the union of and the . ∎
In order to prove , it remains to show that there is a homomorphism that maps to . There are guarded sets in such that can be understood to map to . By construction, we know that can be chosen in such a way that . Since is guarded, can be understood to map to —assuming that the labeled nulls occurring in these instances are mutually new. Let . For every , let . Notice that is a copy of in . By construction, all the facts from that are mapped via to and which have an element from in their image under are already mapped to . For the other facts, the names of the constants in the databases do not matter.101010Here, it is of course essential to assume constant-free rules. Let be arbitrary copies of the sets in . It follows that we can find our desired match in the union of and . Notice that is isomorphic to with each represented by . ∎
Now the database has the desired form with being its core. However, is infinite. Since due to Lemma 38, by compactness, there is a finite such that . Consider a tree decomposition witnessing that is a -tree. There is a maximum such that contains all the subinstances induced by the bags of depth less or equal . Let be the instance that actually contains all the subinstances induced by the bags of level up to . Hence, is itself a -tree and , since .
*Now there is a natural homomorphism mapping to : we simply specify for all . The instance is the one we are looking for. *
**
Proof of Lemma 22
One can naturally encode instances of bounded tree-width into trees over a finite alphabet such that the alphabet’s size depends only on the tree-width. Our goal here is to appropriately encode -trees in order to make them accessible to tree automata techniques. Since the tree-width of a -tree over depends only on the size of and the maximum arity of , the alphabet of the encoding will depend on the same.
Labeled trees.* Let be an alphabet and be the set of finite sequences of positive integers, including the empty sequence .111111We specify that [math] is included in as well. Let us recall that a -labeled tree is a pair , where is the labeling function and is closed under prefixes, i.e., implies , for all and . The elements contained in identify the nodes of . For , nodes of the form are the children of . A path of length in from to is a sequence of nodes such that is a child of . A branch is a path from the root to a leaf node. For , we set , for all , and —notice that is not defined.*
Encoding.* Let and fix a schema . Let be the disjoint union of two sets and , respectively containing and elements. The elements from will be called names. Elements from the set will describe core elements, while those of will describe the others. Furthermore, neighboring nodes may describe overlapping pieces of the instance. In particular, if one name is used in neighboring nodes, this means that the name at hand refers to the same element—this is why we use elements for the non-root bags. Let be the finite schema capturing the following information:*
- •
For all , there is a unary relation .
- •
For all , there is a unary relation .
- •
For each and every -tuple , there is a unary relation .
Let be an alphabet and suppose that is a (finite) -tree over such that . Consider a tree decomposition witnessing that is indeed a -tree and let be the root of . Fix a function such that (i) is injective and (ii) different elements that occur in neighboring bags of are always assigned different names from . Using , we can encode and into a -labeled tree such that each node from corresponds to exactly one node in and vice versa. For a node from , we denote the corresponding node of by in the following and vice versa. In this light, the symbols from have the following intended meaning:
- •
* means that is used as a name for some element of the bag .*
- •
* indicates that is used as name for an element of the bag that also occurs in , i.e., names an element from the core of .*
- •
* indicates that holds in for the elements named by in bag .*
Under certain assumptions, we can decode a -labeled tree into a -tree whose width is bounded by . Let . We say that is consistent, if it satisfies the following properties:
For all nodes it holds that , except for the root whose number of names are accordingly bounded by . Furthermore, . 2. 2.
For all and all it holds that implies that . 3. 3.
For all and all it holds that iff . 4. 4.
*If , then for all on the unique shortest path between and the root. * 5. 5.
For all nodes , there is an and a node such that , , and, for all , and are -connected.
Decoding trees.* Suppose now that is consistent. We show how we can decode into a database which is a -tree whose diameter is bounded by . Let be a name used in . We say that two nodes of are -equivalent if for all nodes on the unique shortest path between and . Clearly, -equivalence defines an equivalence relation and we let [v]_{a}\coloneqq\{{(w,a)}\mid\text{wav}\} and . The domain of is the set and, for , we define*
[TABLE]
Lemma 41
Let be a consistent -labeled tree with root node . Then is well-defined and a -tree over , where is the subinstance of induced by the set . Moreover, is bounded by .
Proof.
Let be a consistent, -labeled tree. The fact that is well-defined is left to the reader. We are going to construct an appropriate decomposition for . The tree has the same structure as . Furthermore, for , we set . We need to show that is indeed a tree decomposition that satisfies the desired properties.
Let and consider two nodes such that and . Then and so and are -connected. Hence, for all which lie on the unique shortest path between and . Since for all such , it follows that , and so is contained in all bags on the unique path between and . Suppose . Then there is a such that . By consistency, . Moreover, we know that , for . It follows that . Now let . By consistency, there is an and a such that , , and and are -connected for . By construction, and . The claim follows now since for . It is immediate that is bounded by . ∎
Notation. Given a consistent -labeled tree and a label , in order to ease notation we often regard as a database consisting of the facts . Furthermore, we let .
Proof of Lemma 22.* The lemma is an easy consequence of Lemma 41 and the fact that, when encoding a -tree over , together with a tree decomposition witnessing that is a -tree, into a consistent -labeled tree , then and are isomorphic. *
**
Roughly, Lemma 22 states that containment among OMQs from can be semantically characterized via the decodings of consistent -labeled trees. This makes the problem of deciding containment amenable to tree automata techniques.
Proof of Lemma 23
Before proceeding to the proof of Lemma 23, we first introduce the relevant automata model.
Automata Techniques
For a set of propositional variables , we denote by the set of Boolean formulas using variables from , the connectives , and the constants . Let us now introduce our automata model.
Definition 10**.**
*A two-way alternating parity automaton (2WAPA) on trees is a tuple , where is a finite set of states, an alphabet (the input alphabet of ), the transition function, where we set , the initial state, and the parity condition that assigns to each a priority . Elements from are called transitions. *
Intuitively, a transition of the form means that a copy of the automaton should change to state and stay at the current node. A transition of the form means that a copy should be sent to the parent node, which is then required to exist, and proceed in state , while one of the form means that a copy of the automaton that assumes state is sent to some child node. The transition means the same as , while means that a copy of the automaton that assumes state should be sent to the parent node which is there not required to exist at all. Likewise, means that a copy of the automaton assuming state should be sent to all child nodes.
Notation. We write for , for , and simply for . Furthermore, for , we define
[TABLE]
Definition 11**.**
A run of a 2WAPA on a -labeled tree is a -labeled tree such that the following holds:
, 2. 2.
if , , and , then there is an such that holds and the following conditions are satisfied:
- •
If then there is a node and a child node of such that .
- •
If then for all , there is a child node of such that .
*We say that a run is accepting on , if on all infinite paths in , the maximum priority among that appears infinitely often is even. accepts a -labeled tree , if there is an accepting run on . We denote by the set of -labeled trees accepts, i.e., the language accepted by . *
Remark. The automaton model defined above resembles that in **[58]**. However, we explicitly provide transitions that allow the automaton move to the parent node, while the model defined in **[58]** provides transitions for moving to some neighboring node, including the parent node. Therefore, the automata in **[58]** offer transitions of the form , , and with their intended meaning as defined above. Using techniques as employed in **[57, 58]**, for a 2WAPA , one can show that the problem of deciding whether is feasible in exponential time with respect to the number of states of and in polynomial time with respect to the size of the input alphabet of .
Proof of Lemma 23.* We only give an intuitive explanation for the construction of the desired 2WAPA. To check whether a -labeled tree is consistent, we can check each condition for consistency separately by a dedicated 2WAPA and then take the intersection of all of them. Most of the consistency conditions are easy to check. We give here a more detailed verbal explanation for condition (5). A 2WAPA checking this condition can be constructed as follows. At the beginning of its run, the automaton branches universally to all nodes (except the root) in a state whose intended purpose is to find appropriate guards in the input tree for the names available at the current node. To this end, the automaton has to do a reachability analysis on the input tree and store, using exponentially many states in , the tuple it seeks to guard. By a guard for the node here, we mean a node with an such that *
(i) *contains all the names present at and * (ii) *is -connected to for all . * * Notice that such a reachability analysis can be easily performed once we have the means to store the information contained in in a single state. This is, however, possible since for this task we need somewhat states, i.e., polynomially many in the size of . *
**
Proof of Lemma 24
We first need to introduce some additional auxiliary notions.
Strictly Acyclic Queries
Let be a CQ over a schema . We denote by the free variables of ; the same notation is used for first-order formulas in general. We can naturally view as an instance whose domain is the set of variables of and contains the body atoms of as facts. In the following, we will often overload notation and write for both the query and the instance . The notions of tree-width, acyclicity, etc. then immediately extend to CQs. Given a tree decomposition of (i.e., of ), we say that is strict, if some bag of contains all variables that are free in (cf. also **[38]**). Accordingly, is called strictly acyclic if it has a guarded tree decomposition that is strict.
Strictly acyclic queries have the convenient property to be equivalent to guarded formulas of a special form. Recall that the set of guarded formulas over a schema is built inductively by including all atomic formulas, relativizing quantifiers by atomic formulas, and closing under Boolean connectives. More precisely, all quantifier occurrences have one of the forms
[TABLE]
such that the free variables of are among .
We are interested in the guarded formulas that are build up using conjunction and existential quantification; we restrict ourselves to such formulas in the following. We call a formula from this class strictly guarded, if it is of the form . We explicitly include the case where is the empty sequence of variables, i.e., if to formulas of the form with . Notice that every guarded sentence (i.e., a formula having no free variables) is strictly guarded, since it is equivalent to . Furthermore, notice that every usual guarded formula that uses only existential quantifiers and conjunction is equivalent to a conjunction of strictly guarded formulas. The following lemma is proved in **[38]**.
Lemma 42
Every strictly acyclic CQ can be rewritten in polynomial time into an equivalent strictly guarded formula that is built up using conjunction and existential quantification only. The converse holds as well.
Squid Decompositions
Let be a BCQ over a schema having body atoms. An -cover of is a BCQ that contains all the atoms from and may additionally contain other body atoms over . It is pretty straightforward that, for an -instance , it holds that iff there is an -cover of such that .
Definition 12**.**
*Let be an instance. For , we say that is -acyclic, if it has a guarded tree decomposition that omits . *
Definition 13**.**
Let be a BCQ over . A squid decomposition of is a tuple , where is an -cover of , a mapping, , and a partition of the atoms such that
- •
* is the set of atoms of induced by ,*
- •
* and is -acyclic. *
Intuitively, a squid decomposition specifies a way how a BCQ can be mapped to an instance that contains some “cyclic parts”—the set specifies those atoms that are mapped to such cyclic parts, while declares those atoms that are mapped to the acyclic parts of the instance at hand. We will make this more precise in Lemma 43 below, where we analyze matches in -trees.
Given a CQ and a set of variables , the -reduct of , denoted , is the conjunctive query that arises from by dropping all the existential quantifiers that bind variables in .
Lemma 43
Let be a -tree over and a BCQ over . Let be a witnessing tree decomposition of . It holds that iff there is a squid decomposition of and a homomorphism such that
* is witnessed by ,* 2. 2.
* is witnessed by , and* 3. 3.
there are strictly guarded formulas such that .
Proof.
For the direction from right to left, consider such a given squid decomposition and a homomorphism as in the hypothesis of the lemma. It is immediate that is a homomorphism mapping to . Since is an -cover of , we obtain as required.
For the other direction, suppose that is witnessed by a homomorphism . For each , let be an atom of such that and contains all domain elements from as arguments. Notice that the exist, since is guarded except for . Since maps to , for each atom of , there is a node such that . Let be the set of all these nodes and their closure under greatest lower bounds with respect to , excluding the root node of . Consider the set of atoms . Notice that at least half of the nodes of are of the form —hence, . Let be a BCQ constructed as follows. Take the conjunction of and for each (), add an atom , where each is a newly chosen variable. Then is obviously an -cover of . Furthermore, by construction, there is a mapping and an isomorphism such that . Now let be the greatest set of atoms of such that . Moreover, let and . We claim that is a squid decomposition of that satisfies together with the points mentioned in the statement of the lemma.
To see that is a squid decomposition of , the only nontrivial point to prove is that is indeed -acyclic. We will prove this below in the course of establishing the third item.
The first two items are immediate by construction. We prove the third item. Suppose and consider the -reduct of . By construction, the atoms are contained in . Now the set together with the order gives rise to a forest consisting of trees whose roots are descendants of , i.e., the root of (recall that is not contained in ). Moreover, we have
(i) ,
(ii) , for , and
(iii) , for .
For , let and, for , let be the set of atoms . Now it is easy to check using the facts stated before that each is acyclic and, hence, so is . Furthermore, denoting by the root of , it holds that —indeed, if for some , then, since and , it must be the case that by connectivity. It follows that the -reduct of (viewed as Boolean query), henceforth denoted , is strictly acyclic and is therefore equivalent to a strictly guarded formula . Hence, the query is equivalent to . Moreover, it follows that itself is -acyclic—notice that and that , for . Hence, , for . The claim now follows since every is acyclic. ∎
Derivation trees
Let be an -database and a set of guarded rules. Let be a strictly acyclic query whose free variables are exactly those from and let be a tuple from . A derivation tree for with respect to and is a finite tree whose nodes are labeled via a function with pairs of the form , where are constants from and is a strictly acyclic query over having exactly free, such that the following conditions are satisfied:
, where is the root node of . 2. 2.
If for some node , then one of the following conditions holds (let and ):
- (a)
* is a leaf node and , for some atomic formula such that .* 2. (b)
The node has a successor labeled by and it holds that
[TABLE] 3. (c)
The query is logically equivalent to and has successors respectively labeled by .
Lemma 44
Let be an atomic formula. Then iff there is a derivation tree for with respect to and .
Proof (sketch).* Let and . The direction from right to left is an easy induction on the construction of the derivation tree. We sketch the other direction. Consider the guarded chase forest for and , where is a function labeling the nodes and edges of . We construct a derivation tree for by induction on the number of chase steps required to derive from and .*
For the base case, if , the claim is obvious since we can apply rule 2.(a). Assume that is derived using a rule
[TABLE]
and a homomorphism such that , where is the guard of . If , the result immediately follows by the induction hypothesis. Otherwise, the image of under contains some labeled nulls as arguments. Assume that all the contain nulls as their arguments—for those that do not, the induction hypothesis would yield appropriate derivation trees again. Notice that all the nulls occurring in appear in . By construction of , there is a node that is an ancestor of the nodes having the atoms as labels and which has a label of the form which contains no nulls at all as arguments. There is a corresponding atomic formula whose image under an appropriate homomorphism equals . Furthermore, there are atoms such that and
[TABLE]
Now regard the (*) as atomic formulas with free variables among . The formula is then a strictly acyclic query that satisfies . An application of rule 2.(b) then requires us to find a derivation tree for , whence an application of rule 2.(c) reduces this task to finding derivation trees for the atoms and their corresponding tuples of constants. These trees exist by induction hypothesis and we can simply concatenate them appropriately in order to arrive at a derivation tree for . *
**
Given a guarded formula built up from conjunctions and existential quantification, we define the nesting depth of , denoted , inductively:
- •
If is an atomic formula, then .
- •
If , then .
- •
If and , then .
Lemma 45
Let be a database, a set of guarded rules, and a strictly acyclic conjunctive query. Then iff there is a derivation tree for with respect to and .
Proof (sketch).* We again sketch only the direction from left to right. Let be the strictly guarded formula corresponding to . We proceed by induction on the nesting depth of . If , then is quantifier free and thus a conjunction of atoms , where for . An application of rule 2.(c) reduces the problem of building a derivation tree for to the problem of building corresponding trees for the and their corresponding constants from . The existence of these trees is guaranteed by Lemma 44.*
Now suppose that . Let and . Assume, without loss of generality, that all the bound variables from are pairwise distinct. In the following, we will describe how to construct a derivation tree for . If , then there is a homomorphism mapping each atom of to such that . Furthermore, maps each atom of to a node of the guarded chase forest of and . Let denote the atom labeling the node of where is mapped to via . Let exhaust all elements from that are not from and exhaust those from that are from . Let be the formula . Clearly, . Hence, we can create a successor of that is labeled by . Assume now that none of the is from . Furthermore, assume that , since otherwise we can just simply apply rule 2.(c) to reduce to a conjunction of queries of the desired form. As in the proof of Lemma 44, there is a node in whose label contains only values from as arguments and such that is an ancestor of the node labeling . Furthermore, all the atoms from that contain an element from as argument are also located in the subtree rooted at . Let be the query that results from deleting all atoms from which are mapped via into the subtree rooted at . Notice that may be empty and has free variables among . Furthermore is equivalent to a conjunction of strictly acyclic queries. Let be the atomic formula whose image under an appropriate homomorphism equals . A similar line of reasoning as in the proof of Lemma 44 shows that there are atomic formulas such that and
[TABLE]
*whence an application of rule 2.(b) and rule 2.(c) reduces the problem of constructing a derivation tree for to that of constructing corresponding trees for and . Notice that is a conjunction of strictly guarded formulas of nesting depth at most . Hence, the induction hypothesis guarantees the existence of such derivation trees. *
**
Having the above results in place, it is easy to show the following statement:
Lemma 46
Let be a -tree over and an OMQ where is guarded and a BCQ. Then iff there is a squid decomposition of and a homomorphism such that:
* is witnessed by , where is the subinstance of induced by .* 2. 2.
There are strictly acyclic queries such that
- (a)
* and* 2. (b)
for and , there are derivation trees for with respect to and .
Proof.
We can easily prove by induction on the number of chase steps that is an -tree, where is the subinstance of induced by . Now the lemma at hand is immediate by combining this fact with Lemma 43 and Lemma 45. ∎
We are now ready to proceed with the proof of Lemma 24:
Proof of Lemma 24.* Lemma 46 will guide the construction of the 2WAPA we are now going to construct. Suppose is an OMQ from and let . We are going to construct a 2WAPA that accepts a consistent -labeled tree iff . In particular, the number of states of will be at most exponential in the size of and at most polynomial in , while the construction of will be feasible in 2ExpTime.*
The state set. Let denote the set of all Boolean acyclic queries over that are of size at most . Notice that each of these queries is equivalent to a strictly guarded formula. Furthermore, assume that is closed under -reducts, for , provided that they are strictly acyclic as well, i.e., if and , then also provided is strictly acyclic. For , let
[TABLE]
and let be the union of all the sets . Now the set of states consists of an initial state, denoted , plus the set factorized modulo logical equivalence. We denote by the equivalence class of a query . Furthermore, for a strictly guarded formula , we may abuse notation and write for the equivalence class of the strictly acyclic query that is equivalent to . Notice that the size of is exponential in the size of , since there are only exponentially many CQs of size at most that are mutually non-equivalent (cf. **[10]**).
The parity condition. We set , for all . This means that only finite trees are accepted.
The transition function. In the following, for each , we denote by the set of all pairs that are of the form for which there is a squid decomposition of the form and a function such that:
- •
, where all the are relational ground atoms.
- •
, where the are strictly acyclic queries.
Call two pairs and as above equivalent if and . Let be the set of equivalence classes under this relation and denote by the equivalence of a pair under this relation. Now we fix for each a strictly guarded formula that is equivalent to all queries from . Likewise, we fix a function such that , i.e., which picks a representative for each equivalence class .
Now let . Specify as follows:
For the initial state , set
[TABLE]
Intuitively, the automaton selects a squid decomposition where its components are instantiated by names occurring in the root node of the input tree. The automaton tries to verify the single compartments of the squid decomposition, i.e., it tries to match them to the chase expansion of the input database under . 2. 2.
Let . We define according to a case distinction:
- (a)
Suppose that . Then . 2. (b)
Suppose , where is an atomic formula (including equality), , and exhausts all names occuring in . If then . Otherwise,
[TABLE]
where
[TABLE]
We provide some intuitive explanation for this second case.
- (a)
If is the empty query, it can be satisfied at any input node and, hence, the automaton accepts unconditionally on this computation branch. 2. (b)
Otherwise, we first inspect the strictly guarded formula at hand. If the names occurring in the guard are not present at the current node, it rejects. Otherwise, it tries to satisfy with all possible assignments for at the current node and then proceed in state . Apart from these possibilities, the automaton can decide to move to any neighboring node (i.e., the parent or a child) while remaining in state . This amounts to an exhaustive search of the input tree that tries to satisfy in the input tree. Furthermore, the automaton may choose to construct derivation trees for . There, it uses the information provided by in order to find strictly acyclic queries that imply . Consequently, it tries to proceed its search with .
We shall now briefly comment on the running time needed to construct . The interesting part of the construction concerns the transition function , in particular point 2.(b) involving . We have seen that in the proofs of Lemma 44 and Lemma 45 that there are double-exponentially many candidates for the query that (possibly) implies under . Furthermore, consists of at most exponentially many atoms. Each check whether such a query at hand implies requires at most double-exponential time in the size of . This follows from the well-known fact that checking query implication under a set of guarded rules is feasible in 2ExpTime with respect to the size of the right-hand side query, and in polynomial time with respect to the size of the left-hand side query (cf. **[23*]**), i.e., the data complexity of query answering under guarded tgds is polynomial time. *
**
PROOFS OF SECTION 6
Proof of Theorem 26
A proof sketch is given in the main body of the paper. However, the fact that is in 2ExpTime* deserves a formal proof. Recall that to establish the latter result we need a more refined complexity analysis of the problem of deciding whether a guarded OMQ is contained in a UCQ; this is discussed in the main body of the paper. In fact, it suffices to show the following result. As in the previous section, we focus on constant-free tgds and CQs, but all the results can be extended to the general case at the price of more involved definitions and proofs. Moreover, we assume that tgds have only one atom in the head. Recall that we write for the variables of that appear in more than one atom, and we also write for the variables of that appear only in one atom. Then:*
Proposition 47
Consider and a Boolean CQ . The problem of deciding whether is feasible in
double-exponential time in ; and 2. 2.
exponential time in .
It is easy to verify that the above result, together with the algorithm devised in the main body of the paper, implies that is in 2ExpTime*. The rest of this section is devoted to show the above proposition. Our crucial task is, given a CQ , to devise an automaton that accepts consistent labeled trees which correspond to databases that make true.*
Lemma 48
Let be a Boolean CQ over . There is a 2WAPA , where , that accepts a consistent -labeled tree iff . The number of states of is exponential in and polynomial in . Furthermore, can be constructed in exponential time.
Proof.
We are going to construct . Let be the variables of and fix a total order among them. Define the state set to be
[TABLE]
Notice that . We set , where denotes the empty substitution. In the following, we treat as a set of relational atoms and let . For and , define as follows:
- •
If , distinguish the following cases:
If there is an atom such that and , then . 2. 2.
Otherwise, let
[TABLE]
- •
Suppose , for some . Let denote the unique atom such that . Set
[TABLE]
Set the parity condition to be for all . Intuitively, the automaton works in two passes. The first pass consists of the runs working on states of the form . In this pass, the automaton tries to find an assignment for the variables in the query that appear in at least two distinct atoms. When a candidate assignment is found, the automaton changes to state which is the beginning of the second pass. A state of the form means that the assignment can be extended to all variables and, in this state, the automaton tries to extend to cover the variable . The automaton accepts if it is able to extend the candidate assignment to all . ∎
Having the above result in place, we can now reduce the problem in question to the emptiness problem for 2WAPA.
Lemma 49
Consider and a Boolean CQ . We can construct in double-exponential time in and in exponential time in a 2WAPA , which has exponentially many states in and polynomially many states in , such that
[TABLE]
Proof.
Let and . Then is defined as:
[TABLE]
It is an easy task to verify that the claim follows from Lemmas 22, 23, 24, and 48. ∎
It is clear that Proposition 47 is an easy consequence of Lemma 49.
PROOFS OF SECTION 7
*Recall that we focus on unary and binary predicates. Moreover, we consider constant-free tgds and CQs, and we assume that tgds have only one atom in the head. *
Proof of Proposition 30
Basics.* Let be a -tree of width two. We say that a tree decomposition witnessing that is a -tree is lean, if it satisfies the following conditions:*
- •
The elements from occur only in the root of and its immediate successors.
- •
If is a child of in , then there are unique such that and . The element is called new at .
- •
It follows from the previous item that every node in has a unique new element . We additionally require that appears in the bag of each child of .
Intuitively, -trees that have lean tree decomposition represent the actual tree structure of . It is fairly straightforward to see that every -tree has a lean tree decomposition.
Recall that the Gaifman graph of is the graph with and if and coexist in some atom of . Given two nodes from , the distance from to in , denoted , is the minimum length of a path between and , and if such a path does not exist. For , we denote by the minimum distance among two nodes of that respectively have and in their bags. We call the distance from to in .
Notice that in a tree decomposition witnessing that is a -tree, any element , if appears in the bag of , then it occurs only at , at and its children, or at and its parent. Since furthermore the bag of the root node is uniquely determined by , each node in the tree has a uniquely determined set of child nodes whose bags are determined by the structure of alone. Therefore, the following two lemmas follow immediately.
Lemma 50
Let be a lean tree decomposition witnessing that is a -tree. Then for all .
Lemma 51
Let and be two lean tree decompositions witnessing that is a -tree. Then for all .
In the following, we denote by the subinstance of induced by the set of elements whose distance from any in any lean tree decomposition is bounded by . The subinstance is defined analogously.The branching degree of a lean tree decomposition is the maximum number of child nodes of any node contained in the tree of . Notice that two lean tree decompositions of a -tree always have the same branching degree; the argument is similar as for the two lemmas above. Hence, we can simply speak about the branching degree of .
Encodings.* Recall that a consistent -labeled tree encodes information on an -database and an appropriate tree decomposition of . It is clear that has a lean tree decomposition, but it is not guaranteed that this is reflected in as well. We call (the consistent) lean, if the tree decomposition is, where iff for some and . The following is easy to prove:*
Lemma 52
There is a 2WAPA on trees that accepts a consistent -labeled tree iff it is lean. The number of states of is bounded logarithmically in the size of and can be constructed in polynomial time in the size of .
Let be a labeled tree. The branching degree of a node is the cardinality of ; the branching degree of is the maximum over all branching degrees of its nodes and is this maximum does not exist. We also say that is -ary if the branching degree of is bounded by . A node is a leaf node of if it has branching degree zero. The depth of is the maximum length among the lengths of all branches and if this maximum does not exist. Let us remark that the branching degree of the lean -labeled tree as defined for labeled trees equals the branching degree of as defined above.
Lemma 53
Let be an OMQ from . There is an such that the following are equivalent:
There is an -database such that . 2. 2.
There is a -tree with and branching degree at most such that .
Proof.
Let and, let be the 2WAPA from Lemma 24. Take the intersection of with
(i) the 2WAPA from Lemma 23 and
(ii) the 2WAPA from Lemma 52 that checks leanness.
Call the resulting automaton . Then accepts a -labeled tree iff is lean and consistent and . We let be the number of states of and claim that this is the required bound on the branching degree.
First of all, notice that the first item of the lemma trivially implies the second independently from the choice of . For the other direction, suppose that for some -database . Then there is a -tree such that and . Being a -tree, has a lean tree decomposition and the encoding of together with corresponds to a lean -labeled tree . It follows that . By the results of [44], it follows that there is a whose branching degree is bounded by the number of states of , i.e., by . The tree is lean and consistent, therefore is a -tree of branching degree at most for some such that . Furthermore, , as required. ∎
We are now ready to prove Proposition 30:
Proof of Proposition 30.* We largely follow [16] here. Choose as in Lemma 53 above. Suppose first that is UCQ rewritable. Let be a corresponding UCQ rewriting. Since the query is connected, we can assume that is as well. We choose and suppose that for some -tree . Since is a UCQ rewriting, for some . Fix a homomorphism witnessing that . We distinguish cases. Suppose first that . Since is connected, it follows by Lemma 50 and so . On the other hand, if , then it is also easy to check that .*
*For the other direction, suppose that the second item of the proposition’s statement holds, i.e., there is a such that for all -trees over with and branching degree at most it holds that implies or . Let be the set of all -trees such that and that have branching degree at most such that . We regard as a set of BCQs and regard it as factorized modulo logical equivalence. It is clear that is finite then and we claim that is a UCQ rewriting of . We explicitly include the case where is empty, in which case is equivalent to the empty disjunction and there is no database at all such that . To see that is indeed a UCQ rewriting of , let be an -database such that . Then there is an such that . Furthermore, and so as well, since is closed under homomorphisms. Suppose now . We know that there is a -tree with and branching degree at most such that and—when we regard as an instance—there is a homomorphism from to . Let be a minimal connected subset of such that . is again a -tree for some . Therefore or . The latter is impossible by minimality of . Hence, and so there is a (logically equivalent) copy of contained in . Hence, , therefore , and hence . Recall that, when is regarded as an instance, there is a homomorphism from to . Therefore, . *
**
Proof of Proposition 31
Let be an OMQ from such that is connected. We are going to show that the desired 2WAPA can be constructed in 2ExpTime*. Notice that, using similar results as in [16], this gives us a decision procedure for deciding also for non-connected queries. Let us first introduce some auxiliary notions.*
2WAPAs on -ary trees.* A 2WAPA on -ary trees is just defined as a 2WAPA, except that its transitions are , where is the state set of . The notion of run is then defined on -ary trees only and its definition is modified in the obvious way so as to deal with the transitions . Intuitively, for , a transition means that the automaton should move to the -th child of the current node (which is then required to exist) and assume state . Correspondingly, means that the automaton should move to the -th child and assume state provided that this -th child exists at all. We remark that all 2WAPAs constructed in this paper so far can easily be modified to work on -ary trees as well and we shall assume in the following that they do so. Furthermore, deciding whether is feasible in exponential time in the number of states of and in polynomial time in the size of the input alphabet of (cf. [57]).*
Let be as in Proposition 30. In the following, we shall regard all trees mentioned in the following as -ary and let . Before proceeding to a proof of Proposition 31, we must make the notion of being an “extension” of a labeled tree more precise.
Extensions of trees.* Let be a 2WAPA that accepts a -labeled tree iff *
(i) is lean and consistent, * (ii) , and * (iii) . * * Notice that a 2WAPA that accepts a lean and consistent -labeled tree iff can be easily constructed using the construction in Lemma 24. Hence, can be constructed intersecting several 2WAPAs we have already encountered.
*Let be the set of all tuples of the form , where and are states of . We define a new alphabet . Notice that is of double-exponential size in the size of . For , we denote by the restriction of to , that is, . The restriction of a -labeled tree to , denoted , is the tree that arises from when we restrict the label of each node of to . We say that a -labeled tree is consistent if *
(i) *its restriction to is consistent and * (ii) symbols such that appear only in leaf nodes of . * * Likewise, we say that a consistent is lean if is. The decoding of is naturally extended to consistent -labeled trees by setting . The following lemma is a straightforward extension of Lemmas 23 and 52.
Lemma 54
There are 2WAPAs and that respectively accept a -labeled tree iff it is consistent and lean. Both have logarithmically many states in the size of and can be constructed in polynomial time in the size of .
Let be a lean and consistent -labeled tree. We say that is an extension of if is a -labeled tree that arises from by attaching -labeled trees to those leaves of that contain elements from . Furthermore, for such nodes, the labels of the corresponding nodes in are those of restricted to .
Definition 14**.**
*Let be the set of all lean and consistent -labeled trees such that , yet there is an extension of such that and . *
Lemma 55
* is infinite iff is not UCQ rewritable.*
Proof.
Suppose is infinite. Since the trees at hand have bounded branching degree, for every , there is a such that is a -tree (for some ) that contains individuals whose distance from any is greater than or equal to and , yet for some extension of , we have but . Suppose now that is UCQ rewritable. Let be such that for all -trees (of the appropriate dimensions), implies or . Choose and such that
(i) is an extension of ,
(ii) has depth greater than , and
(iii) but and .
Since , we know that or . The latter is impossible by assumption, the former contradicts the fact , since . This proves the direction from left to right. The other direction is immediate. ∎
We are now ready to establish Proposition 31:
Proof of Proposition 31.* We are now going to describe the construction of a 2WAPA such that , which will prove the claim by virtue of Lemma 55. This automaton is the intersection of several ones. First of all, we ensure that all the accepted -trees are lean and consistent (cf. Lemma 54). We additionally intersect the automaton with the complement of from Lemma 23 (more precisely, the version of it running on -labeled trees) and another automaton whose construction we shall describe in more detail here. On a high level, will be constructed so as to accept a lean and consistent -labeled tree if and only if there is an extension of such that accepts . Let be the set of states of , its transition function, and its parity function. For , let be the set of tuples such that the following holds:*
- •
There is a -labeled tree such that and a run of on such that
, i.e., starts from ;121212Strictly speaking, is, of course, not a run since it does not start in the initial state. 2. 2.
* and is accepting on , or there is a node such that .*
Now the set of states of is the same as of , i.e., . Accordingly, the initial state of is that of . Furthermore , for every . Given and , we let
[TABLE]
We are going to give an intuitive explanation of this construction in the following. Roughly, a pair indicates that there is a -labeled tree and run of on such that the root of is labeled with , the run starts in state , and either accepts , or it traverses the root again at some point, then being in state . The set can be computed a priori in 2ExpTime*; considering that is of double-exponential size in the size of , it follows that the collection can be computed in 2ExpTime. Now the input tree for comes with labels from of the form in its leaves. These “types” amount to guesses of possible extensions of the input tree. Utilizing the sets , thus explores the possible ways how the given input tree can be extended to a -labeled tree that is accepted by . *
**
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases . Addison-Wesley, 1995.
- 2[2] A. Amarilli, M. Benedikt, P. Bourhis, and M. Vanden Boom. Query answering with transitive and linear-ordered data. In IJCAI , pages 893–899, 2016.
- 3[3] T. J. Ameloot, B. Ketsman, F. Neven, and D. Zinn. Weaker forms of monotonicity for declarative networking: A more fine-grained answer to the calm-conjecture. In PODS , pages 64–75, 2014.
- 4[4] T. J. Ameloot, B. Ketsman, F. Neven, and D. Zinn. Datalog queries distributing over components. In ICDT , pages 308–323, 2015.
- 5[5] T. J. Ameloot, F. Neven, and J. V. den Bussche. Relational transducers for declarative networking. J. ACM , 60(2):15, 2013.
- 6[6] M. Arenas, R. Hull, W. Martens, T. Milo, and T. Schwentick. Foundations of Data Management (Dagstuhl perspectives workshop 16151). Dagstuhl Reports , 6(4):39–56, 2016.
- 7[7] F. Baader, M. Bienvenu, C. Lutz, and F. Wolter. Query and predicate emptiness in ontology-based data access. J. Artif. Intell. Res. (JAIR) , 56:1–59, 2016.
- 8[8] J.-F. Baget, M. Leclère, M.-L. Mugnier, and E. Salvat. On rules with existential variables: Walking the decidability line. Artif. Intell. , 175(9-10):1620–1654, 2011.
