Rewritability in Monadic Disjunctive Datalog, MMSNP, and Expressive Description Logics
Cristina Feier, Antti Kuusisto, Carsten Lutz

TL;DR
This paper investigates the decidability and complexity of rewriting problems for monadic disjunctive Datalog, MMSNP, and description logic-based queries, providing new constructions and complexity bounds.
Contribution
It establishes decidability and complexity results for rewritability into FO, Datalog, and MDLog, and introduces a new canonical Datalog construction applicable to formulas with free variables.
Findings
Rewritability into FO and monadic Datalog is decidable.
Rewritability into Datalog is decidable under certain equality conditions.
Complexity is 2NExpTime-complete for most cases, with some gaps remaining.
Abstract
We study rewritability of monadic disjunctive Datalog programs, (the complements of) MMSNP sentences, and ontology-mediated queries (OMQs) based on expressive description logics of the ALC family and on conjunctive queries. We show that rewritability into FO and into monadic Datalog (MDLog) are decidable, and that rewritability into Datalog is decidable when the original query satisfies a certain condition related to equality. We establish 2NExpTime-completeness for all studied problems except rewritability into MDLog for which there remains a gap between 2NExpTime and 3ExpTime. We also analyze the shape of rewritings, which in the MMSNP case correspond to obstructions, and give a new construction of canonical Datalog programs that is more elementary than existing ones and also applies to formulas with free variables.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\lmcsdoi
15215 \lmcsheadingLABEL:LastPageJun. 06, 2018May 23, 2019
\titlecomment\lsuper
- The current paper is the extended version of an invited conference abstract [FKL17]. It contains detailed proofs of all results and new material regarding dichotomies and deciding PTime query evaluation.
Rewritability in Monadic Disjunctive Datalog, MMSNP, and Expressive Description Logics
Cristina Feier\rsupera
\lsuperaUniversity of Bremen, Department of Computer Science, Germany
,
Antti Kuusisto\rsuperb
\lsuperbTampere University, Mathematics, Finland
and
Carsten Lutz\rsuperb
Abstract.
We study rewritability of monadic disjunctive Datalog programs, (the complements of) MMSNP sentences, and ontology-mediated queries (OMQs) based on expressive description logics of the family and on conjunctive queries. We show that rewritability into FO and into monadic Datalog (MDLog) are decidable, and that rewritability into Datalog is decidable when the original query satisfies a certain condition related to equality. We establish 2NExpTime-completeness for all studied problems except rewritability into MDLog for which there remains a gap between 2NExpTime and 3ExpTime. We also analyze the shape of rewritings, which in the case of MMSNP correspond to obstructions, and give a new construction of canonical Datalog programs that is more elementary than existing ones and also applies to non-Boolean queries.
Key words and phrases:
MDDLog, MMSNP, OMQ, FO-Rewritability, Monadic Datalog-Rewritability
\lsuper*The author was supported by ERC Consolidator Grant 647289 CODA
1. Introduction
In data access with ontologies, the premier aim is to answer queries over incomplete and heterogeneous data while taking advantage of the domain knowledge provided by an ontology [BtCLW14, CDL*+*09, BO15]. Since traditional database systems are often unaware of ontologies, it is common to rewrite the emerging ontology-mediated queries (OMQs) into more standard database query languages. For example, the DL-Lite family of description logics (DLs) was designed as an ontology language specifically so that any OMQ where is a DL-Lite ontology, a data signature, and a conjunctive query, can be rewritten into an equivalent first-order (FO) query that can then be executed using a standard SQL database system [CGL*+*07, ACKZ09]. In more expressive ontology languages, it is not guaranteed that for every OMQ, there is an equivalent FO query. For example, this is the case for DLs of the and Horn- families and for DLs of the expressive family; please see [BHLS17] for a general introduction to DLs. In many members of the and Horn- families, however, rewritability into monadic Datalog (MDLog) is guaranteed, thus enabling the use of Datalog engines for query answering. In and above, not even Datalog-rewritability is generally ensured. Since ontologies emerging from practical applications tend to be structurally simple, though, there is reason to hope that (FO-, MDLog-, and Datalog-) rewritings do exist in many practically relevant cases even when the ontology is formulated in an expressive language. This has in fact been experimentally confirmed for FO-rewritability in the family of DLs [HLISW15], and it has led to the implementation of rewriting tools that, although incomplete, are able to compute rewritings in many practical cases [PUMH10, KNG14, TSCS15].
Fundamental problems that emerge from this situation are to understand the exact limits of rewritability and to provide (complete) algorithms that decide the rewritability of a given OMQ and that compute a rewriting when it exists. These problems have been adressed in [BLW13, HLISW15, BHLW16, LS17] for DLs from the and Horn- families. For DLs from the family, first results were obtained in [BtCLW14] where a connection between OMQs and constraint satisfaction problems (CSPs) was established and then used to transfer decidability results from CSPs to OMQs. In fact, rewritability is an important topic in CSP (where it is called definability) as it constitutes a central tool for analyzing the complexity of CSPs [FV98, LLT07, ELT07, DL08]. In particular, it is known that deciding the rewritability of (the complement of) a given CSP into FO and into Datalog is NP-complete [LLT07, Bar16, CL17] and rewritability into MDLog is NP-hard and in ExpTime [CL17]. In [BtCLW14], these results were used to show that FO- and Datalog-rewritability of OMQs is decidable and in fact NExpTime-complete when is formulated in or a moderate extension thereof and is an atomic query (AQ), that is, a monadic query of the simple form . For MDLog-rewritability, one can show NExpTime-hardness and containment in 2ExpTime.
The aim of this paper is to study the above questions for OMQs where the ontology is formulated in an expressive DL from the family and where the actual query is a conjunctive query (CQ) or a union of conjunctive queries (UCQ). As observed in [BtCLW14], transitioning in OMQs from AQs to UCQs corresponds to the transition from CSP to its logical generalization MMSNP introduced by Feder and Vardi [FV98] and studied, for example, in [MS07, Mad09, Mad10, BCF12]. More precisely, while the OMQ language that consists of all OMQs where is formulated in and is an AQ has the same expressive power as the complement of CSP (with multiple templates and a single constant), the OMQ language has the same expressive power as the complement of MMSNP (with free variables)—which in turn is essentially a notational variant of monadic disjunctive Datalog (MDDLog). It should be noted, however, that while all these formalisms are equivalent in expressive power, they differ significantly in succinctness [BtCLW14]; in particular, the best known translation of OMQs from into MMSNP/MDDLog involves a double exponential blowup. In contrast to the CSP case, FO-, MDLog-, and Datalog-rewritability of (complemented) MMSNP sentences was not known to be decidable. In this paper, we establish decidability of FO- and MDLog-rewritability in and related OMQ languages, in MDDLog, and in complemented MMSNP. We show that FO-rewritability is 2NExpTime-complete in all three cases, and that MDLog-rewritability is in 3ExpTime; a 2NExpTime lower bound was established in [BL16]. Let us discuss our results on the complexity of FO-rewritability from three different perspectives. From the OMQ perspective, the transition from AQs to UCQs results in an increase of complexity from NExpTime to 2NExpTime. From the monadic Datalog perspective, adding disjunction (transitioning from monadic Datalog to MDDLog) results in a moderate increase of complexity from 2ExpTime [BtCCV15] to 2NExpTime. And from the CSP perspective, the transition from CSPs to MMSNP results in a rather dramatic complexity jump from NP to 2NExpTime.
For Datalog-rewritability, we obtain only partial results. In particular, we show that Datalog-rewritability is decidable and 2NExpTime-complete for MDDLog programs that, in a certain technical sense made precise in the paper, have equality. For the general case, we only obtain a potentially incomplete procedure. It is well possible that the procedure is in fact complete, but proving this remains an open issue for now. These results also apply to analogously defined classes of MMSNP sentences and OMQs that have equality.
While we mainly focus on deciding whether a rewriting exists rather than actually computing it, we also analyze the shape that rewritings can take. Since the shape turns out to be rather restricted, this is important information for algorithms (complete or incomplete) that seek to compute rewritings. In the CSP/MMSNP world, this corresponds to analyzing obstruction sets for MMSNP, in the style of CSP obstructions [Nes08, BKL08, Ats08] and not to be confused with colored forbidden patterns sometimes used to characterize MMSNP [MS07]. More precisely, we show that an OMQ from is FO-rewritable if and only if it is rewritable into a UCQ in which each CQ has treewidth , where is the size of ;111What we mean here is that has a tree decomposition in which every bag has at most elements and in which neighboring bags overlap in at most one element. similarly, the complement of an MMSNP sentence is FO-definable if and only if it admits a finite set of finite obstructions of treewidth where is the diameter of (the maximum number of variables in a negated conjunction in its body, in Feder and Vardi’s terminology). We also show that an OMQ is MDLog-rewritable if and only if it is rewritable into an MDLog program of diameter where the diameter of an MDLog program is the maximum number of variables in a rule; equivalently, the complemented of an MMSNP sentence is MDLog-definable if and only if it admits a (potentially infinite) set of finite obstructions of treewidth where is the diameter of . For the case of rewriting into Datalog, we give a new and direct construction of canonical Datalog-rewritings of MMSNP sentences. It has been observed in [FV98] that for every CSP and all , it is possible to construct a canonical Datalog program of width and diameter (the width is the maximum arity of IDB relations) in the sense that if any such program is a rewriting of the CSP, then so is ; moreover, even when there is no -Datalog rewriting, then is the best possible approximation of such a rewriting. The existence of canonical Datalog-rewritings for (complemented) MMSNP sentences was already known from [BD13]. However, the construction given there is quite complex, proceeding via an infinite template that is obtained by applying an intricate construction due to Cherlin, Shelah, and Shi [CSS99], and resulting in canonical programs that are rather difficult to understand and to analyze. In contrast, our construction is elementary and essentially parallels the CSP case; it also applies to MMSNP formulas with free variables, where the canonical program takes a rather special form that involves parameters, similar in spirit to the parameters to least fixed-point operators in FO(LFP) [BBV16].
Our main technical tool is the translation of MMSNP sentences into CSPs exhibited by Feder and Vardi [FV98]; actually, the target of the translation is a generalized CSP, meaning that there are multiple templates. The translation is not equivalence preserving and involves a double exponential blowup, but it was designed so as to preserve complexity up to polynomial time reductions. Here, we are primarily interested in the semantic relationship between the original MMSNP sentence and the constructed CSP. It turns out that the translation does not quite preserve rewritability. In particular, when the original MMSNP sentence has a rewriting, then the natural way of constructing from it a rewriting for the CSP is sound only on instances of high girth. However, FO- and MDLog-rewritings of CSPs that are sound on high girth (and unconditionally complete) can be converted into rewritings that are unconditionally sound (and complete). The same is true for Datalog-rewritings when the CSP is derived from an MMSNP sentence that has equality, but it remains open whether it is true for Datalog-rewritings of unrestricted CSPs.
With our translations in place, we can also make relevant observations regarding the (data) complexity of query evaluation in MMSNP, in MDDLog, and of OMQs. This is especially interesting in the light of the recently obtained breakthrough in CSPs that there is a dichotomy between PTime and NP in the complexity of CSPs [Bul17, Zhu17] and that it is decidable and NP-complete whether a CSP is in PTime [CL17]. We show that this implies a dichotomy between PTime and coNP for MDDLog and for all OMQ languages mentioned above. We also prove that PTime query evaluation is decidable and 2NExpTime-complete in the mentioned query languages, and that the same is true for MMSNP.
The structure of this paper is as follows. In Section 2, we introduce the essentials of disjunctive Datalog and its relevant fragments as well as CSP and MMSNP; in fact, we shall always work with Boolean MDDLog rather than with complemented MMSNP. In Section 3, we summarize the main properties of Feder and Vardi’s translation of MMSNP into CSP. We use this in Section 4 to show that FO- and MDLog-rewritability of Boolean MDDLog programs and of the complement of an MMSNP sentences is decidable, also establishing the announced complexity results. In Section 5, we analyze the shape of FO- and MDLog-rewritings and of obstructions for MMSNP sentences. We also establish an MMSNP analogue of an essential combinatorial lemma for CSPs which says that it is possible to replace a structure by a structure of high high girth while preserving certain homomorphisms; the MMSNP analogue achieves high ‘decomposition girth’ (defined in the paper) and preserves the truth of certain MMSNP sentences. In Section 6, we study Datalog-rewritability of MDDLog programs that have equality and construct canonical Datalog programs. Section 7 is concerned with lifting our results from the Boolean case to the general case, concerning the complexity of deciding rewritability, the shape of rewritings, and the construction of canonical Datalog programs. In this section, Datalog programs with parameters play a central role. In Section 8, we introduce OMQs and further lift our results to that setting, finally arriving at our goal to study fundamental rewritability questions for OMQ languages based on (unions of) conjunctive queries. Section 9 is then concerned with dichotomies and the complexity of deciding PTime query evaluation. We conclude in Section 10.
2. Preliminaries
We introduce disjunction Datalog, CSP, and MMSNP. To avoid overloading this section, the introduction of ontology languages and ontology-mediated queries is deferred to Section 8.
A schema is a finite collection of relation symbols with associated arity. An -fact is an expression of the form where is an -ary relation symbol, and are elements of some fixed, countably infinite set of constants. For an -ary relation symbol , is . An -instance is a finite set of -facts. The active domain of is the set of all constants that occur in the facts in . We use the symbols , , to denote tuples of constants and, slightly abusing notation, write to mean that is a tuple of constants from when we do not want to be precise about the length of the tuple. For an instance and a schema , we write to denote the restriction of to the relation symbols in .
A tree decomposition of an instance is a pair , where is an undirected tree and is a family of subsets of such that the following conditions are satisfied:
- (1)
for all , is nonempty and connected in ; 2. (2)
for every fact in , there is a such that .
Unlike in the traditional setup [FG06], we are interested in two parameters of tree decompositions instead of only one. We call an -tree decomposition if for all distinct , and . An instance has treewidth if it admits an -tree decomposition.
We now define the notion of girth. A finite structure has a cycle of length if there are distinct facts and positions such that
- •
for and
- •
the constant in the -th position of is identical to the constant in the -th position of for and with and .
The girth of is the length of the shortest cycle in it and if has no cycle (in which case we say that is a tree).
A constraint satisfaction problem (CSP) is defined by an instance over a schema , called template.222Adopting Datalog terminology, we generally call the schema in which the data is formulated the EDB schema and denote it with . The problem associated with , denoted , is to decide whether an input instance over admits a homomorphism to , denoted . We use coCSP to denote the complement problem, that is, deciding whether . A generalized CSP is defined by a set of templates over the same schema and asks for a homomorphism from the input to at least one of the templates , denoted . Note that a (generalized) CSP can be viewed as a Boolean query over -instances.
An MMSNP sentence over schema has the form with monadic second-order variables, first-order variables, and a conjunction of quantifier-free formulas of the form
[TABLE]
with and where each takes the form or with and each takes the form . The diameter of is the maximum number of variables in some implication in . This presentation is syntactically different from, but semantically equivalent to the original definition from [FV98], which does not use the implication symbol and instead restricts the allowed polarities of atoms. Both forms can be interconverted in polynomial time, see [BtCLW14]. The semantics of MMSNP is the standard semantics of second-order logic. More information can be found, e.g., in [BCF12, BD13]. Note that, just like CSPs, MMSNP sentences can be viewed as Boolean queries.
A conjunctive query (CQ) takes the form where is a conjunction of relational atoms and , denote tuples of variables; the equality relation may be used. Whenever convenient, we shall confuse with the set of atoms in . The variables in are the answer variables in . The arity of is the number of its answer variables and is Boolean if it has arity zero. We say that is over if uses only relation symbols from . An answer to on an -instance is a tuple of constants such that in the standard sense of first-order logic. It is well-known that if and only if there is a homomorphism from viewed as an instance to that takes to . A Boolean CQ is true on an instance , denoted , if the empty tuple is an answer to on . A CQ is a contraction of a CQ if it can be obtained from by identifying variables. A union of conjunctive queries (UCQ) is a disjunction of CQs with the same answer variables. The semantics of UCQs is defined in the expected way.
A disjunctive Datalog rule has the form
[TABLE]
with and . We refer to as the head of , and to as the body. Every variable that occurs in the head of is required to also occur in the body of . A disjunctive Datalog (DDLog) program is a finite set of disjunctive Datalog rules with a selected goal relation that does not occur in rule bodies and appears only in non-disjunctive goal rules . The arity of is the arity of the relation; we say that is Boolean if it has arity zero. Relation symbols that occur in the head of at least one rule of are intensional (IDB) relations, and all remaining relation symbols in are extensional (EDB) relations. Note that, by definition, is an IDB relation. When all relations in are from schema , then we say that is over EDB schema . The IDB schema of is the set of all IDB relations in .
We will sometimes use body atoms of the form that are vacuously true for all elements of the active domain. This is just syntactic sugar since any rule with body atom can equivalently be replaced by a set of rules obtained by replacing in all possible ways with an atom where is a relation symbol from and where for some and all other are fresh variables.
A DDLog program is called monadic or an MDDLog program if all its IDB relations with the possible exception of have arity at most one. The size of a DDLog program is the number of symbols needed to write it (where relation symbols and variable names count one), its width is the maximum arity of non- IDB relations used in it, and its diameter is the maximum number of variables that occur in a rule in .
A Datalog rule is a disjunctive Datalog rule in which the rule head contains exactly one disjunct. Datalog (DLog) programs and monadic Datalog (MDLog) programs are then defined in the expected way. We call a Datalog program an -Datalog program if its width is and its diameter is .
For an -ary DDLog program over schema , an -instance , and , we write if where the variables in all rules of are universally quantified and thus is a set of first-order (FO) sentences; please see [AHV95, EGM97] for more information on the semantics of (disjunctive) Datalog. A query over of arity is
- •
sound for if for all -instances and , implies ;
- •
complete for if for all -instances and , implies ;
- •
a rewriting of if it is sound for and complete for .
Note that Boolean programs are also covered by the above definitions. To additionally specify the syntactic shape of , we speak of a UCQ-rewriting, an MDLog-rewriting, and so on. An FO-rewriting takes the form of an FO-query that uses only relation symbols from the relevant EDB schema and possibly equality, but neither constants nor function symbols. We say that an MDDLog program is FO-rewritable if there is an FO-rewriting of , and likewise for UCQ-rewritability and for MDLog-rewritability. Since a generalized CSP defined by a set of templates can be viewed as a Boolean query, we can also speak of a query to be sound and complete for respectively a rewriting of coCSP. The definitions are as expected, paralleling the ones above.
It was shown in [BtCLW14] that the complement of an MMSNP sentence can be translated into an equivalent Boolean MDDLog program in polynomial time and vice versa; moreover, the transformations preserve diameter and all other parameters relevant for this paper. From now on, we will thus not explicitly distinguish between Boolean MDDLog and (complemented) MMSNP.
3. MDDLog, Simple MDDLog and CSP
Feder and Vardi show how to translate an MMSNP sentence into (the complemen of) a generalized CSP that has the same complexity up to polynomial time reductions [FV98]. The resulting CSP has a different schema than the original MMSNP sentence and is thus not equivalent to it. We are going to make use of this translation to reduce rewritability problems for MDDLog to the corresponding problems for CSPs. Consequently, our main interest is in the precise semantic relationship between the MMSNP sentence and the constructed CSP, rather than in their complexity. In this section, we sum up the properties of the results obtained in [FV98] that are relevant for us. These properties are all we need in later sections, that is, we do not need to go further into the details of the translation. For the reader’s convenience and information, we still describe the translations in full detail in Appendix A; these are based on the presentation given in [BL16], which is more detailed than the original presentation in [FV98].
Let be a schema. A schema is a -aggregation schema for if its relations have the form where is a CQ over without quantified variables and the arity of is identical to the length of , which is at most . The generalized CSP to be constructed later makes use of a schema of this form. What is important at this point is that there are natural translations of instances between the two schemas. To make this precise, let be an -instance. The corresponding -instance consists of all facts such that . Conversely, let be an -instance. The corresponding -instance consists of all facts such that and is a conjunct of .
{exa}
Let with a binary relation symbol,
[TABLE]
and let consist of where for brevity. Take the -instance defined by
[TABLE]
The corresponding -instance is
[TABLE]
Note that when we transition from back to and take the -instance corresponding to , we do not obtain , but rather a strict superset that contains additional facts such as . This is illustrated in Figure 1 which shows the instances , , and a subset of that contains all facts from plus one additional fact.
The translation in [FV98] consists of two steps. We describe them here using Boolean MDDLog instead of (complemented) MMSNP. The first step is to transform the given Boolean MDDLog program into a Boolean MDDLog program over a suitable aggregation schema such that is of a restricted syntactic form called simple. In the second step, one transforms into a generalized CSP whose complement is equivalent to .
We start with summing up the important aspects of the first step. A Boolean MDDLog program is simple if it satisfies the following conditions:
- (1)
every rule in contains at most one EDB atom and this atom contains all variables of the rule body, each variable exactly once; 2. (2)
rules without an EDB atom contain at most a single variable.
Now, the first step achieves the following. {thmC}[[FV98]]
Given a Boolean MDDLog program over EDB schema of diameter , one can construct a simple Boolean MDDLog program over a -aggregation EDB schema for and IDB schema such that
- (1)
If is an -instance and the corresponding -instance, then iff ; 2. (2)
If is an -instance and the corresponding -instance, then
- (a)
implies ; 2. (b)
implies if the girth of exceeds .
If is of size , then the size of and the cardinality of are bounded by , where is a polynomial. The construction takes time polynomial in the size of .
The translation underlying Theorem 1 consists of three steps itself: first saturate by adding all rules that can be obtained as a contraction of a rule in , that is, by identifying variables in the rule body and head in a consistent way. Then rewrite in an equivalence-preserving way so that all rule bodies are biconnected, introducing fresh unary and nullary IDB relations as needed. And finally replace the conjunction of all EDB atoms in each rule body with a single EDB atom , additionally taking care of interactions between the new EDB relations that arise e.g. when we have two relations and such that is contained in in the sense of query containment. Details are in Appendix A.1.
The following theorem summarizes the second step of the translation of Boolean MDDLog into a generalized CSP. {thmC}[[FV98]]
Let be a simple Boolean MDDLog program over EDB schema and with IDB schema , the maximum arity of relations in . Then there exists a set of templates over such that
- (1)
is equivalent to coCSP; 2. (2)
and for each ;
The construction takes time polynomial in .
We again sketch the idea underlying the proof of the theorem. The desired set of templates contains one template for every 0-type, that is, for every set of nullary IDB relations in that does not contain and that satisfies all rules in which use only nullary IDBs. Each template contains one constant for every 1-type , that is, for every set of unary IDBs that agrees on nullary IDBs with the 0-type for which the template was constructed and that satisfy all rules in which use only IDB relations that are at most unary. One then interprets all EDB relations in a maximal way so that all rules in are satisfied. The fact that is simple implies that no choices arise, that is, there is only one maximal interpretation of each EDB relation and the interpretations of different such relations do not interact. Details are given in Appendix A.2.
4. FO- and MDLog-Rewritability of Boolean MDDLog Programs
We exploit the translations described in the previous section and the known results that FO-rewritability of CSPs and MDLog-rewritability of coCSPs are decidable to obtain analogous results for Boolean MDDLog, and thus also for MMSNP. In the case of FO-rewritability, we obtain tight 2NExpTime complexity bounds. For MDLog-rewritability, the exact complexity remains open (as in the CSP case), between 2NExpTime and 3ExpTime.
We start with observing that FO-rewritability and MDLog-rewritability are more closely related than one might think at first glance. Recall that, by Rossman’s homomorphism preservation theorem [Ros08], a first-order formula is preserved under homomorphisms on finite structures if and only if it is equivalent to a UCQ. While every MDLog-rewriting can be viewed as an infinitary UCQ-rewriting, Rossman’s result implies that FO-rewritability of a Boolean MDDLog program coincides with (finitary) UCQ-rewritability. The latter is true also in the non-Boolean case.
Proposition 1**.**
Let be an MDDLog program. Then is FO-rewritable iff is UCQ-rewritable.
Proof 4.1**.**
It is well known and easy to show that truth of disjunctive Datalog programs is preserved under homomorphisms. Thus, the proposition immediately follows from Rossman’s theorem in the Boolean case. For the non-Boolean case, we observe that Rossman establishes his result also in the presence of constants. Let be an MDDLog program and a rewriting of . We can apply Rossman’s result to , where is a tuple of constants of the same length as , obtaining a UCQ equivalent to . Let be obtained from by replacing the constants in with the variables from . It can be verified that is a rewriting of .
For utilizing the translation of Boolean MDDLog programs to generalized CSPs in the intended way, the interesting aspect is to deal with the translation of a Boolean MDDLog program into a simple program stated in Theorem 1, since it is not equivalence preserving. The following proposition relates rewritings of to rewritings of . It also applies to Datalog-rewritings, which we will make use of in Section 6.
Lemma 2**.**
Let be a Boolean MDDLog program of diameter , as in Theorem 1, and . Then
- (1)
every -rewriting of can effectively be converted into a -rewriting of ; 2. (2)
every -rewriting of can effectively be converted into a -rewriting of that is (i) sound on instances of girth exceeding and (ii) complete.
Proof 4.2**.**
Let and be the EDB schema of and of , respectively. We start with the case .
For Point 1, let be a UCQ-rewriting of . Let be the UCQ obtained from by replacing every atom with , that is, with the result of replacing the variables in with the variables (which may lead to identifications). We show that is as required. Let be an -instance and the corresponding -instance. Then we have iff (by Point 1 of Theorem 1) iff (by choice of ) iff (by construction of and of ). Let us expand on the latter.
First assume that . Then there is a CQ in and a homomorphism from to . By construction, contains a CQ that is obtained from by replacing every atom with . Clearly, for every atom , we must have . The construction of yields . Consequently, is also a homomorphism from to . Conversely, assume that there is a CQ in and a homomorphism from to . Then there is a CQ in from which was obtained by the described replacement operation. For every atom , we must have . We obtain and thus is a homomorphism from to .
For Point 2, let be a UCQ-rewriting of . The UCQ consists of all CQs that can be obtained as follows:
- (1)
choose a CQ from , a contraction of , and a partition of ; 2. (2)
for each , choose a relation from and a tuple of variables (repeated occurrences allowed) that are either from or do not occur in such that ; then replace in with the single atom .
To establish that is as desired, we show that for every -instance
- (I)
* implies if is of girth exceeding (soundness) and* 2. (II)
* implies (completeness).*
Let be the -instance corresponding to .
For Point (I), we observe that implies (by construction of and of ) implies (by choice of ) implies (by Point 2b of Theorem 1 and if is of girth exceeding ). Let us zoom into the first implication. Assume that . Then there is a CQ in and a homomorphism from to . There must be some CQ in from which has been constructed in Steps 1 and 2 above. Let be as in this construction. It suffices to show that is a homomorphism from to , for each . Thus fix a . Then there is a relation and a tuple of variables that are either from or do not occur in such that and . Thus . By construction of , this yields and thus we are done.
For Point (II), we have that implies (by Point 2a of Theorem 1) implies (by choice of ). It thus remains to show that implies . Thus assume that there is a CQ in and a homomorphism from to . We use and to guide the choices in Step 1 and Step 2 of the construction of CQs in to exhibit a CQ in such that .
We start with Step 1. As , we use the contraction of obtained by identifying variables and whenever . Thus, is an injective homomorphism from to . We next need to choose a partition of . For every fact , choose a fact that was obtained from during the construction of and denote this fact with . Now let be the partition of obtained by grouping together two atoms and if and only if . Let denote the (unique) value of for all the atoms in .
Step 2 deals with each query separately. We choose the relation from , which clearly is in . We choose the tuple of variables based on the tuple of individuals . Let . Then the -th variable in is if (which is well-defined since is injective) and a fresh variable if there is no such . This finishes the guiding process and thus gives rise to a query in .
It remains to argue that can be extended to a homomorphism from to . Take a and consider the corresponding atom in . Then all the facts in were obtained from the fact during the construction of . By construction of from , we can extend to the fresh variables in so that and thus . Doing this for all yields the desired .
Now for the cases . We treat these cases in one since our construction preserves the width of Datalog-rewritings. In fact, this construction is very similar to the case , so we only give a sketch.
For Point 1, let be a Datalog-rewriting of . We construct a Datalog program of the same width over EDB schema by modifying the EDB part of each rule body in the same way in which we had modified the UCQ-rewriting in the case : replace every EDB-atom with . We then have iff (by Point 1 of Theorem 1) iff (by choice of ) iff . The latter is by construction of and of . To prove it in more detail, it suffices to show that for every extension of to the IDB relations in with corresponding extension of , and every rule body in which was translated into a rule body in , we have iff . The arguments needed are as in the case .
The proof of Point 2 can be adapted from UCQs to Datalog in an analogous way. Let be a Datalog-rewriting of . We construct a Datalog program of the same width over EDB schema . The rules in are obtained by taking a rule
[TABLE]
from , where the are IDB and is a CQ over schema , converting into a CQ over schema in two steps, in the same way in which a CQ over was converted into a CQ over in the case , and then including in the rule
[TABLE]
The crucial step in the correctness proof is to show that implies for any -instance and corresponding -instance . The arguments are again the same as in the case , the main difference being that we need to consider extensions of and to IDB relations from instead of working with and themselves.
Point 2 of Lemma 2 only yields a rewriting of on -instances of high girth. We next show that, for , the existence of a -rewriting on instances of high girth implies the existence of a -rewriting that works on instances of unrestricted girth. Whether the same is true for remains as an open problem. We need the following well-known lemma that goes back to Erdös and was adapted to CSPs by Feder and Vardi. Informally, it says that every instance can be ‘exploded’ into an instance of high girth that behaves similarly regarding homomorphisms.
Lemma 3**.**
For every instance and , there is an instance (over the same schema) such that , has girth exceeding , and for every instance of size at most , we have iff .
Feder and Vardi additionally show that can be constructed by a randomized polynomial time reduction that was later derandomized by Kun [Kun13], but here we do not rely on such computational properties. Every CQ can be viewed as an instance by using the variables as constants and the atoms as facts. It thus makes sense to speak about tree decompositions of CQs and about their treewidth, and it is clear what we mean by saying that a CQ is a tree (that is, has girth ).
Lemma 4**.**
Let be a set of templates over schema , , and . If coCSP() is -rewritable on instances of girth exceeding , then it is -rewritable.
Proof 4.3**.**
We start with . Let be a UCQ that defines coCSP on instances of girth exceeding , and let be the UCQ that consists of all contractions of a CQ in that are a tree CQ. We show that defines coCSP on unrestricted -instances.
Let be an -instance. First assume that . By Lemma 3, there is an -instance of girth exceeding and also exceeding the number of variables in each CQ in and satisfying and . Thus , that is, there is a CQ in and a homomorphism from to . Let be the contraction of obtained by identifying variables and if . Thus, is an injective homomorphism from to . Since the girth of exceeds the number of variables in , must be a tree. Consequently, is a CQ in and we have . From , we obtain .
Now assume that . Then, there is a tree CQ in such that . When we view as an -instance , then clearly and has girth exceeding . Thus, , and from we obtain .
Now for the case . Let be an MDLog program that defines coCSP on instances of girth exceeding . Let be the program obtained from by replacing every rule with all rules such that is a tree CQ that is a contraction of . We show that is an MDLog-definition of coCSP on instances of unrestricted girth.
Let be an -instance. First assume that . By Lemma 3, there is an -instance whose girth exceeds and also exceeds the diameter of and that satisfies and . The latter yields . It remains to show that this implies since with , this yields as required.
To show that follows from , it suffices to show that all IDB facts derived by starting from are also derived by . Thus let be an extension of to the IDBs in . It is enough to show that when a single application of a rule from in yields an IDB atom , then can derive the same atom. The former is the case only if contains a rule such that there is a homomorphism from to with . Let be the contraction of obtained by identifying variables and when . Since the girth of exceeds the diameter of , is a tree. Thus, contains the rule and the application of this rule in enabled by yields . We have thus shown and are done.
Now assume that . Then there is a proof tree for from and , see [AHV95] for details. From that tree, we can read off an -instance such that , , and, since is monadic and only comprises rules with tree-shaped bodies, is a tree. Thus, has girth exceeding and from we get . But with , this yields as required.
Putting together Theorems 1 and 1, Proposition 1, and Lemmas 2 and 4, we obtain the following reductions of rewritability of Boolean MDDLog programs to CSP rewritability.
Proposition 5**.**
Every Boolean MDDLog program can be converted into a set of templates such that
- (1)
* is -rewritable iff coCSP** is -rewritable for every ;* 2. (2)
every -rewriting of can be effectively translated into a -rewriting of coCSP* and vice versa, for every .* 3. (3)
* and for each , the size of and a polynomial.*
The construction takes time polynomial in .
FO-rewritability of CSPs (and their complements) is NP-complete [LLT07] and it was observed in [BtCLW14] that the upper bound lifts to generalized CSPs. MDLog-rewritability of coCSPs is NP-hard and in ExpTime [CL17]. We show in Appendix B that also this upper bound lifts to generalized coCSPs. Together with Proposition 5, this yields the upper bounds in the following theorem. The lower bounds are from [BL16].
Theorem 6**.**
For Boolean MDDLog programs and the complement of MMSNP sentences,
- (1)
FO-rewritability (equivalently: UCQ-rewritability) is 2NExpTime*-complete;* 2. (2)
MDLog-rewritability is in 3ExpTime* (and 2NExpTime-hard).*
5. Shape of Rewritings, Obstructions, Explosion
In the FO case, it is possible to extract from the approach in the previous section an algorithm that computes actual rewritings, if they exist. However, that algorithm is hardly practical. An important first step towards the design of more practical algorithms that compute rewritings (in an exact or in an approximative way) is to analyze the shape of rewritings. In fact, both FO- and MDLog-rewritings of coCSPs are known to be of a rather restricted shape, far from exploiting the full expressive power of the target languages. In this section, we establish corresponding results for Boolean MDDLog. This topic is closely related to the theory of obstructions, so we also establish connections between the rewritability of MMSNP sentences and natural obstruction sets. Finally, we observe an MMSNP counterpart of Lemma 3, the fundamental ‘explosion’ lemma for CSPs.
The following summarizes our results regarding the shape of rewritings.
Theorem 7**.**
Let be a Boolean MDDLog program of diameter . Then
- (1)
if is FO-rewritable, then it has a UCQ-rewriting in which each CQ has treewidth ; 2. (2)
if is MDLog-rewritable, then it has an MDLog-rewriting of diameter .
Proof 5.1**.**
We analyze the proof of Lemma 2 and use known results from CSP. In fact, any FO-rewritable coCSP has a UCQ-rewriting that consists of tree CQs [NT00], and thus the same holds for simple Boolean MDDLog programs. If we convert such a rewriting of into a rewriting of as in the proof of Lemma 2, we obtain a UCQ-rewriting in which each CQ has treewidth . For Point 2 of Theorem 7, one uses the proof of Lemma 2 and the known fact that every MDLog-rewritable CSP has an MDLog-rewriting in which every rule body comprises at most one EDB atom, see e.g. the proof of Theorem 19 in [FV98].
In a sense, the concrete bound in Points 1 and 2 of Theorem 7 is quite remarkable. Point 2 says, for example, that when eliminating disjunctions from a Boolean MDDLog program, it is never necessary to increase the diameter!
We now consider obstructions. An obstruction set for a CSP template over schema is a set of instances over the same schema such that for any -instance , we have iff for some . The elements of are called obstructions. A lot is known about CSP obstructions. For example, is FO-rewritable if and only if it has a finite obstruction set [Ats08] if and only if it has a finite obstruction set that consists of finite trees [NT00], and is MDLog-rewritable if and only if it has a (potentially infinite) obstruction set that consists of finite trees [FV98]. Here we consider obstruction sets for MMSNP, defined in the obvious way: an obstruction set for an MMSNP sentence over schema is a set of instances over the same schema such that for any -instance , we have iff for some . This should not be confused with colored forbidden patterns used to characterize MMSNP in [MS07]. The following characterizes FO-rewritability of MMSNP sentences in terms of obstruction sets.
Corollary 8**.**
For every MMSNP sentence , the following are equivalent:
- (1)
* is FO-rewritable;* 2. (2)
* has a finite obstruction set;* 3. (3)
* has a finite set of finite obstructions of treewidth .*
Corollary 8 follows from Point 1 of Theorem 7 and the straightfoward observations that an MMSNP sentence is FO-rewritable iff is (which is equivalent to a Boolean MDDLog program) and that every finite obstruction set for gives rise to a UCQ-rewriting of and vice versa. We now turn to MDLog-rewritability.
Proposition 9**.**
Let be an MMSNP sentence of diameter . Then is MDLog-rewritable iff has a set of obstructions (equivalently: finite obstructions) that are of treewidth .
Proof 5.2**.**
The “only if” direction is a consequence of Point 2 of Theorem 7 and the fact that, for any Boolean monadic Datalog program of diameter over EDB schema , a proof tree for from an -instance and gives rise to a finite -instance of treewidth with . The desired obstruction set for is then the set of all these . The “if” direction is a consequence of Theorem 5 in [BD13].
We remark that the results in [BD13] almost give Proposition 9, but do not seem to deliver any concrete bound on the parameter of the treewidth of obstruction sets.
We close with observing an MMSNP counterpart of the ‘explosion’ Lemma 3, first giving a preliminary. Let be an instance over some schema . A -decomposition of is a pair where is a set of indices and is a partition of such that for all distinct , and . Thus, a -decomposition decomposes into parts of size at most and with little overlap. These parts can be viewed as the facts of an instance over an aggregation schema defined by the relations where is viewed as a CQ, that is,
[TABLE]
where we assume some fixed (but otherwise irrelevant) order on the elements of each . Now, we say that has -decomposition girth if is the supremum of the girths of , for all -decompositions of . It can be shown that has -decomposition girth if and only if it has treewidth .
Here comes the announced MMSNP counterpart of Lemma 3.
Lemma 10**.**
For every instance and , and every MDDLog program of diameter at most , there is an instance (over the same schema) such that , has -decomposition girth exceeding , and iff .
Proof 5.3**.**
Let be a Boolean MDDLog program of diameter over EDB schema . By Theorems 1 and 1, there is a -aggregation schema and a set of templates over such that:
- (1)
for any -instance with corresponding -instance , iff ; 2. (2)
for any -instance whose girth exceeds with corresponding -instance , iff .
Let and be an -instance and its corresponding -instance. Furthermore, let be the -instance obtained from by applying Lemma 3 with and as given. Then , has girth exceeding , and iff iff . Let be the -instance corresponding to . As has girth exceeding , Point (2) above yields iff . In summary, we thus obtain iff .
It thus remains to show that has -decomposition girth exceeding and that . The former is witnessed by the -decomposition of obtained by using as the facts of and as the set of facts obtained from fact during the construction of .
As the last step, we argue that follows from , and that in fact any homomorphism from to is also a homomorphism from to . Thus let be such a homomorphism. For any fact in , there is a fact in such that . We have . By definition of , this means and we are done.
We believe that Lemma 10 can be useful in many contexts, saving a detour via CSPs. For example, it enables an alternative proof of Theorem 7. We illustrate this for Point 1. We can start with a UCQ-rewriting of an MDDLog program of diameter and show that the UCQ that consists of all CQs of treewidth that are a contraction of a CQ in must also be a rewriting of : take an instance that makes true, use Lemma 10 to transform to an of girth exceeding and also exceeding the size of any CQ in such that and , observe that and that a homomorphism from any CQ in to gives rise to a homomorphism from a CQ in to , and derive from .
6. Datalog-Rewritability of Boolean MDDLog Programs
and Canonical Datalog Programs
We consider rewriting Boolean MDDLog programs into Datalog programs, making two contributions. First, we show that Datalog-rewritability is decidable for programs that have equality, a condition that is defined in detail below. For programs that do not have equality, the same construction yields a procedure that is sound, but whose completeness remains an open problem. And second, we give a new and direct construction of canonical Datalog-rewritings of Boolean MDDLog programs (equivalently: the complements of MMSNP sentences), bypassing the construction of infinite templates [BD13] which involves the application of a non-trivial construction due to Cherlin, Shelah, and Shi [CSS99]. This construction is potentially useful even though it is yet unknown whether Datalog-rewritability of MDDLog programs is decidable (for programs that do not have equality): when is not rewritable, then the canonical Datalog-rewriting is the best possible approximation of in terms of a Datalog program (of given width and diameter).
6.1. Datalog-Rewritability of Boolean MDDLog Programs
A CSP template has equality if its EDB schema includes the distinguished binary relation and interprets as the relation . Thus, is an extremely natural kind of constraint: a fact in the input instance means that and must be mapped to the same template element; spoken from the perspective of constraint satisfaction, they are variables that must receive the same value.
In accordance with the above, we say that an MDDLog program has equality if its EDB schema includes the distinguished binary relation , contains the rules
[TABLE]
for each IDB relation , and these are the only rules that mention . Thus, a fact in the input instance says that the same IDB relations can be derived by for and for . It can be verified that when an MDDLog program that has equality is converted into a generalized CSP based on a set of templates according to Theorems 1 and 1 (using the concrete constructions in the appendix), then all templates in have equality.
We aim to show decidability of the Datalog-rewritability of MDDLog programs that have equality following the strategy that we have used for rewritability into FO and into MDLog in Section 4. We thus need a counterpart of Lemma 4, that is, we have to show that for all templates that have equality, Datalog-rewritability of coCSP on instances of high girth implies unrestricted Datalog-rewritability. It is here that having equality is an advantage. In particular, every input instance for coCSP can be made high girth preserving (non-)homomorphisms to by introducing additional -facts. This is similar in spirit to the explosion Lemma 3, but the construction is much simpler than in the proof of that lemma. We next make it explicit.
Let be an -instance and let . We use to denote the set of pairs such that and . In what follows, for any tuple of constants , we use to denote its -th component. Reserve fresh constants as follows:
- •
a constant , for all ;
- •
constants , for all with .
Define an instance that consists of the following facts:
- (1)
for every with of arity , the fact where ; 2. (2)
for all distinct with , the facts
[TABLE]
Observe that has girth exceeding . Moreover, it satisfies the following crucial property.
Lemma 11**.**
For every CSP template over that has equality, iff .
Proof 6.1**.**
Let be a template over that has equality. We have to show that there is a homomorphism from to iff there is a homomorphism from to . In fact, can be obtained from by setting when ; conversely, can be obtained from by setting when —the latter is well-defined by construction of and since is interpreted as the reflexive relation in .
We are now ready to establish the announced counterpart of Lemma 4.
Lemma 12**.**
Let be a set of templates over schema that have equality, and let . If coCSP() is DLog-rewritable on instances of girth exceeding , then it is DLog-rewritable.
Proof 6.2**.**
Assume that coCSP is DLog-rewritable on instances of girth exceeding and let be a concrete rewriting. We construct a Datalog program such that for any -instance , iff . Clearly, is then a rewriting of coCSP on instances of unrestricted girth.
We aim to construct such that it mimics the execution of on , despite being executed on . One challenge is that the domains of and are not identical. In , the IDB relations of need to be adapted to reflect this change of domain, and so do the rules. Let be the maximum arity of any relation in . Every IDB relation of gives rise to a set of IDB relations in . In fact, every position of can be replaced either with
- (1)
* positions, for some , reflecting the case that the position is filled with a constant where with -ary; or with* 2. (2)
* positions, for some , reflecting the case that the position is filled with a constant where and , with -ary and -ary.*
In Case 1, the positions store the constants in . The symbol and the number from also need to be stored, which is done as an annotation to the IDB relation. In Case 2, the first positions store the constants in while the latter positions store the constants in ; we additionally need to store the symbols and , the numbers and from and , and the number , which is again done by annotation of the IDB relation.
Let us make this formal. The IDB relations of take the form where is an IDB relation of and is a function from to
[TABLE]
such that if , then and if , then and . The arity of is where is the arity of if and is the sum of the arities of and if . In the construction of , we manipulate the rules of to account for this change in the IDB schema. We can assume w.l.o.g. that is closed under contractions of rules. Let
[TABLE]
be a rule in where are IDB and are EDB (possibly the distinguished relation), such that
- ()
every variable occurs at most once in .
Note that it might be possible to write a single rule from in the above form in more than one way because -atoms can be placed in the second line or in the third line; we then consider all possible ways. Informally, this choice corresponds to the decision whether the -atom is mapped to an -fact in that comes from an -fact in (Point 1 of the definition of ) or to a freh -fact (Point 2 of the definition of ). Also note that rules that do not satisfy () can be ignored since they never apply in .
Let be the variables in the rule, and let be such that the following conditions are satisfied:
- (1)
for each with , we have for all ; 2. (2)
for each , one of the following is true for some :
- (a)
* and ;* 2. (b)
* and ;* 3. (c)
* and .*
With each variable in , we associate a tuple of distinct variables. If is of the form , then the length of matches the arity of and is called a variable block. If is of the form , then the length of is the sum of the arities and of and ; the first variables in are then also called a variable block, and so are the last variables. Variable blocks will either be disjoint or identical. Identities are minimized such that the following conditions are satisfied:
- (I1)
if occurs in some , then ; 2. (I2)
if Case 2a applies to , then is identical to the first variable block in ; 3. (I3)
if Case 2b applies to , then is identical to the second variable block in ; 4. (I4)
if Case 2c applies to , then the first variable blocks of and are identical, and so are the second variable blocks.
Regarding (I1), note that cannot occur in more than one because of (), thus the condition can always be satisfied ‘without conflicts’. Then include in the rule
[TABLE]
such that
- (R1)
if the -th component in is , then ; 2. (R2)
* is obtained from by replacing each variable with ;* 3. (R3)
* contains the following atoms:*
- •
for each variable with of the form , an atom where the -th component of is and all other variables are distinct and fresh;
- •
for each variable with of the form , atoms where the -th component of and the -th component of is and all other variables are distinct and fresh.
As an example, consider the following rule in :
[TABLE]
where is EDB and IDB, and let , , and . Note that Case 2a applies to . We have and and thus obtain the following rule in :
[TABLE]
where the last line corresponds to above, and where and .
We have have to show that iff for any -instance . There is a correspondence between extensions of to the IDB relations in and extensions of to the IDB relations in . More precisely, a fact in an extension of represents a fact in an extension of as follows (and vice versa): for each , let be the subtuple of that starts at position and is of length (where, as before, is the arity of if and is the sum of the arities of and if ); the -th constant in is if and if and .
One essentially has to show that every application of a rule from in an extension of can be reproduced by an application of a rule from in the corresponding extension of , and vice versa. We only sketch the details. First let be an extension of to the IDB relations in and let be a rule in applicable in , and a homomorphism from to such that . Since is closed under contractions of rules, we can assume that is injective. Let
[TABLE]
such that all are IDB, all EDB, and an equality atom is included in the third line if and only if at least one of and is not of the form . Consequently, for all variables that occur in the second line, is of the form . One can now verify that Condition () is satisfied. Assume that this is not the case. The first case is that that there are distinct atoms and that share a variable . In , every constant of the form occurs in exactly one fact that only contains constants of the form . Thus, must take and to the same fact in . Since is injective, and must be identical which is a contradiction. The second case is that there is an atom in which a variable occurs more than once. This is in contradiction to being a homomorphism to .
Now define a map by putting if and if . It can be verified that the two conditions required of are satisfied. We thus obtain a corresponding rule in . It can be verified that applying this rule in the extension of corresponding to adds the fact that corresponds to .
Conversely, let be an extension of to the IDB relations in and let
[TABLE]
be a rule in and a homomorphism from the rule body to such that is not in . This rule was derived from a rule
[TABLE]
in and a map , the variables in the latter rule. We define a map from to , where is the extension of that corresponds to . Let . If and , then set . If and , then set . We argue that is a homomorphism from the body of the latter rule to . There are three cases:
- •
Consider an atom . Let . Then there is a corresponding atom in the former rule and thus . For each , let be the subtuple of that starts at position and is of length . Define the tuple by letting the -th constant be if and if and . By (R3), all constants in occur in the domain of . Moreover, . It thus remains to observe that , which follows from (R1) and (R2) and the definition of .
- •
Consider an atom . Let . Then the atom must also be in the former rule and thus , yielding . By Condition 1 imposed on , we have for each . Moreover, by (I1) we must have for each . Thus, the definition of yields and we are done.
- •
Consider an atom . We know that one of the Cases 2a to 2d apply to . We only treat the first case explicitly. Thus assume that and . By definition, and where . By (I2), is identical to the first variable block in and thus . By definition if , contains and we are done.
It can now be verified that the application of the latter rule adds to the fact that corresponds to .
DLog-rewritability of CSPs is NP-complete [Bar16, CL17] and it was observed in [BtCLW14] that this result lifts to generalized CSPs. It thus follows from Theorems 1 and 1 and Lemma 12 that DLog-rewritability of Boolean MDDLog programs that have equality is decidable in 2NExpTime. It is straightforward to verify that the 2NExpTime lower bound for Datalog-rewritability of MDDLog programs from [BL16] applies also to programs that have equality.
Theorem 13**.**
For Boolean MDDLog programs that have equality, Datalog-rewritability is 2NExpTime-complete.
Regarding MDDLog programs that do not have equality, the above yields a sound but possibly incomplete algorithm for deciding DLog-rewritability. Let us make this more precise. For an MDDLog program that does not have equality, we use to denote the extension of with the fresh EDB relation and the above rules. If has equality, then simply denotes .
Lemma 14**.**
For MDDLog programs , DLog-rewitability of implies DLog-rewritability of .
Lemma 14 follows from the trivial observation that any DLog-rewriting of can be converted into a DLog-rewriting of by dropping all rules that use the relation . It is an interesting open question whether the converse of Lemma 14 holds. Due to Lemma 14, a sound but possibly incomplete algorithm for unrestricted MDDLog programs can thus be formulated as follows: first replace with and then decide DLog-rewritability as per Theorem 13. We speculate that this algorithm is actually complete. In particular, for CSPs it is known that adding equality does preserve Datalog-rewritability [LZ07], and completeness of our algorithm is equivalent to an analogous result holding for MDDLog.
6.2. Canonical Datalog-Rewritings
For constructing actual DLog-rewritings instead of only deciding their existence, canonical Datalog programs play an important role. Feder and Vardi show that for every CSP template and all , one can construct an -Datalog program that is canonical for in the sense that if there is any -Datalog program which is equivalent to the complement of , then the canonical one is [FV98]. In this section, we show that there are similarly simple canonical Datalog programs for Boolean MDDLog. Note that the existence of canonical Datalog programs for MMSNP (and thus for Boolean MDDLog) is already known from [BD13]. However, the construction given there is more general and rather complex, proceeding via an infinite template and exploiting that it is -categorial. This makes it hard to analyze the exact structure and size of the resulting canonical programs. Here, we define canonical Datalog programs for Boolean MDDLog programs in a more elementary way. In contrast to the previous subsection, we do not assume that equality is available.
Let be a Boolean MDDLog program over EDB schema and with IDB relations from . Further let . We aim to construct a canonical -DLog program for . The most important properties of this program is that it is sound for and complete for on -instances of treewidth . We first convert into a DDLog program that is equivalent to on instances of treewidth and then construct the canonical program for rather than for . Unlike , the new program is not monadic. Informally, the canonical program simulates on -instances of treewidth proceeding in a bag-by-bag fashion. This is enabled by the additional non-monadic IDB relations introduced in which represent information that needs to be passed from bag to bag. We remark that the construction of is vaguely similar in spirit to the first step of converting an MDDLog program into simple form, c.f. Appendix A.1. To describe it, we need a preliminary.
With every MDDLog rule where is of treewidth and every -tree decomposition of , we associate a set of rewritten rules constructed as follows. Choose a root of the undirected tree , thus inducing a direction. We write if is a successor of in and use to denote . For all such that , introduce a fresh -ary IDB relation ; note that . Now, the set of rewritten rules contains one rule for each . For , the rule is
[TABLE]
where is the sub-disjunction of that contains all disjuncts with and is the restriction of to the atoms that contain only variables from . For , we include the same rule, but use only as the head. The set of rewritten rules associated with is obtained by taking the union of the rewritten rules associated with and any .
The DDLog program is constructed from as follows:
- (1)
first extend with all contractions of rules in ; 2. (2)
then delete all rules with not of treewidth and replace every rule with of treewidth with the rewritten rules associated with it.
To clarify the relation between and , we remark that it is possible to verify the following conditions; a detailed proof is omitted since these conditions are not going to be used in what follows:
- (I)
is sound for , that is, for all -instances , implies ; 2. (II)
is complete for on -instances of treewidth , that is, for all such instances , implies .
Note that is not complete for on instances of unrestricted treewidth. For example, if consists of only a goal rule whose rule body is a -clique (without reflexive loops), then returns false on the instance that consists of the same clique. {exa} Assume that contains the rule
[TABLE]
and consider the -tree decomposition of the rule body that consists of two nodes , successor of , with and . In , the rule is split into two rules
[TABLE]
Informally, these rules are supposed to cover homomorphisms from the body of the original rule to an -instance of treewidth such that the variables in are mapped to constants from some bag and variables from to constants from a neighboring bag. The IDB relation memorizes that we have already seen part of the rule body.
Let denote the additional IDB relations in . We now construct the canonical -DLog program for . Fix constants . For , we use to denote the set of all -instances with domain . The program uses -ary IDB relations , for all and all . It contains all rules , , that satisfy the following conditions:
- (1)
is over schema and contains at most variables; 2. (2)
for every extension of the -instance with -facts such that
- (a)
satisfies all rules of and does not contain and 2. (b)
for each , , there is an such that
there is an such that
where is viewed as an instance, denotes the result of replacing the constants in with the variables in (possibly resulting in identifications), and denotes the simultaneous restriction of to schema and constants .333We could additionally demand that is minimal so that Condition 2 is satisfied, but this is not strictly required. We also include in all rules of the form , of any arity from [math] to .
The intuition behind the construction of is as follows. When starting with an input -instance of treewidth and then chasing with , that is, exhaustively applying these rules in an unspecified order, then the resulting instance represents all extensions of to the relations in that satisfy all rules in and do not contain . A fact , , means that for every such there is an such that contains the facts in . Thus, the set in the index of should be read disjunctively. Note that then indicates that every extension of that satisfies all rules in must contain . The bodies of rules in are large enough to cover the restriction of to the constants from any single bag. This suffices only because we have transitioned from to before constructing .
The following are central properties of canonical DLog programs.
Lemma 15**.**
- (1)
* is sound for ;* 2. (2)
* is complete for on instances of treewidth .*
Proof 6.3**.**
*For Point 1, let be an -instance with . It suffices to show that . Let be the sequence of -instances obtained by chasing with . We first note that the following can be proved by induction on (and using the definition of ):
Claim. If , , then for every extension of to the relations in that satisfies all rules of and does not contain , there is an such that .
Since , there are and such that . By the claim, there is thus no extension of to the relations in that satisfies all rules of and does not contain . Consequently, .*
For Point 2, assume that and let be an -tree decomposition of , . Then there is an extension of to the IDB relations in such that all rules in are satisfied and contains no atom of the form .
We use to construct an extension of to the relations in . Choose a root of , thus inducing a direction on the undirected tree . For all and successors of , choose an ordering of the constants in and let denote the number of these constants. Let be all facts of this form in . By construction of , there must be at least one such fact, and the fact must also be in . Thus, we can associate with a unique minimal set so that .
The construction of proceeds top down over . At all points, we maintain the invariant that
- ()
for all nodes and successors of , there is an such that .
The construction of starts at the root of . There must be an extension of with -facts such that
- (i)
* satisfies all rules of and does not contain * 2. (ii)
for each , , there is an such that
as, otherwise, a rule of would create an atom of the form in . Start with putting . Note that for each successor of , () is satisfied because of Point (ii) and since .
We proceed top-down over . Assume that is a successor of and has already been treated. There must be an extension of with -facts such that
- (i)
* satisfies all rules of and does not contain ,* 2. (ii)
, and 3. (iii)
for each , , there is an such that
as, otherwise, because of () a rule of would create an atom of the form in with , in contradiction to being minimal with . Put . It can again be verified that () is satisfied.
By construction, the instance does not contain and is also a tree decomposition of , that is, each EDB atom and each IDB atom of falls within some bag . We aim to show that satisfies all rules of , thus as required.
Let be the result of closing under contractions of rules and recall that is obtained from by dropping and rewriting rules. Let be a rule in and let be a homomorphism from its body to . We have to show that one of the disjuncts in the head of is satisfied under . contains the rule obtained from by identifying all variables such that . It clearly suffices to show that one of the disjuncts in the head of is satisfied under . Note that is an injective homomorphism from the body of to which implies that is of treewidth . Moreover, we can read off an -tree decomposition of from and .
In , and are rewritten into rules such that no uses a fresh IDB relation from the head of any with (that is, an IDB relation that does not occur in , of arity at most ). Let be where are disjuncts that also occur in the head of and is a fresh IDB relation introduced by the rewriting in the case that and if (by which we mean: there is no disjunct in the latter case). One can show by induction on that for ,
- (1)
* for some and or* 2. (2)
.
To see this, assume that Point 1 is not satisfied for some . Then Point 2 holds for all . By choice of , there is a such that . Thus is a homomorphism from to , and consequently there is a disjunct in the head of such that . This implies that one of Points 1 or 2 is satisfied for .
Note that Point 2 cannot hold for because the disjunct is not present in . Thus there is an such that for some . Since occurs in the head of , we are done.
We are now ready to show that the canonical program is indeed canonical, as detailed by the following theorem. For two Boolean DLog programs over the same EDB schema , we write if for every -instance , implies .
Theorem 16**.**
Let be a Boolean MDDLog program, , and the canonical -DLog program for . Then
- (1)
* for every -DLog program that is sound for ;* 2. (2)
* is -DLog-rewritable iff is a DLog-rewriting of .*
Proof 6.4**.**
Let be the EDB schema of .
For Point 1, let be an -DLog program that is sound for and let be an -instance with . From the proof tree for from and , we can construct an -instance of treewidth such that and . It suffices to show that , which is easy: from , we obtain and Point 2 of Lemma 15 yields .
The “if” direction of Point 2 is trivial. For the “only if” direction, assume that is -DLog-rewritable and let be a concrete rewriting. We have to show that is sound and complete for . The former is Point 1 of Lemma 15. For the latter, we get since is a rewriting of and from Point 1, thus as required.
Note that by Point 1 of Theorem 16, the canonical -DLog program for an MDDLog program is interesting even if is not rewritable into an -DLog program as it is the strongest sound -DLog approximation of .
7. Non-Boolean MDDLog Programs
We lift the results about the complexity of rewritability, about canonical DLog programs, and about the shape of rewritings and obstructions from the case of Boolean MDDLog programs to the non-Boolean case. For all of this, a certain extension of -Datalog programs with parameters plays a central role. We thus begin by introducing these extended programs.
7.1. Deciding Rewritability
An -Datalog program with parameters is an -ary -Datalog program in which all IDBs have arity at least and where in every rule, all IDB atoms agree on the variables used in the last positions (both in rule bodies and heads and including the IDB). The last positions of IDBs are called parameter positions. To visually separate the parameter positions from the non-distinguished positions, we use “” as a delimiter to replace the usual comma, writing e.g.
[TABLE]
where are IDB, is EDB, and there are three parameter positions. Note that, by definition, all variable positions in atoms are parameter positions. {exa}
The following is an MDLog program with one parameter that returns all constants which are on an -cycle, a binary EDB relation:
[TABLE]
Parameters in Datalog programs play a similar role as parameters to least fixed-point operators in FO(LFP), see for example [BBV16] and references therein. The program in Example 7.1 is not definable in MDLog without parameters, which shows that adding parameters increases expressive power. Although -DLog programs with parameters are -DLog programs, one should think of them as a mild generalization of -programs.
A DLog program is an -DLog program if it is an -DLog program for some . To lift decidability and complexity results from the Boolean to the non-Boolean case, we show that rewritability of an -ary MDDLog program into -DLog with parameters can be reduced to rewritability of a Boolean MDDLog program into -DLog (without parameters). We believe that Datalog with parameters is a natural rewriting target for non-Boolean MDDLog programs since, in a sense, the parameters reflect the special role of the constants from the input instance that are returned as an answer. Note that the case is about UCQ-rewritability (and thus FO-rewritability) because [math]-DLog programs (with and without parameters) are an alternative presentation of UCQs. The reduction proceeds in two steps, described by subsequent Lemmas 17 and 18. {exa}
The following MDDLog program is rewritable into the MDLog program with parameters from Example 7.1, but not into an MDLog program without parameters:
[TABLE]
The following lemma shows that, by introducing constants, we can reduce the rewritability of non-Boolean MDDLog programs into Datalog with parameters to the rewritability of Boolean MDDLog programs with constants into Datalog with constants. Note that the presence of constants in an -DLog program is not reflected in the values of and . We will show in a second step that the rewritability of Boolean MDDLog programs with constants into Datalog with constants can be reduced to the rewritability of Boolean MDDLog programs without constants into Datalog without constants.
The diameter of an -DLog program with parameters is and the diameter of a DLog program with constants is defined as for DLog programs without constants, that is, only variables contribute to the diameter, but constants do not. The rule size of an MDDLog program is the maximum number of variable occurrences in a rule body.
Lemma 17**.**
Given an -ary MDDLog program , one can construct Boolean MDDLog programs with constants over the same EDB schema such that for all ,
- (1)
* is rewritable into an -DLog program with parameters iff each of is rewritable into an -DLog program with constants;* 2. (2)
* and the size (resp. diameter, rule size) of each program is bounded by the size (resp. diameter, rule size) of .*
The construction takes time polynomial in the size of .
Proof 7.1**.**
Let be an -ary MDDLog program over EDB schema . Fix a set of constants. For each , we construct from a Boolean MDDLog program such that for any , is -DLog rewritable iff all programs are.
Let . Given two -tuples of terms (constants or variables) and , we write if implies for . We write when . The program is obtained from as follows:
- •
replace every rule with by ;
- •
drop every rule with .
*Note that the non-goal rules in are identical to those in . By converting proof trees for into proof trees for and vice versa, one can show the following.
Claim. For all -instances and with , .
We show that is rewritable into an -DLog program with parameters iff all of the constructed programs are rewritable into an -DLog program with constants.*
Let be an -DLog program with parameters that is a rewriting of . For each , let be the Boolean -DLog program with constants obtained from as follows:
- •
replace every rule with (and where might be ) by , where is the result of replacing in each variable with ;
- •
drop every rule with .
By translating proof trees, it can be shown that () iff . It is now easy to show that is a rewriting of : for every -instance , iff (by the claim) iff (since is a rewriting of ) iff (by ()).
Conversely, for all let be a Boolean -DLog program with constants that is a rewriting of . We construct an -DLog program with parameters as follows. For each , fix a tuple of fresh variables such that . Let be the -DLog program with parameters obtained from as follows:
- (i)
replace each with ; 2. (ii)
replace each non-goal IDB atom with the atom (both in rule bodies and heads), a fresh IDB relation; 3. (iii)
replace with .
Then is defined as the union of all programs . We first argue that for every , -instance , and with ,
- (1)
* implies and* 2. (2)
, with , implies .
Point 1 can be proved by showing that, from a proof tree of from and , one can construct a proof tree of from and . For Point 2, assume with . Then can again be shown by manipulating proof trees. It can be verified that, by construction, . Consequently and since is a rewriting of , implies for all , that is, is contained in in the sense of query containment. Thus in particular , as required.
It remains to show that is a rewriting for . First assume that . Choose some with . Then by the claim and thus since is a rewriting of . Point 1 above yields .
Now assume that . Then by construction of , there is a such that and . To see this, note in particular that the different programs do not share any IDBs and thus do not interact in . Choose a with . From Point 2 above, we obtain which yields . This implies by the claim.
We next show that constants can be eliminated from Boolean programs.
Lemma 18**.**
Given a Boolean MDDLog program with constants over EDB schema , one can construct a Boolean MDDLog program over an EDB schema such that
- (1)
* is rewritable into -DLog with constants iff is rewritable into -DLog, for any ;* 2. (2)
If is of size and diameter , then the size of is ; moreover, the diameter of is bounded by the rule size of .
The construction takes time polynomial in the size of .
Proof 7.2**.**
Let be a Boolean MDDLog program over EDB schema that contains constants . The program will be over EDB schema where are fresh monadic relation symbols. contains all rules that can be obtained from a rule in by choosing a partial function that maps terms (variables or constants) in to elements of such that for each constant and then, for each term with ,
- (1)
replacing each occurrence of in the body of with a fresh variable and adding , and 2. (2)
replacing each occurrence of in the head of with one of the fresh variables introduced for in Step 1.
Additionally, contains the rule , for .
*Note that the rewriting presented above, which we call dejoining since it introduces different variables for each occurence of a term in a rule body, can be applied not only to MDDLog programs, but also to MDLog programs. Before we proceed, we make a basic observation about dejoining and its connection to a certain quotient construction. Let be an MDDLog program or an MDLog program, with constants , and let be the result of dejoining . Let be an -instance such that are disjoint whenever and which does not contain the constants . The quotient of is the -instance obtained from by replacing every with by the constant (which also results in the identification of elements in the active domain) and removing all atoms involving one of the relations. By converting proof trees of from into proof trees of from and vice versa, one can show the following.
Claim. iff .
We now show that is rewritable into -DLog iff is.*
First let be an -DLog rewriting of . Let be obtained from by dejoining all rules and adding the rule for . Clearly, is an -DLog program. We argue that is a rewriting of . Let be an -instance. W.l.o.g., we can assume that does not contain . If are not disjoint for some , then and . Otherwise, let be the quotient of . We have iff (by the claim) iff ( is rewriting of ) iff (again by the claim).
Let be an -DLog rewriting of . Let be the program constructed from by removing all rules that contain atoms of the form and with and replacing all variables that occur in a rule body in atoms of the form with and removing all -atoms from such rules. Clearly, is an -DLog program (with constants ). We argue that is a rewriting of . Let be an -instance that w.l.o.g. does not contain and let . Note that is the quotient of . Then iff (by the claim) iff ( is rewriting of ) iff (by construction of ).
We are now ready to lift the complexity results from Theorems 6 and 13 to the non-Boolean case, by putting them together with Lemmas 17 and 18.
Theorem 19**.**
For -ary MDDLog programs,
- (1)
FO-rewritability (equivalently: UCQ-rewritability) is 2NExpTime*-complete;* 2. (2)
rewritability into MDLog with parameters is in 3ExpTime* (and 2NExpTime-hard);* 3. (3)
DLog-rewritability is 2NExpTime*-complete for programs that have equality.*
Proof 7.3**.**
We remind that the upper bounds for rewritability of Boolean MDDLog programs stated in Theorems 6 and 13 are obtained by a (generalized) CSP and then deciding the rewritability of (the complement of) that CSP. One can trace the blowups stated in Lemmas 17 and 18 as well as in Theorems 1 and 1 to verify that the constructed CSP does not become significantly larger in the non-Boolean case, that is, it still satisfy the bounds stated in Point 3 of Proposition 1. Thus, we obtain the same upper bounds as in the Boolean case. Regarding Point 1, we additionally recall that Proposition 1 also covers non-Boolean MDDLog programs and thus it suffices to consider UCQ-rewritability. Regarding Point 3, we note that it can be verified that the constructions in the proofs of Lemmas 17 and 18 preserve the property of having equality.
In view of Point 2, we remark (once more) that for non-Boolean MDDLog programs , MDLog with parameters is in a sense a more natural target for rewriting than MDLog without parameters. The intuitive reason is that positions in the answer to can be thought of as constants, and constants correspond to parameters. To make this a bit more precise, consider the grounding of obtained by replacing, in every goal rule, each variable that occurs in the head by a constant. In contrast to the standard database setup (and in contrast to the proof of Lemma 17), we mean here constants that are interpreted according to the standard FO semantics, that is, different constants can denote the same element of an instance. When looking for an MDLog-rewriting of , it is clearly very natural to admit the constants from also in the rewriting. Now, one can verify that any such rewriting can be translated in a straightforward way into a rewriting of into MDLog with parameters, and vice versa.
We further note that MDLog with parameters enjoys similarly nice properties as standard MDLog. For example, containment is decidable. This follows from [RK13, BKR15] where generalizations of MDLog with parameters are studied, the actual parameters being represented by constants.
We also remark that Theorem 19 remains true when we admit constants in MDDLog programs. In fact, the proof of Lemma 17 goes through also when the original MDDLog program contains constants, and both the original and the newly introduced constants can then be removed by Lemma 18.
7.2. Canonical Datalog-Rewritings
We now turn our attention to canonical DLog-rewritings for non-Boolean MDDLog programs. Let be an -ary MDDLog program. We associate with a canonical -DLog program with parameters, for any . The construction is a refinement of the one from the Boolean case.
We start with some preliminaries. An -marked instance is an instance endowed with (not necessarily distinct) distinguished elements . An -tree decomposition with parameters of an -marked instance is an -tree decomposition of , the number of distinct constants in , in which every bag contains all constants from . An -marked instance has treewidth with parameters if it admits an -tree decomposition with parameters.
We first convert into a DDLog program that is equivalent to on instances of bounded treewidth. The construction is identical to the Boolean case (first variable identification, then rewriting) except that
- (1)
we use treewidth in place of treewidth ; consequently, the arity of the freshly introduced IDB relations may also be up to ; 2. (2)
for rules, all head variables must occur in the root bag of the tree decomposition (they can then be treated in the same way as a Boolean rule despite the -ary head relation).
It can be verified that is sound for and that it is complete for on -marked instances of treewidth with parameters in the sense that, for all such instances , implies . is not guaranteed to be complete for answers other than because of the way we treat goal rules in Point 2 above, for example when contains a rule of the form .
Let denote the additional IDB relations in the resulting program . We now construct the canonical -DLog program with parameters . Fix constants and let denote the set of all -instances with domain . The program uses -ary IDB relations , for all and all . It contains all rules , , that satisfy the following conditions:
- (1)
contains at most variables; 2. (2)
in every extension of the -instance with -facts such that
- (a)
satisfies all rules of and does not contain and 2. (b)
for each , , there is an such that
there is an such that
We also include all rules of the form . This finishes the construction of . It is straightforward to verify that is sound for . It is complete in the same sense as .
Lemma 20**.**
* is sound for . It is complete for on -marked instances of treewidth with parameters in the sense that for any such instance , implies .*
The proof of Lemma 20 is similar to that of Lemma 15, details are omitted. In analogy with Theorem 16, we can then obtain the following result about canonical DLog programs.
Theorem 21**.**
Let be an -ary MDDLog program, , and the canonical -DLog program with parameters associated with . Then
- (1)
* for every -DLog program that is sound for ;* 2. (2)
* is rewritable into -DLog with parameters iff is a rewriting of .*
Note that, as a consequence of Theorem 21, an -ary MDDLog program is DLog-rewritable (in the standard sense, without parameters) iff the canonical -DLog program with parameters is a rewriting, for some . In a sense, this exactly parallels the behaviour of canonical DLog programs in the Boolean case. As an important consequence, the reductions presented in Lemmas 17 and 18 show that if DLog-rewritability of Boolean programs turns out to be decidable (without assuming equality), then the same is true for DLog-rewritability of non-Boolean programs. Theorems 6 and 13
7.3. Shape of Rewritings and Obstructions
We now analyze the shape of rewritings of non-Boolean MDDLog programs. An -tree decomposition with parameters of an -ary CQ is an -tree decomposition of in which every bag contains all answer variables of . The treewidth with parameters of an -ary CQ is now defined in the expected way.
Theorem 22**.**
Let be an -ary MDDLog program of diameter . Then
- (1)
if is FO-rewritable, then it has a UCQ-rewriting in which each CQ has treewidth with parameters; 2. (2)
if is rewritable into MDLog with parameters, then it has an MDLog-rewriting with parameters of diameter .
Note that Theorem 22 is immediate from Theorem 7 and Lemmas 17 and 18 when denotes the rule size of instead of its diameter. To get the improved version, one needs to carefully trace the construction of rewritings, starting with rewritings for the CSPs ultimately constructed and then through the proofs of Lemmas 2, 18, and 17. In particular, the constructions in Lemmas 2 and 18 interplay in a subtle way that can be exploited to improve the bound. Details are given in Appendix C.
As in the Boolean case, rewritings are closely related to obstructions. We define obstruction sets for MMSNP formulas with free variables and summarize the results that we obtain for them. A set of marked obstructions for an MMSNP formula with free variables over schema is a set of -marked instances over the same schema such that for any -instance , we have iff for some , there is a homomorphism from to with . We obtain the following corollary from Point 1 of Theorem 22 in exactly the same way in which Corollary 8 is obtained from Point 1 of Theorem 7.
Corollary 23**.**
For every MMSNP formula with free variables, the following are equivalent:
- (1)
* is FO-rewritable;* 2. (2)
* has a finite marked obstruction set;* 3. (3)
* has a finite set of finite marked obstructions of treewidth with parameters.*
It is interesting to note that this result can be viewed as a generalization of the characterization of obstruction sets for CSP templates with constants in terms of ‘c-acyclicity’ in [AtCKT11]; our parameters correspond to constants in that paper. We now turn to MDLog-rewritability.
Proposition 24**.**
Let be an MMSNP formula of diameter with free variables. Then is rewritable into an MDLog program with parameters iff has a set of marked obstructions (equivalently: finite marked obstructions) that are of treewidth with parameters.
Proof 7.4**.**
The “only if” direction is a consequence of Point 2 of Theorem 22 and the fact that, for any MDLog program with parameters of diameter over EDB schema , a proof tree for from an -instance and gives rise to a finite -marked -instance of treewidth with parameters that satisfies . The “if” direction is a consequence of the fact that the canonical -DLog program with parameters associated with viewed as an MDDLog program is complete on inputs of treewidth with parameters in the sense of Lemma 20.
As an illustration, it might be interesting to reconsider Example 7.1. The unary MDDLog program shown there is the negation of a unary MMSNP formula that has as a set of marked obstructions the set of all -cycles on which one element is the marked element. Each of these obstructions has treewidth with one parameter, but not treewidth in the strict sense.
8. Ontology-Mediated Queries
While the results on disjunctive Datalog and on MMSNP obtained in the previous sections are interesting in their own right, our premier aim is to study fundamental question of rewritability in the context of ontology-mediated queries (OMQs). Such questions have received a lot of interest in the OMQ context, see for example [BtCLW14, BHLW16, LS17] and references therein. In particular, we settle an open question from [BtCLW14] by showing that in the OMQ language , introduced in detail below, FO-rewritability is decidable and 2NExpTime-complete. In what follows, we first introduce several prominent description logics to serve as ontology languages and, based on that, ontology-mediated queries. We then show how the results from the previous sections can be used to obtain results about ontology-mediated queries.
8.1. Preliminaries
In description logics, ontologies are defined by so-called TBoxes. A TBox, in turn, is a set of inclusions (that is, logical implications) between concepts (that is, logical formulas), and possibly also additional kinds of statements. Each description logic is determined by the constructors that are available to build up concepts and by the statements that are allowed in TBoxes. Here, we introduce the widely known description logics , , and , listed in the order of increasing expressive power. We refer the reader to [BHLS17] for a more thorough introduction to DLs.
An -concept is formed according to the syntax rule
[TABLE]
where ranges over a fixed countably infinite set of concept names and over a fixed countably infinite set of role names. An -concept is an -concept in which the constructors and are not used. An -TBox (resp. -TBox) is a finite set of concept inclusions , and -concepts (resp. -concepts). While extends with additional concept constructors, extends with additional types of TBox statements. There is thus no need to define -concepts as these are simply -concepts. A role is either a role name or an expression with a role name. A -TBox is a finite set of
- •
concept inclusions , and -concepts,
- •
role inclusions , and roles, and
- •
transitivity statements , a role name.
DL semantics is given in terms of interpretations. An interpretation takes that form where is a non-empty set called the domain and is the interpretation function which maps each concept name to a subset and each role name to a binary relation . Note that an interpretation is simply a notational variant of a relational FO-structure that interprets only unary and binary relations. The interpretation function is extended to compound concepts in the standard way, as given in Figure 2.
An intepretation is a model of a TBox if it satisfies all statements in , that is,
- •
implies ;
- •
implies ;
- •
implies that is transitive.
For roles , we write if every model of satisfies .
In description logic, data is typically stored in so-called ABoxes. For uniformity with MDDLog, we use instances instead, identifying unary relations with concept names, binary relations with role names, and disallowing relations of any other arity. An interpretation is a model of an instance if implies and implies . We say that an instance is consistent with a TBox if and have a joint model.
An ontology-mediated query (OMQ) over a schema is a triple where is a TBox formulated in a description logic and is a query. The TBox can introduce symbols that are not in , which allows it to enrich the schema available for formulating the query . In fact, can use symbols from , additional symbols from , and also completely fresh symbols (which is useful only in very rare cases). As the TBox language, we may use any of the description logics introduced above. Since all these logics admit only unary and binary relations, we assume that these are the only allowed arities in schemas throughout Section 8. As the actual query language, we use UCQs and CQs. The OMQ languages that these choices give rise to are denoted with , , and so on. In the actual query, we generally disallow the use of role names such that for some role name , and . In fact, admitting such roles in the query poses serious additional complications, which are outside the scope of this paper; see e.g. [BEL*+*10, GPT13]. To make the restriction explicit, we add a superscript to OMQ languages when the DL used permits transitivity statements in the TBox, such as in .
The semantics of an OMQ is given in terms of certain answers. Let be an -instance and a tuple of constants from . We write and call a certain answer to on if for all models of and , we have . The latter denotes satisfaction of in in the usual sense of first-order logic. {exa} Let be the following OMQ, formulated in :
[TABLE]
The TBox describes the risk of somebody having medullary thyroid cancer (MTC) in the presence of an abnormal calcitonin test (CTest). While abnormal calcitonin levels are a marker for MTC, there can also be false positives, for example due to smoking (Line 1 of ). However, in the presence of a high risk for the genetic syndrome MEN2, high calcitonin levels immediately raise an MTC suspicion (Line ). Pheochromocitoma patients (PCCPatient) have a high MEN2 risk (Line ). As MEN2 is caused by a genetic mutation, the risk carries within families (Line ). On the -instance
[TABLE]
the only certain answer to is .
An OMQ is FO-rewritable if there is an FO query over schema (and possibly involving equality), called an FO-rewriting of , such that for all -instances and , we have iff . Other notions of rewritability such as UCQ-rewritability and MDLog-rewritability are defined accordingly.
Note that the TBox can be inconsistent with the input instance , that is, there could be no joint model of and . It can thus be a sensible alternative to work with consistent FO-rewritability, considering only -instances that are consistent w.r.t. . This can then be complemented with rewritability of inconsistency for , that is, rewritability of the Boolean OMQ , a fresh concept name, which is true on an -instance iff is inconsistent with . It is not hard to prove, though, that consistent -rewritability can be reduced to -rewritability in polynomial time for all OMQ languages condidered in this paper and all ; see the corresponding proof for query containment in [BL16]. Moreover, rewritability of consistency was studied in [BtCLW14] and shown to be NExpTime-complete for all OMQ languages considered in this paper.
8.2. Rewritability of OMQs
We now lift the results from earlier sections to OMQs. There is a known equivalence-preserving translation from the relevant OMQ languages to MDDLog, but it involves a double exponential blowup [BtCLW14] that most likely is unavoidable.444It was shown in [BtCLW14] that a single exponential blowup is unavoidable. Whether the blowup has to be double exponential is an open problem. We refine this translation and carefully trace the parameters in which the blowup occurs to show that, despite these blowups, the complexity of the relevant problems does not increase. The following is our main result concerning OMQs.
Theorem 25**.**
In all OMQ languages between and , as well as between and ,
- (1)
FO-rewritability (equivalently: UCQ-rewritability) is 2NExpTime*-complete; in fact, there is an algorithm which, given an OMQ , decides in time whether is FO-rewritable;* 2. (2)
MDLog-rewritability is in 3ExpTime* (and 2NExpTime-hard); in fact, there is an algorithm which, given an OMQ , decides in time whether is MDLog-rewritable*
where and are the size of and and is a polynomial.
Note that the runtime for deciding FO-rewritability stated in Theorem 25 is double exponential only in the size of the actual query (which tends to be very small) while it is only single exponential in the size of the TBox (which can become large) and similarly for MDLog-rewritability, only one exponential higher.
The lower bounds in Theorem 25 are from [BL16]. To prove the upper bounds, we first give a refined translation from OMQs to MDDLog. A proof is provided in Appendix D.
Theorem 26**.**
For every OMQ from , one can construct an equivalent MDDLog program such that
- (1)
the size of is bounded by ; 2. (2)
the IDB schema of is of size at most ; 3. (3)
the rule size of is bounded by
where and are the size of and and is a polynomial. The construction takes time polynomial in the size of .
Let be an OMQ from . Instead of deciding FO- or MDLog-rewritability of , we can decide the same problem for the MDDLog program delivered by Theorem 26. The bounds stated in Theorem 26, Lemmas 17 and 18, and Theorems 1 and 1, though, only guarantee that we obtain a CSP template with 3-exponentially many elements, which does not yield 2NExpTime upper bounds. However, it is possible to combine the construction underlying Theorem 26 with those underlying Lemmas 17 and 18 and Theorem 1 to obtain the following.
Lemma 27**.**
Given an OMQ from with of size and of size , one can construct a simple MDDLog program over an aggregation EDB schema such that
- (1)
* is -rewritable iff is -rewritable for every ;* 2. (2)
the size of and the cardinality of are bounded by and the arity of relations in is bounded by ; 3. (3)
the IDB schema of is of size
where is a polynomial. The construction takes time polynomial in the size of .
A proof is provided in Appendix E. Constructing a CSP template from this refined simple program and applying the decision procedures for rewritability of CSP templates, we obtain the upper bounds stated in Theorem 1.
We remark that it is not possible to extend Theorem 25 to description logics with functional roles or number restrictions since, in such DLs, FO-rewritability of OMQs is undecidable [BtCLW14]. The proof can be adapted to MDLog-rewritability.
The results about the shape of rewritings stated in Theorem 22 (of course) also apply to the OMQ case. Note that, in Points 1 and 2 of that theorem, we can then replace with . Moreover, the canonical DLog programs introduced for MDDLog in Section 7 can also be utilized for OMQs via the translation underlying the proof of Theorem 26.
Regarding Datalog-rewritability of OMQs, we obtain a potentially incomplete decision procedure by combining Theorem 26 with Lemmas 17 and 18 and the algorithm from Section 6. It is possible to define a class of OMQs that have equality and for which this procedure is complete. Roughly, needs to contain a relation and enforces that for all models of and all , and satisfy exactly the same subconcepts of and exactly the same tree contractions of and then taking a subquery. We refrain from working out the details.
9. Dichotomy and Deciding PTime Query Evaluation
There was a recent breakthrough in research on CSPs, independently achieved by Bulatov and by Zhuk, who have proved the long standing Feder-Vardi conjecture thus establishing a dichotomy between PTime and NP for the complexity of CSPs [Bul17, Zhu17]. Together with results by Chen and Larose [CL17], this also implies that it is decidable and NP-complete whether the CSP defined by a given template has PTime complexity. We observe that, together with the translations given in this paper, we obtain several interesting results on MMSNP, MDDLog, and OMQs.
In particular, we consider the (data) complexity of query evaluation, which is defined in the expected way. For example, each OMQ gives rise to the following query evaluation problem: given an -instance and a tuple , decide whether . This problem is guaranteed to be in coNP when is from any of the OMQ languages studied in this paper [BtCLW14], but of course there are also OMQs for which it is in PTime or even simpler and from a practical perspective it is very important to understand the exact complexity of evaluating the concrete queries that are relevant for the application at hand. The definition of query evaluation is analogous for MDDLog and for MMSNP; note that MMSNP only gives rise to Boolean queries and that there is an NP upper bound for the complexity rather than a coNP one.
The question of PTime query evaluation also comes with an associated ‘meta problem’: given an OMQ (or a query from some other relevant language), decide whether admits PTime query evaluation. We remark that the data complexity of OMQs as well as the associated meta problem and dichotomy questions have received significant interest [BtCLW14, LW17, LS17, HLW17, LSW15, LSW13, CDL*+*13]. The following theorem summarizes our results regarding the complexity of query evaluation.
Theorem 28**.**
In MDDLog and all OMQ languages between and , as well as between and ,
- (1)
there is a dichotomy in the complexity of query evaluation between PTime and coNP; 2. (2)
deciding PTime-query evaluation is 2NExpTime*-complete.*
The same holds for MMSNP, with coNP replaced by NP in Point (1).
Proof 9.1**.**
For MMSNP, it is well-known that there is a dichotomy for query evaluation between PTime and NP iff there is such a dichotomy for the complexity of CSPs. This gives the MMSNP version of (1). In fact, it is even known that the constructions from the proofs of Theorems 1 and 1, which transform an MMSNP sentence into a CSP, preserve complexity up to polynomial time reductions [FV98, Kun13]. We thus obtain the upper bound in the MMSNP version of (2). The lower bound is obtained from the reduction used in [BL16] to show that the Datalog-rewritability of (the complement of) MMSNP sentences is 2NExpTime-hard. The proof is by reduction of a tiling problem. Given such a tiling problem, one constructs an MMSNP sentence such that is FO-rewritable if there is a tiling and equivalent to 3-colorability (thus not Datalog-rewritable and NP-hard) otherwise. Clearly, such a reduction also yields 2NExpTime-hardness of PTime query evaluation.
For the cases of MDDLog and for OMQs, lower bounds are obtained along the same lines, that is, by observing that the lower bound constructions from [BL16] are directly applicable. For the upper bounds and the dichotomies, we first observe that the construction in the proofs of Lemma 18 preserves complexity up to polynomial time reductions (which is implicit in the proof) and recall that the translation in Lemma 27 is even equivalence-preserving. It thus remains to deal with Lemma 17. There, an -ary MDDLog program is translated into a family of Boolean MDDLog programs (with constants) , where is fixed set of constants. The claim formulated in the proof of Lemma 17 provides
- (1)
a polynomial time reduction of evaluating to evaluating the programs : given an instance and an , to decide whether choose with and check whether ; 2. (2)
for each , a polynomial time reduction of evaluating to evaluating : given an instance , to decide whether check whether .
It follows that can be evaluated in PTime iff all of the programs can. This is enough to transfer the upper bound for deciding PTime query evaluation. For dichotomy, we additionally need that if one of the programs is coNP-hard, then so is , which follows from (2).
10. Discussion
We have clarified the decidability status and computational complexity of FO- and MDLog-rewritability in MMSNP, MDDLog, and various OMQ languages based on expressive description logics and conjunctive queries, and we also made several interesting observations regarding dichotomies and the decidability and complexity of PTime query evaluation. For Datalog-rewritability, we were only able to obtain partial results, namely a sound algorithm that is complete on a certain class of inputs and potentially incomplete in general. This raises several natural questions: is our algorithm actually complete in general? Does an analogue of Lemma 4 (that is, rewritability on instances of high girth implies rewritability) hold for Datalog as a target language? What is the complexity of deciding Datalog-rewritability in the afore-mentioned languages? From an OMQ perspective, it would also be important to work towards more practical approaches for computing (FO-, MDLog-, and DLog-) rewritings. Given the high computational complexities involved, such approaches might have to be incomplete to be practically feasible. However, the degree/nature of incompleteness should then be characterized, and we expect the results in this paper to be helpful in such an endeavour.
Appendix A Translating Boolean MDDLog to generalized CSP
A.1. From MDDLog to Simple MDDLog
Let be a Boolean MDDLog program over schema and of diameter . We first construct from an equivalent Boolean MDDLog program such that the following conditions are satisfied:
- (i)
all rule bodies are biconnected, that is, when any single variable is removed from the body (by deleting all atoms that contain it), then the resulting rule body is still connected; 2. (ii)
if occurs in a rule body with EDB, then the body contains no other EDB atoms.
A good way to think about what is achieved in this first step is that, when the resulting program is evaluated on an instance of bounded treewidth, then it suffices to map the rule bodies to individual bags while it is never necessary to cross ‘bag boundaries’.
To construct , we first extend with all contractions of rules in ; we will refer to this step as the collapsing step. We then split up rules that are not biconnected into multiple rules by exhaustively executing the following rewriting steps:
- •
replace every rule where and share exactly one variable but both contain also other variables with the rules and , where is a fresh monadic IDB relation and is the restriction of to atoms that are nullary or contain a variable from , ;
- •
replace every rule where and share no variables and are both non-empty with the rules and , where is a fresh nullary IDB relation and the are as above;
- •
replace every rule where is an EDB relation and contains at least one EDB atom and the variable , with the rules and , where is a fresh monadic IDB relation.
It is easy to see that the resulting program is equivalent to the original program and that all satisfies Conditions (i) and (ii) above.
We next construct from the desired simple program by replacing, in every rule, the EDB atoms in the rule body with a single EDB atom that represents the conjunction of all atoms replaced. We thus introduce fresh EDB relations that represent conjunctions of old EDB relations. Note that there can be implications between the new EDB relations that we will have to take care of in the construction of .
Let denote the set of CQs that can be obtained from a rule body in by consistently renaming variables, using only variables that occur in . Let be the IDB schema of . For every , we write to denote the restriction of to -atoms, and likewise for and IDB atoms. The EDB schema of consists of the relations , , whose arity is the number of variables in (which, by construction of , is identical to the number of variables in ). The program consists of the following rules:
whenever is a rule in , , and , then contains the rule
The case where is identical to corresponds to adapting rules in to the new EDB signature and the other cases take care of implications between EDB relations. {exa}
Assume that contains the following rules, where and are EDB relations:
[TABLE]
A new ternary EDB relation is introduced for the EDB body atoms of the lower rule, where , and a new ternary EDB relation is introduced for the upper rule, . In , the rules are replaced with
[TABLE]
Note that and thus logically implies , which results in two copies of the goal rule to be generated.
Proof details for the following lemma can be found in [BL16]. Recall that is the diameter of the original MDDLog program .
Lemma 29**.**
- (1)
If is an -instance and the corresponding -instance, then iff ; 2. (2)
If is an -instance and the corresponding -instance, then
- (a)
* implies ;* 2. (b)
* implies if the girth of exceeds .*
The following example demonstrates why the restriction to high girth instances in Point 2b of Lemma 29 is necessary, see also Example 3. {exa}
Consider the programs and from Example A.1. Take the -instance defined by
[TABLE]
It can be verified that . But the corresponding -instance is such that derives the IDB relation at , , and , and additionally contains the facts
[TABLE]
which are not covered by any fact in . Thus clearly .
A.2. From Simple MDDLog to Generalized CSP
Let be a simple MDDLog program over EDB schema and with IDB schema . For , an i-type is a set of relation symbols from of arity at most that does not contain and that satisfies all rules in which use only IDB relations of arity at most and do not involve any EDB relations.
We build a template for each 0-type . The elements of are exactly the 1-types that agree with on nullary IDB relations. consists of the following facts:
- (1)
for each nullary . 2. (2)
for each 1-type and each monadic ; 3. (3)
for each relation and all 1-types such that does not contain a rule
[TABLE]
such that for , and .
The following was observed in [FV98].
Lemma 30**.**
For any -instance , we have iff for all [math]-types .
Appendix B MDLog-Rewritability of Generalized CSP
In the proof of the subsequent theorem, we use obstructions of CSPs, which are defined in Section 5.
Theorem 31**.**
*Given a finite set of templates , it can be decided in ExpTime whether coCSP is MDLog-rewritable. *
Proof B.1**.**
Consider coCSP over schema with . We start with observing that we can assume that the templates in are mutually homomorphically incomparable: if this is not the case, we remove templates that are not homomorphically minimal and further remove templates so that none of the remaining templates are homomorphically equivalent. Clearly, this is equivalence preserving and can be done in ExpTime.
We aim to show that coCSP is MDLog-rewritable if and only if coCSP is for all , which gives the desired ExpTime upper bound. The “if” direction is immediate since the union of MDLog programs is expressible as an MDLog program. For the “only if” direction, assume that coCSP() is MDLog-rewritable, and let denote a concrete rewriting. Consider a template and let denote the set of all finite -instances of treewidth that do not homomorphically map to where is the maximum number of variables that occur in a single rule of . We will show that is an obstruction set for . It then follows from Theorem 23 of [FV98], which says that the existence of an obstruction set of treewidth for some fixed implies MDLog-rewritability, that coCSP is MDLog-rewritable.
By definition of , it is immediate that if for some and -instance , then . We now establish the converse. Assume that . Consider the disjoint union of and . Since the templates in are homomorphically incomparable, for all . Thus and there is a proof tree for from and . From that tree, we can read off an -instance such that , has treewidth , and . From the latter, we get . There must thus also be a connected component of with . We clearly have . Since , , and is connected, we moreover get which finishes the proof.
Appendix C Proof of Theorem 22
See 22
Proof C.1**.**
We treat the two cases, FO-rewritability and MDLog-rewritability with parameters, in parallel in a uniform way. To achieve uniformity, recall that FO-rewritability coincides with UCQ-rewritability by Proposition 1 and observe that a UCQ-rewriting of treewidth with parameters can be converted into a non-recursive MDLog-rewriting with parameters of diameter and vice versa. We work with the latter.
Assume that an -ary MDDLog program over EDB schema is rewritable into (non-recursive) MDLog with parameters. We can convert
- (1)
* into Boolean MDDLog programs with constants (Lemma 17),* 2. (2)
* into Boolean MDDLog programs without constants (Lemma 18),* 3. (3)
* into simple Boolean MDDLog programs (Theorem 1), and* 4. (4)
* into CSP templates (Theorem 1)*
such that all these programs and (complements of) templates are rewritable into (non-recursive) MDLog. Moreover, in the proofs of the mentioned lemmas and theorems, it is shown how to construct (non-recursive) MDLog-rewritings of from given ones of , for from given ones of , and so on. We are going to analyze these constructions in more detail.
We first note that for any (non-recursive) MDLog-rewritable CSP, there is a (non-recursive) MDLog-rewriting where every rule body has at most one EDB atom that contains all variables which occur in the rule body. Since each program is actually equivalent to the complement of the CSP template in Step 4, the same is true for the programs . Thus, there is a (non-recursive) MDLog-rewriting of in which
- ()
each rule body has at most one EDB atom that contains all variables.
The translation of into in Step 3 involves replacing the EDB schema with an aggregation schema . More precisely, consists of relations where is obtained from a rule body in by first contracting, then splitting up the body into biconnected components, and finally dropping all IDB relations. When translating the rewriting of into a rewriting of , this change in schema is reverted. By (), the diameter of is thus bounded by the arity of relations in and that arity, in turn, is bounded by the diameter of . What’s more important, though, is that we actually know what the rule bodies in look like:
- ()
every rule body in is obtained from a rule body in by first contracting, then splitting up the body into biconnected components, then dropping all IDB relations, and finally decorating with some fresh IDB relations without introducing fresh variables.
Now consider the translation of into in Step 2 and the corresponding translation of into a rewriting of . In the former, we dejoin rule bodies by (sometimes) replacing different occurrences of the same variable with different variables and adding the atoms and for some , thus increasing the diameter. In the latter, we rejoin the dejoined rules in in the sense that we replace variables with the same constant whenever the rule body contains the (EDB) atoms and . It can be verified that rejoining any rule body of the form () results in a rule body whose diameter is bounded by the diameter of . This gives the desired result since Step 1 preserves diameter.
Appendix D Proof of Theorem 26
See 26
Proof D.1**.**
Let be an OMQ from and let and be the size of and , respectively. We use to denote the set of subconcepts of (concepts occurring in) . Moreover, let be the set of all tree CQs that can be obtained from a CQ in by first existentially quantifying all answer variables, then contracting, and then taking a subquery. Every can be viewed as a -concept provided that we additionally choose a root of the tree. We denote this concept with . For example, the tree CQ and choice of as the root yields the -concept . Let be the set of all these concepts and let be the schema that consists of monadic relation symbols and for each and nullary relation symbols and for each . We are going to construct an MDDLog program over EDB schema and IDB schema that is equivalent to .
By a diagram, we mean a conjunction of atoms over the schema . For an interpretation , we write if there is a homomorphism from to , that is, a map such that:
- (1)
* with implies ;* 2. (2)
* with implies ;* 3. (3)
* implies and implies ;* 4. (4)
* implies and implies .*
We say that is realizable if there is an model of with . A diagram implies a CQ , with a tuple of variables from , if every homomorphism from to some model of is also a homomorphism from to . The MDDLog program consists of the following rules:
- (1)
the rule for each ; 2. (2)
the rule for each ; 3. (3)
the rule for each non-realizable diagram that contains a single variable and only atoms of the form , ; 4. (4)
the rule for each non-realizable connected diagram that contains at most two variables and at most three atoms; 5. (5)
the rule for each diagram that implies , has at most variable occurrences, and uses only relations of the following form: , with a concept name that occurs in , and role names from that occur in .
To understand , a good first intuition is that rules of type 1 and 2 guess an interpretation , rules of type 3 and 4 take care that the independent guesses are consistent with each other, with the facts in and with the inclusions in the TBox , and rules of type 5 ensure that returns the answers to in .
However, this description is an oversimplification. Guessing is not really possible since might have to contain additional domain elements to satisfy existential quantifiers in which may be involved in homomorphisms from (a CQ in) to , but new elements cannot be introduced by MDDLog rules. Instead of introducing new elements, rules of type 1 and 2 thus only guess the tree CQs that are satisfied by those elements. Tree CQs suffice because has a tree-like model property and since we have disallowed the use of roles in the query that have a transitive subrole. The notion of ‘diagram implies query’ used in the rules of type 5 takes care that the guessed tree CQs are taken into account when looking for homomorphisms from to the guessed model. This construction is identical to the one used in the proof of Theorem 1 of [BtCLW13], with two exceptions. First, we use predicates and for every concept while the mentioned proof uses a predicate for every subset . And second, our versions of Rules 3-5 are formulated more carefully. It can be verified that the correctness proof given in [BtCLW13] is not affected by these modifications. The modifications do make a difference regarding the size of , though, which we analyse next.
It is not hard to see that, for some polynomial , the number of rules of type 1 is bounded by , the number of rules of type 2 and of type 4 is bounded by , the number of rules of type 3 is bounded by , and the number of rules of type 5 is bounded by . Consequently, the overall number of rules is bounded by and so is the size of . The bounds on the size of the IDB schema and number of rules in stated in Theorem 26 are easily verified. It remains to argue that the construction can be carried out in double exponential time. It suffices to observe two facts. First, consistency of a given diagram can be decided in ExpTime since the satisfiability of concepts w.r.t. TBoxes is in ExpTime [Tob01]. And second, for a given diagram and CQ with a tuple of variables from , it can be decided in time single exponential in the size of and of and double exponential in the size of whether implies . This is a consequence of the fact that, in , given an ABox that may contain compound concepts (in place of concept names), a TBox , a CQ and a candidate answer , it can be decided in time single exponential in the size of and and double exponential in the size of whether is a certain answer to on w.r.t. [GLHS08].
Appendix E Proof of Lemma 27
See 27
Proof E.1**.**
We convert into an MDDLog program as per Theorem 26 and then remove the answer variables according to the constructions in the proofs of Lemmas 17 and 18, which gives programs and . Analyzing the latter constructions reveals that the number of rules on is bounded by rules where is the number of rules in and is its arity. Moreover, the rule size does not increase and neither the IDB schema nor the EDB schema changes. The latter construction produces a program with rules where is the number of rules in and is the rule size of . Moreover, the IDB schema is not changed and the rule size at most doubles. The EDB schema of the new program comprises fresh monadic relation symbols. It can thus be verified that the obtained Boolean MDDLog program still satisfies Conditions 1-3 of Theorem 26 except that in the last point has to be replaced by . We make this explicit for the reader’s convenience:
- (1)
the size of is bounded by ; 2. (2)
the IDB schema of is of size at most ; 3. (3)
the rule size of is bounded by .
We next convert into a simple Boolean MDDLog program according to Theorem 1. Let us analyze the construction in detail to understand the size of , of its EDB schema , and of its IDB schema .
The initial variable identification step can be ignored. In fact, we start with at most rules, each of size at most . Thus variable identification results in a factor of regarding the program size and rule number, which is absorbed by , and the other relevant parameters do not change; in particular, the IDB schema remains unchanged.
The next and central step is to make rules biconnected. Given that the rule size is at most , this can split up each rule into at most rules. This is absorbed by the bounds on program size and rule number. However, on first glance it might seem that we end up with a double exponentially large IDB schema since we might have to split up a double exponential number of rules, each time introducing at least one fresh IDB relation. To argue that this is actually not the case, we distinguish rules of type 1-2 and 4-5 from the construction of (proof of Theorem 26); note that the constructions in the proofs of Lemmas 17 and 18 modify the rules only in a very mild way and thus for every rule in it is still clear which type it has.
We need not worry about rules of Type 1-2 and 4-5 since there are only many such rules, each of size at most , and thus the number of additional IDB relations introduced for making them biconnected is also bounded by . Rules of type 3 in , on the other hand, are of a very restricted form, namely
[TABLE]
with . These rules are biconnected and thus we are done when is Boolean. In the non-Boolean case, rules of the above lead to the introduction of additional rules in the construction in the proof of Lemma 18. This results in rules in that are of the form
[TABLE]
where is one of the fresh IDB relations introduced in the mentioned construction. The latter rules have to be split up to be made biconnected. This will result in rules of the form
[TABLE]
Clearly, there are only many rule bodies of the latter form and thus it suffices to introduce at most the same number of fresh IDB relations . Thus, the size of the IDB schema of is bounded by . Also note that, at this Point, the rule size has (potentially) decreased and is bounded by . This is obvious for rules of Type 1-2 and 4-5, and also for the rules obtained from making rules of Type 3 biconnected, see above.
The last step is the change of EDB schema. It involves no blowups and we thus obtain the bounds stated in Lemma 27. In particular, the arity of relations in is bounded by since it is bounded by the rule size of the program that we had obtained before changing the EDB schema.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ACKZ 09] Alessandro Artale, Diego Calvanese, Roman Kontchakov, and Michael Zakharyaschev. The DL-Lite family and relations. JAIR , 36:1–69, 2009.
- 2[AHV 95] Serge Abiteboul, Richard Hull, and Victor Vianu, editors. Foundations of Databases: The Logical Level . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1995.
- 3[At CKT 11] Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang Chiew Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst. , 36(4):23:1–23:48, 2011.
- 4[Ats 08] Albert Atserias. On digraph coloring problems and treewidth duality. Eur. J. Comb. , 29(4):796–820, 2008.
- 5[Bar 16] Libor Barto. The collapse of the bounded width hierarchy. J. Log. Comput. , 26(3):923–943, 2016.
- 6[BBV 16] Michael Benedikt, Pierre Bourhis, and Michael Vanden Boom. A step up in expressiveness of decidable fixpoint logics. In Proc. of LICS 2016 , pages 817–826. IEEE Computer Society, 2016.
- 7[BCF 12] Manuel Bodirsky, Hubie Chen, and Tomás Feder. On the complexity of MMSNP. SIAM J. Discrete Math. , 26(1):404–414, 2012.
- 8[BD 13] Manuel Bodirsky and Víctor Dalmau. Datalog and constraint satisfaction with infinite templates. J. Comput. Syst. Sci. , 79(1):79–100, 2013.
