Reasoning about disclosure in data integration in the presence of source constraints
Michael Benedikt, Pierre Bourhis (CRIStAL, CNRS, SPIRALS), Louis, Jachiet (CRIStAL, CNRS, SPIRALS), Micha\"el Thomazo (DI-ENS, ENS Paris, CNRS,, PSL, VALDA )

TL;DR
This paper investigates how source constraints influence privacy disclosure analysis in data integration systems, providing bounds on what an attacker can infer considering source semantics and constraints.
Contribution
It introduces a formal framework for analyzing disclosure in data integration with source constraints, highlighting their significant impact on privacy assessments.
Findings
Source constraints significantly affect disclosure analysis.
The paper establishes bounds on source-aware disclosure.
Constraints can either limit or enable information disclosure.
Abstract
Data integration systems allow users to access data sitting in multiple sources by means of queries over a global schema, related to the sources via mappings. Data sources often contain sensitive information, and thus an analysis is needed to verify that a schema satisfies a privacy policy, given as a set of queries whose answers should not be accessible to users. Such an analysis should take into account not only knowledge that an attacker may have about the mappings, but also what they may know about the semantics of the sources. In this paper, we show that source constraints can have a dramatic impact on disclosure analysis. We study the problem of determining whether a given data integration system discloses a source query to an attacker in the presence of constraints, providing both lower and upper bounds on source-aware disclosure analysis.
| Unbounded arity | Bounded arity | |||||||
|---|---|---|---|---|---|---|---|---|
| PSpace | ExpTime | 2ExpTime | 2ExpTime | NPL=QEntail | NP | ExpTime | 2ExpTime | |
| ExpTime | ExpTime | 2ExpTime | 2ExpTime | NP | NP | ExpTime | 2ExpTime | |
| 2ExpTime | 2ExpTime | 2ExpTime | 2ExpTime | ExpTime | ExpTime | ExpTime | 2ExpTime | |
| 2ExpTime | 2ExpTime | 2ExpTime | 2ExpTime | 2ExpTimeL=QEntail | 2ExpTime | 2ExpTime | 2ExpTime | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Reasoning about Disclosure in Data Integration in the Presence of Source Constraints
Michael Benedikt1
Pierre Bourhis2
Louis Jachiet2&Michaël Thomazo3
1University of Oxford
2CNRS CRIStAL, Université Lille, Inria Lille
3Inria, DI ENS, ENS, CNRS, PSL University
{pierre.bourhis, louis.jachiet}@univ-lille.fr, [email protected], [email protected]
Abstract
Data integration systems allow users to access data sitting in multiple sources by means of queries over a global schema, related to the sources via mappings. Datasources often contain sensitive information, and thus an analysis is needed to verify that a schema satisfies a privacy policy, given as a set of queries whose answers should not be accessible to users. Such an analysis should take into account not only knowledge that an attacker may have about the mappings, but also what they may know about the semantics of the sources. In this paper, we show that source constraints can have a dramatic impact on disclosure analysis. We study the problem of determining whether a given data integration system discloses a source query to an attacker in the presence of constraints, providing both lower and upper bounds on source-aware disclosure analysis.
1 Introduction
In data integration, users are shielded from the heterogeneity of multiple datasources by querying via a global schema, which provides a unified vocabulary. The relationship between sources and the user-facing schema are specified declaratively via mapping rules. In data integration systems based on knowledge representation techniques, users pose queries against the global schema, and these queries are answered using data in the sources and background knowledge. The computation of the answers involves reasoning based on the query, the mappings, and any additional semantic information that is known on the global schema.
Data integration brings with it the danger of disclosing information that data owners wish to keep confidential. In declarative data integration, detection of privacy violations is complex: although explicit access to source information may be masked by the global schema, an attacker can infer source facts via reasoning with schema and mapping information.
Example 1**.**
We consider an information integration setting for a hospital, which internally stores the following data:
[TABLE]
The hospital publishes the following data: giving opening times for building , giving times when a given patient can be visited, and listing the doctors with their specialty and their building . Formally the data being exposed is given by the following mappings:
[TABLE]
Prior work Benedikt et al. (2018) has studied disclosure in knowledge-based data integration, with an emphasis on the role of semantic information on the global schema – in the form of ontological rules that relate the global schema vocabulary. The presence of an ontology can assist in privacy, since distinctions in the source data may become indistinguishable in the ontology. More dangerous from the point of view of protecting information is semantic information about sources. For example, the sources in a data integration setting will generally overlap: that is, they will satisfy referential integrity constraints, saying that data items in one source link to items in another source. Such constraints should be assumed as public knowledge, and with that knowledge the attacker may be able to infer information that was intended to be secret.
Example 2**.**
*Continuing Example 1, suppose that we know that each patient has a doctor specialized in their condition, which can be formalized as: *
[TABLE]
And that we also know that when a patient is in a building, they must have a doctor there:
[TABLE]
Due to the presence of these constraints, there can be a disclosure of the relationship of patient to speciality . Indeed, an attacker can see the for , and from this, along with , they can sometimes infer the building where is treated (e.g. if has a unique set of open hours). From this they may be able to infer, using , the specialty that has been treated for – for example, if all the doctors in share a specialty.
In this work, we perform a detailed examination of the role of source constraints in disclosing information in the context of data integration. We focus on mappings from the sources given by universal Horn rules, where the global schema comes with no constraints. Since our disclosure problem requires reasoning over all sources satisfying the constraints, we need a constraint formalism that admits effective reasoning. We will look at a variety of well-studied rule-based formalisms, with the simplest being referential constraints, and the most complex being the frontier-guarded rules Baget et al. (2011). While decidability of our disclosure problems will follow from prior work Benedikt et al. (2016), we will need new tools to analyze the complexity of the problem. In Section 3, we give reductions of disclosure problems to the query entailment problem that is heavily-studied in knowledge representation. While a naïve application of the reduction allows us only to conclude very pessimistic bounds, a more fine-grained analysis, combined with some recent results on CQ entailment, will allow us to get much better bounds, in some cases ensuring tractability. In Section 4, we complement these results with lower bounds. Both the upper and lower bounds revolve around a complexity analysis for reasoning with guarded existential rules and a restricted class of equality rules, where the rule head compares a variable and a distinguished constant. We believe this exploration of limited equality rules can be productive for other reasoning problems.
Overall we get a complete picture of the complexity of disclosure in the presence of source constraints for many natural classes: see Tables 1 in Section 6 for a summary of our bounds. Full proofs are available at the address https://hal.inria.fr/hal-02145369.
2 Preliminaries
We adopt standard notions from function-free first-order logic over a vocabulary of relational names. An instance is a finite set of facts. By a query we always mean a conjunctive query (CQ), which is a first-order formula of the form , where each is an atom. The arity of a CQ is the number of its free variables, and CQs of arity 0 are Boolean.
Data integration.
Assume that the relational names in the vocabulary are split into two disjoint subsets: source and global schema. The arity of such a schema is the maximal arity of its relational names. We consider a set of mapping rules between source relations and a global schema relation given. We focus on rules where is a conjunctive query, there are no repeated variables in , and where each global schema relation is associated with exactly one rule. Such rules are sometimes called “GAV mappings” in the database literature Lenzerini (2002), and the unique associated to a global relation is referred to as the definition of . The rules are guarded () if for every rule, there exists an atom in the antecedent that contains all the variables of . The rules are atomic () if each consists of a single atom, and they are projection maps () if each is a single atom with no repeated variables. Given an instance for the source relations, the image of under mapping , denoted , is the instance for the global schema consisting of all facts where is the definition of .
Source constraints.
We consider restrictions on the sources in the form of rules. A tuple-generating dependency (TGD) is a universally quantified sentence of the form , where the body and the head are conjunctions of atoms such that each term is either a constant or a variable in and , respectively. Variables , common to the head and body, are called the frontier variables. A frontier-guarded TGD () is a TGD in which there is an atom of the body that contains every frontier variable. We focus on s because they have been heavily studied in the database and knowledge representation community, and it is known that many computational problems involving s are decidable Baget et al. (2011). In particular this is true of the query entailment problem, which asks, given a finite collection of facts , a finite set of sentences, and a CQ , whether entails . We use to denote an instance of this problem and also say that “ entails w.r.t. constraints ”. A special case of s are Guarded TGDs (s), in which there is an atom containing all body variables. These specialize further to linear TGDs (s), whose body consists of a single atom; and even further to inclusion dependencies (s), a linear TGD with a single atom in the head, in which no variable occurs multiple times in the body, and no variable occurs multiple times in the head. Even s occur quite commonly: for example, the source constraints of Example 2 can be rewritten as s. The most specialized class we study are the unary s:(s), which are s with at most one frontier variable.
Queries and disclosure.
The sensitive information in a data integration setting is given by a CQ over the source schema, which we refer to as the policy. Intuitively, disclosure of sensitive information occurs in a source instance whenever the attacker can infer from the image that holds of a tuple in . Formally, we say an instance for the global schema is realizable, with respect to mappings and source constraints if there is some source instance that satisfies such that . For a realizable , the set of such are the possible source instances for . A query result is disclosed at if holds on all possible source instances for . A query admits a disclosure (for mappings and source constraints ) if there is some realizable instance and binding for the free variables of for which is disclosed. In this terminology, the conclusion of Example 2 was that policy admits a disclosure with respect to the constraints and mappings. For a class of constraints , a class of mappings , a class of policies , we write to denote the problem of determining whether a policy (a CQ, unless otherwise stated) admits a disclosure for a set of mappings in and a set of source constraints in . Given and a CQ , the corresponding instance of this problem is denoted by . In this paper we will focus on disclosure for queries and constraints without constants, although our techniques extend to the setting with constants, as long as distinct constants are not assumed to be unequal.
3 Reducing Disclosure to Query Entailment
Our first goal is to provide a reduction from to a finite collection of standard query entailment problems. For simplicity we will restrict to Boolean queries in stating the results, but it is straightforward to extend the reductions and results to the non-Boolean case. We first recall a prior reduction of to a more complex problem, the hybrid open and closed world query answering problem Lutz et al. (2013, 2015); Franconi et al. (2011), denoted . takes as input a set of facts , a collection of constraints , a Boolean query , and additionally a subset of the vocabulary. A possible world for such is any instance containing , satisfying , and such that for each relation , the -facts in are the same as the -facts in . holds if holds in every possible world. Note that the query entailment problem is a special case of , where is empty.
Given a set of mapping rules of the form , we let be the set of global schema predicates, and let be the mapping rules, considered as bi-directional constraints between global schema predicates and sources.
We now recall one of the main results of Benedikt et al. (2016):
Theorem 1**.**
There is an instance computable in linear time from , such that holds if and only if holds.
In fact, the arguments in Benedikt et al. (2016) show that can be taken to be a very simple instance, the critical instance over the global schema denoted where , for a set of predicates, denotes the instance that mentions only a single element , and contains, for each relation in of arity , the fact .
Corollary 1**.**
* is in 2ExpTime.*
Proof.
The non-classical aspect of comes into play with rules of of form . But in the context of , these can be rewritten as single-constant equality rules (s) . Such rules remain in the Guarded Negation Fragment of first-order logic, which also subsumes s, while having a query entailment problem in 2ExpTime Bárány et al. (2015). ∎
We now want to conduct a finer-grained analysis, looking for cases that give lower complexity. To do this we will transform further into a classical query entailment problem. This will require a transformation of our query , a transformation of our source constraints and mappings into a new set of constraints, and a transformation of the instance . The idea of the transformation is that we remove the s that are implicit in the problem, replacing them with constraints and queries that reflect all the possible impacts the rules might have on identifying two variables.
We first describe the transformation of the query and the constraints. They will involve introducing a new unary predicate ; informally this states that is equal to . Consider a CQ . An annotation of is a subset of ’s variables. Given an annotation of , we let be the query obtained from by performing the following operation for each in : for all occurrences of except the first one, replacing with a fresh variable ; and adding conjuncts as well as to . A critical-instance rewriting of a CQ is a CQ obtained by applying the above process to for any annotation. We write to indicate that is such a rewriting.
To transform the mapping rules and constraints to a new set of constraints using , we lift the notion of critical-instance rewriting to TGDs in the obvious way: a critical-instance rewriting of a TGD (either in or ), is the set of TGDs formed by applying the above process to the body of . We write to indicate that is a critical-instance rewriting for a , and similarly for mappings. For example, the second mapping rule in Example 1 has several rewritings; one of them will change the rule body to .
Our transformed constraints will additionally use the set of constraints , including all rules:
[TABLE]
where ranges over the global schema and . Informally states that all elements in the mapping image must be . We also need to transform the instance, using a source instance with “witnesses for the target facts”. Consider a fact in formed by applying a mapping rule in . The set of witness tuples for is the set , where contains in each position containing a variable and containing a constant in every position containing a variable . That is the witness tuples are witnesses for the fact , where each existential witness is chosen fresh. Let be the instance formed by taking the witness tuples for every fact .
We are now ready to state the reduction of the disclosure problem to query entailment:
Theorem 2**.**
* holds exactly when there is a such that entails w.r.t. constraints:*
[TABLE]
Note that Theorem 2 does not give a polynomial time reduction: both and can contain exponentially many rewritings, and further there can be exponentially many rewritings in .
However, the algorithm does give us a better bound in the case of Guarded TGDs with bounded arity.
Corollary 2**.**
If we bound the arity of schema relations, then is in ExpTime.
Proof.
First, by introducing additional intermediate relations and source constraints, we can assume that contains only projection mappings. Thus we can guarantee that just contains the rules in . By introducing intermediate relations and additional source constraints, we can also assume that each has a body with at most two atoms. Since the arity of relations is fixed, the size of such - or -atom bodies is fixed as well. From this we see that the number of constraints in any is polynomial. The reduction in Theorem 2 thus gives us exponentially many entailment problems of polynomial size. Since entailment over Guarded TGDs with bounded arity is in ExpTime Calì et al. (2013), we can conclude. ∎
3.1 Refinements of the Reduction to Identify Lower Complexity Cases
In order to lower the complexity to ExpTime without bounding the arity, we refine the construction of the function in the case where is a linear TGD, providing a function that constructs only polynomially many rewritten constraints.
Let be a linear TGD with relation of arity , and suppose contains distinct free variables . Let be the set of pairs with such that the same variable sits at positions and in . We order as ; for each that is not the initial pair , we let be its predecessor in the linear order.
We let denote new predicates of arity for each . Let be a set of distinct variables, and be formed from by replacing with . We begin the construction of with the constraints: and .
For each with a predecessor , we add to the following constraints: and .
Letting the final pair in , we add to the constraint where is obtained from by replacing all but the first occurrence of each variable by a fresh variable.
If consists of s, we let be the result of applying this process to every . Similarly, if consists of atomic mappings (implying that the associated rules are s), then we let the result of applying the process above to the rule going from source relation to global schema relation associated to . Then we have:
Theorem 3**.**
When consists of s, holds exactly when there is a such that entails w.r.t. to the constraints
[TABLE]
We can combine this result with recent work on fine-grained complexity of s to improve the doubly exponential upper bound of Corollary 1 for linear TGD source constraints and atomic mappings:
Theorem 4**.**
* is in ExpTime. If the arity of relations in the source schema is bounded, then the complexity drops to NP, while if further the policy is atomic, the problem is in PTime.*
Proof.
It is sufficient to get an ExpTime algorithm for the entailment problem produced by Theorem 3, since then we can apply it to each in ExpTime. The constraints in are Guarded TGDs that are not necessarily s. But the bodies of these guarded TGDs consist of a guard predicate and atoms over a fixed “side signature”, namely the unary predicate . It is known that the query entailment for s and guarded TGDs with a fixed side signature is in ExpTime, with the complexity dropping to NP (resp. PTime) when the arity is fixed (resp. fixed and the query is atomic) Amarilli and Benedikt (2018a). ∎
Can we do better than ExpTime? We can note that when the constraints are s, consists only of ; similarly if a mapping is a projection, then consists only of . This gives us a good upper bound in one of the most basic cases:
Corollary 3**.**
* is in PSpace. If further a bound is fixed on the arity of relations in the source schema, then the problem becomes NP, dropping to PTime when the policy is atomic.*
Proof.
Our algorithm will guess a in and checks the entailment of Theorem 2. This gives an entailment problem for s, known to be in PSpace in general, in NP for bounded arity, and in PTime for bounded arity and atomic queries Johnson and Klug (1984). ∎
3.2 Obtaining Tractability
Thus far we have seen cases where the complexity drops to PSpace in the general case and NP in the bounded arity case, and PTime for atomic queries. We now present a case where we obtain tractability for arbitrary queries and arity. Recall that a is an where at most one variable is exported. They are actually quite common, capturing referential integrity when data is identified by a single attribute. We can show that restricting to s while having only projection maps leads to tractability:
Theorem 5**.**
* is in PTime.*
Proof.
The first step is to refine the reduction of Theorem 2 to get an entailment problem with only s, over an instance consisting of a single unary fact . The main issue is avoid the constraints in , corresponding to the mapping rules. The intuition for this is that on , the only impact of the backward and forward implications of is to create new facts among the source relations. In these new facts only , is propagated. Rather than creating s (implicitly what happens in the reduction) or generating classical constraints where the impact of the equalities are “baked in” (as in the critical-instance rewritings of Theorems 2 and 3), we truncate the source relations to the positions where non-visible elements occur, while generating s on these truncated relations that simulate the impact of back-and-forth using .
The second step is to show that query entailment with s over the instance consisting only of is in PTime. This can be seen as an extension of the PTime inference algorithm for s Cosmadakis et al. (1990). The idea behind this result is to analyze the classical “chase procedure” for query entailment with TGDs Fagin et al. (2005). In the case of s over a unary fact, the shape of the chase model is very restricted; roughly speaking, it is a tree where only a single fact connects two values. Based on this, we can simplify the query dramatically, making it into an acyclic query where any two variables co-occur in at most one predicate. Once query simplification is performed, we can reduce query entailment to polynomial many entailment problems involving individual atoms in the query. This in turn can be solved using the inference procedure of Cosmadakis et al. (1990). ∎
4 Lower Bounds
We now focus on providing lower bounds for , showing in particular that the upper bounds provided in Section 3 can not be substantially improved. For many classes of constraints it is easy to see that the complexity of disclosure inherits the lower bounds for the classical entailment problem for the class. From this we get a number of matching lower bounds; e.g. 2ExpTime for constraints, PSpace for constraints. But note that in some cases the upper bounds we have provided for disclosure in Section 3 are higher than the complexity of entailment over the source constraints. For example, for s we have provided only a 2ExpTime upper bound for guarded mappings (from Corollary 1), and only an exponential bound for atomic mappings (from Theorem 4). This suggests that the form of the mappings influences the complexity as well, as we now show.
Most of our proofs for hardness above the entailment bound for source constraints rely on the encoding of a Turing machine. Source constraints are used to generate the underlying structures (tree of configurations, tape of a Turing machine) while mappings are used to ensure consistency (a universal configuration is accepting if and only if all its successor configurations are accepting, the content of the tape is consistently represented,…). To illustrate our approach, we sketch the proof of the following result.
Theorem 6**.**
* and are 2ExpTime-hard, and are ExpTime-hard even in bounded arity.*
Proof.
Recall that Theorem 1 relates disclosure to a problem on . Also recall from Section 3 the intuition that such a problem amounts to a classical entailment problem for a CQ over a very simple instance, using the source dependencies and s: of the form , where will be the body of a mapping. We will sketch how to simulate an alternating ExpSpace Turing machine using a problem using s and guarded s. This can in turn be simulated using our problem.
We first build a tree of configurations using s, such that each node has a type (existential or universal) and is the parent of two nodes (called -successor and -successor) of the opposite type. This tree structure is represented, together with additional information, by atoms such as:
[TABLE]
Intuitively, this states that is a universal configuration, parent of and . (resp. , resp. ) is the acceptance bit for (resp. , resp. ), which will be made equal to if and only if the configuration represented by (resp. , resp. ) is accepting. will be used to represent cell addresses, while is the identifier of the root of the configuration tree. The initial instance is such an atom, where the first position and the last position are the same constant, is a vector of [math]’s, is a vector of ’s, and all other arguments are distinct constants.
We use s to propagate acceptance information up in the tree. For instance, a universal configuration is accepting if both its successors are accepting. This is simulated by the following :
[TABLE]
To simulate , we need access to an exponential number of cells for each configuration. We identify a cell by the configuration it belongs to and an address, which is a vector, generated by s, of length whose arguments are either [math] or . The atom for representing a cell is thus , where is the parent configuration of , which is the configuration to which the represented cell belongs, is the address of the cell, its content, the content of the previous cell, and the content of the next cell. Note that this representation is redundant, and we need to use s to ensure its consistency.
Note that is a tuple of length the size of . Each position corresponds to an element of that set, and the content of a represented cell is the element which corresponds to the unique position in which appears.
We now explain how to build the representation of the initial tape, and simulate the transition function. Both steps are done by unifying some nulls with . W.l.o.g., we assume that the initial tape contains a in the first cell, on which points the head of in a state , and that corresponds to the first bit of . We thus use a to set this bit to in the first cell of the first configuration. We then set (w.l.o.g.) the second bit of all the other cells of that configuration to (assuming this represents ).
To simulate the transitions, we note that the content of a cell in a configuration depends only on the content of the same cell in the parent configuration, along with the content of parent’s previous and next cells. We thus add a that checks for the presence of specifying the content of three consecutive cells in a configuration, and unify a null with to specify the content of the corresponding cell of a child configuration.
The argument above uses s and s, but we can simplify the mappings to using s. ∎
A simple variation of the construction used for PSpace-hardness of entailment with s Casanova et al. (1984) shows that our upper bounds for source constraints and atomic maps are tight. The case of source constraints and projection maps can be done via reduction to that of source constraints and atomic maps:
Theorem 7**.**
* and are both ExpTime-hard.*
The above results, coupled with argument that the lower bounds for entailment are inherited by disclosure, show tightness of all upper bounds from Table 1 in the unbounded arity case. Another variation of the encoding in Theorem 6 shows that with no restriction on the mappings one can not do better than the 2ExpTime upper bound of Corollary 1 even for constraints in bounded arity,
Theorem 8**.**
* is 2ExpTime-hard in bounded arity.*
The theorem above, again combined with results showing that the lower bounds for entailment are inherited, suffice to show tightness of all upper bounds from Table 1 in the case of bounded arity.
We can also show that our tractability result for constraints and projection maps does not extend when either the maps or the constraints are broadened. Informally, this is because with these extensions we can generate an instance on which CQ querying is NP-hard.
5 Related Work
Disclosure analysis has been approached from many angles. We do not compare with the vast amount of work that analyzes probabilistic mechanisms for releasing information, providing probabilistic guarantees on disclosure Dwork (2006). Our work focuses on the impact of reasoning on mapping-based mechanisms used in knowledge-based information integration, which are deterministic; thus one would prefer, and can hope for, deterministic guarantees on disclosure. We deal here with the analysis of disclosure, while there is a complementary literature on how to enforce privacy Biskup and Weibert (2008); Bonatti et al. (1995); Bonatti and Sauro (2013); Studer and Werner (2014).
The problem of whether information is disclosed on a particular instance (variation of introduced in Section 3) has been studied in both the knowledge representation Lutz et al. (2013, 2015); Franconi et al. (2011); Ahmetaj et al. (2016); Amendola et al. (2018) and database community Abiteboul and Duschka (1998). The corresponding schema-level problem was defined in Benedikt et al. (2016), which allows arbitrary constraints relating the source and the global schema. However, results are provided only for constraints in guarded logics, which does not subsume the case of mappings given here. Our results clarify some issues in prior work: Benedikt et al. (2016) claimed that disclosure with source constraints and atomic maps is in PSpace, while our Theorem 7 shows that the problem is ExpTime-hard. Our notion of disclosure corresponds to the complement of Benedikt et al. (2018)’s “data-independent compliance”. The formal framework of Benedikt et al. (2018) is orthogonal to ours. On the one hand, source constraints are absent; on the other hand a more powerful mapping language is considered, with existentials in the head of rules, while constraints on the global schema, given by ontological axioms, are now allowed. Benedikt et al. (2018) assume that the attacker has an interface for posing queries against the global schema, with the queries being answered under entailment semantics. In general, the semantic information on the global schema makes disclosure harder, since the outputs of different mapping rules may be indistinguishable by an attacker who only sees the results of reasoning. In contrast, source constraints make disclosure of secrets easier, since they provide additional information to the attacker.
6 Summary and Conclusion
We have isolated the complexity of information disclosure from a schema in the presence of commonly-studied sets of source constraints. A summary of many combinations of mappings and source constraints is given in Table 1: note that all problems are complete for the complexity classes listed. We have shown tractability in the case of s and projection maps (omitted in the tables), while showing that lifting the restriction leads to intractability. But we leave open a finer-grained analysis of complexity for frontier-one constraints with more general mappings. Our results depend on a fine-grained analysis of reasoning with TGDs and s, a topic we think is of independent interest.
Acknowledgements
This work was partially funder by CNRS Momentum project “Managing Data without Leak”.
Appendix A Detailed Proofs from Section 3: Upper Bounds for Disclosure
A.1 Proof of Theorem 2: Correctness of the Basic Reduction
from Disclosure to Classical Entailment
Recall the statement of Theorem 2, which applies the algorithms to TGDs and to mappings.
holds exactly when there is a such that entails w.r.t. constraints:
[TABLE]
holds.
By Theorem 1 we know that is equivalent to .
This will immediately allow us to prove one direction of the equivalence. Suppose each of our entailments fails. From this, we see using Sagiv and Yannakakis [1980] that does not entail the disjunction of . Thus we have an instance extending with facts that may include the predicate, where satisfies all the rewritten constraints and no rewritten query . Note that since satisfies the constraints of as well as , we know that the element , if it occurs in , will be labeled with .
Form an instance by unifying all elements in satisfying into a single element , making inherit any fact that such an participates in. That is, we choose so that if is the mapping taking any element satisfying to and fixing every other element, then is a homomorphism from onto . We can easily verify that satisfies the original source constraints . For each homomorphism of the body of into , there is a homomorphism of some into . We know is satisfied in , and taking the -image of the tuples that witness this gives us the required witnesses for in . Now let be the restriction of to the source relations. We argue that the mapping image of under is exactly . To see that the image of must include all the facts in , note that includes all facts of , which contains witnesses for each such fact. Thus the -image, namely , contains witnesses for each such fact as well. Conversely, suppose the image of includes a fact ; we will argue that is in . Since satisfied , any such fact in must have all satisfying . Thus in each such fact must be of the form . Thus the -image of is exactly the same .
Finally, we claim that satisfies . If it satisfies , then would satisfy for some annotation , a contradiction. Putting this all together, we see that contradicts .
Before turning to the other direction, we will explain some other results that will be necessary. The first is the chase procedure for checking entailment of a query from a set of constraints and a set of facts . This proceeds by building a sequence of instances where each is formed from by “firing a rule” . Firing in means finding a homomorphism from the body of into , and adding facts to extend to the head, using fresh values for all existentially quantified variables. Such a homomorphism is called a trigger for the rule firing. The chase of under , denoted , is any instance formed as the union of such a sequence having the additional property that every rule that could fire in some fires in some later . The significance of the chase for query entailment is the following result Fagin et al. [2005]:
Theorem 9**.**
For an instance , set of TGDs , and UCQ , we have if and only if some chase model for under satisfies .
We will also need a variation of the chase for the problem , taken from Benedikt et al. [2016]. The visible chase is a sequence of source instances that begins with . is formed from by “chasing and merging”. The chase step applies the usual chase procedure described above to with constraints , creating new facts that possibly contain fresh values. In a merge step, we take a mapping and a homomorphism of the body of into , and for each free variable of , we replace by in all facts in which it appears. We say that this is a merge step with on . Since the process is monotone, it must reach a fixpoint, which we refer to as the visible chase of , denoted .
Proposition 1**.**
Benedikt et al. [2016]** holds exactly when satisfies .
We now prove the other direction, assuming that fails, but one of the entailments holds. By Theorem 9, this means that some chase of under the constraints satisfies for some annotation . Let denote such a chase sequence for under . We form another sequence , with , maintaining the invariant that there is a homomorphism from to mapping every element satisfying to . The inductive step is performed as follows:
- •
For every chase step with a rule of applied in , having trigger , we know that for some . We can apply the corresponding rule in , with a trigger that maps a variable to the -image of . Thus composed with is .
- •
For every chase step in with a rule of for and a trigger , we apply a merge step in with and .
Since some satisfies , one of the must satisfy . Since contains the image of under the homomorphism , and maps to , we see that must satisfy . But is a subinstance of the visible chase for our problem. Thus the assumption that fails and Proposition 1 imply that cannot hold in , a contradiction.
A.2 Simplifying Mappings
In this section, we will see that we can simplify mapping to be projection maps at the cost of moving to a richer class of source constraints.
Given a problem we consider and built in the following way: is composed of plus for each mapping we create a predicate and we add to the two constraints and . is composed of mappings .
Proposition 2**.**
We have if and only if .
Proof.
To prove the proposition, it is sufficient to prove that holds on if and only if holds on (see Proposition 1). Let be the instance obtained by removing all the facts in .
We recall that the visible chase works iteratively, at each step a database is created from by chasing all facts then merging some values with . For the sake of simplicity we suppose that each step is composed of either one rule firing or one merging.
- •
We start by proving that implies .
Let be a sequence corresponding to . We build a sequence corresponding to . We are trying to build such that there exists for all there exists such that , and implies .
We prove by induction:
- –
is composed of witnesses of and of witnesses of . We build such that each is obtained by firing the -th rule .
- –
Let us suppose that and is obtained by firing a rule ; could have been fired on and thus we can build such that .
- –
When is obtained by merging values then it means that we have holding in and thus holding in therefore we could use the rule followed by an unification on . Therefore we can build and such that .
- •
For the direction implies we start by noticing that, without loss of generality, we can suppose that the sequence of starts by firing each rule (it is always possible to generate more facts) and then we create such that for all big enough there exists such that
- –
Once all rules have been fired, we see that we obtain an instance isomorphic to .
- –
When and is obtained through a merge step, it means that we had but we easily see by induction that this means that we had and thus that we can also perform the merge step on
- –
When is obtained through a rule, it is either a rule in that we can reproduce in or it is a rule . In this latter case, we don’t have anything to do as will be discarded by .
Now, we also see that will grow as grows since except for rules , our increases. Therefore at the limit we have that implies .
∎
Corollary 4**.**
* reduces to .*
A.3 More Details for the Proof of Corollary 2
We recall the statement of Corollary 2:
If we fix the maximal arity of relations in the schema, then is in ExpTime.
We now fill in the details of the proof sketch in the body.
Reducing to .
Using Corollary 4, we can reduce the problem to . We now show that this latter problem is in ExpTime.
Reducing to two atoms in the body of TGDs.
Given a set of s and a set of maps we now reduce to where each in holds at most two conjuncts in the rule body.
is composed by applying the following process for each . The constraint is guarded, therefore we can select a guarding conjunct such that . When we simply add to . When , we rewrite this constraint by introducing predicates , while producing the following constraints and for : . Finally we also add . It is easy to see that this new problem is equivalent because each constraint in is implied by its corresponding constraints in and if we look at the result of the visible chase, the only fact derived from a are facts such that .
Rewriting in PTime.
Now that maps are s and each has at most two atoms in their body, we can apply the rewriting presented in Theorem 2. Notice that each will be rewritten to a bounded number of s, and the rewriting of the maps will be trivial. Since query entailment with s is ExpTime when the arity is bounded we can conclude the proof.
A.4 Proof of Theorem 3: More Efficient Reduction to Entailment for Source Constraints and Atomic Mappings
Recall the statement of Theorem 3, which concerns the application of the rewriting algorithms for source constraints , and the algorithm for atomic mappings :
holds exactly when there is a such that entails w.r.t. to the constraints
[TABLE]
Let and be the constraints posed in Theorem 3. By Theorem 2, it is enough to show that query entailment involving is equivalent to entailment involving .
In one direction, suppose that is a counterexample to entailment involving . We fire the rules generating atoms to get instance . We claim that the constraints of hold. Clearly, the rules generating atoms hold. Further, by construction, for any if holds exactly when there is an annotation We now consider the rule . Considering a such that holds, we want to claim that there is an annotation such that holds.
Recall that each is associated with some variable that occurs as both and in . If holds, we know that either or holds. If the latter happens, then we add the variable to our annotation. We can then verify that holds.
Since we are assuming that the corresponding constraint of holds in , we can conclude that . From this we see that is a counterexample to the entailment involving .
In the other direction, let be a counterexample to the entailment for the constraints in . We claim that the constraints of hold of . For constraints corresponding to source constraints with no repeated variables in the body, this is easy to verify, so we concentrate on constraints deriving from source constraints that do have repeated variables in the body.
Each of these constraints is of the form for some annotation . Fix a such that holds. We claim that holds for all . We prove this by induction on the position of in the ordering of pairs in . Each corresponds to some variable that is repeated. If is in , then implies that hold. Using the corresponding rule and the induction hypothesis we conclude that holds. If is not in then implies that . Using the other rule generating in , as well as the induction hypothesis, we conclude that holds. This completes the inductive proof that holds. Now using the corresponding constraint of we conclude that . Since the constraints of hold, is also a counterexample to the entailment involving .
A.5 More details in proof of Theorem 4: upper bounds
for source constraints and atomic maps
Recall the statement of Theorem 4
The problem is in ExpTime. If the arity of relations in the source schema is bounded, then the complexity drops to NP. If further the query is atomic, the problem is in PTime.
We now give more details on the proof. As mentioned in the body, is sufficient to get an ExpTime algorithm for the entailment problem produced by Theorem 3, since then we can apply it to each in ExpTime. The constraints in are Guarded TGDs that are not necessarily s. But the bodies of these guarded TGDs consist of a guard predicate and atoms over a fixed “side signature”, namely the unary predicate . We can apply now the linearization technique, originating in Gottlob et al. [2014] and refined in Amarilli and Benedikt [2018a]. Given a side signature this is an algorithm that converts an entailment problem involving ta set of non-full s and Guarded TGDs using , producing an equivalent entailment problem involving the same query, but only s. Further:
- •
The algorithm runs in ExpTime in general, and in PTime when the arity of the relations in the input is fixed
- •
The algorithm does not increase the arity of the signature, and thus the size of each output is polynomially-bounded in the input.
See also Appendix G of Amarilli and Benedikt [2018b] for a longer exposition of the linearization technique. Thus for general arity, we can use this algorithm to get an entailment problem with the same query, a data set exponentially bounded in the input data and a set of s, each polynomially-sized in the inputs. By applying a standard first-order query-rewriting algorithm to the query, we reduce this problem to evaluation of a union of conjunctive queries get a UCQ on . The size of each conjunct in is polynomially-bounded in the inputs, and so each conjunct can be evaluated in time , giving an ExpTime algorithm in total.
For fixed arity, we apply the same algorithm to get an entailment problem using s of bounded arity, which is known Johnson and Klug [1984] to be solvable in NP. Further, when the query is atomic, entailment with s is in PTime.
A.6 Proof of Theorem 5: Disclosure for Source Constraints and is PTime
We prove that when the source constraints are s and the mappings are projections, disclosure analysis is in PTime. By Theorem 1, it suffices to show that the problem is PTime. We will thus first reduce this problem a problem where is composed of constraints and is composed of a single unary fact .
Reachable predicates.
We define the entailment graph over a set of constraints . In this graph, nodes correspond to predicates and there is an edge for each constraint . Given an initial set of facts , one can compute the set of entailed predicates. This set is defined as the set of predicates reachable in the entailment graph starting from the predicates appearing in .
Visible position graph.
In studying tuple-generating dependencies, one often associates a set of dependencies with a graph whose edges represent the flow of data from one relation to another via the dependencies. See, for example the position graph used in defining the class of weakly acyclic sets of TGDs Fagin et al. [2005].
We develop another such graph, the visible position graph associated with a set of source constraints and mappings. The nodes are the pairs where is a predicate, and there is an edge when we have an (either a source constraint or a mapping rule) with . We refer to a node in this graph as a position. A position of a relation in the source schema is said to be visible if there is a path from to a node such that belongs to the global schema. Another other position is said to be invisible. We see that when a position is visible then for any fact that holds in a possible world for we must have .
Note that if we have , is exported to , and position of is visible, then position of is visible as well.
Reduction to entailment.
Let and . We will reduce the problem to the problem , where is a set of s, and is a CQ. Our reduction proceeds as follows:
- •
We transform the schema for sources creating a predicate for each source predicate , where the arity of is the arity of minus the number of positions that are visible.
- •
.
- •
is built as the set of constraints where are fresh distinct variables and .
- •
is formed from the set of constraints such that there is an exported variable lying in an invisible position of . For each such constraint, contains the constraint where denotes the projection of to the invisible positions of , and similarly for and .
- •
is formed from constraints such that and there is an exported variable lying in a visible position of , exported to an invisible position of . For each such constraint includes the constraint where denotes the projection of to the invisible positions of and similarly for .
- •
the query is built from by first replacing each conjunct with its corresponding predicate , projecting out the visible positions. After this, for every variable that occurred in within both a visible and an invisible position, is replaced by , while we add a conjunct .
Correctness of the reduction.
The correctness of the reduction is captured in the following result:
Proposition 3**.**
For any source constraints consisting of s and consisting of projection mappings, there is a disclosure over a schema with constraints mappings and secret query if and only if holds.
Proof.
We start with the argument for the left to right direction. We let be a counterexample to the entailment . By Theorem 9, we can assume that is formed by applying the chase procedure to . In particular, each fact in can be assumed to use a predicate in .
We show that there is an instance that is a counterexample to
[TABLE]
and thus (by Theorem 1) we cannot have a disclosure. We form by filling out each visible position with . We claim that satisfies each source constraint . Suppose that holds in . Then holds in , where projects on to the invisible positions.
- •
First, suppose there is a variable in an invisible position of exported to an invisible position in . Then since satisfies , we know that for some , holds in , By the definition of , we have that holds, where fills out each visible position with . We can see that is the required witness for .
- •
Next, suppose there is a variable in a visible position of exported to an invisible position in . Then we must have . Since is in and satisfies , we have holding in for some , and hence holding in for some tuple where fills all the visible positions. Thus holds in this case as well.
- •
Finally, note that a variable at an invisible position cannot be exported to a visible position. Therefore the only remaining case is the case where no variable has been exported. Since is reachable, then is also reachable therefore there is a constraints and thus holds in
We next claim that the image of under agrees with .
- •
For every global schema predicate , occurs in the the image of under . This follows easily from the fact that contains .
- •
If holds in the -image, then because each visible position was filled out with , we must have each . Thus the result follows.
Note that from the preceding claims, we know that is a possible world for . Finally, we claim that does not satisfy .
- •
Suppose with homomorphism as a witness. Since is a possible world for , for any variable occurring in a visible position, . Let be formed from the restriction of to variables that occur in , by mapping the additional variable to . Note that in , holds. For this, we see that is a homomorphism witnessing that . This is a contradiction to the fact that is a counterexample to the entailment.
We now have argued that is a counterexample to , which completes the proof of the left to right direction.
For the other direction, suppose that is a counterexample to . Note that for any fact over the source relations in , for any visible position of , we must have . Form by projecting each fact in to the invisible positions of the relation. We will argue that is a counterexample to the entailment produced by the reduction.
- •
should contain therefore extends .
- •
The fact that was a solution to also guarantees that for all reachable predicates we have and thus and thus all constraints in are satisfied.
- •
Let us show that the constraints in are satisfied: fix a constraint , derived from source constraint . Fix a fact in . By definition of , extends to a satisfying in . Thus, since , there is a fact that holds in with whenever . We can project to the invisible positions to get a fact in . We claim that is a witness for the satisfaction of with respect to . Consider any variable exported from to position of where is mapped to value in . Then in , was exported to the corresponding invisible position in , and from this we see that as required.
- •
Now consider a constraint . Since holds only for in , we only have to verify that holds for some such that (where is the position of in ). Let us suppose that was derived from source constraint where is the position of the exported in and is the position of the exported variable in . By the definition of , we know that is a reachable predicate, and hence must hold for some in and since is visible we have . Because we have holds in for some such that and thus is the required witness for .
- •
Finally, we argue that does not satisfy . Suppose by way of contradiction that satisfies via homomorphism . Note that the variables of that do not occur in are those that occur only in visible positions within an atom of . We extend to a mapping from the variables of to by mapping each such variable to . We argue that is a homomorphism of to . Consider an atom of , where correspond to the invisible positions. Suppose first that the corresponding atom of is of the form where is obtained from by replacing any variable shared with a visible position by . We know that holds in because is a homomorphism. Thus holds in for some . By the properties of visible positions and the fact that is a possible world for , we see that each . Thus not only preserves the atom , but it also preserves the additional atom , since holds in . thus is a homomorphism, contradicting the fact that is a counterexample to .
Since extends , satisfies the constraints , and does not satisfy the query , it is a counterexample to the entailment, completing this direction of the argument. ∎
Overview of PTime algorithm for entailment with s over a single fact.
At this point we have restricted to a CQ entailment problem for a set of s and a single fact. It was claimed in Kikot et al. [2011] that there is a polynomial time query rewriting for s, and from this it would easily follow that our entailment problem is in PTime (query evaluation is PTime when these is a single fact). However later work (footnote on page 38 of Bienvenu et al. [2018]) refers to flaws in this argument, and says that polynomial rewritability is open. We therefore give a direct proof that such an entailment problems are in PTime. This will proceed via several steps:
- •
A reduction to the case of “binary schemas”: those where the arity of each predicate is at most .
- •
Query simplification, which will reduce the query to a connected acyclic query.
- •
Reduction to atomic entailment.
Reduction to binary schemas.
We begin by using verbatim an idea of Kikot et al. [2011], reducing to the same problem but when the input schema is binary. We do this via a standard reduction of general arity reasoning to binary reasoning, introducing predicates for every relation of arity and each ; informally these state that is the value in position of -tuple . We also introduce a predicate for each predicate ; informally this states that there is some tuple in the predicate . We translate each exporting a variable from position to position to a . For each , that is not exporting a variable, we create a rule . We also create rules and for each predicate and where is the arity of . Finally the query is transformed into where each conjunct is transformed into the conjunction , for a fresh variable . Finally the database over the binary schema is built in the following way: for each fact of the initial database, we create a fresh value and we add the conjunct for where is the arity of and we also add . Further details can be found in Kikot et al. [2011]. Note that, in the resulting problem, each frontier-0 rule produced has a body with an atom over a unary predicate.
Proposition 4**.**
The transformation above preserves query entailment.
Special form of the chase: annotated chase forest.
In the case of s the chase process applied to our single-fact instance produces an in instance that will be infinite. However, it has a special shape that we can exploit. For the remainder of this section, by we consider an instance formed from a restricted chase sequence, in which a witness to a TGD is added to instance for binding to only if . It is known Fagin et al. [2005] that in Theorem 9 it suffices to consider such instances. The annotated chase is a node- and edge-labelled forest formed from as follows:
- •
the nodes are the values of
- •
the node label of a value is the collection of unary predicates holding at
- •
an edge labeled by fact mentioning and connects a value to a value if holds in and is generated in the chase step that produces .
We can see that this graph is a forest where he roots are (the value where holds) as well as some other trees rooted to reachable facts generated from frontier-[math] dependencies and thus rooted at elements where holds for some . Further, since the chase is restricted, we can see that this graph has the unique adjoining label property: for each , for each predicate , there cannot be two nodes adjacent to such that the edge from to and from to both are labelled with the same predicate and have in the same position. Furthermore, the restricted chase also ensures that the forest is composed of at most one tree per predicate since all the roots that are produced needs to be different.
First query simplification: eliminating forking pairs.
Given a CQ , a pair of distinct atoms and sharing the same predicate and a variable at the same position (i.e. and or and ) is a forking pair of . We say that a query is non-forking when there are no forking pairs.
Proposition 5**.**
If a CQ has a forking pair and and is the query where the variable is replaced with , then
Proof.
Let . If holds in , then clearly the same holds of . Conversely suppose holds in via homomorphism , and suppose . This gives us a violation of the unique adjoining label property. ∎
Applying the proposition above, we can assume that is non-forking. Without loss of generality, we can also assume that is connected (otherwise we can test the entailment of each connected part).
Second simplification: reducing to acyclic queries
The CQ-graph. of a CQ is the node- and edge-labelled graph whose nodes are the variables of and whose edges are labelled with atoms of such that:
- •
an edge between variables labelled with and is labelled with the binary atoms containing both and ;
- •
a node is labelled with the set of unary predicates in containing .
The CQ-graph said embedded in some annotated chase forest if there is a homomorphism preserving edges, i.e. if there is an edge to labeled with then should contain an to labeled with and nodes, i.e. if there is a predicate on the node then there should be in . The homomorphism is called an embedding of in .
It is immediate from the completeness of the chase procedure that for any annotated chase forest for , a query is entailed if and only if its CQ-graph is embedded in . Our reduction to the case of a CQ with acyclic CQ-graph will depend heavily on the following observation:
Proposition 6**.**
Any embedding of a connected and non-forking CQ into an annotated chase forest for must be injective.
Proof.
Let be a connected and non-forking and let be an embedding. Let us prove by induction on the size of the path between and that when .
Two neighboring nodes cannot be sent to the same value. For a path of size , if we have such that forms a path in the CQ-graph of then has to be different than otherwise the label from to and from to would be the same and there would be a forking pair in .
Let with be a path in the CQ-graph between and . By induction the for are all distinct and thus the distance between and is at least hence is at least at distance of . ∎
Our reduction to the acyclic case follows immediately:
Corollary 5**.**
If a connected non-forking CQ is entailed by over then the CQ-graph of is acyclic.
Proof.
is entailed The image of the CQ-graph through the injective homomorphism is a forest. ∎
Determining entailment for acyclic connected graphs.
We now give the final setp in our algorithm, which deals with deciding entailment of a connected, non-forking query , which by Corollary 5 must have an acyclic CQ-graph. Given an acyclic connected undirected graph and any vertex of the graph, we can direct it be a tree with as the root. Thus for such a having variables, the tree arrangements are the possible ways to root the CQ-graph of the query . We are particularly interested in arrangements of where the directionality from parent to child reflects the entailment structure relative to between atoms in the query. A tree arrangement of is faithfully entailed if for every variable in with parent in the tree, there is an atom containing and not containing such that entails , where is the conjunction of all atoms whose variables are contained in ; in the case that is the root, we require alone to entail .
In a faithfully-entailed tree arrangement, the conjunction of atoms holding at the root of the tree entails the existence of the whole tree. We can further find a single atom that entails the whole tree. A root-generating atom of a tree arrangement is an atom (not necessarily in ) containing the root variable , such that generates all atoms mentioning .
Proposition 7**.**
A faithfully entailed tree arrangement for must have a root-generating atom.
Proof.
We know that must hold in the chase of the initial fact under , and by Proposition 6 we know that there is an injective homomorphism from to the chase. Consider the point in the chase process where value is first generated. This occurs by firing some rule with an atom, where the head has either a binary atom or a unary atom . We consider the case where the atom is binary, and where the generated atom is . In this case the fact must generate every fact containing . Thus we can take the atom , where is a fresh variable, as a root-generating atom. The case of unary atoms and the case where is in the second position of the fact is similar. ∎
Given a tree arrangement of and variable of , denotes the the restriction of to the variables that are descendants of in .
The main idea of our PTime algorithm is that it suffices to descend through the tree arrangement, checking some entailments for each parent-child pair in isolation.
Proposition 8**.**
There is a PTime algorithm taking as input a variable in a CQ , a tree arrangement of , and an atom containing such that the existential quantification of is entailed by , and determining whether is faithfully entailed and is a root-generating atom.
Proof.
We first check whether is a root-generating atom, using a PTime inference algorithm for s Cosmadakis et al. [1990]. We then consider each child of in the tree arrangement. We know that there is exactly one conjunct containing and . We check whether entails , and then call the algorithm recursively for and . If each recursive call succeeds, the algorithm succeeds. ∎
From the prior proposition we get a PTime algorithm for the arrangement as a whole:
Proposition 9**.**
There is a PTime algorithm taking a tree arrangement of CQ , and an atom containing the root of the arrangement, and determines whether the whole tree arrangement can be faithfully entailed and is a root-generating atom.
Proof.
We first need to check that is entailed, which amounts to checking that . As before this can be done using Cosmadakis et al. [1990]. We then utilize the algorithm of Proposition 8. ∎
Note that Proposition 9 gives a polynomial time algorithm for checking whether a tree arrangement can be faithfully entailed. We can apply the algorithm of the proposition with every possible unary and binary atom containing the root variable. In the binary case, we consider all atoms containing the root variable and an additional fresh variable.
Putting it all together.
Putting together our reduction to -entailment (Proposition 3), our schema simplification (Proposition 4) the query simplifications (the reduction to connected CQs, Proposition 5, and Corollary 5), and our PTime algorithm for simplified queries (Proposition 9) we obtain the proof of Theorem 5.
Appendix B Detailed Proofs from Section 4: Lower Bounds for Disclosure
B.1 Proof of the first part: Theorem 6:
2ExpTime-hardness for and without arity bound
Recall the first part of Theorem 6:
is 2ExpTime-hard.
Recall that Theorem 1 relates disclosure to a problem on a very simple instance. Also recall from Section 3 the intuition that such a problem amounts to a classical entailment problem for a CQ over a very simple instance, using the source dependencies and s: of the form , where will be the body of a mapping. We show here how to simulate the run of an alternating ExpSpace Turing machine without explicitly using s, instead using inclusion dependencies as source constraints coupled with guarded mappings. An alternating Turing machine is a -tuple where:
- •
is the finite set of states
- •
is the finite tape alphabet
- •
and are functions from to
- •
is the initial state
- •
is a function from to that specifies the type of each state.
We assume that always alternates between existential and universal states, and that there is a unique final state, that can be reached only if the head is in the first cell and contains a specific symbol. All of these assumptions can be made without loss of generality. If is in a configuration where whose state is such that , the configuration is said to be accepting. If is in a configuration where whose state is such that , the configuration is said to be accepting if its and successors (obtained after applying or ) are accepting. If is in a configuration whose state is such that , the configuration is said to be accepting if its -successor or its -successor is accepting. A more thorough introduction to Turing machines can be found in Papadimitriou [1994].
We first present the reduction, and show its correctness in the next subsection.
B.2 The Reduction
We will create constraints and mappings that will serve to perform the following tasks:
- •
generate addresses for cells of in such a way that one can check whether two addresses are consecutive in a guarded way. The same addresses will be used for all the configurations. This will be done by a mapping creating copies of two individuals that represent [math] and , along with inclusion dependencies that perform permutations and generate addresses;
- •
encode the content of a cell, the position of the head, and the state of the head: for each cell, we store a vector whose length is the size of . Each position corresponds to an element of that set; we will arrange that the position contains if and only if the cell contains , and either the head is over that cell and is in state , or the head is not over that cell and . All values are first freshly instantiated by inclusion dependencies, and mappings are then responsible for unifying the correct positions with ;
- •
ensure that the tape that is associated with a successor of a configuration can be obtained by a transition of the Turing machine: this is also performed by using a mapping to enforce the correct positions of the cell to be unified with ;
- •
check that configurations are accepting: this is the case either when the corresponding tape is in a final accepting state, or when it is in an existential state and one of the two successor configurations is accepting, or it is in a universal state, and both successor configurations are accepting.
Let us describe the source signature. For each predicate, we will explain what feature of the ATM it should represent in the appropriate instance generated by the constraints. By “the appropriate instance”, we mean the visible chase of the initial instance over the source constraints and mappings: this was introduced after Theorem 9, and it was noted that it is the canonical instance for the source and targets to consider for disclosure.
We use to represent a tuple , and to represent the tuple of size .
- •
. The intended meaning is that a configuration is universal and has as children and , and that the acceptance bit of is , of is and of is . The last four positions are placeholders: for the root of the tree of configurations, for , for a value representing [math] and for a value representing .
- •
: same intended meaning, except that is existential.
- •
with intended meaning that the cell of address of the tape represented by has a content represented by , while the previous cell has a content represented by and the next cell has content represented by . The last four positions are placeholders for the root of the tree of configurations, , a value representing [math] and a value representing .
- •
with intended meaning that the cell of address in configuration contains at the position of the representation of its content. and play similar roles for the cell before and after the cell of address .
- •
is an auxiliary predicate used to generated an exponential number of addresses.
- •
states that is the -successor of (and similarly for )
Below we will always use the symbol to range over .
The structure generated by the inclusion dependencies is represented Figure 1. Atoms are represented by geometric shapes in the inside of which are arguments (some are omitted to ease the reading). The atoms form a tree shaped structure, and induce a tree structure on the configuration identifiers: for instance, is the parent of and . atoms are associated with a configuration identifier (for instance, those represented are associated with ), and has the parent configuration identifier to ensure guardedness of the mappings used in the following reduction. Note that the elements used to describe the cell’s addresses ( and ) also appear in the atoms, to ensure guardedness.
Initialization.
We first define a mapping , introducing some elements in the visible chase. The definition of this mapping is:
[TABLE]
Generation of the tree of configuration.
-successors have themselves - and -successors, and are existential if their parent is universal:
[TABLE]
And similarly for and for the -successor.
Universal and existential acceptance condition.
If both successors of a universal configuration are accepting, so is . We create a mapping with definition:
[TABLE]
If the -successor of an existential configuration is accepting, so is . We create a mapping with definition:
[TABLE]
We create a similar mapping for the -successor.
Tape representation and consistency of tapes.
We now focus on the representation of the tape and its consistency. We generate addresses and associated values:
[TABLE]
[TABLE]
will generate addresses to represent the tape associated with its fifth argument. To emphasize this, we use the letter (as node) at this position, while the fourth argument contains its parent configuration, denoted by .
[TABLE]
For each address, we initialize its content (as well as the content of the previous and next cells) by fresh values .
[TABLE]
Note that the values , and are vectors of length the size of . In particular, we use the notation to represent a vector of same length, composed of fresh variables, except for the position , that contains .
We now use mappings to force some of these values to be equal to . Each position of represents an element of , and we will enforce exactly one of these positions to contain . If the head of the Turing machine is on the cell represented, then the position of corresponding to where is the letter in the cell and the state of the Turing machine, will contain . Otherwise, the position of corresponding to will contain .
As we store the content of a cell in several atoms, we must ensure that the tape associated with a configuration is consistent, by checking that is consistent with from the next cell. To ensure guardedness, we first introduce auxiliary predicates and that define the content of the bit of the value of the current, previous and next cells:
[TABLE]
We now introduce the definition of a mapping which ensures the consistency of the tape content (note that the first atom is a guard):
[TABLE]
is defined similarly to deal with the previous cell.
We enforce the tape of the initial configuration to have the head of the Turing machine on the first cell (and assume w.l.o.g that this is represented by the first position of containing ) and all the other cells containing (and we assume w.l.o.g that this is represented by the second position of containing ). We thus create the mappings , for the the first cell, having definition:
[TABLE]
and we introduce the mappings , for all the other cells, having definition:
[TABLE]
Note that this data is associated with the children of the root (as is both in the fourth and last minus three positions of the atoms), and not with the root itself, due to the choice of keeping in the identifier of the parent of the considered configuration.
We then check that the tape associated with the -successor of a configuration is indeed obtained by applying an -transition. This is done by noticing that the value of each cell of the -successor is deterministically defined by the value of the cell and its two neighbors in the original configuration (the neighbors are necessary to know whether the head of the Turing machine is now in the considered cell). To ensure guardedness, we first define a predicate marking -successors (and similarly for -successors):
[TABLE]
Let us consider a cell of address in . We assume that its content is represented by , while the content of its left (resp. right) neighbor is represented by (resp. ). We represent the fact that this implies that the content of the cell of address is in the -successor of by the following mapping :
[TABLE]
Note that the above formulation requires the content of the previous and of the next cells, which makes this mappings not applicable when is the address of either the first or the last cell. We thus add rules to specifically deal with these two cases (that looks at the content of the current and next cell when is a vector of , and at the content of current and previous cell when is a vector of ). Note that there is only polynomially such mappings to be built. And we finally create a mapping enforcing that configurations whose tape is in an accepting state (which we assume w.l.o.g. corresponds to the case where the first cell contains the bit) are declared as accepting.
[TABLE]
The policy query is
[TABLE]
We will show that this policy query is disclosed if and only if the original Turing machine accepts on the empty tape.
B.3 Proof of Correctness
We show that the policy is disclosed if and only if accepts on the empty tape. By Theorem 1, the policy is disclosed if and only if the corresponding problem has a positive answer. Further, this holds if and only if the policy query holds on the result of the visible chase (introduced after Theorem 9). We thus focus on showing the equivalence of the acceptance of the empty tape by and the satisfaction of the policy in the visible chase.
Let us start by describing some relationships between the visible chase of and the run of . As contains , there is in the visible chase the atom
[TABLE]
where all individuals but are nulls.
Definition 1** (Tape Representation).**
Let be a tape (with head position and state included) of . A representation of is a set of atoms
[TABLE]
where ranges over the binary representations of the addresses of , and such that for any cell of the following holds:
- •
for any , contains fresh nulls except for the bit that represents the content of at address , where it contains
- •
for any except the representation of the leftmost cell, contains fresh nulls except for the bit that represents the content of at address ; in this bit it contains (* exclusively contains fresh nulls for the leftmost cell)*
- •
for any except the representation of the rightmost cell, contains fresh nulls except for the bit that represents the content of at address ; on this bit it contains (* exclusively contains fresh nulls for the rightmost cell)*
In that case, is called a representative of .
Lemma 1**.**
* and , as defined above Definition 1, are representatives of the initial tape.*
Proof.
We show the result for , the same reasoning being applicable to . As the atom
[TABLE]
belongs to the visible chase, atoms of the shape
[TABLE]
for any vector with for any , are generated, where all nulls from and are fresh (thanks to the rules involving ). As the first argument and the ante-ante-penultimate argument of such an atom are equal, the definition of maps to the atom of address , and the body of maps to all the other atoms. Applying and then ensures that is a representative for the initial tape, as no other mapping may merge a term of these atoms. ∎
Lemma 2**.**
If is a representative of a tape and if the visible chase contains
[TABLE]
then (resp. ) is a representative of the tape (resp. ) obtained by applying the -transition (resp. -transition) applicable to .
Proof.
We show the result for the -successor, the same reasoning being applicable for the -successor. As the visible chase contains
[TABLE]
it also contains atoms of the shape:
[TABLE]
for any vector , where all nulls from and are fresh. Note that is necessary distinct from (as it is the representative of a tape). Hence neither nor may unify a term with . As is a representative of , for any address, if the bit of represents the actual value in at address , then the visible chase contains where is the binary encoding of . Similarly, and also belong to the visible chase where applicable. Then for all addresses, an application of the relevant mapping of the shape merges the null at the position representing the content of with . Applying and then ensures that is a representative for . ∎
Wrapping up the previous two lemmas, we get that there is a tree structure in the visible chase that corresponds exactly to the tree of configurations of the run of : the two individuals and are representatives of the initial configuration, and their children (which are the individuals at the second and third individuals in the atom in which they appear at the first position) are representatives of the configurations that can be reached with an or transition. It remains to check that the argument representing the accepting status of a configuration are correctly set, which is the topic of the following lemma.
Lemma 3**.**
If is the representative of a tape, there is in the visible chase an atom of the shape
[TABLE]
if and only if accepts on .
Proof.
Let be a tape of representative . There are four cases in which accepts on :
- •
the state of is final in : this is the case if and only if merges the fourth argument of with
- •
the state of is universal in and both its successors are accepting: by induction assumption (on the number of transitions that need to be applied to prove acceptance of a tape), both accepting bits of and are unified with , and thus the accepting bit of is unified with thanks to
- •
the state of is existential in and its -successor is accepting: by induction assumption, the accepting bit of is unified with , and thus the accepting bit of is unified with thanks to
- •
similar case, with the -successor.
∎
Let us now use the above lemmas to show that accepts if and only if the policy query holds in the visible chase. If accepts, let us consider an accepting run of . From Lemmas 1 and 2, we can build a configuration tree that contains a representative for all the tapes that are involved in this run. From Lemma 3, the accepting bits of the representative are set adequately, and the policy query holds in the visible chase.
Conversely, let us consider a visible chase sequence such that the policy query holds in its result. Let us first remark that the argument of the policy query appearing in the first position is equal to the argument in the ante-ante-penultimate position. This implies that none of the witnesses of mappings other than need to be applied in order to entail the policy query, which can be seen from the following three facts: (i) none of the positions that may contain a configuration identifier may be unified with ; (ii) all mappings contain configuration identifiers (iii) only the witness associated with may generate an atom of the shape
[TABLE]
This implies that in the visible chase sequence entailing the policy query, we start by introducing and as in Lemma 1. Let us now consider the smallest set of configuration representatives that fulfills the following conditions:
- •
and are in
- •
if is in and the tape associated with is in a universal state, then both successors of are in
- •
if is in and the tape associated with is an existential state, then a successor of having its acceptance bit equal to is in .
By the previous lemmas, there exists an accepting run of going exactly through the represented configurations.
B.4 Second Part of Proof of Theorem 6: ExpTime-hardness for Inclusion Dependencies and Guarded Maps in Bounded Arity
Recall the statement of the second part of Theorem 6:
is ExpTime-hard even in bounded arity.
In the proof of Theorem 6, we used predicates of unbounded arity only to generate exponentially many cell addresses. Here, we use only addresses, and can encode their content through predicates to . However, the proof follows the same line of argumentation as in Theorem 6.
Let us describe the source signature.
- •
states that a configuration is universal and has as children and , and that the acceptance bit of is , of is and of is . The last two positions are placeholders for the root of the tree of configurations, and .
- •
: same meaning, except that is existential.
- •
states that the cell of address of the tape represented by has a content represented by . The last position is a placeholder for .
- •
states that the cell of address in configuration contains at the position of the representation of its content.
- •
states that is the -successor of (and similarly for )
The symbol always ranges over .
Initialization.
We first define a mapping , introducing some elements in the visible chase, whose definition is:
[TABLE]
Generation of the tree of configuration.
-successors have themselves - and -successors, and are existential if their parent is universal:
[TABLE]
And similarly for and for the -successor.
Universal and existential acceptance condition.
If both successors of a universal configuration are accepting, so is . We create a mapping having definition:
[TABLE]
If the -successor of an existential configuration is accepting, so is . We create a mapping having definition:
[TABLE]
We create a similar mapping for the -successor.
Tape representation and consistency of tapes.
We now focus on the representation of the tape and its consistency. For each configuration, we generate cells whose content is initialized freshly:
[TABLE]
and similarly for existential configurations and for the -successors. Note that the values , and are again vectors of length the size of . We again use the notation to represent a vector of same length, composed of fresh variables, except for the position , which contains .
To ensure guardedness, we first introduce auxiliary predicates and that define the content of the bit of the value of the current, previous and next cells:
[TABLE]
We enforce that the tape of the initial configuration has the head of the Turing machine on the first cell (and assume w.l.o.g that this is represented by the first position of containing ) with all the other cells containing . We also assume w.l.o.g that the other cells containing is represented by the second position of containing . We thus create the mappings , for the the first cell, having definition:
[TABLE]
and , for all the other cells (), with definition:
[TABLE]
Note that this data is associated with the children of the root (as is both in the first and the penultimate positions of the atoms), and not with the root itself, due to the choice of keeping in the identifier of the parent of the considered configuration.
We then check that the tape associated with the -successor of a configuration is indeed obtained by applying an -transition. This is done by noticing that the value of each cell of the -successor is determined by the value of the cell and its two neighbors in the original configuration (the neighbors are necessary to know whether the head of the Turing machine is now in the considered cell). To ensure guardedness, we first define a predicate marking -successors (and similarly for -successors):
[TABLE]
Let us consider a cell of address in . We assume that its content is represented by , while the content of its left (resp. right) neighbor is represented by (resp. ). We represent the fact that this implies that the content of the cell of address is in the -successor of by the following mapping :
[TABLE]
As in the non-bounded case, the first (resp. last) cell should be dealt with separately, as there is no content in the (non-existent) previous (resp. next) cell. And we finally create a mapping enforcing that configurations whose tape is in an accepting state (which we assume w.l.o.g. corresponds to the case where the first cell contains the bit) are declared as accepting.
[TABLE]
The policy is
[TABLE]
We can verify that this policy is disclosed if and only if the original Turing machine accepts on the empty tape, using a similar reasoning to the unbounded case.
B.5 Final Part of Proof of Theorem 6:
Reduction from and to and
Theorem 6 states a 2ExpTime lower bound for general arity and an ExpTime lower bound in bounded arity for two different cases. The first case was when the source constraints are s and the mappings are guarded. The previous sections of the appendix have gone through the proofs of this case in detail. We now finish the proof of Theorem 6 showing:
is 2ExpTime-hard, and is ExpTime-hard even in bounded arity.
The proof of both of these assertions follows directly from Corollary 4 (the general reduction of maps presented in section A.2).
We have seen that and reduces to and , therefore we have the lower bound for from the lower bound
B.6 Proof of Theorem 7: ExpTime-hardness for Inclusion Dependencies and Atomic Maps,
and for s with Projection Maps
Recall the statement of Theorem 7:
and are both ExpTime-hard.
We first focus on the case of . We adapt the construction used for PSpace-hardness of entailment with s Casanova et al. [1984] to show ExpTime-hardness for source constraints and atomic maps. We start with an alternating (rather than deterministic in Casanova et al. [1984]) Turing machine and an input , and consider the problem asking whether there exists a halting computation of that uses at most cells. As in the original reduction, we use inclusion dependencies to simulate the transition relation of . The adaptation lies in the additional use of a fresh position holding a configuration identifier, and the generation of a tree of configurations, as in the reduction presented in Theorem 6.
Let us describe the signature:
- •
states the configuration has quantification , has accepting bit , a tape represented by . the last argument will always hold in the visible chase;
- •
names two successors configurations and , with the configurations consisting of acceptance bits and , which are obtained from by applying transitions and .
Let us turn to the description of and subsequently . represents the content of the tape: for each position of the tape, there is an argument for each pair of . Intuitively, this argument is equal to if and only if the position contains the corresponding letter and head, and a fresh null otherwise.
We introduce a mapping that initializes the tape:
[TABLE]
As in the proof of Theorem 6, we propagate the acceptance information using mappings. For a universal state, we use a mapping with definition:
[TABLE]
For an existential state, we use two mappings with definitions:
[TABLE]
and
[TABLE]
As before, we notice that the state of a cell after applying a transition is deterministically defined by its content as well as the content of its left and right neighbor. The following inclusion dependency states that from any configuration, we can try to apply all possible transitions to generate the - and -successors:
[TABLE]
We now generate the tape associated with the -transition (and similarly for the -transition):
[TABLE]
where denotes the dual quantifier. Let us describe the vector . Suppose is the transition that checks whether position contains , position contains and the head in state , and position contains ; changes to , moves the head to the right and goes into state . Then is defined as follows:
- •
any argument that corresponds to a position distinct from or is chosen equal to the argument at the same position in ;
- •
the argument that corresponds to now contains the value of at position , and all other variables appearing in an argument corresponding to position are existentially quantified;
- •
the argument that corresponds to now contains the value of at position , and all other variables appearing in an argument corresponding to position are existentially quantified.
Note that here we have a distinction with the previous reduction: we do not check that a transition is applicable before applying it, as this would be out of the capabilities of . However, the same argument as in Casanova et al. [1984] proves that a configuration reached from simulating a non-applicable transition cannot lead to an accepting state. We choose as a policy:
[TABLE]
Proposition 10**.**
The policy is disclosed if and only if there is an accepting computation that uses at most cells.
The lower bound for follows by reduction:
Proposition 11**.**
There is a polynomial time reduction from to .
Proof.
Given a mapping where there may be repeated variables in the head atom, we replace it by a projection mapping
[TABLE]
where is a new predicate whose arity is the number of distinct variables in . has the same variables as , but with no repetition. For example, if the head of the original rule is , then the new rule has head .
We additionally add the source constraint:
[TABLE]
It is easy to see that this transformation preserves disclosure. ∎
B.7 Proof of Theorem 8: lower bounds for s in bounded arity
Recall the statement of Theorem 8:
is 2ExpTime-hard in bounded arity.
This proof will be very similar to the proof of Theorem 6. We will provide a reduction from an alternating ExpSpace Turing machine to and s. We show how to simulate the run of an alternating ExpSpace Turing machine with inclusion dependencies and s.
The main difference between the proof of Theorem 6 and the proof here is that in Theorem 6 each cell carried bits specifying the address of the cell. In this version we cannot use this trick as we are using a reduction where all predicates are bounded. For each configuration, the tape will represented in the leaves of a full binary tree of depth . For a cell , the bits specifying the address of a will scattered across the predicates in its lineage, each holding one bit of the address: an internal node has two descendants each carrying four values . We will have and when then node represents the addresses where the -th bit is [math] and , when it is .
Let us describe our source signature:
- •
states that a configuration is universal and has children and , and that the acceptance bit of is , of is and of is . The last four positions are placeholders for the root of the tree of configurations, two values representing [math] and two values representing .
- •
: same meaning, except that is existential.
- •
for , corresponds to a node of depth in the binary tree representing the tape of a configuration. In this predicate is the parent of the node, is the current node, will be equal to when the node if the first child of and equal to otherwise. will be the complement of (i.e. implies and implies ).
- •
states that the cell at position contains the data represented by . and play similar roles for the previous cell and the next cell.
Critical element.
We create a mapping defined as . The relation will allow us to test whether a variable is equal to .
Initialization.
We first define a mapping introducing some elements in the visible chase, whose definition is:
[TABLE]
Generating the tree of configuration.
-successors have themselves - and -successors, and are existential if their parent is universal:
[TABLE]
And similarly for and for the -successor.
Universal and Existential Acceptance Condition.
If both successors of a universal configuration are accepting, so is . We create a mapping with definition:
[TABLE]
If the -successor of an existential configuration is accepting, so is . We create a mapping of definition:
[TABLE]
We create a similar mapping for the -successor.
Generating the tape cells.
We now focus on the representation of the tape and its consistency. We generate addresses and associated values:
[TABLE]
[TABLE]
And for we have:
[TABLE]
[TABLE]
Finally for we have:
[TABLE]
Initialization of the tape.
For the case 0, we use the pattern and introduce a mapping defined as:
[TABLE]
For all others cases, with a first at the -th bit, we use the pattern and introduce :
[TABLE]
Ensuring the coherence between and .
We need to check the coherence between in an address and at the previous address. As usual when is at the address then needs to be checked against the at the address . We introduce the mapping :
[TABLE]
Encoding transitions.
As in previous reductions, we encode the transitions of as a set of (where is the value of the cell, is the value at the cell before and at the cell after and is the written value).
For our transition, we need to write at the same address s where lies in the -child of the configuration of . To be at the same address, we need to check that the path follows the same bits (that we note here ). We use the mapping defined as:
[TABLE]
Encoding final states.
Whenever the current state is we need to enforce that the bit is set. To enforce that the bit is set, we introduce the following mapping, for each value marking a final state:
[TABLE]
Policy.
The policy query is
[TABLE]
We can verify that the policy query is disclosed if and only if the original Turing machine accepts on the empty tape.
B.8 Maximality of our Tractability Conditions
Recall that Theorem 5 shows that we can get tractability by simultaneously restricting our constraints to be s and our mappings to be s. Recall also that a is an with at most one exported variable. Here we show that these restrictions are maximal in the following sense: if we increase from s to s with frontier one we get intractability. We also get intractability if we stick with s but we allow the mappings to be atomic. Let denote the s with at most one exported variable. In fact, we will show something stronger (here denotes no constraints):
Theorem 10**.**
* and are both NP-hard.*
In order to prove our results, we will rely again on Proposition 1, which states that testing for disclosure is equivalent to evaluating the policy query on the result of the visible chase process. The process starts with the instance , which has source witnesses for each tuple in . It proceeds by alternating traditional chase steps and merge steps, which are applications of a . It is well known that query evaluation is NP-hard on arbitrary instances. But the constraints that we are considering in this section do not allow us to generate arbitrary instances as a visible sections. In this section we will exhibit a instance on which query evaluation is NP-hard, but where can be the result of the visible chase using s but no constraints, or with a visible chase using constraints and s.
The instance .
will have one relation with atoms. We present the content of below. Empty cells are filled with fresh nulls, is the only value shared by two tuples and correspond to nulls that are shared inside a tuple:
[TABLE]
Note that this is a single-shared value instance: only one value, namely , is shared among multiple tuples. Such instances can be produced as the result of the visible chase over atomic mappings with no constraints. In this case:
[TABLE]
They can also be produced as the result of the visible chase over one projection mapping with s:
[TABLE]
The remainder of the argument is to show that CQ evaluation is NP-hard over this instance, via reduction from satisfiability of a propositional circuit (Circuit SAT).
General idea of the reduction.
The reduction that we provide will create a query for each instance of Circuit SAT. Without loss of generality, we suppose that is composed of wires , of negation gates and of binary OR gates . Wire that are not the output of any gate are the inputs of the circuit. We will suppose that the output corresponds to the wire .
We will build the query to contain conjuncts for each wire, each negation gate and each binary OR. Furthermore we will create a variable for each wire .
For the sake of readability, we present the conjuncts graphically, with each row representing an atom. A row with entries represents an atom where is a fresh existentially quantified variable when the cell is empty and the variable in the cell otherwise.
Wires.
For each wire , we will force the value of its associated variable to be either (when the wire carries the value true) or (when the wire carries false).
For each wire , we will have a conjunct:
[TABLE]
For the variable corresponding to the output wire we also add a conjunct:
[TABLE]
Negation.
For each negation gate , whose input is the wire and output is the wire , we will have the following conjuncts:
[TABLE]
Computing binary OR.
For the binary OR gate whose inputs are the wires and and the output is , we introduce the following conjuncts:
[TABLE]
Proof that this reduction captures Circuit-SAT.
Let us suppose that the circuit is satisfied. Towards showing that the query is satisfied in the instance , we first build a binding for the variables that are shared between multiple of the conjunct grouping above, which are exactly the “wire variables” . We do this by setting when and when . We now show that this binding extends to a valuation making the query true. Since all the other variables are not shared between the conjunct groups, it suffices to show satisfiability of each conjunct group in isolation.
- •
We see that all the conjuncts corresponding to wires are satisfied (even the special conjunct corresponding to the output).
- •
For the negation gate whose input is and output is . When and thus , we can set , and satisfy all 3 conjuncts. When and thus , we can set and and satisfy all 3 conjuncts.
- •
For an OR gate whose inputs are and , and whose output is . There are four cases:
- –
when and thus we can set
- –
when and and thus , we can set ,
- –
when and and thus , we can set ,
- –
when and thus we can set
In all cases, our conjuncts are satisfied.
Conversely, let us show that when the query is satisfied in our instance , then the circuit is satisfiable. Let be a homomorphism from the query variables to values. Since we have wire conjuncts constraining for each wire , we can see that or . We now consider the circuit assignment such that when and when . Let us show that this assignment witnesses the satisfiability of the circuit.
- •
The output wire is already constrained such that but it also has a special conjunct and the only remaining possibility for is and thus the output gate is set at .
- •
For each negation gate whose input is and output is :
- –
when then the conjunct holding and (i.e. the first row in the graphical representation) forces that . The conjunct holding and forces to be a fresh null or . But since appears in the column and in the column , we can only have and thus .
- –
when then the conjunct holding and forces that or . Then the conjunct holding and forces to be either a fresh null (when ) or (when ). Since appears in the column and in the column we cannot have fresh null, we conclude that, and thus .
In both cases, the semantics of the negation gate is respected.
- •
Consider each OR gate whose inputs are , and output is . First we have:
- –
when then necessarily
- –
when then or or
- –
when then necessarily
- –
when then or .
Therefore we see that:
- –
when then necessarily and thus
- –
when and then and thus
- –
when and then necessarily and thus
- –
when then and thus
in all cases we do have that the semantics of the OR gate is respected.
All in all, we have seen that the circuit is satisfiable if and only if the query has a solution on the visible chase.
B.9 Lower Bounds Inherited from Entailment
In the body of the paper we claimed that in several cases, we could show that the complexity of disclosure for a class was at least as hard as the complexity of query entailment for the class. We do not claim that there is a generic reduction from query entailment to disclosure. There is a simple reduction from entailment for special classes of instances to disclosure. More specifically, disclosure is easily seen to subsume entailment on instances of the form . But one needs to see that entailment on these specialized instances is as hard as entailment in general; this requires a separate argument for each class.
There are three cases of “lower bounds from entailment” that are used in the body of the paper: those whose lower bound is annotated with in Table 1. We give the details of each argument below.
B.9.1 In Bounded Arity, Disclosure with Source Constraints and Projection Maps is NP-hard
We begin by showing that disclosure for source constraints and projection maps inherits the NP-hardness that is known for query entailment with s. We do this via a direct reduction from -coloring. We make use again of the characterization of disclosure using the visible chase.
Let us take a graph that is an input to -coloring. In our reduction, the schema, the constraints and the mapping will not depend on this actual graph reduced. Only the query will depend on the graph.
We will have a single source relation and one mapping to create canonical values for in , which is the initial instance in the visible chase. Then we will use two constraints to create all permutations for these values: and .
Because of the mapping, will have three values with . Then the constraints ensure that the canonical model contains the six permutations of arguments for : , , , , , .
In our query variables will capture the coloring of each node, we note the variable associated with node . For each the query will include a conjunct .
We sketch the correctness of this reduction. The three values in encode the three possible colors in a coloring. The conjunct forbids the nodes and to be mapped to the same value (, or ) as has no solution in . Therefore if we have disclosure have a -coloring.
Conversely, if we have a -coloring, we can find a solution for the query in the visible chase.
B.9.2 In General Arity, Disclosure with Source Constraints and Projection Maps is PSpace-hard
Here we give a direct reduction111The original proof given here was faulty, many thanks to Balder ten Cate for noticing it and suggesting a fix. from the implication problem for s, or equivalently, the entailment problem for a single-atom instance and an atomic query. This is known to be PSpace-hard Casanova et al. [1984]. Given a problem (where has no repeated variables and is composed of IDs), we introduce a fresh predicate and we reduce it to the disclosure problem with query on the constraints plus and the mapping .
B.9.3 In Bounded Arity, Disclosure with Source Constraints and Projection Maps is 2ExpTime-hard
The last place where we claim that disclosure is at least as hard as entailment is for source constraints and projection maps in bounded arity. Here we will proceed by modifying the reduction used in Theorem 8. In this proof we used mappings for two distinct purposes. The initialization mapping was used to generate some values in the initial instance of the visible chase. In the proof, this mapping is an atomic map but not a projection map; but we can easily change this to use a projection map and an .
The remaining maps are used to ensure that certain values get merged with in the visible chase. Put another way, they are used to enforce certain s. But with the mappings and , we can ensure that the initial instance of the visible chase includes exactly one element satisfying . Once we have done this, we can mimic a
[TABLE]
by a source constraint
[TABLE]
This must be a , since the frontier has size one. Transforming the mappings according to this methodology, while leaving the query the same as in Theorem 8 gives us a modification of the hardness proof using source constraints and projection maps, as required.
Appendix C Refinements of our results
Atomic queries.
We have focused in the body of the paper on policy queries given as general CQs. But almost all of our lower bounds can be seen to hold for atomic queries. The only exceptions are stated in Theorem 4 and Corollary 3, where we claim PTime membership in bounded arity when restricting to atomic queries. Note that the NP-hardness bounds for general CQs corresponding to these upper bounds do not follow from our custom reductions, but using the simple reduction from entailment of CQs for the corresponding classes (e.g. s).
Non-Boolean queries.
In this appendix we have provided details of our upper bounds, assuming for simplicity that the queries are Boolean. But the proofs all extend to the non-Boolean case, as we now explain. To see this we need to go back to Theorem 1. We restate the theorem in a slightly different variant:
Theorem 11**.**
Benedikt et al. [2016]** When source constraints are TGDs and mapping rules are given by CQ definitions, then if a disclosure of a CQ (Boolean or non-Boolean) occurs, then the source instance which witnesses this can be taken to be .
The statement differs slightly from that of Theorem 1, since this version talks about getting an instance that agrees with on the mapping images, rather than having one that extends and satisfies the constraints.
The important point is that the result holds for non-Boolean queries as well as Boolean queries. Note that for a Non-Boolean query , all the facts that an attacker will see in the mapping image of will contain only the value . Thus the only query answers that can be disclosed to the attacker will involve the value . Inspection of each of the upper bound reductions will show that to detect such disclosures, it suffices to pre-process the query to add conjuncts for each variable , treating the result as a Boolean query.
Note that this transformation converts atomic queries to queries consisting of a single atom and an additional set of unary atoms. However, this will not impact the PTime claims in Theorem 4 and Corollary 3. For example in Theorem 4, we will need only to note that for atomic queries on a bounded arity schema, we will get an atomic query with a bounded number of additional atoms of the form . Entailment of such queries over a bounded arity schema with s is still in PTime.
Dependencies with multiple atoms in the head.
In some of our upper bound proofs, we assumed that the dependencies had a single atom in the head for simplicity, even when the classes in question (e.g. s) does not impose this. In our results that are stated for general arity, this assumption can be made without loss of generality, since one can simplify the heads by introducing intermediate predicates. In bounded arity, one must take some care, since one cannot polynomially reduce to the case of a single atom in the head. All of our results for bounded arity do in fact hold as stated, without any additional restrictions on the head. We explain how the argument needs to be customized for the most subtle case, Theorem 4.
Recall that the bounded arity case of Theorem 4 starts with the critical-instance rewriting, which reduces to reasoning with Guarded TGDs having a fixed side signature, the unary predicate . The linearization of Amarilli and Benedikt [2018a]; Amarilli and Benedikt [2018b], applied in this context, proceeds in two steps. First we generate all derived rules of the form:
[TABLE]
Notice that these are full-dependencies: no existentials in the head. This generation can be done inductively, via the dynamic programming steps in Amarilli and Benedikt [2018b]: one inductive steps composes a derived rule with one of the original non-full dependencies. A second step composes two derived full rules. This can be applied directly to the case of rules with multiple atoms in the head.
After this is done, the second step of linearization moves to an extended signature described as follows: for every relation of arity in the original signature (without ), and for each set of positions of , we introduce predicates of arity . Informally, stands in for . We lift every original dependency:
[TABLE]
to a linear TGD:
[TABLE]
where contains the positions corresponding to exported variables in . We lift every derived full dependency of the form:
[TABLE]
to a linear TGD:
[TABLE]
Finally we have linear TGD asserting that the semantics of become stronger as one adds to the set of positions :
[TABLE]
for .
We rewrite the query to the extended signature in the analogous way. The correctness of this transformation is given by an argument identical to that in the single-headed case in Amarilli and Benedikt [2018b].
Note that this transformation is in PTime when the arity is fixed. It reduces us to an entailment problem with s, still with bounded arity, but with multiple atoms in the head. Such an entailment problem can be shown to be in NP using a simple variation of the algorithm for s of Johnson and Klug [1984].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abiteboul and Duschka [1998] Serge Abiteboul and Olivier Duschka. Complexity of answering queries using materialized views. In PODS , pages 254–263, 1998.
- 2Ahmetaj et al. [2016] Shqiponja Ahmetaj, Magdalena Ortiz, and Mantas Šimkus. Polynomial datalog rewritings for expressive description logics with closed predicates. In IJCAI , pages 878–885, 2016.
- 3Amarilli and Benedikt [2018 a] Antoine Amarilli and Michael Benedikt. When Can We Answer Queries Using Result-Bounded Data Interfaces? In PODS , pages 281–293, 2018.
- 4Amarilli and Benedikt [2018 b] Antoine Amarilli and Michael Benedikt. When Can We Answer Queries Using Result-Bounded Data Interfaces? In arxiv , 2018. available at https://arxiv.org/pdf/1706.07936.pdf .
- 5Amendola et al. [2018] Giovanni Amendola, Nicola Leone, Marco Manna, and Pierfrancesco Veltri. Enhancing existential rules by closed-world variables. In IJCAI , pages 1676–1682, 2018.
- 6Baget et al. [2011] Jean-François Baget, Michel Leclère, Marie-Laure Mugnier, and Eric Salvat. On rules with existential variables: Walking the decidability line. Artif. Intell. , 175(9-10), 2011.
- 7Bárány et al. [2015] Vince Bárány, Balder Ten Cate, and Luc Segoufin. Guarded negation. J. ACM , 62(3):356–367, 2015.
- 8Benedikt et al. [2016] Michael Benedikt, Pierre Bourhis, Balder ten Cate, and Gabriele Puppis. Querying visible and invisible information. In LICS , pages 297–306, 2016.
