New Results for the Complexity of Resilience for Binary Conjunctive Queries with Self-Joins
Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou

TL;DR
This paper investigates the computational complexity of resilience in binary conjunctive queries with self-joins, providing new hardness results, structural characterizations, and a dichotomy for specific cases, advancing understanding of deletion problems in database queries.
Contribution
It introduces novel structural properties and complexity classifications for resilience in self-join queries, extending previous results and offering a dichotomy for certain restricted cases.
Findings
Identifies new structural properties affecting complexity.
Provides NP-hardness results for various query structures.
Establishes a dichotomy for queries with relations repeated up to twice.
Abstract
The resilience of a Boolean query is the minimum number of tuples that need to be deleted from the input tables in order to make the query false. A solution to this problem immediately translates into a solution for the more widely known problem of deletion propagation with source-side effects. In this paper, we give several novel results on the hardness of the resilience problem for (i.e. conjunctive queries with relations of maximal arity 2) with one repeated relation. Unlike in the self-join free case, the concept of triad is not enough to fully characterize the complexity of resilience. We identify new structural properties, namely chains, confluences and permutations, which lead to various -hardness results. We also give novel involved reductions to network flow to show certain cases are in . Overall, we give a dichotomy…
| Query class | |
| ● | all self-join conjunctive queries |
| ◐ | single-self-join (ssj) binary conjunctive queries |
| ssj binary conjunctive queries with exactly 2 -atoms | |
| ssj binary conjunctive queries with exactly 3 -atoms |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\usetkzobj
all
\newaliascntlemmatheorem \aliascntresetthelemma \newaliascntconjecturetheorem \aliascntresettheconjecture \newaliascntremarktheorem \aliascntresettheremark \newaliascntcorollarytheorem \aliascntresetthecorollary \newaliascntdefinitiontheorem \aliascntresetthedefinition \newaliascntpropositiontheorem \aliascntresettheproposition \newaliascntexampletheorem \aliascntresettheexample \newaliascntaxiomtheorem \aliascntresettheaxiom \newaliascntproblemtheorem \aliascntresettheproblem \newaliascntfacttheorem \aliascntresetthefact \newaliascntclaimtheorem \aliascntresettheclaim
New Results for the Complexity of Resilience for Binary Conjunctive Queries with Self-Joins
Cibele Freire
Wellesley College
,
Wolfgang Gatterbauer
Northeastern University, Boston
,
Neil Immerman
University of Massachusetts Amherst
and
Alexandra Meliou
University of Massachusetts Amherst
Abstract.
The resilience of a Boolean query on a database is the minimum number of tuples that need to be deleted from the input tables in order to make the query false. A solution to this problem immediately translates into a solution for the more widely known problem of deletion propagation with source-side effects. In this paper, we give several novel results on the hardness of the resilience problem for conjunctive queries with self-joins, and, more specifically, we present a dichotomy result for the class of single-self-join binary queries with exactly two repeated relations occurring in the query. Unlike in the self-join free case, the concept of triad is not enough to fully characterize the complexity of resilience. We identify new structural properties, namely chains, confluences and permutations, which lead to various NP-hardness results. We also give novel involved reductions to network flow to show certain cases are in P. Although restricted, our results provide important insights into the problem of self-joins that we hope can help solve the general case of all conjunctive queries with self-joins in the future.
††ccs: Theory of computation Database theory††ccs: Information systems Relational database model
1. Introduction
Various problems in database research, such as causality, explanations, and deletion propagation, examine how interventions in the input to a query impact the query’s output. An intervention constitutes a change (update, addition, or deletion) to the input tuples. In this paper, we study the resilience of a Boolean query with respect to tuple deletions. Resilience is a variant of deletion propagation that focuses on Boolean queries: it corresponds to the minimum number of tuples whose deletion causes the query to evaluate to false. In previous work (Freire et al., 2015), we provided a full characterization of the complexity of resilience for the family of self-join-free conjunctive queries (sj-free CQs) with functional dependencies. In this paper, we augment the previous results to account for a restricted class of self-joins.
Self-joins have long plagued the complexity study of many problems in database theory research: for example, on the topic of consistent query answering, Kolaitis and Pema (Kolaitis and Pema, 2012) proved a dichotomy into PTIME and coNP-complete cases for the family of queries with only two atoms and no self-joins. Koutris and Suciu (Koutris and Suciu, 2014) extended the dichotomy to the larger class of self-join-free conjunctive queries, where each atom has as primary key either a single attribute or all the attributes. Koutris and Wijsen (Koutris and Wijsen, 2017, 2018b) further extended the dichotomy to the full class of sj-free Boolean CQs, and queries with negated atoms (Koutris and Wijsen, 2018a). To the best of our knowledge, there is no known result on this problem for a query family that permits self-joins. As another example, complexity results on the problem of query-based pricing (Koutris et al., 2015) are also restricted to the class of sj-free CQs. On the closely related topic of deletion propagation with view side-effects, Kimelfeld et al. (Kimelfeld et al., 2012) used a characteristic of the query structure (head domination) to formalize a complexity dichotomy for the family of sj-free CQs, and indicated that self-joins can significantly harden approximation in the problem of deletion propagation. Extensions to the cases of functional dependencies (Kimelfeld, 2012) and multi-tuple deletions (Kimelfeld et al., 2013) also focused on the same query class. These examples offer strong indication that self-joins introduce significant hurdles in the study of a variety of problems, and progress in cases that account for self-joins is rare.111While some prior work on related problems does allow for self-joins (Buneman et al., 2002; Cong et al., 2012; Amarilli et al., 2017), the complexity characterizations in those results are not specific to the queries, but rather to high-level operators (e.g, join, projection, etc.). In contrast, our work provides results that are fine-grained and identify elements of the query structure that render the resilience problem NP-complete or PTIME-computable.
In this paper, we give several novel results on the hardness of the resilience problem for CQs with self-joins. We show some results that hold for any CQ with self-join but later we focus on the class of binary CQs (those where relations are either unary or binary). We provide various complexity results for binary CQs where only one relation name can be repeated, which we denote by single-self-join (ssj). We analyze the case of ssj binary queries in general but emphasize that for the case with at most 2 instances of the repeated relation, we prove that a P versus NP-complete dichotomy exists. We further provide a unifying criterion for hardness (a “proof template”), and we conjecture that it subsumes and generalizes the criterion of triads from Sj-free queries, and that it provides a sufficient criterion of hardness for any CQ.
Contributions and outline.
- •
Contrasting with current knowledge about the resilience of CQs without self-joins (summarized in Section 2), we demonstrate how self-joins complicate the problem and invalidate several aspects and intuitions from the self-join-free case (Section 3).
- •
We establish foundations for tackling the resilience problem for conjunctive queries with self-joins by identifying important conditions on the minimality and connectedness of queries and by revising the fundamental notion of query domination (Section 4).
- •
We prove that resilience for queries that contain a triad (a structure that characterizes hardness in the sj-free case (Freire et al., 2015)) remains NP-complete in the presence of self-joins (Section 5.2).
- •
By narrowing our target class to the class of binary conjunctive queries (those where relations are either unary or binary) and single-self-join queries (i.e., only one relation can appear in multiple atoms of the query), we identify a new structure that implies hardness, thus expanding the NP-complete class compared to the sj-free case (Section 6).
- •
We identify and define the fundamental structures of chains, confluences, and permutations, and use them to prove a complete dichotomy between NP-complete and PTIME cases for the class of single-self-join binary conjunctive queries where exactly two atoms in a query correspond to the same relation (Section 7).
- •
We prove several involved results using the chains, confluences, and permutations structures in the case of single-self-join binary conjunctive queries where exactly 3 atoms correspond to the same relation. While a complete dichotomy for this class remains elusive, our work creates a roadmap and identifies remaining open problems (Section 8).
- •
We provide the novel concept of Independent Join Paths. This general “proof template” aims to () provide a sufficient criterion of hardness for any CQs, () subsume the prior hardness criterion of triads for SJ-free CQs, and () provide a hint for an approach that could possibly automate the search for hardness reductions. (Section 9).
Some of our results apply to the general class of self-join CQs, while others apply to more restricted query families. We annotate our theoretical results with the symbols detailed in Table 1 to indicate the relevant assumptions.
2. Background and Prior Results
This section introduces our notation, defines the resilience of a query, and summarizes prior complexity results for sj-free queries.
Standard database notations. We use boldface to denote tuples or ordered sets, (e.g., ) and use both subscripts and superscripts as indices (e.g., and ). We fix a relational vocabulary , and denote the arity of a relation . We call unary and binary those relations with arity 1 or 2, respectively. We call “binary queries” those queries that contain only unary or binary relations. A database instance over is , where each is a finite relation. We call the elements of tuples and write instead of when is clear from the context. With some abuse of notation we also denote as the set of all tuples, i.e. , where the union is understood to be a disjoint union (thus each tuple belongs to only one relation). The active domain is the set of all constants occurring in . The size of the database instance is , i.e. the number of tuples in the database.222Notice that other work sometimes uses as the size of the database. Our different definition has no implication on our complexity results but simplifies the discussions of our reductions.
A conjunctive query (CQ) is a first-order formula where the variables are called existential variables, are called the head variables (or free variables), and each atom (also called subgoal) represents a relation where .333WLOG, we assume that is a tuple of only variables and don’t write the constants. Selections can always be directly pushed into the database before executing the query. In other words, for any constant in the query, we can first apply a selection on each relation and then consider the modified query with a column removed. A self-join-free CQ (sj-free CQ) is one where no relation symbol occurs more than once and thus every atom represents a different relation. In turn, a self-join CQ is one where at least one relation symbol is repeated, and a single-self-join (ssj) CQ is one where only one relation symbol can be repeated in a query. We write for the set of variables occurring in atom . As usual, we abbreviate a non-Boolean query in Datalog notation by where has head variables and represents the body of the query.
Unless otherwise stated, a query in this paper denotes a Boolean CQ (i.e., ). We write to denote that the query evaluates to true over the database instance , and to denote that evaluates to false. For a Boolean query , we write to indicate that represents the set of all existentially quantified variables. We write as short notation for the set .
Additional notations.
We call a valuation of all existential variables that is permitted by and that makes true (i.e. ) a witness .444Note that our notion of witness slightly differs from the one commonly seen in provenance literature where a “witness” refers to a subset of the input database records that is sufficient to ensure that a given output tuple appears in the result of a query (Cheney et al., 2009). The set of witnesses is then
[TABLE]
Since every witness implies exactly one set of at most tuples from that make the query true, we will slightly abuse the notation and also refer to this set of tuples as “witnesses.” For example, consider the query with over the database . Then one can easily see that
[TABLE]
and their respective tuples are , , and .
In line with prior work (Freire et al., 2015; Meliou et al., 2010), relations may be specified as exogenous, meaning that tuples from these relations cannot be deleted.555In other words, tuples in these atoms provide context and are outside the scope of possible “interventions” in the spirit of causality (Halpern and Pearl, 2005). We specify the atoms corresponding to exogenous relations with a superscript “x”. The remaining atoms are endogenous.
Complexity theory. We write to mean .666First-order reductions are not required, but it is the case that all reductions defined in the paper are expressible in first-order. We say that two problems have equivalent complexity () iff they are inter-reducible, i.e., and .
2.1. Query resilience
In this paper, we focus on the problem of resilience, a variant of the problem of deletion propagation focusing on Boolean queries: Given , what is the minimum number of endogenous tuples that have to be removed from to make the query false? A large minimum set implies that the query is more “resilient” and requires the deletion of more tuples to change the query output. In order to study the complexity of resilience, we focus on the decision problem:
Definition 1 (Resilience Decision).
Given a query , database , and an integer . We say that if and only if and there exists a set with at most endogenous tuples s.t. . We define as the size of a minimum contingency set for input and .
In other words, means that there is a set of or fewer endogenous tuples whose removal makes the query false. We refer to such a set of tuples as a “contingency set.” Observe that, for a fixed , we can talk about data complexity and when is computable in PTIME.
A central result of the prior work on resilience (Freire et al., 2015) is that the complexity of resilience of an sj-free CQ can be exactly characterized via a natural property of its dual hypergraph . The hypergraph of an sj-free query is usually defined with its vertices being the variables of and the hyperedges being the atoms (Abiteboul et al., 1995). The dual hypergraph, , has vertex set , and each variable determines the hyperedge consisting of all those atoms in which occurs: . A path in the graph is an alternating sequence of vertices and edges, , such that for all , , i.e., the hyperedge joins vertices and . We explicitly list the hyperedges in the path, because more than one hyperedge may join the same pair of vertices. Since we only consider dual hypergraphs, we use the shorter term “hypergraph” from now on.
Example 2 (Hypergraphs).
We illustrate the prior results with the following 4 queries and their hypergraphs shown in Fig. 1:
[TABLE]
In the remainder of this section, we summarize the intuition behind three main constructs—triads, domination, and linear queries—that lead to the result presented in Theorem 7. Then, in Section 3 we provide an exposition of how self-joins alter or completely invalidate these prior constructs.
2.2. Domination
We may mark some relations in an input database as exogenous and, the remaining relations are endogenous. However, some relations are “implicitly” exogenous. For example, the relation in is given as endogenous, but is never needed in minimum contingency sets. We next define a syntactic property, called domination, that captures when endogenous relations are implicitly exogenous.
Definition 3 (Domination).
If a query has endogenous atoms such that , we say that dominates .
For example, dominates in . Whenever a contingency set contains tuples from , they can always be replaced with a smaller than, or equal, number of tuples from .
Proposition 4 ( Domination for resilience (Meliou
et al., 2010)).
Let be an sj-free CQ and the query resulting from labeling some dominated atoms as exogenous. Then .
When studying resilience, we follow the convention that all dominated atoms are made exogenous, and we consider that the normal form of a query. As we have seen, dominates in . Similarly, the atom dominates both and in . We thus transform the queries so that the dominated atoms are exogenous. Exogenous atoms have the superscript “x”.
[TABLE]
Proposition 4 implies that .
2.3. Triads and hardness
We showed in (Freire et al., 2015) that and from Example 2 are NP-complete. While and appear to be quite different, they share a key common structural property which alone is responsible for hardness for sj-free CQs.
Definition 5 (Triad).
A triad is a set of three endogenous atoms, such that for every pair , there is a path from to in that uses no variable occurring in the other atom of .
Intuitively, a triad is a triple of points with “robust connectivity.” Observe that atoms form a triad in and atoms form a triad in (see Fig. 1). For example, there is a path from to in (across hyperedge ) that uses only variables (here ) that are not contained in the other atom (). We showed that triads are responsible for hardness (see Appendix B for proof):
Lemma 6 ( Triads make hard (Freire et al., 2015)).
Let be an sj-free CQ where all “dominated” atoms are exogenous. If has a triad, then is NP-complete.
2.4. Linear queries
A query is linear if its atoms can be arranged in a linear order s.t. each variable occurs in a contiguous sequence of atoms. Geometrically, a query is linear if all of the vertices of its hypergraph can be drawn along a straight line and all of its hyperedges can be drawn as convex regions (thus the variables form intervals on a line of relations). For example is linear (see Fig. 1(d)).
It was shown in (Meliou et al., 2010) that for any sj-free CQ that is linear, may be computed in a natural way using network flow. Thus all such queries are easy.
If all sj-free CQs without a triad were linear, then this would complete the dichotomy theorem for resilience. While this is not the case, we completed the proof of Theorem 7, by showing that every triad-free sj-free CQ may be transformed to a linear query of equivalent resilience.
2.5. Dichotomy Theorem
Now we can present the full characterization of the complexity of sj-free CQs proved in (Freire et al., 2015).
Theorem 7 ( Dichotomy of resilience for sj-free CQs (Freire et al., 2015)).
Let be an sj-free CQ and let be the result of making all “dominated” atoms exogenous. If has a triad, then is NP-complete, otherwise it is in PTIME.
3. Self-joins change everything
Queries with self-joins are far more complicated than sj-free queries for at least 4 reasons: (1) For the sj-free case, triads alone were shown to determine hardness. Triads need at least 3 existential variables and at least 3 subgoals. Section 3.1 shows that already 2 atoms or 2 variables can be enough for hardness; (2) Linear sj-free queries can be solved using a natural reduction to network flow. For self-join queries, linear queries can be hard. Furthermore, Section 3.3 shows that we may need more elaborate reductions to network flow, even when they are easy. (3) The previous definition of domination does not lead to the desired properties in the presence of self-joins. Section 3.2 explains why dominated atoms may still be relevant when computing the minimum contingency set. (4) Our previous crucial concept of the dual hypergraph is no longer sufficient to characterize queries when relations appear multiple times. The position at which a variable appears in a subgoal may influence the complexity of resilience, including whether an atom has repeated variables, e.g., “.”
In the cases where the variable position is relevant and we are restricted to binary queries, we naturally represent queries as labeled direct graphs. This representation captures all relevant structural information of the binary queries, especially the relative position of variables, which the hypergraph representation does not reflect.
Definition 1 (Binary graph).
Let be a binary CQ. Its binary graph has vertex set and labeled edge sets defined by atoms , i.e. atom translates into labeled edge . For unary atoms, the edge will be a loop.
3.1. Basic hard queries: and
We start by proving hardness for two queries that will play an important role in our later results. The first (for “vertex cover”) has only 2 variables and 3 atoms. The second (since it “chains” two binary relations together) has only 2 atoms and 3 variables:
[TABLE]
Figure 2 shows graphical representations of both queries while illustrating the differences between the dual hypergraph and the binary graph of a binary CQ.
Recall that in the sj-free case, a query needs a triad to be hard and all linear queries are easy. In particular, an sj-free query must have at least 3 variables and 3 atoms to be hard.
Proposition 2 ().
* is NP-complete.*
Proposition 3 ().
* is NP-complete.*
3.2. SJ-Free domination no longer works
We saw from Proposition 4 that in sj-free CQs, making all dominated atoms exogenous leaves the query resilience unchanged. In the presence of self-joins, this is no longer true.
Example 4.
Query is a self-join variation of with replaced by ’s. Similar to , we have , so dominates by Definition 3. Thus should become exogenous when searching for the minimal contingency set. But this is not the case. Consider the database instance
[TABLE]
Our query has 3 witnesses over this database: , , and . If was made exogenous, the only possible minimum contingency set would be . However, if is considered as endogenous, there is a smaller contingency set, with only .
This example shows that domination as defined in Definition 3 no longer implies that a relation can be made exogenous in the self-join case. This immediately raises the question of whether there is a set of conditions which implies that a relation can be made exogenous in the self-join setting, i.e. if there is a self-join version of domination. Additionally, does have a triad? The answer to both is yes, as we will see in LABEL:{sec:domination} and Section 5.1, respectively.
3.3. Easy queries that use flow in a trickier way
As mentioned in the discussion of Theorem 7, resilience for linear sj-free CQ can be computed directly from network flow. As we have just seen, in the presence of self-joins, some linear queries are hard. For those that are easy, network flow can still help us compute resilience, but the arguments become trickier.
The following queries are two such examples, where modified versions of network flow are used to show resilience is easy in these cases.
[TABLE]
Proposition 5 ().
* is in P.*
Proposition 6 ().
* is in P.*
4. New general observations and plan of attack
We next give 3 new general observations before we describe our plan of attack in the remainder of the paper.
4.1. Minimal queries
Given queries and , we say that is contained in () if answers to over any database instance are always a subset of the answers to over . We say is equivalent to () if and (Abiteboul et al., 1995). We say a conjunctive query is minimal if for every other conjunctive query such that , has at least as many atoms as . For every query , there exists a minimal equivalent CQ that can be obtained from by removing zero or more atoms (Chandra and Merlin, 1977).
From now on, we focus only on minimal queries. This is WLOG, since any non-minimal query can always be minimized as a pre-processing step. The reason is that our hardness evaluation relies on identifying certain subqueries (or patterns) in a query that make this query hard. However, if a pattern is in a subquery that is removed during minimization, then, this pattern has no effect on the resilience of the query.
4.2. Query components
A connected component of (or “component” in short) is a non-empty subset of atoms that are connected via existential variables. A query is disconnected if its atoms can be partitioned into two or more components that do not share any existential variables. For example,
[TABLE]
The resilience of a query is determined by taking the minimum of the resiliences of each of its components. In the following, let stand for the resilience of query over database , which is the size of the minimum contingency set for .
Lemma 1 ( Query components).
Let be a query that consists of components , . Then .
We can now show that the complexity of a query is determined by the hardest of its components if the query is minimal:
Lemma 2 ( Query components complexity).
Let be a minimal query that consists of query components. is NP-complete if there is at least one component for which is NP-complete. Conversely, if is in P for all , then is in P.
In the remainder of the paper we assume queries to be connected.
4.3. SJ-domination
As discussed in Section 3.2, we need to consider the position of the variables in the attribute list of each atom in a sj-query. We write to express that the -th attribute of atom is variable for a query and omit when is clear from the context.
Definition 3 ( Domination with Self-Joins).
Let relations and be endogenous relations in query . We say that dominates if there exists a function
[TABLE]
such that for each atom , there exists an atom satisfying , . In other words, each atom that occurs in has a corresponding atom, and each of these pairs will have matching variables accordingly to function .
Notice that when appears only once, the definition of domination is equivalent to the sj-free definition: .
Example 4.
To illustrate the new self-join domination, consider the following queries:
[TABLE]
By following the definition above, doesn’t dominate in but it does in , whereas is dominated in both queries. Notice that in , a tuple will always join with tuple so we can always choose instead to be in the contingency set. The same is not true for , where a tuple could join with or .
Proposition 5 ( Domination for resilience with Self-Join).
Let be a CQ and the result of labeling some dominated relations exogenous. Then .
4.4. Outline of our plan of attack
To obtain a dichotomy result for the resilience of binary queries in the presence of a single-self-join, we proceed as follows. (1) Section 5 shows that triads in any conjunctive queries with self-joins still imply hardness (Theorem 6) and furthermore, when triads are absent, the endogenous atoms are linearly connected. We call such queries pseudo-linear (Theorem 7). We conjecture that pseudo-linear queries may be transformed to linear queries of equivalent resilience (Conjecture 8). In any case, it suffices to study the criteria for hardness of pseudo-linear queries. (2) Section 6 generalizes the hardness pattern behind to a more general class of hard ssj binary queries that contain “paths” between repeated atoms. (3) We then focus on the complexity of the resilience of ssj binary CQs with at most a single repetition of a single relation. Section 7 gives a complete characterization of the complexity for the cases of 2 occurrences of a repeated relation. This is a dichotomy theorem: we show that for all such queries, , is either NP-complete or is reducible to network flow and is thus in P. Section 8 presents the remaining challenges that must be overcome in order to characterize all queries with 3 occurrences of a repeated relation. In Section 9, we present a “template” for hardness proofs that we believe will help us make progress in the general self-join case.
5. Non-linear Queries: NP-Complete
In this section we prove that queries containing triads remain hard in the presence of self-joins (Theorem 6). We then show that for any query that does not contain a triad, its endogenous atoms are arranged linearly. We call such a query pseudo-linear. Thus, we conclude that either a query contains a triad in which case its resilience problem is NP-complete, or it is pseudo-linear. In the following sections, we can thus safely restrict our attention to pseudo-linear queries.
Definition 1 (Self-join variation of a CQ).
Let be a sj-free CQ and let result from by replacing some atoms from with the atom , where the relation occurs elsewhere in . We say that is a self-join variation of .
Example 2 (Self-join variations).
Consider sj-free query . The following are all its possible self-join variations:
[TABLE]
We first observe that the resilience of self-join variations of a query can only be harder than their sj-free counterpart:
Lemma 3 ( SJ Can Only Make Resilience Harder).
Let be an sj-free CQ and let be a self-join variation of . If is minimal, then .
We need to rely on the fact that is minimal for the result to hold, as we see in the example below:
Example 4.
Consider query , and observe that is NP-complete because contains a triad. A possible self-join variation is . Note that is not minimal, and is equivalent to . So is trivially in P.
By Lemma 3, the resilience of the self-join variations of are NP-complete. Recall from Definition 5 that a triad is set of three endogenous atoms, so we can say that the self-join variations of all have triads. However, it does not immediately follow from Lemma 3 that every sj-query with a triad is hard. The missing cases are when an sj-query includes a triad, but it is not a self-join variation of an sj-free query with a triad. We next explore this situation.
5.1. Self-join variations of and
Recall two important sj-free queries:
[TABLE]
is NP-complete because it contains the triad, . However, and are easy because dominates and dominates so they only have two endogenous atoms each and thus no triad.
The same doesn’t occur with some self-join variations of and . Below we list two example of variations which contain triads.
[TABLE]
In these examples, relation is now more robust and not dominated by or . Therefore, they still contain triads consisting of their three -atoms. The presence of a triad is a strong indication that these queries are hard but we cannot use Lemma 3 to show this because their sj-free counterparts, and , are easy. We now proceed to show their complexity is hard.
Proposition 5.
Let be a self-join variation of or . If has a triad, then is NP-complete.
Using Proposition 5, we now generalize the fact that triads make sj-free queries hard (Lemma 6) to the same result for general CQs.
5.2. Triads Make Queries Hard
Theorem 6 ( SJ-queries with triads).
If has a triad, then is NP-complete.
Proof Sketch.
We argue that there are only two cases to consider when a sj-query has a triad. Consider that is a self-join variations of a sj-free query . The first case is when the resilience of is hard, i.e., is NP-complete. Then, we can use Lemma 3 to show is NP-complete. The second case is when is in P. Then, we argue that we can show a reduction from , where is a self-join variation of either or that contains a triad. From Proposition 5, is NP-complete, which thus implies that is also NP-complete. ∎
Thus, if a query contains a triad, it is hard. In the next section, we discuss queries that do not contain triads and how they are similar to linear queries, since their endogenous atoms are linearly connected.
5.3. No Triad Means Pseudo-Linear
In (Freire et al., 2015), we proved that if a sj-free CQ has no triad, then may be transformed to a sj-free CQ query which is linear and such that . Since linear sj-free CQ’s are easy, it follows that is easy.
This argument no longer works in the presence of self-joins because linear queries can be easy or hard. However, we can extend the theorem from (Freire et al., 2015) to show the following,
Theorem 7 ( No Triad Means Pseudo-Linear).
Let be a CQ with no triad. Then all endogenous atoms in are connected linearly.
We conjecture that pseudo-linearity is equivalent to linearity when considering resilience. What makes a query pseudo-linear, instead of linear or containing a triad, is the presence of some exogenous atoms. However, the exogenous atoms of a query are mostly only connecting the endogenous atoms, and also, if necessary, ensuring that is a minimal query, so we believe they can be modified to obtain a linear query without altering the complexity of a query.
Conjecture 8 ( No Triad Means Linear).
Let be a CQ with no triad. Then we can transform to a linear CQ with .
6. Paths are hard
Section 3.1 presented two linear queries that are hard, unlike in the sj-free case where all linear queries are easy. Note that these queries are binary and, in both, only one relation is part of a self-join. In other words, these are single-self-join binary queries.
We now identify a pattern characteristic of that we call a path. The main result of this section is that every ssj binary query containing a path is hard. We start by showing the case where the self-join relation is unary.
Theorem 1 ( Unary path).
Let be a minimal ssj-CQ. If contains distinct atoms and , then is NP-complete.
Proof sketch.
Let and be the first two occurrences of the relation in . Since is connected, and are connected by at least one non-self-join relation, (see Fig. 4(a)). We prove that . Details are in Appendix A, but it is not hard to see that any database can be transformed to a database that exactly preserves resilience. Here in come from and in , and all the other atoms of (including any additional occurrences of the self-join relation, , to the right of ) are covered by multiple, extra values which complete the joins but are never chosen in minimum contingency sets. Note that this proof doesn’t make any assumption about the existence or not of triads. ∎
When the self-join relation is binary, if two consecutive atoms, , , are disjoint, then we call this a binary path. “Overlapping” consecutive atoms with shared variables, such as , in , can also cause hardness and are studied in later sections.
Theorem 2 ( Binary path).
Let be a minimal ssj-CQ. If has distinct consecutive sj atoms with , then is NP complete.
Proof Sketch.
Given as in the statement of the theorem, there must be an atom , with on the path between them, and and . Now, as in the proof of Theorem 1, we reduce to . We map any database to a database , where contains \bigl{\{}(a,a)\,\bigm{|}\,A(a)\in~{}D\bigr{\}} plus other multiple, extra values for any other atoms of the relation in to the left of or the right of and S^{\prime}=\bigl{\{}(a,b)\,\bigm{|}\,R(a,b)\in D\bigr{\}}. Same as in the unary case, there is no assumption about the linearity of the query. Details are in Appendix A. ∎
Unary and Binary Paths are the simplest of the hard patterns. By Theorem 1 and LABEL:binarypath, they always force their queries to be hard.
Since we have established that an sj- query either has a triad or is pseudo-linear (Theorem 7) and because we have proved that triads imply hardness (Theorem 6), we can now focus on the pseudo-linear queries.
In the next sections we study the more subtle pseudo-linear ssj binary queries, which do not contain paths.
7. Queries with exactly two -atoms
In this section we cover the complexity of pseudo-linear ssj binary queries with exactly two atoms referring to the same relation. We will refer to this relation as . As always, we assume that our query is minimal and connected, and from now on also assume that does not contain a triad or a path as described in Theorem 1 and Theorem 2; otherwise we would already know that is NP-complete. Even in this restricted setting, we will see that there is a surprisingly rich variety of structures, requiring different strategies to determine their complexity.
Because there are no paths, must be a binary relation and the two -atoms must have at least one variable in common.
- •
Chains have one common variable and join in different attributes, e.g., ;
- •
Confluences have one common variable and join in the same attribute, e.g., ;
- •
Permutations share both variables but join in different attributes, e.g., .
- •
Queries with repeated variables (REP) have repeated variables in at least one -atom e.g.,
Figure 5 shows the binary graphs for each these patterns, which helps visualize the subtle variations in how the -atoms can join. We consider each of these possibilities in turn and characterize their complexity.
7.1. 2-Chains
The chain query is the simplest possible minimal sj-query with two atoms and we proved earlier that its resilience is NP-complete (Proposition 3). In this section we prove that the chain structure is quite robust and that any of its expansions remains NP-complete.
We call “expansions” of any query obtained by adding new relations to it, i.e. relations that do not self-join. We start by presenting the expansions obtained by adding unary relations and then generalize that to any expansion.
Figure 6(a) shows how unary relations can be added to . Each one can appear by itself or combined with others. While the proof involves several subcases, the important take-away is that all 8 of these expansions are hard.
Proposition 1 (Chains with unary relations).
Any expansion of with unary relations is NP-complete.
Proof Sketch.
We prove these expansions are hard by a reduction from 3SAT. The same idea used to prove that is hard will work here as long as we adapt the variable and clause gadgets to deal with the existence of the unary relations. Lemmas 3, 4 and 5 in Appendix A contain the details. ∎
Now we can generalize this hardness result to any chain expansion using a reduction idea similar to the ones used for the proofs of Theorems 1 and 2 for paths.
Proposition 2 ( Chains).
If a query contains a 2-chain as its only self-join, then is NP-complete.
7.2. 2-Confluences
Confluences are defined by relation joining only in the same attribute. We refer to this pattern as (Fig. 6(b)).
Note that as a stand-alone query is not minimal, so we need other atoms connected to both and . An example of a minimal query containing a confluence is .
We next show that the standard flow algorithm without any modifications works correctly for linear queries with no self-join other than one -confluence, thus generalizing the idea of Proposition 5.
Proposition 3 ( ).
* for any linear query with as its only self-join pattern can be solved in P by standard network flow.*
In Proposition 3 we assume that is linear, thus guaranteeing that every path in from to involves the variable , and therefore we are able to create a network flow to solve the problem. Note that this is not true in general for pseudo-linear queries containing . For example, consider . It is easy to see that is pseudo-linear but we have . Thus, we cover all possible cases for by observing,
Proposition 4 ().
Let be a pseudo-linear query with as its only self-join pattern. If contains an exogenous path from to not involving the variable , then is NP-complete; otherwise it is in P.
7.3. 2-Permutations
We call two -atoms sharing both variables a permutation. The smallest pattern that has this property is (Fig. 5). We show that permutations have both NP-complete and PTIME instances.
Easy permutations. We start with two easy permutations.
[TABLE]
Proposition 5.
* and are in P.*
Proof.
Given a database satisfying , each tuple that is part of a witness for is part of exactly one witness. Therefore the size of a minimum contingency set for is exactly the number of witnesses.
Given a database satisfying , for each join , we have 2 possible choices. Either will be in the min or either one of and but never both. Therefore we can reduce to vertex cover in a bipartite graph, which is in P. ∎
Hard permutations. Surprisingly, adding another unary atom to , thus bounding it on both ends. leads to a hard query.
[TABLE]
It is still true that for any pair participating in a join, a minimum contingency set will only contain one tuple from the pair. This might lead to the wrong conclusion that network flow could solve this problem. We will next show that this is incorrect.
Proposition 6.
* is NP-complete.*
The criterion. The main structural difference between the hard and easy permutations defined above is whether or not there are relations that “bound” the permutation on both ends, i.e. whether there are endogenous relations , such that contains variable but not , and contains variable but not . Thus, the hard permutation, , is bound, but the easy ones, , are not bound. Using this characterization, we identify when 2-permutations are hard.
Proposition 7 ().
Let be a pseudo-linear query with as its only self-join. If is bound, then is NP-complete; otherwise, is in P.
7.4. Queries with REP
We call queries with repeated variables (or REP in short) those where atoms contain the same variable twice, e.g. occurrences of . Note that this is only relevant for the case where is part of a self-join, otherwise it could be considered as .
There are only three patterns to consider when we are restricted to two -atoms, either one or both atoms have repeated variables. The following queries are the smallest examples of this class of queries:
[TABLE]
Notice that queries and satisfy the condition for hardness of binary paths (Theorem 2), since their set of variables is disjoint. Therefore, we can conclude that and are NP-complete, as well as any expansion of those queries. We show that any REP queries that contain are in P.
Proposition 8 ().
Any pseudo-linear query with exactly two -atoms that contains is in P.
7.5. The dichotomy
Combining our results so far, with at most two occurrences of the self-join relation, we have proved a complete characterization of the complexity of resilience:
Theorem 9 ( Two-Atom Dichotomy).
Consider an ssj-CQ, with at most two occurrences of the self-join relation. If has any of the following
- (1)
triad 2. (2)
path 3. (3)
chain 4. (4)
bounded permutation 5. (5)
confluence with exogenous path
then is NP-complete. Otherwise, is PTIME via a reduction to network flow. In addition there is a PTIME algorithm that on input determines which case occurs.
8. Queries with exactly three -atoms
In Theorem 9 we completely characterized the complexity of resilience of all CQs with at most one repetition of a single relation, thus extending the dichotomy for sj-free CQs into the land of self-joins.
In this section, we present an overview of what can happen when we allow a third -atom to self-join. Since we only have to consider pseudo-linear queries that do not have a path, all three -atoms must connect to each other directly or through the third -atom. Even though this is still a restrictive setting, we will see that it brings non-trivial complications to the characterization. We will present some complexity results; but also some remaining open problems.
8.1. 3-Chains
We obtain a 3-chain by adding an extra -atom to a 2-chain in a way such that the new atom joins in a different attribute from the other two.
[TABLE]
Analogous to the 2-chain case, 3-chains are always hard. In fact this holds for 4-chains, 5-chains, etc.
Proposition 1 ().
For all , if contains a -chain as its only self-join, then is NP-complete.
8.2. 3-Confluences
Adding a third -atom to a 2-confluence and making sure that it joins in the same attribute with one of the two existing -atoms produces a 3-confluence.
[TABLE]
As in the 2-confluence case, is not minimal, so other atoms are required to make it minimal. Here are a few examples of minimal queries containing .
[TABLE]
These queries are very similar but one of them is hard, while the other one is easy.
Proposition 2.
* is NP-complete.*
Proposition 3 ().
Any variation of obtained by including unary relations is NP-complete.
Proof.
We define a reduction from Max 2SAT similar to the one used for by adding the appropriate tuples to obtain the same set of joins. The contingency set doesn’t change with the new tuples and therefore the properties of the reduction hold. ∎
Proposition 4.
* is in P.*
Open problem. There is a third variant of 3-confluences which somewhat mix queries and (Fig. 7).
[TABLE]
The complexity of remains unknown.
8.3. 3-Chain-Confluence
With 3 -atoms, it is possible that different patterns will occur at the same time. This feature of this case makes it harder to analyze the queries, since the result of these interactions might diverge from what we expect when we see each pattern in isolation.
In this section we present some queries where a 2-chain and a 2-confluence occur at the same time.
[TABLE]
The resilience of these queries is hard but they require different reductions. If is bound, then we can use a reduction from . Otherwise we need a reduction from Max 2SAT.
Proposition 5.
* and are NP-complete.*
Proposition 6.
* is NP-complete.*
Open Problem. In this category of queries with chain and confluence, we don’t know the complexity of .
8.4. 3-Permutation plus R
It is not possible to obtain two permutations in a query with only 3 -atoms. In fact, there are only two ways that a new -atom can be connected to a permutation: either by joining with or , and those are equivalent.
[TABLE]
Similar to the case, is not a minimal query, so additional atoms are necessary. We list the main examples of how this query can be made minimal and discuss the complexity of their resilience.
First we start with a query we have already seen and another one that is a slight variation on the first (Fig. 3(b)).
[TABLE]
We proved in Proposition 6 that is in P by using network flow. A similar argument proves that is also in P.
Proposition 7.
* is in P.*
The next query we will see is . Although very similar to and , is hard. It is surprising that such a small difference can already change the complexity of the resilience problem. Moreover, the proof requires a new reduction instead of a reduction similar to the one used in Proposition 6.
[TABLE]
Proposition 8.
* is NP-complete.*
Some other examples of queries that are hard but these are somewhat related to .
[TABLE]
Proposition 9.
, and are NP-complete.
Open Problems. Despite the similarities with the queries presented in this section, we were not able to determine the complexity of the following queries:
[TABLE]
8.5. Queries with REP
If all three occurrences of have repeated variables, then we are in the path case.
[TABLE]
Proposition 10.
* and are NP-complete.*
Open problems. We don’t know the complexity of other queries that fall in this category of having three -atoms with REP but the following open ones are intriguing.
[TABLE]
Query has a similar structure to but a similar reduction doesn’t seem to work. Similarly, a reduction from doesn’t work for .
9. Independent Join Paths: a unifying hardness criterion
Motivation. We now define a particular “template” for hardness reductions which we call Independent Join Paths or IJPs. The idea is that if we can construct a particular database that fulfills 5 criteria for a query , then we can conclude safely that is NP-complete.
This recent development is exciting for several reasons: 1) In our earlier attempts to prove hardness for queries, we amassed a plethora of different hardness proofs, with little immediate intuition of how one hardness proof immediately helps facilitate the hardness proof of another query. Now we expect that the task can be simplified to the task of searching for any particular database that serves as “proof” of hardness based on a generalized reduction from Vertex Cover. 2) We were able to look at our existing hardness proofs and post-hoc identify some part in some gadget that formed an IJP. In other words, IJPs were already present in our hardness proofs (we give examples in Appendix C). Thus IJPs are really a unifying common denominator for all hard queries known so far. 3) The search for hardness proofs could now, in theory, be automated. While we have not yet explored this idea, we give the intuition in Appendix C. 4) The hardness based on IJPs is not restricted to the particular fragment of CQs that we have analyzed in this paper; rather they are a universal criterion. Even the original criterion of triads for sj-free CQs can be subsumed under IJPs. 5) We conjecture that the inability to form IJPs for those queries that are in PTIME can be deduced from the structure of a query, and future work will discover the reason.
The intuition of IJPs. We have already seen that paths between two subgoals and that refer to the same relation are a sufficient condition for hardness under “certain circumstances”. Recall our simplest example for a path implying hardness: . The intuition of our construction is now as follows: Take any minimal VC problem for a graph (see Fig. 8(a)). Replace any existing arc with 3 arcs instead to create (see Fig. 8(b)). Then has a VC of size iff has a VC of size . Similarly, replace each arc with 5 instead of 3 arcs, then the condition for is . The key property we needed for this to work is the fact that 3 arcs form a particular path with the following “OR-property” (see Fig. 8(c)): As long as at least one end point of the path is removed, then the minimal VC is exactly one additional node per path.
Formalization of IJPs. We next use this idea to define a particular canonical database instance which we call “Independent Join Path.” We conjecture that whenever a query has such a canonical database, then resilience is hard by a proof that generalizes the idea from above. We give the formal definition here and provide intuition for each of the conditions in Appendix C. In the following, we write to denote the subvector of that retains only the entries indexed by . For example if and then
Definition 1 (Independent Join Path).
A database forms an Independent Join Path for query if the following conditions hold:
- (1)
There is a relation containing at least two tuples and with and . 2. (2)
In , and each participate in exactly one witness of . Both and have exactly tuples, where is the number of atoms in . 3. (3)
There is no endogenous relation containing a tuple with or . 4. (4)
If there is an exogenous relation containing a tuple with for some , then also contains with . 5. (5)
Let be the resilience of on : . Then the resilience is in all 3 cases of removing either , or , or both.
Conjecture 2 (IJPs imply hardness).
If there is a database that forms an IJP for a query , then is NP-complete.
The conjecture. For the fragment of CQs we are considering in this paper, we have been able to simplify some hardness proofs, which at times use very different constructions (reductions from VC, 3-SAT, Max 2-SAT), by looking at our existing hardness proofs and identifying IJPs in our existing gadgets.
We conjecture that the existence of IJPs for a query is also a necessary condition for hardness, that there is an algorithm to verify whether a query can form IJPs or not, and that the fact that a query cannot form IJPs (such as linear SJ-free CQs) translates immediately into a PTIME algorithm for solving .
10. Related work
In prior work (Freire et al., 2015), we identified the concept of a triad, a novel structure that allowed us to fully characterize the complexity of resilience (and consequentially for deletion propagation) for the class of self-join-free conjunctive queries with potential functional dependencies. Our work in this paper considers self-joins, which have long-plagued the study of many problems in database theory; results for such queries have been few and far between.
Deletion propagation and view updates. The problem of resilience is a special case of deletion propagation, focusing on Boolean queries. Deletion propagation generally refers to non-Boolean queries. Given a non-Boolean query and database , the typical goal is to determine the minimum number of tuples that must be removed from , so that a tuple is no longer in the query result (Buneman et al., 2002; Dayal and Bernstein, 1982) (source side-effects). Variants of deletion propagation consider side-effects in the query result rather than the source (Kimelfeld et al., 2012; Kimelfeld, 2012), and multi-tuple deletions (Cong et al., 2012; Kimelfeld et al., 2013). Resilience and deletion propagation are special cases of the view update problem (Bancilhon and Spyratos, 1981; Cong et al., 2012; Cosmadakis and Papadimitriou, 1984; Dayal and Bernstein, 1982; Fagin et al., 1983; Gottlob et al., 1988; Keller, 1985), which consists of finding the set of operations that should be applied to the database in order to obtain a certain modification in the view.
Causality and explanations. Database causality is geared towards providing explanations for query results, but typically relies on the concept of responsibility (Meliou et al., 2010, 2011), which is harder than resilience. The idea of interventions appears in other explanation settings, but often apply to queries instead of the data (Roy and Suciu, 2014; Wu and Madden, 2013; Roy et al., 2015). Finally, the problem of explaining missing query results (Chapman and Jagadish, 2009; Herschel and Hernández, 2010; Huang et al., 2008; Herschel et al., 2009; Tran and Chan, 2010) is a problem analogous to deletion propagation, but in this case, we want to add, rather than remove tuples from the view.
Provenance and view updates. Data provenance studies formalisms that can characterize the relation between the input and the output of a given query (Buneman et al., 2001; Cheney et al., 2009; Cui et al., 2000; Green et al., 2007). “Why-provenance” is the provenance type most closely related to resilience. The motivation behind Why-provenance is to find the “witnesses” for the query answer, i.e., the tuples or group of tuples in the input that can produce the answer. Resilience, searches to find a minimum set of input tuples that can make a query false.
11. Final Remarks
In this paper, we studied the problem of resilience for conjunctive queries with self-joins. We identified fundamental query structures that impact hardness, and proved a complete dichotomy for the restricted class of single-self-join binary CQs where exactly two atoms can correspond to the same relation.
We also present results towards the for the case of binary CQs with a single self-join relation that appears in atoms, and identifies some open problems and challenges towards completing the dichotomy for this class (Section 8).
Our work also presents a roadmap for tackling the analysis of more extended query families. Section 9 provides towards a possible generalization of our results to all class of self-join queries, by using a unifying criterion that we call Independent Join Paths.
Overall, our work in this paper contributes important progress in the theoretical analysis of self-joins, which has long been stalled for many related problems. We hope that our results, even though they apply to a restricted class, will provide the foundations to help solve the general case for CQs with self-joins in the future.
Appendix A Detailed proofs
A.1. Proofs for Section 3.1
Proof of Proposition 2.
A database with unary and binary is simply a directed graph. For a directed graph , we can create a database instance where for each node , we add tuple in , and for each edge , we add tuple in . Furthermore, iff graph has at least one edge. Note that any vertex cover of has a correspondent set of tuples in , and it is easy to see that .
More precisely,
[TABLE]
Therefore, is NP-complete.
∎
Proof of Proposition 3.
We reduce 3SAT to . Let be a 3CNF formula with variables and clauses . We map any such to a pair where is a database satisfying , and
[TABLE]
Figure 10 shows part of consisting of the gadgets for where in this example, . The nodes correspond to tuples in and there is a directed edge between any two nodes those are witnesses for . The variable gadgets are cycles of length whose minimum contingency sets are the set of blue nodes indicating the variable is assigned true, or the set of red nodes, indicating the variable is assigned false. The 9-node clause gadgets have minimum contingency sets of size 5 when the clause is assigned true, and 6 otherwise. ∎
A.2. Proofs for Section 3.3
Proof of Proposition 5.
We first argue that -tuples are not the optimal choice for a contingency set. Let be a minimum contingency set containing tuple .
*Case 1: * contains only or but not both. WLOG, suppose it contains only . We can then obtain a contingency set of size . Similar if it contains only .
Case 2: contains both and . Consider and , and suppose that neither of those is a contingency set. Then we have in and in . However, the existence of those witnesses implies that has the witness contradicting the fact that is a contingency set. Therefore, at least one of must be a contingency set and we can replace by or .
Since can be made exogenous, solving resilience for this query is the same as solving vertex cover in a bipartite graph, and therefore is in P. ∎
Proof of Proposition 6.
For a linear sj-free query, we can represent its resilience problem as a network flow making each endogenous tuple an edge of weight 1. Each flow is a witness and the min-cuts are exactly the minimum contingency sets (see (Meliou et al., 2010) for details). It is not clear what to do with repeated relations because there is no obvious way to add to a standard network flow algorithm an extra constraint that two or more edges represent the same tuple, and can thus be removed together at the reduced cost of only 1.
To handle , consider an input database with and tuples. We refer to -tuples that have an inverse as 2-way tuples, and the ones that don’t as 1-way tuples. We construct a flow graph by creating 1-weight edges for all tuples , and 1-weight edges for pairs of 2-way tuples. There are -weight edges for all tuples , where is the source, -weight edges if and only if or there is a 1-way tuple or , and -weight edges for pairs of 2-way tuples, where is the target. Note that 1-way tuples are never the optimal choice, since we can always pick an -tuple instead, so they have infinite weight in the flow graph. Below we refer to the tuple that the edges represent, instead of the edge itself.
We show that from the min-cut, , of the flow graph, we can construct a minimum contingency set, , as follows: contains all the ’s from . For each edge , we add one of or to as follows: If but then we add to . Symmetrically, if but then we add to ; otherwise, arbitrarily add one or the other.
We claim that the resulting is a minimum contingency set. Because it comes from a min-cut, it suffices to show that is a contingency set, i.e., . Suppose for the sake of a contradiction, that has a wtiness , , , , i.e., some tuple, , occurs twice in the join. This is impossible because since , at least one of or must be in .
The other possible wtiness is . Note that if is a 1-way tuple, then this wtiness would be a flow contradicting the fact that is a cut. Thus, is a 2-way tuple. Since can’t be a flow, the pair must be in .
Since was not chosen in , it must be that . This means that there is still a flow from to , so was not a cut. ∎
A.3. Proofs for Section 4.2
Proof of Lemma 1.
First observe that disconnected components join as a cross-product, so for a query to be made false it is enough that at least one of its query components is made false. Hence, for each query component , if , then , which then implies . ∎
Proof of Lemma 2.
This is easy to see because the resilience problem for consists of the union of the independent resilience problems for its components. If is NP-complete, then we can take a database that has the relevant instance of and all the other components can be extremely resilient, so the minimum contingency sets is always a subset of ’s component. Conversely, if each is in P, then to solve the minimum contingency set, we find the minimum contingency of each component, and the global minimum is simply the minimum of these minima. ∎
A.4. Proofs for Section 4.3
Proof of Proposition 5.
We show that tuples from dominated relations don’t need to be used in minimum contingency sets. Assume is a connected query and let be a minimum contingency set of in .
Suppose that relation dominates relation and there is some tuple that is in . Tuple can participate in joins as one or more of the -atoms in . Let’s call those atoms , for . Our definition of domination guarantees that there exists an atom for each atom such that the projection of onto always produces the same tuple . Then we can replace by and we remove at least as many witnesses if .
As a result we show the complexity of is the same if is made exogenous and therefore . ∎
A.5. Proofs for Section 5
Proof of Lemma 3.
Let be a database. We map to by marking all the tuples according to which variables they refer to in witnesses of . For each witness assigning the variables of to domain values (), we add the tuples to , where occurs in . In particular, if was replaced by to obtain , results in adding the tuple to .
For example, consider that atom was replaced by atom . If is part of a witness , we have . Then is included in .
Since the variables mark the tuples in , the new self-joins have no effect: if the subscripted variables are in a tuple of in , then it came from a tuple of in . It then follows that there is a 1:1 correspondence of contingency sets for and . We need the minimality of , because if there were an assignment where when , this would correspond to a reassignment of the variables, to a proper subset, so that some would be doing “double duty”. This would mean that a proper subset of implies , i.e, is not minimal.
∎
A.6. Proofs for Section 5.1
Proof of Proposition 5.
The proofs essentially follow the same strategy used to reduce 3SAT to with a few adjustments to handle the self-joining relation and also the variable order, which is relevant in some cases. See Lemma 1 and Lemma 2 for the details. ∎
Lemma 1.
* and are NP-complete.*
Proof of Lemma 1.
We first show that is NP-complete by a reduction from 3SAT, similar to the one used to prove is NP-complete (Proposition 1).
Let be a 3CNF formula with variables and clauses . Our reduction will map any such to a pair where is a database satisfying , and
[TABLE]
In our construction, if , then the size of each minimum contingency set for in will be , whereas if , then the size of all contingency sets for in will be greater than .
We construct by taking from the proof of Proposition 1, and adding the following tuples for each witness in :
[TABLE]
Notice that for each witness in we thus create 3 witnesses, , , in but they all use the same -tuples.
We know from Proposition 1 that some -tuples participate in 2 witnesses (triangles) and some only in 1 within a variable gadget. Thus, in these numbers are 6 witnesses or 3 witnesses. Observe that -tuples participate in at most 2 witnesses each, so it is never better to choose an -tuple instead of an tuple. Therefore it follows that the same choice of tuples for the minimum contingency set for will also work for by choosing the corresponding -tuples in based on the -tuples chosen from .
For the reduction is similar, but the final atom – instead of – must be handled. The solution is that for each witness in , we add the following tuples to :
[TABLE]
Now, each witness from leads to 6 witnesses in – the three from the above proof plus their reversals. Thus, the -tuples for solid edges from Figure 16 are used in 6 witnesses each, whereas -tuples are in at most 4 witnesses each. Thus, based on the minimum contingency sets for , we create minimum contingency sets for by including the corresponding -tuples and their reversals. ∎
Lemma 2.
, and are NP-complete.
Proof of Lemma 2.
The same idea used above to prove that is hard, works for query . When defining for this case, we just need to add the appropriate -tuples:
[TABLE]
Since -tuples have the same properties as the -tuples, they are never better choices than -tuples and we can obtain a minimum contingency set with only -tuples, as we saw in Lemma 1 above. Similar reduction thus follow for and . ∎
A.7. Proofs for Section 5.2
Proof of Theorem 6.
This mostly follows from the fact that triads make sj-free queries hard and adding self-joins to a hard query keeps it hard (Lemma 6, Lemma 3).
The case we haven’t covered yet is where the triad in involves self-join relations which would be dominated and thus exogenous in the corresponding sj-free query. Examples are self-join variations of and which are hard even though – because of domination – their sj-free cases are easy (Proposition 5).
We now follow and extend the proof of Lemma 6 when has a triad, , even though if did not include a self join, one or more of its members would be dominated. In Case 1, , , are pairwise disjoint. Here the reduction from to goes through exactly as in the proof of Lemma 6. We can choose a single relevant variable for each , so no domination is possible. Any minimum contingency set consists of elements of , or , and the reduction from goes through.
In Case 2, where are not pairwise disjoint, we have to consider a partition of the variables into 7 pieces (Eqn. 6 from the proof of LABEL:hard_partdichotomy). As argued there, there is still a 1:1 correspondence between witnesses of and witnesses of .
If there are no (endogenous) relations containing just the , or variables, then the reduction from goes through. If there is a relation containing just , then we instead use the same reduction but from the appropriate self-join variation of . If there are relations containing just and but not , then we get a reduction from the appropriate self-join variation of . If there are relations for , and , then these form an sj-free triad and thus we already know that is hard. ∎
A.8. Proofs for Section 5.3
Proof of Theorem 7.
We are given , a CQ with no triad. Let be the number of groups of endogenous atoms in , where we put two atoms in the same group iff they contain exactly the same variables, so and belong in the same group, but and do not. We refer to the groups of endogenous atoms as .
Since is connected but has no triad, for any pair , either these atoms are connected directly in , or they are connected via at least another group , but both cases cannot occur. If they are connected directly, then they must appear consecutive in an order of the endogenous atoms. Otherwise, must be placed between them. Note that in the latter case, removing the variables of separates the atoms of into two connected components, one containing and the other containing , so we call the separator of .
Now, for any set of endogenous atoms from different groups, when and are already placed along the line, say with to the right of , then it is easy to see where must go. If is the separator, goes to the left of , if is the separator, goes to the right of and if is the separator, then it goes between and , and that’s what guarantees the endogenous atoms are linearly connected. Looking at Figure 9, we see that the endogenous atoms of are arranged linearly.
∎
A.9. Proofs for Section 6
Proof of Theorem 1 (Unary Path).
We define a reduction from . Given a database we want to define a database such that
[TABLE]
We can assume that and are consecutive occurrences of so let be a subquery of consisting of a path from to with no intervening occurrences of . Thus, . Since is the only sj relation, the relations that occur in occur only in .
For each atom occurring in , we define
[TABLE]
where
[TABLE]
In other words, maps to , maps to , and any other variable maps to . Thus, we have made a faithful copy of capturing . For the other atoms, , not in , let
[TABLE]
where matches with as well as with a set of new values, where . It follows that there is always a minimum contingency sets for with only -tuples, in particular, the sets \bigl{\{}A(a)\,\bigm{|}\,V(a)\in\Gamma\bigr{\}} for any minimum contingency set for . ∎
Proof of Theorem 2 (Binary Path).
Similar to the unary case, we define a reduction from . Given a database we want to define a database such that
[TABLE]
Consider , and that is a subquery of consisting of a path from to with no intervening occurrences of . By assumption, there is no path of just ’s from to , so we may assume that and have such an -free path, , between them.
In order to define the reduction, we define an equivalence relation, , on the variables occurring in , namely iff has an -path from to , i.e., there is a path of -atoms occuring in that takes us from to . (For example, for the query , the equivalence classes of are .) Note that by assumption, for the equivalence relation defined by , .
For any atom occurring in , we define
[TABLE]
where
[TABLE]
Additionally, for atoms occurring in , let
[TABLE]
where matches with as well as with a set of new values, where .
We have that all -tuples in will have the same value as first and second attributes, so can be seen as corresponding to relation in . Similar to the unary case, we have made a copy of capturing and there is always a minimum contingency sets for with only -tuples, in particular, the sets \bigl{\{}R(a,a)\,\bigm{|}\,V(a)\in\Gamma\bigr{\}} for any minimum contingency set for . ∎
A.10. Proofs for Section 7.1
These are the expansions of with unary relations:
[TABLE]
We next show all of them are hard queries.
Lemma 3.
* is NP-complete.*
Proof of Lemma 3.
For this case we are going to use almost the same reduction as the one used for , just with the added -tuples. Then we argue that there is always a min that only uses -tuples.
Let be a 3CNF formula with variables and clauses . Our reduction will map any such to a pair where is a database satisfying , and
[TABLE]
In our construction, if , then the size of each minimum contingency set for in will be , whereas if , then the size of all contingency sets for in will be greater than .
First, include in all the same -tuples included in the proof of Proposition 3. In addition to that add the following -tuples:
- (1)
Variable gadget: For each variable and each insert the following two tuples into the database: and . 2. (2)
Clause gadget: For each clause insert the following 6 tuples into the database: , , , , , .
By adding those tuples, we obtain the same structure and witnesses of the reduction for . Now suppose that is in a minimum contingency set . If (or ) for some , we know that must join with (or ) by our construction. Thus, we can exchange for and obtain contingency set . Similar, if , then must join with tuple , since there is only tuple of that kind for each possible value of .
This shows that there is a minimum contingency set for without -tuples, and the properties of the reduction in Proposition 3 also hold in this case. ∎
Lemma 4.
, , and are NP-complete.
Proof of Lemma 4.
We again define a reduction from 3SAT, using gadgets similar to the one in Proposition 3. The variable gadget remains such that a minimum cover will choose either blue nodes (variable is set to true), or red nodes (variable is set to false). The clause gadget (black nodes) is chosen as to enforce a clause: if one or more of the outermost black nodes are chosen, then the minimum cover is 5, otherwise 6.
We next reduce 3SAT to . Let be a 3CNF formula with variables and clauses . Our reduction will map any such to a pair where is a database satisfying , and
[TABLE]
In our construction, if , then the size of each minimum contingency set for in will be , whereas if , then the size of all contingency sets for in will be greater than .
- (1)
Variable gadget: For each variable and each insert the following tuples into the database: , and , . If , then make the superscript 1. The resulting witnesses between the tuples form a cycle of length . The minimum contingency sets are to either choose all tuples representing a variable to have assignment true, or all tuples representing a variable to have assignment false. Note that any -tuple only joins once, therefore it is better to choose an -tuple, since all of these join at least twice. 2. (2)
Clause gadget: For each clause insert the following tuples into the database: , , , , , , , , , , , . The resulting witnesses form a triangle. If either of the is removed, then the remaining witnesses can be destroyed by choosing only 2 or more tuples, otherwise we need 3. Similar to the variable gadget, -tuples are not an optimal choice because they only participate in one witness each. 3. (3)
Connecting the gadgets: For each variable that appears in clause at position 1, add the following tuples: and . If appears as positive add tuple , if it appear as negative add tuple . Analogously use or instead of for positions 2 and 3 instead of position 1.
Observe that if the clause is not satisfied, then we need to choose the -tuples (orange squares in Fig. 11), and not choose the outer black nodes (-tuples) in the clause gadget, resulting in choosing 6 tuples in total in order to delete all the witnesses, otherwise we just need 5 tuples.
The reduction for is very similar to the one presented above. First, use the same just adding the appropriate -tuples, i.e., -tuples that preserve the witnesses.
Now note that for any , there is only one -tuple such that , therefore must join with . Therefore, any occurrence of -tuple in a contingency set can be exchanged by its correspondent -tuple, and we are guaranteed this reduction has the same properties as the one for . ∎
Lemma 5.
* and are NP-complete.*
Proof of Lemma 5.
We define a reduction from 3SAT. As in the previous cases, the variable gadget remains such that a minimum cover will choose either blue nodes (variable is set to true), or red nodes (variable is set to false). The clause gadget (center black nodes) is chosen as to enforce a clause: if one or more of the outermost joins (black edges) are deleted by choosing the corresponding -tuple (orange square), then the minimum cover for the black subgraph is 2, otherwise 3.
We next reduce 3SAT to . Let be a 3CNF formula with variables and clauses . Our reduction will map any such to a pair where is a database satisfying , and
[TABLE]
In our construction, if , then the size of each minimum contingency set for in will be , whereas if , then the size of all contingency sets for in will be greater than .
- (1)
Variable gadget: For each variable and each insert the following tuples into the database: , and , and , . If , then make the superscript 1. The resulting witnesses between the tuples form a cycle of length . The minimum contingency sets are to either choose all tuples representing a variable to have assignment true, or all tuples representing a variable to have assignment false. If we only consider those tuples, note that - and -tuples participate in only one witness, so the optimal choice is to delete -tuples. 2. (2)
Clause gadget: For each clause insert the following tuples into the database: , , , , , , , , , , , , , , . The resulting witnesses form a triangle. If either of the is removed, then the remaining witnesses can be destroyed by choosing only 2 or more tuples, otherwise we need 3. We later argue that these tuples only need be -tuples. 3. (3)
Connecting the gadgets: For each variable that appears in clause at position 1, add the following tuples: and . If appears as positive add tuple , if it appear as negative add tuple . Analogously use or instead of for positions 2 and 3 instead of position 1.
With our gadget, if the clause cannot be satisfied, then we need to choose all the -tuples (orange diamonds on Fig. 12), since we can delete two witnesses by doing deleting each. In that case, in order to delete the remaining witnesses we need to delete 3 tuples, namely the 3 black nodes in the triangle, resulting on the total deletion of 6 tuples.
We now need to argue that, besides the tuples depicted in Fig. 12, we don’t need other - or -tuples for a minimum contingency set. Assume there is a tuple in a min . Given that , our construction guarantees there is only one -tuple such that , therefore we can have , and is also a minimum contingency set. Similarly, if there is a tuple in , and assuming , there is only one -tuple , and therefore the same follows.
For use almost the same construction as above. We just add the appropriate -tuples and show that there is a minimum contingency set that does not contain those.
Consider as initially defined for . Now we include the appropriate -tuples:
- (1)
Variable gadget: For each variable and each insert the following tuples into the database: , 2. (2)
Clause gadget: For each clause insert the following tuples into the database: , , . 3. (3)
Connecting the gadgets: For each variable that appears in clause at position 1, add tuple . Analogously and for positions 2 and 3, respectively.
By adding those -tuples we obtain the same witnesses we saw in the reduction for . With this construction we guarantee that for any tuple , there is either only one tuple or only one tuple , which means we can always choose one of those -tuples instead and obtain another minimum contingency set without -tuples. ∎
Proof of Proposition 2.
Suppose that are the unique -atoms in . Assume first that there are no unary atoms . We define a reduction from to as follows:
Consider a database with and we may assume that there are no loops , since those would have to be in any . We define a new database such that for each atom or occurring in , we define
[TABLE]
where
[TABLE]
Now we want to show
[TABLE]
Notice that this mapping from to preserves the witnesses in . Moreover, there are no new witnesses created where variables are mapped to values that did not correspond to witnesses before. Since is pseudo-linear, no endogenous atom of contains both and . Therefore, any minimum contingency set for is also a minimum contingency set for . This completes our reduction.
Now, if any subset of unary relations does appear in , then we define a reduction from the appropriate unary expansion of . The same mapping used above to define from preserves all minimum contingency sets, as desired. ∎
A.11. Proofs for Section 7.2
Proof of Proposition 3.
For , let be any database satisfying and let be a witness of satisfying . Note that if occurs in then, by linearity, it must be as an atom immediately to the left of . Furthermore, any such atom may be considered exogenous because it is never better to choose over . Furthermore, if occurs in , then it would be via an atom immediately to the right of . If so, we can assume it is immediately to the left of . In particular, we may assume that neither nor occurs in .
We can write where , , stand for the atoms of , respectively.
Let be a network flow for ignoring the fact that has a self-join. Thus has duplicates edges for its -tuples, i.e., for each there are two edges, in . Assume that each edge corresponding to an endogenous, resp. exogenous tuple has weight 1, resp. .
Let be a min cut for . Let be the corresponding set of atoms of , where any edges are replaced by the atom . Observe that since there is no flow through , is a contingency set for .
We claim that in fact is a minimum contingency set for . The key idea is the following:
Lemma 6.
Let be a minimal cut of . Then does not include more than one instance of any tuple.
Proof.
Suppose to the contrary, that is a minimal cut for and contains both and . Since is minimal, it follows that and both contain flows:
and
. But then contains the flow
, contradicting the fact that is a cut. See Fig. 13 for a depiction in the graph. ∎
Now, let be any contingency set. We claim that is the same size as some cut of . To see this, let us first let be the result of replacing each atom with both possible edges, in . Since is a contingency set, it follows that is a cut of . Now, let be a minimal subset of that is still a cut, where some of the extra -edges, i.e., either or have been removed.
By the proof of Lemma 6, we know that has only one edge for each atom . Thus, as claimed. It follows that the size of a min cut of is the same as the size of a minimum contingency set for . ∎
A.12. Proofs for Section 7.3
Proof of Proposition 6.
We define a reduction from 3SAT to , see Figure 14. Similar to the previous cases, we want to create variable gadgets such that a minimum cover will choose either blue nodes (variable is set to true), or red nodes (variable is set to false), and a clause gadget (black nodes) such that if the clause is satisfied, then the minimum cover is 5, otherwise 6.
Let be a 3CNF formula with variables and clauses . Our reduction will map any such to a pair where is a database satisfying , and
[TABLE]
In our construction, if , then the size of each minimum contingency set for in will be , whereas if , then the size of all contingency sets for in will be greater than .
- (1)
Variable gadget: For each variable and each insert the following tuples into the database: , , , and , , , . If , then make the superscript 1.
We want to join those tuples such that the minimum contingency sets are to either choose all tuples representing a variable to have assignment true, or all tuples representing a variable to have assignment false, plus some -tuples. To obtain that property, we need the following additional tuples: , , , and , , , .
With this construction we guarantee that we can “cover” the variable gadget by choosing either all positive -tuples plus the tuples , or all negative -tuples plus the tuples . In both cases, we choose tuples. 2. (2)
Clause gadget: For each clause insert the following tuples into the database: , , , , , , , , , , , and , , , , , , , , , , , and
For this gadget, we have 3 options to choose only 5 tuples in order to delete all the witnesses. For example: , , , . 3. (3)
Connecting the gadgets: For each variable that appears in clause at position 1, add the following tuples: if appears as positive, and if it appear as negative. Analogously use or instead of for positions 2 and 3 instead of position 1.
After connecting the variable gadgets with the clause gadgets, the witnesses are formed such that if a clause cannot be satisfied, then we need to pick all - and -tuples from the clause gadget (the black triangle), totaling 6 tuples. Otherwise, we can delete all witnesses by picking 5 tuples, namely 2 pairs of -tuples and one -tuple. ∎
Proof of Proposition 7.
There are 2 cases.
Case 1: is not bound. We can write where does not contain the variable . includes and may include exogenous atoms containing the variable . Think of as the rightmost group in Figure 9.
For any database, , is equivalent to the following Network Flow. As usual, each endogenus atom from the pseudo-linear becomes a 1-weight edge and each exogenus atom is an -weight edge. Whenever , we add -weight edges from the rightmost output of and to and a 1-weight edge from to the terminal node, .
Case 2: is bound. We can write where includes and may include an essentially exogenous atom if that occurs in . The relevant issues are that removing separates from and these contain at least one endogenous atom each.
We define a reduction from to . We say that variable , if occurs in . Otherwise, .
Now consider a database with . We define a new database such that for each atom or occurring in , we define
[TABLE]
where
[TABLE]
It is clear that the witnesses and minimum contingency sets of are exactly preserved in . ∎
A.13. Proof for Section 7.4
Proof of Proposition 8.
First consider . Given a database such that , witnesses can be of two forms:
[TABLE]
From that, we can conclude that no tuple with needs to be in a contingency set, since we can choose either or instead. Thus, we can construct a network flow that doesn’t include tuples and solve resilience for . Note that when we consider any expansion of that is pseudo-linear, we always have that with is not needed in a minimum contingency set. This property together with the assumption that query is pseudo-linear, allows for a construction of a network flow to solve resilience. Therefore, is in P. ∎
A.14. Proof for Section 7.5
Proof of Theorem 9.
If has a triad, then is NP-complete by Theorem 6. By Theorem 7, we only need to consider the cases where is pseudo-linear.
In this case, if has a path (Theorem 1, Theorem 2), then is NP-complete. Paths cover all the queries where -atoms do not share a variable, including cases with variable repetition. It remains to characterize the complexity of the queries where -atoms share at least one variable. Note that chain, permutation, and confluence are the only three possible patterns for a query with exactly two -atoms and no variable repetition.
If has a chain , then is np-complete (Proposition 2). If has a permutation, then is NP-complete when the permutation is bounded, and it is in P, when the permutation is unbounded (Proposition 7). These are the only two possible ways a permutation can occur. If has a confluence, then is NP-complete when there is an exogenous path, and it is in P otherwise (Proposition 4).
Now we only have left the case where has variable repetition and the -atoms share a variable, which implies is in P (Proposition 8).
Since we have exhausted all the cases to consider, we show that there is a dichotomy for the class of ssj binary queries with only two -atoms. ∎
A.15. Proofs for Section 8.1
Proof of Proposition 1.
We define a reduction from to , using a strategy similar to the proof of Theorem 1. ∎
A.16. Proofs for Section 8.2
Proof of Proposition 2.
We reduce Max 2-SAT to . Given a 2CNF formula, , with variables and clauses, and a number , we produce a database, , and bound , such that has an assignment satisfying at least clauses iff . The construction is drawn in Figure 15. A sample variable gadget for variable is shown. The two minimum contingency sets consist of nodes, plus 2 helper nodes in the two crossover gadgets or nodes, plus 2 helper nodes, corresponding to variable being true or false, respectively. The reason for the crossover is so that each variable can be instantiated via diamonds and hexagons corresponding to the atoms , respectively.
The clause gadgets for clauses of size 1 and size 2 are also drawn. Clauses of size 1 need no nodes chosen when they are true and one node otherwise. Clauses of size 2 need 1 node chosen when they are true and 2 when they are false. Let be the number of clauses of size 2 in . Saying that at least clauses of are true means that at most clauses are false. Thus, the size of the minimum contingency set is . ∎
Proof of Proposition 4.
First observe that any contingency set contains only -tuples, since are dominated and therefore exogenous. For any tuple , if , then must be in all contingency sets, since those 3 tuples form a witness. Let be the set of all such tuples. We then proceed to create a flow with tuples and we claim that is a min contingency set for , where is a min cut found by flow.
Let be a min cut and suppose there is a such that and . That implies that there are at least 2 witnesses that can be broken by deleting one tuple but the min cut chose to delete 2 edges. Consider the tuple and these witnesses to be
[TABLE]
Note that with this set of tuples we also have witness
[TABLE]
which cannot be deleted by deleting , contradicting the assumption that it was possible. ∎
A.17. Proofs for Section 8.3
Proof of Proposition 5.
Reduction from . ∎
Proof of Proposition 6.
Reduction from Max 2SAT, similar to the one used for . ∎
A.18. Proofs for Section 8.4
Proof of Proposition 7.
This is similar to Proposition 6. The difference is that while “dominates” the 1-way tuple in , it is not the case that would dominate because there might be many ’s such that , in which case it might be advantageous to choose one instead of many ’s.
We thus modify the flow graph to include all the edges at cost 1 each on the left, all the pairs at cost 1 each on the right. We include -weight edges from any to plus cost 1 edges from to for any 1-way edges .
Let be a min-cost flow and form by including all the ’s and 1-way ’s from together with one of or whenever . Similar to Proposition 6, the rule for which to choose is that if some but no , then add to . Symmetrically, if but no , then add to ; otherwise, arbitrarily add one or the other.
The same argument as in Proposition 6 shows that the resulting is a minimum contingency set. ∎
Proof of Proposition 8.
We reduce 3SAT to . The idea for the variable gadgets is that for a database that contains the tuples , , , , we must choose exactly one or , the first of which will correspond to the assignment to 1, and the second of which, to 0. In full detail, the gadget consists of a chain of these choices, i.e., the union of , , together with all the tuples , , , . For a minimum contingency over this gadget we may choose all of the and edges (corresponding to gets 1), or all the and edges (corresponding to gets 0).
The clause gadget is similar. If is , then the clause can eliminate two, but not all three pointers to the edges , , after removing 8 tuples. To simplify the explanation, let and for elements . The clause gadget contains the union of the following sets of tuples: , , , , , , , , . The idea is that for each full pair, , exactly one of or must be chosen in the minimum contingency set . is designed so that a contingency set of size 8 exists iff at least one pair from , , has been previously chosen, i.e., iff the clause is true. ∎
Proof of Proposition 9.
We reduce to . Given a database , construct as
[TABLE]
It then follows, that it is always at least as good to put into , rather than . Thus, the minimum contingency sets for correspond exactly to the minimum contingency sets for .
For , Even though , there is no obvious reduction between and . However, the same reduction from 3SAT to in Proposition 6 also works for .
For , we can define a reduction from . ∎
A.19. Proofs for Section 8.5
Proof of Proposition 10.
For , a reduction from is enough. Note that tuples with do not need to be in a contingency set.
For , a reduction from Max 2SAT, similar to the one used in Proposition 2, can be used to show NP-hardness. ∎
Appendix B Relevant proofs from sj-free case
Proof of Proposition 4.
Let be a minimum contingency set of in . Suppose that atom dominates atom but there is some tuple . Let be the projection of onto . Then we can replace by and we remove at least as many witnesses that . It follows, as desired, that the complexity of is unchanged if is exogenous, i.e., . ∎
Proposition 1 (Triangle is hard).
* is NP-complete.*
Proof of Proposition 1.
We reduce 3SAT to . It will then follow that is NP complete. Let be a 3CNF formula with variables and clauses . Our reduction will map any such to a pair where is a database satisfying , and
[TABLE]
In our construction, if , then the size of each minimum contingency set for in will be , whereas if , then the size of all contingency sets for in will be greater than .
Note iff it contains three pairs , , . We visualize as a red edge, as a green edge and as a blue edge. Thus each witness that is an RGB triangle. (Notice that the edge direction drawn in Figure 16 corresponds to the variable order in , and analogously for and .) The job of a contingency set for is to remove all RGB triangles.
contains one circular gadget for each variable . The circle consists of solid edges, half of them marked and the other half marked (see 16(a) and 16(b)). Note that there are RGB triangles and they can be minimally broken by choosing the edges or the edges. Any other way would require more edges removed. Thus, each minimum contingency set for corresponds to a truth assignment to the variables of . And there will be a minimum contingency set of size iff .
We complete the construction of by adding one RGB triangle for each clause . For example, suppose . The RGB triangle we add consists of a red edge marked , a green edge marked and a blue edge marked (see 16(c)). Note that if the chosen assignment satisfies , then all edges are removed, or all edges are removed, or all edges are removed. Thus the triangle is automatically removed.
How do we create ’s RGB triangle? Remember that we have chosen to contain 2 segments for each clause. We use the th odd-numbered segment of to produce the or used in the clause- triangle. The even numbered segments are not used: they serve as buffers to prevent spurious RGB triangles from being created (In 16(b) we mark these even segments with frowns: they are sad because they are never used).
More precisely, the red -edge from is , the green -edge from is , and the blue -edge from is (see 16(c)).
Now to make this an RGB triangle in , we identify the two -vertices, the two vertices and the two vertices. In other words, ’s -vertex is equal to ’s -vertex , i.e., they are the same element of the domain of . We have thus constructed ’s RGB triangle (see 16(c)).
The key idea is that these identifications can only create this single new RGB triangle because there is no other way to get back to from in two steps. All other identifications involve different segments and so are at least six steps away. Recall that this is the reason why the odd-numbered segments in the ’s are not used: this ensures that no additional RGB triangles are created.
Thus, as desired, Equation 3 holds and we have reduced to . ∎
Proposition 2 (Tripod is hard).
* is NP-complete.*
Proof of Proposition 2.
We reduce to . It will then follow that is NP-complete. Let be an instance of . We construct an instance of by constructing relations as copies of from . Define as follows:
[TABLE]
Here, stands for a new unique domain value resulting from the concatenation of domain values and . Observe that there is a 1:1 correspondence between the witnesses of and the witnesses of . Thus, every contingency set for in corresponds to a contingency set of the same size for in . Furthermore no minimum from needs to choose tuples from . If were in , then we could replace it by , which suffices to remove all the witnesses removed by . As we will explain later, “dominates” (3). It follows that . ∎
Proof of Lemma 6.
Let be a query with triad . We build a reduction from to . Given any that satisfies we will produce a database that satisfies such that for all :
[TABLE]
We will assume that no variable is shared by all three elements of (we can ignore any such variable by setting it to a constant). Our proof splits into two cases:
Case 1: are pairwise disjoint. Our reduction is similar to the reduction from to (2).
We first define the triad relations in :
[TABLE]
Thus, each tuple of, for example, consists of identical entries with value for each pair . Thus, mirror , respectively.
To define all the other atoms of , we first partition the variables of into 4 disjoint sets: . Now for each atom , arrange its variables in these four groups. Then define the atom of as follows:
[TABLE]
For example, all the variables are assigned the value and all the variables are assigned .
By the definition of triad, there is a path from to not using any edges (variables) from . Thus, any witness that which includes occurrences of and must have .
Similarly, a path from to guarantees that is preserved and a path from to guarantees that is preserved. It follows that the witnesses that are essentially identical to the witnesses that (See Fig. 17).
Furthermore, any minimum contingency set only needs tuples from or . For example, if a tuple contains or , then it can be replaced by a tuple from . Thus the sizes of minimum contingency sets are preserved, i.e., Equation 4 holds, as desired. Thus is NP-complete.
Case 2: for some : We generalize the construction from Case 1 as follows. Partition into those unshared, those shared with , and those shared with (Addition is mod 3).
We then assign the relations of the triad as follows:
[TABLE]
Since none of the ’s is dominated, in each case both possible values occur, e.g., and both occur in the tuples of Thus as in Case 1, capture , respectively. We now partition into 7 sets as follows. The key idea is that for each assignment of to values in , we will make assignments according to that partition.
[TABLE]
We then define each other atom in to be the following set of tuples, where the only difference between atoms is which of the 7 members of the partition of variables occurs in .
[TABLE]
By the definition of triad, there is a path from to not using any edges (variables) from , i.e., none from . Thus, any witness including occurrences of some of must have . Thus, as in Case 1, the witnesses of are essentially identical to the witnesses of and we have reduced to . ∎
Appendix C Independent Join Paths: details
We give more details on the concept of Independent Join Paths. We start with some intuition by providing examples (Section C.1), state our conjecture, and finish by pointing out how this concept could possibly allow an automated search for hardness proofs (Section C.2), a prospect we are especially excited about.
C.1. IJP Examples
We give here examples of IJPs for various queries and earlier hardness reductions, and provide the intuition for our 4 conditions.
Standard paths. The first example shows that IJPs contain standard paths (Theorem 1) as a special case.
Example 1 ().
Consider our simplest example for an SJ-path implying hardness: from Fig. 2(a). The following database of 3 tuples forms an IJP:
[TABLE]
- (1)
We have and with and . 2. (2)
* and each participate in only one witness, which in this case is the same one.* 3. (3)
* being unary, there can’t be any other relation with a strict subset of the constants.* 4. (4)
No exogenous relation. 5. (5)
The resilience , but becomes 0 after removing either or or both.
Triads. The second example shows that any query with a triad can form IJPs. We illustrate with our favorite triangle query.
Example 2 ().
Consider the triangle query as the simplest example of a non-linear SJ-free query containing a triad (see Fig. 1(a)). The following database of 7 tuples form an IJP:
[TABLE]
- (1)
We have and with and . 2. (2)
* only participates in witness , and only participates in witness .* 3. (3)
No other relation has a strict subsets of the constants from 4. (4)
No exogenous relation. 5. (5)
The resilience , but becomes 1 after removing either , or , or both.
Figure 18* illustrates the 3 joins forming the IJP. The connection to our idea from Fig. 8(b) now becomes clearer. Also notice that this IJP forms the basic element of our prior hardness proof for triads.*
More complicated IJPs. The third example uses a more complicated IJP.
Example 3 (more complicated gadget).
Consider the query
[TABLE]
Then following database forms an IJP:
[TABLE]
- (1)
We have and . 2. (2)
* only participates in witness and only participates in witness .* 3. (3)
No other relation has a strict subset of the constants from . 4. (4)
No exogenous relation. 5. (5)
The resilience with
[TABLE]
but becomes 3 after () removing with
[TABLE]
or () removing with
[TABLE]
or () removing both with
[TABLE]
Figure 19* illustrates how these 21 tuples create 8 different joins, representing the IJP. It turns out that this IJP is “hidden” and can be spotted by the careful reader in the crossover part of the variable gadget used in Proposition 2.*
Condition 4. We next give one example that illustrates why we need condition 4 of our definition for IJPs. In particular, this query is an example in which two (instead of only one) relation is repeated. We know through a dedicated proof that the complexity of this query is in PTIME. We illustrate a “failed attempt” to create an IJP and point out the problems that would arise if we ignored condition 4.
Example 4 (Independent paths).
Consider the following query which contains two repeated relations. We investigate the canonical database
[TABLE]
and its ability to form an IJP.
- (1)
We have and . 2. (2)
* and participate in only one witness .* 3. (3)
No other relation has a strict subset of the constants from . 4. (4)
Condition 3 requires that and be added to the database, which is currently not the case, and which we ignore for a moment. 5. (5)
The resilience is 1, and becomes 0 if any tuple is removed.
The crucial condition 4 forces us to add and to the database. And then condition 2 and 5 are not true anymore. Addition of these tuples form 2 more joins and , which requires both tuples and to be removed make the query false.
In other words, the canonical database is not enough to succeed with the reduction from VC (recall Fig. 8(b): any two edges incoming and outgoing from vertex create addition joins.
C.2. Toward an automated proof construction
At its core, each IJP can be considered as a set of “canonical databases” or witnesses, which have been appropriately “aligned.” We give the intuition with the triangle query from Example 2 and Fig. 18.
Example 5.
Assume we construct three disjoint canonical databases:
[TABLE]
The total number of constants used is 9, three for each of the three joins.
We can now look at all the possible ways in which these constant can be partitioned into nonempty subsets. The answer is given by the Bell number and is 21147 for . Exhaustive enumeration over these 21147 cases will also lead to partition
[TABLE]
which is isomorph to the IJP from Fig. 18.
Our Definition 1 now provides a procedure to test that the resulting database indeed forms an IJP.
The more general procedure is now as follows
- (1)
for an increasing number of joins 2. (2)
for all possible partitions 3. (3)
for all pairs of tuples of the same relation that are not dominated 4. (4)
if an exogenous tuple contains a subset of the constants, then possible add a second tuple 5. (5)
calculate the minimal VC of the resulting hypergraph under the 4 cases , where 0 and 1 mean that a tuple is present or absent, respectively.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Abiteboul et al . (1995) Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases . Addison-Wesley. https://dl.acm.org/doi/10.5555/551350
- 3Amarilli et al . (2017) Antoine Amarilli, Mikaël Monet, and Pierre Senellart. 2017. Conjunctive Queries on Probabilistic Graphs: Combined Complexity. In PODS . 217–232. https://doi.org/10.1145/3034786.3056121 · doi ↗
- 4Bancilhon and Spyratos (1981) F. Bancilhon and N. Spyratos. 1981. Update Semantics of Relational Views. ACM TODS 6, 4 (1981), 557–575. https://doi.org/10.1145/319628.319634 · doi ↗
- 5Buneman et al . (2001) Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. 2001. Why and Where: A Characterization of Data Provenance. In ICDT . 316–330. https://doi.org/10.1007/3-540-44503-X_20 · doi ↗
- 6Buneman et al . (2002) Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. 2002. On Propagation of Deletions and Annotations Through Views. In PODS . 150–158. https://doi.org/10.1145/543613.543633 · doi ↗
- 7Chandra and Merlin (1977) Ashok K. Chandra and Philip M. Merlin. 1977. Optimal Implementation of Conjunctive Queries in Relational Data Bases. In STOC . 77–90. https://doi.org/10.1145/800105.803397 · doi ↗
- 8Chapman and Jagadish (2009) Adriane Chapman and H. V. Jagadish. 2009. Why not?. In SIGMOD . 523–534. https://doi.org/10.1145/1559845.1559901 · doi ↗
