Computational Complexity of Queries Based on Itemsets
Nikolaj Tatti

TL;DR
This paper explores the computational difficulty of determining exact frequency bounds of itemset conjunctions, revealing that key problems are NP-complete or PP-hard, indicating significant intractability in this area.
Contribution
It establishes the NP-completeness and PP-hardness of fundamental query evaluation problems related to itemset frequencies, highlighting their computational intractability.
Findings
Checking maximal consistent frequency is NP-complete
Evaluating Maximum Entropy estimate is PP-hard
Checking consistency is NP-complete
Abstract
We investigate determining the exact bounds of the frequencies of conjunctions based on frequent sets. Our scenario is an important special case of some general probabilistic logic problems that are known to be intractable. We show that despite the limitations our problems are also intractable, namely, we show that checking whether the maximal consistent frequency of a query is larger than a given threshold is NP-complete and that evaluating the Maximum Entropy estimate of a query is PP-hard. We also prove that checking consistency is NP-complete.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Computational Complexity of Queries Based on Itemsets
Nikolaj Tatti
HIIT Basic Research Unit, Laboratory of Computer and Information Science, Helsinki University of Technology, Finland
Abstract
We investigate determining the exact bounds of the frequencies of conjunctions based on frequent sets. Our scenario is an important special case of some general probabilistic logic problems that are known to be intractable. We show that despite the limitations our problems are also intractable, namely, we show that checking whether the maximal consistent frequency of a query is larger than a given threshold is NP-complete and that evaluating the Maximum Entropy estimate of a query is PP-hard. We also prove that checking consistency is NP-complete.
keywords:
Computational Complexity, Data Mining, Itemset
1 Introduction
Assume that we have two events, say and . Assume further that their probabilities are and . What can we say about the probability of ? We know that the probability must lie within . This interval is tight: For each there is a distribution having as a probability of . Also note that the Maximum Entropy estimate in this case is .
A more complicated example would be the following: Assume three events , , and . Assume that we know , , , and . What can we say about ?
Let us make these examples more general: A conjunctive query is a boolean formula having the form . Assume that we are given a set of conjunctive queries along with their probabilities. Assume also that we are given a conjunctive query not belonging to . What can we tell about the probability of this query? We know that the possible probabilities of the query correspond to some interval. In the paper we show that checking whether the right side of this interval is larger than some threshold is NP-complete. We also show that estimating the probability of the query using Maximum Entropy is PP-hard.
In the paper we adopt the terminology used in data mining of [math]– data: Conjunctive queries are represented by sets of items called itemsets and the probabilities of conjunctive queries are called itemset frequencies.
Our problems are special cases of much more general problems (see Section 6 for detailed comparison). These general problems are well-studied and they are all (at least) NP-hard. The difference is that in our work we concentrate on studying antimonotonic families of itemsets. We should point out that antimonotonic families are important since they tend to arise frequently in practice, for example, in mining of frequent itemsets [1, 2]. A similar technique is used in [7] to prove that inference of Belief Networks is NP-hard. The result of [7] is essentially Theorem 6 (in this paper) though it is in a different context. The general boolean query scenario is reduced to Linear Programming in [10]. A method worth mentioning is introduced in [15] where the authors estimate the frequencies using Maximum Entropy.
2 Preliminaries
In this section we give basic definitions used in mining of [math]– data.
By a binary data set we mean a collection of binary vectors of length sampled from some distribution. We define a sample space to be the collection of all possible binary vectors of length . From now on will always denote the sample space, will denote the dimension of binary vectors. Any distribution given in this paper will be defined on .
It is custom to assign an attribute to each dimension of . Thus, when we speak of we mean the th dimension. The set of all attributes is . An itemset is a subset of . Let be an itemset. We often use a condensed notation . A family of itemsets is called antimonotonic if all the subsets of any member are also included.
Let be a distribution defined on . We use the following notation: Let be an itemset and let be a binary vector of length . Then we shorten the notation by . By we mean , where contains only ones. The probability is called the frequency of .
Assume a family of itemsets and a vector of length . We say that a distribution satisfies the frequencies if for . We say that these frequencies are consistent if there is a distribution satisfying them.
3 Maximal Frequency Query is NP-complete
Assume that we want to find the frequency for an itemset based on some known family of itemsets. We know that generally the frequency for is not unique: There may be distributions that produce different frequencies for but have the same frequencies of . The set of all the consistent frequencies of is an interval [4]. In this section we focus on finding one side of this interval:
Problem 1
(MaxQuery)* Assume that we are given an antimonotonic family having members along with rational and consistent frequencies . Find the maximal frequency for a given itemset that can be produced by a distribution satisfying the frequencies .*
In other words, we ask ourselves that, if we know the frequencies , then what is the largest consistent frequency for . Note that the maximal frequency always exists since the frequencies are required to be consistent. Our goal in this section is to show that in general this problem is intractable. First let us give an example where the solution can be easily obtained.
Example 1
Assume that a family contains only the itemsets of size one. Then the frequency is the mean of the attribute . The maximal frequency for an itemset is .
We know that MaxQuery can be solved by using Linear Programming [4] though the resulting program contains an exponential number of variables. This reduction along with some results from Linear Programming theory [14] has important consequences: There is a distribution, say , producing the maximal frequency for B and having at most non-zero entries. Also, has rational entries, and if is the number of bits needed to specify the denominator of an element of the frequency vector , then the number of bits needed to specify the denominator of an entry of is . We call such a distribution canonical.
Since NP is defined for yes/no problems we need the decision version of MaxQuery:
Problem 2
(MaxQueryDec)* Assume that we are given an antimonotonic family having members along with rational and consistent frequencies . Given an itemset and a rational threshold is there a distribution satisfying the frequencies such that the frequency of is larger than ?*
The relation between MaxQuery and MaxQueryDec is the following: Assume that we can solve MaxQuery in polynomial time, then we can clearly solve MaxQueryDec in polynomial time. Assume now that we can solve MaxQueryDec in polynomial time. Let be the solution of MaxQuery. We can find using MaxQueryDec and dichotomous search. We know that is a rational number between [math] and and that the denominator of can be expressed using bits. Thus the number of required search steps is .
Theorem 2
MaxQueryDec* is in NP.*
{@proof}
[Proof] Let be a canonical distribution for MaxQuery. We can represent this distribution in polynomial space, and hence we can use it as a certificate. To check the certificate we need to check that is a real distribution, that it satisfies the frequencies and that its frequency for is larger than the threshold .
Our next step is to reduce 3SAT to MaxQueryDec. In order to do that we need the following lemma:
Lemma 3
Assume that two distributions and satisfy the frequencies of an antimonotonic family of itemsets. Let . Then for any binary vector .
{@proof}
[Proof] Fix and . Let and let . Denote the elements of by . Let be the probability of being and at least one of being . We see that
[TABLE]
Let be the collection of non-empty subsets of . We can express the last term of Eq. 1 by using the inclusion-exclusion principle
[TABLE]
By combining Eqs. 1 and 2 we have expressed as a linear combination of terms having the form where . Antimonotonicity implies that all these frequencies are included in . This makes unique and the lemma follows.
Theorem 4
3SAT* is polynomial-time reducible to MaxQueryDec.*
{@proof}
[Proof] Let be an instance of 3SAT having variables and clauses. We set the dimension of the sample space to be . The first items correspond to the variables of and the last items correspond to the clauses. We use the following notation: Let be a truth assignment and let be a clause, then is a function resulting , if is satisfied by , and [math] otherwise. We denote the first items by and the last items by . We also set and .
We will now define an antimonotonic family of itemsets. Let be some clause and let be its corresponding item. Assume that the items corresponding to the variables in are , , and . We add an itemset to the family along with its subsets. We repeat this procedure to each clause in . The resulting family contains members at maximum.
The following step is to define the frequencies . In order to do this we define a distribution over the attributes to be
[TABLE]
That is, the first items are distributed uniformly and the values of the last items are set to correspond to the truth values of the clauses.
We define the frequencies , where . We note that the frequencies are rational and consistent. There is a closed formula for evaluating these frequencies. For example, assume that we have a clause . The frequency of the itemset is then
[TABLE]
where in the first summation ranges over truth assignments such that and ranges over binary vectors of length such that . In the second summation ranges similarly as in the first summation and is now set to correspond to the clauses. The frequencies for the other members of can be deduced in a similar way. Thus we can obtain the frequencies in polynomial time.
Let be the maximal frequency for the itemset . We claim that the formula is satisfiable if and only if .
Assume that is satisfiable by a truth assignment, then we have
[TABLE]
Assume now that there is a distribution satisfying the frequencies and producing a positive frequency for . Let be a truth assignment not satisfying the formula, that is, there is a clause, say , that is not satisfied. Define and . Lemma 3 implies that . By reversing this property we get the following: If is such that
[TABLE]
holds, then must satisfy .
By the assumption so there exists a truth assignment such that Eq. 3 holds. Thus is satisfiable. The reduction is complete if we set the query and the threshold .
Example 5
Consider the formula . We have two clauses, and , and three variables, , , and . The itemset family along with its frequencies (given in parenthesises) is
[TABLE]
The maximal frequency of for this setup (solved by linear programming) is . Clearly, the formula is satisfiable.
4 MaxEnt Frequency Query is PP-hard
In the previous section we showed that searching for the maximal frequencies is a very hard problem. The maximal frequencies, however, are not so useful if our goal is to estimate boolean queries from a given set of itemsets. A much more useful approach is to use Maximum Entropy approach. Given a distribution defined on , the entropy of is . It is custom to define so that is always defined.
Problem 3
(EntrQuery)* Assume that we are given an antimonotonic family having members along with rational and consistent frequencies . Find a frequency for a given itemset produced by the distribution satisfying the frequencies and maximising the entropy .*
It has been empirically shown that EntrQuery results in a good approximation [15].
Again we need a decision version of the problem:
Problem 4
(EntrQueryDec)* Assume that we are given an antimonotonic family having members along with rational and consistent frequencies . Let be a frequency for a given itemset produced by a distribution satisfying the frequencies and maximising entropy. Is larger than a given rational threshold ?*
The following theorem shows that EntrQueryDec is NP-hard.
Theorem 6
3SAT* is polynomial-time reducible to EntrQueryDec.*
{@proof}
[Proof] Let be an instance of 3SAT. Let , , and be the same as in the proof of Theorem 4. Let be the set of distributions satisfying the frequencies . Let . A marginal distribution is obtained from by keeping only the items included in . The distribution has the following property: The items corresponding to the clauses are completely determined by the items corresponding to the variables. This implies that the entropy of [11, Theorem 4.2].
Let be the distribution maximising the entropy. Let be the distribution defined in the proof of Theorem 4. Note that . We know that there is no distribution that has larger entropy than the uniform distribution [11, Theorem 3.1]. Since is uniform, we must have . Hence . We also know that the distribution maximising entropy is unique [8, Theorem 3.1]. This implies that . To complete the proof we note that produces a positive frequency for if and only if is satisfiable.
A problem P is in PP if there is a machine such that an input is a yes-instance of P iff more than half of the computation paths end up accepting [13]. The class PP is (believed to be) larger than NP. We can show that EntrQueryDec is PP-hard: In the proof the frequency of is exactly the number of satisfying assignments divided by . Hence, if we set the threshold , the instance will be in EntrQueryDec iff the square root of the number of assignments satisfy the given 3SAT formula. This problem is known to be PP-complete [3].
5 Checking Consistency is NP-complete
So far we have assumed that the itemset frequencies given in our problems are consistent. Let us remove this constraint and consider the following problem.
Problem 5
(Consistent)* Assume that we are given an antimonotonic family having members along with rational frequencies . Are the frequencies consistent?*
The following theorem proves that Consistent is a very hard problem.
Theorem 7
Consistent* is NP-complete.*
{@proof}
[Proof] First, we need to show that Consistent is in NP. We know from Linear Programming theory that if the frequencies are valid then there is a canonical distribution satisfying the frequencies. This is our certificate and thus Consistent is in NP.
We now prove that 3SAT is polynomial-time reducible to Consistent. We use the same construction as in the proof of Theorem 4 with some additions: We add one special attribute, say , to the set of attributes. We add an itemset to , and we also add itemsets having the form to . The frequencies for the new itemsets are set to be , where is the number of variables appearing in the 3SAT instance .
Assume that is satisfiable by a truth assignment . We define a distribution by extending the distribution to . The extension is done such that is iff . Clearly, satisfies the frequencies.
To prove the other direction, assume that there exists a distribution, say , that satisfies the frequencies. To prove that is satisfiable we must prove that . Select two attributes, say and . Note that and . This implies that . We can prove in an iterative fashion that
[TABLE]
This proves the result.
6 Connections to Related Work
An NP-complete problem called FreqSat introduced in [5, 6] is a generalisation of Consistent — in FreqSat we are allowed to have non-antimonotonic families and inequality constraints. We can transform MaxQueryDec into FreqSat by changing the query into an inequality constraint. We should also point out that the proof of NP-hardness of FreqSat given in [5] is (although not explicitly mentioned) actually a valid proof for Consistent.
An even more general scenario is introduced in [12] in which we are allowed to have conditional first-order logic sentences as constraints/queries. This scenario can be emulated by itemsets [6]. Also, a famous problem called PSat in which we are given a CNF-formula, a frequency for each clause, and we are asked whether there is a distribution satisfying the frequencies is known to be NP-complete [9].
7 Conclusions
In this paper we studied certain boolean query problems. Our problems were specialised (but frequently occurring and thus important) problems of much general scenarios and we showed that despite the limitations our problems remained intractable. The crux of the paper lies within the construction in the proof of Theorem 4.
There are some open problems: For example, what is the exact complexity of MaxQuery? Is it FNP-complete or FP-complete? Also, what is the complexity of the opposite problem MinQuery? In addition, it is worthwhile to study the conditions under which the boolean query problems can be solved efficiently.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data , pages 207–216, Washington, D.C., 26–28 1993.
- 2[2] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and Aino Inkeri Verkamo. Fast discovery of association rules. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining , pages 307–328. AAAI Press/The MIT Press, 1996.
- 3[3] Delbert D. Bailey, Victor Dalmau, and Phokion G. Kolaitis. Phase transitions of PP-complete satisfiability problems. In IJCAI , pages 183–192, 2001.
- 4[4] Artur Bykowski, Jouni K. Seppänen, and Jaakko Hollmén. Model-independent bounding of the supports of Boolean formulae in binary data. In Pier Luca Lanzi and Rosa Meo, editors, Database technologies for data mining . Springer Verlag, 2003.
- 5[5] Toon Calders. Axiomatization and Deduction Rules for the Frequency of Itemsets . Ph D thesis, University of Antwerp, Belgium, 2003.
- 6[6] Toon Calders. Computational complexity of itemset frequency satisfiability. In Proceedings of the 23nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database System , 2004.
- 7[7] Gregory Cooper. The computational complexity of probabilistic inference using bayesian belief networks. Artificial Intelligence , 42(2–3):393–405, Mar. 1990.
- 8[8] I. Csiszár. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability , 3(1):146–158, Feb. 1975.
