
TL;DR
This paper introduces a novel method for mining meaningful itemsets from real-valued datasets by averaging over threshold-based supports, enabling efficient discovery of statistically significant patterns.
Contribution
It proposes a new family of quality scores for real-valued itemsets, treating thresholds as random variables and normalizing support for better pattern significance assessment.
Findings
Efficient computation of average support for real-valued itemsets.
Normalizations against independence and partition assumptions.
Effective discovery of statistically significant patterns.
Abstract
Pattern mining is one of the most well-studied subfields in exploratory data analysis. While there is a significant amount of literature on how to discover and rank itemsets efficiently from binary data, there is surprisingly little research done in mining patterns from real-valued data. In this paper we propose a family of quality scores for real-valued itemsets. We approach the problem by considering casting the dataset into a binary data and computing the support from this data. This naive approach requires us to select thresholds. To remedy this, instead of selecting one set of thresholds, we treat thresholds as random variables and compute the average support. We show that we can compute this support efficiently, and we also introduce two normalisations, namely comparing the support against the independence assumption and, more generally, against the partition assumption. Our…
| Name | Size | Threshold | Time | |
|---|---|---|---|---|
| Ind | ms | |||
| Plant | ms | |||
| Alon | ms | |||
| Thalia | ms | |||
| Yeast | ms |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\xspaceaddexceptions
Itemsets for Real-valued Datasets
Nikolaj Tatti
HIIT, Department of Information and Computer Science Aalto University, Finland
Abstract
Pattern mining is one of the most well-studied subfields in exploratory data analysis. While there is a significant amount of literature on how to discover and rank itemsets efficiently from binary data, there is surprisingly little research done in mining patterns from real-valued data. In this paper we propose a family of quality scores for real-valued itemsets. We approach the problem by considering casting the dataset into a binary data and computing the support from this data. This naive approach requires us to select thresholds. To remedy this, instead of selecting one set of thresholds, we treat thresholds as random variables and compute the average support. We show that we can compute this support efficiently, and we also introduce two normalisations, namely comparing the support against the independence assumption and, more generally, against the partition assumption. Our experimental evaluation demonstrates that we can discover statistically significant patterns efficiently.
Index Terms:
pattern mining, itemsets, real-valued itemsets
I Introduction
Pattern mining is one of the most well-studied subfields in exploratory data analysis. While there is a significant amount of literature on how to discover and rank itemsets efficiently from binary data, there is surprisingly little research done in mining patterns from real-valued data. In this paper we propose a family of quality scores for real-valued itemsets.
In order to motivate our approach, assume that we are given a dataset containing real numbers and a miner for mining itemsets from a binary data. The most straightforward way to use the miner to find patterns from is to transform into a binary data, and apply the miner. More formally, assume that we have selected a threshold for every item in the dataset. Then we define a binary data by setting , if , and 0 otherwise, where ranges over all transactions of .
This approach has two immediate setbacks. Firstly, we have to select the thresholds . In addition, such a measure is coarse, any intricate interaction between items is destroyed as data values are categorised into two coarse categories, 0s and 1s. Hence, instead of selecting just one set of thresholds, we will vary , and instead of computing support only for one dataset, we will compute an average support. More formally, we will attach a distribution to each threshold and compute the mean , where is the frequency (support) of an itemset in a binarized data .
This approach has several benefits. First of all, the support is monotonically decreasing, which allows us to discover all frequent itemsets efficiently. On the other hand, we will show that we can compute the support efficiently, even though it involves taking an average over a complex function.
We still need to choose the threshold distribution . In this work we focus on a specific distribution involved with copulas [1]: roughly speaking, we will define , where is the rank of the th transaction after data is sorted w.r.t. the th column. We will see that this distribution induces a support in which the actual values of individual items do not matter, instead the support is based on the ranks of the values. Interestingly enough, several popular statistical tests, such as the Mann-Whitney U test or the Wilcoxon signed-rank test, are also based on the ranks of values.
A standard technique in pattern mining is to compare the observed support against the expected value under some null hypothesis, where the hypothesis is typically an independence assumption. Here we consider two approaches, in the first approach we do a -normalisation by comparing the support against the independence assumption. In our second approach, we generalise the null hypothesis to a partition model, where we assume that items from different parts of the partition are independent. A particular difficulty with these approaches is that in order to compute them we need to compute the expected mean and the variance. While this is trivial when dealing with simple transactional data, it becomes intricate since the threshold distribution actually depends on the dataset. Nevertheless, we can compute the exact mean and variance for the independence assumption and exact mean and asymptotic variance for the partition assumption. Interestingly enough, the independence test is non-parametric, that is the mean and the variance depend only on the number of datapoints, whereas in the partition assumption we need to estimate parameters from the dataset.
The rest paper of the paper is organized as follows. We introduce preliminary notation in Section II. We define our general measure in Section III and introduce copula support in Section IV. We present an independence test in Section V and test based on partitions in Section VI. We discuss related work in Section VII and present our experiments in Section VIII. Finally, we conclude our paper with remarks in Section IX.
II Preliminaries and Notation
In this section we introduce the preliminary notation.
A dataset is a multiset of transactions , where is a vector of length . We will often use as the number of datapoints and as the dimension of the dataset. We treat each vector as a sample from an unknown distribution, . We refer to the random variables as items, or as features.
Let be the set of all items. An itemset is a set of items . Assume that you are given an itemset and a binary vector . We say that covers if , for every . We will use standard notation, by writing to mean .
Assume now that we are given a collection of binary vectors . We define the support or the frequency of an itemset as the proportion of transactions in covering ,
[TABLE]
An important property of the support is that it is monotonically decreasing, that is, , if . This property allows us to use efficient techniques [2] to discover all itemsets whose frequency is higher than some given threshold.
III Itemset support for real-valued data
In this section we define our measure for real-valued data. In order to do so, let be a dataset over items, , and transactions. Assume that we are given a threshold for each item . Let us write . Given a vector of length , we define to be a binary vector with if , and [math] otherwise. We now define a binarized data to be
[TABLE]
Essentially, is a dataset where each value is binarized either to [math] or to , depending on the threshold. We can now compute a support for a given itemset by computing .
The problem with this approach is that we need to select a threshold set . Additionally, once we have made this choice, the treatment of values in is coarse: a value slightly higher than the threshold contributes to the support as much as the values that are significantly higher.
To remedy this, we treat thresholds as random variables. That is, we have random variables, . We will assume that each threshold is assigned independently, that is, are independent variables. We will go over some of the natural choices for distributions of later on. If we write to be the density function of the th threshold, we can now define support as an average support, where the mean is taken over the possible thresholds, that is,
[TABLE]
The important property of this support is that it is monotonically decreasing. This allows us to mine all frequent itemsets using the standard pattern mining search.
Proposition 1
Assume two itemsets such that . Then .
Proof:
For any given threshold set , we have . It follows immediately, that , which proves the proposition. ∎
Computing the support from the definition is awkward as it requires taking integrals. Fortunately, we can rewrite the support in a much more accessible form.
Proposition 2
Assume a dataset with transactions and a distribution over the thresholds. Then the support of itemset is equal to
[TABLE]
Proof:
We can rewrite the support as
[TABLE]
Transaction covers if only if for each . Since are independent, it follows that
[TABLE]
This completes the proof. ∎
IV Copula Support
Our measure depends on the threshold distribution. In this section we focus on a specific distribution related to copulas.
Assume that we are given a dataset . Let us assume for simplicity that for each item, say , the data points are unique. Fix an item and for notational simplicity let us assume that the datapoints are ordered according to the th item, for . Let us define the probability of a threshold by requiring that the threshold will hit the interval with a probability of , where . In other words, the cumulative distribution is equal to
[TABLE]
This gives us straightforward way of computing the support. Given a dataset of points, we compute , where is the rank of the th transaction according to the th column. We can now define a copula111Copula stands for a cumulative joint distribution of random variables that have gone through such a transformation [1]. support by
[TABLE]
where .
Example 1
Consider that we are given a dataset with 4 items and 3 transactions
[TABLE]
The corresponding ranks are then
[TABLE]
For example, the copula support for is then
[TABLE]
As we see in the experiments, using as a filtering condition is not enough. Consequently, we also define by setting
[TABLE]
where , that is, the top items will be always above threshold and the bottom will be always below threshold.
Copula support has some peculiar features. First of all, the support does not depend on the actual values of , only on their ranks. This makes this support excellent for cases where computing the difference between the values of does not make sense. In addition to that for any item, hence the support is not useful for selecting itemsets of size . Even though, we assume that has independent samples, the ranks are no longer independent. However, if we assume independence between the items, we can compute the mean and the variance as we will see in the next section.
V Copula support as a statistical test
A standard technique in pattern mining is to compare the observed support against the independence model. In this section we demonstrate how to do this comparison for copula support. More specifically, we are interested in the quantity
[TABLE]
where and are the mean and the variance of the copula support under the null hypothesis.
We will now show how to compute the mean and the variance of the copula support. In fact, if we set , then we will show that and
[TABLE]
We will also show that approaches the Gaussian distribution as the number of data points goes to infinity.
To simplify the analysis we will make an assumption that the probability of a tie between two values of an item is [math]. This assumption is reasonable if the dataset is generated for example from sensor readings.
We will dedicate the remaining section to proving these results. Note that we cannot use Central Limit Theorem to prove the normality because the ranks of individual rows are not independent. Case in point, for a single item will always be , hence the variance will be [math] for this case.
In order to prove the result, we will first need to establish some notation. Assume that we have samples, independent and identically distributed random variables, , each sample is a vector of size . Define
[TABLE]
where returns 1 if the statement is true, and [math] otherwise. Note that the term , however, we keep it in the sum for notational convenience. Similarly, we can now define
[TABLE]
If we are given a dataset , then is an estimate of the random variable . Our goal is to compute and \sigma^{2}=\operatorname{Var}\mathopen{}\big{[}{\sqrt{N}U}\big{]}.
Note that since we assume that and are independent for , it follows also that and are also independent for . However, unlike and , and are not independent.
In order to continue we need the following lemma.
Lemma 3
Fix and let , , and be distinct integers. Then
[TABLE]
Proof:
Since the probability of a having a tie between variables is [math], using the symmetry argument, the probability will be larger than is .
Similarly, if we sort the three variables based on their value, there are 6 possible permutations, each permutation has a probability of . There are two permutations that satisfy the second event, namely and . This shows that the probability of the second event is equal to . Finally, there is only one permutation that satisfies the third event, namely, , which proves the lemma. ∎
We will first compute the mean of .
Proposition 4
The average of is .
Proof:
According to Lemma 3, . Since and are independent for , we can write
[TABLE]
∎
This proves the result.
Our next step is to compute the variance of . Since the variables are not independent, we will have to compute them in two stages. Our first step is to compute the second moment of .
Lemma 5
The second moment of is equal to
[TABLE]
Proof:
Decompose the second moment into two sums,
[TABLE]
According to Lemma 3, the terms in the first sum are equal to while the terms in the second sum are equal to . This gives us
[TABLE]
This completes the proof. ∎
Our next step is to compute the cross-moment of .
Lemma 6
The cross-moment is equal to
[TABLE]
Proof:
Decompose the moment into four sums
[TABLE]
where
[TABLE]
The random variables in the term of the sum of are all independent, hence the probability is equal to . According to Lemma 3 the term in the sum of is equal to and the term in the sum for and is equal to . This gives us
[TABLE]
Grouping the terms gives us
[TABLE]
This completes the proof. ∎
We can now use both lemmas in order to compute the variance.
Proposition 7
The variance \operatorname{Var}\mathopen{}\big{[}{\sqrt{N}U}\big{]} is equal to
[TABLE]
Proof:
We begin by splitting \operatorname{E}\mathopen{}\big{[}{(\sqrt{N}U)^{2}}\big{]} into two sums and applying Lemma 5 and Lemma 6,
[TABLE]
We can now use this to express the variance as
[TABLE]
This proves the result. ∎
Finally, we show that approaches a Gaussian distribution. Note that this result does not depend on the assumption that items are independent. Hence, we will be able to use the same result in the next section.
Proposition 8
The quantity approaches a Gaussian distribution as approaches infinity.
We postpone the proof of this proposition to Appendix.
VI Productive Itemsets and Copula Support
In the previous section we tested the support against the independence assumption. A natural extension of this is to assume a partition of the given itemset such that items are independent only when they belong to different blocks of the partition. In fact, an approach suggested in [3] mines itemsets from binary data whose support is substantially larger than the expectation given by the partition. In order to mimic this for real-valued data, we define
[TABLE]
where is a partition of and where and is the mean and the variance under the assumption that items belonging to different blocks in are independent. Our final goal is to find a partition that produces the lowest score, that is, a partition that explains the support the best, , where goes over all partitions of at least size . Note that we are only interested in one-side test. However, we can easily adjust the formula for a symmetrical two-side test. In addition, in [3] the authors were looking only at partitions of size , whereas we go over all non-trivial partitions.
In this section we show how we can compute the needed mean and the variance in order to normalise the support. Unlike with the independence model, the test is no longer non-parametric and we will have to estimate several parameters for each subitemset in the partition. Moreover, we will only provide the variance only when approaches infinity as the interactions between variables are complex and hard to compute exactly for finite .
We proceed as follows: We will first show what statistics we need from each subitemset and how to compute them. Then we will show how to use these statistics in order to compute the mean and the variance.
VI-A Statistics needed to compute the rank
Assume that we are given an itemset . This itemset will eventually be a block in the partition. Let be data samples. Let us shorten . Let us define
[TABLE]
which is essentially a product of normalised ranks of the th datapoint. Similar to Section V, let , a random variable corresponding to the copula support .
Ultimately, we will need three statistics from , namely , \alpha=\operatorname{E}\mathopen{}\big{[}{T_{1}^{2}}\big{]}, and \beta=\operatorname{Var}\mathopen{}\big{[}{\sqrt{N}U}\big{]}. We will discuss how to estimate these statistics in the next subsection. If were distributed independently, then . However, are dependent. Fortunately, we know enough about the dependency so that we can compute .
In order to compute we need to introduce several random variables. Let
[TABLE]
be the rank of the th transaction for an itemset . In addition, let us define . We can express the variance with , and . The benefit of this is that we can estimate these parameters, and by doing so estimate , as we will demonstrate in the next subsection.
Proposition 9
The variance approaches
[TABLE]
as approaches infinity.
We postpone the proof of this proposition to Appendix.
VI-B Estimating statistics
Unlike with , the mean and the variance of depend on the underlying distribution, and we are forced to estimate the statistics, namely , , described in the previous section. These estimates are given in Algorithm 1. Estimating and is trivial. However, estimating is more intricate due to the last term given in Proposition 9.
Assume that we are given a dataset and itemset . Fix and assume that is sorted based on th column, largest first. Let . Note that is an estimate for . Hence, we can estimate as
[TABLE]
We can use the right-hand side to compute for every efficiently, and then use to estimate . We can assume that we have precomputed the order w.r.t. each item before the actual mining. Hence, the cost of estimating the parameters is .
We should stress that we use the same dataset to compute the estimates and to compute . This means that will be somewhat skewed and we cannot interpret as a -value. However, our main goal is not to interpret the obtained values as a statistical test, rather our goal is to rank patterns.
VI-C Computing z-score
Now that we have computed statistics for each itemset occurring in a partition, we can combine them in order to compute the mean and the variance needed for .
Proposition 10
Assume that we are given an itemset and a partition of . Let be random data points. Let , and let . Let , \alpha_{i}=\operatorname{E}\mathopen{}\big{[}{{\mathit{rnk}\mathopen{}\left(1;P_{i},\mathcal{Y}\right)}^{2}}\big{]}, \beta_{i}=\lim_{N\to\infty}\operatorname{Var}\mathopen{}\big{[}{\sqrt{N}U_{i}}\big{]}.
Under the assumption that are independent, we have and
[TABLE]
as approaches infinity.
We postpone the proof of this proposition to Appendix.
VII Related Work
While pattern mining has been well researched for binary data, the problem of discovering patterns from real-valued data is open. The most straightforward approach to mine patterns is to discretize data using threshold, see for example [4]. Among methods that do not use thresholds, Calders et al. [5] proposed 3 quality measures for itemsets from numerical attributes. The first two measures were based on the extrema values of the items in an itemset. The most related measure to our work is the third measure, , which is a generalisation of Kendall’s , essentially the number of pairs in which all items are concordant. Interestingly enough, similar to the copula support, also depends only the order of values not on the actual values. In this work we were able to define two normalisations and for our approach, while the authors did not introduce any statistical normalisation for . We conjecture that a similar normalisation can be done also for .
Jaroszewicz and Korzen [6] suggested discovering polynomial itemsets, essentially cross-moments from real-valued data. We can show that for a certain threshold distribution, our support is equal to the support of polynomial itemsets. Steinbach et al. [7] considered several support functions for itemsets, such as, taking the smallest value in a transaction among the items in the itemset.
Ranking and filtering patterns based on a statistical test has been well studied. Brin et al. compared likelihood-ratio against independence assumption [8]. Webb proposed, among many other criteria, to compare the observed support to an expected support of a partition of size that fits best [3]. More complex null hypotheses such as Bayesian networks [9] or Maximum Entropy models [10] have been also suggested.
Our approach has similarities with mining itemsets from uncertain data [11], where instead of binary data, we have real-valued values between expressing the likelihood of the entry being equal to 1. In fact, if we interpret values computed in Section IV as probabilistic dataset, then will be the same as the expected support computed from probabilistic dataset. However, in probabilistic setting the entries are assumed to be independent, whereas in our case they have an intricate dependency. Consequently, the variance given by Propositions 7 and 9 do not hold for probabilistic datasets. In addition, we cannot compute frequentness measure suggested by Bernecker et al. [12] in our case, however we can estimate it by a normal distribution as suggested by Calders et al. [13].
Defining and computing a quality score for two real-valued variables, essentially an itemset of length 2, is a surprisingly open problem. The approach based on Information Theory was suggested in [14]. An interesting starting point is also a measure of concordance, see Definition 5.1.7 in [1]. These approaches are suitable only for itemsets of size 2 whereas we are interested in measuring the quality of itemset of any size. Finally, Szeékely and Rizzo [15] suggested a measure based on how pair-wise distances correlate. This measure is symmetric while our measure was specifically designed to focus on large values.
VIII Experiments
In this secion we present our experiments.
Datasets: We used 2 synthetic and 3 real-world data sets as our benchmark data. The first dataset Ind consists of data points, each of items, generated independently uniformly from the interval . The second dataset Plant has the same dimensions as the first dataset. In this dataset we planted 5 subspace clusters each having items: We generated independently boolean variables indicating whether a transaction belongs to the th cluster, a transaction can belong to multiple clusters. We set . If , then we set the corresponding items to . All other values were set to [math]. Finally, we added noise sampled uniformly from . As real-world benchmark datasets we used the following 3 gene expression data sets: Alon [16], Arabidopsis thaliana or Thalia, and Saccharomyces cerevisiae or Yeast.222Thalia and Yeast are available at http://www.tik.ee.ethz.ch/~sop/bimax/ The sizes of the datasets are given in Table I.
Setup: For each dataset we computed frequent itemsets using as a support. We set the threshold such that we get roughly several hundred thousand itemsets, see Table I. We then ranked itemsets using and . The results are given in Figure VIII.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. B. Nelsen, An introduction to Copulas . Springer, 2006.
- 2[2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining , 1996, pp. 307–328.
- 3[3] G. I. Webb, “Self-sufficient itemsets: An approach to screening potentially interesting associations between items,” TKDD , vol. 4, no. 1, pp. 3:1–3:20, 2010.
- 4[4] R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational tables,” in SIGMOD , 1996, pp. 1–12.
- 5[5] T. Calders, B. Goethals, and S. Jaroszewicz, “Mining rank-correlated sets of numerical attributes,” in KDD , 2006, pp. 96–105.
- 6[6] S. Jaroszewicz and M. Korzen, “Approximating representations for large numerical databases,” in SDM , 2007.
- 7[7] M. Steinbach, P.-N. Tan, H. Xiong, and V. Kumar, “Generalizing the notion of support,” in KDD , 2004, pp. 689–694.
- 8[8] S. Brin, R. Motwani, and C. Silverstein, “Beyond market baskets: Generalizing association rules to correlations,” in SIGMOD , 1997, pp. 265–276.
