Itemsets for Real-valued Datasets

Nikolaj Tatti

arXiv:1902.00804·cs.DS·February 5, 2019

Itemsets for Real-valued Datasets

Nikolaj Tatti

PDF

TL;DR

This paper introduces a novel method for mining meaningful itemsets from real-valued datasets by averaging over threshold-based supports, enabling efficient discovery of statistically significant patterns.

Contribution

It proposes a new family of quality scores for real-valued itemsets, treating thresholds as random variables and normalizing support for better pattern significance assessment.

Findings

01

Efficient computation of average support for real-valued itemsets.

02

Normalizations against independence and partition assumptions.

03

Effective discovery of statistically significant patterns.

Abstract

Pattern mining is one of the most well-studied subfields in exploratory data analysis. While there is a significant amount of literature on how to discover and rank itemsets efficiently from binary data, there is surprisingly little research done in mining patterns from real-valued data. In this paper we propose a family of quality scores for real-valued itemsets. We approach the problem by considering casting the dataset into a binary data and computing the support from this data. This naive approach requires us to select thresholds. To remedy this, instead of selecting one set of thresholds, we treat thresholds as random variables and compute the average support. We show that we can compute this support efficiently, and we also introduce two normalisations, namely comparing the support against the independence assumption and, more generally, against the partition assumption. Our…

Tables1

Table 1. TABLE I: Basic statistics of datasets and experiments

Name	Size	Threshold	Time	$\| patterns \|$
Ind	$10 000 \times 100$	$0.1$	$7$ m $37$ s	$166 750$
Plant	$10 000 \times 100$	$0.1$	$7$ m $14$ s	$171 303$
Alon	$2000 \times 62$	$0.26$	$8$ m $17$ s	$393 683$
Thalia	$734 \times 69$	$0.12$	$2$ m $10$ s	$148 334$
Yeast	$2993 \times 173$	$0.2$	$19$ m $50$ s	$529 872$

Equations138

fr (X; D) = \frac{∣ { 1 \leq i \leq N ; d _{i} covers X } ∣}{N} .

fr (X; D) = \frac{∣ { 1 \leq i \leq N ; d _{i} covers X } ∣}{N} .

D_{T} = {x_{T} ∣ x \in D} .

D_{T} = {x_{T} ∣ x \in D} .

fr (X; D, p) = E [fr (X; D_{T})] = \int_{t_{1}} \dots \int_{t_{K}} fr (X; D_{T}) i = 1 \prod K p (R_{i} = t_{i}) d t_{i} .

fr (X; D, p) = E [fr (X; D_{T})] = \int_{t_{1}} \dots \int_{t_{K}} fr (X; D_{T}) i = 1 \prod K p (R_{i} = t_{i}) d t_{i} .

fr (X; D, p) = \frac{1}{N} x \in D \sum i \in X \prod p (R_{i} \leq x_{i}) .

fr (X; D, p) = \frac{1}{N} x \in D \sum i \in X \prod p (R_{i} \leq x_{i}) .

fr (X; D, p) = E [fr (X; D_{T})] = \frac{1}{N} x \in D \sum p (x_{T} covers X) .

fr (X; D, p) = E [fr (X; D_{T})] = \frac{1}{N} x \in D \sum p (x_{T} covers X) .

p (x_{T} covers X) = i \in X \prod p (R_{i} \leq x_{i}) .

p (x_{T} covers X) = i \in X \prod p (R_{i} \leq x_{i}) .

p (R_{j} < d_{ij}) = \frac{i - 1}{N - 1} .

p (R_{j} < d_{ij}) = \frac{i - 1}{N - 1} .

cp (X; D) = \frac{1}{N} i = 1 \sum N rnk (i; X, D),

cp (X; D) = \frac{1}{N} i = 1 \sum N rnk (i; X, D),

{(1.2, 4.5, 3.8, 8.9), (4.4, 4.7, 1.9, 8.8), (8.2, 8.5, 3.0, 6.5)} .

{(1.2, 4.5, 3.8, 8.9), (4.4, 4.7, 1.9, 8.8), (8.2, 8.5, 3.0, 6.5)} .

{(0, 0, 1, 1), (0.5, 0.5, 0, 0.5), (1, 1, 0.5, 0)} .

{(0, 0, 1, 1), (0.5, 0.5, 0, 0.5), (1, 1, 0.5, 0)} .

cp (a_{2} a_{3}) = \frac{1}{3} (0 \times 1 + 0.5 \times 0 + 1 \times 0.5) = \frac{1}{6} .

cp (a_{2} a_{3}) = \frac{1}{3} (0 \times 1 + 0.5 \times 0 + 1 \times 0.5) = \frac{1}{6} .

p (R_{i} < d_{i}) = max (min (\frac{i - 1 - M}{N - 1 - 2 M}, 1), 0),

p (R_{i} < d_{i}) = max (min (\frac{i - 1 - M}{N - 1 - 2 M}, 1), 0),

z_{\ textsc ind} (X; D) = N \frac{cp ( X ; D ) - μ}{σ},

z_{\ textsc ind} (X; D) = N \frac{cp ( X ; D ) - μ}{σ},

σ^{2} = \frac{( 2 N - 1 ) ^{M}}{6 ^{M} ( N - 1 ) ^{M}} + \frac{( N - 2 ) ^{M} ( 3 N - 1 ) ^{M}}{1 2 ^{M} ( N - 1 ) ^{2 M - 1}} - \frac{N}{4 ^{M}} .

σ^{2} = \frac{( 2 N - 1 ) ^{M}}{6 ^{M} ( N - 1 ) ^{M}} + \frac{( N - 2 ) ^{M} ( 3 N - 1 ) ^{M}}{1 2 ^{M} ( N - 1 ) ^{2 M - 1}} - \frac{N}{4 ^{M}} .

S_{ij} = rnk (i; j, Y) = \frac{1}{N - 1} k = 1 \sum N I [Y_{ij} > Y_{k j}],

S_{ij} = rnk (i; j, Y) = \frac{1}{N - 1} k = 1 \sum N I [Y_{ij} > Y_{k j}],

U = cp (X; Y) = \frac{1}{N} i = 1 \sum N j \in X \prod S_{ij} .

U = cp (X; Y) = \frac{1}{N} i = 1 \sum N j \in X \prod S_{ij} .

p (Y_{ij} > Y_{k j}) p (Y_{ij} > Y_{l j}, Y_{k j} > Y_{l j}) p (Y_{ij} > Y_{k j}, Y_{k j} > Y_{l j}) = 1/2, = 1/3, = 1/6 .

p (Y_{ij} > Y_{k j}) p (Y_{ij} > Y_{l j}, Y_{k j} > Y_{l j}) p (Y_{ij} > Y_{k j}, Y_{k j} > Y_{l j}) = 1/2, = 1/3, = 1/6 .

E [U] = \frac{1}{N} i = 1 \sum N j \in X \prod E [S_{ij}] = \frac{1}{N} i = 1 \sum N j \in X \prod \frac{1}{2} = \frac{1}{2 ^{M}} .

E [U] = \frac{1}{N} i = 1 \sum N j \in X \prod E [S_{ij}] = \frac{1}{N} i = 1 \sum N j \in X \prod \frac{1}{2} = \frac{1}{2 ^{M}} .

E [S_{ij}^{2}] = \frac{2 N - 1}{6 ( N - 1 )} .

E [S_{ij}^{2}] = \frac{2 N - 1}{6 ( N - 1 )} .

\begin{split}\operatorname{E}\mathopen{}\left[S_{ij}^{2}\right]&=\frac{1}{(N-1)^{2}}\operatorname{E}\mathopen{}\big{[}{\big{(}\sum_{k\neq i}\mathit{I}\mathopen{}\left[Y_{ij}>Y_{kj}\right]\big{)}^{2}}\big{]}\\ &=\frac{1}{(N-1)^{2}}\sum_{k\neq i}p(Y_{ij}>Y_{kj})\\ &\qquad+\frac{1}{(N-1)^{2}}\sum_{k\neq i}\sum_{l\neq k,i}p(Y_{ij}>Y_{kj},Y_{ij}>Y_{lj})\ .\end{split}

\begin{split}\operatorname{E}\mathopen{}\left[S_{ij}^{2}\right]&=\frac{1}{(N-1)^{2}}\operatorname{E}\mathopen{}\big{[}{\big{(}\sum_{k\neq i}\mathit{I}\mathopen{}\left[Y_{ij}>Y_{kj}\right]\big{)}^{2}}\big{]}\\ &=\frac{1}{(N-1)^{2}}\sum_{k\neq i}p(Y_{ij}>Y_{kj})\\ &\qquad+\frac{1}{(N-1)^{2}}\sum_{k\neq i}\sum_{l\neq k,i}p(Y_{ij}>Y_{kj},Y_{ij}>Y_{lj})\ .\end{split}

\begin{split}\operatorname{E}\mathopen{}\left[S_{ij}^{2}\right]&=\frac{1}{(N-1)^{2}}\big{(}(N-1)/2+(N-1)(N-2)/3\big{)}\\ &=\frac{1}{6(N-1)}(3+2(N-2))=\frac{2N-1}{6(N-1)}\quad.\\ \end{split}

\begin{split}\operatorname{E}\mathopen{}\left[S_{ij}^{2}\right]&=\frac{1}{(N-1)^{2}}\big{(}(N-1)/2+(N-1)(N-2)/3\big{)}\\ &=\frac{1}{6(N-1)}(3+2(N-2))=\frac{2N-1}{6(N-1)}\quad.\\ \end{split}

E [S_{ij} S_{k j}] = \frac{( N - 2 ) ( 3 N - 1 )}{12 ( N - 1 ) ^{2}} .

E [S_{ij} S_{k j}] = \frac{( N - 2 ) ( 3 N - 1 )}{12 ( N - 1 ) ^{2}} .

\begin{split}&\operatorname{E}\mathopen{}\left[S_{ij}S_{kj}\right]\\ &\ =\frac{1}{(N-1)^{2}}\operatorname{E}\mathopen{}\big{[}{\big{(}\sum_{m\neq i}\mathit{I}\mathopen{}\left[Y_{ij}>Y_{mj}\right]\big{)}\big{(}\sum_{n\neq k}\mathit{I}\mathopen{}\left[Y_{kj}>Y_{nj}\right]\big{)}}\big{]}\\ &\ =\frac{A+B+C+D}{(N-1)^{2}},\end{split}

\begin{split}&\operatorname{E}\mathopen{}\left[S_{ij}S_{kj}\right]\\ &\ =\frac{1}{(N-1)^{2}}\operatorname{E}\mathopen{}\big{[}{\big{(}\sum_{m\neq i}\mathit{I}\mathopen{}\left[Y_{ij}>Y_{mj}\right]\big{)}\big{(}\sum_{n\neq k}\mathit{I}\mathopen{}\left[Y_{kj}>Y_{nj}\right]\big{)}}\big{]}\\ &\ =\frac{A+B+C+D}{(N-1)^{2}},\end{split}

A B C D = m \neq = i, k \sum p (Y_{ij} > Y_{mj}, Y_{k j} > Y_{mj}), = m \neq = i, k \sum n \neq = m, i, k \sum p (Y_{ij} > Y_{mj}, Y_{k j} > Y_{nj}), = m \neq = i, k \sum p (Y_{ij} > Y_{k j}, Y_{k j} > Y_{nj}), and = m \neq = i, k \sum p (Y_{ij} > Y_{mj}, Y_{k j} > Y_{ij}) .

A B C D = m \neq = i, k \sum p (Y_{ij} > Y_{mj}, Y_{k j} > Y_{mj}), = m \neq = i, k \sum n \neq = m, i, k \sum p (Y_{ij} > Y_{mj}, Y_{k j} > Y_{nj}), = m \neq = i, k \sum p (Y_{ij} > Y_{k j}, Y_{k j} > Y_{nj}), and = m \neq = i, k \sum p (Y_{ij} > Y_{mj}, Y_{k j} > Y_{ij}) .

A = \frac{N - 2}{3}, B = \frac{( N - 2 ) ( N - 3 )}{4}, C = D = \frac{N - 2}{6} .

A = \frac{N - 2}{3}, B = \frac{( N - 2 ) ( N - 3 )}{4}, C = D = \frac{N - 2}{6} .

E [S_{ij} S_{k j}] = \frac{4 ( N - 2 ) + 3 ( N - 2 ) ( N - 3 ) + 4 ( N - 2 )}{12 ( N - 1 ) ^{2}} = \frac{( N - 2 ) ( 3 N - 1 )}{12 ( N - 1 ) ^{2}} .

E [S_{ij} S_{k j}] = \frac{4 ( N - 2 ) + 3 ( N - 2 ) ( N - 3 ) + 4 ( N - 2 )}{12 ( N - 1 ) ^{2}} = \frac{( N - 2 ) ( 3 N - 1 )}{12 ( N - 1 ) ^{2}} .

σ^{2} = \frac{( 2 N - 1 ) ^{M}}{6 ^{M} ( N - 1 ) ^{M}} + \frac{( N - 2 ) ^{M} ( 3 N - 1 ) ^{M}}{1 2 ^{M} ( N - 1 ) ^{2 M - 1}} - \frac{N}{4 ^{M}} .

σ^{2} = \frac{( 2 N - 1 ) ^{M}}{6 ^{M} ( N - 1 ) ^{M}} + \frac{( N - 2 ) ^{M} ( 3 N - 1 ) ^{M}}{1 2 ^{M} ( N - 1 ) ^{2 M - 1}} - \frac{N}{4 ^{M}} .

\begin{split}\operatorname{E}\mathopen{}\big{[}{(\sqrt{N}U)^{2}}\big{]}&=\frac{1}{N}\big{(}\sum_{i=1}^{N}\prod_{j\in X}\operatorname{E}\mathopen{}\left[S_{ij}^{2}\right]+\sum_{i,k\atop i\neq k}\prod_{j\in X}\operatorname{E}\mathopen{}\left[S_{ij}S_{kj}\right]\big{)}\\ &=\frac{(2N-1)^{M}}{6^{M}(N-1)^{M}}+\frac{(N-2)^{M}(3N-1)^{M}}{12^{M}(N-1)^{2M-1}}\ .\end{split}

\begin{split}\operatorname{E}\mathopen{}\big{[}{(\sqrt{N}U)^{2}}\big{]}&=\frac{1}{N}\big{(}\sum_{i=1}^{N}\prod_{j\in X}\operatorname{E}\mathopen{}\left[S_{ij}^{2}\right]+\sum_{i,k\atop i\neq k}\prod_{j\in X}\operatorname{E}\mathopen{}\left[S_{ij}S_{kj}\right]\big{)}\\ &=\frac{(2N-1)^{M}}{6^{M}(N-1)^{M}}+\frac{(N-2)^{M}(3N-1)^{M}}{12^{M}(N-1)^{2M-1}}\ .\end{split}

\begin{split}\sigma^{2}&=\operatorname{E}\mathopen{}\big{[}{(\sqrt{N}U)^{2}}\big{]}-N\operatorname{E}\mathopen{}\left[U\right]^{2}\\ &=\frac{(2N-1)^{M}}{6^{M}(N-1)^{M}}+\frac{(N-2)^{M}(3N-1)^{M}}{12^{M}(N-1)^{2M-1}}-\frac{N}{4^{M}}\quad.\end{split}

\begin{split}\sigma^{2}&=\operatorname{E}\mathopen{}\big{[}{(\sqrt{N}U)^{2}}\big{]}-N\operatorname{E}\mathopen{}\left[U\right]^{2}\\ &=\frac{(2N-1)^{M}}{6^{M}(N-1)^{M}}+\frac{(N-2)^{M}(3N-1)^{M}}{12^{M}(N-1)^{2M-1}}-\frac{N}{4^{M}}\quad.\end{split}

z_{\ textsc prt} (X, P; D) = \frac{cp ( X ; D ) - μ}{σ},

z_{\ textsc prt} (X, P; D) = \frac{cp ( X ; D ) - μ}{σ},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\xspaceaddexceptions

Itemsets for Real-valued Datasets

Nikolaj Tatti

HIIT, Department of Information and Computer Science Aalto University, Finland

[email protected]

Abstract

Pattern mining is one of the most well-studied subfields in exploratory data analysis. While there is a significant amount of literature on how to discover and rank itemsets efficiently from binary data, there is surprisingly little research done in mining patterns from real-valued data. In this paper we propose a family of quality scores for real-valued itemsets. We approach the problem by considering casting the dataset into a binary data and computing the support from this data. This naive approach requires us to select thresholds. To remedy this, instead of selecting one set of thresholds, we treat thresholds as random variables and compute the average support. We show that we can compute this support efficiently, and we also introduce two normalisations, namely comparing the support against the independence assumption and, more generally, against the partition assumption. Our experimental evaluation demonstrates that we can discover statistically significant patterns efficiently.

Index Terms:

pattern mining, itemsets, real-valued itemsets

I Introduction

Pattern mining is one of the most well-studied subfields in exploratory data analysis. While there is a significant amount of literature on how to discover and rank itemsets efficiently from binary data, there is surprisingly little research done in mining patterns from real-valued data. In this paper we propose a family of quality scores for real-valued itemsets.

In order to motivate our approach, assume that we are given a dataset $D$ containing real numbers and a miner for mining itemsets from a binary data. The most straightforward way to use the miner to find patterns from $D$ is to transform $D$ into a binary data, and apply the miner. More formally, assume that we have selected a threshold $t_{i}$ for every item $i$ in the dataset. Then we define a binary data $B$ by setting $b_{ji}=1$ , if $d_{ji}\geq t_{i}$ , and 0 otherwise, where $j$ ranges over all transactions of $D$ .

This approach has two immediate setbacks. Firstly, we have to select the thresholds $t_{i}$ . In addition, such a measure is coarse, any intricate interaction between items is destroyed as data values are categorised into two coarse categories, 0s and 1s. Hence, instead of selecting just one set of thresholds, we will vary $t_{i}$ , and instead of computing support only for one dataset, we will compute an average support. More formally, we will attach a distribution $p(R_{i}=t_{i})$ to each threshold and compute the mean $\operatorname{E}\mathopen{}\left[{\mathit{fr}\mathopen{}\left(X;B\right)}\right]$ , where ${\mathit{fr}\mathopen{}\left(X;B\right)}$ is the frequency (support) of an itemset $X$ in a binarized data $B$ .

This approach has several benefits. First of all, the support is monotonically decreasing, which allows us to discover all frequent itemsets efficiently. On the other hand, we will show that we can compute the support efficiently, even though it involves taking an average over a complex function.

We still need to choose the threshold distribution $p(R_{i}=t_{i})$ . In this work we focus on a specific distribution involved with copulas [1]: roughly speaking, we will define $p(R_{i}\leq d_{ji})=k/({\left|D\right|}-1)$ , where $k$ is the rank of the $j$ th transaction after data is sorted w.r.t. the $i$ th column. We will see that this distribution induces a support in which the actual values of individual items do not matter, instead the support is based on the ranks of the values. Interestingly enough, several popular statistical tests, such as the Mann-Whitney U test or the Wilcoxon signed-rank test, are also based on the ranks of values.

A standard technique in pattern mining is to compare the observed support against the expected value under some null hypothesis, where the hypothesis is typically an independence assumption. Here we consider two approaches, in the first approach we do a $z$ -normalisation by comparing the support against the independence assumption. In our second approach, we generalise the null hypothesis to a partition model, where we assume that items from different parts of the partition are independent. A particular difficulty with these approaches is that in order to compute them we need to compute the expected mean and the variance. While this is trivial when dealing with simple transactional data, it becomes intricate since the threshold distribution actually depends on the dataset. Nevertheless, we can compute the exact mean and variance for the independence assumption and exact mean and asymptotic variance for the partition assumption. Interestingly enough, the independence test is non-parametric, that is the mean and the variance depend only on the number of datapoints, whereas in the partition assumption we need to estimate parameters from the dataset.

The rest paper of the paper is organized as follows. We introduce preliminary notation in Section II. We define our general measure in Section III and introduce copula support in Section IV. We present an independence test in Section V and test based on partitions in Section VI. We discuss related work in Section VII and present our experiments in Section VIII. Finally, we conclude our paper with remarks in Section IX.

II Preliminaries and Notation

In this section we introduce the preliminary notation.

A dataset $D$ is a multiset of $N$ transactions $d_{1},\ldots,d_{N}$ , where $d_{j}\in^{K}$ is a vector of length $K$ . We will often use $N={\left|D\right|}$ as the number of datapoints and $K$ as the dimension of the dataset. We treat each vector $d_{i}$ as a sample from an unknown distribution, $p(a_{1},\ldots,a_{K})$ . We refer to the random variables $a_{i}$ as items, or as features.

Let $A=\left\{a_{1},\ldots,a_{K}\right\}$ be the set of all items. An itemset $X$ is a set of items $X\subseteq A$ . Assume that you are given an itemset $X$ and a binary vector $d\in\left\{0,1\right\}^{K}$ . We say that $d$ covers $X$ if $d_{i}=1$ , for every $a_{i}\in X$ . We will use standard notation, by writing $x_{1}\cdots x_{M}$ to mean $\left\{x_{1},\ldots,x_{M}\right\}$ .

Assume now that we are given a collection of binary vectors $D=d_{1},\ldots,d_{N}$ . We define the support or the frequency of an itemset $X$ as the proportion of transactions in $D$ covering $X$ ,

[TABLE]

An important property of the support is that it is monotonically decreasing, that is, ${\mathit{fr}\mathopen{}\left(X;D\right)}\leq{\mathit{fr}\mathopen{}\left(Y;D\right)}$ , if $Y\subseteq X$ . This property allows us to use efficient techniques [2] to discover all itemsets whose frequency is higher than some given threshold.

III Itemset support for real-valued data

In this section we define our measure for real-valued data. In order to do so, let $D$ be a dataset over $K$ items, ${a_{1}},\ldots,{a_{K}}$ , and $N$ transactions. Assume that we are given a threshold $t_{i}\in$ for each item $a_{i}$ . Let us write $T=\left(t_{1},\ldots,t_{K}\right)$ . Given a vector $x\in^{K}$ of length $K$ , we define $y=x_{T}$ to be a binary vector with $y_{i}=1$ if $x_{i}\geq t_{i}$ , and [math] otherwise. We now define a binarized data $D_{T}$ to be

[TABLE]

Essentially, $D_{T}$ is a dataset where each value is binarized either to [math] or to $1$ , depending on the threshold. We can now compute a support for a given itemset $X$ by computing ${\mathit{fr}\mathopen{}\left(X;D_{T}\right)}$ .

The problem with this approach is that we need to select a threshold set $T$ . Additionally, once we have made this choice, the treatment of values in $D$ is coarse: a value slightly higher than the threshold contributes to the support as much as the values that are significantly higher.

To remedy this, we treat thresholds as random variables. That is, we have $K$ random variables, ${R_{1}},\ldots,{R_{K}}$ . We will assume that each threshold is assigned independently, that is, $R_{i}$ are independent variables. We will go over some of the natural choices for distributions of $R_{i}$ later on. If we write $p(R_{i}=t_{i})$ to be the density function of the $i$ th threshold, we can now define support as an average support, where the mean is taken over the possible thresholds, that is,

[TABLE]

The important property of this support is that it is monotonically decreasing. This allows us to mine all frequent itemsets using the standard pattern mining search.

Proposition 1

Assume two itemsets $X,Y$ such that $X\subseteq Y$ . Then ${\mathit{fr}\mathopen{}\left(X;D,p\right)}\geq{\mathit{fr}\mathopen{}\left(Y;D,p\right)}$ .

Proof:

For any given threshold set $T$ , we have ${\mathit{fr}\mathopen{}\left(X;D_{T}\right)}\geq{\mathit{fr}\mathopen{}\left(Y;D_{T}\right)}$ . It follows immediately, that $\operatorname{E}\mathopen{}\left[{\mathit{fr}\mathopen{}\left(X;D_{T}\right)}\right]\geq\operatorname{E}\mathopen{}\left[{\mathit{fr}\mathopen{}\left(Y;D_{T}\right)}\right]$ , which proves the proposition. ∎

Computing the support from the definition is awkward as it requires taking ${\left|X\right|}$ integrals. Fortunately, we can rewrite the support in a much more accessible form.

Proposition 2

Assume a dataset $D$ with $N$ transactions and a distribution $p$ over the thresholds. Then the support of itemset $X$ is equal to

[TABLE]

Proof:

We can rewrite the support as

[TABLE]

Transaction $y=x_{T}$ covers $X$ if only if $y_{i}\geq R_{i}$ for each $i\in X$ . Since $R_{i}$ are independent, it follows that

[TABLE]

This completes the proof. ∎

IV Copula Support

Our measure depends on the threshold distribution. In this section we focus on a specific distribution related to copulas.

Assume that we are given a dataset $D=d_{1},\ldots,d_{N}$ . Let us assume for simplicity that for each item, say $a_{j}$ , the data points $d_{ij}$ are unique. Fix an item $a_{j}$ and for notational simplicity let us assume that the datapoints are ordered according to the $j$ th item, $d_{ij}<d_{(i+1)j}$ for $i=1,\ldots,N-1$ . Let us define the probability of a threshold $R_{i}$ by requiring that the threshold will hit the interval $[d_{ij},d_{(i+1)j}]$ with a probability of $1/(N-1)$ , where $i=1,\ldots,N-1$ . In other words, the cumulative distribution is equal to

[TABLE]

This gives us straightforward way of computing the support. Given a dataset $D$ of $N$ points, we compute $r_{ij}=(c-1)/(N-1)$ , where $c$ is the rank of the $i$ th transaction according to the $j$ th column. We can now define a copula111Copula stands for a cumulative joint distribution of random variables that have gone through such a transformation [1]. support by

[TABLE]

where ${\mathit{rnk}\mathopen{}\left(i;X,D\right)}=\prod_{j\in X}r_{ij}$ .

Example 1

Consider that we are given a dataset with 4 items and 3 transactions

[TABLE]

The corresponding ranks $\left\{r_{ij}\right\}$ are then

[TABLE]

For example, the copula support for $\left\{a_{2}a_{3}\right\}$ is then

[TABLE]

As we see in the experiments, using ${\mathit{cp}\mathopen{}\left(X,D\right)}$ as a filtering condition is not enough. Consequently, we also define ${\mathit{cp}\mathopen{}\left(X;D,\alpha\right)}$ by setting

[TABLE]

where $M=\lfloor\alpha N\rfloor$ , that is, the top $\alpha N$ items will be always above threshold and the bottom $\alpha N$ will be always below threshold.

Copula support has some peculiar features. First of all, the support does not depend on the actual values of $D$ , only on their ranks. This makes this support excellent for cases where computing the difference between the values of $D$ does not make sense. In addition to that ${\mathit{cp}\mathopen{}\left(a_{i};D\right)}=1/2$ for any item, hence the support is not useful for selecting itemsets of size $1$ . Even though, we assume that $D$ has independent samples, the ranks $r_{ij}$ are no longer independent. However, if we assume independence between the items, we can compute the mean and the variance as we will see in the next section.

V Copula support as a statistical test

A standard technique in pattern mining is to compare the observed support against the independence model. In this section we demonstrate how to do this comparison for copula support. More specifically, we are interested in the quantity

[TABLE]

where $\mu$ and $\sigma$ are the mean and the variance of the copula support under the null hypothesis.

We will now show how to compute the mean and the variance of the copula support. In fact, if we set $M={\left|X\right|}$ , then we will show that $\mu=1/2^{M}$ and

[TABLE]

We will also show that ${\mathit{z_{\textsc{ind}}}\mathopen{}\left(X;D\right)}$ approaches the Gaussian distribution $N(0,1)$ as the number of data points goes to infinity.

To simplify the analysis we will make an assumption that the probability of a tie between two values of an item is [math]. This assumption is reasonable if the dataset is generated for example from sensor readings.

We will dedicate the remaining section to proving these results. Note that we cannot use Central Limit Theorem to prove the normality because the ranks of individual rows are not independent. Case in point, ${\mathit{cp}\mathopen{}\left(x\right)}$ for a single item will always be $1/2$ , hence the variance will be [math] for this case.

In order to prove the result, we will first need to establish some notation. Assume that we have $N$ samples, independent and identically distributed random variables, $\mathcal{Y}=Y_{1},\ldots,Y_{N}$ , each sample is a vector of size $K$ . Define

[TABLE]

where $\mathit{I}\mathopen{}\left[B\right]$ returns 1 if the statement $B$ is true, and [math] otherwise. Note that the term $\mathit{I}\mathopen{}\left[Y_{ij}=Y_{ij}\right]=0$ , however, we keep it in the sum for notational convenience. Similarly, we can now define

[TABLE]

If we are given a dataset $D$ , then ${\mathit{cp}\mathopen{}\left(X;D\right)}$ is an estimate of the random variable $U$ . Our goal is to compute $\mu=\operatorname{E}\mathopen{}\left[U\right]$ and $\sigma^{2}=\operatorname{Var}\mathopen{}\big{[}{\sqrt{N}U}\big{]}$ .

Note that since we assume that $Y_{ij}$ and $Y_{kl}$ are independent for $j\neq l$ , it follows also that $S_{ij}$ and $S_{kl}$ are also independent for $j\neq l$ . However, unlike $Y_{ij}$ and $Y_{kj}$ , $S_{ij}$ and $S_{kj}$ are not independent.

In order to continue we need the following lemma.

Lemma 3

Fix $j$ and let $i$ , $k$ , and $l$ be distinct integers. Then

[TABLE]

Proof:

Since the probability of a having a tie between variables is [math], using the symmetry argument, the probability $Y_{ij}$ will be larger than $Y_{kj}$ is $1/2$ .

Similarly, if we sort the three variables based on their value, there are 6 possible permutations, each permutation has a probability of $1/6$ . There are two permutations that satisfy the second event, namely $Y_{ij}>Y_{kj}>Y_{lj}$ and $Y_{kj}>Y_{ij}>Y_{lj}$ . This shows that the probability of the second event is equal to $1/3$ . Finally, there is only one permutation that satisfies the third event, namely, $Y_{ij}>Y_{kj}>Y_{lj}$ , which proves the lemma. ∎

We will first compute the mean of $U$ .

Proposition 4

The average of $U$ is $\operatorname{E}\mathopen{}\left[U\right]=1/2^{M}$ .

Proof:

According to Lemma 3, $\operatorname{E}\mathopen{}\left[S_{ij}\right]=1/2$ . Since $S_{ij}$ and $S_{kl}$ are independent for $j\neq l$ , we can write

[TABLE]

∎

This proves the result.

Our next step is to compute the variance of $U$ . Since the variables $S_{ij}$ are not independent, we will have to compute them in two stages. Our first step is to compute the second moment of $S_{ij}$ .

Lemma 5

The second moment of $S_{ij}$ is equal to

[TABLE]

Proof:

Decompose the second moment into two sums,

[TABLE]

According to Lemma 3, the terms in the first sum are equal to $1/2$ while the terms in the second sum are equal to $1/3$ . This gives us

[TABLE]

This completes the proof. ∎

Our next step is to compute the cross-moment of $S_{ij}$ .

Lemma 6

The cross-moment is equal to

[TABLE]

Proof:

Decompose the moment into four sums

[TABLE]

where

[TABLE]

The random variables in the term of the sum of $B$ are all independent, hence the probability is equal to $1/4$ . According to Lemma 3 the term in the sum of $A$ is equal to $1/3$ and the term in the sum for $C$ and $D$ is equal to $1/6$ . This gives us

[TABLE]

Grouping the terms gives us

[TABLE]

This completes the proof. ∎

We can now use both lemmas in order to compute the variance.

Proposition 7

The variance $\operatorname{Var}\mathopen{}\big{[}{\sqrt{N}U}\big{]}$ is equal to

[TABLE]

Proof:

We begin by splitting $\operatorname{E}\mathopen{}\big{[}{(\sqrt{N}U)^{2}}\big{]}$ into two sums and applying Lemma 5 and Lemma 6,

[TABLE]

We can now use this to express the variance as

[TABLE]

This proves the result. ∎

Finally, we show that ${\mathit{z_{\textsc{ind}}}\mathopen{}\left(X;\mathcal{Y}\right)}$ approaches a Gaussian distribution. Note that this result does not depend on the assumption that items are independent. Hence, we will be able to use the same result in the next section.

Proposition 8

The quantity $\sqrt{N}(U-\operatorname{E}\mathopen{}\left[U\right])$ approaches a Gaussian distribution as $N$ approaches infinity.

We postpone the proof of this proposition to Appendix.

VI Productive Itemsets and Copula Support

In the previous section we tested the support against the independence assumption. A natural extension of this is to assume a partition of the given itemset such that items are independent only when they belong to different blocks of the partition. In fact, an approach suggested in [3] mines itemsets from binary data whose support is substantially larger than the expectation given by the partition. In order to mimic this for real-valued data, we define

[TABLE]

where $P$ is a partition of $X$ and where $\mu$ and $\sigma$ is the mean and the variance under the assumption that items belonging to different blocks in $P$ are independent. Our final goal is to find a partition that produces the lowest score, that is, a partition that explains the support the best, ${\mathit{z_{\textsc{prt}}}\mathopen{}\left(X;D\right)}=\min_{P}{\mathit{z_{\textsc{prt}}}\mathopen{}\left(X,P;D\right)}$ , where $P$ goes over all partitions of at least size $2$ . Note that we are only interested in one-side test. However, we can easily adjust the formula for a symmetrical two-side test. In addition, in [3] the authors were looking only at partitions of size $2$ , whereas we go over all non-trivial partitions.

In this section we show how we can compute the needed mean and the variance in order to normalise the support. Unlike with the independence model, the test is no longer non-parametric and we will have to estimate several parameters for each subitemset in the partition. Moreover, we will only provide the variance only when $N$ approaches infinity as the interactions between variables are complex and hard to compute exactly for finite $N$ .

We proceed as follows: We will first show what statistics we need from each subitemset and how to compute them. Then we will show how to use these statistics in order to compute the mean and the variance.

VI-A Statistics needed to compute the rank

Assume that we are given an itemset $X=x_{1}\cdots x_{M}$ . This itemset will eventually be a block in the partition. Let $\mathcal{Y}=Y_{1},\ldots,Y_{N}$ be $N$ data samples. Let us shorten $O_{ijx}=\frac{1}{N-1}\mathit{I}\mathopen{}\left[Y_{ix}>Y_{jx}\right]$ . Let us define

[TABLE]

which is essentially a product of normalised ranks of the $i$ th datapoint. Similar to Section V, let $U=\frac{1}{N}\sum_{i=1}^{N}T_{i}$ , a random variable corresponding to the copula support ${\mathit{cp}\mathopen{}\left(X\right)}$ .

Ultimately, we will need three statistics from $X$ , namely $\mu=\operatorname{E}\mathopen{}\left[U\right]$ , $\alpha=\operatorname{E}\mathopen{}\big{[}{T_{1}^{2}}\big{]}$ , and $\beta=\operatorname{Var}\mathopen{}\big{[}{\sqrt{N}U}\big{]}$ . We will discuss how to estimate these statistics in the next subsection. If $T_{i}$ were distributed independently, then $\beta=\alpha-\mu^{2}$ . However, $T_{i}$ are dependent. Fortunately, we know enough about the dependency so that we can compute $\beta$ .

In order to compute $\beta$ we need to introduce several random variables. Let

[TABLE]

be the rank of the $i$ th transaction for an itemset $X\setminus\left\{x\right\}$ . In addition, let us define $C_{kx}=\sum_{i=1}^{N}T_{ix}O_{ikx}$ . We can express the variance $\beta$ with $\alpha$ , $\mu$ and $C_{kx}$ . The benefit of this is that we can estimate these parameters, and by doing so estimate $\beta$ , as we will demonstrate in the next subsection.

Proposition 9

The variance $\beta$ approaches

[TABLE]

as $N$ approaches infinity.

We postpone the proof of this proposition to Appendix.

VI-B Estimating statistics

Unlike with ${\mathit{z_{\textsc{ind}}}\mathopen{}\left(X\right)}$ , the mean and the variance of ${\mathit{z_{\textsc{ind}}}\mathopen{}\left(X;P\right)}$ depend on the underlying distribution, and we are forced to estimate the statistics, namely $\alpha$ , $\beta$ , $\mu$ described in the previous section. These estimates are given in Algorithm 1. Estimating $\mu$ and $\alpha$ is trivial. However, estimating $\beta$ is more intricate due to the last term given in Proposition 9.

Assume that we are given a dataset $D$ and itemset $X$ . Fix $x\in X$ and assume that $D$ is sorted based on $x$ th column, largest first. Let $Z=X\setminus\left\{x\right\}$ . Note that ${\mathit{rnk}\mathopen{}\left(k;Z,D\right)}$ is an estimate for $T_{kx}$ . Hence, we can estimate $C_{kx}$ as

[TABLE]

We can use the right-hand side to compute $c_{kl}$ for every $k$ efficiently, and then use $c_{kl}$ to estimate $\beta$ . We can assume that we have precomputed the order w.r.t. each item $x_{l}$ before the actual mining. Hence, the cost of estimating the parameters is $O(N{\left|X\right|})$ .

We should stress that we use the same dataset to compute the estimates and to compute ${\mathit{z_{\textsc{prt}}}\mathopen{}\left(X;D\right)}$ . This means that ${\mathit{z_{\textsc{prt}}}\mathopen{}\left(X,P;D\right)}$ will be somewhat skewed and we cannot interpret ${\mathit{z_{\textsc{prt}}}\mathopen{}\left(X,P;D\right)}$ as a $p$ -value. However, our main goal is not to interpret the obtained values as a statistical test, rather our goal is to rank patterns.

VI-C Computing z-score

Now that we have computed statistics for each itemset occurring in a partition, we can combine them in order to compute the mean and the variance needed for ${\mathit{z_{\textsc{prt}}}\mathopen{}\left(X\right)}$ .

Proposition 10

Assume that we are given an itemset $X$ and a partition $P_{1},\ldots,P_{L}$ of $X$ . Let $\mathcal{Y}=Y_{1},\ldots,Y_{N}$ be $N$ random data points. Let $U={\mathit{cp}\mathopen{}\left(X;\mathcal{Y}\right)}$ , and let $U_{i}={\mathit{cp}\mathopen{}\left(P_{i};\mathcal{Y}\right)}$ . Let $\mu_{i}=\operatorname{E}\mathopen{}\left[U_{i}\right]$ , $\alpha_{i}=\operatorname{E}\mathopen{}\big{[}{{\mathit{rnk}\mathopen{}\left(1;P_{i},\mathcal{Y}\right)}^{2}}\big{]}$ , $\beta_{i}=\lim_{N\to\infty}\operatorname{Var}\mathopen{}\big{[}{\sqrt{N}U_{i}}\big{]}$ .

Under the assumption that $P_{i}$ are independent, we have $\operatorname{E}\mathopen{}\left[U\right]=\prod_{i=1}^{L}\mu_{i}$ and

[TABLE]

as $N$ approaches infinity.

We postpone the proof of this proposition to Appendix.

VII Related Work

While pattern mining has been well researched for binary data, the problem of discovering patterns from real-valued data is open. The most straightforward approach to mine patterns is to discretize data using threshold, see for example [4]. Among methods that do not use thresholds, Calders et al. [5] proposed 3 quality measures for itemsets from numerical attributes. The first two measures were based on the extrema values of the items in an itemset. The most related measure to our work is the third measure, $\mathit{supp}_{\tau}$ , which is a generalisation of Kendall’s $\tau$ , essentially the number of pairs in which all items are concordant. Interestingly enough, similar to the copula support, $\mathit{supp}_{\tau}$ also depends only the order of values not on the actual values. In this work we were able to define two normalisations ${\mathit{z_{\textsc{ind}}}\mathopen{}\left(X\right)}$ and ${\mathit{z_{\textsc{prt}}}\mathopen{}\left(X\right)}$ for our approach, while the authors did not introduce any statistical normalisation for $\mathit{supp}_{\tau}$ . We conjecture that a similar normalisation can be done also for $\mathit{supp}_{\tau}$ .

Jaroszewicz and Korzen [6] suggested discovering polynomial itemsets, essentially cross-moments from real-valued data. We can show that for a certain threshold distribution, our support is equal to the support of polynomial itemsets. Steinbach et al. [7] considered several support functions for itemsets, such as, taking the smallest value in a transaction among the items in the itemset.

Ranking and filtering patterns based on a statistical test has been well studied. Brin et al. compared likelihood-ratio against independence assumption [8]. Webb proposed, among many other criteria, to compare the observed support to an expected support of a partition of size $2$ that fits best [3]. More complex null hypotheses such as Bayesian networks [9] or Maximum Entropy models [10] have been also suggested.

Our approach has similarities with mining itemsets from uncertain data [11], where instead of binary data, we have real-valued values between $[0,1]$ expressing the likelihood of the entry being equal to 1. In fact, if we interpret $r_{ij}$ values computed in Section IV as probabilistic dataset, then ${\mathit{cp}\mathopen{}\left(X\right)}$ will be the same as the expected support computed from probabilistic dataset. However, in probabilistic setting the entries are assumed to be independent, whereas in our case they have an intricate dependency. Consequently, the variance given by Propositions 7 and 9 do not hold for probabilistic datasets. In addition, we cannot compute frequentness measure suggested by Bernecker et al. [12] in our case, however we can estimate it by a normal distribution as suggested by Calders et al. [13].

Defining and computing a quality score for two real-valued variables, essentially an itemset of length 2, is a surprisingly open problem. The approach based on Information Theory was suggested in [14]. An interesting starting point is also a measure of concordance, see Definition 5.1.7 in [1]. These approaches are suitable only for itemsets of size 2 whereas we are interested in measuring the quality of itemset of any size. Finally, Szeékely and Rizzo [15] suggested a measure based on how pair-wise distances correlate. This measure is symmetric while our measure was specifically designed to focus on large values.

VIII Experiments

In this secion we present our experiments.

Datasets: We used 2 synthetic and 3 real-world data sets as our benchmark data. The first dataset Ind consists of $10\,000$ data points, each of $100$ items, generated independently uniformly from the interval $[0,1]$ . The second dataset Plant has the same dimensions as the first dataset. In this dataset we planted 5 subspace clusters each having $4$ items: We generated independently $5\times 10\,000$ boolean variables $B_{ti}$ indicating whether a transaction $t$ belongs to the $i$ th cluster, a transaction can belong to multiple clusters. We set $p(B_{ti}=1)=0.4$ . If $B_{ti}=1$ , then we set the corresponding items to $0.5$ . All other values were set to [math]. Finally, we added noise sampled uniformly from $[0,1]$ . As real-world benchmark datasets we used the following 3 gene expression data sets: Alon [16], Arabidopsis thaliana or Thalia, and Saccharomyces cerevisiae or Yeast.222Thalia and Yeast are available at http://www.tik.ee.ethz.ch/~sop/bimax/ The sizes of the datasets are given in Table I.

Setup: For each dataset we computed frequent itemsets using ${\mathit{cp}\mathopen{}\left(X;D,0.25\right)}$ as a support. We set the threshold such that we get roughly several hundred thousand itemsets, see Table I. We then ranked itemsets using ${\mathit{z_{\textsc{ind}}}\mathopen{}\left(X\right)}$ and ${\mathit{z_{\textsc{prt}}}\mathopen{}\left(X\right)}$ . The results are given in Figure VIII.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. B. Nelsen, An introduction to Copulas . Springer, 2006.
2[2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining , 1996, pp. 307–328.
3[3] G. I. Webb, “Self-sufficient itemsets: An approach to screening potentially interesting associations between items,” TKDD , vol. 4, no. 1, pp. 3:1–3:20, 2010.
4[4] R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational tables,” in SIGMOD , 1996, pp. 1–12.
5[5] T. Calders, B. Goethals, and S. Jaroszewicz, “Mining rank-correlated sets of numerical attributes,” in KDD , 2006, pp. 96–105.
6[6] S. Jaroszewicz and M. Korzen, “Approximating representations for large numerical databases,” in SDM , 2007.
7[7] M. Steinbach, P.-N. Tan, H. Xiong, and V. Kumar, “Generalizing the notion of support,” in KDD , 2004, pp. 689–694.
8[8] S. Brin, R. Motwani, and C. Silverstein, “Beyond market baskets: Generalizing association rules to correlations,” in SIGMOD , 1997, pp. 265–276.