Discovering Episodes with Compact Minimal Windows
Nikolaj Tatti

TL;DR
This paper introduces a new measure for identifying significant episodes in pattern mining, focusing on their compactness, and provides an efficient method to compute this measure using finite state machines, demonstrating practical effectiveness.
Contribution
It proposes a novel quality measure for episodes based on their compactness and develops a finite state machine technique to efficiently compute the necessary statistics.
Findings
The measure effectively ranks interpretable episodes high in text data.
The ranking process is fast, capable of handling tens of thousands of episodes in seconds.
The score can be interpreted as a P-value asymptotically.
Abstract
Discovering the most interesting patterns is the key problem in the field of pattern mining. While ranking or selecting patterns is well-studied for itemsets it is surprisingly under-researched for other, more complex, pattern types. In this paper we propose a new quality measure for episodes. An episode is essentially a set of events with possible restrictions on the order of events. We say that an episode is significant if its occurrence is abnormally compact, that is, only few gap events occur between the actual episode events, when compared to the expected length according to the independence model. We can apply this measure as a post-pruning step by first discovering frequent episodes and then rank them according to this measure. In order to compute the score we will need to compute the mean and the variance according to the independence model. As a main technical contribution…
| Sequence | length | |
|---|---|---|
| Ind | ||
| Plant | ||
| Moby | ||
| Address | ||
| Jmlr | ||
| Nsf |
| Sequence | max window | threshold | # of episodes |
|---|---|---|---|
| Ind | |||
| Plant | |||
| Moby | |||
| Address | |||
| Jmlr | |||
| Nsf |
| Sequence | time (s) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ind | ||||||||||||
| Plant | ||||||||||||
| Moby | ||||||||||||
| Address | ||||||||||||
| Jmlr | ||||||||||||
| Nsf | ||||||||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Nikolaj Tatti 22institutetext: ADReM, University of Antwerp, Belgium
DTAI, KU Leuven, Belgium
HIIT, Aalto University, Finland
22email: [email protected]
Discovering Episodes with Compact Minimal Windows
Nikolaj Tatti
Abstract
Discovering the most interesting patterns is the key problem in the field of pattern mining. While ranking or selecting patterns is well-studied for itemsets it is surprisingly under-researched for other, more complex, pattern types.
In this paper we propose a new quality measure for episodes. An episode is essentially a set of events with possible restrictions on the order of events. We say that an episode is significant if its occurrence is abnormally compact, that is, only few gap events occur between the actual episode events, when compared to the expected length according to the independence model. We can apply this measure as a post-pruning step by first discovering frequent episodes and then rank them according to this measure.
In order to compute the score we will need to compute the mean and the variance according to the independence model. As a main technical contribution we introduce a technique that allows us to compute these values. Such a task is surprisingly complex and in order to solve it we develop intricate finite state machines that allow us to compute the needed statistics. We also show that asymptotically our score can be interpreted as a -value. In our experiments we demonstrate that despite its intricacy our ranking is fast: we can rank tens of thousands episodes in seconds. Our experiments with text data demonstrate that our measure ranks interpretable episodes high.
Keywords:
episode mining; statistical test; independence model; minimal window
††journal: Data Mining and Knowledge Discovery
1 Introduction
Discovering the most interesting patterns is the key problem in the field of pattern mining. While ranking or selecting patterns is well-studied for itemsets, a canonical and arguably the easiest pattern type, it is surprisingly under-researched for other, more complex, pattern types.
Discovering episodes, frequent patterns from an event sequence has been a fruitful and active field in pattern mining since their original introduction by Mannila et al (1997). Essentially, an episode is a set of events that should occur close to each other (gaps are allowed) possibly with some constraints on the order of the occurrences, see Section 2 for full definition. While the concept of support for itemsets is straightforward, it is simply the number of transactions containing the pattern, defining a support for episodes is more complex. The most common way of defining a support is to slide a window of fixed size over the sequence and count in how many windows the pattern occurs. Such a measure is monotonically decreasing and hence all frequent episodes can be found using APriori approach given by Mannila et al (1997). Alternatively we can consider counting minimal windows, that is finding and counting the most compact windows that contain the episode.
The common wisdom is that finding frequent patterns is not enough. Discovering frequent patterns with high threshold will result to trivial patterns, omitting many interesting patterns, while using a low threshold will result in a pattern explosion. This phenomenon has led to many ranking methods for itemsets, the most well-studied pattern type. Unlike for itemsets, ranking episodes is heavily under-developed. Existing statistical approaches for ranking episodes are mostly based on the number of fixed-size windows (see more detailed discussion in Section 6). However, a natural way of measuring the goodness of an episode is the average length of its instances—a good episode should have compact minimal windows. Hence, our goal and contribution is a measure based directly on the average length of minimal windows.
The most straightforward and common way to measure significance for itemsets is to compare the observed support, the number of transactions in which all attributes co-occur, against the independence model: if the observed support deviates a lot from the expectation, we consider the itemset important. In this paper we use the same principle and propose an interestingness measure for an episode by comparing the observed lengths of minimal windows of the episode against the expectation computed from the independence model. Given a set of episodes we can now apply our measure to each episode and rank the episodes, placing episodes with the most abnormal minimal windows on top. While this is an easy task for itemsets, computing statistics turns out to be complex for episodes.
We define our score as follows: given an episode , we assign a weight to each minimal window of based on how long it is. The weight will be large for small windows and small for large windows. To compute the expected weight we assume that for each symbol we have a probability of its occurrence in the sequence. We then compute the expected weight based on a model in which the symbols are independent of each other. We say that the episode is significant if the observed average weight is abnormally large, that is, the minimal windows are abnormally short.
Example 1
Assume that we have an alphabet of size , . Assume that the probabilities for having a symbol are , , and . Let be a serial episode . Then is a minimal window for if and only if it has a form . Hence the probability of a random sequence of length to be a minimal window for is equal to
[TABLE]
We are interested in a probability of a minimal window having length . To get this we divide the joint probability by the probability
[TABLE]
Using this normalisation we get that the probability of a minimal window having length is equal to
[TABLE]
for , and [math] otherwise. If we now weight minimal windows with an exponential decay, say, , then the expected weight is equal to . On the other hand, assume that we have a sequence . There are minimal windows of length and one minimal window of length . Hence, the observed average weight is suggesting that the minimal windows are more compact than what the independence model implies.
Computing the needed statistics turns out to be a surprisingly complex problem. We attack this problem in Section 4 by introducing a certain finite state machine having episodes as the nodes. Then using this structure we are able to compute the statistics recursively, starting from simple episodes and moving towards more complex ones.
Our recipe for the mining process is as follows: Given the sequence we first split the sequence in two. The first sequence is used for discovering candidate episodes, in our case episodes that have a large number of minimal windows. Luckily, this condition is monotonically decreasing and we can mine these episodes using a standard APriori method. We also compute the needed probabilities of individual events from the first sequence. Once we have discovered candidate episodes and have computed the expectation, we compare the expected weight against the average observed weight from the second sequence using a simple -score. This step allows us to prune uninteresting episodes, which is in our case episodes that obey the independence model.
The rest of the paper is structured as follows. In Section 2 we introduce the preliminary definitions and notation. We introduce our method for evaluating the difference between the observed windows and the independence model in Section 3. In Sections 4–5 we lay out our approach for computing the independence model. We present the related work in Section 6. Our experiments are given in Section 7 and we conclude our work with discussion in Section 8. All proofs are given in Appendix.
2 Preliminaries and Notation
We begin by presenting preliminary concepts and notations that will be used throughout the rest of the paper.
A sequence is a string of symbols coming from a finite alphabet , that is, we have . Given a sequence and two indices and , such that , we denote by a sub-sequence of .
An episode is represented by an acyclic directed graph with labelled nodes, that is , where is the set of nodes, is the set of directed edges, and is the function , mapping each node to its label.
Given a sequence and an episode we say that covers the episode if there is an injective map mapping each node to a valid index such that the node and the corresponding sequence element have the same label, , and that if there is an edge , then we must have . In other words, the parents of the node must occur in before . Traditional episode mining is based on searching episodes that are covered by sufficiently many sub-windows of certain fixed size.
Example 2
Consider an episode given in Figure 1. This episode has 4 nodes labelled as , , , and , and requires that must come first, followed by and in arbitrary order, and finally followed by . Figure 1 also shows an example of a sequence that covers the episode.
An elementary theorem says that in a directed acyclic graph there exists a sink, a node with no outgoing edges. We denote the set of sinks by . Given an episode and a node , we define to be the sub-episode obtained from by removing , and the incident edges.
Given an episode we define a set of prefix episodes by
[TABLE]
that is, a prefix episode is a subepisode of such that if is contained in , then all parents (in ) of are also contained in .
Example 3
Episode given in Figure 1 has 6 prefix episodes. Among of these 6 episodes one is empty, the remaining 5 episodes are given in Figure 2.
3 Minimal Windows of Episodes
Traditionally, discovering episodes from a single long sequence can be done in two ways. The first approach is to slide a window of fixed sized over the window and count the number of windows in which the episode occurs. The second approach is to count the number of minimal windows. The goal of this paper is to build a measure based minimal windows. If the statistic is abnormal, then we consider this pattern important.
In order to make the preceding discussion more formal, let be an episode, and let be a sequence. We say that is a minimal window for if is covered by but not by any proper sub-window of . In this paper we are interested in discovering episodes that have abnormally compact minimal windows, a natural way of defining the significance of an episode.
Example 4
Consider a toy episode given in Figure 1. The sequence given in Figure 1 covers the episode but it not a minimal window. However, if we remove 2 last symbols from the sequence, then the sequence becomes a minimal window.
Example 5
Consider a serial episode , that is a pattern stating event should be followed by an event , and two sequences ’’ and ’’. If we fix the length of a window to be (or larger), then the number of windows covering the episode will be the same for the both sequences. In fact, in this case all windows will contain the episode. However, occurrences of the episode in these sequences are different. In the first sequence, all minimal windows are of length , while in the second sequence, we have 2 minimal windows of length and minimal windows of length . Our intuition is that should be considered more significant in the first sequence than in the second.
Our goal in this paper is to design a measure that will indicate if the minimal windows are significantly compact. One approach would be to measure the average length of minimal windows. However, this ratio is susceptible to the variance in large minimal windows: consider that we have two minimal windows: the first is of length and the other is of length . Then the length of the second window dominates the average length, even though the first window is more interesting. In order to counter this phenomenon we suggest using the following statistic. Assume that we are given a parameter . Let be a minimal window for . We define the weight of a window to be . Compact windows will have a large value whereas large windows will have a small value. Let be the average weight of all minimal windows for .
We are interested in testing whether is significantly large. In order to do that, let be a random sequence and define a random variable if is a minimal window, if there is no such we define . Define also to be the indicator whether has a minimal window of starting at th index.
We suggest using the following statistic. Given a parameter , we define . Then is an estimate of a statistic .
We will show that there is and such that
[TABLE]
approaches a normal distribution . This suggest to define a measure . This is simply a -normalisation of the statistic .
We can also compute , where is the cumulative density function of the standard normal distribution , and interpret this quantity as a -value. However, this interpretation is problematic mainly because the normal distribution estimate is only accurate asymptotically.
Hence, we only consider merely as a ranking measure. Nevertheless, this measure makes a lot of sense: it measures how much the observed value deviates from the expectation, a common approach in ranking patterns, and it also takes the account the uncertainty of the measure.
In order to achieve our goal, we need to perform two steps
We need to show that converges into 2. 2.
We need to compute and that are needed for .
Both of these steps are non-trivial. Proving asymptotic normality is difficult because , , and are not independent, hence we will have to show that the sequence is mixing fast enough. Computing and will require a set of recursive equations. The remaining theoretical sections are devoted to proving asymptotic normality and computing the mean and the variance.
4 Detecting Minimal Windows
In this and the next section we establish our main theoretical contribution, which is how to compute .
We divide our task as follows: In Section 4.1 we build a finite state machine recognising when an episode is covered. In Section 4.2 we modify this machine so that we can use it for subsequent statistical calculations. Using this machine as a base we construct in Section 4.3 a machine that is able to recognise a minimal window of .
4.1 Constructing finite state machine
We begin by constructing a finite state machine that recognises the coverage of an episode.
In this paper, a finite state machine (or simply a machine) is a DAG with labelled edges and a single source. We allow multiple edges between two nodes.
Given a state in we say that covers if there is a subsequence such that can be reached from the source node using as an input.
Given an episode , we define a machine to be a DAG containing prefix graphs as nodes . We add an edge if and only if there is a sink node such that . We label edge with the label of , .
Example 6
Consider an episode given in Figure 3(a). The corresponding machine is given in Figure 3(b). Sink state corresponds to episode and source state corresponds to the empty episode. Intermediate state corresponds to given in Figure 2, corresponds to , corresponds to , and corresponds to .
Comparing the definition of coverage of a state in and the definition of a coverage for episodes gives immediately the following proposition.
Proposition 1
Given an episode , a sequence covers an episode if and only if covers the corresponding state in .
4.2 Making Simple Machines
In order to be able to compute the needed probabilities in subsequent sections, a machine need to have a crucial property. We say that machine is simple if each state in does not multiple incoming edges with the same label. If we reverse the direction of edges, then simplicity is equivalent to a finite state machine being deterministic.
In general, is not simple. If an episode contains two nodes, say and with the same label such that is not an ancestor of and vice versa, then there is a state in , where is a prefix episode having and as sinks will have (at least) two incoming edges with the same label (see Figures 4(a)–4(b)).
Luckily, we can transform into a simple machine. This transformation is almost equivalent to a process of making a non-deterministic finite state machine to deterministic.
In order to make this formal, let us first give some definitions. Assume that we are given a machine . Given a state in , we define
[TABLE]
to be the set of labels of all incoming edges. If is a subset of states in , then we write .
Let be a subset of states in and let be a label. We define
[TABLE]
to be the union set of parents of each connected with an edge having the label . We also define
[TABLE]
to be the set of states that have no incoming edge with a label .
Let be the (unique) source state in . We define
[TABLE]
Finally, we define a closure of inductively to be the collection of sets of states
[TABLE]
We are now ready to define a simple machine . The states of this machine are
[TABLE]
An edge with a label is in if and only if and . Since, for each , there is only one such that , it follows that is simple.
Example 7
A machine given in Figure 4(b) is not simple since the state has two incoming edges with , each edge correspond to either one of . In order to obtain , we first observe that the nodes are
[TABLE]
This final machine is given in Figure 4(c). Note that is simple since parents of are grouped together.
The following proposition reveals the expected result between and .
Proposition 2
Let be a machine. Let be a state in . Then a sequence covers if and only if covers at least one .
The coverage of a machine is based on subsequences and working with subsequences is particularly difficult since there may be several subsequences that cover episode , which leads to difficulties when computing probabilities.
Instead of working with subsequences directly, we will define a greedy function. Assume that we are given a simple machine . Let be a state and let be a sequence. We define a greedy function recursively
[TABLE]
In other words, the greedy function descends to parent states as fast as possible.
Example 8
Consider a machine given in Figure 3(b) and sequence given in Figure 1. We have
[TABLE]
The example suggests that a sequence covers an episode if the greedy function reaches the source state in the corresponding machine. This holds in general: the following proposition shows that we can use the greedy function to test for coverage. Note that this crucial property is specific to machine induced from episodes. It will not hold for a general machine.
Proposition 3
Let be an episode, then a sequence covers , a state in , if and only if , the source state of .
Corollary 1
Let be an episode and let be the sink state of . A sequence covers if and only if , the source state of .
4.3 Machine recognising minimal windows
So far we have constructed and that recognise when a sequence covers . However, we are interested in finding out when a sequence is a minimal window for .
Assume that we are given an episode and let . Let be the source state of and let be the sink state of . We define two machines,
is obtained from by adding a new source state, say , and adding an edge for each possible label. 2. 2.
is obtained from by adding a new sink state, say and adding an edge for each possible label.
Both and are simple.
Let us first consider . Assume that we are given a sequence such that . Then we know immediately that covers but does not. Now let us consider . Sequence covers if and only . Consequently, we need to design a machine that simultaneously computes for and for .
In order to do so we need to define a special machine. Assume that we are given two simple machines and , and a set of pairs of states , where is a state in and is a state in . We will now define a join machine, , that is guaranteed to contain the states from . To define the states of this machine, let be a state in and let be a state in . We first define a set of pairs of states recursively
[TABLE]
We define the states of to be . Two states and are connected with an edge if and only if and . It follows immediately that is simple.
Proposition 4
Let and be two simple machines. Let be a set of pairs of states. Define . Let be a state in . Then .
We can now define a machine that we will use to test whether sequence is a minimal window of . Let , , and as defined above. Let . The following proposition demonstrates how we can use to characterise the minimal window.
Proposition 5
Let , , , and be as defined above. Let be a sink state of . Then, a sequence is a minimal window for if and only if , where .
For the purpose of recognising minimal windows, there are lot of redundant states in . Any state that is not a child or part of can be removed and the outgoing edges reattached to the source state without effecting the validity of Proposition 5. This is true because once the greedy function reaches any such state then it will never reach . To optimise we remove two types of non-source states: any of form , where is the source state of and any state of form . We refer to the resulting machine as .
Example 9
Consider an episode given in Figure 5(a). The machine is given in Figure 5(b) and the augmented versions and are given in Figures 5(c)–5(d). These machines are then combined to , given in Figure 5(e).
The final, simplified, machine is given in Figure 5(f). In order to a sequence to be a minimal window for , the greedy function must land either in , , or in . Note that many states from are removed. For example, if we are in and we see any other symbol than , then we know that is not a minimal window since must end with in order to be one.
5 Computing Moments
Now that we have defined a machine for recognising a minimal window, we will use it to compute the needed probabilities. In Section 5.1 we demonstrate how to use the machine to compute the expected weight. In Section 5.2 we show the asymptotic normality and in Section 5.3 we demonstrate how to compute the variance. We finish the section by considering computational complexity.
5.1 Computing probabilities
Proposition 5 gives us means to express the minimal window using a machine and the greedy function. In this section we demonstrate how to compute probabilities that the greedy function lands in some particular state.
Let be a simple machine. Let be a set of states in and let be a state in . Let us first define
[TABLE]
to be the probability that a random sequence of length reaches one of the states in .
Proposition 6
Let be a simple machine. Let be a set of states in and let be a state in .
Then it holds that for ,
[TABLE]
For , we have
[TABLE]
Example 10
Consider a machine given Figure 5(f). Assume that the individual probabilities are , , and . The according to Proposition 1, and
[TABLE]
for , which implies that . We can verify this by observing that the sequence of events that leads from to must have events labelled as followed by one .
To solve the needed quantities, we need to compute moments,
[TABLE]
Proposition 5 now immediately implies that we can express the needed statistics using moments.
Proposition 7
Assume an episode . Let and let and be as in Proposition 5. Let , and be defined as in Section 3. Then
[TABLE]
Note that the sum has infinite number of terms, hence we cannot compute this by raw application of Proposition 6. Luckily, we can express moments in closed recursive form. First, we need to show that the moments we consider are finite.
Lemma 1
Let be a simple machine. Let be a set of states in . Assume that for all . Assume that we are given a function such that grows at polynomial rate. If the source node is not contained in , then is finite for any state .
Proposition 8
Let be a simple machine. Assume that we have a function mapping an integer to a real number. Assume also for , we have for some and a function . Assume that and grow at polynomial rate, at maximum. Let and set . Let . Then
[TABLE]
We can now use Proposition 8 to compute the moments given in Proposition 7.
Proposition 9
The identity holds for the following functions,
[TABLE]
Example 11
Consider machine given in Figure 5(e). Let us define . Assume also that the probabilities for the symbols are , , and . Let .
Then using Proposition 8 we see that
[TABLE]
and the moment for the remaining states is equal to [math].
Proposition 8 gives us means for a straightforward algorithm Moments for computing moments (given in Algorithm 1). Moments takes as input a simple machine , a map for initial values, a map for update values, and a constant . Note that Moments is linear function of and , that is,
[TABLE]
for any constants and . We will use this property later for speed-ups.
5.2 Asymptotic Normality
We will now prove that our statistic approaches to the normal distribution. The proof is not trivial since the variables and are not independent. Hence we will use Central Limit Theorem for strongly mixing sequences.
Our first step is to show that the sequence the central limit theorem holds for .
Proposition 10
Let be an episode. Sequence converges in distribution to , where , , and is a covariance matrix, , , , where
[TABLE]
Since the central limit theorem holds for , we can apply this to obtain the main result.
Proposition 11
Let be an episode. Let , and be as in Proposition 10. Define . Then
[TABLE]
converges to as , where .
These results suggest that we can use as a -value, where is the cumulative density function of the normal distribution. However, in practice we have several problems:
- •
The result is accurate only asymptotically. Moreover, the distribution of can be heavily skewed so we need a large number of samples in order to estimate become accurate.
- •
We do not have directly, the probabilities of individual items, instead we will estimate the probabilities from the training sequence. This will introduce some error in prediction making the -values smaller than they should be.
- •
We are computing a large number of statistical tests. In such case, it is advisable to use some technique, for example, Bonferroni correction, to compensate for the multiple hypotheses problem. However, it is not obvious which technique should we use.
Because of these problems, instead of interpreting as a -value, we simply use to rank patterns and use it as a top- method. Note that is a monotonic function, hence the larger the score, the smaller the -value.
By studying the formulas in the above propositions we see that we can compute the necessary statistics and using Proposition 7, and consequently we can compute . However, in order to compute the variance we need to compute , , , and given in Proposition 10. We will demonstrate a technique for computing these statistics in the next section.
5.3 Computing Cross-moments
Our final step is to compute cross-moments given in Proposition 10. In order to do so we first need to prove a different formulation of these statistics. This formulation is more fruitful as we no longer have to deal infinite sums.
Proposition 12
Let , , , , , and be as in Proposition 10. Define and . Then
[TABLE]
Our next step is to compute the moments. To that end, let be a machine recognising the minimal window of , let be a sink state in , and let be the states as in Proposition 5. We will study the probability , where and . Let and . The idea is to break the probability into a sum of probabilities based on the state and . These probabilities can be further decomposed into three factors which we can then turn into moments using Proposition 8.
Define a random variable . This variable is true if and only if . In addition, define and .
Let us write to be all proper intermediate states of between and . Since , implies that . Similarly, implies that . We can now write as
[TABLE]
The only non-trivial factor in Equation 2 that we cannot solve using is . To solve this we construct yet another machine. Let and let . Then Proposition 4 implies that
[TABLE]
This leads to
[TABLE]
Let us write . We now define a function by which we can express the missing cross-moments,
[TABLE]
This function is particularly useful since we can now apply Equation 3 and obtain a closed form using moments,
[TABLE]
Let us now express the cross-moments using . We see immediately that,
[TABLE]
As a final step we describe how we can optimise computation of . First recall that Moments, given in Algorithm 1, is linear with respect to its parameters and . Consider Equation 4. Instead of computing the sum over explicitly, we can compute , where is defined as . We can repeat this trick again to remove the explicit sum over . The pseudo-code taking into account these optimisations is given in Algorithm 2.
Example 12
Let us compute for an episode given in Figure 5(a). Let , given in Figure 6. Note that this machine is the same machine given in Figure 5(f). Let us define . Assume also that the probabilities for the symbols are , , and and assume that we selected . Define .
To compute we need to compute moments from three different machines. The obtained moments from a previous machine is fed as initial values to the next machine as shown in Figure 6. We use for the first and the third machine. The second machine is with redundant states removed. This machine is given in Figure 6.
We start with , and as initial values we set whenever a state is in , and [math] otherwise. This is equivalent to Example 9. We need moments only for two states, and , which are
[TABLE]
We now use the moments of and as initial values for and , that is, we set and , and 0 for other states. We can now compute the moments,
[TABLE]
and [math] for the remaining states. We feed these moments into initial values and compute the final moments,
[TABLE]
Consequently, .
5.4 Computational complexity
Let us now finish this section by discussing the computational complexity. Given a machine , evaluating moments will take time. Hence, we need to study the sizes of our machines. Given an episode with nodes, the first machine may have states. This happens if is a parallel episode. In practice, as we will see in the experiments, this is not a problem since is typically small.
Exponentiality is (most likely) unavoidable since testing whether a sequence covers an episode is known to be NP-hard problem (Tatti and Cule, 2011), and since we can use to test coverage in polynomial time w.r.t. the states in we must have episodes for which we have exponential number of states.
Simplifying may also lead to an exponential number of nodes. This may happen if we have a lot of unrelated nodes with same labels. Typically, this will not happen, especially, if the sequence has a large alphabet. Moreover, we can avoid this problem by mining only strict episodes (Tatti and Cule, 2012) in which we require that if there are two nodes with the same label, then one of the nodes must be an ancestor of the other. For such episodes, is already simple.
Computing a joint machine may result into a machine having states. In practice, the amount of states in is much smaller since not all pairs are considered. Similarly, a machine needed for computing cross-moments may have nodes. We will see that in our experiments the number of states and edges remains small, making the method fast in practice.
6 Related Work
Our approach can be seen as an extension of (Tatti, 2009) where we developed a statistical test based on average length of minimal windows. We used a recursive update similar to the one given in Proposition 6, however we capped the length of minimal windows and computed explicitly the probabilities of an episode having a minimal window of a certain length. In this work we avoid this by using Proposition 7. Additional limitation of (Tatti, 2009) is that we were forced to simulate cross-moments where in this work we compute them analytically.
Statistical measures for ranking episodes have been considered by Gwadera et al (2005b, a) in which the authors considered episode to be significant if the episode occurs too often or not often enough in windows of fixed size. As a background model the authors used independence model in (Gwadera et al, 2005b) and Markov-chain model in (Gwadera et al, 2005a). The authors’ approach in (Gwadera et al, 2005b) is similar to ours: First they construct a finite state machine, essentially , and use recursive update similar Proposition 6 in order to compute the mean, that is, the likelihood that the sequence of length covers the episode under independence assumption. The main difference between our approach and theirs is that we base our measure directly on compactness, the average length of a minimal window, while they base their measure on occurrence, that is, in how many windows the episode occurs.
Working with the general episodes is difficult for two main reasons. Firstly, general episodes are more prone to suffer from pattern explosion due to the fact that there are so many directed acyclic graphs. Secondly, the simplest task such as testing whether a sequence contains an episode is a NP-hard problem (Tatti and Cule, 2011). Several subclasses of general episodes have been suggested. Pei et al (2006) suggested mining episodes from set of strings, sequences of unique symbols. Tatti and Cule (2012) suggested discovering closed strict episodes. An episode is strict if two nodes with the same label are always connected. Achar et al (2012) suggested discovering episodes with unique labels possibly with some additional constraints, for example, the number of paths in a DAG. The authors suggested a score based on how evenly unconnected nodes occur in front of each other. Tatti and Cule (2011) considered a broader class of episodes in which nodes are allowed to have multiple labels.
Casas-Garriga (2003) proposed a criterion for episodes by requiring that the consecutive symbols in a sequence should only within a specified bound. While this approach attacks the problem of fixed windows, it is still a frequency-based measure. This measure, however, is not monotonic as it is pointed out by Méger and Rigotti (2004). It would be useful to see whether we can compute an expected value of this measure so that we can compute a -value based on some background model.
In a related work, Cule et al (2009) considered parallel episodes significant if the smallest window containing each occurrence of a symbol of an episode had a small value. Their approach differ from ours since the smallest window containing a fixed occurrence of a symbol is not necessarily the minimal window. Also, they consider only parallel episodes whereas we consider more general DAG episodes. An interesting approach has been also taken by Calders et al (2007) where the authors define a windowless frequency measure of an itemset within a stream to be the frequency starting from a certain point. This point is selected so that the frequency is maximal. However, this method is defined for itemsets and it would be fruitful to see whether this idea can be extended into episodes.
Finite state machines have been used by Tronícek (2001); Hirao et al (2001) for discovering episodes. However, their goal is different than ours since the actual machine is built upon a sequence and not the episode set and it is used for discovering episodes and not computing the coverage.
7 Experiments
In this section we present our experiments with the quality measure using synthetic and real-world text sequences.
7.1 Datasets
We conducted our experiments with several synthetic and real-world sequences.
The first synthetic sequence, Ind consists of events drawn independently and uniformly from an alphabet of symbols. The second synthetic sequence, Plant also contains events independently and uniformly from an alphabet of symbols but in addition we planted 5 serial episodes. Each episode consisted of 5 nodes, each node with a unique label. We planted each episode times and we added a gap between two consecutive events with a probability.
Our third dataset, Moby, is the novel Moby Dick by Herman Melville.111The book was obtained from http://www.gutenberg.org/etext/15. Our fourth sequence, Nsf consists of 739 first NSF award abstracts from 1990.222The abstracts were obtained from http://kdd.ics.uci.edu/databases/nsfabs/nsfawards.html Our final dataset, Address, consists of inaugural addresses of the presidents of the United States.333The addresses were obtained from http://www.bartleby.com/124/pres68. To avoid the historic concept drift—early speeches have different vocabulary than the later ones—we entwined the speeches by first taking the odd ones and then even ones. Our fourth dataset, Jmlr, consists of abstracts from Journal of Machine Learning Research.444The abstracts were obtained from http://jmlr.csail.mit.edu/ The sequences were processed using the Porter Stemmer and the stop words were removed. The basic characteristics of sequences are summarised in Table 1.
7.2 Experimental Setup
Our experimental setup mimics the framework setup by Webb (2007) in which the data is divided into two parts, the first part is used for discovering the patterns and the second part for testing whether the discovered patterns were significant. We divided each sequence into two parts of equivalent lengths. We used the first sequence for discovering the candidate episodes and training the independence model. Then we tested the discovered episodes against the model using the second sequence. We set parameter to .
To generate candidate episodes we used a miner given by Tatti and Cule (2012). This miner discovers episodes in a breath-first fashion, that is, an episode is tested if and only if all its sub-episodes are frequent. The miner outputs closed555An episode is closed if there are no superepisode with the same support. and strict episodes. Requiring episodes to be closed reduces redundancy between candidates considerably as there are typically many episodes describing the same set of minimal windows. The alphabet is large in our sequences, which implies that it is quite unlikely to see the same symbol twice within a short window. Consequently, there are only few non-strict frequent episodes.
As a constraint we required that the number of non-overlapping minimal windows must exceed certain threshold in the first sequence. This is a monotonic condition that allows us to discover all candidates efficiently. During mining we also put an upper limit for minimal windows. The parameters and the numbers of candidates are given in Table 2.
7.3 Computational complexity
Let us first study computational complexity in practice. As we pointed out earlier it is possible that sizes of structures needed to compute the score become exponentially large. To demonstrate the sizes in practice we computed the average number of states and edges in machines used to compute the score. The results are given in Table 3.
From these results we see that the number of nodes and edges stay small. This is due to the fact that majority of episodes are small, typically with 2–3 nodes. Simplification does not add any new nodes or edges since we use strict episodes, where nodes with the same label must be connected, consequently, is simple. Number of nodes and edges are at highest for , a machine needed to compute cross-moments for Nsf data. This is due to the fact that Nsf contains a lot of phrases where the same words are being repeated. As a consequence, we discover large episodes which in turn generate large machines. Running times given in the last column of Table 3 imply that ranking is fast. Ranking discovered episodes is done within few seconds. For example, in Address ranking episodes takes less than 5 seconds.
We consider only closed and strict episodes as candidates. If we consider also non-closed episodes, then the distribution of episode types may change as long closed episodes tend to be serial. Consequently, we will have more general episodes. This may result in larger machines as serial episodes have the simplest machines.
7.4 Significant Episodes
Let us first consider Plant dataset. The first 5 episodes according to our ranking were exactly the planted patterns. The scores of these patterns are between and . The following patterns are typically a combination of an original pattern with an additional parallel symbol or a subset of an original pattern. The scores of these patterns, though significant, are dropping fast: the score of the 6th pattern is , the score of 7th pattern is . Note that if we used frequency (or any other monotonic measure) as a score, subsets of these planted patterns would have appeared first in the list.
Our next step is to see what types of episodes does our score preferred. In order to do that, we first consider Figure 7 where we have plotted the number of nodes in an episode as a function of rank. We see that top patterns tend to have more nodes. This is especially prominent with Address and Nsf datasets.
We continued our experiments by computing the proportion of episode types, that is, whether an episode is a parallel, serial, or general, as a function of rank, given in Figure 8. From figures we see that distribution depends heavily on a sequence. Serial episodes tend to be distributed evenly, parallel episodes tend to be missing from the very top and general episodes tend to be missing from the very bottom.
Finally, let us conclude by demonstrating some of the discovered top patterns from Address and Jmlr datasets, given in Figure 9. The first three patterns represent phrases that are often said by the presidents. Episode in Figure 9(b) is particularly interesting since presidents tend to acknowledge vice president(s) and the chief justice at the beginning of their speeches but the order is not fixed. The remaining 3 patterns represent common phrases occurring in abstracts of machine learning articles.
7.5 Asymptotic normality
Proposition 11 implies that if the independence assumption hold in the testing sequence, then should behave like a sample from a standard normal distribution as the size of the sequence increases. In this section we test the rate of convergence.
To that end we generated several sequences with independent events, each event having equal probability to occur. We generated three training sequences from alphabets of , , and symbols. Each sequence contained events. For each training sequence we generated testing sequences of different lengths, namely , , and .
From each testing sequence we mined frequent episodes. We selected the thresholds such that we got roughly episodes, more specifically, we used , , as thresholds for sequences with , , symbols respectively. We then tested the discovered non-singleton episodes on testing sequences. Note that computing the score requires probabilities of individual events. We computed the scores both by using the true probabilities and by estimating the probabilities from the training sequence.
In Figure 10 we plotted the proportion of episodes for which is smaller than the threshold. Proposition 11 implies that ideally this plot should be the identity line between [math] and . We see that this is the case in Figure 10(a). As we increase the size of the alphabet, the estimate becomes more and more inaccurate. We believe that this is due to high skewness of the actual distribution. When using true probabilities for individual probabilities, longer testing sequences produce better results. Using estimated values introduces additional errors, as can be seen in Figure 10(d) where a testing sequence of length is less ideal than the sequence of . However, this phenomenon can be attacked by dividing the sequence to training and testing portion more fairly, thus making the estimates more accurate.
8 Discussion and Conclusions
In this paper we proposed a new quality measure for episodes based on minimal windows. In order to do this, we approached by computing the expected values based on the independence model and compared the expectations to the observed values by computing a -score.
Our main technical contribution is a technique for computing the moments of minimal windows. In order to do so we created a series of elaborate finite state machines and demonstrated that we can compute the moments recursively. In this paper we chose to use a specific statistic, namely , where is a length of a minimal window and is a user-given parameter. However, the same principle can be applied also directly on the length of minimal windows.
While the actual computation of statistics is fairly complex and requires a great number of recursive updates, and even may be exponentially slow, our experiments demonstrate that the computation is fast in practice, we can rank tens of thousands of episodes in the matter of seconds.
Our technique has its limitations. In synthetic data, plant, after finding 5 true patterns, our method continued scoring high patterns that were either superpatterns of subpatterns of the first 5 patterns. All these patterns are significant in the sense that they deviate significantly from the independence model. Nevertheless, they provide no new information about the underlying structure in the data. This problem occurs in any pattern ranking scheme where the ranking method does not take other patterns into account.
Approaches to further reduce patterns by considering patterns as a set instead of individual patterns have been developed for itemsets. For example, one approach for itemsets involve in partitioning itemsets into subitemsets and applying independence assumption between the individual parts Webb (2010). Transforming this idea to episodes is not trivial. A more direct approach—although using only serial episodes—where episodes were selected using MDL techniques was suggested in Tatti and Vreeken (2012). An extension of this work to general episodes would be interesting.
Proposition 11 implies that we can interpret our measure as a -value. In practice, this can be problematic as we demonstrate in Section 7.5. Since the distributions are heavily skewed, especially when dealing with a large alphabet, we require a lot of samples before the normality assumption becomes accurate. Nevertheless our experiments with synthetic and text data demonstrate that our score produces interpretable rankings.
Acknowledgements
Nikolaj Tatti was partly supported by a Post-Doctoral Fellowship of the Research Foundation – Flanders (fwo).
Appendix A Proofs
Proof (Proof of Proposition 2)
We will prove this by induction. Let be the source state of . The proposition holds trivially when , a source state. Assume now that the proposition holds for all parent states of .
Assume that covers . Let be a subsequence of that leads from the source state to . Let be the last symbol of occurring in . Then a parent state is covered by . By the induction assumption at least one is covered by . If there is such that , then is covered by , otherwise there is that has as a parent state. The edge connecting and is labelled with . Hence covers also.
To prove the other direction assume that covers . Let be a sub-sequence that leads from to . Let be the last symbol occurring in . Let be the parent state of connected by an edge labelled with . Since , we must have as a parent state of such that . By the induction assumption, covers . Hence covers .
In order to prove Proposition 3 we need the following lemma.
Lemma 2
Let be an episode and assume a sequence that covers . Let . If is empty, then covers . Otherwise, there is an episode that is covered by .
Proof
Let be a valid mapping of to indices of corresponding to the coverage. If is empty, then is not in the range of , then covers . If is not empty but is not in the range of , then covers , and any episode in .
Assume now that is in range of , that is, there is a sink with a label . Episode is in . Moreover, restricted to provides the needed mapping in order to to cover .
Proof (Proof of Proposition 3)
If , then it is trivial to see that covers .
Assume that covers . We will prove this direction by induction over , the length of . The proposition holds for . Assume that and that proposition holds for all sequences of length .
Let . Note that . Hence, to prove the proposition we need to show that covers .
If , then covers . Hence, we can assume that , that is, .
Proposition 2 implies that one of the states of , say , is covered by . Proposition 1 states that the corresponding episode, say , is covered by .
Assume that . This is possibly only if that is there is no sink node in labelled as . Lemma 2 implies that covers , Propositions 1 and 2 imply that covers .
Assume that , Then contains all states of corresponding to the episodes of form , where is sink node of with a label . According to Lemma 2, covers one of these episodes, Propositions 1 and 2 imply that covers .
Proof (Proof of Proposition 4)
We will prove the proposition by induction over , the length of . The proposition holds when . Assume that and that proposition holds for sequence of length .
Let . Then, by definition of , . Write . Since
[TABLE]
and, because of induction assumption, , we have .
Proof (Proof of Proposition 5)
Assume that is a minimal window for . Since covers in , . This implies that or . The latter case implies that covers in , which is a contradiction. Hence, . Let . If , then covers in , which is a contradiction. Hence . Proposition 4 implies that .
Assume that such that . Proposition 4 implies that and . The former implication leads to which implies that covers .
If covers , then and so , which is a contradiction. Hence does not cover . The latter implication leads to which implies that does not cover . This proves the proposition.
Proof (Proof of Proposition 6)
If , then which immediately implies the proposition. Assume that . Note that .
[TABLE]
Since individual symbols in are independent, it follows that
[TABLE]
This proves the proposition.
Proof (Proof of Lemma 1)
Define . Note that . We claim that for each there is a constant such that which in turns proves the lemma. To prove the claim we use induction over parenthood of and .
Since the source node is not in , the first step follows immediately. Assume that the result holds for all parent states of . Define
[TABLE]
Since , the case of holds. Assume that the the induction assumption holds for and for up to . Let . Note that . According to Proposition 6 we have
[TABLE]
This proves that decays at exponential rate.
Proof (Proof of Proposition 8)
The proposition follows by a straightforward manipulation of Equation 1. First note that
[TABLE]
Equation 1 implies that
[TABLE]
Combining Equations 5 and 6 and solving gives us the result.
To prove the asymptotic normality we will use the following theorem.
Theorem A.1 (Theorem 27.4 in (Billingsley, 1995))
Assume that is a stationary sequence with , , and is -mixing with , where is the strong mixing coefficient,
[TABLE]
where is an event depending only on and is an event depending only on . Let . Then exists and converges to and .
Proof (Proof of Proposition 10)
Let us write and . Assume that we are given a vector and write . We will first prove that converges to a normal distribution using Theorem A.1.
First note that and that
[TABLE]
Since every moment of and is finite, is also finite. We will prove now that is -mixing.
Fix and . Write to be an event that covers . If is true, then and (and hence ) for depends only , that is, either there is a minimal window , where or .
Let be an event depending only on and be an event depending only on . Then . We can rephrase this and bound . To bound the right side, let , let be its sink state and let be all states save the source state. Then the probability is equal to
[TABLE]
Since does not contain the source node, the moment is finite. Consequently, which implies that . Thus Theorem A.1 implies that converges to a normal distribution with the variance . Levy’s continuity theorem (Theorem 2.13 van der Vaart, 1998) now implies that the characteristic function of converges to a characteristic function of normal distribution ,
[TABLE]
The left side is a characteristic function of (with as a parameter). Similarly, the right side is a characteristic function of . Levy’s continuity theorem now implies that converges into .
Proof (Proof of Proposition 11)
Function is differentiable at . Since 1/\sqrt{L}\big{(}\sum_{k=1}^{L}(Z_{k},X_{k})-(q,p)\big{)} converges to normal distribution, we can apply Theorem 3.1 in (van der Vaart, 1998) so that
[TABLE]
converges to , where . The gradient of is equal to . The proposition follows.
Proof (Proof of Proposition 12)
To prove all four cases simultaneously, let us write write to be either or and let to be either or . Let and . First note that , which allows us to ignore inside the mean.
Assume that we have . Then given that , and depends only on first symbols of sequence. Since does not depend on first symbols, this implies that
[TABLE]
which in turns implies that .
Note that for whenever . Consequently, we have
[TABLE]
where the second last equality holds because and the last equality follows since and for any .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Achar et al (2012) Achar A, Laxman S, Viswanathan R, Sastry PS (2012) Discovering injective episodes with general partial orders. Data Min Knowl Discov 25(1):67–108
- 2Billingsley (1995) Billingsley P (1995) Probability and Measure, 3rd edn. John Wiley & sons
- 3Calders et al (2007) Calders T, Dexters N, Goethals B (2007) Mining frequent itemsets in a stream. In: Proceedings of the 7th IEEE International Conference on Data Mining (ICDM 2007), pp 83–92
- 4Casas-Garriga (2003) Casas-Garriga G (2003) Discovering unbounded episodes in sequential data. In: Knowledge Discovery in Databases: PKDD 2003, 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp 83–94
- 5Cule et al (2009) Cule B, Goethals B, Robardet C (2009) A new constraint for mining sets in sequences. In: Proceedings of the SIAM International Conference on Data Mining (SDM 2009), pp 317–328
- 6Gwadera et al (2005 a) Gwadera R, Atallah MJ, Szpankowski W (2005 a) Markov models for identification of significant episodes. In: Proceedings of the SIAM International Conference on Data Mining (SDM 2005), pp 404–414
- 7Gwadera et al (2005 b) Gwadera R, Atallah MJ, Szpankowski W (2005 b) Reliable detection of episodes in event sequences. Knowledge and Information Systems 7(4):415–437
- 8Hirao et al (2001) Hirao M, Inenaga S, Shinohara A, Takeda M, Arikawa S (2001) A practical algorithm to find the best episode patterns. In: Discovery Science, pp 435–440
