Summarizing Event Sequences with Serial Episodes: A Statistical Model   and an Application

Soumyajit Mitra; P S Sastry

arXiv:1904.00516·cs.LG·April 2, 2019

Summarizing Event Sequences with Serial Episodes: A Statistical Model and an Application

Soumyajit Mitra, P S Sastry

PDF

Open Access

TL;DR

This paper introduces a new MDL-based algorithm for summarizing sequential data with serial episodes, provides a statistical justification, and demonstrates its effectiveness in text classification by significantly reducing feature dimensions without accuracy loss.

Contribution

It presents a novel statistical justification for an MDL-based sequence summarization algorithm and applies it to text classification for feature reduction.

Findings

01

Over four-fold reduction in feature vector size achieved.

02

The summarization maintains classification accuracy.

03

First statistical justification for MDL-based sequence summarization.

Abstract

In this paper we address the problem of discovering a small set of frequent serial episodes from sequential data so as to adequately characterize or summarize the data. We discuss an algorithm based on the Minimum Description Length (MDL) principle and the algorithm is a slight modification of an earlier method, called CSC-2. We present a novel generative model for sequence data containing prominent pairs of serial episodes and, using this, provide some statistical justification for the algorithm. We believe this is the first instance of such a statistical justification for an MDL based algorithm for summarizing event sequence data. We then present a novel application of this data mining algorithm in text classification. By considering text documents as temporal sequences of words, the data mining algorithm can find a set of characteristic episodes for all the training data as a whole.…

Tables7

Table 1. Table I: Encoding of event sequence

Size of episode	Episode name	No. of occurrences	List of Occurrences
$3$	$(A \overset{2}{\to} B \overset{1}{\to} C)$	$2$	$< 2, 4 >$
$3$	$(D \overset{2}{\to} E \overset{2}{\to} C)$	$2$	$< 1, 5 >$
$2$	$(A \overset{1}{\to} B)$	$1$	$< 7 >$
$1$	$C$	$2$	$< 3, 8 >$

Table 2. Table II: Dictionary sizes for different datasets

Dataset

Number of

discovered

episodes

Size of

Dict-I

Size of

Dict-II

Reuters-21578

2261

14575

1560

WebKB

2423

7287

1884

20-Newsgroup

4703

54580

7361

Movie Review

2490

37714

3007

Table 3. Table III: Linear SVM accuracy and F-measure(macro) for VecAvg representation

Dataset	Scores ( %)
	Accuracy		F-measure
	Dict-I	Dict-II	Dict-I	Dict-II
Reuters-21578	95.43	95.52	81.67	81.97
WebKB	89.11	88.90	87.84	87.63
20-Newsgroup	69.84	70.14	68.23	68.64
Movie Review	81.3	80.35	81.28	80.34

Table 4. Table IV: Naive-Bayes, Linear SVM Classification accuracy and F-measure(macro)for BoW representation

Dataset	Classifier	Scores (%)
		Accuracy		F-measure
		Dict-I	Dict-II	Dict-I	Dict-II
Reuters-21578	NB	96.07	96.30	95.99	96.32
Reuters-21578	SVM	97.03	97.17	97.01	97.13
WebKB	NB	83.52	83.60	82.26	83.63
WebKB	SVM	91.04	90.62	90.54	90.61
20-Newsgroup	NB	81.03	79.41	79.89	79.33
20-Newsgroup	SVM	82.73	81.99	82.03	81.90
Movie Review	NB	82.50	82.85	82.48	82.34
Movie Review	SVM	86.75	84.50	87.64	84.74

Table 5. Table V: Mean (and standard deviation) of classification accuracy with Naive-Bayes using different dictionaries (BoW representation)

Dataset	Dictionary-I	Dictionary-II
Reuters-21578	95.36( $\pm$ 0.883)	95.01( $\pm 0.895$ )
WebKB	83.38( $\pm$ 0.315)	83.23( $\pm$ 0.328)
20-Newsgroup	81.35( $\pm 0.412$ )	80.36( $\pm 0.532$ )

Table 6. Table VI: Mean (and standard deviation) of classification accuracy with Linear SVM using different dictionaries (BoW representation)

Dataset	Dictionary-I	Dictionary-II
Reuters-21578	97.30( $\pm$ 0.289)	97.31( $\pm$ 0.214)
WebKB	89.97( $\pm$ 0.798)	90.58( $\pm$ 0.536)
20-Newsgroup	82.94( $\pm 0.139$ )	82.67( $\pm$ 0.473)

Table 7. Table VII: Sample words from the set of rejected and selected words for Dictionary-II

Dataset

Movie Review

labels=‘positive sentiment’, ‘negative sentiment’

WebKb

labels=‘project’,‘student’, ‘faculty’,‘course’

Rejected

Words

stunts, theatre, cinematographer, moviestar,

directorship, producers, storyteller, scripts,

spotlight, audition, auditorium, backstage,

torrent, reviewer, performances, entertainment.

chemistry, cryptography, probabilistic, lagrangian,

arithmetic, scholarship, bibtex, manuscript, newsletter,

computer, interdisciplinary, mathematician, biotechnology,

accuracy, baseline, neurocomputing, gaussian.

Selected

Words

enjoyable, funny, hilarious, entertaining,

superb, boring, sleepy, disappointed, twists,

clever, impressed, surprised, liked, interested,

awful, pleasing, miserably, dumber, interesting,

impressive, intelligent, fantastic.

syllabus, internet, introductory, prerequisite, research,

bibliography, professor, student, quiz, exercise, credit,

query, tutor, project, phd, fellowship, conference,

curriculum, scientist, magazine, instructor, theorem,

homework, examination, semester, journal, homepage.

Equations54

D = < (D, 1), (A, 2), (C, 3), (E, 3), (A, 4), (B, 4), (C, 5), (D, 5), (B, 6), (C, 7), (E, 7), (A, 7), (C, 8), (B, 8), (C, 9) >

D = < (D, 1), (A, 2), (C, 3), (E, 3), (A, 4), (B, 4), (C, 5), (D, 5), (B, 6), (C, 7), (E, 7), (A, 7), (C, 8), (B, 8), (C, 9) >

score(\alpha,\mathcal{D})=\big{(}f_{\alpha}N\big{)}-\big{(}2N+1+f_{\alpha}\big{)}

score(\alpha,\mathcal{D})=\big{(}f_{\alpha}N\big{)}-\big{(}2N+1+f_{\alpha}\big{)}

o v er l a p - scor e (α, D, F_{s}) = scor e (α, D) - β \in F_{s} \sum O M (α, β)

o v er l a p - scor e (α, D, F_{s}) = scor e (α, D) - β \in F_{s} \sum O M (α, β)

If o v er l a p - scor e (α, D, F_{s}) > 0, then L (F_{s} \cup {α}, D) < L (F_{s}, D)

If o v er l a p - scor e (α, D, F_{s}) > 0, then L (F_{s} \cup {α}, D) < L (F_{s}, D)

t_{h_{i} (N)} \leq t_{h_{i}^{^{'}} (N)} \forall i \in {1, 2, \dots, p}

t_{h_{i} (N)} \leq t_{h_{i}^{^{'}} (N)} \forall i \in {1, 2, \dots, p}

t_{h_{i} (N)} \leq t_{h_{i}^{^{'}} (N)} < t_{h_{i + 1}^{^{'}} (1)} < t_{h_{i + 1} (1)}

t_{h_{i} (N)} \leq t_{h_{i}^{^{'}} (N)} < t_{h_{i + 1}^{^{'}} (1)} < t_{h_{i + 1} (1)}

P (o,q ∣Λ) = π_{q_{1}} b_{q_{1}} (o_{1}) t = 2 \prod T p_{q_{t - 1} q_{t}} b_{q_{t}} (o_{t})

P (o,q ∣Λ) = π_{q_{1}} b_{q_{1}} (o_{1}) t = 2 \prod T p_{q_{t - 1} q_{t}} b_{q_{t}} (o_{t})

q^{*} = q argmax P (o,q ∣Λ)

q^{*} = q argmax P (o,q ∣Λ)

P(\textbf{o,q}|\Lambda)=\Bigg{(}\frac{\eta}{M}\Bigg{)}^{|\textbf{q}_{n}|}\Bigg{(}\frac{1-\eta}{2}\Bigg{)}^{|\textbf{q}_{e}|}

P(\textbf{o,q}|\Lambda)=\Bigg{(}\frac{\eta}{M}\Bigg{)}^{|\textbf{q}_{n}|}\Bigg{(}\frac{1-\eta}{2}\Bigg{)}^{|\textbf{q}_{e}|}

P(\textbf{o,q}|\Lambda)=\Bigg{(}\frac{\eta}{M}\Bigg{)}^{T}\Bigg{(}\frac{(1-\eta)M}{2\eta}\Bigg{)}^{|\textbf{q}_{e}|}

P(\textbf{o,q}|\Lambda)=\Bigg{(}\frac{\eta}{M}\Bigg{)}^{T}\Bigg{(}\frac{(1-\eta)M}{2\eta}\Bigg{)}^{|\textbf{q}_{e}|}

P (o, q^{*} ∣ Λ_{α β})

P (o, q^{*} ∣ Λ_{α β})

P (o, q^{*} ∣ Λ_{α γ})

⟹ \frac{P ( o , q ^{*} ∣ Λ _{α β} )}{P ( o , q ^{*} ∣ Λ _{α γ} )}

P (o,q ∣Λ)

P (o,q ∣Λ)

\displaystyle=\Bigg{(}\frac{\eta}{M}\Bigg{)}^{T}\Bigg{(}\frac{(1-\eta)M}{2\eta}\Bigg{)}^{|\textbf{q}_{1}|}\Bigg{(}\frac{(1-\eta)M}{\eta}\Bigg{)}^{|\textbf{q}_{2}|}

P (o,q ∣ Λ_{α β})

P (o,q ∣ Λ_{α β})

\displaystyle\qquad\qquad\qquad\qquad\quad\Bigg{(}\frac{(1-\eta)M}{\eta}\Bigg{)}^{O_{\alpha\beta}}

\displaystyle=\Bigg{(}\frac{\eta}{M}\Bigg{)}^{T}\Bigg{(}\frac{(1-\eta)M}{2\eta}\Bigg{)}^{Nf_{\alpha}+Nf_{\beta}}

\displaystyle\qquad\qquad\qquad\qquad\quad\Bigg{(}\frac{(1-\eta)M}{4\eta}\Bigg{)}^{-O_{\alpha\beta}}

P (o, q^{*} ∣ Λ_{α β})

P (o, q^{*} ∣ Λ_{α β})

\displaystyle\qquad\qquad\qquad\qquad\Bigg{(}\frac{(1-\eta)M}{4\eta}\Bigg{)}^{-O^{*}_{\alpha\beta}}

\frac{P ( o , q ^{*} ∣ Λ _{α β} )}{P ( o , q ^{*} ∣ Λ _{α γ} )}

\frac{P ( o , q ^{*} ∣ Λ _{α β} )}{P ( o , q ^{*} ∣ Λ _{α γ} )}

\displaystyle\qquad\qquad\qquad\qquad\Bigg{(}\frac{(1-\eta)M}{4\eta}\Bigg{)}^{O^{*}_{\alpha\gamma}-O^{*}_{\alpha\beta}}

O v er l a p - scor e_{1} (β, α)

O v er l a p - scor e_{1} (β, α)

O v er l a p - scor e_{2} (β, α)

\displaystyle\frac{P(\textbf{o},\textbf{q}^{*}|\Lambda_{\alpha\beta})}{P(\textbf{o},\textbf{q}^{*}|\Lambda_{\alpha\gamma})}=\frac{\Bigg{(}\frac{(1-\eta)M}{2\eta}\Bigg{)}^{N(f^{*}_{\beta}-f^{*}_{\gamma})}}{\Bigg{(}\frac{(1-\eta)M}{4\eta}\Bigg{)}^{O^{*}_{\alpha\beta}-O^{*}_{\alpha\gamma}}}\qquad\qquad\qquad\qquad

\displaystyle\frac{P(\textbf{o},\textbf{q}^{*}|\Lambda_{\alpha\beta})}{P(\textbf{o},\textbf{q}^{*}|\Lambda_{\alpha\gamma})}=\frac{\Bigg{(}\frac{(1-\eta)M}{2\eta}\Bigg{)}^{N(f^{*}_{\beta}-f^{*}_{\gamma})}}{\Bigg{(}\frac{(1-\eta)M}{4\eta}\Bigg{)}^{O^{*}_{\alpha\beta}-O^{*}_{\alpha\gamma}}}\qquad\qquad\qquad\qquad

\displaystyle=2^{N(f^{*}_{\beta}-f^{*}_{\gamma})}\Bigg{(}\frac{(1-\eta)M}{4\eta}\Bigg{)}^{N(f^{*}_{\beta}-f^{*}_{\gamma})-(O^{*}_{\alpha\beta}-O^{*}_{\alpha\gamma})}>1

⟹ P (o ∣ Λ_{α β}) > P (o ∣ Λ_{α γ}) \mbox \leavevmode \leavevmode \leavevmode (u n d er a ss u m pt i o n A 1)

\frac{P ( o , q ^{*} ∣ Λ _{α β} )}{P ( o , q ^{*} ∣ Λ _{α γ} )}

\frac{P ( o , q ^{*} ∣ Λ _{α β} )}{P ( o , q ^{*} ∣ Λ _{α γ} )}

\displaystyle=\frac{2^{x}}{\Bigg{(}\frac{(1-\eta)M}{4\eta}\Bigg{)}^{x+\xi}}

\displaystyle=\frac{1}{\Bigg{(}\frac{(1-\eta)M}{8\eta}\Bigg{)}^{x}\Bigg{(}\frac{(1-\eta)M}{4\eta}\Bigg{)}^{\xi}}<1

⟹ P (o ∣ Λ_{α γ}) > P (o ∣ Λ_{α β}) \mbox \leavevmode \leavevmode \leavevmode (u n d er A 1)

M o d i f i e d - w f (w, d) = w f (w, d) * i df (w)

M o d i f i e d - w f (w, d) = w f (w, d) * i df (w)

i df (w) = lo g \frac{1 + n _{d}}{1 + df ( w )} + 1

i df (w) = lo g \frac{1 + n _{d}}{1 + df ( w )} + 1

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Time Series Analysis and Forecasting · Algorithms and Data Compression

MethodsMinimum Description Length

Full text

Summarizing Event Sequences with Serial Episodes: A Statistical Model and an Application

Soumyajit Mitra, and P S Sastry Soumyajit Mitra was at the Department of Electrical Engineering, Indian Institute of Science, Bangalore, India. He is currently with Samsung R&D, Bangalore, India.

E-mail: [email protected] P.S. Sastry is with the Department of Electrical Engineering, Indian Institute of Science, Bangalore, India.

E-mail: [email protected]

Abstract

In this paper we address the problem of discovering a small set of frequent serial episodes from sequential data so as to adequately characterize or summarize the data. We discuss an algorithm based on the Minimum Description Length (MDL) principle and the algorithm is a slight modification of an earlier method, called CSC-2. We present a novel generative model for sequence data containing prominent pairs of serial episodes and, using this, provide some statistical justification for the algorithm. We believe this is the first instance of such a statistical justification for an MDL based algorithm for summarizing event sequence data. We then present a novel application of this data mining algorithm in text classification. By considering text documents as temporal sequences of words, the data mining algorithm can find a set of characteristic episodes for all the training data as a whole. The words that are part of these characteristic episodes could then be considered the only relevant words for the dictionary thus resulting in a considerably reduced feature vector dimension. We show, through simulation experiments using benchmark data sets, that the discovered frequent episodes can be used to achieve more than four-fold reduction in dictionary size without losing any classification accuracy.

Index Terms:

Frequent episodes, MDL principle, compressing frequent patterns, HMM models for episodes, dictionary learning, text classification.

1 Introduction

Frequent pattern mining is an important problem in data mining with applications in diverse domains [1]. Frequently occurring local patterns can capture useful aspects of the semantics of the data. However, in practice, the mined frequent patterns are often large in number and quite redundant in nature which makes it difficult to effectively use them. Isolating a small set of non-redundant informative frequent patterns that best describes the data, is an interesting current research problem [2, 3, 4, 5, 6, 7, 8]. In this paper we are concerned with mining of sequential data in the framework of frequent episodes [9]. We address the problem of isolating a small set of non-redundant serial episodes that best characterize the data.

There have been many recent efforts for extracting a small subset of non-redundant characteristic patterns. There are mainly two families of methods. One family of methods retain only those patterns which are, in some sense, statistically significant. The statistical significance is assessed using either a suitable null model in a hypothesis testing framework or by fitting a generative model for the data source [10, 11, 12, 13, 14, 4]. While this can reduce the number of frequent patterns to some extent, this approach cannot tackle redundancy in the discovered patterns.

Another prominent family of methods for deciding which subset of patterns best explains the data, is based on an information theoretic approach called Minimum Description Length (MDL) principle [15]. In the context of the problem of isolating a ‘best’ subset of frequent patterns, the use of MDL principle can be explained as follows. We formulate a mechanism so that given any subset of frequent patterns we can use them as a ‘model’ to encode the data. Then, the subset that results in the overall best level of data compression is considered to be the subset that best characterizes the data. Such a view, motivated by MDL principle, has been found effective for many frequent pattern mining algorithms [16].

MDL principle views learning as data compression. If we are able to discover all the important regularities in data then we should be able to use these to compress the data well. In this view, the coding mechanism used should be lossless; that is the original data should be exactly recoverable given the encoded compressed representation.

The Krimp algorithm [2] is one of the first methods that used MDL principle to identify a small set of relevant patterns in the context of frequent itemset mining. As mentioned earlier, in this paper we are concerned with sequential data. For sequential data, unlike in the case of transaction data, the temporal ordering of data tuples is important and our encoding mechanism should be such that we should be able to recover the original data sequence in correct order along with all time stamps. This presents additional complications while encoding sequential data using frequent patterns. (See [6] for more discussion on this). There are many MDL-motivated algorithms proposed for characterizing sequence data through a subset of frequent patterns [3, 5, 6, 7]. Different algorithms use different strategies for coding data using frequent patterns. While the methods are motivated by MDL principle, the coding strategies and hence the computation of compression achieved by a given subset of frequent patterns are essentially arbitrary.

In this paper we consider a recently proposed algorithm called CSC-2 [6] which is an efficient method to discover a subset of serial episodes that best characterizes data of event sequences. It uses a novel pattern class consisting of injective serial episodes with fixed inter-event times. A similar pattern class was also used recently for learning association rules from temporal data [17]. The CSC-2 algorithm uses the number of distinct occurrences of an episode as its frequency. Here, we extend it to the case of non-overlapped occurrences as episode frequency and then provide some statistical justification for the algorithm based on a generative model.

The main contribution of this paper is a HMM-based generative model which provides some statistical justification for the CSC-2 algorithm. In all MDL-based approaches, a subset of patterns is selected based on the data compression it can achieve. This depends on the (arbitrary) coding scheme used by the algorithm which is selected heuristically. In this paper we provide a justification for the coding scheme and the algorithm used in CSC-2 based on our proposed statistical generative model. This is the first time, to our knowledge, that such a formal connection is established between mining of episodes using the MDL principle and a generative model for data source. Since this generative model is Markovian and hence can handle only non-overlapped occurrences based episode frequency, we extended CSC-2 to use non-overlapped occurrences as episode frequency.

Another major contribution of this paper is a novel application of this method of discovering a set of characteristic episodes from sequential data. The application is in text classification. Most text classification methods represent each document as a vector over a dictionary of words which is often called the bag-of-words representation. The dictionary for this is taken to be all the words in the corpus (after appropriate stemming and dropping of stop words). Often, the dictionary sizes are large resulting in high dimensionality of the feature vectors representing individual documents. A text document can be viewed as a sequence of events with event types being words. Hence, using our method, we can discover a subset of characteristic episodes that best represent the full corpus of document data. We can then use the words (event-types) in the subset of discovered episodes to form our dictionary. Since our method does not even need a frequency threshold, this constitutes a parameter-less unsupervised method of feature selection for this problem. We show, through empirical experiments, that this method results in a very significant reduction of dictionary size without any loss of classification accuracy.

The rest of the paper is organized as follows. Section II describes the episode mining algorithm. Section III presents our proposed generative model. Section IV explains our method of finding a smaller sized dictionary in text classification problems and reports results obtained on different text datasets. Conclusions are presented in Section-V.

2 Discovering Best Subset of Serial Episodes

2.1 Episodes in Event Sequences

We begin with a brief informal description of the episodes framework. (See [9, 18] for more details). Here the data is (abstractly) viewed as an event sequence denoted as $\mathcal{D}=<(E_{1},t_{1}),(E_{2},t_{2}),...,(E_{n},t_{n})>$ where, in each tuple or event, $(E_{i},t_{i})$ , $E_{i}$ is the event-type and $t_{i}$ is the time of occurrence of that event. We have $E_{i}\in\mathcal{E}$ , a finite alphabet set and $t_{i}\leq t_{i+1}$ , $\forall i$ . An example event sequence is

[TABLE]

The patterns of interest here are called episodes. In this paper we are concerned with only serial episodes. We represent an $N$ -node serial episode, $\alpha$ , as $\alpha[1]\rightarrow\cdots\rightarrow\alpha[N]$ where $\alpha[i]$ is the event-type of the $i^{th}$ event of the episode. An episode is said to be injective if all event types in the episode are distinct. For example, $A\rightarrow B\rightarrow C$ is a three node injective serial episode. An occurrence of the serial episode is constituted by events in the data sequence that have appropriate event types and their times of occurrence are in the correct order. In (1), $((A,2),(B,4),(C,5))$ constitutes an occurrence of $A\rightarrow B\rightarrow C$ while $((A,4),(B,4),(C,5))$ does not because $B$ does not occur after $A$ . (Note that the events constituting an occurrence need not be contiguous in the data).

The data mining problem is to discover all frequently occurring episodes. In the frequent episodes framework, many different frequency measures are defined based on counting different subsets of occurrences. We mention two such frequencies below which are relevant for this paper. There are efficient algorithms for discovering serial episodes under many frequency measures. (See [18] for more details on different frequencies and algorithms for discovering serial episodes).

Two occurrences of a serial episode are said to be non-overlapped if no event of one occurrence is in between events of the other. In (1), $((A,2),(B,4),(C,5))$ and $((A,7),(B,8),(C,9))$ are non-overlapped occurrences of $A\rightarrow B\rightarrow C$ while $((A,4),(B,6),(C,8))$ is another occurrence of this episode which overlaps with both the earlier ones. The non-overlapped frequency of an episode is defined as the maximum number of non-overlapped occurrences of the episode in the event sequence [14]. Two occurrences are said to be distinct if they do not share any event. All three occurrences above are distinct. The maximum number of distinct occurrences is another frequency of interest.

In our method here we use a special class of serial episodes called fixed-interval serial episodes [6]. A fixed interval serial episode can be denoted as $\alpha=(\alpha[1]\xrightarrow{\Delta_{1}}\alpha[2]\xrightarrow{\Delta_{2}}\cdot\cdot\cdot\xrightarrow{\Delta_{N-1}}\alpha[N])$ where $\Delta_{i}$ is the prescribed gap between the times of $i^{th}$ and $(i+1)^{st}$ events of any occurrence of $\alpha$ . For example, $A\xrightarrow{2}B\xrightarrow{1}C$ is a fixed interval injective serial episode. In (1), $((A,2),(B,4),(C,5))$ is an occurrence of this episode while $((A,7),(B,8),(C,9))$ is not.

As is easy to see, all events constituting an occurrence of a fixed interval serial episode are completely specified by giving only the time of occurrence of the first event. Also, two occurrences starting at different times would be distinct if the episode is injective (that is, all event types in the episode are different).

2.2 Mining Algorithm for the Best Subset

Here our interest is in discovering a small set of fixed-interval serial episodes that best explains the data. We use the Minimum Description Length (MDL) principle for this. Hence we rank different subsets of episodes by the total encoding length that results when we use them as models to encode data. Under MDL, the encoding should be such that we should be able to recover the original data completely. Since we are considering sequential data, this means we should be able to recover the data in the original sequence with all time stamps.

We first explain the strategy of coding the data sequence using our episodes. The basic idea is that we can encode all events constituting the occurrence of a fixed interval serial episode by just giving the start times of the occurrence. The encoding strategy is same as that used in [6]. For obtaining the best subset of episodes we essentially use the CSC-2 algorithm from [6] with the main difference being we use the non-overlapped frequency while that algorithm uses distinct occurrences as frequency. Below we first explain the encoding scheme through an example and then briefly explain the CSC-2 algorithm. (For more details on the encoding scheme and the CSC-2 algorithm, please see [6]).

Table I illustrates the coding scheme by encoding the event sequence in (1) using essentially three arbitrarily selected episodes. Each row specifies the size and description of an episode, the number of occurrences of the episode and the start times of these occurrences. Thus, the first row of Table I specifies a three-node episode, namely, $(A\xrightarrow{2}B\xrightarrow{1}C)$ , which has two occurrences starting at time instants 2 and 4. Thus, this row codes for six events in the data constituting the two occurrences of this 3-node episode. Similarly the second row codes for six events by specifying two occurrences of a 3-node episode and the third row codes for two events by specifying one occurrence of a 2-node episode. Suppose we are interested in asking how good is this subset of three episodes. These three episodes together, as specified through Table I, account for all but two events in the data. But coding under MDL should be lossless. Hence, in the last row of Table I we have used two occurrences of a 1-node episode to make sure that all events in the data sequence are covered. It is easy to see that given this table, we can recreate the entire data sequence exactly. In this table we can think of the first two columns as coding the model, that is the subset of episodes, and the last two columns as coding the data using this model. Thus the length or size of this table can be the total encoded length for the subset of episodes. Given any subset of episodes (such as the three episodes in the first three rows of the table) we can find an encoding like this for the whole data by adding occurrences of a few 1-node episodes as needed (which is what is done in the fourth row of the table).

In this table, one can see that the event $(C,5)$ is coded for, by both the first and the second episode in the table. Intuitively, we get better data compression if such overlaps among the parts of data encoded by different episodes in the selected set, are minimized. Thus, we should get better compression of data if we can choose episodes with high frequency (so that they can cover for large number of events) which are non-redundant (so that the overlaps as mentioned above are reduced). This is the intuitive reason for using this coding scheme and looking for a subset of episodes that achieves best compression of data.

Our objective is to find a subset of episodes to encode data like this so as to get best data compression. For purposes of counting length/memory we assume that event types as well as times of occurrence are integers and that each integer takes one unit of memory. Let $\alpha$ be an N-node episode of frequency $f_{\alpha}$ used for encoding. Its row in the table would need $2N+1+f_{\alpha}$ units ( $1$ unit to represent the size of the episode, $N$ units to represent the event-types of the episode, $N-1$ units for representing the inter-event gaps, $1$ unit for frequency and $f_{\alpha}$ units to represent the start times of the occurrences). Since non-overlapped or distinct occurrences do not share events, this episode encodes for $f_{\alpha}N$ events in the data and hence we need at least $f_{\alpha}N$ units of memory if we want to encode these events in the data using 1-node episodes. Define

[TABLE]

If $score(\alpha,\mathcal{D})>0$ , then we can conclude that $\alpha$ is a useful candidate, since, selecting it can improve encoding length (in comparison to the trivial encoding using only 1-node episodes). However, the true utility of $\alpha$ is to be assessed with respect to what it would add to compression given the other selected episodes.

Let $\mathcal{F}_{s}$ be a set of episodes of size greater than one. Given any such $\mathcal{F}_{s}$ , let $L(\mathcal{F}_{s},\mathcal{D})$ denote the total encoded length of $\mathcal{D}$ , when we encode all the events which are part of the occurrences of episodes in $\mathcal{F}_{s}$ , by using episodes in $\mathcal{F}_{s}$ and encode the remaining events in $\mathcal{D}$ , if any, by episodes of size one. Given any two episodes $\alpha\text{, }\beta$ , let $OM(\alpha,\beta)$ denote the number of events in the data that are covered by occurrences of both $\alpha$ and $\beta$ in the data sequence $\mathcal{D}$ . Define

[TABLE]

$Overlap\text{-}score$ gives an estimate of how much extra encoding efficiency can be achieved by selecting $\alpha$ given the set $\mathcal{F}_{s}$ . It can be proved [6, Prop. 1] that

[TABLE]

This means that, given a current set of episodes $\mathcal{F}_{s}$ , adding to $\mathcal{F}_{s}$ an episode $\alpha$ with positive $overlap\text{-}score$ , would only reduce the total encoded length. The CSC-2 algorithm in [6] is essentially a greedy algorithm that keeps adding episodes with highest $overlap\text{-}score$ . This greedy selection of best episode (based on $overlap\text{-}score$ ) is done from a set of candidate episodes, generated through a depth-first search of the lattice of all serial episodes. Each candidate episode is the ‘best’ episode in one of the paths of the depth-first search tree. For the sake of completeness we give the pseudocode of this algorithm as Algorithm 1 (For more details see [6]).

We can run this algorithm to find ‘top- $K$ ’ best episodes. If we give a very large value of $K$ , the algorithm exits when it cannot find any more episodes (of size greater than 1) which improves coding efficiency. The algorithm needs no frequency threshold given by users. Our $overlap$ - $score$ naturally prefer episodes with higher frequency and we need no threshold because we pick episodes based on what they add to coding efficiency. Thus, the algorithm does not really have any hyperparameters (except for $T_{g}$ , the maximum allowable inter-event gap which is not a critical one).

While calculating $overlap$ - $score$ , we need to decide what type of occurrences we would count toward frequency. As mentioned earlier, CSC-2 uses distinct occurrences. In this paper we use non-overlapped occurrences for frequency of episodes. The reason for this is that the generative model we present in the next section is for non-overlapped occurrences. Also, in our application to text classification, non-overlapped occurrences is a more natural choice for frequency.

We obtain the sequence of non-overlapped occurrences from the distinct occurrences returned by CSC-2 using a simple algorithm. We take the first occurrence from the sequence of distinct occurrence as the first one in the sequence of non-overlapped occurrences. Then onwards we take the first distinct occurrence starting after the last non-overlapped occurrence we have as the next one in our sequence of non-overlapped occurrences. The pseudocode for this algorithm is listed as Algorithm 2. Below, we prove the correctness of this algorithm. That is, we show that the sequence of non-overlapped occurrences we get is a maximal one and hence we get the correct non-overlapped frequency.

Let $\mathcal{H}=\{h_{1},h_{2},\dots,h_{l}\}$ denote the set of non-overlapped occurrences returned by Algorithm 2. Each occurrence, $h_{i}$ can be thought of as a tuple of indices in the data sequence which give the position of events in data that constitute this occurrence. For example, in data sequence (1), the occurrence of the episode $A\rightarrow B\rightarrow C$ constituted by the events $<(A,2),(B,4),(C,5)>$ would be represented by the tuple $(2\;6\;7)$ . Hence, as a notation, we use $t_{h_{i}(k)}$ to denote the time of the $k^{th}$ event of the episode in the occurrence $h_{i}$ . On the set of occurrences, $\mathcal{H}$ , there is a natural order: occurrence $h_{i}$ is earlier than $h_{j}$ if the $t_{h_{i}(1)}<t_{h_{j}(1)}$ . Because of the way the occurrences in $\mathcal{H}$ are selected by our algorithm, the following property is easily seen to hold:

Property 1: $h_{1}$ is the earliest distinct occurrence of the episode $\alpha$ . For any $i$ , $h_{i}$ is the first distinct occurrence starting after $t_{h_{i-1}(N)}$ and there is no distinct occurrence which starts after $t_{h_{l}(N)}$ .

**Proposition 1: ** $\mathcal{H}$ is a maximal set of non-overlapped occurrences of $\alpha$

*Proof: *Note that for fixed interval injective serial episodes, occurrences having different start times are distinct. Consider any other set of non-overlapped occurrences of the episode, $\mathcal{H}^{{}^{\prime}}=\{h^{\prime}_{1},h^{{}^{\prime}}_{2},\dots,h^{{}^{\prime}}_{m}\}$ . Let $p=min\{m,l\}$ . We first show that

[TABLE]

We use induction on $i$ to prove this. Let us show this first for $i=1$ . Suppose, $t_{h_{1}(N)}>t_{h^{{}^{\prime}}_{1}(N)}$ . Since, the inter-event gaps are fixed, we have $t_{h_{1}(1)}>t_{h^{{}^{\prime}}_{1}(1)}$ . This means we have found a distinct occurrence of the episode which starts before $h_{1}$ . This contradicts the first statement of Property 1 that $h_{1}$ is the earliest distinct occurrence. Hence, $t_{h_{1}(N)}\leq t_{h^{{}^{\prime}}_{1}(N)}$ .

Suppose, $t_{h_{i}(N)}\leq t_{h^{{}^{\prime}}_{i}(N)}$ is true for some $i<p$ . We show that $t_{h_{i+1}(N)}\leq t_{h^{{}^{\prime}}_{i+1}(N)}$ . Suppose, $t_{h_{i+1}(N)}>t_{h^{{}^{\prime}}_{i+1}(N)}$ . This implies $t_{h_{i+1}(1)}>t_{h^{{}^{\prime}}_{i+1}(1)}$ . Again, since, $\mathcal{H}^{{}^{\prime}}$ is a set of non-overlapped occurrences, we have $t_{h^{{}^{\prime}}_{i}(N)}<t_{h^{{}^{\prime}}_{i+1}(1)}$ . Hence, we have

[TABLE]

But this contradicts the fact of Property 1, that $h_{i+1}$ is the earliest distinct occurrence after $t_{h_{i}(N)}$ . Hence, $t_{h_{i+1}(N)}\leq t_{h^{{}^{\prime}}_{i+1}(N)}$ .

Now we prove the maximality of the set $\mathcal{H}$ . Suppose, we assume that $|\mathcal{H}^{{}^{\prime}}|>|\mathcal{H}|,\text{ i.e },\;m>l$ . From inequality (4), $h^{{}^{\prime}}_{l+1}$ is an occurrence beyond $t_{h_{l}(N)}$ . But this contradicts the last statement of Property 1 that there is no distinct occurrence beyond $t_{h_{l}(N)}$ . Hence, $|\mathcal{H}|\geq|\mathcal{H}^{{}^{\prime}}|$ for every set of non-overlapped occurrences $\mathcal{H}^{{}^{\prime}}$ . This proves the maximality of the set $\mathcal{H}$ .

We can now sum up our method of finding a subset of serial episodes that best characterizes the data sequence. We use the coding scheme as described here and use a greedy heuristic to find the subset that achieves the best compression. This is essentially the same as the CSC-2 algorithm of [6]. However, we use Algorithm 2 to get non-overlapped occurrences of episodes from distinct occurrences and then use that frequency in selecting episodes with best overlap-score. In the next section we present an interesting generative model that provides some statistical justification for our algorithm based on selecting an episode with best overlap-score.

3 A Generative model for Pairs of Episodes

In this section we present a class of generative models which is a specialized class of HMMs. (This model is motivated by a HMM-based model for single episodes proposed in [14]). An HMM contains a Markov chain over some state space. But the states are unobservable. In each state, a symbol is emitted from a finite symbol set according to a symbol probability distribution associated with that state. The stream of symbols is the observable output sequence of the model.

In our case, the symbol set would the set of event-types and thus the observed output sequence would be a sequence of event-types. We think of this as an event sequence where the event-times are not explicitly specified. For occurrences and hence for frequencies of general serial episodes (without any inter-event times specified) only the time-ordering of the event-types in the data sequence is important; actual event times play no role. Hence in this section we consider serial episodes without any fixed inter-event times.

In our generative model, the state transition probability matrix of the Markov chain is parameterized by a single parameter, which is called the noise parameter. For every pair of serial episodes, we have one such generative model. For small enough value of the noise parameter, the model is such that the output from the model would be an event sequence containing many non-overlapped occurrences of the two corresponding episodes. While occurrences of any one episode would be non-overlapped in the output event sequence, an occurrence of one episode may be arbitrarily interleaved with occurrences of the other episode. Thus this is a good class of generative model for a data source where a pair of episodes form the most prominent frequent patterns (under the frequency based on non-overlapped occurrences). This is the first instance of such a statistical generative model for multiple episodes.

Consider the family of such models containing a model for every possible pair of episodes. Let $\Lambda_{\alpha\beta}$ denote the model for the pair of episodes $\alpha$ and $\beta$ . Given an event sequence we can now ask which is the maximum likelihood estimate of a model from this class of models. This would essentially tell us which pair of episodes best ‘explains’ the data sequence in the sense of maximizing the likelihood. We show that such a pair of episodes are not necessarily the two most frequent episodes. The data likelihood depends both on the frequencies of the episodes as well as on the number of events in the data that the occurrences of the two episodes share. Thus, we show, for example, that $\Lambda_{\alpha\beta}$ may have better likelihood than $\Lambda_{\alpha\gamma}$ even when $\beta$ has lower frequency than $\gamma$ , if overlap between $\alpha$ and $\beta$ is much less than that between $\alpha$ and $\gamma$ . The results we present here provide some statistical justification for the coding scheme and the algorithm that we presented in the previous section.

3.1 The HMM model

A HMM is specified as $\Lambda=(\mathcal{P},\pi,b)$ where $\mathcal{P}=[p_{ij}]$ is the state transition probability matrix of the Markov chain with state space, say, $S$ , $\pi$ is the initial state probabilities and $b=(b_{q},\;q\in S)$ where $b_{q}$ denotes the symbol probability distribution in state $q$ . Let $\textbf{o}=(o_{1},o_{2},\cdots,o_{T})$ be an observed symbol (or output) sequence. The joint probability of the output sequence o and a state sequence $\textbf{q}=(q_{1},q_{2},\cdots,q_{T})$ given an HMM $\Lambda$ is

[TABLE]

To determine the model with maximum likelihood, we need to find $P(\textbf{o}|\Lambda)$ . This data likelihood is often assessed by evaluating the above joint probability of (5) along a most likely state sequence, $\textbf{q}^{*}$ , where

[TABLE]

We also follow this simplification often employed by methods using HMM models. Thus we assume $P(\textbf{o}|\Lambda_{1})>P(\textbf{o}|\Lambda_{2})$ if $P(\textbf{o},\textbf{q}^{*}|\Lambda_{1})>P(\textbf{o},\textbf{q}^{*}|\Lambda_{2})$ . (This would be referred to as assumption A1).

Let $\Lambda_{\alpha\beta}$ denote the model corresponding to the pair of episodes, $\alpha$ and $\beta$ . We give full description of this model below. For the sake of simplicity, we consider that both are $N$ -node episodes. The model depends on whether or not the two episodes share any event types and hence we consider two separate cases (wherever necessary):

•

Case-I: $\alpha$ and $\beta$ have no common event-types, i.e $\alpha[i]\neq\beta[j]$ $\forall i,j\in\{1,2,...,N\}$ .

•

Case-II: $\alpha$ and $\beta$ have some common event-types.

The State Space

The number of states in the HMM is $4N^{2}+1$ . The state space can be partitioned into two parts: episode states, $\mathcal{S}_{e}$ , comprising of $2N^{2}$ states and noise states, $\mathcal{S}_{n}$ , comprising of $2N^{2}+1$ states. Episode states are denoted by $S^{k}_{i,j}$ , $k\in\{1,2\},\;i,j\in\{1,2,..,N\}$ . The noise states are given by $N^{k}_{i,j}$ , $k\in\{1,2\},\;i,j\in\{1,2,..,N\}$ , and the state $N^{0}_{1,1}$ .

Emission structure

The symbol probability distribution for the episode states is a delta function. The episode state $S^{1}_{i,j}$ emits the symbol $\alpha[i]$ with probability 1, whereas $S^{2}_{i,j}$ emits the symbol $\beta[j]$ with probability 1. For each noise state, the symbol probability distribution is uniform over the alphabet set $\mathcal{E}$ . (We denote $|\mathcal{E}|=M$ ).

Transition structure

Under Case-I (where $\alpha$ and $\beta$ do not share any event types), the state transition probabilities out of episode states are given by Fig. 1. Under Case-II also (where $\alpha$ and $\beta$ may share some event types), the transition probabilities out of episode states are as given by the state transition structure of Fig 1 except for the states $S^{1}_{i,(j\text{ mod }N)+1}$ and $S^{2}_{(i\text{ mod }N)+1,j}$ where $i,j$ are such that $\alpha[(i\text{ mod }N)+1]=\beta[(j\text{ mod }N)+1]$ . For such $i,j$ the transition probabilities are as given in Fig. 2.

For all the noise states, $N^{k}_{i,j}$ , $k\in\{1,2\}$ the transition structure is as shown in Fig. 3. The noise state $N^{0}_{1,1}$ , can transit with $\frac{1-\eta}{2}$ probability to each of the episode states $S^{2}_{1,1}$ and $S^{1}_{1,1}$ or remain in $N^{0}_{1,1}$ with $\eta$ probability.

It may be noted that all transition probabilities are determined by a single parameter, $\eta$ , which is called the noise parameter. The values of individual transition probabilities are fixed in an intuitively simple manner. From any state, transitions into a noise state has probability $\eta$ . The remaining probability is equally divided between all reachable episode states.

One can intuitively see the logic of the state transition structure also. Recall that in state $S^{1}_{i,j}$ we emit symbol $\alpha[i]$ . So, after this we can either go to $S^{1}_{(i+1),j}$ to emit the next event type of $\alpha$ or go to $S^{2}_{(i+1),j}$ to now emit an event type from $\beta$ . This allows for arbitrary overlap of occurrences of $\alpha$ and $\beta$ . Similarly from $S^{2}_{i,j}$ (after emitting $\beta[j]$ ) we can either go to $S^{2}_{i,(j+1)}$ or $S^{1}_{(i,(j+1)}$ . Since event types constituting occurrence of an episode need not be contiguous, from the episode states we can go to the noise states and cycle there zero or more times before coming back to episode states. After emitting the last event types of, say, $\alpha$ , the next event type of $\alpha$ that can be emitted is $\alpha[1]$ . Hence, from $S^{1}_{N,j}$ we should go to either $S^{1}_{1,j}$ or $S^{2}_{1,j}$ (or a noise state). That is why in the transition structure as given, whenever an index is incremented it is always with respect to modulo $N$ .

All the above is fine when $\alpha$ and $\beta$ do not share event types. Suppose they share an event type. When that event type appears in the data it could be part of an occurrence of only $\alpha$ or that of only $\beta$ or neither. These possibilities are all accounted for by the above transition structure. However, there is one more possibility, namely, it is part of an occurrence of both $\alpha$ as well as $\beta$ ; that is, the two occurrences share an event. The transition structure given in Fig. 2 ensures that our generative model includes this possibility too.

Initial states

If $\alpha[1]\neq\beta[1]$ , the initial state is $N^{0}_{1,1}$ with probability $\eta$ , $S^{1}_{1,1}$ with probability $\frac{1-\eta}{2}$ and $S^{2}_{1,1}$ with probability $\frac{1-\eta}{2}$ . If $\alpha[1]=\beta[1]$ , the initial state is $N^{0}_{1,1}$ with probability $\eta$ , $S^{1}_{1,2}$ with probability $1-\eta$ .

An Example

Consider a model $\Lambda_{\alpha\beta}$ , where $\alpha=A\to B\to C$ and $\beta=D\to B\to E$ . Let the alphabet set $\mathcal{E}=\{A,B,C,D,E,F,G\}$ . We show a few example state sequences and output sequences of length 10 that can be emitted by $\Lambda_{\alpha\beta}$ in Fig. 4. As can be seen from the figure, the output sequence contains occurrences of $\alpha$ and $\beta$ that may be arbitrarily interleaved. Here we have $\alpha[2]=\beta[2]$ . Hence transitions out of episode states $S^{1}_{12},S^{2}_{21}$ are as given in Fig. 2 and for all other episode states they are as given in Fig. 1. The special transition structure for $S^{1}_{12},S^{2}_{21}$ allows some occurrences of $\alpha$ and $\beta$ to share an event (of event type $B$ ) as can be seen in row-3 (the transition from $S^{2}_{21}$ to $S^{2}_{32}$ ) of Fig. 4.

3.2 Analysis

In this section we derive expressions for the likelihood (of a joint state and output sequence) of our HMM model and use this to compare likelihoods of models corresponding to different pairs of episodes. The expressions depend on whether or not the pairs of episodes share event types and hence the two cases are dealt with separately.

In all our analysis we assume $\eta<\frac{M}{M+8}$ where $M=|\mathcal{E}|$ .

3.2.1 Case-I

Here, $\alpha[i]\neq\beta[j]\quad\forall i,j\in\{1,2,\dots,N\}$ . Hence, all episode states have only $\frac{1-\eta}{2}$ transition into them. Decomposing any state sequence into two sub-sequences $\textbf{q}_{e}$ and $\textbf{q}_{n}$ , corresponding to the episode and noise states, we have the following observation: in equation (5), whenever the transition probability $p_{q_{t-1}q_{t}}$ is $(1-\eta)/2$ , the state $q_{t}$ has to be an episode state, and hence the $b_{q_{t}}(o_{t})$ is either $1$ or [math]. Similarly, whenever $p_{q_{t-1}q_{t}}$ is $\eta$ , the corresponding $b_{q_{t}}(o_{t})$ is $\frac{1}{M}$ . Thus for any state sequence with non-zero probability, we can write the joint probability as

[TABLE]

Here, $|\textbf{q}_{n}|$ and $|\textbf{q}_{e}|$ denote lengths of the respective sub-sequences. Since $|\textbf{q}_{e}|+|\textbf{q}_{n}|=|\textbf{q}|=T$ (the length of output or event sequence), (7) can be written as

[TABLE]

Under our assumption we have $\eta<\frac{M}{M+8}$ , and hence $\frac{(1-\eta)M}{2\eta}>1$ . Then, $P(\textbf{o,q}|\Lambda)$ is monotonically increasing with $|\textbf{q}_{e}|$ . Thus, the most likely state sequence is the one that spends the longest time in episode states. Due to constraints imposed on the state transition structure, in any state sequence of $\Lambda_{\alpha\beta}$ having non-zero probability, the episode states corresponding to a particular episode have to occur in sequence. Moreover, when a particular episode state $S^{1}_{i,j}$ or $S^{2}_{i,j}$ is revisited, it implies one cycle of all the event-types corresponding to that episode have been emitted. Suppose $f^{*}_{\alpha}$ and $f^{*}_{\beta}$ are the maximum possible number of non-overlapping occurrences of $\alpha$ and $\beta$ respectively in o. Since, $\alpha$ and $\beta$ do not share any event type and at each episode state we emit one symbol, the most likely state sequence has at least $N(f^{*}_{\alpha}+f^{*}_{\beta})$ number of episode states in it, i.e $|\textbf{q}^{*}_{e}|\geq Nf^{*}_{\alpha}+Nf^{*}_{\beta}$ .

For the sake of simplicity, we make the assumption (referred to as A2) that there is no state sequence with non-zero probability that includes any incomplete occurrence of either of the episodes.111 We can ensure that the assumption A2 always holds by modifying our model by adding an extra symbol (“end-of-sequence” marker) at the end of the output sequence o and modifying the symbol probability distribution of the noise states $N^{1}_{N,1}$ and $N^{2}_{1,N}$ by following the trick used with HMMs for single episodes as in [19].

Under the assumption A2, we have $|\textbf{q}^{*}_{e}|=Nf^{*}_{\alpha}+Nf^{*}_{\beta}$ . Consider two models $\Lambda_{\alpha\beta}$ and $\Lambda_{\alpha\gamma}$ .

[TABLE]

Hence, under assumption A1, if $f^{*}_{\beta}>f^{*}_{\gamma}$ , we have $P(\textbf{o}|\Lambda_{\alpha\beta})>P(\textbf{o}|\Lambda_{\alpha\gamma})$ . This essentially implies that given an already selected episode (the episode $\alpha$ here), if we want to select the next one from the set of episodes that do not share any event type with the already selected one, we should choose the most frequent one from that set.

3.2.2 Case-II

In this case, we have some episode states with a $(1-\eta)/2$ transition into them while some have $(1-\eta)$ transition into them. It should be noted that because of the transition structure, a symbol emitted from an episode state with $1-\eta$ transition into it is part of an occurrence of both the episodes. It means that the event is shared across occurrences of the two episodes. On the other hand, a symbol emitted from an episode state with $(1-\eta)/2$ transition into it is part of an occurrence of only one episode and hence is not shared. Now, we further decompose the episode states part of any state sequence, $\textbf{q}_{e}$ into two parts $\textbf{q}_{1},\textbf{q}_{2}$ . The episode states corresponding to event types that are not shared form $\textbf{q}_{1}$ while those corresponding to shared ones form $\textbf{q}_{2}$ . Since, every state emits one symbol, we have $|\textbf{q}_{n}|+|\textbf{q}_{1}|+|\textbf{q}_{2}|=T$ . Now, the joint probability of an output and state sequence is given by

[TABLE]

Let us consider a state sequence q (having non-zero probability) that contains $f_{\alpha}$ and $f_{\beta}$ number of occurrences of the episodes $\alpha$ and $\beta$ respectively. Let the number of events shared between these occurrences be $O_{\alpha\beta}$ . Then, the no of events covered by the occurrences of the episodes in the output sequence is $(Nf_{\alpha}+Nf_{\beta}-O_{\alpha\beta}),$ out of which $(Nf_{\alpha}+Nf_{\beta}-2O_{\alpha\beta})$ number of events are not shared and $O_{\alpha\beta}$ number of events are shared. Under assumption A2, we have $|\textbf{q}_{2}|=O_{\alpha\beta}$ and $|\textbf{q}_{1}|=Nf_{\alpha}+Nf_{\beta}-2O_{\alpha\beta}$ . So, for this state sequence,

[TABLE]

For $\eta<\frac{M}{M+8}$ , $\frac{(1-\eta)M}{4\eta}>1$ . So, we see that the joint probability is an increasing function of the no of occurrences of the episodes, and for fixed $f_{\alpha}$ and $f_{\beta}$ , a decreasing function of the number of events shared between the occurrences.

Let $f^{*}_{\alpha}$ and $f^{*}_{\beta}$ be the maximum possible number of non-overlapped occurrences of $\alpha$ and $\beta$ respectively in o. So, the most likely state sequence ( $\textbf{q}^{*}$ ) is the one which emits all the $f^{*}_{\alpha}+f^{*}_{\beta}$ number of occurrences from the episode states and among all such state sequences it is the one which shares minimum number of events between these occurrences. Let $O^{*}_{\alpha\beta}$ be the number of shared events corresponding to $\textbf{q}^{*}$ . Then, from (9),

[TABLE]

We will have a similar expression for the model $\Lambda_{\alpha\gamma}$ and hence

[TABLE]

Thus, under assumption A1, we see that if $f_{\beta}=f_{\gamma}$ , likelihood is higher for the pair of episodes that share lesser number of events. In general, the relative likelihood of $\Lambda_{\alpha\beta}$ and $\Lambda_{\alpha\gamma}$ depends both on the frequencies of $\beta$ and $\gamma$ as well as on the difference in their overlaps with $\alpha$ . To better understand this, let us define two metrics to rate any other episode with respect to episode $\alpha$ .

[TABLE]

We will show that given an episode $\alpha$ , if the values of both metric for an episode $\beta$ are higher than those for an episode $\gamma$ , then $\Lambda_{\alpha,\beta}$ has higher data likelihood compared to $\Lambda_{\alpha,\gamma}$ (under our assumption on $\eta$ and under A1 and A2).

Case-a: $f^{*}_{\beta}>f^{*}_{\gamma},O^{*}_{\alpha\beta}<O^{*}_{\alpha\gamma}$

Under assumption A1, from (11), it is easily seen that $P(\textbf{o}|\Lambda_{\alpha\beta})>P(\textbf{o}|\Lambda_{\alpha\gamma})$ . Also, it is easy to check that $f^{*}_{\beta}>f^{*}_{\gamma}$ and $-O^{*}_{\alpha\beta}>-O^{*}_{\alpha\gamma}$ imply $Overlap\text{-}score_{1}(\beta,\alpha)>Overlap\text{-}score_{1}(\gamma,\alpha)$ and $Overlap\text{-}score_{2}(\beta,\alpha)>Overlap\text{-}score_{2}(\gamma,\alpha)$

Case-b: $f^{*}_{\beta}>f^{*}_{\gamma},O^{*}_{\alpha\beta}>O^{*}_{\alpha\gamma}$ ,

In this scenario, depending on the values of overlaps, the two metrics for $\beta$ may be greater or smaller than those of $\gamma$ . Hence we consider these two sub-cases.

Case-b1: $Overlap\text{-}score_{1}(\beta,\alpha)>Overlap\text{-}score_{1}(\gamma,\alpha)$ and $Overlap\text{-}score_{2}(\beta,\alpha)>Overlap\text{-}score_{2}(\gamma,\alpha)$ .

Since, $Overlap\text{-}score_{1}(\beta,\alpha)>Overlap\text{-}score_{1}(\gamma,\alpha)\implies Nf^{*}_{\beta}-O^{*}_{\alpha\beta}>Nf^{*}_{\gamma}-O^{*}_{\alpha\gamma}\implies Nf^{*}_{\beta}-Nf^{*}_{\gamma}>O^{*}_{\alpha\beta}-O^{*}_{\alpha\gamma}$ , we have from (11),

[TABLE]

Case-b2: $Overlap\text{-}score_{1}(\gamma,\alpha)>Overlap\text{-}score_{1}(\beta,\alpha)$ and $Overlap\text{-}score_{2}(\gamma,\alpha)>Overlap\text{-}score_{2}(\beta,\alpha)$

Since, $Overlap\text{-}score_{2}(\gamma,\alpha)>Overlap\text{-}score_{2}(\beta,\alpha)\implies Nf^{*}_{\gamma}-\frac{1}{2}O^{*}_{\alpha\gamma}>Nf^{*}_{\beta}-\frac{1}{2}O^{*}_{\alpha\beta}\implies Nf^{*}_{\beta}-Nf^{*}_{\gamma}<\frac{1}{2}(O^{*}_{\alpha\beta}-O^{*}_{\alpha\gamma})$ . Let $Nf^{*}_{\beta}-Nf^{*}_{\gamma}$ be $x$ . Then we can write $(O^{*}_{\alpha\beta}-O^{*}_{\alpha\gamma})=2x+\xi$ , where $\xi>0$ . Since we assume $\eta<\frac{M}{M+8}$ , we have $\frac{(1-\eta)M}{8\eta}>1$ . Now, from (11),

[TABLE]

The results presented here provide statistical justification for our algorithm presented in the previous section where we select episodes based on their overlap score as given by (3). Suppose we have selected only $\alpha$ and want to choose either $\beta$ or $\gamma$ as our second episode. Based on (3), this choice depends on the sign of $(N-1)(f^{*}_{\beta}-f^{*}_{\gamma})-(O^{*}_{\alpha\beta}-O^{*}_{\alpha\gamma})$ , which is a figure of merit motivated by considerations of coding efficiency. This is essentially same as the difference of $Overlap\text{-}score_{1}$ between $\beta$ and $\gamma$ which is a figure of merit that determines which pair of episodes maximize the data likelihood.

4 Application to Text Classification

In this section we present a novel application of our method of finding a ‘good’ subset of frequent episodes to characterize data. The application is in the domain of text classification. Most text classification techniques use a bag-of-words approach where each document (or data sample) is represented as a collection of words that belong to a dictionary. The dictionary is usually considered as the set of all unique words present in the training corpus after preliminary preprocessing. This makes the size of the dictionary large leading to high dimensionality of the feature vector representation of each document. Other vector space representation of documents (e.g.,word-averaging in [20]) also, depends largely on the dictionary of words used.

One can think of a text document as a sequence of events with event types being the words. Then using all training data in an unsupervised fashion, we can use our method to find the ‘best’ subset of serial episodes that represent the data well. These episodes are likely to contain all specific words that are important for this document collection. Thus, a dictionary built using only the words (event-types) found in the subset of discovered episodes is likely to be useful. This is what we explore here.

Let the dictionary obtained by using all unique words (after usual preprocessing) from the training data collection be termed Dictionary-I. We run our algorithm for discovering the ‘best’ subset of serial episodes (that achieve best data compression) on the entire training corpus. We form a new dictionary as the set of all the unique words (event-types) that are present in the non-singleton episodes (i.e., episodes of size 2 or more) discovered by our algorithm. We call this smaller sized dictionary as Dictionary-II. In each case we would represent documents as vectors over one of these dictionaries and investigate standard classifiers such as Naive-Bayes and SVM. Using simulations on some standard benchmark datasets we show that we get large dimensionality reduction without any loss of accuracy by the classifier.

Typically, in training data for text classification, we have many documents but each document is short. Mining for episodes that can achieve significant compression individually for each document does not give any interesting episodes mainly because each sequence is short. We string together all training data (of all classes) to make one long document and we mine for a set of frequent serial episodes that achieve best compression (using the algorithm discussed in this paper). We employ special symbols to denote end of each training document and modify our mining algorithm so that occurrence of no episode would span two different documents.

4.1 Experimental Results

4.1.1 Datasets

We compared the classification accuracies on three standard benchmarks, 20-Newsgroup, Reuters-21578 and WebKB, downloaded from a publicly available repository of datasets for single-label text categorization.222http://ana.cachopo.org/datasets-for-single-label-text-categorization We used the preprocessed stemmed version of these datasets. For Reuters-21578, we use the 8 class stemmed version of the dataset; WebKB- is a 4-class dataset while 20-Newsgroup- is a 20-class dataset. For these, the Dictionary-I is the set of all unique words present in the stemmed training data. Apart from these, we also used the movie-review dataset prepared by Pang and Lee (2004). We used the polarity dataset v2.0.333http://www.cs.cornell.edu/people/pabo/movie-review-data/ This sentiment analysis dataset consist of 2000 movie reviews. As preprocessing steps, we converted all letters to lower case and removed all words less than 3 characters long. No stop words except ‘and’, ‘the’ were removed. Dictionary-I was created from this preprocessed training data.

4.1.2 Feature Vectors

We compared the text classification accuracies using two different models

•

Bag-of-words(BoW)- Each data sample is converted into a feature vector of the dimension of the size of the corresponding dictionary used. Each feature represents the frequency of that word in that data sample except for the Movie Review dataset, where, as in [21], we used binary features denoting presence or absence of the word in the corresponding document. Further, tf-idf along with cosine normalization were done on these feature vectors as explained in the next subsection.

•

Average Embedding (VecAvg) [20]- Word2vec is used to produce the word embeddings and each text is then represented as the average of all the embeddings of the words present in that text. In case of Dictionary-II, averaging of word embeddings were done only for words which were part of the dictionary and the rest were ignored. In case of Movie Reviews and 20 Newsgroup, the pretrained model of GoogleNews vectors 444 https://github.com/mmihaltz/word2vec-GoogleNews-vectors were used, whereas in case of the other two datasets (since these were stemmed), the model was trained with gensim library with parameters vector size=200 and window=5.

4.1.3 Tf-Idf

Term frequency-Inverse document frequency (tf-idf) is a numerical statistic which is good at quantifying the importance of a word to a document in a collection. Let $wf(w,d)$ denote the frequency (that is the number of occurrences) of a word $w$ in a document $d$ . Instead of using this raw frequency as the feature value, we use a modified word frequency defined by

[TABLE]

where the inverse-document frequency, ( $idf(w)$ ), is given as

[TABLE]

Here, $n_{d}$ is the total number of documents and $df(w)$ is the number of documents that contain the word $w$ . We use this modified frequency of each word ( $Modified\text{-}wf(w,d)$ ) as the feature value. The feature vectors were further cosine normalized.

4.1.4 Results

We compare the classification accuracies obtained using our proposed Dictionary-II with those obtained with Dictionary-I. For BoW and VecAvg representation, we present results using Linear SVM. For BoW, Naive Bayes(NB) results are also presented for comparison with accuracies reported in literature. For the Movie Review dataset, we present the mean value corresponding to the ten fold cross validation on the original folds introduced in [21].

Table II shows sizes of the two dictionaries for different datasets. The number of episodes reported in Table II is the number of non-singleton episodes. As can be seen from the table, the size of Dictionary-II is almost a fourth of that of Dictionary-I in case of WebKB; for the other datasets it is about one eighth to one tenth. Thus, this method results in a very significant reduction in dictionary size (and hence in feature vector dimension).

The classification accuracies obtained with different dictionaries are shown in Tables III–IV. Table III shows accuracies and F-measure with linear SVM classifier under Vec Avg representation while Table IV shows these for Naive Bayes and linear SVM classifiers under BoW representation. We did not try any nonlinear SVM because all other studies on these benchmark data sets reported only accuracies with linear SVM. As is easy to see, the accuracies and F-measure scores (under both BoW as well as VecAvg representation) achieved by either classifier with different dictionaries are mostly very close. Thus we can conclude that our frequent episodes based method allows us to get a very large reduction in dictionary size without any significant change in the classification accuracy. (We also note that the accuracy of our Dictionary-I in Table IV is consistent with the bag-of-words accuracy reported in [22] and [23]).

The above are with the train-test split as given in the original datasets. For BoW representation, we also generated 3 random splits for the datasets Reuters-21578, WebKB, 20-Newsgroup having the same train-test distribution of each class as in the original split. The results (showing averages and standard deviations) are presented in Tables V–VI. Once again, the results clearly show that there are no significant differences between accuracies achieved with the two dictionaries.

For the BoW representation for this document classification application, our method of learning a dictionary results in a significant decrease in feature vector dimension. But this method is quite different from generic dimensionality reduction techniques such as PCA. With PCA we may get dimensionality reduction by choosing certain linear combinations of earlier features. With the original feature vector dimension being in tens of thousands, the new features obtained as such linear combinations would not be semantically interpretable. However, our data mining method essentially decides on which words of Dictionary-I to be retained (and which are to be rejected). Thus this method is essentially a feature selection method rather than a dimensionality reduction method. Hence, the dimensionality reduction achieved here is semantically interpretable.

To get such a feel for what the data mining does, we present in Table-VII, some sample of words that are retained and rejected by our method in case of Movie Review and WebKB datasets. The words shown are hand-picked but only from a set of 1000 randomly selected words. It is easy to see that this makes good semantic sense. For example, in Movie Review we reject many movie related words like ‘stunts’, ‘theater’, ‘performances’ etc. which, while they may appear in the reviews, may not carry any information regarding sentiment of the review. On the other hand, we retain words like ‘hilarious’, ‘boring’, ‘surprised’ etc. that can carry sentiment information. Similar comments apply to WebKb dataset (e.g., selected words like ‘prerequisite’, ‘introductory’, ‘project’ can be commonly found on a project or course web-page and hence they may carry discriminative information). Thus, the data mining method (based on finding episodes for compressing data) seems to be effective in picking a dictionary that is relevant to the text corpus.

5 Conclusions

In this paper we considered the problem of discovering a small set of serial episodes to characterize sequential data. We extended the existing CSC-2 algorithm of [6] to work with non-overlapped frequency.

Our main contribution is a novel HMM-based generative model for pairs of episodes. The model generates very general output sequences where the two episodes are the most prominent frequent episodes (under non-overlapped frequency). The model is very intuitive. The symbols emitted from episode states constitute the ‘model-based’ occurrences of episodes. The noise states can emit any symbol and hence symbols emitted from the noise states can be thought of as the distracting signal that may mask real episodes and contribute spurious frequent episodes. The transition structure is also intuitively motivated. From any state, transitions into a noise state has probability $\eta$ . The remaining probability is equally divided between all reachable episode states. For this model class we showed that the episode-pair model that has best likelihood for the data sequence, is determined both by the frequencies of the episodes as well as overlaps between their occurrences. The analytical expressions we derived for the data likelihoods provide statistical justification for our algorithm of selecting a subset of episodes.

The CSC-2 algorithm is motivated based on the MDL principle. Using an intuitively appealing coding scheme to encode data using episodes, the algorithm finds a subset of episodes to maximize data compression achieved. It is essentially incrementally picking episodes based on the so called overlap score which depends both on the frequency of the episode as well as on the extent of overlap in its occurrences with those of already selected episodes. Our HMM-based model provides some statistical justification for this strategy used by the algorithm.

A generative model for sequential data to capture interactions of two episodes as well as using it to justify an MDL based algorithm for frequent episodes are both novel contributions of this paper. As mentioned in Section 1, there have been many algorithms, motivated by the MDL philosophy, for succinctly characterizing data using a small set of frequent patterns. However, all such algorithms for sequential data are heuristic in nature. We believe that the HMM model we presented here is a good first step in developing a statistical theory for MDL-based algorithms that find a good subset of frequent episodes.

Another important contribution of this paper is a novel application of frequent episodes mining to text classification. We view the text document as a sequence of events with event types being the words. Then we find the subset of episodes that best characterizes the entire text corpus in terms of data compression. The words appearing in this subset of frequent episodes is likely to gives us the most informative words for the corpus and hence we use only these words to form the dictionary using which the documents are represented as vectors. Thus the method amounts to learning a context-sensitive dictionary using the idea of frequent pattern mining. Also, since our data mining method does not need any user-specified hyperparameters, same is true for this method of dimensionality reduction. To the best of our knowledge this is a first instance of application of frequent pattern methods to dictionary learning. As we showed through extensive simulations, the method results in many-fold decrease in the size of dictionary without compromising the classification accuracy. Also, as can be seen from the examples of retained and rejected words, the method seems to be quite effective in learning a good subset of words.

The HMM model we presented is for pairs of episodes. While it is, in principle, extendable to any number of episodes, notationally it would be very complex. A good extension of the work presented here is in the direction of extending these analytical techniques to arbitrary number of episodes. Generative models can, in general, be used for assessing statistical significance of the frequency of an episode (e.g., [14]). Since the model introduced here also accounts for interactions among episodes, it should be usable for questions such as whether or not the observed frequencies of two episodes would make both of them significant given the extent of overlap between their occurrences. This is also a useful direction in which the work presented here can be extended.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. C. Aggarwal and J. Han, Frequent pattern mining . Springer, 2014.
2[2] J. Vreeken, M. Van Leeuwen, and A. Siebes, “Krimp: mining itemsets that compress,” Data Mining and Knowledge Discovery , vol. 23, no. 1, pp. 169–214, 2011.
3[3] N. Tatti and J. Vreeken, “The long and the short of it: summarising event sequences with serial episodes,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 2012, pp. 462–470.
4[4] M. Mampaey, N. Tatti, and J. Vreeken, “Tell me what i need to know: succinctly summarizing data with itemsets,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 2011, pp. 573–581.
5[5] H. T. Lam, F. Mörchen, D. Fradkin, and T. Calders, “Mining compressing sequential patterns,” Statistical Analysis and Data Mining , vol. 7, no. 1, pp. 34–52, 2014.
6[6] A. Ibrahim, S. Sastry, and P. S. Sastry, “Discovering compressing serial episodes from event sequences,” Knowledge and Information Systems , vol. 47, no. 2, pp. 405–432, 2016.
7[7] A. Bhattacharyya and J. Vreeken, “Efficiently summarizing event sequences with rich interleaving patterns,” in Proceedings of the 2017 SIAM International Conference on Data Mining . SIAM, 2017.
8[8] Q. Fan, Y. Li, D. Zhang, and K.-L. Tan, “Discovering newsworthy themes from sequenced data: A step towards computational journalism,” IEEE Transactions on Knowledge and Data Engineering , vol. 29, no. 7, pp. 1398–1411, 2017.