Reference-Based Sequence Classification

Zengyou He; Guangyao Xu; Chaohua Sheng; Bo Xu; Quan Zou

arXiv:1905.07188·cs.LG·December 15, 2020

Reference-Based Sequence Classification

Zengyou He, Guangyao Xu, Chaohua Sheng, Bo Xu, Quan Zou

PDF

TL;DR

This paper introduces a unified reference-based framework for sequence classification that consolidates existing pattern-based methods and facilitates the development of new algorithms with competitive accuracy.

Contribution

The paper presents a general framework unifying pattern-based sequence classification methods and enabling the creation of novel algorithms.

Findings

01

New algorithms achieve comparable accuracy to state-of-the-art methods.

02

Framework effectively unifies existing pattern-based approaches.

03

Experimental results validate the versatility and effectiveness of the proposed framework.

Abstract

Sequence classification is an important data mining task in many real world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable…

Tables5

Table 1. TABLE I: The categorization of some existing feature-based sequence classification algorithms under our framework

Algorithm	Construction of Candidate Reference Point Set	Selection of Reference Points	Selection of Similarity Function
SCIP[5]	SubTrainD	minsup, minint and maxsize constraints	SF 1/3
Ref.[8]	SubTrainD	minsup constraint	SF 1
Ref.[11]	SubTrainD	uniqueness, gap, mindisc (DF 2 and 4) constraints	SF 5
Ref.[12]	SubTrainD	gap, mindisc (DF 1 and 3) constraints	SF 2
MiSeRe[17]	SubTrainD	level constraint	SF 1
Ref.[18]	SubTrainD	minsup and gap constraints	SF 1/4
PSO-AB[19]	SubTrainD	minsup and closeness constraints	SF 6
FeatureMine[26]	SubTrainD	minsup, redundancy and mindisc (DF 6) constraints	SF 1
CDSPM[30]	SubTrainD	minsup and mindisc (DF 5) constraints	SF 1

Table 2. TABLE II: Summary of the Sequential Data Sets Used in the Experiments

Dataset	$\| D \|$	#items	minl	maxl	avgl	#classes
Activity	35	10	12	43	21.14	2
Aslbu	424	250	2	54	13.05	7
Auslan2	200	16	2	18	5.53	10
Context	240	94	22	246	88.39	5
Epitope	2392	20	9	21	15	2
Gene	2942	5	41	216	86.53	2
News	4976	27884	1	6779	139.96	5
Pioneer	160	178	4	100	40.14	3
Question	1731	3612	4	29	10.17	2
Reuters	1010	6380	4	533	93.84	4
Robot	4302	95	24	24	24	2
Skating	530	82	18	240	48.12	7
Unix	5472	1697	1	1400	32.34	4
Webkb	3667	7736	1	20628	129.37	3

Table 3. TABLE III: Performance comparison of different algorithms in terms of the classification accuracy

Dataset	Classifier	R-A	R-MHT	R-GAHC	MiSeRe	FSP	DSP	Sqn2VecSEP	Sqn2VecSIM	Classifier	SCIP
	NB	0.966	0.977	0.811	1.000	0.960	0.960	1.000	1.000	SCII_HAR	0.663
	DT	0.931	0.931	0.794	0.960	1.000	1.000	0.900	0.800	SCII_MA	0.675
Activity	SVM	0.977	0.926	0.629	1.000	0.994	0.994	1.000	0.950	SCIS_HAR	0.967
	KNN	0.983	0.811	0.800	1.000	0.886	0.886	1.000	0.950	SCIS_MA	1.000
	NB	0.574	0.561	0.449	0.548	0.527	0.420	0.298	0.554	SCII_HAR	0.540
	DT	0.523	0.527	0.480	0.565	0.542	0.459	0.405	0.484	SCII_MA	0.526
Aslbu	SVM	0.638	0.625	0.483	0.571	0.581	0.455	0.498	0.633	SCIS_HAR	0.553
	KNN	0.563	0.310	0.479	0.574	0.531	0.464	0.544	0.591	SCIS_MA	0.536
	NB	0.322	0.330	0.334	0.304	0.292	0.145	0.260	0.290	SCII_HAR	0.100
	DT	0.317	0.308	0.308	0.314	0.330	0.170	0.270	0.270	SCII_MA	0.095
Auslan2	SVM	0.328	0.323	0.326	0.300	0.318	0.170	0.290	0.310	SCIS_HAR	0.200
	KNN	0.327	0.304	0.333	0.302	0.309	0.166	0.310	0.200	SCIS_MA	0.175
	NB	0.780	0.778	0.772	0.938	0.812	0.572	0.900	0.900	SCII_HAR	0.613
	DT	0.791	0.798	0.745	0.841	0.873	0.575	0.600	0.542	SCII_MA	0.617
Context	SVM	0.939	0.937	0.755	0.927	0.868	0.577	0.933	0.900	SCIS_HAR	0.796
	KNN	0.871	0.853	0.813	0.896	0.839	0.585	0.900	0.858	SCIS_MA	0.867
	NB	0.675	0.663	0.751	0.588	0.696	0.671	0.779	0.761	SCII_HAR	0.684
	DT	0.839	0.842	0.815	0.842	0.814	0.750	0.813	0.800	SCII_MA	0.712
Epitope	SVM	0.855	0.838	0.769	0.834	0.758	0.716	0.802	0.800	SCIS_HAR	0.705
	KNN	0.932	0.925	0.918	0.924	0.898	0.801	0.863	0.841	SCIS_MA	0.721
	NB	0.998	0.997	1.000	1.000	1.000	1.000	1.000	1.000	SCII_HAR	1.000
	DT	0.997	0.997	0.998	1.000	1.000	1.000	0.967	0.976	SCII_MA	1.000
Gene	SVM	1.000	1.000	1.000	1.000	1.000	1.000	0.999	1.000	SCIS_HAR	1.000
	KNN	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	SCIS_MA	1.000
	NB	0.754	0.765	0.645	0.866	0.595	0.714	0.946	0.967	SCII_HAR	0.936
	DT	0.719	0.724	0.654	0.837	0.793	0.767	0.600	0.555	SCII_MA	0.942
News	SVM	0.975	0.969	0.874	0.905	0.771	0.775	0.972	0.974	SCIS_HAR	0.910
	KNN	0.856	0.350	0.756	0.732	0.544	0.761	0.918	0.912	SCIS_MA	0.918
	NB	0.977	0.931	0.796	0.891	0.932	0.852	0.975	0.963	SCII_HAR	0.963
	DT	0.883	0.870	0.826	0.988	0.964	0.926	0.788	0.825	SCII_MA	0.963
Pioneer	SVM	0.980	0.952	0.638	0.989	0.983	0.930	0.950	0.988	SCIS_HAR	0.975
	KNN	0.990	0.477	0.828	0.975	0.933	0.926	0.975	0.988	SCIS_MA	0.975
	NB	0.846	0.828	0.840	0.870	0.763	0.763	0.736	0.762	SCII_HAR	0.833
	DT	0.889	0.881	0.877	0.885	0.822	0.759	0.717	0.809	SCII_MA	0.785
Question	SVM	0.949	0.947	0.868	0.902	0.814	0.763	0.789	0.881	SCIS_HAR	0.846
	KNN	0.895	0.879	0.889	0.897	0.819	0.762	0.828	0.874	SCIS_MA	0.837
	NB	0.892	0.893	0.808	0.903	0.831	0.765	0.921	0.905	SCII_HAR	0.951
	DT	0.878	0.878	0.843	0.912	0.903	0.897	0.826	0.741	SCII_MA	0.953
Reuters	SVM	0.976	0.970	0.858	0.962	0.933	0.915	0.984	0.974	SCIS_HAR	0.957
	KNN	0.958	0.452	0.918	0.899	0.894	0.900	0.960	0.960	SCIS_MA	0.956
	NB	0.871	0.866	0.832	0.826	0.735	0.718	0.808	0.822	SCII_HAR	0.795
	DT	0.880	0.879	0.871	0.900	0.843	0.742	0.811	0.778	SCII_MA	0.822
Robot	SVM	0.955	0.952	0.902	0.913	0.780	0.723	0.840	0.834	SCIS_HAR	0.817
	KNN	0.947	0.947	0.942	0.937	0.860	0.743	0.945	0.949	SCIS_MA	0.819
	NB	0.281	0.271	0.197	0.290	0.240	N/A	0.336	0.321	SCII_HAR	0.181
	DT	0.258	0.215	0.204	0.258	0.272	N/A	0.226	0.230	SCII_MA	0.181
Skating	SVM	0.375	0.277	0.208	0.293	0.299	N/A	0.370	0.321	SCIS_HAR	0.189
	KNN	0.290	0.203	0.245	0.241	0.191	N/A	0.302	0.340	SCIS_MA	0.191
	NB	0.718	0.764	0.772	0.768	0.699	0.607	0.703	0.566	SCII_HAR	0.837
	DT	0.887	0.883	0.874	0.899	0.819	0.750	0.776	0.756	SCII_MA	0.838
Unix	SVM	0.927	0.921	0.906	0.915	0.820	0.748	0.899	0.872	SCIS_HAR	0.857
	KNN	0.869	0.822	0.865	0.873	0.803	0.745	0.892	0.891	SCIS_MA	0.842
	NB	0.710	0.720	0.641	0.845	0.701	0.833	0.858	0.886	SCII_HAR	0.897
	DT	0.820	0.821	0.788	0.874	0.869	0.862	0.629	0.635	SCII_MA	0.911
Webkb	SVM	0.954	0.952	0.880	0.927	0.895	0.869	0.934	0.940	SCIS_HAR	0.894
	KNN	0.887	0.544	0.851	0.779	0.843	0.858	0.772	0.691	SCIS_MA	0.901

Table 4. TABLE IV: The average classification accuracies of different methods over all data sets used in the experiment

Classifier	R-A	R-MHT	R-GAHC	MiSeRe	FSP	DSP	Sqn2VecSEP	Sqn2VecSIM	Classifier	SCIP
NB	0.740	0.739	0.689	0.760	0.699	0.694	0.751	0.764	SCII_HAR	0.714
DT	0.758	0.754	0.720	0.791	0.775	0.743	0.666	0.657	SCII_MA	0.716
SVM	0.845	0.828	0.721	0.817	0.772	0.741	0.804	0.813	SCIS_HAR	0.762
KNN	0.812	0.634	0.760	0.788	0.739	0.738	0.801	0.789	SCIS_MA	0.767
AVG	0.789	0.739	0.722	0.789	0.746	0.729	0.756	0.756	AVG	0.740

Table 5. TABLE V: The average classification accuracies of different similarity functions over all data sets used in the experiment

Classifier	R-A-J	R-A-S	R-A-N	R-MHT-J	R-MHT-S	R-MHT-N	R-GAHC-J	R-GAHC-S	R-GAHC-N
NB	0.740	0.675	0.718	0.739	0.677	0.711	0.689	0.672	0.674
DT	0.758	0.733	0.747	0.754	0.723	0.744	0.720	0.702	0.703
SVM	0.845	0.793	0.837	0.828	0.779	0.829	0.721	0.746	0.762
KNN	0.812	0.752	0.802	0.634	0.738	0.773	0.760	0.706	0.749
AVG	0.789	0.738	0.776	0.739	0.730	0.764	0.722	0.706	0.722

Equations48

s u p_{D_{c_{1}}} (t) > min s u p, s u p_{D_{c_{2}}} (t) \leq min s u p,

s u p_{D_{c_{1}}} (t) > min s u p, s u p_{D_{c_{2}}} (t) \leq min s u p,

oc c_{D_{c_{1}}} (t) > min co u n t, oc c_{D_{c_{2}}} (t) \leq min co u n t,

oc c_{D_{c_{1}}} (t) > min co u n t, oc c_{D_{c_{2}}} (t) \leq min co u n t,

s u p d i ff = s u p_{D_{c_{1}}} (t) - s u p_{D_{c_{2}}} (t) .

s u p d i ff = s u p_{D_{c_{1}}} (t) - s u p_{D_{c_{2}}} (t) .

\begin{gathered}F$-$ratio=\frac{Occ_{between}}{Occ_{within}},\end{gathered}

\begin{gathered}F$-$ratio=\frac{Occ_{between}}{Occ_{within}},\end{gathered}

O c c_{b e tw ee n} =

O c c_{b e tw ee n} =

+ ∣ D_{c_{2}} ∣ (oc c_{D_{c_{2}}} (t) - \frac{oc c _{D_{c_{1}}} ( t ) + oc c _{D_{c_{2}}} ( t )}{2})^{2},

O c c_{w i t hin} =

O c c_{w i t hin} =

+ j = 1 \sum ∣ D_{c_{2}} ∣ (occo u n t_{D_{c_{2 j}}} (t) - oc c_{D_{c_{2}}} (t))^{2} .

GR (t, c_{1}, c_{2}) \geq min GR S i g_{co n} (t, c_{1}, c_{2}) \geq min S i g,

GR (t, c_{1}, c_{2}) \geq min GR S i g_{co n} (t, c_{1}, c_{2}) \geq min S i g,

S im (s, t) = {1, 0, if t \subseteq s, otherwise .

S im (s, t) = {1, 0, if t \subseteq s, otherwise .

S im (s, t) = {1, 0, if α an d t a r e s imi l a r, otherwise .

S im (s, t) = {1, 0, if α an d t a r e s imi l a r, otherwise .

S im (s, t) = {C (t, s), 0, if t \subseteq s, otherwise,

S im (s, t) = {C (t, s), 0, if t \subseteq s, otherwise,

S im (s, t) = {occ n u m, 0, if t \subseteq s, otherwise,

S im (s, t) = {occ n u m, 0, if t \subseteq s, otherwise,

S im (s, t) = {occo u n t_{s} (t), 0, if t \subseteq s, otherwise,

S im (s, t) = {occo u n t_{s} (t), 0, if t \subseteq s, otherwise,

S im (s, t) = \frac{∣ L C S ( s , t ) ∣}{M a x { ∣ s ∣ , ∣ t ∣ }},

S im (s, t) = \frac{∣ L C S ( s , t ) ∣}{M a x { ∣ s ∣ , ∣ t ∣ }},

J (s, t) = \frac{∣ s \cap t ∣}{∣ s ∣ + ∣ t ∣ - ∣ s \cap t ∣},

J (s, t) = \frac{∣ s \cap t ∣}{∣ s ∣ + ∣ t ∣ - ∣ s \cap t ∣},

J (s, t) = \frac{∣ L C S ( s , t ) ∣}{∣ s ∣ + ∣ t ∣ - ∣ L C S ( s , t ) ∣} .

J (s, t) = \frac{∣ L C S ( s , t ) ∣}{∣ s ∣ + ∣ t ∣ - ∣ L C S ( s , t ) ∣} .

J (ab c d e, ec d c)

J (ab c d e, ec d c)

K_{n} (s, t)

K_{n} (s, t)

= u \in I^{n} \sum ϕ_{u} (s) . ϕ_{u} (t)

= u \in I^{n} \sum u \subseteq s \sum λ^{l_{s} (u)} u \subseteq t \sum λ^{l_{t} (u)}

= u \in I^{n} \sum u \subseteq s \sum u \subseteq t \sum λ^{l_{s} (u) + l_{t} (u)},

\hat{K}_{n} (s, t) = \frac{K _{n} ( s , t )}{K _{n} ( s , s ) K _{n} ( t , t )} .

\hat{K}_{n} (s, t) = \frac{K _{n} ( s , t )}{K _{n} ( s , s ) K _{n} ( t , t )} .

\hat{K}_{1} (ab c d e, ec d c)

\hat{K}_{1} (ab c d e, ec d c)

= \frac{4 λ ^{2}}{5 λ ^{2} \times 6 λ ^{2}} \approx 0.73.

S im (s, t) = \frac{∣ L C S ( s , t ) ∣}{M in { ∣ s ∣ , ∣ t ∣ }},

S im (s, t) = \frac{∣ L C S ( s , t ) ∣}{M in { ∣ s ∣ , ∣ t ∣ }},

S im (ab c d e, ec d c)

S im (ab c d e, ec d c)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\history

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. 00.0000/ACCESS.2020.DOI

\corresp

Corresponding author: Zengyou He (e-mail: [email protected]). \tfootnoteThis work was partially supported by the Natural Science Foundation of China under Grant Nos. 61972066 and 61572094, and the Fundamental Research Funds for the Central Universities (No. DUT20YG106).

Reference-Based Sequence Classification

ZENGYOU HE1

GUANGYAO XU1

CHAOHUA SHENG1

BO XU1

QUAN ZOU2

School of Software, Dalian University of Technology, Tuqiang Road, Dalian, China

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology, Chengdu, China

Abstract

Sequence classification is an important data mining task in many real-world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art sequence classification algorithms.

Index Terms:

Sequence classification, sequential data analysis, cluster analysis, hypothesis testing, sequence embedding

\titlepgskip

=-15pt

I Introduction

In many practical applications, we have to conduct data analysis on data sets that are composed of discrete sequences. Each sequence is an ordered list of elements. For instance, such a sequence can be a protein sequence, where each element corresponds to an amino acid. Due to the existence of a large number of discrete sequences in a wide range of applications, sequential data analysis has become an important issue in machine learning and data mining. Compared to non-sequential data mining, sequential data analysis is confronted with new challenges because of the ordering relationship between different elements in the sequences. Similar to the analysis of non-sequential data, there are different sequential data mining problems such as clustering, classification and pattern discovery. In this paper, we focus on the sequence classification problem.

The task of classification is to determine which predefined target class one unknown object should be assigned to [1]. As a specific case of the general classification problem, sequence classification is to assign class labels to new sequences based on the classifier constructed in the training phase. In many real-world applications, we can formulate the data analysis task as a sequence classification problem. For instance, the essential task in numerous bioinformatics applications is to classify biological sequences into existing categories [2].

To tackle the sequence classification problem, many effective methods have been proposed from different aspects. Roughly, existing sequence classification methods can be divided into three categories [3]: feature-based methods, distance-based methods and model-based methods. Feature-based methods first transform sequences into feature vectors and then apply existing vectorial data classification methods. Distance-based methods apply classifiers such as KNN ( $k$ Nearest Neighbors) to solve the sequence classification problem, in which the key issue is to specify a proper distance function to measure the distance between two sequences [3]. Model-based methods generally assume that sequences from different classes are generated from different probability distributions, in which the key issue is to estimate the model parameters from the set of training sequences.

In this paper, we focus on the feature-based method since it has several advantages. First of all, various effective classifiers have been developed for vectorial data classification [4]. After transforming sequences into feature vectors, we can choose any one of these existing classification methods to fulfill the sequence classification task. Second, in some popular feature-based methods such as pattern-based methods, each feature has a good interpretability. Last but not least, the extraction of features from sequences has been extensively studied across different fields, making it feasible to generate sequence features in an effective manner.

The $k$ -mer (in bioinformatics) or $k$ -gram (in natural language processing) is a substring that is composed of $k$ consecutive elements, which is probably the most widely used feature in feature-based sequence classification. Such a $k$ -mer based feature construction method is further generalized by the pattern-based method, in which a feature is a sequential pattern (a subsequence) that satisfies some constraints (e.g. frequent pattern, discriminative pattern). Over the past few decades, a large number of pattern-based methods have been presented in the context of sequence classification [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30].

In this paper, we present a reference-based sequence classification framework, which can be considered as a non-trivial generalization of the pattern-based methods. This framework has several key steps: candidate set construction, reference point selection and feature value construction. In the first step, a set of sequences that serve as the candidate reference points are constructed. Then, some sequences from the candidate set are selected as the reference points according to certain criteria. The number of features in the transformed vectorial data will equal the number of selected reference points. In other words, each reference point will correspond to a transformed feature. Finally, a similarity function is used to calculate the similarity between each sequence in the data and every reference point. The similarity to each reference point will be used as the corresponding feature value.

The reference-based sequence classification framework is quite general and flexible since the selection of both reference points and similarity functions is arbitrary. Existing feature-based methods can be regarded as a special variant under our framework by (1) using (frequent or discriminative) sequential patterns (subsequences) as reference points and (2) utilizing a boolean function (output 1 if the reference point is contained in a given sequence and output 0 otherwise) as the similarity function. Besides unifying existing pattern-based methods under the same umbrella, the reference-based sequence classification framework can be used as a general platform for developing new feature-based sequence classification methods. To justify this point, we develop a new feature-based method in which a subset of training sequences are used as the reference points and the Jaccard coefficient is used as the similarity function. In particular, we present two instance selection methods to select a good set of reference points.

To demonstrate the feasibility and advantages of this new framework, we conduct a series of comprehensive performance studies on real sequential data sets. In the experiments, we compare several variants under our framework with some existing sequence classification methods in terms of classification accuracy. Experimental results show that new methods developed under the proposed framework are capable of achieving better classification accuracy than traditional sequence classification methods. This indicates that such a reference-based sequence classification framework is promising from a practical point of view.

The main contributions of this paper can be summarized as follows:

•

We present a general reference-based framework for feature-based sequence classification. It offers a unified view for understanding and explaining many existing feature-based sequence classification methods in which different types of sequential patterns are used as features.

•

The reference-based framework can be used as a general platform for developing new feature-based sequence classification algorithms. To verify this point, we design new feature-based sequence classification algorithms under this framework and demonstrate its advantages through extensive experimental results on real sequential data sets.

The rest of the paper is structured as follows. Section II gives a discussion on the related work. In Section III, we introduce the reference-based sequence classification framework in detail. In Section IV, we show that many existing feature-based sequence classification algorithms can be reformulated within the reference-based framework. In Section V, we present new feature-based sequence classification algorithms under this framework, which are effective and quite different from available solutions. We experimentally evaluate the proposed reference-based framework through a series of experiments on real-life data sets in Section VI. Finally, we summarise our research and give a discussion on the future work in Section VII.

II Related Work

In this section, we discuss previous research efforts that are closely related to our method. In Section II-A, we provide a categorization on existing feature-based sequence classification methods. In Section II-B, we discuss several instance-based feature generation methods in the literature of time series classification. In Section II-C, we present a concise discussion on reference-based sequence clustering algorithms. In Section II-D, we provide a short summary on dimension reduction and embedding methods based on landmark points.

II-A Feature-Based Methods

II-A1 Explicit Subsequence Representation without Selection

The naive approach in dealing with discrete sequences is to treat each element as a feature. However, the order information between different elements will be lost and the sequential nature cannot be captured in the classification. Short sequence segments of $k$ consecutive elements called $k$ -grams can be used as features to solve this problem. Given a set of $k$ -grams, a sequence can be represented as a vector of the presence or absence of the $k$ -grams or the frequencies of the $k$ -grams. In this feature representation method, all $k$ -grams (for a specified $k$ value) are explicitly used as the features without feature selection.

II-A2 Explicit Subsequence Representation with Selection (Classifier-Independent)

Lesh et al. [26] present a pattern-based classification method in which a sequential pattern is chosen as a feature. The selected pattern should satisfy the following criteria: (1) be frequent, (2) be distinctive of at least one class and (3) not redundant. Towards this direction, many pattern-based classification methods have been subsequently proposed, in which different constraints are imposed on the patterns that should be selected as features [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 29, 30]. Note that any classifier designed for vectorial data can be applied to the transformed data generated from such pattern-based methods. In other words, such feature generation methods are classifier-independent.

II-A3 Explicit Subsequence Representation with Selection (Classifier-Dependent)

The above pattern-based methods are universal and classifier-independent. However, some patterns that are critical to the classifier may be filtered out during the selection process. Thus, several methods which can select pattern features from the entire pattern space for a specific classifier have been proposed [31, 32, 33].

In [31], a coordinate-wise gradient ascent technique is presented for learning the logistic regression function in the space of all $k$ -grams. The method exploits the inherent structure of the $k$ -gram feature space to automatically provide a compact set of highly discriminative $k$ -gram features. In [32], a framework is presented in which linear classifiers such as logistic regression and support vector machine can work directly in the explicit high-dimensional space of all subsequences. The key idea is a gradient-bounded coordinate-descent strategy to quickly retrieve features without explicitly enumerating all potential subsequences. In [33], a novel document classification method using all substrings as features is proposed, in which $L_{1}$ regularization is applied to a multi-class logistic regression model to fulfill the feature selection task automatically and efficiently.

II-A4 Implicit Subsequence Representation

In contrast to explicit subsequence representation, kernel-based methods employ an implicit subsequence representation strategy. A kernel function is the key ingredient for learning with support vector machines (SVMs) and it implicitly defines a high-dimensional feature space. Some kernel functions $K(x,y)$ have been presented for measuring the similarity between two sequences $x$ and $y$ (e.g. [34]).

There are a variety of string kernels which are widely used for sequence classification (e.g. [35, 36, 37, 38]). A sequence is transformed into a feature space and the kernel function is the inner product of two transformed feature vectors.

Leslie et al. [35] propose a $k$ -spectrum kernel for protein classification. Given a number $k\geq 1$ , the $k$ -spectrum of an input sequence is the set of all its $k$ -length (contiguous) subsequences.

Lodhi et al. [36] present a string kernel based on gapped $k$ -length subsequences for text classification. The subsequences are weighted by an exponentially decaying factor of their full length in the text.

In [37], a mismatch string kernel is proposed, in which a certain number of mismatches are allowed in counting the occurrence of a subsequence. Several string kernels related to the mismatch kernel are presented in [38]: restricted gappy kernels, substitution kernels and wildcard kernels.

II-A5 Sequence Embedding

All the methods mentioned above use subsequences as features. Alternatively, the sequence embedding method generates a vector representation in which each feature does not have a clear interpretation. Most existing approaches for sequence embedding are proposed for texts in natural language processing, where word and document embeddings are used as an efficient way to encode the text (e.g. [39, 40]). The basic assumption in these methods is that words that appear in similar contexts have similar meanings.

The word2vec model [39] uses a two-layer neural network to learn a vector representation for each word. The sequence (text) embedding vector can be further generated by combining the feature vectors for words. The doc2vec model [40] extends word2vec by directly learning feature vectors for entire sentences, paragraphs, or documents.

Nguyen et al. [41] propose an unsupervised method (named Sqn2Vec) for learning sequence embedding by predicting its belonging singleton symbols and sequential patterns (SPs). The main objective of Sqn2Vec is to address the limitations of two existing approaches: pattern-based methods often produce sparse and high-dimensional feature vectors while sequence embedding methods in natural language processing may fail on data sets with a small vocabulary.

II-A6 Summary of Feature-Based Methods

Roughly, existing feature-based sequence classification methods can be divided into the above five categories. Each of these methods has its pros and cons, which we will discuss briefly next.

First, using $k$ -grams as features without feature selection is simple and effective in practice. However, the feature length $k$ cannot be large and many redundant features may be included.

Second, in the pattern-based method, the length of a feature is not restricted as long as the feature satisfies given constraints and redundant features can be filtered out in some formulations. However, it is a non-trivial task to efficiently mine patterns that can satisfy the constraints.

Third, sequence classification methods based on adaptive feature selection can automatically select features from the set of all subsequences. The basic idea is to integrate the feature selection and classifier construction into the same procedure. Hence, these methods are classifier-dependent in the sense that each algorithm is only applicable to a specific classifier.

Fourth, kernel-based methods can implicitly map the sequence into a high-dimensional feature space without explicit feature extraction. The major challenge is how to choose a proper string kernel function and how to handle large data sets efficiently.

Finally, sequence embedding methods generate a new vector representation for each sequence that may achieve better classification accuracy. Unfortunately, the semantic interpretation of each feature becomes a difficult issue.

II-B Instance-Based Feature Generation Methods

There are several instance-based feature generation methods for time series classification which are closely related to our method (e.g. [42, 43]).

Iosifidis et al. [42] propose a time series classification method based on a novel vector representation. The vector representation for each time series is generated by calculating its similarities from a subset of training instances. To find a good subset of representative instances, a clustering procedure is further presented. In [43], each time series is represented as a feature vector, where the feature value is its dynamic time warping similarity from one of the training instances. Note that all training instances are used for feature generation.

II-C Reference-Based Sequence Clustering

In the literature of sequence clustering, the idea of using reference/landmark points to accelerate the cluster analysis process has been widely studied (e.g. [44, 45]). In this type of sequence clustering algorithm, a reference point selection method is first employed to obtain a small set of landmark points and then the clustering process is conducted based on the similarities between input sequences and selected reference points. Here, we would like to highlight the following differences between our method and existing research efforts in this field: (1) The objective is different. We focus on the classification issue while these methods aim at the cluster analysis problem. Besides, their main concern is to improve the running efficiency of the sequence clustering procedure; (2) The method is different. We present two reference point selection methods: one unsupervised method and one supervised method (see Section V for the details). In existing reference-based sequence clustering methods, only the unsupervised reference point selection method is applicable since no class label information is available.

II-D Reference-Based Dimension Reduction

A number of research papers have presented the idea of using the distances to a set of reference points to fulfill the dimension reduction task (e.g. [46, 47]). Our method shares some similarities with these methods since the final objective is the same. However, most of these methods are not developed for the task of sequence classification. As a result, our method is quite different from these methods for both the reference point selection and the similarity computation.

III Reference-Based Sequence Classification Framework

Let $I=\left\{i_{1},i_{2},...,i_{m}\right\}$ be a finite set of $m$ distinct items, which is generally called the alphabet in the literature. A sequence $s$ over $I$ is an ordered list $s=\left\langle s_{1},s_{2},...,s_{l}\right\rangle$ , where $s_{i}\in I$ and $l$ is the length of the sequence $s$ . A sequence $t=\left\langle t_{1},t_{2},...,t_{r}\right\rangle$ is said to be a subsequence of $s$ if there exist integers $1\leq i_{1}<i_{2}<...<i_{r}\leq l$ such that $t_{1}=s_{i_{1}},t_{2}=s_{i_{2}},...,t_{r}=s_{i_{r}}$ , denoted as $t\subseteq s$ (if $t\neq s$ , written as $t\subset s$ ). We use $maxsize$ to denote the allowed maximum length of subsequences.

Let $C=\left\{c_{1},c_{2},...,c_{j}\right\}$ be a finite set of $j$ distinct classes. A labeled sequential data set $D$ over $I$ is a set of instances and each instance $d$ is denoted by $(s,c_{k})$ , where $s$ is a sequence and $c_{k}\in C$ is a class label, $|D|$ is the number of sequences in $D$ . The set $D_{c_{i}}\subseteq D$ contains all sequences that have the same class label $c_{i}$ (i.e., $D=\cup^{j}_{i=1}D_{c_{i}}$ ). $D_{c_{i}}(t)$ is the set of sequences in $D_{c_{i}}$ that contain $t$ , where $t$ is a given sequence. Sequences in $D$ ( $D_{c_{i}}$ ) is divided into a training set $TrainD$ ( $TrainD_{c_{i}}$ ) and a testing set $TestD$ ( $TestD_{c_{i}}$ ). The set of all subsequences of $TrainD$ is denoted by $SubTrainD=\left\{t|t\subseteq s,s\in TrainD\right\}$ .

As shown in Fig. 1, we present a reference-based sequence classification framework. It is composed of three major phases: reference point selection, feature value generation, model construction and prediction. In the following, we will elaborate on each step in detail.

III-A Reference Point Selection

In the first stage of the presented framework, a reference point selection procedure is performed to generate a set of pivot sequences. As shown in Fig. 2, this procedure can be further divided into three steps: alphabet extraction, candidate set generation and pivot sequence selection.

In the first step, we scan the training set $TrainD$ to extract the alphabet $I$ that is composed of distinct items. Note that there can be some items that only appear in the testing set $TestD$ . In the forthcoming paragraphs, we will see that this extreme case does not affect our subsequent steps.

In the second step, we generate the set of candidate reference sequences $CR$ from the alphabet $I$ . Note that any sequence over $I$ can be a member of $CR$ . In other words, $CR$ can be an infinite set. In practice, some constraints will be imposed on the potential member in $CR$ . For instance, those pattern-based methods only consider subsequences of $TrainD$ as members of $CR$ under our framework, which will be further discussed in Section IV. Furthermore, the use of different construction methods for building the candidate set $CR$ will lead to the generation of many new feature-based sequence classification methods.

In the third step, we select a subset of sequences $R$ from $CR$ as the landmark sequences for generating features. That is, each reference sequence will correspond to a transformed feature. The critical issue in this step is how to design an effective pivot sequence selection method. To date, existing pattern-based methods typically utilize some simple criteria to conduct the reference sequence selection task. For example, those methods based on frequent subsequences use the minimal support constraint as the criterion for reference sequence selection. Apparently, many new and interesting pivot sequence selection methods remain unexplored under our framework. In the subsequent paragraphs of this subsection, we will list some commonly used criteria for selecting reference sequences from the set of candidate pivot sequences.

Constraint 1.

( $Gap\ constraint$ [11]). Given two sequences $s=\left\langle s_{1},s_{2},...,s_{l}\right\rangle$ and $t=\left\langle t_{1},t_{2},...,t_{r}\right\rangle$ , if $t$ is the subsequence of $s$ such that $t_{1}=s_{i_{1}},t_{2}=s_{i_{2}},...,t_{r}=s_{i_{r}}$ , the $gap$ between $i_{k}$ and $i_{k+1}$ is defined as $Gap(s,i_{k},i_{k+1})=i_{k+1}-i_{k}-1$ . Given two thresholds $mingap$ and $maxgap$ ( $0\leq mingap\leq maxgap$ ), if $mingap\leq Gap(s,i_{k},i_{k+1})\leq maxgap$ ( $1\leq k\leq r-1$ ), then the occurrence of $t$ in $s$ fulfills the $gap\ constraint$ .

Constraint 2.

( $Minsup\ constraint$ [12]). Given a set of sequences $D_{c_{i}}$ with the class label $c_{i}$ and a sequence $t$ , $count_{D_{c_{i}}}(t)$ is used to denote the number of sequences in $D_{c_{i}}$ that contain $t$ as a subsequence. The $support$ of $t$ in $D_{c_{i}}$ is defined as $sup_{D_{c_{i}}}(t)=\frac{count_{D_{c_{i}}}(t)}{|D_{c_{i}}|}$ . Given a positive threshold $minsup$ , if $sup_{D_{c_{i}}}(t)\geq minsup$ , then $t$ satisfies the $minsup\ constraint$ and $t$ is a frequent sequential pattern in $D_{c_{i}}$ .

Constraint 3.

( $Mindisc\ constraint$ [48]). Given two class labels $c_{1}$ and $c_{2}$ , a sequence $t$ is said to be a discriminative pattern if it is over-expressed on $D_{c_{1}}$ against $D_{c_{2}}$ (or the vice versa). To evaluate the discriminative power, many measures/functions have been proposed in the literature [48]. If the discriminative function value of $t$ can pass certain constraints, then it satisfies the $mindisc\ constraint$ . Here we just list some measures that have been used for selecting discriminative patterns in sequence classification.

•

Discriminative Function (DF) 1 [12]:

[TABLE]

where $minsup$ is a given $support$ threshold.

•

Discriminative Function (DF) 2 [11]:

[TABLE]

where $occ_{D_{c_{1}}}(t)=\frac{occount_{D_{c_{1}}}(t)}{|D_{c_{1}}|}$ and $mincount$ is a given threshold. The $occount_{D_{c_{1}}}(t)$ is the number of non-overlapping occurrences of $t$ in $D_{c_{1}}$ .

•

Discriminative Function (DF) 3 [12]:

[TABLE]

•

Discriminative Function (DF) 4 [11]:

[TABLE]

where

[TABLE]

and $Occ_{within}$ is defined as:

[TABLE]

•

Discriminative Function (DF) 5 [30]:

[TABLE]

where $GR(t,c_{1},c_{2})=\frac{sup_{c_{1}}(t)}{sup_{c_{2}}(t)}$ is the $GrowthRate$ of $t$ , $minGR$ is a given $GrowthRate$ threshold. $Sig_{con}(t,c_{1},c_{2})=min_{q\in Q}\left\{\frac{GR(t,c_{1},c_{2})}{GR(q,c_{1},c_{2})}\right\}$ is used to describe the conditional redundancy, where $Q$ is the set of discriminative sub-patterns of $t$ , $minSig$ is a given threshold.

•

Discriminative Function (DF) 6 [26]:

The chi-squared test is used as the discriminative function to check if the candidate sequence is correlated with at least one class that it is frequent in.

Constraint 4.

( $Uniqueness\ constraint$ [11]). A sequence is said to satisfy the $uniqueness\ constraint$ if all its items are unique.

Constraint 5.

( $Closedness\ constraint$ [19]). A sequence $t$ is said to satisfy the $closedness\ constraint$ if no sequences that contain $t$ as a subsequence have the same $support$ as $t$ .

Constraint 6.

( $Redundancy\ constraint$ [26]). A sequence $t$ is said to satisfy the $redundancy\ constraint$ if $con\emph{f}(t)\geq\frac{|D_{c_{i}}|}{|D|}$ , where $con\emph{f}(t)=\frac{count_{D_{c_{i}}}(t)}{count_{D}(t)}$ is the $con\emph{f}idence$ of $t$ .

Constraint 7.

( $Interestingness\ constraint$ [5]). Given a set of sequences $D_{c_{i}}$ with class label $c_{i}$ , two sequences $s=\left\langle s_{1},s_{2},...,s_{l}\right\rangle$ and $t=\left\langle t_{1},t_{2},...,t_{r}\right\rangle$ , if $t$ is the subsequence of $s$ such that $t_{1}=s_{i_{1}},t_{2}=s_{i_{2}},...,t_{r}=s_{i_{r}}$ , $I_{c_{i}}(t)=sup_{D_{c_{i}}}(t)\times C_{c_{i}}(t)$ is used to denote the $interestingness$ of $t$ , where $C_{c_{i}}(t)=\frac{|t|}{\overline{W_{c_{i}}}(t)}$ is the $cohesion$ of $t$ in $D_{c_{i}}(t)$ , $\overline{W_{c_{i}}}(t)=\frac{\sum_{s\in D_{c_{i}}(t)}W(t,s)}{count_{D_{c_{i}}}(t)}$ and $W(t,s)=min\left\{i_{r}-i_{1}+1|i_{1}\leq i_{r}\right\}$ . And the $cohesion$ of $t$ in a sequence $s$ is $C(t,s)=\frac{|t|}{W(t,s)}$ . Given two thresholds $minsup$ and $minint$ , if $sup_{D_{c_{i}}}(t)\geq minsup$ and $I_{c_{i}}(t)\geq minint$ , then $t$ satisfies the $interestingness\ constraint$ .

Constraint 8.

( $Level\ constraint$ [17]). Given a sequence $t$ and a set of sequences $D$ with $j$ classes, a sequential classification rule $\pi$ is denoted as $\pi:t\to count_{D_{c_{1}}}(t),count_{D_{c_{2}}}(t),...,count_{D_{c_{j}}}(t)$ , where $t$ is the body of the rule. From a Bayesian point of view, to choose the best rule is equivalent to maximizing $p(\pi|D)=\frac{p(\pi,D)}{p(D)}=\frac{p(\pi)\times p(D|\pi)}{p(D)}$ , where $p(D)$ is a constant, $cost(\pi)=-\log(p(\pi)\times p(D|\pi))$ is used as the evaluation criterion, and the normalized criterion $level$ is defined as $level(\pi)=1-\frac{cost(\pi)}{cost(\pi_{\emptyset})}$ , in which $cost(\pi_{\emptyset})$ is the cost of the null model when the sequence body is empty. If $0<level(\pi)\leq 1$ , then $t$ satisfies the $level\ constraint$ .

III-B Feature Value Generation

In the second stage of the presented framework, a similarity function is used to generate vectorial representations for all sequences in both training data and testing data. As shown in the left part of Fig. 3, this procedure can be further divided into two steps: (1) calculating the similarities between training instances and reference points; (2) calculating the similarities between testing instances and reference points.

In the first step, we utilize a similarity function to transform $TrainD$ into a vectorial training set $TrainD^{\prime}$ by calculating the similarity between each sequence in $TrainD$ and every reference point in $R$ . Each similarity value will be used as the corresponding feature value. The critical issue in this step is how to choose a suitable similarity function. Note that the selection of the similarity function is arbitrary. In other words, any feasible similarity function can be used in this step. In fact, many existing feature-based methods utilize a boolean function as the similarity function, which outputs 1 as the feature value if the reference point is a subsequence of the target sequence and 0 otherwise.

In the second step, we use the same similarity function to transform $TestD$ into a vectorial testing set $TestD^{\prime}$ . Note that the number of features in the transformed vectorial data set is $|R|$ , which is the number of reference points.

The similarity function plays an important role in generating feature values. Accordingly, it will have a great impact on the prediction result. For the purpose of summarizing existing research efforts under our framework with respect to the similarity function, here we list some similarity functions between two sequences $s$ and $t$ that have been deployed in the literature.

•

Similarity Function (SF) 1 [26]:

[TABLE]

•

Similarity Function (SF) 2 [12]:

[TABLE]

In Equation (III.7), $similar$ means $ed(\alpha,t)\leq\gamma\times|t|$ ( $|s|\geq|t|$ ), $ed(\alpha,t)$ is the $edit\ distance$ between $\alpha$ and $t$ (the minimum number of operations needed to transform $\alpha$ into $t$ , where an operation can be the insertion, deletion, or substitution of a single item), $\alpha$ is a contiguous subsequence of $s$ with $|t|$ items, which is extracted by using a sliding window of length $|t|$ that starts from the first element of $s$ . If $\alpha$ and $t$ are not $similar$ , then the sliding window will be repeatedly shifted one position to the right until $|s|-|t|+1$ subsequences have been checked or a new subsequence $\alpha$ $similar$ to $t$ is encountered. $\gamma$ is a given $maximum\ di\emph{ff}erence$ threshold.

•

Similarity Function (SF) 3 [5]:

[TABLE]

where $C(t,s)$ is the $cohesion$ of $t$ in the sequence $s$ .

•

Similarity Function (SF) 4 [18]:

[TABLE]

where $occnum$ is the number of occurrences of $t$ in $s$ .

•

Similarity Function (SF) 5 [11]:

[TABLE]

where $occount_{s}(t)$ is the number of non-overlapping occurrences of $t$ in $s$ .

•

Similarity Function (SF) 6 [19]:

[TABLE]

where $|LCS(s,t)|$ is the length of the longest common subsequence, $|s|$ and $|t|$ are the length of $s$ and $t$ respectively.

III-C Model Construction and Prediction

In the third stage of the presented framework, we construct a prediction model to make predictions. As shown in the right part of Fig. 3, this procedure can be further divided into three steps: model construction, prediction and classification result generation.

In the first step, an existing vectorial data classification method is used to construct a prediction model from the vectorial training set $TrainD^{\prime}$ since we have transformed training sequences into feature vectors in the second stage. Numerous classification methods have been designed for classifying feature vectors (e.g. support vector machines and decision trees) [4, 49]. After training a classifier with $TrainD^{\prime}$ , the prediction model is ready for classifying unknown samples.

In the second step, we forward the vectorial testing set $TestD^{\prime}$ to the classifier to make predictions. In the third step, we output the prediction result and compute the classification accuracy by comparing the predicted class labels with the ground-truth labels.

IV General Framework for Feature-Based Classification

In this section, we show that many existing feature-based sequence classification algorithms can be reformulated within the presented reference-based framework. The differences between these algorithms mainly lie in the selection of reference points and similarity functions. As summarized in Table I, we can categorize these existing methods according to three criteria: (1) How to construct the candidate set of reference points? (2) How to choose a set of reference points? (3) Which similarity function should be used? Note that the definitions and notations for different constraints and similarity functions have been presented in Section III-A and Section III-B. From Table I, we have the following observations.

First of all, any sequence over the alphabet can be a potential member of the candidate set of reference points $CR$ . However, all feature-based sequence classification algorithms in Table I use $SubTrainD$ to construct $CR$ since the idea of using subsequences as features is quite natural with a good interpretability. Although $SubTrainD$ is a finite set, its size is still very large and most sequences in $SubTrainD$ are useless and redundant for classification. Therefore, it is necessary to explore alternative methods for constructing the set of candidate reference points. For instance, we may use all original sequences in $TrainD$ to construct $CR$ , so that the size of $CR$ will be greatly reduced and the corresponding features may be more representative.

Second, many sequence selection criteria have been proposed to select $R$ from $CR$ , such as $minsup$ and $mindisc$ . The main objective of applying these criteria is to select a subset of sequences that can generate good features for building the classifier. However, it is not an easy task to set suitable thresholds for these constraints to produce a set of reference sequences with moderate size. More importantly, most of these constraints are proposed from the literature of sequential pattern mining, which may be only applicable to the selection of reference sequences from $SubTrainD$ . In other words, more general reference point selection strategies should be developed.

Last, the most widely used similarity function in Table I is SF 1, which is a boolean function based on whether the reference point is a subsequence of the sequence in $TrainD$ . Although some non-boolean functions have been used, the potential of utilizing more elaborate similarity functions between two sequences still needs further investigation.

Overall, our reference-based sequence classification framework is quite generic, in which many existing pattern-based sequence classification methods can be reformulated as its special variants. Meanwhile, there are still many limitations in current research efforts under this framework. Hence, new and effective sequence classification methods should be developed towards this direction.

V New Variants under the Framework

In addition to encompassing existing pattern-based methods, this framework can also be used as a general platform to design new feature-based sequence classification methods.

As discussed in Section IV, there are three key ingredients in our framework: the construction of the candidate reference point set, the selection of reference points and the selection of similarity function. Obviously, we will generate a “new” sequence classification algorithm based on an unexplored combination of these three components. In view of the fact that the number of possible combinations is quite large, it is infeasible to enumerate all these variants. Instead, we will only present two variants that are quite different from existing algorithms to demonstrate the advantage of this framework.

V-A The Use of Training Set as the Candidate Set

With our framework, all previous pattern-based sequence classification methods utilize the set $SubTrainD$ as the candidate reference point set $CR$ in the first step. One limitation of this strategy is that the actual size of $CR$ will be very large. As a result, it poses great challenges for the reference point selection task in the consequent step. To alleviate these issues, we propose to use all original sequences in $TrainD$ to construct the set of candidate reference points. The rationale for this candidate set construction method is based on the following observations.

Firstly, all information given for building the classifier is contained in the original training set. In other words, we will not lose any relevant information for the classification task if $TrainD$ is used as the candidate set of reference sequences. In fact, the widely used candidate set $SubTrainD$ is derived from $TrainD$ .

Secondly, even we use all the training sequences in $TrainD$ as the reference points, the transformed vectorial data will be a $|TrainD|\times|TrainD|$ table. That is, the number of features is still no larger than the number of samples. Therefore, we do not need to analyze a HDLSS (high-dimension, low-sample-size) data set during the classification stage. In contrast, the number of features may be much larger than the number of samples in the vectorial data obtained from $SubTrainD$ if the parameters are not properly specified during the reference point selection procedure. In fact, we have tested the performance when all training sequences are used as reference points. The experimental results show that this quite simple idea is able to achieve comparable performance in terms of classification accuracy.

Finally, the same idea has been employed in the literature of time series classification [42, 43]. Its success motivates us to investigate the feasibility and advantage in the context of discrete sequence classification.

V-B Two Reference Point Selection Methods

To select reference sequences from $TrainD$ , those existing constraints proposed in the context of sequential pattern mining are not applicable. Therefore, we have to develop new algorithms to choose a subset of representative reference sequences from $TrainD$ . To this end, two different reference sequence selection methods are presented. The first one is an unsupervised method, which selects reference sequences based on cluster analysis without considering the class label information. The second one is a supervised method, which evaluates each candidate sequence according to its discriminative ability across different classes. In the following two sub-sections, we will present the details of these two reference point selection algorithms.

V-B1 Unsupervised Reference Point Selection

As we have discussed in Section V-A, we may choose all sequences in the training set as reference points. However, the number of features in the transformed vectorial data can still be very large if the number of training instances is large. The selection of a small subset of representative training sequences as reference points will greatly reduce the computational burden in the subsequent stage. One natural idea is to divide the training sequences in $CR$ into different clusters using a clustering algorithm [50]. Then, we can select a representative sequence from each cluster as the reference point.

To date, many algorithms have been presented for clustering discrete sequences (e.g. [51]). We can just adopt an existing sequence clustering algorithm in our pipeline. Here we choose the Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm [52] to fulfill the sequence clustering task. This algorithm is used because it can often generate a high-quality clustering result and can handle any forms of similarity measure.

In the following, we will describe the details of the reference point selection method based on GAHC.

In the first stage, the $i$ -th sequence in $CR$ will form a cluster $C_{i}$ .

In the second stage, a similarity function is used to calculate the similarity between each pair of clusters to produce a similarity matrix $Sim$ , where $Sim[i,j]$ is the similarity between the two clusters $C_{i}$ and $C_{j}$ . Many similarity measures have been presented for sequential data (e.g. [53]). Here we choose the Jaccard coefficient. More specific details on the similarity function will be discussed in Section V-C.

In the third stage, we first search the similarity matrix $Sim$ to identify the maximum value $maxSim$ , which corresponds to the most similar pair of clusters $C_{k}$ and $C_{l}$ . Then, these two clusters are merged to form a new cluster $C_{k}$ and the number of clusters in total is decreased by 1. Meanwhile, the entries related to $C_{l}$ in $Sim$ are set to be 0 and $Sim$ is updated by recalculating the similarity between $C_{k}$ and each of the remaining clusters. The similarity between the newly generated cluster and each of the remaining clusters is calculated as the average similarity between all members in the two clusters since we use the $group$ - $average$ method. We repeat the third stage until the number of clusters is equal to the number of reference points we want to select.

In the last stage, we select a representative sequence from each cluster. For each cluster, any sequence in this cluster can be used as a representative. To provide a consistent and deterministic output, we use the sequence with the minimum subscript in the cluster as the reference point.

V-B2 Supervised Reference Point Selection

To choose a subset of representative reference sequences from $TrainD$ , we can also employ a supervised method in which the class label information is utilized. As we have discussed in Section IV, different $mindisc$ constraints have been widely used to evaluate the discriminative power of sequential patterns. Unfortunately, these constraints are only applicable to the selection of reference points from $SubTrainD$ . In addition, it is not an easy task to set suitable thresholds to control the number of selected reference points. In order to overcome these limitations, we present a reference point selection method based on hypothesis testing, in which the statistical significance in terms of $p$ -value is used to assess the discriminative power of each candidate sequence.

Hypothesis testing is a commonly used method in statistical inference. The usual line of reasoning is as follows: first, formulate the null hypothesis and the alternative hypothesis; second, select an appropriate test statistic; third, set a significance level threshold; finally, reject the null hypothesis if and only if the $p$ -value is less than the significance level threshold, where the $p$ -value is the probability of getting a value of the test statistic that is at least as extreme as what is actually observed on condition that the null hypothesis is true.

In order to assess the discriminative power of each candidate sequence in terms of $p$ -value, we can use the null hypothesis that this sequence does not belong to any class and all sequences from different classes are drawn from the same population. If the above null hypothesis is true, then the similarities between the candidate sequence and training sequences are drawn from the same population. Therefore, we can formulate the corresponding hypothesis testing problem as a two-sample testing problem [54], where one sample is the set of similarities between the candidate sequence and the training sequences from one target class and another sample is the set of similarities between the candidate sequence and the training sequences from the remaining classes.

Since we test all candidate sequences in $CR$ at the same time, it is actually a multiple hypothesis testing problem. If no multiple testing correction is conducted, then the number of false positives among reported reference sequences may be very high. To tackle this problem, we adopt the BH procedure to control the FDR (False Discovery Rate) [55], which is the expected proportion of false positives among all reported sequences.

The reference point selection method based on MHT (Multiple Hypothesis Testing) is shown in Algorithm 1. In the following, we will elaborate on this algorithm in detail.

In the first stage (step 1-4), we select a set of sequences $D_{c_{i}}$ with the class label $c_{i}$ from $CR$ , then we regard $D_{c_{i}}$ as the positive data set $D_{+}$ and use the set of all remaining sequences in $CR$ as the negative data set $D_{-}$ .

In the second stage (step 5-17), for each sequence $S_{k}$ in $D_{+}$ , a similarity function is used to calculate the similarity between $S_{k}$ and each sequence in $D_{+}$ and $D_{-}$ , where the similarity function is the same as that used in Section V-B1 and $Sim[k,j]$ is the similarity between the two sequences $S_{k}$ and $S_{j}$ . Then, the Mann-Whitney U test [56] is used to calculate the $p$ -value based on the two similarity set $Sim_{+}$ and $Sim_{-}$ .

In the third stage (step 18-27), the BH method first sorts sequences in $D_{+}$ according to their corresponding $p$ -value in an ascending order, i.e., $D_{+}=\left\{S_{1},S_{2},...,S_{|D_{+}|}\right\}$ ( $S_{1}.pvalue\leq S_{2}.pvalue\leq...\leq S_{|D_{+}|}.pvalue$ ). Then, we sequentially search $D_{+}$ to identify the maximal sequence index $maxindex$ which satisfies the condition that $S_{k}.pvalue\leq\alpha\frac{k}{|D_{+}|}$ , where $\alpha$ is the significance level threshold. Those sequences whose indices are larger than $maxindex$ will be removed from $D_{+}$ .

In the last stage (step 28-30), we select all sequences from $D_{+}$ as reference points. The whole process will be terminated after each set of sequences from every class has been regarded as $D_{+}$ .

V-C Similarity Function

In order to measure the similarity between two sequences, we choose the Jaccard coefficient as the similarity function in our method. The larger the Jaccard coefficient between the two sequences is, the more similar they are.

Given two sequences $s=\left\langle s_{1},s_{2},...,s_{l}\right\rangle$ and $t=\left\langle t_{1},t_{2},...,t_{r}\right\rangle$ , the Jaccard coefficient is defined as:

[TABLE]

where $|s\cap t|$ is the number of items in the intersection of $s$ and $t$ . However, this may lose the order information of sequences. To alleviate this issue, we use the LCS (Longest Common Subsequence) between $s$ and $t$ to replace $s\cap t$ . Then, the Jaccard coefficient is redefined as:

[TABLE]

Example 1.

Given two sequences $s=\left\langle a,b,c,d,e\right\rangle$ and $t=\left\langle e,c,d,c\right\rangle$ , the $LCS(s,t)$ is $\left\langle c,d\right\rangle$ , then the modified Jaccard coefficient is

[TABLE]

Note that we can also use other similarity functions in the literature, such as those methods summarized and reviewed in [53]. The choice of a more appropriate similarity function may yield better performance than the modified Jaccard coefficient. In order to check the effect of similarity function on the classification performance, we also consider the following two alternative similarity functions.

The first one is the String Subsequence Kernel (SSK) [36]. The main idea of SSK is to compare two sequences by means of the subsequences they contain in common. That is, the more subsequences in common, the more similar they are.

Given two sequences $s=\left\langle s_{1},s_{2},...,s_{l}\right\rangle$ and $t=\left\langle t_{1},t_{2},...,t_{r}\right\rangle$ and a parameter $n$ , the SSK is defined as:

[TABLE]

where $\phi_{u}(s)$ is the feature mapping for the sequence $s$ and each $u\in I^{n}$ , $I$ is a finite alphabet, $I^{n}$ is the set of all subsequences of length $n$ and $u$ is a subsequence of $s$ such that $u_{1}=s_{i_{1}},u_{2}=s_{i_{2}},...,u_{n}=s_{i_{n}}$ , $l_{s}(u)=i_{n}-i_{1}+1$ is the length of $u$ in $s$ , $\lambda\in(0,1)$ is a decay factor which is used to penalize the gap. The calculation steps are as follows: enumerate all subsequences of length $n$ , compute the feature vectors for the two sequences, and then compute the similarity. The normalized kernel value is given by

[TABLE]

Example 2.

Given two sequences $s=\left\langle a,b,c,d,e\right\rangle$ and $t=\left\langle e,c,d,c\right\rangle$ , the subsequences of length 1 ( $n$ =1) are $a,b,c,d,e$ . The corresponding feature vector for each of the sequences can be denoted as $\phi_{1}(s)=\left\langle\lambda,\lambda,\lambda,\lambda,\lambda\right\rangle$ and $\phi_{1}(t)=\left\langle 0,0,2\lambda,\lambda,\lambda\right\rangle$ , then the normalized kernel value is

[TABLE]

When this function is employed in our method, $n$ = 1 is used as the default parameter setting. Although the setting of $n$ = 1 may lose the order information, it will greatly reduce the computational cost and can provide satisfactory results in practice.

Another alternative similarity function is the normalized LCS. The larger the normalized LCS between two sequences is, the more similar they are.

Given two sequences $s=\left\langle s_{1},s_{2},...,s_{l}\right\rangle$ and $t=\left\langle t_{1},t_{2},...,t_{r}\right\rangle$ , the normalized LCS is defined as:

[TABLE]

Example 3.

Given two sequences $s=\left\langle a,b,c,d,e\right\rangle$ and $t=\left\langle e,c,d,c\right\rangle$ , the $LCS(s,t)$ is $\left\langle c,d\right\rangle$ , then the normalized LCS is

[TABLE]

VI Experiments

To demonstrate the feasibility and advantages of this new framework, we conducted experiments on fourteen real sequential data sets. We compared our two algorithms derived under the reference-based framework with other sequence classification algorithms in terms of classification accuracy. All experiments were conducted on a PC with Intel(R) Xeon(R) CPU 2.40GHz and 12G Memory. All the reported accuracies in the experiments were the average accuracies obtained by repeating the 5-fold cross-validation 5 times except SCIP (accuracies in SCIP were obtained using 10-fold cross-validation because this is a fixed setting in software package provided by the author).

VI-A Data Sets

We choose fourteen benchmark data sets which are widely used for evaluating sequence classification algorithms: Activity [57], Aslbu [14], Auslan2 [14], Context [58], Epitope [12], Gene [59], News [5], Pioneer [14], Question [60], Reuters [5], Robot [5], Skating [14], Unix [5], Webkb [5]. The main characteristics of these data sets are summarized in Table II, where $|D|$ represents the number of sequences in the data set, #items denotes the number of distinct elements, minl, maxl and avgl are used to denote the minimum length, maximum length and average length of the sequences respectively, and #classes represents the number of distinct classes in the data set.

VI-B Parameter Settings

Our two algorithms are denoted by R-MHT (Reference Point Selection Based on MHT) and R-GAHC (Reference Point Selection Based on GAHC), respectively. In addition, the method that uses all sequences in $TrainD$ as reference points is denoted as R-A, which is also included in the performance comparison. We compare our algorithms with five existing sequence classification algorithms: MiSeRe111http://www.misere.co.nf [17], Sqn2Vec222https://github.com/nphdang/Sqn2Vec [41], SCIP333http://adrem.ua.ac.be/sites/adrem.ua.ac.be/files/SCIP.zip [5], FSP (the algorithm based on frequent sequential patterns) and DSP (the algorithm based on discriminative sequential patterns).

In MiSeRe, $num\_of\_rules$ is specified to be 1024 and $execution\_time$ is set to be 5 minutes for all data sets.

Sqn2Vec is an unsupervised method for learning sequence embeddings from both singleton symbols and sequential patterns. It has two variants: Sqn2VecSEP and Sqn2VecSIM, where Sqn2VecSEP (Sqn2VecSIM) generates sequence representations from singleton symbols and sequential patterns separately (simultaneously). In these two variants, $minsup$ = 0.05, $maxgap$ = 4 and the embedding dimension $d$ is set to be 128 for all data sets.

SCIP is a sequence classification method based on interesting patterns, which has four different variants: SCII_HAR, SCII_MA, SCIS_HAR and SCIS_MA. In the experiments, the following parameter setting is used in all data sets: $minsup$ = 0.05, $minint$ = 0.02, $maxsize$ = 3, $con\emph{f}$ = 0.5 and $topk$ = 11.

Frequent sequential patterns have been widely used as features in sequence classification. To include the algorithm based on frequent sequential patterns in the comparison (denoted by FSP), we employ the PrefixSpan algorithm [61] as the frequent sequential pattern mining algorithm. The parameters are specified as follows: $maxsize$ = 3 and $minsup$ = 0.3 for all data sets except Context (the $minsup$ in Context is set to be 0.9 in order to avoid the generation of too many patterns).

Similarly, discriminative sequential patterns are widely used as features in many sequence classification algorithms and applications as well. To include the algorithm based on discriminative sequential patterns in the comparison (denoted by DSP), we first use the PrefixSpan algorithm to mine a set of frequent sequential patterns and then detect discriminative patterns from the frequent pattern set. The parameters for PrefixSpan are identical to those used in FSP and $minGR$ = 3 is used as the threshold for filtering discriminative sequential patterns.

VI-C Results

In Table III, the detailed performance comparison results in terms of classification accuracies are presented. Note that the result of DSP on the Skating data set is N/A because we cannot find any discriminative patterns from this data set based on the given parameter setting. In the experiments, $\alpha$ = 0.05 is used for R-MHT and $pointnum$ is specified to be 1/10 of the size of $TrainD$ for R-GAHC. After transforming sequences into feature vectors, we chose NB (Naive Bayes), DT (Decision Tree), SVM (Support Vector Machine), KNN ( $k$ Nearest Neighbors) as the classifiers. The implementation of each classifier was obtained from WEKA [62] except Sqn2Vec. In Sqn2Vec, all classifiers were obtained from scikit-learn [63] since its source code is written in python.

In order to have a global picture of the overall performance of different algorithms, we calculate the average accuracy over all data sets for each classifier. The corresponding average accuracies for different methods are recorded in Table IV. The results show that among our two methods, R-MHT can achieve better performance than R-GAHC when NB, DT and SVM are used as the classifier. However, R-MHT has a bad performance when KNN is used as the classifier. Since we select a representative sequence from each cluster in R-GAHC and any sequence in a cluster can be used as a representative, we may miss the most representative sequence. Meanwhile, the choice of clustering method and the specification of the number of clusters will influence the results. In addition, the R-A method outperforms R-MHT and R-GAHC since we will not lose any relevant information for the classification task when all training sequences are used as reference points. However, the feature dimension will be very high in R-A, which will incur high computational cost in practice.

Compared with other classification methods, our methods are able to achieve comparable performance. In particular, R-A and MiSeRe [17] can achieve the highest average classification accuracy among all competitors since all information given for building the classifier is contained in the reference point set in R-A. The reason why R-MHT and R-GAHC are slightly worse may be that their reference points are less distinct from each other in different classes and some sequences that are important for classification are missed. It is quite amazing since R-A is a very simple algorithm derived from our framework. This indicates that the proposed reference-based sequence classification framework is quite useful in practice. It can be expected more accurate feature-based sequence classification methods will be developed under this framework in the future. From Table III and Table IV, it can be also observed that none of the algorithms in the comparison can always achieve the best performance across all data sets. Therefore, more research efforts still should be devoted to the development of effective sequence classification algorithms.

The use of different similarity functions may affect the performance of our algorithms. To investigate this issue, we use two additional similarity functions in the experiments for comparison: SSK and the normalized LCS, whose details have been introduced in Section V-C.

Table V presents the average classification accuracies of different similarity functions over all data sets. Jaccard coefficient, SSK and normalized LCS are denoted as J, S and N, respectively. In Table V, R-A-J means that the Jaccard coefficient is used as the similarity function in R-A. Other notations in this table can be interpreted in a similar manner. The results show that the use of different similarity functions can affect the performance of our algorithms. Among these three similarity functions, the use of the Jaccard coefficient as the similarity function can achieve better performance in most cases. However, R-MHT-J has unsatisfactory performance when KNN is used as the classifier. It can be also observed that none of the similarity functions is always the best performer. Therefore, more suitable similarity functions should be developed.

The above experimental results and analysis show that the proposed new methods based on our framework can achieve comparable performance to those state-of-the-art sequence classification algorithms, which demonstrate the feasibility and advantages of our framework. And our framework is quite general and flexible since the selection of both reference points and similarity functions is arbitrary. However, since the feature selection and classifier construction in our framework are separate and any existing vectorial data classification methods can be used to tackle the sequence classification problem, some features that are critical to the classifier may be filtered out during the selection process.

VII Conclusion

In this paper, we present a reference-based sequence classification framework by generalizing the pattern-based methods. This framework is quite general and flexible, which can be used as a general platform to develop new algorithms for sequence classification. To verify this point, we present several new feature-based sequence classification algorithms under this new framework. A series of comprehensive experiments on real data sets show that our methods are capable of achieving better classification accuracy than existing sequence classification algorithms. Thus, the reference-based sequence classification framework is quite promising and useful in practice.

In future work, we intend to explore more appropriate reference sequence selection methods and similarity functions to improve the performance and reduce the computational cost. As a result, more accurate feature-based sequence classification methods would be derived under this framework.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques . Elsevier, 2011.
2[2] M. Deshpande and G. Karypis, “Evaluation of techniques for classifying biological sequences,” in Proceedings of the 6th Paciﬁc-Asia Conference on Advances in Knowledge Discovery and Data Mining . Berlin, Germany: Springer, 2002, pp. 417–431.
3[3] Z. Xing, J. Pei, and E. Keogh, “A brief survey on sequence classification,” Acm Sigkdd Explorations Newsletter , vol. 12, no. 1, pp. 40–48, 2010.
4[4] E. Cernadas and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?” Journal of Machine Learning Research , vol. 15, no. 1, pp. 3133–3181, 2014.
5[5] C. Zhou, B. Cule, and B. Goethals, “Pattern based sequence classification,” IEEE Transactions on Knowledge and Data Engineering , vol. 28, no. 5, pp. 1285–1298, 2016.
6[6] T. P. Exarchos, M. G. Tsipouras, C. Papaloukas, and D. I. Fotiadis, “A two-stage methodology for sequence classification based on sequential pattern mining and optimization,” Data & Knowledge Engineering , vol. 66, no. 3, pp. 467–487, 2008.
7[7] D. Lo, H. Cheng, J. Han, S.-C. Khoo, and C. Sun, “Classification of software behaviors for failure detection: a discriminative pattern mining approach,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 2009, pp. 557–566.
8[8] R. She, F. Chen, K. Wang, M. Ester, J. L. Gardy, and F. S. Brinkman, “Frequent-subsequence-based prediction of outer membrane proteins,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 2003, pp. 436–445.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Reference-Based Sequence Classification

Abstract

Index Terms:

I Introduction

II Related Work

II-A Feature-Based Methods

II-A1 Explicit Subsequence Representation without Selection

II-A2 Explicit Subsequence Representation with Selection (Classifier-Independent)

II-A3 Explicit Subsequence Representation with Selection (Classifier-Dependent)

II-A4 Implicit Subsequence Representation

II-A5 Sequence Embedding

II-A6 Summary of Feature-Based Methods

II-B Instance-Based Feature Generation Methods

II-C Reference-Based Sequence Clustering

II-D Reference-Based Dimension Reduction

III Reference-Based Sequence Classification Framework

III-A Reference Point Selection

Constraint 1**.**

Constraint 2**.**

Constraint 3**.**

Constraint 4**.**

Constraint 5**.**

Constraint 6**.**

Constraint 7**.**

Constraint 8**.**

III-B Feature Value Generation

III-C Model Construction and Prediction

IV General Framework for Feature-Based Classification

V New Variants under the Framework

V-A The Use of Training Set as the Candidate Set

V-B Two Reference Point Selection Methods

V-B1 Unsupervised Reference Point Selection

V-B2 Supervised Reference Point Selection

V-C Similarity Function

Example 1**.**

Example 2**.**

Example 3**.**

VI Experiments

VI-A Data Sets

VI-B Parameter Settings

VI-C Results

VII Conclusion

Constraint 1.

Constraint 2.

Constraint 3.

Constraint 4.

Constraint 5.

Constraint 6.

Constraint 7.

Constraint 8.

Example 1.

Example 2.

Example 3.