Reference-Based Sequence Classification
Zengyou He, Guangyao Xu, Chaohua Sheng, Bo Xu, Quan Zou

TL;DR
This paper introduces a unified reference-based framework for sequence classification that consolidates existing pattern-based methods and facilitates the development of new algorithms with competitive accuracy.
Contribution
The paper presents a general framework unifying pattern-based sequence classification methods and enabling the creation of novel algorithms.
Findings
New algorithms achieve comparable accuracy to state-of-the-art methods.
Framework effectively unifies existing pattern-based approaches.
Experimental results validate the versatility and effectiveness of the proposed framework.
Abstract
Sequence classification is an important data mining task in many real world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable…
| Algorithm | Construction of Candidate Reference Point Set | Selection of Reference Points | Selection of Similarity Function |
|---|---|---|---|
| SCIP[5] | SubTrainD | minsup, minint and maxsize constraints | SF 1/3 |
| Ref.[8] | SubTrainD | minsup constraint | SF 1 |
| Ref.[11] | SubTrainD | uniqueness, gap, mindisc (DF 2 and 4) constraints | SF 5 |
| Ref.[12] | SubTrainD | gap, mindisc (DF 1 and 3) constraints | SF 2 |
| MiSeRe[17] | SubTrainD | level constraint | SF 1 |
| Ref.[18] | SubTrainD | minsup and gap constraints | SF 1/4 |
| PSO-AB[19] | SubTrainD | minsup and closeness constraints | SF 6 |
| FeatureMine[26] | SubTrainD | minsup, redundancy and mindisc (DF 6) constraints | SF 1 |
| CDSPM[30] | SubTrainD | minsup and mindisc (DF 5) constraints | SF 1 |
| Dataset | #items | minl | maxl | avgl | #classes | |
|---|---|---|---|---|---|---|
| Activity | 35 | 10 | 12 | 43 | 21.14 | 2 |
| Aslbu | 424 | 250 | 2 | 54 | 13.05 | 7 |
| Auslan2 | 200 | 16 | 2 | 18 | 5.53 | 10 |
| Context | 240 | 94 | 22 | 246 | 88.39 | 5 |
| Epitope | 2392 | 20 | 9 | 21 | 15 | 2 |
| Gene | 2942 | 5 | 41 | 216 | 86.53 | 2 |
| News | 4976 | 27884 | 1 | 6779 | 139.96 | 5 |
| Pioneer | 160 | 178 | 4 | 100 | 40.14 | 3 |
| Question | 1731 | 3612 | 4 | 29 | 10.17 | 2 |
| Reuters | 1010 | 6380 | 4 | 533 | 93.84 | 4 |
| Robot | 4302 | 95 | 24 | 24 | 24 | 2 |
| Skating | 530 | 82 | 18 | 240 | 48.12 | 7 |
| Unix | 5472 | 1697 | 1 | 1400 | 32.34 | 4 |
| Webkb | 3667 | 7736 | 1 | 20628 | 129.37 | 3 |
| Dataset | Classifier | R-A | R-MHT | R-GAHC | MiSeRe | FSP | DSP | Sqn2VecSEP | Sqn2VecSIM | Classifier | SCIP |
|---|---|---|---|---|---|---|---|---|---|---|---|
| NB | 0.966 | 0.977 | 0.811 | 1.000 | 0.960 | 0.960 | 1.000 | 1.000 | SCII_HAR | 0.663 | |
| DT | 0.931 | 0.931 | 0.794 | 0.960 | 1.000 | 1.000 | 0.900 | 0.800 | SCII_MA | 0.675 | |
| Activity | SVM | 0.977 | 0.926 | 0.629 | 1.000 | 0.994 | 0.994 | 1.000 | 0.950 | SCIS_HAR | 0.967 |
| KNN | 0.983 | 0.811 | 0.800 | 1.000 | 0.886 | 0.886 | 1.000 | 0.950 | SCIS_MA | 1.000 | |
| NB | 0.574 | 0.561 | 0.449 | 0.548 | 0.527 | 0.420 | 0.298 | 0.554 | SCII_HAR | 0.540 | |
| DT | 0.523 | 0.527 | 0.480 | 0.565 | 0.542 | 0.459 | 0.405 | 0.484 | SCII_MA | 0.526 | |
| Aslbu | SVM | 0.638 | 0.625 | 0.483 | 0.571 | 0.581 | 0.455 | 0.498 | 0.633 | SCIS_HAR | 0.553 |
| KNN | 0.563 | 0.310 | 0.479 | 0.574 | 0.531 | 0.464 | 0.544 | 0.591 | SCIS_MA | 0.536 | |
| NB | 0.322 | 0.330 | 0.334 | 0.304 | 0.292 | 0.145 | 0.260 | 0.290 | SCII_HAR | 0.100 | |
| DT | 0.317 | 0.308 | 0.308 | 0.314 | 0.330 | 0.170 | 0.270 | 0.270 | SCII_MA | 0.095 | |
| Auslan2 | SVM | 0.328 | 0.323 | 0.326 | 0.300 | 0.318 | 0.170 | 0.290 | 0.310 | SCIS_HAR | 0.200 |
| KNN | 0.327 | 0.304 | 0.333 | 0.302 | 0.309 | 0.166 | 0.310 | 0.200 | SCIS_MA | 0.175 | |
| NB | 0.780 | 0.778 | 0.772 | 0.938 | 0.812 | 0.572 | 0.900 | 0.900 | SCII_HAR | 0.613 | |
| DT | 0.791 | 0.798 | 0.745 | 0.841 | 0.873 | 0.575 | 0.600 | 0.542 | SCII_MA | 0.617 | |
| Context | SVM | 0.939 | 0.937 | 0.755 | 0.927 | 0.868 | 0.577 | 0.933 | 0.900 | SCIS_HAR | 0.796 |
| KNN | 0.871 | 0.853 | 0.813 | 0.896 | 0.839 | 0.585 | 0.900 | 0.858 | SCIS_MA | 0.867 | |
| NB | 0.675 | 0.663 | 0.751 | 0.588 | 0.696 | 0.671 | 0.779 | 0.761 | SCII_HAR | 0.684 | |
| DT | 0.839 | 0.842 | 0.815 | 0.842 | 0.814 | 0.750 | 0.813 | 0.800 | SCII_MA | 0.712 | |
| Epitope | SVM | 0.855 | 0.838 | 0.769 | 0.834 | 0.758 | 0.716 | 0.802 | 0.800 | SCIS_HAR | 0.705 |
| KNN | 0.932 | 0.925 | 0.918 | 0.924 | 0.898 | 0.801 | 0.863 | 0.841 | SCIS_MA | 0.721 | |
| NB | 0.998 | 0.997 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | SCII_HAR | 1.000 | |
| DT | 0.997 | 0.997 | 0.998 | 1.000 | 1.000 | 1.000 | 0.967 | 0.976 | SCII_MA | 1.000 | |
| Gene | SVM | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.999 | 1.000 | SCIS_HAR | 1.000 |
| KNN | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | SCIS_MA | 1.000 | |
| NB | 0.754 | 0.765 | 0.645 | 0.866 | 0.595 | 0.714 | 0.946 | 0.967 | SCII_HAR | 0.936 | |
| DT | 0.719 | 0.724 | 0.654 | 0.837 | 0.793 | 0.767 | 0.600 | 0.555 | SCII_MA | 0.942 | |
| News | SVM | 0.975 | 0.969 | 0.874 | 0.905 | 0.771 | 0.775 | 0.972 | 0.974 | SCIS_HAR | 0.910 |
| KNN | 0.856 | 0.350 | 0.756 | 0.732 | 0.544 | 0.761 | 0.918 | 0.912 | SCIS_MA | 0.918 | |
| NB | 0.977 | 0.931 | 0.796 | 0.891 | 0.932 | 0.852 | 0.975 | 0.963 | SCII_HAR | 0.963 | |
| DT | 0.883 | 0.870 | 0.826 | 0.988 | 0.964 | 0.926 | 0.788 | 0.825 | SCII_MA | 0.963 | |
| Pioneer | SVM | 0.980 | 0.952 | 0.638 | 0.989 | 0.983 | 0.930 | 0.950 | 0.988 | SCIS_HAR | 0.975 |
| KNN | 0.990 | 0.477 | 0.828 | 0.975 | 0.933 | 0.926 | 0.975 | 0.988 | SCIS_MA | 0.975 | |
| NB | 0.846 | 0.828 | 0.840 | 0.870 | 0.763 | 0.763 | 0.736 | 0.762 | SCII_HAR | 0.833 | |
| DT | 0.889 | 0.881 | 0.877 | 0.885 | 0.822 | 0.759 | 0.717 | 0.809 | SCII_MA | 0.785 | |
| Question | SVM | 0.949 | 0.947 | 0.868 | 0.902 | 0.814 | 0.763 | 0.789 | 0.881 | SCIS_HAR | 0.846 |
| KNN | 0.895 | 0.879 | 0.889 | 0.897 | 0.819 | 0.762 | 0.828 | 0.874 | SCIS_MA | 0.837 | |
| NB | 0.892 | 0.893 | 0.808 | 0.903 | 0.831 | 0.765 | 0.921 | 0.905 | SCII_HAR | 0.951 | |
| DT | 0.878 | 0.878 | 0.843 | 0.912 | 0.903 | 0.897 | 0.826 | 0.741 | SCII_MA | 0.953 | |
| Reuters | SVM | 0.976 | 0.970 | 0.858 | 0.962 | 0.933 | 0.915 | 0.984 | 0.974 | SCIS_HAR | 0.957 |
| KNN | 0.958 | 0.452 | 0.918 | 0.899 | 0.894 | 0.900 | 0.960 | 0.960 | SCIS_MA | 0.956 | |
| NB | 0.871 | 0.866 | 0.832 | 0.826 | 0.735 | 0.718 | 0.808 | 0.822 | SCII_HAR | 0.795 | |
| DT | 0.880 | 0.879 | 0.871 | 0.900 | 0.843 | 0.742 | 0.811 | 0.778 | SCII_MA | 0.822 | |
| Robot | SVM | 0.955 | 0.952 | 0.902 | 0.913 | 0.780 | 0.723 | 0.840 | 0.834 | SCIS_HAR | 0.817 |
| KNN | 0.947 | 0.947 | 0.942 | 0.937 | 0.860 | 0.743 | 0.945 | 0.949 | SCIS_MA | 0.819 | |
| NB | 0.281 | 0.271 | 0.197 | 0.290 | 0.240 | N/A | 0.336 | 0.321 | SCII_HAR | 0.181 | |
| DT | 0.258 | 0.215 | 0.204 | 0.258 | 0.272 | N/A | 0.226 | 0.230 | SCII_MA | 0.181 | |
| Skating | SVM | 0.375 | 0.277 | 0.208 | 0.293 | 0.299 | N/A | 0.370 | 0.321 | SCIS_HAR | 0.189 |
| KNN | 0.290 | 0.203 | 0.245 | 0.241 | 0.191 | N/A | 0.302 | 0.340 | SCIS_MA | 0.191 | |
| NB | 0.718 | 0.764 | 0.772 | 0.768 | 0.699 | 0.607 | 0.703 | 0.566 | SCII_HAR | 0.837 | |
| DT | 0.887 | 0.883 | 0.874 | 0.899 | 0.819 | 0.750 | 0.776 | 0.756 | SCII_MA | 0.838 | |
| Unix | SVM | 0.927 | 0.921 | 0.906 | 0.915 | 0.820 | 0.748 | 0.899 | 0.872 | SCIS_HAR | 0.857 |
| KNN | 0.869 | 0.822 | 0.865 | 0.873 | 0.803 | 0.745 | 0.892 | 0.891 | SCIS_MA | 0.842 | |
| NB | 0.710 | 0.720 | 0.641 | 0.845 | 0.701 | 0.833 | 0.858 | 0.886 | SCII_HAR | 0.897 | |
| DT | 0.820 | 0.821 | 0.788 | 0.874 | 0.869 | 0.862 | 0.629 | 0.635 | SCII_MA | 0.911 | |
| Webkb | SVM | 0.954 | 0.952 | 0.880 | 0.927 | 0.895 | 0.869 | 0.934 | 0.940 | SCIS_HAR | 0.894 |
| KNN | 0.887 | 0.544 | 0.851 | 0.779 | 0.843 | 0.858 | 0.772 | 0.691 | SCIS_MA | 0.901 |
| Classifier | R-A | R-MHT | R-GAHC | MiSeRe | FSP | DSP | Sqn2VecSEP | Sqn2VecSIM | Classifier | SCIP |
|---|---|---|---|---|---|---|---|---|---|---|
| NB | 0.740 | 0.739 | 0.689 | 0.760 | 0.699 | 0.694 | 0.751 | 0.764 | SCII_HAR | 0.714 |
| DT | 0.758 | 0.754 | 0.720 | 0.791 | 0.775 | 0.743 | 0.666 | 0.657 | SCII_MA | 0.716 |
| SVM | 0.845 | 0.828 | 0.721 | 0.817 | 0.772 | 0.741 | 0.804 | 0.813 | SCIS_HAR | 0.762 |
| KNN | 0.812 | 0.634 | 0.760 | 0.788 | 0.739 | 0.738 | 0.801 | 0.789 | SCIS_MA | 0.767 |
| AVG | 0.789 | 0.739 | 0.722 | 0.789 | 0.746 | 0.729 | 0.756 | 0.756 | AVG | 0.740 |
| Classifier | R-A-J | R-A-S | R-A-N | R-MHT-J | R-MHT-S | R-MHT-N | R-GAHC-J | R-GAHC-S | R-GAHC-N |
|---|---|---|---|---|---|---|---|---|---|
| NB | 0.740 | 0.675 | 0.718 | 0.739 | 0.677 | 0.711 | 0.689 | 0.672 | 0.674 |
| DT | 0.758 | 0.733 | 0.747 | 0.754 | 0.723 | 0.744 | 0.720 | 0.702 | 0.703 |
| SVM | 0.845 | 0.793 | 0.837 | 0.828 | 0.779 | 0.829 | 0.721 | 0.746 | 0.762 |
| KNN | 0.812 | 0.752 | 0.802 | 0.634 | 0.738 | 0.773 | 0.760 | 0.706 | 0.749 |
| AVG | 0.789 | 0.738 | 0.776 | 0.739 | 0.730 | 0.764 | 0.722 | 0.706 | 0.722 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\history
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. 00.0000/ACCESS.2020.DOI
\corresp
Corresponding author: Zengyou He (e-mail: [email protected]). \tfootnoteThis work was partially supported by the Natural Science Foundation of China under Grant Nos. 61972066 and 61572094, and the Fundamental Research Funds for the Central Universities (No. DUT20YG106).
Reference-Based Sequence Classification
ZENGYOU HE1
GUANGYAO XU1
CHAOHUA SHENG1
BO XU1
QUAN ZOU2
School of Software, Dalian University of Technology, Tuqiang Road, Dalian, China
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology, Chengdu, China
Abstract
Sequence classification is an important data mining task in many real-world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art sequence classification algorithms.
Index Terms:
Sequence classification, sequential data analysis, cluster analysis, hypothesis testing, sequence embedding
\titlepgskip
=-15pt
I Introduction
In many practical applications, we have to conduct data analysis on data sets that are composed of discrete sequences. Each sequence is an ordered list of elements. For instance, such a sequence can be a protein sequence, where each element corresponds to an amino acid. Due to the existence of a large number of discrete sequences in a wide range of applications, sequential data analysis has become an important issue in machine learning and data mining. Compared to non-sequential data mining, sequential data analysis is confronted with new challenges because of the ordering relationship between different elements in the sequences. Similar to the analysis of non-sequential data, there are different sequential data mining problems such as clustering, classification and pattern discovery. In this paper, we focus on the sequence classification problem.
The task of classification is to determine which predefined target class one unknown object should be assigned to [1]. As a specific case of the general classification problem, sequence classification is to assign class labels to new sequences based on the classifier constructed in the training phase. In many real-world applications, we can formulate the data analysis task as a sequence classification problem. For instance, the essential task in numerous bioinformatics applications is to classify biological sequences into existing categories [2].
To tackle the sequence classification problem, many effective methods have been proposed from different aspects. Roughly, existing sequence classification methods can be divided into three categories [3]: feature-based methods, distance-based methods and model-based methods. Feature-based methods first transform sequences into feature vectors and then apply existing vectorial data classification methods. Distance-based methods apply classifiers such as KNN ( Nearest Neighbors) to solve the sequence classification problem, in which the key issue is to specify a proper distance function to measure the distance between two sequences [3]. Model-based methods generally assume that sequences from different classes are generated from different probability distributions, in which the key issue is to estimate the model parameters from the set of training sequences.
In this paper, we focus on the feature-based method since it has several advantages. First of all, various effective classifiers have been developed for vectorial data classification [4]. After transforming sequences into feature vectors, we can choose any one of these existing classification methods to fulfill the sequence classification task. Second, in some popular feature-based methods such as pattern-based methods, each feature has a good interpretability. Last but not least, the extraction of features from sequences has been extensively studied across different fields, making it feasible to generate sequence features in an effective manner.
The -mer (in bioinformatics) or -gram (in natural language processing) is a substring that is composed of consecutive elements, which is probably the most widely used feature in feature-based sequence classification. Such a -mer based feature construction method is further generalized by the pattern-based method, in which a feature is a sequential pattern (a subsequence) that satisfies some constraints (e.g. frequent pattern, discriminative pattern). Over the past few decades, a large number of pattern-based methods have been presented in the context of sequence classification [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30].
In this paper, we present a reference-based sequence classification framework, which can be considered as a non-trivial generalization of the pattern-based methods. This framework has several key steps: candidate set construction, reference point selection and feature value construction. In the first step, a set of sequences that serve as the candidate reference points are constructed. Then, some sequences from the candidate set are selected as the reference points according to certain criteria. The number of features in the transformed vectorial data will equal the number of selected reference points. In other words, each reference point will correspond to a transformed feature. Finally, a similarity function is used to calculate the similarity between each sequence in the data and every reference point. The similarity to each reference point will be used as the corresponding feature value.
The reference-based sequence classification framework is quite general and flexible since the selection of both reference points and similarity functions is arbitrary. Existing feature-based methods can be regarded as a special variant under our framework by (1) using (frequent or discriminative) sequential patterns (subsequences) as reference points and (2) utilizing a boolean function (output 1 if the reference point is contained in a given sequence and output 0 otherwise) as the similarity function. Besides unifying existing pattern-based methods under the same umbrella, the reference-based sequence classification framework can be used as a general platform for developing new feature-based sequence classification methods. To justify this point, we develop a new feature-based method in which a subset of training sequences are used as the reference points and the Jaccard coefficient is used as the similarity function. In particular, we present two instance selection methods to select a good set of reference points.
To demonstrate the feasibility and advantages of this new framework, we conduct a series of comprehensive performance studies on real sequential data sets. In the experiments, we compare several variants under our framework with some existing sequence classification methods in terms of classification accuracy. Experimental results show that new methods developed under the proposed framework are capable of achieving better classification accuracy than traditional sequence classification methods. This indicates that such a reference-based sequence classification framework is promising from a practical point of view.
The main contributions of this paper can be summarized as follows:
- •
We present a general reference-based framework for feature-based sequence classification. It offers a unified view for understanding and explaining many existing feature-based sequence classification methods in which different types of sequential patterns are used as features.
- •
The reference-based framework can be used as a general platform for developing new feature-based sequence classification algorithms. To verify this point, we design new feature-based sequence classification algorithms under this framework and demonstrate its advantages through extensive experimental results on real sequential data sets.
The rest of the paper is structured as follows. Section II gives a discussion on the related work. In Section III, we introduce the reference-based sequence classification framework in detail. In Section IV, we show that many existing feature-based sequence classification algorithms can be reformulated within the reference-based framework. In Section V, we present new feature-based sequence classification algorithms under this framework, which are effective and quite different from available solutions. We experimentally evaluate the proposed reference-based framework through a series of experiments on real-life data sets in Section VI. Finally, we summarise our research and give a discussion on the future work in Section VII.
II Related Work
In this section, we discuss previous research efforts that are closely related to our method. In Section II-A, we provide a categorization on existing feature-based sequence classification methods. In Section II-B, we discuss several instance-based feature generation methods in the literature of time series classification. In Section II-C, we present a concise discussion on reference-based sequence clustering algorithms. In Section II-D, we provide a short summary on dimension reduction and embedding methods based on landmark points.
II-A Feature-Based Methods
II-A1 Explicit Subsequence Representation without Selection
The naive approach in dealing with discrete sequences is to treat each element as a feature. However, the order information between different elements will be lost and the sequential nature cannot be captured in the classification. Short sequence segments of consecutive elements called -grams can be used as features to solve this problem. Given a set of -grams, a sequence can be represented as a vector of the presence or absence of the -grams or the frequencies of the -grams. In this feature representation method, all -grams (for a specified value) are explicitly used as the features without feature selection.
II-A2 Explicit Subsequence Representation with Selection (Classifier-Independent)
Lesh et al. [26] present a pattern-based classification method in which a sequential pattern is chosen as a feature. The selected pattern should satisfy the following criteria: (1) be frequent, (2) be distinctive of at least one class and (3) not redundant. Towards this direction, many pattern-based classification methods have been subsequently proposed, in which different constraints are imposed on the patterns that should be selected as features [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 29, 30]. Note that any classifier designed for vectorial data can be applied to the transformed data generated from such pattern-based methods. In other words, such feature generation methods are classifier-independent.
II-A3 Explicit Subsequence Representation with Selection (Classifier-Dependent)
The above pattern-based methods are universal and classifier-independent. However, some patterns that are critical to the classifier may be filtered out during the selection process. Thus, several methods which can select pattern features from the entire pattern space for a specific classifier have been proposed [31, 32, 33].
In [31], a coordinate-wise gradient ascent technique is presented for learning the logistic regression function in the space of all -grams. The method exploits the inherent structure of the -gram feature space to automatically provide a compact set of highly discriminative -gram features. In [32], a framework is presented in which linear classifiers such as logistic regression and support vector machine can work directly in the explicit high-dimensional space of all subsequences. The key idea is a gradient-bounded coordinate-descent strategy to quickly retrieve features without explicitly enumerating all potential subsequences. In [33], a novel document classification method using all substrings as features is proposed, in which regularization is applied to a multi-class logistic regression model to fulfill the feature selection task automatically and efficiently.
II-A4 Implicit Subsequence Representation
In contrast to explicit subsequence representation, kernel-based methods employ an implicit subsequence representation strategy. A kernel function is the key ingredient for learning with support vector machines (SVMs) and it implicitly defines a high-dimensional feature space. Some kernel functions have been presented for measuring the similarity between two sequences and (e.g. [34]).
There are a variety of string kernels which are widely used for sequence classification (e.g. [35, 36, 37, 38]). A sequence is transformed into a feature space and the kernel function is the inner product of two transformed feature vectors.
Leslie et al. [35] propose a -spectrum kernel for protein classification. Given a number , the -spectrum of an input sequence is the set of all its -length (contiguous) subsequences.
Lodhi et al. [36] present a string kernel based on gapped -length subsequences for text classification. The subsequences are weighted by an exponentially decaying factor of their full length in the text.
In [37], a mismatch string kernel is proposed, in which a certain number of mismatches are allowed in counting the occurrence of a subsequence. Several string kernels related to the mismatch kernel are presented in [38]: restricted gappy kernels, substitution kernels and wildcard kernels.
II-A5 Sequence Embedding
All the methods mentioned above use subsequences as features. Alternatively, the sequence embedding method generates a vector representation in which each feature does not have a clear interpretation. Most existing approaches for sequence embedding are proposed for texts in natural language processing, where word and document embeddings are used as an efficient way to encode the text (e.g. [39, 40]). The basic assumption in these methods is that words that appear in similar contexts have similar meanings.
The word2vec model [39] uses a two-layer neural network to learn a vector representation for each word. The sequence (text) embedding vector can be further generated by combining the feature vectors for words. The doc2vec model [40] extends word2vec by directly learning feature vectors for entire sentences, paragraphs, or documents.
Nguyen et al. [41] propose an unsupervised method (named Sqn2Vec) for learning sequence embedding by predicting its belonging singleton symbols and sequential patterns (SPs). The main objective of Sqn2Vec is to address the limitations of two existing approaches: pattern-based methods often produce sparse and high-dimensional feature vectors while sequence embedding methods in natural language processing may fail on data sets with a small vocabulary.
II-A6 Summary of Feature-Based Methods
Roughly, existing feature-based sequence classification methods can be divided into the above five categories. Each of these methods has its pros and cons, which we will discuss briefly next.
First, using -grams as features without feature selection is simple and effective in practice. However, the feature length cannot be large and many redundant features may be included.
Second, in the pattern-based method, the length of a feature is not restricted as long as the feature satisfies given constraints and redundant features can be filtered out in some formulations. However, it is a non-trivial task to efficiently mine patterns that can satisfy the constraints.
Third, sequence classification methods based on adaptive feature selection can automatically select features from the set of all subsequences. The basic idea is to integrate the feature selection and classifier construction into the same procedure. Hence, these methods are classifier-dependent in the sense that each algorithm is only applicable to a specific classifier.
Fourth, kernel-based methods can implicitly map the sequence into a high-dimensional feature space without explicit feature extraction. The major challenge is how to choose a proper string kernel function and how to handle large data sets efficiently.
Finally, sequence embedding methods generate a new vector representation for each sequence that may achieve better classification accuracy. Unfortunately, the semantic interpretation of each feature becomes a difficult issue.
II-B Instance-Based Feature Generation Methods
There are several instance-based feature generation methods for time series classification which are closely related to our method (e.g. [42, 43]).
Iosifidis et al. [42] propose a time series classification method based on a novel vector representation. The vector representation for each time series is generated by calculating its similarities from a subset of training instances. To find a good subset of representative instances, a clustering procedure is further presented. In [43], each time series is represented as a feature vector, where the feature value is its dynamic time warping similarity from one of the training instances. Note that all training instances are used for feature generation.
II-C Reference-Based Sequence Clustering
In the literature of sequence clustering, the idea of using reference/landmark points to accelerate the cluster analysis process has been widely studied (e.g. [44, 45]). In this type of sequence clustering algorithm, a reference point selection method is first employed to obtain a small set of landmark points and then the clustering process is conducted based on the similarities between input sequences and selected reference points. Here, we would like to highlight the following differences between our method and existing research efforts in this field: (1) The objective is different. We focus on the classification issue while these methods aim at the cluster analysis problem. Besides, their main concern is to improve the running efficiency of the sequence clustering procedure; (2) The method is different. We present two reference point selection methods: one unsupervised method and one supervised method (see Section V for the details). In existing reference-based sequence clustering methods, only the unsupervised reference point selection method is applicable since no class label information is available.
II-D Reference-Based Dimension Reduction
A number of research papers have presented the idea of using the distances to a set of reference points to fulfill the dimension reduction task (e.g. [46, 47]). Our method shares some similarities with these methods since the final objective is the same. However, most of these methods are not developed for the task of sequence classification. As a result, our method is quite different from these methods for both the reference point selection and the similarity computation.
III Reference-Based Sequence Classification Framework
Let be a finite set of distinct items, which is generally called the alphabet in the literature. A sequence over is an ordered list , where and is the length of the sequence . A sequence is said to be a subsequence of if there exist integers such that , denoted as (if , written as ). We use to denote the allowed maximum length of subsequences.
Let be a finite set of distinct classes. A labeled sequential data set over is a set of instances and each instance is denoted by , where is a sequence and is a class label, is the number of sequences in . The set contains all sequences that have the same class label (i.e., ). is the set of sequences in that contain , where is a given sequence. Sequences in () is divided into a training set () and a testing set (). The set of all subsequences of is denoted by .
As shown in Fig. 1, we present a reference-based sequence classification framework. It is composed of three major phases: reference point selection, feature value generation, model construction and prediction. In the following, we will elaborate on each step in detail.
III-A Reference Point Selection
In the first stage of the presented framework, a reference point selection procedure is performed to generate a set of pivot sequences. As shown in Fig. 2, this procedure can be further divided into three steps: alphabet extraction, candidate set generation and pivot sequence selection.
In the first step, we scan the training set to extract the alphabet that is composed of distinct items. Note that there can be some items that only appear in the testing set . In the forthcoming paragraphs, we will see that this extreme case does not affect our subsequent steps.
In the second step, we generate the set of candidate reference sequences from the alphabet . Note that any sequence over can be a member of . In other words, can be an infinite set. In practice, some constraints will be imposed on the potential member in . For instance, those pattern-based methods only consider subsequences of as members of under our framework, which will be further discussed in Section IV. Furthermore, the use of different construction methods for building the candidate set will lead to the generation of many new feature-based sequence classification methods.
In the third step, we select a subset of sequences from as the landmark sequences for generating features. That is, each reference sequence will correspond to a transformed feature. The critical issue in this step is how to design an effective pivot sequence selection method. To date, existing pattern-based methods typically utilize some simple criteria to conduct the reference sequence selection task. For example, those methods based on frequent subsequences use the minimal support constraint as the criterion for reference sequence selection. Apparently, many new and interesting pivot sequence selection methods remain unexplored under our framework. In the subsequent paragraphs of this subsection, we will list some commonly used criteria for selecting reference sequences from the set of candidate pivot sequences.
Constraint 1**.**
( [11]). Given two sequences and , if is the subsequence of such that , the between and is defined as . Given two thresholds and (), if (), then the occurrence of in fulfills the .
Constraint 2**.**
( [12]). Given a set of sequences with the class label and a sequence , is used to denote the number of sequences in that contain as a subsequence. The of in is defined as . Given a positive threshold , if , then satisfies the and is a frequent sequential pattern in .
Constraint 3**.**
( [48]). Given two class labels and , a sequence is said to be a discriminative pattern if it is over-expressed on against (or the vice versa). To evaluate the discriminative power, many measures/functions have been proposed in the literature [48]. If the discriminative function value of can pass certain constraints, then it satisfies the . Here we just list some measures that have been used for selecting discriminative patterns in sequence classification.
- •
Discriminative Function (DF) 1 [12]:
[TABLE]
where is a given threshold.
- •
Discriminative Function (DF) 2 [11]:
[TABLE]
where and is a given threshold. The is the number of non-overlapping occurrences of in .
- •
Discriminative Function (DF) 3 [12]:
[TABLE]
- •
Discriminative Function (DF) 4 [11]:
[TABLE]
where
[TABLE]
and is defined as:
[TABLE]
- •
Discriminative Function (DF) 5 [30]:
[TABLE]
where is the of , is a given threshold. is used to describe the conditional redundancy, where is the set of discriminative sub-patterns of , is a given threshold.
- •
Discriminative Function (DF) 6 [26]:
The chi-squared test is used as the discriminative function to check if the candidate sequence is correlated with at least one class that it is frequent in.
Constraint 4**.**
( [11]). A sequence is said to satisfy the if all its items are unique.
Constraint 5**.**
( [19]). A sequence is said to satisfy the if no sequences that contain as a subsequence have the same as .
Constraint 6**.**
( [26]). A sequence is said to satisfy the if , where is the of .
Constraint 7**.**
( [5]). Given a set of sequences with class label , two sequences and , if is the subsequence of such that , is used to denote the of , where is the of in , and . And the of in a sequence is . Given two thresholds and , if and , then satisfies the .
Constraint 8**.**
( [17]). Given a sequence and a set of sequences with classes, a sequential classification rule is denoted as , where is the body of the rule. From a Bayesian point of view, to choose the best rule is equivalent to maximizing , where is a constant, is used as the evaluation criterion, and the normalized criterion is defined as , in which is the cost of the null model when the sequence body is empty. If , then satisfies the .
III-B Feature Value Generation
In the second stage of the presented framework, a similarity function is used to generate vectorial representations for all sequences in both training data and testing data. As shown in the left part of Fig. 3, this procedure can be further divided into two steps: (1) calculating the similarities between training instances and reference points; (2) calculating the similarities between testing instances and reference points.
In the first step, we utilize a similarity function to transform into a vectorial training set by calculating the similarity between each sequence in and every reference point in . Each similarity value will be used as the corresponding feature value. The critical issue in this step is how to choose a suitable similarity function. Note that the selection of the similarity function is arbitrary. In other words, any feasible similarity function can be used in this step. In fact, many existing feature-based methods utilize a boolean function as the similarity function, which outputs 1 as the feature value if the reference point is a subsequence of the target sequence and 0 otherwise.
In the second step, we use the same similarity function to transform into a vectorial testing set . Note that the number of features in the transformed vectorial data set is , which is the number of reference points.
The similarity function plays an important role in generating feature values. Accordingly, it will have a great impact on the prediction result. For the purpose of summarizing existing research efforts under our framework with respect to the similarity function, here we list some similarity functions between two sequences and that have been deployed in the literature.
- •
Similarity Function (SF) 1 [26]:
[TABLE]
- •
Similarity Function (SF) 2 [12]:
[TABLE]
In Equation (III.7), means (), is the between and (the minimum number of operations needed to transform into , where an operation can be the insertion, deletion, or substitution of a single item), is a contiguous subsequence of with items, which is extracted by using a sliding window of length that starts from the first element of . If and are not , then the sliding window will be repeatedly shifted one position to the right until subsequences have been checked or a new subsequence to is encountered. is a given threshold.
- •
Similarity Function (SF) 3 [5]:
[TABLE]
where is the of in the sequence .
- •
Similarity Function (SF) 4 [18]:
[TABLE]
where is the number of occurrences of in .
- •
Similarity Function (SF) 5 [11]:
[TABLE]
where is the number of non-overlapping occurrences of in .
- •
Similarity Function (SF) 6 [19]:
[TABLE]
where is the length of the longest common subsequence, and are the length of and respectively.
III-C Model Construction and Prediction
In the third stage of the presented framework, we construct a prediction model to make predictions. As shown in the right part of Fig. 3, this procedure can be further divided into three steps: model construction, prediction and classification result generation.
In the first step, an existing vectorial data classification method is used to construct a prediction model from the vectorial training set since we have transformed training sequences into feature vectors in the second stage. Numerous classification methods have been designed for classifying feature vectors (e.g. support vector machines and decision trees) [4, 49]. After training a classifier with , the prediction model is ready for classifying unknown samples.
In the second step, we forward the vectorial testing set to the classifier to make predictions. In the third step, we output the prediction result and compute the classification accuracy by comparing the predicted class labels with the ground-truth labels.
IV General Framework for Feature-Based Classification
In this section, we show that many existing feature-based sequence classification algorithms can be reformulated within the presented reference-based framework. The differences between these algorithms mainly lie in the selection of reference points and similarity functions. As summarized in Table I, we can categorize these existing methods according to three criteria: (1) How to construct the candidate set of reference points? (2) How to choose a set of reference points? (3) Which similarity function should be used? Note that the definitions and notations for different constraints and similarity functions have been presented in Section III-A and Section III-B. From Table I, we have the following observations.
First of all, any sequence over the alphabet can be a potential member of the candidate set of reference points . However, all feature-based sequence classification algorithms in Table I use to construct since the idea of using subsequences as features is quite natural with a good interpretability. Although is a finite set, its size is still very large and most sequences in are useless and redundant for classification. Therefore, it is necessary to explore alternative methods for constructing the set of candidate reference points. For instance, we may use all original sequences in to construct , so that the size of will be greatly reduced and the corresponding features may be more representative.
Second, many sequence selection criteria have been proposed to select from , such as and . The main objective of applying these criteria is to select a subset of sequences that can generate good features for building the classifier. However, it is not an easy task to set suitable thresholds for these constraints to produce a set of reference sequences with moderate size. More importantly, most of these constraints are proposed from the literature of sequential pattern mining, which may be only applicable to the selection of reference sequences from . In other words, more general reference point selection strategies should be developed.
Last, the most widely used similarity function in Table I is SF 1, which is a boolean function based on whether the reference point is a subsequence of the sequence in . Although some non-boolean functions have been used, the potential of utilizing more elaborate similarity functions between two sequences still needs further investigation.
Overall, our reference-based sequence classification framework is quite generic, in which many existing pattern-based sequence classification methods can be reformulated as its special variants. Meanwhile, there are still many limitations in current research efforts under this framework. Hence, new and effective sequence classification methods should be developed towards this direction.
V New Variants under the Framework
In addition to encompassing existing pattern-based methods, this framework can also be used as a general platform to design new feature-based sequence classification methods.
As discussed in Section IV, there are three key ingredients in our framework: the construction of the candidate reference point set, the selection of reference points and the selection of similarity function. Obviously, we will generate a “new” sequence classification algorithm based on an unexplored combination of these three components. In view of the fact that the number of possible combinations is quite large, it is infeasible to enumerate all these variants. Instead, we will only present two variants that are quite different from existing algorithms to demonstrate the advantage of this framework.
V-A The Use of Training Set as the Candidate Set
With our framework, all previous pattern-based sequence classification methods utilize the set as the candidate reference point set in the first step. One limitation of this strategy is that the actual size of will be very large. As a result, it poses great challenges for the reference point selection task in the consequent step. To alleviate these issues, we propose to use all original sequences in to construct the set of candidate reference points. The rationale for this candidate set construction method is based on the following observations.
Firstly, all information given for building the classifier is contained in the original training set. In other words, we will not lose any relevant information for the classification task if is used as the candidate set of reference sequences. In fact, the widely used candidate set is derived from .
Secondly, even we use all the training sequences in as the reference points, the transformed vectorial data will be a table. That is, the number of features is still no larger than the number of samples. Therefore, we do not need to analyze a HDLSS (high-dimension, low-sample-size) data set during the classification stage. In contrast, the number of features may be much larger than the number of samples in the vectorial data obtained from if the parameters are not properly specified during the reference point selection procedure. In fact, we have tested the performance when all training sequences are used as reference points. The experimental results show that this quite simple idea is able to achieve comparable performance in terms of classification accuracy.
Finally, the same idea has been employed in the literature of time series classification [42, 43]. Its success motivates us to investigate the feasibility and advantage in the context of discrete sequence classification.
V-B Two Reference Point Selection Methods
To select reference sequences from , those existing constraints proposed in the context of sequential pattern mining are not applicable. Therefore, we have to develop new algorithms to choose a subset of representative reference sequences from . To this end, two different reference sequence selection methods are presented. The first one is an unsupervised method, which selects reference sequences based on cluster analysis without considering the class label information. The second one is a supervised method, which evaluates each candidate sequence according to its discriminative ability across different classes. In the following two sub-sections, we will present the details of these two reference point selection algorithms.
V-B1 Unsupervised Reference Point Selection
As we have discussed in Section V-A, we may choose all sequences in the training set as reference points. However, the number of features in the transformed vectorial data can still be very large if the number of training instances is large. The selection of a small subset of representative training sequences as reference points will greatly reduce the computational burden in the subsequent stage. One natural idea is to divide the training sequences in into different clusters using a clustering algorithm [50]. Then, we can select a representative sequence from each cluster as the reference point.
To date, many algorithms have been presented for clustering discrete sequences (e.g. [51]). We can just adopt an existing sequence clustering algorithm in our pipeline. Here we choose the Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm [52] to fulfill the sequence clustering task. This algorithm is used because it can often generate a high-quality clustering result and can handle any forms of similarity measure.
In the following, we will describe the details of the reference point selection method based on GAHC.
In the first stage, the -th sequence in will form a cluster .
In the second stage, a similarity function is used to calculate the similarity between each pair of clusters to produce a similarity matrix , where is the similarity between the two clusters and . Many similarity measures have been presented for sequential data (e.g. [53]). Here we choose the Jaccard coefficient. More specific details on the similarity function will be discussed in Section V-C.
In the third stage, we first search the similarity matrix to identify the maximum value , which corresponds to the most similar pair of clusters and . Then, these two clusters are merged to form a new cluster and the number of clusters in total is decreased by 1. Meanwhile, the entries related to in are set to be 0 and is updated by recalculating the similarity between and each of the remaining clusters. The similarity between the newly generated cluster and each of the remaining clusters is calculated as the average similarity between all members in the two clusters since we use the - method. We repeat the third stage until the number of clusters is equal to the number of reference points we want to select.
In the last stage, we select a representative sequence from each cluster. For each cluster, any sequence in this cluster can be used as a representative. To provide a consistent and deterministic output, we use the sequence with the minimum subscript in the cluster as the reference point.
V-B2 Supervised Reference Point Selection
To choose a subset of representative reference sequences from , we can also employ a supervised method in which the class label information is utilized. As we have discussed in Section IV, different constraints have been widely used to evaluate the discriminative power of sequential patterns. Unfortunately, these constraints are only applicable to the selection of reference points from . In addition, it is not an easy task to set suitable thresholds to control the number of selected reference points. In order to overcome these limitations, we present a reference point selection method based on hypothesis testing, in which the statistical significance in terms of -value is used to assess the discriminative power of each candidate sequence.
Hypothesis testing is a commonly used method in statistical inference. The usual line of reasoning is as follows: first, formulate the null hypothesis and the alternative hypothesis; second, select an appropriate test statistic; third, set a significance level threshold; finally, reject the null hypothesis if and only if the -value is less than the significance level threshold, where the -value is the probability of getting a value of the test statistic that is at least as extreme as what is actually observed on condition that the null hypothesis is true.
In order to assess the discriminative power of each candidate sequence in terms of -value, we can use the null hypothesis that this sequence does not belong to any class and all sequences from different classes are drawn from the same population. If the above null hypothesis is true, then the similarities between the candidate sequence and training sequences are drawn from the same population. Therefore, we can formulate the corresponding hypothesis testing problem as a two-sample testing problem [54], where one sample is the set of similarities between the candidate sequence and the training sequences from one target class and another sample is the set of similarities between the candidate sequence and the training sequences from the remaining classes.
Since we test all candidate sequences in at the same time, it is actually a multiple hypothesis testing problem. If no multiple testing correction is conducted, then the number of false positives among reported reference sequences may be very high. To tackle this problem, we adopt the BH procedure to control the FDR (False Discovery Rate) [55], which is the expected proportion of false positives among all reported sequences.
The reference point selection method based on MHT (Multiple Hypothesis Testing) is shown in Algorithm 1. In the following, we will elaborate on this algorithm in detail.
In the first stage (step 1-4), we select a set of sequences with the class label from , then we regard as the positive data set and use the set of all remaining sequences in as the negative data set .
In the second stage (step 5-17), for each sequence in , a similarity function is used to calculate the similarity between and each sequence in and , where the similarity function is the same as that used in Section V-B1 and is the similarity between the two sequences and . Then, the Mann-Whitney U test [56] is used to calculate the -value based on the two similarity set and .
In the third stage (step 18-27), the BH method first sorts sequences in according to their corresponding -value in an ascending order, i.e., (). Then, we sequentially search to identify the maximal sequence index which satisfies the condition that , where is the significance level threshold. Those sequences whose indices are larger than will be removed from .
In the last stage (step 28-30), we select all sequences from as reference points. The whole process will be terminated after each set of sequences from every class has been regarded as .
V-C Similarity Function
In order to measure the similarity between two sequences, we choose the Jaccard coefficient as the similarity function in our method. The larger the Jaccard coefficient between the two sequences is, the more similar they are.
Given two sequences and , the Jaccard coefficient is defined as:
[TABLE]
where is the number of items in the intersection of and . However, this may lose the order information of sequences. To alleviate this issue, we use the LCS (Longest Common Subsequence) between and to replace . Then, the Jaccard coefficient is redefined as:
[TABLE]
Example 1**.**
Given two sequences and , the is , then the modified Jaccard coefficient is
[TABLE]
Note that we can also use other similarity functions in the literature, such as those methods summarized and reviewed in [53]. The choice of a more appropriate similarity function may yield better performance than the modified Jaccard coefficient. In order to check the effect of similarity function on the classification performance, we also consider the following two alternative similarity functions.
The first one is the String Subsequence Kernel (SSK) [36]. The main idea of SSK is to compare two sequences by means of the subsequences they contain in common. That is, the more subsequences in common, the more similar they are.
Given two sequences and and a parameter , the SSK is defined as:
[TABLE]
where is the feature mapping for the sequence and each , is a finite alphabet, is the set of all subsequences of length and is a subsequence of such that , is the length of in , is a decay factor which is used to penalize the gap. The calculation steps are as follows: enumerate all subsequences of length , compute the feature vectors for the two sequences, and then compute the similarity. The normalized kernel value is given by
[TABLE]
Example 2**.**
Given two sequences and , the subsequences of length 1 (=1) are . The corresponding feature vector for each of the sequences can be denoted as and , then the normalized kernel value is
[TABLE]
When this function is employed in our method, = 1 is used as the default parameter setting. Although the setting of = 1 may lose the order information, it will greatly reduce the computational cost and can provide satisfactory results in practice.
Another alternative similarity function is the normalized LCS. The larger the normalized LCS between two sequences is, the more similar they are.
Given two sequences and , the normalized LCS is defined as:
[TABLE]
where is the length of the longest common subsequence, is length of , and is the length of .
Example 3**.**
Given two sequences and , the is , then the normalized LCS is
[TABLE]
VI Experiments
To demonstrate the feasibility and advantages of this new framework, we conducted experiments on fourteen real sequential data sets. We compared our two algorithms derived under the reference-based framework with other sequence classification algorithms in terms of classification accuracy. All experiments were conducted on a PC with Intel(R) Xeon(R) CPU 2.40GHz and 12G Memory. All the reported accuracies in the experiments were the average accuracies obtained by repeating the 5-fold cross-validation 5 times except SCIP (accuracies in SCIP were obtained using 10-fold cross-validation because this is a fixed setting in software package provided by the author).
VI-A Data Sets
We choose fourteen benchmark data sets which are widely used for evaluating sequence classification algorithms: Activity [57], Aslbu [14], Auslan2 [14], Context [58], Epitope [12], Gene [59], News [5], Pioneer [14], Question [60], Reuters [5], Robot [5], Skating [14], Unix [5], Webkb [5]. The main characteristics of these data sets are summarized in Table II, where represents the number of sequences in the data set, #items denotes the number of distinct elements, minl, maxl and avgl are used to denote the minimum length, maximum length and average length of the sequences respectively, and #classes represents the number of distinct classes in the data set.
VI-B Parameter Settings
Our two algorithms are denoted by R-MHT (Reference Point Selection Based on MHT) and R-GAHC (Reference Point Selection Based on GAHC), respectively. In addition, the method that uses all sequences in as reference points is denoted as R-A, which is also included in the performance comparison. We compare our algorithms with five existing sequence classification algorithms: MiSeRe111http://www.misere.co.nf [17], Sqn2Vec222https://github.com/nphdang/Sqn2Vec [41], SCIP333http://adrem.ua.ac.be/sites/adrem.ua.ac.be/files/SCIP.zip [5], FSP (the algorithm based on frequent sequential patterns) and DSP (the algorithm based on discriminative sequential patterns).
In MiSeRe, is specified to be 1024 and is set to be 5 minutes for all data sets.
Sqn2Vec is an unsupervised method for learning sequence embeddings from both singleton symbols and sequential patterns. It has two variants: Sqn2VecSEP and Sqn2VecSIM, where Sqn2VecSEP (Sqn2VecSIM) generates sequence representations from singleton symbols and sequential patterns separately (simultaneously). In these two variants, = 0.05, = 4 and the embedding dimension is set to be 128 for all data sets.
SCIP is a sequence classification method based on interesting patterns, which has four different variants: SCII_HAR, SCII_MA, SCIS_HAR and SCIS_MA. In the experiments, the following parameter setting is used in all data sets: = 0.05, = 0.02, = 3, = 0.5 and = 11.
Frequent sequential patterns have been widely used as features in sequence classification. To include the algorithm based on frequent sequential patterns in the comparison (denoted by FSP), we employ the PrefixSpan algorithm [61] as the frequent sequential pattern mining algorithm. The parameters are specified as follows: = 3 and = 0.3 for all data sets except Context (the in Context is set to be 0.9 in order to avoid the generation of too many patterns).
Similarly, discriminative sequential patterns are widely used as features in many sequence classification algorithms and applications as well. To include the algorithm based on discriminative sequential patterns in the comparison (denoted by DSP), we first use the PrefixSpan algorithm to mine a set of frequent sequential patterns and then detect discriminative patterns from the frequent pattern set. The parameters for PrefixSpan are identical to those used in FSP and = 3 is used as the threshold for filtering discriminative sequential patterns.
VI-C Results
In Table III, the detailed performance comparison results in terms of classification accuracies are presented. Note that the result of DSP on the Skating data set is N/A because we cannot find any discriminative patterns from this data set based on the given parameter setting. In the experiments, = 0.05 is used for R-MHT and is specified to be 1/10 of the size of for R-GAHC. After transforming sequences into feature vectors, we chose NB (Naive Bayes), DT (Decision Tree), SVM (Support Vector Machine), KNN ( Nearest Neighbors) as the classifiers. The implementation of each classifier was obtained from WEKA [62] except Sqn2Vec. In Sqn2Vec, all classifiers were obtained from scikit-learn [63] since its source code is written in python.
In order to have a global picture of the overall performance of different algorithms, we calculate the average accuracy over all data sets for each classifier. The corresponding average accuracies for different methods are recorded in Table IV. The results show that among our two methods, R-MHT can achieve better performance than R-GAHC when NB, DT and SVM are used as the classifier. However, R-MHT has a bad performance when KNN is used as the classifier. Since we select a representative sequence from each cluster in R-GAHC and any sequence in a cluster can be used as a representative, we may miss the most representative sequence. Meanwhile, the choice of clustering method and the specification of the number of clusters will influence the results. In addition, the R-A method outperforms R-MHT and R-GAHC since we will not lose any relevant information for the classification task when all training sequences are used as reference points. However, the feature dimension will be very high in R-A, which will incur high computational cost in practice.
Compared with other classification methods, our methods are able to achieve comparable performance. In particular, R-A and MiSeRe [17] can achieve the highest average classification accuracy among all competitors since all information given for building the classifier is contained in the reference point set in R-A. The reason why R-MHT and R-GAHC are slightly worse may be that their reference points are less distinct from each other in different classes and some sequences that are important for classification are missed. It is quite amazing since R-A is a very simple algorithm derived from our framework. This indicates that the proposed reference-based sequence classification framework is quite useful in practice. It can be expected more accurate feature-based sequence classification methods will be developed under this framework in the future. From Table III and Table IV, it can be also observed that none of the algorithms in the comparison can always achieve the best performance across all data sets. Therefore, more research efforts still should be devoted to the development of effective sequence classification algorithms.
The use of different similarity functions may affect the performance of our algorithms. To investigate this issue, we use two additional similarity functions in the experiments for comparison: SSK and the normalized LCS, whose details have been introduced in Section V-C.
Table V presents the average classification accuracies of different similarity functions over all data sets. Jaccard coefficient, SSK and normalized LCS are denoted as J, S and N, respectively. In Table V, R-A-J means that the Jaccard coefficient is used as the similarity function in R-A. Other notations in this table can be interpreted in a similar manner. The results show that the use of different similarity functions can affect the performance of our algorithms. Among these three similarity functions, the use of the Jaccard coefficient as the similarity function can achieve better performance in most cases. However, R-MHT-J has unsatisfactory performance when KNN is used as the classifier. It can be also observed that none of the similarity functions is always the best performer. Therefore, more suitable similarity functions should be developed.
The above experimental results and analysis show that the proposed new methods based on our framework can achieve comparable performance to those state-of-the-art sequence classification algorithms, which demonstrate the feasibility and advantages of our framework. And our framework is quite general and flexible since the selection of both reference points and similarity functions is arbitrary. However, since the feature selection and classifier construction in our framework are separate and any existing vectorial data classification methods can be used to tackle the sequence classification problem, some features that are critical to the classifier may be filtered out during the selection process.
VII Conclusion
In this paper, we present a reference-based sequence classification framework by generalizing the pattern-based methods. This framework is quite general and flexible, which can be used as a general platform to develop new algorithms for sequence classification. To verify this point, we present several new feature-based sequence classification algorithms under this new framework. A series of comprehensive experiments on real data sets show that our methods are capable of achieving better classification accuracy than existing sequence classification algorithms. Thus, the reference-based sequence classification framework is quite promising and useful in practice.
In future work, we intend to explore more appropriate reference sequence selection methods and similarity functions to improve the performance and reduce the computational cost. As a result, more accurate feature-based sequence classification methods would be derived under this framework.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques . Elsevier, 2011.
- 2[2] M. Deshpande and G. Karypis, “Evaluation of techniques for classifying biological sequences,” in Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining . Berlin, Germany: Springer, 2002, pp. 417–431.
- 3[3] Z. Xing, J. Pei, and E. Keogh, “A brief survey on sequence classification,” Acm Sigkdd Explorations Newsletter , vol. 12, no. 1, pp. 40–48, 2010.
- 4[4] E. Cernadas and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?” Journal of Machine Learning Research , vol. 15, no. 1, pp. 3133–3181, 2014.
- 5[5] C. Zhou, B. Cule, and B. Goethals, “Pattern based sequence classification,” IEEE Transactions on Knowledge and Data Engineering , vol. 28, no. 5, pp. 1285–1298, 2016.
- 6[6] T. P. Exarchos, M. G. Tsipouras, C. Papaloukas, and D. I. Fotiadis, “A two-stage methodology for sequence classification based on sequential pattern mining and optimization,” Data & Knowledge Engineering , vol. 66, no. 3, pp. 467–487, 2008.
- 7[7] D. Lo, H. Cheng, J. Han, S.-C. Khoo, and C. Sun, “Classification of software behaviors for failure detection: a discriminative pattern mining approach,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 2009, pp. 557–566.
- 8[8] R. She, F. Chen, K. Wang, M. Ester, J. L. Gardy, and F. S. Brinkman, “Frequent-subsequence-based prediction of outer membrane proteins,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . ACM, 2003, pp. 436–445.
