A Feature Selection Based on Perturbation Theory
Javad Rahimipour Anaraki, Hamid Usefi

TL;DR
This paper introduces a novel feature selection method using perturbation theory to detect feature correlations, especially effective in high-dimensional, singular datasets common in bioinformatics, outperforming traditional methods in feature reduction and accuracy.
Contribution
The paper presents a new perturbation-based approach for feature selection that effectively identifies important features in singular, high-dimensional datasets, improving over existing methods.
Findings
Selects fewer features while maintaining or improving accuracy.
Effective in high-dimensional, singular datasets.
Outperforms conventional feature selection methods.
Abstract
Consider a supervised dataset , where is the outcome column, rows of correspond to observations, and columns of are the features of the dataset. A central problem in machine learning and pattern recognition is to select the most important features from to be able to predict the outcome. In this paper, we provide a new feature selection method where we use perturbation theory to detect correlations between features. We solve using the method of least squares and singular value decomposition of . In practical applications, such as in bioinformatics, the number of rows of (observations) are much less than the number of columns of (features). So we are dealing with singular matrices with big condition numbers. Although it is known that the solutions of least square problems in singular case are very sensitive to…
| 40.8401 | 40.8401 | 2.2115e-05 | |
| -8.5981 | -8.5980 | -1.1532e-05 | |
| 17.4601 | -5.9568e+03 | -5.9743e+03 | |
| -3.7881 | -1.4436e+03 | -1.4398e+03 | |
| 16.1273 | 6.1460e+03 | 6.1298e+03 | |
| -8.5981 | -8.5980 | -1.8675e-05 |
| b | 37.104 | 112.981 | 47.897 | 87.030 | 48.270 | 112.981 |
| Config. | ||||||
| 40.390 | 7.748 | 14.574 | 3.507 | 13.330 | 7.748 |
| Dataset | Samples | Features |
| LSVT Voice | 126 | 310 |
| Madelon | 2000 | 500 |
| Colon | 62 | 2000 |
| Lung | 203 | 3312 |
| Lymphoma | 96 | 4026 |
| GLIOMA | 50 | 4434 |
| Leukemia | 72 | 7070 |
| ALLAML | 72 | 7129 |
| Dataset | Number of selected features | |||||||
| GBM | LASSO | LARS | RLSR | HSIC-Lasso | PFS-DT | PFS-SVM | PFS-NN | |
| LSVT Voice | 239 | 126 | 125 | 125 | 12 | 1345.30 | 87111.90 | 3094.60 |
| Madelon | 467 | 89 | 89 | 89 | — | 34100.80 | 624.80 | 2564.60 |
| Colon | 656 | 62 | 61 | 61 | 9 | 729.80 | 2239.30 | 1830.60 |
| Lung | 1503 | 203 | 202 | 202 | 134 | 34105.00 | 28100.00 | 58131.20 |
| Lymphoma | 1491 | 96 | 95 | 95 | 181 | 3651.80 | 2344.80 | 4275.50 |
| GLIOMA | 535 | 50 | 49 | 49 | 17 | 725.60 | 1736.50 | 2837.50 |
| Leukemia | 1053 | 72 | 71 | 71 | 17 | 646.10 | 1541.00 | 2449.00 |
| ALLAML | 1200 | 72 | 71 | 71 | 8 | 1541.20 | 2453.40 | 843.00 |
| Dataset | Classification Accuracy | |||||||
| GBM | LASSO | LARS | RLSR | HSIC-Lasso | PFS-DT | PFS-SVM | PFS-NN | |
| LSVT Voice | 73.68 | 73.68 | 72.14 | 63.16 | 78.94 | 83.9785.26 | 60.0064.46 | 84.2886.86 |
| Madelon | 77.67 | 53.16 | 62.00 | 49.34 | — | 76.1881.45 | 62.1561.62 | 83.6781.97 |
| Colon | 78.95 | 83.33 | 79.49 | 68.42 | 84.21 | 100.0091.58 | 89.2092.61 | 84.6689.20 |
| Lung | 75.41 | 51.17 | 63.58 | 75.41 | 83.60 | 96.2094.10 | 100.0099.95 | 100.0099.84 |
| Lymphoma | 62.07 | 39.21 | 32.19 | 60.71 | 51.72 | 64.6555.93 | 61.1162.41 | 66.6769.94 |
| GLIOMA | 60.00 | 52.50 | 53.75 | 53.33 | 80.00 | 85.4279.33 | 95.0090.08 | 95.0085.58 |
| Leukemia | 95.46 | 96.88 | 96.88 | 95.46 | 100.00 | 96.8895.45 | 97.0699.71 | 97.0698.23 |
| ALLAML | 90.91 | 90.83 | 90.83 | 62.38 | 90.90 | 93.3389.09 | 93.3396.29 | 85.7190.95 |
| Dataset | Number of selected features | Classification Accuracy | ||||
| PFS-DT | PFS-SVM | PFS-NN | PFS-DT | PFS-SVM | PFS-NN | |
| LSVT Voice | 1555.70 | 287.70 | 6786.40 | 89.7482.43 | 50.0056.00 | 81.0786.11 |
| Madelon | 19154.80 | 15175.80 | 78127.80 | 75.3581.27 | 62.4861.45 | 79.6680.42 |
| Colon | 1133.10 | 1329.80 | 1333.70 | 86.6789.15 | 90.9189.77 | 89.2088.86 |
| Lung | 5393.00 | 66126.50 | 63133.90 | 95.7990.88 | 99.4798.96 | |
| Lymphoma | 5953.20 | 1337.80 | 5873.40 | 69.2353.18 | 63.5862.28 | 76.5471.05 |
| GLIOMA | 530.40 | 1531.50 | 1731.60 | 89.5879.00 | 90.0088.67 | 86.6787.25 |
| Leukemia | 731.60 | 1842.60 | 1744.70 | 100.0097.65 | 94.1297.35 | 94.1296.06 |
| ALLAML | 2744.60 | 3258.10 | 851.60 | 86.0989.81 | 82.8686.90 | 90.0087.29 |
| Dataset | -means | -means | ||||
| PFS-DT | PFS-SVM | PFS-NN | PFS-DT | PFS-SVM | PFS-NN | |
| LSVT Voice | 1.88 | 0.57 | 0.91 | 1.47 | 0.64 | 1.00 |
| Madelon | 0.81 | 2.54 | 1.26 | 0.52 | 0.34 | 0.62 |
| Colon | 3.95 | 2.35 | 2.96 | 2.69 | 3.06 | 2.66 |
| Lung | 0.89 | 0.99 | 0.75 | 0.96 | 0.77 | 0.73 |
| Lymphoma | 1.03 | 1.40 | 0.92 | 1.00 | 1.67 | 0.97 |
| GLIOMA | 3.16 | 2.50 | 2.29 | 2.63 | 2.83 | 2.80 |
| Leukemia | 4.52 | 2.41 | 2.00 | 3.12 | 2.30 | 2.18 |
| ALLAML | 2.17 | 1.81 | 2.09 | 2.02 | 1.48 | 1.70 |
| Dataset | Train | Test | ||
| Class 1 | Class 2 | Class 1 | Class 2 | |
| Leukemia | 24 | 13 | 23 | 12 |
| Lung | 9 | 70 | 8 | 69 |
| Prostate | 25 | 26 | 25 | 26 |
| Dataset | Number of selected features | Classification Accuracy | ||
| FS-SVM | PFS-DT | FS-SVM | PFS-DT | |
| Leukemia | 142 | 2420.4 | 80.00 | 85.1577.34 |
| Lung | 20 | 329.90 | 97.00 | 100.0099.28 |
| Prostate | 252 | 2937.40 | 86.00 | 88.2387.44 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Blind Source Separation Techniques · Gene expression and cancer classification
A Feature Selection Based on Perturbation Theory
Javad Rahimipour Anaraki
Hamid Usefi
Department of Computer Science, Memorial University of Newfoundland,
St. John’s, NL, A1B 3X5 Canada
Department of Mathematics and Statistics, Memorial University of Newfoundland,
St. John’s, NL, A1C 5S7 Canada
Abstract
Consider a supervised dataset , where b is the outcome column, rows of correspond to observations, and columns of are the features of the dataset. A central problem in machine learning and pattern recognition is to select the most important features from to be able to predict the outcome. In this paper, we provide a new feature selection method where we use perturbation theory to detect correlations between features. We solve using the method of least squares and singular value decomposition of . In practical applications, such as in bioinformatics, the number of rows of (observations) are much less than the number of columns of (features). So we are dealing with singular matrices with big condition numbers. Although it is known that the solutions of least square problems in singular case are very sensitive to perturbations in , our novel approach in this paper is to prove that the correlations between features can be detected by applying perturbations to . The effectiveness of our method is verified by performing a series of comparisons with conventional and novel feature selection methods in the literature. It is demonstrated that in most situations, our method chooses considerably less number of features while attaining or exceeding the accuracy of the other methods.
keywords:
Feature selection , Perturbation theory , Least angle regression
††journal: Expert Systems With Applications
1 Introduction
In machine learning and pattern recognition, feature selection is the process of selecting the most important features of a problem while removing unnecessary ones. This process plays an important role in reducing the dimension of datasets. Feature selection methods are categorized into two main groups of feature ranking and feature subset selection [Hall et al., 2003]. The former is a set of methods that ranks the features based on some measured values, and selects the top features, accordingly. The latter screens the critical features using fitness value. Both groups can be implemented using filter-based or wrapper-based approaches [Kohavi & John, 1997]. In the filter-based approach, a merit evaluates the quality of every feature regardless of its impact on the outcome, while the wrapper-based approaches measure the effectiveness of the features based on the results of a (a set of) classifier(s). The wrapper-based methods are highly computationally-intensive and powerful in predicting the outcome compared to the filter-based methods which are faster but less accurate.
With the emergence of high dimensional data, for example in Genomics, sophisticated feature selection methods are required to remove noisy features and detect correlation between features. It is desired that a small subset of features are selected to predict the outcome with high accuracy. The traditional feature selection methods such as principal component analysis [Jolliffe, 2002] or Relief [Kira & Rendell, 1992] have shortcomings in terms of dimensionality reduction, accuracy, as well as running time. We shall review some of the breakthrough methods that are effective in these respects.
There have been numerous methods based on the information theory, see for example Zhao et al. [2016], Sun et al. [2013], Bennasar et al. [2015]. These methods aim to minimize the feature redundancy while maximizing the features’ relevancy. Most notable and widely used information theory based method is minimal-redundancy-maximal-relevance criterion (mRMR) Peng et al. [2005]. It is shown in various studies that mRMR effectively chooses a small subset of features to predict the outcome with high accuracy. However, as it is pointed out in [Yamada et al., 2018], the computational cost of mRMR on large dataset is high. In other words, it is not feasible to scale up mRMR for big datasets.
Feature selection is also referred to as variable selection in Statistics. Fundamental variable selection methods include least absolute shrinkage and selection operator (LASSO) and least angle regression (LARS). LASSO, introduced by Tibshirani [Tibshirani, 1996], is a subset selection based on least squares regression. It minimizes the size of a regression model by removing those predictor variables with zero-valued coefficients by calculating Equation 1, the LASSO estimate, subject to , where is a vector of coefficients and is tuning parameter
[TABLE]
and the solution for is , are LASSO estimates where is the total number of features, b represents responses, contains predictor variables and is the number of samples.
LARS, introduced by Efron et al. [Efron et al., 2004], is a linear regression model fitting based on the LASSO algorithm which calculates all the LASSO estimates efficiently, in combination with a forward stage-wise linear regression method within steps, where is number of covariates and is number of samples. LARS starts with selecting the most relevant feature and continues by adding the next feature with the highest correlation with the current residual. Then, it continues in a direction which has equal angle from the two already selected features until the next feature is met. The complexity of LARS algorithm is
In a novel work, Yamada et al. [Yamada et al., 2014] proposed a non-linear feature selection method for high-dimensional datasets called Hilbert-Schmidt independence criterion least absolute shrinkage and selection operator (HSIC-Lasso), in which the most informative non-redundant features are selected using a set of kernel functions, where the solutions are found by solving a LASSO problem. The complexity of the original Hilbert-Schmidt feature selection (HSFS) is . In a recent work [Yamada et al., 2018] called Least Angle Nonlinear Distributed (LAND), the authors have improved the computational power of the HSIC-Lasso. They have demonstrated via some experiments that LAND and HSIC-Lasso have attain similar classification accuracies and dimension reduction. However, LAND has the advantage that it can be deployed on parallel distributed computing.
A method proposed by Chen et al. [Chen et al., 2017] is a feature selection called rescaled linear square regression (RLSR), where a set of coefficients for least square regression is employed to scale and rank features. The advantage of their method is that it can be applied to both supervised and semi-supervised classification problems.
In this paper, we introduce a new linear feature selection method. Linear models usually outperform nonlinear models over high-dimensional datasets. Consider a dataset , consisting of samples where each sample contains features. Let us denote by the first columns of and by b the last column. Our objective is to remove those columns of that do not have a significant impact on b. So, we want to choose a subset of columns of to express (up to an error) b as a linear combination of this subset. We consider the linear system , where is the vector of unknowns. In practical applications, the system may not have exact solutions. However, we want to find an so that the distance between and b is as small as possible. That is, we want to minimize the distance over all . To do so, we shall use the method of least squares and singular value decomposition (SVD) of . The Moore-Penrose inverse of is defined in terms of SVD of and it is known that is the unique solution with the smallest 2-norm that satisfy the least square problem , see Theorem 2.1.
There has been extensive literature, see [Golub & Van Loan, 2013], regarding the sensitivity of solutions of least square problems when is full-rank. It is also known and rightfully cautioned that solutions of singular systems where condition number of is bigger than one are sensitive to perturbations in . However, we prove in Theorem 2.2, that one can use perturbations to reveal correlations between columns of . To do so, we solve both and using SVD, where is a small perturbation of . It turns out that features and correlate if and only if and are close (in the magnitude of .). This allows to cluster features based on the differences .
Next, we consider the column vector whose values are and consider clustering features based on this single column. As we mentioned, features that correlate with each other fall into the same cluster. However, within a cluster there might be features that do not correlate (but have the same value for ). To break down some big clusters that contain independent features, we use a simple but efficient method based on the angle between features. In Section 2.2, we consider the projection of b into each of the hyperplanes obtained by removing one feature at a time. We construct a column that consists of the angles between each feature and the corresponding hyperplane. The third column in our clustering process consists of angles between each feature and b.
We note that often in classification problems and real-world datasets, for example Cancer datasets, the column contains nominal values (classes). One can then assign numerical values for each class. Although, this assignment is not unique our method is insensitive to the way in which the classes are numbered. The reason is, correlations between columns of is independent of . Indeed, by Theorem 2.2, the vector consisting of the is proportional to correlations between columns of and as such is insensitive to changes in . Also, if changes, then all the angles between columns of and will be shifted by a fix amount (the difference of old and new ). This shows that our matrix is insensitive to the way in which we convert classes to numerical values.
After arriving at the matrix, we use a clustering algorithm and cluster our matrix into clusters where is at most . Since we do not know the optimal , we take the output feature subset for each and use a classifier to get an accuracy with respect to that feature subset. Alternatively, our algorithm can take as input an integer to represent the number of desired features and this way we can just cluster with respect to the input and return the centroids as the selected subset of features. The final algorithm is presented in Section 2.3.
To the best of our knowledge, this is the first work to report on using perturbation theory in feature selection. Specifically, the fact that correlations can be detected via perturbations has not been explored before. As we can see through numerous experiments in Section 3, our method on average chooses smaller number of features while attaining or exceeding the classification accuracy of other methods. Also, the complexity of our algorithm is dominated by that of computing the SVD of an matrix which can be done in and even faster as explained in [Holmes et al., 2007]. In particular, in datasets where we have hundreds of samples and thousands of features (), the complexity of PFS is close to quadratic. It is also worth noting that our proposed method can be applied to both regression and classification problems. We present some further insights in Section 4, and conclude the paper and suggest possible future paths in Section 5.
2 Proposed Approach
Consider the system . Since we want to know the smallest subset of columns of that we can express b as a linear combination of elements of that subset, we can normalize the columns of . So, we can assume each column of has length 1.
In real world applications, the system may not have a solution. In other words, if is not in the column space of , there is no such that . Instead, we can find an so that the distance between and b is as small as possible. That is, we want to minimize the distance over all . This minimization problem is known as the method of least squares and its solutions is defined via SVD of . Recall that the SVD of an matrix is of the form , where is an orthogonal matrix, is an orthogonal matrix, and is an diagonal matrix. Also recall that the Moore-Penrose inverse of is the matrix , where .
It is well-known that the least squares solutions can be given in terms of the Moore-Penrose inverse, see [Golub & Van Loan, 2013].
Theorem 2.1** (All Least Squares Solutions)**
Let be an matrix and . Then all the solutions of are of the form , where . Furthermore, the unique solution whose 2-norm is the smallest is given by .
In our method, each dataset with samples and features is divided into two matrices: coefficients and constants. Coefficients matrix involves all the feature values except for the outcome, the constant vector b only contains the classification outcome. In the next section we employ perturbation theory to detect redundant features.
2.1 Detecting correlations via perturbation
To demonstrate how the perturbation can reveal different aspects of features, a synthetic dataset called SynthData is generated with 100 samples and six features based on the following setup:
[TABLE]
where generates 100 random numbers with uniform probability in the interval . So, , where is an matrix. Now let be a small perturbation of and solve and using SVD. We have demonstrated the solutions and as well as their differences in Table 1. As we expected, and differ significantly. However, our interest is focused at the last column of Table 1, where we have recorded the difference between and .
Before we state the main theorem, we shall need to recall some facts and definitions which can be found in [Golub & Van Loan, 2013].
Let be a perturbation of . Denote by and the singular values of and , respectively. The samllest non-zero singular value of is denoted by and the greatest of the is denoted by . It is well-known that . It has been of great interest to compare the and . In this regard, we use a classical bound on the difference between and due to Weyl:
[TABLE]
We need to determine the type of perturbations we use. Indeed, we choose to be a random matrix such that , for some . We set where our estimates are correct up to a magnitude of . We are now ready to prove the main theorem of this paper.
Theorem 2.2
Let and be solutions of and , where is a small enough perturbation. If a feature is independent of the rest of the features then . Furthermore, suppose that is a subset of such that , for some non-zero . If
any subset of is linearly independent, 2. 2.
* are linearly independent from the rest of features in .*
Then the vectors and are proportional.
Proof. From and , we get . We claim that . To prove the claim, we consider the SVD of which is of the form . So, . Since and are orthogonal and for orthogonal matrices we have , it follows that
[TABLE]
by Equation (2). Hence,
[TABLE]
It follows from the claim that
[TABLE]
Now, if a feature, say , is independent of the rest of features, then it follows from Equation (3) that . Suppose now that is a linearly dependent subset of such that , for some coefficients . Since are linearly independent from the rest of features in , we get
[TABLE]
Now, if and are not proportional, we can use Equation (4) and to get a dependence relation of a shorter length between the elements of , which would contradict our assumption (1). The proof is complete.
Consider now the correlation in the SynthData dataset. As we mentioned earlier, we normalize the columns of and replace with , where . Note that . We have
[TABLE]
So, correlation vector between is . On the other hand, we have . Note that in this example, weights (norms) of and are very close to each other compared to weight of . In general, when a dependence relation exists between a set of features, Theorem 2.2 along with normalization detect the two features whose weights are closest to each other compared to the others. In particular, if features and correlate with each other then the differences and are almost the same. The converse may not be necessarily true.
We can now consider a column vector whose values are and use a clustering algorithm to cluster this single column. Clearly, features that correlate with each other fall into the same cluster. However, within a cluster there might be features that do not correlate (but have the same value for ). For this reason, we want to further refine the clustering process by computing two more characteristics of data. We shall explain this in the next section.
2.2 Refining the clustering process
One way to compare the similarity between vectors is by calculating the angle between them. Features that have smaller angles with the outcome b are informative and predictive. So we construct another column whose values are angles between the and b. The angle of each feature with b in SynthData are calculated and shown in the Table 2.
Our third column in the clustering process is obtained as follows. We remove each feature from the matrix along with its corresponding coefficient in . Then, the angle of resulting vector and the actual outcome b will be considered as a measure of the relevancy for feature . Note that the closer b and are, the less significant the vector is. Applying this process to SynthData is shown in Table 3.
Now we set up an matrix where the first column consists of , the second column is the angles between the ’s and , and the third column is the angles between the ’s and . Next we use a clustering algorithm to cluster our into clusters. The centroids of clusters will be chosen as our selected features. Since we do not know the optimal number of clusters, we take the output feature subset for each and use a classifier to get an accuracy with respect to that feature subset. Alternatively, our algorithm can take as input an integer to represent the number of desired features and this way we can just cluster with respect to the input and return the centroids as the selected subset of features. The upper bound for the number of clusters is , where is the numerical rank of .
2.3 Algorithm
The PFS running time is , where is the complexity of calculating SVD for a matrix [Holmes et al., 2007], and is the time complexity of the -means clustering algorithm to cluster a dataset of size into clusters. Therefore, the time complexity of PFS is dominated by the complexity of SVD.
Flowchart of PFS is depicted in Figure 1 and is as shown in Algorithm 1. The MATLAB® implementation of PFS is publicly available on GitHub111https://github.com/jracp/PerturbationFeatureSelection.
3 Experimental Results
We generate the perturbation matrix such that the entries of are randomly chosen in the range and .
Referring to Tran et al. [Tran et al., 2017], classification accuracy of imbalanced datasets should be calculated using Equation 5.
[TABLE]
where is the number of classes in dataset, is the number of correctly classified instances within class , and is the total number of samples in the class .
When comparing two feature selection methods, there are three quantities that matter: 1) the accuracy, 2) number of selected features 3) complexity and running time.
We adopt the following formula to compare feature selection methods based on the their accuracy and selected number of features: We quantify the relative effectiveness of a feature selection methods as follows:
[TABLE]
Formula (6) means that a feature selection method with smaller number of features and higher classification accuracy is favourable.
All the computations have been done on an ubuntu 14.04 LTS machine with Intel®Core™i5-4570, 24 GB of RAM, using MATLAB® 9.2.0.556344 (R2017a), R version 3.4.4 (2018-03-15), and Java™SE Runtime Environment (build 1.8.0_151-b12).
3.1 Comparisons with conventional methods
In this section, we compare PFS with Friedman’s gradient boosting machine (GBM) [Friedman, 2001]; least absolute shrinkage and selection operator (LASSO) [Tibshirani, 1996]; least angle regression (LARS) [Efron et al., 2004]; rescaled linear square regression (RLSR) [Chen et al., 2017] with , where is the minimum number of selected features using GBM, LASSO and LARS; and Hilbert-Schmidt independence criterion least absolute shrinkage and selection operator (HSIC-Lasso) [Yamada et al., 2014]. We used gbm package in R [Ridgeway, 2007] for running GBM, and MATLAB® implementations of LASSO and LARS by Sjöstrand [Sjöstrand, 2005], RLSR and HSIC-Lasso.
In Section 3.1.1, we have used -means to cluster our matrix where the upper bound for is the numerical rank of . To find the best subset, we have experimented with three different classifiers, that is decision tree (DT) [Breiman et al., 1984], support vector machine (SVM) [Allwein et al., 2000], and -nearest neighbour (-NN) [Altman, 1992] in the inner layer. Once we find the and corresponding subset of features that gives us the best accuracy, we output that subset as the selected features. At the outer layer of our algorithm, we always use DT for classification. To demonstrate a fair and robust result, we run the algorithm 10 times where each time a subset of features is outputted and then classified by DT. The average of accuracies as well as average size of feature subsets are reported. We have demonstrated similar experiments using fuzzy c-means in Section 3.1.2.
We perform a series of tests on various datasets including, one medical dataset, LSVT Voice [Tsanas et al., 2014], one artificial dataset Madelon and six biological datastes – namely, Colon , Lung, Lymphoma, GLIOMA, Leukemia and ALLAML – have been selected from ASU dataset repository [Li et al., 2017] and UCI repository of machine learning [Lichman, 2013]. The specifications of all datasets are given in Table 4.
Note that for the experiments in this section, the decision tree classifier is applied with MATLAB®, using 70% of the data for training and 30% for testing and validating. This set up is applied to all methods including GBM, LASSO, LARS, RLSR, HSIC-Lasso, and PFS. Since PFS uses a clustering algorithm, the selected subset of features in PFS can change each run. So, we run PFS 10 times on randomly shuffled data where testing and trainings sets vary accordingly in each run.
3.1.1 Evaluation results using -means
In this section, we use -means to cluster our matrix where the upper bound for is the numerical rank of . To find the best subset, we have experimented with three different classifiers, that is DT, SVM and NN in the inner layer. Once we find the and corresponding subset of features that gives us the best accuracy, we output that subset as the selected features. At the outer layer of our algorithm, we always use DT for classification for all the methods.
In Tables 5 and 6, we have reported the selected number of feature and classification accuracies, respectively. Note that PFS-DT, PFS-SVM, and PFS-NN mean that we have used DT, SVM, and NN as the inner classifier in PFS, respectively. In all the methods we have used DT to report the classification accuracy.
To demonstrate a fair and robust result, we run our algorithm 10 times where each time the dataset is randomly shuffled and a subset of features is outputted. The average of accuracies as well as average size of feature subsets are reported. Also, we use Formula 6 to find the optimal accuracy and subset of features amongst the 10 run. In columns corresponding to PFS-DT, PFS-SVM, and PFS-NN, the optimal number of features and optimal classification accuracy with respect to Formula 6 are shown in the superscript whereas the average number of features and average of classification accuracies are shown in the subscript.
We can see from Table 6 that, over all, the classification accuracies of PFS-based methods are favourable to the other methods and only HSIC-Lasso is sometimes attaining similar accuracies. On the other hand, HSIC-Lasso chooses less number of features on average compared to PFS-based methods. We remark that the number of features in PFS depends on the upper bound we set for the number of clusters when we cluster our intermediate matrix. We have taken as an upper bound but this bound is just a crude estimate and in the next phases of this project we shall improve this bound. Hence, it is possible to still decrease the average number of features in PFS.
We can also observe from Table 6, that when NN is used as the inner classifier, the average classification accuracies are slightly better than when DT or SVM are used. In contrast, the average number of features are slightly lower when DT is used as the inner classifier.
3.1.2 Evaluation results using fuzzy -means
To investigate the affect of clustering method, we have also experimented with fuzzy -means clustering algorithm for which, the results are shown in Table 7. We can also observe from Table 7 that all in all there is very little difference in average classification accuracies regardless of which classifier is used. In contrast, the average number of features are slightly lower when DT is used as the inner classifier.
3.1.3 A quantified measure
In Sections 3.1.2 and 3.1.1, we have used each of -means and fuzzy -means as our clustering algorithm. It seems that using fuzzy -means, our method in general chooses more features. To present and amalgamate the results of Tables 5, 6, and 7, we apply Formula 6 using average classification accuracy and average number of features to obtain a comparison in Table 8 between -means and fuzzy -means. We can conclude that based on the measure given by Formula 6, our algorithm has a better performance when -means is used for clustering.
3.2 Comparison with methods based on SVM & optimization
A recent paper by Ghaddar and Naoum-Sawaya [Ghaddar & Naoum-Sawaya, 2018] proposed a feature selection method using support vector machines (FS-SVM) for binary-class datasets, in which, a pre-defined percentage of features is selected through adjusting norm of the classifier.
Ghaddar et al. applied their method to a set of cancer datasets (# of samples # of features) – namely, Leukemia (72 7130), Lung cancer (139 1000), Prostate cancer (102 12,601) – adopted from Cancer Program at Broad Institute 222http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi (different form those in Table 4). For each dataset, a subset of positive and negative classes have been selected for training and testing purposes (see Table 9).
We have used PFS with DT as the inner classifier and followed the same setup to compare PFS-DT with the method proposed in [Ghaddar & Naoum-Sawaya, 2018]. To get unbiased results, we run PFS-DT 10 times where each time we shuffled and constructed test and train datasets based on the configuration in Table 9. The optimal and average results are reported in Table 10.
In order to find the highest classification accuracy, the authors in [Ghaddar & Naoum-Sawaya, 2018] have applied their method FS-SVM and limited the selected subset of features to range from 2% to 20% of total number of features. In turn, the running time of FS-SVM is very high.
4 Discussions
The upper bound for the number of clusters in Algorithm 1 is the numerical rank of matrix , which infers about the largest number of independent features. There exists various clustering algorithms and one way to improve the proposed method is to cluster the generated characteristics dataset more efficiently. Of course, the number of clusters in PFS can be set manually which adds a great flexibility in selecting a certain number of features. It is worth noting that some of the clusters that represent irrelevant features can be excluded right away before starting the clustering process. Irrelevant features can be detected by their corresponding coefficients in the solution of the least squares problem.
Since -means and fuzzy -means clustering method choose the initial centroids randomly, the final outcome of PFS could be different per run, which introduces a valid concern of non-reproducibility of the results. To remedy this, the proposed algorithm has iterated -times to provide more robust and reproducible results. An alternative approach is to use a deterministic clustering algorithm which we shall examine in the future.
The complexity of our proposed method is dominated by the complexity of calculating SVD.
5 Conclusions and future work
In this paper, we proposed a novel feature selection method. We divide a dataset into a matrix consisting of features and the vector b of the classification outcome, hence . We solve the least squares problem using the singular decomposition of . We have proved and demonstrated how perturbation theory can be used to detect correlations between features. Through this process, irrelevant features can be identified and filtered out at the very first stages of the algorithm. The main ingredient of our approach is perturbation theory and experimental results show how powerful this method is to detect and remove correlations. We have compared our method with several other methods and it is shown that PFS always chooses a fraction of the number of features selected by other methods. Furthermore, we believe PFS is robust against noise. A noisy data can be viewed as a perturbed system. So we can consider a system of the form and apply Theorem 2.2. We shall investigate the noise-robustness of PFS in future work.
We compared the results from our method with famous LASSO and LARS methods and their descendants RLSR and HSIC-Lasso, as well as, GBM against several datasets. Moreover, we compared our method with the recently proposed method based on optimizing the support vector machines (FS-SVM) [Ghaddar & Naoum-Sawaya, 2018]. The overall performance of PFS in terms of the number of selected features and resulting classification accuracies shows its applicability and effectiveness compared to conventional and recent feature selection methods.
The advantage of the proposed method is its modularity. It can be seen as a framework for future feature selection methods, in which different characteristics of feature are extracted using a set of measures. Then, the results are grouped using a user-specified clustering method. Finally, each cluster is evaluated by an arbitrary classifier and the best subset is selected either based on the size of the selected subset or resulting classification accuracy or a combination of both, as suggested in Equation 6.
In a future work, we shall also investigate the effect of using different parametric and non-parametric clustering methods to compare the results and decrease the complexity of PFS. Also, we are looking at designing a version of the PFS applicable to gene datasets through a multi-stage process.
Acknowledgements
The research of the second author was supported by NSERC of Canada under grant # RGPIN 418201. The authors would like to thank the anonymous reviewers for valuable comments and feedback that helped with the exposition and clarity of results.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Allwein et al. [2000] Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of machine learning research , 1 , 113–141.
- 2Altman [1992] Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician , 46 , 175–185.
- 3Bennasar et al. [2015] Bennasar, M., Hicks, Y., & Setchi, R. (2015). Feature selection using joint mutual information maximisation. Expert Systems with Applications , 42 , 8520–8532.
- 4Breiman et al. [1984] Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees . CRC press.
- 5Chen et al. [2017] Chen, X., Yuan, G., Nie, F., & Huang, J. Z. (2017). Semi-supervised feature selection via rescaled linear regression. In IJCAI (pp. 1525–1531).
- 6Efron et al. [2004] Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. et al. (2004). Least angle regression. The Annals of statistics , 32 , 407–499.
- 7Friedman [2001] Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics , (pp. 1189–1232).
- 8Ghaddar & Naoum-Sawaya [2018] Ghaddar, B., & Naoum-Sawaya, J. (2018). High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research , 265 , 993–1004.
