A Feature Selection Based on Perturbation Theory

Javad Rahimipour Anaraki; Hamid Usefi

arXiv:1902.09938·cs.LG·February 27, 2019

A Feature Selection Based on Perturbation Theory

Javad Rahimipour Anaraki, Hamid Usefi

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel feature selection method using perturbation theory to detect feature correlations, especially effective in high-dimensional, singular datasets common in bioinformatics, outperforming traditional methods in feature reduction and accuracy.

Contribution

The paper presents a new perturbation-based approach for feature selection that effectively identifies important features in singular, high-dimensional datasets, improving over existing methods.

Findings

01

Selects fewer features while maintaining or improving accuracy.

02

Effective in high-dimensional, singular datasets.

03

Outperforms conventional feature selection methods.

Abstract

Consider a supervised dataset $D = [A ∣ b]$ , where $b$ is the outcome column, rows of $D$ correspond to observations, and columns of $A$ are the features of the dataset. A central problem in machine learning and pattern recognition is to select the most important features from $D$ to be able to predict the outcome. In this paper, we provide a new feature selection method where we use perturbation theory to detect correlations between features. We solve $A X = b$ using the method of least squares and singular value decomposition of $A$ . In practical applications, such as in bioinformatics, the number of rows of $A$ (observations) are much less than the number of columns of $A$ (features). So we are dealing with singular matrices with big condition numbers. Although it is known that the solutions of least square problems in singular case are very sensitive to…

Tables10

Table 1. Table 1: Perturbation of SynthData

	$X$	$\tilde{X}$	$X - \tilde{X}$
$x_{1}$	40.8401	40.8401	2.2115e-05
$x_{2}$	-8.5981	-8.5980	-1.1532e-05
$x_{3}$	17.4601	-5.9568e+03	-5.9743e+03
$x_{4}$	-3.7881	-1.4436e+03	-1.4398e+03
$x_{5}$	16.1273	6.1460e+03	6.1298e+03
$x_{6}$	-8.5981	-8.5980	-1.8675e-05

Table 2. Table 2: Angle of each feature to b in SynthData

	$f_{1}$	$f_{2}$	$f_{3}$	$f_{4}$	$f_{5}$	$f_{6}$
b	37.104	112.981	47.897	87.030	48.270	112.981

Table 3. Table 3: Angles of calculated b ^ i subscript ^ b 𝑖 \hat{\textbf{b}}_{i} to b for SynthData

Config.	${\hat{b}}_{1}$	${\hat{b}}_{2}$	${\hat{b}}_{3}$	${\hat{b}}_{4}$	${\hat{b}}_{5}$	${\hat{b}}_{6}$
$θ$	40.390	7.748	14.574	3.507	13.330	7.748

Table 4. Table 4: Dataset Specifications

Dataset	Samples	Features
LSVT Voice	126	310
Madelon	2000	500
Colon	62	2000
Lung	203	3312
Lymphoma	96	4026
GLIOMA	50	4434
Leukemia	72	7070
ALLAML	72	7129

Table 5. Table 5: Number of selected features using GBM, LASSO, LARS, RLSR, HSIC-Lasso, PFS based on decision tree classifier (PFS-DT), PFS based on support vector machine classifier (PFS-SVM) and PFS based on k 𝑘 k -nearest neighbour classifier (PFS- k 𝑘 k NN). For each version of PFS the mean of the number of selected features in 10 run is reported in subscript.

Dataset	Number of selected features
Dataset	GBM	LASSO	LARS	RLSR	HSIC-Lasso	PFS-DT	PFS-SVM	PFS- $k$ NN
LSVT Voice	239	126	125	125	12	13_45.30	87_111.90	30_94.60
Madelon	467	89	89	89	—	34_100.80	6_24.80	25_64.60
Colon	656	62	61	61	9	7_29.80	22_39.30	18_30.60
Lung	1503	203	202	202	134	34_105.00	28_100.00	58_131.20
Lymphoma	1491	96	95	95	181	36_51.80	23_44.80	42_75.50
GLIOMA	535	50	49	49	17	7_25.60	17_36.50	28_37.50
Leukemia	1053	72	71	71	17	6_46.10	15_41.00	24_49.00
ALLAML	1200	72	71	71	8	15_41.20	24_53.40	8_43.00

Table 6. Table 6: Classification accuracies of GBM, LASSO, LARS, RLSR, HSIC-Lasso, PFS based on decision tree classifier (PFS-DT), PFS based on support vector machine classifier (PFS-SVM) and PFS based on k 𝑘 k -nearest neighbour classifier (PFS- k 𝑘 k NN). For each version of PFS the mean of the resulting classification accuracies in 10 run is reported in subscript.

Dataset	Classification Accuracy
Dataset	GBM	LASSO	LARS	RLSR	HSIC-Lasso	PFS-DT	PFS-SVM	PFS- $k$ NN
LSVT Voice	73.68	73.68	72.14	63.16	78.94	83.97_85.26	60.00_64.46	84.28_86.86
Madelon	77.67	53.16	62.00	49.34	—	76.18_81.45	62.15_61.62	83.67_81.97
Colon	78.95	83.33	79.49	68.42	84.21	100.00_91.58	89.20_92.61	84.66_89.20
Lung	75.41	51.17	63.58	75.41	83.60	96.20_94.10	100.00_99.95	100.00_99.84
Lymphoma	62.07	39.21	32.19	60.71	51.72	64.65_55.93	61.11_62.41	66.67_69.94
GLIOMA	60.00	52.50	53.75	53.33	80.00	85.42_79.33	95.00_90.08	95.00_85.58
Leukemia	95.46	96.88	96.88	95.46	100.00	96.88_95.45	97.06_99.71	97.06_98.23
ALLAML	90.91	90.83	90.83	62.38	90.90	93.33_89.09	93.33_96.29	85.71_90.95

Table 7. Table 7: The number of selected features and the resulting classification accuracies using fuzzy c 𝑐 c -means version of PFS based on decision tree classifier (PFS-DT), PFS based on support vector machine classifier (PFS-SVM) and PFS based on k 𝑘 k -nearest neighbour classifier (PFS- k 𝑘 k NN). For each version of PFS the mean of the number of selected features and the mean of the resulting classification accuracies is reported in subscript.

Dataset	Number of selected features			Classification Accuracy
Dataset	PFS-DT	PFS-SVM	PFS- $k$ NN	PFS-DT	PFS-SVM	PFS- $k$ NN
LSVT Voice	15_55.70	2_87.70	67_86.40	89.74_82.43	50.00_56.00	81.07_86.11
Madelon	19_154.80	15_175.80	78_127.80	75.35_81.27	62.48_61.45	79.66_80.42
Colon	11_33.10	13_29.80	13_33.70	86.67_89.15	90.91_89.77	89.20_88.86
Lung	53_93.00	66_126.50	63_133.90	95.79_90.88	99.47_98.96	${100.00}_{98.42}$
Lymphoma	59_53.20	13_37.80	58_73.40	69.23_53.18	63.58_62.28	76.54_71.05
GLIOMA	5_30.40	15_31.50	17_31.60	89.58_79.00	90.00_88.67	86.67_87.25
Leukemia	7_31.60	18_42.60	17_44.70	100.00_97.65	94.12_97.35	94.12_96.06
ALLAML	27_44.60	32_58.10	8_51.60	86.09_89.81	82.86_86.90	90.00_87.29

Table 8. Table 8: The resulting measure calculated using Equation 6 for k 𝑘 k -means and c 𝑐 c -means versions of PFS based on decision tree classifier (PFS-DT), PFS based on support vector machine classifier (PFS-SVM) and PFS based on k 𝑘 k -nearest neighbour classifier (PFS- k 𝑘 k NN).

Dataset	$k$ -means			$c$ -means
Dataset	PFS-DT	PFS-SVM	PFS- $k$ NN	PFS-DT	PFS-SVM	PFS- $k$ NN
LSVT Voice	1.88	0.57	0.91	1.47	0.64	1.00
Madelon	0.81	2.54	1.26	0.52	0.34	0.62
Colon	3.95	2.35	2.96	2.69	3.06	2.66
Lung	0.89	0.99	0.75	0.96	0.77	0.73
Lymphoma	1.03	1.40	0.92	1.00	1.67	0.97
GLIOMA	3.16	2.50	2.29	2.63	2.83	2.80
Leukemia	4.52	2.41	2.00	3.12	2.30	2.18
ALLAML	2.17	1.81	2.09	2.02	1.48	1.70

Table 9. Table 9: Number of samples of each class for each dataset in FS-SVM

Dataset	Train		Test
Dataset	Class 1	Class 2	Class 1	Class 2
Leukemia	24	13	23	12
Lung	9	70	8	69
Prostate	25	26	25	26

Table 10. Table 10: Comparison of PFS based on decision tree classifier (PFS-DT) and FS-SVM

Dataset	Number of selected features		Classification Accuracy
Dataset	FS-SVM	PFS-DT	FS-SVM	PFS-DT
Leukemia	142	24_20.4	80.00	85.15_77.34
Lung	20	3_29.90	97.00	100.00_99.28
Prostate	252	29_37.40	86.00	88.23_87.44

Equations26

(\overset{α}{^}, \hat{β}) = ar g min ⎩ ⎨ ⎧ i = 1 \sum N {b_{i} - α - j \sum β_{j} x_{ij}}^{2} ⎭ ⎬ ⎫,

(\overset{α}{^}, \hat{β}) = ar g min ⎩ ⎨ ⎧ i = 1 \sum N {b_{i} - α - j \sum β_{j} x_{ij}}^{2} ⎭ ⎬ ⎫,

f_{1} = r an d (100), f_{2} = r an d (100),

f_{1} = r an d (100), f_{2} = r an d (100),

f_{3} = r an d (100), f_{4} = r an d (100),

f_{5} = 8 \times f_{3} + 2 \times f_{4}, f_{6} = 5 \times f_{2},

b = 7 \times f_{1} - 3 \times f_{2} + 6 \times f_{3},

∣ σ_{i} - σ_{i}^{'} ∣ \leq ∣∣ E ∣ ∣_{2}, i = 1, 2, \dots

∣ σ_{i} - σ_{i}^{'} ∣ \leq ∣∣ E ∣ ∣_{2}, i = 1, 2, \dots

∣∣ \tilde{X} ∣ ∣_{2} = ∣∣ V Σ^{- 1} U^{T} b ∣ ∣_{2}

∣∣ \tilde{X} ∣ ∣_{2} = ∣∣ V Σ^{- 1} U^{T} b ∣ ∣_{2}

\leq ∣∣ Σ^{- 1} ∣ ∣_{2} ∣∣ b ∣ ∣_{2} = \frac{1}{σ _{min} ( A + E )}

\leq \frac{1}{- ∣∣ E ∣ ∣ _{2} + σ _{min} ( A )},

∣∣ E \tilde{X} ∣ ∣_{2} \leq ∣∣ E ∣ ∣_{2} ∣∣ \tilde{X} ∣ ∣_{2}

∣∣ E \tilde{X} ∣ ∣_{2} \leq ∣∣ E ∣ ∣_{2} ∣∣ \tilde{X} ∣ ∣_{2}

= \frac{1 0 ^{- s}}{1 - 1 0 ^{- s}} = \frac{1}{1 0 ^{s} - 1} \approx 1 0^{- s}

(x_{1} - \tilde{x}_{1}) f_{1} + \dots + (x_{t} - \tilde{x}_{t}) f_{t} + \dots + (x_{n} - \tilde{x}_{n}) f_{n} \approx 0.

(x_{1} - \tilde{x}_{1}) f_{1} + \dots + (x_{t} - \tilde{x}_{t}) f_{t} + \dots + (x_{n} - \tilde{x}_{n}) f_{n} \approx 0.

(x_{1} - \tilde{x}_{1}) f_{1} + \dots + (x_{t} - \tilde{x}_{t}) f_{t} \approx 0.

(x_{1} - \tilde{x}_{1}) f_{1} + \dots + (x_{t} - \tilde{x}_{t}) f_{t} \approx 0.

f_{5}^{'} = \frac{f _{5}}{∣∣ f _{5} ∣∣} = \frac{8 f _{3} + 2 f _{4}}{45.38}

f_{5}^{'} = \frac{f _{5}}{∣∣ f _{5} ∣∣} = \frac{8 f _{3} + 2 f _{4}}{45.38}

\frac{1}{s} i = 1 \sum s \frac{C C _{i}}{M _{i}},

\frac{1}{s} i = 1 \sum s \frac{C C _{i}}{M _{i}},

\frac{classification accuracy}{# selected features} .

\frac{classification accuracy}{# selected features} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Blind Source Separation Techniques · Gene expression and cancer classification

Full text

A Feature Selection Based on Perturbation Theory

Javad Rahimipour Anaraki

[email protected]

Hamid Usefi

[email protected]

Department of Computer Science, Memorial University of Newfoundland,

St. John’s, NL, A1B 3X5 Canada

Department of Mathematics and Statistics, Memorial University of Newfoundland,

St. John’s, NL, A1C 5S7 Canada

Abstract

Consider a supervised dataset $D=[A\mid\textbf{b}]$ , where b is the outcome column, rows of $D$ correspond to observations, and columns of $A$ are the features of the dataset. A central problem in machine learning and pattern recognition is to select the most important features from $D$ to be able to predict the outcome. In this paper, we provide a new feature selection method where we use perturbation theory to detect correlations between features. We solve $AX=\textbf{b}$ using the method of least squares and singular value decomposition of $A$ . In practical applications, such as in bioinformatics, the number of rows of $A$ (observations) are much less than the number of columns of $A$ (features). So we are dealing with singular matrices with big condition numbers. Although it is known that the solutions of least square problems in singular case are very sensitive to perturbations in $A$ , our novel approach in this paper is to prove that the correlations between features can be detected by applying perturbations to $A$ . The effectiveness of our method is verified by performing a series of comparisons with conventional and novel feature selection methods in the literature. It is demonstrated that in most situations, our method chooses considerably less number of features while attaining or exceeding the accuracy of the other methods.

keywords:

Feature selection , Perturbation theory , Least angle regression

††journal: Expert Systems With Applications

1 Introduction

In machine learning and pattern recognition, feature selection is the process of selecting the most important features of a problem while removing unnecessary ones. This process plays an important role in reducing the dimension of datasets. Feature selection methods are categorized into two main groups of feature ranking and feature subset selection [Hall et al., 2003]. The former is a set of methods that ranks the features based on some measured values, and selects the top features, accordingly. The latter screens the critical features using fitness value. Both groups can be implemented using filter-based or wrapper-based approaches [Kohavi & John, 1997]. In the filter-based approach, a merit evaluates the quality of every feature regardless of its impact on the outcome, while the wrapper-based approaches measure the effectiveness of the features based on the results of a (a set of) classifier(s). The wrapper-based methods are highly computationally-intensive and powerful in predicting the outcome compared to the filter-based methods which are faster but less accurate.

With the emergence of high dimensional data, for example in Genomics, sophisticated feature selection methods are required to remove noisy features and detect correlation between features. It is desired that a small subset of features are selected to predict the outcome with high accuracy. The traditional feature selection methods such as principal component analysis [Jolliffe, 2002] or Relief [Kira & Rendell, 1992] have shortcomings in terms of dimensionality reduction, accuracy, as well as running time. We shall review some of the breakthrough methods that are effective in these respects.

There have been numerous methods based on the information theory, see for example Zhao et al. [2016], Sun et al. [2013], Bennasar et al. [2015]. These methods aim to minimize the feature redundancy while maximizing the features’ relevancy. Most notable and widely used information theory based method is minimal-redundancy-maximal-relevance criterion (mRMR) Peng et al. [2005]. It is shown in various studies that mRMR effectively chooses a small subset of features to predict the outcome with high accuracy. However, as it is pointed out in [Yamada et al., 2018], the computational cost of mRMR on large dataset is high. In other words, it is not feasible to scale up mRMR for big datasets.

Feature selection is also referred to as variable selection in Statistics. Fundamental variable selection methods include least absolute shrinkage and selection operator (LASSO) and least angle regression (LARS). LASSO, introduced by Tibshirani [Tibshirani, 1996], is a subset selection based on least squares regression. It minimizes the size of a regression model by removing those predictor variables with zero-valued coefficients by calculating Equation 1, the LASSO estimate, subject to $\sum_{j}|\beta_{j}|\leq t$ , where $\beta$ is a vector of coefficients and $t\geq 0$ is tuning parameter

[TABLE]

and the solution for $\alpha$ is $\hat{\alpha}=\bar{b}$ , $\hat{\beta}=(\hat{\beta}_{1},\dots,\hat{\beta}_{n})^{T}$ are LASSO estimates where $n$ is the total number of features, b represents responses, $x$ contains predictor variables and $N$ is the number of samples.

LARS, introduced by Efron et al. [Efron et al., 2004], is a linear regression model fitting based on the LASSO algorithm which calculates all the LASSO estimates efficiently, in combination with a forward stage-wise linear regression method within $n$ steps, where $n$ is number of covariates and $m$ is number of samples. LARS starts with selecting the most relevant feature and continues by adding the next feature with the highest correlation with the current residual. Then, it continues in a direction which has equal angle from the two already selected features until the next feature is met. The complexity of LARS algorithm is $O(n^{3}+mn^{2}).$

In a novel work, Yamada et al. [Yamada et al., 2014] proposed a non-linear feature selection method for high-dimensional datasets called Hilbert-Schmidt independence criterion least absolute shrinkage and selection operator (HSIC-Lasso), in which the most informative non-redundant features are selected using a set of kernel functions, where the solutions are found by solving a LASSO problem. The complexity of the original Hilbert-Schmidt feature selection (HSFS) is $O(n^{4})$ . In a recent work [Yamada et al., 2018] called Least Angle Nonlinear Distributed (LAND), the authors have improved the computational power of the HSIC-Lasso. They have demonstrated via some experiments that LAND and HSIC-Lasso have attain similar classification accuracies and dimension reduction. However, LAND has the advantage that it can be deployed on parallel distributed computing.

A method proposed by Chen et al. [Chen et al., 2017] is a feature selection called rescaled linear square regression (RLSR), where a set of coefficients for least square regression is employed to scale and rank features. The advantage of their method is that it can be applied to both supervised and semi-supervised classification problems.

In this paper, we introduce a new linear feature selection method. Linear models usually outperform nonlinear models over high-dimensional datasets. Consider a dataset $D$ , consisting of $m$ samples where each sample contains $n+1$ features. Let us denote by $A$ the first $n$ columns of $D$ and by b the last column. Our objective is to remove those columns of $A$ that do not have a significant impact on b. So, we want to choose a subset of columns of $A$ to express (up to an error) b as a linear combination of this subset. We consider the linear system $AX=\textbf{b}$ , where $X=[x_{1},\ldots,x_{n}]^{T}$ is the vector of unknowns. In practical applications, the system $AX=\textbf{b}$ may not have exact solutions. However, we want to find an $X$ so that the distance between $AX$ and b is as small as possible. That is, we want to minimize the distance $||AX-\textbf{b}||_{2}$ over all $X$ . To do so, we shall use the method of least squares and singular value decomposition (SVD) of $A$ . The Moore-Penrose inverse $A^{+}$ of $A$ is defined in terms of SVD of $A$ and it is known that $X=A^{+}\textbf{b}$ is the unique solution with the smallest 2-norm that satisfy the least square problem $\text{min}_{X}||AX-\textbf{b}||_{2}$ , see Theorem 2.1.

There has been extensive literature, see [Golub & Van Loan, 2013], regarding the sensitivity of solutions of least square problems when $A$ is full-rank. It is also known and rightfully cautioned that solutions of singular systems where condition number of $A$ is bigger than one are sensitive to perturbations in $A$ . However, we prove in Theorem 2.2, that one can use perturbations to reveal correlations between columns of $A$ . To do so, we solve both $AX=\textbf{b}$ and $(A+E)\tilde{X}=\textbf{b}$ using SVD, where $E$ is a small perturbation of $A$ . It turns out that features $\textbf{f}_{i}$ and $\textbf{f}_{j}$ correlate if and only if $\mid x_{i}-\tilde{x}_{i}\mid$ and $\mid x_{j}-\tilde{x}_{j}\mid$ are close (in the magnitude of $||E||_{2}$ .). This allows to cluster features based on the differences $\mid x_{i}-\tilde{x}_{i}\mid$ .

Next, we consider the column vector $|X-\tilde{X}|$ whose values are $\mid x_{i}-\tilde{x_{i}}\mid$ and consider clustering features based on this single column. As we mentioned, features that correlate with each other fall into the same cluster. However, within a cluster there might be features that do not correlate (but have the same value for $\mid x_{i}-\tilde{x_{i}}\mid$ ). To break down some big clusters that contain independent features, we use a simple but efficient method based on the angle between features. In Section 2.2, we consider the projection of b into each of the hyperplanes obtained by removing one feature at a time. We construct a column that consists of the angles between each feature and the corresponding hyperplane. The third column in our clustering process consists of angles between each feature and b.

We note that often in classification problems and real-world datasets, for example Cancer datasets, the column $\mathbf{b}$ contains nominal values (classes). One can then assign numerical values for each class. Although, this assignment is not unique our method is insensitive to the way in which the classes are numbered. The reason is, correlations between columns of $A$ is independent of $\mathbf{b}$ . Indeed, by Theorem 2.2, the vector $X-\tilde{X}$ consisting of the $x_{i}-\tilde{x}_{i}$ is proportional to correlations between columns of $A$ and as such $X-\tilde{X}$ is insensitive to changes in $\mathbf{b}$ . Also, if $\mathbf{b}$ changes, then all the angles between columns of $A$ and $\mathbf{b}$ will be shifted by a fix amount (the difference of old $\mathbf{b}$ and new $\mathbf{b}$ ). This shows that our $n\times 3$ matrix is insensitive to the way in which we convert classes to numerical values.

After arriving at the $n\times 3$ matrix, we use a clustering algorithm and cluster our $n\times 3$ matrix into $k$ clusters where $k$ is at most $\textbf{rank}(A)$ . Since we do not know the optimal $k$ , we take the output feature subset for each $k$ and use a classifier to get an accuracy with respect to that feature subset. Alternatively, our algorithm can take as input an integer $k$ to represent the number of desired features and this way we can just cluster with respect to the input $k$ and return the centroids as the selected subset of features. The final algorithm is presented in Section 2.3.

To the best of our knowledge, this is the first work to report on using perturbation theory in feature selection. Specifically, the fact that correlations can be detected via perturbations has not been explored before. As we can see through numerous experiments in Section 3, our method on average chooses smaller number of features while attaining or exceeding the classification accuracy of other methods. Also, the complexity of our algorithm is dominated by that of computing the SVD of an $m\times n$ matrix which can be done in $O(\min\{mn^{2},m^{2}n\})$ and even faster as explained in [Holmes et al., 2007]. In particular, in datasets where we have hundreds of samples and thousands of features ( $m^{2}\leq n$ ), the complexity of PFS is close to quadratic. It is also worth noting that our proposed method can be applied to both regression and classification problems. We present some further insights in Section 4, and conclude the paper and suggest possible future paths in Section 5.

2 Proposed Approach

Consider the system $AX=\textbf{b}$ . Since we want to know the smallest subset of columns of $A$ that we can express b as a linear combination of elements of that subset, we can normalize the columns of $A$ . So, we can assume each column of $A$ has length 1.

In real world applications, the system $AX=\textbf{b}$ may not have a solution. In other words, if $b$ is not in the column space of $A$ , there is no $X$ such that $AX=\textbf{b}$ . Instead, we can find an $X$ so that the distance between $AX$ and b is as small as possible. That is, we want to minimize the distance $||AX-\textbf{b}||_{2}$ over all $X$ . This minimization problem is known as the method of least squares and its solutions is defined via SVD of $A$ . Recall that the SVD of an $m\times n$ matrix $A$ is of the form $A=USV^{T}$ , where $U$ is an $m\times m$ orthogonal matrix, $V$ is an $n\times n$ orthogonal matrix, and $S=\text{diag}(\sigma_{1},\ldots,\sigma_{r},0,\ldots,0)$ is an $m\times n$ diagonal matrix. Also recall that the Moore-Penrose inverse of $A$ is the $n\times m$ matrix $A^{+}=VS^{-1}U^{T}$ , where $S^{-1}=\text{diag}(\sigma_{1}^{-1},\ldots,\sigma_{r}^{-1},0,\ldots,0)$ .

It is well-known that the least squares solutions can be given in terms of the Moore-Penrose inverse, see [Golub & Van Loan, 2013].

Theorem 2.1 (All Least Squares Solutions)

Let $A$ be an $m\times n$ matrix and $\textbf{b}\in\mathbb{R}^{m}$ . Then all the solutions of $\text{min}_{X}||AX-\textbf{b}||_{2}$ are of the form $y=A^{+}\textbf{b}+q$ , where $q\in\ker(A)$ . Furthermore, the unique solution whose 2-norm is the smallest is given by $z=A^{+}\textbf{b}$ .

In our method, each dataset with $m$ samples and $n+1$ features is divided into two matrices: coefficients and constants. Coefficients matrix $A$ involves all the feature values except for the outcome, the constant vector b only contains the classification outcome. In the next section we employ perturbation theory to detect redundant features.

2.1 Detecting correlations via perturbation

To demonstrate how the perturbation can reveal different aspects of features, a synthetic dataset called SynthData is generated with 100 samples and six features based on the following setup:

[TABLE]

where $rand(100)$ generates 100 random numbers with uniform probability in the interval $(0,1)$ . So, $D=[A\mid\textbf{b}]$ , where $A=[\textbf{f}_{1}\mid\cdots\mid\textbf{f}_{6}]$ is an $100\times 6$ matrix. Now let $E$ be a small perturbation of $A$ and solve $AX=\textbf{b}$ and $(A+E)\tilde{X}=\textbf{b}$ using SVD. We have demonstrated the solutions $X$ and $\tilde{X}$ as well as their differences in Table 1. As we expected, $X$ and $\tilde{X}$ differ significantly. However, our interest is focused at the last column of Table 1, where we have recorded the difference between $X$ and $\tilde{X}$ .

Before we state the main theorem, we shall need to recall some facts and definitions which can be found in [Golub & Van Loan, 2013].

Let $\tilde{A}=A+E$ be a perturbation of $A$ . Denote by $\sigma_{1}\geq\sigma_{2}\geq\cdots$ and $\sigma^{\prime}_{1}\geq\sigma^{\prime}_{2}\geq\cdots$ the singular values of $A$ and $\tilde{A}$ , respectively. The samllest non-zero singular value of $A$ is denoted by $\sigma_{\text{min}}$ and the greatest of the $\sigma_{i}$ is denoted by $\sigma_{\text{max}}$ . It is well-known that $||A||_{2}=\sigma_{\text{max}}$ . It has been of great interest to compare the $\sigma_{i}$ and $\sigma^{\prime}_{i}$ . In this regard, we use a classical bound on the difference between $\sigma_{i}$ and $\sigma^{\prime}_{i}$ due to Weyl:

[TABLE]

We need to determine the type of perturbations we use. Indeed, we choose $E$ to be a random matrix such that $||E||_{2}\approx 10^{-s}\sigma_{\text{min}}(A)$ , for some $s\geq 0$ . We set $s=3$ where our estimates are correct up to a magnitude of $10^{-3}$ . We are now ready to prove the main theorem of this paper.

Theorem 2.2

Let $X$ and $\tilde{X}$ be solutions of $AX=\textbf{b}$ and $(A+E)\tilde{X}=\textbf{b}$ , where $E$ is a small enough perturbation. If a feature $\textbf{f}_{i}$ is independent of the rest of the features then $|x_{i}-\tilde{x}_{i}|\approx 0$ . Furthermore, suppose that $S^{\prime}=\{\textbf{f}_{1},\ldots,\textbf{f}_{t}\}$ is a subset of $S$ such that $\sum_{i=1}^{t}c_{i}\textbf{f}_{i}=0$ , for some non-zero $c_{i}$ . If

any subset of $S^{\prime}$ is linearly independent, 2. 2.

$\textbf{f}_{1},\ldots,\textbf{f}_{t}$ * are linearly independent from the rest of features in $S$ .*

Then the vectors $\begin{pmatrix}c_{1}\\ \vdots\\ c_{t}\end{pmatrix}$ and $\begin{pmatrix}x_{1}-\tilde{x}_{1}\\ \vdots\\ x_{t}-\tilde{x}_{t}\end{pmatrix}$ are proportional.

Proof. From $AX=\textbf{b}$ and $(A+E)\tilde{X}=\textbf{b}$ , we get $A(X-\tilde{X})=E\tilde{X}$ . We claim that $||E\tilde{X}||\approx 10^{-s}$ . To prove the claim, we consider the SVD of $A+E$ which is of the form $A+E=U\Sigma V^{T}$ . So, $\tilde{X}=V\Sigma^{-1}U^{T}b$ . Since $U$ and $V$ are orthogonal and for orthogonal matrices we have $||U\mathbf{v}||_{2}=||\mathbf{v}||_{2}$ , it follows that

[TABLE]

by Equation (2). Hence,

[TABLE]

It follows from the claim that

[TABLE]

Now, if a feature, say $\textbf{f}_{n}$ , is independent of the rest of features, then it follows from Equation (3) that $|x_{n}-\tilde{x}_{n}|\approx 0$ . Suppose now that $S^{\prime}=\{\textbf{f}_{1},\ldots,\textbf{f}_{t}\}$ is a linearly dependent subset of $S$ such that $\sum_{i=1}^{t}c_{i}\textbf{f}_{i}=0$ , for some coefficients $c_{1},\ldots,c_{t}$ . Since $\textbf{f}_{1},\ldots,\textbf{f}_{t}$ are linearly independent from the rest of features in $S$ , we get

[TABLE]

Now, if $\begin{pmatrix}c_{1}\\ \vdots\\ c_{t}\end{pmatrix}$ and $\begin{pmatrix}x_{1}-\tilde{x}_{1}\\ \vdots\\ x_{t}-\tilde{x}_{t}\end{pmatrix}$ are not proportional, we can use Equation (4) and $\sum_{i=1}^{t}c_{i}\textbf{f}_{i}=0$ to get a dependence relation of a shorter length between the elements of $S^{\prime}$ , which would contradict our assumption (1). The proof is complete. $\Box$

Consider now the correlation $\textbf{f}_{5}=8\times\textbf{f}_{3}+2\times\textbf{f}_{4}$ in the SynthData dataset. As we mentioned earlier, we normalize the columns of $A$ and replace $A$ with $[\textbf{f}^{\prime}_{1}\mid\cdots\mid\textbf{f}^{\prime}_{6}]$ , where $\textbf{f}^{\prime}_{i}=\frac{\textbf{f}_{i}}{||\textbf{f}^{\prime}_{i}||}$ . Note that $||\textbf{f}_{3}||=5.52,||\textbf{f}_{4}||=5.33,||\textbf{f}_{5}||=45.38$ . We have

[TABLE]

So, correlation vector between $\textbf{f}^{\prime}_{3},\textbf{f}^{\prime}_{4},\textbf{f}^{\prime}_{5}$ is $\begin{bmatrix}0.97\\ 0.23\\ -1\end{bmatrix}$ . On the other hand, we have $\begin{bmatrix}x_{3}-\tilde{x}_{3}\\ x_{4}-\tilde{x}_{4}\\ x_{5}-\tilde{x}_{5}\\ \end{bmatrix}=(-6.1298e+03)\begin{bmatrix}0.97\\ 0.23\\ -1\end{bmatrix}$ . Note that in this example, weights (norms) of $8\times\textbf{f}_{3}$ and $\textbf{f}_{4}$ are very close to each other compared to weight of $2\times\textbf{f}_{4}$ . In general, when a dependence relation exists between a set of features, Theorem 2.2 along with normalization detect the two features whose weights are closest to each other compared to the others. In particular, if features $\textbf{f}_{i}$ and $\textbf{f}_{j}$ correlate with each other then the differences $\mid x_{i}-\tilde{x_{i}}\mid$ and $\mid x_{j}-\tilde{x_{j}}\mid$ are almost the same. The converse may not be necessarily true.

We can now consider a column vector whose values are $\mid x_{i}-\tilde{x_{i}}\mid$ and use a clustering algorithm to cluster this single column. Clearly, features that correlate with each other fall into the same cluster. However, within a cluster there might be features that do not correlate (but have the same value for $\mid x_{i}-\tilde{x_{i}}\mid$ ). For this reason, we want to further refine the clustering process by computing two more characteristics of data. We shall explain this in the next section.

2.2 Refining the clustering process

One way to compare the similarity between vectors is by calculating the angle between them. Features that have smaller angles with the outcome b are informative and predictive. So we construct another column whose values are angles between the $\textbf{f}_{i}$ and b. The angle of each feature with b in SynthData are calculated and shown in the Table 2.

Our third column in the clustering process is obtained as follows. We remove each feature $\textbf{f}_{i}$ from the matrix $A$ along with its corresponding coefficient $x_{i}$ in $X$ . Then, the angle of resulting vector $A\setminus\{\textbf{f}_{i}\}\times X\setminus\{x_{i}\}=\hat{\textbf{b}}_{i}$ and the actual outcome b will be considered as a measure of the relevancy for feature $\textbf{f}_{i}$ . Note that the closer b and $\hat{\textbf{b}}_{i}$ are, the less significant the vector $x_{i}\textbf{f}_{i}$ is. Applying this process to SynthData is shown in Table 3.

Now we set up an $n\times 3$ matrix where the first column consists of $|x_{i}-\tilde{x}_{i}|$ , the second column is the angles between the $\textbf{f}_{i}$ ’s and $\mathbf{b}$ , and the third column is the angles between the $\hat{\textbf{b}}_{i}$ ’s and $\mathbf{b}$ . Next we use a clustering algorithm to cluster our $n\times 3$ into $k$ clusters. The centroids of clusters will be chosen as our selected features. Since we do not know the optimal number of clusters, we take the output feature subset for each $k$ and use a classifier to get an accuracy with respect to that feature subset. Alternatively, our algorithm can take as input an integer $k$ to represent the number of desired features and this way we can just cluster with respect to the input $k$ and return the centroids as the selected subset of features. The upper bound for the number of clusters is $\textbf{rank}(A)$ , where $\textbf{rank}(A)$ is the numerical rank of $A$ .

2.3 Algorithm

The PFS running time is $t\times(\min(m\times n^{2},m^{2}\times n)+k\times(3\times n\times k))$ , where $\min(m\times n^{2},m^{2}\times n)$ is the complexity of calculating SVD for a $m\times n$ matrix [Holmes et al., 2007], and $(3\times n\times k)$ is the time complexity of the $k$ -means clustering algorithm to cluster a dataset of size $n\times 3$ into $k$ clusters. Therefore, the time complexity of PFS is dominated by the complexity of SVD.

Flowchart of PFS is depicted in Figure 1 and is as shown in Algorithm 1. The MATLAB® implementation of PFS is publicly available on GitHub111https://github.com/jracp/PerturbationFeatureSelection.

3 Experimental Results

We generate the perturbation matrix $E$ such that the entries of $E$ are randomly chosen in the range $c_{l}=10^{6}$ and $c_{u}=10^{5}$ .

Referring to Tran et al. [Tran et al., 2017], classification accuracy of imbalanced datasets should be calculated using Equation 5.

[TABLE]

where $s$ is the number of classes in dataset, $CC_{i}$ is the number of correctly classified instances within class $i$ , and $M_{i}$ is the total number of samples in the class $i$ .

When comparing two feature selection methods, there are three quantities that matter: 1) the accuracy, 2) number of selected features 3) complexity and running time.

We adopt the following formula to compare feature selection methods based on the their accuracy and selected number of features: We quantify the relative effectiveness of a feature selection methods as follows:

[TABLE]

Formula (6) means that a feature selection method with smaller number of features and higher classification accuracy is favourable.

All the computations have been done on an ubuntu 14.04 LTS machine with Intel®Core™i5-4570, 24 GB of RAM, using MATLAB® 9.2.0.556344 (R2017a), R version 3.4.4 (2018-03-15), and Java™SE Runtime Environment (build 1.8.0_151-b12).

3.1 Comparisons with conventional methods

In this section, we compare PFS with Friedman’s gradient boosting machine (GBM) [Friedman, 2001]; least absolute shrinkage and selection operator (LASSO) [Tibshirani, 1996]; least angle regression (LARS) [Efron et al., 2004]; rescaled linear square regression (RLSR) [Chen et al., 2017] with $k=minSelF$ , where $minSelF$ is the minimum number of selected features using GBM, LASSO and LARS; and Hilbert-Schmidt independence criterion least absolute shrinkage and selection operator (HSIC-Lasso) [Yamada et al., 2014]. We used gbm package in R [Ridgeway, 2007] for running GBM, and MATLAB® implementations of LASSO and LARS by Sjöstrand [Sjöstrand, 2005], RLSR and HSIC-Lasso.

In Section 3.1.1, we have used $k$ -means to cluster our $n\times 3$ matrix where the upper bound for $k$ is the numerical rank of $A$ . To find the best subset, we have experimented with three different classifiers, that is decision tree (DT) [Breiman et al., 1984], support vector machine (SVM) [Allwein et al., 2000], and $k$ -nearest neighbour ( $k$ -NN) [Altman, 1992] in the inner layer. Once we find the $k$ and corresponding subset of features that gives us the best accuracy, we output that subset as the selected features. At the outer layer of our algorithm, we always use DT for classification. To demonstrate a fair and robust result, we run the algorithm 10 times where each time a subset of features is outputted and then classified by DT. The average of accuracies as well as average size of feature subsets are reported. We have demonstrated similar experiments using fuzzy c-means in Section 3.1.2.

We perform a series of tests on various datasets including, one medical dataset, LSVT Voice [Tsanas et al., 2014], one artificial dataset Madelon and six biological datastes – namely, Colon , Lung, Lymphoma, GLIOMA, Leukemia and ALLAML – have been selected from ASU dataset repository [Li et al., 2017] and UCI repository of machine learning [Lichman, 2013]. The specifications of all datasets are given in Table 4.

Note that for the experiments in this section, the decision tree classifier is applied with MATLAB®, using 70% of the data for training and 30% for testing and validating. This set up is applied to all methods including GBM, LASSO, LARS, RLSR, HSIC-Lasso, and PFS. Since PFS uses a clustering algorithm, the selected subset of features in PFS can change each run. So, we run PFS 10 times on randomly shuffled data where testing and trainings sets vary accordingly in each run.

3.1.1 Evaluation results using $k$ -means

In this section, we use $k$ -means to cluster our $n\times 3$ matrix where the upper bound for $k$ is the numerical rank of $A$ . To find the best subset, we have experimented with three different classifiers, that is DT, SVM and $k$ NN in the inner layer. Once we find the $k$ and corresponding subset of features that gives us the best accuracy, we output that subset as the selected features. At the outer layer of our algorithm, we always use DT for classification for all the methods.

In Tables 5 and 6, we have reported the selected number of feature and classification accuracies, respectively. Note that PFS-DT, PFS-SVM, and PFS- $k$ NN mean that we have used DT, SVM, and $k$ NN as the inner classifier in PFS, respectively. In all the methods we have used DT to report the classification accuracy.

To demonstrate a fair and robust result, we run our algorithm 10 times where each time the dataset is randomly shuffled and a subset of features is outputted. The average of accuracies as well as average size of feature subsets are reported. Also, we use Formula 6 to find the optimal accuracy and subset of features amongst the 10 run. In columns corresponding to PFS-DT, PFS-SVM, and PFS- $k$ NN, the optimal number of features and optimal classification accuracy with respect to Formula 6 are shown in the superscript whereas the average number of features and average of classification accuracies are shown in the subscript.

We can see from Table 6 that, over all, the classification accuracies of PFS-based methods are favourable to the other methods and only HSIC-Lasso is sometimes attaining similar accuracies. On the other hand, HSIC-Lasso chooses less number of features on average compared to PFS-based methods. We remark that the number of features in PFS depends on the upper bound we set for the number of clusters when we cluster our intermediate $n\times 3$ matrix. We have taken $\mathbf{rank}(A)$ as an upper bound but this bound is just a crude estimate and in the next phases of this project we shall improve this bound. Hence, it is possible to still decrease the average number of features in PFS.

We can also observe from Table 6, that when $k$ NN is used as the inner classifier, the average classification accuracies are slightly better than when DT or SVM are used. In contrast, the average number of features are slightly lower when DT is used as the inner classifier.

3.1.2 Evaluation results using fuzzy $c$ -means

To investigate the affect of clustering method, we have also experimented with fuzzy $c$ -means clustering algorithm for which, the results are shown in Table 7. We can also observe from Table 7 that all in all there is very little difference in average classification accuracies regardless of which classifier is used. In contrast, the average number of features are slightly lower when DT is used as the inner classifier.

3.1.3 A quantified measure

In Sections 3.1.2 and 3.1.1, we have used each of $k$ -means and fuzzy $c$ -means as our clustering algorithm. It seems that using fuzzy $c$ -means, our method in general chooses more features. To present and amalgamate the results of Tables 5, 6, and 7, we apply Formula 6 using average classification accuracy and average number of features to obtain a comparison in Table 8 between $k$ -means and fuzzy $c$ -means. We can conclude that based on the measure given by Formula 6, our algorithm has a better performance when $k$ -means is used for clustering.

3.2 Comparison with methods based on SVM & optimization

A recent paper by Ghaddar and Naoum-Sawaya [Ghaddar & Naoum-Sawaya, 2018] proposed a feature selection method using support vector machines (FS-SVM) for binary-class datasets, in which, a pre-defined percentage of features is selected through adjusting $l_{1}-$ norm of the classifier.

Ghaddar et al. applied their method to a set of cancer datasets (# of samples $\times$ # of features) – namely, Leukemia (72 $\times$ 7130), Lung cancer (139 $\times$ 1000), Prostate cancer (102 $\times$ 12,601) – adopted from Cancer Program at Broad Institute 222http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi (different form those in Table 4). For each dataset, a subset of positive and negative classes have been selected for training and testing purposes (see Table 9).

We have used PFS with DT as the inner classifier and followed the same setup to compare PFS-DT with the method proposed in [Ghaddar & Naoum-Sawaya, 2018]. To get unbiased results, we run PFS-DT 10 times where each time we shuffled and constructed test and train datasets based on the configuration in Table 9. The optimal and average results are reported in Table 10.

In order to find the highest classification accuracy, the authors in [Ghaddar & Naoum-Sawaya, 2018] have applied their method FS-SVM and limited the selected subset of features to range from 2% to 20% of total number of features. In turn, the running time of FS-SVM is very high.

4 Discussions

The upper bound for the number of clusters in Algorithm 1 is the numerical rank of matrix $A$ , which infers about the largest number of independent features. There exists various clustering algorithms and one way to improve the proposed method is to cluster the generated characteristics dataset more efficiently. Of course, the number of clusters in PFS can be set manually which adds a great flexibility in selecting a certain number of features. It is worth noting that some of the clusters that represent irrelevant features can be excluded right away before starting the clustering process. Irrelevant features can be detected by their corresponding coefficients in the solution of the least squares problem.

Since $k$ -means and fuzzy $c$ -means clustering method choose the initial centroids randomly, the final outcome of PFS could be different per run, which introduces a valid concern of non-reproducibility of the results. To remedy this, the proposed algorithm has iterated $t$ -times to provide more robust and reproducible results. An alternative approach is to use a deterministic clustering algorithm which we shall examine in the future.

The complexity of our proposed method is dominated by the complexity of calculating SVD.

5 Conclusions and future work

In this paper, we proposed a novel feature selection method. We divide a dataset $D$ into a matrix $A$ consisting of features and the vector b of the classification outcome, hence $D=[A\mid\textbf{b}]$ . We solve the least squares problem $\text{min}_{X}||AX-\textbf{b}||_{2}$ using the singular decomposition of $A$ . We have proved and demonstrated how perturbation theory can be used to detect correlations between features. Through this process, irrelevant features can be identified and filtered out at the very first stages of the algorithm. The main ingredient of our approach is perturbation theory and experimental results show how powerful this method is to detect and remove correlations. We have compared our method with several other methods and it is shown that PFS always chooses a fraction of the number of features selected by other methods. Furthermore, we believe PFS is robust against noise. A noisy data can be viewed as a perturbed system. So we can consider a system of the form $\tilde{A}X=\tilde{\textbf{b}}$ and apply Theorem 2.2. We shall investigate the noise-robustness of PFS in future work.

We compared the results from our method with famous LASSO and LARS methods and their descendants RLSR and HSIC-Lasso, as well as, GBM against several datasets. Moreover, we compared our method with the recently proposed method based on optimizing the support vector machines (FS-SVM) [Ghaddar & Naoum-Sawaya, 2018]. The overall performance of PFS in terms of the number of selected features and resulting classification accuracies shows its applicability and effectiveness compared to conventional and recent feature selection methods.

The advantage of the proposed method is its modularity. It can be seen as a framework for future feature selection methods, in which different characteristics of feature are extracted using a set of measures. Then, the results are grouped using a user-specified clustering method. Finally, each cluster is evaluated by an arbitrary classifier and the best subset is selected either based on the size of the selected subset or resulting classification accuracy or a combination of both, as suggested in Equation 6.

In a future work, we shall also investigate the effect of using different parametric and non-parametric clustering methods to compare the results and decrease the complexity of PFS. Also, we are looking at designing a version of the PFS applicable to gene datasets through a multi-stage process.

Acknowledgements

The research of the second author was supported by NSERC of Canada under grant # RGPIN 418201. The authors would like to thank the anonymous reviewers for valuable comments and feedback that helped with the exposition and clarity of results.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allwein et al. [2000] Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of machine learning research , 1 , 113–141.
2Altman [1992] Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician , 46 , 175–185.
3Bennasar et al. [2015] Bennasar, M., Hicks, Y., & Setchi, R. (2015). Feature selection using joint mutual information maximisation. Expert Systems with Applications , 42 , 8520–8532.
4Breiman et al. [1984] Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees . CRC press.
5Chen et al. [2017] Chen, X., Yuan, G., Nie, F., & Huang, J. Z. (2017). Semi-supervised feature selection via rescaled linear regression. In IJCAI (pp. 1525–1531).
6Efron et al. [2004] Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. et al. (2004). Least angle regression. The Annals of statistics , 32 , 407–499.
7Friedman [2001] Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics , (pp. 1189–1232).
8Ghaddar & Naoum-Sawaya [2018] Ghaddar, B., & Naoum-Sawaya, J. (2018). High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research , 265 , 993–1004.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

A Feature Selection Based on Perturbation Theory

Abstract

keywords:

1 Introduction

2 Proposed Approach

Theorem 2.1** (All Least Squares Solutions)**

2.1 Detecting correlations via perturbation

Theorem 2.2

2.2 Refining the clustering process

2.3 Algorithm

3 Experimental Results

3.1 Comparisons with conventional methods

3.1.1 Evaluation results using kkk-means

3.1.2 Evaluation results using fuzzy ccc-means

3.1.3 A quantified measure

3.2 Comparison with methods based on SVM & optimization

4 Discussions

5 Conclusions and future work

Acknowledgements

Theorem 2.1 (All Least Squares Solutions)

3.1.1 Evaluation results using $k$ -means

3.1.2 Evaluation results using fuzzy $c$ -means